Putnam-like dataset summary: LLMs as mathematical competition contestants
In this paper we summarize the results of the Putnam-like benchmark published by Google DeepMind. This dataset consists of 96 original problems in the spirit of the Putnam Competition and 576 solutions generated by LLMs. We analyze the performance of models on this set of problems to verify their ability to solve problems from mathematical contests. We find that top models, particularly Gemini 2.5 Pro, achieve high scores, demonstrating strong mathematical reasoning capabilities, although their performance was lower on problems from the 2024 Putnam competition. The analysis highlights distinct behavioral patterns among models, including bimodal scoring distributions and challenges in providing fully rigorous justifications.
💡 Research Summary
The paper provides a comprehensive analysis of the “Putnam‑like” benchmark released by Google DeepMind, which consists of 96 original Putnam‑style problems and 576 solutions generated by six state‑of‑the‑art large language models (LLMs): Gemini‑2.5‑flash‑04‑17, Gemini‑2.5‑pro‑03‑25, OpenAI’s o3‑mini‑high, OpenAI’s o4‑mini‑high, DeepSeek’s r1, and Anthropic’s Claude sonnet‑3.7. Each problem is categorized by difficulty level (1–6) and mathematical domain (linear algebra, abstract algebra, analysis/inequalities, discrete mathematics, probability, number theory, and polynomials). Human experts graded every solution on a 0‑10 rubric that rewards partial progress, and an automatic Gemini‑based grader also produced scores; the authors treat the human grades as the gold standard.
Overall, the grade distribution is heavily skewed toward high scores: 46 % of all solutions receive a perfect 10, while only 15 % receive a zero. The presence of many intermediate scores (3–7) reflects the fine‑grained rubric, which differs from the binary “final‑answer” benchmarks commonly used in prior work.
Model‑wise performance shows a clear hierarchy. Gemini‑2.5‑pro‑03‑25 attains the highest average (8.7/10) and frequently reaches the maximum, even on the hardest level‑5 and level‑6 problems, indicating a strong ability to produce fully rigorous proofs. Gemini‑2.5‑flash‑04‑17 also performs well (average 7.6) but its answers tend to be verbose, containing many dead‑ends alongside correct reasoning. The OpenAI models achieve moderate scores (average 5.6–6.0). They often generate promising ideas and partial arguments but fail to complete them into fully correct proofs; o4‑mini‑high shows a modest improvement over its predecessor o3‑mini‑high, with fewer partially correct solutions and a slightly higher median. DeepSeek’s r1 is the weakest (average 4.5), typically delivering only outlines without detailed calculations or logical justification. Claude sonnet‑3.7 scores around 6.0, producing well‑written prose but occasionally omitting critical steps or making conceptual slips.
Difficulty‑level analysis reveals that levels 1–2 are solved correctly in over 75 % of cases (≥7 points), while levels 5–6 are substantially harder for all models. Interestingly, level 4 problems receive higher average scores than level 3, suggesting a mismatch between human‑perceived difficulty and the challenges faced by LLMs. Domain‑wise, linear algebra is the easiest (average ≈9), whereas analysis/inequalities and polynomials are the toughest (averages below 7), likely because they demand precise analytical reasoning and avoidance of unjustified numerical approximations.
To gauge the benchmark’s realism, the authors also evaluated the same six models on authentic 2024 Putnam problems using the identical grading rubric. All models scored markedly lower on the real contest (e.g., Gemini‑2.5‑flash‑04‑17 drops from 7.6 to 4.3), confirming that the Putnam‑like dataset is somewhat easier than the actual competition.
Statistical tests support the observations: a Kolmogorov‑Smirnov test shows that the score distributions of o4‑mini‑high and sonnet‑3.7 are virtually identical (p = 0.998), while a t‑test confirms the significant difference between level 3 and level 4 scores (p = 0.02).
The authors conclude that the Gemini models exhibit the most mature mathematical reasoning among current LLMs, delivering detailed, often fully rigorous solutions. However, even the best models still make classical mistakes, occasionally rely on prohibited numerical approximations, and sometimes fail to produce a complete formal proof. The study highlights the need for future work on aligning automatic grading with human expert judgment, integrating LLM‑generated proofs into formal verification systems (e.g., Lean), and expanding benchmarks to cover geometry and combinatorics, which are absent from the current dataset. Overall, the Putnam‑like benchmark serves as a valuable intermediate step toward assessing LLMs’ capabilities in contest‑style mathematical problem solving, but it also underscores the gap that remains before LLMs can reliably match top human contestants on the full spectrum of Putnam challenges.
Comments & Academic Discussion
Loading comments...
Leave a Comment