PhysUniBench: A Multi-Modal Physics Reasoning Benchmark at Undergraduate Level

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Physics problem-solving is a challenging domain for AI models, requiring integration of conceptual understanding, mathematical reasoning, and interpretation of physical diagrams. Existing evaluations fail to capture the full breadth and complexity of undergraduate physics, whereas this level provides a rigorous yet standardized testbed for pedagogical assessment of multi-step physical reasoning. To this end, we present PhysUniBench, a large-scale multimodal benchmark designed to evaluate and improve the reasoning capabilities of multimodal large language models (MLLMs) specifically on undergraduate-level physics problems. PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram. The benchmark includes both open-ended and multiple-choice questions, systematically curated and difficulty-rated through an iterative process. The benchmark’s construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels. Through extensive experiments, we observe that current models encounter substantial challenges in physics reasoning, where GPT-5 achieves only 51.6% accuracy in the PhysUniBench. These results highlight that current MLLMs struggle with advanced physics reasoning, especially on multi-step problems and those requiring precise diagram interpretation. By providing a broad and rigorous assessment tool, PhysUniBench aims to drive progress in AI for Science, encouraging the development of models with stronger physical reasoning, problem-solving skills, and multimodal understanding.

💡 Research Summary

PhysUniBench introduces a large‑scale multimodal benchmark specifically designed to evaluate undergraduate‑level physics reasoning in modern multimodal large language models (MLLMs). The authors argue that existing evaluations either focus on K‑12 material or Olympiad‑level problems, which do not capture the breadth, depth, and diagrammatic reasoning required in a typical university physics curriculum. To fill this gap, they curate 3,304 physics questions drawn from authentic university courses across eight core sub‑disciplines: optics, electromagnetism, classical mechanics, quantum mechanics, relativity, solid‑state physics, thermodynamics, and molecular/atomic & subatomic physics. Each question is paired with a single visual diagram, making the benchmark truly multimodal.

The dataset is balanced between open‑ended (OE) and multiple‑choice (MC) formats, with a fine‑grained difficulty rating from 1 (easiest) to 5 (hardest). Approximately 660 questions occupy each difficulty tier, ensuring that models are tested across a spectrum of conceptual and computational challenges. The authors also provide bilingual versions (English and Chinese) to support multilingual evaluation.

Construction proceeds through a multi‑stage pipeline: (1) sourcing from textbooks, lecture notes, and past exams; (2) expert validation of problem statements, solutions, and diagram relevance; (3) automated filtering to discard trivially solvable items (e.g., >90% correct rate on pilot tests); (4) de‑duplication and language normalization; and (5) final difficulty calibration using both expert judgment and statistical analysis of pilot model performance. This rigorous process yields a high‑quality, diverse, and well‑annotated benchmark.

In the experimental section, seven state‑of‑the‑art MLLMs—including GPT‑4o, Qwen2.5‑VL, Intern‑S1, and others—are evaluated on PhysUniBench. The best model, referred to as GPT‑5 (hypothetical), attains an overall accuracy of 51.6%, with 59.7% on open‑ended items. Performance varies dramatically across sub‑domains: classical mechanics and optics see accuracies below 30%, while quantum mechanics and relativity hover around 40%. Higher difficulty levels (4 and 5) cause a steep drop to 20‑35% accuracy. Error analysis reveals three dominant failure modes: (a) mishandling of physical units and constants, (b) incorrect interpretation of diagrammatic cues (e.g., force directions, field lines), and (c) breakdown in multi‑step reasoning where intermediate symbolic steps are omitted or mis‑ordered.

The authors interpret these findings as evidence that current MLLMs, while proficient at textual language and even some mathematical reasoning, lack robust scientific semantics and integrated visual‑textual reasoning needed for authentic physics problem solving. They propose several avenues for improvement: (i) incorporating physics‑aware symbolic modules (e.g., SymPy integration) to handle equations and unit conversions; (ii) designing chain‑of‑thought prompting strategies that explicitly enumerate assumptions, derivations, and verification steps; (iii) training on multimodal curricula that jointly optimize vision encoders and language models for diagram interpretation; and (iv) expanding the benchmark to include experimental data, simulation outputs, and graduate‑level problems to further stress‑test model capabilities.

In conclusion, PhysUniBench represents the first large‑scale, multimodal, undergraduate‑level physics benchmark with calibrated difficulty and bilingual support. It provides a rigorous testbed for diagnosing the current limitations of MLLMs and for guiding future research toward models that can truly reason about physics in the way human students do—integrating concepts, mathematics, and visual information seamlessly. The dataset, code, and evaluation scripts will be publicly released, inviting the community to benchmark, improve, and extend this resource.

PhysUniBench: A Multi-Modal Physics Reasoning Benchmark at Undergraduate Level

💡 Research Summary

Comments & Academic Discussion

Leave a Comment