Estimating problem difficulty without ground truth using Large Language Model comparisons
📝 Abstract
Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes–construction, scale and dependence–identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson $r \geq 0.80$ for $n=1876 $. Thirdly, we show that LLM compare is robust to hallucinations, with less than $6\%$ degradation in Pearson correlation for $10\%$ noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.
💡 Analysis
Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes–construction, scale and dependence–identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson $r \geq 0.80$ for $n=1876 $. Thirdly, we show that LLM compare is robust to hallucinations, with less than $6\%$ degradation in Pearson correlation for $10\%$ noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.
📄 Content
Scaling the finetuning of large language models (LLMs) marks a step change in their capabilities, as it consistently raises the overall performance on established benchmarks [4,16,21,27,35,39]. Current state-of-the-art models have already saturated long-standing benchmarks like MATH [20] and GSM8K [26]. Newer benchmarks like Omni-Math [13] and GPQA Diamond [31] are more challenging, yet are expected to reach saturation soon and already raise concerns about data leakage. Recently introduced datasets such as FrontierMath [14] and Humanity’s Last Exam [30] attempt to outpace model capabilities through expert-authored questions, but unfortunately their manual collection process is not scalable. The limited number of high-quality and sufficiently difficult question-response pairs thus necessitates the creation of synthetically LLM-generated data [32,36]. Here, the ultimate goal is to create synthetic out-of-distribution data, i.e. problems that are currently unsolvable by both humans and LLMs. In this setting, a key challenge is finding a method for estimating problem difficulty.
The concept of problem difficulty has been approached and measured in various ways. Traditionally, problem difficulty was measured in a human-centric way. In educational assessment, Classical Test Theory (CTT) and Item Response Theory (IRT) define difficulty as a statistical property of test items [17,5], typically represented by student performance scores. Bloom’s Taxonomy and NASA’s Task Load Index provide difficulty assessment frameworks to categorize tasks according to their cognitive workload level [1,18,19]. Other approaches include manual calibration through expert judgment [7] or pre-testing [24], and comparative judgment techniques (pairwise item comparisons), which often yield more reliable difficulty rankings than independent item ratings [3]. However, as these methods fundamentally depend on human responses, they do not generalize to newly generated, out-of-distribution problems.
In mathematics, economics, and computer science, researchers have proposed formal measures to quantify how challenging a problem is to solve [8,10,29], yet these measures often have a very narrow scope (e.g. the number of counter-examples necessary to disprove a theorem) or are incomputable.
The ability of LLMs to assess the inherent difficulty of problems has received little attention, as they often display poor calibration, reporting high confidence while achieving relatively low accuracy [30,38]. Nevertheless, LLM verifiers are already incorporated into the reinforcement learning pipeline for judging final responses or solution processes [16,33,40], and there is also evidence that they demonstrate promising self-evaluation skills in open-ended sampling tasks [23]. Furthermore, LLMs have been successfully used to assess problem difficulty in various scenarios, e.g. aggregated LLM performance scores [12], supervised learning from question text [5], and in-context difficulty learning [13]. Unfortunately, these LLM-based approaches depend on specific models, context information, and external calibration, limiting their applicability.
Overview In this paper, we present LLM compare, a new method for estimating problem difficulty that does not rely on ground truth data, such as performance scores or reference answers. In this method, an LLM is repeatedly asked to compare two problems and then difficulty scores are computed based on the outcomes, using the Bradley-Terry (BT) model [6]. LLMs have already shown potential in pairwise preference aggregation [9,11,37], and BT-based scoring has been proven reliable in aggregating LLM judgments [15,34,41]. To validate our method, we first propose a new taxonomy that positions existing difficulty measures along three orthogonal planes, related to their construction, scale, and dependence. While prior surveys presented taxonomies of pipelines to supervised, text-based difficulty prediction [2,5], our framework classifies the inherent properties of difficulty measures. Secondly, we compare our method to four types of existing difficulty measures, human labels, human performance, LLM labels and LLM performance, across three datasets: JEE Advanced Maths 2024, The Cambridge MCQ Reading Dataset and the subset of algebra questions from Omni-Math. We execute the LLM comparisons with both OpenAI o3 and Gemini 2.5 pro to assess model dependence.
Conclusion LLM compare is a fine-grained, reliable and broadly applicable alternative for existing measures of difficulty, resolving their key limitations. We show that LLM compare naturally occupies two empty quadrants in the space of difficulty measures: it is the first measure to be both continuous and dynamic, as well as model-agnostic and independent of ground truth information. This makes it the only measure currently suitable for scoring synthetic out-of-distribution problems. Furthermore, we demonstrate that our method correlates positively with all existing types of
This content is AI-processed based on ArXiv data.