Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Robust benchmarks are essential for accurately reflecting the generalization capabilities of large language models (LLMs). Existing benchmarks that curate questions at the question level suffer from three limitations: vulnerability to data contamination, restriction to single-concept assessment, and reliance on costly domain expert annotation. We propose Encyclo-K , a statement-based benchmark that extracts standalone knowledge statements from authoritative textbooks and dynamically composes them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space resists memorization while maintaining stable model rankings across question sets; each question aggregates 8-10 statements for comprehensive knowledge assessment; and annotators only verify formatting compliance without requiring domain expertise. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges-even OpenAI-GPT-5.1 achieves only 62.07% accuracy, with model performance displaying clear gradient distributions across both reasoning models (16.04%-62.07%) and chat models (9.71%-50.40%). Encyclo-K achieves robust LLM evaluation through contamination-resistant dynamic generation and comprehensive multi-statement assessment.

📜 Original Paper Content