Evaluating LLMs' Divergent Thinking Capabilities for Scientific Idea Generation with Minimal Context
While Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design (e.g., accurately extracting key findings from papers or generating coherent experimental procedures), existing evaluation benchmarks primarily assess performance using rich contextual inputs. We introduce LiveIdeaBench, a comprehensive benchmark evaluating LLMs’ scientific idea generation by assessing divergent thinking capabilities using single-keyword prompts. Drawing from Guilford’s creativity theory, our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity. Through extensive experimentation with over 40 leading models across 1,180 keywords spanning 22 scientific domains, we reveal that the scientific idea generation capabilities measured by our benchmark, are poorly predicted by standard metrics of general intelligence. Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores. These findings highlight the need for specialized evaluation benchmarks for scientific idea generation and suggest that enhancing these idea generation capabilities in LLMs may require different training strategies than those used for improving general problem-solving abilities, potentially enabling a wider range of AI tools tailored for different stages of the scientific process.
💡 Research Summary
The paper introduces LiveIdeaBench, a novel benchmark designed to evaluate large language models (LLMs) on their ability to generate scientific ideas from minimal context, specifically using single‑keyword prompts. While existing benchmarks assess LLM performance on tasks that rely on rich inputs such as abstracts or full papers, LiveIdeaBench isolates the divergent‑thinking component of scientific creativity by providing only a keyword and asking the model to produce multiple novel research ideas.
The benchmark is grounded in Guilford’s creativity theory and measures five dimensions: originality, feasibility, fluency, flexibility, and clarity. Generation (the “Idea LLM” stage) is performed on 1,180 carefully curated scientific keywords spanning 22 domains (e.g., quantum entanglement, CRISPR, carbon nanotubes). For each keyword, each model produces 5–10 distinct ideas. Evaluation (the “Judge LLM” stage) employs a dynamic panel of the top‑10 state‑of‑the‑art models, which automatically score each idea on originality, feasibility, and clarity using language‑model‑based judges. Fluency is quantified by analyzing the semantic and lexical diversity among the ideas generated for the same keyword, while flexibility is derived as the 30th percentile of the averaged scores across the other four dimensions. This design yields a multidimensional profile for every model, allowing fine‑grained comparison beyond a single aggregate score.
The authors evaluated over 40 LLMs—including open‑source and proprietary systems—across the full keyword set. Key findings include: (1) Standard general‑intelligence benchmarks (e.g., MMLU, BIG‑Bench) correlate weakly with the creativity scores, confirming that intelligence and divergent thinking are largely independent in LLMs, mirroring classic human psychology results. (2) Smaller models such as QwQ‑32B‑preview achieve originality and feasibility scores comparable to much larger models like claude‑3.7‑sonnet:thinking, suggesting that model size alone does not dictate creative capability and that training objectives matter. (3) Most models excel in fluency (they can generate many ideas) but display substantial variance in flexibility and originality, indicating a tendency toward “creative homogeneity” where ideas are numerous but not sufficiently diverse. (4) Automatic judges align well with human expert ratings for clarity, but tend to over‑estimate feasibility when domain‑specific knowledge is lacking, highlighting the need for hybrid human‑LLM evaluation pipelines for certain dimensions.
Beyond the empirical results, LiveIdeaBench itself is presented as an extensible infrastructure. The judging panel is dynamic—researchers can replace or retrain judge models as newer, less biased evaluators become available. A public leaderboard tracks model performance across all five dimensions in real time, facilitating transparent benchmarking and rapid iteration on new model releases.
The authors argue that improving scientific idea‑generation in LLMs likely requires training strategies distinct from those that boost general problem‑solving abilities. Potential avenues include: (i) prompting techniques that explicitly encourage divergent thinking, (ii) integration of domain‑specific knowledge graphs to enrich the semantic space from which ideas are drawn, and (iii) reinforcement‑learning‑from‑human‑feedback (RLHF) rewards that prioritize diversity, novelty, and practical relevance rather than just correctness.
In conclusion, LiveIdeaBench fills a critical gap in AI evaluation by providing a systematic, scalable, and theory‑driven method to assess the creative, divergent‑thinking component of scientific discovery. The benchmark’s findings underscore that LLMs’ capacity for scientific innovation does not automatically scale with their general intelligence scores, and that dedicated research on creativity‑focused training and evaluation is essential for building AI tools that can meaningfully contribute to the early stages of the scientific process.
Comments & Academic Discussion
Loading comments...
Leave a Comment