Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, we propose a novel pairwise-comparison framework for assessing textual creativity that leverages shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human and synthetic data to train highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs.

💡 Research Summary

The paper tackles the problem of evaluating textual creativity in large language models (LLMs), a task that has traditionally suffered from two major drawbacks: (1) most existing methods are domain‑specific (e.g., humor, problem‑solving) and therefore cannot generalize across the wide variety of creative tasks, and (2) they rely heavily on costly human judgments that are often inconsistent because “creativity” is a subjective notion without a shared context.

To address these issues, the authors introduce three core contributions.

1. A context‑aware pairwise‑comparison protocol.
Given a single instruction (I), two responses (R₁, R₂) with differing levels of creativity are presented, and the evaluator is asked “Which response is more creative?” Human annotators applied this protocol to over 3,000 instruction‑response pairs. When a shared instruction was provided, the inter‑annotator agreement (Intraclass Correlation Coefficient, ICC) rose from 0.59 (no shared context) to 0.75, demonstrating that a common contextual frame dramatically improves labeling consistency.

2. CreataSet – a large, cross‑domain dataset.
The dataset construction follows a three‑step pipeline:

Dataset initialization: The authors aggregate existing creativity‑focused corpora (e.g., Oogiri‑GO, Ruozhiba) and a broad instruction‑tuning collection (Infinity‑Instruct) covering 87 distinct domains (poetry, lyrics, prose, short inspirational sentences, problem‑solving prompts, etc.). This yields 1,875,146 (instruction, response) pairs.
Context‑aware response augmentation: For each instruction, a powerful LLM (GPT‑4o) generates multiple responses spanning a predefined creativity scale (low, medium, high). This synthetic generation produces more than one million labeled pairs, providing the “weakly supervised” portion of the dataset.
Label construction with mixed strategy: Responses are paired, and a binary label y indicates whether the first response is judged more creative than the second. Labels are derived either from human judgments (for the test split) or from a confidence model trained on a small human‑annotated seed. The final training instances have the form (I, R₁, R₂, y).

The data are categorized into three types: Type A (existing creative pairs), Type B (stand‑alone texts enriched with generated instructions), and Type C (standard instruction‑response pairs). This taxonomy ensures diversity in length, style, and domain, which is crucial for robust generalization.

3. CrEval – an LLM‑based creativity evaluator.
CrEval is built by fine‑tuning a GPT‑4‑style backbone on CreataSet with a binary classification head. Training employs a combination of contrastive loss and pairwise ranking loss to directly optimize the model’s ability to pick the more creative response in a pair.

Experimental findings

Human alignment: On the 3,000‑pair test set, CrEval achieves an average ICC of 0.78, outperforming GPT‑4o (0.61) and several heuristic baselines (unique‑n‑gram count, Divergent Semantic Integration, etc.) that linger around 0.45.
Domain generalization: In zero‑shot tests on 20 unseen domains, CrEval maintains a 12 % accuracy boost over GPT‑4o, confirming that the cross‑domain training data successfully mitigates over‑fitting to any single genre.
Boosting LLM creativity: CrEval is used as a reward model in reinforcement learning and as a re‑ranking scorer in a self‑refine loop. When applied to a base LLM, human judges rate the generated outputs about 9 % more creative on average, indicating that an accurate evaluator can directly improve generation quality.

Limitations and future directions

The authors acknowledge that a large portion of the training labels are synthetic; while the augmentation pipeline is designed to mimic human creativity levels, it may still diverge from genuine human perception. They suggest incorporating more human‑in‑the‑loop verification to continuously refine label quality. Cultural and linguistic bias is another concern: the current corpus is dominated by English and Chinese texts, potentially limiting applicability to other language families. Expanding the dataset to include multilingual, multicultural sources is a clear next step. Finally, creativity is defined as “new, surprising, and valuable,” yet the pairwise protocol primarily captures the “surprising” aspect. Future work could develop multi‑dimensional scoring that separately evaluates novelty and usefulness.

Overall impact

By introducing a standardized, context‑aware pairwise evaluation framework and a massive, diverse training set, the paper provides a practical solution for automated creativity assessment. CrEval not only surpasses state‑of‑the‑art LLM judges but also demonstrates that a reliable evaluator can be leveraged to enhance the creative output of generative models. This work extends the LLM‑as‑judge paradigm into the creative domain and lays a solid foundation for future research on measuring and improving AI creativity.

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

💡 Research Summary

Comments & Academic Discussion

Leave a Comment