Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality
Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI .
💡 Research Summary
This paper introduces the Conceptual Cultural Index (CCI), a novel metric designed to quantify cultural specificity at the sentence level, addressing a critical gap in evaluating Large Language Models (LLMs) in multicultural contexts. As LLMs are increasingly deployed globally, assessing whether their outputs are culturally appropriate or specific is essential for fairness and reliability. Existing benchmarks often rely on QA formats with overall accuracy, failing to capture the continuous spectrum of cultural knowledge, from universally general to highly culture-specific. The CCI provides a operationalizable and interpretable solution to this problem.
The core methodology of CCI is based on estimating relative generality across cultures. For a given sentence, a target culture, and a set of comparison cultures, an LLM is prompted to estimate how common or familiar the sentence is within each culture on a scale from 0.0 to 1.0. These per-culture generality scores are obtained via a single query prompting for JSON-formatted output. The CCI score is then defined as the difference between the generality score in the target culture and the average generality score across all other cultures in the comparison set (CCI = p_target - mean(p_others)). This score ranges from -1 to 1, where values near 1 indicate high specificity to the target culture, values near 0 indicate cross-cultural generality, and values near -1 indicate specificity to non-target cultures. A key feature of CCI is its controllability: users can define the comparison set of cultures (e.g., G20 countries, neighboring nations) to operationally control the scope of “culture” being evaluated, moving beyond vague definitions.
The authors rigorously validate CCI using Japan as the target culture. They construct an evaluation dataset of 200 Japanese culture-specific sentences and 200 general sentences, generated by GPT-5 and manually reviewed. They compare CCI against a baseline method where an LLM directly scores cultural specificity on a 0-1 scale. Experiments are conducted across five LLMs, including multilingual models (Llama 3.1, Qwen 2.5, gpt-oss) and Japanese-specialized models (Llama 3.1 Swallow, llm-jp 3.1).
The results demonstrate several key findings. First, CCI achieves superior or comparable separability between culture-specific and general sentences compared to the direct scoring baseline, as measured by the Area Under the ROC Curve (AUC). Specialized models and models with strong reasoning capabilities (like gpt-oss) performed particularly well with CCI. Second, the experiments confirm CCI’s controllability. When the comparison set included neighboring cultures (China, South Korea), the CCI scores for Japanese sentences sharing practices with those neighbors decreased appropriately, showing that CCI can mitigate overestimation of specificity for regionally common elements. Third, the paper presents a practical application: stratifying existing Japanese commonsense benchmarks (JCommonsenseQA and JCommonsenseMorality) by assigning CCI scores to each test item. Analysis revealed a clear trend: model accuracy (for models like Qwen2.5 and Llama 3.1) tended to decrease for items with higher CCI scores (greater cultural specificity). This stratification, invisible from overall accuracy alone, allows for culture-aware error analysis and highlights the relative strength of the Japanese-specialized model (llm-jp) on high-CCI items.
In conclusion, the Conceptual Cultural Index offers a robust, interpretable, and flexible framework for measuring sentence-level cultural specificity. It transforms an abstract concept into an operational metric, enabling finer-grained evaluation of LLMs, curation of culturally stratified datasets, and deeper analysis of where models succeed or fail based on cultural context. The authors have made their code publicly available, facilitating further research and application across diverse cultural domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment