SDUs DAISY: A Benchmark for Danish Culture

SDUs DAISY: A Benchmark for Danish Culture
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce a new benchmark for Danish culture via cultural heritage, Daisy, based on the curated topics from the Danish Culture Canon 2006. For each artifact in the culture canon, we query the corresponding Wikipedia page and have a language model generate random questions. This yields a sampling strategy within each work, with a mix of central of peripheral questions for each work, not only knowledge of mainstream information, but also in-depth cornerstones defining the heritage of Danish Culture, defined by the Canon committee. Each question-answer pair is humanly approved or corrected in the final dataset consisting of 741 close-ended question answer pairs covering topics, from 1300 BC. archaeological findings, 1700 century poems and musicals pieces to contemporary pop music and Danish design and architecture.


💡 Research Summary

The paper introduces “SDU’s Daisy,” a novel benchmark designed to evaluate large language models (LLMs) on Danish cultural knowledge. The authors identify a gap in current AI evaluation: while LLMs have demonstrated impressive capabilities on English‑centric benchmarks, their understanding of non‑English, especially low‑resource, cultural content remains under‑explored. Denmark, with roughly six million speakers, is classified as a low‑to‑medium‑resource language, and no systematic benchmark exists for assessing its cultural heritage in AI systems.

To fill this void, the authors base their dataset on the official Danish Culture Canon (Kulturkanon), a government‑endorsed collection of 108 works spanning eight domains—architecture, visual arts, design & crafts, film, literature, children’s literature, music, and performing arts. Each domain contains 12 curated works selected by expert committees, covering a temporal range from archaeological findings dated to 1300 BCE up to contemporary pop music and design. Because the Canon is an authoritative, pedagogically integrated reference, it offers a reliable foundation for a culturally grounded benchmark.

The construction pipeline proceeds in three stages. First, the Wikipedia page for each Canon work is automatically retrieved, providing a variable‑length textual description. Second, the authors employ the Gemma‑3 27B model (4‑bit quantized) with a custom prompt that asks the model to generate five “random” questions from the page content. The prompt explicitly requests a mix of central (core factual) and peripheral (deep, nuanced) questions, aiming to test not only mainstream knowledge but also the “corner‑stones” that define Danish heritage. Third, multiple human annotators review each generated question‑answer pair for validity, clarity, and cultural relevance, correcting or discarding problematic items. After this curation, the final dataset comprises 741 closed‑ended QA pairs (the children’s literature sub‑canon was omitted due to data issues). Human validation also involved rewriting some questions for conciseness and correcting answer errors.

For evaluation, the authors adopt a simple “answer‑only” prompting scheme: models receive the question and are instructed to output just the answer, with units (e.g., meters, kilograms) specified in the prompt template (provided in the appendix). Predicted answers are normalized by lower‑casing, removing punctuation, articles, and extra whitespace. Performance metrics include word‑level F1 (precision/recall) and BLEU (using NLTK’s sentence_bleu). This open‑set normalization tolerates minor lexical variations while still rewarding exact factual matches.

Baseline experiments involve several state‑of‑the‑art multilingual LLMs: Meta’s Llama‑3.3‑70B‑Instruct, OpenAI’s GPT‑OSS‑20B and 120B, Mistral‑Small‑24B‑Instruct‑2503, and Google’s Gemma‑3‑27B‑IT. Results (BLEU/F1) are uniformly low: Llama‑3.3‑70B achieves the highest BLEU of 0.166 and F1 of 0.268, followed by GPT‑OSS‑120B (BLEU 0.126, F1 0.211). Gemma‑3‑27B, despite being praised for Danish generation quality, scores only 0.123 BLEU and 0.193 F1, indicating a substantial gap between linguistic fluency and factual cultural knowledge. Qualitative inspection shows most models can produce plausible answer formats, but often miss the correct entity or provide overly generic responses. In some cases (e.g., GPT‑OSS‑20B) the model requires extensive reasoning traces (>2000 tokens) before yielding an answer, leading the authors to discard such outputs.

The discussion interprets these findings. First, even though Danish Wikipedia likely appears in the pre‑training corpora, the signal for highly specific cultural facts is weak relative to the massive multilingual data, resulting in poor recall of Canon details. Second, alignment and safety fine‑tuning may bias models toward cautious, generalized replies rather than confident enumeration of closed‑list facts. Consequently, the observed failures reflect not merely data scarcity but also how cultural specificity is weighted and retrieved throughout the training and alignment pipeline.

The paper situates Daisy among existing Danish NLP resources. Benchmarks like ScandEval, DaCy, and DaNLP focus on linguistic tasks (NER, sentiment, syntactic parsing) or translate English benchmarks, lacking cultural grounding. A recent crowdsourced effort (DaKultur) collects culturally relevant prompts but suffers from ambiguous definitions of “cultural relevance,” mixing pop‑culture trivia with practical or stereotypical queries. Daisy, by contrast, offers a rigorously curated, officially sanctioned set of cultural artifacts, providing an ecologically valid, reproducible testbed for future LLM development.

In conclusion, the authors deliver the first systematic benchmark for Danish cultural knowledge, release the dataset, generation pipeline, and evaluation scripts publicly, and highlight the need for culturally aware model training and alignment strategies, especially for low‑resource languages. Daisy can serve both the AI research community—by exposing cultural blind spots in multilingual models—and the digital humanities community—by enabling quantitative analysis of cultural representation in AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment