Oogiri-Master: Benchmarking Humor Understanding via Oogiri

Reading time: 5 minute
...

📝 Original Info

  • Title: Oogiri-Master: Benchmarking Humor Understanding via Oogiri
  • ArXiv ID: 2512.21494
  • Date: 2025-12-25
  • Authors: ** - Murakami Soichiro (CyberAgent) - Kamigaito Hidetaka (CyberAgent, Nara Institute of Science and Technology) - Takamura Hiroya (Institute of Science Tokyo) - Okumura Manabu (Institute of Science Tokyo) **

📝 Abstract

Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others' ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs.

💡 Deep Analysis

Figure 1

📄 Full Content

Oogiri-Master: Benchmarking Humor Understanding via Oogiri Soichiro Murakami1, Hidetaka Kamigaito1,2, Hiroya Takamura3, Manabu Okumura3 1CyberAgent, 2Nara Institute of Science and Technology, 3Institute of Science Tokyo murakami_soichiro@cyberagent.co.jp, kamigaito.h@is.naist.jp, {takamura,oku}@pi.titech.ac.jp Abstract Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others’ ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs. Keywords: Humor, Oogiri, Large Language Models, Benchmarking, Linguistic Analysis 1. Introduction Endowing large language models (LLMs) with human-like creative thinking capabilities is a major challenge that extends beyond problem-solving abilities. Humor understanding is one of such key capabilities. Understanding and generating humor as humans require more than pattern matching; they necessitate creative reasoning that incorpo- rates context and cultural nuances to produce witty and unexpected responses (Loakman et al., 2025). This study addresses humor as an instance of creative thinking in LLMs by focusing on the spe- cific case of Oogiri (大喜利). Oogiri is a Japanese creative response game that involves improvising humorous responses to a given prompt, as shown in Figure 1, making it an ideal testbed for creativity and wit. This raises the central question: What exactly makes Oogiri responses funny to humans? The starting point of our study is to answer this question. Few studies have aimed to capture the human perception of funniness using objective metrics and to analyze its components quantita- tively. This absence poses a significant barrier to the evaluation of humor understanding in LLMs. We address two key challenges in evaluating the humor understanding of LLMs. First, the con- stituent elements of a funny response remain insuffi- ciently understood. Humor is a subjective construct arising from a complex interplay of factors such as the violation of expectations and resonance. How- ever, an objective, quantitative metric does not ex- ist for measuring funniness itself. Consequently, Prompt: Worst commit message ever. Response: “It works on my machine.” Figure 1: Oogiri prompt–response example. we lack a principled basis for explaining why an Oogiri-style response is funny, which hinders the systematic improvement of LLM humor understand- ing. The second challenge is the low reliability of existing datasets for such analysis. For exam- ple, the Oogiri-GO dataset (Zhong et al., 2024) was collected from Bokete,1 a caption-contest plat- form on which users upvote funny responses to prompts. Although this social-voting signal is useful at this scale, it introduces two methodological limi- tations. First, the fairness of the evaluation process is not guaranteed: making the popularity of each response visible to other raters may introduce popu- larity bias and compromise objectivity. Second, the dataset exhibits structural bias. With only approx- imately eight candidate responses per prompt, on average, raters are likely to select a relatively better option rather than an intrinsically humorous one. Therefore, in this study, we propose Oogiri- Master, a benchmark that evaluates the humor understanding of LLMs using the Oogiri task. Specifically, we address the two challenges outlined above by constructing a novel dataset and conducting a quantitative analysis of the funniness components, with which we assess the current 1https://bokete.jp/ arXiv:2512.21494v1 [cs.CL] 25 Dec 2025 capabilities and pave the way for improvements. First, we construct Oogiri-Corpus, a dataset that ensures reliability and objectivity.2 On average, ea

📸 Image Gallery

page_1.png page_2.png page_3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut