Reading time: 25 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.21494
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others' ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs.

๐Ÿ“„ Full Content

Endowing large language models (LLMs) with human-like creative thinking capabilities is a major challenge that extends beyond problem-solving abilities. Humor understanding is one of such key capabilities. Understanding and generating humor as humans require more than pattern matching; they necessitate creative reasoning that incorporates context and cultural nuances to produce witty and unexpected responses (Loakman et al., 2025). This study addresses humor as an instance of creative thinking in LLMs by focusing on the specific case of Oogiri (ๅคงๅ–œๅˆฉ). Oogiri is a Japanese creative response game that involves improvising humorous responses to a given prompt, as shown in Figure 1, making it an ideal testbed for creativity and wit. This raises the central question: What exactly makes Oogiri responses funny to humans?

The starting point of our study is to answer this question. Few studies have aimed to capture the human perception of funniness using objective metrics and to analyze its components quantitatively. This absence poses a significant barrier to the evaluation of humor understanding in LLMs.

We address two key challenges in evaluating the humor understanding of LLMs. First, the constituent elements of a funny response remain insufficiently understood. Humor is a subjective construct arising from a complex interplay of factors such as the violation of expectations and resonance. However, an objective, quantitative metric does not exist for measuring funniness itself. Consequently, Prompt: Worst commit message ever. Response: “It works on my machine.”

Figure 1: Oogiri prompt-response example.

we lack a principled basis for explaining why an Oogiri-style response is funny, which hinders the systematic improvement of LLM humor understanding. The second challenge is the low reliability of existing datasets for such analysis. For example, the Oogiri-GO dataset (Zhong et al., 2024) was collected from Bokete, 1 a caption-contest platform on which users upvote funny responses to prompts. Although this social-voting signal is useful at this scale, it introduces two methodological limitations. First, the fairness of the evaluation process is not guaranteed: making the popularity of each response visible to other raters may introduce popularity bias and compromise objectivity. Second, the dataset exhibits structural bias. With only approximately eight candidate responses per prompt, on average, raters are likely to select a relatively better option rather than an intrinsically humorous one.

Therefore, in this study, we propose Oogiri-Master, a benchmark that evaluates the humor understanding of LLMs using the Oogiri task. Specifically, we address the two challenges outlined above by constructing a novel dataset and conducting a quantitative analysis of the funniness components, with which we assess the current capabilities and pave the way for improvements. First, we construct Oogiri-Corpus, a dataset that ensures reliability and objectivity.2 On average, each prompt is paired with approximately 100 diverse candidate responses that are rated for funniness by approximately 100 human judges in an independent setting in which they cannot see others’ ratings. This design mitigates the issues of fairness and data bias observed in existing datasets. Second, using this dataset, we quantitatively analyze the linguistic features that constitute funniness. We identify common lexical and structural patterns in high-rated responses, transforming the ambiguous notion of funniness into measurable, objective metrics. This enables explanations of why a response is funny based on data-driven evidence, rather than subjective intuition. Finally, we present the multifaceted benchmark results on Oogiri-Master. We benchmark humans and various LLMs to clarify the current state of the art in the humor understanding of LLMs.

The contributions of this study can be summarized as follows:3 First, we constructed and release a large-scale reliable dataset, Oogiri-Corpus, which serves as a novel foundation for evaluating humor understanding in LLMs. Second, through quantitative analysis of this dataset, we identified the constituent components of funniness, demonstrating that features such as response length, perspective shift, and ambiguity are strongly correlated with high-rated responses. Third, we propose a novel benchmark, Oogiri-Master, and experimentally demonstrated that (1) state-of-the-art LLMs such as GPT-5 show performance approaching human performance; (2) our analytical insights into the constituent components of humor can contribute to performance improvements in humor judgment; (3) instructing LLMs to leverage these insights only when uncertain improves their performance; and simultaneously, (4) continued pretraining on the target-language corpus enhances the humor understanding abilities of LLMs.

Background on Computational Humor Computational humor is a relatively new area, and humor understanding/generation remains a challenging problem in natural language processing (Loakman et al., 2025). One obstacle is defining “humor” appropriately. Consequently, many studies have narrowed the scope to specific forms (e.g., puns, Oogiri, satire) to make the problem tractable (Amin and Burghardt, 2020). Among these, pun generation has a particularly long history and is a central task (Ritchie, 2005;Yu et al., 2018;Luo et al., 2019)

We target Oogiri as our testbed for humor understanding. Oogiri is a creative response game in which one provides a witty response to a prompt. Although the most common setup is a text-totext format in which a textual prompt is paired with a textual response, modal variants exist (e.g., image-to-text one-liners; image&text-to-text fill-inthe-blank) (Zhong et al., 2024). These formats resemble memes (Sharma et al., 2023;Nguyen and Ng, 2024); we regard memes as a multimodal variant of Oogiri. However, we focus on text-totext Oogiri for two reasons. First, abundant web resources exist. Oogiri is widely popular in TV programs and social media, and large platforms such as Bokete and Oogiri Sogo host substantial data. Because analyzing humor components requires diverse and numerous samples, Oogiri is suitable from a data perspective. Second, the text-to-text format is unimodal, making semantic understanding more straightforward than with multimodal variants.

Although progress has been hampered by limited datasets, interest has recently increased with the advent of LLMs and the concomitant need for evaluation resources. Oogiri-specific datasets remain relatively scarce; adjacent resources include English caption datasets collected from the New Yorker Caption Contest (Hessel et al., 2023) and various meme datasets (Liu et al., 2022;Hwang and Shwartz, 2023;Hossain et al., 2022). Oogiri-GO, which was built using Bokete and social media, is a representative Oogiri dataset. However, it faces two issues: (1) fairness concerns: Voter interfaces display others’ popularity, inviting conformity and potentially compromising objectivity. (2) structural bias: Many prompts have few candidate responses (approximately eight on average); hence, raters may select responses that are merely “less bad,” rather than intrinsically funny. In this study, we construct a novel Oogiri dataset, Oogiri-Corpus, which addresses these issues and serves as a foundation for evaluating LLM humor understanding, thereby improving reliability.

Although studies have been conducted on generation, understanding, and explanation in computational humor (Amin and Burghardt, 2020;Loakman et al., 2025), quantitative analyses of the constituent components of “funniness” remain underexplored. To fill this gap, using Oogiri-Corpus, we analyze how diverse linguistic features, such as perspective shift, ambiguity, harmlessness, surprisal, sentence length, and part-of-speech (POS) ratios, relate to humor, with the aim of identifying objective, quantitative indicators. Furthermore, using our benchmark experiments, we outline how these insights can improve LLM humor understanding.

Motivated by the second challenge mentioned in ยง1, we present Oogiri-Corpus and provide details on its construction process and descriptive statistics. We collected data from a public Japanese Oogiri competition platform, Oogiri Sogo4 . On this platform, each prompt proceeds through an answer phase, a voting phase, and a final leaderboard announcement. During the answer phase, users submit responses within a fixed time window (e.g., 12 h). This phase then transitions to the voting phase, in which users vote for the responses that they find funny among all submissions. Unlike other platforms (e.g., Bokete), vote counts are not displayed during the voting phase, which helps to mitigate popularity bias and supports fairer evaluation. Finally, the platform announces a leaderboard based on the total votes.

Dataset construction comprised two steps: web crawling5 and quality filtering. First, we collected 2,165 prompts from the platform. 6 Each prompt is associated with many responses, and each response has a vote count indicating its perceived funniness. We applied vote-based filtering to ensure reliability: we excluded prompts for which the total number of votes was fewer than 100. This threshold reduces the variance owing to rater subjectivity and chance when the vote pool is small. In total, 908 prompts remained. We refer to this 908-prompt dataset as Oogiri-Corpus, and used it for the subsequent analyses and benchmark construction.

Oogiri-Corpus consists of prompts, responses, and vote counts. Across the 908 prompts, each prompt has approximately 96 responses and 172 votes, on average. The total number of promptresponse pairs is 82,536. This is approximately seven times larger than that of Oogiri-GO (Zhong et al., 2024) and, to the best of our knowledge, is the largest Japanese Oogiri dataset to date. 7 funny rather than merely “less bad” within a limited pool. Dataset statistics are presented in Table 1.

We address the first challenge mentioned in ยง1: elucidating the components that constitute a “funny response.” “Funniness” is subjective and complex; for example, it involves expectation violations and relatability. However, a generally accepted quantitative metric remains lacking. Accordingly, our analysis aims to explain and analyze why an Oogiri response is funny based on a variety of quantitative linguistic features. Through this analysis, we seek to identify objective and quantitative indicators for understanding humor and to pave the way for improving the ability of LLMs to understand humor.

We quantitatively examined the linguistic features that constitute “humor,” using Oogiri-Corpus as the foundation. Although the dataset links an average of 96 responses to each prompt, we did not use all responses for the analysis. This is because many responses have zero votes, creating a pronounced imbalance between high-rated responses with many votes and low-rated responses with no votes, which makes the analysis challenging. Accordingly, we first narrowed down the responses under analysis and balanced the highand low-rated responses. Specifically, for each prompt, we defined the top three responses by vote count as “high-rated responses” and the bottom three as “low-rated responses.” On average, highrated responses received approximately 8.5 votes, whereas all low-rated responses had zero votes. Given this low-rated nature, we considered them as reasonable representatives of “unfunny responses.” This yielded 5,448 responses for the analysis, with 908 prompts ร— 6 responses.

We examined the relationships between linguistic features and response humor. Specifically, for each response, we quantitatively measured a range of linguistic features and analyzed the relationship of these feature values to response humor (i.e., differences between the high-and low-rated groups). We defined and quantified various aspects of linguistic features by borrowing ideas from theories of humor, such as incongruity theory (Morreall, 2024). These include basic linguistic features, such as sentence length, as well as higher-order features, such as resolution of incongruity (see details in ยง4.3). We considered that, when a feature exhibits a significantly higher or lower value in high-rated responses, it may constitute a component of humor.

We reported these relationships using an independent two-sample Student’s t-test (two-sided, assuming equal variances) (Fisher, 1925) and Cohen’s d (Cohen, 1988). The t-test assesses whether there is a statistically significant difference between two group means. Because the t-tests are sensitive to large sample sizes, we also reported Cohen’s d, an effect-size measure. Cohen’s d is the difference between the two group means divided by a pooled standard deviation and is used to evaluate the magnitude of the effect. Larger values indicate more substantively meaningful group differences. The formula for Cohen’s d is as follows:

where X, s, and n are the mean, standard deviation, and sample size for each group, and s p is the pooled standard deviation. The conventional benchmarks interpreted d = 0.2, 0.5, and 0.8 as small, medium, and large effects, respectively.

To capture humor from multiple perspectives, we defined four groups of features listed in Table 2 and measured them quantitatively. Inspired by the theories of humor (Morreall, 2024) and prior research on humor and other creative domains (Zhong et al., 2024;Murakami et al., 2025), we selected these features as plausible constituents of humor.

We defined basic linguistic features that comprise (i) responseindependent measures and (ii) prompt-response relative measures. The former is based solely on the response, whereas the latter is based on the relationship between the prompt and response. The response-independent measures include sentence length based on character count, number of unique characters, ratios of character types (e.g., hiragana and katakana in Japanese), and POS ratios (e.g., nouns, verbs, and symbol marks). We used a Japanese morphological analyzer, MeCab (Kudo et al., 2004), to perform tokenization and POS tagging. The prompt-response relative measures in-clude length ratios of prompt-response pairs based on character count, lexical novelty ratios, and relative change in character-type ratios. We defined the lexical novelty ratio as the proportion of words in the response that do not appear in the prompt and the relative change in character-type ratios as the difference in the ratios of character types between the prompt and response.

Inspired by incongruity theory (McDonald, 2013), we introduced semantic features that capture how a response deviates from the expectations set by the prompt. Incongruity theory states that humor arises when expectations are violated (Morreall, 2024). In a prompt-response setting, this corresponds to semantic divergence or explicit contradiction between the two texts. To capture this relationship, we used two signals: (i) semantic distance and (ii) textual entailment. Semantic distance is measured as one minus the cosine similarity between the prompt and response embeddings. Textual entailment is measured using natural language inference (NLI) probabilities, namely entailment, neutral, and contradiction, predicted using an NLI model. We used the text-embedding-3-large (OpenAI, 2025) to obtain the text embeddings and the mDeBERTa-v3base (He et al., 2021) fine-tuned on the XNLI (Conneau et al., 2018) and multilingual-NLI-26lang-2mil7 datasets (Laurer et al., 2022) to obtain the NLI probabilities. 8 We assumed that higher semantic distance or explicit contradiction indicates higher unexpectedness. We then quantitatively tested whether leveraging contradictions increases the degree of humor.

In addition to the aforementioned features grounded in incongruity theory, we introduced two metrics by borrowing ideas from information theory: surprisal (Shannon, 1948) and normalized pointwise mutual information (nPMI) (Fano, 1961). Surprisal is the length-normalized negative log-probability under a language model; higher values indicate less predictable responses. nPMI quantifies the association between a prompt and its response; lower values imply co-occurrence that is close to chance. Both metrics also capture deviation from expectation in incongruity theory: surprisal reflects unpredictability of the prompt-response pair or response text itself, whereas nPMI captures unexpectedness in the prompt-response relationship. We computed these using GPT-2.9

LLM-Scored Higher-Order Features We used an LLM to measure eight higher-order linguistic features. By “higher-order,” we mean features that extend beyond surface cues (e.g., length) and probabilistic or embedding-based signals (e.g., surprisal). Therefore, we used an LLM to score each promptresponse pair on a 1-5 scale across the eight aspects listed in The ease of predicting the response, (6) Incongruity resolution, grounded in incongruity-resolution theory (Ritchie, 2009); the natural resolution of an initial mismatch by a coherent reinterpretation, (7) Metaphor use: The presence of metaphorical expression in the response, (8) Perspective shift: A meaningful change in viewpoint or framing that enables a punchline. In all cases, higher scores indicate more of the stated property. We defined clear evaluation criteria for each aspect and incorporated them into the prompt. 10 Because of API cost considerations, we sampled 2,000 prompt-response pairs, where 1,000 pairs were randomly selected from high-and low-rated groups, and conducted batched evaluations for each pair using GPT-5.

We report on the relationships between each linguistic feature and response humor. Table 2 presents the mean of each feature for the high-and low-rated groups, p-value of the t-test, and Cohen’s d. Our analysis yielded the following findings:

High-Rated Responses Tend to be Shorter Length-related features such as the length and prompt-response length ratios were significantly lower in the high-rated group than in the low-rated group, with small effect sizes. This suggests that brevity contributes to humor.

Interestingly, the high-rated group showed significantly lower values for the unique character count (unique chars) and the rate at which vocabulary that is not in the prompt appears in the response (lexical novelty), with small effect sizes. This indicates that, relative to the low-rated group, high-rated responses had a lower tendency to use new vocabulary and may benefit from selecting appropriate words without straying far from the topic of the prompt. 10 The full prompt is provided in Appendix A.

Ambiguity exploitation, associative distance, benign violation, incongruity resolution, metaphor use, and perspective shift were significantly higher in the high-rated group, with small-to-medium effect sizes. Among these, perspective shift and ambiguity showed relatively larger effects, indicating particular importance for humor. Incongruity resolution, grounded in incongruity-resolution theory (Ritchie, 2009), also showed a relatively large effect size, suggesting its contribution to humor.

Other Features Have Limited Impact Semantic distance, textual entailment, surprisal, nPMI, and other linguistic features (e.g., POS ratio) showed statistically significant differences, but the effect sizes were below small, suggesting limited contri-butions to humor. Notably, textual entailment and surprisal captured similar aspects to coherence and expectedness in higher-order linguistic features, but their effect sizes were below small, consistently suggesting their limited role in constituting humor.

We propose a novel benchmark, Oogiri-Master. The aim of this benchmark is to measure the ability of an LLM to understand and judge “humor” in Oogiri from different perspectives. Specifically, we propose five tasks that can be broadly grouped into two categories: four relative-judgment tasks using multiple-choice question answering (MCQA) and one absolute-judgment task using binary classification. Standardized prompt templates and strict evaluation criteria were used to ensure reproducibility and comparability. In the experiments, we tested the insights from our analysis results in ยง4 and reflected the multiple linguistic features into prompt templates, seeking the performance of LLMs ( ยง5.3). Our goal was to clarify the current state of LLM humor understanding and outline a path for further improvement.

Relative Judgment Tasks In the MCQA setting, the model selects the most humorous response to a given prompt from several candidate responses. We defined four types of tasks: two binary-choice tasks, a three-choice task, and a fourchoice task. In all tasks, the high-rated response for each prompt served as the positive example, and the negatives were constructed differently for each task. For the two binary-choice tasks, we constructed negatives in two ways: (i) we paired the positive with one low-rated response from the same prompt (Binary same ) and (ii) we paired the positive with one high-rated response for a different prompt (Binary diff ). The latter evaluates whether the model can judge funniness as a response to the given prompt, rather than merely ranking responses within the same prompt, following Hessel et al. (2023). For the three-and four-choice tasks, we used one low-rated same-prompt response and one or two high-rated different-prompt responses as negatives, respectively.

In the binary classification setting, the model decides whether a response to a prompt is “funny” or “not funny.” For each prompt, we used the high-rated response as the positive and the low-rated response as the negative, measuring the ability of the model to evaluate funniness in absolute terms. Figure 2 shows an example of the absolute-judgment prompt.

Oogiri-Master is built on Oogiri-Corpus. For the MCQA setting, we sampled 100 prompts per task from Oogiri-Corpus, and selected positives and negatives according to each task design, yielding 400 items across the four tasks. For binary classification, we sampled 100 prompts from Oogiri-Corpus, pairing one high-rated response and one low-rated response per prompt for 200 items. In total, Oogiri-Master comprised 600 items.11

We evaluated a range of LLMs listed in Table 3, from proprietary (e.g., to open-source (e.g., DeepSeek-R1), on five tasks in Oogiri-Master. We report the accuracy as an evaluation metric.

For API-based models, we averaged results over three trials. During inference, we set the temperature parameter to zero for all models. We compared two prompting strategies when instructing the LLMs to solve each task. (1) a baseline prompt that simply instructs the model to select options, as shown in Figure 2, (2) an insight-augmented prompt that incorporates features computed from given prompt-response pairs based on the findings of our data analysis. To keep the prompts concise, we included only a small set of features selected with reference to the observed effect sizes in Table 2. Specifically, we used five basic features: length, unique character count, prompt-response length ratio, symbol ratio, and katakana ratio; and six LLM-scored features: ambiguity exploitation, associative distance, benign violation, incongruity resolution, metaphor use, and perspective shift. The basic features were precomputed and inserted directly into the prompt. LLM-scored features followed a two-step procedure: first, for each prompt-response pair, the target LLM computed scores for each aspect (e.g., metaphor use); second, these scores were included as context when instructing the model to select the options for each task.

To validate the human performance on this benchmark, we recruited crowdworkers from the crowdsourcing platform12 and asked them to solve each item using the same baseline prompt that was shown to the LLMs. Each item was answered by 21 workers, and the final labels were determined by majority vote. We included attention checks with unambiguous answers and aggregated the results only for the 21 workers who passed the checks for each item.

Table 3 lists the benchmark results. We compared two prompting strategies: a baseline prompt and an insight-augmented prompt.

Baseline Prompt When averaging the accuracy across the five tasks, Claude-Opus-4 performed the best (68.7%), followed by GPT-5 (67.6%) and Gemini-2.5-Pro (53.4%). Open LLMs lagged behind these proprietary LLMs; even the strongest, LLM-jp-3.1-13b ja , reached only 49.8%. Additionally, with the same instructions as those provided to the LLMs, the 21 crowdworkers achieved 68.7%, which is comparable to that of Claude-Opus-4. One possible reason that the human performance was relatively low compared with our expectations is the demographic mismatch between crowdworkers and users of the Oogiri platform. 13 Humor is subjective, and differences in age and interests can yield different judgments of funniness. Future studies will include analyses that account for annotator attributes and evaluations using more diverse raters.

With feature incorporation, four models, namely GPT-5, Gemini-2.5-Pro, DeepSeek-R1, and DeepSeek-R1 ja , improved their average accuracy across the five tasks. Notably, GPT-5 increased from 67.6% to 70.7% (+3.1%), surpassing both human performance and Claude-Opus-4 in the baseline setting. This supports the effectiveness of the linguistic features that reflect the components of humor in improving Oogiri understanding. However, three models, namely Claude-Opus-4, gpt-oss-20b, and LLM-jp-3.1-13b ja , degraded. One possible factor is differences in the reasoning ability. Compared with the baseline, the insight-augmented prompt was longer and more complex because of the added features and instructions. Stronger reasoners (e.g., GPT-5) could correctly interpret these complex prompts and benefits, whereas weaker models (e.g., LLMjp-3.1-13b ja ) tended to misinterpret them and overrely on feature magnitudes. For example, given the insight that funnier responses tend to be shorter, weaker models over-selected very short responses. This suggests that when reasoning is limited, instructing models to consider features can introduce overfitting problems and reduce performance.

We compared the two models in Table 3, namely DeepSeek-R1 and DeepSeek-R1 ja , which share the same architecture and parameter count; the only difference is the pretraining data. DeepSeek-R1 ja continues pretraining DeepSeek-R1 on a Japanese corpus.14 DeepSeek-R1 ja improved the average accuracy across the five tasks from 41.3% to 44.6% in the baseline setting (+3.3 points) and from 41.4% to 46.0% in the insight-augmented setting (+4.6 points). As our benchmark is based on Japanese Oogiri, these results suggest that continued pretraining on a Japanese corpus is effective in improving Oogiri understanding. Although prior work has shown benefits for Japanese cultural and knowledge understanding (Tsutsumi and Jinnai, 2025), our findings indicate that such continued pretraining aids in the more advanced language understanding required for Japanese Oogiri.

Ablation Study of Feature Groups Table 4 presents the average accuracy over the five tasks for GPT-5 and Gemini-2.5-Pro under four settings: introducing only basic linguistic features, introducing only LLM-scored higher-order features, introducing both, and using the baseline with no features. In all cases, incorporating features into a prompt improved the average accuracy over the baseline. For GPT-5, using both feature groups yielded the best results. For Gemini-2.5-Pro, introducing only basic linguistic features (e.g., length and character-type ratios) performed the best. Notably, when introducing only basic linguistic features, both Gemini-2.5-Pro and GPT-5 improved more than when introducing higher-order features alone (e.g., +3.7 and +2.2 points, respectively). Response length was already identified in our analysis as a constituent component of humor, and the benchmark results empirically confirm that such simple heuristics can be effective criteria for evaluating funniness. These findings suggest that exploring a broad range of linguistic features is a promising direction for enhancing the humor understanding of LLMs further. on performance when incorporating features into prompts, that is, how we should tell the model to use the features. We considered two styles: (1) instructing the model to use the features when judging funniness, and (2) instructing the model to consult the features only when uncertain. In our prelim-inary experiments, we first attempted style (1) and observed an over-reliance on feature magnitudes, which motivated the proposal of style (2). Table 5 shows the average accuracy of GPT-5 over the five tasks for the no-feature baseline and the two instruction styles. Here, the “Uncertain” column corresponds to style (2). In both styles, incorporating features improved over the baseline; notably, style (2) yielded the highest performance, improving the average accuracy by 3.1 points over the baseline. This indicates that asking the model to consider features only when uncertain helps to prevent overdependence on feature magnitudes and enables more appropriate use of the features. The results highlight instruction design as an important lever for improving the humor understanding of LLMs, and the value of exploring more effective instruction styles in future studies.

We presented a systematic study of humor on Oogiri-Corpus, and introduced Oogiri-Master, a benchmark covering relative and absolute judgments. Our analysis showed that multiple linguistic features, such as length and ambiguity, correlated with high-rated responses. In the benchmark experiments, we showed that incorporating these features into prompts improves the model performance. Furthermore, we demonstrated that continued pretraining on a Japanese corpus further boosts accuracy and instructing models to consider features only when uncertain mitigates overreliance on heuristics. Future work will include ex-ploring other effective linguistic features and refining prompt design, scaling human evaluations with annotator attributes, and extending the method to other languages and multimodal settings.

Data Collection and Licensing Oogiri-Corpus was constructed by collecting data from the public Japanese Oogiri competition platform, Oogiri Sogo. We confirm that the site explicitly permits web crawling, ensuring the legitimacy of the data collection process in ยง3. To promote transparency and facilitate further research, Oogiri-Corpus and Oogiri-Master will be made available under the CC BY-NC-SA 4.0 license.

We recruited crowdworkers for human baseline evaluation in ยง5. We used Yahoo! Crowdsourcing as the crowdsourcing platform. In accordance with the platform’s regulations, the compensation was set at 10 yen per 20 tasks. Workers were informed that the annotated results would be used for research purposes. In addition, we acknowledge that a potential demographic mismatch between the crowdworkers and Oogiri-platform users exists as discussed in ยง5.3.2, suggesting that a further analysis accounting for annotator attributes is necessary to improve the evaluation reliability.

Limited to Japanese Oogiri Our analysis and benchmark are based on Japanese Oogiri data. Some humor depends on culture-specific knowledge (e.g., a response such as “Mount Fuji” may be funny to Japanese users because it evokes familiar shared knowledge), and similar effects may not hold in other languages or cultural contexts. Moreover, our feature analysis included Japanese-specific elements (e.g., character-type ratios), which may not be directly transferred. Future work should include collecting and analyzing Oogiri-like data in other languages and cultures to better understand the cross-lingual and crosscultural variations in humor.

Benchmark Scope Limited to Oogiri Understanding We proposed a benchmark focused on understanding “funniness” in Oogiri: four MCQA subtasks and one binary classification task. However, humor understanding is related to other capabilities such as generation and explanation (Loakman et al., 2025). Although these are beyond the scope of this study, extending the benchmark to evaluate generation and explanation is an important direction for future research.

As discussed in Related Work ( ยง2), Oogiri can be framed as textto-text, image-to-text, or image&text-to-text (Zhong et al., 2024). We focused on the text-to-text approach for two reasons: (1) as a first step toward measuring LLM humor understanding, a unimodal text-only setup reduces complexity relative to multimodal settings, and (2) text-to-text Oogiri data are more abundant on the web, facilitating robust dataset construction and generalizable analysis.

An important next step is to extend the dataset to multimodal variants and study humor understanding involving visual information.

Output Requirements:

-All scores must be integers (1-5).

-In the reasoning field, summarize the concise basis for each score in 1-3 sentences.

-Return in JSON format.

{ “reasoning”: “Reason for the scores”, “ambiguity_exploitation”: number, “associative_distance”: number, “benign_violation”: number, “coherence”: number, “expectedness”: number, “incongruity_resolution”: number, “metaphor_use”: number, “perspective_shift”: number }

Comparison of linguistic features between high-and low-rated responses. * indicates statistical significance (p < 0.05). Bold values in the Cohen’s d indicate a small or medium effect size (|d| โ‰ง 0.2). โ€  indicates features that are employed in the benchmark experiments ( ยง5).

We distinguish the dataset, Oogiri-Corpus, which underpins our analyses, from the benchmark, Oogiri-Master, which builds on it to evaluate LLMs.

3 The dataset and the benchmark will be provided under the CC BY-NC-

SA 4.0 license.

https://chinsukoustudy.com/

The site explicitly permits web crawling.

Prompt IDs 87-2254 were available when accessed.

Compared with 11,842 Japanese Oogiri instances in a text-to-text setting.

https://huggingface.co/MoritzLaurer/ mDeBERTa-v3-base-xnli-multilingual-nli-2mil7

https://huggingface.co/rinna/ japanese-gpt2-medium

To prevent data contamination, we sampled different data points from the analysis dataset in ยง4.

https://crowdsourcing.yahoo.co.jp/

Because neither the crowdsourcing service nor the Oogiri platform discloses detailed user attributes, we could not perform a precise comparison; however, some differences in user populations are plausible.

https://huggingface.co/cyberagent/ DeepSeek-R1-Distill-Qwen-14B-Japanese

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut