On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.

💡 Research Summary

The paper tackles the challenging problem of generating humorous captions for images, a task that requires not only visual understanding but also sophisticated humor reasoning, creative imagination, and linguistic expression. Existing approaches that rely on generic prompting, multi‑hop reasoning, or task‑specific fine‑tuning tend to produce captions that capture surface‑level wordplay but lack deep logical humor and originality. To overcome these limitations, the authors introduce HOMER (Humor‑theory‑driven Multi‑role LLM Collaboration Framework augmented with humor Retrieval), a novel pipeline that grounds caption generation in the General Theory of Verbal Humor (GTVH).

GTVH posits five interrelated knowledge resources that together constitute a joke: script opposition, situation, target, narrative strategy, and language. The authors argue that script opposition—conflict between two semantic frames—is the core driver of humor, especially in visual contexts where an image can simultaneously evoke two contradictory expectations. HOMER operationalizes this theory through three cooperating large language model (LLM) agents:

Conflicting‑Script Extractor – Given an image, this module first uses a vision encoder (CLIP) to produce a detailed description of the scene (location, characters, actions, etc.). It then prompts an LLM with a carefully crafted template that asks it to identify script oppositions (e.g., “normal coffee cups vs. oversized coffee cups”) and to produce a concise situation description. The output, a pair (C, D) of script oppositions and situation narrative, serves as the logical backbone for the rest of the pipeline.
Hierarchical Imaginator – From the extracted oppositions and situation, the imaginator selects candidate humor targets (entities that embody the conflict). It expands each target through two complementary imagination processes:
- Deep‑pattern imagination: An LLM‑driven free‑association function f_chain recursively generates a chain of related concepts, starting from the target. Empirically, the chain length averages four steps (e.g., coffee → milk → cow). This yields a backbone “imagination tree” for each target.
- Broad‑pattern imagination: The system queries a large joke database (assembled from 12 public joke corpora, totaling ~1.2 M jokes) using a query embedding that combines the target, situation, and script information. Top‑K most similar jokes are retrieved, tokenized, and added as leaf nodes to the imagination tree.
To prune irrelevant or weakly humorous expansions, the authors define a humor‑relevance score H(e, ε) = H_rel + H_freq + H_div. H_rel combines WordNet‑based Wu‑Palmer semantic similarity with a Jaccard‑based conceptual opposition measure, thereby capturing both relevance and surprise. H_freq reflects how frequently a token appears in jokes, and H_div rewards part‑of‑speech diversity. Only the highest‑ranked leaf nodes (determined by a threshold δ) are retained, producing a refined imagination tree T_im.
Caption Generator – The final module assembles a composite prompt Φ that incorporates the situation description D, script oppositions C, the refined imagination trees T_im, and a selected narrative strategy Ω (e.g., exaggeration, reversal). Feeding Φ to an LLM (GPT‑4o in the experiments) yields multiple candidate captions. These are scored for humor, relevance, and diversity; the top‑scoring caption is output as the final humorous description.

The authors evaluate HOMER on two New Yorker cartoon benchmarks, measuring both automatic metrics (ROUGE‑L, BLEU‑4, METEOR) and a human‑rated humor score (1–5). HOMER achieves ROUGE‑L 0.42, BLEU‑4 0.31, and an average humor score of 4.2, outperforming strong baselines such as GPT‑4o, CLoT (a multi‑hop reasoning model), Flamingo‑V2, and Kosmos‑2 by roughly 7 % relative improvement. Ablation studies demonstrate that removing any of the three core components—script opposition extraction, hierarchical imagination, or joke‑based pruning—significantly degrades performance, confirming their complementary contributions.

Key contributions of the work include:

Theory‑driven design: First integration of GTVH into a practical multimodal LLM pipeline, providing explicit interpretability of why a caption is humorous.
Multi‑role collaboration: A clear division of labor among LLM agents that mirrors the cognitive stages of humor creation (recognition, imagination, articulation).
Imagination trees with humor‑relevance pruning: A novel mechanism that balances creative breadth (via joke retrieval) with relevance and surprise (via semantic‑opposition scoring).

The paper also discusses limitations. The joke database may carry cultural and linguistic biases, potentially restricting the model’s humor to the domains represented in the source corpora. The reliance on multiple LLM calls incurs high computational cost, which could hinder real‑time deployment. Fixed depths for imagination chains may limit scalability to more complex scenes. Future directions suggested include expanding the joke repository to multilingual sources, incorporating user feedback for adaptive narrative strategies, and fine‑tuning vision‑language models to internalize script opposition detection directly.

In summary, HOMER demonstrates that grounding multimodal humor generation in a well‑established humor theory, combined with structured imagination and retrieval, can substantially improve both the creativity and interpretability of AI‑generated jokes. This work opens a promising avenue for theory‑guided, multi‑agent LLM systems in other creative domains such as storytelling, advertising, and educational content creation.

On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment