CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity

CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

How should a system handle uncertainty when comparing narratives? We present CascadeMind, a hybrid neuro-symbolic system for SemEval-2026 Task 4 (Narrative Story Similarity) built around a core finding: an LLM’s internal vote distribution is a reliable proxy for task difficulty, and confidence-aware routing outperforms uniform treatment of all cases. Our cascade samples eight parallel votes from Gemini 2.5 Flash, applying a supermajority threshold to resolve confident cases immediately (74% of instances at 85% development accuracy). Uncertain cases escalate to additional voting rounds (21%), and only perfect ties (5%) are deferred to a symbolic ensemble of five narrative signals grounded in classical narrative theory. The resulting difficulty gradient (85% -> 67% -> 61% by pathway) confirms that vote consensus tracks genuine ambiguity. In official Track A evaluation, CascadeMind placed 11th of 47 teams with 72.75% test accuracy (Hatzel et al., 2026), outperforming several systems built on larger and more expensive models. Gains are driven primarily by routing strategy rather than symbolic reasoning, suggesting that for narrative similarity, knowing when you don’t know matters more than adding auxiliary representations.


💡 Research Summary

This paper introduces CascadeMind, a hybrid neuro‑symbolic system designed for SemEval‑2026 Task 4, which requires determining which of two candidate stories is more narratively similar to a given anchor story. The authors’ central insight is that the distribution of votes generated by a large language model (LLM) can serve as a reliable proxy for task difficulty. By exploiting this proxy, they construct a cascade that treats confident cases differently from ambiguous ones, thereby improving overall performance while keeping API usage modest.

The cascade consists of four stages. In Stage 1 the system prompts Gemini 2.5 Flash to produce eight independent binary decisions (A or B) for each triplet. Stage 2 evaluates the vote consensus: if at least seven out of eight votes agree (a super‑majority threshold of 87.5 %), the decision is accepted immediately. This “super‑majority” path covers roughly 74 % of the data and yields 85 % accuracy on the development set.

When the initial eight votes are not decisive (splits of 6‑2, 5‑3, or 4‑4), Stage 3 escalates the instance by issuing three additional API calls, each again generating eight votes, for a total of 32 votes. The majority of these 32 votes determines the final label. This escalation path accounts for about 21 % of cases and achieves 67 % accuracy, while the average number of API calls per instance is 1.78.

The remaining 5 % of instances result in a perfect tie after escalation (16‑16). These are handed to Stage 4, a symbolic “tie‑breaker” that combines five narrative‑theoretic similarity signals: (1) TF‑IDF lexical similarity (weight 0.49), (2) story‑grammar similarity based on Propp and Todorov phases (weight 0.40), (3) semantic similarity from sentence‑transformer embeddings (weight 0.08), (4) tension‑curve correlation derived from sentiment and subjectivity (weight 0.02), and (5) event‑chain similarity using longest common subsequence of action verbs (weight 0.01). The weights were optimized via differential evolution on a synthetic training set of 1,900 triplets, achieving 99.5 % accuracy on a held‑out synthetic validation split. On real development ties, however, the symbolic module reaches only 61 % accuracy, indicating that its strength lies in handling high‑uncertainty cases rather than serving as a general classifier.

Empirically, CascadeMind placed 11th out of 47 teams in the official test set with 72.75 % accuracy. Detailed analysis shows that most of the performance gain stems from the confidence‑aware routing rather than the symbolic component. The super‑majority path alone provides a strong baseline; escalation adds modest improvements; the symbolic tie‑breaker contributes only when the neural votes are maximally ambiguous.

The authors discuss several implications. First, vote consensus is an effective uncertainty estimator, aligning with the selective‑prediction literature that allows models to abstain or defer when confidence is low. Second, the cascade reduces computational cost because the expensive symbolic processing is invoked for only a tiny fraction of inputs. Third, the discrepancy between synthetic training performance (near‑perfect) and real‑world tie‑breaker performance highlights a domain shift: the synthetic data over‑emphasizes lexical and structural cues, whereas real test items require richer world knowledge and more flexible event representations. The minimal weight for the event‑chain signal (1 %) suggests that the current verb‑matching approach is too brittle; future work could explore fuzzy matching or semantic role labeling to capture plot similarity more robustly.

In conclusion, CascadeMind demonstrates that “knowing when you don’t know” can be more valuable than simply adding more sophisticated representations. By leveraging LLM vote distributions for uncertainty estimation and applying a tiered routing strategy, the system achieves competitive results with modest resources. The symbolic ensemble, while currently a secondary contributor, offers a promising avenue for further enhancement if richer narrative knowledge and more expressive event modeling are incorporated.


Comments & Academic Discussion

Loading comments...

Leave a Comment