A Generative Model for Joint Multiple Intent Detection and Slot Filling

A Generative Model for Joint Multiple Intent Detection and Slot Filling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In task-oriented dialogue systems, spoken language understanding (SLU) is a critical component, which consists of two sub-tasks, intent detection and slot filling. Most existing methods focus on the single-intent SLU, where each utterance only has one intent. However, in real-world scenarios users usually express multiple intents in an utterance, which poses a challenge for existing dialogue systems and datasets. In this paper, we propose a generative framework to simultaneously address multiple intent detection and slot filling. In particular, an attention-over-attention decoder is proposed to handle the variable number of intents and the interference between the two sub-tasks by incorporating an inductive bias into the process of multi-task learning. Besides, we construct two new multi-intent SLU datasets based on single-intent utterances by taking advantage of the next sentence prediction (NSP) head of the BERT model. Experimental results demonstrate that our proposed attention-over-attention generative model achieves state-of-the-art performance on two public datasets, MixATIS and MixSNIPS, and our constructed datasets.


💡 Research Summary

Spoken Language Understanding (SLU) is a cornerstone of task‑oriented dialogue systems, traditionally split into Intent Detection (classifying the user’s goal) and Slot Filling (labeling token spans with semantic roles). While most prior work assumes a single intent per utterance, real‑world interactions frequently contain multiple intents, e.g., “Add this song to my playlist and play it”. Existing multi‑intent datasets such as MixATIS and MixSNIPS are constructed by randomly concatenating single‑intent sentences, resulting in unnatural examples, and current models either rely on handcrafted interaction modules or train from scratch, failing to fully exploit large pre‑trained language models.

The paper introduces GEMIS (Generative Model for Multi‑Intent SLU), a unified generative framework that recasts joint intent detection and slot filling as a single sequence‑to‑sequence (seq2seq) task. The input is the raw utterance; the target is a structured token sequence that first lists all intents (enclosed in special tokens) followed by slot triplets encoded as <start‑position> <end‑position> <slot‑type>. By using a pre‑trained BART encoder‑decoder, GEMIS inherits the rich contextual knowledge of large language models without adding task‑specific randomly initialized parameters.

A key innovation is the Attention‑over‑Attention (AoA) decoder layer. In a standard Transformer decoder, cross‑attention attends to encoder outputs. GEMIS replaces this with a two‑step attention: (1) conventional cross‑attention to capture token‑level context, and (2) a second attention that explicitly focuses on the already generated intent tokens. The two attention maps are combined via a weighted sum, allowing the intent representation to directly guide the generation of slot tokens. This design naturally handles a variable number of intents and mitigates interference between the two sub‑tasks.

To address the scarcity of high‑quality multi‑intent data, the authors leverage BERT’s Next Sentence Prediction (NSP) head. They compute the NSP probability for every pair of single‑intent sentences; only pairs with high continuity scores are concatenated, yielding two new datasets: MultiATIS (≈20 k samples) and MultiSNIPS (≈50 k samples). Compared with the random concatenation approach, these datasets preserve natural discourse flow and better reflect real user behavior.

Experiments are conducted on the public MixATIS/MixSNIPS benchmarks and the newly built MultiATIS/MultiSNIPS. Baselines include Slot‑Gated models, Co‑Interactive Transformers, and graph‑neural‑network approaches. Evaluation metrics cover Intent Accuracy, Slot F1, and Joint Accuracy (both intents and slots correct). GEMIS consistently outperforms all baselines, achieving 2–5 percentage‑point gains in Joint Accuracy on the original mixes and even larger improvements (up to 10 pp) on utterances containing three or more intents. Ablation studies reveal that removing AoA degrades performance by ~3 pp, and omitting the pointer network harms slot boundary detection, confirming the contribution of each component. Moreover, when the same GEMIS architecture is trained on the original Mix datasets versus the NSP‑filtered Multi datasets, the latter always yields higher scores, underscoring the importance of realistic data.

The paper’s contributions are threefold: (1) a generative seq2seq framework that unifies intent detection and slot filling, (2) the AoA decoder that injects intent information into slot generation without extra task‑specific modules, and (3) an automated, NSP‑driven pipeline for constructing high‑quality multi‑intent SLU corpora. Limitations include reliance on beam search to mitigate error propagation inherent to autoregressive generation and the fact that NSP‑filtered data, while more natural than random concatenations, still differ from authentic multi‑intent user logs. Future work may explore reinforcement‑learning‑based decoding, domain adaptation with real dialogue logs, and extending the approach to multilingual or multimodal settings.

In summary, GEMIS demonstrates that a carefully designed generative model, combined with a principled data construction strategy, can substantially advance multi‑intent SLU performance, especially as the number of intents per utterance grows, paving the way for more flexible and robust conversational agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment