Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization

Cutting-off Redundant Repeating Generations for Neural Abstractive   Summarization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper tackles the reduction of redundant repeating generation that is often observed in RNN-based encoder-decoder models. Our basic idea is to jointly estimate the upper-bound frequency of each target vocabulary in the encoder and control the output words based on the estimation in the decoder. Our method shows significant improvement over a strong RNN-based encoder-decoder baseline and achieved its best results on an abstractive summarization benchmark.


💡 Research Summary

The paper addresses a pervasive problem in neural abstractive summarization systems that use recurrent neural network (RNN) encoder‑decoder architectures: the tendency to generate redundant, repeated phrases or words. While similar issues have been studied in neural machine translation (NMT) under the “coverage” paradigm, the authors argue that coverage is ill‑suited for summarization because summaries are highly compressed, often constrained to a strict length budget (e.g., 75 bytes in DUC‑2004). Consequently, a different mechanism is needed to prevent the decoder from wasting precious output slots on repetitions.

Core Idea
The authors propose a Word‑Frequency‑Estimation (WFE) sub‑model that predicts, during encoding, an upper‑bound on how many times each target‑vocabulary token may appear in the final summary. This upper‑bound is then incorporated as a prior in the decoder’s scoring function, effectively cutting off any word whose predicted quota has been exhausted. The WFE consists of two parallel streams:

  1. Frequency stream (ˆr) – a ReLU‑activated linear projection of the encoder’s hidden states, yielding a non‑negative real‑valued vector that estimates the maximum count for each word.
  2. Occurrence gate (ˆg) – a sigmoid‑activated linear projection that predicts whether a word should appear at all (values near 0 suppress the word, values near 1 allow it).

The final estimate is the element‑wise product ˆa = ˆr ⊙ ˆg. Because ˆg is binary‑like, it can be interpreted as a “fertility” gate similar to those used in coverage models, while ˆr focuses purely on the magnitude of the count.

Integration with Decoding
During beam‑search decoding, the standard log‑likelihood term (cumulative log‑probability plus log‑softmax of the current output logits) is augmented with an additional term ˜a_j = log(ClipReLU1(˜r_j) ⊙ ˆg). Here ˜r_j is a dynamically updated copy of ˆr that is decremented by the one‑hot vector of the word generated at the previous step. The ClipReLU1 function caps values to the interval


Comments & Academic Discussion

Loading comments...

Leave a Comment