Shortcut Sequence Tagging

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep stacked RNNs are usually hard to train. Adding shortcut connections across different layers is a common way to ease the training of stacked networks. However, extra shortcuts make the recurrent step more complicated. To simply the stacked architecture, we propose a framework called shortcut block, which is a marriage of the gating mechanism and shortcuts, while discarding the self-connected part in LSTM cell. We present extensive empirical experiments showing that this design makes training easy and improves generalization. We propose various shortcut block topologies and compositions to explore its effectiveness. Based on this architecture, we obtain a 6% relatively improvement over the state-of-the-art on CCGbank supertagging dataset. We also get comparable results on POS tagging task.

💡 Research Summary

The paper addresses the well‑known difficulty of training deep stacked recurrent neural networks (RNNs), especially stacked Long Short‑Term Memory (LSTM) networks, which suffer from vanishing or exploding gradients as the number of vertical layers grows. To alleviate this problem, the authors propose a novel architectural unit called the “shortcut block.” The key idea is to remove the self‑connected memory cell (the cₜ₋₁ → cₜ recurrence) that is present in standard LSTM cells and replace it with direct skip connections from lower layers to higher layers. These skip connections are gated, meaning that a gate gₗₜ decides whether the information from a previous layer h^{‑l}_t should be incorporated into the current layer’s computation.

Formally, each shortcut block computes: m = iₗₜ ⊙ sₗₜ + gₗₜ ⊙ h^{‑l}_t
hₗₜ = oₗₜ ⊙ tanh(m) + gₗₜ ⊙ h^{‑l}_t

where iₗₜ, oₗₜ, and sₗₜ are the usual input, output, and candidate gates of an LSTM, while gₗₜ is a newly introduced shortcut gate. The gate can be deterministic (e.g., a sigmoid of a linear projection) or stochastic (a Bernoulli variable whose probability may be learned). By discarding the self‑connection, the model no longer needs to store a separate cell state, reducing memory consumption and simplifying back‑propagation. The gated skip path allows gradients to bypass intermediate layers, mitigating the depth‑related gradient attenuation.

The authors explore several topologies for arranging shortcut blocks within a stacked architecture:

Type 1 – the output of the first hidden layer is directly connected to all higher layers.
Type 2 and Type 3 – blocks with skip span 1 and 2, respectively, forming a chain of gated shortcuts.
Type 4 and Type 5 – nested shortcut blocks that combine multiple spans.

Through extensive experiments, they find that Type 3 (span 2) combined with deterministic sigmoid gates yields the best performance, while stochastic gates tend to be less stable during training.

For input representation, the model concatenates three sources of information: pre‑trained 100‑dimensional GloVe word embeddings, 5‑dimensional character embeddings (derived from the first and last five characters of each word), and 5‑dimensional capitalization embeddings. A context window of size three (the target token plus one token on each side) is applied, resulting in an input vector of dimension 465. This vector is fed into a stack of bidirectional LSTM layers, each also of size 465, ensuring that the dimensionality of the hidden states matches that of the input.

Training details include Gaussian initialization for non‑recurrent weights (scaled by 1/√fan‑in) and orthogonal initialization for recurrent weight matrices, following best practices for stable gradient flow. The model is evaluated on two sequence‑tagging tasks:

CCG Supertagging – using the CCGbank dataset with 1,285 supertags. The standard split (sections 02‑21 for training, 00 for development, 23 for testing) is employed, with all digits mapped to ‘9’ and words lower‑cased. The shortcut‑block model achieves a relative error reduction of about 6 % compared with the previous state‑of‑the‑art, establishing a new benchmark on this dataset.
POS Tagging – using the Penn Treebank. The proposed architecture attains performance comparable to the best existing LSTM‑based systems, demonstrating that the shortcut block does not sacrifice accuracy on more conventional tagging tasks.

The paper concludes that replacing the internal self‑connection of LSTM cells with gated skip connections simplifies the architecture, reduces memory requirements, and most importantly, facilitates the training of much deeper stacked recurrent networks. The systematic exploration of different shortcut topologies and gate designs provides practical guidance for future work on deep RNNs in various sequence modeling domains.

Shortcut Sequence Tagging

💡 Research Summary

Comments & Academic Discussion

Leave a Comment