Encoding-based Memory Modules for Recurrent Neural Networks

Encoding-based Memory Modules for Recurrent Neural Networks

Introduction

TODO

Related Work

TODO

Model

In the following discussion, we omit biases from the equations to improve the clarity of the presentation.

Linear Memory Networks (LMN)

A Linear Memory Network (LMN) is a variant of Elman RNN with a linear memory update. The model equations are:

The LMN is equivalent to an Elman RNN but does not suffer from the vanishing gradient if $`W_{mm}`$ is an orthogonal matrix, which can be easily guaranteed (((CITE))). This is a desirable property for a model that must be able to capture long-term dependencies.

Multiscale LMN

Time series can be subsampled to get shorter sequences. Long-term dependencies become shorter in the subsampled sequence and therefore are easier to learn. The memory $`m^t \in \mathbb{R}^{g N_m}`$ of a Multiscale LMN (MS-LMN) is organized into $`g`$ blocks of size $`N_m`$. Each module updates its memory with decreasing frequencies by subsampling the hidden state sequence $`h^1, ..., h^t`$ with an exponential sampling rate $`T_i = 2^i`$ for block $`i=0, ..., N_m - 1`$. The slower modules are connected to the faster ones but not vice-versa. This constraint is necessary to allow the slow modules to see only the subsampled data. The update for each module $`k=0,... g-1`$ is:

\begin{eqnarray*}
 \vh^t &=& \sigma(\mW^k_{xh} \vx^t_k + \sum_{j=k}^{g-1} \mW_{m_j h} \vm^{t-1}) \\
 \vm^t_k &=& \begin{cases} 
 \mW^k_{hm} \vh^t_k + \mW^k_{mm} \vm^{t-1}_k & t \mod{2^k} = 0 \\
 m^{t-1} & o.w.
 \end{cases}
\end{eqnarray*}
Multiscale LMN architecture. Dashed edges represent connections that are active only when the corresponding output nodes are active (t mod  Ti = 0).
Structure of the block matrix representing the recurrent weights for it = 1. Darker blocks represent the active modules (t mod  Ti = 0), while lighter blocks represent disabled modules (t mod  Ti ≠ 0).

A schematic representation of the architecture is shown in Figure 1. at each timestep, $` \vh^t`$ receives the whole memory $`\vm^t`$, while the memory blocks $`\vm_i^t`$ are updated with different frequencies and are connected only with slower modules.

Since the modules are sorted with decreasing update frequency $`1, 2, 4, ...`$, at each timestep we only update the memory blocks relative to a contiguous set of indices from $`0`$ to $`i_{t} = \min \{i\ |\ t \bmod 2^t = 0 \wedge t \bmod 2^{t+1} \neq 0\}`$.

Using this property the memory update for all the blocks can be efficiently implemented with two matrix multiplications, as shown in Figure 2.

Incremental Training

In this section we propose an incremental training schema that gradually expands the network with additional memory blocks trained to capture the long-term dependencies.

The algorithm start with an empty model. Each module is then trained separately and added to the network. The addition of a new module can be separated into three steps:

learn dependencies of length $`2^i`$
An LMN model is trained on sequences sampled every $`2^i`$ timesteps, where $`i`$ is the index of the current module.

initialize new block
A new module is added to the model with the weights corresponding to the newly trained LMN.

finetune the entire model
After the initialization of the new module the entire model is finetuned with BPTT.

Using these approach the model can gradually learn longer dependencies. Each new module is trained separately at first to learn the long dependencies. After this preliminary step the model is combined with the previous network and the resulting architecture is finetuned. After the finetuning phase a new model can be trained until the network is made of $`g`$ modules.

AAA

Experimental Results

In this section we show the experimental results. The MS-LMN is compared against the CW-RNN and LSTM. Each configuration tested comprises a single hidden layer. All the models are optimized using Adam (((CITE))). Additional details about the experimental setup can be found in the supplementary material.

Sequence Generation

The first experiment is a synthetic task on audio sequences. We extracted a sequence of 300 timesteps by sampling an audio signal at 44.1 kHz, starting at a random position of the file. The sequence elements are normalized to lie in the range $`[-1, +1]`$ and the resulting sequence is used as the output target. The model is provided with no input and it must output at each timestep the corresponding element of the target sequence. For each architecture several configurations are trained with a different number of parameters in $`\{ 100, 250, 500, 1000 \}`$ by adjusting the number of hidden units. The models are trained with a normalized MSE loss. The CW-RNN and MS-LMN use 9 modules with exponential clock rates $`\{ 1, \ldots , 2^8 \}`$.

Figure 4 shows the results of the experiment. The MS-LMN reaches an error two order of magnitudes smaller than the competing models, improving even the result of the CWRNN, which uses a similar mechanism with exponential clock rates.

Results on the Sequence Generation task for different models. Dashed blue lines represent the target sequence while the solid green line shows the model prediction.

TIMIT Spoken Word Classification

The second experiment is a classification task on audio sequences. The dataset is extracted from TIMIT (((CITE))) following the experimental setup in (((CITE))). From the original dataset we extracted from the raw sentences audio sequences corresponding to 25 different words. The words extracted are chosen to have a similar suffix, requiring the model to learn long-term dependencies to correctly classify each sample. For every word there are 7 examples from different speakers, which were partitioned into 5 samples for training and validation and 2 samples for test. The total dataset contains 175 sequences. Since the work of (((CITE))) does not provide the training/validation splits we used a different split. We noticed that due to the small size of the dataset the performance can vary with a different split and we were not able to reproduce their results. We provide our training/validation split in the supplementary material to allow other researcher to compare against our results.

Conclusion

TODO