Encoding-based Memory Modules for Recurrent Neural Networks
Introduction
TODO
Related Work
TODO
Model
In the following discussion, we omit biases from the equations to improve the clarity of the presentation.
Linear Memory Networks (LMN)
A Linear Memory Network (LMN) is a variant of Elman RNN with a linear memory update. The model equations are:
The LMN is equivalent to an Elman RNN but does not suffer from the vanishing gradient if $`W_{mm}`$ is an orthogonal matrix, which can be easily guaranteed (((CITE))). This is a desirable property for a model that must be able to capture long-term dependencies.
Multiscale LMN
Time series can be subsampled to get shorter sequences. Long-term dependencies become shorter in the subsampled sequence and therefore are easier to learn. The memory $`m^t \in \mathbb{R}^{g N_m}`$ of a Multiscale LMN (MS-LMN) is organized into $`g`$ blocks of size $`N_m`$. Each module updates its memory with decreasing frequencies by subsampling the hidden state sequence $`h^1, ..., h^t`$ with an exponential sampling rate $`T_i = 2^i`$ for block $`i=0, ..., N_m - 1`$. The slower modules are connected to the faster ones but not vice-versa. This constraint is necessary to allow the slow modules to see only the subsampled data. The update for each module $`k=0,... g-1`$ is:
\begin{eqnarray*}
\vh^t &=& \sigma(\mW^k_{xh} \vx^t_k + \sum_{j=k}^{g-1} \mW_{m_j h} \vm^{t-1}) \\
\vm^t_k &=& \begin{cases}
\mW^k_{hm} \vh^t_k + \mW^k_{mm} \vm^{t-1}_k & t \mod{2^k} = 0 \\
m^{t-1} & o.w.
\end{cases}
\end{eqnarray*}
A schematic representation of the architecture is shown in Figure 1. at each timestep, $` \vh^t`$ receives the whole memory $`\vm^t`$, while the memory blocks $`\vm_i^t`$ are updated with different frequencies and are connected only with slower modules.
Since the modules are sorted with decreasing update frequency $`1, 2, 4, ...`$, at each timestep we only update the memory blocks relative to a contiguous set of indices from $`0`$ to $`i_{t} = \min \{i\ |\ t \bmod 2^t = 0 \wedge t \bmod 2^{t+1} \neq 0\}`$.
Using this property the memory update for all the blocks can be efficiently implemented with two matrix multiplications, as shown in Figure 2.
Incremental Training
In this section we propose an incremental training schema that gradually expands the network with additional memory blocks trained to capture the long-term dependencies.
The algorithm start with an empty model. Each module is then trained separately and added to the network. The addition of a new module can be separated into three steps:
learn dependencies of length $`2^i`$
An LMN model is trained on sequences sampled every $`2^i`$ timesteps,
where $`i`$ is the index of the current module.
initialize new block
A new module is added to the model with the weights corresponding to the
newly trained LMN.
finetune the entire model
After the initialization of the new module the entire model is finetuned
with BPTT.
Using these approach the model can gradually learn longer dependencies. Each new module is trained separately at first to learn the long dependencies. After this preliminary step the model is combined with the previous network and the resulting architecture is finetuned. After the finetuning phase a new model can be trained until the network is made of $`g`$ modules.
Experimental Results
In this section we show the experimental results. The MS-LMN is compared against the CW-RNN and LSTM. Each configuration tested comprises a single hidden layer. All the models are optimized using Adam (((CITE))). Additional details about the experimental setup can be found in the supplementary material.
Sequence Generation
The first experiment is a synthetic task on audio sequences. We extracted a sequence of 300 timesteps by sampling an audio signal at 44.1 kHz, starting at a random position of the file. The sequence elements are normalized to lie in the range $`[-1, +1]`$ and the resulting sequence is used as the output target. The model is provided with no input and it must output at each timestep the corresponding element of the target sequence. For each architecture several configurations are trained with a different number of parameters in $`\{ 100, 250, 500, 1000 \}`$ by adjusting the number of hidden units. The models are trained with a normalized MSE loss. The CW-RNN and MS-LMN use 9 modules with exponential clock rates $`\{ 1, \ldots , 2^8 \}`$.
Figure 4 shows the results of the experiment. The MS-LMN reaches an error two order of magnitudes smaller than the competing models, improving even the result of the CWRNN, which uses a similar mechanism with exponential clock rates.
TIMIT Spoken Word Classification
The second experiment is a classification task on audio sequences. The dataset is extracted from TIMIT (((CITE))) following the experimental setup in (((CITE))). From the original dataset we extracted from the raw sentences audio sequences corresponding to 25 different words. The words extracted are chosen to have a similar suffix, requiring the model to learn long-term dependencies to correctly classify each sample. For every word there are 7 examples from different speakers, which were partitioned into 5 samples for training and validation and 2 samples for test. The total dataset contains 175 sequences. Since the work of (((CITE))) does not provide the training/validation splits we used a different split. We noticed that due to the small size of the dataset the performance can vary with a different split and we were not able to reproduce their results. We provide our training/validation split in the supplementary material to allow other researcher to compare against our results.
Conclusion
TODO