Regularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences

We introduce Delay Pruning, a simple yet powerful technique to regularize dynamic Boltzmann machines (DyBM). The recently introduced DyBM provides a particularly structured Boltzmann machine, as a generative model of a multi-dimensional time-series. …

Authors: Sakyasingha Dasgupta, Takayuki Yoshizumi, Takayuki Osogami

Regularized Dynamic Boltzmann Machine with Delay Pruning for   Unsupervised Learning of Temporal Sequences
Re gularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of T emporal Sequences Sakyasingha Dasgupta IBM Research - T okyo Email: sdasgup@jp.ibm.com T akayuki Y oshizumi IBM Research - T okyo Email: yszm@jp.ibm.com T akayuki Osogami IBM Research - T okyo Email: osogami@jp.ibm.com Abstract —W e intr oduce Delay Pruning, a simple yet powerful technique to regularize dynamic Boltzmann machines (DyBM). The recently introduced DyBM provi des a particularly structur ed Boltzmann machine, as a generative model of a multi-dimensional time-series. This Boltzmann machine can have infinitely many layers of units but allows exact infer ence and learning based on its biologically moti vated structure. DyBM uses the idea of conduction delays in the form of fixed length first-in first-out (FIFO) queues, with a neuron connected to another via this FIFO queue, and spik es from a pr e-synaptic neuron tra vel along the queue to the post-synaptic neuron with a constant period of delay . Here, we present Delay Pruning as a mechanism to prune the lengths of the FIFO queues (making them zero) by setting some delay lengths to one with a fixed probability , and finally selecting the best perf orming model with fixed delays. The uniqueness of structur e and a non-sampling based learning rule in DyBM, make the application of previously proposed regularization techniques like Dropout or DropConnect difficult, leading to poor generalization. First, we ev aluate the perf ormance of Delay Pruning to let DyBM learn a multidimensional temporal sequence generated by a Marko v chain. Finally , we show the effectiveness of delay pruning in learning high dimensional sequences using the moving MNIST dataset, and compare it with Dropout and DropConnect methods. I . I N T R O D U C T I O N Deep neural networks [1], [2] have been successfully ap- plied for learning in a large number of image recognition and other machine learning tasks. Ho wever , neural network (NNs) based models are typically well suited on scenarios with lar ge amounts of av ailable labelled datasets. Increasing the network complexity (in terms of size or number of layers), one can achiev e impressive levels of performance. A caveat is that this can lead to gross ov er-fitting or generalization issues, when trained in the presence of limited amount of training samples. As a result, a wide range of techniques, like adding a L 2 penalty term, Bayesian methods [3], adding noise to training data [4] etc., for regularizing NNs have been developed. More recently , with a focus on NNs with a deep archi- tecture, Dropout [5] and DropConnect [6] techniques hav e been proposed as ways to pre vent over -fitting by randomly omitting some of the feature detectors on each training sample. Specifically , Dropout inv olves randomly deleting some of the activ ations (units) in each layer during a forward pass and then back-propagating the error only through the remaining units. DropConnect generalizes this to randomly omitting weights rather than the acti vations (units). Both these techniques ha ve been shown to significantly improve the performance on standard fully-connected deep neural network architectures. In this w ork, we propose a novel regularization technique called Delay Pruning, designed for a recently introduced generativ e model called dynamic Boltzmann machine (DyBM) [7]. Unlike the con ventional Boltzmann machine (BM) [8], which is trained with a collection of static patterns, DyBM is designed for unsupervised learning of temporal pattern sequences. DyBM is motiv ated by postulates and observations from biological neural networks, allowing exact inference and learning of weights based on the timing of spikes (spike- timing dependent plasticity - STDP). Unlike the restricted Boltzmann machine (RBM) [9], DyBM has no specific hidden units, and the network can be unfolded through time, allowing infinitely man y layers [10]. Furthermore, DyBM can be vie wed as fully-connected recurrent neural netw ork with memory units and with conduction delays between units implemented in the form of fixed length first-in first-out (FIFO) queues. A spike originating at a pre-synaptic neuron (unit) trav els along this FIFO queue and reaches the post-synaptic neuron after a fixed delay . The length of the FIFO queues is equal to one minus the maximum delay value. Due to this completely nov el architecture of DyBM applying existing regularization methods is dif ficult or does not lead to better generalization performance. As such, the here proposed Delay Pruning technique allo ws a method for regularized training of NNs with FIFO queues. Specifically , during training, it truncates the lengths to zero, for randomly selected FIFO queues. W e ev aluate the performance of Delay Pruning on a stochastic multi-dimensional time series and then compare it with Dropout and DropConnect for unsupervised learning on the high-dimensional moving MNIST dataset. In the next sections, we first give a brief ov erview of DyBM and its learning rule, followed by the Delay Pruning algorithm, experimental results and conclusion. I I . D Y N A M I C B O LT Z M A N N M A C H I N E A. Overview In this paper , we use DyBM [7] for unsupervised learning of temporal sequences and show better generalised perfor- mance using our Delay Pruning algorithm. Unlike standard Boltzmann machines, DyBM can be trained with a time- series of patterns. Specifically , the DyBM giv es the conditional probability of the next values (patterns) of a time-series given its historical values. This conditional probability can depend on the whole history of the time-series, and the DyBM can thus be used iterativ ely as a generativ e model of a time-series. DyBM can be defined from BM having multiple layers of units, where one layer represents the most recent values of a time-series, and the remaining layers represent the historical values of the time-series. The most recent v alues are condition- ally independent of each other giv en the historical values. The DyBM is equi valent to such a BM having an infinite number of layers, so that the most recent values can depend on the whole history of the time series. W e train the DyBM in such a way that the likelihood of given time-series is maximized with respect to the conditional distribution of the next values giv en the historical values. Similar to a BM, a DyBM consists of a network of artificial neurons. Each neuron takes a binary value, 0 or 1, following a probability distribution that depends on the parameters of the DyBM. Unlike the BM, the values of the DyBM can change ov er time in a way that depends on its previous values. That is, the DyBM stochastically generates a multi-dimensional series of binary values. Learning in con ventional BMs is based on an Hebbian formulation, but is often approximated with sampling based strategy like contrastive diver gence. In this formulation the concept of time is largely missing. In DyBM, like biological networks, learning is dependent on the timing of spikes. This is called spike-timing dependent plasticity , or STDP [11], which states that a synapse is strengthened if the spike of a pre- synaptic neuron precedes the spike of a post-synaptic neuron (long term potentiation - L TP), and the synapse is weakened if the temporal order is rev ersed (long term depression - L TD). DyBM uses an exact online learning rule, that has the properties of L TP and L TD. x i [0] Pre-synaptic neuron, i Post-synaptic neuron, j x i [-1] x i [-2] x i [-3] x i [-4] Conduction delay, d ij Synaptic eligibil ity trace from neuron to neuron j FIFO queue of length d ij - 1 Neural eligibility trace of neuron i i Fig. 1. A DyBM consists of a network of neurons and memory units. A pre- synaptic neuron is connected to a post-synaptic neuron via a FIFO queue. The spike from the pre-synaptic neuron reaches the post-synaptic neuron after a constant conduction delay . Each neuron has the memory unit for storing neural eligibility traces. A synaptic eligibility trace is associated with a synapse between a pre-synaptic neuron and a post-synaptic neuron, and summarizes the spikes that have arriv ed at the synapse, via the FIFO queue. The learning rule of DyBM exhibits some of the key prop- erties of STDP due to its structure consisting of conduction delays and memory units, which are illustrated in Figure 1. A neuron is connected to another in a way that a spike from a pre-synaptic neuron, i , tra vels along an axon and reaches a post-synaptic neuron, j , via a synapse after a delay consisting of a constant period, d i,j . In the DyBM, a FIFO queue causes this conduction delay . The FIFO queue stores the values of the pre-synaptic neuron for the last d i,j − 1 units of time. Each stored value is pushed one position to ward the head of the queue when the time is incremented by one unit. The value of the pre-synaptic neuron is thus giv en to the post- synaptic neuron after the conduction delay . Moreover , the DyBM aggregates information about the spikes in the past into neural eligibility traces and synaptic eligibility traces, which are stored in the memory units. Each neuron is associated with a learnable parameter called bias. The strength of the synapse between a pre-synaptic neuron and a post-synaptic neuron is represented by learnable parameters called weights. Those are further divided into L TP and L TD components. B. Definition The DyBM shown in Figure 2 (b) can be sho wn to be equiv alent to a BM having infinitely many layers of units [10]. Similar to the RBM (Figure 2 (a)), the DyBM has no weight between the units in the right-most layer of Figure 2 (b). Unlike the RBM [9], each layer of the DyBM has a common number , N , of units, and the bias and the weight in the DyBM can be shared among different units in a particular manner . Formally , the DyBM- T is a BM having T layers from − T + 1 to 0 , where T is a positiv e integer or infinity . Let x ≡ ( x [ t ] ) − T