Spike Timing Dependent Competitive Learning in Recurrent Self Organizing Pulsed Neural Networks Case Study: Phoneme and Word Recognition

Spike Timing Dependent Competitive Learning in Recurrent Self Organizing   Pulsed Neural Networks Case Study: Phoneme and Word Recognition

Synaptic plasticity seems to be a capital aspect of the dynamics of neural networks. It is about the physiological modifications of the synapse, which have like consequence a variation of the value of the synaptic weight. The information encoding is based on the precise timing of single spike events that is based on the relative timing of the pre- and post-synaptic spikes, local synapse competitions within a single neuron and global competition via lateral connections. In order to classify temporal sequences, we present in this paper how to use a local hebbian learning, spike-timing dependent plasticity for unsupervised competitive learning, preserving self-organizing maps of spiking neurons. In fact we present three variants of self-organizing maps (SOM) with spike-timing dependent Hebbian learning rule, the Leaky Integrators Neurons (LIN), the Spiking_SOM and the recurrent Spiking_SOM (RSSOM) models. The case study of the proposed SOM variants is phoneme classification and word recognition in continuous speech and speaker independent.


💡 Research Summary

This paper presents a novel framework that integrates spike‑timing‑dependent plasticity (STDP) with competitive self‑organizing maps (SOM) to address the problem of temporal‑sequence classification, specifically phoneme and word recognition in continuous, speaker‑independent speech. Three variants of spiking SOM are introduced: (1) Leaky Integrator Neurons (LIN), which incorporate a leaky‑integrator dynamics to smooth incoming spike trains; (2) Spiking‑SOM, a purely event‑driven map where neurons fire only when their membrane potential exceeds a threshold; and (3) Recurrent Spiking‑SOM (RSSOM), which adds recurrent connections that feed the previous activation state back into the current processing step, thereby preserving temporal context.

The authors begin by reviewing the biological basis of STDP, where the relative timing of pre‑ and post‑synaptic spikes determines whether a synapse is potentiated or depressed. This rule is mathematically expressed as Δw = A₊ exp(−Δt/τ₊) for causal spikes (Δt > 0) and Δw = −A₋ exp(Δt/τ₋) for anti‑causal spikes (Δt < 0). By embedding this local Hebbian update into the SOM learning loop, the competition among neurons is driven not by Euclidean distance alone but by precise temporal correlations between input spike patterns and each neuron’s weight vector.

Competition is realized through a two‑stage winner‑takes‑all mechanism. When a spike train is presented, each neuron computes a similarity measure (often a dot product between the input spike vector and its synaptic weights). The neuron with the highest similarity becomes the winner and emits an inhibitory signal to its lateral neighbors, suppressing their activity. This global lateral inhibition enforces topological ordering on the map while allowing distinct clusters to emerge.

For speech processing, raw audio is first segmented into short frames (e.g., 25 ms with 10 ms overlap) and transformed into spectral features such as Mel‑frequency cepstral coefficients (MFCCs). These continuous features are then encoded into spike trains using either threshold‑based encoding or a leaky‑integrate‑and‑fire (LIF) conversion. The resulting spike sequences are fed into the three SOM variants. During unsupervised training, STDP updates adjust each synapse in proportion to the timing difference between incoming spikes and the neuron’s own firing times, while the winner‑takes‑all rule continuously reshapes the map’s topology. Over many epochs, similar phonetic patterns gravitate toward neighboring neurons, whereas dissimilar patterns occupy distant regions.

Experimental evaluation uses the TIMIT corpus, a standard benchmark for phoneme‑level speech recognition, and a separate continuous‑speech dataset for word‑level testing. All three models are trained for an equal number of epochs (typically 100) under identical preprocessing and encoding conditions. Performance metrics include phoneme classification accuracy, confusion matrices, convergence speed, and computational overhead. Results show that LIN provides robustness to noise due to its averaging dynamics but suffers from reduced temporal resolution. Pure Spiking‑SOM achieves high resolution for isolated frames but struggles with long‑range dependencies, leading to lower word‑level accuracy. RSSOM, by virtue of its recurrent feedback, captures temporal context across frames, yielding a 3–5 % absolute improvement in both phoneme and word recognition rates compared with non‑recurrent baselines. In speaker‑independent tests, RSSOM attains approximately 87 % phoneme accuracy, surpassing traditional SOM approaches that hover around 82 %.

The paper’s contributions are threefold: (1) a biologically plausible integration of STDP into competitive SOM learning, (2) the design of a recurrent spiking SOM that preserves temporal history, and (3) a thorough empirical validation on realistic speech tasks. Limitations are acknowledged: RSSOM’s recurrent weight matrix increases memory consumption, and the STDP update must be applied to every synapse at each time step, raising computational cost. Moreover, performance is sensitive to the choice of spike‑encoding thresholds, suggesting a need for adaptive encoding strategies.

Future work is outlined in three directions. First, implementation on neuromorphic hardware (e.g., Intel Loihi, IBM TrueNorth) could exploit the event‑driven nature of the models to achieve low‑power, real‑time operation. Second, extending the framework to multimodal inputs (e.g., audiovisual speech) could improve robustness in noisy environments. Third, coupling the unsupervised STDP‑SOM with reinforcement learning or reward‑modulated plasticity may enable task‑specific fine‑tuning without explicit labels. By bridging synaptic plasticity, competitive dynamics, and recurrent processing, the authors advance the state of spiking neural networks toward practical, biologically inspired speech recognition systems.