Deep neural network architectures designed for application domains other than sound, especially image recognition, may not optimally harness the time-frequency representation when adapted to the sound recognition problem. In this work, we explore the ConditionaL Neural Network (CLNN) and the Masked ConditionaL Neural Network (MCLNN) for multi-dimensional temporal signal recognition. The CLNN considers the inter-frame relationship, and the MCLNN enforces a systematic sparseness over the network's links to enable learning in frequency bands rather than bins allowing the network to be frequency shift invariant mimicking a filterbank. The mask also allows considering several combinations of features concurrently, which is usually handcrafted through exhaustive manual search. We applied the MCLNN to the environmental sound recognition problem using the ESC-10 and ESC-50 datasets. MCLNN achieved competitive performance, using 12% of the parameters and without augmentation, compared to state-of-the-art Convolutional Neural Networks.
Sound recognition is a wide research field that combines two broad areas of research; signal processing and pattern recognition. One of the very early attempts in sound recognition, especially speech, was in the work of Davis et al. [1] in 1952. In their work, they devised an analog circuitry for spoken digits' recognition. Their attempt marks a very early interest in the sound recognition problem. Over the years, the methods have evolved to involve not just speech, but music and environmental sound recognition as well. This interest was backed-up with the wide spread of related applications. For example, the usage of music sharing platforms or applications of automatic environmental sound recognition for surveillance [2,3] especially when low lighting conditions hinders the ability of the video channel to capture useful information.
Handcrafting the features extracted from a signal, image or sound, has been widely investigated in research. The efforts invested aim to provide distinctive features that can enhance the recognition accuracy of the pattern recognition model. Recent attempts using deep neural networks have achieved breakthrough results [4] for image recognition. These deep models managed to abstract the features of a raw input signal over multiple neural network layers. The extracted features are further classified using a conventional classifier such as Random Forest [5] or Support Vector Machines (SVM) [6].
An attempt to use the deep neural network architectures for automatic feature extraction for sound was in the work of Hamel et al. [7]. In their work, they used three stacked Restricted Boltzmann Machines (RBM) [8] to form a Deep Belief Net (DBN) [9] architecture. They used the DBN for feature extraction from music clips. The extracted features were further classified using an SVM. They showed in their work the abstract representations captured by the RBM at each layer which consequently enhances the classification compared to using the raw time-frequency representation.
Deep architectures of Convolutional Neural Networks (CNN) [10] achieved remarkable results in image recognition [4]. Also, they got adapted to the sound recognition problem. For example, CNN was used in [11] for phoneme recognition in speech, where the CNN was used to extract the features, and the states’ transitions were modeled using a Hidden Markov Model (HMM) [12].
Handcrafted features for sound are still superior in most contexts compared to employing neural networks as feature extractors of images, but the accuracy gap is getting narrower. The motivation behind using neural networks aims to eliminate the efforts invested in handcrafting the most efficient features for a sound signal.
Several neural based architectures have been proposed for the sound recognition problem, but usually, they get adapted to sound after they gain wide success in other applications especially image recognition. The adaptation of such models to sound may not harness its related properties in a time-frequency representation. For example, an RBM treats the temporal signal frames as static, isolated frames, ignoring the inter-frame relation. The CNN depends on weight sharing, which does not preserve the spatial locality of the learned features.
We discuss in this work, the ConditionaL Neural Network (CLNN) that is designed for multidimensional temporal signals. The Masked ConditionaL Neural Network (MCLNN) extends upon the CLNN by embedding a filterbank-like behavior within the network through enforcing a systematic sparseness over the network’s weights. The filterbank-like pattern allows the network to exploit performance advantages of filterbanks used in signal analysis such as frequency shift-invariance. Additionally, the masking operation allows an automatic exploration of a range of feature combinations concurrently analogous to the manual features selection to be used for classification.
The models we discuss in this work have been considered in [13] for music genre classification with more emphasis on the influence of the data split (training set, validation set and testing set) on the reported accuracies in the literature. In this work, we evaluate the applicability of the models to sounds of a different nature i.e. environmental sounds.
The Restricted Boltzmann Machine (RBM) [8] is a generative model that undergoes an unsupervised training. The RBM is formed of two layers of neurons, a visible and a hidden layer. The two layers are connected using bidirectional connections across them with the absence of connections between neurons of the same later. An RBM is trained using contrastive divergence [14] aiming to minimize the error between an input feature vector introduced to the network at the visible layer and the reconstructed version of the generated vector from the network.
We referred earlier that one of the drawbacks of applying an RBM to a temporal signal is ignoring the temporal dependencies between the signal’s frames
This content is AI-processed based on open access ArXiv data.