Learning audio and image representations with bio-inspired trainable feature extractors

Electr onic Letters on Computer V ision and Image Analysis 16(2):17-2 0, 2017 Learning audio and image repr esentation s with bio-inspired trainab le featur e extractors Nicola Strisciuglio ∗ ∗ Johann Bernoulli Institute for Mathematics and Computer Science, University of Gr oningen, Netherlands This work was carried out at the University of Gr on ingen (Netherlands) and at the Univers ity o f Salerno (Italy) Receiv ed 24th Jul 201 7; accepted 24th Nov 2017 Abstract Recent advancements in pattern recognition and signal processing conc ern the automatic learning of data representations from lab eled trainin g samples. T ypical app roaches are based on d e ep learnin g and conv olution al n eural ne twork s, which require large amount of labeled training samples. In this work, we propo se novel feature extractors that can b e used to learn the representatio n of single proto type samples in an autom atic conﬁgu ration pr ocess. W e em p loy the pr oposed feature extractors in applications of au dio and image processing, and show their ef fectiveness on ben chmark data sets. 1 Introd uction Since w hen very young, we can quickly learn new concepts, and distinguish between dif ferent kinds of object or sound. If we see a single object or hear a particular sound , w e are then able to recog nize such sample or e ven dif ferent versio ns of it in other scenario s. For example , if one sees an iron chair and associ ates the object to the genera l concept of “chairs” , he will be able to detect and recogni ze also wood en or w ick er chair s. Similarly , when we hear the sound of a parti cular ev ent, such as a scream, we are then able to recogn ize other kinds of scream that occur in diffe rent en vironments. W e continuousl y learn rep resentation s of the real world, which we then use in order to understand new and changing en vironments . In the ﬁeld of patt ern recognit ion, traditional metho ds are typically based on repres entations of the real wo rld that require a care ful design of a suitable feature set (i.e. a data represent ation), which in v olves conside rable domain knowle dge and ef fort by expert s. Recently , appro aches for automated learnin g of represen tations from trainin g data were introduced . Represen tation le arning aims at a void ing engineering of hand-cra fted featu res and provi ding automati cally learne d features suitable fo r the reco gnition tasks. No wadays, widely popular approa ches for repre sentation learn ing are based on deep learnin g techniq ues and con vo lutional neural networ ks (CNN). These techniqu es are v ery powerfu l, b ut are compu tationally e xpensi ve and requ ire lar ge amount of labele d training data to learn ef fectiv e models for the applicati ons at hand. In this paper we repor t the main achie vements includ ed in the doctoral thesis titled ‘Bio-ins pired algorit hms for pattern recognit ion in audio and image processing’ , in which we propose d novel approac hes for re pr esen- tation learni ng for audio and image signals [5]. Corresponden ce to: Nicola Strisciuglio < n.strisciuglio@rug.nl > Recommended for acceptance by Anjan Dutta and Carles S ´ anchez http://dx.doi.org/10 .5565/rev/elcvia.11 28 ELCVIA ISSN:1577-5097 Published by Computer V ision Center / Univ ersitat Aut ` onoma de Barcelona, Barcelona, Spain 18 Electr onic Letters on Computer V ision and Image Analysis 16(2):17-2 0, 2017 feature extraction pre-pr ocessing feature selection model training classiﬁcation/rejection feature learning input data decision Figure 1: Ove rview of a pattern recognitio n s ystem. The input data are pre-proces sed an d then feature s are compute d to ext ract importan t properties from such data. The feature s to be computed can be determin ed by an engine ering process or can be learned from the data ( r epr esenta tion learning ). Feature selection procedures can be employe d to determine a sub set of discrimin ant features that are then used to train a classiﬁer , which determin es a m odel of the training data. Such model is then used in the operating phase of the system, while a classiﬁer takes decisio ns on the input data. 2 Motivation and contribu tion Moti vate d by the fact that w e can learn ef fecti ve representa tions of a new category of objects or sounds from a single example and success ive ly generalize to a w ide range of real-world samples, w e studied the possib ility of learning data representat ions from small amounts of traini ng samples. W e in ve stigated the design of feature ext ractors that can be automatical ly trained by sh owing single prototyp e samples, and employed them into pattern recognit ion systems to solv e practica l problems. W e propo sed representa tion le arning techniq ues for audi o and image processing based on novel trainabl e feature ex tractors. The desi gn and impl ementation of the pro posed fe ature extractor s are inspire d by some functi ons of the human auditory and visual systems. The structure of the propos ed feature extrac tors is learned from training samples in an automatic conﬁguration step, rather than ﬁ xed a-prior i in the implementation [5]. W e employed the ne wly designed m ethodologie s into systems for audio ev ent detectio n and classiﬁcati on in noisy en vironment s and for delin eation of blood vessels in retinal fundus images. The con tribu tions of this work can be listed as: a ) nov el bio-insp ired t rainable feature extractors fo r representa tion learn ing of audio and image signals, respecti vely called COP E and B -COSFIRE ; b ) a system for audio e vent detect ion based on COPE feature extrac tors; c ) the release of two data set s of audio ev ents of interest mixed to v arious backgr ound sound s and with dif ferent signal to nois e ratios (SNR); d ) a method for deline ation of elong ated and curv ilinear pattern s in images based on B -COSF IRE ﬁlters; e ) feature selection mechanis ms based on information theory and machine learning approaches. 3 Methods W e introduced a no vel appro ach for rep resentation learning, based on trainable feature extr actors. W e e xtended the trad itional scheme of pattern recognitio n systems with feature learnin g algorith ms (dashed box at the top of Figure 1), w hich construc t a suitable repre sentation of traini ng data by automatic conﬁguri ng a set of feature ext ractors. Electr onic Letters on Computer V ision and Image Analysis 16(2):17-2 0, 2017 19 W e proposed trainab le CO P E (Combinati on of Peaks of Ener gy) featur e ext ractors for sound analysis , tha t can be trained to detect any sound pattern of intere st. In an automat ic con ﬁguration pro cess perf ormed on a single prototype soun d pattern, the struc ture of a COPE feature ext ractor is learn ed by modelin g the constel - lation of peak po ints in a time-frequ ency repres entation of the input sou nd [10]. In the applicatio n p hase, a COPE feature has high va lue when computed on the same sound used for conﬁguration , but also to similar or corrup ted versio ns of it due to noise or distort ion. This accounts for generaliza tion capab ilities and robu stness of detection of the pattern s of interest. The response of a COPE featu re extracto r is computed as the combi- nation of th e weight ed score of its cons tituent constellat ion of ener gy pea ks. For further detail we refer the reader to [10]. For the design of COPE feature extracto rs, we were inspired by some functions of the cochlea membrane and the inner hair cells in the auditory system, w hich con vert the sound pressure wa ves into neural stimuli on the auditory nerve. W e emplo yed COPE feature extrac tors togethe r with a multi -class Suppo rt V ector Machine (SVM) classiﬁer to perform audio ev ent detection and classiﬁcatio n, also in cases where sounds hav e null or negati ve SNR. W e prop osed B -CO SFIRE (that stand s for B ar -selecti ve Combination of S hifted Filter Respons es) ﬁlters for detect ion of elonga ted and curvili near patterns in images and apply them to the delineation of blood ve ssels in retinal images [1, 8]. The B -COSFIRE ﬁlters are trainable, that is their structure is automatically conﬁgured from prototype elongated patterns. The design of the B -CO SFIRE ﬁlters is inspir ed by the functio ns of some neuron s, calle d simple cells, in area V1 of the visual system, which ﬁre when presented with line or contour stimuli. A B -COSFIRE ﬁlter achie ves orien tation selecti vity by compu ting the weighted geo metric mean of the outpu t of a pool of Diff erence-of-Gau ssians (DoG) ﬁlters, whose support s are aligned in a collin ear manner . Rotation in varia nce is ef ﬁciently obtained by appropriate shiftings of the D oG ﬁlter responses . For fur ther detail we refer the reader to [1]. After conﬁgurin g a lar ge bank of B -COSFIRE ﬁlters selecti ve for vessels (i.e. lines) and vessel-e ndings (i.e. line-en dings) o f v arious thick ness (i.e. scale), w e proposed to use sev eral approac hes ba sed on information theory and machine lea rning to select an optimal sub set of B -COSFIRE ﬁ lters for the vessel del ineation task [7, 9]. W e indic ate th is procedure with the dashed box named ‘feature lear ning’ in Figure 1. W e consi der th e selecte d ﬁ lters as feature extra ctors to construc t a pixel-wise feature ve ctor which we used in combin ation with a SVM classiﬁer to classify the pixels in the testing image as vessel and non-vessel pixels . 4 Experiments and Results W e re leased two data se ts for benchmark of aud io eve nt detection and clas siﬁcation methods, namely the MIVIA audio e vents [3] and the MIVIA road e vents [2] data sets. W e reported baselin e results (reco gnition rate of 86 . 7% and 82% on the two data sets) by using a real-time method for ev ent detection that is based on an adaptati on of the bag of featur es clas siﬁcation sche me to noisy audio streams [3, 2]. The results that we achie ved by using COPE feature ex tractors sho w a consid erable impr ovement with respect to the ones of the bag of features approach. W e obtain ed a recognition rate of 95 . 38% on the M IVIA audio eve nt and of 94% (with standard de viation on cross-v alidation experi ments equal to 4 . 32 ) on the MIVIA road e vent data sets. W e performed t -Student tests and observ ed a statistic ally signiﬁca nt improv ement of the recognitio n rate with respec t to baseline performance on both data sets. W e ev aluated the perfor mance of the proposed B -COS FIRE ﬁlters on fou r data sets of retinal fundus images for benchmarkin g of blood v essel seg mentation algorithms , namely the DRIVE, S T A RE, CHAS E DB1 and HRF data sets. The resul ts that we achie ved (DRIVE : Se = 0 . 76 55 , Sp = 0 . 9704 ; ST ARE: S e = 0 . 77 16 , Sp = 0 . 9701 ; CHASE DB1: Se = 0 . 758 5 , Sp = 0 . 9587 ; HR F : Se = 0 . 7511 , Sp = 0 . 9745 ) are highe r than the ones report ed by many state- of-the-art methods based on ﬁltering appro aches. T he ﬁ lter selectio n procedu re based on supervise d learning that we proposed in [9] contrib utes to a statistical ly signi ﬁ cant increa se of performan ce results , which are higher than or comparable to the ones of other methods base d on machine learnin g tech niques. W e exten ded the appl ication range of the B -COSFIRE ﬁ lters to aeri al images for the deline ation of roads and ri vers, natu ral and textu red images [4], and to pav ement and road sur face images for the dete ction of cracks and 20 Electr onic Letters on Computer V ision and Image Analysis 16(2):17-2 0, 2017 damages [6]. The results that we achie ve d are better than or compara ble to the ones achie ved by exi sting meth- ods, which are usually designed to solv e spec iﬁc problems. The pr oposed B -COSFIRE ﬁlters demonstrated to be effect iv e in vari ous applica tions and with dif ferent types of images (retin al fundus ph otography , aerial photo graphy , laser scans) for delinea tion of elongate d and curvili near patterns . W e studied the compu tational requirement s of the pro posed algorith ms in order to e va luate thei r applica bility in real-world application s and the fulﬁllment of real-ti me co nstraints giv en by the conside red pro blems. The MA TLAB implementa tion of the proposed algorit hms are publicly released for research purposes * . 5 Conclusions In this work, we propos ed nove l trainab le feature extracto rs and employ ed them in application s of sound and image processin g. The tra inable charac ter of the proposed feature extract ors is in that their structure is learned directl y from training data in an automatic conﬁgura tion proc ess, rat her the ﬁed in the implementatio n. This pro vides ﬂexibilit y and ada ptability of the propos ed methods to dif ferent app lications. The e xperimental resu lts that we ac hiev ed, compar ed to the ones of othe r e xisting app roaches, demonstrated the ef fecti veness of the propo sed methods in vario us applicatio ns. This work co ntrib utes to th e de velopmen t of t echniques for r epr esentatio n learning in audio and image proces sing, suitable for domains where there is no av ailabi lity of lar ge amount of labele d training data. Refer ences [1] A zzopa rdi, G., Strisciuglio , N., V ento, M., Petkov , N.: T rainabl e COSFIRE ﬁ lters for vessel delineation with applica tion to retinal images. Medical Image A nalysi s 19(1), 46 – 57 (2015) [2] Foggia, P ., Petk ov , N., Saggese, A., Strisciuglio , N., V ento, M.: Audio surveilla nce of roads: A system for detecting anomalous sounds. IEEE Tran s. Intell. Tr ansp. Syst. 17(1), 279–288 (2016) [3] Foggia, P ., Petk ov , N., Saggese, A. , Strisci uglio, N., V ent o, M.: Relia ble de tection of audio e vents in highly noisy en vironments . Pattern Recogn. Lett. 65, 22 – 28 (2015) [4] S trisciu glio, N., Petko v , N. : Delineatio n of line patterns in images using b-cosﬁre ﬁlters. In: 2017 Inter - nation al Conference and W orkshop on B ioins pired Intelligenc e (IWOBI). pp. 1–6 (July 2017) [5] S trisciu glio, N.: Bio-insp ired algorith ms for patter n recogni tion in audio and image processi ng. Uni versity of Groningen (2016), http: //www. cs.rug.nl/~nick/strisciuglio phd.p df [6] S trisciu glio, N. , Azzopardi, G ., Petko v , N.: Detect ion of curved lines w ith b-cosﬁre ﬁlters: A case study on crack delineatio n. In: CAIP 2017, pp. 108–1 20 (2017 ) [7] S trisciu glio, N., A zzopar di, G., V ento, M., P etk ov , N.: Multiscale bloo d vessel deline ation using B- COSFIRE ﬁlters. In: CAIP , LNC S , vol. 9257, pp. 300–31 2 (2015) [8] S trisciu glio, N., Azzopard i, G. , V ent o, M., Petk ov , N.: Unsupe rvised delin eation of th e v essel tree in retinal fundus images. In: VIPIMA GE , pp. 149–15 5 (2015) [9] S trisciu glio, N., Azzop ardi, G., V ento, M., Petk ov , N .: S uperv ised vessel delineation in retinal fun dus images with the automatic selection of B -COSFIRE ﬁlters. Mach. V is. Appl. pp. 1–13 (2016) [10] Strisciuglio, N., V ento, M., Petkov , N.: B io-inspired ﬁlters for audio ana lysis. In: BrainComp 201 5, Rev ised S elected Papers. pp. 101–115 (2016) * The code is av ailable from t he GitLab repositories at http:/ / gitlab .com/nicstrisc

Learning audio and image representations with bio-inspired trainable feature extractors

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment