On Time-frequency Scattering and Computer Music
Time-frequency scattering is a mathematical transformation of sound waves. Its core purpose is to mimick the way the human auditory system extracts information from its environment. In the context of improving the artificial intelligence of sounds, i…
Authors: Vincent Lostanlen
On Time-frequency Scattering and Computer Music Vincen t Lostanlen, New Y ork Univ ersit y First v ersion: Octob er 2018. Latest revis ion: Ma y 2019. ... qu’il disp erse le son dans une pluie aride ... — Stéphane Mallarmé The quest for an adequate representation of auditory textures lies a t the foundation of computer m usic res earch. Indeed, none o f its analog pr edecessors ever managed a pr actical compromise b e- t ween t wo concur rent needs in so und des ig n: first, to faithfully repro duce any pr e - existing texture ; and s e condly , to offer enough flexibilit y for sculpting novel textures from scratc h. F or example, Sch aeffer’s musique c oncr ète offere d a precise t yp olog y of musical ob jects, yet co nstrains the com- po ser to a fig urativistic raw mater ia l [ 1 ]. On the other hand, Sto ckhausen’s Elektr onische Musik , as it arr anges simple noises and tones through time, may hav e uncovered new a ven ues in mu- sical abstraction; yet at the co st o f a narrow, distinctiv ely “ro botic” timbral palette [ 2 ]. In the history of music technology , s uc h a n opposition b et ween spe cificit y a nd ex pressivity is reflected in the r espective developmen ts of gra n ular synthesis a nd additive synthesis: one is universal but computationally intractable, the other is terse but somewhat clunky . With the democr a tization of analog- to -digital audio conversion, b oth a foremen tio ned schoo ls of though t ca me to decline, and new to ols for sound manipulation in the time-frequency doma in, such as the phase vo c oder , ga ined moment um amo ng contempor ary music comp osers. H owev er, the progressive digitization of the m usic studio brought little progr ess to the long-lasting pr oblem of a udio texture syn thesis and manipulation. The s cience of auditor y neurophysiology pav ed the wa y tow a rds a computational fra mew or k for audio texture mo deling that could reconcile the sp ecificity of musique c oncr ète with the expressivity of Elektr onisch e musik . In 1996, Nina K owalski and her colleagues employ ed a n ar ray of silicon elec- tro des to measur e the cortical res p onses of a ferr et to computer-gener ated r ipple stim uli, exhibiting mo dulations bo th in time and frequency [ 3 ]. Pairwise co rrelations b etw een stimuli and re s ponses led to a n exha us tiv e mapping of the pr imary auditor y co rtex of mammals, which ass ocia tes each neuron to a sp ectrotemp oral receptive field (STRF) — tha t is, the time-frequency representation pattern eliciting max imal excitation of this neuron. What Kow alski et al. concluded is that our brain in tegr ates the acoustic spec trum through time in terms o f its s pectro tempora l mo dulations at v arious scales (pitch interv als) and rates (pulse tempi). Neither exclusively rhythmic (temp oral), nor exc lus ively har monic (frequential), our brain is indeed a joint, rhythmico-harmonico-melo dic pro cessor that enco des sound into a multifaceted sensation. Despite marking a watershed in our under standing of m usic p erception, this finding long re- mained outside the technological landscap e of computer music designer s, b ecause the biolo gically inspired STRF representation w as not an inv ertible pro cedure. Instead, although STRF a llo wed to ma p sounds to sp ecific areas of the auditory cor tex, the dua l problem of sonifying the neuro- electrical activ ations of these ar eas had remained largely unexplo red. In a ddition, since STRF ha d bee n obtained empirically from ferret neur o nal action p oten tials, the r esulting representation could not be interpreted p ost ho c in terms of contin uous perceptual para meters, s uch as pitc h or temp o . Simply put, STRF are mor e concrete than musique c oncr èt e itself — in lieu of ea rdrum vibrations , what they contain is a heatmap of primar y auditory cortex activit y — but lack the mathematical concision of an Elektr onische Musik score in order to allow for any compo sitional interv ention on the world of natural sounds. F ro m 2013 to 2016, I was a grad studen t at École normale supérieur e, str iving to develop new conv olutional op erator s in the time-frequency domain for mo deling musical tim bre [ 4 ]. With m y cow orker Joakim Andén and m y adviso r Stéphane Malla t, I contributed to a STRF-bas ed computational mo del for audio texture synthesis, under the name o f time-frequency scattering. Time-frequency scattering w as meant as the successor to “ time scattering ”, as it w a s formulated b y Mallat himself in 2012. The name was coined as a nod to the w o rld of quan tum mec hanics: 1 from the r eddish shade of a sunset to the g listening of a pe arl, the umbrella term of scattering encompasses man y differen t microscopic phenomena. The commonality be tw e e n these phenomena is that they all inv o lve a radia tion of some kind as w ell as a maze of nonuniformities. Let g b e a Gauss ia n b ell c urv e. In the co ntext of sca ttering tra nsforms, the r adiation is a sound pressure wa ve U 0 ( t ) while the maze consis ts of Morle t wa velets ψ γ ( t ) = 2 γ g (2 γ t ) exp(2 π i2 γ t ) − ˆ g (2 γ ) (1) tuned at reso lutions 2 γ , as well as mo dulus nonlinearities. Before time-frequency scattering was formalized, Ma llat had defined the time scattering trans- form as a cascade of pur e ly tempor al wa velet mo dulus op erators: U m +1 ( t, γ 1 . . . γ m +1 ) = U m t ∗ ψ γ m +1 ( t ) = Z + ∞ −∞ U m ( τ , γ 1 . . . γ m ) ψ γ ( t − τ ) d τ (2) and then generalized his theor y to all real-v alued functions of finite energy defined over the irr e- ducible repr esen ta tions of a g iv en compact Lie group [ 5 ]. Shor tly thereafter, m y cow or ker Ir ène W aldspurg er pr ov ed that scattering transfor ms, despite the loss of phase incurred b y the complex mo duli, are in vertible with co n tinuous inv ers e [ 6 ]. She resor ted to adv anced metho ds in to p olog y and complex analysis (namely the Riesz-F réc het-Kolmogor ov theo rem a nd meromorphic exten- sions, among others) to come up with this astonishing result: on the co ndition that the c hosen wa velets form a “tight” frame of the functional space a t hand, and tow ards the limit of infinite depth m → ∞ , the time v ar iable ca n ultimately b e remov ed from the equa tio n, beca use the oscillatory nature of sound vibrations in U 0 ( t ) gets fully characterized by its in terference pattern thro ugh the scattering netw or k. Going back to the metaphor of Mie scattering in quantum mec ha nics, it is as though Mallat and W alds pur ger had unearthed so me kind o f all-witnessing crysta l, whose eternal glisten were a pe tr ified testimony of ev er y light it had seen b efore. W aldspurg er’s in vertibilit y theorem spurr ed m y in teres t for impro ving the state of the art in audio texture synthesis. Nev er theless, one imp ortant drawback of the sca ttering trans form — in its original, purely temp oral definition — is that it do es not include the notions o f relativit y of pitch nor rela tivity of tempo . Instead, each wav elet mo dulus layer decomp oses all pa ths p = ( γ 1 . . . γ m ) asynchronously . It was after p ersona l communications with Shihab Shamma that we rea lized the crucial imp ortance of a ccount ing for joint mo dulations in time and frequency ( t and λ 1 = 2 γ 1 ); or, said in algebraic terms, for elastic displacements ov er the affine W eyl-Heisenber g g roup on L 2 ( R ) . Consequently , we pro ceeded to genera lize the one- dimensional Mo rlet wa velet in Equation 1 by a tensor product ov er mult iple v ariables ( v 1 . . . v R ) , yie lding time-frequency scattering wa velets of the form Ψ λ ( v 1 . . . v R ) = R O r =1 2 γ :: v r ( θ :: v r ) g r (2 γ :: v r v r ) h exp 2 π i2 γ :: v r ( θ :: v r ) v r − ˆ g r 2 γ :: v r i (3) wherein the mul tiindex λ enca ps ulates log-wav elengths γ :: v r ∈ R and particle s pins θ ∈ T and the infix o pera tor :: denotes list constructio n (“co ns”) in the ML family of progr amming languages . The conceptual jump from purely tempo ral scattering to time-fr equency sca ttering event ually turned out to b e fruitful, but difficult: be c a use wa velengths γ m at one layer of the netw ork (e.g. pitc h γ 1 or temp o γ 2 ) may take o ver the r oles of spa tial v ariables v r in a deep er netw or k, k e e ping track of all cross-dep endencies betw een v ariables appea led for a more systematic resort to recursion in our n umerical applications. Andén a nd myself studied the ab ov e definition in complementary wa ys. He used the pr inciple of stationary phas e to confir m that time-frequency scattering characterizes the chirp rates of ripple stim uli, ana logously to STRF in the primary auditor y co rtex. He also designed a multiresolution analysis scheme for tim e-frequency scattering , in the fashion of Malla t’s discre te wa velet transfor m algorithm and Simoncelli’s steerable pyramid. This s cheme allowed to int erpret the time-frequency scattering tra nsform as the response of a deep con volutional neural net work whose depth gr ows logarithmically with receptive field size. On my part, I wrote down the pro duction rules of the following context-sensitiv e grammar , so that the langua ge o f admissible paths in a time-frequency scattering netw or k could b e describ ed exha us tiv ely by a nondeterministic T u r ing machine with linearly bo unded tap e memory: 2 S → t S → t, ( γ 1 , X ∗ )? γ m , X → γ m , γ m +1 , X ∗ γ m , X → γ m , Y n , γ 1 :: γ m , θ 1 :: γ m , γ m +1 , X n , X ? ( n ≥ 0) γ m , Y , γ k → γ m , γ k +1 :: γ m , θ k +1 :: γ m , γ k Y n , Y , γ k :: γ m → Y n , γ k +1 :: γ m , θ k +1 :: γ m , γ k :: γ m . Once the r ecursive gramma r a bove was in plac e , I was able to re ason at compile time on the computation gra ph of time-frequency s cattering architectures, and cast W alds pur ger’s adv ances in phase r etriev al from time scattering co efficients into a multiv ariable fra mew or k. Up on advice from Jo a n Bruna, I opted for synthesizing sound by sto chastic gra dient descent: starting from a random initial g uess — usually , Brownian motion no ise — this pro cedure adds a corrective term to the signal at every iteration, so that its time-frequency scattering coe fficients match those of a predefined textural tar g et. Incidentally , it is also b y mea ns of sto chastic gradient descent that most of the a lgorithms that ar e known to day , alb eit so mewhat impro perly , as artificial intelligence, learn to perfor m tasks o f computer vision, automatic sp eech recog nition, and lang ua ge transla tion. Because time-frequency sc a ttering netw or ks, just like deep conv o lutional neural netw o rks, cons is t of differen tiable lay ers, the corrective term in stochastic gradient descent can b e computed b y a metho d of Lagrange mult iplier s, named bac kpro pagation. There is, how ever, one distinction betw e e n the tw o iterative pro cedures: wherea s in deep lear ning, gr adien t backpropagation causes an infinitesimal up date of synaptic w eights in or der to bring the predicted o utput closer to the ground truth, her e , the syna ptic weigh ts are kept fixed, under the for m of wav elet impulse resp onse co efficien ts; but it is the ra w wav eform itself that g ets upda ted tow ards a lo cal minimum of the Euclidean erro r functional E = k (E m ) m k 2 , with E 2 m = Z . . . Z Λ 1 ... Λ m Z + ∞ −∞ U m ( τ , λ 1 . . . λ m ) 2 d τ − Z + ∞ −∞ U ∞ m ( τ , λ 1 . . . λ m ) 2 d τ d m λ. (4) Aside fro m this technical distinction, audio texture synthesis from scattering c o efficients is quite comparable to the training of a deep neural netw o rk. In b oth cases , the system pro duces uninfor- mative outco mes at the start; a nd then, after b eing exp osed to some real-world data, adjusts its own predictions b y tr ial and error , un til converging to a highly ar ticulate statistical fit. F or Joakim Andén and m yself, refactoring the source co de of the softw are library for s cattering transforms so that it could allow for multiv ariable architectures and gr adien t ba c kpr opagation, was a steady effort of a lmost tw o years, with man y emotional ups and downs — as o ften in scientifi c research. By the end of 2015, we ha d a w or king implementation 1 and present ed it at the IEEE conference on Machine Lear ning for Sig na l Pro cessi ng (MLSP ) in Boston [ 7 ]. Our paper bo iled do wn to three claims: first, time-frequency scattering is more mathematically interpretable than other auditory representations, b e them eng ineered o r learned; secondly , on some task s for which the av ailabilit y of annotated data is limited (e.g. musical instrument recognition), it actually outpe r forms deep learning classifiers; and thirdly , it allows to reconstruct c hirps in audio textures, such as bird v o calizations , with satisfying p erceptual similarit y to the target. Y et, the section on signal re-synth e sis w as purely meant as an illustra tion of the capabilities and limitations of time-frequency scattering, a s compared to other auditor y repr esen tations. Never in the r esearch agenda of my PhD did I a n ticipate that time-fr e q uency scattering co uld one da y prove to b e useful to contemporar y m usic crea tion. Florian Heck er wrote to me for the firs t time in the spring of 2016 . He had heard of time- frequency scattering through our mut ua l colleague Bob Sturm, and w a n ted to use it as a softw ar e for texture-related sound syn thesis with wav elets features. When we firs t r an time-freq uency sca t- tering o n his piece Mo dulator (2014 ), I w a s pleased to find that it p erformed about as well in terms of perceptual similarity , while conv e rging ov er 50 times faster. Indeed, contrary to other STRF-inspired so ftw a re, the time-frequency scattering li br ary was using a mul tiresolution pyramid to s pa re unnecessar y computations in the low er frequencies; moreov e r , the wa velet facto rization in Equa tio n 3 a llowed to vectorize array op erations a nd rely on fast F our ier transfor ms (FFT) to sp e ed up c o n volutions. These technical improv ements, althoug h leaving the gist of the algo rithm essentially unch a nged, noticeably str eamlined the comp ositional workflo w, b y a llowin g r apid pro- totypin g of ideas. Because running one iter a tion of sto chastic g radient descent now lasted ab out as long as the targ et sound clip, it bec a me p ossible to listen to sy n thetic texture samples in r eal 1 gith ub.com/lostanlen/scatter ing.m 3 time, mea nwhile time-frequency scattering was pr ogressively co n verging tow ards a lo cal optimum of Equation 4 . I op ened this essay b y depicting a schematic, and perha ps outdated, dichotom y betw een musique c oncr ète and Elektr onische Musik . I argued that both of these para digms were follo wing the same artistic research pr o gram — that is, to lib erate the W estern canon from a thousand-year tr adition of solmization that gives hegemo nic p ow er to the concept of m usica l note — yet by cla shing ways. What musique c oncr ète g a ined in ter ms of timbral sophistication, it lac ked in terms of stylistic power. Conv ersely , Elektr onische Musik a c hieved a maximal level o f creative control, yet w a s restricted by a r udimen tary collection of building blo cks: pur e tones. This dilemma, a s comp oser Jean-Claude Risset often said, was a direct consequence o f the use o f analog audio technologies. Now in the age of digital information, the tradeo ff b etw een s pecificity and expressivity se e ms to ha ve progressively softened, if not go ne obsolete altogether. In a piece such as Florian Hec ker’s F A VN (2016 ), b o th traditions are kept alive in a p erpetua l jeu de m ir oirs which dynamically alter- nates b et ween the c oncr ète par adigm (i.e. to compute time-frequency s cattering co efficients from the reco nstructed wa veform at iteration n ) and the Elektr onische para digm (i.e. to synthesize a wa veform at iteration ( n + 1) from the numerical parameters o btained through g radient backprop- agation at iteration n ). Then, once such a playful interaction is in place, the deci sion o f printin g out the v alues o f time-frequency scattering co efficients, origina ting from an analysis of the three mov ements of F A VN, fig urates the ad infin itu m limit o f both paradigms. Betw een the a na lysis and r e-synthesis steps, o ccurs a stage of abstrac tion: that of sorting all time-frequency sca ttering paths b y the relative amoun t of energy that they carry . Mea suring energy in a g iven scattering path λ is made p ossible b y the Littlewoo d-Paley condition ∀ ω :: v r , 1 − ε . b φ ( ω :: v r ) 2 + 1 2 X γ :: v r b ψ γ :: v r ( ω :: v r ) 2 . 1 , (5) which states that, for ev ery v ar iable v r , the filterbank of wa velets ψ γ :: v r and its co r resp onding scaling function φ unitarily cov er the F ourier domain. This double inequalit y implies that the amount of energ y in a scattering r epresentation is the same at every lay er — and, therefore, equal to the energy of the orig inal wav eform U 0 ( t ) . Ther e fo re, in the con text of time-frequency scattering, and for an y v alue of the path λ = ( γ 1 , γ 2 , γ 1 :: γ 1 ) , the ratio U m ( t, λ ) is a dimensionless quantit y b etw een zero and one. Multiplying this quantit y by 1 0 6 conv erts it into a n umber of parts per million (ppm). This num b er is the leftmost co lumn in the table. The s e cond co lumn denotes acoustic frequency in Hertz (Hz), corre s ponding to the temp oral log-frequency v ariable γ 1 in the first lay er of the scattering net work. The third column denotes temporal mo dulation frequency , also known as ra t e in Hertz (Hz), a nd corres ponding to the temp oral log-frequency v ariable γ 2 in the s e cond lay er . It should b e r emarked that the acoustic frequency be lo ngs to the audible range ( 20 Hz - 20 kHz), but that the temp oral mo dulation frequency can be as low as 1 Hz, and a s high as 1 k Hz under the condition γ 1 < γ 2 . Lastly , the fourth c o lumn denotes frequential mo dulation frequency , a lso known as sc ale in cycles per o ctav e (c/o ), a nd cor respo nding to the v ariable γ 1 :: γ 1 in the second lay er. With the mapping betw e e n time-frequency scattering paths p = ( γ 1 , γ 2 , γ 1 :: γ 1 ) and av era ges energies in pa rts per million that is presented herein, there is enough information to replicate the auditory p ercepts o f F A VN , even in the abs e nce of an w av eform-domain record o f the piece. The numerical tables app earing in these pag es epitomize one founding myth of computer music: that o f a mental quest fo r “the” sound. At the limit o f technical feasibility , sig nal reconstr uctio n is p erfect and all phase incoherences hav e disappeare d: the outcome is a n exact, Ele ktr onische rendition of the or iginal c oncr ète mater ial. In other w or ds, the pro cedure has gone full circle from Elektr onische to c oncr ète and bac k, without a lteration. Nev er theless, owing to sto chastic effects in the sampling of Brownian motion and the finiteness of computational resour c es, the sonified piece can only be a close approximate of its textual-numerical prototype. In the to and fro o f cognitive mo deling and acoustic a djustmen t, the m usic of signa ls and the music of sym b ols c hase eac h other like a ca dence d far andole. Quite paradoxically , the impac t of ma thematical qua n tization g r adually bec o mes less noticeable as it b ecomes more accurate. Here I do no t mean to say , in what would b e a paraphr a se of Leibniz, that “Music is a hidden arithmetic exercise of the soul, which do es not know that it is co un ting”. I do not, either, mean that the numeric tables that a re prin ted herein co uld aspire to b e a proxy for the auditory exp erience: on the contrary , I firmly believe that m usic is mea n t to b e hear d, and that no other medium can replace it, or even refer to it in an y formal “w o r d-ob ject” corresp ondence system. Thirdly , I do 4 not think of m usic as a languag e in the same sense as our other forms of comm unica tion, be them sp o k en, written, or sig ned; and therefore certainly not of this publication as an ersatz of pos t- serialist musical scor e. Rather, and despite the utter ineffability of m usic, it is p ossible to shed light up on our s hared facult y of r ecursion, supplemented b y p erceptual qua n tization a nd tabular organiza tion; of which musical notation is a mere by-product. F ar from any neo-numerological consider ations, what is , in m y mind, the in timate r aison d’êtr e of this publication, is tha t it helps us listener s understand t wo comp ositional prospects, and wraps them int o one: the w ill to e xpand the scop e of the potentially audible, by see k ing for more and more complexit y in the para metrization of sound synthesis; and the desire to delve deep er in to what has b e en heard, by shifting the auditory fo c us onto previously unnoticed details. Music is, therefore, a tw o- fo ld ritual of anticipation. Like the co mposer , it is in the liminalit y of finite sp eeds that the faun s hall dw ell and thrive. A c kno wl edgmen t This work is supported b y the ER C Inv aria n tClas s 3209 59. The author wishes to thank Théis Bazin, Graham Dov e, J a n F erre ir a, and Lorenzo Senni for helpful discussio ns. References [1] P . Schaeffer, T r e atise on Music al Obje cts: An Essay acr oss Discipl ines . Universit y o f Califor nia Press, 201 7. [2] J. Harvey , The Music of Sto ckhausen: An Intr o duction . Univ ersity o f California Press, 1 9 75. [3] N. Ko walski, D. Depireux, and S. Sha mma , “ Analysis of dynamic sp ectra in ferret primary auditory co rtex. I. Charac ter istics of single-unit resp onses to moving ripple s pectra ,” J ournal of Neur ophysiolo gy , vol. 76, no. 5, 1996. [4] V. Lostanlen, Convolutional op er ators in the time-fr e quency domain . PhD thesis, École normale supé r ieure, 2017 . [5] S. Mallat, “Gr oup inv aria n t s cattering,” Comm. Pur e Appl. Math. , v o l. 65 , no . 10, pp. 133 1– 1398, 20 1 2. [6] I. W aldspurger , W avelet tr ansform mo dulus: phase r etrieval and sc attering . PhD thesis, École normale supérieure, 2 015. [7] J. Andén, V. Lostanlen, and S. Mallat, “J oin t time-frequency scattering for audio classification,” in Pr o c. IEEE MLSP , 2015 . 5
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment