Cover Song Synthesis by Analogy

Co v er Song Syn thesis b y Analogy Christopher J. T ralie Duk e Universit y Departmen t of Mathematics ctralie@alumni.princeton.edu Abstract In this w ork, w e p ose and address the follo wing “cov er song analogies” problem: giv en a song A by artist 1 and a co ver song A’ of this song b y artist 2, and given a diﬀeren t song B b y artist 1, syn thesize a song B’ whic h is a cov er of B in the style of artist 2. Normally , suc h a p olyphonic st yle transfer problem w ould be quite c hallenging, but we show how the cov er songs example constrains the problem, making it easier to solv e. First, we extract the longest common beat-synchronous subsequence b et ween A and A’, and we time stretch the corresponding b eat interv als in A’ so that they align with A. W e then deriv e a version of join t 2D conv olutional NMF, which w e apply to the constant-Q sp ectrograms of the sync hronized segments to learn a translation dictionary of sound templates from A to A’. Finally , we apply the learned templates as ﬁlters to the song B, and we mash up the translated ﬁltered comp onents in to the synthesized song B’ using audio mosaicing. W e show case our algorithm on several examples, including a syn thesized cov er version of Michael Jac kson’s “Bad” by Alien An t F arm, learned from the latter’s “Smo oth Criminal” co ver. 1 In tro duction Figure 1: A demonstration of the “shap e analogies” tec hnique in [22]. In the language of our w ork, the statue head with a neutral expression ( A 0 ) is a “cov er” of the low resolution mesh A with a neutral expression, and this is used to synthesize the surprised statue “cov er” face B 0 from a lo w resolution surprised face mesh B . The rock group Alien Ant F arm has a famous co ver of Michael Jackson’s “Smo oth Criminal” which is faithful to but st ylistically unique from the original song. Ho wev er, to our kno wledge, they nev er released a co ver of an y other Mic hael Jackson songs. What if we instead w anted to know how they w ould hav e cov ered Mic hael Jackson’s “Bad”? That is, we seek a song whic h is identiﬁable as MJ’s “Bad,” but whic h also sounds as if it’s in Alien Ant F arm’s style , including timbral c haracteristics, relative temp o, and instrument types. In general, m ultimedia style transfer is a challenging task in computer aided creativit y applications. When an example of the stylistic transformation is av ailable, as in the “Smo oth Criminal” example ab o ve, this problem can b e phrased in the language of analogies; given an ob ject A and a diﬀerently st ylized version of this ob ject A 0 , and given an ob ject B in the style of A , syn thesize an ob ject B 0 whic h has the prop erties of B but the st yle of A 0 . One of the earliest works using this vocabulary is the “image analogies” w ork [12], whic h sho wed it was p ossible to transfer b oth linear ﬁlters (e.g. blurring, em b ossing) and nonlinear ﬁlters 1 Figure 2: An ideal carto on example of our joint 2DNMF and ﬁltering pro cess, where M = 20, N = 60, T = 10, F = 10, and K = 3. In this example, vertical time-frequency blo c ks are “cov ered” by horizontal blo c ks, diagonal lines with negative slop es are cov ered b y diagonal lines with p ositiv e slop es, and squares are cov ered by circles. When presented with a new song B , our goal is to synthesize a song B 0 whose CQT is sho wn in the low er right green box. (e.g. w atercolors) in the same simple framework. More recen t work with con volutional netw orks has shown ev en b etter results for images [10]. There has also b een some work on “shape analogies” for 3D meshes [22], in which nonrigid deformations b et ween triangle meshes A and A 0 are used to induce a corresp onding deformation B 0 from an ob ject B , which can be used for motion transfer (Figure 1). In the audio realm, most st yle transfer w orks are based on mashing up sounds from examples in a target style using “audio mosaicing,” usually after manually sp ecifying some desired path through a space of sound grains [16]. A more automated audio moscaicing technique, known as “audio analogies” [19], uses corresp ondences b et ween a MIDI score and audio to driv e concatenated syn thesis, whic h leads to impressive results on monophonic audio, suc h as stylized synthesis of jazz recordings of a trump et. More recen tly , this has evolv ed into the audio to musical audio setting with audio “musaicing,” in which the timbre of an audio source is transferred onto a target by means of a mo diﬁed NMF algorithm [7], suc h as b ees buzzing The Beatles’ “Let It Be.” A slightly closer step to the p olyphonic (multi source) m usical audio to musical audio case has been sho wn to work for drum audio cross-synthesis with the aid of a m usical score [5], and some v ery recent initial work has extended this to the general m usical audio to m usical audio case [9] using 2D nonnegative matrix factorization, though this still remains open. Finally , there is some recent work on con verting p olyphonic audio of guitar songs to musical scores of v arying diﬃculties so users can play their o wn cov ers [1]. In this w ork, we constrain the polyphonic musical audio to m usical audio style transfer problem b y using c over song pairs, or songs which are the same but in a diﬀerent style, and whic h act as our ground truth A and A 0 examples in the analogies framew ork. Since small scale automatic cov er song identiﬁcation has matured in recen t years [17, 18, 4, 23], w e can accurately synchronize A and A 0 (Section 2.1), even if they are in very diﬀeren t styles. Once they are synchronized, the problem b ecomes more straightforw ard, as we can blindly factorize A and A 0 in to diﬀerent instruments whic h are in correspondence, turning the problem in to a series of monophonic style transfer problems. T o do this, w e p erform NMF2D factorization of A and A 0 (Section 2.2). W e then ﬁlter B by the learned NMF templates and mash up audio grains to create B 0 , using the aforemen tioned “m usaicing” tec hniques [7] (Section 2.3). W e demonstrate our tec hniques on snippets of A , A 0 , and B whic h are ab out 20 seconds long, and w e show qualitatively ho w the instruments of A 0 transfer 2 on to the music of B in the ﬁnal result B 0 (Section 3). 2 Algorithm Details In this section, we will describ e the steps of our algorithm in more detail. 1 2.1 Co v er Song Alignment And Sync hronization Figure 3: An example feature fused cross-similarity matrix D for the ﬁrst 80 b eats of Michael Jackson’s “Smo oth Criminal,” compared to the co ver v ersion by Alien An t F arm. W e threshold the matrix and extract 20 seconds of the longest common subsequence, as measured b y Smith W aterman. The alignment path is sho wn in red. As with the original image analogies algorithm [12], w e ﬁnd it helpful if A and A 0 are in direct cor- resp ondence at the s ample lev el. Since cov er song pairs generally diﬀer in temp o, we need to align them ﬁrst. T o accomplish this, w e draw up on the state of the art “early fusion” cov er song alignment technique presen ted by the authors of [23]. Brieﬂy , w e extract beat onsets for A and A 0 using either a simple dynamic programming b eat track er [8] or slow er but more accurate RNN + Bay esian b eat track ers [13], depending on the complexit y of the audio. W e then compute b eat-sync hronous sliding window HPCP and MF CC features, and w e fuse them using similarity netw ork fusion [25, 26]. The result is a M × N cross-similarity matrix D , where M is the num b er of b eats in A and N is the n umber of b eats in A 0 , and D ij is directly prop ortional to the similarit y b etw een b eat i of A and b eat j in A 0 . Please refer to [23] for more details. Once we hav e the matrix D , we can then extract an alignment b et ween A and A 0 b y p erforming Smith W aterman [20] on a binary thresholded version of D , as in [23]. W e mak e one crucial mo diﬁcation, how ever. T o allow for more p ermissiv e alignments with missing b eats for iden tiﬁcation purp oses, the original cov er songs algorithm creates a binary thresholded version of D using 10% mutual binary nearest neighbors. On the other hand, in this application, w e seek shorter snipp ets from each song which are as w ell aligned as p ossible. Therefore, w e create a stricter binary thresholded version B , where B ij = 1 only if it is in the top 3 √ M N distances ov er all M N distances in D . This means that many ro ws of B ij will b e all zeros, but w e will hone in on the b est matching segments. Figure 3 shows such a thresholding of the cross-similarity matrix for tw o versions of the song “Smo oth Criminal,” whic h is an example we will use throughout this section. Once B has b een computed, w e compute a X -length alignmen t path P by bac k-tracing through the Smith W aterman alignment matrix, as sho wn in Figure 3. Let the b eat onset times for A in the path P b e t 1 , t 2 , ..., t X and the b eat times for A 0 b e s 1 , s 2 , ..., s X . W e use the rubberband library [3] to time stretc h A 0 b eat by b eat, so that interv al [ s i , s i + 1] is stretc hed by 1 Note that all audio is mono and sampled at 22050hz. 3 Figure 4: Join t 2DNMF on the magnitude CQTs of “Smo oth Criminal” by Michael Jac kson and Alien An t F arm, with F = 14 (7 halfsteps at 24 bins p er octav e), T = 20 (130ms in time), and K = 2 comp onen ts. In this case, W 1 1 and W 1 2 hold p ercussiv e comp onen ts, and W 2 1 and W 2 2 hold guitar comp onen ts. a factor ( t i +1 − t i ) / ( s i +1 − s i ). The result is a snipp et of A 0 whic h is the same length as the corresp onding snipp et in A . Henceforth, we will abuse notation and refer to these snipp ets as A and A 0 . W e also extract a smaller snipp et from B of the same length for reasons of memory eﬃciency , which w e will henceforth refer to as B . 2.2 NMF2D for Join t Blind F actorization / Filtering Once we ha ve sync hronized the snipp ets A and A 0 , we blindly factorize and ﬁlter them into K corresp onding trac ks A 1 , A 2 , ..., A K and A 0 1 , A 0 2 , ..., A 0 K . The goal is for each track to con tain a diﬀerent instrument. F or instance, A 1 ma y contain an acoustic guitar, whic h is co vered by an electric guitar contained in A 0 1 , and A 2 ma y con tain drums which are co vered b y a drum machine in A 0 2 . This will make it possible to reduce the syn thesis problem to K indep enden t monophonic analogy problems in Section 2.3. T o accomplish this, the main to ol we use is 2D con volutional nonnegativ e matrix factorization (2DNMF) [15]. W e apply this algorithm to the magnitude of the constant-Q transforms (CQTs) [2] C A , C A 0 and C B of A , A 0 , and B , resp ectiv ely . In tuitively , we seek K time-frequency templates in A and A 0 whic h represen t small, characteristic snipp ets of the K instruments we b elieve to be presen t in the audio. W e then approximate the magnitude constant-Q transform of A and A 0 b y conv olving these templates in b oth time and frequency . The constan t-Q transform is necessary in this framew ork, since a pitch shift can b e appro ximated b y a line ar shift of all CQT bins b y the same amoun t. F requency shifts can then be represen ted as conv olutions in the v ertical direction, whic h is not possible with the ordinary STFT. Though it is more complicated, 2DNMF is a more compact representation than 1D con volutional NMF (in time only), in which it is necessary to store a diﬀeren t template for each pitch shift of each instrument. W e note that pitc h shifts in real instrumen ts are more complicated than shifting all frequency bins b y the same p erceptual amoun t [14], but the basic version of 2DNMF is ﬁne for our purp oses. More concretely , deﬁne a K -comp onen t, F -frequency , T -time lag 2D con volutional NMF decomp osition for a matrix X ∈ R M × N as follo ws X ≈ Λ W , H = T X τ =1 F X φ =1 ↓ φ W τ → τ H φ (1) where W τ ∈ R M × K and H φ ∈ R K × N store a τ -shifted time template and φ -frequency shifted coeﬃcients, resp ectiv ely . By ↓ φ A , we mean down shift the rows of A by φ , so that ro w i of ↓ φ A is row i − φ of A , and the ﬁrst φ rows of ↓ φ A are all zeros. And ← τ A means left-shift A , so that column j of ← τ A is column j − τ of A , and the ﬁrst τ columns of ← τ A are all zeros. 4 In our problem, we deﬁne an extension of 2DNMF to jointly factorize the 2 songs, A and A 0 , whic h eac h ha ve M CQT co eﬃcients. In particular, given matrices C A , C A 0 ∈ C M × N 1 represen ting the complex CQT frames in each song ov er time, we seek W τ 1 , W τ 2 ∈ R M × K and H φ 1 ∈ R K × N 1 that minimize the sum of the Kullbac k-Leibler divergences betw een the magnitude CQT co eﬃcien ts and the conv olutions: D ( | C A | || Λ W 1 , H 1 ) + D ( | C A 0 | || Λ W 2 , H 1 ) (2) where the Kullbac k-Leibler divergence D ( X || Y ) is deﬁned as D ( X || Y ) = X i X j X i,j log X i,j Y i,j − X i,j + Y i,j (3) That is, w e shar e H 1 b et w een the factorizations of | C A | and | C A 0 | so that w e can discov er shared structure b etw een the co vers. F ollo wing similar computations to those of ordinary 2DNMF [15], it can be sho wn that Equation 2 is non-decreasing under the alternating up date rules: W τ 1 ← W τ 1  P F φ =1 ↑ φ  | C A | Λ W 1 , H 1  → τ H φ 1 T P F φ =1 1 · → τ H φ 1 T (4) W τ 2 ← W τ 2  P F φ =1 ↑ φ  | C A 0 | Λ W 2 , H 1  → τ H φ 1 T P F φ =1 1 · → τ H φ 1 T (5) H φ 1 ← H φ 1      P T τ =1 ↓ φ W τ 1 T ← τ  | C A | Λ W 1 , H 1  + ↓ φ W τ 2 T ← τ  | C A 0 | Λ W 2 , H 1  P T τ =1 ↓ φ W τ 1 T ← τ 1 + ↓ φ W τ 2 T ← τ 1     (6) where 1 is a column v ector of all 1s of appropriate dimension. W e need an in vertible CQT to go back to audio templates, so w e use the non-stationary Gab or T ransform (NSGT) implementation of the CQT [24] to compute C A , C A 0 , and C B . W e use 24 bins p er o cta ve b et w een 50hz and 11.7kHz, for a total of 189 CQT bins. W e also use F = 14 in most of our examples, allo wing 7 halfstep shifts, and we use T = 20 on te mporally do wnsampled CQTs to co ver a timespan of 130 milliseconds. Finally , we iterate through Equations 4, 5,and 6 in sequence 300 times. Note that a naiv e implemen tation of the ab o ve equations can be v ery computationally in tensive. T o ameliorate this, we implemen ted GPU versions of Equations 1, 4, 5,and 6. Equation 1 in particular is well- suited for a parallel implementation, as the shifted conv olutional blo c ks ov erlap each other heavily and can b e carefully oﬄoaded into shared memory to exploit this 2 . In practice, we witnessed a 30x sp eedup of our GPU implemen tation ov er our CPU implementation for 20 second audio clips for A , A 0 , and B . Figure 2 sho ws a synthetic example with an exact solution, and Figure 4 shows a local min which is the result of running Equations 4, 5,and 6 on real audio from the “Smo oth Criminal” example. It is evident from H 1 that the ﬁrst comp onen t is p ercussive (activ ations at regular interv als in H 1 1 , and no pitc h shifts), while the second component corresponds to the guitar melo dy ( H 2 1 app ears like a “m usical score” of sorts). F urthermore, W 1 1 and W 1 2 ha ve broadband frequency con tent consistent with p ercussiv e even ts, while W 2 1 and W 2 2 ha ve visible harmonics consistent with vibrating strings. Note that w e generally use more than K = 2, which allo ws ﬁner granularit y than harmonic/p ercussiv e, but even for K = 2 in this example, we observ e qualitatively better separation than oﬀ-the-shelf harmonic/p ercussive separation algorithms [6]. Once w e hav e W 1 , W 2 , and H 1 , we can recov er the audio templates A 1 , A 2 , ..., A K and A 0 1 , A 0 2 , ..., A 0 K b y using the comp onents of W 1 and W 2 as ﬁlters. First, deﬁne Λ W , H , k as 2 In the interest of space, w e omit more details of our GPU-based NMFD in this pap er, but a documented implementation can be found at https://github.com/ctralie/CoverSongSynthesis/ , and w e plan to release more details in a companion pap er later. 5 Λ W , H , k = T X τ =1 F X φ =1 ↓ φ W k τ → τ H k φ (7) where W k is the k th column of W and H k is the k th ro w of H . No w, deﬁne the ﬁltered CQTs b y using soft masks deriv ed from W 1 and H 1 : C A k = C A  Λ p W 1 , H 1 , k P K m =1 Λ p W 1 , H 1 , m ! (8) C A 0 k = C A 0  Λ p W 2 , H 1 , k P K m =1 Λ p W 2 , H 1 , m ! (9) where p is some p ositive integer applied element-wise (we choose p = 2), and the ab o ve multiplications and divisions are also applied elemen t-wise. It is no w possible to inv ert the CQTs to uncov er the audio templates, using the inv erse NSGT [24]. Thus, if the separation was goo d, we are left with K indep enden t monophonic pairs of sounds b et ween A and A 0 . In addition to inv erting the sounds after these masks are applied, we can also listen to the comp onen ts of W 1 and W 2 themselv es to gain insight in to ho w they are b eha ving as ﬁlters. Since W 1 and W 2 are magnitude only , w e apply the Griﬃn Lim algorithm [11] to p erform phase retriev al, and then we in vert them as b efore to obtain 130 millisecond sounds for eac h k . 2.3 Musaicing And Mixing W e now describ e how to use the corresp onding audio templates w e learned in Section 2.2 to p erform st yle transfer on a new piece of audio, B . 2.3.1 Separating T rac ks in B First, we compute the CQT of B , C B ∈ C M × N 2 . W e then represent its magnitude using W 1 as a basis, so that we ﬁlter B into the same set of instrumen ts into which A was separated. That is, we solve for H 2 so that | C B | ≈ Λ W 1 , H 2 . This can b e p erformed with ordinary 2DNMF, holding W 1 ﬁxed; that is, repeating the follo wing up date until conv ergence H φ 2 ← H φ 2      P T τ =1 ↓ φ W τ 1 T ← τ  | C B | Λ W 1 , H 2  P T τ =1 ↓ φ W τ 1 T ← τ 1     (10) As with A and A 0 , we can no w ﬁlter B into a set of audio tracks B 1 , B 2 , ..., B K b y ﬁrst computing ﬁltered CQTs as follo ws C B k = C B  Λ p W 1 , H 2 , k P K m =1 Λ p W 1 , H 2 , m ! (11) and then in verting them. 2.3.2 Constructing B 0 T rac k by T rac k A t this p oin t, we could use H 2 and let our cov er song CQT magnitudes | C B 0 | = Λ W 2 , H 2 , follow ed by Griﬃn Lim to recov e r the phase. How ev er, we hav e found that the resulting sounds are to o “blurry ,” as they lack all but re-arranged lo w rank detail from A 0 . Instead, we choose to draw sound grains from the inv erted, ﬁltered tracks from A 0 , which contain all of the detail of the original song. F or this, w e ﬁrst reconstruct eac h trac k of B using audio grains from the corresp onding tracks in A , and then we replace each track with the corresp onding track in B . T o accomplish this, we apply the audio musaicing technique of Driedger [7] to eac h track in B , using source audio A . 6 F or computational and memory reasons, w e no w abandon the CQT and switch to the STFT with hop size h = 256 and window size w = 2048. More sp eciﬁcally , let N 1 b e the num b er of STFT frames in A and A 0 and let N 2 b e the num b er of STFT frames in B . F or each A k , we create an STFT matrix S A k whic h holds the STFT of A k concatenated to pitch shifted versions of A k , so that pitc hes b ey ond what were in A can b e represen ted, but by the same instruments. W e use ± 6 halfsteps, so S A k ∈ C w × 13 N 1 . W e do the same for A 0 k to create S A 0 k ∈ C w × 13 N 1 . Finally , we compute the STFT of B k without an y shifts: S B k ∈ C w × N 2 . No w, we apply Driedger’s technique, using | S A k | as a sp ectral dictionary to reconstruct | S B k | ( S A k is analogous to the buzzing b ees in [7]). That is, we seek an H k ∈ R 13 N 1 × N 2 so that | S B k | ≈ | S A k | H (12) F or completeness, we brieﬂy re-state the iterations in Driedger’s technique [7]. F or L iterations total, at the ` th iteration, compute the following 4 up dates in order. First, we restrict the num b er of rep eated activ ations by ﬁltering in a maxim um horizontal neigh b orhoo d in time R ( ` ) km = ( H ( ` ) km if H ( ` ) km = µ r , ( ` ) km H ( ` ) km (1 − n +1 N ) otherwise ) (13) where µ r , ( ` ) km holds the maximum in a neighborho od H k , m − r : m + r for some parameter r (we c ho ose r = 3 in our examples). Next, we restrict the n um b er of sim ultaneous activ ations b y shrinking all of the v alues in eac h column that are less than the top p v alues in that column: P ( ` ) km = ( R ( ` ) km if R ( ` ) km ≥ M p ( n ) R ( ` ) km (1 − n +1 N ) otherwise ) (14) where M p ( ` ) is a ro w v ector holding the p th largest v alue of eac h column of R ( ` ) (w e c ho ose p = 10 in our examples). After this, we promote time-con tinuous structures b y conv olving along all of the diagonals of H : C ( ` ) km = c X i = − c P ( ` ) ( k + i ) , ( m + i ) (15) W e choose c = 3 in our examples. Finally , w e p erform the ordinary KL-based NMF up date: H ( ` + 1 ) ← H ( ` )  | S A k | T | S B k | | S A k | C ( ` ) | S A k | · 1 (16) W e perform 100 such iterations ( L = 100). Once w e ha ve our ﬁnal H , we can use this to create B 0 k as follo ws: S B 0 k = S A 0 k H (17) In other words, we use the learned activ ations to create S B k using S A k , but we instead use these activ a- tions with the dictionary from S A 0 k . This is the key step in the style tr ansfer , and it is done for the same reason that H 1 is shared betw een A and A 0 . Figure 5 shows an example for the guitar track on Michael Jac kson’s “Bad,” translating to Alien Ant F arm using A and A 0 templates from Michael Jackson’s and Alien An t F arms’ v ersions of “Smo oth Criminal,” resp ectively . It is visually apparen t that | S A k | H ≈ S B k , it is also apparent that S B 0 k = S A 0 k H is similar to S B k , except it has more energy in the higher frequencies. This is consisten t with the fact that Alien An t F arm uses a more distorted electric guitar, whic h has more broadband energy . T o create our ﬁnal B 0 , we simply add all of the STFTs S B 0 k together for each k , and we p erform and in verse STFT to go bac k to audio. 7 Figure 5: Driedger’s tec hnique [7] using audio grains from pitc h shifted v ersions of A k (the k th trac k in Mic hael Jackson’s “Smooth Criminal”) to create B k (the k th trac k in Michael Jackson’s “Bad”), and using the activ ations to create B 0 k (the k th syn thesized track in Alien An t F arm’s “Bad”). 2.4 A Note Ab out T emp os The algorithm w e ha ve described so far assumes that the temp os t A , t A 0 , and t B of A , A 0 , and B , resp ectiv ely are similar. This is certainly not true in more interesting cov ers. Section 2.1 took care of the disparit y betw een A and A 0 during the sync hronization. How ever, we also need to p erform a temp o scaling on B by a factor of t A /t B b efore running our algorithm. Once we hav e computed B 0 , whose tempo is initially t A , we scale its temp o back by ( t B /t A ) · ( t A 0 /t A ). F or instance, supp ose that t A = 60, t A 0 = 80, and t B = 120. Then the ﬁnal temp o of B 0 will b e 60 × (120 / 60) × (80 / 60) = 160 bpm. 3 Exp erimen tal Examples W e now qualitativ ely explore our technique on several examples 3 . In all of our examples, we use K = 3 sources. This is a hyperparameter that should be c hosen with care in general, since we would like each comp onen t to corresp ond to a single instrument or group of related instruments. Ho wev er, our initial examples are simple enough that w e exp ect basically a harmonic source, a percussive source, and one other source or sub-separation b et ween either harmonic and p ercussive. First, we follow through on the example we hav e b een exploring and we synthesize Alien An t F arm’s “Bad” from Michael Jackson’s “Bad” ( B ), using “Smo oth Criminal” as an example. The translation of the guitar from synth to electric is clearly audible in the ﬁnal result. F urthermore, a track which w as exclusively drums in A included some extra screams in A 0 that Alien Ant F arm p erformed as embellishments. These 3 Please listen to our results at http://www.covers1000.net/analogies.html 8 em b ellishmen ts transferred o v er to B 0 in the “Bad” example, further reinforcing Alien An t F arm’s style. Note that these screams w ould not hav e b een preserved had w e simply used inv erted the CQT Λ W 2 , H 2 , but they are presen t in one of the ﬁltered trac ks and av ailable as audio grains during musaicing for that trac k. In addition to the “Bad” example, we also synthesize Alien Ant F arm’s version of “W anna Be Startin Something,” using “Smo oth Criminal” as an example for A and A 0 once again. In this example, Alien An t F arm’s screams o ccur consisten tly with the fast drum b eat every measure. Finally , we explore an example with a more extreme temp o shift b et ween A and A 0 ( t A 0 < t A ). W e use Marilyn Manson’s co ver of “Sweet Dreams” by the Eurythmics to synthesize a Marilyn Manson cov er of “Who’s That Girl” by the Eurythmics. W e found that in this particular example, w e obtained b etter results when w e performed 2DNMF on | C A | b y itself ﬁrst, and then w e p erformed the optimizations in Equation 5 and Equation 6, holding W 1 ﬁxed. This is a minor tw eak that can b e left to the discretion of the user at run time. Ov erall, our tec hnique w orks w ell for instrument translation in these three examples. How ev er, we did notice that the vocals did not carry ov er a t all in any of them. This is to b e exp ected, since singing v oice separation often assumes that the instrumen ts are lo w rank and the v oice is high rank [21], and our ﬁlters and ﬁnal mosaicing are b oth deriv ed from low rank NMF mo dels. 4 Discussion / F uture Directions In this work, w e demonstrated a pro of of concept, fully automated end to end system which can synthesize a co v er song snippet of a song B given an example co ver A and A 0 , where A is b y the same artist as B , and B 0 should sound lik e B but in the style of A 0 . W e show ed some promising initial results on a few examples, whic h is, to our knowledge, one of the ﬁrst steps in the challenging direction of automatic polyphonic audio m usaicing. Our tec hnique do es ha ve some limitations, ho wev er. Since w e use W 1 for b oth A and B , w e are limited to examples in which A and B hav e similar instruments. It would be interesting to explore ho w far one could push this technique with diﬀerent instruments b et ween A and B , which happ ens quite often ev en within a corpus b y the same artist. W e ha v e also noticed that in addition to singing v oice, other “high rank” instrumen ts, such as the ﬁddle, cannot b e properly translated. W e b eliev e that that translating such instrumen ts and voices w ould b e an in teresting and c hallenging future direction of research, and it would likely need a completely diﬀeren t approac h to the one we presen ted here. Finally , out of the three main steps of our pip eline, synchronization (Section 2.1), blind joint factoriza- tion/source separation (Section 2.2), and ﬁltering/musaicing (Section 2.3), the weak est step by far is the blind source separation. The single channel source separation problem is still far from solved in general ev en without the complication of co ver songs, so that will likely remain the weak est step for some time. If one has access to the unmixed studio trac ks for A , A 0 , and B , though, that step can b e entirely circumv en ted; the algorithm would remain the same, and one w ould exp ect higher quality results. Unfortunately , suc h trac ks are diﬃcult to ﬁnd in general for those who do not work in a music studio, which is wh y blind source separation also remains an imp ortan t problem in its own right. 5 Ac kno wledgemen ts Christopher T ralie was partially supp orted by an NSF big data grant DKA-1447491 and an NSF Researc h T raining Grant nsf-dms 1045133. W e also thank Brian McF ee for helpful discussions ab out inv ertible CQTs. References [1] Shun ya Ariga, Satoru F uk a yama, and Masatak a Goto. Song2guitar: A diﬃculty-a ware arrangement system for generating guitar solor cov ers from polyphonic audio of popular music. In 18th International So ciety for Music Information R etrieval (ISMIR) , 2017. 9 [2] Judith C Bro wn. Calculation of a constant q sp ectral transform. The Journal of the A c oustic al So ciety of A meric a , 89(1):425–434, 1991. [3] C Cannam. Rubb er band library . Softwar e r ele ase d under GNU Gener al Public Lic ense (version 1.8. 1) , 2012. [4] Ning Chen, W ei Li, and Haidong Xiao. F using similarity functions for cov er song identiﬁcation. Multi- me dia T o ols and Applic ations , pages 1–24, 2017. [5] Christian Dittmar and Meinard M ¨ uller. Reverse Engineering the Amen Break – Score-informed Sepa- ration and Restoration applied to Drum Recordings. IEEE/ACM T r ansactions on A udio, Sp e e ch, and L anguage Pr o c essing , 24(9):1531–1543, 2016. [6] Jonathan Driedger, Meinard M ¨ uller, and Sascha Disc h. Extending harmonic-percussive separation of audio signals. In ISMIR , pages 611–616, 2014. [7] Jonathan Driedger, Thomas Pr¨ atzlic h, and Meinard M¨ uller. Let it bee-tow ards nmf-inspired audio mosaicing. In ISMIR , pages 350–356, 2015. [8] Daniel PW Ellis. Beat tracking b y dynamic programming. Journal of New Music R ese ar ch , 36(1):51–60, 2007. [9] Hadrien F oroughmand and Geoﬀro y P eeters. Multi-source m usaicing using non-negative matrix factor 2- d deconv olution. In 18th International So ciety for Music Information R etrieval (ISMIR), L ate Br e aking Session , 2017. [10] Leon A Gatys, Alexander S Eck er, and Matthias Bethge. A neural algorithm of artistic st yle. arXiv pr eprint arXiv:1508.06576 , 2015. [11] Daniel Griﬃn and Jae Lim. Signal estimation from mo diﬁed short-time fourier transform. IEEE T r ansactions on A c oustics, Sp e e ch, and Signal Pr o c essing , 32(2):236–243, 1984. [12] Aaron Hertzmann, Charles E Jacobs, Nuria Oliv er, Brian Curless, and Da vid H Salesin. Image analogies. In Pr o c e e dings of the 28th annual c onfer enc e on Computer gr aphics and inter active te chniques , pages 327–340. A CM, 2001. [13] Florian Krebs, Sebastian B¨ ock, and Gerhard Widmer. An eﬃcien t state-space mo del for joint temp o and meter trac king. In ISMIR , pages 72–78, 2015. [14] T omohiko Nak am ura and Hirok azu Kameok a. Shifted and con volutiv e source-ﬁlter non-negative ma- trix factorization for monaural audio source separation. In A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP), 2016 IEEE International Confer enc e on , pages 489–493. IEEE, 2016. [15] Mikkel N Schmidt and Morten Mørup. Nonnegativ e matrix factor 2-d decon volution for blind single c hannel source separation. In International Confer enc e on Indep endent Comp onent Analysis and Signal Sep ar ation , pages 700–707. Springer, 2006. [16] Diemo Sc hw arz, Roland Cahen, and Sam Britton. Principles and applications of in teractiv e corpus-based concatenativ e synthesis. In Journ´ ees d’Informatique Music ale (JIM) , pages 1–1, 2008. [17] Joan Serra, Xavier Serra, and Ralph G Andrzejak. Cross recurrence quantiﬁcation for cov er song iden tiﬁcation. New Journal of Physics , 11(9):093017, 2009. [18] Diego F Silv a, Chin-Chin M Y eh, Gusta vo Enrique de Almeida Prado Alv es Batista, Eamonn Keogh, et al. Simple: assessing music similarity using subsequences joins. In International So ciety for Music Information R etrieval Confer enc e, XVII . In ternational Society for Music Information Retriev al-ISMIR, 2016. [19] Ian Simon, Sumit Basu, David Salesin, and Maneesh Agraw ala. Audio analogies: Creating new music from an existing p erformance b y concatenative synthesis. In ICMC . Citeseer, 2005. 10 [20] T emple F Smith and Michael S W aterman. Iden tiﬁcation of common molecular subsequences. Journal of mole cular biolo gy , 147(1):195–197, 1981. [21] Pablo Sprec hmann, Alexander M Bronstein, and Guillermo Sapiro. Real-time online singing voice separation from monaural recordings using robust low-rank mo deling. In ISMIR , pages 67–72, 2012. [22] Rob ert W Sumner and Jov an P op o vi´ c. Deformation transfer for triangle meshes. In A CM T r ansactions on Gr aphics (TOG) , volume 23, pages 399–405. A CM, 2004. [23] Christopher J T ralie. Early mfcc and hp cp fusion for robust cov er song identiﬁcation. In 18th Interna- tional So ciety for Music Information R etrieval (ISMIR) , 2017. [24] Gino Angelo V elasco, Nic ki Holighaus, Monik a D¨ orﬂer, and Thomas Grill. Constructing an in vertible constan t-q transform with non-stationary gab or frames. Pr o c e e dings of D AFX11, Paris , pages 93–99, 2011. [25] Bo W ang, Jia yan Jiang, W ei W ang, Zhi-Hua Zhou, and Zhuo w en T u. Unsup ervised metric fusion by cross diﬀusion. In Computer Vision and Pattern R e c o gnition (CVPR), 2012 IEEE Confer enc e on , pages 2997–3004. IEEE, 2012. [26] Bo W ang, Aziz M Mezlini, F eyyaz Demir, Marc Fiume, Zhuo wen T u, Michael Brudno, Benjamin Haib e- Kains, and Anna Goldenberg. Similarity netw ork fusion for aggregating data types on a genomic scale. Natur e metho ds , 11(3):333–337, 2014. 11

Cover Song Synthesis by Analogy

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment