An Overview of Lead and Accompaniment Separation in Music

Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 1 An Ov ervie w of Lead and Accompaniment Separation in Music Zafar Raﬁi, Member , IEEE, Antoine Liutkus, Member , IEEE, Fabian-Robert St ¨ oter , Stylianos Ioannis Mimilakis, Student Member , IEEE, Derry FitzGerald, and Bryan Pardo, Member , IEEE Abstract —Popular music is often composed of an accompa- niment and a lead component, the latter typically consisting of vocals. Filtering such mixtures to extract one or both components has many applications, such as automatic karaoke and remixing. This particular case of source separation yields very speciﬁc challenges and opportunities, including the particular complexity of musical structures, but also relev ant prior knowledge coming from acoustics, musicology or sound engineering. Due to both its importance in applications and its challenging difﬁculty , lead and accompaniment separation has been a popular topic in signal pro- cessing for decades. In this article, we pro vide a compr ehensive re view of this research topic, organizing the different approaches according to whether they are model-based or data-center ed. For model-based methods, we organize them according to whether they concentrate on the lead signal, the accompaniment, or both. For data-centered approaches, we discuss the particular difﬁculty of obtaining data f or learning lead separation systems, and then review recent approaches, notably those based on deep learning . Finally , we discuss the delicate problem of evaluating the quality of music separation through adequate metrics and present the results of the largest ev aluation, to-date, of lead and accompaniment separation systems. In conjunction with the above, a comprehensive list of references is provided, along with rele vant pointers to av ailable implementations and repositories. Index T erms —Source separation, music, accompaniment, lead, over view . I . I N T R O D U C T I O N M USIC is a major form of artistic e xpression and plays a central role in the entertainment industry . While digitization and the Internet led to a rev olution in the way music reaches its audience [1], [2], there is still much room to improv e on how one interacts with musical content, beyond simply controlling the master volume and equalization. The ability to interact with the individual audio objects (e.g., the lead vocals) in a music recording would enable diverse applications such as music upmixing and remixing, automatic karaoke, object-wise equalization, etc. Most publicly av ailable music recordings (e.g., CDs, Y ouT ube, iT unes, Spotify) are distributed as mono or stereo Z. Raﬁi is with Gracenote, Emeryville, CA, USA (zafar .raﬁi@nielsen.com). A. Liutkus and F .-R. St ¨ oter are with Inria and LIRMM, Uni versity of Mont- pellier , France (ﬁrstname.lastname@inria.fr). S.I. Mimilakis is with Fraun- hofer IDMT , Ilmenau, German y (mis@idmt.fraunhofer .de). D. FitzGerald is with Cork School of Music, Cork Institute of T echnology , Cork, Ireland (Derry .Fitzgerald@cit.ie). B. Pardo is with Northwestern Uni versity , Ev anston, IL, USA (pardo@northwestern.edu). This work was partly supported by the research programme KAMoulox (ANR-15-CE38-0003-01) funded by ANR, the French State agency for re- search. S.I. Mimilakis is supported by the European Unions H2020 Frame work Programme (H2020- MSCA-ITN-2014) under grant agreement no 642685 MacSeNet mixtures with multiple sound objects sharing a track. There- fore, manipulation of individual sound objects requires sep- aration of the stereo audio mixture into sev eral tracks, one for each different sound sources. This process is called audio sour ce separ ation and this ov ervie w paper is concerned with an important particular case: isolating the lead source — typically , the vocals — from the musical accompaniment (all the rest of the signal). As a general problem in applied mathematics, source sepa- ration has enjoyed tremendous research acti vity for roughly 50 years and has applications in v arious ﬁelds such as bioinfor- matics, telecommunications, and audio. Early research focused on so-called blind source separation, which typically builds on very weak assumptions about the signals that comprise the mixture in conjunction with very strong assumptions on the way they are mixed. The reader is referred to [3], [4] for a comprehensiv e revie w on blind source separation. T ypical blind algorithms, e.g., independent component analysis (ICA) [5], [6], depend on assumptions such as: source signals are independent, there are more mixture channels than there are signals, and mixtures are well modeled as a linear combination of signals. While such assumptions are appropriate for some signals lik e electroencephalograms, they are often violated in audio. Much research in audio-speciﬁc source separation [7], [8] has been motiv ated by the speech enhancement problem [9], which aims to recover clean speech from noisy recordings and can be seen as a particular instance of source separation. In this respect, many algorithms assume the audio background can be modeled as stationary . Ho we ver , the musical sources are characterized by a very rich, non-stationary spectro-temporal structure. This prohibits the use of such methods. Musical sounds often exhibit highly synchronous ev olution ov er both time and frequency , making overlap in both time and frequenc y very common. Furthermore, a typical commercial music mix- ture violates all the classical assumptions of ICA. Instruments are correlated (e.g., a chorus of singers), there are more instruments than channels in the mixture, and there are non- linearities in the mixing process (e.g., dynamic range compres- sion). This all has required the de velopment of music-speciﬁc algorithms, exploiting av ailable prior information about source structure or mixing parameters [10], [11]. This article provides an overvie w of nearly 50 years of research on lead and accompaniment separation in music. Due to space constraints and the lar ge v ariability of the paradigms in v olved, we cannot delve into detailed mathematical descrip- tion of each method. Instead, we will con v ey core ideas and Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 2 methodologies, grouping approaches according to common features. As with any attempt to impose an a posteriori taxonomy on such a large body of research, the resulting classiﬁcation is ar guable. Ho we ver , we belie ve it is useful as a roadmap of the relev ant literature. Our objectiv e is not to advocate one methodology over another . While the most recent methods — in particular those based on deep learning — currently show the best performance, we believ e that ideas underlying earlier methods may also be inspiring and stimulate ne w research. This point of view leads us to focus more on the strengths of the methods rather than on their weaknesses. The rest of the article is organized as follows. In Section II, we present the basic concepts needed to understand the dis- cussion. W e then present sections on model-based methods that exploit speciﬁc knowledge about the lead and/or the accompaniment signals in music to achieve separation. W e show in Section III how one body of research is focused on modeling the lead signal as harmonic, exploiting this central assumption for separation. Then, Section IV describes many methods achie ving separation using a model that takes the musical accompaniment as redundant . In Section V, we show how these two ideas were combined in other studies to achiev e separation. Then, we present data-driv en approaches in Section VI, which exploit large databases of audio examples where both the isolated lead and accompaniment signals are av ailable. This enables the use of machine learning methods to learn how to separate. In Section VII, we show how the widespread av ailability of stereo signals may be leveraged to design algorithms that assume centered-panned vocals, but also to improve separation of most methods. Finally , Section VIII is concerned with the problem of how to e v aluate the quality of the separation, and provides the results for the largest ev aluation campaign to date on this topic. I I . F U N D A M E N T A L C O N C E P T S W e now very brieﬂy describe the basic ideas required to understand this paper, classiﬁed into three main categories: signal processing, audio modeling and probability theory . The interested reader is strongly encouraged to delve into the many online courses or textbooks available for a more detailed presentation of these topics, such as [12], [13] for signal processing, [9] for speech modeling, and [14], [15] for probability theory . A. Signal pr ocessing Sound is a series of pressure wa ves in the air . It is recorded as a waveform , a time-series of measurements of the displacement of the microphone diaphragm in response to these pressure wa ves. Sound is reproduced if a loudspeaker diaphragm is moved according to the recorded wa veform. Multichannel signals simply consist of sev eral wa veforms, captured by more than one microphone. T ypically , music signals are stereophonic, containing two wav eforms. Microphone displacement is typically measured at a ﬁxed sampling frequency . In music processing, it is common to hav e sampling frequencies of 44 . 1 kHz (the sample frequency on a compact disc) or 48 kHz, which are higher than the typical sampling rates of 16 kHz or 8 kHz used for speech in telephony . This is because musical signals contain much higher frequency content than speech and the goal is aesthetic beauty in addition to basic intelligibility . A time-frequency (TF) representation of sound is a matrix that encodes the time-varying spectrum of the wa veform. Its entries are called TF bins and encode the varying spectrum of the w av eform for all time frames and frequency channels. The most commonly-used TF representation is the short time Fourier transform (STFT) [16], which has complex entries: the angle accounts for the phase, i.e., the actual shift of the corresponding sinusoid at that time bin and frequency bin, and the magnitude accounts for the amplitude of that sinusoid in the signal. The magnitude (or power) of the STFT is called spectr ogr am . When the mixture is multichannel, the TF representation for each channel is computed, leading to a three-dimensional array: frequency , time and channel. A TF representation is typically used as a ﬁrst step in pro- cessing the audio because sources tend to be less overlapped in the TF representation than in the wav eform [17]. This makes it easier to select portions of a mixture that correspond to only a single source. An STFT is typically used because it can be in verted back to the original wa veform. Therefore, modiﬁcations made to the STFT can be used to create a modiﬁed waveform. Generally , a linear mixing process is considered, i.e., the mixture signal is equal to the sum of the source signals. Since the Fourier transform is a linear operation, this equality holds for the STFT . While that is not the case for the magnitude (or power) of the STFT , it is commonly assumed that the spectrograms of the sources sum to the spectrogram of the mixture. In many methods, the separated sources are obtained by ﬁltering the mixture. This can be understood as performing some equalization on the mixture, where each frequency is attenuated or kept intact. Since both the lead and the accom- paniment signals change ov er time, the ﬁlter also changes. This is typically done using a TF mask , which, in its simplest form, is deﬁned as the gain between 0 and 1 to apply on each element of the TF representation of the mixture (e.g., an STFT) in order to estimate the desired signal. Loosely speaking, it can be understood as an equalizer whose setting changes e very few milliseconds. After multiplication of the mixture by a mask, the separated signal is recov ered through an inv erse TF transform. In the multichannel setting, more sophisticated ﬁlters may be designed that incorporate some delay and combine different channels; this is usually called beamforming . In the frequency domain, this is often equi v- alent to using complex matrices to multiply the mixture TF representation with, instead of just scalars between 0 and 1 . In practice, masks can be designed to ﬁlter the mixture in sev eral ways. One may estimate the spectrogram for a single source or component, e.g., the accompaniment, and subtract it from the mixture spectrogram, e.g., in order to estimate the lead [18]. Another way would be to estimate separate spec- trograms for both lead and accompaniment and combine them to yield a mask. For instance, a TF mask for the lead can be taken as the proportion of the lead spectrogram over the sum of Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 3 both spectrograms, at each TF bin. Such ﬁlters are often called W iener ﬁlter s [19] or r atio masks . Ho w the y are calculated may in v olv e some additional techniques lik e e xponentiation and may be understood according to assumptions re g arding the underlying statistics of the sources. F or recent w ork in this area, and man y useful pointers in designing such masks, the reader is referred to [20]. B. A udio and speec h modeling It is typical in audio processing to describe audio w a v eforms as be longing to one of tw o dif ferent cate gories, which are sinusoidal signals — or pure tones — and noise . Actuall y , both are just the tw o e xtremes in a continuum of v arying pr edictability : on the one hand, the shape of a sinusoidal w a v e in the future can reliably be guessed from pre vious samples. On the other hand, white noise is deﬁned as an unpredictable signal and its spectrogram has constant ener gy e v erywhere. Dif ferent noise proﬁles may then be obtained by attenuating the ener gy of some frequenc y re gions. This in turn induces some predictability in the signal, and in the e xtreme case where all the ener gy content is concentrated in one frequenc y , a pure tone is obtained. A w a v eform may al w ays be modeled as some ﬁlter applied on some e xcitation signal . Usually , the ﬁlter is assumed to v ary smoothly across frequencies, hence modifying only what is called the spectr al en velope of the signal, while the e xcitation signal comprises the rest. This is the basis for the sour ce-ﬁlter model [21], which is of great importance in speech modeling, and thus also in v ocal separation. As for speech, the ﬁlter is created by the shape of the v ocal tract. The e xcitation signal is made of the glottal pulses generated by the vibration of the v ocal folds. This results into voiced speech sounds made of time-v arying harmonic/sinusoidal components. The e xcitation signal can also be the air ﬂo w passing through some constriction of the v ocal tract. This results into un voiced , noise-lik e, speech sounds. In this conte xt, v o wels are said to be v oiced and tend to feature man y s inusoids, while some phonemes such as fricati v es are un v oiced and noisier . A classical tool for dissociating the en v elope from the e xci- tation is t he cepstrum [22]. It has applications for estimating the fundamental frequenc y [23], [24], for deri ving the Mel- frequenc y cepstral coef ﬁcients (MFCC) [ 25 ] , or for ﬁltering signals through a so-called liftering operation [26] that enables modiﬁcations of either the e xcitation or the en v elope parts through the source-ﬁlter paradigm. An adv antage of the source-ﬁlter model approach is indeed that one can dissociate the pitched content of the signal, em- bodied by the pos ition of its harmonics, from its TF en v elope which describes where the ener gy of the sound lies. In the case of v ocals, it yields the ability to distinguish between the actual note being sung (pitch content) and the phoneme being uttered (mouth and v ocal tract conﬁguration), respecti v ely . One k e y feature of v ocals is the y typically e xhibit great v ariability in fundamental frequenc y o v er time. The y can also e xhibit lar ger vibr atos (fundamental frequenc y modulations) and tr emolos (amplitude modulations) in comparison to other instruments, as seen in the top spectrogram in Figure 1. 220 440 880 1760 3520 7040 14080 Frequenc y (Hz) (a) Lead/V ocals 220 440 880 1760 3520 7040 14080 Frequenc y (Hz) (b) Accompaniment 0 1 2 3 220 440 880 1760 3520 7040 14080 T ime (seconds) Frequenc y (Hz) (c) Mixture Fig. 1: Examples of spectrograms from an e xcerpt of the track “The Wrong’Uns - Rothk o” from MUSDB18 dataset. The tw o sources to be separated are depicted in (a) and (b), and its mixture in (c). The v ocals (a) are mostly harmonic and often well described by a source-ﬁlter model in which an e xcitation signal is ﬁltered according to the v ocal tract conﬁguration. The accompaniment signal (b) features more di v ersity , b ut usually does not feature as much vibrato as for the v ocals, and most importantly is seen to be denser and also mor e r edundant . All spectrograms ha v e log-compressed amplitudes as well as log-scaled frequenc y axis. Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 4 A particularity of musical signals is that the y typically consist of sequences of pitched notes. A sound gi v es the perception of ha ving a pitch if the majority of the ener gy in the audio signal is at frequencies located at inte ger multiples of some fundamental frequenc y . These inte ger multiples are called harmonics . When the fundamental frequenc y changes, the frequencies of these harmonics also change, yielding the typical comb spectrograms of harmonic s ignals, as depicted in the top spectrogram in Figure 1. Another note w orth y feature of sung melodies o v er simple speech is that thei r fundamental frequencies are, in general, located at precise frequenc y v alues corresponding to the musical k e y of the song. These v ery peculiar features are often e xploited in separation methods. F or simplicity reasons, we use the terms pitc h and fundamental fr equency interchangeably throughout the paper . C. Pr obability theory Probability theory [14], [27] is an important frame w ork for designing man y data analysis and processing methods. Man y of the methods described in this article use it and it is f ar be yond the scope of this paper to present it rigorously . F or our purpose, it will suf ﬁ ce to say that the observations consist of the mixture signals. On the other hand, the par ameter s are an y rele v ant feature about the source signal (such as pitch or time-v arying en v elope) or ho w the signals are mix ed (e.g., the panning position). These parameters can be used to deri v e estimates about the tar get lead and accompaniment signals. W e understand a probabilistic model as a function of both the observ ations and the parameters: it describes ho w lik ely the observ ations are, gi v en the parameters. F or instance, a ﬂat spectrum is lik ely under the noise model, and a mixture of comb s pectrograms is lik ely under a harmonic model with the appropriate pitch parameters for the sources. When the obser - v ations are gi v en, v ariation in the model depends only on the parameters. F or some parameter v alue, it tells ho w lik ely the observ ations are. Under a harmonic model for instance, pitch may be estim ated by ﬁnding the pitch parameter that mak es the observ ed w a v eform as lik ely as possible. Alternati v ely , we may w ant to choose between se v eral possible models such as v oiced or un v oiced. In such cases, model selection methods are a v ailable, such as the Bayesian information criterion (BIC) [28]. Gi v en these basic ideas, we brieﬂy mention tw o models that are of particular importance. Firstly , the hidden Mark o v model (HMM) [15], [29] is rele v ant for time-v arying observ ations. It basically deﬁnes se v eral states , each one related to a speciﬁc model and with some probabilities for transitions between them. F or instance, we could deﬁne as man y states as possible notes played by the lead guitar , each one associated with a typical spectrum. The V iterbi algorithm is a dynamic programming method which actually estimates the most lik ely sequence of states gi v en a sequence of observ ations [30]. Secondly , the Gaussian mixture model (GMM) [31] is a w ay to approximate an y distrib ution as a weighted sum of Gaussians. It is widely used in clustering, because it w orks well with the celebrated Expectation-Maximization (EM) algorithm [32] to assign one particular clust er to each data point, while automatically estimating the clusters parameters. As we will see later , man y methods w ork by assigning each TF bin to a gi v en source in a similar w ay . I I I . M O D E L I N G T H E L E A D S I G N A L : H A R M O N I C I T Y As mentioned in Section II-B, one particularity of v ocals is their production by the vibration of the v ocal folds, further ﬁltered by the v ocal tract. As a consequence, sung melodies are mostly harmonic, as depicted in Figure 1, and therefore ha v e a fundamental frequenc y . If one can track the pitch of the v ocals, one can then estimate the ener gy at the harmonics of the fundamental frequenc y and reconstruct the v oice. This is the basis of the oldest methods (as well as some more recent methods) we are a w are of for separ ating the lead signal from a musical mixture. Such methods are summarized in Figure 2. In a ﬁrst step, the objecti v e is to get estimates of the time-v arying fundamental frequenc y for the lead at each time frame. A second step in this respect is then to track this fundamental frequenc y o v er time, in other w ords, to ﬁnd the best sequence of estimates, in order to identify the melody line. This can done either by a suitable pitch detection method, or by e xploiting the a v ailability of the score. Such algorithms typically assume that the lead corresponds to the harmonic signal wi th strongest amplitude. F or a re vie w on the particular topic of melody e xtraction, the reader is referred to [33]. From this starting point, we can distinguish between tw o kinds of approaches, depending on ho w the y e xploit the pitch information. Filtering (Section III-B) mixtur e Score/MIDI informe d Re-Synthesis (Section III-A) Harmonic Mode l Sinusoidal sy n thesi s Comb Filte r sy n thesized lea d f i ltered lea d Fig. 2: The approaches based on a harmonic assumption for v ocals. In a ﬁrst analysis step, the fundamental frequenc y of the lead signal is e xtracted. From it, a separati on is obtained either by resynthesis (Section III-A), or by ﬁltering the mixture (Section III-B). A. Analysis-synthesis appr oac hes The ﬁrst option to obtain the separated lead signal is to resynthesize it using a sinusoidal model. A sinusoidal model decomposes the sound with a set of sine w a v es of v arying frequenc y and amplitude. If one kno ws the fundamental fre- quenc y of a pitched sound (lik e a singing v oice), as well as the Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 5 spectral env elope of the recording, then one can reconstruct the sound by making a set of sine waves whose frequencies are those of the harmonics of the fundamental frequency , and whose amplitudes are estimated from the spectral en velope of the audio. While the spectral en velope of the recording is generally not exactly the same as the spectral en v elope of the target source, it can be a reasonable approximation, especially assuming that different sources do not overlap too much with each other in the TF representation of the mixture. This idea allo ws for time-domain processing and was used in the earliest methods we are aw are of. In 1973, Miller proposed in [34] to use the homomorphic vocoder [35] to separate the excitation function and impulse response of the vocal tract. Further reﬁnements include segmenting parts of the signal as voiced, unv oiced, or silences using a heuristic program and manual interaction. Finally , cepstral liftering [26] was exploited to compensate for the noise or accompaniment. Similarly , Maher used an analysis-synthesis approach in [36], assuming the mixtures are composed of only two harmonic sources. In his case, pitch detection was performed on the STFT and included heuristics to account for possibly colliding harmonics. He ﬁnally resynthesized each musical voice with a sinusoidal model. W ang proposed instantaneous and frequency-warped tech- niques for signal parameterization and source separation, with application to v oice separation in music [37], [38]. He intro- duced a frequency-locked loop algorithm which uses multiple harmonically constrained trackers. He computed the estimated fundamental frequency from a maximum-likelihood weighting of the tracking estimates. He w as then able to estimate harmonic signals such as voices from comple x mixtures. Meron and Hirose proposed to separate singing voice and piano accompaniment [39]. In their case, prior knowledge con- sisting of musical scores was considered. Sinusoidal modeling as described in [40] was used. Ben-Shalom and Dubnov proposed to ﬁlter an instrument or a singing voice out in such a way [41]. They ﬁrst used a score alignment algorithm [42], assuming a kno wn score. Then, they used the estimated pitch information to design a ﬁlter based on a harmonic model [43] and performed the ﬁltering using the linear constraint minimum variance approach [44]. They additionally used a heuristic to deal with the un voiced parts of the singing voice. Zhang and Zhang proposed an approach based on harmonic structure modeling [45], [46]. They ﬁrst extracted harmonic structures for singing voice and background music signals us- ing a sinusoidal model [43], by extending the pitch estimation algorithm in [47]. Then, they used the clustering algorithm in [48] to learn harmonic structure models for the background music signals. Finally , they e xtracted the harmonic structures for all the instruments to reconstruct the background music signals and subtract them from the mixture, leaving only the singing voice signal. More recently , Fujihara et al. proposed an accompaniment reduction method for singer identiﬁcation [49], [50]. After fundamental frequency estimation using [51], they extracted the harmonic structure of the melody , i.e., the power and phase of the sinusoidal components at fundamental frequency and harmonics. Finally , the y resynthesized the audio signal of the melody using the sinusoidal model in [52]. Similarly , Mesaros et al. proposed a v ocal separation method to help with singer identiﬁcation [53]. They ﬁrst applied a melody transcription system [54] which estimates the melody line with the corresponding MIDI note numbers. Then, they performed sinusoidal resynthesis, estimating amplitudes and phases from the polyphonic signal. In a similar manner , Duan et al. proposed to separate harmonic sources, including singing v oices, by using harmonic structure models [55]. They ﬁrst deﬁned an average harmonic structure model for an instrument. Then, they learned a model for each source by detecting the spectral peaks using a cross-correlation method [56] and quadratic interpolation [57]. Then, they extracted the harmonic structures using BIC and a clustering algorithm [48]. Finally , they separated the sources by re-estimating the fundamental frequencies, re-extracting the harmonics, and reconstructing the signals using a phase generation method [58]. Lagrange et al. proposed to formulate lead separation as a graph partition problem [59], [60]. They ﬁrst identiﬁed peaks in the spectrogram and grouped the peaks into clusters by using a similarity measure which accounts for harmonically related peaks, and the normalized cut criterion [61] which is used for segmenting graphs in computer vision. They ﬁnally selected the cluster of peaks which corresponds to a predominant harmonic source and resynthesized it using a bank of sinusoidal oscillators. Ryyn ¨ anen et al. proposed to separate accompaniment from polyphonic music using melody transcription for karaoke applications [62]. They ﬁrst transcribed the melody into a MIDI note sequence and a fundamental frequency trajectory , using the method in [63], an improved version of the earlier method [54]. Then, they used sinusoidal modeling to estimate, resynthesize, and remove the lead vocals from the musical mixture, using the quadratic polynomial-phase model in [64]. B. Comb-ﬁltering appr oaches Using sinusoidal synthesis to generate the lead signal suf fers from a typical metallic sound quality , which is mostly due to discrepancies between the estimated excitation signals of the lead signal compared to the ground truth. T o address this issue, an alternativ e approach is to exploit harmonicity in another way , by ﬁltering out ev erything from the mixture that is not located close to the detected harmonics. Li and W ang proposed to use a vocal/non-v ocal classiﬁer and a predominant pitch detection algorithm [65], [66]. The y ﬁrst detected the singing voice by using a spectral change detector [67] to partition the mixture into homogeneous por- tions, and GMMs on MFCCs to classify the portions as vocal or non-vocal. Then, they used the predominant pitch detection algorithm in [68] to detect the pitch contours from the v ocal portions, extending the multi-pitch tracking algorithm in [69]. Finally , they e xtracted the singing v oice by decomposing the vocal portions into TF units and labeling them as singing or accompaniment dominant, e xtending the speech separation algorithm in [70]. Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 6 Han and Raphael proposed an approach for desoloing a recording of a soloist with an accompaniment giv en a musical score and its time alignment with the recording [71]. They deriv ed a mask [72] to remove the solo part after using an EM algorithm to estimate its melody , that exploits the score as side information. Hsu et al. proposed an approach which also identiﬁes and separates the unv oiced singing voice [73], [74]. Instead of processing in the STFT domain, they use the perceptually motiv ated gammatone ﬁlter-bank as in [66], [70]. They ﬁrst detected accompaniment, un voiced, and v oiced se gments using an HMM and identiﬁed voice-dominant TF units in the v oiced frames by using the singing v oice separation method in [66], using the predominant pitch detection algorithm in [75]. Un v oiced-dominant TF units were identiﬁed using a GMM classiﬁer with MFCC features learned from training data. Finally , ﬁltering was achieved with spectral subtraction [76]. Raphael and Han then proposed a classiﬁer-based approach to separate a soloist from accompanying instruments using a time-aligned symbolic musical score [77]. They built a tree- structured classiﬁer [78] learned from labeled training data to classify TF points in the STFT as belonging to solo or accompaniment. The y additionally constrained their classiﬁer to estimate masks having a connected structure. Cano et al. proposed v arious approaches for solo and accompaniment separation. In [79], the y separated saxophone melodies from mixtures with piano and/or orchestra by using a melody line detection algorithm, incorporating information about typical saxophone melody lines. In [80]–[82], they proposed to use the pitch detection algorithm in [83]. Then, they reﬁned the fundamental frequenc y and the harmonics, and created a binary mask for the solo and accompaniment. They ﬁnally used a post-processing stage to reﬁne the separation. In [84], they included a noise spectrum in the harmonic reﬁnement stage to also capture noise-like sounds in vocals. In [85], the y additionally included common amplitude modu- lation characteristics in the separation scheme. Bosch et al. proposed to separate the lead instrument using a musical score [86]. After a preliminary alignment of the score to the mixture, the y estimated a score conﬁdence measure to deal with local misalignments and used it to guide the pre- dominant pitch tracking. Finally , they performed low-latenc y separation based on the method in [87], by combining har- monic masks deriv ed from the estimated pitch and additionally exploiting stereo information as presented later in Section VII. V aneph et al. proposed a framework for vocal isolation to help spectral editing [88]. They ﬁrst used a voice acti vity detection process based on a deep learning technique [89]. Then, they used pitch tracking to detect the melodic line of the vocal and used it to separate the vocal and background, allowing a user to pro vide manual annotations when necessary . C. Shortcomings As can be seen, explicitly assuming that the lead signal is harmonic led to an important body of research. While the aforementioned methods show excellent performance when their assumptions are valid, their performance can drop sig- niﬁcantly in adverse, but common situations. Firstly , vocals are not always purely harmonic as they contain unv oiced phonemes that are not harmonic. As seen abov e, some methods already handle this situation. Howe v er , vocals can also be whispered or saturated, both of which are difﬁcult to handle with a harmonic model. Secondly , methods based on the harmonic model depend on the quality of the pitch detection method. If the pitch detector switches from following the pitch of the lead (e.g., the voice) to another instrument, the wrong sound will be isolated from the mix. Often, pitch detectors assume the lead signal is the loudest harmonic sound in the mix. Unfortunately , this is not always the case. Another instrument may be louder or the lead may be silent for a passage. The tendency to follo w the pitch of the wrong instrument can be mitigated by applying constraints on the pitch range to estimate and by using a perceptually relev ant weighting ﬁlter before performing pitch tracking. Of course, these approaches do not help when the lead signal is silent. I V . M O D E L I N G T H E AC C O M PA N I M E N T : R E D U N DA N C Y In the previous section, we presented methods whose main focus was the modeling of a harmonic lead melody . Most of these studies did not make modeling the accompaniment a core focus. On the contrary , it was often dealt with as adverse noise to which the harmonic processing method should be robust to. In this section, we present another line of research which concentrates on modeling the accompaniment under the as- sumption it is someho w more redundant than the lead signal. This assumption stems from the fact that musical accom- paniments are often highly structured, with elements being repeated many times. Such repetitions can occur at the note lev el, in terms of rhythmic structure, or ev en from a harmonic point of view: instrumental notes are often constrained to hav e their pitch lie in a small set of frequencies. Therefore, modeling and remo ving the redundant elements of the signal are assumed to result in removal of the accompaniment. In this paper, we identify three families of methods that exploit the redundanc y of the accompaniment for separation. A. Gr ouping low-rank components The ﬁrst set of approaches we consider is the identiﬁcation of redundanc y in the accompaniment through the assumption that its spectrogram may be well represented by only a few components. T echniques exploiting this idea then focus on algebraic methods that decompose the mixture spectrogram into the product of a fe w template spectra acti v ated over time. One way to do so is via non-negati ve matrix factorization (NMF) [90], [91], which incorporates non-ne gati ve constraints. In Figure 3, we picture methods exploiting such techniques. After factorization, we obtain several spectra, along with their activ ations over time. A subsequent step is the clustering of these spectra (and activ ations) into the lead or the accompa- niment. Separation is ﬁnally performed by deriving W iener ﬁlters to estimate the lead and the accompaniment from the mixture. For related applications of NMF in music analysis, the reader is referred to [92]–[94]. Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 7 m i x t u r e × l e a d a c c o m p a n i m e n t F a c t o r i z a t i o n a n d C l u s t e r i n g l e a d a c c o m p a n i m e n t e s t i m a t e s Fig. 3: The approaches based on a low-r ank assumption. Non-ne g ati v e matrix f actorization (NMF) is used to identify components from the mixture, that are subsequently clustered into lead or accompaniment. Additional constraints may be incorporated. V emb u and Baumann proposed to use NMF (and also ICA [95]) to separate v ocals from mixtures [96]. The y ﬁrst discriminated between v ocal and non-v ocal sections in a mixture by using dif ferent combinations of features, such as MFCCs [25], perceptual linear predicti v e (PLP) coef ﬁcients [97], and log frequenc y po wer coef ﬁcients (LFPC) [98], and training tw o classiﬁers, namely neural netw orks and support v ector machines (SVM). The y then applied redundanc y re- duction techniques on the TF representation of the mixture to separate the sources [99], by using NMF (or ICA). The components were then grouped as v ocal and non-v ocal by reusing a v ocal/non-v ocal classiﬁer with MFCC, LFPC, and PLP coef ﬁcients. Chanrungutai and Ratanamahatana proposed to use NMF with automatic component selection [100], [101]. The y ﬁrst decomposed the mixture spectrogram using NMF with a ﬁx ed number of basis components. The y then remo v ed the components with brief rh ythmic and long-lasting continuous e v ents, assuming that the y correspond to instrumental sounds. The y ﬁnally used the remaining components to reconstruct the singing v oice, after reﬁning them using a high-pass ﬁlter . Marx er and Janer proposed an approach based on a T ikhono v re gularization [102] as an alternati v e to NMF , for singing v oice separation [103]. Their method sacriﬁced the non-ne g ati vity constraints of the NMF in e xchange for a computationally less e xpensi v e solution for spectrum decom- position, making it more interesting in lo w-latenc y scenarios. Y ang et al. proposed a Bayesian NMF approach [104], [105]. F ollo wing the approaches in [106] and [107], the y used a Poisson distrib ution for the lik elihood function and e xponential distrib utions for the model parameters in the NMF algorithm, and deri v ed a v ariational Bayesian EM algorithm [32] to solv e the NMF problem. The y also adapti v ely deter - mined the number of bases from the mixture. The y ﬁnally grouped the bases into singing v oice and background music by using a k -means clustering algorithm [108] or an NMF- based clustering algorithm. In a dif ferent manner , Smaragdis and Mysore proposed a user -guided approach for remo ving sounds from mixtures by humming the tar get sound to be remo v ed, for e xample a v ocal track [109]. The y modeled the mixture using probabilistic latent component analysis (PLCA) [110], another equi v alent formulation of NMF . One k e y feature of e xploiting user input w as to f acilitate the grouping of components into v ocals and accompaniment, as hummi ng helped to identify some of the parameters for modeling the v ocals. Nakamuray and Kameoka proposed an L p -norm NMF [111], with p controlling the sparsity of the error . The y de v eloped an algorithm for solving this NMF problem based on the auxiliary function principle [112], [113]. Setting an adequate number of bases and p tak en as small enough allo wed them to estimate the accompaniment as the lo w-rank decomposition, and the singing v oice as the error of the approximation, respecti v ely . Note that, in this case, the singing v oice w as not e xplicitly modeled as a sparse component b ut rather corresponded to the error which happened to be constrained as sparse. The ne xt subsection will actually deal with approaches that e xplicitly model the v ocals as the sparse component. B. Low-r ank accompaniment, spar se vocals m i x t u r e R P C A l o w - r a n k a c c o m p a n i m e n t s p a r s e l e a d Fig. 4: The approaches based on a low-r ank accompaniment, spar se vocals assumption. As opposed to methods based on NMF , methods based on rob ust principal component analysis (RPCA) assume the lead signal has a sparse and non-structured spectrogram. The me thods presented in the pre vious section ﬁrst compute a decomposition of the mixture into man y com po ne n t s that are sorted a posteriori as accompaniment or lead. As can be seen, this means the y mak e a lo w-rank assumption for the accompaniment, b ut typically also for the vocals . Ho we v er , as can for instance be seen on Figure 1, the spectrogram for the v ocals do e xhibit much more freedom than accompaniment, and e xperience sho ws the y are not adequately described by a small number of spectral bases. F or this reason, another track of research depicted in Figure 4 focused on using a lo w-rank assumption on the accompaniment only , while assuming the v ocals are spar se and not structur ed . This loose Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 8 assumption means that only a fe w coef ﬁcients from their spectrogram should ha v e signiﬁcant magnitude, and that the y should not feature signiﬁcant redundanc y . Those ideas are in line with rob ust principal component analysis (RPCA) [114], which is the mathematical tool used by this body of methods, initiated by Huang et al. for singing v oice separation [115] . It decomposes a matrix into a sparse and lo w-rank component. Sprechmann et al. proposed an approach based on RPCA for online singing v oice separat ion [116]. The y used ideas from con v e x optimizati on [117], [118] and multi-layer neural netw orks [119]. The y presented tw o e xtensions of RPCA and rob ust NMF models [120]. The y then used these e xtensions in a multi-layer neural netw ork frame w ork which, after an initial training stage, allo ws online source separation. Jeong and Lee proposed tw o e xtensions of the RPCA model to impro v e the estimation of v ocals and accompaniment from the sparse and lo w-rank components [121]. Their ﬁrst e xtension included the Schatten p and ` p norms as generalized nuclear norm optimizations [122]. The y also suggested a pre- processing stage based on log arithmic scaling of the mixture TF representation to enhance the RPCA. Y ang also proposed an approach based on RPCA with dictionary learning for reco v ering lo w-rank components [123]. He introduced a multiple lo w-rank representation follo wing the observ ation that elements of the singing v oice can also be reco v ered by the lo w-r ank component. He ﬁrst incorporated online dictionary learning methods [124] in his methodology to obtain prior information about the structure of the sources and then incorporated them into the RPCA model. Chan and Y ang then e xtended RPCA to comple x and quaternionic cases with application to singing v oice separation [125]. The y e xtended the principal component pursuit (PCP) [114] for solving the RPCA problem by presenting comple x and quaternionic proximity operators for the ` 1 and trace-norm re gularizations to account for the missing phase information. C. Repetitions within the accompaniment While the rationale behind lo w-rank me thods for lead- accompaniment separat ion is to e xploit the idea that the musical background should be redundant, adopting a lo w-rank model is not the only w ay to do it. An alternate w ay to proceed is to e xploit the musical structur e of songs, to ﬁnd r epetitions that can be utilized to perform separation. Just lik e in RPCA- based methods, the accompaniment i s then assumed to be the only source for which repetitions will be found. The unique feature of the methods described here is the y combine music structure analysis [126]–[128] wi th particular w ays to e xploit the identiﬁcation of repeated parts of the accompaniment. Raﬁi et al. proposed the REpeating P attern Extraction T ech- nique (REPET) to separate the accompaniment by assuming it is repeating [129]–[131], which is often t h e case in popular music. This approach, which is representati v e of this line of research, is represented on Figure 5. First, a repeating period is e xtracted by a music information retrie v al system, s uch as a beat spectrum [132] in this case. Then, this e xtracted informa- tion is used to estimate the spectrogram of the accompaniment through an a v eraging of the identiﬁed repetitions. From this, a ﬁlter is deri v ed. m i x t u r e S t r u c t u r e A n a l y s i s A v e r a g i n g R e p e t i t i o n s a c c o m p a n i m e n t e s t i m a t e r e p e t i t i o n d e t e c t i o n Fig. 5: The approaches based on a r epetition assumption for accompaniment. In a ﬁrst analysis step, repetiti ons are identiﬁed. Then, the y are used to b uild an estimate for the accompaniment spectrogram and proceed to separation. Seetharaman et al. [133] le v eraged the tw o dimensional F ourier transform (2DFT) of the spectrogram to create an algorithm v ery similar to REPET . The properties of the 2DFT let them separate the periodic background from the non- periodic v ocal melody by deleting peaks in the 2DFT . This eliminated the need to create an e xplicit model of the periodic audio and without the need to ﬁnd the period of repetition, both of which are required in REPET . Liutkus et al. adapted the REPET approach in [129], [130] to handle repeating structures v arying along time by modeling the repeating patterns only locally [131], [134]. The y ﬁrst identiﬁed a repeating period for e v ery time frame by com- puting a beat spectrogram as in [132]. Then the y estimated the spectrogram of the accompaniment by a v eraging the time frames in the mixture spectrogram at their local period rat e, for e v ery TF bin. From this, the y ﬁnally e xtracted the repeating structure by deri ving a TF mask. Raﬁi et al. further e xtended the REPET a p pr o a ches in [129], [130] and [134] to handle repeating structures that are not periodic. T o do this, the y proposed the REPET -SIM method in [131], [135] to identify repeating frames for e v ery time frame by computing a self-similarity matrix, as in [136]. Then, the y estimated the accompaniment spect rogram at e v ery TF bin by a v eraging the neighbors identiﬁed thanks to that similarity matrix. An e xtension for real-time processing w as presented in [137] and a v ersion e xploiting user interaction w as proposed in [138]. A method close to REPET -SIM w as also proposed by FitzGerald in [139]. Liutkus et al. proposed the K ernel Additi v e modeling (KAM) [140], [141] as a frame w ork which generalizes the REPET approaches in [129]–[131], [134], [135]. The y as- sumed that a source at a TF location can be modeled using its v alues at other locations through a speciﬁed k ernel which can account for features such as periodicity , self-similarity , stability o v er time or frequenc y , etc. This notably enabled modeling of the accompaniment using more than one repeating pattern. Liutkus et al. als o proposed a light v ersion using a f ast compression algorithm to mak e the approach more scalable Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 9 [142]. The approach was also used for interference reduction in music recordings [143], [144]. W ith the same idea of e xploiting intra-song redundancies for singing voice separation, b ut through a very different method- ology , Moussallam et al. assumed in [145] that all the sources can be decomposed sparsely in the same dictionary and used a matching pursuit greedy algorithm [146] to solve the problem. They integrated the separation process in the algorithm by modifying the atom selection criterion and adding a decision to assign a chosen atom to the repeated source or to the lead signal. Deif et al. proposed to use multiple median ﬁlters to separate vocals from music recordings [147]. They augmented the approach in [148] with diagonal median ﬁlters to improv e the separation of the vocal component. They also inv estigated different ﬁlter lengths to further improv e the separation. Lee et al. also proposed to use the KAM approach [149]– [152]. They applied the β -order minimum mean square error (MMSE) estimation [153] to the back-ﬁtting algorithm in KAM to improve the separation. They adaptively calculated a perceptually weighting factor α and the singular value decom- position (SVD)-based factorized spectral amplitude exponent β for each kernel component. D. Shortcomings While methods focusing on harmonic models for the lead often fall short in their expressi ve power for the accompa- niment, the methods we revie wed in this section are often observed to suffer e xactly from the con verse weakness, namely they do not provide an adequate model for the lead signal. Hence, the separated vocals often will feature interference from unpredictable parts from the accompaniment, such as some percussion or effects which occur infrequently . Furthermore, even if the musical accompaniment will ex- hibit more redundanc y , the vocals part will also be redundant to some extent, which is poorly handled by these methods. When the lead signal is not vocals but played by some lead instrument, its redundancy is e ven more pronounced, because the notes it plays lie in a reduced set of fundamental frequencies. Consequently , such methods would include the redundant parts of the lead within the accompaniment estimate, for example, a steady humming by a vocalist. V . J O I N T M O D E L S F O R L E A D A N D AC C O M PA N I M E N T In the previous sections, we revie wed two important bodies of literature, focused on modeling either the lead or the accompaniment parts of music recordings, respecti vely . While each approach showed its own advantages, it also featured its own drawbacks. For this reason, some researchers devised methods combining ideas for modeling both the lead and the accompaniment sources, and thus beneﬁting from both approaches. W e no w revie w this line of research. A. Using music structur e analysis to drive learning The ﬁrst idea we ﬁnd in the literature is to augment methods for accompaniment modeling with the prior identiﬁcation of sections where the v ocals are present or absent. In the case of the low rank models discussed in Sections IV -A and IV -B, such a strategy indeed dramatically improves performance. Raj et al. proposed an approach in [154] that is based on the PLCA formulation of NMF [155], and extends their prior work [156]. The parameters for the frequency distribution of the background music are estimated from the background music-only segments, and the rest of the parameters from the singing voice+background music segments, assuming a priori identiﬁed vocal re gions. Han and Chen also proposed a similar approach for melody extraction based on PLCA [157], which includes a further estimate of the melody from the vocals signal by an auto- correlation technique similar to [158]. G ´ omez et al. proposed to separate the singing voice from the guitar accompaniment in ﬂamenco music to help with melody transcription [159]. They ﬁrst manually segmented the mixture into vocal and non-v ocal regions. The y then learned percussi ve and harmonic bases from the non-vocal regions by using an unsupervised NMF percussi ve/harmonic separation approach [93], [160]. The v ocal spectrogram was estimated by keeping the learned percussiv e and harmonic bases ﬁx ed. Papadopoulos and Ellis proposed a signal-adaptive formu- lation of RPCA which incorporates music content information to guide the recovery of the sparse and low-rank components [161]. Prior musical knowledge, such as predominant melody , is used to re gularize the selection of activ e coef ﬁcients during the optimization procedure. In a similar manner, Chan et al. proposed to use RPCA with vocal activity information [162]. They modiﬁed the RPCA algorithm to constraint parts of the input spectrogram to be non-sparse to account for the non-vocal parts of the singing voice. A related method was proposed by Jeong and Lee in [163], using RPCA with a weighted l 1 -norm. They replaced the uniform weighting between the lo w-rank and sparse com- ponents in the RPCA algorithm by an adaptiv e weighting based on the variance ratio between the singing voice and the accompaniment. One ke y element of the method is to incorporate vocal acti vation information in the weighting. B. F actorization with a known melody While using only the knowledge of vocal activity as de- scribed above already yields an increase of performance o ver methods operating blindly , man y authors went further to also incorporate the fact that vocals often have a strong melody line. Some redundant model is then assumed for the accompa- niment, while also enforcing a harmonic model for the v ocals. An early method to achieve this is depicted in Figure 6 and was proposed by V irtanen et al. in [164]. The y estimated the pitch of the vocals in the mixture by using a melody transcription algorithm [63] and deri ved a binary TF mask to identify where vocals are not present. They then applied NMF on the remaining non-v ocal segments to learn a model for the background. W ang and Ou also proposed an approach which combines melody extraction and NMF-based soft masking [165]. They Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 10 mixtur e Melod y Extractio n Harmoni c Model Low Ran k F actorizatio n melod y accompaniment initial estimat e Filterin g lea d accompaniment Fig. 6: F actorization informed with the melody . First, melody e xtraction is performed on the mixture. Then, this information is used to dri v e the estimation of the accompaniment: TF bins pertaining to the lead should not be tak en into account for estimating the accompaniment model. identiﬁed accompaniment, un v oiced, and v oiced se gments in the mixture using an HMM model with MFCCs and GMMs. The y then estimated the pitch of the v ocals from the v oiced se gments using the method in [166] and an HMM with the V iterbi algorithm as in [167]. The y ﬁnally applied a soft mask to separate v oice and accompaniment. Raﬁi et al. in v estig ated the combination of an approach for modeling the background and an approach for modeling the melody [168]. The y modeled the background by deri ving a rh ythmic mask using the REPET -SIM algorithm [135] and the melody by deri ving a harmonic mask using a pitch-based algorithm [169]. The y proposed a parallel and a sequential combination of those algorithms. V enkataramani et al. proposed an approach combining sinu- soidal modeling and matrix decomposition, which incorporates prior kno wledge about singer and phoneme identity [170]. The y applied a predominant pitch algorithm on annotated sung re gions [171] and performed harmonic sinusoidal modeling [172]. Then, the y estimated the spectral en v elope of the v ocal component from the spectral en v elope of the mixture using a phoneme dictionary . After that, a spectral en v elope dictionary representing sung v o wels from song se gments of a gi v en singer w as learned using an e xtension of NMF [173], [174]. The y ﬁnally estimated a soft mask using the singer -v o wel dictionary to reﬁne and e xtract the v ocal component. Ik emiya et al. proposed to combine RPCA with pitch estimation [175], [176]. The y deri v ed a mask using RPCA [115] to separate the mixture spectrogram into singing v oice and accompaniment components. The y then estimated the fundamental frequenc y contour from the singing v oice com- ponent based on [177] and deri v ed a harmonic mask. The y inte grated the tw o masks and resynthesized the singing v oice and accompaniment signals. Dobashi et al. then proposed to use that singing v oice separation approach in a music performance assistance system [178]. Hu and Liu proposed to combine approaches based on matrix decomposition and pitch i n f ormation for singer iden- tiﬁcation [179]. The y used non-ne g ati v e matrix partial co- f actorization [173], [180] which inte grates prior kno wledge about the singing v oice and the accompaniment, to separate the mixture into singing v oice and accompaniment portions. The y then identiﬁed the singing pitch from the singing v oice portions using [181] and deri v ed a harmonic mask as in [182], and ﬁnally reconstructed the singing v oice using a missing feature method [183]. The y also proposed to add temporal and sparsity criteria to their algorithm [184]. That methodology w as also adopted by Zhang et al. in [185], that follo wed the frame w ork of the pitch-based approach in [66], by performing singing v oice detection using an HMM classiﬁer , singing pitch detection using the algorithm in [186], and singing v oice separation using a binary mask. Addition- ally , the y augmented that approach by analyzing the latent components of the TF matrix using NMF in order to reﬁne the singing v oice and accompaniment. Zhu et al. [187] proposed an approach which is also rep- resentati v e of this body of literature, with the pitch detection algorithm being the one in [181] and binary TF masks used for separation after NMF . C. J oint factorization and melody estimation The methods presented abo v e put together the ideas of modeling the lead (typically the v ocals) as featuring a melodic harmonic line and the accompaniment as redundant. As such, the y already e xhibit signiﬁcant impro v ement o v er approaches only applying one of these ideas a s presented in Sections III and IV, respecti v ely . Ho we v er , these methods abo v e are still restricted in the sense that the analysis performed on each side cannot help impro v e the other one. In other w ords, the estimation of the models for the lead and the accompaniment are done sequentially . Another idea is to proceed jointly . f i l t e r N M F a c c o m p a n i m e n t m i x t u r e l e a d e n v e l o p e s a c t i v a t i o n s m e l o d y l i n e h a r m o n i c t e m p l a t e s s o u r c e Fig. 7: Joint estimation of the lead and accompaniment, the former one as a source-ﬁlter model and the latter one as an NMF model. A seminal w ork in this respect w as done by Durrieu et al. using a source-ﬁ lter and NMF model [188]–[190], depicted Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 11 in Figure 7. Its core idea is to decompose the mixture spectrogram as the sum of two terms. The ﬁrst term accounts for the lead and is inspired by the source-ﬁlter model described in Section II: it is the element-wise product of an excitation spectrogram with a ﬁlter spectrogram. The former one can be understood as harmonic combs activ ated by the melodic line, while the latter one modulates the env elope and is assumed lo w-rank because fe w phonemes are used. The second term accounts for the accompaniment and is modeled with a standard NMF . In [188]–[190], they modeled the lead by using a GMM-based model [191] and a glottal source model [192], and the accompaniment by using an instantaneous mixture model [193] leading to an NMF problem [94]. The y jointly estimated the parameters of their models by maximum likelihood estimation using an iterati ve algorithm inspired by [194] with multiplicativ e update rules dev eloped in [91]. They also extracted the melody by using an algorithm comparable to the V iterbi algorithm, before re-estimating the parameters and ﬁnally performing source separation using W iener ﬁlters [195]. In [196], they proposed to adapt their model for user - guided source separation. The joint modeling of the lead and accompaniment parts of a music signal was also considered by Fuentes et al. in [197], that introduced the idea of using a log-frequency TF representation called the constant-Q transform (CQT) [198]– [200]. The advantage of such a representation is that a change in pitch corresponds to a simple translation in the TF plane, instead of a scaling as in the STFT . This idea was used along the creation of a user interface to guide the decomposition, in line with what was done in [196]. Joder and Schuller used the source-ﬁlter NMF model in [201], additionally exploiting MIDI scores [202]. They syn- chronized the MIDI scores to the audio using the alignment algorithm in [203]. They proposed to exploit the score infor - mation through two types of constraints applied in the model. In a ﬁrst approach, they only made use of the information regarding whether the leading voice is present or not in each frame. In a second approach, they took advantage of both time and pitch information on the aligned score. Zhao et al. proposed a score-informed leading voice sepa- ration system with a weighting scheme [204]. They extended the system in [202], which is based on the source-ﬁlter NMF model in [201], by using a Laplacian or a Gaussian-based mask on the NMF activ ation matrix to enhance the likelihood of the score-informed pitch candidates. Jointly estimating accompaniment and lead allowed for some research in correctly estimating the un voiced parts of the lead, which is the main issue with purely harmonic models, as highlighted in Section III-C. In [201], [205], Durrieu et al. extended their model to account for the un voiced parts by adding white noise components to the voice model. In the same direction, Janer and Marxer proposed to sep- arate un v oiced fricati ve consonants using a semi-supervised NMF [206]. They extended the source-ﬁlter NMF model in [201] using a lo w-latency method with timbre classiﬁcation to estimate the predominant pitch [87]. They approximated the fricativ e consonants as an additiv e wideband component, training a model of NMF bases. They also used the transient quality to differentiate between fricativ es and drums, after extracting transient time points using the method in [207]. Similarly , Marxer and Janer then proposed to separately model the singing v oice breathiness [208]. They estimated the breathiness component by approximating the voice spectrum as a ﬁltered composition of a glottal excitation and a wide- band component. They modeled the magnitude of the voice spectrum using the model in [209] and the en velope of the voice excitation using the model in [192]. They estimated the pitch using the method in [87]. This was all integrated into the source-ﬁlter NMF model. The body of research initiated by Durrieu et al. in [188] consists of using algebraic models more sophisticated than one simple matrix product, but rather inspired by musicological knowledge. Ozerov et al. formalized this idea through a general framew ork and showed its application for singing voice separation [210]–[212]. Finally , Hennequin and Rigaud augmented their model to account for long-term reverberation, with application to singing voice separation [213]. They extended the model in [214] which allows extraction of the reverberation of a speciﬁc source with its dry signal. They combined this model with the source-ﬁlter NMF model in [189]. D. Differ ent constraints for dif fer ent sour ces Algebraic methods that decompose the mixture spectrogram as the sum of the lead and accompaniment spectrograms are based on the minimization of a cost or loss function which measures the error between the approximation and the observation. While the methods presented abo ve for lead and accompaniment separation did propose more sophisticated models with parameters explicitly pertaining to the lead or the accompaniment, another option that is also popular in the dedicated literature is to modify the cost function of an optimization algorithm for an existing algorithm (e.g., RPCA), so that one part of the resulting components would preferentially account for one source or another . This approach can be ex empliﬁed by the harmonic- percussiv e source separation method (HPSS), presented in [160], [215], [216]. It consists in ﬁltering a mixture spec- trogram so that horizontal lines go in a so-called harmonic source, while its vertical lines go into a per cussive source. Separation is then done with TF masking. Of course, such a method is not adequate for lead and accompaniment separation per se , because all the harmonic content of the accompaniment is classiﬁed as harmonic. Howe ver , it sho ws that nonparamet- ric approaches are also an option, provided the cost function itself is well chosen for each source. This idea was followed by Y ang in [217] who proposed an approach based on RPCA with the incorporation of har- monicity priors and a back-end drum remov al procedure to improv e the decomposition. He added a re gularization term in the algorithm to account for harmonic sounds in the low-rank component and used an NMF-based model trained for drum separation [211] to eliminate percussi ve sounds in the sparse component. Jeong and Lee proposed to separate a vocal signal from a music signal [218], extending the HPSS approach in [160], Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 12 [215]. Ass u m ing that the spectrogram of the signal can be represented as the sum of harmonic, percussi v e, and v ocal components, the y deri v ed an objecti v e function which enforces the t emporal and spectral continuity of the harmonic and per - cussi v e components, respecti v ely , similarly to [160], b ut also the sparsity of the v ocal component. Assuming non-ne g ati vity of the components, the y then deri v ed iterati v e update rules to minimize the objecti v e function. Ochiai et al. e xtended this w ork in [219], notably by imposing harmonic constraints for the lead. W atanabe et al. e xtended RPCA for singing v oice separation [220]. The y added a harmonicity constraint in the objecti v e function to account for harmonic structures, such as in v ocal signals, and re gularization terms to enforce the non-ne g ati vity of the solution. The y used the generalized forw ard-backw ard splitting algorithm [221] to solv e the optimization problem. The y also applied post-processing to remo v e the lo w frequen- cies in the v ocal spectrogram and b uilt a TF mask to remo v e time frames with lo w ener gy . Going be yond smoothness and harmonicity , Hayashi et al. proposed an NMF with a constraint to help separate periodic components, such as a repeating accompaniment [222]. The y deﬁned a periodicity constraint which the y incorporated in the object i v e function of the NMF algorithm to enforce the periodicity of the bases. E. Cascaded and iter ated methods In their ef fort to propose separation methods for the lead and accompaniment in music, some authors disco v ered that v ery dif ferent methods often ha v e complementary strengths. This moti v ated the combination of methods. In practice, there are se v eral w ays to follo w this line of research. One potential route to achi e v e better separation is to cascade se v eral methods. This is what FitzGerald and Gainza proposed in [216] with multiple median ﬁlters [148]. The y used a median-ﬁlter based HPSS approach at dif ferent frequenc y resolutions to separate a mixture into harmonic, percussi v e, and v ocal components. The y also in v estig ated the use of STFT or CQT as the TF representation and proposed a post-processing step to impro v e the separation with tensor f actorization techniques [223] and non-ne g ati v e partial co- f actorization [180]. The tw o-stage HPSS system proposed by T achibana et al. in [224] proceeds the same w ay . It is an e xtension of the melody e xtraction approach in [225] and w as applied for karaok e in [226]. It consists in using the optimization-based HPSS algorithm from [160], [215], [227], [228] at dif ferent frequenc y resolutions to separate the mixture into harmonic, percussi v e, and v ocal components. HPSS w as not the only separation module considered as the b uilding block of combined lead and accompaniment separation approaches. Deif et al. also proposed a multi-stage NMF-based algorithm [229], based on the approach in [230]. The y used a local spectral discontinuity measure to reﬁne the non-pitched components obtained from the f actorization of the long windo w spectrogram and a local temporal discontinuity measure to reﬁne the non-percussi v e components obtained from f actorization of the short windo w spectrogram. m i x t u r e M e t h o d A l e a d a c c o m p a n i m e n t M e t h o d B M e t h o d C R e  n e m e n t I n i t i a l E s t i m a t e s F i n a l e s t i m a t e s Fig. 8: Cascading source separation methods. The results from method A is impro v ed by applying methods B and C on its output, which are specialized in reducing interferences from undesired sources in each signal. Finally , this cascading concept w as considered ag ain by Driedger and M ¨ uller in [231], that introduces a processing pipeline for the outputs of dif ferent methods [115], [164], [232], [233] to obtain an impro v ed separation quality . Their core idea is depicted in Figure 8 and combines the output of dif ferent methods in a speciﬁc order to impro v e separation. Another approach for impro ving the quality of separation when using se v eral separation procedures is not to restrict the number of such iterations from one method to another , b ut rather to iterate them man y times until satisf actory results are obtained. This is what is proposed in Hsu et al. in [234], e xtending the algorithm in [235]. The y ﬁrst estimated the pitch range of the singing v oice by using the HPSS method in [160], [225]. The y separated the v oice gi v en the estimated pitch using a binary mask obtained by training a multilayer perceptron [236] and re-estimated the pitch gi v en the separated v oice. V oice separation and pitch estimation are then iterated until con v er gence. As another iterati v e method, Zhu et al. proposed a multi- stage NMF [230], using ha rmonic and percussi v e separation at dif ferent frequenc y resolutions similar to [225] and [216]. The main or iginality of their contrib ution w as to iterate the reﬁnements instead of applying it only once. An issue with such iterated methods lies in ho w to decide whether con v er gence is obtained, and it is not clear whether the quality of the separated signals will necessarily impro v e. F or this reason, Bryan and Mysore proposed a user -guided approach based on PLCA, which can be applied for the sepa- ration of the v ocals [237]–[239]. The y allo wed a user to mak e annotations on the spectrogram of a mixture, incorporated the Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 13 feedback as constraints in a PLCA model [110], [156], and used a posterior re gularization technique [240] to reﬁne the estimates, repeating the process until the user is sa tisﬁed with the results. This is similar to the w ay Ozero v et al. proposed to tak e user input into account in [241]. mixtur e  nal estimate s Fusio n Method s Fig. 9: Fusion of separation methods. The output of man y separation methods is fed into a fusion system that combines them to produce a single estimate. A pr incipled w ay to aggre g ate the result of man y source separation systems to obtain one single estimate that is con- sistently better than all of them w as presented by Jaure guiberry et al. in their fusion fr ame work , depicted in Figure 9. It tak es adv antage of multiple e xisting approaches, and demonstrated its application to singing v oice separation [242]–[244]. The y in v estig ated fusion methods based on non-linear optimization, Bayesian model a v eraging [245], and deep neural netw orks (DNN). As another attempt to design an ef ﬁcient fusion method, McV icar et al. proposed in [246] to combine the outputs of RPCA [115], HPSS [216], Gabor ﬁltered spectrograms [247], REPET [130] and an approach based on deep learning [248]. T o do this, the y used dif ferent classiﬁcation techniques to b uild the aggre g ated TF mask, such as a logistic re gression model or a conditional random ﬁeld (CRF) trained using the method in [249] with time and/or frequenc y dependencies. Manilo w et al. trained a neural netw ork to predict quality of source separation for three source separation algorithms, each le v eraging a dif ferent cue - repetition, spatialization, and harmonicity/pitch proximity [250]. The method estimates separation quality of the lead v ocals for each algorithm, using only the original audio m ixture and separated source output. These estimates were used to guide switching between algorithms along time. F . Sour ce-dependent r epr esentations In the pre vious section, we st ated that some authors consid- ered iterating separation at dif ferent frequenc y resolutions, i.e., using dif ferent TF representations [216], [224], [229]. This can be seen as a combination of dif ferent methods. Ho we v er , this can also be seen from another perspecti v e as based on picking speciﬁc r epr esentations . W olf et al. proposed an approach using rigid motion se gmentation, with application to singing v oice separation [251], [252]. The y introduced harmonic template models with amplitude and pitch modulations deﬁned by a v elocity v ector . The y applied a w a v elet transform [253] on the harmonic template models to b uild an audio image where the amplitude and pitch dynamics can be separated through the v elocity v ector . The y then deri v ed a v elocity equation, similar to the optical ﬂo w v elocity equation used in images [254], to se gment v elocity components. Finally , the y ide n t iﬁed the harmonic templates which model dif ferent sources in the mixture and separated them by approximating the v elocity ﬁeld o v er the corresponding harmonic template models. Y en et al. proposed an approach using spectro-temporal modulation features [255], [256]. The y decomposed a mixture using a tw o-stage auditory model which consists of a cochlear module [257] and cortical module [258]. The y then e xtracted spectro-temporal modulation features from the TF units and clustered the TF units into harmonic, percussi v e, and v ocal components using the EM algorithm and resynthesized the estimated signals. Chan and Y ang proposed an approach using an informed group sparse representation [259]. The y introduced a repre- sentation b uilt using a learned dictionary based on a chord sequence which e xhibits group sparsity [260] and which can incorporate melody annotations. The y deri v ed a formulation of the problem in a manner similar to RPCA and solv ed it using the alternating direction method of multipliers [261]. The y also sho wed a relation between their representation and the lo w-rank representation in [123], [262]. G. Shortcomings The lar ge body of literature we re vie wed in the preceding sections is concentrated on choosing adequate models for the lead and accompaniment parts of music signals in order to de vise ef fecti v e signal processing methods to achie v e separa- tion. From a higher perspecti v e, their common feature is to guide the separation process in a model-based way : ﬁrst, the scientist has some idea re g arding characteristics of the lead signal and/or the accompaniment, and then an algorithm is designed to e xploit this kno wledge for separation. Model-based methods for lead and accompaniment separa- tion are f aced with a common risk that their core assumptions will be violated for the signal under study . F or instance, the lead to be separated may not be harmoni c b ut saturated v ocals or the accompaniment may not be repetiti v e or redundant, b ut rather al w ays changing. In such cases, model-based methods are prone to lar ge errors and poor performance. Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 14 V I . D A TA - D R I V E N A P P RO AC H E S A way to address the potential cav eats of model-based separation behaving badly in case of violated assumptions is to avoid making assumptions altogether, b ut rather to let the model be learned from a lar ge and representati ve database of examples. This line of research leads to data-driven methods, for which researchers are concerned about directly estimating a mapping between the mixture and either the TF mask for separating the sources, or their spectrograms to be used for designing a ﬁlter . As may be foreseen, this strategy based on machine learning comes with sev eral challenges of its o wn. First, it requires considerable amounts of data. Second, it typically requires a high-capacity learner (many tunable parameters) that can be prone to over -ﬁtting the training data and therefore not working well on the audio it faces when deplo yed. A. Datasets Building a good data-driv en method for source separation relies heavily on a training dataset to learn the separation model. In our case, this not only means obtaining a set of musical songs, but also their constitutiv e accompaniment and lead sources, summing up to the mixtures. For professionally- produced or recorded music, the separated sources are often either unav ailable or priv ate. Indeed, they are considered amongst the most precious assets of right holders, and it is very difﬁcult to ﬁnd isolated vocals and accompaniment of professional bands that are freely av ailable for the research community to work on without copyright infringements. Another difﬁculty arises when considering that the dif ferent sources in a musical content do share some common orches- tration and are not superimposed in a random way , prohibiting simply summing isolated random notes from instrumental databases to produce mixtures. This contrasts with the speech community which routinely generates mixtures by summing noise data [263] and clean speech [264]. Furthermore, the temporal structures in music signals typ- ically spread over long periods of time and can be exploited to achieve better separation. Additionally , short excerpts do not often comprise parts where the lead signal is absent, although a method should learn to deal with that situation. This all suggests that including full songs in the training data is preferable ov er short excerpts. Finally , professional recordings typically undergo sophis- ticated sound processing where panning, reverberation, and other sound effects are applied to each source separately , and also to the mixture. T o date, simulated data sets hav e poorly mimicked these ef fects [265]. Many separation methods make assumptions about the mixing model of the sources, e.g., assuming it is linear (i.e., does not comprise effects such as dynamic range compression). It is quite common that methods giving extremely good performance for linear mixtures completely break down when processing published musical recordings. T raining and test data should thus feature realistic audio engineering to be useful for actual applications. In this context, the dev elopment of datasets for lead and accompaniment separation was a long process. In early times, it was common for researchers to test their methods on some priv ate data. T o the best of our knowledge, the ﬁrst attempt at releasing a public dataset for ev aluating v ocals and accom- paniment separation was the Music Audio Signal Separation (MASS) dataset [266]. It strongly boosted research in the area, ev en if it only featured 2.5 minutes of data. The breakthrough was made possible by some artists which made their mixed- down audio, as well as its constituti ve stems (unmixed tracks), av ailable under open licenses such as Creative Commons, or authorized scientists to use their material for research. The MASS dataset then formed the core content of the early Signal Separation Evaluation Campaigns (SiSEC) [267], which e v aluate the quality of v arious music separation meth- ods [268]–[272]. SiSEC always had a strong focus on vocals and accompaniment separation. For a long time, vocals separa- tion methods were very demanding computationally and it was already considered extremely challenging to separate excerpts of only a few seconds. In the following years, ne w datasets were proposed that im- prov ed ov er the MASS dataset in many directions. W e brieﬂy describe the most important ones, summarized in T able I. • The QU ASI dataset was proposed to study the impact of different mixing scenarios on the separation quality . It consists of the same tracks as in the MASS dataset, but kept full length and mixed by professional sound engineers. • The MIR-1K and iKala datasets were the ﬁrst attempts to scale vocals separation up. They feature a higher number of samples than the previously a v ailable datasets. Howe v er , they consist of mono signals of very short and amateur karaoke recordings. • The ccMixter dataset was proposed as the ﬁrst dataset to feature many full-length stereo tracks. Each one comes with a v ocals and an accompaniment source. Although it is stereo, it often suf fers from simplistic mixing of sources, making it unrealistic in some aspects. • MedleyDB has been dev eloped as a dataset to serve many purposes in music information retriev al. It consists of more than 100 full-length recordings, with all their constitutiv e sources. It is the ﬁrst dataset to provide such a large amount of data to be used for audio separation research (more than 7 hours). Among all the material present in that dataset, 63 tracks feature singing voice. • DSD100 was presented for SiSEC 2016. It features 100 full-length tracks originating from the ’Mixing Secret’ Free Multitrack Do wnload Library 1 of the Cambridge Music T echnology , which is freely usable for research and edu- cational purposes. Finally , we present here the MUSDB18 dataset, putting together tracks from MedleyDB, DSD100, and other new musical material. It features 150 full-length tracks, and has been constructed by the authors of this paper so as to address all the limitations we identiﬁed above: • It only features full-length tracks, so that the handling of long-term musical structures, and of silent regions in the lead/vocal signal, can be e valuated. 1 http://www .cambridge- mt.com/ms- mtk.htm Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 15 • It only features stereo signals which were mix ed using pro- fessional digital audio w orkstations. This results in quality stereo mix es which are repres entati v e of real application scenarios. • As with DSD100, a design choice of MUSDB18 w as to split the signals into 4 predeﬁned cate gories: bass, drums, v ocals, and other . This contrasts with the enhanced granularity of Medle yDB that of fers more types of sources, b ut it strongly promotes automation of the algorithms. • Man y musical genres are represented in MUSDB18, for e xample, jazz, electro, metal, etc. • It is split into a de v elopment (100 tracks, 6.5 h) and a test dataset (50 tracks, 3.5 h), for the design of data-dri v en separation methods. All details about this freel y a v ailable dataset and its accom- pan ying softw are tools may be found in its dedicated website 2 . In an y case, it can be seen that datasets of suf ﬁcient duration to b uild data-dri v en separation methods were only created recently . B. Alg ebr aic appr oac hes A natural w ay to e xploit a training database w as to learn some parts of the m o de l to guide the estimation process into better solutions. W ork on this topic may be traced back to the suggestion of Ozero v et al. in [276] to learn spectral template models based on a database of isolated sources, and then to adapt this dictionary of templates on the mixture using the method in [277]. The e xploitation of training data w as formalized by Smaragdis et al. in [110] in the conte xt of source separation within the supervised and semi-supervised PLCA frame w ork. The core idea of this probabilistic formulation, equi v alent to NMF , is to learn some spectral bases from the training set which are then k ept ﬁx ed at separation time. In the same line, Ozero v et al. proposed an a pp r oach using Bayesian models [191]. The y ﬁrst se gmented a song into v ocal and non-v ocal parts using GMMs with MFCCs. Then, the y adapted a general music model on the non-v ocal parts of a particular song by using the maximum a posteriori (MAP) adaptation approach in [278] Ozero v et al. later proposed a frame w ork for source separa- tion which generalizes se v eral approaches gi v en prior informa- tion about the problem and sho wed its application for singing v oice separation [210]–[212]. The y chose the local Gaussian model [279] as the core of the frame w ork and allo wed the prior kno wledge about each source and its mixing characteristics using user -speciﬁed constraints. Estimation w as performed through a generalized EM algorithm [32]. Raﬁi et al. proposed in [280] to address the main dra wback of the repetit ion-based methods described in Section IV -C, which is the weakness of the model for v ocals. F or this purpose, the y combined the REPET -SIM model [135] for the accompaniment with a NMF-based model for singing v oice learned from a v oice dataset. As yet another e xample of using training data for NMF , Boulanger -Le w ando wski et al. proposed in [281] to e xploit 2 https://sigsep.github .io/musdb g r o u n d t r u t h s o u r c e s t e s t s t a g e t r a i n i n g s t a g e m i x t u r e u n s e e n m i x t u r e t a r g e t e s t i m a t e s D N N P r e p r o c e s s i n g Fig. 10: General architecture for methods e xploiting deep learning. The netw ork inputs the mixture and outputs either the sources spectrograms or a TF mask. Methods usually dif fer in their choice for a netw ork architecture and the w ay it is learned using the training data. long-term temporal dependencies in NMF , embodied using recurrent neural netw orks (RNN) [236]. The y incorporated RNN re gularization into the NMF frame w ork to temporally constrain the acti vity matrix during the decomposition, which can be seen as a generalization of the non-ne g ati v e HMM in [282]. Furthermore, the y used supervised and semi-supervised NMF algorithms on isolated sources to train the models, as in [110]. C. Deep neur al networks T aking adv antage of the recent a v ailability of suf ﬁciently lar ge databases of isolated v ocals along with their accompa- niment, se v eral researchers in v estig ated the use of machine learning methods to directly estimate a mapping between the mixture and the sources. Although end-to-end systems inputting and outputting the w a v eforms ha v e already been proposed in the speech community [283], the y are not yet a v ailable for music source separation. This may be due to the relati v e small size of music separation databases, at most 10 h today . Instead, most systems feature pre and post-processing steps that consist in computing classical TF representations and b uilding TF masks, respecti v ely . Although such end-to- end systems will ine vitably be proposed in the near future, the common structure of deep learning methods for lead and accompaniment separation usually corresponds for no w to the one depicted in Figure 10. From a general perspecti v e, we may say that most current methods mainly dif fer in the structure pick ed for the netw ork, as well as in the w ay it is learned. Pro viding a thorough introduction to deep neural netw orks is out of the scope of this paper . F or our purpose, it suf ﬁces Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 16 T ABLE I: Summary of datasets av ailable for lead and accompaniment separation. Tracks without vocals were omitted in the statistics. Dataset Y ear Reference(s) URL T racks T rack duration (s) Full/ster eo? MASS 2008 [266] http://www .mtg.upf.edu/download/datasets/mass 9 16 ± 7 no / yes MIR-1K 2010 [74] https://sites.google.com/site/un voicedsoundseparation/mir - 1k 1,000 8 ± 8 no / no QU ASI 2011 [270], [273] http://www .tsi.telecom- paristech.fr/aao/en/2012/03/12/quasi/ 5 206 ± 21 yes / yes ccMixter 2014 [141] http://www .loria.fr/ ∼ aliutkus/kam/ 50 231 ± 77 yes / yes MedleyDB 2014 [274] http://medleydb .weebly .com/ 63 206 ± 121 yes / yes iKala 2015 [162] http://mac.citi.sinica.edu.tw/ikala/ 206 30 no / no DSD100 2015 [271] sisec17.audiolabs- erlangen.de 100 251 ± 60 yes / yes MUSDB18 2017 [275] https://sigsep.github .io/musdb 150 236 ± 95 yes / yes to mention that they consist of a cascade of several possibly non-linear transformations of the input, which are learned during a training stage. They were shown to ef fectively learn representations and mappings, provided enough data is a v ail- able for estimating their parameters [284]–[286]. Different architectures for neural networks may be combined/cascaded together , and many architectures were proposed in the past, such as feedforward fully-connected neural networks (FNN), con v olutional neural networks (CNN), or RNN and variants such as the long short-term memory (LSTM) and the gated- recurrent units (GRU). T raining of such functions is achieved by stochastic gradient descent [287] and associated algorithms, such as backpropagation [288] or backpropagation through time [236] for the case of RNNs. T o the best of our knowledge, Huang et al. were the ﬁrst to propose deep neural networks, RNNs here [289], [290], for singing voice separation in [248], [291]. They adapted their framew ork from [292] to model all sources simultaneously through masking. Input and target functions were the mixture magnitude and a joint representation of the individual sources. The objectiv e was to estimate jointly either singing voice and accompaniment music, or speech and background noise from the corresponding mixtures. Modeling the temporal structures of both the lead and the accompaniment is a considerable challenge, ev en when using DNN methods. As an alternativ e to the RNN approach proposed by Huang et al. in [248], Uhlich et al. proposed the usage of FNNs [293] whose input consists of supervectors of a few consecutiv e frames from the mixture spectrogram. Later in [294], the same authors considered the use of bi-directional LSTMs for the same task. In an effort to make the resulting system less computa- tionally demanding at separation time but still incorporating dynamic modeling of audio, Simpson et al. proposed in [295] to predict binary TF masks using deep CNNs, which typically utilize fewer parameters than the FNNs. Similarly , Schlueter proposed a method trained to detect singing voice using CNNs [296]. In that case, the trained network was used to compute saliency maps from which TF masks can be computed for singing voice separation. Chandna et al. also considered CNNs for lead separation in [297], with a particular focus on lo w- latency . The classical FNN, LSTM and CNN structures abo ve served as baseline structures over which some others tried to improve. As a ﬁrst example, Mimilakis et al. proposed to use a hybrid structure of FNNs with skip connections to separate the lead instrument for purposes of remixing jazz recordings [298]. Such skip connections allo w to propagate the input spectrogram to intermediate representations within the network, and mask it similarly to the operation of TF masks. As advocated, this enforces the networks to approximate a TF masking process. Extensions to temporal data for singing v oice separation were presented in [299], [300]. Similarly , Jansson et al. proposed to propagate the spectral information com- puted by conv olutional layers to intermediate representations [301]. This propagation aggregates intermediate outputs to proceeding layer(s). The output of the last layer is responsible for masking the input mixture spectrogram. In the same vein, T akahashi et al. proposed to use skip connections via element-wise addition through representations computed by CNNs [302]. Apart from the structure of the network, the w ay it is trained, comprising how the targets are computed, has a tremendous impact on performance. As we saw , most methods operate on deﬁning TF masks or estimating magnitude spectrograms. Howe ver , other methods were proposed based on deep clus- tering [303], [304], where TF mask estimation is seen as a clustering problem. Luo et al. in v estigated both approaches in [305] by proposing deep bidirectional LSTM networks capable of outputting both TF masks or features to use as in deep clustering. Kim and Smaragdis proposed in [306] another way to learn the model, in a denoising auto-encoding fashion [307], again utilizing short segments of the mixture spectrogram as an input to the network, as in [293]. As the best network structure may vary from one track to another , some authors considered a fusion of methods, in a manner similar to the method [242] presented above. Grais et. al [308], [309] proposed to aggregate the results from an ensemble of feedforward DNNs to predict TF masks for separation. An improv ement was presented in [310], [311] where the inputs to the fusion network were separated signals, instead of TF masks, aiming at enhancing the reconstruction of the separated sources. As can be seen the use of deep learning methods for the design of lead and accompaniment separation has already stimulated a lot of research, although it is still in its infanc y . Interestingly , we also note that using audio and music speciﬁc knowledge appears to be fundamental in designing ef fectiv e systems. As an example of this, the contribution from Nie et al. in [312] was to include the construction of the TF mask as an e xtra non-linearity included in a recurrent network. This is an ex emplar of where signal processing elements, such as ﬁltering through masking, are incorporated as a building block of the machine learning method. Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 17 The netw ork structure is not the only thing t hat can beneﬁt from audio kno wledge for better separation. The design of appropriate features is another . While we sa w that superv ec- tors of spectrogram patches of fered the ability to ef fecti v ely model time-conte xt information in FNNs [293], Sebastian and Murth y [313] proposed the use of the modiﬁed group delay feature r epresentation [314] in their deep RNN architecture. The y applied their approach for both singing v oice and v ocal- violin separation. Finally , as with other methods, DNN-based separation tech- niques can als o be combined with others to yield impro v ed performance. As an e xampl e, F an et al. proposed to use DNNs to separate the singing v oice and to also e xploit v ocal pitch estimation [315]. The y ﬁrst e xtracted the singing v oice using feedforw ard DNNs with sigmoid acti v ation functions. The y then estimated the v ocal pitch from the e xtracted singing v oice using dynamic programming. D. Shortcomings Data-dri v en methods are no w adays the topic of important research ef forts, partic u l arly those based on DNNs. This is notably due to their impressi v e performance in terms of separation quality , as can, for instance, be noticed belo w in Section VIII. Ho we v er , the y also come with some limitations. First, we highlighted that lead and accompaniment sepa- ration in music has the v ery speciﬁc problem of scarce data. Since it is v ery hard to g ather lar ge amounts of training data for that application, it is hard to fully e xploit learning methods that require lar ge training sets. This raises v ery speciﬁc challenges in terms of machine learning. Second, the lack of interpretability of model parameters is often mentioned as a signiﬁcant shortcoming when it comes to applications. Indeed, music engineering systems are characterized by a strong importance of human-computer interactions, because the y are used in an artistic conte xt that may require speciﬁc needs or resul ts. As of today , it is unclear ho w to pro vide user interaction for controlling the millions of parameters of DNN-based systems. V I I . I N C L U D I N G M U L T I C H A N N E L I N F O R M A T I O N In descri bing the abo v e methods, we ha v e not discussed the f act that music s ignals are typically stereophonic. On the contrary , the b ulk of methods we discussed focused on designing good spectrogram models for the purpose of ﬁltering mixtures that may be monophonic . Such a strate gy is called single-c hannel source separation and is usually presented as more challenging than multichannel source separation. Indeed, only TF structure may then be used to discriminate the accompaniment from the lead. In stereo recordings, one further so-called spatial dimension is introduced, which is sometim es referred to as pan , that corresponds to the percei v ed position of a source in the stereo ﬁeld. De vising methods to e xploit this spatial di v ersity for source separation has also been the topic of an important body of research that we re vie w no w . A. Extr acting the lead based on panning In the case of popular music signals, a f act of paramount practical importance is that the lead signal — such as v ocals — is v ery often mix ed in the center , which means that its ener gy is approximately the same in left and right channels. On the contrary , other instruments are often mix ed at positions to the left or right of the stereo ﬁeld. s t e r e o m i x t u r e P a n n i n g I n f o r m a t i o n E x t r a c t i o n p a n n i n g h i s t o g r a m P a n n i n g D e p e n d e n t T F M a s k C o m p u t a t i o n b i n a r y m a s k Fig. 11: Separation of the lead based on panning information. A stereo cue called panning allo ws to design a TF mask. The general structure of methods e xtracting the lead based on stereo cues is displayed on Figure 11, introduced by A v en- dano, who proposed to separate sources in stereo mixtures by using a panning inde x [316]. He deri v ed a tw o-dimensional map by comparing left and right channel s in the TF domain to identify the dif ferent sources based on their panning position [317]. The same methodology w as considered by Barry et al. in [318] in his Azimuth Discrimination and Resynthesis (ADRess) approach, with panning inde x es computed with dif ferences instead of ratios. V in yes et al. also proposed to unmix com mercially produced music recordings thanks to stereo cues [319]. The y designed an interf ace similar to [318] where a user can set some parameters to generate dif ferent TF ﬁlters in real time. The y sho wed ap- plications for e xtracting v arious instruments, including v ocals. Cobos and L ´ opez proposed to separate sources in stereo mixtures by using TF masking and multile v el thresholding [320]. The y based t heir approach on the De generate Unmixing Estimation T echnique (DUET) [321]. The y ﬁrst deri v ed his- tograms by measuring the amplitude relationship between TF points in left and right channels. Then, the y obtained se v eral thresholds using the multile v el e xtension of Otsu’ s method [322]. Finally , TF points were assigned to their related sources to produce TF masks. Soﬁanos et al. proposed to separate the singing v oice from a stereo mixture using ICA [323]–[325]. The y assumed that most commercial songs ha v e the v ocals panned to the center and that the y dominate the other sources in amplitude. In [323], the y proposed to combine a modiﬁed v ersion of ADRess with ICA to ﬁlter out the other instruments. In [324], the y proposed a modiﬁed v ersion without ADRess. Kim et al. proposed to separate centered singing v oice in stereo music by e xploiting binaural cues, such as inter -channel le v el and inter -channel phase di f ference [326]. T o this end, Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 18 they build the pan-based TF mask through an EM algorithm, exploiting a GMM model on these cues. B. Augmenting models with ster eo As with using only a harmonic model for the lead signal, using stereo cues in isolation is not alw ays sufﬁcient for good separation, as there can often be multiple sources at the same spatial location. Combining stereo cues with other methods improv es performance in these cases. Cobos and L ´ opez proposed to extract singing voice by combining panning information and pitch tracking [327]. They ﬁrst obtained an estimate for the lead thanks to a pan-based method such as [316], and reﬁned the singing v oice by using a TF binary mask based on comb-ﬁltering method as in Section III-B. The same combination was proposed by Marxer et al. in [87] in a lo w-latenc y context, with dif ferent methods used for the binaural cues and pitch tracking blocks. FitzGerald proposed to combine approaches based on rep- etition and panning to extract stereo vocals [328]. He ﬁrst used his nearest neighbors median ﬁltering algorithm [139] to separate vocals and accompaniment from a stereo mixture. He then used the ADRess algorithm [318] and a high-pass ﬁlter to reﬁne the vocals and improve the accompaniment. In a somewhat dif ferent manner, FitzGerald and Jaiswal also proposed to combine approaches based on repetition and panning to improve stereo accompaniment recovery [329]. They presented an audio inpainting scheme [330] based on the nearest neighbors and median ﬁltering algorithm [139] to recover TF regions of the accompaniment assigned to the vocals after using a source separation algorithm based on panning information. In a more theoretically grounded manner, several meth- ods based on a probabilistic model were generalized to the multichannel case. For instance, Durrieu et al. e xtended their source-ﬁlter model in [201], [205] to handle stereo signals, by incorporating the panning coefﬁcients as model parameters to be estimated. Ozerov and F ´ evotte proposed a multichannel NMF frame- work with application to source separation, including vocals and music [331], [332]. They adopted a statistical model where each source is represented as a sum of Gaussian components [193], and where maximum likelihood estimation of the parameters is equi v alent to NMF with the Itakura- Saito diver gence [94]. They proposed two methods for esti- mating the parameters of their model, one that maximized the likelihood of the multichannel data using EM, and one that maximized the sum of the likelihoods of all channels using a multiplicativ e update algorithm inspired by NMF [90]. Ozerov et al. then proposed a multichannel non-negati ve tensor factorization (NTF) model with application to user- guided source separation [333]. They modeled the sources jointly by a 3-valence tensor (time/frequency/source) as in [334] which extends the multichannel NMF model in [332]. They used a generalized EM algorithm based on multiplica- tiv e updates [335] to minimize the objectiv e function. They incorporated information about the temporal segmentation of the tracks and the number of components per track. Ozerov et al. later proposed weighted variants of NMF and NTF with application to user-guided source separation, including separation of vocals and music [241], [336]. Sawada et al. also proposed multichannel extensions of NMF , tested for separating stereo mixtures of multiple sources, including vocals and accompaniment [337]–[339]. They ﬁrst deﬁned multichannel extensions of the cost function, namely , Euclidean distance and Itakura-Saito diver gence, and deriv ed multiplicativ e update rules accordingly . They then proposed two techniques for clustering the bases, one b uilt into the NMF model and one performing sequential pair-wise mer ges. Finally , multichannel information was also used with DNN models. Nugraha et al. addressed the problem of multichannel source separation for speech enhancement [340], [341] and music separation [342], [343]. In this framew ork, DNNs are still used for the spectrograms, while more classical EM algorithms [344], [345] are used for estimating the spatial parameters. C. Shortcomings When compared to simply processing the different chan- nels independently , incorporating spatial information in the separation method often comes at the cost of additional computational complexity . The resulting methods are indeed usually more demanding in terms of computing power , because they in v olve the design of beamforming ﬁlters and in v ersion of covariance matrices. While this is not really an issue for stereophonic music, this may become prohibiting in conﬁgu- rations with higher numbers of channels. V I I I . E V A L UA T I O N A. Backgr ound The problem of ev aluating the quality of audio signals is a research topic of its o wn, which is deeply connected to psy- choacoustics [346] and has many applications in engineering because it provides an objecti ve function to optimize when designing processing methods. While mean squared error (MSE) is often used for mathematical conv enience whene v er an error is to be computed, it is a very established fact that MSE is not representative of audio perception [347], [348]. For example, inaudible phase shifts would dramatically increase the MSE. Moreov er , it should be acknowledged that the concept of quality is rather application-dependent. In the case of signal separation or enhancement, processing is often only a part of a whole architecture and a rele vant methodology for ev aluation is to study the positiv e or negati ve impact of this module on the overall performance of the system, rather than to consider it independently from the rest. For example, when embedded in an automatic speech recognition (ASR) system, performance of speech denoising can be assessed by checking whether it decreases word error rate [349]. When it comes to music processing, and more particularly to lead and accompaniment separation, the e v aluation of sep- aration quality has traditionally been inspired by work in the audio coding community [347], [350] in the sense that it aims at comparing ground truth vocals and accompaniment with Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 19 their estimates, just like audio coding compares the original with the compressed signal. B. Metrics As noted previously , MSE-based error measures are not perceptually relev ant. For this reason, a natural approach is to have humans do the comparison. The gold-standard for human perceptual studies is the MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA) methodology , that is commonly used for ev aluating audio coding [350]. Howe ver , it quickly became clear that the speciﬁc ev al- uation of separation quality cannot easily be reduced to a single number , ev en when achieved through actual perceptual campaigns, b ut that quality rather depends on the application considered. For instance, karaoke or vocal extraction come with opposing trade-offs between isolation and distortion. For this reason, it has been standard practice to provide dif ferent and complementary metrics for ev aluating separation that measure the amount of distortion, artifacts, and interference in the results. While human-based perceptual ev aluation is deﬁnitely the best way to assess separation quality [351], [352], having computable objecti ve metrics is desirable for sev eral reasons. First, it allows researchers to ev aluate performance without setting up costly and lengthy perceptual ev aluation campaigns. Second, it permits large-scale training for the ﬁne-tuning of parameters. In this respect, the Blind Source Separation Evaluation (BSS Eval) toolbox [353], [354] provides quality metrics in decibel to account for distortion (SDR), artifacts (SAR), and interferences (SIR). Since it was made av ailable quite early and provides somewhat reasonable correlation with human perception in certain cases [355], [356] it is still widely used to this day . Even if BSS Ev al was considered suf ﬁcient for e v aluation purposes for a long time, it is based on squared error criteria. Follo wing early work in the area [357], the Perceptual Evalua- tion of Audio Source Separation (PEASS) toolkit [358]–[360] was introduced as a way to predict perceptual ratings. While the methodology is very rele vant, PEASS howe ver was not widely accepted in practice. W e believ e this is for two reasons. First, the proposed implementation is quite computationally demanding. Second, the perceptual scores it was designed with are more related to speech separation than to music. Improving perceptual e valuation often requires a lar ge amount of experiments, which is both costly and requires many expert listeners. One way to increase the number of participants is to conduct web-based experiments. In [361], the authors report they were able to aggregate 530 participants in only 8.2 hours and obtained perceptual ev aluation scores com- parable to those estimated in the controlled lab en vironment. Finally , we highlight here that the development of new perceptually rele v ant objecti ve metrics for singing voice sep- aration ev aluation remains an open issue [362]. It is also a highly crucial one for future research in the domain. C. P erformance (SiSEC 2016) In this section, we will discuss the performance of 23 source separation methods e valuated on the DSD100, as part of the task for separating professionally-produced music recordings at SiSEC 2016. The methods are listed in T able II, along with the acronyms we use for them, their main references, a very brief summary , and a link to the section where they are described in the text. T o date, this stands as the largest ev aluation campaign ev er achieved on lead and accompa- niment separation. The results we discuss here are a more detailed report for SiSEC 2016 [272], presented in line with the taxonomy proposed in this paper . T ABLE II: Methods e v aluated during SiSEC 2016. Acronym Ref. Summary Section HU A [115] RPCA standard version IV -B RAF1 [130] REPET standard version IV -C RAF2 [134] REPET with time-varying period RAF3 [135] REPET with similarity matrix KAM1-2 [142] KAM with different conﬁgurations CHA [162] RPCA with vocal activation information V -A JEO1-2 [163] l 1 -RPCA with v ocal acti vation information DUR [201] Source-ﬁlter NMF V -C OZE [212] Structured NMF with learned dictionaries VI-B KON [291] RNN VI-C GRA2-3 [308] DNN ensemble STO1-2 [363] FNN on common fate TF representation UHL1 [293] FNN with context NUG1-4 [343] FNN with multichannel information VII UHL2-3 [294] LSTM with multichannel information IBM ideal binary mask The objectiv e scores for these methods were obtained using BSS Ev al and are giv en in Figure 12. For more details about the results and for listening to the estimates, we refer the reader to the dedicated interactive website 3 . As we ﬁrst notice in Figure 12, the HU A method, corre- sponding to the standard RPCA as discussed in Section IV -B, showed rather disappointing performance in this ev aluation. After inspection of the results, it appears that processing full-length tracks is the issue there: at such scales, vocals also exhibit redundancy , which is captured by the low-rank model associated with the accompaniment. On the other hand, the RAF1-3 and KAM1-3 methods that exploit redundancy through repetitions, as presented in Section IV -C, behav e much better for full-length tracks: e ven if somewhat redundant, vocals are rarely as repetitive as the accompaniment. When those methods are ev aluated on datasets with very short excerpts (e.g., MIR-1K), such se vere practical drawbacks are not apparent. Like wise, the DUR method that jointly models the vocals as harmonic and the accompaniment as redundant, as discussed in Section V -C, does show rather disappointing performance, considering that it was long the state-of-the-art in earlier SiSECs [270]. After inspection, we may propose two reasons for this performance drop. First, using full-length excerpts also clearly revealed a shortcoming of the approach: it poorly han- dles silences in the lead, which were rare in the short-length excerpts tested so far . Second, using a much larger ev aluation set re vealed that vocals are not necessarily well modeled by a harmonic source-ﬁlter model; breathy or saturated voices appear to greatly challenge such a model. 3 http://www .sisec17.audiolabs- erlangen.de Raﬁi et al.: An Ov ervie w of Lead and Accompaniment Separation in Music 20 HUA − 5 0 5 10 15 SDR in dB KAM1 KAM2 RAF1 RAF2 RAF3 CHA JEO1 JEO2 DUR OZE GRA2 GRA3 K ON STO1 STO2 UHL1 NUG1 NUG2 NUG3 NUG4 UHL2 UHL3 IBM HUA − 5 0 5 10 15 20 SIR in dB KAM1 KAM2 RAF1 RAF2 RAF3 CHA JEO1 JEO2 DUR OZE GRA2 GRA3 K ON STO1 STO2 UHL1 NUG1 NUG2 NUG3 NUG4 UHL2 UHL3 IBM HUA IV-B 0 5 10 15 20 SAR in dB KAM1 KAM2 RAF1 RAF2 RAF3 IV-C CHA JEO1 JEO2 V-A DUR V-C OZE VI-B GRA2 GRA3 K ON STO1 STO2 UHL1 VI-C NUG1 NUG2 NUG3 NUG4 UHL2 UHL3 VI I IBM Oracle v o cals accompanimen t Fig. 12: BSS Ev al scores for the v ocals and accompaniment estimates for SiSEC 2016 on the DSD100 dataset. Results are sho wn for the test set only . Scores are grouped as in T able II according to the section the y are described in the te xt, indicated belo w each group. While processing full-length tracks comes as a challenge, it can also be an opportunity . It i s indeed w orth noticing that whene v er RPCA is helped through v ocal acti vity detection, its performance is signiﬁcantly boosted, as highlighted by the relati v ely good results obtained by CHAN and JEO. As discussed in Section VI, the a v ailability of learning data made it possible to b uild data-dri v en approaches, lik e the NMF-based OZE method which is a v ailable through the Fle x- ible Audio Source Separation T oolbox (F ASST) [211], [212]. Although it w as long state-of-the-art, it has been strongly out- performed recently by other data-dri v en approaches, namely DNNs. One ﬁrst reason clearly appears as the superior e x- pressi v e po wer of DNNs o v er NMF , b ut one second reason could v ery simply be that OZE s hould be trained ane w with the same lar ge amount of data. As mentioned abo v e, a striking f act we see in Figure 12 is that the o v erall performance of data-dri v en DNN methods is the hi gh e st. This sho ws that e xploiting learning data does help separation greatly compared to only relying on a priori assumptions such as the harmonicity or redundanc y . Addition- ally , dynamic models such as CNN or LSTM appear more adapted to music than FNN. These good performances in audio source separation go in line with the recent success of DNNs in ﬁelds as v aried as computer vision, speech recognition, and natural language processing [285]. Ho we v er , the picture may be seen to be more subtle than simply black-box DNN systems beating all other approaches. F or instance, e xploiting multichannel probabilistic models, as discuss ed in Section VII, leads to the NUG and UHL2- 3 methods, that signiﬁcantly outperform the DNN methods ignoring stereo information. In the same v ein, we e xpect other speciﬁc assumptions and musicological ideas to be e xploited for further impro ving the quality of the separation. One particular feature of this e v aluation is that it also sho ws ob vious weaknesses in the objecti v e metrics. F or instance, the GRA method beha v es signiﬁcantly w orse than an y other methods. Ho we v er , when listening to the separated signals, this does not seem deserv ed. All in all, designing ne w and Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 21 con v enient metrics that better match perception and that are speciﬁcally built for music on large datasets clearly appears as a desirable milestone. In any case, the performance achieved by a totally informed ﬁltering method such as IBM is signiﬁcantly higher than that of any submitted method in this ev aluation. This means that lead and accompaniment separation has room for much improv ement, and that the topic is bound to witness many breakthroughs still. This is even more true considering that IBM is not the best upper bound for separation performance: other ﬁltering methods such as ideal r atio mask [20] or multi- channel Wiener ﬁlter [344] may be considered as references. Regardless of the abo ve, we would also lik e to highlight that good algorithms and models can suffer from slight errors in their low-le vel audio processing routines. Such routines may include the STFT representation, the overlap-add procedure, energy normalization, and so on. Considerable improvements may also be obtained by using simple tricks and, depending on the method, large impacts can occur in the results by only changing low-le vel parameters. These include the over - lap ratio for the STFT , speciﬁc ways to regularize matrix in v erses in multichannel models, etc. Further tricks such as the exponentiation of the TF mask by some positiv e value can often boost performance signiﬁcantly more than using more sophisticated models. Howe v er , such tricks are often lost when publishing research focused on the higher-le v el algorithms. W e belie ve this is an important reason why sharing source code is highly desirable in this particular application. Some online repositories containing implementations of lead and accompaniment separation methods should be mentioned, such as nussl 4 and untwist [364]. In the companion webpage of this paper 5 , we list many different online resources such as datasets, implementations, and tools that we hope will be useful to the practitioner and provide some useful pointers to the interested reader . D. Discussion Finally , we summarize the core advantages and disadv an- tages for each one of the ﬁve groups of methods we identiﬁed. Methods based on the harmonicity assumption for the lead are focused on sinusoidal modeling. They enjoy a very strong interpretability and allo w for the direct incorporation of any prior knowledge concerning pitch. Their fundamental weakness lies in the f act that many singing voice signals are not harmonic, e.g., when breathy or distorted. Modeling the accompaniment as redundant allo ws to exploit long-term dependencies in music signals and may beneﬁt from high-le vel information like tempo or score. Their most important dra wback is to fall short in terms of voice models: the lead signal itself is often redundant to some extent and thus partly incorporated in the estimated accompaniment. Systems jointly modeling the lead as harmonic and the accompaniment as redundant beneﬁt from both assumptions. They were long state-of-the-art and enjoy a good interpretabil- ity , which makes them good candidates for interacti ve separa- 4 https://github .com/interactiv eaudiolab/nussl 5 https://sigsep.github .io tion methods. Howev er , their core shortcoming is to be highly sensitiv e to violations of their assumptions, which prov es to often be the case in practice. Such situations usually require ﬁne-tuning and hence prev ents their use as black-box systems for a broad audience. Data-driv en methods in v olve machine learning to directly learn a mapping between the mixture and the constituti ve sources. Such a strategy recently introduced a breakthrough compared to ev erything that was done before. Their most important disadv antages are the lack of interpretability , which makes it challenging to design good user interactions, as well as their strong dependency on the size of the training data. Finally , multichannel methods le verage stereophonic infor - mation to strongly improve performance. Interestingly , this can usually be combined with better spectrogram models such as DNNs to further improve quality . The price to pay for this boost in performance is an additional computational cost, that may be prohibitiv e for recordings of more than two channels. I X . C O N C L U S I O N In this paper , we thoroughly discussed the problem of sep- arating lead and accompaniment signals in music recordings. W e gav e a comprehensiv e overvie w of the research undertaken in the last 50 years on this topic, classifying the different approaches according to their main features and assumptions. In doing so, we sho wed how one very large body of research can be described as being model-based. In this context, it was evident from the literature that the two most important assumptions behind these models are that the lead instrument is harmonic, while the accompaniment is redundant. As we demonstrated, a very lar ge number of methods on model-based lead-accompaniment separation can be seen as using one or both of these assumptions. Howe ver , music encompasses a variety of signals of an extraordinary diversity , and no rigid assumption holds well for all signals. For this reason, while there are often some music pieces where each method performs well, there will also be some where it fails. As a result, data-dri ven methods were proposed as an attempt to introduce more ﬂexibility at the cost of requiring representati ve training data. In the context of this paper, we proposed the largest freely av ailable dataset for music separation, comprising close to 10 hours of data, which is 240 times greater than the ﬁrst public dataset released 10 years ago. At present, we see a huge focus on research utilizing recent machine learning breakthroughs for the design of singing v oice separation methods. This came with an associated boost in performance, as measured by objecti ve metrics. Howe ver , we hav e also discussed the strengths and shortcomings of existing ev aluations and metrics. In this respect, it is important to note that the songs used for e valuation are b ut a minuscule fraction of all recorded music, and that separating music signals remains the processing of an artistic means of e xpression. As such it is impossible to escape the need for human perceptual ev aluations, or at least adequate models for it. After revie wing the large existing body of literature, we may conclude here by saying that lead and accompaniment Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 22 separation in music is a problem at the crossroads of many different paradigms and methods. Researchers from very different backgrounds such as physics, signal or computer engineering have tackled it, and it exists both as an area for strong theoretical research and as a real-world challenging engineering problem. Its strong connections with the arts and digital humanities ha ve proved attractive to many researchers. Finally , as we sho wed, there is still much room for improv e- ment in lead and accompaniment separation, and we believ e that new and exciting research will bring new breakthroughs in this ﬁeld. While DNN methods represent the latest big step forward and signiﬁcantly outperform previous research, we belie ve that future improv ements can come from any direction, including those discussed in this paper . Still, we expect future improvements to initially come from improv ed machine learning methodologies that can cope with reduced training sets, as well as improv ed modeling of the speciﬁc properties of musical signals, and the development of better signal representations. R E F E R E N C E S [1] R. Kalakota and M. Robinson, e-Business 2.0: Roadmap for Success . Addison-W esley Professional, 2000. [2] C. K. Lam and B. C. T an, “The Internet is changing the music industry , ” Communications of the ACM , v ol. 44, no. 8, pp. 62–68, 2001. [3] P . Common and C. Jutten, Handbook of Blind Sour ce Separation . Academic Press, 2010. [4] G. R. Naik and W . W ang, Blind Sour ce Separ ation . Springer-V erlag Berlin Heidelberg, 2014. [5] A. Hyv ¨ arinen, “Fast and robust ﬁxed-point algorithm for independent component analysis, ” IEEE T ransactions on Neural Networks , v ol. 10, no. 3, pp. 626–634, May 1999. [6] A. Hyv ¨ arinen and E. Oja, “Independent component analysis: Algo- rithms and applications, ” Neural Networks , vol. 13, no. 4-5, pp. 411– 430, Jun. 2000. [7] S. Makino, T .-W . Lee, and H. Sawada, Blind Speech Separation . Springer Netherlands, 2007. [8] E. V incent, T . V irtanen, and S. Gannot, Audio Sour ce Separation and Speech Enhancement . Wile y , 2018. [9] P . C. Loizou, Speech enhancement: theory and practice . CRC Press, 1990. [10] A. Liutkus, J.-L. Durrieu, L. Daudet, and G. Richard, “ An overvie w of informed audio source separation, ” in 14th International W orkshop on Image Analysis for Multimedia Interactive Services , Paris, France, Jul. 2013. [11] E. Vi ncent, N. Bertin, R. Gribon val, and F . Bimbot, “From blind to guided audio source separation: How models and side information can improve the separation of sound, ” IEEE Signal Pr ocessing Magazine , vol. 31, no. 3, pp. 107–115, May 2014. [12] U. Z ¨ olzer , D AFX - Digital Audio Ef fects . W iley , 2011. [13] M. M ¨ uller , Fundamentals of Music Pr ocessing: Audio, Analysis, Algo- rithms, Applications . Springer, 2015. [14] E. T . Jaynes, Pr obability theory: The lo gic of science . Cambridge univ ersity press, 2003. [15] O. Capp ´ e, E. Moulines, and T . Ryden, Inference in Hidden Markov Models (Springer Series in Statistics) . Secaucus, NJ, USA: Springer- V erlag New Y ork, Inc., 2005. [16] R. J. McAulay and T . F . Quatieri, “Speech analysis/synthesis based on a sinusoidal representation, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 34, no. 4, pp. 744–754, Aug. 1986. [17] S. Rickard and O. Y ilmaz, “On the approximate w-disjoint orthogonal- ity of speech, ” in IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing , Orlando, Florida, USA, May 2002. [18] S. Boll, “Suppression of acoustic noise in speech using spectral subtrac- tion, ” IEEE T ransactions on acoustics, speech, and signal pr ocessing , vol. 27, no. 2, pp. 113–120, 1979. [19] N. Wiener , “Extrapolation, interpolation, and smoothing of stationary time series, ” 1975. [20] A. Liutkus and R. Badeau, “Generalized Wiener ﬁltering with fractional power spectrograms, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Brisbane, QLD, Australia, Apr . 2015. [21] G. Fant, Acoustic Theory of Speech Pr oduction . W alter de Gruyter , 1970. [22] B. P . Bogert, M. J. R. Healy , and J. W . T ukey , “The quefrency alanysis of time series for echoes: Cepstrum pseudo-autocovariance, cross- cepstrum, and saphe cracking, ” Pr oceedings of a symposium on time series analysis , pp. 209–243, 1963. [23] A. M. Noll, “Short-time spectrum and “cepstrum” techniques for vocal- pitch detection, ” Journal of the Acoustical Society of America , vol. 36, no. 2, pp. 296–302, 1964. [24] ——, “Cepstrum pitch determination, ” J ournal of the Acoustical Soci- ety of America , vol. 41, no. 2, pp. 293–309, 1967. [25] S. B. Davis and P . Mermelstein, “Comparison of parametric repre- sentations for monosyllabic word recognition in continuously spoken sentences, ” IEEE T ransactions on A udio, Speech, and Language Pro- cessing , vol. 28, no. 4, pp. 357–366, Aug. 1980. [26] A. V . Oppenheim, “Speech analysis-synthesis system based on ho- momorphic ﬁltering, ” Journal of the Acoustical Society of America , vol. 45, no. 2, pp. 458–465, 1969. [27] R. Durrett, Pr obability: theory and examples . Cambridge university press, 2010. [28] G. Schwarz, “Estimating the dimension of a model, ” Annals of Statis- tics , vol. 6, no. 2, pp. 461–464, Mar. 1978. [29] L. R. Rabiner, “ A tutorial on hidden Markov models and selected applications in speech recognition, ” Pr oceedings of the IEEE , vol. 77, no. 2, pp. 257–286, Feb. 1989. [30] A. J. V iterbi, “ A personal history of the Viterbi algorithm, ” IEEE Signal Pr ocessing Magazine , vol. 23, no. 4, pp. 120–142, 2006. [31] C. Bishop, Neural networks for pattern recognition . Clarendon Press, 1996. [32] A. P . Dempster , N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm, ” Journal of the Royal Statistical Society , v ol. 39, no. 1, pp. 1–38, 1977. [33] J. Salamon, E. G ´ omez, D. Ellis, and G. Richard, “Melody extraction from polyphonic music signals: Approaches, applications and chal- lenges, ” IEEE Signal Pr ocessing Magazine , vol. 31, 2014. [34] N. J. Miller, “Removal of noise from a voice signal by synthesis, ” Utah Univ ersity , T ech. Rep., 1973. [35] A. V . Oppenheim and R. W . Schafer , “Homomorphic analysis of speech, ” IEEE Tr ansactions on Audio and Electr oacoustics , vol. 16, no. 2, pp. 221–226, Jun. 1968. [36] R. C. Maher, “ An approach for the separation of voices in composite musical signals, ” Ph.D. dissertation, Uni versity of Illinois at Urbana- Champaign, 1989. [37] A. L. W ang, “Instantaneous and frequency-warped techniques for auditory source separation, ” Ph.D. dissertation, Stanford Univ ersity , 1994. [38] ——, “Instantaneous and frequency-warped techniques for source sep- aration and signal parametrization, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, New Y ork, USA, Oct. 1995. [39] Y . Meron and K. Hirose, “Separation of singing and piano sounds, ” in 5th International Conference on Spoken Language Processing , Sydney , Australia, Nov . 1998. [40] T . F . Quatieri, “Shape in variant time-scale and pitch modiﬁcation of speech, ” IEEE T ransactions on Signal Pr ocessing , vol. 40, no. 3, pp. 497–510, Mar . 1992. [41] A. Ben-Shalom and S. Dubno v , “Optimal ﬁltering of an instrument sound in a mixed recording given approximate pitch prior, ” in Inter- national Computer Music Conference , Miami, FL, USA, Nov . 2004. [42] S. Shalev-Shwartz, S. Dubnov , N. Friedman, and Y . Singer , “Robust temporal and spectral modeling for query by melody , ” in 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrie val , T ampere, Finland, Aug. 2002. [43] X. Serra, “Musical sound modeling with sinusoids plus noise, ” in Musical Signal Pr ocessing . Swets & Zeitlinger , 1997, pp. 91–122. [44] B. V . V een and K. M. Buckley , “Beamforming techniques for spatial ﬁltering, ” in The Digital Signal Processing Handbook . CRC Press, 1997, pp. 1–22. [45] Y .-G. Zhang and C.-S. Zhang, “Separation of voice and music by har- monic structure stability analysis, ” in IEEE International Conference on Multimedia and Expo , Amsterdam, Netherlands, Jul. 2005. [46] ——, “Separation of music signals by harmonic structure modeling, ” in Advances in Neural Information Processing Systems 18 . MIT Press, 2006, pp. 1617–1624. Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 23 [47] E. T erhardt, “Calculating virtual pitch, ” Hearing Resear ch , vol. 1, no. 2, pp. 155–182, Mar . 1979. [48] Y .-G. Zhang, C.-S. Zhang, and S. W ang, “Clustering in knowledge embedded space, ” in Machine Learning: ECML 2003 . Springer Berlin Heidelberg, 2003, pp. 480–491. [49] H. Fujihara, T . Kitahara, M. Goto, K. K omatani, T . Ogata, and H. G. Okuno, “Singer identiﬁcation based on accompaniment sound reduction and reliable frame selection, ” in 6th International Conference on Music Information Retrieval , London, UK, Sep. 2005. [50] H. Fujihara, M. Goto, T . Kitahara, and H. G. Okuno, “ A modeling of singing voice robust to accompaniment sounds and its application to singer identiﬁcation and vocal-timbre-similarity-based music infor- mation retriev al, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 18, no. 3, pp. 638–648, Mar . 2010. [51] M. Goto, “ A real-time music-scene-description system: Predominant- F0 estimation for detecting melody and bass lines in real-world audio signals, ” Speech Communication , vol. 43, no. 4, pp. 311–329, Sep. 2004. [52] J. A. Moorer, “Signal processing aspects of computer music: A surve y , ” Pr oceedings of the IEEE , vol. 65, no. 8, pp. 1108–1137, Aug. 2005. [53] A. Mesaros, T . V irtanen, and A. Klapuri, “Singer identiﬁcation in polyphonic music using vocal separation and pattern recognition meth- ods, ” in 7th International Conference on Music Information Retrieval , V ictoria, BC, Canada, Oct. 2007. [54] M. Ryyn ¨ anen and A. Klapuri, “Transcription of the singing melody in polyphonic music, ” in 7th International Confer ence on Music Information Retrieval , V ictoria, BC, Canada, Oct. 2006. [55] Z. Duan, Y .-F . Zhang, C.-S. Zhang, and Z. Shi, “Unsupervised single- channel music source separation by average harmonic structure model- ing, ” IEEE T ransactions on A udio, Speech, and Language Processing , vol. 16, no. 4, pp. 766–778, May 2008. [56] X. Rodet, “Musical sound signal analysis/synthesis: Sinu- soidal+residual and elementary wav eform models, ” in IEEE T ime-Fr equency and T ime-Scale W orkshop , Cov entry , UK, Aug. 1997. [57] J. O. Smith and X. Serra, “P ARSHL: An analysis/synthesis program for non-harmonic sounds based on a sinusoidal representation, ” in International Computer Music Confer ence , Urbana, IL, USA, Aug. 1987. [58] M. Slane y , D. Naar, and R. F . L yon, “ Auditory model in version for sound separation, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Adelaide, SA, Australia, Apr . 1994. [59] M. Lagrange and G. Tzanetakis, “Sound source tracking and formation using normalized cuts, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Honolulu, HI, USA, Apr . 2007. [60] M. Lagrange, L. G. Martins, J. Murdoch, and G. Tzanetakis, “Normal- ized cuts for predominant melodic source separation, ” IEEE T ransac- tions on Audio, Speech, and Language Pr ocessing , v ol. 16, no. 2, pp. 278–290, Feb . 2008. [61] J. Shi and J. Malik, “Normalized cuts and image segmentation, ” IEEE T ransactions on P attern Analysis and Machine Intellig ence , v ol. 22, no. 8, pp. 888–905, Aug. 2000. [62] M. Ryyn ¨ anen, T . V irtanen, J. Paulus, and A. Klapuri, “ Accompani- ment separation and karaoke application based on automatic melody transcription, ” in IEEE International Confer ence on Multimedia and Expo , Hannov er , German y , Aug. 2008. [63] M. Ryyn ¨ anen and A. Klapuri, “ Automatic transcription of melody , bass line, and chords in polyphonic music, ” Computer Music Journal , vol. 32, no. 3, pp. 72–86, Sep. 2008. [64] Y . Ding and X. Qian, “Processing of musical tones using a combined quadratic polynomial-phase sinusoid and residual (QUASAR) signal model, ” Journal of the Audio Engineering Society , vol. 45, no. 7/8, pp. 571–584, Jul. 1997. [65] Y . Li and D. W ang, “Singing voice separation from monaural record- ings, ” in 7th International Conference on Music Information Retrieval , 2006. [66] ——, “Separation of singing voice from music accompaniment for monaural recordings, ” IEEE T ransactions on Audio, Speech, and Lan- guage Processing , vol. 15, no. 4, pp. 1475–1487, May 2007. [67] C. Duxbury , J. P . Bello, M. Davies, and M. Sandler , “Complex domain onset detection for musical signals, ” in 6th International Confer ence on Digital A udio Effects , London, UK, Sep. 2003. [68] Y . Li and D. W ang, “Detecting pitch of singing voice in polyphonic audio, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , Philadelphia, P A, USA, Mar . 2005. [69] M. W u, D. W ang, and G. J. Brown, “ A multipitch tracking algorithm for noisy speech, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 11, no. 3, pp. 229–241, May 2003. [70] G. Hu and D. W ang, “Monaural speech se gregation based on pitch tracking and amplitude modulation, ” IEEE T ransactions on Neural Networks , vol. 15, no. 5, pp. 1135–1150, Sep. 2002. [71] Y . Han and C. Raphael, “Desoloing monaural audio using mixture mod- els, ” in 7th International Confer ence on Music Information Retrieval , V ictoria, BC, Canada, Oct. 2007. [72] S. T . Roweis, “One microphone source separation, ” in Advances in Neural Information Pr ocessing Systems 13 . MIT Press, 2001, pp. 793–799. [73] C.-L. Hsu, J.-S. R. Jang, and T .-L. Tsai, “Separation of singing v oice from music accompaniment with unv oiced sounds reconstruction for monaural recordings, ” in AES 125th Con vention , San Francisco, CA, USA, Oct. 2008. [74] C.-L. Hsu and J.-S. R. Jang, “On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 18, no. 2, pp. 310–319, Feb. 2010. [75] K. Dressler, “Sinusoidal e xtraction using an efﬁcient implementation of a multi-resolution FFT, ” in 9th International Conference on Digital Audio Effects , Montreal, QC, Canada, Sep. 2006. [76] P . Scalart and J. V . Filho, “Speech enhancement based on a priori signal to noise estimation, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Atlanta, GA, USA, May 1996. [77] C. Raphael and Y . Han, “ A classiﬁer -based approach to score-guided music audio source separation, ” Computer Music Journal , vol. 32, no. 1, pp. 51–59, 2008. [78] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classiﬁcation and Re gr ession Tr ees . Chapman and Hall/CRC, 1984. [79] E. Cano and C. Cheng, “Melody line detection and source separation in classical saxophone recordings, ” in 12th International Confer ence on Digital A udio Effects , Como, Italy , Sep. 2009. [80] S. Grollmisch, E. Cano, and C. Dittmar , “Songs2See: Learn to play by playing, ” in AES 41st Conference: Audio for Games , Feb. 2011, pp. P2–3. [81] C. Dittmar, E. Cano, J. Abeßer, and S. Grollmisch, “Music information retriev al meets music education, ” in Multimodal Music Pr ocessing . Dagstuhl Publishing, 2012, pp. 95–120. [82] E. Cano, C. Dittmar , and G. Schuller, “Efﬁcient implementation of a system for solo and accompaniment separation in polyphonic music, ” in 20th Eur opean Signal Pr ocessing Confer ence , Bucharest, Romania, Aug. 2012. [83] K. Dressler , “Pitch estimation by the pair-wise e v aluation of spectral peaks, ” in 42nd AES Confer ence on Semantic A udio , Ilmenau, Ger - many , Jul. 2011. [84] E. Cano, C. Dittmar, and G. Schuller , “Re-thinking sound separation: Prior information and additivity constraints in separation algorithms, ” in 16th International Conference on Digital Audio Effects , Maynooth, Ireland, Sep. 2013. [85] E. Cano, G. Schuller , and C. Dittmar, “Pitch-informed solo and accom- paniment separation towards its use in music education applications, ” EURASIP Journal on Advances in Signal Pr ocessing , vol. 2014, no. 23, Sep. 2014. [86] J. J. Bosch, K. Kondo, R. Marx er , and J. Janer , “Score-informed and timbre independent lead instrument separation in real-world scenarios, ” in 20th Eur opean Signal Pr ocessing Confer ence , Bucharest, Romania, Aug. 2012. [87] R. Marxer, J. Janer, and J. Bonada, “Low-latenc y instrument separation in polyphonic audio using timbre models, ” in 10th International Confer ence on Latent V ariable Analysis and Signal Separation , T el A viv , Israel, Mar . 2012. [88] A. V aneph, E. McNeil, and F . Rigaud, “ An automated source separation technology and its practical applications, ” in 140th AES Convention , Paris, France, May 2016. [89] S. Leglai ve, R. Hennequin, and R. Badeau, “Singing voice detection with deep recurrent neural netw orks, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing , Brisbane, QLD, Australia, Apr . 2015. [90] D. D. Lee and H. S. Seung, “Learning the parts of objects by non- negati ve matrix factorization, ” Nature , vol. 401, pp. 788–791, Oct. 1999. [91] ——, “ Algorithms for non-neg ativ e matrix factorization, ” in Advances in Neural Information Pr ocessing Systems 13 . MIT Press, 2001, pp. 556–562. Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 24 [92] P . Smaragdis and J. C. Brown, “Non-neg ativ e matrix factorization for polyphonic music transcription, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, New Y ork, USA, Oct. 2003. [93] T . V irtanen, “Monaural sound source separation by nonnegativ e matrix factorization with temporal continuity and sparseness criteria, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 15, no. 3, pp. 1066–1074, Mar . 2007. [94] C. F ´ evotte, “Nonnegative matrix factorization with the Itakura-Saito div ergence: With application to music analysis, ” Neural Computation , vol. 21, no. 3, pp. 793–830, Mar . 2009. [95] P . Common, “Independent component analysis, a new concept?” Signal Pr ocessing , v ol. 36, no. 3, pp. 287–314, Apr . 1994. [96] S. V embu and S. Baumann, “Separation of vocals from polyphonic au- dio recordings, ” in 6th International Conference on Music Information Retrieval , London, UK, sep 2005. [97] H. Hermansky , “Perceptual linear predictive (PLP) analysis of speech, ” Journal of the Acoustical Society of America , vol. 87, no. 4, pp. 1738– 1752, Apr . 1990. [98] T . L. Nwe and Y . W ang, “ Automatic detection of vocal segments in popular songs, ” in 5th International Conference for Music Information Retrieval , Barcelona, Spain, Oct. 2004. [99] M. A. Casey and A. W estner , “Separation of mixed audio sources by independent subspace analysis, ” in International Computer Music Confer ence , Berlin, German y , Sep. 2000. [100] A. Chanrungutai and C. A. Ratanamahatana, “Singing voice separation for mono-channel music using non-negati v e matrix factorization, ” in International Confer ence on Advanced T echnolo gies for Communica- tions , Hanoi, V ietnam, Oct. 2008. [101] ——, “Singing voice separation in mono-channel music, ” in Interna- tional Symposium on Communications and Information T echnolo gies , Lao, China, Oct. 2008. [102] A. N. T ikhonov , “Solution of incorrectly formulated problems and the regularization method, ” Soviet Mathematics , vol. 4, pp. 1035–1038, 1963. [103] R. Marxer and J. Janer, “ A Tikhonov regularization method for spec- trum decomposition in low latency audio source separation, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Kyoto, Japan, Mar . 2012. [104] P .-K. Y ang, C.-C. Hsu, and J.-T . Chien, “Bayesian singing-voice sepa- ration, ” in 15th International Society for Music Information Retrieval Confer ence , T aipei, T aiwan, Oct. 2014. [105] J.-T . Chien and P .-K. Y ang, “Bayesian factorization and learning for monaural source separation, ” IEEE/A CM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 24, no. 1, pp. 185–195, Jan. 2015. [106] A. T . Cemgil, “Bayesian inference for nonnegati ve matrix factorisation models, ” Computational Intelligence and Neuroscience , vol. 2009, no. 4, pp. 1–17, Jan. 2009. [107] M. N. Schmidt, O. Winther , and L. K. Hansen, “Bayesian non-negativ e matrix factorization, ” in 8th International Confer ence on Independent Component Analysis and Signal Separation , Paraty , Brazil, Mar . 2009. [108] M. Spiertz and V . Gnann, “Source-ﬁlter based clustering for monaural blind source separation, ” in 12th International Conference on Digital Audio Effects , Como, Italy , Sep. 2009. [109] P . Smaragdis and G. J. Mysore, “Separation by “humming”: User- guided sound extraction from monophonic mixtures, ” in IEEE W ork- shop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, New Y ork, USA, Oct. 2009. [110] P . Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi- supervised separation of sounds from single-channel mixtures, ” in 7th International Confer ence on Independent Component Analysis and Signal Separation , London, UK, Sep. 2007. [111] T . Nakamuray and H. Kameoka, “ L p -norm non-negati ve matrix f ac- torization and its application to singing voice enhancement, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Brisbane, QLD, Australia, Apr. 2015. [112] J. M. Ortega and W . C. Rheinboldt, Iterative solution of nonlinear equations in se veral variables . Academic Press, 1970. [113] H. Kameoka, M. Goto, and S. Sagayama, “Selective ampliﬁer of periodic and non-periodic components in concurrent audio signals with spectral control env elopes, ” Information Processing Society of Japan, T ech. Rep., 2006. [114] E. J. Cand ` es, X. Li, Y . Ma, and J. Wright, “Robust principal component analysis?” J ournal of the ACM , v ol. 58, no. 3, pp. 1–37, May 2011. [115] P .-S. Huang, S. D. Chen, P . Smaragdis, and M. Hasegaw a-Johnson, “Singing-voice separation from monaural recordings using robust principal component analysis, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , K yoto, Japan, Mar. 2012. [116] P . Sprechmann, A. Bronstein, and G. Sapiro, “Real-time online singing voice separation from monaural recordings using robust low-rank mod- eling, ” in 13th International Society for Music Information Retrieval Confer ence , Porto, Portugal, Oct. 2012. [117] B. Recht, M. Fazel, and P . A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, ” SIAM Revie w , v ol. 52, no. 3, pp. 471–501, Aug. 2010. [118] B. Recht and C. R ´ e, “Parallel stochastic gradient algorithms for large- scale matrix completion, ” Mathematical Pro gramming Computation , vol. 5, no. 2, pp. 201–226, Jun. 2013. [119] K. Gregor and Y . LeCun, “Learning fast approximations of sparse coding, ” in 27th International Conference on Machine Learning , Haifa, Israel, Jun. 2010. [120] L. Zhang, Z. Chen, M. Zheng, and X. He, “Robust non-negati ve matrix factorization, ” F r ontiers of Electrical Electronic Engineering China , vol. 6, no. 2, pp. 192–200, Jun. 2011. [121] I.-Y . Jeong and K. Lee, “V ocal separation using extended robust principal component analysis with Schatten P / L p -norm and scale compression, ” in International W orkshop on Machine Learning for Signal Pr ocessing , Reims, France, Nov . 2014. [122] F . Nie, H. W ang, and H. Huang, “Joint Schatten p -norm and l p -norm robust matrix completion for missing value recovery , ” Knowledge and Information Systems , v ol. 42, no. 3, pp. 525–544, Mar . 2015. [123] Y .-H. Y ang, “Low-rank representation of both singing voice and music accompaniment via learned dictionaries, ” in 14th International Society for Music Information Retrieval confer ence , Curitiba, PR, Brazil, Nov . 2013. [124] J. Mairal, F . Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding, ” in 26th Annual International Conference on Machine Learning , Montreal, QC, Canada, Jun. 2009. [125] T .-S. T . Chan and Y .-H. Y ang, “Comple x and quaternionic principal component pursuit and its application to audio separation, ” IEEE Signal Pr ocessing Letters , vol. 23, no. 2, pp. 287–291, Feb . 2016. [126] G. Peeters, “Deriving musical structures from signal analysis for music audio summary generation: ”sequence” and ”state” approach, ” in Inter- national Symposium on Computer Music Multidisciplinary Researc h , Montpellier , France, May 2003. [127] R. B. Dannenberg and M. Goto, “Music structure analysis from acoustic signals, ” in Handbook of Signal Processing in Acoustics . Springer Ne w Y ork, 2008, pp. 305–331. [128] J. P aulus, M. M ¨ uller , and A. Klapuri, “ Audio-based music structure analysis, ” in 11th International Society for Music Information Retrieval Confer ence , Utrecht, Netherlands, Aug. 2010. [129] Z. Raﬁi and B. Pardo, “ A simple music/voice separation system based on the extraction of the repeating musical structure, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Prague, Czech Republic, May 2011. [130] ——, “REpeating Pattern Extraction T echnique (REPET): A simple method for music/voice separation, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 21, no. 1, pp. 73–84, Jan. 2013. [131] Z. Raﬁi, A. Liutkus, and B. Pardo, “REPET for background/foreground separation in audio, ” in Blind Sour ce Separation . Springer Berlin Heidelberg, 2014, pp. 395–411. [132] J. F oote and S. Uchihashi, “The beat spectrum: A new approach to rhythm analysis, ” in IEEE International Conference on Multimedia and Expo , T okyo, Japan, Aug. 2001. [133] P . Seetharaman, F . Pishdadian, and B. Pardo, “Music/v oice separation using the 2d Fourier transform, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , Ne w Paltz, New Y ork, Oct. 2017. [134] A. Liutkus, Z. Raﬁi, R. Badeau, B. Pardo, and G. Richard, “ Adaptiv e ﬁltering for music/voice separation exploiting the repeating musical structure, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Kyoto, Japan, Mar . 2012. [135] Z. Raﬁi and B. Pardo, “Music/v oice separation using the similarity matrix, ” in 13th International Society for Music Information Retrieval Confer ence , Porto, Portugal, Oct. 2012. [136] J. Foote, “V isualizing music and audio using self-similarity , ” in 7th ACM International Confer ence on Multimedia , Orlando, FL, USA, Oct. 1999. [137] Z. Raﬁi and B. Pardo, “Online REPET -SIM for real-time speech enhancement, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , V ancouver , BC, Canada, May 2013. Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 25 [138] Z. Raﬁi, A. Liutkus, and B. P ardo, “ A simple user interface system for recovering patterns repeating in time and frequency in mixtures of sounds, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , Brisbane, QLD, Australia, Apr . 2015. [139] D. FitzGerald, “V ocal separation using nearest neighbours and median ﬁltering, ” in 23r d IET Irish Signals and Systems Confer ence , Maynooth, Ireland, Jun. 2012. [140] A. Liutkus, Z. Raﬁi, B. Pardo, D. FitzGerald, and L. Daudet, “Kernel spectrogram models for source separation, ” in 4th Joint W orkshop on Hands-free Speech Communication Micr ophone Arrays , V illers-les- Nancy , France, May 2014. [141] A. Liutkus, D. FitzGerald, Z. Raﬁi, B. Pardo, and L. Daudet, “Kernel additiv e models for source separation, ” IEEE T ransactions on Signal Pr ocessing , v ol. 62, no. 16, pp. 4298–4310, Aug. 2014. [142] A. Liutkus, D. FitzGerald, and Z. Raﬁi, “Scalable audio separation with light kernel additiv e modelling, ” in IEEE International Confer ence on Acoustics, Speec h and Signal Processing , Brisbane, QLD, Australia, Apr . 2015. [143] T . Pr ¨ atzlich, R. Bittner , A. Liutkus, and M. M ¨ uller , “Kernel additive modeling for interference reduction in multi-channel music recordings, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , Brisbane, QLD, Australia, Aug. 2015. [144] D. F . Y ela, S. Ewert, D. FitzGerald, and M. Sandler, “Interference reduction in music recordings combining kernel additive modelling and non-negati ve matrix f actorization, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , New Orleans, LA, USA, Mar . 2017. [145] M. Moussallam, G. Richard, and L. Daudet, “ Audio source separation informed by redundancy with greedy multiscale decompositions, ” in 20th European Signal Pr ocessing Conference , Bucharest, Romania, Aug. 2012. [146] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries, ” IEEE Tr ansactions on Signal Processing , vol. 41, no. 12, pp. 3397–3415, Dec. 1993. [147] H. Deif, D. FitzGerald, W . W ang, and L. Gan, “Separation of vocals from monaural music recordings using diagonal median ﬁlters and practical time-frequenc y parameters, ” in IEEE International Symposium on Signal Processing and Information T echnology , Abu Dhabi, United Arab Emirates, Dec. 2015. [148] D. FitzGerald and M. Gainza, “Single channel vocal separation using median ﬁltering and factorisation techniques, ” ISAST T ransactions on Electr onic and Signal Pr ocessing , vol. 4, no. 1, pp. 62–73, Jan. 2010. [149] J.-Y . Lee and H.-G. Kim, “Music and voice separation using log- spectral amplitude estimator based on kernel spectrogram models backﬁtting, ” Journal of the Acoustical Society of Kor ea , vol. 34, no. 3, pp. 227–233, 2015. [150] J.-Y . Lee, H.-S. Cho, and H.-G. Kim, “V ocal separation from monaural music using adaptive auditory ﬁltering based on k ernel back-ﬁtting, ” in Interspeec h , Dresden, Germany , Sep. 2015. [151] H.-S. Cho, J.-Y . Lee, and H.-G. Kim, “Singing voice separation from monaural music based on kernel back-ﬁtting using beta-order spectral amplitude estimation, ” in 16th International Society for Music Information Retrieval Conference , M ´ alaga, Spain, Oct. 2015. [152] H.-G. Kim and J. Y . Kim, “Music/voice separation based on kernel back-ﬁtting using weighted β -order MMSE estimation, ” ETRI Journal , vol. 38, no. 3, pp. 510–517, Jun. 2016. [153] E. Plourde and B. Champagne, “ Auditory-based spectral amplitude estimators for speech enhancement, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 16, no. 8, pp. 1614–1623, Nov . 2008. [154] B. Raj, P . Smaragdis, M. Shashanka, and R. Singh, “Separating a fore- ground singer from background music, ” in International Symposium on F rontier s of Resear c h on Speech and Music , 2007. [155] P . Smaragdis and B. Raj, “Shift-in variant probabilistic latent component analysis, ” MERL, T ech. Rep., 2006. [156] B. Raj and P . Smaragdis, “Latent variable decomposition of spectro- grams for single channel speaker separation, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, New Y ork, USA, Oct. 2005. [157] J. Han and C.-W . Chen, “Impro ving melody extraction using proba- bilistic latent component analysis, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , Prague, Czech Republic, May 2011. [158] P . Boersma, “ Accurate short-term analysis of the fundamental fre- quency and the harmonics-to-noise ratio of a sampled sound, ” in IF A Pr oceedings 17 , 1993. [159] E. G ´ omez, F . J. C. nadas Quesada, J. Salamon, J. Bonada, P . V . Candea, and P . C. nas Molero, “Predominant fundamental frequency estimation vs singing voice separation for the automatic transcription of accompanied ﬂamenco singing, ” in 13th International Society for Music Information Retrie val Confer ence , Porto, Portugal, Aug. 2012. [160] N. Ono, K. Miyamoto, J. L. Roux, H. Kameoka, and S. Sagayama, “Separation of a monaural audio signal into harmonic/percussiv e components by complementary diffusion on spectrogram, ” in 16th Eur opean Signal Processing Confer ence , Lausanne, Switzerland, Aug. 2008. [161] H. Papadopoulos and D. P . Ellis, “Music-content-adapti ve robust prin- cipal component analysis for a semantically consistent separation of foreground and background in music audio signals, ” in 17th Interna- tional Confer ence on Digital Audio Effects , Erlangen, Germany , Sep. 2014. [162] T .-S. Chan, T .-C. Y eh, Z.-C. Fan, H.-W . Chen, L. Su, Y .-H. Y ang, and R. Jang, “V ocal activity informed singing voice separation with the iKala dataset, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Brisbane, QLD, Australia, Apr . 2015. [163] I.-Y . Jeong and K. Lee, “Singing voice separation using RPCA with weighted l 1 -norm, ” in 13th International Conference on Latent V ari- able Analysis and Signal Separation , Grenoble, France, Feb. 2017. [164] T . V irtanen, A. Mesaros, and M. Ryyn ¨ anen, “Combining pitch-based inference and non-negativ e spectrogram factorization in separating vo- cals from polyphonic music, ” in ISCA T utorial and Research W orkshop on Statistical and P er ceptual Audition , Brisbane, Australia, Sep. 2008. [165] Y . W ang and Z. Ou, “Combining HMM-based melody extraction and NMF-based soft masking for separating voice and accompaniment from monaural audio, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Prague, Czech Republic, May 2011. [166] A. Klapuri, “Multiple fundamental frequency estimation by summing harmonic amplitudes, ” in 7th International Confer ence on Music In- formation Retrieval , V ictoria, BC, Canada, Oct. 2006. [167] C.-L. Hsu, L.-Y . Chen, J.-S. R. Jang, and H.-J. Li, “Singing pitch ex- traction from monaural polyphonic songs by contextual audio modeling and singing harmonic enhancement, ” in 10th International Society for Music Information Retrie val Confer ence , K yoto, Japan, Oct. 2009. [168] Z. Raﬁi, Z. Duan, and B. Pardo, “Combining rhythm-based and pitch- based methods for background and melody separation, ” IEEE/ACM T ransactions on Audio, Speech, and Language Processing , vol. 22, no. 12, pp. 1884–1893, Sep. 2014. [169] Z. Duan and B. Pardo, “Multiple fundamental frequency estimation by modeling spectral peaks and non-peak re gions, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 18, no. 8, pp. 2121– 2133, Nov . 2010. [170] S. V enkataramani, N. Nayak, P . Rao, and R. V elmurugan, “V ocal sep- aration using singer-vo wel priors obtained from polyphonic audio, ” in 15th International Society for Music Information Retrieval Conference , T aipei, T aiwan, Oct. 2014. [171] V . Rao and P . Rao, “V ocal melody extraction in the presence of pitched accompaniment in polyphonic music, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 18, no. 8, pp. 2145–2154, Nov . 2010. [172] V . Rao, C. Gupta, and P . Rao, “Context-a ware features for singing voice detection in polyphonic music, ” in International W orkshop on Adaptive Multimedia Retrieval , Barcelona, Spain, Jul. 2011. [173] M. Kim, J. Y oo, K. Kang, and S. Choi, “Nonnegativ e matrix partial co- factorization for spectral and temporal drum source separation, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 5, no. 6, pp. 1192–1204, Oct. 2011. [174] L. Zhang, Z. Chen, M. Zheng, and X. He, “Nonnegativ e matrix and tensor factorizations: An algorithmic perspective, ” IEEE Signal Pr ocessing Magazine , vol. 31, no. 3, pp. 54–65, May 2014. [175] Y . Ikemiya, K. Y oshii, and K. Itoyama, “Singing voice analysis and editing based on mutually dependent F0 estimation and source separation, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Brisbane, QLD, Australia, Apr . 2015. [176] Y . Ikemiya, K. Itoyama, and K. Y oshii, “Singing voice separation and vocal F0 estimation based on mutual combination of robust principal component analysis and subharmonic summation, ” IEEE/A CM T rans- actions on A udio, Speech, and Languag e Pr ocessing , vol. 24, no. 11, pp. 2084–2095, No v . 2016. [177] D. J. Hermes, “Measurement of pitch by subharmonic summation, ” Journal of the Acoustical Society of America , vol. 83, no. 1, pp. 257– 264, Jan. 1988. [178] A. Dobashi, Y . Ikemiya, K. Itoyama, and K. Y oshii, “ A music per- formance assistance system based on vocal, harmonic, and percussi ve Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 26 source separation and content visualization for music audio signals, ” in 12th Sound and Music Computing Confer ence , Maynooth, Ireland, Jul. 2015. [179] Y . Hu and G. Liu, “Separation of singing voice using nonnegativ e matrix partial co-factorization for singer identiﬁcation, ” IEEE T rans- actions on A udio, Speech, and Language Processing , vol. 23, no. 4, pp. 643–653, Apr . 2015. [180] J. Y oo, M. Kim, K. Kang, and S. Choi, “Nonne gati ve matrix partial co-factorization for drum source separation, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , 2010. [181] P . Boersma, “PRAA T, a system for doing phonetics by computer, ” Glot International , vol. 5, no. 9/10, pp. 341–347, Dec. 2001. [182] Y . Li, J. W oodruff, and D. W ang, “Monaural musical sound separation based on pitch and common amplitude modulation, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 17, no. 7, pp. 1361– 1371, Sep. 2009. [183] B. Raj, M. L. Seltzer, and R. M. Stern, “Reconstruction of missing fea- tures for rob ust speech recognition, ” Speech Communication , v ol. 43, no. 4, pp. 275–296, Sep. 2004. [184] Y . Hu and G. Liu, “Monaural singing voice separation by non-negativ e matrix partial co-factorization with temporal continuity and sparsity criteria, ” in 12th International Conference on Intelligent Computing , Lanzhou, China, Aug. 2016. [185] X. Zhang, W . Li, and B. Zhu, “Latent time-frequency component analysis: A nov el pitch-based approach for singing voice separation, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , Brisbane, QLD, Australia, Apr . 2015. [186] A. de Cheveign ´ e and H. Ka wahara, “YIN, a fundamental frequency estimator for speech and music, ” Journal of the Acoustical Society of America , vol. 111, no. 4, pp. 1917–1930, Apr. 2002. [187] B. Zhu, W . Li, and L. Li, “T owards solving the bottleneck of pitch- based singing voice separation, ” in 23rd A CM International Conference on Multimedia , Brisbane, QLD, Australia, Oct. 2015. [188] J.-L. Durrieu, G. Richard, and B. David, “Singer melody extraction in polyphonic signals using source separation methods, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Las V egas, NV , USA, Apr . 2008. [189] ——, “ An iterative approach to monaural musical mixture de-soloing, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , T aipei, T aiwan, Apr . 2009. [190] J.-L. Durrieu, G. Richard, B. David, and C. F ´ evotte, “Source/ﬁlter model for unsupervised main melody extraction from polyphonic audio signals, ” IEEE T ransactions on Audio, Speec h, and Language Pr ocessing , v ol. 18, no. 3, pp. 564–575, Mar . 2010. [191] A. Ozerov , P . Philippe, F . Bimbot, and R. Gribonv al, “ Adaptation of Bayesian models for single-channel source separation and its applica- tion to v oice/music separation in popular songs, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 15, no. 5, pp. 1564– 1578, Jul. 2007. [192] D. H. Klatt and L. C. Klatt, “ Analysis, synthesis, and perception of voice quality variations among female and male talkers, ” J ournal of the Acoustical Society of America , vol. 87, no. 2, pp. 820–857, Feb . 1990. [193] L. Benaroya, L. Mcdonagh, F . Bimbot, and R. Gribon v al, “Non negati ve sparse representation for Wiener based source separation with a single sensor , ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Hong Kong, China, Apr . 2003. [194] I. S. Dhillon and S. Sra, “Generalized nonnegativ e matrix approxima- tions with Bregman diver gences, ” in Advances in Neural Information Pr ocessing Systems 18 . MIT Press, 2005, pp. 283–290. [195] L. Benaroya, F . Bimbot, and R. Gribonv al, “ Audio source separation with a single sensor, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 14, no. 1, pp. 191–199, Jan. 2006. [196] J.-L. Durrieu and J.-P . Thiran, “Musical audio source separation based on user-selected F0 track, ” in 10th International Conference on Latent V ariable Analysis and Signal Separation , T el A vi v , Israel, Mar . 2012. [197] B. Fuentes, R. Badeau, and G. Richard, “Blind harmonic adaptive decomposition applied to supervised source separation, ” in Signal Pr ocessing Confer ence (EUSIPCO), 2012 Pr oceedings of the 20th Eur opean . IEEE, 2012, pp. 2654–2658. [198] J. C. Brown, “Calculation of a constant Q spectral transform, ” Journal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, Jan. 1991. [199] J. C. Brown and M. S. Puckette, “ An efﬁcient algorithm for the calculation of a constant Q transform, ” Journal of the Acoustical Society of America , vol. 92, no. 5, pp. 2698–2701, Nov . 1992. [200] C. Sch ¨ orkhuber and A. Klapuri, “Constant-Q transform toolbox, ” in 7th Sound and Music Computing Conference , Barcelona, Spain, Jul. 2010. [201] J.-L. Durrieu, B. David, and G. Richard, “ A musically motivated mid- lev el representation for pitch estimation and musical audio source separation, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 5, no. 6, pp. 1180–1191, Oct. 2011. [202] C. Joder and B. Schuller , “Score-informed leading voice separation from monaural audio, ” in 13th International Society for Music Infor- mation Retrieval Conference , Porto, Portugal, Oct. 2012. [203] C. Joder, S. Essid, and G. Richard, “ A conditional random ﬁeld framew ork for robust and scalable audio-to-score matching, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 19, no. 8, pp. 2385–2397, Nov . 2011. [204] R. Zhao, S. Lee, D.-Y . Huang, and M. Dong, “Soft constrained leading voice separation with music score guidance, ” in 9th International Symposium on Chinese Spoken Language , Singapore, Singapore, Sep. 2014. [205] J.-L. Durrieu, A. Ozerov , C. F ´ evotte, G. Richard, and B. David, “Main instrument separation from stereophonic audio signals using a source/ﬁlter model, ” in 17th Eur opean Signal Pr ocessing Conference , Glasgow , UK, Aug. 2009. [206] J. Janer and R. Marxer, “Separation of unv oiced fricatives in singing voice mixtures with semi-supervised NMF, ” in 16th International Confer ence on Digital Audio Effects , Maynooth, Ireland, Sep. 2013. [207] J. Janer, R. Marxer, and K. Arimoto, “Combining a harmonic-based NMF decomposition with transient analysis for instantaneous per- cussion separation, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , K yoto, Japan, Mar . 2012. [208] R. Marx er and J. Janer, “Modelling and separation of singing v oice breathiness in polyphonic mixtures, ” in 16th International Conference on Digital A udio Effects , Maynooth, Ireland, Sep. 2013. [209] G. Degottex, A. Roebel, and X. Rodet, “Pitch transposition and breath- iness modiﬁcation using a glottal source model and its adapted vocal- tract ﬁlter , ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Prague, Czech Republic, May 2011. [210] A. Ozerov , E. V incent, and F . Bimbot, “ A general modular framew ork for audio source separation, ” in 9th International Conference on Latent V ariable Analysis and Signal Separation , St. Malo, France, Sep. 2010. [211] ——, “ A general ﬂexible framework for the handling of prior informa- tion in audio source separation, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 20, no. 4, pp. 1118–1133, May 2012. [212] Y . Sala ¨ un, E. V incent, N. Bertin, N. Souvira ` a-Labastie, X. Jau- reguiberry , D. T . T ran, and F . Bimbot, “The ﬂexible audio source separation toolbox version 2.0, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Florence, Italy , May 2014. [213] R. Hennequin and F . Rigaud, “Long-term reverberation modeling for under-determined audio source separation with application to vocal melody extraction, ” in 17th International Society for Music Information Retrieval Conference , Ne w Y ork City , NY , USA, Aug. 2016. [214] R. Singh, B. Raj, and P . Smaragdis, “Latent-v ariable decomposition based dereverberation of monaural and multi-channel signals, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Dallas, TX, USA, Mar. 2010. [215] N. Ono, K. Miyamoto, H. Kameoka, and S. Sagayama, “ A real-time equalizer of harmonic and percussive components in music signals, ” in 9th International Conference on Music Information Retrieval , Philadel- phia, P A, USA, Sep. 2008. [216] D. FitzGerald, “Harmonic/percussive separation using median ﬁlter- ing, ” in 13th International Confer ence on Digital Audio Effects , Graz, Austria, Sep. 2010. [217] Y .-H. Y ang, “On sparse and low-rank matrix decomposition for singing voice separation, ” in 20th ACM International Conference on Multime- dia , Nara, Japan, Oct. 2012. [218] I.-Y . Jeong and K. Lee, “V ocal separation from monaural music using temporal/spectral continuity and sparsity constraints, ” IEEE Signal Pr ocessing Letters , vol. 21, no. 10, pp. 1197–1200, Jun. 2014. [219] E. Ochiai, T . Fujisawa, and M. Ikehara, “V ocal separation by con- strained non-negati v e matrix factorization, ” in Asia-P aciﬁc Signal and Information Pr ocessing Association Annual Summit and Conference , Hong K ong, China, Dec. 2015. [220] T . W atanabe, T . Fujisawa, and M. Ikehara, “V ocal separation using improved robust principal component analysis and post-processing, ” in IEEE 59th International Midwest Symposium on Circuits and Systems , Abu Dhabi, United Arab Emirates, Oct. 2016. Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 27 [221] H. Raguet, J. Fadili, , and G. Peyr ´ e, “ A generalized forward-backward splitting, ” SIAM Journal on Imaging Sciences , vol. 6, no. 3, pp. 1199– 1226, Jul. 2013. [222] A. Hayashi, H. Kameoka, T . Matsubayashi, and H. Sawada, “Non- negati ve periodic component analysis for music source separation, ” in Asia-P aciﬁc Signal and Information Pr ocessing Association Annual Summit and Confer ence , Jeju, South Korea, Dec. 2016. [223] D. FitzGerald, M. Cranitch, and E. Coyle, “Using tensor factorisation models to separate drums from polyphonic music, ” in 12th Interna- tional Confer ence on Digital Audio Effects , Como, Italy , Sep. 2009. [224] H. T achibana, N. Ono, and S. Sagayama, “Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms, ” IEEE/ACM T ransactions on Audio, Speech and Language Pr ocessing , vol. 22, no. 1, pp. 228–237, Jan. 2014. [225] H. T achibana, T . Ono, N. Ono, and S. Sagayama, “Melody line estimation in homophonic music audio signals based on temporal- variability of melodic source, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , Dallas, TX, USA, Mar. 2010. [226] H. T achibana, N. Ono, and S. Sagayama, “ A real-time audio-to- audio karaoke generation system for monaural recordings based on singing voice suppression and key con version techniques, ” Journal of Information Pr ocessing , vol. 24, no. 3, pp. 470–482, May 2016. [227] N. Ono, K. Miyamoto, H. Kameoka, J. L. Roux, Y . Uchiyama, E. Tsunoo, T . Nishimoto, and S. Sagayama, “Harmonic and percussive sound separation and its application to MIR-related tasks, ” in Advances in Music Information Retrieval . Springer Berlin Heidelberg, 2010, pp. 213–236. [228] H. T achibana, H. Kameoka, N. Ono, and S. Sagayama, “Comparativ e ev aluations of multiple harmonic/percussiv e sound separation tech- niques based on anisotropic smoothness of spectrogram, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Kyoto, Japan, Mar . 2012. [229] H. Deif, W . W ang, L. Gan, and S. Alhashmi, “ A local discontinuity based approach for monaural singing voice separation from accom- panying music with multi-stage non-negativ e matrix factorization, ” in IEEE Global Conference on Signal and Information Processing , Orlando, FL, USA, Dec. 2015. [230] B. Zhu, W . Li, R. Li, and X. Xue, “Multi-stage non-negati v e matrix factorization for monaural singing voice separation, ” IEEE T ransac- tions on Audio, Speech, and Languag e Pr ocessing , vol. 21, no. 10, pp. 2096–2107, Oct. 2013. [231] J. Driedger and M. M ¨ uller , “Extracting singing voice from music recordings by cascading audio decomposition techniques, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Brisbane, QLD, Australia, Apr. 2015. [232] J. Driedger, M. M ¨ uller , and S. Disch, “Extending harmonic-percussive separation of audio signals, ” in 15th International Society for Music Information Retrieval Conference , T aipei, T aiwan, Oct. 2014. [233] R. T almon, I. Cohen, and S. Gannot, “Transient noise reduction using nonlocal diffusion ﬁlters, ” IEEE/ACM T ransactions on Audio, Speech and Language Processing , vol. 19, no. 6, pp. 1584–1599, Aug. 2011. [234] C.-L. Hsu, D. W ang, J.-S. R. Jang, and K. Hu, “ A tandem algorithm for singing pitch extraction and voice separation from music accompani- ment, ” IEEE T r ansactions on Audio, Speech, and Languag e Pr ocessing , vol. 20, no. 5, pp. 1482–1491, Jul. 2012. [235] G. Hu and D. W ang, “ A tandem algorithm for pitch estimation and voiced speech segregation, ” IEEE T r ansactions on Audio, Speech, and Language Processing , vol. 18, no. 8, pp. 2067–2079, Nov . 2010. [236] D. E. Rumelhart, G. E. Hinton, and R. J. W illiams, “Learning internal representations by error propagation, ” in P arallel distributed process- ing: explorations in the micr ostructur e of cognition, vol. 1 . MIT Press Cambridge, 1986, pp. 318–362. [237] N. J. Bryan and G. J. Mysore, “Interacti ve user-feedback for sound source separation, ” in International Confer ence on Intelligent User- Interfaces, W orkshop on Interactive Machine Learning , Santa Monica, CA, USA, Mar . 2013. [238] ——, “ An efﬁcient posterior regularized latent variable model for interactiv e sound source separation, ” in 30th International Confer ence on Machine Learning , Atlanta, GA, USA, Jun. 2013. [239] ——, “Interactive reﬁnement of supervised and semi-supervised sound source separation estimates, ” in IEEE International Conference on Acoustics, Speech, and Signal Processing , V ancouver , BC, Canada, May 2013. [240] K. Ganchev , J. ao Grac ¸ a, J. Gillenwater, and B. T askar, “Posterior reg- ularization for structured latent variable models, ” Journal of Machine Learning Resear ch , vol. 11, pp. 2001–2049, Mar . 2010. [241] A. Ozerov , N. Duong, and L. Chev allier , “W eighted nonnegative tensor factorization: on monotonicity of multiplicative update rules and application to user-guided audio source separation, ” T echnicolor, T ech. Rep., 2013. [242] X. Jaureguiberry , G. Richard, P . Lev eau, R. Hennequin, and E. V incent, “Introducing a simple fusion framework for audio source separation, ” in IEEE International W orkshop on Machine Learning for Signal Pr ocessing , Southampton, UK, Sep. 2013. [243] X. Jaure guiberry , E. V incent, and G. Richard, “V ariational Bayesian model av eraging for audio source separation, ” in IEEE W orkshop on Statistical Signal Processing W orkshop , Gold Coast, VIC, Australia, Jun. 2014. [244] ——, “Fusion methods for speech enhancement and audio source separation, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 24, no. 7, pp. 1266–1279, Jul. 2016. [245] J. A. Hoeting, D. Madigan, A. E. Raftery , and C. T . V olinsky , “Bayesian model a veraging: A tutorial, ” Statistical Science , vol. 14, no. 4, pp. 382–417, Nov . 1999. [246] M. McV icar , R. Santos-Rodriguez, and T . D. Bie, “Learning to separate vocals from polyphonic mixtures via ensemble methods and structured output prediction, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , Shanghai, China, Mar. 2016. [247] A. K. Jain and F . Farrokhnia, “Unsupervised texture segmentation using Gabor ﬁlters, ” in IEEE International Conference on Systems, Man and Cybernetics , Los Angeles, CA, USA, Nov . 1990. [248] P .-S. Huang, M. Kim, M. Hasega wa-Johnson, and P . Smaragdis, “Singing-voice separation from monaural recordings using deep re- current neural networks, ” in 15th International Society for Music Information Retrieval Conference , T aipei, T aiwan, Oct. 2014. [249] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P . Pletscher, “Block- coordinate Frank-Wolfe optimization for structural SVMs, ” in 30th International Conference on Mac hine Learning , Atlanta, GA, USA, Jun. 2013. [250] E. Manilo w , P . Seetharaman, F . Pishdadian, and B. Pardo, “Predicting algorithm ef ﬁcacy for adaptive, multi-cue source separation, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, New Y ork, Oct. 2017. [251] G. W olf, S. Mallat, and S. Shamma, “ Audio source separation with time-frequency velocities, ” in IEEE International W orkshop on Ma- chine Learning for Signal Pr ocessing , Reims, France, Sep. 2014. [252] ——, “Rigid motion model for audio source separation, ” IEEE T rans- actions on Signal Processing , vol. 64, no. 7, pp. 1822–1831, Apr . 2016. [253] J. And ´ en and S. Mallat, “Deep scattering spectrum, ” IEEE T ransactions on Signal Pr ocessing , vol. 62, no. 16, pp. 4114–4128, Aug. 2014. [254] C. P . Bernard, “Discrete wavelet analysis for fast optic ﬂow computa- tion, ” Applied and Computational Harmonic Analysis , vol. 11, no. 1, pp. 32–63, Jul. 2001. [255] F . Y en, Y .-J. Luo, and T .-S. Chi, “Singing voice separation using spectro-temporal modulation features, ” in 15th International Society for Music Information Retrie val Confer ence , T aipei, T aiwan, Oct. 2014. [256] F . Y en, M.-C. Huang, and T .-S. Chi, “ A two-stage singing voice separation algorithm using spectro-temporal modulation features, ” in Interspeech , Dresden, Germany , Sep. 2015. [257] T . Chi, P . Rub, and S. A. Shamma, “Multiresolution spectrotemporal analysis of complex sounds, ” Journal of the Acoustical Society of America , vol. 118, no. 2, pp. 887–906, Aug. 2005. [258] T . Chi, Y . Gao, M. C. Guyton, P . Ru, and S. Shamma, “Spectro- temporal modulation transfer functions and speech intelligibility , ” J our- nal of the Acoustical Society of America , vol. 106, no. 5, pp. 2719– 2732, Nov . 1999. [259] T .-S. T . Chan and Y .-H. Y ang, “Informed group-sparse representation for singing voice separation, ” IEEE Signal Pr ocessing Letters , vol. 24, no. 2, pp. 156–160, Feb. 2017. [260] M. Y uan and Y . Lin, “Model selection and estimation in re gression with grouped variables, ” Journal of the Royal Statistical Society Series B , vol. 68, no. 1, pp. 49–67, Dec. 2006. [261] S. Ma, “ Alternating proximal gradient method for conve x minimiza- tion, ” J ournal of Scientiﬁc Computing , vol. 68, no. 2, p. 546572, Aug. 2016. [262] G. Liu, Z. Lin, S. Y an, J. Sun, Y . Y u, and Y . Ma, “Rob ust recovery of subspace structures by low-rank representation, ” IEEE Tr ansactions on P attern Analysis and Machine Intelligence , vol. 35, no. 1, pp. 171–184, Jan. 2007. [263] A. V arga and H. J. Steeneken, “ Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additi ve noise on speech recognition systems, ” Speech Communication , vol. 12, no. 3, pp. 247–251, Jul. 1993. Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 28 [264] J. S. Garofolo, L. F . Lamel, W . M. Fisher , J. G. Fiscus, and D. S. Pallett, “DARP A TIMIT acoustic-phonetic continuous speech corpus CD-R OM. NIST speech disc 1-1.1, ” 1993. [265] N. Sturmel, A. Liutkus, J. Pinel, L. Girin, S. Marchand, G. Richard, R. Badeau, and L. Daudet, “Linear mixing models for active listening of music productions in realistic studio conditions, ” in 132nd AES Con vention , Budapest, Hungary , Apr . 2012. [266] M. V inyes, “MTG MASS database, ” 2008, http://www .mtg.upf.edu/ static/mass/resources. [267] E. V incent, S. Araki, and P . Boﬁll, “The 2008 signal separation ev aluation campaign: A community-based approach to large-scale ev al- uation, ” in 8th International Confer ence on Independent Component Analysis and Signal Separation , Paraty , Brazil, Mar . 2009. [268] S. Araki, A. Ozerov , B. V . Gowreesunker , H. Sawada, F . J. Theis, G. Nolte, D. Lutter, and N. Duong, “The 2010 signal separation ev aluation campaign (SiSEC2010): - audio source separation -, ” in 9th International Confer ence on Latent V ariable Analysis and Signal Separation , St. Malo, France, Sep. 2010. [269] S. Araki, F . Nesta, E. V incent, Z. Koldovsk y , G. Nolte, A. Ziehe, and A. Benichoux, “The 2011 signal separation ev aluation campaign (SiSEC2011): - audio source separation -, ” in 10th International Confer ence on Latent V ariable Analysis and Signal Separation , 2012. [270] E. V incent, S. Araki, F . J. Theis, G. Nolte, P . Boﬁll, H. Sawada, A. Ozerov , V . Gowreesunker , D. Lutter, and N. Q. Duong, “The signal separation evaluation campaign (2007-2010): Achievements and remaining challenges, ” Signal Pr ocessing , vol. 92, no. 8, pp. 1928– 1936, Aug. 2012. [271] N. Ono, Z. Raﬁi, D. Kitamura, N. Ito, and A. Liutkus, “The 2015 signal separation ev aluation campaign, ” in 12th International Conference on Latent V ariable Analysis and Signal Separation , Liberec, Czech Republic,, Aug. 2015. [272] A. Liutkus, F .-R. St ¨ oter , Z. Raﬁi, D. Kitamura, B. Ri vet, N. Ito, N. Ono, and J. Fontecav e, “The 2016 signal separation evaluation campaign, ” in 13th International Conference on Latent V ariable Analysis and Signal Separation , Grenoble, France, Feb. 2017. [273] A. Liutkus, R. Badeau, and G. Richard, “Gaussian processes for underdetermined source separation, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 59, no. 7, pp. 3155–3167, Feb. 2011. [274] R. Bittner , J. Salamon, M. Tierney , M. Mauch, C. Cannam, , and J. P . Bello, “MedleyDB: A multitrack dataset for annotation-intensi ve mir research, ” in 15th International Society for Music Information Retrieval Confer ence , T aipei, T aiwan, Oct. 2014. [275] Z. Raﬁi, A. Liutkus, F .-R. St ¨ oter , S. I. Mimilakis, and R. Bittner, “Musdb18, a dataset for audio source separation, ” Dec. 2017. [Online]. A vailable: https://sigsep.github.io/musdb [276] A. Ozerov , P . Philippe, R. Gribonv al, and F . Bimbot, “One microphone singing voice separation using source-adapted models, ” in IEEE W ork- shop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, New Y ork, USA, Oct. 2005. [277] W .-H. Tsai, D. Rogers, and H.-M. W ang, “Blind clustering of popular music recordings based on singer voice characteristics, ” Computer Music J ournal , vol. 28, no. 3, pp. 68–78, 2004. [278] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multiv ariate Gaussian mixture observations of Markov chains, ” IEEE T ransactions on Audio, Speech, and Langua ge Pr ocessing , vol. 2, no. 2, pp. 291–298, Apr . 1994. [279] E. V incent, M. Jafari, S. Abdallah, M. Plumbley , and M. Davies, “Probabilistic modeling paradigms for audio source separation, ” in Machine Audition: Principles, Algorithms and Systems . IGI Global, 2010, pp. 162–185. [280] Z. Raﬁi, D. L. Sun, F . G. Germain, and G. J. Mysore, “Combining modeling of singing voice and background music for automatic sep- aration of musical mixtures, ” in 14th International Society for Music Information Retrieval Conference , Curitiba, PR, Brazil, Nov . 2013. [281] N. Boulanger-Le wando wski, G. J. Mysore, and M. Hoffman, “Exploit- ing long-term temporal dependencies in NMF using recurrent neural networks with application to source separation, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing , Florence, Italy , May 2014. [282] G. J. Mysore, P . Smaragdis, and B. Raj, “Non-negati ve hidden Markov modeling of audio with application to source separation, ” in 9th International Conference on Latent V ariable Analysis and Signal Separation , St. Malo, France, Sep. 2010. [283] K. Qian, Y . Zhang, S. Chang, X. Y ang, D. Flor ˆ encio, and M. Hase gaw a- Johnson, “Speech enhancement using bayesian wav enet, ” Proc. Inter- speech 2017 , pp. 2013–2017, 2017. [284] L. Deng and D. Y u, “Deep learning: Methods and applications, ” F oundations and T r ends in Signal Pr ocessing , vol. 7, no. 3-4, pp. 197– 387, Jun. 2014. [285] Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning, ” Nature , vol. 521, pp. 436–444, May 2015. [286] I. Goodfello w , Y . Bengio, and A. Courville, Deep Learning . MIT Press, 2016. [287] H. Robbins and S. Monro, “ A stochastic approximation method, ” Annals of Mathematical Statistics , vol. 22, no. 3, pp. 400–407, Sep. 1951. [288] D. E. Rumelhart, G. E. Hinton, and R. J. W illiams, “Learning repre- sentations by back-propagating errors, ” Nature , vol. 323, pp. 533–536, Oct. 1986. [289] M. Hermans and B. Schrauwen, “Training and analysing deep recurrent neural networks, ” in 26th International Conference on Neural Informa- tion Pr ocessing Systems , Lake T ahoe, NV , USA, Dec. 2013. [290] R. Pascanu, C. Gulcehre, K. Cho, and Y . Bengio, “How to construct deep recurrent neural networks, ” in International Conference on Learn- ing Repr esentations , Banf f, AB, Canada, Apr . 2014. [291] P .-S. Huang, M. Kim, M. Hasegawa-Johnson, and P . Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation, ” IEEE/A CM T ransactions on Audio, Speech, and Language Processing , vol. 23, 2015. [292] ——, “Deep learning for monaural speech separation, ” in IEEE In- ternational Confer ence on Acoustics, Speech and Signal Processing , Florence, Italy , May 2014. [293] S. Uhlich, F . Giron, and Y . Mitsufuji, “Deep neural network based instrument extraction from music, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing , Brisbane, QLD, Australia, Apr . 2015. [294] S. Uhlich, M. Porcu, F . Giron, M. Enenkl, T . Kemp, N. T akahashi, and Y . Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , New Orleans, LA, USA, Mar . 2017. [295] A. J. R. Simpson, G. Roma, and M. D. Plumbley , “Deep karaoke: Extracting v ocals from musical mixtures using a con volutional deep neural network, ” in 12th International Conference on Latent V ariable Analysis and Signal Separation , Liberec, Czech Republic, Aug. 2015. [296] J. Schl ¨ uter , “Learning to pinpoint singing voice from weakly labeled examples, ” in 17th International Society for Music Information Re- trieval Conference , Ne w Y ork City , NY , USA, Aug. 2016. [297] P . Chandna, M. Miron, J. Janer, and E. G ´ omez, “Monoaural audio source separation using deep con volutional neural networks, ” in 13th International Conference on Latent V ariable Analysis and Signal Separation , Grenoble, France, Feb. 2017. [298] S. I. Mimilakis, E. Cano, J. Abeßer, and G. Schuller , “New sonorities for jazz recordings: Separation and mixing using deep neural net- works, ” in 2nd AES W orkshop on Intelligent Music Pr oduction , London, UK, Sep. 2016. [299] S. I. Mimilakis, K. Drossos, T . V irtanen, and G. Schuller , “ A recurrent encoder-decoder approach with skip-ﬁltering connections for monaural singing voice separation, ” in IEEE International W orkshop on Machine Learning for Signal Processing , T okyo, Japan, Sep. 2017. [300] S. I. Mimilakis, K. Drossos, J. ao F . Santos, G. Schuller , T . V irtanen, and Y . Bengio, “Monaural singing voice separation with skip-ﬁltering connections and recurrent inference of time-frequency mask, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Apr . 2018. [301] A. Jansson, E. Humphrey , N. Montecchio, R. Bittner , A. Kumar, and T . W eyde, “Singing v oice separation with deep U-Net con volutional networks, ” in 18th International Society for Music Information Re- trieval Conferenceng , Suzhou, China, Oct. 2017. [302] N. T akahashi and Y . Mitsufuji, “Multi-scale multi-band densenets for audio source separation, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, New Y ork, Oct. 2017. [303] J. R. Hershey , Z. Chen, J. L. Roux, and S. W atanabe, “Deep clustering: Discriminativ e embeddings for segmentation and separation, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Shanghai, China, Mar . 2016. [304] Y . Isik, J. L. Roux, Z. Chen, S. W atanabe, and J. R. Hershe y , “Single- channel multispeaker separation using deep clustering, ” in Interspeech , 2016. [305] Y . Luo, Z. Chen, J. R. Hershey , J. L. Roux, and N. Mesgarani, “Deep clustering and conventional networks for music separation: Stronger together , ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , New-Orleans, LA, USA, Mar. 2017. Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 29 [306] M. Kim and P . Smaragdis, “ Adaptive denoising autoencoders: A ﬁne- tuning scheme to learn from test mixtures, ” in 12th International Con- fer ence on Latent V ariable Analysis and Signal Separation , Liberec, Czech Republic, Aug. 2015. [307] P . Vincent, H. Larochelle, I. Lajoie, Y . Bengio, and P .-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, ” Journal of Mac hine Learning Resear ch , vol. 11, pp. 3371–3408, Dec. 2010. [308] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley , “Single channel audio source separation using deep neural network ensembles, ” in 140th AES Convention , Paris, France, May 2016. [309] ——, “Combining mask estimates for single channel audio source separation using deep neural networks, ” in Interspeech , San Francisco, CA, USA, Sep. 2016. [310] ——, “Discriminative enhancement for single channel audio source separation using deep neural networks, ” in 13th International Confer- ence on Latent V ariable Analysis and Signal Separation , Grenoble, France, Feb . 2017. [311] ——, “T w o-stage single-channel audio source separation using deep neural networks, ” IEEE/A CM T ransactions on Audio, Speech, and Language Processing , vol. 25, no. 9, pp. 1773–1783, Sep. 2017. [312] S. Nie, W . Xue, S. Liang, X. Zhang, W . Liu, L. Qiao, and J. Li, “Joint optimization of recurrent netw orks exploiting source auto-regression for source separation, ” in Interspeec h , Dresden, Germany , Sep. 2015. [313] J. Sebastian and H. A. Murthy , “Group delay based music source separation using deep recurrent neural networks, ” in International Confer ence on Signal Pr ocessing and Communications , Bangalore, India, Jun. 2016. [314] B. Y egnanarayana, H. A. Murthy , and V . R. Ramachandran, “Processing of noisy speech using modiﬁed group delay functions, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , T oronto, ON, Canada, Apr. 1991. [315] Z.-C. Fan, J.-S. R. Jang, and C.-L. Lu, “Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking, ” in IEEE International Conference on Multimedia Big Data , T aipei, T aiwan, Apr . 2016. [316] C. A vendano, “Frequency-domain source identiﬁcation and manipu- lation in stereo mixes for enhancement, suppression and re-panning applications, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, Ne w Y ork, USA, Oct. 2003. [317] C. A vendano and J.-M. Jot, “Frequency domain techniques for stereo to multichannel upmix, ” in AES 22nd International Conference , Espoo, Finland, Jun. 2002. [318] D. Barry , B. Lawlor , and E. Coyle, “Sound source separation: Azimuth discrimination and resynthesis, ” in 7th International Confer ence on Digital Audio Effects , Naples, Italy , Oct. 2004. [319] M. V inyes, J. Bonada, and A. Loscos, “Demixing commercial music productions via human-assisted time-frequency masking, ” in 120th AES Con vention , Paris, France, May 2006. [320] M. Cobos and J. J. L ´ opez, “Stereo audio source separation based on time-frequency masking and multilevel thresholding, ” Digital Signal Pr ocessing , v ol. 18, no. 6, pp. 960–976, Nov . 2008. [321] ¨ Ozg ¨ ur Yılmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking, ” IEEE T ransactions on Signal Pr ocessing , vol. 52, no. 7, pp. 1830–1847, Jul. 2004. [322] N. Otsu, “ A threshold selection method from gray-level histograms, ” IEEE T ransactions on Systems, Man, and Cybernetics , vol. 9, no. 1, pp. 62–66, Jan. 1979. [323] S. Soﬁanos, A. Ariyaeeinia, and R. Polfreman, “T owards effecti ve singing v oice extraction from stereophonic recordings, ” in IEEE In- ternational Confer ence on Acoustics, Speech and Signal Processing , Dallas, TX, USA, Mar. 2010. [324] ——, “Singing voice separation based on non-vocal independent com- ponent subtraction, ” in 13th International Confer ence on Digital Audio Effects , Graz, Austria, Sep. 2010. [325] S. Soﬁanos, A. Ariyaeeinia, R. Polfreman, and R. Sotudeh, “H- semantics: A hybrid approach to singing voice separation, ” Journal of the Audio Engineering Society , vol. 60, no. 10, pp. 831–841, Oct. 2012. [326] M. Kim, S. Beack, K. Choi, and K. Kang, “Gaussian mixture model for singing voice separation from stereophonic music, ” in AES 43rd Confer ence , Pohang, South Korea, Sep. 2011. [327] M. Cobos and J. J. L ´ opez, “Singing voice separation combining panning information and pitch tracking, ” in AES 124th Convention , Amsterdam, Netherlands, May 2008. [328] D. FitzGerald, “Stereo vocal extraction using ADRess and nearest neighbours median ﬁltering, ” in 16th International Confer ence on Digital Audio Effects , Maynooth, Ireland, Jan. 2013. [329] D. FitzGerald and R. Jaiswal, “Improved stereo instrumental track recovery using median nearest-neighbour inpainting, ” in 24th IET Irish Signals and Systems Conference , Letterkenn y , Ireland, Jun. 2013. [330] A. Adler , V . Emiya, M. G. Jafari, M. Elad, R. Gribon val, and M. D. Plumbley , “ Audio inpainting, ” IEEE T ransactions on Audio, Speec h, and Language Processing , vol. 20, no. 3, pp. 922–932, Mar . 2012. [331] A. Ozerov and C. F ´ evotte, “Multichannel nonnegati ve matrix factor- ization in con v olutiv e mixtures with application to blind audio source separation, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , T aipei, T aiwan, Apr . 2009. [332] ——, “Multichannel nonnegati ve matrix f actorization in con voluti v e mixtures for audio source separation, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 18, no. 3, pp. 550–563, Mar . 2010. [333] A. Ozerov , C. F ´ evotte, R. Blouet, and J.-L. Durrieu, “Multichannel nonnegati ve tensor factorization with structured constraints for user- guided audio source separation, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , Prague, Czech Republic, May 2011. [334] A. Liutkus, R. Badeau, and G. Richard, “Informed source separation using latent components, ” in 9th International Confer ence on Latent V ariable Analysis and Signal Separation , St. Malo, France, Sep. 2010. [335] C. F ´ evotte and A. Ozerov , “Notes on nonnegati ve tensor factorization of the spectrogram for audio source separation: statistical insights and to wards self-clustering of the spatial cues, ” in 7th International Symposium on Computer Music Modeling and Retrieval , M ´ alaga, Spain, Jun. 2010. [336] A. Ozerov , N. Duong, and L. Chevallier , “On monotonicity of multi- plicativ e update rules for weighted nonneg ativ e tensor factorization, ” in International Symposium on Nonlinear Theory and its Applications , Luzern, Switzerland, Sep. 2014. [337] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “New formulations and efﬁcient algorithms for multichannel NMF, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, New Y ork, USA, Oct. 2011. [338] ——, “Ef ﬁcient algorithms for multichannel extensions of Itakura-Saito nonnegati ve matrix factorization, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , K yoto, Japan, Mar. 2012. [339] ——, “Multichannel extensions of non-negati ve matrix factorization with comple x-valued data, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 21, no. 5, pp. 971–982, May 2013. [340] S. Sivasankaran, A. A. Nugraha, E. V incent, J. A. M. Cordovilla, S. Dalmia, I. Illina, and A. Liutkus, “Robust ASR using neural network based speech enhancement and feature simulation, ” in IEEE Automatic Speech Recognition and Understanding W orkshop , Scottsdale, AZ, USA, Dec. 2015. [341] A. A. Nugraha, A. Liutkus, and E. V incent, “Multichannel audio source separation with deep neural networks, ” IEEE/ACM Tr ansactions on Audio, Speech, and Language Pr ocessing , v ol. 24, no. 9, pp. 1652– 1664, Sep. 2016. [342] ——, “Multichannel audio source separation with deep neural net- works, ” Inria, T ech. Rep., 2015. [343] ——, “Multichannel music separation with deep neural networks, ” in 24th Eur opean Signal Pr ocessing Conference , Budapest, Hungary , Aug. 2016. [344] N. Q. K. Duong, E. V incent, and R. Gribonv al, “Under-determined rev erberant audio source separation using a full-rank spatial covariance model, ” IEEE T ransactions on Audio, Speech, and Language Process- ing , vol. 18, no. 7, pp. 1830–1840, Sep. 2010. [345] A. Ozerov , A. Liutkus, R. Badeau, and G. Richard, “Informed source separation: source coding meets source separation, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , New Paltz, New Y ork, USA, Oct. 2011. [346] E. Zwicker and H. Fastl, Psychoacoustics: F acts and models . Springer- V erlag Berlin Heidelberg, 2013. [347] A. W . Rix, J. G. Beerends, M. P . Hollier, and A. P . Hekstra, “Perceptual ev aluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , Salt Lake City , UT , USA, May 2001. [348] Z. W ang and A. C. Bovik, “Mean squared error: Love it or leave it? a new look at signal ﬁdelity measures, ” IEEE Signal Pr ocessing Magazine , v ol. 26, no. 1, pp. 98–117, Jan. 2009. Raﬁi et al.: An Overvie w of Lead and Accompaniment Separation in Music 30 [349] J. Barker , R. Marxer , E. V incent, and S. W atanabe, “The third CHiME speech separation and recognition challenge: Dataset, task and baselines, ” in IEEE W orkshop on Automatic Speech Recognition and Understanding , Scottsdale, AZ, USA, Dec. 2015. [350] I. Recommendation, “Bs. 1534-1. method for the subjective assessment of intermediate sound quality (MUSHRA), ” International T elecommu- nications Union, Gene va , 2001. [351] E. V incent, M. Jafari, and M. Plumble y , “Preliminary guidelines for subjectiv e evaluation of audio source separation algorithms, ” in ICA Resear ch Network International W orkshop , Southampton, UK, Sep. 2006. [352] E. Cano, C. Dittmar , and G. Schuller, “Inﬂuence of phase, magnitude and location of harmonic components in the perceived quality of extracted solo signals, ” in AES 42nd Conference on Semantic Audio , Ilmenau, Germany , Jul. 2011. [353] C. F ´ evotte, R. Gribon val, and E. V in vent, “BSS EV AL toolbox user guide - re vision 2.0, ” IRISA, T ech. Rep., 2005. [354] E. V incent, R. Gribonv al, and C. F ´ evotte, “Performance measurement in blind audio source separation, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 14, no. 4, pp. 1462–1469, Jul. 2006. [355] B. Fox, A. Sabin, B. Pardo, and A. Zopf, “Modeling perceptual similarity of audio signals for blind source separation evaluation, ” in 7th International Confer ence on Latent V ariable Analysis and Signal Separation , London, UK, Sep. 2007. [356] B. Fox and B. Pardo, “T owards a model of perceived quality of blind audio source separation, ” in IEEE International Confer ence on Multimedia and Expo , Beijing, China, Jul. 2007. [357] J. Kornyck y , B. Gunel, and A. Kondoz, “Comparison of subjective and objectiv e ev aluation methods for audio source separation, ” Journal of the Acoustical Society of America , vol. 4, no. 1, 2008. [358] V . Emiya, E. V incent, N. Harlander, and V . Hohmann, “Multi-criteria subjectiv e and objectiv e ev aluation of audio source separation, ” in 38th International AES Confer ence , Pitea, Sweden, Jun. 2010. [359] ——, “Subjective and objective quality assessment of audio source separation, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 19, no. 7, pp. 2046–2057, Sep. 2011. [360] E. V incent, “Improv ed perceptual metrics for the evaluation of audio source separation, ” in 10th International Conference on Latent V ariable Analysis and Signal Separation , T el A viv , Israel, Mar . 2012. [361] M. Cartwright, B. Pardo, G. J. Mysore, and M. Hof fman, “Fast and easy crowdsourced perceptual audio evaluation, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing , Shanghai, China, Mar . 2016. [362] U. Gupta, E. Moore, and A. Lerch, “On the perceptual relev ance of objectiv e source separation measures for singing voice separation, ” in IEEE W orkshop on Applications of Signal Pr ocessing to A udio and Acoustics , Ne w Paltz, NY , USA, Oct. 2005. [363] F .-R. St ¨ oter , A. Liutkus, R. Badeau, B. Edler , and P . Magron, “Common fate model for unison source separation, ” in IEEE International Con- fer ence on Acoustics, Speech and Signal Pr ocessing , Shanghai, China, Mar . 2016. [364] G. Roma, E. M. Grais, A. J. Simpson, I. Sobieraj, and M. D. Plumbley , “Untwist: A new toolbox for audio source separation, ” in 17th International Society on Music Information Retrieval Conference , New Y ork City , NY , USA, Aug. 2016. Zafar Raﬁi receiv ed a PhD in Electrical Engi- neering and Computer Science from Northwestern Univ ersity in 2014, and an MS in Electrical En- gineering from both Ecole Nationale Sup ´ erieure de l’Electronique et de ses Applications in France and Illinois Institute of T echnology in the US in 2006. He is currently a senior research engineer at Gracenote in the US. He also worked as a research engineer at Audionamix in France. His research interests are centered on audio analysis, somewhere between signal processing, machine learning, and cognitive science, with a predilection for source separation and audio identiﬁcation in music. Antoine Liutkus received the State Engineering de- gree from T ´ el ´ ecom ParisT ech, France, in 2005, and the M.Sc. degree in acoustics, computer science and signal processing applied to music (A TIAM) from the Uni versit ´ e Pierre et Marie Curie (Paris VI), P aris, in 2005. He worked as a research engineer on source separation at Audionamix from 2007 to 2010 and ob- tained his PhD in electrical engineering at T ´ el ´ ecom ParisT ech in 2012. He is currently researcher at Inria, France. His research interests include audio source separation and machine learning. Fabian-Robert St ¨ oter received the diploma degree in electrical engineering in 2012 from the Leibniz Univ ersit ¨ at Hannover and worked towards his Ph.D. degree in audio signal processing in the research group of B. Edler at the International Audio Labora- tories Erlangen, Germany . He is currently researcher at Inria, France. His research interests include su- pervised and unsupervised methods for audio source separation and signal analysis of highly ov erlapped sources. Stylianos Ioannis Mimilakis received a Master of Science de gree in Sound & Music Computing from Pompeu Fabra Univ ersity and a Bachelor of Engi- neering in Sound & Music Instruments T echnologies from Higher T echnological Education Institute of Ionian Islands. Currently he is pursuing his Ph.D. in signal processing for music source separation, under the MacSeNet project at Fraunhofer IDMT . His research interests include, in verse problems in audio signal processing and synthesis, singing voice separation and deep learning. Derry FitzGerald (PhD, M.A. B.Eng.) is a Research Felow in the Cork School of Music at Cork Insti- tute of T echnology . He was a Stokes Lecturer in Sound Source Separation algorithms at the Audio Research Group in DIT from 2008-2013. Pre vious to this he worked as a post-doctoral researcher in the Dept. of Electronic Engineering at Cork Institute of T echnology , having previously completed a Ph.D. and an M.A. at Dublin Institute of T echnology . He has also worked as a Chemical Engineer in the pharmaceutical industry for some years. In the ﬁeld of music and audio, he has also work ed as a sound engineer and has written scores for theatre. He has utilised his sound source separation technologies to create the ﬁrst e ver ofﬁcially released stereo mixes of several songs for the Beach Boys, including Good V ibrations and I get around. His research interests are in the areas of sound source separation and, tensor factorizations. Bryan Pardo, head of the Northwestern Univ ersity Interactiv e Audio Lab, is an associate professor in the Northwestern Univ ersity Department of Electri- cal Engineering and Computer Science. Prof. Pardo receiv ed a M. Mus. in Jazz Studies in 2001 and a Ph.D. in Computer Science in 2005, both from the Uni versity of Michigan. He has authored o ver 80 peer-revie wed publications. He has dev eloped speech analysis software for the Speech and Hearing department of the Ohio State Uni versity , statistical software for SPSS and worked as a machine learning researcher for General Dynamics. While ﬁnishing his doctorate, he taught in the Music Department of Madonna Uni versity . When he’s not programming, writing or teaching, he performs throughout the United States on saxophone and clarinet at venues such as Albion College, the Chicago Cultural Center , the Detroit Concert of Colors, Bloomington Indiana’s Lotus Festiv al and Tucson’ s Rialto Theatre.

An Overview of Lead and Accompaniment Separation in Music

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment