The Skipping Behavior of Users of Music Streaming Services and its Relation to Musical Structure

The Skipping Beha vior of Users of Music Streaming Services and its Relation to Musical Structure Nicola Mon tecchio, Pierre Ro y , F rançois Pac het Sp otify Abstract The behavior of users of music streaming services is in vestigated from the p oint of view of the temp oral dimension of individual songs; sp eciﬁcally , the main ob ject of the analysis is the p oin t in time within a song at which users stop listening and start streaming another song (“skip”). The main contribution of this study is the ascertainmen t of a correlation betw een the distribution in time of skipping even ts and the musical structure of songs. It is also sho wn that such distribution is not only sp eciﬁc to the individual songs, but also indep endent of the cohort of users and, under stationary conditions, date of observ ation. Finally , user b ehavioral data is used to train a predictor of the m usical structure of a song solely from its acoustic con tent; it is shown that the use of such data, av ailable in large quantities to m usic streaming services, yields signiﬁcan t impro vemen ts in accuracy o ver the customary fashion of training this class of algorithms, in which only smaller amounts of hand-lab eled data are a v ailable. 1 In tro duction Since the adv ent of online music streaming services, p eople ha ve b een able to easily access millions of songs on demand. As a consequence of this abundance, no vel listening behaviors hav e emerged: departing from the passiv e, concen trated listening practice that is t ypical of media suc h as LPs, to da y p eople tend to listen to music in a muc h more frantic wa y than b efore. The central ob ject of interest of this pap er is the behavior of users regarding “skipping”: the act of in terrupting a song in order to listen to the next song in the music queue (the queue possibly b eing the song’s album, a playlist in whic h the song ﬁgured, or a sequence of songs proposed b y the recommendation engine of the streaming service). It is our opinion that skipping is a crucial feature in understanding mo dern listening behaviors. F or the ﬁrst time in the history of m usicology , researchers can systematically collect and analyze massiv e amounts of data about music listening behavior. Streaming services are only one of many p ossible contexts in which music is consumed; as such, one m ust b e a w are that an y observ ed b e- ha viour is not necessarily generalizable to other situations (e.g., it is safe to assume that skipping b eha viour in the context of listening to LPs is radically diﬀerent from the online music streaming case described in this pap er); nonetheless, online streaming services represen t no wada ys the pre- ferred music consumption mec hanism. Statistical analysis of user skipping b ehavior in time yields indeed fascinating information ab out how p eople listen and react to music. W e start by in vestigating the speciﬁcity of skipping behavior with respect to songs, in particu- lar with resp ect to its consistency in the case of data collection on diﬀerent dates and in diﬀeren t regions. W e then iden tify a connection, whic h to the best of our knowledge has not b een observ ed b efore, b et ween skip behaviour and m usical structure (the segmen tation of a musical work in to m usically relev ant sections, such as “intro”, “verse”, “chorus”): listeners are more likely to skip a song right after a c hange betw een m usical sections. Subsequen tly , w e turn our atten tion to the task of predicting m usical structure from acoustic con ten t (i.e., from the raw audio w av eform of a recording), a well known problem in the research communit y , commonly referred to as struc- tur al se gmentation : w e prop ose a nov el approach that mak es use of skipping b ehavior in order to train a prediction algorithm in a semi-sup ervised wa y , b y exploiting data that can b e extracted automatically in large quantities and is directly related to the p erception of m usic b y users. W e ﬁnish by discussing ho w future lines of inquiries that stem from this study , aimed at a better understanding of user reaction, could potentially lead to the dev elopment of comp ositional to ols aimed at improving the reception of music b y audiences. 1 0:00 0:45 1:30 2:15 3:00 3:45 4:30 time 1 0 4 1 0 3 1 0 2 skipping likelihood (a) An individual song. 0% 20% 40% 60% 80% 100% time, relative to song duration 1 0 0 1 0 1 skipping likelihood (b) Aggregated streams of all songs. Figure 1: Skip proﬁles of individual songs and aggregate b eha vior. 2 General Statistical Asp ects of User Behavior The principal ob ject of study of this paper is the distribution (histogram) of the p oints in time at whic h users stop listening to a track, whic h will be referred to as the trac k’s “ Skip Pr oﬁle ”; a t ypical example is depicted in Figure 1a . Visual insp ection of Skip Proﬁles suggests a possible interpretation as the superp osition of a general trend (pictured in Figure 1b , obtained through the aggregation of streaming data o ver the en tire catalog) and a residual signal in which of p eaks concentrate at sp eciﬁc p oints in time. In this Section, w e inv estigate the collectiv e b eha vior of users on the platform, as w ell as the sp eciﬁcity of skip proﬁles to songs and their consistency in time and across geographical regions. 2.1 Previous W ork Previous research has addressed the issue of mo deling m usic listening; in particular, many works in vestigated why certain songs become more popular than othe rs. [ SDW06 ] shows that the non- uniformit y of sub jective preference ma yb e largely explained b y so-called cumulative advantage , therefore rendering a priori prediction of p opularity rather p ointless: m usic hits are inherently unpredictable, due to so cial pressure. Nonetheless, man y works addressing “Hit Song Science” ha ve b een published (e.g. [ HMS14 ]), attempting to predict the p opularity of a new song based on features automatically extracted from its acoustic con tent; criticism of such line of works include [ PR08 ]. All these studies, ho wev er, consider songs in their entiret y , and do not consider the impact of listeners from a temp oral dimension. A diﬀeren t line of research literature is concerned with the temporal aspect of user resp onses to musical stimuli. The ob ject of the multiple exp erimen ts presen ted in [ BKR + 18 ] is to “identify the amount of time necessary to make accurate aesthetic judgments”; in this study the authors argue that such time is around 750ms. Most skipping activity in the context of a commercial m usic streaming service also o ccurs in the very ﬁrst few seconds of listening, thereb y suggesting a similar ability of users to v ery quickly express (negative) aesthetic judgments: preliminary analysis of Skip Proﬁles [ Lam b ], av eraged on millions of listeners and billions of plays obtained from the commercial streaming service Sp otify , identiﬁes a “steep drop oﬀ in listeners in the early part of a song, when most listeners are deciding whether or not to skip the song”. It m ust be pointed out that the context of [ Lamb ] (and of this pap er) – b ehavior of generic users in unsp eciﬁed listening con texts – is radically diﬀerent from the carefully designed and controlled exp erimental conditions of [ BKR + 18 ]; nonetheless, the ﬁndings of b oth are in agreement. Related w ork done on large scale m usic listening b eha vior data includes the analysis of scrubbing b eha vior [ Lama ], that is the practice of moving the cursor within the song in order to search for, and listen to sp eciﬁc parts. The author show ed ho w such data can be used to iden tify segments of particular interest in songs: instrumen tal solos, particularly dramatic moments, and, within the genre of electronic dance music, the “drop”. 2 0 5 10 15 20 25 30 time [s] 0.0 0.1 0.2 0.3 skip likelihood (a) Skipping likelihoo d at the b eginning of a song. 0% 20% 40% 60% 80% 100% time, relative to song duration 0.5 0.6 0.7 0.8 0.9 probability of streaming (b) Probability in time of a song being streaming. Figure 2: Collectiv e skipping b eha vior of users. 2.2 A v erage Behavior Skipping is an ov erwhelmingly common b ehavior of users of streaming services: a quarter of all streamed songs are skipp ed within the ﬁrst ﬁve seconds, and only roughly half of all songs are listened to in their entiret y [ Lamb ]. That analysis, whic h dates back to 2014, w as repro duced using song streams sampled in August 2018 from Sp otify , and its results w ere conﬁrmed. As can b e observed in Figure 1b (in relativ e time) and 2a (in absolute time), most skips o ccur indeed at the v ery b eginning of songs; there is also a clear tendency to skip the ending of songs, which often con tains sev eral seconds of silence or long fadeouts. Figure 2b shows the (in verted) cumulativ e distribution of skipping with respect to (normalized) time (i.e., the probabilit y in time that a song is still playing). 2.3 Sp eciﬁcit y and Consistency Figure 3a shows sections of the Skip Proﬁles of t wo songs, for whic h the data was collected on diﬀeren t da ys; this particular example shows proﬁles that are unique to their resp ectiv e songs and consistent across collection dates. This observ ation prompts the question of whether this is a general property that holds ov er a larger collection of songs: can a song b e identiﬁed from its Skip Proﬁle? Subsequen tly , the ev olution of Skip Proﬁles is analyzed considering data c ollected o v er an extended p erio d of consecutive days and from diﬀerent geographical regions. Dataset The dataset used in this Section is composed of 100 p opular songs, released in April and Ma y 2018. As of June 1st, 2018, 12 of those songs were among the top 100 most p opular songs (in terms of global n umber of streams), and 40 of them among the top 1000. The songs were selected b y an exp ert musician, with the aim of maximizing v ariety among genres and a voiding m ultiple songs from the same artists. Over 3 billion skipping even ts hav e b een collected from Sp otify , spanning a p erio d of three months across all countries in whic h the streaming service op erates; most streaming activity (around 30%) originates from the United States, follo wed b y Great Britain, Mexico, Germany . 2.3.1 Sp eciﬁcit y of Skip Proﬁles In order to study the sp eciﬁcity of the shap e of Skip Proﬁles with resp ect to songs, the dataset w as pro cessed by considering the ﬁrst tw o min utes of each proﬁle (to account for the v ariability in length of songs), and normalizing each resulting proﬁle fragment indep enden tly (to account for the diﬀerent popularity of the songs); moreov er, giv en that most of the skipping activity occurs in the v ery ﬁrst instan ts, the initial 5 seconds of data are discarded as well. Finally , fragments are smo othed by median-ﬁltering, to de-noise proﬁles deriv ed from smaller amoun ts of a v ailable streaming data. Euclidean distance b etw een the v ector representation of Skip Proﬁles fragments provides then a straightforw ard wa y to measure speciﬁcity: proﬁles should ideally b e closer to other proﬁles asso ciated to the same song (their streaming data b eing collected on diﬀerent days) than to proﬁles 3 0:15 0:30 0:45 time 1 0 3 6 × 1 0 4 skipping likelihood Track A Track A (different day) Track B (a) Sections of Skip Proﬁles collected on diﬀerent da ys. 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 0.0200 distance 0 100 200 300 likelihood distance to same tracks (different days) distance to other tracks (b) Distributions of same- and diﬀeren t-song distances. Figure 3: Consistency of Skip Proﬁles in time. asso ciated to diﬀeren t songs. Figure 3b sho ws that this is indeed the case, by picturing the distributions of same-song and diﬀerent-song distances. This sp eciﬁcity can be further quantiﬁed by framing the problem as an Information Retriev al ev aluation task [ SMR08 ]. Giv en a query – a skip proﬁle for a random (track, date) pair – the dataset is sorted b y euclidean distance from the query . Retrieved proﬁles are considered r elevant if associated to the query trac k on diﬀeren t dates, non-r elevant otherwise. Ev aluation can then b e carried out using common measures, such as Mean A verage Precision (MAP) 1 . A random baseline for the exp eriment on this dataset (obtained b y returning a randomly sh uﬄed list of results) yields MAP = 0.014; in con trast, the methodology discussed abov e yields MAP = 0.886, conﬁrming the hypothesis of sp eciﬁcit y of skip proﬁles. 2.3.2 Ev olution of Skip Proﬁles in Time Let us consider a signal consisting of the diﬀerences (in terms of euclidean distance) b etw een Skip Proﬁles collected on subsequen t days; sev eral p ossibilities arise: a single S.P . can be selected as reference (that corresp onding to the ﬁrst day being the ob vious choice), or the day-to-da y diﬀerence b et ween the subsequen t days can be examined. In this section, the analysis is carried out on one mon th of data collected follo wing eac h song’s release date; the release dates on the Streaming service corresp ond to those of their general av ailability . Empirical analysis of the diﬀerence signal b et ween subsequen t da ys show ed no remark able anomaly , i.e., there app ears to b e no p oint in time in which user b ehaviour suddenly changes. On the other hand, the evolution of the diﬀerence signal with respect to the release date sho ws more v ariability , and a determining factor seems to be tra jectory of the num b er of streams; Figure 4 exempliﬁes the tw o most common cases. A steady streaming behavior ( 4a ) is asso ciated to a relativ ely constan t distribution of Skip Proﬁles o ver the diﬀerent days (one can also notice how the weekly listening patterns are reﬂected in the S.P . evolution). On the other hand, the songs exhibiting signiﬁcant c hanges in S.P . o ver diﬀeren t days tend to be those for which the amount of streams changes signiﬁcantly: Figure 4b shows a declining amount of streams per day , but the b eha vior for rising amoun ts of streams is similar. A v eraging o ver the dataset, one can b etter sense the ov erall ev olution: Figure 4c shows, as one could reasonably exp ect, that in the ﬁrst tw o weeks the b ehavior undergo es the most changes, after which Skip Proﬁles are fairly stable. 1 Precision is deﬁned as the fraction of relev ant do cumen ts among the retriev ed ones. A v erage Precision is the av erage of the Precision v alues obtained considering the subsequences of the retrieved do cumen ts (sorted b y relev ance), up to each relev ant do cument in the collection. Mean A verage Precision is the mean, o ver diﬀerent queries, of the A verage Precision value. The range for all these measures is [0 , 1] . 4 7 14 21 28 number of days since release 0.055 0.060 0.065 0.070 0.075 difference 110000 115000 120000 125000 130000 135000 140000 number of plays per day diff: day to day diff: since first day plays (a) A song with a constan t n umber of streams. 7 14 21 28 number of days since release 0.10 0.15 0.20 0.25 difference 50000 100000 150000 200000 250000 300000 350000 number of plays per day diff: day to day diff: since first day plays (b) A song with a declining num b er of streams. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 days since reference skip profile 0.2 0.4 0.6 difference from reference skip profile (c) Change, with respect to the release date, o ver the dataset Figure 4: Diﬀerence b etw een Skip Proﬁles collected on diﬀerent days 5 2.3.3 Consistency of Proﬁles across Regions The geographical lo cation of users presents an additional wa y of sub dividing user b eha viour data. As w as the case before, the ob ject of study is the consistency of Skip Proﬁles, across a diﬀeren t data partitioning scheme. Streaming data was collected for the same set of songs as in the exp erimen ts presented abov e, and subdivided by country . In order to av oid being inﬂuenced by day of release, the data w as collected in August of 2018, to mak e sure that user b eha viour for those songs is stationary and that no eﬀects caused by proximit y to the release date can b e observed; Skip Proﬁles composed of less than 100.000 streams were discarded. As anticipated, empirical observ ation of Skip Proﬁles collected across diﬀeren t countries did not suggest an y diﬀerence among partitions. T o v alidate this h yp othesis, we rep eated the exp erimen ts of Section 2.3.1 . The resulting distribution of same- and diﬀerent-song distances is similar to the one originating from the time-based partitioning; a MAP score of 0.939, in contrast to a baseline of 0.017, was obtained, thus conﬁrming the hypothesis of consistency of proﬁles across data collection regions. 3 User Beha vior and Musical Structure Individual Skip Proﬁles exhibit deviations from their aggregated trend ( 1b ) in a wa y that is closely related to musical structure. Figure 5a shows a song’s Skip Proﬁle, o verla yed with m usical struc- tural b oundaries. It is visually eviden t that section boundaries are commonly follo wed by surges in the likelihoo d of skipping. This Section in vestigates the correlation b etw een user skipping b ehavior and musical structure. T o the b est of the authors’ kno wledge, such correlation has not b een observed b efore. It is ﬁrst sho wn that the lo cation of musically relev ant boundaries can b e predicted directly from skip b e- ha vioral data. The accuracy of the prediction is then ev aluated against a collection of existing algorithms, dev elop ed in the ﬁeld of Music Information Retriev al, that predict musical section b oundaries from the con tent of recordings (i.e., the acoustic w a veforms). Skipping behavior data is ﬁnally exploited to automatically generate large quan tities of training data (otherwise very cum- b ersome and costly to create man ually) for conten t-based Mac hine Learning algorithms, and the accuracy of algorithms trained using such data is ev aluated against the customary wa y of training, as well as multiple other baselines. 3.1 Related W ork Con tent-based m usical structure segmen tation has b een a central researc h sub ject in the ﬁeld of Music Information Retriev al for many y ears. A comprehensive ov erview of the topic [ PMK10 ] deﬁnes the main goal of a structural segmen tation algorithm: “to divide an audio recording into temp oral segments corresp onding to musical parts and to group these segments into musically meaningful categories”. The input of suc h algorithms is the audio con tent of a music recording (i.e., a digital encoding of the acoustic w av eform), the output is a partitioning of a song into (usually) non-ov erlapping windows; some algorithms asso ciate a lab eling to each section, which can b e descriptive (such as “bridge” and “chorus”) or just identifying of rep etitive sections (e.g., “A”, “B” and so on). As discussed in the ab ov e-mentioned pap er, most algorithms can be categorized into three conceptual approac hes, based on r ep etition , novelty and homo geneity , that is the identiﬁcation of, resp ectively , recurring patterns, transitions b etw een contrasting parts, and contiguous sections that are consistent with resp ect to some musical prop erty; algorithms typically mak e use of features that attemp t to tak e in to accoun t m usical c haracteristics suc h as melo dy , harmon y , rh ythm, and tim bre. A more recent ov erview of the State of the Art in the ﬁeld can b e found in [ Nie15 ]. Researc h on the topic has long b een hindered by the lac k of sizeable amoun ts of exp ert an- notations – the manual segmentations and lab elings of recordings done b y musically-competent individuals – because of its time-consuming nature, coupled with the legal retrictions inv olved in to sharing cop yrighted recordings. Initiatives such as the SALAMI dataset [ SBF + 11 ] attempt to ﬁll this need by providing a relatively large (several hundreds) source of annotations for recordings, man y of whic h are in the public domain. An alternativ e approach to deal with the aforemen tioned issues is the release of the algorithms themselv es as op en source softw are: MSAF [ NB16 ] is the leading eﬀort in that regard within the 6 0:00 0:30 1:00 1:30 2:00 2:30 3:00 3:30 4:00 4:30 time 1 0 4 1 0 3 1 0 2 skipping likelihood INTRO (male voice 1) (beat + male voice 2) 0:15 CHORUS (male voice 1) 0:42 (male voice 2) 0:55 VERSE 2 (male voice 1) 1:10 (beat) 1:23 (male voice 1) 1:49 CHORUS 2:18 (male voice 2) 2:31 VERSE 3 (male voice 3) 2:45 (beat) 2:58 (male voice 2 & 3) 3:12 BRIDGE (male voice 1 + beat) 3:27 CHORUS (male voice 2) 3:53 ENDING (male spoken voices) 4:14 (a) Skip Proﬁle and annotated m usical structural boundaries. 0:00 0:30 1:00 1:30 2:00 2:30 3:00 3:30 4:00 4:30 time 0.000 0.002 0.004 0.006 0.008 de-trended skip likelihood (b) De-trended Skip Proﬁle and m usical structural boundaries. 0:00 0:30 1:00 1:30 2:00 2:30 3:00 3:30 4:00 4:30 time 0.2 0.4 0.6 0.8 structural boundaries likelihood (c) Likelihoo d of structural boundaries, predicted from the Skip Proﬁle. 0:00 0:30 1:00 1:30 2:00 2:30 3:00 3:30 4:00 4:30 time 0.000 0.025 0.050 0.075 0.100 0.125 0.150 structural boundaries likelihood (d) Likelihoo d of structural boundaries, predicted from the acoustic conten t. Figure 5: Skipping b ehavior and musical structure. 7 Music Information Retriev al communit y , and couples that goal with an ev aluation framework; the latter is in turn based on MIReval [ RMH + 14 ], a reference implemen tation of a large set of common m usic-sp eciﬁc IR ev aluation metrics. The remainder of this paper mak es use of the fol- lo wing algorithm implementations b orro wed from MSAF : cnmf [ NJ13 ], foote [ F o o00 ], olda [ ME14b ], scluster [ ME14a ], sf [ SMGA14 ]. F urthermore, another reference algorithm is given b y the The Ec ho Nest Analyzer, based on [ Jeh05 ], whose results can b e accessed through Spotify’s public API. New er approaches, based on more recent Mac hine Learning techniques, include [ USG14 ] and [ GS15 ], that achiev e State of the Art accuracy in prediction by making use of a Conv olutional Neural Netw ork arc hitecture. 3.2 Correlation of Skip Proﬁles and Musical Structure In order to emphasize the regions of a Skip Proﬁle that depart signiﬁcantly from its ov erall course, a de-trending algorithm is applied, the results of which are depicted in Figure 5b . The core of the pro cedure is a deliberately simple heuristic (a com bination of median and lo w-pass ﬁltering) aimed at isolating the general trend of the proﬁle; this trend is subsequently subtracted from the original signal and the residual signal is rectiﬁed. This graphical representation makes it even clearer that surges in skips regularly follow section b oundaries, after a short del a y . Such dela y (estimated to b e around 3.5s) can be interpreted as the sum of tw o comp onen ts: the time it tak es a user to realize they do not wan t to be listening an ymore to the current song – triggered by the crossing of a section b oundary – and the time sp ent in teracting with the repro duction device (tapping or clicking on the “next” button). T o prov e the relation b etw een user b ehavior and m usical structure we sho w that the latter (sp eciﬁcally , the lo cation of section b oundaries) can be predicted from the former (a Skip Proﬁle). T o this end, a compact F eed-F orward Neural Netw ork (less than 50k parameters) is trained on short segments (30s) of Skip Proﬁles; the ob jective of the net work is to predict whether the central lo cation of each particular segment in the original signal falls close enough (within 1s) to a section b oundary . It is suﬃcient to annotate only a few dozen songs to obtain satisfactory performance, and the results of suc h procedure can be observed in Figure 5c , clearly sho wing the strong relation b et ween m usic structure and user b ehavior (a more rigorous quantitativ e ev aluation is carried out in Section 3.4 b elo w). 3.3 T raining of A coustic Con tent-based Structure Prediction Mo dels using Behavioral Data An existing algorithm from the Music Information Retriev al literature is exploited to demonstrate the eﬀectiveness of using Skip Proﬁles for training conten t-based section prediction algorithms. The algorithm, detailed in [ USG14 ], is based on a straightforw ard, well understo o d Con volu- tional Neural Netw ork architecture. Originally designed in [ SB14 ] for the task of onset dete ction – the task of detecting the instan ts at which m usical even ts, suc h as individual notes or chords, o ccur – the arc hitecture of the mo del for structural prediction is unchanged, except for the longer input ranges considered. The mo del is essentially a binary classiﬁer that predicts the lik eliho o d of the presence of a section b oundary in the center of an audio excerpt. The netw ork comprises t wo con volutional lay ers, eac h follo wed by a max-po oling la yer, feeding one dense la yer that is ﬁnally pro jected into a scalar output. The input of the net work is a segmen t of an audio recording, for whic h a Mel-Sp e ctr o gr am is extracted; the latter a common transformation in the music signal pro cessing literature [ Mül15 ] consisting of a spectrogram (a time-frequency represen tation obtained b y concatenating the in- dividual magnitude of the F ourier transforms of short, o verlapping excerpts of the signal) whose frequency bands are then warped according to a p erceptual (“Mel”) scale. The output is a scalar, whose v alue dep ends on the distance of the closest section b oundary from the cen ter of the audio input. In [ USG14 ] a strategy known as tar get sme aring is employ ed, whic h accoun ts for the inaccuracy of ground truth b oundary annotations: during the training phase, only the excerpts centered on a frame that is suﬃciently close to a section boundary are presen ted to the netw ork as p ositive examples, and their target v alue is the weigh t of a Gaussian k ernel, ev aluated at the distance in time b etw een the center of the excerpt and the closest section b oundary . 8 Fine-grained annotation of individual songs, in terms of the lo cations of structural b oundaries, is used for training this class of algorithms: suc h annotation is carried out manually in a very lab orious and time consuming wa y . In order to exploit of the correlation b et ween Skip Proﬁle and structure, the mo del described in Section 3.2 is used to generate large quantities of training data. F rom the mo del predictions, only regions of very high and v ery low b oundary likelihoo d are retained; empirical observ ation suggests that false p ositive samples are generated more frequently than false negative ones, as can b e seen in Figure 5c . The trained netw ork can b e used to create a prediction by rep eatedly applying it to adjacent, o verlapping segmen ts of a recording. The output is a vector whose length is proportional to the length of the input recording. In order to extract a discrete set of time instants (the estimated structural b oundaries) from it, a peak-picking pro cedure is used: a p oin t in the likelihoo d v ector is considered a p eak if it is a lo cal maxim um and is far enough (a few seconds) from other p eaks; a threshold is set to half the v alue of the likelihoo d of the third-highest p eak. An example prediction is pictured in Figure 5d , along with the estimated boundaries mark ed with circles. 3.4 Ev aluation The ev aluation of segmentation algorithms is customarily formulated in terms of (approximate) o verlap b et ween t wo sets of ins tan ts in time: the actual timings of section b oundaries (as annotated b y an exp ert curator) and those predicted b y an algorithm. Eac h prediction is considered a hit if it falls within a certain range from an y reference b oundary timing, a miss otherwise. It is common in the literature to consider tw o suc h in terv al sizes, of 0.5 and 3s, and to use F-score 2 as the preferred ev aluation measure. The ev aluation of segmentation algorithms forms the sub ject of [ NFJB14 ], in which it is argued that an appropriate weigh ting of the F-Score factors (as opp osed to the default unit weigh ting) is a measure that b etter corresp onds to human p erception; the results presented b elo w conform to such weigh ting scheme. Datasets The exp erimen ts were carried out making use of several datasets. No song b elongs to multiple sets. • SALAMI [ SBF + 11 ]: already mentioned in Section 3.1 , is comprised of the annotations for 1164 songs; ho wev er the recordings for only a subset of these songs are publicly av ailable, yielding 376 useable recording/annotation pairs. In case of multiple annotations p er song, only the ﬁrst one w as considered. These songs are not commercial recordings, hence no asso ciated user b ehavior data is av ailable. • TOP100 : the dataset consists of the one hundred most popular songs (by n umber of streams, globally) on Sp otify as of April 1, 2018. The structure of each song w as manually annotated b y a single annotator (a professional musician). The skip proﬁles for these songs, deriv ed from roughly 1 billion streams, were collected o ver the mon th of Ma y 2018. • SP20k : a dataset of the Skip Proﬁles for roughly 20k songs; only songs with at least 100k streams ov er a p erio d of 3 months are retained, totalling ab out 81 billion streams. In the TOP100 dataset, t wo types of b oundaries were annotated: “structural”, that only includes prop er structural b oundaries (suc h as “Intro”, “Chorus”, “V erse”), and “extended”, that includes additional signiﬁcant ev ents happ ening in the song (e.g., the entrance of a diﬀerent singer within a musical section). In Figure 5a the t wo types of b oundaries can b e observ ed (solid line for struc- tural boundaries, dashed line for non-structural, extended boundaries). Non-structural extended b oundaries often o ccur half-wa y through a m usical section: a t ypical case is a verse constituted by the repetition of a musical phrase, in which the second o ccurrence is c haracterized b y the presence of additional (usually p ercussiv e) backing instrumen ts. Results T able 1 rep orts the results of the ev aluation, in terms of W e igh ted F-Score for the Hit-Rate metric, of several algorithms on the TOP100 dataset. 2 F-score is deﬁned as the harmonic mean of Precision and Recall; the latter is deﬁned as the fraction of relev ant instances that hav e been retrieved ov er the total amount of relevan t instances. 9 Algorithm b oundaries: Structural Extended “hit” windo w size: 0.5s 3s 0.5s 3s baseline 0.053 0.273 0.079 0.372 foote [ F o o00 ] 0.085 0.415 0.127 0.511 olda [ ME14b ] 0.158 0.504 0.217 0.609 scluster [ ME14a ] 0.126 0.354 0.170 0.474 sf [ SMGA14 ] 0.106 0.425 0.165 0.502 ten [ Jeh05 ] 0.152 0.536 0.190 0.607 skipprofile-nn [Section 3.2 ] 0.245 0.630 0.278 0.636 audio-nn-salami [ USG14 ] 0.226 0.464 0.287 0.522 audio-nn-skip-profiles [Section 3.3 ] 0.259 0.560 0.307 0.638 audio-nn-finetune [Section 3.3 ] 0.311 0.575 0.373 0.658 T able 1: Hit Rate (weigh ted F-Score) for sev eral algorithms, on the TOP100 dataset. The baseline entry refers to a trivial estimator, whic h alw a ys predicts b oundaries at ﬁxed reg- ular interv als; the particular spacing (12s) has b een determined through a grid-searc h pro cedure, in order to obtain the highest p ossible F-Score for the baseline. This particular baseline method- ology mirrors that of [ USG14 ], in whic h the rep orted v alues are how ever signiﬁcan tly higher (0.13 and 0.33, for window sizes of 0.5s and 3s, respectively); the large diﬀerence is attributable to the diﬀeren t ev aluation dataset used, and to the weigh ting scheme applied to the F-Score metric. Next, the foote , olda , scluster , and sf entries refer to algorithms that hav e an op en-source implemen tation provided b y MSAF 3 , and ten refers to a commercial algorithm (The Echo Nest, no w part of Sp otify). In all of these instances, the default parameters (if an y) were utilized, and no training or optimization pro cedure was p erformed. The en try skipprofile-nn refers to the predictor describ ed in Section 3.2 (due to the need to use the TOP100 dataset for this task, the rep orted v alues are computed through 5-fold cross-v alidation) and v alidates the fundamental thesis proposed in this pap er, namely , that m usical structure and user b ehavior are correlated. The en try audio-nn-salami represents the mo del and training pro cedure described in [ USG14 ]. The original pap er presen ts sev eral v ariants of the mo del, the largest of whic h w as adopted. At- tempts were made to re-implemen t the described mo del as closely as poss ible; b ecause of the smaller training set av ailable to us (this algorithm is trained using the publicly av ailable, 376-songs subset of SALAMI as detailed abov e, as opposed to a dataset of 1220 annotated recordings av ailable to the authors of the original pap er), and in order not to be hindered b y possible misinterpretations of unclear passages in the description, sev eral experiments were performed to ﬁnd the b est settings among v ariations of the original mo del and learning strategies. The entry audio-nn-skip-profiles represents the same mo del of audio-nn-salami , trained using data derived from user b ehavior (the SP20k dataset) as describ ed in Section 3.3 . The exact same parametrization and learning strategy as b efore w ere used when p erforming this exp eriment, in order to restrict the diﬀerence in outcome to the data source. The reported results show that it is indeed p ossible to achiev e state of the art p erformance using large quantities of unannotated training data. The ﬁnal entry , audio-nn-finetune , obtains the b est p erformances by combining b oth sources of training data. The audio-nn-skip-profiles mo del, discussed ab ov e, is taken as the starting p oin t; subsequen tly , it is ﬁne-tuned by training the last lay er using the SALAMI dataset. This allows the model to exploit both Skip Proﬁles and manually assembled sources: the former, a source of unreliable data av ailable in large quan tities, is used to learn a robust feature representation, while the latter is used to eﬃciently exploit the high-precision nature of a small, hand-curated dataset. 4 Discussion The original goal of the study was to lev erage massive amoun ts of ﬁne-grained information collected b y streaming services, in order to b etter understand how songs are received by their audience. A 3 https://github.com/urinieto/msaf 10 b etter understanding could in principle b e used to inform the design of nov el comp ositional to ols that take into account a mo del of the listener grounded in actual music consumption data. Through the inv estigation of large-scale user b eha vior, a previously unknown correlation b e- t ween skipping and m usical structure has emerged. A joint analysis of the distribution of musical sections and the o verall trend of skip proﬁles, analyzed across a large catalog (as opp osed to with resp ect to individual songs), is the ob ject of future researc h inquiries. It is the authors’ opinion that suc h analysis has the potential to pav e the w ay for the design of tools that can leverage and an ticipate user resp onse in order to provide useful guidance to creators. An additional future researc h direction is the mo deling of the resp onse of individual users to the songs to which they listen. The eﬀect that m usical features – such as genre, moo d, or instrumen tation, just to name a few – hav e on the reception by users is a well studied problem in the literature and in industry , and closely tied to the ﬁeld of Recommender Systems. The authors are how ever not a ware of existing w ork that attempts to jointly exploit conten t-based Music Information Retriev al metho ds and user mo deling with ﬁne-grained temporal user information to inﬂuence recommendation, and b elieve that this study can provide a starting p oint for such future researc h directions. A ckno wledgements The authors would like to ac knowledge the con tribution of F ernando Diaz for the v aluable insigh ts during the initial phases of this work. References [BKR + 18] Am y M Belﬁ, Anna Kasdan, Jess Ro wland, Edw ard A V essel, G Gabrielle Starr, and Da vid P o epp el. Rapid timing of m usical aesthetic judgments. Journal of Exp erimental Psycholo gy: Gener al , 2018. [F o o00] Jonathan F o ote. Automatic audio segmen tation using a measure of audio no velt y . In Multime dia and Exp o, 2000. ICME 2000. 2000 IEEE International Confer enc e on , v olume 1, pages 452–455. IEEE, 2000. [GS15] Thomas Grill and Jan Sc hlüter. Music boundary detection using neural net works on com bined features and tw o-level annotations. In ISMIR , pages 531–537, 2015. [HMS14] Dorien Herremans, Da vid Martens, and Kenneth Sörensen. Dance hit song prediction. Journal of New Music R ese ar ch , 43(3):291–302, 2014. [Jeh05] T ristan Jehan. Cr e ating music by listening . PhD thesis, Massac husetts Institute of T echnology , School of Architecture and Planning, Program in Media Arts and Sciences, 2005. [Lama] P aul Lamere. The drop machine. https://musicmachinery.com/2015/06/16/ the- drop- machine . Accessed: 2018-07-15. [Lam b] P aul Lamere. The skip. https://musicmachinery.com/2014/05/02/the- skip/ . A c- cessed: 2018-07-15. [ME14a] Brian McF ee and Dan Ellis. Analyzing song structure with sp ectral clustering. In ISMIR , pages 405–410, 2014. [ME14b] Brian McF ee and Daniel PW Ellis. Learning to segmen t songs with ordinal linear discriminan t analysis. Self , 275:330, 2014. [Mül15] Meinard Müller. F undamentals of music pr o c essing: Audio, analysis, algorithms, ap- plic ations . Springer, 2015. [NB16] Oriol Nieto and Juan Pablo Bello. Systematic exploration of computational music structure research. In ISMIR , pages 547–553, 2016. 11 [NFJB14] Oriol Nieto, Morwaread M F arb o o d, T ristan Jehan, and Juan Pablo Bello. Perceptual analysis of the f-measure for ev aluating section b oundaries in m usic. In Pr o c e e dings of the 15th International So ciety for Music Information R etrieval Confer enc e (ISMIR 2014) , pages 265–270, 2014. [Nie15] Oriol Nieto. Disc overing structur e in music: Automatic appr o aches and p er c eptual evaluations . PhD thesis, PhD thesis, New Y ork Universit y , 2015. [NJ13] Oriol Nieto and T ristan Jehan. Conv ex non-negative matrix factorization for automatic m usic structure iden tiﬁcation. In ICASSP , pages 236–240, 2013. [PMK10] Jouni Paulus, Meinard Müller, and Ans si Klapuri. Audio-based music structure anal- ysis. In 11th International So ciety for Music Information R etrieval Confer enc e , pages 625–636. ISMIR, 2010. [PR08] F rançois P achet and Pierre Ro y . Hit song science is not y et a science. In Juan Pablo Bello, Elaine Chew, and Douglas T urnbull, editors, ISMIR 2008, 9th International Confer enc e on Music Information R etrieval, Dr exel University, Philadelphia, P A, USA, Septemb er 14-18, 2008 , pages 355–360, 2008. [RMH + 14] Colin Raﬀel, Brian McF ee, Eric J Humphrey , Justin Salamon, Oriol Nieto, Daw en Liang, Daniel PW Ellis, and C Colin Raﬀel. mir_ev al: A transparent implemen tation of common mir metrics. In In Pr o c e e dings of the 15th International So ciety for Music Information R etrieval Confer enc e, ISMIR . Citeseer, 2014. [SB14] Jan Schlüter and Sebastian Böc k. Improv ed musical onset detection with conv olutional neural netw orks. In ICASSP , pages 6979–6983, 2014. [SBF + 11] Jordan Bennett Louis Smith, John Ashley Burgoyne, Ichiro F ujinaga, David De Roure, and J Stephen Do wnie. Design and creation of a large-scale database of structural annotations. In ISMIR , volume 11, pages 555–560. Miami, FL, 2011. [SD W06] Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. W atts. Exp erimental study of inequality and unpredictability in an artiﬁcial cultural market. Scienc e , 311(5762):854–856, 2006. [SMGA14] Joan Serra, Meinard Müller, P eter Grosc he, and Josep Ll Arcos. Unsup ervised music structure annotation by time series structure features and segmen t similarity . IEEE T r ansactions on Multime dia , 16(5):1229–1240, 2014. [SMR08] Hinrich Schütze, Christopher D Manning, and Prabhak ar Ragha v an. Intr o duction to information r etrieval , v olume 39. Cambridge Universit y Press, 2008. [USG14] Karen Ullrich, Jan Sc hlüter, and Thomas Grill. Boundary detection in music structure analysis using conv olutional neural netw orks. In ISMIR , pages 417–422, 2014. 12

The Skipping Behavior of Users of Music Streaming Services and its Relation to Musical Structure

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment