Revisiting the problem of audio-based hit song prediction using convolutional neural networks
Being able to predict whether a song can be a hit has impor- tant applications in the music industry. Although it is true that the popularity of a song can be greatly affected by exter- nal factors such as social and commercial influences, to which d…
Authors: Li-Chia Yang, Szu-Yu Chou, Jen-Yu Liu
REVISITING THE PR OBLEM OF A UDIO-B ASED HIT SONG PREDICTION USING CONV OLUTION AL NEURAL NETWORKS Li-Chia Y ang ∗ , Szu-Y u Chou ∗ , J en-Y u Liu ∗ , Y i-Hsuan Y ang ∗ , Y i-An Chen † ∗ Research Center for Information T echnology Innov ation, Academia Sinica, T aiwan † Machine Learning Research T eam, KKBO X Inc., T aiwan ABSTRA CT Being able to predict whether a song can be a hit has impor - tant applications in the music industry . Although it is true that the popularity of a song can be greatly af fected by e xter- nal factors such as social and commercial influences, to which degree audio features computed from musical signals (whom we regard as internal factors) can predict song popularity is an interesting research question on its own. Motiv ated by the recent success of deep learning techniques, we attempt to ex- tend previous work on hit song prediction by jointly learning the audio features and prediction models using deep learning. Specifically , we experiment with a con volutional neural net- work model that takes the primiti ve mel-spectrogram as the input for feature learning, a more advanced JYnet model that uses an e xternal song dataset for supervised pre-training and auto-tagging, and the combination of these two models. W e also consider the inception model to characterize audio infor - mation in dif ferent scales. Our experiments suggest that deep structures are indeed more accurate than shallow structures in predicting the popularity of either Chinese or W estern Pop songs in T aiwan. W e also use the tags predicted by JYnet to gain insights into the result of dif ferent models. Index T erms — Hit song prediction, deep learning, con- volutional neural netw ork, music tags, cultural factors 1. INTR ODUCTION The popularity of a song can be measured a posteriori ac- cording to statistics such as the number of digital downloads, playcounts, listeners, or whether the song has been listed in the Billboard Chart once or multiple times. Howe ver , for mu- sic producers and artists, it w ould be more interesting if song popularity can be predicted a priori before the song is ac- tually released. For music streaming service providers, an automatic function to identify emerging trends or to discover potentially interesting but not-yet-popular artists is desirable to address the so-called “long tail” of music listening [1]. In academia, researchers are also interested in understanding the This work was partially supported by the Ministry of Science and T ech- nology of T aiwan under Contracts 104-2221-E-001-029-MY3 and 105-2221- E-001-019-MY2. factors that make a song popular [2, 3]. This can be formulated as a pattern recognition problem, where the task is to gener- alize observed association between song popularity measure- ments and feature representation characterizing the songs in the training data to unseen songs [4]. Our literature surv ey shows that this automatic hit song pr ediction task has been approached using mainly two dif fer- ent information sources: 1) internal factors directly relating to the content of the songs, including dif ferent aspects of au- dio properties, song lyrics, and the artists; 2) e xternal factors encompassing social and commercial influences (e.g. concur - rent social ev ents, promotions or album co ver design). The majority of previous work on the internal factors of song popularity are concerned with the audio properties of music. The early work of Dhanaraj and Logan [4] used sup- port vector machine to classify whether a song will appear in music charts based on latent topic features computed from audio Mel-frequency cepstral coefficients (MFCC) and song lyrics. Follo wing this w ork, P achet et al. [5] employed a large number of audio features commonly used in music informa- tion retrie v al (MIR) research and concluded that the features they used are not informativ e enough to predict hits, claiming that hit song science is not yet a science. Ni et al. [6] took a more optimistic stand, sho wing that certain audio features such as tempo, duration, loudness and harmonic simplicity correlate well with the ev olution of musical trends. How- ev er , their work analyzes the ev olution of hit songs [7–9], rather than discriminates hits from non-hits. Fan et al. [10] performed audio-based hit song prediction of music charts in mainland China and UK and found that Chinese hit song pre- diction is more accurate than the UK v ersion. Purely lyric- based hit song prediction was relativ ely unexplored, e xcept for the work presented by Singhi and Bro wn [11]. On the other hand, on e xternal f actors, Salg anik et al. [12] showed that the song itself has relati vely minor role than the social influences for deciding whether a song can be a hit. Zangerla et al. [13] used T witter posts to predict future charts and found that T witter posts are helpful when the music charts of the recent past are av ailable. T o our best knowledge, despite its recent success in v ar- ious pattern recognition problems, deep learning techniques hav e not be employed for hit song prediction. In particular, in speech and music signal processing, conv olutional neural network (CNN) models have exhibited remarkable strength in learning task-specific audio features directly from data, out- performing models based on hand-crafted audio features in many prediction tasks [14 – 16]. W e are therefore moti v ated to extend previous work on audio-based hit song prediction by using state-of-the-art CNN-based models, using either the primitiv e, low-le vel mel-spectrogram directly as the input for feature learning, or a more advanced setting [17] that exploits an external mu- sic auto-tagging dataset [18] for extracting high-lev el audio features. Moreo ver , instead of using music charts, we use a collection of user listening data from KKBO X Inc., a leading music streaming service pro vider in East Asia. W e formulate hit song prediction as a regression problem and test how we can predict the popularity of Chinese and W estern Pop music among T aiwanese KKBOX users, whose mother tongue is Mandarin. Therefore, in addition to testing whether deep models outperform shallo w models in hit song prediction, we also in vestigate how the culture origin of songs affects the performance of different CNN models. 2. D A T ASET Because we are interested in discriminating hits and non-hits, we find it informati ve to use the playcounts a song receives ov er a period of time from streaming services to define song popularity and formulate a regression problem to predict song popularity . In collaboration with KKBOX Inc., we obtain a subset of user listening records contributed by T aiwanese lis- teners ov er a time horizon of one year , from Oct. 2012 to Sep. 2013, in volving the playcounts of close to 30K users for around 125K songs. Based on the language metadata pro- vided by KKBO X, we compile a Mandarin subset featuring Chinese Pop songs and a W estern subset comprising of songs sung mainly in English. There are more songs in the W est- ern subset b ut the Mandarin songs recei ve more playcounts on av erage, for Mandarin is the mother tongue of T aiwanese. The following steps are taken to gain insights into the data and for data pre-processing. First, as the songs in our dataset are released in dif ferent times, we need to check whether we hav e to compensate for this bias, for intuitively songs released earlier can solicit more playcounts. W e plot in Fig. 1 the av erage playcounts of songs released in different time peri- ods, where Q1 denotes the first three months starting from Oct. 2012 and –Q1 the most recent three months before Oct. 2012, etc. The y-axis is in log scale but the actual v alues are obscured due to a confidentiality agreement with KKBO X. From the dash lines we see that the average playcounts from different time periods seem to be within a moderate range in the log scale for both subsets, e xempting the need to compen- sate for the time bias by further operations. Second, we define the hit scor e of a song according to the multiplication of its playcount in log scale and the number of Fig. 1 . The a verage playcounts (in log scale) of songs released in different time periods. Fig. 2 . The distribution of hit scores (see Section 2 for defini- tion) in the (left) whole and (right) test sets. users (also in log scale) who have listened to the song. W e opt for not using the playcounts only to measure song pop- ularity because it is possible that the playcount of a song is contributed by only a v ery small number of users. Third, to make our experimental results on the two subsets comparable, we sample the same amount of 10K songs in our experiment for both subsets. These songs are those with the highest playcounts within the subset. It can be seen from Fig. 2 that the distributions of hit scores of the sampled songs are similar . The solid lines in Fig. 1 show that after this sampling the time bias among the sampled songs remains moderate. Finally , we randomly split the songs to have 8K, 1K, and 1K songs as the training, validation, and test data for each of the subsets. Although it may be more interesting to split the songs according to their release dates so as to ‘learn from the past and predict the future, ’ we lea ve this as a future w ork. Our focus here is to study whether deep models perform better than shallow models in audio-based hit song prediction. The scale and the time span of the dataset are deemed appropriate for this study . Unlike previous work on musical trend analysis that may in volve more than ten years’ worth of data (e.g. [6], [19]), for the purpose of our work we want to av oid changes in public music tastes and therefore it is better to use listening records collected within a year . 3. METHODS W e formulate hit song prediction as a regression problem and train either shallo w or deep neural network models for pre- dicting the hit scores. Giv en the audio representation x n for each song n in the training set, the objectiv e is to optimize the parameters Θ of our model f ( · ) by minimizing the squared er - ror between the ground truth y n and our estimate, expressed as min Θ P n k y n − f Θ ( x n ) k 2 2 . As described belo w , a total number of six methods are considered, All of them are im- plemented based on the lasagne library [20], and the model settings such as learning rate update strategy , dropout rate, and numbers of feature maps per layer are empirically tuned by using the validation set. 3.1. Method 1 (m1): LR As the simplest method, we compute 128-bin log-scaled mel- spectrograms [21] from the audio signals and take the mean and standard de viation over time, leading to a 256-dim feature vector per song. The feature vectors are used as the input to a single-layer shallo w neural network model, which is effec- tiv ely a linear regression (LR) model. The mel-spectrograms are computed by short-time Fourier transform with 4,096- sample, half-overlapping Hanning windo ws, from the middle 60-second segment of each song, which is sampled at 22 kHz. In lasagne, we can implement the LR model by a dense layer . 3.2. Method 2 (m2): CNN Going deeper, we use the mel-spectrograms directly as the in- put, which is a 128 by 646 matrix for there are 646 frames per song, to a CNN model. Our CNN model consists of two early conv olutional layers, with respectiv ely 128-by-4 and 1- by-4 con volutional kernels, and three late con volutional lay- ers, which all has 1-by-1 conv olutional kernels. Unlike usual CNN models, we do not use fully connected layers in the lat- ter half of our model for such fully con volutional model has been shown more ef fecti ve for music [14, 17, 22]. 3.3. Method 3 (m3): inception CNN The idea of inception was introduced in GoogLeNet for visual problems [23]. It uses multi-scale kernels to learn features. W e make an audio version of it by adding two more parallel early conv olutional layers with different sizes: 132-by-8 and 140-by-16, as illustrated in the bottom-right corner of Fig. 3. T o combine the output of these three kernels by concatena- tion, the input mel-spectrogram needs to be zero-padded. 3.4. Method 4 (m4): JYnet (a CNN model) + LR While generic audio features such as mel-spectrogram may be too primitive to predict hits, we employ a state-of-art music auto-tagging system referred to as the JYnet [17] to compute Fig. 3 . Architecture of the inv estigated CNN models. high-lev el tag-based features. JYnet is another CNN model that also takes the 128-bin log-scaled mel-spectrograms as the input, but the model is trained to make tag prediction using the MagnaT agA T une dataset [18]. The output is the activ a- tion scores of 50 music tags, including genres, instruments, and other performing related tags such as male vocal, female vocal, fast and slow . From the output of JYnet (i.e. 50-dim tag-based features), we learn another LR model for predicting hit scores, as illustrated in the bottom-left corner of Fig. 3. 3.5. Methods 5 and 6 (m5 & m6): Joint T raining W e also try to combine (m4) with (m2) or (m3) to exploit information in both the mel-spectrograms and tags, leading to (m5) and (m6). Instead of simply combining the results of the two models f Θ 1 ( · ) and f Θ 2 ( · ) being combined, we add another layer on top of them for joint training, as illustrated in Fig. 3. The learning objective becomes: min w, Θ 1 , Θ 2 X n k y n − w f Θ 1 ( x n ) − (1 − w ) f Θ 2 ( x n ) k 2 2 , (1) where w determines their relativ e weight. In this way , we can optimize the model parameters of both models jointly . Howe v er , when method 4 is used in joint training we only update the parameters of its LR part, as JYnet is treated as an external, pre-trained model in our implementation. 4. EXPERIMENT AL RESUL TS W e train and ev aluate the two data subsets separately . For ev aluation, the follo wing four metrics are considered: • Recall@100: Treating the 100 songs (i.e. 10%) with the highest hit scores among the 1,000 test songs as the hit songs, we rank all the test songs in descending order of the predicted hit scores and count the number of hit songs that occur in the top 100 of the resulting ranking. • nDCG@100: normalized discounted cumulativ e gain (nDCG) is another popular measure used in ranking T able 1 . Accuracy of Hit Song Prediction Method Mandarin subset W estern subset recall nDCG Kendall Spearman recall nDCG Kendall Spearman (m1) audio+LR 0.1900 0.1997 0.1679 0.2480 0.1400 0.1271 0.0674 0.1002 (m2) audio+CNN 0.2300 0.2334 0.1806 0.2678 0.1300 0.1294 0.1031 0.1564 (m3) audio+inception CNN 0.2500 0.2369 0.2286 0.3374 0.1800 0.1989 0.1093 0.1636 (m4) tag+LR 0.2400 0.2372 0.1671 0.2473 0.2000 0.1774 0.0918 0.1372 (m5) (m2)+(m4) 0.2500 0.2558 0.2018 0.2971 0.1800 0.1791 0.1300 0.1941 (m6) (m3)+(m4) 0.3000 0.2927 0.2665 0.3894 0.2100 0.2413 0.1341 0.1996 problems [24]. It is computed in a way similar to re- call@100, but the positions of recalled hit songs in the ranking list are taken into account. • Kendall’ s τ : we directly compare the ground truth and predicted rankings of the test songs in hit scores (with- out defining which songs are hit songs) and compute a value that is based on the number of correctly and in- correctly ranked pairs [25]. • Spearman’ s ρ : the rank correlation coefficient (consid- ering the relativ e rankings but not the actual hit scores) between the ground truth and predicted rankings. The result is sho wn in T able 1, which is obtained by aver - aging the result of 10 repetition of each method. The follow- ing observations can be made. First, by comparing the result of (m1), (m2) and (m3), we see that better result in most of the four metrics is obtained by using deeper and more compli- cated models for both subsets. This suggests the ef fecti veness of deep structures for this task. Furthermore, by comparing the result of the two subsets, we see that audio-based hit song prediction is easier for the Mandarin subset, confirming the findings of Fan et al. [10]. Second, as both (m1) and (m4) use LR for prediction, by comparing their result we see that the tag-based method (m4) outperforms the simple audio-based method (m1) in all the four metrics for the W estern subset, demonstrating the effec- tiv eness of the JYnet tags. This is howe ver not the case for the Mandarin subset for Kendall’ s τ and Spearman’ s ρ . Third, from the result of (m5) and (m6), we see that the joint learning structure can further impro ve the result for both subsets. The best result is obtained by (m6) in all metrics. T o gain insights, we employ JYnet to assign genre labels to all the test songs and examine the distribution of genres in the top-50 hit songs determined by either automatic models or the ground truth. For each song, we pick the genre label that has the strongest acti vation as predicted by JYnet. The result- ing genre distributions are shown in Fig. 4. W e see from the result of ground truth that the W estern hits have more di verse genres. The predominance of ‘Pop’ songs in the Mandarin subset might e xplain why 1) hit song prediction in this subset is easier and 2) (m4) alone cannot improve τ and ρ . More- ov er , for the W estern subset, we see that the genre distribu- tion of (m4) is more di verse than that of (m3), despite that Fig. 4 . The predominate tags (predicted by JYnet) for the top-50 hit songs determined by different methods for the (top) Mandarin and (bottom) W estern subsets. From left to right: (a) the tag-based model (m4), (b) the audio-based model (m3), (c) the hybrid model (m6), and (d) the ground truth. (m3) achieves slightly higher nDCG and Spearman’ s ρ . This might imply that the ability to match the genre distrib ution of the ground truth is another important performance indicator . 5. CONCLUSION In this paper , we have introduced state-of-the-art deep learn- ing techniques to the audio-based hit song prediction prob- lem. Instead of aiming at classifying hits from non-hits, we formulate it as a regression problem. Ev aluations on the lis- tening data of T aiwanese users of a streaming compan y called KKBO X confirms the superiority of deep structures ov er shal- low structures in predicting song popularity . Deep structures are in particular important for W estern songs, as simple shal- low models may not capture the rich acoustic and genre di- versity exhibited in W estern hits. For future work, we hope to understand what our neural network models actually learn, to compare against more existing methods (preferably using the same datasets), and to inv estigate whether our models can predict future charts or emerging trends. 6. REFERENCES [1] H. Silk, R. Santos-Rodriguez, C. Mesnage, T . De Bie, and M. McV icar , “Data science for the detection of emerging music styles, ” EPSRC , pp. 4–6, 2014. [2] S. McClary , Studying popular music , vol. 10, 1991. [3] P . D. Lopes, “Inno v ation and div ersity in the popular music industry , 1969 to 1990, ” American Sociological Revie w , vol. 57, no. 1, pp. 56, 1992. [4] R. Dhanaraj and B. Logan, “Automatic prediction of hit songs, ” in Proceedings of International Society for Music Information Retrieval , pp. 11–15, 2005. [5] F . Pachet and P . Roy , “Hit song science is not yet a sci- ence, ” in Pr oceedings of International Society for Music Information Retrieval , pp. 355–360, 2008. [6] Y . Ni and R. Santos-Rodriguez, “Hit song science once again a science, ” International W orkshop on Machine Learning and Music , pp. 2–3, 2011. [7] R. M. MacCallum, M. Mauch, A. Burt, and A. M. Leroi, “Evolution of music by public choice, ” in Pr oceedings of the National Academy of Sciences , vol. 109, no. 30, pp. 12081–12086, 2012. [8] M. Mauch, R. M. MacCallum, M. Levy , and A. M. Leroi, “The ev olution of popular music: USA 1960- 2010, ” Royal Society Open Science , vol. 2, no. 5, pp. 150081, 2015. [9] J. Serr ` a, ´ A. Corral, M. Bogu ˜ n ´ a, M. Haro, and J. L. Ar- cos, “Measuring the e volution of contemporary western popular music, ” Scientific Reports , vol. 2, pp. 1–6, 2012. [10] J. Fan and M. Case y , “Study of Chinese and UK hit songs prediction, ” 10th International Symposium on Computer Music Multidisciplinary Resear ch (CMMR) , pp. 640–652, 2013. [11] A. Singhi and D. G. Brown, “Hit song detection u s ing lyric features alone, ” in Pr oceedings of International Society for Music Information Retrieval , 2014. [12] M. J. Salganik, P . S. Dodds, and D. J. W atts, “Experi- mental study of inequality and cultural market, ” Science , vol. 311, no. 5762, pp. 854–856, 2006. [13] E. Zangerle, M. Pichl, B. Hupfauf, and G. Specht, “Can microblogs predict music charts ? an analysis of the relationship between # nowplaying tweets and music charts, ” in Pr oceedings of International Society for Mu- sic Information Retrieval , 2016. [14] K. Choi, G. Fazekas, and M. Sandler, “ Automatic tag- ging using deep con volutional neural networks, ” in Pr o- ceedings of International Society for Music Information Retrieval , 2016. [15] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Y u, “Con v olutional neural networks for speech recognition, ” IEEE/A CM T r ansactions on Au- dio, Speech, and Language Pr ocessing , pp. 1533–1545, 2014. [16] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , 2014, pp. 6964–6968. [17] J.-y . Liu and Y .-h. Y ang, “Event localization in Music auto-tagging, ” in Pr oceedings of the 24th A CM interna- tional confer ence on Multimedia , 2016. [18] E. Law , K. W est, M. I. Mandel, M. Bay , and J. S. Downie, “Ev aluation of algorithms using games: the case of music tagging, ” in Pr oceedings of International Society for Music Information Retrieval , 2009. [19] S. Kinoshita, T . Oga wa, and M. Haseyama, “Popular music estimation based on topic model using time infor- mation and audio features, ” IEEE 3rd Global Confer- ence on Consumer Electr onics (GCCE) , pp. 102–103, 2014. [20] “lasagne, ” [online] https://lasagne. readthedocs.org/en/latest/ . [21] S. Dieleman and B. Schrauwen, “Multiscale approaches to music audio feature learning, ” in Pr oceedings of In- ternational Society for Music Information Retrie val , pp. 116–121, 2013. [22] E. Shelhamer, J. Long, and T . Daeedll, “Fully con volu- tional networks for semantic segmentation, ” Computer V ision and P attern Recognition (CVPR) , 2015. [23] C. Szegedy , W . Liu, Y . Jia, and P . Sermanet, “Go- ing deeper with conv olutions, ” arXiv preprint arXiv: 1409.4842 , 2014. [24] Y . W ang, L. W ang, Y . Li, D. He, T .-Y . Liu, and W . Chen, “A theoretical analysis of NDCG ranking measures, ” in Pr oceedings of the Annual Confer ence on Learning The- ory , pp. 1–30, 2013. [25] M. K endall and J. D. Gibbons, Rank correlation meth- ods , vol. 3, Oxford Univ ersity Press, 1990.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment