SeER: An Explainable Deep Learning MIDI-based Hybrid Song Recommender System

SeER: An Explainable Deep Learning MIDI-based Hybrid Song Recommender System Khalil Damak Knowledge Discovery and W eb Mining Lab, CECS Department, University of Louisville Louisville, USA khalil.damak@louisville.edu Olfa Nasraoui Knowledge Discovery and W eb Mining Lab, CECS Department, University of Louisville Louisville, USA olfa.nasraoui@louisville.edu ABSTRA CT State of the art music recommender systems mainly rely on ei- ther matrix factorization-based collaborative ltering approaches or deep learning architectures. Deep learning models usually use metadata for content-based ltering or predict the next user inter- action by learning from temporal sequences of user actions. Despite advances in deep learning for song recommendation, none has taken advantage of the sequential nature of songs by learning sequence models that are based on content. Aside from the imp ortance of prediction accuracy , other signicant aspects are important, such as explainability and solving the cold start problem. In this work, we propose a hybrid deep learning model, called “SeER" , that uses collaborative ltering (CF) and deep learning sequence models on the MIDI content of songs for recommendation in order to provide more accurate personalized recommendations; solve the item cold start problem; and generate a r elevant explanation for a song rec- ommendation. Our evaluation experiments show promising results compared to state of the art baseline and hybrid song recommender systems in terms of ranking evaluation. Moreover , base d on pro- posed tests for oine validation, we show that our p ersonalized explanations capture properties that are in accordance with the user’s preferences. KEY W ORDS hybrid recommender system, deep learning, recurrent neural net- works, matrix factorization, music recommender system, explain- ability , user cold start problem, e xplainable AI 1 IN TRODUCTION Recommendation is becoming a prevalent component of our daily lives that has attracted increasing interest fr om the Machine Learn- ing resear ch community in recent y ears. Among the elds in which recommendation is most decisive is music. Music streaming plat- forms are indeed numerous: Spotify [ 38 ], Pandora [ 31 ], Y ouTube Music [ 48 ] and many others. Howe ver , what makes the success of a platform is its capacity to pr edict which song the user wants to listen to at the moment given their previous interactions. The most accurate recommender systems rely on complex black box machine learning models that do not explain why they output the predicted recommendation. In fact, one main challenge is design- ing a recommender system that mitigates the trade-o between explainability and prediction accuracy [ 3 ]. T o day , the most widely used techniques in music recommendation are matrix factoriza- tion (MF)-based collaborative ltering approaches [ 27 ] and deep learning architectures [ 49 ]. MF is based on similarities between users and items in a latent space obtained by factorizing the rating matrix into user and item latent factor matrices [ 19 ]. For state of the art deep learning recommender systems, there are mainly two approaches. The rst approach relies on content based ltering [ 42 ] using metadata to recommend items. The second approach uses sequence models [ 26 ] [ 8 ] [ 16 ] to predict the next interaction (played song) given the previous interactions [ 14 ][ 39 ][ 47 ]. Despite the advances in deep learning for song recommendation and de- spite the sequential nature of songs that makes them naturally adapted to sequence mo dels, no work has used sequence models with the content of songs for recommendation. Aside from ac- curacy and explainability , the cold start problem is a signicant issue for collaborative ltering recommender systems [ 2 ]. In fact, most recommender systems need an initial histor y of interactions (ratings, clicks, plays, etc.) to r ecommend items. In music stream- ing platforms, new users and songs are constantly added making solving this issue crucial. In this work, we take advantage of the sequential nature of the songs, the prediction p ower of MF and the superior capabilities of deep learning sequence models to achieve the following objectives: • Propose a method to transform the Musical Instrument Digi- tal Interface (MIDI) format [ 1 ] of songs into multidimentional time series to b e used as input to deep learning sequence models and ke ep a large amount of information about the song; • Integrate content based ltering using de ep learning se- quence models into collaborative ltering MF to build a novel hybrid model that provides accurate pr edictions compared to baseline recommender systems, solves the item cold start problem, and provides explanations to the recommendations; and • Propose a new type of explanation to song recommendation that consists of presenting to the user a short personalized MIDI segment of the song that characterizes the p ortion that the user is predicted to like the most. 2 RELA TED W ORK V arious recommender systems rely on sequence models. However , not all of them use them for recommendation with user preferences. In fact, some are session-based CF models [ 14 ][ 39 ][ 47 ] that pre- dict the next interaction in a sequence of interactions regardless of the user’s personal preferences. Other methods introduce con- tent to session-based recommendation [ 15 ][ 37 ] and prove that side information enhances the recommendation quality [ 49 ]. Other rec- ommender systems using sequence models take into consideration T able 1: Notation used in Section 3. Symbol/Notation Denition U User embedding (latent) matrix U u User u ’s latent vector S Song lookup matrix S s Song s ’s attened array x s Song s ’s array (multidimensional time series) x s t Song s ’s array at time step t x s , k Array of segment k from song s x s , e x p u Explainability segment array of song recommendation s to user u R Training data r u s Actual rating of user u to song s ˆ r u s Predicted rating of user u to song s ˆ r k u s Predicted rating of user u to segment k of song s h < m >, s t Hidden state of sequence model layer m at time step t on input song s T Normalized number of time steps | . | Cardinality Û Dot product ∇ w J Gradient of J with respect to w user identication [ 46 ][ 45 ]. These engines model temporal depen- dencies for both users and movies [ 46 ][ 45 ] and generate reviews [ 45 ]. The main objective of these models is to predict ratings of users to items using seasonal evolutions of items and user preferences in addition to user and item latent vectors. Alternate models aimed to generate re view tips [ 25 ], predict the returning time of users and predict items [ 17 ] or produce next item recommendations for a user by proposing a novel Gated Recurrent Unit [ 8 ] (GRU) struc- ture [ 10 ]. Finally , some recommender systems also use sequence models as a feature representation learning tool [ 49 ]. [ 5 ] creates a latent representation of items and uses it as input to a CF model with a user embedding to predict ratings. On the other hand, song recommendation received contributions from few hybrid models that often diverge in terms of input data and features created. In fact, music items can be r epresented by features derived from audio signals, social tags or web content [ 40 ]. Among the most noticeable hybrid song recommender systems, [ 44 ] learns latent factors of users and items using matrix factorization and sums their product with the product obtained with created user and song features. [ 6 ] combines non-negative MF and graph regularization to predict the inclusion of a song in a playlist. [ 30 ] learns artist embeddings from biographies and track embeddings from audio spectrograms, then aggregates and multiplies them by user latent factors obtained by weighted MF to predict ratings. [ 41 ] trains a Convolutional Neural Network [ 22 ] on spe ctrograms of song samples to predict latent features obtained with an MF approach for songs with no ratings. Finally , [ 4 ] positions the users in a mood space, given their favorite artists, and recommends new artists using similarity measures. 3 METHODS In this section, we start by describing the data that we used along with its preparation procedure. Then, we present our model called “Sequence-based Explainable Recommender system" (SeER). This being done, we describe our explainability process called “Segment Forward propagation" . But rst, we show the list of variables that are used in this section in Table 1 to ease the reading of the remainder of this article. Figure 1: Our dataset resulting from the interse ction b e- tween “The Lakh MIDI Dataset v0.1" and “The Echo Nest T aste Prole Subset" . 3.1 Data Preparation W e ne eded a dataset that includes both user to item interactions and song content data. Thus, we used two datasets from the Million Song Dataset (MSD) [ 7 ]. The Echo Nest T aste Prole Subset [ 28 ] includes 48,373,586 play counts of 1,019,318 users to 384,546 songs collected from The Echo Nest’s undisclosed partners. The Lakh MIDI Dataset includes 45,129 unique MIDI les matched to MSD songs [ 33 ] [ 34 ]. W e combined b oth datasets by taking the intersec- tion in terms of songs as presented in Fig. 1. Then, we followed the same methodology used in [ 13 ] to reduce the sparsity of the data. W e ltered out users that interacted with less than 20 unique songs. W e obtained a dataset with 32,180 users, 6,442 songs with available MIDI les, and 941,044 play counts. Our dataset has a sparsity of 99.54%. W e preprocessed our dataset by rst mapping the play counts to ratings in order to remove outliers. T o prove the necessity of this step, we show the distribution and statistics of the play counts in Fig. 2. The play counts follow a p ower law distribution with a median of 1. Also, there are users that listene d to the same song hundreds and thousands of times; and the maximum play count is 3,532. These high play counts are outliers that may bias the training of the model. While it is true that the more a song is listened to by a user , the more likely the user likes it, whether a user listens to a song 10 or 3,000 times, it is clear that they like it. Hence, both cases should be consider ed the same. Therefore , w e used the statistics of the play counts to map them to ratings as shown in Fig. 3. Next, we created the input to train sequence models by transforming each MIDI le into a multidimensional time series. MIDI les are polyphonic digital instrumental audios that are usually use d to create music. They are constituted of e vent messages that are consecutive in time [ 1 ]. Each message includes a type (such as a note), notation (the note played), time (the time it is played) and velocity (how rapidly and forcefully it is played) [ 43 ]. These events ar e distributed over 16 available channels of information, which are independent paths over which messages travel [ 1 ]. Each channel can be programmed to play one instrument. Thus, a MIDI le can play up to 16 instruments simultaneously . W e rst used “MIDICSV" [ 43 ] to translate the MIDI les into she ets of the event messages. W e only considered the “Note on C" events to focus our interest on the sequences of notes played throughout time. Thus, we extracted the notes that are played within the 16 channels with their velocities. As a result, each transformed multidimensional time series is constituted of a 2 (a) (b) Figure 2: Play count statistics: (a) represents the statistics of the play count and ( b) represents the density plot of the play count (ltered play counts < 1000 for better visualization). Figure 3: Play count normalization into 5-star ratings. Figure 4: MIDI to multidimensional time series transforma- tion process. certain number of rows representing the numb er of “Note on C" events and 32 features representing the notes and velocities played within the 16 channels. The transformation process is summarized in Fig. 4. W e then normalized the number of time steps to the me dian number of timesteps of the songs in our dataset (2,600) to be able to train with mini-batches [ 24 ]. At least 50% of the songs kept all their notes and 75% of the songs kept at least half of their notes. Finally , in or der to avoid duplicates of the same song in the input and ensure memory eciency , we created a song lookup matrix by attening each multidimensional time series into one row in the matrix. 3.2 SeER Three main observations motivate d the design of our model. First , the sequential nature of songs, particularly represented by MIDI les, can be best modeled using sequential models. Second , the hidden state (output) of a sequence model is b oth learnable and of chosen size, both being basic properties of an embedding ma- trix. Thus, we opte d to assimilate it to a user embedding. Third , sequence models can propagate instances with var ying time steps. This inspired us to try to explain the recommendations using song Figure 5: Structure of SeER: For every training tuple (user , song, rating), the model extracts the corresponding user la- tent vector and at song array from the user latent matrix and the song lookup matrix respectively . The song array is reshaped to its original 2-dimensional format and input to a sequence mo del. The resulting song hidden state vector is multiplied with the user latent vector to predict the rating. segments. These motivations led us to design our model, calle d “SeER": a sequence-based explainable hybrid song recommender system with the structure presented in Fig. 5. SeER takes as input the song lookup matrix and a user embedding matrix. For each user , song, and rating triplet ( u , s , r u s ) in the training data R , we extract the corresponding latent factor vector U u of the user and the at- tened song array S s . The latter process is illustrated in Fig. 5 with multiplications of the user embedding and song lo okup matrices with one hot vectors of u and s respectively . The song array is next reshaped into its two-dimensional original shape (2600 time steps by 32 features). The resulting array x s is input to a sequence model and, nally , the hidden state of the last layer m at the last time step ( T = 2 , 600 ) h < m >, s T is multiplied with the song latent vector U u to predict a rating of the user to the song ˆ r u s = U u · h < m >, s T . T o be consistent, we chose the size of the sequential hidden state to be the same as the number of user latent features. This enables computing the scalar product of the two latent vectors to yield a pr edicted rating. The model is trained using the Mean Squared Error (MSE) [ 23 ] as a loss function by comparing the actual rating r u s to the predicted rating ˆ r u s . Thus, our objective function is: J = 1 | R | Õ ( u , s , r u s ) ϵ R ( ˆ r u s − r u s ) 2 = 1 | R | Õ ( u , s , r u s ) ϵ R ( U u · h < m >, s T − r u s ) 2 (1) Note that in Fig. 5, the cell states can b e ignored when using Recurrent Neural Networks [26] (RNNs) or GRUs. The training process of SeER for each epoch is describ ed in Alg. 1. 3.3 Recommendation and Segment For ward Propagation Explainabiliy The recommendation consists of feeding user and unrate d song inputs to the SeER mo del in Fig. 5 which results in a predicted rating for each input song. The highest predicted ratings yield a list of recommended songs. After generating a song recommenda- tion s to a user u , we explain it by presenting a 10-second MIDI 3 Algorithm 1 SeER training algorithm (for each epoch) with step size α , using mini-batch gradient descent for simplicity 1: procedure SeER_training_epoch (song lookup matrix S , user latent factor matrix U , set of mini-batches B , learning rate α , number of sequence model layers m , numb er of timesteps T ) 2: for b i n B do 3: for ( u , s , r u s ) i n b do 4: U u ← O ne _ hot ( u ) · U ▷ extract latent vector of u 5: S s ← O ne _ hot ( s ) · S ▷ extract at song array of s 6: x s ← Re s hape ( S s ) ▷ reshape S s to ( T × 32 ) 7: h < m >, s T ← S e que nc e _ mod e l ( x s ) 8: ˆ r u s ← U u · h < m >, s T ▷ predict rating of u to s 9: end for 10: J = 1 | b | Í ( u , s , r u s ) ϵ b ( U u h < m >, s T − r u s ) 2 ▷ compute prediction loss 11: w ← w − α · ∇ w J ▷ back propagate ( w refers to the parameters of U and the sequence model) 12: end for 13: end proce dure Figure 6: Segment For ward Propagation Explainability . segment x s , e x p u of the song that tries to capture the most imp or- tant portion of the recommende d song for the input user . First, we sample segments of the MIDI le by using a 10-second slid- ing window of one-second stride. This means that the rst seg- ment is the rst 10 seconds of the audio, the se cond segment is from second 2 to second 11, and so forth, until we reach the end of the song. T o do this, we start by creating absolute time seg- ments that we match to MIDI times in the song to determine the range of time steps of each segment. In fact, the time in a MIDI le is in pulses and can b e converted to absolute time such that t ime [ µ s ] = M I D I t i m e [ p u l s e s ] D i v i s i o n [ pu l s e s / Q R . n o t e ] T empo [ µ s / Q R . not e ] . The di- vision is the number of pulses per quarter note and the tempo is a measure of speed [ 43 ]. Then, we create a multidimensional time series x s , k for each segment k by truncating the time series of the recommended song x s . Finally , we feed each segment’s time series x s , k along with the user latent vector U u as input to the SeER model to estimate a rating ˆ r k u s of that user to the segment. The segment that obtains the highest pr edicted rating ˆ r k u s is presented to the user as an explanation for the song recommendation. W e call this ex- plainability process “Segment For ward Propagation Explainability" because it relies on forward propagation of segments to explain the prediction. The aforementioned explanation process is presented in Fig. 6 and is summarized in Alg. 2. Algorithm 2 Segment For ward Propagation Explainability 1: procedure Segment_Forw ard_Propaga tion (recommended song s , length of s in se conds L , song array x s , user latent vector U u , number of timesteps T , trained mo del S e E R ) 2: abs _ t ime _ x s ← h M I D I t i m e ( x s t ) D i v i s i o n ( x s t ) · T empo ( x s t ) | t = 1 . . T i ▷ match timesteps to absolute times in x s 3: abs _ t ime _ s e д ← [( i , i + 9 ) | i = 1 . . L − 9 ] ▷ create absolute time segments 4: son д _ se д ment s ← [ x s , k = x s [ i : j ] | ( abs _ t ime _ x s [ i ] , ab s _ t ime _ x s [ j ]) i n abs _ t ime _ s e д ] ▷ create 10 second segments of x s 5: s e д _ r at i n д s ← [ ˆ r k u s = S e E R ( x s , k , U u ) | x s , k i n s on д _ s e д ment s ] ▷ predict ratings for each segment 6: x s , e x p u ← son д _ se д ment s [ ar д max k ( s e д _ r at i n д s )] ▷ determine explainability segment 7: end proce dure T able 2: T op 5 normalized ratings of User 1000 in the dataset. Artist name Title Genres Rating Dido White Flag Pop, Hip-hop 5 Travie McCoy Billionaire [ft. Bruno Mars] Pop 5 Dido Thank Y ou Pop 4 Alicia Keys ft. Adam Le vine Wild Horses Neo-soul 4 Michael Bublé Put Y our Head On My Shoulder Easy listening 2 T able 3: Example of top 5 recommendation predicted by SeER to user 1000 (partially listed in Table 2). The explana- tions are represented by the start and end times of the 10- second samples in µ s . Predicted Artist name Title Genres rating Explanation Andreas Johnson Glorious Alternative/Indie, Pop 5.360812 (130074061.0, 139999986.0) The Knack My Sharona Rock 5.346163 (11172411.0, 20937925.8) Cat Stevens Trouble Singer-songwriter 5.330237 (24230213.1, 33972849.8) CoCo Lee Before I Fall In Love Contemporar y R&B 5.314626 (126034512.0, 135942920.0) Red Hot Chili Peppers Blood Sugar Sex Magik Alternative/Indie 5.290801 (248107860.0, 257837580.0) In order to illustrate the SeER recommendation and explainabil- ity processes, we compute the top 5 recommendations for User number 1000 whose top 5 rated songs are liste d in T able 2. The top 5 recommendations with explanations for this user are shown in T able 3. The explanations are presented with the start and end times of the 10-second samples in µ s . W e provide a link to a vide o 1 demo where these explainability segments can be heard. 4 OFFLINE EXPERIMEN T AL EV ALU A TION In this section, we describe an oine evaluation pipeline aiming to assess the recommendation performance and capabilities of our model. 1 https://drive.google.com/le/d/1imlh4nPFhXetE1jCzPkGPm8XRxUxcRJR/view? usp=sharing 4 T able 4: Hyp erparameter tuning results: MAP@10 on the test data after 20 ep ochs. Best results (in b old) obtained, rst, with 150 latent features, then, with GRU. Hyperparameter V alue MAP@10 # of latent features 50 0.1236 100 0.1424 150 0.1433 200 0.1425 Sequence model type LSTM 0.1433 RNN 0.0973 GRU 0.1437 4.1 Experimental Setting W e used the same 80/20% train/test split for all the experiments in order to be consistent when comparing two models or when reproducing an experiment. Due to computational and time con- straints, we trained all the models in 20 epochs and evaluated the results in terms of recommendation ranking using Mean A verage Precision at cuto K (MAP@K). Furthermore, in order to assess the statistical signicance when comparing two models, we replicated each experiment 5 times and applied statistical tests. 4.2 Hyperparameter T uning W e xed the number of sequence mo del layers to 1 and the batch size to 500 because of memory constraints. Also, we relied on the Adaptive Moment Estimation (Adam) [ 18 ] optimizer b ecause it yields a relativ ely fast convergence and adapts the learning rate for each parameter [ 13 ]. Finally , w e tuned the number of latent features from 50 to 200 with increments of 50 and the sequence mo del type by trying RNN, GRU and Long Short- T erm Memor y [ 16 ] (LSTM) networks. W e relied on a greedy approach, that consists of varying the hyperparameters one by one independently from each other . W e starte d by initializing the sequence model typ e to LSTM and tuned the number of latent features. Then, we varied the sequence model type. The results are presented in T able 4. W e obtained the best performance with 150 latent features and GRU. 4.3 Research Questions T o evaluate the prediction ability of our model, w e made both wide and narrow comparisons. For the wide comparison, we matched our mo del to baseline recommender systems regardless of their types and data nature. On the other hand, the narrow comparison consists of comparing our model to its closest competitors which are state of the art hybrid song r ecommender systems. This leads us to formulate our rst two research questions: RQ1: How does our model compare to baseline recommender systems? and RQ2: How does our model compare to state of the art (SOT A) hybrid song recommender systems? Also, SeER can be seen as an updated version of MF with the item embedding matrix being replaced with the output of a sequence model that takes as input our preprocessed song content data. Thus, we assess the importance of the way we use the content data by comparison to MF in the third research question: RQ3: What is the importance of our use of the content data? Finally , we assess whether our explanations share similar characteristics. T able 5: Comparison of SeER with baseline models: MAP@10 results after 20 ep ochs for 5 replicates. Replicate SeER MF NeuMF ItemPop 1 0.1436 0.1289 0.1314 0.0778 2 0.1481 0.1292 0.1303 0.0778 3 0.1399 0.1285 0.1366 0.0778 4 0.1453 0.1266 0.1376 0.0778 5 0.1414 0.1288 0.1378 0.0778 A verage 0.1437 0.1284 0.1347 0.0778 The intuition behind this is that the shared characteristics may be interpreted as user preferences captured and incorporated in the explanations. This would indicate that the e xplanations represent the preferences of the user and ar e not an articial product of the model. This is translate d in RQ4: Do the personalized explanations share similar characteristics? 4.4 RQ1: How does our model compare to baseline recommender systems? The baseline recommender systems we used for comparison are: • Matrix Factorization [27]: One of the most used collab- orative ltering techniques and basis of a large number of recommender systems including ours. W e used the same number of latent factors as our model which is 150. • NeuMF [13]: State of the art collab orative ltering tech- nique that combines Generalized Matrix Factorization [ 13 ] (GMF) and Multi-Layer Perceptron [ 9 ] (MLP). W e replaced its output layer with a dot product and used MSE as a loss function b ecause we are working with ratings. W e used three hidden layers for MLP and 150 latent features for all embe d- ding matrices. • ItemPop [35]: Most popular item recommendation. Use d to benchmark the recommendation performance. W e present the results obtained with each model in T able 5. Our model yields an average MAP@10 of 0.1437 which is higher than all the other methods. It also has the b enet of being explain- able. Furthermore, w e validated our results with ANO V A [ 11 ] and T ukey [ 12 ] tests. All the p-values were lower than 0.01 meaning that our model performs signicantly better than all the other mod- els. Note that comparing our model to MF can be considered as an ablation study that aims to prove the importance of the se quence mo del layers and the use of the content data. In fact, replacing the sequence model layers with an emb ed- ding layer reduces our mo del to MF. 4.5 RQ2: How does our model compare to SOT A hybrid song recommender systems? The most related hybrid song recommender system we found is [ 30 ]. It applies MF-, Convolutional Neural Network [ 21 ] (CNN)- and MLP-based [ 9 ] models on play counts, audio spe ctrograms and artist biographies to generate recommendations. The dataset used is a subset of the MSD that overlaps with ours. Their dataset in- cludes around 1M users and 328,821 songs. W e compared our model 5 T able 6: Comparison of SeER with MM-LF-LIN [30] on an overlapping dataset. Our model’s performance is assesse d with MAP@500 after 20 epochs with 5 replicates. Replicate SeER MM-LF-LIN 1 0.1438 0.0036 2 0.1483 - 3 0.1400 - 4 0.1455 - 5 0.1415 - A verage 0.1438 0.0036 directly to the results in [ 30 ] using the same evaluation process that they used. Although comparing two models on overlapping datasets is unconventional, the results can give us an idea about the ranges in which the ranking performances of the two models are. The best performing conguration, MM-LF-LIN [ 30 ], presents a MAP@500 of 0.0036, which is signicantly lower (ANO V A p-value < 0.01) than our average performance of 0.1438 as presented in T able 6. 4.6 RQ3: What is the importance of our use of the se quential content data? W e can assess the imp ortance of using the sequential song content data as follows: • First, as we already prov ed in RQ1, the content data helped improve the recommendation performance since our model performs signicantly better than pure rating-based MF. • Second, the content data allowed us to solve the item cold start problem because the item data comes from both the song’s MIDI content data and the user ratings. Thus, items with no ratings can b e recommended by relying solely on their content. • Finally , the sequential nature of our content data, in addition to the structure of our model, allowed us to generate 10- second instrumental explanations making recommendation more transparent. The explanations are evaluated in RQ4 below . 4.7 RQ4: Do the personalize d explanations share similar characteristics that capture user-preferences? In or der to validate the 10-second segment explanations oine, we tried to determine, for every user , whether their personalized explanations share common characteristics. Explanations that share common properties are likely to be generate d based on captured hidden preferences of the user . Hence, these explanations may represent the most important sections of the recommended songs. In that case, the explanations may not b e just artefacts. T o study the latter property , we propose two approaches based on analysis of the song content similarities and tags respectively . 4.7.1 Content-based validation. Given that we have the content of the explanations, we rst relied on similarity measures to prove that they share similar characteristics. W e randomly selected 100 users T able 7: Signicance testing with 95% condence of the dif- ference between A vg. DTW b etween explanation and A vg. DTW between random segments: The explanations are sig- nicantly close to each other compared to the random seg- ments. This means that the explanations capture and share common characteristics that are likely to represent the user’s preferences. A vg. DTW between A vg. DTW between 95% CI of the dierence Adjusted explanations (DTW e) random segments (DTWr) (DT W e - DTW r) p-value 22,844.1 24,820.9 (182, 3,771) 0.031 as our test sample. For every test user , w e use our model to generate the top 5 recommendations with explanations and compute the average Dynamic Time W arping (DT W) [ 36 ] distance between the explanations (DTW e). DTW is a powerful distance measure between multidimensional time series that do not necessarily have the same size. T o generate the average DTW distance between two lists of multidimensional time series, we compute the DTW distance matrix between them and take the average of all the values in the matrix. In the case of DT W distances between explanations, both lists are similar and include the song arrays of the generated 5 explanations. T o compare, we selecte d a random 10-second segment from every recommended song and computed the average DT W distance b etween these 5 segments (DTWr ) for every user . W e compare to average D T W distances between 10-second segments instead of between the whole recommended songs to avoid any bias coming from the dierent song lengths. Finally , we considered the problem as a Randomized Complete Block Design (RCBD) [ 29 ] and applied a T ukey test [ 12 ] for pairwise comparison. The null hypothesis is whether the averages over all the users of the average DTW distances between the explanations and between the random segments are similar . For simplicity , we will call these two entities " A vg. DT W between explanations" (or DTW e) and " A vg. DT W between random segments" (or DT W r). W e show these average values with the 95% Condence Intervals (CIs) of the dierence (DTW e - DTWr ) and the statistical test r esults in T able 7. W e notice that A vg. DTW between explanations is signicantly smaller than A vg DT W b etween random segments (p-value<0.05 and 0 is not in the Condence Interval). This means that for each user , we can assert with 95% condence that the explanations are signicantly close to each other compared to the random segments from the recommendations. Thus, we can assert that our generated 10- second segment explanations share common characteristics which are likely to represent the preferences of the user . 4.7.2 T ag-based validation. T ags can capture an item’s properties. In the case of songs, they can include genres, the era, the name of the artist or subjective emotional descriptions. W e used the tags from the "Last.fm" dataset [ 7 ] provided with the MSD . These tags are available for almost every song in the MSD and amount to 522,366 tags [ 7 ]. In our dataset, we selected the songs that intersect with the "Last.fm" dataset and ltered the tags that occur in at least 100 songs in order to remo ve noisy tags. W e obtained 4,659 songs with 227 tags. From the users that interacted with these songs, we ltered the ones that have at least 10 liked songs. In fact, we made the assumption that a rating strictly higher than 3 means that the 6 user likes the song. Next, we randomly selected 100 users as our test sample. For every user , we determined the T op 1, 2 and 3 preferred tags, based on the tags of their liked songs, and generated the top 5 recommendations with explanations. Our objective is to determine how much the personalized recommendations and explanations match the preferred tags of every user . Thus, w e need to determine the tags of both the recommendations and the explanations, which are not necessarily in the tags dataset. T o cope with this issue, we trained a multi-lab el classication model on our tags dataset to predict the tags of the recommendations and explanations. The classication model is basically a sequence model layer with 20% dropout followed by Multilayer Perceptr on (MLP) [ 32 ] layers with ReLU activation functions and an output layer with 227 nodes, corresponding to the 227 classes, each with a Sigmoid activation function. The model is trained to optimize the Binary Cross-entropy loss to pr edict the probability of each tag individually in every node [ 20 ]. T o tune our model’s hyperparameters, we started with an LSTM layer followed by the output layer . W e tune d the size of the hidden state from 100 to 500 with an incr ement of 100. Then, we tuned the number of MLP hidden layers from 1 to 5. W e chose the number of nodes in the hidden layers to b e the optimal size of the hidden state, which is 300. Finally , we tuned the se quence model type of the rst layer by additionally trying RNN and GRU. The best model has one LSTM layer with a hidden state size of 300 followed by 4 MLP lay ers of the same size and, nally , the output layer . W e reached a performance of 93.4% accuracy and respectively 51.8%, 61.9% and 67.7% top-1, top-2 and top-3 categorical accuracy with 5-fold cross validation. W e used top-k categorical accuracy [ 20 ] because we are interested in correctly predicting the existing tags in a sparse target. W e used our trained classier to predict the tags of all the recommendations and explanations for all the users. Then, we calculated the A verage Percentage Matching of the recommendations and explanations with the top 1, 2 and 3 user preferred tags. W e dene the Per centage Matching of a list of songs S with the top k preferred tags T k ( u ) of a user u ϵ U as the percentage of songs from S including at least one of the top k preferred tags T k ( u ) , as follows: % M at c hi n д ( S , T k ( u )) = 100 | S | | { s ϵ S | T a д s ( s ) ∩ T k ( u ) , ∅ } | (2) T a д s ( s ) is the set of tags of the song s . In our case, the set of tags of a recommendation or an explanation is predicted using the multi-label classication model. The A verage Percentage matching is the average of the Percentage Matching ov er all the test users: A v д % M at c hi n д ( S , U , k ) = 100 | U | | U | Õ u = 1 % M at c hi n д ( S ( u ) , T k ( u )) (3) S ( u ) in our case is either the set of recommendations or expla- nations of user u . W e varied k and considered every problem as a RCBD [ 29 ]. W e applied Tukey tests [ 12 ] for pair wise compar- ison. The null hyp othesis for ev ery test is whether the average percentage matchings of the recommendations and of the expla- nations with the top k liked songs (A vg%Matching(rec., U , k) and A vg%Matching(exp., U , k) r espectively) are equal. W e show the tw o average percentage matching values with the corresponding 90% T able 8: Signicance testing with 90% condence of the dif- ference between the A vg % Matching of recommendations and explanations with user top k preferred tags. A vg%Matching A vg%Matching 90% CI Adjusted k (rec., U, k) ( exp., U , k) of the dierence p-value 1 84.24% 84.85 % ( -0.01181, -0.00031) 0.083 2 90.71% 90.91 % (-0.00537, 0.00133) 0.320 3 94.75% 94.95 % (-0.00537, 0.00133) 0.320 T able 9: Example of a test user (#26647) where the explana- tions match the favorite tags more than the recommenda- tions: The rst recommended song is a "pop" song. However , the explainability segment is both "pop" and "rock" which matches the favorite tags of the user better than the recom- mendation itself. Recommendation Recommendation tags Explanation tags 1 pop pop, rock 2 pop, rock pop, rock 3 pop, rock pop, rock 4 pop, rock pop, rock 5 pop, rock pop, rock User top 3 tags (sorted) rock , pop, favorites k 1 2 3 % M at chi n д ( r e c ., T k ( u )) 80% 100% 100% % M at chi n д ( ex p ., T k ( u )) 100% 100% 100% CIs of the dierences ( A vg%Matching(rec., U, k) - A vg%Matching(exp., U, k)) and adjusted p-values of the T ukey tests in T able 8. W e no- tice that for all k, the explanations match the preferred tags of the users more than the recommendations. The dier ence is signicant for k = 1 (CI of the dier ence does not include 0 and p-value<0.1). Howev er , starting from k = 2, the dierence becomes insignicant as both the recommendations and explanations start matching the top k preferred tags comparably well, but still with a slight ad- vantage for the explanations. This means that the explanations share similar properties which are more in accordance with the preferred tags of the users than even the overall recom- mendations. Assuming that the tags represent the genres, if the user’s preferred genre is, for instance , “Rock" and a “Pop " song gets recommended, the explanation of that song is likely to b e a “Rock" segment of the song. W e show an illustrative example of a user from our test sample in T able 9. 5 CONCLUSION W e pr oposed a hybrid song r ecommender system that uses both rat- ings and song content to generate personalized recommendations accompanied with short MIDI segments as explanations. W e made recommendation more transparent while relying on powerful deep learrning models. Our experiments demonstrated that our archi- tecture performs signicantly better than b oth baseline and SOT A hybrid song recommender systems. Moreover , we validated the ef- fectiveness of the way we integrate the content data and solved the item cold start problem which is a notorious limitation of Collabo- rative Filtering techniques. Finally , we validated our explainability 7 approach by showing that the personalized explanations are able to capture properties that ar e in accordance with the pr eferences of the user . Our approach has limitations such as the slow train- ing time and the user cold start problem. In the future , we plan to extend our methods to more complex and challenging modalities such as images and videos. REFERENCES [1] [n. d.]. Introduction to MIDI and Computer Music: The MIDI Standard. http: //www.indiana.edu/~emusic/361/midi.htm. Accessed: 2019-03-11. [2] Behnoush Abdollahi. 2017. Accurate and justiable : new algorithms for explainable recommendations . Ph.D. Dissertation. University of Louisville . [3] Behnoush Abdollahi and Olfa Nasraoui. 2017. Using Explainability for Con- strained Matrix Factorization. In Proceedings of the Eleventh ACM Conference on Recommender Systems (RecSys ’17) . A CM, New Y ork, NY, USA, 79–83. https: //doi.org/10.1145/3109859.3109913 [4] Ivana Andjelkovic, Denis Parra, and John O’Donovan. 2018. Moodplay: Inter- active Music Recommendation based on Artists’ Mood Similarity . International Journal of Human-Computer Studies (2018), –. https://doi.org/10.1016/j.ijhcs. 2018.04.004 [5] Trapit Bansal, David Belanger , and Andrew McCallum. 2016. Ask the GRU: Multi- T ask Learning for Deep T ext Recommendations. https://doi.org/10.1145/ 2959100.2959180 cite arxiv:1609.02116Comment: 8 pages. [6] Kirell Benzi, V assilis Kalofolias, Xavier Bresson, and Pierre V andergheynst. 2016. Song recommendation with non-negative matrix factorization and graph total variation. 2016 IEEE International Conference on Acoustics, Spe ech and Signal Processing (ICASSP) (2016), 2439–2443. [7] Thierry Bertin-Mahieux, Daniel P. W . Ellis, Brian Whitman, and Paul Lamere. 2011. The Million Song Dataset. In Proce edings of the 12th International Conference on Music Information Retrieval (ISMIR 2011) . [8] K yunghyun Cho, Bart van Merrienb oer , Caglar Gulcehre, Dzmitr y Bahdanau, Fethi Bougares, Holger Schwenk, and Y oshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Deco der for Statistical Machine T ranslation. In Proceedings of the 2014 Conference on Empirical Metho ds in Natural Language Processing (EMNLP) . Association for Computational Linguistics, 1724–1734. https: //doi.org/10.3115/v1/D14- 1179 [9] Y ann Le Cun. 1988. A Theoretical Framework for Back-Propagation. [10] Tim Donkers, Benedikt Loepp, and Jürgen Ziegler . 2017. Se quential User-based Recurrent Neural Network Recommendations. In Proceedings of the Eleventh ACM Conference on Recommender Systems (RecSys ’17) . A CM, New Y ork, NY, USA, 152–160. https://doi.org/10.1145/3109859.3109877 [11] R.A. Fisher . 1925. Statistical methods for research workers . Edinburgh Oliver & Boyd. [12] Winston Haynes. 2013. T ukey’s T est . Springer New Y ork, New Y ork, N Y , 2303– 2304. https://doi.org/10.1007/978- 1- 4419- 9863- 7_1212 [13] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on W orld Wide W eb (WW W ’17) . International W orld Wide W eb Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 173–182. https://doi.org/10.1145/3038912.3052569 [14] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015). [15] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Tikk. 2016. Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys ’16) . ACM, New Y ork, N Y, USA, 241–248. https: //doi.org/10.1145/2959100.2959167 [16] Sepp Ho chreiter and Jürgen Schmidhub er . 1997. Long Short- T erm Memor y . Neural Comput. 9, 8 (Nov . 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9. 8.1735 [17] How Jing and Alexander J. Smola. 2017. Neural Survival Recommender . In Proceedings of the T enth ACM International Conference on W eb Search and Data Mining (WSDM ’17) . ACM, New Y ork, N Y, USA, 515–524. https://doi.org/10. 1145/3018661.3018719 [18] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimiza- tion. CoRR abs/1412.6980 (2014). arXiv:1412.6980 http://arxiv .org/abs/1412.6980 [19] Y ehuda Kor en, Robert Bell, and Chris V olinsky . 2009. Matrix Factorization T echniques for Recommender Systems. Computer 42, 8 (Aug. 2009), 30–37. https://doi.org/10.1109/MC.2009.263 [20] Maksim Lapin, Matthias Hein, and Bernt Schiele. 2017. Analysis and optimiza- tion of loss functions for multiclass, top-k, and multilabel classication. IEEE transactions on pattern analysis and machine intelligence 40, 7 (2017), 1533–1554. [21] Y ann Lecun, Léon Bottou, Y oshua Bengio, and Patrick Haner . 1998. Gradient- based learning applied to do cument recognition. In Proceedings of the IEEE . 2278– 2324. [22] Y ann LeCun, Patrick Haner , Léon Bottou, and Y oshua Bengio. 1999. Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision . Springer, 319–345. [23] E.L. Lehmann and G. Casella. 1998. Theor y of Point Estimation . Springer V erlag. [24] Mu Li, T ong Zhang, Y uqiang Chen, and Alexander J. Smola. 2014. Ecient Mini-batch Training for Stochastic Optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14) . ACM, New Y ork, N Y , USA, 661–670. https://doi.org/10.1145/2623330.2623612 [25] Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. Neural Rating Regression with Abstractive Tips Generation for Recommendation. CoRR abs/1708.00154 (2017). arXiv:1708.00154 [26] Zachary Chase Lipton. 2015. A Critical Review of Recurrent Neural Networks for Sequence Learning. CoRR abs/1506.00019 (2015). [27] R. Mehta and K. Rana. 2017. A review on matrix factorization techniques in recommender systems. In 2017 2nd International Conference on Communication Systems, Computing and I T A pplications (CSCITA ) . 269–274. https://doi.org/10. 1109/CSCI T A.2017.8066567 [28] The Echo Nest. [n. d.]. The Echo Nest Taste Prole Subset. https://labrosa.ee. columbia.edu/millionsong/tasteprole. [29] Donald M Olsson. 1978. Randomized Complete Block Design. Journal of Quality T echnology 10, 1 (1978), 40–41. [30] Sergio Oramas, Oriol Nieto, Mohamed Sordo , and Xavier Serra. 2017. A Deep Mul- timodal Approach for Cold-start Music Recommendation. CoRR abs/1706.09739 (2017). arXiv:1706.09739 [31] pandora. [n. d.]. pandora. https://www.pandora.com/. [32] Marius-Constantin Popescu, Valentina E. Balas, Liliana Perescu-Popescu, and Nikos Mastorakis. 2009. Multilayer Perceptr on and Neural Networks. WSEAS Trans. Cir . and Sys. 8, 7 (July 2009), 579–588. http://dl.acm.org/citation.cfm?id= 1639537.1639542 [33] Colin Rael. [n. d.]. The Lakh MIDI Dataset v0.1. https://colinrael.com/projects/ lmd/. [34] Colin Rael. 2016. Learning-Based Methods for Comparing Se quences, with Applica- tions to Audio-to-MIDI Alignment and Matching . Ph.D. Dissertation. COLUMBIA UNIVERSI T Y . [35] Steen Rendle, Christoph Freudenthaler , Zeno Gantner, and Lars Schmidt- Thieme. 2009. BPR: Bayesian Personalize d Ranking from Implicit Feedback. In Proceedings of the T wenty-Fifth Conference on Uncertainty in Articial Intelligence (U AI ’09) . AU AI Press, Arlington, Virginia, United States, 452–461. http://dl.acm.org/ citation.cfm?id=1795114.1795167 [36] Stan Salvador and Philip Chan. 2007. T oward accurate dynamic time warping in linear time and space. Intelligent Data A nalysis 11, 5 (2007), 561–580. [37] Elena Smirnova and F lavian Vasile . 2017. Contextual Sequence Mo deling for Recommendation with Recurrent Neural Networks. CoRR abs/1706.07684 (2017). arXiv:1706.07684 http://ar xiv .org/abs/1706.07684 [38] Spotify . [n. d.]. Spotify . https://www.spotify .com/us/. [39] Y ong Kiam T an, Xinxing Xu, and Y ong Liu. 2016. Improved Recurrent Neural Networks for Session-based Re commendations. CoRR abs/1606.08117 (2016). arXiv:1606.08117 http://ar xiv .org/abs/1606.08117 [40] Andreu V all and Gerhard Widmer . 2018. Machine Learning Approaches to Hybrid Music Recommender Systems. CoRR abs/1807.05858 (2018). http://arxiv .org/abs/1807.05858 [41] Aaron V an den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In A dvances in neural information process- ing systems . 2643–2651. [42] Robin van Meteren. 2000. Using Content-Based Filtering for Recommendation. [43] John W alker . 2004. MIDICSV . https://colinrael.com/projects/lmd/. [44] Xinxi W ang and Y e W ang. 2014. Improving Content-based and Hybrid Music Recommendation Using Deep Learning. In Proceedings of the 22Nd ACM Interna- tional Conference on Multimedia (MM ’14) . A CM, New Y ork, NY , USA, 627–636. https://doi.org/10.1145/2647868.2654940 [45] Chao- Yuan Wu, Amr Ahmed, Alex Beutel, and Ale xander J. Smola. 2017. Joint Training of Ratings and Reviews with Recurrent Recommender Networks (ICLR 2017) . [46] Chao- Y uan Wu, Amr Ahme d, Alex Beutel, Alexander J. Smola, and How Jing. 2017. Recurrent Recommender Networks. In Proceedings of the T enth ACM International Conference on W eb Search and Data Mining (WSDM ’17) . ACM, New Y ork, N Y, USA, 495–503. https://doi.org/10.1145/3018661.3018689 [47] Sai Wu, W eichao Ren, Chengchao Y u, Gang Chen, Dongxiang Zhang, and Jingbo Zhu. 2016. Personal recommendation using deep recurrent neural networks in NetEase. 2016 IEEE 32nd International Conference on Data Engineering (ICDE) (2016), 1218–1229. [48] Y ouT ube. [n. d.]. Y ouT ube Music. https://music.youtube.com/. [49] Shuai Zhang, Lina Yao , and Aixin Sun. 2017. Deep Learning base d Recom- mender System: A Survey and New Perspectives. CoRR abs/1707.07435 (2017). arXiv:1707.07435 http://ar xiv .org/abs/1707.07435 8

SeER: An Explainable Deep Learning MIDI-based Hybrid Song Recommender System

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment