(A) Data in the Life: Authorship Attribution of Lennon-McCartney Songs

(A) Data in the Life: Authorship A ttribution of Lennon-McCartney Songs Mark E. Glic kman ∗ Departmen t of Statistics Harv ard Univ ersit y glickman@fas.harvard.edu Jason I. Bro wn Departmen t of Mathematics and Statistics Dalhousie Univ ersity jason.brown@dal.ca Ry an B. Song Sc ho ol of Engineering and Applied Sciences Harv ard Univ ersit y ryan.b.song@gmail.com Abstract The songwriting duo of John Lennon and Paul McCartney , the t wo founding mem- b ers of the Beatles, comp osed some of the most p opular and memorable songs of the last cen tury . Despite ha ving authored songs under the joint credit agreemen t of Lennon-McCartney , it is w ell-do cumen ted that most of their songs or portions of songs w ere primarily written by exactly one of the tw o. F urthermore, the authorship of some Lennon-McCartney songs is in dispute, with the recollections of authorship based on previous in terviews with Lennon and McCartney in conﬂict. F or Lennon-McCartney songs of kno wn and unkno wn authorship written and recorded o ver the p erio d 1962-66, w e extracted m usical features from eac h song or song p ortion. These features consist of the occurrence of melo dic notes, c hords, melo dic note pairs, chord c hange pairs, and four-note melo dy con tours. W e developed a prediction mo del based on v ariable ∗ Address for corresp ondence: Department of Statistics, Harv ard Universit y , 1 Oxford Street, Cam bridge, MA 02138, USA. E-mail address: glickman@fas.harvard.edu . Jason Brown is supp orted b y NSERC gran t R GPIN 170450-2013. The authors would like to thank Xiao-Li Meng, Michael Jordan, David C. Hoaglin, and the three anon ymous referees for their helpful comments. 1 screening follo w ed b y logistic regression with elastic net regularization. Out-of-sample classiﬁcation accuracy for songs with known authorship w as 76%, with a c -statistic from an ROC analysis of 83.7%. W e applied our mo del to the prediction of songs and song p ortions with unknown or disputed authorship. Key words: authorship, elastic net, logistic regression, music, regularization, stylometry , v ariable screening 1 In tro duction The Beatles are arguably one of the most inﬂuen tial m usic groups of all time, having sold ov er 600 million albums worldwide. Bey ond the initial mania that accompanied their introduction to the UK and Europ e in 1962-63, and subsequently to the United States in early 1964, the Beatles’ m usical and cultural impact still has lasting inﬂuence. The group has been the fo cus of academic research to an exten t that riv als most classical comp osers. Heuger (2018) has b een maintaining a bibliograph y that con tains o v er 500 entries dev oted to academic researc h on the Beatles. Some recen t examples of scientiﬁc study of Beatles music include Cath´ e (2016) who applied harmonic v ectors theory to Beatles songs, W agner (2003) who analyzed the presence of blues motifs in Beatles m usic, and Bro wn (2004) who used F ourier analysis to determine the true arrangement and instrumen tation of the op ening c hord of “A Hard Da y’s Nigh t.” The songwriting duo of John Lennon and P aul McCartney to ok the writing credits for most recorded Beatles songs. The tw o agreed prior to the Beatles’ formation that all songs writ- ten b y the tw o of them, either together or individually , would b e credited to the partnership “Lennon-McCartney .” After the Beatles brok e up in 1970 and the Lennon-McCartney part- nership dissolved, Lennon and McCartney attempted to clarify their contributions to their join tly credited songs. Most often, individual songs were ac knowledged to be written en- tirely by either McCartney or Lennon, though in some cases one would write most of a song 2 and the other w ould con tribute small p ortions or sections of the song. Compton (1988) pro vided a fairly complete accounting of the actual authorships of Lennon-McCartney songs to the extent they are known through interviews with each of Lennon and McCartney . Ac- cording to this listing, sev eral songs are of disputed m usical authorship. Some examples include the entire songs “Misery ,” “Do Y ou W ant to Kno w a Secret?,” “W ait,” and “In My Life.” W omac k (2007) provided an in teresting accoun t of the discrepancy in Lennon and McCartney’s recollection of the authorship of “In My Life” in particular: Lennon wrote the lyrics, McCartney asserted that he wrote all of the m usic, and Lennon claimed that McCart- ney’s only contribution was helping with the middle eight melo dy . Given that further direct questioning ab out these songs is unlikely to reveal their true author, it is an op en question whether musical features of Lennon-McCartney songs could provide statistical evidence of song authorship b et ween Lennon versus McCartney . The idea of using statistical mo dels to predict authorship is one that has b een around for o ver half a cen tury . In one of the ﬁrst successful attempts at modeling w ord frequencies, Mosteller and W allace (1963, 1984) used Bay esian classiﬁcation mo dels to infer that James Madison wrote all of the 12 disputed F ederalist pap ers. Other recent works related to authorship attribution include Efron and Thisted (1976) and Thisted and Efron (1987), who address questions related to Shakespeare’s writing, and Airoldi, Anderson, Fienberg, and Skinner (2006), who examine authorship attribution of Ronald Reagan’s radio addresses. T ypical text analysis relies on constructing w ord histograms, and then mo deling authorship as a function of w ord frequencies. Basic background on the analysis and mo deling of w ord frequencies can b e found in Manning and Sch¨ utze (1999), and these mo dels applied to text authorship attribution can b e found in Clement and Sharp (2003) and Malyutov (2005). This pap er is concerned with using harmonic and melo dic information from the corpus of Lennon-McCartney songs from the ﬁrst part of the Beatles’ career to infer authorship of songs b y John Lennon and P aul McCartney . It is not unreasonable to assume that 3 Lennon and McCartney songs are distinguishable through m usical features. F or example, b oth McCormick (1998) and Hartzog (2016) observed that Lennon songs hav e melo dies that tend not to v ary substantially in pitch (illustrativ e examples include “I Am the W alrus” and “Across the Universe”), whereas McCartney songs tend to hav e melo dies with larger pitch c hanges (e.g., “Hey Jude” and “Oh Darling”). Ho wev er, such anecdotal observ ations may not suﬃciently c haracterize distinctions b etw een Lennon and McCartney – a more s cien tiﬁc approac h is necessary . Our analyses attempt to capture distinguishing m usical features through a statistical approac h. Previous work applying quan titativ e metho ds to distinguish Lennon and McCartney songs is limited. Whissell (1996) p erformed a stylometric analysis of Beatles songs based on lyric con tent via text analyses to characterize the emotional diﬀerences b etw een Lennon and Mc- Cartney o ver time. An unpublished pap er by McDougal (2013) performed a traditional text analysis using w ord count metho ds to compare Lennon and McCartney’s lyric usage, and supplemen ted the text analysis with auditory-deriv ed information from the program The Ec ho Nest ( the.echonest.com ). More generally , a v ariet y of statistical methods for inferring authorship from m usical information ha ve b een published. Cilibrasi, Vit´ anyi, and De W olf (2004) and Naccache, Borgi, and Gh´ edira (2008) used Musical Instrumen t Digital Interfa ce (MIDI) encoding of songs, whic h con tains information on the pitc h v alues, in terv als, note du- rations, and instruments to p erform distance-based clustering. Dubnov, Assa yag, Lartillot, and Bejerano (2003) developed metho ds to segmen t m usic using incremental parsing applied to MIDI ﬁles in order to learn stylistic asp ects of m usic represen tation. Conklin (2006) also introduced representing melo dy as a sequence of segments, and mo deled musical style through this representation. A diﬀerent approac h was tak en by George and Shamir (2014), who conv erted song data into t wo-dimensional sp ectrograms, and used these representations as a means to cluster songs. Our approac h to musical authorship attribution is most closely related to methods applied to 4 genome expression studies and other areas in which the num b er of predictors is considerably larger than the sample size. In a m usical context, we reduce each song to a vector of binary v ariables indicating the o ccurrences of sp eciﬁed lo cal musical features. W e deriv e the features based on the en tire set of chords that can b e play ed (harmonic conten t) and the en tire set of notes that can b e sung by the lead singer (melo dic conten t). F rom the p oin t of view of melo dic sequences of notes or harmonic sequences of c hords b eha ving lik e text in a do cumen t, individual notes and individual c hords can b e understo o d as 1-gram represen tations. The o ccurrence of individual c hords and individual notes form an essential part of a reduction in a song’s musical con tent. T o increase the ric hness of the represen tation, w e also consider 2-gram representations of c hord and melo dic sequences. That is, we record the o ccurrence of pairs of consecutiv e notes and pairs of consecutiv e c hords as individual binary v ariables. Rather than considering larger n -gram sequences (with n > 2) as a unit of analysis, w e extract lo cal contour information of melo dic sequences indicating the lo cal shap e of the melody line to be a ﬁfth set of v ariables to represent lo cal features within a song. Using o ccurrences of pitc hes in the sung melo dies, c hords, pitch transitions, harmonic transitions, and con tour information of Lennon-McCartney songs with kno wn authorship p ermits mo deling of song authorship as a function of m usical con tent. W e dev elop ed our mo deling approach as a tw o-step algorithm. First, we kept only m usical features that hav e a suﬃciently strong biv ariate asso ciation with authorship, an application of sure independence screening (F an, 2007; F an & Lv, 2008). With the features that re- mained, we then mo deled the authorship attribution as a logistic regression, but estimated the mo del parameters using elastic net regularization (F riedman, Hastie, & Tibshirani, 2010; Zou & Hastie, 2005), an approac h that ﬂexibly constrains the a v erage log-lik eliho o d b y a con vex combination of a ridge p enalty (Le Cessie & V an Hou w elingen, 1992) and a lasso p enalt y (Tibshirani, 1996, 2011). Man y other approac hes to regularization are p ossible. F or example, Kempfert and W ong (2018), who predict the authorship of Hadyn versus Mozart string quartets based on m usical features, select their mo del through subset selection on the 5 Ba yesian information criterion (BIC) statistic. This pap er pro ceeds as follo ws. W e describe the background of the song data collection and formation in Section 2. This is follow ed in Section 3 b y the dev elopment of a mo del for authorship attribution based on a v ariable screening pro cedure follow ed by elastic net logistic regression. The application of the mo deling approach is describ ed in Section 4 where w e summarize the ﬁt of the model to the corpus of Lennon-McCartney songs of kno wn authorship, and apply the mo del results for predicting songs of disputed authorship. The pap er concludes in Section 5 with a discussion of the utilit y of our approac h to wider m usical settings. W e provide relev ant background on musical notes, scales, note interv als, c hords, and song structure in App endix A. 2 Song Data The data used in our analyses consist of melo dic and harmonic information based on Lennon- McCartney songs that w ere written b etw een 1962 and 1966. This p erio d of Beatles music is during the years they toured and o ccurred b efore the band’s activities centered on stu- dio pro ductions when their songwriting approac hes lik ely changed signiﬁcantly . The songs w e included in our analyses w ere from the original UK-released albums Ple ase Ple ase Me, With the Be atles, A Har d Day’s Night, Be atles for Sale, Help!, Rubb er Soul, and R evolver , as well as all the singles from the same era that were not present in an y of these albums. The essential reference for b oth the melo dic and harmonic con tent of the songs w as F u- jita, Hagino, Kub o, and Sato (1993), although the Isophonics online database of c hords for The Beatles songs ( http://isophonics.net/content/reference-annotations ) pro vided additional p oints of reference for eac h song. The authorship of eac h Lennon-McCartney song, or whether the authorship credit was in 6 dispute, has b een do cumen ted in Compton (1988), though for some songs w e ha v e found other do cumentation of song authorship. Aside from recording whether en tire songs were written b y Lennon versus McCartney , Compton also notes that in man y cases songs had m ultiple sections with p ossibly diﬀerent authors. F or example, the song “W e Can W ork It Out” is credited to McCartney as the author, though the bridge section starting with the lyric “Life is v ery short...” was written by Lennon. In our analyses, w e treat these sections as t w o diﬀeren t units of analysis with diﬀeren t authors. F urthermore, several songs that w ere ac knowledged as full collab orations b etw een Lennon and McCartney were excluded from the corpus of kno wn authorship from whic h w e dev elop our prediction algorithm. The song “The W ord” is suc h an example of a full collab oration. It is plausible that some of the disputed songs w ere actually collab orations, but the curren t information about the songs did not p ermit these joint attributions. The total num b er of Lennon-McCartney songs or p ortions of songs with an undisputed individual author (Lennon or McCartney) w as 70. Eigh t songs or p ortions of songs in this p erio d were of disputed authorship. Our pro cess w as to man ually co de each song’s harmonic (c hord) and melo dic progressions. The song con tent that serv es as the input to our mo deling strategy is a set of represen tations of simple melodic and harmonic patterns within eac h song in the form of category indicators. That is, we let eac h song b e represen ted by a v ector of binary v ariables within the song, where eac h v ariable is the presence/absence of a musical feature that could o ccur in the song. W e describ e these representations in more detail b elow. The pro cess to obtain these category indicators in v olves conv erting each song’s melo dic and harmonic con ten t into a usable form. Melody lines w ere partitioned into phrases which were typically b o ok-ended b y rests (silence). An alternative approac h would ha v e b een to mo del counts of m usical features within songs, whic h is m uch more in line with authorship attribution analysis for text do cuments. A crucial diﬃculty with this approach is how to address rep eated phrases (e.g., v erses, c horuses) within a song. As an extreme example that is not part of our sample, consider the later-Beatles perio d McCartney-written song “Hey Jude.” The “na na na” 7 fadeout, whic h lasts roughly four minutes on the recording, is rep eated 19 times (Everett, 1999). Keeping these rep eated o ccurrences would lik ely ov er-represen t the m usical ideas suggested by the phrase. W e explored mo dels in which feature coun ts w ere incorp orated, including versions where the coun ts w ere capped at an upper limit (i.e., winsorizing the larger coun ts), and v ersions in volving the transformation of coun ts to the log scale, but these approac hes resulted in worse predictability than our ﬁnal mo del. The use of whether a musical feature w as present in a song pro duced b etter discriminatory p ow er in authorship predictions. The key of every song was standardized relativ e to the tonic for songs in a ma jor key , and to the relative ma jor (up a minor third) for songs in a minor key . If a key change o ccurred in the middle of the song, the harmonic and melo dic information from that p oint onw ard w ould b e standardized to the mo dulated key . W e constructed ﬁv e diﬀerent sets of m usical features within each song as follo ws based on pro cessed melo dic and harmonic data for the collection of songs. The ﬁrst set of features w as c hord types. Sev en diatonic chords, that is, I, ii, iii, IV, V, vi, vii, whic h are con ven tionally the building blo cks for most p opular W estern m usic, were their o wn categories. The true diatonic chord on the sev enth note of the scale is a diminished c hord, which was only used once, in “Y ou W on’t See Me,” while the minor vii was used more often. W e therefore to ok the lib ert y of using the minor vii instead as our “diatonic” chord on the seven th. Because diminished and augmented chords were used rarely in general, w e collapsed all o ccurrences of non-diatonic ma jor chords along with augmen ted chords in to a single category , and non- diatonic minor c hords along with diminished chords in to a single category . This resulted in a total of 9 categories. W e explored other category divisions, including fewer instances of collapsed categories, but the sparsity of the data across the non-diatonic, augmen ted, and diminished c hords resulted in less reliable predictability . Additionally , we decided to group all sev enth and extended c hords (e.g., nin th chord, elev en th c hord) with their unaltered triad 8 coun terparts. The second set of features consisted of melo dic notes. The o ctav e in which a melo dic note w as sung was ignored in the construction, so that the num b er of note categories totaled 12 (the num b er of pitc h classes on the chromatic scale). The third set of features comprised c hord transitions, that is, pairs of consecutive c hords. As with individual chord categories, considering all com binations of chord transitions would ha ve resulted in an unnecessarily large n um b er of sparsely counted categories. W e collapsed the c hord categories as follo ws. Eac h transition among the tonic, sub-dominan t (ma jor fourth), and dominan t (ma jor ﬁfth) was its o wn category . Every other transition from a diatonic chord to another diatonic c hord, regardless of the order of the t w o c hords, was its o wn category . F or example, transitions from ii to V were group ed with transitions from V to ii. T ransitions that in v olved the tonic and an y non-diatonic c hord w ere group ed in to one category , and transitions that inv olved the dominant and an y non-diatonic c hord w ere also all group ed in to one category . Chord transitions starting with an y non-diatonic c hord, and ending with a diatonic chord (other than the tonic or dominan t) was its own category , and c hord transitions ending with any non-diatonic c hord, and starting with a diatonic chord (other than the tonic or dominan t) was its o wn category . Finally , all chord transitions b e- t ween t w o non-diatonic c hords fell under one category . The total num b er of chord transition categories totaled 24 with these raw category collapsings. Empty categories from the canon of songs were ignored. The fourth set of features inv olv ed melo dic note transitions as pairs of notes. In con trast to the single melo dic note categories, w e considered the o cta v e of the second note in the pair. Th us, eac h melodic note in a pair could b e in a three-o cta ve range. In addition, we considered the start and end rest of a phrase to b e considered a note in constructing note transition categories. Th us a single note at the start or at the end of a phrase w as eac h treated as 9 a note transition. Eac h start of a phrase on an y diatonic note w as its o wn category , and eac h end of a phrase on any diatonic note was its own category . All notes on the diatonic scale transitioning from or to the tonic was its own category . Any transition from a pitc h on the diatonic or p entatonic scale (whic h includes the ﬂat 3 and ﬂat 7) to another pitc h on the diatonic or p en tatonic scale, including the same pitch, w as its own category , regardless of o ctav e. Up ward mo vemen ts b y 2, 3, 4, or 5 notes on the diatonic scale w ere individual categories, and the corresp onding down ward mov emen ts were their o wn categories. W e p erformed a greater amoun t of collapsing of categories of melodic transitions when at least one note in the transition was not on the diatonic scale. All transitions b et ween the t wo same non-diatonic notes (excluding the ﬂat 3 and ﬂat 7) w ere collapsed in to the same category . All melo dic phrases starting on a non-diatonic note w ere collapsed in to the same category , and all melo dic phrases ending on a non-diatonic note were collapsed into the same category . A semitone up ward or down w ard mov emen t from a diatonic note to a non-diatonic note formed t wo distinct categories, as did a semitone upw ard or do wnw ard mov emen t from a non-diatonic note to a diatonic note. All upw ard mo vemen ts of at least tw o semitones in v olving a non- diatonic note w ere collapsed in to the same category , and all do wnw ard mov emen ts of at least tw o semitones were collapsed into the same category . The total num b er of nonempty categories of melo dic transitions under this collapsing sc heme w as 65. It is worth noting that we had also considered an alternative set of melo dic transition v ariables. These were based to a large exten t on grouping up w ard and down ward mov emen ts by the size of the in terv al, but without regard to the musical function of the transition. W e feel that the main groupings described ab ov e are arguably more musically justiﬁable b ecause they are more directly connected to the pitches within transition pairs rather than pitch distances. The last set of features captured lo cal con tours in the melo dic line of a song. Ev ery consec- utiv e 4-note subset within a melo dic phrase (b et ween its start and end) was partitioned into one of 27 diﬀeren t categories according to the direction of eac h consecutiv e pair of notes. 10 F or eac h of the three pairs of consecutive notes in a 4-note melo dic sequence, the transi- tions could b e up, down or same if the melo dic notes mo ved up, down, or sta y ed the same. Because eac h consecutive pair across the 4-note sequences allo wed three p ossibilities, the represen tation consisted of 3 × 3 × 3 = 27 categories. Longer con tours (consecutiv e note subsets of 5 or more notes) would pro vide greater con tour detail, but the n um b er of implied categories w ould create diﬃculties in mo del ﬁtting esp ecially with the relatively low n umber of songs to use for mo del-building. The contour represen tation is an attempt to characterize lo cal features in the melo dic line b eyond 2-gram representations but without the same lev el of detail. The ﬁve sets of musical features together total 137 binary v ariables for each song. Our mo deling approac h, whic h relies mainly on cross-v alidating regularized logistic regression, can result in prediction instability when a feature is shared b y v ery few or v ery man y songs. W e therefore remo ved 16 features in which ﬁv e or few er songs contained the feature, or where 66 or more songs (out of 70) contained the feature. The features shared by 66 or more songs included the tonic c hord; melo dic notes that included the tonic, second, third and ﬁfth; and the 4-note con tour (up, do wn, down). The features shared by ﬁv e or fewer songs consisted of the minor sev en th chord, c hord transition from iii to V, up ward and down ward melo dic transitions by 5 notes on the diatonic scale, rep eated ﬂat 3 notes, other repeated non-diatonic notes, up ward melo dic transition from ﬂat 7 to ﬂat 3, melo dic transition b et ween ﬂat 3 and ﬁfth, and melo dic transition from ﬂat 7 to fourth. With these exclusions, our analyses used a total of 121 musical features. W e displa y the most common m usical characteristics by category , after the exclusions, in T able 1. Ma jor 4th and ma jor 5th chords are the most common among the 70 songs (after the tonic), and the melo dic notes of a 4th and 6th are also common. These notes and c hords are understo o d to b e the building blo cks of p opular W estern music. The chord transition from ma jor 5th to tonic is also a common c hord change in p opular music, is w ell-represented 11 in early Lennon-McCartney songs, and is often utilized as a harmonic phrase resolution. The most common melo dy note transitions sta y on the diatonic scale, whic h again is in k eeping with W estern songwriting. Finally , the tw o contours listed in T able 1 are b oth simple shap es in the melo dic line. Represen tation F eatures Chords Ma jor 4th (64), Ma jor 5th (63) Melo dic notes 4th (62), 6th (63) Chord transitions Ma jor 5th to T onic (61) Note transitions Down w ard transition of one note on the diatonic scale (62), Up ward transition of one note on the diatonic scale (60) Con tours (do wn, do wn, down) (61), (do wn, do wn, up) (62) T able 1. Musical features among the 121 that o ccurred in 60 or more of the 70 songs with kno wn authorship, after eliminating features o ccurring in 65 or more songs. Num b ers in paren theses indicate the n umber of songs with the listed feature. 3 A mo del for songwriter attribution Our approac h to mo deling authorship inv olv ed a t wo-step process. First, w e selected a subset of the 121 musical features that eac h had a suﬃciently strong biv ariate asso ciation with authorship. Second, conditional on the selected features, we mo deled authorship using logistic regression regularized via elastic net p enalization (Zou & Hastie, 2005) with tuning parameters optimized b y cross-v alidation. The latter pro cess w as implemen ted in the R pac k age glmnet (F riedman et al., 2010). W e describ e eac h step in more detail b elo w. F or song i , i = 1 , . . . , n , where n is the n umber of songs with kno wn authorship in the training data, let y i =      0 if song i w as written by John Lennon 1 if song i w as written by P aul McCartney . (1) 12 W e let y = ( y 1 , . . . , y n ) 0 denote the vector of binary authorship indicators. F or j = 1 , . . . , J , where J is the total n umber of dichotomized m usical features, let for eac h i = 1 , . . . , n , x ij =      0 if feature j is not observ ed in song i 1 if feature j is observed in song i . (2) W e let X denote the n × J matrix with elemen ts x ij , and let X j denote the j -th column of X . The ﬁrst step of our pro cedure is to determine a subset of the index set { 1 , 2 , . . . , J } in whic h X j is suﬃciently asso ciated with authorship. This can b e accomplished b y computing o dds ratios of the j -th binary feature with authorship and retaining features with an o dds ratio (or its recipro cal) ab ov e a sp eciﬁed threshold. Equiv alen tly , the selection can b e p erformed b y retaining features in whic h tests for signiﬁcant o dds ratios hav e p -v alues b elow a sp eciﬁed lev el. This pre-pro cessing of features, known as sure independence screening (SIS), has b een dev elop ed and explored by F an (2007), F an and Lv (2008), and F an and Song (2010). SIS is more t ypically emplo y ed in settings with a massive num b er of predictors, but in our setting provides a crude but eﬀectiv e w a y of reducing the num b er of features in our ﬁnal mo del. Our ﬁnal mo del ev aluations exhibit b etter out-of-sample accuracy including SIS as a pre-pro cessing step to mo deling than omitting this step, as we describ e in Section 4. T o implement SIS in our setting, we computed a p -v alue of a P earson chi-squared test for eac h j = 1 , . . . , J , for the signiﬁcance of the o dds ratio in a 2 × 2 contingency table constructed from y and X j . When the elements of any of the con tingency tables has lo w coun ts, the o dds ratio estimate is unstable. The reference distribution for suc h settings is p o orly approximated by a c hi-squared distribution, so we instead sim ulated test statistics 10,000 times from the n ull distribution according to Hop e (1968) to obtain more reliable p -v alues. This pro cedure is implemented in the chisq.test function in base R. The p -v alue for each test was then compared to a pre-sp eciﬁed signiﬁcance level to determine inclusion 13 for mo deling. See b elow for a detailed discussion ab out the sp eciﬁed signiﬁcance lev el. Supp ose as a result of the v ariable screening we retained K v ariables, renum b ered 1 , . . . , K . The second step of the pro cedure inv olv es a logistic regression mo del of the form p i = Pr( y i = 1 | x i , β 0 , β ) = 1 1 + exp( − ( β 0 + x 0 i β )) (3) where x i = ( x i 1 , . . . , x iK ) 0 , and with mo del parameters β 0 and β = ( β 1 , . . . , β K ) 0 . Given the p ossibly large num b er of musical features compared to the num b er of songs in our data set, w e ﬁt our logistic regression mo del through elastic net regularization. Letting ` ( β 0 , β | y , X ∗ ) = n X i =1 ( y i log p i + (1 − y i ) log (1 − p i )) (4) b e the log-likelihoo d of the mo del parameters, where X ∗ is the n × K matrix of x ij re- tained from v ariable screening, elastic net regularization seeks to ﬁnd estimates of β 0 and β , conditional on α and λ , that minimize f E N ( β 0 , β | y , X ∗ , α , λ ) = − 1 n ` ( β 0 , β | y , X ∗ ) + λ  (1 − α ) k β k 2 2 2 + α k β k 1  (5) where k β k 2 2 = P J j =1 β 2 j and k β k 1 = P J j =1 | β j | , and λ ≥ 0 and 0 ≤ α ≤ 1 are tuning parameters. When α = 0, regularization is of the form of a ridge ( L 2 ) p enalt y , and when α = 1 the logistic regression is ﬁt with a Lasso ( L 1 ) p enalty . Optimization of the elastic net logistic regression parameters proceeds as follows. W e consider the equally-spaced grid of v alues for α in { 0 . 0 , 0 . 1 , . . . , 1 . 0 } . F or eac h candidate v alue of α , w e consider 100 candidate v alues of λ . The c hoice of these candidate v alues is describ ed in F riedman et al. (2010). F or these 11 × 100 = 1100 candidate pairs ( α, λ ), we p erform 5-fold cross-v alidation using the negativ e log-lik eliho o ds ev aluated at the withheld fold. Each fold is constructed by sampling songs stratiﬁed by author so that approximately 20% of Lennon 14 and 20% of McCartney songs are contained in eac h fold. This approac h preserv es the balance in authorship within fold relativ e to the o v erall sample. W e choose the minimizing pair of α and λ , and then minimize the target function in (5) o ver the co eﬃcients β 0 and β . Zou and Hastie (2005) argued for considering the selection of λ based on a 1 standard error rule commonly used in regularization pro cedures, but we found in our application that c ho osing the minimum v alue resulted in b etter predictabilit y . A natural extension to regularized logistic regression is to include in teractions among the predictors. Among the diﬃculties of including all interaction terms in a regularized regres- sion is that the lik ely higher degree of sparsit y among the interactions compared with the individual features mak es it diﬃcult to iden tify the important interactions. F uthermore, high correlations among the v ariables can negativ ely impact selection. W ork aimed at discov ering imp ortan t interactions in a more principled manner has b een explored. Ruczinski, Ko op er- b erg, and LeBlanc (2003, 2004) dev elop ed logic regression, a pro cedure that ﬁnds Bo olean com binations of binary predictors in an approac h similar to Ba yesian CAR T (Chipman, George, & McCullo ch, 1998). Logic regression prev ents ov erﬁtting through the reduction of mo del complexit y in growing the num b er of Bo olean com binations that are formed. Pro ce- dures such as those by Bien, T a ylor, and Tibshirani (2013) and Lim and Hastie (2015) in volv e building interactions only when the main eﬀect terms are selected, and this is carried out by taking adv an tage of the group-lasso (Y uan & Lin, 2006). W e explored these extensions to our approac h, based on ha ving already eliminated the rarely-o ccurring or frequently-occurring m usical features, but found that out-of-sample predictability was worse than using only the additiv e eﬀects of our features. An argumen t could be made that including in teractions would b etter account for sets of m usical features that are highly correlated. Ho wev er, the extra ﬂexibilit y asso ciated with including in teractions results in greater v ariance in the predictions that degrades our mo del’s p erformance. Rather than sp ecifying a single signiﬁcance lev el threshold for v ariable screening follow ed 15 b y regularized logistic regression, our selection pro cedure considered ﬁve diﬀeren t signiﬁ- cance level thresholds: 1.0 (no v ariable screening), 0.75, 0.50, 0.25, and 0.10. W e discuss in Section 5 the rationale for only four additional thresholds. W e p erformed lea ve-one-out cross-v alidation in the follo wing manner to c ho ose the best threshold. Let X ( i ) and y ( i ) denote the predictor matrix and resp onse v ector with observ ation i deleted. First, for a ﬁxed threshold t ∈ { 1 . 0 , 0 . 75 , 0 . 50 , 0 . 25 , 0 . 10 } , we p erformed v ariable screening on X ( i ) follo wed b y ﬁtting elastic net logistic regression of y ( i ) based on the retained features (with 5-fold cross-v alidation within the n − 1 songs to obtain the elastic net parameter estimates). The out-of-sample predicted probability ˆ p ( t ) i for observ ation i and threshold t is then computed giv en x i from the ﬁtted logistic regression. The negativ e log-lik eliho o d for threshold t is computed as LL ( t ) = − n X i =1  y i log ˆ p ( t ) i + (1 − y i ) log (1 − ˆ p ( t ) i )  . (6) The threshold t = t opt with the minim um LL ( t ) is the one c hosen b y this pro cedure. Once t opt is determined, v ariable screening is p erformed using this threshold based on all n observ ations follo wed b y p erforming regularized logistic regression on the remaining features. 4 Mo del implemen tation and results W e applied our approach to authorship attribution dev elop ed in Section 3 to the corpus of 70 Lennon-McCartney songs based on the musical features describ ed in Section 2. W e ﬁrst describ e mo del summaries applied to the 70 Lennon-McCartney songs in the training data. These summaries are based on a lea ve-one-out predictive analysis. W e then ﬁt our mo del to the full 70 songs, and use the results to make predictions on the s ongs and song p ortions that are of disputed authorship or are known to b e collab orativ e. 16 4.1 Predictiv e v alidit y and lea v e-one-out mo del summaries A common approach to predictiv e v alidity in mac hine learning is to divide a data set into mo deling, v alidation, and calibration subsets. T ypically a mo del is constructed and v alidated iterativ ely on the ﬁrst t wo subsets of the sample, and predictive prop erties of the approach are summarized on the withheld calibration set. See Drap er (2013) for a go o d o v erview of this approac h, whic h the author terms “calibration cross-v alidation.” Giv en the small num b er of observ ations (songs) in our sample, our predictiv e accuracy w ould suﬀer by withholding a substan tial calibration set, so instead we summarized our algorithm’s quality of calibration through lea ve-one-out cross-v alidation. Sp eciﬁcally , w e withheld one song at a time, and with the remaining 69 songs we p erformed the pro cedure describ ed in Section 3. That is, with 69 songs at a time, we ﬁrst optimized the choice of the p -v alue threshold for SIS through lea ve-one-out cross-v alidation (with a 68-v ersus-1 split to compute the out-of-sample negativ e log-lik eliho o d), then with the v ariables selected based on the optimized p -v alue threshold w e ﬁt a logistic regression via elastic net regularization on the 69 songs (using 5-fold cross- v alidation to estimate the tuning parameters). Finally , based on the logistic regression ﬁt, the probability estimate of the withheld song w as computed. This pro cess w as p erformed for all 70 songs to obtain out-of-sample predictions for each song with known authorship. Figure 1 displays histograms of the out-of-sample probabilities McCartney wrote eac h of the 70 songs or song p ortions with kno wn authorship. The songs or fragments were divided in to the 39 that Lennon wrote, and the 31 that McCartney wrote. Generally , the higher probabilit y estimates tend to corresp ond to McCartney-authored songs, and low er proba- bilities corresp ond to Lennon songs. Using 0.5 as a threshold for classiﬁcation, the mo del correctly classiﬁes 76.9% of Lennon songs, and 74.2% of McCartney songs, with an ov erall correct classiﬁcation rate of 75.7%. W e display the leav e-one-out probabilit y predictions for the 39 songs known to b e written by Lennon in T able 2, and for the 31 songs known to b e written by McCartney in T able 3. 17 15 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 Lennon McCar tney Probability of McCar tney authorship Figure 1. Bac k-to-back histograms of the out-of-sample prediction probabilities of songs of kno wn authorship. Bars to the left represent 39 songs or song p ortions known to b e written b y Lennon, and bars to the right represen t 31 songs or song p ortions known to b e written b y McCartney . In addition to the simple classiﬁcation results, w e p erformed a receiver op erating c haracter- istic curve (ROC) analysis on the out-of-sample probability predictions for the 70 songs and fragmen ts. The results of the analysis, whic h w ere p erformed using the pROC library in R (Robin et al., 2011), are summarized in Figure 2. The c -statistic (or area under the R OC curv e, A UC) is 0.837, whic h indicates a strong level of predictiv e discrimination. F or eac h of the 70 applications of optimized v ariable screening follow ed by regularized logistic regression based on 69 songs at a time, we recorded the optimal v ariable screening p -v alue threshold. W e disco vered that among the p -v alue thresholds in the candidate set, the signif- icance lev el of 0.25 was selected in 69 of the 70 applications of v ariable screening, and the 18 Specificity Sensitivity 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 A UC = 0.837 Figure 2. R OC plot for out-of-sample song probability predictions based on 70 songs or song fragments with kno wn authorship. signiﬁcance lev el of 1.0 was selected for one application (corresp onding to lea ving out the song “Y ou W on’t See Me” b y McCartney). Figure 3 sho ws b o xplots across the 70 analyses of the leav e-one-out predictive negativ e log-likelihoo ds for each p -v alue threshold. As seen in the ﬁgure, the negativ e log-lik eliho o ds achiev e their low est v alues when discarding features that hav e a p -v alue for an o dds ratio larger than 0.25. The second-b est c hoice among these candidate thresholds w as not to remo ve an y v ariable prior to elastic net. Remo ving v ariables based on a threshold of 0.10 resulted in noticeably w orse p erformance than an y of the other c hoices. 19 ● ● ● ● ● ● p1.0 p0.75 p0.50 p0.25 p0.10 30 35 40 45 50 55 60 65 Threshold Leav e−one−out negative log−lik elihood Figure 3. Bo xplots of the leav e-one-out negative log-likelihoo ds for eac h c hoice of p -v alue threshold (1.0, 0.75, 0.5, 0.25, 0.10). Each b ox consists of the distribution of negative log- lik eliho o ds across the 70 leav e-one-out analyses. 4.2 Probabilit y predictions for disputed and collab orativ e songs W e applied our algorithm from Section 3 to the full set of 70 songs. The resulting logistic regression mo del was then used to mak e predictions on disputed songs and song p ortions, and on songs kno wn to b e collab orations b etw een Lennon and McCartney . The optimal signiﬁcance lev el threshold for the v ariable screening was 0.25 based on leav e-one-out cross- v alidation. Conditional on selecting v ariables using the 0.25 p -v alue threshold, the tuning parameters in elastic net logistic regression were optimized at α = 0 . 3 and λ = 0 . 0359. Th us, the ﬁnal logistic regression mo del for predictions in v olved an av erage of L 1 and L 2 p enalties, but more heavily weigh ted to w ards a ridge p enalt y . Of the 40 features that were selected through sure indep endence screening, 29 w ere non-zero in the ﬁnal mo del as a result of elastic net logistic regression. The full set of 29 v ariables is listed in T able 4. 20 Distinguishing song features of Lennon and McCartney authorship can b e learned from the co eﬃcien t estimates of the logistic regression. P ositiv e co eﬃcients are indicative of features used more associated with McCartney’s songs, and negativ e co eﬃcients are indicative of features more asso ciated with Lennon’s songs. These results oﬀer in teresting interpretat ions of m usical features that distinguish McCartney and Lennon songs. One clear theme that emerges is that McCartney tended to use more non-standard musical motifs than Lennon. F or example, the harmonic transitions b etw een I → vi and vi → I are mo v es that are natural and reasonably direct in p opular m usic, and Lennon used these chord changes muc h more frequently than McCartney (coeﬃcient of − 0 . 315). These c hord changes also create an am biguity about whether the song is in the ma jor or relative minor key . Lennon songs like “It’s Only Lov e” start with t w o sets of alternations b etw een I and vi. In contrast, the chord c hange b etw een ii and IV (co eﬃcient of 1 . 428) is less standard, and is used more frequen tly by McCartney , as oﬀering a diﬀerent “ﬂa vor” to the often used sub-dominant, and is used, for example, in McCartney’s “I’m Lo oking Through Y ou.” Another example is that Lennon’s melodic note changes tended to remain m uc h more often on the notes of the diatonic scale, whereas McCartney tended to use melo dic note transitions that mov ed oﬀ the diatonic scale. This is exhibited in the negative co eﬃcients for note transitions moving up or do wn one note on the diatonic scale, and the p ositiv e co eﬃcient (1 . 135) for upw ard note transitions in which one w as not on the diatonic scale. Lennon also more often started melo dic phrases at the 3rd or ended phrases at a 5th, b oth of which are notes on the diatonic scale. In contrast, McCartney more often used a ﬂat 3, and transitions from the ﬂat 3 to the tonic in his sung melo dies, b oth of which are notes often asso ciated with a blues scale and not the diatonic scale. This observ ation is at o dds with the often-held notion that Lennon comp osed songs in a more traditional “ro ck-and-roll” style. In general, these results suggest that the greater complexit y in McCartney’s m usic is a distinguishing 21 feature exhibited by the co eﬃcients in T able 4 that are p ositive. In addition to the co eﬃcients, we rep ort a measure of v ariable imp ortance in the third col- umn of T able 4. Our measure has close connections to an early approach developed in the con text of random forests (Breiman, 2001). In particular, the imp ortance of a v ariable can b e assessed by randomly p erm uting its v alues across observ ations, and then computing an o verall measure of mo del p erformance. The low er the p erformance measure after p ermut- ing the v ariable, the more imp ortan t the v ariable. F or our approach, randomly p ermuting the v alues of a m usical feature across songs is eﬀectiv ely equiv alen t to ha ving the feature remo ved b ecause sure indep endence scanning should eliminate the feature in the ﬁrst step of our prediction algorithm. Th us, our v ariable imp ortance measure w as computed as follows. First, we remo ved the m usical feature whose importance w e w anted to assess. W e then applied our out-of-sample pro cedure from Section 4.1 and computed 70 lea v e-one-out pre- dicted probabilities. W e p erformed an R OC analysis on these probabilities and the known authorship of the 70 songs and summarized the c -statistics in the third column of T able 4. Lo wer v alues of the c -statistic indicate greater v ariable imp ortance. The c -statistic without eliminating any features is 0.837, but some of the v alues in T able 4 can b e higher given the random assignments in the stratiﬁed cross-v alidation pro cedure. Generally , higher absolute v alues of co eﬃcien t estimates correspond to low er c -statistics. Musical features with the lo west c -statistics, all less than 0.80, include the McCartney features (1) the o ccurrence of the 4th note on the diatonic scale, (2) the chord transition b et ween ii and IV, (3) the note transition down ward from the ﬂat 3rd to the tonic, and (4) the note transition do wnw ard a half step from a non-diatonic note to a diatonic note. The only feature with a Lennon leaning and ha ving a c -statistic less than 0.80 is the note transition up a half step from a non-diatonic note to a diatonic note. Compared with the McCartney feature of a down ward half-step mov e, up ward half-step mov es may corresp ond to particular note transitions that are distinct from the down w ard mov es. 22 W e applied the ﬁt of our model to make predictions for eigh t songs or song portions with dis- puted authorship, and for 11 kno wn to b e collab orations. The prediction probabilities were deriv ed b y applying the ﬁtted logistic regression to the songs of unkno wn and collab orativ e authorship. W e accompanied the probabilit y predictions with approximate 95% conﬁdence in terv als calculated in the follo wing manner. F or eac h song of disputed or collab orativ e authorship, w e computed 70 probabilit y predictions based on lea ving out eac h one of the 70 songs in our training sample. An approximate 95% conﬁdence interv al is constructed from the 2.5%-ile and 97.5%-ile of the 70 probabilit y predictions for eac h song. It is w orth noting that these in terv als are conserv ative b ecause one few er song is used than with the corresp onding p oint prediction. The probability predictions and corresponding conﬁdence in terv als are displa y ed in T ables 5 and 6. W e also displa y the distributions ov er the 70 pre- dicted probabilities for eac h disputed song as densit y estimates in Figure 4. F or the songs and fragmen ts of disputed authorship, all of the probabilities are lo wer than 0.5 suggesting that eac h individually had a higher probabilit y of b eing written b y Lennon. The 95% conﬁ- dence in terv als are mostly less than 0.5, though “W ait” and the bridge of “In My Life” hav e conﬁdence in terv als that cross 0.5. The density plots in Figure 4 demonstrate the substan- tial uncertaint y in the probabilit y prediction for the bridge of “In My Life” and to a lesser exten t for “W ait.” In most instances, the conclusions based on our mo del seem to matc h up with the susp ected authorship, as discussed by Compton (1988). According to Compton, the song “Ask Me Wh y ,” whic h Lennon sang, was lik ely written by Lennon. Similarly , “Do Y ou W ant to Kno w a Secret?” was one that Lennon recalled ha ving written and then giv en to George Harrison to sing. In “A Hard Day’s Night,” the verse and c horus are known to ha ve b een written by Lennon (Rybaczewski, 2018; Wiener, 1986), but McCartney seemed to remem b er having collab orated, p erhaps with the bridge, which he sang. While McCartney wrote most and p ossibly all of “Mic helle,” Lennon claimed in some interviews that he came up with the bridge on his own, but in other interviews asserted that the bridge was a collab- oration with McCartney (Compton, 1988). “W ait” is also susp ected to ha ve b een written 23 b y Lennon according to Compton (1988), though in Miles (1998) McCartney remembers the song as mostly his. Lennon wrote “What Go es On” several years prior to the formation of the Beatles, and it is disputed whether McCartney (and Ringo Starr) help ed write the bridge section when recording the song with the Beatles. W e discuss “In My Life” in more detail b elow. F or the songs during the study p erio d that were understo o d to b e collab orative, it is unclear to what extent Lennon and McCartney shared songwriting eﬀorts. Our mo del’s probabilit y predictions can b e viewed as demonstrating similarities with the patterns inferred in songs and fragments with undisputed authorship. Ho wev er, it is w orth noting that our mo del w as developed on a set of songs and song p ortions that were of single authorship, and that applying our mo del to songs of collab orativ e authorship may result in predictions that are not as trustw orth y . As with the information in T able 5, most of the collab orative songs in T able 6 were inferred to be mostly matc hing the st yle of Lennon. While four songs were inferred to be written more in McCartney’s style, tw o exceptions are worth noting. The songs “Baby’s in Black” and “The W ord,” according to Compton (1988), w ere b oth entirely collab orativ e, with Lennon having claimed that “The W ord” w as mostly his w ork. It is curious, in particular, that “The W ord” is inferred with near certaint y of b eing McCartney- authored. One feature of the song is the predominance of the ﬂat third. This McCartn ey-lik e motif may b e resp onsible for the high probability that the song is inferred to b e written by McCartney . The other tw o songs, “F rom Me to Y ou” and “She Lov es Y ou,” w ere also more lik ely to b e McCartney-authored. Compton (1988) rep orted that the former was claimed to b e entirely collab orativ e, and that the latter was initiated b y McCartney even though the song was written collab orativ ely . Tw o of the collab orations are w orth y of comment. While Lennon and McCartney co-wrote “She Lo ves Y ou,” Lennon remem b ered that “it w as Paul’s idea” (Compton, 1988), and the probabilit y indicates that the song is weigh ted to wards McCartney . On the other hand, our 24 mo del’s probability prediction for “I W ant to Hold Y our Hand,” which w as written “ey eball to eyeball” (Compton, 1988), is that the song is muc h more characteristic of Lennon’s st yle. Indeed, in one of the Jann W enner interviews (W enner, 2009), Lennon opined ab out the b eaut y of the song’s melo dy , and pick ed out that song along with his song “Help!” as the t wo Beatles’ songs he might hav e wan ted to re-record. Ho w ever, perhaps the song migh t ha ve b een sp ecial to him as it had m uc h more of his imprin t. Of all Lennon-McCartney songs, “In My Life” has probably garnered the greatest amount of sp eculation ab out its true author. R ol ling Stone magazine considered it to b e the 23rd greatest song of all time ( R ol ling Stone , 2011). Our mo del pro duces a probabilit y of 18.9% that McCartney wrote the verse, and a 43.5% probability that McCartney wrote the bridge, with a large amount of uncertain ty ab out the latter. Because it is known that Lennon wrote the lyrics, it would not b e surprising that he also wrote the music. Lennon claimed (Compton, 1988) that McCartney help ed with the bridge, but that w as the exten t of his contribution. Breaking apart the song into the v erse and the bridge separately , it is apparen t that the v erse is muc h more consistent stylistically with Lennon’s songwriting. Th us, a conclusion b y our mo del is that the v erse is consistent with Lennon’s songwriting style, but the bridge less so. The bridge having a probabilit y that McCartney wrote the song closer to 0.5 may b e indicativ e of their collab orativ e nature, as suggested by Lennon, of this part of the song. 5 Discussion The approach to authorship attribution for Lennon-McCartney songs we developed in this pap er has connections to metho ds used in attribution analysis of text do cuments. One imp ortan t diﬀerence is that typical text analysis mo dels rely on the relative frequencies of o ccurrence of w ords or word combinations. In a musical context, where rep eats of musical features are intrinsic to a song’s construction, the relativ e frequencies of the o ccurrence of 25 the m usical “w ords” ma y obscure their importance in c haracterizing an author’s comp osition st yle. Another diﬀerence from t ypical text analysis problems is that songs include more than just one text stream. F or our work, we sp eciﬁcally included songs’ melo dic note sequence and chord sequence as t wo streams in parallel. Our particular c hoice in the represen tation and analysis of Lennon-McCartney songs of the early Beatles p erio d seemed to b e suﬃcien t in reco vering a song’s author with greater than 75% accuracy , and with a high level of discrimination ( c -statistic of 0.837 from the ROC analysis). Our mo del predictions, particularly for the songs with disputed authorship, seem to b e sup- p orted generally with the stories that accompan y the songs’ origins. While it is tempting to in terpret the results of our mo del as revelations of a song’s true author, other in terpretations are just as comp elling. F or example, a disputed song such as “In My Life” which according to our mo del has a high probabilit y of the verse and bridge eac h b eing written b y Lennon, ma y in fact hav e b een written by McCartney who stated he comp osed the song in the st yle of Smokey Robinson and the Miracles (T urner, 1999), but actually wrote in the style of Lennon, whether consciously or unconsciously . Songs with high probabilities of b eing writ- ten by Lennon or McCartney are mainly indications that the songs ha ve m usical features that are consisten t with the Lennon or McCartney s ongs used in the developmen t of our mo del. T o this end, one use of our mo del is to inv estigate whether certain sections of disputed or collab orative songs are susp ected of b eing more consistent with particular comp osition st yles. F or example, the song of disputed authorship “W ait,” whic h our mo del estimates a probability of 0.391 of b eing written b y McCartney , is sung in harmon y by Lennon and McCartney throughout the song except in the bridge section where McCartney sings alone. It is natural to ask whether that section ma y b e more in the st yle of McCartney who may ha ve had a freer hand in writing that p ortion of the song. Indeed, our mo del applied to just the bridge section results in a 0.646 probability of McCartney authorship, suggesting that the bridge is more in the style of McCartney than Lennon. 26 In typical text analyses, the c hoice of “stop” words, i.e., the ones used in analyses to distin- guish authorship style, is often made sub jectiv ely or at least by conv en tion. The analogous decision in a musical context is arguably muc h more diﬃcult, as the complexity of choices is far greater. In our work, we needed to make many sub jective decisions that inﬂuenced the construction of musical features. Suc h decisions included what constituted the beginning and ending of melo dic phrases, whether a k ey change (mo dulation) should reset the tonic of the song, whether ad-libb ed vocals should b e considered part of the melo dic line, how to include dual melo dy lines that w ere sung in harmony , and so on. Our guiding principle was to mak e choices that could b e viewed as the most conserv ative in the sense of having the least impact on the information in the data. F or example, we omitted melo dic information from ad-libb ed vocals, and made phrasings of melo dic lines as long as p ossible, as shorter lines in tro duced extra “rests” as part of the melo dic transitions. Also, when it was not clear in cases of dual melo dy lines which was the main melo dy , we included b oth melo dy lines. It is worth noting that the mo del developed here w as not our ﬁrst attempt. W e explored v ariations of the presen ted approach b efore arriving at our ﬁnal mo del, including versions that p ermitted in teractions, alternativ e v ariable selection pro cedures such as recursiv e fea- ture elimination and step wise v ariable selection, models for the musical features as a function of authorship that were inv erted using Bay es rule, random forests, as well as sev eral others. A danger in exploring to o many mo dels, esp ecially with our small sample size and without a true test/holdout set, is the p oten tial to ov erﬁt. This concern ma y not b e apparen t in the presen tation of our analytic summaries, whic h was the culmination of a series of mo del in vestigations. The concern of ov erﬁtting limited some of our explorations. F or example, after having mo dest success using elastic net logistic regression without an y v ariable pre- pro cessing, w e inserted v ariable screening parameterized b y a p -v alue threshold based only on four threshold v alues. Using a greater range of thresholds, esp ecially after having learned that elastic net alone w as a promising approac h, and that w e were tuning the mo del pa- rameters based on the same lea ve-one-out v alidation data, w ould ha ve had the p otential to 27 pro duce ov erﬁtted predictions. W e susp ect that our ﬁnal mo del, ho w ever, do es not suﬀer from ov erﬁtting concerns in an y appreciable w ay . First, the approac h we present is actu- ally fairly simple: the remov al of musical features based on biv ariate relationships with the resp onse follo wed by regularized logistic regression. More complex pro cedures migh t raise questions about their generalizability . Second, w e w ere cautious ab out optimizing the predic- tion algorithm and calibrating the predictabilit y using out-of-sample criteria. F or example, probabilit y predictions in volv ed lea ving out data (one song at a time) to optimize the p -v alue threshold for v ariable screening, follo wed by leaving out p ortions of data (20% of the data that remained) to optimize the elastic net tuning parameters; and this en tire pro cedure was p erformed leaving out one song at a time when making predictions for the songs of known authorship. This cascading application of cross-v alidation mitigates some of the natural concerns ab out p ossible ov erﬁtting. Our particular modeling approac h do es p ermit extensions to address wider sets of songwriter attribution applications. Our mo del assumes only tw o authors, but this is easily extended to multiple songwriters in larger applications by mo deling authorship in a multinomial logit mo del, for example. Another extension of our mo del can address c hanges in an author’s style o ver time. Our application to Lennon-McCartney songs fo cused on a time p erio d where the songwriters’ m usical st yles w ere not changing in profound wa ys. T o include larger spans of time where a songwriter’s st yle ma y b e changing, one p ossibilit y is to assume a sto chastic pro cess on the m usical feature eﬀects for eac h author, suc h as through an autoregressiv e pro cess. Suc h an approac h ackno wledges that an author’s style is lik ely to ev olv e gradually o ver time and with an uncertain tra jectory . This approac h w ould be straigh tforward to implemen t in a Ba y esian setting, though implemen ting suc h a mo del in conjunction with v ariable screening would inv olv e metho dological c hallenges. Sev eral other limitations are worth mentioning. Our approach assumes that eac h song or (more relev antly) song portion con tains suﬃcien tly ric h detail to capture musical information 28 for distinguishing authorship. Shorter song fragmen ts w ould ha ve a scarcit y of features, and probabilit y predictions are exp ected to b e less reliable. F urthermore, if the goal of this w ork was to make the most accurate predictions of a song’s author, then our approac h could clearly b e impro ved b y incorp orating readily a v ailable additional information. Lyric conten t, information on a song’s structure, use of rhythm, song temp o, time signature, and the iden tit y of a song’s actual singer or singers are all lik ely to be highly predictiv e and distinguishing of a song’s authorship. Our decision to ignore this extra information is consistent with the larger goal of b eing able to establish the stylistic ﬁngerprint of a songwriter based solely on a corpus of songs’ musical con tent, using Lennon-McCartney songs as a sandb o x for understanding the p oten tial for this approach. Ultimately , the reduction of a songwriter’s musical conten t in to lo w-dimensional representations, suc h as a vector of musical feature eﬀects, is the ﬁrst step to w ards establishing m usical signatures for songwriters that can b e used for further analysis. F or example, with many songwriters’ st yles characterized in a reduced form, it is p ossible to establish inﬂuence netw orks to learn ab out the diﬀusion of the creative process in p opular music. With recent impro vemen ts in technology to conv ert audio information in to formats amenable to the type of analysis we dev elop ed in this pap er (Casey et al., 2008; F u, Lu, Ting, & Zhang, 2011), larger-scale analyses of songwriters’ st yles are a p oten tial area of exploration. A Musical Bac kground A justiﬁcation for the m usical features chosen requires an understanding of W estern p opular m usic. Midd le C , often denoted as C4, has frequency 261 . 6Hz, and the well known equally- temp ered 12-tone chr omatic sc ale starting on note C4 is the sequence of notes C4, C#4, D4, D#4, E4, F4, F#4, G4, G#4, A4, A#4, B4 29 where each successive note is derived from the previous one b y multiplying the frequency b y 2 1 / 12 . In the ab ov e sequence, notes preceding the “4” (i.e., C, C#, D, D#, E, F, F#, G, G#, A, A#, B) are the pitches , and the n umber 4 refers to the o ctave of the note. The con tinuation of the sequence ab ov e is the same set of pitches, but at the next higher o ctav e, that is, C5, C#5, D5, and so on. The 12 notes can also b e visualized in a piano diagram in Figure 5. F or the curren t discussion, we can represent a note as Z [ i, j ], where i ≥ 1 indexes the pitch of the note and j ≥ 1 indexes the o ctav e of the note. W e set Z [1 , 4] = C4, and all other notes are relative to this anc horing c hoice. Giv en the circular ordering of pitc hes in the c hromatic scale, Z [ i + 12 , j ] = Z [ i, j + 1] for all i and j . Thus, a sp eciﬁc note has m ultiple represen tations using this notation. By conv en tion, the o ctav e of a note is the v alue j in whic h the representation Z [ i, j ] has i ≤ 12. The notes Z [ i, j ] and Z [ i + 1 , j ] are said to b e a semitone ap art , while the notes Z [ i, j ] and Z [ i + 2 , j ] are said to b e a whole tone apart. Notes Z [ i, j ] , Z [ i, j + 1] , Z [ i, j + 2] , . . . , are said to b e in the same pitch class . Th us, D3, D4, D5, and so on, are in the same pitc h class, but reside in diﬀerent o cta ves. It is worth noting that while the sharp symbol # denotes raising a note a semitone, one can also use the ﬂat suﬃx [ to low er a note a semitone. One can translate or tr ansp ose the chromatic scale to start on any note giv en its circular structure, and to the h uman ear all such chromatic scales play ed in sequence sound essen tially the same. A c hromatic scale can start on any note Z [ i, j ] and consists of the 12 notes ( Z [ i, j ] , Z [ i + 1 , j ] , . . . , Z [ i + 11 , j ]). The basis of W estern music is the diatonic sc ale , whic h, starting on a given note Z [ i, j ], called the tonic of the scale, consists of the subsequence of seven notes from the chromatic scale ( Z [ i, j ] , Z [ i + 2 , j ] , Z [ i + 4 , j ] , Z [ i + 5 , j ] , Z [ i + 7 , j ] , Z [ i + 9 , j ] , Z [ i + 11 , j ]) . 30 F or example, b eginning on an A at any o ctav e, the diatonic scale with tonic A is (A, B, C#, D, E, F#, G#). Chromatic notes that are not part of the diatonic scale are called non-diatonic . Thus the non-diatonic notes with resp ect to the diatonic scale starting on A include A#, C, D#, F, and G. The diatonic scale p ermeates m uch of W estern music, and most p opular songs (or p ortions of songs) can be analyzed to be based on a diatonic scale starting at a speciﬁc note b elonging to one of the 12 pitc h classes; the lo west note of the diatonic scale is called the major key , or just the key , of the song, and the note itself is the tonic of that key . Songs are often to b e found in a “minor” k ey , based on a minor sc ale . F or our purp oses, we asso ciate, as is often done, the minor key with the ma jor k ey three semitones up, as they share the same seven notes. This particular deﬁnition of a minor key is often called the natur al minor , and is the r elative minor of the asso ciated ma jor k ey . F or example, the key of A minor (as a natural minor) consists of the notes (A, B, C, D, E, F, G), which are the same as those in the ma jor k ey of C (C, D, E, F, G, A, B), so that A minor is the relativ e minor asso ciated with C ma jor. Because the ma jor key and relativ e minor share the same notes on the diatonic scale, in our work w e classify songs b eing in the ma jor k ey as a pro xy for the diatonic notes. With a giv en key of a song, non-diatonic notes are usually sp eciﬁed b y their relation to the tonic. So, for example, in the key of C, the ﬂat third and ﬂat seven th are E [ and B [ (and they could, equiv alen tly , be called the raised second and raised sixth, as w ell). In fact, in p op/ro ck music, the ﬂat third and ﬂat seven th play a large role, as they app ear in the ﬁve note p entatonic (or the blues ) scale, whic h consists of the notes ( Z [ i, j ] , Z [ i + 3 , j ] , Z [ i + 5 , j ] , Z [ i + 7 , j ] , Z [ i + 10 , j ]), where Z [ i, j ] is the tonic of the p en tatonic scale. Thus, the p en tatonic scale starting on tonic C is (C, E [ , F, G, B [ ). A note tr ansition or an interval is a pair of notes, where the size of the interv al dep ends on the num b er of semitones b etw een them. Some sample interv als include: 31 • unison is b etw een tw o identical notes (e.g., C4 → C4). • a major se c ond consists of tw o notes where the second is tw o semitones (whole tone) up from the ﬁrst (e.g., C4 → D4, F4 → G4). • a major thir d consists of t wo notes where the second is four semitones (t w o whole tones) up from the ﬁrst (e.g., C4 → E4, F4 → A4). • a p erfe ct fourth consists of t wo notes where the second is ﬁve semitones up from the ﬁrst (e.g., D4 → G4). • a p erfe ct ﬁfth consists of tw o notes where the second is seven semitones up from the ﬁrst (e.g., A4 → E5). • a major sixth consists of tw o notes where the second is nine semitones up from the ﬁrst (e.g., D4 → B4). • a major seventh consists of tw o notes where the second is 11 semitones up from the ﬁrst (e.g., F4 → E5). • an o ctave consists of t w o notes where the second is 12 semitones up from the ﬁrst (e.g., C4 → C5). The minor second, third, sixth, and sev enth in terv als arise b y lo w ering the second note of the corresponding ma jor in terv al by a semitone. F or example, C → E [ is a minor third. F or interv als of a fourth and ﬁfth, the term diminishe d applies when the top note of the corresp onding in terv al is decreased by a semitone, and the term augmente d applies when raising the top note a semitone. As an example, the interv al C → G# is an augmented ﬁfth in the k ey of C. In our c hoice of note transitions within p op songs, the diatonic notes (alwa ys relativ e to the key) hav e prime imp ortance, with sp ecial emphasis on diatonic transitions to and from the tonic, transitions b etw een small steps on the diatonic scale (which are fairly common in melo dy writing), and transitions along the p entatonic/blues scale. 32 Chords, for our purp oses, consist of three notes play ed simultaneously (called a triad ), and form the basis of most of the harmony in p op songs. The t wo most common t yp es of c hords are major c hords and minor c hords. A ma jor chord is formed, using Z [ i, j ] as the ro ot of the c hord, as ( Z [ i, j ] , Z [ i + 4 , j ] , Z [ i + 7 , j ]). A minor c hord, in con trast, is formed as ( Z [ i, j ] , Z [ i + 3 , j ] , Z [ i + 7 , j ]). Less common are diminished c hords, formed as ( Z [ i, j ] , Z [ i + 3 , j ] , Z [ i + 6 , j ]), and augmented c hords, formed as ( Z [ i, j ] , Z [ i + 4 , j ] , Z [ i + 8 , j ]). Building chords from the diatonic scale consists of taking a starting note within the scale and successiv ely lay ering on t w o extra notes ab ov e it, skipping a note eac h time. F or example, in the key of C, the diatonic chor ds are: • C ma jor, the I ma jor chord (the tonic ), consisting of notes C, E, and G. • D minor, the ii minor chord, consisting of notes D, F, and A. • E minor, the iii minor chord, consisting of notes E, G, and B. • F ma jor, the IV ma jor chord (the sub dominant ), consisting of notes F, A, and C. • G ma jor, the V ma jor chord (the dominant ), consisting of notes G, B, and D. • A minor, the vi minor chord, consisting of notes A, C, and E. • B diminished, the vii ◦ diminished chord, consisting of notes B, D, and F. All of these diatonic chords are “native” to the scale in whic h they reside; all other c hords, with resp ect to the scale, are deemed to b e non-diatonic chor ds . The diatonic chords are the most common ones in p opular songs, although non-diatonic chords are often added for v ariet y and creating emotional tension. In particular, in rock-and-roll music, the ma jor c hords on the ﬂat third and the ﬂat seven th (and sometimes the ﬂat sixth) play a signiﬁcan t role in that genre. 33 In p op/ro ck m usic, the diatonic chords are all prev alen t, especially the tonic (I), sub dominant (IV), and dominant (V) chords, with the exception of the diminished sev enth chord on the sev enth note of the diatonic scale; this chord is rarely used. The minor c hord on the seven th note o ccurs more often, and is sometimes considered a replacement as one of the diatonic c hords. T ransitions b et ween c hords are a cornerstone of p op/ro c k music. Chor d pr o gr essions are sequences of c hords that often rep eat throughout a song. T ransitions b etw een diatonic c hords form the bulk of the c hord transitions. Less common (but not infrequently , when group ed together) are transitions b et ween non-diatonic c hords and the tonic (I) or dominant (V). En tire songs can b e viewed in their most basic form as the superp osition of chord progressions along with melo dic lines. Songs are divided into sections within whic h chord progressions and melo dies are identical or nearly iden tical. Tw o of the main sections that appear in most p op/ro ck songs are the verse and the chorus. The verses within a song t ypically ha ve identical m usical con tent, but usually con tain diﬀeren t lyrics. The chorus of a song t ypically has greater musical and emotional intensit y than the verse, and contains identical lyrics across rep eats within the song. It is common for songs to ha v e a third m usical section inserted b et ween an o ccurrence of the c horus and a subsequen t v erse, called the bridge section. This section musically functions as a connector b etw een the chorus and v erse, and ma y even undergo a mo dulation , that is, resetting the song to a diﬀeren t k ey , if only temp orarily . Other t yp es of sections ma y app ear in typical p op/ro ck music (e.g., intro, pre-c horus, outro), but the verse, chorus, and bridge are nearly universal comp onen ts of a song. More details ab out the basics of melo dic and harmonic structure of p opular m usic can b e found in Benw ard (2014) and Middleton (1990). 34 References Airoldi, E. M., Anderson, A. G., Fienberg, S. E., & Skinner, K. K. (2006). Who wrote Ronald Reagan’s radio addresses? Bayesian Analysis , 1 (2), 289–319. Ben ward, B. (2014). Music in the ory and pr actic e, volume 1 . McGraw-Hill Higher Education. Bien, J., T aylor, J., & Tibshirani, R. (2013). A lasso for hierarc hical in teractions. The Annals of Statistics , 41 (3), 1111-1141. Breiman, L. (2001). Random forests. Machine Le arning , 45 (1), 5–32. Bro wn, J. I. (2004). Mathematics, physics and A Hard Day’s Night. CMS Notes , 36 (6), 4–8. Casey , M. A., V eltk amp, R., Goto, M., Leman, M., Rho des, C., & Slaney , M. (2008). Con tent-based m usic information retriev al: Current directions and future challenges. Pr o c e e dings of the IEEE , 96 (4), 668–696. Cath ´ e, P . (2016). La nostalgie chez les Beatles: vers une application de la th ´ eorie des v ecteurs harmoniques ` a la musique p op? V olume! , 12 (1), 181–191. Chipman, H. A., George, E. I., & McCullo c h, R. E. (1998). Ba yesian CAR T mo del search. Journal of the Americ an Statistic al Asso ciation , 93 (443), 935–948. Cilibrasi, R., Vit´ anyi, P ., & De W olf, R. (2004). Algorithmic clustering of music based on string compression. Computer Music Journal , 28 (4), 49–67. Clemen t, R., & Sharp, D. (2003). N -gram and Ba y esian classiﬁcation of do cuments for topic and authorship. Liter ary and Linguistic Computing , 18 (4), 423–447. Compton, T. (1988). McCartney or Lennon?: Beatles myths and the comp osing of the Lennon-McCartney songs. The Journal of Popular Cultur e , 22 (2), 99–131. Conklin, D. (2006). Melodic analysis with segmen t classes. Machine L e arning , 65 (2), 349–360. Drap er, D. (2013). Ba y esian mo del sp eciﬁcation: Heuristics and examples. In P . Damien, P . Dellap ortas, N. G. Polson, & D. A. Stephens (Eds.), Bayesian the ory and applic a- tions (pp. 409–431). New Y ork: Oxford Univ ersity Press. Dubno v, S., Assa y ag, G., Lartillot, O., & Bejerano, G. (2003). Using mac hine-learning metho ds for m usical st yle mo deling. Computer , 36 (10), 73–80. Efron, B., & Thisted, R. (1976). Estimating the num b er of unseen sp ecies: Ho w man y words did Shakespeare know? Biometrika , 63 (3), 435–447. Ev erett, W. (1999). The Be atles as musicians: R evolver thr ough the antholo gy . Oxford 35 Univ ersity Press, USA. F an, J. (2007). V ariable screening in high-dimensional feature space. In Pr o c e e dings of the 4th international c ongr ess of chinese mathematicians (V ol. 2, pp. 735–747). F an, J., & Lv, J. (2008). Sure indep endence screening for ultrahigh dimensional feature space. Journal of the R oyal Statistic al So ciety, Series B (Statistic al Metho dolo gy) , 70 (5), 849–911. F an, J., & Song, R. (2010). Sure indep endence screening in generalized linear mo dels with NP-dimensionalit y . The A nnals of Statistics , 38 (6), 3567–3604. F riedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear mo dels via co ordinate descen t. Journal of Statistic al Softwar e , 33 (1). Retriev ed from http://www.jstatsoft.org/v33/i01/ F u, Z., Lu, G., Ting, K. M., & Zhang, D. (2011). A surv ey of audio-based m usic classiﬁcation and annotation. IEEE Tr ansactions on Multime dia , 13 (2), 303–319. F ujita, T., Hagino, Y., Kub o, H., & Sato, G. (1993). The Be atles: Complete Sc or es . Hal Leonard Publishing Corp oration. George, J., & Shamir, L. (2014). Computer analysis of similarities betw een albums in popular m usic. Pattern R e c o gnition L etters , 45 , 78–84. Hartzog, B. (2016, March). The Be atles’ songwriting. Retriev ed from http :// www .brianhartzog.com/beatles/beatles-songwriting.htm (Accessed 07-June-2017) Heuger, M. (2018). Be ablio gr aphy: Mostly ac ademic writings ab out the Be atles. Retrieved from http:// www .icce .rug .nl / ~ soundscapes / BEAB / index .shtml (Accessed 11- July-2018) Hop e, A. C. (1968). A simpliﬁed Monte Carlo signiﬁcance test pro cedure. Journal of the R oyal Statistic al So ciety, Series B (Statistic al Metho dolo gy) , 30 (3), 582–598. Kempfert, K. C., & W ong, S. W. (2018). Where do es Ha ydn end and Mozart b egin? Comp oser classiﬁcation of string quartets. arXiv pr eprint arXiv:1809.05075 . Le Cessie, S., & V an Hou welingen, J. C. (1992). Ridge estimators in logistic regression. Applie d Statistics , 41 (1), 191–201. Lim, M., & Hastie, T. (2015). Learning in teractions via hierarchical group-lasso regulariza- tion. Journal of Computational and Gr aphic al Statistics , 24 (3), 627–654. Malyuto v, M. B. (2005). Authorship attribution of texts: a review. Ele ctr onic Notes in Discr ete Mathematics , 21 , 353–357. Manning, C. D., & Sc h ¨ utze, H. (1999). F oundations of statistic al natur al language pr o c essing . 36 MIT Press. McCormic k, N. (1998, January 10). Must it b e Lennon or McCartney? Retriev ed from http :// www .telegraph .co .uk / culture / 4711552 / Must -it -be -Lennon -or -McCartney.html (Accessed 07-June-2017) McDougal, C. (2013, August). Multi-dimensional c omputer-driven quantitative analysis of the music and lyrics of the Be atles (T ec hnical report). Northeastern Univ ersity. Retriev ed from https://cedricmcdougal.com/4/papers/beatles.pdf Middleton, R. (1990). Studying Popular Music . McGraw-Hill Education (UK). Miles, B. (1998). Paul McCartney: Many Ye ars fr om Now . Macmillan. Mosteller, F., & W allace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination metho ds applied to the authorship of the disputed federalist pap ers. Journal of the Americ an Statistic al Asso ciation , 58 (302), 275–309. Mosteller, F., & W allace, D. L. (1984). Applie d Bayesian and Classic al Infer enc e: The Case of the Fe der alist Pap ers . Springer. Naccac he, M., Borgi, A., & Gh´ edira, K. (2008). A learning-based mo del for musical data rep- resen tation using histograms. In International symp osium on c omputer music mo deling and r etrieval (pp. 207–215). Robin, X., T urc k, N., Hainard, A., Tiberti, N., Lisacek, F., Sanc hez, J.-C., & M ¨ uller, M. (2011). pROC: An op en-source pack age for R and S+ to analyze and compare ROC curv es. BMC Bioinformatics , 12 , 77. Rolling Stone. (2011, April). The Be atles, In My Life. Retrieved from https :// www .rollingstone .com / music / music -lists / 500 -greatest -songs -of -all -time -151127/the-beatles-in-my-life-57758/ (Accessed 19-August-2018) Ruczinski, I., Ko op erb erg, C., & LeBlanc, M. (2003). Logic regression. Journal of Compu- tational and Gr aphic al Statistics , 12 (3), 475–511. Ruczinski, I., Ko op erb erg, C., & LeBlanc, M. L. (2004). Exploring in teractions in high- dimensional genomic data: an ov erview of logic regression, with applications. Journal of Multivariate Analysis , 90 (1), 178–195. Rybaczewski, D. (2018). A Har d Day’s Night. Retrieved from http://www.beatlesebooks .com/hard-days-night (Accessed 19-August-2018) Thisted, R., & Efron, B. (1987). Did Shakespeare write a newly-disco vered p o em? Biometrika , 74 (3), 445–455. Tibshirani, R. (1996). Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety, Series B (Statistic al Metho dolo gy) , 58 (1), 267–288. 37 Tibshirani, R. (2011). Regression shrink age and selection via the lasso: a retrosp ective. Journal of the R oyal Statistic al So ciety, Series B (Statistic al Metho dolo gy) , 73 (3), 273– 282. T urner, S. (1999). A Har d Day’s Write: The stories b ehind every Be atles song . Carlton, Dubai. W agner, N. (2003). “Domestication” of blue notes in the Beatles’ songs. Music The ory Sp e ctrum , 25 (2), 353–365. W enner, J. (2009). John Lennon Rememb ers - Jann Wenner Interview Part 5. Retrieved from http://tittenhurstlennon.blogspot.com/2009/07/jann-wenner-interview -part-5.html (Accessed 14-Jan uary-2019) Whissell, C. (1996). T raditional and emotional st ylometric analysis of the songs of Beatles Paul McCartney and John Lennon. Computers and the Humanities , 30 (3), 257–265. Wiener, A. J. (1986). The Be atles: A R e c or ding History . McF arland & Co Inc Pub. W omack, K. (2007). Authorship and the Beatles. Col le ge Liter atur e , 34 (3), 161–182. Y uan, M., & Lin, Y. (2006). Mo del selection and estimation in regression with group ed v ariables. Journal of the R oyal Statistic al So ciety, Series B (Statistic al Metho dolo gy) , 68 (1), 49–67. Zou, H., & Hastie, T. (2005). Regularization and v ariable selection via the elastic net. Journal of the R oyal Statistic al So ciety, Series B (Statistic al Metho dolo gy) , 67 (2), 301–320. 38 McCartney Lennon-authored Song Probabilit y All I’ve Got T o Do 0.008 Do ctor Rob ert 0.012 I’m Happy Just to Dance With Y ou 0.038 No Reply 0.041 Girl 0.047 I’ll Be Back 0.048 I’m Only Sleeping 0.049 There’s a Place 0.064 I’ll Cry Instead 0.065 When I Get Home 0.066 And Y our Bird Can Sing 0.067 Help! 0.071 W e Can W ork It Out (Bridge) 0.071 Y ou’re Going to Lose that Girl 0.076 I’m a Loser 0.100 Run F or Y our Life 0.109 It’s Only Lov e 0.111 This Boy 0.128 I Call Y our Name 0.148 It W on’t Be Long 0.178 Please Please Me 0.185 Y ou Can’t Do That 0.231 Tic ket to Ride 0.244 A Hard Day’s Nigh t (V erse/Chorus) 0.279 Da y T ripp er 0.294 I Don’t W ant to Sp oil the P art y 0.332 T omorrow Nev er Kno ws 0.378 Not a Second Time 0.390 T ell Me Why 0.438 No where Man 0.445 Y ou’ve Got to Hide Y our Lov e Awa y 0.524 If I F ell 0.574 An y Time At All 0.588 I F eel Fine 0.598 I Should Hav e Known Better 0.615 Norw egian W o o d (V erse/Chorus) 0.666 Y es It Is 0.802 She Said She Said 0.836 What Go es On (V erse/Chorus) 0.944 T able 2. Songs or song fragmen ts kno wn to b e written by John Lennon, rank ordered according to the out-of-sample probabilit y (second column) that is attributed to Paul Mc- Cartney . 39 McCartney McCartney-authored Song Probabilit y Y ou W on’t See Me 0.069 And I Lov e Her (V erse/Chorus) 0.105 F or No One 0.184 Here There and Ev erywhere 0.202 PS I Lov e Y ou 0.282 I’ll F ollo w the Sun 0.284 Can’t Buy Me Lo v e 0.440 Got to Get Y ou Into My Life 0.448 Eigh t Da ys a W eek 0.528 Eleanor Rigby 0.570 I’m Down 0.606 Hold Me Tight 0.606 She’s a W oman 0.660 I’v e Just Seen a F ace 0.668 T ell Me What Y ou See 0.668 What Y ou’re Doing 0.679 Driv e My Car 0.688 Y esterday 0.689 The Night Before 0.715 All My Loving 0.719 Y ellow Submarine 0.734 Ev ery Little Thing 0.806 W e Can W ork It Out (V erse/Chorus) 0.866 Mic helle (V erse/Chorus) 0.912 Things W e Said T o da y 0.938 Go o d Day Sunshine 0.953 I’m Lo oking Through Y ou 0.957 Another Girl 0.964 I Saw Her Standing There 0.979 I W anna Be Y our Man 0.986 Lo ve Me Do 0.989 T able 3. Songs or song fragmen ts kno wn to b e written by Paul McCartney , rank ordered according to the out-of-sample probabilit y (second column) that is attributed to Paul Mc- Cartney . 40 F eature Co eﬃcien t c -statistic In tercept –0.796 — Chord: V 1.096 0.806 Chord: iii –0.350 0.842 Note: Flat 2 –0.874 0.817 Note: Flat 3 0.603 0.828 Note: 4th 1.347 0.788 Note: 6th 0.046 0.825 Chord transition: b etw een I and vi –0.315 0.823 Chord transition: b etw een ii and iii –0.255 0.846 Chord transition: b etw een ii and IV 1.428 0.795 Chord transition: b etw een ii and V –0.291 0.830 Chord transition: non-diatonic to diatonic –0.096 0.833 Melo dic transition: down from 4th to ﬂat 3rd 0.481 0.849 Melo dic transition: down from ﬂat 3rd to tonic 1.206 0.778 Melo dic transition: down 1 note on diatonic scale, not incl. 1 or 4 → 5 / 5 → 4 –0.348 0.824 Melo dic transition: down 1 half step from non-diatonic to diatonic 1.030 0.797 Melo dic transition: phrase end on 5th –0.633 0.808 Melo dic transition: pair of notes on the 6th –0.218 0.825 Melo dic transition: up 1 note on diatonic scale, not incl. 1 or 4 → 5 / 5 → 4 –0.576 0.821 Melo dic transition: up 1 half step from non-diatonic to diatonic –1.232 0.798 Melo dic transition: up from tonic to ﬂat 3rd 0.376 0.833 Melo dic transition: from 3rd to tonic 0.284 0.829 Melo dic transition: from 4th to 5th –0.653 0.816 Melo dic transition: up from or to a non-diatonic note 1.135 0.806 Con tour: (Up, Up, Do wn) –0.098 0.841 Con tour: (Do wn, Down, Same) 0.535 0.824 Con tour: (Up, Same, Same) –0.098 0.835 Con tour: (Do wn, Up, Same) –0.938 0.825 Con tour: (Same, Down, Up) –0.501 0.812 Con tour: (Up, Down, Up) –0.555 0.826 T able 4. Co eﬃcien t estimates in the ﬁnal logistic regression in the second column, and R OC analysis c -statistics in the third column. The c -statistics are computed from the 70 lea ve-one-out probabilities with the v ariable remov ed from the prediction algorithm; thus smaller c -statistics indicate greater v ariable imp ortance. 41 McCartney Probability Song (95% conﬁdence interv al) Ask Me Why 0.057 (0.018, 0.080) Do Y ou W ant to Kno w a Secret 0.080 (0.033, 0.097) A Hard Day’s Nigh t (Bridge) 0.069 (0.016, 0.135) Mic helle (Bridge) 0.199 (0.109, 0.300) W ait 0.391 (0.275, 0.540) What Go es On (Bridge) 0.235 (0.088, 0.255) In My Life (V erse) 0.189 (0.079, 0.307) In My Life (Bridge) 0.435 (0.270, 0.692) T able 5. Probabilit y estimates for eight songs or song fragments of disputed or unknown authorship with 95% conﬁdence interv als based on a lea v e-one-out analysis b eing attributable to McCartney . McCartney Probability Song (95% conﬁdence interv al) Misery 0.310 (0.245, 0.451) And I Lov e Her (Bridge) 0.263 (0.110, 0.315) Norw egian W o o d (Bridge) 0.330 (0.135, 0.408) Little Child 0.337 (0.175, 0.417) Bab y’s in Black 0.920 (0.822, 0.977) The W ord 0.976 (0.899, 0.994) F rom Me T o Y ou 0.606 (0.510, 0.721) Thank Y ou Girl 0.106 (0.036, 0.202) She Lov es Y ou 0.616 (0.515, 0.733) I’ll Get Y ou 0.062 (0.016, 0.107) I W an t to Hold Y our Hand 0.115 (0.065, 0.182) T able 6. Probabilit y estimates for 11 collab orative songs or song fragmen ts with 95% conﬁdence interv als based on a leav e-one-out analysis b eing attributable to McCartney . 42 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 Ask Me Why Probability of McCartney Authorship Density 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 Do Y ou Want to Know a Secret Probability of McCartney Authorship Density 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 A Hard Da y' s Night (Bridge) Probability of McCartney Authorship Density 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 12 Michelle (Bridge) Probability of McCartney Authorship Density 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 W ait Probability of McCartney Authorship Density 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 What Goes On (Bridge) Probability of McCartney Authorship Density 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7 In My Life (V erse) Probability of McCartney Authorship Density 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 In My Life (Bridge) Probability of McCartney Authorship Density Figure 4. Density plots of the lea ve-one-out probabilit y predictions for the eigh t songs of disputed authorship. 43 C D E F G A B C# D# F# G# A# Figure 5. Chromatic scale notes app earing on a piano diagram. 44

(A) Data in the Life: Authorship Attribution of Lennon-McCartney Songs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment