word2vec Skip-Gram with Negative Sampling is a Weighted Logistic PCA
We show that the skip-gram formulation of word2vec trained with negative sampling is equivalent to a weighted logistic PCA. This connection allows us to better understand the objective, compare it to other word embedding methods, and extend it to hig…
Authors: Andrew J. L, graf, Jeremy Bellay
word2vec Skip-Gram with Negativ e Sampling is a W eigh ted Logistic PCA Andrew J. Landgraf 1 and Jerem y Bella y 2 1 Battelle Health and Analytics , Colum bus, OH, landgraf@ba ttelle.org 2 Battelle Cyber Inno v ations, C olumbus , OH, bellayj@bat telle.org Abstract W e sho w that the skip-gram form ulation of word2vec trained with neg- ative sampling is equiv alent to a w eighted logistic PCA. This connection allo ws us to bett er understand the ob jective, compare it to other wor d em b edd ing method s, and extend it to higher dimensional mod els. Bac kground Mikolo v et al. (2013) intro duced the skip-gr am formulation for neural word em- bedding s, wherein one tries to pr e dict the cont ext of a g iven w ord. Their negative-sampling alg orithm improv ed the co mputational feasibility of tr ain- ing the embeddings. Due to their state-of-the-ar t p er formance on a num- ber of tasks, there has b een m uch r esearch aimed a t better understanding it. Goldb erg and Levy (201 4) s how ed that skip- gram with negative-sampling algo - rithm (SGNS) maximizes a differe nt likelihoo d than the skip-gram formulation po ses and fur ther showed how it is implicitly related to po int wise mutual info r - mation (Levy and Goldb erg, 2014). W e sho w that SGNS is a weigh ted logistic PCA, which is a sp ecial case of exp onential family P CA for the bino mia l likeli- ho o d. Cotterell et al. (2017) show ed that the skip- gram for mulation ca n b e viewed as exp onential family PCA with a multinomial likelihoo d, but they did no t make the c o nnection between the neg ative-sampling algor ithm and the binomial likelihoo d. Li et al. (20 15) s how ed that SGNS is an explicit matrix factor ization related to r e presentation le a rning, but the matrix factoriz a tion ob jective they found was co mplicated and they did not find the connection to the binomial distribution or exp onential family PCA. W eigh ted Logistic PCA Exp onential family principal co mpo nent analys is is an extensio n of pr incipal comp onent ana lysis (P CA) to data coming from expo nential family distr ibu- tions. Le tting Y = [ y ij ] b e a data matrix, it a ssumes that y ij , i = 1 , . . . , n, j = 1 1 , . . . , d , are generated from an exp onential family distr ibution with corresp ond- ing natural parameters θ ij . E x po nential family PCA decomp os e s Θ = [ θ ij ] = AB T , where A ∈ R n × f , B ∈ R d × f , and f < min( n, d ). This implies that θ ij = a T i b j , where a i ∈ R f is the i th row of A and b j ∈ R f is the j th row of B . When the exp onential family distribution is Gaussian, this reduces to stan- dard P CA. When it is Berno ulli ( y ij ∈ { 0 , 1 } , P r ( y ij = 1) = p ij ), this is typically called logistic PCA and lo g likelihoo d b eing max imized is X i,j y ij θ ij − log(1 + exp( θ ij )) , where θ ij = log p ij 1 − p ij is the log o dds and is approximated by the low er di- mensional a T i b j . Just as in logistic re gressio n, when there is more than o ne indep endent and ident ically distributed trial for a given ( i , j ) combination, the distribution b e- comes bino mia l. If there a re y ij successes out of n ij opp ortunities, then the lo g likelihoo d is X i,j n ij ( ˆ p ij θ ij − log (1 + exp( θ ij ))) , where ˆ p ij = y ij n ij is the prop ortion of successes. This can be viewed as a w eighted logistic PCA with respo nses ˆ p ij and weigh ts n ij . Skip-Gram with Negativ e Sampling SGNS compar es the o bs erved word-cont ext pairs with ra ndomly-gener ated non- observed pairs and maximizes the probability o f the a ctual word-cont ext pair s, while minimizing the pr obability of the negative pairs . Let n w, c be the num ber o f time word w is in the c o ntext of word c , n w and n c be the num b er of times word w and context c a pp e ars, | D | b e the nu mber of word-context pairs in the c o rpus, P D ( w ) = n w | D | , P D ( c ) = n c | D | 1 , and P D ( w, c ) = n w,c | D | be the distributions of the w ords , contexts, and w ord- context pairs, resp ectively , and k be the n umber of nega tive samples. Letting σ ( x ) = 1 1+ e − x , Levy and Go ldb er g (2014) show ed that SGNS maxi- mizes X w X c n w, c log σ ( v T w v c ) + k E c ′ ∼ P D [log σ ( − v T w v c ′ )] , where v w and v c are the f -dimensional vectors for word w and context c , re - sp ectively . 1 In Mikolo v et al . (2013 ) define P D ( c ) ∝ n 0 . 75 c , but without loss of generality , we use the simpler definition i n this pap er. 2 The SGNS ob jectiv e ca n be r ewritten ℓ = X w X c n w, c log σ ( v T w v c ) + k E c ′ ∼ P D [log σ ( − v T w v c ′ )] = X w (" X c n w, c log σ ( v T w v c ) # + " X c n w, c k E c ′ ∼ P D [log σ ( − v T w v c ′ )] #) = X w (" X c n w, c log σ ( v T w v c ) # + k n w E c ′ ∼ P D [log σ ( − v T w v c ′ )] ) = X w (" X c n w, c log σ ( v T w v c ) # + " k n w X c ′ P D ( c ′ ) log σ ( − v T w v c ′ ) #) = X w X c n w, c log exp( v T w v c ) 1 + exp( v T w v c ) + k n w P D ( c ) log 1 1 + exp( v T w v c ) = X w X c n w, c ( v T w v c ) − ( n w, c + k n w P D ( c )) log 1 + exp( v T w v c ) = X w X c ( n w, c + k n w P D ( c )) n w, c n w, c + k n w P D ( c ) ( v T w v c ) − log 1 + exp( v T w v c ) . Define the prop or tion x w, c = n w, c n w, c + k n w P D ( c ) = P D ( w, c ) P D ( w, c ) + k P D ( w ) P D ( c ) . Then SGNS maximizes X w X c ( n w, c + k n w P D ( c )) x w, c ( v T w v c ) − log (1 + exp( v T w v c )) , which is logistic PC A with r esp onses x w, c and weigh ts ( n w, c + k n w P D ( c )). Multiplying by the cons ta nt 1 / | D | , the ob jectiv e be comes X w X c ( P D ( w, c ) + k P D ( w ) P D ( c )) x w, c ( v T w v c ) − log (1 + exp( v T w v c )) , which gives the weight s a slightly ea sier in terpr etation. Implications In terpretation Interpreting the ob jectiv e, weigh ts will b e larg er for word- context pairs with hig her num b er o f o ccurr ences, as well as for word and co nt exts with higher num b er s of marginal o c currences. The resp onse x w, c is 0 for all non-observed pairs and will b e clo s er to 1 if the num b er of word-context pa ir o ccurrences is lar ge compared to the marginal w or d a nd context o ccurr e nces. The num b er o f neg a tive sa mples per word, k , has the effect of r egularizing the prop ortions down fro m 1. Larg er k ’s will also diminish the effect of the w or d- context pairs in the weights. 3 Comparison to Other Results W e can easily der ive the main result from Levy and Goldb erg (2014), the implicit factoriza tio n of the point wise mutual information (PMI), under this in terpr etation. F or ea ch c o mbination of w and c , there ar e n w, c po sitive examples a nd k n w P D ( c ) negative examples . The maximum likeliho o d estimate of the pr obability is x w, c . The lo g o dds o f x w, c is log n w, c | D | n w n c − log k = P M I ( w , c ) − log k , which is the same re s ult a s in Levy and Goldb er g (2014). Comparison to Other Metho ds W eighted logistic PCA has b een us ed in collab ora tive filtering of implicit feedback data by Sp otify (Jo hnson, 20 14), where it was referred to as lo gistic matrix factoriz ation. Johnson (20 14) was a mo dification of a previo us metho d which per fo rmed ma trix fa c torization w ith a weighted least squares ob jective (Hu et al., 200 8). Johnso n (20 14) rep o rted that weighted lo g istic PCA had similar accur acy to Hu et al. (2008)’s weigh ted least squares metho d, but could achieve it with a smaller la tent dimension. With that in mind, we ca n consider an alter native weigh ted least squar es version of SGNS (SGNS-LS), X w X c ( P D ( w, c ) + k P D ( w ) P D ( c )) x w, c − v T w v c 2 . Possible adv a ntages include improv ed computational efficiency and a further compariso n with GloV e (P ennington et a l., 2014), whic h a lso us e s a weighted least s quares ob jective. Ignoring the word a nd context bias terms, GloV e’s ob jective is X w X c f ( n w, c ) log n w, c − v T w v c 2 , where f ( n w, c ) is a weighting function, which equals 0 when n w, c is 0, effectively removing the non-o bs erved word-context pairs. Comparing the tw o o b jectives, they b o th have weights increas ing as a func- tion of n w, c , but SGNS-LS’s w eights ar e depe ndent on the num b er of ma rginal o ccurrences of the words and con texts. Both metho ds transform the num b er of word-cont ext o ccurrences, SGNS-LS conv erting it to a pro po rtion and GloV e taking the log . W e b elieve the weigh ting scheme for SGNS-LS has a conceptua l adv antage ov er that of GloV e. F or example, let n i,j = n k,l = 1 with n i ≫ n k and n j ≫ n l . Glo V e trea ts them b oth the same, but SGNS-LS will ha ve x i,j < x k,l and will give more weigh t to x i,j bec ause n i,j being small is m uch more unlikely due to random chance than n k,l being small. T raining The co nnec tio n of SGNS to weigh ted logistic PCA allows us to con- ceive o f o ther methods to train the word and context vectors. F or e xample, once the s pa rse word-context matrix has b een created, one can either use the MapRe- duce fra mework of Johnson (20 14) or GloV e’s approach: s ample elements of the matrix a nd per form s to chastic gr adient descent with AdaGr ad (and simila rly for SGNS-LS, with different gradients). GloV e only samples non-zero elements of the matrix, whereas SGNS(-LS) must sample all elements, bec ause the non- o ccurrence is imp orta nt for SGNS. 4 Extension Finally , with this co nnection to logistic P CA, SGNS can b e ex- tended to include other fa c to rs in a higher or der tensor factorization, analog ous to the extension for skip-g ram des crib ed in Cotterell et al. (201 7). Of pa r ticular int ere s t is training do cument vectors along with the w ord and context vectors. References Cotterell, R., A. Poliak, B . V an Durme, and J. Eisner (2017). Explaining and generalizing sk ip- gram thro ugh exp onential family principal comp onent a na l- ysis. EACL 2017 , 175. Goldb erg, Y. and O. Levy (2014). word2vec expla ine d: Der iving Mikolov et al.’s negative-sampling word-embedding metho d. arXiv pr eprint arXiv:1402 .3722 . Hu, Y., Y. Ko ren, and C. V o linsky (2008). Co llab ora tive filtering for implicit feedback datasets. In Eighth IEEE International Confer enc e on Data Mining (ICDM), 2008 , pp. 263–27 2 . IE EE. Johnson, C. C. (2 0 14). Logistic matrix fac to rization for implicit feedback da ta. A dvanc es in Neur al Information Pr o c essing Systems 27 . Levy , O . and Y. Goldb erg (2014). Neural word embedding as implicit matrix factorization. In A dvanc es in neur al information pr o c essing s ystems , pp. 2177 – 2185. Li, Y., L. Xu, F. Tian, L. Jiang , X. Zho ng, and E. Chen (20 15). W or d embedding revisited: A new repr esentation le a rning a nd explicit matrix fa c torization per sp ective. In IJCAI , pp. 365 0 –365 6. Mikolo v, T., I. Sutskev er, K. Che n, G. S. Corra do, and J. Dean (20 13). Dis- tributed r e presentations of words and phrases and their compos itio nality . In A dvanc es in neur al information pr o c essing systems , pp. 3 111– 3119. Pennington, J., R. So cher, and C. D. Manning (20 14). Glo V e: Globa l vectors for word repr esentation. In EMNLP , V olume 14, pp. 1 532– 1543. 5
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment