word2vec Explained: deriving Mikolov et al.s negative-sampling word-embedding method

The word2vec software of Tomas Mikolov and colleagues (https://code.google.com/p/word2vec/ ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research pap…

Authors: Yoav Goldberg, Omer Levy

word2vec Explained: Deriving Mik olo v et al.’s Negativ e-Sampling W ord-Em b edding Metho d Y oa v Goldb erg and Omer Levy { yoav.goldbe rg,omerlevy } @gmail.com F ebruary 14, 2014 The word2v ec softw are of T o mas Mikolov and co lleagues 1 has g ained a lo t of traction lately , and pro vides state-of-the-ar t word em bedding s. The lea r ning mo dels b ehind the so ft w are are des crib ed in t w o resea rch pap ers [1, 2]. W e found the description of the mo dels in these pap ers to b e so mewhat cryptic and ha r d to follow. While the motiv ations and presentation may b e obvious to the neural-netw orks language- mo deling crowd, we had to struggle quite a bit to figure out the r ationale b ehind the eq uations. This note is an attempt to explain equation (4) ( ne gative sampling ) in “Dis- tributed Representations of W or ds and Phrases and their Co mp os itionality” b y T omas Mikolo v, Ilya Sutskev e r , K a i Chen, Greg Corr ado and Jeffrey Dean [2]. 1 The skip-gram mo del The depar ture p oint of the pap er is the skip-gra m mo del. I n this mo del we are given a corpus o f words w and their co n texts c . W e consider the conditional probabilities p ( c | w ), and g iven a corpus T ext , the goal is to set the parameter s θ of p ( c | w ; θ ) so as to maximize the corpus probability: arg max θ Y w ∈ T ext   Y c ∈ C ( w ) p ( c | w ; θ )   (1) in this equa tion, C ( w ) is the set o f contexts of word w . Alter na tively: arg max θ Y ( w, c ) ∈ D p ( c | w ; θ ) (2) here D is the se t o f all word and context pair s we extract from the text. 1 https:// code.goog le.com/p/word2vec/ 1 1.1 P arameterization of the skip-gram mo del One approach for par ameterizing the skip-gram mo del follows the neural-net work language models literature, and mo dels the c onditional probability p ( c | w ; θ ) us- ing s o ft-max: p ( c | w ; θ ) = e v c · v w P c ′ ∈ C e v c ′ · v w (3) where v c and v w ∈ R d are vector repres en tations for c a nd w resp ectively , and C is the set o f all av ailable contexts. 2 The par ameters θ are v c i , v w i for w ∈ V , c ∈ C , i ∈ 1 , · · · , d (a total of | C | × | V | × d para meters). W e would like to set the parameters such that the pro duct (2) is maximized. Now will be a go od time to take the log and switch from pro duct to sum: arg max θ X ( w, c ) ∈ D log p ( c | w ) = X ( w, c ) ∈ D (log e v c · v w − log X c ′ e v c ′ · v w ) (4) An a ssumption under lying the e mbedding pro cess is the following: Assumption maximizing ob jectiv e 4 will result in go o d embeddings v w ∀ w ∈ V , in the sense that similar words will hav e s imilar vectors. It is not cle a r to us at this p oint why this ass umption holds. While ob jectiv e (4) ca n b e computed, it is co mputationally exp ensive to do so, b ecause the term p ( c | w ; θ ) is very exp ensive to co mpute due to the s umma - tion P c ′ ∈ C e v c ′ · v w ov er all the contexts c ′ (there can be h undreds of thousands of them). O ne wa y of making the computatio n more tractable is to r eplace the softmax with a n hier ar chic al softmax . W e will not ela bo rate on this dir ection. 2 Negativ e Sampling Mikolo v et al. [2] pr esent the negative-sampling approach a s a more efficient wa y of deriving word em b eddings. While negative-sampling is based on the skip-gra m mo del, it is in fact optimizing a different o b jectiv e. What follows is the deriv ation of the neg ative-sampling ob jective. Consider a pair ( w, c ) of w ord and con text. Did this pair come from the training data? Let’s denote by p ( D = 1 | w , c ) the pr obability that ( w , c ) came from the co rpus da ta. Corresp ondingly , p ( D = 0 | w, c ) = 1 − p ( D = 1 | w, c ) will be the probability that ( w , c ) did no t come fro m the corpus data . As b efore, assume there are pa rameters θ controlling the distribution: p ( D = 1 | w, c ; θ ). 2 Throughout this note, we assume that the words and the con texts come fr om distinct v ocabularies, so that, for example, the v ector asso ciated with the word do g will b e different from the vect or asso ciated with the con text do g . This assumption f ollows the literature, where it is not motiv ated. One motiv ation for making this assumption is th e following: consider the case where b oth the word do g and the contex t do g share the same vec tor v . W ords hardly appear in the cont exts of themselves, and so the mo del should assi gn a low probability to p ( dog | dog ), which enta ils assi gning a lo w v alue to v · v which is i mpossi ble. 2 Our goal is no w to find parameters to maximize the probabilities that all of the observ atio ns indeed came from the data: arg max θ Y ( w, c ) ∈ D p ( D = 1 | w, c ; θ ) = arg max θ log Y ( w, c ) ∈ D p ( D = 1 | w, c ; θ ) = arg max θ X ( w, c ) ∈ D log p ( D = 1 | w , c ; θ ) The qua n tit y p ( D = 1 | c, w ; θ ) can b e defined using softmax: p ( D = 1 | w , c ; θ ) = 1 1 + e − v c · v w Leading to the o b jectiv e: arg max θ X ( w, c ) ∈ D log 1 1 + e − v c · v w This ob jective has a trivial so lution if we set θ suc h that p ( D = 1 | w, c ; θ ) = 1 for ev ery pair ( w, c ). This can b e easily achiev e d by setting θ such that v c = v w and v c · v w = K for all v c , v w , where K is larg e enough num ber (prac tically , we get a pro bability of 1 a s so o n as K ≈ 4 0). W e need a mec ha nism that prev en ts a ll the vectors from ha ving the same v alue, by disallowing some ( w , c ) combinations. One wa y to do so , is to pre s ent the mo del with some ( w , c ) pa irs for which p ( D = 1 | w , c ; θ ) m ust be lo w, i.e. pairs whic h a re not in the data. This is a chiev ed b y g enerating the set D ′ of random ( w, c ) pairs, ass uming they are all incorr ect (the name “negative- sampling” stems from the set D ′ of randomly sa mpled neg a tive examples). The optimization ob jectiv e now b ecomes: arg max θ Y ( w, c ) ∈ D p ( D = 1 | c, w ; θ ) Y ( w, c ) ∈ D ′ p ( D = 0 | c, w ; θ ) = arg max θ Y ( w, c ) ∈ D p ( D = 1 | c, w ; θ ) Y ( w, c ) ∈ D ′ (1 − p ( D = 1 | c, w ; θ )) = arg max θ X ( w, c ) ∈ D log p ( D = 1 | c, w ; θ ) + X ( w, c ) ∈ D ′ log(1 − p ( D = 1 | w , c ; θ )) = arg max θ X ( w, c ) ∈ D log 1 1 + e − v c · v w + X ( w, c ) ∈ D ′ log(1 − 1 1 + e − v c · v w ) = arg max θ X ( w, c ) ∈ D log 1 1 + e − v c · v w + X ( w, c ) ∈ D ′ log( 1 1 + e v c · v w ) 3 If w e let σ ( x ) = 1 1+ e − x we get: arg max θ X ( w, c ) ∈ D log 1 1 + e − v c · v w + X ( w, c ) ∈ D ′ log( 1 1 + e v c · v w ) = arg max θ X ( w, c ) ∈ D log σ ( v c · v w ) + X ( w, c ) ∈ D ′ log σ ( − v c · v w ) which is almost equation (4) in Mikolov et al ([2]). The difference fro m Mikolov et al. is that here we pre sent the ob jective for the ent ire corpus D ∪ D ′ , while they present it for one example ( w , c ) ∈ D and k ex amples ( w , c j ) ∈ D ′ , following a par ticular wa y o f constructing D ′ . Spec ific a lly , with nega tive sa mpling of k , Mikolo v et al.’s constructed D ′ is k times larger than D , and for each ( w, c ) ∈ D we co nstruct k sa mples ( w, c 1 ) , . . . , ( w , c k ), wher e ea ch c j is drawn acc o rding to its unigram distribution raised to the 3 / 4 p ow er . This is equiv alent to drawing the samples ( w, c ) in D ′ from the distr ibution ( w, c ) ∼ p wo rd s ( w ) p contexts ( c ) 3 / 4 Z , where p wo rd s ( w ) and p contexts ( c ) are the unigra m distributions of w ords and contexts r esp ectively , and Z is a nor malization consta n t. In the work o f Mikolo v et al. each co nt ext is a word (and all words app ear as contexts), a nd so p context ( x ) = p wo rd s ( x ) = count ( x ) | T ext | 2.1 Remarks • Unlike the Skip-g ram mo del descr ibed ab ov e, the formulation in this sec- tion do es not mo del p ( c | w ) but instead mo dels a quantit y rela ted to the joint distribution of w and c . • If we fix the words r epresentation and learn only the contexts representa- tion, or fix the contexts repres ent ation and lea rn o nly the word r epresen- tations, the mo del reduces to logistic regr ession, and is conv e x . How ever, in this mo del the words a nd co nt exts representations are learne d jointly , making the mo del non-conv ex. 3 Con text defin itions This sectio n lists some p eculiarities of the co ntexts used in the word2 vec so ft- ware, as reflected in the co de. Generally sp eaking, for a sen tence of n w or ds w 1 , . . . , w n , co n texts of a w o rd w i comes from a windo w of size k a round the word: C ( w ) = w i − k , . . . , w i − 1 , w i +1 , . . . , w i + k , where k is a parameter. Ho w- ever, there are tw o subtleties: Dynamic wi ndo w si ze the window size that is b e ing us ed is dynamic – the parameter k denotes the maximal window size. F or each word in the corpus, a window size k ′ is sampled uniformly from 1 , . . . , k . 4 Effect of subsampl ing and rare-w ord p runi ng word2vec has t wo additional parameters for discar ding some of the input words: w ords app ear ing less than mi n-cou nt times a re not co nsidered as either w ords or co nt exts, an in a ddition frequent words (as defined b y the samp le para meter ) a re down-sampled. Imp ortantly , these w o rds are remov ed from the text b efor e generating the contexts. This has the effect of incr e asing the effe ctive win- dow size for certain words. According to Mikolo v et a l. [2], sub-sampling of frequent words improves the quality of the resulting em b edding on some benchmarks. The or iginal motiv ation for sub-sampling was that frequent words a re less informa tive. Her e we se e another explanation for its effec- tiveness: the effective window size gr ows, including context-w o rds which are b oth conten t-full and linearly far awa y from the focus word, thus mak- ing the similarities more topical. 4 Wh y do es this pro d uce go o d w ord represe n- tations? Go o d questio n. W e don’t r eally know. The distributional hypo thes is states that w ords in similar contexts have sim- ilar meanings. The ob jectiv e ab ove clearly tries to increase the quantit y v w · v c for go o d w o r d-context pairs, a nd decrease it for bad ones. Int uitively , this means tha t words that share many contexts will b e s imilar to each other (note also that contexts sharing ma n y words will also be similar to e a ch other). This is, howev er, very hand-wa vy . Can we make this intuition more precise ? W e’d r eally like to see something more formal. References [1] T omas Mikolov, Kai Chen, Gr eg Cor rado, and Jeffrey Dean. Efficien t es- timation of word r e pr esentations in v ector s pace. CoRR , abs/130 1.3781 , 2013. [2] T omas Mikolo v, Ilya Sutskev er, Kai Chen, Gregor y S. Cor rado, a nd Jeffrey Dean. Distributed representations of words and phra ses and their comp osi- tionality . In A dvanc es in Neur al Information Pr o c essing Systems 26: 27th Annual Co nfer enc e on Neur al Information Pr o c essing Systems 2013 . Pr o- c e e dings of a me eting held De c emb er 5-8, 2013, L ake T aho e, Nevada, Unite d States , pages 31 11–31 19, 2013. 5

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment