Word Embeddings via Tensor Factorization

W ord Embeddings via T ensor F actorization Eric Bailey , Charles Meyer , and Shuchin Aer on T ufts University , Medford, MA 02215 Abstract Many state-of-the-art word embedding techniques inv olve factorization of a co-occurrence based matrix. W e aim to ex- tend this approach by studying word embedding techniques that in volve factorization of co-occurrence based tensor s ( N - way arrays). W e present two new word embedding techniques based on tensor factorization and show that they outperform common methods on sev eral semantic NLP tasks when giv en the same data. T o train one of the embeddings, we present a ne w joint tensor factorization problem and an approach for solving it. Furthermore, we modify the performance met- rics for the Outlier Detection (Camacho-Collados and Nav- igli 2016) task to measure the quality of higher-order rela- tionships that a word embedding captures. Our tensor-based methods signiﬁcantly outperform existing methods at this task when using our ne w metric. Finally , we demonstrate that vectors in our embeddings can be composed multiplicatively to create dif ferent vector representations for each meaning of a polysemous word. W e show that this property stems from the higher order information that the vectors contain, and thus is unique to our tensor based embeddings. Introduction W ord embeddings hav e been used to improve the perfor- mance of many NLP tasks including language modelling (Bengio et al. 2003), machine translation (Bahdanau, Cho, and Bengio 2014), and sentiment analysis (Kim 2014). The broad applicability of word embeddings to NLP implies that improv ements to their quality will likely have widespread beneﬁts for the ﬁeld. The word embedding problem is to learn a mapping η : V → R k ( k ≈ 100-300 in most applications) that encodes meaningful semantic and/or syntactic information. For in- stance, in many word embeddings, η ( car ) ≈ η ( truck ) , since the words are semantically similar . More complex relationships than similarity can also be encoded in word embeddings. For example, we can answer analogy queries of the form a : b :: c : ? using sim- ple arithmetic in many state-of-the-art embeddings (Mikolov et al. 2013). The answer to bed : sleep :: chair : x is giv en by the word whose vector representation is closest to η ( sleep ) − η ( bed ) + η ( chair ) ( ≈ η ( sit ) ). Other embeddings may encode such information in a nonlinear way (Jastrzeb- ski, Lesniak, and Czarnecki 2017). (Mikolov et al. 2013) demonstrates the additi ve composi- tionality of their word2vec vectors: one can sum vectors produced by their embedding to compute vectors for certain phrases rather than just vectors for words. Later in this pa- per , we will show that our embeddings naturally giv e rise to a form of multiplicative compositionality that has not yet been explored in the literature. Almost all recent word embeddings rely on the distribu- tional hypothesis (Harris 1954), which states that a word’ s meaning can be inferred from the words that tend to sur- round it. T o utilize the distributional hypothesis, many em- beddings are gi ven by a lo w-rank factor of a matrix deri ved from co-occurrences in a large unsupervised corpus, see (Pennington, Socher, and Manning 2014; Murphy , T aluk- dar , and Mitchell 2012; Levy and Goldberg 2014) and (Salle, V illavicencio, and Idiart 2016). Approaches that rely on matrix factorization only utilize pairwise co-occurrence information in the corpus. W e aim to extend this approach by creating word embeddings given by factors of tensors containing higher order co-occurrence data. Related work Some common word embeddings related to co-occurrence based matrix f actorization include GloV e (Pennington, Socher , and Manning 2014), word2vec (Levy and Gold- berg 2014), LexV ec (Salle, V illavicencio, and Idiart 2016), and NNSE (Murphy , T alukdar , and Mitchell 2012). In con- trast, our w ork studies word embeddings gi ven by factoriza- tion of tensors. An ov erview of tensor factorization methods is giv en in (Kolda and Bader 2009). Our work uses factorization of symmetric nonnegati ve tensors, which has been studied in the past (W ang and Qi 2007; Comon et al. 2008). In general, factorization of ten- sors has been applied to NLP in (V an de Cruys, Poibeau, and Korhonen 2013) and factorization of nonnegati ve ten- sors (V an de Cruys 2009). Recently , factorization of sym- metric tensors has been used to create a generic word embed- ding (Sharan and V aliant 2017) but the idea was not explored extensi vely . Our work studies this idea in much greater de- tail, fully demonstrating the viability of tensor factorization as a technique for training word embeddings. Composition of word vectors to create novel representa- tions has been studied in depth, including additiv e, multi- plicativ e, and tensor-based methods (Mitchell and Lapata 2010; Blacoe and Lapata 2012). T ypically , composition is used to create vectors that represent phrases or sentences . Our work, instead, sho ws that pairs of word vectors can be composed multiplicatively to create different vector repre- sentations for the various meanings of a single polysemous word. Mathematical preliminaries Notation Throughout this paper we will write scalars in lo wercase italics α , vectors in lowercase bold letters v , matrices with uppercase bold letters M , and tensors (of order N > 2 ) with Euler script notation X , as is standard in the literature. Pointwise Mutual Inf ormation Pointwise mutual information (PMI) is a useful property in NLP that quantiﬁes the likelihood that two words co-occur (Levy and Goldber g 2014). It is deﬁned as: P M I ( x, y ) = log p ( x, y ) p ( x ) p ( y ) where p ( x, y ) is the probability that x and y occur to- gether in a gi ven ﬁx ed-length context window in the corpus, irrespectiv e of order . It is often useful to consider the positive PMI (PPMI), de- ﬁned as: P P M I ( x, y ) := max (0 , P M I ( x, y )) since negati ve PMI values hav e little grounded interpretation (Bullinaria and Levy 2007; Levy and Goldberg 2014; V an de Cruys 2009). Giv en an inde xed vocab ulary V = { w 1 , . . . , w | V | } , one can construct a | V | × | V | PPMI matrix M where m ij = P P M I ( w i , w j ) . Many existing word embedding techniques inv olve factorizing this PPMI matrix (Levy and Goldberg 2014; Murphy , T alukdar , and Mitchell 2012; Salle, V illavicencio, and Idiart 2016). PMI can be generalized to N variables. While there are many ways to do so (V an de Cruys 2011), in this paper we use the form deﬁned by: P M I ( x N 1 ) = log p ( x 1 , . . . , x N ) p ( x 1 ) · · · p ( x N ) where p ( x 1 , . . . , x N ) is the probability that all of x 1 , . . . , x N occur together in a giv en ﬁxed-length context window in the corpus, irrespecti ve of their order . In this paper we study 3-way PPMI tensors M , where m ij k = P P M I ( w i , w j , w k ) , as this is the natural higher- order generalization of the PPMI matrix. W e leav e the study of creating word embeddings with N -dimensional PPMI tensors ( N > 3 ) to future work. T ensor factorization Just as the rank- R matrix decomposition is deﬁned to be the product of two factor matrices ( M ≈ UV > ), the canonical rank- R tensor decomposition for a third order tensor is de- ﬁned to be the product of three factor matrices (Kolda and Bader 2009): X ≈ R X r =1 u r ⊗ v r ⊗ w r =: J U , V , W K , (1) where ⊗ is the outer product: ( a ⊗ b ⊗ c ) ij k = a i b j c k . This is also commonly referred to as the rank- R CP Decomposition . Elementwise, this is written as: x ij k ≈ R X r =1 u ir v j r w kr = h u : ,i ∗ v : ,j , w : ,k i , where ∗ is elementwise vector multiplication and u : ,i is the i th row of U . In our later section on multiplicative com- positionality , we will see this formulation gives rise to a meaningful interpretation of the elementwise product be- tween vectors in our word embeddings. Symmetric CP Decomposition. In this paper , we will consider symmetric CP decomposition of nonne gative ten- sors (Lim 2005; Kolda and Bader 2009). Since our N -way PPMI is nonnegativ e and in v ariant under permutation, the PPMI tensor M is nonnegati ve and supersymmetric, i.e. m ij k = m σ ( i ) σ ( j ) σ ( k ) ≥ 0 for an y permutation σ ∈ S 3 . In the symmetric CP decomposition, instead of factoriz- ing M ≈ J U , V , W K , we factorize M as the triple product of a single factor matrix U ∈ R | V |× R such that M ≈ J U , U , U K In this formulation, we use U to be the word embedding so the vector for w i is the i th row of U similar to the formu- lations in (Levy and Goldberg 2014; Murphy , T alukdar , and Mitchell 2012; Pennington, Socher , and Manning 2014). It is kno wn that the optimal rank- k CP decomposition ex- ists for symmetric nonnegati ve tensors such as the PPMI ten- sor (Lim 2005). Howe ver , ﬁnding such a decomposition is NP hard in general (Hstad 1990) so we must consider ap- proximate methods. In this work, we only consider the symmetric CP de- composition, leaving the study of other tensor decomposi- tions (such as the T ensor Train or HOSVD (Oseledets 2011; K olda and Bader 2009)) to future work. Methodologies Computing the Symmetric CP Decomposition The Θ( | V | 3 ) size of the third order PPMI tensor presents a number of computational challenges. In practice, | V | can vary from 10 4 to 10 6 , resulting in a tensor whose naive rep- resentation requires at least 4 ∗ 10 , 000 3 bytes = 4 TB of ﬂoats. Even the sparse representation of the tensor takes up such a large fraction of memory that standard algo- rithms such as successi ve rank-1 approximation (W ang and Qi 2007; Mu, Hsu, and Goldfarb 2015) and alternating least-squares (Kolda and Bader 2009) are infeasible for our uses. Thus, in this paper we will consider a stochastic on- line formulation similar to that of (Maehara, Hayashi, and Kawarabayashi 2016). W e optimize the CP decomposition in an online fashion, using small random subsets M t of the nonzero tensor en- tries to update the decomposition at time t . In this minibatch setting, we optimize the decomposition based on the current minibatch and the previous decomposition at time t − 1 . T o update U (and thus the symmetric decomposition), we ﬁrst deﬁne a decomposition loss L ( M t , U ) and minimize this loss with respect to U using Adam (Kingma and Ba 2014). At each time t , we take M t to be all co-occurrence triples (weighted by PPMI) in a ﬁxed number of sentences (around 1,000) from the corpus. W e continue training until we have depleted the entire corpus. For M t to accurately model M , we also include a cer- tain proportion of elements with zero PPMI (or “negati ve samples”) in M t , similar to that of (Salle, V illavicencio, and Idiart 2016). W e use an empirically found proportion of neg- ativ e samples for training, and leave discov ery of the optimal negati ve sample proportion to future work. W ord Embedding Proposals CP-S. The ﬁrst embedding we propose is based on symmetic CP decomposition of the PPMI tensor M as discussed in the mathematical preliminaries section. The optimal setting for the word embedding W is: W := argmin U || M − J U , U , U K || F Since we cannot feasibly compute this e xactly , we minimize the loss function deﬁned as the squared error between the values in M t and their predicted values: L ( M t , U ) = X m t ij k ∈ M t ( m t ij k − R X r =1 u ir u j r u kr ) 2 using the techniques discussed in the previous section. JCP-S. A potential problem with CP-S is that it is only trained on third order information. T o rectify this issue, we propose a nov el joint tensor factorization problem we call Joint Symmetric Rank- R CP Decomposition . In this prob- lem, the input is the ﬁx ed rank R and a list of supersymmet- ric tensors M n of dif ferent orders b ut whose axis lengths all equal | V | . Each tensor M n is to be factorized via rank- R symmetric CP decomposition using a single | V | × R factor matrix U . T o produce a solution, we ﬁrst deﬁne the loss at time t to be the sum of the reconstruction losses of each different tensor: L joint (( M t ) N n =2 , U ) = N X n =2 L ( M t n , U ) , where M n is an n -dimensional supersymmetric PPMI ten- sor . W e then minimize the loss with respect to U . Since we are using at most third order tensors in this work, we assign our word embedding W to be: W := argmin U L joint ( M 2 , M 3 , U ) = argmin U  L ( M 2 , U ) + L ( M 3 , U )  This problem is a speciﬁc instance of Coupled T ensor Decomposition, which has been studied in the past (Acar , K olda, and Dunlavy 2011; Naskovska and Haardt 2016). In this problem, the goal is to factorize multiple tensors using at least one factor matrix in common. A similar formula- tion to our problem can be found in (Comon, Qi, and Use- vich 2015), which studies blind source separation using the algebraic geometric aspects of jointly factorizing numerous supersymmetric tensors (to unkno wn rank). In contrast to our work, they outline some generic rank properties of such a decomposition rather than attacking the problem numeri- cally . Also, in our formulation the rank is ﬁxed and an ap- proximate solution must be found. Exploring the connection between the theoretical aspects of joint decomposition and quality of word embeddings would be an interesting avenue for future work. T o the best of our knowledge this is the ﬁrst study of Joint Symmetric Rank- R CP Decomposition. Shifted PMI In the same way (Levy and Goldberg 2014) considers fac- torization of positi ve shifted PMI matrices, we consider fac- torization of positiv e shifted PMI tensor s M , where m ij k = max( P M I ( w i , w j , w k ) − α , 0) for some constant shift α . W e empirically found that different levels of shifting re- sulted in different qualities of word embeddings – the best shift we found for CP-S was a shift of α ≈ 3 , whereas any nonzero shift for JCP-S resulted in a worse embedding across the board. When we discuss ev aluation we report the results gi ven by factorization of the PPMI tensors shifted by the best value we found for each speciﬁc embedding. Computational notes When considering going from two dimensions to three, it is perhaps necessary to discuss the computational issues in such a problem size increase. Howe ver , it should be noted that the creation of pre-trained embeddings can be seen as a pre-processing step for many future NLP tasks, so if the training can be completed once, it can be used fore ver there- after without having to take training time into account. De- spite this, we found that the training of our embeddings w as not considerably slower than the training of order-2 equiv- alents such as SGNS. Explicitly , our GPU trained CBO W vectors (using the experimental settings found below) in 3568 seconds, whereas training CP-S and JCP-S took 6786 and 8686 seconds respectiv ely . Evaluation In this section we present a quantitati ve evaluation compar- ing our embeddings to an informationless embedding and two strong baselines. Our baselines are: 1. Random (random vectors with I.I.D. entries normally distributed with mean 0 and variance 1 2 ), for comparing against a model with no meaningful information encoded 2. word2vec (CBO W) (Mikolov et al. 2013), for compari- son against the most commonly used embedding method, as well as for comparison against a technique related to PPMI matrix factorization (Le vy and Goldberg 2014) 3. NNSE 1 (Murphy , T alukdar , and Mitchell 2012), for com- parison against a technique that relies on an explicit PPMI matrix factorization For a fair comparison, we trained each model on the same corpus of 10 million sentences gathered from Wikipedia. W e remov ed stopwords and words appearing fewer than 2,000 times (130 million tokens total) to reduce noise and unin- formativ e words. Our word2vec and NNSE baselines were trained using the recommended hyperparameters from their original publications, and all optimizers were using using the default settings. Hyperparameters are always consistent across ev aluations. Because of the dataset size, the results sho wn should be considered a proof of concept rather than an objective com- parison to state-of-the-art pre-trained embeddings. Due to the natural computational challenges arising from working with tensors, we leave creation of a full-scale production ready embedding based on tensor factorization to future work. As is common in the literature (Mikolov et al. 2013; Mur- phy , T alukdar , and Mitchell 2012), we use 300-dimensional vectors for our embeddings and all w ord vectors are normal- ized to unit length prior to ev aluation. Quantitative tasks Outlier Detection. The Outlier Detection task (Camacho- Collados and Navigli 2016) is to determine which word in a list L of n + 1 words is unrelated to the other n which were chosen to be related. For each w ∈ L , one can com- pute its compactness score c ( w ) , which is the compactness of L \ { w } . c ( w ) is explicitly computed as the mean simi- larity of all word pairs ( w i , w j ) : w i , w j ∈ L \ { w } . The predicted outlier is argmax w ∈ L c ( w ) , as the n related words should form a compact cluster with high mean similarity . W e use the W ikiSem500 dataset (Blair , Merhav , and Barry 2016) which includes sets of n = 8 words per group gath- ered based on semantic similarity . Thus, performance on this task is correlated with the amount of semantic information encoded in a word embedding. Performance on this dataset was shown to be well-correlated with performance at the common NLP task of sentiment analysis (Blair, Merhav , and Barry 2016). The two metrics associated with this task are accuracy and Outlier Position Percentage (OPP). Accuracy is the fraction of cases in which the true outlier correctly had the highest compactness score. OPP measures how close the true outlier was to ha ving the highest compactness score, rewarding em- beddings more for predicting the outlier to be in 2 nd place rather than n th when sorting the words by their compactness score c ( w ) . 3-way Outlier Detection. As our tensor-based embed- dings encode higher order relationships between words, we introduce a ne w way to compute c ( w ) based on gr oups of 3 1 The input to NNSE is an m × n matrix, where there are m words and n co-occurrence patterns. In our experiments, we set m = n = | V | and set the co-occurrence information to be the number of t imes w i appears within a windo w of 5 words of w j . As stated in the paper , the matrix entries are weighted by PPMI. words rather than pairs of words. W e deﬁne the compactness score for a word w to be: c ( w ) = X v i 1 6 = v w X v i 2 6 = v w , v i 1 X v i 3 6 = v w , v i 1 , v i 2 sim ( v i 1 , v i 2 , v i 3 ) , where sim ( · ) denotes similarity between a group of 3 vec- tors. sim ( · ) is deﬁned as: sim ( v 1 , v 2 , v 3 ) =  1 3 3 X i =1 || v i − 1 3 3 X j =1 v j || 2  − 1 W e call this e valuation method OD 3 . The purpose of OD 3 is to ev aluate the extent to which an embedding captures 3 rd order relationships between words. As we will see in the results of our quantitati ve experiments, our tensor methods outperform the baselines on OD3, which validates our approach. This approach can easily be generalized to OD N ( N > 3) , but again we leav e the study of higher order relationships to future work. Simple supervised tasks. (Jastrzebski, Lesniak, and Czarnecki 2017) points out that the primary application of word embeddings is transfer learning to NLP tasks. They argue that to ev aluate an embedding’ s ability to transfer in- formation to a relev ant task, one must measure the embed- ding’ s accessibility of information for actual downstream tasks. T o do so, one must cite the performance of simple supervised tasks as training set size increases, which is com- monly done in transfer learning e valuation (Jastrzebski, Les- niak, and Czarnecki 2017). If an algorithm using a word em- bedding performs well with just a small amount of training data, then the information encoded in the embedding is eas- ily accessible. The simple supervised do wnstream tasks we use to ev alu- ate the embeddings are as follows: 1. Supervised Analogy Recovery . W e consider the task of solving queries of the form a : b :: c : ? using a simple neural network as suggested in (Jastrzebski, Lesniak, and Czarnecki 2017). The analogy dataset we use is from the Google analogy testbed (Mikolov et al. 2013). 2. Sentiment analysis. W e also consider sentiment analysis as described by (Schnabel et al. 2015). W e use the sug- gested Large Movie Revie w dataset (Maas et al. 2011), containing 50,000 movie re views. All code is implemented using scikit-learn or T ensorFlow and uses the suggested train/test split. W ord similarity . T o standardize our ev aluation method- ology , we e valuate the embeddings using w ord similarity on the common MEN and MT urk datasets (Bruni, Tran, and Ba- roni 2014; Radinsky et al. 2011). For an ov ervie w of word similarity ev aluation, see (Schnabel et al. 2015). Quantitative r esults Outlier Detection results. The results are shown in T able 1 . The ﬁrst thing to note is that CP-S outperforms the other methods across each Outlier Detection metric. Since the W ikiSem500 dataset is semantically focused, performance Figure 1: Analogy task performance vs. % training data T able 1: Outlier Detection scores across all embeddings (Method) OD2 OPP OD2 acc OD3 OPP OD3 acc Random 0.6123 0.2765 0.5345 0.1950 CBO W 0.6542 0.3731 0.6162 0.3034 NNSE 0.6998 0.4288 0.6292 0.3190 CP-S 0.7078 0.4370 0.6741 0.3597 JCP-S 0.7017 0.4242 0.6666 0.3201 at this task demonstrates the quality of semantic information encoded in our embeddings. On OD2, the baselines perform more competitively with our CP Decomposition based models, but when OD3 is con- sidered our methods clearly excel. Since the tensor-based methods are trained directly on third order information and perform much better at OD3, we see that OD3 scores reﬂect the amount of third order information in a word embedding. This is a validation of OD3, as our 3 rd order embeddings would naturally out perform 2 nd order embeddings at a task that requires third order information. Still, the superiority of our tensor-based embeddings at OD 2 demonstrates the qual- ity of the semantic information they encode. Supervised analogy results. The results are shown in Figure 1 . At the supervised semantic analogy task, CP-S vastly outperforms the baselines at all le vels of training data, further signifying the amount of semantic information en- coded by this embedding technique. Also, when only 10% of the training data is presented, our tensor methods are the only ones that attain nonzero perfor- mance – even in such a limited data setting, use of CP-S’ s vectors results in nearly 40% accuracy . This phenomenon is also observed in the syntactic analogy tasks: our embeddings consistently outperform the others until 100% of the training data is presented. These two observations demonstrate the accessibility of the information encoded in our word embed- dings. W e can thus conclude that this relational information encoded in the tensor-based embeddings is more easily ac- cessible than that of CBO W and NNSE. Thus, our methods would likely be better suited for transfer learning to actual NLP tasks, particularly those in data-sparse settings. Sentiment analysis results. The results are shown in Figure 2 . In this task, JCP-S is the dominant method across all lev els of training data, but the difference is more obvi- ous when training data is limited. This again indicates that for this speciﬁc task the information encoded by our tensor- based methods is more readily a vailable as that of the base- lines. It is thus evident that e xploiting both second and third order co-occurrence data leads to higher quality semantic in- formation being encoded in the embedding. At this point it is not clear why JCP-S so v astly outperforms CP-S at this task, but its superiority to the other strong baselines demon- strates the quality of information encoded by JCP-S. This discrepancy is also illustrativ e of the f act that there is no sin- gle “best word embedding” (Jastrzebski, Lesniak, and Czar - necki 2017) – different embeddings encode different types of information, and thus should be used where they shine rather than for ev ery NLP task. W ord Similarity results. T able 2: W ord Similarity Scores (Spearman’ s ρ ) (Method) MEN MT urk Random -0.028 -0.150 CBO W 0.601 0.498 NNSE 0.717 0.686 CP-S 0.630 0.631 JCP-S 0.621 0.669 W e sho w the results in T able 2 . As we can see, our em- Figure 2: Sentiment analysis task performance vs. % training data beddings very clearly outperform the random embedding at this task. They ev en outperform CBO W on both of these datasets. It is worth including these results as the word sim- ilarity task is a very common way of ev aluating embedding quality in the literature. Howe ver , due to the many intrinsic problems with ev aluating word embeddings using word sim- ilarity (Faruqui et al. 2016), we do not discuss this further . Multiplicative Compositionality W e ﬁnd that even though they are not explicitly trained to do so, our tensor-based embeddings capture polysemy in- formation naturally through multiplicati ve compositional- ity . W e demonstrate this property qualitativ ely and provide proper moti vation for it, lea ving automated utilization to fu- ture work. In our tensor-based embeddings, we found that one can create a vector that represents a word w in the context of another word w 0 by taking the elementwise product v w ∗ v w 0 . W e call v w ∗ v w 0 a “meaning vector” for the polysemous word w . For example, consider the word star , which can denote a lead performer or a celestial body . W e can create a vector for star in the “lead performer” sense by taking the elementwise product v star ∗ v actor . This produces a vector that lies near vectors for words related to lead performers and far from those related to star ’ s other senses. T o motiv ate why this works, recall that the values in a third order PPMI tensor M are gi ven by: m ij k = P P M I ( w i , w j , w k ) ≈ R X r =1 v ir v j r v kr = h v i ∗ v j , v k i , where v i is the word vector for w i . If words w i , w j , w k hav e a high PPMI, then h v i ∗ v j , v k i will also be high, meaning v i ∗ v j will be close to v k in the vector space by cosine similarity . For example, e ven though galaxy is likely to appear in the context of the word star in in the “celestial body” sense, h v star ∗ v actor , v g alaxy i ≈ PPMI( star , actor , galaxy ) is low whereas h v star ∗ v actor , v drama i ≈ PPMI( star , actor , drama ) is high. Thus , v star ∗ v actor represents the meaning of star in the “lead performer” sense. In T able 3 we present the nearest neighbors of multiplica- tiv e and additiv e composed vectors for a variety of poly- semous words. As we can see, the words corresponding to the nearest neighbors of the composed vectors for our ten- sor methods are semantically related to the intended sense both for multiplicative and additiv e composition. In con- trast, for CBO W , only additive composition yields vectors whose nearest neighbors are semantically related to the in- tended sense. Thus, our embeddings can produce comple- mentary sets of polysemous word representations that are qualitativ ely valid whereas CBO W (seemingly) only guar- antees meaningful additiv e compositionality . W e leav e auto- mated usage of this property to future work. Conclusion Our key contrib utions are as follows: 1. T wo novel tensor factorization based word embed- dings. W e presented CP-S and JCP-S, which are word embedding techniques based on symmetric CP decom- position. W e experimentally demonstrated that these em- beddings outperform existing matrix-based techniques on a number of do wnstream semantic tasks when trained on the same data. 2. A novel joint symmetric tensor factorization problem. T able 3: Nearest neighbors (in cosine similarity) to elementwise products of word v ectors Composition Nearest neighbors (CP-S) Nearest neighbors (JCP-S) Nearest neighbors (CBOW) star ∗ actor oscar , awar d-winning , supporting r oles , drama , musical DNA , young er , tip star + actor stars , movie , actr ess actr ess , trek , pictur e actr ess , comedian , starred star ∗ planet planets , constellation , tr ek galaxy , earth , minor ﬁngers , layer , arm star + planet sun , earth , galaxy galaxy , dwarf , constellation galaxy , planets , earth tank ∗ fuel liquid , injection , tanks vehicles , motors , vehicle armor ed , tanks , armoured tank + fuel tanks , engines , injection vehicles , tanks , power ed tanks , engine , diesel tank ∗ weapon gun , ammunition , tanks brigade , cavalry , battalion persian , ag e , rapid tank + weapon tanks , armor , riﬂe tanks , battery , batteries tanks , cannon , armor ed W e introduced and utilized Joint Symmetric Rank- R CP Decomposition to train JCP-S. In this problem, multiple supersymmetric tensors must be decomposed using a sin- gle rank- R factor matrix. This technique allo ws for uti- lization of both second and third order co-occurrence in- formation in word embedding training. 3. A new embedding evaluation metric to measure amount of third order information. W e produce a 3 - way analogue of Outlier Detection (Camacho-Collados and Navigli 2016) that we call OD 3 . This metric ev alu- ates the degree to which third order information is cap- tured by a given word embedding. W e demonstrated this by sho wing our tensor based techniques, which naturally encode third information, perform better at OD3 com- pared to existing second order models. 4. W ord vector multiplicative compositionality for poly- semous word repr esentation. W e sho wed that our word vectors can be meaningfully composed multiplicatively to create a “meaning vector” for each different sense of a polysemous word. This property is a consequence of the higher order information used to train our embeddings, and was empirically shown to be unique to our tensor- based embeddings. T ensor factorization appears to be a highly applicable and effecti ve tool for learning word embeddings, with many ar- eas of potential future work. Lev eraging higher order data in training word embeddings is useful for encoding ne w types of information and semantic relationships compared to mod- els that are trained using only pairwise data. This indicates that such techniques will pro ve useful for training word em- beddings to be used in downstream NLP tasks. References [Acar , Kolda, and Dunla vy 2011] Acar , E.; K olda, T . G.; and Dunlavy , D. M. 2011. All-at-once optimization for coupled matrix and tensor factorizations. CoRR abs/1105.3422. [Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.; and Bengio, Y . 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. [Bengio et al. 2003] Bengio, Y .; Ducharme, R.; V incent, P .; and Jan vin, C. 2003. A neural probabilistic language model. J. Mac h. Learn. Res. 3:1137–1155. [Blacoe and Lapata 2012] Blacoe, W ., and Lapata, M. 2012. A comparison of vector -based representations for seman- tic composition. In Pr oceedings of the 2012 Joint Confer- ence on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , EMNLP- CoNLL ’12, 546–556. Stroudsbur g, P A, USA: Association for Computational Linguistics. [Blair , Merhav , and Barry 2016] Blair , P .; Merhav , Y .; and Barry , J. 2016. Automated generation of multilingual clus- ters for the ev aluation of distributed representations. CoRR abs/1611.01547. [Bruni, T ran, and Baroni 2014] Bruni, E.; Tran, N. K.; and Baroni, M. 2014. Multimodal distributional semantics. J. Artif. Int. Res. 49(1):1–47. [Bullinaria and Levy 2007] Bullinaria, J. A., and Levy , J. P . 2007. Extracting semantic representations from word co- occurrence statistics: A computational study . Behavior Re- sear ch Methods 39(3):510–526. [Camacho-Collados and Navigli 2016] Camacho-Collados, J., and Navigli, R. 2016. Find the word that does not belong: A framework for an intrinsic ev aluation of word vector representations. In A CL W orkshop on Evaluating V ector Space Repr esentations for NLP , 43–50. Association for Computational Linguistics. [Comon et al. 2008] Comon, P .; Golub, G.; Lim, L.-H.; and Mourrain, B. 2008. Symmetric tensors and symmetric tensor rank. SIAM Journal on Matrix Analysis and Applications 30(3):1254–1279. [Comon, Qi, and Usevich 2015] Comon, P .; Qi, Y .; and Use- vich, K. 2015. A polynomial formulation for joint decom- position of symmetric tensors of different orders. In L V A- ICA ’2015 , volume 9237 of Lectur e Notes in Computer Sci- ence . Liberec, Czech Republic: Springer . Special session on tensors. hal-01168992. [Faruqui et al. 2016] Faruqui, M.; Tsvetko v , Y .; Rastogi, P .; and Dyer, C. 2016. Problems with e valuation of word embeddings using word similarity tasks. CoRR abs/1605.02276. [Harris 1954] Harris, Z. 1954. Distributional structure. W or d 10(23):146–162. [Hstad 1990] Hstad, J. 1990. T ensor rank is np-complete. Journal of Algorithms 11(4):644 – 654. [Jastrzebski, Lesniak, and Czarnecki 2017] Jastrzebski, S.; Lesniak, D.; and Czarnecki, W . M. 2017. How to ev aluate word embeddings? on importance of data efﬁcienc y and simple supervised tasks. [Kim 2014] Kim, Y . 2014. Con volutional neural networks for sentence classiﬁcation. CoRR abs/1408.5882. [Kingma and Ba 2014] Kingma, D. P ., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980. [K olda and Bader 2009] Kolda, T . G., and Bader , B. W . 2009. T ensor decompositions and applications. SIAM RE- VIEW 51(3):455–500. [Levy and Goldber g 2014] Levy , O., and Goldberg, Y . 2014. Neural word embedding as implicit matrix factorization. In Pr oceedings of the 27th International Conference on Neu- ral Information Pr ocessing Systems , NIPS’14, 2177–2185. Cambridge, MA, USA: MIT Press. [Lim 2005] Lim, L.-H. 2005. Optimal solutions to non- negati ve parafac/multilinear nmf alw ays exist. [Maas et al. 2011] Maas, A. L.; Daly , R. E.; Pham, P . T .; Huang, D.; Ng, A. Y .; and Potts, C. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language T ec hnologies - V olume 1 , HL T ’11, 142–150. Stroudsbur g, P A, USA: Association for Com- putational Linguistics. [Maehara, Hayashi, and Kawarabayashi 2016] Maehara, T .; Hayashi, K.; and Kawarabayashi, K.-i. 2016. Expected tensor decomposition with stochastic gradient descent. In Pr oceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence , AAAI’16, 1919–1925. AAAI Press. [Mikolov et al. 2013] Mikolov , T .; Sutskev er , I.; Chen, K.; Corrado, G.; and Dean, J. 2013. Distrib uted representa- tions of words and phrases and their compositionality . CoRR abs/1310.4546. [Mitchell and Lapata 2010] Mitchell, J., and Lapata, M. 2010. Composition in Distributional Models of Semantics. Cognitive Science 34(8):1388–1429. [Mu, Hsu, and Goldfarb 2015] Mu, C.; Hsu, D.; and Gold- farb, D. 2015. Successiv e rank-one approximations for nearly orthogonally decomposable symmetric tensors. SIAM Journal on Matrix Analysis and Applications 36(4):1638– 1659. [Murphy , T alukdar, and Mitchell 2012] Murphy , B.; T aluk- dar , P . P .; and Mitchell, T . M. 2012. Learning effecti ve and interpretable semantic models using non-negati ve sparse embedding. In Kay , M., and Boitet, C., eds., COLING , 1933–1950. Indian Institute of T echnology Bombay . [Naskovska and Haardt 2016] Naskovska, K., and Haardt, M. 2016. Extension of the semi-algebraic framework for approximate CP decompositions via simultaneous matrix di- agonalization to the efﬁcient calculation of coupled CP de- compositions. In 50th Asilomar Confer ence on Signals, Sys- tems and Computers, A CSSC 2016, P aciﬁc Gr ove , CA, USA, November 6-9, 2016 , 1728–1732. [Oseledets 2011] Oseledets, I. V . 2011. T ensor-train decom- position. SIAM J. Sci. Comput. 33(5):2295–2317. [Pennington, Socher , and Manning 2014] Pennington, J.; Socher , R.; and Manning, C. D. 2014. Glov e: Global vectors for word representation. In Empirical Methods in Natural Languag e Pr ocessing (EMNLP) , 1532–1543. [Radinsky et al. 2011] Radinsky , K.; Agichtein, E.; Gabrilovich, E.; and Markovitch, S. 2011. A word at a time: Computing word relatedness using temporal se- mantic analysis. In Pr oceedings of the 20th International Confer ence on W orld W ide W eb , WWW ’11, 337–346. New Y ork, NY , USA: A CM. [Salle, V illavicencio, and Idiart 2016] Salle, A.; V illavicen- cio, A.; and Idiart, M. 2016. Matrix factorization using window sampling and ne gative sampling for impro ved word representations. In Pr oceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany , V olume 2: Short P aper s . [Schnabel et al. 2015] Schnabel, T .; Labuto v , I.; Mimno, D. M.; and Joachims, T . 2015. Ev aluation methods for unsu- pervised word embeddings. In Mrquez, L.; Callison-Burch, C.; Su, J.; Pighin, D.; and Marton, Y ., eds., EMNLP , 298– 307. The Association for Computational Linguistics. [Sharan and V aliant 2017] Sharan, V ., and V aliant, G. 2017. Orthogonalized als: A theoretically principled tensor de- composition algorithm for practical use. arXiv pr eprint arXiv:1703.01804 . [V an de Cruys, Poibeau, and Korhonen 2013] V an de Cruys, T .; Poibeau, T .; and Korhonen, A. 2013. A tensor -based factorization model of semantic compositionality . In Pr oc. of N AACL-HL T , 1142–1151. [V an de Cruys 2009] V an de Cruys, T . 2009. A non-ne gati ve tensor factorization model for selectional preference induc- tion. In Pr oceedings of the W orkshop on Geometrical Mod- els of Natural Language Semantics , GEMS ’09, 83–90. Stroudsbur g, P A, USA: Association for Computational Lin- guistics. [V an de Cruys 2011] V an de Cruys, T . 2011. T wo multi- variate generalizations of pointwise mutual information. In Pr oceedings of the W orkshop on Distributional Semantics and Compositionality , DiSCo ’11, 16–20. Stroudsburg, P A, USA: Association for Computational Linguistics. [W ang and Qi 2007] W ang, Y ., and Qi, L. 2007. On the successiv e supersymmetric rank-1 decomposition of higher- order supersymmetric tensors. Numerical Lin. Alg . with Ap- plic. 14(6):503–519.

Word Embeddings via Tensor Factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment