Affinity Weighted Embedding

Supervised (linear) embedding models like Wsabie and PSI have proven successful at ranking, recommendation and annotation tasks. However, despite being scalable to large datasets they do not take full advantage of the extra data due to their linear n…

Authors: Jason Weston, Ron Weiss, Hector Yee

Affinity W eighted Embedding Jason W eston Google Inc., New Y ork, NY , USA. jweston@goog le.com Ron W eiss Google Inc., New Y ork, NY , USA. ronw@google. com Hector Y ee Google Inc., San Bruno, CA, USA. hyee@google. com Abstract Supervised (linear) embed ding models like Wsabie [5] and PSI [1] have proven successful at ran king, recommen dation and annotatio n tasks. Howe ver , despite being scalable to large datasets they do no t take full advantage o f the extra d ata due to their linea r natur e, and typically un derfit. W e propose a ne w class of m odels which aim to provide improved perform ance while retaining many of the benefits of the existing class of embedd ing mode ls. Our new appro ach works by iterati vely learning a linear emb edding mod el where th e n ext iter ation’ s features and labels are re weigh ted as a fun ction of the pr evious iteration. W e describe several variants of the family , and give some initial results. 1 (Super vised) Linear Embedding Models Standard linear embedd ing models are of the form: f ( x, y ) = x ⊤ U ⊤ V y = X ij x i U ⊤ i V j y j . where x are th e input features and y is a possible label (in the an notation case), docu ment (in the informa tion retr iev al case) or item (in the recommen dation case). These m odels are used in b oth supervised and unsuperv ised settings. In the sup ervised rankin g case, they have pr oved successful in many of the tasks described above, e.g. the Wsabie algorithm [5, 4, 6] which approximately opti- mizes precision at the top of the ranked list has proven useful for annotation and recommend ation. These meth ods scale well to large data and are simple to im plement and use. However , a s they contain no n onlinear ities (other than in the f eature represen tation in x and y ) they can be limited in their ability to fit large complex datasets, and in our experience typically underfit. 2 Affinity W eighted Embedding Models In this work we propose the following generalized embedding model: f ( x, y ) = X ij G ij ( x, y ) x i U ⊤ i V j y j . where G is a function , built f rom a p revious le arning step, that measures th e a ffinity between two points. Giv en a pair x , y and feature indices i an d j , G retu rns a scalar . Large values of the scalar indicate a h igh degree o f match . Dif feren t metho ds of le arning (o r ch oosing) G lead to different variants of our prop osed approach: 1 • G ij ( x, y ) = G ( x, y ) . In this case each feature index pair i , j retu rns the same s calar so the model reduces to: f ( x, y ) = G ( x, y ) x ⊤ U ⊤ V y . • G ij ( x, y ) = G ij . I n this case the retur ned scalar for i , j is the same indep endent of the input vector x and lab el y , i.e. it i s a re weighting of the feature pairs. Th is gi ves th e model: f ( x, y ) = X ij G ij x i U ⊤ i V j y j . This is likely only u seful in large sparse featur e spaces, e.g . if G ij represents the weig ht of a word-pair in an info rmation retrieval task o r an item -pair in a recommen dation task. Further, it is possible that G ij could take a pa rticular for m, e.g. it is rep resented as a low rank matrix G ij = g ⊤ i g j . In that case we hav e the model f ( x, y ) = P ij g ⊤ i g j x i U ⊤ i V j y j . While it may be possible to lear n the p arameters of G jointly with U and V here we advocate an iterativ e approach : 1. T rain a standard embedd ing model: f ( x, y ) = x ⊤ U ⊤ V y . 2. Build G using the representatio n learnt in (1). 3. T rain a weighted model: f ( x, y ) = P ij G ij ( x, y ) x i ¯ U ⊤ i ¯ V j y j . 4. Possibly repeat the proced ure further: build ¯ G from (3). (So far we ha ve not tried this). Note that the training algorithm used for (3) is the same as for (1) – we only change the model. In the following, we will focu s on th e G ij ( x, y ) = G ( x, y ) case (where we only weight examples, not features) and a particular choice of G 1 : G ( x, y ) = m X i =1 exp( − λ x || U x − U x i || 2 ) exp( − λ y || y − y i || 2 ) (1) where x and y ar e the sets of vectors from the training s et. G is built using the e mbeddin g U lear nt in step (1) , an d is th en used to build a n ew embe dding model in step (3). Due to the iterativ e na ture of the step s we can comp ute G for all examples in par allel u sing a MapReduce framework, and store the train ing set necessary for step (3), thus making learnin g straight-f orward. T o d ecrease stor age, instead of com puting a smo oth G as above we can clip ( sparsify) G by taking only the top n n earest n eighbor s to U x , and set the r est to 0 . Further we take λ y suitably large such that ex p( − λ y || y − y i || 2 ) eithe r gives 1 for y i = y or 0 otherwise 2 . In summ ary , then, for each train ing example, we simply have to find th e ( n = 20 in our experiments) nearest neighboring e xamp les in the embedd ing space, and then we reweight their labels u sing eq . 1. (All other lab els would then r eceiv e a weight of zer o, although o ne co uld also add a constant bias to guaran tee those labels can receive non-zer o final scores.) 3 Experiments So far , we h av e conducted two prelim inary exper iments on M agnatagatu ne (annotatin g music with text tags) an d ImageNe t ( annotation images with labels). Wsabie has been a pplied to both tasks previously [4, 5]. On Mag natagatune w e used MFCC features fo r both Wsabie and our m ethod, similar to those used in [4]. For both models we used an embe dding dimension of 10 0. Our m ethod improved o ver Wsabie marginally as shown in T able 1. W e speculate that this improvement is small due to the small size 1 Although perhaps G ( x, y ) = P m i =1 exp( − λ x || U x − U x i || 2 ) exp( − λ y || V y − V y i || 2 ) would be more natural. Further we could also consider G or ig ( x, y ) = P m i =1 exp( − λ x || x − x i || 2 ) exp( − λ y || y − y i || 2 ) w hich does no t mak e use of the embedding in step (1) at all. This w ould lik ely perform poorly when the inp ut features are too sparse, which wou ld be the point of improving the representation by learning it with U and V . 2 This i s useful in the label annotation or item ranking settings, but would not be a g ood idea in an information retriev al setting. 2 T able 1: Magnatagatun e Results Algorithm Prec@1 Prec@3 k -Nearest Neighb or 39.4% 28.6% k -Nearest Neighb or (Ws abie sp ace) 45.2% 31.9% Wsabie 48.7% 37.5% Affinity W eighted Emb edding 52.7% 39.2% T able 2: ImageNet Results (Fall 2011, 21k labels) Algorithm Prec@1 Wsabie (KPCA features) 9.2% k -Nearest Neighbor (Wsabie space) 13.7 % Affinity W e ighted Embed ding 16.4% Con volutional Net [2] 15.6% (NOT E: on a different train/test split) of the dataset (only 16,000 t rainin g e xamples, 104 input dimensions for the MFC Cs and 160 unique tags). W e believe o ur method will be more useful on larger tas ks. On the I mageNet task (Fall 2011, 10M exam ples, 474 KPCA features and 21k classes) the improve- ment over Wsabie is mu ch larger, shown in T able 2. W e used similar KPCA features as in [5] for both Wsabie and o ur method. W e u se an embedd ing dimen sion of 128 for bo th. W e also comp are to near est neighb or in the embeddin g space . For our metho d, we used the max instead of the sum in eq. (1) as it gave b etter results. Our method is competitive with th e con volutional neural netw ork model of [2] (note , this is on a d ifferent train/test split) . Howe ver, we believe th e method of [3] would likely perform better again if applied in the same setting. 4 Conclusions In conclusion, by incorpo rating a learnt reweighting fu nction G in to supervised linear embedd ing we can increase the capacity of the model leading to improved results. One issue however is that the cost of redu cing un derfitting b y u sing G is that it both increa ses the storag e and computation al requirem ents of the model. One avenue we ha ve b egun explor ing in th at r egard is to use appro ximate methods in order to compu te G . Refer ences [1] B. Bai, J. W eston, D. Grangier , R. Collobert, K. Sadamasa, Y . Qi , C. Cortes, and M. Mohri. Polynomial semantic index ing. In NIPS , 2009. [2] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, A. S enior , P . T ucker , K. Y ang, et al. Large scale distr ibuted deep networks. In Advances in Neural Information Pr ocessing Systems 25 , pages 1232–1 240, 2012. [3] A. Krizhev sky , I. Sutskev er , and G. Hinton. Imagenet classifi cation with deep con vo lutional neural net- works. In Advances in Neural Information Pr ocessing Sys tems 25 , pages 1106–1114, 2012. [4] J. W eston, S. Bengio, and P . Hamel. Large-scale music annotation and retriev al: Learning to rank i n joint semantic spaces. In J ournal of New Music Resear ch , 2012. [5] J. W eston, S. Bengio, and N. Usunier . W sabie: S caling up to large vocabulary image annotation. In Intl. J oint Conf. Artificial Intelligence , (IJCAI) , pages 2764–2770, 2011. [6] J. W eston, C. W ang , R. W eiss, and A. Berenzeig . Latent collaborativ e retriev al. ICML , 2012. 3

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment