Low-dimensional Data Embedding via Robust Ranking

Low-dimensional Data Emb edding via Robust Ranking Ehsan Amid University of California, Santa Cruz, CA 95064 eamid@ucsc.edu Nikos Vlassis Adobe Research San Jose, CA 95113 vlassis@adobe.com Manfred K. W armuth University of California, Santa Cruz, CA 95064 manfred@ucsc.edu ABSTRA CT W e describe a new method called t -ETE for nding a low-dimensional embedding of a set of objects in Euclidean space. W e formulate the embedding problem as a joint ranking problem over a set of triplets, where each triplet captures the relativ e similarities between three objects in the set. By exploiting recent advances in robust rank- ing, t -ETE produces high-quality embe ddings even in the presence of a signicant amount of noise and better preser ves local scale than known methods, such as t-STE and t-SNE. In particular , our method produces signicantly better results than t-SNE on signa- ture datasets while also being faster to compute. KEY W ORDS Ranking, Triplet Embedding, Robust Losses, t -Exponential Distri- bution, Dimensionality Reduction, t-SNE. 1 IN TRODUCTION Learning a metric embedding for a set of objects base d on relative similarities is a central problem in human computation and crowd- sourcing. The application domain includes a variety of dierent elds such as recommender systems and psychological question- naires. The relative similarities are usually provided in the form of triplets, where a triplet ( i , j , k ) expresses that “ object i is more similar to object j than to object k ” , for which the similarity func- tion may be unknown or not even quantie d. The rst object i is referred to as the quer y object and obje cts j and k are the test objects. The triplets are typically gathered by human evaluators via a data-collecting me chanism such as Amazon Mechanical Turk 1 . These types of constraints have also b een used as side information in semi-supervise d metric learning [4, 8] and clustering [2]. Given a set of relative similarity comparisons on a set of objects, the goal of triplet embe dding is to nd a representation for the objects in some metric space such that the constraints induced by the triplets are satised as much as possible. In other words, the embedding should reect the underlying similarity function from which the constraints were generated. Earlier metho ds for triplet emb edding include Generalize d Non-metric Multidimen- sional Scaling (GNMDS) [ 1 ], Crowd Kernel Learning (CKL) [ 14 ], and Stochastic Triplet Embe dding (STE) and extension, t-distributed STE (t-STE) [15]. One major drawback of the previous methods for triplet emb ed- ding is that their performance can drop signicantly when a small amount of noise is introduced in the data. The noise may arise due to dierent reasons. For instance , each human evaluator may use a dierent similarity function when comparing objects [ 3 ]. As a result, there might e xist conicting triplets with reversed test ob- jects. Another type of noise could be due to the insucient degree 1 https://www.mturk.com of freedom when mapping an intrinsically (and p ossibly hidden) high-dimensional representation to a lower-dimensional embe d- ding. A simple example is mapping uniformly distributed p oints on a two-dimensional circle to a one-dimensional line; regardless of the embedding, the end points of line will always violate some similarity constraints. In this paper , we cast the triplet embedding problem as a joint ranking problem. In any emb edding, for each object i , the remaining object are naturally ranke d by their “distance” to i . The triplet ( i , j , k ) expresses that the object j should b e ranke d higher than object k for the ranking of i . Therefore , triplet embedding can be viewed as mapping the objects into a Euclidean space so that the joint rankings belonging to all quer y objects ar e as consistent (with respect to the triplets) as possible. In order to nd the embedding, we dene a loss for each triplet and minimize the sum of losses over all triplets. Initially our triplet loss is unbounded. Howev er in order to make our method robust to noise, we apply a novel robust transformation (using the generalized log function), which caps the triplet loss by a constant. Our new method, t -Exponential Triplet Embedding ( t -ETE) 2 , inherits the heavy-tail properties of t-STE in producing high-quality emb eddings, while b eing signicantly more robust to noise than any other metho d. Figure 1 illustrates examples of embeddings of a subset of 6000 data points from the MNIST dataset using t-STE and our proposed metho d. The triplets are synthetically generated by sampling a random point from one of the 20 -nearest neighbors for each point and another p oint from those that ar e locate d far away ( 100 triplets for each point). The two embeddings are very similar when there is no noise in the triplets (Figures 1(a) and 1(b)). However , after ‘reversing’ 20% of the triplets, t-STE fails to produce a meaningful embedding (Figure 1(c)) while t -ETE is almost unaected by the noise (Figure 1(d)). W e also apply our t -ETE method to dimensionality reduction and develop a new technique, which samples a subset of triplets in the high-dimensional space and nds the low-dimensional repre- sentation that satises the corresponding ranking. W e quantify the importance of each triplet by a non-negative weight. W e show that even a small car efully chosen subset of triplets capture sucient in- formation about the local as well as the global structure of the data to produce high-quality embeddings. Our proposed method outper- forms the commonly used t-SNE [ 9 ] for dimensionality reduction in many cases while having a much lower complexity . 2 TRIPLET EMBEDDING VIA RANKING In this section we formally dene the triplet embedding problem. Let I = { 1 , 2 , . . . , N } denote a set of objects. Suppose that the fea- ture (metric) representation of these objects is unknown. However , 2 The acronym t-STE is based on the Student-t distribution. Here, “t” is part of the name of the distribution. Our method, t -ETE, is based on t -exponential family . Here, t is a parameter of the model. Ehsan Amid, Nikos Vlassis, and Manfred K. W armuth t-STE - Clean Triplets 0 1 2 3 4 5 6 7 8 9 (a) t-ETE - Clean Triplets (b) t-STE - Noisy Triplets (c) t-ETE - Noisy Triplets (d) Figure 1: Experiments on the MNIST dataset: noise-free triplets using (a) t-STE, and ( b) t -ETE, and triplets with 20% noise using (c) t-STE, and (d) the proposed t -ETE. some information about the relative similarities of these objects is available in the form of triplets . A triplet ( i , j , k ) is an ordered tuple which represents a constraint on the relative similarities of the ob- jects i , j , and k , of the type “object i is more similar to object j than to object k . ” Let T = { ( i , j , k ) } denote the set of triplets available for the set of objects I . Given the set of triplets T , the triplet embedding problem amounts to nding a metric representation of the objects, Y = { y 1 , y 2 , . . . , y N } , such that the similarity constraints imposed by the triplets are satised as much as possible by a given distance function in the embedding. For instance, in the case of Euclidean distance , we want ( i , j , k ) = ⇒ k y i − y j k < k y i − y k k , w .h.p. (1) The reason that we may not require all the constraints to b e satise d in the embedding is that there may exist inconsistent and/or con- icting constraints among the set of triplets. This is a very common phenomenon when the triplets are collected via human evaluators via crowdsour cing [3, 16]. W e can consider the triplet emb edding problem as a ranking problem imp osed by the set of constraints T . More sp ecically , each triplet ( i , j , k ) can be seen as a partial ranking result where for a query over i , we are given two results, namely j and k , and the triplet constraint species that “the result j should have relatively higher rank than k ” . In this setting, only the order of closeness of test objects to the query object determines the ranking of the objects. Let us dene ` i j k (Y ) ∈ [ 0 , ∞) to be non-negative loss associated with the triplet constraint ( i , j , k ) . T o reect the ranking constraint, the loss ` i j k (Y ) should be a monotonically increasing (decreasing) function of the pairwise distance k y i − y j k ( k y i − y k k ). These properties ensure that ` i j k (Y ) → 0 whenever k y i − y j k → 0 and k y i − y k k → ∞ . W e can now dene the triplet emb edding problem as minimizing the sum of the ranking losses of the triplets in T , that is, min Y L T , L T = Õ ( i , j , k ) ∈ T ` i j k (Y ) . (2) In the above formulation, the individual loss of each triplet is un- bounded. This means that in cases where a subset of the constraints are corrupted by noise, the loss of even a single inconsistent triplet may dominate the total objective (2) and result in a poor perfor- mance. In order to avoid such eect, we introduce a new robust transformation to cap the individual loss of each triplet from above by a constant. As we will see, the capping helps to avoid the noisy triplets and produce high-quality embeddings, even in the presence of a signicant amount of noise. 3 ROBUST LOSS TRANSFORMA TIONS W e rst introduce the generalized log t and exp t functions as the generalization of the standard log and exp functions, respectively . The generalized log t function with temperature parameter 0 < t < 2 is dened as [10, 12] log t ( x ) = ( log ( x ) if t = 1 ( x 1 − t − 1 )/( 1 − t ) otherwise . (3) Note that log t is concave and non-decreasing and generalizes the log function which is recover ed in the limit t → 1 . The exp t func- tion is dened as the inverse of log t function. exp t ( x ) = ( exp ( x ) if t = 1 [ 1 + ( 1 − t ) x ] 1 /( 1 − t ) + otherwise , (4) where [ · ] + = max ( 0 , ·) . Similarly , the standard exp is recovered in the limit t → 1 . Figure 2(a) and 2(b) illustrate the exp t and log t functions for several values of t . One major dierence with the standard exp and log functions is that the familiar distributive properties do not hold in general: exp t ( a b ) , exp t ( a ) exp t ( b ) and log t ( a b ) , log t ( a ) + log t ( b ) . An important property of exp t is that it decays to zero slower than exp for values of 1 < t < 2 . This motivates dening heavy-taile d distri- butions using the exp t function. More sp ecically , the t -exponential family of distributions is dene d as a generalization of the expo- nential family by using the exp t function in place of the standard exp function [11, 13]. Our main focus is the capping property of the log t function: for values x > 1 , the log t function with t > 1 grows slower than the log function and reaches the constant value 1 /( t − 1 ) in the limit Low-dimensional Data Embedding via Robust Ranking x -3 -2 -1 0 1 2 3 exp t t ! 0 t = 0.5 t = 1 t = 1.5 1 2 3 4 5 6 7 (a) x -3 -2 -1 0 1 2 log t 1 2 3 4 (b) Figure 2: Generalized exp and log functions: (a) exp t function, and (b) log t function for dierent values of 0 < t < 2 . Note that for t = 1 , the two functions reduce to standard exp and log functions, respectively . x → ∞ . This idea can be used to dene the following robust loss transformation on the non-negative unbounded loss ` : ρ t ( ` ) = log t ( 1 + ` ) , 1 < t < 2 . (5) Note that ρ t ( 0 ) = 0 , as desired. Moreover , the derivative of the transformed loss ρ 0 t ( ` ) → 0 as ` → ∞ along with the additional property that the loss function converges to a constant as ` → ∞ , i.e., ρ t ( ` ) → 1 /( t − 1 ) ≥ 0 . W e will use this transformation to develop a robust ranking approach for the problem of the triplet embedding in presence of noise in the set of constraints. Finally , note that setting t = 1 yields the transformation ρ 1 ( ` ) = log ( 1 + ` ) , (6) which has been used for robust binary ranking in [ 17 ]. Note that ρ 1 ( ` ) grows slow er than ` , but still ρ 1 ( ` ) → ∞ as ` → ∞ . In other words, the transformed loss will not be cappe d from above. W e will show that this transformation is not sucient for robustness to noise. 4 T -EXPONEN TIAL TRIPLET EMBEDDING Building on our discussion on the heavy-tailed properties of gener- alized exp function (4), we can dene the ratio ` ( t 0 ) i j k (Y ) = exp t 0 (− k y i − y k k 2 ) exp t 0 (− k y i − y j k 2 ) (7) with 1 < t 0 < 2 as the loss of the ranking asso ciated with the triplet ( i , j , k ) . The loss is non-negative and satises the properties of a valid loss for ranking, as discussed earlier . Note that due to heavy-tail of exp t 0 function with 1 < t 0 < 2 , the loss function (7) encourages relatively higher-satisfaction of the ranking compared to, e.g., standar d exp function. Dening the loss of each triplet ( i , j , k ) ∈ T as the ranking loss in (7) , we formulate the objective of the triplet embedding problem as minimizing the sum of robust transformations of individual losses, that is, min Y C T , C T = Õ ( i , j , k ) ∈ T log t  1 + ` ( t 0 ) i j k (Y )  , (8) Algorithm 1 W eighted t -ETE Dimensionality Reduction Input: high-dimensional data X = { x 1 , x 2 , . . . , x n } , temperatures t and t 0 , embedding dimension d Output: Y = { y 1 , y 2 , . . . , y n } , where y i ∈ R d - T ← { } , W ← { } for i = 1 to n do for j ∈ { m -nearest neighbors of i } do - sample k unif. from { k : k x i − x k k > k x i − x j k } - compute weight ω i j k using (13) - T ← T ∪ ( i , j , k ) - W ← W ∪ ω i j k end for end for - for all ω ∈ W : ω ← ω max ω W + γ - initialize Y to n points in R d sampled from N ( 0 , 10 − 3 I d × d ) for r = 1 to iter# do - calculate the gradient ∇ C W , T of (12) - update Y ← Y − η ∇ C W , T end for in which, 1 < t < 2 . W e call our method t -Exponential Triplet Embedding ( t -ETE, for short). Note that the loss of each triplet in the summation is no w capped from above by 1 /( t − 1 ) . Additionally , the gradient of the objective function (8) with respect to the positions of the objects Y ∇ C T = Õ ( i , j , k ) ∈ T 1 ( 1 + ` ( t 0 ) i j k (Y ) ) t ∇ ` ( t 0 ) i j k (Y ) (9) includes additional forgetting factors 1 /( 1 + ` ( t 0 ) i j k (Y ) ) t that damp the eect of those triplets that are highly-unsatised. 5 CONNECTION TO PREVIOUS METHODS Note that by setting t = 1 , we can use the property of the log func- tion log ( a ) = − log ( 1 / a ) to write the objective (8) as the following equivalent maximization problem 3 max Y Õ ( i , j , k ) ∈ T log p ( t 0 ) i j k , (10) where p ( t 0 ) i j k = exp t 0 (− k y i − y j k 2 ) exp t 0 (− k y i − y j k 2 ) + exp t 0 (− k y i − y k k 2 ) (11) is dened as the probability that the triplet ( i , j , k ) is satised. Set- ting t 0 = 1 and t 0 = 2 recovers the STE and t-STE (with α = 1 ) formulations, respectively 4 . STE (and t-STE) aim to maximize the joint probability that the triplets T are satised in the embedding Y . The poor p erformance of STE and t-STE in presence of noise can b e explained by the fact that there is no capping of the log- satisfaction probabilities 5 of each triplet (see (6) ). Therefore, the 3 Note that log t ( a ) , − log t ( 1 / a ) in general. 4 The Student-t distribution with α degrees of freedom can be written in form of a t -exponential distribution with −( α + 1 )/ 2 = 1 /( 1 − t ) (see [5]). 5 Note that in this case, the probabilities should be capped from below . Ehsan Amid, Nikos Vlassis, and Manfred K. W armuth Number of Dimensions 0 5 10 15 20 25 Nearest Neighbor Error 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 MNIST Digits Noise Level 0 0.1 0.2 0.3 0.4 0.5 Generalization Accuracy 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 MNIST Digits Noise Level 0 0.1 0.2 0.3 0.4 0.5 Nearest Neighbor Accuracy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 MNIST Digits (a) Number of Dimensions 0 5 10 15 20 25 30 Nearest Neighbor Error 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 MIT Scenes (b) Noise Level 0 0.1 0.2 0.3 0.4 0.5 Generalization Accuracy 0.4 0.5 0.6 0.7 0.8 0.9 1 MIT Scenes (c) Noise Level 0 0.1 0.2 0.3 0.4 0.5 Nearest Neighbor Accuracy 0 0.1 0.2 0.3 0.4 0.5 0.6 MIT Scenes (d) Figure 3: Generalization and nearest-neighbor performance: MNIST (top row) and MI T Scenes ( bottom row). (a) Generalization error , ( b) nearest-neighbor error , (c) generalization accuracy in presence of noise, and (d) nearest-neighbor accuracy in presence of noise . For all the experiments, we use t = t 0 . For the generalization and nearest-neighbor error experiments, we start with t = 2 and use a smaller t as the numb er of dimensions increases (more degree of free dom). For the noise experiments, we set t = 1 . 7 . Figures b est viewed in color . low satisfaction probabilities of a the noisy triplet dominates the objective function (10) and thus, results in poor performance. 6 APPLICA TIONS TO DIMENSIONALI T Y REDUCTION Now , consider the case where a high-dimensional r epresentation X = { x i } n i = 1 is provided for a set of n objects. Having the t -ETE method in hand, one may ask the following question: “given the high-dimensional representation X for the objects, is it possible to nd a lower-dimensional repr esentation Y for these obje cts by satisfying a set of ranking constraints (i.e., triplets), formed based on their relative similarities in the repr esentation X ?” . Note that the total number of triplets that can be formed on a set of n objects is O ( n 3 ) and trying to satisfy all the possible triplets is computationally expensive. However , we argue that most of the these triplets are redundant and contain the same amount of information about the relative similarity of the obje cts. For instance, consider two triplets ( i , j , k ) and ( i , j , k 0 ) in which i and k are locate d far away and k and k 0 are neighbors of each other . Given ( i , j , k ) , having ( i , j , k 0 ) provides no extra information on the placements of i and j , as long as k and k 0 are located close together in the embedding. In other words, k and k 0 are viewed by i as almost being the same object. Note that for each object i , the nearby objects having relatively short distance to i specify the local structure of the object, whereas those that are located far away determine the global placement of i in the space. For that matter , for each query object i , we would like to consider those triplets (with high probability ) that preserve both local and global structure of the data. Follo wing the discussion above, w e emphasize on preserving the lo cal information by explic- itly choosing the rst test object among the nearest-neighbors of the quer y obje ct i . The global information of the object i is then preserved by considering a small numb er of objects, uniformly sampled from those that are locate d farther away . This leads to the following procedure for sampling a set of informative triplets. For each object i , we choose the rst object from the set of m -nearest neighbors of i and then, sample the outlier object uniformly from those that are located farther away from i than the rst object. This is e quivalent to sampling a triplet uniformly at random conditioned on that the rst test object is chosen among the m -nearest neigh- bors of i . W e use equal number of nearest-neighbors and outliers for each point, which results in nm 2 triplets in total. Low-dimensional Data Embedding via Robust Ranking (a) (b) Figure 4: Embedding of the Food dataset using (a) t-STE, and t -ETE ( t = 2 ) metho ds. There appear no clear separation between the clusters in (a) while in ( b), three dierent clusters of foo d are evident: “V egetables and Meals” (top), “Ice creams and Deserts” (bottom left), and “Breads and Cookies” ( bottom right). The original t -ETE formulation aims to satisfy each triplet equally likely . This would be r easonable in cases wher e no side information about the extent of each constraint is provided. Howev er , given the high-dimensional representation of the objects X , this assumption may not be accurate. In other words, the ratio of the pairwise simi- larities of the obje cts spe cied in each triplet may vary signicantly among the triplets. T o account for this variation, we can intr oduce a notion of weight for each triplet to reect the extent that the triplet needs to be satised. More formally , let ω i j k ≥ 0 denote the weight associated with the triplet ( i , j , k ) and let W = { ω i j k } denote the set of all triplet weights. The W eighted t -ETE can be formulate d as minimizing the sum of weighted capped losses of triplets, that is, min Y C W , T , C W , T = Õ ( i , j , k ) ∈ T ω i j k log t  1 + ` ( t 0 ) i j k (Y )  . (12) The t -ETE metho d can be seen as a special case of the weighted triplet emb edding formulation where all the triplets have unit weights. Finally , to assign weights to the sampled triplets, we note that the loss ratio in (7) is inversely proportional to how well the triplet is satised in the embedding. This suggests using the inverse loss ratios of the triplets in the high-dimensional space as the weights associated with the triplets. More formally , we set ω i j k = exp (− k x i − x j k 2 / σ 2 i j ) exp (− k x i − x k k 2 / σ 2 i k ) , (13) where σ 2 i j = σ i σ j is a constant scaling factor for the pair ( i , j ) . W e set σ i to the distance of i to its 10 -th nearest neighb or . This choice of scaling adaptively handles the dense as well as the sparse regions of data distribution. Finally , the choice of exp function rather than using exp t 0 with 1 < t 0 < 2 is to have more emphasis on the distances of the objects in the high-dimensional space. The pseudo-code for the algorithm is shown in Algorithm 1. In practice, dividing each weight at the end by the maximum weight in W and adding a constant positive bias γ > 0 to all weight improves the results. Note that both sampling and weighting the triplets using (13) and calculating the gradient of loss requires calculating the pair wise dis- tances only between O ( n m 2 ) objects in the high-dimensional space (for instance, by using ecient methods to calculate m -nearest neighbors such as [ 7 ]) or the low-dimensional emb edding. In many cases, m 2  n , which results in a huge computational advantage over O ( n 2 ) complexity of t-SNE. 7 EXPERIMEN TS In this section, we conduct experiments to evaluate the performance of t -ETE for triplet embedding as well the application of W eighted t -ETE for non-linear dimensionality reduction. In the rst set of experiments, we compar e t -ETE to the follo wing triplet embedding methods: 1 ) GNMDS, 2 ) CKL, 3 ) STE, and 4 ) t-STE. W e evaluate the generalization performance of the dierent methods by means of satisfying unseen triplets and the nearest-neighbor error , as well as their robustness to constraint noise. W e also provide visualization Ehsan Amid, Nikos Vlassis, and Manfred K. W armuth 112 311 3 Doors Down 3LW Aaliyah Aaron Carter ABBA AC/DC Ace of Base Aerosmith a-ha Air Supply Alanis Morissette Alan Jackson Al Green Alice Cooper Alice DeeJay Alice in Chains All Saints Alphaville Ani DiFranco Annie Lennox Anouk Aqua Aretha Franklin Backstreet Boys Bad Brains Bad Company Bangles Barenaked Ladies Barry White Basement Jaxx BBMak The Beach Boys Beastie Boys Beck Bee Gees Belinda Carlisle Ben Folds Five Bette Midler Big Star Billy Idol Billy Joel Billy Ray Cyrus Black Sabbath Blackstreet Blessid Union of Souls Blind Melon blink-182 Blondie Bloodhound Gang Blood, Sweat & Tears Blur Bobby Vee Bob Dylan Bob Marley Bob Seger Bon Jovi Bonnie Tyler Boston Brian Wilson Britney Spears Bruce Springsteen Bryan Adams Busta Rhymes CAKE The Cardigans Carly Simon Cat Stevens Céline Dion Cheap Trick Chemical Brothers Cher Chic Chicago Chris Isaak Christina Aguilera Christopher Cross Chumbawamba Clash Coal Chamber Coldplay Collective Soul Coolio The Corrs Counting Crows Craig David Cranberries Cream Creedence Clearwater Revival Crowded House Culture Beat Culture Club The Cure Cyndi Lauper Cypress Hill Daft Punk D'Angelo Dave Matthews Band David Bowie Deana Carter Debelah Morgan Deep Purple Def Leppard Deftones Depeche Mode Dido Dire Straits Disturbed Dixie Chicks DMX Don McLean Donna Summer The Doors Dr. Dre Dr. Octagon Dru Hill Duran Duran Edwin McCain Eiffel 65 Elton John Elvis Costello Elvis Presley Enigma Enrique Iglesias En Vogue Enya Erasure Eric Clapton Eros Ramazzotti Eurythmics Evan and Jaron Everclear Everlast Everly Brothers Everything but the Girl Extreme Faith Hill Faith No More Fastball Fatboy Slim Filter Fine Young Cannibals Finger Eleven Fiona Apple Fleetwood Mac Foo Fighters Foreigner Frank Sinatra Fuel Fugees Gabrielle Garbage Garth Brooks Gary Wright Genesis George Michael Gerry Rafferty The Get Up Kids Goldfinger Goo Goo Dolls Gorillaz Green Day Heart House of Pain Huey Lewis & The News The Human League Ice Cube Incubus INXS Iron Maiden Janet Jackson Janis Joplin Ja Rule Jennifer Paige Jessica Andrews Jessica Simpson The Jimi Hendrix Experience Joe Joe Cocker John Denver John Lennon Juvenile Kansas KC and The Sunshine Band Keith Sweat Kenny Chesney Kenny G Kenny Loggins Kid Rock King Tubby Kiss La Bouche Lauryn Hill LeAnn Rimes Led Zeppelin Lenny Kravitz Les Rythmes Digitales LFO Lifehouse Limp Bizkit Linkin Park Lionel Richie Live Liz Phair LL Cool J Lou Bega Love Lucy Pearl Ludacris Luniz Lynyrd Skynyrd Madison Avenue Madonna Marc Anthony Mariah Carey Marilyn Manson Mark Knopfler Marvin Gaye Matthew Sweet Me First and the Gimme Gimmes Melanie C Melissa Etheridge Men at Work Metallica Michael Jackson Mike Oldfield Miles Davis Milli Vanilli Missy Elliott Moby Montell Jordan Moody Blues Mr. Big Mudvayne Muse MxPx Mya Mystikal Natalie Imbruglia Nazareth Neil Diamond Neil Sedaka Neil Young Nelly Nelly Furtado New Found Glory New Order New Radicals Nick Cave & The Bad Seeds Nick Drake Nine Days Nine Inch Nails Nirvana No Doubt *NSync Oasis The Offspring Oleander Olivia Newton-John Orgy O-Town Our Lady Peace OutKast Ozzy Osbourne Papa Roach Pat Benatar Paula Abdul Paul Simon Paul van Dyk Pennywise Peter Gabriel Pet Shop Boys Phil Collins Pink Pink Floyd PJ Harvey Placebo Poison The Police Portishead The Presidents of the United States of America Pretenders Prince Procol Harum Prodigy Propellerheads Q-Tip Queen Queensrÿche Radiohead Rage Against the Machine Rammstein Rancid Rednex R.E.M. REO Speedwagon Richard Marx Rick Astley Ricky Martin R. Kelly Robert Palmer Rockapella Rod Stewart The Rolling Stones Ron Sexsmith Roxette Roy Orbison Run DMC Sade Samantha Mumba Santana Sarah McLachlan Savage Garden S Club 7 Scorpions Seal Selena Semisonic Seven Mary Three Shaggy Shania Twain Sheryl Crow Simon & Garfunkel Simple Minds Sisqó Sixpence None the Richer Skid Row Sly & The Family Stone The Smashing Pumpkins Smash Mouth Sneaker Pimps Soft Cell Soul Asylum Soundgarden Spandau Ballet Spice Girls Spin Doctors Spineshank Staind Steppenwolf Stereophonics Steve Miller Band Steve Winwood Stevie Nicks Stevie Wonder Sting Stone Temple Pilots Stroke 9 Styx Sublime Sugar Sugar Ray Supertramp Survivor Talking Heads Tears for Fears Temple of the Dog Tesla The Beatles The Verve Third Eye Blind Thompson Twins Tim McGraw Tina Turner TLC Toby Keith Tom Petty Toni Braxton Tool Toto Tracy Chapman Tricky Twista U2 UB40 Ugly Kid Joe Uriah Heep Usher Van Halen Vanilla Ice Van Morrison Velvet Underground Vengaboys Vertical Horizon Violent Femmes War Weezer "Weird Al" Yankovic Westlife Wheatus Whiskeytown White Zombie Whitney Houston Wilson Phillips Wu-Tang Clan Wyclef Jean Xzibit ZZ Top rock metal pop dance hiphop jazz country reggae other Bruce Springsteen Dire Straits Eric Clapton Joe Cocker Mark Knopfler Melissa Etheridge Nazareth Tom Petty Cream The Doors Janis Joplin The Jimi Hendrix Experience Led Zeppelin Pink Floyd R.E.M. Tina Turner AC/DC Aerosmith Scorpions Van Halen ZZ Top Alice Cooper Bad Brains Black Sabbath Finger Eleven Iron Maiden Metallica Ozzy Osbourne Tool Figure 5: Results of the t -ETE algorithm ( t = 2 ) on the Music dataset: compare the result with the one in [15]. The neighborho od structure is more meaningful in some regions than the one with t-STE. results on two real-w orld datasets. Next, we apply the W eighted t - ETE method for non-linear dimensionality reduction and compare the result to the t-SNE method. The code for the (W eighted) t -ETE method as well as all the experiments will be publicly available upon acceptance. 7.1 Generalization and Nearest-Neighbor Error W e rst evaluate the performance of dier ent methods by means of generalization to unseen triplets as well as preserving the nearest- neighbor similarity . For this part of e xperiments, we consider the MNIST Digits 6 (1000 subsamples) and MI T Scenes 7 (800 subsam- ples) datasets. The synthetic triplets are generate d as mentioned earlier ( 100 triplets per point). T o evaluate the generalization perfor- mance, we perform a 10 -fold cross validation and report the fraction of held-out triplets that are unsatised as a function of number of dimension. This quantity indicates how well the method learns the underlying structure of the data. Additionally , we calculate the nearest-neighbor error as a function of numb er of dimensions. The nearest-neighbor error is a measure of how well the embedding captures the pairwise similarity of the objects base d on relative 6 http://yann.lecun.com/exdb/mnist/ 7 http://people.csail.mit.edu/torralba/code/spatialenvelope/ comparisons. The results are shown in Figure 3(a)-3(b). As can be seen, t -ETE performs as good as the b est performing method or even better on b oth generalization and near est-neighbor error . This indicates that t -ETE successfully captures the underlying structure of the data and scales properly with the number of dimensions. 7.2 Robustness to Noise Next, we evaluate the r obustness of the dierent methods to triplet noise. T o evaluate the p erformance, we generate a dierent test set for both datasets with the same number of triplets as the training set. For each noise level, we randomly subsample a subset of training triplets and rev erse the order of the objects. After generating the embedding, we evaluate the performance on the test set and report the fraction of the test triplets that are satised as well as the nearest- neighbor accuracy . The results are shown in Figure 3(c)-3(d). As can be seen, the performance of all the other methods starts to drop immediately when only a small amount of noise is added to the data. On the other hand, t -ETE is very robust to triplet noise such that the performance is almost unaected for up to 15% of noise. This veries that t -ETE can be ee ctively applied to real-world datasets where a large portion of the triplets may have been corrupted by noise. Low-dimensional Data Embedding via Robust Ranking (a) (b) (c) (d) 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 a b c d e (e) (f ) (g) (h) Figure 6: Dimensionality reduction results using t-SNE (top gure) and W eighted t -ETE ( bottom gure) on: a) Wine, b) Sphere, c) Swiss Roll, d) Faces, e) COIL-20, f ) MNIST, g) USPS, and h) Letters datasets. W e use t = t 0 = 2 in all experiments. Figures b est viewed in color . 7.3 Visualization Results W e provide visualization results on the Food [ 16 ] and Music [ 6 ] datasets. Figures 4(a) and 4(b) illustrate the results on the Food dataset using t-STE and t -ETE ( t = 2 ), respectively . The same initialization for the data points is used for the both methods. As can be seen, no clear clusters are evident using the t-STE method. On the other hand, t -ETE reveals thr ee main clusters in the data: “V egetables and Meals” (top), “Ice creams and Deserts” ( bottom left), and “Breads and Cookies” (bottom right). The visualization of the Music dataset using the t -ETE method ( t = 2 ) is shown in Figure 5. The result can be compared with the one using the t-STE method 8 . The distribution of the artists and the neighborhoo d structure are similar for both methods, but more 8 A vailable on homepage.tudelft.nl/19j49/ste Ehsan Amid, Nikos Vlassis, and Manfred K. W armuth meaningful in some regions using the t -ETE method. This can be due to the noise in the triplets that have been collecte d via human evaluators. Additionally , t -ETE results in 0 . 52 nearest-neighbor error on the data points compared to 0 . 63 error using t-STE. 7.4 Dimensionality Re duction Results W e apply the weighted triplet emb edding metho d to nd a 2 - dimensional visualization of the following datasets: 1 ) Wine 9 , 2 ) Sphere ( 1000 uniform samples from a surface of a three-dimensional sphere 10 ), 3 ) Swiss Roll ( 3000 sub-samples 11 ), 4 ) Faces ( 400 syn- thetic faces with dierent pose and lighting 11 ), 5 ) COIL-20 12 , 6 ) MNIST ( 10 , 000 sub-samples), and 7 ) USPS ( 11 , 000 images of hand- written digits 13 ). W e compare our r esults with those obtained using the t-SNE method. In all experiments, we use m = 20 for our metho d (for COIL-20 , we use m = 10 ) and bias γ = 0 . 01 . The results are shown in Figure 6. As can be seen, our method successfully preserves the underlying structure of the data and produces high-quality embedding on all datasets, both having an underlying lo w-dimensional manifold ( e.g., Swiss Roll ), or clusters of points (e.g., USPS ). On the other hand, in most cases, t-SNE ov er-emphasizes the separation of the points and therefore, tears up the manifold. The same eect happens for the clusters, e.g., in the USPS dataset. The embedding forms multiple separated sub-clusters (for instance, the clusters of p oints ‘ 3 ’s, ‘ 7 ’s, and ‘ 8 ’s are divided into several smaller sub-clusters). Our objective function also enjoys better convergence properties and converges to a good solution using simple gradient descent. This eliminates the need for more complex optimization tricks such as momentum and early over-emphasis, used in t-SNE. 8 CONCLUSION W e introduced a ranking approach for embedding a set of objects in a low-dimensional space , given a set of r elative similarity con- straints in the form of triplets. W e showed that our method, t -ETE, is robust to high le vel of noise in the triplets. W e generalized our method to a weighted version to incorporate the importance of each triplet. W e applie d our weighted triplet emb edding method to develop a new dimensionality reduction technique, which out- performs the commonly used t-SNE method in many cases while having a lower complexity and better convergence behavior . REFERENCES [1] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David Kriegman, and Serge Belongie. 2007. Generalized Non-metric Multidimensional Scaling. In Proceedings of the Eleventh International Conference on Articial Intelligence and Statistics . San Juan, Puerto Rico. [2] Ehsan Amid, Aristides Gionis, and Antti Ukkonen. 2015. A kernel-learning approach to semi-supervised clustering with relative distance comparisons. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases . Springer , 219–234. [3] Ehsan Amid and Antti Ukkonen. 2015. Multiview Triplet Embedding: Learning Attributes in Multiple Maps. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) . 1472–1480. http://jmlr .org/proce edings/papers/ v37/amid15.pdf 9 UCI repository . 10 research.cs.aalto./pml/software/dr edviz/ 11 web.mit.edu/cocosci/isomap/datasets.html 12 www1.cs.columbia.e du/CA VE/software/softlib/coil- 20.php 13 www.cs.nyu.edu/~ro weis/data.html [4] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. 2007. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning . ACM, 209–216. [5] Nan Ding and S. V . N. Vishwanathan. 2010. t -Logistic Regression. In Pr oceedings of the 23th International Conference on Neural Information Processing Systems (NIPS’10) . Cambridge, MA, USA, 514–522. [6] Daniel P. W . Ellis, Brian Whitman, Adam Berenzweig, and Stev e Lawrence. 2002. The Quest for Ground T ruth in Musical Artist Similarity. In Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR ’02) . Paris, France, 170–177. [7] Ville Hyvönen, T eemu Pitkänen, Sotiris T asoulis, Elias Jääsaari, Risto T uomainen, Liang W ang, Jukka Corander , and T eemu Roos. 2015. Fast k-nn search. arXiv preprint arXiv:1509.06957 (2015). [8] Eric Yi Liu, Zhishan Guo, Xiang Zhang, Vladimir Jojic, and W ei W ang. 2012. Metric learning from relative comparisons by minimizing squared residual. In 2012 IEEE 12th International Conference on Data Mining . IEEE, 978–983. [9] Laurens van der Maaten and Georey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov (2008), 2579–2605. [10] Jan Naudts. 2002. Deformed exponentials and logarithms in generalize d thermo- statistics. Physica A 316 (2002), 323–334. http://arxiv.org/pdf/cond- mat/0203489 [11] Jan Naudts. 2004. Estimators, escort probabilities, and phi-exponential families in statistical physics. Journal of Inequalities in Pure and A pplied Mathematics 5, 4 (2004), 102. [12] Jan Naudts. 2004. Generalized thermostatistics based on deformed exponential and logarithmic functions. Physica A 340 (2004), 32–40. [13] Timothy Sears. 2010. Generalized Maximum Entropy , Convexity and Machine Learning . Ph.D. Dissertation. The A ustralian National University . [14] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam T. Kalai. 2011. Adaptively Learning the Cro wd Kernel. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) . [15] L. van der Maaten and K. W einberger . 2012. Stochastic triplet embedding. In 2012 IEEE International W orkshop on Machine Learning for Signal Processing . 1–6. DOI: http://dx.doi.org/10.1109/MLSP.2012.6349720 [16] Michael Wilber , Sam K wak, and Serge Belongie. 2014. Cost-Ee ctive HITs for Relative Similarity Comparisons. In Human Computation and Crowdsourcing (HCOMP) . Pittsburgh. [17] Hyokun Y un, Parameswaran Raman, and S. V . N. Vishwanathan. 2014. Ranking via Robust Binary Classication. In Proceedings of the 27th International Confer- ence on Neural Information Processing Systems (NIPS’14) . Cambridge, MA, USA, 2582–2590. http://dl.acm.org/citation.cfm?id=2969033.2969115

Low-dimensional Data Embedding via Robust Ranking

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment