Unsupervised Ranking of Multi-Attribute Objects Based on Principal Curves

Unsupervised ranking faces one critical challenge in evaluation applications, that is, no ground truth is available. When PageRank and its variants show a good solution in related subjects, they are applicable only for ranking from link-structure dat…

Authors: Chun-Guo Li, Xing Mei, Bao-Gang Hu

Unsupervised Ranking of Multi-Attribute Objects Based on Principal   Curves
1 Unsuper vised Ranking of Multi- Attribut e Objects Based on Pr incipal Cur v es Chun-Guo Li, Xing Mei, and Bao-Gang Hu, Senior Member , IEEE, Abstract —Unsuper vised ranking f aces one critical challenge in ev aluation applications, that is , no ground truth is av ailable. When P ageRank and its variants show a good solution in rel ated subjects, they are applicable only for ranking from link-structure da ta. In this work, we focus on unsuper vised ranking from multi-attribute data which is also common in ev aluation tasks. T o ov e rcome the challenge, w e propose five essential meta- rules for the design and assessment of unsupervised ranking approaches: scale and translation inv ariance , strict monotonicity , linear/nonlinear capacities , smoothness , and explicitness of parameter size . These meta- rules are regarded as high lev el knowledge f or unsuper vised ranking t asks. Inspired by the w or ks in [8] and [14], we propose a ranking principal curve (RPC) model, which lear ns a one-dimensional manifold function to perform unsupervised ranking tasks on multi-at tribute observations. Fur thermore, the RPC is modeled to be a cubic B ´ ezier cur v e with control points restricted in the inter ior of a hypercube, thereby complying with all the five meta-rules to infer a reasonable ranking li st. Wi th control points as the model parameters , one i s able to understand the learned manifold and to inter pret the ranking list semantically . Numer ical experiments of the presented RPC model are conducted on two open datasets of different ranking applications. In comparison with the state-of-the-art approaches, the new model is able to show more reasonable ranking lists. Index T erms —Unsupervi sed rank ing, multi-attribute , stri ct monotonicity , smoothness, data skeleton, principal cur v es, B ´ ezier cur v es. ✦ 1 I N T R O D U C T I O N F R O M the viewpoint of machine learning, ranking can be performed in an either supervised or unsu- pervised way as shown in the hierarchical structure in Fig. 1. When supervised ranking [ 1] is able to ev a luate the ranking pe r formance from the given ground truth, unsupervised ranking seems more challenging because no ground truth label is av a ilable. Modelers or user s will encounter a more difficult issue below: “ How can we insur e th at the ranking li st fr om the unsu- pervised ranking is reasonable or proper? ” From the viewpoint of given data types, ranking ap- proaches ca n be further divided into two c ategories: ranking based on link structure and ranking based on multi-attribute da ta. PageRank [2] is one of the repre- sentative unsupervised approaches to ra nk items which have a linkin g network (e.g. websites). But PageRank and its variants do not work f or ranking c andidates which have no links. In this pa per , we f ocus on unsuper- vised ranking approaches on a set of objects with multi- attribute numerical observa tions. T o rank from multi-attr ibute objects, weighted sum- mation of attributes is w idely used to provide a scala r score for each object. But different weight assignments give different ranking lists such that ra nking results are not convincing enough. The first principal component • C.-G. Li, X. Mei and B. -G. Hu are with the National Laboratory of Pattern Recognitio n, Institute of Auto mation Chinese Academy of Sci ences, 95 ZhongGuanCun East Road, Beijing 100190, P .R. China. Email: cgli@nlpr .ia.ac.cn, hubg@nlpr .ia.ac.cn • C.G. Li is also with Faculty of Mathematics and Computer Science, Hebei University , 180 Wusi East Road, Baoding, Hebei 071002, P .R. China. analysis (PCA) p rovides a weight learning approach [5 ], by which the score for each object is determined by its principal co mponent on the skeleton of the data distri- bution. However , it encounters problems when the da ta distribution is nonlinearly shaped. Although kernel PCA [5] is proposed to attack this problem, the mapping to the kernel space is not or der-preserving, which is the ba sic requirem ent for a ranking function. Neither dimension reduction methods [6 ] nor vector quantization [9] can assign scores for multi-attribute observations. As the nonlinear extension of the first PCA, principal curves can be used to perform a ranking task [8], [10]. A principal curve provides an ordering of data points by the ordering of threading through their projected points on the curve (illustrated by Fig. 2) which can be regarded as the “ra nking skeleton”. However , not all of principal curve models a re ca pable of performing a ranking task. Polyline approximation of a principal curve [ 11] fa ils to provide a consistent ranking rule due to non-smoothness at connecting points. Besides, it fails to guara ntee order-preserving. Order-preserving can not be guarantee d either by a general principal curve model (e.g. [ 19]) which is not modeled specially f or ra nking tasks. The problem can be tackled by the constraint of strict monotonicity which is one of the constraints we present for ranking f unctions in this pape r . E xample 1 shows that st rict mono tonicity is a necessary co ndition for a ranking function but was neglected by all other investigations. Example 1. Suppose we want to evalua te life qualities of countries with a pr incipal cur ve based on two attributes: 2 Fig. 1. Hierarchical diagram of ranking approaches. RPC is an unsuper vised ranking approach based on multi-attrib ute obser vati ons fo r objects. (a) Polyline A p proxi mation (non-strict monotonicity) (b) A General Princi pal Curve (non-monotonicity) Fig. 2. Examples on a monotonicity proper ty for ranking with principal cur ves. LEB 1 and GDP 2 . E a ch country is a da ta point in the two-dimensional plane of LEB and GDP . If the principal curve is a pproximated by a polyline as in Fig. 2(a), the piece of the horizontal line is not strictly monotone. It makes the same ranking solution for x 1 = (58 , 1 . 4) and x 2 = (58 , 1 6 . 2) but x 2 should be ranked higher than x 1 . For a ge ner a l principal curve like the curve in Fig. 2(b) which is no t monotone, two pairs of points are ordered unreasonably . T he pair , x 3 = (74 , 40 . 2 ) and x 4 = (82 , 40 . 2) , a re put in the same pla c e of the ranking list since they are projected to the same point which has the ver tical tangent lin e to the c ur v e . But x 4 should be ranked higher for its higher L E B than x 3 . Another pair , x 5 = (75 , 62 . 5) and x 6 = (81 , 64 . 8) , are also put in the same p la ce but apparently x 6 should b e ranked higher than x 5 . W ith strict monotonicity , these points would b e in the order tha t they are. Following the principle of “ let the d ata speak for them- selves ” [12], this work tries to attack problems for unsu- pervised ranking of multi-attribute objects with principal curves . First, ra nking performance is taken into a ccount for the design of ra nking functions. It is known that knowledge of a given task ca n always improve learning performance [13]. The reason why PageRank produces a commonly accepta ble search result for a query , lies 1. Life Expectancy at Birth, years 2. Gross Domest ic Product per capita by Purchasing Power Parities, K$/person Fig. 3. Motiv ation of RPC model for unsuper vised rank- ing. on th at PageRank a lgorithm is de signed by integrating the knowledge a bout backlinks [2]. For multi-attribute objects with no linking networks, knowledge about rank- ing functions can be taken into account to make ranking functions produce reasonable ra nking lists. In this work, we present five essential meta - rules for ra nking rules (Fig. 3). These meta-rules can be cap a ble of assessing the reasonability of ranking lists for unsupervised r a nking. Second, principal curves should be modeled to be able to serve as ranking functions. As referred in [8], ranking with a principal curve is performed on the learned skeleton of da ta distribution. But not all principal curve models are capable of p roducing reasonable ranked lists when no ranking knowledge is embed ded into principal curve models. Motivated by [ 14], the principal curve can be pa rametrically designed with a cubic B ´ ezier curve . W e will show in Section 4 th at the pa rameterized pr incipa l curve ha s all the five meta-rules with constraints on control points a nd that its e xistence a nd convergency of learning a lgorithm are proved theoretically . Therefore, the pa rameterized principal curv e is capable of making a reasonable ranking list. The following points highlight the main contributions of this paper: • W e propose five meta-rules for unsupervised rank- ing, which serve as high-level guidance in the de- sign and assessment of unsupervised ranking a p - proaches for multi-attribute objects. W e justi fy t hat the five meta-rules are essential in applications, but unfortunately some or all of them were overlooked 3 by most of ranking approaches. • A ranking principal cur ve (RPC) model is presented for unsupervised ranking from multi-attribute nu- merical observations of objects, d ifferent from PageRank which ra nks from link structure [2]. The presented model ca n satisfy all of five meta-rules for r a nking tasks, while other e xisting approaches [8] overlooked them. • W e develop the RPC lear ning algorithm, and the- oretically p rove the e x istence of a RPC a nd con- vergency of lea rning algorithm for given multi- attribute objects for ranking. W ith RPC learning algorithm, reasonable ra nking lists for openly ac- cessible data illustrate the good perf ormance of the proposed unsupervised ranking a pproaches. 1.1 Related W orks Domain knowledge can be integrated into lea ning mod- els to improve learning p erformance. By coupling do- main knowledge as prior information with network con- structions, Hu et al. [13] and Daniels et al. [15] improve the prediction accuracy of neural networks. Recently , monotoni city is taken into consideration a s constraints by Kotłowski et al. [16] to improved the ordinal clas- sification performance. For unsupervised r a nking, the domain knowledge of monotonici ty can also be taken into account a nd is capable of assessing the ra nking performance, other than evaluation of side-effects [17]. Ranking on manifolds has provided a new ranking framework [3], [4], [8], [18], which is different from general ranking functions such as ranking aggregation [7]. As one-dimensional manifolds, principal curves are able to pe rform unsupervised ra nking tasks from multi- attribute numerical observations of objects [8]. But not all principal curve models can serve a s ranking f unctions. For exa mple, Elmap ca n well portra y the contour of a molecular surface [19] but would bring about a biased ranking list due to no guarantee of order-preserving [8]. What’s more, Elmap is hardly interpretable since the parameter size of pr incipa l curve s is unknown explicitly . A B ´ ezier curve is a parametrica l one-dimensional curve which is widely used in fitting [20]. Hu et al. [14] proved that in two-dimensional space a cubic B ´ ezier curve is strictly monotone with end point s in the opposite corner and control points in the interior of the square box as shown in Fig. 4. T o avoid confusion, end points refer to the points on b oth ends of the control polyline (also the end poin ts of the curve) and co ntrol points refer to the other vertices of the control polyline in this paper . 1.2 Paper Organization The rest of this paper is organized a s follows. Back- grounds of this pa per a re forma liz ed in the next section. In Section 3, five meta-rules are elaborated for ranking functions. In Section 4, a r anking model, namely ranking principal curve (RPC) model, is defined and formulated with a cubic B ` ezier curve which is proved to follo w all 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Fig. 4. F or an increasing mo notone function, there are fo ur basic nonlinear shape s [14] of cubic B ´ ezier cur ves (in b lue) which mimic shapes of the control po lylines (in red). Cur ve s hapes are determined by the locations of control points. the five meta-rules for rankin g functions. RPC learning algorithm is designed to lea rn the con trol points of the cubic B ` ezier curve in Section 5. T o illustrate the e ffec tive performance of the proposed RPC model, applications on real world datasets are carried out in S ection 6 , prior to summary of this pape r in Se ction 7. 2 B A C K G R O U N D S Consider ranking a set of n objects A = { a 1 , a 2 , · · · , a n } according to d real-valued attributes (or in dicators, fea - tures) V = { v 1 , v 2 , · · · , v d } . Numerical observations of one object a ∈ A on all the attributes comprise an item which is denoted a s a vector x in d -dimensional space R d . Ranking objects in A is equivalent to ranking data points X = { x 1 , x 2 , · · · , x n } . That is, to give the ordering of a i 1  a i 2  · · ·  a i n can be achieved by discovering the ordering of x i 1  x i 2  · · ·  x i n where { i 1 , i 2 , · · · , i n } is a permutation of { 1 , 2 , · · · , n } and x i  x j means that x i precedes x j . As there is no label to help with ranking, it is an unsupervised ranking problem from multi-attribute data. Mathematically , ranking task is to provide a list of totally o rdered points. A total order is a special pa r tial order which requires comparability in a d dition to the re- quirements of reflexivity , antisymmetry and transitivity for the partial order [21]. Let x and y are one pair of points in X . For ra nking, if x and y are different, they have the ordinal relation of e ither x  y or y  x . I f x  y and y  x , then y = x which infers that x and y are the same thing. Remembering that a par tial order is a ssociated with a proper cone and that R d + is a self -dual pr oper cone [21] 4 R d + = { ρ : ρ T x ≥ 0 , ∀ x ∈ R d + } , the order for ranking tasks on R d is defined in this paper to b e x  y ⇐ ⇒      δ 1 ( y 1 − x 1 ) δ 2 ( y 2 − x 2 ) . . . δ d ( y d − x d )      ∈ R d + (1) where x = ( x 1 , x 2 , · · · , x d ) T , y = ( y 1 , y 2 , · · · , y d ) T , and δ j =  1 , j ∈ E − 1 , j ∈ F . (2) It is easy to ve r ify that the or der de fined by Eq.(1) is a total order with properties of compar a bility , reflexivity , antisymmetry and transitivity . In Eq.(2), E and F are two subsets of { 1 , 2 , · · · , d } such that E S F = { 1 , 2 , · · · , d } and E T F = ∅ . If le t α = ( δ 1 , δ 2 , · · · , δ d ) T . (3) α is unique for one given ra nking task a nd va ries from task to ta sk. For a given r anking task with de fined α , x precedes y f or x j < y j ( j ∈ E ) and x j > y j ( j ∈ F ) . As R is totally ordered, we prefer to grad e e ach point with a real value to help with ra nking. Assume ϕ : R d 7→ R i s the ranking function to assign x a score which provides the ordering of x . ϕ is required to be order-preserving so that ϕ ( x ) has the same ordering in R as x in R d . In order theory , an order-preserving function is also called isotone or monotone [ 22]. Definition 1 ([22]) . A function ϕ : R d 7→ R is called monotone (or , alternatively, order-preserving) if x  y = ⇒ ϕ ( x ) ≤ ϕ ( y ) (4) and strictly m onotone if x  y , x 6 = y = ⇒ ϕ ( x ) < ϕ ( y ) (5) Order-preserving is the ba sic requirement for a ra nk- ing function. For a pa rtially ordered set, ϕ should assign a score to x no more than the score to y if x  y . Moreover , if x 6 = y also holds, the score assigned to x must be smaller than the score to y . As A is totally ordered and different points should be assigned with different scor es, the ranking f unction is required to be strictly monotone as stated by Eq.(5). Otherwise, the ranking rule would be meaningless due to breaking the ordering in original da ta space R d . Example 2. In a d dition to the two indicators in Ex- ample 1, another two indicators are taken to evaluate life qualities of countries: IMR 3 and T ube rculosis 4 . It is easily known that the life quality of one country would be higher if it has a hi gher L EB and GDP wh ile a lower IMR a nd T uberculosis. Let numerical observa - tions on four countries to be x I = (2 . 1 , 62 . 7 , 75 , 59 ) , x M = (11 . 3 , 75 . 5 , 12 , 3 0) , x G = (32 . 1 , 7 9 . 2 , 6 , 4) , and x N = 3. Infant Mortality Rate per 1000 born 4. new cases of infectious T uberc ulosis per 100,000 of population (47 . 6 , 80 . 1 , 3 , 3) respectively . By Eq.(1), they have the ordering x I  x M  x G  x N with α = (1 , 1 , − 1 , − 1) T . In this case , E = { 1 , 2 } and F = { 3 , 4 } . Let ϕ ( x I ) = 0 . 407 , ϕ ( x M ) = 0 . 593 , ϕ ( x G ) = 0 . 7 85 and ϕ ( x N ) = 0 . 8 91 . Then ϕ is a strictly monotone mapping which strictly preserves the ordering in R 4 . Recall that a differentiable function f : R 7→ R is nondecreasing if and only if f ′ ( x ) ≥ 0 for all x ∈ dom f , and increasing if f ′ ( x ) > 0 for all x ∈ dom f (but the converse is n ot tr ue) [23]. They a re r eadily e xtended to the ca se of monotonicity i n Definition 1 w ith respect to the order defined by Eq.(1). Theorem 1 ([21]) . Let ϕ : R d 7→ R be differentiable . ϕ is monotone if and only if ∇ ϕ ( x )  0 (6) where 0 is th e z ero vect or . ϕ is strictly mo not one if ∇ ϕ ( x ) ≻ 0 (7) Theorem 1 provides first-order condition s for mono- tonicity . Note that ‘ ≻ ’ denotes a strict partial order [21]. Let ∇ ϕ ( x ) =  ∂ ϕ ∂ x 1 , ∂ ϕ ∂ x 2 , · · · , ∂ ϕ ∂ x d  T . (8) ∇ ϕ ( x ) ≻ 0 infers ∂ ϕ ∂ x j > 0 for j ∈ E and ∂ ϕ ∂ x j < 0 for j ∈ F . ∇ ϕ ( x ) ≻ 0 infers that each component o f ∇ ϕ ( x ) does not equal to zero. By the case of strict monotonicity in Theorem 1, ∇ ϕ ( x ) ≻ 0 infers not only that ϕ is strictly monotone from R d to R , but a lso that the value s = ϕ ( x ) is increasing with respect to x j ( j ∈ E ) and de creasing with respect to x j ( j ∈ F ) . V ice versa, if ∂ ϕ ∂ x j is bigger than zero for j ∈ E and smaller than zero for j ∈ F , ∇ ϕ ( x ) ≻ 0 holds a nd infers ϕ is a strictly monotone mapping. Lemma 1 can be concluded immediately . Lemma 1. s = ϕ ( x ) is strictly monotone if and only if s is strictly monotone along x i with fix ed the others x j ( j 6 = i ) . Further more, a strictly monotone mapping infe r s a one-to-one mapping that for a value s ∈ rang ϕ there is exactly on e point x ∈ dom ϕ such that ϕ ( x ) = s . If the point x is denoted by x = f ( s ) , f : R 7→ R d is ca lled the inverse mapping of ϕ and inherits th e property of strict monotoni city of its origin ϕ . Theorem 2 . Assume ∇ ϕ ( x ) ≻ 0 . There exists an inverse mapping denoted by f : rang ϕ 7→ dom ϕ such that ∇ f ( s ) ≻ 0 hold s for all s ∈ rang ϕ , th at is for ∀ s 1 , s 2 ∈ rang ϕ s 1 < s 2 = ⇒ f ( s 1 )  f ( s 2 ) , f ( s 1 ) 6 = f ( s 2 ) . (9) Proof of Theorem 2 can b e found in Appendix B. The theorem also holds in the other direction. Assuming f : R 7→ R d , if ∇ f ( s ) ≻ 0 , there exists an inverse mapping ϕ : rangf 7→ dom f and ∇ ϕ ( x ) ≻ 0 holds for all x ∈ rang f . Beca use of the one-to-one correspondence, f a nd ϕ share the same geometric properties such a s scale and translation invariance, smoothness a nd strict 5 monotoni city [23]. 3 M E T A - R U L E S As a ranking function for ϕ : R d 7→ R , ϕ ( x ) outputs a real value s = ϕ ( x ) a s the ranking score for a given point x . The ranking list of objects would be provided by sorting their ranking scores in ascending/descending order . Since unsupervised ranking has no labe l infor- mation to verify the ranking list, we restrict ra nking functions with five essential features to guarantee that a reasonable ranking list is provided. These fe a tures are capable of serving as high-level guidance of modeling ranking functions. They are a lso ca pable of serving a s high-level assessments f or unsupervised ranking perfor- mance, different from assessments for su pervised ra nk- ing perf ormance which take qualities o f ranking labels. Any functions from R d to R with all the five features can ser ve a s ranking functions a nd be able to provide a reasonable ra nking list. These features are rules for ranking rules, namely meta-rules . 3.1 Scale and T ranslation In variance Definition 2 ([24]) . A ranking rule is invariant t o scale and translation if for x  y ϕ ( x ) ≤ ϕ ( y ) ⇐ ⇒ ϕ ( L ( x )) ≤ ϕ ( L ( y )) . (10) where L ( · ) performs scale and translation. Numerical observations on different indica tors are taken on different dimensions of quantity . In Example 1, GD P is measured in thousands o f dollars while LEB ranges from 4 0 to 90 yea rs. They are not in the same dimensions of quantity . As a general data preprocessing technique, scale and tra nslation can take them into the same dimensions (e.g. [0 , 1] ) while preserving their orig- inal ordering. If let L be a linea r t ransformation on R d , we have x  y ⇐ ⇒ L ( x )  L ( y ) for x , y ∈ R d [24]. Therefore, a ranking function ϕ ( x ) should produce th e same ranking list before and after s caling and tr a nslat- ing. 3.2 Strict Mo notonicity Definition 3 ([22]) . ϕ ( x ) is strictly monoto ne if ϕ ( x i ) < ϕ ( x j ) for x i  x j and x i 6 = x j ( i 6 = j ) . Strict monotonicity in Definition 1 is specified here a s one of m eta-rules f or ranking. F or ordinal cla ssification problem, monotoni city is a genera l constraint since two different objects would b e classi fied into the sa me cla ss [16]. But for the ranking problem discussed in this paper , it requires the strict monotonicity since different objects should have dif ferent scores for rankin g. ϕ ( x i ) = ϕ ( x j ) holds if a nd only if x i = x j ( i 6 = j ) . In Exa mple 1, x 1  x 2 and x i 6 = x j indicate that a higher score should be assigned to x 2 than x 1 . And so do x 3 and x 4 . T herefore, the ra nking function ϕ ( x ) is required to b e a strictly monotone mapping. Otherwise, the ranking list would be not convincing. ϕ in Example 2 is to the point referred here. 3.3 Linear/Nonlinear Capacities Definition 4. ϕ ( x ) has the c a pacities of linearity and nonlin- earity if ϕ ( x ) is able to d ep ict th e relationship of both linearity and nonlinearity . T aking the ranking task in Example 1 for illustration, one has no knowledge about the relationship between LEB and the score. The score might be a either linear or nonlinear function of LEB . It is the similar ca se for the relationship b e tween GDP a nd the score. Therefore, t = ϕ ( x ) should embody both of the linear and nonlinear relationships b e tween t and x j . For the ranking task in Example 1, the ranking function ϕ should be a linear function of LEB for fixed GDP if LE B is linear with t . Meanwhile, ϕ should a lso be a nonlinear function of GDP for fixed LEB if GDP is nonlinear with t . 3.4 Smooth ness Definition 5 ( [23]) . ϕ ( x ) is smooth if ϕ ( x ) is C h ( h ≥ 1) . In mathematical a nalysis, a f unction is called smooth if it has derivatives of all orders [23]. Y et a ranking function ϕ ( x ) is required to be of class C h where h ≥ 1 . T ha t is, ϕ ( x ) is c ontinuous a nd ha s the first-order d e rivative ∇ ϕ ( x ) . The first-order derivative ∇ ϕ ( x ) guar a ntees that ϕ ( x ) will exert a consistent ra nking rule for all objects and the ranking rule would be not abruptly changed for some object. T aking the polyline in Fig. 2 for illustration, it is of class C 0 but not of class C 1 because it is continu- ous but not differentiable a t the connecting vertex of the two l ines. This would lead to a n unreasonable ranking for those points projected to the ve rtex. 3.5 Explicitness of Parameter Size Definition 6. ϕ ( x ) has th e pr operty of explicitness if ϕ ( x ) has known parameter size for a fair compa rison am ong ranking models. Hu et al. [13] considered that nonpara metric ap- proaches are a cla ss of “blac k-box” approaches since they can not be interpreted by our intuition. As a ra nk- ing function, ϕ ( x ) should be semantically interpr etable so that ϕ ( x ) has systematical meanings. For example, ϕ ( x ) = θ T x gives exp licitly the linear e xpression with parameter size d which is the dimension of the parameter θ . It ca n b e interpreted that the score of x is linear with x and the pa rameter θ is the allocation proportion vector of indicators for ra nking. M oreover , if there is another ranking model with the sa me chara cteristics, ϕ ( x , θ ) would be more applicable if it has a smaller size of para meters. These five meta-rules above is the guidance of de- signing a reasonable and practical ranking function . T o perform a ra nking task, a ranking function should satisfy all the five meta-rules a bove to produce a convincing ranking list. Any ranking f unction that breaks any of them would p roduce a biased and unreasonable r a nking list. In this sense, they can be regarded as high-level assessments for unsupervised ranking pe rformance. 6 Fig. 5. Schematic plots of ranking s kel etons (heavy solid lines or cur ves in red). Circle points: obser vation s of countries on two indicators: LEB and GDP . 4 R A N K I N G P R I N C I P A L C U R V E S In this section, we propose a ranking principal curve (RPC) model to pe r form an un supervised ranking task with a principal curve which has all the five meta-rules. The RPC is pa rametrically designed to a cubic B ´ ezier curve with contr ol points restricted in the in terior of a hypercube. 4.1 RPC Mod el The simplest ranking rul e is the first PCA which sum- marizes the data in d -dimensional space with the la rgest principal component line [25]. The first PCA see ks the direction w that expla ins the maximal v a riance of the data cloud. Then x is orthogonally projected by w T x onto the line passing through the mean µ . The line can be regarded as the ranking skeleton . Projected p oints take an ordering along the ra nking skeleton which is just the ordering of their first principal components computed b y w T x . Let s = w T x a nd a n ordering of s gives the ordering of x . As a ranking function, the first PCA is smoot h, explicitly expressed, and invar iant to scale and tra nslation. It works well for the skeleton of slender e llipse distributing data. However , the first PCA can hardly dep ict the skeleton of data distributions like crescents (Fig. 5 (a)) such that the produced ranking lis t is not convincing. Wh at’s more, the first P CA might be non-strictly monoton e when the direction w is parallel to one coordinate a xis such that it can no t discriminate those points like x 1 and x 2 in E x ample 1 since they will be projected to the same points if the first PCA is on the direction parallel to the horizontal line. The prob- lems referred a bove hinder the first PCA from extensive applications in comprehensive evaluation. Recalling tha t principal c urves are nonlinear ex ten- sions of the first PCA [10], we try to summarize mul- tiple indicators in the d a ta space with a principal curv e (Appendix A gives a brief review of principal curve s) . Assuming f ( s, θ )( θ ∈ Θ ) is the principal curve of a given data cloud, it p rovides an ordering of projected points on the pr incipal curve, in a way similar to the first PCA. Intuitively , the principal curve is a good choice to perform ranking tasks. On the one hand, unsupervised ranking could only rely on those numerical observa - tions for ra nking candidates on given attributes. For the dataset with a linking network, PageRank can calcula te a score with backlinks f or each point [2]. When there is no link between points, a score ca n still be calculated according to the ra nking skeleton, instead of link struc- ture. On the other hand, the principal c urve reconstructs x according to x = f ( s, θ ) + ε , instead of x = µ + s w + ε for the first PCA. T o perform ranking tasks, a ranking function assigns a score s to x by s = ϕ ( x , θ ) . Actually , noise is inevitable due to measuring e rrors and influence from exclusive indica tors from V . T hus the latent score should be produce after removing noise fro m x , that is s = ϕ ( x − ε , θ ) . A s a ranking function, ϕ is assumed to be strictly monotone. Thus, da ta points and scores are one-to-one correspondence and there exists an inverse function f for ϕ such that x = f ( s, θ ) + ε (11) which is the very principal curve model [10]. The inverse function can be ta ken a s the generating function for numerical observations from the score s which can be regarded to be pre-ex isting. As stated in S ection 3, there a re five m eta-rules for a function ϕ ( x , θ ) to serve as a ranking rule. As ϕ ( x , θ ) is required to be strictly monotone, there exists a n in- verse function f ( s, θ ) which is also strictly monotone by Theorem 2. Correspondingly , ϕ and its inverse f share the other properties of scale and translation in variance, smoothness, capacities of linearity and nonlinearity , a nd explicitness of pa rameter size. A principal curve should also follow all the five meta-rules to serve a s a rank- ing f unction. However , polyline approximations of the principal curve might go against smoothness and strict monotoni city (e . g. Fig. 5(b)). A smooth principal curve would also go against strict monoton icity ( e.g. Fig. 5(c)). Both of them would make unreasonable r anking solu- tions as illustrated in Example 1. W ithin the framework of Fig. 3, all the five meta -rules can be modeled as constraints to the ranking function. Since a principal curve is defined to be smooth a nd invariant to scale and translation [1 0], the constraint of strict monotonicity would make it be ca pable of performing ra nking tasks (e.g. Fig. 5(d)). Natura lly , the principal c urve should have a known para meter size for interpretability reason. W e pr esent Definition 7 f or unsupervised ranking with a principal curve . Definition 7. A curve f ( x , θ ) in d -dim ensional space is called a ranking p rincipal curve (RPC) if f ( x , θ ) is a strictly monotone principal curve of given data cloud and it is explicitly expressed with known parameters θ of limited size. 7 4.2 RPC F orm ulation with B ´ ezier Curves T o perform a ranking ta sk, a principal curve model should follow a ll the five meta- rules ( S ection 3) which can be also similarly defined for f . However , not all of principal curve models can perform ranking tasks. The models in [1 0], [2 6], [27], [29] lack of ex p licitness and c an not make a monotone mapping on R d (Fig. 5(c)). Polyline approximation [11], [19], [28] misses the requirem ents for smoothness and strictly monotonicity (Fig. 5(b)) . A new principal curve model is nee ded to perform ra nking while f ollowing all the five meta-rules. In this paper , an RPC is para metrically modeled with a B ´ ezier curve f ( s ) = k X r =0 B k r ( s ) p r , s ∈ [0 , 1] (12) which is formulated in ter ms of Bernstein pol ynomials [31] B k r ( s ) =  k r  (1 − s ) k − r s r , (13)  k r  = k ! r !( k − r )! . (14) In Eq.(12), p r ∈ R d are control and e nd points of the B ´ ezier curve which are in the place of the function parameters θ in Eq.(11). Pa rticularly , when k = 3 , Eq.(12) has the matrix form of f ( s ) = PMz . (15) where P = ( p 0 , p 1 , p 2 , p 3 ) M =     1 − 3 3 − 1 0 3 − 6 3 0 0 3 − 3 0 0 0 1     , z =     1 s s 2 s 3     In case k > 3 , th e m odel woul d become more complex and bring about overfitting problem. In case k < 3 , t he model is too simple to represent a ll possible monotonic curves. k = 3 is the most suitable degree to perform the ranking task. A c ubic B ´ ezier curve with constraints on control points can be proved to have all the five meta-rules. First of all, the formulation Eq.(12) is a non linear interpolation of control points and end points in terms of B ern- stein polynomials [31]. These points are the determinant parameters of total size d × 4 . Different locations of these points would produce different shapes of nonlinear curves besides straight lines [14]. Scale a nd translation to B ´ ezier curves are ap p lied to these points without changing the r a nking score which is contained in z Λf ( s ) + β = ΛPMz + β = ( ΛP + β ) Mz (16) where Λ is a diagonal matrix with scaling fa ctors to dimensions and β is the tra nslation vector . This property allows us to put all data into [0 , 1 ] d in order to facilitate ranking. What’s m ore, the derivative of f ( s ) is a lo wer order B ´ ezier curve d f ( s ) ds = k k − 1 X j =0 B k − 1 j ( s )( p j +1 − p j ) (17) which involves the calculation of end points and control points. Its de rivatives of all orders exist for all s ∈ [0 , 1] and thus Eq.(12) is smooth e nough. La st but not the least, it has been proved th at a cubic B ´ ezier curve can perform the four basic types of strict monotoni city in two-dimensional space [ 14]. L e t end points after sca le and tra nslation are denoted by p 0 = 1 2 ( 1 − α ) and p 3 = 1 2 ( 1 + α ) . Control points p 2 and p 3 are the determinants for nonlinearity of the cubic B ´ ezier curve (Fig. 4). In two-dimensional space, f ( s ) is p roved to be increasing along ea ch coordinate if control points a re restricted in the interior of the hypercube [0 , 1] d [14]. Thus, a propositio n ca n be deduce d by Lemma 1. Proposition 1. f ( s ) is strictly monotone for s ∈ [0 , 1] with p 0 = 1 2 ( 1 − α ) , p 3 = 1 2 ( 1 + α ) and p 1 , p 2 ∈ (0 , 1) d . What is the most important, there a lwa ys exists an RPC para meterized by a c ubic B ´ ezier curve which is strictly monotone for a group of numerical observations. The existence has fa iled to be proved in many principal curve models [ 10], [19], [28]. Theorem 3. Assume that x is t he numerical ob servation o f a ranking candidate a nd th at E k x k 2 < ∞ . There exists P ∗ ∈ [0 , 1 ] d such that f ∗ ( s ) = P ∗ Mz is strictly monotone and J ( P ∗ ) = inf n J ( P ) = E  inf s k x − PMz k 2 o . ( 18) Proof of Theorem 3 can be found in Appendix C. 5 R P C L E A R N I N G A L G O R I T H M T o perform unsupervised ranking from the numerical observations of ranking candidates X = ( x 1 , x 2 , · · · , x n ) , we should first learn control points of the curve in Eq.(12). The optimal points a c hieve the infimum of the estimation of J ( P ) in Eq.(1 8). By the principal curve definition proposed by Hastie et al. [10], the RPC is the curve which minimizes the summed residual ε . Therefore, the ranking task is formulated a s a nonlinear optimization problem min J ( P , s ) = n X i =1 k x i − PMz i k 2 (19) s.t.  ∂ PMz ∂ s  T ( x i − PMz )      s = s i = 0 , (20) s = ( s 1 , s 2 , · · · , s n ) , z = (1 , s, s 2 , s 3 ) T P ∈ [0 , 1] d × 4 , s i ∈ [0 , 1] , i = 1 , 2 , · · · , n where Eq.(20) d etermines s i to find the point on the curve which has the minimum residual to reconstruct 8 x i by f ( s i ) . Obviously , a local minimizer ( P ∗ , s ∗ ) can be achieved in an alternating minimization way P ( t +1) = a rg min n X i =1 k x i − PMz ( t ) i k 2 (21)  ∂ P ( t +1) Mz ∂ s  T  x i − P ( t +1) Mz       s = s ( t +1) i = 0 (22) where t means the t th itera tion. The optimal solution of Eq.(21) has an explicit expres- sion. Associate X with Z Z =     1 1 · · · 1 s 1 s 2 · · · s n s 2 1 s 2 2 · · · s 2 n s 3 1 s 3 2 · · · s 3 n     = ( z 1 , z 2 , · · · , z n ) (23) and Eq.(1 9) ca n be rewritten in matrix form J ( P , s ) = k X − PMZ k F = tr ( X T X ) − 2 tr ( PMZX T ) + tr ( PMZZ T M T P T ) . (24) Setting the de r ivative of J with respect to P to zero ∂ J ∂ P = 2  P ( MZ )( MZ ) T − X ( MZ ) T  = 0 (25) and remembering A + = A T ( AA T ) + [35], we get an explicitly expression for the minimum point of Eq.(1 9) P = X ( MZ ) T  ( MZ )( MZ ) T  + = X ( MZ ) + (26) where ( · ) + takes pseudo-inverse c omputation. Based on the t th iterative results Z ( t ) , the optimal solution can be given by substituting Z ( t ) into Eq.(2 6) which is P ( t +1) = X ( MZ ( t ) ) + . However , ( MZ ( t ) ) + is compu- tationally expensive in numerical e xperiments and X is always ill-conditioned which has a high condition number , resulting in that a very small change in Z ( t ) would produce a tremendous change in P ( t +1) . Z ( t ) is not the optimal solution of Eq.(19) but a intermediate result of the itera tion, and P ( t +1) would thereby go far away from the optimal solution. T o settle out the problem, we employ the Richardson iteration [37] with a preconditioner D which is a diagonal matrix with the L 2 norm of columns of ( M Z ( t ) )( MZ ( t ) ) T as its dia gonal elements. Then P ( t +1) is updated a ccording to P ( t +1) = P ( t ) − γ ( t ) ( P ( t ) ( MZ ( t ) )( MZ ( t ) ) T − X ( MZ ( t ) ) T ) D − 1 (27) where γ ( t ) is a sca lar para meter such that the s equence P ( t ) converges. In practice, we set γ ( t ) = 2 λ ( t ) min + λ ( t ) max (28) where λ ( t ) min and λ ( t ) max is the minimum and maximum eigenvalues of ( MZ ( t ) )( MZ ( t ) ) T respectively [ 38]. After getting P ( t +1) , the score vector s ( t +1) can be calculated a s the solution to E q.( 2 2). Eq.(22) is a quin- Algorithm 1 Algorithm to le a rn an RPC. Input: X : data matrix; ξ : a small positive value; Output: P ∗ : control points of the lea rned B ´ ezier curve s ∗ : the score vector of objects in the set. 1: Normalize X into [0 , 1] d ; 2: Initialize P (0) ; 3: while △ J > ξ do 4: Adopt GSS to find the approximate solution s ( t ) ; 5: Compute P ( t +1) using a preconditioner; 6: if △ J < 0 then 7: break; 8: end if 9: end while tic polynomial equation which rarely has explicitly ex- pressed roots. In [ 20], s i for x i was approximated by Gra- dient and Gauss-Newton methods respectively . Jenkins- T raub method [3 2] was also considered to find the roots of the polynom ial equation directly . As Eq.(20 ) is designed to find the minimum distance of point x i from the curve, we adopt Golden Section Search (GSS ) [33] to find the local approximate solution to E q.(22). Algorithm 1 summarizes th e alternative optimization procedure. Before performing the rankin g task, num er- ical observa tions of objects should be normalized into [0 , 1 ] d by ˆ x = x − x min x max − x min (29) where ˆ x is the normalized vector of x , x min the min- imum vector a nd x max the maximum ve c tor . Grading scores would be unchanged as scaling and translating are only performed on control points and end points (Eq.(16)) without changing the interpolation values. In Step 2, we initialize the end points as p 0 = 1 2 ( 1 − α ) and p 3 = 1 2 ( 1 + α ) , and randomly select samples as contro l points. During lea rning procedure, P ( t ) is a utomatically learned making a B ´ ezier curve to be an RPC in numerical experiments. In Step 6, △ J < 0 o ccurs when J begins to increase. In this case, the algorithm stops updating ( P ( t ) , s ( t ) ) and gets a local minimum J . Proposition 2 guarantees the convergency of the sequence found by RPC learning algorithm (proof can be found in A ppendix D). Therefore, the RPC learning algorithm finds a con- verging sequence of ( P ( t ) , s ( t ) ) to achieve the infimum in Eq.(18). Proposition 2 . I f P ( t ) → P ∗ as t → ∞ , J ( P ( t ) , s ( t ) ) is a decaying sequence whic h converges to J ( P ∗ , s ∗ ) as t → ∞ . Algorithm 1 con verges in lim ited step s. In each step, P is updated in 4 × d size and scores f or points a re calculated in n size. When iter ation stops, ranking scores are produced along with P . In summary , the computa- tional complexity of RPC unsupervised ra nking model 9 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 A B C A ’ s 2 s 1 s 3 s’ 1 Fig. 6. A , B and C are three objects to rank. s 1 , s 2 and s 3 are scor es given b y the RPC (in green) of S-ty pe shape in the fi gure. A different observation of A (denoted by A ′ ) would give a different RPC (in pink) and thus a different ordering of objects. is O (4 d + n ) . Compared to the ra nking rule of weighted summation, ra nking with RPC model costs a little more. However , weighted summation need s weight assign- ments by a domain expert such that it is more subjective because weights is diverse expe rt by expert. But RPC model needs no expert to a ssign weight pr oportions to indicators. The lea r ning procedure of RPC model does the whole work for ranking. The RPC lea rning algorithm lear ns a ranking f unc- tion in a completely different way from the tradition al methods. On the one hand, the ranking function is in constraints of five meta-rules for ra nking rules. Integrat- ing meta -rules with ranking functions makes the ra nking rule be more in line with human knowledge about rank- ing problems. As a high level knowledge, these meta- rules are ca pable of evalua ting ranking performance. On the other hand, ranking is carried out following the principle of unsupervised r a nking, “ let the data speak for themselves ”. For unsupervised ranking, there is no information for ranking labe ls to guide the system to learn a r anking function. A s a matter of fact, the structure of the dataset contains the ordinal information between objects. If all the d e termining factors of ordinal relations are included, the RPC can thread through all the objects successively . In practice, the most influential indicators are selected to estimate the order of objects, but the rest factors still affect the numerical observation. In the case we know nothing ab out the rest f a ctors, we would bette r to minimize the effect which we formulate to be error ε . Therefore, minimizing er rors is adopted as the le a rning objection in case no r anking label can be availab le. 6 E X P E R I M E N T S 6.1 Comparisons w ith Ranking Aggregation For ra nking task, some researchers prefer to a ggregate many different ranking lists of the same set of objects in order to get a “b e tter ” order . For example, median rank aggregation [3 4] aggregates d ifferent orderings into a median ra nk with κ ( i ) = P m j =1 τ j ( i ) m , i = 1 , 2 , · · · , n (30) where τ j ( i ) is the location of object i in ranking list τ j , τ j is a permutation of { 1 , 2 , · · · , n } and κ is the order- ing of median rank a ggregation. However , approaches of r anking aggregation suffers the difficulties of strict monotoni city and smoothn ess. Therefore, the ranking list is not very convincing. What’s more, aggregation merely combines the orderings and ignores the information delivered by numerical observations. In contrast, RPC is modeled following all the five meta-rules which infers a reasonable ranking list. More- over , RPC can detect the ordinal information embed- ded in the numerical observations, illustrated in Fig. 6. Consider to rank three objects A , B and C in a two-dimensional space in Fig. 6. Le t their numerical observations on x 1 and x 2 be values shown in T able 1(a). Objects c an be ordered along with x 1 and x 2 respectively . Median rank aggregation [34] produces an ordering which can not distinguish A and B since they are in the parata ctic place o f the ra nking list. In contrast, the RPC model produce the order AB C where A and B a re in a distinguishable order since RPC ranks objects based on their original observation data. If there is a different observation for one of objects, a d ifferent RPCwould pro- duce a different ra nking list while RankAgg remains the same. In T ab le 1(b), a different observation of object A is obtained, denoted as A ′ . A different RPC is lea r ned (the pink curv e in Fig. 6 ) and gives the order B A ′ C (the last column of T able 1(b) ) which is different from the or der in T able 1(a ). In summary , RPC is able to capture the ordinal information contained not o nly among ranking candidates but a lso in the individual observation. 6.2 Application s Unsupervised ranking of multi-attribute observations of objects has a widely a pplications. The most significant application is to rank countries, journals and universi- ties. T a king the journal ranking task for illustration, there have been many indices to rank journals, such as impact factor (IF) [39] and Eigenfactor [40]. Different indices reflect different aspects of journals and provide d ifferent ranking lists for journals. Thus, how to eva luate journals in a comprehensive way be comes a tough problem. RPC model is proposed as a new framework to attack the problem which provides an ordering a long the “ranking skeleton” of d a ta distribution. In this paper , we perform ranking ta sks with RPCs to produce a comprehensive evaluation on three open access da tasets of countries and journals with the open source software Sc ila b (5. 4.1 version) on a U buntu 12.04 system with 4GB memory . Due to space limitation, we just list parts of their ranking lists (the full lists will be a v a ilable when the pap e r is published). 10 T ABLE 1 RPC model can detect ordinal inf ormation c ontained in nume rical obser vations in Fig. 6. (a) A group of bservations and ranking lists by different rules Object x 1 x 2 RankAgg RPC V alue Order V alue Order Score Order A 0 . 3 2 0 . 25 1 1.5 0.2329 1 B 0 . 25 1 0 . 55 2 1. 5 0.3304 2 C 0 . 7 3 0 . 7 3 3 0.7300 3 (b) Anot h er group of bservations and ranking lists by different rules Object x 1 x 2 RankAgg RPC V alue Order V alue Order Score Order A ′ 0 . 35 2 0 . 4 1 1.5 0.3708 2 B 0 . 25 1 0 . 55 2 1. 5 0.3431 1 C 0 . 7 3 0 . 7 3 3 0.7318 3 Note: Differ ent obs ervations of objects would produce different ranking li sts o f objects . In (a), o bjects A , B and C can be ordered by their values on x 1 and x 2 respectively . Ranking aggregation (RankAgg) the n produce a comprehensive ordering by Eq. (30). But i t fails to distinguish A and B which have dis ti nguishable observations while RPC can distinguish them. RPC can also detect the mino r ordinal difference between obj ects. In (b), A has a d iffer ent obs e rvation from (a), which is den oted as A ′ . R anking lists keeps the same for RankAgg while RPC provides a different ordering. T ABLE 2 P ar t of the ranking list f or lif e qualities of countr ies. Country GDP 1 LEB 2 IMR 3 T ub erculosis 4 Elmap [8] RPC Score Order Score Order Luxembourg 70014 79.56 6 4 0.892 1 1.0000 1 Norway 47551 80.29 3 3 0.647 2 0.8720 2 Kuwait 44947 77.258 11 10 0.608 3 0.8483 3 Singapore 41479 79. 627 12 2 0.578 4 0.8305 4 United States 41674 77.93 2 7 0.575 5 0.8275 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . Moldova 2362 67.923 63 17 0.002 97 0. 5139 96 V anuatu 3477 69.257 37 31 0.011 96 0.5135 97 Suriname 7234 68.425 53 30 0.011 95 0. 5133 98 Morocco 3547 70.443 44 36 0.002 98 0. 5106 99 Iraq 3200 68.495 25 37 -0.002 100 0.5032 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . South Africa 8477 51.803 349 55 -0.652 167 0.0786 167 Sierra Leone 790 46.365 219 160 -0.664 169 0.0541 168 Djibouti 1964 54.456 330 88 -0.655 168 0.0524 169 Zimbabwe 538 41.681 311 68 -0.680 170 0.0462 170 Swaziland 4384 44.99 422 110 -0.876 171 0 171 p 0 44713 81.218 2 0 - - - - p 1 330 80.4 2 0 - - - - p 2 330 59.7 33 43 - - - - p 3 1581.824 41.68 290 151 - - - - 1 Gross Domestic Product per capita by Purchasing Power Parities, $per p e rson; 2 Life E xpectancy at Birth, years; 3 Infant Mo rtality Rate (per 1000 born); 4 Infectious T uberculosis , new cases per 100,000 of po pulation, esti mated . 5 p j ( j = 0 , 1 , 2 , 3) are control and end p o ints o f the RPC. 6.2.1 R esults on Lif e Qualities of Countr ies Gorban et al. [8] ranked 171 countries by life qualities of people with data driven from GAPMINDER 5 based on f our indicators as in Ex a mple 2. For comparison, we a lso use the same four GAPMINDER indicators in [8]. The RPC lea rned by Algorithm 1 is shown in two-dimensional visualization in Fig. 7 and part of the ranking list is illustrated in T able 2 . From Fig. 7, RPC portrays the data distributing trends with different shapes, including linearity and nonlinear- 5. http://www . gapminder .org/ ity . For this task, α = [1 , 1 , − 1 , − 1 ] T for this task just as Example 2 . α a lso discovers the relationship between indicators for ranking. GDP is in the same direction with L EB, but in the opposite direction with IMR a nd T uberculosis. In the beginning, a s mall a mount of GDP increasing brings a bout tremendous increasing of LEB and tremendous decreasing of IM R and T uberculo sis. When GDP exceeds $1 4300 (0.2 as normalize d va lue in Fig. 7) p e r person, increasing GDP does result in little LEB increase, so d oes IM R a nd T uberculos is decrease. As a matter of fact, it is hard to improve further LEB, IMR and T uberculosis wh en they are close to th e limit 11 0 1 0 8 GDP 0 1 0 1 GDP LEB 0 1 0 1 GDP Tuberculosis 0 1 0 1 GDP IMR 0 1 0 1 LEB GDP 0 1 0 2.5 LEB 0 1 0 1 LEB Tuberculosis 0 1 0 1 LEB IMR 0 1 0 1 Tuberculosis GDP 0 1 0 1 Tuberculosis LEB 0 1 0 8 Tuberculosis 0 1 0 1 Tuberculosis IMR 0 1 0 1 IMR GDP 0 1 0 1 IMR LEB 0 1 0 1 IMR Tuberculosis 0 1 0 5 IMR Fig. 7. T wo-dimensional display of data points and RPC fo r lif e qualities of c ountries. G reen points are numerical obser vati ons and red cur ves are 2-dimensional projection of RPC . of human evolution. In T a b le 2, control points provided by RPC learning algorithm (Algorithm 1) are li sted in the bottom. p i in the bottom is given in the original data space . Although the number o f control points ar e set to tw o in a ddition to two end points, the number actua lly needed for each indicators is adapted automatically by learning. From T able 2, p 0 and p 1 for IM R and T uberculosis overlaps which means that three points ar e e nough for a B ´ ezier curve to depicts the sk eleton of IMR and T uberculosis. T wo-dimensional visualizations in Fig. 7 tally with the statement above. Gorban et a l. [ 8] provided centered scores for coun- tries, which is similar to the first PCA . But the z ero score is assigned to no c ountry s uch that no country is taken as the ra nking reference. In addition, rankers would get into trouble to understand the ranking principle due to unknown para meter size. Therefore, the ra nking list is hard to interpr et for human understanding. Compared with Elmap [8], the presented RPC model follows a ll the five meta-rules. W ith these meta-rules as con straints, it achieves a better fitting performance in term of Mea n Square Error ( 90 % vs 86% of explained varia nce). It produces scores in [0 , 1] where 0 and 1 are the worst and the best reference respectively . Luxembourg with the best life quality provides a d eveloping direction for countries below . Add itionally , the RPC model is interpretable and easy to carry out in pra ctice since there a re just four points to dete rmine the ranking list. 6.2.2 Results on J ournal Ranking W e also apply RPC model to rank journals with data accessable from the W eb of Knowledge 6 which is affil- iated to Thomson Reuters. Thomson Reuters publishes annually Journal Citation Reports (JCR) wh ich provide information a bout academic journals in the sciences and social sciences. J CR2012 reports citation information with indicators of Impact Factor , 5-year Impact Factor , Im- mediacy Index, E igenfactor Score, and Article Influence Score. After journals with da ta missing are removed from the data ta ble (58 out of 45 1 ), RPC model trie s to provide a comprehensive ranking list of journals in the ca tegories of computer science: ar tificial intelligence, cybernetics, information systems, interdisciplinary appli- cations, software engineering, theory a nd methods. T able 3 illustrates the ranking list of journals produced by RPC model based on JCR2012 . T wo-dimension al visualiza tion of the RPC is shown in Fig. 8. For this ra nking task, a journal will rank higher with a h igher va lue for each indica tor , that is α = [1 , 1 , 1 , 1] . Among all the indicators here, 5-year Impact Factor shows almos t a linear relations hip with the o thers. But Eigenfactor presents n o clea r relationship which means that it is ca lcula ted in a very different way from the other indicator . Actua lly , Eigenfactor works like PageRank [2] while the others take f requency count. From T able 3, IEEE T ransactions on Knowledge and Data Engineering (TKDE) is ra nked in a higher place than IEE E T ransactions on Systems, M an, and Cybernetics-Part A (SM CA) a lthough SMCA has a higher IF ( 2.183) than TKDE (1.892 ). The lower influence score (0.767 ) of SMC A brings it down the ranking list (vs. 1.129 for TKDE). Therefore, TKDE gets a higher c ompre- hensive evaluating score and wins a higher ranking place in the ranking list. This means that one indicator does not te ll the whole story of r anking lists. RPC produces a ranking list f or journals taking account several indicators of different aspec ts. 7 C O N C L U S I O N S Ranking and its tools h ave and will ha ve an increasing impact on the beha v ior of human, either p ositively or negatively . However , those ranking activities a re still facing many challenges which have greatly restrained to the rational d e sign and utilization of ranking tools. Generally , ranking in practice is an unsupervised task which encounters a critical challenge that there is no ground truth to evaluate the provided lists. PageRank [2] is an effective unsupervised ranking model for ra nking candidates with a link-structure. However , it does not work for numerical observations o n multiple a ttributes of objects. It is well kno wn th at dom ain kno wledge can always improve the da ta mining performance. W e try to attack unsupervised ranking problems by domain kn owledge 6. http://wokinfo.com/ 12 T ABLE 3 P ar t of the ranking list f or JCR2012 jour nals of computer sciences. T itle Impact Factor (I F) 5-Y ear I F Immediacy Ind ex Eigenfactor Influence Score RPC Score Order Score Ord er Score Order Score Order Sc ore Order S core Order IEEE T P A TTERN ANAL 4. 795 7 6.144 5 0.625 26 0. 05237 3 3.235 6 1.0000 1 ENTERP I NF SYST UK 9.256 1 4.771 10 2.682 2 0.00173 230 0.907 86 0.9505 2 J STA T SOFTW 4.910 4 5.907 6 0.753 18 0.01744 20 3.314 4 0.9162 3 MIS QUART 4.659 8 7.474 2 0.705 21 0. 01036 49 3.077 7 0.9105 4 ACM COMPUT S URV 3.543 21 7.854 1 0.421 56 0. 00640 80 4.097 1 0.9092 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DECIS S UPPOR T S YST 2. 201 51 3.037 43 0.196 169 0.00994 52 0.864 93 0.4701 65 COMPUT ST A T DA T A AN 1.304 156 1. 449 180 0.415 61 0.02601 11 0.918 83 0.4665 66 IEEE T KNOWL DA TA EN 1.892 82 2.426 72 0.217 152 0.01256 37 1.129 55 0.4616 67 MACH L EARN 1.467 133 2. 143 96 0.373 70 0. 00638 81 1.528 20 0. 4490 68 IEEE T SYS T MAN CY A 2.183 53 2.44 68 0.465 46 0.00728 69 0.767 111 0.4466 69 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 1 0 6 IF 0 1 0 1 IF 5IF 0 1 0 1 IF ImmInd 0 1 0 1 IF Eigenfactor 0 1 0 1 IF IS 0 1 0 1 5IF IF 0 1 0 4 5IF 0 1 0 1 5IF ImmInd 0 1 0 1 5IF Eigenfactor 0 1 0 1 5IF IS 0 1 0 1 ImmInd IF 0 1 0 1 ImmInd 5IF 0 1 0 8 ImmInd 0 1 0 1 ImmInd Eigenfactor 0 1 0 1 ImmInd IS 0 1 0 1 Eigenfactor IF 0 1 0 1 Eigenfactor 5IF 0 1 0 1 Eigenfactor ImmInd 0 1 0 10 Eigenfactor 0 1 0 1 Eigenfactor IS 0 1 0 1 IS IF 0 1 0 1 IS 5IF 0 1 0 1 IS ImmInd 0 1 0 1 IS Eigenfactor 0 1 0 4 IS Fig. 8. T wo-dimensional display of data points and RPC fo r JCR2012. Green points are numerical obser vations and red cur ves are 2-dimensional projection of RPC. (IF:Impact F actor , 5IF:5-Y ear IF , ImmInd:Immediacy In- de x, IS: Influence Score) about ra nking. M otivated by [1 3], [1 6], five meta-rules as ranking knowledge are presented a nd are regarded as constraints to ra nking models. They are scale and tra ns- lation invariance, strict monotonicity , linear/nonlinear capacities, smoothness and explicitness of parameter size. They can also be capable of assessing the ranking performance of different models. Enlightened by [8], [14], we propose a ranking model with a principal curve which is parametrica lly formulated with a cubic B ´ ezier curve by restricting control points in the interior of the hypercube [0 , 1 ] d . Control points are le a rned from the da ta distribution without human interventions. Ap- plications in life qualities of countries and journals of computer sciences show that the propo sed RPC model can produce reasonable ra nking lists. From an a pplication view points, there are many indicators for ranking objects. RPC can a lso be used to do feature selection which is one part of our future works. A P P E N D I X A P R I N C I P A L C U RV E S Given a dataset X = ( x 1 , x 2 , · · · , x n ) , x i ∈ R d , a prin- cipal curve summarizes the data with a smooth curve instead of a straight line in the first PCA x = f ( s ) + ε (A-1) where f ( s ) = ( f 1 ( s ) , f 2 ( s ) , · · · , f d ( s )) ∈ R d and s ∈ R . The principal curv e f was originally defined by Hastie and Stuetzle [10] as a smooth ( C ∞ ) unit-speed ( k f ′′ k 2 = 1 ) one-dimensional manifold in R d satisfying the self- consistence condition f ( s ) = E ( x | s f ( x ) = s ) where s = s f ( x ) ∈ R is the largest value so that f ( s ) has the minimum distance f rom x . Mathematically , s f ( x ) is formulated as [ 1 0] s f ( x ) = sup n s : k x − f ( s ) k = inf s k x − f ( s ) k o . (A-2) In other words, a curve f : R 7→ R d is called a p rinci- pal curve if it minimizes the expe cted squared distance between x a nd f which is denoted by [1 1] J ( f ) = E  inf s k x − f ( s ) k 2  = E k x − f ( s f ( x )) k 2 . (A-3 ) As a n one-dimensional principal manifold, the prin- cipal curve has a wide ap p lica tions ( e.g. [36]) due to its simpleness. Following Ha stie and Stuetz le [10], re- searchers afterwa rds have proposed a va riety of pr inci- pal curve definitions and learning a lgorithms to perform different tasks [11], [19], [26], [ 2 9]. But most of them tried to first approximate the principal curve first with a 13 polyline [11] and then smooth it to meet the requirement for smoothness [1 0] of the principal curve. Therefore, the expression of the principal curve is not explicit and results in a ‘blac k-box’ which is hard to interpret. The other definitions of p r incipal curves [27], [ 30] em- ployed Gaussian mixture model to generally formulate the principal curve which br ings model bias and makes interpretation e v e n harder . When the principal curve is used to pe rform a ra nking task, it should be modeled to be a ‘white-box’ which can be well interpreted f or its provided r anking lists. A P P E N D I X B P R O O F O F T H E O R E M 2 If ∇ ϕ ( x ) ≻ 0 , ϕ is strictly monotone by Theorem 1. Regarding that the ranking candida tes is totally ordered, there is a one-to-one correspondence between ranking items in R d and rang ϕ . Otherwise, s = ϕ ( x 0 ) and s = ϕ ( x 0 + △ x ) both hold for some x 0 ∈ dom ϕ . In this case, ∇ ϕ ( x ) | x = x 0 = 0 which contradicts the assumption ∇ ϕ ( x ) ≻ 0 holds for all x ∈ dom ϕ . By Lemma 1 and the one-to-one correspondence, there exists an inverse mapping f : rang ϕ 7→ dom ϕ such that x = f ( s ) . By strict monotonicity (Eq.(1)) and the one-to- one correspondence, we hav e x 1  x 2 , x 1 6 = x 2 ⇐ ⇒ s 1 < s 2 (B-1) Thus, ∇ f ( s ) ≻ 0 holds for s ∈ rang ϕ .  A P P E N D I X C P R O O F O F R P C E X I S T E N C E ( T H E O R E M 3 ) Proof. Assume U = [0 , 1] and C ( U ) denotes the set of a ll continuous functi on f : U 7→ [0 , 1] d ⊆ R d embracing all possible observations of x . The uniform metric is d e fined as D ( f , g ) = s up 0 ≤ s ≤ 1 k f ( s ) − g ( s ) k , ∀ f , g ∈ C ( U ) . (C-1) It is easy to see ( C ( U ) , D ) is a complete metric space [21]. Let Γ = { f ( s ) : f ( s ) = PMz , P ∈ Θ } ⊆ C ( U ) , where Θ ∈ [0 , 1] 4 is the convex hull of x . W ith the Frobenius norm, Θ is a sequentially compact set so that for any given sequence in Θ there exists a subsequence P ( t ) converging uniformly to a n P ∗ ∈ [0 , 1] d [21] with k P ( t ) − P ∗ k F → 0 (C-2) Let p 0 = 1 2 ( 1 − α ) a nd p 3 = 1 2 ( 1 + α ) . Th en we have a sequence f ( t ) ( s ) converging uniformly to f ∗ ( s ) : D  f ( t ) ( s ) , f ∗ ( s )  = sup 0 ≤ s ≤ 1 k f ( t ) ( s ) − f ∗ ( s ) k (C-3) ≤ sup 0 ≤ s ≤ 1 k P ( t ) − P ∗ k F k Mz k (C-4) = k P ( t ) − P ∗ k F → 0 (C-5) where k M z k = 1 . By Propositi on 1, f ( t ) ( s ) is a curve sequence of strictly monotonicity and converges to f ∗ ( s ) . Assuming the converging sequence f ( t ) ( s ) makes J ( P ( t ) ) ≥ J ( P ∗ ) for fixed x ∈ R d , J ( P ( t ) ) − J ( P ∗ ) = k x − f ( t ) ( s ) k 2 − k x − f ∗ ( s ) k 2 (C-6) ≤  k x − f ( t ) ( s ) k + k x − f ∗ ( s ) k  k f ( t ) ( s ) − f ∗ ( s ) k (C-7) → 0 (C-8) and therefore E  J ( P ( t ) ) − J ( P ∗ )  → 0 . (C-9) Finally , we complete the proof.  A P P E N D I X D P R O O F O F C O N V E R G E N C E ( P R O P O S I T I O N 2 ) Proof: First of a ll, P ( t ) generated b y Richardson method has been proved to converge [37]. Assume P ( t ) → P ∗ , and s ( t ) and s ∗ are the corresponding score vectors calculated b y Eq.(22). Note that the item P ( t +1) − P ( t ) is in the descending direction of J in Eq.(27). So we get that J ( P ( t ) , s ( t ) ) ≥ J ( P ( t +1) , s ( t ) ) . (D-1 ) Then with the c ontro l points P ( t +1) , s ( t +1) minimizes the summed orthogonal distance J ( P ( t +1) , s ( t ) ) ≥ J ( P ( t +1) , s ( t +1) ) . (D-2 ) Thus we get J ( P ( t ) , s ( t ) ) ≥ J ( P ( t +1) , s ( t +1) ) . (D-3) Finally , by T heorem 3 the sequence { J ( P ( t ) , s ( t ) ) } c on- verges to its infimum { J ( P ∗ , s ∗ ) } as s → ∞ .  A C K N OW L E D G E M E N T The authors appreciate very much the advice fr om th e machine lea rning crew in NLPR. This work is supported in part by NSFC (No. 61273196 ) f or C.-G. Li and B.-G. Hu, and NSFC (No. 61271430 and No. 6133 2 017) for X. Mei. R E F E R E N C E S [1] H. Li, “A Short Introduction t o Learning to Rank”, IE ICE T rans. Inf. Syst. , vol. E94-D, no. 10, pp. 1-9, 2011. [2] S. Brin and L . Page, “The A natomy of a Large-Scale Hypertextual W eb Search Eng in e ”, Computer Networks , v ol. 30, no. 1-7, pp. 107- 117, 1998. [3] D. Zhou, J. W eston, A. Gretton, O. Bousquet, an d B. Sch ¨ olkopf, “Ranking on Data Manifolds”, Advances in Neural Information Processing Systems 16 , S. Thrun, L. Saul, and B. Sch ¨ olkopf, eds., MIT Press, 2004. [4] B. Xu, J. Bu, C. Chen, D. Cai, X. He, W . L iu, and J. Luo, “Efficient manifold ranking for image retrieval”, P roc. 34th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval , pp. 525-534, 2011. [5] C. Bishop, Pattern Recognition and Machine Learning , New Y ork: Springer , 2006. [6] I. Guyon and A. Elisseeff, “An Introduction to V ariable and Feature Selection”, J. Mach. Learn. Res. , vol. 3, p p . 1157-1182 , 2003. [7] A. Kl ementiev , D. Roth, K. Small, and I. T itov , “Unsupervised Rank Aggregation with Domain-Specific Expertise”, P roc. 20th Int’l Joint Conf. Artifical Intell. , pp. 1101-1106, 2009. 14 [8] A.Y . Zinovye v and A.N. G orban, “Nonlinear Quality of Life Index”[EB/OL], New Y ork, http://arxiv .org/abs/1008.4063, 2010. [9] A. V asuki, “A Rev iew of V ector Quant ization T echiniques”, IEEE Potentials , vol. 25, no. 4, pp. 39-47, 2006. [10] T . Hastie and W . St uetzle, “Principal Curves”, J. Amer . Stat. Assoc. , vol. 84, no. 406, pp. 502-516, 1989. [11] B. K ´ e g l, A. Krzy ˙ zak, T . Linder , and K. Zeger , “L e arning and Design of Principal Curves”, IE EE T rans. P attern Anal. Mach. Intell. , vol. 22, no. 3, pp. 281-297, 2000. [12] P . Gould, “Letting the Data Speak for Themselves”, Assoc. Amer . Geog. USA , vol. 71, no. 2, 1981. [13] B.-G. Hu, H.B. Qu, Y . W an g, and S.H. Y ang, “A Generalized- Constraint Neural Network mod el: Associating Partially Known Relationships for Nonlinear Regression”, Inf. Sci. , vol. 179, pp. 1929-1943, 2009. [14] B.-G. Hu, G.K. I Mann , and R.G . Gosine, “Control curve design for non linear (or fuzzy) proportional actions using spline-based functions”, Automatica , vol. 34, no. 9, pp. 1125-1133, 1998. [15] H. Daniels and M. V elikova, “Monotone and Partially Monotone Neural Networks”, IEEE T rans. Neural Networks , vol. 21, no. 6, pp. 906-917, 2010. [16] W . Kotłowski and R. Słowi ´ nski, “On Nonparametric Ordinal Classifica tion with Monotonicity Constraints”, IEEE T rans. Knowl. Data Engineering , vol. 25, no. 11, pp. 2576-2589, 2013. [17] Y . Zhang, W . Zhang, J. Pei, X. L in, Q. Lin, and A. Li, “Consensus- Based Ranking of Multivalued Objects: A Gen eralized B orda Count Approach”, IEEE T rans. Knowl. Data E ngineering , vol. 26, no. 1, pp. 83-96, 2014. [18] X.Q. Cheng, P . Du, J.F . Guo, X.F . Zhu, and Y . X. Chen , “Ranking on Data M anifold with Sink Points”, IEEE T rans. Knowl. Data Engineering , vol. 25, no. 1, pp. 177-191, 2013. [19] A.N. Gorban and A.Y . Zinovye v , “Chapter 2: Pri ncipal Graphs and Manifolds”, Handbook of Research on Machine Learning Appli- cations and T rends: Algorithms, Methods, and T echiniques , E.S. Olivas, J.D.M. Guerrero, M.M. Sober , J.R.M. Benedit o, A.J.S. L ´ opez, eds., New Y ork: Inf. Sci. Ref., v ol. 1, pp. 28-59, 2010. [20] T .A. Pastv a, “B ´ ezier Curve Fit t ing”, m aster ’s thesis, Naval Post- graduate School, 1998. [21] S. Boyd and L. V andenbe rghe, Convex Optimization , New Y ork: Camb. Univ . Press, 2004. [22] H.A. Priestley , “Chapter 2: Order ed Sets and Complete Lattices-a Primer for Compute r Science”, Algebraic and Coalgebraic Methods in the Mathematics of Program Construction , R. Backhouse, R. Crole, J. Jibbons, eds., LNCS 2297, pp. 21-78, 2002. [23] P .M. Fitzpatrick, Advanced Calculus , CA: Thomson Brooks/Cole, 2006. [24] A. Cambibi, D.T . Luc, and L . M arte in . “Order-Preserving T rans- formations and Applications”, J. Optimization Theory Applicatio ns , vol. 118, no. 2, pp. 275-293, 2003. [25] T .W . Anderson, An Introduction to Multivariat e S tatistical Analysis , New Jersey: John W iley & Sons, Inc. 2003. [26] J.D. Banfield and A. E. Raftery , “Ice Floe Iden tification in Satellite Images Using Mathematical Morphology and Clustering about Pincipal Curves”, J. Amer . Stat. Assoc. , vol. 87, no. 417, pp. 7-16, 1992. [27] P . Delicado, “Another look at principal curves and surfaces”, J. Multivaria te Anal. , vol. 77, no. 1, pp. 84-116, 2001. [28] K. Chang and J. Ghosh, “A Unified Model for Probabili stic Principal Surf aces”, IEEE T rans. Pattern Anal. Mach. Intell. , vol. 23, no. 1, pp. 22-41, 2001. [29] J. Einbeck, G. T utz, and L. E vers, “Local Principal Curves”, Stat. and Comput. , vol. 15, no. 4, pp. 301-313, 2005 [30] R. T ibshirani, “Principal Curves Rev isited”, Stat. and Comput. , vol. 2, no. 4, pp. 183-190, 1992. [31] G. Farin, Curves and Surfaces for Computer Aided Geometric D esign (4th Edition) , California: Acad. Press, Inc., 1997. [32] M. A. Jen k ins and J. F . T raub, “A Three-Stage Algorithm for Real Polynomials Using Quadratic Iteration”, SIAM J. Numer . Anal. , vol. 7, no. 44, pp. 545566, 1970. [33] M.S. Bazaraa, H.D. Sherali, and C.M . Shetty , Nonlinear Program- ming: Theory and Algorithms , New Jersey , Hoboken : John W iley & Sons, Inc., 2006. [34] C. Dwork, R. Kumar , M. Naor , and D. Sivakumar , “Rank Aggre- gation Methods for t he W eb ”, P roc. 10th Int’l Co nf. World Wide Web , pp. 613-622, 2001. [35] R.A. Roger and C.R. Johnson, Matrix Analysis , New Y ork: Camb. Univ . Press, 1985. [36] J.P . Zhang, X.D. W ang, U.Kruger , F .Y . W ang, “Principal Curve Al- gorithms for Partitioning High-Dimensional Data Spaces”, IEE E T rans. Neural Networks , vol. 22, no. 3, pp. 367-380, 2011. [37] L.F . Richardson, “The approximate arithmetical solution b y finite differ ences of physical problems involving differential equations, with an applic ation to the str esses in a masonry dam”, Phil os. T rans. Roy . Soc. London Ser . A , vol. 210, pp. 307-357, 1910. [38] G. H. Golub and C. F . v an Loan, Matrix Computati ons, 3rd ed. , Baltimore, MD: Johns Hopkins, 1996. [39] E. Garfield, “The History and Meaning of th e Jour nal Impact Factor ”, J. Amer . Med. Ass oc. , vol. 295, no. 1, p p . 90-93, 2006. [40] C.T . Bergstrom, J.D. W e st, M.A. W iseman,“T he E igenfactor Met- rics”, J. of Neuroscience , vol. 28, no. 45, pp. 11433-11434, 2008.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment