An Introduction to Artificial Prediction Markets for Classification
Prediction markets are used in real life to predict outcomes of interest such as presidential elections. This paper presents a mathematical theory of artificial prediction markets for supervised learning of conditional probability estimators. The art…
Authors: Adrian Barbu, Nathan Lay
Journal of Mac hine Learning Researc h - (2012) - Submitted -; Published - An In tro duction to Artificial Prediction Mark ets for Classification Adrian Barbu abarbu@fsu.edu Dep artment of S tatistics Florida S t ate University T al lahasse e, FL 32306, US A Nathan La y nla y@fsu.edu Dep art m ent of Scientific Computing Florida S t ate University T al lahasse e, FL 32306, US A Abstract Prediction markets are used in real life to pr edict outcomes o f in terest such as presidential elections. This paper presen ts a mathematical theory o f ar tificial predictio n mar k ets for sup e rvised learning of conditional pro babilit y e s timators. The artificia l pre dic tio n market is a no v el method for fusing the prediction information of featur es or trained classifiers, where the fusion result is t he contract price on the p ossible outcomes. The market can be trained o nline by up dating the pa rticipants’ budgets using training e x amples. I ns pired by the real prediction mark ets, the equations that g overn the market are derived from simple and reaso na ble assumptions. Efficient n umerical alg orithms are presented for solv ing these eq uations. The o btained ar tificial prediction market is sho wn to be a maxim um likelihoo d estimator. It genera lizes linear aggreg ation, existent in bo osting and random forest, as well a s logistic regression and some kernel methods. F urthermore, the market mechanism allo ws the aggre g ation of sp ecialized cla ssifiers that par ticipate only o n sp ecific instances. Exp erimental co mpa risons show that the artificia l prediction markets o ften outp e rform ra ndom forest and implicit online lea r ning on syn thetic data and r eal UCI datasets. Moreov er, an extensive ev aluation fo r pelv ic and a bdomina l ly mph no de de tectio n in CT data shows that the prediction market improv es adab o ost’s detection rate from 79 . 6% to 81 . 2% at 3 false po sitiv es/volume. Keyw ords: online learning, ensemble metho ds, supe rvised lear ning, r andom forest, im- plicit o nline lea rning. 1. In tro duction Prediction markets, also kno wn as information m ark ets, are foru ms that tr ade con tracts that yield pa ymen ts dep endent on the outcome of f uture even ts of in terest. They ha v e b een used in the US Departmen t of Defense (P olk et al., 2003), health care (P ol green et al., 200 6), to predict presidential elections ( W olfe rs a nd Zitzewitz, 2004) and in large co rp orations to mak e informed decisions (Co wgill et al., 2008). The prices of the con tracts traded in these mark ets are go o d appro ximations for the pr obabilit y of the outcome of inte rest (Manski, 2006; Gjerstad and Hall, 2005). prediction markets are capable of fu sing the information c 2012 Adrian Barbu and Nathan La y . Barbu and La y that the m ark et p articipan ts p ossess through the con tract price. F o r more details, see Arro w et al. (2008). In this pap er we in trod uce a mathematica l theory for sim ulating pred iction mark ets n umerically for the pur p ose of su p ervised lea rning of probabilit y estimato rs. W e deriv e the mathematical equations that go v ern the mark et and sho w ho w can they b e solv ed n umerically or in some cases even analytically . An imp ortan t part of th e prediction marke t is the con tract pr ice, whic h will b e sh o wn to b e an estimator of the class-conditional probabilit y giv en the evid en ce presented through a feature v ector x . It is the result of th e fu sion of the information p ossessed by the market participan ts. The obtained artificial pr ediction marke t tu r ns out to h a v e go o d mo deling p ow er . It will b e shown in S ection 3.1 that it generalizes linear aggrega tion of classifiers, the basis of b o osting (F riedman et al., 2000; S c hapire, 2003) and rand om forest (Breiman , 2001). It turns out that to obtain linear aggregation, eac h mark et participant p urc hases contract s for the class it predicts, regardless of the mark et price for that con tract. F ur thermore, in Sections 3.2 and 3.3 will b e presented sp ecial b e tting f unctions th at mak e the prediction mark et equiv alen t to a logistic regression and a kernel-based classifier resp ectiv ely . W e in tro duce a new t yp e o f classifier that is sp e cialize d in modeling certain regions of the feature space. S u c h classifiers ha v e go o d accuracy in their region of sp ecializatio n and are not used in predicting outcomes for ob s erv ations outside this region. This means that for eac h observ ati on, a different subset of classifiers will b e aggregated to obtain the estimated probabilit y , making the whole approac h b ecome a sort of ad-ho c aggregation. This is contrast to th e general tren d in b o osting w here the s ame classifiers are aggregated for all observ ations. W e giv e examples of generic sp ecialized classifiers as the lea v es of random trees from a rand om forest. Exp erimen tal v alidatio n on thousands of synthetic d atasets with Ba y es errors ranging from 0 (v ery easy) to 0.5 (v ery difficult) as w ell as on real UCI data show that the prediction market using the sp ecialize d classifiers outp erforms the r andom forest in p rediction and in estimating the true underlying p robabilit y . Moreo v er, we presen t exp eriment al comparisons on man y UCI datasets of the artificial prediction mark et with the recen tly int ro duced implicit online learning (Kulis and Bartlett, 2010) and observe that the market significantl y outp erf orms the implicit online learning on some of the datasets and is nev er ou tp erformed by it. 2. The Ar tificial P rediction Market for Classification This wo rk simulat es th e Iowa ele ctr onic market (W olfers and Zitzewitz, 2004), whic h is a real prediction mark et that can b e found online at http://www.biz.uio w a.edu/iem/. 2.1 The Iow a Electronic Mark et The Iowa ele ctr onic market (W olfers and Zitzewitz, 2004) is a f oru m where con tracts for future outcomes of inte rest (e.g. pr esiden tial elections) are traded. Con tracts are sold f or e ac h of the p ossible outcomes of the eve nt of in terest. The con tract price flu ctuates b ased on sup ply and demand . In the Io wa electronic market, a winning con tract (that predicted the correct outcome) pays $1 after the outcome is kno wn. Therefore, th e contract pr ice will alw a ys b e b et w een 0 and 1. 2 Ar tificial P rediction Markets Our market will simulat e this b eha vior, with contrac ts for all the p ossible outcomes, pa ying 1 if that outcome is realized. 2.2 Setup of the Artificial Prediction Market If the p ossible classes (outcomes) are 1 , ..., K , we assume there exist contract s for eac h class, whose pr ices form a K -dimensional v ector c = ( c 1 , ..., c K ) ∈ ∆ ⊂ [0 , 1] K , where ∆ is the probabilit y simplex ∆ = { c ∈ [0 , 1] K , P K k =1 c k = 1 } . Let Ω ⊂ R F b e the instance or feature space conta ining all th e a v ailable information that can b e used in making outcome p redictions p ( Y = k | x ) , x ∈ Ω . The mark et consists of a n umber of mark et participan ts ( β m , φ m ( x , c )) , m = 1 , ..., M . A market p articip ant is a pair ( β , φ ( x , c )) of a budget β and a b etting function φ ( x , c ) : Ω × ∆ → [0 , 1] K , φ ( x , c ) = φ 1 ( x , c ) , ..., φ K ( x , c ) . The budget β repr esen ts the weigh t or imp ortance of the p articipan t in the mark et. The b etting fun ction tells what p ercen tage of its bud get this participan t will allo cate to purchase con tracts for eac h class, based on the instance x ∈ Ω and th e mark et pr ice c . As the marke t price c is not kn o wn in adv ance, the b etting function describ es what the p articipant plans to d o for e ac h p ossible price c . The betting functions could b e based on tr ained classifiers h ( x ) : Ω → ∆ , h ( x ) = ( h 1 ( x ) , ..., h K ( x )) , P K k =1 h k ( x ) = 1, but they can also b e related to th e feature sp ace in other wa ys . W e will sho w that logistic regression and k ernel metho ds can also b e represen ted using the artificial p r ediction marke t and sp ecific types of b etting functions. In order to b et at most the budget β , th e b etting functions must satisfy P K k =1 φ k ( x , c )) ≤ 1. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Cost c Percent bet phi1(x,1−c) phi2(x,c) total bet 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Cost c Percent bet phi1(x,1−c) phi2(x,c) total bet 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Cost c Percent bet phi1(x,1−c) phi2(x,c) total bet 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Cost c Percent bet phi1(x,1−c) phi2(x,c) total bet Figure 1: Betting function examples: a) Constan t, b) Linear, c) Aggressiv e, d ) Logistic. Sho wn are φ 1 ( x , 1 − c ) (red), φ 2 ( x , c ) (blue), and the total amount b et φ 1 ( x , 1 − c ) + φ 2 ( x , c ) (blac k dotted). F or a) through c), the classifier probabilit y is h 2 ( x ) = 0 . 2. Examples of b etting functions include the follo wing, also shown in Figure 1 : • Constant b etting functions φ k ( x , c ) = φ k ( x ) for example based on trained cla ssifiers φ k ( x , c ) = η h k ( x ), wh ere η ∈ (0 , 1] is constan t. • Linear b etting functions φ k ( x , c ) = (1 − c k ) h k ( x ) (1) • Aggressiv e b etting fun ctions φ k ( x , c ) = h k ( x ) 1 if c k ≤ h k ( x ) 0 if c k > h k ( x ) + ǫ h k ( x )+ ǫ − c k ǫ otherwise (2) 3 Barbu and La y • Logistic b etting functions: φ 1 m ( x , 1 − c ) = (1 − c )( x + m − ln(1 − c ) /B ) , φ 2 m ( x , c ) = c ( − x − m − ln c/B ) where x + = xI ( x > 0) , x − = xI ( x < 0) and B = P m β m . The b etting fun ctions p la y a similar role to the p otential functions from maxim um en trop y mo dels (Berger et al. , 19 96; Ratnaparkhi et al., 1996; Zh u et al., 1998), in that they m ak e a con v ersion fr om the feature outpu t (or classifier outp ut for some markets) to a common u nit of measure (energy for the maxim um ent ropy mo dels and money for the mark et). The con tract p rice do es not fluctuate in our setup, in stead it is go v erned by Equation (4). Th is equation guarantees that at this p rice, the tot al amount obtained from selling con tracts to the participants is equ al to the total amount won by the winn ing con tracts, indep endent of th e outcome. Equilibrium price c from Price Equations ... ... Market participants h m ( x ) β m Betting function Budget Classifier h M ( x ) β M Betting function Budget Classifier h 1 ( x ) β 1 Betting function Budget Classifier Input ( x ,y) Prediction Market Estimated probability p(y| x )= c Figure 2: Online le arning a nd agg regation using the artificial prediction market . Giv en feature v ect or x , a set of mark et participan ts will establish the mark et equilibrium price c , wh ic h is an estimator of P ( Y = k | x ). The equilibrium pr ice is go v erned b y the Pr ice Equations (4). Online training on an e xample ( x , y ) is achiev ed through Budget Up dat e ( x , y , c ) s h o wn w ith gra y arr ows. 2.3 T raining the Artificial Prediction Market T raining the marke t in v olv es initializing all participan ts with the same bud get β 0 and pre- sen ting to the mark et a set of training examples ( x i , y i ) , i = 1 , ..., N . F or eac h example ( x i , y i ) the participants purc hase con tr acts for the d ifferen t classes based on the mark et price c (whic h is n ot kno wn yet) and their bu dgets β m are up dated based on the con tracts purchased and the tr u e outcome y i . After all training examples ha v e b een pr esen ted, th e participan ts will hav e bud gets that dep end on how we ll they p redicted the correct class y for eac h training example x . Th is pr ocedu re is illus trated in Figure 2. 4 Ar tificial P rediction Markets Algorithm 1 Budget Up date ( x , y , c ) Input: T r aining example ( x , y ), price c for m = 1 to M do Up date p articipan t m ’s b udget as β m ← β m − K X k =1 β m φ k m ( x , c ) + β m c y φ y m ( x , c ) (3) end for Algorithm 2 Prediction Market T raining Input: T r aining examples ( x i , y i ) , i = 1 , ..., N Initialize all bu dgets β m = β 0 , m = 1 , ..., M . for eac h tr aining example ( x i , y i ) do Compute equilibrium p rice c i using Eq. 4 Run Budget Up date ( x i , y i , c i ) end for The budget up date pro cedur e subtracts from the budget of eac h participan t the amounts it b ets for eac h class, th en r ewards eac h participan t based on h ow m an y contrac ts it p u r- c hased for th e correct class. P articipan t m p urc hased β m φ k m ( x , c ) worth of cont racts f or class k , at pr ice c k . Thus the num b er of con tracts p urc hased for class k is β m φ k m ( x , c ) /c k . T otally , participant m ’s budget is decreased by the amoun t P K k =1 β m φ k m ( x , c ) inv ested in con tracts. Since partici- pan t m b ought β m φ y m ( x , c ) /c y con tracts for th e correct class y , he is rew arded the amount β m φ y m ( x , c ) /c y . 2.4 The Market Price Equations Since we are simulating a real mark et, w e assume that the total amoun t of money collec tiv ely o wned by th e participan ts is conserv ed after eac h training example is presented. T h us the sum of all participan ts’ b udgets P M m =1 β m should alw a ys b e M β 0 , the amoun t giv en at the b eginning. Since an y of the outcomes is theoretically p ossible for eac h instance, we ha v e the follo wing constraint: Assumption 1 The total budget P M m =1 β m must b e c onserve d indep endent of the outc ome y . This condition trans f orms in to a set of equations th at constrain the m ark et p rice, which w e call the pr ice equations. The marke t price c also ob eys P K k =1 c k = 1. Let B ( x , c ) = P M m =1 P K k =1 β m φ k m ( x , c ) b e the total b et for observ ation x at p rice c . W e ha v e Theorem 1 Pric e Equations. The total budget P M m =1 β m is c onserve d after the Budget Up date ( x , y , c ) , indep endent of the outc ome y , i f and only if c k > 0 , k = 1 , ..., K and M X m =1 β m φ k m ( x , c ) = c k B ( x , c ) , ∀ k = 1 , ..., K (4) 5 Barbu and La y The pro of is giv en in the App endix. 2.5 Price Uniqueness The pr ice equations together with the equation P K k =1 c k = 1 are enough to u niquely deter- mine th e m arket pr ice c , und er mild assumptions on the b etting fun ctions φ k ( x , c ). Observe that if c k = 0 for some k , then the contrac t costs 0 and pa ys 1, so there is ev erything to w in. In this case, one should ha v e φ k ( x , c ) > 0. This suggests a cla ss of b etting functions φ k ( x , c k ) dep en d ing only on the price c k that are con tin uous an d monotonically n on-increasing in c k . If all φ k m ( x , c k ) , m = 1 , ..., M are cont in uous and m onotonical ly non-increasing in c k with φ k m ( x , 0) > 0 then f k ( c k ) = 1 c k P M m =1 β m φ k m ( x , c k ) is conti nuous and strictly d ecreasing in c k as long as f k ( c k ) > 0. T o obtain conditions for price u niqueness, w e use the follo wing fun ctions f k ( c k ) = 1 c k M X m =1 β m φ k m ( x , c k ) , k = 1 , ..., K (5) Remark 2 If al l f k ( c k ) ar e c ontinuous and strictly de cr e asing in c k as long as f k ( c k ) > 0 , then for e very n > 0 , n ≥ n k = f k (1) ther e is a unique c k = c k ( n ) that satisfies f k ( c k ) = n . The pro of is giv en in the App endix. T o guarante e p rice uniqueness, w e need at least one mark et participant to satisfy the follo wing Assumption 2 The total b et of p articip ant ( β m , φ m ( x , c )) is p ositive inside the simplex ∆ , i.e. K X j =1 φ j m ( x , c j ) > 0 , ∀ c ∈ (0 , 1) K , K X j =1 c j = 1 . (6) Then w e h a v e th e follo wing result, also pro v ed in the App end ix. Theorem 3 Assume al l b etting functions φ k m ( x , c k ) , m = 1 , ..., M , k = 1 , ..., K ar e c ontin- uous, with φ k ( x , 0) > 0 and φ k m ( x , c ) /c is strictly de cr e asing in c as long as φ k m ( x , c ) > 0 . If the b e tting function φ m ( x , c ) of le ast one p articip ant with β m > 0 satisfies Assumption 2, then for the Budget Up date ( x , y , c ) ther e is a unique pric e c = ( c 1 , ..., c K ) ∈ (0 , 1) K ∩ ∆ such that the total budget P M m =1 β m is c onserve d. Observe that all four b etting fu nctions defined in Section 2.2 ( constant, linear, aggressiv e and logistic) satisfy the conditions of Theorem 3 , so there is a un ique price that conserves the budget. 2.6 Solving the Market Price Equations In practice, a double bisection algorithm could b e used to find the equilibrium p rice, com- puting eac h c k ( n ) by the bisection m etho d , and emp lo ying another b isection algorithm to find n s u c h that the price cond ition P K k =1 c k ( n ) = 1 holds. Observe that the n satisfying P K k =1 c k ( n ) = 1 can b e b ounded f r om ab ov e by 6 Ar tificial P rediction Markets n = n K X k =1 c k ( n ) = K X k =1 c k ( n ) f k ( c k ( n )) = K X k =1 M X m =1 β m φ k m ( x , c ) ≤ M X m =1 β m b ecause for eac h m , P K k =1 φ k m ( x , c ) ≤ 1. A p oten tially faster alternativ e to the double b isectio n m etho d is the Mann Iteration (Mann, 1953 ) describ ed in Algorithm 3. The price equ ations can b e view ed as fixed p oint equation F ( c ) = c , where F ( c ) = 1 n ( f 1 ( c ) , ..., f K ( c )) with f k ( c ) = P m m =1 β m φ k m ( x , c k ). The Mann iteration is a fixed p oint algorithm, whic h mak es weigh ted up date steps c t +1 = (1 − 1 t ) c t + 1 t F ( c t ) The Mann iteration is guaran teed to con v erge for con tractions or pseu d o-con tract ions. Ho w ev er, we observed exp erimental ly that it usually con v erges in only a f ew (up to 10 ) steps, making it ab ou t 100-100 0 times faster than th e double bisection algorithm. If, after a small n um b er of steps, the Mann iteration has not conv erged, the double bisection alg orithm is used on that instance to compu te th e equilibrium pr ice. Ho we v er, th is happ ens on less than 0 . 1% of the instances. Algorithm 3 Market Price b y Mann Iteration Initialize i = 1 , c k = 1 K , k = 1 , ..., K rep eat f k = P m β m φ k m ( x, c ) n = P k f k if n 6 = 0 then f k ← f k n r k = f k − c k c k ← ( i − 1) c k + f k i end if i ← i + 1 un til P k | r k | ≤ ǫ or n = 0 or i > i max 2.7 Two -class F orm ulation F or the t w o-class problem, i.e. K = 2, th e budget equation can b e simp lified by writing c = (1 − c, c ) and obtaining the two-class market pric e e quation (1 − c ) M X m =1 β m φ 2 m ( x , c ) − c M X m =1 β m φ 1 m ( x , 1 − c ) = 0 (7) This can b e solve d numerically directly in c using the bisection metho d. Again, the solution is u nique if φ k m ( x , c k ) , m = 1 , ..., M , k = 1 , 2 are contin u ous, monotonically n on-increasing and ob ey condition (6). Moreo v er, the solution is guarant eed to exist if there exist m , m ′ with β m > 0 , β m ′ > 0 and suc h that φ 2 m ( x , 0) > 0 , φ 1 m ′ ( x , 1) > 0. 7 Barbu and La y 3. Relation t o Existing Supervised Learnin g Methods There is a large degree of flexibilit y in choosing the b etting functions φ m ( x , c ). Diffe rent b etting functions giv e different wa ys to fuse the marke t participan ts. In w hat follo ws we pro v e that by choosing sp ecific b etting fu nctions, the artificial p rediction mark et b eha v es lik e a linear aggregator or logistic regressor, or that it can b e used as a kernel-based classifier. 3.1 Constan t Bett ing and Linear Aggregation F or mark ets with constan t b etting functions, φ k m ( x , c ) = φ k m ( x ) the market price has a simple analytic formula, p ro v ed in the App end ix. Theorem 4 Constan t Betting. If al l b etting function ar e c onstant φ k m ( x , c ) = φ k m ( x ) , then the e quilibrium pric e is c = P M m =1 β m φ m ( x ) P M m =1 P K k =1 β m φ k m ( x ) (8) F urthermo r e, if the b etting functions ar e b ase d on classifiers φ k m ( x , c ) = η h k m ( x ) then the e qui librium pric e is obtaine d by line ar aggr e gation c = P M m =1 β m h m ( x ) P M m =1 β m = X m α m h m ( x ) (9) This w ay the artificial pred iction marke t can mo del linear aggregatio n o f classifiers. Metho ds suc h as Adab o ost (F reu nd and S c hapire, 1996; F riedm an et al., 2000 ; S c hapire, 2003) and Random F orest (Breiman, 200 1) also aggregate their constituen ts using linear aggrega tion. Ho wev er, there is more to Ad ab o ost and Random F orest than linear aggrega- tion, since it is v ery imp ortant ho w to construct the constituents th at are aggregated. In particular, the r andom forest (Breiman, 2001) can b e view ed as an artificial prediction mark et with constant b etting (linear aggregation) w here all participan ts are random trees with the same bud get β m = 1 , m = 1 , ..., M . W e also obtain an analytic form of the bu dget up d ate: β m ← β m − β m K X k =1 φ k m ( x ) + β m φ y m ( x ) P M j =1 P K k =1 β j φ k j ( x ) P M j =1 β j φ y j ( x ) whic h for classifier b ased b etting fun ctions φ k m ( x , c ) = η h k m ( x ) b ecomes: β m ← β m (1 − η ) + η β m h y m ( x ) P M j =1 β j P M j =1 β j h y j ( x ) This is a nov el online u p date ru le for linear aggregat ion. 3.2 Prediction Mark ets for Logistic Regression A v arian t of log istic regression can also b e mo deled using prediction markets, with the follo wing b etting fu nctions φ 1 m ( x , 1 − c ) = (1 − c )( x + m − 1 B ln(1 − c )) , φ 2 m ( x , c ) = c ( − x − m − 1 B ln c ) 8 Ar tificial P rediction Markets where x + = xI ( x > 0) , x − = x I ( x < 0) and B = P m β m . The t wo class equation (7) b ecomes: P M m =1 β m c (1 − c )( x m − ln(1 − c ) /B + ln c/B ) = 0 so ln 1 − c c = P M m =1 β m x m , w hic h giv es the logistic r egression m o d el ˆ p ( Y = 1 | x ) = c = 1 1 + exp( P M m =1 β m x m ) The bud get up date equation β m ← β m − η β m [(1 − c ) x + m + cx − m − H ( c ) / B ] + η β m u y ( c ) is obtained, wh ere u 1 ( c ) = x + m − ln(1 − c ) /B , u 2 ( c ) = − x − m − ln( c ) /B . W riting x β = P M m =1 β m x m , the bu dget up date can b e rearranged to β m ← β m − η β m x m − x β B y − 1 1 + exp( x β ) . (10) This equation resem bles the standard p er-observ ation up date equation for online logisti c regression: β m ← β m − η x m y − 1 1 + exp ( x β ) , (11) with tw o differences. T h e term x β /B ensures the b udgets alwa ys sum to B while the factor β m mak es sure that β m ≥ 0. The up d ate from eq. (10 ), lik e eq. (11) tr ies to in crease | x β | , b u t it do es that sub ject to constrain ts that β m ≥ 0 , m = 1 , ..., M and P M m =1 β m = B . Observe also that multiplying β b y a constant d o es not c hange the d ecision line of the logistic regression. 3.3 Relation t o Kernel Metho ds Here w e construct a mark et p articipan t fr om eac h training example ( x n , y n ) , n = 1 , ...N , th us the num b er of participan ts M is the num b er N of training examples. W e construct a participan t from training example ( x m , y m ) by defining the f ollo win g b etting functions in terms of u m ( x ) = x T m x k x m kk x k : φ y m m ( x ) = u m ( x ) + = ( u m ( x ) if u m ( x ) ≥ 0 0 else , φ 2 − y m m ( x ) = − u m ( x ) − = ( 0 if u m ( x ) ≥ 0 − u m ( x ) else (12) Observe that these b etting fun ctions do not dep end on the con tract price c , so it is a constan t mark et but not one based on classifiers. The t w o-cl ass price equation giv es c = X m β m φ 2 m ( x ) X m β m ( φ 1 m ( x ) + φ 2 m ( x )) = X m β m [ y m u m ( x ) − u m ( x ) − ] X m β m | u m ( x ) | since it can b e v erified that φ 2 m ( x ) = y m u m ( x ) − u m ( x ) − and φ 1 m ( x ) + φ 2 m ( x ) = | u m ( x ) | . 9 Barbu and La y The decision r ule c > 0 . 5 b ecomes P m β m φ 2 m ( x ) > P m β m φ 1 m ( x ) or P m β m ( φ 2 m ( x ) − φ 1 m ( x )) > 0. Since φ 2 m ( x ) − φ 1 m ( x ) = (2 y m − 2) u m ( x ) = (2 y m − 2) x T m x k x m kk x k (since in our setup y m ∈ { 1 , 2 } ), we obtain the SVM t yp e of d ecision r ule with α m = β m / k x m k : h ( x ) = sgn( M X m =1 α m (2 y m − 3) x T m x ) The budget u p date b ecomes in this case: β m ← β m − η β m | u m ( x ) | + η β m φ y m ( x ) c y The same reasoning carries out for u m ( x ) = K ( x m , x ) w ith th e RBF kernel K ( x m , x ) = exp( −k x m − x k 2 /σ 2 ). In Figure 3, left, is sho wn an example of the d ecision b oundary of a mark et trained online with an R BF k ernel with σ = 0 . 2 on 1000 exa mples uniformly sampled in th e [ − 1 , 1] 2 in terv al. In Figure 3, righ t is sho wn the estimated p robabilit y ˆ p ( y = 1 | x ). −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 3: Left: 1000 training examples an d learned d ecision b oundary (righ t) for an RBF k ernel-based m ark et fr om eq. (12) with σ = 0 . 1. Right: estimated probability function. This example s ho ws that the artificial prediction market is an online metho d w ith en ough mo deling p o w er to represen t complex decision b oundaries suc h as those g iv en by RBF k ernels through the b etting functions of the participant s. It will b e sho wn in Theorem 5 that the constan t m ark et maximizes the lik eli ho o d, so it is not clear yet w hat can b e done to ob tain a small num b er of s upp ort vecto rs as in th e online kernel-based metho d s (Bordes et al., 2005; Cauw en b er gh s and P oggio, 2001; Kivinen et al. , 2004). 4. Prediction Mark ets and Maxim um Lik eliho o d This section discusses wh at typ e of optimization is p erformed du r ing the b udget up d ate from eq. (3). S p ecifically , w e pr ov e that the artificial pred iction market s p erform maximum lik elihoo d learning of the p arameters by a v ersion of gradien t ascent. Consider th e reparametrization γ = ( γ 1 , ..., γ M ) = ( √ β 1 , ..., √ β M ). T he market pr ice c ( x ) = ( c 1 ( x ) , ..., c K ( x ) is an estimate of the class p robabilit y p ( y = k | x ) for eac h instance 10 Ar tificial P rediction Markets x ∈ Ω. Thus a set of training observ ations ( x i , y i ) , i = 1 , ..., N , since ˆ p ( y = y i | x i ) = c y i ( x i ), the (normalized) log-lik eliho o d function is L ( γ ) = 1 N N X i =1 ln ˆ p ( y = y i | x i ) = 1 N N X i =1 ln c y i ( x i ) (13) W e w ill again use the total amoun t b et B ( x , c ) = P M m =1 P K k =1 β m φ k m ( x , c ) for observ a- tion x at marke t pr ice c . W e will fi r st fo cus on th e constan t market φ k m ( x , c ) = φ k m ( x ), in whic h case B ( x , c ) = B ( x ) = P M m =1 P K k =1 β m φ k m ( x ). W e in tro duce a batc h up d ate on all the tr ainin g examples ( x i , y i ) , i = 1 , ..., N : β m ← β m + β m η N N X i =1 1 B ( x i ) φ y i m ( x i ) c y i ( x i ) − K X k =1 φ k m ( x i ) ! . (14) Equation (14) can b e viewe d as p resen ting all observ ati ons ( x i , y i ) to the market simulta- neously in stead of sequentia lly . The follo w ing statemen t is prov ed in the App endix Theorem 5 ML for constan t market. The up date (14 ) for the c onst ant market max- imizes the likeliho o d ( 13) by g r adient asc ent on γ subje ct to the c onstr aint P M m =1 γ 2 m = 1 . The incr emental u p date β m ← β m + β m η B ( x i ) φ y i m ( x i ) c y i ( x i ) − K X k =1 φ k m ( x i ) ! . (15) maximizes the like liho o d (1 3 ) by c onstr aine d sto chastic gr ad ient asc ent. In the general case of non-constan t b etting functions, the log-lik eliho o d is L ( γ ) = N X i =1 log c y i ( x i ) = N X i =1 log M X m =1 γ 2 m φ y i m ( x i , c ( x i )) − N X i =1 log K X k =1 M X m =1 γ 2 m φ k m ( x i , c ( x i )) (16) If we ignore the dep end ence of φ k m ( x i , c ( x i )) on γ in (16), and appro ximate the gradient as: ∂ L ( γ ) ∂ γ j ≈ N X i =1 γ j φ y i j ( x i , c ( x i )) P M m =1 γ 2 m φ y i m ( x i , c ( x i )) − γ j P K k =1 φ k j ( x i , c ( x i )) P K k =1 P M m =1 γ 2 m φ k m ( x i , c ( x i )) ! then the pro of of Theorem 5 follo ws th rough and we obtain the follo wing mark et up date β m ← β m + β m η B ( x , c ) " φ y m ( x , c ) c y − K X k =1 φ k m ( x , c ) # , m = 1 , ..., M (17) This w a y w e ob tain only an appro ximate statemen t in the general case Remark 6 Maxim um Lik eliho o d. The pr e diction market up date (17) finds an appr oxi- mate maximum of the likeliho o d (13) subje c t to the c onstr aint P M m =1 γ 2 m = 1 by an appr ox- imate c onstr aine d sto c hastic gr adient asc ent. 11 Barbu and La y Observe that th e up d ates from (15 ) and (17) differ from th e up date (3) by using an adaptiv e step size η /B ( x , c ) instead of the fixed step size 1. It is easy to chec k that maximizing the lik elihoo d is equiv alent to min imizing an ap- pro ximation of the exp ected KL dive rgence to the tr ue distribution E Ω [ K L ( p ( y | x ) , c y ( x ))] = Z Ω p ( x ) Z Y p ( y | x ) log p ( y | x ) c y ( x ) dy d x obtained using the training set as Mon te Carlo samples fr om p ( x , y ). In many cases the num b er of n egati v e examples is muc h larger than the p ositiv e exam- ples, and is desir ed to maximize a w eigh ted log-lik eliho o d L ( γ ) = 1 N N X i =1 w ( x i ) ln c y i ( x i ) This can b e ac hiev ed (exactly for constan t b etting and appro ximately in general) using the w eigh ted up d ate ru le β m ← β m + η w ( x ) β m B ( x , c ) " φ y m ( x , c ) c y − K X k =1 φ k m ( x , c ) # , m = 1 , ..., M (18) The parameter η and the num b er of training ep o chs can b e u sed to con trol how close the budgets β are to the ML optim um, and this wa y av oid o v erfitting the training data. An imp ortant issue for the real pr ediction markets is the efficie nt market hyp othesis , whic h state s that the mark et price fu ses in an optimal w a y the information a v ailable to the market participants (F ama , 1970; Basu, 1977; Malkiel, 2003). F rom Theorem 5 we can dra w the follo wing conclusions for the artificial pred iction mark et with constan t b etting: 1. In general, an untrained market (in whic h the b udgets ha v e not b een up dated b ased on training d ata) w ill not satisfy the efficien t mark et h yp othesis. 2. The mark et trained w ith a large amount of representat iv e training data an d small η satisfies the efficien t mark et h yp othesis. 5. Sp ecialized Classifiers The p r ediction mark et is capable of fu sing the in formation a v ailable to the marke t partic- ipan ts, wh ich can b e tr ained classifiers. Th ese classifiers are usually sub optimal, due to computational or complexit y constraints, to the wa y th ey are trained, or other reasons. In b o osting, all selected cla ssifiers are aggregated for eac h instance x ∈ Ω . This can b e d etrimen tal since some classifiers could p erform p o orly on subregions of the instance space Ω , degrading the p erform an ce of the b o osted classifier. In many situations there exist simple rules that hold on sub sets of Ω but not on th e en tire Ω. Classifiers tr ained on such subsets D i ⊂ Ω, wo uld ha v e small misclassificatio n error on D i but un predictable b eha vior outside of D i . T he artificial pr ediction m ark et can aggregate su c h classifiers, transformed in to participants that don’t b et an ything ou tsid e of their domain of exp ertise D i ⊂ Ω. This w a y , for different instances x ∈ Ω , different subsets of p articipan ts w ill con tribute to the 12 Ar tificial P rediction Markets resulting pr obabilit y estimate. W e call these sp e cialize d classifiers s in ce they only giv e th eir opinion th rough b etting on observ ations that fall inside their domain of sp ecializat ion. Th us a sp ecialized classifier with a domain D would ha v e a b etting f unction of the form: φ k ( x , c ) = ( ϕ k ( x , c ) if x ∈ D 0 else (19) This idea is illustr ated on the follo wing simple 2D example of a triangular region, sho wn in Figure 4, with p ositive examples inside the triangle and negativ es outside. An accurate classifier for that region can be constru cted using six market participan ts, one for ea c h half-plane determined b y eac h side of th e triangle. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ + + + + Figure 4: A p erfect classifier can b e constructed for the triangular r egion a b ov e from a mark et of six sp ecialized classifiers th at only b id on a half-plane determined by one sid e of th e triangle. Three of these sp ecialized classifiers h a v e 100% accuracy while the other three h av e lo w accuracy . Nev ertheless, the m arket is capable of obtaining 100% o v erall accuracy . Three of these classifiers corresp ond to the three half planes that are outside the tr iangle. These participan ts hav e 100% accuracy in predicting the observ ations, all negativ es, that fall in their half planes and d on’t b et an ything outside of their half planes. The other three classifiers are not v ery go o d , and will hav e sm aller b udgets. On an observ atio n that lies outside of the triangle, one or t w o of the h igh-budget classifiers will b et a large amoun t on the correct prediction and will d riv e the output p robabilit y . When an observ ation falls inside the triangle, only th e small-budget classifiers will participate but will b e in agreemen t and still outpu t the correct pr obabilit y . Ev aluat ing this marke t on 1000 p ositiv es and 1000 negativ es sho w ed that the mark et obtained a prediction accuracy of 100%. There are many wa ys to construct s p ecialized classifiers, dep ending on the problem setup. In natural language p ro cessing for example, a sp ecialized classifier could b e based on grammar r ules, wh ic h wo rk v ery w ell in man y cases, but not alw a ys. W e p rop ose t w o generic sets of sp ecialized classifiers. The firs t set are th e lea v es of the random trees of a random forest while th e second s et are the lea v es of th e decision trees trained by adab o ost. Eac h leaf f is a rule that d efines a d omain D f = { x ∈ Ω , f ( x ) = 1 } of the in stances th at ob ey th at rule. The b etting function of this sp ecialized classifier is giv en in eq. (19) wh ere ϕ k f ( x , c ) is b ased on the asso ciated classifier h k f ( x ) = n f k /n f , obtaining constan t, linear and aggressiv e versions. Here n f k is th e num b er of training instances of 13 Barbu and La y class k that ob ey rule f and n f = P k n f k . By the wa y the r andom trees are trained, usually n f = n f k for s ome k. In F riedman and Po p escu (2008) these ru les were com bined u s ing a lin ear aggregati on metho d similar to b o osting. One could also us e other no d es of the random tree, not neces- sarily the lea v es, f or the same pur p ose. It can b e verified using eq. (8) that constan t sp ecialized b etting is th e linear aggregatio n of the participan ts that are curr en tly b etting. Th is is different than th e linear aggregation of all the classifiers. 6. Related W ork This wo rk b orro ws prediction market ideas from Economics and b r ings them to Mac hine Learning f or sup ervised aggregation of classifiers or features in general. Related w ork in E conomics. Recen t wo rk in Economics (Manski, 2006; Perols et al., 2009; Plott et al., 2003 ) inv estigates the information f u sion of the pred iction mark ets. How- ev er, none of these works aims at u sing the prediction marke ts as a to ol for learning class probabilit y estimators in a sup ervised mann er. Some works (Perol s et al., 2009; P lott et al., 2003) fo cus on parim utuel b etting mec h- anisms for com bining classifiers. In parimutuel b etting contract s are sold for all p ossible outcomes (classes) and th e en tire b u dget (min us fees) is divided b etw een the participan ts that pur chased con tracts for the winn ing outcome. P arim utuel b etting has a different wa y of fusing inf orm ation than the Io w a prediction mark et. The inform ation based decision fusion (Perols et al., 2009) is a first v ersion of an artifi- cial p rediction market. It aggregat es classifiers through the p arim utuel b etting mec hanism, using a lo op that up dates the o dds for eac h outcome and tak es up dated b ets u ntil conv er- gence. Th is insur es a stronger in formation fusion than without u p dating the o d d s. Our w ork is differen t in man y w a ys. First our w ork uses the Io w a electronic market instead of parim utuel b etting with o dds-up dating. Using the Iow a mo del allo w ed us to obtain a closed form equation for the marke t price in some imp ortant cases. It also allo w ed us to r elate the mark et to some existing learning metho ds . S econd, our w ork presents a m ulti-class formula- tion of the p r ediction markets as opp osed to a t wo -class approac h presen ted in (P erols et al., 2009). Third, the analytical market price formulat ion allo w ed us to p ro v e that the constan t mark et p erforms maxim um lik eliho od learning. Finally , our w ork ev aluates the prediction mark et not only in terms of classification accuracy but also in the accuracy of p redicting the exact class conditional probab ility giv en the evidence. Related w ork in Machine Lea rning. Im plicit online learning (Kulis and Bartlett, 2010) pr esents a generic online learning metho d that balances b et w een a “conserv ativ eness” term that discourages large changes in the mo d el and a “correctness” term that tries to adapt to the new observ ation. Instead of using a linear appr oximati on as other online meth- o ds d o, th is app roac h solv es an imp licit equation for find ing the new mo d el. In this regard, the predictio n mark et also solv es an implicit equation at eac h step for fi nding th e new mo del, b ut do es not balance t w o criteria like the implicit online learning metho d. Instead it p erforms maxim um lik eliho o d estimation, whic h is consistent and asymptotically opti- mal. In exp eriments, we observed that the prediction mark et obtains significan tly smaller misclassification errors on many d ataset s compared to implicit online learning. 14 Ar tificial P rediction Markets Sp ecializatio n ca n b e view ed as a t yp e of reject rule (Cho w, 1970 ; T ortorella, 2004). Ho w ev er, instead of ha ving a reject rule f or the aggreg ated classifier, eac h mark et participant has his o wn reject r ule to decide on w hat observ ations to con tribute to the aggregation. R OC-based reject rules (T ortorella, 2004) could b e found for eac h market participant and used for d efining its d omain of sp ecializat ion. Moreo v er, the market can giv e an ov erall reject rule on h op eless in stances that fall outside the sp ecializat ion domain of all participants. No participan t will b et for suc h an instance and this can b e detected as an ov erall rejection of that instance. If the o v erall r eject option is not desired, one could a v oid ha ving instances for whic h no classifiers b et by including in the marke t a set of participants that are all the lea v es of a n umber of rand om trees. This w a y , by the design of the r andom trees, it is guarantee d that eac h instance will fall in to at least one leaf, i.e. participant, h ence th e in s tance will not b e rejected. A simp lified sp ecialization appr oac h is tak en in delegated classifiers (F erri et al., 2004). A firs t classifier w ould decide on t he relativ ely ea sy in stances and would delegat e more difficult examples to a second classifier. This approac h can b e seen as a mark et with t w o participan ts that are n ot o v erlapping. Th e sp ecialization domain of th e second participant is defin ed by the first participant . The marke t takes a more generic app r oac h where eac h classifier decides indep endently on which instances to b et. The same type of lea v es of rand om trees (i.e. rules) were used by F riedman and P op escu (2008) for linear a ggregatio n. Ho w ev er, our work present s a m ore ge neric aggregation metho d through th e pr ediction market, with linear aggregati on as a p articular case, and w e view the ru les as one sort of sp ecialized classifiers that only bid in a su b domain of the feature space. Our earlier w ork (Lay and Barbu, 2010 ) fo cused only on aggregation of classifiers and did not discuss the connection b et w een the artificial pr ed iction markets and logistic r egres- sion, k ernel metho ds and maximum likelihoo d learning. Moreo v er, it did n ot in clude an exp erimen tal comparison with implicit online learning and adab o ost. Tw o other prediction m ark et mec hanisms ha v e b een recen tly p rop osed in the literature. The fir st one (Chen and V aughan , 2010; Chen et al., 201 1) has the p articipan ts entering the mark et sequen tially . Eac h participant is paid b y an entit y calle d the mark et mak er according to a pr edefined scoring ru le. The second prediction mark et mechanism is th e mac hine learning mark et (Stork ey, 2011; Stork ey et al., 2012), dealing with all participan ts sim ultaneously . Eac h mark et participan t purc hases contract s for the p ossible o utcomes to maximize its own u tilit y function. The equilibriu m p rice of the con tracts is computed b y an optimizati on pro cedu r e. Differen t utilit y f u nctions result in differen t forms of the equilibrium pr ice, such as the mean, m edian, or geometric mean of the participant s’ b eliefs. 7. Exp erimen tal V alidation In this section we presen t exp erimen tal comparisons of the p erf ormance of differen t artificial prediction mark ets with r andom forest, adab o ost an d implicit online lea rning (Kulis and Bartlett, 2010). 15 Barbu and La y F our artificial prediction m arkets are ev aluated in this section. Th ese mark ets hav e the same classifiers, n amely the lea v es of the trained random tr ees, but differ either in the b etting functions or in the wa y the budgets are trained as follo ws: 1. The first mark et has constant b etting and equal budgets for all participan ts. W e pro v ed in Section 3.1 th at this is a random forest (Breiman, 2001). 2. The second market has constan t b etting based on sp ecialize d classifiers (the lea ves of the random tr ees), w ith the bud gets initialized with the same v alues lik e the mark et 1 ab ov e, b ut trained u sing the up date equation (15) . Thus after training it will b e differen t from mark et 1. 3. The third market h as lin ear b etting functions (1), for which the market p rice can b e computed analytically only for binary classification. The market is initializ ed with equal budgets and trained usin g eq. (17). 4. The four th mark et has agg ressive b etting (2) with ǫ = 0 . 01 and the market price computed us ing the Mann iteration Algorithm 3. The marke t is initialized w ith equal budgets and trained us in g eq. (17 ). The v alue ǫ = 0 . 01 wa s chosen f or s im p licit y ; a b etter c hoice w ould b e to obtain it by cross-v alidation. F or eac h dataset, 50 random trees are trained on b o otstrap samples of the trainin g data. These trained random trees are used to constr u ct the r andom forest and the other three market s describ ed ab o v e. T his wa y only the aggregat ion capabilities of the d ifferen t mark ets are compared. The bud gets in the market s 2-4 describ ed ab o v e are trained on the s ame training data using the up date equation (17) which simplifies to (15) f or the constant market. A C++ imp lemen tatio n of th ese marke ts can b e found at the follo wing addr ess: http://s tat.fsu.ed u/~abarbu/Research/PredMarket.zip 7.1 Case Study W e first inv estigate the b eha vior of three mark ets on a dataset in terms of training and test error as well as loss function. F or that, we c hose the satim age d ataset from the UCI rep ository (Blak e and Merz, 1998) since it has a su pplied test set. The satimag e dataset has a training set of size 4435 and a test set of size 2000. The markets inv estigated are the constan t marke t with b oth in cremen tal and batc h up dates, giv en in eq. (15 ) and (14) r esp ectiv ely , the lin ear and aggressive markets w ith incremen tal up dates giv en in (17 ). Observe that the η in eq. (15) is n ot divided b y N (the num b er of observ ati ons) while the η in (14) is divided by N . Thus to obtain the same b eha vior the η in (15) should b e the η from (14) d ivided by N . W e used η = 100 / N for the incremen tal up d ate and η = 100 for th e b atc h up d ate unless otherwise sp ecified. In Figure 5 are plotted the misclassification errors on the training and test sets and the negativ e log-lik eliho o d function vs. the num ber of training ep o c hs, a v eraged o v er 10 r u ns. F rom Figure 5 one could see th at the incremen tal and batc h up d ates p erform similarly in terms of the lik elihoo d function, training and test errors. Ho w ev er, the increment al u p date is preferr ed since it is requires less m emory and can handle an arb itrarily large amount of training data. The aggressiv e and constant m ark ets achiev e similar v alues of the negativ e 16 Ar tificial P rediction Markets 0 5 10 15 20 25 30 35 40 45 50 0 0.5 1 1.5 2 2.5 3 x 10 −3 Number of Epochs Misclassification Error Linear incremental Aggressive inc. Constant inc. Constant batch Random Forest 0 5 10 15 20 25 30 35 40 45 50 0.086 0.087 0.088 0.089 0.09 0.091 0.092 0.093 0.094 Number of Epochs Misclassification Error Linear incremental Aggressive incremental Constant incremental Constant batch Random Forest 0 5 10 15 20 25 30 35 40 45 50 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Number of Epochs − Log Likelihood Linear incremental Constant inc.,eta=10/N Constant batch,eta=10 Aggressive inc. Constant incremental Constant batch Figure 5: Exp eriments on the satimage dataset for the incremental and batc h mark et u p- dates. L eft: The tr aining er r or vs. n um b er of ep o chs. Middle: Th e test error vs. n umber of ep o c hs. Righ t: The negativ e log-lik eliho o d fu nction vs. num b er of training ep o chs. Th e learning rates are η = 100 / N for the incremen tal up date and η = 100 for the batc h up date un less otherwise sp ecified. log likelihoo d and sim ilar training errors , bu t the aggressiv e marke t seems to o v erfit more since the test error is larger than the constan t incremen tal ( p -v alue < 0 . 0 5). The linear mark et has worse v alues of the log-lik elihoo d, training and test errors ( p -v alue < 0 . 05). 7.2 Ev aluation of the Probability Est imation and Classification Accuracy on Syn thetic Dat a W e p erform a series of exp eriments on syntheti c datasets to ev aluat e the marke t’s abilit y to pr edict class conditional pr obabilities P ( Y | x ). The exp erimen ts are p erf ormed on 5000 binary datasets w ith 50 lev els of Ba y es error E = Z min { p ( x , Y = 0) , p ( x , Y = 1) } d x , ranging from 0.01 to 0.5 with equal incremen ts. F or eac h dataset, the tw o cla sses hav e equal frequency . Both p ( x | Y = k ) , k = 0 , 1 are normal d istributions N ( µ k , σ 2 I ), with µ 0 = 0 , σ 2 = 1 and µ 1 c hosen in some rand om direction at such a distance to obtain the desired Ba y es error. F or ea c h of the 50 Ba yes error level s, 100 d ataset s of s ize 200 we re generated usin g the bisection m etho d to fi nd an appropriate µ 1 in a rand om direction. T raining of t he participan t budgets is done with η = 0 . 1. F or eac h observ ation x , the class conditional p robabilit y can b e computed analytically using the Ba y es r ule p ∗ ( Y = 1 | x ) = p ( x | Y = 1) p ( Y = 1) p ( x , Y = 0) + p ( x , Y = 1) An estimatio n ˆ p ( y = 1 | x ) obtained with one of the mark ets is compared to the true probabilit y p ∗ ( Y = 1 | x ) using th e L 2 norm E ( ˆ p, p ∗ ) = Z ( ˆ p ( y = 1 | x ) − p ∗ ( y = 1 | x )) 2 p ( x ) d x where p ( x ) = p ( x , Y = 0) + p ( x , Y = 1). 17 Barbu and La y 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Bayes Error Rate Estimation Error Aggressive bet Constant bet Random Forest Linear bet 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.2 0.4 0.6 0.8 1 Relative Error Bayes Error Rate Aggressive bet Constant bet Random Forest Linear bet Figure 6: Left: Class p robabilit y estimation error vs p roblem difficu lty f or 5000 100D p r ob- lems. Righ t: Probabilit y estimation errors relativ e to r andom forest. The aggres- siv e and lin ear b etting are shown with b o x plots. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.02 0.04 0.06 0.08 0.1 0.12 Bayes Error Rate Misclassification Error Aggressive bet Constant bet Random Forest Linear bet 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 Relative Misclassification Error Bayes Error Rate Aggressive bet Constant bet Random Forest Linear bet Figure 7: Left: Misclassificati on err or min us Ba y es error vs problem difficult y for 5000 100D problems. Righ t: Misclassification errors relativ e to random forest. T he aggressiv e b etting is s ho wn with b o x plots. In practice, this error is appr o ximate d u sing a sample of size 1000. Th e errors of the probabilit y estimates obtained b y the four mark ets are sho wn in Figure 6 for a 100D pr ob- lem s etup . Also sho wn on the right are the err ors relativ e to the random forest, obtained by dividing eac h error to the corresp onding random forest error. As one could see, the aggres- siv e and constan t b etting markets obtain significant ly b etter ( p -v alue < 0 . 01) probabilit y estimators than the random forest, for Ba y es err ors up to 0 . 28. On the other h and, th e linear b etting m ark et obtains p robabilit y estimators signifi can tly b etter ( p -v alue < 0 . 01) than th e r andom forest for Ba y es error f rom 0 . 34 to 0 . 5. W e also ev aluate d the misclassification errors of the f ou r markets in predicting the co rrect class, for th e same 5000 d atasets. T he difference b etw een these misclassification errors and the Bay es error are sho wn in Figure 7, left. The difference b et w een these misclassification errors and the r andom forest error are sho wn in Figure 7 , righ t. W e see that all markets 18 Ar tificial P rediction Markets with trained participan ts pred ict significan tly b etter ( p -v alue < 0 . 01) th an rand om forest for Ba y es errors up to 0 . 3, and b ehav e similar to random forest for the remaining datasets. 7.3 Comparison w ith Random F orest on UCI Datasets In this section we condu ct an ev aluation on 31 datasets from the UC I mac hine learning rep ository (Blak e and Merz, 1998). The optimal num b er of training ep o c hs and η are meta-parameters th at need to b e c hosen appropriately for eac h dataset. W e observed ex- p eriment ally that η can tak e any v alue up to a maxim um that d ep ends on the dataset. In these exp erimen ts we to ok η = 10 / N train . Th e b est num b er of ep o c hs w as chosen by ten fold cross-v alidatio n. In ord er to compare with the results in (Breiman, 2001), the training and test sets w ere randomly su bsampled from the a v ailable data, with 90% for trainin g and 10% for testing. The exceptions are the satimag e , zipcod e , hill-v alley an d poker datasets with test sets of size 2000 , 2007 , 606 , 10 6 resp ectiv ely . All resu lts were a v eraged ov er 100 ru ns. W e present t w o ran d om forest results. In the column named RFB are presen ted the random forest results from (Breiman, 2001)where eac h tree no de is sp lit based on a rand om feature. In the column named RF w e present the results of our o wn RF implement ation with splits based on rand om features. Th e leaf no des of the random trees from our RF implemen tation are used as sp ecialize d participants f or all the mark ets ev aluated. The CB, LB and AB columns are the p erformances of the constan t, linear and r esp ec- tiv ely aggressiv e mark ets on th ese d ataset s. Significan t mean d ifferences ( α < 0 . 01) from RFB are s h o wn with + , − for when RFB is w orse resp ectiv ely b etter. Significan t paired t -tests (Dem ˇ sar , 20 06) ( α < 0 . 01) that compare th e marke ts with our RF imp lementati on are shown with • , † f or when RF is w orse resp ectiv ely b etter. The constant , linear and aggressiv e markets significantly outp erformed our RF imple- men tation on 22, 19 r esp ectiv ely 22 datasets out o f the 3 1 ev al uated. They w ere not significan tly outp erformed b y our RF implementa tion on any of the 31 d atasets. Compared to the RF results from Breiman (2001) (RFB), CB, LB an d AB significan tly outp erformed RFB on 6,5,6 datasets resp ectiv ely , and were not significan tly outp erformed on any d ataset . 7.4 Comparison w ith Implicit Online Learning on UCI Datasets W e implemented the implicit online learning (Kulis and Bartlett, 2010) algorithm for clas- sification with linear aggregation. The ob ject iv e of implicit online learning is to minimize the loss ℓ ( β ) in a c onservative w a y . The c onservativeness of the u p date is determined b y a Bregman div ergence D ( β , β t ) = φ ( β ) − φ ( β t ) − h∇ φ ( β t ) , β − β t i where φ ( β ) are real-v alued strictly con v ex functions. Rather than minimize the loss function itself, th e f u nction f t ( β ) = D ( β , β t ) + η t ℓ ( β ) 19 Barbu and La y T able 1: Th e misclassificatio n errors for 31 datasets from the UC Irvine Rep ository are sho wn in p ercen t (%).. The mark ets ev aluated are our implementa tion of random forest (RF), and markets with Con s tan t (CB), L in ear (LB) and resp ectiv ely Ag- gressiv e (AB) Betting. RFB con tains the rand om forest results from (Breiman, 2001). Data N train N test F K RFB RF CB LB AB breast-ca nc e r 683 – 9 2 2.7 2.5 2.4 2.4 2.4 sonar 208 – 60 2 18.0 16.6 14.1 • + 14.2 • + 14 .1 • + vo w el 990 – 10 11 3.3 2.9 2.6 • + 2.7 + 2.6 • + ecoli 3 36 – 7 8 1 3.0 12.9 12.9 12.8 12.9 german 1000 – 24 2 26 .2 25.5 24.9 • + 25.1 24.9 • + glass 214 – 9 6 21.2 23.5 22.2 • 22.4 22.2 • image 2310 – 19 7 2.7 2.7 2.5 • 2.5 • 2.5 • ionosphere 351 – 34 2 7.5 7.4 6.7 • 6.9 • 6.7 • letter-recog nition 2 0000 – 16 26 4.7 4.2 + 4.2 • + 4.2 • + 4.2 • + liver-disorders 345 – 6 2 24.7 26.5 26.3 26.2 26.2 pima-diab etes 768 – 8 2 24.3 24.1 23.8 23.7 23.8 satimage 4435 200 0 36 6 10.5 10.1 + 10.0 • + 10.1 • + 10.0 • + vehicle 846 – 18 4 26.4 26.3 26.1 26.2 26.1 voting-records 232 – 16 2 4.6 5 .3 4.2 • 4.2 • 4.2 • zipc ode 7291 200 7 256 10 7.8 7.7 7.6 • + 7.7 • + 7.6 • + abalone 4177 – 8 3 – 45.5 45.4 45.4 45.4 balance-sca le 625 – 4 3 – 15.4 15.4 15.4 15.4 car 1728 – 6 4 – 2.8 2.0 • 2.2 • 2.0 • connect-4 67557 – 42 3 – 19.6 19.3 • 19.4 • 19.5 • cylinder-bands 277 – 33 2 – 22.7 20.9 • 21 .1 • 20.9 • hill-v a lley 606 6 06 100 2 – 46.9 45.8 • 46.3 • 45.8 • isolet 1559 – 61 7 26 – 17 .0 15.7 • 15.8 • 15.7 • king-ro ok- vs-king 28056 – 6 18 – 1 5.6 15.4 • 15.4 • 1 5.4 • king-rk- v s-k-pawn 3 1 96 – 36 2 – 2.0 1 .5 • 1.6 • 1.5 • madelon 2000 – 50 0 2 – 46 .1 45.2 • 45.3 • 45 .2 • magic 19020 – 10 2 – 12.0 11.9 • 11.9 • 11.9 • m usk 6598 – 1 66 2 – 3.7 3.5 • 3.6 • 3.5 • po k er 25010 10 6 10 10 – 43.2 43.1 • 43.1 • 43.1 • SAheart 462 – 9 2 – 30.8 30.8 30.7 30.8 splice-junction 3190 – 59 3 – 18.9 17.7 • 18.2 • 17.7 • yeast 1484 – 8 10 – 38.3 38.1 38.0 38.1 20 Ar tificial P rediction Markets is m inimized instead. Here η t is th e learning rate. The Bregman diverge nce ensur es that the optimal β is not to o far from β t . The algorithm for implicit online learning is as follo ws ˜ β t +1 = argmin β ∈ R M f t ( β ) β t +1 = argmin β ∈ S D ( β , ˜ β t +1 ) The first step solv es the unconstrained v ersion of the problem while the second step fi nds th e ne ar est feasible solution to the unconstrained minimizer su b ject to the Bregman d ivergence. F or our pr ob lem we use ℓ ( β ) = − log ( c y ( β )) where c y ( β ) is the constan t marke t equilibrium price for ground tru th lab el y . W e c hose the squared Euclidean distance D ( β , β t ) = k β − β t k 2 2 as our Bregman diverge nce and learnin g rate η t = 1 / √ t . T o ensur e that c = P M m =1 h m β m = H β is a v alid p r obabilit y vecto r, the feasible solution set is therefore S = { β ∈ [0 , 1] M : P M m =1 β m = 1 } . This giv es the follo wing up date sc heme ˜ β t +1 = β t + η t 1 p ( H y ) T β t +1 = argmin β ∈ S n k β − ˜ β t +1 k 2 2 o where H y = h y 1 , h y 2 , . . . , h y M is the v ecto r of classifier outputs for the true lab el y , q = H y β t , r = H y ( H y ) T and p = 1 2 q + p q 2 + 4 η t r . The results presente d in T able 2 are obtained b y 10 f old cross-v ali dation. Th e cross- v alidation errors w ere av eraged o v er 10 d ifferen t p erm utations of the data in the cross- v alidation folds. The results from CB online and implicit online are obtained in one ep o ch. The r esults from the CB offlin e and imp licit offline columns are obtained in an off-line fashion using an appropriate n umber of ep o c hs (up to 10) to obtain the smallest cross-v alidate d err or on a random p ermutat ion of the data that is different fr om th e 10 p ermuta tions used to obtain the results. The comparisons are done with paired t -tests and sho wn with ∗ and ‡ wh en th e con- stan t b etting marke t is signifi cantly ( α < 0 . 01) b etter or wo rse than the corresp onding implicit online learning. W e also p erform ed a comparison with our RF implemen tation, and significan t differences are s h o wn with • and † . Compared to RF, implicit online learning wo n 5-0, CB online won in 9-1 and CB offline w on 12-0. Compared to im p licit online, which p erformed ident ical with imp licit offline, b oth CB online and CB offline w on 9-0. 7.5 Comparison w ith Adab o ost for Lymph No de Detection Finally , we compared the linear aggregation capability of the artificial pr ediction market with adab o ost for a lym ph no de detection problem. The system is setup as describ ed in 21 Barbu and La y T able 2: Comp arison with Implicit Online Learning and random forest usin g 10-fold cross- v alidation. Implicit C B Implicit CB Dataset N train N test F K RF Online Online O ffline Offline breast-ca nc e r 683 – 9 2 3.1 3 .1 3 3.1 3 sonar 208 – 60 2 15.1 15.2 15 .3 15.1 14.6 vo w el 990 – 10 11 3.2 3.2 3.2 3.2 2.9 •∗ ecoli 336 – 7 8 13.7 13.7 13 .6 13.7 13.6 german 1 0 00 – 24 2 23.6 23.5 2 3.5 23.5 23.4 glass 214 – 9 6 21.4 21.4 21 .3 21.4 21 image 2310 – 19 7 1.9 1 .9 1.9 1.9 1.8 • ionosphere 3 51 – 34 2 6.4 6 .5 6.5 6.5 6 .5 letter-recog nition 20000 – 16 26 3.3 3 .3 3.3 •∗ 3 .3 3.3 liver-disorders 345 – 6 2 26.4 26.4 26 .4 26.4 26.4 pima-diab etes 768 – 8 2 23.2 23.2 23 .2 23.2 23.2 satimage 4435 200 0 3 6 6 8.8 8.8 8.8 8.8 8.7 • vehicle 846 – 18 4 24.8 24.7 24 .9 24.7 24.9 voting-records 232 – 16 2 3.5 3 .5 3 .5 3.5 3 .5 zipc ode 7 291 2007 2 56 10 6.1 6.1 6.2 6.1 6.2 abalone 4177 – 8 3 45.5 45.5 45.6 † 45 .5 45 .5 balance-sca le 625 – 4 3 17.7 17.7 17 .7 17.7 17.7 car 1728 – 6 4 2.3 2.3 1.8 •∗ 2.3 1.1 •∗ connect-4 67557 – 42 3 19 .9 1 9.9 • 19.5 •∗ 19.9 • 18.2 •∗ cylinder-bands 2 77 – 33 2 21 .4 2 1 .3 21.2 21.3 20.8 • hill-v a lley 606 606 100 2 43.8 43.7 43.7 43 .7 43 .7 isolet 1559 – 61 7 26 6.9 6 .9 6 .9 6.9 6 .9 king-rk- v s-king 28056 – 6 18 21.6 21.6 • 19.6 •∗ 21.5 • 15.7 •∗ king-rk- v s-k-pawn 319 6 – 36 2 1 1 0.7 • ∗ 1 0.5 •∗ magic 1902 0 – 10 2 11.9 1 1.9 • 11.8 •∗ 11.9 • 11.7 •∗ madelon 2000 – 500 2 26.8 26.5 • 25.6 •∗ 26.4 • 21.6 •∗ m usk 6598 – 1 66 2 1.7 1 .7 • 1.6 •∗ 1.7 • 1 •∗ splice-junction-gene 3190 – 59 3 4.3 4 .3 4.2 • ∗ 4.3 4.1 •∗ SAheart 462 – 9 2 31.5 31.5 3 1 .6 31.5 31.6 yeast 1484 – 8 10 37.3 37.3 37.3 37.3 37.3 22 Ar tificial P rediction Markets Barbu et al. (2012), namely a set of lym p h no de cand idate p ositions ( x, y , z ) are obtained using a trained d etecto r. E ach candid ate is segmente d usin g gradient d escen t optimization and ab ou t 17000 features are extracted from the segmentat ion r esult. Using these features, adab o ost constructed 32 w ea k classifiers. Eac h wea k classifier is asso ciated with one f eature, splits th e feature range into 64 bins and returns a pr edefined v alue (1 or − 1), for eac h b in. Th us, one can consider there are M = 32 × 64 = 2048 s p ecialize d participan ts, eac h b etting f or one class (1 or − 1) f or any observ ation that falls in its domain. The participants are give n b udgets β ij , i = 1 , .., 32 , j = 1 , .., 64 w here i is the feature index and j is the bin index. The participant b udgets β ij , j = 1 , ..., 64 corresp onding to the same f eature i are initialized th e same v alue β i , namely th e adab o ost co efficien t. F or eac h bin, th e retur n class 1 or − 1 is the outcome f or wh ich the participant will b et its b udget. The constant b etting market of th e 2048 participants is in itializ ed with these bu dgets and trained with the same training examples that we re u sed to train the adab o ost classifier. The obtained constant m arket probabilit y for an observ ati on x = ( x 1 , ..., x 32 ) is based on the b in indexes b = ( b 1 ( x 1 ) , ..., b 32 ( x 32 ): p ( y = 1 | b ) = P 32 i =1 β i,b i h i ( b i ) P 32 i =1 β i,b i (20) An imp ortant issu e is that the num b er N pos of p ositiv e examples is muc h smaller than the num b er N neg of negativ es. Similar to adab o ost, the sum of the w eigh ts of the p ositive examples sh ou ld b e th e same as the sum of w eigh ts of the negativ es. T o accomplish this in the m ark et, w e use the w eigh ted up date ru le Eq. (18), with w pos = 1 N pos for eac h p ositiv e example and w neg = 1 N neg for eac h negativ e. 0 5 10 15 20 25 30 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 Epoch Detection Rate at 3 FP/Vol Train Market Train Adaboost Test Market Test Adaboost 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 False positives per volume Detection rate Train Market 7 Epochs Train Adaboost Test Market 7 Epochs Test Adaboost Figure 8: Left: Dete ction r ate at 3 FP/v ol vs. n umber of training ep o c hs for a lymp h no d e detection pr oblem. Right: ROC cu rv es for adab o ost and the constan t b etting mark et with p articipan ts as the 2048 adab o ost we ak classifier b ins. The resu lts are obtained with six-fold cross-v al idation. The adab o ost classifier and th e constant market were ev aluated f or a lymph no de detec- tion application on a dataset con taining 54 CT scans of th e p elvic and ab d omin al region, with a total of 569 lymph n o d es, with six-fold cross-v alidatio n. The ev aluatio n criterion 23 Barbu and La y is the same for all metho ds, as sp ecified in Barbu et al. (2012). A lym ph no de detection is considered correct if its cent er is inside a manual solid lym ph no de segmen tation and is incorrect if it n ot inside an y lymph no de s egmentat ion (solid or non-solid). In Figure 8, left, is sho wn th e training and testing detection r ate at 3 false p ositiv es p er v olume (a clinically acceptable false p ositiv e rate) vs the num b er of training ep o c hs. W e see the detection rate increases to ab out 81% for ep o c hs 6 to 16 ep o c hs and then gradually decreases. In Figure 8, righ t, a re shown the trainin g and test R OC cu r v es of adab o ost and the constant market trained with 7 ep o c hs. In this case the detection r ate at 3 false p ositiv es p er v olume imp ro v ed from 79 . 6% for adab o ost to 81 . 2% for the constan t market . The p -v alue for this d ifference was 0.0276 based on paired t -test. 8. Conclusion and F uture W ork This p ap er pr esents a theory f or artificial prediction mark ets f or the purp ose of su p ervised learning of class conditional pr obabilit y estimators. Th e artificial prediction mark et is a no v el online learning algorithm that can b e easily imp lemen ted for tw o class and m ulti class applications. Linear aggregatio n, logistic regression as wel l as certain kernel metho ds can b e viewe d as particular instances of th e artificial pr ediction marke ts. Inspired from real life, sp ecialized classifiers th at only b et on su bsets of the instance sp ace Ω we re in tro duced. Exp erimenta l comparisons on real and syn thetic data sho w that the prediction mark et usually outp erforms random forest, adab o ost and implicit online learning in prediction accuracy . The artificial prediction market s ho ws the follo wing promising features: 1. It can b e up date d online w ith min im al computational cost wh en a new obs erv ation ( x , y ) is presente d. 2. It has a simple form of the up date iteration that can b e easily implemen ted. 3. F or multi-cla ss classification it can fuse information fr om all typ es of bin ary or multi- class classifiers: e.g. trained one- v s-al l, many-vs-many , multi -class decision tree, etc. 4. It can obtain meaningful probabilit y estimates w h en o nly a subset of th e mark et participan ts are inv olved for a particular instance x ∈ X . This feature is useful f or learning on manifolds (Belkin and Niyog i, 2004; Elgammal and Lee; Saul and Row eis , 2003), wh ere the lo cation on the manifold d ecides whic h marke t participant s sh ould b e in v olv ed. F or example, in f ace detection, different f ace p art classifiers (ey es, mouth, ears, nose, hair, etc) can b e inv olv ed in the mark et, dep ending on the orienta tion of the head hyp othesis b eing ev aluated. 5. Because of their b etting functions, th e sp e cialize d marke t participants can decide for whic h instances they b et and how m uc h. This is another wa y to com bine classifiers, differen t from the b o osting app roac h where all classifiers participate in estimating the class p robabilit y for eac h obser v ation. W e are curr en tly extendin g the artificial pred iction mark et framew ork to regression and densit y estimation. These extensions inv olv e contract s for u ncoun tably m an y outcomes b ut the up date and the mark et price equations extend naturally . 24 Ar tificial P rediction Markets F utu re work in cludes find ing explicit b ounds for th e generalizatio n err or b ased on the n umber of training examples. Another item of futur e work is findin g other generic t yp es sp ecialized participan ts that are not lea v es of ran d om or adab o ost trees. F or example, by clustering the instances x ∈ Ω , one could find regions of th e instance space Ω where simple classifiers (e.g. logistic regression, or b etting for a single class) can b e used as sp ecialized mark et participan ts for that region. Ac kno wledgmen ts The authors w ish to thank Jan Hendr ik Schmidt from Innov ation Park Gmbh. for stirr ing in us the excitemen t for the prediction mark ets. The authors ackno wledge partial su pp ort from FSU startup grant and ONR N0001 4-09-1-0 664. References K. J. Arrow, R. F ors ythe, M. Gorha m, R. Hahn, R. Hanson, J. O . Ledyard, S. Levmore , R. Litan, P . Milgro m, a nd F. D. Nelson. The promise o f prediction markets. Scienc e , 3 2 0(5878):8 77, 2008. A. Ba r bu, M. Suehling, X. Xu, D. Liu, S. Zhou, and D. Coma niciu. Automatic detection and segmentation of lymph no des fro m ct data. IEEE T ra ns. on Me dic al Imaging , 31(2 ):2 40–250 , 2012. S. Bas u. Investmen t perfor mance of common sto cks in relation to their price- earnings ratios : A test of the efficient market hypothesis. The Journal of Financ e , 32 (3):663–68 2, 1977 . M. Belkin and P . Niyogi. Semi-sup ervised lea rning on Riema nnian manifolds. Mach ine L e arning , 56(1):209 –239, 200 4. A.L. Berger , V.J.D. Pietra , and S.A.D. Pietra . A maximum entropy a pproach to natura l language pro cessing. Computational linguistics , 22(1):3 9–71, 19 96. C. Blake and CJ Merz. UCI repo sitory of machine learning databases [h ttp://www. ics. uci. edu/ mlearn/MLRepo sitory . html], Department of Infor mation and Co mputer Science. Uni- versity of California , Irvine, CA , 19 98. A. Bordes, S. Ertekin, J. W eston, and L. Bo ttou. F ast k ernel classifiers with online and activ e learning. The Journal of Machi ne L e arning R ese ar ch , 6:1 619, 2005 . L. Breiman. Ra ndom forests. Machine L e arning , 45(1):5 – 32, 2 001. G. Cauw en be r ghs and T. Poggio. Incremental and decr emen tal supp ort vector machin e le a rning. In NIPS , pag e 409 , 2 0 01. Y. Chen and J.W. V aughan. A new understanding of pr ediction ma rkets via no-r egret learning . In Pr o c e e dings of the 11th ACM c onfer enc e on Ele ctr onic c ommer c e , pages 189– 198. ACM, 201 0. Y. Che n, J. Ab ernethy , and J.W. V aughan. An optimization-bas e d framework for a utomated mar k et- making. Pr o c e e dings of t he EC , 11:5–9 , 2 011. C. Chow. On o ptim um recog nition erro r and reject tradeoff. IEEE T r ans. on Information The ory , 16(1):41– 46, 197 0. B. Cowgill, J. W olfers , and E. Z itzewitz. Using pr ediction markets to track infor mation flows: Evidence from Go ogle . Dartmouth Col le ge , 2008. 25 Barbu and La y J. Dem ˇ sar. Statistical compar isons of c la ssifiers over multip le data sets. The Journal of Machine L e arning R ese ar ch , 7:30 , 2 006. A. Elgammal and C.S. Lee. Inferring 3d bo dy p ose from silhouettes using activity manifold learning. In CVPR 2004 . E.F. F ama. Efficien t capital ma r kets: A review of theory and empirica l work. Journal of Financ e , pages 38 3–417, 197 0. C. F err i, P . Flach, and J. Her n´ andez-O rallo. Delega ting clas sifiers. In In ternational Confer enc e in Machine L e arning , 20 04. Y. F reund and R.E. Schapire. E xpe rimen ts with a new b o osting algorithm. In Internat ional Con- fer enc e in Machine L e arning , pages 148–1 56, 199 6 . J. F riedman, T. Hastie, and R. Tibs hir ani. Additiv e log istic regr ession: a statistical v iew of b o osting. Annals of Statistics , 28(2):337 –407, 2 000. J.H. F riedman and B.E . Pop escu. Predictive lear ning via rule ensembles. A nn. Appl. Stat. , 2 (3): 916–9 54, 200 8. S. Gjer s tad and M.C. Hall. Risk av ersion, b eliefs, and prediction mar ket equilibrium. Ec onomic Scienc e L ab or atory, Un iversity of A rizona , 2 005. J. Kivinen, AJ Smola, and R C Williamson. Online Lear ning with Kernels. IEEE T r ans. on Signal Pr o c essing , 52:2 165–21 76, 2004. B. K ulis and P .L. Ba rtlett. Implicit O nline Lear ning. In Int ern ational Confer enc e i n Machine L e arning , 2010 . N. Lay and A. Ba rbu. Sup ervised Aggre g ation o f Classifier s using Artificia l Predictio n Mar kets. In International Confer enc e in Machi ne L e arning , 201 0. B.G. Malkiel. The efficient market hypo thesis and its cr itics. The J ournal of Ec onomic Persp e ctives , 17(1):59– 82, 200 3. W. Ro b ert Mann. Mean V alue Metho ds in Iteration. Pr o c. Amer. Math. So c. , 4:5 06–510 , 1953. C.F. Manski. Interpreting the predictions of prediction markets. Ec onomics L etters , 91 (3):425–4 2 9, 2006. J. Perols, K. Chari, and M. Agraw al. Informatio n Market-Based Decisio n F usion. Management Scienc e , 55 (5):827–8 42, 2009. C.R. Plott, J. Wit, and W.C. Y ang. Parimutuel b etting markets a s information aggreg ation devices: Exp erimental r esults. Ec onomic The ory , 22 (2):311–35 1, 2003 . P .M. Polgreen, F.D. Nelson, and G.R. Neumann. Use of prediction mar kets to fo r ecast infectious disease activity. Clinic al Infe ctious Dise ases , 44(2):2 7 2–279 , 2006. C. Polk, R. Hanson, J. Ledyard, and T. Ishikida . The po licy analysis market: an electronic co mmerce application of a combinatorial information market. In AC M Conf. on Ele ctr onic Commer c e , pag es 272–2 73, 200 3. A. Ratna pa rkhi et a l. A max im um entrop y mo del for part- o f-spe ec h tagg ing. In Pr o c e e dings of the Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , volume 1, pag es 133–1 42, 1996. 26 Ar tificial P rediction Markets L.K. Saul and S.T. Row eis. Think globally , fit lo ca lly: unsuper vised lear ning o f low dimensio nal manifolds. The Journal of Machine L e arning R ese ar ch , 4:11 9–155 , 2003. R.E. Schapire. The b o osting approach to machine learning: An ov erview. L e ct. Notes in Statistics , pages 14 9–172, 200 3. A. Stor key . Ma c hine lea r ning markets. AIST A TS , 2 011. A. Storkey , J. Millin, and K. Geras. Iso ela stic ag en ts and wealth up dates in machine lea rning markets. ICML , 20 1 2. F. T ortor ella. Reducing the cla ssification cost of supp ort vector classifiers through a n R OC-base d reject rule. Pattern Analysi s & A pplic ations , 7(2):128 –143, 200 4. J. W olfers and E. Zitzewitz. P rediction markets. J ournal of Ec onomic Persp e ctives , pa ges 1 07–126 , 2004. S.C. Zhu , Y. W u, and D. Mumford. Filters, ra ndom fields and ma xim um ent ropy (FRAME): T ow ards a unified theory for texture mo deling. International Journal of Computer Vision , 27(2):107 –126, 1998. App endix: Pro ofs Pro of [of Theorem 1] F rom eq. (3), the total bud get P M m =1 β m is conserv ed if an d only if M X m =1 K X k =1 β m φ k m ( x , c ) = M X m =1 β m φ y m ( x , c ) /c y (21) Denoting n = P M m =1 P K k =1 β m φ k m ( x , c ), and since the ab o v e equation must h old for all y , w e obtain that eq. (4) is a necessary cond ition and also c k 6 = 0 , k = 1 , ..., K , wh ic h means c k > 0 , k = 1 , ..., K . Recipro cally , if c k > 0 and eq. (4) h old for all k , dividing by c k w e obtain eq. (21). Pro of [o f Remark 2] Since the total bu dget is conserved and is p ositiv e, there exists a β m > 0, therefore P M m =1 β m φ k m ( x , 0) > 0, whic h implies lim c k → 0 f k ( c k ) = ∞ . F rom the fact th at f k ( c k ) is con tin uous and s tr ictly decreasing, with lim c k → 0 f k ( c k ) = ∞ and lim c k → 1 f k ( c k ) = 0, it im p lies that for eve ry n > 0 there exists a uniqu e c k that satisfies f k ( c k ) = n . Pro of [of T heorem 3] F rom Remark 2 we get th at for ev ery n ≥ n k , n > 0 there is a u nique c k ( n ) suc h that f k ( c k ( n )) = n . Moreo ver, f ollo win g the pr o of of Remark 2 w e see that c k ( n ) is contin u ous and strictly decreasing on ( n k , ∞ ), with lim n →∞ c k ( n ) = 0. If max k n k > 0, take n ∗ = max k n k . T here exists k ∈ { 1 , ..., K } such that n k = n ∗ , so c k ( n ∗ ) = 1, therefore P K j =1 c j ( n ∗ ) ≥ 1. If max k n k = 0 then n k = 0 , k = 1 , ..., K whic h m eans φ k m ( x , 1) = 0 , k = 1 , ..., K for all m w ith β m > 0. Let a k m = min { c | φ k m ( x , c ) = 0 } . W e hav e a k m > 0 for all k since 27 Barbu and La y φ k m ( x , 0) > 0. Thus lim n → 0 + c k ( n ) = max m a k m ≥ a k 1 , where w e assumed that φ 1 ( x , c ) satisfies Assumption 2. But fr om Assumption 2 there exists k such that a k 1 = 1. Thus lim n → 0 + P K k =1 c k ( n ) ≥ P K k =1 a k 1 > 1 so there exists n ∗ suc h that P K k =1 c k ( n ∗ ) ≥ 1. Either wa y , since P K k =1 c k ( n ) is con tin uous, strictly decreasing, and since P K k =1 c k ( n ∗ ) ≥ 1 and lim n →∞ P K k =1 c k ( n ) = 0, there exists a unique n > 0 suc h that P K k =1 c k ( n ) = 1. F or this n , from Theorem 1 follo ws that the tot al budget is conserve d for the p rice c = ( c 1 ( n ) , ..., c K ( n )). Uniqueness follo ws fr om the un iqu eness of c k ( n ) and the u niqueness of n . Pro of [of Theorem 4] Th e price equations (4) b ecome: M X m =1 β m φ k m ( x ) = c k K X k =1 M X m =1 β m φ k m ( x ) , ∀ k = 1 , ..., K. whic h giv e the result fr om eq. (8). If φ k m ( x ) = η h k m ( x ), using P K k =1 h k m ( x ) = 1, the denominator of eq. (8) b ecomes K X k =1 M X m =1 β m φ k m ( x ) = η M X m =1 β m K X k =1 h k m ( x ) = η M X m =1 β m so c k = η P M m =1 β m h k m ( x ) η P M m =1 β m = X m α m h k m ( x ) , ∀ k = 1 , ..., K Pro of [of Theorem 5] F or the current parameters γ = ( γ 1 , ..., γ M ) = ( √ β 1 , ..., √ β m ) and an observ ation ( x i , y i ), w e ha v e the market price for lab el y i : c y i ( x i ) = M X m =1 γ 2 m φ y i m ( x i ) / ( M X m =1 K X k =1 γ 2 m φ k m ( x i )) (22) So the log-lik eliho od is L ( γ ) = 1 N N X i =1 log c y i ( x i ) = 1 N N X i =1 log M X m =1 γ 2 m φ y i m ( x i ) − 1 N N X i =1 log M X m =1 K X k =1 γ 2 m φ k m ( x i ) (23) W e ob tain the gradien t comp onent s: ∂ L ( γ ) ∂ γ j = 1 N N X i =1 γ j φ y i j ( x i ) P M m =1 γ 2 m φ y i m ( x i ) − γ j P K k =1 φ k j ( x i ) P M m =1 P K k =1 γ 2 m φ k m ( x i ) ! (24) Then fr om (22 ) we ha v e P M m =1 γ 2 m φ y i m ( x i ) = B ( x i ) c y i ( x i ). Hence (24 ) b ecomes ∂ L ( γ ) ∂ γ j = γ j N N X i =1 1 B ( x i ) φ y i j ( x i ) c y i ( x i ) − K X k =1 φ k j ( x i ) ! . 28 Ar tificial P rediction Markets W rite u j = 1 N P N i =1 1 B ( x i ) φ y i j ( x i ) c y i ( x i ) − P K k =1 φ k j ( x i ) , then ∂ L ( γ ) ∂ γ j = γ j u j . The batc h u p date (14) is β j ← β j + η β j u j . By taking the square ro ot w e get the up date in γ γ j ← γ j p 1 + η u j = γ j + γ j ( p 1 + η u j − 1) = γ j + γ j η u j p 1 + η u j + 1 = γ ′ j . W e can write the T aylor expansion: L ( γ ′ ) = L ( γ ) + ( γ ′ − γ ) T ∇ L ( γ ) + 1 2 ( γ ′ − γ ) T H ( L )( ζ )( γ ′ − γ ) so L ( γ ′ ) = L ( γ ) + M X j =1 γ j u j η γ j u j p 1 + η u j + 1 + η 2 A ( η ) = L ( γ ) + η M X j =1 γ 2 j u 2 j p 1 + η u j + 1 + η 2 A ( η ) where | A ( η ) | is b ounded in a neigh b orho o d of 0. No w assume th at ∇ L ( γ ) 6 = 0, thus γ j u j 6 = 0 for some j . Then P M j =1 γ 2 j u 2 j √ 1+ ηu j +1 > 0 hence L ( γ ′ ) > L ( γ ) for an y η small enough. Th us as long as ∇ L ( γ ) 6 = 0 the batc h up d ate (14) with an y η suffi cien tly small will increase the lik elihoo d fu nction. The batc h up date (14 ) can b e split in to N p er-observ atio n up dates of the form (15). 29
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment