Entropy, Perception, and Relativity

LAMP-TR-131 CAR-TR-1012 CS-TR-4799 UMIA CS-TR-2006-2 0 April 2006 ENTROPY, PE R CEP T I ON, AND RE LA TIVITY Stefan Jaeger Language and Media Pro cessing Lab o ratory Institute for Adv ance d Computer Studies Univ ersit y of Maryland College P ark, MD 20742- 3275 jae ger@umiacs.umd.e du Abstract In this pap er, I expand Shannon’s deﬁnition of en tropy in to a new fo rm of en tropy that allo ws integration of information from diﬀeren t random ev en ts. Shannon’s notion of en- trop y is a s p ecial case of m y more general deﬁnition of en tropy . I deﬁne probability using a so-called p erfo rmance function, whic h is de facto an exp onen tial distribution. Assum- ing that m y general notion of en tropy reﬂects the true uncertaint y ab out a probabilistic ev en t, I understand that our p erceiv ed uncertain ty diﬀers. I claim that our p erception is the result of tw o o pp osing forces similar to the t w o famous an tago nists in Chinese philo s- oph y: Yin and Y ang . Based on this idea, I sho w that our perceiv ed uncertain ty matche s the true uncertain t y in po ints determine d b y the golden ratio. I demonstrate that the w ell-kno wn sigmoid function, which w e t ypically employ in art iﬁcial neural net w orks a s a non-linear threshold function, describes the actual p erformance. F urthermore, I prov ide a motiv ation for the time dilation in Einstein’s Sp ecial Relativity , basically claiming tha t although time dilation conforms with our p erception, it do es not correspo nd to realit y . A t the end o f the pap er, I show ho w to apply this theoretical framew ork to practical applications. I presen t recognition rat es for a patt ern recognitio n problem, a nd also pro- p ose a net work architec ture that can tak e adv an tage of general en tro py to s olve complex decision problems. Keyw ords: Information Theory , Entrop y , Sensor F usion, Mac hine Learning, P erception, Spec ial R elativity . The supp ort of this research by the Depar tment o f Defense under con tract MDA-9040-2C-04 06 is gratefully a ckno wledged. 1 In tro duction Uncertain t y is our constant companion in e ve ryday’s decision making. Being able to deal with uncertain ty is th us an essen tial re quiremen t for in telligen t b eha vior in real- w orld en vironmen ts. Naturally , kno wing the exact amount of uncertaint y in volv ed in a particular decision is a v ery useful information to ha v e. Mathematically , the classic w a y of measuring the uncertain ty for a random eve nt is to compute its informatio n based on the deﬁnition of entrop y in tro duced b y Shannon [15 ]. In this pap er, ho we ve r, I in tro duce a new, gene ral f orm of en tropy that is motiv ated b y my earlier work on classiﬁer combination [3 , 4, 6]. The idea of classiﬁer com bination, o r sensor fusion in general, is to com bine the outcomes of diﬀeren t sub-optimal pro cesses in to one integrated result. Ideally , the integrated pro cess p erforms b etter in the given application domain than eac h individual pro cess alone. In order to in tegrate diﬀeren t pro cesses in t o a single pro cess, computers need to deal with the uncertain ties in v olved in the outcomes of eac h individual process. F o r c lassiﬁer com bination, sev eral com binatio n sc hemes ha v e a lr eady b een suggested. The curren t state-of- the-art, how ev er, has not give n its ﬁnal v erdict on this issu e ye t. I n m y earlier work, I prop o sed an informational-theoretical approac h to this problem. The main idea of this approa c h is to normalize conﬁdence v alues in suc h a w a y that their no minal v alues matc h their conv ey ed information, whic h I measure on a training set in the application domain. The o v erall combin ed conﬁdence for e ach class is then simply the sum of t he normalized conﬁdence v alues of eac h individual classiﬁer. In this pap er, I am going to elab orate on m y earlier ideas by lo oking at them from the general en trop y’s p oint of view. I structured the pap er as follo ws: F ollowing this in tro duction, Section 2 rep eats the deﬁnition of en trop y as introduced by Shannon, and compares it to m y new and more general deﬁnition. Section 3 provides a short in tro duction in to m y earlier work on infor- mational conﬁdenc e and repeat s the main p ostulates and t heir immediate consequenc es. Section 4 describ es ho w I understand conﬁdence as the result o f an inte rplay o f tw o opp osing forces. In Section 5, this insight will sho w the sigmoid func tio n of class ic bac k- propagation netw orks in a diﬀeren t ligh t, namely a s a kind of mediator b et we en these t w o forces. A closer insp ection in Section 6 reve als that the net eﬀect of b oth opp osing forces equals one single force in po in ts deﬁned b y the golden ra tio. In Section 7, I relate the in tro duced forces to the w ell- known forces of Yin and Y ang in Chinese philosophy . In par t icular, I sho w how w e can deriv e the t ypical Yin-Y a ng sym b ol from the assump- tions ma de. In Section 8, I explore common grounds of the general fr a mew ork prese nted here and Einstein’s Sp ecial Relativit y . I provide an intere sting motiv a t io n for the time dilation in Einstein’s Sp ecial Relativit y . Section 9 is then going to show ho w we can learn informational conﬁdence v alues, illustrating the learning pro cess with a practical example of handwritten Japanese c hara cter recognition. This section also prop oses a net w ork archite cture for learning based on the ideas in tro duced in the previous sections. Finally , a summary with the main results concludes the pap er. 1 2 En tropy En trop y is a measure fo r the uncertain ty in a rando m ev en t or signal. Alternative ly , w e can understand en tropy as the amoun t of informatio n con v ey ed b y the random ev en t o r carried by the signal. Entrop y is a general concept that has applications in statistical mec hanics, thermo dynamics, and of course information theory in computer science. The latter will b e the fo cus of my atten tion in the following. A t the end of m y pap er, I will presen t an interes ting connection with Einstein ’s Sp ecial Relativity and phys ics, though. Claude E. Shannon in tro duced en tropy as a measure for randomness in his 1 9 48 seminal pap er “A Mathematical Theory of Comm unication.” F or a discrete random ev en t with n p ossible outcomes, Shannon deﬁnes t he en tropy H as t he sum of exp ected information K i for eac h outcome i : H = n X i =1 K i (1) Shannon uses the negative logarithm to compute information itself. In this w ay , he can simply add the information of tw o indep enden t outcomes to get the com bined info rmation of b oth. Accordingly , eac h K i in (1) reads as follo ws: K i = − p ( i ) ln ( p ( i )) , (2) with p ( i ) denoting the probability of the i- th outcome. The en trop y reac hes a maxim um when all p ( i ) are equal, whic h indicates maxim um uncertain t y . On the other hand, the en trop y is minimal; i.e. zero, if exactly one p ( i ) is 1 and all o ther outcomes hav e a probabilit y o f zero. I am no w intro ducing the follo wing more general v arian t that I will b e using instead of (2) to compute the en tro p y H : K i = − p ( K i ) ln ( p ( K i )) (3) In this new f o rm, the exp ected information for each outcome app ears o n b ot h sides of the equation, e ﬀectiv ely making (3) a ﬁxed p oint equation. Instead of using the probabilit y p ( i ) of an outcome, I am no w using the probability of the outcome’s sp eciﬁc information. I also do not require the sum of all probabilities p ( K i ) to b e one. A straightforw ard comparison of ( 2 ) and (3) sho ws that Shannon’s deﬁnition of en- trop y and its more g eneral v arian t are the same if eac h outcome satisﬁes the follo wing equation: p ( i ) = p ( K i ) (4) In other w ords, b oth deﬁnitions of entrop y are the same when the probability of eac h outcome matc hes the probabilit y of its information, whic h w e can consider to b e a ﬁxed p oin t. The next section giv es a motiv a t ion for the general en tropy form ula using pattern recognition, and in particular classiﬁer com bination, a s a practical example. 2 3 Informational Conﬁdence P attern recognition is a researc h ﬁeld in computer science dealing with the automatic classiﬁcation of pattern samples in to diﬀeren t classes. Depending on the application domain, t ypical classe s are; e.g., c haracters, gestures, tr a ﬃc signs, face s, etc. F or a g iv en unkno wn t est pattern, most classiﬁers return b oth the actual classiﬁcation result in form of a ra nk ed list of class lab els, and corresp onding v alues indicating the conﬁdence of the classiﬁer in each class lab el. I will b e using the term “conﬁdence v alue” for these v alues throughout the pap er, but I should men tion that other researc hers ma y prefer diﬀeren t terms, suc h as “score” or “likelihoo d.” In practical classiﬁer systems, conﬁdence v alues are usually only rough approx imatio ns of their mathematically correct v alues. In particular, they v ery often do not meet the requiremen ts of proba bilities. While this usually do es not hamp er the op eration of a single classiﬁer, which o nly dep ends on the relativ e prop ortion of conﬁdence v alues, it causes problems in multiple classiﬁer systems, whic h need the proper v alues for com bination purp oses. P ost- pro cessing steps , suc h as linguistic con text analysis for c haracter recognition, can also b eneﬁt from more accurate conﬁdence v alues. Com bination of diﬀerent classiﬁers in a multiple classiﬁer systems has turned out to b e a p ow erful to ol for reducing the uncertain t y inv olve d in a classiﬁcation problem [8]. Researc hers ha v e sh own in n umerous ex p erimen ts that the p erformance o f the com bined classiﬁers can outp erform the p erformance of eac h single classiﬁer. Nev ertheless, re- searc hers are still undecided ab out ho w to b est inte gra te the conﬁdence v alues of eac h individual classiﬁer into one single conﬁdence. In earlier work, I prop osed so-called infor- mational conﬁdence v alues as as a wa y to combine m ultiple conﬁdences v alues [3, 4, 6]. The idea of informationa l conﬁdence v alues is to intro duce a standard of comparison allo wing fair comparison and easy integration of conﬁdence v a lues generated by diﬀer- en t classiﬁers. The deﬁnition o f informational conﬁdence v alues relies on tw o cen tra l p ostulates: 1 Conﬁdence is infor mat ion 2 Informatio n depends on p erformance The ﬁrst p o stulate states that each conﬁdence v alue conv eys information, a nd it con- sequen tly req uires that the nominal v a lue of eac h conﬁdence v alue should equal t he information con v ey ed. The second p ostulate then logically con tinues b y requiring tha t the amount of info rmation conv ey ed should dep end on the p erformance of t he conﬁdence v alue in the application do main. F rom b oth p ostulates tak en together, I can fo llow t hat conﬁdence dep ends on p erforma nce via info r ma t ion. T o fo rmalize these requiremen ts, let me assume that eac h classiﬁer C can out put conﬁdence v alues from a set of conﬁdence v alues K C , with K C = { K 0 , K 1 , . . . , K i , . . . , K N } (5) Let m e further assume that K N indicates the highest conﬁdence c lassiﬁer C can output. The follow ing ﬁxed p oint eq uation then deﬁnes a linear relationship b et w een conﬁdence and information, with the latter dep ending o n the p erformance complemen t o f eac h 3 conﬁdence v alue. K i = E ∗ I ( p ( K i )) + C (6) W e see that the conﬁdence v alues K i app ear on b oth sides of Equation (6), ess entially making it a ﬁxed p oin t equation with the so-called informational conﬁdence v alues as ﬁxed p oin ts. Using the p erfo r ma nce complemen t ensures that higher conﬁdence v alues with b etter performance con vey mor e information tha n low er conﬁdence v alues whe n w e apply Claude Shannon’s lo g arithmic notion of informatio n [1 5]. According to Shannon, information of a probabilistic ev en t is the negativ e log arithm of its probability . More information on Shannon’s work and the implications of his strikingly simple deﬁnition of information can b e found in [13, 14, 16]. By setting constan t C to zero, inserting the negative logarithm as info rmation func- tion I , and using 1 − p ( K i ) as p erformance complemen t, I simplify Equation (6) to the follo wing deﬁnition of info r mational conﬁdence: K i = − E ∗ ln (1 − p ( K i )) (7) The still unknown parameters necessary to compute informational conﬁdence v alues according to (7) are E and p ( K i ). A straigh tforward transformatio n of (7) sheds more ligh t on these t wo parameters : K i = − E ∗ ln (1 − p ( K i )) ⇐ ⇒ e − K i E = 1 − p ( K i ) ⇐ ⇒ p ( K i ) = 1 − e − K i E (8) The result sho ws that the performance func tion p ( K i ) describ es an exp onen tial distribu- tion with exp ectation v alue E . This fo llo ws from the general deﬁnition of an exp onen tial densit y function e λ ( x ) with parameter λ : e λ ( x ) = ( λ ∗ e − λx : x ≥ 0 0 : x < 0 λ > 0 (9) F or eac h λ , the enclosed area of the densit y function equals 1: ∞ Z −∞ e λ ( x ) d x = ∞ Z 0 λ ∗ e − λx dx = 1 ∀ λ > 0 (10) Figure 1 show s three diﬀerent exp onen tial densities diﬀering in their parameter λ , with λ = 100, λ = 20, and λ = 10 resp ectiv ely . The parameter λ has a direct inﬂuence on the steepness of the exp onential densit y function. The higher λ , the steeper the densit y function. The correspo nding distribution E λ ( k ), which describ es the proba bility tha t the ran- dom v a riable assumes v alues lo we r than or equal to a giv en v alue k , computes as follo ws: E λ ( k ) = Z k −∞ e λ ( x ) d x 4 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 0.25 0.3 Exponential Density 100 * exp(-100 * x) 20 * exp(-20 * x) 10 * exp(-10 * x) Figure 1: Exp onen tial densit y for λ = 100, λ = 2 0, and λ = 10 . 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0.25 0.3 Exponential Distribution 1-exp(-100 * x) 1-exp(-20 * x) 1-exp(-10 * x) Figure 2: Exp onen tial distribution for λ = 100, λ = 20, and λ = 1 0. = Z k 0 λ ∗ e − λx dx = h − e − λx i k 0 = 1 − e − λk (11) Figure 2 show s the distributions for the three diﬀeren t densities depicted in Figure 1, with λ = 100, λ = 20 , and λ = 10. The parameter λ inﬂuences again the stee pness: A larger λ en tails a steeper distribution. F or eac h parameter λ , the distribution function con v erges on 1 with increasing conﬁdence. Another imp ortant feature is the relation- ship b etw een para meter λ and the exp ectation v alue E o f the exp onentially distributed random v ariable. Bo th are in inv erse prop ort io n to each other, with E = 1 λ . Accord- ingly , the ex p ectatio n v a lues corresp onding to the exp onen tial densities in Figure 1 , and distributions in Figure 2, are E = 1 100 , E = 1 20 , and E = 1 10 , resp ectiv ely . When w e compare the p erformance sp eciﬁcation in (8) with the exponen tial distri- bution in ( 1 1), w e see that the only diﬀerence lies in the exp onen t o f the exp onential function. In fact, p erformance function and exp onen tial distribution b ecome iden tical f or λ = 1 E . This result sho ws that the performa nce function p ( K i ) describ es the distribution of exp onen tially distributed conﬁdence v alues with expectatio n E . W e can therefore con- sider conﬁde nce as an exp o nentially distributed r a ndom v ariable with parameter λ = 1 E . 5 The p erformance theorem summarizes this imp ortant result: P erformance Theorem: A classiﬁer C w ith p erformance p ( K ) pro vides informational conﬁdence K = − E ∗ ln (1 − p ( K )) if, and only if, p ( K ) is an exp onen tial distribution with expectation E . The p erformance theorem explains the meaning and implications of the par a meters E and p ( K ). F or classiﬁers violating the p erformance theorem, the equation stated in the p erformance theorem allows to compute the prop er infor ma t ional conﬁdence v alues as long as w e kno w the sp eciﬁc v alues of E and p ( K ). Section 9 will la ter sho w ho w we can estimate these parameters on a giv en ev aluatio n set. In the next section, I tak e the idea of informational conﬁdence a step further and in tro- duce a second ty p e of conﬁdence called coun ter- conﬁdence, whic h describes the conﬁdence of the classiﬁer in the falseness of its output. The subseque nt sections then elab or a te o n this concept and presen t new theoretical results and discuss their implications. 4 Opp osing F orces I am assuming t hat decision making is ba sed on tw o opp o sing forces, one supp orting a certain o utcome and one arguing ag a inst it. In particular, I am going to prop ose a formalization of b o t h forces, whic h I na me F orce A and F orce B for the time b eing, based on the ﬁxed p oint equation of the p erformance theorem. In fact, I p o stulate that F orce A is a lready deﬁned b y this equation. F orce B only diﬀers in its in terpretatio n of p erformance. 4.1 F orce A The ﬁrst f orce, F orce A, describ es the conﬁdenc e in a particular decision. Accordingly , I use the ﬁxed p oint equation of informat ional conﬁdence v alues as the deﬁnition of F orce A: K = − E ∗ ln (1 − p ( K )) (12) The left-hand side of this equation denotes the magnitude of F orce A. It is the pro duct of info rmation in the Shanno n sens e and an exp ectation v a lue in t he statistical sense. As sho wn ab ov e, the p erformance function p ( K ) follo ws immediately as p ( K ) = 1 − e − K E . If the p erformance in the logarithmic expression on the righ t-hand side of (12) is 1, and the exp ectation E is p ositiv e, then A-F orce b ecomes inﬁnit y . On the other hand, if the p erformance is zero, then the logarithm b ecomes zero and there is no A-force at all. 4.2 F orce B The second force, F orce B, is deﬁne d similarly but p erforms complemen t a ry to F orce A. F orce B describ es information t ha t dep ends directly on the p erforma nce and not on the p erfo rmance complemen t. Accordingly , the following mo diﬁed ﬁxed p oint equation describes F orce B: K = − E ∗ ln ( p ( K )) (13) 6 The diﬀerence to F orce A lies in the in terpretation of the p erformance function p ( K ), whic h follows a gain from a straigh tforw ard t r ansformation: K = − E ∗ ln ( p ( K )) ⇐ ⇒ p ( K ) = e − K E (14) W e see that the p erformance f unction of F orce B is similar to the p erformance of F orce A. Ho w ev er, it lo oks at the problem from a diﬀeren t side. Ins tead of describing the area delimited b y K under the exp onen tial density curve, it describ es the remaining area that is not delimited. Parameter E is again a statistical exp ectation v alue. Unlik e F orce A, F orce B b ecomes inﬁnity for a p erfo rmance equal to zero and p ositiv e expectatio n. It b ecomes zero w henev er the p erformance is p erfect, i.e. p ( K ) = 1. While F o r ce A deﬁnes informational conﬁdence v a lues, F orce B can b e considered as deﬁning inf o rmational coun ter-conﬁdence v alues. 4.3 In terpla y of F orces Ha ving deﬁned b oth F orce A and F orce B, I p o stulate tha t all decision pro cesses are the result of the in terpla y b et w een these t w o fo r ces. What we can actually experience when making decisions is the dominance one of these forces has ac hiev ed ov er its counterpart. Mathematically , I understand that this dominance is the net eﬀect o f b oth forces and th us use the diﬀerence b et w een the deﬁning equations in (12) and (13) to describe it: K = − E ∗ ln 1 − p ( K ) p ( K ) ! (15) This equation is a ﬁxed p oin t equation itself. It describ es t he net force, whic h is the result of b oth forces acting sim ultaneously . Naturally , t he net force b ecomes zero when F orce A equals F orce B. This is the case when either the exp ectation v alue is zero o r the p erformance p ( K ) is 0 . 5. The net fo rce b ecomes either inﬁnity o r min us inﬁnit y when one fo rce do minates completely o v er its coun terpart. In particular, the net f o rce b ecomes inﬁnity when F orce A dominates with p ( K ) = 1 and minus inﬁnit y when F orce B dominates with p ( K ) = 0. The follo wing t w o sections are go ing to presen t tw o more in teresting theoretical re- sults, whic h are a direct consequence o f the net f orce deﬁned b y ( 1 5), namely the sigmoid function and t he golden ratio. Section 7 will later relate F orce A a nd F orce B to the w ell-kno wn an tago nistic forces in Chinese philosophy : Yin and Y ang. 5 Sigmoid F unction A closer lo ok at the net force deﬁned in (1 5) rev eals that the p erfo r ma nce function is indeed a w ell-know n function. A s traig h tforward deriv ation leads to the following result: K = − E ∗ ln 1 − p ( K ) p ( K ) ! 7 0 0.2 0.4 0.6 0.8 1 -4 -3 -2 -1 0 1 2 3 4 1/(1+exp(-x)) 1/(1+exp(-2*x)) 1/(1+exp(-3*x)) 1/(1+exp(-0.5*x)) Figure 3: Sigmoid function. ⇐ ⇒ e − K E = 1 p ( K ) − 1 ⇐ ⇒ p ( K ) = 1 1 + e − K E (16) It show s that the p erformance function is actually iden tical to the t yp e of sigmoid function that classical feedforw ard net w ork arc hitectures v ery often use as threshold function. The traditional explanation for the use o f this particular threshold function has alwa ys lain in its featur es of non-linearity and simplicit y . Non-linearity increases the expressiv eness of a neural net w ork, allo wing decision b oundaries in feature space that a simple linear net w ork would not b e able to mo del. A neural netw ork with only linear output functions w ould simply collapse into a single linear function, whic h c annot m o del complex decision b oundaries. The other adv antage of the sigmoid function in (16) is the simplicit y of its deriv ation, whic h facilitates the bac kpropagation o f errors during the tra ining of neural net w orks. While these are surely imp ortant p o ints, it now seems that the deeper meaning of the sigmoid function has more of an informatio n- theoretical nature, as motiv ated ab ov e. Figure 3 sho ws the sigmoid function in (16) for four diﬀerent parameters E , namely E = 1, E = 1 2 , E = 1 3 , and E = 2. As its name already suggests, the sigmoid function has an S-shap e. It conv erges on 0 to w ards negativ e inﬁnit y and on 1 to w ards inﬁnit y . The parameter E con tro ls the steepnes s of the sigmoid func tio n. F or smaller v alues of E , the sigmoid function b ecomes steeper and approac hes faster to either 0 or 1 on b o t h ends. Indep enden t o f E , the sigmoid function is alw ay s 0 . 5 for K = 0. 6 The Golden Ratio I no w assume that the p erformance o f a giv en conﬁdence v alue K a lwa ys matc hes exactly the exp ectation, i.e. in other words E = p ( K ). Note that this corresp onds t o the form of the summands of the general en tro py in Section 2. The net f o rce equation in (15) will 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 y(p) z(p) y(p) = −p log( (1−p) / p ) z(p) = −(1−p) log( (1−p) / p ) Figure 4: Net F orce. then read as follo ws: K = − p ( K ) ∗ ln 1 − p ( K ) p ( K ) ! (17) Figure 4 depicts the net fo r ce in (17) graphically for p erf o rmance v alues p ( K ) ranging from 0 to 1. As w e can see in F igure 4, the net force b ecomes zero for p ( K ) = 0 and p ( K ) = 0 . 5. F or p erformances higher than 0 . 5 and approaching 1, the net force div erges to inﬁnit y . F igure 4 also sho ws a mirrored v ariant of the net fo rce, namely K = − (1 − p ( K )) ∗ ln 1 − p ( K ) p ( K ) ! (18) This equation is a direct result of (17) aft er changing the sign a nd replacing the p erfor- mance p ( K ) with its complemen t 1 − p ( K ). The ne t force and its mirrored v ariant b oth meet at p ( K ) = 0 . 5 . W e can actually consider p ( K ) = 0 . 5 as a transition p o int where the net force tr a nsforms into its mirrored v ariant. After the transition, w e are still lo o king at t he same problem. How eve r, our p oin t of view has c hanged and is no w reﬂected b y the mirrored net force. This will b ecome impo r tan t later in Section 7, where we relate these forces to Yin and Y ang. F or the time b eing, let us concen trate on the net force in (17). The net fo r ce and the counte r- conﬁdence (F orce B) in (13) , with E = p ( K ), b ecome equal when the p er- formance p ( K ) satisﬁes the follo wing relationship: p ( K ) = 1 − p ( K ) p ( K ) (19) ⇐ ⇒ p ( K ) 2 + p ( K ) − 1 = 0 ⇐ ⇒ p ( K ) = − 1 2 ± s 1 4 + 1 ⇐ ⇒ p ( K ) = √ 5 − 1 2 ∨ − 1 − √ 5 2 ⇐ ⇒ p ( K ) ≈ 0 . 618 ∨ − 1 . 618 (20) 9 a a b b a + b i s t o a a a s a a i s t o b b a + b Figure 5: Go lden Ra t io . This transformation show s t ha t coun ter-conﬁdence a nd net force are the same for a p erformance of ab out 0 . 618, when just considering the p o sitive p erformance v alue. In- terestingly , t his tr a nsformation also sho ws that the t w o p ossible v alues satisfying (19), namely ≈ − 1 . 618 and ≈ 0 . 6 18, are precisely t he negativ e v a lues of the so-called go lden ratio. F orce B th us equals the comp ound eﬀect of F orce A a nd F orce B for performances deﬁned b y the golden ratio. With a detailed intro duction into the golden ratio b eing out o f scop e, I pro vide only some bac kgro und information ab out the golden ratio, or golden mean as it is also called [10, 2]. The golden ratio is an irratio nal num b er, or rather t w o num b ers, describing the prop ortion of tw o quantities . Express ed in w or ds, t w o quantities a r e in the golden ratio to eac h other, if the whole is to the larger part as the larger part is to the smaller part. The whole in this case is simply the sum of bot h pa rts. Figure 5 sho ws an example of a line divided into tw o segmen ts that are in the golden ratio to each other. Historically , the golden ratio w as already studied by ancien t mathematicians. It plays an imp ortant role in diﬀeren t ﬁelds lik e geometry , biology , ph ysics, and ot hers. Many artists and designers delib erately or unconsciously make use of it b ecause it seems that artw ork based on the golden ratio has an esthetic app eal, a nd features some kind of natural symmetry . Despite the fact that the golden mean is of paramount imp ortance to so man y ﬁelds, I think it is fair to sa y that we still do not ha v e a full, or rather correct, understanding of its true meaning in scie nce. Mathematically , the golden mean can b e derive d from the follo wing equation, whic h describ es the collo quial description giv en ab o ve in mat hematical terms. a + b a = a b (21) Accordingly , the golden mean, whic h is t ypically denoted by the Greek letter ϕ , is then giv en by the r atio of a a nd b , i.e. ϕ = a b . Us ing the relationship in (21), the golden ratio ϕ can be resolve d in to tw o p ossible v alues: ϕ = 1 + √ 5 2 ∨ 1 − √ 5 2 (22) = ⇒ ϕ ≈ 1 . 618 ∨ − 0 . 618 (23) Usually , the p ositiv e v alue ( ≈ 1 . 618) is iden tiﬁed with ϕ . Note that these v a lues are the same as in (2 0), except that their signs are reve rsed. The reader interes ted in a thorough analysis of the go lden mean can ﬁnd more information and man y practical examples in the references [10, 2]. 10 7 Yin and Y ang I will no w relate the ab ov e theoretical results with one of the oldest philosophical world views, namely the principle of Yin and Y ang. In particular, I dare to adv a nce the h yp othesis that b oth F orce A and F orce B, whic h I deﬁned res p ectiv ely in (12) and (13 ) using ﬁxed p oint equations, corresp ond to the tw o opp osing forces Yin a nd Y ang when w e assume that exp ectatio n alw a ys equals p erformance, i.e. E = p ( K ). If this can indeed b e conﬁrmed b y further observ a t io ns, this ancien t philosophical concept could pla y a n imp ortant role in computer science. In fa ct, I will pro vide fur t her evidence of this claim and also sho w ho w w e can use the concept of Yin a nd Y ang for machine learning. Let me b egin with a short summary of the Yin/Y ang concept in Chines e philosoph y . 7.1 Philosoph y The concept of Yin and Y ang is deeply ro oted in Chinese philosoph y [23]. Its origin dates bac k at least 2500 y ears, probably m uch earlier, pla ying a crucial role in the o ldest Chinese philosophical texts. Chinese philosophy has at t a c hed great imp ortance to Yin/Y ang ev er since. T o day , the idea of Yin/Y ang p erv ades ﬁelds as diﬀeren t as religion, sp orts, medicine, p olitics, and man y more. The fact that the Korean national ﬂag sp o rts a Yin/Y ang sym b ol illustrates the emphasis laid on this concept in Asian coun tries. Yin and Y ang stand fo r t wo principles t ha t are o pp osites of eac h other, and whic h are constan tly trying to gain the up p er hand ov er eac h other. Ho w ev er, neither one will ever succeed in doing so, though one principle may temp orarily dominate the other one. Both principles cannot exist without each other. It is ra ther the constant struggle b etw een b oth principles t ha t deﬁnes our w orld and pro duces the rhythm of life. According to Chinese philosoph y , Yin and Y ang are the foundat io n o f our en tire univ erse. Th ey ﬂow through, and th us aﬀect, ev ery being. Ty pical examples o f Yin/Y ang o pp osites are, for example, nigh t/day , cold/hot, rest/activit y , etc. Chinese philosoph y do es not conﬁne itself to a mere description of Yin and Y ang. It also pro vides guidelines on how to liv e in accordance with Yin and Y ang. The central statemen t is that Yin and Y ang need to b e in harmon y . An y im balance of an economical, biological, ph ysical, or c hemical system can b e directly attributed to a distorted equilib- rium b etw een Yin and Y ang. F or instance, an illness a ccompanied by fev er is the result of Y ang b eing to o strong and dominating Yin. On the other hand, dominance o f Yin could result, for instance, in a b o dy shiv ering with cold. The o ptima l state ev ery b eing, or system, should striv e for is therefore the stat e of equilibrium b et we en Yin and Y ang. It is this state o f equilibrium b etw een Yin a nd Y a ng that Chinese philosoph y considers the most p o w erful and stable state a sys tem can assume. Yin and Y ang can b e furt her sub divided in to Yin and Y a ng. F o r instance, “cold” can b e further divided into “co ol” or “c hilly ,” and “hot” in to “warm” or “b oiling.” Yin and Y ang already carry the see d of their opp osites: A dominating Yin b ecomes sus ceptible to Y ang and will eve ntually turn into its opp osite. O n the other hand, a dominating Y ang give s rise to Yin and will th us t ur n in to Yin o v er time. This deﬁnes the p erennial alternating cycle of Yin o r Y ang dominance. Only the equilibrium b etw een Yin a nd Y ang is able to o v ercome this cycle. 11 Figure 6: Yin and Y ang. −6 −4 −2 0 2 4 6 8 10 −8 −6 −4 −2 0 2 4 6 Figure 7: Logarithmic spiral. 7.2 Logarithmic Spirals Figure 6 depicts the w ell-kno wn blac k and white sym b ol of Yin and Y ang. The dots of diﬀeren t color in the area delimited by eac h force sym b olize the fact that each force b ears the seed of its coun terpart within itself. According to the pr inciple of Yin and Y ang outlined ab ov e, neither Yin nor Y a ng can b e observ ed directly . Both Yin and Y ang are in tert wined forces alw a ys o ccurring in pair s, r ather than b eing isolated f o rces indep enden t from eac h o ther. In Chinese philosoph y , Yin and Y ang assume t he for m of spirals. I will no w sho w that the net force in (17) is a spiral to o. In order to do so, I w ill ﬁrst in t r o duce the general deﬁnition of the logarithmic spiral b efore I then illus trate the similarit y to the famous Yin/Y ang sym b ol. A loga rithmic spiral is a sp ecial t yp e o f spiral curv e, whic h pla ys an imp ortan t role in nature. It o ccurs in all diﬀeren t kinds of ob jects and pro cesses, suc h as mollusk s hells, h urricanes, galaxies, and man y more [1]. In p olar co ordinates ( r , θ ), the g eneral deﬁnition of a logarithmic spiral is r = ae bθ (24) P arameter a is a scale factor determining the size of the spiral, while parameter b con- trols the dir ection and tigh tness of the w rapping. F or a logarithmic spiral, the distances b et w een the turnings increase. This distinguishes t he log arithmic spiral from the Arc hi- median spiral, whic h features constant distances b etw een turnings. Figure 7 depicts a t ypical example of a log a rithmic spiral. Resolving (24) fo r θ leads to the f o llo wing general 12 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Yin / Yang Model (y,d) (−y,−d) y = −p log( p / (1−p) ) d = (1−p) exp( −y / p ) Figure 8: Yin-Y ang Spirals. form of logarithmic spirals: θ = 1 b ln  r a  (25) In order to sho w that the net force in (1 7 ) deﬁnes a logarithmic spiral, and fo r the sak e of easier illustrat io n, I in v estigate the negativ e ve rsion of the net force in (17 ) and lo ok at the p olar co ordinates ( r, θ ) it deﬁnes, namely: θ = − p ( K ) ∗ ln p ( K ) 1 − p ( K ) ! and r = (1 − p ( K )) ∗ e − θ p ( K ) (26) A comparison o f (26) with the general form o f lo garithmic spirals in (25) show s that the net fo r ce do es indeed desc rib e a spiral. Both (25) and (2 6) matc h when we set the parameters a and b to the follo wing v alues: a = 1 − p ( K ) and b = − 1 p ( K ) (27) In particular, w e can c heck that a and b are identical when p ( K ) equals the golden ratio . If we let p ( K ) run from 0 to 1, and mirror the resulting spiral along b oth a xes similar to F igure 4, w e receiv e tw o spirals. Figure 8 show s b o th spirals plotted in a Cartesian co ordinate system. Bo th s pirals are, of c ourse, symmetrical and their turnings a pproac h the unit circle. A comparison of the Yin/Y ang sym b o l of Figur e 6 with the spirals in Figure 8 shows the strong similarities betw een b oth ﬁgures. A simple mirror op eratio n transforms the spirals in Figure 8 in to the Yin/Y ang sym b ol. The additio n of a time dimension to Fig ure 8 generates a three-dimensional ob ject. It resem bles a funnel o r trump et that has a wide circular op ening on the upp er end and narrow s tow ards the origin. F ig ure 9 depicts this “informationa l univ erse,” whic h follo ws directly fro m t he tw o -dimensional graphic in Figure 8 when I use the p erformance v alues as time co or dinates fo r the third axis. Note that the us e of p erformance as time is reasonable b ecause the exp o nen tial distribution is t ypically used to mo del dynamic time pro cesses a nd the exp ectatio n v alue is th us t ypically asso ciated with time. This will also b e an imp ortan t p oin t in the next section. 13 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 9: Informatio nal Univ erse. 8 Relativit y This section dis cusses the net force in a wider con text and fro m a phy sical p oin t o f view. I b egin b y revisiting the net force as in tro duced in (15): K = − E ∗ ln 1 − p ( K ) p ( K ) ! (28) The net force describ es the net eﬀect of the t w o forces deﬁned in (12) and (13), r esp ec- tiv ely . As I show ed ab ov e, eac h force en tails its own in terpretation of the p erformance function p ( K ). Ho w ev er, the net eﬀect of b oth forces in (28), whic h computes simply as the diﬀerence b etw een b oth forces, provide s no information ab out the interpretation of p ( K ). Both in terpretations, i.e. the exp o nential distribution or its complemen t, are v alid p erfor mances. In fact, the interpretation w e use dep ends o n our viewp oint and just c hanges the sign of the net force in (28). The previous result in (16) sho ws, that the sigmoid function pr ovides the correct performa nce v alues once w e ha v e c hosen our po in t of vie w. Accordingly , the p erformance will lie b etw een 0 and 0 . 5 for a ne ga tiv e net force and b etw een 0 . 5 and 1 for a p ositiv e net f orce. The fact that there is no ob jectiv ely correct viewpoint strongly r esem bles the principle of relativity , whic h pla ys a ma jor role in ph ysics. Motiv ated b y the general en tropy in tro duced at the b eginning of this pap er, I will now deriv e another in teresting result relating to relativit y . As I hav e in t ro duced in Section 2 , the general en trop y is based on summands ha ving the follow ing fo r m: K i = − p ( K i ) ln ( p ( K i )) (29) W e can easily see that eac h summand matc hes the deﬁnition of F orce B introduced in (13) when the exp ectation equals the p erformance. F or this reason, I consider F orce B, or 14 rather general en tropy , to b e the more fundamen tal force of b oth F orce A and F orce B. Actually , I understand that the diﬀerence b et w een F o rce A and F orce B, i.e. the net force, describes merely our p erception, while the general en tropy deﬁnes the true uncer- tain ty . The sigmoid function will th us prov ide the real p erformance v alues, allo wing us to compute the a ctual general e ntrop y . Spinning this though t further, I understand that w e p erceiv e realit y in p oints deﬁned b y the golden rat io . Our p erception will b e diﬀeren t from realit y except for p erfor mance v alues equal to the g olden ratio. Let me presen t an in teresting ph ysical application of this idea: In ph ysics, a t ypical perfo rmance function could b e the v elo cit y v of an ob ject in relation to ligh t sp eed c . This v alue should alwa ys lie within the ra ng e from 0 to 1 b ecause the curren t state-of-the-art a ssumes that no ob ject can mo ve faster than the speed of ligh t. If we ins ert this relativ e sp eed into ( 19), whic h describ es the relationship deﬁning the golden ratio, w e obtain t he followin g re sult: p ( K ) = 1 − p ( K ) p ( K ) = ⇒ q 1 − p ( K ) 2 = q p ( K ) = ⇒ s 1 −  v c  2 = q p ( K ) (30) The expression on the left-hand side is the w ell-kno wn Lorentz facto r , o r rather the in v erse Loren tz factor, which pla ys a crucial part in Einstein’s sp ecial relativit y . The Loren tz f a ctor describ es ho w mass, length, and time change fo r an o b ject or system whose v elo cit y approac hes ligh t sp eed. F or a mo ving ob ject, an observ er will measure a shorter length, more mass, and a shorter time lapse b etw een tw o ev en ts. These eﬀects b ecome more pronounced as the moving ob ject a ppro ac hes the sp eed of light. D ep ending on the relative sp eed to ligh t, the Loren tz factor describ es basically the ratio b et w een the quan tit y measured for the observ er and the quantit y measured for the mov ing system . F or instance, if t is the time measured lo cally b y the observ er, then the corr esp o nding time t ′ measured for the mo ving system computes as follow s: t ′ = s 1 − v 2 c 2 ∗ t (31) W e can see that t ′ con v erges to zero for increasing sp eed, i.e. w e can measure no time lapse for a system mo ving with light sp eed. Similar relationships hold for length and mass. Ho w ev er, time dilation is esp ecially in teresting b ecause the exp onential distribution is v ery often used to mo del the time b et w een stat istical ev ents tha t happ en at a constan t a v erage rat e, suc h as radio activ e deca y or the time un t il the next system failure, as already men tioned in the previous section. The exp ectation v alue of the exp onen tial distribution is then indeed time, namely the exp ected time until the next even t. In this contex t, an expectatio n v alue in t he form of the Loren tz factor mak es p erfect sense. Actually , time dilatio n can then b e follow ed fr o m the relatio nship in (30). How eve r, according t o the p erceptual mo del in tro duced ab o v e, I understand tha t time dilation is merely our p erception and do es not reﬂect reality . The true p erformance follow s when w e use our observ ed p erformance as input t o the sigmoid function, whic h then provide s the actual 15 p erformance. F or instance, an exp ectation v a lue correspo nding to a Lor entz factor with a relative sp eed (p erfor ma nce) o f 0 . 5 leads to a n o bserv ed perfor mance of 1 √ 2 according to (3 0 ). Insertion of this observ ed p erformance into the sigmoid f unction leads to the follo wing result: 1 1 + 1 √ 2 ≈ 0 . 586 (32) Note that this result is sligh tly larger than 0 . 5. This concludes my theoretical foray in to the ﬁeld of ph ysics. W e kno w from prac- tical exp erimen ts that the o bserv ation of a ph ysical exp erimen t can actually c hange its outcome. The classic example for this f a ct is the famous double slit exp erimen t [21]. F or this reason, some ph ysicists ha v e alr eady suggested that they migh t hav e to include h uman p erception in to their mo dels in order to dev elop a more complete and th us more p o we rful theory tha t can describe these eﬀects. It rem ains to b e seen to what exten t the prop osed p erceptiv e mo del turns out to b e useful in this respect. 9 Informational In telligence In this section, I am g oing to a pply the concept of inf o rmational conﬁdence to a practical problem. In order to do so, I divide this section in to three subsections: In the ﬁrst sub- section, I sho w ho w to learn informational conﬁdence v alues b y estimating the necessary parameters on an ev aluation set. In the second subsection, I presen t practical recognition rates of a m ultiple classiﬁer system fo r handwritten Japa nese ch ara cter recognition. In the third subsection, I prop ose a new framew ork for machine learning in the form of a net w ork arc hitecture that implemen ts the ideas in tro duced ab o ve , in particular general en trop y . I therefore use the term “informatio na l in telligence” as the title for this section in order to con v ey the broader meaning of informational conﬁdence. 9.1 Informational Conﬁdence Learning In most practical cases, classiﬁers do not pro vide informational conﬁdence v alues. Their conﬁdence v alues ty pically violate the ﬁxed p oint eq uatio n in the p erformance theorem, indicating a distorted equilibrium b et w een information and conﬁdence . Classiﬁer com- bination therefore calls for a second training pro cess in addition to the classiﬁer-sp eciﬁc training metho ds teac hing eac h classiﬁer the decision b oundaries of each class. Accord- ingly , I consider learning of infor ma t ional conﬁdence v alues to b e a 3-step pro cess: In the ﬁrst step, I train a classiﬁer with its sp eciﬁc training metho d and training set. In the second step, I estimate the p erformance for eac h conﬁdence v alue on a n ev aluation set. Finally , I compute new informational conﬁdence v alues b y inserting the p erformance estimates in to the ﬁxed p o int equation of the p erfo r ma nce theorem. The newly com- puted infor ma t io nal conﬁdence v alues are stored in a lo ok-up table and will replace the original raw conﬁdence v alues in all future classiﬁcations. The ﬁxed p oin t equation of the p erformance theorem then formulates as fo llows: K new i = − ˆ E ∗ ln  1 − ˆ p ( K old i )  , (33) 16 where ˆ p ( K old i ) is the p erformance estimate of each raw conﬁdence v alue K old i , ˆ E is the exp ectatio n estimate, and K new i is the new informational conﬁdence v alue subsequen tly replacing K old i . In the following, I show how I compute the estimates ˆ E and ˆ p ( K old i ) on the ev aluation set [3, 4, 6]. 9.1.1 Exp ectation Estimate ˆ E F or the practical exp erimen ts in the next subsection, the classiﬁer’s global r ecognition rate R on the ev aluation set will serve as the exp ectation estimate ˆ E . I additionally normalize the recognition rate R according to the o v erall info r mation I ( C ) provide d b y classiﬁer C . F ollo wing the computation of inf o rmation for conﬁdence v alues I (1 − p ( K )), I estimate I ( C ) using the p erformance complemen t [3, 4 ]: ˆ I ( C ) = I (1 − R ) = − ln (1 − R ) , (34) Based o n the estimate ˆ I ( C ), ˆ E computes as ˆ I ( C ) √ R , whic h maps the global recognition rate R to its normalized rate fo r a one-bit classiﬁer. The ﬁxed p oin t equation in the p erformance theorem now f orm ulates as follow s: K new i = − ˆ I ( C ) √ R ∗ ln  1 − ˆ p ( K old i )  (35) This lea v es us with the p erformance estimate as the only miss ing par a meter to com- pute informational conﬁdence v alues. 9.1.2 P erformance Estimate ˆ p ( K old i ) Motiv ated b y the p erformance theorem, whic h states t hat the perfo rmance function fol- lo ws an exp onential distribution, I prop ose an estimate that expresses p erformance as a p ercen tage of the maximum p erf o rmance p ossible. Accordingly , m y relative p erfor- mance estimate describ es the diﬀeren t areas delimited by the conﬁdence v alues under their common densit y function. Mathematically , the p erformance estimate is based on accum ulated partial f r equencies deﬁned by the follo wing form ula [17, 18]: ˆ p ( K old i ) = P i k =0 n cor r ect ( K old k ) N (36) In this equation, N is the num b er of patterns con tained in the ev aluation set. T he help function n cor r ect ( K old k ) returns the n um b er of pat t erns correctly classiﬁed with con- ﬁdence K old k . The use of monotonously increasing frequencies guarantees that the es- timated informatio na l conﬁdence v alues will not aﬀect the order of the o riginal raw conﬁdence v alues: K old i ≤ K old j = ⇒ K new i ≤ K new j (37) F or this reason, the p erformance estimate in (36) ensures that informatio nal conﬁdence v alues ha ve no aﬀect on the recognition rate of a single classiﬁer, except for ties in tro duced b y mapping tw o diﬀeren t conﬁdence v alues to the same info rmational conﬁdence v alue. 17 Ties can happ en when t w o neigh b oring conﬁdence v alues sho w the same p erfo rmance and b ecome indistinguishable due to insuﬃcien t ev a luation data. In most applications, this should b e no problem, though. T ypically , the eﬀect of informational conﬁdence v alues sho ws only when w e com bine sev eral classiﬁers in to a m ultiple classiﬁer system, with all classiﬁers learning their individual info r ma t ional conﬁdence v alues, unless w e compute class-speciﬁc informational conﬁdence v alues. Estimates based o n accum ulated part ial fr equencies act lik e a ﬁlter in that they do not consider single conﬁdence v alues but a whole range of v alues. They av erag e t he estimation error o ve r all conﬁdence v alues in a conﬁdence interv a l. This diminishes the negativ e eﬀect of inaccurate measuremen ts of the estimate ˆ p ( K old i ) in application doma ins with insuﬃcien t or erroneous ev a luation data. F urthermore, estimation of informational conﬁdence v alues can b e considered a w arping pro cess aligning the progression of conﬁ- dence v alues with t he progression of perfo r ma nce. F o r experiments with other p ossible p erformance estimates, readers are referred to the references [3, 4, 6]. After normalizat io n of the p erformance estimate ˆ p ( K old i ) to a one-bit classiﬁer, as I already did fo r the exp ectation estimate, the ﬁnal v ersion of the ﬁxed p o in t equation in the p erformance theorem reads as follow s: K new i = − ˆ I ( C ) √ R ∗ ln  1 − ˆ I ( C ) q ˆ p ( K old i )  (38) Note that the new ly computed informational conﬁdence v alues K new i are an attractor of this ﬁxed p oin t equation. In other w ords, the ﬁxed p oin t will b e reac hed a fter exactly one iteration of the training pro cedure, or rather estimation pro cess. All additional iterations will pro duce exactly the same conﬁdence v alues; i.e., K new i = K old i . 9.2 Practical Exper iments In this mainly theoretical pap er, I conﬁne my self t o practical experimen ts for a m ultiple classiﬁer sys tem dev elop ed to recognize handwritten Japanese c haracters. Readers will ﬁnd more info r mation in the references, including ot her exp erimen ts with informatio na l conﬁdence v alues for do cumen t pro cessing applications [3, 4, 6]. Handwriting recogni- tion is a v ery promising applicatio n ﬁeld for classiﬁer com bination. Multiple classiﬁer systems ha v e therefore a long tradition in handwriting recognition [22, 20]. In partic- ular, t he duality of handwriting recognition, with its t w o bra nc hes oﬀ-line recognition and on- line recognition, makes it suitable for m ultiple classiﬁe r systems. While o ﬀ-line classiﬁers pro cess static images of handwritten w ords, o n-line classiﬁers op erate on the dynamic data a nd exp ect p oint sequences o v er time a s input signals. Compared to the time-indep enden t oﬀ- line represen tations used b y o ﬀ-line classiﬁers, on-line classiﬁers suf- fer from strok e-order and stroke-n umber v aria t ions inheren t in h uman handwriting a nd th us in on-line data . On the other hand, on-line classiﬁers are able to exploit the dy- namic information and can v ery often discriminate b etw een classes with higher accuracy . Oﬀ-line and o n-line classiﬁers thus complemen t eac h o t her, a nd their com binatio n can o v ercome the problem of strok e-order and stro ke-n umber v a r ia tions. This is especially imp ortant in Japa nese and Chinese c haracter recognition b ecause t he av erag e n um b er of strok es p er character, and th us the num b er of v ariations, is muc h higher than in the Latin alphab et [5, 9]. 18 Japanese oﬄine online AN D OR 1-b est 89.94 81.04 75.41 95.56 2-b est 94.54 85.64 82.62 97.55 3-b est 95.75 87 .3 0 84.99 98.06 T able 1: Single n-b est rates for handwritten Japanese character recognition. F or m y experimen ts, I use a m ultiple classiﬁer system comprising tw o classiﬁers for on-line handwritten Japanese c haracters. Both classiﬁers are nearest neigh b or classiﬁers. One o f these tw o classiﬁers, how ev er, transforms the captured on-line data into an oﬀ-line pictorial represen tation b efore applying the actual classiﬁcation engine. This tra nsfor- mation ha pp ens in a pre-pro cessing step and connects neigh b oring on-line p o ints using a sophisticated pain ting metho d [19, 7]. W e can therefore consider this classiﬁer to be an oﬀ- line classiﬁer. As mentioned ab ov e, learning of informatio nal conﬁdence v alues is a three-step pro cess: First, eac h classiﬁer is trained with its standard training metho d on a giv en training set. Then, I compute the p erformance of each conﬁdence v alue for eac h classiﬁer on an ev aluation set, using the p erformance estimate in (36) . In the last step, I e stimate the informational conﬁdence v alues base d on the estimate giv en in (38). These estimates will then replace the original conﬁdence v alues in all future classiﬁcations of eac h classiﬁer. In m y experiments , eac h classiﬁer w as initially trained on a training set con taining more tha n one millio n handwritten Japanese characters. The test and ev aluation set con tains 5 4 , 775 handwritten characters. F rom this set, I take ab out t wo third of the samples to estimate the p erformances of conﬁdence v alues and o ne third to compute the ﬁnal recognition p erformance of the estimated info rmational conﬁdence v alues. F or more informatio n ab o ut the classiﬁers and data sets used, I refer readers to the references [7, 11, 12]. T able 1 lists the individual recognition rates for the oﬀ-line and on-line classiﬁer. It sho ws the proba bilit ies that the correct class lab el is among the n-b est alternativ es ha ving the highest conﬁdence, with n = 1 , 2 , 3. The oﬀ - line recognition rates are muc h higher than the corresp onding on-line rates. Clearly , strok e-order and strok e-nu mber v ariations are larg ely resp onsible f or this p erformance diﬀerence. They complicate considerably the classiﬁcation task for the on-line classiﬁer. The last tw o columns of T able 1 show the p ercen tage o f test patterns for which the correct class lab el o ccurs either t wice (AND) or at least once (OR) in the n-b est lists of b oth classiﬁers. The relativ ely larg e gap b et w een the oﬀ-line recognitio n rates and t he n umbers in the OR-column suggests that on-line information is indeed complemen ta r y and useful fo r classiﬁe r com bination. T able 2 shows the recognition rates for combine d oﬀ-line/on-line recognition, using sum-rule, max-rule, and pro duct-rule as combination sc hemes. Sum-r ule adds the conﬁ- dence v alues pro vided b y eac h classiﬁer for the same class, while pro duct-rule m ultiplies the conﬁdence v alues. Max-rule simply tak es the maximum conﬁdence without any fur- ther op eration. The class with the maximum ov erall conﬁdence will then b e c hosen as the most lik ely class for the giv en test pattern. Note that sum-rule is the mathematically appropriate combination sc heme for in tegration of information from diﬀeren t sources [15]. 19 Japanese ( 89.94 ) Raw Conﬁde nce Inf. Conﬁdence Sum-rule 93.25 93.78 Max-rule 91.30 91.14 Pro duct-rule 92.98 65.16 T able 2: Com bined recognitio n rates for handwritten Japanese c haracter recognition. In addition, sum-rule is robust against noise, a s w as sho wn in [8]. The upp er left cell of T able 2 lists ag ain the b est single recognition ra te from T able 1, ac hiev ed by the oﬀ-line r ecognizer. The second column con tains the com bined recognition rates f or t he ra w conﬁdence v alues as provided direc tly by the classiﬁers, while the third column lists the recognition r a tes for informatio na l conﬁdence v alues computed according to (38). Compared to the individual rates, the combine d recognition rates in T able 2 are clear impro v emen ts. The sum-rule on raw conﬁdence v alues already accoun ts for an impro v e- men t of almost 3 . 5%. The b est com bined recognition rate ach ieve d with normalized informational conﬁdence is 93 . 78%. It outp erfor ms the oﬀ- line classiﬁer, whic h is the b est individual classiﬁer, b y almost 4 . 0%. Sum-rule p erforms b etter than max-rule and pro duct-rule, a fact in accordance with the results in [8]. 9.3 Neural Net w ork Arc hitecture A t the end of this pap er, I am going to show how the results in tro duced abov e can b e com bined to fo r m a netw ork arc hitecture fo r complex decision problems. The arc hitecture I prop ose is similar to the w ell-known feedforw ard t yp e of artiﬁcial neural net w orks in that a neuron ﬁrst integrates its inputs and then a pplies a sigmoid function to compute the ﬁnal output, whic h it propagates to the synapses of o ther neurons. The main motiv at io n for t he sigmoid f unction, ho w ev er, deriv es from a n informatio na l-theoretical bac kground, as discussed in Section 5. Figure 10 sho ws the basic unit of the prop osed “information net w ork:” a neuron and its synapses. The basic idea is that eac h synapse computes one summand of the general en tropy deﬁned in (3) of Section 2. The main b o dy o f the neuron ﬁrst integrates all these summands, computing the general entrop y according to (1) a nd (3). The sigmoid function then computes the actual p erformance based on the general en tr o p y . Finally , the neuron forw ards the newly computed p erformance to other neurons, whic h in turn rep eat the same pro cess. In this w ay , complex decisions b ecome aggregates of simpler decisions. Similar to t he training process in feedforw ard net works, the back propa g ation of feed- bac k trains the netw ork in Figure 10. Instead of the gradient descen t in parameter space that is t ypically implemen ted in feedforw ard netw orks, ba ckpropagation for the netw ork in Figure 10 means basically propagating the p erformance back so that each neuron can adjust its output. The p erfo rmance can b e directly inserted as part of the sigmoid function in (16 ). F or instance, insertion of the p erformance v alues deﬁned in (3 6) leads to t he following expression for the output v alues, after additionally normalizing eac h 20 N e u r on S y n a p se i 1 S y n a p se i 2 S y n a p se ij T a r g e t N e u r on p ( K ) i 1 p ( K ) i 2 p ( K ) ij p ( K ) i i − p ( K ) l n ( p ( K )) . . i 1 i 1 − p ( K ) l n ( p ( K )) . . i 2 i 2 i i − p ( K ) l n ( p ( K )) . . ij ij − p ( K ) l n ( p ( K )) . . i i i i S u m + + S i g m o i d F ee db ack i i Figure 10: Informa t io n Net w ork. p erformance v alue to one bit: 1 1 + ˆ I ( C ) q ˆ p ( K old i ) (39) In m y exp erimen ts, a simple summation of the information pro vided b y eac h output v alue, or ra ther classiﬁer, f o r each class provides a recognition r ate of 93 . 92 for the handwritten c haracter recognition problem. This is b etter than the b est recognition ra te in T able 2. I hop e to be able to supp ort the pr o p osed netw ork architec ture with additional ex- p erimen ts in ot her application domains, and by implemen ting a f ull- ﬂedged net w ork and not just a single la y er. 10 Summary I introduced a new form o f en tropy that can b e considered a n extension of t he classic en trop y in tro duced b y Shannon. Eac h summand of this en tropy is a ﬁxed p oin t equa- tion in whic h the so- called p erfor mance function tak es ov er the part of the probability . Ho w ev er, the p erformance function plays sev eral r o les in m y appro a c h: It describ es the distribution of an exp onentially distributed random v aria ble, and is also an exp ectation v alue in the statistical sense. F urthermore, with the exp o nential distribution typic ally used to des crib e statistical time pro cesses, there is also a p oint in f av or of it b eing time. The p erformance theorem in t he ﬁrst part of the pap er summarizes these relationships and prov ides guidelines for learning info rmational conﬁdence v alues for classiﬁer com- bination. In my ﬁrst practical results published in [3, 4, 6], I impro v ed the recognition rates for sev eral m ultiple classiﬁer systems. In the presen t pap er, I conﬁned m yself to the recognition rat es for handwritten Japanese c haracter recognition and concen trated on theoretical issues. I show ed how to pro duce a s ym b ol similar to the famous Yin/Y ang sym b ol by depicting the net conﬁdence as a spiral. The net conﬁdence is the diﬀerence 21 b et w een the conﬁdence a nd counter-conﬁdence , with the latter b eing based on the p er- formance complemen t. My unde rstanding is tha t our p erception is alwa ys the comp osite of Yin and Y ang and do es not r eﬂect the realit y , except when the p erformance function equals t he golden ratio. I thus assign an information-t heoretical meaning to the golden ratio. Moreo v er, I understand that the sigmoid f unction pro vides the a ctual p erformance v alue that w e cannot observ e directly . Under these observ at io ns and assumptions, I can explain the time dilatio n of Einstein’s Special Relativit y . How eve r, it follows that time dilation is mere p erception and do es not correspond to reality . A t the end of the pap er, I prop osed a net w ork arc hitecture for complex decisions, whic h tak es adv an tage of the general en tropy concept. I hop e that the usefulness of this arc hitecture can b e c onﬁrmed b y future exp erimen ts in diﬀeren t application ﬁelds. Ac knowled gmen t I w ould lik e to thank Ondrej V elek, Akihito Kitadai, and Masaki Nak a g a w a for prov iding data for the practical experimen ts. References [1] T.A. Co ok. The Curves of Life . D o ve r Publications, 1 979. [2] H.E. Hun tley . The Divine Pr op ortion . D o ve r Publications, 1 970. [3] S. Jaeger. Infor ma t io nal Classiﬁer Fusion. In Pr o c. of the 17th In t. C onf. o n Pattern R e c o gnition , pages 216–219, Cam bridg e, UK, 2004. [4] S. Jaeger. Using Informat io nal Conﬁdence Values fo r Classiﬁer Com bination: An Exp eriment with Combine d O n-Line/Oﬀ-Line Japanese Character R ecognition. In Pr o c. of the 9 th In t. Workshop on F r ontiers in Handwri ting R e c o gnition , pages 87– 92, T oky o, Japan, 2004. [5] S. Jaeger, C.-L. Liu, and M. Nak agaw a. The State of the Art in Japanese Online Handwriting Recognition Compared to T ec hniques in W estern Handwriting Recog- nition. International Journal on Do cument A nalysis and R e c o g n ition , 6(2):75 –88, 2003. [6] S. Jaeger, H. Ma, and D. Do ermann. Iden tifying Script on Word-L ev el with Informa- tional Conﬁdence. In Int. Conf. on Do c ument Analysis an d R e c o gnition (ICDAR) , pages 416–420, Seoul, Korea, 2005. [7] S. Jaeger and M. Nak aga wa. Tw o On-Line Japanese Character Databases in Unip en Format. In 6th Internationa l Co n fer enc e on Do cument Analysis and R e c o gnition (ICDAR) , pages 566– 570, Seattle, 2001. [8] J. Kittler, M. Hatef, R.P .W. Duin, and J. Matas. On Com bining Classiﬁers. IEEE T r ansactions o n Pattern A nalysis an d Machine Intel ligen c e , 20(3):226–23 9 , 1998 . 22 [9] C.-L. Liu, S. Jaeger, and M. Nak agaw a. Online Recognition of Chinese Characters: The State-of-the-Art. IEEE T r a n s. on Pattern Analysis and Mach ine I ntel ligenc e (TP AMI) , 26(2) :198–213, 2004. [10] M. Livio. The Golden R atio . Random House, Inc., 200 2. [11] M. Nak agaw a, K. Akiy ama, L.V. T u, A. Homma, and T. Higashiy ama. Robust and Highly Cus tomizable Recognition of On-Line Handwritten Japanes e Characters. In Pr o c. of the 13th International Confer enc e on Pattern R e c o gnition , v olume I I I, pages 269–273, Vienna, Austria, 1996. [12] M. Nak agaw a , T. Higashiy ama, Y. Y amanak a , S. Saw a da, L. Higashigaw a, a nd K. Akiy ama. On-Line Ha ndwritten Character P attern Database Sampled in a Se- quence of Sen tences without An y W riting Instructions. In F ourth International Confer enc e on D o cument Analysis and R e c o gnition (ICDAR) , pages 376–381, Ulm, German y , 199 7. [13] J. R. Pierce. An Intr o duction to Info rmation The ory: Symb ols, Signals, and Noise . Do ve r Publications, Inc., New Y ork, 1980. [14] W. Sacco, W. Cop es, C. Slo ye r, and R. Stark. Information The ory: Savi n g Bits . Janson Publications, Inc., Dedham, MA, 1988. [15] C. E. Shannon. A Mathematical Theory of Comm unication. Bel l System T e ch. J. , 27(623-6 56):379–4 23, 1 948. [16] N. J. A. Slo ane and A. D . Wyner. Claude Elwo o d Shannon: Col le cte d Pap ers . IEEE Press, Piscata w a y , NJ, 1993. [17] O. V elek, S. Jaeger, and M. Nak agaw a. A New Warping Tec hnique for Normalizing Lik eliho o d of Multiple Classiﬁers and it s Eﬀectiv eness in Com bined On- Line/Oﬀ- Line Ja panese Character Recognition. In 8th Internation al Workshop on F r ontiers in Handwriting R e c o gnition (IWFHR) , pages 177–182, Nia g ara-on-t he- L a k e, Canada, 2002. [18] O. V elek, S. Ja eger, and M. Nak agaw a. Accum ulated-Recognition- Rate Normaliza- tion for Com bining Multiple On/ Oﬀ-line Japanese Character Classiﬁers Tested on a Large Database. I n 4th I nternational Workshop on Multiple Classiﬁe r Systems (MCS) , pages 1 96–205, Guildford, UK, 2003. Lecture Notes in Computer Science, Springer-V erlag. [19] O. V elek, C.-L. Liu, S. Jaeger, a nd M. Nak agaw a. An Impro v ed Approac h to Gen- erating Realistic Kanji Character Images from On-Line Characters and its Beneﬁt to Oﬀ-Line Recognition P erformance. In 16th Internationa l Co nfer enc e on Pattern R e c o gnition (IC PR) , v olume 1, pages 588–591, Queb ec, 2002. [20] W. W ang, A. Bra k ensiek, and G . Rigoll. Com binatio n of Multiple Classiﬁers for Handwritten Word Recognition. In Pr o c. of the 8th International Worksh op on 23 F r ontiers in Handwriting R e c o g n ition (IWFHR-8) , pages 117–122, Niaga ra-on-the- Lak e, Canada, 2002. [21] Wikip edia. D ouble-slit exp erimen t, 2006. http:/ / www.wik ip edia.org. [22] L. Xu, A. Krzyzak, and C.Y. Suen. Metho ds of C ombining Multiple Classiﬁers and Their Applications t o Handwriting Recognition. IEEE T r ans. on Systems, Man, and Cyb ernetics , 22(3):418 – 435, 1 992. [23] Yin and y ang. ht t p:/ /www.wikipedia.org. 24

Entropy, Perception, and Relativity

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment