Entropy, Perception, and Relativity

In this paper, I expand Shannon's definition of entropy into a new form of entropy that allows integration of information from different random events. Shannon's notion of entropy is a special case of my more general definition of entropy. I define p…

Authors: Stefan Jaeger

Entropy, Perception, and Relativity
LAMP-TR-131 CAR-TR-1012 CS-TR-4799 UMIA CS-TR-2006-2 0 April 2006 ENTROPY, PE R CEP T I ON, AND RE LA TIVITY Stefan Jaeger Language and Media Pro cessing Lab o ratory Institute for Adv ance d Computer Studies Univ ersit y of Maryland College P ark, MD 20742- 3275 jae ger@umiacs.umd.e du Abstract In this pap er, I expand Shannon’s definition of en tropy in to a new fo rm of en tropy that allo ws integration of information from differen t random ev en ts. Shannon’s notion of en- trop y is a s p ecial case of m y more general definition of en tropy . I define probability using a so-called p erfo rmance function, whic h is de facto an exp onen tial distribution. Assum- ing that m y general notion of en tropy reflects the true uncertaint y ab out a probabilistic ev en t, I understand that our p erceiv ed uncertain ty differs. I claim that our p erception is the result of tw o o pp osing forces similar to the t w o famous an tago nists in Chinese philo s- oph y: Yin and Y ang . Based on this idea, I sho w that our perceiv ed uncertain ty matche s the true uncertain t y in po ints determine d b y the golden ratio. I demonstrate that the w ell-kno wn sigmoid function, which w e t ypically employ in art ificial neural net w orks a s a non-linear threshold function, describes the actual p erformance. F urthermore, I prov ide a motiv ation for the time dilation in Einstein’s Sp ecial Relativity , basically claiming tha t although time dilation conforms with our p erception, it do es not correspo nd to realit y . A t the end o f the pap er, I show ho w to apply this theoretical framew ork to practical applications. I presen t recognition rat es for a patt ern recognitio n problem, a nd also pro- p ose a net work architec ture that can tak e adv an tage of general en tro py to s olve complex decision problems. Keyw ords: Information Theory , Entrop y , Sensor F usion, Mac hine Learning, P erception, Spec ial R elativity . The supp ort of this research by the Depar tment o f Defense under con tract MDA-9040-2C-04 06 is gratefully a ckno wledged. 1 In tro duction Uncertain t y is our constant companion in e ve ryday’s decision making. Being able to deal with uncertain ty is th us an essen tial re quiremen t for in telligen t b eha vior in real- w orld en vironmen ts. Naturally , kno wing the exact amount of uncertaint y in volv ed in a particular decision is a v ery useful information to ha v e. Mathematically , the classic w a y of measuring the uncertain ty for a random eve nt is to compute its informatio n based on the definition of entrop y in tro duced b y Shannon [15 ]. In this pap er, ho we ve r, I in tro duce a new, gene ral f orm of en tropy that is motiv ated b y my earlier work on classifier combination [3 , 4, 6]. The idea of classifier com bination, o r sensor fusion in general, is to com bine the outcomes of differen t sub-optimal pro cesses in to one integrated result. Ideally , the integrated pro cess p erforms b etter in the given application domain than eac h individual pro cess alone. In order to in tegrate differen t pro cesses in t o a single pro cess, computers need to deal with the uncertain ties in v olved in the outcomes of eac h individual process. F o r c lassifier com bination, sev eral com binatio n sc hemes ha v e a lr eady b een suggested. The curren t state-of- the-art, how ev er, has not give n its final v erdict on this issu e ye t. I n m y earlier work, I prop o sed an informational-theoretical approac h to this problem. The main idea of this approa c h is to normalize confidence v alues in suc h a w a y that their no minal v alues matc h their conv ey ed information, whic h I measure on a training set in the application domain. The o v erall combin ed confidence for e ach class is then simply the sum of t he normalized confidence v alues of eac h individual classifier. In this pap er, I am going to elab orate on m y earlier ideas by lo oking at them from the general en trop y’s p oint of view. I structured the pap er as follo ws: F ollowing this in tro duction, Section 2 rep eats the definition of en trop y as introduced by Shannon, and compares it to m y new and more general definition. Section 3 provides a short in tro duction in to m y earlier work on infor- mational confidenc e and repeat s the main p ostulates and t heir immediate consequenc es. Section 4 describ es ho w I understand confidence as the result o f an inte rplay o f tw o opp osing forces. In Section 5, this insight will sho w the sigmoid func tio n of class ic bac k- propagation netw orks in a differen t ligh t, namely a s a kind of mediator b et we en these t w o forces. A closer insp ection in Section 6 reve als that the net effect of b oth opp osing forces equals one single force in po in ts defined b y the golden ra tio. In Section 7, I relate the in tro duced forces to the w ell- known forces of Yin and Y ang in Chinese philosophy . In par t icular, I sho w how w e can deriv e the t ypical Yin-Y a ng sym b ol from the assump- tions ma de. In Section 8, I explore common grounds of the general fr a mew ork prese nted here and Einstein’s Sp ecial Relativit y . I provide an intere sting motiv a t io n for the time dilation in Einstein’s Sp ecial Relativit y . Section 9 is then going to show ho w we can learn informational confidence v alues, illustrating the learning pro cess with a practical example of handwritten Japanese c hara cter recognition. This section also prop oses a net w ork archite cture for learning based on the ideas in tro duced in the previous sections. Finally , a summary with the main results concludes the pap er. 1 2 En tropy En trop y is a measure fo r the uncertain ty in a rando m ev en t or signal. Alternative ly , w e can understand en tropy as the amoun t of informatio n con v ey ed b y the random ev en t o r carried by the signal. Entrop y is a general concept that has applications in statistical mec hanics, thermo dynamics, and of course information theory in computer science. The latter will b e the fo cus of my atten tion in the following. A t the end of m y pap er, I will presen t an interes ting connection with Einstein ’s Sp ecial Relativity and phys ics, though. Claude E. Shannon in tro duced en tropy as a measure for randomness in his 1 9 48 seminal pap er “A Mathematical Theory of Comm unication.” F or a discrete random ev en t with n p ossible outcomes, Shannon defines t he en tropy H as t he sum of exp ected information K i for eac h outcome i : H = n X i =1 K i (1) Shannon uses the negative logarithm to compute information itself. In this w ay , he can simply add the information of tw o indep enden t outcomes to get the com bined info rmation of b oth. Accordingly , eac h K i in (1) reads as follo ws: K i = − p ( i ) ln ( p ( i )) , (2) with p ( i ) denoting the probability of the i- th outcome. The en trop y reac hes a maxim um when all p ( i ) are equal, whic h indicates maxim um uncertain t y . On the other hand, the en trop y is minimal; i.e. zero, if exactly one p ( i ) is 1 and all o ther outcomes hav e a probabilit y o f zero. I am no w intro ducing the follo wing more general v arian t that I will b e using instead of (2) to compute the en tro p y H : K i = − p ( K i ) ln ( p ( K i )) (3) In this new f o rm, the exp ected information for each outcome app ears o n b ot h sides of the equation, e ffectiv ely making (3) a fixed p oint equation. Instead of using the probabilit y p ( i ) of an outcome, I am no w using the probability of the outcome’s sp ecific information. I also do not require the sum of all probabilities p ( K i ) to b e one. A straightforw ard comparison of ( 2 ) and (3) sho ws that Shannon’s definition of en- trop y and its more g eneral v arian t are the same if eac h outcome satisfies the follo wing equation: p ( i ) = p ( K i ) (4) In other w ords, b oth definitions of entrop y are the same when the probability of eac h outcome matc hes the probabilit y of its information, whic h w e can consider to b e a fixed p oin t. The next section giv es a motiv a t ion for the general en tropy form ula using pattern recognition, and in particular classifier com bination, a s a practical example. 2 3 Informational Confidence P attern recognition is a researc h field in computer science dealing with the automatic classification of pattern samples in to differen t classes. Depending on the application domain, t ypical classe s are; e.g., c haracters, gestures, tr a ffic signs, face s, etc. F or a g iv en unkno wn t est pattern, most classifiers return b oth the actual classification result in form of a ra nk ed list of class lab els, and corresp onding v alues indicating the confidence of the classifier in each class lab el. I will b e using the term “confidence v alue” for these v alues throughout the pap er, but I should men tion that other researc hers ma y prefer differen t terms, suc h as “score” or “likelihoo d.” In practical classifier systems, confidence v alues are usually only rough approx imatio ns of their mathematically correct v alues. In particular, they v ery often do not meet the requiremen ts of proba bilities. While this usually do es not hamp er the op eration of a single classifier, which o nly dep ends on the relativ e prop ortion of confidence v alues, it causes problems in multiple classifier systems, whic h need the proper v alues for com bination purp oses. P ost- pro cessing steps , suc h as linguistic con text analysis for c haracter recognition, can also b enefit from more accurate confidence v alues. Com bination of different classifiers in a multiple classifier systems has turned out to b e a p ow erful to ol for reducing the uncertain t y inv olve d in a classification problem [8]. Researc hers ha v e sh own in n umerous ex p erimen ts that the p erformance o f the com bined classifiers can outp erform the p erformance of eac h single classifier. Nev ertheless, re- searc hers are still undecided ab out ho w to b est inte gra te the confidence v alues of eac h individual classifier into one single confidence. In earlier work, I prop osed so-called infor- mational confidence v alues as as a wa y to combine m ultiple confidences v alues [3, 4, 6]. The idea of informationa l confidence v alues is to intro duce a standard of comparison allo wing fair comparison and easy integration of confidence v a lues generated by differ- en t classifiers. The definition o f informational confidence v alues relies on tw o cen tra l p ostulates: 1 Confidence is infor mat ion 2 Informatio n depends on p erformance The first p o stulate states that each confidence v alue conv eys information, a nd it con- sequen tly req uires that the nominal v a lue of eac h confidence v alue should equal t he information con v ey ed. The second p ostulate then logically con tinues b y requiring tha t the amount of info rmation conv ey ed should dep end on the p erformance of t he confidence v alue in the application do main. F rom b oth p ostulates tak en together, I can fo llow t hat confidence dep ends on p erforma nce via info r ma t ion. T o fo rmalize these requiremen ts, let me assume that eac h classifier C can out put confidence v alues from a set of confidence v alues K C , with K C = { K 0 , K 1 , . . . , K i , . . . , K N } (5) Let m e further assume that K N indicates the highest confidence c lassifier C can output. The follow ing fixed p oint eq uation then defines a linear relationship b et w een confidence and information, with the latter dep ending o n the p erformance complemen t o f eac h 3 confidence v alue. K i = E ∗ I ( p ( K i )) + C (6) W e see that the confidence v alues K i app ear on b oth sides of Equation (6), ess entially making it a fixed p oin t equation with the so-called informational confidence v alues as fixed p oin ts. Using the p erfo r ma nce complemen t ensures that higher confidence v alues with b etter performance con vey mor e information tha n low er confidence v alues whe n w e apply Claude Shannon’s lo g arithmic notion of informatio n [1 5]. According to Shannon, information of a probabilistic ev en t is the negativ e log arithm of its probability . More information on Shannon’s work and the implications of his strikingly simple definition of information can b e found in [13, 14, 16]. By setting constan t C to zero, inserting the negative logarithm as info rmation func- tion I , and using 1 − p ( K i ) as p erformance complemen t, I simplify Equation (6) to the follo wing definition of info r mational confidence: K i = − E ∗ ln (1 − p ( K i )) (7) The still unknown parameters necessary to compute informational confidence v alues according to (7) are E and p ( K i ). A straigh tforward transformatio n of (7) sheds more ligh t on these t wo parameters : K i = − E ∗ ln (1 − p ( K i )) ⇐ ⇒ e − K i E = 1 − p ( K i ) ⇐ ⇒ p ( K i ) = 1 − e − K i E (8) The result sho ws that the performance func tion p ( K i ) describ es an exp onen tial distribu- tion with exp ectation v alue E . This fo llo ws from the general definition of an exp onen tial densit y function e λ ( x ) with parameter λ : e λ ( x ) = ( λ ∗ e − λx : x ≥ 0 0 : x < 0 λ > 0 (9) F or eac h λ , the enclosed area of the densit y function equals 1: ∞ Z −∞ e λ ( x ) d x = ∞ Z 0 λ ∗ e − λx dx = 1 ∀ λ > 0 (10) Figure 1 show s three different exp onen tial densities differing in their parameter λ , with λ = 100, λ = 20, and λ = 10 resp ectiv ely . The parameter λ has a direct influence on the steepness of the exp onential densit y function. The higher λ , the steeper the densit y function. The correspo nding distribution E λ ( k ), which describ es the proba bility tha t the ran- dom v a riable assumes v alues lo we r than or equal to a giv en v alue k , computes as follo ws: E λ ( k ) = Z k −∞ e λ ( x ) d x 4 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 0.25 0.3 Exponential Density 100 * exp(-100 * x) 20 * exp(-20 * x) 10 * exp(-10 * x) Figure 1: Exp onen tial densit y for λ = 100, λ = 2 0, and λ = 10 . 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0.25 0.3 Exponential Distribution 1-exp(-100 * x) 1-exp(-20 * x) 1-exp(-10 * x) Figure 2: Exp onen tial distribution for λ = 100, λ = 20, and λ = 1 0. = Z k 0 λ ∗ e − λx dx = h − e − λx i k 0 = 1 − e − λk (11) Figure 2 show s the distributions for the three differen t densities depicted in Figure 1, with λ = 100, λ = 20 , and λ = 10. The parameter λ influences again the stee pness: A larger λ en tails a steeper distribution. F or eac h parameter λ , the distribution function con v erges on 1 with increasing confidence. Another imp ortant feature is the relation- ship b etw een para meter λ and the exp ectation v alue E o f the exp onentially distributed random v ariable. Bo th are in inv erse prop ort io n to each other, with E = 1 λ . Accord- ingly , the ex p ectatio n v a lues corresp onding to the exp onen tial densities in Figure 1 , and distributions in Figure 2, are E = 1 100 , E = 1 20 , and E = 1 10 , resp ectiv ely . When w e compare the p erformance sp ecification in (8) with the exponen tial distri- bution in ( 1 1), w e see that the only difference lies in the exp onen t o f the exp onential function. In fact, p erformance function and exp onen tial distribution b ecome iden tical f or λ = 1 E . This result sho ws that the performa nce function p ( K i ) describ es the distribution of exp onen tially distributed confidence v alues with expectatio n E . W e can therefore con- sider confide nce as an exp o nentially distributed r a ndom v ariable with parameter λ = 1 E . 5 The p erformance theorem summarizes this imp ortant result: P erformance Theorem: A classifier C w ith p erformance p ( K ) pro vides informational confidence K = − E ∗ ln (1 − p ( K )) if, and only if, p ( K ) is an exp onen tial distribution with expectation E . The p erformance theorem explains the meaning and implications of the par a meters E and p ( K ). F or classifiers violating the p erformance theorem, the equation stated in the p erformance theorem allows to compute the prop er infor ma t ional confidence v alues as long as w e kno w the sp ecific v alues of E and p ( K ). Section 9 will la ter sho w ho w we can estimate these parameters on a giv en ev aluatio n set. In the next section, I tak e the idea of informational confidence a step further and in tro- duce a second ty p e of confidence called coun ter- confidence, whic h describes the confidence of the classifier in the falseness of its output. The subseque nt sections then elab or a te o n this concept and presen t new theoretical results and discuss their implications. 4 Opp osing F orces I am assuming t hat decision making is ba sed on tw o opp o sing forces, one supp orting a certain o utcome and one arguing ag a inst it. In particular, I am going to prop ose a formalization of b o t h forces, whic h I na me F orce A and F orce B for the time b eing, based on the fixed p oint equation of the p erformance theorem. In fact, I p o stulate that F orce A is a lready defined b y this equation. F orce B only differs in its in terpretatio n of p erformance. 4.1 F orce A The first f orce, F orce A, describ es the confidenc e in a particular decision. Accordingly , I use the fixed p oint equation of informat ional confidence v alues as the definition of F orce A: K = − E ∗ ln (1 − p ( K )) (12) The left-hand side of this equation denotes the magnitude of F orce A. It is the pro duct of info rmation in the Shanno n sens e and an exp ectation v a lue in t he statistical sense. As sho wn ab ov e, the p erformance function p ( K ) follo ws immediately as p ( K ) = 1 − e − K E . If the p erformance in the logarithmic expression on the righ t-hand side of (12) is 1, and the exp ectation E is p ositiv e, then A-F orce b ecomes infinit y . On the other hand, if the p erformance is zero, then the logarithm b ecomes zero and there is no A-force at all. 4.2 F orce B The second force, F orce B, is define d similarly but p erforms complemen t a ry to F orce A. F orce B describ es information t ha t dep ends directly on the p erforma nce and not on the p erfo rmance complemen t. Accordingly , the following mo dified fixed p oint equation describes F orce B: K = − E ∗ ln ( p ( K )) (13) 6 The difference to F orce A lies in the in terpretation of the p erformance function p ( K ), whic h follows a gain from a straigh tforw ard t r ansformation: K = − E ∗ ln ( p ( K )) ⇐ ⇒ p ( K ) = e − K E (14) W e see that the p erformance f unction of F orce B is similar to the p erformance of F orce A. Ho w ev er, it lo oks at the problem from a differen t side. Ins tead of describing the area delimited b y K under the exp onen tial density curve, it describ es the remaining area that is not delimited. Parameter E is again a statistical exp ectation v alue. Unlik e F orce A, F orce B b ecomes infinity for a p erfo rmance equal to zero and p ositiv e expectatio n. It b ecomes zero w henev er the p erformance is p erfect, i.e. p ( K ) = 1. While F o r ce A defines informational confidence v a lues, F orce B can b e considered as defining inf o rmational coun ter-confidence v alues. 4.3 In terpla y of F orces Ha ving defined b oth F orce A and F orce B, I p o stulate tha t all decision pro cesses are the result of the in terpla y b et w een these t w o fo r ces. What we can actually experience when making decisions is the dominance one of these forces has ac hiev ed ov er its counterpart. Mathematically , I understand that this dominance is the net effect o f b oth forces and th us use the difference b et w een the defining equations in (12) and (13) to describe it: K = − E ∗ ln 1 − p ( K ) p ( K ) ! (15) This equation is a fixed p oin t equation itself. It describ es t he net force, whic h is the result of b oth forces acting sim ultaneously . Naturally , t he net force b ecomes zero when F orce A equals F orce B. This is the case when either the exp ectation v alue is zero o r the p erformance p ( K ) is 0 . 5. The net fo rce b ecomes either infinity o r min us infinit y when one fo rce do minates completely o v er its coun terpart. In particular, the net f o rce b ecomes infinity when F orce A dominates with p ( K ) = 1 and minus infinit y when F orce B dominates with p ( K ) = 0. The follo wing t w o sections are go ing to presen t tw o more in teresting theoretical re- sults, whic h are a direct consequence o f the net f orce defined b y ( 1 5), namely the sigmoid function and t he golden ratio. Section 7 will later relate F orce A a nd F orce B to the w ell-kno wn an tago nistic forces in Chinese philosophy : Yin and Y ang. 5 Sigmoid F unction A closer lo ok at the net force defined in (1 5) rev eals that the p erfo r ma nce function is indeed a w ell-know n function. A s traig h tforward deriv ation leads to the following result: K = − E ∗ ln 1 − p ( K ) p ( K ) ! 7 0 0.2 0.4 0.6 0.8 1 -4 -3 -2 -1 0 1 2 3 4 1/(1+exp(-x)) 1/(1+exp(-2*x)) 1/(1+exp(-3*x)) 1/(1+exp(-0.5*x)) Figure 3: Sigmoid function. ⇐ ⇒ e − K E = 1 p ( K ) − 1 ⇐ ⇒ p ( K ) = 1 1 + e − K E (16) It show s that the p erformance function is actually iden tical to the t yp e of sigmoid function that classical feedforw ard net w ork arc hitectures v ery often use as threshold function. The traditional explanation for the use o f this particular threshold function has alwa ys lain in its featur es of non-linearity and simplicit y . Non-linearity increases the expressiv eness of a neural net w ork, allo wing decision b oundaries in feature space that a simple linear net w ork would not b e able to mo del. A neural netw ork with only linear output functions w ould simply collapse into a single linear function, whic h c annot m o del complex decision b oundaries. The other adv antage of the sigmoid function in (16) is the simplicit y of its deriv ation, whic h facilitates the bac kpropagation o f errors during the tra ining of neural net w orks. While these are surely imp ortant p o ints, it now seems that the deeper meaning of the sigmoid function has more of an informatio n- theoretical nature, as motiv ated ab ov e. Figure 3 sho ws the sigmoid function in (16) for four different parameters E , namely E = 1, E = 1 2 , E = 1 3 , and E = 2. As its name already suggests, the sigmoid function has an S-shap e. It conv erges on 0 to w ards negativ e infinit y and on 1 to w ards infinit y . The parameter E con tro ls the steepnes s of the sigmoid func tio n. F or smaller v alues of E , the sigmoid function b ecomes steeper and approac hes faster to either 0 or 1 on b o t h ends. Indep enden t o f E , the sigmoid function is alw ay s 0 . 5 for K = 0. 6 The Golden Ratio I no w assume that the p erformance o f a giv en confidence v alue K a lwa ys matc hes exactly the exp ectation, i.e. in other words E = p ( K ). Note that this corresp onds t o the form of the summands of the general en tro py in Section 2. The net f o rce equation in (15) will 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 y(p) z(p) y(p) = −p log( (1−p) / p ) z(p) = −(1−p) log( (1−p) / p ) Figure 4: Net F orce. then read as follo ws: K = − p ( K ) ∗ ln 1 − p ( K ) p ( K ) ! (17) Figure 4 depicts the net fo r ce in (17) graphically for p erf o rmance v alues p ( K ) ranging from 0 to 1. As w e can see in F igure 4, the net force b ecomes zero for p ( K ) = 0 and p ( K ) = 0 . 5. F or p erformances higher than 0 . 5 and approaching 1, the net force div erges to infinit y . F igure 4 also sho ws a mirrored v ariant of the net fo rce, namely K = − (1 − p ( K )) ∗ ln 1 − p ( K ) p ( K ) ! (18) This equation is a direct result of (17) aft er changing the sign a nd replacing the p erfor- mance p ( K ) with its complemen t 1 − p ( K ). The ne t force and its mirrored v ariant b oth meet at p ( K ) = 0 . 5 . W e can actually consider p ( K ) = 0 . 5 as a transition p o int where the net force tr a nsforms into its mirrored v ariant. After the transition, w e are still lo o king at t he same problem. How eve r, our p oin t of view has c hanged and is no w reflected b y the mirrored net force. This will b ecome impo r tan t later in Section 7, where we relate these forces to Yin and Y ang. F or the time b eing, let us concen trate on the net force in (17). The net fo r ce and the counte r- confidence (F orce B) in (13) , with E = p ( K ), b ecome equal when the p er- formance p ( K ) satisfies the follo wing relationship: p ( K ) = 1 − p ( K ) p ( K ) (19) ⇐ ⇒ p ( K ) 2 + p ( K ) − 1 = 0 ⇐ ⇒ p ( K ) = − 1 2 ± s 1 4 + 1 ⇐ ⇒ p ( K ) = √ 5 − 1 2 ∨ − 1 − √ 5 2 ⇐ ⇒ p ( K ) ≈ 0 . 618 ∨ − 1 . 618 (20) 9 a a b b a + b i s t o a a a s a a i s t o b b a + b Figure 5: Go lden Ra t io . This transformation show s t ha t coun ter-confidence a nd net force are the same for a p erformance of ab out 0 . 618, when just considering the p o sitive p erformance v alue. In- terestingly , t his tr a nsformation also sho ws that the t w o p ossible v alues satisfying (19), namely ≈ − 1 . 618 and ≈ 0 . 6 18, are precisely t he negativ e v a lues of the so-called go lden ratio. F orce B th us equals the comp ound effect of F orce A a nd F orce B for performances defined b y the golden ratio. With a detailed intro duction into the golden ratio b eing out o f scop e, I pro vide only some bac kgro und information ab out the golden ratio, or golden mean as it is also called [10, 2]. The golden ratio is an irratio nal num b er, or rather t w o num b ers, describing the prop ortion of tw o quantities . Express ed in w or ds, t w o quantities a r e in the golden ratio to eac h other, if the whole is to the larger part as the larger part is to the smaller part. The whole in this case is simply the sum of bot h pa rts. Figure 5 sho ws an example of a line divided into tw o segmen ts that are in the golden ratio to each other. Historically , the golden ratio w as already studied by ancien t mathematicians. It plays an imp ortant role in differen t fields lik e geometry , biology , ph ysics, and ot hers. Many artists and designers delib erately or unconsciously make use of it b ecause it seems that artw ork based on the golden ratio has an esthetic app eal, a nd features some kind of natural symmetry . Despite the fact that the golden mean is of paramount imp ortance to so man y fields, I think it is fair to sa y that we still do not ha v e a full, or rather correct, understanding of its true meaning in scie nce. Mathematically , the golden mean can b e derive d from the follo wing equation, whic h describ es the collo quial description giv en ab o ve in mat hematical terms. a + b a = a b (21) Accordingly , the golden mean, whic h is t ypically denoted by the Greek letter ϕ , is then giv en by the r atio of a a nd b , i.e. ϕ = a b . Us ing the relationship in (21), the golden ratio ϕ can be resolve d in to tw o p ossible v alues: ϕ = 1 + √ 5 2 ∨ 1 − √ 5 2 (22) = ⇒ ϕ ≈ 1 . 618 ∨ − 0 . 618 (23) Usually , the p ositiv e v alue ( ≈ 1 . 618) is iden tified with ϕ . Note that these v a lues are the same as in (2 0), except that their signs are reve rsed. The reader interes ted in a thorough analysis of the go lden mean can find more information and man y practical examples in the references [10, 2]. 10 7 Yin and Y ang I will no w relate the ab ov e theoretical results with one of the oldest philosophical world views, namely the principle of Yin and Y ang. In particular, I dare to adv a nce the h yp othesis that b oth F orce A and F orce B, whic h I defined res p ectiv ely in (12) and (13 ) using fixed p oint equations, corresp ond to the tw o opp osing forces Yin a nd Y ang when w e assume that exp ectatio n alw a ys equals p erformance, i.e. E = p ( K ). If this can indeed b e confirmed b y further observ a t io ns, this ancien t philosophical concept could pla y a n imp ortant role in computer science. In fa ct, I will pro vide fur t her evidence of this claim and also sho w ho w w e can use the concept of Yin a nd Y ang for machine learning. Let me b egin with a short summary of the Yin/Y ang concept in Chines e philosoph y . 7.1 Philosoph y The concept of Yin and Y ang is deeply ro oted in Chinese philosoph y [23]. Its origin dates bac k at least 2500 y ears, probably m uch earlier, pla ying a crucial role in the o ldest Chinese philosophical texts. Chinese philosophy has at t a c hed great imp ortance to Yin/Y ang ev er since. T o day , the idea of Yin/Y ang p erv ades fields as differen t as religion, sp orts, medicine, p olitics, and man y more. The fact that the Korean national flag sp o rts a Yin/Y ang sym b ol illustrates the emphasis laid on this concept in Asian coun tries. Yin and Y ang stand fo r t wo principles t ha t are o pp osites of eac h other, and whic h are constan tly trying to gain the up p er hand ov er eac h other. Ho w ev er, neither one will ever succeed in doing so, though one principle may temp orarily dominate the other one. Both principles cannot exist without each other. It is ra ther the constant struggle b etw een b oth principles t ha t defines our w orld and pro duces the rhythm of life. According to Chinese philosoph y , Yin and Y ang are the foundat io n o f our en tire univ erse. Th ey flow through, and th us affect, ev ery being. Ty pical examples o f Yin/Y ang o pp osites are, for example, nigh t/day , cold/hot, rest/activit y , etc. Chinese philosoph y do es not confine itself to a mere description of Yin and Y ang. It also pro vides guidelines on how to liv e in accordance with Yin and Y ang. The central statemen t is that Yin and Y ang need to b e in harmon y . An y im balance of an economical, biological, ph ysical, or c hemical system can b e directly attributed to a distorted equilib- rium b etw een Yin and Y ang. F or instance, an illness a ccompanied by fev er is the result of Y ang b eing to o strong and dominating Yin. On the other hand, dominance o f Yin could result, for instance, in a b o dy shiv ering with cold. The o ptima l state ev ery b eing, or system, should striv e for is therefore the stat e of equilibrium b et we en Yin and Y ang. It is this state o f equilibrium b etw een Yin a nd Y a ng that Chinese philosoph y considers the most p o w erful and stable state a sys tem can assume. Yin and Y ang can b e furt her sub divided in to Yin and Y a ng. F o r instance, “cold” can b e further divided into “co ol” or “c hilly ,” and “hot” in to “warm” or “b oiling.” Yin and Y ang already carry the see d of their opp osites: A dominating Yin b ecomes sus ceptible to Y ang and will eve ntually turn into its opp osite. O n the other hand, a dominating Y ang give s rise to Yin and will th us t ur n in to Yin o v er time. This defines the p erennial alternating cycle of Yin o r Y ang dominance. Only the equilibrium b etw een Yin a nd Y ang is able to o v ercome this cycle. 11 Figure 6: Yin and Y ang. −6 −4 −2 0 2 4 6 8 10 −8 −6 −4 −2 0 2 4 6 Figure 7: Logarithmic spiral. 7.2 Logarithmic Spirals Figure 6 depicts the w ell-kno wn blac k and white sym b ol of Yin and Y ang. The dots of differen t color in the area delimited by eac h force sym b olize the fact that each force b ears the seed of its coun terpart within itself. According to the pr inciple of Yin and Y ang outlined ab ov e, neither Yin nor Y a ng can b e observ ed directly . Both Yin and Y ang are in tert wined forces alw a ys o ccurring in pair s, r ather than b eing isolated f o rces indep enden t from eac h o ther. In Chinese philosoph y , Yin and Y ang assume t he for m of spirals. I will no w sho w that the net force in (17) is a spiral to o. In order to do so, I w ill first in t r o duce the general definition of the logarithmic spiral b efore I then illus trate the similarit y to the famous Yin/Y ang sym b ol. A loga rithmic spiral is a sp ecial t yp e o f spiral curv e, whic h pla ys an imp ortan t role in nature. It o ccurs in all differen t kinds of ob jects and pro cesses, suc h as mollusk s hells, h urricanes, galaxies, and man y more [1]. In p olar co ordinates ( r , θ ), the g eneral definition of a logarithmic spiral is r = ae bθ (24) P arameter a is a scale factor determining the size of the spiral, while parameter b con- trols the dir ection and tigh tness of the w rapping. F or a logarithmic spiral, the distances b et w een the turnings increase. This distinguishes t he log arithmic spiral from the Arc hi- median spiral, whic h features constant distances b etw een turnings. Figure 7 depicts a t ypical example of a log a rithmic spiral. Resolving (24) fo r θ leads to the f o llo wing general 12 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Yin / Yang Model (y,d) (−y,−d) y = −p log( p / (1−p) ) d = (1−p) exp( −y / p ) Figure 8: Yin-Y ang Spirals. form of logarithmic spirals: θ = 1 b ln  r a  (25) In order to sho w that the net force in (1 7 ) defines a logarithmic spiral, and fo r the sak e of easier illustrat io n, I in v estigate the negativ e ve rsion of the net force in (17 ) and lo ok at the p olar co ordinates ( r, θ ) it defines, namely: θ = − p ( K ) ∗ ln p ( K ) 1 − p ( K ) ! and r = (1 − p ( K )) ∗ e − θ p ( K ) (26) A comparison o f (26) with the general form o f lo garithmic spirals in (25) show s that the net fo r ce do es indeed desc rib e a spiral. Both (25) and (2 6) matc h when we set the parameters a and b to the follo wing v alues: a = 1 − p ( K ) and b = − 1 p ( K ) (27) In particular, w e can c heck that a and b are identical when p ( K ) equals the golden ratio . If we let p ( K ) run from 0 to 1, and mirror the resulting spiral along b oth a xes similar to F igure 4, w e receiv e tw o spirals. Figure 8 show s b o th spirals plotted in a Cartesian co ordinate system. Bo th s pirals are, of c ourse, symmetrical and their turnings a pproac h the unit circle. A comparison of the Yin/Y ang sym b o l of Figur e 6 with the spirals in Figure 8 shows the strong similarities betw een b oth figures. A simple mirror op eratio n transforms the spirals in Figure 8 in to the Yin/Y ang sym b ol. The additio n of a time dimension to Fig ure 8 generates a three-dimensional ob ject. It resem bles a funnel o r trump et that has a wide circular op ening on the upp er end and narrow s tow ards the origin. F ig ure 9 depicts this “informationa l univ erse,” whic h follo ws directly fro m t he tw o -dimensional graphic in Figure 8 when I use the p erformance v alues as time co or dinates fo r the third axis. Note that the us e of p erformance as time is reasonable b ecause the exp o nen tial distribution is t ypically used to mo del dynamic time pro cesses a nd the exp ectatio n v alue is th us t ypically asso ciated with time. This will also b e an imp ortan t p oin t in the next section. 13 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 9: Informatio nal Univ erse. 8 Relativit y This section dis cusses the net force in a wider con text and fro m a phy sical p oin t o f view. I b egin b y revisiting the net force as in tro duced in (15): K = − E ∗ ln 1 − p ( K ) p ( K ) ! (28) The net force describ es the net effect of the t w o forces defined in (12) and (13), r esp ec- tiv ely . As I show ed ab ov e, eac h force en tails its own in terpretation of the p erformance function p ( K ). Ho w ev er, the net effect of b oth forces in (28), whic h computes simply as the difference b etw een b oth forces, provide s no information ab out the interpretation of p ( K ). Both in terpretations, i.e. the exp o nential distribution or its complemen t, are v alid p erfor mances. In fact, the interpretation w e use dep ends o n our viewp oint and just c hanges the sign of the net force in (28). The previous result in (16) sho ws, that the sigmoid function pr ovides the correct performa nce v alues once w e ha v e c hosen our po in t of vie w. Accordingly , the p erformance will lie b etw een 0 and 0 . 5 for a ne ga tiv e net force and b etw een 0 . 5 and 1 for a p ositiv e net f orce. The fact that there is no ob jectiv ely correct viewpoint strongly r esem bles the principle of relativity , whic h pla ys a ma jor role in ph ysics. Motiv ated b y the general en tropy in tro duced at the b eginning of this pap er, I will now deriv e another in teresting result relating to relativit y . As I hav e in t ro duced in Section 2 , the general en trop y is based on summands ha ving the follow ing fo r m: K i = − p ( K i ) ln ( p ( K i )) (29) W e can easily see that eac h summand matc hes the definition of F orce B introduced in (13) when the exp ectation equals the p erformance. F or this reason, I consider F orce B, or 14 rather general en tropy , to b e the more fundamen tal force of b oth F orce A and F orce B. Actually , I understand that the difference b et w een F o rce A and F orce B, i.e. the net force, describes merely our p erception, while the general en tropy defines the true uncer- tain ty . The sigmoid function will th us prov ide the real p erformance v alues, allo wing us to compute the a ctual general e ntrop y . Spinning this though t further, I understand that w e p erceiv e realit y in p oints defined b y the golden rat io . Our p erception will b e differen t from realit y except for p erfor mance v alues equal to the g olden ratio. Let me presen t an in teresting ph ysical application of this idea: In ph ysics, a t ypical perfo rmance function could b e the v elo cit y v of an ob ject in relation to ligh t sp eed c . This v alue should alwa ys lie within the ra ng e from 0 to 1 b ecause the curren t state-of-the-art a ssumes that no ob ject can mo ve faster than the speed of ligh t. If we ins ert this relativ e sp eed into ( 19), whic h describ es the relationship defining the golden ratio, w e obtain t he followin g re sult: p ( K ) = 1 − p ( K ) p ( K ) = ⇒ q 1 − p ( K ) 2 = q p ( K ) = ⇒ s 1 −  v c  2 = q p ( K ) (30) The expression on the left-hand side is the w ell-kno wn Lorentz facto r , o r rather the in v erse Loren tz factor, which pla ys a crucial part in Einstein’s sp ecial relativit y . The Loren tz f a ctor describ es ho w mass, length, and time change fo r an o b ject or system whose v elo cit y approac hes ligh t sp eed. F or a mo ving ob ject, an observ er will measure a shorter length, more mass, and a shorter time lapse b etw een tw o ev en ts. These effects b ecome more pronounced as the moving ob ject a ppro ac hes the sp eed of light. D ep ending on the relative sp eed to ligh t, the Loren tz factor describ es basically the ratio b et w een the quan tit y measured for the observ er and the quantit y measured for the mov ing system . F or instance, if t is the time measured lo cally b y the observ er, then the corr esp o nding time t ′ measured for the mo ving system computes as follow s: t ′ = s 1 − v 2 c 2 ∗ t (31) W e can see that t ′ con v erges to zero for increasing sp eed, i.e. w e can measure no time lapse for a system mo ving with light sp eed. Similar relationships hold for length and mass. Ho w ev er, time dilation is esp ecially in teresting b ecause the exp onential distribution is v ery often used to mo del the time b et w een stat istical ev ents tha t happ en at a constan t a v erage rat e, suc h as radio activ e deca y or the time un t il the next system failure, as already men tioned in the previous section. The exp ectation v alue of the exp onen tial distribution is then indeed time, namely the exp ected time until the next even t. In this contex t, an expectatio n v alue in t he form of the Loren tz factor mak es p erfect sense. Actually , time dilatio n can then b e follow ed fr o m the relatio nship in (30). How eve r, according t o the p erceptual mo del in tro duced ab o v e, I understand tha t time dilation is merely our p erception and do es not reflect reality . The true p erformance follow s when w e use our observ ed p erformance as input t o the sigmoid function, whic h then provide s the actual 15 p erformance. F or instance, an exp ectation v a lue correspo nding to a Lor entz factor with a relative sp eed (p erfor ma nce) o f 0 . 5 leads to a n o bserv ed perfor mance of 1 √ 2 according to (3 0 ). Insertion of this observ ed p erformance into the sigmoid f unction leads to the follo wing result: 1 1 + 1 √ 2 ≈ 0 . 586 (32) Note that this result is sligh tly larger than 0 . 5. This concludes my theoretical foray in to the field of ph ysics. W e kno w from prac- tical exp erimen ts that the o bserv ation of a ph ysical exp erimen t can actually c hange its outcome. The classic example for this f a ct is the famous double slit exp erimen t [21]. F or this reason, some ph ysicists ha v e alr eady suggested that they migh t hav e to include h uman p erception in to their mo dels in order to dev elop a more complete and th us more p o we rful theory tha t can describe these effects. It rem ains to b e seen to what exten t the prop osed p erceptiv e mo del turns out to b e useful in this respect. 9 Informational In telligence In this section, I am g oing to a pply the concept of inf o rmational confidence to a practical problem. In order to do so, I divide this section in to three subsections: In the first sub- section, I sho w ho w to learn informational confidence v alues b y estimating the necessary parameters on an ev aluation set. In the second subsection, I presen t practical recognition rates of a m ultiple classifier system fo r handwritten Japa nese ch ara cter recognition. In the third subsection, I prop ose a new framew ork for machine learning in the form of a net w ork arc hitecture that implemen ts the ideas in tro duced ab o ve , in particular general en trop y . I therefore use the term “informatio na l in telligence” as the title for this section in order to con v ey the broader meaning of informational confidence. 9.1 Informational Confidence Learning In most practical cases, classifiers do not pro vide informational confidence v alues. Their confidence v alues ty pically violate the fixed p oint eq uatio n in the p erformance theorem, indicating a distorted equilibrium b et w een information and confidence . Classifier com- bination therefore calls for a second training pro cess in addition to the classifier-sp ecific training metho ds teac hing eac h classifier the decision b oundaries of each class. Accord- ingly , I consider learning of infor ma t ional confidence v alues to b e a 3-step pro cess: In the first step, I train a classifier with its sp ecific training metho d and training set. In the second step, I estimate the p erformance for eac h confidence v alue on a n ev aluation set. Finally , I compute new informational confidence v alues b y inserting the p erformance estimates in to the fixed p o int equation of the p erfo r ma nce theorem. The newly com- puted infor ma t io nal confidence v alues are stored in a lo ok-up table and will replace the original raw confidence v alues in all future classifications. The fixed p oin t equation of the p erformance theorem then formulates as fo llows: K new i = − ˆ E ∗ ln  1 − ˆ p ( K old i )  , (33) 16 where ˆ p ( K old i ) is the p erformance estimate of each raw confidence v alue K old i , ˆ E is the exp ectatio n estimate, and K new i is the new informational confidence v alue subsequen tly replacing K old i . In the following, I show how I compute the estimates ˆ E and ˆ p ( K old i ) on the ev aluation set [3, 4, 6]. 9.1.1 Exp ectation Estimate ˆ E F or the practical exp erimen ts in the next subsection, the classifier’s global r ecognition rate R on the ev aluation set will serve as the exp ectation estimate ˆ E . I additionally normalize the recognition rate R according to the o v erall info r mation I ( C ) provide d b y classifier C . F ollo wing the computation of inf o rmation for confidence v alues I (1 − p ( K )), I estimate I ( C ) using the p erformance complemen t [3, 4 ]: ˆ I ( C ) = I (1 − R ) = − ln (1 − R ) , (34) Based o n the estimate ˆ I ( C ), ˆ E computes as ˆ I ( C ) √ R , whic h maps the global recognition rate R to its normalized rate fo r a one-bit classifier. The fixed p oin t equation in the p erformance theorem now f orm ulates as follow s: K new i = − ˆ I ( C ) √ R ∗ ln  1 − ˆ p ( K old i )  (35) This lea v es us with the p erformance estimate as the only miss ing par a meter to com- pute informational confidence v alues. 9.1.2 P erformance Estimate ˆ p ( K old i ) Motiv ated b y the p erformance theorem, whic h states t hat the perfo rmance function fol- lo ws an exp onential distribution, I prop ose an estimate that expresses p erformance as a p ercen tage of the maximum p erf o rmance p ossible. Accordingly , m y relative p erfor- mance estimate describ es the differen t areas delimited by the confidence v alues under their common densit y function. Mathematically , the p erformance estimate is based on accum ulated partial f r equencies defined by the follo wing form ula [17, 18]: ˆ p ( K old i ) = P i k =0 n cor r ect ( K old k ) N (36) In this equation, N is the num b er of patterns con tained in the ev aluation set. T he help function n cor r ect ( K old k ) returns the n um b er of pat t erns correctly classified with con- fidence K old k . The use of monotonously increasing frequencies guarantees that the es- timated informatio na l confidence v alues will not affect the order of the o riginal raw confidence v alues: K old i ≤ K old j = ⇒ K new i ≤ K new j (37) F or this reason, the p erformance estimate in (36) ensures that informatio nal confidence v alues ha ve no affect on the recognition rate of a single classifier, except for ties in tro duced b y mapping tw o differen t confidence v alues to the same info rmational confidence v alue. 17 Ties can happ en when t w o neigh b oring confidence v alues sho w the same p erfo rmance and b ecome indistinguishable due to insufficien t ev a luation data. In most applications, this should b e no problem, though. T ypically , the effect of informational confidence v alues sho ws only when w e com bine sev eral classifiers in to a m ultiple classifier system, with all classifiers learning their individual info r ma t ional confidence v alues, unless w e compute class-specific informational confidence v alues. Estimates based o n accum ulated part ial fr equencies act lik e a filter in that they do not consider single confidence v alues but a whole range of v alues. They av erag e t he estimation error o ve r all confidence v alues in a confidence interv a l. This diminishes the negativ e effect of inaccurate measuremen ts of the estimate ˆ p ( K old i ) in application doma ins with insufficien t or erroneous ev a luation data. F urthermore, estimation of informational confidence v alues can b e considered a w arping pro cess aligning the progression of confi- dence v alues with t he progression of perfo r ma nce. F o r experiments with other p ossible p erformance estimates, readers are referred to the references [3, 4, 6]. After normalizat io n of the p erformance estimate ˆ p ( K old i ) to a one-bit classifier, as I already did fo r the exp ectation estimate, the final v ersion of the fixed p o in t equation in the p erformance theorem reads as follow s: K new i = − ˆ I ( C ) √ R ∗ ln  1 − ˆ I ( C ) q ˆ p ( K old i )  (38) Note that the new ly computed informational confidence v alues K new i are an attractor of this fixed p oin t equation. In other w ords, the fixed p oin t will b e reac hed a fter exactly one iteration of the training pro cedure, or rather estimation pro cess. All additional iterations will pro duce exactly the same confidence v alues; i.e., K new i = K old i . 9.2 Practical Exper iments In this mainly theoretical pap er, I confine my self t o practical experimen ts for a m ultiple classifier sys tem dev elop ed to recognize handwritten Japanese c haracters. Readers will find more info r mation in the references, including ot her exp erimen ts with informatio na l confidence v alues for do cumen t pro cessing applications [3, 4, 6]. Handwriting recogni- tion is a v ery promising applicatio n field for classifier com bination. Multiple classifier systems ha v e therefore a long tradition in handwriting recognition [22, 20]. In partic- ular, t he duality of handwriting recognition, with its t w o bra nc hes off-line recognition and on- line recognition, makes it suitable for m ultiple classifie r systems. While o ff-line classifiers pro cess static images of handwritten w ords, o n-line classifiers op erate on the dynamic data a nd exp ect p oint sequences o v er time a s input signals. Compared to the time-indep enden t off- line represen tations used b y o ff-line classifiers, on-line classifiers suf- fer from strok e-order and stroke-n umber v aria t ions inheren t in h uman handwriting a nd th us in on-line data . On the other hand, on-line classifiers are able to exploit the dy- namic information and can v ery often discriminate b etw een classes with higher accuracy . Off-line and o n-line classifiers thus complemen t eac h o t her, a nd their com binatio n can o v ercome the problem of strok e-order and stro ke-n umber v a r ia tions. This is especially imp ortant in Japa nese and Chinese c haracter recognition b ecause t he av erag e n um b er of strok es p er character, and th us the num b er of v ariations, is muc h higher than in the Latin alphab et [5, 9]. 18 Japanese offline online AN D OR 1-b est 89.94 81.04 75.41 95.56 2-b est 94.54 85.64 82.62 97.55 3-b est 95.75 87 .3 0 84.99 98.06 T able 1: Single n-b est rates for handwritten Japanese character recognition. F or m y experimen ts, I use a m ultiple classifier system comprising tw o classifiers for on-line handwritten Japanese c haracters. Both classifiers are nearest neigh b or classifiers. One o f these tw o classifiers, how ev er, transforms the captured on-line data into an off-line pictorial represen tation b efore applying the actual classification engine. This tra nsfor- mation ha pp ens in a pre-pro cessing step and connects neigh b oring on-line p o ints using a sophisticated pain ting metho d [19, 7]. W e can therefore consider this classifier to be an off- line classifier. As mentioned ab ov e, learning of informatio nal confidence v alues is a three-step pro cess: First, eac h classifier is trained with its standard training metho d on a giv en training set. Then, I compute the p erformance of each confidence v alue for eac h classifier on an ev aluation set, using the p erformance estimate in (36) . In the last step, I e stimate the informational confidence v alues base d on the estimate giv en in (38). These estimates will then replace the original confidence v alues in all future classifications of eac h classifier. In m y experiments , eac h classifier w as initially trained on a training set con taining more tha n one millio n handwritten Japanese characters. The test and ev aluation set con tains 5 4 , 775 handwritten characters. F rom this set, I take ab out t wo third of the samples to estimate the p erformances of confidence v alues and o ne third to compute the final recognition p erformance of the estimated info rmational confidence v alues. F or more informatio n ab o ut the classifiers and data sets used, I refer readers to the references [7, 11, 12]. T able 1 lists the individual recognition rates for the off-line and on-line classifier. It sho ws the proba bilit ies that the correct class lab el is among the n-b est alternativ es ha ving the highest confidence, with n = 1 , 2 , 3. The off - line recognition rates are muc h higher than the corresp onding on-line rates. Clearly , strok e-order and strok e-nu mber v ariations are larg ely resp onsible f or this p erformance difference. They complicate considerably the classification task for the on-line classifier. The last tw o columns of T able 1 show the p ercen tage o f test patterns for which the correct class lab el o ccurs either t wice (AND) or at least once (OR) in the n-b est lists of b oth classifiers. The relativ ely larg e gap b et w een the off-line recognitio n rates and t he n umbers in the OR-column suggests that on-line information is indeed complemen ta r y and useful fo r classifie r com bination. T able 2 shows the recognition rates for combine d off-line/on-line recognition, using sum-rule, max-rule, and pro duct-rule as combination sc hemes. Sum-r ule adds the confi- dence v alues pro vided b y eac h classifier for the same class, while pro duct-rule m ultiplies the confidence v alues. Max-rule simply tak es the maximum confidence without any fur- ther op eration. The class with the maximum ov erall confidence will then b e c hosen as the most lik ely class for the giv en test pattern. Note that sum-rule is the mathematically appropriate combination sc heme for in tegration of information from differen t sources [15]. 19 Japanese ( 89.94 ) Raw Confide nce Inf. Confidence Sum-rule 93.25 93.78 Max-rule 91.30 91.14 Pro duct-rule 92.98 65.16 T able 2: Com bined recognitio n rates for handwritten Japanese c haracter recognition. In addition, sum-rule is robust against noise, a s w as sho wn in [8]. The upp er left cell of T able 2 lists ag ain the b est single recognition ra te from T able 1, ac hiev ed by the off-line r ecognizer. The second column con tains the com bined recognition rates f or t he ra w confidence v alues as provided direc tly by the classifiers, while the third column lists the recognition r a tes for informatio na l confidence v alues computed according to (38). Compared to the individual rates, the combine d recognition rates in T able 2 are clear impro v emen ts. The sum-rule on raw confidence v alues already accoun ts for an impro v e- men t of almost 3 . 5%. The b est com bined recognition rate ach ieve d with normalized informational confidence is 93 . 78%. It outp erfor ms the off- line classifier, whic h is the b est individual classifier, b y almost 4 . 0%. Sum-rule p erforms b etter than max-rule and pro duct-rule, a fact in accordance with the results in [8]. 9.3 Neural Net w ork Arc hitecture A t the end of this pap er, I am going to show how the results in tro duced abov e can b e com bined to fo r m a netw ork arc hitecture fo r complex decision problems. The arc hitecture I prop ose is similar to the w ell-known feedforw ard t yp e of artificial neural net w orks in that a neuron first integrates its inputs and then a pplies a sigmoid function to compute the final output, whic h it propagates to the synapses of o ther neurons. The main motiv at io n for t he sigmoid f unction, ho w ev er, deriv es from a n informatio na l-theoretical bac kground, as discussed in Section 5. Figure 10 sho ws the basic unit of the prop osed “information net w ork:” a neuron and its synapses. The basic idea is that eac h synapse computes one summand of the general en tropy defined in (3) of Section 2. The main b o dy o f the neuron first integrates all these summands, computing the general entrop y according to (1) a nd (3). The sigmoid function then computes the actual p erformance based on the general en tr o p y . Finally , the neuron forw ards the newly computed p erformance to other neurons, whic h in turn rep eat the same pro cess. In this w ay , complex decisions b ecome aggregates of simpler decisions. Similar to t he training process in feedforw ard net works, the back propa g ation of feed- bac k trains the netw ork in Figure 10. Instead of the gradient descen t in parameter space that is t ypically implemen ted in feedforw ard netw orks, ba ckpropagation for the netw ork in Figure 10 means basically propagating the p erformance back so that each neuron can adjust its output. The p erfo rmance can b e directly inserted as part of the sigmoid function in (16 ). F or instance, insertion of the p erformance v alues defined in (3 6) leads to t he following expression for the output v alues, after additionally normalizing eac h 20 N e u r on S y n a p se i 1 S y n a p se i 2 S y n a p se ij T a r g e t N e u r on p ( K ) i 1 p ( K ) i 2 p ( K ) ij p ( K ) i i − p ( K ) l n ( p ( K )) . . i 1 i 1 − p ( K ) l n ( p ( K )) . . i 2 i 2 i i − p ( K ) l n ( p ( K )) . . ij ij − p ( K ) l n ( p ( K )) . . i i i i S u m + + S i g m o i d F ee db ack i i Figure 10: Informa t io n Net w ork. p erformance v alue to one bit: 1 1 + ˆ I ( C ) q ˆ p ( K old i ) (39) In m y exp erimen ts, a simple summation of the information pro vided b y eac h output v alue, or ra ther classifier, f o r each class provides a recognition r ate of 93 . 92 for the handwritten c haracter recognition problem. This is b etter than the b est recognition ra te in T able 2. I hop e to be able to supp ort the pr o p osed netw ork architec ture with additional ex- p erimen ts in ot her application domains, and by implemen ting a f ull- fledged net w ork and not just a single la y er. 10 Summary I introduced a new form o f en tropy that can b e considered a n extension of t he classic en trop y in tro duced b y Shannon. Eac h summand of this en tropy is a fixed p oin t equa- tion in whic h the so- called p erfor mance function tak es ov er the part of the probability . Ho w ev er, the p erformance function plays sev eral r o les in m y appro a c h: It describ es the distribution of an exp onentially distributed random v aria ble, and is also an exp ectation v alue in the statistical sense. F urthermore, with the exp o nential distribution typic ally used to des crib e statistical time pro cesses, there is also a p oint in f av or of it b eing time. The p erformance theorem in t he first part of the pap er summarizes these relationships and prov ides guidelines for learning info rmational confidence v alues for classifier com- bination. In my first practical results published in [3, 4, 6], I impro v ed the recognition rates for sev eral m ultiple classifier systems. In the presen t pap er, I confined m yself to the recognition rat es for handwritten Japanese c haracter recognition and concen trated on theoretical issues. I show ed how to pro duce a s ym b ol similar to the famous Yin/Y ang sym b ol by depicting the net confidence as a spiral. The net confidence is the difference 21 b et w een the confidence a nd counter-confidence , with the latter b eing based on the p er- formance complemen t. My unde rstanding is tha t our p erception is alwa ys the comp osite of Yin and Y ang and do es not r eflect the realit y , except when the p erformance function equals t he golden ratio. I thus assign an information-t heoretical meaning to the golden ratio. Moreo v er, I understand that the sigmoid f unction pro vides the a ctual p erformance v alue that w e cannot observ e directly . Under these observ at io ns and assumptions, I can explain the time dilatio n of Einstein’s Special Relativit y . How eve r, it follows that time dilation is mere p erception and do es not correspond to reality . A t the end of the pap er, I prop osed a net w ork arc hitecture for complex decisions, whic h tak es adv an tage of the general en tropy concept. I hop e that the usefulness of this arc hitecture can b e c onfirmed b y future exp erimen ts in differen t application fields. Ac knowled gmen t I w ould lik e to thank Ondrej V elek, Akihito Kitadai, and Masaki Nak a g a w a for prov iding data for the practical experimen ts. References [1] T.A. Co ok. The Curves of Life . D o ve r Publications, 1 979. [2] H.E. Hun tley . The Divine Pr op ortion . D o ve r Publications, 1 970. [3] S. Jaeger. Infor ma t io nal Classifier Fusion. In Pr o c. of the 17th In t. C onf. o n Pattern R e c o gnition , pages 216–219, Cam bridg e, UK, 2004. [4] S. Jaeger. Using Informat io nal Confidence Values fo r Classifier Com bination: An Exp eriment with Combine d O n-Line/Off-Line Japanese Character R ecognition. In Pr o c. of the 9 th In t. Workshop on F r ontiers in Handwri ting R e c o gnition , pages 87– 92, T oky o, Japan, 2004. [5] S. Jaeger, C.-L. Liu, and M. Nak agaw a. The State of the Art in Japanese Online Handwriting Recognition Compared to T ec hniques in W estern Handwriting Recog- nition. International Journal on Do cument A nalysis and R e c o g n ition , 6(2):75 –88, 2003. [6] S. Jaeger, H. Ma, and D. Do ermann. Iden tifying Script on Word-L ev el with Informa- tional Confidence. In Int. Conf. on Do c ument Analysis an d R e c o gnition (ICDAR) , pages 416–420, Seoul, Korea, 2005. [7] S. Jaeger and M. Nak aga wa. Tw o On-Line Japanese Character Databases in Unip en Format. In 6th Internationa l Co n fer enc e on Do cument Analysis and R e c o gnition (ICDAR) , pages 566– 570, Seattle, 2001. [8] J. Kittler, M. Hatef, R.P .W. Duin, and J. Matas. On Com bining Classifiers. IEEE T r ansactions o n Pattern A nalysis an d Machine Intel ligen c e , 20(3):226–23 9 , 1998 . 22 [9] C.-L. Liu, S. Jaeger, and M. Nak agaw a. Online Recognition of Chinese Characters: The State-of-the-Art. IEEE T r a n s. on Pattern Analysis and Mach ine I ntel ligenc e (TP AMI) , 26(2) :198–213, 2004. [10] M. Livio. The Golden R atio . Random House, Inc., 200 2. [11] M. Nak agaw a, K. Akiy ama, L.V. T u, A. Homma, and T. Higashiy ama. Robust and Highly Cus tomizable Recognition of On-Line Handwritten Japanes e Characters. In Pr o c. of the 13th International Confer enc e on Pattern R e c o gnition , v olume I I I, pages 269–273, Vienna, Austria, 1996. [12] M. Nak agaw a , T. Higashiy ama, Y. Y amanak a , S. Saw a da, L. Higashigaw a, a nd K. Akiy ama. On-Line Ha ndwritten Character P attern Database Sampled in a Se- quence of Sen tences without An y W riting Instructions. In F ourth International Confer enc e on D o cument Analysis and R e c o gnition (ICDAR) , pages 376–381, Ulm, German y , 199 7. [13] J. R. Pierce. An Intr o duction to Info rmation The ory: Symb ols, Signals, and Noise . Do ve r Publications, Inc., New Y ork, 1980. [14] W. Sacco, W. Cop es, C. Slo ye r, and R. Stark. Information The ory: Savi n g Bits . Janson Publications, Inc., Dedham, MA, 1988. [15] C. E. Shannon. A Mathematical Theory of Comm unication. Bel l System T e ch. J. , 27(623-6 56):379–4 23, 1 948. [16] N. J. A. Slo ane and A. D . Wyner. Claude Elwo o d Shannon: Col le cte d Pap ers . IEEE Press, Piscata w a y , NJ, 1993. [17] O. V elek, S. Jaeger, and M. Nak agaw a. A New Warping Tec hnique for Normalizing Lik eliho o d of Multiple Classifiers and it s Effectiv eness in Com bined On- Line/Off- Line Ja panese Character Recognition. In 8th Internation al Workshop on F r ontiers in Handwriting R e c o gnition (IWFHR) , pages 177–182, Nia g ara-on-t he- L a k e, Canada, 2002. [18] O. V elek, S. Ja eger, and M. Nak agaw a. Accum ulated-Recognition- Rate Normaliza- tion for Com bining Multiple On/ Off-line Japanese Character Classifiers Tested on a Large Database. I n 4th I nternational Workshop on Multiple Classifie r Systems (MCS) , pages 1 96–205, Guildford, UK, 2003. Lecture Notes in Computer Science, Springer-V erlag. [19] O. V elek, C.-L. Liu, S. Jaeger, a nd M. Nak agaw a. An Impro v ed Approac h to Gen- erating Realistic Kanji Character Images from On-Line Characters and its Benefit to Off-Line Recognition P erformance. In 16th Internationa l Co nfer enc e on Pattern R e c o gnition (IC PR) , v olume 1, pages 588–591, Queb ec, 2002. [20] W. W ang, A. Bra k ensiek, and G . Rigoll. Com binatio n of Multiple Classifiers for Handwritten Word Recognition. In Pr o c. of the 8th International Worksh op on 23 F r ontiers in Handwriting R e c o g n ition (IWFHR-8) , pages 117–122, Niaga ra-on-the- Lak e, Canada, 2002. [21] Wikip edia. D ouble-slit exp erimen t, 2006. http:/ / www.wik ip edia.org. [22] L. Xu, A. Krzyzak, and C.Y. Suen. Metho ds of C ombining Multiple Classifiers and Their Applications t o Handwriting Recognition. IEEE T r ans. on Systems, Man, and Cyb ernetics , 22(3):418 – 435, 1 992. [23] Yin and y ang. ht t p:/ /www.wikipedia.org. 24

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment