Information Width
Kolmogorov argued that the concept of information exists also in problems with no underlying stochastic model (as Shannon's information representation) for instance, the information contained in an algorithm or in the genome. He introduced a combinat…
Authors: Joel Ratsaby
Information width Jo el Ratsab y Electrical and Electronics Engineering Departmen t Ariel Univ ersity Cen ter of Samaria ISRAEL ratsaby@ariel.ac.il No v em b er 16, 2021 Abstract Kolmogoro v argued that the concept of information exists also in problems with no underlying stochastic model (as Shannon’s information representation) for instance, the information con tained in an algorithm or in the genome. He in tro duced a com binatorial notion of en trop y and information I ( x : y ) conv ey ed by a binary string x ab out the unkno wn v alue of a v ariable y . The current pap er p oses the follo wing questions: what is the relationship b et w een the information conv ey ed by x about y to the description complexit y of x ? is there a notion of cost of information ? are there limits on ho w efficien t x con veys information ? T o answ er these questions Kolmogoro v’s definition is extended and a new concept termed information width which is similar to n -widths in approximation theory is introduced. Information of any input source, e.g., sample- based, general side-information or a h ybrid of both can be ev aluated b y a single common form ula. An application to the space of binary functions is considered. Keyw ords: Binary functions, Com binatorics, n -widths, VC-dimension 1 In tro duction Kolmogoro v [13] sought for a measure of information of ‘finite ob jects’. He considered three approac hes, the so-called combinatorial, probabilistic and algorithmic. The probabilistic approac h corresp onds to the w ell-established definition of the Shannon entrop y which applies to sto chastic settings where an ‘ob ject’ is represented by a random v ariable. In this setting, the en trop y of an ob ject and the information conv ey ed b y one ob ject ab out another are well defined. Here it is necessary to view an ob ject (or a finite binary string) as a realization of a sto c hastic pro cess. While this has often b een used, for instance, to measure the information of English texts [7, 14] b y assuming some finite-order Marko v pro cess, it is not obvious that suc h mo deling of finite ob jects pro vides a natural and a universal representation of information as Kolmogorov states in [13]: What r e al me aning is ther e, for example, in asking how much information is c ontaine d in (the b o ok) ”War and Pe ac e” ? Is it r e asonable to ... p ostulate some pr ob ability distribution for this set ? Or, on the other hand, must we assume that the individual sc enes in this b o ok form a r andom se quenc e with sto chastic r elations that damp out quite r apid ly over a distanc e of sever al p ages ? These questions led Kolmogoro v to introduce an alternate non-probabilistic and algorithmic notion of the 1 information con tained in a finite binary string. He defined it as the length of the minimal- size program that can compute the string. This has b een later developed in to the so-called Kolmo gor ov Complexity field [15]. In the com binatorial approac h, Kolmogoro v inv estigated another non stochastic measure of information for an ob ject y . Here y is tak en to be an y element in a finite space Y of ob jects. In [13] he defines the ‘entrop y’ of Y as H ( Y ) = log | Y | where | Y | denotes the cardinalit y of Y and all logarithms henceforth are tak en with resp ect to 2. As he writes, if the v alue of Y is known to b e Y = { y } then this muc h entrop y is ‘eliminated’ by providing log | Y | bits of ‘information’. Let R = X × Y b e a general finite domain and consider a set A ⊆ R (1) that consists of all ‘allow ed’ v alues of pairs ( x, y ) ∈ R . The entrop y of Y is defined as H ( Y ) = log | Π Y ( A ) | where Π Y ( A ) ≡ { y ∈ Y : ( x, y ) ∈ A for some x ∈ X } denotes the pro jection of A on Y . Consider the restriction of A on Y based on x which is defined as Y x = { y ∈ Y : ( x, y ) ∈ A } , x ∈ Π X ( A ) (2) then the conditional combinatorial entrop y of Y giv en x is defined as H ( Y | x ) = log | Y x | . (3) Kolmogoro v defines the information conv ey ed b y x ab out Y by the quan tit y I ( x : Y ) = H ( Y ) − H ( Y | x ) . (4) Alternativ ely , we may view I ( x : Y ) as the information that a set Y x con v eys ab out another set Y satisfying Y x ⊆ Y . In this case we let the domain be R = Π Y ( A ) × Π Y ( A ), A x ⊆ R is the set of p ermissible pairs A x = { ( y , y 0 ) : y ∈ Π Y ( A ) , y 0 ∈ Y x } and the information is defined as I ( Y x : Y ) = log | Π Y ( A ) | 2 − log ( | Y x || Π Y ( A ) | ) . (5) W e will refer to this representation as Kolmogorov ’s information b etw een sets. Clearly , I ( Y x : Y ) = I ( x : Y ). In many applications, kno wing an input x only con v eys partial information ab out an unkno wn v alue y ∈ Y . F or instance, in problems whic h in v olv e the analysis of algorithms on discrete classes of structures, such as sets of binary vectors or functions on a finite domain, an algorithmic searc h is made for some optimal elemen t in this set based only on partial information. One such paradigm is the area of statistical pattern recognition [2, 26] where an unknown target, i.e., a pattern classifier, is seek ed based on the information contained in a finite sample and some side-information. This information is implicit in the particular set of classifiers that form the p ossible h yp otheses. F or example, let n b e a p ositiv e in teger and consider the domain [ n ] = { 1 , . . . , n } . Let F = { 0 , 1 } [ n ] b e the set of all binary functions f : [ n ] → { 0 , 1 } . The p o w er set P ( F ) 2 represen ts the family of all sets G ⊆ F . Rep eating this, w e ha ve P ( P ( F )) as the collection of all prop erties of sets G , i.e., a pr op erty is a set whose elements are subsets G of F . W e denote b y M a prop erty of a set G and write G | = M . Supp ose that we seek to kno w an unkno wn target function t ∈ F . Any partial information ab out t which may b e expressed b y t ∈ G | = M can effectively reduce the search space. It has b een a long-standing problem to try to quantify the v alue of general side-information for learning (see [24] and references therein). W e assert that Kolmogorov’s com binatorial framewor k may serv e as a basis. W e let x index p ossible properties M of subsets G ⊆ F and the ob ject y represen t the unkno wn target t whic h may b e any elemen t of F . Side information is then represented by knowing certain prop erties of sets that con tain the target. The input x conv eys that t is in some subset G that has a certain prop ert y M x . In principle, Kolmogorov’s quan tit y I ( x : Y ) should serv e as the v alue of information in x ab out the unknown v alue of y . Ho w ev er, its current form (4) is not general enough since it requires that the target y b e restricted to a fixed set Y x on knowledge of x . T o see this, suppose t is in a set that satisfies prop erty M x . Consider the collection { G z } z ∈ Z x of all subsets G z ⊆ F that hav e this prop erty . Clearly , t ∈ S z ∈ Z x G z hence w e ma y first consider Y x = S z ∈ Z x G z but some useful information implicit in this collection is ignored as we now show: consider t w o prop erties M 0 and M 1 with corresp onding index sets Z x 0 and Z x 1 suc h that S z ∈ Z x 0 G z = S z ∈ Z x 1 G z ≡ F 0 ⊆ F . Supp ose that most of the sets G z , z ∈ Z x 0 are small while the sets G z , z ∈ Z x 1 are large. Clearly , prop erty M 0 is more informativ e than M 1 since starting with knowledge that t is in a set that satisfies it should tak e (on av erage) less additional information (once the particular set G b ecomes known) in order to completely sp ecify t . If, as ab o v e, we let A x 0 = S z ∈ Z x 0 G z and A x 1 = S z ∈ Z x 1 G z then we hav e I ( x 0 : Y ) = I ( x 1 : Y ) which wrongly implies that b oth prop erties are equally informativ e. Knowing M 0 pro vides implicit information associated with the collection of p ossible sets G z , z ∈ Z x 0 . This implicit structural information cannot b e represen ted in (4). 2 Ov erview In [23] we b egan to consider an extension of Kolmogorov’s com binatorial information that can b e to applied for more general settings. The current pap er further builds up on this and con tin ues to explore the ‘ob jectification’ of information, viewing it as a ‘static’ relationship b et w een sets of ob jects in contrast to the standard Shannon represen tation. As it is based on basic set theoretic principles no assumption is necessary concerning the underlying space of ob jects other than its finiteness. It is thus more fundamental than the standard probabilit y- based represen tation used in information theory . It is also more general than Bay esian approac hes, for instance in statistical pattern recognition, which assume that a target y is randomly dra wn from Y according to a prior probability distribution. The main t w o contributions of the pap er are, first, the introduction of a set-theoretic framew ork of information and its efficiency , and secondly the application of this framework to classes of binary functions. Sp ecifically , in Section 6 we define a quantit y called the information width (Definition 6) whic h measures the information c on v ey ed ab out an un- kno wn target y b y a maximally-informativ e input of a fixed description complexity l (this notion is defined in Definition 9). The first result, Theorem 1, computes this width and 3 it is consequen tly used as a reference point for a comparison of the information v alue of differen t inputs. This is done via the measures of cost and efficiency of information de- fined in Definitions 5 and 7. The width serves as a universal reference against which any t yp e of information input may b e computed and compared, for instance, the information of a finite data sample used in a problem of learning can b e compared to other kinds of side-information. In Section 7 w e apply the framew ork to the space of binary functions on a finite domain. W e consider information which is con vey ed via prop erties of classes of binary functions, sp ecifically , those that relate to the complexity of learning suc h functions. The prop erties are stated in terms of combinatorial quan tities such as the V apnik-Cherv onenkis (VC) di- mension. Our in terest in inv estigating the information conv ey ed b y such prop erties stems from the large b ody of work on learning binary function classes (see for instance [4, 17, 19]). This is part of the area called statistical learning theory that deals with computing the com- plexit y of learning ov er hypotheses classes, e.g., neural net w orks, using v arious algorithms eac h with its particular type of side information (which is sometimes referred to as the ‘inductiv e bias’, see [18]). F or instance it is known [2] that learning an unkno wn target function in a class of VC- dimension (defined in Definition 8) whic h is no greater than d requires a training sample of size linear in d . The kno wledge that the target is in a class whic h has this prop erty con v eys the useful information that there exist an algorithm which learns an y target in this class to within arbitrarily low error based on a finite samples (in general this is imp ossible if the V C-dimension is infinite). But how v aluable is this information, can it be quantified and compared to other types of side information ? The theory dev elop ed here answers this and treats the problem of computing the v alue of information in a uniform manner for an y source. The generality of this approach stands on its set-theoretic basis. Here information, or entrop y , is defined in terms of the num be r of bits that it takes to index ob jects in general sets. This is conv enien tly applicable to settings suc h as those in learning theory where the underlying structures consist of classes of ob- jects or inference mo dels (h yp otheses) such as binary functions (classifiers), decision trees, Bo olean formulae, neural netw orks. Another area (unrelated to learning theory) whic h can b e applicable is computational biology . Here a sequence of the nucleotides or amino acids mak e up DNA, RNA or protein molecules and one is interested in the amount of functional information needed to sp ecify sequences with in ternal order or structure. F unctional infor- mation is not a prop erty of any one molecule, but of the ensemble of all p ossible sequences, rank ed b y activit y . As an example, [25] considers a pile of DNA, RNA or protein molecules of all p ossible sequences, sorted by activit y with the most activ e at the top. More infor- mation is required to sp ecify molecules that carry out difficult tasks, such as high-affinity binding or the rapid catalysis of chemical reactions with high energy barriers, than is needed to sp ecify weak binders or slow catalysts. But, as stated in [25], precisely how muc h more functional information is required to sp ecify a giv en increase in activity is unkno wn. Our mo del may b e applicable here if w e let all sequences of a fixed level α of activit y b e in the same class that satisfies a prop erty M α . Its description complexit y (introduced later) may represen t the amount of functional information. In Section 7 we consider an application of this mo del and state Theorems 2 – 5 which estimate the information v alue, the cost and the description complexity asso ciated with 4 differen t prop erties. This allo ws to compute their information efficiency and compare their relativ e v alues. F or instance, Theorem 5 considers a h ybrid prop erty which consists of t w o sources of information, a sample of size m and side-information ab out the VC-dimension d of the class that contains the target. By computing the efficiency with resp ect to m and d w e determine ho w effectiv e is eac h of these sources. In Section 8 we compare the information efficiency of several additional prop erties of this kind. 3 Com binatorial form ulation of information In this section we extend the information measure (4) to one that applies to a more general setting (as discussed in Section 1) where kno wledge of x ma y still lea v e some v agueness ab out the p ossible v alue of y . As in [13] w e seek a non-sto chastic representation of the information con vey ed by x ab out y . Henceforth let Y b e a general finite domain, let Z = P ( Y ) and X = P ( Z ) where as b efore for a set E we denote by P ( E ) its p ow er set. Here Z represents the set of indices z of all p ossible sets Y z ⊆ Y and X is the set of indices x of all p ossible collections, i.e., subsets Z x ⊆ Z of indices of sets Y z , z ∈ Z . W e say that a set Y z ⊆ Y has a prop erty M x if z ∈ Z x where Z x is the uniquely corresp onding collection of M x . The previous representation based on (1) is subsumed by this represen tation since in- stead of X we hav e Z and the sets Y x , x ∈ Π X ( A ) defined in (2) are now represen ted b y the sets Y z , z ∈ Z x . Since X indexes all p ossible prop erties of sets in Y then for any A as in (1) there exists an x ∈ X in the new representation suc h that Π X ( A ) in Kolmogoro v’s represen tation is equiv alent to Z x in the new represen tation. Therefore what w as previously represen ted b y the sets { Y x : x ∈ Π X ( A ) } is no w the collection of sets { Y z : z ∈ Z x } . In this new representation a giv en input x can p oin t to multiple subsets Y z of Y , z ∈ Z x , and hence apply for the more general settings discussed in the previous section. W e will view the information con v ey ed b y x ab out an unknown ob ject y through tw o p ersp ectiv es. The first is held by the side that pro vides the information and the second by the side whic h acquires it. F rom the side of the provider, we denote b y the subset y ⊂ Y a set of target v alues y , for instance, solutions to a problem any one of which the pro vider ma y wish to inform the acquirer. In general the pro vider pro vides partial information about y via an ob ject x ∈ X whic h is used as a means of represen ting this information and as input to the acquirer. F rom the acquirer’s side, initially (b efore seeing x ) the set of possible targets is the whole target-domain Y since he do es not ‘know’ the subset y . After seeing the input x he then has a collection of sets Y z , z ∈ Z x , one of which is ensured (b y the pro vider) to intersect the subset y . In this case w e say that x is informativ e ab out y (Definition 1). Kolmogorov’s represen tation fits the acquirer’s p ersp ective where the unknown subset y is just the whole target domain Y (known by default) and therefore x is the only v ariable in the information form ula of (4). In all subsequent definitions that inv olv e y we may switch b etw een the tw o p ersp ectiv es simply by replacing y with Y . Definition 1 L et y ⊆ Y b e fixe d. A n input obje ct x ∈ X is c al le d informative for y , denote d x ` y , if ther e exists a z ∈ Z x with a c orr esp onding set Y z such that y T Y z 6 = ∅ . The follo wing is our definition of the combinatorial v alue of information. 5 Definition 2 L et y ⊆ Y and c onsider any x ∈ X such that x ` y . Define by I ( z : y ) = log ( | y [ Y z | 2 ) − log ( | y || Y z | ) . Then the information c onveye d by x ab out the unknown value y ∈ y is define d as I ( x : y ) ≡ 1 | Z x | X z ∈ Z x I ( z : y ) . Remark 1 F or a fixe d x , the information value I ( x : y ) is in gener al dep endent on y sinc e differ ent y for which x is informative ( x ` y ) may have differ ent values of I ( z : y ) , z ∈ Z x . The information value is non-ne gative r e al me asur e d in bits. Henceforth, it will b e conv enien t to assume that I ( x : y ) = 0 whenever x is not informativ e for y . Remark 2 We wil l r efer to I ( x : y ) as the pr ovider’s (or pr ovide d) information ab out the unknown tar get y given x . The ac quir er’s (or ac quir e d) information is define d b ase d on the sp e cial c ase wher e y = Y . Her e we have | y S Y z | = | Y S Y z | = | Y | and the information value b e c omes I ( x : Y ) = 1 | Z x | X z ∈ Z x [2 log | Y | − log | Y | − log | Y z | ] = log | Y | − 1 | Z x | X z ∈ Z x log | Y z | . (6) Definition 2 is consisten t with (4) in that the represen tation of uncertaint y is done as in [13] in a set-theoretic approach since all expressions in (6) inv olv e set-quantities such as cardinalities and restrictions of sets. The expression of (4) is a sp ecial case of (6) with Z x b eing a singleton set and y = Y . In defining I ( z : y ) we hav e implicitly extended Kolmogoro v’s information I ( Y x : Y ) b et w een sets Y x and Y that satisfy Y x ⊆ Y (see (5)) in to the more general definition where one of the t w o sets is not necessarily contained in the other and neither one equals the whole space Y , i.e., I ( z : y ) ≡ I ( Y z : y ) is the information b et w een the sets Y z and y where Y z is not necessarily contained in y . Here we take the underlying t w o-dimensional domain as R = ( y S Y z ) × ( y S Y z ) and the set of p ermissible pairs as A z , y = { ( y , y 0 ) : y ∈ y , y 0 ∈ Y z } ⊂ R . W e may view the relationship b et ween the pro vider and acquirer as a transformation b e- t w een sets y → { Y z : z ∈ Z x }→ Y where the provider, knowing the set y , c hooses some x with whic h he represen ts the unkno wn v alue of y and for him, the amount of information remaining ab out y as con v ey ed by x is I ( x : y ) bits. The acquirer, starting from knowing only Y , uses x , or equiv alen tly the corresp onding collection of sets { Y z : z ∈ Z x } , as an intermediate ‘medium’ to acquire I ( x : Y ) bits of information ab out the unknown v alue of y . Note that, in general, the 6 pro vider’s information may b e smaller, equal or larger than the acquired information. F or instance, fix an x , then directly from Definition 2 for any z ∈ Z x w e can compare I ( z : y ) v ersus I ( z : Y ) and see that if | y S Y z | is closer to | Y | (or | y | ) than to | y | (or | Y | ) then I ( z : y ) > I ( z : Y ) (or I ( z : Y ) > I ( z : y )) respectively . Th us taking the a v erage ov er all z ∈ Z x it is p ossible in general to ha v e I ( x : y ) smaller, larger or equal to I ( x : Y ). In this pap er we will primarily use the acquirer’s p ersp ectiv e and will thus refer to the sum in (6) as the conditional com binatorial entrop y which is defined next. Definition 3 L et H ( Y | x ) ≡ 1 | Z x | X z ∈ Z x log | Y z | (7) b e the c onditional entr opy of Y given x . It will b e conv enien t to express the conditional entrop y as H ( Y | x ) = X k ≥ 2 ω x ( k ) log k with ω x ( k ) = |{ z ∈ Z x : | Y z | = k }| | Z x | . (8) W e will refer to this quan tit y ω x ( k ) as the conditional density function of k . The factor of log k comes from log | Y z | whic h from (3) is the com binatorial conditional-entrop y H ( Y | z ). 4 Description complexity W e hav e so far defined the notion of information I ( x : y ) ab out the unkno wn v alue y con v ey ed by x . Let us now define the description complexit y of x . Definition 4 The description complexity of x , denote d ` ( x ) , is define d as ` ( x ) ≡ log | Z | | Z x | . (9) Remark 3 The description c omplexity ` ( x ) is a p ositive r e al numb er me asur e d in bits. It takes a fr actional value if the c ar dinality of Z x is gr e ater than half that of Z . Definition 4 is motiv ated from the following: from Section 3, the input x conv eys a certain prop erty common to ev ery set Y z , z ∈ Z x ⊆ Z , such that the unknown v alue y is an elemen t of at least one suc h set Y z . Without the kno wledge of x these indices z are only kno wn to b e elements of Z in which case it takes log | Z | bits to describ e any z , or equiv alen tly , an y Y z . If x is giv en then the length of the binary string that describ es a z in Z x is only log | Z x | . The set Z x can therefore b e describ ed by a string of length log | Z | − log | Z x | whic h is precisely the right side of (9). Alternativ ely , ` ( x ) is the information I ( x : Z ) gained ab out the unknown v alue z giv en x (since x p oin ts to a single set Z x then this information follows directly from Kolmogorov’s formula (4)). As | Z x | decreases there are few er p ossible sets Y z that satisfy the prop erty describ ed b y x and the description complexity ` ( x ) increases. In this case, x con veys a more ‘sp ecial’ prop ert y of the p ossible sets Y z and the ‘price’ of describing such a prop erty increases. The follo wing is a useful result. 7 Lemma 1 Denote by Z c x = Z \ Z x the c omplement of the set Z x and let x c denote the input c orr esp onding to Z c x . Then ` ( x c ) = − log (1 − 2 − ` ( x ) ) . Pr o of : Denote b y p = | Z x | / | Z | . Then b y definition of the description complexity w e ha v e ` ( x c ) = − log (1 − p ) . Clearly , 2 − log 1 1 − p = 1 − 2 − log 1 p from which the result follows. Remark 4 Sinc e cle arly the pr op ortion of elements z ∈ Z which ar e in Z x plus the pr o- p ortion of those in Z c x is fixe d and e quals 1 then 2 − ` ( x ) + 2 − ` ( x c ) = 1 . If the description c omplexity ` ( x ) and ` ( x c ) change (for instanc e with r esp e ct to an incr e ase in | Z | ) then they change in opp osite dir e ctions. However, the c ar dinalities of the c orr esp onding sets Z x and Z c x may b oth incr e ase (9) for instanc e if | Z | gr ows at a r ate faster than the r ate of change of either ` ( x ) or ` ( x c ) . A question to raise at this p oint is whether the following trivial relationship b etw een ` ( x ) and the entrop y H ( Y | x ) holds, ` ( x ) + H ( Y | x ) ? = H ( Y ) . (10) This is equiv alent to asking if ` ( x ) ? = I ( x : Y ) (11) or in words, do es the price of describing an input x equals the information gained by kno wing it ? As we sho w next, the answ er dep ends on certain characteristics of the set Z x . When (4) do es not apply but (6) do es, then in general, the relation do es not hold. 5 Scenario examples In all the follo wing scenarios we tak e the acquirer’s p ersp ective, i.e., with no input given the unknown y is only kno wn to b e in Y . As the first scenario, w e start with the simplest uniform setting which is defined as follo ws: Sc enario S1 : As in (2)-(4), an input x amoun ts to a single set Y x . The set Z x is a singleton { Y x } so | Z x | = 1 and instead of Z we hav e Π X ( A ). W e imp ose the follo wing conditions: for all x, x 0 ∈ X , Y x T Y x 0 = ∅ , | Y x | = | Y x 0 | . With Y = S x ∈ X Y x then it follo ws that for any x , | Y x | = | Y | | X | . F rom (4) it follows that the description complexity of any x is ` ( x ) = log | X | 1 = log | X | and the entrop y H ( Y | x ) = log | Y x | = log | Y | | X | . 8 W e therefore ha v e ` ( x ) + H ( Y | x ) = log | X | + log | Y | − log | X | = log | Y | . Since the right side equals H ( Y ) then (10) holds. Next consider another scenario: Sc enario S2 : An input x giv es a single set Y x but now for an y t w o distinct x, x 0 , w e only force the condition that | Y x | = | Y x 0 | , i.e., the in tersection Y x T Y x 0 ma y b e non-empty . The description complexit y ` ( x ) is the same as in the previous scenario and for an y x, x 0 ∈ X the entrop y is the same H ( Y | x ) = H ( Y | x 0 ) with a v alue of log α | Y | | X | , for some α ≥ 1. So ` ( x ) + H ( Y | x ) = log | X | + log α | Y | | X | = log ( α | Y | ) ≥ log | Y | . Hence the left side of (10) is greater than or equal to the right side. By (11), this means that the ‘price’, i.e., the description complexity per bit of information ma y b e larger than 1. Let us introduce at this point the follo wing combinatorial quantit y: Definition 5 The cost of information I ( x : y ) , denote d κ y ( x ) , is define d as κ y ( x ) = ` ( x ) I ( x : y ) and r epr esents the numb er of description bits of x p er bit of information ab out the unknown value of y as c onveye d by x . Th us, letting y = Y and considering the tw o previous scenarios where Kolmogoro v’s defi- nition (4) applies then the cost of information equals 1 or at least 1, respectively . As the next scenario, let us consider the follo wing: Sc enario S3 : W e follo w the setting of Definition 2 where an input x means that the un- kno wn v alue of y is contained in at least one set Y z , z ∈ Z x hence | Z x | ≥ 1. Supp ose that | Y | | X | ≡ a for some in teger a ≥ 1 and assume that for all x ∈ X , H ( Y | x ) = log( a ). (The sets Y z , z ∈ Z x ma y still differ in size and o v erlap). Thus we ha v e ` ( x ) + H ( Y | x ) = log | Z | | Z x | + log | Y | | X | . (12) Supp ose that | Z | | X | ≡ b for some integer b ≥ 1, Z x T Z x 0 = ∅ and | Z x | = | Z x 0 | for any x, x 0 ∈ X . Since Z = S x ∈ X Z x then | Z x | = b for all x ∈ X . The righ t side of (12) equals log | Y | and (10) is satisfied. If for some x, x 0 w e ha v e | Z x | < b and | Z x 0 | > b (with entropies b oth still at log( a )) then the left side of (10) is greater than or less than H ( Y ), resp ectively . Hence it is p ossible in this scenario for the cost κ y ( x ) to b e greater or less than 1. T o understand wh y for some inputs x the cost ma y b e strictly smaller than 1 observ e that under the curren t scenario the actual set Y z whic h contains the unkno wn y remains unkno wn ev en after pro ducing the description x . Thus in this case the left side of (10) represents the 9 total description complexity of the unkno wn v alue of y (on a verage o v er all possible sets Y z ) given that the only fact kno wn ab out Y z is that its index z is an elemen t of Z x . In con trast, scenario S1 has the total description complexity of the unknown y on the left side of (10) whic h also includes the description of the sp ecific Y x that contains y (hence it ma y b e longer). Scenario S3 is an example, as mentioned in Section 1, of kno wing a property whic h still leav es the acquirer with sev eral sets that contain the unknown y . In Section 7 w e will consider several specific prop erties of this kind. Let us no w con tin ue and in tro duce additional concepts as part of the framework. 6 Information width and efficiency With the definitions of Section 3 in place w e no w hav e a quantitativ e measure of the infor- mation (and cost) conv ey ed by an input x ab out an unkno wn v alue y . This y is con tained in some set that satisfies a certain property and the set itself ma y remain unknown. In subsequen t sections w e consider several examples of inputs x for whic h these measures are computed and compared. Amongst the differen t w a ys of conv eying information ab out an unkno wn v alue y it is natural to ask at this p oint if there exists a notion of maximal infor- mation. This is formalized next by the follo wing definition which re sem bles n -widths used in functional approximation theory [21]. Definition 6 L et I ∗ p ( l ) ≡ max x ∈ X ` ( x )= l min y ⊆ Y x ` y I ( x : y ) (13) b e the l th -information-width . Remark 5 The ab ove definition is state d fr om the pr ovider’s p oint of view. He is fr e e to cho ose a fixe d ‘me dium’, i.e., a structur e Z x of sets (but limite d in its description c omplexity to l ) in or der to pr ovide information at some later time ab out any set y ⊆ Y of obje cts to the ac quir er. F or that he c onsiders al l p ossible inputs x of description c omplexity l and me asur es the information it wil l pr ovide for the har dest tar get-subset y . We r efer to the ab ove as the pr ovider’s information width. If we set y = Y then we obtain the ac quir er’s width of information, denote d as I ∗ a ( l ) ≡ I ∗ ( l ) which takes a simpler form of I ∗ ( l ) = max x ∈ X ` ( x )= l I ( x : Y ) . The next result computes the v alue of I ∗ ( l ). Theorem 1 Denote by N the p ositive inte gers. L et 1 ≤ l ≤ log | Z | and define r ( l ) ≡ min ( a ∈ N : a X i =1 | Y | i ≥ | Z | 2 − l ) . 10 Then we have I ∗ ( l ) = log | Y | − 2 l | Z | r ( l ) − 1 X k =2 | Y | k log k + | Z | 2 − l − r ( l ) − 1 X i =1 | Y | i log r ( l ) . (14) Pr o of : Consider a particular input x ∗ ∈ X with a description complexity ` ( x ∗ ) = l and with a corresp onding Z x ∗ that con tains the indices z of as many distinct non-empt y sets Y z of the low est p ossible cardinality . By (9) it follo ws that Z x ∗ satisfies | Z x ∗ | = | Z | 2 − l and con tains all z suc h that 1 ≤ | Y z | ≤ r ( l ) − 1 in addition to | Z | 2 − l − P r ( l ) − 1 i =1 | Y | i elemen ts z for which | Y z | = r ( l ). W e therefore ha ve I ( x ∗ : Y ) as equal to the righ t side of (14). An y other x with ` ( x ) = l m ust hav e H ( Y | x ) ≥ H ( Y | x ∗ ) since it is formed by replacing one of the sets Y z ab o v e with a larger set Y z 0 . Hence for such x , I ( x : Y ) ≤ I ( x ∗ : Y ) and therefore I ∗ ( l ) = I ( x ∗ : Y ). The notion of width is more general than that defined ab ov e. F or instance in functional appro ximation theory the so called n -widths are used to measure the approximation error of some rich general class of functions, e.g., Sob olev class, by the clos est elemen t of a manifold of simpler function classes. F or instance, the Kolmogorov width K n ( F ) of a class F of functions (see [21]) is defined as K n ( F ) = inf F n ⊂ F sup f ∈ F inf g ∈ F n k f − g k where F n v aries ov er all linear subspaces of F of dimensionalit y n . Thus from this more general set-p ersp ectiv e it is p erhaps not surprising that suc h a basic quantit y of width has also an information theoretic interpretation as w e hav e shown in (13). The work of [16] considers the VC-width of a finite-dimensional set F defined as ρ V C n ( F ) ≡ inf H n sup f ∈ F dist( f , H n ) where F ⊂ I R m is a target set, H n runs ov er the class H n of all sets H n ⊂ I R m of VC- dimension VC( H n ) = n (see Definition 8) and dist( f , H n ) ≡ inf h ∈ H n dist( f , h ) where dist() denotes the distance b et ween an elemen t f ∈ F and h ∈ H n based on the l m q -norm, 1 ≤ q ≤ ∞ . W e can make the following analogy with the information width of (13): f corresp onds to y , F to Y , h to z , n corresp onds to l , H n to x (or equiv alen tly to Z x ), the condition V C( H n ) = n corresp onds to the condition of having a description complexity ` ( x ) = l , the class H n corresp onds to the set { x ∈ X : ` ( x ) = l } , dist( f , h ) corresponds to I ( z : y ), dist( f , H n ) = inf h ∈ H n dist( f , h ) corresponds to I ( x : y ) = (1 / | Z x | ) P z ∈ Z x I ( z : y ), sup f ∈ F corresp onds to min y ⊂ Y , and inf H n :VC( H n )= n corresp onds to max x : ` ( x )= l . The notion of information efficiency to b e introduced b elo w is based on the acquirer’s information width I ∗ ( l ). Definition 7 Denote by κ ∗ ( x ) ≡ ` ( x ) I ∗ ( ` ( x )) the p er-bit c ost of maximal information c onveye d ab out an unknown tar get y in Y c onsidering al l p ossible inputs of the same description c omplexity as x . Consider an input x ∈ X informative for Y . Then the efficiency of x for Y is define d by η Y ( x ) ≡ κ ∗ ( x ) κ Y ( x ) 11 wher e the c ost is define d in Definition 5. Remark 6 By definition of κ ∗ ( x ) and κ y ( x ) it fol lows that η Y ( x ) = I ( x : Y ) I ∗ ( ` ( x )) . (15) While we will not use it here, the provider’s efficiency can b e defined in a similar wa y . Let us consider an example where the ab ov e definitions may b e applied. Let n be a p ositiv e in teger and denote by [ n ] = { 1 , . . . , n } . Let the target space b e Y = { 0 , 1 } [ n ] whic h consists of all binary functions g : [ n ] → { 0 , 1 } . Let Z = P ( Y ) b e the set of indices z of all p ossible classes Y z ⊆ Y of binary functions g on [ n ] (as b efore for any set E w e denote b y P ( E ) its p o wer set). Let X = P ( Z ) consist of all p ossible (property) sets Z x ⊆ Z . Th us here every p ossible class of binary functions on [ n ] and every p ossible prop erty of a class is represen ted. Figure 1(a) sho ws I ∗ ( l ) and Figure 1(b) displa ys the cost κ ∗ ( l ) for this example as n = 5 , 6 , 7. F rom these graphs we see that the width I ∗ ( l ) grows at a sub-linear rate with resp ect to l since the cost strictly increase s. In the next section, we apply the theory introduced in the previous sections to the s pace of binary functions. 7 Binary function classes Let F = { 0 , 1 } [ n ] and write P ( F ) for the pow er set whic h consists of all subsets G ⊆ F . Let G | = M represent the statemen t “ G satisfies prop ert y M ”. In order to apply the ab ov e framew ork we let y represen t an unkno wn target t ∈ F and x a description ob ject, e.g., a binary string, that describ es the p ossible prop erties M of sets G ⊆ F which may contain t . Denote by x M the ob ject that describ es prop erty M . Our aim is to compute the v alue of information I ( x M : F ), the description complexity ` ( x M ), the cost κ F ( x M ) and efficiency η F ( x ) for v arious inputs x M . Note that the set Z x used in the previous sections is no w a collection of classes G , i.e., elemen ts of P ( F ), which satisfy a prop erty M . W e will sometimes refer to this collection b y M and write |M| for its cardinalit y (which is analogous to | Z x | in the notation of the preceding sections). Before w e pro ceed, let us recall a few basic definitions from set theory . F or any fixed subset E ⊆ [ n ] of cardinality d and an y f ∈ F denote by f | E ∈ { 0 , 1 } d the restriction of f on E . F or a set G ⊆ F of functions, the set tr G ( E ) = { f | E : f ∈ G } is called the tr ac e of G on E . The trace is a basic and useful measure of the combinatorial ric hness of a binary function class and is related to its densit y (see Chapter 17 in [5]). It has also b een shown to relate to v arious fundamental results in differen t fields, e.g., statistical learning theory [26], com binatorial geometry [20], graph theory [3, 10] and the theory of empirical pro cesses [22]. It is a member of a more general class of prop erties that are expressed in terms of certain allo w ed or forbidden restrictions [1]. In this pap er we focus on prop erties based on the trace of a class which are expressed in terms of a positive in teger parameter d in the following general form: d = max {| E | : E ⊆ [ n ] , condition on tr G ( E ) holds } . 12 The first definition taking such form is the so-called V apnik-Chervonenkis dimension [27]. Definition 8 The V apnik-Chervonenkis dimension of a set G ⊆ F , denote d VC ( G ) , is define d as V C ( G ) ≡ max {| E | : E ⊆ [ n ] , | tr G ( E ) | = 2 | E | } . The next definition considers the other extreme for the size of the trace. Definition 9 L et L ( G ) b e define d as L ( G ) ≡ max {| E | : E ⊆ [ n ] , | tr G ( E ) | = 1 } . F or any G ⊆ F define the following three prop erties: L d ≡ ‘L( G ) ≥ d ’ V d ≡ ‘V C( G ) < d ’ V c d ≡ ‘V C( G ) ≥ d ’ . W e no w apply the framew ork to these and other related properties (for clarit y , we defer some of the pro ofs to Section 10.2). Henceforth, for tw o sequences a n , b n , we write a n ≈ b n to denote that lim n →∞ a n b n = 1 and a n b n denotes lim n →∞ a n b n = 0. Denote the standard normal probabilit y distribution and cumulativ e distribution b y φ ( x ) = (1 / √ 2 π ) exp( − x 2 / 2) and Φ ( x ) = R x −∞ φ ( z ) dz , resp ectiv ely . The main results are stated as Theorems 2 through 5. Theorem 2 L et t b e an unknown element of F . Then the value of information in knowing that t ∈ G wher e G | = L d , is I ( x L d : F ) = log | F | − X k ≥ 2 ω x L d ( k ) log k ≈ n − Φ ( − a ) log 2 n 1+2 d + 2 − ( n − d ) / 2 φ ( a ) + O (2 − ( n − d ) ) 1 − 2 d 1+2 d 2 n wher e a = 2(1 + 2 d )2 − ( n + d ) / 2 − 2 ( n − d ) / 2 and the description c omplexity of x L d is ` ( x L d ) ≈ 2 n 2 d 1 + 2 d − d − c log n for some 1 ≤ c ≤ d , as n incr e ases. Remark 7 F or lar ge n , we have the fol lowing estimates: I ( x L d : F ) ' n − log 2 n 1 + 2 d ' d 13 and ` ( x L d ) ' 2 n − d. The c ost is estimate d by κ F ( x L d ) ' 2 n − d d . The next result is for property V c d . Theorem 3 L et t b e an unknown element of F . Denote by a = (2 n − 2 d +1 )2 − n/ 2 . Then the value of information in knowing that t ∈ G , G | = V c d , is I ( x V c d : F ) = log | F | − X k ≥ 2 ω x V c d ( k ) log k ≈ n − ( n − 1) 2 n Φ ( a ) + 2 n/ 2 φ ( a ) 1 + a 2 ( n − 1)2 n 2 n Φ ( a ) + 2 n/ 2 φ ( a ) with incr e asing n . Assume that d = d n > log n then the description c omplexity of x V c d satisfies ` ( x V c d ) ≈ d (2 d + 1) + log( d ) − log(2 n Φ ( a ) + 2 n/ 2 φ ( a )) − log n + 1 . Remark 8 F or lar ge n , the information value is appr oximately I ( x V c d : F ) ' 1 and ` ( x V c d ) ' d 2 d − n − log n d thus κ F ( x V c d ) ' ` ( x V c d ) . We note that the description length incr e ases with r esp e ct to d implying that the pr op ortion of classes with a VC-dimension lar ger than d de cr e ases with d . With r esp e ct to n it b ehaves opp ositely. The prop erty of having an (upp er) bounded V C-dimension (or trace) has b een widely studied in numerous fields (see the earlier discussion). F or instance in statistical learning theory [6, 26] the imp ortan t prop erty of conv ergence of the empirical av erages to the means o ccurs uniformly o ver all elemen ts of an infinite class provided that it satisfies this prop ert y . It is thus interesting to study the prop erty V d defined ab ov e even for a finite class of binary functions. Theorem 4 L et t b e an unknown element of F . The value of information in knowing that t ∈ G , G | = V d is I ( x V d : F ) ≈ 1 − o (2 − n/ 2 ) with n and d = d n incr e asing such that n < d n 2 d n . The description c omplexity of x V d is ` ( x V d ) = − log 1 − 2 − ` ( x V c d ) wher e ` ( x V c d ) is as in The or em 3. 14 Remark 9 Both the description c omplexity and the c ost of information ar e appr oximate d as κ F ( x V d ) ' ` ( x V d ) ' − log (1 − 2 − ( d 2 d − n − log( n d )) ) . R elating to R emark 4, while ` ( x V d ) incr e ases with r esp e ct to n and henc e the pr op ortion of classes with the pr op erty V d de cr e ases as 2 − ` ( x V d ) , the actual numb er of binary function classes that have this pr op erty (i.e., the c ar dinality of the c orr esp onding set Z x ) increases with n sinc e | Z x V d | = | Z | 2 − ` ( x V d ) = 2 2 n 1 − 2 − ( d 2 d − n − log( n d )) . The numb er of classes that have the c omplement pr op erty V c d also cle arly incr e ases sinc e ` ( x V c d ) de cr e ases with n . We note that the description length de cr e ases with r esp e ct to d implying that the pr op ortion of classes with a VC-dimension no lar ger than d incr e ases with d . As another related case, consider an input x which in addition to con v eying that t ∈ G with V C( G ) < d also pro vides a lab eled sample S m = { ( ξ i , ζ i ) } m i =1 , ξ i ∈ [ n ], ζ i = t ( ξ i ), 1 ≤ i ≤ m . This means that for all f ∈ G , f ( ξ i ) = ζ i , 1 ≤ i ≤ m . W e express this b y stating that G satisfies the property V d ( S m ) ≡ ‘V C( G ) < d , G | ξ = ζ ’ where G | ξ denotes the set of restrictions { f | ξ : f ∈ G } , f | ξ = [ f ( ξ 1 ) , . . . , f ( ξ m )] and ζ = [ ζ 1 , . . . , ζ m ]. The following result states the v alue of information and cost for prop ert y V d ( S m ). Theorem 5 L et t b e an unknown element of F and S m = { ( ξ i , t ( ξ i )) } m i =1 a sample. Then the value of information in knowing that t ∈ G wher e G | = V d ( S m ) is I ( x V d ( S m ) : F ) ≈ m − o (2 − ( n − m ) / 2 ) with n and d = d n incr e asing such that n < d n 2 d n . The description c omplexity of x V d ( S m ) is ` ( x V d ( S m ) ) ≈ 2 n (1 + log (1 − p )) + n − m d 2 d (1+2 d ) Φ ( a )2 n − m + φ ( a )2 ( n − m ) / 2 + (1 − p ) 2 n wher e p = 2 − m / (2 − m + 1) , a = (2 n p − 2 d ) /σ , σ = p 2 n p (1 − p ) . Remark 10 The description c omplexity is estimate d by ` ( x V d ( S m ) ) ' 2 n 1 + n − m d 2 d (1+2 d )+ m + log (1 − p ) and the c ost of information is κ F ( x V d ( S m ) ) ' ` ( x V d ( S m ) ) m . Remark 11 The dep endenc e of the description c omplexity on d disapp e ars r apid ly with incr e asing d , the effe ct of m r emains minor which effe ctively makes ` ( x V d ( S m ) ) almost take the maximal p ossible value of 2 n . Thus the pr op ortion of classes which satisfy pr op erty V d ( S m ) is very smal l. 15 7.1 Balanced prop erties Theorems 3 and 4 p ertain to prop erty V c d and its complemen t V d . It is interesting that in b oth cases the information v alue is appro ximately equal to 1. If w e denote by P ∗ n,k a uniform probabilit y distribution o ver the space of classes G ⊂ F conditioned on | G | = k (this will b e defined later in a more precise context in (20)) then, as is sho wn later, P ∗ n,k ( V d ) and P ∗ n,k ( V c d ) v ary approximately linearly with resp ect to k . Th us in b oth cases the conditional densit y (8) is dominated b y the v alue of k = 2 n − 1 and hence b oth hav e appro ximately the same conditional entropies (7) and information v alues. Let us define the follo wing: Definition 10 A pr op erty M is c al le d balanced if I ( x M : F ) = I ( x M c : F ) . W e ma y c haracterize some sufficien t conditions for M to be balanced. First, as in the case of prop ert y V d and more generally for any prop ert y M a sufficient condition for this to hold is to ha v e a density (and that of its complemen t M c ) dominated b y some cardinality v alue k ∗ . Represen ting ω x M ( k ) by a p osterior probabilit y function P n ( k |M ), for instance as in (30) for M = L d , makes the conditional en tropies H ( F | x M ) and H ( F | x M c ) b e approximately the same. A stricter sufficien t condition is to hav e ω x M ( k ) = ω x M c ( k ) for every k . This implies the condition that P n ( k |M ) = P n ( k |M c ) whic h using Bay es rule gives P ( M c | k ) P ( M| k ) = P ( M c ) P ( M ) , for all 2 ≤ k ≤ 2 n . In words, this condition says that the bias of fa v oring a class G as satisfying property M v ersus M c (i.e., the ratio of their probabilities) should b e constant with resp ect to the cardinalit y k of G . Any suc h prop ert y is therefore characterized b y certain features of a class G that are in v arian t to its size, i.e., if the size of G is provided in adv ance then no information is gained ab out whether G satisfies M or its complement M c . In contrast, prop erty L d is an example of a very unbalanced prop ert y . It is an example of a general property whose p osterior function decreases fast with resp ect to k as w e now consider: Example : Let Q b e a prop erty with a distribution P ∗ n,k ( Q ) = cα k , 0 < α < 1, c > 0. In a similar w ay as Theorem 2 is prov ed we obtain that the information v alue of this property tends to I ( x Q : F ) ≈ n − Φ ( − a ) log (2 n α/ (1 + α )) + φ ( a ) / √ α 2 n + O (1 / ( α 2 n )) 1 − (1 + α ) − 2 n 16 with increasing n where a = (2 − 2 n p ) / p 2 n p (1 − p ) and p = α/ (1 + α ). This is appro ximated as I ( x Q : F ) ' n − n + log α 1 + α = log 1 + 1 α . F or instance, supp ose P ∗ n,k ( Q ) is an exp onen tial probability function then taking α = 1 /e giv es an information v alue of I ( x Q : F ) ' log (1 + e ) ' 1 . 89 . F or the complement Q c , if w e appro ximate P ∗ n,k ( Q c ) = 1 − cα k ' 1 and the conditional en trop y (7) as P k ≥ 2 P ∗ n,k ( Q c ) P n ( k ) log k P j ≥ 1 P ∗ n,j ( Q c ) P n ( j ) ' X k ≥ 2 P n ( k ) log k ≈ log(2 n − 1 ) = n − 1 , where P n ( k ) is the binomial probability distribution with parameter 2 n and 1 / 2, then the information v alue is approximated by I ( x Q c : F ) ' n − ( n − 1) = 1 . By taking α to b e even smaller w e obtain a prop erty Q which has a v ery different information v alue compared to Q c . 8 Comparison W e no w compare the information v alues and the efficiencies for the v arious inputs x con- sidered in the previous section. In this comparison we also include the follo wing simple prop ert y defined next: let G ∈ P ( { 0 , 1 } n ) b e any class of functions and denote b y the identity pr op erty M ( G ) of G the ‘prop erty whic h is satisfied only by G ’. W e immediately ha v e I ( x M ( G ) : F ) = n − log | G | (16) and ` ( x M ( G ) ) = 2 n − log (1) = 2 n since the cardinality |M ( G ) | = 1. The cost in this case is κ F ( x M ( G ) ) = 2 n n − log | G | . Note that x conv eys that t is in a sp ecific class G hence the en trop y and information v alues are according to Kolmogoro v’s definitions (3) and (4). The efficiency in this case is simple to compute using (15): w e ha ve I ∗ ( ` ( x )) = I ∗ (2 n ) and the sums in (14) v anish since r (2 n ) = 1 th us I ∗ (2 n ) = n and η F ( x ) = ( n − log | G | ) /n . Let us first compare the information v alue and the efficiency of three sub cases of this iden tit y prop erty with the following three different class cardinalities: | G | = √ n , | G | = n and | G | = 2 n − √ n . Figure 2 displays the information v alue and Figure 3 shows the efficiency for these sub cases. As seen the information v alue increases as the cardinality of G decreases 17 whic h follo ws from (16). The efficiency η F ( x ) for these three subcases may be obtained exactly and equals (according to the same order as ab o ve) 1 − (log n ) / (2 n ), 1 − (log n ) /n and 1 / √ n . Th us a prop erty with a single elemen t G may ha v e an efficiency whic h increas es or decreases dep ending on the rate of gro wth of the cardinality of G with resp ect to n . Let us compare the efficiency for inputs x L d , x V c d , x V d and x V d ( S m ) . As an example, supp ose that the V C-dimension parameter d grows as d ( n ) = √ n . As can b e seen from Figure 4, prop ert y V d is the most efficient of the three staying ab ov e the 80% lev el. Letting the sample size increase at the rate of m ( n ) = n a then from Figure 5 the efficiency of V d ( S m ) increases with resp ect to a but remains smaller than the efficiency of prop erty V d . Letting the VC-dimension increase as d ( n ) = n b then Figure 6 displays the efficiency of V d ( S m ) as a function of b for several v alues of a = 0 . 1 , 0 . 2 , . . . , 0 . 4 where n is fixed at 10. As seen, the efficiency increases approximately linearly with a and non-linearly with resp ect to b with a saturation at approximately b = 0 . 2. 9 Conclusions The information width introduced here is a fundamen tal concept based on which a com- binatorial in terpretation of information is defined and used as the basis for the concept of efficiency of information. W e defined the width from tw o p ersp ectives, that of the pro vider and the acquirer of information and used it as a reference p oin t according to which the efficiency of an y input information can b e ev aluated. As an application we considered the space of binary function classes on a finite domain and computed the efficiency of informa- tion conv eyed b y v arious class prop erties. The main p oin t that arises from these results is that side-information of different types can b e quan tified, computed and compared in this common framework which is more general than the standard framework used in the theory of information transmission. As further work, it will be in teresting to compute the efficiency of information in other applications, for instance, p ertaining to prop erties of classes of Bo olean functions f : { 0 , 1 } n → { 0 , 1 } (for whic h there are many applications, see for instance [8]). It will b e interesting to examine standard searc h algorithms, for instance, those used in mac hine learning ov er a finite searc h space (or hypothesis space) and compute their information ef- ficiency , i.e., accounting for all side information a v ailable for an algorithm (including data) and computing for it the acquired information v alue and efficiency . In our treatment of this sub ject we did not touch the issue of how the information is use d. F or instance, a learning algorithm uses side-information and training data to learn a pattern classifier which has minimal prediction (generalization) error. A searc h algorithm in the area of information-retriev al uses an input query to return an answ er set that o verlaps as many of the relev an t ob jects and at the same time has as few non-relev ant ob jects as possible. In each such application the information acquirer, e.g., an algorithm, has an asso ciated p erformance criterion, e.g., prediction error, p ercentage recall or precision, according to whic h it is ev aluated. What is the relationship b etw een information and p erformance, do es p erformance depend on efficiency or only on the amoun t of pro vided information ? what are the consequences of using input information of low efficiency ? F or the curren t w ork, w e leav e these questions as op en. The remaining parts of the pap er consist of the technical w ork used to obtain the previous results. 18 10 T ec hnical results In this section we provide the pro ofs of Theorems 2 to 5. Our approach is to estimate the num b er of sets G ⊆ F that satisfy a prop ert y M . Using the techniques from [28] we emplo y a probabilistic metho d by whic h a random class is generated and the probability that it satisfies M is computed. As we use the uniform probability distribution on elemen ts of the p o wer set P ( F ) then probabilities yield cardinalities of the corresp onding sets. The computation of ω x ( k ) and hence of (6) follo ws directly . It is w orth noting that, as in [12], the notion of probability is only used here for simplifying some of the coun ting arguments and th us, unlik e Shannon’s information, it plays no role in the actual definition of information. Before pro ceeding with the pro ofs, in the next section we describe the probability mo del for generating a random class. 10.1 Random class generation In this subsection w e describ e the underlying probabilistic pro cesses with whic h a random class is generated. W e use the so-called binomial mo del to generate a random class of binary functions (this has b een extensively used in the area of random graphs [11]). In this mo del, the random class F is constructed through 2 n indep enden t coin tossings, one for eac h function in F , with a probabilit y of success (i.e., selecting a function in to F ) equal to p . The probability distribution P n,p is formally defined on P ( F ) as follows: given parameters n and 0 ≤ p ≤ 1, for any G ∈ P ( F ), P n,p ( F = G ) = p | G | (1 − p ) 2 n −| G | . In our application, we choose p = 1 / 2 and denote the probability distribution as P n ≡ P n, 1 2 . It is clear that for any element G ∈ P ( F ), the probability that the random class F equals G is α n ≡ P n ( F = G ) = 1 2 2 n (17) and the probability of F having a cardinalit y k is P n ( |F | = k ) = 2 n k α n , 1 ≤ k ≤ 2 n . (18) The following fact easily follows from the definition of the conditional probability: for an y set B ⊆ P ( F ), P n ( F ∈ B | |F | = k ) = P G ∈ B α n 2 n k α n = | B | 2 n k . (19) Denote by F ( k ) = { G ∈ P ( F ) : | G | = k } , the collection of binary-function classes of cardinality k , 1 ≤ k ≤ 2 n . Consider the uniform probabilit y distribution on F ( k ) whic h is defined as follows: given parameters n and 1 ≤ k ≤ 2 n then for any G ∈ P ( F ), P ∗ n,k ( G ) = 1 2 n k , if G ∈ F ( k ) , (20) 19 and P ∗ n,k ( G ) = 0 otherwise. Hence from (19) and (20) it follows that for an y B ⊆ P ( F ), P n ( F ∈ B | |F | = k ) = P ∗ n,k ( F ∈ B ) . (21) It will b e conv enien t to use another probabilit y distribution which estimates P ∗ n,k and is defined as follows. Construct a random n × k binary matrix by fair-coin tossings with the nk elements taking v alues 0 or 1 indep enden tly with probabilit y 1 / 2. Denoting by Q ∗ n,k the probabilit y measure corresp onding to this process then for any matrix U ∈ U n × k ( { 0 , 1 } ), Q ∗ n,k ( U ) = 1 2 nk . Clearly , the columns of a binary matrix U are v ectors of length n whic h are binary functions on [ n ]. Hence the set cols( U ) of columns of U represents a class of binary functions. It con tains k elements if and only if cols( U ) consists of distinct elements, or less than k elemen ts if t w o columns are the same. Denote by S a simple binary matrix as one all of whose columns are distinct ([1]). W e claim that the conditional distribution of the set of columns of a random binary matrix, knowing that the matrix is simple, is the uniform probabilit y distribution P ∗ n,k . T o see this, observe that the probabilit y that the columns of a random binary matrix are distinct is Q ∗ n,k ( S ) = 2 n (2 n − 1) · · · (2 n − k + 1) 2 nk . (22) F or any fixed class G ∈ P ( F ) of k binary functions there are k ! corresp onding simple matrices in U n × k ( { 0 , 1 } ). Therefore given a simple matrix S , the probabilit y that cols( S ) equals a class G is Q ∗ n,k ( G | S ) = k ! 2 nk 2 nk 2 n (2 n − 1) · · · (2 n − k + 1) = 1 2 n k = P ∗ n,k ( G ) . (23) Using the distribution Q ∗ n,k enables simpler computations of the asymptotic probabilit y of sev eral types of ev en ts that are asso ciated with the prop erties of Theorems 2 – 5. W e henceforth resort to the follo wing pro cess for generating a random class G : for ev ery 1 ≤ k ≤ 2 n w e rep eatedly and indep endently draw matrices of size n × k using Q ∗ n,k un til we get a simple matrix M n × k . Then we randomly draw a v alue for k according to the distribution of (18) and c ho ose the formerly generated simple matrix corresp onding to this c hosen k . Since this is a simple matrix then by (23) it is clear that this choice yields a random class G whic h is distributed uniformly in F ( k ) according to P ∗ n,k . This is stated formally in Lemma 3 b elo w but first we hav e an auxiliary lemma that shows the ab ov e pro cess conv erges. Lemma 2 L et n = 1 , 2 , . . . and c onsider the pr o c ess of dr awing se quenc es s ( k ) m = { M ( i ) k,n } m i =1 , 1 ≤ k ≤ 2 n , al l of length m wher e the k th se quenc e c onsists of matric es M ( i ) k,n ∈ U n × k ( { 0 , 1 } ) which ar e r andomly and indep endently dr awn ac c or ding to the pr ob ability distribution Q ∗ n,k . Then the pr ob ability that after m = ne 2 n trials ther e exists a k such that no simple matrix app e ars in s ( k ) m , c onver ges to zer o with incr e asing n . 20 Pr o of : Let a ( n, k ) = Q ∗ n,k ( S ) be the probability of getting a simple matrix M n,k ∈ U n × k . Then the probability that there exists some k suc h that s ( k ) m consists of only non-simple matrices is q ( n, m ) ≤ 2 n X k =1 (1 − a ( n, k )) m ≤ 2 n X k =1 e − ma ( n,k ) . (24) F rom (22) w e ha ve ln a ( n, k ) = k − 1 X i =0 ln(2 n − i ) − nk ln 2 = 2 n X j =2 n − ( k − 1) ln j − nk ln 2 . (25) Since ln x is increasing function of x then for an y pair of p ositiv e integers 2 ≤ a ≤ b we ha ve b X j = a ln j ≥ Z b a − 1 ln xdx = b (ln b − 1) − ( a − 1)(ln( a − 1) − 1) . Hence ln a ( n, k ) ≥ 2 n ( n ln 2 − 1) − (2 n − k )(ln(2 n − k ) − 1) − nk ln 2 ≡ b ( n, k ) and the right side of (24) is no w b ounded as follows 2 n X k =1 e − ma ( n,k ) ≤ 2 n X k =1 e − me b ( n,k ) . (26) F rom a simple chec k of the deriv ative of b ( n, k ) with resp ect to k it follows that b ( n, k ) is a decreasing function of k on 1 ≤ k ≤ 2 n . Replacing eac h term in the sum on the right side of (26) by the last term giv es the follo wing b ound q ( n, m ) ≤ e n ln 2 − me − 2 n . (27) The exp onen t is negative provided m > n ln(2) e 2 n . Cho osing m = ne 2 n guaran tees that q ( n, m ) → 0 with increasing n . The following result states that the measure Q ∗ n,k ma y replace P ∗ n,k uniformly ov er 1 ≤ k ≤ 2 n . Lemma 3 L et B ⊆ P ( F ) . Then max 1 ≤ k ≤ 2 n | P ∗ n,k ( B ) − Q ∗ n,k ( B ) | → 0 as n tends to infinity. 21 Pr o of : F rom (23) we hav e P ∗ n,k ( B ) = Q ∗ n,k ( B | S ) = Q ∗ n,k ( B T S ) Q ∗ n,k ( S ) . Then max k | P ∗ n,k ( B ) − Q ∗ n,k ( B ) | = max k Q ∗ n,k ( B T S ) Q ∗ n,k ( S ) − Q ∗ n,k ( B ) ≤ max k 1 Q ∗ n,k ( S ) max k Q ∗ n,k ( B \ S ) − Q ∗ n,k ( B ) Q ∗ n,k ( S ) ≤ max k 1 Q ∗ n,k ( S ) max k Q ∗ n,k ( B \ S ) − Q ∗ n,k ( B ) + max k Q ∗ n,k ( B )(1 − Q ∗ n,k ( S )) ! ≤ max k 1 Q ∗ n,k ( S ) max k Q ∗ n,k ( B \ S ) − Q ∗ n,k ( B ) + max k Q ∗ n,k ( B ) max k 1 − Q ∗ n,k ( S ) ! . (28) F rom Lemma 2 it follows that max k 1 − Q ∗ n,k ( S ) → 0 , max k | 1 /Q ∗ n,k ( S ) | → 1 (29) with increasing n . F or any 1 ≤ k ≤ 2 n , Q ∗ n,k ( B ) + Q ∗ n,k ( S ) − 1 ≤ Q ∗ n,k ( B \ S ) ≤ Q ∗ n,k ( B ) and b y Lemma 2, Q ∗ n,k ( S ) tends to 1 uniformly o v er 1 ≤ k ≤ 2 n with increasing n . Hence max k Q ∗ n,k ( B T S ) − Q ∗ n,k ( B ) → 0 which together with (28) and (29) implies the statemen t of the lemma. W e now pro ceed to the pro ofs of the theorems in Section 7. 10.2 Proofs Note that for any prop ert y M , the quan tit y ω x ( k ) in (6) is the ratio of the n um b er of classes G ∈ F ( k ) that satisfy M to the total num b er of classes that satisfy M . It is therefore equal to P n ( |F | = k | F | = M ). Our approac h starts by computing the probability P n ( F | = M | |F | = k ) from whic h P n ( |F | = k | F | = M ) and then ω x ( k ) are obtained. 10.2.1 Proof of Theorem 2 W e start with an auxiliary lemma which states that the probabilit y P n ( F | = L d | |F | = k ) p ossesses a zero-one b eha vior. Lemma 4 L et F b e a class of c ar dinality k n and r andomly dr awn ac c or ding to the uniform pr ob ability distribution P ∗ n,k n on F ( k n ) . Then as n incr e ases, the pr ob ability P ∗ n,k n ( F | = L d ) that F satisfies pr op erty L d tends to 0 or 1 if k n log(2 n/d ) or k n = 1 + κ n , κ n (log( n )) /d , r esp e ctively. 22 Pr o of : F or brevit y , we sometimes write k for k n . Using Lemma 3 it suffices to show that Q ∗ n,k ( F | = L d ) tends to 1 or 0 under the stated conditions. F or any set S ⊆ [ n ], | S | = d and any fixed v ∈ { 0 , 1 } d , under the probabilit y distribution Q ∗ n,k , the even t E v that every function f ∈ F satisfies f | S = v has a probabilit y (1 / 2) kd . Denote by E S the ev en t that all functions in the random class F hav e the same restriction on S . There are 2 d p ossible restrictions on S and the even ts E v , v ∈ { 0 , 1 } d , are disjoint. Hence Q ∗ n,k ( E S ) = 2 d (1 / 2) kd = 2 − ( k − 1) d . The even t that F has property L d , i.e., that L ( F ) ≥ d , equals the union of E S , o v er all S ⊆ [ n ] of cardinalit y d . Thus we hav e Q ∗ n,k ( F | = L d ) = Q ∗ n,k [ S ⊆ [ n ]: | S | = d E S ≤ n d Q ∗ n,k E [ d ] = 2 − ( k − 1) d n d (1 − o (1)) d ! . F or k = k n log(2 n/d ) the right side tends to zero which prov es the first statemen t. Let the m utually disjoint sets S i = { id + 1 , id + 2 , . . . , d ( i + 1) } ⊆ [ n ], 0 ≤ i ≤ m − 1 where m = b n/d c . The even t that M d is not true equals T S : | S | = d E S . Its probability is Q ∗ n,k \ S : | S | = d E S = 1 − Q ∗ n,k [ S : | S | = d E S ≤ max { 0 , 1 − Q ∗ n,k m − 1 [ i =0 E S i ! } . Since the sets are disjoint and of the same size d then the right hand side equals max { 0 , 1 − mQ ∗ n,k ( E [ d ] ) } . This equals max { 0 , 1 − j n d k 2 − ( k − 1) d } whic h tends to zero when k = k n = 1 + κ n , κ n (log( n )) /d . The second statemen t is pro v ed. Remark 12 While fr om this r esult it is cle ar that the critic al value of k for the c onditional pr ob ability P n ( L d | k ) to tend to 1 is O (log ( n )) , as wil l b e shown b elow, when c onsidering the c onditional pr ob ability P n ( k |L d ) , the most pr ob able value of k is much higher at O (2 n − d ) . W e con tin ue now with the pro of of Theorem 2. F or any probability measure P on P ( F ) denote b y P ( k |L d ) = P ( |F | = k | F | = L d ) . By the premise of Theorem 2, the input x describ es the target t as an element of a class that satisfies property L d . In this case the quan tit y ω x ( k ) is the ratio of the n umber of classes of cardinality k that satisfy L d to the total n um b er of classes that satisfy L d . Since b y (17) the probabilit y distribution P n is uniform o ver the space P ( F ) whose size is 2 2 n then ω x ( k ) = P n ( k , L d )2 2 n P n ( L d )2 2 n = P n ( k |L d ) . (30) W e hav e P n ( k |L d ) = P n ( L d | k ) P n ( k ) P 2 n j =1 P n ( L d | j ) P n ( j ) . 23 By (21), it follows therefore that the sum in (6) equals 2 n X k =2 ω x ( k ) log ( k ) = 2 n X k =2 P ∗ n,k ( L d ) P n ( k ) P 2 n j =1 P ∗ n,j ( L d ) P n ( j ) log k . (31) Let N = 2 n , then by Lemma 3 and from the pro of of Lemma 4, as n (hence N ) increases, it follo ws that P ∗ n,k ( L d ) ≈ Q ∗ n,k ( L d ) = 1 2 d ( k − 1) A ( N , d ) (32) where A ( N , d ) satisfies log N d ≤ A ( N , d ) ≤ log d N d ! . Let p = 1 / (1 + 2 d ) then using (32) the ratio in (31) is P N k =2 N k p k (1 − p ) N − k log k P N j =1 N j p j (1 − p ) N − j . (33) Substituting for N and p , the denominator equals 1 − (1 − p ) N = 1 − 1 − 1 1 + 2 d 2 n = 1 − 2 d 1 + 2 d 2 n . (34) Using the DeMoivre-Laplace limit theorem [9], the binomial distribution P N ,p ( k ) with pa- rameters N and p satisfies P N ,p ( k ) ≈ 1 σ φ k − µ σ , N → ∞ where φ ( x ) = (1 / √ 2 π ) exp( − x 2 / 2) is the standard normal probabilit y densit y function and µ = N p , σ = p N p (1 − p ). The sum in the numerator of (33) ma y b e approximated by an in tegral 1 σ Z ∞ 2 φ x − µ σ log x dx = Z ∞ (2 − µ ) /σ φ ( x ) log ( σ x + µ ) dx. The log factor equals log µ + log(1 + xσ /µ ) = log µ + xσ /µ + O ( x 2 ( σ /µ ) 2 ). Denote by a = (2 − µ ) /σ then the righ t side ab ov e equals Φ ( − a ) log µ + σ µ φ ( a ) + O σ µ 2 ! where Φ ( x ) is the normal cumulativ e probability distribution. Substituting for µ , σ , p and N , and com bining with (34) then (31) is asymptotically equal to 2 n X k =2 ω x ( k ) log k ≈ Φ ( − a ) log 2 n 1+2 d + 2 − ( n − d ) / 2 φ ( a ) + O (2 − ( n − d ) ) 1 − 2 d 1+2 d 2 n (35) 24 where a = 2(1 + 2 d )2 − ( n + d ) / 2 − 2 ( n − d ) / 2 . In Theorem 2, the set Y is the class F (see (6)) hence log | F | = n and I ( x : F ) = n − X k ≥ 2 ω x ( k ) log k . Com bining with (35) the first statemen t of Theorem 2 follows. W e now compute the description complexit y ` ( x L d ). Since in this setting Y is F and Z is P ( F ) then, b y (9), the description complexity ` ( x L d ) is 2 n − log |L d | . Since , the probabilit y distribution P n is uniform on P ( F ) then the cardinality of L d equals |L d | = 2 2 n P n ( L d ) . It follo ws that ` ( x L d ) = − log P n ( L d ) hence it suffices to compute P n ( L d ). Letting N = 2 n , we hav e P n ( L d ) = N X k =1 P n ( L d | k ) P N ( k ) = N X k =1 P ∗ n,k ( L d ) P N ( k ) . Using (32) and letting p = 1 / (1 + 2 d ) this b ecomes N X k =1 1 2 d ( k − 1) A ( N , d ) 1 2 N N k = 1 1 − p N 1 2 N − d A ( N , d )(1 − (1 − p ) N ) . Letting q = (1 − p ) N , it follows that − log P n ( L d ) = log 2 N − d A ( N , d ) + log q 1 − q = N − d − log A ( N , d ) + q + O ( q 2 ) + log ( q ) = N (1 − p − O ( p 2 )) − d − c log log N + o (1) where 1 ≤ c ≤ d . Substituting for N gives the result. 10.2.2 Proof of Theorem 3 W e start with an auxiliary lemma that states a threshold v alue for the cardinality of a random element of F ( k ) that satisfies prop erty V c d . Lemma 5 F or any inte ger d > 0 let k b e an inte ger satisfying k ≥ 2 d . L et F b e a class of c ar dinality k and r andomly dr awn ac c or ding to the uniform pr ob ability distribution P ∗ n,k on F ( k ) . Then lim n →∞ P ∗ n,k ( V c d ) = 1 . 25 Remark 13 When k n < 2 d ther e do es not exist an E ⊆ [ n ] with | tr E ( F ) | = 2 d henc e P ∗ n,k n ( V c d ) = 0 . F or k n 2 d , P ∗ n,k n ( V c d ) tends to 1 . Henc e, for a r andom class F to have pr op erty V c d the critic al value of its c ar dinality is 2 d . W e pro ceed no w with the pro of of Lemma 5. Pr o of : It suffices to prov e the result for k = 2 d since P ∗ n,k ( F | = V c d ) ≥ P ∗ n, 2 d ( F | = V c d ). As in the pro of of Lemma 4, we represent P ∗ n, 2 d b y Q ∗ n, 2 d using (23) and with Lemma 3 it suffices to sho w that Q ∗ n, 2 d ( F | = V c d ) tends to 1. Denote b y U d the ‘complete’ matrix with d rows and 2 d columns formed b y all 2 d binary v ectors of length d , ranked for instance in alphab etical order. The ev ent “ F | = V c d ” o ccurs if there exists a subset S = { i 1 , . . . , i d } ⊆ [ n ] such that the submatrix whose rows are indexed by S and columns by [2 d ], is equal to U d . Let S i = { id + 1 , id + 2 , . . . , d ( i + 1) } , 0 ≤ i ≤ m − 1, b e the sets defined in the pro of of Lemma 4 and consider the m corresponding ev ents which are defined as follo ws: the i th ev en t is describ ed as having a submatrix whose rows are indexed b y S i and is equal to U d . Since the sets S i , 1 ≤ i ≤ m are disjoin t it is clear that these ev ents are indep endent and ha v e the same probability Q ∗ n,k ( S i ) = 2 − d 2 d . Hence the probability that at least one of them is fulfilled is 1 − (1 − 2 − d 2 d ) b n/d c whic h tends to 1 as n increases. W e con tinue with the pro of of Theorem 3. As in the pro of of Theorem 2, since by (17) the probabilit y distribution P n is uniform ov er P ( F ) then ω x ( k ) = P n ( k , V c d )2 2 n P n ( V c d )2 2 n = P n ( k |V c d ) = P ∗ n,k ( V c d ) P n ( k ) P 2 n j =0 P ∗ n,j ( V c d ) P n ( j ) , 1 ≤ k ≤ 2 n . Considering Remark 13, in this case the sum in (6) is 2 n X k =2 d P ∗ n,k ( V c d ) P n ( k ) log k P 2 n j =2 d P ∗ n,j ( V c d ) P n ( j ) . (36) W e no w obtain its asymptotic v alue as n increases. F rom the pro of of Lemma 5, it follo ws that for all k ≥ 2 d , P ∗ n,k ( V c d ) ≈ 1 − (1 − β ) rk , β = 2 − d 2 d , r = n d 2 d . Since β is an exp onen tially small positive real we appro ximate (1 − β ) rk b y 1 − r k β (b y assumption, n < d 2 d hence this remains p ositive for all 1 ≤ k ≤ 2 n ). Therefore we take P ∗ n,k ( V c d ) ≈ r k β (37) and (36) is approximated by P 2 n k =2 d k P n ( k ) log k P 2 n j =2 d j P n ( j ) . (38) 26 As b efore, for simpler notation let us denote N = 2 n and let P N ( k ) b e the binomial distri- bution with parameters N and p = 1 / 2. Denote b y µ = N / 2 and σ = p N / 4, then using the DeMoivre-Laplace limit theorem we hav e P N ( k ) ≈ 1 σ φ k − µ σ , N → ∞ . Th us (38) is approximated by the ratio of t wo integrals R ∞ (2 d − µ ) /σ φ ( x )( σ x + µ ) log( σ x + µ ) dx R ∞ (2 d − µ ) /σ φ ( x )( σ x + µ ) dx . (39) The log factor equals log µ + log (1 + xσ /µ ) = log µ + xσ /µ + O ( x 2 ( σ /µ ) 2 ). Denote by a = ( µ − 2 d ) /σ (40) then the numerator is approximated by Φ ( a ) µ log µ + σ (1 + log µ ) φ ( a ) + O σ 3 µ 2 a 2 φ ( a ) = log ( N / 2) Φ ( a ) N / 2 + 1 + a 2 N log ( N / 2) φ ( a ) √ N / 2 . Similarly , the denominator of (39) is approximated by Φ ( a ) N / 2 + φ ( a ) √ N / 2 . The ratio, and hence (36), tends to log( N / 2) Φ ( a ) N / 2 + 1 + a 2 N log( N / 2) φ ( a ) √ N / 2 Φ ( a ) N / 2 + φ ( a ) √ N / 2 . Substituting bac k for a then the ab ov e tends to log ( N / 2) = log N − 1. With N = 2 n and (6) the first statement of the theorem follo ws. W e now compute the description complexity ` ( x V c d ). F ollowing the steps of the second part of the pro of of Theorem 2 (Section 10.2.1) we hav e ` ( x V c d ) = − log P n ( V c d ). Using (37) the probabilit y is approximated by P n ( V c d ) ≈ r β N X k =2 d k P N ( k ) and as b efore, this is approximated b y r β ( Φ ( a ) N + φ ( a ) √ N ) / 2. Thus substituting for r and β w e ha ve − log P n ( V c d ) ≈ d (2 d + 1) + log( d ) − log Φ ( a ) N + φ ( a ) √ N − log log N + 1 . Substituting for N yields the result. 27 10.2.3 Proof of Theorem 4 The pro of is almost identical to that of Theorem 3. F rom (37) we hav e P ∗ n,k ( V d ) = 1 − P ∗ n,k ( V c d ) = 1 , 1 ≤ k < 2 d 1 − P ∗ n,k ( V c d ) ≈ 1 − r β k , 2 d ≤ k ≤ 2 n hence 2 n X k =2 P ∗ n,k ( V d ) P n ( k ) log k P 2 n j =1 P ∗ n,j ( V d ) P n ( j ) ≈ P 2 n k =2 P n ( k ) log k − r β P 2 n k =2 d k P n ( k ) log k P 2 n j =1 P n ( j ) − r β P 2 n j =2 d j P n ( j ) . (41) Let a b e as in (40) and denote by b = ( µ − 2) /σ and s = Φ ( a ) N/ 2 + φ ( a ) √ N / 2 then the numerator tends to log ( N / 2)( Φ ( b ) − r sβ ) + φ ( b ) / √ N and the denominator tends to 1 − (1 / 2) N − r sβ . Then (41) tends to log( N / 2) Φ ( b ) − r sβ 1 − r sβ + φ ( b ) √ N (1 − (1 / 2) N − r sβ ) ≈ log( N / 2) + φ ( b ) √ N (1 − r sβ ) . Substituting for r , β and N yields the statemen t of the theorem. 10.2.4 Proof of Theorem 5 The probability that a random class of cardinalit y k sat- isfies the prop erty V d ( S m ) is P n ( F | = V d ( S m ) | |F | = k ) = P n ( F | = V d , F | ξ = ζ | |F | = k ) = P n ( F | = V d | F | ξ = ζ , |F | = k ) P n ( F | ξ = ζ | |F | = k ) . (42) The factor on the right of (42) is the probability of the condition E ζ that a random class F of size k has for all its elements the same restriction ζ on the sample ξ . As in the pro of of Lemma 4 it suffices to use the probability distribution Q ∗ n,k in which case Q ∗ n,k ( E ζ ) is γ k where γ ≡ (1 / 2) m . The left factor of (42) is the probabilit y that a random class F with cardinalit y k which satisfies E ζ will satisfy prop erty V d . This is the same as the ev en t that a random class F on [ n ] \ S m satisfies prop ert y V d . Its probability is P ∗ n − m,k ( V d ) whic h equals 1 for k < 2 d and using (37) for k ≥ 2 d it is approximated as 1 − P ∗ n − m,k ( V c d ) ≈ 1 − r k β where r = ( n − m ) / ( d 2 d ) and β = 2 − d 2 d . Hence the conditional en trop y b ecomes 2 n X k =2 γ k P ∗ n − m,k ( V d ) P n ( k ) log k P 2 n j =1 γ j P ∗ n − m,j ( V d ) P n ( j ) ≈ P 2 n k =2 γ k P n ( k ) log k − r β P 2 n k =2 d γ k k P n ( k ) log k P 2 n j =1 γ j P n ( j ) − r β P 2 n j =2 d γ j j P n ( j ) . (43) Let p = γ / (1 + γ ), N = 2 n and denote b y P N ,p ( k ) the binomial distribution with parameters N and p . Then (43) b ecomes P N k =2 P N ,p ( k ) log k − r β P N k =2 d k P N ,p ( k ) log k P N j =1 P N ,p ( j ) − r β P N j =2 d j P N ,p ( j ) . (44) 28 With µ = N p and σ = p N p (1 − p ) let a = ( µ − 2 d ) /σ , b = ( µ − 2) /σ and s = Φ ( a ) N p + φ ( a ) p N p (1 − p ) then the numerator tends to log( N p )( Φ ( b ) − r sβ ) + φ ( b ) √ N r 1 − p p and the denominator tends to 1 − (2 m / (2 m + 1)) N − r sβ . Therefore (44) tends to log( N p ) + φ ( b ) √ N (1 − o (1) − r sβ ) r 1 − p p . Substituting for r , s , β and N yields the first statement of the theorem. Next, we obtain the description complexity . W e ha ve ` ( x V d ( S m ) ) = − log P n ( V d ( S m )) . The probability P n ( V d ( S m )) is the denominator of (43) which equals the denominator of (44) mult iplied by a factor of (2(1 − p )) − N hence from ab ov e − log P n ( V d ( S m )) ≈ − log 1 − 2 m 1 + 2 m N − r β s ! + N + N log (1 − p ) . Let q = 2 m 1+2 m N + r β s then we hav e as an estimate ` ( x V d ( S m ) ) ≈ log 1 1 − q + N (1 + log (1 − p )) from which the second statement of the theorem follo ws. 11 Ac knowledgemen ts The author thanks Dr. Vitaly Maioro v from the department of mathematics of the T echnion for useful remarks. Bibliograph y [1] R. Anstee, B. Fleming, Z. F uredi, and A. Sali. Color critical hypergraphs and forbidden configurations. In Pr o c. Eur oComb’2005 , pages 117–122. DMTCS, 2005. [2] M. Anthon y and P . L. Bartlett. Neur al Network L e arning:The or etic al F oundations . Cam bridge Universit y Press, 1999. [3] M. Anthon y , G. Bright well, and C. Co op er. The Vapnik-Chervonenkis dimension of a random graph. Discr ete Mathematics , 138(1-3):43–56, 1995. [4] A. Blum. Learning b o olean functions in an infinite attribute space. Mach. L e arn. , 9(4):373–386, 1992. 29 [5] B. Bollob´ as. Combinatorics: Set Systems, Hyp er gr aphs, F amilies of ve ctors, and c om- binatorial pr ob ability . Cambridge Universit y Press, 1986. [6] S. Boucheron, O. Bousquet, and G. Lugosi. Intr o duction to Statistic al L e arning The ory, In , O. Bousquet, U.v. Luxbur g, and G. Rsch (Editors) , pages 169–207. Springer, 2004. [7] T. Co ver and R. King. A con vergen t gam bling estimate of the entrop y of english. IEEE T r ansactions on Information The ory , 24(4):413–421, 1978. [8] Y. Crama and P . Hammer. Bo ole an F unctions The ory, Algorithms and Applic ations . Cam bridge Universit y Press, to app ear. http://www.rogp.hec.ulg.ac.be/Crama/ Publications/BookPage.html . [9] W. F eller. An Intr o duction to Pr ob ability The ory and Its Applic ations , v olume 1. Wiley , New Y ork, second edition, 1957. [10] D. Haussler and E. W elzl. Epsilon-nets and simplex range queries. Discr ete Computa- tional Ge ometry , 2:127–151, 1987. [11] S. Janson, T. Luczak, and A. Ruci´ nski. R andom Gr aphs . Wiley , New Y ork, 2000. [12] A. N. Kolmogoro v. On tables of random n umbers. Sankhyaa, The Indian J. Stat. , A(25):369–376, 1963. [13] A. N. Kolmogorov. Three approaches to the quantitativ e definition of information. Pr oblems of Information T r ansmission , 1:1–17, 1965. [14] I. Kon toyiannis. The complexity and entrop y of literary st yles. T echnical Rep ort 97, NSF, Octob er 1997. [15] M. Li and P . Vitanyi. An intr o duction to Kolmo gor ov c omplexity and its applic ations . Springer-V erlag, New Y ork, 1997. [16] V. Maioro v and J. Ratsaby . The degree of appro ximation of sets in Euclidean space us- ing sets with b ounded Vapnik-Cherv onenkis dimension. Discr ete Applie d Mathematics , 86(1):81–93, 1998. [17] Y. Mansour. L e arning Bo ole an F unctions via the F ourier T r ansform . [18] T. Mitc hell. Machine L e arning . McGraw Hill, 1997. [19] B. K. Natara jan. On learning bo olean functions. In STOC ’87: Pr o c e e dings of the ninete enth annual ACM c onfer enc e on The ory of c omputing , pages 296–304, New Y ork, NY, USA, 1987. ACM Press. [20] J. Pac h and P . K. Agarwal. Combinatorial Ge ometry . Wiley-Interscience Series, 1995. [21] A. Pinkus. n-widths in Appr oximation The ory . Springer-V erlag, 1985. [22] D. P ollard. Conver genc e of Sto chastic Pr o c esses . Springer-V erlag, 1984. 30 [23] J. Ratsab y . On the com binatorial represen tation of information. In Dann y Z. Chen and D. T. Lee, editors, The Twelfth A nnual International Computing and Combinatorics Confer enc e (COCOON’06) , volume LNCS 4112, pages 479–488. Springer-V erlag, 2006. [24] J. Ratsaby and V. Maiorov. On the v alue of partial information for learning by exam- ples,. Journal of Complexity , 13:509–544, 1998. [25] J. W. Szostak. F unctional information: Molecular messages. Natur e , 423:423–689, 2003. [26] V. N. V apnik. Statistic al L e arning The ory . Wiley , 1998. [27] V. N. V apnik and A. Y a. Chervonenkis. On the uniform con vergence of relativ e fre- quencies of even ts to their probabilities. The ory Pr ob ab. Apl. , 16:264–280, 1971. [28] B. Ycart and J. Ratsab y . V C and related dimensions of random function classes. Disc. Math. and The or. Comp. Sci. , 2008. T o app ear. Figures 02 0 4 0 2 4 n= 5 n= 6 n= 7 n= 5 n= 6 n= 7 I * ( l ) l a () 02 0 4 0 0 10 20 n= 5 n= 6 n= 7 n= 5 n= 6 n= 7 κ * ( l ) l b () Figure 1: (a) I ∗ ( ` ), (b) κ ∗ ( ` ) 31 24 68 1 0 0 5 10 (a) (b) (c) (a) (b) (c) n I( x:t ) Figure 2: Information I ( x M ( G ) : F ) for (a) | G | = √ n ,(b) | G | = n and (c) | G | = 2 n − √ n 24 68 1 0 0.2 0.4 0.6 0.8 (a ) (b ) (c ) (a ) (b ) (c ) n η ( x ) t Figure 3: Efficiency η F ( x M ( G ) ) for (a) | G | = √ n ,(b) | G | = n and (c) | G | = 2 n − √ n 32 24 68 1 0 0. 2 0. 4 0. 6 0. 8 (a ) (b ) (c ) (a ) (b ) (c ) n η ( x ) t Figure 4: Efficiency η F ( x ) for (a) x L d , (b) x V c d and (c) x V d , d = √ n 24 68 1 0 0.2 0.4 0.6 0.8 a= 0.01 a= 0.1 a= 0.5 a= 0.95 a= 0.01 a= 0.1 a= 0.5 a= 0.95 n η ( x ) t Figure 5: Efficiency η F ( x V d ( S m ) ), with m = n a , a = 0 . 01 , 0 . 1 , 0 . 5 , 0 . 95, d = √ n 33 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 a= 0 . 1 a= 0 . 2 a= 0 . 3 a= 0 . 4 a= 0 . 1 a= 0 . 2 a= 0 . 3 a= 0 . 4 b η ( x ) t Figure 6: Efficiency η F ( x V d ( S m ) ), with n = 10, m ( n ) = m a , d ( n ) = n b 34
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment