There are (at least) three approaches to quantifying information. The first, algorithmic information or Kolmogorov complexity, takes events as strings and, given a universal Turing machine, quantifies the information content of a string as the length of the shortest program producing it. The second, Shannon information, takes events as belonging to ensembles and quantifies the information resulting from observing the given event in terms of the number of alternate events that have been ruled out. The third, statistical learning theory, has introduced measures of capacity that control (in part) the expected risk of classifiers. These capacities quantify the expectations regarding future data that learning algorithms embed into classifiers. This note describes a new method of quantifying information, effective information, that links algorithmic information to Shannon information, and also links both to capacities arising in statistical learning theory. After introducing the measure, we show that it provides a non-universal analog of Kolmogorov complexity. We then apply it to derive basic capacities in statistical learning theory: empirical VC-entropy and empirical Rademacher complexity. A nice byproduct of our approach is an interpretation of the explanatory power of a learning algorithm in terms of the number of hypotheses it falsifies, counted in two different ways for the two capacities. We also discuss how effective information relates to information gain, Shannon and mutual information.
m outputs y È Y given input x È X is encoded in Markov matrix p m Ôy xÕ.
The effective information generated when system m outputs y is computed as follows. First, let the potential repertoire p unif ÔXÕ be the input set equipped with the uniform distribution. Next, compute the actual repertoire via Bayes’ rule pÔX yÕ : p y doÔxÕ
where p m ÔyÕ x p m y doÔxÕ ¨¤ p unif ÔxÕ and doÔ¤Õ refers to Pearl’s interventional calculus [7].
Effective information is the Kullback-Leibler divergence between the two repertoires eiÔm, yÕ : D pm ÔX yÕ ¯punif ÔXÕ .
(
For a deterministic function f : X Y, the actual repertoire and effective information are
The support of the actual repertoire is the pre-image f ¡1 ÔyÕ. Elements in the pre-image all have the same probability since they cannot be distinguished by the function f . Effective information quantifies the size of the pre-image relative to the input set -the smaller (“sharper”) the pre-image, the higher ei.
We show that effective information is a non-universal analog of Kolmogorov complexity. Given universal Turing machine T , the (unnormalized) Solomonoff prior probability of string s is
where the sum is over strings i that cause T to output s as a prefix, where no proper prefix of i outputs s, and lenÔiÕ is the length of i. Kolmogorov complexity is KÔsÕ : ¡ log 2 p T ÔsÕ. Kolmogorov complexity is usually defined as the shortest program on a universal prefix machine that produces s.
The two definitions coincide up to additive constant by Levin’s Coding Theorem [1].
Replace universal Turing machine T with deterministic system f : X Y. All inputs have lenÔxÕ log 2 X in the optimal code for the uniform distribution on X . Define the effective probability of y as
Note that p f ÔyÕ is a special case of p m ÔyÕ, as defined after Eq. ( 1). The effective distribution is thus a non-universal analog of the Solomonoff prior, since it is computed by replacing universal Turing machine T in Eq. ( 4) with deterministic physical system f : X Y.
In the deterministic case, effective information turns out to be eiÔf, yÕ ¡ log 2 p f ÔyÕ, analogously to Kolmogorov complexity. Effective information is non-universal -but computable -since it depends on the choice of f .
This section uses a particular deterministic function, learning algorithm L F ,D , to connect effective information and the effective distribution to statistical learning theory.
Given finite set X , let hypothesis space Σ X σ : X ¨1´c ontain all labelings of elements of X . Now, given a set of functions F Σ X and unlabeled data D È X l , define learning algorithm
The learning algorithm takes a labeling of the data as input and outputs the empirical risk of the function that best fits the data. We drop subscripts from the notation L below.
Define empirical VC-entropy in [3]) as VÔF , DÕ : log 2 q D ÔFÕ where q D : F R l : f f Ôd 1 Õ . . . f Ôd l Õ ¨. Also define empirical Rademacher complexity as
These capacities can be used to bound the expected risk of classifiers, see [8,9] for details. The following propositions are proved in [5]: Proposition 1 (effective information “is” empirical VC-entropy).
eiÔL, 0Õ ¡ log 2 p L Ô0Õ l ¡ VÔF , DÕ Proposition 2 (expectation over p L ÔǫÕ “is” empirical Rademacher complexity).
EÖǫ
Thus, replacing the universal Turing machine with learning algorithm L F ,D we obtain that our analog of Kolmogorov complexity, the effective information of output ǫ 0, is essentially empirical VC-entropy. Moreover, the expectation of the analog of the Solomonoff distribution is essentially Rademacher complexity.
The two quantities eiÔL, 0Õ and EÖǫ p L × are measures of explanatory power: as they increase, so expected future performance improves. By Eq. ( 3), the effective information generated by L is
where hypotheses are counted after logarithming. Effective information, which relate to VC-entropy, counts the number of hypotheses the learning algorithm falsifies when it fits labels perfectly, without taking into account how often they are wrong. Similarly, see [5] for details, the expectation is
Expected ǫ, which relates to Rademacher complexity, looks at the average behavior of the learning algorithm, averaging over the fractions of hypotheses falsified, weighted by how much of the data they are falsified on.
The bounds proved in [3,8,9], which control the expected future performance of the classifier minimizing empirical risk, can therefore be rephrased in terms of the number of hypotheses falsified by the learning algorithm, Eqs ( 7) and ( 8), suggesting a possible route towards rigorously grounding the role of falsification in science [6].
We relate effective information to Shannon and mutual information.
Suppose we have model m that generates data d È D with probability p m Ôd hÕ given hypothesis h È H. For prior distribution pÔHÕ on hypotheses, the information gained by observing d is D p m ÔH dÕ ¯pÔHÕ .
(9) Kullback-Leibler divergence DÖp q× can be interpreted as the number of Y
This content is AI-processed based on open access ArXiv data.