No Free Lunch versus Occams Razor in Supervised Learning
The No Free Lunch theorems are often used to argue that domain specific knowledge is required to design successful algorithms. We use algorithmic information theory to argue the case for a universal bias allowing an algorithm to succeed in all interesting problem domains. Additionally, we give a new algorithm for off-line classification, inspired by Solomonoff induction, with good performance on all structured problems under reasonable assumptions. This includes a proof of the efficacy of the well-known heuristic of randomly selecting training data in the hope of reducing misclassification rates.
💡 Research Summary
The paper revisits the No‑Free‑Lunch (NFL) theorems, which state that when averaged uniformly over all possible problems of a given type, no learning algorithm can outperform random guessing. The authors argue that the uniform‑distribution assumption is unrealistic for real‑world tasks, because most practical problems exhibit structure that can be described by short programs. By invoking algorithmic information theory—specifically Kolmogorov complexity and the Solomonoff universal prior (M)—they introduce a non‑uniform, “universal” bias that favours simpler, more compressible hypotheses, embodying the philosophical principle of Occam’s razor in a formal way.
After establishing the standard NFL result (Theorem 5) and its reliance on a uniform prior over the hypothesis space, the paper defines Kolmogorov complexity C(x), monotone complexity Kₘ, and the Solomonoff prior M(x) = Σ_{p:U(p)=x*}2^{−ℓ(p)}. Because M is a semi‑measure, they use its normalised version Mₙₒᵣₘ as a proper probability distribution. Under this prior, they prove a “free lunch” exists: Proposition 1 shows that for binary classification with input space X = {0,1}ⁿ and a training set of size 2^{n−k}, there exists an algorithm whose expected misclassification rate (with respect to Mₙₒᵣₘ) is strictly less than ½ for sufficiently large n. The proof splits the hypothesis space into those that agree with the training data on a constant label and those that do not, then applies Lemma 10 to bound the loss.
The central technical contribution is a complexity‑based offline classification algorithm, denoted A*. Given training data (Xₘ, f|_{Xₘ}), A* searches for the simplest total function ˜f : X → Y that is consistent with the observed labels, i.e., it minimises K_M(˜f;X) subject to ˜f(x_i)=f(x_i) for all x_i∈Xₘ. The algorithm then predicts the label of any unseen instance x by outputting ˜f(x). This embodies a formal version of “choose the simplest hypothesis consistent with the data,” and the authors prove that if the true target function has low Kolmogorov complexity (i.e., is structured), the expected error of A* is low. Moreover, they provide a theoretical justification for the common heuristic of selecting training examples uniformly at random: random sampling increases the probability that the simple consistent hypothesis will be discovered, thereby reducing expected loss.
The authors acknowledge several limitations. Both the Solomonoff prior and Kolmogorov complexity are incomputable; consequently, A* can only be approximated in practice. The paper restricts its analysis to offline classification, leaving the more intricate online or active‑learning settings for future work. It also notes that while the non‑uniform prior yields a free lunch, the magnitude of the advantage depends on how well the prior matches the true distribution of real problems, a question that requires empirical investigation.
In summary, the paper reconciles the apparent conflict between NFL theorems and Occam’s razor by showing that when a universal, structure‑favoring prior is adopted, a learning algorithm can indeed achieve better-than‑random performance across all “interesting” (i.e., compressible) problem domains. The proposed A* algorithm provides a concrete, theoretically grounded method for exploiting this bias in offline classification, and the work highlights the importance of incorporating algorithmic‑information‑theoretic priors into the design of general‑purpose learning systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment