MathMLStatistical Learning Theory

No Free Lunch Theorems

Nikhil R

August 28, 2025

No Free Lunch theorems are a class of results in optimization and machine learning that assert the equivalence of all learners in expectation over all problems. This article provides an introduction to NFL theorems in the context of machine learning.

An Elementary Model

Consider the data generating process specified by the tuple $(μ, A, a, c)$ where $μ$ is a probability measure over some set of objects $X$ each with an associated label in ${0, 1}$ , $A$ is the set of all possible object attributes, $a$ is an observation rule, and $c$ is the conditional probability. The classification task is to learn an approximation $s$ of the conditional probability distribution $a$ given some training data $τ = ((a_{1}, y_{1}), ..., (a_{n}, y_{n})), a_{i} \in A, y_{i} \in {0, 1}$ . A learner $L$ is an algorithm that consumes $τ$ and outputs $s$ with no knowledge of $c$ or $μ$ . Generalization accuracy is the probability of predicting the correct label for un-observed attributes.

A Conservation Law

A useful notion of generalization performance would be to measure generalization accuracy relative to a random guesser. A random guesser would always achieve a generalization accuracy of $0.5$ . Consequently, let Generalization performance = Generalization Accuracy - 0.5.

Let $T = (A \times {0, 1})^{n}$ be the set of all possible training data and $G P_{L} (a_{0}, c; τ)$ denote the generalization performance of a learner $L$ for objects such that $a (x) = a_{0}$ , given $τ \in T$ . Since $A$ is finite, every $c$ can be identified by an m-tuple $(c (a_{1}), c (a_{2}), ..., c (a_{m}))$ .

To evaluate the learner, we can consider an expression of the form: $\int_{[0, 1]^{m}} E_{T} E_{A} G P_{L} (c) d c = \int_{[0, 1]^{m}} τ \in T, a_{i} \in / τ \sum P (τ ∣ c) P (a_{i}) G P_{L} (a_{i}, c; τ) d c$

Now consider just the expression $I (τ) = \int_{[0, 1]^{m}} \sum_{i} P (τ ∣ c) P (a_{i}) G P_{L} (a_{i}, c; τ) d c$ . For a fixed $τ \in T$ , we can construct $\forall c \in [0, 1]^{m}, c^{'} ∣ c_{i}^{'} = c_{i} ⟺ a_{i} \in τ$ otherwise $c_{i}^{'} = 1 - c_{i}$ . Thus, for a given $τ$ we have a permutation $c \to c^{'}$ . Observe that $P (τ ∣ c) = P (τ ∣ c^{'})$ . Changing variables, we have $I (τ) = \int_{[0, 1]^{m}} \sum_{i} P (τ ∣ c^{'}) P (a_{i}) G P_{L} (a_{i}, c^{'}; τ) d c$ . Let $s_{i}$ be the probability that $L$ predicts 1 for $a_{i}$ . Then, $\forall a_{i} \in / τ$ :

G P_{L} (a_{i}, c; τ) G P_{L} (a_{i}, c^{'}; τ) ⟹ G P_{L} (a_{i}, c; τ) = s_{i} c_{i} + (1 - s_{i}) (1 - c_{i}) - 0.5 = s_{i} (1 - c_{i}) + (1 - s_{i}) c_{i} - 0.5 + G P_{L} (a_{i}, c^{'}; τ) = 0

Thus, $I (τ) = \frac{1}{2} \int_{[0, 1]^{m}} P (τ ∣ c) a_{i} \in / τ \sum P (a_{i}) (G P_{L} (a_{i}, c; τ) + G P_{L} (a_{i}, c^{'}; τ)) d c = 0$

We have effectively shown that, $\int_{[0, 1]^{m}} E_{T} E_{A} G P_{L} (c) d c = τ \in T \sum I (τ) = 0$

In other words, "Generalization Performance is conserved over all learning situations"! Consequently,

If there are problems where a learner out-performs another significantly, there must exist problems where the learner under-performs by the same amount in total.
Every optimizer is as good or as bad as every other optimizer.
We cannot, a priori, expect learners to do better than random chance.

Ray Solomonoff (1960) introduced the idea of inductive inference by a computational formalization of pure Bayesianism. Let $T$ be a theory, and $A$ be any alternative theory, and $D$ be observed data. Conditional probability dictates, $P (T ∣ D) = \frac{P ( T ) P ( D ∣ T )}{P ( T ) P ( D ∣ T ) + \sum _{A} P ( A ) P ( D ∣ A )}$ Then future data $F$ can be predicted as: $P (F ∣ D) = T \sum P (F ∣ T, D) P (T ∣ D)$ Clearly, this requires a prior probability over all theories. One such prior could be $P (T) \propto 2^{- K (T)}$ where $K (T)$ is the prefix-free Kolmogorov Complexity of $T$ (normalizable by Kraft's Inequality).

Generally, the Kolmogorov complexity of an object $T$ is defined as the length of the shortest valid computer program (in a specific language) that produces $T$ as the output.

Theoretically, it is meaningful to only speak of Kolmogorov complexity in the context of Turing-complete languages, since for any Turing-complete languages $L, M$ , we have $K_{L} (T) = K_{M} (T) + O (1)$ . Evidently, given any language L, we can write an interpreter for language M, and embed the shortest program in language M, and so obtain a program in L with a constant overhead independent of T.

The aforementioned universal prior favours low-complexity theories or algorithms. In this manner, a universal learner employing Solomonoff's Induction formalizes the following philosophical tools:

Occam's Razor: The simplest explanation is the most plausible.
Epicurus' Principle: If multiple theories explain the observations, retain them all.

Solomonoff Induction is said to be complete, if the cumulative errors made by predictions are upper-bounded by the Kolmogorov-complexity of the data-generating process.

An adversarial argument shows that completeness and computability are mutually exclusive.

Kolmogorov-Style NFL

Goldblum et al. (2024) discuss a Kolmogorov-style formulation of the No Free Lunch Theorem. Let $T \in (X \times Y)^{n}$ be a dataset uniformly sampled over both objects ( $X$ ) and labels ( $Y$ ). Then, with probability at least $1 - δ$ , for every classifier that uses a conditional distribution $p (y ∣ x)$ , the empirical cross-entropy ( $C E$ ) is bounded below as: $C E (p) \geq ln ∣ Y ∣ - \frac{ln ( 2 )}{n} (K (p) + 2 lo g_{2} (K (p)) - lo g (δ) + c)$ where $c$ is a constant depending on the choice of language for $K$ . This is an intuitive result, since in particular it implies that no model can represent a classifier with appreciably lower cross-entropy than that attained from random guess when the data is incompressible, if the dataset is large enough.

PAC Learnability and NFL

Valiant (1984) introduced Probably-Approximately-Correct (PAC) learning as a formal model for machine learning tasks, capturing learning as a search through a set of hypotheses that, with high probability, finds a hypothesis within a bounded error.

A hypothesis class $H$ is agnostic PAC-learnable with respect to a set $Z$ and a (measurable) loss function $l : H \times Z \to R_{+}$ if there exists a function $m_{H} : (0, 1)^{2} \to N$ and a learning algorithm which: for every $ϵ, δ \in (0, 1)$ , and probability space $D$ over $Z$ , when running the algorithm on $m \geq m_{H}$ i.i.d. examples generated by $D$ the algorithm returns $h \in H$ such that, with probability at least $1 - δ$ , $L_{D} (h) \leq h^{'} \in H min L_{D} (h^{'}) + ϵ$ where $L_{D} (h) = E_{z \sim D} [l (h, z)]$ .

Suppose $A$ is a learning algorithm for binary classification with respect to 0-1 loss over a domain $X$ . Let $m \in N$ such that $m < \frac{∣ X ∣}{2}$ . Then there exists a distribution $D$ over $X \times {0, 1}$ and a labelling function $f$ such that $L_{D} (f) = 0$ and $P_{S \sim D^{m}} (L_{D} (A (S)) \geq \frac{1}{8}) \geq \frac{1}{7}$

It can be shown that for a set $C \subset X$ such that $∣ C ∣ = 2 m$ , any learner that observes only $m$ examples cannot comment about the remaining items in $C$ . Considering all functions from $C \to {0, 1}$ , $f_{1}, f_{2}, \dots, f_{2^{2 m}}$ , "complement-pairing" shows that $\exists i \in [2^{2 m}]$ : $E_{S \sim D_{i}^{m}} [L_{D_{i}} (A (S))] \geq \frac{1}{4}$ $D_{i} (x, y) = {\frac{1}{2 m} 0 if y = f_{i} (x) otherwise$

For a hypothesis class $H$ of binary labelling functions over $X$ , the VC-Dimension is defined as the cardinality of the largest $C \subset X$ such that the restriction $H_{C}$ contains all functions from $C$ to ${0, 1}$ .

The Fundamental Theorem of Statistical Learning states that VC-Dimension characterizes PAC-learnability: a hypothesis class is PAC-learnable if and only if it has finite VC-Dimension.

Ultimately:

PAC Learnability is not inconsistent with No Free Lunch Theorems.
NFLs consider hypothesis classes of infinite VC-Dimension.
PAC-Learning succeeds by incorporating an Inductive Bias.

Simplicity Bias

Part of the success of deep neural networks can be attributed to an inherent simplicity bias. Recent research empirically confirms:

GPT-2 (pre-trained and randomly initialized) models produce low-complexity bitstrings with higher probability (Goldblum et al., 2024).
Deep networks are inductively biased to find lower "effective-rank" ( $- \sum \tilde{σ}_{i} ln (\tilde{σ}_{i})$ ) embeddings (Huh et al., 2022).
Although deep networks, which are often over-parametrized, can memorize noise data, they tend to learn simple patterns first (fewer critical samples) (Arpit et al., 2017).

References

Schaffer, C. (1994) 'A conservation law for generalization performance', Machine Learning Proceedings 1994, pp. 259–265. doi:10.1016/b978-1-55860-335-6.50039-8.
Shalev-Shwartz, S. and Ben-David, S. (2022) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Valiant, L.G. (1984) 'A theory of the learnable', Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing — STOC '84, pp. 436–445. doi:10.1145/800057.808710.
Wolpert, D.H. and Macready, W.G. (1997) 'No free lunch theorems for optimization', IEEE Transactions on Evolutionary Computation, 1(1), pp. 67–82. doi:10.1109/4235.585893.