No Free Lunch Theorems
Nikhil R
·August 28, 2025
No Free Lunch theorems are a class of results in optimization and machine learning that assert the equivalence of all learners in expectation over all problems. This article provides an introduction to NFL theorems in the context of machine learning.
An Elementary Model
Consider the data generating process specified by the tuple where is a probability measure over some set of objects each with an associated label in , is the set of all possible object attributes, is an observation rule, and is the conditional probability. The classification task is to learn an approximation of the conditional probability distribution given some training data . A learner is an algorithm that consumes and outputs with no knowledge of or . Generalization accuracy is the probability of predicting the correct label for un-observed attributes.
A Conservation Law
A useful notion of generalization performance would be to measure generalization accuracy relative to a random guesser. A random guesser would always achieve a generalization accuracy of . Consequently, let Generalization performance = Generalization Accuracy - 0.5.
Let be the set of all possible training data and denote the generalization performance of a learner for objects such that , given . Since is finite, every can be identified by an m-tuple .
To evaluate the learner, we can consider an expression of the form:
Now consider just the expression . For a fixed , we can construct otherwise . Thus, for a given we have a permutation . Observe that . Changing variables, we have . Let be the probability that predicts 1 for . Then, :
Thus,
We have effectively shown that,
In other words, "Generalization Performance is conserved over all learning situations"! Consequently,
- If there are problems where a learner out-performs another significantly, there must exist problems where the learner under-performs by the same amount in total.
- Every optimizer is as good or as bad as every other optimizer.
- We cannot, a priori, expect learners to do better than random chance.
Related Notions
Ray Solomonoff (1960) introduced the idea of inductive inference by a computational formalization of pure Bayesianism. Let be a theory, and be any alternative theory, and be observed data. Conditional probability dictates, Then future data can be predicted as: Clearly, this requires a prior probability over all theories. One such prior could be where is the prefix-free Kolmogorov Complexity of (normalizable by Kraft's Inequality).
Generally, the Kolmogorov complexity of an object is defined as the length of the shortest valid computer program (in a specific language) that produces as the output.
Theoretically, it is meaningful to only speak of Kolmogorov complexity in the context of Turing-complete languages, since for any Turing-complete languages , we have . Evidently, given any language L, we can write an interpreter for language M, and embed the shortest program in language M, and so obtain a program in L with a constant overhead independent of T.
The aforementioned universal prior favours low-complexity theories or algorithms. In this manner, a universal learner employing Solomonoff's Induction formalizes the following philosophical tools:
- Occam's Razor: The simplest explanation is the most plausible.
- Epicurus' Principle: If multiple theories explain the observations, retain them all.
Solomonoff Induction is said to be complete, if the cumulative errors made by predictions are upper-bounded by the Kolmogorov-complexity of the data-generating process.
An adversarial argument shows that completeness and computability are mutually exclusive.
Kolmogorov-Style NFL
Goldblum et al. (2024) discuss a Kolmogorov-style formulation of the No Free Lunch Theorem. Let be a dataset uniformly sampled over both objects () and labels (). Then, with probability at least , for every classifier that uses a conditional distribution , the empirical cross-entropy () is bounded below as: where is a constant depending on the choice of language for . This is an intuitive result, since in particular it implies that no model can represent a classifier with appreciably lower cross-entropy than that attained from random guess when the data is incompressible, if the dataset is large enough.
PAC Learnability and NFL
Valiant (1984) introduced Probably-Approximately-Correct (PAC) learning as a formal model for machine learning tasks, capturing learning as a search through a set of hypotheses that, with high probability, finds a hypothesis within a bounded error.
A hypothesis class is agnostic PAC-learnable with respect to a set and a (measurable) loss function if there exists a function and a learning algorithm which: for every , and probability space over , when running the algorithm on i.i.d. examples generated by the algorithm returns such that, with probability at least , where .
Suppose is a learning algorithm for binary classification with respect to 0-1 loss over a domain . Let such that . Then there exists a distribution over and a labelling function such that and
It can be shown that for a set such that , any learner that observes only examples cannot comment about the remaining items in . Considering all functions from , , "complement-pairing" shows that :
For a hypothesis class of binary labelling functions over , the VC-Dimension is defined as the cardinality of the largest such that the restriction contains all functions from to .
The Fundamental Theorem of Statistical Learning states that VC-Dimension characterizes PAC-learnability: a hypothesis class is PAC-learnable if and only if it has finite VC-Dimension.
Ultimately:
- PAC Learnability is not inconsistent with No Free Lunch Theorems.
- NFLs consider hypothesis classes of infinite VC-Dimension.
- PAC-Learning succeeds by incorporating an Inductive Bias.
Simplicity Bias
Part of the success of deep neural networks can be attributed to an inherent simplicity bias. Recent research empirically confirms:
- GPT-2 (pre-trained and randomly initialized) models produce low-complexity bitstrings with higher probability (Goldblum et al., 2024).
- Deep networks are inductively biased to find lower "effective-rank" () embeddings (Huh et al., 2022).
- Although deep networks, which are often over-parametrized, can memorize noise data, they tend to learn simple patterns first (fewer critical samples) (Arpit et al., 2017).
References
Schaffer, C. (1994) 'A conservation law for generalization performance', Machine Learning Proceedings 1994, pp. 259–265. doi:10.1016/b978-1-55860-335-6.50039-8.
Shalev-Shwartz, S. and Ben-David, S. (2022) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Valiant, L.G. (1984) 'A theory of the learnable', Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing — STOC '84, pp. 436–445. doi:10.1145/800057.808710.
Wolpert, D.H. and Macready, W.G. (1997) 'No free lunch theorems for optimization', IEEE Transactions on Evolutionary Computation, 1(1), pp. 67–82. doi:10.1109/4235.585893.