Proper Learnability and the Role of Unlabeled Data
Proper learning refers to the setting in which learners must emit predictors in the underlying hypothesis class $H$, and often leads to learners with simple algorithmic forms (e.g. empirical risk minimization (ERM), structural risk minimization (SRM)). The limitation of proper learning, however, is that there exist problems which can only be learned improperly, e.g. in multiclass classification. Thus, we ask: Under what assumptions on the hypothesis class or the information provided to the learner is a problem properly learnable? We first demonstrate that when the unlabeled data distribution is given, there always exists an optimal proper learner governed by distributional regularization, a randomized generalization of regularization. We refer to this setting as the distribution-fixed PAC model, and continue to evaluate the learner on its worst-case performance over all distributions. Our result holds for all metric loss functions and any finite learning problem (with no dependence on its size). Further, we demonstrate that sample complexities in the distribution-fixed PAC model can shrink by only a logarithmic factor from the classic PAC model, strongly refuting the role of unlabeled data in PAC learning (from a worst-case perspective). We complement this with impossibility results which obstruct any characterization of proper learnability in the realizable PAC model. First, we observe that there are problems whose proper learnability is logically undecidable, i.e., independent of the ZFC axioms. We then show that proper learnability is not a monotone property of the underlying hypothesis class, and that it is not a local property (in a precise sense). Our impossibility results all hold even for the fundamental setting of multiclass classification, and go through a reduction of EMX learning (Ben-David et al., 2019) to proper classification which may be of independent interest.
💡 Research Summary
This paper provides a deep investigation into the conditions under which supervised learning problems are “properly learnable,” meaning that the learner is constrained to always output a predictor from the pre-specified hypothesis class H. Proper learning is algorithmically desirable, often leading to simple methods like Empirical Risk Minimization (ERM), but its limitation is that some problems (e.g., in multiclass classification) are only learnable by “improper” learners that can go outside H.
The work is structured around two primary learning scenarios, yielding both positive and profound negative results.
In the first scenario, the authors consider a Distribution-Fixed PAC model, where the learner is granted exact knowledge of the marginal distribution D over unlabeled data in addition to the labeled training sample. They show that for any finite learning problem (with no dependence on its size) and any bounded metric loss function, there always exists an optimal proper learner in this setting. This optimal learner is governed by a framework called distributional regularization, a randomized generalization of classical regularization that assigns scores to probability distributions over H (i.e., to randomized hypotheses). A key and perhaps surprising finding is that this extra knowledge of D does not substantially improve the worst-case sample complexity: it can shrink by at most a logarithmic factor compared to the standard PAC model. This strongly refutes the idea that unlabeled data, from a pure worst-case perspective, provides a fundamental boost to PAC learnability; instead, it merely simplifies the form of the optimal algorithm to a proper one.
The second part of the paper returns to the classic (realizable) PAC model and asks whether proper learnability can be cleanly characterized, for instance by a combinatorial dimension analogous to the VC or DS dimensions. Here, the authors establish a series of powerful impossibility results that obstruct any such simple characterization:
- Undecidability: They show that for some hypothesis classes H, the question “Is H properly learnable?” is logically undecidable—independent of the standard ZFC axioms of set theory. This is proven via a novel reduction from the undecidable EMX learning problem (Ben-David et al.) to proper multiclass classification.
- Non-Monotonicity: Proper learnability is not a monotone property. A hypothesis class H being properly learnable does not imply that its subsets or supersets are also properly learnable.
- Non-Locality: Proper learnability is not a local property. There exist classes H and H’ that agree on all finite restrictions (i.e., H|S = H’|S for every finite set S of domain points), yet one is properly learnable and the other is not. This means the property cannot be determined by looking at finite fragments of the class.
These impossibility results demonstrate that proper learnability is inherently more complex and structurally different from improper learnability. They rule out characterization by any monotone or locally-checkable dimension. The paper further discusses how these barriers, particularly non-monotonicity, also obstruct natural relaxations of the distribution-fixed model, such as models where the learner must learn the marginal distribution D in a class-conditional way.
In summary, the paper reveals a dichotomy: while complete knowledge of the data marginal guarantees the existence of a simple optimal proper learner (though with limited worst-case sample efficiency gains), understanding proper learnability in the standard PAC model is fraught with fundamental logical and combinatorial barriers, necessitating a new paradigm beyond traditional learning-theoretic dimensions.
Comments & Academic Discussion
Loading comments...
Leave a Comment