In this post we will talk about PAC Learning and VC Dimension, explaining what they are and why they are useful in Machine Learning.
Disclaimer: the following notes were written following the slides provided by the professor Restelli at Polytechnic of Milan and the book ‘Pattern Recognition and Machine Learning'.
PAC-Learning and VC-Dimension
PAC-Learning
In Probably Approximately Correct Learning, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the “probably” part), the selected function will have low generalization error (the “approximately correct” part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples.
Let’s focus on classification. Given:
- set of instances $X$
- set of hypotheses $H$ (finite)
- set of possible target concepts $C$. Each concept $c$ corresponds to a boolean function $c:\boldsymbol{X} \rightarrow {0,1}$, which can be viewed as belonging to a certain class or not.
- training instances generated by a fixed, unknown probability distribution $\mathcal{P}$ over $X$.
The learner observes a sequence $\mathcal{D}$ of training samples $\langle x, c(x) \rangle$, for some target concept $c \in C$ and it must output a hypothesis $h$ estimating $c$. $h$ is evaluated by its performance on subsequent instances drawn according to $\mathcal{P}$
$L_{true} = Pr_{x\in\mathcal{P}}[c(x) <> h(x)]$
Informally, the true error of $h$ is just the error rate we expect when applying $h$ to future instances drawn according to the probability distribution $\mathcal{P}$.
Our aim is to characterize classes of target concepts that can be reliably learned from a reasonable number of randomly drawn training examples and a reasonable amount of computation.
We might try to characterize the number of training examples needed to learn a hypothesis $h$ for which $L_{train}(h) = 0$. Unfortunately, it turns out this is futile in the setting we are considering, for two reasons.
- First, unless we provide training examples corresponding to every possible instance in $X$ (an unrealistic assumption), there may be multiple hypotheses consistent with the provided training examples, and the learner cannot be certain to pick the one corresponding to the target concept.
- Second, given that the training examples are drawn randomly, there will always be some nonzero probability that the training examples encountered by the learner will be misleading. (For example, although we might frequently see skiers of different heights, on any given day there is some small chance that all observed training examples will happen to be 2 meters tall).
To accommodate these two difficulties, we weaken our demands on the learner in two ways.
- First, we will not require that the learner output a zero error hypothesis: we will require only that its error be bounded by some constant $\epsilon$, that can be made arbitrarily small.
- Second, we will not require that the learner succeed for every sequence of randomly drawn training examples: we will require only that its probability of failure be bounded by some constant $\delta$, that can be made arbitrarily small.
In short, we require only that the learner probably learn a hypothesis that is approximately correct, hence the term probably approximately correct learning, or PAC learning for short.
Version spaces
The version space $VS_{H,\mathcal{D}}$ is the subset of hypothesis in $H$ consistent with training data $\mathcal{D} (L_{train}=0)$. How likely is the learner to pick a bad hypothesis?
Theorem
*If the hypothesis space $H$ is finite and $\mathcal{D}$ is a sequence of $N \ge 1$ independent random examples of some target concept $c$, then for any $0 \le \epsilon \le 1$, the probability that $VS_{H,\mathcal{D}}$ contains a hypothesis error greater than $\epsilon$ is less than $|H|e^{-\epsilon N}$:*
$$Pr(\exists h \in H : L_{train} = 0 \land L_{true} \ge \epsilon) \le |H|e^{-\epsilon N}$$
We want this probability to be at most $\delta$
$$ |H|e^{-\epsilon N} \le \delta $$, where $\delta$ represents the confidence.
We can compute $N$ based on the other variables
$$ N \ge \frac{1}{\epsilon}(ln|H| + ln(\frac{1}{\delta})) $$ or equivalently $$ \epsilon \ge \frac{1}{N}(ln|H| + ln(\frac{1}{\delta})) $$
Let’s make a practical example to better understand the basic idea (credit to link): we want to learn the concept “medium-built person” from examples. We are given the height and weight of $N$ individuals, the training set. We are told for each $[height,weight]$ pair if it is or not of medium built. We would like to learn this concept, i.e. produce an algorithm that in the future answers correctly if a pair $[height,weight]$ represents/not a medium-built person. We are interested in knowing which value of $N$ to use if we want to learn this concept well.
Our concern is to characterize what we mean by well or good when evaluating learned concepts. By imposing the above bound, we are able to say if a learned concept is good. Different degrees of “goodness” will correspond to different values of epsilon and delta.
Note however that if we consider $M$ boolean features, there are $|C| = 2^M$ distinct concepts and hence $|H| = {2^2}^M$. This means that even if we have a logarithmic dependency on $|H|$, it is still exponential w.r.t. $M$.
Consider a class $C$ of possible target concepts defined over a set of instances $X$ and a learner $L$ using hypothesis space $H$.
Definition
$C$ is PAC-learnable if there exists an algorithm $L$ such that for every $c \in C$, for any distribution $\mathcal{P}$, for any $\epsilon$ such that $0 \le \epsilon < \frac{1}{2}$ and $\delta$ such that $0 \le \delta < \frac{1}{2}$, with probability at least $1-\delta$, outputs an hypothesis $h \in H$ such that $L_{true}(h) \le \epsilon$ using a number of samples that is polynomial in $\frac{1}{\epsilon}$ and $\frac{1}{\delta}$.
Definition
$C$ is efficiently PAC-learnable by learner $L$ using $H$ if and only if for every $c \in C$, for any distribution $\mathcal{P}$, for any $\epsilon$ such that $0 \le \epsilon < \frac{1}{2}$ and $\delta$ such that $0 \le \delta < \frac{1}{2}$, with probability at least $1-\delta$, $L$ outputs an hypothesis $h \in H$ such that $L_{true}(h) \le \epsilon$ in time that is polynomial in $\frac{1}{\epsilon}, \frac{1}{\delta}, M$ and $size(c)$.
In the above definition, we want $\epsilon$ to be less than $\frac{1}{2}$ . Well, this is pretty natural as we won’t be interested in the algorithms which have more than fifty percent chances of giving an inaccurate answer. Also, observe the strict inequality i.e. $\epsilon < \frac{1}{2}$ . This is because $\epsilon = \frac{1}{2}$ would be useless as it would be no better than generating the output with the toss of a coin.
Agnostic learning
Now we know how many training examples suffice to ensure (with probability $(1-\delta)$) that every hypothesis in $H$ having zero training error will have a true error of at most $\epsilon$. Unfortunately, if $H$ does not contain the target concept $c$, then a zero-error hypothesis cannot always be found. In this case, the most we might ask of our learner is to output the hypothesis from $H$ that has the minimum error over the training examples. A learner that makes no assumption that the target concept is representable by $H$ and that simply finds the hypothesis with minimum training error, is often called an agnostic learner, because it makes no prior commitment about whether or not $C \subseteq H$.
We have to find a new bound that takes in consideration this new scenario:
$L_{true}(h) \le L_{train}(h) + \epsilon$
We will use the Hoeffding bound:
$$ Pr(E[\bar{X}] - \bar{X} > \epsilon) \le e^{-2N\epsilon^2} $$
$$ Pr(\exists h \in H : L_{true}(h) - L_{train}(h) > \epsilon) \le |H|e^{-2N\epsilon^2} $$
Leading to
$$ N \ge \frac{1}{2\epsilon^2}(ln|H| + ln(\frac{1}{\delta})) $$
The bound can be used as a model selection method like cross-validation, with the benefit of using all the data but the drawback of having a pessimistic bound and that can be applied only if $|H|$ is finite.
VC-Dimension
In a continuous hypothesis space, in which $|H| = \infty$, it is not possible to use the conclusions derived from PAC-learning. Here we consider a second measure of the complexity of $H$, called the Vapnik-Chervonenkis dimension of $H$, or $VC(H)$ for short.
Definition
A dichotomy of a set $S$ is a partition of $S$ into two disjoint subsets.
Definition
A set of instances $S$ is shattered by hypothesis space $H$ if and only if for every dichotomy of $S$ there exists some hypothesis in $H$ consistent with this dichotomy.
Note that if a set of instances is not shattered by a hypothesis space, then there must be some concept (dichotomy) that can be defined over the instances, but that cannot be represented by the hypothesis space. The ability of H to shatter a set of instances is thus a measure of its capacity to represent target concepts defined over these instances.
What if $H$ cannot shatter $X$, but can shatter some large subset $S$ of $X$? Intuitively, it seems reasonable to say that the larger the subset of $X$ that can be shattered, the more expressive H. The VC dimension of $H$ is precisely this measure:
Definition
The VC dimension, $VC(H)$, of hypothesis space $H$ defined over instance space $X$ is the size of the largest finite subset of $X$ shattered by $H$. If arbitrarily large finite sets of $X$ can be shattered by $H$, then $VC(H) \equiv \infty$.
Based on VC-dimension the number of randomly drawn examples guaranteeing an error of at most $\epsilon$ with probability at least $(1 − \delta)$ is:
$N \ge \frac{1}{\epsilon}(4 log_2(\frac{2}{\delta}) + 8VC(H)log_2(\frac{13}{\epsilon}))$
Note that just as in the bound from $N \ge \frac{1}{\epsilon}(ln|H| + ln(\frac{1}{\delta}))$, the number of required training examples $N$ grows logarithmically in $\frac{1}{\delta}$. It now grows log times linear in $\frac{1}{\epsilon}$, rather than linearly. Significantly, the $ln|H|$ term in the earlier bound has now been replaced by the alternative measure of hypothesis space complexity, $VC(H)$ (recall $VC(H) \le log_2|H|$).
Theorem
The VC dimension of a hypothesis space $|H| < \infty$ is bounded from above: $VC(H) \le log_2|H|$.
Theorem
Concept class $C$ with $VC(C) = \infty$ is not PAC-learnable.