Why is Learning Possible?

by dokuDoku Generalization

🔍

## Statement of Purpose
The goal of what we are going to show today is:
<center>
Given that we have a dataset $D$ that we evaluate our hypothesis model $h$ (from the set of hypothesis models $H$ where $h \in H$) approximating the ground truth function $f$, can we say anything about performance of $h$ approximating $f$ outside of $D$?
</center>

In other words does our training error $E_{in}$ for our machine learning algorithm tell us anything about our test time error $E_{out}$?

## Toy Example

<img src="/media/images/20260321_195106_grok-image-68b06fe5-510f-4005-ac95-cf4776a64709.jpg" width="300" alt="toy example"> 
<center>
$\textbf{Figure 1}$ Jar of red and blue marbles
</center>

Consider that we have a jar with red and blue marbles in **Figure 1**. There's a probability of $\mu$ that we will choose a red marble and a probability of $1-\mu$ a blue one will be chosen (both selections with replacement). Assume that the value of $\mu$ is unknown to us and we pick $N$ marbles with replacement. We observe the frequency of red marbles as $\nu$.

What does $\nu$ tell us about $\mu$?

To motivate that there is *something* here, consider if we know that $\mu = 0.9$, what's the probability a sample of $N=10$ marbles will have $\nu \leq 0.1$? Using the binomial distribution, we use the general formula:
$$ \binom{n}{k} p^k (1-p)^{n-k} $$
Given that we have two possibilities for $k$, $0$ and $1$, with $n = N = 10$, and $p = \mu = 0.9$, we get 
$$ \mathbb{P}(\nu \leq 0.1) = \binom{10}{0} 0.9^0 \cdot (1-0.9)^{10} + \binom{10}{1} 0.9^1 \cdot (1-0.9)^9 $$
$$ \mathbb{P}(\nu \leq 0.1) = 1 \cdot 1 \cdot 0.1^{10} + 10 \cdot 0.9^1 \cdot 0.1^9 = 10^{-10} + 9 \cdot 10^{-9}$$
As we can see $\mathbb{P}(\nu \leq 0.1)$ is a very small number, so it's quite *unlikely* that $\nu \leq 0.1$ given $\mu = 0.9$. Great! There's probably some relationship between $\mu$ and $\nu$.

### Hoeffding Inequality

The Hoeffding Inequality states that for any sample size $N$
$$ \mathbb{P}[|\nu - \mu| > \epsilon] \leq 2e^{-2\epsilon^2N} \quad for\ any\ \epsilon > 0$$
As the sample size $N$ grows it becomes exponentially unlikely that $\nu$ will deviate from $\mu$ by more than our tolerance $\epsilon$. **Note** that the only random quantity in the Hoeffding Inequality is $\nu$, since $\mu$ is constant but simply unknown. We simply do not know how many red and blue marbles there are in the jar, but the ratio of these marbles is static. We will not prove the Hoeffding Inequality here, but there are many resources online.

Also notice that although $\mathbb{P}[|\nu - \mu| > \epsilon]$ depends on $\mu$ we are able to bound the probability by $2e^{-2\epsilon^2N}$, which does not depend on $\mu$, and only the sample size $N$ affects the bound. If we choose $\epsilon$ to be very small to make $\nu$ a good approximation of $\mu$ we need to increase $N$. Note that although we cannot get the exact value of $\mu$ from this, we can bound $\mu$ from our observed $\nu$, inferring a range for $\mu$.

For completeness, let's plug in the numbers from our previous example and see if the Hoeffding Inequality bounds our result: $10^{-10} + 9 \cdot 10^{-9}$. 
- $\mu = 0.9$
- $\nu = 0.1$
- $\epsilon = |\nu - \mu| = |0.1-0.9| = 0.8$
- $N = 10$

Giving 
$$ \mathbb{P}[|\nu - \mu| > \epsilon] \leq 2e^{-2\epsilon^2N} \rightarrow \mathbb{P}[|\nu - \mu| > 0.8] \leq 2e^{-2\cdot 0.8^2\cdot 10} = 5.52 × 10^{-6}$$
 
We can see that $10^{-10} + 9 \cdot 10^{-9} \leq 5.52 × 10^{-6}$ showing Hoeffding Inequality holds!

## Relation to hypothesis $h$ estimating $f$

How does the jar problem relate to our learning problem, where we want to see for our hypothesis function $h \in H$ how closely we can approximate ground truth function $f$. We want to see how performance on dataset $D = \{x_1,...x_N\}$ will translate to out of sample performance. Note that the data points $x_1,...x_N$ are drawn in an independent and identically distributed (iid) fashion.

Take a single hypothesis function $h \in H$ and compare it to $f$ on each point $x_i \in D$. $h(x_i) = f(x_i)$ is labeled as a blue marble, $h(x_i) \neq f(x_i)$ is labeled as a red marble. Thus we get that 
- $\nu$ is the observed frequency that $h(x) \neq f(x)$ for dataset $D$
- $\mu$ is the true probability that $h(x) \neq f(x)$ for a random draw from the data distribution

We can relabel $\nu$ as the in sample error, $E_{in}$, and $\mu$ as the out of sample error $E_{out}$. 
- $E_{in}(h) = \frac{1}{N} \sum_{i=1}^N \mathbb{1}[h(x_i) \neq f(x_i)]$
- $E_{out}(h) = \mathbb{P}_{x \sim D} (h(x) \neq f(x))$
Restate the Hoeffding Inequality as
$$ \mathbb{P}[|\nu - \mu| > \epsilon] \leq 2e^{-2\epsilon^2N} \rightarrow \mathbb{P}[|E_{in} - E_{out}| > \epsilon] \leq 2e^{-2\epsilon^2N} $$

Great, this makes sense, since for a given hypothesis $h$ we want to see what our $E_{in}$ can tell us about our $E_{out}$. To motivate the significance, consider a thought experiment where $E_{in}$ gives you no information about $E_{out}$. That would mean that any hypothesis $h$ would be equally good to choose to minimize $E_{out}$ if we are only taking $E_{in}$ into consideration, giving us no insight.

### Choosing a Hypothesis

We now want to expand this to choose the optimal, lowest sample error, hypothesis $g \in H$ out of the set of all $M$ hypotheses in $H = \{h_1, h_2, ...h_M\}$. What this means is that now we have $M$ jars, one for each $h_i \in H$ $\forall i \in M$, as seen in **Figure 2**.

<img src="/media/images/20260321_204945_grok-image-a18e9020-d85d-4ccf-93b4-791c37159889.jpg" width="500" alt="toy example"> 
<center>
$\textbf{Figure 2}$ $M$ Jars each for a Hypothesis $h$
</center>

The Hoeffding Inequality applies to each jar individually, the situation becomes more complicated when we consider all bins simultaneously since 
$$\mathbb{P}[|E(h_i)_{in} - E(h_i)_{out}| > \epsilon] \leq 2e^{-2\epsilon^2N} $$

assumes that hypothesis $h_i$ is fixed before dataset generation. However, we now want to ask given that we find out best hypothesis $g \in H$, can we bound this probability? 
$$\mathbb{P}[|E(g)_{in} - E(g)_{out}| > \epsilon]$$
The Hoeffding Inequality does not hold out of the box here since we have selection bias and $g$ is not fixed. Note that however since $g \in H$ we can relate $\mathbb{P}[|E(g)_{in} - E(g)_{out}| > \epsilon]$ to the other hypotheses since $g$ must technically be one of them, we get

$$|E(g)_{in} - E(g)_{out}| > \epsilon \rightarrow |E(h_1)_{in} - E(h_1)_{out}| > \epsilon\ or\ |E(h_2)_{in} - E(h_2)_{out}| > \epsilon\ ...\ or\ |E(h_M)_{in} - E(h_M)_{out}| > \epsilon $$

We also know that from probability that *If $B_1 \rightarrow B_2$ then $\mathbb{P}(B_1) \leq \mathbb{P}(B_2)$* i.e. if event $B_1$ implies $B_2$, then the probability of $B_1$ is less than or equal to the probability of $B_2$. The proof is trivial, since just consider if $B_2$ occurs without $B_1$, but if $B_1$ occurs then $B_2$ must occur.

We also know that *If $B_1$, $B_2$, ... $B_M$ are events then $\mathbb{P}(B_1\ or\ B_2\ or\ ...\ B_M) \leq \mathbb{P}(B_1) + \mathbb{P}(B_2) + ... + \mathbb{P}(B_M)$*, i.e. the union bound.

Applying these to our previous statement with $\mathbb{P}[|E(g)_{in} - E(g)_{out}| > \epsilon$ we get
$$\mathbb{P}[|E(g)_{in} - E(g)_{out}| > \epsilon] \leq \mathbb{P}[|E(h_1)_{in} - E(h_1)_{out}| > \epsilon] + \mathbb{P}[|E(h_2)_{in} - E(h_2)_{out}| > \epsilon] + ... + \mathbb{P}[|E(h_M)_{in} - E(h_M)_{out}| > \epsilon]$$

Now apply the Hoeffding Inequality to each $h_i$ 
$$\mathbb{P}[|E(g)_{in} - E(g)_{out}| > \epsilon] \leq \sum_{i=1}^M \mathbb{P}[|E(h_i)_{in} - E(h_i)_{out}| > \epsilon] \leq 2Me^{-2\epsilon^2N}$$
Thus we get that for the optimal $g \in H$ we proved a high probability uniform convergence on the finite hypothesis set $H$. Note that the same dataset of size $N$ was used on all the hypothesis functions. Additionally, since the bound for the Hoeffding Inequality does not depend on the hypothesis, we can simply apply it to each element in our summation.