VC Dimension via Pascal's Triangle

by dokuDoku Generalization

🔍

## What is the VC Dimension? $d_{VC}$

Let's assume that our problem is attempting to predict a dataset with outputs labeled $\{-1,1\}$. The function we are using to predict this is the hypothesis function $h$ in the set of hypothesis functions $H$, $h \in H$ for some training dataset $D$. The *Vapik-Chervonenkis (VC)* dimension of a hypothesis set $H$, $d_{VC}(H)$ or simply $d_{VC}$ (depending on implied context) is the largest value of $N$ for which $h \in H$ for which every possible assignment of labels $\{-1,1\}$ can be perfectly classified by some hypothesis $h \in H$.

All this means is that given I have $N$ points, $h$ can classify any of the $2^N$ $\{-1,1\}$ coloring of these points.

</center>
<center>
$\textbf{Video 1:}$ VC Dimension of a Simple Linear Model, showing $d_{VC} = 3$
</center>

As we can see in **Video 1**, the Linear Model has a $d_{VC} = 3$ since it can classify all $2^3 = 8$ possible colorings of the three points and *cannot* classify all $2^4 = 16$ colorings of four points on a 2D plane. Thus it can correctly realize all possible labelings of 3 points, but not all possible labelings of 4 points in $\mathbb{R}^2$.

**NOTE** for the linear model mentioned here, we will always assume we are working in $\mathbb{R}^2$.

In general, the VC Dimension is a great way to compare the capacity of hypothesis classes $H$'s to correctly classify $N$ points. The *higher* the VC Dimension, the *more* points it can classify (higher capacity) and the *lower* the VC Dimension the *less* points it can classify.

### How does this Relate to [Feasibility of Machine Learning](https://www.noteblogdoku.com/blog/why-is-learning-possible)?

In general, models with higher VC Dimensions are more expressive and have a higher risk of overfitting the data. Contrastingly, models with low VC Dimensions are less expressive and have a lower risk of overfitting the dataset

<img src="/media/images/20260327_224924_vc_dim_model.png" alt="vc dim model" style="max-width: 80%;"> 
<center>
$\textbf{Figure 1:}$ VC Dimension of Models Demonstrating Capacity to Overfit on Noisy Dataset (of 20 points) for a Linear and 20 Degree Polynomial Hypothesis Sets
</center>

As we can see in **Figure 1** our hypothesis does not fit the underlying function it fits the observed data points. We may think from our previous example that the higher the VC Dimension the better since we can represent more complex underlying functions. Although this is true, this also means that we can exactly fit noise. **Figure 1** shows that even though on our dataset of 20 points, the VC Dimension of a linear model is only 3, it roughly generalizes and classifies most of the dataset correctly, and more importantly, it will generalize well to new data, meaning error in $E_{in}$ is likely to be close to out of sample error $E_{out}$. Contrastingly, for our 20 degree polynomial model, which can fix every point perfectly, is unlikely to generalize well outside of the training data. We can see that there is a large region classified as blue at the top to simply fit one datapoint. As such it is easy to see that $E_{out} >> E_{in}$ is likely to happen for the 20 degree polynomial.

This begs the question: *Is the generalization & learning capabilities of a hypothesis set tied to its VC Dimension $d_{VC}(H)$?*

Recall, from the previous blog post [Why is Learning Possible?](https://www.noteblogdoku.com/blog/why-is-learning-possible) we found that for a given hypothesis set $H$ that the hypothesis $g$ that classified the most in training samples correctly (lowest $E_{in}$ in dataset error) has the following bound between $E_{in}$ and $E_{out}$

$$\mathbb{P}[|E(g)_{in} - E(g)_{out}| > \epsilon] \leq 2Me^{-2\epsilon^2N}$$
- $E(g)_{in}$ in dataset error for best hypothesis $g \in H$
- $E(g)_{out}$ out of sample error for best hypothesis $g \in H$
- $M$ total number of hypotheses in $H$, i.e. $M = |H|$
- $N$ total number of training samples labeled $\{-1,1\}$
- $\epsilon$ bound for inequality

**Our goal is to derive how to replace the quantity $M$ with one related to the VC Dimension of hypothesis space $H$, $d_{VC}(H)$.** Intuitively, $d_{VC}(H)$ tells us the capacity of a hypothesis set to fit all possible $\{-1,1\}$ colorings of $N$ points. As we can see from **Figure 1** and **Video 1** the higher the VC Dimension of $H$ the higher the number of points $N$ it can classify all possible $\{-1,1\}$ colorings of.

We will see that we can actually get a tighter, polynomial bound for $|E(g)_{in} - E(g)_{out}|$ using the VC Dimension instead of what we did previously using the Union Bound (assumes all $h \in H$ disjoint & non-overlapping). Additionally, $|E(g)_{in} - E(g)_{out}|$ tells us how generalizable a model $g$ from hypothesis class $H$ is, mathematically solidifying the intuition presented in **Figure 1** and **Video 1**.

### New Definitions

**Dichotomy**:
A dichotomy is a division of something into two distinct and mutually exclusive parts. In our examples, the fact that every point has to be either $\{-1,1\}$ or the colors blue or red and not both, makes it a dichotomy. Basically just classification problems for yes/no answers.

**Growth Function for hypothesis set $H$, $m_H(N)$**:
The growth function for hypothesis set $H$ is defined by 
$$ m_H(N) = max_{x_1,...x_N \in X} |H(x_1,...x_N)|$$
which is the maximum number of distinct dichotomies that can be generated by $H$ on any $N$ points. To compute $m_H(N)$ consider all possible choices of $N$ points $x_1,...x_N$ and pick the one that gives us the most dichotomies.

For the examples above in **Video 1** and **Figure 1**, the linear model can generate up two 8 dichotomies for 3 points, i.e.
$$ m_{linear}(3) = 8$$
Note also that in **Video 1**, it was shown for 4 points the linear model can only generate 14 out of 16 possible dichotomies, i.e. 
$$ m_{linear}(4) = 14 $$

Finally, note that $m_H(N)$ has an upper bound of $2^N$ since for any given $N$ points we can color each one blue or red, so since each point has 2 options that's $2^N$. In other words
$$ m_H(N) \leq 2^N$$

**Shatter**:
When $H$ is capable of generating all possible dichotomies for $N$ points, i.e. when 
$$m_H(N) = 2^N$$
As we can see since $m_{linear}(3) = 8$, the linear model class shatters 3 points.

**$m_H(N)$ and $d_{VC}(H)$**:
$m_H(N)$ is defined as the number of dichotomies $H$ can generate, i.e. how many different possible blue/red colorings on $N$ points. $d_{VC}(H)$ is the largest value of $N$ for which hypothesis set $H$ can generate every possible blue/red coloring (every dichotomy).

Thus we get that $d_{VC}(H)$ is the maximum $N$ for which $m_H(N) = 2^N$ for some model class $H$. So for the linear model since $m_{linear}(3) = 8$ and $m_{linear}(4) < 16$, $d_{VC}(linear) = 3$.

**Break Point**:
If no dataset of size $k$ can be shattered by $H$, then $k$ is said to be a break point for $H$. In other words, it's the smallest $N$ such that $m_H(N) < 2^N$ or $d_{VC}(H) + 1 = k$.

For the linear model, since $m_{linear}(3) = 8$ and $m_{linear}(4) < 16$ we have 4 is the break point $k$. 
Equivalently, sine $d_{VC}(linear) = 3$, we have 4 is the break point $k$.

## Maximum Number of Dichotomies on $N$ points with Break Point $k$

To prove that there's a polynomial bound for $|E(g)_{in} - E(g)_{out}|$ we are going to count the maximum number of dichotomies on $N$ points given some break point $k$, without having to assume any particular form of $H$. This bound will then apply to any $H$.

$\textbf{B(N,k)}$
*By definition, $B(N,k)$ is the maximum number of dichotomies on $N$ points such that no subset of size $k$ of the $N$ points can be shattered by these dichotomies.* In other words, given that we have $N$ points and break point $k$, $B(N,k)$ is the maximum number of ways we can color $N$ points blue or red. Since $B(N,k)$ is a maximum, it's an upper bound for any $m_H(N)$ that has a break point $k$:

$$ m_H(N) \leq B(N,k)\quad k\ break\ point\ for\ H$$

We will see that to prove this, this is simply an application of *Pascal's Triangle and Binomial Theorem*.

### Case where $k \geq N$
Assume that for this case that $k \geq N$, so no matter what the value of $N$ is, the break point is greater. Thus we have that for all $N$, that since $m_H(N)$ is bounded by $2^N$ that:
$$ m_H(N) = B(N,k) = 2^N$$

First we will show intuitively why this is Binomial Theorem and Pascal's Triangle, then we will prove it rigorously with induction.

Consider that for the $N$th row of Pascal's Triangle we have the following from Binomial Theorem (see **Figure 2**):
$$ 2^N = \sum_{i=0}^N \binom{N}{i} = \binom{N}{0} + \binom{N}{1} + ...\ + \binom{N}{N} $$

Since we can color each dot red or blue, consider that the expression $\binom{N}{i}$ is saying:

*How many ways are there to choose $i$ blue points from $N$ total points?*

Thus the expression $\sum_{i=0}^N \binom{N}{i}$ is saying:

*Given I have $N$ points, how many total ways are there to choose $i$ blue points from $\forall i \in [0,N]$?*

As we know, given all $N$ points there are $2^N$ ways total to color them, so it is unsurprising that by Binomial Theorem: $ 2^N = \sum_{i=0}^N \binom{N}{i}$

<img src="/media/images/20260328_064123_Screenshot_2026-03-27_at_11.36.03_PM.png" alt="vc dim model" style="max-width: 40%;"> 
<center>
$\textbf{Figure 2:}$ First 6 Rows of Pascal's Triangle showing Binomial Theorem
</center>

#### Proof by Induction
For the case where $k \geq N$, we are going to prove 
$$B(N,k) = \sum_{i=0}^N \binom{N}{i} = 2^N$$

**Base Case** $N=1$
For the Base Case Consider when $N=1$, i.e. when there is only 1 point. We know that 1 point can be colored either blue or red, so that's two possibilities. Our proposition agrees with this:
$$B(1,k) = \sum_{i=0}^1 \binom{1}{i} = \binom{1}{0} + \binom{1}{1} = 1 + 1 = 2$$
Base Case proven!

**Inductive Case** $B(N,k) = 2 \cdot B(N-1,k)$
Suppose that we have $N-1$ points, this would mean that we have $2^{N-1}$ total dichotomies. If we add another point, for each one of these $2^{N-1}$ dichotomies we would add another two, giving us $2 \cdot 2^{N-1}$. As we know $2^N = 2 \cdot 2^{N-1}$. Thus we have proved the inductive case by showing that
$$B(N,k) = 2^N = 2\cdot 2^{N-1} = 2 \cdot B(N-1,k)$$

Since we showed both the Base and Inductive Case are true, we have proved that $B(N,k) = 2^N = \sum_{i=0}^N \binom{N}{i}$, i.e. the nth row of Pascal's Tree for $B(N,k)$ with $k \geq N$.

### Case where $k < N$ 
Given what we showed for $B(N,k)$ with $k \geq N$, we showed that this is simply the $N$th row of Pascal's Triangle. Now we are going to show that given $B(N,k)$ with $k < N$ that this is equal to the summation of the first $k$ elements $[0,k-1]$ of the $N$th row of Pascal's Triangle, i.e.

$$ B(N,k) = \sum_{i=0}^{k-1} \binom{N}{i} < 2^N$$
For the case $B(5,3)$ see **Figure 2** showing in white both in the row $N=5$ with the first $k=3$ entries, and its ancestors which are also the first $k=3$ elements in their respective rows.

<img src="/media/images/20260328_081735_binomial_tree.png" alt="vc dim model" style="max-width: 50%;"> 
<center>
$\textbf{Figure 2}$ Pascal's Triangle with $B(N,k)$ with Break Point $k=3$. White shows which elements connect to the first $k=3$ elements on row $N=5$
</center>

First we will show intuitively why this is the case and then show a rigorous proof by induction.

Suppose that I have $N$ points, but given the limitations of my hypothesis set $H$, I can only select some subset $k-1$ of them to color for all of their possible dichotomies. An example of this can be seen in **Video 1** where for the linear model on $N=4$ we can only color 14 out of 16 different dichotomies (blue/red colorings).

Like previously, we have that $\binom{N}{i}$ is saying that given $N$ total points, how many ways are there to choose $i$ of them to be blue. Now consider that we have a *restriction* on $i$ such that $i < k$ since we can only represent all of the dichotomies of a subset of $k-1$ points from $N$ (consider all such subsets $k-1$ of $N$).

Thus given our model restriction, we can only consider from our $N$ total points $k$ of them at a time, so we can consider choosing $0$ blue points, $1$ blue points, ... $k-1$ blue points, giving us
$$ B(N,k) = \binom{N}{0} + \binom{N}{1} + ...\ + \binom{N}{k-1} = \sum_{i=0}^{k-1} \binom{N}{i}$$

#### Proof by Induction

We are going to show that for $B(N,k)$ with $k < N$ that $B(N,k)$ is equal to the sum of the first $k$ elements of the $N$th row of Pascal's Triangle:
$$ B(N,k) =  \sum_{i=0}^{k-1} \binom{N}{i}$$

**Base Case**: $k=1$
Consider the case where $k=1$. This means that the break point $k =1$ so the model class $H$ cannot even produce both colorings of any one point, thus there is only 1 possible solution. $B(N,1) =1$. 
We can see this matches our hypothesis 
$$B(n,1) = \sum_{i=0}^0 \binom{N}{i} = \binom{N}{0} = 1$$
Base case proven!

**Inductive Case**: $B(N,k) = B(N-1,k) + B(N-1,k-1)$

Consider for the inductive case we have two cases (**Figure 3**):
**Case 1**: $B(N-1,k)$ 
- Add $N$th point *outside* of subset $k$

**Case 2**: $B(N-1,k-1)$
- Add $N$th point *to the* subset $k-1$ making it subset $k$

<center>
$\textbf{Figure 3}$: Two Cases where we add element $x_n$ either in/outside of Subset $k$ 
</center>

Note that since there can only be two possibilities, adding the new element *outside* or *inside* the subset of $k$ elements these are all possible cases for adding a new point to $N-1$ existing points.

For induction, we will show $B(N,k) = B(N-1,k) + B(N-1,k-1)$. This is saying that the summation of the first $k$ elements of the $N$th row of the binomial theorem is equal to sum of the first $k$ and $k-1$ elements of the previous $N-1$ row. Recall the example from **Figure 2** showing how for $B(5,3)$ it only depends on the first three elements from row $N=4$.

Recall *Pascal's Identity*
$$ \binom{N}{i} = \binom{N-1}{i} + \binom{N-1}{i-1} $$

We can see that for $B(N,k)$ using Pascal's Identity gives us
$$ B(N,k) = \sum_{i=0}^{k-1} \binom{N}{i} = \sum_{i=0}^{k-1}\left( \binom{N-1}{i} + \binom{N-1}{i-1} \right)$$
$$= \sum_{i=0}^{k-1} \binom{N-1}{i} + \sum_{i=0}^{k-1} \binom{N-1}{i-1}$$
As we know by definition: 
$$ B(N-1,k) = \sum_{i=0}^{k-1} \binom{N-1}{i}$$
and 
$$ B(N-1,k-1) = \sum_{i=0}^{k-2} \binom{N-1}{i} = \sum_{i=1}^{k-1} \binom{N-1}{i-1} 
= \sum_{i=0}^{k-1} \binom{N-1}{i-1}$$
Note that $\sum_{i=1}^{k-1} \binom{N-1}{i-1} = \sum_{i=0}^{k-1} \binom{N-1}{i-1}$ since $\binom{N}{-1} = 0$ since you cannot choose negative objects from $N$ total objects.

Thus we can see that
$$ B(N,k) = \sum_{i=0}^{k-1} \binom{N}{i} = \sum_{i=0}^{k-1} \binom{N-1}{i} + \sum_{i=0}^{k-1} \binom{N-1}{i-1} = B(N-1,k) + B(N-1,k-1)$$
Proving the inductive case!

### Final Result

We have that for $B(N,k)$ with break point $k \leq N+1$ (combining our two above proofs) that: 
$$B(N,k) = \sum_{i=0}^{k-1} \binom{N}{i}$$

Note that when $k < N$ we get that 
$$B(N,k) = \sum_{i=0}^{k-1} \binom{N}{i} < 2^N$$
It can actually be shown that $B(N,k)$ with $k<N$ is actually *polynomial*. Recall previously that we stated that 
$$m_H(N) \leq B(N,k) \quad k\ break\ point\ for\ H$$
Giving us that for hypothesis sets $H$ with $k < N$ that 
$$m_H(N) \leq B(N,k) < 2^N$$
And in fact $B(N,k)$ gives a polynomial bound.

Additionally, given that we know that $k = d_{VC}+1$ for $m_H$, we can restate this in terms of the VC Dimension

$$m_H(N) \leq \sum_{i=0}^{d_{VC}} \binom{N}{i}$$

Thus the VC Dimension is the order of the polynomial bound on $m_H(N)$ and is the best we can do for this type of reasoning since no smaller break point than $k = d_{VC}+1$ exists.

## Updating Generalization Bound

Now that we can bound the growth function $m_H(N)$ in terms of the VC Dimension, we need to replace the number of hypotheses $M$ in the generalization bound with the growth function $m_H(N)$, thus seeing how the VC Dimension behaves in the generalization equation.

$$\mathbb{P}[|E(g)_{in} - E(g)_{out}| > \epsilon] \leq 2Me^{-2\epsilon^2N}$$

First let's rephrase this as picking a tolerance level $\delta$ and assert with probability at least $1-\delta$ that 
$$ E(g)_{out} \leq E(g)_{in} + \sqrt{\frac{1}{2N}ln\frac{2M}{\delta}}$$

Now directly replace $M$, the total number of hypotheses in $H$, where each is assuming a full dichotomy of $N$ data points ($2^N$ bound) with $m_H(N)$ with break point $k < N$ giving a polynomial bound.

$$\boxed{E(g)_{out} \leq E(g)_{in} + \sqrt{\frac{1}{2N}\ln\frac{2m_H(N)}{\delta}}}$$

$$m_H(N) \leq \sum_{i=0}^{d_{VC}} \binom{N}{i} $$

Thus we get that
- The **lower the VC Dimension** the **closer** $E(g)_{out}$ and $E(g)_{in}$ are
- The **higher the VC Dimension** the **farther** $E(g)_{out}$ and $E(g)_{in}$ are

This is directly reflected in **Figure 1** where for the Linear Model with $d_{VC} =3$ the selected hypothesis $g$ is generalizable and for the Polynomial Model of degree 20 (which fits the training data perfectly) has poor out of sample performance.