Bias Variance Tradeoff

by dokuDoku Generalization

🔍

## What is the Bias Variance Tradeoff?

As previously discussed in the [VC Dimension](https://www.noteblogdoku.com/blog/vc-dimension-via-pascal-s-triangle) blog post, one way to approach generalization is to analyze the complexity of a Hypothesis class $H$ of models to analyze the trade off between *approximating ground truth $f$ with function $g$ on training data and generalizing on new data*. The big picture result was that
- The lower the VC Dimension the tighter the $E(g)_{out}$ and $E(g)_{in}$ bounds are
- The higher the VC Dimension the looser $E(g)_{out}$ and $E(g)_{in}$ bounds are

Equivalently, in a broad strokes statement: 
- **The higher the model complexity of $g$, the more prone to overfitting the model is on the training set**

Note that the VC Dimension analysis is commonly introduced for **binary outputs** only (classification of points with labels $\{-1,1\}$) and we would have to do significant work to extend it to non-binary outputs. This work does exist, but it is more interesting to study the Bias-Variance model additionally for model generalization.

The **Bias Variance** analysis is well suited for *Squared Loss Terms*. Let's start with some definitions:
- **Bias**: 
    - The expected squared difference between the model's predicted value and true value
    - How wrong the model is on average
    - Increased proportionally to how simple the model is (underfitting)
$$ squared\ bias(x) = (\bar{g}(x) - f(x))^2$$
- **Variance**: 
    - Expected difference between optimal model trained on dataset $D$ and the average model 
        - $D \sim P^N$ denotes a sample a dataset of size $n$ from the data distribution 
    - How any one trained model instance differs from the average trained model
    - Increased proportionally to model complexity (overfitting)
    - Fixed with regularization or more data
$$ var(x) = \mathbb{E}_{D \sim P^N}[(g^D(x) - \bar{g}(x))^2]$$

See the **title image** for a very intuitive description of Bias and Variance! Note that since we typically do not know the underlying function we are approximating (if we did, why are we learning???) that this a *mental model* for what models types to choose based on observed behavior.

To spoil the end result (see **Figure 1**), we get that the output error for trained model (not considering noise) $g$, $E_{out}(g)$ expected over all datasets of size $N$ sampled $D \sim P^N$ is equal to:

$$\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right] =
\underbrace{\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2\right]}_{\text{variance}} +
\underbrace{(\bar{g}(x) - f(x))^2}_{\text{squared\ bias}}$$

**Note** that we should read $\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right]$ as the expectation of the out of sample error of model $g$ that's trained on dataset $D$, more explicitly: 
$$\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right] = \sum_D \mathbb{P}(D) \cdot E_{out}(g^D)$$
So if we keep training models of the same type as $g$ on data sampled from $P^N$, then $\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right]$ is the expected performance.

<center>
$\textbf{Figure 1:}$ Bias-Variance Trade Off shown with Three Different Model Classes with Ground Truth 5th Degree Polynomial Function, $\textbf{300}$ Sampled Distributions
</center>
<center>
Notice that Bias gets worse both for the $\textit{under and over parameterized}$ models, since the variance starts to affect the bias for Degree 10 Model.
</center>

<img src="/media/images/20260330_195924_bias_var_viz.png" alt="bias var viz" style="max-width: 90%; height: auto;">
<center>
$\textbf{Figure 2:}$ Bias-Variance Trade Off shown with Three Different Model Classes with Ground Truth 5th Degree Polynomial Function, $\textbf{3000}$ Sampled Distributions
</center>
<center>
Notice for the Degree 10 polynomial, the sampled points are typically matched exactly, but high variance occurs due to over-parameterization  
</center>

## Derivation

We are going to start with the squared error and show how we can decompose it as a Bias-Variance trade off.

Definitions:
- $g^D$ the trained model on some subsample $D$ of $N$ points from the total distribution $P^N$
    - $H$ is the hypothesis space
    - $g \in H$ where $H$ determines the type of model $g$ is
    - Ex. $H$ could be 2D linear regression models, so $g^D$ would be a linear regression model trained on $D$
- $P^N$ is the distribution of data that we sample $N$ points from 
- $\bar{g}(x) := \mathbb{E}_{D \sim P^N}[g^D(x)]$ is the average model $g$ from all possible models $g$ (from the same hypothesis set)
- $f(x)$ is the ground truth function we are attempting to mimic 
- $\mathbb{E}_x$ the expectation with respect to $x$ where the input $x \sim X$
    - $X$ is the input space such that $x \in X$

The squared error out of sample of the model $g$ trained on sampled dataset $D$ is:
$$E_{out}(g^D) = \mathbb{E}_x[(g^D(x) - f(x))^2]$$
We want to consider all such optimal functions $g$ trained on their sampled data points $D \sim P^N$ and find the expected or average $g^D$ with respect to the training data distribution.

$$ \mathbb{E}_{D \sim P^N}[E_{out}(g^D)] = \mathbb{E}_{D \sim P^N}[\mathbb{E}_x[(g^D(x) - f(x))^2]]$$

Great! Let's start. By linearity of expectation:
$$ \mathbb{E}_{D \sim P^N}[E_{out}(g^D)] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[(g^D(x) - f(x))^2]]$$
$$ = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2 - 2g^D(x)f(x) + f^2(x)]]$$
$$ = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$

Great, now let's add and subtract $\bar{g}(x)^2$ where $\bar{g}(x) = \mathbb{E}_{D \sim P^N}[g^D(x)]$. That way we can complete the square and get our variance term.

$$ = \mathbb{E}_x\left[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2 + \bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]\right]$$

Let's split this into two parts, deal with them separately, then combine the result. 
$$ Part\ A:\ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2]$$
$$ Part\ B:\ \mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$

#### Part A: 
It's easiest to show the result that we want (variance) and then derive this is equivalent to 
$$\mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2]$$

Desired result (the variance): 
$$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2]\right]$$
First start by multiplying the squared term
$$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[g^D(x)^2 - 2g^D(x)\bar{g}(x) + \bar{g}(x)^2]\right]$$

$$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - 2\mathbb{E}_{D \sim P^N}[g^D(x)\bar{g}(x)] + \mathbb{E}_{D \sim P^N}[\bar{g}(x)^2]]$$

Using the fact that $\bar{g}(x)$ is constant with respect to the expectation over $D$, we get that 
$$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - 2\bar{g}(x)^2 + \bar{g}(x)^2]$$
$$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2]$$

Which is equal to our original expression! Thus
$$\mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2]\right]$$

#### Part B:

We are going to show that 
$$\mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$
is equal to the squared bias term
$$ (\bar{g}(x) - f(x))^2 $$

Let's begin
$$\mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$
Recognize this is quadratic expansion and that $\bar{g}(x) := \mathbb{E}_{D \sim P^N}[g^D(x)]$
$$ \mathbb{E}_x[(\bar{g}(x) - \mathbb{E}_{D \sim P^N}[f(x)])^2] $$
Additionally since $f(x)$ does not depend on $D \sim P^N$ we have that $\mathbb{E}_{D \sim P^N}[f(x)] = f(x)$ simplifying to the desired expression

$$ \mathbb{E}_x[(\bar{g}(x) - f(x))^2] $$

Thus, we have shown that 
$$\mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]] = \mathbb{E}_x[(\bar{g}(x) - f(x))^2]$$

#### Putting it Together

Given our derivation in **Part A** and **Part B** we get:

$$ \mathbb{E}_x\left[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2 + \bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]\right] 
= \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2]\right] + \mathbb{E}_x[(\bar{g}(x) - f(x))^2]$$

Which is our bias variance decomposition of the squared loss! 
Restating this from the original form: 
$$ \mathbb{E}_{D \sim P^N}[E_{out}(g^D)] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[(g^D(x) - f(x))^2]] $$ 
$$ \downarrow$$
$$
\boxed{
\mathbb{E}_{D \sim P^N}\Bigl[E_{\text{out}}(g^D)\Bigr] =
\underbrace{\mathbb{E}_x\Bigl[\mathbb{E}_{D \sim P^N}\bigl[(g^D(x) - \bar{g}(x))^2\bigr]\Bigr]}_{\text{variance}} +
\underbrace{\mathbb{E}_x\Bigl[(\bar{g}(x) - f(x))^2\Bigr]}_{\text{bias}^2}
}
$$