Bias Variance Tradeoff by dokuDoku Generalization 🔍 ## What is the Bias Variance Tradeoff? As previously discussed in the [VC Dimension](https://www.noteblogdoku.com/blog/vc-dimension-via-pascal-s-triangle) blog post, one way to approach generalization is to analyze the complexity of a Hypothesis class $H$ of models to analyze the trade off between *approximating ground truth $f$ with function $g$ on training data and generalizing on new data*. The big picture result was that - The lower the VC Dimension the tighter the $E(g)_{out}$ and $E(g)_{in}$ bounds are - The higher the VC Dimension the looser $E(g)_{out}$ and $E(g)_{in}$ bounds are Equivalently, in a broad strokes statement: - **The higher the model complexity of $g$, the more prone to overfitting the model is on the training set** Note that the VC Dimension analysis is commonly introduced for **binary outputs** only (classification of points with labels $\{-1,1\}$) and we would have to do significant work to extend it to non-binary outputs. This work does exist, but it is more interesting to study the Bias-Variance model additionally for model generalization. The **Bias Variance** analysis is well suited for *Squared Loss Terms*. Let's start with some definitions: - **Bias**: - The expected squared difference between the model's predicted value and true value - How wrong the model is on average - Increased proportionally to how simple the model is (underfitting) $$ squared\ bias(x) = (\bar{g}(x) - f(x))^2$$ - **Variance**: - Expected difference between optimal model trained on dataset $D$ and the average model - $D \sim P^N$ denotes a sample a dataset of size $n$ from the data distribution - How any one trained model instance differs from the average trained model - Increased proportionally to model complexity (overfitting) - Fixed with regularization or more data $$ var(x) = \mathbb{E}_{D \sim P^N}[(g^D(x) - \bar{g}(x))^2]$$ See the **title image** for a very intuitive description of Bias and Variance! Note that since we typically do not know the underlying function we are approximating (if we did, why are we learning???) that this a *mental model* for what models types to choose based on observed behavior. To spoil the end result (see **Figure 1**), we get that the output error for trained model (not considering noise) $g$, $E_{out}(g)$ expected over all datasets of size $N$ sampled $D \sim P^N$ is equal to: $$\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right] = \underbrace{\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2\right]}_{\text{variance}} + \underbrace{(\bar{g}(x) - f(x))^2}_{\text{squared\ bias}}$$ **Note** that we should read $\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right]$ as the expectation of the out of sample error of model $g$ that's trained on dataset $D$, more explicitly: $$\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right] = \sum_D \mathbb{P}(D) \cdot E_{out}(g^D)$$ So if we keep training models of the same type as $g$ on data sampled from $P^N$, then $\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right]$ is the expected performance. <img src="/media/images/20260330_095525_bias_var_viz.png" alt="bias var viz" style="max-width: 90%; height: auto;"> <center> $\textbf{Figure 1:}$ Bias-Variance Trade Off shown with Three Different Model Classes with Ground Truth 5th Degree Polynomial Function, $\textbf{300}$ Sampled Distributions </center> <center> Notice that Bias gets worse both for the $\textit{under and over parameterized}$ models, since the variance starts to affect the bias for Degree 10 Model. </center> <img src="/media/images/20260330_195924_bias_var_viz.png" alt="bias var viz" style="max-width: 90%; height: auto;"> <center> $\textbf{Figure 2:}$ Bias-Variance Trade Off shown with Three Different Model Classes with Ground Truth 5th Degree Polynomial Function, $\textbf{3000}$ Sampled Distributions </center> <center> Notice for the Degree 10 polynomial, the sampled points are typically matched exactly, but high variance occurs due to over-parameterization </center> ## Derivation We are going to start with the squared error and show how we can decompose it as a Bias-Variance trade off. Definitions: - $g^D$ the trained model on some subsample $D$ of $N$ points from the total distribution $P^N$ - $H$ is the hypothesis space - $g \in H$ where $H$ determines the type of model $g$ is - Ex. $H$ could be 2D linear regression models, so $g^D$ would be a linear regression model trained on $D$ - $P^N$ is the distribution of data that we sample $N$ points from - $\bar{g}(x) := \mathbb{E}_{D \sim P^N}[g^D(x)]$ is the average model $g$ from all possible models $g$ (from the same hypothesis set) - $f(x)$ is the ground truth function we are attempting to mimic - $\mathbb{E}_x$ the expectation with respect to $x$ where the input $x \sim X$ - $X$ is the input space such that $x \in X$ The squared error out of sample of the model $g$ trained on sampled dataset $D$ is: $$E_{out}(g^D) = \mathbb{E}_x[(g^D(x) - f(x))^2]$$ We want to consider all such optimal functions $g$ trained on their sampled data points $D \sim P^N$ and find the expected or average $g^D$ with respect to the training data distribution. $$ \mathbb{E}_{D \sim P^N}[E_{out}(g^D)] = \mathbb{E}_{D \sim P^N}[\mathbb{E}_x[(g^D(x) - f(x))^2]]$$ Great! Let's start. By linearity of expectation: $$ \mathbb{E}_{D \sim P^N}[E_{out}(g^D)] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[(g^D(x) - f(x))^2]]$$ $$ = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2 - 2g^D(x)f(x) + f^2(x)]]$$ $$ = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$ Great, now let's add and subtract $\bar{g}(x)^2$ where $\bar{g}(x) = \mathbb{E}_{D \sim P^N}[g^D(x)]$. That way we can complete the square and get our variance term. $$ = \mathbb{E}_x\left[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2 + \bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]\right]$$ Let's split this into two parts, deal with them separately, then combine the result. $$ Part\ A:\ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2]$$ $$ Part\ B:\ \mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$ #### Part A: It's easiest to show the result that we want (variance) and then derive this is equivalent to $$\mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2]$$ Desired result (the variance): $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2]\right]$$ First start by multiplying the squared term $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[g^D(x)^2 - 2g^D(x)\bar{g}(x) + \bar{g}(x)^2]\right]$$ $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - 2\mathbb{E}_{D \sim P^N}[g^D(x)\bar{g}(x)] + \mathbb{E}_{D \sim P^N}[\bar{g}(x)^2]]$$ Using the fact that $\bar{g}(x)$ is constant with respect to the expectation over $D$, we get that $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - 2\bar{g}(x)^2 + \bar{g}(x)^2]$$ $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2]$$ Which is equal to our original expression! Thus $$\mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2]\right]$$ #### Part B: We are going to show that $$\mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$ is equal to the squared bias term $$ (\bar{g}(x) - f(x))^2 $$ Let's begin $$\mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$ Recognize this is quadratic expansion and that $\bar{g}(x) := \mathbb{E}_{D \sim P^N}[g^D(x)]$ $$ \mathbb{E}_x[(\bar{g}(x) - \mathbb{E}_{D \sim P^N}[f(x)])^2] $$ Additionally since $f(x)$ does not depend on $D \sim P^N$ we have that $\mathbb{E}_{D \sim P^N}[f(x)] = f(x)$ simplifying to the desired expression $$ \mathbb{E}_x[(\bar{g}(x) - f(x))^2] $$ Thus, we have shown that $$\mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]] = \mathbb{E}_x[(\bar{g}(x) - f(x))^2]$$ #### Putting it Together Given our derivation in **Part A** and **Part B** we get: $$ \mathbb{E}_x\left[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2 + \bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]\right] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2]\right] + \mathbb{E}_x[(\bar{g}(x) - f(x))^2]$$ Which is our bias variance decomposition of the squared loss! Restating this from the original form: $$ \mathbb{E}_{D \sim P^N}[E_{out}(g^D)] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[(g^D(x) - f(x))^2]] $$ $$ \downarrow$$ $$ \boxed{ \mathbb{E}_{D \sim P^N}\Bigl[E_{\text{out}}(g^D)\Bigr] = \underbrace{\mathbb{E}_x\Bigl[\mathbb{E}_{D \sim P^N}\bigl[(g^D(x) - \bar{g}(x))^2\bigr]\Bigr]}_{\text{variance}} + \underbrace{\mathbb{E}_x\Bigl[(\bar{g}(x) - f(x))^2\Bigr]}_{\text{bias}^2} } $$ ## What is the Bias Variance Tradeoff? As previously discussed in the [VC Dimension](https://www.noteblogdoku.com/blog/vc-dimension-via-pascal-s-triangle) blog post, one way to approach generalization is to analyze the complexity of a Hypothesis class $H$ of models to analyze the trade off between *approximating ground truth $f$ with function $g$ on training data and generalizing on new data*. The big picture result was that - The lower the VC Dimension the tighter the $E(g)_{out}$ and $E(g)_{in}$ bounds are - The higher the VC Dimension the looser $E(g)_{out}$ and $E(g)_{in}$ bounds are Equivalently, in a broad strokes statement: - **The higher the model complexity of $g$, the more prone to overfitting the model is on the training set** Note that the VC Dimension analysis is commonly introduced for **binary outputs** only (classification of points with labels $\{-1,1\}$) and we would have to do significant work to extend it to non-binary outputs. This work does exist, but it is more interesting to study the Bias-Variance model additionally for model generalization. The **Bias Variance** analysis is well suited for *Squared Loss Terms*. Let's start with some definitions: - **Bias**: - The expected squared difference between the model's predicted value and true value - How wrong the model is on average - Increased proportionally to how simple the model is (underfitting) $$ squared\ bias(x) = (\bar{g}(x) - f(x))^2$$ - **Variance**: - Expected difference between optimal model trained on dataset $D$ and the average model - $D \sim P^N$ denotes a sample a dataset of size $n$ from the data distribution - How any one trained model instance differs from the average trained model - Increased proportionally to model complexity (overfitting) - Fixed with regularization or more data $$ var(x) = \mathbb{E}_{D \sim P^N}[(g^D(x) - \bar{g}(x))^2]$$ See the **title image** for a very intuitive description of Bias and Variance! Note that since we typically do not know the underlying function we are approximating (if we did, why are we learning???) that this a *mental model* for what models types to choose based on observed behavior. To spoil the end result (see **Figure 1**), we get that the output error for trained model (not considering noise) $g$, $E_{out}(g)$ expected over all datasets of size $N$ sampled $D \sim P^N$ is equal to: $$\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right] = \underbrace{\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2\right]}_{\text{variance}} + \underbrace{(\bar{g}(x) - f(x))^2}_{\text{squared\ bias}}$$ **Note** that we should read $\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right]$ as the expectation of the out of sample error of model $g$ that's trained on dataset $D$, more explicitly: $$\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right] = \sum_D \mathbb{P}(D) \cdot E_{out}(g^D)$$ So if we keep training models of the same type as $g$ on data sampled from $P^N$, then $\mathbb{E}_{D \sim P^N}\left[E_{out}(g^D)\right]$ is the expected performance. <img src="/media/images/20260330_095525_bias_var_viz.png" alt="bias var viz" style="max-width: 90%; height: auto;"> <center> $\textbf{Figure 1:}$ Bias-Variance Trade Off shown with Three Different Model Classes with Ground Truth 5th Degree Polynomial Function, $\textbf{300}$ Sampled Distributions </center> <center> Notice that Bias gets worse both for the $\textit{under and over parameterized}$ models, since the variance starts to affect the bias for Degree 10 Model. </center> <img src="/media/images/20260330_195924_bias_var_viz.png" alt="bias var viz" style="max-width: 90%; height: auto;"> <center> $\textbf{Figure 2:}$ Bias-Variance Trade Off shown with Three Different Model Classes with Ground Truth 5th Degree Polynomial Function, $\textbf{3000}$ Sampled Distributions </center> <center> Notice for the Degree 10 polynomial, the sampled points are typically matched exactly, but high variance occurs due to over-parameterization </center> ## Derivation We are going to start with the squared error and show how we can decompose it as a Bias-Variance trade off. Definitions: - $g^D$ the trained model on some subsample $D$ of $N$ points from the total distribution $P^N$ - $H$ is the hypothesis space - $g \in H$ where $H$ determines the type of model $g$ is - Ex. $H$ could be 2D linear regression models, so $g^D$ would be a linear regression model trained on $D$ - $P^N$ is the distribution of data that we sample $N$ points from - $\bar{g}(x) := \mathbb{E}_{D \sim P^N}[g^D(x)]$ is the average model $g$ from all possible models $g$ (from the same hypothesis set) - $f(x)$ is the ground truth function we are attempting to mimic - $\mathbb{E}_x$ the expectation with respect to $x$ where the input $x \sim X$ - $X$ is the input space such that $x \in X$ The squared error out of sample of the model $g$ trained on sampled dataset $D$ is: $$E_{out}(g^D) = \mathbb{E}_x[(g^D(x) - f(x))^2]$$ We want to consider all such optimal functions $g$ trained on their sampled data points $D \sim P^N$ and find the expected or average $g^D$ with respect to the training data distribution. $$ \mathbb{E}_{D \sim P^N}[E_{out}(g^D)] = \mathbb{E}_{D \sim P^N}[\mathbb{E}_x[(g^D(x) - f(x))^2]]$$ Great! Let's start. By linearity of expectation: $$ \mathbb{E}_{D \sim P^N}[E_{out}(g^D)] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[(g^D(x) - f(x))^2]]$$ $$ = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2 - 2g^D(x)f(x) + f^2(x)]]$$ $$ = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$ Great, now let's add and subtract $\bar{g}(x)^2$ where $\bar{g}(x) = \mathbb{E}_{D \sim P^N}[g^D(x)]$. That way we can complete the square and get our variance term. $$ = \mathbb{E}_x\left[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2 + \bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]\right]$$ Let's split this into two parts, deal with them separately, then combine the result. $$ Part\ A:\ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2]$$ $$ Part\ B:\ \mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$ #### Part A: It's easiest to show the result that we want (variance) and then derive this is equivalent to $$\mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2]$$ Desired result (the variance): $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2]\right]$$ First start by multiplying the squared term $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[g^D(x)^2 - 2g^D(x)\bar{g}(x) + \bar{g}(x)^2]\right]$$ $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - 2\mathbb{E}_{D \sim P^N}[g^D(x)\bar{g}(x)] + \mathbb{E}_{D \sim P^N}[\bar{g}(x)^2]]$$ Using the fact that $\bar{g}(x)$ is constant with respect to the expectation over $D$, we get that $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - 2\bar{g}(x)^2 + \bar{g}(x)^2]$$ $$ \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2]$$ Which is equal to our original expression! Thus $$\mathbb{E}_x[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2]\right]$$ #### Part B: We are going to show that $$\mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$ is equal to the squared bias term $$ (\bar{g}(x) - f(x))^2 $$ Let's begin $$\mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]]$$ Recognize this is quadratic expansion and that $\bar{g}(x) := \mathbb{E}_{D \sim P^N}[g^D(x)]$ $$ \mathbb{E}_x[(\bar{g}(x) - \mathbb{E}_{D \sim P^N}[f(x)])^2] $$ Additionally since $f(x)$ does not depend on $D \sim P^N$ we have that $\mathbb{E}_{D \sim P^N}[f(x)] = f(x)$ simplifying to the desired expression $$ \mathbb{E}_x[(\bar{g}(x) - f(x))^2] $$ Thus, we have shown that $$\mathbb{E}_x[\bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]] = \mathbb{E}_x[(\bar{g}(x) - f(x))^2]$$ #### Putting it Together Given our derivation in **Part A** and **Part B** we get: $$ \mathbb{E}_x\left[\mathbb{E}_{D \sim P^N}[g^D(x)^2] - \bar{g}(x)^2 + \bar{g}(x)^2 -2\mathbb{E}_{D \sim P^N}[g^D(x)f(x)] + \mathbb{E}_{D \sim P^N}[f^2(x)]\right] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}\left[(g^D(x) - \bar{g}(x))^2]\right] + \mathbb{E}_x[(\bar{g}(x) - f(x))^2]$$ Which is our bias variance decomposition of the squared loss! Restating this from the original form: $$ \mathbb{E}_{D \sim P^N}[E_{out}(g^D)] = \mathbb{E}_x[\mathbb{E}_{D \sim P^N}[(g^D(x) - f(x))^2]] $$ $$ \downarrow$$ $$ \boxed{ \mathbb{E}_{D \sim P^N}\Bigl[E_{\text{out}}(g^D)\Bigr] = \underbrace{\mathbb{E}_x\Bigl[\mathbb{E}_{D \sim P^N}\bigl[(g^D(x) - \bar{g}(x))^2\bigr]\Bigr]}_{\text{variance}} + \underbrace{\mathbb{E}_x\Bigl[(\bar{g}(x) - f(x))^2\Bigr]}_{\text{bias}^2} } $$ Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!