Bias Variance of L2 Ridge Regression

by dokuDoku Generalization Statistics

🔍

## Prerequisites

I would recommend reading the following posts before reading this one

* [Bias Variance Tradeoff](https://www.noteblogdoku.com/blog/bias-variance-tradeoff)
* [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss)
* [Linear Regression from Projections onto Subspaces](https://www.noteblogdoku.com/blog/linear-regression-solution-from-projections-onto-subspaces)
* [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression)
* [L2 Ridge Regression from Lagrange Multipliers](https://www.noteblogdoku.com/blog/l2-ridge-regression-from-lagrange-multipliers) (Optional)

The general strategy for this blog post is as follows:

1. Review Bias-Variance Decomposition in general for squared loss models
2. Bias-Variance Decomposition for L2 Linear Regression with model $(y - x^T w)^2$
3. Bias-Variance Decomposition for L2 Ridge Regression with model $(y - x^T w)^2 + \lambda ||w||^2$

## Bias-Variance Decomposition

Recall from [Bias Variance Tradeoff](https://www.noteblogdoku.com/blog/bias-variance-tradeoff) that for a squared loss model:

$$
L(\theta) = (\hat{y}(x) - y)^2
$$

Where

* $L(\theta) \in \mathbb{R}$ is the scalar loss  
* $\hat{y}(x) \in \mathbb{R}$ is the predicted output  
* $y \in \mathbb{R}$ is the observed target

#### Data Generating Process

Assume the true data is generated as:

$$
y = f(x) + \epsilon
$$

where:

* $f(x)$ is the true (deterministic) function  
* $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is noise

#### Expected Out-of-Sample Error

For a model trained on dataset $D$, define:

$$
E_{out}(\hat{y}^D) = \mathbb{E}_x \left[(\hat{y}^D(x) - y)^2\right]
$$

We now take expectation over all possible datasets $D \sim P^N$:

$$
\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right]
=
\mathbb{E}_x \left[
\mathbb{E}_D \left[(\hat{y}^D(x) - y)^2\right]
\right]
$$

#### Bias-Variance Decomposition

The error decomposes as:

$$
\mathbb{E}_D \left[(\hat{y}^D(x) - y)^2\right]
=
\underbrace{
\mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right]
}_{\text{variance}}
+
\underbrace{
(\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2
}_{\text{squared bias}}
+
\underbrace{
\sigma^2
}_{\text{noise}}
$$

Putting it all together:

$$
\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right]
=
\mathbb{E}_x \left[
\underbrace{
\mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right]
}_{\text{variance}}
+
\underbrace{
(\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2
}_{\text{squared bias}}
+
\underbrace{
\sigma^2
}_{\text{irreducible noise}}
\right]
$$

#### Notation

- $\mathbb{E}_D$ denotes expectation over datasets $D \sim P^N$  
- $\hat{y}^D(x)$ is the model trained on dataset $D$  
- $\sigma^2$ is the variance of the noise

## Bias-Variance Decomposition for L2 Loss

Recall that for linear regression we consider a **single input-output pair**:

$$ L(w) = (y - x^T w)^2 $$

* $L(w) \in \mathbb{R}$ scalar loss
* $x \in \mathbb{R}^d$ input data point
* $w \in \mathbb{R}^d$ weight vector
* $y \in \mathbb{R}$ scalar ground truth
* $\hat{y}(x) = x^T w \in \mathbb{R}$ model prediction

Thus, the expected output error is:

$$
E_{out}(\hat{y}^D) = \mathbb{E}_x[(y - x^T w)^2]
$$

### Ground Truth Model

We assume:

$$
y = x^T w^* + \epsilon = f(x) + \epsilon
$$

where:

* $w^* \in \mathbb{R}^d$ is the true parameter vector
* $\epsilon \sim \mathcal{N}(0, \sigma^2)$

### Calculating the Bias-Variance

Using:

* $\hat{y}^D(x) = x^T w$

we plug into the decomposition:

$$
\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right]
=
\mathbb{E}_x \left[
\mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right]
+
(\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2
+
\sigma^2
\right]
$$
Variance: 
- $\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w - \mathbb{E}_{D}[x^T w])^2] \right]$

Bias Squared:
- $\mathbb{E}_x[(\mathbb{E}_{D}[x^T w] - (x^Tw^*))^2]$

Noise:
- $\sigma^2$

### Find $\mathbb{E}_D[x^Tw]$

To make our calculations easier, let's find $\mathbb{E}_D[x^Tw]$, i.e. the expected output from our ordinary least squares model. We will first find $\mathbb{E}_D[w]$ then $\mathbb{E}_D[x^Tw]$.

Recall from [Linear Regression Solution](https://www.noteblogdoku.com/blog/linear-regression-solution-from-projections-onto-subspaces) we found for ordinary least squares the optimal $w$ is
$$ L(w) = ||y - Xw||^2 \rightarrow w = (X^TX)^{-1}X^Ty $$
where $X \in \mathbb{R}^{n \times d}$ such that each row of $X$ is a data points (stacking all the $x^T_i$'s vertically).

Given that $y = Xw^* + \epsilon$, where now $\epsilon = N(0, \sigma^2I)$ (since now $y \in \mathbb{R}^n$ for $X \in \mathbb{R}^{n \times d}$, we have

$$w = (X^TX)^{-1}X^T(Xw^* + \epsilon)$$
Great, now let's take the expectation of $w$ and see what happens:
$$ \mathbb{E}_D[w] = \mathbb{E}_D[(X^TX)^{-1}X^T(Xw^* + \epsilon)]$$
$$ = \mathbb{E}_D[(X^TX)^{-1}X^TXw^*] + \mathbb{E}_D [(X^TX)^{-1}X^T\epsilon)]$$

Note that since $\epsilon \sim N(0, \sigma^2I)$, $E_D[\epsilon] = 0$, giving
$$ = \mathbb{E}_D[(X^TX)^{-1}X^TXw^*] + 0$$

We typically assume $X$ is composed of linearly independent column (feature) vectors, which means that $X^TX$ is positive definite (invertible). One could then simplify: 
$$(X^TX)^{-1}X^TX = I $$

Simplifying our statement to
$$ = \mathbb{E}_D[(X^TX)^{-1}X^TXw^*] = \mathbb{E}_D[Iw^*] = \mathbb{E}_D[w^*] = w^*$$

For our result: 
$$ \mathbb{E}_D[w] = w^*$$

For an individual sample from the data set $x^T$, we have that
$$ \mathbb{E}_D[x^Tw] = x^T \mathbb{E}[w] = x^Tw^*$$

### Plugging into Bias Variance

Great, now let's plug this into the Bias Variance equation, recall from earlier:
$$
\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right]
=
\mathbb{E}_x \left[
\mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right]
+
(\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2
+
\sigma^2
\right]
$$

Variance: 
- $\mathbb{E}_x[\mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right]] \rightarrow 
\mathbb{E}_x[\mathbb{E}_D[(x^Tw - \mathbb{E}_D[x^Tw])^2]] = 
\mathbb{E}_x[\mathbb{E}_D[(x^Tw - x^Tw^*)^2]]
$

Bias Squared:
- $\mathbb{E}_x[(\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2] \rightarrow
\mathbb{E}_x[(x^Tw^* - x^Tw^*)^2] = 0
$
- **Unbiased Estimator**

Noise: 
- $\sigma^2$

Putting it all together we get
$$\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = 
\mathbb{E}_x[\mathbb{E}_D[(x^Tw - x^Tw^*)^2]] + 0 + \sigma^2 
$$
$$ = \mathbb{E}_x[\mathbb{E}_D[(x^Tw - x^Tw^*)^2]] + \sigma^2
= \mathbb{E}_x[\mathbb{E}_D[(w^Txx^Tw - 2w^Txx^Tw^* + w^*xx^Tw^*)]] + \sigma^2
$$

Recognize the quadratic form where 
$$ (w^Txx^Tw - 2w^Txx^Tw^* + w^{*T}xx^Tw^*) $$
$$ = (w-w^*)^T(xx^T)(w-w^*)$$
Such that $xx^T \in \mathbb{R}^{d \times d}$ matrix that's of rank 1.

A more useful form is:
$$ (w-w^*)^T(xx^T)(w-w^*) = ((w-w^*)^Tx)(x^T(w-w^*))$$

Note that $(x^T(w-w^*)) \in \mathbb{R}$, so $(x^T(w-w^*))^T = ((w-w^*)^Tx) = (x^T(w-w^*))$ giving us
$$((w-w^*)^Tx)(x^T(w-w^*)) = (x^T(w-w^*))^2$$

Resulting in:
$$\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right]
= \mathbb{E}_x[\mathbb{E}_D[(x^T(w-w^*))^2]] + \sigma^2
$$

## Bias Variance Decomposition for L2 Ridge Regression

Recall that for L2 Ridge Regression we consider for a single input-output pair:
$$ L(w) = (y - x^Tw_\lambda)^2 + \lambda w_\lambda^Tw_\lambda $$

* $L(w) \in \mathbb{R}$ scalar loss
* $x \in \mathbb{R}^d$ input data point
* $w \in \mathbb{R}^d$ weight vector
* $y \in \mathbb{R}$ scalar ground truth
* $\hat{y_\lambda}(x) = x^T w \in \mathbb{R}$ model prediction
    - Note that this is the same model as in L2 loss, we are constraining the *values* of the weights
* $w^Tw \in \mathbb{R}$ regularization factor for weight minimization
* $\lambda \in \mathbb{R} \geq 0$ regularization scaling factor
    - Note $\lambda = 0$ gives us the previously derived L2 Loss

Thus, the expected output error is:
$$
E_{out}(\hat{y_\lambda}^D) = \mathbb{E}_x[(y - x^T w_\lambda)^2] \text{ with regularization constraint: } \lambda w_\lambda^Tw_\lambda
$$

Note, we are assuming the same ground truth model as L2 Loss: $y = x^Tw^* + \epsilon = f(x) + \epsilon$

### Calculating the Bias-Variance 
* $\hat{y}_\lambda^D(x) = x^T w$

we plug into the decomposition:

$$
\mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right]
=
\mathbb{E}_x \left[
\mathbb{E}_D \left[(\hat{y}_\lambda^D(x) - \mathbb{E}_D[\hat{y}_\lambda^D(x)])^2\right]
+
(\mathbb{E}_D[\hat{y}_\lambda^D(x)] - f(x))^2
+
\sigma^2
\right]
$$
Variance: 
- $\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - \mathbb{E}_{D}[x^T w_\lambda])^2] \right]$

Bias Squared:
- $\mathbb{E}_x[(\mathbb{E}_{D}[x^T w_\lambda] - (x^Tw^*))^2]$

Noise:
- $\sigma^2$

### Find $\mathbb{E}_D[x^Tw_\lambda]$

To make our calculations easier, let's find $\mathbb{E}_D[x^Tw_\lambda]$ i.e. the expected output from our L2 Ridge Regression model. We'll first find $\mathbb{E}_D[w_\lambda]$ then $\mathbb{E}_D[x^Tw_\lambda]$

Recall from [L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) we found the optimal $w_\lambda$

$$ L(w) = ||y -Xw_\lambda||^2  + \lambda ||w||^2 \rightarrow w_\lambda = (X^TX + \lambda I)^{-1}X^Ty$$
where $X \in \mathbb{R}^{n \times d}$.

Given that $y = Xw^* + \epsilon$, $\epsilon = N(0, \sigma^2I)$, we have 
$$ w_\lambda = (X^TX + \lambda I)^{-1}X^T(Xw^* + \epsilon) $$

Take expectation of $w_\lambda$ wrt $D$
$$ \mathbb{E}_D[w_\lambda] = \mathbb{E}_D[(X^TX + \lambda I)^{-1}X^T(Xw^* + \epsilon)]$$
$$ = \mathbb{E}_D[(X^TX + \lambda I)^{-1}X^TXw^*  + (X^TX + \lambda I)^{-1}X^T \epsilon)]$$

Note that $\mathbb{E}_D[\epsilon] = 0$ thus
$$ = \mathbb{E}_D[(X^TX + \lambda I)^{-1}X^TXw^*]$$

Assume that $X$ composed of linearly independent columns, giving us that $X^TX$ positive definite, invertible, and has a valid **eigendecomposition** with orthogonal eigenvector matrix $Q$ and eigenvalue diagonal matrix $\Lambda$
$$ X^TX = Q \Lambda Q^T$$
$$ = \mathbb{E}_D[(Q\Lambda Q^T + \lambda I)^{-1} Q\Lambda Q^Tw^*]$$

Recognize that $Q\Lambda Q^T + \lambda I = Q(\Lambda + \lambda I) Q^T$
$$ = \mathbb{E}_D[(Q(\Lambda + \lambda I)Q^T)^{-1} Q\Lambda Q^Tw^*]
= \mathbb{E}_D[Q(\Lambda + \lambda I)^{-1}Q^TQ \Lambda Q^T w^*]
$$
$$ = \mathbb{E}_D[Q(\Lambda + \lambda I)^{-1}\Lambda Q^T w^*]$$

Note that this is just $w^*$ where its eigencomponents are scaled such that: 
$$ (\Lambda + \lambda I)^{-1}\Lambda = 
\begin{bmatrix}
\frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\
0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda}
\end{bmatrix}
$$

So as $\lambda$ increases $w_\lambda$ first reduces the smallest eigencomponets the most rapidly to zero, then eventually all components go to zero in the limit. Please see [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) for a more in-depth discussion.

We can restate $ = \mathbb{E}_D[Q(\Lambda + \lambda I)^{-1}\Lambda Q^T w^*]$ to give us more intuition: 
$$ \mathbb{E}_D[w_\lambda] = \mathbb{E}_D[Q\ diag(\frac{\lambda_i}{\lambda_i + \lambda}) Q^T w^*] = \mathbb{E}_D[S_\lambda w^*] = S_\lambda w^*$$
$$ S_\lambda = Q\ diag(\frac{\lambda_i}{\lambda_i + \lambda}) Q^T $$

Great, we have immediately following from this that
$$ \mathbb{E}_D[x^T w_\lambda] = x^TS_\lambda w^*$$

### Plugging into Bias Variance

Great, now let's plug this into the Bias Variance Equation, recall
$$
\mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right]
=
\mathbb{E}_x \left[
\mathbb{E}_D \left[(\hat{y}_\lambda^D(x) - \mathbb{E}_D[\hat{y}_\lambda^D(x)])^2\right]
+
(\mathbb{E}_D[\hat{y}_\lambda^D(x)] - f(x))^2
+
\sigma^2
\right]
$$
Variance: 
- $\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - \mathbb{E}_{D}[x^T w_\lambda])^2] \right] \rightarrow
\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] \right] 
$

Bias Squared:
- $\mathbb{E}_x[(\mathbb{E}_{D}[x^T w_\lambda] - (x^Tw^*))^2] \rightarrow 
\mathbb{E}_x[(x^T S_\lambda w^* - (x^Tw^*))^2]
$

Noise:
- $\sigma^2$

Putting it all together:
$$ 
\mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right]
=
\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] \right] +
\mathbb{E}_x[(x^T S_\lambda w^* - x^Tw^*)^2] + \sigma^2
$$

### Simplify
Let's simplify to the same form as L2 Loss. We should see that when we set $\lambda = 0$ we should get an equivalent statement to L2 Loss from L2 Ridge Regression.

$$ 
\mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right]
=
\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] \right] +
\mathbb{E}_x[(x^T S_\lambda w^* - x^Tw^*)^2] + \sigma^2
$$

$$
= \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] 
+ (x^T S_\lambda w^* - x^Tw^*)^2
\right] + \sigma^2
$$

$$
= \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - S_\lambda w^*))^2] 
+ (x^T (S_\lambda w^* - w^*))^2
\right] + \sigma^2
$$

As we can see this is equal to the L2 equivalent when $\lambda = 0$. 
$$\text{L2 Expected Loss: } \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right]
= \mathbb{E}_x[\mathbb{E}_D[(x^T(w-w^*))^2]] + \sigma^2
$$

$$
\text{L2 Ridge Regrssion Expected Loss: }
\mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - S_\lambda w^*))^2] 
+ (x^T (S_\lambda w^* - w^*))^2
\right] + \sigma^2
$$

For $\lambda = 0$, we have that: 
$$ S_\lambda = (\Lambda + \lambda I)^{-1}\Lambda = 
\begin{bmatrix}
\frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\
0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda}
\end{bmatrix} \rightarrow 
\begin{bmatrix}
\frac{\lambda_1}{\lambda_1} & 0 & \cdots & 0 \\
0 & \frac{\lambda_2}{\lambda_2} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n}
\end{bmatrix} = I_n
$$

Simplifying our expression to: 
$$
\text{L2 Ridge Regrssion Expected Loss: }
\mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = 
\mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - I_n w^*))^2] 
+ (x^T (I_n w^* - w^*))^2
\right] + \sigma^2
$$
$$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - w^*))^2] 
+ (x^T (0))^2
\right] + \sigma^2
$$
$$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - w^*))^2] 
\right] + \sigma^2
$$

Which is the exact same as the L2 Loss! With $w \rightarrow w_\lambda$ of course.

## Comparing L2 Loss & L2 Ridge Regression Bias Variance Decompositions

Comparing these side by side we can see that the regularization term $\lambda$ introduces Bias and lowers Variance:
- L2 Loss: 
$$
\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right]
=
\underbrace{\mathbb{E}_x\left[\mathbb{E}_D\left[(x^T(w - w^*))^2\right]\right]}_{\text{variance}}
+
\underbrace{\sigma^2}_{\text{noise}}
$$
- L2 Ridge Regression: 
$$
\mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right]
=
\underbrace{\mathbb{E}_x\left[ \mathbb{E}_{D}\left[(x^T (w_\lambda - S_\lambda w^*))^2\right] \right]}_{\text{variance}}
+
\underbrace{\mathbb{E}_x\left[(x^T (w^* - S_\lambda w^*))^2\right]}_{\text{bias}^2}
+
\underbrace{\sigma^2}_{\text{noise}}
$$

As we can see L2 Ridge Regression has an added bias term over L2 Loss, which goes to 0 as $\lambda \rightarrow 0$. As $\lambda$ increases this bias term will increase too.

However, the true win happens when we find the optimal value of $\lambda$ to reduce the variance, which is a big win when our model class has a high variance (higher than the squared bias we induce).

As $\lambda \rightarrow \infty$, Variance $\rightarrow 0$, since $w_\lambda \rightarrow 0$. To see this recall the ridge solution: 
$$ w_\lambda = (X^TX + \lambda I)^{-1} X^Ty $$
As $\lambda \rightarrow \infty$ we get that:
$$(X^TX + \lambda I)^{-1} = (Q \Lambda Q^T + \lambda I)^{-1} = (Q (\Lambda + \lambda I) Q^T)^{-1}$$
$$ = Q^T diag(\frac{1}{\lambda_i + \lambda}) Q$$
We can see that $\forall i$ that as $\lambda \rightarrow \infty$ $\frac{1}{\lambda_i + \lambda} \rightarrow 0$, thus $w_\lambda \rightarrow 0$.

Note that $S_\lambda \rightarrow 0$ as $\lambda \rightarrow \infty$ too 
$$ S_\lambda = (\Lambda + \lambda I)^{-1}\Lambda = 
\begin{bmatrix}
\frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\
0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda}
\end{bmatrix} \rightarrow 
\begin{bmatrix}
\frac{\lambda_1}{\lambda_1 + \infty} & 0 & \cdots & 0 \\
0 & \frac{\lambda_2}{\lambda_2 + \infty} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \infty}
\end{bmatrix} = 0_n
$$

Thus the entire Variance term goes to 0 since we get:
$$ (x^T (w_\lambda - S_\lambda w^*))^2 \rightarrow (x^T (0- 0 w^*))^2 = 0$$

As we can see in **Figure 1** we see that as $\lambda$ starts out at 0, we have a high variance plot with *no bias*. However, the expected error out decreases until an optimal value of $\lambda$ when the bias becomes too large to compensate for the decreased variance.

<img src="/media/images/20260428_092559_04_24_2026_16_26_04_ridge_bias_variance.gif" alt="04 24 2026 16 26 04 ridge bias variance" style="max-width: 100%; height: auto;">
<center>
$\textbf{Figure 1}$ Bias Variance Trade off for RBF Ridge Model.

Model chosen since system is $\textbf{highly overparameterized}$ giving a $\textbf{high variance}$ making this a great candidate for L2 Ridge Regression Normalization.

See that as $\lambda$ starts at near 0, the variance is quite high, but is reduced dramatically as $\lambda$ increases. Bias increases as expected with $\lambda$ increasing. 
</center>