Bias Variance of L2 Ridge Regression Take AI Quiz by dokuDoku Generalization Statistics 🔍 ## Prerequisites I would recommend reading the following posts before reading this one * [Bias Variance Tradeoff](https://www.noteblogdoku.com/blog/bias-variance-tradeoff) * [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss) * [Linear Regression from Projections onto Subspaces](https://www.noteblogdoku.com/blog/linear-regression-solution-from-projections-onto-subspaces) * [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) * [L2 Ridge Regression from Lagrange Multipliers](https://www.noteblogdoku.com/blog/l2-ridge-regression-from-lagrange-multipliers) (Optional) The general strategy for this blog post is as follows: 1. Review Bias-Variance Decomposition in general for squared loss models 2. Bias-Variance Decomposition for L2 Linear Regression with model $(y - x^T w)^2$ 3. Bias-Variance Decomposition for L2 Ridge Regression with model $(y - x^T w)^2 + \lambda ||w||^2$ ## Bias-Variance Decomposition Recall from [Bias Variance Tradeoff](https://www.noteblogdoku.com/blog/bias-variance-tradeoff) that for a squared loss model: $$ L(\theta) = (\hat{y}(x) - y)^2 $$ Where * $L(\theta) \in \mathbb{R}$ is the scalar loss * $\hat{y}(x) \in \mathbb{R}$ is the predicted output * $y \in \mathbb{R}$ is the observed target #### Data Generating Process Assume the true data is generated as: $$ y = f(x) + \epsilon $$ where: * $f(x)$ is the true (deterministic) function * $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is noise #### Expected Out-of-Sample Error For a model trained on dataset $D$, define: $$ E_{out}(\hat{y}^D) = \mathbb{E}_x \left[(\hat{y}^D(x) - y)^2\right] $$ We now take expectation over all possible datasets $D \sim P^N$: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}^D(x) - y)^2\right] \right] $$ #### Bias-Variance Decomposition The error decomposes as: $$ \mathbb{E}_D \left[(\hat{y}^D(x) - y)^2\right] = \underbrace{ \mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right] }_{\text{variance}} + \underbrace{ (\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2 }_{\text{squared bias}} + \underbrace{ \sigma^2 }_{\text{noise}} $$ Putting it all together: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x \left[ \underbrace{ \mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right] }_{\text{variance}} + \underbrace{ (\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2 }_{\text{squared bias}} + \underbrace{ \sigma^2 }_{\text{irreducible noise}} \right] $$ #### Notation - $\mathbb{E}_D$ denotes expectation over datasets $D \sim P^N$ - $\hat{y}^D(x)$ is the model trained on dataset $D$ - $\sigma^2$ is the variance of the noise ## Bias-Variance Decomposition for L2 Loss Recall that for linear regression we consider a **single input-output pair**: $$ L(w) = (y - x^T w)^2 $$ * $L(w) \in \mathbb{R}$ scalar loss * $x \in \mathbb{R}^d$ input data point * $w \in \mathbb{R}^d$ weight vector * $y \in \mathbb{R}$ scalar ground truth * $\hat{y}(x) = x^T w \in \mathbb{R}$ model prediction Thus, the expected output error is: $$ E_{out}(\hat{y}^D) = \mathbb{E}_x[(y - x^T w)^2] $$ ### Ground Truth Model We assume: $$ y = x^T w^* + \epsilon = f(x) + \epsilon $$ where: * $w^* \in \mathbb{R}^d$ is the true parameter vector * $\epsilon \sim \mathcal{N}(0, \sigma^2)$ ### Calculating the Bias-Variance Using: * $\hat{y}^D(x) = x^T w$ we plug into the decomposition: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right] + (\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2 + \sigma^2 \right] $$ Variance: - $\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w - \mathbb{E}_{D}[x^T w])^2] \right]$ Bias Squared: - $\mathbb{E}_x[(\mathbb{E}_{D}[x^T w] - (x^Tw^*))^2]$ Noise: - $\sigma^2$ ### Find $\mathbb{E}_D[x^Tw]$ To make our calculations easier, let's find $\mathbb{E}_D[x^Tw]$, i.e. the expected output from our ordinary least squares model. We will first find $\mathbb{E}_D[w]$ then $\mathbb{E}_D[x^Tw]$. Recall from [Linear Regression Solution](https://www.noteblogdoku.com/blog/linear-regression-solution-from-projections-onto-subspaces) we found for ordinary least squares the optimal $w$ is $$ L(w) = ||y - Xw||^2 \rightarrow w = (X^TX)^{-1}X^Ty $$ where $X \in \mathbb{R}^{n \times d}$ such that each row of $X$ is a data points (stacking all the $x^T_i$'s vertically). Given that $y = Xw^* + \epsilon$, where now $\epsilon = N(0, \sigma^2I)$ (since now $y \in \mathbb{R}^n$ for $X \in \mathbb{R}^{n \times d}$, we have $$w = (X^TX)^{-1}X^T(Xw^* + \epsilon)$$ Great, now let's take the expectation of $w$ and see what happens: $$ \mathbb{E}_D[w] = \mathbb{E}_D[(X^TX)^{-1}X^T(Xw^* + \epsilon)]$$ $$ = \mathbb{E}_D[(X^TX)^{-1}X^TXw^*] + \mathbb{E}_D [(X^TX)^{-1}X^T\epsilon)]$$ Note that since $\epsilon \sim N(0, \sigma^2I)$, $E_D[\epsilon] = 0$, giving $$ = \mathbb{E}_D[(X^TX)^{-1}X^TXw^*] + 0$$ We typically assume $X$ is composed of linearly independent column (feature) vectors, which means that $X^TX$ is positive definite (invertible). One could then simplify: $$(X^TX)^{-1}X^TX = I $$ Simplifying our statement to $$ = \mathbb{E}_D[(X^TX)^{-1}X^TXw^*] = \mathbb{E}_D[Iw^*] = \mathbb{E}_D[w^*] = w^*$$ For our result: $$ \mathbb{E}_D[w] = w^*$$ For an individual sample from the data set $x^T$, we have that $$ \mathbb{E}_D[x^Tw] = x^T \mathbb{E}[w] = x^Tw^*$$ ### Plugging into Bias Variance Great, now let's plug this into the Bias Variance equation, recall from earlier: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right] + (\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2 + \sigma^2 \right] $$ Variance: - $\mathbb{E}_x[\mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right]] \rightarrow \mathbb{E}_x[\mathbb{E}_D[(x^Tw - \mathbb{E}_D[x^Tw])^2]] = \mathbb{E}_x[\mathbb{E}_D[(x^Tw - x^Tw^*)^2]] $ Bias Squared: - $\mathbb{E}_x[(\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2] \rightarrow \mathbb{E}_x[(x^Tw^* - x^Tw^*)^2] = 0 $ - **Unbiased Estimator** Noise: - $\sigma^2$ Putting it all together we get $$\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x[\mathbb{E}_D[(x^Tw - x^Tw^*)^2]] + 0 + \sigma^2 $$ $$ = \mathbb{E}_x[\mathbb{E}_D[(x^Tw - x^Tw^*)^2]] + \sigma^2 = \mathbb{E}_x[\mathbb{E}_D[(w^Txx^Tw - 2w^Txx^Tw^* + w^*xx^Tw^*)]] + \sigma^2 $$ Recognize the quadratic form where $$ (w^Txx^Tw - 2w^Txx^Tw^* + w^{*T}xx^Tw^*) $$ $$ = (w-w^*)^T(xx^T)(w-w^*)$$ Such that $xx^T \in \mathbb{R}^{d \times d}$ matrix that's of rank 1. A more useful form is: $$ (w-w^*)^T(xx^T)(w-w^*) = ((w-w^*)^Tx)(x^T(w-w^*))$$ Note that $(x^T(w-w^*)) \in \mathbb{R}$, so $(x^T(w-w^*))^T = ((w-w^*)^Tx) = (x^T(w-w^*))$ giving us $$((w-w^*)^Tx)(x^T(w-w^*)) = (x^T(w-w^*))^2$$ Resulting in: $$\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x[\mathbb{E}_D[(x^T(w-w^*))^2]] + \sigma^2 $$ ## Bias Variance Decomposition for L2 Ridge Regression Recall that for L2 Ridge Regression we consider for a single input-output pair: $$ L(w) = (y - x^Tw_\lambda)^2 + \lambda w_\lambda^Tw_\lambda $$ * $L(w) \in \mathbb{R}$ scalar loss * $x \in \mathbb{R}^d$ input data point * $w \in \mathbb{R}^d$ weight vector * $y \in \mathbb{R}$ scalar ground truth * $\hat{y_\lambda}(x) = x^T w \in \mathbb{R}$ model prediction - Note that this is the same model as in L2 loss, we are constraining the *values* of the weights * $w^Tw \in \mathbb{R}$ regularization factor for weight minimization * $\lambda \in \mathbb{R} \geq 0$ regularization scaling factor - Note $\lambda = 0$ gives us the previously derived L2 Loss Thus, the expected output error is: $$ E_{out}(\hat{y_\lambda}^D) = \mathbb{E}_x[(y - x^T w_\lambda)^2] \text{ with regularization constraint: } \lambda w_\lambda^Tw_\lambda $$ Note, we are assuming the same ground truth model as L2 Loss: $y = x^Tw^* + \epsilon = f(x) + \epsilon$ ### Calculating the Bias-Variance * $\hat{y}_\lambda^D(x) = x^T w$ we plug into the decomposition: $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}_\lambda^D(x) - \mathbb{E}_D[\hat{y}_\lambda^D(x)])^2\right] + (\mathbb{E}_D[\hat{y}_\lambda^D(x)] - f(x))^2 + \sigma^2 \right] $$ Variance: - $\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - \mathbb{E}_{D}[x^T w_\lambda])^2] \right]$ Bias Squared: - $\mathbb{E}_x[(\mathbb{E}_{D}[x^T w_\lambda] - (x^Tw^*))^2]$ Noise: - $\sigma^2$ ### Find $\mathbb{E}_D[x^Tw_\lambda]$ To make our calculations easier, let's find $\mathbb{E}_D[x^Tw_\lambda]$ i.e. the expected output from our L2 Ridge Regression model. We'll first find $\mathbb{E}_D[w_\lambda]$ then $\mathbb{E}_D[x^Tw_\lambda]$ Recall from [L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) we found the optimal $w_\lambda$ $$ L(w) = ||y -Xw_\lambda||^2 + \lambda ||w||^2 \rightarrow w_\lambda = (X^TX + \lambda I)^{-1}X^Ty$$ where $X \in \mathbb{R}^{n \times d}$. Given that $y = Xw^* + \epsilon$, $\epsilon = N(0, \sigma^2I)$, we have $$ w_\lambda = (X^TX + \lambda I)^{-1}X^T(Xw^* + \epsilon) $$ Take expectation of $w_\lambda$ wrt $D$ $$ \mathbb{E}_D[w_\lambda] = \mathbb{E}_D[(X^TX + \lambda I)^{-1}X^T(Xw^* + \epsilon)]$$ $$ = \mathbb{E}_D[(X^TX + \lambda I)^{-1}X^TXw^* + (X^TX + \lambda I)^{-1}X^T \epsilon)]$$ Note that $\mathbb{E}_D[\epsilon] = 0$ thus $$ = \mathbb{E}_D[(X^TX + \lambda I)^{-1}X^TXw^*]$$ Assume that $X$ composed of linearly independent columns, giving us that $X^TX$ positive definite, invertible, and has a valid **eigendecomposition** with orthogonal eigenvector matrix $Q$ and eigenvalue diagonal matrix $\Lambda$ $$ X^TX = Q \Lambda Q^T$$ $$ = \mathbb{E}_D[(Q\Lambda Q^T + \lambda I)^{-1} Q\Lambda Q^Tw^*]$$ Recognize that $Q\Lambda Q^T + \lambda I = Q(\Lambda + \lambda I) Q^T$ $$ = \mathbb{E}_D[(Q(\Lambda + \lambda I)Q^T)^{-1} Q\Lambda Q^Tw^*] = \mathbb{E}_D[Q(\Lambda + \lambda I)^{-1}Q^TQ \Lambda Q^T w^*] $$ $$ = \mathbb{E}_D[Q(\Lambda + \lambda I)^{-1}\Lambda Q^T w^*]$$ Note that this is just $w^*$ where its eigencomponents are scaled such that: $$ (\Lambda + \lambda I)^{-1}\Lambda = \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda} \end{bmatrix} $$ So as $\lambda$ increases $w_\lambda$ first reduces the smallest eigencomponets the most rapidly to zero, then eventually all components go to zero in the limit. Please see [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) for a more in-depth discussion. We can restate $ = \mathbb{E}_D[Q(\Lambda + \lambda I)^{-1}\Lambda Q^T w^*]$ to give us more intuition: $$ \mathbb{E}_D[w_\lambda] = \mathbb{E}_D[Q\ diag(\frac{\lambda_i}{\lambda_i + \lambda}) Q^T w^*] = \mathbb{E}_D[S_\lambda w^*] = S_\lambda w^*$$ $$ S_\lambda = Q\ diag(\frac{\lambda_i}{\lambda_i + \lambda}) Q^T $$ Great, we have immediately following from this that $$ \mathbb{E}_D[x^T w_\lambda] = x^TS_\lambda w^*$$ ### Plugging into Bias Variance Great, now let's plug this into the Bias Variance Equation, recall $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}_\lambda^D(x) - \mathbb{E}_D[\hat{y}_\lambda^D(x)])^2\right] + (\mathbb{E}_D[\hat{y}_\lambda^D(x)] - f(x))^2 + \sigma^2 \right] $$ Variance: - $\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - \mathbb{E}_{D}[x^T w_\lambda])^2] \right] \rightarrow \mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] \right] $ Bias Squared: - $\mathbb{E}_x[(\mathbb{E}_{D}[x^T w_\lambda] - (x^Tw^*))^2] \rightarrow \mathbb{E}_x[(x^T S_\lambda w^* - (x^Tw^*))^2] $ Noise: - $\sigma^2$ Putting it all together: $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] \right] + \mathbb{E}_x[(x^T S_\lambda w^* - x^Tw^*)^2] + \sigma^2 $$ ### Simplify Let's simplify to the same form as L2 Loss. We should see that when we set $\lambda = 0$ we should get an equivalent statement to L2 Loss from L2 Ridge Regression. $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] \right] + \mathbb{E}_x[(x^T S_\lambda w^* - x^Tw^*)^2] + \sigma^2 $$ $$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] + (x^T S_\lambda w^* - x^Tw^*)^2 \right] + \sigma^2 $$ $$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - S_\lambda w^*))^2] + (x^T (S_\lambda w^* - w^*))^2 \right] + \sigma^2 $$ As we can see this is equal to the L2 equivalent when $\lambda = 0$. $$\text{L2 Expected Loss: } \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x[\mathbb{E}_D[(x^T(w-w^*))^2]] + \sigma^2 $$ $$ \text{L2 Ridge Regrssion Expected Loss: } \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - S_\lambda w^*))^2] + (x^T (S_\lambda w^* - w^*))^2 \right] + \sigma^2 $$ For $\lambda = 0$, we have that: $$ S_\lambda = (\Lambda + \lambda I)^{-1}\Lambda = \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda} \end{bmatrix} \rightarrow \begin{bmatrix} \frac{\lambda_1}{\lambda_1} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n} \end{bmatrix} = I_n $$ Simplifying our expression to: $$ \text{L2 Ridge Regrssion Expected Loss: } \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - I_n w^*))^2] + (x^T (I_n w^* - w^*))^2 \right] + \sigma^2 $$ $$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - w^*))^2] + (x^T (0))^2 \right] + \sigma^2 $$ $$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - w^*))^2] \right] + \sigma^2 $$ Which is the exact same as the L2 Loss! With $w \rightarrow w_\lambda$ of course. ## Comparing L2 Loss & L2 Ridge Regression Bias Variance Decompositions Comparing these side by side we can see that the regularization term $\lambda$ introduces Bias and lowers Variance: - L2 Loss: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \underbrace{\mathbb{E}_x\left[\mathbb{E}_D\left[(x^T(w - w^*))^2\right]\right]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}} $$ - L2 Ridge Regression: $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \underbrace{\mathbb{E}_x\left[ \mathbb{E}_{D}\left[(x^T (w_\lambda - S_\lambda w^*))^2\right] \right]}_{\text{variance}} + \underbrace{\mathbb{E}_x\left[(x^T (w^* - S_\lambda w^*))^2\right]}_{\text{bias}^2} + \underbrace{\sigma^2}_{\text{noise}} $$ As we can see L2 Ridge Regression has an added bias term over L2 Loss, which goes to 0 as $\lambda \rightarrow 0$. As $\lambda$ increases this bias term will increase too. However, the true win happens when we find the optimal value of $\lambda$ to reduce the variance, which is a big win when our model class has a high variance (higher than the squared bias we induce). As $\lambda \rightarrow \infty$, Variance $\rightarrow 0$, since $w_\lambda \rightarrow 0$. To see this recall the ridge solution: $$ w_\lambda = (X^TX + \lambda I)^{-1} X^Ty $$ As $\lambda \rightarrow \infty$ we get that: $$(X^TX + \lambda I)^{-1} = (Q \Lambda Q^T + \lambda I)^{-1} = (Q (\Lambda + \lambda I) Q^T)^{-1}$$ $$ = Q^T diag(\frac{1}{\lambda_i + \lambda}) Q$$ We can see that $\forall i$ that as $\lambda \rightarrow \infty$ $\frac{1}{\lambda_i + \lambda} \rightarrow 0$, thus $w_\lambda \rightarrow 0$. Note that $S_\lambda \rightarrow 0$ as $\lambda \rightarrow \infty$ too $$ S_\lambda = (\Lambda + \lambda I)^{-1}\Lambda = \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda} \end{bmatrix} \rightarrow \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + \infty} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2 + \infty} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \infty} \end{bmatrix} = 0_n $$ Thus the entire Variance term goes to 0 since we get: $$ (x^T (w_\lambda - S_\lambda w^*))^2 \rightarrow (x^T (0- 0 w^*))^2 = 0$$ As we can see in **Figure 1** we see that as $\lambda$ starts out at 0, we have a high variance plot with *no bias*. However, the expected error out decreases until an optimal value of $\lambda$ when the bias becomes too large to compensate for the decreased variance. <img src="/media/images/20260428_092559_04_24_2026_16_26_04_ridge_bias_variance.gif" alt="04 24 2026 16 26 04 ridge bias variance" style="max-width: 100%; height: auto;"> <center> $\textbf{Figure 1}$ Bias Variance Trade off for RBF Ridge Model. Model chosen since system is $\textbf{highly overparameterized}$ giving a $\textbf{high variance}$ making this a great candidate for L2 Ridge Regression Normalization. See that as $\lambda$ starts at near 0, the variance is quite high, but is reduced dramatically as $\lambda$ increases. Bias increases as expected with $\lambda$ increasing. </center> ## Prerequisites I would recommend reading the following posts before reading this one * [Bias Variance Tradeoff](https://www.noteblogdoku.com/blog/bias-variance-tradeoff) * [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss) * [Linear Regression from Projections onto Subspaces](https://www.noteblogdoku.com/blog/linear-regression-solution-from-projections-onto-subspaces) * [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) * [L2 Ridge Regression from Lagrange Multipliers](https://www.noteblogdoku.com/blog/l2-ridge-regression-from-lagrange-multipliers) (Optional) The general strategy for this blog post is as follows: 1. Review Bias-Variance Decomposition in general for squared loss models 2. Bias-Variance Decomposition for L2 Linear Regression with model $(y - x^T w)^2$ 3. Bias-Variance Decomposition for L2 Ridge Regression with model $(y - x^T w)^2 + \lambda ||w||^2$ ## Bias-Variance Decomposition Recall from [Bias Variance Tradeoff](https://www.noteblogdoku.com/blog/bias-variance-tradeoff) that for a squared loss model: $$ L(\theta) = (\hat{y}(x) - y)^2 $$ Where * $L(\theta) \in \mathbb{R}$ is the scalar loss * $\hat{y}(x) \in \mathbb{R}$ is the predicted output * $y \in \mathbb{R}$ is the observed target #### Data Generating Process Assume the true data is generated as: $$ y = f(x) + \epsilon $$ where: * $f(x)$ is the true (deterministic) function * $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is noise #### Expected Out-of-Sample Error For a model trained on dataset $D$, define: $$ E_{out}(\hat{y}^D) = \mathbb{E}_x \left[(\hat{y}^D(x) - y)^2\right] $$ We now take expectation over all possible datasets $D \sim P^N$: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}^D(x) - y)^2\right] \right] $$ #### Bias-Variance Decomposition The error decomposes as: $$ \mathbb{E}_D \left[(\hat{y}^D(x) - y)^2\right] = \underbrace{ \mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right] }_{\text{variance}} + \underbrace{ (\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2 }_{\text{squared bias}} + \underbrace{ \sigma^2 }_{\text{noise}} $$ Putting it all together: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x \left[ \underbrace{ \mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right] }_{\text{variance}} + \underbrace{ (\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2 }_{\text{squared bias}} + \underbrace{ \sigma^2 }_{\text{irreducible noise}} \right] $$ #### Notation - $\mathbb{E}_D$ denotes expectation over datasets $D \sim P^N$ - $\hat{y}^D(x)$ is the model trained on dataset $D$ - $\sigma^2$ is the variance of the noise ## Bias-Variance Decomposition for L2 Loss Recall that for linear regression we consider a **single input-output pair**: $$ L(w) = (y - x^T w)^2 $$ * $L(w) \in \mathbb{R}$ scalar loss * $x \in \mathbb{R}^d$ input data point * $w \in \mathbb{R}^d$ weight vector * $y \in \mathbb{R}$ scalar ground truth * $\hat{y}(x) = x^T w \in \mathbb{R}$ model prediction Thus, the expected output error is: $$ E_{out}(\hat{y}^D) = \mathbb{E}_x[(y - x^T w)^2] $$ ### Ground Truth Model We assume: $$ y = x^T w^* + \epsilon = f(x) + \epsilon $$ where: * $w^* \in \mathbb{R}^d$ is the true parameter vector * $\epsilon \sim \mathcal{N}(0, \sigma^2)$ ### Calculating the Bias-Variance Using: * $\hat{y}^D(x) = x^T w$ we plug into the decomposition: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right] + (\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2 + \sigma^2 \right] $$ Variance: - $\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w - \mathbb{E}_{D}[x^T w])^2] \right]$ Bias Squared: - $\mathbb{E}_x[(\mathbb{E}_{D}[x^T w] - (x^Tw^*))^2]$ Noise: - $\sigma^2$ ### Find $\mathbb{E}_D[x^Tw]$ To make our calculations easier, let's find $\mathbb{E}_D[x^Tw]$, i.e. the expected output from our ordinary least squares model. We will first find $\mathbb{E}_D[w]$ then $\mathbb{E}_D[x^Tw]$. Recall from [Linear Regression Solution](https://www.noteblogdoku.com/blog/linear-regression-solution-from-projections-onto-subspaces) we found for ordinary least squares the optimal $w$ is $$ L(w) = ||y - Xw||^2 \rightarrow w = (X^TX)^{-1}X^Ty $$ where $X \in \mathbb{R}^{n \times d}$ such that each row of $X$ is a data points (stacking all the $x^T_i$'s vertically). Given that $y = Xw^* + \epsilon$, where now $\epsilon = N(0, \sigma^2I)$ (since now $y \in \mathbb{R}^n$ for $X \in \mathbb{R}^{n \times d}$, we have $$w = (X^TX)^{-1}X^T(Xw^* + \epsilon)$$ Great, now let's take the expectation of $w$ and see what happens: $$ \mathbb{E}_D[w] = \mathbb{E}_D[(X^TX)^{-1}X^T(Xw^* + \epsilon)]$$ $$ = \mathbb{E}_D[(X^TX)^{-1}X^TXw^*] + \mathbb{E}_D [(X^TX)^{-1}X^T\epsilon)]$$ Note that since $\epsilon \sim N(0, \sigma^2I)$, $E_D[\epsilon] = 0$, giving $$ = \mathbb{E}_D[(X^TX)^{-1}X^TXw^*] + 0$$ We typically assume $X$ is composed of linearly independent column (feature) vectors, which means that $X^TX$ is positive definite (invertible). One could then simplify: $$(X^TX)^{-1}X^TX = I $$ Simplifying our statement to $$ = \mathbb{E}_D[(X^TX)^{-1}X^TXw^*] = \mathbb{E}_D[Iw^*] = \mathbb{E}_D[w^*] = w^*$$ For our result: $$ \mathbb{E}_D[w] = w^*$$ For an individual sample from the data set $x^T$, we have that $$ \mathbb{E}_D[x^Tw] = x^T \mathbb{E}[w] = x^Tw^*$$ ### Plugging into Bias Variance Great, now let's plug this into the Bias Variance equation, recall from earlier: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right] + (\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2 + \sigma^2 \right] $$ Variance: - $\mathbb{E}_x[\mathbb{E}_D \left[(\hat{y}^D(x) - \mathbb{E}_D[\hat{y}^D(x)])^2\right]] \rightarrow \mathbb{E}_x[\mathbb{E}_D[(x^Tw - \mathbb{E}_D[x^Tw])^2]] = \mathbb{E}_x[\mathbb{E}_D[(x^Tw - x^Tw^*)^2]] $ Bias Squared: - $\mathbb{E}_x[(\mathbb{E}_D[\hat{y}^D(x)] - f(x))^2] \rightarrow \mathbb{E}_x[(x^Tw^* - x^Tw^*)^2] = 0 $ - **Unbiased Estimator** Noise: - $\sigma^2$ Putting it all together we get $$\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x[\mathbb{E}_D[(x^Tw - x^Tw^*)^2]] + 0 + \sigma^2 $$ $$ = \mathbb{E}_x[\mathbb{E}_D[(x^Tw - x^Tw^*)^2]] + \sigma^2 = \mathbb{E}_x[\mathbb{E}_D[(w^Txx^Tw - 2w^Txx^Tw^* + w^*xx^Tw^*)]] + \sigma^2 $$ Recognize the quadratic form where $$ (w^Txx^Tw - 2w^Txx^Tw^* + w^{*T}xx^Tw^*) $$ $$ = (w-w^*)^T(xx^T)(w-w^*)$$ Such that $xx^T \in \mathbb{R}^{d \times d}$ matrix that's of rank 1. A more useful form is: $$ (w-w^*)^T(xx^T)(w-w^*) = ((w-w^*)^Tx)(x^T(w-w^*))$$ Note that $(x^T(w-w^*)) \in \mathbb{R}$, so $(x^T(w-w^*))^T = ((w-w^*)^Tx) = (x^T(w-w^*))$ giving us $$((w-w^*)^Tx)(x^T(w-w^*)) = (x^T(w-w^*))^2$$ Resulting in: $$\mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x[\mathbb{E}_D[(x^T(w-w^*))^2]] + \sigma^2 $$ ## Bias Variance Decomposition for L2 Ridge Regression Recall that for L2 Ridge Regression we consider for a single input-output pair: $$ L(w) = (y - x^Tw_\lambda)^2 + \lambda w_\lambda^Tw_\lambda $$ * $L(w) \in \mathbb{R}$ scalar loss * $x \in \mathbb{R}^d$ input data point * $w \in \mathbb{R}^d$ weight vector * $y \in \mathbb{R}$ scalar ground truth * $\hat{y_\lambda}(x) = x^T w \in \mathbb{R}$ model prediction - Note that this is the same model as in L2 loss, we are constraining the *values* of the weights * $w^Tw \in \mathbb{R}$ regularization factor for weight minimization * $\lambda \in \mathbb{R} \geq 0$ regularization scaling factor - Note $\lambda = 0$ gives us the previously derived L2 Loss Thus, the expected output error is: $$ E_{out}(\hat{y_\lambda}^D) = \mathbb{E}_x[(y - x^T w_\lambda)^2] \text{ with regularization constraint: } \lambda w_\lambda^Tw_\lambda $$ Note, we are assuming the same ground truth model as L2 Loss: $y = x^Tw^* + \epsilon = f(x) + \epsilon$ ### Calculating the Bias-Variance * $\hat{y}_\lambda^D(x) = x^T w$ we plug into the decomposition: $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}_\lambda^D(x) - \mathbb{E}_D[\hat{y}_\lambda^D(x)])^2\right] + (\mathbb{E}_D[\hat{y}_\lambda^D(x)] - f(x))^2 + \sigma^2 \right] $$ Variance: - $\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - \mathbb{E}_{D}[x^T w_\lambda])^2] \right]$ Bias Squared: - $\mathbb{E}_x[(\mathbb{E}_{D}[x^T w_\lambda] - (x^Tw^*))^2]$ Noise: - $\sigma^2$ ### Find $\mathbb{E}_D[x^Tw_\lambda]$ To make our calculations easier, let's find $\mathbb{E}_D[x^Tw_\lambda]$ i.e. the expected output from our L2 Ridge Regression model. We'll first find $\mathbb{E}_D[w_\lambda]$ then $\mathbb{E}_D[x^Tw_\lambda]$ Recall from [L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) we found the optimal $w_\lambda$ $$ L(w) = ||y -Xw_\lambda||^2 + \lambda ||w||^2 \rightarrow w_\lambda = (X^TX + \lambda I)^{-1}X^Ty$$ where $X \in \mathbb{R}^{n \times d}$. Given that $y = Xw^* + \epsilon$, $\epsilon = N(0, \sigma^2I)$, we have $$ w_\lambda = (X^TX + \lambda I)^{-1}X^T(Xw^* + \epsilon) $$ Take expectation of $w_\lambda$ wrt $D$ $$ \mathbb{E}_D[w_\lambda] = \mathbb{E}_D[(X^TX + \lambda I)^{-1}X^T(Xw^* + \epsilon)]$$ $$ = \mathbb{E}_D[(X^TX + \lambda I)^{-1}X^TXw^* + (X^TX + \lambda I)^{-1}X^T \epsilon)]$$ Note that $\mathbb{E}_D[\epsilon] = 0$ thus $$ = \mathbb{E}_D[(X^TX + \lambda I)^{-1}X^TXw^*]$$ Assume that $X$ composed of linearly independent columns, giving us that $X^TX$ positive definite, invertible, and has a valid **eigendecomposition** with orthogonal eigenvector matrix $Q$ and eigenvalue diagonal matrix $\Lambda$ $$ X^TX = Q \Lambda Q^T$$ $$ = \mathbb{E}_D[(Q\Lambda Q^T + \lambda I)^{-1} Q\Lambda Q^Tw^*]$$ Recognize that $Q\Lambda Q^T + \lambda I = Q(\Lambda + \lambda I) Q^T$ $$ = \mathbb{E}_D[(Q(\Lambda + \lambda I)Q^T)^{-1} Q\Lambda Q^Tw^*] = \mathbb{E}_D[Q(\Lambda + \lambda I)^{-1}Q^TQ \Lambda Q^T w^*] $$ $$ = \mathbb{E}_D[Q(\Lambda + \lambda I)^{-1}\Lambda Q^T w^*]$$ Note that this is just $w^*$ where its eigencomponents are scaled such that: $$ (\Lambda + \lambda I)^{-1}\Lambda = \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda} \end{bmatrix} $$ So as $\lambda$ increases $w_\lambda$ first reduces the smallest eigencomponets the most rapidly to zero, then eventually all components go to zero in the limit. Please see [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) for a more in-depth discussion. We can restate $ = \mathbb{E}_D[Q(\Lambda + \lambda I)^{-1}\Lambda Q^T w^*]$ to give us more intuition: $$ \mathbb{E}_D[w_\lambda] = \mathbb{E}_D[Q\ diag(\frac{\lambda_i}{\lambda_i + \lambda}) Q^T w^*] = \mathbb{E}_D[S_\lambda w^*] = S_\lambda w^*$$ $$ S_\lambda = Q\ diag(\frac{\lambda_i}{\lambda_i + \lambda}) Q^T $$ Great, we have immediately following from this that $$ \mathbb{E}_D[x^T w_\lambda] = x^TS_\lambda w^*$$ ### Plugging into Bias Variance Great, now let's plug this into the Bias Variance Equation, recall $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x \left[ \mathbb{E}_D \left[(\hat{y}_\lambda^D(x) - \mathbb{E}_D[\hat{y}_\lambda^D(x)])^2\right] + (\mathbb{E}_D[\hat{y}_\lambda^D(x)] - f(x))^2 + \sigma^2 \right] $$ Variance: - $\mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - \mathbb{E}_{D}[x^T w_\lambda])^2] \right] \rightarrow \mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] \right] $ Bias Squared: - $\mathbb{E}_x[(\mathbb{E}_{D}[x^T w_\lambda] - (x^Tw^*))^2] \rightarrow \mathbb{E}_x[(x^T S_\lambda w^* - (x^Tw^*))^2] $ Noise: - $\sigma^2$ Putting it all together: $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] \right] + \mathbb{E}_x[(x^T S_\lambda w^* - x^Tw^*)^2] + \sigma^2 $$ ### Simplify Let's simplify to the same form as L2 Loss. We should see that when we set $\lambda = 0$ we should get an equivalent statement to L2 Loss from L2 Ridge Regression. $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x\left[\mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] \right] + \mathbb{E}_x[(x^T S_\lambda w^* - x^Tw^*)^2] + \sigma^2 $$ $$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T w_\lambda - x^TS_\lambda w^*)^2] + (x^T S_\lambda w^* - x^Tw^*)^2 \right] + \sigma^2 $$ $$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - S_\lambda w^*))^2] + (x^T (S_\lambda w^* - w^*))^2 \right] + \sigma^2 $$ As we can see this is equal to the L2 equivalent when $\lambda = 0$. $$\text{L2 Expected Loss: } \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \mathbb{E}_x[\mathbb{E}_D[(x^T(w-w^*))^2]] + \sigma^2 $$ $$ \text{L2 Ridge Regrssion Expected Loss: } \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - S_\lambda w^*))^2] + (x^T (S_\lambda w^* - w^*))^2 \right] + \sigma^2 $$ For $\lambda = 0$, we have that: $$ S_\lambda = (\Lambda + \lambda I)^{-1}\Lambda = \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda} \end{bmatrix} \rightarrow \begin{bmatrix} \frac{\lambda_1}{\lambda_1} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n} \end{bmatrix} = I_n $$ Simplifying our expression to: $$ \text{L2 Ridge Regrssion Expected Loss: } \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - I_n w^*))^2] + (x^T (I_n w^* - w^*))^2 \right] + \sigma^2 $$ $$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - w^*))^2] + (x^T (0))^2 \right] + \sigma^2 $$ $$ = \mathbb{E}_x\left[ \mathbb{E}_{D}[(x^T (w_\lambda - w^*))^2] \right] + \sigma^2 $$ Which is the exact same as the L2 Loss! With $w \rightarrow w_\lambda$ of course. ## Comparing L2 Loss & L2 Ridge Regression Bias Variance Decompositions Comparing these side by side we can see that the regularization term $\lambda$ introduces Bias and lowers Variance: - L2 Loss: $$ \mathbb{E}_D \left[E_{out}(\hat{y}^D)\right] = \underbrace{\mathbb{E}_x\left[\mathbb{E}_D\left[(x^T(w - w^*))^2\right]\right]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}} $$ - L2 Ridge Regression: $$ \mathbb{E}_D \left[E_{out}(\hat{y}_\lambda^D)\right] = \underbrace{\mathbb{E}_x\left[ \mathbb{E}_{D}\left[(x^T (w_\lambda - S_\lambda w^*))^2\right] \right]}_{\text{variance}} + \underbrace{\mathbb{E}_x\left[(x^T (w^* - S_\lambda w^*))^2\right]}_{\text{bias}^2} + \underbrace{\sigma^2}_{\text{noise}} $$ As we can see L2 Ridge Regression has an added bias term over L2 Loss, which goes to 0 as $\lambda \rightarrow 0$. As $\lambda$ increases this bias term will increase too. However, the true win happens when we find the optimal value of $\lambda$ to reduce the variance, which is a big win when our model class has a high variance (higher than the squared bias we induce). As $\lambda \rightarrow \infty$, Variance $\rightarrow 0$, since $w_\lambda \rightarrow 0$. To see this recall the ridge solution: $$ w_\lambda = (X^TX + \lambda I)^{-1} X^Ty $$ As $\lambda \rightarrow \infty$ we get that: $$(X^TX + \lambda I)^{-1} = (Q \Lambda Q^T + \lambda I)^{-1} = (Q (\Lambda + \lambda I) Q^T)^{-1}$$ $$ = Q^T diag(\frac{1}{\lambda_i + \lambda}) Q$$ We can see that $\forall i$ that as $\lambda \rightarrow \infty$ $\frac{1}{\lambda_i + \lambda} \rightarrow 0$, thus $w_\lambda \rightarrow 0$. Note that $S_\lambda \rightarrow 0$ as $\lambda \rightarrow \infty$ too $$ S_\lambda = (\Lambda + \lambda I)^{-1}\Lambda = \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda} \end{bmatrix} \rightarrow \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + \infty} & 0 & \cdots & 0 \\ 0 & \frac{\lambda_2}{\lambda_2 + \infty} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \infty} \end{bmatrix} = 0_n $$ Thus the entire Variance term goes to 0 since we get: $$ (x^T (w_\lambda - S_\lambda w^*))^2 \rightarrow (x^T (0- 0 w^*))^2 = 0$$ As we can see in **Figure 1** we see that as $\lambda$ starts out at 0, we have a high variance plot with *no bias*. However, the expected error out decreases until an optimal value of $\lambda$ when the bias becomes too large to compensate for the decreased variance. <img src="/media/images/20260428_092559_04_24_2026_16_26_04_ridge_bias_variance.gif" alt="04 24 2026 16 26 04 ridge bias variance" style="max-width: 100%; height: auto;"> <center> $\textbf{Figure 1}$ Bias Variance Trade off for RBF Ridge Model. Model chosen since system is $\textbf{highly overparameterized}$ giving a $\textbf{high variance}$ making this a great candidate for L2 Ridge Regression Normalization. See that as $\lambda$ starts at near 0, the variance is quite high, but is reduced dramatically as $\lambda$ increases. Bias increases as expected with $\lambda$ increasing. </center> Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!