Decomposition of L2 Ridge Regression

by dokuDoku Generalization Statistics

🔍

## Decomposing L2 Ridge Loss

The L2 Ridge Regression Loss is 
$$ L(w) = ||y - Xw||^2 + \lambda ||w||^2 $$ 
- $w \in \mathbb{R}^d$ weight vector 
- $X \in \mathbb{R}^{n \times d}$ input matrix
    - Each row is a datapoint with degree $d$ of which there are $n$
- $y \in \mathbb{R}^n$ output vector 
    - Each of the $n$ points is a scalar predicting the output of each $x_i$
- $\lambda \in \mathbb{R}$ regularization parameter for how much we want to scale $||w||^2$
 - $L(w) \in \mathbb{R}$ loss, difference between our prediction $w^Tx$ and ground truth $y$ with regularizing term

Note that this loss is for a matrix $X$ and vector $w$ with a scalar prediction for each $x_i$, $x_i^Tw$ for a ground truth scalar $y_i$ for each $x_i$.

## What is the optimal value of $w$?

Let's take the gradient of $L(w)$ to find the optimal value of $w$, which we will then compare to the one from the unregularized L2 Loss:

$$ ||y - Xw||^2 \rightarrow w_{opt} = (X^TX)^{-1}X^Ty $$
As we derived in [Linear Regression Solution from Projections](https://www.noteblogdoku.com/blog/linear-regression-solution-from-projections-onto-subspaces).

Let's take the gradient
$$ L(w) = ||y - Xw||^2 + \lambda ||w||^2 = (y-Xw)^T(y-Xw) + \lambda w^Tw $$
$$ L(w) = y^Ty -2y^TXw + w^TX^TXw + \lambda w^Tw$$
$$ dL = 0 - 2y^TXdw + 2w^TX^Tdw + 2\lambda w^Tdw$$
Note that $dL = \langle \nabla_w L, dw \rangle$, giving
$$ \nabla_w L(w) = -2y^TX + 2w^TX^TX + 2\lambda I w^T$$

We know that L2 Ridge Loss is convex since L2 loss is convex and we are adding a squared function $||w||^2$ to it, producing another convex function, thus we set its gradient to 0 to get the minimum $w$
$$\nabla_w L(w) = 0 = -2y^TX + 2w^TX^TX + 2\lambda I w^T$$
$$ 0 = -y^TX + w^TX^TX + \lambda I w^T$$
Take the transpose of both sides
$$ 0 = -X^Ty + X^TXw + \lambda Iw$$
$$ X^Ty = (X^TX + \lambda I) w \rightarrow (X^TX + \lambda I)^{-1}X^Ty = w$$

Denoting the optimal $w$ for L2 Ridge Regression as $w_{ridge}$, we have 
$$ w_{ridge} = (X^TX + \lambda I)^{-1}X^Ty$$

As we can see if we compare this to our $w_{opt}$ from regular L2 Loss we see that they are equal when $\lambda = 0$, let's put them side by side to make it clear: 
$$ w_{ridge} = (X^TX + \lambda I)^{-1}X^Ty \quad w_{opt} = (X^TX)^{-1}X^Ty$$

Note that we can typically assume that $X$ is composed of linearly independent column vectors. This assumption is made since it means that $X^TX$ is positive definite (invertible, and other nice properties) and we can choose the features, where each feature is a column of which there's $d$.

From this we can do the eigenvalue decomposition of $X^TX = Q\Lambda Q^T$ where $\Lambda$ is a diagonal matrix of eigenvalues and $Q$ is an orthogonal matrix of eigenvectors.

We can see that 
$$ w_{ridge} = (Q\Lambda Q^T + \lambda I)^{-1}X^Ty \quad w_{opt} = (Q\Lambda Q^T)^{-1}X^Ty$$
with 
$$ w_{ridge} = (Q\Lambda Q^T + \lambda I)^{-1}X^Ty = (Q(\Lambda + \lambda I)Q^T)^{-1}X^Ty $$
$$ w_{ridge} = Q^T(\Lambda + \lambda I)^{-1} Q X^Ty$$

$$ (\Lambda + \lambda I)^{-1}
=
\begin{bmatrix}
\frac{1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\
0 & \frac{1}{\lambda_2 + \lambda} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{1}{\lambda_n + \lambda}
\end{bmatrix}$$

Thus what we see is that our eigenvectors defined by $X^TX$ are scaled by a reciprocal factor $\lambda$ so that the larger $\lambda$ is the smaller our $w_{ridge}$ is for every component of $w$.

## Taylor Approximation

For a much more **obvious** effect of how $\lambda$ affects the optimal weight $w_{ridge}$, we are going to use a 2nd Degree Taylor Approximation of the original L2 Loss where $\nabla_w L = 0$

Let's start with 
$$ L2(w) = ||y - Xw||^2 $$
Taking the Taylor Series approximation near $w_{opt}$ from the optimal L2 Loss the approximation $\hat{L2}(w_{opt})$ is
$$ \hat{L2}(w_{opt}) = L2(w_{opt}) + \nabla_w L2(w - w_{opt}) + \frac{1}{2}(w - w_{opt})^T H (w - w_{opt})$$
where $H$ is the Hessian of $L2$ and is positive definite.

Note that the minimum of $\hat{L2}$ occurs where $\nabla_w L2(w_{opt}) = 0$, simplifying this to
$$ \hat{L2}(w_{opt}) = L2(w_{opt}) + \frac{1}{2}(w - w_{opt})^T H (w - w_{opt})$$

Since $w_{opt}$ is where the minimum occurs we can say that $H$ is positive semi-definite. Let's take the gradient of this approximation
$$ \nabla_w \left( L2(w_{opt}) + \frac{1}{2}(w - w_{opt})^T H (w - w_{opt}) \right)$$
$$ \nabla_w \hat{L2} = H (w - w_{opt}) $$

To study the effect of weight decay, we are going to add the weight decay gradient, since Ridge Regression is:
$$ L(w) = ||y - Xw||^2 + \lambda ||w||^2 = \text{L2 Loss} + \lambda \text{ Weight Decay}$$
$$ \nabla_w L(w) = \nabla_w \text{L2 Loss} + \lambda \nabla_w \text{ Weight Decay}$$
Note, multiplied by constant $\frac{1}{2}$ for ease of use.
$$ \nabla_w \lambda \frac{1}{2} w^Tw = \lambda I w$$

Thus we can see that the approximation for $L$ is 
$$ \nabla_w \hat{L}(w_{ridge}) = \nabla_w \hat{L2} + \lambda \nabla_w \text{ Weight Decay}$$
$$ \nabla_w \hat{L}(w_{ridge}) = H (w_{ridge} - w_{opt}) + \lambda I w_{ridge}$$

Setting to 0 to find the optimal $w_{ridge}$
$$ H (w_{ridge} - w_{opt}) + \lambda I w_{ridge} = 0$$
$$ H w_{ridge} - H w_{opt} + \lambda I w_{ridge} = 0$$
$$ (H + \lambda I) w_{ridge} = H w_{opt} $$
$$ w_{ridge} = (H + \lambda I)^{-1} H w_{opt}$$

Since $H$ semipositive definite, we can do an eigenvalue decomposition: $H = Q \Lambda Q^T$

$$ w_{ridge} = (Q \Lambda Q^T + \lambda I)^{-1} Q \Lambda Q^T  w_{opt}$$
$$ w_{ridge} = (Q (\Lambda + \lambda I) Q^T)^{-1} Q \Lambda Q^T w_{opt}$$
$$ w_{ridge} = Q (\Lambda + \lambda I)^{-1} Q^T Q \Lambda Q^T w_{opt}$$
$$ w_{ridge} = Q (\Lambda + \lambda I)^{-1} \Lambda Q^T w_{opt}$$

Great! Now we can see how $w_{ridge}$ is $w_{opt}$ with $\lambda$ adjusting its eigenvalues. Upon further analysis we see 
$$(\Lambda + \lambda I)^{-1} \Lambda = \begin{bmatrix}
\frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\
0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda}
\end{bmatrix}
$$

In general we see that our eigenvalues $\lambda_i$ are scaled such that:
$$ \frac{\lambda_i}{\lambda_i + \lambda}$$

We can see that from 
$$ w_{ridge} = Q \begin{bmatrix}
\frac{\lambda_1}{\lambda_1 + \lambda} & 0 & \cdots & 0 \\
0 & \frac{\lambda_2}{\lambda_2 + \lambda} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{\lambda_n}{\lambda_n + \lambda}
\end{bmatrix} Q^T w_{opt}
$$

That as $\lambda \rightarrow 0$ that $w_{ridge} \rightarrow w_{opt}$ as expected. 
As $\lambda \rightarrow \infty$ that $w_{ridge} \rightarrow 0$ since all eigenvalues approach 0. 
In general we can see that as $\lambda$ increases each $\lambda_i$ decreases, with smaller $\lambda_i$ being disproportionately affected.

<center>
$\textbf{Figure 1}$ How $w_{ridge}$ varies from $w_{OLS}$ (L2 Ordinary Least Squares solution) as $\lambda$ Varies

Larger eigenvalues $\lambda_i$ decrease more slowly as $\lambda$ increases. 
It can be seen that the largest eigenvectors are maintained for longer. 
The line approaches the line: $y=mx+b \rightarrow y = 0x +0$ as $\lambda$ increases since this forces $w_1, w_2 \rightarrow 0$

</center>

As we can see in **Figure 1** as $\lambda$ increases the weights $w_1,w_2 \rightarrow 0$ as specified by the weight constraint. It is observed in the lower left diagram that the most significant eigenvectors are maintained for the longest, with the least significant eigenvectors approaching 0 much more quickly, which can be seen in the upper left diagram.

Of note, see that the upper right and lower left diagrams, are really the same diagram with different labels, where we see that the $w_{ridge}$ solution is the sum of the eigenvectors.