L1 Lasso vs L2 Ridge Regression Take AI Quiz by dokuDoku Generalization Statistics 🔍 ## Prerequisites: Please read the following blog posts first discussing properties of L1 Lasso and L2 Ridge before reading this: - [L1 Lasso Regularization & Sparsity](https://www.noteblogdoku.com/blog/l1-lasso-regularization-sparsity) - [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) - [L2 Ridge Regression from Lagrange Multipliers](https://www.noteblogdoku.com/blog/l2-ridge-regression-from-lagrange-multipliers) - [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss) ## L1 Lasso vs L2 Ridge Regression Regularization $$ \text{L1 Lasso: } L(w) = ||y - Xw||_2^2 + \lambda ||w||_1$$ $$ \text{L2 Ridge: } L(w) = ||y - Xw||_2^2 + \lambda ||w||^2_2$$ As we can see, L1 Lasso and L2 Ridge are similar, with the main difference being the constraint on the L2 euclidean ball, and the other on the L1 diamond "ball", as seen in the **Title Image GIF**. As we can see in the **Title Image GIF** as we increase the regularization parameter $\lambda$ the weights become more constrained since the radius of their respective constraint circles (L1 "Circle" Diamond and L2 Circle) shrink. As we can also see in **Figure 1**, when we plot the weight decay for both regularizations seen in the **Title Image** L1 Lasso induces sparse weights, where only $w_2 \neq 0$, unlike L2 Ridge, which does not generally induce sparsity. Unlike Lasso, it lacks the coordinate-wise thresholding behavior that sets many coefficients exactly to zero. As we can see the main difference is that Lasso can drive coefficients exactly to zero through soft-thresholding, whereas Ridge typically shrinks coefficients smoothly toward zero. <img src="/media/images/20260501_010003_04_30_2026_17_59_34_lasso_ridge_weight_paths.png" alt="04 30 2026 17 59 34 lasso ridge weight paths" style="max-width: min(800px, 100%); height: auto;"> <center> $\textbf{Figure 1}$ Weight values for $w_1$,$w_2$ for L1 Lasso & L2 Ridge with respect to $\lambda$ </center> ### Benefits of L1 Lasso for High Dimensional Weights One of the great benefits of L1 Lasso over L2 Ridge Regression is the ability to make a reasonable approximation using only a subset of the weights as seen in **Figure 2**. <img src="/media/images/20260501_073937_05_01_2026_00_36_10_lasso_ridge_dataset_overview.png" alt="05 01 2026 00 36 10 lasso ridge dataset overview" style="max-width: min(1400px, 100%); height: auto;"> <img src="/media/images/20260501_021407_04_30_2026_19_12_36_lasso_ridge_high_dim_paths.png" alt="04 30 2026 19 12 36 lasso ridge high dim paths" style="max-width: min(1400px, 100%); height: auto;"> <center> $\textbf{Figure 2}$: L1 Lasso vs L2 Ridge Regression's effects on Weight Sparsity with respect to $\lambda$. Top segment of Figure, shows the Linear Regression model used with the Weights from Ordinary Least Regression fit, $w \in \mathbb{R}^{10}$. Bottom segment of figure shows how L1 Lasso & L2 Ridge Regression both regularize the weights. In particular we are interested in the number of $\textit{Active Parameters}$ wrt $\lambda$ & $\textit{Retained Signal}$ </center> <img src="/media/images/20260501_173747_05_01_2026_10_35_39_lasso_dataset_boomerang.gif" alt="05 01 2026 10 35 39 lasso dataset boomerang" style="max-width: min(1400px, 100%); height: auto;"> <img src="/media/images/20260501_174035_05_01_2026_10_35_39_ridge_dataset_boomerang.gif" alt="05 01 2026 10 35 39 ridge dataset boomerang" style="max-width: 100%; height: auto;"> <center> $\textbf{Figure 3}$ Visualization showing how L1 Lasso vs L2 Ridge Regression Regularized Regression estimates change wrt $\lambda$ In this example the least significant weights in L1 Lasso, go to zero first, unlike how in L2 Ridge all weights go to zero at rate $\frac{\lambda_i}{\lambda_i + \lambda}$, see [Decomposition Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) </center> This is really powerful, since we can get close to the ordinary least squares solution (unregularized solution) without having to use all of the parameters. This is key for when we have too many parameters, or only a subset of parameters that we can afford to measure. As we can see in **Figure 2** & **Figure 3** as we increase $\lambda$, we can consider less parameters for L1 Lasso, versus having many almost zero weights as in L2 Ridge Regression. With L2 Ridge Regression, we can see that although there are many parameters that are effectively 0, L1 Lasso sets them explicitly to 0 giving an effective sparse approximation to the OLS solution. #### What is Signal Retained? As shown in **Figure 2** & **Figure 3**, we can see that the L1 Lasso has a linear decrease in signal retained and L2 Ridge has a more curved shape. *Signal Retained* is defined as how much of the true predictive information in the features remains after regularization shrinks the model’s coefficients. $$ \text{Signal Retained}(\lambda) = \frac{|\mathbf{w}(\lambda)|_2}{|\mathbf{w}^{\text{OLS}}|_2} \times 100 $$ I.e. Signal Retained is just the L2 norm of the normalized weight vector versus the un-normalized weight vector. This is not a perfect comparison by any means, however since Lasso and Ridge Regression both pull the weights towards the origin, this is a good signal for how much $w$ has been shrank with respect to $\lambda$. As we can see, for small $\lambda$ Lasso retains more signal for our example, but as $\lambda$ grows, Ridge retains more signal since the weights approach 0 in the limit and do not decrease to 0 linearly, as shown in **Figure 2**. ## Prerequisites: Please read the following blog posts first discussing properties of L1 Lasso and L2 Ridge before reading this: - [L1 Lasso Regularization & Sparsity](https://www.noteblogdoku.com/blog/l1-lasso-regularization-sparsity) - [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) - [L2 Ridge Regression from Lagrange Multipliers](https://www.noteblogdoku.com/blog/l2-ridge-regression-from-lagrange-multipliers) - [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss) ## L1 Lasso vs L2 Ridge Regression Regularization $$ \text{L1 Lasso: } L(w) = ||y - Xw||_2^2 + \lambda ||w||_1$$ $$ \text{L2 Ridge: } L(w) = ||y - Xw||_2^2 + \lambda ||w||^2_2$$ As we can see, L1 Lasso and L2 Ridge are similar, with the main difference being the constraint on the L2 euclidean ball, and the other on the L1 diamond "ball", as seen in the **Title Image GIF**. As we can see in the **Title Image GIF** as we increase the regularization parameter $\lambda$ the weights become more constrained since the radius of their respective constraint circles (L1 "Circle" Diamond and L2 Circle) shrink. As we can also see in **Figure 1**, when we plot the weight decay for both regularizations seen in the **Title Image** L1 Lasso induces sparse weights, where only $w_2 \neq 0$, unlike L2 Ridge, which does not generally induce sparsity. Unlike Lasso, it lacks the coordinate-wise thresholding behavior that sets many coefficients exactly to zero. As we can see the main difference is that Lasso can drive coefficients exactly to zero through soft-thresholding, whereas Ridge typically shrinks coefficients smoothly toward zero. <img src="/media/images/20260501_010003_04_30_2026_17_59_34_lasso_ridge_weight_paths.png" alt="04 30 2026 17 59 34 lasso ridge weight paths" style="max-width: min(800px, 100%); height: auto;"> <center> $\textbf{Figure 1}$ Weight values for $w_1$,$w_2$ for L1 Lasso & L2 Ridge with respect to $\lambda$ </center> ### Benefits of L1 Lasso for High Dimensional Weights One of the great benefits of L1 Lasso over L2 Ridge Regression is the ability to make a reasonable approximation using only a subset of the weights as seen in **Figure 2**. <img src="/media/images/20260501_073937_05_01_2026_00_36_10_lasso_ridge_dataset_overview.png" alt="05 01 2026 00 36 10 lasso ridge dataset overview" style="max-width: min(1400px, 100%); height: auto;"> <img src="/media/images/20260501_021407_04_30_2026_19_12_36_lasso_ridge_high_dim_paths.png" alt="04 30 2026 19 12 36 lasso ridge high dim paths" style="max-width: min(1400px, 100%); height: auto;"> <center> $\textbf{Figure 2}$: L1 Lasso vs L2 Ridge Regression's effects on Weight Sparsity with respect to $\lambda$. Top segment of Figure, shows the Linear Regression model used with the Weights from Ordinary Least Regression fit, $w \in \mathbb{R}^{10}$. Bottom segment of figure shows how L1 Lasso & L2 Ridge Regression both regularize the weights. In particular we are interested in the number of $\textit{Active Parameters}$ wrt $\lambda$ & $\textit{Retained Signal}$ </center> <img src="/media/images/20260501_173747_05_01_2026_10_35_39_lasso_dataset_boomerang.gif" alt="05 01 2026 10 35 39 lasso dataset boomerang" style="max-width: min(1400px, 100%); height: auto;"> <img src="/media/images/20260501_174035_05_01_2026_10_35_39_ridge_dataset_boomerang.gif" alt="05 01 2026 10 35 39 ridge dataset boomerang" style="max-width: 100%; height: auto;"> <center> $\textbf{Figure 3}$ Visualization showing how L1 Lasso vs L2 Ridge Regression Regularized Regression estimates change wrt $\lambda$ In this example the least significant weights in L1 Lasso, go to zero first, unlike how in L2 Ridge all weights go to zero at rate $\frac{\lambda_i}{\lambda_i + \lambda}$, see [Decomposition Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) </center> This is really powerful, since we can get close to the ordinary least squares solution (unregularized solution) without having to use all of the parameters. This is key for when we have too many parameters, or only a subset of parameters that we can afford to measure. As we can see in **Figure 2** & **Figure 3** as we increase $\lambda$, we can consider less parameters for L1 Lasso, versus having many almost zero weights as in L2 Ridge Regression. With L2 Ridge Regression, we can see that although there are many parameters that are effectively 0, L1 Lasso sets them explicitly to 0 giving an effective sparse approximation to the OLS solution. #### What is Signal Retained? As shown in **Figure 2** & **Figure 3**, we can see that the L1 Lasso has a linear decrease in signal retained and L2 Ridge has a more curved shape. *Signal Retained* is defined as how much of the true predictive information in the features remains after regularization shrinks the model’s coefficients. $$ \text{Signal Retained}(\lambda) = \frac{|\mathbf{w}(\lambda)|_2}{|\mathbf{w}^{\text{OLS}}|_2} \times 100 $$ I.e. Signal Retained is just the L2 norm of the normalized weight vector versus the un-normalized weight vector. This is not a perfect comparison by any means, however since Lasso and Ridge Regression both pull the weights towards the origin, this is a good signal for how much $w$ has been shrank with respect to $\lambda$. As we can see, for small $\lambda$ Lasso retains more signal for our example, but as $\lambda$ grows, Ridge retains more signal since the weights approach 0 in the limit and do not decrease to 0 linearly, as shown in **Figure 2**. Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!