L1 Lasso Regularization & Sparsity

by dokuDoku Generalization Statistics

🔍

## What is L1 Lasso Regularization?

L1 Lasso Regression is defined by the loss function: 
$$ L(w) = ||y - Xw||_2^2 + \lambda ||w||_1 $$
- $L(w) \in \mathbb{R}$ which is the scalar loss 
- $y \in \mathbb{R}^{m}$ the ground truth output
    - $m$ outputs each of dimension 1 i.e. $\mathbb{R}$
- $w \in \mathbb{R}^n$ the weight parameters 
- $X \in \mathbb{R}^{m \times n}$ data entries
    - $m$ dataset entries, each of dimension $n$

Most importantly, $||w||_1$ is defined as the "L1 Norm" that's the sum of the absolute values:
$$ ||w||_1 = \sum_{i=1}^n |w_i|$$

<img src="/media/images/20260430_043851_Screenshot_2026-04-29_at_9.35.06_PM.png" alt="Screenshot 2026 04 29 at 9.35.06 PM" style="max-width: min(1200px, 95%); height: auto;">
<center>
$\textbf{Figure 1}$: L1 Norm with 1D and 2D Level Sets.

Additionally, 3D Octahedron showing the $\textit{L1 Ball}$ in 3D space

Note the corners of the L1 Balls in 1D, 2D, and 3D, where the derivative does not exist and we must use $\textit{subgradients}$.

Optimization tends to give these corners as solutions, making coefficients zero.
</center>

<img src="/media/images/20260430_051128_04_29_2026_22_10_58_lasso_diamond_classic.png" alt="04 29 2026 22 10 58 lasso diamond classic" style="max-width: min(600px, 90%); height: auto;">
<center>
$\textbf{Figure 2}$: L1 Norm "Circle" in 2D Space with $w_1$, $w_2$ axes with radius $\lambda$

Top-down view of Center Figure "2D" in $\textbf{Figure 1}$ showing the specific Level Set.

Red diamond in $\textbf{Figure 2}$ correlates to red diamond in $\textbf{Figure 1}$
</center>

In 1 Dimension this is simply the absolute value function $y = |x|$, as seen in **Figure 1**. In two dimensions $y = |x_1| + |x_2|$ becomes a diamond such that the diamond represents a "level set" of constant $y$, which can be seen by the red diamond in **Figure 1**'s center figure & **Figure 2**. This can also be thought as *the L1 Circle* since this is equivalent to a 2D circle in L2 with $y = ||x||_2 = \sqrt{x_1^2 + x_2^2}$.

In three dimensions we get a *diamond polytope* as can be seen in the **Title Image** on the left side and **Figure 1**'s right figure. Along this diamond polytope we have a constant value of $y$ such that $y = ||w||_1 = \sum_{i=1}^3 |x_i|$. This is also called the *L1 Ball* since in the L1 space, these are all points with distance 1 from the origin. Note that this is similar to the sphere in L2 (typically euclidean space).

This penalized objective is closely related to the constrained Lasso problem, where $w$ is restricted to lie inside an L1 ball $||w||_1 \leq t$. The parameter $\lambda$ controls how strongly large L1 norms are penalized. Larger $\lambda$ corresponds to a smaller effective constraint radius. As can be seen in the **Title Image GIF**, when $\lambda$ is large enough certain weights $w_i$ zero out, giving us a **sparse regularization** of our function, which is fantastic for reducing the number of active weights needed. See **Appendix** for more details.

Sparsity arises because the corners of the L1 ball align with coordinate axes. Optimization tends to hit these corners driving the coefficients to become exactly zero.

## Subgradients

To get the optimal $w$ of the stated Lagrangian, we need to find the gradient of $L(w)$ where
$$ L(w) = ||y - Xw||_2^2 + \lambda ||w||_1 $$
$$ L(w) = ||y - Xw||_2^2 + \lambda \sum_{i=1}^n |w_i| $$

However, as we know the function $|w_i|$ is non-differentiable since we reach a non-differentiable corner at $w_i = 0$. We can see this in **Figure 1** where at every corner of our $||w||_1$ balls/spheres we have at least one $w_i = 0$.

This motivates the use of **subgradients** since this mitigates this specific issue, giving us a gradient we can work with and optimize.

#### Definition Subgradient:
A subgradient is a generalization of the gradient for functions that may not be differentiable, especially convex functions (which $||w||_1$ is).

Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be a convex function. A vector $g \in \mathbb{R}^n$ is a **subgradient** of $f$ at point $x$ if it satisfies the subgradient inequality $\forall y$:
$$ f(y) \geq f(x) + \langle g, y-x \rangle $$

The set of all subgradients at x is the **subdifferential** and is denoted: 
$$ \partial f(x) = \{g \in \mathbb{R}^n | f(y) \geq f(x) + \langle g, y-x \rangle \forall y \}$$

An intuitive way to look at this is that for convex functions we have that the first order approximation is *always* an under approximation of the convex function by definition:  
$$ f(y) \geq f(x) + \nabla f(x)^T(y-x) = f(x) + \langle \nabla f(x), y-x \rangle$$

Is *always* an under approximation of the function $f$, as seen in **Figure 3**

<img src="/media/images/20260430_055834_convex_underestimator.png" alt="convex underestimator" style="max-width: min(650px, 90%); height: auto;">
<center>
$\textbf{Figure 3}$: Approximations to Convex Functions are always Under-estimators. Convex epigraph shown in blue. 
</center>

<img src="/media/images/20260430_055843_04_29_2026_22_52_59_subgradient_abs.png" alt="04 29 2026 22 52 59 subgradient abs" style="max-width: min(1200px, 95%); height: auto;">
<center>
$\textbf{Figure 4}$: Subgradients for Absolute Value function $y = |x|$. Demonstrates generalization of $\textbf{Figure 3}$ where for $x=0$ we have several valid lines with slopes $g \in [-1,1]$ s.t. they are also under-estimators of $|x|$.

Note for the sub-differential on the right, one way to interpret this is that at $x=0$ $\partial x$ takes on values in $[-1,1]$. 
</center>

As we can see in **Figure 4**, we have that for certain points in convex functions, such as when we have piecewise defined functions (consider that $|x|$ is piecewise defined with pieces $-x$, $x$), that we can have multiple under-estimator functions with intercepts on the function and a range of slope values.

The subdifferential of $f(x) = |x|$ at $x=0$ is
$$ \partial |x| = Sign(x) $$
$$ \operatorname{Sign}(x) =
\begin{cases}
\{1\} & \text{if } x > 0 \\
[-1, 1] & \text{if } x = 0 \\
\{-1\} & \text{if } x < 0
\end{cases}
$$
which directly follows from **Figure 4**.

## Gradient of L1 Lasso Regression

Great, now let's find the gradient of $L(w)$: 
$$ L(w) = ||y - Xw||_2^2 + \lambda \sum_{i=1}^n |w_i| $$
$$ L(w) = (y - Xw)^T(y - Xw) + \lambda \sum_{i=1}^n |w_i| $$
To make taking the gradient easier, we're going to multiply the non-regularizing term (the L2 loss term) by $\frac{1}{2}$. We define the squared-error term with a factor of $\frac{1}{2}$ for algebraic convenience. This convention doesn't change the family of Lasso solutions, since changing the scaling of the squared error term can be absorbed into a rescaling of $\lambda$. 
$$ L(w) = \frac{1}{2}(y - Xw)^T(y - Xw) + \lambda \sum_{i=1}^n |w_i| $$
$$ L(w) = \frac{1}{2}y^Ty - y^TXw + \frac{1}{2}w^TX^TXw  + \lambda \sum_{i=1}^n |w_i| $$

Take the differential of both sides
$$ dL(w) = -y^TXdw + w^TX^TXdw + \lambda\ Sign(w^T)dw$$

Note that $dL = \langle dw, \nabla_w L(w) \rangle$, so we get that 
$$ \nabla_w L(w) = -y^TX + w^TX^TX + \lambda\ Sign(w^T)$$

Note that this is in numerator form for the gradient, however to make the next calculations easier, we're going to take the transpose of $\nabla_w L(w)$, $\nabla_w L(w)^T$, which is the same vector transposed: 
$$ \nabla_w L(w)^T = -X^Ty + X^TXw + \lambda\ Sign(w)$$
$$ \nabla_w L(w)^T = X^T( Xw -y) + \lambda\ Sign(w)$$

Let's suppose that we only care about the $i$th index of the output: $ \nabla_w L(w)^T_i$, we would express that as: 
$$ \nabla_w L(w)^T_i = x^T_i( Xw -y) + \lambda\ Sign(w)_i$$

Let's solve for $Sign(w)$ since we will do casework based on this to determine behavior at the optimal when $\nabla_w L(w)^T_i = 0$

$$ 0 = x^T_i( Xw -y) + \lambda\ Sign(w)_i$$
$$ -\lambda\ Sign(w)_i = x^T_i( Xw -y) $$
$$ Sign(w)_i = \frac{x^T_i(y - Xw)}{\lambda} $$

### Casework for values of $w_i$

Let's step through some casework for $w_i$ corresponding to the $Sign(w)$ function:

#### Case 1: $w_i < 0$
$$ Sign(w)_i = \frac{x^T_i(y - Xw)}{\lambda} $$
Note that $Sign(w)_i = -1$ when $w_i < 0$, thus we get:

$$ -1 = \frac{x^T_i(y - Xw)}{\lambda} $$
With a set $\lambda$ this is:

$$ -\lambda = x^T_i(y - Xw) $$

#### Case 2: $w_i > 0$
Notice that this is basically Case 1, but with the opposite sign, giving us:
$$ \lambda = x^T_i(y - Xw) $$

#### Case 3: $w_i = 0$
This is the special case, here we will see that since $Sign(w)_i \in [-1,1]$ is a range that there are several values where $w_i$ reaches 0 encouraging sparsity where if the predicted and ground truth values are in a certain range, we set $w_i$ to zero:

$$ Sign(w)_i = \frac{x^T_i(y - Xw)}{\lambda} $$
$$ \frac{x^T_i(y - Xw)}{\lambda} \in [-1,1]$$

Note that we use $\in$ here and not $=$ since $Sign(w)_i$ for $w_i = 0$ is a set. 
Simplifying we get:

$$ x^T_i(y - Xw) \in [-\lambda,\lambda]$$
$$ x^T_i y - x^T_i Xw \in [-\lambda, \lambda]$$

$$ |x^T_i y - x^T_i Xw| \leq \lambda$$

We can state all three cases simply as:

$$
\boxed{
\left.
\begin{array}{ll}
\text{Case 1: } x_i^T(y-Xw)=-\lambda & \text{if } w_i<0 \\
\text{Case 2: } x_i^T(y-Xw)=\lambda & \text{if } w_i>0 \\
\text{Case 3: } x_i^T(y-Xw)\in[-\lambda,\lambda] & \text{if } w_i=0
\end{array}
\right.
}
$$

What this is saying is that

*If feature $x_i$'s correlation with the residual $y -Xw$ is within $\lambda$ then Lasso can set $w_i = 0$*.

This encourages sparsity such that depending on our hyperparameter $\lambda$ we can choose to ignore certain parameters $w_i$. Looking at our **Title Image GIF** we can see that as the L1 Ball decreases in radius that the optimal point on the ball hits one of the corners, indicating that the corresponding $w_i$ is set to 0. See **Figure 5** for an image describing this for the 2D case.

<img src="/media/images/20260430_223704_04_30_2026_15_35_29_lasso_kkt_conditions.png" alt="04 30 2026 15 35 29 lasso kkt conditions" style="max-width: min(600px, 100%); height: auto;">
<center>
$\textbf{Figure 5}$: Conditions for $w_i$ from Casework clearly shown on the L1 Circle in 2D.

Notice that the Corners have at least 1 parameter fulfilling Case 3, where since the absolute value of the residual is less or equal to $\lambda$ the corresponding $w_i = 0$
</center>

**Figure 5** shows the case for 2D, but as we can extrapolate for higher dimensions, the corners also fulfill these conditions. As the dimensions get higher though, we can see that corners represent having multiple $w_i = 0$ and that edges have $w_i = 0$. Specifically, for the 3D case, each vertex has two $w_i = 0$ and each edge has one $w_i = 0$ as seen in **Figure 6**. This trend encourages more sparsity in higher dimensions.

<img src="/media/images/20260430_230654_04_30_2026_16_04_34_lasso_kkt_conditions.png" alt="04 30 2026 16 04 34 lasso kkt conditions" style="max-width: min(1000px, 95%); height: auto;">
<center>
$\textbf{Figure 6}$: Conditions on 3D L1 Ball showing how sparsity induced on edges and vertices in higher dimensions. 
</center>

---
## Appendix

### Equivalence between the Lagrangian & Optimization with Constraint Problems

The *general constrained optimzation problem* is stated as
$$
\min_w f(w)
\quad \text{s.t.} \quad g(w) \le t.
$$

This means that we want to minimize the objective $f(w)$ while only considering points $w$ whose constraint value $g(w)$ is at most $t$.

We can rewrite the constraint as

$$
g(w)-t \le 0.
$$

The Lagrangian is then

$$
\mathcal{L}(w,\lambda)
=
f(w)+\lambda(g(w)-t),
$$

where $\lambda \ge 0$ is the Lagrange multiplier.

For a fixed value of $\lambda$, we have

$$
\mathcal{L}(w,\lambda)
=
f(w)+\lambda g(w)-\lambda t.
$$

Since $-\lambda t$ does not depend on $w$, it does not affect the value of $w$ that minimizes the objective. Therefore, minimizing the Lagrangian with respect to $w$ is equivalent to minimizing

$$
f(w)+\lambda g(w).
$$

Thus, the constrained problem

$$
\min_w f(w)
\quad \text{s.t.} \quad g(w)\le t
$$

corresponds to the penalized problem

$$
\min_w f(w)+\lambda g(w).
$$

In the Lasso case, we have

$$
f(w)=\frac12\|y-Xw\|_2^2
$$

and

$$
g(w)=\|w\|_1.
$$

Therefore, the constrained Lasso problem is

$$
\min_w \frac12\|y-Xw\|_2^2
\quad \text{s.t.} \quad
\|w\|_1 \le t.
$$

The corresponding Lagrangian is

$$
\mathcal{L}(w,\lambda)
=
\frac12\|y-Xw\|_2^2
+
\lambda(\|w\|_1-t).
$$

Expanding gives

$$
\mathcal{L}(w,\lambda)
=
\frac12\|y-Xw\|_2^2
+
\lambda\|w\|_1
-
\lambda t.
$$

Since $-\lambda t$ is constant with respect to $w$, the minimizer is the same as the minimizer of

$$
\frac12\|y-Xw\|_2^2
+
\lambda\|w\|_1.
$$

Therefore, the constrained Lasso problem

$$
\min_w \frac12\|y-Xw\|_2^2
\quad \text{s.t.} \quad
\|w\|_1 \le t
$$

is equivalent to the penalized Lasso problem

$$
\min_w
\frac12\|y-Xw\|_2^2
+
\lambda\|w\|_1
$$

for an appropriate relationship between $t$ and $\lambda$.

Intuitively, the constrained version says that $w$ must lie inside an L1 ball of radius $t$, while the penalized version says that large values of $\|w\|_1$ are punished. A smaller $t$ corresponds to a stronger constraint, while a larger $\lambda$ corresponds to a stronger penalty.