Properties of L2 Loss by dokuDoku Math 🔍 ## What is L2 Loss? L2 Loss is defined as the summation of squared errors for a given set of vectors, can also be the average squared error for a set of vectors. Specifically, for the single vector case, it's the sum of squared differences of all elements of the predicted versus ground truth vector. For ground truth vector $y \in \mathbb{R}^m$, and prediction $\hat{y} \in \mathbb{R}^m$ the L2 Loss, or sum of squared differences: $$ \sum_{i} (y_i - \hat{y}_i)^2 = ||y - \hat{y}||_2^2 = (y-\hat{y})^T(y-\hat{y})$$ This commonly used function in machine learning and engineering has some cool properties, we will be discussing: - Convexity - Positive Semi-Definiteness - Hessian Positive Semi-Definiteness - Hilbert Space Relation ## L2 Loss & Convexity <img src="/media/images/20260416_005948_Screenshot_2026-04-15_at_5.58.46_PM.png" alt="Screenshot 2026 04 15 at 5.58.46 PM" style="max-width: min(600px,80%); height: auto;"> <center> $\textbf{Figure 1:}$ L2 Loss Landscape for Summation of Squared Errors </center> As we can see, the L2 Loss is the parabola $e^2 = ||y - \hat{y}||^2_2 = \sum_{i} (y_i - \hat{y}_i)^2$. We can see intuitively the L2 Loss is a *convex function* since it's quadratic with positive curvature. Equivalently, it satisfies Jensen's Inequality stating that the function of the average is less than or equal to the average of the function, where our function is $e^2$. The theorem for Jensen's inequality with respect to convexity is: <center> A function $f: C \rightarrow \mathbb{R}$ s.t. $C$ is a convex set is convex if and only if Jensen's Inequality holds for all finite convex combinations i.e. for all random variables taking values in $C$. </center> $$Jensen's\ Inequality:\ f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$$ For $0 \leq \theta \leq 1$, and $x,y \in domain(f)$, we have that Jensen's Inequality can be stated as the function of the weighted average between two points and the weighted average of two points evaluated on the function: $$ f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y)$$ As we can see in **Figure 2** below, the chord between two points with $\theta f(x) + (1-\theta)f(y)$ on that chord is above $f(\theta x + (1-\theta)y)$, thus stating that this L2 Loss function is convex. This is fantastic, because that means that for certain classes of problems when we compose the input with the L2 Loss that we have a convex model which has *global minimums* that we can optimize for. Mathematically, we can show it's convex and we have done so in the **Appendix** (to keep post readable). <img src="/media/images/20260417_000056_Screenshot_2026-04-16_at_5.00.14_PM.png" alt="Screenshot 2026 04 16 at 5.00.14 PM" style="max-width: min(600px, 70%); height: auto;"> <center> $\textbf{Figure 2}$ Jensen's Inequality for L2 Loss with $0 \leq \theta \leq 1$ </center> ### What Models are Convex? Some problems are convex optimization problems, such as compositions with affine mappings. So if we have a function $f$ that's convex, if we input an affine function $Ax+b$ then $g(x) = f(Ax+b)$ will be a convex function. In fact, this is commonly known as **linear regression**, and you should read: [Fitting a Simple Linear Model Using Gradient Descent](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent) to learn more about how gradient descent can fit these kinds of models. In fact for this specific model, whenever we find the local minimum (reduce error $e$ as much as possible), we find the global minimum. See **Appendix** for this derivation. To state explicitly, the **linear regression** problem is as follows: $$ \hat{y} = Ax + b$$ $$ ||y - \hat{y}||_2^2 = ||y - Ax - b||_2^2 = (y - Ax-b)^T (y-Ax-b)$$ It may be natural to ask why are neural networks not convex functions. Consider a simple two layer network (without biases): $$ \hat{y}(A_1, A_2) = ReLU(A_2 \cdot ReLU(A_1x))$$ where we are optimizing over both $A_1$ & $A_2$. Naturally, we have to ask, *is this two layer network convex with respect to parameters $A_1$ and $A_2$?* To analyze this we are going to look at $A_1$ & $A_2$ jointly. If we take the original function that's a joint function of $A_1$ *and* $A_2$ we will find this is **not** convex. To start, note that the $ReLU$ function is defined as $$ \text{ReLU}(x) = \begin{cases} x & \text{if } x \ge 0 \\ 0 & \text{if } x < 0 \end{cases} $$ We now show by counterexample that the function is not convex jointly in $A_1$ and $A_2$. To simplify the argument, consider the scalar case where $A_1 = a \in \mathbb{R}$, $A_2 = b \in \mathbb{R}$, and fix an input $c>0$. On the region $a\ge 0$ and $b\ge 0$, both ReLU activations are active, so $$\hat y(a,b)=\operatorname{ReLU}\bigl(b\,\operatorname{ReLU}(ac)\bigr)=abc$$ Thus we study the function $f(a,b)=abc$ with $c>0$ fixed. Take two points in parameter space, meaning that this is the input to $f$ with the pair $(a,b)$: $u=(0,2), \qquad v=(2,0)$ Their midpoint is $\frac{u+v}{2}=(1,1)$. Evaluating $f$ at these points gives $$f(u)=f(0,2) = 0\cdot2\cdot c=0, \qquad f(v)=f(2,0) = 2 \cdot 0 \cdot c=0$$ while $$f\!\left(\frac{u+v}{2}\right)=f(1,1)= 1\cdot 1 \cdot c = c>0$$ If $f$ were convex, Jensen's inequality would imply $$f\!\left(\frac{u+v}{2}\right)\le \frac12 f(u)+\frac12 f(v)=0$$ But this would require $c\le 0$, contradicting $c>0$. Therefore $f(a,b)=abc$ is not convex jointly in $(a,b)$. Hence the original two-layer ReLU network is not jointly convex in its parameters. ## L2 Loss as a Positive Semi-Definite Matrix As you may have noticed, the L2 Loss is parabola shaped and has positive curvature, which strongly suggests that it is a quadratic with a positive semi-definite matrix. Recall that for the general quadratic form, $A$ is positive semi-definite under the following condition: $$ f(x) = x^TAx \geq 0$$ We can see that since $||y - \hat{y}||_2^2 \geq 0$ (Norm property that all norms are greater than or equal to zero) that this could be the case. Explicitly, for our matrix $A=I$ and $x = y-\hat{y}$, we can see the following quadratic form: $$ ||y - \hat{y}||_2^2 = (y-\hat{y})^T I (y-\hat{y}) \geq 0$$ Thus we have showed that the L2 Loss is a quadratic function with $A=I$ ($I$ by definition is Positive Semi-Definite). This also relates to that there is a global minimum for this function. Additionally if we have the L2 loss in a non-euclidean space this gives us the **Squared Mahalanobis Distance** where $I$ is replaced with covariance matrix $\Sigma$, which is Positive Definite giving great norm properties. This is commonly stated as $$ (x-\mu)^T \Sigma^{-1} (x-\mu)$$ Which as you can see is in the same form as $ ||y - \hat{y}||_2^2 = (y-\hat{y})^T I (y-\hat{y}) \geq 0$, but with distance defined by $\Sigma$ and not the standard euclidean distance $I$. ## L2 Loss has Positive Definite Hessian Note that since the L2 Loss is always greater than or equal to 0, and is convex that this strongly suggests that its Hessian is positive definite. We will see that computing the Hessian directly shows we obtain $2I$ which is positive definite. Let's take the second derivative of the L2 Loss: $$ ||y - \hat{y}||_2^2 = (y-\hat{y})^T(y-\hat{y})$$ Take the differential with respect to $\hat{y}$ since it is produced by our model: $$ dL = -d\hat{y}^T(y-\hat{y}) - (y-\hat{y})^Td\hat{y}$$ Recognize that both terms are in $\mathbb{R}$ so it's equal to its transpose. Giving us $$ dL = -2(y-\hat{y})^Td\hat{y}$$ Note the matrix calculus identity $$ dL = \langle \nabla_{\hat{y}}L, d\hat{y} \rangle$$ Giving us $$ \nabla_{\hat{y}}L = -2(y-\hat{y})^T$$ Great now let's take the second derivative of this. Note that $\nabla_{\hat{y}}L \in \mathbb{R}^{1 \times n}$ where $y \in \mathbb{R}^n$ (numerator format for derivatives). We can see this since $y - \hat{y} \in \mathbb{R}^{1 \times n}$. $$ d \left(\nabla_{\hat{y}}L \right) = d(-2(y-\hat{y})^T)$$ $$ -2\ d\left(y - \hat{y} \right)^T = 2 d \hat{y}^T $$ Again using now that the Hessian is the matrix $H$ defined such that $$d(\nabla_{\hat{y}}L) = \langle H, d\hat{y} \rangle$$ We take the transpose: $$\left(2 d \hat{y}^T\right)^T = 2 d \hat{y} $$ Note that using the definition $d(\nabla_{\hat{y}}L) = \langle H, d\hat{y} \rangle \rightarrow d(\nabla_{\hat{y}}L) = H d\hat{y} $ We have that for every increment $d\hat{y}$ $$ H d\hat{y} = 2d\hat{y}$$ The only matrix $H$ that can satisfy this for all $d\hat{y} \in R^n$ is $$H = 2I$$ As we can see the Hessian is $2I$ which is positive definite, therefore the function is strictly convex! Note, you can show this using vector calculus too, I just find using matrix calculus easier since I don't have to track a lot of indices. Also, conceptually in this case it mirrors single value calculus helping intuition. Of note, we can also do this in three lines $$L(\hat{y}) = ||\hat{y} - y||^2$$ $$ \nabla L = 2(\hat{y} - y)^T$$ $$ \nabla^2 L = 2I$$ ## Hilbert Space Compatibility A *Hilbert Space* is defined as a complete vector space equipped with an inner product. The key properties include - Vector Space (add and scale vectors) - Inner product $\langle x,y \rangle$ - Norm induced - Completeness, limits of sequences stay inside the space The connection between L2 loss and Hilbert Spaces is that L2 loss is squared distance in a Hilbert Space. Precisely $$ ||y-\hat{y}||^2 = \langle y - \hat{y}, y - \hat{y} \rangle $$ We write $|| \cdot ||$ for the norm induced by the Hilbert space inner product. In $\mathbb{R}^n$ this coincides with $|| \cdot ||_2$. Thus the L2 loss $ ||y-\hat{y}||^2$ corresponds to finding the point $\hat{y}$ closest to $y$. Additionally, because the norm comes from an inner product, we get that for closed subspaces in a Hilbert space, the projection minimizing L2 loss exists and is unique. For example with linear regression we solve $$ \min_{\hat{y} \in span(A)} \|y - \hat{y}\|^2 $$ which gives the projection of y onto the column space of $A$ for $\hat{y} = Ax$ (note that in this formulation bias $b$ build into $A$ with corresponding element equal to 1 in $x$). The error term satisfies $$ y - \hat{y} \perp span(A)$$ This orthogonality condition only exists because we are in a Hilbert Space since Hilbert Spaces have - Inner products $\langle x,y \rangle$ - Projection Theorem - For any point $y$ and closed subspace $S$ there exists a unique $\hat{y} \in S$ s.t. $y - \hat{y} \perp S$ Additionally, to minimize the distance using gradient descent we can see that with respect to the inner product the gradient of $||y-\hat{y}||^2$ is well defined: $$ ||y-\hat{y}||^2 = \langle y - \hat{y}, y - \hat{y} \rangle $$ Take a small step with $\hat{y} + \epsilon t$, where $t$ is an arbitrary direction in the subspace that we want to travel, giving us $$ \frac{d}{d \epsilon} || y - (\hat{y} + \epsilon t)||^2 = \frac{d}{d \epsilon} \langle y - (\hat{y} + \epsilon t), y - (\hat{y} + \epsilon t) \rangle$$ $$\frac{d}{d \epsilon} \langle y - \hat{y} - \epsilon t, y - \hat{y} - \epsilon t \rangle$$ $$\frac{d}{d \epsilon} \langle (y - \hat{y}) - \epsilon t, (y - \hat{y}) - \epsilon t \rangle$$ Using the result from the **Appendix** we get $$ \frac{d}{d \epsilon} \left( \langle y - \hat{y}, y - \hat{y} \rangle - 2\epsilon \langle y - \hat{y}, t \rangle + \epsilon ^2 \langle t,t \rangle \right) $$ $$ = 0 - 2 \langle y - \hat{y}, t \rangle + 2 \epsilon \langle t,t \rangle $$ $$\frac{d}{d\epsilon}\Big|_{\epsilon=0} = -2 \langle y - \hat{y},\, t \rangle $$ Since this is valid for all $t$ in the subspace, we have a valid gradient for each $t$ in this Hilbert subspace, meaning that given this L2 objective, we always have a gradient defined that we can use to find the optimal point. --- ## Appendix ### Convexity using Jensen's Inequality: Now using Jensen's Inequality we will show convexity: $$L(\theta \hat{y}_1 + (1-\theta)\hat{y}_2) \leq \theta L(\hat{y}_1) + (1-\theta)L(\hat{y}_2) $$ Recall that $$ L(\hat{y}) = ||y - \hat{y}||_2^2 $$ We begin with the left-hand side: $$ L(\theta \hat{y}_1 + (1-\theta)\hat{y}_2) = \left\| y - \left(\theta \hat{y}_1 + (1-\theta)\hat{y}_2 \right) \right\|_2^2 $$ Rewriting: $$ = \left\| \theta(y - \hat{y}_1) + (1-\theta)(y - \hat{y}_2) \right\|_2^2 $$ Expanding the squared norm: $$ = \theta^2 ||y - \hat{y}_1||_2^2 + (1-\theta)^2 ||y - \hat{y}_2||_2^2 + 2\theta(1-\theta)(y - \hat{y}_1) \cdot (y - \hat{y}_2) $$ Now compute the right-hand side: $$ \theta L(\hat{y}_1) + (1-\theta)L(\hat{y}_2) = \theta ||y - \hat{y}_1||_2^2 + (1-\theta)||y - \hat{y}_2||_2^2 $$ Now subtract the left-hand side from the right-hand side: $$ \theta L(\hat{y}_1) + (1-\theta)L(\hat{y}_2) - L(\theta \hat{y}_1 + (1-\theta)\hat{y}_2) $$ $$ = \theta(1-\theta)\left( ||y - \hat{y}_1||_2^2 + ||y - \hat{y}_2||_2^2 - 2 (y - \hat{y}_1)\cdot (y - \hat{y}_2) \right) $$ Using the identity: $$ ||a||^2 + ||b||^2 - 2a \cdot b = ||a - b||^2 $$ We get: $$ = \theta(1-\theta) ||(y - \hat{y}_1) - (y - \hat{y}_2)||_2^2 $$ $$ = \theta(1-\theta)||\hat{y}_1 - \hat{y}_2||_2^2 $$ Since norms are nonnegative and $0 \leq \theta \leq 1$, we have: $$ \theta(1-\theta)||\hat{y}_1 - \hat{y}_2||_2^2 \geq 0 $$ Thus: $$ L(\theta \hat{y}_1 + (1-\theta)\hat{y}_2) \leq \theta L(\hat{y}_1) + (1-\theta)L(\hat{y}_2) $$ Therefore, the L2 Loss is convex. ### Expansion of $\langle (y - \hat{y}) - \epsilon t, (y - \hat{y}) - \epsilon t \rangle$ Using the Bilinearity property of inner products $\langle \alpha u + \beta v, w \rangle = \alpha \langle u, w \rangle + \beta \langle v,w \rangle $, we get: $$ \langle (y - \hat{y}) - \epsilon t, (y - \hat{y}) - \epsilon t \rangle = \langle y - \hat{y}, y - \hat{y} \rangle - \epsilon \langle y - \hat{y}, t \rangle - \epsilon \langle t, y - \hat{y} \rangle + \epsilon^2 \langle t,t \rangle $$ $$ = \langle y - \hat{y}, y - \hat{y} \rangle - 2\epsilon \langle y - \hat{y}, t \rangle + \epsilon ^2 \langle t,t \rangle $$ ## What is L2 Loss? L2 Loss is defined as the summation of squared errors for a given set of vectors, can also be the average squared error for a set of vectors. Specifically, for the single vector case, it's the sum of squared differences of all elements of the predicted versus ground truth vector. For ground truth vector $y \in \mathbb{R}^m$, and prediction $\hat{y} \in \mathbb{R}^m$ the L2 Loss, or sum of squared differences: $$ \sum_{i} (y_i - \hat{y}_i)^2 = ||y - \hat{y}||_2^2 = (y-\hat{y})^T(y-\hat{y})$$ This commonly used function in machine learning and engineering has some cool properties, we will be discussing: - Convexity - Positive Semi-Definiteness - Hessian Positive Semi-Definiteness - Hilbert Space Relation ## L2 Loss & Convexity <img src="/media/images/20260416_005948_Screenshot_2026-04-15_at_5.58.46_PM.png" alt="Screenshot 2026 04 15 at 5.58.46 PM" style="max-width: min(600px,80%); height: auto;"> <center> $\textbf{Figure 1:}$ L2 Loss Landscape for Summation of Squared Errors </center> As we can see, the L2 Loss is the parabola $e^2 = ||y - \hat{y}||^2_2 = \sum_{i} (y_i - \hat{y}_i)^2$. We can see intuitively the L2 Loss is a *convex function* since it's quadratic with positive curvature. Equivalently, it satisfies Jensen's Inequality stating that the function of the average is less than or equal to the average of the function, where our function is $e^2$. The theorem for Jensen's inequality with respect to convexity is: <center> A function $f: C \rightarrow \mathbb{R}$ s.t. $C$ is a convex set is convex if and only if Jensen's Inequality holds for all finite convex combinations i.e. for all random variables taking values in $C$. </center> $$Jensen's\ Inequality:\ f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$$ For $0 \leq \theta \leq 1$, and $x,y \in domain(f)$, we have that Jensen's Inequality can be stated as the function of the weighted average between two points and the weighted average of two points evaluated on the function: $$ f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y)$$ As we can see in **Figure 2** below, the chord between two points with $\theta f(x) + (1-\theta)f(y)$ on that chord is above $f(\theta x + (1-\theta)y)$, thus stating that this L2 Loss function is convex. This is fantastic, because that means that for certain classes of problems when we compose the input with the L2 Loss that we have a convex model which has *global minimums* that we can optimize for. Mathematically, we can show it's convex and we have done so in the **Appendix** (to keep post readable). <img src="/media/images/20260417_000056_Screenshot_2026-04-16_at_5.00.14_PM.png" alt="Screenshot 2026 04 16 at 5.00.14 PM" style="max-width: min(600px, 70%); height: auto;"> <center> $\textbf{Figure 2}$ Jensen's Inequality for L2 Loss with $0 \leq \theta \leq 1$ </center> ### What Models are Convex? Some problems are convex optimization problems, such as compositions with affine mappings. So if we have a function $f$ that's convex, if we input an affine function $Ax+b$ then $g(x) = f(Ax+b)$ will be a convex function. In fact, this is commonly known as **linear regression**, and you should read: [Fitting a Simple Linear Model Using Gradient Descent](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent) to learn more about how gradient descent can fit these kinds of models. In fact for this specific model, whenever we find the local minimum (reduce error $e$ as much as possible), we find the global minimum. See **Appendix** for this derivation. To state explicitly, the **linear regression** problem is as follows: $$ \hat{y} = Ax + b$$ $$ ||y - \hat{y}||_2^2 = ||y - Ax - b||_2^2 = (y - Ax-b)^T (y-Ax-b)$$ It may be natural to ask why are neural networks not convex functions. Consider a simple two layer network (without biases): $$ \hat{y}(A_1, A_2) = ReLU(A_2 \cdot ReLU(A_1x))$$ where we are optimizing over both $A_1$ & $A_2$. Naturally, we have to ask, *is this two layer network convex with respect to parameters $A_1$ and $A_2$?* To analyze this we are going to look at $A_1$ & $A_2$ jointly. If we take the original function that's a joint function of $A_1$ *and* $A_2$ we will find this is **not** convex. To start, note that the $ReLU$ function is defined as $$ \text{ReLU}(x) = \begin{cases} x & \text{if } x \ge 0 \\ 0 & \text{if } x < 0 \end{cases} $$ We now show by counterexample that the function is not convex jointly in $A_1$ and $A_2$. To simplify the argument, consider the scalar case where $A_1 = a \in \mathbb{R}$, $A_2 = b \in \mathbb{R}$, and fix an input $c>0$. On the region $a\ge 0$ and $b\ge 0$, both ReLU activations are active, so $$\hat y(a,b)=\operatorname{ReLU}\bigl(b\,\operatorname{ReLU}(ac)\bigr)=abc$$ Thus we study the function $f(a,b)=abc$ with $c>0$ fixed. Take two points in parameter space, meaning that this is the input to $f$ with the pair $(a,b)$: $u=(0,2), \qquad v=(2,0)$ Their midpoint is $\frac{u+v}{2}=(1,1)$. Evaluating $f$ at these points gives $$f(u)=f(0,2) = 0\cdot2\cdot c=0, \qquad f(v)=f(2,0) = 2 \cdot 0 \cdot c=0$$ while $$f\!\left(\frac{u+v}{2}\right)=f(1,1)= 1\cdot 1 \cdot c = c>0$$ If $f$ were convex, Jensen's inequality would imply $$f\!\left(\frac{u+v}{2}\right)\le \frac12 f(u)+\frac12 f(v)=0$$ But this would require $c\le 0$, contradicting $c>0$. Therefore $f(a,b)=abc$ is not convex jointly in $(a,b)$. Hence the original two-layer ReLU network is not jointly convex in its parameters. ## L2 Loss as a Positive Semi-Definite Matrix As you may have noticed, the L2 Loss is parabola shaped and has positive curvature, which strongly suggests that it is a quadratic with a positive semi-definite matrix. Recall that for the general quadratic form, $A$ is positive semi-definite under the following condition: $$ f(x) = x^TAx \geq 0$$ We can see that since $||y - \hat{y}||_2^2 \geq 0$ (Norm property that all norms are greater than or equal to zero) that this could be the case. Explicitly, for our matrix $A=I$ and $x = y-\hat{y}$, we can see the following quadratic form: $$ ||y - \hat{y}||_2^2 = (y-\hat{y})^T I (y-\hat{y}) \geq 0$$ Thus we have showed that the L2 Loss is a quadratic function with $A=I$ ($I$ by definition is Positive Semi-Definite). This also relates to that there is a global minimum for this function. Additionally if we have the L2 loss in a non-euclidean space this gives us the **Squared Mahalanobis Distance** where $I$ is replaced with covariance matrix $\Sigma$, which is Positive Definite giving great norm properties. This is commonly stated as $$ (x-\mu)^T \Sigma^{-1} (x-\mu)$$ Which as you can see is in the same form as $ ||y - \hat{y}||_2^2 = (y-\hat{y})^T I (y-\hat{y}) \geq 0$, but with distance defined by $\Sigma$ and not the standard euclidean distance $I$. ## L2 Loss has Positive Definite Hessian Note that since the L2 Loss is always greater than or equal to 0, and is convex that this strongly suggests that its Hessian is positive definite. We will see that computing the Hessian directly shows we obtain $2I$ which is positive definite. Let's take the second derivative of the L2 Loss: $$ ||y - \hat{y}||_2^2 = (y-\hat{y})^T(y-\hat{y})$$ Take the differential with respect to $\hat{y}$ since it is produced by our model: $$ dL = -d\hat{y}^T(y-\hat{y}) - (y-\hat{y})^Td\hat{y}$$ Recognize that both terms are in $\mathbb{R}$ so it's equal to its transpose. Giving us $$ dL = -2(y-\hat{y})^Td\hat{y}$$ Note the matrix calculus identity $$ dL = \langle \nabla_{\hat{y}}L, d\hat{y} \rangle$$ Giving us $$ \nabla_{\hat{y}}L = -2(y-\hat{y})^T$$ Great now let's take the second derivative of this. Note that $\nabla_{\hat{y}}L \in \mathbb{R}^{1 \times n}$ where $y \in \mathbb{R}^n$ (numerator format for derivatives). We can see this since $y - \hat{y} \in \mathbb{R}^{1 \times n}$. $$ d \left(\nabla_{\hat{y}}L \right) = d(-2(y-\hat{y})^T)$$ $$ -2\ d\left(y - \hat{y} \right)^T = 2 d \hat{y}^T $$ Again using now that the Hessian is the matrix $H$ defined such that $$d(\nabla_{\hat{y}}L) = \langle H, d\hat{y} \rangle$$ We take the transpose: $$\left(2 d \hat{y}^T\right)^T = 2 d \hat{y} $$ Note that using the definition $d(\nabla_{\hat{y}}L) = \langle H, d\hat{y} \rangle \rightarrow d(\nabla_{\hat{y}}L) = H d\hat{y} $ We have that for every increment $d\hat{y}$ $$ H d\hat{y} = 2d\hat{y}$$ The only matrix $H$ that can satisfy this for all $d\hat{y} \in R^n$ is $$H = 2I$$ As we can see the Hessian is $2I$ which is positive definite, therefore the function is strictly convex! Note, you can show this using vector calculus too, I just find using matrix calculus easier since I don't have to track a lot of indices. Also, conceptually in this case it mirrors single value calculus helping intuition. Of note, we can also do this in three lines $$L(\hat{y}) = ||\hat{y} - y||^2$$ $$ \nabla L = 2(\hat{y} - y)^T$$ $$ \nabla^2 L = 2I$$ ## Hilbert Space Compatibility A *Hilbert Space* is defined as a complete vector space equipped with an inner product. The key properties include - Vector Space (add and scale vectors) - Inner product $\langle x,y \rangle$ - Norm induced - Completeness, limits of sequences stay inside the space The connection between L2 loss and Hilbert Spaces is that L2 loss is squared distance in a Hilbert Space. Precisely $$ ||y-\hat{y}||^2 = \langle y - \hat{y}, y - \hat{y} \rangle $$ We write $|| \cdot ||$ for the norm induced by the Hilbert space inner product. In $\mathbb{R}^n$ this coincides with $|| \cdot ||_2$. Thus the L2 loss $ ||y-\hat{y}||^2$ corresponds to finding the point $\hat{y}$ closest to $y$. Additionally, because the norm comes from an inner product, we get that for closed subspaces in a Hilbert space, the projection minimizing L2 loss exists and is unique. For example with linear regression we solve $$ \min_{\hat{y} \in span(A)} \|y - \hat{y}\|^2 $$ which gives the projection of y onto the column space of $A$ for $\hat{y} = Ax$ (note that in this formulation bias $b$ build into $A$ with corresponding element equal to 1 in $x$). The error term satisfies $$ y - \hat{y} \perp span(A)$$ This orthogonality condition only exists because we are in a Hilbert Space since Hilbert Spaces have - Inner products $\langle x,y \rangle$ - Projection Theorem - For any point $y$ and closed subspace $S$ there exists a unique $\hat{y} \in S$ s.t. $y - \hat{y} \perp S$ Additionally, to minimize the distance using gradient descent we can see that with respect to the inner product the gradient of $||y-\hat{y}||^2$ is well defined: $$ ||y-\hat{y}||^2 = \langle y - \hat{y}, y - \hat{y} \rangle $$ Take a small step with $\hat{y} + \epsilon t$, where $t$ is an arbitrary direction in the subspace that we want to travel, giving us $$ \frac{d}{d \epsilon} || y - (\hat{y} + \epsilon t)||^2 = \frac{d}{d \epsilon} \langle y - (\hat{y} + \epsilon t), y - (\hat{y} + \epsilon t) \rangle$$ $$\frac{d}{d \epsilon} \langle y - \hat{y} - \epsilon t, y - \hat{y} - \epsilon t \rangle$$ $$\frac{d}{d \epsilon} \langle (y - \hat{y}) - \epsilon t, (y - \hat{y}) - \epsilon t \rangle$$ Using the result from the **Appendix** we get $$ \frac{d}{d \epsilon} \left( \langle y - \hat{y}, y - \hat{y} \rangle - 2\epsilon \langle y - \hat{y}, t \rangle + \epsilon ^2 \langle t,t \rangle \right) $$ $$ = 0 - 2 \langle y - \hat{y}, t \rangle + 2 \epsilon \langle t,t \rangle $$ $$\frac{d}{d\epsilon}\Big|_{\epsilon=0} = -2 \langle y - \hat{y},\, t \rangle $$ Since this is valid for all $t$ in the subspace, we have a valid gradient for each $t$ in this Hilbert subspace, meaning that given this L2 objective, we always have a gradient defined that we can use to find the optimal point. --- ## Appendix ### Convexity using Jensen's Inequality: Now using Jensen's Inequality we will show convexity: $$L(\theta \hat{y}_1 + (1-\theta)\hat{y}_2) \leq \theta L(\hat{y}_1) + (1-\theta)L(\hat{y}_2) $$ Recall that $$ L(\hat{y}) = ||y - \hat{y}||_2^2 $$ We begin with the left-hand side: $$ L(\theta \hat{y}_1 + (1-\theta)\hat{y}_2) = \left\| y - \left(\theta \hat{y}_1 + (1-\theta)\hat{y}_2 \right) \right\|_2^2 $$ Rewriting: $$ = \left\| \theta(y - \hat{y}_1) + (1-\theta)(y - \hat{y}_2) \right\|_2^2 $$ Expanding the squared norm: $$ = \theta^2 ||y - \hat{y}_1||_2^2 + (1-\theta)^2 ||y - \hat{y}_2||_2^2 + 2\theta(1-\theta)(y - \hat{y}_1) \cdot (y - \hat{y}_2) $$ Now compute the right-hand side: $$ \theta L(\hat{y}_1) + (1-\theta)L(\hat{y}_2) = \theta ||y - \hat{y}_1||_2^2 + (1-\theta)||y - \hat{y}_2||_2^2 $$ Now subtract the left-hand side from the right-hand side: $$ \theta L(\hat{y}_1) + (1-\theta)L(\hat{y}_2) - L(\theta \hat{y}_1 + (1-\theta)\hat{y}_2) $$ $$ = \theta(1-\theta)\left( ||y - \hat{y}_1||_2^2 + ||y - \hat{y}_2||_2^2 - 2 (y - \hat{y}_1)\cdot (y - \hat{y}_2) \right) $$ Using the identity: $$ ||a||^2 + ||b||^2 - 2a \cdot b = ||a - b||^2 $$ We get: $$ = \theta(1-\theta) ||(y - \hat{y}_1) - (y - \hat{y}_2)||_2^2 $$ $$ = \theta(1-\theta)||\hat{y}_1 - \hat{y}_2||_2^2 $$ Since norms are nonnegative and $0 \leq \theta \leq 1$, we have: $$ \theta(1-\theta)||\hat{y}_1 - \hat{y}_2||_2^2 \geq 0 $$ Thus: $$ L(\theta \hat{y}_1 + (1-\theta)\hat{y}_2) \leq \theta L(\hat{y}_1) + (1-\theta)L(\hat{y}_2) $$ Therefore, the L2 Loss is convex. ### Expansion of $\langle (y - \hat{y}) - \epsilon t, (y - \hat{y}) - \epsilon t \rangle$ Using the Bilinearity property of inner products $\langle \alpha u + \beta v, w \rangle = \alpha \langle u, w \rangle + \beta \langle v,w \rangle $, we get: $$ \langle (y - \hat{y}) - \epsilon t, (y - \hat{y}) - \epsilon t \rangle = \langle y - \hat{y}, y - \hat{y} \rangle - \epsilon \langle y - \hat{y}, t \rangle - \epsilon \langle t, y - \hat{y} \rangle + \epsilon^2 \langle t,t \rangle $$ $$ = \langle y - \hat{y}, y - \hat{y} \rangle - 2\epsilon \langle y - \hat{y}, t \rangle + \epsilon ^2 \langle t,t \rangle $$ Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!