Linear Regression Solution from Projections onto Subspaces by dokuDoku Math 🔍 ## What is Linear Regression for a Single Vector? Suppose that we have the following linear equation with $A \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, $b \in \mathbb{R}^m$: $$ Ax = b$$ And furthermore suppose that $b$ is the ground truth object that we are attempting to predict from input $x$ using parameters $A$. Thus we can say that the goal is to make the prediction $Ax$ as close as possible to $b$, formally our error $e \in \mathbb{R}^m$ is defined as: $$ e = b - Ax $$ We can equivalently minimize the easier L2 Squared loss (see [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss) for cool insights) that also minimizes $e$: $$ ||b - Ax||^2 = (b-Ax)^T(b-Ax) = e^Te $$ $$ = ||e||^2 = \sum_i e_i^2 $$ Since we know that this L2 Loss is a convex function (see [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss)), giving us that there's only one global minima, we can see that the minimum for $||e||^2$ and $e$ is achieved when $e = 0$, which makes sense intuitively since this means your parameters $A$ gives a perfect prediction. ### Phrasing as a Projection Let's start with the one dimensional case where for $Ax = b$ we have $A,b \in \mathbb{R}^n$ and $x \in \mathbb{R}$. Since $A \in \mathbb{R}^n$ is a line and not a 2D or higher dimensional subspace, we're going to rephrase $A$ as $a$. <img src="/media/images/20260418_013015_Screenshot_2026-04-17_at_6.29.53_PM.png" alt="Screenshot 2026 04 17 at 6.29.53 PM" style="max-width: min(600px, 80%); height: auto;"> <center> $\textbf{Figure 1}$ 1D Projection of vector $b$ onto $a$ </center> We can see in **Figure 1** the projection of $b$ onto the subspace (line) defined by $a$, which is projection $p$. Note that projection $p$ can be stated as $ax$, where $x$ is a scaling parameter for the vector $a$ in this case. Furthermore, $e$ directly corresponds to our error since as we can see $$ e = b - p = b - ax $$ which is the vector version of our error. We can see that the equivalent L2 Loss is also minimized when $e = 0$, i.e. $$ e^Te = (b-p)^T(b-p) = (b-ax)^T(b-ax) $$ We can state this problem as: <center> Given the vector $b$ what is the optimal approximation $p$ in the column space of $A$ such that $e = b - p$ is minimized? </center> We can derive this using linear algebra and calculus. #### Linear Algebra Suppose that we want to find the optimal $p$ such that $e = b -p$ is minimized. We can see that the error $e$ is minimized when this vector is perpendicular to the column space defined by $A$. Consider, for the sake of contradiction, that this was false. <img src="/media/images/20260418_023154_Screenshot_2026-04-17_at_7.30.52_PM.png" alt="Screenshot 2026 04 17 at 7.30.52 PM" style="max-width: min(600px, 80%); height: auto;"> <center> $\textbf{Figure 2}$ Example for why $e \perp a$ Optimal with $e'$ Counter Example </center> Suppose that as seen in **Figure 2** we have $e'$ such that $e'$ not perpendicular to $a$, but instead has angle $\theta$. Using the Law of Sines, we can find the optimal length of $e'$: $$ \frac{sin \theta}{b} = \frac{sin \phi}{e'} $$ Since we know that $b, \phi$ are both constant, we can see $e'$ is a function of $\theta$. Isolating $sin \theta$ to one side we get $$ sin \theta = \frac{b\ sin \phi}{e'}$$ We can see that the maximum value of $sin \theta$ will give us the smallest value of $e'$. Since $0 \leq sin \theta \leq 1$ and $0 \leq \theta \leq \pi$, we have that the value of $\theta$ that maximizes $e'$ is $\pi/2$. Thus we can see that the smallest error we can get between vector $b$ and vector $a$ is $e = b-p \perp a$. #### Finding optimal projection $p = ax$ Great, now let's find $p = ax$, the optimal projection for error $e$. Let's start with the fact that $e \perp a$. This gives us $$ e \cdot a = 0 \rightarrow b-p \cdot a = 0 \rightarrow a^T(b-p) = 0$$ Solving for $p$ $$ a^Tb - a^Tp = 0 \rightarrow a^Tb = a^Tp $$ Great, let's take it a step further to solve for $x$ since $p = ax$ $$a^Tb = a^Tp \rightarrow a^Tb = a^Tax $$ Since $a^Ta \in \mathbb{R}$ we can divide by it, giving us $$ \frac{a^Tb}{a^Ta} = x$$ Note that in the denominator we must have $a^Ta$ and not $aa^T$, since $a^Ta \in \mathbb{R}$ and $aa^T \in \mathbb{R}^{n \times n}$ since $x \in \mathbb{R}$. Thus we have found the optimal projection $p$ that minimizes error $e$ $$ p = \frac{a^Tb}{a^Ta} \cdot a$$ Notice that this is the vector $a$ times a scaling factor. We will see in the case where $A$ is a matrix that effectively we are scaling each column vector in the right hand side matrix ($A^TA$) by a factor & linearly combining them to give us a result in the range. #### Calculus Suppose that instead we used calculus instead of linear algebra to find optimal $p$, we should get the same answer. We know that $p = ax$ and that $e = b-p = b-ax$. We want to find the minimum $e$ since we are minimizing error. Note that the minimum of $e$ is also the same as the minimum of $||e||^2 = e^Te$, thus we will find $e^Te$, since it's easier. We get: $$ e^Te = (b-ax)^T(b-ax)$$ Since $e^Te$ is a function of $x$ with $b,a$ constant, we have $$ f(x) = (b-ax)^T(b-ax)$$ Take the differential to get $$ df = d(b-ax)^T(b-ax) + (b-ax)^T d(b-ax)$$ $$ = -a^T dx^T(b-ax) + (b-ax)^T (-adx)$$ Recognize that both terms in $\mathbb{R}$ and use the fact that for some arbitrary $z \in \mathbb{R}$ that $z^T = z$, giving us $$ df = -2(b-ax)^Ta dx $$ Recognizing that $df = \langle f'(x), dx \rangle$, we see that $$ f'(x) = -2(b-ax)^Ta = -2a^T(b-ax)$$ We know that this function is convex from [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss), thus finding where $f'(x) = 0$ is equivalent to finding the minimum, thus $$ f'(x) = 0 = -2(b-ax)^Ta$$ Solving for $x$ $$ 0 = -2a^T(b-ax) \rightarrow 0 = -2a^Tb + 2a^Tax \rightarrow a^Tb = a^Tax$$ $$ x = \frac{a^Tb}{a^Ta}$$ Which matches what we have from the linear algebra solution! Wow great we have solved the same problem two ways, very neat! ### Linear Regression with $A \in \mathbb{R}^{m \times n}$ Since we have solve the linear regression case where $A \in \mathbb{R}^n$, let's solve the case where $A$ is has linearly independent columns and we are finding minimum error such that for $e = b -p = b - Ax$ that our projection $Ax = p$ minimizes $e$. We again will show that we can solve this both using linear algebra and calculus. <img src="/media/images/20260418_033620_Screenshot_2026-04-17_at_8.35.24_PM.png" alt="Screenshot 2026 04 17 at 8.35.24 PM" style="max-width: min(600px, 80%); height: auto;"> <center> $\textbf{Figure 3}$ Projection of $b$ onto the xy plane where presumably $A = I_2$. Note that $e \perp a_i, \forall a_i \in A$ </center> #### Linear Algebra Note that from the previous section, as shown by **Figure 2** we saw that for any individual vector $a_i$ in the matrix $A \in \mathbb{R}^{m \times n}$ defining its column space that the optimal $e$ is perpendicular to $a_i$. This applies $\forall a_i \in A$ such that $e$ is perpendicular to the column space of $A$, $e \perp A$ giving us that $$\langle e,a_1 \rangle = 0, \langle e,a_2 \rangle = 0,... \langle e,a_n \rangle = 0$$ Since each $a_i \perp e$ we have that $A^Te=0$ since this is the dot product of each $a_i$ with $e$. Thus we get $$A^Te = 0 = A^T(b-p) = 0$$ Solving for $p$ $$A^Tb - A^Tp = 0 \rightarrow A^Tb = A^Tp$$ Since $p$ is in the column space of $A$ we have that $p = Ax$, giving $$ A^Tb = A^TAx$$ **Note** this is often called the *normal equation*. Using the fact that $A$ has linearly independent columns, we know that $A^TA$ is invertible and positive definite. Thus we get that $$ A^Tb = A^TAx \rightarrow (A^TA)^{-1}A^Tb = x$$ Some interesting tidbits to note. - If we look at the form of the equation $ A^Tb = A^TAx$, what we are saying is that if we project $b$ into the row space of $A$ (i.e. $A^T$) and our projection $p = Ax$ into the row space of $A$ that these are equal. - A solution to the normal equation **always** exists. This is because in $A^Tb = A^TAx$ both sides of the equation lie in the column space of $A^T$ since both sides projected into this space. - The solution to the normal equation is **unique**. We can see that because $A$ is a linearly independent matrix (all of its columns linearly independent) that $A^TA$ is positive definite, and linearly independent. This means that no trivial solution (non-zero solution) solution exists such that $A^TAx = 0$ (the only $x$ that gives this solution is $x=0$). From this we can see that there can only be one linear combination of the vectors in $A^TA$ per output in $range(A^TA)$. Thus the solution $x$ to the normal equation is unique, since $A^TA$ has full rank (columns linearly independent). - From the expression $(A^TA)^{-1}A^Tb = x$ we get that the pseudo inverse (Moore-Penrose pseudoinverse) for non-invertible $A$ is $(A^TA)^{-1}A^T$ when $A$ has full column rank ($A$ linearly independent $rank(A)=n$, $A \in \mathbb{R}^{m \times n}$ - If $A$ is invertible, then this expression is equal to $A^{-1}$: $(A^TA)^{-1}A^T = A^{-1}A^{-T}A^T = A^{-1}I = A^{-1}$ #### Calculus Now let's find the same solution using calculus methods. We want to find the minimal $e$ so we are going to minimize $||e||^2$ as we did before, making finding the minimal $e$ a bit easier. Since we know that $e = b -p \rightarrow e = b -Ax$ We get that $$e^Te = (b -Ax)^T(b -Ax)$$ Take the differential to get that for $f(x) = (b -Ax)^T(b -Ax)$ $$df = (b-Ax)^TAdx + A^Tdx^T(b -Ax)$$ Recognize both terms are in $\mathbb{R}$, thus they are equal to their transposes. $$df = -2A^T(b-Ax)dx$$ Recognize $df = \langle f'(x),dx \rangle$ giving $$f'(x) = -2A^T(b-Ax)$$ Since (as stated before) we know that $e^Te$ convex if we set $f'(x) =0$ we will find the minimal $x$, thus $$ f'(x) = 0 = -2A^T(b-Ax)$$ $$ -2A^Tb + 2A^TAx = 0 \rightarrow A^TAx = A^Tb$$ $$ (A^TA)^{-1}A^Tb = x$$ Which matches our solution from the linear algebra case! ## What is Linear Regression for a Single Vector? Suppose that we have the following linear equation with $A \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, $b \in \mathbb{R}^m$: $$ Ax = b$$ And furthermore suppose that $b$ is the ground truth object that we are attempting to predict from input $x$ using parameters $A$. Thus we can say that the goal is to make the prediction $Ax$ as close as possible to $b$, formally our error $e \in \mathbb{R}^m$ is defined as: $$ e = b - Ax $$ We can equivalently minimize the easier L2 Squared loss (see [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss) for cool insights) that also minimizes $e$: $$ ||b - Ax||^2 = (b-Ax)^T(b-Ax) = e^Te $$ $$ = ||e||^2 = \sum_i e_i^2 $$ Since we know that this L2 Loss is a convex function (see [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss)), giving us that there's only one global minima, we can see that the minimum for $||e||^2$ and $e$ is achieved when $e = 0$, which makes sense intuitively since this means your parameters $A$ gives a perfect prediction. ### Phrasing as a Projection Let's start with the one dimensional case where for $Ax = b$ we have $A,b \in \mathbb{R}^n$ and $x \in \mathbb{R}$. Since $A \in \mathbb{R}^n$ is a line and not a 2D or higher dimensional subspace, we're going to rephrase $A$ as $a$. <img src="/media/images/20260418_013015_Screenshot_2026-04-17_at_6.29.53_PM.png" alt="Screenshot 2026 04 17 at 6.29.53 PM" style="max-width: min(600px, 80%); height: auto;"> <center> $\textbf{Figure 1}$ 1D Projection of vector $b$ onto $a$ </center> We can see in **Figure 1** the projection of $b$ onto the subspace (line) defined by $a$, which is projection $p$. Note that projection $p$ can be stated as $ax$, where $x$ is a scaling parameter for the vector $a$ in this case. Furthermore, $e$ directly corresponds to our error since as we can see $$ e = b - p = b - ax $$ which is the vector version of our error. We can see that the equivalent L2 Loss is also minimized when $e = 0$, i.e. $$ e^Te = (b-p)^T(b-p) = (b-ax)^T(b-ax) $$ We can state this problem as: <center> Given the vector $b$ what is the optimal approximation $p$ in the column space of $A$ such that $e = b - p$ is minimized? </center> We can derive this using linear algebra and calculus. #### Linear Algebra Suppose that we want to find the optimal $p$ such that $e = b -p$ is minimized. We can see that the error $e$ is minimized when this vector is perpendicular to the column space defined by $A$. Consider, for the sake of contradiction, that this was false. <img src="/media/images/20260418_023154_Screenshot_2026-04-17_at_7.30.52_PM.png" alt="Screenshot 2026 04 17 at 7.30.52 PM" style="max-width: min(600px, 80%); height: auto;"> <center> $\textbf{Figure 2}$ Example for why $e \perp a$ Optimal with $e'$ Counter Example </center> Suppose that as seen in **Figure 2** we have $e'$ such that $e'$ not perpendicular to $a$, but instead has angle $\theta$. Using the Law of Sines, we can find the optimal length of $e'$: $$ \frac{sin \theta}{b} = \frac{sin \phi}{e'} $$ Since we know that $b, \phi$ are both constant, we can see $e'$ is a function of $\theta$. Isolating $sin \theta$ to one side we get $$ sin \theta = \frac{b\ sin \phi}{e'}$$ We can see that the maximum value of $sin \theta$ will give us the smallest value of $e'$. Since $0 \leq sin \theta \leq 1$ and $0 \leq \theta \leq \pi$, we have that the value of $\theta$ that maximizes $e'$ is $\pi/2$. Thus we can see that the smallest error we can get between vector $b$ and vector $a$ is $e = b-p \perp a$. #### Finding optimal projection $p = ax$ Great, now let's find $p = ax$, the optimal projection for error $e$. Let's start with the fact that $e \perp a$. This gives us $$ e \cdot a = 0 \rightarrow b-p \cdot a = 0 \rightarrow a^T(b-p) = 0$$ Solving for $p$ $$ a^Tb - a^Tp = 0 \rightarrow a^Tb = a^Tp $$ Great, let's take it a step further to solve for $x$ since $p = ax$ $$a^Tb = a^Tp \rightarrow a^Tb = a^Tax $$ Since $a^Ta \in \mathbb{R}$ we can divide by it, giving us $$ \frac{a^Tb}{a^Ta} = x$$ Note that in the denominator we must have $a^Ta$ and not $aa^T$, since $a^Ta \in \mathbb{R}$ and $aa^T \in \mathbb{R}^{n \times n}$ since $x \in \mathbb{R}$. Thus we have found the optimal projection $p$ that minimizes error $e$ $$ p = \frac{a^Tb}{a^Ta} \cdot a$$ Notice that this is the vector $a$ times a scaling factor. We will see in the case where $A$ is a matrix that effectively we are scaling each column vector in the right hand side matrix ($A^TA$) by a factor & linearly combining them to give us a result in the range. #### Calculus Suppose that instead we used calculus instead of linear algebra to find optimal $p$, we should get the same answer. We know that $p = ax$ and that $e = b-p = b-ax$. We want to find the minimum $e$ since we are minimizing error. Note that the minimum of $e$ is also the same as the minimum of $||e||^2 = e^Te$, thus we will find $e^Te$, since it's easier. We get: $$ e^Te = (b-ax)^T(b-ax)$$ Since $e^Te$ is a function of $x$ with $b,a$ constant, we have $$ f(x) = (b-ax)^T(b-ax)$$ Take the differential to get $$ df = d(b-ax)^T(b-ax) + (b-ax)^T d(b-ax)$$ $$ = -a^T dx^T(b-ax) + (b-ax)^T (-adx)$$ Recognize that both terms in $\mathbb{R}$ and use the fact that for some arbitrary $z \in \mathbb{R}$ that $z^T = z$, giving us $$ df = -2(b-ax)^Ta dx $$ Recognizing that $df = \langle f'(x), dx \rangle$, we see that $$ f'(x) = -2(b-ax)^Ta = -2a^T(b-ax)$$ We know that this function is convex from [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss), thus finding where $f'(x) = 0$ is equivalent to finding the minimum, thus $$ f'(x) = 0 = -2(b-ax)^Ta$$ Solving for $x$ $$ 0 = -2a^T(b-ax) \rightarrow 0 = -2a^Tb + 2a^Tax \rightarrow a^Tb = a^Tax$$ $$ x = \frac{a^Tb}{a^Ta}$$ Which matches what we have from the linear algebra solution! Wow great we have solved the same problem two ways, very neat! ### Linear Regression with $A \in \mathbb{R}^{m \times n}$ Since we have solve the linear regression case where $A \in \mathbb{R}^n$, let's solve the case where $A$ is has linearly independent columns and we are finding minimum error such that for $e = b -p = b - Ax$ that our projection $Ax = p$ minimizes $e$. We again will show that we can solve this both using linear algebra and calculus. <img src="/media/images/20260418_033620_Screenshot_2026-04-17_at_8.35.24_PM.png" alt="Screenshot 2026 04 17 at 8.35.24 PM" style="max-width: min(600px, 80%); height: auto;"> <center> $\textbf{Figure 3}$ Projection of $b$ onto the xy plane where presumably $A = I_2$. Note that $e \perp a_i, \forall a_i \in A$ </center> #### Linear Algebra Note that from the previous section, as shown by **Figure 2** we saw that for any individual vector $a_i$ in the matrix $A \in \mathbb{R}^{m \times n}$ defining its column space that the optimal $e$ is perpendicular to $a_i$. This applies $\forall a_i \in A$ such that $e$ is perpendicular to the column space of $A$, $e \perp A$ giving us that $$\langle e,a_1 \rangle = 0, \langle e,a_2 \rangle = 0,... \langle e,a_n \rangle = 0$$ Since each $a_i \perp e$ we have that $A^Te=0$ since this is the dot product of each $a_i$ with $e$. Thus we get $$A^Te = 0 = A^T(b-p) = 0$$ Solving for $p$ $$A^Tb - A^Tp = 0 \rightarrow A^Tb = A^Tp$$ Since $p$ is in the column space of $A$ we have that $p = Ax$, giving $$ A^Tb = A^TAx$$ **Note** this is often called the *normal equation*. Using the fact that $A$ has linearly independent columns, we know that $A^TA$ is invertible and positive definite. Thus we get that $$ A^Tb = A^TAx \rightarrow (A^TA)^{-1}A^Tb = x$$ Some interesting tidbits to note. - If we look at the form of the equation $ A^Tb = A^TAx$, what we are saying is that if we project $b$ into the row space of $A$ (i.e. $A^T$) and our projection $p = Ax$ into the row space of $A$ that these are equal. - A solution to the normal equation **always** exists. This is because in $A^Tb = A^TAx$ both sides of the equation lie in the column space of $A^T$ since both sides projected into this space. - The solution to the normal equation is **unique**. We can see that because $A$ is a linearly independent matrix (all of its columns linearly independent) that $A^TA$ is positive definite, and linearly independent. This means that no trivial solution (non-zero solution) solution exists such that $A^TAx = 0$ (the only $x$ that gives this solution is $x=0$). From this we can see that there can only be one linear combination of the vectors in $A^TA$ per output in $range(A^TA)$. Thus the solution $x$ to the normal equation is unique, since $A^TA$ has full rank (columns linearly independent). - From the expression $(A^TA)^{-1}A^Tb = x$ we get that the pseudo inverse (Moore-Penrose pseudoinverse) for non-invertible $A$ is $(A^TA)^{-1}A^T$ when $A$ has full column rank ($A$ linearly independent $rank(A)=n$, $A \in \mathbb{R}^{m \times n}$ - If $A$ is invertible, then this expression is equal to $A^{-1}$: $(A^TA)^{-1}A^T = A^{-1}A^{-T}A^T = A^{-1}I = A^{-1}$ #### Calculus Now let's find the same solution using calculus methods. We want to find the minimal $e$ so we are going to minimize $||e||^2$ as we did before, making finding the minimal $e$ a bit easier. Since we know that $e = b -p \rightarrow e = b -Ax$ We get that $$e^Te = (b -Ax)^T(b -Ax)$$ Take the differential to get that for $f(x) = (b -Ax)^T(b -Ax)$ $$df = (b-Ax)^TAdx + A^Tdx^T(b -Ax)$$ Recognize both terms are in $\mathbb{R}$, thus they are equal to their transposes. $$df = -2A^T(b-Ax)dx$$ Recognize $df = \langle f'(x),dx \rangle$ giving $$f'(x) = -2A^T(b-Ax)$$ Since (as stated before) we know that $e^Te$ convex if we set $f'(x) =0$ we will find the minimal $x$, thus $$ f'(x) = 0 = -2A^T(b-Ax)$$ $$ -2A^Tb + 2A^TAx = 0 \rightarrow A^TAx = A^Tb$$ $$ (A^TA)^{-1}A^Tb = x$$ Which matches our solution from the linear algebra case! Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!