Linear Regression Solution from Projections onto Subspaces

by dokuDoku Math

🔍

## What is Linear Regression for a Single Vector?

Suppose that we have the following linear equation with $A \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, $b \in \mathbb{R}^m$:
$$ Ax = b$$
And furthermore suppose that $b$ is the ground truth object that we are attempting to predict from input $x$ using parameters $A$. Thus we can say that the goal is to make the prediction $Ax$ as close as possible to $b$, formally our error $e \in \mathbb{R}^m$ is defined as:
$$ e = b - Ax $$

We can equivalently minimize the easier L2 Squared loss (see [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss) for cool insights) that also minimizes $e$:

$$ ||b - Ax||^2 = (b-Ax)^T(b-Ax) = e^Te $$
$$ = ||e||^2 = \sum_i e_i^2 $$

Since we know that this L2 Loss is a convex function (see [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss)), giving us that there's only one global minima, we can see that the minimum for $||e||^2$ and $e$ is achieved when $e = 0$, which makes sense intuitively since this means your parameters $A$ gives a perfect prediction.

### Phrasing as a Projection

Let's start with the one dimensional case where for $Ax = b$ we have $A,b \in \mathbb{R}^n$ and $x \in \mathbb{R}$. Since $A \in \mathbb{R}^n$ is a line and not a 2D or higher dimensional subspace, we're going to rephrase $A$ as $a$.

<img src="/media/images/20260418_013015_Screenshot_2026-04-17_at_6.29.53_PM.png" alt="Screenshot 2026 04 17 at 6.29.53 PM" style="max-width: min(600px, 80%); height: auto;">
<center> 
$\textbf{Figure 1}$ 1D Projection of vector $b$ onto $a$ 
</center>

We can see in **Figure 1** the projection of $b$ onto the subspace (line) defined by $a$, which is projection $p$. Note that projection $p$ can be stated as $ax$, where $x$ is a scaling parameter for the vector $a$ in this case. Furthermore, $e$ directly corresponds to our error since as we can see 
$$ e = b - p = b - ax $$
which is the vector version of our error. We can see that the equivalent L2 Loss is also minimized when $e = 0$, i.e.

$$ e^Te = (b-p)^T(b-p) = (b-ax)^T(b-ax) $$

We can state this problem as:
<center>
Given the vector $b$ what is the optimal approximation $p$ in the column space of $A$ such that $e = b - p$ is minimized?
</center>

We can derive this using linear algebra and calculus.

#### Linear Algebra
Suppose that we want to find the optimal $p$ such that $e = b -p$ is minimized. We can see that the error $e$ is minimized when this vector is perpendicular to the column space defined by $A$.

Consider, for the sake of contradiction, that this was false.

<center>
$\textbf{Figure 2}$ Example for why $e \perp a$ Optimal with $e'$ Counter Example
</center>

Suppose that as seen in **Figure 2** we have $e'$ such that $e'$ not perpendicular to $a$, but instead has angle $\theta$. Using the Law of Sines, we can find the optimal length of $e'$:
$$ \frac{sin \theta}{b} = \frac{sin \phi}{e'} $$

Since we know that $b, \phi$ are both constant, we can see $e'$ is a function of $\theta$. Isolating $sin \theta$ to one side we get
$$ sin \theta = \frac{b\ sin \phi}{e'}$$
We can see that the maximum value of $sin \theta$ will give us the smallest value of $e'$. Since $0 \leq sin \theta \leq 1$ and $0 \leq \theta \leq \pi$, we have that the value of $\theta$ that maximizes $e'$ is $\pi/2$.

Thus we can see that the smallest error we can get between vector $b$ and vector $a$ is $e = b-p \perp a$.

#### Finding optimal projection $p = ax$
Great, now let's find $p = ax$, the optimal projection for error $e$. Let's start with the fact that $e \perp a$. This gives us
$$ e \cdot a = 0 \rightarrow b-p \cdot a = 0 \rightarrow a^T(b-p) = 0$$
Solving for $p$
$$ a^Tb - a^Tp = 0 \rightarrow a^Tb = a^Tp $$
Great, let's take it a step further to solve for $x$ since $p = ax$

$$a^Tb = a^Tp \rightarrow a^Tb = a^Tax $$
Since $a^Ta \in \mathbb{R}$ we can divide by it, giving us
$$ \frac{a^Tb}{a^Ta} = x$$

Note that in the denominator we must have $a^Ta$ and not $aa^T$, since $a^Ta \in \mathbb{R}$ and $aa^T \in \mathbb{R}^{n \times n}$ since $x \in \mathbb{R}$.

Thus we have found the optimal projection $p$ that minimizes error $e$
$$ p =  \frac{a^Tb}{a^Ta} \cdot a$$

Notice that this is the vector $a$ times a scaling factor. We will see in the case where $A$ is a matrix that effectively we are scaling each column vector in the right hand side matrix ($A^TA$) by a factor & linearly combining them to give us a result in the range.

#### Calculus

Suppose that instead we used calculus instead of linear algebra to find optimal $p$, we should get the same answer. We know that $p = ax$ and that $e = b-p = b-ax$. We want to find the minimum $e$ since we are minimizing error. Note that the minimum of $e$ is also the same as the minimum of $||e||^2 = e^Te$, thus we will find $e^Te$, since it's easier.

We get:
$$ e^Te = (b-ax)^T(b-ax)$$
Since $e^Te$ is a function of $x$ with $b,a$ constant, we have
$$ f(x) = (b-ax)^T(b-ax)$$
Take the differential to get
$$ df = d(b-ax)^T(b-ax) + (b-ax)^T d(b-ax)$$
$$ = -a^T dx^T(b-ax) + (b-ax)^T (-adx)$$
Recognize that both terms in $\mathbb{R}$ and use the fact that for some arbitrary $z \in \mathbb{R}$ that $z^T = z$, giving us
$$ df = -2(b-ax)^Ta dx $$

Recognizing that $df = \langle f'(x), dx \rangle$, we see that 
$$ f'(x) = -2(b-ax)^Ta = -2a^T(b-ax)$$

We know that this function is convex from [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss), thus finding where $f'(x) = 0$ is equivalent to finding the minimum, thus 
$$ f'(x) = 0 = -2(b-ax)^Ta$$
Solving for $x$
$$ 0 = -2a^T(b-ax) \rightarrow 0 = -2a^Tb + 2a^Tax \rightarrow a^Tb = a^Tax$$
$$ x = \frac{a^Tb}{a^Ta}$$

Which matches what we have from the linear algebra solution! Wow great we have solved the same problem two ways, very neat!

### Linear Regression with $A \in \mathbb{R}^{m \times n}$

Since we have solve the linear regression case where $A \in \mathbb{R}^n$, let's solve the case where $A$ is has linearly independent columns and we are finding minimum error such that for $e = b -p = b - Ax$ that our projection $Ax = p$ minimizes $e$.

We again will show that we can solve this both using linear algebra and calculus.

<img src="/media/images/20260418_033620_Screenshot_2026-04-17_at_8.35.24_PM.png" alt="Screenshot 2026 04 17 at 8.35.24 PM" style="max-width: min(600px, 80%); height: auto;">
<center>
$\textbf{Figure 3}$ Projection of $b$ onto the xy plane where presumably $A = I_2$. Note that $e \perp a_i, \forall a_i \in A$
</center>

#### Linear Algebra

Note that from the previous section, as shown by **Figure 2** we saw that for any individual vector $a_i$ in the matrix $A \in \mathbb{R}^{m \times n}$ defining its column space that the optimal $e$ is perpendicular to $a_i$. This applies $\forall a_i \in A$ such that $e$ is perpendicular to the column space of $A$, $e \perp A$ giving us that

$$\langle e,a_1 \rangle = 0, \langle e,a_2 \rangle = 0,... \langle e,a_n \rangle = 0$$

Since each $a_i \perp e$ we have that $A^Te=0$ since this is the dot product of each $a_i$ with $e$. Thus we get

$$A^Te = 0 = A^T(b-p) = 0$$
Solving for $p$
$$A^Tb - A^Tp = 0 \rightarrow A^Tb = A^Tp$$
Since $p$ is in the column space of $A$ we have that $p = Ax$, giving 
$$ A^Tb = A^TAx$$
**Note** this is often called the *normal equation*. 
Using the fact that $A$ has linearly independent columns, we know that $A^TA$ is invertible and positive definite. Thus we get that 
$$ A^Tb = A^TAx \rightarrow (A^TA)^{-1}A^Tb = x$$

Some interesting tidbits to note.

- If we look at the form of the equation $ A^Tb = A^TAx$, what we are saying is that if we project $b$ into the row space of $A$ (i.e. $A^T$) and our projection $p = Ax$ into the row space of $A$ that these are equal. 
- A solution to the normal equation **always** exists. This is because in $A^Tb = A^TAx$ both sides of the equation lie in the column space of $A^T$ since both sides projected into this space. 
- The solution to the normal equation is **unique**. We can see that because $A$ is a linearly independent matrix (all of its columns linearly independent) that $A^TA$ is positive definite, and linearly independent. This means that no trivial solution (non-zero solution) solution exists such that $A^TAx = 0$ (the only $x$ that gives this solution is $x=0$). From this we can see that there can only be one linear combination of the vectors in $A^TA$ per output in $range(A^TA)$. Thus the solution $x$ to the normal equation is unique, since $A^TA$ has full rank (columns linearly independent). 
- From the expression $(A^TA)^{-1}A^Tb = x$ we get that the pseudo inverse (Moore-Penrose pseudoinverse) for non-invertible $A$ is $(A^TA)^{-1}A^T$ when $A$ has full column rank ($A$ linearly independent $rank(A)=n$, $A \in \mathbb{R}^{m \times n}$
    - If $A$ is invertible, then this expression is equal to $A^{-1}$: $(A^TA)^{-1}A^T = A^{-1}A^{-T}A^T = A^{-1}I = A^{-1}$

#### Calculus

Now let's find the same solution using calculus methods. We want to find the minimal $e$ so we are going to minimize $||e||^2$ as we did before, making finding the minimal $e$ a bit easier. Since we know that $e = b -p \rightarrow e = b -Ax$

We get that 
$$e^Te = (b -Ax)^T(b -Ax)$$
Take the differential to get that for $f(x) = (b -Ax)^T(b -Ax)$
$$df = (b-Ax)^TAdx + A^Tdx^T(b -Ax)$$
Recognize both terms are in $\mathbb{R}$, thus they are equal to their transposes. 
$$df = -2A^T(b-Ax)dx$$
Recognize $df = \langle f'(x),dx \rangle$ giving 
$$f'(x) = -2A^T(b-Ax)$$

Since (as stated before) we know that $e^Te$ convex if we set $f'(x) =0$ we will find the minimal $x$, thus

$$ f'(x) = 0 = -2A^T(b-Ax)$$
$$ -2A^Tb + 2A^TAx = 0 \rightarrow A^TAx = A^Tb$$
$$ (A^TA)^{-1}A^Tb = x$$

Which matches our solution from the linear algebra case!

## What is Linear Regression for a Single Vector?

We can equivalently minimize the easier L2 Squared loss (see [Properties of L2 Loss](https://www.noteblogdoku.com/blog/properties-l2-loss) for cool insights) that also minimizes $e$:

$$ ||b - Ax||^2 = (b-Ax)^T(b-Ax) = e^Te $$
$$ = ||e||^2 = \sum_i e_i^2 $$

### Phrasing as a Projection

$$ e^Te = (b-p)^T(b-p) = (b-ax)^T(b-ax) $$

We can state this problem as:
<center>
Given the vector $b$ what is the optimal approximation $p$ in the column space of $A$ such that $e = b - p$ is minimized?
</center>

We can derive this using linear algebra and calculus.

Consider, for the sake of contradiction, that this was false.

<center>
$\textbf{Figure 2}$ Example for why $e \perp a$ Optimal with $e'$ Counter Example
</center>

Thus we can see that the smallest error we can get between vector $b$ and vector $a$ is $e = b-p \perp a$.

$$a^Tb = a^Tp \rightarrow a^Tb = a^Tax $$
Since $a^Ta \in \mathbb{R}$ we can divide by it, giving us
$$ \frac{a^Tb}{a^Ta} = x$$

Note that in the denominator we must have $a^Ta$ and not $aa^T$, since $a^Ta \in \mathbb{R}$ and $aa^T \in \mathbb{R}^{n \times n}$ since $x \in \mathbb{R}$.

Thus we have found the optimal projection $p$ that minimizes error $e$
$$ p =  \frac{a^Tb}{a^Ta} \cdot a$$

#### Calculus

Recognizing that $df = \langle f'(x), dx \rangle$, we see that 
$$ f'(x) = -2(b-ax)^Ta = -2a^T(b-ax)$$

Which matches what we have from the linear algebra solution! Wow great we have solved the same problem two ways, very neat!

### Linear Regression with $A \in \mathbb{R}^{m \times n}$

We again will show that we can solve this both using linear algebra and calculus.

#### Linear Algebra

$$\langle e,a_1 \rangle = 0, \langle e,a_2 \rangle = 0,... \langle e,a_n \rangle = 0$$

Since each $a_i \perp e$ we have that $A^Te=0$ since this is the dot product of each $a_i$ with $e$. Thus we get

Some interesting tidbits to note.

#### Calculus

Since (as stated before) we know that $e^Te$ convex if we set $f'(x) =0$ we will find the minimal $x$, thus

$$ f'(x) = 0 = -2A^T(b-Ax)$$
$$ -2A^Tb + 2A^TAx = 0 \rightarrow A^TAx = A^Tb$$
$$ (A^TA)^{-1}A^Tb = x$$

Which matches our solution from the linear algebra case!

CREATED: April 18, 2026 at 12:47 AM

LAST UPDATED: April 19, 2026 at 05:33 AM

Comments (0)

No comments yet. Be the first to comment!

← Back to Blog

Comments (0)

✉️ Join Mailing List