Fitting a Simple Linear Model using Gradient Descent

by dokuDoku Machine Learning Foundations DoKu88/simpleNeuralNet/blob/main/linear_eq_wi_grads.py

🔍

## Fitting a Simple Linear Model using Gradient Descent

In the attached GitHub Repository file I wrote [linear_eq_wi_grads.py](https://github.com/DoKu88/simpleNeuralNet/blob/main/linear_eq_wi_grads.py), what we are doing is that given a simple linear model:
$$ Eq\ 1.\ y = Mx + b$$ 
such that $y,b \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, $M \in \mathbb{R}^{m \times n}$
With squared loss function:
$$ Eq\ 2.\ L(y) = \sum_i (g_i - y_i)^2$$
such that $g,y \in \mathbb{R}^m$

Minimize $L(y)$ given some ground truth $g$ vector output. This problem is a simple multi-dimensional linear equation, and in fact equation 1 is the matrix form of linear regression. It is well known that there is a closed form solution for minimizing equation 1 $\hat{x} = (M^TM)^{-1} X^Ty$ given that we include the bias $b$ to matrix $M$.

However, we are optimizing $L(y)$ today since this linear equation is a building block for neural networks. Specifically, for Multi-Linear Programs (MLP's) typically what we do is we layer linear layers, in the form $y = Mx + b$ with ReLU non-linearities $f(x_i) = \max(0,x_i) \forall i$. Thus we are going to introduce the back propagation, gradient descent algorithm for the simple linear model with the squared loss function to show how this essential building block can be used in a neural network.

## Chain Rule 
As you may remember from calculus the chain rule enables us to find the gradients of composed functions. In our example our composed functions are: $L(y) = \sum_i (g_i - y_i)^2$ such that $y = Mx + b$. To train the model based on the loss function, we need to find the gradients of the loss $L$ with respect to matrix $M$ and bias $b$. The reason why we are taking the gradients with respect to our parameters is because we want to edit the parameters to minimize the loss. To do this, we need to take gradients with respect to the parameters since this will tell us in what direction to update the gradients to minimize the loss. Note that gradients are vectors that give us the slope of the graph at that point, which points upwards. However, since we are finding minima and not maxima, we will have to multiply this gradient by -1, which we will see later.

To find $\frac{\partial L}{\partial b}$, we are going to use the Chain Rule to decompose our composed function. As stated by the Chain Rule
$$ \frac{\partial L}{\partial b} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial b}$$
Similarly to find $\frac{\partial L}{\partial M}$
$$ \frac{\partial L}{\partial M} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial M}$$

Thus we must first find $\frac{\partial L}{\partial y}$ given that $L(y) = \sum_i (g_i - y_i)^2$.

### Find $\frac{\partial L}{\partial y}$

We can do this component wise such that $$L(y_i) = (g_i - y_i)^2$$ giving $$\frac{\partial L}{\partial y_i} = 2(g_i - y_i) \cdot -1 \rightarrow -2(g_i - y_i)$$ Because we have that $L(y) \in \mathbb{R}$ and $y \in \mathbb{R}^m$ we know that $\frac{\partial L}{\partial y} = [\frac{\partial L}{\partial y_1}, \frac{\partial L}{\partial y_2},...\frac{\partial L}{\partial y_m}] \in \mathbb{R}^{1xm}$. It is easy to see that since $\frac{\partial L}{\partial y_i} =-2(g_i - y_i)$, that $$\frac{\partial L}{\partial y} =[-2(g_1 - y_1), -2(g_2 - y_2), ... -2(g_m - y_m)] \rightarrow -2(g-y)^T$$

### Find $\frac{\partial y}{\partial b}$

We need to similarly find $\frac{\partial y}{\partial b}$ to ultimately find our goal $\frac{\partial L}{\partial b}$. For $\frac{\partial y}{\partial b}$, let's similarly decompose this into partial derivatives given that $y,b \in \mathbb{R}^m$. Given the linear equation, Eq 1. $y = Mx + b$ let's take the derivative with respect to b giving us:
$$ \frac{\partial y}{\partial b} = 0 + I_m$$
We can see from the single variable version of this problem s.t. $y,\mu,x,b \in \mathbb{R}$ that $y = \mu x + b$ gives $\frac{\partial y}{\partial b} = 0 + 1 = 1$ so this should give one intuition for how we derive $I_m$. 
Intuition is not enough! 
We must derive this rigorously!

We will break $\frac{\partial y}{\partial b}$ into casework, using the cases such that: 
1. $\frac{\partial y_i}{\partial b_j}$ where $i=j$
2. $\frac{\partial y_i}{\partial b_j}$ where $i \neq j$

#### Case 1: $\frac{\partial y_i}{\partial b_j}$ where $i=j$

Given: $y = Mx + b$, we know that the $Mx$ term does not contribute to $\frac{\partial y}{\partial b}$ and by extension $\frac{\partial y_i}{\partial b_j}$. Thus we really have: $y = b$ or $y_i = b_i \forall i$. Hence for $\frac{\partial y_i}{\partial b_j}$ where $i=j$, we have the equation: $\frac{\partial y_i}{\partial b_i} = \frac{\partial b_i}{\partial b_i} = 1$.

Thus we have shown that $\frac{\partial y_i}{\partial b_j} = 1$ where $i=j$

#### Case 2: $\frac{\partial y_i}{\partial b_j}$ where $i \neq j$

Similar to Case 1, we have $y = b$ or $y_i = b_i \forall i$. Hence for $\frac{\partial y_i}{\partial b_j}$ where $i \neq j$, we have the equation: $\frac{\partial y_i}{\partial b_j} = \frac{\partial b_i}{\partial b_j} = 0$.

Thus we have shown that $\frac{\partial y_i}{\partial b_j} = 0$ where $i \neq j$

Given these two cases, let's fill out the Jacobian of $\frac{\partial y}{\partial b}$

$$
  \frac{\partial y}{\partial b}
  =
  \begin{bmatrix}
  \frac{\partial y_1}{\partial b_1} & \frac{\partial y_1}{\partial b_2} & \cdots & \frac{\partial y_1}{\partial b_m} \\
  \frac{\partial y_2}{\partial b_1} & \frac{\partial y_2}{\partial b_2} & \cdots & \frac{\partial y_2}{\partial b_m} \\
  \vdots & \vdots & \ddots & \vdots \\
  \frac{\partial y_m}{\partial b_1} & \frac{\partial y_m}{\partial b_2} & \cdots & \frac{\partial y_m}{\partial b_m}
  \end{bmatrix}
  $$

Filing in $\frac{\partial y_i}{\partial b_j}$ using casework, we get: 
  $$
  \frac{\partial y}{\partial b}
  =
  \begin{bmatrix}
  1 & 0 & \cdots & 0 \\
  0 & 1 & \cdots & 0 \\
  \vdots & \vdots & \ddots & \vdots \\
  0 & 0 & \cdots & 1
  \end{bmatrix}
  = I_m
  $$

### Find $\frac{\partial L}{\partial b}$

To find $\frac{\partial L}{\partial b}$ we invoke the chain rule $\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial b}$. First let's check the matrix shapes, giving us: $\frac{\partial L}{\partial b} \in \mathbb{R}^{1xm}$, $\frac{\partial L}{\partial y} \in \mathbb{R}^{1xm}$ and $\frac{\partial y}{\partial b} \in \mathbb{R}^{mxm}$. These shapes for our matrix multiplication line up!

To calculate $\frac{\partial L}{\partial b}$ we simply substitute in the values we found:
$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial b} = -2(g-y)^T \cdot I_m = -2(g-y)^T$.

Thus we have shown that 
$$ \frac{\partial L}{\partial b} = -2(g-y)^T$$

### Find $\frac{\partial y}{\partial M}$

The **important** part of this derivation is $\frac{\partial y}{\partial M}$ since given this we can update the weights of the matrix, which characterize our linear model and also Multi-Linear Programs (MLP) which are foundational for machine learning. It has been noted that since $y \in \mathbb{R}^m$ and $M \in \mathbb{R}^{m \times n}$ that $\frac{\partial y}{\partial M}$ is a three dimensional tensor of shape $\mathbb{R}^{m \times m \times n}$. However, despite this tensor, we will find that this reduces nicely to a matrix of size $\mathbb{R}^{m \times n}$ when the chain rule is invoked.

Computing this matrix derivative with respect to a vector requires more derivation and matrix calculus, which *we will do* in a *subsequent post*. However, for today, let's derive this using some casework.

Firstly, note that
  $$
  y
  =
  \begin{bmatrix}
  M_{1,1} & M_{1,2} & \cdots & M_{1,n} \\
  M_{2,1} & M_{2,2} & \cdots & M_{2,n} \\
  \vdots & \vdots & \ddots & \vdots \\
  M_{m,1} & M_{m,2} & \cdots & M_{m,n} \\
  \end{bmatrix}
  \times 
  \begin{bmatrix}
  x_{1} \\
  x_{2} \\
  \vdots \\
  x_{n}\\
  \end{bmatrix}
  = 
  \begin{bmatrix}
  M_{1,:} \cdot x^T \\
  M_{2,:} \cdot x^T \\
  \vdots \\
  M_{m,:} \cdot x^T \\
  \end{bmatrix}
  $$

Clearly giving us that 
$$ y_i = M_{i,:} \cdot x^T$$

From this equation we can clearly see from vector calculus that $\frac{\partial y_i}{\partial M_{i,:}} = x^T$, which also aligns with our matrix shapes since $y_i \in \mathbb{R}$ and $M_{i,:} \in \mathbb{R}^{1 \times n}$, $x^T \in \mathbb{R}^{1 \times n}$ since $x \in \mathbb{R}^{n}$. Additionally, it is trivial to see that for $i \neq j$ that $\frac{\partial y_i}{\partial M_{i,:}} = 0$.

**Note** for another, deeper derivation of this, check out my newer blog post: [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$ to update matrix $M$](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm)

### Find $\frac{\partial L}{\partial M}$

To find $\frac{\partial L}{\partial M}$ we are going to use the chain rule $$ \frac{\partial L}{\partial M} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial M}$$ 
Specifically, we can make our derivation **easier** by restating the chain rule as follows
$$ \frac{\partial L}{\partial M_{i,:}} = \frac{\partial L}{\partial y_i}\frac{\partial y_i}{\partial M_{i,:}}$$
Note that **technically** we would write it this way $ \frac{\partial L}{\partial M_{i,:}} = \sum_j \frac{\partial L}{\partial y_j}\frac{\partial y_j}{\partial M_{i,:}}$ however, we derived that for $i \neq j$ that $\frac{\partial y_i}{\partial M_{i,:}} = 0$.

Given this we can find the derivative of the loss $L$ with respect to a **row** of matrix $M$. This greatly simplifies our derivation. 
Now all we have to do is plug in our results to get the answer: 
$ \frac{\partial L}{\partial M_{i,:}} = \frac{\partial L}{\partial y_i}\frac{\partial y_i}{\partial M_{i,:}} = -2(g_i - y_i) \cdot x^T $

Thus as we can see since $ \frac{\partial L}{\partial M_{i,:}} = -2(g_i - y_i) \cdot x^T $, if we extend this to $ \frac{\partial L}{\partial M}$ using this derivation we would get: $ \frac{\partial L}{\partial M_{i,:}} = -2(g_i - y_i) \cdot x^T\ \forall i \in [1,m]$ giving us the final result 
$$ \frac{\partial L}{\partial M} = -2(g - y) \cdot x^T $$
As we can see $-2(g - y) \in \mathbb{R}^{m \times 1}$ and $x^T \in \mathbb{R}^{1 \times n}$ giving us the **outer product** of $-2(g-y) \cdot x^T \in \mathbb{R}^{m \times n}$ which is the same dimension as $M$. Note that the shape of $\frac{\partial L}{\partial y}$ transposed.

## Gradient Descent Update

Now this is the easy part, since we have $ \frac{\partial L}{\partial M}$ and $ \frac{\partial L}{\partial b}$, we can update our parameters $M$ and $b$ respectively. To do this we simply do the following update rule: 
$$ value' = value\ - \alpha \cdot \nabla value$$
Where $\alpha$ is the **learning rate** hyperparameter. 
For our parameters $M$ and $b$, our updated parameters $M'$, $b'$ are as follows
$$ b' = b - \alpha \cdot \frac{\partial L}{\partial b} = b - \alpha \cdot -2(g-y) = b + \alpha \cdot 2(g-y) $$
$$ M' = M - \alpha \cdot \frac{\partial L}{\partial M} = M - \alpha (-2(g - y) \cdot x^T) = M + \alpha (2(g - y) \cdot x^T)$$