The Core of Backprop 2: Deriving $\frac{\partial L}{\partial M}$ using Matrix Calculus

by dokuDoku Machine Learning Foundations

$The Core of Backprop 2: Deriving $\frac{\partial L}{\partial M}$ using Matrix Calculus$

🔍

This is a sequel blog post to [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm) however this time it's shorter and simpler using **Matrix Calculus** instead of deriving using component by component. For context on why this is important please read [Neural Network Implementation from Scratch blog post](https://www.noteblogdoku.com/blog/neural-network-implementation-from-scratch), and [Fitting a Simple Linear Model using Gradient Descent blog post](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent).

The focus of this blog is to derive $\frac{\partial L}{\partial M}$ from the equations:
$$ y = Mx$$
$$ L(y) = \sum_i (g_i - y_i)^2 = (g-y)^T(g-y) $$

With vector and matrix shapes $y, g \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, $M \in \mathbb{R}^{m \times n}$, $L \in \mathbb{R}$. Note that typically the linear layer of a neural network/Multi-Linear Program (MLP) is typically $y = Mx + b$, where $b \in \mathbb{R}^m$ is a bias term. However when taking the derivative with respect to matrix $M$, this bias term is unimportant, so we will be ignoring it today. Additionally, $g$ stands for ground truth and $y$ is the value our simple model predicts. A neural network typically composes these linear layers with simple ReLU's, so taking the derivative of $y = Mx$ with respect to $M$ is important.

The goal is to derive $\frac{\partial L}{\partial M}$ to update the weight matrix $M$ so that we can minimize $L(Mx)$. To do this we need to find the derivative to update the parameters of $M$ to reduce the error.

**Note** this derivation will use *numerator* notation for derivatives, which makes the chain rule more intuitive than for *denominator* notation.

## Differential of $L$, $dL$

To start, let's take the differential of $L$
$$ L = (g-y)^T(g-y) \rightarrow dL = -(g-y)^Tdy - dy^T(g-y) $$
Of note $L, dL \in \mathbb{R}$ and $y, dy \in \mathbb{R}^m$, i.e. the differentials are in the same dimension as what they're the differential of. Hence we get that $(g-y)^T dy$ with $(g-y)^T \in \mathbb{R}^{1 \times m}$ & $dy \in \mathbb{R}^{m \times 1}$ gives 
$$(g-y)^T dy \in \mathbb{R}$$ 
Similarly, $dy^T (g-y)$ with $dy^T \in \mathbb{R}^{1 \times m}$ & $(g-y) \in \mathbb{R}^{m \times 1}$ gives 
$$dy^T (g-y) \in \mathbb{R}$$
Thus we can say that $(dy^T (g-y))^T = dy^T (g-y)$ since it's a scalar in $\mathbb{R}$ giving
$$ dL = -(g-y)^Tdy - dy^T(g-y) = -2(g-y)^Tdy$$
Since the derivative is a linear operator, using the generic function $f$ we have $df = f'(x)[dx]$ (with the brackets [], emphasizing operator on $dx$, nice but not necessary) we can restate as
$$ dL = -2(g-y)^T[dy] $$

## Differential of $y$, $dy$

Given that we have $y = Mx$ with $y \in \mathbb{R}^m$, $x \in \mathbb{R}^n$ and $M \in \mathbb{R}^{m \times n}$, let's use the product rule to get:
$$ y = Mx \rightarrow dy = (dM)x + M dx$$
Note that since $x$ is constant $dx = 0$. This is because in our machine learning algorithm our **input data $x$** stays constant with respect to our network. The parameters/weights $M$ is what we are optimizing and changing. Hence:
$$ dy = (dM)x + M \cdot 0 \rightarrow dy = (dM)x$$
Note that $y,dy \in \mathbb{R}^m$, $M,dM \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, hence we get that $(dM)x$ is $\mathbb{R}^{m \times n} \cdot \mathbb{R}^n \rightarrow \mathbb{R}^m$ which is a quick shape check.

### What do I do with $dy = (dM)x$? 
We would like to transform this into the Jacobian form $dy = \frac{\partial y}{\partial M}[dM]$ for an obvious update to $M$ from the equation $dy = [dM]x$. 
Given that we have $dy = [dM]x$ which is equivalent in the abstract form to $dy = \frac{\partial y}{\partial M} [dM]$, we want to find $\frac{\partial y}{\partial M}$ which will tell us how to update the weights of $M$. However, this is a multidimensional tensor since $dy \in \mathbb{R}^m$, $dM \in \mathbb{R}^{m \times n}$ giving $\frac{\partial y}{\partial M} \in \mathbb{R}^{m \times n \times m}$. To go from the linear operator form of matrix calculus to the more conventional Jacobian form, we need to vectorize $M$ to give us an update shape of $\mathbb{R}^{n \times m}$ for our update in numerator form, which we will have to transpose to match the shape of $M$. 
Note will be referencing some proofs from the **Appendix**.

First let's vectorize both sides to get: 
$$ dy = dMx \rightarrow d\ vec(y) = vec(dMx) $$
Of note $vec(dy) = d\ vec(y)$ because both $vec()$ and $d$ (differential) are linear operators. 
Use from **Proof 1** $vec(CA^T) = (A \otimes I) vec(C)$ to get
$$ vec(dMx) = (x^T \otimes I_m) vec(dM) $$
with $I \in \mathbb{R}^{m \times m}$ since $I$ matches the number of rows in $dM \in \mathbb{R}^{m \times n}$
Note that since $y, dy \in \mathbb{R}^m$ they are already in vector form thus $vec(y) = y$ (same with $dy$ & $x$). 
In vector form we now have: 
$$ dy = vec(dMx) = (x^T \otimes I_m) vec(dM)$$
Given this equation, we know that the general form for finding the gradient of some function $f$ is 
$$df = f'(x)[dx] = \frac{\partial f}{\partial x} [dx]$$

I.e. $\frac{\partial f}{\partial x}$ the quantity that when the inner product is taken with $dx$ equals $df$, i.e.
$$ \langle \frac{\partial f}{\partial x}, dx \rangle = df$$

By this definition it is trivial to see 
$$ dy = vec(dMx) = (x^T \otimes I_m) [vec(dM)]$$
Denotes 
$$ \frac{\partial y}{\partial vec(M)} = x^T \otimes I_m $$
Shape check, $dy \in \mathbb{R}^m$, $vec(dM) \in \mathbb{R}^{mn}$ giving $\frac{\partial y}{\partial vec(M)} \in \mathbb{R}^{m \times mn}$ which matches $x^T \otimes I_m $ where at every element $x_i \in [1,m]$ we multiply it by $I_m \in \mathbb{R}^{m \times m}$, giving $x^T \otimes I_m \in \mathbb{R}^{m \times mn}$

## Chain Rule: Substitute $dy$ in $dL$ equation.

Recall from earlier 
$$ dL = -2(g-y)^T[dy] $$
And 
$$ dy = [dM]x \rightarrow vec(dy) = vec(dMx) = (x^T \otimes I_m) [vec(dM)]$$
Recall $vec(dy) = dy$, and substitute $dy$ into $ dL = -2(g-y)^T[dy] $ giving
$$ dL = -2(g-y)^T vec(dMx) = -2(g-y)^T vec(dMx) = -2(g-y)^T (x^T \otimes I_m) [vec(dM)]$$

Note the general form 
$$df = f'(x)[dx] = \frac{\partial f}{\partial x} [dx]$$
and that 
$$ \langle \frac{\partial f}{\partial x}, dx \rangle = df$$
Giving us that $\frac{\partial L}{\partial vec(M)}$ is 
$$\frac{\partial L}{\partial vec(M)} = -2(g-y)^T (x^T \otimes I_m)$$
Checking the shapes we see that $-2(g-y)^T \in \mathbb{R}^{1\times m}$ and $x^T \otimes I_m \in \mathbb{R}^{m \times mn}$ giving 
$$\frac{\partial L}{\partial vec(M)} = -2(g-y)^T (x^T \otimes I_m) \in \mathbb{R}^{1 \times mn}$$

To get $\frac{\partial L}{\partial M}$ from $\frac{\partial L}{\partial vec(M)}$ devectorize both sides
$$ devec(\frac{\partial L}{\partial vec(M)}) = \frac{\partial L}{\partial M} = devec(-2(g-y)^T (x^T \otimes I_m))$$

Using **Proof 2** we have (for arbitrary vectors $v \in \mathbb{R}^m$,$x \in \mathbb{R}^n$) $devec(v^T (x^T \otimes I_m)) = xv^T$, applied here
$$ devec(-2(g-y)^T (x^T \otimes I_m)) = -2x(g-y)^T $$
Giving 
$$\frac{\partial L}{\partial M} = -2x(g-y)^T \in \mathbb{R}^{n \times m}$$

Under *numerator* layout Jacobian notation $\frac{\partial L}{\partial M}$ appears as an $n \times m$ object. The gradient used in optimization, with the same shape as $M$, is its transpose in *denominator* layout. The update we will use to update $M$ in gradient descent is: 
$$ (-2x(g-y)^T)^T = -2(g-y)x^T \in \mathbb{R}^{m \times n}$$

Which is **exactly** the update rule from [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm).

However, *this time* it was completed using **Matrix Calculus** resulting in a shorter, cleaner derivation.

---
## Appendix

### Proof 1
$vec(CA^T) = (A \otimes I_M) vec(C)$ s.t. $A \in \mathbb{R}^{p \times n}$, $C \in \mathbb{R}^{m \times n}$ giving $CA^T \in \mathbb{R}^{m \times p}$
Let's derive this proof using some matrix algebra! Let's look at the columns of $CA^T$. The first column of $CA^T$ is a linear combination of the columns of $C$ who's coefficients are given by the first column of $A^T$ (first row of $A$).
$$ column\ 1\ CA^T = \sum_j a_{1j} c_j$$
Where $a_{1j}$ is the jth element of the first row of $A$, $c_j$ is the jth column vector of $C$. This gives us 
$$
\operatorname{vec}(CA^T)
=
\begin{pmatrix}
\sum_j a_{1j} c_j \\
\sum_j a_{2j} c_j \\
\vdots
\end{pmatrix}
=
\underbrace{
\begin{pmatrix}
a_{11} I & a_{12} I & \cdots \\
a_{21} I & a_{22} I & \cdots \\
\vdots   & \vdots   & \ddots
\end{pmatrix}
}_{A \otimes I}
\underbrace{
\begin{pmatrix}
c_1 \\
c_2 \\
\vdots
\end{pmatrix}
}_{\operatorname{vec}(C)}
$$

Thus we have derived 
$$ vec(CA^T) = (A \otimes I_M) vec(C) $$

### Proof 2
$devec(v^T (x^T \otimes I_m)) = xv^T$ s.t. $x \in \mathbb{R}^n$, $v \in \mathbb{R}^m$ and $xv^T \in \mathbb{R}^{n \times m}$
Let's derive this proof using some matrix algebra! To restate $x \in \mathbb{R}^n$, $v \in \mathbb{R}^m$ making $(x^T \otimes I_m) \in \mathbb{R}^{mn \times m}$ and our desired result $xv^T \in \mathbb{R}^{n \times m}$.

To start by definition of Kronecker Product
$$ v^T(x^T \otimes I_m) = v^T [x_1I_m, x_2I_m,...x_nI_m]$$
Note, in this step the shape $(x^T \otimes I_m) \in \mathbb{R}^{mn \times m}$ becomes trivial to see. 
$$
\begin{aligned}
v^T(x^T \otimes I_m)
&=
v^T
\left\lbrack
\begin{array}{c|c|c|c}
\overbrace{
\begin{array}{cccc}
x_1 & 0 & \cdots & 0 \\
0 & x_1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & x_1
\end{array}
}^{m}
&
\overbrace{
\begin{array}{cccc}
x_2 & 0 & \cdots & 0 \\
0 & x_2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & x_2
\end{array}
}^{m}
&
\cdots
&
\overbrace{
\begin{array}{cccc}
x_n & 0 & \cdots & 0 \\
0 & x_n & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & x_n
\end{array}
}^{m}
\end{array}
\right\rbrack
\left.
\begin{array}{c}
\\
\\
\\
\end{array}
\right\} m
\\[-55pt]
&\qquad
\underbrace{
\phantom{
\left\lbrack
\begin{array}{c|c|c|c}
\begin{array}{cccc}
x_1 & 0 & \cdots & 0 \\
0 & x_1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & x_1
\end{array}
&
\begin{array}{cccc}
x_2 & 0 & \cdots & 0 \\
0 & x_2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & x_2
\end{array}
&
\cdots
&
\begin{array}{cccc}
x_n & 0 & \cdots & 0 \\
0 & x_n & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & x_n
\end{array}
\end{array}
\right\rbrack
}
}_{mn}
\end{aligned}
$$

When we multiply a matrix by a vector in the form $xA$, with the vector on the left, it's equivalent to adding $row_i$ of A by $x_i$ giving us

$$ 
\begin{aligned}
v^T
\left\lbrack
\begin{array}{c|c|c|c}
\begin{array}{cccc}
x_1 & 0 & \cdots & 0 \\
0 & x_1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & x_1
\end{array}
&
\begin{array}{cccc}
x_2 & 0 & \cdots & 0 \\
0 & x_2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & x_2
\end{array}
&
\cdots
&
\begin{array}{cccc}
x_n & 0 & \cdots & 0 \\
0 & x_n & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & x_n
\end{array}
\end{array}
\right\rbrack
\end{aligned} 
= 
[x_1v^T, x_2v^T, ..., x_nv^T] \in \mathbb{R}^{1 \times mn}
$$
$$ 
[x_1v^T, x_2v^T, ..., x_nv^T]
= 
[v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n] \mathbb{R}^{1 \times mn}$$
Observe the pattern to take the column major devectorization (inverse operation of vectorization) where every column $i$ of the devectorized matrix is the accordant $i^{th}$ series of $m$ continuous elements. To take the column major devectorization, we are going to take the transpose of said $\mathbb{R}^{1 \times mn}$ vector to make it $\mathbb{R}^{mn}$, show the resultant $\mathbb{R}^{m \times n}$ matrix then transpose to make it $\mathbb{R}^{n \times m}$ to make it the devectorization of $v^T(x^T \otimes I_m)$.

$$ [v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n] \in \mathbb{R}^{1 \times mn} \rightarrow $$
$$[v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n]^T \in \mathbb{R}^{mn}$$
$$ devec([v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n]^T) = 
\begin{bmatrix}
v_1 x_1 & v_1 x_2 & \cdots & v_1 x_n \\
v_2 x_1 & v_2 x_2 & \cdots & v_2 x_n \\
\vdots  & \vdots  & \ddots & \vdots  \\
v_m x_1 & v_m x_2 & \cdots & v_m x_n
\end{bmatrix}
= vx^T
$$
Finally to get back $devec([v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n])$ transpose the result $vx^T$ giving 
$$ devec([v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n]) = (vx^T)^T = xv^T \in \mathbb{R}^{n \times m}$$

Thus we have shown 
$$devec(v^T (x^T \otimes I_m)) = xv^T$$