The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$ to update matrix $M$ by dokuDoku Machine Learning Foundations 🔍 For context on why this is important please read [Neural Network Implementation from Scratch blog post](https://www.noteblogdoku.com/blog/neural-network-implementation-from-scratch), and [Fitting a Simple Linear Model using Gradient Descent blog post](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent) (which has another derivation of this). The focus of this blog post is to derive: $\frac{\partial L}{\partial M}$ from the equations: $$y = Mx$$ $$ L(y) = \sum_i (g_i - y_i)^2 $$ With vector and matrix shapes $y, g \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, $M \in \mathbb{R}^{m \times n}$, $L \in \mathbb{R}$. Note that typically the linear layer of a neural network/Multi-Linear Program (MLP) is typically $y = Mx + b$, where $b \in \mathbb{R}^m$ is a bias term. However when taking the derivative with respect to matrix $M$, this bias term is unimportant, so we will be ignoring it today. In our equations $y = Mx$ and $L(y) = \sum_i (g_i - y_i)^2$, the goal is to derive $\frac{\partial L}{\partial M}$ to update the weight matrix $M$ so that we can minimize $L(Mx)$. To do this we need to use calculus to find what the derivative should be so that we can update the parameters of $M$ to reduce the error. **Note** this derivation will use *numerator* notation for derivatives, and to make it such that the update shapes match, we will take the transpose of the resultant output. This makes it easier to use chain rule versus *denominator* notation. Numerator convention states that the shape of derivative outputs equal to shape of numerator times transpose of the shape of the denominator We are going to use the chain rule to state: $$\frac{\partial L}{\partial M} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial M} $$ ### Find $\frac{\partial L}{\partial y}$ This is fairly straightforward, we know that $$ \frac{\partial L}{\partial y} = [\frac{\partial L}{\partial y_1}, \frac{\partial L}{\partial y_2}, ... \frac{\partial L}{\partial y_m}] \in \mathbb{R}^{1\times m}$$ Given $\frac{\partial L}{\partial y_i}$, we know that $L(y_i) = (g_i - y_i)^2 \rightarrow \frac{\partial L}{\partial y_i} = -2(g_i -y_i)$. Extrapolating this across $\frac{\partial L}{\partial y}$ we get $$ \frac{\partial L}{\partial y} = -2 \cdot [(g_1 - y_1), (g_2 - y_2), ... (g_m - y_m)] \rightarrow -2(g - y)^T$$ ### Find $\frac{\partial y}{\partial M}$ For this, we are going to state the concept of **vectorizing** a matrix. The vectorization $vec\ A \in \mathbb{R}^{mn}$ of any $m \times n$ matrix $A \in \mathbb{R}^{m \times n}$ is defined by simply **stacking the columns** of A, from left to right into a column vector $vec\ A$. That is if we denote the $n$ columns of $A$ by $m$-component vectors $a_1, a_2, ...a_n \in \mathbb{R}^m$ then $$ vec\ A = vec(A_{:,1},A_{:,2},...A_{:,n}) = vec(a_1, a_2, ...a_n) = [a_1, a_2,...a_n]^T \in \mathbb{R}^{mn}$$ Also can be stated as: $$ y = Mx \leftrightarrow vec(y) = (x^T \otimes I_m) vec(M)$$ This will make it much easier to work with $M$ and we can take the inverse of this operation to go from $vec\ M \rightarrow M$ going from $\mathbb{R}^{mn} \rightarrow \mathbb{R}^{m \times n}$. Let's consider first taking $\frac{\partial y_i}{\partial M}$. We are going to simplify this to $\frac{\partial y_i}{\partial vec(M)}$. Since $vec(M) \in \mathbb{R}^{mn}$ we get that the gradient $\frac{\partial y_i}{\partial vec(M)} \in \mathbb{R}^{1 \times mn}$. $$ \frac{\partial y_i}{\partial \mathrm{vec}(M)} = [ \frac{\partial y_i}{\partial M_{1,1}}, \ldots, \frac{\partial y_i}{\partial M_{m,1}}, \ldots, \frac{\partial y_i}{\partial M_{1,n}}, \ldots, \frac{\partial y_i}{\partial M_{m,n}} ] \in \mathbb{R}^{1 \times mn} $$ Note that since $$ y = \begin{bmatrix} M_{1,1} & M_{1,2} & \cdots & M_{1,n} \\ M_{2,1} & M_{2,2} & \cdots & M_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ M_{m,1} & M_{m,2} & \cdots & M_{m,n} \\ \end{bmatrix} \times \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n}\\ \end{bmatrix} = \begin{bmatrix} M_{1,:} \cdot x \\ M_{2,:} \cdot x \\ \vdots \\ M_{m,:} \cdot x \\ \end{bmatrix} $$ Giving $$ y_i = M_{i,:} \cdot x \rightarrow y_i = \sum_j^n M_{i,j}x_j $$ Equivalently using columns of $M$, let $m_j = M_{:,j} \in \mathbb{R}^m$ be the jth column of M $$ y_i = \sum_{j=1}^n x_j(e_i^T m_j) $$ where $e_i \in \mathbb{R}^m$ is the ith standard basis vector (1 at position i, 0 elsewhere). This column form is useful when thinking about the vectorization of $M$ since $vec(M)$ stacks the columns $m_1, m_2, ... m_n$. Looking at $\frac{\partial y_i}{\partial vec(M)}$ this gives us that when we have the original value from $M_{i,j}$ that: $$\frac{\partial y_i}{\partial vec(M_{i,j})} = x_j\ \ \ \ \frac{\partial y_i}{\partial vec(M_{k,j})} = 0\ k \neq i$$ If we devectorize $\frac{\partial y_i}{\partial \mathrm{vec}(M)}$ to $\frac{\partial y_i}{\partial M}$ i.e. $\frac{\partial y_i}{\partial M} = vec^{-1}(\frac{\partial y_i}{\partial \mathrm{vec}(M)})$ we get that $\frac{\partial y_i}{\partial M} \in \mathbb{R}^{n \times m}$. Given what we just stated we get the following result: $$ \frac{\partial y_i}{\partial M} = \begin{bmatrix} \frac{\partial y_i}{\partial M_{1,1}} & \frac{\partial y_i}{\partial M_{2,1}} & \cdots & \frac{\partial y_i}{\partial M_{m,1}} \\ \frac{\partial y_i}{\partial M_{1,2}} & \frac{\partial y_i}{\partial M_{2,2}} & \cdots & \frac{\partial y_i}{\partial M_{m,2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_i}{\partial M_{1,n}} & \frac{\partial y_i}{\partial M_{2,n}} & \cdots & \frac{\partial y_i}{\partial M_{m,n}} \\ \end{bmatrix} = \begin{bmatrix} 0 & \frac{\partial y_i}{\partial M_{i,1}} = x_1 & \cdots & 0 \\ 0 & \frac{\partial y_i}{\partial M_{i,2}} = x_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \frac{\partial y_i}{\partial M_{i,n}} = x_n & \cdots & 0 \\ \end{bmatrix} = \begin{pmatrix} \vert & \vert & \ & \vert \\ 0 & x & \cdots & 0\\ \vert & \vert & \ & \vert \end{pmatrix} \in \mathbb{R}^{n \times m} $$ Since we really want to find $\frac{\partial y}{\partial M}$ we can extrapolate from $\frac{\partial y_i}{\partial M}$ to get the following: $$ \frac{\partial y}{\partial M} = [\frac{\partial y_1}{\partial M}, \frac{\partial y_2}{\partial M}, ..., \frac{\partial y_m}{\partial M}]^T \in R^{m \times n \times m}$$ Alternatively: $$ \frac{\partial y}{\partial vec(M)} = [\frac{\partial y_1}{\partial vec(M)}, \frac{\partial y_2}{\partial vec(M)}, ..., \frac{\partial y_m}{\partial vec(M)}]^T \in R^{m \times mn}$$ ### Find $\frac{\partial L}{\partial M}$ Now that we know $\frac{\partial L}{\partial y}$ and $\frac{\partial y}{\partial M}$ we can calculate $\frac{\partial L}{\partial M} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial M}$. To start, the desired matrix shape of $\frac{\partial L}{\partial y} \in \mathbb{R}^{n \times m}$ since we are using *numerator* convention (shape of derivative outputs equal to shape of numerator times transpose of the shape of the denominator), we will simply take the transpose to directly update matrix $M$. Using the chain rule we get that: $$\frac{\partial L}{\partial M} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial M} \rightarrow \mathbb{R}^{n \times m} = \mathbb{R}^{1\times m} \times \mathbb{R}^{m \times n \times m} = \mathbb{R}^{1 \times n \times m} = \mathbb{R}^{n \times m}$$ Which is correct, meaning that we have verified that the matrix shapes themselves are correct! $$ \frac{\partial L}{\partial M} = [\frac{\partial L}{\partial y_1},\frac{\partial L}{\partial y_2},...\frac{\partial L}{\partial y_m}][\frac{\partial y_1}{\partial M}, \frac{\partial y_2}{\partial M}, ... \frac{\partial y_m}{\partial M}]^T = \sum_i^M \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial M} \in \mathbb{R}^{n \times m}$$ Which is adding each $\frac{\partial y_i}{\partial M} \in \mathbb{R}^{n \times m}$ scaled by $\frac{\partial L}{\partial y_i} \in \mathbb{R}$ giving us a $\mathbb{R}^{n \times m}$ matrix. **Note** that as we stated earlier that $\frac{\partial y_i}{\partial M}$ is equal to a $\mathbb{R}^{n \times m}$ matrix of 0's *except* the ith column which is equal to vector $x$. Thus as we can see $$ \frac{\partial L}{\partial M} = \begin{pmatrix} \vert & \vert & \ & \vert \\ \frac{\partial L}{\partial y_1} x & \frac{\partial L}{\partial y_2} x & \cdots & \frac{\partial L}{\partial y_n} x\\ \vert & \vert & \ & \vert \end{pmatrix}$$ Given this result we can restate this as: $$ \frac{\partial L}{\partial M} = \begin{pmatrix} \vert & \vert & \ & \vert \\ \frac{\partial L}{\partial y_1} x & \frac{\partial L}{\partial y_2} x & \cdots & \frac{\partial L}{\partial y_n} x\\ \vert & \vert & \ & \vert \end{pmatrix} = x \cdot \frac{\partial L}{\partial y} $$ Which is the outer product of $x \in \mathbb{R}^{n \times 1}$ with $\frac{\partial L}{\partial y} \in \mathbb{R}^{1 \times m}$ giving us $\frac{\partial L}{\partial M} \in \mathbb{R}^{n \times m}$, which is exactly what we wanted! ### Updating $M$ To update the matrix $M$ using $\frac{\partial L}{\partial M}$, we need to take its transpose, i.e. to update $M$ using gradient descent $$ parameter_{new} = parameter_{old} - \alpha \times \frac{\partial\ Loss}{\partial\ parameter}$$ where $\alpha$ is the **learning rate** hyperparameter we do $$ M_{new} = M - \alpha \times (\frac{\partial L}{\partial M})^T = M - \alpha \times (x \cdot \frac{\partial L}{\partial y})^T = M - \alpha \times (\frac{\partial L}{\partial y} \cdot x^T) $$ This exactly matches what is publicly stated for updating the weight parameters of a matrix $M$ is that you take the previous signal, in this case $\frac{\partial L}{\partial y}$ and you take the outer product with the input vector $x$ (or its average or with respect to your batch). This is a rigorous explanation for how to derive $\frac{\partial L}{\partial M}$ given $y = Mx$ and $ L(y) = \sum_i (g_i - y_i)^2 $. Thanks to my friend Anant for white boarding this solution with me. For context on why this is important please read [Neural Network Implementation from Scratch blog post](https://www.noteblogdoku.com/blog/neural-network-implementation-from-scratch), and [Fitting a Simple Linear Model using Gradient Descent blog post](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent) (which has another derivation of this). The focus of this blog post is to derive: $\frac{\partial L}{\partial M}$ from the equations: $$y = Mx$$ $$ L(y) = \sum_i (g_i - y_i)^2 $$ With vector and matrix shapes $y, g \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, $M \in \mathbb{R}^{m \times n}$, $L \in \mathbb{R}$. Note that typically the linear layer of a neural network/Multi-Linear Program (MLP) is typically $y = Mx + b$, where $b \in \mathbb{R}^m$ is a bias term. However when taking the derivative with respect to matrix $M$, this bias term is unimportant, so we will be ignoring it today. In our equations $y = Mx$ and $L(y) = \sum_i (g_i - y_i)^2$, the goal is to derive $\frac{\partial L}{\partial M}$ to update the weight matrix $M$ so that we can minimize $L(Mx)$. To do this we need to use calculus to find what the derivative should be so that we can update the parameters of $M$ to reduce the error. **Note** this derivation will use *numerator* notation for derivatives, and to make it such that the update shapes match, we will take the transpose of the resultant output. This makes it easier to use chain rule versus *denominator* notation. Numerator convention states that the shape of derivative outputs equal to shape of numerator times transpose of the shape of the denominator We are going to use the chain rule to state: $$\frac{\partial L}{\partial M} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial M} $$ ### Find $\frac{\partial L}{\partial y}$ This is fairly straightforward, we know that $$ \frac{\partial L}{\partial y} = [\frac{\partial L}{\partial y_1}, \frac{\partial L}{\partial y_2}, ... \frac{\partial L}{\partial y_m}] \in \mathbb{R}^{1\times m}$$ Given $\frac{\partial L}{\partial y_i}$, we know that $L(y_i) = (g_i - y_i)^2 \rightarrow \frac{\partial L}{\partial y_i} = -2(g_i -y_i)$. Extrapolating this across $\frac{\partial L}{\partial y}$ we get $$ \frac{\partial L}{\partial y} = -2 \cdot [(g_1 - y_1), (g_2 - y_2), ... (g_m - y_m)] \rightarrow -2(g - y)^T$$ ### Find $\frac{\partial y}{\partial M}$ For this, we are going to state the concept of **vectorizing** a matrix. The vectorization $vec\ A \in \mathbb{R}^{mn}$ of any $m \times n$ matrix $A \in \mathbb{R}^{m \times n}$ is defined by simply **stacking the columns** of A, from left to right into a column vector $vec\ A$. That is if we denote the $n$ columns of $A$ by $m$-component vectors $a_1, a_2, ...a_n \in \mathbb{R}^m$ then $$ vec\ A = vec(A_{:,1},A_{:,2},...A_{:,n}) = vec(a_1, a_2, ...a_n) = [a_1, a_2,...a_n]^T \in \mathbb{R}^{mn}$$ Also can be stated as: $$ y = Mx \leftrightarrow vec(y) = (x^T \otimes I_m) vec(M)$$ This will make it much easier to work with $M$ and we can take the inverse of this operation to go from $vec\ M \rightarrow M$ going from $\mathbb{R}^{mn} \rightarrow \mathbb{R}^{m \times n}$. Let's consider first taking $\frac{\partial y_i}{\partial M}$. We are going to simplify this to $\frac{\partial y_i}{\partial vec(M)}$. Since $vec(M) \in \mathbb{R}^{mn}$ we get that the gradient $\frac{\partial y_i}{\partial vec(M)} \in \mathbb{R}^{1 \times mn}$. $$ \frac{\partial y_i}{\partial \mathrm{vec}(M)} = [ \frac{\partial y_i}{\partial M_{1,1}}, \ldots, \frac{\partial y_i}{\partial M_{m,1}}, \ldots, \frac{\partial y_i}{\partial M_{1,n}}, \ldots, \frac{\partial y_i}{\partial M_{m,n}} ] \in \mathbb{R}^{1 \times mn} $$ Note that since $$ y = \begin{bmatrix} M_{1,1} & M_{1,2} & \cdots & M_{1,n} \\ M_{2,1} & M_{2,2} & \cdots & M_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ M_{m,1} & M_{m,2} & \cdots & M_{m,n} \\ \end{bmatrix} \times \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n}\\ \end{bmatrix} = \begin{bmatrix} M_{1,:} \cdot x \\ M_{2,:} \cdot x \\ \vdots \\ M_{m,:} \cdot x \\ \end{bmatrix} $$ Giving $$ y_i = M_{i,:} \cdot x \rightarrow y_i = \sum_j^n M_{i,j}x_j $$ Equivalently using columns of $M$, let $m_j = M_{:,j} \in \mathbb{R}^m$ be the jth column of M $$ y_i = \sum_{j=1}^n x_j(e_i^T m_j) $$ where $e_i \in \mathbb{R}^m$ is the ith standard basis vector (1 at position i, 0 elsewhere). This column form is useful when thinking about the vectorization of $M$ since $vec(M)$ stacks the columns $m_1, m_2, ... m_n$. Looking at $\frac{\partial y_i}{\partial vec(M)}$ this gives us that when we have the original value from $M_{i,j}$ that: $$\frac{\partial y_i}{\partial vec(M_{i,j})} = x_j\ \ \ \ \frac{\partial y_i}{\partial vec(M_{k,j})} = 0\ k \neq i$$ If we devectorize $\frac{\partial y_i}{\partial \mathrm{vec}(M)}$ to $\frac{\partial y_i}{\partial M}$ i.e. $\frac{\partial y_i}{\partial M} = vec^{-1}(\frac{\partial y_i}{\partial \mathrm{vec}(M)})$ we get that $\frac{\partial y_i}{\partial M} \in \mathbb{R}^{n \times m}$. Given what we just stated we get the following result: $$ \frac{\partial y_i}{\partial M} = \begin{bmatrix} \frac{\partial y_i}{\partial M_{1,1}} & \frac{\partial y_i}{\partial M_{2,1}} & \cdots & \frac{\partial y_i}{\partial M_{m,1}} \\ \frac{\partial y_i}{\partial M_{1,2}} & \frac{\partial y_i}{\partial M_{2,2}} & \cdots & \frac{\partial y_i}{\partial M_{m,2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_i}{\partial M_{1,n}} & \frac{\partial y_i}{\partial M_{2,n}} & \cdots & \frac{\partial y_i}{\partial M_{m,n}} \\ \end{bmatrix} = \begin{bmatrix} 0 & \frac{\partial y_i}{\partial M_{i,1}} = x_1 & \cdots & 0 \\ 0 & \frac{\partial y_i}{\partial M_{i,2}} = x_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \frac{\partial y_i}{\partial M_{i,n}} = x_n & \cdots & 0 \\ \end{bmatrix} = \begin{pmatrix} \vert & \vert & \ & \vert \\ 0 & x & \cdots & 0\\ \vert & \vert & \ & \vert \end{pmatrix} \in \mathbb{R}^{n \times m} $$ Since we really want to find $\frac{\partial y}{\partial M}$ we can extrapolate from $\frac{\partial y_i}{\partial M}$ to get the following: $$ \frac{\partial y}{\partial M} = [\frac{\partial y_1}{\partial M}, \frac{\partial y_2}{\partial M}, ..., \frac{\partial y_m}{\partial M}]^T \in R^{m \times n \times m}$$ Alternatively: $$ \frac{\partial y}{\partial vec(M)} = [\frac{\partial y_1}{\partial vec(M)}, \frac{\partial y_2}{\partial vec(M)}, ..., \frac{\partial y_m}{\partial vec(M)}]^T \in R^{m \times mn}$$ ### Find $\frac{\partial L}{\partial M}$ Now that we know $\frac{\partial L}{\partial y}$ and $\frac{\partial y}{\partial M}$ we can calculate $\frac{\partial L}{\partial M} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial M}$. To start, the desired matrix shape of $\frac{\partial L}{\partial y} \in \mathbb{R}^{n \times m}$ since we are using *numerator* convention (shape of derivative outputs equal to shape of numerator times transpose of the shape of the denominator), we will simply take the transpose to directly update matrix $M$. Using the chain rule we get that: $$\frac{\partial L}{\partial M} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial M} \rightarrow \mathbb{R}^{n \times m} = \mathbb{R}^{1\times m} \times \mathbb{R}^{m \times n \times m} = \mathbb{R}^{1 \times n \times m} = \mathbb{R}^{n \times m}$$ Which is correct, meaning that we have verified that the matrix shapes themselves are correct! $$ \frac{\partial L}{\partial M} = [\frac{\partial L}{\partial y_1},\frac{\partial L}{\partial y_2},...\frac{\partial L}{\partial y_m}][\frac{\partial y_1}{\partial M}, \frac{\partial y_2}{\partial M}, ... \frac{\partial y_m}{\partial M}]^T = \sum_i^M \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial M} \in \mathbb{R}^{n \times m}$$ Which is adding each $\frac{\partial y_i}{\partial M} \in \mathbb{R}^{n \times m}$ scaled by $\frac{\partial L}{\partial y_i} \in \mathbb{R}$ giving us a $\mathbb{R}^{n \times m}$ matrix. **Note** that as we stated earlier that $\frac{\partial y_i}{\partial M}$ is equal to a $\mathbb{R}^{n \times m}$ matrix of 0's *except* the ith column which is equal to vector $x$. Thus as we can see $$ \frac{\partial L}{\partial M} = \begin{pmatrix} \vert & \vert & \ & \vert \\ \frac{\partial L}{\partial y_1} x & \frac{\partial L}{\partial y_2} x & \cdots & \frac{\partial L}{\partial y_n} x\\ \vert & \vert & \ & \vert \end{pmatrix}$$ Given this result we can restate this as: $$ \frac{\partial L}{\partial M} = \begin{pmatrix} \vert & \vert & \ & \vert \\ \frac{\partial L}{\partial y_1} x & \frac{\partial L}{\partial y_2} x & \cdots & \frac{\partial L}{\partial y_n} x\\ \vert & \vert & \ & \vert \end{pmatrix} = x \cdot \frac{\partial L}{\partial y} $$ Which is the outer product of $x \in \mathbb{R}^{n \times 1}$ with $\frac{\partial L}{\partial y} \in \mathbb{R}^{1 \times m}$ giving us $\frac{\partial L}{\partial M} \in \mathbb{R}^{n \times m}$, which is exactly what we wanted! ### Updating $M$ To update the matrix $M$ using $\frac{\partial L}{\partial M}$, we need to take its transpose, i.e. to update $M$ using gradient descent $$ parameter_{new} = parameter_{old} - \alpha \times \frac{\partial\ Loss}{\partial\ parameter}$$ where $\alpha$ is the **learning rate** hyperparameter we do $$ M_{new} = M - \alpha \times (\frac{\partial L}{\partial M})^T = M - \alpha \times (x \cdot \frac{\partial L}{\partial y})^T = M - \alpha \times (\frac{\partial L}{\partial y} \cdot x^T) $$ This exactly matches what is publicly stated for updating the weight parameters of a matrix $M$ is that you take the previous signal, in this case $\frac{\partial L}{\partial y}$ and you take the outer product with the input vector $x$ (or its average or with respect to your batch). This is a rigorous explanation for how to derive $\frac{\partial L}{\partial M}$ given $y = Mx$ and $ L(y) = \sum_i (g_i - y_i)^2 $. Thanks to my friend Anant for white boarding this solution with me. Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!