The Core of Backprop 2: Deriving $\frac{\partial L}{\partial M}$ using Matrix Calculus by dokuDoku Machine Learning Foundations 🔍 This is a sequel blog post to [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm) however this time it's shorter and simpler using **Matrix Calculus** instead of deriving using component by component. For context on why this is important please read [Neural Network Implementation from Scratch blog post](https://www.noteblogdoku.com/blog/neural-network-implementation-from-scratch), and [Fitting a Simple Linear Model using Gradient Descent blog post](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent). The focus of this blog is to derive $\frac{\partial L}{\partial M}$ from the equations: $$ y = Mx$$ $$ L(y) = \sum_i (g_i - y_i)^2 = (g-y)^T(g-y) $$ With vector and matrix shapes $y, g \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, $M \in \mathbb{R}^{m \times n}$, $L \in \mathbb{R}$. Note that typically the linear layer of a neural network/Multi-Linear Program (MLP) is typically $y = Mx + b$, where $b \in \mathbb{R}^m$ is a bias term. However when taking the derivative with respect to matrix $M$, this bias term is unimportant, so we will be ignoring it today. Additionally, $g$ stands for ground truth and $y$ is the value our simple model predicts. A neural network typically composes these linear layers with simple ReLU's, so taking the derivative of $y = Mx$ with respect to $M$ is important. The goal is to derive $\frac{\partial L}{\partial M}$ to update the weight matrix $M$ so that we can minimize $L(Mx)$. To do this we need to find the derivative to update the parameters of $M$ to reduce the error. **Note** this derivation will use *numerator* notation for derivatives, which makes the chain rule more intuitive than for *denominator* notation. ## Differential of $L$, $dL$ To start, let's take the differential of $L$ $$ L = (g-y)^T(g-y) \rightarrow dL = -(g-y)^Tdy - dy^T(g-y) $$ Of note $L, dL \in \mathbb{R}$ and $y, dy \in \mathbb{R}^m$, i.e. the differentials are in the same dimension as what they're the differential of. Hence we get that $(g-y)^T dy$ with $(g-y)^T \in \mathbb{R}^{1 \times m}$ & $dy \in \mathbb{R}^{m \times 1}$ gives $$(g-y)^T dy \in \mathbb{R}$$ Similarly, $dy^T (g-y)$ with $dy^T \in \mathbb{R}^{1 \times m}$ & $(g-y) \in \mathbb{R}^{m \times 1}$ gives $$dy^T (g-y) \in \mathbb{R}$$ Thus we can say that $(dy^T (g-y))^T = dy^T (g-y)$ since it's a scalar in $\mathbb{R}$ giving $$ dL = -(g-y)^Tdy - dy^T(g-y) = -2(g-y)^Tdy$$ Since the derivative is a linear operator, using the generic function $f$ we have $df = f'(x)[dx]$ (with the brackets [], emphasizing operator on $dx$, nice but not necessary) we can restate as $$ dL = -2(g-y)^T[dy] $$ ## Differential of $y$, $dy$ Given that we have $y = Mx$ with $y \in \mathbb{R}^m$, $x \in \mathbb{R}^n$ and $M \in \mathbb{R}^{m \times n}$, let's use the product rule to get: $$ y = Mx \rightarrow dy = (dM)x + M dx$$ Note that since $x$ is constant $dx = 0$. This is because in our machine learning algorithm our **input data $x$** stays constant with respect to our network. The parameters/weights $M$ is what we are optimizing and changing. Hence: $$ dy = (dM)x + M \cdot 0 \rightarrow dy = (dM)x$$ Note that $y,dy \in \mathbb{R}^m$, $M,dM \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, hence we get that $(dM)x$ is $\mathbb{R}^{m \times n} \cdot \mathbb{R}^n \rightarrow \mathbb{R}^m$ which is a quick shape check. ### What do I do with $dy = (dM)x$? We would like to transform this into the Jacobian form $dy = \frac{\partial y}{\partial M}[dM]$ for an obvious update to $M$ from the equation $dy = [dM]x$. Given that we have $dy = [dM]x$ which is equivalent in the abstract form to $dy = \frac{\partial y}{\partial M} [dM]$, we want to find $\frac{\partial y}{\partial M}$ which will tell us how to update the weights of $M$. However, this is a multidimensional tensor since $dy \in \mathbb{R}^m$, $dM \in \mathbb{R}^{m \times n}$ giving $\frac{\partial y}{\partial M} \in \mathbb{R}^{m \times n \times m}$. To go from the linear operator form of matrix calculus to the more conventional Jacobian form, we need to vectorize $M$ to give us an update shape of $\mathbb{R}^{n \times m}$ for our update in numerator form, which we will have to transpose to match the shape of $M$. Note will be referencing some proofs from the **Appendix**. First let's vectorize both sides to get: $$ dy = dMx \rightarrow d\ vec(y) = vec(dMx) $$ Of note $vec(dy) = d\ vec(y)$ because both $vec()$ and $d$ (differential) are linear operators. Use from **Proof 1** $vec(CA^T) = (A \otimes I) vec(C)$ to get $$ vec(dMx) = (x^T \otimes I_m) vec(dM) $$ with $I \in \mathbb{R}^{m \times m}$ since $I$ matches the number of rows in $dM \in \mathbb{R}^{m \times n}$ Note that since $y, dy \in \mathbb{R}^m$ they are already in vector form thus $vec(y) = y$ (same with $dy$ & $x$). In vector form we now have: $$ dy = vec(dMx) = (x^T \otimes I_m) vec(dM)$$ Given this equation, we know that the general form for finding the gradient of some function $f$ is $$df = f'(x)[dx] = \frac{\partial f}{\partial x} [dx]$$ I.e. $\frac{\partial f}{\partial x}$ the quantity that when the inner product is taken with $dx$ equals $df$, i.e. $$ \langle \frac{\partial f}{\partial x}, dx \rangle = df$$ By this definition it is trivial to see $$ dy = vec(dMx) = (x^T \otimes I_m) [vec(dM)]$$ Denotes $$ \frac{\partial y}{\partial vec(M)} = x^T \otimes I_m $$ Shape check, $dy \in \mathbb{R}^m$, $vec(dM) \in \mathbb{R}^{mn}$ giving $\frac{\partial y}{\partial vec(M)} \in \mathbb{R}^{m \times mn}$ which matches $x^T \otimes I_m $ where at every element $x_i \in [1,m]$ we multiply it by $I_m \in \mathbb{R}^{m \times m}$, giving $x^T \otimes I_m \in \mathbb{R}^{m \times mn}$ ## Chain Rule: Substitute $dy$ in $dL$ equation. Recall from earlier $$ dL = -2(g-y)^T[dy] $$ And $$ dy = [dM]x \rightarrow vec(dy) = vec(dMx) = (x^T \otimes I_m) [vec(dM)]$$ Recall $vec(dy) = dy$, and substitute $dy$ into $ dL = -2(g-y)^T[dy] $ giving $$ dL = -2(g-y)^T vec(dMx) = -2(g-y)^T vec(dMx) = -2(g-y)^T (x^T \otimes I_m) [vec(dM)]$$ Note the general form $$df = f'(x)[dx] = \frac{\partial f}{\partial x} [dx]$$ and that $$ \langle \frac{\partial f}{\partial x}, dx \rangle = df$$ Giving us that $\frac{\partial L}{\partial vec(M)}$ is $$\frac{\partial L}{\partial vec(M)} = -2(g-y)^T (x^T \otimes I_m)$$ Checking the shapes we see that $-2(g-y)^T \in \mathbb{R}^{1\times m}$ and $x^T \otimes I_m \in \mathbb{R}^{m \times mn}$ giving $$\frac{\partial L}{\partial vec(M)} = -2(g-y)^T (x^T \otimes I_m) \in \mathbb{R}^{1 \times mn}$$ To get $\frac{\partial L}{\partial M}$ from $\frac{\partial L}{\partial vec(M)}$ devectorize both sides $$ devec(\frac{\partial L}{\partial vec(M)}) = \frac{\partial L}{\partial M} = devec(-2(g-y)^T (x^T \otimes I_m))$$ Using **Proof 2** we have (for arbitrary vectors $v \in \mathbb{R}^m$,$x \in \mathbb{R}^n$) $devec(v^T (x^T \otimes I_m)) = xv^T$, applied here $$ devec(-2(g-y)^T (x^T \otimes I_m)) = -2x(g-y)^T $$ Giving $$\frac{\partial L}{\partial M} = -2x(g-y)^T \in \mathbb{R}^{n \times m}$$ <!-- Note that this Jacobian is in *numerator* form, and not *denominator* form, which means that it is transpose of the matrix we can use to directly update $M$. --> Under *numerator* layout Jacobian notation $\frac{\partial L}{\partial M}$ appears as an $n \times m$ object. The gradient used in optimization, with the same shape as $M$, is its transpose in *denominator* layout. The update we will use to update $M$ in gradient descent is: $$ (-2x(g-y)^T)^T = -2(g-y)x^T \in \mathbb{R}^{m \times n}$$ Which is **exactly** the update rule from [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm). However, *this time* it was completed using **Matrix Calculus** resulting in a shorter, cleaner derivation. <!-- ## 3 Line Derivation Using Trace! --> --- ## Appendix ### Proof 1 $vec(CA^T) = (A \otimes I_M) vec(C)$ s.t. $A \in \mathbb{R}^{p \times n}$, $C \in \mathbb{R}^{m \times n}$ giving $CA^T \in \mathbb{R}^{m \times p}$ Let's derive this proof using some matrix algebra! Let's look at the columns of $CA^T$. The first column of $CA^T$ is a linear combination of the columns of $C$ who's coefficients are given by the first column of $A^T$ (first row of $A$). $$ column\ 1\ CA^T = \sum_j a_{1j} c_j$$ Where $a_{1j}$ is the jth element of the first row of $A$, $c_j$ is the jth column vector of $C$. This gives us $$ \operatorname{vec}(CA^T) = \begin{pmatrix} \sum_j a_{1j} c_j \\ \sum_j a_{2j} c_j \\ \vdots \end{pmatrix} = \underbrace{ \begin{pmatrix} a_{11} I & a_{12} I & \cdots \\ a_{21} I & a_{22} I & \cdots \\ \vdots & \vdots & \ddots \end{pmatrix} }_{A \otimes I} \underbrace{ \begin{pmatrix} c_1 \\ c_2 \\ \vdots \end{pmatrix} }_{\operatorname{vec}(C)} $$ Thus we have derived $$ vec(CA^T) = (A \otimes I_M) vec(C) $$ ### Proof 2 $devec(v^T (x^T \otimes I_m)) = xv^T$ s.t. $x \in \mathbb{R}^n$, $v \in \mathbb{R}^m$ and $xv^T \in \mathbb{R}^{n \times m}$ Let's derive this proof using some matrix algebra! To restate $x \in \mathbb{R}^n$, $v \in \mathbb{R}^m$ making $(x^T \otimes I_m) \in \mathbb{R}^{mn \times m}$ and our desired result $xv^T \in \mathbb{R}^{n \times m}$. To start by definition of Kronecker Product $$ v^T(x^T \otimes I_m) = v^T [x_1I_m, x_2I_m,...x_nI_m]$$ Note, in this step the shape $(x^T \otimes I_m) \in \mathbb{R}^{mn \times m}$ becomes trivial to see. $$ \begin{aligned} v^T(x^T \otimes I_m) &= v^T \left\lbrack \begin{array}{c|c|c|c} \overbrace{ \begin{array}{cccc} x_1 & 0 & \cdots & 0 \\ 0 & x_1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_1 \end{array} }^{m} & \overbrace{ \begin{array}{cccc} x_2 & 0 & \cdots & 0 \\ 0 & x_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_2 \end{array} }^{m} & \cdots & \overbrace{ \begin{array}{cccc} x_n & 0 & \cdots & 0 \\ 0 & x_n & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_n \end{array} }^{m} \end{array} \right\rbrack \left. \begin{array}{c} \\ \\ \\ \end{array} \right\} m \\[-55pt] &\qquad \underbrace{ \phantom{ \left\lbrack \begin{array}{c|c|c|c} \begin{array}{cccc} x_1 & 0 & \cdots & 0 \\ 0 & x_1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_1 \end{array} & \begin{array}{cccc} x_2 & 0 & \cdots & 0 \\ 0 & x_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_2 \end{array} & \cdots & \begin{array}{cccc} x_n & 0 & \cdots & 0 \\ 0 & x_n & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_n \end{array} \end{array} \right\rbrack } }_{mn} \end{aligned} $$ When we multiply a matrix by a vector in the form $xA$, with the vector on the left, it's equivalent to adding $row_i$ of A by $x_i$ giving us $$ \begin{aligned} v^T \left\lbrack \begin{array}{c|c|c|c} \begin{array}{cccc} x_1 & 0 & \cdots & 0 \\ 0 & x_1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_1 \end{array} & \begin{array}{cccc} x_2 & 0 & \cdots & 0 \\ 0 & x_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_2 \end{array} & \cdots & \begin{array}{cccc} x_n & 0 & \cdots & 0 \\ 0 & x_n & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_n \end{array} \end{array} \right\rbrack \end{aligned} = [x_1v^T, x_2v^T, ..., x_nv^T] \in \mathbb{R}^{1 \times mn} $$ $$ [x_1v^T, x_2v^T, ..., x_nv^T] = [v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n] \mathbb{R}^{1 \times mn}$$ Observe the pattern to take the column major devectorization (inverse operation of vectorization) where every column $i$ of the devectorized matrix is the accordant $i^{th}$ series of $m$ continuous elements. To take the column major devectorization, we are going to take the transpose of said $\mathbb{R}^{1 \times mn}$ vector to make it $\mathbb{R}^{mn}$, show the resultant $\mathbb{R}^{m \times n}$ matrix then transpose to make it $\mathbb{R}^{n \times m}$ to make it the devectorization of $v^T(x^T \otimes I_m)$. $$ [v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n] \in \mathbb{R}^{1 \times mn} \rightarrow $$ $$[v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n]^T \in \mathbb{R}^{mn}$$ $$ devec([v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n]^T) = \begin{bmatrix} v_1 x_1 & v_1 x_2 & \cdots & v_1 x_n \\ v_2 x_1 & v_2 x_2 & \cdots & v_2 x_n \\ \vdots & \vdots & \ddots & \vdots \\ v_m x_1 & v_m x_2 & \cdots & v_m x_n \end{bmatrix} = vx^T $$ Finally to get back $devec([v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n])$ transpose the result $vx^T$ giving $$ devec([v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n]) = (vx^T)^T = xv^T \in \mathbb{R}^{n \times m}$$ Thus we have shown $$devec(v^T (x^T \otimes I_m)) = xv^T$$ This is a sequel blog post to [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm) however this time it's shorter and simpler using **Matrix Calculus** instead of deriving using component by component. For context on why this is important please read [Neural Network Implementation from Scratch blog post](https://www.noteblogdoku.com/blog/neural-network-implementation-from-scratch), and [Fitting a Simple Linear Model using Gradient Descent blog post](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent). The focus of this blog is to derive $\frac{\partial L}{\partial M}$ from the equations: $$ y = Mx$$ $$ L(y) = \sum_i (g_i - y_i)^2 = (g-y)^T(g-y) $$ With vector and matrix shapes $y, g \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, $M \in \mathbb{R}^{m \times n}$, $L \in \mathbb{R}$. Note that typically the linear layer of a neural network/Multi-Linear Program (MLP) is typically $y = Mx + b$, where $b \in \mathbb{R}^m$ is a bias term. However when taking the derivative with respect to matrix $M$, this bias term is unimportant, so we will be ignoring it today. Additionally, $g$ stands for ground truth and $y$ is the value our simple model predicts. A neural network typically composes these linear layers with simple ReLU's, so taking the derivative of $y = Mx$ with respect to $M$ is important. The goal is to derive $\frac{\partial L}{\partial M}$ to update the weight matrix $M$ so that we can minimize $L(Mx)$. To do this we need to find the derivative to update the parameters of $M$ to reduce the error. **Note** this derivation will use *numerator* notation for derivatives, which makes the chain rule more intuitive than for *denominator* notation. ## Differential of $L$, $dL$ To start, let's take the differential of $L$ $$ L = (g-y)^T(g-y) \rightarrow dL = -(g-y)^Tdy - dy^T(g-y) $$ Of note $L, dL \in \mathbb{R}$ and $y, dy \in \mathbb{R}^m$, i.e. the differentials are in the same dimension as what they're the differential of. Hence we get that $(g-y)^T dy$ with $(g-y)^T \in \mathbb{R}^{1 \times m}$ & $dy \in \mathbb{R}^{m \times 1}$ gives $$(g-y)^T dy \in \mathbb{R}$$ Similarly, $dy^T (g-y)$ with $dy^T \in \mathbb{R}^{1 \times m}$ & $(g-y) \in \mathbb{R}^{m \times 1}$ gives $$dy^T (g-y) \in \mathbb{R}$$ Thus we can say that $(dy^T (g-y))^T = dy^T (g-y)$ since it's a scalar in $\mathbb{R}$ giving $$ dL = -(g-y)^Tdy - dy^T(g-y) = -2(g-y)^Tdy$$ Since the derivative is a linear operator, using the generic function $f$ we have $df = f'(x)[dx]$ (with the brackets [], emphasizing operator on $dx$, nice but not necessary) we can restate as $$ dL = -2(g-y)^T[dy] $$ ## Differential of $y$, $dy$ Given that we have $y = Mx$ with $y \in \mathbb{R}^m$, $x \in \mathbb{R}^n$ and $M \in \mathbb{R}^{m \times n}$, let's use the product rule to get: $$ y = Mx \rightarrow dy = (dM)x + M dx$$ Note that since $x$ is constant $dx = 0$. This is because in our machine learning algorithm our **input data $x$** stays constant with respect to our network. The parameters/weights $M$ is what we are optimizing and changing. Hence: $$ dy = (dM)x + M \cdot 0 \rightarrow dy = (dM)x$$ Note that $y,dy \in \mathbb{R}^m$, $M,dM \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, hence we get that $(dM)x$ is $\mathbb{R}^{m \times n} \cdot \mathbb{R}^n \rightarrow \mathbb{R}^m$ which is a quick shape check. ### What do I do with $dy = (dM)x$? We would like to transform this into the Jacobian form $dy = \frac{\partial y}{\partial M}[dM]$ for an obvious update to $M$ from the equation $dy = [dM]x$. Given that we have $dy = [dM]x$ which is equivalent in the abstract form to $dy = \frac{\partial y}{\partial M} [dM]$, we want to find $\frac{\partial y}{\partial M}$ which will tell us how to update the weights of $M$. However, this is a multidimensional tensor since $dy \in \mathbb{R}^m$, $dM \in \mathbb{R}^{m \times n}$ giving $\frac{\partial y}{\partial M} \in \mathbb{R}^{m \times n \times m}$. To go from the linear operator form of matrix calculus to the more conventional Jacobian form, we need to vectorize $M$ to give us an update shape of $\mathbb{R}^{n \times m}$ for our update in numerator form, which we will have to transpose to match the shape of $M$. Note will be referencing some proofs from the **Appendix**. First let's vectorize both sides to get: $$ dy = dMx \rightarrow d\ vec(y) = vec(dMx) $$ Of note $vec(dy) = d\ vec(y)$ because both $vec()$ and $d$ (differential) are linear operators. Use from **Proof 1** $vec(CA^T) = (A \otimes I) vec(C)$ to get $$ vec(dMx) = (x^T \otimes I_m) vec(dM) $$ with $I \in \mathbb{R}^{m \times m}$ since $I$ matches the number of rows in $dM \in \mathbb{R}^{m \times n}$ Note that since $y, dy \in \mathbb{R}^m$ they are already in vector form thus $vec(y) = y$ (same with $dy$ & $x$). In vector form we now have: $$ dy = vec(dMx) = (x^T \otimes I_m) vec(dM)$$ Given this equation, we know that the general form for finding the gradient of some function $f$ is $$df = f'(x)[dx] = \frac{\partial f}{\partial x} [dx]$$ I.e. $\frac{\partial f}{\partial x}$ the quantity that when the inner product is taken with $dx$ equals $df$, i.e. $$ \langle \frac{\partial f}{\partial x}, dx \rangle = df$$ By this definition it is trivial to see $$ dy = vec(dMx) = (x^T \otimes I_m) [vec(dM)]$$ Denotes $$ \frac{\partial y}{\partial vec(M)} = x^T \otimes I_m $$ Shape check, $dy \in \mathbb{R}^m$, $vec(dM) \in \mathbb{R}^{mn}$ giving $\frac{\partial y}{\partial vec(M)} \in \mathbb{R}^{m \times mn}$ which matches $x^T \otimes I_m $ where at every element $x_i \in [1,m]$ we multiply it by $I_m \in \mathbb{R}^{m \times m}$, giving $x^T \otimes I_m \in \mathbb{R}^{m \times mn}$ ## Chain Rule: Substitute $dy$ in $dL$ equation. Recall from earlier $$ dL = -2(g-y)^T[dy] $$ And $$ dy = [dM]x \rightarrow vec(dy) = vec(dMx) = (x^T \otimes I_m) [vec(dM)]$$ Recall $vec(dy) = dy$, and substitute $dy$ into $ dL = -2(g-y)^T[dy] $ giving $$ dL = -2(g-y)^T vec(dMx) = -2(g-y)^T vec(dMx) = -2(g-y)^T (x^T \otimes I_m) [vec(dM)]$$ Note the general form $$df = f'(x)[dx] = \frac{\partial f}{\partial x} [dx]$$ and that $$ \langle \frac{\partial f}{\partial x}, dx \rangle = df$$ Giving us that $\frac{\partial L}{\partial vec(M)}$ is $$\frac{\partial L}{\partial vec(M)} = -2(g-y)^T (x^T \otimes I_m)$$ Checking the shapes we see that $-2(g-y)^T \in \mathbb{R}^{1\times m}$ and $x^T \otimes I_m \in \mathbb{R}^{m \times mn}$ giving $$\frac{\partial L}{\partial vec(M)} = -2(g-y)^T (x^T \otimes I_m) \in \mathbb{R}^{1 \times mn}$$ To get $\frac{\partial L}{\partial M}$ from $\frac{\partial L}{\partial vec(M)}$ devectorize both sides $$ devec(\frac{\partial L}{\partial vec(M)}) = \frac{\partial L}{\partial M} = devec(-2(g-y)^T (x^T \otimes I_m))$$ Using **Proof 2** we have (for arbitrary vectors $v \in \mathbb{R}^m$,$x \in \mathbb{R}^n$) $devec(v^T (x^T \otimes I_m)) = xv^T$, applied here $$ devec(-2(g-y)^T (x^T \otimes I_m)) = -2x(g-y)^T $$ Giving $$\frac{\partial L}{\partial M} = -2x(g-y)^T \in \mathbb{R}^{n \times m}$$ <!-- Note that this Jacobian is in *numerator* form, and not *denominator* form, which means that it is transpose of the matrix we can use to directly update $M$. --> Under *numerator* layout Jacobian notation $\frac{\partial L}{\partial M}$ appears as an $n \times m$ object. The gradient used in optimization, with the same shape as $M$, is its transpose in *denominator* layout. The update we will use to update $M$ in gradient descent is: $$ (-2x(g-y)^T)^T = -2(g-y)x^T \in \mathbb{R}^{m \times n}$$ Which is **exactly** the update rule from [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm). However, *this time* it was completed using **Matrix Calculus** resulting in a shorter, cleaner derivation. <!-- ## 3 Line Derivation Using Trace! --> --- ## Appendix ### Proof 1 $vec(CA^T) = (A \otimes I_M) vec(C)$ s.t. $A \in \mathbb{R}^{p \times n}$, $C \in \mathbb{R}^{m \times n}$ giving $CA^T \in \mathbb{R}^{m \times p}$ Let's derive this proof using some matrix algebra! Let's look at the columns of $CA^T$. The first column of $CA^T$ is a linear combination of the columns of $C$ who's coefficients are given by the first column of $A^T$ (first row of $A$). $$ column\ 1\ CA^T = \sum_j a_{1j} c_j$$ Where $a_{1j}$ is the jth element of the first row of $A$, $c_j$ is the jth column vector of $C$. This gives us $$ \operatorname{vec}(CA^T) = \begin{pmatrix} \sum_j a_{1j} c_j \\ \sum_j a_{2j} c_j \\ \vdots \end{pmatrix} = \underbrace{ \begin{pmatrix} a_{11} I & a_{12} I & \cdots \\ a_{21} I & a_{22} I & \cdots \\ \vdots & \vdots & \ddots \end{pmatrix} }_{A \otimes I} \underbrace{ \begin{pmatrix} c_1 \\ c_2 \\ \vdots \end{pmatrix} }_{\operatorname{vec}(C)} $$ Thus we have derived $$ vec(CA^T) = (A \otimes I_M) vec(C) $$ ### Proof 2 $devec(v^T (x^T \otimes I_m)) = xv^T$ s.t. $x \in \mathbb{R}^n$, $v \in \mathbb{R}^m$ and $xv^T \in \mathbb{R}^{n \times m}$ Let's derive this proof using some matrix algebra! To restate $x \in \mathbb{R}^n$, $v \in \mathbb{R}^m$ making $(x^T \otimes I_m) \in \mathbb{R}^{mn \times m}$ and our desired result $xv^T \in \mathbb{R}^{n \times m}$. To start by definition of Kronecker Product $$ v^T(x^T \otimes I_m) = v^T [x_1I_m, x_2I_m,...x_nI_m]$$ Note, in this step the shape $(x^T \otimes I_m) \in \mathbb{R}^{mn \times m}$ becomes trivial to see. $$ \begin{aligned} v^T(x^T \otimes I_m) &= v^T \left\lbrack \begin{array}{c|c|c|c} \overbrace{ \begin{array}{cccc} x_1 & 0 & \cdots & 0 \\ 0 & x_1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_1 \end{array} }^{m} & \overbrace{ \begin{array}{cccc} x_2 & 0 & \cdots & 0 \\ 0 & x_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_2 \end{array} }^{m} & \cdots & \overbrace{ \begin{array}{cccc} x_n & 0 & \cdots & 0 \\ 0 & x_n & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_n \end{array} }^{m} \end{array} \right\rbrack \left. \begin{array}{c} \\ \\ \\ \end{array} \right\} m \\[-55pt] &\qquad \underbrace{ \phantom{ \left\lbrack \begin{array}{c|c|c|c} \begin{array}{cccc} x_1 & 0 & \cdots & 0 \\ 0 & x_1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_1 \end{array} & \begin{array}{cccc} x_2 & 0 & \cdots & 0 \\ 0 & x_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_2 \end{array} & \cdots & \begin{array}{cccc} x_n & 0 & \cdots & 0 \\ 0 & x_n & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_n \end{array} \end{array} \right\rbrack } }_{mn} \end{aligned} $$ When we multiply a matrix by a vector in the form $xA$, with the vector on the left, it's equivalent to adding $row_i$ of A by $x_i$ giving us $$ \begin{aligned} v^T \left\lbrack \begin{array}{c|c|c|c} \begin{array}{cccc} x_1 & 0 & \cdots & 0 \\ 0 & x_1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_1 \end{array} & \begin{array}{cccc} x_2 & 0 & \cdots & 0 \\ 0 & x_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_2 \end{array} & \cdots & \begin{array}{cccc} x_n & 0 & \cdots & 0 \\ 0 & x_n & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_n \end{array} \end{array} \right\rbrack \end{aligned} = [x_1v^T, x_2v^T, ..., x_nv^T] \in \mathbb{R}^{1 \times mn} $$ $$ [x_1v^T, x_2v^T, ..., x_nv^T] = [v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n] \mathbb{R}^{1 \times mn}$$ Observe the pattern to take the column major devectorization (inverse operation of vectorization) where every column $i$ of the devectorized matrix is the accordant $i^{th}$ series of $m$ continuous elements. To take the column major devectorization, we are going to take the transpose of said $\mathbb{R}^{1 \times mn}$ vector to make it $\mathbb{R}^{mn}$, show the resultant $\mathbb{R}^{m \times n}$ matrix then transpose to make it $\mathbb{R}^{n \times m}$ to make it the devectorization of $v^T(x^T \otimes I_m)$. $$ [v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n] \in \mathbb{R}^{1 \times mn} \rightarrow $$ $$[v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n]^T \in \mathbb{R}^{mn}$$ $$ devec([v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n]^T) = \begin{bmatrix} v_1 x_1 & v_1 x_2 & \cdots & v_1 x_n \\ v_2 x_1 & v_2 x_2 & \cdots & v_2 x_n \\ \vdots & \vdots & \ddots & \vdots \\ v_m x_1 & v_m x_2 & \cdots & v_m x_n \end{bmatrix} = vx^T $$ Finally to get back $devec([v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n])$ transpose the result $vx^T$ giving $$ devec([v_1x_1, v_2x_1, ... v_1x_2, v_2x_2,...,v_1x_n,v_2x_n...v_mx_n]) = (vx^T)^T = xv^T \in \mathbb{R}^{n \times m}$$ Thus we have shown $$devec(v^T (x^T \otimes I_m)) = xv^T$$ Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!