Optimizing 2D Convolutions by dokuDoku Machine Learning Foundations Convolutions 🔍 ## Recap from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) To give a short recap from the previous blog post introducing 2D Convolutions, [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) Given that we have the **full** convolution, where only **one element** of overlap is required we have - $z \in \mathbb{R}^{m \times n}$ the input image - $vec(z) \in \mathbb{R}^{mn}$ the vectorized input image - $k \in \mathbb{R}^{r \times s}$ the kernel - $o \in \mathbb{R}^{(m+r-1)\times (n+s-1)}$ the output image - $vec(o) \in \mathbb{R}^{(m+r-1)(n+s-1)}$ the vectorized output image - $f(k) \in \mathbb{R}^{(m+r-1)(n+s-1) \times mn}$ the Toeplitz block matrix of kernel $k$ **NOTE** All equations are 0 indexed and all derivatives use the **numerator** convention! This is the definition of the 2-Dimensional Convolution of $z$ and $k$ $$ o[p,q] = (z * k)[p,q] = \sum_{i=-\infty}^{\infty} \sum_{j=-\infty}^{\infty} z[i,j] \, k[p-i, q-j] $$ Additionally it was shown that when $o$ is vectorized, this 2D Convolution can be represented as the multiplication between the Toplitz Block Matrix of Kernel $k$, $f(k)$ and imput image $z$ vectorized, $vec(z)$ $$ vec(o) = f(k) vec(z)$$ ## Force output $o$ to have same size as input $z$ To make it easier to optimize our kernel $k$ with respect to a scalar loss function, let's force the output $o$ of the convolution to match the input size of $z$ i.e. $o,z \in \mathbb{R}^{m \times n}$. Given our result doing this for 1D Convolutions in [Optimizing 1D Convolutions Implementation](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions-implementation), we have that assuming the the kernel must **completely** overlap with the input (plus padding $P$) that for - $y \in \mathbb{R}^{n}$ as the output - $z \in \mathbb{R}^n$ as the input - $k \in \mathbb{R}^m$ as the kernel - $P$ is the padding equal to $\frac{m-1}{2}$ using *Padding Equation* for only *odd* sized kernels - Padding Equation $N_{out} = N_{in} + 2P - (K-1)$, $N_{out}$ output size, $N_{in}$ input size, $P$ padding added to each end, $K$ kernel size $$ y(x) = (z * k)(x) = \sum_{b=1}^{m} k(b)z(x-b+P) $$ *Note* in this blog post $y(x)$ is **0 indexed**, in [Optimizing 1D Convolutions Implementation](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions-implementation), the derivation was 1 indexed, and kernel was $w$ *not* $k$. Extending this to our problem setup in 2D it is trivial to see that we will need two paddings, one for each dimension designated $P_1$ and $P_2$, and it should be trivial to calculate them Given $$ N_{out} = N_{in} + 2P - (K-1)$$ - $N_{out}$ output size - $N_{in}$ input size - $P$ padding added to each end - $K$ kernel size For $P_1$ with $N_{out1} = N_{in1} = m$, $K_1 = r$ $$ N_{out} = N_{in} + 2P - (K-1) \rightarrow m = m + 2P_1 - (r-1) \rightarrow P_1 = \frac{r-1}{2}$$ Similarly for $P_2$ with $N_{out2} = N_{in2} = n$, $K_2 = s$ $$ N_{out} = N_{in} + 2P - (K-1) \rightarrow n = n + 2P_2 - (s-1) \rightarrow P_2 = \frac{s-1}{2}$$ Giving us the *new* 2D Convolution formula with - $o,z \in \mathbb{R}^{m \times n}$ - $k \in \mathbb{R}^{r \times s}$ - $P_1 = \frac{r-1}{2}$ - $P_2 = \frac{s-1}{2}$ $$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$ with for arbitrary $u,v$ $z[u,v] = 0$ when $u \notin [0, m-1]$ or $v \notin [0,n-1]$ This is fantastic, we now have a convolution equation that we can optimize such that the input and output shapes are the *same*. Additionally, since we are directly summing over the indices of the kernel $k$ this makes it easy to optimize the indices of the kernel. ## Optimization Statement Setup Given - $z \in \mathbb{R}^{m \times n}$ as the input image - $o \in \mathbb{R}^{m \times n}$ as the predicted output image - $k \in \mathbb{R}^{r \times s}$ as the kernel - $P_1 = \frac{r-1}{2}$ as the Padding for the first dimension - $P_2 = \frac{s-1}{2}$ as the Padding for the second dimension - $L \in \mathbb{R}$ as the scalar loss from squared loss - $g \in \mathbb{R}^{m \times n}$ as the ground truth image With Squared Loss and the 2D Convolution defined as $$ L(o) = ||(g-o)||_F^2 = \sum_{p=0}^{m-1} \sum_{q=0}^{n-1} (g_{p,q} - o_{p,q})^2 $$ $$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$ We are first going to solve $\frac{\partial L}{\partial k[i,j]}$ for $i \in [0, r-1]$ and $j \in [0, s-1]$ i.e. component by component. Then we are going to show how to directly get the entire gradient $\frac{\partial L}{\partial k}$ by showing the gradient is itself a convolution! As a **bonus**, we are going to show how to optimize the originally previously derived equation $vec(o) = f(k) vec(z) \rightarrow vec(o) = m(z) vec(k)$ where $vec(o) = f(k) vec(z)$ is originally from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) and $f(k) \in \mathbb{R}^{(m+r-1)(n+s-1) \times mn}$ the Toeplitz block matrix of kernel $k$. Note that $vec(o) = f(k) vec(z) = vec(o) = m(z) vec(k)$ using the property that **convolutions are equivalent to polynomial multiplication** & thus **commutative** as shown in [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions). ## Component-wise $\frac{\partial L}{\partial k[i,j]}$ Derivation To start let's find $dL/do$ Per compoent of $o$ i.e. $\frac{\partial L}{\partial o[p,q]}$ we have $$ L(o) = \sum_{p=0}^{m-1} \sum_{q=0}^{n-1} (g_{p,q} - o_{p,q})^2 \rightarrow \frac{\partial L}{\partial o[p,q]} = -2(g[p,q] - o[p,q]) $$ Now let's find $\frac{\partial o[p,q]}{\partial k[i,j]$ We have that $$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$ Thus for a specific $k[i,j]$ we have that $$ do[p,q] = d((z * k)[p,q]) = d(\sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2])$$ Since we only want to find $\frac{\partial o[p,q]}{\partial k[i,j]}$, this means for all $k[a,b]$ where $a \neq i$,$b \neq j$ that $dk[a,b] = 0$. Thus our equation reduces to $$ do[p,q] = d(\sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2]) \rightarrow z[p-i+P_1,q-j+P_2]dk[i,j]$$ giving $$ \frac{\partial o[p,q]}{\partial k[i,j]} = z[p-i+P_1,q-j+P_2] \in \mathbb{R}$$ Invoking the multivariate chain rule we have $$ \frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do[p,q]} \frac{do[p,q]}{dk[i,j]} $$ becoming $$ \frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} -2(g[p,q]-o[p,q]) \times z[p-i+P_1,q-j+P_2] \in \mathbb{R}$$ In numerator form this would be arranged as the matrix gradient $$ \frac{dL}{dk} = \begin{bmatrix} \frac{dL}{dk[0,0]} & \frac{dL}{dk[1,0]} & \cdots & \frac{dL}{dk[r-1,0]}\\ \frac{dL}{dk[0,1]} & \frac{dL}{dk[1,1]} & \cdots & \frac{dL}{dk[r-1,1]}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{dL}{dk[0,s-1]} & \frac{dL}{dk[1,s-1]} & \cdots & \frac{dL}{dk[r-1,s-1]} \end{bmatrix} \in \mathbb{R}^{s\times r} $$ ## $\frac{\partial L}{\partial k}$ as a Convolution Derivation Using the previously stated multivariate chain rule $$ \frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do[p,q]} \frac{do[p,q]}{dk[i,j]} $$ which using the fact that our derivatives are in numerator format becomes $$ \frac{dL}{dk}[j,i] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do}[q,p] \frac{do[p,q]}{dk[i,j]} $$ Note that since we previously derived that $$ \frac{\partial o[p,q]}{\partial k[i,j]} = z[p-i+P_1,q-j+P_2] $$ substitute it into our equation, giving $$ \frac{dL}{dk}[j,i] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do}[q,p] \frac{do[p,q]}{dk[i,j]} \rightarrow \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do}[q,p] z[p-i+P_1,q-j+P_2]$$ Take the transpose of $\frac{dL}{do}[q,p]$ to get $(\frac{dL}{do}[q,p])^T = (\frac{dL}{do})^T[p,q]$ and $(\frac{dL}{dk}[j,i])^T = (\frac{dL}{dk})^T[i,j]$ giving $$ (\frac{dL}{dk})^T[i,j] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} (\frac{dL}{do})^T[p,q] z[p-i+P_1,q-j+P_2]$$ Notice how this looks similar to the form of the **2D Convolution** as we previously defined $$ (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$ Seeing this, we can say that with $(\frac{dL}{do})^T$ as $k$ and $z$ as $z$ that $$ (\frac{dL}{dk})^T[i,j] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} (\frac{dL}{do})^T[p,q] z[p-i+P_1,q-j+P_2] = ((\frac{dL}{do})^T * z)[i,j]$$ Simplifying to $$ (\frac{dL}{dk})^T = (\frac{dL}{do} * z) \rightarrow \frac{dL}{dk} = (\frac{dL}{do} * z)^T$$ In the context of a neural network, our update $\frac{dL}{dk} \in \mathbb{R}^{s \times r}$ is the transpose of the gradient signal from the previous layer $(\frac{dL}{do})^T$ **convolved** with our input image $z$. ## Matrix Calculus Derivation $vec(o) = f(k)vec(z)$ for Full-Convolution (Input & Output Different Sizes) As as refresher from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices), we have the **full convolution** (note previously we were only considering when the *input* $z$ and *output* $o$ had the same size, this is for when there's **at least** one element overlap between kernel $k$ and input $z$). The reason why we break convention here is to not re-derive the Toeplitz block matrix $f(k)$, *however* if we were to, we would simply take the middle, sub-matrix of $mn$ rows to have the output shape of $vec(o) = f(k)vec(z)$ match the input shape of $vec(z) \in \mathbb{R}^{mn}$ - $z \in \mathbb{R}^{m \times n}$ the input image - $vec(z) \in \mathbb{R}^{mn}$ the vectorized input image - $k \in \mathbb{R}^{r \times s}$ the kernel - $vec(k) \in \mathbb{R}^{rs}$ is the vectorized kernel - $o \in \mathbb{R}^{(m+r-1)\times (n+s-1)}$ the output image - $vec(o) \in \mathbb{R}^{(m+r-1)(n+s-1)}$ the vectorized output image - $f(k) \in \mathbb{R}^{(m+r-1)(n+s-1) \times mn}$ the Toeplitz block matrix of kernel $k$ - $L \in \mathbb{R}$ as the scalar loss from squared loss - $g \in \mathbb{R}^{(m+r-1)\times (n+s-1)}$ as the ground truth image To start we have $$ L = ||g - o||_F^2 = vec(g-o)^T vec(g-o)$$ $$ dL = -d(vec(g-o)^T) vec(g-o) - vec(g-o)^T d(vec(g-o)) $$ $$ dL = -dvec(o)^T vec(g-o) - vec(g-o)^T d(vec(o)) $$ Note that $dvec(o)^T vec(g-o) \in \mathbb{R}$ thus $(dvec(o)^T vec(g-o))^T = dvec(o)^T vec(g-o)$ $$ dL = - vec(g-o)^T d(vec(o)) - vec(g-o)^T d(vec(o)) = -2 vec(g-o)^T d(vec(o))$$ Now we are going to use the *commutative property* of polynomial multiplication, since recall from [1D Convolutions](https://www.noteblogdoku.com/blog/1d-convolutions) convolutions *are* polynomial multiplication, hence this means $$ o[p,q] = (k * z)[p,q] = (z * k)[p,q] $$ Thus from our derivation in [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) we would have that $$vec(o) = f(k)vec(z) = f(z) vec(k) $$ where now we taking the Toeplitz Block matrix of the image $z$ and not the kernel $k$. Great now let's find $d(vec(o))$ $$ d(vec(o)) = d(f(z)vec(k)) = d(f(z)) vec(k) + f(z) d(vec(k))$$ Note that since $z$ is the input the differential of z $dz = 0$. Since $f(z)$ is a linear operator on $z$ we have that $d(f(z)) = 0$ too. Thus $$ d(vec(o)) = f(z) d(vec(k))$$ Knowing that for a generic function $g(x)$ that $$dg(x) = \frac{\partial g}{\partial x} dx \quad \langle \frac{\partial g}{\partial x}, dx \rangle = dg $$ We get that $$ \frac{\partial vec(o)}{\partial vec(k)} = f(z) \in \mathbb{R}^{(m+r-1)(n+s-1) \times (rs)}$$ for the **full convolution** where the kernel and image have **at least** one cell overlap and the size of output $o$ not equal to input $z$. To finish up $$ dL = -2 vec(g-o)^T d(vec(o))$$ $$ d(vec(o)) = f(z) d(vec(k))$$ Substituting & using chain rule we get $$ dL = -2 vec(g-o)^T f(z) d(vec(k))$$ now using that for a generic function $g(x)$ that $$dg(x) = \frac{\partial g}{\partial x} dx \quad \langle \frac{\partial g}{\partial x}, dx \rangle = dg $$ we get that $$ \frac{\partial L}{\partial vec(k)} = -2 vec(g-o)^T f(z) \in \mathbb{R}^{1\times (rs)}$$ To do a shape check note $$ vec(g-o)^T \in \mathbb{R}^{1\times (m+r-1)(n+s-1)} \quad f(z) \in \mathbb{R}^{(m+r-1)(n+s-1) \times (rs)}$$ resulting in $$ \frac{\partial L}{\partial vec(k)} = -2 vec(g-o)^T f(z) \in \mathbb{R}^{1 \times (rs)}$$ which we would then reshape $$\frac{\partial L}{\partial vec(k)} \in \mathbb{R}^{1 \times (rs)} \rightarrow \frac{\partial L}{\partial k} \in \mathbb{R}^{s \times r}$$ using the *numerator* convention. ## Recap from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) To give a short recap from the previous blog post introducing 2D Convolutions, [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) Given that we have the **full** convolution, where only **one element** of overlap is required we have - $z \in \mathbb{R}^{m \times n}$ the input image - $vec(z) \in \mathbb{R}^{mn}$ the vectorized input image - $k \in \mathbb{R}^{r \times s}$ the kernel - $o \in \mathbb{R}^{(m+r-1)\times (n+s-1)}$ the output image - $vec(o) \in \mathbb{R}^{(m+r-1)(n+s-1)}$ the vectorized output image - $f(k) \in \mathbb{R}^{(m+r-1)(n+s-1) \times mn}$ the Toeplitz block matrix of kernel $k$ **NOTE** All equations are 0 indexed and all derivatives use the **numerator** convention! This is the definition of the 2-Dimensional Convolution of $z$ and $k$ $$ o[p,q] = (z * k)[p,q] = \sum_{i=-\infty}^{\infty} \sum_{j=-\infty}^{\infty} z[i,j] \, k[p-i, q-j] $$ Additionally it was shown that when $o$ is vectorized, this 2D Convolution can be represented as the multiplication between the Toplitz Block Matrix of Kernel $k$, $f(k)$ and imput image $z$ vectorized, $vec(z)$ $$ vec(o) = f(k) vec(z)$$ ## Force output $o$ to have same size as input $z$ To make it easier to optimize our kernel $k$ with respect to a scalar loss function, let's force the output $o$ of the convolution to match the input size of $z$ i.e. $o,z \in \mathbb{R}^{m \times n}$. Given our result doing this for 1D Convolutions in [Optimizing 1D Convolutions Implementation](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions-implementation), we have that assuming the the kernel must **completely** overlap with the input (plus padding $P$) that for - $y \in \mathbb{R}^{n}$ as the output - $z \in \mathbb{R}^n$ as the input - $k \in \mathbb{R}^m$ as the kernel - $P$ is the padding equal to $\frac{m-1}{2}$ using *Padding Equation* for only *odd* sized kernels - Padding Equation $N_{out} = N_{in} + 2P - (K-1)$, $N_{out}$ output size, $N_{in}$ input size, $P$ padding added to each end, $K$ kernel size $$ y(x) = (z * k)(x) = \sum_{b=1}^{m} k(b)z(x-b+P) $$ *Note* in this blog post $y(x)$ is **0 indexed**, in [Optimizing 1D Convolutions Implementation](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions-implementation), the derivation was 1 indexed, and kernel was $w$ *not* $k$. Extending this to our problem setup in 2D it is trivial to see that we will need two paddings, one for each dimension designated $P_1$ and $P_2$, and it should be trivial to calculate them Given $$ N_{out} = N_{in} + 2P - (K-1)$$ - $N_{out}$ output size - $N_{in}$ input size - $P$ padding added to each end - $K$ kernel size For $P_1$ with $N_{out1} = N_{in1} = m$, $K_1 = r$ $$ N_{out} = N_{in} + 2P - (K-1) \rightarrow m = m + 2P_1 - (r-1) \rightarrow P_1 = \frac{r-1}{2}$$ Similarly for $P_2$ with $N_{out2} = N_{in2} = n$, $K_2 = s$ $$ N_{out} = N_{in} + 2P - (K-1) \rightarrow n = n + 2P_2 - (s-1) \rightarrow P_2 = \frac{s-1}{2}$$ Giving us the *new* 2D Convolution formula with - $o,z \in \mathbb{R}^{m \times n}$ - $k \in \mathbb{R}^{r \times s}$ - $P_1 = \frac{r-1}{2}$ - $P_2 = \frac{s-1}{2}$ $$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$ with for arbitrary $u,v$ $z[u,v] = 0$ when $u \notin [0, m-1]$ or $v \notin [0,n-1]$ This is fantastic, we now have a convolution equation that we can optimize such that the input and output shapes are the *same*. Additionally, since we are directly summing over the indices of the kernel $k$ this makes it easy to optimize the indices of the kernel. ## Optimization Statement Setup Given - $z \in \mathbb{R}^{m \times n}$ as the input image - $o \in \mathbb{R}^{m \times n}$ as the predicted output image - $k \in \mathbb{R}^{r \times s}$ as the kernel - $P_1 = \frac{r-1}{2}$ as the Padding for the first dimension - $P_2 = \frac{s-1}{2}$ as the Padding for the second dimension - $L \in \mathbb{R}$ as the scalar loss from squared loss - $g \in \mathbb{R}^{m \times n}$ as the ground truth image With Squared Loss and the 2D Convolution defined as $$ L(o) = ||(g-o)||_F^2 = \sum_{p=0}^{m-1} \sum_{q=0}^{n-1} (g_{p,q} - o_{p,q})^2 $$ $$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$ We are first going to solve $\frac{\partial L}{\partial k[i,j]}$ for $i \in [0, r-1]$ and $j \in [0, s-1]$ i.e. component by component. Then we are going to show how to directly get the entire gradient $\frac{\partial L}{\partial k}$ by showing the gradient is itself a convolution! As a **bonus**, we are going to show how to optimize the originally previously derived equation $vec(o) = f(k) vec(z) \rightarrow vec(o) = m(z) vec(k)$ where $vec(o) = f(k) vec(z)$ is originally from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) and $f(k) \in \mathbb{R}^{(m+r-1)(n+s-1) \times mn}$ the Toeplitz block matrix of kernel $k$. Note that $vec(o) = f(k) vec(z) = vec(o) = m(z) vec(k)$ using the property that **convolutions are equivalent to polynomial multiplication** & thus **commutative** as shown in [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions). ## Component-wise $\frac{\partial L}{\partial k[i,j]}$ Derivation To start let's find $dL/do$ Per compoent of $o$ i.e. $\frac{\partial L}{\partial o[p,q]}$ we have $$ L(o) = \sum_{p=0}^{m-1} \sum_{q=0}^{n-1} (g_{p,q} - o_{p,q})^2 \rightarrow \frac{\partial L}{\partial o[p,q]} = -2(g[p,q] - o[p,q]) $$ Now let's find $\frac{\partial o[p,q]}{\partial k[i,j]$ We have that $$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$ Thus for a specific $k[i,j]$ we have that $$ do[p,q] = d((z * k)[p,q]) = d(\sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2])$$ Since we only want to find $\frac{\partial o[p,q]}{\partial k[i,j]}$, this means for all $k[a,b]$ where $a \neq i$,$b \neq j$ that $dk[a,b] = 0$. Thus our equation reduces to $$ do[p,q] = d(\sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2]) \rightarrow z[p-i+P_1,q-j+P_2]dk[i,j]$$ giving $$ \frac{\partial o[p,q]}{\partial k[i,j]} = z[p-i+P_1,q-j+P_2] \in \mathbb{R}$$ Invoking the multivariate chain rule we have $$ \frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do[p,q]} \frac{do[p,q]}{dk[i,j]} $$ becoming $$ \frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} -2(g[p,q]-o[p,q]) \times z[p-i+P_1,q-j+P_2] \in \mathbb{R}$$ In numerator form this would be arranged as the matrix gradient $$ \frac{dL}{dk} = \begin{bmatrix} \frac{dL}{dk[0,0]} & \frac{dL}{dk[1,0]} & \cdots & \frac{dL}{dk[r-1,0]}\\ \frac{dL}{dk[0,1]} & \frac{dL}{dk[1,1]} & \cdots & \frac{dL}{dk[r-1,1]}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{dL}{dk[0,s-1]} & \frac{dL}{dk[1,s-1]} & \cdots & \frac{dL}{dk[r-1,s-1]} \end{bmatrix} \in \mathbb{R}^{s\times r} $$ ## $\frac{\partial L}{\partial k}$ as a Convolution Derivation Using the previously stated multivariate chain rule $$ \frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do[p,q]} \frac{do[p,q]}{dk[i,j]} $$ which using the fact that our derivatives are in numerator format becomes $$ \frac{dL}{dk}[j,i] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do}[q,p] \frac{do[p,q]}{dk[i,j]} $$ Note that since we previously derived that $$ \frac{\partial o[p,q]}{\partial k[i,j]} = z[p-i+P_1,q-j+P_2] $$ substitute it into our equation, giving $$ \frac{dL}{dk}[j,i] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do}[q,p] \frac{do[p,q]}{dk[i,j]} \rightarrow \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do}[q,p] z[p-i+P_1,q-j+P_2]$$ Take the transpose of $\frac{dL}{do}[q,p]$ to get $(\frac{dL}{do}[q,p])^T = (\frac{dL}{do})^T[p,q]$ and $(\frac{dL}{dk}[j,i])^T = (\frac{dL}{dk})^T[i,j]$ giving $$ (\frac{dL}{dk})^T[i,j] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} (\frac{dL}{do})^T[p,q] z[p-i+P_1,q-j+P_2]$$ Notice how this looks similar to the form of the **2D Convolution** as we previously defined $$ (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$ Seeing this, we can say that with $(\frac{dL}{do})^T$ as $k$ and $z$ as $z$ that $$ (\frac{dL}{dk})^T[i,j] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} (\frac{dL}{do})^T[p,q] z[p-i+P_1,q-j+P_2] = ((\frac{dL}{do})^T * z)[i,j]$$ Simplifying to $$ (\frac{dL}{dk})^T = (\frac{dL}{do} * z) \rightarrow \frac{dL}{dk} = (\frac{dL}{do} * z)^T$$ In the context of a neural network, our update $\frac{dL}{dk} \in \mathbb{R}^{s \times r}$ is the transpose of the gradient signal from the previous layer $(\frac{dL}{do})^T$ **convolved** with our input image $z$. ## Matrix Calculus Derivation $vec(o) = f(k)vec(z)$ for Full-Convolution (Input & Output Different Sizes) As as refresher from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices), we have the **full convolution** (note previously we were only considering when the *input* $z$ and *output* $o$ had the same size, this is for when there's **at least** one element overlap between kernel $k$ and input $z$). The reason why we break convention here is to not re-derive the Toeplitz block matrix $f(k)$, *however* if we were to, we would simply take the middle, sub-matrix of $mn$ rows to have the output shape of $vec(o) = f(k)vec(z)$ match the input shape of $vec(z) \in \mathbb{R}^{mn}$ - $z \in \mathbb{R}^{m \times n}$ the input image - $vec(z) \in \mathbb{R}^{mn}$ the vectorized input image - $k \in \mathbb{R}^{r \times s}$ the kernel - $vec(k) \in \mathbb{R}^{rs}$ is the vectorized kernel - $o \in \mathbb{R}^{(m+r-1)\times (n+s-1)}$ the output image - $vec(o) \in \mathbb{R}^{(m+r-1)(n+s-1)}$ the vectorized output image - $f(k) \in \mathbb{R}^{(m+r-1)(n+s-1) \times mn}$ the Toeplitz block matrix of kernel $k$ - $L \in \mathbb{R}$ as the scalar loss from squared loss - $g \in \mathbb{R}^{(m+r-1)\times (n+s-1)}$ as the ground truth image To start we have $$ L = ||g - o||_F^2 = vec(g-o)^T vec(g-o)$$ $$ dL = -d(vec(g-o)^T) vec(g-o) - vec(g-o)^T d(vec(g-o)) $$ $$ dL = -dvec(o)^T vec(g-o) - vec(g-o)^T d(vec(o)) $$ Note that $dvec(o)^T vec(g-o) \in \mathbb{R}$ thus $(dvec(o)^T vec(g-o))^T = dvec(o)^T vec(g-o)$ $$ dL = - vec(g-o)^T d(vec(o)) - vec(g-o)^T d(vec(o)) = -2 vec(g-o)^T d(vec(o))$$ Now we are going to use the *commutative property* of polynomial multiplication, since recall from [1D Convolutions](https://www.noteblogdoku.com/blog/1d-convolutions) convolutions *are* polynomial multiplication, hence this means $$ o[p,q] = (k * z)[p,q] = (z * k)[p,q] $$ Thus from our derivation in [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) we would have that $$vec(o) = f(k)vec(z) = f(z) vec(k) $$ where now we taking the Toeplitz Block matrix of the image $z$ and not the kernel $k$. Great now let's find $d(vec(o))$ $$ d(vec(o)) = d(f(z)vec(k)) = d(f(z)) vec(k) + f(z) d(vec(k))$$ Note that since $z$ is the input the differential of z $dz = 0$. Since $f(z)$ is a linear operator on $z$ we have that $d(f(z)) = 0$ too. Thus $$ d(vec(o)) = f(z) d(vec(k))$$ Knowing that for a generic function $g(x)$ that $$dg(x) = \frac{\partial g}{\partial x} dx \quad \langle \frac{\partial g}{\partial x}, dx \rangle = dg $$ We get that $$ \frac{\partial vec(o)}{\partial vec(k)} = f(z) \in \mathbb{R}^{(m+r-1)(n+s-1) \times (rs)}$$ for the **full convolution** where the kernel and image have **at least** one cell overlap and the size of output $o$ not equal to input $z$. To finish up $$ dL = -2 vec(g-o)^T d(vec(o))$$ $$ d(vec(o)) = f(z) d(vec(k))$$ Substituting & using chain rule we get $$ dL = -2 vec(g-o)^T f(z) d(vec(k))$$ now using that for a generic function $g(x)$ that $$dg(x) = \frac{\partial g}{\partial x} dx \quad \langle \frac{\partial g}{\partial x}, dx \rangle = dg $$ we get that $$ \frac{\partial L}{\partial vec(k)} = -2 vec(g-o)^T f(z) \in \mathbb{R}^{1\times (rs)}$$ To do a shape check note $$ vec(g-o)^T \in \mathbb{R}^{1\times (m+r-1)(n+s-1)} \quad f(z) \in \mathbb{R}^{(m+r-1)(n+s-1) \times (rs)}$$ resulting in $$ \frac{\partial L}{\partial vec(k)} = -2 vec(g-o)^T f(z) \in \mathbb{R}^{1 \times (rs)}$$ which we would then reshape $$\frac{\partial L}{\partial vec(k)} \in \mathbb{R}^{1 \times (rs)} \rightarrow \frac{\partial L}{\partial k} \in \mathbb{R}^{s \times r}$$ using the *numerator* convention. Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!