Optimizing 2D Convolutions

by dokuDoku Machine Learning Foundations Convolutions

🔍

## Recap from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices)

To give a short recap from the previous blog post introducing 2D Convolutions, [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices)

Given that we have the **full** convolution, where only **one element** of overlap is required we have
- $z \in \mathbb{R}^{m \times n}$ the input image
  - $vec(z) \in \mathbb{R}^{mn}$ the vectorized input image
- $k \in \mathbb{R}^{r \times s}$ the kernel 
- $o \in \mathbb{R}^{(m+r-1)\times (n+s-1)}$ the output image 
  - $vec(o) \in \mathbb{R}^{(m+r-1)(n+s-1)}$ the vectorized output image
- $f(k) \in \mathbb{R}^{(m+r-1)(n+s-1) \times mn}$ the Toeplitz block matrix of kernel $k$
**NOTE** All equations are 0 indexed and all derivatives use the **numerator** convention!

This is the definition of the 2-Dimensional Convolution of $z$ and $k$
$$ o[p,q] = (z * k)[p,q] = \sum_{i=-\infty}^{\infty} \sum_{j=-\infty}^{\infty} z[i,j] \, k[p-i, q-j] $$

Additionally it was shown that when $o$ is vectorized, this 2D Convolution can be represented as the multiplication between the Toplitz Block Matrix of Kernel $k$, $f(k)$ and imput image $z$ vectorized, $vec(z)$
$$ vec(o) = f(k) vec(z)$$

## Force output $o$ to have same size as input $z$

To make it easier to optimize our kernel $k$ with respect to a scalar loss function, let's force the output $o$ of the convolution to match the input size of $z$ i.e. $o,z \in \mathbb{R}^{m \times n}$. Given our result doing this for 1D Convolutions in [Optimizing 1D Convolutions Implementation](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions-implementation), we have that assuming the the kernel must **completely** overlap with the input (plus padding $P$) that for 
- $y \in \mathbb{R}^{n}$ as the output 
- $z \in \mathbb{R}^n$ as the input
- $k \in \mathbb{R}^m$ as the kernel 
- $P$ is the padding equal to $\frac{m-1}{2}$ using *Padding Equation* for only *odd* sized kernels
  - Padding Equation $N_{out} = N_{in} + 2P - (K-1)$, $N_{out}$ output size, $N_{in}$ input size, $P$ padding added to each end, $K$ kernel size

$$ y(x) = (z * k)(x) = \sum_{b=1}^{m} k(b)z(x-b+P) $$
*Note* in this blog post $y(x)$ is **0 indexed**, in [Optimizing 1D Convolutions Implementation](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions-implementation), the derivation was 1 indexed, and kernel was $w$ *not* $k$.

Extending this to our problem setup in 2D it is trivial to see that we will need two paddings, one for each dimension designated $P_1$ and $P_2$, and it should be trivial to calculate them
Given 
$$ N_{out} = N_{in} + 2P - (K-1)$$
- $N_{out}$ output size
- $N_{in}$ input size
- $P$ padding added to each end
- $K$ kernel size

For $P_1$ with $N_{out1} = N_{in1} = m$, $K_1 = r$
$$ N_{out} = N_{in} + 2P - (K-1) \rightarrow m = m + 2P_1 - (r-1) \rightarrow P_1 = \frac{r-1}{2}$$
Similarly for $P_2$ with $N_{out2} = N_{in2} = n$, $K_2 = s$
$$ N_{out} = N_{in} + 2P - (K-1) \rightarrow n = n + 2P_2 - (s-1) \rightarrow P_2 = \frac{s-1}{2}$$

Giving us the *new* 2D Convolution formula with 
- $o,z \in \mathbb{R}^{m \times n}$
- $k \in \mathbb{R}^{r \times s}$
- $P_1 = \frac{r-1}{2}$
- $P_2 = \frac{s-1}{2}$

$$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$
with for arbitrary $u,v$ $z[u,v] = 0$ when $u \notin [0, m-1]$ or $v \notin [0,n-1]$

This is fantastic, we now have a convolution equation that we can optimize such that the input and output shapes are the *same*. Additionally, since we are directly summing over the indices of the kernel $k$ this makes it easy to optimize the indices of the kernel.

## Optimization Statement Setup
Given 
- $z \in \mathbb{R}^{m \times n}$ as the input image
- $o \in \mathbb{R}^{m \times n}$ as the predicted output image 
- $k \in \mathbb{R}^{r \times s}$ as the kernel 
- $P_1 = \frac{r-1}{2}$ as the Padding for the first dimension
- $P_2 = \frac{s-1}{2}$ as the Padding for the second dimension
- $L \in \mathbb{R}$ as the scalar loss from squared loss
- $g \in \mathbb{R}^{m \times n}$ as the ground truth image

With Squared Loss and the 2D Convolution defined as 
$$ L(o) = ||(g-o)||_F^2 = \sum_{p=0}^{m-1} \sum_{q=0}^{n-1} (g_{p,q} - o_{p,q})^2 $$
$$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$

We are first going to solve $\frac{\partial L}{\partial k[i,j]}$ for $i \in [0, r-1]$ and $j \in [0, s-1]$ i.e. component by component.

Then we are going to show how to directly get the entire gradient $\frac{\partial L}{\partial k}$ by showing the gradient is itself a convolution!

As a **bonus**, we are going to show how to optimize the originally previously derived equation $vec(o) = f(k) vec(z) \rightarrow vec(o) = m(z) vec(k)$ where $vec(o) = f(k) vec(z)$ is originally from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) and $f(k) \in \mathbb{R}^{(m+r-1)(n+s-1) \times mn}$ the Toeplitz block matrix of kernel $k$. Note that $vec(o) = f(k) vec(z) = vec(o) = m(z) vec(k)$ using the property that **convolutions are equivalent to polynomial multiplication** & thus **commutative** as shown in [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions).

## Component-wise $\frac{\partial L}{\partial k[i,j]}$ Derivation

To start let's find $dL/do$
Per compoent of $o$ i.e. $\frac{\partial L}{\partial o[p,q]}$ we have 
$$ L(o) = \sum_{p=0}^{m-1} \sum_{q=0}^{n-1} (g_{p,q} - o_{p,q})^2 \rightarrow \frac{\partial L}{\partial o[p,q]} = -2(g[p,q] - o[p,q]) $$

Now let's find $\frac{\partial o[p,q]}{\partial k[i,j]$
We have that 
$$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$
Thus for a specific $k[i,j]$ we have that 
$$ do[p,q] = d((z * k)[p,q]) = d(\sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2])$$
Since we only want to find $\frac{\partial o[p,q]}{\partial k[i,j]}$, this means for all $k[a,b]$ where $a \neq i$,$b \neq j$ that $dk[a,b] = 0$. Thus our equation reduces to
$$ do[p,q] = d(\sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2]) \rightarrow z[p-i+P_1,q-j+P_2]dk[i,j]$$
giving 
$$ \frac{\partial o[p,q]}{\partial k[i,j]} = z[p-i+P_1,q-j+P_2] \in \mathbb{R}$$

Invoking the multivariate chain rule we have 
$$ \frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do[p,q]} \frac{do[p,q]}{dk[i,j]} $$
becoming 
$$ \frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} -2(g[p,q]-o[p,q]) \times z[p-i+P_1,q-j+P_2] \in \mathbb{R}$$
In numerator form this would be arranged as the matrix gradient 
$$
\frac{dL}{dk}
=
\begin{bmatrix}
\frac{dL}{dk[0,0]} & \frac{dL}{dk[1,0]} & \cdots & \frac{dL}{dk[r-1,0]}\\
\frac{dL}{dk[0,1]} & \frac{dL}{dk[1,1]} & \cdots & \frac{dL}{dk[r-1,1]}\\
\vdots & \vdots & \ddots & \vdots\\
\frac{dL}{dk[0,s-1]} & \frac{dL}{dk[1,s-1]} & \cdots & \frac{dL}{dk[r-1,s-1]}
\end{bmatrix}
\in \mathbb{R}^{s\times r}
$$

## $\frac{\partial L}{\partial k}$ as a Convolution Derivation

Using the previously stated multivariate chain rule
$$ \frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do[p,q]} \frac{do[p,q]}{dk[i,j]} $$
which using the fact that our derivatives are in numerator format becomes 
$$ \frac{dL}{dk}[j,i] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do}[q,p] \frac{do[p,q]}{dk[i,j]} $$
Note that since we previously derived that 
$$ \frac{\partial o[p,q]}{\partial k[i,j]} = z[p-i+P_1,q-j+P_2] $$
substitute it into our equation, giving
$$ \frac{dL}{dk}[j,i] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do}[q,p] \frac{do[p,q]}{dk[i,j]} \rightarrow 
\sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{dL}{do}[q,p] z[p-i+P_1,q-j+P_2]$$
Take the transpose of $\frac{dL}{do}[q,p]$ to get $(\frac{dL}{do}[q,p])^T = (\frac{dL}{do})^T[p,q]$ and $(\frac{dL}{dk}[j,i])^T = (\frac{dL}{dk})^T[i,j]$ giving 
$$ (\frac{dL}{dk})^T[i,j] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} (\frac{dL}{do})^T[p,q] z[p-i+P_1,q-j+P_2]$$
Notice how this looks similar to the form of the **2D Convolution** as we previously defined
$$ (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$
Seeing this, we can say that with $(\frac{dL}{do})^T$ as $k$ and $z$ as $z$ that 
$$ (\frac{dL}{dk})^T[i,j] = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} (\frac{dL}{do})^T[p,q] z[p-i+P_1,q-j+P_2] = ((\frac{dL}{do})^T * z)[i,j]$$
Simplifying to 
$$ (\frac{dL}{dk})^T = (\frac{dL}{do} * z) \rightarrow \frac{dL}{dk} = (\frac{dL}{do} * z)^T$$

In the context of a neural network, our update $\frac{dL}{dk} \in \mathbb{R}^{s \times r}$ is the transpose of the gradient signal from the previous layer $(\frac{dL}{do})^T$ **convolved** with our input image $z$.

## Matrix Calculus Derivation $vec(o) = f(k)vec(z)$ for Full-Convolution (Input & Output Different Sizes)
As as refresher from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices), we have the **full convolution** (note previously we were only considering when the *input* $z$ and *output* $o$ had the same size, this is for when there's **at least** one element overlap between kernel $k$ and input $z$). The reason why we break convention here is to not re-derive the Toeplitz block matrix $f(k)$, *however* if we were to, we would simply take the middle, sub-matrix of $mn$ rows to have the output shape of $vec(o) = f(k)vec(z)$ match the input shape of $vec(z) \in \mathbb{R}^{mn}$

- $z \in \mathbb{R}^{m \times n}$ the input image
  - $vec(z) \in \mathbb{R}^{mn}$ the vectorized input image
- $k \in \mathbb{R}^{r \times s}$ the kernel 
  - $vec(k) \in \mathbb{R}^{rs}$ is the vectorized kernel
- $o \in \mathbb{R}^{(m+r-1)\times (n+s-1)}$ the output image 
  - $vec(o) \in \mathbb{R}^{(m+r-1)(n+s-1)}$ the vectorized output image
- $f(k) \in \mathbb{R}^{(m+r-1)(n+s-1) \times mn}$ the Toeplitz block matrix of kernel $k$
- $L \in \mathbb{R}$ as the scalar loss from squared loss
- $g \in \mathbb{R}^{(m+r-1)\times (n+s-1)}$ as the ground truth image

To start we have 
$$ L = ||g - o||_F^2 = vec(g-o)^T vec(g-o)$$
$$ dL = -d(vec(g-o)^T) vec(g-o) - vec(g-o)^T d(vec(g-o)) $$
$$ dL = -dvec(o)^T vec(g-o) - vec(g-o)^T d(vec(o)) $$
Note that $dvec(o)^T vec(g-o) \in \mathbb{R}$ thus $(dvec(o)^T vec(g-o))^T = dvec(o)^T vec(g-o)$
$$ dL = - vec(g-o)^T d(vec(o)) - vec(g-o)^T d(vec(o)) = -2 vec(g-o)^T d(vec(o))$$

Now we are going to use the *commutative property* of polynomial multiplication, since recall from [1D Convolutions](https://www.noteblogdoku.com/blog/1d-convolutions) convolutions *are* polynomial multiplication, hence this means
$$ o[p,q] = (k * z)[p,q] = (z * k)[p,q] $$
Thus from our derivation in [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices) we would have that 
$$vec(o) = f(k)vec(z) = f(z) vec(k) $$
where now we taking the Toeplitz Block matrix of the image $z$ and not the kernel $k$. 
Great now let's find $d(vec(o))$
$$ d(vec(o)) = d(f(z)vec(k)) = d(f(z)) vec(k) + f(z) d(vec(k))$$
Note that since $z$ is the input the differential of z $dz = 0$. Since $f(z)$ is a linear operator on $z$ we have that $d(f(z)) = 0$ too. Thus
$$ d(vec(o)) = f(z) d(vec(k))$$
Knowing that for a generic function $g(x)$ that 
$$dg(x) = \frac{\partial g}{\partial x} dx \quad \langle \frac{\partial g}{\partial x}, dx \rangle = dg $$
We get that 
$$ \frac{\partial vec(o)}{\partial vec(k)} = f(z) \in \mathbb{R}^{(m+r-1)(n+s-1) \times (rs)}$$
for the **full convolution** where the kernel and image have **at least** one cell overlap and the size of output $o$ not equal to input $z$.

To finish up 
$$ dL = -2 vec(g-o)^T d(vec(o))$$
$$ d(vec(o)) = f(z) d(vec(k))$$
Substituting & using chain rule we get 
$$ dL = -2 vec(g-o)^T f(z) d(vec(k))$$ 
now using that for a generic function $g(x)$ that 
$$dg(x) = \frac{\partial g}{\partial x} dx \quad \langle \frac{\partial g}{\partial x}, dx \rangle = dg $$
we get that 
$$ \frac{\partial L}{\partial vec(k)} = -2 vec(g-o)^T f(z) \in \mathbb{R}^{1\times (rs)}$$
To do a shape check note 
$$ vec(g-o)^T \in \mathbb{R}^{1\times (m+r-1)(n+s-1)} \quad f(z) \in \mathbb{R}^{(m+r-1)(n+s-1) \times (rs)}$$
resulting in 
$$ \frac{\partial L}{\partial vec(k)} = -2 vec(g-o)^T f(z) \in \mathbb{R}^{1 \times (rs)}$$
which we would then reshape 
$$\frac{\partial L}{\partial vec(k)} \in \mathbb{R}^{1 \times (rs)} \rightarrow \frac{\partial L}{\partial k} \in \mathbb{R}^{s \times r}$$ 
using the *numerator* convention.

## Recap from [2D Convolutions Framed as Toeplitz Matrices](https://www.noteblogdoku.com/blog/2d-convolutions-framed-as-toeplitz-matrices)

This is the definition of the 2-Dimensional Convolution of $z$ and $k$
$$ o[p,q] = (z * k)[p,q] = \sum_{i=-\infty}^{\infty} \sum_{j=-\infty}^{\infty} z[i,j] \, k[p-i, q-j] $$

## Force output $o$ to have same size as input $z$

Giving us the *new* 2D Convolution formula with 
- $o,z \in \mathbb{R}^{m \times n}$
- $k \in \mathbb{R}^{r \times s}$
- $P_1 = \frac{r-1}{2}$
- $P_2 = \frac{s-1}{2}$

$$ o[p,q] = (z * k)[p,q] = \sum_{a=0}^{r-1}\sum_{b=0}^{s-1} k[a,b]z[p-a+P_1,q-b+P_2] $$
with for arbitrary $u,v$ $z[u,v] = 0$ when $u \notin [0, m-1]$ or $v \notin [0,n-1]$

We are first going to solve $\frac{\partial L}{\partial k[i,j]}$ for $i \in [0, r-1]$ and $j \in [0, s-1]$ i.e. component by component.

Then we are going to show how to directly get the entire gradient $\frac{\partial L}{\partial k}$ by showing the gradient is itself a convolution!

## Component-wise $\frac{\partial L}{\partial k[i,j]}$ Derivation

## $\frac{\partial L}{\partial k}$ as a Convolution Derivation

CREATED: March 10, 2026 at 03:22 AM

LAST UPDATED: March 26, 2026 at 11:30 PM

Comments (0)

No comments yet. Be the first to comment!

← Back to Blog

Comments (0)

✉️ Join Mailing List