Optimizing 2D Convolutions Implementation

See the above video showing the results of the code implementation of the math below!
There are three 2D Kernels of size 3x3 that is being trained to denoise the input image to match the ground truth image as closely as possible, with ReLU non-linearities inbetween the layers. 
As we can see, the convolutional network gets better over time to deal with the salt and pepper noise.
Code implementation included in link at top of blog post.
Code written to follow from this blog post, not to have optimal runtime.

## Recap from [Optimizing 2D Convolutions](https://www.noteblogdoku.com/blog/optimizing-2d-convolutions)                                                                                                                
                                                                                                                                                                                         
  Given an input image $z \in \mathbb{R}^{m \times n}$, a kernel $k \in \mathbb{R}^{r \times s}$, and a same-padded output $o \in \mathbb{R}^{m \times n}$, the squared-error loss       
  against a clean target $g$ is:

$$L = |g - o|_F^2$$

The derivative of the loss with respect to each output pixel:

$$\frac{\partial L}{\partial o[p,q]} = -2(g[p,q] - o[p,q])$$

The output's dependence on each kernel weight follows directly from the convolution definition:

$$\frac{\partial o[p,q]}{\partial k[i,j]} = z[p - i + P_1,\ q - j + P_2]$$

Applying the multivariate chain rule, the gradient for each kernel element sums over all output positions:

$$\frac{dL}{dk[i,j]} = \sum_{p=0}^{m-1}\sum_{q=0}^{n-1} \frac{\partial L}{\partial o[p,q]} \cdot z[p - i + P_1,\ q - j + P_2]$$

The key result from that post is that this entire gradient matrix is itself a convolution:

$$\frac{dL}{dk} = \left(\frac{dL}{do}^T * z\right)^T$$

The upstream gradient convolved with the input gives the kernel gradient directly. The backward pass and forward pass are **both 2D Convolutions**.

**Note** that for this implementation we are actually using **cross correlation** not **convolution**, this means that instead of flipping the kernel along both of its axes we do not and simply have for each relative position in the kernel it multiplied by its relative position in the input image. Since we are learning these kernel weights, it effectively does not matter since you can just *learn* the flip. In short, convolution is correlation with a flipped kernel.

---
  ## im2col: Unrolling Input Patches into Columns

Implementing the formula above with nested Python loops over $(p, q, i, j)$ is correct but slow. The im2col trick reformulates it as a single matrix multiply.

For every output position $(p, q)$, we extract the kernel sized patch from the padded input and lay it flat as a row (column vector in math, but row in Python). For a $32 \times 32$ output and a $3 \times 3$ kernel, this produces a cols matrix of shape $(1024, 9)$. The forward pass becomes:

```python
  cols, out_shape = self._im2col(z_input, kernel.shape, padding)
  kernel_flat = kernel.reshape(-1)      # (9,)
  out_flat = cols @ kernel_flat         # (1024,)
  out = out_flat.reshape(out_shape)     # (32, 32)
```

Observe what the kernel gradient formula becomes in this representation. The upstream $\frac{dL}{do}$ flattened to shape $(1024,)$ gives:

$$\frac{dL}{dk_\text{flat}} = \texttt{cols}^T \cdot \left(\frac{dL}{do}\right)_\text{flat}$$

This is the convolution from the recap above rewritten as a matrix vector product. In code:

```python
  upstream_flat = upstream.reshape(-1)
  dK_flat = cols.T @ upstream_flat      # (9,)
  self.dLdK = dK_flat.reshape(k_h, k_w) # (3, 3)
```

The cols matrix is **computed once** in the forward pass and cached. The backward pass reuses it with a NumPy call instead of a slow Python loop. The im2col operation reformulates the convolution as a *matrix multiply*, which NumPy can run using optimized linear algebra libraries (BLAS).

For the gradient with respect to the input $z$ needed to propagate error back to earlier layers we use the full convolution with the **flipped kernel**, then crop the padding. Note that since we are doing **cross correlation** in the forward pass and not **convolution**, we need to flip the kernel in the gradient update since in our derived equations for convolution the kernel is already inherently flipped, so for the proper update rule to apply we need to flip the kernel in the back propagation.

```python
  k_flipped = np.flip(self.kernel)
  full_cols, full_shape = self._im2col(upstream, k_flipped.shape, (k_h - 1, k_w - 1))
  full = (full_cols @ k_flipped.reshape(-1)).reshape(full_shape)
  dz = full[pad_h:H_p - pad_h, pad_w:W_p - pad_w]  # crop back to input size
```
  ---
  ## Results

Training on 1,000 synthetic $32 \times 32$ grayscale images (sinusoidal backgrounds with random blobs) at 10% salt-and-pepper noise, over 500 epochs with learning rate $10^{-3}$, the three-layer network learns to suppress noise. As visible in the video above, the denoised output progressively recovers the structure of the clean image as the kernels evolve from their random Xavier initialization into spatially structured smoothing filters.

Optimizing 2D Convolutions Implementation

🎥 Video

Comments (0)

✉️ Join Mailing List