Optimizing 1D Convolutions Implementation by dokuDoku Machine Learning Foundations Convolutions DoKu88/simpleNeuralNet/blob/main/1D_convolution.py 🎥 Video Your browser does not support the video tag. See the above video showing the results of the [code implementation](https://github.com/DoKu88/simpleNeuralNet/blob/main/1D_convolution.py) of the math below! There is a 1D Kernel of length 11 that is being trained to denoise the input signal to match the ground truth signal as closely as possible. As we can see, the convolution gets better over time to deal with the salt and pepper noise and converges towards a Gaussian like shaped kernel. [Code implementation](https://github.com/DoKu88/simpleNeuralNet/blob/main/1D_convolution.py) included in link at top of blog post. Code written to follow from this blog post, not to have optimal runtime. ## Recap from [Optimizing 1D Convolutions](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions) In the previous post [Optimizing 1D Convolutions](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions) it was shown how to optimize a 1D Convolution (where the input and output vector sizes are **the same**) such that $$ L(y) = (g-y)^T(g-y) $$ $$ y = (z*w) \rightarrow y = T'(w)z $$ $$ vec(T'(w)) = BSw $$ With variables - $z \in \mathbb{R}^n$ as the input - $w \in \mathbb{R}^m$ as the kernel - $y \in \mathbb{R}^{m}$ as the output - $g \in \mathbb{R}^m$ as the ground truth - $T'(w) \in \mathbb{R}^{n \times n}$ as the Toeplitz matrix of kernel $w$ - $B \in \mathbb{R}^{n^2 \times (2n-1)}$ as Vectorized Basis of Square Toeplitz Matrices - $S \in \mathbb{R}^{(2n-1) \times m}$ as the Transformation Matrix from the vector space of $w$ to $k$ s.t. $T'(k) = Bk$ - $L \in \mathbb{R}$ as the scalar loss **NOTE** all vectors are **1 Indexed** Giving the final result: $$\frac{\partial L}{\partial w} = -2(g-y)^T (z^T \otimes I_n)BS \in \mathbb{R}^{1 \times m}$$ It was a fantastic derivation where we can optimize kernel $w$ using the Toeplitz matrix interpretation of 1D Convolutions $T'(w)z = y$. However, this does not appear to be fun to code into Python to ultimately denoise our input signal $z$ with kernel $w$ producing output $y$ and comparing to ground truth $g$. ## A Simpler Derivation of $\frac{\partial L}{\partial w}$ Let's do a simpler derivation of this! Recall from [Optimizing 1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions) we found that Padding $P$ was equal to $\frac{m-1}{2} = P$ such that we can have the output $y \in \mathbb{R}^m$ to match the size of our input $z \in \mathbb{R}^m$. ### Re-indexing the Convolution Summation Formally, the convolution is $$ y(x) = (z * w)(x) = \sum_{a=-P}^{P} w(a+P+1)z(x-a) $$ Let's make this easier to implement and reason about by making the indexes of the sum from $[1,m]$. Note that $w$ in this summation ranges from $[1,m]$ since when $a = -P$ the index of $w$ is $a+P+1 \rightarrow -P + P + 1 = 1$ $a = P$ the index of $w$ is $a+P+1 \rightarrow P + P + 1 = \frac{m-1}{2} + \frac{m-1}{2} + 1 = m$ Great, let's define $b$ to replace our bounds on a such that $$ b = a + P + 1 = a + 1 + \frac{m-1}{2}$$ where $b \in [1,m]$ (so it's easy to index on our kernel). Note that since $z$ has index $x-a$ a simple substitution gives: $$x -a = x - (b-P-1) = x - b + P +1$$ Thus making the convolution summation $$ y(x) = (z * w)(x) = \sum_{b=1}^{m} w(b)z(x-b+P+1) $$ ## Componentwise Derivatives For this, we are going to simply use this formulation of the chain rule. For some arbitrary functions $$f(x):\mathbb{R}^n \rightarrow \mathbb{R}$$ $$g(t):\mathbb{R}^m \rightarrow \mathbb{R}^n$$ such that $$ h(t) = f(g(t)) \rightarrow \frac{\partial h}{\partial t_j} = \sum_i \frac{\partial f}{\partial g_i}\frac{\partial g_i}{\partial t_j}$$ For us we have $$ L(y) = \sum_i^m (g_i - y_i)^2$$ $$ y(x) = (z * w)(x) = \sum_{b=1}^{m} w(b)z(x-b+P+1)$$ with - $L(y) \in \mathbb{R}$ the scalar loss - $g \in \mathbb{R}^m$ the ground truth signal - $y \in \mathbb{R}^m$ the output of the convolution, our processed signal - $z \in \mathbb{R}^m$ the input signal with noise - $w \in \mathbb{R}^n$ the kernel we are optimizing Taking component wise derivatives (where $i$,$j$ arbitrary components) we get $$ dL(y_i) = -2(g_i - y_i) dy_i \in \mathbb{R} $$ $$ dy_i(w_j) = z(i-j+P+1)dw_j \in \mathbb{R} \quad P = \frac{m-1}{2}$$ Thus we can say that using chain rule $$ \frac{\partial L}{\partial w_j} = \sum_i^m \frac{\partial L}{\partial y_i}\frac{\partial y_i}{\partial w_j}$$ $$ \frac{\partial L}{\partial w_j} = \sum_i^m -2(g_i - y_i) \times z(i-j+P+1)$$ Note that although this is in *numerator form* since $\frac{\partial L}{\partial w_j} \in \mathbb{R}\ \forall j \in [1,n]$ we have that $\frac{\partial L}{\partial w_j} = (\frac{\partial L}{\partial w_j})^T$ so *numerator form* is *denominator form*. Our gradient vector in numerator form is $$ \frac{\partial L}{\partial w} = [\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, ... \frac{\partial L}{\partial w_n}] \in \mathbb{R}^{1 \times n}$$ Thus our gradient descent update rule will use the *denominator form* $(\frac{\partial L}{\partial w})^T \in \mathbb{R}^m$ such that with learning rate $\alpha$ $$ w_{new} = w_{old} - \alpha (\frac{\partial L}{\partial w})^T $$ See the above video showing the results of the [code implementation](https://github.com/DoKu88/simpleNeuralNet/blob/main/1D_convolution.py) of the math below! There is a 1D Kernel of length 11 that is being trained to denoise the input signal to match the ground truth signal as closely as possible. As we can see, the convolution gets better over time to deal with the salt and pepper noise and converges towards a Gaussian like shaped kernel. [Code implementation](https://github.com/DoKu88/simpleNeuralNet/blob/main/1D_convolution.py) included in link at top of blog post. Code written to follow from this blog post, not to have optimal runtime. ## Recap from [Optimizing 1D Convolutions](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions) In the previous post [Optimizing 1D Convolutions](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions) it was shown how to optimize a 1D Convolution (where the input and output vector sizes are **the same**) such that $$ L(y) = (g-y)^T(g-y) $$ $$ y = (z*w) \rightarrow y = T'(w)z $$ $$ vec(T'(w)) = BSw $$ With variables - $z \in \mathbb{R}^n$ as the input - $w \in \mathbb{R}^m$ as the kernel - $y \in \mathbb{R}^{m}$ as the output - $g \in \mathbb{R}^m$ as the ground truth - $T'(w) \in \mathbb{R}^{n \times n}$ as the Toeplitz matrix of kernel $w$ - $B \in \mathbb{R}^{n^2 \times (2n-1)}$ as Vectorized Basis of Square Toeplitz Matrices - $S \in \mathbb{R}^{(2n-1) \times m}$ as the Transformation Matrix from the vector space of $w$ to $k$ s.t. $T'(k) = Bk$ - $L \in \mathbb{R}$ as the scalar loss **NOTE** all vectors are **1 Indexed** Giving the final result: $$\frac{\partial L}{\partial w} = -2(g-y)^T (z^T \otimes I_n)BS \in \mathbb{R}^{1 \times m}$$ It was a fantastic derivation where we can optimize kernel $w$ using the Toeplitz matrix interpretation of 1D Convolutions $T'(w)z = y$. However, this does not appear to be fun to code into Python to ultimately denoise our input signal $z$ with kernel $w$ producing output $y$ and comparing to ground truth $g$. ## A Simpler Derivation of $\frac{\partial L}{\partial w}$ Let's do a simpler derivation of this! Recall from [Optimizing 1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/optimizing-1d-convolutions) we found that Padding $P$ was equal to $\frac{m-1}{2} = P$ such that we can have the output $y \in \mathbb{R}^m$ to match the size of our input $z \in \mathbb{R}^m$. ### Re-indexing the Convolution Summation Formally, the convolution is $$ y(x) = (z * w)(x) = \sum_{a=-P}^{P} w(a+P+1)z(x-a) $$ Let's make this easier to implement and reason about by making the indexes of the sum from $[1,m]$. Note that $w$ in this summation ranges from $[1,m]$ since when $a = -P$ the index of $w$ is $a+P+1 \rightarrow -P + P + 1 = 1$ $a = P$ the index of $w$ is $a+P+1 \rightarrow P + P + 1 = \frac{m-1}{2} + \frac{m-1}{2} + 1 = m$ Great, let's define $b$ to replace our bounds on a such that $$ b = a + P + 1 = a + 1 + \frac{m-1}{2}$$ where $b \in [1,m]$ (so it's easy to index on our kernel). Note that since $z$ has index $x-a$ a simple substitution gives: $$x -a = x - (b-P-1) = x - b + P +1$$ Thus making the convolution summation $$ y(x) = (z * w)(x) = \sum_{b=1}^{m} w(b)z(x-b+P+1) $$ ## Componentwise Derivatives For this, we are going to simply use this formulation of the chain rule. For some arbitrary functions $$f(x):\mathbb{R}^n \rightarrow \mathbb{R}$$ $$g(t):\mathbb{R}^m \rightarrow \mathbb{R}^n$$ such that $$ h(t) = f(g(t)) \rightarrow \frac{\partial h}{\partial t_j} = \sum_i \frac{\partial f}{\partial g_i}\frac{\partial g_i}{\partial t_j}$$ For us we have $$ L(y) = \sum_i^m (g_i - y_i)^2$$ $$ y(x) = (z * w)(x) = \sum_{b=1}^{m} w(b)z(x-b+P+1)$$ with - $L(y) \in \mathbb{R}$ the scalar loss - $g \in \mathbb{R}^m$ the ground truth signal - $y \in \mathbb{R}^m$ the output of the convolution, our processed signal - $z \in \mathbb{R}^m$ the input signal with noise - $w \in \mathbb{R}^n$ the kernel we are optimizing Taking component wise derivatives (where $i$,$j$ arbitrary components) we get $$ dL(y_i) = -2(g_i - y_i) dy_i \in \mathbb{R} $$ $$ dy_i(w_j) = z(i-j+P+1)dw_j \in \mathbb{R} \quad P = \frac{m-1}{2}$$ Thus we can say that using chain rule $$ \frac{\partial L}{\partial w_j} = \sum_i^m \frac{\partial L}{\partial y_i}\frac{\partial y_i}{\partial w_j}$$ $$ \frac{\partial L}{\partial w_j} = \sum_i^m -2(g_i - y_i) \times z(i-j+P+1)$$ Note that although this is in *numerator form* since $\frac{\partial L}{\partial w_j} \in \mathbb{R}\ \forall j \in [1,n]$ we have that $\frac{\partial L}{\partial w_j} = (\frac{\partial L}{\partial w_j})^T$ so *numerator form* is *denominator form*. Our gradient vector in numerator form is $$ \frac{\partial L}{\partial w} = [\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, ... \frac{\partial L}{\partial w_n}] \in \mathbb{R}^{1 \times n}$$ Thus our gradient descent update rule will use the *denominator form* $(\frac{\partial L}{\partial w})^T \in \mathbb{R}^m$ such that with learning rate $\alpha$ $$ w_{new} = w_{old} - \alpha (\frac{\partial L}{\partial w})^T $$ Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!