Optimizing 1D Convolutions by dokuDoku Machine Learning Foundations Convolutions 🔍 To see the definition of a 1D Convolution and how a 1D Convolution can be represented as a Toeplitz Matrix multiplication with an input vector, please read [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions). ### Goal Our goal today is to derive a way to train our kernel using gradient descent to filter noise from an input signal. More specifically, given an input signal $z \in \mathbb{R}^n$ with noise, train a kernel $w \in \mathbb{R}^m$ (with $m < n$) that convolves with the input $z$ to produce output signal $y \in \mathbb{R}^n$ to match ground truth signal $g \in \mathbb{R}^n$ with squared loss such that: $$ L(y) = (g-y)^T(g-y) $$ $$ y = (z*w) $$ To do this we will need to restrict our convolution $(z*w)$ to only output vectors $y$ of the same size as input $z$. Ultimately what we strive to derive is the gradient to update kernel $w$ $$\frac{\partial L}{\partial w} \in \mathbb{R}^{1 \times m}$$ in the *numerator* convention, meaning to update $w$ in our gradient descent update the transpose of $\frac{\partial L}{\partial w}$ will have to be taken with learning rate $\alpha$ such that $$ w_{new} = w_{old} - \alpha (\frac{\partial L}{\partial w})^T$$ ## What is a 1-Dimensional Discrete Convolution with Equal Input & Output Sizes A **full** 1D Discrete Convolution is defined as a 1D convolution where *only one* element from the input and kernel vectors have to overlap, giving an output size that is *larger* than the input. The definition of a *full* 1D Convolution with: - $z \in \mathbb{R}^n$ as the input - $w \in \mathbb{R}^m$ as the kernel - $y \in \mathbb{R}^{m+n-1}$ as the output - $T(w) \in \mathbb{R}^{(m+n-1) \times n}$ as the Toeplitz matrix of kernel $w$ $$ y(x) = (z * w)(x) = \sum_{a=-\infty}^{\infty} z(a)w(x-a) $$ Equivalently, it was shown in [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions) that $$ y = T(w)z$$ **NOTE** All vectors and matrices are **1 Indexed** meaning the first element is always indicated by **1** not 0. For this problem statement we want to restrict the 1D Convolution such that both the input and output sizes are *the same*. This gives us several advantages. Firstly, it forces our Toeplitz matrix $T'(w)$ to be square with $T'(w) \in \mathbb{R}^{n \times n}$, making later math easier. Additionally, for our chosen problem it makes it much easier to evaluate our results since our input, output, and ground truth vectors are all the same size $g,y,z \in \mathbb{R}^n$. To do this, we need to calculate the **Padding** we are going to use with the convolution. The padding is defined as the the number of zeros we prepend and append to the input vector $z$ affecting the output size of $y = (z*w)$. Previously, padding was not a concern since in our convolution definition $ y(x) = (z * w)(x) = \sum_{a=-\infty}^{\infty} z(a)w(x-a) $ we sum from $-\infty$ to $\infty$. From the **Appendix** we have the Padding Equation calculating the size of the convolution output $N_{out}$ given the convolution input $N_{in}$, and kernel size $K$. $$ N_{out} = N_{in} + 2P - (K-1) $$ Find $P$ such that $N_{out} = N_{in} = n$, $K = m$, we get $$ n = n + 2P - (m-1) \rightarrow \frac{m-1}{2} = P $$ Note that this is typically why kernels are chosen to have an *odd* size and not an *even* size so that padding is an integer number in integer space. Knowing this, let's rewrite the Discrete 1D Convolution summation definition restricting our output $y$ to have size $y \in \mathbb{R}^n$ $$ y(x) = (z * w)(x) = \sum_{a=-P}^{P} w(a+P+1)z(x-a) $$ Of note, since $P = \frac{m-1}{2}$, the range $[-P,P]$ has length $m$ (include 0 in middle), which is exactly the length of our kernel $w \in \mathbb{R}^m$, so in this summation all values of the kernel are defined. *However* for $z(x-a)$ with $x-a \notin [1,n]$, $z(x-a) = 0$ with $x \in [1,n]$. To see exactly how many $0$'s will be padded to $z$ simply plug in the bounds $-P,P$ at the endpoints of $z$ which are $1$ and $n$ - Left Padding: $x=1$, $a=P$ $z(x-P) \rightarrow 1-P = 1-\frac{m-1}{2}$ so $P$ prepended before index $1$ - Right Padding: $x=n$ $a=-P$ $z(n-(-P)) \rightarrow n+P = n+\frac{m-1}{2}$ so $P$ appended $0$'s after index $n$ Left Padding + Right Padding gives there to be $2P = m-1$ extra $0$'s padded to input (kernel size $w \in \mathbb{R}^m$ minus one). For example, if for a kernel $w$ with size 3, this would mean that there would be two 0's padded to input $z$, one to the beginning and one to the end of the input sequence $z$. Given this definition, we are going to define the Toeplitz Matrix $T'(w) \in \mathbb{R}^{n \times n}$ such that it's the Toeplitz matrix that is equivalent to our convolution $ y(x) = (z * w)(x) = \sum_{a=-P}^{P} w(a+P+1)z(x-a) $ with $x \in [1,n]$ so that the input and output sizes of the 1D Convolution match. Note that this is mathematically the same as the Toeplitz Matrix we defined in [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions) with *less* rows, specifically only the middle $n$ rows **not** all $m+n-1$ rows. ## The Square Toeplitz Matrix Representation in the Square Toeplitz Basis To make it easier to derive the gradient of $T'(w)$ with respect to $w$, we are going to represent the vectorized square Toeplitz matrix $vec(T'(w))$ as a matrix transformation applied to $w$. This is because when we use matrix calculus to solve $dT'(w)$, it would be much easier if we could represent this function as a matrix transformation. Hence, we invoke the famous linear algebra theorem *<center>Every linear transformation between finite dimensional vector spaces can be respresented by a matrix, with respect to a chosen basis.</center>* ### What is the Basis of Square Toeplitz Matrices? The definition of a Toeplitz Matrix is such that each diagonal of the matrix has the same value, i.e. for a given row $i$ and column $j$ that all entries such that $i-j = \alpha$, where $\alpha$ is arbitrary will have the same value $\beta$. Commonly stated that for a Toeplitz matrix $T'$: $$ T'_{ij} = \beta \quad \text{whenever } i-j=\alpha$$ $$ T' = \begin{bmatrix} t_0 & t_{-1} & t_{-2} & t_{-3} \\ t_1 & t_0 & t_{-1} & t_{-2} \\ t_2 & t_1 & t_0 & t_{-1} \\ t_3 & t_2 & t_1 & t_0 \end{bmatrix} $$ Since each diagonal is indexed by the quantity \(i-j\), a Toeplitz matrix can equivalently be written in the form $$ T'_{ij} = t_{i-j}$$ Thus, when we represent a Toeplitz matrix in the basis \(\{A_r\}\), the coefficient \(k_r\) is exactly the value placed on the diagonal with offset \(r\), so that $$ T'(k)_{ij} = k_{i-j} $$ Suppose that we choose the most obvious basis of matrices such that $$T'(k) = \sum_r k_r A_r$$ $$ (A_r)_{ij} = \begin{cases} 1 & \text{if } i-j = r \\ 0 & \text{otherwise} \end{cases} $$ Some examples of $A_r$ are $$ A_0 = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \\ \end{bmatrix} = I_n\ A_1 = \begin{bmatrix} 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ 0 & 0 & 0 & \ddots & \vdots \\ \vdots & \vdots & \vdots & \ddots & 1 \\ 0 & 0 & 0 & \cdots & 0 \end{bmatrix} A_{-1} = \begin{bmatrix} 0 & 0 & 0 & \cdots & 0 \\ 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & \cdots & 0 & 1 & 0 \end{bmatrix} $$ Note that since $T'(k) \in \mathbb{R}^{n \times n}$ we can say that the entries on the first column and the first row define the entire $T'(k)$ matrix since the diagonals must be the same value. Thus since there are $n$ column entries and $n$ row entries, with $1$ shared at $T'(k)_{1,1}$ we can say that there are $2n-1$ such values and that by extension $r \in [-(n-1),(n-1)]$ and $k \in \mathbb{R}^{2n-1}$. To make it obvious, see this matrix showing how all of our $k_r \in [-(n-1),(n-1)]$ are represented in our basis $$ T'(k)= \sum_{r=-(n-1)}^{n-1} k_r A_r = \begin{bmatrix} k_0 & k_{-1} & k_{-2} & \cdots & k_{-(n-1)} \\ k_1 & k_0 & k_{-1} & \cdots & k_{-(n-2)} \\ k_2 & k_1 & k_0 & \ddots & \vdots \\ \vdots & \vdots & \ddots & \ddots & k_{-1} \\ k_{n-1} & k_{n-2} & \cdots & k_1 & k_0 \end{bmatrix}, \qquad k \in \mathbb{R}^{2n-1},\quad r\in\{-(n-1),\dots,n-1\}. $$ To represent $T'(k)$ in its current form we could transform $T'(k) = \sum_r k_r A_r$ into a tensor of $2n-1$ $A_r$'s stacked on top of each other giving us a tensor of size $\mathbb{R}^{(2n-1)\times n \times n}$. That being said when we later need to take the differential of $T'(k)$ that will be messy. For this reason, let's introduce a trick to represent this tensor as a matrix by *vectorizing* (column major) $A_r \forall r \in [-(n-1),(n-1)]$, placing the $vec(A_r)$ into a matrix $B$ and then multiplying by $k$ to get $vec(T'(k))$ i.e. $$ \operatorname{vec}(T'(k)) = \overbrace{ \left( \begin{array}{cccccc} | & | & & | & & | \\ \operatorname{vec}(A_{-(n-1)}) & \operatorname{vec}(A_{-(n-2)}) & \cdots & \operatorname{vec}(A_{0}) & \cdots & \operatorname{vec}(A_{n-1}) \\ | & | & & | & & | \end{array} \right) }^{2n-1} \left.\vphantom{ \begin{array}{c} \\ \\ \\ \end{array}} \right\} n^2 \begin{bmatrix} k_{-(n-1)} \\ k_{-(n-2)} \\ \vdots \\ k_{0} \\ \vdots \\ k_{n-1} \end{bmatrix} = Bk $$ We can see this is equivalent, since $vec(T'(k)) = Bk$ is taking the weighted sum of all the basis vectors in $B$ or $vec(T'(k)) = \sum_r vec(A_r) \times k_r$. Of note, $B \in \mathbb{R}^{n^2 \times (2n-1)}$ and $vec(T'(k)) \in \mathbb{R}^{n^2}$. Thus we have shown that we can represent the vectorized form of the square Toeplitz $T'(k) \in \mathbb{R}^{n \times n}$ as a matrix multiplication with vectorized basis $B \in \mathbb{R}^{n^2 \times (2n-1)}$ and kernel $k \in \mathbb{R}^{2n-1}$ resulting in $vec(T'(k)) \in \mathbb{R}^{n^2}$ $$ vec(T'(k)) = Bk$$ ### This is great, however my original kernel $w \in \mathbb{R}^m$ and this newly defined kernel $k \in \mathbb{R}^{2n-1}$ Great question! To represent $w \in \mathbb{R}^m$ in the same vector space as $k \in \mathbb{R}^{2n-1}$, we are going to define a *Linear Transformation Matrix* $S \in \mathbb{R}^{(2n-1) \times m}$ such that we will obtain the following result $$ vec(T'(w)) = BSw$$ As stated, $k = Sw$ and $vec(T'(k)) = Bk$ therefore $vec(T'(w)) = BSw$. **What does the original $T'(w)$ such that $y = T'(w)z$ look like?** We can see that for an odd size kernel $w$ that its Toeplitz matrix $T'(w)$ (in general, as defined in section *What is a 1-Dimensional Discrete Convolution with Equal Input & Output Sizes* for $y = T'(w)z$ The coefficient multiplying $z_j$ in output $y_i$ is obtained by setting $$j = i - a \quad \rightarrow \quad a = i - j $$ So the matrix entry is $$ T'(w)_{ij} = \begin{cases} w_{P+i-j+1}, & |i-j|\le P \\ 0, & |i-j|>P \end{cases} $$ Which is the general formula, hence the general square Toeplitz matrix is $$ T'(w)= \begin{bmatrix} w_{P+1} & w_{P} & w_{P-1} & \cdots & 0 & 0 \\ w_{P+2} & w_{P+1} & w_{P} & \ddots & \vdots & \vdots \\ w_{P+3} & w_{P+2} & w_{P+1} & \ddots & 0 & 0 \\ \vdots & \ddots & \ddots & \ddots & w_{P} & w_{P-1} \\ 0 & \cdots & w_{P+3} & w_{P+2} & w_{P+1} & w_{P} \\ 0 & \cdots & 0 & w_{P+3} & w_{P+2} & w_{P+1} \end{bmatrix} $$ where only the diagonals $i - j \in [-P,P]$ are nonzero. Concretely for an example $w \in \mathbb{R}^3$ with $w = [w_1,w_2,w_3]^T$ & $m=3$ we have padding $P = \frac{m-1}{2} \rightarrow \frac{3-1}{2}=1$ Giving $$ T'(w)_{ij} = \begin{cases} w_{1+i-j}, & |i-j|\le 1 \\ 0, & |i-j|>1 \end{cases} $$ and $$ T'(w)= \begin{bmatrix} w_2 & w_1 & 0 & \cdots & 0 \\ w_3 & w_2 & w_1 & \ddots & \vdots \\ 0 & w_3 & w_2 & \ddots & 0 \\ \vdots & \ddots & \ddots & \ddots & w_1 \\ 0 & \cdots & 0 & w_3 & w_2 \end{bmatrix} \in \mathbb{R}^{n \times n}. $$ **How do we map $w \rightarrow k$? What is matrix $S \in \mathbb{R}^{(2n-1)\times m}$?** The vector $k \in \mathbb{R}^{2n-1}$ represents all diagonals of an $n \times n$ Toeplitz matrix and $w \in \mathbb{R}^m$ only occupies the central $m$ diagonals due to the convolution kernel. Therefore from $k = Sw$, $S$ embeds $w$ in the larger diagonal vector $k$ and fills the rest with zeros. From the convolution definition derived $$ T'(w)_{ij} = \begin{cases} w_{P+i-j+1}, & |i-j|\le P \\ 0, & |i-j|>P \end{cases} $$ where $$ P = \frac{m-1}{2}$$ The Toeplitz matrix only has non-zero diagonals when $$ r = i-j \in [-P,P]$$ Hence, the diagonal vector $k$ must satisfy $$ k_{r} = \begin{cases} w_{r+P+1}, & r \in [-P,P] \\ 0, & else \end{cases} $$ so that the kernel fills the central diagonals of the Toeplitz Matrix. To write this as a linear transformation define $S \in \mathbb{R}^{(2n-1)\times m}$ such that $k = Sw$. Each row of $S$ selects the appropriate element of $w$ if the diagonal corresponds to the kernel, and zero otherwise. We have that the entries of $k \in [-P,P]$ are equivalent to $w$ such that $k_{-P} = w_1$, $k_{-P+1} = w_2$ ..., $k_{P} = w_m$ This gives us $S \in \mathbb{R}^{(2n-1)\times m}$ $$ S= \begin{bmatrix} 0_{(n-P-1)\times m} \\ I_m \\ 0_{(n-P-1)\times m} \end{bmatrix} \in \mathbb{R}^{(2n-1)\times m}. $$ therefore $$ k = Sw = \begin{bmatrix} 0_{(n-P-1)\times m} \\ I_m \\ 0_{(n-P-1)\times m} \end{bmatrix} \begin{bmatrix} w_1\\ w_2\\ \vdots\\ w_m \end{bmatrix} = \begin{bmatrix} 0\\ \vdots\\ 0\\ w_1\\ w_2\\ \vdots\\ w_m\\ 0\\ \vdots\\ 0 \end{bmatrix}. $$ ## Finding $\frac{\partial L}{\partial w}$ To find $\frac{\partial L}{\partial w}$ from the equations $$ L(y) = (g-y)^T(g-y) $$ $$ y = T'(w)z $$ $$ vec(T'(w)) = BSw $$ With variables - $z \in \mathbb{R}^n$ as the input - $w \in \mathbb{R}^m$ as the kernel - $y \in \mathbb{R}^{m}$ as the output - $g \in \mathbb{R}^m$ as the ground truth - $T'(w) \in \mathbb{R}^{n \times n}$ as the Toeplitz matrix of kernel $w$ - $B \in \mathbb{R}^{n^2 \times (2n-1)}$ as Vectorized Basis of Square Toeplitz Matrices - $S \in \mathbb{R}^{(2n-1) \times m}$ as the Transformation Matrix from the vector space of $w$ to $k$ s.t. $T'(k) = Bk$ - $L \in \mathbb{R}$ as the scalar loss Ok this is the simple part from our setup, let's do matrix calculus! The approach here is to take the differential of each equation and then combine the results together using the *chain rule*. ### Differential of $L$, $dL$ To start let's take the differential of $L$ $$ L = (g-y)^T(g-y) \rightarrow dL = -(g-y)^Tdy - dy^T(g-y) $$ Since $-(g-y)^Tdy, - dy^T(g-y) \in \mathbb{R}$ we they are equal to their transpose, giving us: $$ dL = -2(g-y)^Tdy $$ ### Differential of $y$, $dy$ $$y = T'(w)z $$ $$ dy = d(T'(w))z + T'(w)dz$$ Note that since $z$ is the input signal, we cannot update $z$ and hence $dz = 0$, thus $$ dy = d(T'(w))z $$ ### Differential of $T'(w)$ Note that to take the differential of $T'(w)$ with respect to $w$ this would give us the unwieldy tensor result $$ \frac{\partial T'(w)}{\partial w} \in \mathbb{R}^{n \times n \times m}$$ As we did in the previous section: **The Square Toeplitz Matrix Representation in the Square Toeplitz Basis** we showed $$ vec(T'(w)) = BSw $$ Hence, we will find $d\ vec(T'(w))$! $$ d\ vec(T'(w)) = d(BSw) = (dB)Sw + B(dS)w + BS(dw) $$ Because the matrices $B$ and $S$ are constant we have $dB, dS = 0$, hence $$ d\ vec(T'(w)) = BSdw $$ ### Chain Rule Fantastic, now all we need to do is the chain rule. $$ dL = -2(g-y)^Tdy $$ Note that since we now have $vec(T'(w))$ we are going to have to take the differential of $vec(y)$ giving us $$ d(vec(y)) = vec(d(T'(w))z)$$ Let's use the well known Kronecker Product identity: $(A \otimes I) vec(C) = vec(CA^T)$ such that $I$ has the same number of rows as $C$ giving us $$ d(vec(y)) = (z^T \otimes I_n)vec(d(T'(w))) $$ Note that since $y \in \mathbb{R}^n$ that $vec(y) = y$ we have: $$ d(vec(y)) = dy = (z^T \otimes I_n)vec(d(T'(w))) $$ Fantastic, let's substitute $vec(d(T'(w)))$ in $$ dy = (z^T \otimes I_n)BSdw$$ Now let's substitute $dy$ into the equation for $dL$ $$ dL = -2(g-y)^T(z^T \otimes I_n)BSdw$$ Since for the general function $f(x)$ we have $$df = \frac{\partial f}{\partial x} dx \quad \langle \frac{\partial f}{\partial x}, dx \rangle = df$$ i.e. $\frac{\partial f}{\partial x}$ the quantity that when the inner product is taken with $dx$ equals $df$ we can see that from $ dL = -2(g-y)^T(z^T \otimes I_n)BSdw$ we have $$ dL = \langle -2(g-y)^T(z^T \otimes I)BS, dw \rangle \rightarrow \frac{\partial L}{\partial w} = -2(g-y)^T(z^T \otimes I_n)BS$$ Since we are in the numerator form convention $$\frac{\partial L}{\partial w} = -2(g-y)^T(z^T \otimes I_n)BS \in \mathbb{R}^{1 \times m}$$ Let's verify this is the case, for the equation above we have: $$-2(g-y)^T \in \mathbb{R}^{1 \times n} \quad (z^T \otimes I_n) \in \mathbb{R}^{n \times n^2} \quad BS \in \mathbb{R}^{n^2 \times m}$$ Further combining terms: $$-2(g-y)^T (z^T \otimes I_n) \in \mathbb{R}^{1 \times n^2}$$ thus $$\frac{\partial L}{\partial w} = -2(g-y)^T (z^T \otimes I_n)BS \in \mathbb{R}^{1 \times m}$$ ## Gradient Descent Update Term To optimize kernel $w \in \mathbb{R}^m$ such that the squared loss function $L$ is minimized to have output vector $y \in \mathbb{R}^n$ match $g \in \mathbb{R}^n$ we use the gradient descent update rule with learning rate $\alpha$ $$ w_{new} = w_{old} - \alpha (\frac{\partial L}{\partial w})^T$$ Such that $$ w_{new} = w_{old} - \alpha (-2(g-y)^T (z^T \otimes I_n)BS)^T$$ --- ## Appendix ### Padding Equation $N_{out} = N_{in} + 2P - (K-1)$ with $N_{out}$ output size, $N_{in}$ input size, $P$ padding size, $K$ kernel size. For this derivation, let's assume that $K$ is an odd number, thus making any desired padding $P \in \mathbb{Z}$ (as mentioned in above blog post). We are here to show that the stated *Padding Equation* is indeed correct. Note that if we have $P = 0$ the equation becomes $$N_{out} = N_{in} - (K-1)$$ thus as we can see since $N_{out} \leq N_{in}$ this is **not** the same as the definition of a *full 1D Convolution* as stated in [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions), where $N_{out} = N_{in} + K -1$ because only 1 element needed for overlap for a valid convolution. Instead here, this equation is setup such that when $P=0$ we are assuming that the kernel is overlapping **completely** with the input vector, and that there is no spare space. Hence, because of this, what we get is the output size of the convolution is the number of times the kernel completely fits inside of the input. This is easier to see if we actually rephrase the equation as $$ N_{out} = N_{in} - (K-1) \rightarrow N_{out} = N_{in} - K + 1$$ Seeing that it is of this form $N_{out} = N_{in} - K + 1$, remember our basic counting principles. This is the number of elements from $[1,N_{in}-K]$ since to get the number of elements from $[i,n]$ (for some arbitrary number $n$ & $i$) this is equal to $n -i +1$. Hence $N_{in} - K + 1$ is saying, assuming we are 1 indexed, find the number of integers in the range $[1, N_{in} - K]$ which is exactly the number of times that $K$ fits into $N_{in}$ since, counting from 1, $N_{in} -K$ is the last element of the input sequence of size $N_{in}$ such that the kernel of size $K$ has complete overlap. Thus we've shown that $ N_{out} = N_{in} - (K-1)$ is correct. Now to show that $$N_{out} = N_{in} + 2P - (K-1)$$ is correct, remember that the definition of padding is the number of 0's added to each side. Thus if one zero is added to each side of the input sequence, $N_{in}$ essentially becomes $N_{in} + 2(1)$ and for arbitrary $P$ it's $N_{in} \rightarrow N_{in} + 2P$. Thus we can see that $$N_{out} = N_{in} + 2P - (K-1)$$ is valid for when our **kernel must fully overlap with the input sequence with padding**. To see the definition of a 1D Convolution and how a 1D Convolution can be represented as a Toeplitz Matrix multiplication with an input vector, please read [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions). ### Goal Our goal today is to derive a way to train our kernel using gradient descent to filter noise from an input signal. More specifically, given an input signal $z \in \mathbb{R}^n$ with noise, train a kernel $w \in \mathbb{R}^m$ (with $m < n$) that convolves with the input $z$ to produce output signal $y \in \mathbb{R}^n$ to match ground truth signal $g \in \mathbb{R}^n$ with squared loss such that: $$ L(y) = (g-y)^T(g-y) $$ $$ y = (z*w) $$ To do this we will need to restrict our convolution $(z*w)$ to only output vectors $y$ of the same size as input $z$. Ultimately what we strive to derive is the gradient to update kernel $w$ $$\frac{\partial L}{\partial w} \in \mathbb{R}^{1 \times m}$$ in the *numerator* convention, meaning to update $w$ in our gradient descent update the transpose of $\frac{\partial L}{\partial w}$ will have to be taken with learning rate $\alpha$ such that $$ w_{new} = w_{old} - \alpha (\frac{\partial L}{\partial w})^T$$ ## What is a 1-Dimensional Discrete Convolution with Equal Input & Output Sizes A **full** 1D Discrete Convolution is defined as a 1D convolution where *only one* element from the input and kernel vectors have to overlap, giving an output size that is *larger* than the input. The definition of a *full* 1D Convolution with: - $z \in \mathbb{R}^n$ as the input - $w \in \mathbb{R}^m$ as the kernel - $y \in \mathbb{R}^{m+n-1}$ as the output - $T(w) \in \mathbb{R}^{(m+n-1) \times n}$ as the Toeplitz matrix of kernel $w$ $$ y(x) = (z * w)(x) = \sum_{a=-\infty}^{\infty} z(a)w(x-a) $$ Equivalently, it was shown in [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions) that $$ y = T(w)z$$ **NOTE** All vectors and matrices are **1 Indexed** meaning the first element is always indicated by **1** not 0. For this problem statement we want to restrict the 1D Convolution such that both the input and output sizes are *the same*. This gives us several advantages. Firstly, it forces our Toeplitz matrix $T'(w)$ to be square with $T'(w) \in \mathbb{R}^{n \times n}$, making later math easier. Additionally, for our chosen problem it makes it much easier to evaluate our results since our input, output, and ground truth vectors are all the same size $g,y,z \in \mathbb{R}^n$. To do this, we need to calculate the **Padding** we are going to use with the convolution. The padding is defined as the the number of zeros we prepend and append to the input vector $z$ affecting the output size of $y = (z*w)$. Previously, padding was not a concern since in our convolution definition $ y(x) = (z * w)(x) = \sum_{a=-\infty}^{\infty} z(a)w(x-a) $ we sum from $-\infty$ to $\infty$. From the **Appendix** we have the Padding Equation calculating the size of the convolution output $N_{out}$ given the convolution input $N_{in}$, and kernel size $K$. $$ N_{out} = N_{in} + 2P - (K-1) $$ Find $P$ such that $N_{out} = N_{in} = n$, $K = m$, we get $$ n = n + 2P - (m-1) \rightarrow \frac{m-1}{2} = P $$ Note that this is typically why kernels are chosen to have an *odd* size and not an *even* size so that padding is an integer number in integer space. Knowing this, let's rewrite the Discrete 1D Convolution summation definition restricting our output $y$ to have size $y \in \mathbb{R}^n$ $$ y(x) = (z * w)(x) = \sum_{a=-P}^{P} w(a+P+1)z(x-a) $$ Of note, since $P = \frac{m-1}{2}$, the range $[-P,P]$ has length $m$ (include 0 in middle), which is exactly the length of our kernel $w \in \mathbb{R}^m$, so in this summation all values of the kernel are defined. *However* for $z(x-a)$ with $x-a \notin [1,n]$, $z(x-a) = 0$ with $x \in [1,n]$. To see exactly how many $0$'s will be padded to $z$ simply plug in the bounds $-P,P$ at the endpoints of $z$ which are $1$ and $n$ - Left Padding: $x=1$, $a=P$ $z(x-P) \rightarrow 1-P = 1-\frac{m-1}{2}$ so $P$ prepended before index $1$ - Right Padding: $x=n$ $a=-P$ $z(n-(-P)) \rightarrow n+P = n+\frac{m-1}{2}$ so $P$ appended $0$'s after index $n$ Left Padding + Right Padding gives there to be $2P = m-1$ extra $0$'s padded to input (kernel size $w \in \mathbb{R}^m$ minus one). For example, if for a kernel $w$ with size 3, this would mean that there would be two 0's padded to input $z$, one to the beginning and one to the end of the input sequence $z$. Given this definition, we are going to define the Toeplitz Matrix $T'(w) \in \mathbb{R}^{n \times n}$ such that it's the Toeplitz matrix that is equivalent to our convolution $ y(x) = (z * w)(x) = \sum_{a=-P}^{P} w(a+P+1)z(x-a) $ with $x \in [1,n]$ so that the input and output sizes of the 1D Convolution match. Note that this is mathematically the same as the Toeplitz Matrix we defined in [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions) with *less* rows, specifically only the middle $n$ rows **not** all $m+n-1$ rows. ## The Square Toeplitz Matrix Representation in the Square Toeplitz Basis To make it easier to derive the gradient of $T'(w)$ with respect to $w$, we are going to represent the vectorized square Toeplitz matrix $vec(T'(w))$ as a matrix transformation applied to $w$. This is because when we use matrix calculus to solve $dT'(w)$, it would be much easier if we could represent this function as a matrix transformation. Hence, we invoke the famous linear algebra theorem *<center>Every linear transformation between finite dimensional vector spaces can be respresented by a matrix, with respect to a chosen basis.</center>* ### What is the Basis of Square Toeplitz Matrices? The definition of a Toeplitz Matrix is such that each diagonal of the matrix has the same value, i.e. for a given row $i$ and column $j$ that all entries such that $i-j = \alpha$, where $\alpha$ is arbitrary will have the same value $\beta$. Commonly stated that for a Toeplitz matrix $T'$: $$ T'_{ij} = \beta \quad \text{whenever } i-j=\alpha$$ $$ T' = \begin{bmatrix} t_0 & t_{-1} & t_{-2} & t_{-3} \\ t_1 & t_0 & t_{-1} & t_{-2} \\ t_2 & t_1 & t_0 & t_{-1} \\ t_3 & t_2 & t_1 & t_0 \end{bmatrix} $$ Since each diagonal is indexed by the quantity \(i-j\), a Toeplitz matrix can equivalently be written in the form $$ T'_{ij} = t_{i-j}$$ Thus, when we represent a Toeplitz matrix in the basis \(\{A_r\}\), the coefficient \(k_r\) is exactly the value placed on the diagonal with offset \(r\), so that $$ T'(k)_{ij} = k_{i-j} $$ Suppose that we choose the most obvious basis of matrices such that $$T'(k) = \sum_r k_r A_r$$ $$ (A_r)_{ij} = \begin{cases} 1 & \text{if } i-j = r \\ 0 & \text{otherwise} \end{cases} $$ Some examples of $A_r$ are $$ A_0 = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \\ \end{bmatrix} = I_n\ A_1 = \begin{bmatrix} 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ 0 & 0 & 0 & \ddots & \vdots \\ \vdots & \vdots & \vdots & \ddots & 1 \\ 0 & 0 & 0 & \cdots & 0 \end{bmatrix} A_{-1} = \begin{bmatrix} 0 & 0 & 0 & \cdots & 0 \\ 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & \cdots & 0 & 1 & 0 \end{bmatrix} $$ Note that since $T'(k) \in \mathbb{R}^{n \times n}$ we can say that the entries on the first column and the first row define the entire $T'(k)$ matrix since the diagonals must be the same value. Thus since there are $n$ column entries and $n$ row entries, with $1$ shared at $T'(k)_{1,1}$ we can say that there are $2n-1$ such values and that by extension $r \in [-(n-1),(n-1)]$ and $k \in \mathbb{R}^{2n-1}$. To make it obvious, see this matrix showing how all of our $k_r \in [-(n-1),(n-1)]$ are represented in our basis $$ T'(k)= \sum_{r=-(n-1)}^{n-1} k_r A_r = \begin{bmatrix} k_0 & k_{-1} & k_{-2} & \cdots & k_{-(n-1)} \\ k_1 & k_0 & k_{-1} & \cdots & k_{-(n-2)} \\ k_2 & k_1 & k_0 & \ddots & \vdots \\ \vdots & \vdots & \ddots & \ddots & k_{-1} \\ k_{n-1} & k_{n-2} & \cdots & k_1 & k_0 \end{bmatrix}, \qquad k \in \mathbb{R}^{2n-1},\quad r\in\{-(n-1),\dots,n-1\}. $$ To represent $T'(k)$ in its current form we could transform $T'(k) = \sum_r k_r A_r$ into a tensor of $2n-1$ $A_r$'s stacked on top of each other giving us a tensor of size $\mathbb{R}^{(2n-1)\times n \times n}$. That being said when we later need to take the differential of $T'(k)$ that will be messy. For this reason, let's introduce a trick to represent this tensor as a matrix by *vectorizing* (column major) $A_r \forall r \in [-(n-1),(n-1)]$, placing the $vec(A_r)$ into a matrix $B$ and then multiplying by $k$ to get $vec(T'(k))$ i.e. $$ \operatorname{vec}(T'(k)) = \overbrace{ \left( \begin{array}{cccccc} | & | & & | & & | \\ \operatorname{vec}(A_{-(n-1)}) & \operatorname{vec}(A_{-(n-2)}) & \cdots & \operatorname{vec}(A_{0}) & \cdots & \operatorname{vec}(A_{n-1}) \\ | & | & & | & & | \end{array} \right) }^{2n-1} \left.\vphantom{ \begin{array}{c} \\ \\ \\ \end{array}} \right\} n^2 \begin{bmatrix} k_{-(n-1)} \\ k_{-(n-2)} \\ \vdots \\ k_{0} \\ \vdots \\ k_{n-1} \end{bmatrix} = Bk $$ We can see this is equivalent, since $vec(T'(k)) = Bk$ is taking the weighted sum of all the basis vectors in $B$ or $vec(T'(k)) = \sum_r vec(A_r) \times k_r$. Of note, $B \in \mathbb{R}^{n^2 \times (2n-1)}$ and $vec(T'(k)) \in \mathbb{R}^{n^2}$. Thus we have shown that we can represent the vectorized form of the square Toeplitz $T'(k) \in \mathbb{R}^{n \times n}$ as a matrix multiplication with vectorized basis $B \in \mathbb{R}^{n^2 \times (2n-1)}$ and kernel $k \in \mathbb{R}^{2n-1}$ resulting in $vec(T'(k)) \in \mathbb{R}^{n^2}$ $$ vec(T'(k)) = Bk$$ ### This is great, however my original kernel $w \in \mathbb{R}^m$ and this newly defined kernel $k \in \mathbb{R}^{2n-1}$ Great question! To represent $w \in \mathbb{R}^m$ in the same vector space as $k \in \mathbb{R}^{2n-1}$, we are going to define a *Linear Transformation Matrix* $S \in \mathbb{R}^{(2n-1) \times m}$ such that we will obtain the following result $$ vec(T'(w)) = BSw$$ As stated, $k = Sw$ and $vec(T'(k)) = Bk$ therefore $vec(T'(w)) = BSw$. **What does the original $T'(w)$ such that $y = T'(w)z$ look like?** We can see that for an odd size kernel $w$ that its Toeplitz matrix $T'(w)$ (in general, as defined in section *What is a 1-Dimensional Discrete Convolution with Equal Input & Output Sizes* for $y = T'(w)z$ The coefficient multiplying $z_j$ in output $y_i$ is obtained by setting $$j = i - a \quad \rightarrow \quad a = i - j $$ So the matrix entry is $$ T'(w)_{ij} = \begin{cases} w_{P+i-j+1}, & |i-j|\le P \\ 0, & |i-j|>P \end{cases} $$ Which is the general formula, hence the general square Toeplitz matrix is $$ T'(w)= \begin{bmatrix} w_{P+1} & w_{P} & w_{P-1} & \cdots & 0 & 0 \\ w_{P+2} & w_{P+1} & w_{P} & \ddots & \vdots & \vdots \\ w_{P+3} & w_{P+2} & w_{P+1} & \ddots & 0 & 0 \\ \vdots & \ddots & \ddots & \ddots & w_{P} & w_{P-1} \\ 0 & \cdots & w_{P+3} & w_{P+2} & w_{P+1} & w_{P} \\ 0 & \cdots & 0 & w_{P+3} & w_{P+2} & w_{P+1} \end{bmatrix} $$ where only the diagonals $i - j \in [-P,P]$ are nonzero. Concretely for an example $w \in \mathbb{R}^3$ with $w = [w_1,w_2,w_3]^T$ & $m=3$ we have padding $P = \frac{m-1}{2} \rightarrow \frac{3-1}{2}=1$ Giving $$ T'(w)_{ij} = \begin{cases} w_{1+i-j}, & |i-j|\le 1 \\ 0, & |i-j|>1 \end{cases} $$ and $$ T'(w)= \begin{bmatrix} w_2 & w_1 & 0 & \cdots & 0 \\ w_3 & w_2 & w_1 & \ddots & \vdots \\ 0 & w_3 & w_2 & \ddots & 0 \\ \vdots & \ddots & \ddots & \ddots & w_1 \\ 0 & \cdots & 0 & w_3 & w_2 \end{bmatrix} \in \mathbb{R}^{n \times n}. $$ **How do we map $w \rightarrow k$? What is matrix $S \in \mathbb{R}^{(2n-1)\times m}$?** The vector $k \in \mathbb{R}^{2n-1}$ represents all diagonals of an $n \times n$ Toeplitz matrix and $w \in \mathbb{R}^m$ only occupies the central $m$ diagonals due to the convolution kernel. Therefore from $k = Sw$, $S$ embeds $w$ in the larger diagonal vector $k$ and fills the rest with zeros. From the convolution definition derived $$ T'(w)_{ij} = \begin{cases} w_{P+i-j+1}, & |i-j|\le P \\ 0, & |i-j|>P \end{cases} $$ where $$ P = \frac{m-1}{2}$$ The Toeplitz matrix only has non-zero diagonals when $$ r = i-j \in [-P,P]$$ Hence, the diagonal vector $k$ must satisfy $$ k_{r} = \begin{cases} w_{r+P+1}, & r \in [-P,P] \\ 0, & else \end{cases} $$ so that the kernel fills the central diagonals of the Toeplitz Matrix. To write this as a linear transformation define $S \in \mathbb{R}^{(2n-1)\times m}$ such that $k = Sw$. Each row of $S$ selects the appropriate element of $w$ if the diagonal corresponds to the kernel, and zero otherwise. We have that the entries of $k \in [-P,P]$ are equivalent to $w$ such that $k_{-P} = w_1$, $k_{-P+1} = w_2$ ..., $k_{P} = w_m$ This gives us $S \in \mathbb{R}^{(2n-1)\times m}$ $$ S= \begin{bmatrix} 0_{(n-P-1)\times m} \\ I_m \\ 0_{(n-P-1)\times m} \end{bmatrix} \in \mathbb{R}^{(2n-1)\times m}. $$ therefore $$ k = Sw = \begin{bmatrix} 0_{(n-P-1)\times m} \\ I_m \\ 0_{(n-P-1)\times m} \end{bmatrix} \begin{bmatrix} w_1\\ w_2\\ \vdots\\ w_m \end{bmatrix} = \begin{bmatrix} 0\\ \vdots\\ 0\\ w_1\\ w_2\\ \vdots\\ w_m\\ 0\\ \vdots\\ 0 \end{bmatrix}. $$ ## Finding $\frac{\partial L}{\partial w}$ To find $\frac{\partial L}{\partial w}$ from the equations $$ L(y) = (g-y)^T(g-y) $$ $$ y = T'(w)z $$ $$ vec(T'(w)) = BSw $$ With variables - $z \in \mathbb{R}^n$ as the input - $w \in \mathbb{R}^m$ as the kernel - $y \in \mathbb{R}^{m}$ as the output - $g \in \mathbb{R}^m$ as the ground truth - $T'(w) \in \mathbb{R}^{n \times n}$ as the Toeplitz matrix of kernel $w$ - $B \in \mathbb{R}^{n^2 \times (2n-1)}$ as Vectorized Basis of Square Toeplitz Matrices - $S \in \mathbb{R}^{(2n-1) \times m}$ as the Transformation Matrix from the vector space of $w$ to $k$ s.t. $T'(k) = Bk$ - $L \in \mathbb{R}$ as the scalar loss Ok this is the simple part from our setup, let's do matrix calculus! The approach here is to take the differential of each equation and then combine the results together using the *chain rule*. ### Differential of $L$, $dL$ To start let's take the differential of $L$ $$ L = (g-y)^T(g-y) \rightarrow dL = -(g-y)^Tdy - dy^T(g-y) $$ Since $-(g-y)^Tdy, - dy^T(g-y) \in \mathbb{R}$ we they are equal to their transpose, giving us: $$ dL = -2(g-y)^Tdy $$ ### Differential of $y$, $dy$ $$y = T'(w)z $$ $$ dy = d(T'(w))z + T'(w)dz$$ Note that since $z$ is the input signal, we cannot update $z$ and hence $dz = 0$, thus $$ dy = d(T'(w))z $$ ### Differential of $T'(w)$ Note that to take the differential of $T'(w)$ with respect to $w$ this would give us the unwieldy tensor result $$ \frac{\partial T'(w)}{\partial w} \in \mathbb{R}^{n \times n \times m}$$ As we did in the previous section: **The Square Toeplitz Matrix Representation in the Square Toeplitz Basis** we showed $$ vec(T'(w)) = BSw $$ Hence, we will find $d\ vec(T'(w))$! $$ d\ vec(T'(w)) = d(BSw) = (dB)Sw + B(dS)w + BS(dw) $$ Because the matrices $B$ and $S$ are constant we have $dB, dS = 0$, hence $$ d\ vec(T'(w)) = BSdw $$ ### Chain Rule Fantastic, now all we need to do is the chain rule. $$ dL = -2(g-y)^Tdy $$ Note that since we now have $vec(T'(w))$ we are going to have to take the differential of $vec(y)$ giving us $$ d(vec(y)) = vec(d(T'(w))z)$$ Let's use the well known Kronecker Product identity: $(A \otimes I) vec(C) = vec(CA^T)$ such that $I$ has the same number of rows as $C$ giving us $$ d(vec(y)) = (z^T \otimes I_n)vec(d(T'(w))) $$ Note that since $y \in \mathbb{R}^n$ that $vec(y) = y$ we have: $$ d(vec(y)) = dy = (z^T \otimes I_n)vec(d(T'(w))) $$ Fantastic, let's substitute $vec(d(T'(w)))$ in $$ dy = (z^T \otimes I_n)BSdw$$ Now let's substitute $dy$ into the equation for $dL$ $$ dL = -2(g-y)^T(z^T \otimes I_n)BSdw$$ Since for the general function $f(x)$ we have $$df = \frac{\partial f}{\partial x} dx \quad \langle \frac{\partial f}{\partial x}, dx \rangle = df$$ i.e. $\frac{\partial f}{\partial x}$ the quantity that when the inner product is taken with $dx$ equals $df$ we can see that from $ dL = -2(g-y)^T(z^T \otimes I_n)BSdw$ we have $$ dL = \langle -2(g-y)^T(z^T \otimes I)BS, dw \rangle \rightarrow \frac{\partial L}{\partial w} = -2(g-y)^T(z^T \otimes I_n)BS$$ Since we are in the numerator form convention $$\frac{\partial L}{\partial w} = -2(g-y)^T(z^T \otimes I_n)BS \in \mathbb{R}^{1 \times m}$$ Let's verify this is the case, for the equation above we have: $$-2(g-y)^T \in \mathbb{R}^{1 \times n} \quad (z^T \otimes I_n) \in \mathbb{R}^{n \times n^2} \quad BS \in \mathbb{R}^{n^2 \times m}$$ Further combining terms: $$-2(g-y)^T (z^T \otimes I_n) \in \mathbb{R}^{1 \times n^2}$$ thus $$\frac{\partial L}{\partial w} = -2(g-y)^T (z^T \otimes I_n)BS \in \mathbb{R}^{1 \times m}$$ ## Gradient Descent Update Term To optimize kernel $w \in \mathbb{R}^m$ such that the squared loss function $L$ is minimized to have output vector $y \in \mathbb{R}^n$ match $g \in \mathbb{R}^n$ we use the gradient descent update rule with learning rate $\alpha$ $$ w_{new} = w_{old} - \alpha (\frac{\partial L}{\partial w})^T$$ Such that $$ w_{new} = w_{old} - \alpha (-2(g-y)^T (z^T \otimes I_n)BS)^T$$ --- ## Appendix ### Padding Equation $N_{out} = N_{in} + 2P - (K-1)$ with $N_{out}$ output size, $N_{in}$ input size, $P$ padding size, $K$ kernel size. For this derivation, let's assume that $K$ is an odd number, thus making any desired padding $P \in \mathbb{Z}$ (as mentioned in above blog post). We are here to show that the stated *Padding Equation* is indeed correct. Note that if we have $P = 0$ the equation becomes $$N_{out} = N_{in} - (K-1)$$ thus as we can see since $N_{out} \leq N_{in}$ this is **not** the same as the definition of a *full 1D Convolution* as stated in [1D Convolutions Blog Post](https://www.noteblogdoku.com/blog/1d-convolutions), where $N_{out} = N_{in} + K -1$ because only 1 element needed for overlap for a valid convolution. Instead here, this equation is setup such that when $P=0$ we are assuming that the kernel is overlapping **completely** with the input vector, and that there is no spare space. Hence, because of this, what we get is the output size of the convolution is the number of times the kernel completely fits inside of the input. This is easier to see if we actually rephrase the equation as $$ N_{out} = N_{in} - (K-1) \rightarrow N_{out} = N_{in} - K + 1$$ Seeing that it is of this form $N_{out} = N_{in} - K + 1$, remember our basic counting principles. This is the number of elements from $[1,N_{in}-K]$ since to get the number of elements from $[i,n]$ (for some arbitrary number $n$ & $i$) this is equal to $n -i +1$. Hence $N_{in} - K + 1$ is saying, assuming we are 1 indexed, find the number of integers in the range $[1, N_{in} - K]$ which is exactly the number of times that $K$ fits into $N_{in}$ since, counting from 1, $N_{in} -K$ is the last element of the input sequence of size $N_{in}$ such that the kernel of size $K$ has complete overlap. Thus we've shown that $ N_{out} = N_{in} - (K-1)$ is correct. Now to show that $$N_{out} = N_{in} + 2P - (K-1)$$ is correct, remember that the definition of padding is the number of 0's added to each side. Thus if one zero is added to each side of the input sequence, $N_{in}$ essentially becomes $N_{in} + 2(1)$ and for arbitrary $P$ it's $N_{in} \rightarrow N_{in} + 2P$. Thus we can see that $$N_{out} = N_{in} + 2P - (K-1)$$ is valid for when our **kernel must fully overlap with the input sequence with padding**. Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!