Neural Network Implementation From Scratch by dokuDoku Machine Learning Foundations DoKu88/simpleNeuralNet/blob/main/neural_net_wi_autograd.py π # Building a Neural Net from Scratch with Node-Based Autograd I built an educational, working neural network (Multi-Linear Program, MLP) from scratch using a node-based architecture (Doubly Linked List) that makes the math of backpropagation explicit and composable [here in neural_net_wi_autograd.py](https://github.com/DoKu88/simpleNeuralNet/blob/main/neural_net_wi_autograd.py) and the *newer* [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$ to update matrix $M$ blog post](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm}). By the end, you'll see exactly how gradients flow through a network using the chain rule. *Note* that the connections in-between the nodes of the Linked List are represented by elements in a list, this could have been done with pointers but since the network is small enough a list is illustrative. *All training results for the homemade neural network can be seen in the PDF included at bottom of blog post.* ### Core Idea: Every Layer is a Node The entire network is built from the `Node` abstract class, including linear transformations, activations, and the loss function. The interface implemented is: ```python class Node(ABC): def forward(self, x: np.ndarray) -> np.ndarray: ... def backward(self, upstream: np.ndarray) -> np.ndarray: ... def update_weights(self, lr: float) -> None: ... ``` You can compose any sequence of nodes into a network and train it with the same three lines: forward, backward, update. ### Tracking Inputs for Backprop A key insight in backprop is that computing gradients requires tracking the inputs that were seen during the forward pass (see backpropagtion below & [Fitting Simple Linear Model](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent)). Nodes handle this automatically via `_record_input`: ```python def _record_input(self, x: np.ndarray) -> None: self._inputs.append(x.copy()) ``` Each call to `forward` appends the input to a list. During `backward`, the node retrieves the most recent input with `self._inputs[-1]`. This means the node is self-contained β it doesn't need to be handed its input again at backprop time; it already stored it. The network doesn't need to manage any of this externally. This simple implementation was chosen since this is an educational network, not necessarily optimized. ### Chain Rule, Made Explicit Backpropagation is fundamentally about computing gradients efficiently by reusing intermediate results and applying the **chain rule** in reverse through our Linked List. At a high level: - During the **forward pass**, we compute the output (and loss) by passing data through each node. - During the **backward pass**, we start from the loss and work backwards, calculating how much each parameter contributed to the error. The chain rule tells us that the gradient of the loss with respect to a parameter deep in the network is the product of all the local gradients along the path from that parameter to the loss. For an **explicit derivation of gradients of our simple linear layers of the network**, please read my other blog post [Fitting a Simple Linear Model using Gradient Descent](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent). This fully derives $\frac{\partial L}{\partial b}$ and $\frac{\partial L}{\partial M}$ for linear layers $y = Mx +b$ with loss $L(y) = \sum_i (y_i-g_i)^2$ (squared loss) where $y,g,b \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, and $M \in \mathbb{R}^{m \times n}$. As will be seen, the derivation of the gradient of the linear layer is #### Simple Numerical Example Consider a tiny network with just two nodes: a `LinearLayer` $$ y = Wx + b $$ followed by a `ReLU` $$ z = \max(0, y) $$ The loss is $$ L = (z - t)^2 $$ where $ t $ is the target. Note that the vector shapes are the following: $y,b,t,z \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, and $M \in \mathbb{R}^{m \times n}$ Suppose $ x = [2,1]^T$, $ M = \begin{bmatrix} 1 & 2 \\ -1 & 3 \\ \end{bmatrix} $, $ b = [0.5,1]^T$, $t=[3,2]^T$ - Forward: - $ y = \begin{bmatrix} 1 & 2 \\ -1 & 3 \\ \end{bmatrix} \cdot [2,1]^T + [0.5,1]^T = [4.5,2]^T $ - $ z = \max([0,0]^T, [4.5,2]^T) = [4.5,2]^T $ - $ L = (4.5β3)^2+(2β2)^2 = 2.25 $ - Backward: - Start with: $\frac{\partial L}{\partial z}β=2(zβt)=2([4.5,2β]^Tβ[3,2β]^T)=2[1.5, 0β]=[3, 0β]$ - ReLU: $\frac{\partial z}{\partial y} = [1,1]^T$ so $\frac{\partial L}{\partial y} = \frac{\partial L}{\partial z}\frac{\partial z}{\partial y} = \frac{\partial L}{\partial z} \odot \mathbb{1}_{y > 0} = [3,0]^T$ - Linear: $\frac{\partial L}{\partial M} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial M} = \frac{\partial L}{\partial y} x^T = [3,0]^T [2,1] = \begin{bmatrix} 6 & 3 \\ 0 & 0 \\ \end{bmatrix}$ - Bias: $\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial b} = \frac{\partial L}{\partial y} I_2 = [3,0]^T$ This shows the chain rule in action: each local gradient multiplies with the upstream gradient to produce the full gradient. Note: For **full derivation** see [Fitting Simple Linear Model blog post](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent) In our code, the training loop implements this: **Forward:** ```python h = x for node in layers: h = node.forward(h) loss = loss_node.forward(h, y) ``` **Backward:** ```python upstream = loss_node.backward(1.0) for node in reversed(layers): upstream = node.backward(upstream) ``` Each `backward` call receives $ \frac{\partial L}{\partial \text{output}} $ and returns $ \frac{\partial L}{\partial \text{input}} $. The return value becomes the upstream gradient for the previous node. The chain rule is literally a loop in reverse. #### Inside LinearLayer.backward ```python def backward(self, upstream: np.ndarray) -> np.ndarray: # upstream = dL/dy shape (out_features,) self.dLdb = upstream.copy() # dy/db = 1 β dL/db = upstream self.dLdW = np.outer(upstream, x) # dL/dW_ij = upstream[i] * x[j] dx = self.W.T @ upstream # dL/dx = W^T @ upstream return dx ``` Because $ y_i = \sum_j W_{ij} x_j + b_i $, we have $ \frac{\partial y_i}{\partial W_{ij}} = x_j $. Multiplying by the upstream $ \frac{\partial L}{\partial y_i} $ gives the outer product for $ \frac{\partial L}{\partial W} $. The $ dx $ term continues the chain backward. This was a short highlight of a longer derivation that can be seen in the [Fitting a Simple Linear Model blog post](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent) and the *newer* [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$ to update matrix $M$ blog post](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm}) ### Xavier Initialization Weight initialization matters tremendously for learning. If weights start too large, gradients explode. Too small, and they vanish before reaching early layers. `LinearLayer` uses Xavier (Glorot) initialization: ```python scale = np.sqrt(2.0 / (in_features + out_features)) self.W = rng.standard_normal((out_features, in_features)) * scale ``` The scale factor $$ \sqrt{\frac{2}{\text{fan\_in} + \text{fan\_out}}} $$ keeps the variance of activations roughly constant across layers at initialization β critical for deeper networks like the three-hidden-layer model here. ### Two Kinds of Nodes: Linear and Activation The network splits into two distinct node types: - **LinearLayer** β has learnable parameters (`W`, `b`). Its `backward` computes parameter gradients and propagates to the previous layer. It overrides `update_weights` for gradient descent. - **Activation** nodes (ReLU, Sigmoid, etc.) β no parameters. They introduce non-linearity. Their `backward` applies the local derivative element-wise: ReLU: $$ y = \max(0, x) $$ $$ \frac{\partial y}{\partial x} = \begin{cases} 1 & x > 0 \\ 0 & \text{otherwise} \end{cases} $$ ```python def _f_prime(self, x, y): return (x > 0).astype(float) ``` Sigmoid: $$ y(x) = \frac{1}{1 + e^{-x}} $$ $$ \frac{\partial y}{\partial x} = y(1 - y) $$ ``` def _f_prime(self, x, y): return y * (1.0 - y) ``` Separating these concerns keeps the code clean and makes `update_weights` a safe no-op on activations. ### Abstract Classes Enable Future Extension The `Node` ABC defines a clear contract: implement `forward`, `backward`, and `equation`. Adding new layers is trivial β just implement the math; the training loop stays unchanged. `SquaredLoss` is a concrete example: ```python def _loss_value(self, pred, target): return float(np.sum((target - pred) ** 2)) def _loss_gradient(self, pred, target): return -2.0 * (target - pred) ``` ### Training This trains on scikit-learnβs `make_moons` dataset β two interlocking half-circles that demand genuine non-linearity. Architecture: Input(2) β Linear(2β4) β ReLU β Linear(4β4) β ReLU β Linear(4β4) β ReLU β Linear(4β1) β SquaredLoss Building deeper or wider networks is just adding nodes: ```python nodes = [linear1, relu1, linear2, relu2, linear3, relu3, linear4, loss_fn] ``` This was designed to make it easy to add different layers within the network. The network is modeled as a **doubly linked list** with forward connections for forward propagation and back connections for back propagation. Note that although we use a list to denote the connections for the linked list, we could have just as easily used pointers. However, given the size of the network, keeping the nodes within a list gives us visibility & simplicity. ### Results See the PDF at the bottom of this blog post for the results of our training. As you can see on page 1, is the dataset, it is a classic two interlocking half circles. This was chosen for testing since the neural network has to be sufficiently parameterized and *non-linear* to be able to demonstrate learning. On the second page you can see the trained model's predictions on the train and validation set. As expected, the model learns the training data and visually we can see that it classifies most of them correctly. There is one notable example missed at [x1,x2] = [-0.5, 0.6] (about), but this is a quite tricky example, and overall the model does well. For validation (unseen data in training) the model does quite well and predicts most of the entries correct. Near the boundaries, we can see that the confidence of the model weakens, especially for the hard examples, which is to be expected. Finally, on the last page, we can see the training and validation loss curves. As we can se the model learns using Squared Loss and improves performance. This particular run had - Mean train loss: 0.005922 - Mean validation loss: 0.008348 The link to this python script, is at the top of the file, but you can also find it on my GitHub here at [neural_net_wi_autograd.py](https://github.com/DoKu88/simpleNeuralNet/blob/main/neural_net_wi_autograd.py). ### Takeaway This implementation is an explanatory example for how neural networks work. It is nice to be able to implement this and gain intuition for how gradient descent works instead of using a PyTorch abstraction (PyTorch is great though by the way). With clean node abstraction, explicit input tracking, and a reversed loop that applies the chain rule, you get a composable, extensible network that trains on real data. # Building a Neural Net from Scratch with Node-Based Autograd I built an educational, working neural network (Multi-Linear Program, MLP) from scratch using a node-based architecture (Doubly Linked List) that makes the math of backpropagation explicit and composable [here in neural_net_wi_autograd.py](https://github.com/DoKu88/simpleNeuralNet/blob/main/neural_net_wi_autograd.py) and the *newer* [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$ to update matrix $M$ blog post](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm}). By the end, you'll see exactly how gradients flow through a network using the chain rule. *Note* that the connections in-between the nodes of the Linked List are represented by elements in a list, this could have been done with pointers but since the network is small enough a list is illustrative. *All training results for the homemade neural network can be seen in the PDF included at bottom of blog post.* ### Core Idea: Every Layer is a Node The entire network is built from the `Node` abstract class, including linear transformations, activations, and the loss function. The interface implemented is: ```python class Node(ABC): def forward(self, x: np.ndarray) -> np.ndarray: ... def backward(self, upstream: np.ndarray) -> np.ndarray: ... def update_weights(self, lr: float) -> None: ... ``` You can compose any sequence of nodes into a network and train it with the same three lines: forward, backward, update. ### Tracking Inputs for Backprop A key insight in backprop is that computing gradients requires tracking the inputs that were seen during the forward pass (see backpropagtion below & [Fitting Simple Linear Model](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent)). Nodes handle this automatically via `_record_input`: ```python def _record_input(self, x: np.ndarray) -> None: self._inputs.append(x.copy()) ``` Each call to `forward` appends the input to a list. During `backward`, the node retrieves the most recent input with `self._inputs[-1]`. This means the node is self-contained β it doesn't need to be handed its input again at backprop time; it already stored it. The network doesn't need to manage any of this externally. This simple implementation was chosen since this is an educational network, not necessarily optimized. ### Chain Rule, Made Explicit Backpropagation is fundamentally about computing gradients efficiently by reusing intermediate results and applying the **chain rule** in reverse through our Linked List. At a high level: - During the **forward pass**, we compute the output (and loss) by passing data through each node. - During the **backward pass**, we start from the loss and work backwards, calculating how much each parameter contributed to the error. The chain rule tells us that the gradient of the loss with respect to a parameter deep in the network is the product of all the local gradients along the path from that parameter to the loss. For an **explicit derivation of gradients of our simple linear layers of the network**, please read my other blog post [Fitting a Simple Linear Model using Gradient Descent](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent). This fully derives $\frac{\partial L}{\partial b}$ and $\frac{\partial L}{\partial M}$ for linear layers $y = Mx +b$ with loss $L(y) = \sum_i (y_i-g_i)^2$ (squared loss) where $y,g,b \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, and $M \in \mathbb{R}^{m \times n}$. As will be seen, the derivation of the gradient of the linear layer is #### Simple Numerical Example Consider a tiny network with just two nodes: a `LinearLayer` $$ y = Wx + b $$ followed by a `ReLU` $$ z = \max(0, y) $$ The loss is $$ L = (z - t)^2 $$ where $ t $ is the target. Note that the vector shapes are the following: $y,b,t,z \in \mathbb{R}^m$, $x \in \mathbb{R}^n$, and $M \in \mathbb{R}^{m \times n}$ Suppose $ x = [2,1]^T$, $ M = \begin{bmatrix} 1 & 2 \\ -1 & 3 \\ \end{bmatrix} $, $ b = [0.5,1]^T$, $t=[3,2]^T$ - Forward: - $ y = \begin{bmatrix} 1 & 2 \\ -1 & 3 \\ \end{bmatrix} \cdot [2,1]^T + [0.5,1]^T = [4.5,2]^T $ - $ z = \max([0,0]^T, [4.5,2]^T) = [4.5,2]^T $ - $ L = (4.5β3)^2+(2β2)^2 = 2.25 $ - Backward: - Start with: $\frac{\partial L}{\partial z}β=2(zβt)=2([4.5,2β]^Tβ[3,2β]^T)=2[1.5, 0β]=[3, 0β]$ - ReLU: $\frac{\partial z}{\partial y} = [1,1]^T$ so $\frac{\partial L}{\partial y} = \frac{\partial L}{\partial z}\frac{\partial z}{\partial y} = \frac{\partial L}{\partial z} \odot \mathbb{1}_{y > 0} = [3,0]^T$ - Linear: $\frac{\partial L}{\partial M} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial M} = \frac{\partial L}{\partial y} x^T = [3,0]^T [2,1] = \begin{bmatrix} 6 & 3 \\ 0 & 0 \\ \end{bmatrix}$ - Bias: $\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial b} = \frac{\partial L}{\partial y} I_2 = [3,0]^T$ This shows the chain rule in action: each local gradient multiplies with the upstream gradient to produce the full gradient. Note: For **full derivation** see [Fitting Simple Linear Model blog post](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent) In our code, the training loop implements this: **Forward:** ```python h = x for node in layers: h = node.forward(h) loss = loss_node.forward(h, y) ``` **Backward:** ```python upstream = loss_node.backward(1.0) for node in reversed(layers): upstream = node.backward(upstream) ``` Each `backward` call receives $ \frac{\partial L}{\partial \text{output}} $ and returns $ \frac{\partial L}{\partial \text{input}} $. The return value becomes the upstream gradient for the previous node. The chain rule is literally a loop in reverse. #### Inside LinearLayer.backward ```python def backward(self, upstream: np.ndarray) -> np.ndarray: # upstream = dL/dy shape (out_features,) self.dLdb = upstream.copy() # dy/db = 1 β dL/db = upstream self.dLdW = np.outer(upstream, x) # dL/dW_ij = upstream[i] * x[j] dx = self.W.T @ upstream # dL/dx = W^T @ upstream return dx ``` Because $ y_i = \sum_j W_{ij} x_j + b_i $, we have $ \frac{\partial y_i}{\partial W_{ij}} = x_j $. Multiplying by the upstream $ \frac{\partial L}{\partial y_i} $ gives the outer product for $ \frac{\partial L}{\partial W} $. The $ dx $ term continues the chain backward. This was a short highlight of a longer derivation that can be seen in the [Fitting a Simple Linear Model blog post](https://www.noteblogdoku.com/blog/fitting-a-simple-linear-model-using-gradient-descent) and the *newer* [The Core of Backprop: Deriving $\frac{\partial L}{\partial M}$ to update matrix $M$ blog post](https://www.noteblogdoku.com/blog/core-of-backprop-deriving-dl-dm}) ### Xavier Initialization Weight initialization matters tremendously for learning. If weights start too large, gradients explode. Too small, and they vanish before reaching early layers. `LinearLayer` uses Xavier (Glorot) initialization: ```python scale = np.sqrt(2.0 / (in_features + out_features)) self.W = rng.standard_normal((out_features, in_features)) * scale ``` The scale factor $$ \sqrt{\frac{2}{\text{fan\_in} + \text{fan\_out}}} $$ keeps the variance of activations roughly constant across layers at initialization β critical for deeper networks like the three-hidden-layer model here. ### Two Kinds of Nodes: Linear and Activation The network splits into two distinct node types: - **LinearLayer** β has learnable parameters (`W`, `b`). Its `backward` computes parameter gradients and propagates to the previous layer. It overrides `update_weights` for gradient descent. - **Activation** nodes (ReLU, Sigmoid, etc.) β no parameters. They introduce non-linearity. Their `backward` applies the local derivative element-wise: ReLU: $$ y = \max(0, x) $$ $$ \frac{\partial y}{\partial x} = \begin{cases} 1 & x > 0 \\ 0 & \text{otherwise} \end{cases} $$ ```python def _f_prime(self, x, y): return (x > 0).astype(float) ``` Sigmoid: $$ y(x) = \frac{1}{1 + e^{-x}} $$ $$ \frac{\partial y}{\partial x} = y(1 - y) $$ ``` def _f_prime(self, x, y): return y * (1.0 - y) ``` Separating these concerns keeps the code clean and makes `update_weights` a safe no-op on activations. ### Abstract Classes Enable Future Extension The `Node` ABC defines a clear contract: implement `forward`, `backward`, and `equation`. Adding new layers is trivial β just implement the math; the training loop stays unchanged. `SquaredLoss` is a concrete example: ```python def _loss_value(self, pred, target): return float(np.sum((target - pred) ** 2)) def _loss_gradient(self, pred, target): return -2.0 * (target - pred) ``` ### Training This trains on scikit-learnβs `make_moons` dataset β two interlocking half-circles that demand genuine non-linearity. Architecture: Input(2) β Linear(2β4) β ReLU β Linear(4β4) β ReLU β Linear(4β4) β ReLU β Linear(4β1) β SquaredLoss Building deeper or wider networks is just adding nodes: ```python nodes = [linear1, relu1, linear2, relu2, linear3, relu3, linear4, loss_fn] ``` This was designed to make it easy to add different layers within the network. The network is modeled as a **doubly linked list** with forward connections for forward propagation and back connections for back propagation. Note that although we use a list to denote the connections for the linked list, we could have just as easily used pointers. However, given the size of the network, keeping the nodes within a list gives us visibility & simplicity. ### Results See the PDF at the bottom of this blog post for the results of our training. As you can see on page 1, is the dataset, it is a classic two interlocking half circles. This was chosen for testing since the neural network has to be sufficiently parameterized and *non-linear* to be able to demonstrate learning. On the second page you can see the trained model's predictions on the train and validation set. As expected, the model learns the training data and visually we can see that it classifies most of them correctly. There is one notable example missed at [x1,x2] = [-0.5, 0.6] (about), but this is a quite tricky example, and overall the model does well. For validation (unseen data in training) the model does quite well and predicts most of the entries correct. Near the boundaries, we can see that the confidence of the model weakens, especially for the hard examples, which is to be expected. Finally, on the last page, we can see the training and validation loss curves. As we can se the model learns using Squared Loss and improves performance. This particular run had - Mean train loss: 0.005922 - Mean validation loss: 0.008348 The link to this python script, is at the top of the file, but you can also find it on my GitHub here at [neural_net_wi_autograd.py](https://github.com/DoKu88/simpleNeuralNet/blob/main/neural_net_wi_autograd.py). ### Takeaway This implementation is an explanatory example for how neural networks work. It is nice to be able to implement this and gain intuition for how gradient descent works instead of using a PyTorch abstraction (PyTorch is great though by the way). With clean node abstraction, explicit input tracking, and a reversed loop that applies the chain rule, you get a composable, extensible network that trains on real data. π Document Download PDF Comments (0) Please log in to comment. No comments yet. Be the first to comment! β Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!