Vector Normalization Take AI Quiz by dokuDoku Generalization Statistics 🔍 ## Vector Normalization In this post, I use “normalization” broadly to refer to techniques that control the scale, statistics, or geometry of vectors in machine learning systems. Some of these methods normalize activations during training, such as BatchNorm, LayerNorm, and RMSNorm. Others normalize vectors directly using a chosen norm, such as L1 or L2 normalization. Dynamic Tanh (DyT) is slightly different: it is not a normalization method in the strict sense, but it is often discussed as a normalization-free alternative because it controls activation scale through a learned bounded transformation. Rather than treating all of these as the same operation, we will compare them by asking: - What quantity is being controlled? - Across which dimensions is it computed? - Does the method use mean subtraction, variance scaling, norm scaling, or a learned nonlinear transformation? - Where is it usually used in a model? Before diving into each method, it helps to define a few geometric properties that normalization often affects: - **Magnitude**: the length of a vector - Affects the size of dot products and activation scales - **Direction**: the angle or orientation of a vector - Often used to compare embeddings, especially through cosine similarity - **Offset**: a shift in the vector’s coordinates, often related to nonzero means or additive bias terms - Can shift activations away from the origin and affect downstream layers <iframe src="/blog/vector-normalization/applet/6" width="95%" height="520" style="border:none;display:block;" sandbox="allow-scripts allow-same-origin"></iframe> <center> $\textbf{Interactive Python For Norms!}$ (use in light mode) Interact with the different norms and their effects on random vectors. Interactive version of local Matplotlib GIFs I generated. </center> Depending on the method, normalization can make vectors share certain properties, such as comparable length, mean, variance, or root-mean-square scale. Normalization techniques are often used to: - Make gradients more consistent across dimensions and layers - Make neural networks easier to optimize - Reduce sensitivity to weight or input initialization - Help residual and attention mechanisms behave more predictably The recurring theme is that neural networks are easier to reason about when the vectors moving through them do not grow, shrink, or drift unpredictably. ## Normalization Techniques We are going to discuss the following normalization techniques: - Layer Norm - RMS Norm - DyT (Dynamic Tanh) - Batch Norm - L1 Norm - L2 Norm **Note** that we will not be including scaling and offset terms, so assume that it's default scaling with an offset of 0. This simplifies the analysis. ### Layer Norm <img src="/media/images/20260505_062105_LayerNorm.gif" alt="LayerNorm" style="max-width: min(800px,100%); height: auto;"> <center> $\textbf{Figure 1}$: Layer Normalization Animation showing vectors being constrained to a Spherical Hyperplane defined by $\sum x_i = 0$ with constant radius $\sqrt{d_k}$ </center> The Layer Norm is defined by: $$ LN(x) = \frac{x - \mu(x) \mathbb{1}_d}{\sigma(x) + \epsilon} $$ - $x \in \mathbb{R}^d$ which is an $d$ dimensional vector - $\mu(x) = \frac{1}{d} \sum_{i=1}^d x_i \in \mathbb{R}$ - Average of the vector wrt its components - $\sigma(x) = \sqrt{\frac{1}{d} \sum (x_i - \mu)^2} \in \mathbb{R}$ - Standard deviation of $x$ with respect to its components - $\mathbb{1}_d \in \mathbb{R}^d$ which is a $d$ dimensional vector such that all entries are 1 - $[1,...,1]^T$, of which there's $d$ 1's - $\epsilon \approx 0 \in \mathbb{R}$ which prevents divide by zero errors when $\sigma(x) = 0$ - $LN(x) \in \mathbb{R}^d$ As we can see, relative to all entries in the vector, we normalize *each vector for $LN(X)$ relative to that vector's entries*. Using column major notation (every column is a vector) we get for $n$, $d$ dimensional vectors: Superscripts: Vector Identifier, Subscript: Element Identifier $$LN(X) = \begin{bmatrix} \vert & \vert & & \vert \\ LN(x^1) & LN(x^2) & \cdots & LN(x^n) \\ \vert & \vert & & \vert \end{bmatrix} \in \mathbb{R}^{d \times n} $$ #### Hyperplane Defined by Subtracting $\mu$ We can see that by subtracting the vector's average $\mu(x^i)$ from itself $x^i$ we get a linear equation such that $$\sum_{j=1}^d x^i_j - \mu(x^i) \mathbb{1} = 0$$ Consider that $\mu(x^i) \mathbb{1} = \frac{1}{d} \sum_{j=1}^d x^i_j$. Thus we get that $$\sum_{j=1}^d x^i_j - \mu(x^i) \mathbb{1} = \sum_{j=1}^d \left( x^i_j - \frac{1}{d} \sum_{j=1}^d x^i_j \right)$$ $$ = \left( \sum_{j=1}^d x^i_j \right) - \sum_{j=1}^d x^i_j = 0$$ Thus $\sum_{j=i}^d (x_j - \mu(x)) = 0$, with the centered vector $z = x - \mu(x) \mathbb{1}$. Since $ \sum z_i = 0$, this defines a hyperplane in $\mathbb{R}^d$, since a hyperplane can be defined as the solution set of a non-trivial linear equation: $$ a_1 z_1 + a_2 z_2 + ... a_d z_d = b$$ As we can see in **Figure 1**, all vectors collapse to the hyperplane defined by the linear equation (of its vector entries) to 0. #### Radius of $\sqrt{d}$ Note that in **Figure 1** all normalized points lie on a circle, meaning that they have the *same magnitude*. We can see that this occurs because we divide $z$ by $\sigma(x)$. Consider the following equation (ignoring $\epsilon \approx 0$ since it's irrelevant for our analysis): $$ LN(x) = \frac{x - \mu(x) \mathbb{1}_d}{\sigma(x)} = \frac{z}{\sigma(x)} $$ Now we'll see that the length of $LN(x)$ equals $\sqrt{d}$ denoted $||LN(x)|| = \sqrt{d}$. We know that $$ ||LN(x)||^2 = ||\frac{x - \mu(x) \mathbb{1}_d}{\sigma(x)}||^2$$ $$ = \sum (\frac{x - \mu(x) \mathbb{1}_d}{\sigma(x)})^2 = \frac{1}{\sigma^2} \sum(x_i - \mu)^2$$ $$ ||LN(x)||^2 = \frac{1}{\sigma^2} ||x - \mu||^2 $$ If we find $\sigma$ and $||x - \mu||$, then we can find $||LN(x)||$, with this in mind: $$ ||x - \mu||^2 = \sum (x_i - \mu)^2 $$ $$ Var(x) = \sigma^2 = \mathbb{E}(x - \mathbb{E}(x))^2 = \frac{1}{d} \sum_{i=1}^d (x_i - \mu)^2$$ Interesting, It appears that $$ ||x - \mu||^2 = d \sigma^2$$ Great, now we can simplify our equation for $||LN(x)||^2$: $$ ||LN(x)||^2 = \frac{d \sigma^2}{\sigma^2} = d $$ $$ \boxed{ ||LN(x)|| = \sqrt{d}}$$ ### RMS Norm <img src="/media/images/20260505_213216_RMSNorm.gif" alt="RMSNorm" style="max-width:min(800px,100%); height: auto;"> <center> $\textbf{Figure 2}$: RMS (Root Mean Squared) Normalization Animation showing vectors being constrained the surface of a Sphere in $\mathbb{R}^3$ with radius $\sqrt{d}$ </center> The RMS Norm is defined by: $$ RMSNorm(x) = \frac{x}{\sqrt{\frac{1}{d} \sum_i x^2_i}} $$ - $x \in \mathbb{R}^d$ which is a $d$ dimensional vector - $RMSNorm(x) \in \mathbb{R}^d$ As we can see, relative to all entries in the vector, we normalize *each vector for $RMSNorm(X)$ relative to that vector's entries*. Using column major notation, we get that for $n$, $d$, dimensional vectors: $$RMSNorm(X) = \begin{bmatrix} \vert & \vert & & \vert \\ RMSNorm(x^1) & RMSNorm(x^2) & \cdots & RMSNorm(x^n) \\ \vert & \vert & & \vert \end{bmatrix} \in \mathbb{R}^{d \times n} $$ #### Radius of $\sqrt{d}$ Notice in **Figure 2** we have a sphere of radius $\sqrt{d}$ where all normalized vectors are required to lie on its surface. We can see that this is the radius from the following: $$ ||RMSNorm(x)||^2 = ||\frac{x}{\sqrt{\frac{1}{d} \sum_i x^2_i}}||^2 $$ $$ = \sum_{i=1}^d \left( \frac{x_i}{\sqrt{\frac{1}{d} \sum_i x^2_i}} \right)^2 = \frac{\sum_i^d x_i^2}{\frac{1}{d} \sum_i^d x_i^2} = d$$ $$ ||RMSNorm(x)||^2 = d \rightarrow \boxed{||RMSNorm(x)|| = \sqrt{d}}$$ Great! We can see that for when $\mu(x) = 0$ that the Root Mean Squared and the standard deviation $\sigma$ are the same when the mean is 0. Thus another way to look at RMS Norm is that, when $\mu = 0$ it's the same as Layer Norm, but the vectors are not projected onto the hyperplane defined by sum of the vector components equaling zero. Benefits of this norm is that it's simpler and faster than LayerNorm, since mean subtraction is not needed. ### Dynamic Tanh DyT <img src="/media/images/20260505_220336_DyT.gif" alt="DyT" style="max-width: min(800px,100%); height: auto;"> <center> $\textbf{Figure 3}$: DyT (Dynamic Tanh) "Normalization" Animation showing vectors being constrained to be inside the cube with bounds $[-1,1]$ for each dimension </center> The Dynamic Tanh is an element-wise replacement for norm layers (mostly LayerNorm) from the 2025 [Transformers without Normalization](https://arxiv.org/pdf/2503.10622) paper with Yann LeCun. It is defined by: $$ \mathrm{DyT}(x) = \tanh(x) \odot \mathbf{1} $$ - $x \in \mathbb{R}^d$ which is a $d$ dimensional vector - $\tanh(x) \odot \mathbf{1} \in \mathbb{R}^d$ - Elementwise application of tanh function s.t. every element $x_i, \forall i \in [1,d]$ has tanh applied separately <img src="/media/images/20260505_222122_Screenshot_2026-05-05_at_3.20.17_PM.png" alt="Screenshot 2026 05 05 at 3.20.17 PM" style="max-width: min(600px, 90%); height: auto;"> <center> $\textbf{Figure 4}$: Hyperbolic Tangent Graph $tanh(x)$ </center> As we can see in **Figure 4** the range of $tanh(x)$ is $-1 \leq tanh(x) \leq 1$, which constrains all input vectors to the $\mathbb{R}^3$ cube shown in **Figure 1**. The main benefits of Dynamic Tanh being used in transformer based models (instead of commonly used LayerNorm or RMSNorm) is that it often matches or exceeds performance with no or minimal hyperparameter changes, much simpler to implement, enables stable training of normalization-free Transformers, and less sensitive to hyperparameters. Additionally, the speedups are mostly hardware/optimization dependent, and offers an alternative to normalization, mainly the ones we have seen before that **normalize with respect to the elements in the vector**. Note that typically Dynamic Tanh is formulated with learned hyperparameters to adjust the graph in **Figure 4**, but for our discussion we omitted it for simplicity. Briefly to state here we have with the hyperparameters: $$ \boxed{DyT(x) = \gamma tanh(\alpha x) + \beta}$$ which provides scaling, offset, and slope adjustments for this function effectively helping to learn what the "normalization" should be, where we are normalizing each vector individually (all vectors use the same hyperparameter values). Unsurprisingly, it does not offer an alternative to *Batch Norm*, which normalizes a feature with respect to other vectors in the batch. ### Batch Norm <img src="/media/images/20260505_222845_BatchNorm.gif" alt="BatchNorm" style="max-width: min(800px,100%); height: auto;"> <center> $\textbf{Figure 5}$: Batch Normalization showing Vectors Normalized per Feature wrt other Vectors in the Batch (Image). Each feature approaches the distribution $N(0,1)$ (wrt vectors in Batch). Cube represents $\pm 1 \sigma$ (Standard Deviation) for the features. </center> Batch Norm, first introduced in the paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167) is ubiquitous within machine learning. $$ BN(x) = \frac{x - \mu(X)}{\sqrt{\sigma^2 + \epsilon}} $$ Superscripts: Vector Identifier, Subscript: Element Identifier - $x \in \mathbb{R}^d$ which is a $d$ dimensional vector over $n$ vectors - $\mu(X) \in \mathbb{R}^d$ which is the average vector with respect to the features - $\mu(X) = \frac{1}{n} \sum_i^n x^i$ - $\sigma^2(X) = \frac{1}{n} \sum_i^n (x^i - \mu(X))^2 \in \mathbb{R}^d$ What this does is that with respect to each dimension (also commonly denoted as *feature*) we normalize the vectors to have $0$ mean and standard deviation of $1$. See **Figure 6**, which shows that for a batch of 60 vectors, with 8 dimensions, that after the batch normalization is complete, the distribution of the features of the vectors have mean 0 and standard deviation of 1. <img src="/media/images/20260506_022124_Screenshot_2026-05-05_at_7.20.52_PM.png" alt="Screenshot 2026 05 05 at 7.20.52 PM" style="max-width: min(650px,90%); height: auto;"> <center> $\textbf{Figure 6}$: Distribution of Dimensions Before and After BatchNorm applied for the entire batch of 60 vectors, each with 8 dimensions. Note mean and variance of each dimension's distributions standardized </center> **NOTE** as stated in the paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167), we need to add learned parameters s.t. we can adjust the mean and shift $$ \boxed{ BN_{\gamma, \beta}(x) = \gamma BN(x) + \beta}$$ however, as previously mentioned we are ignoring this for simplicity. Batch Norm is greatly useful for machine learning, since it standardizes the inputs with respect to each layer. As mentioned in the paper, *data whitening* has been found to be effective for statistically cleaning the input to a neural network. However since neural networks are compositions of functions, represented by: $$ l = F_2(F_1(u,\Theta_1), \Theta_2)$$ - $F_1,F_2$ are functions - $\Theta_1, \Theta_2$ are parameters for their respective functions - $u \in \mathbb{R}^{n \times d}$ is our input batch - $l \in \mathbb{R}$ is our scalar loss We can see that if we can standardize the input into $F_2$ from the output of $F_1$, it makes it *much easier* for the network to learn. If the distribution of the inputs of these nested layers shift, termed **Internal Covariate Shift** changes it makes it harder to learn (lower learning rate parameter, distribution unpredictable) and encourages *cascading failures* where for instance if $F_1$ outputs values that are out of distribution, then $F_2$ will not know what to do with this data, and this will compound as the network increases in depth. This technique is ubiquitous today since it solves this common problem. As we can see from **Figure 5**, the strength of this method is not in making the vectors live within a polytope like LayerNorm or RMS Norm, but as shown in **Figure 6** standardizes the distribution of data, during training making learning for each layer less dependent on the previous layer's extreme values. Of note, although the original paper motivates BatchNorm through internal covariate shift, later analyses suggest its benefits may also come from smoothing the optimization landscape and improving gradient behavior. ### L1 Normalization <img src="/media/images/20260506_023512_L1.gif" alt="L1" style="max-width: min(800px, 100%); height: auto;"> <center> $\textbf{Figure 7}$: L1 Normalization of Vectors s.t. each vector forced to be on the surface on the L1 Ball </center> The L1 normalization for vectors forces all of the vectors to live on the L1 "Ball" (polytope) $$ L1(x) = \frac{x}{||x||_1} = \frac{x}{\sum_{i=1}^d |x_i|}$$ - $x \in \mathbb{R}^d$ where $x$ is a $d$ dimensional vector - $\sum_{i=1}^d |x_i| \in \mathbb{R}$ which is the length in L1 space - $L1(x) \in \mathbb{R}^d$ the normalized vector Since we are dividing each vector $x$ by its length in L1 $||x||_1$, we get that each vector is now lives on the surface of the L1 Ball, which we can see in **Figure 7**. This normalization is commonly used in Lasso and when you want the sum of absolute values to equal 1, such as in probability distributions. L1 normalization should not be confused with L1 regularization. L1 regularization promotes sparsity by adding a penalty to the loss, whereas L1 normalization rescales an existing vector to have unit L1 norm. For some more applications of L1 Normalization, see the previous blog posts: - [L1 Lasso vs L2 Ridge Regression](https://www.noteblogdoku.com/blog/l1-lasso-vs-l2-ridge-regression) - [L1 Lasso Regularization & Sparsity](https://www.noteblogdoku.com/blog/l1-lasso-regularization-sparsity) ### L2 Normalization <img src="/media/images/20260506_025552_L2.gif" alt="L2" style="max-width: min(800px, 100%); height: auto;"> <center> $\textbf{Figure 8}$: L2 Normalization of Vectors s.t. each vector forced to be on the surface on the L2 Ball. Each vector has length 1 in Euclidean space. </center> The L2 Normalization for vectors forces all the vectors to live on the L2 Ball, by setting each vector's length equal to 1 $$ L2(x) = \frac{x}{||x||_2} = \frac{x}{\sqrt{\sum_{i=1}^d x^2_i}} $$ - $x \in \mathbb{R}^d$ where $x$ is a $d$ dimensional vector - $\sqrt{\sum_{i=1}^d x_i^2} \in \mathbb{R}$ which is the length in L2 euclidean space - $L2(x) \in \mathbb{R}^d$ the normalized vector This is one of the most common unit-length normalization, which turns vectors into essentially "direction only" vectors. It's popularly used in cosine similarity, embedding normalization, and Ridge regularization. For some other uses for L2 normalization and discussion in how it's used for regularization see the following blog posts: - [L1 Lasso vs L2 Ridge Regression](https://www.noteblogdoku.com/blog/l1-lasso-vs-l2-ridge-regression) - [Bias Variance of L2 Ridge Regression](https://www.noteblogdoku.com/blog/bias-variance-of-l2-ridge-regression) - [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) - [L2 Ridge Regression from Lagrange Multipliers](https://www.noteblogdoku.com/blog/l2-ridge-regression-from-lagrange-multipliers) ## Vector Normalization In this post, I use “normalization” broadly to refer to techniques that control the scale, statistics, or geometry of vectors in machine learning systems. Some of these methods normalize activations during training, such as BatchNorm, LayerNorm, and RMSNorm. Others normalize vectors directly using a chosen norm, such as L1 or L2 normalization. Dynamic Tanh (DyT) is slightly different: it is not a normalization method in the strict sense, but it is often discussed as a normalization-free alternative because it controls activation scale through a learned bounded transformation. Rather than treating all of these as the same operation, we will compare them by asking: - What quantity is being controlled? - Across which dimensions is it computed? - Does the method use mean subtraction, variance scaling, norm scaling, or a learned nonlinear transformation? - Where is it usually used in a model? Before diving into each method, it helps to define a few geometric properties that normalization often affects: - **Magnitude**: the length of a vector - Affects the size of dot products and activation scales - **Direction**: the angle or orientation of a vector - Often used to compare embeddings, especially through cosine similarity - **Offset**: a shift in the vector’s coordinates, often related to nonzero means or additive bias terms - Can shift activations away from the origin and affect downstream layers <iframe src="/blog/vector-normalization/applet/6" width="95%" height="520" style="border:none;display:block;" sandbox="allow-scripts allow-same-origin"></iframe> <center> $\textbf{Interactive Python For Norms!}$ (use in light mode) Interact with the different norms and their effects on random vectors. Interactive version of local Matplotlib GIFs I generated. </center> Depending on the method, normalization can make vectors share certain properties, such as comparable length, mean, variance, or root-mean-square scale. Normalization techniques are often used to: - Make gradients more consistent across dimensions and layers - Make neural networks easier to optimize - Reduce sensitivity to weight or input initialization - Help residual and attention mechanisms behave more predictably The recurring theme is that neural networks are easier to reason about when the vectors moving through them do not grow, shrink, or drift unpredictably. ## Normalization Techniques We are going to discuss the following normalization techniques: - Layer Norm - RMS Norm - DyT (Dynamic Tanh) - Batch Norm - L1 Norm - L2 Norm **Note** that we will not be including scaling and offset terms, so assume that it's default scaling with an offset of 0. This simplifies the analysis. ### Layer Norm <img src="/media/images/20260505_062105_LayerNorm.gif" alt="LayerNorm" style="max-width: min(800px,100%); height: auto;"> <center> $\textbf{Figure 1}$: Layer Normalization Animation showing vectors being constrained to a Spherical Hyperplane defined by $\sum x_i = 0$ with constant radius $\sqrt{d_k}$ </center> The Layer Norm is defined by: $$ LN(x) = \frac{x - \mu(x) \mathbb{1}_d}{\sigma(x) + \epsilon} $$ - $x \in \mathbb{R}^d$ which is an $d$ dimensional vector - $\mu(x) = \frac{1}{d} \sum_{i=1}^d x_i \in \mathbb{R}$ - Average of the vector wrt its components - $\sigma(x) = \sqrt{\frac{1}{d} \sum (x_i - \mu)^2} \in \mathbb{R}$ - Standard deviation of $x$ with respect to its components - $\mathbb{1}_d \in \mathbb{R}^d$ which is a $d$ dimensional vector such that all entries are 1 - $[1,...,1]^T$, of which there's $d$ 1's - $\epsilon \approx 0 \in \mathbb{R}$ which prevents divide by zero errors when $\sigma(x) = 0$ - $LN(x) \in \mathbb{R}^d$ As we can see, relative to all entries in the vector, we normalize *each vector for $LN(X)$ relative to that vector's entries*. Using column major notation (every column is a vector) we get for $n$, $d$ dimensional vectors: Superscripts: Vector Identifier, Subscript: Element Identifier $$LN(X) = \begin{bmatrix} \vert & \vert & & \vert \\ LN(x^1) & LN(x^2) & \cdots & LN(x^n) \\ \vert & \vert & & \vert \end{bmatrix} \in \mathbb{R}^{d \times n} $$ #### Hyperplane Defined by Subtracting $\mu$ We can see that by subtracting the vector's average $\mu(x^i)$ from itself $x^i$ we get a linear equation such that $$\sum_{j=1}^d x^i_j - \mu(x^i) \mathbb{1} = 0$$ Consider that $\mu(x^i) \mathbb{1} = \frac{1}{d} \sum_{j=1}^d x^i_j$. Thus we get that $$\sum_{j=1}^d x^i_j - \mu(x^i) \mathbb{1} = \sum_{j=1}^d \left( x^i_j - \frac{1}{d} \sum_{j=1}^d x^i_j \right)$$ $$ = \left( \sum_{j=1}^d x^i_j \right) - \sum_{j=1}^d x^i_j = 0$$ Thus $\sum_{j=i}^d (x_j - \mu(x)) = 0$, with the centered vector $z = x - \mu(x) \mathbb{1}$. Since $ \sum z_i = 0$, this defines a hyperplane in $\mathbb{R}^d$, since a hyperplane can be defined as the solution set of a non-trivial linear equation: $$ a_1 z_1 + a_2 z_2 + ... a_d z_d = b$$ As we can see in **Figure 1**, all vectors collapse to the hyperplane defined by the linear equation (of its vector entries) to 0. #### Radius of $\sqrt{d}$ Note that in **Figure 1** all normalized points lie on a circle, meaning that they have the *same magnitude*. We can see that this occurs because we divide $z$ by $\sigma(x)$. Consider the following equation (ignoring $\epsilon \approx 0$ since it's irrelevant for our analysis): $$ LN(x) = \frac{x - \mu(x) \mathbb{1}_d}{\sigma(x)} = \frac{z}{\sigma(x)} $$ Now we'll see that the length of $LN(x)$ equals $\sqrt{d}$ denoted $||LN(x)|| = \sqrt{d}$. We know that $$ ||LN(x)||^2 = ||\frac{x - \mu(x) \mathbb{1}_d}{\sigma(x)}||^2$$ $$ = \sum (\frac{x - \mu(x) \mathbb{1}_d}{\sigma(x)})^2 = \frac{1}{\sigma^2} \sum(x_i - \mu)^2$$ $$ ||LN(x)||^2 = \frac{1}{\sigma^2} ||x - \mu||^2 $$ If we find $\sigma$ and $||x - \mu||$, then we can find $||LN(x)||$, with this in mind: $$ ||x - \mu||^2 = \sum (x_i - \mu)^2 $$ $$ Var(x) = \sigma^2 = \mathbb{E}(x - \mathbb{E}(x))^2 = \frac{1}{d} \sum_{i=1}^d (x_i - \mu)^2$$ Interesting, It appears that $$ ||x - \mu||^2 = d \sigma^2$$ Great, now we can simplify our equation for $||LN(x)||^2$: $$ ||LN(x)||^2 = \frac{d \sigma^2}{\sigma^2} = d $$ $$ \boxed{ ||LN(x)|| = \sqrt{d}}$$ ### RMS Norm <img src="/media/images/20260505_213216_RMSNorm.gif" alt="RMSNorm" style="max-width:min(800px,100%); height: auto;"> <center> $\textbf{Figure 2}$: RMS (Root Mean Squared) Normalization Animation showing vectors being constrained the surface of a Sphere in $\mathbb{R}^3$ with radius $\sqrt{d}$ </center> The RMS Norm is defined by: $$ RMSNorm(x) = \frac{x}{\sqrt{\frac{1}{d} \sum_i x^2_i}} $$ - $x \in \mathbb{R}^d$ which is a $d$ dimensional vector - $RMSNorm(x) \in \mathbb{R}^d$ As we can see, relative to all entries in the vector, we normalize *each vector for $RMSNorm(X)$ relative to that vector's entries*. Using column major notation, we get that for $n$, $d$, dimensional vectors: $$RMSNorm(X) = \begin{bmatrix} \vert & \vert & & \vert \\ RMSNorm(x^1) & RMSNorm(x^2) & \cdots & RMSNorm(x^n) \\ \vert & \vert & & \vert \end{bmatrix} \in \mathbb{R}^{d \times n} $$ #### Radius of $\sqrt{d}$ Notice in **Figure 2** we have a sphere of radius $\sqrt{d}$ where all normalized vectors are required to lie on its surface. We can see that this is the radius from the following: $$ ||RMSNorm(x)||^2 = ||\frac{x}{\sqrt{\frac{1}{d} \sum_i x^2_i}}||^2 $$ $$ = \sum_{i=1}^d \left( \frac{x_i}{\sqrt{\frac{1}{d} \sum_i x^2_i}} \right)^2 = \frac{\sum_i^d x_i^2}{\frac{1}{d} \sum_i^d x_i^2} = d$$ $$ ||RMSNorm(x)||^2 = d \rightarrow \boxed{||RMSNorm(x)|| = \sqrt{d}}$$ Great! We can see that for when $\mu(x) = 0$ that the Root Mean Squared and the standard deviation $\sigma$ are the same when the mean is 0. Thus another way to look at RMS Norm is that, when $\mu = 0$ it's the same as Layer Norm, but the vectors are not projected onto the hyperplane defined by sum of the vector components equaling zero. Benefits of this norm is that it's simpler and faster than LayerNorm, since mean subtraction is not needed. ### Dynamic Tanh DyT <img src="/media/images/20260505_220336_DyT.gif" alt="DyT" style="max-width: min(800px,100%); height: auto;"> <center> $\textbf{Figure 3}$: DyT (Dynamic Tanh) "Normalization" Animation showing vectors being constrained to be inside the cube with bounds $[-1,1]$ for each dimension </center> The Dynamic Tanh is an element-wise replacement for norm layers (mostly LayerNorm) from the 2025 [Transformers without Normalization](https://arxiv.org/pdf/2503.10622) paper with Yann LeCun. It is defined by: $$ \mathrm{DyT}(x) = \tanh(x) \odot \mathbf{1} $$ - $x \in \mathbb{R}^d$ which is a $d$ dimensional vector - $\tanh(x) \odot \mathbf{1} \in \mathbb{R}^d$ - Elementwise application of tanh function s.t. every element $x_i, \forall i \in [1,d]$ has tanh applied separately <img src="/media/images/20260505_222122_Screenshot_2026-05-05_at_3.20.17_PM.png" alt="Screenshot 2026 05 05 at 3.20.17 PM" style="max-width: min(600px, 90%); height: auto;"> <center> $\textbf{Figure 4}$: Hyperbolic Tangent Graph $tanh(x)$ </center> As we can see in **Figure 4** the range of $tanh(x)$ is $-1 \leq tanh(x) \leq 1$, which constrains all input vectors to the $\mathbb{R}^3$ cube shown in **Figure 1**. The main benefits of Dynamic Tanh being used in transformer based models (instead of commonly used LayerNorm or RMSNorm) is that it often matches or exceeds performance with no or minimal hyperparameter changes, much simpler to implement, enables stable training of normalization-free Transformers, and less sensitive to hyperparameters. Additionally, the speedups are mostly hardware/optimization dependent, and offers an alternative to normalization, mainly the ones we have seen before that **normalize with respect to the elements in the vector**. Note that typically Dynamic Tanh is formulated with learned hyperparameters to adjust the graph in **Figure 4**, but for our discussion we omitted it for simplicity. Briefly to state here we have with the hyperparameters: $$ \boxed{DyT(x) = \gamma tanh(\alpha x) + \beta}$$ which provides scaling, offset, and slope adjustments for this function effectively helping to learn what the "normalization" should be, where we are normalizing each vector individually (all vectors use the same hyperparameter values). Unsurprisingly, it does not offer an alternative to *Batch Norm*, which normalizes a feature with respect to other vectors in the batch. ### Batch Norm <img src="/media/images/20260505_222845_BatchNorm.gif" alt="BatchNorm" style="max-width: min(800px,100%); height: auto;"> <center> $\textbf{Figure 5}$: Batch Normalization showing Vectors Normalized per Feature wrt other Vectors in the Batch (Image). Each feature approaches the distribution $N(0,1)$ (wrt vectors in Batch). Cube represents $\pm 1 \sigma$ (Standard Deviation) for the features. </center> Batch Norm, first introduced in the paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167) is ubiquitous within machine learning. $$ BN(x) = \frac{x - \mu(X)}{\sqrt{\sigma^2 + \epsilon}} $$ Superscripts: Vector Identifier, Subscript: Element Identifier - $x \in \mathbb{R}^d$ which is a $d$ dimensional vector over $n$ vectors - $\mu(X) \in \mathbb{R}^d$ which is the average vector with respect to the features - $\mu(X) = \frac{1}{n} \sum_i^n x^i$ - $\sigma^2(X) = \frac{1}{n} \sum_i^n (x^i - \mu(X))^2 \in \mathbb{R}^d$ What this does is that with respect to each dimension (also commonly denoted as *feature*) we normalize the vectors to have $0$ mean and standard deviation of $1$. See **Figure 6**, which shows that for a batch of 60 vectors, with 8 dimensions, that after the batch normalization is complete, the distribution of the features of the vectors have mean 0 and standard deviation of 1. <img src="/media/images/20260506_022124_Screenshot_2026-05-05_at_7.20.52_PM.png" alt="Screenshot 2026 05 05 at 7.20.52 PM" style="max-width: min(650px,90%); height: auto;"> <center> $\textbf{Figure 6}$: Distribution of Dimensions Before and After BatchNorm applied for the entire batch of 60 vectors, each with 8 dimensions. Note mean and variance of each dimension's distributions standardized </center> **NOTE** as stated in the paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167), we need to add learned parameters s.t. we can adjust the mean and shift $$ \boxed{ BN_{\gamma, \beta}(x) = \gamma BN(x) + \beta}$$ however, as previously mentioned we are ignoring this for simplicity. Batch Norm is greatly useful for machine learning, since it standardizes the inputs with respect to each layer. As mentioned in the paper, *data whitening* has been found to be effective for statistically cleaning the input to a neural network. However since neural networks are compositions of functions, represented by: $$ l = F_2(F_1(u,\Theta_1), \Theta_2)$$ - $F_1,F_2$ are functions - $\Theta_1, \Theta_2$ are parameters for their respective functions - $u \in \mathbb{R}^{n \times d}$ is our input batch - $l \in \mathbb{R}$ is our scalar loss We can see that if we can standardize the input into $F_2$ from the output of $F_1$, it makes it *much easier* for the network to learn. If the distribution of the inputs of these nested layers shift, termed **Internal Covariate Shift** changes it makes it harder to learn (lower learning rate parameter, distribution unpredictable) and encourages *cascading failures* where for instance if $F_1$ outputs values that are out of distribution, then $F_2$ will not know what to do with this data, and this will compound as the network increases in depth. This technique is ubiquitous today since it solves this common problem. As we can see from **Figure 5**, the strength of this method is not in making the vectors live within a polytope like LayerNorm or RMS Norm, but as shown in **Figure 6** standardizes the distribution of data, during training making learning for each layer less dependent on the previous layer's extreme values. Of note, although the original paper motivates BatchNorm through internal covariate shift, later analyses suggest its benefits may also come from smoothing the optimization landscape and improving gradient behavior. ### L1 Normalization <img src="/media/images/20260506_023512_L1.gif" alt="L1" style="max-width: min(800px, 100%); height: auto;"> <center> $\textbf{Figure 7}$: L1 Normalization of Vectors s.t. each vector forced to be on the surface on the L1 Ball </center> The L1 normalization for vectors forces all of the vectors to live on the L1 "Ball" (polytope) $$ L1(x) = \frac{x}{||x||_1} = \frac{x}{\sum_{i=1}^d |x_i|}$$ - $x \in \mathbb{R}^d$ where $x$ is a $d$ dimensional vector - $\sum_{i=1}^d |x_i| \in \mathbb{R}$ which is the length in L1 space - $L1(x) \in \mathbb{R}^d$ the normalized vector Since we are dividing each vector $x$ by its length in L1 $||x||_1$, we get that each vector is now lives on the surface of the L1 Ball, which we can see in **Figure 7**. This normalization is commonly used in Lasso and when you want the sum of absolute values to equal 1, such as in probability distributions. L1 normalization should not be confused with L1 regularization. L1 regularization promotes sparsity by adding a penalty to the loss, whereas L1 normalization rescales an existing vector to have unit L1 norm. For some more applications of L1 Normalization, see the previous blog posts: - [L1 Lasso vs L2 Ridge Regression](https://www.noteblogdoku.com/blog/l1-lasso-vs-l2-ridge-regression) - [L1 Lasso Regularization & Sparsity](https://www.noteblogdoku.com/blog/l1-lasso-regularization-sparsity) ### L2 Normalization <img src="/media/images/20260506_025552_L2.gif" alt="L2" style="max-width: min(800px, 100%); height: auto;"> <center> $\textbf{Figure 8}$: L2 Normalization of Vectors s.t. each vector forced to be on the surface on the L2 Ball. Each vector has length 1 in Euclidean space. </center> The L2 Normalization for vectors forces all the vectors to live on the L2 Ball, by setting each vector's length equal to 1 $$ L2(x) = \frac{x}{||x||_2} = \frac{x}{\sqrt{\sum_{i=1}^d x^2_i}} $$ - $x \in \mathbb{R}^d$ where $x$ is a $d$ dimensional vector - $\sqrt{\sum_{i=1}^d x_i^2} \in \mathbb{R}$ which is the length in L2 euclidean space - $L2(x) \in \mathbb{R}^d$ the normalized vector This is one of the most common unit-length normalization, which turns vectors into essentially "direction only" vectors. It's popularly used in cosine similarity, embedding normalization, and Ridge regularization. For some other uses for L2 normalization and discussion in how it's used for regularization see the following blog posts: - [L1 Lasso vs L2 Ridge Regression](https://www.noteblogdoku.com/blog/l1-lasso-vs-l2-ridge-regression) - [Bias Variance of L2 Ridge Regression](https://www.noteblogdoku.com/blog/bias-variance-of-l2-ridge-regression) - [Decomposition of L2 Ridge Regression](https://www.noteblogdoku.com/blog/decomposition-of-l2-ridge-regression) - [L2 Ridge Regression from Lagrange Multipliers](https://www.noteblogdoku.com/blog/l2-ridge-regression-from-lagrange-multipliers) Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!