KL-Divergence and Cross Entropy by dokuDoku Information Theory 🔍 ## Where have I seen you before? **KL-Divergence** and **Cross Entropy** are two fundamental quantities in machine learning, especially whenever models output probabilities or compare distributions. Cross Entropy is commonly used as the classification loss function, and any maximum likelihood estimation with categorical or softmax output. KL-Divergence is common whenever we compare distributions directly such as in PPO, knowledge distillation, Variational Autoencoders (VAEs), and measuring dataset distribution shift. KL-Divergence is commonly used for regularizing updates to your distribution with respect to a reference distribution. ## What are KL-Divergence and Cross Entropy? KL-Divergence and Cross Entropy are defined in reference to Entropy and Self-Information, described in [Self-Information & Entropy blog](https://www.noteblogdoku.com/blog/self-information-entropy). A quick recap, suppose we have a discrete set of possible outcomes $$X = {x_1, x_2, ...x_n}$$ with a probability distribution $P$ such that $$\sum_{i=1}^n P(x_i) = 1$$ - **Self Information:** The surprise of observing event $x_i$ under distribution $P$ $$ I_P(x_i) = -log(P(x_i)) $$ - Equivalently, the approximate depth of $x_i$ in optimal Huffman Tree Encoding - Rare events give more information - Common events give less information - **Entropy:** The expected Self Information of events drawn from $P$ $$H(P) = -\sum_i P(x_i)log P(x_i) = \sum_i P(x_i) I_P(x_i) = \mathbb{E}_{x \sim P} [I_P(x)]$$ - Equivalently, the expected depth of elements in Huffman Tree built from $P$ - Measure of uncertainty of distribution Now let's define Cross Entropy and KL-Divergence, which compares two distributions we have: - **Ground Truth Distribution** $P$ that we desire to mimic - **Hypothesis Distribution** $Q$ that we have generated - **Cross Entropy:** The expected Information drawn from a Hypothesis Distribution $Q$ when sampled using $P$ $$ H(P,Q) = -\sum_i P(x_i)log Q(x_i) = \sum_i P(x_i) I_Q(x_i) = \mathbb{E}_{x \sim P} [I_Q(x)]$$ - Equivalently, expected height of Huffman Tree made for $Q$ when sampled using $P$ - Penalty for using hypothesis distribution $Q$ when ground truth $P$ generates outcomes - **KL-Divergence:** The expectation of difference of information from distributions $P$ & $Q$ when sampling using $P$ $$D_{KL}(P||Q) = \sum_i P(x_i)(log P(x_i) - log Q(x_i)) = \sum_i P(x_i) (I_Q(x_i) - I_P(x_i)) = \mathbb{E}_{x \sim P} [I_Q(x) - I_P(x)]$$ $$D_{KL}(P||Q) = \sum_i P(x_i) (I_Q(x_i) - I_P(x_i)) = \sum_i P(x_i) I_Q(x_i) - \sum_i P(x_i) I_P(x_i)) = H(P,Q) - H(P)$$ - Equivalently, the expected difference in height of Huffman Tree made for $Q$ versus $P$ when both sampled using $P$ - Known as **Relative Entropy** because $D_{KL}(P||Q) = H(P,Q) - H(P)$ - Measure of discrepancy between distributions, **not** distance or inner product since $D_{KL}(P||Q) \neq D_{KL}(Q||P)$ ## Cross Entropy: $H(P,Q) = \sum_i P(x_i) I_Q(x_i)$ Cross Entropy is the information we get from hypothesis distribution $Q$ when we sample using distribution $P$. Note that when $P=Q$ *Cross Entropy is just the Entropy*, i.e. $H(P,P) = H(P)$. <center> <video controls style="max-width: 80%;"> <source src="/media/videos/20260315_224049_03_15_2026_15_40_15_legal_akita_cross_entropy.mp4"> </video> </center> <center> $\textbf{Video 1:}$ Slideshow showing Empirically Calculating Cross Entropy of $H(P,Q)$ by Sampling Distribution $Q$ according to Distribution $P$ </center> As we can see from **Video 1**, we can measure the amount of information we get on average when we sample hypothesis $Q$ according to ground truth $P$. This measures how off our hypothesis distribution is from our ground truth distribution. We can prove that the Cross Entropy $H(P,Q)$ is minimized when $Q=P$ by using the Lagrangian Multipliers Calculus method. We have that for our general function $f(Q): \mathbb{R}^n \rightarrow \mathbb{R}$, constraint function $g(Q): \mathbb{R}^n \rightarrow \mathbb{R}$, constraint $c \in \mathbb{R}$ and our Lagrange Multiplier $\lambda$ that by optimizing the following function $h(Q) : \mathbb{R}^n \rightarrow \mathbb{R}$ we are optimizing $f(Q)$ with respect to $g(Q) = c$. $$ h(x) = f(x) - \lambda (g(x) -c)$$ The setup for our optimization problem to find the minimum cross entropy $H(P,Q)$ such that $\sum_i Q(x_i) = 1$ for all events $x_i \in [1,n]$ is - $f(Q)$ is $H(P,Q) = -\sum_i P(x_i)log Q(x_i)$ - $g(Q)$ is $\sum_{i=1}^n Q(x_i)$ - $c$ constraint is $1$ Giving us that $h(Q)$ the function we are finding the min of is: $$ h(Q) = -\sum_i P(x_i)log Q(x_i) - \lambda (\sum_{i=1}^n Q(x_i) -1)$$ Great, let's find $\frac{\partial h}{\partial Q(x_i)}$ for one particular $Q(x_i)$, see the pattern and apply it to each $\frac{\partial h}{\partial Q(x_i)}\ \forall i$ $$ \frac{\partial h}{\partial Q(x_i)} = \frac{-P(x_i)}{Q(x_i)} - \lambda$$ Set $\frac{\partial h}{\partial Q(x_i)}$ to $0$ to minimize this function. As seen in **Appendix**, since we know that the Hessian of $H(P,Q)$ is positive semi-definite with respect to $Q$, and that since our constraint $\sum_{i=1}^n Q(x_i)$ is linear, finding the extrema is finding the minimum $$ \frac{\partial h}{\partial Q(x_i)} = 0 = \frac{-P(x_i)}{Q(x_i)} - \lambda$$ $$ \lambda = \frac{-P(x_i)}{Q(x_i)} \rightarrow \lambda Q(x_i) = -P(x_i) \rightarrow Q(x_i) = \frac{-1}{\lambda} P(x_i)$$ Great, since we have that $\forall i\ Q(x_i) = \frac{-1}{\lambda} P(x_i)$ and we know that $\sum_i Q(x_i) = \sum_i P(x_i) = 1$, let's solve for $\lambda$ $$ \sum_i Q(x_i) = \frac{-1}{\lambda} \sum_i P(x_i) \rightarrow 1 = \frac{-1}{\lambda}$$ $$ \lambda = -1$$ Thus we can see that $$Q(x_i) = \frac{-1}{\lambda} P(x_i) \rightarrow Q(x_i) = \frac{-1}{-1} P(x_i) \rightarrow Q(x_i) = P(x_i)\ \forall x_i$$ Hence we have shown that when the distribution of $Q$ equals $P$ the Cross Entropy is minimized, as can be seen in **Video 1**. Note that we have that the partial derivative of $H(P,Q)$ from our derivation is $$ \frac{\partial H(P,Q)}{\partial Q(x_i)} = \frac{-P(x_i)}{Q(x_i)}$$ Additionally, this motivates cross entropy as the penalty for using hypothesis distribution $Q$ when ground truth $P$ generates outcomes. For prefix codes, this means that the Cross Entropy is minimized when the optimal prefix encoding is used, as seen in **Video 2**. <center> <video controls style="max-width: 100%;"> <source src="/media/videos/20260316_011042_03_15_2026_18_10_18_pumped_weasel_cross_entropy_huffman.mp4"> </video> </center> <center> $\textbf{Video 2}$ Slideshow Empirically Calculating Cross Entropy of $H(P,Q)$ by Sampling Distribution $Q$'s Huffman Tree according to Distribution $P$ </center> As we can see in **Video 2**, we have that for the equivalent representation of $Q$ as its Huffman Tree (see [Self-Information & Entropy blog](https://www.noteblogdoku.com/blog/self-information-entropy)) that the depth of our hypothesis $Q$ Huffman Tree is minimized when we use the optimal suffix tree for probability distribution $P$. This is an equivalent interpretation as that shown in **Video 1** where we are showing Cross Entropy with respect to two probability mass functions. ## KL-Divergence: $D_{KL}(P||Q) = H(P,Q) - H(P) = \sum_i P(x_i) (I_Q(x_i) - I_P(x_i)) $ KL-Divergence is the expectation of the difference of information from sampling distributions Hypothesis $Q$ and Ground Truth $P$ using Ground Truth $P$. This gives us a single number that states how far our Hypothesis distribution $Q$ is from Ground Truth $P$. We can see that from the phrasing $D_{KL}(P||Q) = H(P,Q) - H(P)$ That it's simply the difference between the Cross Entropy of $P$ and $Q$ versus the Entropy of $P$. As we previously proved, the optimal Cross Entropy is the when $P=Q$ i.e. $H(P,P) = H(P)$. Thus KL-Divergence gives us a better metric than Cross Entropy to measure <center> *What's the difference in Entropy between our Hypothesis distribution $Q$ and Ground Truth $P$ when sampled with respect to $P$?* </center> more explicitly <center> *What's the expectation (with respect to $P$) of the difference of Information between Hypothesis distribution $Q$ and Ground Truth $P$?* </center> It is trivial to see that when $Q = P$ that the KL-Divergence is equal to 0, *in contrast* to when $Q=P$ the Cross Entropy $H(P,Q) \geq 0$, making KL-Divergence a more descriptive difference metric. $$ Q=P \rightarrow D_{KL}(P||Q) = D_{KL}(P||P) = H(P,P) - H(P)$$ Given that $H(P,P) = H(P)$ we have $$ D_{KL}(P||P) = H(P) - H(P) = 0$$ ### KL-Divergence and Cross Entropy Relationship It is interesting that when we minimize the Cross Entropy, we find that the optimal Hypothesis distribution $Q$ is equal to $P$ and that the KL-Divergence is also optimal (equal to 0) when $Q=P$ too. This suggests that there is a connection between these two measures, and indeed there is. The derivative of KL-Divergence is **equal** to the derivative of Cross Entropy. We can see this intuitively because since $$ D_{KL} = H(P,Q) - H(P)$$ If we optimize $D_{KL}$ with respect to $Q$ then $H(P)$, a constant with respect to $Q$, has a derivative of 0. Thus only the $H(P,Q)$ (Cross Entropy of $P,Q$) matters for the derivative of $D_{KL}$ and the derivative of $D_{KL}$ equal to the derivative of $H(P,Q)$ with respect to $Q$. We will now show this formally, using the same setup as we did in the *Cross Entropy Section*, hence we will make it much more brief: Take the derivative of $D_{KL}$ with respect to a $Q(x_i)$, find the pattern and apply it to the entire gradient, use Lagrange Multiplier technique to deal with constraint $\sum_i Q(x_i) = 1$: - $f(Q)$ is $D_{KL}(P||Q) = \sum_i P(x_i)(log P(x_i) - log Q(x_i))$ - $g(Q)$ is $\sum_{i=1}^n Q(x_i)$ - $c$ constraint is $1$ Giving us that $h(Q)$ the function we are finding the min of is: $$ h(Q) = \sum_i P(x_i)(log P(x_i) - log Q(x_i)) - \lambda (\sum_{i=1}^n Q(x_i) -1)$$ Great, let's find $\frac{\partial h}{\partial Q(x_i)}$ for one particular $Q(x_i)$, see the pattern and apply it to each $\frac{\partial h}{\partial Q(x_i)}\ \forall i$ $$ h(Q) = \sum_i P(x_i)(log P(x_i) - log Q(x_i)) - \lambda (\sum_{i=1}^n Q(x_i) -1)$$ $$ h(Q) = \sum_i P(x_i)log P(x_i) - \sum_i P(x_i)log Q(x_i)) - \lambda (\sum_{i=1}^n Q(x_i) -1)$$ $$ \frac{\partial h}{\partial Q(x_i)} = 0 - \frac{P(x_i)}{Q(x_i)} - \lambda$$ Set $\frac{\partial h}{\partial Q(x_i)}$ to $0$ to minimize this function. As seen in **Appendix**, since we know that $H(P,Q)$ is positive semi-definite with respect to $Q$, and that since our constraint $\sum_{i=1}^n Q(x_i)$ is linear, finding the extrema is finding the minimum $$ \frac{\partial h}{\partial Q(x_i)} = 0 = \frac{-P(x_i)}{Q(x_i)} - \lambda$$ $$ \lambda = \frac{-P(x_i)}{Q(x_i)} \rightarrow \lambda Q(x_i) = -P(x_i) \rightarrow Q(x_i) = \frac{-1}{\lambda} P(x_i)$$ Great, since we have that $\forall i\ Q(x_i) = \frac{-1}{\lambda} P(x_i)$ and we know that $\sum_i Q(x_i) = \sum_i P(x_i) = 1$, let's solve for $\lambda$ $$ \sum_i Q(x_i) = \frac{-1}{\lambda} \sum_i P(x_i) \rightarrow 1 = \frac{-1}{\lambda}$$ $$ \lambda = -1$$ Thus we can see that $$Q(x_i) = \frac{-1}{\lambda} P(x_i) \rightarrow Q(x_i) = \frac{-1}{-1} P(x_i) \rightarrow Q(x_i) = P(x_i)\ \forall x_i$$ Thus $D_{KL}(P||Q)$ is minimized when $P=Q$. Notice that from above we have that the gradient in general is $$ \frac{\partial D_{KL}(P||Q)}{\partial Q(x_i)} = \frac{-P(x_i)}{Q(x_i)} $$ giving us $$\nabla_Q D_{KL}(P||Q) = [\frac{-P(x_1)}{Q(x_1)}, \frac{-P(x_2)}{Q(x_2)}, ... \frac{-P(x_n)}{Q(x_n)}] \in \mathbb{R}^{1\times n}$$ and $$\nabla_Q H(P,Q) = [\frac{-P(x_1)}{Q(x_1)}, \frac{-P(x_2)}{Q(x_2)}, ... \frac{-P(x_n)}{Q(x_n)}] \in \mathbb{R}^{1\times n}$$ Thus giving us: $$\nabla_Q D_{KL}(P||Q) = \nabla_Q H(P,Q)$$ stating that whenever we minimize Cross Entropy $H(P,Q)$ it is equivalent to minimizing the KL-Divergence $D_{KL}(P||Q)$, which shows up in machine learning all the time. See the **Appendix** for a *bonus proof*, where we find $\frac{\partial D_{KL}(P||Q)}{\partial Q}$ in continuous space! ## $D_{KL}(P||Q)$ versus $D_{KL}(Q||P)$ What's the difference between the KL-Divergence of $Q$ from $P$ $D_{KL}(P||Q)$ and the KL-Divergence of $P$ from $Q$ $D_{KL}(Q||P)$, given that $P$ is our Ground Truth distribution and $Q$ is our hypothesis distribution. Why not just optimize the easier one and get the same result since $D_{KL}(P||P) = D_{KL}(Q||Q) = 0$ which are equivalent when $P=Q$. Great question, glad you asked. In practice it is rare that we can achieve $P=Q$, especially since $P$ can typically only be sampled from and known empirically. So if it cannot be analytically stated what $P$ is then stating that $Q=P$ is vague since you can only guarantee $Q$ is equal to what you *know* and have *sampled* from $P$. Ok, let's just say that, for a scenario in which we know $P$, what if I know the analytical form of $P$, however now the problem is that I cannot exactly model Ground Truth $P$ with my Hypothesis $Q$. Great question, as we can see from the equation for $D_{KL}(P||Q)$ $$D_{KL}(P||Q) = \sum_i \boldsymbol{P(x_i)} (I_Q(x_i) - I_P(x_i))$$ we are sampling from the Ground Truth distribution $P$, so we will get $I_Q(x_i)$ sampled according to distribution $P$ so we will need to optimize our representation of $Q$ to *on average* match the information from $P$ $I_P(x_I) \forall x_i$, minimizing the difference between information obtained from $Q$ and $P$: $I_Q(x_i) - I_P(x_i)$. Let's suppose we optimize the opposite $D_{KL}(Q||P)$ $$D_{KL}(Q||P) = \sum_i \boldsymbol{Q(x_i)} (I_P(x_i) - I_Q(x_i))$$ we are now sampling from our Hypothesis distribution $Q$ (**bolded** for emphasis), so we will get $I_Q(x_I)$ sampled according to distribution $Q$. Thus now we can adjust our Hypothesis distribution $Q$ to - Choose *which* $x_i$ where $x_i \in [0,1]$ to sample - Minimize the difference of information of $Q$ and $P$ at *$Q$'s selected samples* $I_P(x_i) - I_Q(x_i) \rightarrow 0$ Interesting, so $D_{KL}(Q||P)$ could theoretically just choose a fantastic $x_i$ to sample from and optimize its information $I_Q(x_i)$ from where it chooses to sample **NOT** from where the Ground Truth $P$ **will sample** from. Let's illustrate this using a bi-modal distribution for Ground Truth $P$ and a uni-modal Gaussian distribution for our hypothesis $Q$. We are guessing that for $D_{KL}(P||Q)$ we optimize over all of $P$ but for $D_{KL}(Q||P)$ we optimize $Q$ over only one of the modes of $P$, as verified by **Video 3**. <center> <video controls style="max-width: 100%;"> <source src="/media/videos/20260316_024647_03_15_2026_19_44_03_bold_mule_kl_divergence.mp4"> </video> </center> <center> $\textbf{Video 3:}$ Optimizing $\textbf{Under-Parameterized}$ Hypothesis Distribution $Q$ with Loss Function $D_{KL}(P||Q)$ versus $D_{KL}(Q||P)$<br> Hypothesis $Q$ $\textbf{Uni-Modal}$ Gaussian Distribution parameterized by average $\mu$, standard deviation $\sigma$<br> Ground Truth $P$ is a $\textbf{Bimodal}$ Gaussian mixture with two equal height peaks at $x=0.25$ and $x=0.75$ </center> A **nat** is a unit of information similar to a **bit** but with base $e$ instead of base $2$ typically used for continuous distributions $$ I(x)=−log_e(P(x))\quad I(x) = -log_e(e) = 1$$ Note that we derive the gradient update equations for Gaussian distribution $Q$ for $D_{KL}(P||Q)$ and $D_{KL}(Q||P)$ in the **Appendix**. This example is important because we can think of the bimodal distribution representing two categories. We can see that when we optimize our under-parameterized hypothesis $Q$ for $D_{KL}(P||Q)$, we optimize over the **entire** ground truth distribution $P$, unlike when we optimize $Q$ for $D_{KL}(Q||P)$ we only optimize for **one** category represented by $P$. This has profound implications for our neural networks which are **usually under-parameterized** for real world, complex distributions. --- ## Appendix ### Proof that the Hessian of $H(P,Q) = -\sum_i P(x_i)log Q(x_i)$ wrt $Q$ is Positive Semi-Definite Let's show that the Hessian of $H(P,Q)$ is Positive Semi-Definite. Take the first order derivative: $$ \frac{\partial H}{\partial Q(x_i)} = -\frac{P(x_i)}{Q(x_i)} $$ Now take the second derivative with respect to $Q(x_j)$ where $x_j \neq x_i$ and $Q(x_i)$ $$ \frac{\partial^2 H}{\partial Q(x_i) \partial Q(x_i)} = \frac{\partial}{\partial Q(x_i)} -\frac{P(x_i)}{Q(x_i)} = \frac{P(x_i)}{Q(x_i)^2}$$ $$ \frac{\partial^2 H}{\partial Q(x_i) \partial Q(x_j)} = \frac{\partial}{\partial Q(x_j)} -\frac{P(x_i)}{Q(x_i)} = 0$$ Giving us the Hessian that's the diagonal full of $\frac{P(x_i)}{Q(x_i)^2} \forall i$ $$ \nabla^2_Q H(P,Q) = \begin{bmatrix} \frac{P(x_1)}{Q(x_1)^2} & 0 & \cdots & 0\\ 0 & \frac{P(x_2)}{Q(x_2)^2} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \frac{P(x_n)}{Q(x_n)^2} \end{bmatrix} $$ Note that since this is a diagonal matrix, and all entries of the diagonal matrix $\frac{P(x_i)}{Q(x_i)^2} \geq 0$ since $P(x_i),Q(x_i)$ are probabilities, and that for diagonal square matrices, the entries are the eigenvalues, we have that all eigenvalues of $\nabla^2_Q H(P,Q) > 0$ giving us that this is a Positive Semi-Definite matrix. Thus for the Cross Entropy function $H(P,Q)$ we have that any extrema found will be a minima. ### Derivative of Gaussian Distribution with respect to $\mu$ and $\sigma$ $$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left( -\frac{(x-\mu)^2}{2\sigma^2} \right) $$ $$ \frac{\partial f(x)}{\partial \mu} = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \cdot \frac{\partial}{\partial \mu}\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) $$ $$ \frac{\partial f(x)}{\partial \mu} = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \cdot \frac{-2(x-\mu)}{2\sigma^2}\cdot (-1) $$ $$ \frac{\partial f(x)}{\partial \mu} = \frac{x-\mu}{\sqrt{2\pi\sigma^2}\,\sigma^2} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) $$ $$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$ $$\frac{\partial f(x)}{\partial \sigma} = f(x)\left(-\frac{1}{\sigma} + \frac{(x-\mu)^2}{\sigma^3}\right)$$ $$ = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \left( -\frac{1}{\sigma} + \frac{(x-\mu)^2}{\sigma^3} \right) $$ ### Derivative $\frac{\partial D_{KL}(P||Q)}{\partial Q}$ in Continuous Space **Bonus Proof** that was advertised! Suppose that for the continuous version of $D_{KL}(P||Q)$ defined as $$ D_{KL}(P||Q) = \int P(x)(logP(x) - logQ(x)) dx $$ and we want to take the derivative with respect to Q. We get the following: $$ \frac{\partial D_{KL}(P||Q)}{\partial Q} = \frac{\partial}{\partial Q}\int P(x)(logP(x) - logQ(x)) dx$$ $$ \frac{\partial D_{KL}(P||Q)}{\partial Q} = \frac{\partial}{\partial Q}\int P(x)logP(x)dx - \int P(x) logQ(x) dx$$ We can see the first term $\frac{\partial}{\partial Q}\int P(x)logP(x)dx = 0$ since $Q(x)$ is not present, making this a constant. Hence $$ \frac{\partial D_{KL}(P||Q)}{\partial Q} = \frac{\partial}{\partial Q} (- \int P(x) logQ(x) dx)$$ Now we need to consider the following calculus identity utilizing differentials $$ df = f(x+dx) -f(x) = f'(x)dx \rightarrow f(x+dx) = f(x) + f'(x)dx$$ Fantastic, but I have a functional $Q(x)$ instead of a variable $x$, now what? Ends up this also applies to functionals, denoted by $F$. $$ dF = F(Q+dQ) -F(Q) = F'(Q)dQ \rightarrow F(Q+dQ) = F(Q) + F'(Q)dQ$$ Great, taking $F = log()$ and $Q=Q$ we get: $$ d log(Q + dQ) = log(Q) + \frac{dQ}{Q} dQ$$ Thus we get that we are going to do the following: $$ - \int P(x) logQ(x) dx \rightarrow - \int P(x) log(Q(x) + dQ) dx$$ Invoking the statement: $ F(Q+dQ) = F(Q) + F'(Q)dQ$ $$- \int P(x) log(Q(x) + dQ) dx = - \int P(x) logQ(x) dx - \int P(x) \frac{dQ(x)}{Q(x)} dx$$ Multiplying the entire equation by $-1$ we get $$\int P(x) log(Q(x) + dQ) dx = \int P(x) logQ(x) dx + \int P(x) \frac{dQ(x)}{Q(x)} dx$$ Ok, now remember the following from the [Matrix Calculus Backprop Post](https://www.noteblogdoku.com/blog/the-core-of-backprop-2-frac-partial-l-partial-m-using-matrix-calculus) we have the following from matrix calculus $$dF = F'(Q)dQ = \frac{\partial F}{\partial Q}dQ \rightarrow \left\langle \frac{\partial F}{\partial Q}, dQ \right\rangle = dF(Q)$$ This is great for when we are dealing with matrices and vectors, but what about our situation? How am I supposed to take an inner product. Well luckily, in continuous space, the inner product between $\frac{\partial F}{\partial Q}$ and $dQ$ is defined as its integral i.e. $$ \langle \frac{\partial F}{\partial Q}, dQ \rangle \rightarrow \int \frac{\partial F}{\partial Q} dQ$$ The reason why this is is that given we have two vectors that have infinitely many elements, we would have to take the integral s.t. we multiply all of the elements together and then add them up. Fantastic, since we know that $$\int P(x) log(Q(x) + dQ) dx = \int P(x) logQ(x) dx + \int \frac{P(x)}{Q(x)} dQ(x) dx$$ mimics the form $$F(Q+dQ) = F(Q) + F'(Q)dQ$$ Suppose that $$F(Q) = \int P(x) logQ(x) dx$$ Then invoking $F(Q+dQ) = F(Q) + F'(Q)dQ$ on $\int P(x) log(Q(x) + dQ) dx = \int P(x) logQ(x) dx + \int \frac{P(x)}{Q(x)} dQ(x) dx$ we get $$ F'(Q)dQ = \int \frac{P(x)}{Q(x)} dQ(x) dx$$ Utilizing that $$\int \frac{\partial F}{\partial Q} dQ = dF(Q)$$ we get that $$\int \frac{P(x)}{Q(x)} dQ(x) dx = d\ \int P(x) logQ(x) dx$$ giving us that $$\frac{dF}{d Q(x)} = \frac{P(x)}{Q(x)}$$ Note that since $F(Q) = \int P(x) logQ(x) dx$, we **actually** wanted $-F(Q) = -\int P(x) logQ(x) dx$, which means that $$\frac{-dF}{d Q(x)} = \frac{-P(x)}{Q(x)}$$ **NOTE** that $-F(Q) = -\int P(x) logQ(x) dx$ is equal to $D_{KL}(P||Q)$ without the constant term (when we take the derivative wrt $Q(x)$) $\boldsymbol{\int P(x)logP(x)dx} - \int P(x) logQ(x) dx$ (constant term bolded) giving us that $$ \frac{-dF}{d Q(x)} = \frac{\partial D_{KL}(P||Q)}{\partial Q(x)} = \frac{-P(x)}{Q(x)}$$ Great we found our desired solution! Notice how this exactly mimics our answer from the discrete case we did previously: $$\frac{\partial D_{KL}(P||Q)}{\partial Q} = [\frac{-P(x_1)}{Q(x_1)}, \frac{-P(x_2)}{Q(x_2)}, ... \frac{-P(x_n)}{Q(x_n)}]$$ ### Derive Optimization for Gaussian Distribution $Q$ with $D_{KL}(P||Q)$ objective function Use the Chain Rule where $Q = Gaussian(\mu,\sigma)$: $$ \frac{\partial D_{KL}(P||Q)}{\partial \mu} = \frac{\partial D_{KL}(P||Q)}{\partial{Q}}\frac{\partial Q}{\partial \mu}$$ $$\frac{\partial D_{KL}(P||Q)}{\partial Q(\mu,\sigma)} = \frac{-P}{Q(\mu,\sigma)}$$ ### Derivative $\frac{\partial D_{KL}(Q||P)}{\partial Q}$ in Continuous Space Suppose that for the continuous version of $D_{KL}(Q||P)$ defined as $$ D_{KL}(Q||P) = \int Q(x)(logQ(x) - logP(x)) dx $$ and we want to take the derivative with respect to Q. We get the following: $$ \frac{\partial D_{KL}(Q||P)}{\partial Q} = \frac{\partial}{\partial Q}\int Q(x)(logQ(x) - logP(x)) dx$$ $$ \frac{\partial D_{KL}(Q||P)}{\partial Q} = \frac{\partial}{\partial Q}\int Q(x)logQ(x) dx - \int Q(x)logP(x) dx $$ We are going to utilize the method *Calculus of Variations* here with the following perturbation: $$Q(x) \rightarrow = Q(x) + \epsilon \eta(x)$$ From regular calculus we have: $df = f(x + dx) - f(x) = f'(x)dx$, however, now we're perturbing the **function** $Q(x)$ with another small function & we are basically treating $Q(x)$ as $x$ in our regular calculus example. Thus, we now get $$D_{KL}(Q + \epsilon \eta||P) = \int (Q(x) + \epsilon \eta(x))log(Q(x)+\epsilon \eta(x) dx - \int (Q(x) + \epsilon \eta(x)) logP(x) dx $$ Differentiating with respect to $\epsilon$ at $\epsilon = 0$: $$ \frac{d}{d\varepsilon} D_{KL}(Q_\varepsilon \| P)\Big|_{\varepsilon=0} = \int \frac{d}{d\varepsilon} \left[ (Q + \varepsilon \eta)\big(\log(Q + \varepsilon \eta) - \log P\big) \right]_{\varepsilon=0} \, dx $$ Apply the product rule, let $$f(q) = q(log q - log P)$$ $$\frac{\partial f}{\partial q} = (log q - log P) + q \cdot \frac{1}{q} = log q - log P + 1$$ so $$ \frac{d}{d\varepsilon} D_{KL}(Q_\varepsilon \| P)\Big|_{\varepsilon=0} = \int \eta(x) (logQ(x) - logP(x) +1)$$ By the definition of the functional derivative, $$ \frac{d}{d\varepsilon} D_{KL}(Q_\varepsilon \| P)\Big|_{\varepsilon=0} = \int \eta(x) \frac{\delta D_{KL}(Q||P)}{\delta Q(x)} dx$$ Thus we have that $$ \frac{\partial D_{KL}(Q||P)}{\partial Q(x)} = log Q(x) - log P(x) + 1$$ Note that since we are simply finding the derivative, we don't need to utilize the Lagrange Multiplier of the constraint $\int Q(x)dx =1$ ## Where have I seen you before? **KL-Divergence** and **Cross Entropy** are two fundamental quantities in machine learning, especially whenever models output probabilities or compare distributions. Cross Entropy is commonly used as the classification loss function, and any maximum likelihood estimation with categorical or softmax output. KL-Divergence is common whenever we compare distributions directly such as in PPO, knowledge distillation, Variational Autoencoders (VAEs), and measuring dataset distribution shift. KL-Divergence is commonly used for regularizing updates to your distribution with respect to a reference distribution. ## What are KL-Divergence and Cross Entropy? KL-Divergence and Cross Entropy are defined in reference to Entropy and Self-Information, described in [Self-Information & Entropy blog](https://www.noteblogdoku.com/blog/self-information-entropy). A quick recap, suppose we have a discrete set of possible outcomes $$X = {x_1, x_2, ...x_n}$$ with a probability distribution $P$ such that $$\sum_{i=1}^n P(x_i) = 1$$ - **Self Information:** The surprise of observing event $x_i$ under distribution $P$ $$ I_P(x_i) = -log(P(x_i)) $$ - Equivalently, the approximate depth of $x_i$ in optimal Huffman Tree Encoding - Rare events give more information - Common events give less information - **Entropy:** The expected Self Information of events drawn from $P$ $$H(P) = -\sum_i P(x_i)log P(x_i) = \sum_i P(x_i) I_P(x_i) = \mathbb{E}_{x \sim P} [I_P(x)]$$ - Equivalently, the expected depth of elements in Huffman Tree built from $P$ - Measure of uncertainty of distribution Now let's define Cross Entropy and KL-Divergence, which compares two distributions we have: - **Ground Truth Distribution** $P$ that we desire to mimic - **Hypothesis Distribution** $Q$ that we have generated - **Cross Entropy:** The expected Information drawn from a Hypothesis Distribution $Q$ when sampled using $P$ $$ H(P,Q) = -\sum_i P(x_i)log Q(x_i) = \sum_i P(x_i) I_Q(x_i) = \mathbb{E}_{x \sim P} [I_Q(x)]$$ - Equivalently, expected height of Huffman Tree made for $Q$ when sampled using $P$ - Penalty for using hypothesis distribution $Q$ when ground truth $P$ generates outcomes - **KL-Divergence:** The expectation of difference of information from distributions $P$ & $Q$ when sampling using $P$ $$D_{KL}(P||Q) = \sum_i P(x_i)(log P(x_i) - log Q(x_i)) = \sum_i P(x_i) (I_Q(x_i) - I_P(x_i)) = \mathbb{E}_{x \sim P} [I_Q(x) - I_P(x)]$$ $$D_{KL}(P||Q) = \sum_i P(x_i) (I_Q(x_i) - I_P(x_i)) = \sum_i P(x_i) I_Q(x_i) - \sum_i P(x_i) I_P(x_i)) = H(P,Q) - H(P)$$ - Equivalently, the expected difference in height of Huffman Tree made for $Q$ versus $P$ when both sampled using $P$ - Known as **Relative Entropy** because $D_{KL}(P||Q) = H(P,Q) - H(P)$ - Measure of discrepancy between distributions, **not** distance or inner product since $D_{KL}(P||Q) \neq D_{KL}(Q||P)$ ## Cross Entropy: $H(P,Q) = \sum_i P(x_i) I_Q(x_i)$ Cross Entropy is the information we get from hypothesis distribution $Q$ when we sample using distribution $P$. Note that when $P=Q$ *Cross Entropy is just the Entropy*, i.e. $H(P,P) = H(P)$. <center> <video controls style="max-width: 80%;"> <source src="/media/videos/20260315_224049_03_15_2026_15_40_15_legal_akita_cross_entropy.mp4"> </video> </center> <center> $\textbf{Video 1:}$ Slideshow showing Empirically Calculating Cross Entropy of $H(P,Q)$ by Sampling Distribution $Q$ according to Distribution $P$ </center> As we can see from **Video 1**, we can measure the amount of information we get on average when we sample hypothesis $Q$ according to ground truth $P$. This measures how off our hypothesis distribution is from our ground truth distribution. We can prove that the Cross Entropy $H(P,Q)$ is minimized when $Q=P$ by using the Lagrangian Multipliers Calculus method. We have that for our general function $f(Q): \mathbb{R}^n \rightarrow \mathbb{R}$, constraint function $g(Q): \mathbb{R}^n \rightarrow \mathbb{R}$, constraint $c \in \mathbb{R}$ and our Lagrange Multiplier $\lambda$ that by optimizing the following function $h(Q) : \mathbb{R}^n \rightarrow \mathbb{R}$ we are optimizing $f(Q)$ with respect to $g(Q) = c$. $$ h(x) = f(x) - \lambda (g(x) -c)$$ The setup for our optimization problem to find the minimum cross entropy $H(P,Q)$ such that $\sum_i Q(x_i) = 1$ for all events $x_i \in [1,n]$ is - $f(Q)$ is $H(P,Q) = -\sum_i P(x_i)log Q(x_i)$ - $g(Q)$ is $\sum_{i=1}^n Q(x_i)$ - $c$ constraint is $1$ Giving us that $h(Q)$ the function we are finding the min of is: $$ h(Q) = -\sum_i P(x_i)log Q(x_i) - \lambda (\sum_{i=1}^n Q(x_i) -1)$$ Great, let's find $\frac{\partial h}{\partial Q(x_i)}$ for one particular $Q(x_i)$, see the pattern and apply it to each $\frac{\partial h}{\partial Q(x_i)}\ \forall i$ $$ \frac{\partial h}{\partial Q(x_i)} = \frac{-P(x_i)}{Q(x_i)} - \lambda$$ Set $\frac{\partial h}{\partial Q(x_i)}$ to $0$ to minimize this function. As seen in **Appendix**, since we know that the Hessian of $H(P,Q)$ is positive semi-definite with respect to $Q$, and that since our constraint $\sum_{i=1}^n Q(x_i)$ is linear, finding the extrema is finding the minimum $$ \frac{\partial h}{\partial Q(x_i)} = 0 = \frac{-P(x_i)}{Q(x_i)} - \lambda$$ $$ \lambda = \frac{-P(x_i)}{Q(x_i)} \rightarrow \lambda Q(x_i) = -P(x_i) \rightarrow Q(x_i) = \frac{-1}{\lambda} P(x_i)$$ Great, since we have that $\forall i\ Q(x_i) = \frac{-1}{\lambda} P(x_i)$ and we know that $\sum_i Q(x_i) = \sum_i P(x_i) = 1$, let's solve for $\lambda$ $$ \sum_i Q(x_i) = \frac{-1}{\lambda} \sum_i P(x_i) \rightarrow 1 = \frac{-1}{\lambda}$$ $$ \lambda = -1$$ Thus we can see that $$Q(x_i) = \frac{-1}{\lambda} P(x_i) \rightarrow Q(x_i) = \frac{-1}{-1} P(x_i) \rightarrow Q(x_i) = P(x_i)\ \forall x_i$$ Hence we have shown that when the distribution of $Q$ equals $P$ the Cross Entropy is minimized, as can be seen in **Video 1**. Note that we have that the partial derivative of $H(P,Q)$ from our derivation is $$ \frac{\partial H(P,Q)}{\partial Q(x_i)} = \frac{-P(x_i)}{Q(x_i)}$$ Additionally, this motivates cross entropy as the penalty for using hypothesis distribution $Q$ when ground truth $P$ generates outcomes. For prefix codes, this means that the Cross Entropy is minimized when the optimal prefix encoding is used, as seen in **Video 2**. <center> <video controls style="max-width: 100%;"> <source src="/media/videos/20260316_011042_03_15_2026_18_10_18_pumped_weasel_cross_entropy_huffman.mp4"> </video> </center> <center> $\textbf{Video 2}$ Slideshow Empirically Calculating Cross Entropy of $H(P,Q)$ by Sampling Distribution $Q$'s Huffman Tree according to Distribution $P$ </center> As we can see in **Video 2**, we have that for the equivalent representation of $Q$ as its Huffman Tree (see [Self-Information & Entropy blog](https://www.noteblogdoku.com/blog/self-information-entropy)) that the depth of our hypothesis $Q$ Huffman Tree is minimized when we use the optimal suffix tree for probability distribution $P$. This is an equivalent interpretation as that shown in **Video 1** where we are showing Cross Entropy with respect to two probability mass functions. ## KL-Divergence: $D_{KL}(P||Q) = H(P,Q) - H(P) = \sum_i P(x_i) (I_Q(x_i) - I_P(x_i)) $ KL-Divergence is the expectation of the difference of information from sampling distributions Hypothesis $Q$ and Ground Truth $P$ using Ground Truth $P$. This gives us a single number that states how far our Hypothesis distribution $Q$ is from Ground Truth $P$. We can see that from the phrasing $D_{KL}(P||Q) = H(P,Q) - H(P)$ That it's simply the difference between the Cross Entropy of $P$ and $Q$ versus the Entropy of $P$. As we previously proved, the optimal Cross Entropy is the when $P=Q$ i.e. $H(P,P) = H(P)$. Thus KL-Divergence gives us a better metric than Cross Entropy to measure <center> *What's the difference in Entropy between our Hypothesis distribution $Q$ and Ground Truth $P$ when sampled with respect to $P$?* </center> more explicitly <center> *What's the expectation (with respect to $P$) of the difference of Information between Hypothesis distribution $Q$ and Ground Truth $P$?* </center> It is trivial to see that when $Q = P$ that the KL-Divergence is equal to 0, *in contrast* to when $Q=P$ the Cross Entropy $H(P,Q) \geq 0$, making KL-Divergence a more descriptive difference metric. $$ Q=P \rightarrow D_{KL}(P||Q) = D_{KL}(P||P) = H(P,P) - H(P)$$ Given that $H(P,P) = H(P)$ we have $$ D_{KL}(P||P) = H(P) - H(P) = 0$$ ### KL-Divergence and Cross Entropy Relationship It is interesting that when we minimize the Cross Entropy, we find that the optimal Hypothesis distribution $Q$ is equal to $P$ and that the KL-Divergence is also optimal (equal to 0) when $Q=P$ too. This suggests that there is a connection between these two measures, and indeed there is. The derivative of KL-Divergence is **equal** to the derivative of Cross Entropy. We can see this intuitively because since $$ D_{KL} = H(P,Q) - H(P)$$ If we optimize $D_{KL}$ with respect to $Q$ then $H(P)$, a constant with respect to $Q$, has a derivative of 0. Thus only the $H(P,Q)$ (Cross Entropy of $P,Q$) matters for the derivative of $D_{KL}$ and the derivative of $D_{KL}$ equal to the derivative of $H(P,Q)$ with respect to $Q$. We will now show this formally, using the same setup as we did in the *Cross Entropy Section*, hence we will make it much more brief: Take the derivative of $D_{KL}$ with respect to a $Q(x_i)$, find the pattern and apply it to the entire gradient, use Lagrange Multiplier technique to deal with constraint $\sum_i Q(x_i) = 1$: - $f(Q)$ is $D_{KL}(P||Q) = \sum_i P(x_i)(log P(x_i) - log Q(x_i))$ - $g(Q)$ is $\sum_{i=1}^n Q(x_i)$ - $c$ constraint is $1$ Giving us that $h(Q)$ the function we are finding the min of is: $$ h(Q) = \sum_i P(x_i)(log P(x_i) - log Q(x_i)) - \lambda (\sum_{i=1}^n Q(x_i) -1)$$ Great, let's find $\frac{\partial h}{\partial Q(x_i)}$ for one particular $Q(x_i)$, see the pattern and apply it to each $\frac{\partial h}{\partial Q(x_i)}\ \forall i$ $$ h(Q) = \sum_i P(x_i)(log P(x_i) - log Q(x_i)) - \lambda (\sum_{i=1}^n Q(x_i) -1)$$ $$ h(Q) = \sum_i P(x_i)log P(x_i) - \sum_i P(x_i)log Q(x_i)) - \lambda (\sum_{i=1}^n Q(x_i) -1)$$ $$ \frac{\partial h}{\partial Q(x_i)} = 0 - \frac{P(x_i)}{Q(x_i)} - \lambda$$ Set $\frac{\partial h}{\partial Q(x_i)}$ to $0$ to minimize this function. As seen in **Appendix**, since we know that $H(P,Q)$ is positive semi-definite with respect to $Q$, and that since our constraint $\sum_{i=1}^n Q(x_i)$ is linear, finding the extrema is finding the minimum $$ \frac{\partial h}{\partial Q(x_i)} = 0 = \frac{-P(x_i)}{Q(x_i)} - \lambda$$ $$ \lambda = \frac{-P(x_i)}{Q(x_i)} \rightarrow \lambda Q(x_i) = -P(x_i) \rightarrow Q(x_i) = \frac{-1}{\lambda} P(x_i)$$ Great, since we have that $\forall i\ Q(x_i) = \frac{-1}{\lambda} P(x_i)$ and we know that $\sum_i Q(x_i) = \sum_i P(x_i) = 1$, let's solve for $\lambda$ $$ \sum_i Q(x_i) = \frac{-1}{\lambda} \sum_i P(x_i) \rightarrow 1 = \frac{-1}{\lambda}$$ $$ \lambda = -1$$ Thus we can see that $$Q(x_i) = \frac{-1}{\lambda} P(x_i) \rightarrow Q(x_i) = \frac{-1}{-1} P(x_i) \rightarrow Q(x_i) = P(x_i)\ \forall x_i$$ Thus $D_{KL}(P||Q)$ is minimized when $P=Q$. Notice that from above we have that the gradient in general is $$ \frac{\partial D_{KL}(P||Q)}{\partial Q(x_i)} = \frac{-P(x_i)}{Q(x_i)} $$ giving us $$\nabla_Q D_{KL}(P||Q) = [\frac{-P(x_1)}{Q(x_1)}, \frac{-P(x_2)}{Q(x_2)}, ... \frac{-P(x_n)}{Q(x_n)}] \in \mathbb{R}^{1\times n}$$ and $$\nabla_Q H(P,Q) = [\frac{-P(x_1)}{Q(x_1)}, \frac{-P(x_2)}{Q(x_2)}, ... \frac{-P(x_n)}{Q(x_n)}] \in \mathbb{R}^{1\times n}$$ Thus giving us: $$\nabla_Q D_{KL}(P||Q) = \nabla_Q H(P,Q)$$ stating that whenever we minimize Cross Entropy $H(P,Q)$ it is equivalent to minimizing the KL-Divergence $D_{KL}(P||Q)$, which shows up in machine learning all the time. See the **Appendix** for a *bonus proof*, where we find $\frac{\partial D_{KL}(P||Q)}{\partial Q}$ in continuous space! ## $D_{KL}(P||Q)$ versus $D_{KL}(Q||P)$ What's the difference between the KL-Divergence of $Q$ from $P$ $D_{KL}(P||Q)$ and the KL-Divergence of $P$ from $Q$ $D_{KL}(Q||P)$, given that $P$ is our Ground Truth distribution and $Q$ is our hypothesis distribution. Why not just optimize the easier one and get the same result since $D_{KL}(P||P) = D_{KL}(Q||Q) = 0$ which are equivalent when $P=Q$. Great question, glad you asked. In practice it is rare that we can achieve $P=Q$, especially since $P$ can typically only be sampled from and known empirically. So if it cannot be analytically stated what $P$ is then stating that $Q=P$ is vague since you can only guarantee $Q$ is equal to what you *know* and have *sampled* from $P$. Ok, let's just say that, for a scenario in which we know $P$, what if I know the analytical form of $P$, however now the problem is that I cannot exactly model Ground Truth $P$ with my Hypothesis $Q$. Great question, as we can see from the equation for $D_{KL}(P||Q)$ $$D_{KL}(P||Q) = \sum_i \boldsymbol{P(x_i)} (I_Q(x_i) - I_P(x_i))$$ we are sampling from the Ground Truth distribution $P$, so we will get $I_Q(x_i)$ sampled according to distribution $P$ so we will need to optimize our representation of $Q$ to *on average* match the information from $P$ $I_P(x_I) \forall x_i$, minimizing the difference between information obtained from $Q$ and $P$: $I_Q(x_i) - I_P(x_i)$. Let's suppose we optimize the opposite $D_{KL}(Q||P)$ $$D_{KL}(Q||P) = \sum_i \boldsymbol{Q(x_i)} (I_P(x_i) - I_Q(x_i))$$ we are now sampling from our Hypothesis distribution $Q$ (**bolded** for emphasis), so we will get $I_Q(x_I)$ sampled according to distribution $Q$. Thus now we can adjust our Hypothesis distribution $Q$ to - Choose *which* $x_i$ where $x_i \in [0,1]$ to sample - Minimize the difference of information of $Q$ and $P$ at *$Q$'s selected samples* $I_P(x_i) - I_Q(x_i) \rightarrow 0$ Interesting, so $D_{KL}(Q||P)$ could theoretically just choose a fantastic $x_i$ to sample from and optimize its information $I_Q(x_i)$ from where it chooses to sample **NOT** from where the Ground Truth $P$ **will sample** from. Let's illustrate this using a bi-modal distribution for Ground Truth $P$ and a uni-modal Gaussian distribution for our hypothesis $Q$. We are guessing that for $D_{KL}(P||Q)$ we optimize over all of $P$ but for $D_{KL}(Q||P)$ we optimize $Q$ over only one of the modes of $P$, as verified by **Video 3**. <center> <video controls style="max-width: 100%;"> <source src="/media/videos/20260316_024647_03_15_2026_19_44_03_bold_mule_kl_divergence.mp4"> </video> </center> <center> $\textbf{Video 3:}$ Optimizing $\textbf{Under-Parameterized}$ Hypothesis Distribution $Q$ with Loss Function $D_{KL}(P||Q)$ versus $D_{KL}(Q||P)$<br> Hypothesis $Q$ $\textbf{Uni-Modal}$ Gaussian Distribution parameterized by average $\mu$, standard deviation $\sigma$<br> Ground Truth $P$ is a $\textbf{Bimodal}$ Gaussian mixture with two equal height peaks at $x=0.25$ and $x=0.75$ </center> A **nat** is a unit of information similar to a **bit** but with base $e$ instead of base $2$ typically used for continuous distributions $$ I(x)=−log_e(P(x))\quad I(x) = -log_e(e) = 1$$ Note that we derive the gradient update equations for Gaussian distribution $Q$ for $D_{KL}(P||Q)$ and $D_{KL}(Q||P)$ in the **Appendix**. This example is important because we can think of the bimodal distribution representing two categories. We can see that when we optimize our under-parameterized hypothesis $Q$ for $D_{KL}(P||Q)$, we optimize over the **entire** ground truth distribution $P$, unlike when we optimize $Q$ for $D_{KL}(Q||P)$ we only optimize for **one** category represented by $P$. This has profound implications for our neural networks which are **usually under-parameterized** for real world, complex distributions. --- ## Appendix ### Proof that the Hessian of $H(P,Q) = -\sum_i P(x_i)log Q(x_i)$ wrt $Q$ is Positive Semi-Definite Let's show that the Hessian of $H(P,Q)$ is Positive Semi-Definite. Take the first order derivative: $$ \frac{\partial H}{\partial Q(x_i)} = -\frac{P(x_i)}{Q(x_i)} $$ Now take the second derivative with respect to $Q(x_j)$ where $x_j \neq x_i$ and $Q(x_i)$ $$ \frac{\partial^2 H}{\partial Q(x_i) \partial Q(x_i)} = \frac{\partial}{\partial Q(x_i)} -\frac{P(x_i)}{Q(x_i)} = \frac{P(x_i)}{Q(x_i)^2}$$ $$ \frac{\partial^2 H}{\partial Q(x_i) \partial Q(x_j)} = \frac{\partial}{\partial Q(x_j)} -\frac{P(x_i)}{Q(x_i)} = 0$$ Giving us the Hessian that's the diagonal full of $\frac{P(x_i)}{Q(x_i)^2} \forall i$ $$ \nabla^2_Q H(P,Q) = \begin{bmatrix} \frac{P(x_1)}{Q(x_1)^2} & 0 & \cdots & 0\\ 0 & \frac{P(x_2)}{Q(x_2)^2} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \frac{P(x_n)}{Q(x_n)^2} \end{bmatrix} $$ Note that since this is a diagonal matrix, and all entries of the diagonal matrix $\frac{P(x_i)}{Q(x_i)^2} \geq 0$ since $P(x_i),Q(x_i)$ are probabilities, and that for diagonal square matrices, the entries are the eigenvalues, we have that all eigenvalues of $\nabla^2_Q H(P,Q) > 0$ giving us that this is a Positive Semi-Definite matrix. Thus for the Cross Entropy function $H(P,Q)$ we have that any extrema found will be a minima. ### Derivative of Gaussian Distribution with respect to $\mu$ and $\sigma$ $$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left( -\frac{(x-\mu)^2}{2\sigma^2} \right) $$ $$ \frac{\partial f(x)}{\partial \mu} = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \cdot \frac{\partial}{\partial \mu}\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) $$ $$ \frac{\partial f(x)}{\partial \mu} = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \cdot \frac{-2(x-\mu)}{2\sigma^2}\cdot (-1) $$ $$ \frac{\partial f(x)}{\partial \mu} = \frac{x-\mu}{\sqrt{2\pi\sigma^2}\,\sigma^2} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) $$ $$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$ $$\frac{\partial f(x)}{\partial \sigma} = f(x)\left(-\frac{1}{\sigma} + \frac{(x-\mu)^2}{\sigma^3}\right)$$ $$ = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \left( -\frac{1}{\sigma} + \frac{(x-\mu)^2}{\sigma^3} \right) $$ ### Derivative $\frac{\partial D_{KL}(P||Q)}{\partial Q}$ in Continuous Space **Bonus Proof** that was advertised! Suppose that for the continuous version of $D_{KL}(P||Q)$ defined as $$ D_{KL}(P||Q) = \int P(x)(logP(x) - logQ(x)) dx $$ and we want to take the derivative with respect to Q. We get the following: $$ \frac{\partial D_{KL}(P||Q)}{\partial Q} = \frac{\partial}{\partial Q}\int P(x)(logP(x) - logQ(x)) dx$$ $$ \frac{\partial D_{KL}(P||Q)}{\partial Q} = \frac{\partial}{\partial Q}\int P(x)logP(x)dx - \int P(x) logQ(x) dx$$ We can see the first term $\frac{\partial}{\partial Q}\int P(x)logP(x)dx = 0$ since $Q(x)$ is not present, making this a constant. Hence $$ \frac{\partial D_{KL}(P||Q)}{\partial Q} = \frac{\partial}{\partial Q} (- \int P(x) logQ(x) dx)$$ Now we need to consider the following calculus identity utilizing differentials $$ df = f(x+dx) -f(x) = f'(x)dx \rightarrow f(x+dx) = f(x) + f'(x)dx$$ Fantastic, but I have a functional $Q(x)$ instead of a variable $x$, now what? Ends up this also applies to functionals, denoted by $F$. $$ dF = F(Q+dQ) -F(Q) = F'(Q)dQ \rightarrow F(Q+dQ) = F(Q) + F'(Q)dQ$$ Great, taking $F = log()$ and $Q=Q$ we get: $$ d log(Q + dQ) = log(Q) + \frac{dQ}{Q} dQ$$ Thus we get that we are going to do the following: $$ - \int P(x) logQ(x) dx \rightarrow - \int P(x) log(Q(x) + dQ) dx$$ Invoking the statement: $ F(Q+dQ) = F(Q) + F'(Q)dQ$ $$- \int P(x) log(Q(x) + dQ) dx = - \int P(x) logQ(x) dx - \int P(x) \frac{dQ(x)}{Q(x)} dx$$ Multiplying the entire equation by $-1$ we get $$\int P(x) log(Q(x) + dQ) dx = \int P(x) logQ(x) dx + \int P(x) \frac{dQ(x)}{Q(x)} dx$$ Ok, now remember the following from the [Matrix Calculus Backprop Post](https://www.noteblogdoku.com/blog/the-core-of-backprop-2-frac-partial-l-partial-m-using-matrix-calculus) we have the following from matrix calculus $$dF = F'(Q)dQ = \frac{\partial F}{\partial Q}dQ \rightarrow \left\langle \frac{\partial F}{\partial Q}, dQ \right\rangle = dF(Q)$$ This is great for when we are dealing with matrices and vectors, but what about our situation? How am I supposed to take an inner product. Well luckily, in continuous space, the inner product between $\frac{\partial F}{\partial Q}$ and $dQ$ is defined as its integral i.e. $$ \langle \frac{\partial F}{\partial Q}, dQ \rangle \rightarrow \int \frac{\partial F}{\partial Q} dQ$$ The reason why this is is that given we have two vectors that have infinitely many elements, we would have to take the integral s.t. we multiply all of the elements together and then add them up. Fantastic, since we know that $$\int P(x) log(Q(x) + dQ) dx = \int P(x) logQ(x) dx + \int \frac{P(x)}{Q(x)} dQ(x) dx$$ mimics the form $$F(Q+dQ) = F(Q) + F'(Q)dQ$$ Suppose that $$F(Q) = \int P(x) logQ(x) dx$$ Then invoking $F(Q+dQ) = F(Q) + F'(Q)dQ$ on $\int P(x) log(Q(x) + dQ) dx = \int P(x) logQ(x) dx + \int \frac{P(x)}{Q(x)} dQ(x) dx$ we get $$ F'(Q)dQ = \int \frac{P(x)}{Q(x)} dQ(x) dx$$ Utilizing that $$\int \frac{\partial F}{\partial Q} dQ = dF(Q)$$ we get that $$\int \frac{P(x)}{Q(x)} dQ(x) dx = d\ \int P(x) logQ(x) dx$$ giving us that $$\frac{dF}{d Q(x)} = \frac{P(x)}{Q(x)}$$ Note that since $F(Q) = \int P(x) logQ(x) dx$, we **actually** wanted $-F(Q) = -\int P(x) logQ(x) dx$, which means that $$\frac{-dF}{d Q(x)} = \frac{-P(x)}{Q(x)}$$ **NOTE** that $-F(Q) = -\int P(x) logQ(x) dx$ is equal to $D_{KL}(P||Q)$ without the constant term (when we take the derivative wrt $Q(x)$) $\boldsymbol{\int P(x)logP(x)dx} - \int P(x) logQ(x) dx$ (constant term bolded) giving us that $$ \frac{-dF}{d Q(x)} = \frac{\partial D_{KL}(P||Q)}{\partial Q(x)} = \frac{-P(x)}{Q(x)}$$ Great we found our desired solution! Notice how this exactly mimics our answer from the discrete case we did previously: $$\frac{\partial D_{KL}(P||Q)}{\partial Q} = [\frac{-P(x_1)}{Q(x_1)}, \frac{-P(x_2)}{Q(x_2)}, ... \frac{-P(x_n)}{Q(x_n)}]$$ ### Derive Optimization for Gaussian Distribution $Q$ with $D_{KL}(P||Q)$ objective function Use the Chain Rule where $Q = Gaussian(\mu,\sigma)$: $$ \frac{\partial D_{KL}(P||Q)}{\partial \mu} = \frac{\partial D_{KL}(P||Q)}{\partial{Q}}\frac{\partial Q}{\partial \mu}$$ $$\frac{\partial D_{KL}(P||Q)}{\partial Q(\mu,\sigma)} = \frac{-P}{Q(\mu,\sigma)}$$ ### Derivative $\frac{\partial D_{KL}(Q||P)}{\partial Q}$ in Continuous Space Suppose that for the continuous version of $D_{KL}(Q||P)$ defined as $$ D_{KL}(Q||P) = \int Q(x)(logQ(x) - logP(x)) dx $$ and we want to take the derivative with respect to Q. We get the following: $$ \frac{\partial D_{KL}(Q||P)}{\partial Q} = \frac{\partial}{\partial Q}\int Q(x)(logQ(x) - logP(x)) dx$$ $$ \frac{\partial D_{KL}(Q||P)}{\partial Q} = \frac{\partial}{\partial Q}\int Q(x)logQ(x) dx - \int Q(x)logP(x) dx $$ We are going to utilize the method *Calculus of Variations* here with the following perturbation: $$Q(x) \rightarrow = Q(x) + \epsilon \eta(x)$$ From regular calculus we have: $df = f(x + dx) - f(x) = f'(x)dx$, however, now we're perturbing the **function** $Q(x)$ with another small function & we are basically treating $Q(x)$ as $x$ in our regular calculus example. Thus, we now get $$D_{KL}(Q + \epsilon \eta||P) = \int (Q(x) + \epsilon \eta(x))log(Q(x)+\epsilon \eta(x) dx - \int (Q(x) + \epsilon \eta(x)) logP(x) dx $$ Differentiating with respect to $\epsilon$ at $\epsilon = 0$: $$ \frac{d}{d\varepsilon} D_{KL}(Q_\varepsilon \| P)\Big|_{\varepsilon=0} = \int \frac{d}{d\varepsilon} \left[ (Q + \varepsilon \eta)\big(\log(Q + \varepsilon \eta) - \log P\big) \right]_{\varepsilon=0} \, dx $$ Apply the product rule, let $$f(q) = q(log q - log P)$$ $$\frac{\partial f}{\partial q} = (log q - log P) + q \cdot \frac{1}{q} = log q - log P + 1$$ so $$ \frac{d}{d\varepsilon} D_{KL}(Q_\varepsilon \| P)\Big|_{\varepsilon=0} = \int \eta(x) (logQ(x) - logP(x) +1)$$ By the definition of the functional derivative, $$ \frac{d}{d\varepsilon} D_{KL}(Q_\varepsilon \| P)\Big|_{\varepsilon=0} = \int \eta(x) \frac{\delta D_{KL}(Q||P)}{\delta Q(x)} dx$$ Thus we have that $$ \frac{\partial D_{KL}(Q||P)}{\partial Q(x)} = log Q(x) - log P(x) + 1$$ Note that since we are simply finding the derivative, we don't need to utilize the Lagrange Multiplier of the constraint $\int Q(x)dx =1$ Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!