The Euler–Lagrange Equation, Derived by dokuDoku Math 🔍 Please read [Fun with Functionals](https://www.noteblogdoku.com/blog/fun-with-functionals) for a general introduction to Functionals, and [Derivatives of Functionals](https://www.noteblogdoku.com/blog/derivatives-of-functions-vs-derivatives-of-functionals) for how to perturb a function (in function space) inputted into a functional. These concepts will be used here. ## Euler Lagrange Derivation The Euler-Lagrange equation is used to find extremal functions such that its functional is stationary. Concretely, if we have a problem such that we have an input function $y = f(x)$ to the functional $$ I[y] = \int_{x_1}^{x_2} F(x,y,y') dx$$ We want to find the function $y$ for which $I[y]$ is stationary. Note that $y(x)$ is denoted as an *extremal* since like in single variable calculus, an *extremum* is a max or min *point*, an *extremal* is a max or min *function*. Additionally, problems of these forms are common so the expression: $F(x,y,y')$ is denoted as the *Lagrangian*. ### Derivation $$ I[y] = \int_{x_1}^{x_2} F(x,y,y') dx$$ Suppose that $y(x)$ makes $I[y]$ stationary ($y$ is an extremal function) satisfying the boundary conditions $y(x_1) = y_1$, $y(x_2) = y_2$. We are going to introduce the function $\eta(x)$ such that $\eta(x_1) = \eta(x_2) = 0$. $\eta(x)$ is an arbitrary function that's a direction in function space to perturb $y(x)$ with magnitude $\epsilon \in \mathbb{R}$. Also note that all functions are implied to have continuous 2nd derivatives. Because the Euler-Lagrange equation is done specifically for the fixed-endpoint problem in the calculus of variations this is the key context that makes setting $\eta(a) = \eta(b) = 0$ required. Thus we define $y_{\epsilon}$ as the perturbation of $y$ by $\epsilon \eta$ $$y_{\epsilon} = y + \epsilon \eta$$ Since $\eta(x_1) = \eta(x_2) = 0$, $y_{\epsilon}$ has the same boundary conditions as $y$, $y_{\epsilon}(x_1) = y(x_1)$ and $y_{\epsilon}(x_2) = y(x_2)$. Note that since $y(x)$ is defined to be an extremal, the functional $I[y + \epsilon \eta]$ is stationary at $\epsilon = 0$. Thus, $$\frac{d}{d\epsilon} I[y + \epsilon \eta]|_{\epsilon=0} = 0$$ If $I$ is differentiable and $y$ is the stationary point, then for any admissible variation $\eta$, $$I[y + \epsilon \eta] = I[y] + o(\epsilon) \quad \text{as } \epsilon \to 0$$ Starting with $\frac{d}{d\epsilon} I[y + \epsilon \eta]|_{\epsilon=0} = 0$ we have $$\frac{d}{d\epsilon} I[y + \epsilon \eta]|_{\epsilon=0} = 0$$ $$\frac{d}{d\epsilon} I[y + \epsilon \eta]|_{\epsilon=0} = \frac{d}{d\epsilon}|_{\epsilon=0} \int_{x_1}^{x_2} F(x,y_{\epsilon},y'_{\epsilon})dx = 0$$ $$ 0 = \int_{x_1}^{x_2} \frac{\partial}{\partial \epsilon}|_{\epsilon=0} F(x,y_{\epsilon},y'_{\epsilon})dx $$ invoking the chain rule we get $$ = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \frac{\partial y_{\epsilon}}{\partial \epsilon} + \frac{\partial F}{\partial y'_{\epsilon}}\frac{\partial y'_{\epsilon}}{\partial \epsilon} \right]_{\epsilon=0} dx $$ Recall $y_{\epsilon}(x) = y(x) + \epsilon \eta(x)$ giving us $$\frac{\partial y_{\epsilon}}{\partial \epsilon} = \eta(x)$$ and $y'_{\epsilon}(x) = y'(x) + \epsilon \eta'(x)$ giving us that $$\frac{\partial y'_{\epsilon}}{\partial \epsilon} = \eta'(x)$$ Thus our previous statement simplifies to $$ \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \eta(x) + \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \eta(x) \right]_{\epsilon = 0} dx + \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx $$ Now we are going to simplify $\int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx$ with integration by parts to get rid of the $\eta'(x)$. Using the general formula: $$ \int u dv = uv - \int v du $$ we set: - $u = \frac{\partial F}{\partial y'_{\epsilon}}$ - $du = \frac{d}{dx}\frac{\partial F}{\partial y'_{\epsilon}} dx$ - $dv = \eta'(x)dx$ - $v = \eta(x)$ Thus we get $$ \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx = \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)|_{x_1}^{x_2} - \int_{x_1}^{x_2} \frac{d}{dx} \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)dx|_{\epsilon=0}$$ Since we have that $\eta(x_1) = \eta(x_2) = 0$ $$ \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)|_{x_1}^{x_2} = 0$$ Thus $$ \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx = - \int_{x_1}^{x_2} \frac{d}{dx} \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)dx|_{\epsilon=0}$$ Circling back to previous equation we have $$ \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \eta(x) \right]_{\epsilon = 0} dx + \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx $$ $$ = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \eta(x) \right]_{\epsilon = 0} dx - \int_{x_1}^{x_2} \frac{d}{dx} \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)dx|_{\epsilon=0} $$ $$ = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} - \frac{d}{dx}\frac{\partial F}{\partial y'_{\epsilon}}\right] \eta(x)dx|_{\epsilon=0} = 0$$ Evaluating at $\epsilon = 0$, we get that $y_{\epsilon} = y$ and $y'_{\epsilon} = y'$. Thus this simplifies to $$ = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y} - \frac{d}{dx}\frac{\partial F}{\partial y'}\right] \eta(x)dx = 0$$ Using the Fundamental Lemma of the Calculus of Variations: $\int_{x_1}^{x_2} g(x)\eta(x)dx = 0$, $\forall \eta \rightarrow g(x) = 0$. To set this equation to 0 for all arbitrary $\eta(x)$, we must set the other term to 0, thus we have $$ \frac{\partial F}{\partial y}-\frac{d}{dx}\frac{\partial F}{\partial y'} = 0$$ We have derived the Euler–Lagrange equation, which characterizes the stationary functions of the functional $I[y]$ via a necessary condition that any extremal $y$ must satisfy. ## Minimize the Arc Length Functional Let's do a classic physics example that directly shows how by minimizing a functional we find the minimal function path. Suppose that the arc length functional is defined by $$ S[y] = \int \sqrt{1 + (y')^2}dx \quad F(y,y') = \sqrt{1 + (y')^2}$$ Using the Euler-Lagrange result $$ \frac{\partial F}{\partial y} - \frac{d}{dx}\frac{\partial F}{\partial y'} = 0$$ We know that $\frac{\partial F}{\partial y} =0$ since there is no $y$ term, thus we want to find $\frac{d}{dx}\frac{\partial F}{\partial y'} = 0$. $$ \frac{\partial F}{\partial y'} = \frac{1}{2} (1+(y')^2)^{-\frac{1}{2}} \cdot 2y' = \frac{y'}{\sqrt{1+(y')^2}}$$ To find $\frac{d}{dx}\frac{\partial F}{\partial y'} = 0$, recognize that for this quantity to be equal to zero that $\frac{\partial F}{\partial y'}$ must be equal to some constant $C$. Hence we get $$\frac{y'}{\sqrt{1+(y')^2}} = C$$ Solving for $y'$ we get the following. Note that we're going to denote constants by $C_i$ where $i$ is just an arbitrary numbering for said constant, since it'll be easier to just say $C_i$ instead of detailing how all the prev constant terms make that term (frankly we don't care what the constant equals anyway). $$\frac{y'}{\sqrt{1+(y')^2}} = C$$ $$y' = C \sqrt{1+(y')^2} \rightarrow (y')^2 = C_1(1+(y')^2)$$ $$(y')^2 - C_1(y')^2 = C_1 \rightarrow C_2(y')^2 = C_1$$ $$(y')^2 = C_3 \rightarrow y' = \sqrt{C_3} = C_4$$ Thus we get that the slope $y' = \frac{dy}{dx}$ is equal to some constant. This means that the extremal function $y$ is defined as a straight line, remember the basic equation $y = mx+b$ where $m$ is a constant slope, which is what we have here. ## KL-Divergence In a previous blog post, [KL-Divergence and Cross Entopy](https://www.noteblogdoku.com/blog/kl-divergence-and-cross-entropy) we had a pretty lengthy derivation of $\frac{\delta D_{KL}(P||Q)}{\delta Q}$ where $P$ is our ground truth distribution, and $Q$ is our predicted distribution with $$ D_{KL}(P||Q) = \int P(x)(logP(x) - logQ(x))dx$$ Let's take another look at this using some of the tooling we developed here. Note that for the KL-Divergence we have the constraint that $$\int Q(x)dx = 1\quad Q(x) \geq 0,\ \forall x$$ To find the $Q$ that minimizes $D_{KL}(P||Q)$ (assume we know this has a minimum not maximum), we are finding $$min_Q D_{KL}(P||Q)\quad subject\ to\ \int Q(x)dx = 1$$ or with Lagrange Multipliers, stating this as the functional $I[Q]$ $$I[Q] = D_{KL}(P||Q) + \lambda \left(\int Q(x)dx -1\right)$$ $$ = \int P(x)(logP(x) - logQ(x))dx + \lambda \left(\int Q(x)dx -1\right)$$ $$ = \int \left( P(x)(logP(x) - logQ(x)) + \lambda Q(x) \right) dx -\lambda$$ $$I[Q] = \int \left( P(x) log\frac{P(x)}{Q(x)} + \lambda Q(x) \right) dx -\lambda$$ Great, now we can see this is in the form $I[y] = \int_{x_1}^{x_2} F(x,y,y') dx$, with $$ F(x,Q) = P(x) log(\frac{P(x)}{Q(x)}) + \lambda Q(x)$$ From the Euler-Lagrange Equation $ \frac{\partial F}{\partial y}-\frac{d}{dx}\frac{\partial F}{\partial y'} = 0$ we have $$ \frac{\partial F}{\partial Q} - \frac{d}{dx} \frac{\partial F}{\partial Q'} = 0$$ Great we have $$ \frac{\partial F}{\partial Q'} = 0$$ $$ \frac{\partial F}{\partial Q} = \frac{\partial}{\partial Q} \left( P(x) log(\frac{P(x)}{Q(x)}) + \lambda Q(x) \right)$$ $$ \frac{\partial F}{\partial Q} = \frac{\partial}{\partial Q} \left( P(x)logP(x) - P(x)logQ(x) + \lambda Q(x) \right)$$ $$\frac{\partial F}{\partial Q} = \frac{\partial}{\partial Q} \left(- P(x)logQ(x) + \lambda Q(x) \right)$$ $$\frac{\partial F}{\partial Q} = - \frac{P(x)}{Q(x)} + \lambda$$ The stationary condition becomes $$- \frac{P(x)}{Q(x)} + \lambda = 0 \rightarrow Q(x) = \frac{P(x)}{\lambda}$$ To find $\lambda$ use the fact that $\int P(x)dx = \int Q(x) = 1$, and take the integral of both sides of $Q(x) = \frac{P(x)}{\lambda}$ $$ \int Q(x) dx = \int \frac{P(x)}{\lambda} dx $$ $$ 1 = \frac{1}{\lambda} 1 \rightarrow \lambda = 1$$ Thus the stationary condition is $Q(x) = P(x)$, which makes sense since to minimize the $D_{KL}(P||Q)$ you would want $P=Q$. For the unconstrined KL functional $J[Q]$ since the integrand depends on $Q$ but not $Q'$, the Euler-Lagrange operator reduces to $\frac{\partial F}{\partial Q}$, which in this case equals the functional derivative. However, $\frac{\partial F}{\partial Q}$ and $\frac{-P(x)}{Q(x)} + \lambda$ is the functional derivative of the constrained functional. We can see this since $$ J[Q] = D_{KL}(P||Q) = \int P(x) log(\frac{P(x)}{Q(x)}) dx$$ Since the integrand depends on $Q$ but not $Q'$ the functional derivative of the unconstrained functional is $$ \frac{\delta J}{\delta Q} = \frac{\partial}{\partial Q}\left(P(x) log(\frac{P(x)}{Q(x)})\right) = \frac{-P(x)}{Q(x)}$$ Note that from the constrained version with $\lambda$ that the $\lambda$ term does not change the gradient of $D_{KL}$ itself, it changes the gradient of the Lagrangian functional used for constrained optimization. Please read [Fun with Functionals](https://www.noteblogdoku.com/blog/fun-with-functionals) for a general introduction to Functionals, and [Derivatives of Functionals](https://www.noteblogdoku.com/blog/derivatives-of-functions-vs-derivatives-of-functionals) for how to perturb a function (in function space) inputted into a functional. These concepts will be used here. ## Euler Lagrange Derivation The Euler-Lagrange equation is used to find extremal functions such that its functional is stationary. Concretely, if we have a problem such that we have an input function $y = f(x)$ to the functional $$ I[y] = \int_{x_1}^{x_2} F(x,y,y') dx$$ We want to find the function $y$ for which $I[y]$ is stationary. Note that $y(x)$ is denoted as an *extremal* since like in single variable calculus, an *extremum* is a max or min *point*, an *extremal* is a max or min *function*. Additionally, problems of these forms are common so the expression: $F(x,y,y')$ is denoted as the *Lagrangian*. ### Derivation $$ I[y] = \int_{x_1}^{x_2} F(x,y,y') dx$$ Suppose that $y(x)$ makes $I[y]$ stationary ($y$ is an extremal function) satisfying the boundary conditions $y(x_1) = y_1$, $y(x_2) = y_2$. We are going to introduce the function $\eta(x)$ such that $\eta(x_1) = \eta(x_2) = 0$. $\eta(x)$ is an arbitrary function that's a direction in function space to perturb $y(x)$ with magnitude $\epsilon \in \mathbb{R}$. Also note that all functions are implied to have continuous 2nd derivatives. Because the Euler-Lagrange equation is done specifically for the fixed-endpoint problem in the calculus of variations this is the key context that makes setting $\eta(a) = \eta(b) = 0$ required. Thus we define $y_{\epsilon}$ as the perturbation of $y$ by $\epsilon \eta$ $$y_{\epsilon} = y + \epsilon \eta$$ Since $\eta(x_1) = \eta(x_2) = 0$, $y_{\epsilon}$ has the same boundary conditions as $y$, $y_{\epsilon}(x_1) = y(x_1)$ and $y_{\epsilon}(x_2) = y(x_2)$. Note that since $y(x)$ is defined to be an extremal, the functional $I[y + \epsilon \eta]$ is stationary at $\epsilon = 0$. Thus, $$\frac{d}{d\epsilon} I[y + \epsilon \eta]|_{\epsilon=0} = 0$$ If $I$ is differentiable and $y$ is the stationary point, then for any admissible variation $\eta$, $$I[y + \epsilon \eta] = I[y] + o(\epsilon) \quad \text{as } \epsilon \to 0$$ Starting with $\frac{d}{d\epsilon} I[y + \epsilon \eta]|_{\epsilon=0} = 0$ we have $$\frac{d}{d\epsilon} I[y + \epsilon \eta]|_{\epsilon=0} = 0$$ $$\frac{d}{d\epsilon} I[y + \epsilon \eta]|_{\epsilon=0} = \frac{d}{d\epsilon}|_{\epsilon=0} \int_{x_1}^{x_2} F(x,y_{\epsilon},y'_{\epsilon})dx = 0$$ $$ 0 = \int_{x_1}^{x_2} \frac{\partial}{\partial \epsilon}|_{\epsilon=0} F(x,y_{\epsilon},y'_{\epsilon})dx $$ invoking the chain rule we get $$ = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \frac{\partial y_{\epsilon}}{\partial \epsilon} + \frac{\partial F}{\partial y'_{\epsilon}}\frac{\partial y'_{\epsilon}}{\partial \epsilon} \right]_{\epsilon=0} dx $$ Recall $y_{\epsilon}(x) = y(x) + \epsilon \eta(x)$ giving us $$\frac{\partial y_{\epsilon}}{\partial \epsilon} = \eta(x)$$ and $y'_{\epsilon}(x) = y'(x) + \epsilon \eta'(x)$ giving us that $$\frac{\partial y'_{\epsilon}}{\partial \epsilon} = \eta'(x)$$ Thus our previous statement simplifies to $$ \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \eta(x) + \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \eta(x) \right]_{\epsilon = 0} dx + \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx $$ Now we are going to simplify $\int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx$ with integration by parts to get rid of the $\eta'(x)$. Using the general formula: $$ \int u dv = uv - \int v du $$ we set: - $u = \frac{\partial F}{\partial y'_{\epsilon}}$ - $du = \frac{d}{dx}\frac{\partial F}{\partial y'_{\epsilon}} dx$ - $dv = \eta'(x)dx$ - $v = \eta(x)$ Thus we get $$ \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx = \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)|_{x_1}^{x_2} - \int_{x_1}^{x_2} \frac{d}{dx} \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)dx|_{\epsilon=0}$$ Since we have that $\eta(x_1) = \eta(x_2) = 0$ $$ \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)|_{x_1}^{x_2} = 0$$ Thus $$ \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx = - \int_{x_1}^{x_2} \frac{d}{dx} \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)dx|_{\epsilon=0}$$ Circling back to previous equation we have $$ \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \eta(x) \right]_{\epsilon = 0} dx + \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y'_{\epsilon}} \eta'(x) \right]_{\epsilon =0}dx $$ $$ = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} \eta(x) \right]_{\epsilon = 0} dx - \int_{x_1}^{x_2} \frac{d}{dx} \frac{\partial F}{\partial y'_{\epsilon}}\eta(x)dx|_{\epsilon=0} $$ $$ = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y_{\epsilon}} - \frac{d}{dx}\frac{\partial F}{\partial y'_{\epsilon}}\right] \eta(x)dx|_{\epsilon=0} = 0$$ Evaluating at $\epsilon = 0$, we get that $y_{\epsilon} = y$ and $y'_{\epsilon} = y'$. Thus this simplifies to $$ = \int_{x_1}^{x_2} \left[ \frac{\partial F}{\partial y} - \frac{d}{dx}\frac{\partial F}{\partial y'}\right] \eta(x)dx = 0$$ Using the Fundamental Lemma of the Calculus of Variations: $\int_{x_1}^{x_2} g(x)\eta(x)dx = 0$, $\forall \eta \rightarrow g(x) = 0$. To set this equation to 0 for all arbitrary $\eta(x)$, we must set the other term to 0, thus we have $$ \frac{\partial F}{\partial y}-\frac{d}{dx}\frac{\partial F}{\partial y'} = 0$$ We have derived the Euler–Lagrange equation, which characterizes the stationary functions of the functional $I[y]$ via a necessary condition that any extremal $y$ must satisfy. ## Minimize the Arc Length Functional Let's do a classic physics example that directly shows how by minimizing a functional we find the minimal function path. Suppose that the arc length functional is defined by $$ S[y] = \int \sqrt{1 + (y')^2}dx \quad F(y,y') = \sqrt{1 + (y')^2}$$ Using the Euler-Lagrange result $$ \frac{\partial F}{\partial y} - \frac{d}{dx}\frac{\partial F}{\partial y'} = 0$$ We know that $\frac{\partial F}{\partial y} =0$ since there is no $y$ term, thus we want to find $\frac{d}{dx}\frac{\partial F}{\partial y'} = 0$. $$ \frac{\partial F}{\partial y'} = \frac{1}{2} (1+(y')^2)^{-\frac{1}{2}} \cdot 2y' = \frac{y'}{\sqrt{1+(y')^2}}$$ To find $\frac{d}{dx}\frac{\partial F}{\partial y'} = 0$, recognize that for this quantity to be equal to zero that $\frac{\partial F}{\partial y'}$ must be equal to some constant $C$. Hence we get $$\frac{y'}{\sqrt{1+(y')^2}} = C$$ Solving for $y'$ we get the following. Note that we're going to denote constants by $C_i$ where $i$ is just an arbitrary numbering for said constant, since it'll be easier to just say $C_i$ instead of detailing how all the prev constant terms make that term (frankly we don't care what the constant equals anyway). $$\frac{y'}{\sqrt{1+(y')^2}} = C$$ $$y' = C \sqrt{1+(y')^2} \rightarrow (y')^2 = C_1(1+(y')^2)$$ $$(y')^2 - C_1(y')^2 = C_1 \rightarrow C_2(y')^2 = C_1$$ $$(y')^2 = C_3 \rightarrow y' = \sqrt{C_3} = C_4$$ Thus we get that the slope $y' = \frac{dy}{dx}$ is equal to some constant. This means that the extremal function $y$ is defined as a straight line, remember the basic equation $y = mx+b$ where $m$ is a constant slope, which is what we have here. ## KL-Divergence In a previous blog post, [KL-Divergence and Cross Entopy](https://www.noteblogdoku.com/blog/kl-divergence-and-cross-entropy) we had a pretty lengthy derivation of $\frac{\delta D_{KL}(P||Q)}{\delta Q}$ where $P$ is our ground truth distribution, and $Q$ is our predicted distribution with $$ D_{KL}(P||Q) = \int P(x)(logP(x) - logQ(x))dx$$ Let's take another look at this using some of the tooling we developed here. Note that for the KL-Divergence we have the constraint that $$\int Q(x)dx = 1\quad Q(x) \geq 0,\ \forall x$$ To find the $Q$ that minimizes $D_{KL}(P||Q)$ (assume we know this has a minimum not maximum), we are finding $$min_Q D_{KL}(P||Q)\quad subject\ to\ \int Q(x)dx = 1$$ or with Lagrange Multipliers, stating this as the functional $I[Q]$ $$I[Q] = D_{KL}(P||Q) + \lambda \left(\int Q(x)dx -1\right)$$ $$ = \int P(x)(logP(x) - logQ(x))dx + \lambda \left(\int Q(x)dx -1\right)$$ $$ = \int \left( P(x)(logP(x) - logQ(x)) + \lambda Q(x) \right) dx -\lambda$$ $$I[Q] = \int \left( P(x) log\frac{P(x)}{Q(x)} + \lambda Q(x) \right) dx -\lambda$$ Great, now we can see this is in the form $I[y] = \int_{x_1}^{x_2} F(x,y,y') dx$, with $$ F(x,Q) = P(x) log(\frac{P(x)}{Q(x)}) + \lambda Q(x)$$ From the Euler-Lagrange Equation $ \frac{\partial F}{\partial y}-\frac{d}{dx}\frac{\partial F}{\partial y'} = 0$ we have $$ \frac{\partial F}{\partial Q} - \frac{d}{dx} \frac{\partial F}{\partial Q'} = 0$$ Great we have $$ \frac{\partial F}{\partial Q'} = 0$$ $$ \frac{\partial F}{\partial Q} = \frac{\partial}{\partial Q} \left( P(x) log(\frac{P(x)}{Q(x)}) + \lambda Q(x) \right)$$ $$ \frac{\partial F}{\partial Q} = \frac{\partial}{\partial Q} \left( P(x)logP(x) - P(x)logQ(x) + \lambda Q(x) \right)$$ $$\frac{\partial F}{\partial Q} = \frac{\partial}{\partial Q} \left(- P(x)logQ(x) + \lambda Q(x) \right)$$ $$\frac{\partial F}{\partial Q} = - \frac{P(x)}{Q(x)} + \lambda$$ The stationary condition becomes $$- \frac{P(x)}{Q(x)} + \lambda = 0 \rightarrow Q(x) = \frac{P(x)}{\lambda}$$ To find $\lambda$ use the fact that $\int P(x)dx = \int Q(x) = 1$, and take the integral of both sides of $Q(x) = \frac{P(x)}{\lambda}$ $$ \int Q(x) dx = \int \frac{P(x)}{\lambda} dx $$ $$ 1 = \frac{1}{\lambda} 1 \rightarrow \lambda = 1$$ Thus the stationary condition is $Q(x) = P(x)$, which makes sense since to minimize the $D_{KL}(P||Q)$ you would want $P=Q$. For the unconstrined KL functional $J[Q]$ since the integrand depends on $Q$ but not $Q'$, the Euler-Lagrange operator reduces to $\frac{\partial F}{\partial Q}$, which in this case equals the functional derivative. However, $\frac{\partial F}{\partial Q}$ and $\frac{-P(x)}{Q(x)} + \lambda$ is the functional derivative of the constrained functional. We can see this since $$ J[Q] = D_{KL}(P||Q) = \int P(x) log(\frac{P(x)}{Q(x)}) dx$$ Since the integrand depends on $Q$ but not $Q'$ the functional derivative of the unconstrained functional is $$ \frac{\delta J}{\delta Q} = \frac{\partial}{\partial Q}\left(P(x) log(\frac{P(x)}{Q(x)})\right) = \frac{-P(x)}{Q(x)}$$ Note that from the constrained version with $\lambda$ that the $\lambda$ term does not change the gradient of $D_{KL}$ itself, it changes the gradient of the Lagrangian functional used for constrained optimization. Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!