Maximum Likelihood Estimation, Cross-Entropy, and Softmax by dokuDoku Information Theory 🔍 We are going to first introduce *Maximum Likelihood Estimation* then show derivations that taking the maximum log likelihood of a *Logistic Regression* model for Binary and Multi-Class Labels is equivalent to taking the *Cross Entropy Loss* of the *Logistic Regression* model. This motivates the usage of Cross Entropy, as previously discussed in [KL-Divergence and Cross Entropy blog post](https://www.noteblogdoku.com/blog/kl-divergence-and-cross-entropy), for classification style problems. Thanks to my friend Anant for whiteboarding the Binary Classification case with me. ## Maximum Likelihood Estimation Consider the set $X = \{x_1,...,x_n\}$ of $n$ examples drawn independently from the the data generating distribution $p_{data}(x)$. Let our model to predict this dataset $p_{model}(x;\theta)$ be a parametric family of probability distributions indexed by $\theta$ (think of $\theta$ as our parameter here specifying the class of model and its parameters). To be clear, $p_{model}(x;\theta)$ maps any $x$ to a real number estimating the ground truth probability $p_{data}(x)$. The *Maximum Likelihood Estimator* for $\theta$ is then defined as: $$ \theta_{ML} = argmax_{\theta} p_{model}(X;\theta) = argmax_{\theta} \Pi_{i=1}^n p_{model}(x_i;\theta)$$ - Assume the data points $x_i$ independent so probability of dataset is the product of all probabilities - Performance of each model is measured by the product of its output probabilities - How likely the model can mimic the ground truth i.e. is $p_{model} \approx p_{data}$? - Find which $\theta$ gives us the best model To simplify the calculation and be mindful of underflow issues, especially as we multiply many fractions together, typically we take the log, giving us the *Log Likelihood*. The *Maximum Log Likelihood Estimator* defined as $$ \theta_{ML} = argmax_{\theta} p_{model}(X;\theta) = argmax_{\theta} \sum_{i=1}^n log(p_{model}(x_i;\theta))$$ - Key Log Rule used: $log(ab) = log(a) + log(b)$ - Log is a monotonically increasing function, thus for $x_1 > x_2 \rightarrow log(x_1) > log(x_2)\ \forall x$ Additionally, since the argmax does not change with respect to scaling, we are going to scale by $\frac{1}{n}$ to give us an expectation for the log-likelihood $$ \theta_{ML} = argmax_{\theta} \frac{1}{n}\sum_{i=1}^n log(p_{model}(x_i;\theta)) = argmax_{\theta} \mathbb{E}_{x \sim p_{data}} log(p_{model}(x;\theta))$$ Of note, we can interpret the maximum log-likelihood estimation as minimizing the dissimilarity between the ground truth distribution $p_{data}$ and the model's distribution $p_{model}$. Recall from [KL-Divergence and Cross Entropy blog post](https://www.noteblogdoku.com/blog/kl-divergence-and-cross-entropy), the definition of KL-Divergence $$D_{KL}(P||Q) = \mathbb{E}_{x\sim P} \left[log(P(x)) - log(Q(x))\right]$$ Given that if we want to minimize the dissimilarity between our ground truth distribution and our model's we get $$D_{KL}(p_{data}||p_{model}) = \mathbb{E}_{x \sim p_{data}} \left[log(p_{data}(x)) - log (p_{model}(x)) \right]$$ As we also previously showed, to minimize $D_{KL}(p_{data}||p_{model})$ by setting $\nabla_{p_{model}} D_{KL}(p_{data}||p_{model})$ to zero (or using gradient descent) we showed it was equal to the cross entropy's gradient, $\nabla_{p_{model}} D_{KL}(p_{data}||p_{model}) = \nabla_{p_{model}} H(p_{data}, p_{model})$, since only the $p_{model}$ term of the KL Divergence is a function of $p_{model}$. Note that we can also have *conditional log-likelihood* where we simply condition some output $y_i$ on input $x_i$ to get $$ \theta_{ML} = argmax_{\theta} \Pi_i P(y_i|x_i;\theta)$$ with an independent and identically distributed assumption $$ \theta_{ML} = argmax_{\theta} \sum_{i=1}^n log(P(y_i|x_i;\theta))$$ ## Logistic Regression as Probability Model; Binary Classification We are going to state the following definitions for our problem set up: - Binary Output Labels: $y \in \{0,1\}$ - Output of Linear Layer: $z_i = w^Tx_i + b$ - $z_i \in \mathbb{R}$ - $w,x_i \in \mathbb{R}^d$ - $b \in \mathbb{R}$ - $\sigma(z)$ as the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ - $y_{i,pred} = \sigma(z_i)$ i.e. this is our predicted $y_i$ value - Probability $y=1$ is $p(y_i=1|x_i;w,b) = \sigma(z_i)$ - Probability $y=0$ is $p(y_i=0|x_i;w,b) = 1 - \sigma(z_i)$ Here we use the $sigmoid$ as our classifier function since it bounds out output between 0 and 1, and we will see it has nice properties. Additionally, note that $w$ is a vector, for this we are mainly considering the last layer of the neural net and this is only important so that the output number of a scalar, since we know that $w^Tx_i \in \mathbb{R}$. Compacting to write both cases for $y=0$ and $y=1$ into one expression we get: $$p(y_i|x_i;w,b) = \sigma(z_i)^{y_i} (1-\sigma(z_i))^{1-y_i}$$ Using our definition for the Likelihood of the Dataset, we get: $$L(w,b) = \Pi_{i=1}^n p(y_i|x_i;w,b) = \Pi_{i=1}^n \sigma(z_i)^{y_i} (1-\sigma(z_i))^{1-y_i} $$ Now let's take the log of this to give us the Log-Likelihood $$log L(w,b) = \sum_{i=1}^n log(p(y_i|x_i;w,b)) = \sum_{i=1}^n \left[log(\sigma(z_i)^{y_i} (1-\sigma(z_i))^{1-y_i}) \right]$$ $$log L(w,b) = \sum_{i=1}^n \left[ y_i log(\sigma(z_i)) + (1 - y_i) log (1-\sigma(z_i))\right]$$ Note that we can either maximum the log likelihood or minimize the negative log likelihood. Thus $$ -log L(w,b) = - \sum_{i=1}^n \left[ y_i log(\sigma(z_i)) + (1 - y_i) log (1-\sigma(z_i))\right]$$ Additionally, if we take the average of all the training points $n$ we get $$ -log L(w,b) = - \frac{1}{n} \sum_{i=1}^n \left[ y_i log(\sigma(z_i)) + (1 - y_i) log (1-\sigma(z_i)) \right]$$ $$ -log L(w,b) = - \frac{1}{n} \sum_{i=1}^n \left[ y_i log(y_{i,pred}) + (1 - y_i) log (1-y_{i,pred}) \right]$$ Note that if we take the definition of Cross Entropy and isolate it to only sum over two possible outcomes $\{0,1\}$ we get the same thing $$ H(P,Q) = -\sum P(x) log(Q(x)) \rightarrow -\sum P(0)log(Q(0)) + P(1)log(Q(1)) $$ where when $y_i$ is 1, $y_i = P(1)$, $(1-y_i) = P(0)$, $y_{i,pred} = Q(1)$, $1-y_{i,pred} = Q(0)$. We can see by symmetry the same holds when $y_i$ is 0. Thus we have shown that maximizing the Likelihood of our Binary Logistic Regression model is equivalent to minimizing the Binary Cross Entropy Loss. ## Logistic Regression as Probability Model; Multi-Class Classification We are going to state the following definitions for our problem set up: - Class Output Labels: $y \in \{1,...,k\}$ - Output of Linear Layer: $z_i = w^Tx_i + b \in \mathbb{R}^K$ - $z_i \in \mathbb{R}^K$, where $z_{ik}$ is the kth index/class of output i - $w\in \mathbb{R}^{K \times d}$ - $x_i \in \mathbb{R}^d$ - $b \in \mathbb{R}^K$ - $\hat{y}_{ik} = softmax(z_k)$ as the softmax function $softmax(z_k) = \frac{e^{z_k}}{\sum_{j=1}^ke^{z_j}}$ - $softmax(z_k)$ is the probability of selecting class $k$ - $p(y=k|x) = softmax(z_k)$ where $z_k$ is the kth element of $z$ - $\hat{y}_{ik}$ makes it more explicit that it's the kth element of the ith predicted y output - $y_i \in \mathbb{R}^K$ is a one hot vector where $y_{ik} = 1$ means class $k$ is the correct class - Note that if $y_{ik}=1$ then $\forall j$ such that $j\neq k$ $y_{ij} = 0$ Now consider a multi-class classification using *softmax* instead of the *sigmoid* function. Softmax is preferable to sigmoid here, since for each of the $k$ entries in our softmax input, softmax outputs a $k$ dimensional vector where now each of the elements can be interpreted as a probability. Think of it as a "soft" maximum function, it's in the name. Additionally, this function is used due to its properties as we'll see. The likelihood of this model is: $$ L(w,b) = \Pi_{i=1}^n \Pi_{k=1}^K p(y_i=k|x_i)^{y_{ik}}$$ Which is the probability that we will predict the correct class for each of our inputs $x_i$. We raise $p(y_i=k|x_i)$ to $y_{ik}$ since if $y_{ik} =1$ then it's the correct class, and we should learn if $y_{ik} = 0$ then it's the wrong class. So for one-hot labels $\Pi_{k=1}^K p(y_i=k|x_i)^{y_{ik}}$ just picks out the predicted probability assigned to the true class. Let's make this explicit by putting our definition of softmax in $$ L(w,b) = \Pi_{i=1}^n \Pi_{k=1}^K \left( \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \right)^{y_{ik}} $$ Motivated by the previous section, we are going to take the negative log-likelihood $$ -log(L(w,b)) = - \sum_{i=1}^n \sum_{k=1}^K log\left( \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \right)^{y_{ik}} $$ $$ -log(L(w,b)) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log \left( \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \right) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik} \left(log(e^{z_k}) - log\left(\sum_{j=1}^K e^{z_j}\right) \right)$$ Note that since given with out definition of softmax that our predicted value for class $k$ is $$ \hat{y}_{ik} = softmax(z_k) = \frac{e^{z_k}}{\sum_{j=1}^ke^{z_j}}$$ our expression simplifies to $$ -log(L(w,b)) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log\left(\frac{e^{z_k}}{\sum_{j=1}^ke^{z_j}}\right)$$ $$ -log(L(w,b)) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log(softmax(z_k)) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log(\hat{y}_{ik})$$ As we can see from looking at the definition of Cross Entropy $$H(P,Q) = -\sum_i^K P(x_i)log(Q(x_i))$$ we get that with $P$ being the ground truth $y$ and $Q$ being predicted $\hat{y}$ these are the same expressions over all $n$ data points. $$H(y, \hat{y}) = - \sum_{k=1}^K y_{ik}\ log(\hat{y}_{ik})$$ With the total over all $n$ data points as $$ \sum_{i=1}^n H(y, \hat{y}) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log(\hat{y}_{ik}) = -log(L(w,b))$$ Thus we have shown that Cross Entropy Loss arises naturally through the classification task of logistic regression. Note that logistic regression is a *1 layer neural network* with either a sigmoid or softmax function. Since with the Calculus chain rule and backpropagation, we only care about the input from our subsequent layer, this example shows how for the last layer of a given neural network why Cross Entropy loss is commonly justified as a loss function since *it is equivalent* to finding the minimum negative log likelihood to optimize for finding the correct category. This also ties in to that we are inherently trying to match the ground truth distribution in this task. We are going to first introduce *Maximum Likelihood Estimation* then show derivations that taking the maximum log likelihood of a *Logistic Regression* model for Binary and Multi-Class Labels is equivalent to taking the *Cross Entropy Loss* of the *Logistic Regression* model. This motivates the usage of Cross Entropy, as previously discussed in [KL-Divergence and Cross Entropy blog post](https://www.noteblogdoku.com/blog/kl-divergence-and-cross-entropy), for classification style problems. Thanks to my friend Anant for whiteboarding the Binary Classification case with me. ## Maximum Likelihood Estimation Consider the set $X = \{x_1,...,x_n\}$ of $n$ examples drawn independently from the the data generating distribution $p_{data}(x)$. Let our model to predict this dataset $p_{model}(x;\theta)$ be a parametric family of probability distributions indexed by $\theta$ (think of $\theta$ as our parameter here specifying the class of model and its parameters). To be clear, $p_{model}(x;\theta)$ maps any $x$ to a real number estimating the ground truth probability $p_{data}(x)$. The *Maximum Likelihood Estimator* for $\theta$ is then defined as: $$ \theta_{ML} = argmax_{\theta} p_{model}(X;\theta) = argmax_{\theta} \Pi_{i=1}^n p_{model}(x_i;\theta)$$ - Assume the data points $x_i$ independent so probability of dataset is the product of all probabilities - Performance of each model is measured by the product of its output probabilities - How likely the model can mimic the ground truth i.e. is $p_{model} \approx p_{data}$? - Find which $\theta$ gives us the best model To simplify the calculation and be mindful of underflow issues, especially as we multiply many fractions together, typically we take the log, giving us the *Log Likelihood*. The *Maximum Log Likelihood Estimator* defined as $$ \theta_{ML} = argmax_{\theta} p_{model}(X;\theta) = argmax_{\theta} \sum_{i=1}^n log(p_{model}(x_i;\theta))$$ - Key Log Rule used: $log(ab) = log(a) + log(b)$ - Log is a monotonically increasing function, thus for $x_1 > x_2 \rightarrow log(x_1) > log(x_2)\ \forall x$ Additionally, since the argmax does not change with respect to scaling, we are going to scale by $\frac{1}{n}$ to give us an expectation for the log-likelihood $$ \theta_{ML} = argmax_{\theta} \frac{1}{n}\sum_{i=1}^n log(p_{model}(x_i;\theta)) = argmax_{\theta} \mathbb{E}_{x \sim p_{data}} log(p_{model}(x;\theta))$$ Of note, we can interpret the maximum log-likelihood estimation as minimizing the dissimilarity between the ground truth distribution $p_{data}$ and the model's distribution $p_{model}$. Recall from [KL-Divergence and Cross Entropy blog post](https://www.noteblogdoku.com/blog/kl-divergence-and-cross-entropy), the definition of KL-Divergence $$D_{KL}(P||Q) = \mathbb{E}_{x\sim P} \left[log(P(x)) - log(Q(x))\right]$$ Given that if we want to minimize the dissimilarity between our ground truth distribution and our model's we get $$D_{KL}(p_{data}||p_{model}) = \mathbb{E}_{x \sim p_{data}} \left[log(p_{data}(x)) - log (p_{model}(x)) \right]$$ As we also previously showed, to minimize $D_{KL}(p_{data}||p_{model})$ by setting $\nabla_{p_{model}} D_{KL}(p_{data}||p_{model})$ to zero (or using gradient descent) we showed it was equal to the cross entropy's gradient, $\nabla_{p_{model}} D_{KL}(p_{data}||p_{model}) = \nabla_{p_{model}} H(p_{data}, p_{model})$, since only the $p_{model}$ term of the KL Divergence is a function of $p_{model}$. Note that we can also have *conditional log-likelihood* where we simply condition some output $y_i$ on input $x_i$ to get $$ \theta_{ML} = argmax_{\theta} \Pi_i P(y_i|x_i;\theta)$$ with an independent and identically distributed assumption $$ \theta_{ML} = argmax_{\theta} \sum_{i=1}^n log(P(y_i|x_i;\theta))$$ ## Logistic Regression as Probability Model; Binary Classification We are going to state the following definitions for our problem set up: - Binary Output Labels: $y \in \{0,1\}$ - Output of Linear Layer: $z_i = w^Tx_i + b$ - $z_i \in \mathbb{R}$ - $w,x_i \in \mathbb{R}^d$ - $b \in \mathbb{R}$ - $\sigma(z)$ as the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ - $y_{i,pred} = \sigma(z_i)$ i.e. this is our predicted $y_i$ value - Probability $y=1$ is $p(y_i=1|x_i;w,b) = \sigma(z_i)$ - Probability $y=0$ is $p(y_i=0|x_i;w,b) = 1 - \sigma(z_i)$ Here we use the $sigmoid$ as our classifier function since it bounds out output between 0 and 1, and we will see it has nice properties. Additionally, note that $w$ is a vector, for this we are mainly considering the last layer of the neural net and this is only important so that the output number of a scalar, since we know that $w^Tx_i \in \mathbb{R}$. Compacting to write both cases for $y=0$ and $y=1$ into one expression we get: $$p(y_i|x_i;w,b) = \sigma(z_i)^{y_i} (1-\sigma(z_i))^{1-y_i}$$ Using our definition for the Likelihood of the Dataset, we get: $$L(w,b) = \Pi_{i=1}^n p(y_i|x_i;w,b) = \Pi_{i=1}^n \sigma(z_i)^{y_i} (1-\sigma(z_i))^{1-y_i} $$ Now let's take the log of this to give us the Log-Likelihood $$log L(w,b) = \sum_{i=1}^n log(p(y_i|x_i;w,b)) = \sum_{i=1}^n \left[log(\sigma(z_i)^{y_i} (1-\sigma(z_i))^{1-y_i}) \right]$$ $$log L(w,b) = \sum_{i=1}^n \left[ y_i log(\sigma(z_i)) + (1 - y_i) log (1-\sigma(z_i))\right]$$ Note that we can either maximum the log likelihood or minimize the negative log likelihood. Thus $$ -log L(w,b) = - \sum_{i=1}^n \left[ y_i log(\sigma(z_i)) + (1 - y_i) log (1-\sigma(z_i))\right]$$ Additionally, if we take the average of all the training points $n$ we get $$ -log L(w,b) = - \frac{1}{n} \sum_{i=1}^n \left[ y_i log(\sigma(z_i)) + (1 - y_i) log (1-\sigma(z_i)) \right]$$ $$ -log L(w,b) = - \frac{1}{n} \sum_{i=1}^n \left[ y_i log(y_{i,pred}) + (1 - y_i) log (1-y_{i,pred}) \right]$$ Note that if we take the definition of Cross Entropy and isolate it to only sum over two possible outcomes $\{0,1\}$ we get the same thing $$ H(P,Q) = -\sum P(x) log(Q(x)) \rightarrow -\sum P(0)log(Q(0)) + P(1)log(Q(1)) $$ where when $y_i$ is 1, $y_i = P(1)$, $(1-y_i) = P(0)$, $y_{i,pred} = Q(1)$, $1-y_{i,pred} = Q(0)$. We can see by symmetry the same holds when $y_i$ is 0. Thus we have shown that maximizing the Likelihood of our Binary Logistic Regression model is equivalent to minimizing the Binary Cross Entropy Loss. ## Logistic Regression as Probability Model; Multi-Class Classification We are going to state the following definitions for our problem set up: - Class Output Labels: $y \in \{1,...,k\}$ - Output of Linear Layer: $z_i = w^Tx_i + b \in \mathbb{R}^K$ - $z_i \in \mathbb{R}^K$, where $z_{ik}$ is the kth index/class of output i - $w\in \mathbb{R}^{K \times d}$ - $x_i \in \mathbb{R}^d$ - $b \in \mathbb{R}^K$ - $\hat{y}_{ik} = softmax(z_k)$ as the softmax function $softmax(z_k) = \frac{e^{z_k}}{\sum_{j=1}^ke^{z_j}}$ - $softmax(z_k)$ is the probability of selecting class $k$ - $p(y=k|x) = softmax(z_k)$ where $z_k$ is the kth element of $z$ - $\hat{y}_{ik}$ makes it more explicit that it's the kth element of the ith predicted y output - $y_i \in \mathbb{R}^K$ is a one hot vector where $y_{ik} = 1$ means class $k$ is the correct class - Note that if $y_{ik}=1$ then $\forall j$ such that $j\neq k$ $y_{ij} = 0$ Now consider a multi-class classification using *softmax* instead of the *sigmoid* function. Softmax is preferable to sigmoid here, since for each of the $k$ entries in our softmax input, softmax outputs a $k$ dimensional vector where now each of the elements can be interpreted as a probability. Think of it as a "soft" maximum function, it's in the name. Additionally, this function is used due to its properties as we'll see. The likelihood of this model is: $$ L(w,b) = \Pi_{i=1}^n \Pi_{k=1}^K p(y_i=k|x_i)^{y_{ik}}$$ Which is the probability that we will predict the correct class for each of our inputs $x_i$. We raise $p(y_i=k|x_i)$ to $y_{ik}$ since if $y_{ik} =1$ then it's the correct class, and we should learn if $y_{ik} = 0$ then it's the wrong class. So for one-hot labels $\Pi_{k=1}^K p(y_i=k|x_i)^{y_{ik}}$ just picks out the predicted probability assigned to the true class. Let's make this explicit by putting our definition of softmax in $$ L(w,b) = \Pi_{i=1}^n \Pi_{k=1}^K \left( \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \right)^{y_{ik}} $$ Motivated by the previous section, we are going to take the negative log-likelihood $$ -log(L(w,b)) = - \sum_{i=1}^n \sum_{k=1}^K log\left( \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \right)^{y_{ik}} $$ $$ -log(L(w,b)) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log \left( \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \right) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik} \left(log(e^{z_k}) - log\left(\sum_{j=1}^K e^{z_j}\right) \right)$$ Note that since given with out definition of softmax that our predicted value for class $k$ is $$ \hat{y}_{ik} = softmax(z_k) = \frac{e^{z_k}}{\sum_{j=1}^ke^{z_j}}$$ our expression simplifies to $$ -log(L(w,b)) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log\left(\frac{e^{z_k}}{\sum_{j=1}^ke^{z_j}}\right)$$ $$ -log(L(w,b)) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log(softmax(z_k)) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log(\hat{y}_{ik})$$ As we can see from looking at the definition of Cross Entropy $$H(P,Q) = -\sum_i^K P(x_i)log(Q(x_i))$$ we get that with $P$ being the ground truth $y$ and $Q$ being predicted $\hat{y}$ these are the same expressions over all $n$ data points. $$H(y, \hat{y}) = - \sum_{k=1}^K y_{ik}\ log(\hat{y}_{ik})$$ With the total over all $n$ data points as $$ \sum_{i=1}^n H(y, \hat{y}) = - \sum_{i=1}^n \sum_{k=1}^K y_{ik}\ log(\hat{y}_{ik}) = -log(L(w,b))$$ Thus we have shown that Cross Entropy Loss arises naturally through the classification task of logistic regression. Note that logistic regression is a *1 layer neural network* with either a sigmoid or softmax function. Since with the Calculus chain rule and backpropagation, we only care about the input from our subsequent layer, this example shows how for the last layer of a given neural network why Cross Entropy loss is commonly justified as a loss function since *it is equivalent* to finding the minimum negative log likelihood to optimize for finding the correct category. This also ties in to that we are inherently trying to match the ground truth distribution in this task. Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!