Self-Information & Entropy

by dokuDoku Information Theory

🔍

## How much information does one observation give you?

A key question that we want to ask is given a distribution of a set of items, how many questions do we need to ask on average to identify an element in that set. Another way to phrase this is how much information do I obtain from an observation from an element in that set and how does that information vary with respect to which item I observe?

Some examples for *How much information does one observation give you?* include 
- Fair coin flip: 1 yes/no question is needed
- English Letters: frequent *e* versus a rare *z*, means that if we observe a *z* we gain more information 
- The "It snowed in Los Angeles" observation is more informative than the "It snowed in Alaska" observation

Note that we will extend this later to what is the *minimal representation* needed to encode information, taking into account that if on average we see *more frequent items*, we should encode that with *shorter representations* saving space and bandwidth. For instance, it is very common in image processing to encode a row or a block of pixels, with all the same color, by just denoting how many pixels have that color and the color.

Some formats (including optional Windows Bitmap (BMP) variants) support this compressed representation. Suppose that we have an uncompressed row: 
```[5, 5, 5, 5, 5, 2, 2, 2, 7, 7, 7, 7, 7, 7, 7, 7]``` 
with 5 pixels of color #5, 3 pixels of color #2, and 8 pixels of color #7. We can encode this intuitively in the format
```(number of pixels, color)``` giving us ```[(5,5), (3,2), (8,7)]``` greatly reducing our representation.

## Self-Information from Axioms

We are going to define self-information of event $x$, $I(x)$ such that it fulfills the following four axioms, and then prove that the definition fulfills these four axioms. 
1. Information is non-negative
2. Less probable events carry more information 
    - By extension, events that are *guaranteed* should have *no* information content, i.e. non-informative
3. Independent events should have additive information $I(x,y) = I(x) + I(y)$
4. The information function $I(x)$ should be continuous in the probability $P(x)$.

Great, now we are going to prove that our hypothesis for the self-information function is correct, which is: 
$$ I(x) = -log(P(x)) \quad P(x) \in (0,1]$$
since $P(x)$ is defined as the probability of event $x$ occurring.

First let's start by stating the $-log(P(x))$ is continuous in the range $P(x) \in (0,1]$. We can prove this rigorously formally, but it may not be in the reader's best interest with respect to time. However, for some intuition, we know that the exponential function is continuous, and that the logarithmic function is simply the inverse of the exponential 
$$ log_a(b) = c \leftrightarrow a^c = b$$ 
Additionally, observe **Figure 1** showing the function $-log(x)$ plotted in Desmos showing how when $x \in (0,1]$ $-log(x)$ is a continuous, monotonically decreasing function. Formally, this can be proven using properties of the exponential since the exponential (with positive exponents) is a *continuous, monotonically increasing* function. 
**Note** Figure 1 uses $-log(x)$ and not $-log(P(x))$ since it's easier to plot online, however assume that this is *actually* $-log(P(x))$.

<center>
    <img src="/media/images/20260313_231828_Screenshot_2026-03-13_at_4.02.06_PM.png" width="600">
    $\textbf{Figure 1:}$ Empirically Showing $-log_2(x)$ Continuous & Monotonically Decreasing
</center>

Great additionally, the $I(x) = -log(P(x))$ is shown in **Figure 1** to satisfy: 
1. Information is non-negative
2. Less probable events carry more information

Some demonstrative examples include 
- Event with $P(x) = 1 \rightarrow I(x) = -log(1) = 0$ i.e. no information 
- Event with $P(x) = 0 \rightarrow I(x) = -log(0) = \infty$ i.e. infinite information from an impossibly rare event
- Event with $P(x) = \frac{1}{2} \rightarrow I(x) = -log_2(\frac{1}{2}) = log_2(2) = 1$ information with probability $\frac{1}{2}$ has information of 1.

To show that *3. Independent events should have additive information.* $I(x,y) = I(x) + I(y)$ suppose that we have two independent events $x$ and $y$ and that they both occur. The information from these should add. We can see that from $P(x,y)$ defined as $P(x,y) = P(x)P(y)$
$$-log(P(x,y)) = -log(P(x)P(y)) = -log(P(x)) -log(P(y)) = I(x) + I(y) = I(x,y)$$

### How do we Quantify Information? What's a Bit?

A bit is the unit of information obtained when using a base-2 logarithm. Thus 
$$I(x) = -log_2(P(x))$$
measures information in **bits**. Then we have that for $P(x) = \frac{1}{2}$ that $I(x) = -log_2(\frac{1}{2}) = log_2(2) = 1$ which corresponds to **one fair binary decision**. 
Note that if we defined it with a base $e$ logarithm (natural log) that one unit of information would be defined as a **nat**. However, we will stick with bits, since they are ubiquitous and easy to work with.

A bit defined colloquially is a single digit that can be represented by a 0 or a 1. As we can see, given that we have two values we want to represent, each with an equal probability of $P(x) = \frac{1}{2}$ of occurring, in a base-2 system we can represent this with 1 digit, meaning that we can represent it with **one bit** of information.

For example, if there are 4 equally likely outcomes, each with $P(x) = \frac{1}{4}$, identifying which occurred requires 
$$ I(x) = -log_2(\frac{1}{4}) = log_2(4) = 2\ bits$$
2 bits, or two binary decisions (yes/no questions) to determine. We can model each binary decision with a 0 or 1 forming the **binary representation** giving us the binary representation of the numbers 0 through 3:
- 0 = Binary 00
- 1 = Binary 01
- 2 = Binary 10
- 3 = Binary 11

**Fun Aside!**
As a fun aside, suppose that we want to see how much information we can represent in our base-10 number system! Instead of imagining how many yes/no binary questions it takes to represent $n$ equal outcomes, imagine we can ask a question and get one of ten different answers. Thus using a base-10 definition of information: 
$$ I(x) = -log_{10}(\frac{1}{n}) = log_{10}(n)$$
Which is how many digits it takes to represent a number in our standard system. Great, let's assume that we want to represent the first 15 numbers, i.e. $n=15$, we should get that it takes two digits
$$ I(x) = -log_{10}(\frac{1}{15}) = log_{10}(15) \approx 1.176 $$
Since we can only have integer values we round $1.176$ up to $2$, which matches that we need two digits to represent 15! Pretty neat.

## Entropy as Expected Self-Information

We have defined that Self-Information $I(x) = -log(P(x))$ is defined as the amount of information gained when observing an event with probability $P(x)$ from a set. Now we are going to define the quantity **Entropy** which is the expected amount of information gained when observing an event from the same set. Formally Entropy $H(x)$, the expected amount of information from observing a random event $x$ from our set
$$ H(x) = \mathbb{E}_{x \sim P}[I(x)] = \mathbb{E}_{x \sim P}[-log(P(x))] = \sum_{x_i} -log(P(x_i))P(x_i) = -\sum_{x_i} P(x_i)log(P(x_i))$$

As we can see from **Figure 2** where we plot the entropy of binary random variable $H(x)$ with respect to its probability $P(x)$ of showing one outcome (say heads in a coin toss), we can see that entropy is *maximized* when $P(x) = \frac{1}{2}$.

<center>
<img src="/media/images/20260313_215150_Screenshot_2026-03-13_at_2.51.37_PM.png" width="600">
$\textbf{Figure 2:}$ Entropy of a Binary Random Variable $x$, $H(x)$ with $P(x)$ ranging from 0 to 1 plotted versus $H(x)$
</center>

### Derivation: Uniform Distributions have Maximum Entropy
In general the entropy of a random variable $H(x)$ is maximized when said variable has a uniform distribution. We can show this by maximizing $H(x)$ with respect to the constraint that $\sum_{x_i} P(x_i) =1$, since all probabilities in the set must add up to 1.

Using the Lagrange Multipliers technique from Multi-Variable Calculus, we have that for our general function $f(x): \mathbb{R}^n \rightarrow \mathbb{R}$, constraint function $g(x): \mathbb{R}^n \rightarrow \mathbb{R}$, constraint $c \in \mathbb{R}$ and our Lagrange Multiplier $\lambda$ that by optimizing the following function $h(x) : \mathbb{R}^n \rightarrow \mathbb{R}$ we are optimizing $f(x)$ with respect to $g(x) = c$. 
$$ h(x) = f(x) - \lambda (g(x) -c)$$
Note that for our problem, we are going to simplify the nomenclature such that $P(x_i) = p_i\ \forall i \in [1,n]$ (assume that we have a discrete distribution with $n$ events). 
The setup for our optimization problem to find the maximum entropy $H(x)$ such that $\sum_i p_i = 1$ for all events $x_i \in [1,n]$ is
- $f(x)$ is $H(x) = -\sum_{i=1}^n p_i log(p_i)$
- $g(x)$ is $\sum_{i=1}^n p_i$
- $c$ constraint is $1$

Giving us that $h(x)$ the function we are finding the max of is:
$$ h(x) = -\sum_{i=1}^n p_i log(p_i) - \lambda(1 - \sum_{i=1}^n p_i)$$
Great, let's find $\frac{\partial h}{\partial p_i}$ for one particular $x_i$, see the pattern and apply it to each $\frac{\partial h}{\partial p_i}\ \forall i$
$$ \frac{\partial h}{\partial p_i} = -(log(p_i)+1) - \lambda(-1)$$
$$ \frac{\partial h}{\partial p_i} = -(log(p_i)+1) + \lambda$$
Great, now let's set $\frac{\partial h}{\partial p_i} = 0$ to find the extrema. Note that we know this is a maximum since the Hessian of the entropy $H(x)$ is Negative Definite, see **Appendix**. 
$$ 0 = -(log(p_i)+1) + \lambda \rightarrow log(p_i) = -(\lambda +1)$$
Using the log identity $log_a(b) = c \leftrightarrow a^c = b$
$$p_i = e^{-(\lambda + 1)}$$
Great we found that 
$$ \frac{\partial h}{\partial p_i} = 0 \rightarrow p_i = e^{-(\lambda + 1)}$$
i.e. that the maximum $p_i = e^{-(\lambda + 1)}$ for the partial derivative with respect to $x_i$. Note that $p_i$ does not depend on $x_i$ for any of the values of $i$. Hence $p_i$ is constant and equal to $ e^{-(\lambda + 1)}$ for all $i$. Thus by definition, if all the probabilities in our discrete distribution are the same, and required to sum to one, the distribution is a **uniform distribution**.

Thus, we have shown that the discrete distribution that maximizes entropy $H(x)$ is the uniform distribution.
Note that constrastingly, the discrete distribution that minimizes entropy $H(x)$ is a distribution where one event is guaranteed to occur with probability $1$.

## Entropy for Data Compression

Now let's shift from the question *How much information is in one observation?* to the related question *How few bits do we need to represent observations?*. Recall from earlier our examples with how many uniformly distributed numbers we can represent with bits, and the Windows Bitmap (BMP) example compressing image information. Continuing on this thought process, how do we on average send the shortest code to represent information (variable length code)? To solve the problem of *when does one word end and another begin?* we will be using **prefix-free codes**. A code is **prefix-free** if no codeword is a prefix of any codeword. 
An example is:
- {0, 10, 110, 111}, no code is a prefix of another code, and they are of variable length

This greatly simplifies our analysis and derivation. Additionally we can visualize any binary prefix-free code as a binary tree, where the leaves are codewords, and the internal nodes are decisions. As a result, the length of a codeword is the depth of its leaf in the tree. For derivation, we will be using binary prefix-free codes since we have already motivated the use of $log_2()$ in our derivations.

Please see the **Appendix** for a proof of **Kraft's Inequality** & explanation of why prefix codes can *always* achieve the optimal data compression of any character encoding.

### Constructing Huffman Trees

To construct a Huffman Tree, we are going to 
1. Describe the algorithm
2. Describe the runtime 
3. Prove correctness

**The Algorithm**
To start, with some definitions, this is for a **prefix-free code** such that we are constructing a tree where each leaf node has one code word. We are minimizing the following objective 
$$ B(T) = \sum_{i=1}^n P(x_i) \cdot depth(x_i)$$
Where $P(x_i)$ is the probability of event $x_i$ occurring, of which there are $n$. Additionally $depth(x_i)$ is the depth of element $x_i$ in the tree. $B(T)$ objective is stating that we want to minimize the expectation of the depth of the tree.

Additionally: 
- $X$ will be the set of $n$ elements $\{x_1, x_2, ... x_n\}$
- The Queue $Q$ will be a min-binary heap sorted by $P(x_i)$ where $Q$ is composed of elements $x_i$
- $Extract-Min(Q)$ function simply returns the min element from the queue $Q$
- $Insert(Q,z)$ inserts element $z$ into queue $Q$
- $Heapify(X) \rightarrow Q$ turns the set of elments $X$ into min-binary heap $Q$
- $NewNode()$ creates a new node with left and right children ($.left$, $.right$ respectively) and $P(NewNode())$ can be assigned

**Algorithm:**
```
Q = Heapify(X)
for i=1 to n-1:
    z = NewNode()
    z.left = x = Extract-Min(Q)
    z.right = y = Extract-Min(Q)
    P(z) = P(x) + P(y)
    Insert(Q, z)
return Extract-Min(Q) # return the root node of tree formed
```
**Runtime:**
We know that the $Heapify(X)$ operation takes $O(n)$ time with $n$ elements in $X$. Additionally, our for loop executes $n-1$ times with each $Extract-Min(Q)$ operation taking $log(n)$ time, for a total of $O(n\ log(n))$ time. Thus we get the total runtime is
$$O(n) + O(n\ log(n)) \rightarrow O(n\ log(n))$$

**Correctness:**
As we can see from the algorithm, this is a **greedy** algorithm so to prove correctness we will need to prove this using the greedy choice property & optimal substructure. To do this we will prove two lemmas, which when combined will prove the algorithm correct 
1. Greedy Choice Property: 
Choosing the lowest frequencies elements always optimal to add to tree 
2. Optimal Substructure Property: 
Merging the two lowest frequencies and solving for an optimal Tree for the remaining elements also optimal for original problem

Please see the **Appendix** for both proofs, as there is not enough space here for both. However, given that we have proven both of these true, we can see that the algorithm is correct since we first choose the lowest frequency elements to add to the tree we are constructing (using the min-heap). Then secondly when we add these, we add their merged frequencies with node $z$. What this does it that we build the tree such that the *lowest frequency* elements have the *largest depth* iteratively by always choosing the minimum frequency elements to be at the bottom of the current subtree (by having them be children of $z$). This minimizes our objective: $ B(T) = \sum_{i=1}^n P(x_i) \cdot depth(x_i)$. It is obvious to see this is true by contradiction because if we would to (wrongly) place the *highest frequency* items to have the *most depth* and the *lowest frequency* items to have the *least depth* this would maximize $B(T)$.

### Depth of Element $x_i$ in Huffman Tree and Self-Information $I(x_i)$

We have now proved that the Huffman algorithm produces the tree that minimizes the expected code length

$$B(T) = \sum_{i=1}^n P(x_i) \cdot \text{depth}(x_i)$$

### Entropy $H(x)$ of Huffman Trees

A natural question is: **how close is each leaf’s depth to the self-information** $I(x_i) = -\log_2 P(x_i)$?

**Important clarification:**  
It is **not** true in general that $\lfloor I(x_i) \rfloor \leq \text{depth}(x_i) \leq \lceil I(x_i) \rceil$ holds for **every** symbol in a Huffman tree. (In some distributions — especially when one symbol is very probable — rare symbols can receive shorter codes than this range would suggest.)

However, we can prove something even stronger and more useful: **on average**, the weighted code lengths (depths) are within 1 bit of the entropy $H(x)$. This is the key insight that directly connects the Huffman tree cost $B(T)$ to the entropy $H(x)$.

**Proof** (Using Auxiliary Shannon code)

First, consider a simple auxiliary prefix-free code called the **Shannon code**. We define its codeword depths as
$$ depth_S(x_i) = \lceil I(x_i) \rceil = \lceil -\log_2 P(x_i) \rceil $$

1. By definition of the ceiling function, we have  
   $I(x_i) \leq depth_S(x_i) < I(x_i) + 1$  
   (each Shannon codeword is at most 1 bit longer than the ideal self-information depth).

2. These depths satisfy Kraft’s inequality (see Appendix):  
   $$ \sum_{i=1}^n 2^{-depth_S(x_i)} \leq \sum_{i=1}^n 2^{-I(x_i)} = \sum_{i=1}^n P(x_i) = 1 $$
   Therefore a valid binary prefix-free code with exactly these depths exists.

3. The expected depth of this Shannon code is  
   $$ Depth_S = \sum_{i=1}^n P(x_i) \cdot depth_S(x_i) < \sum_{i=1}^n P(x_i) \cdot (I(x_i) + 1) = H(x) + 1 $$

Now recall two facts we already established:

- The Huffman code achieves the **optimal** (shortest possible) expected depth among **all** prefix-free codes (see the Greedy Choice Property and Optimal Substructure Property lemmas + correctness proof).
- Therefore  
  $B(T)_{\text{Huffman}} \leq Depth_S$.

Combining everything we get:
$$ B(T)_{\text{Huffman}} \leq Depth_S < H(x) + 1 $$

We also know from the noiseless coding theorem (please read separately) and Kraft’s inequality that **any** prefix-free code satisfies
$$ B(T) \geq H(x) $$

Putting both bounds together gives the result:

$$ H(x) \leq B(T)_{\text{Huffman}} < H(x) + 1 $$

The average number of bits per symbol in a Huffman encoding is **within 1 bit** of the theoretical lower bound given by the entropy $H(x)$. In other words, the depths produced by the Huffman algorithm are **on average** an excellent approximation of the self-information $I(x_i)$ of each leaf.

This is why Huffman coding compresses well since frequent symbols get short codes (close to their low $I(x_i)$), rare symbols get longer codes (close to their high $I(x_i)$), and the overall entropy stays very close to the information-theoretic optimum.

### Prefix Encoding and Distribution Sampling Equivalence

Self-information has two equivalent meanings:
$$ I(x_i) = -\log_2 P(x_i) $$
- Information gained when observing $x_i$
- Approximate depth (yes/no questions) needed to reach $x_i$ in an optimal prefix tree

Entropy has two equivalent meanings:
$$H(x) = -\sum P(x_i) \log_2 P(x_i) $$
- Expected information from one observation
- Expected depth (average bits) in an optimal prefix code

Prefix-tree traversal and sampling from the PMF are two views of the same thing.
- **Sampling**: Start at root, at each node go left/right with probability = subtree mass → reach a leaf = sampled symbol
- **Decoding**: Start at root, follow bits (0/1) → reach a leaf = decoded symbol

The Huffman tree encodes **both** because it builds the optimal paths using the same probabilities.
The tree is a single structure that:
- Compresses data near-optimally
- Lets you efficiently sample from the distribution

Probability mass function tells you **how often** each symbol appears  
Huffman tree tells you **how to reach** each symbol (by bits or by weighted choices)

As we can see, they are two representations of the exact same information.

<center>
<video controls style="max-width: 80%;">
  <source src="/media/videos/20260314_221441_03_14_2026_15_12_32_crack_snail_huffman.mp4">
</video>
</center>
<center>
$\textbf{Video 1:}$ Demonstration of the Equivalence of Entropy $H(x)$ for Huffman Codes and Probability Mass Functions
</center>

---
## Appendix

### Entropy $H(x)$ has a Negative Definite Hessian Matrix

The entropy function  
$$
H(p) = -\sum_{i=1}^n p_i \log p_i
$$  
is defined on the interior of the probability simplex ($\sum p_i = 1$, $p_i > 0$).

**Hessian computation**

The partial derivatives are:  
$$
\frac{\partial H}{\partial p_k} = -\log p_k - 1
$$

The second partial derivatives give the Hessian matrix:  
$$
\frac{\partial^2 H}{\partial p_j \partial p_k} = 
\begin{cases} 
-\dfrac{1}{p_k} & \text{if } j = k \\
0 & \text{if } j \neq k
\end{cases}
$$

So the Hessian is diagonal:  
$$
\operatorname{Hess}(H) = \operatorname{diag}\left( -\frac{1}{p_1},\ -\frac{1}{p_2},\ \dots,\ -\frac{1}{p_n} \right)
$$

**Negative definiteness**

For any $p_i > 0$ (interior of the simplex), each diagonal entry satisfies  
$$
-\frac{1}{p_k} < 0.
$$

For any nonzero vector $v \in \mathbb{R}^n$, the quadratic form is  
$$
v^T \operatorname{Hess}(H) \, v = -\sum_{k=1}^n \frac{v_k^2}{p_k} < 0,
$$  
since each term $\frac{v_k^2}{p_k} \geq 0$ and at least one is strictly positive ($v \neq 0$).

Therefore the Hessian is **strictly negative definite** on the relative interior of the probability simplex.

### Kraft Inequality Derivation
Kraft's inequality states:  
For any prefix-free binary code with codeword lengths $l_1, l_2, \dots, l_n$

$$
\sum_{i=1}^n 2^{-l_i} \leq 1
$$

Any set of codeword lengths that satisfies the prefix condition must obey Kraft's inequality, and conversely, any set of positive integer lengths $l_i$ that satisfies the inequality can be realized by some prefix-free binary code.

#### Proof of the necessity direction (prefix-free $\rightarrow$ Kraft inequality)

Consider a full binary tree where each codeword corresponds to a leaf at depth $l_i$.  
Because the code is prefix-free, no codeword is a prefix of another → all codewords are leaves, and no leaf is an ancestor of another leaf.

Imagine we assign to each leaf a "weight" of $2^{-l_i}$, where $l_i$ is its depth.

Now consider all nodes at every level of the tree:

- At depth 0 (root): weight = 1
- At depth 1: two children, each can contribute at most $1/2$
- At depth 2: four possible nodes, each at most $1/4$
- And so on.

Because the tree is prefix-free, the subtrees below different codeword leaves are disjoint, and the entire set of leaves "covers" some portion of the possible binary strings without overlap.

The total weight contributed by all leaves is therefore at most the weight of the root:

$$
\sum_{i=1}^n 2^{-l_i} \leq 1
$$

This is because each internal node’s weight is split equally between its two children (each gets half), and the leaves collect all the weight that reaches them — but the total weight never exceeds 1 at any level, and the prefix-free property ensures no overlapping or wasted subtrees beyond the leaves.

Thus, any prefix-free code must satisfy Kraft's inequality.

#### Proof of the sufficiency direction (Kraft inequality ⇒ exists a prefix-free code)

Suppose we are given positive integers $l_1 \leq l_2 \leq \dots \leq l_n$ such that

$$ \sum_{i=1}^n 2^{-l_i} \leq 1 $$

We can construct a prefix-free code with exactly these lengths using a greedy tree-building procedure (similar in spirit to Huffman, but simpler):

1. Sort the lengths in non-decreasing order.
2. Build a binary tree level by level.
3. At each step, assign the next codeword to the first available leaf position (in lexicographic order) at the required depth.
4. Because the Kraft sum ≤ 1, there will always be enough unused nodes at each depth to place the next codeword without violating the prefix condition.

A more formal constructive proof uses induction on $n$:

- Base case ($n=1$): $2^{-l_1} \leq 1$ is always true for $l_1 \geq 0$. Assign the all-zero string of length $l_1$.
- Inductive step: Assume it holds for $n-1$ symbols. Take the longest length $l_n$. There must be at least one binary string of length $l_n$ that is not a prefix of any shorter codeword (because otherwise the sum would exceed 1). Append this string as the codeword for the last symbol. The remaining $n-1$ codewords form a prefix-free code on a subtree, and by induction they can be assigned validly.

Thus, the inequality is also sufficient: if the Kraft sum ≤ 1, then a prefix-free binary code with those exact lengths exists.

## Huffman Tree Construction Proofs
### Proof 1. Greedy Choice Property: 
Choosing the lowest frequencies elements always optimal to add to tree

**Lemma:** Let $X$ be a set of elements of length $n$ such that each element $x_i$ has probability of occurring $P(x_i)$ and assume that $\sum_{i=1}^n P(x_i) = 1$. Let $x,y$ be two elements in $X$ having the lowest probabilities. Then there exists an optimal prefix code for C in which the codewords for $x$ and $y$ have the same length and differ only in the last bit. Note that the cost function of the tree $T$ is defined as 
$$ B(T) = \sum_{i=1}^n P(x_i) \cdot depth(x_i)$$

**Proof:** Objective is to take the tree $T$ representing an arbitrary optimal prefix code and modify to make it a tree representing another optimal prefix code, s.t. $x,y$ are both sibling leaves of maximum depth in the new tree. If such a tree can be constructed, then the codewords for $x$ and $y$ will have the same length and differ only in the last bit.

Let $a$,$b$ be two elments of $X$ that are sibling leaves of maximum depth in $T$. Assume that $P(a) \leq P(b)$ and $P(x) \leq P(y)$. Since $P(x)$ and $P(y)$ are the lowest leaf probabilities, and $P(a), P(b)$ are two arbitrary probabilities, we have $P(x) \leq P(a)$ and $P(y) \leq P(b)$. Note that we could have $P(x) = P(a)$ or $P(y) = P(b)$, however that would trivially make the Lemma true. Hence assume that $P(x) \neq P(b)$.

Exchange the positions of $T$ of $a$ and $x$ to produce tree $T'$, and then exchange the positions of $b$ and $y$ to produce tree $T''$ s.t. $x,y$ are sibling leaves of max depth. The difference in cost between $T$ and $T''$ is ($depth_T(x_i)$ is the depth of element $x_i$ in tree $T$):
$$B(T) - B(T') = \sum_i^n P(x_i)\cdot depth_T(x_i) - \sum_i^n P(x_i)\cdot depth_{T'}(x_i)$$
$$B(T) - B(T') = P(x) \cdot depth_T(x) + P(a)\cdot depth_T(a)- P(x) \cdot depth_{T'}(x) - P(a) \cdot depth_{T'}(a)$$
$$B(T) - B(T') = P(x) \cdot depth_T(x) + P(a)\cdot depth_T(a) - P(x) \cdot depth_T(a) - P(a) \cdot depth_T(x)$$
$$B(T) - B(T') = (P(a)-P(x))\times (depth_T(a) - depth_T(x)) \geq 0$$
Since $P(a) - P(x)$ and $depth_T(a) - depth_T(x)$ are non-negative. $P(a) - P(x)$ is non-negative since $x$ is a minimum-probability leaf and $depth_T(a) - depth_T(x)$ is non-negative since $a$ is a leaf of maximum depth in $T$. By similar argument, exchanging $y$ and $b$ does not increase the cost so $B(T') - B(T'')$ is non-negative too. Thus we have that $B(T'') \leq B(T)$ and since $T$ optimal we have that $B(T) \leq B(T'')$ giving us that $B(T) = B(T'')$. Thus $T''$ is an optimal tree where $x,y$ are sibling leaves of maximum depth proving the lemma.

### Proof 2. Optimal Substructure Property:
Merging the two lowest frequencies and solving for an optimal Tree for the remaining elements also optimal for original problem

**Lemma:** Let $X$ be a set of elements of length $n$ such that each element $x_i$ has probability of occurring $P(x_i)$ and assume that $\sum_{i=1}^n P(x_i) = 1$. Let $x,y$ be two elements in $X$ having the lowest probabilities. Let $X'$ be the set $X$ with the elements $x,y$ removed and a new element $z$ added, so that 
$$X' = X - \{x,y\} \cup \{z\}$$
Define the probabilities of the elements of set $X'$ to be same as for $X$ except with $P(z) = P(x)+P(y)$. Let $T'$ be any tree representing an optimal prefix code for the set $X'$. Then the tree $T$, obtained from $T'$ by replacing the leaf node for $z$ with an internal node having $x,y$ as children, represents an optimal prefix code for $X$. Note that the cost function of the tree $T$ is defined as 
$$ B(T) = \sum_{i=1}^n P(x_i) \cdot depth(x_i)$$

**Proof:** First show how to express the cost $B(T)$ of tree $T$ in terms of $B(T')$ of tree $T'$. For each element $x_i \in X - \{x,y\}$, we have that $depth_T(x_i) = depth_{T'}(x_i)$, and thus $P(x_i) \cdot depth_T(x_i) = P(x_i) \cdot depth_{T'}(x_i)$. Because $depth_T(x) = depth_T(y) = depth_{T'}(z)+1$, we have 
$$ P(x)\cdot depth_T(x) + P(y) \cdot depth_T(y) = (P(x) + P(y)) \cdot (depth_{T'}(z)+1)$$
$$ P(x)\cdot depth_T(x) + P(y) \cdot depth_T(y) = P(z) \cdot depth_{T'}(z) + (P(x) + P(y))$$
Thus we can see that
$$B(T) = B(T') + P(x) + P(y)$$
i.e.
$$B(T') = B(T) - P(x) - P(y)$$

Now prove the lemma by contradiction. Suppose that $T$ does not represent an optimal prefix code for set $X$. Then there exists an optimal tree $T''$ such that $B(T'') < B(T)$. Using the previous lemma (*Proof 1. Greedy Choice Property*) $T''$ has $x,y$ as siblings. Let $T'''$ be the tree $T''$ with the common parent of $x,y$ replaced by a leaf $z$ with probability $P(z) = P(x) + P(y)$. Then 
$$B(T''') = B(T'') - P(x) - P(y)$$
$$B(T''') < B(T) - P(x) - P(y) = B(T')$$
giving us a *contradiction* that $T'$ is an optimal prefix code for $X$. Thus $T$ bust be an optimal prefix code for set $X$.

## How much information does one observation give you?

## Self-Information from Axioms

Great additionally, the $I(x) = -log(P(x))$ is shown in **Figure 1** to satisfy: 
1. Information is non-negative
2. Less probable events carry more information

### How do we Quantify Information? What's a Bit?

## Entropy as Expected Self-Information

## Entropy for Data Compression

Please see the **Appendix** for a proof of **Kraft's Inequality** & explanation of why prefix codes can *always* achieve the optimal data compression of any character encoding.

### Constructing Huffman Trees

To construct a Huffman Tree, we are going to 
1. Describe the algorithm
2. Describe the runtime 
3. Prove correctness

### Depth of Element $x_i$ in Huffman Tree and Self-Information $I(x_i)$

We have now proved that the Huffman algorithm produces the tree that minimizes the expected code length

$$B(T) = \sum_{i=1}^n P(x_i) \cdot \text{depth}(x_i)$$

### Entropy $H(x)$ of Huffman Trees

A natural question is: **how close is each leaf’s depth to the self-information** $I(x_i) = -\log_2 P(x_i)$?

**Proof** (Using Auxiliary Shannon code)

First, consider a simple auxiliary prefix-free code called the **Shannon code**. We define its codeword depths as
$$ depth_S(x_i) = \lceil I(x_i) \rceil = \lceil -\log_2 P(x_i) \rceil $$

1. By definition of the ceiling function, we have  
   $I(x_i) \leq depth_S(x_i) < I(x_i) + 1$  
   (each Shannon codeword is at most 1 bit longer than the ideal self-information depth).

3. The expected depth of this Shannon code is  
   $$ Depth_S = \sum_{i=1}^n P(x_i) \cdot depth_S(x_i) < \sum_{i=1}^n P(x_i) \cdot (I(x_i) + 1) = H(x) + 1 $$

Now recall two facts we already established:

Combining everything we get:
$$ B(T)_{\text{Huffman}} \leq Depth_S < H(x) + 1 $$

We also know from the noiseless coding theorem (please read separately) and Kraft’s inequality that **any** prefix-free code satisfies
$$ B(T) \geq H(x) $$

Putting both bounds together gives the result:

$$ H(x) \leq B(T)_{\text{Huffman}} < H(x) + 1 $$

### Prefix Encoding and Distribution Sampling Equivalence

Entropy has two equivalent meanings:
$$H(x) = -\sum P(x_i) \log_2 P(x_i) $$
- Expected information from one observation
- Expected depth (average bits) in an optimal prefix code

Probability mass function tells you **how often** each symbol appears  
Huffman tree tells you **how to reach** each symbol (by bits or by weighted choices)

As we can see, they are two representations of the exact same information.

---
## Appendix

### Entropy $H(x)$ has a Negative Definite Hessian Matrix

The entropy function  
$$
H(p) = -\sum_{i=1}^n p_i \log p_i
$$  
is defined on the interior of the probability simplex ($\sum p_i = 1$, $p_i > 0$).

**Hessian computation**

The partial derivatives are:  
$$
\frac{\partial H}{\partial p_k} = -\log p_k - 1
$$

So the Hessian is diagonal:  
$$
\operatorname{Hess}(H) = \operatorname{diag}\left( -\frac{1}{p_1},\ -\frac{1}{p_2},\ \dots,\ -\frac{1}{p_n} \right)
$$

**Negative definiteness**

For any $p_i > 0$ (interior of the simplex), each diagonal entry satisfies  
$$
-\frac{1}{p_k} < 0.
$$

Therefore the Hessian is **strictly negative definite** on the relative interior of the probability simplex.

### Kraft Inequality Derivation
Kraft's inequality states:  
For any prefix-free binary code with codeword lengths $l_1, l_2, \dots, l_n$

$$
\sum_{i=1}^n 2^{-l_i} \leq 1
$$

#### Proof of the necessity direction (prefix-free $\rightarrow$ Kraft inequality)

Imagine we assign to each leaf a "weight" of $2^{-l_i}$, where $l_i$ is its depth.

Now consider all nodes at every level of the tree:

- At depth 0 (root): weight = 1
- At depth 1: two children, each can contribute at most $1/2$
- At depth 2: four possible nodes, each at most $1/4$
- And so on.

Because the tree is prefix-free, the subtrees below different codeword leaves are disjoint, and the entire set of leaves "covers" some portion of the possible binary strings without overlap.

The total weight contributed by all leaves is therefore at most the weight of the root:

$$
\sum_{i=1}^n 2^{-l_i} \leq 1
$$

Thus, any prefix-free code must satisfy Kraft's inequality.

#### Proof of the sufficiency direction (Kraft inequality ⇒ exists a prefix-free code)

Suppose we are given positive integers $l_1 \leq l_2 \leq \dots \leq l_n$ such that

$$ \sum_{i=1}^n 2^{-l_i} \leq 1 $$

We can construct a prefix-free code with exactly these lengths using a greedy tree-building procedure (similar in spirit to Huffman, but simpler):

A more formal constructive proof uses induction on $n$:

Thus, the inequality is also sufficient: if the Kraft sum ≤ 1, then a prefix-free binary code with those exact lengths exists.

## Huffman Tree Construction Proofs
### Proof 1. Greedy Choice Property: 
Choosing the lowest frequencies elements always optimal to add to tree

**Lemma:** Let $X$ be a set of elements of length $n$ such that each element $x_i$ has probability of occurring $P(x_i)$ and assume that $\sum_{i=1}^n P(x_i) = 1$. Let $x,y$ be two elements in $X$ having the lowest probabilities. Then there exists an optimal prefix code for C in which the codewords for $x$ and $y$ have the same length and differ only in the last bit. Note that the cost function of the tree $T$ is defined as 
$$ B(T) = \sum_{i=1}^n P(x_i) \cdot depth(x_i)$$

### Proof 2. Optimal Substructure Property:
Merging the two lowest frequencies and solving for an optimal Tree for the remaining elements also optimal for original problem

**Lemma:** Let $X$ be a set of elements of length $n$ such that each element $x_i$ has probability of occurring $P(x_i)$ and assume that $\sum_{i=1}^n P(x_i) = 1$. Let $x,y$ be two elements in $X$ having the lowest probabilities. Let $X'$ be the set $X$ with the elements $x,y$ removed and a new element $z$ added, so that 
$$X' = X - \{x,y\} \cup \{z\}$$
Define the probabilities of the elements of set $X'$ to be same as for $X$ except with $P(z) = P(x)+P(y)$. Let $T'$ be any tree representing an optimal prefix code for the set $X'$. Then the tree $T$, obtained from $T'$ by replacing the leaf node for $z$ with an internal node having $x,y$ as children, represents an optimal prefix code for $X$. Note that the cost function of the tree $T$ is defined as 
$$ B(T) = \sum_{i=1}^n P(x_i) \cdot depth(x_i)$$

Self-Information & Entropy

Comments (0)

✉️ Join Mailing List