Anatomy of Discrete Entropy
Getting a start by studying Shannon’s famous formula
Let’s begin with perhaps the simplest version of entropy - the entropy of a finite discrete random variable, i.e. a measurable map $X:\Omega \rightarrow E$ where $\Omega = (\Omega, \Sigma_\Omega, \mathbb{P})$ is a probability space, $E = (E, \Sigma_E)$ is a measurable space, and the support of $X$, $\text{supp}(X) \subseteq E$, is finite.
This is essentially the same as Shannon’s original definition , where $X$ represented a random symbol, i.e. a random variable $X$ taking values in a finite alphabet of symbols $x$, each emitted from a communication source with probability $\mathbb{P}_X(\{x\}) \in [0, 1]$, where $\mathbb{P}_X := \mathbb{P} \circ X^{-1}$ is the pushforward of $\mathbb{P}$ onto E.
The Formula
\[\begin{equation} H(X) = \sum_{x \in \text{supp}(X)} \mathbb{P}_X(\{x\}) \log\left(\frac{1}{\mathbb{P}_X(\{x\})}\right)\label{eq:entropy-classic} \end{equation}\]We’ve written this formula a bit differently than how it’s usually written, with all its parts clearly visible, so we can study it carefully and methodically. Now, as we study this, let’s pretend to look with fresh eyes, trying to let go of all we have learned, as best we can 1.
Fresh-Eyed Observations
First, notice that the left-hand side notation indicates $H$ is a function of a random variable $H = H(X)$ 2. On the right-hand side, we have a composition of functions, which can be decomposed in various ways.
The innermost functions on the right-hand side appear to depend on the support values $x \in \text{supp}(X)$. However, they also appear to depend on the atomic events $X \in \{x\}$ 3. When the support is finite, there’s a bijection between the support and the atomic events, so these two views are equivalent, but for more general random variables, this might not be the case!
Notice that the sum is discrete because the support of X is discrete, which naturally suggests a passage to integrals when X is continuous 4.
Now, the function $S_X(\,\cdot\,) = \log(\frac{1}{\mathbb{P}_X(\,\cdot\,)})$ looks interesting. \[\begin{equation} S_X:\Sigma_E \xrightarrow{\mathbb{P}_X} [0, 1] \xrightarrow{\frac{1}{(\,\cdot\,)}} [1, \infty] \xrightarrow{\log(\,\cdot\,)} [0, \infty] \label{eq:surprisal-rv} \end{equation}\]
What is that thing? It must be a surprise! ; ) 5 Naively, the sum looks like an average of $S_X$, viewed as a function of $X$ (thinking of the law of the unconscious statistician). Is $S_X$ itself a random variable? 6
Nothing appears to stop us from writing $p_i = \mathbb{P}_X(\{x_i\})$ in \eqref{eq:entropy-classic} and instead viewing $H$ as a function $H = H(P)$ where $P = (p_1,\dots)$ is the probability vector with entries $p_i$. This opens up the possibility of viewing $H$ as a function of probability vectors, or probability measures 7.
In fact, it’s not hard to see that if $X, Y:\Omega \rightarrow E$ are two discrete random variables with probability vectors $P = (p_1, \dots), Q = (q_1, \dots)$ that differ only by a permutation of coordinates, then $H(X) = H(Y)$. In the finite discrete case, this is the same as saying that the “laws” of $X, Y$, i.e. the pushforward distributions $\mathbb{P}_X, \mathbb{P}_Y$, assign the same atomic masses, counted with multiplicity 8. This strengthens the case for thinking of $H(P)$ as a function of a probability distribution.
These observations have raised some natural questions, that we will use to guide our own unique journey through the landscape of modern information theory and its applications. These guides will be our stars, and the connections we draw between them, our constellations.
Drawing Constellations
Constellations are made up of natural objects of natural beauty, over which humans lay meaning, organization, and structure. Looking up at the constellations we know in our night sky, it’s not hard to see, thinking of the various archetypical characters the ancients hung on them, that this was a creative act. There are so many figures that can be drawn across the night sky, so many images to project on them, and there is no right or true way to draw them.
Yet once drawn, they are hard to unsee, and they are alive. Once drawn they serve as references, as guides to our ancestors, of many cultures, used to navigate, sometimes through seemingly impossible journeys. Imagine what it must have been like to set out on a journey across an unknown ocean, with only starlight shining to guide the way.
Before we can begin to draw a constellation, the first star must be found. What star shall we look to?
The First Star
If entropy is fundamental to the concept of information we’ve inherited from Shannon, then probability is just as (if not more) fundamental 9. Does it all begin with probability then?
Our working hypothesis is that the conceptual move Shannon made, interpreting information as surprise, and the way he chose to quantify it, namely the surprisal function $S$ 10, was the new move. This humble function, in its various forms quantitatively realizes information as surprise, and so it is fundamental to information theory as it’s currently understood.
We haven’t made the case for this here — making it properly requires studying surprisal on its own terms, which is where we’re headed next. But we can already see from the anatomy above that $S$ is doing interesting structural work inside the entropy formula, and that it raises questions (Is it a random variable? Why does its expected value deserve a name?) that the formula alone doesn’t answer.
So we’ll study surprisal in detail beginning at the level of events, and consider how it gets along with the structure of the $\sigma$-algebra of events. We’ll play around with notions of conditional and relative surprisal of events, and see where it leads us. We’ll look at surprisal as a function $S$, and try to understand why entropy is thought of as expected surprise, and what that means.
For us, this is the first star. It all starts with surprise!
References
[Sha48] Shannon, Claude E., "A Mathematical Theory of Communication", 1948.
[Har28] Hartley, Ralph V. L., "Transmission of Information", 1928.
[Goo85] Good, I. J., "Weight of Evidence: A Brief Survey", in Bayesian Statistics 2, Bernardo, J. M. and DeGroot, M. H. and Lindley, D. V. and Smith, A. F. M., Eds., North-Holland, 1985.
[Sam51] Samson, Edward W., "Fundamental Natural Concepts of Information Theory", 1951.
[Tri61] Tribus, Myron, "Thermostatics and Thermodynamics; An Introduction to Energy, Information and States of Matter, with Engineering Applications", D. Van Nostrand, 1961.
We note that whatever observations we make are highly dependent on the form this formula was written in (which was chosen deliberately). ↩︎
Indeed, that’s how one is often introduced to the definition of Shannon entropy. ↩︎
In this case, the atomic events are just the singletons, but for a general probability measure, this is not the case. Atomic events can be characterized measure/set-theoretically as sets of “minimal measure”, i.e. measurable sets $A \subset E$ with $\mu(A) > 0$ such that, for any other measurable set $B \subset A$, either $\mu(B) = \mu(A)$ or $\mu(B) = 0$. They can also be characterized lattice algebraically as minimal elements in the partial order determined by set inclusion, i.e. elements $A \in \Sigma_E$ such that if $B < A$ then $B = \bot$. Crucially, in general the lattice-algebraic definition implies, but is not implied by the measure-theoretic one – the latter allows for proper measurable subsets $\emptyset \neq B \subseteq A$ as long as they have measure zero. For the discrete case however, they are equivalent. ↩︎
Yes, it’s hard to unsee what we’ve already seen. ↩︎
Cringey pun fully intended! ↩︎
We’re often told entropy is expected surprise, but we’re also conditioned (pun not intended) to think of expectations as only making sense with respect to some random variable. We do note that in general measure theory one can define the expected value of a function $f:E\rightarrow \mathbb{R}$ with respect to some measure on $E$, $\mathbb{E}[f] = \int f\,\mu$ as long as the integral makes sense. These are often called “observables” and show up in physics a lot. For example, $E$ might be the classical phase space, and $\mu$ might be the Liouville measure. This notion of expected value of an observable gets even more general in theory of $C^\ast/W^\ast$ algebras, which leads into deeper waters we may later revisit. ↩︎
Note that $H(P)$ is invariant under permutation by the indices $i$. ↩︎
By pushforward we mean the pushforward measure $\mathbb{P} \mapsto \mathbb{P}\circ X^{-1} := \mathbb{P}_X$. This pushing forward from the measure $\mathbb{P}$ on $\Omega$ to the measure $\mathbb{P}_X$ on $X$ is induced by the $\sigma$-algebra (lattice) morphism, itself induced by the inverse image of $X$ as a measurable map: \[\begin{equation} \mathbb{P}_X: \Sigma_E \xrightarrow{X^{-1}} \Sigma_\Omega \xrightarrow{\mathbb{P}} [0, 1] \end{equation}\]
By atoms we mean minimal elements of the $\sigma$-algebra as a lattice. Then “equal on atoms” $\mathbb{P}_X \sim \mathbb{P}_Y$ means that the atomic masses assigned by $\mathbb{P}_X$ and $\mathbb{P}_Y$ are the same, counted with multiplicity. Equivalently, the multisets of values taken by the restrictions of these maps to the set of atoms $\mathcal{At}_E \subseteq \Sigma_E$ are equal. In the finite discrete case $\lvert E \rvert < \aleph_0$, the atoms generate $\Sigma_E$, and a probability measure is determined by its values on atoms together with finite additivity. This totally fails for the uncountable case $\lvert E \rvert > \aleph_0$, for example when $E = (\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n))$ has the Borel $\sigma$-algebra, which is atomless. Further note that “equality on atoms” $\mathbb{P}_X \sim \mathbb{P}_Y$ is both necessary and sufficient for equality of entropy of random variables $H(X) = H(Y)$. ↩︎
One can argue that entropy predated Shannon’s communication-theoretic framing of information, indeed it emerged in statistical mechanics, and probability clearly predates both, though the latter (and subsequent developments) really abstracted and clarified it. ↩︎
The quantity $\log(1/p)$ appeared before Shannon in various guises – in the aggregate $\sum p_i \log(1/p_i)$ of Boltzmann-Gibbs statistical mechanics, in Hartley’s $\log M$ for equiprobable alternatives , and in Turing and Good’s “bans” – log-probability units for Bayesian weight of evidence in Enigma codebreaking at Bletchley Park, c. 1940 (see ). But none of these isolated $\log(1/p)$ as the information content of a single event with non-uniform probability and built a theory around that interpretation – that was Shannon’s contribution in 1948 . Shannon himself did not name the quantity; the term “surprisal” was introduced later by Samson and popularized in physics by Tribus . ↩︎