Shannon entropy: how to measure information mathematically

A deep dive into self-information and information entropy with a few thought experiments.

Feb 23, 2025

Hello fellow machine learners,

Last week, we (finally) wrapped up our discussion of the Support Vector Machine. I hope you enjoyed reading those articles as much as I enjoyed writing them!

This week, we will discuss the topic of information entropy. I remember learning about this topic at university. I also remember not actually understanding it 😅

The explanations we were given in lectures felt quite confusing to me, in part because we were trying to ascribe mathematical formulas to rather abstract concepts like information and uncertainty.

This article encompasses my best way of explaining it, and will lay the foundation for an ML algorithm that we’ll discuss next week.

These concepts were first mathematically formalised by Claude Elwood Shannon in his revolutionary 1948 paper titled “A Mathematical Theory of Communication”. This is the main reason why Shannon is regarded as the founding father of information theory. As well as this, Anthropic’s Claude AI model was named after Shannon. So basically, he’s a pretty big deal.

With all that said, let’s get stuck in!1

How do we ‘measure’ information?

Suppose you have a die that you know is heavily biased toward landing on the number 6. If you rolled the die and had to predict what number it would land on, what would you predict? Well, 6 of course. And in doing so, you’d probably be right most of the time. You already know the bias of the die, and so you won’t be surprised from seeing the outcomes of its rolls.

What if, instead, the die is fair, i.e. each number is equally likely to be rolled? Well, you could guess any number and, chances are, you would be wrong more times that you’d be right. After all, for each roll of the die you would only have a one-in-six chance of guessing the number correctly. That is, most of the time, you will be surprised when seeing the outcome of its rolls.

Thus, we can consider the roll of the second die to provide us with more information than the roll of the first die. We could think of information as follows:

The amount of information an event provides can be quantified by how surprising the event’s occurrences are. The more surprising an outcome, the more information you gain from seeing the outcome.

Suppose that an event occurs with probability p. If we wanted to mathematically define the information gained from observing the outcome that occurs with probability p, how would we do it? Whatever function we plan on using, we would probably want the following conditions to hold:

The function should be continuous. The value of information gained (or lost) should change smoothly as the value of p changes.
If p=1, then our information function should output the value 0. This is because, if an event that occurs 100% of the time, then we do not gain any information at all from observing its outcome, i.e. no surprise.
The total information gained from two independent events should equal the sum of the information gained from each individual event.
The more unlikely an event is, the more surprise there is when sampling its outcomes. So the function should be large when p is small, and small when p is large.

These criteria are satisfied precisely by the so-called self-information function

\(I(p) = -\log_2(p).\)

We will explain the use of the base-2 logarithm specifically later. For now, we can check that this function satisfies our requirements:

Continuity: the logarithm is a continuous function.2
Zero information for a deterministic outcome: when p=1, the logarithm takes the value 0 as desired.
Joint events: suppose that we have two independent events which have probabilities p_1 and p_2 of occurring respectively. Then we have by the additive property of the logarithm that
\(\begin{align} I(p_1p_2) & = -(\log_2(p_1p_2)) \\ & = -(\log_2(p_1) + \log_2(p_2)) \\ & = -\log_2(p_1) - \log_2(p_2) \\ & = I(p_1) + I(p_2). \end{align}\)

Here is what the function looks like, demonstrating that condition 4 is indeed satisfied:

The self-information function for input value p. For small positive values of p, the function shoots off to positive infinity. When p=1, the self-information is precisely 0.

Information entropy

Claude Shannon formalised the idea of information entropy mathematically. Given a set of possible events whose probabilities of occurrence are given by the distribution P=(p_1,p_2,…,p_n), the information entropy of the set of events is given by the expected value of the self-information provided by a probability distribution. The formula is given by

\(H(P)=\mathbb{E}[I(P)]= -\sum_{i=1}^n p_i \log_2(p_i).\)

Suppose that the random variable X follows a Bernoulli distribution of parameter p. That is, it takes the value 1 with probability p, and takes the value 0 with probability 1-p. The entropy of this corresponding distribution is

\(H(p) = -p \log_2 (p) - (1-p) \log_2 (1-p).\)

So the information entropy is maximised when p=0.5 which corresponds to the fair coin toss example. In this case, the information entropy value is

\(\begin{align} H(p) & = -\frac{1}{2}\log_2(\frac{1}{2}) -\frac{1}{2}\log_2(\frac{1}{2}) \\ & = -\frac{1}{2}(-1) -\frac{1}{2}(-1) \\ & = 1. \end{align}\)

Conversely, the minimum value that entropy can take is 0. In the Bernoulli distribution case, this happens when either p=1 or p=0.

Why base 2?

Suppose that I am in a room with a coin and a light switch, and my friend in a separate room wants to know whether the coin lands on heads or tails when I toss it, and he wants to know without leaving his room. I am not allowed to communicate with him in any way except with the audible sound of the light switch flip. However, we are allowed to discuss a strategy before we each go into our separate rooms.

The question is as follows: what is the most efficient way I can communicate to my friend whether the coin landed on heads or tails?

One way to do this is with Morse code. I could flip the coin and then the light switch to communicate either ‘head’ or ‘tail’ in Morse code. This would certainly inform my friend of the coin’s toss outcome…

…but can we be even more efficient?

We could agree the following before breaking off: if the coin lands on heads, I flip the switch just once; otherwise, I don’t flip the switch at all. Then my friend can enter his room and just wait for, say, 60 seconds before knowing what has happened.

In deciding to flip the light switch just once, I have transmitted one bit of information to my friend. A bit of data is the smallest measurable unit of information. A bit can take one of two values: 1 or 0. These numbers correspond to either flipping the light switch or leaving it be.

This is far more efficient than using Morse code for the letters of the alphabet. In fact, this one-bit approach is the most information-efficient way in which we could’ve communicated the outcome of the coin toss. And we are saying that I need at least one bit of information to communicate the coin toss outcome to my friend.

Simple communication tasks can be encoded using binary. The binary system is a number system that uses only 0s and 1s, rather than the 0-9 digit decimal system that we use more commonly. Each new column in the binary system introduces the next highest power of 2. So our regular numbers like 3, 12, 1028496839, etc. all have binary representations.

So to work out how many bits are needed to store a value, we need to know what power of 2 we’ll need. And this information is provided precisely by the base-2 logarithm. Don’t forget that log_a(b) means “what power do I need to raise the value ‘a’ to in order to get the value ‘b’?”.

Thus, the base-2 logarithm is used in the self-information and information entropy formulas, because it tells us how many bits are needed to store the desired information, and bits are the standard unit measurement to store information. Different bases will tell you how many units are needed in a separate digit system. For example, the natural logarithm, which uses the number e as its base, uses the units of ‘nats’ (which I assume stands for ‘natural bits’?).

Packing it all up

This concludes another MLAU article! I did my best to try and explain this rather abstract topic. I hope things make sense, but if not, then please do let me know in the comments. Feel free to also roast me by DM too. I’m not fussy.

As always, here is a roundup of what we covered:

The information gained from seeing the outcome of an event can be quantified by measuring how surprising the event’s outcome is. The self-information function provides the perfect mathematical formulation for this.
Information entropy is given by the expected value of information provided by the outcomes of a probability distribution.
Information entropy is maximised for discrete uniform distributions, and is minimised when the outcome of an event is guaranteed. The more skewed a distribution is, the less surprising its outcomes will be.
The base-2 logarithm tells us how many bits are needed to store the desired input information. Different bases measure the information storage size in the corresponding number system. Pretty cool, huh?

Training complete!

I hope you enjoyed reading as much as I enjoyed writing 😁

Do leave a comment if you’re unsure about anything, if you think I’ve made a mistake somewhere, or if you have a suggestion for what we should learn about next 😎

Until next Sunday,

Ameer

PS… like what you read? If so, feel free to subscribe so that you’re notified about future newsletter releases:

Sources

Claude Shannon’s original paper, which introduced this idea of information theory: https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf
“Information Theory, Inference and Learning Algorithms”, by David MacKay: https://www.inference.org.uk/mackay/itila/book.html
An excellent online blog post that helped me better understand information theory: https://mbernste.github.io/posts/entropy/

If anyone knows of a better segue phrase, I’m all ears! Maybe something more specific to the context of ML?

Real Analysis lecture notes from my alma mater if you really want to get into the analysis weeds: https://warwick.ac.uk/fac/sci/maths/people/staff/keith_ball/anal_ii_notes_2025.pdf

Machine Learning Algorithms Unpacked

Discussion about this post