Information Entropy

Let us begin by clarifying the basic idea. Entropy can usually be understood intuitively as a measure of how disordered a system is, while information entropy measures the uncertainty in the output of a random information source. The greater the entropy, the harder it is to predict what the source will produce before we observe it.

But what exactly do we mean by uncertainty? Suppose I give you an eight-bit binary number that is always the same value, say $01101100$ . Its output is completely certain, so its information entropy is $0$ . If it can also produce another value, such as $01111000$ , with some probability, then the uncertainty increases. Information entropy is a way of quantifying precisely this uncertainty.

In lossless compression, entropy gives the theoretical lower bound on the average code length. In other words, if the entropy of an information source is $H(X)$ , then in the long run no lossless encoding can have an average code length below that value. With a suitable coding method, however, we can approach this lower bound over sufficiently long sequences.

Put simply, you can think of it as a theoretical target for various techniques.

Following the standard procedure, here is the formula first:

$H(X) = -\sum_{i=1}^np(x_{i})\log_2 p(x_{i})$

Perhaps because I learned this on my own, I never had a particularly clear understanding of it. I read quite a lot and still did not really get it. Recently, though, I came across the topic again and it suddenly clicked. I feel that many explanations begin from a rather unhelpful place, although perhaps I was simply focusing on the wrong thing. Either way, enough rambling. Let us get back to the point.

We certainly should not begin with the formula. Deriving a formula from the properties of that same formula would be rather pointless, so forget the formula for now.

Consider this binary string I typed at random:

01101100

There are eight binary digits. If each one can be either $0$ or $1$ , then there are $2^8$ possible strings. In other words, those eight positions can represent that many possible states. Nothing controversial so far.

1	1010000000000010

Now consider sixteen bits. There are $2^{16}$ possibilities. Writing the number of states as a power of two makes it easy to see that it grows exponentially. We also have $\log_{2} 2^n = n$ , where $n$ is the number of bits. After taking the logarithm, the result $n$ is exactly the number of bits needed to distinguish among those $2^n$ possible states. This brings the representation of all those possibilities back onto a linear scale.

Following the same line of thought, let us keep looking for a linear relationship and begin with the simplest case: a uniform distribution. Once again, taking a negative logarithm gives

$-\log_2{\frac{1}{2^n}} = n$

This is the amount of information carried by the occurrence of that result. Because the distribution is uniform here, the value is $n$ , meaning that we have actually received $n$ bits of information.

Now consider the opposite case. Suppose the source can output only one particular $n$ -bit number, so that its probability is $1$ . The negative logarithm is then $-\log_{2}1 = 0$ . We receive no information because we already knew what the result would be.

The amount of information obtained by taking the negative logarithm of a probability is called self-information:

$I(x)=-\log_2 p(x)$

Measuring self-information alone is clearly not enough, because it tells us only how much information is conveyed by one particular outcome. For an entire information system, an extremely unlikely outcome may have very large self-information even though the system’s overall entropy is small.

We therefore need to update how we think about the whole system. Before observing it, we know that the system may occupy one of several discrete states and that each state has a corresponding probability. For a binary system, for example, all we know in advance is that it outputs $0$ with probability $P(0)$ and $1$ with probability $P(1)$ . Only after observing it do we know whether the result is $0$ or $1$ . A less probable result is harder to predict, so observing it gives us more self-information.

Once this is clear, the meaning of entropy follows naturally. Return to the original formula and you can see that entropy is simply the expected value of self-information:

$H(X) = -\sum_{i=1}^np(x_{i})\log_2 p(x_{i})=E[I(X)]$