A Note About Negative Information

Todd M. Gureckis · ·  7 minutes to read

Every once in a while I encounter questions that make me feel like I don't understand anything at all.

Mike Frank recently asked on twitter:

Hey stats/ML folks: say I am estimating the parameters of a properly specified model from a sequence of examples using bayesian inference. I calculate information gain (IG) on the posterior over params after each example. Is IG guaranteed to decrease monotonically?

I have studied models of this type for several years now so should know the answer. (I should learn by now that confidence is a warning sign that I should shut up.)

I supposed the answer to Mike's question is just to ask "can information gain can ever be negative?" The answer to this you can find several places online that says "No". The most common explanation is I(Y;X)=H(Y)H(Y|X) which is asserted to always be non-negative since H(Y)H(Y|X).

That seems intuitive because let's say knowing X gives you no information about variable Y then information should be zero and otherwise if X gives any information at all then it should be positive. That's why I replied on Twitter, making a fool of myself saying information gain has to be non-negative, even contradicting my own papers.

For example, in Coenen, Nelson & Gureckis (2019) we have this nice illustrated table showing illustrations of different changes to a posterior distribution and how different measures of "information change" score them:

The one negative case for information gain (the IG column) is where the posterior becomes wider after learning something (also true for probability gain - PG column). However, this clearly implies the existence of negative information gain! And it also makes sense... if you learn that the distribution of a variable is higher variance, then you have more entropy than before! Maybe my paper is wrong?

I'd be inclined to chalk this up to another example where it is good to distrust things you read online. However, information gain is related to the concept of mutual information or MI. MI is a measure of the "amount of information" about one random variable given by another random variable. In fact, the linked Wikipedia article describes them as the same thing. And they are!

The mutual information between variable X and Y is I(X;Y) can be expressed as

I(X;Y)=H(X)H(X|Y)=H(Y)H(Y|X)

Furthermore, Wikipedia confidently states that using Jensen's inequality you can show that I(X;Y)0. Shit. Now I'm worried because if mutual information and information gain are the same thing, and Jensen's inequality proves that mutual information is greater than zero. Does my paper have a problem?

Then Fred Callaway came along (smart gent), and is doing simulations in Julia and stuff showing negative information gain can occur when you start with an incorrect prior for the distribution of data you encounter. Good, that's what our little illustrations implied too, but why isn't the math and "Jensen's inequality" right here?

I started to feel I'm in a hall of mirrors at this point and so I should maybe make a note to sort it out. There are several intuitions here that both seem kind of right but mathematically in conflict. I love this type of situation because I'm pretty sure I can trust the math to sort things out for me if I can just figure it out. So what's the story?

As best I can tell, the issue is just inconsistent terminology across different fields as well as the difference between "subjective" and "empirical" information gain. I want to lay it out here in case anyone else ends up confused.

Act 1: Information theory on measurements

The entropy of a random variable X is defined as

H(X)=xp(x)logp(x)

The conditional entropy is the uncertainty left in a random variable after knowing the random variable Y. First, we need to know the entropy from knowing a specific value:

H(X|Y=y)=xp(x|y)logp(x|y)

then we marginalize over different realized values of y:

H(X|Y)=yp(y)H(X|Y=y)

Next we talk about the concept of relative entropy or Kullback-Leibler divergence which is the distance between two distributions p(x) and q(x).

D(p||q)=xp(x)logp(x)q(x)

Finally mutual information is relative entropy applied to a case with two random variables X and Y with joint distribution p(x,y) and marginal distributions p(x) and p(y). In other word mutual information, I(X;Y) is the

I(X;Y)=D(p(x,y)||p(x)p(y))=xyp(x,y)logp(x,y)p(x)p(y)

There is one property of mutual information that we need to explore which we can get to with a little algebra:

I(X;Y)=x,yp(x,y)logp(x,y)p(x)p(y)=x,yp(x,y)logp(x|y)p(x)=x,yp(x,y)logp(x)+x,yp(x,y)logp(x|y)=xp(x)logp(x)(x,yp(x,y)logp(x|y))=H(X)H(X|Y)

Ok, we made it! So from our definition of mutual information, we get to this final equation which is a differencing of entropy values. This is known as information gain and it is how much entropy is reduced about variable X given variable Y.

Under this treatment, mutual information (and therefore information gain) is always non-negative (via Jensen's Inequality, page 6 has a proof). End of story.

Act 2: Reasoning about belief changes

Ok so what about cognitive science? What do we mean here? Generally, in cognitive science, we are talking about Bayesian updating. For example, Mike's original question is about Bayesian inference and posteriors. This situation is more like the one in the figure from my paper where there is a probability distribution over some parameter (let's call it θ). We are then talking about two distributions, the prior p(θ) and the posterior p(θ|d) which are related in a specific way (Bayes' rule):

p(θ|d)=p(d|θ)p(θ)p(d)

Lindley (1956) wrote an influential paper "On a measure of the information provided by an experiment" which has informed a lot of what people think about when they consider changes in uncertainty and information with Bayesian learning. This was a statistics paper applied to understanding the informativeness of hypothetical experiments. The idea was to start with a prior, say p(θ) and then you could perform some measurement in the world which might change your posterior belief. The question in this work is to quantify how much information different experiments might provide, mostly as a way to plan a research project.

Following the notation above, but using the general set up of Lindley, we first start with the concept of prior uncertainty in some hypothesis or parameter θ as

g0=H(θ)=θp(θ)logp(θ)

(Note: we add the negative sign here which Lindley explains to be dropped in his analysis.) Then if we perform some experiment and observe data D=d we have our posterior uncertainty:

g1(d)=H(θ|D=d)=θp(θ|D=d)logp(θ|D=d)

Lindley then talks about the information (IG) provided by this experiment E as:

g(E,p(θ),d)=IG(θ;D=d)=g0g1(d)=H(θ)H(θ|D=d)

However, we are often not interested in the information in a specific experiment (at least not for planning purposes). Instead we are interested in the average information provided by an experiment:

g(E,p(θ))=EIG(θ;D)=Ed[g0g1(d)]g(E,p(θ))=EIG(θ;D)=dp(d)IG(θ;D=d)=dp(d)H(θ)dp(d)H(θ|D=d)=H(θ)dp(d)H(θ|D=d)=H(θ)H(θ|D)

Note then that the expected information gain provided an experiment is now defined analogously to what was called both information gain and mutual information in the section above.

I believe this is where the terminology difference arises. Information gain in the statistics/cognitive science literature is referring to the information gained by specific outcomes of a hypothetical experiment whereas in the information theory literature (talking about measurement) information gain is more about the mutual information between random variables.

"Subjective" versus "Empirical" information

There is a separate issue with "empirical" expected information gain, distinct from this "subjective" calculation. Empirically negative expected information gain can easily occur and may occur even on average when the prior and the data are a poor match. The question of positive expected information gain is limited only to the case of this subjective analysis. The difference is that the p(d) that you estimate under the "subjective" calculation is based on your prior itself rather than independently determined.

Anyway, in this context things start to make more sense. For example, Lindley (1956) writes that g0g1(d) (the information provided by the outcome of one particular experiment does not have to be non-negative). To quote Lindley, "This can happen when a "surprising" value ... occurs; granted the correctness of the experimental technique, the "surprise" may result in our being less sure about θ than before the experiment." (page 991). So this is negative information gain, but with the caveat of this "special" meaning of information gain.

Another interesting point is that Lindley provides proof (page 990, Theorem 1) that g(E,p(θ))0. He writes "provided the density of d varies with θ, any experiment is informative, on the average" (page 991). The proof for this is a bit opaque to me (relying heavily on a proof from an even older paper). However, the subjective interpretation of expected information gain provides at least an intuition -- surprising events (those largely inconsistent with your prior) are by definition unlikely under your prior and so are down-weighted significantly. Jonathan Nelson alerted me to a paper by Crupi et al. (2018) which explores this issue in much more depth.

These types of proofs though apply only to the "subjective" analysis. Empirically it is possible to obtain negative information gain (Bayesian/cognitive definition) and even negative information gain on average. Fred's simulations set up a case where the prior is so off that every data point experienced is at first "surprising", and this would be true even averaging over many random samples of data points.

The intuition for why mutual information is non-zero stems from something different: a second random variable can either tell you nothing or tell you something about a first random variable.

To sum up

In sum, the information gain talked about in much of cognitive science has an uncanny resemblance to the "core" concept of information gain/mutual information from information theory but the terminology is not 1-to-1. This can lead to confusions when you try to rely on standard machine learning or information theory textbook definitions to answer questions of the type raised by Mike. Perhaps the biggest one has to do with this notion of subjective expected information gain which looks like mutual information but should be interpreted slightly differently. In addition, the difference between "subjective" and "empirical" analysis is important. If you are basing your expectation on your current knowledge then certain mathematical things hold that might not be true in practice.

Hopefully, this is helpful to avoid getting yourself confused like me.

Thanks to Fred Callaway, Zhiwei Li, and Jonathan Nelson for helpful discussions as I worked through this.

References

  1. Coenen, A., Nelson, J. D., & Gureckis, T. M. (2019). Asking the right questions about the psychology of human inquiry: Nine open challenges. Psychonomic Bulletin and Review, 1548--1587. https://doi.org/10.3758/s13423-018-1470-5
  2. Lindley, D. V. (1956). On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4), 986–1005.
  3. Crupi, V., Nelson, J. D., Meder, B., Cevolani, G., & Tentori, K. (2018). Generalized Information Theory Meets Human Cognition: Introducing a Unified Framework to Model Uncertainty and Information Search. Cognitive Science, 42, 1410–1456. https://doi.org/10.1111/cogs.12613
· informationtheory, technote