A Note About Negative Information
Todd M. Gureckis · · 7 minutes to readEvery once in a while I encounter questions that make me feel like I don't understand anything at all.
Mike Frank recently asked on twitter:
Hey stats/ML folks: say I am estimating the parameters of a properly specified model from a sequence of examples using bayesian inference. I calculate information gain (IG) on the posterior over params after each example. Is IG guaranteed to decrease monotonically?
I have studied models of this type for several years now so should know the answer. (I should learn by now that confidence is a warning sign that I should shut up.)
I supposed the answer to Mike's question is just to ask "can information gain can ever be negative?" The answer to this you can find several places online that says "No". The most common explanation is
That seems intuitive because let's say knowing
For example, in Coenen, Nelson & Gureckis (2019) we have this nice illustrated table showing illustrations of different changes to a posterior distribution and how different measures of "information change" score them:
The one negative case for information gain (the IG column) is where the posterior becomes wider after learning something (also true for probability gain - PG column). However, this clearly implies the existence of negative information gain! And it also makes sense... if you learn that the distribution of a variable is higher variance, then you have more entropy than before! Maybe my paper is wrong?
I'd be inclined to chalk this up to another example where it is good to distrust things you read online. However, information gain is related to the concept of mutual information or MI. MI is a measure of the "amount of information" about one random variable given by another random variable. In fact, the linked Wikipedia article describes them as the same thing. And they are!
The mutual information between variable
Furthermore, Wikipedia confidently states that using Jensen's inequality you can show that
Then Fred Callaway came along (smart gent), and is doing simulations in Julia and stuff showing negative information gain can occur when you start with an incorrect prior for the distribution of data you encounter. Good, that's what our little illustrations implied too, but why isn't the math and "Jensen's inequality" right here?
I started to feel I'm in a hall of mirrors at this point and so I should maybe make a note to sort it out. There are several intuitions here that both seem kind of right but mathematically in conflict. I love this type of situation because I'm pretty sure I can trust the math to sort things out for me if I can just figure it out. So what's the story?
As best I can tell, the issue is just inconsistent terminology across different fields as well as the difference between "subjective" and "empirical" information gain. I want to lay it out here in case anyone else ends up confused.
Act 1: Information theory on measurements
The entropy of a random variable
The conditional entropy is the uncertainty left in a random variable after knowing the random variable
then we marginalize over different realized values of
Next we talk about the concept of relative entropy or Kullback-Leibler divergence which is the distance between two distributions
Finally mutual information is relative entropy applied to a case with two random variables
There is one property of mutual information that we need to explore which we can get to with a little algebra:
Ok, we made it! So from our definition of mutual information, we get to this final equation which is a differencing of entropy values. This is known as information gain and it is how much entropy is reduced about variable
Under this treatment, mutual information (and therefore information gain) is always non-negative (via Jensen's Inequality, page 6 has a proof). End of story.
Act 2: Reasoning about belief changes
Ok so what about cognitive science? What do we mean here? Generally, in cognitive science, we are talking about Bayesian updating. For example, Mike's original question is about Bayesian inference and posteriors. This situation is more like the one in the figure from my paper where there is a probability distribution over some parameter (let's call it
Lindley (1956) wrote an influential paper "On a measure of the information provided by an experiment" which has informed a lot of what people think about when they consider changes in uncertainty and information with Bayesian learning. This was a statistics paper applied to understanding the informativeness of hypothetical experiments. The idea was to start with a prior, say
Following the notation above, but using the general set up of Lindley, we first start with the concept of prior uncertainty in some hypothesis or parameter
(Note: we add the negative sign here which Lindley explains to be dropped in his analysis.) Then if we perform some experiment and observe data
Lindley then talks about the information (
However, we are often not interested in the information in a specific experiment (at least not for planning purposes). Instead we are interested in the average information provided by an experiment:
Note then that the expected information gain provided an experiment is now defined analogously to what was called both information gain and mutual information in the section above.
I believe this is where the terminology difference arises. Information gain in the statistics/cognitive science literature is referring to the information gained by specific outcomes of a hypothetical experiment whereas in the information theory literature (talking about measurement) information gain is more about the mutual information between random variables.
"Subjective" versus "Empirical" information
There is a separate issue with "empirical" expected information gain, distinct from this "subjective" calculation. Empirically negative expected information gain can easily occur and may occur even on average when the prior and the data are a poor match. The question of positive expected information gain is limited only to the case of this subjective analysis. The difference is that the
Anyway, in this context things start to make more sense. For example, Lindley (1956) writes that
Another interesting point is that Lindley provides proof (page 990, Theorem 1) that
These types of proofs though apply only to the "subjective" analysis. Empirically it is possible to obtain negative information gain (Bayesian/cognitive definition) and even negative information gain on average. Fred's simulations set up a case where the prior is so off that every data point experienced is at first "surprising", and this would be true even averaging over many random samples of data points.
The intuition for why mutual information is non-zero stems from something different: a second random variable can either tell you nothing or tell you something about a first random variable.
To sum up
In sum, the information gain talked about in much of cognitive science has an uncanny resemblance to the "core" concept of information gain/mutual information from information theory but the terminology is not 1-to-1. This can lead to confusions when you try to rely on standard machine learning or information theory textbook definitions to answer questions of the type raised by Mike. Perhaps the biggest one has to do with this notion of subjective expected information gain which looks like mutual information but should be interpreted slightly differently. In addition, the difference between "subjective" and "empirical" analysis is important. If you are basing your expectation on your current knowledge then certain mathematical things hold that might not be true in practice.
Hopefully, this is helpful to avoid getting yourself confused like me.
Thanks to Fred Callaway, Zhiwei Li, and Jonathan Nelson for helpful discussions as I worked through this.
References
- Coenen, A., Nelson, J. D., & Gureckis, T. M. (2019). Asking the right questions about the psychology of human inquiry: Nine open challenges. Psychonomic Bulletin and Review, 1548--1587. https://doi.org/10.3758/s13423-018-1470-5
- Lindley, D. V. (1956). On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4), 986–1005.
- Crupi, V., Nelson, J. D., Meder, B., Cevolani, G., & Tentori, K. (2018). Generalized Information Theory Meets Human Cognition: Introducing a Unified Framework to Model Uncertainty and Information Search. Cognitive Science, 42, 1410–1456. https://doi.org/10.1111/cogs.12613