The principle of maximum entropy asserts that when trying to determine an unknown probability distribution (for example, the distribution of possible results that occur when you toss a possibly unfair die), you should pick the distribution with maximum entropy consistent with your knowledge.
The goal of this post is to derive the principle of maximum entropy in the special case of probability distributions over finite sets from
- Bayes’ theorem and
- the principle of indifference: assign probability to each of possible outcomes if you have no additional knowledge. (The slogan in statistical mechanics is “all microstates are equally likely.”)
We’ll do this by deriving an arguably more fundamental principle of maximum relative entropy using only Bayes’ theorem.
A better way to state Bayes’ theorem
Suppose you have a set of hypotheses about something, exactly one of which can be true, and some prior probabilities that these hypotheses are true (which therefore sum to ). Then you see some evidence . (Here is a simultaneous definition of both hypotheses and evidence: hypotheses are things that assert how likely or unlikely evidence is. That is, what it means to give evidence about some hypotheses is that there ought to be some conditional probabilities , the likelihoods, describing how likely it is that you see evidence conditional on hypothesis .)
Bayes’ theorem in this setting is then usually stated as follows: you should now have updated posterior probabilities that your hypotheses are true conditional on your evidence, and they should be given by
.
That is, each prior probability gets multiplied by , which describes how much more likely thinks the evidence is than before. You might be concerned that requires the introduction of extra information, but in fact it must be given by
by conditioning on each in turn, so it’s already determined by the priors and the likelihoods. (This is if the are parameterized by a discrete parameter ; in general this sum should be replaced by an integral.)
In practice this statement of Bayes’ theorem seems to be annoyingly easy to forget, at least for me. Here is a better statement. The idea is to think of as just a normalization constant. Hence the revised statement is
.
That is, the posterior probability is proportional to the prior probability times the likelihood, where the proportionality constant is uniquely determined by the requirement that the probabilities sum to .
Intuitively: after seeing some evidence, your confidence in a hypothesis gets multiplied by how well the hypothesis predicted the evidence, then normalized. Now you can take your posteriors to be your new priors in preparation for seeing some more evidence. This is a Bayesian update.
Aside: measures up to scale and improper priors
This statement of Bayes’ theorem suggests a slight reformulation of what we mean by a probability measure: a probability measure is the same thing as a measure with nonzero total measure, up to scaling by positive reals. One reason to like this description is that it naturally incorporates improper priors, which correspond to prior probabilities with possibly infinite total measure, up to scaling by positive reals. The point is that after a Bayesian update an improper prior may become proper again. For example, there’s an improper prior assigning measure to every positive integer , which allows us to talk about hypotheses indexed by the positive integers and with a prior which makes all of them equally likely.
Improper priors may seem obviously bad because they don’t assign probabilities to things: in order to assign a probability you need to normalize by the total measure, which is infinite. However, with an improper prior it is still meaningful to make comparisons between probabilities: you can still meaningfully say that is larger than , or exactly times , since this comparison is invariant under scaling by positive reals.
There’s a somewhat philosophical argument that when performing Bayesian reasoning, only comparisons between probabilities are meaningful anyway: in order to know the probability, in the absolute sense, of something, you need to be absolutely sure you’ve written down every possible hypothesis (in order to ensure that exactly one of them is true). If you leave out the true hypothesis, then you might end up being more and more sure of an arbitrarily bad hypothesis because the true hypothesis wasn’t included in your calculations. In other words, computing the normalization constant in the usual statement of Bayes’ theorem is “global” in that it requires information about all of the , but computing is “local” in that it only involves one at a time.
(And it’s not enough just to have a few hypotheses and then a catch-all hypothesis called “everything else,” because “everything else” is not a hypothesis in the sense that it does not assign likelihoods . A hypothesis has to make predictions.)
The setup
Back to maximum entropy. Imagine that you are repeatedly rolling an -sided die, and you don’t know what the various weights on the die are: that is, you don’t know the true probabilities that the face of the die will come up.
However, you have some hypotheses about these probabilities. Your hypotheses are parameterized by a parameter , which for the sake of concreteness we’ll take to be a real number or a tuple of real numbers, but which could in principle be anything. Your hypotheses assign probability to the face coming up. You also have some prior over your hypotheses, which we’ll write as a probability density function . Hence and
while the are normalized so that
.
Example. If , we might imagine that we’re flipping a coin, with the probability that we flip tails and the probability that we flip heads. Our hypotheses might take the form where , and our prior might be the uniform prior: each is equally likely. Hence our probability density is .
Now suppose you roll the die times. What happens to your beliefs under Bayesian updating in the limit ?
The principle of maximum relative entropy
Suppose you see the face come up times (so the are nonnegative and ; they described observed relative frequencies of the various faces coming up, and altogether describe the empirical probability distribution). Hypothesis predicts that this happens with probability
.
Let’s see how this function behaves as . Taking the log, and using Stirling’s approximation in the form
we get
.
Various terms cancel here due to the fact that . At the end of the day we get
.
This is the first apperarance of the function , the entropy of , regarded as a probability distribution over faces. This is perhaps the most concrete and least mysterious way of introducing entropy: it’s a concise way of summarizing the asymptotic behavior of the multinomial distribution as . Already we see that the entropy being larger corresponds to the counts being more likely, in a very serious way.
But there’s a second term in the likelihood, so let’s compute the logarithm of that too. This gives
.
Thus the logarithm of the likelihood, or log-likelihood, is
.
Now the function that appears is the negative of the Kullback-Leibler divergence. We’ll call it the relative entropy (although this term is sometimes used for the KL divergence, not its negative) and denote it somewhat arbitrarily by .
Altogether, the posterior density is now proportional to
.
From here it’s not hard to see that the posterior density is overwhelmingly concentrated at the hypotheses that maximize relative entropy as , subject to the constraint that the prior density is positive. This is because all the other posterior densities are exponentially smaller in comparison, and as long as the prior density is positive, it doesn’t matter what its exact value is because it too is exponentially small in comparison to the main exponential term.
This calculation suggests that we can interpret the relative entropy as a measure of how well the hypothesis fits the evidence : the larger this number is, the better the fit. (A more common way to describe relative entropy is as a measure of how well a hypothesis fits the “truth.” Here our model for being told that is the “truth” is seeing it asymptotically as .)
Let’s wrap that conclusion up into a theorem.
Theorem: With hypotheses as above, as , Bayesian updates converge towards believing the hypothesis that maximizes the relative entropy subject to the constraint that the prior density is positive.
Now, suppose the true probabilities are . Then as we expect, by the law of large numbers, that the observed frequencies approach the true probabilities . If the true probabilities are among our hypotheses , we would hope, and it seems intuitively clear, that we’ll converge towards believing the true hypothesis. This requires showing the following.
Theorem: The relative entropy is nonpositive, and for fixed , it takes its maximum value iff .
Proof. This is more or less a computation with Lagrange multipliers. For fixed , we want to maximize
subject to the constraint . This constraint means that at a critical point of (whether a maximum, a minimum, or a saddle point), all of the partial derivatives should be equal. (Intuitively, we have a “budget” of probability to spend to increase , and as we spend more probability on one we necessarily must spend less probability on the others. The critical points are then the points where we can’t do any better by shifting our probability budget, meaning that the marginal value of each probability increase is equally good.)
We compute that
so setting all partial derivatives equal we conclude that must be proportional to , and the additional constraint gives for all .
At this critical point takes value . Now we need to show that this critical point is a maximum and not a minimum. Since it’s the unique critical point, it suffices to show that it’s a local maximum. So, consider a point in a small neighborhood of this critical point, where . To second order, we have
and hence
The linear term vanishes, and the quadratic term is negative definite as desired. (Strictly speaking we need to require that the are all positive, but if any of them happen to be zero then the corresponding value of can be safely ignored anyway, since it won’t figure in any of our computations.)
Corollary: With hypotheses as above, as , if the true hypothesis is among the hypotheses with positive density, Bayesian updates converge towards believing it. False hypotheses are disbelieved at an exponential rate with base the exponential of the relative entropy.
In other words, as , the Bayesian definition of probability converges to the frequentist definition of probability.
Example. Let’s return to the example, where we’re flipping a coin with an unknown bias , so that are the probabilities of flipping heads and tails respectively given bias , and our prior is uniform. Suppose that after trials we observe heads and tails, where . Then
.
(We can drop the binomial coefficient because it’s the same for all values of and so can be absorbed into our proportionality constant. We introduced it into the above computation because it becomes important later.)
This computation can be used to deduce the rule of succession, which asserts that at this point you should assign probability to heads coming up on the next coin flip. Note that as this converges to .
The posterior density can be written as
which takes its maximum value when by our results above, although in this case one-variable calculus suffices to prove this. Near this maximum value, Taylor expanding around shows that, for values of sufficiently close to , the posterior density is approximately a Gaussian centered at with standard deviation . Hence in order to be confident that the true bias lies in an interval of size with high probability we need to look at coin flips.
Maximum entropy
We were supposed to get a criterion in terms of maximizing entropy, not relative entropy. What happened to that?
Now instead of knowing the relative frequencies , let’s assume that we only know that they satisfy some conditions. For example, for any function , where is a finite-dimensional real vector space, we might know the expected value
of with respect to the empirical probability distribution. In statistical mechanics a typical and important example is that we might know the average energy. We also might be observing a random walk, and while we don’t know how many steps the random walker took in a given direction (perhaps because they’re moving too fast for us to see), we might know where they ended up after steps, which tells us the average of all the steps the walker took.
(Strictly speaking, if we’re talking about the empirical distribution, where in particular each is necessarily rational, it’s too much to ask that any particular condition be exactly satisfied. We’d be happy to see that it’s asymptotically satisfied as , from which we’re concluding that our conditions are exactly satisfied for the true probabilities , or something like that. It seems there’s something subtle going on here and I am going to completely ignore it.)
Knowing the expected values of some functions is equivalent to knowing that the empirical distribution lies in some affine subspace of the probability simplex
.
However, more complicated constraints are possible. For example, suppose that and that we’re really rolling two independent dice with and sides, respectively, so that we can relabel the possible outcomes with pairs
.
Observing this is true means that, at least asymptotically as , we observe that we can write , where
is the empirical probability that the first die comes up , and similarly
is the empirical probability that the second die comes up . This is a nonlinear constraint: in fact it describes a collection of quadratic equations that the variables must satisfy. Imposing these equations turns out to be equivalent to imposing the simpler homogeneous quadratic equations
which we might recognize as the equations cutting out the image of the Segre embedding
.
The idea is to think of the probability simplex as sitting inside projective space ; then the restriction of the Segre embedding to probability simplices produces a map
describing how a probability distribution over the first die and a probability distribution over the second die gives rise to a joint probability distribution over both of them. More complicated variations of this example are considered in algebraic statistics.
In any case, the game is that instead of knowing the empirical distribution we now only know some conditions it satisfies. Write the set of all distributions satisfying these conditions as . What happens as ? Hypothesis still predicts that empirical distribution occurs with probability
and hence it predicts that we observe that our conditions are satisfied with probability
(where is shorthand for the event ). Using our previous approximations, we can rewrite this as
which gives posterior densities
.
As before, we find that the posterior densities are overwhelmingly concentrated at the hypotheses that maximize relative entropy as (again subject to the constraint that ), but where is now allowed to run over all .
If our prior assigns nonzero density to every possible probability distribution in the probability simplex (for simple, we could take to be parameterized by the points of the probability simplex, to be the probability distribution corresponding to the point , and to be a constant, suitably normalized), then we know that relative entropy takes its maximum value when its two arguments are equal, so we can restrict our attention to the case that above, and we find that, asymptotically as , the posterior density is proportional to the prior density as long as satisfies the conditions, and otherwise.
This is unsurprising: we assumed that all we were told about the empirical distribution is that it satisfied some conditions, so the only change we make to our prior is that we condition on that.
We still haven’t gotten a characterization in terms of entropy, as opposed to relative entropy. This is where we are going to invoke the principle of indifference, which in this situation asserts that the prior we should have is the one concentrated entirely at the hypothesis
that the die rolls are being generated uniformly at random. Note that this means the posterior is also concentrated entirely at this hypothesis!
We now predict empirical probability distribution with probability distributed according to the multinomial distribution, namely
where is now a normalization constant and can be ignored, and is the entropy, rather than the relative entropy. This comes from substituting the uniform distribution for into the relative entropy .
We now want to ask a slightly different question than before. Before we were asking what our beliefs were about the underlying “true” probabilities generating the die rolls. Now we’ve already fixed those beliefs, and we’re instead going to ask what our beliefs are about the empirical distribution , which we now no longer know, conditioned on the fact that . By Bayes’ theorem, this is
where is either if or if , and as before. Overall, we conclude the following.
Theorem: Starting from the indifference prior, as Bayesian updates converge towards believing that the empirical distribution is the maximum entropy distribution in .
In some sense this is not at all a deep statement: it’s just the observation that entropy describes the asymptotics of the multinomial distribution, together with conditioning on . Although it is somewhat interesting that conditioning on , in this setup, is done by seeing that appears to asymptotically lie in as .
Edit: This is essentially the Wallis derivation, but with a much larger emphasis placed on the choice of prior.
Example. Suppose consists of all probability distributions such that the expected value
of some random variable (possibly vector-valued) is fixed; call this fixed value . Then we want to maximize subject to this constraint and the constraint . This is again a Lagrange multiplier problem. We’ll introduce a vector-valued Lagrange multipler , as well as a scalar Lagrange multiplier for the constraint that will later disappear from the calculation. (The will slightly simplify the calculation.)
Then the method of Lagrange multipliers says that any maximum must be a critical point of the function
for some value of and . (Here we are hiding the dependence of on .) Using the fact that
we compute that
and setting these partial derivatives equal to gives
where . But since the must sum to , is a normalization constant determined by this condition, and in fact must be the partition function
.
From here we can compute that the expected value of is
and the entropy is
.
In statistical mechanics, a “die” is a statistical-mechanical system, and is a vector of variables such as energy and particle number describing that system. , the Lagrange multiplier, is a vector of conjugate variables such as (inverse) temperature and chemical potential. The probability distribution we’ve just described is the canonical ensemble if consists only of energy and the grand canonical ensemble if consists of energy and particle numbers.
The uniform prior we assumed using the principle of indifference, possibly after conditioning on a fixed value of the energy (rather than a fixed expected value), is the microcanonical ensemble. The assumption that this is a reasonable prior is called the fundamental postulate of statistical mechanics. As Terence Tao explains here, in a suitable finite toy model involving Markov chains at equilibrium it can be proven rigorously, but in more complicated settings the fundamental postulate is harder to justify, and of course in some settings it will just be wrong. In these settings we can instead use the principle of maximum relative entropy.