From Physics Today, Jan 1992.

Phillip Anderson: Nobel Prize in Physics 1977

Phillip W. Anderson

In 1759 or so the Reverend Thomas Bayes first wrote down the "chain rule" for probability theory. (The date is not known; the paper was published posthumously by a "good friend" in 1763.) Bayes seems to have had no idea that his simple formula might have far-reaching consequences, but thanks to the efforts of Harold Jeffreys, earlier in this century, and many others since, "Bayesian statistics" is now taught to statistics students in advanced courses. Unfortunately, however, it is not taught to nutritionists or even to experimental physicists.

These statistics are the correct way to do inductive reasoning from necessarily imperfect experimental data. What Bayesianism does is to focus one's attention on the question one wants to ask of the data: It says, in effect, How do these data affect my previous knowledge of the situation? It's sometimes called "maximum likelihood" thinking, but the essence of it is to clearly identify the possible answers, assign reasonable a priori probabilities to them and then ask which answers have been made more likely by the data. It's particularly useful in testing simple "null" answers.

Consider, for instance, the question of looking for a needle in a haystack. Actually, of course, there are no needles in most haystacks, so the question doesn't come up unless I happen to suppose that at some particular source of hay there are a lot of absentminded seamstresses or bucolic crack addicts. So I might look for needles to find out if a particular set of haystacks came from that bad source and therefore shouldn't command a high price.

Let us set it up: There are two sources of hay, one with no needles at all and one with up to 9 needles per stack. Let's assign precisely probability 1/2, for the sake of argument, to the case where I'm buying from the needle-free source. (This represents the "null hypothesis" in this example.) If I'm dealing with the potentially needly hay, let's assume that p= (1/2)(1/10) for 0,1, . . . ,9 needles in anyone stack.

I search for needles in one stack, and find none. What do I now know? I know that this outcome had p = 1/2 for needle-free hay, p = 1/20 for needly hay; hence the probability of this outcome is 10 times as great if the hay is needle free. The new "a posteriori" probability of the null hypothesis is therefore 10/11 = (1/2)(1/2+1/20 ) rather than 1/2. Clearly I should buy this hay if that is a good enough bet.

Now suppose I was an ordinary statistician: I would simply say my expected number of needles per stack is now down to 0 \pm 2.5, and to get to 90% certainty I must search at least ten more haystacks, which is ten times as boring.

Thus it's very important to focus on the question I want to ask--namely; whether I have reason to believe that there is any effect at all. In physical experiments one is often measuring something like a particle mass or a Hall effect, where we know there is some finite answer; we just don't know how big. In this case the Bayesian approach is the same as conventional rules. since we have no viable null hypothesis. But there are many very interesting measurements where we don't know whether the effect we're testing exists; and where the real question is whether or not the null hypothesis--read "simplest theory"--is right. Then Bayes can make a very large difference.

Let us take the "fifth force." If we assume from the outset that there is a fifth force and we need only measure its magnitude, we are assigning the bin with zero range and zero magnitude an infinitesimal probability to begin with. Actually, we should be assigning this bin, which is the null hypothesis we want to test, some finite a priori probability--like 1/2--and sharing out the remaining 1/2 among all the other strengths and ranges. We then ask the question, Does a given set of statistical measurements increase or decrease this share of the probability? It turns out that when one adopts this point of view, it often takes a much larger deviation of the result from zero to begin to decrease the null hypothesis's share than it would in the conventional approach. The formulas are complicated, but there are a couple of rules of thumb that give some ideas of the necessary factor. For a large number N of statistically independent measurements, the probability of the null hypothesis must increase by a factor of something like N^{1/2}) . (For a rough idea of where this factor comes from, it is the inverse of the probability of an unbiased random walk's ending up at the starting point.) For a multiparameter fit with p parameters, this becomes N^{p/2} . From the Bayesian point of view, it's not clear that even the very first reexamination of Roland von Eotvos's results actually supported the fifth force, and it's very likely that none of the "positive" results were outside the appropriate error limits.

Another way of putting it is that the proponent of the more complicated theory with extra unknown parameters is free to fix those parameters according to the facts to maximize the a posteriori fit, while the null hypothesis is fixed independent of the data. The Bayesian method enforces Occam's razor by penalizing this introduction of extra parameters; it even tells you when you have added one too many parameters by making your posterior probability worse, not better. (A fixed theory with no new parameters, of course, does not have to pay any penalty .) It also turns out to be independent of tricks like data batching and stopping when you're ahead-but not of discarding "bad runs."

All of this as folklore is not news to most experienced experimentalists. Any good experimentalist will doubt a surprising result with less than 5-sigma "significance," for instance. Nevertheless, the common saying that "with three parameters I can fit an elephant" takes on a new and ominous meaning in this light of Bayesian statistics. How can we ever find and prove any new effect? Again, I think physicists' intuition operates very well: We tend to convert, as soon as possible, our unknown effect into a new or second sharp "null hypothesis." We might propose, for instance, that there is a 17-keV neutrino with some small amplitude, and test that idea and an alternative, null hypothesis on the same footing. If our further data then (hypothetically) indicate a different mass with a different signature, we don't take that as evidence for our new hypothesis, which is now a sharp one, but as a destruction of it. Perhaps a good rule of thumb might be that an effect cannot be taken seriously until it can be used as a null hypothesis in a test of this sort. Of course, statistics can never tell you what causes anything; they are not a defense against insufficiently lateral thinking (such as neglecting to ask whether both or neither of one's hypotheses is true), systematic error, or having found some effect you are not looking for--any or none of which can be operative in a case such as the 17-keV neutrino.

Still, one sees the phrase "significant at the 0.05 or at the 0.01 % level" misused all over physics, astrophysics, materials science, chemistry and, worst of all, nutrition and medicine. When you read in your daily paper that pistachio fudge has been shown to have a significantly favorable effect on sufferers from piles, nine times out of ten a Bayesian would say that the experiment significantly reduces the likelihood of there being any effect of fudge on piles. While we physicists have no hope of reforming the public's fascination with meaningless nutritional pronunciamentos, we can be careful with our uses of the word "significance," and we can test our own parameter values realistically.