Intransitive dice VI: sketch proof of the main conjecture for the balanced-sequences model
I have now completed a draft of a write-up of a proof of the following statement. Recall that a random -sided die (in the balanced-sequences model) is a sequence of length 
 of integers between 1 and 
 that add up to 
, chosen uniformly from all such sequences. A die 
 beats a die 
 if the number of pairs 
 such that 
 exceeds the number of pairs 
 such that 
. If the two numbers are the same, we say that 
 ties with 
.
Theorem. Let  and 
 be random 
-sided dice. Then the probability that 
 beats 
 given that 
 beats 
 and 
 beats 
 is 
.
In this post I want to give a fairly detailed sketch of the proof, which will I hope make it clearer what is going on in the write-up.
The first step is to show that the theorem is equivalent to the following statement.
Theorem. Let  be a random 
-sided die. Then with probability 
, the proportion of 
-sided dice that 
 beats is 
.
We had two proofs of this statement in earlier posts and comments on this blog. In the write-up I have used a very nice short proof supplied by Luke Pebody. There is no need to repeat it here, since there isn’t much to say that will make it any easier to understand than it already is. I will, however, mention once again an example that illustrates quite well what this statement does and doesn’t say. The example is of a tournament (that is, complete graph where every edge is given a direction) where every vertex beats half the other vertices (meaning that half the edges at the vertex go in and half go out) but the tournament does not look at all random. One just takes an odd integer  and puts arrows out from 
 to 
 mod 
 for every 
, and arrows into 
 for every 
. It is not hard to check that the probability that there is an arrow from 
 to 
 given that there are arrows from 
 to 
 and 
 to 
 is approximately 1/2, and this turns out to be a general phenomenon.
So how do we prove that almost all -sided dice beat approximately half the other 
-sided dice?
The first step is to recast the problem as one about sums of independent random variables. Let  stand for 
 as usual. Given a sequence 
 we define a function 
 by setting 
 to be the number of 
 such that 
 plus half the number of 
 such that 
. We also define 
 to be 
. It is not hard to verify that 
 beats 
 if 
, ties with 
 if 
, and loses to 
 if 
.
So our question now becomes the following. Suppose we choose a random sequence  with the property that 
. What is the probability that 
? (Of course, the answer depends on 
, and most of the work of the proof comes in showing that a “typical” 
 has properties that ensure that the probability is about 1/2.)
It is convenient to rephrase the problem slightly, replacing  by 
. We can then ask it as follows. Suppose we choose a sequence 
 of 
 elements of the set 
, where the terms of the sequence are independent and uniformly distributed. For each 
 let 
. What is the probability that 
 given that 
?
This is a question about the distribution of , where the 
 are i.i.d. random variables taking values in 
 (at least if 
 is odd — a small modification is needed if 
 is even). Everything we know about probability would lead us to expect that this distribution is approximately Gaussian, and since it has mean 
, it ought to be the case that if we sum up the probabilities that 
 over positive 
, we should get roughly the same as if we sum them up over negative 
. Also, it is highly plausible that the probability of getting 
 will be a lot smaller than either of these two sums.
So there we have a heuristic argument for why the second theorem, and hence the first, ought to be true.
There are several theorems in the literature that initially seemed as though they should be helpful. And indeed they were helpful, but we were unable to apply them directly, and had instead to develop our own modifications of their proofs.
The obvious theorem to mention is the central limit theorem. But this is not strong enough for two reasons. The first is that it tells you about the probability that a sum of random variables will lie in some rectangular region of  of size comparable to the standard deviation. It will not tell you the probability of belonging to some subset of the y-axis (even for discrete random variables). Another problem is that the central limit on its own does not give information about the rate of convergence to a Gaussian, whereas here we require one.
The second problem is dealt with for many applications by the Berry-Esseen theorem, but not the first.
The first problem is dealt with for many applications by local central limit theorems, about which Terence Tao has blogged in the past. These tell you not just about the probability of landing in a region, but about the probability of actually equalling some given value, with estimates that are precise enough to give, in many situations, the kind of information that we seek here.
What we did not find, however, was precisely the theorem we were looking for: a statement that would be local and 2-dimensional and would give information about the rate of convergence that was sufficiently strong that we would be able to obtain good enough convergence after only  steps. (I use the word “step” here because we can think of a sum of 
 independent copies of a 2D random variable as an 
-step random walk.) It was not even clear in advance what such a theorem should say, since we did not know what properties we would be able to prove about the random variables 
 when 
 was “typical”. That is, we knew that not every 
 worked, so the structure of the proof (probably) had to be as follows.
1. Prove that  has certain properties with probability 
.
2. Using these properties, deduce that the sum  converges very well after 
 steps to a Gaussian.
3. Conclude that the heuristic argument is indeed correct.
The key properties that  needed to have were the following two. First, there needed to be a bound on the higher moments of 
. This we achieved in a slightly wasteful way — but the cost was a log factor that we could afford — by arguing that with high probability no value of 
 has magnitude greater than 
. To prove this the steps were as follows.
- Let 
be a random element of
. Then the probability that there exists
with
is at most
(for some
such as 10).
 - The probability that 
is at least
for some absolute constant
.
 - It follows that if 
is a random
-sided die, then with probability
we have
for every
.
 
The proofs of the first two statements are standard probabilistic estimates about sums of independent random variables.
The second property that  needed to have is more difficult to obtain. There is a standard Fourier-analytic approach to proving central limit theorems, and in order to get good convergence it turns out that what one wants is for a certain Fourier transform to be sufficiently well bounded away from 1. More precisely, we define the characteristic function of the random variable 
 to be
where  is shorthand for 
, 
, and 
 and 
 range over 
.
I’ll come later to why it is good for  not to be too close to 1. But for now I want to concentrate on how one proves a statement like this, since that is perhaps the least standard part of the argument.
To get an idea, let us first think what it would take for  to be very close to 1. This condition basically tells us that 
 is highly concentrated mod 1: indeed, if 
 is highly concentrated, then 
 takes approximately the same value almost all the time, so the average is roughly equal to that value, which has modulus 1; conversely, if 
 is not highly concentrated mod 1, then there is plenty of cancellation between the different values of 
 and the result is that the average has modulus appreciably smaller than 1.
So the task is to prove that the values of  are reasonably well spread about mod 1. Note that this is saying that the values of 
 are reasonably spread about.
The way we prove this is roughly as follows. Let , let 
 be of order of magnitude 
, and consider the values of 
 at the four points 
 and 
. Then a typical order of magnitude of 
 is around 
, and one can prove without too much trouble (here the Berry-Esseen theorem was helpful to keep the proof short) that the probability that
is at least , for some positive absolute constant 
. It follows by Markov’s inequality that with positive probability one has the above inequality for many values of 
.
That’s not quite good enough, since we want a probability that’s very close to 1. This we obtain by chopping up  into intervals of length 
 and applying the above argument in each interval. (While writing this I’m coming to think that I could just as easily have gone for progressions of length 3, not that it matters much.) Then in each interval there is a reasonable probability of getting the above inequality to hold many times, from which one can prove that with very high probability it holds many times.
But since  is of order 
, 
 is of order 1, which gives that the values 
 are far from constant whenever the above inequality holds. So by averaging we end up with a good upper bound for 
.
The alert reader will have noticed that if , then the above argument doesn’t work, because we can’t choose 
 to be bigger than 
. In that case, however, we just do the best we can: we choose 
 to be of order 
, the logarithmic factor being there because we need to operate in many different intervals in order to get the probability to be high. We will get many quadruples where
and this translates into a lower bound for  of order 
, basically because 
 has order 
 for small 
. This is a good bound for us as long as we can use it to prove that 
 is bounded above by a large negative power of 
. For that we need 
 to be at least 
 (since 
 is about 
), so we are in good shape provided that 
.
The alert reader will also have noticed that the probabilities for different intervals are not independent: for example, if some  is equal to 
, then beyond that 
 depends linearly on 
. However, except when 
 is very large, this is extremely unlikely, and it is basically the only thing that can go wrong. To make this rigorous we formulated a concentration inequality that states, roughly speaking, that if you have a bunch of 
 events, and almost always (that is, always, unless some very unlikely event occurs) the probability that the 
th event holds given that all the previous events hold is at least 
, then the probability that fewer than 
 of the events hold is exponentially small in 
. The proof of the concentration inequality is a standard exponential-moment argument, with a small extra step to show that the low-probability events don’t mess things up too much.
Incidentally, the idea of splitting up the interval in this way came from an answer by Serguei Popov to a Mathoverflow question I asked, when I got slightly stuck trying to prove a lower bound for the second moment of . I eventually didn’t use that bound, but the interval-splitting idea helped for the bound for the Fourier coefficient as well.
So in this way we prove that  is very small if 
. A simpler argument of a similar flavour shows that 
 is also very small if 
 is smaller than this and 
.
Now let us return to the question of why we might like  to be small. It follows from the inversion and convolution formulae in Fourier analysis. The convolution formula tells us that the characteristic function of the sum of the 
 (which are independent and each have characteristic function 
) is 
. And then the inversion formula tells us that
What we have proved can be used to show that the contribution to the integral on the right-hand side from those pairs  that lie outside a small rectangle (of width 
 in the 
 direction and 
 in the 
 direction, up to log factors) is negligible.
All the above is true provided the random -sided die 
 satisfies two properties (the bound on 
 and the bound on 
), which it does with probability 
.
We now take a die  with these properties and turn our attention to what happens inside this box. First, it is a standard fact about characteristic functions that their derivatives tell us about moments. Indeed,
,
and when  this is 
. It therefore follows from the two-dimensional version of Taylor’s theorem that
plus a remainder term  that can be bounded above by a constant times 
.
Writing  for 
 we have that 
 is a positive semidefinite quadratic form in 
 and 
. (In fact, it turns out to be positive definite.) Provided 
 is small enough, replacing it by zero does not have much effect on 
, and provided 
 is small enough, 
 is well approximated by 
.
It turns out, crucially, that the approximations just described are valid in a box that is much bigger than the box inside which  has a chance of not being small. That implies that the Gaussian decays quickly (and is why we know that 
 is positive definite).
There is a bit of back-of-envelope calculation needed to check this, but the upshot is that the probability that  is very well approximated, at least when 
 and 
 aren’t too big, by a formula of the form
.
But this is the formula for the Fourier transform of a Gaussian (at least if we let  and 
 range over 
, which makes very little difference to the integral because the Gaussian decays so quickly), so it is the restriction to 
 of a Gaussian, just as we wanted.
When we sum over infinitely many values of  and 
, uniform estimates are not good enough, but we can deal with that very directly by using simple measure concentration estimates to prove that the probability that 
 is very small outside a not too large box.
That completes the sketch of the main ideas that go into showing that the heuristic argument is indeed correct.
Any comments about the current draft would be very welcome, and if anyone feels like working on it directly rather than through me, that is certainly a possibility — just let me know. I will try to post soon on the following questions, since it would be very nice to be able to add answers to them.
1. Is the more general quasirandomness conjecture false, as the experimental evidence suggests? (It is equivalent to the statement that if  and 
 are two random 
-sided dice, then with probability 
, the four possibilities for whether another die beats 
 and whether it beats 
 each have probability 
.)
2. What happens in the multiset model? Can the above method of proof be adapted to this case?
3. The experimental evidence suggests that transitivity almost always occurs if we pick purely random sequences from . Can we prove this rigorously? (I think I basically have a proof of this, by showing that whether or not 
 beats 
 almost always depends on whether 
 has a bigger sum than 
. I’ll try to find time reasonably soon to add this to the draft.)
Of course, other suggestions for follow-up questions will be very welcome, as will ideas about the first two questions above.
from Gowers