In the last post, we considered a Gaussian process and were trying to find upper bounds on the quantity
. We saw that one could hope to improve over the union bound by clustering the points and then taking mini union bounds in each cluster.
Hierarchical clustering
To specify a clustering, we’ll take a sequence of progressively finer approximations to our set . First, recall that we fixed
, and we have used the observation that
.
Now, assume that is finite. Write
, and consider a sequence of subsets
such that
. We will assume that for some large enough
, we have
for
. For every
, let
denote a “closest point map” which sends
to the closest point in
.
The main point is that we can now write, for any ,
This decomposition is where the term “chaining” arises, and now the idea is to bound the probability that is large in terms of the segments in the chain.
What should look like?
One question that arises is how we should think about choosing the approximations . We are trading off two measures of quality: The denser
is in the set
(or, more precisely, in the set
) the smaller the variances of the segments
will be. On the other hand, the larger
is, the more segments we’ll have to take a union bound over.
So far, we haven’t used any property of our random variables except for the fact that they are centered. To make a more informed decision about how to choose the sets , let’s recall the classical Gaussian concentration bound.
This should look familiar: is a mean-zero Gaussian with variance
.
Now, a first instinct might be to choose the sets to be progressively denser in
. In this case, a natural choice would be to insist on something like
being a
-net in
. If one continues down this path in the right way, a similar theory would develop. We’re going to take a different route and consider the other side of the tradeoff.
Instead of insisting that has a certain level of accuracy, we’ll insist that
is at most a certain size. Should we require
or
, or use some other function? To figure out the right bound, we look at (2). Suppose that
are i.i.d.
random variables. In that case, applying (2) and a union bound, we see that to achieve
we need to select . If we look instead at
points instead of
points, the bound grows to
. Thus we can generally square the number of points before the union bound has to pay a constant factor increase. This suggests that the right scaling is something like
. So we’ll require that
for all
.
The generic chaining
This leads us to the generic chaining bound, due to Fernique (though the formulation we state here is from Talagrand).
Theorem 2 Let
be a Gaussian process, and let
be a sequence of subsets such that
and
for
. Then,
Proof: As before, let denote the closest point map and let
. Using (2), for any
,
, and
, we have
Now, the number of pairs can be bounded by
, so we have
If we define the event
then summing (4) yields,
for , since we get geometrically decreasing summands.
Write
Note that if occurs, then
. Thus (5) implies that for
,
Finally, by the triangle inequality,
Plugging this into (6) recovers (3).
Theorem 1.2 gives us a fairly natural way to upper bound the expected supremum using a hierarchical clustering of . Rather amazingly, as we’ll see in the next post, this upper bound is tight. Talagrand’s majorizing measure theorem states that if we take the best choice of
in Theorem 1.2, then the upper bound in (3) is within a constant factor of
.
Hi James! I don’t see an earlier post about Gaussian processes, did you forget to post it?
Never mind, I found it.