Majorizing measures: Gaussian tools

In order to prove that the chaining argument is tight, we will need some additional properties of Gaussian processes. For the chaining upper bound, we used a series of union bounds specified by a tree structure. As a first step in producing a good lower bound, we will look at a way in which the union bound is tight.

Theorem 1 (Sudakov inequality) For some constant $C > 0$, the following holds. Let ${\{X_t\}_{t \in T}}$ be a Gaussian process such that for every distinct ${s,t \in T}$, we have ${d(s,t) \geq \alpha}$. Then,

$\displaystyle \mathop{\mathbb E} \sup_{t \in T} X_t \geq C \alpha \sqrt{\log |T|}.$

The claim is an elementary calculation for a sequence of i.i.d. ${N(0,1)}$ random variables ${g_1, g_2, \ldots, g_n}$ (i.e. ${\mathop{\mathbb E} \sup_i g_i \geq C\sqrt{\log n}}$). We will reduce the general case to this one using Slepian’s comparison lemma.

Lemma 2 (Slepian’s Lemma) Let ${\{X_t\}_{t \in T}}$ and ${\{Y_t\}_{t \in T}}$ be two Gaussian processes such that for all ${s,t \in T}$,

$\displaystyle \mathop{\mathbb E} \,|X_s - X_t|^2 \geq \mathop{\mathbb E} \,|Y_s - Y_t|^2. \ \ \ \ \ (1)$

Then ${\mathop{\mathbb E} \sup_{t \in T} X_t \geq \mathop{\mathbb E} \sup_{t \in T} Y_t}$.

There is a fairly elementary proof of Slepian’s Lemma (see, e.g. the Ledoux-Talagrand book), if one is satisfied with the weaker conclusion ${2\, \mathop{\mathbb E}\,|X_s-X_t|^2 \geq \mathop{\mathbb E}\,|Y_s-Y_t|^2}$, which suffices for our purposes.

To see that Lemma 2 yields Theorem 1, take a family ${\{X_t\}_{t \in T}}$ with ${d(s,t) \geq \alpha}$ for all ${s \neq t \in T}$ and consider the associated variables ${Y_t = \frac{\alpha}{\sqrt{2}} g_t}$ where ${\{g_t\}_{t \in T}}$ is a family of i.i.d. ${N(0,1)}$ random variables. It is straightforward to verify that (1) holds, hence by the lemma, ${\mathop{\mathbb E} \sup_{t \in T} X_t \geq \frac{\alpha}{\sqrt{2}} \mathop{\mathbb E} \sup_{t \in T} g_t}$, and the result follows from the i.i.d. case.

The Sudakov inequality gives us “one level” of a lower bound; the following strengthening will allow us to use it recursively. If we have a Gaussian process ${\{X_t\}_{t \in T}}$ and ${A \subseteq T}$, we will use the notation

$\displaystyle g(A) = \mathop{\mathbb E} \sup_{t \in A} X_t.$

For ${t \in T}$ and ${R \geq 0}$, we also use the notation

$\displaystyle B(t,R) = \{ s \in T : d(s,t) \leq R \}.$

Here is the main theorem of this post; its statement is all we will require for our proof of the majorizing measures theorem:

Theorem 3 For some constants $C > 0$ and ${r > 1}$, the following holds. Suppose ${\{X_t\}_{t \in T}}$ is a Gaussian process, and let ${t_1, t_2, \ldots, t_m \in T}$ be such that ${d(t_i,t_j) \geq \alpha}$ for ${i \neq j}$. Then,

$\displaystyle g(T) \geq C \alpha \sqrt{\log m} + \min_{i=1,2,\ldots,m} g(B(t_i, \alpha/r)).$

The proof of the preceding theorem relies on the a strong concentration property for Gaussian processes. First, we recall the classical isoperimetric inequality for Gaussian space (see, for instance, (2.9) here).
We remind the reader that for a function ${F : \mathbb R^n \rightarrow \mathbb R}$,

$\displaystyle \|F\|_{\mathrm{Lip}} = \sup_{x \neq y \in \mathbb R^n} \frac{|F(x)-F(y)|}{\|x-y\|}.$

Theorem 4 Let ${F : \mathbb R^n \rightarrow \mathbb R}$, and let ${\mu = \int F \,d\gamma_n}$, where ${\gamma_n}$ is the standard ${n}$-dimensional Gaussian measure. Then,

$\displaystyle \gamma_n\left(x \in \mathbb R^n : |F(x)-\mu| > \lambda\right) \leq 2\,\exp\left(\frac{-\lambda^2}{2 \|F\|_{\mathrm{Lip}}}\right). \ \ \ \ \ (2)$

Using this, we can prove the following remarkable fact.

Theorem 5 Let ${\{X_t\}_{t \in T}}$ be a Gaussian process, then

$\displaystyle \mathbb P\left(\left|\sup_{t \in T} X_t - \mathop{\mathbb E} \sup_{t \in T} X_t\right| > \lambda\right) \leq 2\,\exp\left(\frac{-\lambda^2}{2 \sup_{t \in T} \mathop{\mathbb E}(X_t^2)}\right). \ \ \ \ \ (3)$

A notable aspect of this statement is that only the maximum variance affects the concentration, not the number of random variables. We now prove Theorem 5 using Theorem 4.

Proof: We will prove it in the case ${|T|=n}$, but of course our bound is independent of ${n}$. The idea is that given a Gaussian process ${\{X_1, X_2, \ldots, X_n\}}$, we can write

$\displaystyle X_i = a_{i1} \,g_1 + a_{i2}\, g_2 + \cdots + a_{in}\, g_n,$

for ${i=1,2,\ldots, n}$, where ${\{g_i\}_{i=1}^n}$ are standard i.i.d. normals, and the matrix ${A=(a_{i,j})}$ is a matrix of real coefficients. In this case, if ${g = (g_1, g_2, \ldots, g_n)}$ is a standard ${n}$-dimensional Gaussian, then the vector ${Ag}$ is distributed as ${(X_1, X_2, \ldots, X_n)}$.

If we put ${F(x)=\max \{ (Ax)_i : i=1,\ldots,n\}}$, then Theorem 4 yields (3) as long as ${\|F\|_{\mathrm{Lip}} \leq \max_i \sqrt{\mathop{\mathbb E}(X_i^2)}}$. It is easy to see that

$\displaystyle \|F\|_{\mathrm{Lip}} = \|A\|_{2 \rightarrow \infty} = \sup_{\|x\|_2 = 1} \|A x\|_{\infty}.$

But ${\|A\|_{2 \rightarrow \infty}}$ is just the maximum ${\ell_2}$ norm of any row of ${A}$, and the ${\ell_2}$ norm of row ${i}$ is

$\displaystyle \sqrt{\sum_{j=1}^n a_{ij}^2} = \sqrt{\mathop{\mathbb E}(X_i^2)}.$

$\Box$

Using this theorem, we are ready to prove Theorem 3. I will only give a sketch here, but filling in the details is not too difficult.

Assume that the conditions of Theorem 3 hold. Pick an arbitrary ${t_0 \in T}$, and recall that we can write

$\displaystyle g(T) = \mathop{\mathbb E} \sup_{t \in T} X_t = \mathop{\mathbb E} \sup_{t \in T} (X_t - X_{t_0})$

since our gaussians are centered.

Now, by Theorem 1,

$\displaystyle \mathop{\mathbb E} \max_{i=1,\ldots,m} \left(X_{t_i} - X_{t_0}\right) \geq C \alpha \sqrt{\log m}.$

Suppose that ${t_1}$ achieves this. By definition,

$\displaystyle g(B(t_1, \alpha/r)) = \mathop{\mathbb E} \sup_{t \in B(t_1, \alpha/r)} \left(X_t - X_{t_1}\right),$

so we could hope that for some ${t \in B(t_1,\alpha/r)}$, we simultaneously have ${X_t - X_{t_1} \geq g(B(t_1,\alpha/r))}$, yielding

$\displaystyle X_t - X_{t_0} = (X_t - X_{t_1}) + (X_{t_1} - X_{t_0}) \geq C\alpha \sqrt{\log m} + g(B(t_1, \alpha/r)). \ \ \ \ \ (4)$

The problem, of course, is that the events we are discussing are not independent.

This is where Theorem 5 comes in. For any ${i}$, all the variances of the variables ${\{X_t - X_{t_i} : t \in B(t_i,\alpha/r)\}}$ are bounded by ${d(t,t_i)^2 \leq (\alpha/r)^2}$. This implies that we can choose a constant ${c_0 > 0}$ such that

$\displaystyle \mathbb P\left(\left|\sup_{t \in B(t_i,\alpha/r)} X_t - g(B(t_i, \alpha/r))\right| > c_0 (\alpha/r) \sqrt{\log m}\right) \leq m^{-2}.$

So, in fact, we can expect that none of the ${m}$ random variables ${\sup_{t \in B(t_i,\alpha/r)} X_t}$ will deviate from its expected value by more than ${c_0 (\alpha/r) \sqrt{\log m}}$. Which means we can (morally) replace (4) by

$\displaystyle \begin{array}{rl} X_t - X_{t_0} &= (X_t - X_{t_1}) + (X_{t_1} - X_{t_0}) \\ &\geq C\alpha \sqrt{\log m} + g(B(t_1, \alpha/r)) - c_0 (\alpha/r) \sqrt{\log m}. \end{array}$

But now by choosing ${r = 2 C c_0}$, the error term is absorbed.

The generic chaining

In the last post, we considered a Gaussian process ${\{X_t\}_{t \in T}}$ and were trying to find upper bounds on the quantity ${\mathop{\mathbb E}\sup_{t \in T} X_t}$. We saw that one could hope to improve over the union bound by clustering the points and then taking mini union bounds in each cluster.

Hierarchical clustering

To specify a clustering, we’ll take a sequence of progressively finer approximations to our set ${T}$. First, recall that we fixed ${t_0 \in T}$, and we have used the observation that ${\mathop{\mathbb E}\sup_{t \in T} X_t = \mathop{\mathbb E}\sup_{t \in T} (X_t-X_{t_0})}$.

Now, assume that ${T}$ is finite. Write ${T_0 = \{t_0\}}$, and consider a sequence of subsets ${\{T_n\}}$ such that ${T_0 \subseteq T_1 \subseteq T_2 \subseteq \cdots \subseteq T}$. We will assume that for some large enough ${m}$, we have ${T_n = T}$ for ${n \geq m}$. For every ${n \geq 0}$, let ${\pi_n : T \rightarrow T_n}$ denote a “closest point map” which sends $t \in T$ to the closest point in $T_n$.

The main point is that we can now write, for any ${t \in T}$,

$\displaystyle X_t - X_{t_0} = \sum_{n \geq 1} X_{\pi_n(t)} - X_{\pi_{n-1}(t)}. \ \ \ \ \ (1)$

This decomposition is where the term “chaining” arises, and now the idea is to bound the probability that ${X_t - X_{t_0}}$ is large in terms of the segments in the chain.

What should ${T_n}$ look like?

One question that arises is how we should think about choosing the approximations ${T_n}$. We are trading off two measures of quality: The denser ${T_n}$ is in the set ${T}$ (or, more precisely, in the set ${T_{n-1}}$) the smaller the variances of the segments ${X_{\pi_n(t)}-X_{\pi_{n-1}(t)}}$ will be. On the other hand, the larger ${T_n}$ is, the more segments we’ll have to take a union bound over.

So far, we haven’t used any property of our random variables except for the fact that they are centered. To make a more informed decision about how to choose the sets ${\{T_n\}}$, let’s recall the classical Gaussian concentration bound.

Lemma 1 For every ${s,t \in T}$ and ${\lambda > 0}$,

$\displaystyle \mathop{\mathbb P}(X_s - X_t > \lambda) \leq \exp\left(-\frac{\lambda^2}{2\, d(s,t)^2}\right). \ \ \ \ \ (2)$

This should look familiar: ${X_s-X_t}$ is a mean-zero Gaussian with variance ${d(s,t)^2}$.

Now, a first instinct might be to choose the sets ${T_n}$ to be progressively denser in ${T}$. In this case, a natural choice would be to insist on something like ${T_n}$ being a ${2^{-n}}$-net in ${T}$. If one continues down this path in the right way, a similar theory would develop. We’re going to take a different route and consider the other side of the tradeoff.

Instead of insisting that ${T_n}$ has a certain level of accuracy, we’ll insist that ${T_n}$ is at most a certain size. Should we require ${|T_n| \leq n}$ or ${|T_n| \leq 2^n}$, or use some other function? To figure out the right bound, we look at (2). Suppose that ${g_1, g_2, \ldots, g_m}$ are i.i.d. ${N(0,1)}$ random variables. In that case, applying (2) and a union bound, we see that to achieve

$\displaystyle \mathop{\mathbb P}(\exists i : g_i > B) \leq m \mathop{\mathbb P}(g_1 > B) < 1,$

we need to select ${B \asymp \sqrt{\log m}}$. If we look instead at ${m^2}$ points instead of ${m}$ points, the bound grows to ${\sqrt{2 \log m}}$. Thus we can generally square the number of points before the union bound has to pay a constant factor increase. This suggests that the right scaling is something like ${|T_{n+1}| = |T_n|^2}$. So we’ll require that ${|T_n| \leq 2^{2^n}}$ for all ${n \geq 1}$.

The generic chaining

This leads us to the generic chaining bound, due to Fernique (though the formulation we state here is from Talagrand).

Theorem 2 Let ${\{X_t\}_{t \in T}}$ be a Gaussian process, and let ${T_0 \subseteq T_1 \subseteq \cdots \subseteq T}$ be a sequence of subsets such that ${|T_0|=1}$ and ${|T_n| \leq 2^{2^{n}}}$ for ${n \geq 1}$. Then,

$\displaystyle \mathop{\mathbb E}\sup_{t \in T} X_t \leq O(1) \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} d(t, T_n). \ \ \ \ \ (3)$

Proof: As before, let ${\pi_n : T \rightarrow T_n}$ denote the closest point map and let ${T_0 = \{t_0\}}$. Using (2), for any ${n \geq 1}$, ${t \in T}$, and ${u > 0}$, we have

$\displaystyle \mathop{\mathbb P}\left(|X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| > u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right) \leq \exp\left(-\frac{u^2}{2} 2^n\right).$

Now, the number of pairs ${(\pi_n(t),\pi_{n-1}(t))}$ can be bounded by ${|T_n| \cdot |T_{n-1}| \leq 2^{2^{n+1}}}$, so we have

$\displaystyle \mathop{\mathbb P}\left(\exists t : |X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| > u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right) \leq 2^{2^{n+1}} \exp\left(-\frac{u^2}{2} 2^n\right). \ \ \ \ \ (4)$

If we define the event

$\displaystyle \Omega_u = \left\{ \forall n \geq 1, t \in T : |X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| \leq u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right\},$

then summing (4) yields,

$\displaystyle \mathop{\mathbb P}(\overline{\Omega_u}) \leq \sum_{n \geq 1} 2^{2^{n+1}} \exp\left(-\frac{u^2}{2} 2^n\right) \leq O(1)\, e^{-u^2} \ \ \ \ \ (5)$

for ${u \geq 4}$, since we get geometrically decreasing summands.

Write

$\displaystyle S = \sup_{t \in T} \sum_{n \geq 1} 2^{n/2} d(\pi_n(t), \pi_{n-1}(t)).$

Note that if ${\Omega_u}$ occurs, then ${\sup_{t \in T} (X_t - X_{t_0}) \leq uS}$. Thus (5) implies that for $u \geq 4$,

$\displaystyle \mathop{\mathbb P}(\sup_{t \in T} X_t - X_{t_0} > uS) \leq O(1) \, e^{-u^2},$

which implies that

$\displaystyle \mathop{\mathbb E} \sup_{t \in T} X_t \leq O(S) \leq O(1) \sup_{t \in T} \sum_{n \geq 1} 2^{n/2} d(\pi_n(t), \pi_{n-1}(t)). \ \ \ \ \ (6)$

Finally, by the triangle inequality,

$\displaystyle d(\pi_n(t), \pi_{n-1}(t)) \leq d(t, T_n) + d(t, T_{n-1}) \leq 2\,d(t,T_{n-1}).$

Plugging this into (6) recovers (3). $\Box$

Theorem 1.2 gives us a fairly natural way to upper bound the expected supremum using a hierarchical clustering of ${T}$. Rather amazingly, as we’ll see in the next post, this upper bound is tight. Talagrand’s majorizing measure theorem states that if we take the best choice of ${\{T_n\}}$ in Theorem 1.2, then the upper bound in (3) is within a constant factor of ${\mathop{\mathbb E} \sup_{t \in T} X_t}$.