Majorizing measures: Gaussian tools

In order to prove that the chaining argument is tight, we will need some additional properties of Gaussian processes. For the chaining upper bound, we used a series of union bounds specified by a tree structure. As a first step in producing a good lower bound, we will look at a way in which the union bound is tight.

Theorem 1 (Sudakov inequality) For some constant C > 0, the following holds. Let {\{X_t\}_{t \in T}} be a Gaussian process such that for every distinct {s,t \in T}, we have {d(s,t) \geq \alpha}. Then,

\displaystyle  \mathop{\mathbb E} \sup_{t \in T} X_t \geq C \alpha \sqrt{\log |T|}.

The claim is an elementary calculation for a sequence of i.i.d. {N(0,1)} random variables {g_1, g_2, \ldots, g_n} (i.e. {\mathop{\mathbb E} \sup_i g_i \geq C\sqrt{\log n}}). We will reduce the general case to this one using Slepian’s comparison lemma.

Lemma 2 (Slepian’s Lemma) Let {\{X_t\}_{t \in T}} and {\{Y_t\}_{t \in T}} be two Gaussian processes such that for all {s,t \in T},

\displaystyle   \mathop{\mathbb E} \,|X_s - X_t|^2 \geq \mathop{\mathbb E} \,|Y_s - Y_t|^2. \ \ \ \ \ (1)

Then {\mathop{\mathbb E} \sup_{t \in T} X_t \geq \mathop{\mathbb E} \sup_{t \in T} Y_t}.

There is a fairly elementary proof of Slepian’s Lemma (see, e.g. the Ledoux-Talagrand book), if one is satisfied with the weaker conclusion {2\, \mathop{\mathbb E}\,|X_s-X_t|^2 \geq \mathop{\mathbb E}\,|Y_s-Y_t|^2}, which suffices for our purposes.

To see that Lemma 2 yields Theorem 1, take a family {\{X_t\}_{t \in T}} with {d(s,t) \geq \alpha} for all {s \neq t \in T} and consider the associated variables {Y_t = \frac{\alpha}{\sqrt{2}} g_t} where {\{g_t\}_{t \in T}} is a family of i.i.d. {N(0,1)} random variables. It is straightforward to verify that (1) holds, hence by the lemma, {\mathop{\mathbb E} \sup_{t \in T} X_t \geq \frac{\alpha}{\sqrt{2}} \mathop{\mathbb E} \sup_{t \in T} g_t}, and the result follows from the i.i.d. case.

The Sudakov inequality gives us “one level” of a lower bound; the following strengthening will allow us to use it recursively. If we have a Gaussian process {\{X_t\}_{t \in T}} and {A \subseteq T}, we will use the notation

\displaystyle  g(A) = \mathop{\mathbb E} \sup_{t \in A} X_t.

For {t \in T} and {R \geq 0}, we also use the notation

\displaystyle  B(t,R) = \{ s \in T : d(s,t) \leq R \}.

Here is the main theorem of this post; its statement is all we will require for our proof of the majorizing measures theorem:

Theorem 3 For some constants C > 0 and {r > 1}, the following holds. Suppose {\{X_t\}_{t \in T}} is a Gaussian process, and let {t_1, t_2, \ldots, t_m \in T} be such that {d(t_i,t_j) \geq \alpha} for {i \neq j}. Then,

\displaystyle  g(T) \geq C \alpha \sqrt{\log m} + \min_{i=1,2,\ldots,m} g(B(t_i, \alpha/r)).

The proof of the preceding theorem relies on the a strong concentration property for Gaussian processes. First, we recall the classical isoperimetric inequality for Gaussian space (see, for instance, (2.9) here).
We remind the reader that for a function {F : \mathbb R^n \rightarrow \mathbb R},

\displaystyle \|F\|_{\mathrm{Lip}} = \sup_{x \neq y \in \mathbb R^n} \frac{|F(x)-F(y)|}{\|x-y\|}.

Theorem 4 Let {F : \mathbb R^n \rightarrow \mathbb R}, and let {\mu = \int F \,d\gamma_n}, where {\gamma_n} is the standard {n}-dimensional Gaussian measure. Then,

\displaystyle   \gamma_n\left(x \in \mathbb R^n : |F(x)-\mu| > \lambda\right) \leq 2\,\exp\left(\frac{-\lambda^2}{2 \|F\|_{\mathrm{Lip}}}\right). \ \ \ \ \ (2)

Using this, we can prove the following remarkable fact.

Theorem 5 Let {\{X_t\}_{t \in T}} be a Gaussian process, then

\displaystyle   \mathbb P\left(\left|\sup_{t \in T} X_t - \mathop{\mathbb E} \sup_{t \in T} X_t\right| > \lambda\right) \leq 2\,\exp\left(\frac{-\lambda^2}{2 \sup_{t \in T} \mathop{\mathbb E}(X_t^2)}\right). \ \ \ \ \ (3)

A notable aspect of this statement is that only the maximum variance affects the concentration, not the number of random variables. We now prove Theorem 5 using Theorem 4.

Proof: We will prove it in the case {|T|=n}, but of course our bound is independent of {n}. The idea is that given a Gaussian process {\{X_1, X_2, \ldots, X_n\}}, we can write

\displaystyle  X_i = a_{i1} \,g_1 + a_{i2}\, g_2 + \cdots + a_{in}\, g_n,

for {i=1,2,\ldots, n}, where {\{g_i\}_{i=1}^n} are standard i.i.d. normals, and the matrix {A=(a_{i,j})} is a matrix of real coefficients. In this case, if {g = (g_1, g_2, \ldots, g_n)} is a standard {n}-dimensional Gaussian, then the vector {Ag} is distributed as {(X_1, X_2, \ldots, X_n)}.

If we put {F(x)=\max \{ (Ax)_i : i=1,\ldots,n\}}, then Theorem 4 yields (3) as long as {\|F\|_{\mathrm{Lip}} \leq \max_i \sqrt{\mathop{\mathbb E}(X_i^2)}}. It is easy to see that

\displaystyle  \|F\|_{\mathrm{Lip}} = \|A\|_{2 \rightarrow \infty} = \sup_{\|x\|_2 = 1} \|A x\|_{\infty}.

But {\|A\|_{2 \rightarrow \infty}} is just the maximum {\ell_2} norm of any row of {A}, and the {\ell_2} norm of row {i} is

\displaystyle \sqrt{\sum_{j=1}^n a_{ij}^2} = \sqrt{\mathop{\mathbb E}(X_i^2)}.

\Box

Using this theorem, we are ready to prove Theorem 3. I will only give a sketch here, but filling in the details is not too difficult.

Assume that the conditions of Theorem 3 hold. Pick an arbitrary {t_0 \in T}, and recall that we can write

\displaystyle g(T) = \mathop{\mathbb E} \sup_{t \in T} X_t = \mathop{\mathbb E} \sup_{t \in T} (X_t - X_{t_0})

since our gaussians are centered.

Now, by Theorem 1,

\displaystyle \mathop{\mathbb E} \max_{i=1,\ldots,m} \left(X_{t_i} - X_{t_0}\right) \geq C \alpha \sqrt{\log m}.

Suppose that {t_1} achieves this. By definition,

\displaystyle g(B(t_1, \alpha/r)) = \mathop{\mathbb E} \sup_{t \in B(t_1, \alpha/r)} \left(X_t - X_{t_1}\right),

so we could hope that for some {t \in B(t_1,\alpha/r)}, we simultaneously have {X_t - X_{t_1} \geq g(B(t_1,\alpha/r))}, yielding

\displaystyle  X_t - X_{t_0} = (X_t - X_{t_1}) + (X_{t_1} - X_{t_0}) \geq C\alpha \sqrt{\log m} + g(B(t_1, \alpha/r)). \ \ \ \ \ (4)

The problem, of course, is that the events we are discussing are not independent.

This is where Theorem 5 comes in. For any {i}, all the variances of the variables {\{X_t - X_{t_i} : t \in B(t_i,\alpha/r)\}} are bounded by {d(t,t_i)^2 \leq (\alpha/r)^2}. This implies that we can choose a constant {c_0 > 0} such that

\displaystyle  \mathbb P\left(\left|\sup_{t \in B(t_i,\alpha/r)} X_t - g(B(t_i, \alpha/r))\right| > c_0 (\alpha/r) \sqrt{\log m}\right) \leq m^{-2}.

So, in fact, we can expect that none of the {m} random variables {\sup_{t \in B(t_i,\alpha/r)} X_t} will deviate from its expected value by more than {c_0 (\alpha/r) \sqrt{\log m}}. Which means we can (morally) replace (4) by

\displaystyle  \begin{array}{rl} X_t - X_{t_0} &= (X_t - X_{t_1}) + (X_{t_1} - X_{t_0}) \\ &\geq C\alpha \sqrt{\log m} + g(B(t_1, \alpha/r)) - c_0 (\alpha/r) \sqrt{\log m}. \end{array}

But now by choosing {r = 2 C c_0}, the error term is absorbed.

One thought on “Majorizing measures: Gaussian tools

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s