Aram Harrow, Quantum Information, and Dvoretzky’s theorem

I’m delighted to announce that Aram Harrow will be joining us at UW as a visiting professor for the coming academic year. Aram is one of the young leaders in quantum information and computation. During a recent visit, Aram started to explain to me how ideas from asymptotic geometric analysis have started to become important in quantum information.

The capacity of quantum channels

A mixed quantum state on {\mathbb C^d} is represented by a density matrix {\rho}, which is {d \times d} Hermitian, positive semi-definite matrix with {\mathrm{tr}(\rho) = 1}. The von Neumann entropy of {\rho} is given by

\displaystyle  S(\rho) = - \mathrm{tr}(\rho \log \rho).

If {\mathcal M_d} is the space of {d \times d} matrices with complex entries, then a quantum channel is a mapping

\displaystyle  \Phi : \mathcal M_d \rightarrow \mathcal M_k

for some {d,k \geq 1}, which sends PSD matrices to PSD matrices and preserves the trace. Finally, the minimal output entropy of a quantum channel is

\displaystyle  S_{\min}(\Phi) = \min_{\rho} S(\Phi(\rho)),

where {\rho} ranges over all mixed states on {\mathbb C^d}. Since the entropy is a concave function, this minimum is achieved at a pure state.

Work of Shor showed that the following conjecture is equivalent to a number of long-standing problems in quantum information theory.

Additivity conjecture for minimal output von Neumann entropy: Is is true that, for every two quantum channels {\Phi_1} and {\Phi_2}, we have:

\displaystyle  S_{\min}(\Phi_1 \otimes \Phi_2) = S_{\min}(\Phi_1) + S_{\min}(\Phi_2).

Observe that we always have {\leq} above, by considering the product state {\rho_1 \otimes \rho_2} where {\rho_1} and {\rho_2} are minimizers for {\Phi_1} and {\Phi_2}, respectively. The conjecture is about whether we can get a smaller output entropy by entangling the the inputs to {\Phi_1 \otimes \Phi_2}.

Hastings’ counterexample and Dvoretzky’s theorem

Hastings proved that the conjecture is false: By appropriately choosing randomly constructed quantum channels, it is possible to achieve

\displaystyle  S_{\min}(\Phi_1 \otimes \Phi_2) < S_{\min}(\Phi_1) + S_{\min}(\Phi_2).

Remarkably, it has recently been shown by Aubrun, Szarek, and Werner that this counterexample follows from an appropriate version of Dvoretzky’s theorem on almost-Euclidean subspaces of normed spaces (for a refresher, see this earlier post on pseudorandom subspaces).

Let {\mathcal M_{k,d}} be the space of {k \times d} matrices with complex entries. We will consider two norms on matrices {M \in \mathcal M_{k,d}}: The Hilbert-Schmidt norm {\|M\|_2 = \sqrt{\mathrm{tr} |M|^2} = \sqrt{\sum_{i,j} M_{ij}^2}}, and the Schatten {4}-norm {\|M\|_4 = \left(\mathrm{tr} |M|^4\right)^{1/4}}, where {|M| = \sqrt{M^* M}}. This latter quantity is precisely the {\ell^4} norm of the singular values of {M}. Dvoretzky’s theorem tells us (qualitatively) that an appropriate random subspace of the normed space {(\mathcal M_{k,d}, \|\cdot\|_4)} will be nearly Euclidean.

More precisely, for every {k \geq 1}, a uniformly random subspace {\mathcal W \subseteq \mathcal M_{k,O(k^2)}} of dimension {\Omega(k^2)} will, with high probability, satisfy: For every {M \in \mathcal W},

\displaystyle  k^{-1/4} \|M\|_2 \leq \|M\|_4 \leq \left(1+\frac{O(1)}{k}\right) k^{-1/4} \|M\|_2,

i.e. it will be {\left(1+O(1/k)\right)}-Euclidean. This strong quantitative version follows along the lines of enhancements to Milman’s proof of Dvoretzky’s theorem.

Oscillations and chaining

The proof of the preceding quantitative bound is based on a version of Dvoretzky’s theorem for Lipschitz functions, which I will state here in the real case for simplicity. We will use {S^{n-1} \subseteq \mathbb R^n} for the {(n-1)}-dimensional unit sphere, and then for a function {f : S^{n-1} \rightarrow \mathbb R} and a subset {A \subseteq S^{n-1}}, we write

\displaystyle  \mathrm{osc}(f,A) = \sup_{x \in A} |f(x)-\mu(f)|,

where {\mu(f)} is the average value of {f} on {S^{n-1}}.

Theorem 1 If {f : S^{n-1} \rightarrow \mathbb R} is a 1-Lipschitz function, then for every {\varepsilon > 0}, if {E \subseteq \mathbb R^n} is a random subspace of dimension {\lfloor \varepsilon^2 n\rfloor}, then with high probability,

\displaystyle  \mathrm{osc}(f,E \cap S^{n-1}) \leq O(\varepsilon). \ \ \ \ \ (1)

A standard way to prove this lemma would be to use the fact that every {1}-Lipschitz function is highly concentrated on {S^{n-1}} (Levy’s lemma) and then to take a union bound over an {\varepsilon}-net on {S^{n-1}} whose size is bounded by {(1+2/\varepsilon)^n}. Unfortunately, this leads to an extra {\log(1/\varepsilon)} term on the right-hand side of (1) which, remarkably, would lead to a version of Dvoretzky’s theorem that is too weak to imply Hastings’ counterexample to the additivity conjecture.

Gordon originally proved Theorem 1 using comparison results for Gaussian processes. (In fact, we will see a main technical step called Slepian’s Lemma in our next post on majorizing measures.) Alternately, Schechtman showed that one can use concentration of measure and a more sophisticated chaining argument, of the type discussed in our previous post.

The Gödel Prize, TSP, and volume growth

Recently, Sanjeev Arora and Joe Mitchell won the Gödel prize for their work on Euclidean TSP. They show that given n points in \mathbb R^2, and a parameter \varepsilon > 0, it is possible to compute, in polynomial time, a traveling salesman tour of the input whose length is at most a factor 1+\varepsilon more than the length of the optimum tour. (This is called a polynomial-time approximation scheme, or PTAS.)

Later, Arora extended this to work in \mathbb R^k for every fixed k \geq 2. What properties of \mathbb R^k are really needed to get such an algorithm?

Certainly a key property is that the volume of a ball of radius r in \mathbb R^k only grows like O(r^k). This ensures that one can choose an \epsilon-net of size at most O(1/\epsilon)^k in a ball of radius {O(1)}, which is essential for using dynamic programming. In my opinion, this leads to the most fascinating problem left open in this area:

Is bounded volume growth the only property needed to get a PTAS?

This would imply that the use of Euclidean geometry in Arora’s algorithm is non-essential. We can state the question formally as follows. Let (X,d) be a metric space, and let \lambda(X,d) be the smallest number \lambda such that for every r > 0,every ball of radius r in X can be covered by \lambda balls of radius r/2. Is there a PTAS for TSP in X? (In other words, the running time should be bounded by a polynomial in n =|X| whose degree depends only on \lambda(X,d).)

The problem is open, though there are some partial results by Talwar.

Random tree embeddings

Random embeddings

When it is impossible to achieve a low-distortion embedding of some space {X} into another space {Y}, we can consider more lenient kinds of mappings which are still suitable for many applications. For example, consider the unweighted {n}-cycle {C_n}. It is known that any embedding of {C_n} into a tree metric incurs distortion {\Omega(n)}. On the other hand, if we delete a uniformly random edge of {C_n}, this leaves us with a random tree (actually a path) {C'_n} such that for any {x,y \in C_n}, we have

\displaystyle  d_{C'_n}(x,y) \geq d_{C_n}(x,y) \geq \frac12\, \mathop{\mathbb E}[d_{C'_n}(x,y)].

In other words, in expectation the distortion is only 2.

Let {(X,d)} be a finite metric space, and let {\mathcal F} be a family of finite metric spaces. A stochastic embedding from {X} into {\mathcal F} is a random pair {(F,Y)} where {Y \in \mathcal F} and {F : X \rightarrow Y} is a non-contractive mapping, i.e. such that {d_Y(F(x),F(y)) \geq d_X(x,y)} for all {x,y \in X}. The distortion of {F} is defined by

\displaystyle  \max_{x,y \in X} \frac{\mathop{\mathbb E} \left[d_Y(F(x),F(y))\right]}{d_X(x,y)}\,.

We will now argue that every finite metric space admits a surprisingly good stochastic embedding into tree metrics. The next result is from Fakcharoenphol, Rao, and Talwar, following work by Bartal and Alon, Karp, Peleg, and West.

Theorem 1 Every {n}-point metric space {(X,d)} admits a stochastic embedding into the family of tree metrics, with distortion {O(\log n)}.

We will need the random partitioning theorem we proved last time:

Theorem 2 For every {\Delta > 0}, there is a {\Delta}-bounded random partition {\mathcal P} of {X} which satisfies, for every {x \in X} and {0 < r \leq \Delta/8},

\displaystyle   \mathop{\mathbb P}\left(B(x,r) \nsubseteq \mathcal P(x)\right) \leq \frac{8r}{\Delta} H\left(|B(x,\Delta/8)|,|B(x,\Delta)|\right). \ \ \ \ \ (1)

From partitions to trees. Before proving the theorem, we discuss a general way of constructing a tree metric from a sequence of partitions. Assume (by perhaps scaling the metric first) that {1 < d(x,y) \leq 2^M} for all {x,y \in X} with {x \neq y}. Let {P_0, P_1, \ldots, P_M} be partitions of {X}, where {P_k} is {2^k}-bounded. We will assume that {P_M = \{X\}} and {P_0} is a partition of {X} into singleton sets.



Now we inductively construct a tree metric {T = T(P_0, P_1, \ldots, P_M)} as follows. The nodes of the tree will be of the form {(k,S)} for {k \in \{0,1,\ldots,M\}} and {S \subseteq X}. The root is {(M,X)}. In general, if the tree has a node of the form {(j,S)} for {j > 0}, then {(j,S)} will have children

\displaystyle \{ (j-1, S \cap T) : T \in P_{j-1} \textrm{ and } S \cap T \neq \emptyset\}.

The length of an edge of the form {\{(j,S), (j-1,S')\}} is {2^{j}}. This specifies the entire tree {T}. We also specify a map {F : X \rightarrow T} by {F(x) = (0,\{x\})}. We leave the following claim to the reader.

Claim 1 For every {x,y \in X},

\displaystyle  d_{T}(F(x),F(y)) = 2 \sum_{k=1}^{j+1} 2^k = 2^{j+3}-4.

where {j \in \{0,1,\ldots,M\}} is the largest index with {P_j(x) \neq P_j(y)}.

Note, in particular, that {F : X \rightarrow T} is non-contracting because if {2^j < d(x,y) \leq 2^{j+1}} for some {j \geq 0}, then {P_j(x) \neq P_j(y)} since {P_j} is {2^{j}}-bounded, implying that {d_{T}(F(x),F(y)) \geq 2^{j+3}-4 \geq 2^{j+1}}.

Now we are ready to prove Theorem 1.

Proof: Again, assume that {1 < d(x,y) \leq 2^M} for all {x,y \in X}. For {0 < k < M}, let {\mathcal P_k} be the {2^k}-bounded random partition guaranteed by Theorem 2. Let {\mathcal P_0} be a partition into singletons, and let {\mathcal P_M = \{X\}}. Finally, let {\mathcal T=\mathcal T(\mathcal P_0, \ldots, \mathcal P_M)} be the tree constructed above, and let {F : X \rightarrow \mathcal T} be the corresponding (random) non-contractive mapping.

Now, fix {x,y \in X} and an integer {h} such that {2^h < d(x,y) \leq 2^{h+1}}. Using Claim 1 and (1), we have,

\displaystyle  \mathop{\mathbb E}\left[d_{\mathcal T}(F(x),F(y))\right] \leq \sum_{j=0}^M \mathop{\mathbb P}[\mathcal P_j(x) \neq \mathcal P_j(y)] \cdot 2^{j+3}

    \displaystyle  \leq O(d(x,y)) + \sum_{j=h+3}^M \frac{8\,d(x,y)}{2^j} 2^{j+3} H\left(|B(x, 2^{j-3})|,|B(x,2^{j})|\right)

    \displaystyle  \leq O(d(x,y)) + \sum_{j=h+3}^M \frac{8\,d(x,y)}{2^j} 2^{j+3} H\left(|B(x, 2^{j-3})|,|B(x,2^{j})|\right)

   \displaystyle  \leq O(d(x,y)) + O(1) \,d(x,y) \sum_{j=h+3}^M H\left(|B(x, 2^{j-3})|,|B(x,2^{j})|\right)

   \displaystyle  \leq O(H(1,n))\,d(x,y)

   \displaystyle  \leq O(\log n)\,d(x,y),

where in the penultimate line, we have evaluated three disjoint telescoping sums: For any numbers {1 \leq a_1 \leq a_2 \leq \cdots \leq a_k},

\displaystyle H(a_1,a_2)+H(a_2,a_3)+\cdots+H(a_{k-1},a_k)=H(a_1,a_k).


Since every tree embeds isometrically into {\ell_1}, this offers an alternate proof of Bourgain’s theorem when the target space is {\ell_1}. Since we know that expander graphs require {\Omega(\log n)} distortion into {{\ell}_1}, this also shows that Theorem 1 is asymptotically tight.

Random partitions of metric spaces

In addition to the current sequence on Talagrand’s majorizing measures theory, I’m also going to be putting up a series of lecture notes about embeddings of finite metric spaces. The first few will concern random partitions of metric spaces, and their many applications.

Random partitions

Often when one is confronted with the problem of analyzing some problem on a metric space {(X,d)}, a natural way to proceed is by divide and conquer: Break the space into small pieces, do something (perhaps recursively) on each piece, and then glue these local solutions together to achieve a global result. This is certainly an important theme in areas like differential geometry, harmonic analysis, and computational geometry.

In the metric setting, “small” often means pieces of small diameter. Of course we could just break the space into singletons {X = \{x_1\} \cup \{x_2\} \cup \cdots}, but this ignores the second important aspect in a “divide and conquer” approach: what goes on at the boundary. If we are going to combine our local solutions together effectively, we want the “boundary” between distinct pieces of the partition to be small. Formalizing this idea in various ways leads to a number of interesting ideas which I’ll explore in a sequence of posts.

Example: The unit square

Consider the unit square {[0,1]^2} in the plane, equipped with the {\ell_1} metric. If we break the square into pieces of side length {\delta/2}, then every piece has diameter at most {\delta}, and a good fraction of the space is far from the boundary of the partition. In fact, it is easy to see that e.g. 60% of the measure is at least distance {\delta/20} from the boundary (the red dotted line).


But there is a problem with using this as a motivating example. First, observe that we had both a metric structure (the {\ell_1} distance) and a measure (in this case, the uniform measure on {[0,1]^2}). In many potential applications, there is not necessarily a natural measure to work with (e.g. for finite metric spaces, where counting the number of points is often a poor way of measuring the “size” of various pieces).

To escape this apparent conundrum, note that the uniform measure is really just a distraction: The same result holds for any measure on {[0,1]^2}. This is a simple application of the probabilistic method: If we uniformly shift the {\delta/2}-grid at random, then for any point {x \in [0,1]^2}, we have

\displaystyle {\mathbb P}(x \textrm{ is } \delta/20\textrm{-far from the boundary}) \geq 0.6.

Thus, by averaging, for any measure {\mu} on {[0,1]^2} there exists a partition where 60% of the {\mu}-measure is {\delta/20}-far from the boundary. This leads us to the idea that, in many situations, instead of talking about measures, it is better to talk about distributions over partitions. In this case, we want the “boundary” to be small on “average.”

Lipschitz random partitions

We will work primarily with finite metric spaces to avoid having to deal with continuous probability spaces; much of the theory carries over to general metric spaces without significant effort (but the development becomes less clean, as this requires certain measurability assumptions in the theorem statements).

Let {(X,d)} be a finite metric space. If {P} is a partition of {X}, and {x \in X}, we will write {P(x)} for the unique set in {P} containing {x}. We say that {P} is {\Delta}-bounded if {\mathrm{diam}(S) \leq \Delta} for all {S \in P}. We will also say that a random partition {\mathcal P} is {{\Delta}}-bounded if it is supported only on {\Delta}-bounded partitions of {X}.

Let’s now look at one way to formalize the idea that the “boundary” of a random partition {\mathcal P} is small.

A random partition {\mathcal P} of {X} is {{L}}-Lipschitz if, for every {x,y \in X},

\displaystyle \mathbb P\left(\mathcal P(x) \neq \mathcal P(y)\right) \leq L \cdot d(x,y).

Intuitively, the boundary is small if nearby points tend to end up in the same piece of the partition. There is a tradeoff between a random partition {\mathcal P} being {\Delta}-bounded and {{L}}-Lipschitz. As {\Delta} increases, we expect that we can make {{L}}, and hence the “boundary effect,” smaller and smaller. The following theorem can be derived from work of Leighton and Rao, or Linial and Saks. The form stated below comes from work of Bartal.

Theorem 1

If {(X,d)} is an {n}-point metric space and {n \geq 2}, then for every {\Delta > 0}, there is an {\frac{O(\log n)}{\Delta}}-Lipschitz, {\Delta}-bounded random partition {\mathcal P} of {X}.

We will prove a more general fact that will be essential later. For positive numbers {a,b}, define {H(a,b) = \sum_{i=a+1}^b \frac{1}{i}} (note that {H(a,a)=0}). The next theorem and proof are from Calinescu, Karloff, and Rabani.

Theorem 2

For every {\Delta > 0}, there is a {\Delta}-bounded random partition {\mathcal P} of {X} which satisfies, for every {x \in X} and {0 < r \leq \Delta/8},

\displaystyle {\mathbb P}\left(B(x,r) \nsubseteq \mathcal P(x)\right) \leq \frac{8r}{\Delta} H\left(|B(x,\Delta/8)|,|B(x,\Delta)|\right). \ \ \ \ \ (1)

Observe that Theorem 1 follows from Theorem 2 by setting {r = d(x,y)}, and noting that {\mathcal P(x) \neq \mathcal P(y)} implies {B(x,r) \nsubseteq \mathcal P(x)}. We also use {H(1,n) = O(\log n)} for {n \geq 2}.

Suppose that {|X|=n}. Let {\alpha \in [\frac14, \frac12]} be chosen uniformly at random, and let {\pi} be a uniformly random bijection {\pi : \{1,2,\ldots,n\} \rightarrow X}. We will think of {\pi} as giving a random ordering of {X}. We define our random partition {\mathcal P} as follows.

For {i \geq 1}, define

\displaystyle C_i = B(\pi(i), \alpha \Delta) \setminus \bigcup_{j=1}^{i-1} C_j.

It is straightforward that {\mathcal P = C_1 \cup C_2 \cup \cdots \cup C_n} is a partition of X, with perhaps many of the sets {C_i} being empty. Furthermore, by construction, {\mathcal P} is {\Delta}-bounded. Thus we are left to verify (1).

To this end, fix {x \in X} and {r > 0}, and enumerate the points of {X = \{y_1, y_2, \ldots, y_n\}} so that {d(x,y_1) \leq d(x,y_2) \leq \cdots \leq d(x,y_n)}. Let {B = B(x,r)}. We will say that a point {y_i} sees {B} if {d(y_i, x) \leq \alpha \Delta + r}, and we will say that {y_i} cuts {B} if {\alpha \Delta - r\leq d(y_i, x) \leq \alpha \Delta + r}.
(In the picture below, y_1 and y_2 see B, while y_3 does not. Only y_2 cuts B.)

Observe that for any {y \in X},

\displaystyle \mathbb P\left(y \textrm{ cuts } B\right) = \mathbb P\left(\alpha \Delta \in [d(x,y)-r,d(x,y)+r]\right) \leq \frac{2r}{\Delta/4} = \frac{8r}{\Delta}.\ \ \ \ \ (2)

Using this language, we can now reveal our analysis strategy: Let {y_{\min}} be the minimal element (according to the ordering {\pi}) which sees {B}. Then {B(x,r) \nsubseteq \mathcal P(x)} can only occur if {y_{\min}} also cuts {B}. The point is that the fate of {B(x,r)} is not decided until some point sees {B}. Then, if {y_{\min}} does not cut {B}, we have {B(x,r) \subseteq C_{\pi^{-1}(y_{\min})}}, hence {B(x,r) \subseteq \mathcal P(x)}.

Thus we can write,

\displaystyle \mathbb P\left(B(x,r) \nsubseteq \mathcal P(x)\right) \leq \mathbb P\left(y_{\min} \textrm{ cuts } B \right) \leq \sum_{i=1}^n \mathbb P\left(y_{\min}=y_i \textrm{ and } y_i \textrm{ cuts } B\right).\ \ \ \ \ (3)

To analyze the latter sum, first note that if {y \notin B(x,\Delta/2+r)}, then {y} can never see {B} since {\alpha \Delta \leq \Delta/2} always. On the other hand, if {y \in B(x,\Delta/4-r)} then {y} always sees {B}, but can never cut {B} since {\alpha \Delta \geq \Delta/4} always.
Recalling that {r \leq \Delta/8} by assumption, and setting {a=|B(x,\Delta/8)|} and let {b=|B(x,\Delta)|}, we can use (3) to write

\displaystyle \mathbb P\left(B(x,r) \nsubseteq \mathcal P(x)\right)\leq \sum_{i=a+1}^b \mathbb P\left(y_{\min}=y_i \textrm{ and } y_i \textrm{ cuts }B\right). \ \ \ \ \ (4)

Now we come to the heart of the analysis: For any {i \geq 1},

\displaystyle \mathbb P\left(y_{\min} = y_i\,|\, y_i \textrm{ cuts } B\right) \leq \frac{1}{i}.\ \ \ \ \ (5)

The idea is that if {y_i} cuts {B}, then {\alpha \Delta \geq d(x,y_i)-r}. Thus if any {y_j} for {j < i} comes before {y_i} in the {\pi}-ordering, then {y_j} also sees {B} since {d(x,y_j) \leq d(x,y_i)}, hence {y_i \neq y_{\min}}. But the probability that {y_i} is the {\pi}-minimal element of {\{y_1, \ldots, y_i\}} is precisely {\frac{1}{i}}, proving (5).

To finish the proof, we combine (2), (4), and (5), yielding

\displaystyle \mathbb P\left(B(x,r) \nsubseteq \mathcal P(x)\right) \leq \sum_{i=a+1}^b \mathbb P\left(y_i \textrm{ cuts } B\right) \cdot \mathbb P\left(y_i = y_{\min}\,|\, y_i \textrm{ cuts } B\right) \leq  \frac{8r}{\Delta} \sum_{i=a+1}^b \frac{1}{i}\,.


Tightness for expanders. Before ending this section, let us mention that Theorem 1 is optimal up to a constant factor. Let {\{G_k\}} be a family of degree-3 expander graphs, equipped with their shortest-path metric {d_k}. Assume that {|E(S, \bar S)| \geq c|S|} for some {c > 0} and all {|S| \leq \frac12 |G_k|}. Let {\Delta_k = \frac{\log |G_k|}{10}}, and suppose that {G_k} admits a {\Delta_k}-bounded {{L}}-Lipschitz random partition. By averaging, this means there is a fixed {\Delta_k}-bounded partition {P} of {G_k} which cuts at most an {{L}}-fraction of edges. But every {S \in P} has {\mathrm{diam}(S) \leq \Delta_k}, which implies {|S| < n/2}. Hence the partition must cut an {\Omega(1)} fraction of edges, implying {L = \Omega(1)}.

In the next post, we’ll see how these random partitions allow us to reduce may questions on finite metric spaces to questions on trees.

The generic chaining

In the last post, we considered a Gaussian process {\{X_t\}_{t \in T}} and were trying to find upper bounds on the quantity {\mathop{\mathbb E}\sup_{t \in T} X_t}. We saw that one could hope to improve over the union bound by clustering the points and then taking mini union bounds in each cluster.

Hierarchical clustering

To specify a clustering, we’ll take a sequence of progressively finer approximations to our set {T}. First, recall that we fixed {t_0 \in T}, and we have used the observation that {\mathop{\mathbb E}\sup_{t \in T} X_t = \mathop{\mathbb E}\sup_{t \in T} (X_t-X_{t_0})}.

Now, assume that {T} is finite. Write {T_0 = \{t_0\}}, and consider a sequence of subsets {\{T_n\}} such that {T_0 \subseteq T_1 \subseteq T_2 \subseteq \cdots \subseteq T}. We will assume that for some large enough {m}, we have {T_n = T} for {n \geq m}. For every {n \geq 0}, let {\pi_n : T \rightarrow T_n} denote a “closest point map” which sends t \in T to the closest point in T_n.

The main point is that we can now write, for any {t \in T},

\displaystyle   X_t - X_{t_0} = \sum_{n \geq 1} X_{\pi_n(t)} - X_{\pi_{n-1}(t)}. \ \ \ \ \ (1)

This decomposition is where the term “chaining” arises, and now the idea is to bound the probability that {X_t - X_{t_0}} is large in terms of the segments in the chain.

What should {T_n} look like?

One question that arises is how we should think about choosing the approximations {T_n}. We are trading off two measures of quality: The denser {T_n} is in the set {T} (or, more precisely, in the set {T_{n-1}}) the smaller the variances of the segments {X_{\pi_n(t)}-X_{\pi_{n-1}(t)}} will be. On the other hand, the larger {T_n} is, the more segments we’ll have to take a union bound over.

So far, we haven’t used any property of our random variables except for the fact that they are centered. To make a more informed decision about how to choose the sets {\{T_n\}}, let’s recall the classical Gaussian concentration bound.

Lemma 1 For every {s,t \in T} and {\lambda > 0},

\displaystyle   \mathop{\mathbb P}(X_s - X_t > \lambda) \leq \exp\left(-\frac{\lambda^2}{2\, d(s,t)^2}\right). \ \ \ \ \ (2)

This should look familiar: {X_s-X_t} is a mean-zero Gaussian with variance {d(s,t)^2}.

Now, a first instinct might be to choose the sets {T_n} to be progressively denser in {T}. In this case, a natural choice would be to insist on something like {T_n} being a {2^{-n}}-net in {T}. If one continues down this path in the right way, a similar theory would develop. We’re going to take a different route and consider the other side of the tradeoff.

Instead of insisting that {T_n} has a certain level of accuracy, we’ll insist that {T_n} is at most a certain size. Should we require {|T_n| \leq n} or {|T_n| \leq 2^n}, or use some other function? To figure out the right bound, we look at (2). Suppose that {g_1, g_2, \ldots, g_m} are i.i.d. {N(0,1)} random variables. In that case, applying (2) and a union bound, we see that to achieve

\displaystyle  \mathop{\mathbb P}(\exists i : g_i > B) \leq m \mathop{\mathbb P}(g_1 > B) < 1,

we need to select {B \asymp \sqrt{\log m}}. If we look instead at {m^2} points instead of {m} points, the bound grows to {\sqrt{2 \log m}}. Thus we can generally square the number of points before the union bound has to pay a constant factor increase. This suggests that the right scaling is something like {|T_{n+1}| = |T_n|^2}. So we’ll require that {|T_n| \leq 2^{2^n}} for all {n \geq 1}.

The generic chaining

This leads us to the generic chaining bound, due to Fernique (though the formulation we state here is from Talagrand).

Theorem 2 Let {\{X_t\}_{t \in T}} be a Gaussian process, and let {T_0 \subseteq T_1 \subseteq \cdots \subseteq T} be a sequence of subsets such that {|T_0|=1} and {|T_n| \leq 2^{2^{n}}} for {n \geq 1}. Then,

\displaystyle   \mathop{\mathbb E}\sup_{t \in T} X_t \leq O(1) \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} d(t, T_n). \ \ \ \ \ (3)

Proof: As before, let {\pi_n : T \rightarrow T_n} denote the closest point map and let {T_0 = \{t_0\}}. Using (2), for any {n \geq 1}, {t \in T}, and {u > 0}, we have

\displaystyle  \mathop{\mathbb P}\left(|X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| > u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right) \leq \exp\left(-\frac{u^2}{2} 2^n\right).

Now, the number of pairs {(\pi_n(t),\pi_{n-1}(t))} can be bounded by {|T_n| \cdot |T_{n-1}| \leq 2^{2^{n+1}}}, so we have

\displaystyle   \mathop{\mathbb P}\left(\exists t : |X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| > u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right) \leq 2^{2^{n+1}} \exp\left(-\frac{u^2}{2} 2^n\right). \ \ \ \ \ (4)

If we define the event

\displaystyle  \Omega_u = \left\{ \forall n \geq 1, t \in T : |X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| \leq u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right\},

then summing (4) yields,

\displaystyle   \mathop{\mathbb P}(\overline{\Omega_u}) \leq \sum_{n \geq 1} 2^{2^{n+1}} \exp\left(-\frac{u^2}{2} 2^n\right) \leq O(1)\, e^{-u^2} \ \ \ \ \ (5)

for {u \geq 4}, since we get geometrically decreasing summands.


\displaystyle  S = \sup_{t \in T} \sum_{n \geq 1} 2^{n/2} d(\pi_n(t), \pi_{n-1}(t)).

Note that if {\Omega_u} occurs, then {\sup_{t \in T} (X_t - X_{t_0}) \leq uS}. Thus (5) implies that for u \geq 4,

\displaystyle  \mathop{\mathbb P}(\sup_{t \in T} X_t - X_{t_0} > uS) \leq O(1) \, e^{-u^2},

which implies that

\displaystyle   \mathop{\mathbb E} \sup_{t \in T} X_t \leq O(S) \leq O(1) \sup_{t \in T} \sum_{n \geq 1} 2^{n/2} d(\pi_n(t), \pi_{n-1}(t)). \ \ \ \ \ (6)

Finally, by the triangle inequality,

\displaystyle d(\pi_n(t), \pi_{n-1}(t)) \leq d(t, T_n) + d(t, T_{n-1}) \leq 2\,d(t,T_{n-1}).

Plugging this into (6) recovers (3). \Box

Theorem 1.2 gives us a fairly natural way to upper bound the expected supremum using a hierarchical clustering of {T}. Rather amazingly, as we’ll see in the next post, this upper bound is tight. Talagrand’s majorizing measure theorem states that if we take the best choice of {\{T_n\}} in Theorem 1.2, then the upper bound in (3) is within a constant factor of {\mathop{\mathbb E} \sup_{t \in T} X_t}.

Majorizing measures

At STOC 2010 last week, Talagrand gave a presentation on some of his favorite open problems, which included a quick review of Gaussian processes and the majorizing measures theory. In joint work with Jian Ding and Yuval Peres, we recently showed how the cover time of graphs can be characterized by majorizing measures.

While I’ll eventually try to give an overview of this connection, I first wanted to discuss how majorizing measures are used to control Gaussian processes. In the next few posts, I’ll attempt to give an idea of how this works. I have very little new to offer over what Talagrand has already written; in particular, I will be borrowing quite heavily from Talagrand’s book (which you might read instead).

Gaussian processes

Consider a Gaussian process {\{X_t\}_{t \in T}} for some index set {T}. This is a collection of jointly Gaussian random variables, meaning that every finite linear combination of the variables has a Gaussian distribution. We will additionally assume that the process is centered, i.e. {\mathbb E(X_t) = 0} for all {t \in T}.

It is well-known that such a process is completely characterized by the covariances {\{\mathop{\mathbb E}(X_s X_t)\}_{s,t \in T}}. For {s,t \in T}, consider the canonical distance,

\displaystyle  d(s,t) = \sqrt{\mathbb E\,|X_s-X_t|^2},

which forms a metric on {T}. (Strictly speaking, this is only a pseudometric since possibly {d(s,t)=0} even though {X_s} and {X_t} are distinct random variables, but we’ll ignore this.) Since the process is centered, it is completely specified by the distance {d(s,t)}, up to translation by a Gaussian (e.g. the process {\{X_t + X_{t_0}\}_{t \in T}} will induce the same distance for any {t_0 \in T}).

A concrete perspective

If the index set {T} is countable, one can describe every such process in the following way. Let {\{g_i\}_{i=1}^{\infty}} be a sequence of i.i.d. standard Gaussians, let {T \subseteq \ell^2}, and put

\displaystyle  X_t = \sum_{i \geq 1} g_i t_i.

In this case, it is easy to check that {d(s,t) = \|s-t\|_2} for {s,t \in T}. (That this construction is universal follows from the fact that every two separable Hilbert spaces are isomorphic.)

Random projections

If {T} is finite, then we can think of {T \subseteq \mathbb R^n} for some {n \in \mathbb N}. In this case, if {g} is a standard {n}-dimensional Gaussian, then

\displaystyle  X_t = \langle g, t \rangle,

and we can envision the process as the projection of {T} onto a uniformly random direction.

Studying the maxima

We will be concerned primarily with the value,

\displaystyle \mathbb E \sup_{t \in T} X_t.

(I.e. the expected value of the extremal node circled above.) One may assume that {T} is finite without losing any essential ingredient of the theory, in which case the supremum can be replaced by a maximum. Note that we are studying the tails of the process. Dealing with these extremal values is what makes understanding the above quantity somewhat difficult.

As some motivation for the classical study of this quantity, one has the following.

Theorem 1 For a separable Gaussian process {\{X_t\}_{t \in T}}, the following two assertions are equivalent.

  1. The map {t \mapsto X_t(\omega)} is uniformly continuous (as a map from {(T,d)} to {\mathbb R)} with probability one.
  2. As {\varepsilon \rightarrow 0},

    \displaystyle \mathop{\mathbb E}\sup_{d(s,t) \leq \varepsilon} |X_s-X_t| \rightarrow 0.

However, from our viewpoint, the quantitative study of {\mathbb E\sup_{t \in T} X_t} in terms of the geometry of {(T,d)} will play the fundamental role.

Bounding the sup

We will concentrate first on finding good upper bounds for {\mathop{\mathbb E}\sup_{t \in T} X_t}. Toward this end, fix some {t_0 \in T}, and observe that

\displaystyle  \mathop{\mathbb E}\sup_{t \in T} X_t = \mathop{\mathbb E}\sup_{t \in T} (X_t - X_{t_0}).

Since {\sup_{t \in T} (X_t - X_{t_0})} is a non-negative random variable, we can write

\displaystyle  \mathop{\mathbb E} \sup_{t \in T} (X_t - X_{t_0}) = \int_{0}^{\infty} \mathop{\mathbb P}\left(\sup_{t \in T} X_t - X_{t_0} > u\right)\,du,

and concentrate on finding upper bounds on the latter probabilities.

Improving the union bound

As a first step, we might write

\displaystyle  \mathop{\mathbb P}\left(\sup_{t \in T} X_t - X_{t_0} > u\right) \leq \sum_{t \in T} \mathop{\mathbb P}\left(X_t - X_{t_0} > u\right).

While this bound is decent if the variables {\{X_t-X_{t_0}\}_{t \in T}} are somewhat independent, it is rather abysmal if the variables are clustered.

Since the variables in, e.g. {S_1}, are highly correlated (in the “geometric” language, they tend to project close together on a randomly chosen direction), the union bound is overkill. It is natural to choose representatives {t_1 \in S_1} and {t_2 \in S_2}. We can first bound {X_{t_1}-X_{t_0}} and {X_{t_2}-X_{t_0}}, and then bound the intra-cluster values {\{X_t - X_{t_1}\}_{t \in S_1}} and {\{X_t - X_{t_2}\}_{t \in S_2}}. This should yield better bounds as the diameter of {S_1} and {S_2} are hopefully significantly smaller than the diameter of {T}.

Formally, we have

\displaystyle  \mathop{\mathbb P}\left(\sup_{t \in T} X_t - X_{t_0} > u\right) \leq

\displaystyle  \mathop{\mathbb P}(X_{t_1} - X_{t_0} > u/2) + \sum_{t \in S_1} \mathop{\mathbb P}(X_t - X_{t_1} > u/2)

\displaystyle  + \mathop{\mathbb P}(X_{t_2} - X_{t_0} > u/2) + \sum_{t \in S_2} \mathop{\mathbb P}(X_t - X_{t_2} > u/2).

Of course, there is no reason to stop at one level of clustering, and there is no reason that we should split the contribution {u = u/2 + u/2} evenly. In the next post, we’ll see the “generic chaining” method which generalizes and formalizes our intuition about improving the union bound. In particular, we’ll show that every hierarchical clustering of our points offers some upper bound on \mathbb E \sup_{t \in T} X_t.