The need for non-linear mappings

In the last post, I recalled the problem of dimension reduction for finite subsets of L_1.  I thought I should mention the main obstacle to reducing the dimension below O(n) for n-point subsets:  It can’t be done with linear mappings.

All the general results mentioned in that post use a linear mapping.  In fact, they are all of the form:

  1. Change of density, i.e. preprocess the points/subspace so that no point has too much weight on any one coordinate.
  2. Choose a subset of the coordinates, possibly multiplying the chosen coordinates by non-negative weights.  (Note that the Newman-Rabinovich result, based on Batson, Spielman, and Srivastava, is deterministic, while in the other bounds, the sampling is random.)

(The dimension reduction here is non-linear, but only applies to special subsets of L_1, like the Brinkman-Charikar point set.)

The next theorem shows that linear dimension reduction mappings cannot do better than O(n) dimensions.

Theorem: For every 1 \leq p \leq \infty, there are arbitrarily large n-point subsets of L_p on which any linear embedding into L_2 incurs distortion at least \left(\frac{n-1}{2}\right)^{|1/p-1/2|}.

Since the identity map from \ell_1^n to \ell_2^n has distortion \sqrt{n}, this theorem immediately implies that there are n-point subsets on which any linear embedding requires \Omega(n) dimension for an {O(1)}-distortion embedding.  The p=1 case of the preceding theorem was proved by Charikar and Sahai.  A simpler proof, which extends to all p \geq 1 is given in Lemma 3.1 of a paper by myself, Mendel, and Naor.

Open problem: Dimension reduction in L_1

Since I’ve been interacting a lot with the theory group at MSR Redmond (see the UW-MSR Theory Center), I’ve been asked occasionally to propose problems in the geometry of finite metric spaces that might be amenable to probabilistic tools. Here’s a fundamental problem that’s wide open. Let {c_D(n)} be the smallest number such that every {n}-point subset of {L_1} embeds into {\ell_1^{c_D(n)}} with distortion at most {D}. Here’s what’s known.

  1. Talagrand (following work of Bourgain-Lindenstrauss-Milman and Schechtman) proved that for every {\varepsilon > 0}, every {n}-dimensional subspace of {L_1} admits a {(1+\varepsilon)}-distortion embedding into {\ell_1^{d}} with {d = O((n \log n)/\varepsilon^2)}. In particular, this gives

    \displaystyle c_{1+\varepsilon}(n) = O((n \log n)/\varepsilon^2).

  2. Brinkman and Charikar showed that {c_D(n) \geq \Omega(n^{1/D^2})} for {D \geq 1}. A significantly simpler proof was later given by Assaf Naor and myself. (With Brinkman and Karagiozova, we have also shown that this bound is tight for the Brinkman-Charikar examples and their generalizations.)
  3. Recently, Newman and Rabinovich showed that one can take {c_{1+\varepsilon}(n) = O(n/\varepsilon^2)} for any {\varepsilon > 0}. Their paper relies heavily on the beautiful spectral sparsification method of Batson, Spielman, and Srivastava. In fact, it is shown that one can use only {O(n/\varepsilon^2)} weighted cuts (see the paper for details). This also hints at a limitation of their technique, since it is easy to see that the metric on {\{1,2,\ldots,n\} \subseteq \mathbb R} requires {\Omega(n)} cuts for a constant distortion embedding (and obviously only one dimension).

The open problem is to get better bounds. For instance, we only know that

\displaystyle  \Omega(n^{1/100}) \leq c_{10}(n) \leq O(n).

There is evidence that n^{\Theta(1/D^2)} might be the right order of magnitude.  In the large distortion regime, when D = \Omega(\sqrt{\log n} \log \log n), results of Arora, myself, and Naor show that c_D(n) = O(\log n).

Hypercontractivity and its Applications

My student Punya Biswal just completed this great survey on hypercontractivity and its application in computer science. There is a PDF version from his home page, and accompanying slides.

Hypercontractive inequalities are a useful tool in dealing with extremal questions in the geometry of high-dimensional discrete and continuous spaces. In this survey we trace a few connections between different manifestations of hypercontractivity, and also present some relatively recent applications of these techniques in computer science.

1. Preliminaries and notation

Fourier analysis on the hypercube. We define the inner product {\langle f,g \rangle = \mathop{\mathbb E}_{x}f(x)g(x)} on functions {f,g \colon \{-1,1\}^{n} \rightarrow {\mathbb R}}, where the expectation is taken over the uniform (counting) measure on {\{-1,1\}^n}. The multilinear polynomials {\chi_{S}(x)=\prod_{i\in S}x_{i}} (where {S} ranges over subsets of {[n]}) form an orthogonal basis under this inner product; they are called the Fourier basis. Thus, for any function {f \colon \{-1,1\}^{n}\rightarrow{\mathbb R}}, we have {f = \sum_{S\subseteq[n]}\hat{f}(S)\chi_{S}(x)}, where the Fourier coefficients {\hat{f}(S)=\langle f,\chi_{S}\rangle} obey Plancherel’s relation {\sum\hat{f}(S)^{2}=1}. It is easy to verify that {\mathop{\mathbb E}_{x}f(x)=\hat{f}(0)} and {\textsf{Var}_{x}f(x)=\sum_{S\neq\emptyset}\hat{f}(S)^{2}}.

Norms. For {1\leq p<\infty}, define the {\ell_{p}} norm {\|f\|_{p}=(\mathop{\mathbb E}_{x}|f(x)|^{p})^{1/p}}. These norms are monotone in {p}: for every function {f}, {p\geq q} implies {\|f\|_{p}\geq\|f\|_{q}}. For a linear operator {M} carrying functions {f \colon \{-1,1\}^{n}\rightarrow{\mathbb R}} to functions {Mf=g \colon \{-1,1\}^{n}\rightarrow{\mathbb R}}, we define the {p}-to-{q} operator norm {\|M\|_{p\rightarrow q}=\sup_{f}\|Mf\|_{q}/\|f\|_{p}}. {M} is said to be a contraction from {\ell_{p}} to {\ell_{q}} when {\|M\|_{p\rightarrow q}\leq1}. Because of the monotonicity of norms, a contraction from {\ell_{p}} to {\ell_{p}} is automatically a contraction from {\ell_{p}} to {\ell_{q}} for any {q<p}. When {q>p} and {\|M\|_{p\rightarrow q}\leq1}, then {M} is said to be hypercontractive.

Convolution operators. Letting {xy} represent the coordinatewise product of {x, y \in \{-1,1\}^n}, we define the convolution {(f*g)(x)=\mathop{\mathbb E}_{y}f(y)g(xy)} of two functions {f,g \colon \{-1,1\}^{n}\rightarrow{\mathbb R}}, and note that it is a linear operator {f\mapsto f*g} for every fixed {g}. Convolution is commutative and associative, and the Fourier coefficients of a convolution satisfy the useful property {\widehat{f*g}=\hat{f}\hat{g}}. We shall be particularly interested in the convolution properties of the following functions

  • The Dirac delta {\delta \colon \{-1,1\}^{n}\rightarrow{\mathbb R}}, given by {\delta(1,\dotsc,1)=1} and {\delta(x)=0} otherwise. It is the identity for convolution and has {\hat{\delta}(S)=1} for all {S\subseteq[n]}.
  • The edge functions {h_{i} \colon \{-1,1\}^{n}\rightarrow{\mathbb R}} given by

    \displaystyle h_{i}(x)= \begin{cases} \phantom{-}1/2 & x=(1,\dotsc,1)\\ -1/2 & x_{i}=-1,x_{[n]\setminus\{i\}}=(1,\dotsc,1)\\ \phantom{-}0 & \text{otherwise.} \end{cases}

    {\hat{h}_{i}(S)} is {1} or {0} according as {S} contains or does not contain {i}, respectively. For any function {f \colon \{-1,1\}^{n}\rightarrow{\mathbb R}}, {(f*h_{i})(x)=(f(x)-f(y))/2}, where {y} is obtained from {x} by flipping just the {i}th bit. Convolution with {h_{i}} acts as an orthogonal projection (as we can easily see in the Fourier domain), so for any functions {f,g \colon \{-1,1\}^{n}\rightarrow{\mathbb R}}, we have {\langle f*h_{i},g\rangle=\langle f,h_{i}*g\rangle=\langle f*h_{i},g*h_{i}\rangle}

  • The Bonami-Gross-Beckner noise functions {\textsf{BG}_{\rho} \colon \{-1,1\}^{n}\rightarrow{\mathbb R}} for {0\leq\rho\leq1}, where {\widehat{\textsf{BG}}_{\rho}(S)=\rho^{|S|}} and we define {0^{0}=1}. These operators form a semigroup, because {\textsf{BG}_{\sigma}*\textsf{BG}_{\rho}=\textsf{BG}_{\sigma\rho}} and {\textsf{BG}_{1}=\delta}. Note that {\textsf{BG}_{\rho}(x)=\sum_{S}\rho^{|S|}\chi_{S}(x)=\prod_{i}(1+\rho x_{i})}. We define the noise operator {T_{\rho}} acting on functions on the discrete cube by {T_{\rho}f=\textsf{BG}_{\rho}*f}. In combinatorial terms, {(T_{\rho}f)(x)} is the expected value of {f(y)}, where {y} is obtained from {x} by independently flipping each bit of {x} with probability {1-\rho}.

Lemma 1 {\frac{d}{d\rho}\textsf{BG}_{\rho}=\frac{1}{\rho}\textsf{BG}_{\rho}*\sum h_{i}}

Proof: This is easy in the Fourier basis:

\displaystyle \widehat{\textsf{BG}}_{\rho}' = (\rho^{|S|})' = |S|\rho^{|S|-1} = \sum_{i\in[n]}\hat{h}_{i}\frac{\widehat{\textsf{BG}}_{\rho}}{\rho}.

Continue reading