PSD lifting and Unique Games integrality gaps

By now, it is known that integrality gaps for the standard Unique Games SDP (see the paper of Khot and Vishnoi or Section 5.2 of this post) can be used to obtain integrality gaps for many other optimization problems, and often for very strong SDPs coming from various methods of SDP tightening; see, for instance, the paper of Raghavendra and Steurer.

Problematically, the Khot-Vishnoi gap is rather inefficient: To achieve the optimal gap for Unique Games with alphabet size {L}, one needs an instance of size {\exp(\Omega(L))}. As far as I know, there is no obstacle to achieving a gap instance where the number of variables is only {\mathrm{poly}(L)}.

The Walsh-Hadamard code

The Khot-Vishnoi construction is based on the Hadamard code.
(See Section 5.2 here for a complete description.) If we use {L^2(\{-1,1\}^k)} to denote the Hilbert space of real-valued functions {f : \{-1,1\}^k \rightarrow \mathbb R}, then the Walsh-Hadamard basis of {L^2(\{-1,1\}^k))} is the set of functions of the form

\displaystyle  u_S(x) = \prod_{i \in S} x_i,

where {S \subseteq \{1,2,\ldots,k\}}.

Of course, for two such sets {S \neq T}, we have the orthogonality relations,

\displaystyle  \langle u_S, u_T \rangle = 0.

In their construction, the variables are essentially all functions of the form {f : \{-1,1\}^k \rightarrow \{-1,1\}}, of which there are {2^{2^k}}, while there are only {2^k} basis elements {\{u_S\}_{S \subseteq [k]}} which act as the alphabet for the underlying Unique Games instance. This is what leads to the exponential relationship between the number of variables and the label size.

A PSD lifting question

In an effort to improve this dependence, one could start with a much larger set of nearly orthogonal vectors, and then somehow lift them to a higher-dimensional space where they would become orthogonal. In order for the value of the SDP not to blow up, it would be necessary that this map has some kind of Lipschitz property. We are thus led to the following (possibly naïve) question.

Let {C(d,\varepsilon)} be the smallest number such that the following holds. (Here, {S^{d-1} \subseteq \mathbb R^d} denotes the {(d-1)}-dimensional unit sphere and S(L^2) denotes the unit-sphere of L^2.)

There exists a map {F : S^{d-1} \rightarrow S(L^2)} such that {\|F\|_{\mathrm{Lip}} \leq C(d,\varepsilon)} and whenever {u,v \in \mathbb R^d} satisfy {|\langle u,v\rangle| \leq \varepsilon}, we have {\langle F(u), F(v)\rangle = 0}.

(Recall that \|F\|_{\mathrm{Lip}} = \sup_{x \neq y \in S^{d-1}} \|F(x)-F(y)\|/\|x-y\|.)

One can show that

\displaystyle C(d,\varepsilon) \lesssim \frac{\sqrt{d}}{1-\varepsilon}

by randomly partitioning {S^{d-1}} so that all vectors satisfying {|\langle u,v\rangle| \leq \varepsilon} end up in different sets of the partition, and then mapping all the points in a set to a different orthogonal vector.

My question is simply: Is a better dependence on {d} possible? Can one rule out that {C(d,\varepsilon)} could be independent of {d}? Note that any solution which randomly maps points to orthogonal vectors must incur such a blowup (this is essentially rounding the SDP to an integral solution).

Bloomington summer school recap

A couple months ago, at Indiana University, David Fisher, Nets Katz, and I  organized a summer school on Analysis and geometry in the theory of computation.  This school is one in a series organized by David and funded by NSF grant DMS-0643546 (see, e.g. last year’s school). What follows is a brief synopsis of what the school covered.  All the lectures were given by the participants, and there are links to their lecture notes below.  This is essentially an extended version of an introductory document I wrote for the participants, who were a mix of mathematicians and theoretical computer scientists.

Approximation Algorithms

In the following discussion, we will use the word efficient to describe an algorithm that runs in time polynomial in the size of its input. For a graph {G=(V,E)}, we use {\textsf{MC}(G)} to denote the “MAX-CUT value,” i.e. the quantity

\displaystyle \max_{S \subseteq V} \frac{|E(S, \bar S)|}{|E|},

where {E(S, \bar S)} denotes the set of edges between {S} and its complement. It is well-known that computing {\textsf{MC}(G)} is {\mathsf{NP}}-complete, and thus assuming {\mathsf{P} \neq \mathsf{NP}}, there is no efficient algorithm that, given {G}, outputs {\textsf{MC}(G)}.

Given this state of affairs, it is natural to ask how well we can approximate the value {\mathsf{MC}(G)} with an efficient algorithm. For an algorithm {\mathcal A}, we use {\mathcal A(G)} to denote its output when run on the graph {G}. If {\mathcal A} satisfies {\mathcal A(G) \leq \mathsf{MC}(G)} for all {G}, we define its approximation ratio as

\displaystyle \alpha(\mathcal A) = \sup \left\{ \alpha : \mathcal A(G) \geq \alpha \cdot \mathsf{MC}(G) \textrm{ for all graphs}\right\}

Clearly {\mathcal A(G) \in [0,1]}. Now we are interested in the best approximation ratio achievable by an efficient algorithm {\mathcal A}, i.e. the quantity

\displaystyle \textrm{approx}(\mathsf{MC}) = \sup \left\{ \alpha(\mathcal A) : \mathcal A \textrm{ is efficient} \right\}

It should be clear that similar questions arise for all sorts of other values which are NP-hard to compute (e.g. the chromatic number of a graph, or the length of its shortest tour, or the length of the longest simple path, etc.) An algorithm of Goemans and Williamson (based on a form of convex optimization known as semi-definite programming) shows that

\displaystyle \mathrm{approx}(\mathsf{MC}) \geq \alpha_{\mathrm{GW}} = \frac{2}{\pi} \min_{0 < \theta < \pi} \frac{\theta}{1-\cos\theta} = 0.878\ldots

On the other hand, Håstad proved that, as a consequence of the PCP Theorem, it is NP-complete to obtain an approximation ratio better than {16/17}, i.e. if {\mathsf{P} \neq \mathsf{NP}}, then

\displaystyle \mathrm{approx}(\mathsf{MC}) \leq \frac{16}{17} = 0.941\ldots

How does one prove such a theorem? Well, the {\mathsf{NP}}-hardness of MAX-CUT is based on constructing graphs where every optimal solution has a particular structure (which eventually encodes the solution to another NP-hard problem like SATISFIABILITY). Similarly, the NP-hardness of of obtaining even “near-optimal” solutions is proved, in part, by constructing graphs where every solution whose value is close to optimal has some very specific structure (e.g. is close—in some stronger sense—to an optimal solution).

In this way, one of the main steps in proving the inapproximability of {\mathsf{NP}}-hard problems involves constructing objects which have such a “rigidity” property. This summer school is about how one can use the rigidity of analytic and geometric objects to obtain combinatorial objects with the same property. In fact, assuming something called the “Unique Games Conjecture” (which we will see later), the approximability of many constraint satisfaction problems can be tied directly to the existence of certain geometric configurations.

The Lectures

The first series of lectures will concern the Sparsest Cut problem in graphs and its relationship to bi-lipschitz {L_1} embeddings of finite metric spaces. In particular, we will look at rigidity properties of  “nice” subsets of the Heisenberg group, and how these can be used to prove limitations on a semi-definite programming approach to Sparsest Cut. In the second series, we will see how—assuming the Unique Games Conjecture (UGC)—proving lower bounds on certain simple semi-definite programs actually proves lower bounds against all efficient algorithms. This will entail, among other things, an analytic view of {\{0,1\}}-valued functions, primarily through harmonic analysis.

Sparsest Cut and {L_1} embeddings

The Sparsest Cut problem is classically described as follows. We have a graph {G=(V,E)} and two functions {C : V \times V \rightarrow \mathbb R_+} and {D : V \times V \rightarrow \mathbb R_+}, with {\mathrm{supp}(C) \subseteq E}. The goal is to compute

\displaystyle   \Phi^*(G;C,D) = \min_{S \subseteq V} \frac{C(S, \bar S)}{D(S, \bar S)},

where we use {C(A,B) = \sum_{a \in A, b\in B} C(a,b)} and {D(A,B) = \sum_{a \in A, b \in B} D(a,b)}. The problem has a number of important applications in computer science.

Computing {\Phi^*(G;C,D)} is NP-hard, but again we can ask for approximation algorithms. The best-known approach is based on computing the value of the Goemans-Linial semi-definite program, \mathsf{sdp}(G;C,D), which is

\displaystyle \min \left\{ \frac{\sum_{u,v} C(u,v) \|x_u-x_v\|_2^2}{\sum_{u,v} D(u,v) \|x_u-x_v\|_2^2}: \{x_u\}_{u \in V} \subseteq \mathbb R^V\textrm{ and }\|x_u-x_v\|^2 \leq \|x_u-x_w\|^2 + \|x_w-x_v\|^2 \textrm{ for all }  u,v,w \in V \right\}.

This value can be computed by a semi-definite program (SDP), as we will see. It is an easy exercise to check that {\mathsf{sdp}(G;C,D) \leq \Phi^*(G;C,D)}, and we can ask for the smallest {\alpha = \alpha(n)} such that for all {n}-node graphs {G} and all functions {C,D}, we have

\displaystyle \Phi^*(G;C,D) \leq \alpha(n) \cdot \mathsf{sdp}(G;C,D).

(E.g. it is now known that {(\log n)^{2^{-1000}} \leq \alpha(n) \leq O(\sqrt{\log n} \log \log n)}, with the upper bound proved here, and the lower bound proved here.)

By some duality arguments, one can characterize {\alpha(n)} in a different way. For a metric space {(X,d)}, write {c_1(X,d)} for the infimal constant {B} such that there exists a mapping {f : X \rightarrow L_1} satisfying, for all {x,y \in X},

\displaystyle   \|f(x)-f(y)\|_1 \leq d(x,y) \leq B \|f(x)-f(y)\|_1.

It turns out that

\displaystyle \alpha(n) = \sup \left\{ c_1(X,d) : |X|=n \textrm{ and } (X, \sqrt{d})\textrm{ embeds isometrically in } L_2\right\} (1)

This shows that determining the power of the preceding SDP is intimately connected to understanding bi-lipschitz embeddings into {L_1}. This is what we will study in the first 6 lectures.

  1. (Arnaud de Mesmay) In the first lecture, we will be introduced to the basic geometry of the 3-dimensional Heisenberg group {\mathbb H^3}, and how differentiation plays a roll in proving lower bounds on bi-lipschitz distortion. In particular, we will see Pansu’s approach for finite-dimensional targets and a generalization to spaces with the RNP, and also why a straightforward generalization would fail for {L_1}.
  2. (Mohammad Moharrami) Next, we will see how a differentiation approach to {L_1} embeddings might work in a toy setting that uses only finite graphs. The study of “monotone subsets” (which is elementary here) also arises in the work of Cheeger and Kleiner in lectures 4 and 5.  (See also this post.)
  3. (Sean Li) Here, we will see that there is an equivalent metric {d} on the Heisenberg group for which {(\mathbb H^3, \sqrt{d})} embeds isometrically into {L_2}. This is one half of proving lower bounds on {\alpha(n)} using (1).
  4. (Jeehyeon Seo and John Mackay) In Lectures 4-5, we’ll look at the approach of Cheeger and Kleiner for proving that {\mathbb H^3} does not bi-lipschitz embed into {L_1}.  (Note that these authors previously offered a different approach to non-embeddability, though the one presented in these lectures is somewhat simpler.)
  5. (Florent Baudier) Finally, in Lecture 6, we see some embedding theorems for finite metric spaces that allow us to prove upper bounds on {\alpha(n)}.

The UGC, semi-definite programs, and constraint satisfaction

In the second series of lectures, we’ll see how rigidity of geometric objects can possibly say something, not just about a single algorithm (like a semi-definite program), but about all efficient algorithms for solving a particular problem.

  1. (An-Sheng Jhang) First, we’ll review basic Fourier analysis on the discrete cube, and how this leads to some global rigidity theorems for cuts. These tools will be essential later.  (See also these lecture notes from Ryan O’Donnell.)
  2. (Igor Gorodezky) Next, we’ll see a semi-definite program (SDP) for the MAX-CUT problem, and a tight analysis of its approximation ratio (which turns out to be the {0.878\ldots} value we saw earlier).
  3. (Sam Daitch) In the third lecture, we’ll see the definition of the Unique Games Conjecture, and how it can be used (in an ad-hoc manner, for now) to transform our SDP analysis into a proof that the SDP-based algorithm is optimal (among all efficient algorithms) under some complexity-theoretic assumptions.
  4. (Deanna Needell) A key technical component of the preceding lecture is something called the Majority is Stablest Theorem that relates sufficiently nice functions on the discrete cube to functions on Gaussian space.
  5. (Sushant Sachdeva) In the final lecture, we’ll see Raghavendra’s work which shows that, for a certain broad class of NP-hard constraint satisfaction problems, assuming the UGC, the best-possible algorithm is the “canonical” semi-definite program. In other words, the approximation ratio for these problems is completely determined by the existence (or lack thereof) of certain vector configurations in {\mathbb R^n}.  (See also this post.)

Open question: PSD flows

This post is about a beautiful twist on flows that arises when studying (the dual) of the Sparsest Cut SDP.  These objects, which I’m going to call “PSD flows,” are rather poorly understood, and there are some very accessible open problems surrounding them.  Let’s begin with the definition of a normal flow:

Let G=(V,E) be a finite, undirected graph, and for every pair u,v \in V, let \mathcal P_{uv} be the set of all paths between u and v in G.  Let \mathcal P = \bigcup_{u,v \in V} \mathcal P_{uv}.  A flow in G is simply a mapping F : \mathcal P \to \mathbb R_{\geq 0}.  We define, for every edge (u,v) \in E, the congestion on (u,v) as

\displaystyle C_F(u,v) = \sum_{p \in \mathcal P: (u,v) \in p} F(p)

which is the total amount of flow going through (u,v).  Finally, for every u,v \in V, we define

F\displaystyle \lbrack u,v\rbrack = \sum_{p \in \mathcal P_{uv}} F(p)

as the total amount of flow sent from u to v.

Now, in the standard (all-pairs) maximum concurrent flow problem, the goal is to find a flow F which simultaneously sends D units of flow from every vertex to every other, while not putting more than one unit of flow through any edge, i.e.

\displaystyle \mathsf{mcf}(G) = \textrm{maximize } \left\{ \vphantom{\bigoplus} D : \forall u,v, F[u,v] \geq D \textrm{ and } \forall (u,v) \in E, C_F(u,v) \leq 1.\right\}

In order to define a PSD flow, it helps to write this in a slightly different way.  If we define the symmetric matrix

A_{u,v} = F[u,v] - D + {\bf 1}_{\{(u,v) \in E\}} - C_F(u,v)

then we have

Claim 1: \mathsf{mcf}(G) = \max \{ D : A_{u,v} \geq 0 \}.

To see that this is true, we can take a matrix with A_{u,v} \geq 0 for all u,v \in V and fix it one entry at a time so that F[u,v] \geq D and C_F(u,v) \leq 1, without decreasing the total demand satisfied by the flow.

For instance, if (u,v) \in E and C_F(u,v) > 1+\varepsilon, then it must be that F[u,v] > D+\varepsilon, so we can reroute \varepsilon units of flow going through the edge (u,v) to go along one of the extraneous flow paths which gives the excess F[u,v] > D + \varepsilon.  Similar arguments hold for the other cases (Exercise!).

PSD flows

So those are normal flows.  To define a PSD flow, we define for any symmetric matrix A, the Laplacian of A, which has diagonal entries L(A)_{i,i} = \sum_{j \neq i} A_{i,j} and off-diagonal entries L(A)_{i,j} = - A_{i,j}.  It is easy to check that

\displaystyle \langle x, L(A)\, x \rangle = \sum_{i,j} A_{i,j} (x_i-x_j)^2.

Hence if A_{u,v} \geq 0 for all u,v \in V, then certainly L(A) \succeq 0 (i.e. L(A) is positive semi-definite).  The PSD flow problem is precisely

\displaystyle \max \{ D : L(A) \succeq 0 \}

where A is defined as above.  Of course, now we are allowing A to have negative entries, which makes this optimization trickier to understand.  We allow the flow to undersatisfy some demand, and to overcongest some edges, but now the “error” matrix has to induce a PSD Laplacian.

Scaling down the capacities

Now, consider some \delta \in [0,1], and write

\displaystyle A_{u,v}^{(\delta)} = F[u,v] - D + \delta \cdot {\bf 1}_{(u,v) \in E} - C_F(u,v).

Requiring A_{u,v}^{(\delta)} \geq 0 for every u,v \in V simply induces a standard flow problem where each edge now has capacity \delta.  In the case of normal flows, because we can decouple the demand/congestion constraints as in Claim 1, we can easily relate \max \{ D : A_{u,v}^{(\delta)} \geq 0\,\forall u,v \in V\} to \max \{ D : A_{u,v} \geq 0\,\forall u,v \in V\} (the first is exactly \delta times the second, because we can just scale a normal flow down by \delta and now it satisfies the reduced edge capacities).

Question: Can we relate \max \{ D : L(A^{(\delta)}) \succeq 0 \} and \max \{ D : L(A) \succeq 0 \}?  More specifically, do they differ by some multiplicative constant depending only on \delta?

This is a basic question whose answer is actually of fundamental importance in understanding the Sparsest Cut SDP.  I asked this question in its primal form almost 4 years ago (see question 3.2 here).

Note that the answer is affirmative if we can decouple the demand/congestion constraints in the case of PSD flows.  In other words, let X_{u,v} = F[u,v] - D and let Y_{u,v} = {\bf 1}_{(u,v \in E)} - C_F(u,v).

Question: Can we relate \max \{ D : L(A) \succeq 0 \} to \max \{ D : L(X) \succeq 0 \textrm{ and } L(Y) \succeq 0 \}?

In the next post, I’ll discuss consequences of this question for constructing integrality gaps for the Sparsest Cut SDP.