**1. Construction of Föllmer’s drift **

In a previous post, we saw how an entropy-optimal drift process could be used to prove the Brascamp-Lieb inequalities. Our main tool was a result of Föllmer that we now recall and justify. Afterward, we will use it to prove the Gaussian log-Sobolev inequality.

Consider with , where is the standard Gaussian measure on . Let denote an -dimensional Brownian motion with . We consider all processes of the form

where is a progressively measurable drift and such that has law .

Theorem 1 (Föllmer)It holds that

where the minima are over all processes of the form (1).

*Proof:* In the preceding post (Lemma 2), we have already seen that for any drift of the form (1), it holds that

thus we need only exhibit a drift achieving equality.

We define

where is the Brownian semigroup defined by

As we saw in the previous post (Lemma 2), the chain rule yields

We are left to show that has law and .

We will prove the first fact using Girsanov’s theorem to argue about the change of measure between and . As in the previous post, we will argue somewhat informally using the heuristic that the law of is a Gaussian random variable in with covariance . Itô’s formula states that this heuristic is justified (see our use of the formula below).

The following lemma says that, given any sample path of our process up to time , the probability that Brownian motion (without drift) would have “done the same thing” is .

Remark 1I chose to present various steps in the next proof at varying levels of formality. The arguments have the same structure as corresponding formal proofs, but I thought (perhaps naïvely) that this would be instructive.

Lemma 2Let denote the law of . If we definethen under the measure given by

the process has the same law as .

*Proof:* We argue by analogy with the discrete proof. First, let us define the infinitesimal “transition kernel” of Brownian motion using our heuristic that has covariance :

We can also compute the (time-inhomogeneous) transition kernel of :

Here we are using that and is deterministic conditioned on the past, thus the law of is a normal with mean and covariance .

To avoid confusion of derivatives, let’s use for the density of and for the density of Brownian motion (recall that these are densities on paths). Now let us relate the density to the density . We use here the notations to denote a (non-random) sample path of :

where the last line uses .

Now by “heuristic” induction, we can assume , yielding

In the last line, we used the fact that is the infinitesimal transition kernel for Brownian motion.

From Lemma 2, it will follow that has the law where is the law of . In particular, has the law which was our first goal.

Given our preceding less formal arguments, let us use a proper stochastic calculus argument to establish (3). To do that we need a way to calculate

Notice that this involves both time and space derivatives.

**Itô’s lemma.** Suppose we have a continuously differentiable function that we write as where is a space variable and is a time variable. We can expand via its Taylor series:

Normally we could eliminate the terms , etc. since they are lower order as . But recall that for Brownian motion we have the heuristic . Thus we cannot eliminate the second-order space derivative if we plan to plug in (or , a process driven by Brownian motion). Itô’s lemma says that this consideration alone gives us the correct result:

This generalizes in a straightforward way to the higher dimensional setting .

With Itô’s lemma in hand, let us continue to calculate the derivative

For the time derivative (the first term), we have employed the heat equation

where is the Laplacian on .

Note that the heat equation was already contained in our “infinitesimal density” in the proof of Lemma 2, or in the representation , and Itô’s lemma was also contained in our heuristic that has covariance .

Using Itô’s formula again yields

giving our desired conclusion (3).

Our final task is to establish optimality: . We apply the formula (3):

where we used . Combined with (2), this completes the proof of the theorem.

**2. The Gaussian log-Sobolev inequality **

Consider again a measurable with . Let us define . Then the classical log-Sobolev inequality in Gaussian space asserts that

First, we discuss the correct way to interpret this. Define the Ornstein-Uhlenbeck semi-group by its action

This is the natural stationary diffusion process on Gaussian space. For every measurable , we have

or equivalently

The log-Sobolev inequality yields quantitative convergence in the relative entropy distance as follows: Define the *Fisher information*

One can check that

thus the Fisher information describes the instantaneous decay of the relative entropy of under diffusion.

So we can rewrite the log-Sobolev inequality as:

This expresses the intuitive fact that when the relative entropy is large, its rate of decay toward equilibrium is faster.

**Martingale property of the optimal drift.** Now for the proof of (5). Let be the entropy-optimal process with . We need one more fact about : The optimal drift is a martingale, i.e. for .

Let’s give two arguments to support this.

**Argument one: Brownian bridges.** First, note that by the chain rule for relative entropy, we have:

But from optimality, we know that the latter expectation is zero. Therefore -almost surely, we have

This implies that if we condition on the endpoint , then is a Brownian bridge (i.e., a Brownian motion conditioned to start at and end at ).

This implies that , as one can check that a Brownian bridge with endpoint is described by the drift process , and

That seemed complicated. There is a simpler way to see this: Given and any bridge from to , every “permutation” of the infinitesimal steps in has the same law (by commutativity, they all land at ). Thus the marginal law of at every point should be the same. In particular,

**Argument two: Change of measure.** There is a more succinct (though perhaps more opaque) way to see that is a martingale. Note that the process is a Doob martingale. But we have and we also know that is precisely the change of measure that makes into Brownian motion.

**Proof of the log-Sobolev inequality.** In any case, now we are ready for the proof of (5). It also comes straight from Lehec’s paper. Since is a martingale, we have . So by Theorem 1:

The latter quantity is . In the last equality, we used the fact that is precisely the change of measure that turns into Brownian motion.

]]>

Note that establishing a *node-capacitated* version of the Okamura-Seymour theorem was an open question of Chekuri and Kawarabayashi. Resolving it positively is somewhat more difficult.

Theorem 1Given a weighted planar graph and a face of , there is a non-expansive embedding such that is an isometry.

By rational approximation and subdivision of edges, we may assume that is unweighted. The following proof is constructive, and provides an explicit sequence of cuts on whose characteristic functions form the coordinates of the embedding . Each such cut is obtained by applying the following lemma. Note that for a subset of vertices, we use the notation , for the graph obtained from by contracting the edges across

Lemma 2 (Face-preserving cut lemma)Let be a -connected bipartite planar graph and be the boundary of a face of . There exists a cut such that

*Proof:* Fix a plane embedding of that makes the boundary of the outer face. Since is -connected, is a cycle, and since is bipartite and has no parallel edges, .

Consider an arbitrary pair of distinct vertices on . There is a unique path from to that runs along counterclockwise; call this path . Consider a path and a vertex . We say that *lies below* if, in the plane embedding of , lies in the closed subset of the plane bounded by and .

(Note that the direction of is significant in the definition of “lying below,” i.e., belowness with respect to a path in is not the same as belowness with respect to the reverse of the same path in .)

We say that *lies strictly below* if lies below and . We use this notion of “lying below” to define a partial order on the paths in : for we say that is *lower* than if every vertex in lies below .

We now fix the pair and a path so that the following properties hold:

- .
- If are distinct vertices with preceding and , then and .
- If is lower than and , then .

Note that a suitable pair exists because . Finally, we define the cut as follows: does not lie strictly below .

For the rest of this section, we fix the pair , the path and the cut as defined in the above proof.

Lemma 1Let and be distinct vertices on and be such that . Then .

*Proof:* If the lemma holds trivially, so assume . Also assume without loss of generality that precedes in the path . The conditions on imply that all vertices in lie strictly below . Therefore, the path is lower than and distinct from

By property (3), we have , which implies . Since is bipartite, the cycle formed by and must have even length; therefore, .

*Proof:* If there is nothing to prove. If not, we can write

for some where, for all , and . Let and denote the endpoints of with preceding in the path . Further, define and .

By Lemma 1, we have for . Since is a shortest path, we have for . Therefore

The latter quantity is precisely which completes the proof since .

*Proof of Claim 1:* Let be arbitrary and distinct. It is clear that , so it suffices to prove the opposite inequality. We begin by observing that

Let be a path in that achieves the minimum in the above expression. First, suppose . Then we must have . Now, , which implies and we are done.

Next, suppose . Then, there exists at least one vertex in that lies on . Let be the first such vertex and the last (according to the ordering in ) and assume that precedes in the path . Let . Note that may be trivial, because we may have . Now, , whence

This gives

where the first line follows from Eq. (1) and the definition of and the third line is obtained by applying Lemma 2 to the path . If at least one of lies in , then and we are done.

Therefore, suppose . Let . For a path and vertices on , let us use as shorthand for . By property (1), we have and since is bipartite, this means . By property (2), we have and . Using these facts, we now derive

Using this in (**) above and noting that , we get . This completes the proof.

*Proof of Theorem 1:* Assume that is -connected. We may also assume that is bipartite. To see why, note that subdividing every edge of by introducing one new vertex per edge leaves the metric essentially unchanged except for a scaling factor of .

We shall now prove the stronger statement that for every face of there exists a sequence of cuts of such that for all on , we have and that for , . We prove this by induction on .

The result is trivial in the degenerate case when is a single edge. For any larger and any cut , the graph is either a single edge or is -connected. Furthermore, contracting a cut preserves the parities of the lengths of all closed walks; therefore is also bipartite.

Apply the face-preserving cut lemma (Lemma 1) to obtain a cut . By the above observations, we can apply the induction hypothesis to to obtain cuts of corresponding to the image of in . Each cut induces a cut of . Clearly for any . Finally, for any , we have

where the first equality follows from the property of and the second follows from the induction hypothesis. This proves the theorem.

]]>

As I have mentioned before, one of my favorite questions is whether the shortest-path metric on a planar graph embeds into with distortion. This is equivalent to such graphs having an -approximate multi-flow/min-cut theorem. We know that the distortion has to be at least 2. By a simple discretization and compactness argument, this is equivalent to the question of whether every simply-connected surface admits a bi-Lipschitz embedding into .

In a paper of Tasos Sidiropoulos, it is proved that every simply-connected surface of *non-positive curvature* admits a bi-Lipschitz embedding into . A followup work of Chalopin, Chepoi, and Naves shows that actually such a surface admits an *isometric* emedding into . In this post, we present a simple proof of this result that was observed in conversations with Tasos a few years ago—it follows rather quickly from the most classical theorem in this setting, the Okamura-Seymour theorem.

Suppose that is a geodesic metric space (i.e. the distance between any pair of points is realized by a geodesic whose length is ). One says that has *non-positive curvature* (in the sense of Busemann) if for any pair of geodesics and , the map given by

is convex.

Theorem 1Suppose that is homeomorphic to and is endowed with a geodesic metric such that has non-positive curvature. Then embeds isometrically in .

We will access the non-positive curvature property through the following fact. We refer to the book Metric spaces of non-positive curvature.

Lemma 2Every geodesic in can be extended to a bi-infinite geodesic line.

*Proof of Theorem 1:* By a standard compactness argument, it suffices to construct an isometric embedding for a finite subset . Let denote the convex hull of . (A set is *convex* if for every we have for every geodesic connecting to .)

It is an exercise to show that the boundary of is composed of a finite number of geodesics between points . For every pair , let denote a geodesic line containing and which exists by Lemma 2. Consider the collection of sets , and let denote the set of intersection points between geodesics in . Since is a collection of geodesics, and all geodesics intersect at most once (an easy consequence of non-positive curvature), the set is finite.

Consider finally the set . The geodesics in naturally endow with the structure of a planar graph , where two vertices are adjacent if they lie on a subset of some geodesic in and the portion between and does not contain any other points of . Note that is the outer face of in the natural drawing, where is the boundary of (the union of the geodesics in ).

We can put a path metric on this graph by defining the length of an edge as . Let denote the induced shortest-path metric on the resulting graph. By construction, we have the following two properties.

- If or for some , then .
- For every , there is a shortest path between two vertices in such that .

Both properties follow from our construction using the lines .

Now let us state the geometric (dual) version of the Okamura-Seymour theorem.

Theorem 3 (Okamura-Seymour dual version)For every planar graph and face , there is a -Lispchitz mapping such that is an isometry.

Let us apply this theorem to our graph and face . Consider and . By property (1) above, we have . Since , from Theorem 3, we have . But property (2) above says that and lie on a – shortest-path in . Since is -Lipschitz, we conclude that it maps the whole path isometrically, thus , showing that is an isometry, and completing the proof.

]]>
*path space*. After finding an appropriate entropy-maximizer, the Brascamp-Lieb inequality will admit a gorgeous one-line proof. Our argument is taken from the beautiful paper of Lehec.

For simplicity, we start first with an entropy optimization on a discrete path space. Then we move on to Brownian motion.

**1.1 Entropy optimality on discrete path spaces **

Consider a finite state space and a transition kernel . Also fix some time .

Let denote the space of all paths . There is a natural measure on coming from the transition kernel:

Now suppose we are given a starting point , and a target distribution specified by a function scaled so that . If we let denote the law of , then this simply says that is a density with respect to . One should think about as the natural law at time (given ), and describes a perturbation of this law.

Let us finally define the set of all measures on that start at and end at , i.e. those measures satisfying

and for every ,

Now we can consider the entropy optimization problem:

One should verify that, like many times before, we are minimizing the relative entropy over a polytope.

One can think of the optimization as simply computing the most likely way for a mass of particles sitting at to end up in the distribution at time .

The optimal solution exists and is unique. Moreover, we can describe it explicitly: is given by a time-inhomogeneous Markov chain. For , this chain has transition kernel

where is the *heat semigroup* of our chain , i.e.

Let denote the time-inhomogeneous chain with transition kernels and and let denote the law of the random path . We will now verify that is the optimal solution to (1).

We first need to confirm that , i.e. that has law . To this end, we will verify inductively that has law . For , this follows by definition. For the inductive step:

We have confirmed that . Let us now verify its optimality by writing

where the final equality uses the fact we just proved: has law . Continuing, we have

where the final inequality uses the definition of in (2). The latter quantity is precisely by the chain rule for relative entropy.

**Exercise:** One should check that if and are two time-inhomogeneous Markov chains on with respective transition kernels and then indeed the chain rule for relative entropy yields

We conclude that

and from this one immediately concludes that . Indeed, for any measure , we must have . This follows because is the law of the endpoint of a path drawn from and is the law of the endpoint of a path drawn from . The relative entropy between the endpoints is certainly less than along the entire path. (This intuitive fact can again be proved via the chain rule for relative entropy by conditioning on the endpoint of the path.)

** 1.2. The Brownian version **

Let us now do the same thing for processes driven by Brownian motion in . Let be a Brownian motion with . Let be the standard Gaussian measure and recall that has law .

We recall that if we have two measures and on such that is absolutely continuous with respect to , we define the *relative entropy*

Our “path space” will consist of drift processes of the form

where denotes the drift. We require that is progressively measurable, i.e. that the law of is determined by the past up to time , and that . Note that we can write such a process in differential notation as

with .

Fix a smooth density with . In analogy with the discrete setting, let us use to denote the set of processes that can be realized in the form (4) and such that and has law .

Let us also use the shorthand to represent the entire path of the process. Again, we will consider the entropy optimization problem:

As in the discrete setting, this problem has a unique optimal solution (in the sense of stochastic processes). Here is the main result.

Theorem 1 (FÃ¶llmer)If is the optimal solution to (5), then

Just as for the discrete case, one should think of this as asserting that the optimal process only uses as much entropy as is needed for the difference in laws at the endpoint. The RHS should be thought of as an integral over the expected relative entropy generated at time (just as in the chain rule expression (3)).

The reason for the quadratic term is the usual relative entropy approximation for infinitesimal perturbations. For instance, consider the relative entropy between a binary random variable with expected value and a binary random variable with expected value :

I am going to delay the proof of Theorem 1 to the next post because doing it in an elementary way will require some discussion of Ito calculus. For now, let us prove the following.

Lemma 2For any process given by a drift , it holds that

*Proof:* The proof will be somewhat informal. It can be done easily using Girsanov’s theorem, but we try to keep the presentation here elementary and in correspondence with the discrete version above.

Let us first use the chain rule for relative entropy to calculate

Note that has the law of a standard -dimensional of covariance .

If is an -dimensional Gaussian with covariance and , then

Therefore:

where the latter expectation is understood to be conditioned on the past up to time .

In particular, plugging this into (6), we have

** 1.3. Brascamp-Lieb **

The proof is taken directly from Lehec. We will use the entropic formulation of Brascamp-Lieb due to Carlen and Cordero-Erausquin.

Let be a Euclidean space with subspaces . Let denote the orthogonal projection onto . Now suppose that for positive numbers , we have

By (8), we have for all :

The latter equality uses the fact that each is an orthogonal projection.

Let denote a standard Gaussian on , and let denote a standard Gaussian on for each .

Theorem 3 (Carlen & Cordero-Erausquin version of Brascamp-Lieb)For any random vector , it holds that

*Proof:* Let with denote the entropy-optimal drift process such that has the law of . Then by Theorem 1,

where the latter inequality uses Lemma 2 and the fact that has law .

]]>

I wanted to post here a draft of the lecture notes. These extend and complete the series of posts here on non-negative and psd rank and lifts of polytopes. They also incorporate many corrections, and have exercises of varying levels of difficulty. The bibliographic references are sparse at the moment because I am posting them from somewhere in the Adriatic (where wifi is also sparse).

]]>

Theorem 1For every and , there exists a constant such that the following holds. For every ,

In this post, we will see how John’s theorem can be used to transform a psd factorization into one of a nicer analytic form. Using this, we will be able to construct a convex body that contains an approximation to every non-negative matrix of small psd rank.

** 1.1. Finite-dimensional operator norms **

Let denote a finite-dimensional Euclidean space over equipped with inner product and norm . For a linear operator , we define the operator, trace, and Frobenius norms by

Let denote the set of self-adjoint linear operators on . Note that for , the preceding three norms are precisely the , , and norms of the eigenvalues of . For , we use to denote that is positive semi-definite and for . We use for the set of density operators: Those with and .

One should recall that is an inner product on the space of linear operators, and we have the operator analogs of the Hölder inequalities: and .

** 1.2. Rescaling the psd factorization **

As in the case of non-negative rank, consider finite sets and and a matrix . For the purposes of proving a lower bound on the psd rank of some matrix, we would like to have a nice analytic description.

To that end, suppose we have a rank- psd factorization

where and . The following result of Briët, Dadush and Pokutta (2013) gives us a way to “scale” the factorization so that it becomes nicer analytically. (The improved bound stated here is from an article of Fawzi, Gouveia, Parrilo, Robinson, and Thomas, and we follow their proof.)

Lemma 2Every with admits a factorization where and and, moreover,

where .

*Proof:* Start with a rank- psd factorization . Observe that there is a degree of freedom here, because for any invertible operator , we get another psd factorization .

Let and . Set . We may assume that and both span (else we can obtain a lower-rank psd factorization). Both sets are bounded by finiteness of and .

Let and note that is centrally symmetric and contains the origin. Now John’s theorem tells us there exists a linear operator such that

where denotes the unit ball in the Euclidean norm. Let us now set and .

**Eigenvalues of :** Let be an eigenvector of normalized so the corresponding eigenvalue is . Then , implying that (here we use that for any ). Since , (2) implies that . We conclude that every eigenvalue of is at most .

**Eigenvalues of :** Let be an eigenvector of normalized so that the corresponding eigenvalue is . Then as before, we have and this implies . Now, on the one hand we have

Finally, observe that for any and , we have

By convexity, this implies that for all , bounding the right-hand side of (4) by . Combining this with (3) yields . We conclude that all the eigenvalues of are at most .

** 1.3. Convex proxy for psd rank **

Again, in analogy with the non-negative rank setting, we can define an “analytic psd rank” parameter for matrices :

Note that we have implicit equipped and with the uniform measure. The main point here is that can be arbitrary. One can verify that is convex.

And there is a corresponding approximation lemma. We use and .

Lemma 3For every non-negative matrix and every , there is a matrix such that and

Using Lemma 2 in a straightforward way, it is not particularly difficult to construct the approximator . The condition poses a slight difficulty that requires adding a small multiple of the identity to the LHS of the factorization (to avoid a poor condition number), but this has a correspondingly small effect on the approximation quality. Putting “Alice” into “isotropic position” is not essential, but it makes the next part of the approach (quantum entropy optimization) somewhat simpler because one is always measuring relative entropy to the maximally mixed state.

]]>

The approximation statement is made in the context of general probability measures on a finite set (though it should extend at least to the compact case with no issues). The algebraic structure only comes into play when the spectral covering statements are deduced (easily) from the general approximation theorem. The proofs are also done in the general setting of finite abelian groups.

Comments are encouraged, especially about references I may have missed.

]]>

For many cut problems, semi-definite programs (SDPs) are able to achieve better approximation ratios than LPs. The most famous example is the Goemans-Williamson -approximation for MAX-CUT. The techniques of the previous posts (see the full paper for details) are able to show that no polynomial-size LP can achieve better than factor .

** 1.1. Spectrahedral lifts **

The feasible regions of LPs are polyhedra. Up to linear isomorphism, every polyhedron can be represented as where is the positive orthant and is an affine subspace.

In this context, it makes sense to study any cones that can be optimized over efficiently. A prominent example is the positive semi-definite cone. Let us define as the set of real, symmetric matrices with non-negative eigenvalues. A *spectrahedron* is the intersection with an affine subspace . The value is referred to as the *dimension* of the spectrahedron.

In analogy with the parameter we defined for polyhedral lifts, let us define for a polytope to be the minimal dimension of a spectrahedron that linearly projects to . It is an exercise to show that for every polytope . In other words, spectahedral lifts are at least as powerful as polyhedral lifts in this model.

In fact, they are strictly more powerful. Certainly there are many examples of this in the setting of approximation (like the Goemans-Williamson SDP mentioned earlier), but there are also recent gaps between and for polytopes; see the work of Fawzi, Saunderson, and Parrilo.

Nevertheless, we are recently capable of proving strong lower bounds on the dimension of such lifts. Let us consider the cut polytope as in previous posts.

Theorem 1 (L-Raghavendra-Steurer 2015)There is a constant such that for every , one has .

Our goal in this post and the next is to explain the proof of this theorem and how *quantum entropy maximization* plays a key role.

** 1.2. PSD rank and factorizations **

Just as in the setting of polyhedra, there is a notion of “factorization through a cone” that characterizes the parameter . Let be a non-negative matrix. One defines the *psd rank* of as the quantity

The following theorem was independently proved by Fiorini, Massar, Pokutta, Tiwary, and de Wolf and Gouveia, Parrilo, and Thomas. The proof is a direct analog of Yannakakis’ proof for non-negative rank.

Theorem 2For every polytope , it holds that for any slack matrix of .

Recall the class of non-negative quadratic multi-linear functions that are positive on and the matrix given by

We saw previously that is a submatrix of some slack matrix of . Thus our goal is to prove a lower bound on .

** 1.3. Sum-of-squares certificates **

Just as in the setting of non-negative matrix factorization, we can think of a low psd rank factorization of as a small set of “axioms” that can prove the non-negativity of every function in . But now our proof system is considerably more powerful.

For a subspace of functions , let us define the cone

This is the cone of squares of functions in . We will think of as a set of axioms of size that is able to assert non-negativity of every by writing

for some .

Fix a subspace and let . Fix also a basis for .

Define by setting . Note that is PSD for every because where .

We can write every as . Defining by , we see that

Now every can be written as for some and . Therefore if we define (which is a positive sum of PSD matrices), we arrive at the representation

In conclusion, if , then .

By a “purification” argument, one can also conclude .

** 1.4. The canonical axioms **

And just as -juntas were the canonical axioms for our NMF proof system, there is a similar canonical family in the SDP setting: Let be the subspace of all degree- multi-linear polynomials on . We have

(One could debate whether the definition of sum-of-squares degree should have or . The most convincing arguments suggest that we should use membership in the cone of squares over so that the sos-degree will be at least the real-degree of the function.)

On the other hand, our choice has the following nice property.

Lemma 3For every , we have .

*Proof:* If is a non-negative -junta, then is also a non-negative -junta. It is elementary to see that every -junta is polynomial of degree at most , thus is the square of a polynomial of degree at most .

** 1.5. The canonical tests **

As with junta-degree, there is a simple characterization of sos-degree in terms of separating functionals. Say that a functional is *degree- pseudo-positive* if

whenever satisfies (and by here, we mean degree as a multi-linear polynomial on ).

Again, since is a closed convex set, there is precisely one way to show non-membership there. The following characterization is elementary.

Lemma 4For every , it holds that if and only if there is a degree- pseudo-positive functional such that .

** 1.6. The connection to psd rank **

Following the analogy with non-negative rank, we have two objectives left: (1) to exhibit a function with large, and (ii) to give a connection between the sum-of-squares degree of and the psd rank of an associated matrix.

Notice that the function we used for junta-degree has , making it a poor candidate. Fortunately, Grigoriev has shown that the *knapsack polynomial* has large sos-degree.

Theorem 5For every odd , the function

has .

Observe that this is non-negative over (because is odd), but it is manifestly *not* non-negative on .

Finally, we recall the submatrices of defined as follows. Fix some integer and a function . Then is given by

In the next post, we discuss the proof of the following theorem.

Theorem 6 (L-Raghavendra-Steurer 2015)For every and , there exists a constant such that the following holds. For every ,

where .

Note that the upper bound is from (1) and the non-trivial content is contained in the lower bound. As before, in conjunction with Theorem 5, this shows that cannot be bounded by any polynomial in and thus the same holds for .

]]>

Theorem 1For every and , there is a constant such that for all ,

** 1.1. Convex relaxations of non-negative rank **

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets and and a matrix with .

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both and with the uniform measure. Let denote the set of probability densities on . Now define

Here are the columns of and are the rows of . Note that now is unconstrained.

Observe that is a convex function. To see this, given a pair and , write

witnessing the fact that .

** 1.2. A truncation argument **

So the set is convex, but it’s not yet clear how this relates to . We will see now that low non-negative rank matrices are close to matrices with small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms and .

Lemma 2For every non-negative with and every , there is a matrix such thatand

*Proof:* Suppose that with , and let us interpret this factorization in the form

(where are the columns of and are the rows of ). By rescaling the columns of and the rows of , respectively, we may assume that for every .

Let denote the “bad set” of indices (we will choose momentarily). Observe that if , then

from the representation (1) and the fact that all summands are positive.

Define the matrix . It follows that

Each of the latter terms is at most and , thus

Next, observe that

implying that and thus .

Setting yields the statement of the lemma.

Generally, the ratio will be small compared to (e.g., polynomial in vs. super-polynomial in ). Thus from now on, we will actually prove a lower bound on . One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

** 1.3. The test functionals **

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix , we will now assume that . Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix . We will consider the uniform measures on and . We use and to denote averaging with respect to these measures.

Let . From the last post, we know there exists a -locally positive functional such that , and for every -junta .

For with , let us denote . These functionals prove lower bounds on the junta-degree of restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family as our test functions.

** 1.4. Entropy maximization **

Use to write

where and we have and for all , and finally .

First, as we observed last time, if each were a -junta, we would have a contradiction:

because since is -locally positive and the function is a -junta.

So now the key step: Use entropy maximization to approximate by a junta! In future posts, we will need to consider the entire package of functions simultaneously. But for the present lower bound, it suffices to consider each separately.

Consider the following optimization over variables :

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1There exists a function satisfying all the preceding constraints and of the formsuch that

where is some constant depending only on .

Note that depends only on , and thus only depends on as well. Now each only depends on variables (those in and ), meaning that our approximator is an -junta for

Oops. That doesn’t seem very good. The calculation in (3) needs that is a -junta, and certainly (since is a function on ). Nevertheless, note that the approximator is a *non-trivial junta*. For instance, if , then it is an -junta, recalling that is a constant (depending on ).

** 1.5. Random restriction saves the day **

Let’s try to apply the logic of (3) to the approximators anyway. Fix some and let be the set of coordinates on which depends. Then:

Note that the map is a junta on . Thus if , then the contribution from this term is non-negative since is -locally positive. But is fixed and is growing, thus is quite rare! Formally,

In the last estimate, we have used a simple union bound and .

Putting everything together and summing over , we conclude that

Note that by choosing only moderately large, we will make this error term very small.

Moreover, since each is a feasible point of the optimization (4), we have

Almost there: Now observe that

Plugging this into the preceding line yields

Recalling now (2), we have derived a contradiction to if we can choose the right-hand side to be bigger than (which is a negative constant depending only on ). Setting , we consult (5) to see that

for some other constant depending only on . We thus arrive at a contradiction if , recalling that depend only on . This completes the argument.

]]>

Theorem 1 (Fiorini, Massar, Pokutta, Tiwari, de Wolf 2012)There is a constant such that for every , .

We will present a somewhat weaker lower bound using entropy maximization, following our joint works with Chan, Raghavendra, and Steurer (2013) and with Raghavendra and Steurer (2015). This method is only currently capable of proving that , but it has the advantage of being generalizable—it extends well to the setting of approximate lifts and spectrahedral lifts (we’ll come to the latter in a few posts).

** 1.1. The entropy maximization framework **

To use entropy optimality, we proceed as follows. For the sake of contradiction, we assume that is small.

First, we will identify the space of potential lifts of small -value with the elements of a convex set of probability measures. (This is where the connection to *non-negative* matrix factorization (NMF) will come into play.) Then we will choose a family of “tests” intended to capture the difficult aspects of being a valid lift of . This step is important as having more freedom (corresponding to weaker tests) will allow the entropy maximization to do more “simplification.” The idea is that the family of tests should still be sufficiently powerful to prove a lower bound on the entropy-optimal hypothesis.

Finally, we will maximize the entropy of our lift over all elements of our convex set, subject to performing well on the tests. Our hope is that the resulting lift is simple enough that we can prove it couldn’t possibly pass all the tests, leading to a contradiction.

In order to find the right set of tests, we will identify a family of *canonical (approximate) lifts.* This is family of polytopes so that and which we expect to give the “best approximation” to among all lifts with similar -value. We can identify this family precisely because these will be lifts that obey the natural symmetries of the cut polytope (observe that the symmetric group acts on in the natural way).

** 1.2. NMF and positivity certificates **

Recall the matrix given by , where is the set of all quadratic multi-linear functions that are non-negative on . In the previous post, we argued that .

for some functions and . (Here we are using a factorization where and .)

Thus the low-rank factorization gives us a “proof system” for . Every can be written as a conic combination of the functions , thereby certifying its positivity (since the ‘s are positive functions).

If we think about natural families of “axioms,” then since is invariant under the natural action of , we might expect that our family should share this invariance. Once we entertain this expectation, there are natural small families of axioms to consider: The space of non-negative -juntas for some .

A -junta is a function whose value only depends on of its input coordinates. For a subset with and an element , let denote the function given by if and only if .

We let . Observe that . Let us also define as the set of all conic combinations of functions in . It is easy to see that contains precisely the conic combinations of non-negative -juntas.

If it were true that for some , we could immediately conclude that by writing in the form (1) where now ranges over the elements of and gives the corresponding non-negative coefficients that follow from .

** 1.3. No -junta factorization for **

We will now see that juntas cannot provide a small set of axioms for .

Toward the proof, let’s introduce a few definitions. First, for , define the *junta degree of * to be

Since every is an -junta, we have .

Now because is a cone, there is a universal way of proving that . Say that a functional is *-locally positive* if for all and , we have

These are precisely the linear functionals separating a function from : We have if and only if there is a -locally positive functional such that . Now we are ready to prove Theorem 2.

*Proof:* We will prove this using an appropriate -locally positive functional. Define

where denotes the hamming weight of .

First, observe that

Now recall the the function from the statement of the theorem and observe that by opening up the square, we have

Finally, consider some with and . If , then

If , then the sum is 0. If , then the sum is non-negative because in that case is only supported on non-negative values of . We conclude that is -locally positive for . Combined with (2), this yields the statement of the theorem.

** 1.4. From juntas to general factorizations **

So far we have seen that we cannot achieve a low non-negative rank factorization of using -juntas for .

Brief aside:If one translates this back into the setting of lifts, it says that the -round Sherali-Adams lift of the polytopedoes not capture for .

In the next post, we will use entropy maximization to show that a non-negative factorization of would lead to a -junta factorization with small (which we just saw is impossible).

For now, let us state the theorem we will prove. We first define a submatrix of . Fix some integer and a function . Now define the matrix given by

The matrix is indexed by subsets with and elements . Here, represents the (ordered) restriction of to the coordinates in .

Theorem 3 (Chan-L-Raghavendra-Steurer 2013)For every and , there is a constant such that for all ,

Note that if then is a submatrix of . Since Theorem 2 furnishes a sequence of quadratic multi-linear functions with , the preceding theorem tells us that cannot be bounded by any polynomial in . A more technical version of the theorem is able to achieve a lower bound of (see Section 7 here).

]]>