I wanted to post here a draft of the lecture notes. These extend and complete the series of posts here on non-negative and psd rank and lifts of polytopes. They also incorporate many corrections, and have exercises of varying levels of difficulty. The bibliographic references are sparse at the moment because I am posting them from somewhere in the Adriatic (where wifi is also sparse).

]]>

Theorem 1For every and , there exists a constant such that the following holds. For every ,

In this post, we will see how John’s theorem can be used to transform a psd factorization into one of a nicer analytic form. Using this, we will be able to construct a convex body that contains an approximation to every non-negative matrix of small psd rank.

** 1.1. Finite-dimensional operator norms **

Let denote a finite-dimensional Euclidean space over equipped with inner product and norm . For a linear operator , we define the operator, trace, and Frobenius norms by

Let denote the set of self-adjoint linear operators on . Note that for , the preceding three norms are precisely the , , and norms of the eigenvalues of . For , we use to denote that is positive semi-definite and for . We use for the set of density operators: Those with and .

One should recall that is an inner product on the space of linear operators, and we have the operator analogs of the Hölder inequalities: and .

** 1.2. Rescaling the psd factorization **

As in the case of non-negative rank, consider finite sets and and a matrix . For the purposes of proving a lower bound on the psd rank of some matrix, we would like to have a nice analytic description.

To that end, suppose we have a rank- psd factorization

where and . The following result of Briët, Dadush and Pokutta (2013) gives us a way to “scale” the factorization so that it becomes nicer analytically. (The improved bound stated here is from an article of Fawzi, Gouveia, Parrilo, Robinson, and Thomas, and we follow their proof.)

Lemma 2Every with admits a factorization where and and, moreover,

where .

*Proof:* Start with a rank- psd factorization . Observe that there is a degree of freedom here, because for any invertible operator , we get another psd factorization .

Let and . Set . We may assume that and both span (else we can obtain a lower-rank psd factorization). Both sets are bounded by finiteness of and .

Let and note that is centrally symmetric and contains the origin. Now John’s theorem tells us there exists a linear operator such that

where denotes the unit ball in the Euclidean norm. Let us now set and .

**Eigenvalues of :** Let be an eigenvector of normalized so the corresponding eigenvalue is . Then , implying that (here we use that for any ). Since , (2) implies that . We conclude that every eigenvalue of is at most .

**Eigenvalues of :** Let be an eigenvector of normalized so that the corresponding eigenvalue is . Then as before, we have and this implies . Now, on the one hand we have

Finally, observe that for any and , we have

By convexity, this implies that for all , bounding the right-hand side of (4) by . Combining this with (3) yields . We conclude that all the eigenvalues of are at most .

** 1.3. Convex proxy for psd rank **

Again, in analogy with the non-negative rank setting, we can define an “analytic psd rank” parameter for matrices :

Note that we have implicit equipped and with the uniform measure. The main point here is that can be arbitrary. One can verify that is convex.

And there is a corresponding approximation lemma. We use and .

Lemma 3For every non-negative matrix and every , there is a matrix such that and

Using Lemma 2 in a straightforward way, it is not particularly difficult to construct the approximator . The condition poses a slight difficulty that requires adding a small multiple of the identity to the LHS of the factorization (to avoid a poor condition number), but this has a correspondingly small effect on the approximation quality. Putting “Alice” into “isotropic position” is not essential, but it makes the next part of the approach (quantum entropy optimization) somewhat simpler because one is always measuring relative entropy to the maximally mixed state.

]]>

The approximation statement is made in the context of general probability measures on a finite set (though it should extend at least to the compact case with no issues). The algebraic structure only comes into play when the spectral covering statements are deduced (easily) from the general approximation theorem. The proofs are also done in the general setting of finite abelian groups.

Comments are encouraged, especially about references I may have missed.

]]>

For many cut problems, semi-definite programs (SDPs) are able to achieve better approximation ratios than LPs. The most famous example is the Goemans-Williamson -approximation for MAX-CUT. The techniques of the previous posts (see the full paper for details) are able to show that no polynomial-size LP can achieve better than factor .

** 1.1. Spectrahedral lifts **

The feasible regions of LPs are polyhedra. Up to linear isomorphism, every polyhedron can be represented as where is the positive orthant and is an affine subspace.

In this context, it makes sense to study any cones that can be optimized over efficiently. A prominent example is the positive semi-definite cone. Let us define as the set of real, symmetric matrices with non-negative eigenvalues. A *spectrahedron* is the intersection with an affine subspace . The value is referred to as the *dimension* of the spectrahedron.

In analogy with the parameter we defined for polyhedral lifts, let us define for a polytope to be the minimal dimension of a spectrahedron that linearly projects to . It is an exercise to show that for every polytope . In other words, spectahedral lifts are at least as powerful as polyhedral lifts in this model.

In fact, they are strictly more powerful. Certainly there are many examples of this in the setting of approximation (like the Goemans-Williamson SDP mentioned earlier), but there are also recent gaps between and for polytopes; see the work of Fawzi, Saunderson, and Parrilo.

Nevertheless, we are recently capable of proving strong lower bounds on the dimension of such lifts. Let us consider the cut polytope as in previous posts.

Theorem 1 (L-Raghavendra-Steurer 2015)There is a constant such that for every , one has .

Our goal in this post and the next is to explain the proof of this theorem and how *quantum entropy maximization* plays a key role.

** 1.2. PSD rank and factorizations **

Just as in the setting of polyhedra, there is a notion of “factorization through a cone” that characterizes the parameter . Let be a non-negative matrix. One defines the *psd rank* of as the quantity

The following theorem was independently proved by Fiorini, Massar, Pokutta, Tiwary, and de Wolf and Gouveia, Parrilo, and Thomas. The proof is a direct analog of Yannakakis’ proof for non-negative rank.

Theorem 2For every polytope , it holds that for any slack matrix of .

Recall the class of non-negative quadratic multi-linear functions that are positive on and the matrix given by

We saw previously that is a submatrix of some slack matrix of . Thus our goal is to prove a lower bound on .

** 1.3. Sum-of-squares certificates **

Just as in the setting of non-negative matrix factorization, we can think of a low psd rank factorization of as a small set of “axioms” that can prove the non-negativity of every function in . But now our proof system is considerably more powerful.

For a subspace of functions , let us define the cone

This is the cone of squares of functions in . We will think of as a set of axioms of size that is able to assert non-negativity of every by writing

for some .

Fix a subspace and let . Fix also a basis for .

Define by setting . Note that is PSD for every because where .

We can write every as . Defining by , we see that

Now every can be written as for some and . Therefore if we define (which is a positive sum of PSD matrices), we arrive at the representation

In conclusion, if , then .

By a “purification” argument, one can also conclude .

** 1.4. The canonical axioms **

And just as -juntas were the canonical axioms for our NMF proof system, there is a similar canonical family in the SDP setting: Let be the subspace of all degree- multi-linear polynomials on . We have

(One could debate whether the definition of sum-of-squares degree should have or . The most convincing arguments suggest that we should use membership in the cone of squares over so that the sos-degree will be at least the real-degree of the function.)

On the other hand, our choice has the following nice property.

Lemma 3For every , we have .

*Proof:* If is a non-negative -junta, then is also a non-negative -junta. It is elementary to see that every -junta is polynomial of degree at most , thus is the square of a polynomial of degree at most .

** 1.5. The canonical tests **

As with junta-degree, there is a simple characterization of sos-degree in terms of separating functionals. Say that a functional is *degree- pseudo-positive* if

whenever satisfies (and by here, we mean degree as a multi-linear polynomial on ).

Again, since is a closed convex set, there is precisely one way to show non-membership there. The following characterization is elementary.

Lemma 4For every , it holds that if and only if there is a degree- pseudo-positive functional such that .

** 1.6. The connection to psd rank **

Following the analogy with non-negative rank, we have two objectives left: (1) to exhibit a function with large, and (ii) to give a connection between the sum-of-squares degree of and the psd rank of an associated matrix.

Notice that the function we used for junta-degree has , making it a poor candidate. Fortunately, Grigoriev has shown that the *knapsack polynomial* has large sos-degree.

Theorem 5For every odd , the function

has .

Observe that this is non-negative over (because is odd), but it is manifestly *not* non-negative on .

Finally, we recall the submatrices of defined as follows. Fix some integer and a function . Then is given by

In the next post, we discuss the proof of the following theorem.

Theorem 6 (L-Raghavendra-Steurer 2015)For every and , there exists a constant such that the following holds. For every ,

where .

Note that the upper bound is from (1) and the non-trivial content is contained in the lower bound. As before, in conjunction with Theorem 5, this shows that cannot be bounded by any polynomial in and thus the same holds for .

]]>

Theorem 1For every and , there is a constant such that for all ,

** 1.1. Convex relaxations of non-negative rank **

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets and and a matrix with .

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both and with the uniform measure. Let denote the set of probability densities on . Now define

Here are the columns of and are the rows of . Note that now is unconstrained.

Observe that is a convex function. To see this, given a pair and , write

witnessing the fact that .

** 1.2. A truncation argument **

So the set is convex, but it’s not yet clear how this relates to . We will see now that low non-negative rank matrices are close to matrices with small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms and .

Lemma 2For every non-negative with and every , there is a matrix such thatand

*Proof:* Suppose that with , and let us interpret this factorization in the form

(where are the columns of and are the rows of ). By rescaling the columns of and the rows of , respectively, we may assume that for every .

Let denote the “bad set” of indices (we will choose momentarily). Observe that if , then

from the representation (1) and the fact that all summands are positive.

Define the matrix . It follows that

Each of the latter terms is at most and , thus

Next, observe that

implying that and thus .

Setting yields the statement of the lemma.

Generally, the ratio will be small compared to (e.g., polynomial in vs. super-polynomial in ). Thus from now on, we will actually prove a lower bound on . One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

** 1.3. The test functionals **

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix , we will now assume that . Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix . We will consider the uniform measures on and . We use and to denote averaging with respect to these measures.

Let . From the last post, we know there exists a -locally positive functional such that , and for every -junta .

For with , let us denote . These functionals prove lower bounds on the junta-degree of restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family as our test functions.

** 1.4. Entropy maximization **

Use to write

where and we have and for all , and finally .

First, as we observed last time, if each were a -junta, we would have a contradiction:

because since is -locally positive and the function is a -junta.

So now the key step: Use entropy maximization to approximate by a junta! In future posts, we will need to consider the entire package of functions simultaneously. But for the present lower bound, it suffices to consider each separately.

Consider the following optimization over variables :

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1There exists a function satisfying all the preceding constraints and of the formsuch that

where is some constant depending only on .

Note that depends only on , and thus only depends on as well. Now each only depends on variables (those in and ), meaning that our approximator is an -junta for

Oops. That doesn’t seem very good. The calculation in (3) needs that is a -junta, and certainly (since is a function on ). Nevertheless, note that the approximator is a *non-trivial junta*. For instance, if , then it is an -junta, recalling that is a constant (depending on ).

** 1.5. Random restriction saves the day **

Let’s try to apply the logic of (3) to the approximators anyway. Fix some and let be the set of coordinates on which depends. Then:

Note that the map is a junta on . Thus if , then the contribution from this term is non-negative since is -locally positive. But is fixed and is growing, thus is quite rare! Formally,

In the last estimate, we have used a simple union bound and .

Putting everything together and summing over , we conclude that

Note that by choosing only moderately large, we will make this error term very small.

Moreover, since each is a feasible point of the optimization (4), we have

Almost there: Now observe that

Plugging this into the preceding line yields

Recalling now (2), we have derived a contradiction to if we can choose the right-hand side to be bigger than (which is a negative constant depending only on ). Setting , we consult (5) to see that

for some other constant depending only on . We thus arrive at a contradiction if , recalling that depend only on . This completes the argument.

]]>

Theorem 1 (Fiorini, Massar, Pokutta, Tiwari, de Wolf 2012)There is a constant such that for every , .

We will present a somewhat weaker lower bound using entropy maximization, following our joint works with Chan, Raghavendra, and Steurer (2013) and with Raghavendra and Steurer (2015). This method is only currently capable of proving that , but it has the advantage of being generalizable—it extends well to the setting of approximate lifts and spectrahedral lifts (we’ll come to the latter in a few posts).

** 1.1. The entropy maximization framework **

To use entropy optimality, we proceed as follows. For the sake of contradiction, we assume that is small.

First, we will identify the space of potential lifts of small -value with the elements of a convex set of probability measures. (This is where the connection to *non-negative* matrix factorization (NMF) will come into play.) Then we will choose a family of “tests” intended to capture the difficult aspects of being a valid lift of . This step is important as having more freedom (corresponding to weaker tests) will allow the entropy maximization to do more “simplification.” The idea is that the family of tests should still be sufficiently powerful to prove a lower bound on the entropy-optimal hypothesis.

Finally, we will maximize the entropy of our lift over all elements of our convex set, subject to performing well on the tests. Our hope is that the resulting lift is simple enough that we can prove it couldn’t possibly pass all the tests, leading to a contradiction.

In order to find the right set of tests, we will identify a family of *canonical (approximate) lifts.* This is family of polytopes so that and which we expect to give the “best approximation” to among all lifts with similar -value. We can identify this family precisely because these will be lifts that obey the natural symmetries of the cut polytope (observe that the symmetric group acts on in the natural way).

** 1.2. NMF and positivity certificates **

Recall the matrix given by , where is the set of all quadratic multi-linear functions that are non-negative on . In the previous post, we argued that .

for some functions and . (Here we are using a factorization where and .)

Thus the low-rank factorization gives us a “proof system” for . Every can be written as a conic combination of the functions , thereby certifying its positivity (since the ‘s are positive functions).

If we think about natural families of “axioms,” then since is invariant under the natural action of , we might expect that our family should share this invariance. Once we entertain this expectation, there are natural small families of axioms to consider: The space of non-negative -juntas for some .

A -junta is a function whose value only depends on of its input coordinates. For a subset with and an element , let denote the function given by if and only if .

We let . Observe that . Let us also define as the set of all conic combinations of functions in . It is easy to see that contains precisely the conic combinations of non-negative -juntas.

If it were true that for some , we could immediately conclude that by writing in the form (1) where now ranges over the elements of and gives the corresponding non-negative coefficients that follow from .

** 1.3. No -junta factorization for **

We will now see that juntas cannot provide a small set of axioms for .

Toward the proof, let’s introduce a few definitions. First, for , define the *junta degree of * to be

Since every is an -junta, we have .

Now because is a cone, there is a universal way of proving that . Say that a functional is *-locally positive* if for all and , we have

These are precisely the linear functionals separating a function from : We have if and only if there is a -locally positive functional such that . Now we are ready to prove Theorem 2.

*Proof:* We will prove this using an appropriate -locally positive functional. Define

where denotes the hamming weight of .

First, observe that

Now recall the the function from the statement of the theorem and observe that by opening up the square, we have

Finally, consider some with and . If , then

If , then the sum is 0. If , then the sum is non-negative because in that case is only supported on non-negative values of . We conclude that is -locally positive for . Combined with (2), this yields the statement of the theorem.

** 1.4. From juntas to general factorizations **

So far we have seen that we cannot achieve a low non-negative rank factorization of using -juntas for .

Brief aside:If one translates this back into the setting of lifts, it says that the -round Sherali-Adams lift of the polytopedoes not capture for .

In the next post, we will use entropy maximization to show that a non-negative factorization of would lead to a -junta factorization with small (which we just saw is impossible).

For now, let us state the theorem we will prove. We first define a submatrix of . Fix some integer and a function . Now define the matrix given by

The matrix is indexed by subsets with and elements . Here, represents the (ordered) restriction of to the coordinates in .

Theorem 3 (Chan-L-Raghavendra-Steurer 2013)For every and , there is a constant such that for all ,

Note that if then is a submatrix of . Since Theorem 2 furnishes a sequence of quadratic multi-linear functions with , the preceding theorem tells us that cannot be bounded by any polynomial in . A more technical version of the theorem is able to achieve a lower bound of (see Section 7 here).

]]>

** 1.1. Polytopes and inequalities **

A *-dimensional convex polytope* is the convex hull of a finite set of points in . Equivalently, it is a compact set defined by a family of linear inequalities

for some matrix .

Let us give a measure of complexity for : Define to be the smallest number such that for some , we have

In other words, this is the minimum number of *inequalities* needed to describe . If is full-dimensional, then this is precisely the number of *facets* of (a facet is a maximal proper face of ).

Thinking of as a measure of complexity makes sense from the point of view of optimization: Interior point methods can efficiently optimize linear functions over (to arbitrary accuracy) in time that is polynomial in .

** 1.2. Lifts of polytopes **

Many simple polytopes require a large number of inequalities to describe. For instance, the *cross-polytope*

has . On the other hand, is the *projection* of the polytope

onto the coordinates, and manifestly, . Thus is the (linear) shadow of a much simpler polytope in a higher dimension.

[image credit: Fiorini, Rothvoss, and Tiwary]

A polytope is called a *lift* of the polytope if is the image of under a linear projection. Again, from an optimization stand point, lifts are important: If we can optimize linear functionals over , then we can optimize linear functionals over . Define now to be the minimal value of over all lifts of . (The value is sometimes called the *(linear) extension complexity* of .)

** 1.3. The permutahedron **

Here is a somewhat more interesting family of examples where lifts reduce complexity. The *permutahedron* is the convex hull of the vectors where . It is known that .

Let denote the convex hull of the permutation matrices. The Birkhoff-von Neumann theorem tells us that is precisely the set of doubly stochastic matrices, thus (corresponding to the non-negativity constraints on each entry).

Observe that is the linear image of under the map , i.e. we multiply a matrix on the right by the column vector . Thus is a lift of , and we conclude that .

** 1.4. The cut polytope **

If , there are certain combinatorial polytopes we should not be able to optimize over efficiently. A central example is the *cut polytope:* is the convex hull of all matrices of the form for some subset . Here, denotes the characteristic function of .

Note that the MAX-CUT problem on a graph can be encoded in the following way: Let if and otherwise. Then the value of the maximum cut in is precisely the maximum of for . Accordingly, we should expect that cannot be bounded by any polynomial in (lest we violate a basic tenet of complexity theory).

** 1.5. Non-negative matrix factorization **

The key to understanding comes from Yannakakis’ factorization theorem.

Consider a polytope and let us write in two ways: As a convex hull of vertices

and as an intersection of half-spaces: For some and ,

Given this pair of representations, we can define the corresponding *slack matrix* of by

Here, denote the rows of .

We need one more definition. In what follows, we will use . If we have a non-negative matrix , then a *rank- non-negative factorization of * is a factorization where and . We then define the *non-negative rank of ,* written , to be the smallest such that admits a rank- non-negative factorization.

Theorem 1 (Yannakakis)For every polytope , it holds that for any slack matrix of .

The key fact underlying this theorem is Farkas’ Lemma. It asserts that if , then every valid linear inequality over can be written as a non-negative combination of the defining inequalities .

There is an interesting connection here to proof systems. The theorem says that we can interpret as the minimum number of axioms so that every valid linear inequality for can be proved using a conic (i.e., non-negative) combination of the axioms.

** 1.6. Slack matrices and the correlation polytope **

Thus to prove a lower bound on , it suffices to find a valid set of linear inequalities for and prove a lower bound on the non-negative rank of the corresponding slack matrix.

Toward this end, consider the correlation polytope given by

It is an exercise to see that and are linearly isomorphic.

Now we identify a particularly interesting family of valid linear inequalities for . (In fact, it turns out that this will also be an exhaustive list.) A *quadratic multi-linear function* on is a function of the form

for some real numbers and .

Suppose is a quadratic multi-linear function that is also non-negative on . Then “” can be encoded as a valid linear inequality on . To see this, write where . (Note that is intended to be the standard inner product on .)

Now let be the set of all quadratic multi-linear functions that are non-negative on , and consider the matrix (represented here as a function) given by

Then from the above discussion, is a valid sub-matrix of some slack matrix of . To summarize, we have the following theorem.

Theorem 2For all , it holds that .

It is actually the case that . The next post will focus on providing a lower bound on .

]]>
*determinantal measure* (see, for instance, Terry Tao’s post on determinantal processes and Russ Lyons’ ICM survey). I think this is an especially fertile setting for entropy maximization, but this will be the only post in this vein for now; I hope to return to the topic later.

Our goal is to prove the following theorem of Forster.

Theorem 1 (Forster)Suppose that are unit vectors such that every subset of vectors is linearly independent. Then there exists a linear mapping such that

This result is surprising at first glance. If we simply wanted to map the vectors to isotropic position, we could use the matrix . But Forster’s theorem asks that the unit vectors

are in isotropic position. This seems to be a much trickier task.

Forster used this as a step in proving lower bounds on the *sign rank* of certain matrices. Forster’s proof is based on an iterative argument combined with a compactness assertion.

There is another approach based on convex programming arising in the work of Barthe on a reverse Brascamp-Lieb inequality. The relation to Forster’s theorem was observed in work of Hardt and Moitra; it is essentially the dual program to the one we construct below.

** 1.1. Some facts about determinants **

We first recall a few preliminary facts about determinants. For any , we have the Cauchy-Binet formula

We also have a rank-one update formula for the determinant: If a matrix is invertible, then

Finally, for vectors and nonnegative coefficients , we have

This follows because replacing by corresponds to multiplying the th row and column of by , where is the matrix that has the vectors as columns.

** 1.2. A determinantal measure **

To prove Theorem 1, we will first define a probability measure on , i.e., on the -subsets of by setting:

The Cauchy-Binet formula is precisely the statement that , i.e. the collection forms a probability distribution on . How can we capture the fact that some vectors satisfy using only the values ?

Using the rank-one update formula, for an invertible matrix , we have . Thus is the identity matrix if and only if for every ,

Note also that using Cauchy-Binet,

In particular, if , then for every , we have

Of course, our vectors likely don’t satisfy this condition (otherwise we would be done). So we will use the max-entropy philosophy to find the “simplest” perturbation of the values that does satisfy it. The optimal solution will yield a matrix satisfying (1).

** 1.3. Entropy maximization **

Consider the following convex program with variables .

In other words, we look for a distributon on that has minimum entropy relative to , and such that all the “one-dimensional marginals” are equal (recall (2)). Remarkably, the optimum will be a determinantal measure as well.

Note that the uniform distribution on subsets of size is a feasible point and the objective is finite precisely because for every . The latter fact follows from our assumption that every subset of vectors is linearly independent.

** 1.4. Analyzing the optimizer **

By setting the gradient of the Lagrangian to zero, we see that the optimal solution has the form

for some dual variables . Note that the dual variables are unconstrained because they come from equality constraints.

Let us write . We use , , and to denote the values at the optimal solution. Using again the rank-one update formula for the determinant,

But just as in (2), we can also use Cauchy-Binet to calculate the derivative (from the second expression in (3)):

where we have used the fact that if (and otherwise equals ). We conclude that

Now we can finish the proof: Let . Then:

]]>

In our last post regarding Chang’s Lemma, let us visit a version due to Thomas Bloom. We will offer a new proof using entropy maximization. In particular, we will again use only boundedness of the Fourier characters.

There are two new (and elementary) techniques here: (1) using a trade-off between entropy maximization and accuracy and (2) truncating the Taylor expansion of .

We use the notation from our previous post: for some prime and is the uniform measure on . For and , we define . We also use to denote the set of all densities with respect to .

Theorem 1 (Bloom)There is a constant such that for every and every density , there is a subset such that and is contained in some subspace of dimension at most

Note that we only bound the dimension of a subset of the large spectrum, but the bound on the dimension improves by a factor of . Bloom uses this as the key step in his proof of what (at the time of writing) constitutes the best asymptotic bounds in Roth’s theorem on three-term arithmetic progressions:

Theorem 2If a subset contains no non-trivial three-term arithmetic progressions, then

This represents a modest improvement over the breakthrough of Sanders achieving , but the proof is somewhat different.

** 1.1. A stronger version **

In fact, we will prove a stronger theorem.

Theorem 3For every and every density , there is a random subset such that almost surelyand for every , it holds that

This clearly yields Theorem 1 by averaging.

** 1.2. The same polytope **

To prove Theorem 3, we use the same polytope we saw before. Recall the class of test functionals

We defined by

Let us consider a slightly different convex optimization:

Here, is a constant that we will set soon. On the other hand, is now intended as an *additional variable* over which to optimize. We allow the optimization to trade off the entropy term and the accuracy . The constant represents how much we value one vs. the other.

Notice that, since , this convex program satisfies Slater’s condition (there is a feasible point in the relative interior), meaning that strong duality holds (see Section 5.2.3).

** 1.3. The optimal solution **

As in our first post on this topic, we can set the gradient of the Lagrangian equal to zero to obtain the form of the optimal solution: For some dual variables ,

Furthermore, corresponding to our new variable , there is a new constraint on the dual variables:

Observe now that if we put then we can bound (the error in the optimal solution): Since is a feasible solution with , we have

which implies that since .

To summarize: By setting appropriately, we obtain of the form (2) and such that

Note that one can arrive at the same conclusion using the algorithm from our previous post: The version unconcerned with sparsity finds a feasible point after time . Setting yields the same result without using duality.

** 1.4. A Taylor expansion **

Let us slightly rewrite by multiplying the numerator and denominator by . This yields:

The point of this transformation is that now the exponent is a sum of positive terms (using ), and furthermore by (3), the exponent is always bounded by

Let us now Taylor expand . Applying this to the numerator, we arrive at an expression

where , , and each is a density. Here, ranges over all finite sequences of elements from and

where we use to denote the length of the sequence .

** 1.5. The random subset **

We now define a random function by taking with probability .

Consider some . Since , we know that . Thus

But we also have . This implies that .

Equivalently, for any , it holds that . We would be done with the proof of Theorem 3 if we also knew that were supported on functions for which because . This is not necessarily true, but we can simply truncate the Taylor expansion to ensure it.

** 1.6. Truncation **

Let denote the Taylor expansion of to degree . Since the exponent in is always bounded by (recall (4)), we have

By standard estimates, we can choose to make the latter quantity at most .

Since , a union bound combined with our previous argument immediately implies that for , we have

This completes the proof of Theorem 3.

** 1.7. Prologue: A structure theorem **

Generalizing the preceding argument a bit, one can prove the following.

Let be a finite abelian group and use to denote the dual group. Let denote the uniform measure on . For every , let denote the corresponding character. Let us define a *degree- Reisz product* to be a function of the form

for some and and .

Theorem 4For every , the following holds. For every with , there exists a with such that and is a convex combination of degree- Reisz products where

** 1.8. A prologue’s prologue **

To indicate the lack of algebraic structure required for the preceding statement, we can set things up in somewhat greater generality.

For simplicity, let be a finite set equipped with a probability measure . Recall that is the Hilbert space of real-valued functions on equipped with the inner product . Let be a set of functionals with the property that for .

Define a *degree- -Riesz product* as a function of the form

for some functions . Define also the (semi-) norm .

Theorem 5For every , the following holds. For every with , there exists a with such that and is a convex combination of degree- -Riesz products where

]]>

Let be a finite set equipped with a measure . We use to denote the space of real-valued functions equipped with the inner product . Let denote the convex polytope of probability densities with respect to :

** 1.1. A ground truth and a family of tests **

Fix now a density and a family that one can think of as “tests” (or properties of a function we might care about). Given an error parameter , we define a convex polytope as follows:

This is the set of functions that have “performance” similar to that of on all the tests in . Note that the tests are one-sided; if we wanted two-sided bounds for some , we could just add to .

** 1.2. (Projected) coordinate descent in the dual **

We now describe a simple algorithm to find a point . It will be precisely analogous to the algorithm we saw in the previous post for the natural polytope coming from Chang’s Lemma. We define a family of functions indexed by a continuous time parameter:

Here is some test we have yet to specify. Intuitively, it will be a test that is not performing well on. The idea is that at time , we exponentially average some of into . The exponential ensures that is non-negative, and our normalization ensures that . Observe that is the constant function.

For two densities , we define the relative entropy

Our goal is to analyze the potential function . This functional measures how far the “ground truth” is from our hypothesis .

Note that , using the notation from the previous post. Furthermore, since the relative entropy between two probability measures is always non-negative, for all .

Now a simple calculation gives

In other words, as long as the constraint is violated by , the potential is decreasing by at least !

Since the potential starts at and must always be non-negative, if we choose at every time a violated test , then after time , it must be that for some .

** 1.4. A sparse solution **

But our goal in Chang’s Lemma was to find a “sparse” . In other words, we want to be built out of only a few constraints. To accomplish this, we should have switch between different tests as little as possible.

So when we find a violated test , let’s keep until . How fast can the quantity drop from larger than to zero?

Another simple calculation (using ) yields

This quantity is at least .

In order to calculate how much the potential drops while focused on a single constraint, we can make the pessimistic assumption that . (This is formally justified by Grönwall’s inequality.) In this case (recalling (1)), the potential drop is at least

Since the potential can drop by at most overall, this bounds the number of steps of our algorithm, yielding the following result.

Theorem 1For every , there exists a function such that for some andfor some value

In order to prove Lemma 3 from the preceding post, simply use the fact that our tests were characters (and their negations), and these satisfy .

** 1.5. Bibliographic notes **

Mohit Singh has pointed out to me the similarity of this approach to the Frank-Wolfe algorithm. The use of the potential is a common feature of algorithms that go by the names “boosting” or “multiplicative weights update” (see, e.g., this survey).

]]>