Using the notation from the last post, our goal is now to prove the following theorem.

Theorem 1For every and , there is a constant such that for all ,

** 1.1. Convex relaxations of non-negative rank **

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets and and a matrix with .

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both and with the uniform measure. Let denote the set of probability densities on . Now define

Here are the columns of and are the rows of . Note that now is unconstrained.

Observe that is a convex function. To see this, given a pair and , write

witnessing the fact that .

** 1.2. A truncation argument **

So the set is convex, but it’s not yet clear how this relates to . We will see now that low non-negative rank matrices are close to matrices with small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms and .

Lemma 2For every non-negative with and every , there is a matrix such thatand

*Proof:* Suppose that with , and let us interpret this factorization in the form

(where are the columns of and are the rows of ). By rescaling the columns of and the rows of , respectively, we may assume that for every .

Let denote the “bad set” of indices (we will choose momentarily). Observe that if , then

from the representation (1) and the fact that all summands are positive.

Define the matrix . It follows that

Each of the latter terms is at most and , thus

Next, observe that

implying that and thus .

Setting yields the statement of the lemma.

Generally, the ratio will be small compared to (e.g., polynomial in vs. super-polynomial in ). Thus from now on, we will actually prove a lower bound on . One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

** 1.3. The test functionals **

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix , we will now assume that . Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix . We will consider the uniform measures on and . We use and to denote averaging with respect to these measures.

Let . From the last post, we know there exists a -locally positive functional such that , and for every -junta .

For with , let us denote . These functionals prove lower bounds on the junta-degree of restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family as our test functions.

** 1.4. Entropy maximization **

Use to write

where and we have and for all , and finally .

First, as we observed last time, if each were a -junta, we would have a contradiction:

because since is -locally positive and the function is a -junta.

So now the key step: Use entropy maximization to approximate by a junta! In future posts, we will need to consider the entire package of functions simultaneously. But for the present lower bound, it suffices to consider each separately.

Consider the following optimization over variables :

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1There exists a function satisfying all the preceding constraints and of the formsuch that

where is some constant depending only on .

Note that depends only on , and thus only depends on as well. Now each only depends on variables (those in and ), meaning that our approximator is an -junta for

Oops. That doesn’t seem very good. The calculation in (3) needs that is a -junta, and certainly (since is a function on ). Nevertheless, note that the approximator is a *non-trivial junta*. For instance, if , then it is an -junta, recalling that is a constant (depending on ).

** 1.5. Random restriction saves the day **

Let’s try to apply the logic of (3) to the approximators anyway. Fix some and let be the set of coordinates on which depends. Then:

Note that the map is a junta on . Thus if , then the contribution from this term is non-negative since is -locally positive. But is fixed and is growing, thus is quite rare! Formally,

In the last estimate, we have used a simple union bound and .

Putting everything together and summing over , we conclude that

Note that by choosing only moderately large, we will make this error term very small.

Moreover, since each is a feasible point of the optimization (4), we have

Almost there: Now observe that

Plugging this into the preceding line yields

Recalling now (2), we have derived a contradiction to if we can choose the right-hand side to be bigger than (which is a negative constant depending only on ). Setting , we consult (5) to see that

for some other constant depending only on . We thus arrive at a contradiction if , recalling that depend only on . This completes the argument.

I’m still digesting this post, but it’s really nice! The entropy minimization argument seems to be used just to make a claim of the form that if the only thing you care about is to preserve correlations between a bounded function B and all (bounded) m-juntas you may as well assume the function B itself only depends on “few” coordinates – not much more than m times log(||B||_\infty). This is somewhat intuitive since you only want to preserve correlations up to an additive error, and a bounded function could possibly correlate highly with many juntas on disjoint coordinates.

A couple comments:

– In terms of the definition of \alpha_+, I can see loosely the connection with a factorization norm but I couldn’t formulate it precisely as one either. In that respect it might be more natural to have \alpha be defined as the infimum, over all decompositions, of the product of the two quantities that it is currently assumed to bound (the largest row norm of B and the sum of the column norms of A). In fact in the proof of Lemma 2 you end up with a bound that is not uniform, you have a much better bond on the A part (rM) than on the B part (rM/delta).

– A couple lines before 1.2, “write write”.

– In lemma 2 it might be helpful to clarify that the 1- and infinity-norms over M are element-wise (rather than the Schatten norms), as this confused me at first. Also, this is the first time so many norms show up, and it’s not clear at first why e.g. the distance between M and \tilde{M} are measured in 1-norm, but the bound on \alpha_+ involves the infinity-norm. Is there some good intuition for this choice of normalizations? The different norms show up repeatedly in the argument and I had trouble tracking down which bounds were critical and which were more loose. (Feel free to ignore this question, it’s a bit unfair…)

– Right after the proof of Lemma 2, did you mean the inverse ratio, ||M||_1/||M||_\infty? The one you wrote is <<1, no?

Hi Thomas, thanks for the comments!

I fixed the typos and added an explanation of the norms. It should also address your last comment; all the norms are supposed to be averages, so |M|_\infty/|M|_1 makes sense.

Your two more substantial comments are related. I agree that should involve a product of the norm of B and the norm of A. I went back and forth on the correct way to write this. Writing as a product introduced yet another rescaling in the max-entropy argument. As you say, it creates an unnatural imbalance in the truncation argument.

On the other hand, the imbalance would still persist: The eventual dependence on is linear, while the dependence on only comes in only through the entropy and is logarithmic.

I am going to pause on writing it correctly until I write the quantum version. There things are more subtle (e.g., there is no canonical decomposition into B_i’s so they cannot be individually normalized in the L^1 norm), so hopefully the correct normalization will be more apparent.

As for the way error is measured: We are trying to separate from the convex set of matrices with small value. So we are going to use a linear functional which, in this case is

The functional is bounded in norm (simply because is fixed), so we need our approximator to be good in the norm. The idea is that the error from approximation should be smaller than the “fatness” of the separating functional (which in the post is related to the value ).

Since we care about error, the correct way to write the bound in Lemma 2 is , i.e. with relative error. If you do that, you see that actually the bound on should be written as . This is great, as now the statement of the lemma is scale invariant.

Finally, the reason that the ratio has to arise somewhere is in the heart of the truncation argument. Specifically, if B_i is big somewhere, then A_i must be small

everywhereas long as M is uniformly bounded. This allows us to get rid of the large B_i’s and is an essential use of the non-negativity of the factorization.In any case, you make good points, and I struggled to address them without complicating the (very simple) truncation argument any further. I wonder even if introducing is a good idea. Maybe Lemma 2 should just say what kind of approximator exists and dispense with . I wanted to have it there for analogy since In the quantum setting, the analogous “factorization (not a) norm” is more important and subtle.

Yes, this is exactly right.

And, in fact, when you really have this explicit structure, there are other ways at arriving at this approximation. We could build a junta as follows. Suppose B is a distribution on {0,1}^n with very high entropy, e.g. n-t (so t = relative entropy). Find a junta approximation as follows: Initially, the junta set J is empty. Now if B has a large low-degree Fourier coefficient S that is not completely contained in J, then add all the coordinates in S to J. Repeat until no such coefficients remain.

Since B has n-t bits of entropy, this cannot go on for too long. Every time we find a new set S on which B is biased, our estimate of its entropy decreases. At the end, we have constructed an approximation B’ that is a J-junta. It’s an approximation in the sense that B-B’ has no large low-degree Fourier coefficients.

Now do the random restriction to a subset R. As in the current argument, B’|_R will likely have only few junta coordinates. At the same time, the error B-B’|_R will shrink dramatically because a high-degree Fourier coefficient will likely not survive the restriction (all its coordinates will have to fall in R).

This is the argument from our FOCS 2013 paper.

The disadvantage is that (a) it’s more complicated, and (b) it requires looking very carefully at the kind of structure you want the approximator to have. (Notice, for instance, that the entropy-minimization argument does not need to refer to Fourier coefficients, and can potentially work just as well over other domains like the set of traveling salesman tours in the complete graph without getting into representation theory of the symmetric group!)

David and Prasad and I spent well over a year trying to figure out the analogous structural characterization for the SDP setting. In some sense, we were forced to discover the (quantum) entropy-minimization approach because we couldn’t get a handle on the right structure. We had very complicated arguments for special cases.

In the LP setting, one has r functions B_1, …, B_r. In the SDP setting, one has an r-dimensional subspace V of functions and you have to control p^2 for every simultaneously. How do you approximate a subspace by a simpler one? and how does random restriction make the subspace nicer? (It’s a tricky question, and choosing a basis for V is clearly a horrible idea–since it breaks invariance–and it doesn’t seem to work.)

The entropy-minimization philosophy is: Pick your tests. Now if:

(i) The tests are simple.

(ii) The tests are smooth (the L^\infty norm is small)

(iii) You allow some error \eps

Then the entropy-optimal approximator will also be “simple” (as it is a simple combination of simple things).

It’s nice because the tests (which often come up front for a natural reason) dictate the structure. As we’ll see soon, this also works in the SDP/quantum setting.

Thanks for the clarifications, they’re helpful! I’m wondering if indeed it wouldn’t be easier to present things in a different order, with Lemma 2 coming at the end:

– You start with any low-rank factorization of M, with the normalization ||B_i||_1=1 wlog. (From which the bound ||A_i||_\infty\leq ||M||_\infty follows immediately and can be used as needed).

– Then you observe that the entropy minimization argument, together with the random restriction step, gives you an upper bound on the degree of g that depends on the entropy Ent(B_i). It is clear why this comes in: it is the basis for the entropy argument.

– Finally you need to deal with the fact that Ent(B_i) could be large. For this it is natural to treat high-entropy rows as being rare and try to argue that, for the purposes of the overall argument, they cannot do too much harm. Your explanations on getting a L1-approximant then make it more or less clear how Lemma 2 should come in, and you end up removing large-L_infty columns and using duality to bound the effect on M in L1.

Ok, I’m not suggesting this is how it should be done; it’s mostly for my own benefit. You’re probably right that the “quantum” setting will help clarify, by seeing the argument once more in a setting where perhaps fewer types of approximations are possible. It’s like trying to understand duality by working with L2, it’s easy to get it the wrong way round :-)

(Minor typo: you might be missing an A_i(S) on the rhs of the first centered equation in section 1.5)