Entropy optimality: Non-negative rank lower bounds

Using the notation from the last post, our goal is now to prove the following theorem.

Theorem 1 For every {m \geq 1} and {g : \{0,1\}^m \rightarrow \mathbb R_+} , there is a constant {C=C(g)} such that for all {n \geq 2m} ,

\displaystyle  \mathrm{rank}_+(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{\mathrm{deg}_J(g)-1}\,.

1.1. Convex relaxations of non-negative rank

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets {X} and {Y} and a matrix {M : X \times Y \rightarrow \mathbb R_+} with {r=\mathrm{rank}_+(M)} .

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of {\{ N \in \mathbb R_+^{X \times Y} : \mathrm{rank}_+(N) = 1 \}} is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both {X} and {Y} with the uniform measure. Let {\mathcal Q = \{ b : Y \rightarrow \mathbb R_+ \mid \|b\|_1 = 1\}} denote the set of probability densities on {Y} . Now define

\displaystyle  \alpha_+(N) = \min \Big\{ \alpha : \exists A \in \mathbb R_+^{X \times k}, B \in \mathbb R_+^{k \times Y} \textrm{ with } N=AB, \textrm{ and}

\displaystyle  \qquad\qquad\qquad\qquad\qquad\{B_1, \ldots, B_k\} \subseteq \mathcal Q, \textrm{ and }

\displaystyle  \hphantom{\{B_1, \ldots, B_k\} \subseteq \mathcal Q,} \max_{i \in [k]} \|B_i\|_{\infty} \leq \alpha, \sum_{i=1}^k \|A^{(i)}\|_{\infty} \leq \alpha\Big\}.

Here {\{A^{(i)}\}} are the columns of {A} and {\{B_i\}} are the rows of {B} . Note that now {k} is unconstrained.

Observe that {\alpha_+} is a convex function. To see this, given a pair {N=AB} and {N'=A'B'} , write

\displaystyle  \frac{N+N'}{2} = \left(\begin{array}{cc} \frac12 A & \frac12 A' \end{array}\right) \left(\begin{array}{l} \vphantom{\bigoplus} B \\ \vphantom{\bigoplus} B' \end{array}\right)\,,

witnessing the fact that {\alpha_+(\frac12(N+N')) \leq \frac12 \alpha_+(N) + \frac12 \alpha_+(N')} .

1.2. A truncation argument

So the set {\{ N : \alpha_+(N) \leq c \}} is convex, but it’s not yet clear how this relates to {\mathrm{rank}_+} . We will see now that low non-negative rank matrices are close to matrices with {\alpha_+} small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms {\|M\|_1 = \mathbb E_{x,y} |M_{x,y}|} and {\|M\|_{\infty} = \max_{x,y} |M_{x,y}|} .

Lemma 2 For every non-negative {M \in \mathbb R_+^{X \times Y}} with {\mathrm{rank}_+(M) \leq r} and every {\delta \in (0,1)} , there is a matrix {\tilde M \in \mathbb R_+^{X \times Y}} such that

\displaystyle  \|M-\tilde M\|_1 \leq \delta


\displaystyle  \alpha_+(\tilde M) \leq \frac{r}{\delta} \|M\|_{\infty}\,.

Proof: Suppose that {M=AB} with {A \in \mathbb R_+^{X \times r}, B \in \mathbb R_+^{r \times Y}} , and let us interpret this factorization in the form

\displaystyle  M(x,y) = \sum_{i=1}^r A_i(x) B_i(y) \ \ \ \ \ (1)

(where {\{A_i\}} are the columns of {A} and {\{B_i\}} are the rows of {B} ). By rescaling the columns of {A} and the rows of {B} , respectively, we may assume that {\mathop{\mathbb E}_{\mu}[B_i]=1} for every {i \in [r]} .

Let {\Lambda = \{ i : \|B_i\|_{\infty} > \tau \}} denote the “bad set” of indices (we will choose {\tau} momentarily). Observe that if {i \in \Lambda} , then

\displaystyle  \|A_i\|_{\infty} \leq \frac{\|M\|_{\infty}}{\tau}\,,

from the representation (1) and the fact that all summands are positive.

Define the matrix {\tilde M(x,y) = \sum_{i \notin \Lambda} A_i(x) B_i(y)} . It follows that

\displaystyle  \|M-\tilde M\|_1 = \mathop{\mathbb E}_{x,y} \left[|M(x,y)-\tilde M(x,y)|\right] = \sum_{i \in \Lambda} \mathop{\mathbb E}_{x,y} [A_i(x) B_i(y)]\,.

Each of the latter terms is at most {\|A_i\|_{\infty} \|B_i\|_1 \leq \frac{\|M\|_{\infty}}{\tau}} and {|\Lambda| \leq r} , thus

\displaystyle  \|M-\tilde M\|_1 \leq r \frac{\|M\|_{\infty}}{\tau}\,.

Next, observe that

\displaystyle  \mathop{\mathbb E}_y [M(x,y)] = \sum_{i=1}^r A_i(x) \|B_i\|_1 = \sum_{i=1}^r A_i(x)\,,

implying that {\|A_i\|_{\infty} \leq \|M\|_{\infty}} and thus {\sum_{i=1}^r \|A_i\|_{\infty} \leq r \|M\|_{\infty}} .

Setting {\tau = r \|M\|_{\infty}/\delta} yields the statement of the lemma. \Box

Generally, the ratio {\frac{\|M\|_{\infty}}{\|M\|_1}} will be small compared to {r} (e.g., polynomial in {n} vs. super-polynomial in {n} ). Thus from now on, we will actually prove a lower bound on {\alpha_+(M)} . One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

1.3. The test functionals

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix {M_n^g} , we will now assume that {\alpha_+(M_n^g) \leq \alpha} . Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix {M_n^g} . We will consider the uniform measures on {{[n] \choose m}} and {\{0,1\}^n} . We use {\mathop{\mathbb E}_S} and {\mathop{\mathbb E}_x} to denote averaging with respect to these measures.

Let {d=\deg_J(g)-1} . From the last post, we know there exists a {d} -locally positive functional {\varphi : \{0,1\}^n \rightarrow \mathbb R} such that {\beta := \mathop{\mathbb E}_x \varphi(x) g(x) < 0} , and {\mathop{\mathbb E}_x \varphi(x) q(x) \geq 0} for every {d} -junta {q} .

For {S \subseteq [n]} with {|S|=m} , let us denote {\varphi_S(x) = \varphi(x|_S)} . These functionals prove lower bounds on the junta-degree of {g} restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family {\{\varphi_S\}} as our test functions.

1.4. Entropy maximization

Use {\alpha_+(M_n^g) \leq \alpha} to write

\displaystyle  M_n^g(S,x) = \sum_{i=1}^k A_i(S) B_i(x)\,,

where {A_i, B_i \geq 0} and we have {\|B_i\|_1=1} and {\|B_i\|_{\infty} \leq \alpha} for all {i \in [k]} , and finally {\sum_{i=1}^k \|A_i\|_{\infty} \leq \alpha} .

First, as we observed last time, if each {B_i} were a {d} -junta, we would have a contradiction:

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) M_n^g(S,x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^m} \varphi(y) g(y) = \beta < 0\,, \ \ \ \ \ (2)

and yet

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) \sum_{i=1}^k A_i(S) B_i(x)]\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] \geq 0 \ \ \ \ \ (3)

because {\mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right] \geq 0} since {\varphi} is {d} -locally positive and the function {y \mapsto \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right]} is a {d} -junta.

So now the key step: Use entropy maximization to approximate {B_i} by a junta! In future posts, we will need to consider the entire package {\{B_1, \ldots, B_k\}} of functions simultaneously. But for the present lower bound, it suffices to consider each {B_i} separately.

Consider the following optimization over variables {\{\tilde B_i(x) : x \in \{0,1\}^n\}} :

\displaystyle  \mathrm{minimize }\qquad \mathrm{Ent}(\tilde B_i) = \mathbb E [\tilde B_i \log \tilde B_i]

\displaystyle  \qquad\textrm{ subject to } \qquad \qquad\mathbb E[\tilde B_i]=1, \quad \tilde B_i(x) \geq 0 \quad \forall x \in \{0,1\}^n

\displaystyle  \hphantom{\bigoplus} \qquad \qquad \qquad \qquad \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right] \leq \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] + \varepsilon \qquad \forall |S|=m\,. \ \ \ \ \ (4)

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1 There exists a function {\tilde B_i : \{0,1\}^n \rightarrow \mathbb R_+^n} satisfying all the preceding constraints and of the form

\displaystyle  \tilde B_i = \frac{\exp\left(\sum_{j=1}^{M} \lambda_j \varphi_{S_j}\right)}{\mathop{\mathbb E} \exp\left(\sum_{i=1}^{M} \lambda_j \varphi_{S_j}\right)}

such that

\displaystyle  M \leq C(g) \frac{\log \alpha}{\varepsilon^2}\,,

where {C(g)} is some constant depending only on {g} .

Note that {\varphi} depends only on {g} , and thus {\|\varphi_S\|_{\infty}} only depends on {g} as well. Now each {\varphi_{S_j}} only depends on {m} variables (those in {S_j} and {|S_j|=m} ), meaning that our approximator {\tilde B_i} is an {h} -junta for

\displaystyle  h \leq m \cdot C(g) \frac{\log \alpha}{\varepsilon^2}\,. \ \ \ \ \ (5)

Oops. That doesn’t seem very good. The calculation in (3) needs that {\tilde B_i} is a {d} -junta, and certainly {d < m} (since {g} is a function on {\{0,1\}^m} ). Nevertheless, note that the approximator is a non-trivial junta. For instance, if {\alpha \leq 2^{\sqrt{n}}} , then it is an O(\sqrt{n}) -junta, recalling that m is a constant (depending on {g} ).

1.5. Random restriction saves the day

Let’s try to apply the logic of (3) to the {\tilde B_i} approximators anyway. Fix some {i \in [k]} and let {J_i} be the set of coordinates on which {\tilde B_i} depends. Then:

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] = \mathop{\mathbb E}_S\,A_i(S) \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]

Note that the map {y \mapsto \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]} is a junta on {J_i \cap S} . Thus if {|J_i \cap S| \leq d} , then the contribution from this term is non-negative since {\varphi} is {d} -locally positive. But {|S|=m} is fixed and {n} is growing, thus {|J_i \cap S| > d} is quite rare! Formally,

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq - \|A_i\|_{\infty}\, \mathbb P_S[|J_i \cap S| > d] \geq - \|A_i\|_{\infty} \frac{h^d (2m)^d}{n^d} \,.

In the last estimate, we have used a simple union bound and {n \geq 2m} .

Putting everything together and summing over {i \in [k]} , we conclude that

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq -\alpha \frac{h^d (2m)^d}{n^d}\,.

Note that by choosing {n} only moderately large, we will make this error term very small.

Moreover, since each {\tilde B_i} is a feasible point of the optimization (4), we have

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right]

\displaystyle  \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \left(-\varepsilon + \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right]\right)

\displaystyle  \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq -\varepsilon \sum_{i=1}^k \|A_i\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.

Almost there: Now observe that

\displaystyle  \|g\|_1 = \mathop{\mathbb E}_{S,x} [M_n^g(S,x)] = \sum_{i=1}^k \|A_i\|_1 \|B_i\|_1 = \sum_{i=1}^k \|A_i\|_1\,.

Plugging this into the preceding line yields

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] \geq -\varepsilon \|g\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.

Recalling now (2), we have derived a contradiction to {\alpha_+(M) \leq \alpha} if we can choose the right-hand side to be bigger than {\beta} (which is a negative constant depending only on {g} ). Setting {\varepsilon = -\beta/(2 \|g\|_1)} , we consult (5) to see that

\displaystyle  h \leq C'(g) m \log \alpha

for some other constant {C'(g)} depending only on {g} . We thus arrive at a contradiction if {\alpha = o((n/\log n)^d)} , recalling that {m, d} depend only on {g} . This completes the argument.

4 thoughts on “Entropy optimality: Non-negative rank lower bounds

  1. I’m still digesting this post, but it’s really nice! The entropy minimization argument seems to be used just to make a claim of the form that if the only thing you care about is to preserve correlations between a bounded function B and all (bounded) m-juntas you may as well assume the function B itself only depends on “few” coordinates – not much more than m times log(||B||_\infty). This is somewhat intuitive since you only want to preserve correlations up to an additive error, and a bounded function could possibly correlate highly with many juntas on disjoint coordinates.

    A couple comments:

    – In terms of the definition of \alpha_+, I can see loosely the connection with a factorization norm but I couldn’t formulate it precisely as one either. In that respect it might be more natural to have \alpha be defined as the infimum, over all decompositions, of the product of the two quantities that it is currently assumed to bound (the largest row norm of B and the sum of the column norms of A). In fact in the proof of Lemma 2 you end up with a bound that is not uniform, you have a much better bond on the A part (rM) than on the B part (rM/delta).

    – A couple lines before 1.2, “write write”.

    – In lemma 2 it might be helpful to clarify that the 1- and infinity-norms over M are element-wise (rather than the Schatten norms), as this confused me at first. Also, this is the first time so many norms show up, and it’s not clear at first why e.g. the distance between M and \tilde{M} are measured in 1-norm, but the bound on \alpha_+ involves the infinity-norm. Is there some good intuition for this choice of normalizations? The different norms show up repeatedly in the argument and I had trouble tracking down which bounds were critical and which were more loose. (Feel free to ignore this question, it’s a bit unfair…)

    – Right after the proof of Lemma 2, did you mean the inverse ratio, ||M||_1/||M||_\infty? The one you wrote is <<1, no?

  2. Hi Thomas, thanks for the comments!

    I fixed the typos and added an explanation of the norms. It should also address your last comment; all the norms are supposed to be averages, so |M|_\infty/|M|_1 makes sense.

    Your two more substantial comments are related. I agree that \alpha_+ should involve a product of the \ell_1(\ell_{\infty}) norm of B and the \ell_{\infty}(\ell_{\infty}) norm of A. I went back and forth on the correct way to write this. Writing as a product introduced yet another rescaling in the max-entropy argument. As you say, it creates an unnatural imbalance in the truncation argument.

    On the other hand, the imbalance would still persist: The eventual dependence on \|A_i\|_{\infty} is linear, while the dependence on \|B_i\|_{\infty} only comes in only through the entropy and is logarithmic.

    I am going to pause on writing it correctly until I write the quantum version. There things are more subtle (e.g., there is no canonical decomposition into B_i’s so they cannot be individually normalized in the L^1 norm), so hopefully the correct normalization will be more apparent.

    As for the way error is measured: We are trying to separate M from the convex set of matrices with small \alpha_+ value. So we are going to use a linear functional which, in this case is

    (S,x) \mapsto \varphi_S(x)

    The functional is bounded in L^{\infty} norm (simply because \varphi is fixed), so we need our approximator \tilde M to be good in the L^1 norm. The idea is that the error from approximation should be smaller than the “fatness” of the separating functional (which in the post is related to the value \beta ).

    Since we care about L^1 error, the correct way to write the bound in Lemma 2 is \|M-\tilde M\|_1 \leq \delta \|M\|_1 , i.e. with relative error. If you do that, you see that actually the \|M\|_{\infty} bound on \alpha_+ should be written as \|M\|_{\infty}/\|M\|_1 . This is great, as now the statement of the lemma is scale invariant.

    Finally, the reason that the ratio \|M\|_{\infty}/\|M\|_1 has to arise somewhere is in the heart of the truncation argument. Specifically, if B_i is big somewhere, then A_i must be small everywhere as long as M is uniformly bounded. This allows us to get rid of the large B_i’s and is an essential use of the non-negativity of the factorization.

    In any case, you make good points, and I struggled to address them without complicating the (very simple) truncation argument any further. I wonder even if introducing \alpha_+ is a good idea. Maybe Lemma 2 should just say what kind of approximator exists and dispense with \alpha_+ . I wanted to have it there for analogy since In the quantum setting, the analogous “factorization (not a) norm” is more important and subtle.

  3. I’m still digesting this post, but it’s really nice! The entropy minimization argument seems to be used just to make a claim of the form that if the only thing you care about is to preserve correlations between a bounded function B and all (bounded) m-juntas you may as well assume the function B itself only depends on “few” coordinates – not much more than m times log(||B||_\infty). This is somewhat intuitive since you only want to preserve correlations up to an additive error, and a bounded function could possibly correlate highly with many juntas on disjoint coordinates.

    Yes, this is exactly right.

    And, in fact, when you really have this explicit structure, there are other ways at arriving at this approximation. We could build a junta as follows. Suppose B is a distribution on {0,1}^n with very high entropy, e.g. n-t (so t = relative entropy). Find a junta approximation as follows: Initially, the junta set J is empty. Now if B has a large low-degree Fourier coefficient S that is not completely contained in J, then add all the coordinates in S to J. Repeat until no such coefficients remain.

    Since B has n-t bits of entropy, this cannot go on for too long. Every time we find a new set S on which B is biased, our estimate of its entropy decreases. At the end, we have constructed an approximation B’ that is a J-junta. It’s an approximation in the sense that B-B’ has no large low-degree Fourier coefficients.

    Now do the random restriction to a subset R. As in the current argument, B’|_R will likely have only few junta coordinates. At the same time, the error B-B’|_R will shrink dramatically because a high-degree Fourier coefficient will likely not survive the restriction (all its coordinates will have to fall in R).

    This is the argument from our FOCS 2013 paper.

    The disadvantage is that (a) it’s more complicated, and (b) it requires looking very carefully at the kind of structure you want the approximator to have. (Notice, for instance, that the entropy-minimization argument does not need to refer to Fourier coefficients, and can potentially work just as well over other domains like the set of traveling salesman tours in the complete graph without getting into representation theory of the symmetric group!)

    David and Prasad and I spent well over a year trying to figure out the analogous structural characterization for the SDP setting. In some sense, we were forced to discover the (quantum) entropy-minimization approach because we couldn’t get a handle on the right structure. We had very complicated arguments for special cases.

    In the LP setting, one has r functions B_1, …, B_r. In the SDP setting, one has an r-dimensional subspace V of functions and you have to control p^2 for every p \in V  simultaneously. How do you approximate a subspace by a simpler one? and how does random restriction make the subspace nicer? (It’s a tricky question, and choosing a basis for V is clearly a horrible idea–since it breaks invariance–and it doesn’t seem to work.)

    The entropy-minimization philosophy is: Pick your tests. Now if:

    (i) The tests are simple.
    (ii) The tests are smooth (the L^\infty norm is small)
    (iii) You allow some error \eps

    Then the entropy-optimal approximator will also be “simple” (as it is a simple combination of simple things).

    It’s nice because the tests (which often come up front for a natural reason) dictate the structure. As we’ll see soon, this also works in the SDP/quantum setting.

  4. Thanks for the clarifications, they’re helpful! I’m wondering if indeed it wouldn’t be easier to present things in a different order, with Lemma 2 coming at the end:
    – You start with any low-rank factorization of M, with the normalization ||B_i||_1=1 wlog. (From which the bound ||A_i||_\infty\leq ||M||_\infty follows immediately and can be used as needed).
    – Then you observe that the entropy minimization argument, together with the random restriction step, gives you an upper bound on the degree of g that depends on the entropy Ent(B_i). It is clear why this comes in: it is the basis for the entropy argument.
    – Finally you need to deal with the fact that Ent(B_i) could be large. For this it is natural to treat high-entropy rows as being rare and try to argue that, for the purposes of the overall argument, they cannot do too much harm. Your explanations on getting a L1-approximant then make it more or less clear how Lemma 2 should come in, and you end up removing large-L_infty columns and using duality to bound the effect on M in L1.

    Ok, I’m not suggesting this is how it should be done; it’s mostly for my own benefit. You’re probably right that the “quantum” setting will help clarify, by seeing the argument once more in a setting where perhaps fewer types of approximations are possible. It’s like trying to understand duality by working with L2, it’s easy to get it the wrong way round :-)

    (Minor typo: you might be missing an A_i(S) on the rhs of the first centered equation in section 1.5)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s