Entropy optimality: Non-negative rank lower bounds

Using the notation from the last post, our goal is now to prove the following theorem.

Theorem 1 For every {m \geq 1} and {g : \{0,1\}^m \rightarrow \mathbb R_+} , there is a constant {C=C(g)} such that for all {n \geq 2m} ,

\displaystyle  \mathrm{rank}_+(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{\mathrm{deg}_J(g)-1}\,.

1.1. Convex relaxations of non-negative rank

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets {X} and {Y} and a matrix {M : X \times Y \rightarrow \mathbb R_+} with {r=\mathrm{rank}_+(M)} .

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of {\{ N \in \mathbb R_+^{X \times Y} : \mathrm{rank}_+(N) = 1 \}} is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both {X} and {Y} with the uniform measure. Let {\mathcal Q = \{ b : Y \rightarrow \mathbb R_+ \mid \|b\|_1 = 1\}} denote the set of probability densities on {Y} . Now define

\displaystyle  \alpha_+(N) = \min \Big\{ \alpha : \exists A \in \mathbb R_+^{X \times k}, B \in \mathbb R_+^{k \times Y} \textrm{ with } N=AB, \textrm{ and}

\displaystyle  \qquad\qquad\qquad\qquad\qquad\{B_1, \ldots, B_k\} \subseteq \mathcal Q, \textrm{ and }

\displaystyle  \hphantom{\{B_1, \ldots, B_k\} \subseteq \mathcal Q,} \max_{i \in [k]} \|B_i\|_{\infty} \leq \alpha, \sum_{i=1}^k \|A^{(i)}\|_{\infty} \leq \alpha\Big\}.

Here {\{A^{(i)}\}} are the columns of {A} and {\{B_i\}} are the rows of {B} . Note that now {k} is unconstrained.

Observe that {\alpha_+} is a convex function. To see this, given a pair {N=AB} and {N'=A'B'} , write

\displaystyle  \frac{N+N'}{2} = \left(\begin{array}{cc} \frac12 A & \frac12 A' \end{array}\right) \left(\begin{array}{l} \vphantom{\bigoplus} B \\ \vphantom{\bigoplus} B' \end{array}\right)\,,

witnessing the fact that {\alpha_+(\frac12(N+N')) \leq \frac12 \alpha_+(N) + \frac12 \alpha_+(N')} .

1.2. A truncation argument

So the set {\{ N : \alpha_+(N) \leq c \}} is convex, but it’s not yet clear how this relates to {\mathrm{rank}_+} . We will see now that low non-negative rank matrices are close to matrices with {\alpha_+} small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms {\|M\|_1 = \mathbb E_{x,y} |M_{x,y}|} and {\|M\|_{\infty} = \max_{x,y} |M_{x,y}|} .

Lemma 2 For every non-negative {M \in \mathbb R_+^{X \times Y}} with {\mathrm{rank}_+(M) \leq r} and every {\delta \in (0,1)} , there is a matrix {\tilde M \in \mathbb R_+^{X \times Y}} such that

\displaystyle  \|M-\tilde M\|_1 \leq \delta

and

\displaystyle  \alpha_+(\tilde M) \leq \frac{r}{\delta} \|M\|_{\infty}\,.

Proof: Suppose that {M=AB} with {A \in \mathbb R_+^{X \times r}, B \in \mathbb R_+^{r \times Y}} , and let us interpret this factorization in the form

\displaystyle  M(x,y) = \sum_{i=1}^r A_i(x) B_i(y) \ \ \ \ \ (1)

(where {\{A_i\}} are the columns of {A} and {\{B_i\}} are the rows of {B} ). By rescaling the columns of {A} and the rows of {B} , respectively, we may assume that {\mathop{\mathbb E}_{\mu}[B_i]=1} for every {i \in [r]} .

Let {\Lambda = \{ i : \|B_i\|_{\infty} > \tau \}} denote the “bad set” of indices (we will choose {\tau} momentarily). Observe that if {i \in \Lambda} , then

\displaystyle  \|A_i\|_{\infty} \leq \frac{\|M\|_{\infty}}{\tau}\,,

from the representation (1) and the fact that all summands are positive.

Define the matrix {\tilde M(x,y) = \sum_{i \notin \Lambda} A_i(x) B_i(y)} . It follows that

\displaystyle  \|M-\tilde M\|_1 = \mathop{\mathbb E}_{x,y} \left[|M(x,y)-\tilde M(x,y)|\right] = \sum_{i \in \Lambda} \mathop{\mathbb E}_{x,y} [A_i(x) B_i(y)]\,.

Each of the latter terms is at most {\|A_i\|_{\infty} \|B_i\|_1 \leq \frac{\|M\|_{\infty}}{\tau}} and {|\Lambda| \leq r} , thus

\displaystyle  \|M-\tilde M\|_1 \leq r \frac{\|M\|_{\infty}}{\tau}\,.

Next, observe that

\displaystyle  \mathop{\mathbb E}_y [M(x,y)] = \sum_{i=1}^r A_i(x) \|B_i\|_1 = \sum_{i=1}^r A_i(x)\,,

implying that {\|A_i\|_{\infty} \leq \|M\|_{\infty}} and thus {\sum_{i=1}^r \|A_i\|_{\infty} \leq r \|M\|_{\infty}} .

Setting {\tau = r \|M\|_{\infty}/\delta} yields the statement of the lemma. \Box

Generally, the ratio {\frac{\|M\|_{\infty}}{\|M\|_1}} will be small compared to {r} (e.g., polynomial in {n} vs. super-polynomial in {n} ). Thus from now on, we will actually prove a lower bound on {\alpha_+(M)} . One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

1.3. The test functionals

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix {M_n^g} , we will now assume that {\alpha_+(M_n^g) \leq \alpha} . Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix {M_n^g} . We will consider the uniform measures on {{[n] \choose m}} and {\{0,1\}^n} . We use {\mathop{\mathbb E}_S} and {\mathop{\mathbb E}_x} to denote averaging with respect to these measures.

Let {d=\deg_J(g)-1} . From the last post, we know there exists a {d} -locally positive functional {\varphi : \{0,1\}^n \rightarrow \mathbb R} such that {\beta := \mathop{\mathbb E}_x \varphi(x) g(x) < 0} , and {\mathop{\mathbb E}_x \varphi(x) q(x) \geq 0} for every {d} -junta {q} .

For {S \subseteq [n]} with {|S|=m} , let us denote {\varphi_S(x) = \varphi(x|_S)} . These functionals prove lower bounds on the junta-degree of {g} restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family {\{\varphi_S\}} as our test functions.

1.4. Entropy maximization

Use {\alpha_+(M_n^g) \leq \alpha} to write

\displaystyle  M_n^g(S,x) = \sum_{i=1}^k A_i(S) B_i(x)\,,

where {A_i, B_i \geq 0} and we have {\|B_i\|_1=1} and {\|B_i\|_{\infty} \leq \alpha} for all {i \in [k]} , and finally {\sum_{i=1}^k \|A_i\|_{\infty} \leq \alpha} .

First, as we observed last time, if each {B_i} were a {d} -junta, we would have a contradiction:

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) M_n^g(S,x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^m} \varphi(y) g(y) = \beta < 0\,, \ \ \ \ \ (2)

and yet

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) \sum_{i=1}^k A_i(S) B_i(x)]\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] \geq 0 \ \ \ \ \ (3)

because {\mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right] \geq 0} since {\varphi} is {d} -locally positive and the function {y \mapsto \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right]} is a {d} -junta.

So now the key step: Use entropy maximization to approximate {B_i} by a junta! In future posts, we will need to consider the entire package {\{B_1, \ldots, B_k\}} of functions simultaneously. But for the present lower bound, it suffices to consider each {B_i} separately.

Consider the following optimization over variables {\{\tilde B_i(x) : x \in \{0,1\}^n\}} :

\displaystyle  \mathrm{minimize }\qquad \mathrm{Ent}(\tilde B_i) = \mathbb E [\tilde B_i \log \tilde B_i]

\displaystyle  \qquad\textrm{ subject to } \qquad \qquad\mathbb E[\tilde B_i]=1, \quad \tilde B_i(x) \geq 0 \quad \forall x \in \{0,1\}^n

\displaystyle  \hphantom{\bigoplus} \qquad \qquad \qquad \qquad \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right] \leq \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] + \varepsilon \qquad \forall |S|=m\,. \ \ \ \ \ (4)

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1 There exists a function {\tilde B_i : \{0,1\}^n \rightarrow \mathbb R_+^n} satisfying all the preceding constraints and of the form

\displaystyle  \tilde B_i = \frac{\exp\left(\sum_{j=1}^{M} \lambda_j \varphi_{S_j}\right)}{\mathop{\mathbb E} \exp\left(\sum_{i=1}^{M} \lambda_j \varphi_{S_j}\right)}

such that

\displaystyle  M \leq C(g) \frac{\log \alpha}{\varepsilon^2}\,,

where {C(g)} is some constant depending only on {g} .

Note that {\varphi} depends only on {g} , and thus {\|\varphi_S\|_{\infty}} only depends on {g} as well. Now each {\varphi_{S_j}} only depends on {m} variables (those in {S_j} and {|S_j|=m} ), meaning that our approximator {\tilde B_i} is an {h} -junta for

\displaystyle  h \leq m \cdot C(g) \frac{\log \alpha}{\varepsilon^2}\,. \ \ \ \ \ (5)

Oops. That doesn’t seem very good. The calculation in (3) needs that {\tilde B_i} is a {d} -junta, and certainly {d < m} (since {g} is a function on {\{0,1\}^m} ). Nevertheless, note that the approximator is a non-trivial junta. For instance, if {\alpha \leq 2^{\sqrt{n}}} , then it is an O(\sqrt{n}) -junta, recalling that m is a constant (depending on {g} ).

1.5. Random restriction saves the day

Let’s try to apply the logic of (3) to the {\tilde B_i} approximators anyway. Fix some {i \in [k]} and let {J_i} be the set of coordinates on which {\tilde B_i} depends. Then:

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] = \mathop{\mathbb E}_S\,A_i(S) \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]

Note that the map {y \mapsto \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]} is a junta on {J_i \cap S} . Thus if {|J_i \cap S| \leq d} , then the contribution from this term is non-negative since {\varphi} is {d} -locally positive. But {|S|=m} is fixed and {n} is growing, thus {|J_i \cap S| > d} is quite rare! Formally,

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq - \|A_i\|_{\infty}\, \mathbb P_S[|J_i \cap S| > d] \geq - \|A_i\|_{\infty} \frac{h^d (2m)^d}{n^d} \,.

In the last estimate, we have used a simple union bound and {n \geq 2m} .

Putting everything together and summing over {i \in [k]} , we conclude that

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq -\alpha \frac{h^d (2m)^d}{n^d}\,.

Note that by choosing {n} only moderately large, we will make this error term very small.

Moreover, since each {\tilde B_i} is a feasible point of the optimization (4), we have

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right]

\displaystyle  \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \left(-\varepsilon + \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right]\right)

\displaystyle  \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq -\varepsilon \sum_{i=1}^k \|A_i\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.

Almost there: Now observe that

\displaystyle  \|g\|_1 = \mathop{\mathbb E}_{S,x} [M_n^g(S,x)] = \sum_{i=1}^k \|A_i\|_1 \|B_i\|_1 = \sum_{i=1}^k \|A_i\|_1\,.

Plugging this into the preceding line yields

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] \geq -\varepsilon \|g\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.

Recalling now (2), we have derived a contradiction to {\alpha_+(M) \leq \alpha} if we can choose the right-hand side to be bigger than {\beta} (which is a negative constant depending only on {g} ). Setting {\varepsilon = -\beta/(2 \|g\|_1)} , we consult (5) to see that

\displaystyle  h \leq C'(g) m \log \alpha

for some other constant {C'(g)} depending only on {g} . We thus arrive at a contradiction if {\alpha = o((n/\log n)^d)} , recalling that {m, d} depend only on {g} . This completes the argument.