Entropy optimality: Non-negative rank lower bounds

Using the notation from the last post, our goal is now to prove the following theorem.

Theorem 1 For every {m \geq 1} and {g : \{0,1\}^m \rightarrow \mathbb R_+} , there is a constant {C=C(g)} such that for all {n \geq 2m} ,

\displaystyle  \mathrm{rank}_+(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{\mathrm{deg}_J(g)-1}\,.

1.1. Convex relaxations of non-negative rank

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets {X} and {Y} and a matrix {M : X \times Y \rightarrow \mathbb R_+} with {r=\mathrm{rank}_+(M)} .

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of {\{ N \in \mathbb R_+^{X \times Y} : \mathrm{rank}_+(N) = 1 \}} is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both {X} and {Y} with the uniform measure. Let {\mathcal Q = \{ b : Y \rightarrow \mathbb R_+ \mid \|b\|_1 = 1\}} denote the set of probability densities on {Y} . Now define

\displaystyle  \alpha_+(N) = \min \Big\{ \alpha : \exists A \in \mathbb R_+^{X \times k}, B \in \mathbb R_+^{k \times Y} \textrm{ with } N=AB, \textrm{ and}

\displaystyle  \qquad\qquad\qquad\qquad\qquad\{B_1, \ldots, B_k\} \subseteq \mathcal Q, \textrm{ and }

\displaystyle  \hphantom{\{B_1, \ldots, B_k\} \subseteq \mathcal Q,} \max_{i \in [k]} \|B_i\|_{\infty} \leq \alpha, \sum_{i=1}^k \|A^{(i)}\|_{\infty} \leq \alpha\Big\}.

Here {\{A^{(i)}\}} are the columns of {A} and {\{B_i\}} are the rows of {B} . Note that now {k} is unconstrained.

Observe that {\alpha_+} is a convex function. To see this, given a pair {N=AB} and {N'=A'B'} , write

\displaystyle  \frac{N+N'}{2} = \left(\begin{array}{cc} \frac12 A & \frac12 A' \end{array}\right) \left(\begin{array}{l} \vphantom{\bigoplus} B \\ \vphantom{\bigoplus} B' \end{array}\right)\,,

witnessing the fact that {\alpha_+(\frac12(N+N')) \leq \frac12 \alpha_+(N) + \frac12 \alpha_+(N')} .

1.2. A truncation argument

So the set {\{ N : \alpha_+(N) \leq c \}} is convex, but it’s not yet clear how this relates to {\mathrm{rank}_+} . We will see now that low non-negative rank matrices are close to matrices with {\alpha_+} small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms {\|M\|_1 = \mathbb E_{x,y} |M_{x,y}|} and {\|M\|_{\infty} = \max_{x,y} |M_{x,y}|} .

Lemma 2 For every non-negative {M \in \mathbb R_+^{X \times Y}} with {\mathrm{rank}_+(M) \leq r} and every {\delta \in (0,1)} , there is a matrix {\tilde M \in \mathbb R_+^{X \times Y}} such that

\displaystyle  \|M-\tilde M\|_1 \leq \delta


\displaystyle  \alpha_+(\tilde M) \leq \frac{r}{\delta} \|M\|_{\infty}\,.

Proof: Suppose that {M=AB} with {A \in \mathbb R_+^{X \times r}, B \in \mathbb R_+^{r \times Y}} , and let us interpret this factorization in the form

\displaystyle  M(x,y) = \sum_{i=1}^r A_i(x) B_i(y) \ \ \ \ \ (1)

(where {\{A_i\}} are the columns of {A} and {\{B_i\}} are the rows of {B} ). By rescaling the columns of {A} and the rows of {B} , respectively, we may assume that {\mathop{\mathbb E}_{\mu}[B_i]=1} for every {i \in [r]} .

Let {\Lambda = \{ i : \|B_i\|_{\infty} > \tau \}} denote the “bad set” of indices (we will choose {\tau} momentarily). Observe that if {i \in \Lambda} , then

\displaystyle  \|A_i\|_{\infty} \leq \frac{\|M\|_{\infty}}{\tau}\,,

from the representation (1) and the fact that all summands are positive.

Define the matrix {\tilde M(x,y) = \sum_{i \notin \Lambda} A_i(x) B_i(y)} . It follows that

\displaystyle  \|M-\tilde M\|_1 = \mathop{\mathbb E}_{x,y} \left[|M(x,y)-\tilde M(x,y)|\right] = \sum_{i \in \Lambda} \mathop{\mathbb E}_{x,y} [A_i(x) B_i(y)]\,.

Each of the latter terms is at most {\|A_i\|_{\infty} \|B_i\|_1 \leq \frac{\|M\|_{\infty}}{\tau}} and {|\Lambda| \leq r} , thus

\displaystyle  \|M-\tilde M\|_1 \leq r \frac{\|M\|_{\infty}}{\tau}\,.

Next, observe that

\displaystyle  \mathop{\mathbb E}_y [M(x,y)] = \sum_{i=1}^r A_i(x) \|B_i\|_1 = \sum_{i=1}^r A_i(x)\,,

implying that {\|A_i\|_{\infty} \leq \|M\|_{\infty}} and thus {\sum_{i=1}^r \|A_i\|_{\infty} \leq r \|M\|_{\infty}} .

Setting {\tau = r \|M\|_{\infty}/\delta} yields the statement of the lemma. \Box

Generally, the ratio {\frac{\|M\|_{\infty}}{\|M\|_1}} will be small compared to {r} (e.g., polynomial in {n} vs. super-polynomial in {n} ). Thus from now on, we will actually prove a lower bound on {\alpha_+(M)} . One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

1.3. The test functionals

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix {M_n^g} , we will now assume that {\alpha_+(M_n^g) \leq \alpha} . Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix {M_n^g} . We will consider the uniform measures on {{[n] \choose m}} and {\{0,1\}^n} . We use {\mathop{\mathbb E}_S} and {\mathop{\mathbb E}_x} to denote averaging with respect to these measures.

Let {d=\deg_J(g)-1} . From the last post, we know there exists a {d} -locally positive functional {\varphi : \{0,1\}^n \rightarrow \mathbb R} such that {\beta := \mathop{\mathbb E}_x \varphi(x) g(x) < 0} , and {\mathop{\mathbb E}_x \varphi(x) q(x) \geq 0} for every {d} -junta {q} .

For {S \subseteq [n]} with {|S|=m} , let us denote {\varphi_S(x) = \varphi(x|_S)} . These functionals prove lower bounds on the junta-degree of {g} restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family {\{\varphi_S\}} as our test functions.

1.4. Entropy maximization

Use {\alpha_+(M_n^g) \leq \alpha} to write

\displaystyle  M_n^g(S,x) = \sum_{i=1}^k A_i(S) B_i(x)\,,

where {A_i, B_i \geq 0} and we have {\|B_i\|_1=1} and {\|B_i\|_{\infty} \leq \alpha} for all {i \in [k]} , and finally {\sum_{i=1}^k \|A_i\|_{\infty} \leq \alpha} .

First, as we observed last time, if each {B_i} were a {d} -junta, we would have a contradiction:

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) M_n^g(S,x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^m} \varphi(y) g(y) = \beta < 0\,, \ \ \ \ \ (2)

and yet

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) \sum_{i=1}^k A_i(S) B_i(x)]\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] \geq 0 \ \ \ \ \ (3)

because {\mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right] \geq 0} since {\varphi} is {d} -locally positive and the function {y \mapsto \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right]} is a {d} -junta.

So now the key step: Use entropy maximization to approximate {B_i} by a junta! In future posts, we will need to consider the entire package {\{B_1, \ldots, B_k\}} of functions simultaneously. But for the present lower bound, it suffices to consider each {B_i} separately.

Consider the following optimization over variables {\{\tilde B_i(x) : x \in \{0,1\}^n\}} :

\displaystyle  \mathrm{minimize }\qquad \mathrm{Ent}(\tilde B_i) = \mathbb E [\tilde B_i \log \tilde B_i]

\displaystyle  \qquad\textrm{ subject to } \qquad \qquad\mathbb E[\tilde B_i]=1, \quad \tilde B_i(x) \geq 0 \quad \forall x \in \{0,1\}^n

\displaystyle  \hphantom{\bigoplus} \qquad \qquad \qquad \qquad \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right] \leq \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] + \varepsilon \qquad \forall |S|=m\,. \ \ \ \ \ (4)

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1 There exists a function {\tilde B_i : \{0,1\}^n \rightarrow \mathbb R_+^n} satisfying all the preceding constraints and of the form

\displaystyle  \tilde B_i = \frac{\exp\left(\sum_{j=1}^{M} \lambda_j \varphi_{S_j}\right)}{\mathop{\mathbb E} \exp\left(\sum_{i=1}^{M} \lambda_j \varphi_{S_j}\right)}

such that

\displaystyle  M \leq C(g) \frac{\log \alpha}{\varepsilon^2}\,,

where {C(g)} is some constant depending only on {g} .

Note that {\varphi} depends only on {g} , and thus {\|\varphi_S\|_{\infty}} only depends on {g} as well. Now each {\varphi_{S_j}} only depends on {m} variables (those in {S_j} and {|S_j|=m} ), meaning that our approximator {\tilde B_i} is an {h} -junta for

\displaystyle  h \leq m \cdot C(g) \frac{\log \alpha}{\varepsilon^2}\,. \ \ \ \ \ (5)

Oops. That doesn’t seem very good. The calculation in (3) needs that {\tilde B_i} is a {d} -junta, and certainly {d < m} (since {g} is a function on {\{0,1\}^m} ). Nevertheless, note that the approximator is a non-trivial junta. For instance, if {\alpha \leq 2^{\sqrt{n}}} , then it is an O(\sqrt{n}) -junta, recalling that m is a constant (depending on {g} ).

1.5. Random restriction saves the day

Let’s try to apply the logic of (3) to the {\tilde B_i} approximators anyway. Fix some {i \in [k]} and let {J_i} be the set of coordinates on which {\tilde B_i} depends. Then:

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] = \mathop{\mathbb E}_S\,A_i(S) \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]

Note that the map {y \mapsto \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]} is a junta on {J_i \cap S} . Thus if {|J_i \cap S| \leq d} , then the contribution from this term is non-negative since {\varphi} is {d} -locally positive. But {|S|=m} is fixed and {n} is growing, thus {|J_i \cap S| > d} is quite rare! Formally,

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq - \|A_i\|_{\infty}\, \mathbb P_S[|J_i \cap S| > d] \geq - \|A_i\|_{\infty} \frac{h^d (2m)^d}{n^d} \,.

In the last estimate, we have used a simple union bound and {n \geq 2m} .

Putting everything together and summing over {i \in [k]} , we conclude that

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq -\alpha \frac{h^d (2m)^d}{n^d}\,.

Note that by choosing {n} only moderately large, we will make this error term very small.

Moreover, since each {\tilde B_i} is a feasible point of the optimization (4), we have

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right]

\displaystyle  \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \left(-\varepsilon + \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right]\right)

\displaystyle  \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq -\varepsilon \sum_{i=1}^k \|A_i\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.

Almost there: Now observe that

\displaystyle  \|g\|_1 = \mathop{\mathbb E}_{S,x} [M_n^g(S,x)] = \sum_{i=1}^k \|A_i\|_1 \|B_i\|_1 = \sum_{i=1}^k \|A_i\|_1\,.

Plugging this into the preceding line yields

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] \geq -\varepsilon \|g\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.

Recalling now (2), we have derived a contradiction to {\alpha_+(M) \leq \alpha} if we can choose the right-hand side to be bigger than {\beta} (which is a negative constant depending only on {g} ). Setting {\varepsilon = -\beta/(2 \|g\|_1)} , we consult (5) to see that

\displaystyle  h \leq C'(g) m \log \alpha

for some other constant {C'(g)} depending only on {g} . We thus arrive at a contradiction if {\alpha = o((n/\log n)^d)} , recalling that {m, d} depend only on {g} . This completes the argument.

Primer: Lifts of polytopes and non-negative matrix factorization

In preparation for the next post on entropy optimality, we need a little background on lifts of polytopes and non-negative matrix factorization.

1.1. Polytopes and inequalities

A {d} -dimensional convex polytope {P \subseteq \mathbb R^d} is the convex hull of a finite set of points in {\mathbb R^d} . Equivalently, it is a compact set defined by a family of linear inequalities

\displaystyle  P = \{ x \in \mathbb R^d : Ax \leq b \}

for some matrix {A \in \mathbb R^{m \times d}} .

Let us give a measure of complexity for {P} : Define {\gamma(P)} to be the smallest number {m} such that for some {C \in \mathbb R^{s \times d}, y \in \mathbb R^s, A \in \mathbb R^{m \times d}, b \in \mathbb R^m} , we have

\displaystyle  P = \{ x \in \mathbb R^d : Cx=y \textrm{ and } Ax \leq b\}\,.

In other words, this is the minimum number of inequalities needed to describe {P} . If {P} is full-dimensional, then this is precisely the number of facets of {P} (a facet is a maximal proper face of {P} ).

Thinking of {\gamma(P)} as a measure of complexity makes sense from the point of view of optimization: Interior point methods can efficiently optimize linear functions over {P} (to arbitrary accuracy) in time that is polynomial in {\gamma(P)} .

1.2. Lifts of polytopes

Many simple polytopes require a large number of inequalities to describe. For instance, the cross-polytope

\displaystyle  C_d = \{ x \in \mathbb R^d : \|x\|_1 \leq 1 \} = \{ x \in \mathbb R^d : \pm x_1 \pm x_2 \cdots \pm x_d \leq 1 \}

has {\gamma(C_d)=2^d} . On the other hand, {C_d} is the projection of the polytope

\displaystyle  Q_d = \left\{ (x,y) \in \mathbb R^{2d} : \sum_{i=1}^n y_i = 1, \,\, - y_i \leq x_i \leq y_i\,\forall i\right\}

onto the {x} coordinates, and manifestly, {\gamma(Q_d) \leq 2d} . Thus {C_d} is the (linear) shadow of a much simpler polytope in a higher dimension.

[image credit: Fiorini, Rothvoss, and Tiwary]

A polytope {Q} is called a lift of the polytope {P} if {P} is the image of {Q} under a linear projection. Again, from an optimization stand point, lifts are important: If we can optimize linear functionals over {Q} , then we can optimize linear functionals over {P} . Define now {\bar \gamma(P)} to be the minimal value of {\gamma(Q)} over all lifts {Q} of {P} . (The value {\bar \gamma(P)} is sometimes called the (linear) extension complexity of {P} .)

1.3. The permutahedron

Here is a somewhat more interesting family of examples where lifts reduce complexity. The permutahedron {\Pi_n \subseteq \mathbb R^n} is the convex hull of the vectors {(i_1, i_2, \ldots, i_n)} where {\{i_1, \ldots, i_n\} = \{1,\ldots,n \}} . It is known that {\gamma(\Pi_n)=2^n-2} .

Let {B_n \subseteq \mathbb R^{n^2}} denote the convex hull of the {n \times n} permutation matrices. The Birkhoff-von Neumann theorem tells us that {B_n} is precisely the set of doubly stochastic matrices, thus {\gamma(B_n)=n^2} (corresponding to the non-negativity constraints on each entry).

Observe that {\Pi_n} is the linear image of {B_n} under the map {A \mapsto A (1, 2, \ldots, n)^T} , i.e. we multiply a matrix {A \in B_n} on the right by the column vector {(1, 2, \ldots, n)} . Thus {B_n} is a lift of {\Pi_n} , and we conclude that {\bar \gamma(\Pi_n) \leq n^2 \ll \gamma(\Pi_n)} .

1.4. The cut polytope

If {P \neq NP} , there are certain combinatorial polytopes we should not be able to optimize over efficiently. A central example is the cut polytope: {\mathrm{CUT}_n \subseteq \mathbb R^{n^2}} is the convex hull of all matrices of the form {(A_S)_{ij} = |\mathbf{1}_S(i)-\mathbf{1}_S(j)|} for some subset {S \subseteq \{1,\ldots,n\}} . Here, {\mathbf{1}_S} denotes the characteristic function of {S} .

Note that the MAX-CUT problem on a graph {G=(V,E)} can be encoded in the following way: Let {W_{ij} = 1} if {\{i,j\} \in E} and {W_{ij}=0} otherwise. Then the value of the maximum cut in {G} is precisely the maximum of {\langle W, A\rangle} for {A \in \mathrm{CUT}_n} . Accordingly, we should expect that {\bar \gamma(\mathrm{CUT}_n)} cannot be bounded by any polynomial in {n} (lest we violate a basic tenet of complexity theory).

1.5. Non-negative matrix factorization

The key to understanding {\bar \gamma(\mathrm{CUT}_n)} comes from Yannakakis’ factorization theorem.

Consider a polytope {P \subseteq \mathbb R^d} and let us write in two ways: As a convex hull of vertices

\displaystyle  P = \mathrm{conv}\left(x_1, x_2, \ldots, x_n\right)\,,

and as an intersection of half-spaces: For some {A \in \mathbb R^{m \times d}} and {b \in \mathbb R^m} ,

\displaystyle  P = \left\{ x \in \mathbb R^d : Ax \leq b \right\}\,.

Given this pair of representations, we can define the corresponding slack matrix of {P} by

\displaystyle  S_{ij} = b_i - \langle A_i, x_j \rangle \qquad i \in \{1,2,\ldots,m\}, j \in \{1,2,\ldots,n\}\,.

Here, {A_1, \ldots, A_m} denote the rows of {A} .

We need one more definition. In what follows, we will use {\mathbb R_+ = [0,\infty)} . If we have a non-negative matrix {M \in \mathbb R_+^{m \times n}} , then a rank-{r} non-negative factorization of {M} is a factorization {M = AB} where {A \in \mathbb R_+^{m \times r}} and {B \in \mathbb R_+^{r \times n}} . We then define the non-negative rank of {M} , written {\mathrm{rank}_+(M)} , to be the smallest {r} such that {M} admits a rank-{r} non-negative factorization.

Theorem 1 (Yannakakis) For every polytope {P} , it holds that {\bar \gamma(P) = \mathrm{rank}_+(S)} for any slack matrix {S} of {P} .

The key fact underlying this theorem is Farkas’ Lemma. It asserts that if {P = \{ x \in \mathbb R^d : Ax \leq b \}} , then every valid linear inequality over {P} can be written as a non-negative combination of the defining inequalities {\langle A_i, x \rangle \leq b_i} .

There is an interesting connection here to proof systems. The theorem says that we can interpret {\bar \gamma(P)} as the minimum number of axioms so that every valid linear inequality for {P} can be proved using a conic (i.e., non-negative) combination of the axioms.

1.6. Slack matrices and the correlation polytope

Thus to prove a lower bound on {\bar \gamma(\mathrm{CUT}_n)} , it suffices to find a valid set of linear inequalities for {\mathrm{CUT}_n} and prove a lower bound on the non-negative rank of the corresponding slack matrix.

Toward this end, consider the correlation polytope {\mathrm{CORR}_n \subseteq \mathbb R^{n^2}} given by

\displaystyle  \mathrm{CORR}_n = \mathrm{conv}\left(\left\{x x^T : x \in \{0,1\}^n \right\}\right)\,.

It is an exercise to see that {\mathrm{CUT}_{n+1}} and {\mathrm{CORR}_n} are linearly isomorphic.

Now we identify a particularly interesting family of valid linear inequalities for {\mathrm{CORR}_n} . (In fact, it turns out that this will also be an exhaustive list.) A quadratic multi-linear function on {\mathbb R^n} is a function {f : \mathbb R^n \rightarrow \mathbb R} of the form

\displaystyle  f(x) = a_0 + \sum_i a_{ii} x_i + \sum_{i < j} a_{ij} x_i x_j\,,

for some real numbers {a_0} and {\{a_{ij}\}} .

Suppose {f} is a quadratic multi-linear function that is also non-negative on {\{0,1\}^n} . Then “{f(x) \geq 0 \,\,\forall x \in \{0,1\}^n} ” can be encoded as a valid linear inequality on {\mathrm{CORR}_n} . To see this, write {f(x)=\langle A,xx^T\rangle + a_0} where {A=(a_{ij})} . (Note that {\langle \cdot,\cdot\rangle} is intended to be the standard inner product on {\mathbb R^{n^2}} .)

Now let {\textrm{QML}^+_n} be the set of all quadratic multi-linear functions that are non-negative on {\{0,1\}^n} , and consider the matrix (represented here as a function) {\mathcal M_n : \textrm{QML}^+_n \times \{0,1\}^n \rightarrow \mathbb R_+} given by

\displaystyle  \mathcal M_n(f,x) = f(x)\,.

Then from the above discussion, {\mathcal M_n} is a valid sub-matrix of some slack matrix of {\mathrm{CORR}_n} . To summarize, we have the following theorem.

Theorem 2 For all {n \geq 1} , it holds that {\bar \gamma(\mathrm{CUT}_{n+1}) \geq \mathrm{rank}_+(\mathcal M_n)} .

It is actually the case that {\bar \gamma(\mathrm{CUT}_{n+1}) = \mathrm{rank}_+(\mathcal M_n)} . The next post will focus on providing a lower bound on {\mathrm{rank}_+(\mathcal M_n)} .