Entropy optimality: Analytic psd rank and John’s theorem

Recall that our goal is to sketch a proof of the following theorem, where the notation is from the last post. I will assume a knowledge of the three posts on polyhedral lifts and non-negative rank, as our argument will proceed by analogy.

Theorem 1 For every {m \geq 1} and {g : \{0,1\}^m \rightarrow \mathbb R_+} , there exists a constant {C(g)} such that the following holds. For every {n \geq 2m} ,

\displaystyle  1+n^{d/2} \geq \mathrm{rank}_{\mathsf{psd}}(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{(d-1)/2}\,. \ \ \ \ \ (1)

where {d = \deg_{\mathsf{sos}}(g).}

In this post, we will see how John’s theorem can be used to transform a psd factorization into one of a nicer analytic form. Using this, we will be able to construct a convex body that contains an approximation to every non-negative matrix of small psd rank.

1.1. Finite-dimensional operator norms

Let {H} denote a finite-dimensional Euclidean space over {\mathbb R} equipped with inner product {\langle \cdot,\cdot\rangle} and norm {|\cdot|} . For a linear operator {A : H \rightarrow H} , we define the operator, trace, and Frobenius norms by

\displaystyle  \|A\| = \max_{x \neq 0} \frac{|Ax|}{x},\quad \|A\|_* = \mathrm{Tr}(\sqrt{A^T A}),\quad \|A\|_F = \sqrt{\mathrm{Tr}(A^T A)}\,.

Let {\mathcal M(H)} denote the set of self-adjoint linear operators on {H} . Note that for {A \in \mathcal M(H)} , the preceding three norms are precisely the {\ell_{\infty}} , {\ell_1} , and {\ell_2} norms of the eigenvalues of {A} . For {A,B \in \mathcal M(H)} , we use {A \succeq 0} to denote that {A} is positive semi-definite and {A \succeq B} for {A-B \succeq 0} . We use {\mathcal D(H) \subseteq \mathcal M(H)} for the set of density operators: Those {A \in \mathcal M(H)} with {A \succeq 0} and {\mathrm{Tr}(A)=1} .

One should recall that {\mathrm{Tr}(A^T B)} is an inner product on the space of linear operators, and we have the operator analogs of the Hölder inequalities: {\mathrm{Tr}(A^T B) \leq \|A\| \cdot \|B\|_*} and {\mathrm{Tr}(A^T B) \leq \|A\|_F \|B\|_F} .

1.2. Rescaling the psd factorization

As in the case of non-negative rank, consider finite sets {X} and {Y} and a matrix {M : X \times Y \rightarrow \mathbb R_+} . For the purposes of proving a lower bound on the psd rank of some matrix, we would like to have a nice analytic description.

To that end, suppose we have a rank-{r} psd factorization

\displaystyle  M(x,y) = \mathrm{Tr}(A(x) B(y))

where {A : X \rightarrow \mathcal S_+^r} and {B : Y \rightarrow \mathcal S_+^r} . The following result of Briët, Dadush and Pokutta (2013) gives us a way to “scale” the factorization so that it becomes nicer analytically. (The improved bound stated here is from an article of Fawzi, Gouveia, Parrilo, Robinson, and Thomas, and we follow their proof.)

Lemma 2 Every {M} with {\mathrm{rank}_{\mathsf{psd}}(M) \leq r} admits a factorization {M(x,y)=\mathrm{Tr}(P(x) Q(y))} where {P : X \rightarrow \mathcal S_+^r} and {Q : Y \rightarrow \mathcal S_+^r} and, moreover,

\displaystyle  \max \{ \|P(x)\| \cdot \|Q(y)\| : x \in X, y \in Y \} \leq r \|M\|_{\infty}\,,

where {\|M\|_{\infty} = \max_{x \in X, y \in Y} M(x,y)} .

Proof: Start with a rank-{r} psd factorization {M(x,y) = \mathrm{Tr}(A(x) B(y))} . Observe that there is a degree of freedom here, because for any invertible operator {J} , we get another psd factorization {M(x,y) = \mathrm{Tr}\left(\left(J A(x) J^T\right) \cdot \left((J^{-1})^T B(y) J^{-1}\right)\right)} .

Let {U = \{ u \in \mathbb R^r : \exists x \in X\,\, A(x) \succeq uu^T \}} and {V = \{ v \in \mathbb R^r : \exists y \in X\,\, B(y) \succeq vv^T \}} . Set {\Delta = \|M\|_{\infty}} . We may assume that {U} and {V} both span {\mathbb R^r} (else we can obtain a lower-rank psd factorization). Both sets are bounded by finiteness of {X} and {Y} .

Let {C=\mathrm{conv}(U)} and note that {C} is centrally symmetric and contains the origin. Now John’s theorem tells us there exists a linear operator {J : \mathbb R^r \rightarrow \mathbb R^r} such that

\displaystyle  B_{\ell_2} \subseteq J C \subseteq \sqrt{r} B_{\ell_2}\,, \ \ \ \ \ (2)

where {B_{\ell_2}} denotes the unit ball in the Euclidean norm. Let us now set {P(x) = J A(x) J^T} and {Q(y) = (J^{-1})^T B(y) J^{-1}} .

Eigenvalues of {P(x)} : Let {w} be an eigenvector of {P(x)} normalized so the corresponding eigenvalue is {\|w\|_2^2} . Then {P(x) \succeq w w^T} , implying that {J^{-1} w \in U} (here we use that {A \succeq 0 \implies S A S^T \succeq 0} for any {S} ). Since {w = J(J^{-1} w)} , (2) implies that {\|w\|_2 \leq \sqrt{r}} . We conclude that every eigenvalue of {P(x)} is at most {r} .

Eigenvalues of {Q(y)} : Let {w} be an eigenvector of {Q(y)} normalized so that the corresponding eigenvalue is {\|w\|_2^2} . Then as before, we have {Q(y) \succeq ww^T} and this implies {J^T w \in V} . Now, on the one hand we have

\displaystyle  \max_{z \in JC}\, \langle z,w\rangle \geq \|w\|_2 \ \ \ \ \ (3)

since {JC \supseteq B_{\ell_2}} .

On the other hand:

\displaystyle  \max_{z \in JC}\, \langle z,w\rangle^2 = \max_{z \in C}\, \langle Jz, w\rangle^2 = \max_{z \in C}\, \langle z, J^T w\rangle^2\,. \ \ \ \ \ (4)

Finally, observe that for any {u \in U} and {v \in V} , we have

\displaystyle  \langle u,v\rangle^2 =\langle uu^T, vv^T\rangle \leq \max_{x \in X, y \in Y} \langle A(x), B(y)\rangle \leq \Delta\,.

By convexity, this implies that {\max_{z \in C}\, \langle z,v\rangle^2 \leq \Delta} for all {v \in V} , bounding the right-hand side of (4) by {\Delta} . Combining this with (3) yields {\|w\|_2^2 \leq \Delta} . We conclude that all the eigenvalues of {Q(y)} are at most {\Delta} . \Box

1.3. Convex proxy for psd rank

Again, in analogy with the non-negative rank setting, we can define an “analytic psd rank” parameter for matrices {N : X \times Y \rightarrow \mathbb R_+} :

\displaystyle  \alpha_{\mathsf{psd}}(N) = \min \Big\{ \alpha \mid \exists A : X \rightarrow \mathcal S_+^k, B : Y \rightarrow \mathcal S_+^k\,,

\displaystyle  \hphantom{xx} \mathop{\mathbb E}_{x \in X}[A(x)]=I,

\displaystyle  \hphantom{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx} \|B(y)\| \leq \frac{\alpha}{k}\, \mathop{\mathbb E}_{y \in Y}[\mathrm{Tr}(B(y))] \quad \forall y \in Y

\displaystyle  \hphantom{\qquad\qquad} \|A(x)\| \leq \alpha \quad \forall x \in X\Big\}\,.

Note that we have implicit equipped {X} and {Y} with the uniform measure. The main point here is that {k} can be arbitrary. One can verify that {\alpha_{\mathsf{psd}}} is convex.

And there is a corresponding approximation lemma. We use {\|N\|_{\infty}=\max_{x,y} |N(x,y)|} and {\|N\|_1 = \mathop{\mathbb E}_{x,y} |N(x,y)|} .

Lemma 3 For every non-negative matrix {M : X \times Y \rightarrow \mathbb R_+} and every {\eta \in (0,1]} , there is a matrix {N} such that {\|M-N\|_{\infty} \leq \eta \|M\|_{\infty}} and

\displaystyle  \alpha_{\mathsf{psd}}(N) \leq O(\mathrm{rank}_{\mathsf{psd}}(M)) \frac{1}{\eta} \frac{\|M\|_{\infty}}{\|M\|_1}\,.

Using Lemma 2 in a straightforward way, it is not particularly difficult to construct the approximator {N} . The condition {\mathop{\mathbb E}_x [A(x)] = I} poses a slight difficulty that requires adding a small multiple of the identity to the LHS of the factorization (to avoid a poor condition number), but this has a correspondingly small effect on the approximation quality. Putting “Alice” into “isotropic position” is not essential, but it makes the next part of the approach (quantum entropy optimization) somewhat simpler because one is always measuring relative entropy to the maximally mixed state.

Entropy optimality: Quantum lifts of polytopes

In these previous posts, we explored whether the cut polytope can be expressed as the linear projection of a polytope with a small number of facets (i.e., whether it has a small linear programming extended formulation).

For many cut problems, semi-definite programs (SDPs) are able to achieve better approximation ratios than LPs. The most famous example is the Goemans-Williamson {0.878} -approximation for MAX-CUT. The techniques of the previous posts (see the full paper for details) are able to show that no polynomial-size LP can achieve better than factor {1/2} .

1.1. Spectrahedral lifts

The feasible regions of LPs are polyhedra. Up to linear isomorphism, every polyhedron {P} can be represented as {P = \mathbb R_+^n \cap V} where {\mathbb R_+^n} is the positive orthant and {V \subseteq \mathbb R^n} is an affine subspace.

In this context, it makes sense to study any cones that can be optimized over efficiently. A prominent example is the positive semi-definite cone. Let us define {\mathcal S_+^n \subseteq \mathbb R^{n^2}} as the set of {n \times n} real, symmetric matrices with non-negative eigenvalues. A spectrahedron is the intersection {\mathcal S_+^n \cap V} with an affine subspace {V} . The value {n} is referred to as the dimension of the spectrahedron.

In analogy with the {\gamma} parameter we defined for polyhedral lifts, let us define {\bar \gamma_{\mathsf{sdp}}(P)} for a polytope {P} to be the minimal dimension of a spectrahedron that linearly projects to {P} . It is an exercise to show that {\bar \gamma_{\mathsf{sdp}}(P) \leq \bar \gamma(P)} for every polytope {P} . In other words, spectahedral lifts are at least as powerful as polyhedral lifts in this model.

In fact, they are strictly more powerful. Certainly there are many examples of this in the setting of approximation (like the Goemans-Williamson SDP mentioned earlier), but there are also recent gaps between {\bar \gamma} and {\bar \gamma_{\mathrm{sdp}}} for polytopes; see the work of Fawzi, Saunderson, and Parrilo.

Nevertheless, we are recently capable of proving strong lower bounds on the dimension of such lifts. Let us consider the cut polytope {\mathrm{CUT}_n} as in previous posts.

Theorem 1 (L-Raghavendra-Steurer 2015) There is a constant {c > 0} such that for every {n \geq 1} , one has {\bar \gamma_{\mathsf{sdp}}(\mathrm{CUT}_n) \geq e^{c n^{2/13}}} .

Our goal in this post and the next is to explain the proof of this theorem and how quantum entropy maximization plays a key role.

1.2. PSD rank and factorizations

Just as in the setting of polyhedra, there is a notion of “factorization through a cone” that characterizes the parameter {\bar \gamma_{\mathsf{sdp}}(P)} . Let {M \in \mathbb R^{m \times n}_+} be a non-negative matrix. One defines the psd rank of {M} as the quantity

\displaystyle  \mathrm{rank}_{\mathsf{psd}}(M) = \min \left\{ r : M_{ij} = \mathrm{Tr}(A_i B_j) \textrm{ for some } A_1, \ldots, A_m, B_1, \ldots, B_n \in \mathcal S_+^r\right\}\,.

The following theorem was independently proved by Fiorini, Massar, Pokutta, Tiwary, and de Wolf and Gouveia, Parrilo, and Thomas. The proof is a direct analog of Yannakakis’ proof for non-negative rank.

Theorem 2 For every polytope {P} , it holds that {\bar \gamma_{\mathsf{sdp}}(P) = \mathrm{rank}_{\mathsf{psd}}(M)} for any slack matrix {M} of {P} .

Recall the class {\mathrm{QML}_n^+} of non-negative quadratic multi-linear functions that are positive on {\{0,1\}^n} and the matrix {\mathcal M_n : \mathrm{QML}_n^+ \times \{0,1\}^n \rightarrow \mathbb R_+} given by

\displaystyle  \mathcal M_n(f,x) = f(x)\,.

We saw previously that {\mathcal M_n} is a submatrix of some slack matrix of {\mathrm{CUT}_n} . Thus our goal is to prove a lower bound on {\mathrm{rank}_{\mathsf{psd}}(\mathcal M_n)} .

1.3. Sum-of-squares certificates

Just as in the setting of non-negative matrix factorization, we can think of a low psd rank factorization of {\mathcal M_n} as a small set of “axioms” that can prove the non-negativity of every function in {\mathrm{QML}_n^+} . But now our proof system is considerably more powerful.

For a subspace of functions {\mathcal U \subseteq L^2(\{0,1\}^n)} , let us define the cone

\displaystyle  \mathsf{sos}(\mathcal U) = \mathrm{cone}\left(q^2 : q \in \mathcal U\right)\,.

This is the cone of squares of functions in {\mathcal U} . We will think of {\mathcal U} as a set of axioms of size {\mathrm{dim}(\mathcal U)} that is able to assert non-negativity of every {f \in \mathrm{sos}(\mathcal U)} by writing

\displaystyle  f = \sum_{i=1}^k q_i^2

for some {q_1, \ldots, q_k \in \mathsf{sos}(\mathcal U)} .

Fix a subspace {\mathcal U} and let {r = \dim(\mathcal U)} . Fix also a basis {q_1, \ldots, q_r : \{0,1\}^n \rightarrow \mathbb R} for {\mathcal U} .

Define {B : \{0,1\}^n \rightarrow \mathcal S_+^r} by setting {B(x)_{ij} = q_i(x) q_j(x)} . Note that {B(x)} is PSD for every {x} because {B(x) = \vec q(x) \vec q(x)^T} where {\vec q(x)=(q_1(x), \ldots, q_r(x))} .

We can write every {p \in \mathcal U} as {p = \sum_{i=1}^r \lambda_i q_i} . Defining {A(p^2) \in \mathcal S_+^r} by {A(p^2)_{ij} = \lambda_i \lambda_j} , we see that

\displaystyle  \mathrm{Tr}(A(p^2) Q(x)) = \sum_{i,j} \lambda_i \lambda_j q_i(x) q_j(x) = p(x)^2\,.

Now every {f \in \mathsf{sos}(\mathcal U)} can be written as {\sum_{i=1}^k c_i p_i^2} for some {k \geq 0} and {\{c_i \geq 0\}} . Therefore if we define {A(f) = \sum_{i=1}^k c_i \Lambda(p_i^2)} (which is a positive sum of PSD matrices), we arrive at the representation

\displaystyle  f(x) = \mathrm{Tr}(A(f) B(x))\,.

In conclusion, if {\mathrm{QML}_+^n \subseteq \mathsf{sos}(\mathcal U)} , then {\mathrm{rank}_{\mathsf{psd}}(\mathcal M_n) \leq \dim(\mathsf{sos}(\mathcal U))} .

By a “purification” argument, one can also conclude {\dim(\mathsf{sos}(\mathcal U)) \leq \mathrm{rank}_{\mathsf{psd}}(\mathcal M_n)^2} .

1.4. The canonical axioms

And just as {d} -juntas were the canonical axioms for our NMF proof system, there is a similar canonical family in the SDP setting: Let {\mathcal Q_d} be the subspace of all degree-{d} multi-linear polynomials on {\mathbb R^n} . We have

\displaystyle  \dim(\mathcal Q_d) \leq \sum_{k=0}^d {n \choose k} \leq 1+n^d\,. \ \ \ \ \ (1)

For a function {f : \{0,1\}^n \rightarrow \mathbb R_+} , one defines

\displaystyle  \deg_{\mathsf{sos}}(f) = \min \{d : f \in \mathsf{sos}(\mathcal Q_{d}) \}\,.

(One could debate whether the definition of sum-of-squares degree should have {d/2} or {d} . The most convincing arguments suggest that we should use membership in the cone of squares over {\mathcal Q_{\lfloor d/2\rfloor}} so that the sos-degree will be at least the real-degree of the function.)

On the other hand, our choice has the following nice property.

Lemma 3 For every {f : \{0,1\}^n \rightarrow \mathbb R_+} , we have {\deg_{\mathrm{sos}}(f) \leq \deg_J(f)} .

Proof: If {q} is a non-negative {d} -junta, then {\sqrt{q}} is also a non-negative {d} -junta. It is elementary to see that every {d} -junta is polynomial of degree at most {d} , thus {q} is the square of a polynomial of degree at most {d} . \Box

1.5. The canonical tests

As with junta-degree, there is a simple characterization of sos-degree in terms of separating functionals. Say that a functional {\varphi : \{0,1\}^n \rightarrow \mathbb R} is degree-{d} pseudo-positive if

\displaystyle  \langle \varphi, q^2 \rangle = \mathop{\mathbb E}_{x \in \{0,1\}^n} \varphi(x) q(x)^2 \geq 0

whenever {q : \{0,1\}^n \rightarrow \mathbb R} satisfies {\deg(q) \leq d} (and by {\deg} here, we mean degree as a multi-linear polynomial on {\{0,1\}^n} ).

Again, since {\mathsf{sos}(\mathcal Q_d)} is a closed convex set, there is precisely one way to show non-membership there. The following characterization is elementary.

Lemma 4 For every {f : \{0,1\}^n \rightarrow \mathbb R_+} , it holds that {\deg_{\mathsf{sos}}(f) > d} if and only if there is a degree-{d} pseudo-positive functional {\varphi : \{0,1\}^n \rightarrow \mathbb R} such that {\langle \varphi,f\rangle < 0} .

1.6. The connection to psd rank

Following the analogy with non-negative rank, we have two objectives left: (1) to exhibit a function {f \in \mathrm{QML}_n^+} with {\deg_{\mathsf{sos}}(f)} large, and (ii) to give a connection between the sum-of-squares degree of {f} and the psd rank of an associated matrix.

Notice that the function {f(x)=(1-\sum_{i=1}^n x_i)^2} we used for junta-degree has {\deg_{\mathsf{sos}}(f)=1} , making it a poor candidate. Fortunately, Grigoriev has shown that the knapsack polynomial has large sos-degree.

Theorem 5 For every odd {m \geq 1} , the function

\displaystyle  f(x) = \left(\frac{m}{2} - \sum_{i=1}^n x_i\right)^2 - \frac14

has {\deg_{\mathsf{sos}}(f) \geq \lceil m/2\rceil} .

Observe that this {f} is non-negative over {\{0,1\}^m} (because {m} is odd), but it is manifestly not non-negative on {\mathbb R^m} .

Finally, we recall the submatrices of {\mathcal M_n} defined as follows. Fix some integer {m \geq 1} and a function {g : \{0,1\}^m \rightarrow \mathbb R_+} . Then {M_n^g : {[n] \choose m} \times \{0,1\}^n \rightarrow \mathbb R_+} is given by

\displaystyle  M_n^g(S,x) = g(x|_S)\,.

In the next post, we discuss the proof of the following theorem.

Theorem 6 (L-Raghavendra-Steurer 2015) For every {m \geq 1} and {g : \{0,1\}^m \rightarrow \mathbb R_+} , there exists a constant {C(g)} such that the following holds. For every {n \geq 2m} ,

\displaystyle  1+n^{d} \geq \mathrm{rank}_{\mathsf{psd}}(M_n^g) \geq C(g) \left(\frac{n}{\log n}\right)^{(d-1)/2}\,,

where d=\deg_{\mathsf{sos}}(g) .

Note that the upper bound is from (1) and the non-trivial content is contained in the lower bound. As before, in conjunction with Theorem 5, this shows that \mathrm{rank}_{\mathsf{psd}}(\mathcal M_n)  cannot be bounded by any polynomial in n and thus the same holds for \bar \gamma_{\mathsf{sdp}}(\mathrm{CUT}_n) .