Entropy optimality: Forster’s isotropy

In this post, we will see an example of entropy optimality applied to a determinantal measure (see, for instance, Terry Tao’s post on determinantal processes and Russ Lyons’ ICM survey). I think this is an especially fertile setting for entropy maximization, but this will be the only post in this vein for now; I hope to return to the topic later.

Our goal is to prove the following theorem of Forster.

Theorem 1 (Forster) Suppose that {x_1, x_2, \ldots, x_n \in \mathbb R^k} are unit vectors such that every subset of {k} vectors is linearly independent. Then there exists a linear mapping {A : \mathbb R^k \rightarrow \mathbb R^k} such that

\displaystyle  \sum_{i=1}^n \frac{(A x_i) (A x_i)^T}{\|A x_i\|^2} = \frac{n}{k} I\,. \ \ \ \ \ (1)

This result is surprising at first glance. If we simply wanted to map the vectors {\{x_1, \dots x_n\}} to isotropic position, we could use the matrix {A = (\sum_{i=1}^n x_i x_i^T)^{-1/2}} . But Forster’s theorem asks that the unit vectors

\displaystyle  \left\{ \frac{Ax_1}{\|A x_1\|}, \ldots, \frac{A x_n}{\|A x_n\|} \right\}

are in isotropic position. This seems to be a much trickier task.

Forster used this as a step in proving lower bounds on the sign rank of certain matrices. Forster’s proof is based on an iterative argument combined with a compactness assertion.

There is another approach based on convex programming arising in the work of Barthe on a reverse Brascamp-Lieb inequality. The relation to Forster’s theorem was observed in work of Hardt and Moitra; it is essentially the dual program to the one we construct below.

1.1. Some facts about determinants

We first recall a few preliminary facts about determinants. For any {x_1, \ldots, x_n \in \mathbb R^k} , we have the Cauchy-Binet formula

\displaystyle  \det\left(\sum_{i=1}^n x_i x_i^T\right) = \sum_{|S|=k} \det\left(\sum_{i \in S} x_i x_i^T\right)\,.

We also have a rank-one update formula for the determinant: If a matrix {A} is invertible, then

\displaystyle  \det\left(A+u u^T\right) = \det(A) \left(1+u^T A^{-1} u\right)\,.

Finally, for {k} vectors {x_1, x_2, \ldots, x_k \in \mathbb R^k} and nonnegative coefficients {c_1, c_2, \ldots, c_k \geq 0} , we have

\displaystyle  \det\left(\sum_{i=1}^k c_i \,x_i x_i^T\right) = \left(\prod_{i=1}^k c_i\right) \det\left(\sum_{i=1}^k x_i x_i^T\right)\,.

This follows because replacing {x_i} by {c_i x_i} corresponds to multiplying the {i} th row and column of {XX^T} by {\sqrt{c_i}} , where {X} is the matrix that has the vectors {x_1, \ldots, x_k} as columns.

1.2. A determinantal measure

To prove Theorem 1, we will first define a probability measure on {{[n] \choose k}} , i.e., on the {k} -subsets of {\{1,2,\ldots,n\}} by setting:

\displaystyle  D_S = \frac{\det\left(\sum_{i \in S} x_i x_i^T\right)}{\det\left(\sum_{i=1}^n x_i x_i^T\right)}\,.

The Cauchy-Binet formula is precisely the statement that {\sum_{|S|=k} D_S=1} , i.e. the collection {\{D_S\}} forms a probability distribution on {{[n] \choose k}} . How can we capture the fact that some vectors {x_1, \ldots, x_n} satisfy {\sum_{i=1}^n x_i x_i^T = \frac{n}{k} I} using only the values {D_S} ?

Using the rank-one update formula, for an invertible {k \times k} matrix {B} , we have {\frac{\partial}{\partial t}\big|_{t=0} \log \det(B+t u u^T) = \langle u, B^{-1} u\rangle} . Thus {B} is the {k \times k} identity matrix if and only if for every {u \in \mathbb R^k} ,

\displaystyle \frac{\partial}{\partial t}\left|_{t=0} \log \det(B+t uu^T)=\|u\|^2\,.\right.

Note also that using Cauchy-Binet,

\displaystyle  \frac{\partial}{\partial t}\left|_{t=0}\vphantom{\bigoplus}\right. \log \det\left(\sum_{j=1}^n x_j x_j^T + t x_i x_i^T\right)\qquad\qquad\qquad

\displaystyle = \frac{\sum_{S : i \in S} \frac{\partial}{\partial t}\big|_{t=0} \det\left(\sum_{j \in S} x_j x_j^T + t x_i x_i^T\right)} {\det\left(\sum_{i=1}^n x_i x_i^T\right)}

\displaystyle  = \frac{\sum_{S : i \in S} \frac{\partial}{\partial t}\big|_{t=0} (1+t)\det\left(\sum_{j \in S} x_j x_j^T\right)} {\det\left(\sum_{i=1}^n x_i x_i^T\right)} = \sum_{S : i \in S} D_S\,.

In particular, if {\sum_{i=1}^n x_i x_i^T = \frac{n}{k} I} , then for every {i=1,2,\ldots,n} , we have

\displaystyle  \frac{k}{n} = \frac{\partial}{\partial t}\big|_{t=0} \log \det\left(\sum_{i=1}^n x_i x_i^T +t x_i x_i^T\right) = \sum_{S : i \in S} D_S\,. \ \ \ \ \ (2)

Of course, our vectors {x_1, \ldots, x_n} likely don’t satisfy this condition (otherwise we would be done). So we will use the max-entropy philosophy to find the “simplest” perturbation of the {D_S} values that does satisfy it. The optimal solution will yield a matrix {A} satisfying (1).

1.3. Entropy maximization

Consider the following convex program with variables {\{p_S : S \subseteq [n], |S|=k\}} .

\displaystyle  \textrm{minimize} \qquad \sum_{|S|=k} p_S \log \frac{p_S}{D_S}

\displaystyle  \textrm{subject to} \qquad \sum_{S : i \in S} p_S = \frac{k}{n} \qquad \forall i=1,2,\ldots,n

\displaystyle  \sum_{|S|=k} p_S = 1

\displaystyle  p_S \geq 0 \qquad \forall |S|=k\,.

In other words, we look for a distributon on {[n] \choose k} that has minimum entropy relative to {D_S} , and such that all the “one-dimensional marginals” are equal (recall (2)). Remarkably, the optimum {p^*_S} will be a determinantal measure as well.

Note that the uniform distribution on subsets of size {k} is a feasible point and the objective is finite precisely because {D_S \neq 0} for every {S} . The latter fact follows from our assumption that every subset of {k} vectors is linearly independent.

1.4. Analyzing the optimizer

By setting the gradient of the Lagrangian to zero, we see that the optimal solution has the form

\displaystyle  p_S(\lambda_1, \ldots, \lambda_n) = \frac{\exp\left(\sum_{i \in S} \lambda_i \right) D_S}{\sum_{|S|=k} D_S \exp\left(\sum_{i \in S} \lambda_i\right)} = \frac{\exp\left(\sum_{i \in S} \lambda_i \right) D_S}{\det\left(\sum_{i=1}^n e^{\lambda_i} x_i x_i^T\right)} = \frac{\det\left(\sum_{i \in S} e^{\lambda_i} x_i x_i^T\right)}{\det\left(\sum_{i=1}^n e^{\lambda_i} x_i x_i^T\right)} \ \ \ \ \ (3)

for some dual variables {(\lambda_1, \ldots, \lambda_n)} . Note that the {\{\lambda_i\}} dual variables are unconstrained because they come from equality constraints.

Let us write {U = \sum_{i=1}^n e^{\lambda_i} x_i x_i^T} . We use {p_S^*} , {U_*} , and {\{\lambda_i^*\}} to denote the values at the optimal solution. Using again the rank-one update formula for the determinant,

\displaystyle  \frac{\partial}{\partial \lambda_i} \log \det(U) = e^{\lambda_i} \langle x_i, U^{-1} x_i\rangle\,.

But just as in (2), we can also use Cauchy-Binet to calculate the derivative (from the second expression in (3)):

\displaystyle  \frac{\partial}{\partial \lambda_i} \log \det(U) = \sum_{S : i \in S} p_S\,,

where we have used the fact that {\frac{\partial}{\partial \lambda_i} p_S = p_S} if {i \in S} (and otherwise equals {0} ). We conclude that

\displaystyle  \langle x_i, U_*^{-1} x_i\rangle = e^{-\lambda^*_i} \sum_{S : i \in S} p^*_S = e^{-\lambda^*_i} \frac{k}{n}\,.

Now we can finish the proof: Let {A = U_*^{-1/2}} . Then:

\displaystyle  \sum_{i=1}^n \frac{(A x_i) (A x_i)^T}{\|A x_i\|^2} = U_*^{-1/2} \sum_{i=1}^n \frac{x_i x_i^T}{\langle x_i, U_*^{-1} x_i\rangle} U_*^{-1/2}

\displaystyle = U_*^{-1/2} \sum_{i=1}^n \frac{n}{k} e^{\lambda_i^*} x_i x_i^T U_*^{-1/2} = \frac{n}{k} I\,.