# Entropy optimality: Forster’s isotropy

In this post, we will see an example of entropy optimality applied to a determinantal measure (see, for instance, Terry Tao’s post on determinantal processes and Russ Lyons’ ICM survey). I think this is an especially fertile setting for entropy maximization, but this will be the only post in this vein for now; I hope to return to the topic later.

Our goal is to prove the following theorem of Forster.

Theorem 1 (Forster) Suppose that ${x_1, x_2, \ldots, x_n \in \mathbb R^k}$ are unit vectors such that every subset of ${k}$ vectors is linearly independent. Then there exists a linear mapping ${A : \mathbb R^k \rightarrow \mathbb R^k}$ such that

$\displaystyle \sum_{i=1}^n \frac{(A x_i) (A x_i)^T}{\|A x_i\|^2} = \frac{n}{k} I\,. \ \ \ \ \ (1)$

This result is surprising at first glance. If we simply wanted to map the vectors ${\{x_1, \dots x_n\}}$ to isotropic position, we could use the matrix ${A = (\sum_{i=1}^n x_i x_i^T)^{-1/2}}$. But Forster’s theorem asks that the unit vectors

$\displaystyle \left\{ \frac{Ax_1}{\|A x_1\|}, \ldots, \frac{A x_n}{\|A x_n\|} \right\}$

are in isotropic position. This seems to be a much trickier task.

Forster used this as a step in proving lower bounds on the sign rank of certain matrices. Forster’s proof is based on an iterative argument combined with a compactness assertion.

There is another approach based on convex programming arising in the work of Barthe on a reverse Brascamp-Lieb inequality. The relation to Forster’s theorem was observed in work of Hardt and Moitra; it is essentially the dual program to the one we construct below.

We first recall a few preliminary facts about determinants. For any ${x_1, \ldots, x_n \in \mathbb R^k}$, we have the Cauchy-Binet formula

$\displaystyle \det\left(\sum_{i=1}^n x_i x_i^T\right) = \sum_{|S|=k} \det\left(\sum_{i \in S} x_i x_i^T\right)\,.$

We also have a rank-one update formula for the determinant: If a matrix ${A}$ is invertible, then

$\displaystyle \det\left(A+u u^T\right) = \det(A) \left(1+u^T A^{-1} u\right)\,.$

Finally, for ${k}$ vectors ${x_1, x_2, \ldots, x_k \in \mathbb R^k}$ and nonnegative coefficients ${c_1, c_2, \ldots, c_k \geq 0}$, we have

$\displaystyle \det\left(\sum_{i=1}^k c_i \,x_i x_i^T\right) = \left(\prod_{i=1}^k c_i\right) \det\left(\sum_{i=1}^k x_i x_i^T\right)\,.$

This follows because replacing ${x_i}$ by ${c_i x_i}$ corresponds to multiplying the ${i}$th row and column of ${XX^T}$ by ${\sqrt{c_i}}$, where ${X}$ is the matrix that has the vectors ${x_1, \ldots, x_k}$ as columns.

1.2. A determinantal measure

To prove Theorem 1, we will first define a probability measure on ${{[n] \choose k}}$, i.e., on the ${k}$-subsets of ${\{1,2,\ldots,n\}}$ by setting:

$\displaystyle D_S = \frac{\det\left(\sum_{i \in S} x_i x_i^T\right)}{\det\left(\sum_{i=1}^n x_i x_i^T\right)}\,.$

The Cauchy-Binet formula is precisely the statement that ${\sum_{|S|=k} D_S=1}$, i.e. the collection ${\{D_S\}}$ forms a probability distribution on ${{[n] \choose k}}$. How can we capture the fact that some vectors ${x_1, \ldots, x_n}$ satisfy ${\sum_{i=1}^n x_i x_i^T = \frac{n}{k} I}$ using only the values ${D_S}$?

Using the rank-one update formula, for an invertible ${k \times k}$ matrix ${B}$, we have ${\frac{\partial}{\partial t}\big|_{t=0} \log \det(B+t u u^T) = \langle u, B^{-1} u\rangle}$. Thus ${B}$ is the ${k \times k}$ identity matrix if and only if for every ${u \in \mathbb R^k}$,

$\displaystyle \frac{\partial}{\partial t}\left|_{t=0} \log \det(B+t uu^T)=\|u\|^2\,.\right.$

Note also that using Cauchy-Binet,

$\displaystyle \frac{\partial}{\partial t}\left|_{t=0}\vphantom{\bigoplus}\right. \log \det\left(\sum_{j=1}^n x_j x_j^T + t x_i x_i^T\right)\qquad\qquad\qquad$

$\displaystyle = \frac{\sum_{S : i \in S} \frac{\partial}{\partial t}\big|_{t=0} \det\left(\sum_{j \in S} x_j x_j^T + t x_i x_i^T\right)} {\det\left(\sum_{i=1}^n x_i x_i^T\right)}$

$\displaystyle = \frac{\sum_{S : i \in S} \frac{\partial}{\partial t}\big|_{t=0} (1+t)\det\left(\sum_{j \in S} x_j x_j^T\right)} {\det\left(\sum_{i=1}^n x_i x_i^T\right)} = \sum_{S : i \in S} D_S\,.$

In particular, if ${\sum_{i=1}^n x_i x_i^T = \frac{n}{k} I}$, then for every ${i=1,2,\ldots,n}$, we have

$\displaystyle \frac{k}{n} = \frac{\partial}{\partial t}\big|_{t=0} \log \det\left(\sum_{i=1}^n x_i x_i^T +t x_i x_i^T\right) = \sum_{S : i \in S} D_S\,. \ \ \ \ \ (2)$

Of course, our vectors ${x_1, \ldots, x_n}$ likely don’t satisfy this condition (otherwise we would be done). So we will use the max-entropy philosophy to find the “simplest” perturbation of the ${D_S}$ values that does satisfy it. The optimal solution will yield a matrix ${A}$ satisfying (1).

1.3. Entropy maximization

Consider the following convex program with variables ${\{p_S : S \subseteq [n], |S|=k\}}$.

$\displaystyle \textrm{minimize} \qquad \sum_{|S|=k} p_S \log \frac{p_S}{D_S}$

$\displaystyle \textrm{subject to} \qquad \sum_{S : i \in S} p_S = \frac{k}{n} \qquad \forall i=1,2,\ldots,n$

$\displaystyle \sum_{|S|=k} p_S = 1$

$\displaystyle p_S \geq 0 \qquad \forall |S|=k\,.$

In other words, we look for a distributon on ${[n] \choose k}$ that has minimum entropy relative to ${D_S}$, and such that all the “one-dimensional marginals” are equal (recall (2)). Remarkably, the optimum ${p^*_S}$ will be a determinantal measure as well.

Note that the uniform distribution on subsets of size ${k}$ is a feasible point and the objective is finite precisely because ${D_S \neq 0}$ for every ${S}$. The latter fact follows from our assumption that every subset of ${k}$ vectors is linearly independent.

1.4. Analyzing the optimizer

By setting the gradient of the Lagrangian to zero, we see that the optimal solution has the form

$\displaystyle p_S(\lambda_1, \ldots, \lambda_n) = \frac{\exp\left(\sum_{i \in S} \lambda_i \right) D_S}{\sum_{|S|=k} D_S \exp\left(\sum_{i \in S} \lambda_i\right)} = \frac{\exp\left(\sum_{i \in S} \lambda_i \right) D_S}{\det\left(\sum_{i=1}^n e^{\lambda_i} x_i x_i^T\right)} = \frac{\det\left(\sum_{i \in S} e^{\lambda_i} x_i x_i^T\right)}{\det\left(\sum_{i=1}^n e^{\lambda_i} x_i x_i^T\right)} \ \ \ \ \ (3)$

for some dual variables ${(\lambda_1, \ldots, \lambda_n)}$. Note that the ${\{\lambda_i\}}$ dual variables are unconstrained because they come from equality constraints.

Let us write ${U = \sum_{i=1}^n e^{\lambda_i} x_i x_i^T}$. We use ${p_S^*}$, ${U_*}$, and ${\{\lambda_i^*\}}$ to denote the values at the optimal solution. Using again the rank-one update formula for the determinant,

$\displaystyle \frac{\partial}{\partial \lambda_i} \log \det(U) = e^{\lambda_i} \langle x_i, U^{-1} x_i\rangle\,.$

But just as in (2), we can also use Cauchy-Binet to calculate the derivative (from the second expression in (3)):

$\displaystyle \frac{\partial}{\partial \lambda_i} \log \det(U) = \sum_{S : i \in S} p_S\,,$

where we have used the fact that ${\frac{\partial}{\partial \lambda_i} p_S = p_S}$ if ${i \in S}$ (and otherwise equals ${0}$). We conclude that

$\displaystyle \langle x_i, U_*^{-1} x_i\rangle = e^{-\lambda^*_i} \sum_{S : i \in S} p^*_S = e^{-\lambda^*_i} \frac{k}{n}\,.$

Now we can finish the proof: Let ${A = U_*^{-1/2}}$. Then:

$\displaystyle \sum_{i=1}^n \frac{(A x_i) (A x_i)^T}{\|A x_i\|^2} = U_*^{-1/2} \sum_{i=1}^n \frac{x_i x_i^T}{\langle x_i, U_*^{-1} x_i\rangle} U_*^{-1/2}$

$\displaystyle = U_*^{-1/2} \sum_{i=1}^n \frac{n}{k} e^{\lambda_i^*} x_i x_i^T U_*^{-1/2} = \frac{n}{k} I\,.$

## 4 thoughts on “Entropy optimality: Forster’s isotropy”

1. May I ask what’s the major application of this theorem?

2. I would look at the papers linked in the post–Forster’s paper gives an application to lower bounding the sign rank of a matrix (though this has been surpassed by Linial, et. al. using factorization norms). The Hardt-Moitra paper gives applications in machine learning. And Barthe’s paper (which is actually the first and the theorem should really be credited to Barthe) uses it as a tool for some functional inequalities.