Entropy optimality on path space

After Boaz posted on the mother of all inequalities, it seemed about the right time to get around to the next series of posts on entropy optimality. The approach is the same as before, but now we consider entropy optimality on a path space. After finding an appropriate entropy-maximizer, the Brascamp-Lieb inequality will admit a gorgeous one-line proof. Our argument is taken from the beautiful paper of Lehec.

For simplicity, we start first with an entropy optimization on a discrete path space. Then we move on to Brownian motion.

1.1 Entropy optimality on discrete path spaces

Consider a finite state space {\Omega} and a transition kernel {p : \Omega \times \Omega \rightarrow [0,1]} . Also fix some time {T \geq 0} .

Let {\mathcal P_T} denote the space of all paths {\gamma : \{0,1,\ldots,T\} \rightarrow \Omega} . There is a natural measure {\mu_{\mathcal P}} on {\mathcal P_T} coming from the transition kernel:

\displaystyle  \mu_{\mathcal P}(\gamma) = \prod_{t=0}^{T-1} p\left(\gamma(t), \gamma(t+1)\right)\,.

Now suppose we are given a starting point {x_0 \in \Omega} , and a target distribution specified by a function {f : \Omega \rightarrow {\mathbb R}_+} scaled so that {\mathop{\mathbb E}[f(X_T) \mid X_0 = x_0]=1} . If we let {\nu_T} denote the law of {X_T \mid X_0 = x_0} , then this simply says that {f} is a density with respect to {\nu_T} . One should think about {\nu_T} as the natural law at time {T} (given {X_0=x_0} ), and {f \nu_T} describes a perturbation of this law.

Let us finally define the set {\mathcal M_T(f; x_0)} of all measures {\mu} on {\mathcal P_T} that start at {x_0} and end at {f \nu_T} , i.e. those measures satisfying

\displaystyle  \mu\left(\{\gamma : \gamma(0)=x_0\}\right) = 1\,,

and for every {x \in \Omega} ,

\displaystyle  f(x) \nu_T(x) = \sum_{\gamma \in \mathcal P : \gamma(T)=x} \mu(\gamma)\,.

Now we can consider the entropy optimization problem:

\displaystyle  \min \left\{ D(\mu \,\|\, \mu_{\mathcal P}) : \mu \in \mathcal M_T(f;x_0) \right\}\,. \ \ \ \ \ (1)

One should verify that, like many times before, we are minimizing the relative entropy over a polytope.

One can think of the optimization as simply computing the most likely way for a mass of particles sitting at {x_0} to end up in the distribution {f \nu_T} at time {T} .

The optimal solution {\mu^*} exists and is unique. Moreover, we can describe it explicitly: {\mu^*} is given by a time-inhomogeneous Markov chain. For {0 \leq t \leq T-1} , this chain has transition kernel

\displaystyle  q_t(x,y) = p(x,y) \frac{H_{T-t-1} f(y)}{H_{T-t} f(x)}\,, \ \ \ \ \ (2)

where {H_t} is the heat semigroup of our chain {\{X_t\}} , i.e.

\displaystyle  H_t f(x) = \mathop{\mathbb E}[f(X_t) \mid X_0 = x]\,.

Let {\{W_t\}} denote the time-inhomogeneous chain with transition kernels {\{q_t\}} and {W_0=x_0} and let {\mu} denote the law of the random path {\{W_0, \ldots, W_T\}} . We will now verify that {\mu} is the optimal solution to (1).

We first need to confirm that {\mu \in \mathcal M_T(f;x_0)} , i.e. that {W_T} has law {f \nu_T} . To this end, we will verify inductively that {W_t} has law {(H_{T-t} f)\cdot \nu_t} . For {t=0} , this follows by definition. For the inductive step:

\displaystyle  \begin{array}{lll}  \displaystyle\mathop{\mathbb P}[W_{t+1}=y] &= \sum_{x \in \Omega} \Pr[W_t=x] \cdot p(x,y) \frac{H_{T-t-1} f(y)}{H_{T-t} f(x)} \\ \displaystyle&= \sum_{x \in \Omega} H_{T-t} f(x) \nu_t(x) p(x,y) \frac{H_{T-t-1} f(y)}{H_{T-t} f(x)} \\ \displaystyle&= \sum_{x \in \Omega} \nu_t(x) p(x,y) H_{T-t-1}f(y) \\ \displaystyle & = H_{T-t-1} f(y) \nu_{t+1}(y)\,. \end{array}

We have confirmed that {\mu \in \mathcal M_T(f;x_0)} . Let us now verify its optimality by writing

\displaystyle  D(f \nu_T \,\|\,\nu_T) = \mathop{\mathbb E}_{\nu_T} [f \log f] = \mathop{\mathbb E}[\log f(W_T)]\,,

where the final equality uses the fact we just proved: {W_T} has law {f \nu_T} . Continuing, we have

\displaystyle  \mathop{\mathbb E}[\log f(W_T)] = \sum_{t=0}^{T-1} \mathop{\mathbb E}\left[\log \frac{H_{T-t-1} f(W_{t+1})}{H_{T-t} f(W_t)}\right] = \sum_{t=0}^{T-1} \mathop{\mathbb E} \left[D(q_t(W_t, \cdot) \,\|\, p(W_t,\cdot))\right]\,,

where the final inequality uses the definition of {q_t} in (2). The latter quantity is precisely {D(\mu \,\|\, \mu_{\mathcal P})} by the chain rule for relative entropy.

Exercise: One should check that if {\{A_t\}} and {\{B_t\}} are two time-inhomogeneous Markov chains on {\Omega} with respective transition kernels {a_t} and {b_t} then indeed the chain rule for relative entropy yields

\displaystyle  D(\{A_0, \ldots, A_T\} \,\|\, \{B_0, \ldots, B_T\}) = \sum_{t=0}^{T-1} \mathop{\mathbb E}\left[D\left(a_t(A_t, \cdot)\,\|\,b_t(A_t,\cdot)\right)\right]\,. \ \ \ \ \ (3)

We conclude that

\displaystyle  D(f \nu_T \,\|\, \nu_T) = D(\mu \,\|\,\mu_{\mathcal P})\,,

and from this one immediately concludes that {\mu=\mu^*} . Indeed, for any measure {\mu' \in \mathcal M_T(f;x_0)} , we must have {D(\mu' \,\|\,\mu_{\mathcal P}) \geq D(f \nu_T \,\|\,\nu_T)} . This follows because {f \nu_T} is the law of the endpoint of a path drawn from {\mu'} and {\nu_T} is the law of the endpoint of a path drawn from {\mu} . The relative entropy between the endpoints is certainly less than along the entire path. (This intuitive fact can again be proved via the chain rule for relative entropy by conditioning on the endpoint of the path.)

1.2. The Brownian version

Let us now do the same thing for processes driven by Brownian motion in {\mathbb R^n} . Let {\{B_t : t \in [0,1]\}} be a Brownian motion with {B_0=0} . Let {\gamma_n} be the standard Gaussian measure and recall that {B_1} has law {\gamma_n} .

We recall that if we have two measures {\mu} and {\nu} on {\mathbb R^n} such that {\nu} is absolutely continuous with respect to {\mu} , we define the relative entropy

\displaystyle D(\nu\,\|\,\mu) = \int d\nu \log \frac{d\nu}{d\mu}

Our “path space” will consist of drift processes {\{W_t : t \in [0,1]\}} of the form

\displaystyle  W_t = B_t + \int_0^t u_s\,ds\,, \ \ \ \ \ (4)

where {\{u_s\}} denotes the drift. We require that {\{u_s\}} is progressively measurable, i.e. that the law of {u_s} is determined by the past up to time {s} , and that {\mathop{\mathbb E} \int_0^1 \|u_s\|^2 \,ds < \infty} . Note that we can write such a process in differential notation as

\displaystyle  dW_t = dB_t + u_t\,dt\,,

with {W_0=0} .

Fix a smooth density {f : \mathbb R^n \rightarrow {\mathbb R}_+} with {\int f \,d\gamma_n =1} . In analogy with the discrete setting, let us use {\mathcal M(f)} to denote the set of processes {\{W_t\}} that can be realized in the form (4) and such that {W_0 = 0} and {W_1} has law {f d\gamma_n} .

Let us also use the shorthand {W_{[0,1]} = \{W_t : t\in [0,1]\}} to represent the entire path of the process. Again, we will consider the entropy optimization problem:

\displaystyle  \min \left\{ \vphantom{\bigoplus} D\left(W_{[0,1]} \,\|\, B_{[0,1]}\right) : W_{[0,1]} \in \mathcal M(f) \right\}\,. \ \ \ \ \ (5)

As in the discrete setting, this problem has a unique optimal solution (in the sense of stochastic processes). Here is the main result.

Theorem 1 (Föllmer) If {\{ W_t = B_t + \int_0^t u_s\,ds : t \in [0,1]\}} is the optimal solution to (5), then

\displaystyle  D\left(W_{[0,1]}\,\|\,B_{[0,1]}\right) = D(W_1 \,\|\, B_1) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|u_t\|^2\,dt\,.

Just as for the discrete case, one should think of this as asserting that the optimal process only uses as much entropy as is needed for the difference in laws at the endpoint. The RHS should be thought of as an integral over the expected relative entropy generated at time {t} (just as in the chain rule expression (3)).

The reason for the quadratic term is the usual relative entropy approximation for infinitesimal perturbations. For instance, consider the relative entropy between a binary random variable with expected value {\tfrac12 (1-\varepsilon)} and a binary random variable with expected value {\tfrac12} :

\displaystyle  \frac12(1-\varepsilon) \log (1-\varepsilon) + \frac12 (1+\varepsilon) \log (1+\varepsilon) \approx \frac12 \varepsilon^2\,.

I am going to delay the proof of Theorem 1 to the next post because doing it in an elementary way will require some discussion of Ito calculus. For now, let us prove the following.

Lemma 2 For any process {W_{[0,1]} \in \mathcal M(f)} given by a drift {\{u_t : t\in[0,1]\}} , it holds that

\displaystyle  D(W_1 \,\|\, B_1) \leq D(W_{[0,1]} \,\|\, B_{[0,1]}) =\frac12 \int_0^1 \mathop{\mathbb E}\,\|u_t\|^2\,dt\,.

Proof: The proof will be somewhat informal. It can be done easily using Girsanov’s theorem, but we try to keep the presentation here elementary and in correspondence with the discrete version above.

Let us first use the chain rule for relative entropy to calculate

\displaystyle  D\left(W_{[0,1]} \,\|\,B_{[0,1]}\right) = \int_0^1 \mathop{\mathbb E}\left[D( dW_t \,\|\, dB_t)\right] = \int_0^1 \mathop{\mathbb E}\left[D(dB_t + u_t\,dt \,\|\,dB_t)\right]\,. \ \ \ \ \ (6)

Note that {dB_t} has the law of a standard {n} -dimensional of covariance {dt \cdot I} .

If {Z} is an {n} -dimensional Gaussian with covariance {\sigma^2 \cdot I} and {u \in \mathbb R^n} , then

\displaystyle  \begin{array}{lll}  D(Z + u \,\|\, Z) &= \mathop{\mathbb E}\left[\log \frac{e^{-\|Z\|^2/2\sigma^2}}{e^{-\|u-Z\|^2/2\sigma^2}}\right] \\ &= \mathop{\mathbb E}\left[\frac{\|u\|^2}{2\sigma^2} + \frac{\langle u,Z\rangle}{\sigma^2}\right] \\ &= \frac{\|u\|^2}{2\sigma^2}\,. \end{array}


\displaystyle  D(dB_t + u_t\,dt \,\|\,dB_t) = \mathop{\mathbb E} \left[\frac{\|u_t\|^2 dt^2}{2 dt}\mid \mathcal F_t\right] =\frac12 \mathop{\mathbb E}\left[\|u_t\|^2\,dt \mid \mathcal F_t\right]\,,

where the latter expectation is understood to be conditioned on the past \mathcal F_t  up to time {t} .

In particular, plugging this into (6), we have

\displaystyle  D\left(W_{[0,1]} \,\|\,B_{[0,1]}\right) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|u_t\|^2\,dt\,. \ \ \ \ \ (7)


1.3. Brascamp-Lieb

The proof is taken directly from Lehec. We will use the entropic formulation of Brascamp-Lieb due to Carlen and Cordero-Erausquin.

Let {E} be a Euclidean space with subspaces {E_1, E_2, \ldots, E_m} . Let {P_i} denote the orthogonal projection onto {E_i} . Now suppose that for positive numbers {c_1, c_2, \ldots, c_m > 0} , we have

\displaystyle  \sum_{i=1}^m c_i P_i = \mathrm{id}_E\,. \ \ \ \ \ (8)

By (8), we have for all {x \in E} :

\displaystyle \|x\|^2 = \left\langle x,\sum_{i=1}^m c_i P_i x\right\rangle = \sum_{i=1}^m c_i\|P_i x\|^2\,.

The latter equality uses the fact that each {P_i} is an orthogonal projection.

Let {Z} denote a standard Gaussian on {E} , and let {Z_i} denote a standard Gaussian on {E_i} for each {i=1,2,\ldots, m} .

Theorem 3 (Carlen & Cordero-Erausquin version of Brascamp-Lieb) For any random vector {X \in E} , it holds that

\displaystyle  D(X \,\|\, Z) \geq \sum_{i=1}^m c_i D(P_i X \,\|\, Z_i)\,.

Proof: Let {\{W_t : t \in [0,1]\}} with {dW_t = dB_t + v_t\,dt} denote the entropy-optimal drift process such that {W_1} has the law of {X} . Then by Theorem 1,

\displaystyle  D(X\,\|\,Z) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt = \frac12 \int_0^1 \sum_{i=1}^m c_i \mathop{\mathbb E}\,\|P_i v_t\|^2\,dt \geq \sum_{i=1}^m c_i D(P_i X \,\|\, Z_i)\,,

where the latter inequality uses Lemma 2 and the fact that {P_i W_1} has law {P_i X} . \Box

A note on the large spectrum and generalized Riesz products

I wrote a short note entitled Covering the large spectrum and generalized Riesz products that simplifies and generalizes the approach of the first few posts on Chang’s Lemma and Bloom’s variant.

The approximation statement is made in the context of general probability measures on a finite set (though it should extend at least to the compact case with no issues). The algebraic structure only comes into play when the spectral covering statements are deduced (easily) from the general approximation theorem. The proofs are also done in the general setting of finite abelian groups.

Comments are encouraged, especially about references I may have missed.

Entropy optimality: Forster’s isotropy

In this post, we will see an example of entropy optimality applied to a determinantal measure (see, for instance, Terry Tao’s post on determinantal processes and Russ Lyons’ ICM survey). I think this is an especially fertile setting for entropy maximization, but this will be the only post in this vein for now; I hope to return to the topic later.

Our goal is to prove the following theorem of Forster.

Theorem 1 (Forster) Suppose that {x_1, x_2, \ldots, x_n \in \mathbb R^k} are unit vectors such that every subset of {k} vectors is linearly independent. Then there exists a linear mapping {A : \mathbb R^k \rightarrow \mathbb R^k} such that

\displaystyle  \sum_{i=1}^n \frac{(A x_i) (A x_i)^T}{\|A x_i\|^2} = \frac{n}{k} I\,. \ \ \ \ \ (1)

This result is surprising at first glance. If we simply wanted to map the vectors {\{x_1, \dots x_n\}} to isotropic position, we could use the matrix {A = (\sum_{i=1}^n x_i x_i^T)^{-1/2}} . But Forster’s theorem asks that the unit vectors

\displaystyle  \left\{ \frac{Ax_1}{\|A x_1\|}, \ldots, \frac{A x_n}{\|A x_n\|} \right\}

are in isotropic position. This seems to be a much trickier task.

Forster used this as a step in proving lower bounds on the sign rank of certain matrices. Forster’s proof is based on an iterative argument combined with a compactness assertion.

There is another approach based on convex programming arising in the work of Barthe on a reverse Brascamp-Lieb inequality. The relation to Forster’s theorem was observed in work of Hardt and Moitra; it is essentially the dual program to the one we construct below.

1.1. Some facts about determinants

We first recall a few preliminary facts about determinants. For any {x_1, \ldots, x_n \in \mathbb R^k} , we have the Cauchy-Binet formula

\displaystyle  \det\left(\sum_{i=1}^n x_i x_i^T\right) = \sum_{|S|=k} \det\left(\sum_{i \in S} x_i x_i^T\right)\,.

We also have a rank-one update formula for the determinant: If a matrix {A} is invertible, then

\displaystyle  \det\left(A+u u^T\right) = \det(A) \left(1+u^T A^{-1} u\right)\,.

Finally, for {k} vectors {x_1, x_2, \ldots, x_k \in \mathbb R^k} and nonnegative coefficients {c_1, c_2, \ldots, c_k \geq 0} , we have

\displaystyle  \det\left(\sum_{i=1}^k c_i \,x_i x_i^T\right) = \left(\prod_{i=1}^k c_i\right) \det\left(\sum_{i=1}^k x_i x_i^T\right)\,.

This follows because replacing {x_i} by {c_i x_i} corresponds to multiplying the {i} th row and column of {XX^T} by {\sqrt{c_i}} , where {X} is the matrix that has the vectors {x_1, \ldots, x_k} as columns.

1.2. A determinantal measure

To prove Theorem 1, we will first define a probability measure on {{[n] \choose k}} , i.e., on the {k} -subsets of {\{1,2,\ldots,n\}} by setting:

\displaystyle  D_S = \frac{\det\left(\sum_{i \in S} x_i x_i^T\right)}{\det\left(\sum_{i=1}^n x_i x_i^T\right)}\,.

The Cauchy-Binet formula is precisely the statement that {\sum_{|S|=k} D_S=1} , i.e. the collection {\{D_S\}} forms a probability distribution on {{[n] \choose k}} . How can we capture the fact that some vectors {x_1, \ldots, x_n} satisfy {\sum_{i=1}^n x_i x_i^T = \frac{n}{k} I} using only the values {D_S} ?

Using the rank-one update formula, for an invertible {k \times k} matrix {B} , we have {\frac{\partial}{\partial t}\big|_{t=0} \log \det(B+t u u^T) = \langle u, B^{-1} u\rangle} . Thus {B} is the {k \times k} identity matrix if and only if for every {u \in \mathbb R^k} ,

\displaystyle \frac{\partial}{\partial t}\left|_{t=0} \log \det(B+t uu^T)=\|u\|^2\,.\right.

Note also that using Cauchy-Binet,

\displaystyle  \frac{\partial}{\partial t}\left|_{t=0}\vphantom{\bigoplus}\right. \log \det\left(\sum_{j=1}^n x_j x_j^T + t x_i x_i^T\right)\qquad\qquad\qquad

\displaystyle = \frac{\sum_{S : i \in S} \frac{\partial}{\partial t}\big|_{t=0} \det\left(\sum_{j \in S} x_j x_j^T + t x_i x_i^T\right)} {\det\left(\sum_{i=1}^n x_i x_i^T\right)}

\displaystyle  = \frac{\sum_{S : i \in S} \frac{\partial}{\partial t}\big|_{t=0} (1+t)\det\left(\sum_{j \in S} x_j x_j^T\right)} {\det\left(\sum_{i=1}^n x_i x_i^T\right)} = \sum_{S : i \in S} D_S\,.

In particular, if {\sum_{i=1}^n x_i x_i^T = \frac{n}{k} I} , then for every {i=1,2,\ldots,n} , we have

\displaystyle  \frac{k}{n} = \frac{\partial}{\partial t}\big|_{t=0} \log \det\left(\sum_{i=1}^n x_i x_i^T +t x_i x_i^T\right) = \sum_{S : i \in S} D_S\,. \ \ \ \ \ (2)

Of course, our vectors {x_1, \ldots, x_n} likely don’t satisfy this condition (otherwise we would be done). So we will use the max-entropy philosophy to find the “simplest” perturbation of the {D_S} values that does satisfy it. The optimal solution will yield a matrix {A} satisfying (1).

1.3. Entropy maximization

Consider the following convex program with variables {\{p_S : S \subseteq [n], |S|=k\}} .

\displaystyle  \textrm{minimize} \qquad \sum_{|S|=k} p_S \log \frac{p_S}{D_S}

\displaystyle  \textrm{subject to} \qquad \sum_{S : i \in S} p_S = \frac{k}{n} \qquad \forall i=1,2,\ldots,n

\displaystyle  \sum_{|S|=k} p_S = 1

\displaystyle  p_S \geq 0 \qquad \forall |S|=k\,.

In other words, we look for a distributon on {[n] \choose k} that has minimum entropy relative to {D_S} , and such that all the “one-dimensional marginals” are equal (recall (2)). Remarkably, the optimum {p^*_S} will be a determinantal measure as well.

Note that the uniform distribution on subsets of size {k} is a feasible point and the objective is finite precisely because {D_S \neq 0} for every {S} . The latter fact follows from our assumption that every subset of {k} vectors is linearly independent.

1.4. Analyzing the optimizer

By setting the gradient of the Lagrangian to zero, we see that the optimal solution has the form

\displaystyle  p_S(\lambda_1, \ldots, \lambda_n) = \frac{\exp\left(\sum_{i \in S} \lambda_i \right) D_S}{\sum_{|S|=k} D_S \exp\left(\sum_{i \in S} \lambda_i\right)} = \frac{\exp\left(\sum_{i \in S} \lambda_i \right) D_S}{\det\left(\sum_{i=1}^n e^{\lambda_i} x_i x_i^T\right)} = \frac{\det\left(\sum_{i \in S} e^{\lambda_i} x_i x_i^T\right)}{\det\left(\sum_{i=1}^n e^{\lambda_i} x_i x_i^T\right)} \ \ \ \ \ (3)

for some dual variables {(\lambda_1, \ldots, \lambda_n)} . Note that the {\{\lambda_i\}} dual variables are unconstrained because they come from equality constraints.

Let us write {U = \sum_{i=1}^n e^{\lambda_i} x_i x_i^T} . We use {p_S^*} , {U_*} , and {\{\lambda_i^*\}} to denote the values at the optimal solution. Using again the rank-one update formula for the determinant,

\displaystyle  \frac{\partial}{\partial \lambda_i} \log \det(U) = e^{\lambda_i} \langle x_i, U^{-1} x_i\rangle\,.

But just as in (2), we can also use Cauchy-Binet to calculate the derivative (from the second expression in (3)):

\displaystyle  \frac{\partial}{\partial \lambda_i} \log \det(U) = \sum_{S : i \in S} p_S\,,

where we have used the fact that {\frac{\partial}{\partial \lambda_i} p_S = p_S} if {i \in S} (and otherwise equals {0} ). We conclude that

\displaystyle  \langle x_i, U_*^{-1} x_i\rangle = e^{-\lambda^*_i} \sum_{S : i \in S} p^*_S = e^{-\lambda^*_i} \frac{k}{n}\,.

Now we can finish the proof: Let {A = U_*^{-1/2}} . Then:

\displaystyle  \sum_{i=1}^n \frac{(A x_i) (A x_i)^T}{\|A x_i\|^2} = U_*^{-1/2} \sum_{i=1}^n \frac{x_i x_i^T}{\langle x_i, U_*^{-1} x_i\rangle} U_*^{-1/2}

\displaystyle = U_*^{-1/2} \sum_{i=1}^n \frac{n}{k} e^{\lambda_i^*} x_i x_i^T U_*^{-1/2} = \frac{n}{k} I\,.