After Boaz posted on the mother of all inequalities, it seemed about the right time to get around to the next series of posts on entropy optimality. The approach is the same as before, but now we consider entropy optimality on a path space. After finding an appropriate entropy-maximizer, the Brascamp-Lieb inequality will admit a gorgeous one-line proof. Our argument is taken from the beautiful paper of Lehec.
For simplicity, we start first with an entropy optimization on a discrete path space. Then we move on to Brownian motion.
1.1 Entropy optimality on discrete path spaces
Consider a finite state space and a transition kernel
. Also fix some time
.
Let denote the space of all paths
. There is a natural measure
on
coming from the transition kernel:
Now suppose we are given a starting point , and a target distribution specified by a function
scaled so that
. If we let
denote the law of
, then this simply says that
is a density with respect to
. One should think about
as the natural law at time
(given
), and
describes a perturbation of this law.
Let us finally define the set of all measures
on
that start at
and end at
, i.e. those measures satisfying
and for every ,
Now we can consider the entropy optimization problem:
One should verify that, like many times before, we are minimizing the relative entropy over a polytope.
One can think of the optimization as simply computing the most likely way for a mass of particles sitting at to end up in the distribution
at time
.
The optimal solution exists and is unique. Moreover, we can describe it explicitly:
is given by a time-inhomogeneous Markov chain. For
, this chain has transition kernel
where is the heat semigroup of our chain
, i.e.
Let denote the time-inhomogeneous chain with transition kernels
and
and let
denote the law of the random path
. We will now verify that
is the optimal solution to (1).
We first need to confirm that , i.e. that
has law
. To this end, we will verify inductively that
has law
. For
, this follows by definition. For the inductive step:
We have confirmed that . Let us now verify its optimality by writing
where the final equality uses the fact we just proved: has law
. Continuing, we have
where the final inequality uses the definition of in (2). The latter quantity is precisely
by the chain rule for relative entropy.
Exercise: One should check that if and
are two time-inhomogeneous Markov chains on
with respective transition kernels
and
then indeed the chain rule for relative entropy yields
We conclude that
and from this one immediately concludes that . Indeed, for any measure
, we must have
. This follows because
is the law of the endpoint of a path drawn from
and
is the law of the endpoint of a path drawn from
. The relative entropy between the endpoints is certainly less than along the entire path. (This intuitive fact can again be proved via the chain rule for relative entropy by conditioning on the endpoint of the path.)
1.2. The Brownian version
Let us now do the same thing for processes driven by Brownian motion in . Let
be a Brownian motion with
. Let
be the standard Gaussian measure and recall that
has law
.
We recall that if we have two measures and
on
such that
is absolutely continuous with respect to
, we define the relative entropy
Our “path space” will consist of drift processes of the form
where denotes the drift. We require that
is progressively measurable, i.e. that the law of
is determined by the past up to time
, and that
. Note that we can write such a process in differential notation as
with .
Fix a smooth density with
. In analogy with the discrete setting, let us use
to denote the set of processes
that can be realized in the form (4) and such that
and
has law
.
Let us also use the shorthand to represent the entire path of the process. Again, we will consider the entropy optimization problem:
As in the discrete setting, this problem has a unique optimal solution (in the sense of stochastic processes). Here is the main result.
Theorem 1 (Föllmer) If
is the optimal solution to (5), then
Just as for the discrete case, one should think of this as asserting that the optimal process only uses as much entropy as is needed for the difference in laws at the endpoint. The RHS should be thought of as an integral over the expected relative entropy generated at time (just as in the chain rule expression (3)).
The reason for the quadratic term is the usual relative entropy approximation for infinitesimal perturbations. For instance, consider the relative entropy between a binary random variable with expected value and a binary random variable with expected value
:
I am going to delay the proof of Theorem 1 to the next post because doing it in an elementary way will require some discussion of Ito calculus. For now, let us prove the following.
Lemma 2 For any process
given by a drift
, it holds that
Proof: The proof will be somewhat informal. It can be done easily using Girsanov’s theorem, but we try to keep the presentation here elementary and in correspondence with the discrete version above.
Let us first use the chain rule for relative entropy to calculate
Note that has the law of a standard
-dimensional of covariance
.
If is an
-dimensional Gaussian with covariance
and
, then
Therefore:
where the latter expectation is understood to be conditioned on the past up to time
.
In particular, plugging this into (6), we have
1.3. Brascamp-Lieb
The proof is taken directly from Lehec. We will use the entropic formulation of Brascamp-Lieb due to Carlen and Cordero-Erausquin.
Let be a Euclidean space with subspaces
. Let
denote the orthogonal projection onto
. Now suppose that for positive numbers
, we have
By (8), we have for all :
The latter equality uses the fact that each is an orthogonal projection.
Let denote a standard Gaussian on
, and let
denote a standard Gaussian on
for each
.
Theorem 3 (Carlen & Cordero-Erausquin version of Brascamp-Lieb) For any random vector
, it holds that
Proof: Let with
denote the entropy-optimal drift process such that
has the law of
. Then by Theorem 1,
where the latter inequality uses Lemma 2 and the fact that has law
.
Thank you for the great post. Where can I find the proof of Lemma 2 based on Girsanov’s theorem?
Girsanov’s theorem in this setting gives the change of measure under which
becomes a Brownian motion. Let
be the law of
, and define
by
Then by Girsanov’s theorem, under the measure
, the process
is a Brownian motion.
To apply Girsanov, one should verify Novikov’s condition that
. This could actually be false for our process (since
as
is a possibility). But when the density
has finite relative entropy, one can handle this by truncation and then taking a limit.
Now the change of measure allows us to calculate the relative entropy
The assertion that
follows from the chain rule for relative entropy:
where the second term is non-negative.
This is interesting stuff!
I’m confused about what makes the h(u_t) term appear. Can you elaborate a bit on that?
Also, I think there are some small typos in the last two $$ equations; in both cases what should be P_iX is changed into something a little different.
Hi Aram,
. Thanks!
You are right that there is no additional h(.) term. I even removed it before your comment after I wrote the formal proof in the preceding comment using Girsanov. I will fix the typos involving