Our goal now is to prove Lemma 3 from the last post, thereby completing the entropy-maximization proof of Chang’s Lemma. We will prove it now in greater generality because it shows how relative entropy provides a powerful potential function for measuring progress.
Let be a finite set equipped with a measure
. We use
to denote the space of real-valued functions
equipped with the inner product
. Let
denote the convex polytope of probability densities with respect to
:
1.1. A ground truth and a family of tests
Fix now a density and a family
that one can think of as “tests” (or properties of a function we might care about). Given an error parameter
, we define a convex polytope
as follows:
This is the set of functions that have “performance” similar to that of on all the tests in
. Note that the tests are one-sided; if we wanted two-sided bounds for some
, we could just add
to
.
1.2. (Projected) coordinate descent in the dual
We now describe a simple algorithm to find a point . It will be precisely analogous to the algorithm we saw in the previous post for the natural polytope coming from Chang’s Lemma. We define a family of functions
indexed by a continuous time parameter:
Here is some test we have yet to specify. Intuitively, it will be a test that
is not performing well on. The idea is that at time
, we exponentially average some of
into
. The exponential ensures that
is non-negative, and our normalization ensures that
. Observe that
is the constant
function.
For two densities , we define the relative entropy
Our goal is to analyze the potential function . This functional measures how far the “ground truth”
is from our hypothesis
.
Note that , using the notation from the previous post. Furthermore, since the relative entropy between two probability measures is always non-negative,
for all
.
Now a simple calculation gives
In other words, as long as the constraint is violated by
, the potential is decreasing by at least
!
Since the potential starts at and must always be non-negative, if we choose at every time
a violated test
, then after time
, it must be that
for some
.
1.4. A sparse solution
But our goal in Chang’s Lemma was to find a “sparse” . In other words, we want
to be built out of only a few constraints. To accomplish this, we should have
switch between different tests as little as possible.
So when we find a violated test , let’s keep
until
. How fast can the quantity
drop from larger than
to zero?
Another simple calculation (using ) yields
This quantity is at least .
In order to calculate how much the potential drops while focused on a single constraint, we can make the pessimistic assumption that . (This is formally justified by Grönwall’s inequality.) In this case (recalling (1)), the potential drop is at least
Since the potential can drop by at most overall, this bounds the number of steps of our algorithm, yielding the following result.
Theorem 1 For every
, there exists a function
such that for some
and
![]()
for some value
In order to prove Lemma 3 from the preceding post, simply use the fact that our tests were characters (and their negations), and these satisfy .
1.5. Bibliographic notes
Mohit Singh has pointed out to me the similarity of this approach to the Frank-Wolfe algorithm. The use of the potential is a common feature of algorithms that go by the names “boosting” or “multiplicative weights update” (see, e.g., this survey).