Our goal now is to prove Lemma 3 from the last post, thereby completing the entropy-maximization proof of Chang’s Lemma. We will prove it now in greater generality because it shows how relative entropy provides a powerful potential function for measuring progress.
Let be a finite set equipped with a measure . We use to denote the space of real-valued functions equipped with the inner product . Let denote the convex polytope of probability densities with respect to :
1.1. A ground truth and a family of tests
Fix now a density and a family that one can think of as “tests” (or properties of a function we might care about). Given an error parameter , we define a convex polytope as follows:
This is the set of functions that have “performance” similar to that of on all the tests in . Note that the tests are one-sided; if we wanted two-sided bounds for some , we could just add to .
1.2. (Projected) coordinate descent in the dual
We now describe a simple algorithm to find a point . It will be precisely analogous to the algorithm we saw in the previous post for the natural polytope coming from Chang’s Lemma. We define a family of functions indexed by a continuous time parameter:
Here is some test we have yet to specify. Intuitively, it will be a test that is not performing well on. The idea is that at time , we exponentially average some of into . The exponential ensures that is non-negative, and our normalization ensures that . Observe that is the constant function.
For two densities , we define the relative entropy
Our goal is to analyze the potential function . This functional measures how far the “ground truth” is from our hypothesis .
Note that , using the notation from the previous post. Furthermore, since the relative entropy between two probability measures is always non-negative, for all .
Since the potential starts at and must always be non-negative, if we choose at every time a violated test , then after time , it must be that for some .
1.4. A sparse solution
But our goal in Chang’s Lemma was to find a “sparse” . In other words, we want to be built out of only a few constraints. To accomplish this, we should have switch between different tests as little as possible.
So when we find a violated test , let’s keep until . How fast can the quantity drop from larger than to zero?
Another simple calculation (using ) yields
This quantity is at least .
In order to calculate how much the potential drops while focused on a single constraint, we can make the pessimistic assumption that . (This is formally justified by Grönwall’s inequality.) In this case (recalling (1)), the potential drop is at least
Since the potential can drop by at most overall, this bounds the number of steps of our algorithm, yielding the following result.
Theorem 1 For every , there exists a function such that for some and
for some value
In order to prove Lemma 3 from the preceding post, simply use the fact that our tests were characters (and their negations), and these satisfy .
1.5. Bibliographic notes
Mohit Singh has pointed out to me the similarity of this approach to the Frank-Wolfe algorithm. The use of the potential is a common feature of algorithms that go by the names “boosting” or “multiplicative weights update” (see, e.g., this survey).