Entropy optimality: A potential function

Our goal now is to prove Lemma 3 from the last post, thereby completing the entropy-maximization proof of Chang’s Lemma. We will prove it now in greater generality because it shows how relative entropy provides a powerful potential function for measuring progress.

Let {X} be a finite set equipped with a measure {\mu} . We use {L^2(X,\mu)} to denote the space of real-valued functions {f : X \rightarrow \mathbb R} equipped with the inner product {\langle f,g\rangle = \sum_{x \in X} \mu(x) f(x) g(x)} . Let {Q_{\mu} \subseteq L^2(X,\mu)} denote the convex polytope of probability densities with respect to {\mu} :

\displaystyle  Q_{\mu} = \left\{ g \in L^2(X,\mu) : g \geq 0, \mathop{\mathbb E}_{\mu}[g]=1 \right\}\,.

1.1. A ground truth and a family of tests

Fix now a density {f \in Q_{\mu}} and a family {\mathcal T \subseteq L^2(X,\mu)} that one can think of as “tests” (or properties of a function we might care about). Given an error parameter {\varepsilon > 0} , we define a convex polytope {P_{\varepsilon}(f, \mathcal T) \subseteq L^2(X,\mu)} as follows:

\displaystyle  P_{\varepsilon}(f,\mathcal T) = \left\{ g : \langle \varphi, g \rangle \geq \langle \varphi,f\rangle - \varepsilon \quad \forall \varphi \in \mathcal T\right\}\,.

This is the set of functions that have “performance” similar to that of {f} on all the tests in {\mathcal T} . Note that the tests are one-sided; if we wanted two-sided bounds for some {\varphi} , we could just add {-\varphi} to {\mathcal T} .

1.2. (Projected) coordinate descent in the dual

We now describe a simple algorithm to find a point {g \in P_{\varepsilon}(f,\mathcal T) \cap Q_{\mu}} . It will be precisely analogous to the algorithm we saw in the previous post for the natural polytope coming from Chang’s Lemma. We define a family of functions {\{g_t : t \geq 0\} \subseteq Q_{\mu}} indexed by a continuous time parameter:

\displaystyle  g_t = \frac{\exp\left(\int_0^t \varphi_s\,ds\right)}{\mathop{\mathbb E}_{\mu}\left[\exp\left(\int_0^t \varphi_s\,ds\right)\right]}\,.

Here {\varphi_t \in \mathcal T} is some test we have yet to specify. Intuitively, it will be a test that {g_t} is not performing well on. The idea is that at time {t} , we exponentially average some of {\varphi_t} into {g_t} . The exponential ensures that {g_t} is non-negative, and our normalization ensures that {g_t \in Q_{\mu}} . Observe that {g_0 = \mathbf{1}} is the constant {1} function.

1.3. The potential function

For two densities {h,g \in Q_{\mu}} , we define the relative entropy

\displaystyle  D_{\mu}(h\,\|\,g) = \sum_{x \in X} \mu(x) h(x) \log \frac{h(x)}{g(x)}\,.

Our goal is to analyze the potential function {D_{\mu}(f \,\|\,g_t)} . This functional measures how far the “ground truth” {f} is from our hypothesis {g_t} .

Note that {D_{\mu}(f \,\|\,g_0) = D_{\mu}(f \,\|\, \mathbf{1}) = \mathrm{Ent}_{\mu}(f)} , using the notation from the previous post. Furthermore, since the relative entropy between two probability measures is always non-negative, {D_{\mu}(f\,\|\,g_t) \geq 0} for all {t \geq 0} .

Now a simple calculation gives

\displaystyle  \frac{d}{dt} D_{\mu}(f\,\|\,g_t) = \langle \varphi_t, g_t-f\rangle\,. \ \ \ \ \ (1)

In other words, as long as the constraint {\langle \varphi_t, g_t\rangle \geq \langle \varphi_t, f\rangle - \varepsilon} is violated by {g_t} , the potential is decreasing by at least {\varepsilon} !

Since the potential starts at {\mathrm{Ent}_{\mu}(f)} and must always be non-negative, if we choose at every time {t} a violated test {\varphi_t} , then after time {T \leq \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon}} , it must be that {g_t \in P_{\varepsilon}(f,\mathcal T)} for some {t \in [0,T]} .

1.4. A sparse solution

But our goal in Chang’s Lemma was to find a “sparse” {g \in P_{\varepsilon}(f,\mathcal T)} . In other words, we want {g} to be built out of only a few constraints. To accomplish this, we should have {\varphi_t} switch between different tests as little as possible.

So when we find a violated test {\varphi \in \mathcal T} , let’s keep {\varphi_t = \varphi} until {\langle \varphi, g_t\rangle = \langle \varphi,f\rangle} . How fast can the quantity {\langle \varphi, g_t-f\rangle} drop from larger than {\varepsilon} to zero?

Another simple calculation (using {\varphi_t = \varphi} ) yields

\displaystyle  \frac{d}{dt} \langle \varphi, g_t\rangle = - \left\langle \vphantom{\bigoplus} \varphi, g_t (\varphi - \langle \varphi,g_t\rangle)\right\rangle = - \langle \varphi^2, g_t \rangle + \langle \varphi, g_t\rangle^2\,.

This quantity is at least {- \|g_t\|_1 \cdot \|\varphi\|_{\infty}^2 = -\|\varphi\|_{\infty}^2} .

In order to calculate how much the potential drops while focused on a single constraint, we can make the pessimistic assumption that {\frac{d}{dt} \langle \varphi, g_t\rangle = - \|\varphi\|_{\infty}^2} . (This is formally justified by Grönwall’s inequality.) In this case (recalling (1)), the potential drop is at least

\displaystyle  \frac{1}{\|\varphi\|_{\infty}^2} \int_0^{\varepsilon} x \,dx = \frac12 \frac{\varepsilon^2}{\|\varphi\|_{\infty}^2}\,.

Since the potential can drop by at most {\mathrm{Ent}_{\mu}(f)} overall, this bounds the number of steps of our algorithm, yielding the following result.

Theorem 1 For every {\varepsilon > 0} , there exists a function {g \in P_{\varepsilon}(f,\mathcal T) \cap Q_{\mu}} such that for some {\{\lambda_i \geq 0\}} and {\{\varphi_i\} \subseteq \mathcal T}

\displaystyle  g = \frac{\exp\left(\sum_{i=1}^k \lambda_i \varphi_i\right)}{\mathop{\mathbb E}_{\mu}\left[\exp\left(\sum_{i=1}^k \lambda_i \varphi_i\right)\right]}\,,

for some value

\displaystyle  k \leq 2 \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon^2} \sup_{\varphi \in \mathcal T} \|\varphi\|_{\infty}^2\,.

In order to prove Lemma 3 from the preceding post, simply use the fact that our tests were characters (and their negations), and these satisfy {\|\chi_{\alpha}\|_{\infty} \leq 1} .

1.5. Bibliographic notes

Mohit Singh has pointed out to me the similarity of this approach to the Frank-Wolfe algorithm. The use of the potential {D(f \,\|\,g_t)} is a common feature of algorithms that go by the names “boosting” or “multiplicative weights update” (see, e.g., this survey).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s