Entropy optimality: Bloom’s Chang’s Lemma

[Added in post: One might consult this post for a simpler/more general (and slightly more correct) version of the results presented here.]

In our last post regarding Chang’s Lemma, let us visit a version due to Thomas Bloom. We will offer a new proof using entropy maximization. In particular, we will again use only boundedness of the Fourier characters.

There are two new (and elementary) techniques here: (1) using a trade-off between entropy maximization and accuracy and (2) truncating the Taylor expansion of {e^x} .

We use the notation from our previous post: {G=\mathbb F_p^n} for some prime {p} and {\mu} is the uniform measure on {G} . For {\eta > 0}  and {f \in L^2(G)} , we define {\Delta_{\eta}(f) = \{ \alpha \in G : |\hat f(\alpha)| \geq \eta \|f\|_1 \}} . We also use {Q_{\mu} \subseteq L^2(G)} to denote the set of all densities with respect to {\mu} .

Theorem 1 (Bloom) There is a constant {c > 0} such that for every {\eta > 0} and every density {f \in Q_{\mu}} , there is a subset {\Delta \subseteq \Delta_{\eta}(f) \subseteq G} such that {|\Delta| \geq c \eta |\Delta_{\eta}(f)|} and {\Delta} is contained in some {\mathbb F_p} subspace of dimension at most

\displaystyle  \frac{1+\mathrm{Ent}_{\mu}(f)}{c\eta}\,.

Note that we only bound the dimension of a subset of the large spectrum, but the bound on the dimension improves by a factor of {1/\eta} . Bloom uses this as the key step in his proof of what (at the time of writing) constitutes the best asymptotic bounds in Roth’s theorem on three-term arithmetic progressions:

Theorem 2 If a subset {A \subseteq \{1,2,\ldots,N\}} contains no non-trivial three-term arithmetic progressions, then

\displaystyle  |A| \leq O(1) \frac{\left(\log \log N\right)^4}{\log N} N\,.

This represents a modest improvement over the breakthrough of Sanders achieving {\frac{(\log \log N)^{O(1)}}{\log N} N} , but the proof is somewhat different.

1.1. A stronger version

In fact, we will prove a stronger theorem.

Theorem 3 For every {\eta > 0} and every density {f \in Q_{\mu}} , there is a random subset {\Delta \subseteq G} such that almost surely

\displaystyle  \dim_{\mathbb F_p}(\mathrm{span}\, \Delta) \leq 12 \frac{\mathrm{Ent}_{\mu}(f)}{\eta} + O(\log (1/\eta))\,,

and for every {\alpha \in \Delta_{\eta}(f)} , it holds that

\displaystyle  \mathbb P[\alpha \in \Delta] \geq \frac{\eta}{4}\,.

This clearly yields Theorem 1 by averaging.

1.2. The same polytope

To prove Theorem 3, we use the same polytope we saw before. Recall the class of test functionals {\mathcal F = \{ \pm \mathrm{Re}\,\chi_\alpha,  \pm \mathrm{Im}\,\chi_\alpha : \alpha \in G\} \subseteq L^2(G) }

We defined {P(f,\eta) \subseteq L^2(G)} by

\displaystyle  P(f,\eta) = \left\{ g \in L^2(G) : \langle g, \varphi \rangle \geq \langle f,\varphi\rangle - \eta\quad\forall \varphi \in \mathcal F\right\}\,.

Let us consider a slightly different convex optimization:

\displaystyle  \textrm{minimize } \mathrm{Ent}_{\mu}(g)+K \mathbf{\varepsilon} \qquad \textrm{ subject to } g \in P(f,\mathbf{\varepsilon}) \cap Q_{\mu}\,. \ \ \ \ \ (1)

Here, {K} is a constant that we will set soon. On the other hand, {\varepsilon \geq 0} is now intended as an additional variable over which to optimize. We allow the optimization to trade off the entropy term and the accuracy {\varepsilon} . The constant {K > 0} represents how much we value one vs. the other.

Notice that, since {f \in P(f,0) \cap Q_{\mu}} , this convex program satisfies Slater’s condition (there is a feasible point in the relative interior), meaning that strong duality holds (see Section 5.2.3).

1.3. The optimal solution

As in our first post on this topic, we can set the gradient of the Lagrangian equal to zero to obtain the form of the optimal solution: For some dual variables {\{\lambda^*_{\varphi} \geq 0: \varphi \in \mathcal F\} \subseteq \mathbb R} ,

\displaystyle  g^*(x) = \frac{\exp\left(\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} \varphi(x)\right)}{\mathop{\mathbb E}_{\mu} \left[\exp\left(\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} \varphi\right)\right]} \ \ \ \ \ (2)

Furthermore, corresponding to our new variable {\varepsilon} , there is a new constraint on the dual variables:

\displaystyle  \sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} \leq K\,.

Observe now that if we put {K = 2 \frac{\mathrm{Ent}_{\mu}(f)}{\eta}} then we can bound {\varepsilon^*} (the error in the optimal solution): Since {f} is a feasible solution with {\varepsilon=0} , we have

\displaystyle  \mathrm{Ent}_{\mu}(g^*) + K \varepsilon^* \leq \mathrm{Ent}_{\mu}(f)\,,

which implies that {\varepsilon^* \leq \frac{\mathrm{Ent}_{\mu}(f)}{K} = \frac{\eta}{2}} since {\mathrm{Ent}_{\mu}(g^*) \geq 0} .

To summarize: By setting {K} appropriately, we obtain {g^* \in P(f,\eta/2) \cap Q_{\mu}} of the form (2) and such that

\displaystyle  \sum_{\varphi \in \mathcal F} \lambda_{\varphi}^* \leq 2\frac{\mathrm{Ent}_{\mu}(f)}{\eta}\,. \ \ \ \ \ (3)

Note that one can arrive at the same conclusion using the algorithm from our previous post: The version unconcerned with sparsity finds a feasible point after time {T \leq \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon}} . Setting {\varepsilon = \eta/2} yields the same result without using duality.

1.4. A Taylor expansion

Let us slightly rewrite {g^*} by multiplying the numerator and denominator by {\exp\left({\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi}}\right)} . This yields:

\displaystyle  g^* = \frac{\exp\left(\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} (1+\varphi)\right)}{\mathop{\mathbb E}_{\mu} \exp\left(\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} (1+\varphi)\right)}

The point of this transformation is that now the exponent is a sum of positive terms (using {\|\varphi\|_{\infty} \leq 1} ), and furthermore by (3), the exponent is always bounded by

\displaystyle  B \stackrel{\mathrm{def}}{=} 4 \frac{\mathrm{Ent}_{\mu}(f)}{\eta}\,. \ \ \ \ \ (4)

Let us now Taylor expand {e^x = \sum_{j \geq 0} \frac{x^j}{j!}} . Applying this to the numerator, we arrive at an expression

\displaystyle  g^* = \sum_{\vec \alpha} y_{\vec \alpha} T_{\vec \alpha}

where {y_{\vec \alpha} \geq 0} , {\sum_{\vec \alpha} y_{\vec \alpha} = 1} , and each {T_{\vec \alpha} \in Q_{\mu}} is a density. Here, {\vec \alpha} ranges over all finite sequences of elements from {\mathcal F} and

\displaystyle  T_{\vec \alpha} = \frac{\prod_{i=1}^{|\vec \alpha|} (1+\varphi_{\vec \alpha_i})}{\mathop{\mathbb E}_{\mu} \prod_{i=1}^{|\vec \alpha|} (1+\varphi_{\vec \alpha_i})}=\prod_{i=1}^{|\vec \alpha|} (1+\varphi_{\vec \alpha_i})\,,

where we use {|\vec \alpha|} to denote the length of the sequence {\vec \alpha} .

1.5. The random subset

We now define a random function {\mathbf{T} \in L^2(G)} by taking {\mathbf{T}=T_{\vec \alpha}} with probability {y_{\vec \alpha}} .

Consider some {\gamma \in \Delta_{\eta}(f)} . Since {g^* \in P(f,\eta/2)} , we know that {\gamma \in \Delta_{\eta/2}(g^*)} . Thus

\displaystyle  \frac{\eta}{2} < |\langle g^*, \chi_{\gamma}\rangle| \leq \sum_{\vec \alpha} y_{\vec \alpha} |\langle T_{\vec \alpha}, \chi_{\gamma}\rangle| = \mathop{\mathbb E}\,|\langle \mathbf{T},\chi_{\gamma}\rangle|\,.

But we also have {|\langle T_{\vec \alpha}, \chi_{\gamma}\rangle| \leq \|T_{\vec \alpha}\|_1 \cdot \|\chi_{\gamma}\|_{\infty} \leq 1} . This implies that {\mathbb P[|\langle \mathbf{T}, \chi_{\gamma}\rangle| > 0] > \eta/2} .

Equivalently, for any {\gamma \in \Delta_{\eta}(f)} , it holds that {\mathbb P[\gamma \in \Delta_0(\mathbf{T})] > \eta/2} . We would be done with the proof of Theorem 3 if we also knew that {\mathbf{T}} were supported on functions {T_{\vec \alpha}} for which {|\vec \alpha| \leq O(B)} because {\dim_{\mathbb F_p}(\mathrm{span}(\Delta_0(T_{\vec \alpha}))) \leq |\vec \alpha|} . This is not necessarily true, but we can simply truncate the Taylor expansion to ensure it.

1.6. Truncation

Let {p_k(x) = \sum_{j \leq k} \frac{x^j}{j!}} denote the Taylor expansion of {e^x} to degree {k} . Since the exponent in {g^*} is always bounded by {B} (recall (4)), we have

\displaystyle  \sum_{|\vec \alpha| > k} y_{\vec \alpha} \leq \sup_{x \in [0,B]} \frac{|e^x - p_k(x)|}{e^x} \leq \frac{B^{k+1}}{(k+1)!}\,.

By standard estimates, we can choose {k \leq 3 B + O(\log(1/\eta))} to make the latter quantity at most {\eta/4} .

Since {\sum_{|\vec \alpha| > k} y_{\vec \alpha} \leq \eta/4} , a union bound combined with our previous argument immediately implies that for {\gamma \in \Delta_{\eta}(f)} , we have

\displaystyle  \mathbb P\left[|\langle \mathbf{T}, \chi_{\gamma}\rangle| > 0 \textrm{ and } \dim_{\mathbb F_p}(\mathrm{span}(\Delta_0(\mathbf{T}))) \leq k \right] \geq \frac{\eta}{4}\,.

This completes the proof of Theorem 3.

1.7. Prologue: A structure theorem

Generalizing the preceding argument a bit, one can prove the following.

Let {G} be a finite abelian group and use {\hat G} to denote the dual group. Let {\mu} denote the uniform measure on {G} . For every {\gamma \in \hat G} , let {\chi_{\gamma}} denote the corresponding character. Let us define a degree-{k} Reisz product to be a function of the form

\displaystyle  R(x) = \prod_{i=1}^k (1+\varepsilon_{i} \Lambda_i \chi_{\gamma_i}(x))

for some {\gamma_1, \ldots, \gamma_k \in \hat G} and {\varepsilon_1, \ldots, \varepsilon_k \in \{-1,1\}} and {\Lambda_1, \ldots, \Lambda_k \in \{\mathrm{Re},\mathrm{Im}\}} .

Theorem 4 For every {\eta > 0} , the following holds. For every {f : G \rightarrow \mathbb R_+} with {\mathbb E_{\mu} [f]=1} , there exists a {g : G \rightarrow \mathbb R_+} with {\mathbb E_{\mu} [g] = 1} such that {\|\hat f - \hat g\|_{\infty} \leq \eta} and {g} is a convex combination of degree-{k} Reisz products where

\displaystyle  k \leq O(1) \frac{\mathrm{Ent}_{\mu}(f)}{\eta} + O(\log (1/\eta))\,.

1.8. A prologue’s prologue

To indicate the lack of algebraic structure required for the preceding statement, we can set things up in somewhat greater generality.

For simplicity, let {X} be a finite set equipped with a probability measure {\mu} . Recall that {L^2(X)} is the Hilbert space of real-valued functions on {X} equipped with the inner product {\langle f,g\rangle = \mathop{\mathbb E}_{\mu}[fg]} . Let {\mathcal F \subseteq L^2(X)} be a set of functionals with the property that {\|\varphi\|_{\infty} \leq 1} for {\varphi \in \mathcal F} .

Define a degree-{k} {\mathcal F} -Riesz product as a function of the form

\displaystyle  R(x) = \prod_{i=1}^k (1+\varphi_i(x))

for some functions {\varphi_1, \ldots, \varphi_k \in \mathcal F} . Define also the (semi-) norm {\|f\|_{\mathcal F} = \sup_{\varphi \in \mathcal F} |\langle f,\varphi\rangle_{L^2(X)}|} .

Theorem 5 For every {\eta > 0} , the following holds. For every {f : X \rightarrow \mathbb R_+} with {\mathbb E_{\mu} [f]=1} , there exists a {g : X \rightarrow \mathbb R_+} with {\mathbb E_{\mu} [g] = 1} such that {\|f - g\|_{\mathcal F} \leq \eta} and {g} is a convex combination of degree-{k} {\mathcal F} -Riesz products where

\displaystyle  k \leq O(1) \frac{\mathrm{Ent}_{\mu}(f)}{\eta} + O(\log (1/\eta))\,.

Entropy optimality: Chang’s Lemma

In the last post, we saw how minimizing the relative entropy allowed us to obtain a “simple” feasible point of a particular polytope. We continue exploring this theme by giving a related proof of Chang’s Lemma, a Fourier-analytic result that arose in additive combinatorics. Our proof will reveal another aspect of entropy optimality: “Projected sub-gradient descent” and a corresponding convergence analysis.

1.1. Discrete Fourier analysis and Chang’s Lemma

Fix {n \geq 1} and a prime {p} . Let {\mathbb F_p} denote the finite field with {p} elements. We will work over {\mathbb F_p^n} . For convenience, denote {G=\mathbb F_p^n} . Let {\mu} denote the uniform measure on {G} , and for a non-negative function {f : G \rightarrow [0,\infty)} , define the relative entropy

\displaystyle  \mathrm{Ent}_{\mu}(f) = \mathbb E_{\mu} \left[f \log \frac{f}{\mathbb E_{\mu} f}\right]\,,

where we use the shorthand {\mathop{\mathbb E}_{\mu} [f] = \sum_{x \in G} \mu(x) f(x)} . If {\mathop{\mathbb E}_{\mu} [f] = 1} , we say that {f} is a density (with respect to {\mu} ). In that case, {\mathrm{Ent}_{\mu}(f) = D(f \mu \,\|\,\mu)} where {D} denotes the relative entropy (introduced in the last post).

We briefly review some notions from discrete Fourier analysis. One could skip this without much loss; our proof will only use boundedness of the Fourier characters and no other features. For functions {f,g : G \rightarrow \mathbb C} , we will use the inner product {\langle f,g\rangle = \mathop{\mathbb E}_{\mu}[f\bar g]} and the norms {\|f\|_p = (\mathop{\mathbb E}_{\mu}[|f|^p])^{1/p}} for {p \geq 1} .

For an element {\alpha \in G} , we use

\displaystyle  \chi_{\alpha}(x) = \exp\left(\frac{-2\pi}{p} \sum_{i=1}^n \alpha_i x_i\right)

to denote the corresponding Fourier character. Now every {f : G \rightarrow \mathbb C} can be written in the Fourier basis as

\displaystyle  f = \sum_{\alpha \in G} \hat f(\alpha) \chi_{\alpha}\,,

where {\hat f(\alpha) = \langle f,\chi_{\alpha}\rangle} . (For simplicity of notation, we have implicitly identified {G} and {\hat G} .)

We will be interested in the large Fourier coefficients of a function. To that end, for {\eta > 0} , we define

\displaystyle  \Delta_{\eta}(f) = \left\{ \alpha \in G : |\hat f(\alpha)| > \eta \|f\|_1 \right\}\,,

Now we are ready to state Chang’s Lemma; it asserts that if {\mathrm{Ent}_{\mu}(f)} is small, then all of {f} ‘s large Fourier coefficients are contained in a low-dimensional subspace.

Lemma 1 If {f : G \rightarrow [0,\infty)} is a density, then for every {\eta > 0} the set {\Delta_{\eta}(f)} is contained in an {\mathbb F_p} -subspace of dimension at most {d} , where

\displaystyle  d \leq 2 \sqrt{2} \frac{\mathrm{Ent}_{\mu}(f)}{\eta^2}\,.

In the special case {p=2} , the proof will actually give constant {2} instead of {2 \sqrt{2}} .

1.2. A slightly stronger statement

In fact, we will prove the following slightly stronger statement.

Theorem 2 If {f : G \rightarrow [0,\infty)} is a density, then for every {\eta > 0} there exists a density {g : G \rightarrow [0,\infty)} such that {\Delta_{\eta}(f) \subseteq \Delta_0(g)} and moreover {g(x) = \Psi(\chi_{\alpha_1}(x), \ldots, \chi_{\alpha_d}(x))} for some {d \leq 1+2\sqrt{2} \frac{\mathrm{Ent}_{\mu}(f)}{\eta^2}} and {\{\alpha_i\} \subseteq G} and some function \Psi : \mathbb C^d \to \mathbb R . In other words, {g} is a function of at most {d} characters.

It is an exercise in linear algebra to see that for such a density {g} , the set of non-zero Fourier coefficients {\Delta_0(g)} is contained in the span of {\alpha_1, \ldots, \alpha_d} . Since {\Delta_{\eta}(f) \subseteq \Delta_0(g)} , this yields Lemma 1.

1.3. Entropy optimality

Let {L^2(G)} denote the Hilbert space of real-valued functions on {G} equipped with the inner product {\langle \cdot, \cdot\rangle} . Assume now that {f : G \rightarrow [0,\infty)} is a density and fix {\eta > 0} .

Dtnote by {\mathcal F} the family of functions of the form {\pm \mathrm{Re}\,\chi_\alpha} or {\pm \mathrm{Im}\, \chi_\alpha} for some {\alpha \in G} .

Let us associate to {f} the convex polytope {P(f, \eta) \subseteq L^2(G)} of functions {g : G \rightarrow [0,\infty)} satisfying

\displaystyle  \langle g,\varphi\rangle \geq \langle f,\varphi\rangle - \eta

for all {\varphi \in \mathcal F} .

Just as in the matrix scaling setting, {P(f,\eta)} encapsulates the family of “tests” that we care about (which Fourier coefficient magnitudes are large). Note that if {g \in P(f,\eta/\sqrt{2})} , then {\Delta_{\eta}(f) \subseteq \Delta_0(g)} . We also let {Q} denote the probability (density) simplex: The set of all {g : G \rightarrow [0,\infty)} such that {\mathop{\mathbb E}_{\mu} g = 1} .

Of course, {P(f,\eta) \cap Q} is non-empty because {f \in P(f,\eta) \cap Q} . As before, we will try to find a “simple” representative from {P(f,\eta) \cap Q} by doing entropy maximization (which is the same thing as relative entropy minimization):

\displaystyle  \textrm{minimize } \mathrm{Ent}_{\mu}(g) = \sum_{x \in G} \mu(x) g(x) \log g(x) \qquad \textrm{ subject to } g \in P(f,\eta) \cap Q\,.

1.4. Sub-gradient descent

Soon we will explore this optimization using convex duality. But in this post, let us give an algorithm that tries to locate a feasible point of {P(f,\eta) \cap Q} by a rather naive form of (sub-)gradient descent.

We start with the initial point {g_0 = \mathbf{1}} which is the function that is identically {1} on all of {G} . Note that by strict convexity, this is the unique maximizer of {\mathrm{Ent}_{\mu}(g)} among all {g \in Q} . But clearly it may be that {g_0 \notin P(f,\eta)} .

So suppose we have a function {g_t \in Q} for some {t \geq 0} . If {g_t \in P(f,\eta)} , we are done. Otherwise, there exists a violated constraint: Some {\varphi_t \in \mathcal F} such that {\langle g_t, \varphi_t\rangle < \langle f,\varphi_t\rangle - \eta} . Our goal is now to update {g_t} to {g_{t+1} \in Q} that doesn't violate this constraint quite as much.

Notice the dichotomy being drawn here between the “soft'' constraints of {P(f,\eta)} (which we try to satisfy slowly) and the “hard'' constraints of {Q} that we are enforcing at every step. (This procedure and the analysis that follows are a special case of the mirror descent framework; see, for example, Section 4 of Bubeck’s monograph. In particular, there is a more principled way to design the algorithm that follows.)

We might try to write {g_{t+1} = g_t + \varepsilon \varphi_t} for some small {\varepsilon > 0} , but then {g_{t+1}} could take negative values! Instead, we will follow the approximation {e^x \approx 1+x} (for {x} small) and make the update

\displaystyle  g_{t+1} = \frac{g_t \exp\left(\varepsilon \varphi_t\right)}{\mathbb E_{\mu}[g_t \exp\left(\varepsilon \varphi_t\right)]}\,.

Observe that the exponential weight update keeps {g_{t+1}} positive, and our explicit renormalization ensures that {\mathop{\mathbb E}_{\mu} [g_{t+1}]=1} . Thus {g_{t+1} \in Q} .

Of course, one might feel fishy about this whole procedure. Sure, we have improved the constraint corresponding to {\varphi_t} , but maybe we messed up some earlier constraints that we had satisfied. While this may happen, it cannot go on for too long. In the next post, we will prove the following.

Lemma 3 For an appropriate choice of step size {\varepsilon > 0} , it holds that {g_T \in P(f,\eta)} for some {T \leq 2 \frac{\mathrm{Ent}_{\mu}(f)}{\eta^2}} .

Notice that this lemma proves Theorem 2 and thus finishes our proof of Chang’s Lemma. That’s because by virtue of {g_T \in P(f,\eta/\sqrt{2})} , we know that {\Delta_{\eta}(f) \subseteq \Delta_0(g_T)} . Moreover, {g_T} is built out of at most {T+1} characters {\alpha_0, \alpha_1, \ldots, \alpha_{T}} , and thus it satisfies the conclusion of Theorem 2.

For the case {p=2} , we do not have to approximate the real and complex parts separately, so one saves a factor of {\sqrt{2}} .

1.5. Bibliographic notes

This argument bears some similarity to the proof of Impagliazzo, Moore, and Russell. It was discovered in a joint work with Chan, Raghavendra, and Steurer on extended formulations. In both those arguments, the “descent” is implicit and is done by hand—the characters are carefully chosen not to interact. It turns out that this level of care is unnecessary. The analysis in the next post will allow the descent to choose any violated constraint at any time, and moreover will only use the fact that the characters are bounded.

Entropy optimality: Matrix scaling

This is the first in a series of posts on the surprising power of “entropy optimality.” I have recently encountered many incarnations of this principle in a number of d\ifferent settings (functional analysis, stochastic processes, additive number theory, communication complexity, etc.). It is also connected to many well-studied phenomena in machine learning, optimization, and statistical physics. In the spirit of blogs, my goal is for each post to highlight a single interesting aspect of the theory.

For the first post, we’ll look at a very simple but compelling example: matrix scaling. This is a problem that arises in statistics, numerical analysis, and a number of other areas (see Section 1.2 here for references).

Suppose we are given an {n \times n} target matrix {T=(t_{ij})} with nonnegative entries. We also have an {n \times n} input matrix {X=(x_{ij})} and our goal is to multiply the rows and columns of {X} by positive weights so that the resulting matrix has the same row and column sums as {T} .

Stated differently, we look for diagonal matrices {D_1} and {D_2} such that {D_1 X D_2} has the same row and column sums as {T} . A typical choice of target might be {t_{ij} = 1/n} for all {1 \leq i,j \leq n} , meaning that we want to rescale {X} to be doubly stochastic.

Theorem 1 If {x_{ij} = 0 \iff t_{ij} = 0} for all {1 \leq i,j \leq n} , then such a weighting exists.

To prove the theorem, we will search for a matrix {Y = (y_{ij})} with the same row and column sums as {T} : For all {i,j \in [n]} :

\displaystyle \sum_{i} y_{ij}=\sum_i t_{ij} \quad \textrm{ and } \quad \sum_j y_{ij} = \sum_j t_{ij}\,.      (1)

Clearly such a matrix exists; we could just take {Y=T} ! But the idea is that we want {Y}  to be a “simple perturbation” of {X} . We will follow a philosophy analogous to Jaynes’ principle of maximum entropy. What results is precisely the argument of Franklin and Lorenz.

1.1. The relative entropy

Given two probability distributions {\mu} and {\nu} on a finite set {S} , we recall that the relative entropy between two measures {\mu} and {\nu} on {X} is given by

\displaystyle D(\mu\,\|\,\nu) = \sum_{x \in S} \mu(x) \log \frac{\mu(x)}{\nu(x)}\,.

We take this quantity to be infinite unless {\nu(x) = 0 \implies \mu(x)=0} . A fundamental property of {D} is that it is jointly convex in its arguments. For the following, we will only need that {D} is convex in {\mu} which is a simple consequence of the fact that {y \mapsto y \log y} is a convex function on {\mathbb R_+} .

In proving Theorem 1, we may clearly assume that {\sum_{ij} x_{ij} = \sum_{ij} t_{ij} = 1} . Now let {P(T) \subseteq \mathbb R^{n^2}} denote the affine subspace of all {Y} satisfying the constraints (1). We will consider the optimization problem:

\displaystyle \textrm{minimize } D(Y \,\|\, X) = \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} \quad \textrm{ subject to } Y \in P(T)\quad (2)

The idea is that the objective function prefers {Y} to be as close to {X} as possible in terms of relative entropy while still satisfying the constraints. Philosophically, this is supposed to encourage {Y} to be “simple” with respect to {X} .

Observe that, by our assumption on {X} , it follows that {T \in P(T)} is a point satisfying {D(T \,\|\,X) < \infty} . Note that since {P(T)} is a convex, compact set, we know that there is an optimizer, and since {Y \mapsto D(Y \,\|\,X)} is strictly convex, the optimal solution {Y^*} is unique. It will turn out that {Y^*} is of the form {D_1 X D_2} , and this fact will prove our theorem. The last step is to understand the structure of {Y^*} in terms of the Lagrangian.

1.2. The Lagrangian

Let us introduce {2n} dual variables {{ \alpha_i, \beta_j \in \mathbb R : i,j \in [n] }} and consider the Lagrangian (see the Boyd-Vandenberghe book) corresponding to the optimization (2):

\displaystyle \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} + \sum_i \alpha_i \sum_j (t_{ij}-y_{ij}) + \sum_j \beta_j \sum_i (t_{ij}-y_{ij})\,.

At optimality, the gradient should be equal to zero. If we differentiate with respect to some {y_{ij}} , we see that

\displaystyle 1+\log y_{ij} + \alpha_i + \beta_j - \log x_{ij} = 0\,,

which implies that, at optimality,

\displaystyle y^*_{ij} = x_{ij}, e^{-\alpha^*_i-\beta^*_j-1}\,.

In particular, we see that {Y^*} is obtained from {X} by multiplying the rows by {{e^{-\alpha_i^*-1}}} and multiplying the columns by {{e^{-\beta_j^*}}} , completing the proof of Theorem 1.

1.3. Being careful about optimality

An astute reader (or, in this case, commenter) might point out that we did not use the condition {t_{ij} = 0 \implies x_{ij}=0} . But clearly it is necessary, as one can see from examining the pair

\displaystyle  T = \left(\begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array}\right) \quad X = \left(\begin{array}{cc} 1 & 1 \\ 1 & 0 \end{array}\right)\,.

See these lecture notes for a detailed account.