My student Punya Biswal just completed this great survey on hypercontractivity and its application in computer science. There is a PDF version from his home page, and accompanying slides.
Hypercontractive inequalities are a useful tool in dealing with extremal questions in the geometry of high-dimensional discrete and continuous spaces. In this survey we trace a few connections between different manifestations of hypercontractivity, and also present some relatively recent applications of these techniques in computer science.
1. Preliminaries and notation
Fourier analysis on the hypercube. We define the inner product on functions
, where the expectation is taken over the uniform (counting) measure on
. The multilinear polynomials
(where
ranges over subsets of
) form an orthogonal basis under this inner product; they are called the Fourier basis. Thus, for any function
, we have
, where the Fourier coefficients
obey Plancherel’s relation
. It is easy to verify that
and
.
Norms. For , define the
norm
. These norms are monotone in
: for every function
,
implies
. For a linear operator
carrying functions
to functions
, we define the
-to-
operator norm
.
is said to be a contraction from
to
when
. Because of the monotonicity of norms, a contraction from
to
is automatically a contraction from
to
for any
. When
and
, then
is said to be hypercontractive.
Convolution operators. Letting represent the coordinatewise product of
, we define the convolution
of two functions
, and note that it is a linear operator
for every fixed
. Convolution is commutative and associative, and the Fourier coefficients of a convolution satisfy the useful property
. We shall be particularly interested in the convolution properties of the following functions
- The Dirac delta
, given by
and
otherwise. It is the identity for convolution and has
for all
.
- The edge functions
given by
is
or
according as
contains or does not contain
, respectively. For any function
,
, where
is obtained from
by flipping just the
th bit. Convolution with
acts as an orthogonal projection (as we can easily see in the Fourier domain), so for any functions
, we have
- The Bonami-Gross-Beckner noise functions
for
, where
and we define
. These operators form a semigroup, because
and
. Note that
. We define the noise operator
acting on functions on the discrete cube by
. In combinatorial terms,
is the expected value of
, where
is obtained from
by independently flipping each bit of
with probability
.
Lemma 1
![]()
Proof: This is easy in the Fourier basis:
2. The Bonami-Gross-Beckner Inequality
2.1. Poincaré and Log-Sobolev inequalities
The Poincaré and logarithmic Sobolev inequalities both relate a function’s global non-constantness to how fast it changes “locally”. The amount of local change is quantified by the energy , where the Dirichlet form
is defined as
( is the set of pairs
that differ in a single coordinate). In terms of the edge functions
, observe that
.
In the case of the Poincaré inequality, we measure the distance of to a constant by its variance
. Then the Poincaré constant (of the discrete cube) is the supremal
such that the inequality
holds for all . This quantity is also the smallest nonzero eigenvalue of the Laplacian of the discrete cube, viewed as a graph (i.e., its spectral expansion).
Another way of measuring the non-constantness of a function is to consider its entropy (where we assume
and use the convention that
). Note that
for any
, so the entropy is homogenous of degree
in its argument. Because we are comparing the entropy with the energy (which is homogenous of degree
) we use the entropy of the square of the function to define the Log-Sobolev constant: the largest
such that the inequality
holds for all .
For the discrete cube , we have
and
, as we shall see below. It is interesting to ask how these quantities are related when we consider other probability spaces equipped with a suitable Dirichlet form (for example,
-regular graphs with
, where the expectation is taken over all edges). Set
for a sufficiently small
and observe that
and
, whereas
This shows that , which is tight in the case of the cube. However, for constant-degree expander families (in particular, for random
-regular graphs with high probability) we have (from Example 4.2, Diaconis and Saloff-Coste)
but
.
2.2. Hypercontractivity and the log-Sobolev inequality
When , the noise operator
is easily seen to contract
: for any
, we have
. Now consider its behavior from
to
for some
. When
, we have
; in particular, for
,
. On the other hand,
, so
. By the intermediate value theorem, there must be some
such that
. A theorem of Gross connects this critical
with the Log-Sobolev constant
of the underlying space:
Stated differently, when
. Thus to prove hypercontractive inequalities on the discrete cube, it suffices to bound the log-Sobolev constant. We shall prove this claim for
, which turns out to imply the general version.
Proof: (of Theorem 2) We shall prove that for
; the remainder of the theorem can be shown using similar techniques. As we observed before, this inequality is tight when
, so it suffices to show that
for
. For notational convenience, let
. Then
Now we use the fact that to get
Applying Lemma 3 and simplifying, we get
We use Lemma 4 to handle the second term, and plug in to get
whose positivity we are guaranteed by the log-Sobolev inequality applied to .
Proof: Recalling Lemma 1 and the projection property of the s, we have
Proof: It suffices to show that for all
and
. But observe that
and the inequality between the integrals follows from convexity.
2.3. Two-point inequality
We begin by showing that the log-Sobolev inequality holds for the uniform distribution on the two-point space with
. Without loss of generality, consider
. Then
and . We shall show that
is non-negative for
. By symmetry it suffices to consider
. But
and
which is non-negative because and
2.4. Tensoring property
Theorem 5 Let
be the log-Sobolev constant of
. Then the log-Sobolev constant of
is
.
When is a power of
, we can conclude inductively that
; a proof along similar lines works for arbitrary
as well.
Proof: } For any , set
. Then by the conditional entropy formula,
and by convexity,
where the notation ranges over edges of
. Taken together, these give
as claimed.
2.5. Non-product groups
Recall that we defined the Dirichlet form
for functions , but it makes sense for any regular graph if we sample
uniformly from the edges. Thus, given any family of regular graphs, we can ask if they satisfy a log-Sobolev inequality of the form
for all suitable
.
It turns out that the relationship between logarithmic Sobolev inequalities and hypercontractive noise operator subgroups, as stated by Gross, holds for a wide class of spaces, not just the hypercube . Diaconis and Saloff-Coste explored an intermediate between these two extremes of specialization to give improved mixing time results for Markov chains on various graphs.
One of the first discrete applications of hypercontractivity was a celebrated theorem of Kahn, Kalai and Linial relating the maximum influence of a function on the hypercube to its variance. In Theorem 7, we discuss some recent work of O’Donnell and Wimmer generalizing the KKL theorem to apply to the wider class of Schreier graphs associated with group actions (defined below).
An action of a group on a set
is a homomorphism from
to the group of bijections on
, and we write
for the image of
under the bijection for
. If
is a set of generators for
, then the Schreier graph
has vertex set
and edges
for all
and
. It is known that every connected regular graph of even degree can be obtained in this way. The definition of the Dirichlet form
generalizes without change, but to be able to derive a log-Sobolev inequality for this space, we must define the noise operator
in an appropriate fashion to satisfy the claim of Lemma 1:
.
3. Boolean-Valued Functions
3.1. Influences
Write for the collection of random variables
. The influence of the
th coordinate on a function
is given by
When is Boolean-valued, this quantity is just the probability that changing
changes
. Writing
in the Fourier basis, we have
and
, so that
. In addition, we define the total influence
.
3.2. Structural results
Boolean functions are natural combinatorial objects, but they were first studied from an analytical viewpoint in work on voting and social choice. In this setting, a function is viewed as a way to combine the preferences of
voters to yield the result of the election. This explains the notions of dictator or junta functions, which depend on only one or a few of their coordinates, respectively. In this context it is also natural to consider functions where no coordinate (“voter”) has a very large influence. Kahn, Kalai, and Linial first introduced the Fourier analysis of Boolean functions as a technique in computer science. Their theorem establishes that if a function is far from a constant (i.e., has variance at least a constant), then it must have a variable of influence
. We state a strengthening of their original inequality due to Talagrand:
Theorem 6 For any
,
We can compare this to the Poincaré inequality on the cube, which can be stated as
(In particular, there exists a variable of influence .) The KKL theorem is a stronger result of the same form: it is a comparison between a local and a global measure of variation. The proofs of KKL and Talagrand used the hypercontractivity of the cube, but we present here a more recent proof due to Rossignol that uses the log-Sobolev inequality instead. For simplicity we’ll just show the weaker statement that the maximum influence is
.
Proof: Write , where
. For each
, the log-Sobolev inequality states that
. By writing
in terms of the Fourier coefficients
, we can check that
, so that we can sum all these inequalities to obtain
In order to bound , we begin by noting that
where the s are the edge functions we defined earlier. Letting
, we have
where we have used the orthogonality of the s and the fact that
.
To bound , we split it up further:
For , we have that
is a nonpositive decreasing function and therefore,
By comparing Fourier coefficients, it is easy to verify that . Therefore, by convexity,
Until now, the proof has made no use of the fact that takes on only Boolean values. Now we argue that because
, we must have
, so that
. Plugging this into our bound for
yields
For , note that
is increasing, so
Summing all these bounds gives us
By the Poincaré inequality, , so we can set
. With this substitution, the above inequality becomes
Suppose . Then
and we know that . On the other hand, if
, then
We are now in a position to state the recent result of O’Donnell and Wimmer generalizing the KKL theorem to Schreier graphs satisfying a certain technical property.
Theorem 7 Let
be a group acting on a set
,
be a union of conjugacy classes that generates
, and
be the log-Sobolev constant of
. Then for any
,
In particular, there is some
such that
.
For an Abelian group such as (the cube), every group element is in a conjugacy class by itself, so the extra condition on
is vacuous. Using
for the cube, we recover the original KKL theorem. O’Donnell et al. apply the generalized result to the non-Abelian group
of permutations on
, generated by transpositions and acting on the family
of
-subsets of
. By viewing these families as sets of
-bit strings, they recover a “rigidity” version of the Kruskal-Katona theorem that states (roughly) that if a subset of a layer of a cube has a small expansion to the layer above it, then it must be correlated to some dictator function.
Coding theoretic interpretation. In the long code, an integer is encoded as the dictator function
. By using many more bits (
rather than
) of redundant storage, we hope to be able to recover from corruptions in the data. The theorem tells us that as long as the corrupted version of an encoding is far from a constant function, it can be decoded to a coordinate whose influence is
times the average influence. Since every coordinate’s influence is nonnegative, only
coordinates can have influence this large. Thus, we have a {}“small” set of candidate long codes to which we might decode the word. To complete this picture, we’d like to understand how far the word can be from functions that depend only on these coordinates; the following theorem of Friedgut, which we state without proof, furnishes this information.
Theorem 8 For every
and
, there is a function
depending on at most
variables such that
.
4. Gaussian isoperimetry and an algorithmic application
Hypercontractive inequalities were first investigated in the context of Gaussian probability spaces, for their applications to quantum field theory. The following simple proof reduces the continuous Gaussian hypercontractive inequality to its discrete counterpart on the cube.
4.1. From the central limit theorem to Gaussian hypercontractivity
Theorem 9 Let
be normally distributed, i.e.,
Then for a smooth function
, the random variable
satisfies
with
and
Proof: We shall approximate the Gaussian distribution by a weighted sum of Bernoulli variables. Let be uniformly distributed, and set
. By the log-Sobolev inequality applied to
, we have
. By the central limit theorem, the right side converges to
as
, so it remains to show that the left side converges to
as well. Let
be the value obtained by replacing the
th coordinate of
with the value
, and observe that
. Then, using the smoothness of
, we have
so that
The second term vanishes as , and the first term converges to
by the Central Limit Theorem.
The tensoring property of log-Sobolev inequalities lets us extend this result to Gaussian distributions over . We are also interested in the corresponding noise operator
, known as the Ornstein-Uhlenbeck operator, which is given by
Theorem 2 has an analog in this setting, which lets us conclude that every function satisfies
where
and
.
4.2. Reverse hypercontractivity and isoperimetry
In 1982, Borell showed a reversed inequality of a similar form when :
Theorem 10 (Reverse hypercontractivity) Fix
and
such that
. Then for any positive-valued function
, we have
.
Note that the expressions are not norms when
; in particular, they are not convex. However, this theorem can be proved by means similar to our proof for the Gaussian log-Sobolev inequality: we start with a base result for the 2-point space, proceed by tensoring to the hypercube, and use the central limit theorem to cover Gaussian space.
As an application of Borell’s result, consider the following strong isoperimetry theorem for Gaussian space (due to Sherman).
Theorem 11 (Gaussian isoperimetry) Let
be independent Gaussian random variables. Then for any set
and any
, we have
Proof: When , there is nothing to prove. Otherwise, let
be the indicator function of
and observe that
. Therefore, for
, we have
by an application of Markov’s inequality. But is just
, and we know by Borell’s theorem that
for
. Thus
where we have used the facts that and
.
4.3. Fast graph partitioning and the constructive Big Core Theorem
Problem and SDP rounding algorithm. In the -balanced separator problem, we are given a graph
on
vertices and asked to find the smallest set of edges such that their removal disconnects the graph into pieces of size at most
. The problem is NP-hard, and the best known approximation ratio\footnote{For technical reasons, it is actually a pseudo-approximation: the algorithm’s output for
is compared to the optimal value for
.} is
.
The first algorithm to achieve this bound was based on a semidefinite program that assigns a unit vector to each vertex and minimizes the total embedded squared length of the edges subject to the constraint that the vertices are spread out and that the squared distances between the points form a metric:
To round this SDP, Arora, Rao and Vazirani pick a random direction and project all the points along
. They then define sets
and
consisting of points
whose projections are sufficiently large, i.e.,
and similarly
, where
is chosen to make
and
have size
with high probability. Next, they discard points
such that
is much smaller than expected for a pair whose projections are
apart. Finally, if the resulting pruned sets
and
are large enough, they show that greedily growing
yields a good cut.
Matchings and cores. The key step in making this argument work is to ensure that not too many pairs are removed in the pruning step. To bound the probability of this bad event, we consider the possibility that for a large fraction
of directions
, there exists a matching of points
such that each pair
is short (i.e.,
) but stretched along
(i.e.,
). Such a set of points is called a
-core. The big core theorem (first proved with optimal parameters by Lee) asserts that this situation can’t arise: for a fixed
, and
, we must have
, which is a contradiction for our chosen values of
.
In order to prove the big core theorem, Lee concatenates pairs that share a point and belong in matchings for nearby directions. The existence of a long chain of such concatenations is what leads to a contradiction: if we consider the endpoints of a chain of length
, the projection
grows linearly in
whereas the distance
grows only as
(recall that the SDP constrained the squared distances to form a metric).
Boosting. The matching chaining argument we have just presented in its simple form doesn’t work, for the following reason. At each chaining step, the fraction of nearby directions available for our use reduces by roughly (by a union bound) so that we are rapidly left with no direction to move in. To remedy this situation, we need to boost the fraction of usable directions at each step, say from
to
, so that we can carry on chaining in spite of a
loss. Lee’s proof uses the standard isoperimetric inequality for the sphere to show that this boosting can be performed with no change in
and a very small penalty in
. In other words, we take advantage of the fact that a very small dilation of a set of constant measure (i.e., the set of available directions) has measure close to
.
Faster algorithms. Lee’s big core theorem is non-constructive in the sense that it only shows the existence of such a long chain of matched pairs in order to give a contradiction. While this form suffices to bound the approximation ratio of the ARV rounding scheme, other variants of their technique require a way to efficiently sample long chains, not just show their existence. Sherman constructs a distribution over directions that does not depend on the point set at all, yet is guaranteed to always have a non-trivial probability of producing long chains of stretched pairs. More precisely,
Theorem 12 (Constructive big core) For any
, there is
and an efficiently sampleable distribution
over the set of sequences of
direction vectors (each in
), such that: for any
-core
, if the string of directions is sampled from
, the expected number of chains whose endpoints are
apart is at least
.
We sketch some of the ideas of the proof here. Sherman constructs two sequences of Gaussian directions and
. Each
is an independent Gaussian vector, whereas each
for
is a Gaussian vector
-correlated with
. Finally, the distribution
is given by randomly shuffling together the
and
, picking a uniformly random
between
and
, and returning the first
elements of the shuffled sequence. The correlated directions
correspond to the steps in which Lee’s proof chained pairs from similar directions, whereas the independent
correspond to the region-growing steps necessary for boosting. By randomly interleaving these two types of moves, Sherman’s sampling algorithm can be oblivious to the actual point set it is acting on.
5. Complexity theoretic applications
5.1. Dictatorship testing with perfect completeness
Definitions. A function is said to be
-quasirandom if
whenever
. In order to show that a given problem is hard to approximate, we often need to design a test that
- performs
queries on a black-box function
,
- accepts every dictator function with probability
(the completeness probability), and
- accepts every
-quasirandom function with probability
(the soundness probability). A test is said to be adaptive if each query is allowed to depend on the result of the queries so far.
While dictatorship tests for the setting have been known for over a decade (first from the work of Håstad and more recently via the Unique Games Conjecture of Khot), there were no nontrivial bounds for
until some recent results of O’Donnell and Wu. Their analysis, which we show below, relies heavily on the hypercontractive inequality.
Theorem 13 For every
, there is a
-query non-adaptive test that accepts every dictator function
with probability
but accepts any
-quasirandom odd function
with probability
.
The proof uses the following strengthening of the hypercontractive inequality for restricted parameter values.
Lemma 14 If
,
, and
satisfy
, then for all
,
.
Proof:
Proof: (of Theorem 13) Define the “not-two” predicate as follows:
if exactly two of
equal
, and
otherwise. Explicitly,
Let be a parameter to be fixed later. For
, we pick bits
as follows:
- with probability
: we choose
uniformly and independently, then set
;
- with probability
: we choose
uniformly, then set
. Note that for
,
is independent of
. We accept if
. It is immediate from the construction of
that
for
. Therefore, if
is a dictator function, it follows that
must also equal
.
Soundness. It remains to analyze the test when is pseudorandom. We begin by writing
in the Fourier basis:
. Therefore, by symmetry,
We shall systematically rewrite the right-hand side in terms of the Fourier coefficients of . By our assumption that
is odd, we have
whenever
has even cardinality. Therefore
. Also,
Consider a summand where , and without loss of generality fix
. It is easy to see that the contributions due to
cancel each other. Thus, the only terms that remain are of the form
, i.e.,
where we have used the fact that . But
is nonzero only for
odd, and
, so we can upper-bound the above sum by
.
Bounding the cubic term. We proceed similarly:
Each of the expectations can be written as a product over coordinates using the fact that individual coordinates of
are chosen independently. When
belongs to exactly one of
(say
), then it contributes a factor
, making the product zero. Similarly, when
belongs to two of the sets (say
), then the contribution is
by our earlier calculation. Finally, when
belongs to all three of the sets, we have
. In light of this calculation, any triple
that makes a nonzero contribution to the sum (1) must be of the form
for suitable sets where
is disjoint from
. Also
must be odd, from which we can show that
must be odd. In terms of these new sets we can rewrite
For a fixed , define the function
by
. Then we have
Write . Then, using the inequality
, we have
and therefore,
To bound the first term, note that . The sum of the squared Fourier coefficients is just
(by Parseval’s identity) and we can use the
-pseudorandomness property to bound the quantity in the maximum: when
, then
and when
then
. Thus the entire first summand is
.
Hypercontractivity. It remains to bound . Fix
and apply the modified hypercontractive inequality:
Now, and
. The contribution of the corresponding term to the sum we were trying to bound is
. For each choice of
, the
terms sum to at most one, and all the
terms themselves sum to at most one. Therefore, we have bounded the entire sum by
as desired.
5.2. Integrality gap for Unique Label Cover SDP
Problem and SDP relaxation. In the Unique Label Cover problem, we are given a label set and a weighted multigraph
whose edges are labeled by permutations
, and are asked to find an assignment
of labels to edges that maximizes the fraction of edges
that are “consistent” with our labeling, i.e.,
. If there exists a labeling that satisfies all the edges, then it is easy to find such a labeling. However, when all we can guarantee is that
fraction of the edges can be satisfied, it is not known how to find a labeling satisfying even
of them. At the same time, present techniques cannot show that finding a
-consistent labeling is NP-hard.
One approach to solving this problem is to use an extension of the Goemans-Williamson SDP for Max-Cut, where we set up a vector for every vertex
and label
:
(The expectation in the objective is over a distribution where is picked with probability proportional to its weight.) The intent is that
should be the probability that
receives label
, and
should be the corresponding joint probability. It is easy to see that this SDP is a relaxation of the original problem.
Gap instance. In an influential paper, Khot and Vishnoi constructed an integrality gap for this SDP: for a label set of size and an arbitrary parameter
, a graph whose optimal labeling satisfies
fraction of the edges, but for which the SDP optimum is at least
. The hypercontractive inequality plays a central role in the soundness analysis, which we present below.
Let be the set of all functions
and
be the Fourier basis
; clearly,
. Observe that
is an Abelian group under pointwise multiplication, and
is a subgroup. We take the quotient
to be the vertex set. Fix an arbitrary representative for each coset and write
. We shall define a weighted edge between every pair of these representative functions, then show how to extend this definition to all pairs of functions, and finally map these edges to edges between cosets.
- The edge
has weight equal to
, where
are drawn to be
-correlated on every bit with uniform marginals, where
.
- With every edge
between representative functions, we associate the identity permutation.
- A non-representative function acts as if its label is assigned according to its coset’s representative. Thus, the permutation associated with
is
.
- In the actual graph under consideration, every edge
appears as an edge
(with the same permutation and weight).
Soundness analysis. Given a labeling on the cosets, we consider the induced labeling
given by
. From our definitions, it is clear that the objective value attained by
is precisely
, where
are chosen as before. Fix any label
and consider the indicator function
of functions that
labels with
. Since exactly one function in each coset gets labeled
, we know that
. Therefore,
which we can upper-bound (using hypercontractivity) by .
Correction: As
increases
$ decreases
Is this something wrong or am i terribly confused?
Hi, your first comment is not true if the norms are defined using averages. I have fixed the definition of convolution–it was a typo.
Thanks for the reply. I would love to see a proof for the fact that
is monotonic in
.
Also the fourier coeff for
,
.
Girish: The proof follows from the concavity of Shannon Entropy:
Fix any vector
and define:
and then one has
and then by concavity of entropy,
, as required.
Sorry, the missing formula in the last post should be:
OMG! it’s hard. -.-“
Using Shannon entropy sounds like a very indirect way.
If $p\ge q$ then $E[|x|^p]^{1/p} = E[(|x|^q)^{p/q}]^{1/p}$. Since the the function $f(x) = x^{p/q}$ is convex (as $p/q\ge 1)$, by Jensen’s inequality,
$\|x\|_p = E[(|x|^q)^{p/q}]^{1/p} \ge (E[|x|^q])^{p/q * 1/p} = \| x \|_q$.