My student Punya Biswal just completed this great survey on hypercontractivity and its application in computer science. There is a PDF version from his home page, and accompanying slides.

Hypercontractive inequalities are a useful tool in dealing with extremal questions in the geometry of high-dimensional discrete and continuous spaces. In this survey we trace a few connections between different manifestations of hypercontractivity, and also present some relatively recent applications of these techniques in computer science.

### 1. Preliminaries and notation

**Fourier analysis on the hypercube.** We define the inner product on functions , where the expectation is taken over the uniform (counting) measure on . The multilinear polynomials (where ranges over subsets of ) form an orthogonal basis under this inner product; they are called the Fourier basis. Thus, for any function , we have , where the Fourier coefficients obey Plancherel’s relation . It is easy to verify that and .

**Norms.** For , define the norm . These norms are monotone in : for every function , implies . For a linear operator carrying functions to functions , we define the -to- operator norm . is said to be a contraction from to when . Because of the monotonicity of norms, a contraction from to is automatically a contraction from to for any . When and , then is said to be hypercontractive.

**Convolution operators.** Letting represent the coordinatewise product of , we define the convolution of two functions , and note that it is a linear operator for every fixed . Convolution is commutative and associative, and the Fourier coefficients of a convolution satisfy the useful property . We shall be particularly interested in the convolution properties of the following functions

- The Dirac delta , given by and otherwise. It is the identity for convolution and has for all .
- The edge functions given by
is or according as contains or does not contain , respectively. For any function , , where is obtained from by flipping just the th bit. Convolution with acts as an orthogonal projection (as we can easily see in the Fourier domain), so for any functions , we have

- The Bonami-Gross-Beckner noise functions for , where and we define . These operators form a semigroup, because and . Note that . We define the noise operator acting on functions on the discrete cube by . In combinatorial terms, is the expected value of , where is obtained from by independently flipping each bit of with probability .

Lemma 1

*Proof:* This is easy in the Fourier basis:

### 2. The Bonami-Gross-Beckner Inequality

#### 2.1. Poincaré and Log-Sobolev inequalities

The Poincaré and logarithmic Sobolev inequalities both relate a function’s global non-constantness to how fast it changes “locally”. The amount of local change is quantified by the *energy* , where the Dirichlet form is defined as

( is the set of pairs that differ in a single coordinate). In terms of the edge functions , observe that .

In the case of the Poincaré inequality, we measure the distance of to a constant by its variance . Then the Poincaré constant (of the discrete cube) is the supremal such that the inequality

holds for all . This quantity is also the smallest nonzero eigenvalue of the Laplacian of the discrete cube, viewed as a graph (i.e., its spectral expansion).

Another way of measuring the non-constantness of a function is to consider its entropy (where we assume and use the convention that ). Note that for any , so the entropy is homogenous of degree in its argument. Because we are comparing the entropy with the energy (which is homogenous of degree ) we use the entropy of the *square* of the function to define the Log-Sobolev constant: the largest such that the inequality

holds for all .

For the discrete cube , we have and , as we shall see below. It is interesting to ask how these quantities are related when we consider other probability spaces equipped with a suitable Dirichlet form (for example, -regular graphs with , where the expectation is taken over all edges). Set for a sufficiently small and observe that and , whereas

This shows that , which is tight in the case of the cube. However, for constant-degree expander families (in particular, for random -regular graphs with high probability) we have (from Example 4.2, Diaconis and Saloff-Coste) but .

#### 2.2. Hypercontractivity and the log-Sobolev inequality

When , the noise operator is easily seen to contract : for any , we have . Now consider its behavior from to for some . When , we have ; in particular, for , . On the other hand, , so . By the intermediate value theorem, there must be some such that . A theorem of Gross connects this critical with the Log-Sobolev constant of the underlying space:

Stated differently, when . Thus to prove hypercontractive inequalities on the discrete cube, it suffices to bound the log-Sobolev constant. We shall prove this claim for , which turns out to imply the general version.

*Proof:* (of Theorem 2) We shall prove that for ; the remainder of the theorem can be shown using similar techniques. As we observed before, this inequality is tight when , so it suffices to show that for . For notational convenience, let . Then

Now we use the fact that to get

Applying Lemma 3 and simplifying, we get

We use Lemma 4 to handle the second term, and plug in to get

whose positivity we are guaranteed by the log-Sobolev inequality applied to .

*Proof:* Recalling Lemma 1 and the projection property of the s, we have

*Proof:* It suffices to show that for all and . But observe that

and the inequality between the integrals follows from convexity.

#### 2.3. Two-point inequality

We begin by showing that the log-Sobolev inequality holds for the uniform distribution on the two-point space with . Without loss of generality, consider . Then

and . We shall show that is non-negative for . By symmetry it suffices to consider . But and

which is non-negative because and

#### 2.4. Tensoring property

Theorem 5Let be the log-Sobolev constant of . Then the log-Sobolev constant of is .

When is a power of , we can conclude inductively that ; a proof along similar lines works for arbitrary as well.

*Proof:* } For any , set . Then by the conditional entropy formula,

and by convexity,

where the notation ranges over edges of . Taken together, these give

as claimed.

#### 2.5. Non-product groups

Recall that we defined the Dirichlet form

for functions , but it makes sense for any regular graph if we sample uniformly from the edges. Thus, given any family of regular graphs, we can ask if they satisfy a log-Sobolev inequality of the form for all suitable .

It turns out that the relationship between logarithmic Sobolev inequalities and hypercontractive noise operator subgroups, as stated by Gross, holds for a wide class of spaces, not just the hypercube . Diaconis and Saloff-Coste explored an intermediate between these two extremes of specialization to give improved mixing time results for Markov chains on various graphs.

One of the first discrete applications of hypercontractivity was a celebrated theorem of Kahn, Kalai and Linial relating the maximum influence of a function on the hypercube to its variance. In Theorem 7, we discuss some recent work of O’Donnell and Wimmer generalizing the KKL theorem to apply to the wider class of Schreier graphs associated with group actions (defined below).

An action of a group on a set is a homomorphism from to the group of bijections on , and we write for the image of under the bijection for . If is a set of generators for , then the Schreier graph has vertex set and edges for all and . It is known that every connected regular graph of even degree can be obtained in this way. The definition of the Dirichlet form generalizes without change, but to be able to derive a log-Sobolev inequality for this space, we must define the noise operator in an appropriate fashion to satisfy the claim of Lemma 1: .

### 3. Boolean-Valued Functions

#### 3.1. Influences

Write for the collection of random variables . The influence of the th coordinate on a function is given by

When is Boolean-valued, this quantity is just the probability that changing changes . Writing in the Fourier basis, we have and , so that . In addition, we define the total influence .

#### 3.2. Structural results

Boolean functions are natural combinatorial objects, but they were first studied from an analytical viewpoint in work on voting and social choice. In this setting, a function is viewed as a way to combine the preferences of voters to yield the result of the election. This explains the notions of dictator or junta functions, which depend on only one or a few of their coordinates, respectively. In this context it is also natural to consider functions where no coordinate (“voter”) has a very large influence. Kahn, Kalai, and Linial first introduced the Fourier analysis of Boolean functions as a technique in computer science. Their theorem establishes that if a function is far from a constant (i.e., has variance at least a constant), then it must have a variable of influence . We state a strengthening of their original inequality due to Talagrand:

Theorem 6For any ,

We can compare this to the Poincaré inequality on the cube, which can be stated as

(In particular, there exists a variable of influence .) The KKL theorem is a stronger result of the same form: it is a comparison between a local and a global measure of variation. The proofs of KKL and Talagrand used the hypercontractivity of the cube, but we present here a more recent proof due to Rossignol that uses the log-Sobolev inequality instead. For simplicity we’ll just show the weaker statement that the maximum influence is .

*Proof:* Write , where . For each , the log-Sobolev inequality states that . By writing in terms of the Fourier coefficients , we can check that , so that we can sum all these inequalities to obtain

In order to bound , we begin by noting that

where the s are the edge functions we defined earlier. Letting , we have

where we have used the orthogonality of the s and the fact that .

To bound , we split it up further:

For , we have that is a nonpositive decreasing function and therefore,

By comparing Fourier coefficients, it is easy to verify that . Therefore, by convexity,

Until now, the proof has made no use of the fact that takes on only Boolean values. Now we argue that because , we must have , so that . Plugging this into our bound for yields

For , note that is increasing, so

Summing all these bounds gives us

By the Poincaré inequality, , so we can set . With this substitution, the above inequality becomes

Suppose . Then

and we know that . On the other hand, if , then

We are now in a position to state the recent result of O’Donnell and Wimmer generalizing the KKL theorem to Schreier graphs satisfying a certain technical property.

Theorem 7Let be a group acting on a set , be a union of conjugacy classes that generates , and be the log-Sobolev constant of . Then for any ,

In particular, there is some such that .

For an Abelian group such as (the cube), every group element is in a conjugacy class by itself, so the extra condition on is vacuous. Using for the cube, we recover the original KKL theorem. O’Donnell et al. apply the generalized result to the non-Abelian group of permutations on , generated by transpositions and acting on the family of -subsets of . By viewing these families as sets of -bit strings, they recover a “rigidity” version of the Kruskal-Katona theorem that states (roughly) that if a subset of a layer of a cube has a small expansion to the layer above it, then it must be correlated to some dictator function.

**Coding theoretic interpretation.** In the *long code*, an integer is encoded as the dictator function . By using many more bits ( rather than ) of redundant storage, we hope to be able to recover from corruptions in the data. The theorem tells us that as long as the corrupted version of an encoding is far from a constant function, it can be decoded to a coordinate whose influence is times the average influence. Since every coordinate’s influence is nonnegative, only coordinates can have influence this large. Thus, we have a {}“small” set of candidate long codes to which we might decode the word. To complete this picture, we’d like to understand how far the word can be from functions that depend only on these coordinates; the following theorem of Friedgut, which we state without proof, furnishes this information.

Theorem 8For every and , there is a function depending on at most variables such that .

### 4. Gaussian isoperimetry and an algorithmic application

Hypercontractive inequalities were first investigated in the context of Gaussian probability spaces, for their applications to quantum field theory. The following simple proof reduces the continuous Gaussian hypercontractive inequality to its discrete counterpart on the cube.

#### 4.1. From the central limit theorem to Gaussian hypercontractivity

Theorem 9Let be normally distributed, i.e.,

Then for a smooth function , the random variable satisfies

with and

*Proof:* We shall approximate the Gaussian distribution by a weighted sum of Bernoulli variables. Let be uniformly distributed, and set . By the log-Sobolev inequality applied to , we have . By the central limit theorem, the right side converges to as , so it remains to show that the left side converges to as well. Let be the value obtained by replacing the th coordinate of with the value , and observe that . Then, using the smoothness of , we have

so that

The second term vanishes as , and the first term converges to by the Central Limit Theorem.

The tensoring property of log-Sobolev inequalities lets us extend this result to Gaussian distributions over . We are also interested in the corresponding noise operator , known as the Ornstein-Uhlenbeck operator, which is given by

Theorem 2 has an analog in this setting, which lets us conclude that every function satisfies where and .

#### 4.2. Reverse hypercontractivity and isoperimetry

In 1982, Borell showed a *reversed* inequality of a similar form when :

Theorem 10 (Reverse hypercontractivity)Fix and such that . Then for any positive-valued function , we have .

Note that the expressions are not norms when ; in particular, they are not convex. However, this theorem can be proved by means similar to our proof for the Gaussian log-Sobolev inequality: we start with a base result for the 2-point space, proceed by tensoring to the hypercube, and use the central limit theorem to cover Gaussian space.

As an application of Borell’s result, consider the following strong isoperimetry theorem for Gaussian space (due to Sherman).

Theorem 11 (Gaussian isoperimetry)Let be independent Gaussian random variables. Then for any set and any , we have

*Proof:* When , there is nothing to prove. Otherwise, let be the indicator function of and observe that . Therefore, for , we have

by an application of Markov’s inequality. But is just , and we know by Borell’s theorem that for . Thus

where we have used the facts that and .

#### 4.3. Fast graph partitioning and the constructive Big Core Theorem

**Problem and SDP rounding algorithm.** In the -balanced separator problem, we are given a graph on vertices and asked to find the smallest set of edges such that their removal disconnects the graph into pieces of size at most . The problem is NP-hard, and the best known approximation ratio\footnote{For technical reasons, it is actually a pseudo-approximation: the algorithm’s output for is compared to the optimal value for .} is .

The first algorithm to achieve this bound was based on a semidefinite program that assigns a unit vector to each vertex and minimizes the total embedded squared length of the edges subject to the constraint that the vertices are spread out and that the squared distances between the points form a metric:

To round this SDP, Arora, Rao and Vazirani pick a random direction and project all the points along . They then define sets and consisting of points whose projections are sufficiently large, i.e., and similarly , where is chosen to make and have size with high probability. Next, they discard points such that is much smaller than expected for a pair whose projections are apart. Finally, if the resulting pruned sets and are large enough, they show that greedily growing yields a good cut.

**Matchings and cores.** The key step in making this argument work is to ensure that not too many pairs are removed in the pruning step. To bound the probability of this bad event, we consider the possibility that for a large fraction of directions , there exists a matching of points such that each pair is short (i.e., ) but stretched along (i.e., ). Such a set of points is called a *-core*. The big core theorem (first proved with optimal parameters by Lee) asserts that this situation can’t arise: for a fixed , and , we must have , which is a contradiction for our chosen values of .

In order to prove the big core theorem, Lee concatenates pairs that share a point and belong in matchings for nearby directions. The existence of a long chain of such concatenations is what leads to a contradiction: if we consider the endpoints of a chain of length , the projection grows linearly in whereas the distance grows only as (recall that the SDP constrained the *squared* distances to form a metric).

**Boosting.** The matching chaining argument we have just presented in its simple form doesn’t work, for the following reason. At each chaining step, the fraction of nearby directions available for our use reduces by roughly (by a union bound) so that we are rapidly left with no direction to move in. To remedy this situation, we need to boost the fraction of usable directions at each step, say from to , so that we can carry on chaining in spite of a loss. Lee’s proof uses the standard isoperimetric inequality for the sphere to show that this boosting can be performed with no change in and a very small penalty in . In other words, we take advantage of the fact that a very small dilation of a set of constant measure (i.e., the set of available directions) has measure close to .

**Faster algorithms.** Lee’s big core theorem is non-constructive in the sense that it only shows the *existence* of such a long chain of matched pairs in order to give a contradiction. While this form suffices to bound the approximation ratio of the ARV rounding scheme, other variants of their technique require a way to *efficiently sample* long chains, not just show their existence. Sherman constructs a distribution over directions that does not depend on the point set at all, yet is guaranteed to always have a non-trivial probability of producing long chains of stretched pairs. More precisely,

Theorem 12 (Constructive big core)For any , there is and an efficiently sampleable distribution over the set of sequences of direction vectors (each in ), such that: for any -core , if the string of directions is sampled from , the expected number of chains whose endpoints are apart is at least .

We sketch some of the ideas of the proof here. Sherman constructs two sequences of Gaussian directions and . Each is an independent Gaussian vector, whereas each for is a Gaussian vector -correlated with . Finally, the distribution is given by randomly shuffling together the and , picking a uniformly random between and , and returning the first elements of the shuffled sequence. The correlated directions correspond to the steps in which Lee’s proof chained pairs from similar directions, whereas the independent correspond to the region-growing steps necessary for boosting. By randomly interleaving these two types of moves, Sherman’s sampling algorithm can be oblivious to the actual point set it is acting on.

### 5. Complexity theoretic applications

#### 5.1. Dictatorship testing with perfect completeness

**Definitions.** A function is said to be -quasirandom if whenever . In order to show that a given problem is hard to approximate, we often need to design a test that

- performs
*queries*on a black-box function , - accepts every dictator function with probability (the
*completeness*probability), and - accepts every -quasirandom function with probability (the
*soundness*probability). A test is said to be*adaptive*if each query is allowed to depend on the result of the queries so far.

While dictatorship tests for the setting have been known for over a decade (first from the work of Håstad and more recently via the Unique Games Conjecture of Khot), there were no nontrivial bounds for until some recent results of O’Donnell and Wu. Their analysis, which we show below, relies heavily on the hypercontractive inequality.

Theorem 13For every , there is a -query non-adaptive test that accepts every dictator function with probability but accepts any -quasirandom odd function with probability .

The proof uses the following strengthening of the hypercontractive inequality for restricted parameter values.

Lemma 14If , , and satisfy , then for all , .

*Proof:*

*Proof:* (of Theorem 13) Define the “not-two” predicate as follows: if exactly two of equal , and otherwise. Explicitly,

Let be a parameter to be fixed later. For , we pick bits as follows:

- with probability : we choose uniformly and independently, then set ;
- with probability : we choose uniformly, then set . Note that for , is independent of . We accept if . It is immediate from the construction of that for . Therefore, if is a dictator function, it follows that must also equal .

**Soundness.** It remains to analyze the test when is pseudorandom. We begin by writing in the Fourier basis: . Therefore, by symmetry,

We shall systematically rewrite the right-hand side in terms of the Fourier coefficients of . By our assumption that is odd, we have whenever has even cardinality. Therefore . Also,

Consider a summand where , and without loss of generality fix . It is easy to see that the contributions due to cancel each other. Thus, the only terms that remain are of the form , i.e.,

where we have used the fact that . But is nonzero only for odd, and , so we can upper-bound the above sum by .

**Bounding the cubic term.** We proceed similarly:

Each of the expectations can be written as a product over coordinates using the fact that individual coordinates of are chosen independently. When belongs to exactly one of (say ), then it contributes a factor , making the product zero. Similarly, when belongs to two of the sets (say ), then the contribution is by our earlier calculation. Finally, when belongs to all three of the sets, we have . In light of this calculation, any triple that makes a nonzero contribution to the sum (1) must be of the form

for suitable sets where is disjoint from . Also must be odd, from which we can show that must be odd. In terms of these new sets we can rewrite

For a fixed , define the function by . Then we have

Write . Then, using the inequality , we have

and therefore,

To bound the first term, note that . The sum of the squared Fourier coefficients is just (by Parseval’s identity) and we can use the -pseudorandomness property to bound the quantity in the maximum: when , then and when then . Thus the entire first summand is .

**Hypercontractivity.** It remains to bound . Fix and apply the modified hypercontractive inequality:

Now, and . The contribution of the corresponding term to the sum we were trying to bound is . For each choice of , the terms sum to at most one, and all the terms themselves sum to at most one. Therefore, we have bounded the entire sum by as desired.

#### 5.2. Integrality gap for Unique Label Cover SDP

**Problem and SDP relaxation.** In the Unique Label Cover problem, we are given a label set and a weighted multigraph whose edges are labeled by permutations , and are asked to find an assignment of labels to edges that maximizes the fraction of edges that are “consistent” with our labeling, i.e., . If there exists a labeling that satisfies all the edges, then it is easy to find such a labeling. However, when all we can guarantee is that fraction of the edges can be satisfied, it is not known how to find a labeling satisfying even of them. At the same time, present techniques cannot show that finding a -consistent labeling is NP-hard.

One approach to solving this problem is to use an extension of the Goemans-Williamson SDP for Max-Cut, where we set up a vector for every vertex and label :

(The expectation in the objective is over a distribution where is picked with probability proportional to its weight.) The intent is that should be the probability that receives label , and should be the corresponding joint probability. It is easy to see that this SDP is a relaxation of the original problem.

**Gap instance.** In an influential paper, Khot and Vishnoi constructed an integrality gap for this SDP: for a label set of size and an arbitrary parameter , a graph whose optimal labeling satisfies fraction of the edges, but for which the SDP optimum is at least . The hypercontractive inequality plays a central role in the soundness analysis, which we present below.

Let be the set of all functions and be the Fourier basis ; clearly, . Observe that is an Abelian group under pointwise multiplication, and is a subgroup. We take the quotient to be the vertex set. Fix an arbitrary representative for each coset and write . We shall define a weighted edge between every pair of these representative functions, then show how to extend this definition to all pairs of functions, and finally map these edges to edges between cosets.

- The edge has weight equal to , where are drawn to be -correlated on every bit with uniform marginals, where .
- With every edge between representative functions, we associate the identity permutation.
- A non-representative function acts as if its label is assigned according to its coset’s representative. Thus, the permutation associated with is .
- In the actual graph under consideration, every edge appears as an edge (with the same permutation and weight).

**Soundness analysis.** Given a labeling on the cosets, we consider the induced labeling given by . From our definitions, it is clear that the objective value attained by is precisely , where are chosen as before. Fix any label and consider the indicator function of functions that labels with . Since exactly one function in each coset gets labeled , we know that . Therefore,

which we can upper-bound (using hypercontractivity) by .

Correction: As increases $ decreases

Is this something wrong or am i terribly confused?

Hi, your first comment is not true if the norms are defined using averages. I have fixed the definition of convolution–it was a typo.

Thanks for the reply. I would love to see a proof for the fact that is monotonic in .

Also the fourier coeff for , .

Girish: The proof follows from the concavity of Shannon Entropy:

Fix any vector and define:

and then one has

and then by concavity of entropy, , as required.

Sorry, the missing formula in the last post should be:

OMG! it’s hard. -.-“

Using Shannon entropy sounds like a very indirect way.

If $p\ge q$ then $E[|x|^p]^{1/p} = E[(|x|^q)^{p/q}]^{1/p}$. Since the the function $f(x) = x^{p/q}$ is convex (as $p/q\ge 1)$, by Jensen’s inequality,

$\|x\|_p = E[(|x|^q)^{p/q}]^{1/p} \ge (E[|x|^q])^{p/q * 1/p} = \| x \|_q$.