lecture 2: statistical decision theory · ee378a statistical signal processing lecture 2 -...

EE378A Statistical Signal Processing Lecture 2 - 04/06/2017

Lecture 2: Statistical Decision TheoryLecturer: Jiantao Jiao Scribe: Andrew Hilger

In this lecture, we discuss a unified theoretical framework of statistics proposed by Abraham Wald, which isnamed statistical decision theory. 1.

1 Goals

1. Evaluation: The theoretical framework should aid fair comparisons between algorithms (e.g., maxi-mum entropy vs. maximum likelihood vs. method of moments).

2. Achievability: The theoretical framework should be able to inspire the constructions of statisticalalgorithms that are (nearly) optimal under the optimality criteria introduced in the framework.

2 Basic Elements of Statistical Decision Theory

1. Statistical Experiment: A family of probability measures P = {Pθ : θ ∈ Θ}, where θ is a parameterand Pθ is a probability distribution indexed by the parameter.

2. Data: X ∼ Pθ, where X is a random variable observed for some parameter value θ.

3. Objective: g(θ), e.g., inference on the entropy of distribution Pθ.

4. Decision Rule: δ(X). The decision rule need not be deterministic. In other words, there could be aprobabilistically defined decision rule with an associated Pδ|X .

5. Loss Function: L(θ, δ). The loss function tells us how bad we feel about our decision once we findout the true value of the parameter θ chosen by nature.

Example: Pθ(x) = 1√2πe−(x−θ)2

2 , g(θ) = θ, and L(θ, δ) = (θ − δ)2. In other words, X is normally

distributed with mean θ and unit variance X ∼ N(θ, 1), and we are trying to estimate the mean θ. We judgeour success (or failure) using mean-square error.

3 Risk Function

Definition 1 (Risk Function).

R(θ, δ) , E[L(θ, δ(X))] (1)

=

∫L(θ, δ(x))Pθ(dx) (2)

=

∫∫L(θ, δ)Pδ|X(dδ|x)Pθ(dx). (3)

1See Wald, Abraham. ”Statistical decision functions.” In Breakthroughs in Statistics, pp. 342-357. Springer New York,1992.

1

Figure 1: Example risk functions computed over a range of parameter values for two different decision rules.

A risk function evaluates a decision rule’s success over a large number of experiments with fixed parametervalue θ. By the law of large numbers, if we observe X many times independently, the average empirical lossof the decision rule δ will converge to the risk R(θ, δ).

Even after determining risk functions of two decision rules, it may still be unclear which is better.Consider the example of Figure 1. Two different decision rules δ1 and δ2 result in two different risk functionsR(θ, δ1) and R(θ, δ2) evaluated over different values on the parameter θ. The first decision rule δ1 is inferiorfor low and high values of the parameter θ but is superior for the middle values. Thus, even after computinga risk function R, it can still be unclear which decision rule is better. We need new ideas to enable us tocompare different decision rules.

4 Optimality Criterion of Decision Rules

Given the risk function of various decision rules as a function of the parameter θ, there are various approachesto determining which decision rule is optimal.

4.1 Restrict the Competitors

This is a traditional set of methods that were overshadowed by other approaches that we introduce later.A decision rule δ′ is eliminated (or formally, is inadmissible) if there are any other decision rules δ thatare strictly better, i.e., R(θ, δ′) ≥ R(θ, δ) for any θ ∈ Θ and the inequality becomes strict for at least oneθ0 ∈ Θ. However, the problem is that many decision rules cannot be eliminated in this way and we stilllack a criterion to determine which one is better. Then to aid in selection, the rationale of the approach ofrestricting competitors is that only decision rules that are members of a certain decision rule class D areconsidered. The advantage is that, sometimes all but one decision rule in D is inadmissible, and we just usethe only admissible one.

1. Example 1: Class of unbiased decision rules D′ = {δ : E[δ(X)] = g(θ),∀θ ∈ Θ}

2. Example 2: Class of invariant estimators.

However, a serious drawback of this approach is that D may be an empty set for various decision theoreticproblems.

2

4.2 Bayesian: Average Risk Optimality

The idea is to use averaging to reduce the risk function R(θ, δ) to a single number for any given δ.

Definition 2 (Average risk under prior Λ(dθ)).

r(Λ, δ) =

∫R(θ, δ)Λ(dθ) (4)

Here Λ is the is the prior distribution, a probability measure on Θ. The Bayesians and the frequentistsdisagree about Λ; namely, the frequentists do not believe the existence of the prior. However, there doexist more justifications of the Bayesian approach than the interpretation of Λ as prior belief: indeed, thecomplete class theorem in statistical decision theory asserts that in various decision theoretic problems, allthe admissible decision rules can be approximated by Bayes estimators. 2

Definition 3 (Bayes estimator).

δΛ = arg minδr(Λ, δ) (5)

The Bayes estimator δΛ can usually be found using the principle of computing posterior distributions.Note that

r(Λ, δ) =

∫∫∫L(θ, δ)Pδ|X(dδ|x)Pθ(dx)Λ(dθ) (6)

=

∫ (∫∫L(θ, δ)Pδ|X(dδ|x)Pθ|X(dθ|x)

)PX(dx) (7)

where PX(dx) is the marginal distribution of X and Pθ|X(dθ|x) is the posterior distribution of θ given X.In Equation 6, Pθ(dx)Λ(dθ) is the joint distribution of θ and X. In Equation 7, we only have to minimizethe portion in parentheses to minimize r(Λ, δ) because PX(dx) doesn’t depend on δ.

Theorem 4. 3 Under mild conditions,

δΛ(x) = arg minδ

E[L(θ, δ)|X = x] (8)

= arg minPδ|X

∫∫L(θ, δ)Pδ|X(dδ|x)Pθ|X(dθ|x) (9)

Lemma 5. If L(θ, δ) is convex in δ, it suffices to consider deterministic rules δ(x).

Proof Jensen’s inequality:∫∫L(θ, δ)Pδ|X(dδ|x)Pθ|X(dθ|x) ≥

∫L(θ,

∫δPδ|X(dδ|x))Pθ|X(dθ|x). (10)

Examples

1. L(θ, δ) = (g(θ) − δ)2 ⇒ δΛ(x) = E[g(θ)|X = x]. In other words, the Bayes estimator under squarederror loss is the conditional expectation of g(θ) given x.

2. L(θ, δ) = |g(θ)− δ| ⇒ δΛ(x) is any median of the posterior distribution Pθ|X=x.

3. L(θ, δ) = 1(g(θ) 6= δ) ⇒ δΛ(x) = arg maxθ Pθ|x(θ|x) = arg maxθ Pθ(x)Λ(θ). In other words, anindicator loss function results in a maximum a posteriori (MAP) estimator decision rule4.

2See Chapter 3 of Friedrich Liese, and Klaus-J. Miescke. ”Statistical Decision Theory: Estimation, Testing, and Selection.”(2009)

3See Theorem 1.1, Chapter 4 of Lehmann EL, Casella G. Theory of point estimation. Springer Science & Business Media;1998

4Next week, we will cover special cases of Pθ and how to solve Bayes estimator in a computationally efficient way. In thegeneral case, however, computing the posterior distribution may be difficult.

3

4.3 Frequentist: Worst-Case Optimality (Minimax)

Definition 6 (Minimax estimator). The decision rule δ∗ is minimax among all decision rules in D iff

supθ∈Θ

R(θ, δ∗) = infδ∈D

supθ∈Θ

R(θ, δ). (11)

4.3.1 First observation

Since

R(θ, δ) =

∫∫L(θ, δ)Pδ|X(dδ|x)Pθ(dx) (12)

is linear in Pθ|X , this is a convex function in δ. The supremum of a convex function is convex, so finding theoptimal decision rule is a convex optimization problem. However, solving this convex optimization problemmay be computationally intractable. For example, it may not even be computationally tractable to computethe supremum of the risk function R(θ, δ) over θ ∈ Θ. Hence, finding the exact minimax estimator is usuallyhard.

4.3.2 Second observation

Due to the previous difficulty of finding the exact minimax estimator, we turn to another goal: we wish tofind an estimator δ′ such that

infδ

supθR(θ, δ) ≤ sup

θR(θ, δ′) ≤ c · inf

δsupθR(θ, δ) (13)

where c > 1 is a constant. The left inequality is trivially true. For the right inequality, in practice one canusually choose some specific δ′ and evaluate an upper bound of supθ R(θ, δ′) explicitly. However, it remainsto find a lower bound of infδ supθ R(θ, δ). To solve the problem (and save the world), we can use the minimaxtheorem.

Theorem 7 (Minimax Theorem (Sion-Kakutani)). Let Λ, X be two compact, convex sets in some topologi-cally vector spaces. Let function H(λ, x) : Λ×X → R be a continuous function such that:

1. H(λ, ·) is convex for any fixed λ ∈ Λ

2. H(·, x) is concave for any fixed x ∈ X.

Then

1. Strong duality: maxλ minxH(λ, x) = minx maxλH(λ, x)

2. Existence of Saddle point:

∃(λ∗, x∗) : H(λ, x∗) ≤ H(λ∗, x∗) ≤ H(λ∗, x) ∀λ ∈ Λ, x ∈ X.

The existence of saddle point implies the strong duality.

We note that other than the strong duality, the following weak duality is always true without assumptionson H:

supλ

infxH(λ, x) ≤ inf

xsupλH(λ, x) (14)

We define the quantity rΛ , infδ r(Λ, δ) as the Bayes risk under prior distribution Λ. We have thefollowing lines of arguments using weak duality:

infδ

supθR(θ, δ) = inf

δsup

Λr(Λ, δ) (15)

≥ supΛ

infδr(Λ, δ) (16)

= supΛrΛ (17)

4

Equation (17) gives us a strong tool for lower bounding the minimax risk: for any prior distribution Λ, theBayes risk under Λ is a lower bound of the corresponding minimax risk. When the condition of the minimaxtheorem is satisfied (which may be expected due to the bilinearity of r(Λ, δ) in the pair (Λ, δ)), equation (16)achieves equality, which shows that there exists a sequence of priors such that the corresponding Bayes risksequence converges to the minimax risk.

In practice, it suffices to choose some appropriate prior distribution Λ in order to solve the (nearly)minimax estimator.

5 Decision Theoretic Interpretation of Maximum Entropy

Definition 8 (Logarithmic Loss). For a distribution P over X and x ∈ X ,

L(x, P ) , log1

P (x). (18)

Note that P (x) is the discrete mass on x in the discrete case, and is the density at x in the continuous case.

By definition, if P (x) is small, then the loss is large. Now we see how the maximum likelihood estimatorreduces to the so-called maximum entropy under the logarithmic loss function. Specifically, suppose we havefunctions r1(x), · · · , rk(x) and compute the following empirical averages:

βi ,1

n

n∑j=1

ri(xj) = EPn [ri(X)], i = 1, · · · , k (19)

where Pn denotes the empirical distribution based on n samples x1, · · · , xn. By the law of large numbers,it is expected that βi is close to the true expectation under P , so we may restrict the true distribution P inthe following set:

P = {P : EP [ri(X)] = βi, 1 ≤ i ≤ k}. (20)

Also, denote by Q the set of all probability distributions which belong to the exponential family with centralstatistics r1(x), · · · , rk(x), i.e.,

Q = {Q : Q(x) =1

Z(λ)e∑ki=1 λiri(x), ∀x ∈ X} (21)

where Z(λ) is the normalization factor.Now we consider two estimators of P . The first one is the maximum likelihood estimator over Q, i.e.,

PML = arg maxP∈Q

n∑j=1

logP (xj). (22)

The second one is the maximum entropy estimator over P, i.e.,

PME = arg maxP∈P

H(P ) (23)

where H(P ) is the entropy of P , which is H(P ) =∑x∈X −P (x) logP (x) in the discrete case, and H(P ) =∫

X −P (x) logP (x)dx in the continuous case.The key result is as follows:5

Theorem 9. In the above setting, we have PML = PME.

5Berger, Adam L., Vincent J. Della Pietra, and Stephen A. Della Pietra. ”A maximum entropy approach to natural languageprocessing.” Computational linguistics 22.1 (1996): 39-71.

5

Proof We begin with the following lemma, which is the key motivation for the use of the cross-entropyloss in machine learning.

Lemma 10. The true distribution minimizes the expected logarithmic loss:

P = arg minQ

EPL(x,Q) (24)

and the corresponding minimum EPL(x, P ) is the entropy of P .

Proof The assertion follows from the non-negativity of the KL divergence D(P‖Q) ≥ 0, where D(P‖Q) =EP ln dP

dQ .

Equipped with this lemma, and noting that EP [log 1Q(x) ] is linear in P and convex in Q, by the minimax

theorem we have

maxP∈P

H(P ) = maxP∈P

minQ

EP [log1

Q(x)] = min

QmaxP∈P

EP [log1

Q(x)] (25)

In particular, we know that (PME, PME) is a saddle point in P×M, whereM denotes the set of all probabilitymeasures on X . We first take a look at PME which maximizes the LHS. By variational methods6, it is easyto see that the entropy-maximizing distribution PME belongs to the exponential family Q, i.e.,

PME(x) =1

Z(λ)e∑ki=1 λiri(x) (26)

for some parameters λ1, · · · , λk.Now by the definition of saddle points, for any Q ∈ Q ⊂M we have

EPME[log

1

PME(x)] ≤ EPME

[log1

Q(x)] (27)

while for any Q ∈ Q with possibly different coefficients λ′,

EPME[log

1

Q(x)] = EPME

[logZ(λ′)−∑i

λ′iri(x)] [since Q ∈ Q] (28)

= logZ(λ′)−∑i

λ′iβi [since PME ∈ P] (29)

= EPn [logZ(λ′)−∑i

λ′iri(x)] [since Pn ∈ P] (30)

= EPn [log1

Q(x)] (31)

=1

n

n∑j=1

log1

Q(xj). (32)

As a result, the saddle point condition reduces to

1

n

n∑j=1

log1

PME(xj)≤ 1

n

n∑j=1

log1

Q(xj), ∀Q ∈ Q (33)

establishing the fact that PME = PML.

6See Chapter 12 of Thomas, Joy A., and Thomas M. Cover. Elements of information theory. John Wiley & Sons, 2006.

6


Lecture 3: State Estimation in Hidden Markov ProcessesLecturer: Tsachy Weissman Scribe: Alexandros Anemogiannis, Vighnesh Sachidananda

In this lecture1, we define the Hidden Markov Processes (HMP) as well as the forward-backward recursionmethod that efficiently estimates the posteriors in HMPs.

1 Markov Chain

We begin by introducing the Markov triplet, which is useful in the design and analysis of algorithms for stateestimation in Hidden Markov Processes.

Let X,Y, Z denote discrete random variables. We say that X, Y , and Z form a Markov triplet (whichwe denote as X − Y − Z) given the following relationships:

X − Y − Z ⇐⇒ p(x, z|y) = p(x|y)p(z|y) (1)

⇐⇒ p(z|x, y) = p(z|y) (2)

If X,Y, Z form a Markov triplet, their joint distribution enjoys the important property that it can befactored into the product of two functions φ1 and φ2.

Lemma 1. X − Y − Z ⇐⇒ ∃ φ1, φ2 s.t. p(x, y, z) = φ1(x, y)φ2(y, z).

Proof (Proof of necessity) Assume that X − Y − Z. Then by definition of the Markov chain, we have

p(x, y, z) = p(x|y, z)p(y, z) = p(x|y)p(y, z).

Now taking φ1(x, y) = p(x, y) and φ2(y, z) = p(y, z) proves this direction.(Proof of sufficiency) Now, the functions φ1, φ2 exist as mention in the statement. Then we have:

p(x|y, z) =p(x, y, z)

p(y, z)

=p(x, y, z)∑x p(x, y, z)

=φ1(x, y)φ2(y, z)

φ2(y, z)∑x φ1(x, y)

=φ1(x, y)∑x φ1(x, y)

.

The final equality shows that p(x|y, z) is not a function of z and hence, X − Y − Z follows by definition.

2 Hidden Markov Processes

Based on our understanding of the Markov Triplet, we define the Hidden Markov Process (HMP) as follows:

1Reading: Ephrahim & Merhav 2002, lectures 9,10 of 2016 notes

1

Definition 2 (Hidden Markov Process). The process {(Xn, Yn)}n≥1 is a Hidden Markov Process if

• {Xn}n≥1 is a Markov process, i.e.,

∀ n ≥ 1 : p(xn) =

n∏t=1

p(xt|xt−1),

where by convention, we take X0 ≡ 0 and hence p(x1|x0) = p(x1).

• {Yn}n≥1 is determined by a memoryless channel; i.e.,

∀ n ≥ 1 : p(yn|xn) =

n∏t=1

p(yt|xt).

The processes {Xn}n≥1 and {Yn}n≥1 are called the state process and the noisy observations, respectively.

In view of the above definition, we can derive the joint distribution of the state and the observation asfollows:

p(xn, yn) = p(xn)p(yn|xn) =

n∏t=1

p(xt|xt−1)p(yt|xt).

The main problem in the Hidden Markov Models is to compute the the posterior probability of the stateat any time, given all the observations up to that time, i.e. p(xt|yt). The naive approach to do this is tosimply use the the definition of the conditional probability:

p(xt|yt) =p(xt, y

t)

p(yt)=

∑xt−1 p(yt|xt.xt−1)p(xt, x

t−1)∑xt

∑xt−1 p(yt|xt, xt−1)p(xt, xt−1)

.

The above approach needs exponentially many computations both for the numerator and the denominator.To avoid this problem, we develop an efficient method for computing the posterior probabilities using forwardrecursion. Before getting to the algorithm, we establish some conditional independence relations.

3 Conditional Independence Relations

We introduce some conditional independence relations which help us find more efficient ways of calculatingthe posterior.

1. (Xt−1, Y t)−Xt − (Xnt+1, Y

nt+1):

Proof We have :

p(xn, yn) =

n∏i=1

p(xi|xi−1)p(yi|xi)

=

[t∏i=1


]︸︷︷︸

φ1(xt,(xt−1,yt))

[n∏

i=t+1


]︸︷︷︸

φ2(xt,(xnt+1,y

nt+1))

The statement then follows from Lemma 1.

2

2. (Xt−1, Y t−1)−Xt − (Xnt+1, Y

nt ):

Proof Note that if {Xi, Yi}ni=1 is a Hidden Markov Process, then {Xn−i, Yn−i}ni=1 will also be aHidden Markov Process. Applying (a) to this reversed process proves (b).

3. Xt−1 − (Xt, Yt−1)− (Xn

t+1, Ynt ):

Proof Left as an exercise.

4 Causal Inference via Forward Recursion

We now derive the forward recursion algorithm as an efficient method to sequentially compute the causalposterior probabilities.

Note that we have

p(xt|yt) =p(xt, y

t)∑xtp(xt, yt)

=p(xt, y

t−1)p(yt|xt, yt−1)∑xtp(xt, yt−1)p(yt|xt, yt−1)

=p(yt−1)p(xt|yt−1)p(yt|xt)

p(yt−1)∑xtp(xt|yt−1)p(yt|xt)

(by (b))

=p(xt|yt−1)p(yt|xt)∑xtp(xt|yt−1)p(yt|xt)

Define αt(xt) = p(xt|yt) and βt(xt) = p(xt|yt−1), then the above computation can be summarized as

αt(xt) =βt(xt)p(yt|xt)∑xtβt(xt)p(yt|xt)

. (3)

Similarly, we can write β as

βt+1(xt+1) = p(xt+1|yt)

=∑xt

p(xt+1, xt|yt)

=∑xt

p(xt|yt)p(xt+1|yt, xt)

=∑xt

p(xt|yt)p(xt+1|xt) (by (a)).

The above equation can be summarized as

βt+1(xt+1) =∑xt

αt(xt)p(xt+1|xt). (4)

Equations (3) and (4) indicate that αt and βt can be sequentially computed based on each other for t =1, · · · , n with the initialization β(x1) = p(x1). Hence, with this simple algorithm, the causal inference canbe done efficiently in terms of both computation and memory. This is called the forward recursion.

3

5 Non-causal Inference via Backward Recursion

Now suppose that we are interested in finding p(xt|yn) for t = 1, 2, · · · , n in a non-causal manner. For thispurpose, we can derive a backward recursion algorithm as follows.

p(xt|yn) =∑xt+1

p(xt, xt+1|yn)

=∑xt+1

p(xt+1|yn)p(xt|xt+1, yt) (by (c))

=∑xt+1

p(xt+1|yn)p(xt|yt)p(xt+1|xt)

p(xt+1|yt). (by (a))

Now, let γt(xt) = p(xt|yn). Then, the above equation can be summarized as

γt(xt) =∑xt+1

γt+1(xt+1)αt(xt)p(xt+1|xt)βt+1(xt+1)

. (5)

Equation (5) indicates that γt can be recursively computed based on αt’s and βt’s for t = n− 1, n− 2, · · · , 1with the initialization of γn(xn) = αn(xn). This is called the backward recursion.

4


Lecture 4: State Estimation in Hidden Markov Models (cont.)Lecturer: Tsachy Weissman Scribe: David Wugofski

In this lecture we build on previous analysis of state estimation for Hidden Markov Models to create analgorithm for estimating the posterior distribution of the states given the observations.

1 Notation

A quick summary of the notation:

1. Random Variables (objects): used more “loosely”, i.e. X, Y , U , V

2. Alphabets: X ,Y,U ,V

3. Specific Realizations of Random Variables: x, y, u, v

4. Vector Notation: For a vector (x1, · · · , xn), we denote the vector of values from the i-th componentto the j-th component (inclusive) as xji , with the convention xi , xi1.

5. Markov Chains: We use the standard Markov Chain notation X − Y − Z to denote that X,Y, Zform a Markov triplet.

For discrete random variable (object), we write p(x) for the pmf P(X = x), and similarly p(x, y) for pX,Y (x, y)and p(y|x) for pY |X(y|x), etc.

2 Recap of HMP

For a set of unknown states {Xn}n≥1 of a Markov Process and noisy observations {Yn}n≥1 on those statestaken through some memoryless channel, we can express the joint distribution of the states and outputs:

p(xn, yn) =

n∏t=1

p(xt|xt−1)

n∏t=1

p(yt|xt) (1)

where x0 is taken to be some fixed arbitrary value (e.g., 0). Additionally we have the following principlesfor this setting:

(Xt−1, Y t)−Xt − (Xnt+1, Y

nt+1) (a)

(Xt−1, Y t−1)−Xt − (Xnt+1, Y

nt ) (b)

Xt−1 − (Xt, Yt−1)− (Xn

t+1, Ynt ) (c)

In the last lecture, through the forward-backward recursion we have the following algorithms for computingvarious posteriors:

αt(xt) = pXt|Y t(xt|yt) =βt(xt)pYt|Xt

(yt|xt)∑xtβt(xt)pYt|Xt

(yt|xt)(Measurement Update)

βt+1(xt+1) = pXt|Y t−1(xt|yt−1) =∑xt

αt(xt)pXt+1|Xt(xt+1|xt) (Time Update)

γt(xt) = pXt|Y n(xt|yn) =∑xt+1

γt+1(xt+1)αt(xt)pXt+1|Xt

(xt+1|xt)βt+1(xt+1)

(Backward Recursion)

1

3 Factor Graphs

Let us derive a graphical representation of the joint distribution expressed in (1) as follows:

1. Create nodes on the graph representing each random variable: X1, ..., Xn, Y1, ..., Yn.

2. For each pairing of nodes, create an edge if there is some factor in 1 which contains both variables.

This form of graph is called a Factor Graph.

Example: Figure 1 shows the factor graph for the hidden Markov process in our setting.

Figure 1: The factor graph for our hidden Markov process

For any partitioning of the graph into three node sets S1, S2, and S3 such that any path from S1 to S3

passes some node in S2, these sets form a Markov process S1 − S2 − S3 (to see this, simply express the jointpmf into factors). Hence, for any si ∈ Si, i = 1, 2, 3, we have

p(s1, s3|s2) = φ1(s1, s2)φ2(s2, s3) (2)

for some functions φ1, φ2 which consist of the factors which were used to form the factor graph. Using thefactor graph, it is quite easy to establish principles (a)-(c) via graphs. For example, principles (a) and (b)can be proved via Figure 2 and 3 shown below.

Figure 2: The factor graph partition for (a)

Figure 3: The factor graph partition for (b)

2

We can also construct an additional principle from the factor graph partitioning presented in Figure 4:

Xt−1 − (Xt, Yn)−Xn

t+1 (d)

Figure 4: The graphical proof of (d)

4 General Forward–Backward Recursion: Module Functions

Let us consider a generic discrete random variable U ∼ pU passed through a memoryless channel characterizedby pV |U to get a discrete random variable V . By Bayes rule we know

pU |V (u|v) =pU (u)pV |U (v|u)∑u pU (u)pV |U (v|u)

(3)

If we create the probability vector of pU |V for each value of U , we can define a vector function F s.t.

{pU |V (u|v)}u =

{pU (u)pV |U (v|u)∑u pU (u)pV |U (v|u)

}u

, F (pU , pV |U , v) (4)

Additionally, overloading the notation of F , we can define the matrix function F as

F (pU , pV |U ) = {F (pU , pV |U , v)}v (5)

This function F acts as a module function for calculating the posterior distribution of a random variablegiven a prior distribution and a channel distribution.

In addition to F , we can write a module function for calculating the marginal distribution given a priordistribution on the input and a channel distribution. We label this vector function G and define it as follows:

pV = {pV (v)}v =

{∑u

pU (u)pV |U (v|u)

}v

, G(pU , pV |U ) (6)

With these module functions in mind, we can express our recursion functions as follows

αt = F (βt, pYt|Xt, yt) (Measurement Update)

βt+1 = G(αt, pXt+1|Xt) (Time Update)

γt = G(γt+1, F (αt, pXt+1|Xt)) (Backward Recursion)

3

5 State Estimation: the Viterbi Algorithm

Now we are ready to derive the joint posterior distribution p(xn|yn):

p(xn|yn) = p(xn|yn)

n−1∏t=1

p(xt|xnt+1, yn)

= p(xn|yn)

n−1∏t=1

p(xt|xt+1, yn) (By principle (d))

= p(xn|yn)

n−1∏t=1

p(xt|xt+1, yt) (By principle (c))

= p(xn|yn)

n−1∏t=1

p(xt|yt)p(xt+1|xt, yt)p(xt+1|yt)

= γn(xn)

n−1∏t=1

αt(xt)p(xt+1|xt)βt+1(xt+1)

(By principle (a))

Write

ln p(xn|yn) =

n∑t=1

gt(xt, xt+1)

with

gt(xt, xt+1) ,

{ln αt(xt)p(xt+1|xt)

βt+1(xt+1)t = 1, · · · , n− 1

ln γn(xn) t = n

we can solve the MAP estimator xMAP with the help of the following definition:

Definition 1. For 1 ≤ k ≤ n, Let

Mk(xk) := maxxnk+1

n∑t=k

gt(xt, xt+1).

It is straightforward from definition that M1(xMAP1 ) = ln p(xMAP|yn) = maxxn p(xn|yn) and

Mk(xk) = maxxk+1

maxxnk+2

(gk(xk, xk+1) +

n∑t=k+1

gt(xt, xt+1))

= maxxk+1

(gk(xk, xk+1) + maxxnk+2

n∑t=k+1

gt(xt, xt+1))

= maxxk+1

(gk(xk, xk+1) +Mk+1(xk+1)).

Since Mn(xn) = g(xn, xn+1) = ln γn(xn) only depends on one term xn, we may start from n and use theprevious recursive formula to obtain M1(x1). This is called the Viterbi Algorithm.

The Bellman Equations, referred to in the communication setting as the Viterbi Algorithm, is an appli-cation of a technique called dynamic programming. Using the recursive equations derived in the previoussection, we can express the Viterbi Algorithm as follows.

4

1: function Viterbi2: Mn(xn)← ln γn(xn) . Initialization of log-likelihood3: xMAP(xn)← ∅ . Initialization of the MAP estimator4: for k = n− 1, · · · , 1 do5: Mk(xk)← maxxk+1

(gk(xk, xk+1) +Mk+1(xk+1)) . Maximum of log-likelihood6: xk+1(xk)← arg maxxk+1

(gk(xk, xk+1) +Mk+1(xk+1))7: xMAP(xk)← [xk+1(xk), xMAP(xk+1(xk))] . Maximizing sequence with leading term xk8: end for9: M ← maxx1

M1(x1) . Maximum of overall log-likelihood10: x1 ← arg maxx1

M1(x1)11: xMAP ← [x1, x

MAP(x1)] . Overall maximizing sequence12: end function

5


Lecture 3: Particle Filtering and Introduction to Bayesian InferenceLecturer: Tsachy Weissman Scribe: Charles Hale

In this lecture, we introduce new methods for solving Hidden Markov Processes in cases beyond discretealphabets. This requires us to learn first about estimating pdfs based on samples from a different distribution.At the end, we introduce the Bayesian inference briefly.

1 Recap: Inference on Hidden Markov Processes (HMPs)

1.1 Setting

A quick recap on the setting of a HMP:

1. {Xn}n≥1 is a Markov process, called “the state process”.

2. {Yn}n≥1 is “the observation process” where Yi is the output of Xi sent through a “memoryless” channelcharacterized by the distribution PY |X .

Alternatively, we showed in HW1 that the above is totally equivalent to the following:

1. {Xn}n≥1 is “the state process” defined by

Xt = ft(Xt−1,Wt)

where {Wn}n≥1 is an independent process, i.e-each Wt is independent of all the others.

2. {Yn}n≥1 is “the observation process” related to {Xn}n≥1 by

Yt = gt(Xt, Nt)

where {Nn}n≥1 is an independent process.

3. The processes {Nn}n≥1 and {Wn}n≥1 are independent of each other.

1.2 Goal

Since the state process, {Xn}n≥1 is “hidden” from us, we only get to observe {Yn}n≥1 and wish to find a wayof estimating {Xn}n≥1 based on {Yn}n≥1. To do this, we define the forward recursions for causal estimation:

αt(xt) = F (βt, PYt|Xt, yt) (1)

βt+1(xt+1) = G(αt, PXt+1|Xt) (2)

where F and G are the operators defined in lecture 4.

1.3 Challenges

For general alphabets X ,Y, computing the forward recursion is a very difficult problem. However, we knowefficient algorithms that can be applied in the following situations:

1. All random variables are discrete with a finite alphabet. We can then solve these equations by bruteforce in time proportional to the alphabet sizes.

1

2. For t ≥ 1, ft and gt are linear and Nt,Wt, Xt are Gaussian. The solution is the “Kalman Filter”algorithm that we derive in Homework 2.

Beyond these special situations, there are many heuristic algorithms for this computation:

1. We can compute the forward recursions approximately by quantizing the alphabets. This leads to atradeoff between model accuracy and computation cost.

2. This lecture will focus on Particle Filtering, which is a way of estimating the forward recursions withan adaptive quantization.

2 Importance Sampling

Before we can learn about Particle Filtering, we first need to discuss importance sampling1.

2.1 Simple Setting

Let X1, · · · , XN be i.i.d. random variables with density f . Now we wish to approximate f by a probabilitymass function, fN , using our data.

The idea: we can approximate f by taking

fN =1

N

N∑i=1

δ(x−Xi) (3)

where δ(x) is a Dirac-delta distribution.

Claim: fN is a good estimate of f .

Here we define the ”goodness” of our estimate by looking at

EX∼fN [g(X)] (4)

and seeing how close it is to

EX∼f [g(X)] (5)

for any function g with E|g(X)| <∞. We show that by this definition, (3) is a good estimate for f .Proof:

EX∼fN [g(X)] =

∫ ∞−∞

fN (x)g(x)dx (6)

=

∫ ∞−∞

1

N

N∑i=1

δ(x−Xi)g(x)dx (7)

=1

N

N∑i=1

∫ ∞−∞

δ(x−Xi)g(x)dx (8)

=1

N

N∑i=1

g(Xi) (9)

1See section 2.12.4 of the Spring 2013 lecture notes for further reference.

2

By the law of large numbers,

1

N

N∑i=1

g(Xi)a.s.→ EX∼f [g(X)] (10)

as N →∞. Hence,

EX∼fN [g(X)]→ EX∼f [g(X)] (11)

so fN is indeed a ”good” estimate of f .

2.2 General Setting

Suppose now that X1, · · · , XN i.i.d. with common density q, but we wish to estimate a different pdf f .

Theorem: The following estimator

ˆfN (x) =

N∑i=1

wiδ(x−Xi) (12)

with weights

wi =

f(Xi)q(Xi)∑N

j=1f(Xj)q(Xj)

(13)

is a good estimator of f .Proof: It’s easy to see

EX∼ ˆ

fN[g(X)] =

N∑i=1

wig(Xi) (14)

=

N∑i=1

f(Xi)q(Xi)∑N

j=1f(Xj)q(Xj)

g(Xi) (15)

=

∑Ni=1

f(Xi)q(Xi)

g(Xi)∑Nj=1

f(Xj)q(Xj)

(16)

=

1N

∑Ni=1

f(Xi)q(Xi)

g(Xi)

1N

∑Nj=1

f(Xj)q(Xj)

. (17)

As N →∞, the above expression converges to

E[ f(X)q(X) g(X)]

E[ f(X)q(X) ]

=

∫∞−∞

f(x)q(x) g(x)q(x)dx∫∞

−∞f(x)q(x) q(x)dx

(18)

=

∫∞−∞ f(x)g(x)dx∫∞−∞ f(x)dx

(19)

=

∫ ∞−∞

f(x)g(x)dx (20)

= EX∼f [g(X)]. (21)

3

Hence,

EX∼ ˆ

fN[g(X)]→ EX∼f [g(X)] (22)

as N →∞.

2.3 Implications

The above theorem is useful in situations when we don’t know f and cannot draw samples relying on Xi ∼ f ,but we know that f(x) = c·h(x) for some constant c (in many practical cases it is computationally prohibitiveto obtain the normalization constant c = (

∫h(x)dx)−1). Then

wi =

f(Xi)q(Xi)∑N

j=1f(Xj)q(Xj)

(23)

=

c·h(Xi)q(Xi)∑N

j=1c·h(Xj)q(Xj)

(24)

=

h(Xi)q(Xi)∑N

j=1h(Xj)q(Xj)

(25)

Using these wi’s, we see that we can approximate f using samples drawn from a different distribution andh.

2.4 Application

Let X ∼ fX and Y be the observation of X through a memoryless channel characterized by fY |X . For MAPreconstruction of X, we must compute

fX|Y (x|y) =fX(x)fY |X(y|x)∫∞

−∞ fX(x)fY |X(y|x)dx(26)

In general, this computation may be hard due to the integral appearing on the bottom. But the denominatoris just a constant. Hence, suppose we have Xi i.i.d. ∼ fX for 1 ≤ i ≤ N . By the previous theorem, we cantake

ˆfX|y =

N∑i=1

wi(y)δ(x−Xi) (27)

as an approximation, where

wi(y) =

fX|y(Xi)

fX(Xi)∑Nj=1

fX|y(Xj)

fX(Xj)

(28)

=

fX(Xi)fY |X(y|Xi)

fX(Xi)∑Nj=1

fX(Xj)fY |X(y|Xj)

fX(Xj)

(29)

=fY |X(y|Xi)∑Nj=1 fY |X(y|Xj)

(30)

Hence, using the samples X1, X2, · · · , XN we can build a good approximation of fX|Y (·|y) for fixed yusing the above procedure.

4

Let us denote this approximation as

fN (X1, X2, ..., XN , fX , fY |X , y) =

N∑i=1

wiδ(x−Xi) (31)

3 Particle Filtering

Particle Filtering is a way to approximate the forward recursions of a HMP using importance sampling toapproximate the generated distributions rather than computing the true solutions.

particles ← generate {X(i)1 }Ni=1 i.i.d. ∼ fX1

;for t ≥ 2 do

αt = FN ({X(i)1 }Ni=1, βt, PYt|Xt

, yt) ; // estimate αt using importance sampling as in (31)

nextParticles ← generate {X(i)t }Ni=1 i.i.d. ∼ αt;

reset particles;for 1 ≤ i ≤ N do

particles ← generate X(i)t+1 ∼ fXt+1|Xt

(·|X(i)t );

end

βt = 1N

∑Ni=1 δ(x−Xt+1);

endAlgorithm 1: Particle Filter

This algorithm requires some art in use. It is not well understood how errors propagate due to thesesuccessive approximations. In practice, the parameters need to be periodically reset. Choosing these hyper-parameters is an application-specific task.

4 Inference under logarithmic loss

Recall the Bayesian decision theory and let us have the following:

• X ∈ X is something we want to infer;

• X ∼ px (assume that X is discrete);

• x ∈ X is something we wish to reconstruct;

• Λ: X × X → R is our loss function.

First we assume that we do not have any observation, and define the Bayesian response as

U(px) = minx

EΛ(X, x) = minx

∑x

px(s)Λ(x, x) (32)

and

XBayes(pX) = arg minx

EΛ(X, x). (33)

Example: Let X = X = R, Λ(x, x) = (x− x)2. Then

U(pX) = minx

EX∼pX(X − x)2 = Var(X) (34)

XBayes(pX) = E[X]. (35)

5

Consider now the case we have some observation (side information) Y , where our reconstruction X canbe a function of Y . Then

EΛ(X, X(Y )) = E[E[Λ(X, X(y))|Y = y]] (36)

=∑y

E[Λ(X, X(y))|Y = y]pY (y) (37)

=∑y

pY (y)∑x

pX|Y=y(x)Λ(x, X(y)) (38)

This implies

U(pX) = minX(·)

EΛ(X, X(Y )) (39)

= minX(·)

∑y

pY (y)∑x

pX|Y=y(x)Λ(x, X(y)) (40)

=∑y

pY (y)U(pX|Y=y) (41)

and thus

Xopt(y) = XBayes(pX|Y=y) (42)

6


Lecture 6: Bayesian Decision Theory and Justification of Log LossLecturer: Tsachy Weissman Scribe: Okan Atalar

In this lecture1, we introduce the topic of Bayesian Decision Theory and define the concept of loss functionfor assessing the accuracy of the estimation. The log loss function is presented as an important special case.

1 Notation


1. Random Variables (objects): used more “loosely”, i.e., X,Y

2. PMF (PDF): used more “loosely”, i.e. PX , PY

3. Alphabets: X , X

4. Estimate of X: x

5. Specific Values: x, y

6. Loss Function: Λ

7. Markov Triplet: X − Y − Z notation corresponds to a Markov triplet

8. Logarithm: log notation denotes log2, unless otherwise stated

For discrete random variable (object), U has p.m.f: P(u)U , P (U = u). Often, we’ll just write p(u).

Similarly: p(x, y) for P(x,y)X,Y and p(y|x) for P

(y|x)Y |X , etc.

2 Bayesian Decision Theory

Let X ∈ X , X ∼ PX , x ∈ X , Λ : X × X −→ R.

Definition 1 (Bayes Envelope).

U(PX) , minx

E[Λ(X, x)] = minx

∑x

PX(x)Λ(X, x) (1)

Definition 2 (Bayes Response).

XBayes(PX) , arg minx

E[Λ(X, x)] (2)

Essentially, given a prior distribution on X, we are trying to achieve the best estimate through minimizingthe expected value of some “loss function” that we have defined. If in addition to the prior distribution ofX, we are also given the observation Y , the associated Bayes Envelope becomes:

minX(·)

E[Λ(X, X(Y ))] =∑y

U(PX|Y )PY (y) = E[U(PX|Y )] (3)

In this case, the Bayes Response takes the form:

X(y) = XBayes(PX|Y=y) (4)

1Reading: Lecture 3 of the EE378A Spring 2016 notes

1

3 Bayesian Inference Under Logarithmic Loss

In this section, we will be using the logarithm function as the loss function.

Definition 3 (Logarithmic Loss Function).

Λ(x,Q) , log1

Q(x)(5)

where x is the realization of X, Q is the estimator of the pmf of X.

Using Definition 3, the expected value of the log loss function could be expressed as:

E[Λ(X,Q)] =∑x

PX(x)Λ(x,Q) =∑x

PX(x) log1

Q(x)=

∑x

PX(x) log1

Q(x)

PX(x)

PX(x)(6)

=∑x

PX(x) log1

PX(x)+∑x

PX(x) logPX(x)

Q(x)(7)

Definition 4 (Entropy).

H(X) ,∑x

PX(x) log1

PX(x)(8)

Definition 5 (Relative Entropy, or the Kullback–Leibler Divergence).

D(PX‖Q) ,∑x

PX(x) logPX(x)

Q(x)(9)

Using Equation (6,7) and Definitions (4,5):

E[Λ(X,Q)] = H(X) + D(PX‖Q) (10)

Claim: D(PX ||Q) > 0, and D(PX ||Q) = 0⇐⇒ Q = PX .

Before attempting the proof, we provide Jensen’s inequality:

Jensen’s Inequality: Let Q denote a concave function, and X be any random variable. Then

E[Q(X)] 6 Q(E[X]) (11)

Further, if Q is strictly concave: E[Q(X)] = Q(E[X])⇐⇒ X is a constant. In particular, Q(x) = log(x) is aconcave function, thus for a random variable X > 0:

E[logX] 6 logE[X]. (12)

Proof of the Claim:

D(PX‖Q) = E[logPX(x)

Q(x)] = −E[log

Q(x)

PX(x)] > − logE[

Q(x)

PX(x)] = − log

∑x

PX(x)Q(x)

PX(x)= 0 (13)

The last equality follows from∑

x Q(x) = 1, since Q(x) is a pmf and thus sums to one.

Hence, it follows from the claim that

minQ

E[Λ(X,Q)] = H(X) = U(PX) (14)

2

where the minimizer is Q = PX . Therefore:

XBayes(PX) = PX . (15)

Introducing the notation H(X)⇐⇒ H(PX) and making use of equation (3):

minX(·)

E[Λ(X, X(Y ))] = E[U(PX|Y )] =∑y

H(PX|Y )PY (y) (16)

Definition 6 (Conditional Entropy).

H(X|Y ) ,∑y

H(PX|Y )PY (y) (17)

Definition 7 (Mutual Information).

I(X;Y ) , H(X)−H(X|Y ) (18)

Definition 8 (Conditional Relative Entropy).

D(PX|Y ||QX|Y , PY ) ,∑y

D(PX|Y ‖QX|Y )PY (y) (19)

Definition 9 (Conditional Mutual Information).

I(X,Y |Z) , H(X|Z)−H(X|Y,Z) (20)

Exercises:(a) H(X1, X2) = H(X1) + H(X2|X1)(b) I(X;Y ) = H(Y )−H(Y |X)(c) I(X1, X2;Y ) = I(X1;Y ) + I(X2;Y |X1)(d) D(PX1,X2

‖QX1,X2) = D(PX1

‖QX1) + D(PX2|X1

‖QX2|X1|PX1

)

4 Justification of Log Loss as the Right Loss Function

For general loss function Λ:

Definition 10 (Coherent Dependence Function).

C(Λ, PX,Y ) , infx

E[Λ(X, x)]− infX(·)

E[Λ(X, X(Y ))] (21)

The first term corresponds to the minimum expected error (depending on the choice of loss function)in estimating X based on only the prior distribution of X. The second term corresponds to the minimumexpected error in estimating X by also taking the observation Y into account. Therefore, the coherentdependence function gives a measure on how much the estimate improves given observation Y , compared toestimating X based on only the prior distribution of X.

Two examples are provided below to make this concept more clear:(1) Λ(x, x) = (x− x)2 ⇒ C(Λ, PX,Y ) = Var(X)− E[Var(X|Y )](2) Λ(x,Q) = log 1

Q(x) ⇒ C(Λ, PX,Y ) = I(X,Y )

Proving the expressions below left as exercise:(a) C(Λ, PX,Y ) > 0, and C(Λ, PX,Y ) = 0⇐⇒ X and Y are independent

3

(b) C(Λ, PX,Y ) > C(Λ, PX,Z) when X − Y − Z

Natural Requirement: If T = T (X) is a function of X, then C(Λ, PT,Y ) 6 C(Λ, PX,Y ).

This requirement basically states that a function of X cannot carry more information about X comparedto X itself.

Definition 11 (Sufficient Statistic). T = T (X) is a sufficient statistic of X for Y if X − T − Y .

Modest Requirement: If T is a sufficient statistic of X for Y , then C(Λ, PT,Y ) 6 C(Λ, PX,Y )

Theorem: Assume |X | > 3, the only C(Λ, PX,Y ) satisfying the modest requirement is C(Λ, PX,Y ) =I(X;Y ) (up to a multiplicative factor).

Since it was shown that C(Λ, PX,Y ) = I(X;Y ) if (and only if) the logarithmic loss function is used(Λ(x,Q) = log 1

Q(x) ), it follows that the logarithmic loss function is the only loss function which satisfies the

modest requirement.

4


Lecture 7: PredictionLecturer: Tsachy Weissman Scribe: Cem Kalkanli

In this lecture we are going to study prediction under Bayesian setting with general and log loss. We willalso quickly start the subject of prediction of individual sequences.

1 Notation


1. Random Variables (objects): used more “loosely”, i.e. X,Y, U, V ;

2. Alphabets: X , X ;

3. Sequence of Random Variables: X1, X2, · · · , Xt;

4. Loss function: Λ(x, x);

5. Estimate of X: x;

6. Set of distributions on Alphabet X : M(X ).

For discrete random variable (object), U has p.m.f: P(u)U , P (U = u). Often, we’ll just write p(u).

Similarly: p(x, y) for PX,Y (x, y) and p(y|x) for PY |X(y|x), etc.

2 Prediction

Consider X1, X2, · · · , Xn ∈ X .

Definition 1 (Predictor). A predictor F is a sequence of functions F = (Ft)t≥1 with Ft : X t−1 → X .

Definition 2 (Prediction Loss). Given a loss function Λ : X × X → R+, the prediction loss incurred by thepredictor F is defined as

LF (Xn) =1

n

n∑t=1

Λ(Xt, Ft(Xt−1)). (1)

2.1 Prediction under Bayesian Setting

In Bayesian setting, we assume X1, X2, · · · are governed jointly by some underlying law P . Then we havethe following prediction loss.

Definition 3 (Prediction Loss Under Bayesian Setting).

minF

EP [LF (Xn)] =1

n

n∑t=1

minFt

EP [Λ(Xt, Ft(Xt−1))] =

1

n

n∑t=1

EP [U(PXt|Xt−1)] (2)

where U(·) is the Bayes envelope we have defined in the previous lecture.

1

As we can see from (2), our initial objective was to minimize expected prediction loss with respect towhole sequence of functions. However, since we can interchange summation and expectation, we can treatthe overall predictor F as the composition of the sequential optimal predictors Ft. Finally, we note theoverall loss is just the sum of expectations of Bayes envelopes.

Theorem 4 (Miss-match Loss). The additional prediction loss incurred by doing prediction based on law Qinstead of P is upper bounded by some form of relative entropy:

EP [LFQ(Xn)− LFP (Xn)] ≤ Λmax ·√

2

nD(PXn‖QXn), (3)

where Λmax = supx,x Λ(x, x).

Now we are going to introduce Lemma 5 which will be used to prove Theorem 4.

Lemma 5.

EP [Λ(X, XBayes(Q))− Λ(X, XBayes(P ))] ≤ Λmax ‖P −Q‖1 ≤ Λmax

√2D(P‖Q) (4)

Proof

EP [Λ(X, XBayes(Q))− Λ(X, XBayes(P ))]

≤ EP [Λ(X, XBayes(Q))− Λ(X, XBayes(P ))] + EQ[−Λ(X, XBayes(Q)) + Λ(X, XBayes(P ))] (5)

=∑x

(P (x)−Q(x))(Λ(x, XBayes(Q))− Λ(x, XBayes(P ))) (6)

≤∑x

|P (x)−Q(x)|Λmax (7)

= Λmax ‖P −Q‖1 (8)

≤ Λmax

√2D(P‖Q) (9)

where in the last step we have used Pinsker’s inequality D(P‖Q) ≥ 12‖P −Q‖21.

Now we prove Theorem 4 as follows:

EP [LFQ(Xn)− LFP (Xn)] (10)

= EP [1

n

n∑t=1

(Λ(Xt, FtQ(Xt−1))− Λ(Xt, FtP (Xt−1)))] (11)

=1

n

n∑t=1

EP [EP [(Λ(Xt, FtQ(Xt−1))− Λ(Xt, FtP (Xt−1)))|Xt−1]] (Iterated Expectation) (12)

≤ 1

n

n∑t=1

EP [Λmax

√2D(PXt|Xt−1‖QXt|Xt−1 |PXt−1)] (Using Lemma 5) (13)

≤ Λmax

√√√√ 2

n

n∑t=1

EP [D(PXt|Xt−1‖QXt|Xt−1 |PXt−1)] (Jensen’s inequality) (14)

= Λmax

√2

nD(PXn‖QXn) (Using chain rule for relative entropy) (15)

as desired.

2

3 Prediction Under Logarithmic Loss

In this section, we specialize our setup to the logarithmic loss while changing our predictor set into a set ofdistributions. So here we have X = M(X ) and Λ(x, P ) = − logP (x) as the logarithmic loss. Also note thatFt(X

t−1) ∈M(X ).So we can represent our predictor with 3 different setups as follows:

1. Obviously we can carry over the setup we worked previously and see our predictors as a set of functions{Ft}t≥1.

2. Another approach would be to see each predictor as a sequence of conditional distributions {QXt|Xt−1}t≥1where Ft(X

t−1) = QXt|Xt−1 .

3. And finally we can see our predictor as a model on the data. Then Q would be the law we assigned tothe sequence we observe.

Then we can immediately adapt previous results to log loss case. First one is LF(Xn):

LF (Xn) =1

n

n∑t=1

log1

Ft(Xt−1)[xt](16)

=1

n

n∑t=1

log1

QXt|Xt−1(xt)(17)

=1

nlog

1

Q(Xn)(Where now Q is our model about the data) (18)

Now let’s look at the logarithmic loss under the Bayesian setting while assuming X1, X2, · · · sequence isgoverned by P:

EP [LF (Xn)] = EP [1

nlog

1

Q(Xn)] (19)

= EP [1

nlog

P (Xn)

Q(Xn)P (Xn)] (20)

=1

n[H(Xn) + D(PXn‖QXn)] (21)

Then we can see that minF EP [LF (Xn)] = 1nH(Xn) and it is achieved when Ft(X

t−1) = PXt|Xt−1 .

4 Prediction of Individual Sequences

As for the last part, we look at a different set of predictors. In this setup we are given a class of predictorsF , and we seek a predictor F such that:

limn→∞

maxXn

[LF (Xn)− minG∈F

LG(Xn)] = 0 (22)

4.1 Prediction of Individual Sequences under Log Loss

Now the question arises that given a class of law F , can we find P such that (23) is small for any Xn:

1

nlog

1

P (Xn)− min

Q∈F

1

nlog

1

Q(Xn)=

1

nlog

1

P (Xn)− 1

nlog

1

maxQ∈F Q(Xn). (23)

This problem will be solved in the upcoming lecture.

3


Lecture 8: Prediction of Individual SequencesLecturer: Tsachy Weissman Scribe: Chuan–Zheng Lee

In the last lecture we have studied the prediction problem under Bayesian setting, and in this lecture we aregoing to study the prediction problem under logarithmic loss in the individual sequence setting.

1 Problem setup

We first recall the problem setup: given a class of probability laws F , we wish to find some law P with asmall worst-case regret

WCRn(P,F) , maxxn

[log

1

P (xn)− minQ∈F

log1

Q(xn)

]. (1)

Then a first natural question is as follows: could we find some P such that

WCRn(P,F) ≤ nεn (2)

with εn → 0 as n→∞?

2 First answer when |F| <∞: uniform mixture

The answer to the previous question turns out to be affirmative when |F| <∞. Our construction of P is asfollows:

PUnif(xn) =

1

|F|∑Q∈F

Q(xn). (3)

The performance of PUnif is summarized in the following lemma:

Lemma 1. The worst-case regret of PUnif satisfies:

WCRn(PUnif,F) ≤ log |F|. (4)

Proof Just apply Q(xn) ≥ 0 for any Q ∈ F and xn to conclude that

log1

P (xn)− minQ∈F

log1

Q(xn)= log

maxQ∈F Q(xn)

P (xn)(5)

= log |F|+ logmaxQ∈F Q(xn)∑

Q∈F Q(xn)(6)

≤ log |F|. (7)

By Lemma 1, we know that we can choose εn = log |F|n → 0 as n→∞, which provides an affirmative answer

to our question.

1

3 Second answer: normalized maximum likelihood

In the previous section we know that εn → 0 is possible when |F| <∞, but we are still looking for some lawP which gives the minimum εn, or equivalently, a minimum WCRn(P,F). Complicated as the definition ofWCRn(P,F) suggests, surprisingly its minimum admits a closed-form expression:

Theorem 2 (Normalized Maximum Likelihood). The minimum worst-case regret is given by

minP

WCRn(P,F) = log∑xn

maxQ∈F

Q(xn). (8)

This bound is attained by the normalized maximum likelihood predictor expressed as

PNML(xn) =maxQ∈F Q(xn)∑xn maxQ∈F Q(xn)

. (9)

Proof Since PNML(xn)maxQ∈F Q(xn) is a constant, it is clear that PNML attains the desired equality. To prove the

optimality, suppose P ′ is any other predictor. Since both PNML and P ′ are probability distributions, theremust exist some xn such that

P ′(xn) ≤ PNML(xn). (10)

Considering this sequence xn gives

WCRn(P ′,F) ≥ logmaxQ∈F Q(xn)

P ′(xn)(11)

≥ logmaxQ∈F Q(xn)

PNML(xn)(12)

= log∑xn

maxQ∈F

Q(xn) (13)

as desired.

4 Third answer: Dirichlet mixture

Although the normalized maximum likelihood predictor gives the optimal worst-case regret, it has a severedrawback that it is usually really hard to compute in practice. In contrast, the mixture idea used in the firstanswer can usually be efficiently computed (as will be shown below). This observation motivates us to comeup with a mixture-based predictor P , which enjoys a worst-case regret close to that of PNML.

The general mixture idea is as follows: suppose the model class F = (Qθ)θ∈Θ can be parametrized by θ.Consider some prior w(·) on Θ and choose

Pω(xn) =

∫Θ

Qθ(xn)w(dθ) (14)

2

as our predictor. Note that in this case,

Pω(xt|xt−1) =Pω(xt)

Pω(xt−1)(15)

=

∫ΘQθ(x

t)w(dθ)∫ΘQθ(xt−1)w(dθ)

(16)

=

∫ΘQθ(xt|xt−1)Qθ(x

t−1)w(dθ)∫ΘQθ(xt−1)w(dθ)

(17)

=

∫Θ

Qθ(xt|xt−1)Qθ(x

t−1)∫ΘQθ′(xt−1)w(dθ′)

w(dθ) (18)

,∫

Θ

Qθ(xt|xt−1)w(dθ|xt−1) (19)

where

w(dθ|xt−1) =Qθ(x

t−1)w(dθ)∫ΘQθ′(xt−1)w(dθ′)

. (20)

Then how to choose the prior w(·)? On one hand, the prior w(·) should be chosen such that it assignsroughly equal weights to the “representative” sources. On the other hand, it should be chosen in a waysuch that it is computationally tractable for the sequential probability assignments. We take a look at someconcrete examples.

4.1 Bernoulli case

In the first example, we assume that Qθ = Bern(θ)n and F = {Qθ}θ∈[0,1]. Note that in this case the modelclass consists of all i.i.d. distributions. We assign the Dirichlet prior

w(dθ) =1

π· dθ√

θ(1− θ)∼ Beta(

1

2,

1

2). (21)

The properties of the resulting predictor are left as exercises:

Exercise 3. Denote by n0 = n0(xn), n1 = n1(xn) the number of 0’s and 1’s in the sequence xn, respectively.Prove the following:

(a) The conditional probability is given by

Pω(xt|xt−1) =nxt

(xt−1) + 12

t. (22)

This is called the Krichevsky–Trofimov sequential probability assignment.

(b) Show that WCRn(Pω,F) = 12 log n+O(1).

(c) Considering the normalized maximum likelihood predictor, also show that minP WCRn(P,F) = 12 log n+

O(1). Hence, our Dirichlet mixture Pω attains the optimal leading term of the worst-case regret.

In fact, there is a general theorem on the performance of the mixture-based predictor:

Theorem 4 (Rissanen, 1984). Let F = {Qθ}θ∈Θ with Θ ⊂ Rd. Under benign regularity and smoothnessconditions, we have

minP

WCRn(P,F) =d

2log n+ o(log n). (23)

3

Moreover, there exists some Dirichlet prior w(·) such that

WCRn(Pω,F) =d

2log n+ o(log n). (24)

4.2 Markov sources

Given that we can compete with i.i.d. Bernoulli sequences, how can we compete under logarithmic loss withrespect to Markov models? Assume X = {0, 1} and denote by Mk the set of all k-th order Markov models.Note that for Q ∈Mk,

Q(xn|x0−k+1) =

n∏i=1

Q(xi|xi−1i−k) (25)

=∏

all possible k tuples uk

∏i:xi−1

i−k=uk

Q(xi|uk). (26)

The product∏i:xi−1

i−k=uk Q(xi|uk) gives us the probability of the subsequence {xi}xi−1i−k=uk under Bernstein(Q(1|uk)).

Then it makes sense to employ the KT estimator on each of these subsequences.Consider the following P such that

P (xt|xt−1−k+1) =

nxtt−k

(xt−1−k+1) + 1/2

nxt−1t−k

(xt−1−k+1) + 1

, (27)

where nuk(xn) =∑ni=1 1(xii−k+1 = uk).

Then, for this P , we have

WCRn(P,Mk) ≤∑uk

1

2ln(nuk(xn−1)) + o(lnn) (28)

≤ 2k

2ln

n

2k+ o(lnn), (29)

which matches the Rissanen lower bound on the leading term.

4


Lecture 9: Context Tree WeightingLecturer: Tsachy Weissman Scribe: Chuan-Zheng Lee

In the last lecture we have talked abut the mixture idea for i.i.d. sources and Markov sources. Specifically,we introduced the Krichevsky–Trofimov probability assignment for a binary vector xn ∈ {0, 1}n given by

PKT(xn) =

∫ω(θ)θn1(x

n)(1− θ)n0(xn)dθ =

Γ(n0 + 12 )Γ(n1 + 1

2 )

(Γ( 12 ))2Γ(n+ 1)

, (1)

where ω(θ) ∝ 1√θ(1−θ)

, n1 ≡ n1(xn) is the number of 1’s in xn, n0 ≡ n0(xn) is the number of 0’s (so that

n = n0 +n1) and Γ(·) is the gamma function. Note that ω(θ) is a Dirichlet prior with parameters ( 12 ,

12 ), and

the integrand is just a Dirichlet distribution with parameters (n0 + 12 , n1 + 1

2 ). The main result is that, forregular models with d degrees of freedom, the Dirichlet mixing idea achieves the optimal worst-case regretd2 log n+ o(log n) for individual sequences of length n.

In this lecture, we will extend this idea to general tree sources.

1 Tree Source

We first begin with an informal defintion of a tree source.

Definition 1 (Context of a symbol). A context of a symbol xt is a suffix of the sequence that precedes xt,that is, (· · · , xt−2, xt−1).

Definition 2 (Tree source). A tree source is a source where a set of suffixes is sufficient to determine theprobability distribution of the next item (regardless of the past) in the sequence.

A tree source is hence a generalization of an order-one Markov source, where only the previous elementin the sequence determines the next: now, instead of just one element being sufficient, the suffix is sufficient.It is perhaps best elucidated by example.Example 3. Let S = {00, 10, 1} and Θ = {θs, s ∈ S}. Then

p(Xt = 1|Xt−2 = 0, Xt−1 = 0) = θ00

p(Xt = 1|Xt−2 = 1, Xt−1 = 0) = θ10

p(Xt = 1|Xt−1 = 1) = θ1

We can represent this structure in a binary tree:

0

1

0

1θ1

θ10

θ00

t− 2 t− 1 t

Remark Note that the source in Example 3 is not first-order Markov, but can be cast as a special caseof a second-order Markov source.

We denote as PS,θ(xn) the associated probability mass function on xn ∈ Xn.

1

2 Known tree, unknown parameters

Recall that with Markov sources, we could think of the pmf of any sequence as the product of conditionalpmfs. For the tree sources, we can factorize similarly, but this pertains to suffixes.

Let PKTS be the pmf induced by employing the Krichevsky–Trofimov sequence probability assignment

(separately) on each subsequence corresponding to every s ∈ S. We will denote as ns the number ofxi, 1 ≤ i ≤ n with context s, and we will show in Homework 4 that

log1

PKT(xn)−min

θlog

1

QnBernoulli(θ)(xn)≤ 1

2log n+ 1.

Now enumerating over different contexts, the regret on sequence xn is upper-bounded by

log1

PKTS (xn)

−minθ

log1

PS,θ(xn)≤∑s∈S

[1

2log ns + 1

]= |S|

∑s∈S

1

|S|

[1

2log ns + 1

](a)

≤ |S|

[1

2log∑s∈S

1

|S|ns + 1

]

=|S|2

logn

|S|+ |S|

where (a) follows from Jensen’s inequality. Recall from Lecture 8 that Rissannen showed that the worst-caseregret could at best be d

2 log n+o(log n). This shows that the above strategy—using the Krichevsky–Trofimovassignment—is optimal up to lower order terms.

3 Unknown tree

We call a context tree is of depth D if it is a full binary tree of depth D, where node s contains two integers(ns0, ns1), i.e., the number of 0’s and 1’s in the sequence xn which occurs right after the context s.Example 4. Consider the sequence

x1 x2 x3 x4 x5 x6 x71 1 0 0 1 0 0 1 1 0 · · ·

We can draw the tree, with each node s labeled with x(s):

2

0

1

1

1

0

0

1

1

1

1

0

0

0

0

t− 3 t− 2 t− 1 t

{x1, . . . , x7}

{x3, x6, x7}

{x1, x2, x4, x5}

{x7}

{x3, x6}

{x1, x4}

{x2, x5}

{∅}

{x7}

{∅}

{x3, x6}

{x1}

{x4}

{x2}

{∅}

For example, x3 is preceded by ‘001’, so is included in x(‘001’) (at t − 3), x(‘01’) (at t − 2) and x(‘1’) (att− 1). Moreover, for s = 001, we have ns0 = ns1 = 1 (corresponding to x3 and x6, respectively).

Then the context-tree weighting at any leaf node s is PCTW(xn, s) = PKT(ns0, ns1) if depth(s) = D. Forany node s other than a leaf node, we have

PCTW(xn, s) =1

2PKT(ns0, ns1) +

1

2PCTW(xn, 0s)PCTW(xn, 1s)

where 0s means the sequence formed by prepending ‘0’ to s, and 1s means the sequence formed by prepending‘1’ to s.

Finally, the CTW probability assignment on xn can be found recursively at the root via PCTW(xn) =PCTW(xn, ∅).

3


Lecture 10: Regret Analysis of Context Tree WeightingLecturer: Tsachy Weissman Scribe: Chuan-Zheng Lee

In the last lecture we introduced how to use Krichevsky–Trofimov probability assignment to constructthe context tree, and the corresponding weighting scheme for tree sources. In this lecture we will analyzethe (worst-case, or the average-case) regret of the CTW algorithm under logarithmic loss.

1 Context Tree Weighting: Review

Notation: since the Krichevsky–Trofimov probability assignment only depends on the number of zeros andones in the sequence, we define ~n = (n0(xn), n1(xn)) and write PKT(xn) as PKT(~n). The CTW algorithmimposes the Krichevsky–Trofimov probability assignment to every context s ∈ S separately, and for eachcontext s ∈ S, we need to associate with a vector ~ns = (ns0(xn), ns1(xn)), i.e., the number of occurrences ofthe context s0, s1 in the sequence xn, respectively.

Now revisit the last example with nodes labeled with their corresponding ~n:

x1 x2 x3 x4 x5 x6 x7

1 1 0 0 1 0 0 1 1 0 · · ·

0

1

1

1

0

0

1

1

1

1

0

0

0

0

t− 3 t− 2 t− 1 t

~n = (4, 2)

~n1 = (2, 1)

~n0 = (2, 2)

~n11 = (1, 0)

~n01 = (1, 1)

~n10 = (2, 0)

~n00 = (0, 2)

~n111 = (0, 0)

~n011 = (1, 0)

~n101 = (0, 0)

~n001 = (1, 1)

~n110 = (1, 0)

~n010 = (1, 0)

~n100 = (0, 2)

~n000 = (0, 0)

The probability assignment of the CTW algorithm on xn and the context s is given as follows:

PCTW(xn, s) =

{PKT(~ns) if depth(s) = D12P

KT(~ns) + 12PCTW(xn, 0s)PCTW(xn, 1s), otherwise

.

Recursively we find PCTW(xn) = PCTW(xn, ∅), and also the conditional probability

PCTW(xn+1|xn) =PCTW(xn+1)

PCTW(xn).

1

Remark The choice of the weight 12 ’s in that second line is not important; it can be changed into any

(α, 1−α) with α ∈ (0, 1) with almost no loss in theoretical analysis (as you may see below). In practice, thechoice of α depends on the data.

2 Worst-Case Regret of CTW Algorithm

Consider the context tree in the previous example with depth 3, and suppose that the true tree model S isas follows:

0

1

0

1θ1

θ10

θ00

t− 2 t− 1 t

By definition of the CTW algorithm, we have

PCTW(xn, 00) ≥ 1

2PKT(~n00)

PCTW(xn, 10) ≥ 1

2PKT(~n10)

PCTW(xn, 1) ≥ 1

2PKT(~n1)

PCTW(xn, 0) ≥ 1

2PCTW(xn, 00)PCTW(xn, 10)

PCTW(xn) ≥ PCTW(xn, ∅)

≥ 1


≥ 1

2

1


1

2PKT(~n1)

≥ 1

25PKT(~n00)PKT(~n10)PKT(~n1)

=1

25PKTS (xn).

As a result, for any sequence xn, in this case we have

log1

PCTW(xn)−min

θlog

1

PS,θ(xn)≤ 5 + log

1

PKTS (xn)

−minθ

log1

PS,θ(xn)

≤ 5 +3

2log

n

3+ 3

=3

2log

n

3+ 8.

2

In general, repeating the analysis above, we lose a factor of at most 12 for each of the |S| leaves and |S|−1

internal nodes. Therefore, in general, for any S corresponding to a tree source of depth ≤ D,

log1

PCTW(xn)≤ 2|S| − 1 + log

1

PKTS (xn)

≤ 2|S| − 1 +

[|S|2

logn

|S|+ |S|+ min

θlog

1

PS,θ(xn)

]≤ |S|

2log

n

|S|+ 3|S| − 1 + min

θlog

1

PS,θ(xn),

which again shows the order-optimality of the context-tree weighting for any unknown tree model with depthat most D.

3 Average Regret of CTW Algorithm

Recall that the worst-case regret under logarithmic loss is given by

WCRn(P,F) = maxxn

[log

1

P (xn)− minQ∈F

log1

Q(xn)

].

Now, for all true probability distribution Q ∈ F , we have

D(Qxn‖Pxn) = EQ logQ(xn)

P (xn)

≤ EQ[log

1

P (xn)− minQ′∈F

log1

Q′(xn)

]≤WCRn(P,F)

which shows that the average-case regret is upper bounded by the worst-case regret:

supQ∈F

D(Qxn‖Pxn) ≤WCRn(P,F).

Applying this result to the tree source, for any tree source PS,θ with depth(S) ≤ D, we have

D(PnS,Θ‖PnCTW) ≤ |S|2

log n+O(1).

As a result, we know from Lecture 7 that for prediction under a general loss function Λ,

E[LFPCTW (Xn)− LFPS,θ (Xn)

]≤ Λmax

√2

n

(|S|2

log n+O(1)

).

3


Lecture 11: Estimation of Directed InformationLecturer: Tsachy Weissman Scribe: Maggie Engler

In this lecture, we will introduce the quantity of directed information, provide motivation for estimatingthe directed information, and look at two estimators and their associated performance guarantees. First,though, we clarify a question from last time about the depth of context tree weighting.

1 Context Tree Weighting, Continued

Recall that for QCTW,D, denoting the probability assignment by context tree weighting with depth D, weshowed in the previous lecture that

log1

QCTW,D(Xn)−min

θlog

1

PS,θ(Xn)≤ |S|

2log

n

|S|+ 3|S| − 1 (1)

=|S|2

log n + O(1) (2)

for any tree source S with depth at most D. In other words, to obtain the optimal regret up to lower-orderterms, we can choose the tree depth D to be greater than our expected depth of the true tree source inpractice.

To avoid the issue of estimating the true tree depth, we can further mix many context trees with dif-ferent tree depths D. Let’s consider the following mixture: take any {wD}D≥1 such that all wD ≥ 0 and∑∞D=1 wD = 1, and let Q =

∑∞D=1 wDQCTW,D. Now by the standard mixing argument, for any tree source

S with depth D, we have

log1

Q(Xn)≤ log

1

QCTW,D(Xn)+ log

1

wD(3)

≤ |S|2

logn

|S|+ 3|S| − 1 + log

1

wD. (4)

Thus, the mixture Q as defined above is also guaranteed to perform within a constant term of optimal regretfor any tree source S.

2 Universality

Note: Remember that if some data is governed by a law P , the excess log-loss of the probability assignmentQ (beyond what is optimal) is the relative entropy D(PXn ||QXn), normalized to 1

nD(PXn ||QXn).

Definition 1 (Stationary). Probability measure P on X∞ is stationary if for any k-tuple (Xi1 , Xi2 , Xi3 , · · · , Xik)and any m,

(Xi1 , Xi2 , Xi3 , ...Xik)d= (Xi1+m , Xi2+m , Xi3+m , ...Xik+m

) (5)

The shifted k-tuple is equal in distribution to the original. This is also described as shift-invariance and is acommon assumption for an unknown underlying distribution.

Definition 2 (Universal). Probability measure Q on X∞ is universal if

1

nD(PXn ||QXn)

n→∞−−−−→ 0 (6)

for all stationary laws P.

1

Exercise: Show that the mixture Q in (3) is universal.

Universality of Q is clearly a desirable property, as we are guaranteed to have a vanishing excess log-losswith any stationary law P . Additionally, if some Q is universal, it can be used to estimate other quantitiesof interest, such as directed information.

3 Directed Information

Definition 3 (Directed information). The directed information from Xn to Y n is defined as

I(Xn → Y n) ,n∑i=1

I(Xi;Yi|Y i−1) (7)

More generally,

I(Xn−d → Y n) , I((∅d, Xn−d)→ Y n) =

n∑i=1

I(Xi−d;Yi|Y i−1). (8)

We remark that mutual information is the canonical measure of the relevance of one process to another,and the directed information is the canonical measure of the causal relevance of one process to another.Notice that directed information is a summation of mutual information between the first i elements of Xand the ith element of Y , conditioned on the previous i − 1 values of Y . Namely, we measure the causalrelevance of Xi for our prediction of Yi, given Y i−1. The following two examples will illustrate why this isa very natural metric in prediction.

Example 1: Suppose that (X1, X2, · · · , Xn) and (Y1, Y2, · · · , Yn) represent two stocks in the New York StockExchange. You are a young investor desperate to impress your boss at the hedge fund, and you would like topredict the price of stock Y tomorrow, i.e., at day i. You can look solely at Y i−1, but, being more resource-ful than that, you decide to determine the extent to which Xi−1 also informs Yi—that is, I(Xi−1;Yi|Y i−1).Collected over time, we have

∑ni=1 I(Xi−1;Yi|Y i−1), which is equivalent to I(Xn−1 → Y n).

Example 2: Now, on the basis of your excellent application of directed information, you have been pro-moted to hedge fund manager. Your new role requires you to be prescient of global trends, so this time let(X1, X2, · · · , Xn) be the Hang Seng Index, an overall performance measure for the Hong Kong Stock Ex-change, and let (Y1, Y2, · · · , Yn) be the Dow Jones Industrial Average. Because of the time difference betweenNew York and Hong Kong, the trading has already closed in Hong Kong before the opening bell in New York.Therefore, when you predict Yi, you now have access to Xi in addition to Xi−1 and Y i−1 as before. Theconditional mutual information is I(Xi;Yi|Y i−1), and over a period of n days we get

∑ni=1 I(Xi;Yi|Y i−1)

= I(Xn → Y n).

We describe mutual information as a measure of relevance or correlation between two processes, withoutimplying causation. Directed information allows us to be more explicit about the flow of information fromone process to another. The relation is shown below, to be proved as an exercise.

Exercise: Show that I(Xn;Y n) = I(Xn → Y n) + I(Y n−1 → Xn)

Also note that this implies that I(Xn → Y n) ≤ I(Xn;Y n):

I(Xn → Y n) =

n∑i=1

I(Xi;Yi|Y i−1) ≤n∑i=1

I(Xn;Yi|Y i−1) = I(Xn;Y n) (9)

where the second step is given by the data processing inequality.

2

4 Estimation of Directed Information

Definition 4 (Directed information rate). Let X and Y be a pair of processes. The directed informationrate from X to Y is

I(X→ Y) , limn→∞

1

nI(Xn → Y n), (10)

provided that the limit exists.

Exercise: Show that the limit exists if X and Y are jointly stationary.

Note: Recall that for X ∼ PX , the entropy may be written H(X) or equivalently H(PX), which is afunction of the distribution.

Notice that

I(Xn → Y n) =

n∑i=1

I(Xi;Yi|Y i−1) =

n∑i=1

[H(Yi|Y i−1)−H(Yi|Xi, Y i−1)] (11)

This suggests an estimator for the directed information rate. Specifically,

I(Xn → Y n) =1

n

n∑i=1

[H(Yi|Y i−1)−H(Yi|Xi, Y i−1)]. (12)

Since the (conditional) entropy solely depends on the underlying unknown distribution, we can consideranother probability assignment of our choice Q. The idea is that if Q is universal, then the (conditional)entropy based on Q should converge to that based on the true P .

Theorem 5. If Q is universal and (X,Y ) are jointly stationary and ergodic (a concept we will not introducehere, which basically means that the law of large numbers holds in the process), then

E[I(X→ Y)− 1

nI(Xn → Y n)]

n→∞−−−−→ 0 (13)

Therefore, this estimator of the directed information rate converges to the true rate.

With a few more constraints, we can further bound how quickly the estimate converges.

Theorem 6. If Q is derived from context tree weighting (CTW) and the joint process of (X,Y ) is stationary,ergodic, aperiodic, and Markov (of arbitrary order), then

E[I(X→ Y)− 1

nI(Xn → Y n)] ≤ C

log n√n

(14)

where C > 0 is a universal constant not depending on n.

For a more in-depth discussion, please see the paper “Universal Estimation of Directed Information”(Jiao, Permuter, Zhao, Kim, and Weissman, 2012).

Another estimator is suggested by the observation that

I(Xn → Y n) = D(PYi|Xi,Y i−1‖PXi,Y i−1) (15)

with the proof left as an exercise. We can then construct

I(Xn → Y n) ,1

n

n∑i=1

D(QYi|Xi,Y i−1 ||QXi,Y i−1) (16)

for our suitable choice of Q. We omit the details here, but remark that this estimator has similar performanceguarantees to the first, and often has a slightly smoother convergence in practice as relative entropy is alwaysnon-negative.

3


Lecture 12: Introduction to the Discrete Universal Denoiser (DUDE)Lecturer: Tsachy Weissman Scribe: Yihui Quek

In this lecture1, we set the state for the Discrete Universal Denoiser (DUDE), an algorithm for denoisingsequences whose symbols take values in a discrete alphabet. The following table shows at a glance how thisnew problem setting compares to those of the algorithms we have studied in the past:

Algorithm Key idea Input sequence alphabetViterbi algorithm/Kalmanfilter

Maximum a posteriori(MAP) decoding

Continuous

Context-tree weighting(CTW) algorithm

Decoding from context Discrete, generated by a treesource with unknown param-eters

Discrete Universal Denoiser(DUDE)

MAP decoding from context Discrete, generating distribu-tion unknown

1 Setting

Figure 1 shows a toy setting we consider for discrete denoising.

xn = (x1, . . . xn) ∈ Xn NoisyChannel

Denoiserxn(zn) =(x1(zn), x2(zn)...xn(zn)) ∈ Xn

zn ∈ Zn

Figure 1: Denoising setting. In this toy example, we assume that the parameters of the source X (i.e., pX(xn)), aswell as the conditional probabilities of each output symbol produced by the channel on any given input, pX,Z(xi, zi),are known.

We assume all noiseless sequences and observations are composed of symbols which take values in a finitealphabet, X =

{a1, . . . a|X |

}(thus the ‘discrete’ in the name of the denoiser). Denoising means that given a

noisy observation sequence zn, we would like to reconstruct the original noiseless signal, xn, by outputtinga best estimate xn(zn). Assume we have a per-symbol loss function Λ : X × X → R and are allowed to lookat an entire length-n observation zn to guess xn.

Definition 1. An n-block denoiser is a mapping

Xn(Zn) : Zn → Xn.

As in the past, we measure the performance of Xn by the normalized cumulative loss on a block zn:

LXn(xn, zn) =1

n

n∑i=1

Λ(xi, Xn(zn)[i]). (1)

1Reading: Universal Discrete Denoising: Known Channel, T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, M.J. Wein-berger, 2005.

1

(From now on we use the notation Xn(zn)[i]← xi(zn).)

Let us start our investigation by assuming that the symbols are generated probabilistically with knownparameters, i.e., probabilities of all symbols, channel and noise are known to the denoiser. On the cumulativeloss, an optimal denoiser is then described by

arg minXn(·)∈Dn

E[LXn(xn, zn)

]= arg min

Xn(·)∈Dn

E

[1

n

n∑i=1

Λ(xi, xi(zn))

](2)

where Dn denotes the set of all block-n denoisers. We can observe from the right-hand side of (2) that thecumulative loss is separable; it is a sum of terms, each of which depends only xi and xi. This suggests agreedy strategy for producing such an estimate; under this strategy, xi(z

n) is nothing but the Bayes responseto the conditional posterior probability of each symbol:

xi(zn) = XBayes(Pxi|Zn=zn) (3)

and the corresponding value of expected loss is the Bayes envelope,

1

n

n∑i=1

E[U(Pxi|Zn)

]. (4)

We remind the readers that the Bayes envelope is defined as U(PX) := minx∈X

∑x

PX(x)Λ(x, x). In addition,

the discreteness of the input alphabet permits us to write a loss matrix that will be useful in later sections:

{Λ(x, x)} :=

λa1 λa2 . . . λa|X|

where λai is the column vector Λ(·, x = ai). With this definition in hand, we may rewrite the Bayes envelopein terms of the λis:

U(PX) = minxPTX · λx (5)

The conclusion of this analysis reveals that the Bayes response denoiser (3) achieves the minimumloss on a sequence with known generating distribution. But we have glossed over some things inthe above analysis: Firstly, the Kalman filter showed us that deriving the Bayes envelope is already hard inthe continuous case; what more when the symbols come from a discrete alphabet. Secondly, we should notexpect realistically to be handed on a silver platter with the probabilistic parameters of the problem; forone, we often do not know the parameters of the generating distribution of the given sequence. We have infact treaded in these waters before; see the CTW algorithm.

2 Towards construction of a universal scheme

From now on we assume that we do not know the generating distribution of the sequence symbols, but wehave some characterization of the channel. We would like to construct a universal scheme, i.e., a schemethat attains optimal performance asymptotically no matter the true law that governs the data. Here, the“optimal performance denoiser” is XBayes(Pxi|zn) as established previously.

In this section we consider the “single-letter” problem as shown in Figure 2, where our input sequence isnow only one symbol, X, and we are to guess X based on Z. Again, X and X take values in finite alphabet,X =

{a1, . . . a|X |

}. We can also define the channel transition matrix showing probabilities of mapping inputs

to outputs:

{Π(x, z)} :=

πa1 πa2 . . . πa|X|

2

X ∼ PXX ∈ X

ChannelΠ(x, z) Denoiser X ∈ X

Z ∈ Z

Figure 2: Single-letter denoising. We assume that we know the channel transition probability matrix Π(x, z).

where πai is the column vector PX,Z(·, Z = ai).Before we proceed, let us define the Schur/Hadamard product represented by the symbol �:

Definition 2. For vectors v1, v2, v1 � v2 is the vector obtained by element-wise multiplication.

[v1 � v2]i = v1,iv2,i

Secondly, we generalize the Bayes response (implicitly defined on our loss function Λ(x, x)) to take anarbitrary argument v:

XBayes(v) = arg minxvT · λx (6)

With these definitions in hand, we may factorize PZ with a simple calculation:

PZ(z) :=∑x

PX,Z(x, z) =∑x

PX(x)Π(x, z) = PTX · πz =[PTX ·Π

](z) (7)

i.e. PTZ = PTX ·Π. Assume now that |X | = |Z| and Π is invertible2. Then we may re-write this as

PX = Π−TPZ (8)

and we may readily write the conditional probability

PX|Z=z(x) :=PX(x)Π(x, z)

PZ(z)=

[Π−TPZ ](x) · πZ(x)

PZ(z)

PX|Z=z =[Π−TPZ ]� πZ

PZ(z)(9)

Let φ(z) represent the operation of denoising. The previous section established that

minφ(·)

EΛ(x, φ(z)) = E[U(PX|Z)]

and this is achieved byφopt(z) = XBayes(PX|Z=z) = XBayes([Π

−TPZ ]� πZ) (10)

where we have elided the denominator of (9) in the substitution because the overall normalization is not afunction of x and does not affect the minimization.

We can generalize the above analysis to situations where |X | ≤ |Z|, i.e., Π is non-square but of fullrow-rank: in those situations (to be proven in the homework) we have

φopt(z) = XBayes(PX|Z=z) = XBayes([(ΠΠT )−1Π · PZ ]� πZ).

By the way, (ΠΠT )−1Π can be defined for any arbitrary non-square matrix Π and is known as the Moore-Penrose pseudoinverse.

2Usually a safe assumption but notable exceptions include the binary symmetric channel with flipping probability 12

– whichis not a useful channel anyway.

3

3 Universal Discrete Denoising and introduction to DUDE

Now we extend the above analysis to denoising of a sequence, not just a single letter – and in doing so weintroduce the DUDE. Figure 1 is applicable to our new setting too, but now we have far less knowledge ofthe system. More concretely, the assumptions are now:

1. Finite alphabets X ,Z, X .

2. Source X is governed by a probabilistic and stationary but otherwise unknown generating distribution.

3. Channel is known to be a discrete memoryless channel; and Π is known and of full row rank.

4. Loss function Λ is given.

The idea of DUDE is that we “correct by the context”; whereas in Section 1 we had the luxury of knowingPxi|Zn=zn which we could input into the Bayes response function, now we must estimate that. We do so bylooking at left and right length-k contexts around the symbol to be denoised:

z1 . . . zi−k . . . zi−1 zi zi+1 . . . zi+k . . . zn

For every pair of left and right contexts, we can define a count vector m ∈ R|Z| to capture the number oftimes a symbol z ∈ Z appears in these contexts over the entire sequence

m(zn, `k, rk)[z] := {k + 1 ≤ i ≤ n− k : zi−1i−k = `k, zi+ki+1 = rk, zi = z},

the DUDE denoiser effectively calculates the Bayes response (3) with this empirical probability distribution.That is, DUDE(k) performs the function F where:

xi(zn) = F

(Λ,Π,m(zn, zi−1i−k, z

i+ki+1 ), zi

)k + 1 ≤ i ≤ n− k

= arg minxλTx[[Π−T ·m(zn, zi−1i−k, z

i+ki+1 )]� πZ

]. (11)

Readers will note the similarity with (10). Surprisingly, it will turn out that the operation performed by theDUDE is linear in n.

4


Lecture 13: The Performance of Discrete Universal Denoiser (DUDE)Lecturer: Tsachy Weissman Scribe: Ruiyang Song

1 Recap

The system model of the denoising problem is presented in Fig. 1. Let Xn be the unknown input sequencethat passes through a discrete memoryless channel. The channel is characterized by its transition matrix Πwhich is known and is of full row rank. The output of the noisy channel is denoted by Zn which is observedby the denoiser. The denoiser then outputs the reconstruction sequence Xn. Here, Xi ∈ X , Zi ∈ Z, Xi ∈ X ,where X ,Z, and X are finite alphabets. The distortion of the reconstruction sequence is characterized by agiven loss function Λ : X × X → [0,+∞).

Xn Noisy channel Denoiser XnZn

Figure 1: Denoising setting.

The DUDE with context length k works in the following way: in the first pass, the denoiser derives thecounting vector of dimension |Z| as follows:

m(zn, lk, rk

)[z] =

∣∣{k + 1 ≤ i ≤ n− k : zi+ki−k = (lk, z, rk)}∣∣ . (1)

Note that m[z] counts the number of items in the sequence zn that equals z with left context lk and rightcontext rk. In the second pass, DUDE outputs the reconstruction sequence Xn with

Xi(zn) = φ

(Λ,Π,m

(zn, zi−1

i−k, zi+ki+1

), zi), for k + 1 ≤ i ≤ n− k, (2)

where

φ(Λ,Π, v, z) = XBayes

((ΠΠT )−1Π v � πz

)(3)

= arg minxλTx((ΠΠT )−1Π v � πz

). (4)

Exercise 1. Consider the following specific examples:

• For BSC(δ) and Hamming loss, i.e., Π =

(1− δ δδ 1− δ

), Λ =

(0 11 0

). Show that

φ(Λ,Π, v, z) =

{z v(z)

v(z) ≥2δ(1−δ)δ2+(1−δ)2

z otherwise. (5)

• For BEC(ε) and Hamming loss, the alphabets are Z = X⋃{e}, X = X , and

Π(x, z) =

1− ε z = x

ε z = e

0 otherwise

. (6)

Show that

φ(Λ,Π, v, z) =

{z z 6= e

arg maxx∈X v(x) z = e. (7)

1

2 Choice of the context length k

The choice of k is important for the performance of DUDE. On one hand, a large k provides more informationin each context sequence; on the other hand, when k is large, we will have a small number of counts andmight suffer from unreliable context statistics. In practice, we choose

k = kn =

⌈1

5

log n

log |Z|

⌉, (8)

which can be justified through performance analysis. In the following sections, we denote by XnDUDE the

reconstructed sequence given by the DUDE with context length kn.

3 Performance of DUDE

3.1 Stochastic setting

In this section we assume that the source X is a stationary stochastic process. We introduce the definitionof denoisibility as follows:

D(X,Z) = limn→∞

minXn∈Dn

ELXn(Xn, Zn), (9)

where Dn is the set of all denoisers of block length n. Note that

minXn∈Dn

ELXn(Xn, Zn) =1

n

n∑i=1

EU(PXi|Zn

). (10)

We introduce the properties of D(X,Z) in the following exercises.Exercise 2.

• If (X,Z) is jointly stationary, then D(X,Z) exists.

• D(X,Z) = liml→∞ EU(PX0|Zl

−l

). Note that when X,Z are jointly stationary EU

(PXi|Zn

)= EU

(PX0|Zn−i

1−i

).

In the following theorem, we show the optimality of DUDE for the stochastic setting.

Theorem 3. For any stationary process X,

limn→∞

ELXDUDE(Xn, Zn) = D(X,Z). (11)

3.2 Semi-stochastic setting

In this section, we assume that the source x = (x1, x2, . . .) is some deterministic sequence. Note that Z isstill a stochastic process with distribution P(Zn = zn) =

∏ni=1 π(xi, zi). This is actually more general than

the stochastic setting since we assume nothing about the input sequence.Define

Dk(xn, zn) = minf :Z2k+1 7→X

1

n

n−k∑i=k+1

Λ(xi, f(zi+ki−k)

)(12)

to be the error induced by the optimal denoiser with double-sided context length k. Note that here theoptimal denoiser has access to both the original sequence and the noisy sequence, but is restricted to havethe same output for symbols with the same length-k contexts.

Theorem 4. For any source sequence x,

limn→∞

(LXn

DUDE(xn, Zn)−Dkn(xn, Zn)

)= 0 w.p.1. (13)

2

Note that Theorem 4 is stronger than Theorem 3. We can show that Theorem 3 can be proved directlyby Theorem 4. We first introduce some mathematical background.

For a sequence {an}, we define

lim supn→∞

an = limn→∞

supm≥n

am, (14)

lim infn→∞

an = limn→∞

infm≥n

am. (15)

Note that lim sup and lim inf exist for any sequence. Note also the following fact:

Fact 5. limn→n an exists iff. lim supn→∞ an = lim infn→∞ an.

We will also need the following lemma:

Lemma 6 (Fatou’s lemma). If {Rn} is a sequence of non-negative random variables, then

E[lim infn→∞

Rn] ≤ lim infn→∞

E[Rn]. (16)

With the preparation above, we prove Theorem 3 with Theorem 4.Proof For any fixed l, we have

0 = E[lim supn→∞

(LXn

DUDE(xn, Zn)−Dkn(xn, Zn)

)](17)

≥ lim supn→∞

[ELXn

DUDE(xn, Zn)− EDkn(xn, Zn)

](18)

≥ lim supn→∞

[ELXn

DUDE(xn, Zn)− EDl(x

n, Zn)], (19)

where the first inequality follows from Fatou’s lemma. Note that we can prove

EDl(Xn, Zn) ≤ EU

(PX0|Zl

−l

). (20)

Hence, lim supn→∞ ELXnDUDE

(Xn, Zn) ≤ EU(PX0|Zl

−l

). Since l is arbitrary, we have

lim supn→∞

ELXnDUDE

(Xn, Zn) ≤ liml→∞

EU(PX0|Zl

−l

)= D(X,Z). (21)

On the other hand, D(X,Z) measures the fundamental denoisability, and we must have

lim infn→∞

ELXnDUDE

(Xn, Zn) ≥ D(X,Z) (22)

3


Lecture 14: Nonparametric Function EstimationLecturer: Yanjun Han Scribe: Yanjun Han

1 Introduction

DUDE considers the denoising setting with a discrete source alphabet and reconstruction alphabet, andperforms well under both the stochastic and semi-stochastic settings. However, the context-based countingidea used by DUDE cannot be directly applied to the continuous setting. In this lecture we start to lookat a new problem: the nonparametric function estimation problem with a continuous alphabet and newassumptions on the underlying function.

We look at the following regression setting: we have n observations

yi = f(xi) + σξi, i = 1, · · · , n (1)

where f(·) is our target function on [0, 1] which we assume to lie in some prescribed function class F , xi = in

are equally spaced points on [0, 1], σ is the known noise level, and ξii.i.d∼ N (0, 1) are i.i.d Gaussian noise.

Our target is to find some estimator f which comes close to minimizing the worst-case Lq risk:

supf∈F

Ef‖f − f‖q . inff∗

supf∈F

Ef‖f∗ − f‖q. (2)

Notations: for non-negative sequences {an}, {bn}, we use an . bn to denote lim supn→∞anbn

< ∞, i.e.,there exists universal constant C not depending on n such that an ≤ Cbn for any n ∈ N. Similarly, we writean & bn to denote bn . bn, and an � bn to denote an . bn and bn . an.

References:

1. A. Nemirovski. “Topics in non-parametric Statistics.” Ecole d’Ete de Probabilites de Saint-Flour 28(2000): 85.

2. A.B. Tsybakov. “Introduction to Nonparametric Estimation.” Springer, 2009.

2 The Bias–Variance Decomposition

Before looking at specific function class F and estimators f , we take a look at how to evaluate the performanceof any given estimator f . We define the following two quantities:

1. Deterministic error: b(x) , Ef f(x) − f(x); this term corresponds to the bias, and is a deterministicscalar.

2. Stochastic error: s(x) , f(x)−Ef f(x); this term corresponds to the variance, and is a random variable.

With the previous definition, for the Lq loss with 1 ≤ q <∞, we can write

Ef‖f − f‖q = Ef‖f − Ef f + Ef f − f‖q (3)

≤ Ef‖f − Ef f‖q + Ef‖Ef f − f‖q (4)

= Ef‖s‖q + ‖b‖q (5)

≤ (Ef‖s‖qq)1q + ‖b‖q (6)

where we have used the triangle inequality and Jensen’s inequality. As a result, we can decompose the Lq

risk into the bias term ‖b‖q and the variance term (Ef‖s‖qq)1q separately.

1

3 Function Estimation in Holder Balls

First we assume that the function class F is the Holder ball with smoothness parameter s, and construct asound estimator f for f in this case.

3.1 Introduction to Holder Balls

The Holder ball Hs(L) consists of all functions which are Holder smooth of order s > 0, i.e., for s = m+ αwith m ∈ N, α ∈ (0, 1],

Hs(L) =

{f ∈ C[0, 1] : sup

x 6=y∈[0,1]

|f (m)(x)− f (m)(y)||x− y|α

≤ L

}. (7)

In other words, the Holder ball consists of functions which are “smooth” of a certain smoothness order.

Exercise 1. Show that for s > 1, if we use m = 0, α = s in the definition (7), then the only function whichsatisfies the new definition is the constant function.

3.2 Kernel–based Estimator

Now we consider how to find an estimator f of f . First suppose that there is no noise, i.e., σ = 0. In thiscase, we know perfectly the value of f on points xi, but do not know the value of x ∈ (xi, xi+1) in between.However, since we know that the function f is smooth, it may be tempted to use interpolation to obtainf(x), e.g.,

f(x) =xi+1 − xxi+1 − xi

f(xi) +x− xi

xi+1 − xif(xi+1). (8)

In general, we may write

f(x) =∑

i:xi∈Ih(x)

w(x, xi)f(xi) (9)

for some weight function w(x, xi), where Ih(x) , [x−h, x+h] is a small “window” around x with bandwidthh (to be specified later). Also in the noisy case where only yi is available, we may construct

f(x) =∑

i:xi∈Ih(x)

w(x, xi)yi. (10)

Note that for simplicity we have ignored the boundary effect when x is close to zero or one.Now we specify our choice of the weight function. A natural requirement is that

∑i:xi∈Ih(x) w(x, xi) = 1

for any x, i.e., the weights should sum into one. This requirement can be generalized in the following way:when f(x) = xk for k = 0, · · · ,m, the estimator in the LHS of (9) should be equal to the true valuef(x) = xk. In other words, we require that∑

i:xi∈Ih(x)

w(x, xi)xki = xk, ∀k = 0, 1, · · · ,m. (11)

Note that as long as nh → ∞, (11) can be satisfied since we have 2nh degrees of freedom while there areonly m+ 1 linear constraints. The following lemma shows some properties of the weights:

Lemma 2. For any x ∈ [0, 1], there exists some weight function w(x, ·) which satisfies (11) and

‖w(x, ·)‖1 . 1, ‖w(x, ·)‖2 .1√nh. (12)

As a sanity check, just consider the case where m = 0 and w(x, xi) = 12nh for any xi ∈ Ih(x).

2

3.3 Performance Analysis

Now we analyze the performance of the estimator in (10). By the bias–variance decomposition, it suffices todeal with the variance and the bias separately.

For the variance (stochastic error), we have

s(x) = f(x)− Ef f(x) (13)

=∑

i:xi∈Ih(x)

w(x, xi)yi −∑

i:xi∈Ih(x)

w(x, xi)Eyi (14)

=∑

i:xi∈Ih(x)

w(x, xi)(f(xi) + σξi)−∑

i:xi∈Ih(x)

w(x, xi)f(xi) (15)

= σ∑

i:xi∈Ih(x)

w(x, xi)ξi (16)

∼ N (0, σ2‖w(x, ·)‖22). (17)

Since for X ∼ N (0, σ2), we have E|X|q . σq for any q ∈ [1,∞), for the stochastic error we have E|s(x)|q .‖w(x, ·)‖q2 (suppressing the dependence on σ), and thus

(E‖s‖qq)1q . ‖w(x, ·)‖2 .

1√nh

(18)

where in the last step we have used Lemma 2.Now we take a look at the bias. By definition,

b(x) = Ef f(x)− f(x) =∑

i:xi∈Ih(x)

w(x, xi)f(xi)− f(x). (19)

By our choice of the weights, we know that for any polynomial P (x) of degree at most m, we have

|b(x)| =

∣∣∣∣∣∣∑

i:xi∈Ih(x)

w(x, xi)(f(xi)− P (xi))− (f(x)− P (x))

∣∣∣∣∣∣ (20)

≤ (‖w(x, ·)‖1 + 1) · ‖f − P‖∞,Ih(x) (21)

. ‖f − P‖∞,Ih(x) (22)

where ‖f‖∞,I , ess sup{|f(x)| : x ∈ I}. Since this inequality holds for any P , we further have

|b(x)| . infP∈Polym

‖f − P‖∞,Ih(x) (23)

where Polym denotes the class of all polynomials of degree at most m. Hence, to control the bias, it sufficesto find a good polynomial approximation of f in the interval Ih(x).

Due to the definition of the Holder ball, a natural choice of the polynomial is the Taylor expansionpolynomial, i.e.,

P (y) =

m∑k=0

f (k)(x)

k!(y − x)k. (24)

3

By Taylor’s theorem and the Lagrange remainder term, for any y ∈ Ih(x) we have

|f(y)− P (y)| =

∣∣∣∣∣f(y)−m∑k=0

f (k)(x)

k!(y − x)k

∣∣∣∣∣ (25)

=|f (m)(ξ)− f (m)(x)|

m!· |y − x|m (26)

≤ L|ξ − x|α

m!· |y − x|m (27)

≤ Lhα

m!· hm (28)

. hs (29)

where again we suppress the dependence on L. As a result, we have |b(x)| . hs for any x ∈ [0, 1], and thus

‖b‖q . hs. (30)

Combining the bias and variance together, we have

supf∈Hs(L)

Ef‖f − f‖q .1√nh

+ hs. (31)

Setting these two terms to be equal yields h � n−1

2s+1 , and thus

supf∈Hs(L)

Ef‖f − f‖q . n−s

2s+1 . (32)

In fact, this bound turns out to be optimal:

Theorem 3. For any s > 0 and q ∈ [1,∞), the minimax Lq risk in estimating f is given by

inff

supf∈Hs(L)

Ef‖f − f‖q � n−s

2s+1 . (33)

Moreover, the kernel-based estimator in (10) with bandwidth h � n−1

2s+1 attains the minimax risk withinmultiplicative constants.

3.4 Discussions

We briefly discuss the previous result. The most important point is on the bias–variance tradeoff: when thebandwidth h increases, the deterministic error b(x) gets larger and the stochastic error s(x) gets smaller.Specifically, we have

(E|s(x)|q)1q .

1√nh, (34)

|b(x)| . infP∈Polym

‖f − P‖∞,Ih(x). (35)

The reason is also straightforward intuitively: as the bandwidth h increases, the error incurred by usingf(x ± h) to approximate f(x) gets larger, i.e., the bias gets larger. In contrast, the number of the pointsto be averaged over (i.e., 2nh) gets larger, which yields to a smaller variance. Finally, we need to choose asuitable bandwidth h∗ to balance the bias and variance, as is shown in the following figure 1.

The second point to note is that the overall estimator f is linear in the observations y1, · · · , yn. This iscalled a linear estimator. Theorem 3 shows that, for nonparametric function estimation over Holder balls,linear estimator can attain the minimax risk within constants. Then a natural question arises: can linear

4

Figure 1: The bias-variance tradeoff.

approaches always attain the minimax risk in nonparametric function estimation for other natural functionclasses F? We will show in the next section that the answer is no in general.

The third point is more technical and high-level: in choosing the weight function, we require that ourweight should keep all polynomials of degree at most m. Some more thoughts lead to the following question:why are polynomials so special here? Can we choose other basis functions? To answer this question, we needto introduce the Kolmogorov n-width:

Definition 4 (Kolmogorov n-width). Let (X, ‖ · ‖) be a normed vector space, and K ⊂ X be a compactsubset. The Kolmogorov n-width of K is defined by

dn(K) , dn(K, ‖ · ‖) , infV⊂X

supx∈K

infy∈V‖x− y‖ (36)

where the first infimum is taken over all linear subspaces V ⊂ X of dimension at most n.

The following lemma shows the Kolmogorov n-width of the Holder ball:

Lemma 5. For the Holder ball Hs(L) with smoothness parameter s > 0, its Kolmogorov n-width is

dn(Hs(L)) � n−s (37)

and is attained by the polynomial basis V = span{1, x, · · · , xn−1}.

Now we take a look at the bias (35), which shows that the bias is determined by the approximationerror of the basis functions we choose. As a result, we should choose the basis functions which attain theKolmogorov n-width, and by the previous lemma we know that polynomial basis is the right basis to usehere. In general, we would like to remark another important point here: the choice of basis is veryimportant!! We will see more examples in the next lecture.

5

4 Function Estimation in Sobolev Balls

Next we consider the same regression problem in Sobolev balls, which is a generalization of the Holder ballin the sense that some spatial inhomogeneity is allowed here. We will show that in certain scenarios, anylinear approach fails to give the optimal minimax risk.

4.1 Introduction to Sobolev Balls

Sobolev space is another space which naturally measures the smoothness of a function, and plays an importantrole in approximation theory and partial differential equations. Specifically, the (1D) Sobolev ball Wk,p

1 (L)consists of all functions f on [0, 1] such that

‖f (k)‖p ≤ L (38)

where k > 0 is an integer, and p ∈ [1,∞]. Technically, we remark that the derivative is defined in termsof distributions (not the classical derivative) and exists for any function. Moreover, to ensure a continuous

embedding Wk,p1 (L) ⊂ C[0, 1], we need an additional assumption k > 1

p . To see a counterexample, consider

f(x) = 1(x ≥ 12 ), we have f ′(x) = δ(x− 1

2 ) with ‖f ′‖1 = 1, so f ∈ W1,11 (1) but f is not continuous. We also

note that the “effective” smoothness s of Wk,p1 (L) will become

s = k − 1

p. (39)

4.2 Performance Analysis of Linear Approaches

Let’s consider the performance of the linear approach in (10), where the weight function keeps all polynomialof degree at most k−1. By the bias-variance tradeoff, it suffices to deal with these two quantities separately.

For the variance, (34) does not depend on the specific function class F , so we still have

(E‖s‖qq)1q .

1√nh. (40)

For the bias, the polynomial approximation error incurred by the Taylor expansion polynomial will changefor the Sobolev ball. In fact, by the same argument, for any y ∈ Ih(x) we have

|f(y)− P (y)| . hk−1 · |f (k−1)(ξ)− f (k−1)(x)| (41)

≤ hk−1∫ ξ

x

|f (k)(z)|dz (42)

≤ hk−1(∫ ξ

x

|f (k)(z)|pdz

) 1p(∫ ξ

x

1dz

)1− 1p

(43)

≤ hk−1p ·

(∫Ih(x)

|f (k)(z)|pdz

) 1p

. (44)

As a result, we have

‖b‖qq .∫ 1

0

hk− 1p ·

(∫Ih(x)

|f (k)(z)|pdz

) 1p

q

dx (45)

= h(k−1p )q ·

∫ 1

0

(∫Ih(x)

|f (k)(z)|pdz

) qp

dx. (46)

6

To upper bound this quantity, we need to employ the Sobolev ball condition. Note that without theexponent q/p, we have∫ 1

0

(∫Ih(x)

|f (k)(z)|pdz

)dx =

∫∫|x−z|≤h

|f (k)(z)|pdxdz ≤ 2h

∫ 1

0

|f (k)(z)|pdz ≤ 2hLp . h. (47)

In other words, we would like to find the maximum of some integral with exponent q/p given that thisintegral without the exponent is a fixed constant. Recall the following fact:

Exercise 6. For non-negative reals a1, · · · , an, show that

ar1 + ar2 + · · ·+ arn ≤

{n1−r(a1 + a2 + · · ·+ an)r if r ∈ [0, 1],

(a1 + a2 + · · ·+ an)r if r > 1.(48)

Using this fact (and after some analysis), we can obtain that

‖b‖qq .

{hkq if q ≤ p,h(k−

1p+

1q )q if p < q <∞.

(49)

Now upon choosing the optimal bandwidth h, we arrive at

supf∈Wk,p

1 (L)

Ef‖f − f‖q .

n− k

2k+1 if q ≤ p,

n−

k− 1p+1

q

2(k− 1p+1

q)+1 if p < q <∞.

(50)

It turns out that this is the best performance we can hope for using linear approaches:

Theorem 7. For any p ∈ [1,∞], k > 1p and q ∈ [1,∞), the minimax linear risk in estimating f over Sobolev

ball Wk,p1 (L) is given by

inff lin

supf∈Wk,p

1 (L)

Ef‖f lin − f‖q �

n− k

2k+1 if q ≤ p,

n−

k− 1p+1

q

2(k− 1p+1

q)+1 if p < q <∞.

(51)

where the infimum is taken over all linear estimators.

4.3 Optimal Performance

One may wonder whether the performance of the best linear estimator in Theorem 7 is optimal even if weallow for non-linear estimators. The answer is yes when q ≤ p, and is no otherwise. This can be seenintuitively with the help of the previous exercise:

1. When q ≤ p, the bias-maximizing function f is non-zero everywhere in [0, 1]. In this case, the functionf is almost homogeneous, and it is fine to use the same bandwidth h everywhere. This is called the“regular” regime.

2. When p < q <∞, the bias-maximizing function f is only supported on some interval of length h, andvanishes outside. As a result, if we knew this supporting interval, we can simply set the bandwidthoutside this interval to be Θ(1) (otherwise we are accumulating noise but no signal!), and the variancebecomes

(E‖s‖qq)1q .

1√nh· h

1q , (52)

7

which is smaller than the case where we use the same bandwidth everywhere. Combining the bias

bound ‖b‖q . hk−1p+

1q , the optimal bandwidth will result in the error

supf∈Wk,p

1 (L)

Ef‖f − f‖q . n−

k− 1p+1

q

2(k− 1p)+1 . (53)

This is called the “sparse” regime.

The previous insights explain the reason why linear approaches cannot always obtain the minimax risk,and remark that different bandwidths should be used at different areas. The exact minimax risk is given inthe following theorem:

Theorem 8. For any p ∈ [1,∞], k > 1p and q ∈ [1,∞], the minimax risk in estimating f over Sobolev ball

Wk,p1 (L) is given by

inff

supf∈Wk,p

1 (L)

Ef‖f − f‖q �

n− k

2k+1 if q ≤ (2k + 1)p,

( nlogn )

−k− 1

p+1

q

2(k− 1p)+1 if (2k + 1)p < q ≤ ∞.

(54)

where the infimum is taken over all possible estimators.

5 Adaptive Bandwidth: Lepski’s Trick

As is shown in the previous section, we need an adaptive bandwidth when:

1. there is spatial homogeneity: the function f may behave differently at different points, and we shouldchoose a suitable bandwidth to achieve the bias-variance tradeoff “locally”;

2. some parameters, e.g., s, k, p, are unknown: previously our choice of the optimal bandwidth relies onthe parameters s, k, p. If these parameters are unknown, we need to select a good bandwidth based onempirical data we have seen.

Let’s first take a look at what we have already known. For any x ∈ [0, 1], consider the linear estimator

fh in (10) with bandwidth h ≡ h(x) to estimate f(x). Denote by bh(x), sh(x) the corresponding bias andvariance of this estimator with bandwidth h, then we would like to find some optimal bandwidth h∗ ≡ h∗(x)to balance the bias and variance1:

bh∗(x) = sh∗(x). (55)

The problem is that we do not know anything about the LHS other than the monotonicity of bh(x) in h,

and for the RHS we also only know that |sh(x)| .√

lognnh with overwhelming probability. Let’s just define

sh(x) , Θ

(√log n

nh

). (56)

1strictly speaking, bh(x), sh(x) are not the exact deterministic and stochastic errors, but suitable upper bounds (e.g., (35),(34)) of these quantities.

8

5.1 Lepski’s Selection Rule

Surprisingly, the monotonicity of the bias suffices to provide the knowledge to choose the bandwidth adap-tively. Here is Lepski’s selection rule: for any x ∈ [0, 1], define the set2

A(x) , {h ≥ 0 : |fh(x)− fh′(x)| ≤ 4sh′(x), ∀h′ ∈ (0, h)} (57)

consisting of “admissible” bandwidths, and then choose the bandwidth h(x) to be

h(x) , maxA(x). (58)

Finally, the overall estimator fLep is given by

fLep(x) = fh(x)(x). (59)

Note that we only add a little bit “non-linearity” here: we still use a kernel-based estimator at any point,but the bandwidth differs spatially. Moreover, the overall estimator does not require the knowledge of anyparameters (except for an upper bound on the smoothness parameter, for we need to fix a weight function(kernel) to keep all polynomials of degree up to the smoothness parameter).

5.2 Intuitions of Lepski’s Trick

Let’s explain intuitively why Lepski’s trick will work, i.e., we will have

|fh(x)− f(x)| . |fh∗(x)− f(x)|. (60)

The analysis decomposes into two steps:

1. We first show that h∗ ∈ A(x). It suffices to check the definition: when h < h∗, we have

|fh∗(x)− fh(x)| ≤ |fh∗(x)− f(x)|+ |fh(x)− f(x)| (61)

≤ bh∗(x) + sh∗(x) + bh(x) + sh(x) (62)

≤ sh∗(x) + sh∗(x) + sh(x) + sh(x) (63)

≤ sh(x) + sh(x) + sh(x) + sh(x) (64)

= 4sh(x). (65)

Hence, h∗ satisfies the condition, and h∗ ∈ A(x).

2. Next we show that h indeed works. Since h = maxA(x) and h∗ ∈ A(x), we have h ≥ h∗, and by thedefinition of A(x) we know that

|fh(x)− f(x)| ≤ |fh(x)− fh∗(x)|+ |fh∗(x)− f(x)| (66)

≤ 4sh∗(x) + bh∗(x) + sh∗(x) (67)

≤ 6sh∗(x) (68)

as desired.

2strictly speaking, we should require that h′ take value in a finite fine grid of [0, 1] instead of the continuum.

9

5.3 Performance Analysis of Lepski’s Trick

The performance of Lepski’s estimator fLep is summarized in the following theorem:

Theorem 9. For any p ∈ [1,∞], k > 1p and q ∈ [1,∞], the Lepski’s estimator fLep satisfies

supf∈Wk,p

1 (L)

Ef‖fLep − f‖q �

( nlogn )−

k2k+1 if q ≤ (2k + 1)p,

( nlogn )

−k− 1

p+1

q

2(k− 1p)+1 if (2k + 1)p < q ≤ ∞.

(69)

Compared with Theorem 8, we see that this estimator is optimal in estimating function over Sobolevballs within logarithmic factors. In fact, this logarithmic factor is unavoidable if we want our estimator tobe adaptive for any smoothness parameters and any not-too-small subset of [0, 1], and is the price we needto pay for adaptivity.

10


Lecture 15: Wavelet ShrinkageLecturer: Yanjun Han Scribe: Yanjun Han

1 Recap

In the last lecture, we considered the nonparametric function estimation problem in the regression setting. Inparticular, we showed that the kernel-based estimator with a suitable-chosen adaptive bandwidth essentiallyachieves the minimax risk over Holder and Sobolev balls. Specifically, the estimator is constructed as follows:

1. Fix some kernel (or weight function) with suitable regularity conditions (e.g., keep polynomials up to

a prescribed degree), and use this kernel to construct a linear estimator fh(x) for any point x ∈ [0, 1]and bandwidth h > 0;

2. For any point x ∈ [0, 1], use some rule (e.g., Lepski’s trick) to choose an adaptive bandwidth h(x);

3. Finally, for any point x ∈ [0, 1], use fh(x)(x) as the estimator for f(x).

We also recall the Lepski’s trick: for any x ∈ [0, 1], we pick up a set consisting of “admissible” bandwidths:

A = {h ∈ [0, 1] : |fh(x)− fh′(x)| ≤ 4sh′(x), ∀h′ ∈ (0, h)} (1)

and then choose h(x) = maxA. We validate this choice via two steps: we show that the optimal bandwidth

h∗ which balances the bias and variance locally belongs to this set, and then we show that h also works byrelating h to the unknown h∗. This idea can also be generalized to high-dimensional cases with caution, andyou may look at Homework 6 for details.

The main insights of the previous approach are two-fold:

1. The bias-variance tradeoff should be understood and analyzed carefully: this step determines the choiceof the optimal bandwidth;

2. “A little bit” non-linearity needs to be added to the estimator: we have shown in the previous lecturethat all linear approaches may fail to give the order-optimal risk in certain models, while an “almost”linear estimator with “a little bit” non-linearity can succeed. In the previous example, we almostuse a linear kernel-based estimator, and the only additional non-linearity is to choose the bandwidthdifferently at different points.

In this lecture, we will attack the same problem using a different approach called wavelet shrinkage, wherewe can see the same phenomena from a different viewpoint.

2 Gaussian White Noise Model and Change of Basis

In the last lecture we have looked at the regression problem in Gaussian noise, and also remarked that thesame idea can also be used in the density estimation setting, where the kernel-based estimator is called theKDE (Kernel Density Estimator). In this lecture we will look at the Gaussian White Noise Model, andremark that this is essentially the same as the previous two models.

1

2.1 Equivalences between Models

We call two statistical models are equivalent if they can almost simulate each other. An equivalent formulationis that, for any bounded loss function and any objective to be estimated, if there is an estimator f1(x) for

one model, we can construct another estimator f2(y) for the other model whose estimation performance (i.e.,

the risk) is almost no worse than that of f1(x) under any true parameter θ ∈ Θ, and vice versa. Intuitively,if two models are equivalent, for estimation purposes it suffices to look at any one of them. For a rigoroustreatment, we refer interested readers to the Le Cam distance1.

We will consider four different models in the context of nonparametric estimation:

1. Regression Model: we observe n samples {yi}ni=1 with yi = f(xi) + σξi for 1 ≤ i ≤ n, where xi = in

and ξii.i.d∼ N (0, 1);

2. Gaussian White Noise Model: we observe a process (Yt)t∈[0,1] with

Yt =

∫ t

0

f(s)ds+σ√nBt, t ∈ [0, 1] (2)

where (Bt)t∈[0,1] is the standard Brownion Motion on [0, 1]. This model can also be written in thestochastic differential equation (SDE) form as dYt = f(t)dt+ σ√

ndBt;

3. Density Estimation Model: we observe n i.i.d samples y1, · · · , yni.i.d∼ g(·), where the density g is

supported on [0, 1];

4. Poisson Process Model: we observe a Poisson process (Yt)t∈[0,1] with a time-varying intensity ng(·),where the density function g is supported on [0, 1].

In each model, there is some unknown function/density treated as the unknown parameter, and we observediscrete samples or a stochastic process. Suppose that the parameter space is f ∈ F , where F is some functionclass possessing certain order of smoothness. The main result is that these four models are equivalent:

Theorem 1. 2 3 Under certain technical conditions, these four models are asymtotically equivalent as n→∞, with g = f2, σ = 1

2 when talking about the last two models.

2.2 Change of Basis

Due to the model equivalence, in this lecture we consider the Gaussian white noise model

dYt = f(t)dt+σ√ndBt, t ∈ [0, 1] (3)

and our target is to find an estimator f which comes close to minimizing the worst-case squared-error risk

supf∈F

Ef‖f − f‖22. (4)

Now we take a look at how the model and the loss function behave after we transform the problem into adifferent domain.

1see, e.g., F. Liese and K. J. Miescke, Statistical Decision Theory: Estimation, Testing and Selection. Springer, 2008.2L. D. Brown, and M. G. Low. Asymptotic equivalence of nonparametric regression and white noise. The Annals of

Statistics, 24(6), pp. 2384–2398, 1996.3L. D. Brown, A. V. Carter, M. G. Low, and C. H. Zhang, Equivalence theory for density estimation, Poisson processes and

Gaussian white noise with drift. The Annals of Statistics, 32(5), pp. 2074–2097, 2004.

2

Let (φj)∞j=1 be an orthonormal basis of L2[0, 1], then we can represent the function f via its coefficients

θ = (θj)∞j=1, where

θj ,∫ 1

0

φj(t)f(t)dt. (5)

Note that the restriction f ∈ F will be transformed into the condition θ ∈ Θ for some proper parameter setΘ. Also, for any estimator f , we can also represent it by

θj ,∫ 1

0

φj(t)f(t)dt. (6)

As for the observation, we may define

yj ,∫ 1

0

φj(t)dYt (7)

=

∫ 1

0

φj(t)

(f(t)dt+

σ√ndBt

)(8)

=

∫ 1

0

φj(t)f(t)dt+σ√n

∫ 1

0

φj(t)dBt (9)

≡ θj + ε · ξj (10)

where ε , σ√n

denotes the noise level, and

ξj ,∫ 1

0

φj(t)dBt. (11)

Exercise 2. Use the orthonormality of (φj)∞j=1 to conclude that ξj

i.i.d∼ N (0, 1).

By the previous exercise, we know that the Gaussian white noise model reduces to the following Gaussiansequence model under the basis (φj)

∞j=1:

yj = θj + εξj , θ = (θ1, θ2, · · · ) ∈ Θ, ε =σ√n, ξj

i.i.d∼ N (0, 1). (12)

Also, ‖θ − θ‖2 = ‖f − f‖2 by Parseval’s identity, our target becomes to find some θ = (θ1, θ2, · · · ) whichcomes close to minimize

supθ∈Θ

Eθ‖θ − θ‖22 = supθ∈Θ

∞∑j=1

Eθ(θj − θj)2. (13)

As a result, under the basis change, we operate in a Gaussian sequence model and would like to estimatethe mean vector simultaneously.

2.3 Projection Estimator and Bias-Variance Tradeoff

First let’s see how the bias-variance tradeoff will look at after the change of basis. Since the observationitself is a natural estimator for the mean in the Gaussian model, a natural candidate to solve the Gaussiansequence model (12) is the projection estimator :

θj =

{yj , if 1 ≤ j ≤ m,0, if j > m.

(14)

3

It’s easy to see that

Eθ(θj − θj)2 =

{ε2, if 1 ≤ j ≤ m,θ2j , if j > m.

(15)

As a result, we have

Eθ‖θ − θ‖22 = mε2 +∑j>m

θ2j . (16)

Now we have the bias-variance tradeoff in the Gaussian sequence model:

1. Bias: the second term∑j>m θ

2j , which originates from throwing away the data in the tail. Note that

when m increases, the bias will decrease;

2. Variance: the first term mε2, which originates from the noise in the observation sequence (yj)∞j=1. Note

that when m increases, the variance will increase.

We can think of the threshold m as the reciprocal of the bandwidth h in the kernel-based method: as weknow from Fourier analysis, the bandwidth in the time domain and that in the frequency domain satisfiesthe uncertainty principle. Also, in this case, the relationship of the bias/variance in m corresponds to thatof the bias/variance in h in the time domain. As a result, in the transformed domain, the bias-variancetradeoff depends on the cutting position of the observed sequence.

3 Besov Ball and Wavelet Basis

Before introducing how to add the non-linearity in the transformed domain, first we specify the choice ofthe function class F and the orthonormal basis (φj)

∞j=1. Specifically, we will choose F to be the Besov ball

Bsp,q(L), and (φj)∞j=1 to be the wavelet basis. To avoid technicality, we will treat them informally and only

talk about the insights behind these concepts, and refer interested readers to the following reference:

• W. Hardle, G. Kerkyacharian, D. Picard, and A. Tsybakov, Wavelets, approximation, and statisticalapplications. Springer Science & Business Media, Vol. 129, 2012.

3.1 Introduction to Besov Ball

Like the Holder and Sobolev balls, the Besov ball is another ball which characterizes the smoothness of afunction in a more delicate and complicated way. Specifically, for any s > 0, p, q ∈ [1,∞], we may define anorm ‖ · ‖Bs

p,qwhich is somehow (informally) close to

‖f‖Bsp,q≈ ‖f (s)‖p (17)

where f (s) is the order-s derivative of f . Note that s may not be an integer, but in the above definitionwe have a proper definition of a fractional derivative, and the parameter q also affects the definition of thederivative slightly. Intuitively, ‖ · ‖Bs

p,qis another norm which characterizes the order-s smoothness.

Naturally, the Besov ball Bsp,q(L) is defined by

Bsp,q(L) , {f : ‖f‖Bsp,q≤ L}. (18)

4

3.2 Introduction to Wavelet Basis

The wavelet basis is an orthonormal basis which exploits the idea of multi-resolution analysis: any functionis viewed from multiple resolutions. Specifically:

1. There is a father wavelet φ(x) and a mother wavelet ψ(x) on [0, 1];

2. At level j and location k, we define

φjk(x) , 2j2φ(2jx− k), (19)

ψjk(x) , 2j2ψ(2jx− k), (20)

with j ∈ N, 0 ≤ k ≤ 2j − 1.

Note that supp(φjk) = supp(ψjk) = [ k2j ,k+12j ], the level j characterizes the resolution 2−j , and the parameter

k characterizes the spatial location to look at.

Example 3. The Haar wavelet is defined by φ(x) = 1(x ∈ [0, 1]), ψ(x) = 1(x ∈ [0, 12 ])− 1(x ∈ [ 1

2 , 1]). Thisis the first wavelet basis proposed in 1909.

Not all functions can be the father and wavelet wavelets. A crucial property (besides the orthonormality)for wavelets is that, defining

Vj , span{φjk(x), 0 ≤ k ≤ 2j − 1} (21)

Wj , span{ψjk(x), 0 ≤ k ≤ 2j − 1} (22)

then for any j0 ∈ N we have

L2[0, 1] = Vj0 ⊕Wj0 ⊕Wj0+1 ⊕ · · ·. (23)

As a result, any f ∈ L2[0, 1] can be written as

f(x) =

2j0−1∑k=0

αj0kφj0k(x)︸︷︷︸Gross Information

+

∞∑j=j0

2j−1∑k=0

βjkψjk(x)︸︷︷︸Detail Information at level j

(24)

for some coefficients (αj0k), (βjk). The first term corresponds to the “gross information”, i.e., some informa-tion in the average sense, e.g., the average magnitude in a small interval. The second term corresponds tothe “detail information”, i.e., some information related to the local change, e.g., how function oscillates in asmall interval. We can view the detail in different scale/resolution, which is characterized by different levelsj = j0, j0 + 1, · · · .

The reason why we introduce the wavelet basis is that it is the right basis for the Besov ball, as is shownin the following theorem.

Theorem 4. Under certain regularity conditions on the wavelet basis, the Besov norm ‖ · ‖Bsp,q

for functionis equivalent to ‖ · ‖bsp,q for its wavelet coefficients, where

‖f‖bsp,q ,

2j0−1∑k=0

|αj0k|p 1

p

+

∞∑j=j0

2j(s+12−

1p )

2j−1∑k=0

|βjk|p 1

p

q

1q

. (25)

We can also write it in a more compact form as ‖f‖bsp,q = ‖αj0‖p + ‖2j(s+12−

1p )‖βj‖p‖q, where the `p norm

is taken with respect to k, and the `q norm is taken with respect to j.

5

Recall that we call two norms (X, ‖ · ‖1), (X, ‖ · ‖2) are equivalent if there exists universal constantsc1, c2 > 0 such that c1‖f‖2 ≤ ‖f‖1 ≤ c2‖f‖2 for any f ∈ X. Then the following corollary is immediate:

Corollary 5. Defining

Θsp,q , {θ = ((αj0k), (βjk)) : ‖θ‖bsp,q ≤ 1}, (26)

there exists some constants c1, c2 > 0 such that

c1Θsp,q ⊂ Bsp,q(L) ⊂ c2Θs

p,q. (27)

We remark that although the form of Θsp,q is still quite complicated, we will make use of some crucial

properties of Θsp,q to propose sound estimators and validate the fact that the wavelet basis is the right basis

for the Besov space.

4 Thresholding and VisuShrink Estimator

In this section we present the thresholding idea to add the correct non-linearity, and thereby motivates theVisuShrink estimator. The reference for this section and the following ones is:

• I. Johnstone, Gaussian Estimation: Sequence and Wavelet Models. Online manuscript (http://statweb.stanford.edu/~imj/GE09-08-15.pdf), September 2015.

4.1 Ideal Truncated Estimator

Before we look into the Gaussian sequence estimation problem, first we gain some insights from the Gaussianmean estimation in the scalar case. Consider estimating the mean θ ∈ R in the following scalar model:

y = θ + εξ, ξ ∼ N (0, 1) (28)

where ε > 0 is known, and the only assumption we impose on θ is that |θ| ≤ τ for some known τ . The target

is to find some estimator θ such that the mean squared error Eθ(θ− θ)2 is small. The following insights arestraightforward:

1. When τ is large, there is essentially no restriction on the parameter θ, and it is expected that theobservation itself θ = y should be a near-optimal estimator. In fact, if τ = ∞, the natural estimatorθ = y is the Uniformly Minimum Variance Unbiased Estimator (UMVUE) and also minimax.

2. When τ is small (e.g., much smaller than ε), the signal θ is almost completely obscured by the noise.

In this case, θ = 0 is expected to be a good estimate, since E(θ − θ)2 = θ2 ≤ τ2 is really small.

Based on these insights, we expect that either θ = y or θ = 0 can do a good job. As a result, we introducethe following concept of the ideal truncated estimator : suppose that there is a genie who knows the trueparameter θ but is restricted to use θ = y or θ = 0, it is easy to see that the optimal estimator is

θITE = y1(|θ| ≥ ε) (29)

whose mean squared error is given by

Eθ(θITE − θ)2 = min{θ2, ε2}. (30)

The following theorem shows that the ideal truncated estimator essentially attains the minimax risk forany τ ≥ 0:

6

Theorem 6. 4 For any τ ≥ 0, the ideal truncated estimator attains the minimax risk over |θ| ≤ τ within amultiplicative factor of 2.22:

min{τ2, ε2} = sup|θ|≤τ

Eθ(θITE − θ)2 ≤ 2.22 · infθ

sup|θ|≤τ

Eθ(θ − θ)2. (31)

It’s straightforward to generalize this result to the sequence case. Consider the Gaussian sequence model(12) with θ ∈ R(τ), where τ = (τ1, τ2, · · · ) is a non-negative sequence and R(τ) is a orthosymmetric rectanglewith one vertex τ :

R(τ) , {θ = (θ1, θ2, · · · ) : |θi| ≤ τi, ∀i = 1, 2, · · · }. (32)

The following corollary is immediate:

Corollary 7. For any non-negative vector τ = (τ1, τ2, · · · ), for Gaussian sequence model (12) we have

∞∑i=1

min{τ2i , ε

2} = supθ∈R(τ)

Eθ‖θITE − θ‖22 ≤ 2.22 · infθ

supθ∈R(τ)

Eθ‖θ − θ‖22. (33)

4.2 Soft- and Hard-Thresholding Estimator

Note that the ideal truncated estimator is not indeed an estimator since it requires the unavailable informationθ. However, the previous corollary motivates us to find another estimator θ whose performance is comparableto that of θITE, which has been shown to be nearly minimax. Specifically, we want to find an estimatorθ = (θ1, · · · , θm) such that in the Gaussian sequence model (12) of length m, the inequality

Eθ‖θ − θ‖22 ≤ something×

(ε2 +

m∑i=1

min{θ2i , ε

2}

)(34)

holds for any θ ∈ Rm. Note that for minor technical reasons we need to have an additional ε2 term in theRHS.

Now we take a careful look at our requirement. As a sanity check, the RHS is really small when θ = 0,which forces the LHS to be small as well. In other words, when the true parameter θ is the zero vector, ourestimator θ must be close to the zero vector as well. Prove the following result:

Exercise 8. For X1, · · · , Xmi.i.d∼ N (0, ε2), we have P(max1≤i≤m |Xi| ≥ ε

√2 logm)→ 0 as m→∞.

Using this exercise, a natural constraint on the estimator θ can be that, θ = 0 whenever max1≤i≤m |yi| ≤ε√

2 logm. This observation motivates us to do some type of thresholding: specifically, we can define thesoft-thresholding and hard-thresholding functions as follows:

ηst (y) = sign(y) · (|y| − t)+ (35)

ηht (y) = y · 1(|y| ≥ t). (36)

Note that these thresholding functions are close to truncation: when |y| is small the functions return zero,and when |y| is large the functions return something close to y.

By acting coordinatewisely, we may also define the soft-thresholding estimator θst (y) = (ηst (y1), · · · , ηst (ym))

and the hard-thresholding estimator θht (y) = (ηht (y1), · · · , ηht (ym)), and the previous analysis motivates usto choose the threshold t to be roughly ε

√2 logm. The following theorem shows that the oracle inequality

(34) holds for thresholding estimators:

4D. L. Donoho, R. C. Liu, and B. MacGibbon. Minimax risk over hyperrectangles, and implications. The Annals ofStatistics, pp. 1416–1437, 1990.

7

Theorem 9. 5 For t = ε√

2 logm, the soft-thresholding estimator θst (y) satisfies the following oracle inequal-ity in the Gaussian sequence model (12):

Eθ‖θst (y)− θ‖22 ≤ (2 logm+ 1) ·

(ε2 +

m∑i=1

min{θ2i , ε

2}

). (37)

The same result holds for the hard-thresholding estimator θht (y) with t = ε√

2 logm+ log logm.

4.3 VisuShrink Estimator

Now we are about to describe the VisuShrink estimator for the Gaussian sequence model (12) with Θ = Θsp,q.

The parameters in this model are θ = ((αj0k), (βjk) : j ≥ j0, 0 ≤ k ≤ 2j − 1), and we rewrite this vector asθ = (θ1, θ2, · · · ). The VisuShrink estimator does the following:

θVISUi =

{ηst (yi) if i ≤ m0 if i > m

(38)

where m is a parameter to be chosen later, and the threshold t is chosen to be t = ε√

2 logm. Note thatcompared with the projection estimator, the only difference is that when i ≤ m we replaced the raw data yiby its soft-thresholding ηst (yi). Similarly, we can do hard-thresholding as well with t = ε

√2 logm+ log logm

according to Theorem 9.Now we’re about to analyze the performance of the VisuShrink estimator. According to Theorem 9, we

know that

Eθ‖θVISU − θ‖22 ≤ (2 logm+ 1)

(ε2 +

m∑i=1

min{θi, ε2}

)+∑i>m

|θi|2 (39)

≤ (2 logm+ 1)

(ε2 +

∞∑i=1

min{θi, ε2}

)+∑i>m

|θi|2. (40)

By the definition of Θsp,q, by choosing m = ε−A for some large enough constant A > 0, we may have the tail

bound: supθ∈Θsp,q

∑i>m |θi|2 � ε2. As a result, for this choice of m, we have

Eθ‖θVISU − θ‖22 . log(1

ε) ·∞∑i=1

min{θi, ε2}+ ε2 log(1

ε). (41)

Now we come to the crucial property of the parameter set Θsp,q: it is solid and orthosymmetric. Equivalently,

this means that for any θ ∈ Θsp,q, the orthosymmetric hyperrectangle R(|θ|) with a vertex θ is contained

again in the parameter set Θsp,q. As a result, by Corollary 7 we know that

Eθ‖θVISU − θ‖22 . log(1

ε) ·∞∑i=1

min{θi, ε2}+ ε2 log(1

ε) (42)

≤ 2.22 log(1

ε) · inf

θsup

θ′∈R(|θ|)Eθ′‖θ − θ′‖22 + ε2 log(

1

ε) (43)

≤ 2.22 log(1

ε) · inf

θsup

θ′∈Θsp,q

Eθ′‖θ − θ′‖22 + ε2 log(1

ε) (44)

where in the last inequality we have used the property R(|θ|) ⊂ Θsp,q. As a result, the VisuShrink estimator

attains the minimax risk within a logarithmic factor.Finally, we note the relationship between Θs

p,q and Bsp,q(L), and summarize the VisuShrink estimator asfollows:

5D. L. Donoho, and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika, pp. 425–455, 1994.

8

1. Fix some initial level j0 (which is of the constant order) and termination level jε � log( 1ε );

2. Transform the observation process (Yt)t∈[0,1] to the wavelet domain with initial level j0, and obtainthe corresponding α-coefficients and β-coefficients empirically;

3. Use the following procedure to obtain new coefficient estimates:

(a) For α-coefficients (which are only at level j0), keep them all;

(b) For β-coefficients, for j > jε discard them all, and for j0 ≤ j ≤ jε apply the thresholding estimator(either soft or hard one) with suitable threshold given in Theorem 9 to the observation vector.

4. Transform the estimated wavelet coefficients back to the function space, and obtain fVISU.

The property of the VisuShrink estimator fVISU is summarized in the following theorem:

Theorem 10. 6 For any s > 0, p, q ∈ [1,∞], the VisuShrink estimator fVISU attains the minimax risk overBesov balls Bsp,q(L) within logarithmic factors:

supf∈Bs

p,q(L)

Ef‖fVISU − f‖22 . log(1

ε) · inf

fsup

f∈Bsp,q(L)

Ef‖f − f‖22. (45)

4.4 Discussions

We make some discussions on the previous VisuShrink estimator.First of all, the VisuShrink estimator is almost a linear estimator (i.e., close to the projection estimator),

while there is also a little bit non-linearity here, i.e., the thresholding idea. We can think of the thresholdingapproach as a selector which selects the coefficients to keep in a data-dependent manner: when the empiricalcoefficient is large, we expect it to be useful signal and keep it; when the empirical coefficient is small, weexpect it to be the noise and discard it. We can compare this idea with Lepski’s trick to deal with the sparseregime we defined in the last lecture: in the sparse regime, the signal is supported on a small interval and allothers are noise. In this case, Lepski’s trick selects a large bandwidth in the noise regime to essentially neglectall noise, and the VisuShrink estimator simply selects the peak and neglects all others of the transformedsignal in the wavelet domain.

Secondly, the VisuShrink estimator employs the shrinkage idea, which means that reduce the variancesignificantly with a little bit increase on the bias in statistics. Actually, this is where the term “shrink” in thename “VisuShrink” comes from. Specifically, compared with the projection estimator which keeps the rawobservation, the thresholding idea incurs a larger bias (note that the previous one is indeed unbiased !), whilereduces the variance significantly (e.g., from ε2 to min{ε2, θ2

i } per symbol). We remark that the variance is amain issue in the nonparametric function estimation, while we will see in the next lecture that in functionalestimation problems, bias becomes dominant!

Thirdly, by inspecting the proof we find that we do not use any specific properties of Θsp,q (which is of

a complicated form) other than that it is solid and orthosymmetric. Also, we prove that the VisuShrinkestimator is nearly minimax without even figuring out what the minimax risk is. These observations indicatethat the geometry of the parameter set is really important, and the reason why we choose the wavelet basisfor the Besov ball is that the Besov ball becomes solid and orthosymmetric in the wavelet domain! In otherwords, for any orthonormal basis (φi)i∈I , the VisuShrink idea still works as long as the associated parameterset Θ is solid and orthosymmetric in the transformed space. In fact, this property requires that the basis bean unconditional basis:

6D. L. Donoho, and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika, pp. 425–455, 1994.

9

Definition 11 (Unconditional Basis). An orthonormal basis (φi)i∈I is an unconditional basis of the realnormed vector space (X, ‖ · ‖) if and only if there exists a universal constant C > 0 such that∥∥∥∥∥∑

i∈Jεiφi

∥∥∥∥∥ ≤ C∥∥∥∥∥∑i∈J

φi

∥∥∥∥∥ (46)

holds for any finite J ⊂ I.

The main messages are that:

1. Unconditional basis is the optimal basis in nonparametric function estimation;

2. Wavelet basis is an unconditional basis for the Besov norm (L2[0, 1], ‖·‖Bsp,q

) (and Fourier basis is not).

Finally, we remark that by construction, the VisuShrink estimator does not require the knowledge ofparameters s, p, q, L and is thus an adaptive estimator. Similar to the Lepski’s estimator, the only knowledgethe VisuShrink requires is an upper bound of the smoothness parameter s, for the termination level jε dependson this upper bound.

5 Thresholding and SureShrink Estimator

In the previous section we have validated the thresholding idea by proving an oracle inequality (Theorem9), and use the geometry of the parameter set Θs

p,q to relate the risk of the ideal truncated estimator to theminimax risk. In this section we will validate the thresholding idea from a different viewpoint, and introducethe resulting SureShrink estimator.

5.1 Gaussian Mean Estimation over `p Balls with `q Error

Consider the Gaussian sequence estimation problem (12) with a simpler parameter set: Θ = {θ : ‖θ‖p ≤ R}is the `p ball. Also, instead of the mean squared error loss, we consider the `q loss as the general lossfunctions where p ∈ (0,∞], q ∈ [1,∞). The question is that: for this simple example, which estimator isnearly minimax under different parameter configurations (p, q,R, ε)?

We first begin with some insights. When R is large (or equivalently ε is small), the constraint on θ isquite loose, and thus we should use an estimator close to the natural one (i.e., the empirical observation).When R is small (ε is large), the vector is close to zero and we may directly apply a zero estimator. Whenp ∈ (0,∞] is small, we know that the parameter θ is somehow quite sparse, and thus the resulting estimatorshould have many zero entries. In contrast, when p is large, the parameter θ can be quite dense, and thenatural estimator is expected to work here.

As a result, if we treat the natural estimator θ = y as the thresholding estimator θ = ηst (y) with threshold

t = 0, and the zero estimator θ = 0 as the thresholding estimator θ = ηst (y) with threshold t =∞, it seemsthat the thresholding estimator with some suitable threshold should work. This intuition turns out to becorrect, which is shown in the following theorem:

Theorem 12. 7 For most parameter configurations (p, q,R, ε), there exists a universal constant Cp,q > 0such that for the Gaussian sequence model (12) with Θ = {θ : ‖θ‖p ≤ R},

inft

supθ∈Θ

Eθ‖ηst (y)− θ‖qq ≤ Cp,q · infθ

supθ∈Θ

Eθ‖θ − θ‖qq. (47)

The same result also holds for the hard-thresholding estimator ηht (y).

7D. L. Donoho, and I. M. Johnstone. Minimax risk over `p-balls for `q-error. Probability Theory and Related Fields, 99(2),pp. 277–303, 1994.

10

The implication of Theorem 12 to our case is that: although the parameter space Θsp,q is of a very

complicated form, it only involves the the combination of `p and `q norms! Hence, we expect that thethresholding idea also works in our case over Θs

p,q. Specifically, we consider the following estimator:

1. Transform the observation to wavelet coefficients starting from initial level j0 (of a constant order);

2. Keep all father wavelet coefficients, and for each level j, apply the (soft- or hard-)thresholding estimatorto the mother wavelet coefficients with threshold tj ;

3. Transform back the coefficients into functions to yield f .

We write the resulting estimator as ft, where t = (tj0 , tj0+1, · · · ) is the threshold sequence. Based on theprevious insights and with the help of Theorem 12, the following result holds:

Theorem 13. 8 For the nonparametric function estimation over Besov balls, the thresholding estimator withappropriate thresholds attains the minimax risk within a multiplicative factor:

inft=(tj0 ,tj0+1,··· )

supf∈Bs

p,q(L)

Ef‖ft − f‖22 . inff

supf∈Bs

p,q(L)

Ef‖f − f‖22. (48)

5.2 SURE (Stein’s Unbiased Risk Estimate)

The previous Theorem ensures that some thresholding estimator works, but does not specify which thresholdwe should choose. A naıve thought is that, if we could compare the performances of ft with different t’s,then we should choose the one with the minimum error:

t∗ = arg mint

Ef‖ft − f‖22. (49)

However, this approach is infeasible since we do not know the true function f . Despite this difficulty, thegood news is that we can still apply this idea and use an unbiased estimator of Ef‖ft−f‖22 without knowingf .

Now we start to illustrate the idea. Consider the Gaussian sequence model in (12) with length m, and

fix any estimator θ(y) of θ. Note that g(y) = θ(y)− y only depends on y but not on θ, and

Eθ‖θ − θ‖22 = Eθ‖g(y) + y − θ‖22 (50)

= Eθ‖g(y)‖22 + Eθ‖y − θ‖22 + 2Eθ[(y − θ)T g(y)]. (51)

Exercise 14. Prove Stein’s identity: for X ∼ N (µ, σ2) and any (weakly) differentiable function f , we haveE[(X − µ)f(X)] = σ2E[f ′(X)].

By Stein’s identity, we further have

Eθ‖θ − θ‖22 = Eθ‖g(y)‖22 + Eθ‖y − θ‖22 + 2Eθ[(y − θ)T g(y)] (52)

= Eθ‖g(y)‖22 +mε2 + 2ε2Eθ[∇ · g(y)] (53)

= Eθ[(m+ 2∇ · g(y))ε2 + ‖g(y)‖22

]. (54)

As a result, we have the following definition of the Stein’s Unbiased Risk Estimator:

Definition 15 (SURE). For the Gaussian sequence model in (12) with length m, then for any estimator

θ(y) with a weakly differentiable g(y) , θ(y)− y, the Stein’s Unbiased Risk Estimator (SURE) is defined by

rSURE(y) , (m+ 2∇ · g(y))ε2 + ‖g(y)‖22. (55)

The SURE satisfies Eθ[rSURE(y)] = Eθ‖θ − θ‖22 for any θ ∈ Θ.

8D. L. Donoho, and I. M. Johnstone. Minimax estimation via wavelet shrinkage. The annals of Statistics, 26(3), pp.879–921, 1998.

11

5.3 The SureShrink Estimator

With the SURE in hand, we may introduce the SureShrink estimator, which chooses the threshold sequencet = (tj)j≥j0 based on the risk estimate. Specifically, on each level j ≥ j0, we randomly divide the index set{0, 1, · · · , 2j − 1} into two halves I, I ′, and for the soft-thresholding we define

tI , arg mint≥0

∑k∈I′

(1− 2 · 1(|yj,k| ≤ t))ε2 + (min{t, |yj,k|})2 (56)

tI′ , arg mint≥0

∑k∈I

(1− 2 · 1(|yj,k| ≤ t))ε2 + (min{t, |yj,k|})2 (57)

where (yj,k)0≤k≤2j−1 are the empirical wavelet coefficients on level j. Then we apply the soft-thresholdingestimator with threshold tI to all yj,k for k ∈ I, and apply it with threshold tI′ to all yj,k for k ∈ I ′. The sameidea can also be applied to the hard-thresholding estimator. We remark that the random sample splittingapproach is purely for technical purposes (to gain independence), which is not necessary in practice. The

SureShrink estimator is defined to be ft with the thresholds given by the previous equation.The theoretical performance of the SureShrink estimator is summarized in the following theorem.

Theorem 16. 9 For s > 1p −

12 , the SureShrink estimator fSURE = ft essentially performs as well as the op-

timal thresholding estimator, and thus attains the minimax risk over Besov balls Bsp,q(L) within multiplicativeconstants:

supf∈Bs

p,q(L)

Ef‖fSURE − f‖22 ≤ (1 + o(1)) · inf(tj)j≥j0

supf∈Bs

p,q(L)

Ef‖ft − f‖22 (58)

. inff

supf∈Bs

p,q(L)

Ef‖f − f‖22. (59)

6 Minimax Lr Risk over Besov Balls

In the previous sections we have studied the nonparametric function estimation problem over Besov ballsusing the mean squared error loss, where we have used the L2 isometry to establish the transformation fromthe function domain to the wavelet domain. However, we remark that the L2 isometry is not essential here:the same thresholding idea also applies to general Lr error, for r ∈ [1,∞]. Furthermore, note that we haveshown the minimax optimality of various estimators without specifying the value of the minimax risk, forcompleteness we give the most general result here.

Theorem 17. 10 For any s > 1p −

1r , p, q, r ∈ [1,∞], the minimax Lr risk in estimating the function f over

Besov balls Bsp,q(L) is given by

(inff

supf∈Bs

p,q(L)

Ef‖f − f‖rr

) 1r

�

(ε2)

s2s+1 if r < (2s+ 1)p

(ε2 log(1/ε))s

2s+1 (log(1/ε))( 12−

pqr )+ if r = (2s+ 1)p

(ε2 log(1/ε))

s− 1p+ 1

r

2(s− 1p)+1 if r > (2s+ 1)p

. (60)

9D. L. Donoho, and I. M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the Americanstatistical association, 90(432), pp. 1200–1224, 1995.

10D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard. Density estimation by wavelet thresholding. The Annalsof Statistics, pp. 508–539, 1996.

12

In contrast, the minimax linear risk in estimating the function f over Besov balls Bsp,q(L) is given by

(inff lin

supf∈Bs

p,q(L)

Ef‖f lin − f‖rr

) 1r

�

(ε2)

s2s+1 if r ≤ p

(ε2)

s− 1p+ 1

r

2(s− 1p+ 1

r)+1 if p < r <∞

(ε2 log(1/ε))

s− 1p

2(s− 1p)+1 if r =∞

(61)

where the infimum is taken over all possible linear estimators.

13


Lecture 16: Achieving and Estimating the Fundamental LimitLecturer: Jiantao Jiao Scribe: William Clary

In this lecture, we formally define the two distinct problems of achieving and estimating the fundamentallimit, and show that under the logarithmic loss, it is easier to estimate the fundamental limit than to achieveit.

1 The Bayes envelope

The Bayes envelope introduced in the previous lectures can be viewed as the fundamental limit of prediction.Indeed, for a specified loss function Λ(x, x), the minimum average loss in predicting X ∼ PX is given by theBayes envelope:

U(PX) , minx

EX∼PX [Λ(X, x)]. (1)

Throughout this lecture, we observe X1, X2, . . . , Xni.i.d.∼ PX , where X ∈ X = {1, 2, . . . , S}. In other

words, the alphabet size of X is |X | = S. We denote by MS the space of probability measures on X . Wetake Λ(x, x) to be the logarithmic loss in the sequel, in other words, we have

Λ(x, x) = Λ(x, P ) = log1

P (x), (2)

for any x ∈ X , P ∈MS .For non-negative sequences aγ , bγ , we use the notation aγ . bγ to denote that there exists a universal

constant C such that supγaγbγ≤ C, and aγ & bγ is equivalent to bγ . aγ . Notation aγ � bγ is equivalent to

aγ . bγ and bγ . aγ . Notation aγ � bγ means that lim infγaγbγ

=∞, and aγ � bγ is equivalent to bγ � aγ .

We write a ∧ b = min{a, b} and a ∨ b = max{a, b}. Moreover, polyK denotes the set of all polynomials ofdegree no more than K.

2 Achieving the fundamental limit

Given n i.i.d. observationsX1, X2, . . . , Xni.i.d.∼ PX , we would like to construct a predictor P = P (X1, X2, . . . , Xn)

to predict a fresh new independent random variable X ∼ PX , where X is independent of the training data{Xi}ni=1.

The average risk of predicting X using the predictor P under the logarithmic loss is given by

EP

[log

1

P (X)

], (3)

where the expectation is over the randomness of (X1, X2, . . . , Xn, X) ∼ P⊗(n+1)X .

1

2.1 The inappropriate question of minimax risk

Since the distribution PX is unknown, we may take the minimax approach in decision theory and aim atsolving the minimax risk. In other words, we aim at solving

infP

supPX∈MS

EP

[log

1

P (X)

]. (4)

We now show that this question leads to a degenerate answer that may not be what we want.

Theorem 1. The minimax risk is given by

infP

supPX∈MS

EP

[log

1

P (X)

]= log(S), (5)

and the minimax risk achieving P can be taken to be US =(1S ,

1S , . . . ,

1S

), where US is the uniform distribution

on X .

Proof We first show that the minimax risk is at least log(S). Indeed, for any predictor P ∈MS , we have

EP

[log

1

P (X)

∣∣∣∣∣X1, X2, . . . , Xn

]=∑x∈X

PX(x) log1

P (x)(6)

≥∑x∈X

PX(x) log1

PX(x)(7)

= H(PX), (8)

where we used the non-negativity of the KL divergence, andH(PX) is the Shannon entropy. Taking PX = US ,we have H(PX) = logS. Taking expectations on both sides with respect to X1, X2, . . . , Xn, we know

EP

[log

1

P (X)

]≥ logS (9)

for any predictor P .On the other hand, taking P ≡ US , we have

EP

[log

1

P (X)

]= logS, (10)

which proves that the minimax risk is at most logS.

Theorem 1 shows that solving the minimax risk in prediction may lead to inappropriate answers. Indeed,the minimax optimal solutions turns out to be a degenerate answer that ignores all the training data. Whatwe show next is that focusing on the minimax regret solves this problem in a meaningful way.

2.2 Achieving the fundamental limit: minimax regret

As we argued in the proof of Theorem 1, for any predictor P = P (X1, X2, . . . , Xn), we have

EP

[log

1

P (X)

]≥ H(PX). (11)

2

It motivates us to define the minimax regret as follows:

infP

supPX∈MS

EP

[log

1

P (X)

]−H(PX). (12)

We have the following algebraic manipulations for any predictor P :

EP

[log

1

P (X)

]−H(PX) = EP

[EP

[∑x∈X

PX(x) log1

P (x)−∑x∈X

PX(x) log1

PX(x)

∣∣∣∣∣X1, X2, . . . , Xn

]](13)

= EP

[∑x∈X

PX(x) logPX(x)

P (x)

](14)

= EPD(PX‖P ), (15)

where D(P‖Q) =∑x∈X P (x) log P (x)

Q(x) is the KL divergence between P and Q.

In other words, solving the minimax regret of predicting a fresh new independent random variable Xbased on n i.i.d. training samples X1, X2, . . . , Xn is equivalent to solving the problem of estimating thediscrete distribution PX under the KL divergence loss.

The minimax regret is characterized by the following theorem.

Theorem 2. 1 2

infP

supPX∈MS

EP

[log

1

P (X)

]−H(PX) =

{(1 + o(1))S−12n log(e) if n� S

(1 + o(1)) log(Sn ) if n� S(16)

Moreover, if lim sup nS ≤ c ∈ (0,∞), the minimax regret is bounded away from zero.

The predictor P that achieves the performance above in the regime of n� S is:

P (x) =n(x) + β(n(x))

n+∑Sj=1 β(n(xj))

for any x ∈ X , (17)

where

n(x) =

n∑i=1

1(Xi = x), (18)

and (X1, X2, . . . , Xn) is the training data. Here

β(k) =

12 if k = 0

1 if k = 134 o.w.

(19)

The predictor P that achieves the performance above in the regime of n� S is:

P (x) =n(x) + n

S log Sn

n+ n log Sn

for any x ∈ X . (20)

1Paninski, Liam. ”Variational Minimax Estimation of Discrete Distributions under KL Loss.” In NIPS, pp. 1033-1040.2004.

2Braess, Dietrich, and Thomas Sauer. ”Bernstein polynomials and learning theory.” Journal of Approximation Theory 128,no. 2 (2004): 187-206.

3

Vanishing minimax regret implies that there exists a predictor P such that its average prediction error

EP[log 1

P (X)

]on the test set approaches the fundamental limit H(PX). Theorem 2 shows that its takes at

least n � S samples to achieve vanishing minimax regret. It can be understood intuitively that one needsat least to see all the symbols at least once to be able to construct a predictor whose performance is able toapproach the fundamental limit.

The minimax regret definition reflects the traditional way of understanding of the difficulty of machinelearning tasks. In machine learning practice, we iteratively improve our training algorithm, and use itsprediction accuracy on the test set to measure the performance of our prediction algorithm. The bestperformance achieved by existing schemes on the test set is usually understood as the “limit” of predictionfor a specific dataset. In this context, Theorem 2 can be interpreted in the way that with n . S samples,there does not exist any prediction algorithm based on n training samples whose performance on the testset can approach the Bayes envelope in the worst case.

As we show in the next section, there exist algorithms that can estimate the fundamental limit withn� S samples without explicitly constructing a prediction algorithm.

3 Estimating the fundamental limit

We define the problem of estimating the fundamental limit as solving the following minimax problem:

infH

supPX∈MS

EP |H −H(PX)|, (21)

where the infimum is taken over all possible estimators H = H(X1, X2, . . . , Xn) that are functions of theempirical training data. The materials in this section are mainly taken from

1. Jiao, Jiantao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. ”Minimax estimation of functionalsof discrete distributions.” IEEE Transactions on Information Theory 61, no. 5 (2015): 2835-2885.

2. Jiao, Jiantao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. ”Maximum likelihood estimationof functionals of discrete distributions.” arXiv preprint arXiv:1406.6959 (2014).

3.1 The minimax rates

We have the following theorem.

Theorem 3. 3 4 5 6 Suppose n & SlogS . Then,

infH

supPX∈MS

EP |H −H(PX)| � S

n lnn+

lnS√n. (22)

Theorem 3 shows that it suffices to take n � SlogS samples to consistently estimate the fundamental

limit H(PX). It is very surprising that the number of samples required is in fact sublinear in S: one canestimate the Shannon entropy uniformly over all PX ∈MS even if one has not seen most of the symbols inthe alphabet X in the empirical samples.

3Valiant, Gregory, and Paul Valiant. ”Estimating the unseen: an n/log (n)-sample estimator for entropy and support size,shown optimal via new CLTs.” In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 685-694.ACM, 2011.

4Valiant, Gregory, and Paul Valiant. ”The power of linear estimators.” In Foundations of Computer Science (FOCS), 2011IEEE 52nd Annual Symposium on, pp. 403-412. IEEE, 2011.

5Wu, Yihong, and Pengkun Yang. ”Minimax rates of entropy estimation on large alphabets via best polynomial approxi-mation.” IEEE Transactions on Information Theory 62, no. 6 (2016): 3702-3720.

6Jiao, Jiantao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. ”Minimax estimation of functionals of discrete distribu-tions.” IEEE Transactions on Information Theory 61, no. 5 (2015): 2835-2885.

4

3.2 Natural candidate: the empirical entropy

One of the most natural estimators for the Shannon entropy H(PX) given n i.i.d. samples is the empiricalentropy, which is defined as the following.

Denote the empirical distribution by Pn = (p1, p2, . . . , pS), where pi = 1n

∑ni=1 1(Xi = i) is the empirical

frequency of symbol i in the training set. The empirical entropy is defined as H(Pn), which plugs-in theempirical distribution into the Shannon entropy functional. Intuitively, since the Shannon entropy is acontinuous functional for finite alphabet distributions, and Pn converges to the true distribution PX asn → ∞, the plug-in estimate H(Pn) should be a decent estimator for H(PX) if S is fixed and n → ∞. Itis indeed true: it is only in the high dimensions that the empirical entropy starts to behave poorly as anestimate for the Shannon entropy.

We have the following theorem quantifying the performance of the empirical entropy in estimating H(PX).

Theorem 4. 7 8 Suppose n & S. Then,

supPX∈MS

EP |H(Pn)−H(PX)| � S

n+

lnS√n. (23)

Comparing Theorem 4 and 3, it seems that the main difference is that one has improved the term Sn to

Sn lnn in the minimax rate-optimal entropy estimator, while keeping the second term unchanged. We nowinvestigate where the two terms come from, and how one may construct the minimax rate-optimal estimatorsbased on the empirical entropy.

3.3 Analysis of the empirical entropy

For any estimator H, its performance in estimating H(PX) can be characterized via its bias defined asEP H −H(PX), and the concentration of H around its expectation EP H. The concentration property maybe partially characterized by the variance of the estimator H, namely Var(H) = EP (H − EP H)2.

We now argue that in Theorem 4, the term Sn comes from the bias, and the term lnS√

ncomes from the

variance.Introduce the concave function f(x) = x ln( 1

x ) on [0, 1]. It is clear that

H(Pn) =

S∑i=1

f(pi). (24)

We have the following claim.

Claim 5. If pi ≥ 1n , then

0 ≤ f(pi)− Ef(pi) �1

n. (25)

Moreover,

Var(H(Pn)) ≤ (ln(n))2

n∧ 2(ln(S) + 3)2

n. (26)

The results in Claim 5 are inspiring. It shows that the variance of the empirical entropy can be universallybounded regardless of the support size S. Moreover, the bias contributed by each symbol will be linearlyadded up together, contributing the term S

n . It is clear that in the regime of S fixed and n→∞, the variancedominates, but in the high dimensions the bias dominates. Hence, the key to improving the empirical entropywould be to reduce the bias in high dimensions without incurring too much additional variance.

7Jiao, Jiantao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. ”Maximum likelihood estimation of functionals ofdiscrete distributions.” arXiv preprint arXiv:1406.6959 (2014).

8Wu, Yihong, and Pengkun Yang. ”Minimax rates of entropy estimation on large alphabets via best polynomial approxi-mation.” IEEE Transactions on Information Theory 62, no. 6 (2016): 3702-3720.

5

3.4 How can we improve the empirical entropy?

It has been a long journey to find the minimax rate-optimal estimators. Harris in 1975 proposed expandingEpH(Pn) using a Taylor expansion and obtained

EPH(Pn) = H(PX)− S − 1

2n+

1

12n2(1−

S∑i=1

1

pi) + o(

1

n3). (27)

The Taylor expansion result looks decent in the regime where pi’s are not too small. Indeed, for verysmall pi the remainder term

∑Si=1

1pi

may be much larger than the true entropy H(PX) itself. This intuitionturns out to be correct: it suffices to do a first-order bias correction using Taylor series in the regime of “nottoo small” pi.

In general, for n · pn ∼ B(n, p), we may write

EP [f(pn)] = f(p) +1

2f ′′(p)

p(1− p)n

+OP(

1

n2

),

which motivates the bias correction:

f c = f(pn)− 1

2f ′′(pn)

pn(1− pn)

n.

In the entropy estimation case, we follow the bias correction above and do the following 9

Construction 6. If the true pi & lnnn , we use f(pi) + 1

2n instead of f(pi) to estimate f(pi).

Now the focus is on the small pi regime. We need to understand precisely which term contributed the Sn

bias bound. Assume for now that all pi . lnnn . We have the following manipulations:

H(Pn)−H(PX) =

S∑i=1

f(pi)− f(pi) (28)

=

S∑i=1

(f(pi)− PK(pi))−S∑i=1

(f(pi)− PK(pi)) +

S∑i=1

(PK(pi)− PK(pi)), (29)

where PK(·) is an arbitrary polynomial with order no more than K.The following two observations are crucial for the improvements of empirical entropy.

Claim 7. If pi . lnnn , we have pi . lnn

n with probability at least 1− 1n4 .

Claim 8. Suppose K � lnn. Then for any constant c > 0,

infPK∈polyK

supx∈[0, c lnn

n ]|f(x)− PK(x)| � c

n lnn. (30)

Utilizing those two claims, and conditioning on the event that all pi ≤ c lnnn , pi ≤ c lnn

n , we immediatelyobtain that ∣∣∣∣∣

S∑i=1

(f(pi)− PK(pi))

∣∣∣∣∣ . S

n lnn(31)∣∣∣∣∣

S∑i=1

(f(pi)− PK(pi))

∣∣∣∣∣ . S

n lnn, (32)

9Note that this bias correction intuition does not easily generalize to higher order corrections. For a systematic approachto do higher order bias correction with Taylor series, we refer the readers to Yanjun Han, Jiantao Jiao, Tsachy Weissman,”Minimax Rate-Optimal Estimation of Divergences between Discrete Distributions”, arXiv preprint arXiv:1605.09124 (2016).

6

which implies that ∣∣∣∣∣EP[S∑i=1

(PK(pi)− PK(pi))

]∣∣∣∣∣ & S

n(33)

since we already know in Claim 5 that∣∣∣EPH(Pn)−H(PX)

∣∣∣ & Sn .

Thus, we have identified the reason of the poor bias of the empirical entropy: it is because the plug-inapproach in estimating the polynomial PK incurs too much bias. Realizing this turns out to be the crucialfactor that leads to the minimax rate-optimal estimator: under the multinomial model there exists unbiasedestimators for any polynomial PK whose order is no more than n. Indeed, when X ∼ B(n, p), for any integerr ∈ {1, . . . , n}:

E[X(X − 1) . . . (X − r + 1)

n(n− 1) . . . (n− r + 1)

]= pr.

We complete the construction of the minimax rate-optimal estimator by doing the following:

Construction 9. If the true pi . lnnn , we use the unbiased estimator of polynomial PK(pi) to estimate

f(pi). Here PK(·) is the best approximation polynomial of f(pi) over the interval[0, c lnnn

]introduced in

Claim 8.

As for the last step, we need to use the Chernoff bound to show the following results on confidenceintervals in the binomial model:

Claim 10. There exist c1, c2, c3, c4 positive real numbers such that:

• if pi ∈ [0, c1log(n)n ] then pi ∈ [0, c2

log(n)n ] with probability at least 1− 1

n4 .

• if pi ∈ [c3log(n)n , 1] then pi ∈ [c4

log(n)n , 1] with probability at least 1− 1

n4 .

There are other details needed to make the whole proof work: for example, one needs to argue that thisapproach does not increase the variance by too much, and also show minimax lower bounds. In practiceone may also remove the constant term in PK(·) to ensure that one assigns zero to symbols that have neverappeared in the training data. Thus, we have constructed a minimax rate-optimal estimator that does notrequire the knowledge of the support size S, but behaves nearly as well as the exact minimax estimator withthe knowledge of the support size S.

7


Lecture 17: Minimax estimation of high-dimensional functionalsLecturer: Jiantao Jiao Scribe: Jonathan Lacotte

1 Estimating the fundamental limit is easier than achieving it:other loss functions

We emphasize that it is a general phenomenon that estimating the fundamental limit is easier than achievingit. Recall the definition of the two problems of achieving and estimating the fundamental limit:

• Achieving the fundamental limit:

infX(X1,...,Xn)

supPX∈P

EP [Λ(X, X)− U(PX)]

• Estimating the fundamental limit:

infU(X1,...,Xn)

supPX∈P

EP [|U − U(PX)|]

Here we observe n i.i.d. samples X1, X2, . . . , Xn ∈ X with distribution PX ∈ P, where P is a collection ofprobability distribution on X . The U(PX) = arg minx EP [Λ(X, x)] is the Bayes envelope.

In last lecture, we show that in the case of Λ(X, x) = Λ(X, P ) = log 1P (X)

and P =MS , it takes n� S

samples to achieve the fundamental limit, and n� SlnS samples to estimate the fundamental limit.

In this lecture we show that similar phenomenon happens for another widely used loss function: theHamming loss. Suppose X = (Y,Z) where Y ∈ Y, |Y| = S and Z ∈ {0, 1}. One may interpret Y as thefeature, and Z as the label while casting it as a binary classification problem. Consider the Hamming loss:Λ(x, t) = 1{t(Y ) 6=Z}. The Bayes envelope in this case is given by

U(PX) = E[min(η(Y ), 1− η(Y ))],

where η(y) = P[Z = 1 |Y = y].

Theorem 1. Consider P = {PX |PZ(0) = 12}. Then, it requires n � S samples to achieve U(PX) and

n� Slog(S) samples to estimate U(PX).

2 Boosting the Chow-Liu algorithm with improved mutual infor-mation estimates

Graphical models provide us with efficient computational tools to conduct inference in high dimensional datawith potential structure, cf. [1] and references therein. Learning the structure and parameters of graphicalmodels from empirical data is therefore the starting point for all these applications. It has been known thatexact learning of a general graphical model is NP-hard [2], and there exist tractable sub-classes among whichtree graphical models are the most famous. The seminal work of Chow and Liu [3] contributed an efficientalgorithm to compute the Maximum Likelihood Estimator (MLE) of tree structured graphical model basedon empirical data, and constitutes one of the very few cases where the exact MLE can be solved efficiently.There are various approaches towards learning more complex structures, for which we refer the reader to [4]for a review.

1

Concretely, the Chow–Liu algorithm (CL) addresses the following question. Given n i.i.d. samples of arandom vector X = (X1, X2, . . . , Xd), where Xi ∈ X , |X | <∞, we want to estimate the joint distribution ofX. Chow and Liu [3] assumed that PX can be factorized as:

PX =

d∏i=1

PXmi|Xmj(i)

, 0 ≤ j(i) < i, (1)

where (m1,m2, . . . ,md) is an unknown permutation of integers 1, 2, . . . , d, and PXi|X0is by definition equal to

PXi. Then, CL outputs the distribution PX that maximizes the likelihood of the observed data. Interestingly,

this optimization problem can be efficiently solved after being transformed into a Maximum Weight SpanningTree (MWST) problem, which can be solved using the Prim or Kruskal algorithm. In particular, they showedthat the MLE of the tree structure boils down to the following expression:

EML = arg maxEQ:Q is a tree

∑e∈EQ

I(Pe), (2)

where I(Pe) is the mutual information associated with the empirical distribution of the two nodes connectedvia edge e, and EQ is the set of edges of a tree distribution Q (i.e., Q factors as a tree). In words, it

suffices to first compute the empirical mutual information between any two nodes (in total(d2

)pairs), and

the maximum weight spanning tree is the tree structure that maximizes the likelihood. To obtain estimatesof distributions on each edge, Chow and Liu [3] simply assigned the empirical distribution while picking anarbitrary node as the tree root.

We begin by asking the following natural question:

Question 2. Is the Chow–Liu algorithm optimal for learning tree graphical models?

Since the Chow–Liu algorithm exactly solves the MLE, and has been widely used in many applications,its optimality seems to be tacitly assumed in much of the literature. However, a closer inspection of thestatistical theory [5, 6] reveals that it is only known that the Chow–Liu algorithm performs essentiallyoptimally when the number of samples n grows to ∞, while the number of states of the tree has fixed size.Indeed, the modern theory of the maximum likelihood estimation paradigm [7] only justifies the asymptoticefficiency of MLE, without general non-asymptotic guarantees when we have finitely many samples. Incontrast, various modern data-analytic applications deal with datasets that do not have the luxury of toomany observations compared to the alphabet size.

To explain the insights underlying our improved algorithm, we revisit equation (2) and note that if wewere to replace the empirical mutual information with the true mutual information, the output of the MWSTwould be the true edges of the tree. In light of this, the CL algorithm can be viewed as a “plug-in” estimatorthat replaces the true mutual information with an estimate of it, namely the empirical mutual information.Naturally then, it is to be expected that a better estimate of the mutual information would lead to smallerprobability of error in identifying the tree. However, how bad can the empirical mutual information be asan estimate for the true mutual information? The following theorem in [8] implies that it can be highlysub-optimal in high dimensional regimes.

Theorem 3. Suppose we have two random variables X1, X2 ∈ X , |X | <∞. The minimax sample complexityin estimating the mutual information I(X1;X2) under mean squared error is Θ(|X |2/ ln |X |), while the samplecomplexity required by the empirical mutual information to be consistent is Θ(|X |2).

In words, Theorem 3 implies that for the minimax rate-optimal estimator, it suffices to take n �|X |2/ ln |X | samples to consistently estimate the mutual information I(X1;X2) for any underlying distri-butions. At the same time, unless n � |X |2, there exist distributions for which the error of the empiricalmutual information would be bounded away from zero. Furthermore, when |X | is large, the approach ofestimating the three entropy terms in

I(X1;X2) = H(X1) +H(X2)−H(X1, X2) (3)

2

separately using the minimax rate-optimal entropy estimators is minimax rate-optimal for estimating themutual information.

Now we apply the improved mutual information estimates to improve the Chow–Liu algorithm. In thefollowing experiment, we fix d = 7, |X | = 300, construct a star tree (i.e. all random variables are conditionallyindependent given X1), and generate a random joint distribution by assigning independent Beta(1/2, 1/2)-distributed random variables to each entry of the marginal distribution PX1

and the transition probabilitiesPXk|X1

, 2 ≤ k ≤ d (with normalization). Then, we increase the sample size n from 103 to 5.5× 104, and foreach n we conduct 20 Monte Carlo simulations.

Note that the true tree has d− 1 = 6 edges, and any estimated set of edges will have at least one overlapwith these 6 edges because the true tree is a star graph. We define the wrong-edges-ratio in this case as thenumber of edges different from the true set of edges divided by d − 2 = 5. Thus, if the wrong-edges-ratioequals one, it means that the estimated tree is maximally different from the true tree and, in the otherextreme, a ratio of zero corresponds to perfect reconstruction. We compute the expected wrong-edges-ratioover 20 Monte Carlo simulations for each n, and the results are exhibited in Figure 1.

0 1 2 3 4 5 6

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

number of samples n

Exp

ecte

d w

ron

g−

ed

ge

s−

ratio

Modified CL

Original CL

Figure 1: The expected wrong-edges-ratio of our modifed algorithm and the original CL algorithm for sample sizesranging from 103 to 5.5× 104.

Figure 1 reveals intriguing phase transitions for both the modified and the original CL algorithm. Whenwe have fewer than 3 × 103 samples, both algorithms yield a wrong-edges-ratio of 1, but soon after thesample size exceeds 6 × 103, the modified CL algorithm begins to reconstruct the network perfectly, whilethe original CL algorithm continues to fail maximally until the sample size exceeds 47 × 103, 8 times thesample size required by the new algorithm, which we temporarily call “Modified Chow–Liu” algorithm.

Open Question 4. The exact sample complexity for learning tree graphical models is open. Although it is

3

clear that using improved mutual information estimates improves the tree learning process, it is not clear thatit is the optimal way for tree learning. One may start from asking the following easier question: for any jointlydistributed random variables (X1, X2, X3), what is the sample complexity of testing I(X1;X2)− I(X1;X3) ≥ε2 against I(X1;X2)− I(X1;X3) ≤ ε1? Here 0 < ε1 < ε2 are fixed constants.

3 Support size estimation

Given a discrete probability distribution P , we want to estimate its support size:

S(P ) =∑i

1{pi>0}

under the assumption that mini pi ≥ 1S . Note that this assumption immediately implies that the support

size of P can be at most S. The question is: how can we design the minimax rate-optimal estimator forestimating the true support size? The materials in this section come from [9].

We demonstrate in this section that using the approximation based methodology in last lecture, one canintuitively obtain the sample complexity in a systematic fashion.

It was shown in the last lecture that we can distinguish the cases pi .log nn and pi &

log nn with over-

whelming probability. It motivates us to separate the problem into two regimes:

1. The case of 1S & log n

n : in this case, the functional we want to estimate is a constant, and it suffices touse the plug-in approach. The resulting bias for each pi is:

|E[1(pi 6= 0)− 1pi>0]| = (1− pi)n ≤ e−npi ≤ e−nS . (4)

2. The case of 1S . log n

n : in this case, there might be some pi that falls in the interval[0, log n

n

], while

others fall in the interval[

log nn , 1

]. Those pi that fall into

[log nn , 1

]are handled with the MLE, and it

suffices to look at the pi’s that are in[0, log n

n

].

Claim 5. Suppose q ∈ (0, 1). Then,

infPK∈polyK ,PK(0)=0

supx∈[q,1]

|PK(x)− 1| �(

1−√q1 +√q

)K

. e−K√q. (5)

Applying Claim 5 to the scaled interval[0, log n

n

], where q =

1S

log nn

,K � log n, we obtain that the bias

for pi ∈[

1S ,

log nn

]is

e−K√q = e− log n

√n

S log n = e−√

n log nS , (6)

which vanishes at long as n� SlnS .

Hence, we have (intuitively) shown that the bias of the minimax optimal approach should be of order

S · e−Θ(√

n log nS ∨n

S

). (7)

4

References

[1] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,”Foundations and Trends R© in Machine Learning, vol. 1, no. 1-2, pp. 1–305, 2008.

[2] D. Karger and N. Srebro, “Learning Markov networks: Maximum bounded tree-width graphs,” in Pro-ceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrialand Applied Mathematics, 2001, pp. 392–401.

[3] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees,” Infor-mation Theory, IEEE Transactions on, vol. 14, no. 3, pp. 462–467, 1968.

[4] Y. Zhou, “Structure learning of probabilistic graphical models: a comprehensive survey,” arXiv preprintarXiv:1111.6925, 2011.

[5] C. Chow and T. Wagner, “Consistency of an estimate of tree-dependent probability distributions (cor-resp.),” Information Theory, IEEE Transactions on, vol. 19, no. 3, pp. 369–371, 1973.

[6] V. Y. Tan, A. Anandkumar, L. Tong, and A. S. Willsky, “A large-deviation analysis of the maximum-likelihood learning of Markov tree structures,” Information Theory, IEEE Transactions on, vol. 57, no. 3,pp. 1714–1735, 2011.

[7] E. L. Lehmann and G. Casella, Theory of point estimation. Springer, 1998, vol. 31.

[8] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distribu-tions,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835–2885, 2015.

[9] Y. Wu and P. Yang, “Chebyshev polynomials, moment matching, and optimal estimation of the unseen,”arXiv preprint arXiv:1504.01227, 2015.

5


Lecture 18: Effective Sample Size EnlargementLecturer: Jiantao Jiao Scribe: Andrew Naber

1 Effective sample size enlargement

Let MS be the set of all possible discrete distributions with support size S. Consider a random variableX with distribution P ∈ MS . We desire to estimate some functional F (P ) from n independent samplesX1, . . . , Xn of P . Examples of F include the entropy or the L1-distance from another distribution Q ∈MS .Let F = F (X1, . . . , Xn) denote our estimator of F and let Pn denote the empirical distribution as calculatedfrom the n samples. Consider the minimax and the plug-in risk functions defined as

Rminmax(F,P, n) = infF (X1,...,Xn)

supP∈P

E|F − F (P )| (1)

Rplug-in(F,P, n) = supP∈P

E|F (Pn)− F (P )|. (2)

In the following table we compare the minmax and plug-in risk for several different functionals F . Thereare certain conditions on the parameters to make the risk expression hold, and we ignore them here forsimplicity in presentation.

F (P ) P Rminmax(F,P, n) Rplug-in(F,P, n)S∑i=1

pi log

(1

pi

)MS

S

n log(n)+

log(S)√n

S

n+

log(S)√n

Fα(P ) =

S∑i=1

pαi , 0 < α ≤ 12 MS

S

(n log(n))αS

nα

Fα(P ), 12 < α < 1 MS

S

(n log(n))α+S1−α√n

S

nα+S1−α√n

Fα(P ), 1 < α < 32 MS (n log(n))−(α−1) n−(α−1)

Fα(P ), α ≥ 32 MS

1√n

1√n

S∑i=1

1(pi 6= 0) {P : mini pi ≥ 1S } Se

−Θ

(max

{√n log(n)

S ,nS

})Se−Θ(n

S )

S∑i=1

|pi − qi| MS

S∑i=1

qi ∧√

qin lnn

S∑i=1

qi ∧√qin

In the following table we compare the minmax and plug-in risk for several different functionals of twodistributions P,Q ∈ MS where we have m samples from p and n samples from q. For the Kullback-Leiblerand χ2 divergence estimators we only consider (P,Q) ∈ {(P,Q)|P,Q ∈MS ,

Pi

Qi≤ u(S)} where u(S) is some

function of S.

1

F (P,Q) Rminmax(F,P,m, n) Rplug-in(F,P,m, n)S∑i=1

|pi − qi|

√S

min{m,n} log(min{m,n})+

√S

min{m,n}

1

2

S∑i=1

(√pi −

√qi)

2

√S

min{m,n} log(min{m,n})

√S

min{m,n}

D(P‖Q) =

S∑i=1

pi log

(piqi

)S

m log(m)+

Su(S)

n log(n)+

log(u(S))√m

+

√u(S)√n

S

m+Su(S)

n+

log(u(S))√m

√u(S)√n

χ2(P‖Q) =

S∑i=1

p2i

qi− 1

Su(S)2

n log(n)+u(S)√m

+u(S)3/2

√n

Su(S)2

n+u(S)√m

+u(S)3/2

√n

Note: In many cases (with the notable exception of the support size estimator), in passing from theplug-in risk to the minmax risk, we replace m and n in the denominator with m log(m) and n log(n). Weterm this phenomenon the effective sample size enlargement. In the next section, we examine the underlyingreason for this phenomenon.

2 Preliminaries on approximation theory

Definition 1. The rth symmetric difference of f ∈ C[0, 1] is

∆rhf(x) =

{0 if x+ rh

2 or x− rh2 /∈ [0, 1]∑r

k=1(−1)k(rk

)f(x+ r(h2 )− kh) otherwise

Example:

∆1hf(x) = f(x+ h

2 )− f(x− h2 )

∆2hf(x) = f(x+ h)− 2f(x) + f(x− h)

Definition 2.

ωr(f, t) = sup0<h≤t

‖∆rhf‖

ωrϕ(f, t) = sup0<h≤t

‖∆rhϕ(x)f‖

where ϕ(x) =√

(x(1− x)) and ‖f‖ = supx∈[0,1] |f(x)|.

Properties:

1. ωr+1ϕ (f, t) ≤ κ · ωrϕ(f, t), κ a constant

2. limt→0+ ωrϕ(f, t)/tr = 0⇒ f ∈ polyr−1

Example: Let f(x) = x log(x). We then have that ω2(f, t) � t and ωrϕ(f, t) � t2, r ≥ 2.

Definition 3. Peetre’s K-functional

Kr(f, tr) = inf

g

(‖f − g‖+ tr‖g(r)‖

)Kr,ϕ(f, tr) = inf

g

(‖f − g‖+ tr‖ϕrg(r)‖

)

2

Lemma 4. For all f ∈ C[0, 1],

Kr(f, tr) � ωr(f, t)

Kr,ϕ(f, tr) � ωrϕ(f, t).

We present a systematic approach to bound the gap |Ef(X)− f(EX)| using the K-functional approachbelow.

Example: Suppose X ∈ [0, 1] is a random variable and f ∈ C[0, 1]. We then have that

|Ef(X)− f(EX)| = |E(f(X)− g(X) + g(X)− g(EX) + g(EX)− f(EX))

≤ 2‖f − g‖+ |E(g(X)− g(EX))|

≤ 2‖f − g‖+ var(X)2 ‖g′′‖

. ω2

(f,

√var(X)

2

),

where we have used the Taylor’s theorem g(X) − g(EX) = g(EX)(X − EX) + 12g′′(ζ)(X − EX)2 to create

the bound |E(g(X)− g(EX))| ≤ 12‖g′′‖var(X) followed by an application of lemma 4.

3 The mathematics behind the effective sample size enlargement

Theorem 5. Consider a binomial model, i.e., pk ∼ B(k, p) and let f ∈ C[0, 1]. Note that

Epf(pk) =

k∑i=0

f

(i

k

)(k

i

)pi(1− p)k−i

is a polynomial of degree at most k, and is called the Bernstein approximation of the function f(p) of orderk.

1. Bernstein: For all f ∈ C[0, 1], ‖Epf(pk)− f(p)‖ → 0 as k →∞.

2. Totik’94, Knoop and Zhou’94: For all f ∈ C[0, 1], ‖Epf(pk)− f(p)‖ � ω2ϕ(f, 1√

k)

Theorem 6. For any r < k, the best approximation error of f(p) ∈ C[0, 1] is upper bounded by

infpk∈polyk

supx∈[0,1]

|f(x)− pk(x)| . ωrϕ(f, 1k )

where polyk is the set of all polynomials of degree at most k on [0, 1].

Comparing the two theorems above, we see that the key improvement from the Bernstein approximationto the best approximation is reflected on two points: one has changed the argument from 1/

√k to 1/k, and

the modulus order has improved from 2 to an arbitrary number r that is smaller than k. Since increasingthe modulus order in various situations does not help reduce the approximation error, in most situations the1/√k ⇒ 1/k phenomenon dominates, which can be seen from the following table.

f(p) Bernstein approximation error Best approximation error

p log

(1

p

)1

k

1

k2

pα, α > 0, α /∈ N max

{1

k,

1

kα

}1

k2α{0 if p = 0

1 if p ≥ q(1− q)k

(1−√q1 +√q

)k|p− q|, q ∈ [0, 1]

√q(1− q)

k∧ q ∧ 1− q

√q(1− q)k

∧ q ∧ 1− q

3

Now we present a heuristic argument of the effective sample size enlargement phenomenon. Consider

the entropy estimation problem. Since the nonsmooth regime is of order[0, logn

n

], if we (approximately)

conduct Bernstein approximation over this interval with order k � log n, the overall approximation errorafter scaling would be

1

k· log n

n� 1

n. (3)

However, if we use the best approximation with order k � log n in this interval, we have the approximationerror

1

k2· log n

n� 1

n lnn, (4)

which justifies the improvement from n to n log n.

Similarly, consider the function pα, 0 < α < 1. The nonsmooth regime is still[0, logn

n

]. The Bernstein

approximation error would be

1

kα

(log n

n

)α� 1

nα, (5)

and the best approximation error would be

1

k2α

(log n

n

)α� 1

(n lnn)α. (6)

It is also clear why this phenomenon fails to hold for the support size problem: for this problem theimprovement is no longer 1/

√k ⇒ 1/k, but q ⇒ √q.

4


Lecture 18: Bias correction with Jackknife and BoostrapLecturer: Jiantao Jiao Scribe: Daria Reshetova

Both Jackknife and bootstrap are generic methods that can be used to reduce the bias of statisticalestimators. However, the traditional theory proves incapable of answering whether the bootstrap or jackknifecan help reduce the bias of the empirical entropy so as to achieve the minimax rates of entropy estimation.We provide the answers in this lecture using advanced theory of approximation.

To simplify the presentation we focus on the binomial model npn ∼ B(n, p) throughout this lecture.However, we emphasize that the methodology applies to any statistical models of interest.

1 Jackknife intuition

Let e1,n(p) = f(p)−Epf(pn), f ∈ C([0; 1]) be the bias term. Construct the jackknife bias corrected estimatoras

f2 = nf(pn)− (n− 1)f(pn−1)

We now heuristically claim that this approach can reduce bias. Suppose that

e1,n(p) =a(p)

n+b(p)

n2+O

(1

n3

), (1)

where a(p), b(p) are unknown functions of p which do no depend on n. We also have

e1,n−1(p) =a(p)

n− 1+

b(p)

(n− 1)2+O

(1

(n− 1)3

). (2)

Hence, the overall bias of f2 is:

f(p)− Epf2 = ne1,n(p)− (n− 1)e1,n−1(p) (3)

=b(p)

n− b(p)

n− 1+O

(1

n2

)(4)

= − b(p)

n(n− 1)+O

(1

n2

), (5)

which shows that the bias has been reduced to order 1n2 instead of order 1

n .Note: although the result seems solid, it does not capture the dependence in p. For example, if f(p) =

p log 1p , standard Taylor expansion shows that the function b(p) and the final O( 1

n2 ) term contain functions ofp that explode to infinity as p→ 0. Traditional statistical theory fails to analyze the uniform bias reduction

properties of the jackknife for functions such as f(p) = p log(

1p

). This task can be done via the advanced

theory of approximation.

2 Approximation-theoretic analysis of Jackknife

Recap:

• r-th order symmetric difference: ∆rhf(x) =

∑rk=0(−1)k

(rk

)f(x+ (r − k)h)

• ωrϕ(f, t) = sup0≤h≤r

‖∆rhϕf(x)‖, ϕ(x) =

√x(1− x)

1

• f(p) = p log 1p ⇒ ωrϕ(f, t) � t2 for fixed r ≥ 2

Let n1 < n2, . . . < nr = kn1 = n, where k is fixed and ni ∈ N and choose c1, . . . , cr such that∑ri=1 ci = 1

and

for ρ ∈ Z, 1 ≤ ρ ≤ r − 1 :

r∑i=1

cinρi

= 0

Then the estimator would be:

fr =

r∑i=1

cif(pni)

Note that this framework encorporates the one described in section 1 with n1 = n − 1, n2 = n, withc1 = −(n− 1), c2 = n.

Lemma 1. The solutions of ci, 1 ≤ i ≤ r are given by

ci =∏j 6=i

nini − nj

, 1 ≤ i ≤ r. (6)

Theorem 2. Suppose that∑ri=1 |ci| ≤ C, where C is a constant independent of n. Fix r > 0. Then:

1. ‖f − Efr‖∞ ≤ C(ω2rϕ (f, 1√

n) + n−r‖f‖)

2. for fixed α ∈ (0, 2r) :

‖f − Epft‖ = O(n−α/2)⇔ ω2rϕ (f, t) = O(tα)

3. if f(p) = p log 1p , then ‖f − Ef2‖ � 1

n

There are several interesting interpretations of this results:

1. Iterating the jackknife one more time is equivalent to raising the modulus order by 2

2. Raising the modulus order does not reduce the error for functions such as f(p) = p log 1p , hence doing

jackknife does not improve the worst case bias of empirical entropy in terms of order

3. The traditional jackknife, which has n1 = n−1, n2 = n, does not satisfy the conditions of this theorem.

We now show that the reason why the traditional jackknife is not included in the previous theorem isthat it does not satisfy the nice properties enlisted in it.Example 3. Denote the r-th order jackkife with nr = r, nj − nj−1 = 1 as fr. Then there exists a fixed

function f ∈ C((0, 1]) and‖f‖ ≤ 1, such that ‖Efr − f‖ & nr−1. Also, there exists f ∈ C([0; 1]) such that

Var(f2) ≥ n2

e .This result shows that the traditional jackknife may have very bad bias and variance properties if the

function is not very smooth. However, this does not imply that the traditional jackknife does not workfor functions like f(p) = p log 1

p . Since the general theorem does not apply to this situation, one needs tocarefully compute the bias of the jackknife estimator for entropy. It proves to be a highly challenging task.We have the following conjecture:

Conjecture 4. Consider f(p) = p log 1p . We conjecture that ‖f −Efr‖ � 1

n for any fixed r ≥ 2. It is provedto be true when r = 2.

2

3 Bootstrap

The bootstrap relies on two principles: the plug-in principle and the Monte–Carlo principle.

1. Plug-in principle: if P is a good estimate of P then F (P ) is a good estimate of F (P ).

2. Monte–Carlo principle: in order to compute the plug-in estimator F (θ) of F (θ) =∫g(X)dPθ, where

X ∼ Pθ, we sample B random objects X1, X2, . . . , XBi.i.d.∼ Pθ, and then use

1

B

B∑i=1

g(Xi) (7)

to approximately compute F (θ).

Now we investigate the bias correction problem. Since e1(p) = f(p) − Epf(pn), the plug-in principlesuggests that e1(pn) ≈ e1(p), which further suggests the bias corrected estimator

f2 = f(pn) + e1(pn). (8)

Note that one in general needs to use the Monte–Carlo principle to approximately compute e1(pn). It is alsoclear that one can do this correction infinitely many times. Indeed, we have

e2(p) = e1(p)− Ee1(pn)

fr = f(pn) +

r−1∑i=1

ei(pn)

It begs the question: how can we understand the bias correction properties of this bootstrap approach?

1. Question 1: Does fr converge as r →∞? The answer is YES!

2. Question 2: What does it converge to?

Lemma 5. limr→∞‖er(p) − (f(p) − Ln[f ](p))‖ = 0 for any f ∈ C([0; 1]), where Ln[f ](p) is the unique

Lagrange interpolating polynomial of f at n+ 1 points {0, 1/n, 2/n, . . . , 1}.

3. Question 3: does Ln[f ](p) have good approximation properties?

Example 6 (Bernstein). For f(p) = |p− 1/2|, limr→∞|f(p)−Ln[f ](p)| =∞ for any p except p = 0, p =

1/2.

These results show that iterating the bootstrap bias correction too many times may not be a wise idea.What about doing it only a few times? We now show that it essentially has the similar performance as doingthe jackknife exactly the same number of times.

Theorem 7. Fix r > 0. Then

1. ‖f(p)− Epfr‖∞ ≤ C(ω2rϕ (f, 1√

n) + n−r‖f‖)

2. for fixed α ∈ (0, 2r] :

‖f(p)− Epfr‖ = O(n−α/2)⇔ ω2rϕ (f, t) = O(tα)

3. if f(p) = p log 1p , then ‖f − Efr‖ � 1

n

3

lecture 2: statistical decision theory · ee378a statistical signal processing lecture 2 -...

Documents