tutorial part i: information theory meets machine...

Tutorial Part I:Information theory meets machine learning

Emmanuel Abbe Martin WainwrightUC Berkeley Princeton University

(UC Berkeley and Princeton) Information theory and machine learning June 2015 1 / 46

Introduction

Era of massive data sets

Fascinating problems at the interfaces between information theory andstatistical machine learning.

1 Fundamental issues Concentration of measure: high-dimensional problems are remarkably

predictable Curse of dimensionality: without structure, many problems are hopeless Low-dimensional structure is essential

2 Machine learning brings in algorithmic components Computational constraints are central Memory constraints: need for distributed and decentralized procedures Increasing importance of privacy

Historical perspective: info. theory and statistics

Claude Shannon Andrey Kolmogorov

A rich intersection between information theory and statistics

1 Hypothesis testing, large deviations2 Fisher information, Kullback-Leibler divergence3 Metric entropy and Fanos inequality

Statistical estimation as channel coding

Codebook: indexed family of probability distributions {Q | }

Codeword: nature chooses some

Xn1Q

Channel: user observes n i.i.d. draws Xi Q

Decoding: estimator Xn1 7 such that

Statistical estimation as channel coding

Codebook: indexed family of probability distributions {Q | }

Codeword: nature chooses some

Xn1Q

Channel: user observes n i.i.d. draws Xi Q

Decoding: estimator Xn1 7 such that

Perspective dating back to Kolmogorov (1950s) with many variations:

codebooks/codewords: graphs, vectors, matrices, functions, densities....

channels: random graphs, regression models, elementwise probes ofvectors/machines, random projections

closeness : exact/partial graph recovery in Hamming, p-distances,L(Q)-distances, sup-norm etc.

Machine learning: algorithmic issues to forefront!

1 Efficient algorithms are essential only (low-order) polynomial-time methods can ever be implemented trade-offs between computational complexity and performance? fundamental barriers due to computational complexity?

2 Distributed procedures are often needed many modern data sets: too large to stored on a single machine need methods that operate separately on pieces of data trade-offs between decentralization and performance? fundamental barriers due to communication complexity?

3 Privacy and access issues conflicts between individual privacy and benefits of aggregation? principled information-theoretic formulations of such trade-offs?


Part I of tutorial: Three vignettes

1 Graphical model selection

2 Sparse principal component analysis

3 Structured non-parametric regression and minimax theory


Vignette A: Graphical model selectionSimple motivating example: Epidemic modeling

Disease status of person s: xs =

{+1 if individual s is infected

1 if individual s is healthy

(1) Independent infection

Q(x1, . . . , x5) 5

s=1

exp(sxs)

(2) Cycle-based infection

Q(x1, . . . , x5) 5

s=1

exp(sxs)

(s,t)C

exp(stxs xt)

(3) Full clique infection

Q(x1, . . . , x5) 5

s=1

exp(sxs)

s 6=t

exp(stxsxt)

Possible epidemic patterns (on a square)

From epidemic patterns to graphs

Underlying graphs

Model selection for graphsdrawn n i.i.d. samples from

Q(x1, . . . , xp; ) exp{

sV

sxs +

(s,t)E

stxsxt}

graph G and matrix []st = st of edge weights are unknown

data matrix Xn1 {1,+1}np

want estimator Xn1 7 G to minimize error probability

Qn[G(Xn1 ) 6= G

]

Prob. that estimated graph differs from truth

Channel decoding:

Think of graphs as codewords, and the graph family as a codebook.

Past/on-going work on graph selection

exact polynomial-time solution on trees (Chow & Liu, 1967)

testing for local conditional independence relationships (e.g., Spirtes et al,2000; Kalisch & Buhlmann, 2008)

pseudolikelihood and BIC criterion (Csiszar & Talata, 2006)

pseudolikelihood and 1-regularized neighborhood regression Gaussian case (Meinshausen & Buhlmann, 2006) Binary case (Ravikumar, W. & Lafferty et al., 2006)

various greedy and related methods: (Bresler et al., 2008; Bresler, 2015;Netrapalli et al., 2010; Anandkumar et al., 2013)

lower bounds and inachievability results information-theoretic bounds (Santhanam & W., 2012) computational lower bounds (Dagum & Luby, 1993; Bresler et al., 2014) phase transitions and performance of neighborhood regression (Bento &

Montanari, 2009)

US Senate network (20042006 voting)

Experiments for sequences of star graphs

p = 9 d = 3

p = 18d = 6

Empirical behavior: Unrescaled plots

0 100 200 300 400 500 6000

0.2

0.4

0.6

0.8

1

Number of samples

Pro

b. s

ucce

ss

Star graph; Linear fraction neighbors

p = 64p = 100p = 225

Plots of success probability versus raw sample size n.

Empirical behavior: Appropriately rescaled

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

Control parameter

Pro

b. s

ucce

ss

Star graph; Linear fraction neighbors

p = 64p = 100p = 225

Plots of success probability versus rescaled sample size nd2 log p

Some theory: Scaling law for graph selectiongraphs Gp,d with p nodes and maximum degree d

minimum absolute weight min on edges

how many samples n needed to recover the unknown graph?

Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012)

Achievable result: Under regularity conditions, for a graph estimate Gproduced by 1-regularized logistic regression:

n > cu(d2 + 1/2min

)log p

Lower bound on sample size

= Q[G 6= G] 0 Vanishing probability of error

Necessary condition: For graph estimate G produced by any algorithm.

n < c(d2 + 1/2min

)log p

Upper bound on sample size

= Q[G 6= G] 1/2 Constant probability of error

Information-theoretic analysis of graph selection

Question:

How to prove lower bounds on graph selection methods?

Answer:

Graph selection is an unorthodox channel coding problem.

codewords/codebook: graph G in some graph class G

channel use: draw sample Xi = (Xi1, . . . , Xip) {1,+1}p from thegraph-structured distribution QG

decoding problem: use n samples {X1, . . . , Xn} to correctly distinguishthe codeword

X1, X2, . . . , XnQ(X | G)G

Proof sketch: Main ideas for necessary conditionsbased on assessing difficulty of graph selection over various sub-ensemblesG Gp,d

choose G G u.a.r., and consider multi-way hypothesis testing problembased on the data Xn1 = {X1, . . . , Xn}

for any graph estimator : Xn G, Fanos inequality implies that

Q[(Xn1 ) 6= G] 1I(Xn1 ;G)

log |G| o(1)

where I(Xn1 ;G) is mutual information between observations Xn1 and

randomly chosen graph G

remaining steps:

1 Construct difficult sub-ensembles G Gp,d

2 Compute or lower bound the log cardinality log |G|.

3 Upper bound the mutual information I(Xn1 ;G).

A hard d-clique ensemble

Base graph G Graph Guv Graph Gst

1 Divide the vertex set V into pd+1 groups of size d+ 1, and form the basegraph G by making a (d+ 1)-clique C within each group.

2 Form graph Guv by deleting edge (u, v) from G.

3 Consider testing problem over family of graph-structured distributions{Q( ;Gst), (s, t) C}.

Why is this ensemble hard?

Kullback-Leibler divergence between pairs decays exponentially in d, unlessminimum edge weight decays as 1/d.

Vignette B: Sparse principal components analysisPrincipal component analysis:

widely-used method for {dimensionality reduction, data compression etc.}extracts top eigenvectors of sample covariance matrix

classical PCA in p dimensions: inconsistent unless p/n 0low-dimensional structure: many applications lead to sparse eigenvectors

Population model: Rank-one spiked covariance matrix

= SNR parameter

()T rank-one spike

+Ip

Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi) = , andform sample covariance matrix

:=1

n

n

i=1

XiXTi

p-dim. matrix, rank min{n, p}

Example: PCA for face compression/recognition

First 25 standard principal components (estimated from data)

Example: PCA for face compression/recognition

First 25 sparse principal components (estimated from data)

Perhaps the simplest estimator....Diagonal thresholding: (Johnstone & Lu, 2008)

0 10 20 30 40 500

2

4

6

8

10

12

14

Index

Dia

gona

l val

ue

Diagonal thresholding

Given n i.i.d. samples Xi with zero mean, and with spiked covariance = ()T + I:

1 Compute diagonal entries of sample covariance: jj =1n

ni=1X

2ij .

2 Apply threshold to vector {jj , j = 1, . . . , p}

Diagonal thresholding: unrescaled plots

0 200 400 600 8000

0.2

0.4

0.6

0.8

1

Number of observations

Pro

b. s

ucce

ss

Unrescaled plots of diagonal thresholding; k = O(log p)

p = 100p = 200p = 300p = 600p = 1200

Diagonal thresholding: rescaled plots

0 5 10 150

0.2

0.4

0.6

0.8

1

Control parameter

Pro

b. s

ucce

ss

Diagonal thresholding: k = O(log(p))

p = 100p = 200p = 300p = 600p = 1200

Scaling is quadratic in sparsity: nk2 log p

Diagonal thresholding and fundamental limitConsider spiked covariance matrix

= Signal-to-noise

()T rank one spike

+Ipp where B0(k) B2(1) k-sparse and unit norm

Theorem (Amini & W., 2009)

(a) There are thresholds DT DTu such thatn

k2 log p DT

DT fails w.h.p.

n

k2 log p DTu

DT succeeds w.h.p.

(b) For optimal method (exhaustive search):

n

k log p< ES

Fail w.h.p.

n

k log p> ES

Succeed w.h.p.

.

One polynomial-time SDP relaxationRecall Courant-Fisher variational formulation:

= arg maxz2=1

{zT

(()T + Ipp

)

Population covariance

z}.

Equivalently, lifting to matrix variables Z = zzT :

T = arg maxZRpp

Z=ZT , Z0

{trace(ZT)

}s.t. trace(Z) = 1, and rank(Z) = 1

Dropping rank constraint yields a standard SDP relaxation:

ZT = arg maxZRpp

Z=ZT , Z0

{trace(ZT)

}s.t. trace(Z) = 1.

In practice: (dAspremont et al., 2008)

apply this relaxation using the sample covariance matrix

add the 1-constraintp

i,j=1 |Zij | k2.

Phase transition for SDP: logarithmic sparsity

0 5 10 150

0.2

0.4

0.6

0.8

1

Control parameter

Pro

b. s

ucce

ss

SDP relaxation (k = O(log p))

p = 100

p = 200

p = 300

Scaling is linear in sparsity: nk log p

A natural question

Questions:

Can logarithmic sparsity or rank one condition be removed?

Computational lower bound for sparse PCAConsider spiked covariance matrix

= signal-to-noise

()()T rank one spike

+Ipp where B0(k) B2(1)

k-sparse and unit norm

Sparse PCA detection problem:H0 (no signal): Xi D(0, Ipp)H1 (spiked signal): Xi D(0,).

Distribution D with sub-exponential tail behavior.

Theorem (Berthet & Rigollet, 2013)

Under average-case hardness of planted clique, polynomial-minimax level ofdetection POLY is given by

POLY k2 log p

nfor all sparsity log p k p

Classical minimax level OPT k log pn .

Planted k-clique problem

Erdos-Renyi Planted k-clique

Binary hypothesis test based on observing random graph G on p-vertices:

H0 Erdos-Renyi, each edge randomly with prob. 1/2H1 Planted k-clique, remaining edges random

Planted k-clique problem

Random entries Planted k k sub-matrix

Binary hypothesis test based on observing random binary matrix:

H0 Random {0, 1} matrix with Ber(1/2) on off-diagonalH1 Planted k k sub-matrix

Vignette C: (Structured) non-parametric regression

Goal: How to predict output from covariates?

given covariates (x1, x2, x3, . . . , xp)output variable ywant to predict y based on (x1, . . . , xp)

Examples: Medical Imaging; Geostatistics; Astronomy; ComputationalBiology .....

0.5 0.25 0 0.25 0.51.5

1

0.5

0

0.5

Design value x

Fun

ctio

n va

lue

2nd order polynomial kernel

0.5 0.25 0 0.25 0.51.5

1

0.5

0

0.5

Design value x

Fun

ctio

n va

lue

1st order Sobolev

(a) Second-order polynomial fit (b) Lipschitz function fit

High dimensions and sample complexityPossible models:

ordinary linear regression: y =

p

j=1

jxj

, x

+w

general non-parametric model: y = f(x1, x2, . . . , xp) + w.

Estimation accuracy: How well can f be estimated using n samples?

linear models without any structure: error 2 p/n

linear in p

with sparsity k p: error 2 (k log

ep

k

)/n

logarithmic in p

non-parametric models: p-dimensional, smoothness

Curse of dimensionality: Error 2 (1/n) 22+p Exponential slow-down

Minimax risk and sample size

Consider a function class F , and n i.i.d. samples from the model

yi = f(xi) + wi, where f

is some member of F .

For a given estimator {(xi, yi)}ni=1 7 f F , worst-case risk in a metric :

Rnworst(f ;F) = supfF

En[2(f , f)].

Minimax risk

For a given sample size n, the minimax risk

inffRnworst(f ;F) = inf

fsupfF

En[2(f , f)

]

where the infimum is taken over all estimators.


How to measure size of function classes?

A 2-packing of F in metric is acollection {f1, . . . , fM} F such that

(f j , fk

)> 2 for all j 6= k.

The packing number M(2) is thecardinality of the largest such set.

Packing/covering entropy: emerged from Russian school in 1940s/1950s(Kolmogorov and collaborators)

Central object in proving minimax lower bounds for nonparametricproblems (e.g., Hasminskii & Ibragimov, 1978; Birge, 1983; Yu, 1997; Yang &Barron, 1999)

Packing and covering numbers

f1

f2 f3

f4

fM

2

A 2-packing of F in metric is a collection {f1, . . . , fM} F such that

(f j , fk

)> 2 for all j 6= k.

The packing number M(2) is the cardinality of the largest such set.

Example: Sup-norm packing for Lipschitz functions

L

L

L

L 2 M

-packing set: functions {f1, f2, . . . , fM} such that f j fk2 > for allj 6= kfor L-Lipschitz functions in 1-dimension:

M() 2(L/).

Standard reduction: from estimation to testing

f1

f2 f3f4

fM

2

goal: to characterize the minimax risk for-estimation over Fconstruct a 2-packing of F :

collection {f1, . . . , fM} such that (f j , fk) > 2

now form a M -component mixturedistribution as follows:

draw packing index V {1, . . .M}uniformly at random

conditioned on V = j, draw n i.i.d.samples (Xi, Yi) Qfj

1 Claim: Any estimator f such that (f , fJ ) w.h.p. can be used tosolve the M -ary testing problem.

2 Use standard techniques ({Assouad, Le Cam, Fano }) to lower bound theprobability of error in the testing problem.

Minimax rate via metric entropy matching

observe (xi, yi) pairs from model yi = f(xi) + wi

two possible norms

f f2n :=1

n

n

i=1

(f(xi) f(xi)

)2, or f f22 = E

[(f(X) f(X))2

].

Metric entropy master equation

For many regression problems, minimax rate n > 0 determined by solving themaster equation

logM(2;F ,

) n2.

basic idea (with Hellinger metric) dates back to Le Cam (1973)

elegant general version due to Yang and Barron (1999)


Example 1: Sparse linear regression

Observations yi = xi, + wi, where

(k, p) :={ Rp | 0 k, and 2 1

}.

Gilbert-Varshamov: can construct a 2-separated set with

logM(2) k log(epk) elements

Master equation and minimax rate

logM(2) n2 2 k log(

ep

k

)n .

Polynomial-time achievability:

by 1-relaxations under restricted eigenvalue (RE) conditions(Candes & Tao, 2007; Bickel et al., 2009; Buhlmann & van de Geer, 2011)

achieve minimax-optimal rates for 2-error (Raskutti, W., & Yu, 2011)

1-methods can be sub-optimal for prediction error X( )2/n

(Zhang, W. & Jordan, 2014)

Example 2: -smooth non-parametric regression

Observations yi = f(xi) + wi, where

f F() ={f : [0, 1] R | f is -times diffble, with j=0 f (j) C

}.

Classical results in approximation theory (e.g., Kolmogorov & Tikhomorov, 1959)

logM(2;F) (1/

)1/


logM(2) n2 2 (1/n

) 22+1 .


Example 3: Sparse additive and -smooth

Observations yi = f(xi) + wi, where f

satisfies constraints

Additively decomposable: f(x1, . . . xp) =

p

j=1

gj (xj)

Sparse: At most k of (g1 , . . . , gp) are non-zero

Smooth: Each gj belongs to -smooth family F()

Combining previous results yields 2-packing with

logM(2;F) k(1/

)1/+ k log

(epk

)


Solving logM(2) n2 yields

2 k(1/n)

22+1

k-component estimation

+k log

(epk

)

n search complexity

SummaryTo be provided during tutorial....

tutorial part i: information theory meets machine...

Documents