tutorial part i: information theory meets machine...

Download Tutorial Part I: Information theory meets machine learningeabbe/publications/tuto_slides_part1.pdf · Information theory meets machine learning ... Information theory and machine

If you can't read please download the document

Upload: lykhuong

Post on 06-Feb-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • Tutorial Part I:Information theory meets machine learning

    Emmanuel Abbe Martin WainwrightUC Berkeley Princeton University

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 1 / 46

  • Introduction

    Era of massive data sets

    Fascinating problems at the interfaces between information theory andstatistical machine learning.

    1 Fundamental issues Concentration of measure: high-dimensional problems are remarkably

    predictable Curse of dimensionality: without structure, many problems are hopeless Low-dimensional structure is essential

    2 Machine learning brings in algorithmic components Computational constraints are central Memory constraints: need for distributed and decentralized procedures Increasing importance of privacy

  • Historical perspective: info. theory and statistics

    Claude Shannon Andrey Kolmogorov

    A rich intersection between information theory and statistics

    1 Hypothesis testing, large deviations2 Fisher information, Kullback-Leibler divergence3 Metric entropy and Fanos inequality

  • Statistical estimation as channel coding

    Codebook: indexed family of probability distributions {Q | }

    Codeword: nature chooses some

    Xn1Q

    Channel: user observes n i.i.d. draws Xi Q

    Decoding: estimator Xn1 7 such that

  • Statistical estimation as channel coding

    Codebook: indexed family of probability distributions {Q | }

    Codeword: nature chooses some

    Xn1Q

    Channel: user observes n i.i.d. draws Xi Q

    Decoding: estimator Xn1 7 such that

    Perspective dating back to Kolmogorov (1950s) with many variations:

    codebooks/codewords: graphs, vectors, matrices, functions, densities....

    channels: random graphs, regression models, elementwise probes ofvectors/machines, random projections

    closeness : exact/partial graph recovery in Hamming, p-distances,L(Q)-distances, sup-norm etc.

  • Machine learning: algorithmic issues to forefront!

    1 Efficient algorithms are essential only (low-order) polynomial-time methods can ever be implemented trade-offs between computational complexity and performance? fundamental barriers due to computational complexity?

    2 Distributed procedures are often needed many modern data sets: too large to stored on a single machine need methods that operate separately on pieces of data trade-offs between decentralization and performance? fundamental barriers due to communication complexity?

    3 Privacy and access issues conflicts between individual privacy and benefits of aggregation? principled information-theoretic formulations of such trade-offs?

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 6 / 46

  • Part I of tutorial: Three vignettes

    1 Graphical model selection

    2 Sparse principal component analysis

    3 Structured non-parametric regression and minimax theory

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 7 / 46

  • Vignette A: Graphical model selectionSimple motivating example: Epidemic modeling

    Disease status of person s: xs =

    {+1 if individual s is infected

    1 if individual s is healthy

    (1) Independent infection

    Q(x1, . . . , x5) 5

    s=1

    exp(sxs)

    (2) Cycle-based infection

    Q(x1, . . . , x5) 5

    s=1

    exp(sxs)

    (s,t)C

    exp(stxs xt)

    (3) Full clique infection

    Q(x1, . . . , x5) 5

    s=1

    exp(sxs)

    s 6=t

    exp(stxsxt)

  • Possible epidemic patterns (on a square)

  • From epidemic patterns to graphs

  • Underlying graphs

  • Model selection for graphsdrawn n i.i.d. samples from

    Q(x1, . . . , xp; ) exp{

    sV

    sxs +

    (s,t)E

    stxsxt}

    graph G and matrix []st = st of edge weights are unknown

    data matrix Xn1 {1,+1}np

    want estimator Xn1 7 G to minimize error probability

    Qn[G(Xn1 ) 6= G

    ]

    Prob. that estimated graph differs from truth

    Channel decoding:

    Think of graphs as codewords, and the graph family as a codebook.

  • Past/on-going work on graph selection

    exact polynomial-time solution on trees (Chow & Liu, 1967)

    testing for local conditional independence relationships (e.g., Spirtes et al,2000; Kalisch & Buhlmann, 2008)

    pseudolikelihood and BIC criterion (Csiszar & Talata, 2006)

    pseudolikelihood and 1-regularized neighborhood regression Gaussian case (Meinshausen & Buhlmann, 2006) Binary case (Ravikumar, W. & Lafferty et al., 2006)

    various greedy and related methods: (Bresler et al., 2008; Bresler, 2015;Netrapalli et al., 2010; Anandkumar et al., 2013)

    lower bounds and inachievability results information-theoretic bounds (Santhanam & W., 2012) computational lower bounds (Dagum & Luby, 1993; Bresler et al., 2014) phase transitions and performance of neighborhood regression (Bento &

    Montanari, 2009)

  • US Senate network (20042006 voting)

  • Experiments for sequences of star graphs

    p = 9 d = 3

    p = 18d = 6

  • Empirical behavior: Unrescaled plots

    0 100 200 300 400 500 6000

    0.2

    0.4

    0.6

    0.8

    1

    Number of samples

    Pro

    b. s

    ucce

    ss

    Star graph; Linear fraction neighbors

    p = 64p = 100p = 225

    Plots of success probability versus raw sample size n.

  • Empirical behavior: Appropriately rescaled

    0 0.5 1 1.5 20

    0.2

    0.4

    0.6

    0.8

    1

    Control parameter

    Pro

    b. s

    ucce

    ss

    Star graph; Linear fraction neighbors

    p = 64p = 100p = 225

    Plots of success probability versus rescaled sample size nd2 log p

  • Some theory: Scaling law for graph selectiongraphs Gp,d with p nodes and maximum degree d

    minimum absolute weight min on edges

    how many samples n needed to recover the unknown graph?

    Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012)

    Achievable result: Under regularity conditions, for a graph estimate Gproduced by 1-regularized logistic regression:

    n > cu(d2 + 1/2min

    )log p

    Lower bound on sample size

    = Q[G 6= G] 0 Vanishing probability of error

    Necessary condition: For graph estimate G produced by any algorithm.

    n < c(d2 + 1/2min

    )log p

    Upper bound on sample size

    = Q[G 6= G] 1/2 Constant probability of error

  • Information-theoretic analysis of graph selection

    Question:

    How to prove lower bounds on graph selection methods?

    Answer:

    Graph selection is an unorthodox channel coding problem.

    codewords/codebook: graph G in some graph class G

    channel use: draw sample Xi = (Xi1, . . . , Xip) {1,+1}p from thegraph-structured distribution QG

    decoding problem: use n samples {X1, . . . , Xn} to correctly distinguishthe codeword

    X1, X2, . . . , XnQ(X | G)G

  • Proof sketch: Main ideas for necessary conditionsbased on assessing difficulty of graph selection over various sub-ensemblesG Gp,d

    choose G G u.a.r., and consider multi-way hypothesis testing problembased on the data Xn1 = {X1, . . . , Xn}

    for any graph estimator : Xn G, Fanos inequality implies that

    Q[(Xn1 ) 6= G] 1I(Xn1 ;G)

    log |G| o(1)

    where I(Xn1 ;G) is mutual information between observations Xn1 and

    randomly chosen graph G

    remaining steps:

    1 Construct difficult sub-ensembles G Gp,d

    2 Compute or lower bound the log cardinality log |G|.

    3 Upper bound the mutual information I(Xn1 ;G).

  • A hard d-clique ensemble

    Base graph G Graph Guv Graph Gst

    1 Divide the vertex set V into pd+1 groups of size d+ 1, and form the basegraph G by making a (d+ 1)-clique C within each group.

    2 Form graph Guv by deleting edge (u, v) from G.

    3 Consider testing problem over family of graph-structured distributions{Q( ;Gst), (s, t) C}.

    Why is this ensemble hard?

    Kullback-Leibler divergence between pairs decays exponentially in d, unlessminimum edge weight decays as 1/d.

  • Vignette B: Sparse principal components analysisPrincipal component analysis:

    widely-used method for {dimensionality reduction, data compression etc.}extracts top eigenvectors of sample covariance matrix

    classical PCA in p dimensions: inconsistent unless p/n 0low-dimensional structure: many applications lead to sparse eigenvectors

    Population model: Rank-one spiked covariance matrix

    = SNR parameter

    ()T rank-one spike

    +Ip

    Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi) = , andform sample covariance matrix

    :=1

    n

    n

    i=1

    XiXTi

    p-dim. matrix, rank min{n, p}

  • Example: PCA for face compression/recognition

    First 25 standard principal components (estimated from data)

  • Example: PCA for face compression/recognition

    First 25 sparse principal components (estimated from data)

  • Perhaps the simplest estimator....Diagonal thresholding: (Johnstone & Lu, 2008)

    0 10 20 30 40 500

    2

    4

    6

    8

    10

    12

    14

    Index

    Dia

    gona

    l val

    ue

    Diagonal thresholding

    Given n i.i.d. samples Xi with zero mean, and with spiked covariance = ()T + I:

    1 Compute diagonal entries of sample covariance: jj =1n

    ni=1X

    2ij .

    2 Apply threshold to vector {jj , j = 1, . . . , p}

  • Diagonal thresholding: unrescaled plots

    0 200 400 600 8000

    0.2

    0.4

    0.6

    0.8

    1

    Number of observations

    Pro

    b. s

    ucce

    ss

    Unrescaled plots of diagonal thresholding; k = O(log p)

    p = 100p = 200p = 300p = 600p = 1200

  • Diagonal thresholding: rescaled plots

    0 5 10 150

    0.2

    0.4

    0.6

    0.8

    1

    Control parameter

    Pro

    b. s

    ucce

    ss

    Diagonal thresholding: k = O(log(p))

    p = 100p = 200p = 300p = 600p = 1200

    Scaling is quadratic in sparsity: nk2 log p

  • Diagonal thresholding and fundamental limitConsider spiked covariance matrix

    = Signal-to-noise

    ()T rank one spike

    +Ipp where B0(k) B2(1) k-sparse and unit norm

    Theorem (Amini & W., 2009)

    (a) There are thresholds DT DTu such thatn

    k2 log p DT

    DT fails w.h.p.

    n

    k2 log p DTu

    DT succeeds w.h.p.

    (b) For optimal method (exhaustive search):

    n

    k log p< ES

    Fail w.h.p.

    n

    k log p> ES

    Succeed w.h.p.

    .

  • One polynomial-time SDP relaxationRecall Courant-Fisher variational formulation:

    = arg maxz2=1

    {zT

    (()T + Ipp

    )

    Population covariance

    z}.

    Equivalently, lifting to matrix variables Z = zzT :

    T = arg maxZRpp

    Z=ZT , Z0

    {trace(ZT)

    }s.t. trace(Z) = 1, and rank(Z) = 1

    Dropping rank constraint yields a standard SDP relaxation:

    ZT = arg maxZRpp

    Z=ZT , Z0

    {trace(ZT)

    }s.t. trace(Z) = 1.

    In practice: (dAspremont et al., 2008)

    apply this relaxation using the sample covariance matrix

    add the 1-constraintp

    i,j=1 |Zij | k2.

  • Phase transition for SDP: logarithmic sparsity

    0 5 10 150

    0.2

    0.4

    0.6

    0.8

    1

    Control parameter

    Pro

    b. s

    ucce

    ss

    SDP relaxation (k = O(log p))

    p = 100

    p = 200

    p = 300

    Scaling is linear in sparsity: nk log p

  • A natural question

    Questions:

    Can logarithmic sparsity or rank one condition be removed?

  • Computational lower bound for sparse PCAConsider spiked covariance matrix

    = signal-to-noise

    ()()T rank one spike

    +Ipp where B0(k) B2(1)

    k-sparse and unit norm

    Sparse PCA detection problem:H0 (no signal): Xi D(0, Ipp)H1 (spiked signal): Xi D(0,).

    Distribution D with sub-exponential tail behavior.

    Theorem (Berthet & Rigollet, 2013)

    Under average-case hardness of planted clique, polynomial-minimax level ofdetection POLY is given by

    POLY k2 log p

    nfor all sparsity log p k p

    Classical minimax level OPT k log pn .

  • Planted k-clique problem

    Erdos-Renyi Planted k-clique

    Binary hypothesis test based on observing random graph G on p-vertices:

    H0 Erdos-Renyi, each edge randomly with prob. 1/2H1 Planted k-clique, remaining edges random

  • Planted k-clique problem

    Random entries Planted k k sub-matrix

    Binary hypothesis test based on observing random binary matrix:

    H0 Random {0, 1} matrix with Ber(1/2) on off-diagonalH1 Planted k k sub-matrix

  • Vignette C: (Structured) non-parametric regression

    Goal: How to predict output from covariates?

    given covariates (x1, x2, x3, . . . , xp)output variable ywant to predict y based on (x1, . . . , xp)

    Examples: Medical Imaging; Geostatistics; Astronomy; ComputationalBiology .....

    0.5 0.25 0 0.25 0.51.5

    1

    0.5

    0

    0.5

    Design value x

    Fun

    ctio

    n va

    lue

    2nd order polynomial kernel

    0.5 0.25 0 0.25 0.51.5

    1

    0.5

    0

    0.5

    Design value x

    Fun

    ctio

    n va

    lue

    1st order Sobolev

    (a) Second-order polynomial fit (b) Lipschitz function fit

  • High dimensions and sample complexityPossible models:

    ordinary linear regression: y =

    p

    j=1

    jxj

    , x

    +w

    general non-parametric model: y = f(x1, x2, . . . , xp) + w.

    Estimation accuracy: How well can f be estimated using n samples?

    linear models without any structure: error 2 p/n

    linear in p

    with sparsity k p: error 2 (k log

    ep

    k

    )/n

    logarithmic in p

    non-parametric models: p-dimensional, smoothness

    Curse of dimensionality: Error 2 (1/n) 22+p Exponential slow-down

  • Minimax risk and sample size

    Consider a function class F , and n i.i.d. samples from the model

    yi = f(xi) + wi, where f

    is some member of F .

    For a given estimator {(xi, yi)}ni=1 7 f F , worst-case risk in a metric :

    Rnworst(f ;F) = supfF

    En[2(f , f)].

    Minimax risk

    For a given sample size n, the minimax risk

    inffRnworst(f ;F) = inf

    fsupfF

    En[2(f , f)

    ]

    where the infimum is taken over all estimators.

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 37 / 46

  • How to measure size of function classes?

    A 2-packing of F in metric is acollection {f1, . . . , fM} F such that

    (f j , fk

    )> 2 for all j 6= k.

    The packing number M(2) is thecardinality of the largest such set.

    Packing/covering entropy: emerged from Russian school in 1940s/1950s(Kolmogorov and collaborators)

    Central object in proving minimax lower bounds for nonparametricproblems (e.g., Hasminskii & Ibragimov, 1978; Birge, 1983; Yu, 1997; Yang &Barron, 1999)

  • Packing and covering numbers

    f1

    f2 f3

    f4

    fM

    2

    A 2-packing of F in metric is a collection {f1, . . . , fM} F such that

    (f j , fk

    )> 2 for all j 6= k.

    The packing number M(2) is the cardinality of the largest such set.

  • Example: Sup-norm packing for Lipschitz functions

    L

    L

    L

    L 2 M

    -packing set: functions {f1, f2, . . . , fM} such that f j fk2 > for allj 6= kfor L-Lipschitz functions in 1-dimension:

    M() 2(L/).

  • Standard reduction: from estimation to testing

    f1

    f2 f3f4

    fM

    2

    goal: to characterize the minimax risk for-estimation over Fconstruct a 2-packing of F :

    collection {f1, . . . , fM} such that (f j , fk) > 2

    now form a M -component mixturedistribution as follows:

    draw packing index V {1, . . .M}uniformly at random

    conditioned on V = j, draw n i.i.d.samples (Xi, Yi) Qfj

    1 Claim: Any estimator f such that (f , fJ ) w.h.p. can be used tosolve the M -ary testing problem.

    2 Use standard techniques ({Assouad, Le Cam, Fano }) to lower bound theprobability of error in the testing problem.

  • Minimax rate via metric entropy matching

    observe (xi, yi) pairs from model yi = f(xi) + wi

    two possible norms

    f f2n :=1

    n

    n

    i=1

    (f(xi) f(xi)

    )2, or f f22 = E

    [(f(X) f(X))2

    ].

    Metric entropy master equation

    For many regression problems, minimax rate n > 0 determined by solving themaster equation

    logM(2;F ,

    ) n2.

    basic idea (with Hellinger metric) dates back to Le Cam (1973)

    elegant general version due to Yang and Barron (1999)

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 42 / 46

  • Example 1: Sparse linear regression

    Observations yi = xi, + wi, where

    (k, p) :={ Rp | 0 k, and 2 1

    }.

    Gilbert-Varshamov: can construct a 2-separated set with

    logM(2) k log(epk) elements

    Master equation and minimax rate

    logM(2) n2 2 k log(

    ep

    k

    )n .

    Polynomial-time achievability:

    by 1-relaxations under restricted eigenvalue (RE) conditions(Candes & Tao, 2007; Bickel et al., 2009; Buhlmann & van de Geer, 2011)

    achieve minimax-optimal rates for 2-error (Raskutti, W., & Yu, 2011)

    1-methods can be sub-optimal for prediction error X( )2/n

    (Zhang, W. & Jordan, 2014)

  • Example 2: -smooth non-parametric regression

    Observations yi = f(xi) + wi, where

    f F() ={f : [0, 1] R | f is -times diffble, with j=0 f (j) C

    }.

    Classical results in approximation theory (e.g., Kolmogorov & Tikhomorov, 1959)

    logM(2;F) (1/

    )1/

    Master equation and minimax rate

    logM(2) n2 2 (1/n

    ) 22+1 .

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 44 / 46

  • Example 3: Sparse additive and -smooth

    Observations yi = f(xi) + wi, where f

    satisfies constraints

    Additively decomposable: f(x1, . . . xp) =

    p

    j=1

    gj (xj)

    Sparse: At most k of (g1 , . . . , gp) are non-zero

    Smooth: Each gj belongs to -smooth family F()

    Combining previous results yields 2-packing with

    logM(2;F) k(1/

    )1/+ k log

    (epk

    )

    Master equation and minimax rate

    Solving logM(2) n2 yields

    2 k(1/n)

    22+1

    k-component estimation

    +k log

    (epk

    )

    n search complexity

  • SummaryTo be provided during tutorial....