useful statistical concepts for engineers

Upload: m

Post on 10-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Useful Statistical Concepts for Engineers

    1/51

    - 1 -

    Useful Statistical concepts for Engineers

    Deepak Agarwal

    ML Class

    12/10/2009

    Yahoo!

  • 8/8/2019 Useful Statistical Concepts for Engineers

    2/51

    - 2 -

    Scope of the lecture

    Basic probability distributions to model randomness in data

    Fitting distributions to data

    Common parametric distributions Discrete distributions, continuous distributions

    Generalized Linear models

    Multi-level hierarchical models

    Generalized Linear mixed effects models

  • 8/8/2019 Useful Statistical Concepts for Engineers

    3/51

    - 3 -

    Role of Probability distributions

    Probability distributions

    Mathematical models to describe intrinsic variation in data

    Helps in quantifying uncertainty and eventually decision making

    How do we construct such distributions to computeprobabilities for any subset in data domain ?

    Domain

    Finite set of points (total clicks in 100 displays of ad X on Pub Y)

    Countable but infinite set of points (total visits to webpage Y) Real numbers (time spent on webpage Y)

    Is it necessary to specify probabilities for all subsets?

    NO, what to specify ?

  • 8/8/2019 Useful Statistical Concepts for Engineers

    4/51

    - 4 -

    Cumulative distribution function (CDF)

    X : random variable

    CDF F : [0,1] such that

    F(x) = Pr( X x)

    F is non-decreasing and right continuous

    CDF uniquely characterize a probability distribution

    Given CDF, we can compute probability of any subset

    E.g P( a < X b) = F(b) F(a) ; P( X > b) = 1 F(b)

    What about more complicated sets?

    In high dimension?

  • 8/8/2019 Useful Statistical Concepts for Engineers

    5/51

    - 5 -

    Probability density function (PDF)

    A unique functionp : [0,) such that

    P( A ) = Ap(x) dF(x) [aggregate density with weights from F]

    Meaning of notation Ap(x) dF(x)

    real numbers P( A ) = Ap(x) dx (Continuous distributions) Discrete numbers P( A ) = Ap(x) (Discrete distributions)

    PDF often easier to work with when (modeling) fitting

    distributions to data

  • 8/8/2019 Useful Statistical Concepts for Engineers

    6/51

    - 6 -

    Empirical CDF

    Empirical CDF Fm for data X= (x1,x2,,xm) (I I D)

    Probability distribution with mass 1/m on eachxi

    1. m=10; -1.21 0.28 1.08 -2.35 0.43 0.51 -0.57 -0.55 -0.56 -0.89

    2. m=10; 0 0 0 0 0 1 0 0 1 1

    2 1 0 1

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    example 1

    Fm(x)

  • 8/8/2019 Useful Statistical Concepts for Engineers

    7/51

    - 7 -

    Plug-in principle

    (F) : Some characteristic of the theoretical distribution

    E.g. mean

    (Fm) : Corresponding quantity for empirical CDF

    Exercise: Convince yourself this is true

    Plug-in principle: (Fm) good estimator of(F) for all

    characteristics (F)

  • 8/8/2019 Useful Statistical Concepts for Engineers

    8/51

    - 8 -

    Why plug-in works? Glivenko-Cantelli Lemma

    Intuitively, Fn good estimator of F

    Estimator should get better with increasing sample size m

    Glivenko-Cantelli : For the iidscenario

    P(A) Pm(A) uniformlyfor all subsets A, as m

    We can infer about distribution with a large sample

    Error does not grow as we increase the number of quantities

    estimated from the same sample

    Justifies why plug-in principle works

  • 8/8/2019 Useful Statistical Concepts for Engineers

    9/51

    - 9 -

    Quantifying uncertainty in estimates

    We wont have infinite m in practice (costly) Quantify uncertainty in estimates for a given m

    Sample mean

    Estimate of uncertainty

  • 8/8/2019 Useful Statistical Concepts for Engineers

    10/51

    - 10 -

    Standard error calculations continued

    Consider median

    Difficult to compute

    Asymptotic approx:

    Is there a better way?

  • 8/8/2019 Useful Statistical Concepts for Engineers

    11/51

    - 11 -

    Re-sampling from empirical CDF: Bootstrap

    Bootstrap: Random sample (with replacement) from Fm

    The samples help compute s.e.

    For the median example,

    Take a random sample of size m with replacement from

    empirical CDF Fm and compute the median

    For B such samples, compute the standard deviation (se) of

    median estimates, this quantifies uncertainty This works well ifFm is a good approximation to F

    Bootstrap is only finding an approx of s.eF((Fm)) underFm

  • 8/8/2019 Useful Statistical Concepts for Engineers

    12/51

    - 12 -

    Why bootstrap works?

    Except for mean, difficult to compute standard deviation ofother sample statistics

    Bootstrap sampling provides an approximation to Fm, easyblack box to compute variance estimates

    How many bootstrap samples B ?

    For estimating s.e., 20-100 are good enough

    Depends on m, tails of the underlying distribution

    Exercise: How many distinct bootstrap samples are there for a given m ?

  • 8/8/2019 Useful Statistical Concepts for Engineers

    13/51

    - 13 -

    Bootstrap: Variations

    Does it always work? No, especially in cases where Fm is not a good approx ofF

    E.g. sample m data from Uniform(0, )

    max xi ML estimator of

    Bootstrap as defined so far wont work well here

    Parametric bootstrap

    What if we know about the parametric form ofF(e.g. guassian)

    Sample from Fm,par instead ofFm

  • 8/8/2019 Useful Statistical Concepts for Engineers

    14/51

    - 14 -

    Example

    m=100; data drawn iid from N(0,1); distribution of median

    0 1000 2000 3000 4000 5000

    0.2

    8

    0.3

    0

    0.3

    2

    0.3

    4

    B

    c.v

  • 8/8/2019 Useful Statistical Concepts for Engineers

    15/51

    - 15 -

    How can we use it at Y! ?

    Variance estimates can help with online learning(explore/exploit): Johns lecture

    Bootstrapping can help better understand variance

    properties of models Running too many experiments on test data not a good idea

    (Kilians lecture)

    Easy to Map-reduce

  • 8/8/2019 Useful Statistical Concepts for Engineers

    16/51

    - 16 -

    Before we move on, another look at bias-variance tradeoff

  • 8/8/2019 Useful Statistical Concepts for Engineers

    17/51

    - 17 -

    Bias-Variance Tradeoff

    Important in all scenarios (regression, density estimation,.)

    Recall Robs example from lecture 1

    High Bias High varianceOptimal trade-off

  • 8/8/2019 Useful Statistical Concepts for Engineers

    18/51

    - 18 -

    Bias-Variance continued

    F : True distribution generating the data (not known)

    ={ F } : Model class chosen by analyst to approximate F

    Influenced by things like domain knowledge, previous studies,

    software availability, my favorite algo, ad-server latency, . E.g. Linear models, Neural networks, logistic regression,

    X : Available data

    Loss L(F, F[X]) : Metric that measures model performance

    E.g. MSE, Misclassification error, total click lift, total revenue,..

  • 8/8/2019 Useful Statistical Concepts for Engineers

    19/51

    - 19 -

    Bias-Variance continued

    Loss influenced by two aspects How flexible is to approximate reality ? (Bias)

    More flexible it is, more complex it gets (reduces bias)

    How stable is the best fit from to data ? (Variance)

    Does the fit change a lot with perturbations to data ?

    More flexible the class to choose from, more data we need

    to control the variance

    With too much flexibility and little data, we tend to learn

    patterns that are not real

    (chasing the data , too many parameters, generalization error,

    fitting noise, too many degrees-of-freedom)

  • 8/8/2019 Useful Statistical Concepts for Engineers

    20/51

    - 20 -

    Example: Recall Regression from Robs lecture

    Exercise:

    1. Identify

    2. If dim(x) = 20, m

    = 10M; what is a

    more seriousproblem here (bias

    or variance) ?

    3. Based on 2., what

    other tools would

    you try on thisproblem?

  • 8/8/2019 Useful Statistical Concepts for Engineers

    21/51

    - 21 -

    A useful exercise

    Might have heard things like All models are wrong, some are more useful than others

    Google uses simple models but trains them on lots of data

    SVM works well on my data

    Nave Bayes is hard to beat on text data

    Boosting is the best off-the-shelf classifier

    Its all about feature engineering; Maxent, GBDT doesnt matter

    Y! data is too noisy, better to work with simple models

    We have terabytes of data but it is too little

    Exercise: Interpret these in terms of bias-variance tradeoff

  • 8/8/2019 Useful Statistical Concepts for Engineers

    22/51

    - 22 -

    Other remarks on bias-variance

    There is no universal solution to the bias-variance tradeoff

    Several classes available, each with pros and cons

    Understanding the properties of s and experimenting withdata important

    Inventing new s motivated by failures of existing ones on real

    applications important for advancement of the field

  • 8/8/2019 Useful Statistical Concepts for Engineers

    23/51

    - 23 -

    How to measure performance ?

    Depends on the loss function

    For classification and regression, test errors used in ML

    Several other measures in Statistics (does not use test data)

    MODEL FIT - MODEL COMPLEXITY

    AIC, BIC, DIC, Mallows Cp, Bayes factors,

    E.g. AIC = - log-loss(training) + # parameters

    Based on assumptions that may not hold in all scenarios

  • 8/8/2019 Useful Statistical Concepts for Engineers

    24/51

    - 24 -

    Parametric Distributions: A useful class to

    work with data

  • 8/8/2019 Useful Statistical Concepts for Engineers

    25/51

    - 25 -

    Parametric models

    Non-parametric approach attractive, no assumptions needed

    Bootstrap and asymptotics often provides answers, BUT

    Hard to incorporate additional knowledge about the system

    Computationally intensive

    Higher uncertainty in estimates price we pay for generality

    Theory gets harder for dependent random variables

    Social network data, Spatial data, time series

    Parametric models that assume functional form is an

    alternate way to model the world Faster computation, better estimates if model good

    approximation to reality

    Easier to model dependent random variables

  • 8/8/2019 Useful Statistical Concepts for Engineers

    26/51

    - 26 -

    Common discrete parametric distributions

    Bernoulli

    Poisson

    Geometric

    Negative Binomial

    Multinomial

  • 8/8/2019 Useful Statistical Concepts for Engineers

    27/51

    - 27 -

    Common continuous parametric distributions

    Normal (Gaussian)

    Log-normal: Normal on log scale

    Gamma : Tails thinner than log-normal

    Beta: flexible class on [0,1]

    Multivariate Normal : Multivariate Gaussian data

  • 8/8/2019 Useful Statistical Concepts for Engineers

    28/51

    - 28 -

    Exponential family: A general class of parametric

    distributions

    Distribution with PDF given by

    g(): convex (log-partition function)

    Example: Bernoulli distribution

  • 8/8/2019 Useful Statistical Concepts for Engineers

    29/51

    - 29 -

    Estimation

    Maximum likelihood estimation (MLE)

    For i i dcase,

    MLE is asymptotically unbiased, consistent and achieves

    lowest variance asymptotically under mild conditions

  • 8/8/2019 Useful Statistical Concepts for Engineers

    30/51

    - 30 -

    Desirable properties of estimators

    Unbiased

    Consistent

    Low variance, efficient [Attains Cramer-Rao lower bound]

  • 8/8/2019 Useful Statistical Concepts for Engineers

    31/51

    - 31 -

    MLE efficient

    Under mild regularity conditions, MLE is asymptoticallyunbiased, consistent and efficient

    Other estimators

    MVUE: Minimum variance unbiased estimators

    Search for lowest variance estimator among unbiased ones Requires only moment assumptions on the distributions

    Method-of-moments (MOM) estimators

    Equate empirical moments with theoretical ones

    May lose efficiency but easier to estimate in some cases

  • 8/8/2019 Useful Statistical Concepts for Engineers

    32/51

    - 32 -

    Non i i d data [adding more flexibility]

    Statistically independent but different means

    Too flexible, sharing parameters a good compromise

    E.g. 100 displays of an ad on a website (Bernoulli)

    Click probabilities not same, how do we model it ?

    ifunction of features ? Males, females have different probs

    Regression problem (Logistic regression, )

    i =: (zi, ) = ziT ; dim() = n

  • 8/8/2019 Useful Statistical Concepts for Engineers

    33/51

    - 33 -

    Generalized Linear Models: Flexible class for regressions

    Data: (x1,z1), (x2,z2),,(xm,zm)

    Assumption: zis measured without error (important)

    1-parameter exponential family

    = (xi). i

    Do linear regression on transformed scale

    i := (zi, ) = ziT

  • 8/8/2019 Useful Statistical Concepts for Engineers

    34/51

    - 34 -

    GLM continued

    Example: logistic regression (covered in Robs lecture)

    Gaussian regression, Poisson regression are special cases

    Referred to as Generalized linear model (GLM)

    MacCullagh and Nelder (book)

  • 8/8/2019 Useful Statistical Concepts for Engineers

    35/51

    - 35 -

    Other option: Shrinkage estimators (Stein)

    Stein

    Result

    Stein estimator has smaller MSE than MLE

    Remarkable : Incurring some bias by pooling data reduces

    variance significantly

    Shrinkage: Estimates pulled towards the mean

  • 8/8/2019 Useful Statistical Concepts for Engineers

    36/51

    - 36 -

    Bayesian statistics

    Data, parameters are all random variables that we model All inferences about parameters are conditional on data

    Bayes Theorem

    [|X] = [X| ] []/ [X]

    Posterior Lik x Prior

    10 5 0 5 10

    0

    1

    2

    3

    4

    Likelihood

    Prior

  • 8/8/2019 Useful Statistical Concepts for Engineers

    37/51

    - 37 -

    Bayesian continued

    Does not depend on asymptotics, works for finite m

    Rich class of models (generally over-parametrized) but

    avoids over-fitting through constraints on parameters

    Model specification often requires care

    Computationally intensive

    (but approximations work well for large data)

  • 8/8/2019 Useful Statistical Concepts for Engineers

    38/51

    - 38 -

    Bayesian interpretation of Stein

    Exercise

  • 8/8/2019 Useful Statistical Concepts for Engineers

    39/51

    - 39 -

    Analysis of variance (ANOVA)

    Replications within each group E.g. log NGD prices in different dmas

    How to estimate unknown hyper-parameters , ?

  • 8/8/2019 Useful Statistical Concepts for Engineers

    40/51

    - 40 -

    Estimating hyper-parameters: Empirical Bayes

    Empirical Bayes (EB): Maximize marginal likelihood

    ANOVA example (integral available in closed form)

    EB works well for large data, in small samples in may overfit

    Double dipping

  • 8/8/2019 Useful Statistical Concepts for Engineers

    41/51

    - 41 -

    Example

    Time spent on landing page after a story click on TodayModule on Y! Front Page

  • 8/8/2019 Useful Statistical Concepts for Engineers

    42/51

    - 42 -

    Distribution across different properties

  • 8/8/2019 Useful Statistical Concepts for Engineers

    43/51

    - 43 -

    ANOVA

    Observations for a property replications: log(time spent) data

    0.04651195, 0.11435909 , 2.52275583

  • 8/8/2019 Useful Statistical Concepts for Engineers

    44/51

    - 44 -

    Shrinkage

    2 1 0 1 2 3

    0.5

    0.0

    0.5

    MLE

    Shrinakgeest

  • 8/8/2019 Useful Statistical Concepts for Engineers

    45/51

    - 45 -

    Estimating hyper-parameters: Full Bayes

    Assume a mild prior on hyper-parameters

    In ANOVA example

    Computation gets difficult, often require simulation

    Main idea

    Simulate samples from posterior distribution and make all

    conclusions from these (recall parametric bootstrap)

    Several techniques : Markov Chain Monte Carlo (MCMC)

  • 8/8/2019 Useful Statistical Concepts for Engineers

    46/51

    - 46 -

    Modeling correlations through priors

    Time series: Autoregressive prior

    Conditional independence, marginal dependence

    Attractive way to model correlations

    Spatial correlation

  • 8/8/2019 Useful Statistical Concepts for Engineers

    47/51

    - 47 -

    Generalized linear mixed model (GLMM)

    Fit different regressions to different groups but shareparameters

    Example: Random intercept models

    Parallel regressions lines to groups

    Front Page example: log(ts) = a + b*Gender + prop_id

    (Intercept) gender)0 gender)f gender)m sigma^2 tau^20.025 0.114 0.051 0.049 1.32 .121

  • 8/8/2019 Useful Statistical Concepts for Engineers

    48/51

    - 48 -

    GLMM continued

    Crossed-random effects Group specific slopes and intercepts

    FP example

    log(ts) = a + b*gend + Propid*Gender

    Exercise: fit this model using lme4 in R

    Hint: formula (log(ts) ~ gender + (Propid|gender) )

  • 8/8/2019 Useful Statistical Concepts for Engineers

    49/51

    - 49 -

    GLMMs

    From an ML perspective Linear models with different cross-product features

    Fancy regularization (different priors for different features)

    No cross-validation, all parameters estimated automatically

    Priors motivated by problem, highly flexible class Model specification has to be done carefully by analyst

    Extends to exponential family

    Conceptually easy, more computation required Software (lme4 in R; PROC GLMMIX in SAS)

  • 8/8/2019 Useful Statistical Concepts for Engineers

    50/51

    - 50 -

    Generalized linear mixed model (GLMM)

    Back to ANOVA: Regression + ANOVA

    Define , then we can write

    Extends to exponential family, computation gets harder

    Generalized linear mixed models (GLMM)

    Software (lme4 package in R)

  • 8/8/2019 Useful Statistical Concepts for Engineers

    51/51

    Summary

    We covered Bootstrap for I I D case

    Parametric distributions

    Shrinkage Estimators

    Generalized linear models Grouped regressions (mixed effects models)

    For non i i d data, working with flexible parametric models

    provide powerful expressive language to model data

    Needs some practice to master these models

    Next lecture: Olivier Chapelle (Optimization techniques)