05-statisticalmodels

Upload: rhlhotty22

Post on 21-Feb-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 05-StatisticalModels

    1/26

    CS109/Stat121/AC209/E-109

    Data Science

    Statistical Models

    Hanspeter Pfister, Joe Blitzstein, and Verena Kaynig

    data y(observed)

    model parameter !(unobserved)

    statistical inference

    probability

  • 7/24/2019 05-StatisticalModels

    2/26

    This Week

    HW1 due next Thursday - start last week!

    Section assignments coming soon

    Please avoid posting duplicate questions onPiazza (always searchfirst), and avoid postingcode from your homework solutions (seeAndrews post https://piazza.com/class/

    icf0cypdc3243c?cid=310 for more)

    https://piazza.com/class/icf0cypdc3243c?cid=310
  • 7/24/2019 05-StatisticalModels

    3/26

    Drowning in data, but starved for information

    source: http://extensionengine.com/drowning-in-data-the-biggest-hurdle-for-mooc-proliferation/

    http://extensionengine.com/drowning-in-data-the-biggest-hurdle-for-mooc-proliferation/
  • 7/24/2019 05-StatisticalModels

    4/26

    What is a statistical model? a familyof distributions, indexed by parameters sharpens distinction between data and parameters,

    and between estimators and estimands

    parametric (e.g., based on Normal, Binomial) vs.nonparametric (e.g., methods like bootstrap, KDE)

    data y(observed)

    model parameter !(unobserved)

    statistical inferenceprobability

  • 7/24/2019 05-StatisticalModels

    5/26

    Whatgoodis a statistical model?

    All models are wrong, but some models are useful. George Box

    All models are wrong, but some models are useful. George Box (1919-2013)

  • 7/24/2019 05-StatisticalModels

    6/26

    All models are wrong, but some models are useful. George Box

    In that Empire, the Art of Cartography attained such Perfectionthat the map of a single Province occupied the entirety of a City,and the map of the Empire, the entirety of a Province. In time,

    those Unconscionable Maps no longer satisfied, and the

    Cartographers Guild struck a Map of the Empire whose size wasthat of the Empire, and which coincided point for point with it.

    Jorge Luis Borges,On Exactitude in Science

    Borges Google Doodle

  • 7/24/2019 05-StatisticalModels

    7/26

    Big Data vs. Pig Data:https://scensci.wordpress.com/2012/12/14/big-data-or-pig-

    data/

    https://scensci.wordpress.com/2012/12/14/big-data-or-pig-data/
  • 7/24/2019 05-StatisticalModels

    8/26

    Statistical Models: Two Books

  • 7/24/2019 05-StatisticalModels

    9/26

    Parametric vs. Nonparametric

    parametric: finite-dimensional parameter space (e.g.,mean and variance for a Normal)

    nonparametric: infinite-dimensional parameter space

    is there anything in between? nonparametric is very general, but no free lunch! remember to plot and explore the data!

  • 7/24/2019 05-StatisticalModels

    10/26

    Parametric Model Example:

    Exponential Distribution

    Remember the memoryless property!

    f(x) = ex, x >0

    0.0 0.5 1.0 1.5 2.0 2.5 3.0

    0.

    0

    0

    .2

    0.

    4

    0.

    6

    0.

    8

    1.

    0

    1.

    2

    x

    pdf

    0.0 0.5 1.0 1.5 2.0 2.5 3.0

    0.

    0

    0.

    2

    0.

    4

    0.

    6

    0.8

    1.

    0

    x

    cdf

  • 7/24/2019 05-StatisticalModels

    11/26

    Exponential Distribution

    Remember the memoryless property!

    f(x) = ex, x >0

    Exponential is characterizedby memoryless property all models are wrong, but some are useful... iterate between exploring, the data model-building,

    model-fitting, and model-checking

    key building block for more realistic models

  • 7/24/2019 05-StatisticalModels

    12/26

    The Weibull Distribution

    Exponential has constant hazard function Weibull generalizes this to a hazard that is t to a power much more flexible and realistic than Exponential representation: a Weibull is an Expo to a power

  • 7/24/2019 05-StatisticalModels

    13/26

    Pois

    HGeom

    Bin

    (Bern)

    Conditioning

    Conditioning

    Limit

    Limit

    Gamma

    (Expo, Chi-Square)

    NBin

    (Geom)

    Poisson process

    Limit

    Beta

    (Unif)

    Conjugacy

    Bank - post office

    Conjugacy

    Student-t

    (Cauchy)

    NormalLimit

    Limit

    Limit

    Family Tree of Parametric Distributions

    Blitzstein-Hwang, Introduction to Probability

  • 7/24/2019 05-StatisticalModels

    14/26

    Binomial Distribution

    0 2 4 6 8 10

    0.0

    0

    0.0

    5

    0.1

    0

    0.1

    5

    0.2

    0

    0.2

    5

    0.3

    0

    x

    pmf

    0 2 4 6 8 10

    Bin(10,1/2)

    0 2 4 6 8 10

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    pmf

    0 2 4 6 8 10

    Bin(10,1/8)

    0 2 4 6 8 10

    0.0

    0

    0.0

    5

    0.1

    0

    0.1

    5

    0.2

    0

    0.2

    50

    .30

    x

    pmf

    0 2 4 6 8 10

    Bin(100,0.03)

    0 2 4 6 8 10

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    pmf

    0 2 4 6 8 10

    Bin(9,4/5)

    story: X~Bin(n,p) is the number of successes in nindependent Bernoulli(p) trials.

  • 7/24/2019 05-StatisticalModels

    15/26

    Binomial Distribution

    story: X~Bin(n,p) is the number of successes in n

    independent Bernoulli(p) trials.

    Example: # votes for candidate A in election with n voters,where each independently votes for A with probability p

    mean is np (by story and linearity of expectation:E(X+Y)=E(X)+E(Y))

    variance is np(1-p) (by story and the fact thatVar(X+Y)=Var(X)+Var(Y) if X,Y are uncorrelated)

  • 7/24/2019 05-StatisticalModels

    16/26

    (Doonesbury)

  • 7/24/2019 05-StatisticalModels

    17/26

    Poisson Distribution

    story: countnumber of events that occur, when there are a

    large number of independent rare events.

    Examples: # mutations in genetics, # of traffic accidents at acertain intersection, # of emails in an inbox

    mean = variance for the Poisson

    P(X=k) = ek

    k! , k= 0, 1, 2, . . . .

    P i PMF CDF

  • 7/24/2019 05-StatisticalModels

    18/26

    0 1 2 3 4 5 6

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    x

    PMF

    0 1 2 3 4 5 6

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    x

    CDF

    0 2 4 6 8 10

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    x

    PMF

    0 2 4 6 8 10

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    x

    CDF

    Poisson PMF, CDF

    Pois(2)

    Pois(5)

    P i A i ti

  • 7/24/2019 05-StatisticalModels

    19/26

    Poisson Approximation

    |P(X B) P(N B)| min1,1

    nX

    j=1p2

    j .

    ifXis the number of events that occur, where the ith event has probabilitypi, and N Pois(), with the average number of events that occur.

    Example: in matching problem, the probability of

    no match is approximately 1/e.

    P i M d l E l

  • 7/24/2019 05-StatisticalModels

    20/26

    Poisson Model Example

    0 2 4 6 8

    0.0

    0

    0.1

    0

    0.2

    0

    0.3

    0

    number of children

    Pr(Yi==

    yi

    )

    Poisson model

    empirical distribution

    source: Hoff, A First Course in Bayesian Statistical Methods

    Extensions: Zero-Inflated Poisson, Negative Binomial,

  • 7/24/2019 05-StatisticalModels

    21/26

    Normal (Gaussian) Distribution

    Wikipedia

    symmetry central limit theorem characterizations (e.g., via entropy) 68-95-99.7% rule

  • 7/24/2019 05-StatisticalModels

    22/26

    Normal Approximation to Binomial

    Wikipedia

    Th M i f St ti ti

  • 7/24/2019 05-StatisticalModels

    23/26

    The Magic of Statistics

    source: Ramsey/Schafer, The Statistical Sleuth

  • 7/24/2019 05-StatisticalModels

    24/26

    The Evil Cauchy Distribution

    http://www.etsy.com/shop/NausicaaDistribution

    http://www.etsy.com/shop/NausicaaDistribution
  • 7/24/2019 05-StatisticalModels

    25/26

    Bootstrap (Efron, 1979)

    3.142 2.718 1.414 0.693 1.618

    1.414 2.718 0.693 0.693 2.718

    1.618 3.142 1.618 1.414 3.142

    1.618 0.693 2.718 2.718 1.414

    0.693 1.414 3.142 1.618 3.142

    2.718 1.618 3.142 2.718 0.693

    1.414 0.693 1.618 3.142 3.142

    data:

    reps

    resample with replacement, use empirical

    distribution to approximate true distribution

  • 7/24/2019 05-StatisticalModels

    26/26

    Bootstrap World

    source: http://pubs.sciepub.com/ijefm/3/3/2/

    which is based on diagram in Efron-Tibshirani book

    http://pubs.sciepub.com/ijefm/3/3/2/