05-statisticalmodels
TRANSCRIPT
-
7/24/2019 05-StatisticalModels
1/26
CS109/Stat121/AC209/E-109
Data Science
Statistical Models
Hanspeter Pfister, Joe Blitzstein, and Verena Kaynig
data y(observed)
model parameter !(unobserved)
statistical inference
probability
-
7/24/2019 05-StatisticalModels
2/26
This Week
HW1 due next Thursday - start last week!
Section assignments coming soon
Please avoid posting duplicate questions onPiazza (always searchfirst), and avoid postingcode from your homework solutions (seeAndrews post https://piazza.com/class/
icf0cypdc3243c?cid=310 for more)
https://piazza.com/class/icf0cypdc3243c?cid=310 -
7/24/2019 05-StatisticalModels
3/26
Drowning in data, but starved for information
source: http://extensionengine.com/drowning-in-data-the-biggest-hurdle-for-mooc-proliferation/
http://extensionengine.com/drowning-in-data-the-biggest-hurdle-for-mooc-proliferation/ -
7/24/2019 05-StatisticalModels
4/26
What is a statistical model? a familyof distributions, indexed by parameters sharpens distinction between data and parameters,
and between estimators and estimands
parametric (e.g., based on Normal, Binomial) vs.nonparametric (e.g., methods like bootstrap, KDE)
data y(observed)
model parameter !(unobserved)
statistical inferenceprobability
-
7/24/2019 05-StatisticalModels
5/26
Whatgoodis a statistical model?
All models are wrong, but some models are useful. George Box
All models are wrong, but some models are useful. George Box (1919-2013)
-
7/24/2019 05-StatisticalModels
6/26
All models are wrong, but some models are useful. George Box
In that Empire, the Art of Cartography attained such Perfectionthat the map of a single Province occupied the entirety of a City,and the map of the Empire, the entirety of a Province. In time,
those Unconscionable Maps no longer satisfied, and the
Cartographers Guild struck a Map of the Empire whose size wasthat of the Empire, and which coincided point for point with it.
Jorge Luis Borges,On Exactitude in Science
Borges Google Doodle
-
7/24/2019 05-StatisticalModels
7/26
Big Data vs. Pig Data:https://scensci.wordpress.com/2012/12/14/big-data-or-pig-
data/
https://scensci.wordpress.com/2012/12/14/big-data-or-pig-data/ -
7/24/2019 05-StatisticalModels
8/26
Statistical Models: Two Books
-
7/24/2019 05-StatisticalModels
9/26
Parametric vs. Nonparametric
parametric: finite-dimensional parameter space (e.g.,mean and variance for a Normal)
nonparametric: infinite-dimensional parameter space
is there anything in between? nonparametric is very general, but no free lunch! remember to plot and explore the data!
-
7/24/2019 05-StatisticalModels
10/26
Parametric Model Example:
Exponential Distribution
Remember the memoryless property!
f(x) = ex, x >0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.
0
0
.2
0.
4
0.
6
0.
8
1.
0
1.
2
x
pdf
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.
0
0.
2
0.
4
0.
6
0.8
1.
0
x
cdf
-
7/24/2019 05-StatisticalModels
11/26
Exponential Distribution
Remember the memoryless property!
f(x) = ex, x >0
Exponential is characterizedby memoryless property all models are wrong, but some are useful... iterate between exploring, the data model-building,
model-fitting, and model-checking
key building block for more realistic models
-
7/24/2019 05-StatisticalModels
12/26
The Weibull Distribution
Exponential has constant hazard function Weibull generalizes this to a hazard that is t to a power much more flexible and realistic than Exponential representation: a Weibull is an Expo to a power
-
7/24/2019 05-StatisticalModels
13/26
Pois
HGeom
Bin
(Bern)
Conditioning
Conditioning
Limit
Limit
Gamma
(Expo, Chi-Square)
NBin
(Geom)
Poisson process
Limit
Beta
(Unif)
Conjugacy
Bank - post office
Conjugacy
Student-t
(Cauchy)
NormalLimit
Limit
Limit
Family Tree of Parametric Distributions
Blitzstein-Hwang, Introduction to Probability
-
7/24/2019 05-StatisticalModels
14/26
Binomial Distribution
0 2 4 6 8 10
0.0
0
0.0
5
0.1
0
0.1
5
0.2
0
0.2
5
0.3
0
x
pmf
0 2 4 6 8 10
Bin(10,1/2)
0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
x
pmf
0 2 4 6 8 10
Bin(10,1/8)
0 2 4 6 8 10
0.0
0
0.0
5
0.1
0
0.1
5
0.2
0
0.2
50
.30
x
pmf
0 2 4 6 8 10
Bin(100,0.03)
0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
x
pmf
0 2 4 6 8 10
Bin(9,4/5)
story: X~Bin(n,p) is the number of successes in nindependent Bernoulli(p) trials.
-
7/24/2019 05-StatisticalModels
15/26
Binomial Distribution
story: X~Bin(n,p) is the number of successes in n
independent Bernoulli(p) trials.
Example: # votes for candidate A in election with n voters,where each independently votes for A with probability p
mean is np (by story and linearity of expectation:E(X+Y)=E(X)+E(Y))
variance is np(1-p) (by story and the fact thatVar(X+Y)=Var(X)+Var(Y) if X,Y are uncorrelated)
-
7/24/2019 05-StatisticalModels
16/26
(Doonesbury)
-
7/24/2019 05-StatisticalModels
17/26
Poisson Distribution
story: countnumber of events that occur, when there are a
large number of independent rare events.
Examples: # mutations in genetics, # of traffic accidents at acertain intersection, # of emails in an inbox
mean = variance for the Poisson
P(X=k) = ek
k! , k= 0, 1, 2, . . . .
P i PMF CDF
-
7/24/2019 05-StatisticalModels
18/26
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
x
PMF
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
x
CDF
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
x
PMF
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
x
CDF
Poisson PMF, CDF
Pois(2)
Pois(5)
P i A i ti
-
7/24/2019 05-StatisticalModels
19/26
Poisson Approximation
|P(X B) P(N B)| min1,1
nX
j=1p2
j .
ifXis the number of events that occur, where the ith event has probabilitypi, and N Pois(), with the average number of events that occur.
Example: in matching problem, the probability of
no match is approximately 1/e.
P i M d l E l
-
7/24/2019 05-StatisticalModels
20/26
Poisson Model Example
0 2 4 6 8
0.0
0
0.1
0
0.2
0
0.3
0
number of children
Pr(Yi==
yi
)
Poisson model
empirical distribution
source: Hoff, A First Course in Bayesian Statistical Methods
Extensions: Zero-Inflated Poisson, Negative Binomial,
-
7/24/2019 05-StatisticalModels
21/26
Normal (Gaussian) Distribution
Wikipedia
symmetry central limit theorem characterizations (e.g., via entropy) 68-95-99.7% rule
-
7/24/2019 05-StatisticalModels
22/26
Normal Approximation to Binomial
Wikipedia
Th M i f St ti ti
-
7/24/2019 05-StatisticalModels
23/26
The Magic of Statistics
source: Ramsey/Schafer, The Statistical Sleuth
-
7/24/2019 05-StatisticalModels
24/26
The Evil Cauchy Distribution
http://www.etsy.com/shop/NausicaaDistribution
http://www.etsy.com/shop/NausicaaDistribution -
7/24/2019 05-StatisticalModels
25/26
Bootstrap (Efron, 1979)
3.142 2.718 1.414 0.693 1.618
1.414 2.718 0.693 0.693 2.718
1.618 3.142 1.618 1.414 3.142
1.618 0.693 2.718 2.718 1.414
0.693 1.414 3.142 1.618 3.142
2.718 1.618 3.142 2.718 0.693
1.414 0.693 1.618 3.142 3.142
data:
reps
resample with replacement, use empirical
distribution to approximate true distribution
-
7/24/2019 05-StatisticalModels
26/26
Bootstrap World
source: http://pubs.sciepub.com/ijefm/3/3/2/
which is based on diagram in Efron-Tibshirani book
http://pubs.sciepub.com/ijefm/3/3/2/