probability & statistics: intro, summary statistics ...eero/math-tools18/... · probability:...
Post on 06-Jun-2020
5 Views
Preview:
TRANSCRIPT
Mathematical Tools for Neural and Cognitive Science
Probability & Statistics: Intro, summary statistics, probability
Fall semester, 2018
1
- Efron & Tibshirani, Introduction to the Bootstrap, 1998
2
Some history…• 1600’s: Early notions of data summary/averaging
• 1700’s: Bayesian prob/statistics (Bayes, Laplace)
• 1920’s: Frequentist statistics for science (e.g., Fisher)
• 1940’s: Statistical signal analysis and communication, estimation/decision theory (e.g., Shannon, Wiener, etc)
• 1950’s: Return of Bayesian statistics (e.g., Jeffreys, Wald, Savage, Jaynes…)
• 1970’s: Computation, optimization, simulation (e.g,. Tukey)
• 1990’s: Machine learning (large-scale computing + statistical inference + lots of data)
• Since 1950’s! : statistical neural/cognitive models
3
Scientific process
Summarize/fit model(s),compare with predictions
Create/modifyhypothesis/model
Generate predictions,design experiment
Observe / measure data
4
Descriptive statistics: Central tendency5
Descriptive statistics: Central tendency• We often summarize data with the average. Why?
• Average minimizes the squared error (as in regression!):
• Generalize: minimize Lp norm:
– minimize L1 norm: median,
– minimize L0 norm: mode
– minimize norm: midpoint of range
• Issues: outliers, asymmetry, bimodality
• How do we choose?
L∞
m(~x)
µ(~x) = argminc
1
N
NX
n=1
�xn � c
�2=
1
N
NX
n=1
xn
argminc
"1
N
NX
n=1
|xn � c|p#1/p
6
Descriptive statistics: Dispersion7
Descriptive statistics: Dispersion
• Sample standard deviation
• Mean absolute deviation (MAD) about the median
• Quantiles
d(~x) =1
N
NX
n=1
��xn �m(~x)
��
�(~x) = minc
"1
N
NX
n=1
(xn � c)2#1/2
=
"1
N
NX
n=1
(xn � µ(~x))2#1/2
8
Descriptive statistics: Dispersion
Summary statistics (eg: sample mean/var) can be interpreted as estimates of model parametersTo formalize this, we need tools from probability…
9
data
{xn}
histogram
{ck, hk}
probability distribution
p(x)
10
data
{ ⃗x n}
probabilistic model
pθ( ⃗x )
Measurement
Inference
11
You pick a family at random and discover that one of the children is a girl.
Probabilistic Middleville
The stork delivers boys and girls randomly, with family probability {BB,BG,GB,GG}={0.2,0.3,0.2,0.3}
probabilistic model
In Middleville, every family has two children, brought by the stork.
What are the chances that the other child is a girl?
data
inference
12
Statistical MiddlevilleIn Middleville, every family has two children, brought by the stork.
In a survey of 100 of the Middleville families, 32 have two girls, 23 have two boys, and the remainder one of each.
data
inference
The stork delivers boys and girls randomly, with family probability {BB,BG,GB,GG}={0.2,0.3,0.2,0.3}
You pick a family at random and discover that one of the children is a girl.
What are the chances that the other child is a girl?
13
Probability basics (outline)
• distributions: discrete and continuous
• expected value, moments
• cumulative distributions. Quantiles, Q-Q plots, drawing samples.
• transformations: affine, monotonic nonlinear
14
Probability: Definitions/notationlet X, Y, Z be random variables
they can take on values (like ‘heads’ or ‘tails’; or integers 1-6; or real-valued numbers)
let x, y, z stand generically for values they can take, and denote events such as X = x
write the probability that X takes on value x as P(X = x), or PX(x), or sometimes just P(x)
P(x) is a function over values x, which we call the probability “distribution” function (pdf) (for continuous variables, “density”)
Useful to have this notation up on slid, while introducing concepts on board
15
Probability distributions
∞
−∞∫ x xd p( ) =1( )b
a= ∫ ( )P a< < b d x px x
P(x)
p(x)
0 < P(xi ) <1, ∀i
P(xi ) = 1i∑
0 < p(x)
p(x)dx = 1−∞
∞
∫
Discrete random variable Continuous random variable
16
0 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6 7 8 9 10 110
0.05
0.1
0.15
0.2
0.25
0 200 400 600 800 10000
0.02
0.04
0.06
0.08
0.1
2 3 4 5 6 7 8 9 10 11 120
0.05
0.1
0.15
0.2
1 2 3 4 5 60
0.05
0.1
0.15
0.2
a not-quite-fair coin sum of two rolled fair dice
clicks of a Geiger counter, in a fixed time interval
horizontal velocity of gas molecules exiting a fan... and, time between clicks
Example distributionsroll of a fair die
-0 1 2 4 53 6 7 8 9 10
17
Expected value - discrete
[the mean, ]µE(X ) = xi p(xi )i=1
N
∑
0 1 2 3 4# of credit cards
0
0.5
1
1.5
2
2.5
# of
stu
dent
s
104
0 1 2 3 4# of credit cards
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
P(x)
µ
E( f (X )) = f (xi )p(xi )i=1
N
∑More generally:
18
Expected value - continuous
E(x) =
Zx p(x) dx
E(x2) =
Zx
2p(x) dx
E
�(x� µ)2
�=
Z(x� µ)2 p(x) dx
=
Zx
2p(x) dx� µ
2
E (f(x)) =
Zf(x) p(x) dx
[mean, ]µ
[“second moment”, m2]
�2[variance, ]
Note: this is an inner product, and thus linear:
[equal to m2 minus ]μ2
E (af(x) + bg(x)) = aE (f(x)) + bE (g(x))
[“expected value of f ”]
19
Cumulatives
2 3 4 5 6 7 8 9 1011120
0.05
0.1
0.15
0.2
x
p(x)
2 4 6 8 10 120
1
x
c(x)
50 100 150x
p(x)
50 100 1500
0.5
1
x
c(x)
c(y) =
Z y
�1p(x)dx
20
Drawing samples - discrete
0
0.125
0.25
0.375
0.5
0
0.25
0.5
0.75
1
21
• joint distributions
• marginals (integrating)
• conditionals (slicing)
• Bayes’ rule (inverse probability)
• statistical independence (separability)
• linear transformations
Multi-variate probability
[on board]
22
Joint and conditional probability - discrete23
Joint and conditional probability - discrete
P(Ace) P(Heart) P(Ace & Heart) P(Ace | Heart) P(not Jack of Diamonds) P(Ace | not Jack of Diamonds)
“Independence”
24
p(x, y)
Joint distribution (continuous)25
p(x) =
Zp(x, y)dy
p(x, y)
Marginal distribution 26
Conditional probability
A BA & B
p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)
Neither A nor B
27
p(x, y) p(x|y = 68)
Conditional distribution28
p(x|y = 68) = p(x, y = 68)
�Zp(x, y = 68)dx
= p(x, y = 68).p(y = 68)
P(x
|Y=6
8)
Conditional distribution
p(x|y) = p(x, y)/p(y)
More generally:
normalize (by marginal)slice joint distribution
29
Bayes’ Rule
A BA & B
p(A& B) = p(B)p(A | B)
= p(A)p(B | A)
⇒ p(A | B) = p(B | A)p(A)p(B)
p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)
30
Bayes’ Rule
p(x|y) = p(y|x) p(x)/p(y)
(a direct consequence of the definition of conditional probability)
31
P(x|Y=120)
P(x)
Conditional vs. marginal
In general, the marginals for different Y values differ. When are they they same? In particular, when are all conditionals equal to the marginal?
32
Statistical independence
Random variables X and Y are statistically independent if (and only if):
Independence implies that all conditionals are equal to the corresponding marginal:
p(x, y) = p(x)p( y) ∀ x, y
p(x | y) = p(x, y) / p( y) = p(x) ∀ x, y
[note: for discrete distributions, this is an outer product!]
33
Sums of RVs
In addition, if X and Y are independent, then
E(XY ) = E(X )E(Y )
σ Z2 = E X +Y( )− µX + µY( )( )2( ) =σ X
2 +σ Y2
and is a convolution of andpZ (z)
E(X +Y ) = E(X )+ E(Y )
Let Z = X + Y. Since expectation is linear:
pX (x) pY ( y)
[on board]
34
• Mean and variance summarize the centroid/width
• Translation and rescaling of random variables
• Mean/variance of weighted sum of random variables
• The sample average
• ... converges to true mean (except for bizarre distributions)
• ... with variance
• ... most common common choice for an estimate ...
Mean and variance35
−4 −3 −2 −1 0 1 2 3 40
50
100
150
200
250
300
350
400
450
500(u+u+u+u)/sqrt(4)
−4 −3 −2 −1 0 1 2 3 40
50
100
150
200
250104 samples of uniform dist
−4 −3 −2 −1 0 1 2 3 40
100
200
300
400
500
60010 u’s divided by sqrt(10)
−4 −3 −2 −1 0 1 2 3 40
50
100
150
200
250
300
350
400
450(u+u)/sqrt(2)
Central limit for a uniform distribution...
10k samples, uniform density (sigma=1)
36
0 0.2 0.4 0.6 0.8 10
1000
2000
3000
4000
5000
6000one coin
0 0.2 0.4 0.6 0.8 10
1000
2000
3000
4000avg of 4 coins
0 0.2 0.4 0.6 0.8 10
500
1000
1500
2000avg of 16 coins
0 0.2 0.4 0.6 0.8 10
500
1000
1500
2000avg of 64 coins
0 0.2 0.4 0.6 0.8 10
500
1000
1500
2000
2500avg of 256 coins
Central limit for a binary distribution...37
top related