![Page 1: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/1.jpg)
Review of statistical modeling and probability theory
Alan Moses
ML4bio
![Page 2: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/2.jpg)
What is modeling?
• Describe some observations in a simple, more compact way
X = (X1,X2)
![Page 3: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/3.jpg)
What is modeling?
• Describe some observations in a simple, more compact way
Model: a = - G m
r2
Instead of all the observations, we only need to remember a constant ‘G’ and measure some parameters ‘m’ and ‘r’.
![Page 4: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/4.jpg)
What is statistical modeling?
• Deals also with the ‘uncertainty’ in observations
Expectation
Deviation or Variance
• Also use the term ‘probabilistic’ modeling• Mathematics is more complicated
![Page 5: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/5.jpg)
What kind of questions will we answer in this course?
What’s the best linear model to explain some data?
![Page 6: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/6.jpg)
What kind of questions will we answer in this course?
Are there multiple groups? What are they?
![Page 7: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/7.jpg)
What kind of questions will we answer in this course?
Given new data, which group do we assign it to?
![Page 8: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/8.jpg)
3 major areas of machine learning
What’s the best linear model to explain some data?
Are there multiple groups? What are they?
Given new data, which group do we assign it to?
• Regression
• Clustering
• Classification
(that have proven useful in biology)
![Page 9: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/9.jpg)
Molecular Biology exampleE
xpre
ssio
n Le
vel
Expectation
Variance
disease
X = (L,D)
![Page 10: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/10.jpg)
Molecular Biology exampleE
xpre
ssio
n Le
vel
Expectation
Variance
E1 V1
E2 V2
“clustering”
Exp
ress
ion
Leve
l
disease
Class 2 is “enriched” for disease
![Page 11: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/11.jpg)
Molecular Biology exampleE
xpre
ssio
n Le
vel
Expectation
Variance
E1 V1
E2 V2
“clustering”
“regression”
Genotype
AA Aa aa
Exp
ress
ion
Leve
l
Exp
ress
ion
Leve
l
disease
Class 2 is “enriched” for disease
![Page 12: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/12.jpg)
Molecular Biology exampleE
xpre
ssio
n Le
vel
Expectation
Variance
E1 V1
E2 V2
“clustering”
“regression”
Genotype
AA Aa aa
Exp
ress
ion
Leve
l
Exp
ress
ion
Leve
l “classification”
Genotype
AA Aa aa
Exp
ress
ion
Leve
l
disease
disease?
Aa
Class 2 is “enriched” for disease
![Page 13: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/13.jpg)
Probability theory
• Probability theory quantifies uncertainty using ‘distributions’
• Distributions are the ‘models’ and they depend on constants and parameters
E.g., in one dimension, the Gaussian or Normal distribution depends on two constants e and π and two parameters that have to be measured, μ and σ
P(X|μ,σ) = e1√2πσ2
(X–μ)2
2σ2–
‘X’ are the possible datapoints that could come from the distribution. In statistics jargon ‘X’ is called a random variable
![Page 14: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/14.jpg)
Probability theory
• Probability theory quantifies uncertainty using ‘distributions’
• Choosing the distribution or ‘model’s the first step in a statistical model
• E.g., data: mRNA expression levels, counts of sequencing reads, presence or absence of protein domains or ‘A’ ‘C’ ‘G’ and ‘T’ s
• We will use different distributions to describe these different types of data.
![Page 15: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/15.jpg)
Typical data and distributions
• Data is categorical (yes or no, A,C,G,T)
• Data is a fraction (e.g., 13 out of 5212)
• Data is a continuous number (e.g., -6.73)
• Data is a ‘natural’ number (0,1,2,3,4…)
• It’s also possible to do regression, clustering and classification without specifying a distribution
![Page 16: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/16.jpg)
Molecular Biology example
“classification”
Genotype
AA Aa aa
Exp
ress
ion
Leve
l
disease?
Aa
• In this example, we might try to combine a Bernoulli for the disease data, Poisson for the genotype and Gaussian for the expression level
• We also might try to classify without specifying distributions
![Page 17: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/17.jpg)
Molecular Biology exampleG
ene
2 E
xpre
ssio
n Le
vel
• genomics era means we will almost never have the expression level for just one gene or the genotype at just one locus
• Each gene’s expression level can be considered another ‘dimension’
Gene 1 Expression Level
• for two genes, if each point is data for one person, we can make a graph of this type of data
• for 1000s of genes….
Gen
e 2
Exp
ress
ion
Leve
l
Gene 1 Expression Level
Gene 3
Gene 4Gene 5…
![Page 18: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/18.jpg)
Molecular Biology exampleG
ene
2 E
xpre
ssio
n Le
vel
• genomics era means we will almost never have the expression level for just one gene or the genotype at just one locus
• We’ll usually make 2-D plots, but anything we say about 2-D can usually be generalized to n-dimensions
Gene 1 Expression Level
Each “observation” , X, contains expression level for Gene 1 and Gene 2
X = (1.3, 4.6)
Represent this as a vector:
X = (X1, X2)
e.g.,
Or generally
![Page 19: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/19.jpg)
Molecular Biology exampleG
ene
2 E
xpre
ssio
n Le
vel
• genomics era means we will almost never have the expression level for just one gene or the genotype at just one locus
• We’ll usually make 2-D plots, but anything we say about 2-D can usually be generalized to n-dimensions
Gene 1 Expression Level
Each “observation” , X, contains expression level for Gene 1 and Gene 2
X = (1.3, 4.6)
Represent this as a vector:
X = (X1, X2)
e.g.,
Or generally
This gives a geometric interpretation to multivariate statistics
![Page 20: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/20.jpg)
Probability theory
• Probability theory quantifies uncertainty using ‘distributions’
• Distributions are the ‘models’ and they depend on constants and parameters
E.g., in two dimensions, the Gaussian or Normal distribution depends on two constants e and π and 5 parameters that have to be measured, μ and Σ
P(X|μ,σ) = e12π
(X–μ)T Σ-1 (X–μ)2
–
‘X’ are the possible datapoints that could come from the distribution. In statistics jargon ‘X’ is called a random variable
√|Σ|
1
What does the mean mean in 2 dimensions?
What does the standard deviation mean?
![Page 21: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/21.jpg)
Bivariate Gaussian
![Page 22: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/22.jpg)
Molecular Biology exampleG
ene
2 E
xpre
ssio
n Le
vel
• genomics era means we will almost never have the expression level for just one gene or the genotype at just one locus
• We’ll usually make 2-D plots, but anything we say about 2-D can usually be generalized to n-dimensions
Gene 1 Expression Level
Each “observation” , X, contains expression level for Gene 1 and Gene 2
Represent this as a vector: X = (X1, X2)
The mean is also a vector: µ = (µ1, µ2)µ
The variance is a matrix: Σ σ11 σ12σ21 σ22
=
![Page 23: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/23.jpg)
-4 -2 0 2 4
-4-2
02
4
rmvnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, -1.9, -1.9, [,1] 4), ncol = 2))[,1]rm
vno
rm(n
= 3
00
, me
an
= c
(1, 1
), s
igm
a =
ma
trix
(c(1
, -1
.9, -
1.9
, [,2
]
4),
nco
l = 2
))[,2
]
1.0 -1.9-1.9 4.0Σ =
-4 -2 0 2 4
-4-2
02
4
rmvnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, 0, 0, 4), [,1] ncol = 2))[,1]
rmvn
orm
(n =
300
, mea
n =
c(1,
1),
sigm
a =
mat
rix(c
(1, 0
, 0, 4
), [,2
]
nco
l = 2
))[,2
]
1 0 0 4Σ =
-4 -2 0 2 4
-4-2
02
4
rmvnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, 0, 0, 1), [,1] ncol = 2))[,1]rm
vnor
m(n
= 3
00, m
ean
= c(
1, 1
), si
gma
= m
atrix
(c(1
, 0, 0
, 1),
[,2]
n
col =
2))
[,2]
1 0 0 1Σ =
-4 -2 0 2 4
-4-2
02
4
rmvnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, 0.5, 0.5, [,1] 1), ncol = 2))[,1]
rmvn
orm
(n =
300
, mea
n =
c(1,
1),
sigm
a =
mat
rix(c
(1, 0
.5, 0
.5, [
,2]
1
), nc
ol =
2))
[,2]
1.0 0.5 0.5 1.0Σ =
µ
“correlated data”
“axis-aligned, diagonal covariance” “full covariance”
“spherical covariance”Σ = σ2I
![Page 24: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/24.jpg)
Probability theory
• Probability theory quantifies uncertainty using ‘distributions’
• Distributions are the ‘models’ and they depend on constants and parameters
• Once we chose a distribution, the next step is to chose the parameters
• This is called “estimation” or “inference”
P(X|μ,σ) = e1√2πσ2
(X–μ)2
2σ2–
![Page 25: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/25.jpg)
• Choose the parameters so the model ‘fits the data’
• There are many ways to measure how well a model fits that data • Different “Objective functions” will produce different “estimators” (E.g., MSE, ML, MAP)
EstimationE
xpre
ssio
n Le
vel
Expectation
Variance
We want to make a statistical model.
1.Choose a model (or probability distribution)
2.Estimate its parameters
P(X|μ,σ) = e1√2πσ2
(X–μ)2
2σ2–
How do we know which parameters fit the data?
![Page 26: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/26.jpg)
Laws of probability
• If X1 … XN are a series of random variables (think datapoints)
P(X1 , X2) is the “joint probability” and is equal to
P(X1) P(X2) if X1 and X2 are independent.
P(X1 | X2), is the “conditional probability” of event X1 given X2
Conditional probabilities are related by Bayes’ theorem:
P(X1 … XN ) = P(Xi)Πi=1
i=N
P(X1| X2) = P(X2 |X1)P(X1)
P(X2)
(True for all distributions)
![Page 27: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/27.jpg)
Likelihood and MLEs
• Likelihood is the probability of the data (say X) given certain parameters (say θ)
• Maximum likelihood estimation says: choose θ, so that the data is most probable.
• In practice there are many ways to maximize the likelihood.
L = P(X|θ)
L = 0θ
![Page 28: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/28.jpg)
Example of ML estimation
Xi P(Xi|μ=6.5, σ=1.5)
Data:
= P(Xi|μ=6.5, σ=1.5) = 6.39 x 10-5 Πi=1
i=5
L = P(X|θ) = P(X1 … XN | μ, σ)
Mean, μ
L
0.1827373040.0592273220.139963680.2307610960.182737304
5.29.18.27.37.8
![Page 29: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/29.jpg)
Example of ML estimation
Mean, μ
Log(L)
In practice, we almost always use the log likelihood, which becomes a very large negative number when there is a lot of data
![Page 30: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/30.jpg)
Mean, μStan
dard dev
iation, σ
Log
(L)
Example of ML estimation
![Page 31: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/31.jpg)
ML Estimation• In general, the likelihood is a function of multiple variables, so the
derivatives with respect to all of these should be zero at a maximum
• In the example of the Gaussian, we have two parameters, so that
• In general, finding MLEs means solving a set of coupled equations, which usually have to be solved numerically for complex models.
L = 0μL = 0σ
and
![Page 32: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/32.jpg)
MLEs for the Gaussian
• The Gaussian is the symmetric continuous distribution that has as its “centre” a parameter given by what we consider the “average” (the expectation).
• The MLE for the for variance of the Gaussian is like the squared error from the mean, but is actually a biased (but still consistent!?) estimator
μML = X VML = (X - μML)2 ΣXN
1 ΣXN
1
![Page 33: Review of statistical modeling and probability theory](https://reader030.vdocument.in/reader030/viewer/2022032612/56813848550346895d9ff612/html5/thumbnails/33.jpg)
Other estimators• Instead of likelihood, L = P(X|θ) we can choose
parameters to maximize posterior probability:
• Or sum of squared errors:
• Or a penalized likelihood: L* = P(X|θ)
• In each case, estimation involves a mathematical optimization problem that usually has to be solved on computer
• How do we choose?
P(θ|X)
(X – μMSE)2 ΣX
e– θ2
x