probability and stats intro
DESCRIPTION
TRANSCRIPT
![Page 1: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/1.jpg)
Crash course in probability theory and statistics – part 1
Machine Learning, Mon Apr 14, 2008
![Page 2: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/2.jpg)
Motivation
Problem: To avoid relying on “magic” we need mathematics. For machine learning we need to quantify:●Uncertainty in data measures and conclusions●“Goodness” of model (when confronted with data)●Expected error and expected success rates●...and many similar quantities...
![Page 3: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/3.jpg)
Motivation
Problem: To avoid relying on “magic” we need mathematics. For machine learning we need to quantify:●Uncertainty in data measures and conclusions●“Goodness” of model (when confronted with data)●Expected error and expected success rates●...and many similar quantities...
Probability theory: Mathematical modeling when uncertainty or randomness is present.
P X=xi , Y=y j =pij
![Page 4: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/4.jpg)
Motivation
Problem: To avoid relying on “magic” we need mathematics. For machine learning we need to quantify:●Uncertainty in data measures and conclusions●“Goodness” of model (when confronted with data)●Expected error and expected success rates●...and many similar quantities...
Probability theory: Mathematical modeling when uncertainty or randomness is present.
P X=xi , Y=y j =nij
n
Statistics: The mathematics of collection of data, description of data, and inference from data
![Page 5: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/5.jpg)
Introduction to probability theory
Notice: This will be an informal introduction to probability theory (measure theory out of scope for this course). No sigma-algebras, Borel-sets, etc.
For the purpose of this class, our intuition will be right ... in more complex settings it can be very wrong.
We leave the complex setups to the mathematicians and stick to “nice” models.
![Page 6: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/6.jpg)
Introduction to probability theory
Notice: This will be an informal introduction to probability theory (measure theory out of scope for this course). No sigma-algebras, Borel-sets, etc.
For the purpose of this class, our intuition will be right ... in more complex settings it can be very wrong.
We leave the complex setups to the mathematicians and stick to “nice” models.
This introduction will be based on stochastic (random) variables.
![Page 7: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/7.jpg)
Introduction to probability theory
Notice: This will be an informal introduction to probability theory (measure theory out of scope for this course). No sigma-algebras, Borel-sets, etc.
For the purpose of this class, our intuition will be right ... in more complex settings it can be very wrong.
We leave the complex setups to the mathematicians and stick to “nice” models.
This introduction will be based on stochastic (random) variables.
We ignore the underlying probability space (W,A,p) .
![Page 8: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/8.jpg)
Introduction to probability theory
Notice: This will be an informal introduction to probability theory (measure theory out of scope for this course). No sigma-algebras, Borel-sets, etc.
For the purpose of this class, our intuition will be right ... in more complex settings it can be very wrong.
We leave the complex setups to the mathematicians and stick to “nice” models.
This introduction will be based on stochastic (random) variables.
We ignore the underlying probability space (W,A,p) .
If X is the sum of two dice: X(w) = D1(w) + D2(w)
![Page 9: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/9.jpg)
Introduction to probability theory
Notice: This will be an informal introduction to probability theory (measure theory out of scope for this course). No sigma-algebras, Borel-sets, etc.
For the purpose of this class, our intuition will be right ... in more complex settings it can be very wrong.
We leave the complex setups to the mathematicians and stick to “nice” models.
This introduction will be based on stochastic (random) variables.
We ignore the underlying probability space (W,A,p) .
If X is the sum of two dice: X(w) = D1(w) + D2(w)
We ignore the dice and only consider the variables – X ,
D1 , and D2 – and the values
they take.
![Page 10: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/10.jpg)
Discrete random variables
Sect. 1.2
A discrete random variable, X , is a variable that can take
values in a discrete (countable) set { xi }.
The probability of X taking the value xi is denoted p(X=xi)
and satisfies p(X=xi) ³ 0 for all i, åi p(X=xi) = 1, and for any
subset { xj } Í { xi }: p(XÎ{xj}) = åjp(xj) .
![Page 11: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/11.jpg)
Discrete random variables
Sect. 1.2
Intuition/interpretation: If we repeat an experiment (sampling a value for X ) n times, and denote by ni the number of times
we observe X=xi , then ni/n ® p(X=xi) as n®¥ .
A discrete random variable, X , is a variable that can take
values in a discrete (countable) set { xi }.
The probability of X taking the value xi is denoted p(X=xi)
and satisfies p(X=xi) ³ 0 for all i, åi p(X=xi) = 1, and for any
subset { xj } Í { xi }: p(XÎ{xj}) = åjp(xj) .
![Page 12: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/12.jpg)
Discrete random variables
Sect. 1.2
Intuition/interpretation: If we repeat an experiment (sampling a value for X ) n times, and denote by ni the number of times
we observe X=xi , then ni/n ® p(X=xi) as n®¥ .
A discrete random variable, X , is a variable that can take
values in a discrete (countable) set { xi }.
The probability of X taking the value xi is denoted p(X=xi)
and satisfies p(X=xi) ³ 0 for all i, åi p(X=xi) = 1, and for any
subset { xj } Í { xi }: p(XÎ{xj}) = åjp(xj) .
This is the intuition not a definition!(Definitions based on this ends up going in circles).
The definitions are pure abstract math. Any real-world usefulness is pure luck.
![Page 13: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/13.jpg)
Discrete random variables
Sect. 1.2
Intuition/interpretation: If we repeat an experiment (sampling a value for X ) n times, and denote by ni the number of times
we observe X=xi , then ni/n ® p(X=xi) as n®¥ .
A discrete random variable, X , is a variable that can take
values in a discrete (countable) set { xi }.
The probability of X taking the value xi is denoted p(X=xi)
and satisfies p(X=xi) ³ 0 for all i, åi p(X=xi) = 1, and for any
subset { xj } Í { xi }: p(XÎ{xj}) = åjp(xj) .
This is the intuition not a definition!(Definitions based on this ends up going in circles).
The definitions are pure abstract math. Any real-world usefulness is pure luck.
We often simplify the notation and use both p(X)
and p(xi) for p(X=xi), depending on context.
![Page 14: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/14.jpg)
Joint probability
Sect. 1.2
If a random variable, Z , is a vector, Z=(X,Y), we can
consider its components separetly.
The probability p(Z=z) where z = (x,y) is the joint probability
of X=x and Y=y written p(X=x,Y=y) or p(x,y) .
When clear from context, we write just p(X,Y) or p(x,y) and
the notation is symmetric: p(X,Y) = p(Y,X) and p(x,y)=p(y,x) .
The probability of XÎ {xi} and YÎ {yj} becomes åi åj p(xi,yj) .
![Page 15: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/15.jpg)
Marginal probability
Sect. 1.2
The probability of X=xi regardless of the value of Y then
becomes åj p(xi,yj) and is denoted the marginal probability
of X and is written just p(xi).
The sum rule:
(1.10)
![Page 16: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/16.jpg)
Conditional probability
Sect. 1.2
The conditional probability of X given Y is written P(X|Y)
and is the quantity satisfying p(X,Y) = p(X|Y)p(Y).
The product rule:
(1.11)
When p(Y)¹ 0 we get p(X|Y) = p(X,Y) / p(Y) with a simple
interpretation.
![Page 17: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/17.jpg)
Conditional probability
Sect. 1.2
The conditional probability of X given Y is written P(X|Y)
and is the quantity satisfying p(X,Y) = p(X|Y)p(Y).
The product rule:
(1.11)
When p(Y)¹ 0 we get p(X|Y) = p(X,Y) / p(Y) with a simple
interpretation.
Intuition: Before we observe anything, the probability of X is
p(X) but after we observe Y it becomes p(X|Y).
![Page 18: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/18.jpg)
Independence
When p(X,Y) = p(X)p(Y) we say that X andY are
independent.
In this case:
Intuition/justification: Observing Y does not change the
probability of X.
Sect. 1.2
![Page 19: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/19.jpg)
Example
Sect. 1.2
B – colour of bucket
F – kind of fruit
![Page 20: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/20.jpg)
Example
Sect. 1.2
B – colour of bucket
F – kind of fruit
p(F=a,B=r) = p(F=a|B=r)p(B=r) = 2/8 ´ 4/10 = 1/10
p(F=a,B=b) = p(F=a|B=b)p(B=b) = 2/8 ´ 6/10 = 9/20
![Page 21: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/21.jpg)
Example
Sect. 1.2
B – colour of bucket
F – kind of fruit
p(F=a,B=r) = p(F=a|B=r)p(B=r) = 2/8 ´ 4/10 = 1/10
p(F=a,B=b) = p(F=a|B=b)p(B=b) = 2/8 ´ 6/10 = 9/20
p(F=a) = p(F=a,B=r) + p(F=a,B=b) = 1/10 + 9/10 = 11/20
![Page 22: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/22.jpg)
Bayes' theorem
Sect. 1.2
Since p(X,Y) = p(Y,X) (symmetry) and p(X,Y) = p(Y|X)p(X)
(product rule) it follows p(Y|X)p(X) = p(X|Y)p(Y) or, when
p(X)¹ 0 :
Bayes' theorem:(1.12)
Sometimes written: p(Y|X) p(X|Y)p(Y) where
p(X) =åY p(X|Y)p(Y) is an implicit normalising factor.
![Page 23: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/23.jpg)
Sometimes written: p(Y|X) p(X|Y)p(Y) where
p(X) =åY p(X|Y)p(Y) is an implicit normalising factor.
Bayes' theorem
Sect. 1.2
Since p(X,Y) = p(Y,X) (symmetry) and p(X,Y) = p(Y|X)p(X)
(product rule) it follows p(Y|X)p(X) = p(X|Y)p(Y) or, when
p(X)¹ 0 :
Bayes' theorem:(1.12)
Posterior of Y
Likelihood of Y
Prior of Y
![Page 24: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/24.jpg)
Sometimes written: p(Y|X) p(X|Y)p(Y) where
p(X) =åY p(X|Y)p(Y) is an implicit normalising factor.
Bayes' theorem
Sect. 1.2
Since p(X,Y) = p(Y,X) (symmetry) and p(X,Y) = p(Y|X)p(X)
(product rule) it follows p(Y|X)p(X) = p(X|Y)p(Y) or, when
p(X)¹ 0 :
Bayes' theorem:(1.12)
Interpretation:Prior to an experiment, the probability of Y is p(Y)
After observing X , the probability is p(Y|X)
Bayes' theorem tells us how to move from prior to posterior.
![Page 25: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/25.jpg)
Sometimes written: p(Y|X) p(X|Y)p(Y) where
p(X) =åY p(X|Y)p(Y) is an implicit normalising factor.
Bayes' theorem
Sect. 1.2
Since p(X,Y) = p(Y,X) (symmetry) and p(X,Y) = p(Y|X)p(X)
(product rule) it follows p(Y|X)p(X) = p(X|Y)p(Y) or, when
p(X)¹ 0 :
Bayes' theorem:(1.12)
Interpretation:Prior to an experiment, the probability of Y is p(Y)
After observing X , the probability is p(Y|X)
Bayes' theorem tells us how to move from prior to posterior.
This is possibly the most important equation in the entire class!
![Page 26: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/26.jpg)
Example
Sect. 1.2
B – colour of bucket
F – kind of fruit
If we draw an oragne, what is the probability we drew it from the blue basket?
![Page 27: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/27.jpg)
Continuous random variables
Sect. 1.2.1
A continuous random variable, X , is a variable that can
take values in Rd.
The probability density of X is an integrabel function p(X)
satisfying p(x) ³ 0 for all x and ò p(x) dx = 1.
The probability of X Î S Í Rd is given by p(S) = òS p(x) dx.
![Page 28: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/28.jpg)
Expectation
Sect. 1.2.2
The expectation or mean of a function f of random variable
X is a weighted average
For both discrete and continuous random variables:
(1.35)
as N ® ¥ when xn ~ p(X).
![Page 29: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/29.jpg)
Expectation
Sect. 1.2.2
Intuition: If you repeatedly play a game with gain f(x), your
expected overall gain after n games will be n E[f].
The accuracy of this prediction increases with n.
It might not even be possible to “gain” E[f] in a single game.
![Page 30: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/30.jpg)
Expectation
Sect. 1.2.2
Intuition: If you repeatedly play a game with gain f(x), your
expected overall gain after n games will be n E[f].
The accuracy of this prediction increases with n.
It might not even be possible to “gain” E[f] in a single game.
Example: Game of dice with a fair dice, D value of dice,
“gain” function f(d) = d .
![Page 31: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/31.jpg)
Variance
Sect. 1.2.2
The variance of f(x) is defined as
and can be seen as a measure of variability around the mean.
The covariance of X and Y is defined as
and measures the variability of the two variables together.
![Page 32: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/32.jpg)
Variance
Sect. 1.2.2
The variance of f(x) is defined as
and can be seen as a measure of variability around the mean.
When cov[x,y] > 0, when x is above mean, y tends to be.
When cov [x,y]<0, when x is above mean, y tends to be below.
When cov [x,y]=0, x and y are uncorrelated (not necessarily
independent; independece implies uncorrelated, though).
The covariance of X and Y is defined as
and measures the variability of the two variables together.
![Page 33: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/33.jpg)
Covariance
Sect. 1.2.2
When cov[x,y] > 0, when x is above mean, y tends to be.
When cov [x,y]<0, when x is above mean, y tends to be below.
When cov [x,y]=0, x and y are uncorrelated (not necessarily
independent; independece implies uncorrelated, though).
cov [x1,x2]>0 cov [x1,x2]=0
![Page 34: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/34.jpg)
Parameterized distributions
Many distributions are governed by a few parameters.
E.g. coin tossing (Bernoully distribution) governed by the probability of “heads”.
Binomial distribution: number of “heads” k out
of n coin tosses:
![Page 35: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/35.jpg)
Parameterized distributions
Many distributions are governed by a few parameters.
E.g. coin tossing (Bernoully distribution) governed by the probability of “heads”.
Binomial distribution: number of “heads” k out
of n coin tosses:
We can think of a parameterized distribution as a conditional distribution.
The function x ® p(x | q) is the probability of
observation x given parameter q.
The function q ® p(x | q) is the likelihood of
parameter q given observation x. Sometimes written
lhd(q | x) = p(x | q).
![Page 36: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/36.jpg)
Parameterized distributions
Many distributions are governed by a few parameters.
E.g. coin tossing (Bernoully distribution) governed by the probability of “heads”.
Binomial distribution: number of “heads” k out
of n coin tosses:
We can think of a parameterized distribution as a conditional distribution.
The function x ® p(x | q) is the probability of
observation x given parameter q.
The function q ® p(x | q) is the likelihood of
parameter q given observation x. Sometimes written
lhd(q | x) = p(x | q).
The likelihood, in general, is not a probability distribution.
![Page 37: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/37.jpg)
Parameter estimation
Generally, parameters are not know but most be estimated from observed data.
Maximum Likelihood (ML):
Maximum A Posteriori (MAP):(A Bayesian approach assuming a distribution over parameters).
Fully Bayesian:(Estimates a distribution rather than a parameter).
![Page 38: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/38.jpg)
Parameter estimation
Example: We toss a coin and get a “head”. Our model is a
binomial distribution; x is one “head” and q the probability of a
“head”.
Likelihood:
Prior:
Posterior:
![Page 39: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/39.jpg)
Parameter estimation
Example: We toss a coin and get a “head”. Our model is a
binomial distribution; x is one “head” and q the probability of a
“head”.
Likelihood:
Prior:
Posterior:
ML estimate
![Page 40: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/40.jpg)
Parameter estimation
Example: We toss a coin and get a “head”. Our model is a
binomial distribution; x is one “head” and q the probability of a
“head”.
Likelihood:
Prior:
Posterior: MAP estimate
![Page 41: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/41.jpg)
Parameter estimation
Example: We toss a coin and get a “head”. Our model is a
binomial distribution; x is one “head” and q the probability of a
“head”.
Likelihood:
Prior:
Posterior:
Fully Bayesian approach:
![Page 42: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/42.jpg)
Predictions
Assume now known joint distribution p(x,t | q) of explanatory
variable x and target variable t. When observing new x we can use
p(t | x, q) to make predictions about t .
![Page 43: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/43.jpg)
Decision theory
Sect. 1.5
Based on p(x,t | q) we often need to make decisions.
This often means taking one of a small set of actions A1,A2,...,Ak
based on observed x .
Assume that the target variable is in this set, then we make
decisions based on p(t | x, q ) = p( Ai | x, q ).
Put in a different way: we use p(x,t | q) to classify x into one of k
classes, Ci .
![Page 44: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/44.jpg)
Decision theory
Sect. 1.5
We can approach this by splitting the input into regions, Ri, and
make decisions based on these:
In R1 go for C1 ; in R2 go for C2.
Choose regions to minimize classification errors:
![Page 45: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/45.jpg)
Decision theory
Sect. 1.5
We can approach this by splitting the input into regions, Ri, and
make decisions based on these:
In R1 go for C1 ; in R2 go for C2.
Choose regions to minimize classification errors:
Red and green mis-classifies C2 as C1Blue mis-classifies C1 as C2At x0 red is gone and p(mistake) is minimized
![Page 46: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/46.jpg)
Decision theory
Sect. 1.5
We can approach this by splitting the input into regions, Ri, and
make decisions based on these:
In R1 go for C1 ; in R2 go for C2.
Choose regions to minimize classification errors:
x0 is where p(x, C1) = p(x,C2) or similarly
p(C1 | x)p(x) = p(C2 | x)p(x) so we get the
intuitive pleasing:
![Page 47: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/47.jpg)
Model selection
Sect. 1.3
Where do we get p(t,x | q) from in the first place?
![Page 48: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/48.jpg)
Model selection
Sect. 1.3
Where do we get p(t,x | q) from in the first place?
There is no right model – a fair coin or fair dice is as unrealistic as a spherical cow!
![Page 49: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/49.jpg)
Model selection
Sect. 1.3
Where do we get p(t,x | q) from in the first place?
There is no right model – a fair coin or fair dice is as unrealistic as a spherical cow!
Sometimes there are obvious candidates to try – either for the joint
or conditional probabilities p(x,t | q) or p(t | x, q).
Sometimes we can try a "generic" model – linear models, neural networks, ...
![Page 50: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/50.jpg)
Model selection
Sect. 1.3
Where do we get p(t,x | q) from in the first place?
There is no right model – a fair coin or fair dice is as unrealistic as a spherical cow!
Sometimes there are obvious candidates to try – either for the joint
or conditional probabilities p(x,t | q) or p(t | x, q).
Sometimes we can try a "generic" model – linear models, neural networks, ...
This is the topic of most of this class!
![Page 51: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/51.jpg)
Model selection
Sect. 1.3
Where do we get p(t,x | q) from in the first place?
There is no right model – a fair coin or fair dice is as unrealistic as a spherical cow!
![Page 52: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/52.jpg)
Model selection
Sect. 1.3
Where do we get p(t,x | q) from in the first place?
There is no right model – a fair coin or fair dice is as unrealistic as a spherical cow!
But some models are more useful than others.
![Page 53: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/53.jpg)
Model selection
Sect. 1.3
Where do we get p(t,x | q) from in the first place?
There is no right model – a fair coin or fair dice is as unrealistic as a spherical cow!
But some models are more useful than others.
If we have several models, how do we measure the usefulness of each?
![Page 54: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/54.jpg)
Model selection
Sect. 1.3
Where do we get p(t,x | q) from in the first place?
There is no right model – a fair coin or fair dice is as unrealistic as a spherical cow!
But some models are more useful than others.
If we have several models, how do we measure the usefulness of each?
A good measure is prediction accuracy on new data.
![Page 55: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/55.jpg)
Model selection
Sect. 1.3
If we compare two models, we can take a maximum likelihood approach:
or a Bayesian approach:
just as for parameters.
![Page 56: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/56.jpg)
Model selection
Sect. 1.3
If we compare two models, we can take a maximum likelihood approach:
or a Bayesian approach:
just as for parameters.
But there is an over fitting problem:
Complex models often fit trainingdata better without generalizingbetter!
![Page 57: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/57.jpg)
Model selection
Sect. 1.3
If we compare two models, we can take a maximum likelihood approach:
or a Bayesian approach:
just as for parameters.
But there is an over fitting problem:
Complex models often fit trainingdata better without generalizingbetter!
In Bayesian approach, use p(M) to penalize
complex models
In ML approach, use some Information Criteria and maximize ln p(t,x |M) – penalty( M ).
![Page 58: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/58.jpg)
Model selection
Sect. 1.3
If we compare two models, we can take a maximum likelihood approach:
or a Bayesian approach:
just as for parameters.
But there is an over fitting problem:
Complex models often fit trainingdata better without generalizingbetter!
Or more empirical approach: Use some method of splitting data into training data and test data and pick model that performs best on test data.
(and retrain that model with the full dataset).
![Page 59: Probability And Stats Intro](https://reader034.vdocument.in/reader034/viewer/2022051513/547d318b5806b503408b483c/html5/thumbnails/59.jpg)
Summary
● Probabilities
● Stochastic variables
● Marginal and conditional probabilities
● Bayes' theorem
● Expectation, variance and covariance
● Estimation
● Decision theory and model selection