preliminaries prof. navneet goyal cs & is bits, pilani
TRANSCRIPT
![Page 1: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/1.jpg)
Preliminaries
Prof. Navneet GoyalCS & ISBITS, Pilani
![Page 2: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/2.jpg)
Topics Probability Theory Decision Theory Information Theory
![Page 3: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/3.jpg)
Topics Probability Theory Decision Theory Information Theory
![Page 4: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/4.jpg)
Probability Theory Key concept is dealing with uncertainty
– Due to noise and finite data sets
Probability Densities Bayesian Probabilities Gaussian (normal) Distribution Curve Fitting revisited Bayesian Curve Fitting Maximum Likelihood Estimation
![Page 5: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/5.jpg)
Probability Theory Frequentist or Classical Approach Population parameters are fixed constants
whose values are unknown Experiments are repeated indefinitely large
no. of times Toss a fair coin 10 times, it may not be
unusual to observe 80% heads Toss a coin 10 trillion times, we can be fairly
certain that the proportion of heads will be close to 50%
Long run behavior defines probability!
![Page 6: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/6.jpg)
Probability Theory Frequentist or Classical Approach What is the probability that terrorist will strike
an Indian city using AK-47? Difficult to conceive the long-run behavior In frequentist approach, the parameters are
fixed, and the randomness lies in the data Data is viewed as a random sample from a
given distribution with unknown but fixed parameters
![Page 7: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/7.jpg)
Probability Theory Bayesian Approach Turn the assumptions around Parameters are considered to be random variables Data are considered to be known Parameters come from a distribution of possible
values Bayesians look to the observed data to provide
information on likely parameter values Let θ represent the parameters of the unknown
distribution Bayesian approach requires elicitation of a prior
distribution for θ , called the prior distribution p(θ)
![Page 8: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/8.jpg)
Probability Theory Bayesian Approach p(θ) can model extant expert (domain) knowledge, if
any, regarding the distribution of θ For example: Churn modeling experts in Telcos may
be aware that a customer exceeding a certain threshold no. of calls to customer service may indicate a likelihood to churn
Combine this with prior information about the distribution of customer service calls, including its mean & std. deviation
Non-informative prior – assigns equal probabilities to all values of the parameter
Prior prob. of both churners & non-churners = 0.5 (Telco in question is doomed!!)
![Page 9: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/9.jpg)
Probability Theory Bayesian Approach Prior distribution is generally dominated by
the overwhelming amount of information that is found in the data
p(θ|X) – posterior probability, where X represents the entire array of data
Updating of the knowledge about was first performed by Reverend Thomas Bayes (1702-1761)
![Page 10: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/10.jpg)
Probability Theory
Apples and Oranges
![Page 11: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/11.jpg)
Probability Theory
Marginal Probability
Conditional ProbabilityJoint Probability
![Page 12: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/12.jpg)
Probability Theory
Sum Rule
Product Rule
![Page 13: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/13.jpg)
The Rules of Probability
Sum Rule
Product Rule
![Page 14: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/14.jpg)
Bayes’ Theorem
posterior likelihood × prior
Bayes theorem plays a central role in ML!
![Page 15: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/15.jpg)
Joint Distribution over 2 variables
![Page 16: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/16.jpg)
Probability Densities
If the probability of a real valued variable x falling in the interval (x, x+δx) is given by p(x) δx for δx 0, then p(x) is called the prob. density over x
![Page 17: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/17.jpg)
The Gaussian Distribution
![Page 18: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/18.jpg)
Decision Theory Probability theory provides us with a consistent
mathematical framework for quantifying and manipulating uncertainty
Decision theory + Prob. Theory enable us to make optimal decisions in uncertain situations
Input vector x, target variable t Joint Prob. Dist. p(x,t) provides a complete summary of
uncertainty associated with variables x & t Determination of p(x,t) from a set of training data is an
example of inference – a very difficult problem In practical applications, we make a specific prediction
for the value of t & take a specific action based on our understanding of the values t is likely to take
This is Decision theory
![Page 19: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/19.jpg)
![Page 20: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/20.jpg)
Decision Theory Decision stage is generally very simple, even
trivial, once we have solved the inference problem
Role of probabilities in decision making When we receive an X-ray image of a patient,
we need to decide its class We are interested in the probabilities of the
two classes given the image Use Baye’s th.
![Page 21: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/21.jpg)
Decision Theory
Errors
![Page 22: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/22.jpg)
Decision Theory
Optimal Decision Boundary??
Equivalent to minimum misclassification rate decision rule: assign each value of x to the class having the higher posterior probability
p(Ck|x)
![Page 23: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/23.jpg)
Decision Theory Minimizing expected
loss Simply minimizing the
no. of misclassifications does not suffice in all cases
For example: spam mail filtering, IDS, disease diagnosis etc.
Attach a very high cost to the type of misclassification you want to minimize/eliminate
![Page 24: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/24.jpg)
Information Theory How much information is received when we observe a
specific value for a discrete random variable x? Amount of information is degree of surprise
Certain means no information More information when event is unlikely
Entropy: a measure of disorder/unpredictability or a measure of
surprise Tossing a coin
Fair coin – maximum entropy as there is no way to predict the outcome of the next toss
Biased coin – less entropy as uncertainty is lower and we can bet preferentially on the most frequent result
Two-headed coin – zero entropy as the coin will always turn up heads
Most collections of data in the real world lie somewhere in between
![Page 25: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/25.jpg)
Information Theory How to measure Entropy? Information content depends upon probability
distribution of x We look for a function h(x) that is a monotonic
function of the of the probability p(x) If two events x & y are unrelated, then
h(x,y) = h(x) + h(y) Two unrelated events will be statistically independent
p(x,y) = p(x)p(y) h(x) must be log of p(x)
h(x) = -log2p(x)
-ve sign ensures that information is +ve or zero
![Page 26: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/26.jpg)
Information Theory h(x) = -log2p(x)
-ve sign ensures that information is +ve or zero
Choice of basis for log is arbitrary IT theory uses base 2 Units of h(x) are ‘bits’ A sender wishes to transmit the value of a rv
to a receiver Avg. amt. of info. that they transmit is
obtained by taking the expectation wrt the distribution p(x)
![Page 27: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/27.jpg)
Entropy
Important quantity in• coding theory• statistical physics• machine learning (classification using
decision trees)
![Page 28: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/28.jpg)
Entropy Coding theory: x discrete with 8 possible states;
how many bits to transmit the state of x?
All states equally likely
That is, we need to transmit a msg of length 3 bits RV x having 8 possible states (a,b,...,h) and
respective probabilities are given by (1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64)
![Page 29: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/29.jpg)
Entropy
Non-uniform distribution has a smaller entropy than the uniform one!!Has an interpretation of in terms of disorder!
Use shorter codes for more probable events and longer codes for less probable events in the hope of getting a shorter avg code length
![Page 30: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/30.jpg)
Entropy Noiseless coding theorem of Shannon
Entropy is a lower bound on number of bits needed to transmit a random variable
Natural logarithms are used in relationship to other topics Nats instead of bits
![Page 31: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/31.jpg)
Linear Basis Function Models Polynomial basis
functions:
These are global; a small change in x affect all basis functions.
![Page 32: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/32.jpg)
Linear Basis Function Models (4) Gaussian basis
functions:
These are local; a small change in x only affect nearby basis functions. μj and s control location and scale (width).
![Page 33: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/33.jpg)
Linear Basis Function Models (5) Sigmoidal basis
functions:
where
Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).
![Page 34: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/34.jpg)
Home Work Read about Gaussian, Sigmoidal, & Fourier
basis functions Sequential Learning & Online algorithms Will discuss in the next class!
![Page 35: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/35.jpg)
The Bias-Variance Decomposition Bias-variance decomposition is a formal method
for analyzing the prediction error of a predictive model
Bias = avg. distance bet the target and the location where the projectile hits the ground (depends on angle)
Variance = deviation bet x and the avg. position where the projectile hits the floor (depends on force)
Noise: if the target is not stationary then the observed distance is also affected by changes in the location of target
![Page 36: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/36.jpg)
The Bias-Variance Decomposition Low degree polynomial has high bias (fits
poorly) but has low variance with different data sets
High degree polynomial has low bias (fits well) but has high variance with different data sets
Interactive demo @:
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_bias_variance.htm
![Page 37: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/37.jpg)
The Bias-Variance Decomposition True height of Chinese emperor:
200cm, about 6’6”. Poll a random American: ask “How tall is the emperor?”
We want to determine how wrong they are, on average
![Page 38: Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani](https://reader035.vdocument.in/reader035/viewer/2022062320/56649cdc5503460f949a79e7/html5/thumbnails/38.jpg)
The Bias-Variance Decomposition Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate
Squared error = Square of bias error + VarianceAs variance increases, error increases