machine learning part i: classification and bayesian learning ref: e. alpaydin, intro to machine...
Post on 22-Dec-2015
230 views
TRANSCRIPT
![Page 1: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/1.jpg)
Machine Learning
Part I: Classification and Bayesian Learning
Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004
![Page 2: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/2.jpg)
Machine Learning• Machine Leaning is programming computers to optimize a
perf criteria using example data or past experience– Inference from samples
• There is a process that explains the data we observe. But we don’t know the details about how the data are generated.– Internet requests, failure events, etc
• It’s hard to identify (model) the process completely, we could construct a good and useful approximation that detect certain patterns. Such patterns would help us to understand the process and make predictions about the future.
![Page 3: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/3.jpg)
Types of Machine Learning• Supervised learning is to create a function from training data. The
training data consist of pairs of input objects (typically vectors), and desired outputs. – Classification: Given an input, the output is Boolean (yes/no) to predict
a class label of the input object; – Regression: If the label is a numerical value, learn the function f(x)
that best explain the input instance;• Unsupervised learning: manual labels of inputs are not used.
– Clustering: partition a data set into subsets (clusters), so that the data in each subset share some common trait
• Semi-supervised learning: make use of both labeled and unlabeled data for training
• Reinforcement Learning– Learning a policy: A sequence of outputs; No supervised output but
delayed reward– Examples: game playing, robot navigation
![Page 4: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/4.jpg)
Supervised Learning
• Use of Supervised Learning• Classification• Regression• Evaluation Methodology• Bayesian Learning for Classification
![Page 5: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/5.jpg)
Why Supervised Learning?
• Prediction of future cases: Use the rule to predict the output for future inputs
• Knowledge extraction: The rule is easy to understand
• Compression: The rule is simpler than the data it explains
• Outlier detection: Exceptions that are not covered by the rule, e.g., fraud
•5
![Page 6: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/6.jpg)
Classification
•6
• E.g: Credit scoring• Differentiating
between low-risk and high-risk customers from their income and savings
• Rule-based prediction
Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
![Page 7: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/7.jpg)
Learning a Class from Examples• Given a set of examples of
cars, with a label of “family car” or not according to a survey, class learning is to find a description that is shared by all positive examples.
• Use of the class info– Prediction: Is car x a family car?
– Knowledge extraction: What do people expect from a family car?
![Page 8: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/8.jpg)
Training set X
Nt
tt,r 1}{ xX
negative is if 0
positive is if 1
x
xr
2
1
x
xx
Input representation
Attributes: price & engine power
Label of each instance
![Page 9: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/9.jpg)
Hypothesis Class: C
•9
Most specific hypothesis, S
Most general hypothesis, G
2121 power engine AND price eepp
Learning is to find a particular
hypothesis h to approximate C
![Page 10: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/10.jpg)
Hypothesis h and Empirical Error
•10
negative as classifies if 0
positive as classifies if 1)(
x
xx
h
hh
N
t
tt rhhE1
1)|( xX
Error of h:
![Page 11: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/11.jpg)
Model Selection & Generalization• Learning is an ill-posed problem: data is not sufficient to
find a unique solution– Limited number of sample data– Some data might be noise due to imprecision in recording,
labeling, or hidden (latent, unobservable) attributes that affect the label of instances
• The need for inductive bias: assumptions about class structureH – Why rectangle, not circle or irregular shape?
– What’s degree of tightness of fitting?
• Generalization: How well a model performs on new data
![Page 12: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/12.jpg)
Noise and Model ComplexitySimple model is preferred• Easy to use (check)
(lower time complexity)• Easy to train (lower
space complexity)• Easyto explain
(more interpretable)• Easy to generalize (lower
variance )
•12
Noise: any anomaly in the data
which leads it infeasible to reach
a zero-error classification
with a simple hypothesis class
![Page 13: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/13.jpg)
Probably Approximately Correct (PAC) Learning
• How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ?
• Each strip is at most ε/4• Pr that we miss a strip 1‒ ε/4• Pr that N instances miss a strip (1 ‒ ε/4)N
• Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
• 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)• 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
•13
![Page 14: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/14.jpg)
2-Class vs K-ClassNt
tt,r 1}{ xX
, if 0
if 1
ijr
jt
it
ti C
C
x
x
, if 0
if 1
ijh
jt
it
ti C
C
x
xx
•14
K-class problem be viewed as K 2-class problem:Train hypotheses hi(x), i =1,...,K:
![Page 15: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/15.jpg)
Regression
• Examples– Price of a used car– Speed of Top500
• x : car attributesy : price
y = g (x | θ)g ( ) model,
θ parameters
•15
•y = wx+w0
Linear regression
![Page 16: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/16.jpg)
Basic Concepts• Interpolation
– Find a function that best fits a training set with no presence of noise
– r = f(x)
• Extrapolation– Predict the output for any x, if x is NOT in the training set
• Regression– Noise factor must be considered– r = f(x) + OR there’re hidden variables we couldn’t
observe: r = f(x, z)
![Page 17: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/17.jpg)
Regression
01 wxwxg
012
2 wxwxwxg
N
t
tt xgrN
gE1
21|X
•17
N
t
tt wxwrN
w,wE1
2
0101
1|X
tt
t
N
ttt
xfr
r
rx 1,X
For a given test set, find g() that minimizes the empirical error
![Page 18: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/18.jpg)
Underfitting vs Overfitting
• Underfitting: Hypothesis (H) less complex than actual model (C)– Using a line to fit data sampled from a 3rd order
polynomial– Accuracy increases with more sample data; may
not enough if the hypothesis is too complex
• Overfitting: H more complex than C– Having more training data helps but only up to a
certain point
![Page 19: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/19.jpg)
Triple Trade-Off Trade-off between three factors :
1. Complexity of the hypothesisH, c (H): capacity of the hypothesis class
2. Training set size, N, 3. Generalization error, E, on new examples
• As NE• As c (H)first Eand then EThe error of an
over-complex hypothesis can be kept in check by increasing the amount of training data, but only up to a point)
![Page 20: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/20.jpg)
Cross-Validation
• To estimate generalization error, we need data unseen during training.
• Three types of data in cross-validation:– Training set (50%)– Validation set (25%)– Test (publication) set (25%)
• Resampling when there is few data
![Page 21: Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d7f5503460f94a62f59/html5/thumbnails/21.jpg)
Dimensions of a Supervised Learner: Summary
1. Model g() and parameter
2. Loss function L(): diff between desired output and approximation
3. Optimization procedure:
|xg
t
tt g,rLE || xX
•21
X|min arg
E*
return the argument that minimizes