introduction to machine learning aristotelis tsirigos
Post on 04-Jul-2015
312 Views
Preview:
TRANSCRIPT
Introduction to Machine Learning
Aristotelis Tsirigosemail: tsirigos@cs.nyu.edu
Dennis Shasha - Advanced Database SystemsNYU Computer Science
2
What is Machine Learning?
• Principles, methods and algorithms for predicting using past experience
• Not mere memorization, but capability of generalizing on novel situations based on past experience
• Where no theoretical model exists to explain the data, machine learning can be employed to offer such a model
3
Learning models
According to how active or passive the learner is:
• Statistical learning model– No control over observations, they are presented at random in
an independent identically distributed fashion
• Online model– An external source presents the observations to the learner in a
query form
• Query model– The learner is querying an external “expert” source
4
Types of learning problems
A very rough categorization of learning problems:
• Unsupervised learning– Clustering, density estimation, feature selection
• Supervised learning– Classification, regression
• Reinforcement learning– Feedback, games
5
Outline
• Learning methods– Bayesian learning– Nearest neighbor– Decision trees– Linear classifiers– Ensemble methods (bagging & boosting)
• Testing the learner
• Learner evaluation
• Practical issues
• Resources
6
Bayesian learning - Introduction
• Given are:– Observed data D = { d1, d2, …, dn }
– Hypothesis space H
• In the Bayesian setup we want to find the hypothesis that best fits the data in a probabilistic manner:
• In general, it is computationally intractable without any assumptions about the data
{ } )|( argmaxmax DhPhHh∈
=
7
• First transformation using Bayes rule:
• Now we get:
• Notice that the optimal choice depends also on the a priori probability P(h) of hypothesis h
Bayesian learning - Elaboration
)(
)()|()|(
DP
hPhDPDhP =
{ } )()|( argmaxmax hPhDPhHh∈
=
8
Bayesian learning - Independence
• We can further simplify by assuming that the data in D are independently drawn from the underlying distribution:
• Under this assumption:
• How do we estimate P(di|h) in practice?
∏=
=n
ii hdPhDP
1
)|()|(
= ∏
=∈
n
ii
HhhdPhPh
1max )|()(argmax
9
Bayesian learning - Analysis
• For any hypothesis h in H we assume a distribution:– P(d|h) for any point d in the input space
• If d is a point in an m-dimensional space, then the distribution is in fact:– P(d(1),d(2),…,d(m)|h)
• Problems:– Complex optimization problem, suboptimal techniques can be
used if the distributions are differentiable
– In general, there are too many parameters to estimate, therefore a lot of data is needed for reliable estimation
10
Bayesian learning - Analysis
• Need to further analyze distribution P(d|h):– We can assume features are independent (Naïve Bayes):
– Or, build Bayesian Networks where dependencies of the features are explicitly modeled
• Still we have to somehow learn the distributions– Model as parametrized distributions (e.g. Gaussians)
– Estimate the parameters using standard greedy techniques (e.g. Expectation Maximization)
∏=
=m
k
k hdPhdP1
)( )|()|(
11
Bayesian learning - Summary
• Makes use of prior knowledge of:– The likelihood of alternative hypotheses and– The probability of observing data given a specific hypothesis
• The goal is to determine the most probable hypothesis given a series of observations
• The Naïve Bayes method has been found useful in practical applications (e.g. text classification)
• If the naïve assumption is not appropriate, there is a generic algorithm (EM) that can be used to find a hypothesis that is locally optimal
12
Nearest Neighbor - Introduction
• Belongs to class of instance-based learners:– The learner does not make any global prediction of the target
function, only predicts locally for a given point (lazy learner)
• Idea: – Given a query instance x, look at past observations D that are
“close” to x in order to determine x’s class y.
• Issues:– How do we define distance?– How do we define the notion of “neighborhood”?
13
Nearest Neighbor - Details
• Classify new instance x according to its neighborhood N(x)• Neighborhood can be defined in different ways:
– Constant radius– k-Neighbors
• Weights are a function of distance: wi = w(d(x,xi))
= ∑∈ )(:
)(xNxi
ii
i
ywsignxf
Classification rule:
yi : label for point xi
wi : weight for xi
x
N(x)
14
Nearest Neighbor - Summary
• Classify new instances according to their closest points
• Control accuracy in three ways:– Distance metric
– Definition of neighborhood
– Weight assignment
• These parameters must be tuned depending on the problem:– Is there noise in the data?
– Outliers?
– What is a “natural” distance for the data?
15
Decision Trees - Introduction• Suppose data is categorical
• Observation:– Distance cannot be defined in a natural way– Need a learner that operates directly on the attribute values
YESNOLOWHIGHBAD
NOYESLOWHIGHBAD
NONOHIGHLOWBAD
NONOLOWLOWBAD
YESYESLOWLOWGOOD
NONOHIGHLOWGOOD
YESNOLOWHIGHGOOD
YESYESLOWHIGHGOOD
YESNOHIGHHIGHGOOD
YESYESHIGHHIGHGOOD
ElectedWar casualtiesGas pricesPopularityEconomy
16
Decision Trees - The model • Idea: a decision tree that “explains” the data
• Observation:– In general, there is no unique tree to represent the data– In some nodes the decision is not strongly supported by the data
Economy
Popularity
WarGas prices
Popularity
GOOD
LOWHIGHHIGH LOW
BAD
HIGH LOW YES NO
YES=4NO=0
YES=0NO=1
YES=1NO=0
YES=0NO=1
YES=1NO=0
YES=0NO=2
17
Decision Trees - Training
• Build the tree from top to bottom choosing one attribute at a time
• How do we make the choice? Idea:– Choose the most “informative” attribute first– Having no other information, which attribute allows us to classify
correctly most of the time?
• This can be quantified using the Information Gain metric:– Based on Entropy = Randomness
– Measures the reduction in uncertainty about the target value given the value of one of the attributes, therefore it tells us how informative that attribute is
18
Decision Trees - Overfitting
• Problems:– Solution is greedy, therefore suboptimal– Optimal solution infeasible due to time constraints– Is optimal really optimal? What if observations are corrupted by
noise?
• We are really interested in the true, not the training error– Overfitting in the presence of noise
– Occam’s razor: prefer simpler solutions
– Apply pruning to eliminate nodes with low statistical support
19
Decision Trees - Pruning• Get rid of nodes with low support
Economy
Popularity
WarGas prices
Popularity
GOOD
LOWHIGHHIGH LOW
BAD
HIGH LOW YES NO
YES=4NO=0
YES=0NO=1
YES=1NO=0
YES=0NO=1
YES=1NO=0
YES=0NO=2
• The pruned tree does not fully explain the data, but we hope that it will generalize better on unseen instances…
20
Decision Trees - Summary
• Advantages:– Categorical data– Easy to interpret in a simple rule format
• Disadvantages– Hard to accommodate numerical data
– Suboptimal solution
– Bias towards simple trees
21
Linear Classifiers - Introduction
Decision function:
Predicted label:
• There is an infinite number of hyperplanes f(x)=0 that can separate positive from negative examples!
• Is there an optimal one to choose?
22
Linear Classifiers - Margins• Make sure it leaves enough …room for future points!
• Margins:– For a training point xi define its margin:
– For classifier f it is the worst of all margins:
f(x)=0
wxfy iii )(=γ{ }ifM γmin)( =
23
Linear Classifiers - Optimization
• Now, the only thing we have to do is find the f with the maximum possible margin!– Quadratic optimization problem:
,...,n,ibwxf
ww
i 21 ,1),,(y subject to
w minimize
i
22
b,w
=≥
⋅=
• This classifier is known as Support Vector Machines
– The optimal w* will yield the maximum margin γ*:1**
−= wγ
– Finally, the optimal hyperplane can be written as:
**** ),,( bxxybwxfSVi
iii +⋅= ∑∈
α
24
Linear Classifiers - Problems
• What if data is noisy or, worse, linearly inseparable?• Solution 1:
– Allow for outliers in the data when data is noisy
• Solution 2:– Increase dimensionality by creating composite features if the
target function is nonlinear
• Solution 3:– Do both 1 and 2
25
Linear Classifiers - Outliers
• Impose softer restrictions on the margin distribution to accept outliers
f(x)=0
• Now our classifier is more flexible and more powerful, but there are more parameters to estimate
outlier
26
Linear Classifiers - Nonlinearity (!)• Combine input features to form more complex ones:
– Initial space x = (x1,x2,…,xm)– Induced space Φ(x) = (x1,x2,…,xm,2x1x2,2x1x3,…,2xm-1xm)– Inner product can now be written as <Φ(x)·Φ(y)> = <x·y>2
• Kernels:– The above product is denoted K(x,y)= <Φ(x)·Φ(y)> and it is
called a kernel– Kernels induce nonlinear feature spaces based on the initial
feature space – There is a huge collection of kernels, for vectors, trees, strings,
graphs, time series, …
• Linear separation in the composite feature space implies a nonlinear separation in the initial space!
27
Linear Classifiers - Summary
• Provides a generic solution to the learning problem
• We just have to solve an easy optimization problem
• Parametrized by the induced feature space and the noise parameters
• There exist theoretical bounds on their performance
28
Ensembles - Introduction
• Motivation: – finding just one classifier is “too risky”
• Idea:– combine a group of classifiers into the final learner
• Intuition:– each classifier is associated with some risk of wrong predictions
in future data
– instead of investing in just one risky classifier, we can distribute the decision in many classifiers thus effectively reducing the overall risk
29
Ensembles - Bagging
• Main idea:– From observations D, T subsets D1,…,DT are drawn at random
– For each Di train a “base” classifier fi (e.g. decision tree)
– Finally, combine the T classifiers into one classifier f by taking a majority vote:
= ∑
=
T
ii xfsignxf
1
)()(
• Observations:– Need enough observations to get partitions that approximately
respect the iid condition ( |D| >> T )– How do we decide on the base classifier?
30
Ensembles - Boosting
• Main idea:– Run through a number of iterations
– At each iteration t, a “weak” classifier ft is trained on a weighted version of the training data (initially weights are equal)
– Each point’s weight is updated so that examples with poor margin with respect to ft are assigned a higher weight in an attempt to “boost” them in the next iterations
– The classifier itself is assigned a weight αt according to its training error
• Combine all classifiers into a weighted majority vote:
= ∑
ttt xfsignxf )()( α
31
Bagging vs. Boosting
• Two distinct ways to apply the diversification idea
Margin maximizationRisk minimizationEffect
SimpleComplex Base learner
Adaptive data weightingPartition before trainingTraining data
BoostingBagging
32
Testing the learner
• How do we estimate the learner’s performance?
• Create test sets from the original observations:– Test set:
• Partition into training and test sets and use error on test set as an estimate of the true error
– Leave-one-out: • Remove one point and train the rest, then report error on this point
• Do it for all points and report mean error
– k-fold Cross Validation:• Randomly partition the data set in k non-overlapping sets• Choose one set at a time for testing and train on the rest• Report mean error
33
Learner evaluation - PAC learning
• Probably Approximate Correct (PAC) learning of a target class C using hypothesis space H:– If for all target functions f in C and for all 0<ε,δ<1/2, with
probability at least (1-δ) we can learn a hypothesis h in H that approximates f with an error at most ε.
• Generalization (or true) error:– We really care about the error in unseen data
• Statistical learning theory gives as the tools to express the true error (and its confidence) in terms of:– The empirical (or training) error– The confidence 1-δ of the true error– The number of training examples– The complexity of the classes C and/or H
34
Learner evaluation - VC dimension
• The VC dimension of a hypothesis space H measures its power to interpret the observations
• Infinite hypothesis space size does not necessarily imply infinite VC dimension!
• Bad news: – If we allow a hypothesis space with infinite VC dimension,
learning is impossible (requires infinite number of observations)
• Good news:– For the class of large margin linear classifiers the following error
bound can be proven:
+≤
δγγ
γ4
log32
log8
log642
)(2
2 n
R
enR
nferror
35
Practical issues
• Machine learning is driven by data, so a good learner must be data-dependent in all aspects:– Hypothesis space
– Prior knowledge
– Feature selection and composition
– Distance/similarity measures
– Outliers/Noise
• Never forget that learners and training algorithms must be efficient in time and space with respect to:– The feature space dimensionality
– The training set size
– The hypothesis space size
36
Conclusions
• Machine learning is mostly art and a little bit of science!
• For each problem at hand a different classifier will be the optimal one
• This simply means that the solution must be data-dependent:– Select an “appropriate” family of classifiers (e.g. Decision Trees)– Choose the right representation for the data in the feature space– Tune available parameters of your favorite classifier to reflect the
“nature” of the data
• Many practical applications, especially when there is no good theory available to model the data
37
Resources
• Books– T. Mitchell, Machine Learning– N. Cristianini & Shawe-Taylor, An introduction to Support Vector
Machines– V. Kecman, Learning and Soft Computing– R. Duda, P. Hart & D. Stork, Pattern Classification
• Online tutorials– A. Moore, http://www-2.cs.cmu.edu/~awm/tutorials/
• Software– WEKA: http://www.cs.waikato.ac.nz/~ml/weka/– SVMlight: http://svmlight.joachims.org/
top related