introduction to machine learning course 67577 fall 2007 lecturer: amnon shashua teaching assistant:...
Post on 21-Dec-2015
265 views
TRANSCRIPT
Introduction to Machine Learning
course 67577 fall 2007
Lecturer: Amnon ShashuaTeaching Assistant: Yevgeny Seldin
School of Computer Science and EngineeringHebrew University
What is Machine Learning?
• Inference engine (computer program) that when given sufficient data (examples) computes a function that matches as close as possible the process generating the data.
• Make accurate prediction based on observed data
• Algorithms to optimize a performance criterion based on observed data
• Learning to do better in the future based on what was experienced in the past
• Programming by examples: instead of writing a program to solve a task directly, machine learning seeks methods by which the computer will come up with its own program based on training examples.
Why Machine Learning?
• Data-driven algorithms are able examine large amounts of data. A human expert on the other hand is likely to be guided by subjective impressions or by examining a relatively small number of examples.
• Humans often have trouble expressing what they know but have no difficulty in labeling data
• Machine learning is effective in domains where declarative (rule based) knowledge is difficult to obtain yet generating training data is easy
Typical Examples
• Visual recognition (say, detect faces in an image): the amount of variability in appearance introduce challenges that are beyond the capacity of direct programming
• Spam filtering: data-driven programming can adapt to changing tactics by spammers
• Extract topics from documents: categorize news articles whether they are about politics, sports, science, etc.
• Natural language understanding: from spoken words to text; categorize the meaning of spoken sentences
• Optical character recognition (OCR)
• Medical diagnosis: from symptoms to diagnosis
• Credit card transaction fraud detection
• Wealth prediction
Fundamental Issues
• Over-fitting: doing well on a training set does not guarantee accuracy on new examples
• What is the resource we wish to optimize? For a given accuracy, use the smallest size training set
• Examples are drawn from some (fixed) distribution D over X x Y (instance space x output space). Does the learner actually need to recover D during the learning process?
• How does the learning process depend on the complexity of the family of learning functions (concept class C)? How does one define complexity of C?
• When the goal is to learn the joint distribution D then the problem is computationally unwieldy because the joint distribution table is exponentially large. What assumptions can be made to simplify the task?
Multiclass classification. K=2 is normally of most interest.
Supervised vs. Un-supervised
YXf :
kY ,...,1
Supervised Learning Models:
where X is the instance (data) space and Y is the output space
RY Regression.
Predict the price of a used car given brand, year, mileage..Kinematics of a robot arm; navigate by determining steering angle from image input..
Un-supervised Learning Models:
Find regularities in the input data assuming there is some structure in the input space
• Density estimation• Clustering (non-parametric density estimation): divide customers to groups which have similar attributes..• Latent class models: extract topics from documents• Compression: represent the input space with fewer parameters; projection to lower-dimensional spaces
Notations
),(),...,,( 11 mm yxyxZ
*,,}1,0{ XRXX nn
mxxS ,...,1
},,{,},1,1{ YRYY
Xx
""),2.7,0,3.2,5.0(),0,0,1,1,1,0( textxxx
X is the instance space: space from which observations are drawn. Examples,
input instance, a single observation. Examples,
Y is the output space: set of possible outcomes that can be associated with a measurement. Examples,
An example is an instance-label pair (x,y). If |Y|=2 one typically uses {0,1} or {-1,1}.We say that an example (x,y) is positive if y=1 and otherwise we call it a negative exampleA training set Z consists of m instance-label pairs:In some cases we refer to the training set without labels:
Separating hyperplanes: a concept h(x) is specified by a vector and a scalar b such that:
Conjunction learning: a conjunction is a special case of a Boolean formula. A literalIs a variable or its negation and a term is a conjunction of literals, i.e.A target function is a term which consists of a subset of literals.In this case and
Each is called a concept or hypothesis or classifier. Example, if
Notations
||C
)( 321 xxx
YXhhC :|
Ch
},{,}1,0{ falsetrueYX n
}1,0{,}1,0{ YX n
ixxhihC )(:|
A concept (hypothesis) class C is a set (not necessarily finite) of functions of the form:
Other examples:
then C might be:
Decision trees: when then any boolean function can be described by a binary tree. Thus, C consists of decision trees ( )
nRX
otherwise
bxwxh
T
1
1)(
nX 2 nC 3
nRw
The Formal Learning ModelProbably Approximate Correct (PAC)
• Distribution invariant: Learner does not need to estimate the joint distribution D over X x Y. Assumptions are that examples arrive i.i.d. and that D exists and is fixed.
• The training sample complexity (size of the training set Z) depends only the desired accuracy and confidence parameters - does not depend on D.
• Not all concept classes D are PAC-learnable. But some interesting classes are.
Unrealizable case: when and the training set isand D is over XxY
Realizable case: when a target concept is known to lie inside C.In this case, the training set is
sampled randomly and independently (i.i.d) according to some (unknown) Distribution D, i.e., S is distributed according to the product distribution
},...,{ 1 mxxS
Cxct )(
Cct
miiti xcxZ 1)(,
miii yxZ 1,
Ch )()(:)( xhxcxprobherr tD
dxxDxhxcindXx
t )())()((
falsex
truexxind
0
1)(
mDDD ...
Given a concept function
)(herr is the probability that an instance x sampled according to D will belabeled incorrectly by h(x)
PAC Model Definitions
given to the learner specifies desired accuracy, i.e.
Note: in realizable case because
)(min)( herrCoptCh
0)( Copt0)( tcerr
0 )()( Coptherr
01 given to the learner specifies desired confidence, i.e.
1])()([ Coptherrprob
The learner is allowed to deviate occasionally from the desired accuracybut only rarely so..
PAC Model Definitions
We will say that an algorithm L learns C if for every and for everyD over XxY, L generates a concept function such that theprobability that is at least
tc tc
tc tc
)()( Coptherr 1Ch
tc
PAC Model Definitions
from the set of all training examples to C with the following property:given any there is an integer such that if then, for any probability distribution D on XxY, if Z is atraining set of length m drawn randomly according to , thenwith probability of at least then hypothesis is such that
tc tc
tc tc
)()( Coptherr1
mD
1,0,
Formal Definition of PAC Learning
11),(:
m
m
iii CyxL
),(0 m0mm
CZLh )(
A learning algorithm L is a function:
We say that C is learnable (or PAC-learnable) if there is a learning algorithm for C
tc tc
tc tc
Formal Definition of PAC Learning
),(0 m does not depend on D, i.e., PAC model is distribution invariant
),(0 mThe class C determines the sample complexity. For “simple” classes would be small compared to more “complex” classes.
Notes:
tc tc
tc tc
Course Syllabus
C
m ln1
C
C 1
ln)(1
ln1 Cvcd
m 3 x PAC:
2 x Separating Hyperplanes:
Support Vector Machine, Kernels, LinearDiscriminant Analysis
3 x Unsupervised Learning:
Dimensionality Reduction (PCA), Density Estimation,Non-parametric Clustering (spectral methods)
5 x Statistical Inference:
Maximum Likelihood, Conditional Independence,Latent Class Models, Expectation-Maximization Algorithm,Graphical Models