introduction to machine learning course 67577 fall 2007 lecturer: amnon shashua teaching assistant:...

Introduction to Machine Learning

course 67577 fall 2007

Lecturer: Amnon ShashuaTeaching Assistant: Yevgeny Seldin

School of Computer Science and EngineeringHebrew University

What is Machine Learning?

• Inference engine (computer program) that when given sufficient data (examples) computes a function that matches as close as possible the process generating the data.

• Make accurate prediction based on observed data

• Algorithms to optimize a performance criterion based on observed data

• Learning to do better in the future based on what was experienced in the past

• Programming by examples: instead of writing a program to solve a task directly, machine learning seeks methods by which the computer will come up with its own program based on training examples.

Why Machine Learning?

• Data-driven algorithms are able examine large amounts of data. A human expert on the other hand is likely to be guided by subjective impressions or by examining a relatively small number of examples.

• Humans often have trouble expressing what they know but have no difficulty in labeling data

• Machine learning is effective in domains where declarative (rule based) knowledge is difficult to obtain yet generating training data is easy

Typical Examples

• Visual recognition (say, detect faces in an image): the amount of variability in appearance introduce challenges that are beyond the capacity of direct programming

• Spam filtering: data-driven programming can adapt to changing tactics by spammers

• Extract topics from documents: categorize news articles whether they are about politics, sports, science, etc.

• Natural language understanding: from spoken words to text; categorize the meaning of spoken sentences

• Optical character recognition (OCR)

• Medical diagnosis: from symptoms to diagnosis

• Credit card transaction fraud detection

• Wealth prediction

Fundamental Issues

• Over-fitting: doing well on a training set does not guarantee accuracy on new examples

• What is the resource we wish to optimize? For a given accuracy, use the smallest size training set

• Examples are drawn from some (fixed) distribution D over X x Y (instance space x output space). Does the learner actually need to recover D during the learning process?

• How does the learning process depend on the complexity of the family of learning functions (concept class C)? How does one define complexity of C?

• When the goal is to learn the joint distribution D then the problem is computationally unwieldy because the joint distribution table is exponentially large. What assumptions can be made to simplify the task?

Multiclass classification. K=2 is normally of most interest.

Supervised vs. Un-supervised

YXf :

kY ,...,1

Supervised Learning Models:

where X is the instance (data) space and Y is the output space

RY Regression.

Predict the price of a used car given brand, year, mileage..Kinematics of a robot arm; navigate by determining steering angle from image input..

Un-supervised Learning Models:

Find regularities in the input data assuming there is some structure in the input space

• Density estimation• Clustering (non-parametric density estimation): divide customers to groups which have similar attributes..• Latent class models: extract topics from documents• Compression: represent the input space with fewer parameters; projection to lower-dimensional spaces

Notations

),(),...,,( 11 mm yxyxZ

*,,}1,0{ XRXX nn

mxxS ,...,1

},,{,},1,1{ YRYY

Xx

""),2.7,0,3.2,5.0(),0,0,1,1,1,0( textxxx

X is the instance space: space from which observations are drawn. Examples,

input instance, a single observation. Examples,

Y is the output space: set of possible outcomes that can be associated with a measurement. Examples,

An example is an instance-label pair (x,y). If |Y|=2 one typically uses {0,1} or {-1,1}.We say that an example (x,y) is positive if y=1 and otherwise we call it a negative exampleA training set Z consists of m instance-label pairs:In some cases we refer to the training set without labels:

Separating hyperplanes: a concept h(x) is specified by a vector and a scalar b such that:

Conjunction learning: a conjunction is a special case of a Boolean formula. A literalIs a variable or its negation and a term is a conjunction of literals, i.e.A target function is a term which consists of a subset of literals.In this case and

Each is called a concept or hypothesis or classifier. Example, if

Notations

||C

)( 321 xxx

YXhhC :|

Ch

},{,}1,0{ falsetrueYX n

}1,0{,}1,0{ YX n

ixxhihC )(:|

A concept (hypothesis) class C is a set (not necessarily finite) of functions of the form:

Other examples:

then C might be:

Decision trees: when then any boolean function can be described by a binary tree. Thus, C consists of decision trees ( )

nRX

otherwise

bxwxh

T

1

1)(

nX 2 nC 3

nRw

The Formal Learning ModelProbably Approximate Correct (PAC)

• Distribution invariant: Learner does not need to estimate the joint distribution D over X x Y. Assumptions are that examples arrive i.i.d. and that D exists and is fixed.

• The training sample complexity (size of the training set Z) depends only the desired accuracy and confidence parameters - does not depend on D.

• Not all concept classes D are PAC-learnable. But some interesting classes are.

Unrealizable case: when and the training set isand D is over XxY

Realizable case: when a target concept is known to lie inside C.In this case, the training set is

sampled randomly and independently (i.i.d) according to some (unknown) Distribution D, i.e., S is distributed according to the product distribution

},...,{ 1 mxxS

Cxct )(

Cct

miiti xcxZ 1)(,

miii yxZ 1,

Ch )()(:)( xhxcxprobherr tD

dxxDxhxcindXx

t )())()((

falsex

truexxind

0

1)(

mDDD ...

Given a concept function

)(herr is the probability that an instance x sampled according to D will belabeled incorrectly by h(x)

PAC Model Definitions

given to the learner specifies desired accuracy, i.e.

Note: in realizable case because

)(min)( herrCoptCh

0)( Copt0)( tcerr

0 )()( Coptherr

01 given to the learner specifies desired confidence, i.e.

1])()([ Coptherrprob

The learner is allowed to deviate occasionally from the desired accuracybut only rarely so..


We will say that an algorithm L learns C if for every and for everyD over XxY, L generates a concept function such that theprobability that is at least

tc tc

tc tc

)()( Coptherr 1Ch

tc


from the set of all training examples to C with the following property:given any there is an integer such that if then, for any probability distribution D on XxY, if Z is atraining set of length m drawn randomly according to , thenwith probability of at least then hypothesis is such that

tc tc

tc tc

)()( Coptherr1

mD

1,0,

Formal Definition of PAC Learning

11),(:

m

m

iii CyxL

),(0 m0mm

CZLh )(

A learning algorithm L is a function:

We say that C is learnable (or PAC-learnable) if there is a learning algorithm for C

tc tc

tc tc

Formal Definition of PAC Learning

),(0 m does not depend on D, i.e., PAC model is distribution invariant

),(0 mThe class C determines the sample complexity. For “simple” classes would be small compared to more “complex” classes.

Notes:

tc tc

tc tc

Course Syllabus

C

m ln1

C

C 1

ln)(1

ln1 Cvcd

m 3 x PAC:

2 x Separating Hyperplanes:

Support Vector Machine, Kernels, LinearDiscriminant Analysis

3 x Unsupervised Learning:

Dimensionality Reduction (PCA), Density Estimation,Non-parametric Clustering (spectral methods)

5 x Statistical Inference:

Maximum Likelihood, Conditional Independence,Latent Class Models, Expectation-Maximization Algorithm,Graphical Models

introduction to machine learning course 67577 fall 2007 lecturer: amnon shashua teaching assistant:...

Documents

observed data learning

training data

input data

labeling data machine

learning process

sufficient data examples

instance data space

machine learning course