machine learning - sharif university of...

23
Machine Learning Computational Learning Theory Hamid R. Rabiee [Slides are based on Machine Learning Book (Ch. 7) by Tom Mitchell] Spring 2015 http://ce.sharif.edu/courses/ 93-94/2/ce717-1

Upload: others

Post on 15-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Machine Learning

Computational Learning TheoryHamid R. Rabiee

[Slides are based on Machine Learning Book (Ch. 7) by Tom Mitchell]

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1

Page 2: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Overview

Are there general laws that govern learning?

Sample Complexity: How many training examples are needed for a learner to

converge (with high probability) to a successful hypothesis?

Computational Complexity: How much computational effort is needed for a

learner to converge (with high probability) to a successful hypothesis?

Mistake Bound: How many training examples will the learner misclassify before

converging to a successful hypothesis?

Probability that learner outputs a successful hypothesis

These questions will be answered within two analytical frameworks

The Probably Approximately Correct (PAC) framework

The Mistake Bound framework

Page 3: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Overview

Rather than answering these questions for individual learners, we will

answer them for broad classes of learners.

In particular we will consider

The size or complexity of the hypothesis space considered by the learner

The accuracy to which the target concept must be approximated

The probability that the learner will output a successful hypothesis

The manner in which training examples are presented to the learner

Sharif University of Technology, Computer Engineering Department, Machine Learning Course3

Page 4: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Prototypical Concept Learning Task

Given:

Instances 𝑿: Possible days, each described by the attributes Sky, AirTemp,

Humidity, Wind, Water, Forecast

Target function c: EnjoySport: 𝑿 → 𝟎, 𝟏

Hypotheses 𝑯: Conjunctions of literals. E.g. <? , 𝑪𝒐𝒍𝒅,𝑯𝒊𝒈𝒉, ? , ? , ? >

Training examples 𝑫: Positive and negative examples of the target function

< 𝒙𝟏, 𝒄 𝒙𝟏 >,… ,< 𝒙𝒎, 𝒄 𝒙𝒎 >

Determine

A hypothesis 𝒉 ∈ 𝑯 such that 𝒉 𝒙 = 𝒄(𝒙) for ∀𝒙 ∈ 𝑫?

A hypothesis 𝒉 ∈ 𝑯 such that 𝒉 𝒙 = 𝒄(𝒙) for ∀𝒙 ∈ 𝑿 ?

Sharif University of Technology, Computer Engineering Department, Machine Learning Course4

Page 5: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Sample Complexity

How many training examples are sufficient to learn the target concept?

If learner proposes instances, as queries to teacher

Learner proposes instance 𝒙, teacher provides c(x)

If teacher (who knows 𝒄) provides training examples

teacher provides sequence of examples of form < 𝑥, 𝑐 𝑥 >

If some random process (e.g., nature) proposes instances

instance 𝑥 generated randomly, teacher provides 𝑐 𝑥

Sharif University of Technology, Computer Engineering Department, Machine Learning Course5

Page 6: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Sample Complexity: 1

Learner proposes instance 𝒙, teacher provides 𝒄(𝒙) (assume 𝒄(𝒙) is in

learner's hypothesis space 𝑯)

Optimal query strategy: play 20 questions

pick instance 𝒙 such that half of hypotheses in 𝑽𝑺 classify 𝒙 positive, half

classify 𝒙 negative

When this is possible, need log𝟐 |𝑯| queries to learn 𝒄

Sharif University of Technology, Computer Engineering Department, Machine Learning Course6

Page 7: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Sample Complexity: 2

Teacher (who knows 𝒄) provides training examples (assume 𝒄 is in

learner's hypothesis space 𝑯)

Optimal teaching strategy: depends on 𝑯 used by learner

Consider the case 𝑯 conjunctions of up to 𝒏 Boolean literals and their

negations

e.g., 𝑨𝒊𝒓𝑻𝒆𝒎𝒑 = 𝒘𝒂𝒓𝒎 𝑨𝑵𝑫𝑾𝒊𝒏𝒅 = 𝒔𝒕𝒓𝒐𝒏𝒈 , where 𝑨𝒊𝒓𝑻𝒆𝒎𝒑,𝑾𝒊𝒏𝒅,… each have

2 possible values.

if 𝒏 possible Boolean attributes in 𝑯, 𝒏 + 𝟏 examples suffice (why?)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course7

Page 8: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Sample Complexity: 3

Given:

set of instances 𝑿

set of hypotheses 𝑯

set of possible target concepts 𝑪

training instances generated by a fixed, unknown probability distribution 𝓓 over 𝑿

Learner observes a sequence 𝑫 of training examples of form < 𝒙, 𝒄 𝒙 >, for

some target concept 𝒄 ∈ 𝑪

instances 𝒙 are drawn from distribution 𝓓

teacher provides target value 𝒄(𝒙) for each

Learner must output a hypothesis 𝒉 estimating 𝒄

𝒉 is evaluated by its performance on subsequent instances drawn according to 𝓓

Note: randomly drawn instances, noise-free classifications

Sharif University of Technology, Computer Engineering Department, Machine Learning Course8

Page 9: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

True Error of a Hypothesis

Definition: The true error (denoted 𝒆𝒓𝒓𝒐𝒓𝓓(𝒉)) of hypothesis 𝒉 with

respect to target concept 𝒄 and distribution 𝓓 is the probability that 𝒉 will

misclassify an instance drawn at random according to 𝓓.

𝒆𝒓𝒓𝒐𝒓𝓓(𝒉) ≡ min𝒙∈𝓓

[𝒄 𝒙 ≠ 𝒉 𝒙 ]

Sharif University of Technology, Computer Engineering Department, Machine Learning Course9

Page 10: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Two Notions of Error

Training error of hypothesis 𝒉 with respect to target concept 𝒄

How often 𝒉(𝒙) ≠ 𝒄(𝒙) over training instances

True error of hypothesis 𝒉 with respect to 𝒄

How often 𝒉(𝒙) ≠ 𝒄(𝒙) over future random instances

Our concern:

Can we bound the true error of 𝒉 given the training error of 𝒉?

First consider when training error of 𝒉 is zero (i.e., 𝒉 ∈ 𝑽𝑺 𝑯,𝑫)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course10

Page 11: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Exhausting the Version Space

Definition: The Version Space 𝑽𝑺 𝑯,𝑫 is said to be 𝝐-exhausted with respect to

𝒄 and 𝓓, if every hypothesis 𝒉 in 𝑽𝑺 𝑯,𝑫 has error less than 𝝐 with respect to 𝒄

and 𝓓.

∀𝒉 ∈ 𝑽𝑺 𝑯,𝑫 𝒆𝒓𝒓𝒐𝒓𝓓 𝒉 < 𝝐

Sharif University of Technology, Computer Engineering Department, Machine Learning Course11

Page 12: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

How many examples will 𝝐-exhaust the VS?

Theorem: [Haussler, 1988]

If the hypothesis space 𝑯 is finite, and 𝑫 is a sequence of 𝟏 ≤ 𝒎 independent

random examples of some target concept 𝒄, then for any 𝟎 ≤ 𝝐 ≤ 𝟏, the

probability that the version space with respect to 𝑯 and 𝑫 is not 𝝐-exhausted

(with respect to 𝒄) is less than 𝑯 𝒆−𝝐𝒎

INTERESTING! This bounds the probability that any consistent learner will

output a hypothesis 𝒉 with 𝛜 ≤ 𝒆𝒓𝒓𝒐𝒓(𝒉)

If we want to this probability to be below 𝜹

𝑯 𝒆−𝝐𝒎 ≤ 𝜹

then 𝟏

𝝐(ln 𝑯 + ln 1

𝜹) ≤ 𝒎

Sharif University of Technology, Computer Engineering Department, Machine Learning Course12

Page 13: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Learning Conjunctions of Boolean Literals

How many examples are sufficient to assure with probability at least (𝟏

− 𝜹) that

every 𝒉 in 𝑽𝑺𝑯,𝑫 satisfies 𝒆𝒓𝒓𝒐𝒓𝓓(𝒉) ≤ 𝝐

Use our theorem: 𝟏

𝝐(ln 𝑯 + ln 1

𝜹) ≤ 𝒎

Suppose 𝑯 contains conjunctions of constraints on up to 𝒏 Boolean

attributes (i.e., 𝒏 Boolean literals). Then 𝑯 = 𝟑𝒏, and

𝟏

𝝐(ln 3𝑛 + ln 1

𝜹) ≤ 𝒎 →

𝟏

𝝐(𝒏 ln 3 + ln 1

𝜹) ≤ 𝒎

Sharif University of Technology, Computer Engineering Department, Machine Learning Course13

Page 14: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

How About ‘EnjoySport’?

𝟏

𝝐(ln 𝑯 + ln 1

𝜹) ≤ 𝒎

If 𝑯 is as given in ‘EnjoySport’ then 𝑯 = 𝟗𝟕𝟑, and

𝟏

𝝐(ln 𝟗𝟕𝟑 + ln 1

𝜹) ≤ 𝒎

... if want to assure that with probability 95\%, 𝑽𝑺 contains only

hypotheses with 𝒆𝒓𝒓𝒐𝒓𝓓 𝒉 ≤ 𝟎. 𝟏, then it is sufficient to have 𝒎

examples, where

𝟏

𝟎. 𝟏(ln 𝟗𝟕𝟑 + ln 1

𝟎.𝟎𝟓) ≤ 𝒎 → 𝟏𝟎 ln𝟗𝟕𝟑 + ln 20 ≤ 𝒎

→ 𝟏𝟎(6. 𝟖𝟖 + 3) ≤ 𝒎 → 𝟗𝟖. 𝟖 ≤ 𝒎

Sharif University of Technology, Computer Engineering Department, Machine Learning Course14

Page 15: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

The PAC Learning Model

Consider a class 𝑪 defined over a set of instances 𝑿 of length 𝒏 and a

learner 𝑳 using hypothesis space 𝑯.

Definition: 𝑪 is “PAC-learnable” by 𝑳 using 𝑯 if for ∀𝒄 ∈ 𝑪, distributions 𝓓

over 𝑿, 𝜺 such that 𝟎 < 𝜺 < 𝟎. 𝟓, and 𝜹 such that 𝟎 < 𝜹 < 𝟎. 𝟓

learner 𝑳 will, with probability at least (𝟏 − 𝜹), output a hypothesis 𝒉𝝐𝑯

such that 𝒆𝒓𝒓𝒐𝒓𝑫(𝑯) ≤ 𝜺 , in time that is 𝐩𝐨𝐥𝐲𝐧𝐨𝐦𝐢𝐚𝐥(𝟏

𝜺,𝟏

𝜹, 𝒏, 𝒔𝒊𝒛𝒆 𝒄 ).

Sharif University of Technology, Computer Engineering Department, Machine Learning Course15

Page 16: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Agnostic Learning

So far, assumed 𝒄 ∈ 𝑯

Agnostic learning setting: don't assume 𝒄 ∈ 𝑯

What do we want then?

The hypothesis 𝒉 that makes fewest errors on training data

What is sample complexity in this case?

𝟏

𝟐𝝐𝟐(ln 𝑯 + ln 1

𝜹) ≤ 𝒎

derived from Hoeffding bounds:

𝑷𝒓[𝒆𝒓𝒓𝒐𝒓𝓓 𝒉 > 𝒆𝒓𝒓𝒐𝒓𝑫 𝒉 + 𝝐] ≤ 𝒆−𝟐𝒎𝝐𝟐

Sharif University of Technology, Computer Engineering Department, Machine Learning Course16

Page 17: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Shattering a Set of Instances

Definition: a dichotomy of a set S is a partition of S into two disjoint subsets.

Definition: A set of instances S is shattered by hypothesis space H iff for every

dichotomy of S there exists some hypothesis in H consistent with this dichotomy.

Sharif University of Technology, Computer Engineering Department, Machine Learning Course17

Page 18: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Three Instances Shattered

Sharif University of Technology, Computer Engineering Department, Machine Learning Course18

Page 19: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

The Vapnik-Chervonenkis Dimension

Definition: The Vapnik-Chervonenkis dimension, VC(H), of hypothesis space H

defined over instance space X is the size of the largest finite subset of X shattered

by H. If arbitrarily large finite sets of X can be shattered by H, then 𝑽𝑪 𝑯 ≡ ∞

Sharif University of Technology, Computer Engineering Department, Machine Learning Course19

Page 20: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

VC Dim. of Linear Decision Surfaces

Sharif University of Technology, Computer Engineering Department, Machine Learning Course20

Page 21: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Sample Complexity from VC Dimension

How many randomly drawn examples suffice to 𝝐-exhaust 𝑽𝑺𝑯,𝑫 with

probability at least (𝟏 − 𝜹)?

Sharif University of Technology, Computer Engineering Department, Machine Learning Course21

Page 22: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Mistake Bounds

So far: how many examples needed to learn?

What about: how many mistakes before convergence?

Let's consider similar setting to PAC learning:

Instances drawn at random from 𝑿 according to distribution 𝓓

Learner must classify each instance before receiving correct classification from

teacher

Can we bound the number of mistakes learner makes before converging?

Sharif University of Technology, Computer Engineering Department, Machine Learning Course22

Page 23: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/93-94/2/ce717-1/resources/root/Slides/ML... · Overview Rather than answering these questions for individual

Sharif University of Technology, Computer Engineering Department, Machine Learning Course32

Any Question?

End of Lecture 19

Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1