computational learning theory...3 computational learning theory what general laws constrain...

38
1 Computational Learning Theory How many training examples are sufficient to successfully learn the target function? How much computational effort is needed to converge to a successful hypothesis? How many mistakes will the learner make before succeeding?

Upload: others

Post on 01-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

1

Computational Learning Theory

How many training examples are sufficient to successfully learn the target function?

How much computational effort is needed to converge to a successful hypothesis?

How many mistakes will the learner make before succeeding?

Page 2: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

2

Computational Learning Theory

What is “success”?Inductive learning

1. Learner poses queries to teacher 2. Teacher chooses examples 3. Randomly generated instances, labeled by

teacher Probably approximately correct (PAC) learning Vapnik-Chervonenkis (VC) Dimension Mistake bounds

Page 3: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

3

Computational Learning Theory

What general laws constrain inductive learning? We seek theory to relate:

1. Probability of successful learning2. Number of training examples3. Complexity of hypothesis space4. Accuracy to which target concept is approximated5. Manner in which training examples presented

Page 4: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

4

Prototypical Concept Learning Task

Given: Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, ForecastTarget function c: EnjoySport : X → {0,1}Hypotheses H: Conjunctions of literals. E.g. <?, Cold, High, ?, ?, ?>Training examples D: Positive and negative examples of the target function <x1,c(x1)>, …, <xm,c(xm)>

Determine: 1. A hypothesis h in H such that h(x) = c(x) for all x in D? 2. A hypothesis h in H such that h(x) = c(x) for all x in X?

Page 5: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

5

Sample Complexity

How many training examples are sufficient to learn the target concept?

1. If learner proposes instances, as queries to teacherLearner proposes instance x, teacher provides c(x)

2. If teacher (who knows c) provides training examples teacher provides sequence of examples of form <x, c(x)>

3. If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x)

Page 6: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

6

Sample Complexity: Case 1

Learner provides instance x, teacher provides c(x)

Optimal query strategy:pick instance x such that half of hypotheses in VS classify x positive, half classify x negative How many queries are needed to learn c?

log2⏐H⏐

Page 7: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

7

Sample Complexity: Case 2Teacher provides training examples

Optimal teaching strategy: depends on H used by learner

Consider the case H = conjunctions of up to n boolean literals and their negations e.g., (AirTemp = Warm) ∧ (Wind = Strong),where AirTemp,Wind, … each have 2 possible values.

How many examples are needed to learn c? use Find-S

Page 8: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

8

Find-S Algorithm

1. Initialize h to the most specific hypothesis in H2. For each positive training instance x

For each attribute constraint ai in hIf the constraint ai in h is satisfied by xthen do nothingelse replace ai in h by the next more general constraint that is satisfied by x

3. Output hypothesis h

Page 9: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

9

Sample ComplexityExample:

attributes A,B,C,D,E, concept (A=0) ∧ (C=1)+ : 01111 Find-S : (A=0) ∧ (B=1) ∧ (C=1) ∧ (D=1) ∧ (E=1) + : 00111 Find-S generalize (A=0) ∧ (C=1) ∧ (D=1) ∧ (E=1) + : 00101 Find-S generalize (A=0) ∧ (C=1) ∧ (E=1) + : 00100 Find-S generalize (A=0) ∧ (C=1) - : 10100 Find-S do not generalize to (C=1) - : 00000 Find-S do not generalize to (A=0)

How many examples are needed to learn c? n+1

Page 10: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

10

Sample Complexity: Case 3Given:

set of instances X set of hypotheses H set of possible target concepts C training instances generated by a fixed, unknown probability distribution D over X

Learner observes a sequence D of training examples of form <x, c(x)>, for some target concept c ∈ C instances x are drawn from distribution D teacher provides target value c(x) for each

Learner must output a hypothesis h estimating c Note: probabilistic instances, noise-free classifications

Page 11: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

11

True Error of a Hypothesis

Definition: The true error errorD(h) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.

Page 12: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

12

Two Notions of Error

Training error of hypothesis h with respect to target concept c How often h(x) ≠ c(x) over training instances

True error of hypothesis h with respect to c How often h(x) ≠ c(x) over future random instances

Our concern: Can we bound the true error of h given the training error of h? First consider when training error of h is zero (h ∈ VSH,D )

Page 13: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

13

Exhausting the Version Space

Definition: The version space VSH,D is said to be ε-exhausted with respect to c and D, if every hypothesis h in VSH,D has error less than ε with respect to c and D.

(∀h ∈ VSH,D ) errorD(h) < ε

Page 14: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

14

How many examples will ε-exhaust the VS? Theorem: [Haussler, 1988].

If H is finite, and D is a sequence of m ≥ 1 independent random examples of some target concept c, then for any 0 ≤ ε ≤ 1, the probability that VSH,D is not ε-exhausted (with respect to c) is ≤ ⏐H⏐e-εm

This bounds the probability that any consistent learner will output a hypothesis h with error(h) ≥ε

We want to this probability to be below δ⏐H⏐e-εm ≤ δ (7.1)

Page 15: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

15

)2.7(1lnln1

1lnlnln

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛+≥

⎟⎠⎞

⎜⎝⎛+=⎟⎟

⎞⎜⎜⎝

⎛≥

≤−

δε

δδε

δ

δ

δ

ε

ε

ε

Hm

HH

m

He

eH

eH

m

m

m

Page 16: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

16

Learning Conjunctions of Boolean Literals

How many examples are sufficient to assure with probability at least (1 - δ) that every h in VSH,D satisfies errorD(h) ≤ ε ?

Use our theorem:

Suppose H contains conjunctions of constraints on up to n boolean attributes (i.e., n boolean literals).

Then ⏐H⏐= 3n , and

or

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛+≥δε1lnln1 Hm

Page 17: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

17

Example: Enjoy Sport

If H is as given in EnjoySport then ⏐H⏐ = 973, and

... if want to assure that with probability 95%, VS contains only hypotheses with errorD(h) ≤ 0.1:

Page 18: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

18

Probably Approximately Correct (PAC) LearningConsider a class C of possible target concepts

defined over a set of instances X of length n, and a learner L using hypothesis space H.

Definition: C is PAC-learnable by L using H if for all c ∈ C, distributions D over X, ε such that 0 ≤ ε≤1/2, and δ such that 0 ≤ δ ≤ 1/2, learner L will with probability at least (1 - δ) output a hypothesis h ∈ H such that errorD(h) ≤ε, in time that is polynomial in 1/ε,1/δ, n and size(c).

Page 19: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

19

Agnostic Learning

So far, assumed c ∈ H

Agnostic learning setting: don't assume c ∈ H

What do we want then?

The hypothesis h that makes fewest errors on training data

What is sample complexity in this case?

Page 20: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

20

Agnostic LearningHoeffding bounds: if the training error errorD(h) is measured over the

set D containing m randomly drawn examples, then

Consider the probability that any one of the |H| hypotheses could have a large error

)3.7(1ln||ln21

1ln||ln||ln2

||||||

2

2

2

22

2

22

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛+≥

⎟⎠⎞

⎜⎝⎛+=⎟

⎠⎞

⎜⎝⎛≥

≥⇔≤−

δε

δδε

δ

δδ

ε

εε

Hm

HHm

He

HeeH

m

mm

22||])()()(Pr[( εε mDD eHherrorherrorHh −≤+>∈∃

Page 21: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

21

Shattering a Set of Instances

Definition: a dichotomy of a set S is a partition of S into two disjoint subsets.

Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.

Page 22: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

22

Three Instances ShatteredInstance space X

Page 23: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

23

The Vapnik-Chervonenkis Dimension

Definition: The Vapnik-Chervonenkis dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) ≡ ∞.

Page 24: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

24

VC-Dim of a Set of Real Numbers

X = ℜh ∈ H, h has the form a < x < bVC(H) = ?Consider S = {3.1,5.7}. 4 hypotheses

(1 < x < 2), (1 < x < 4), (4 < x < 7), and (1 < x < 7) shatter S⇒VC(H) ≥ 2

S = {x0, x1, x2} cannot be shartted ⇒VC(H) < 3⇒VC(H) = 2

Page 25: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

25

VC-Dim of Linear Decision SurfacesX = instances corresponding to points (x,y)H = set of linear decision sufarcesVC(H) = ?2 points → can be shattered3 points → can be shattered4 points → cannot be shattered

⇒ VC(H)=3In general for a k-dimensional space VC(H)=k+1

Page 26: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

26

Sample Complexity from VC-DimensionHow many randomly drawn examples suffice to ε-exhaust VSH,Dwith probability at least (1-δ)? [Blumer et al. 1989]

(7.7)

Lower bound on sample complexity: [Ehrenfeucht et al. 1989]Consider any concept class C such that VC(C)≥2, any learner L and any 0 <ε< 1/8, and 0<δ<1/100. There exists a distribution D* and target concept in C such that if L observes fewer examples than

then with probability ≥ δ, L outputs a hypothesis h having errorD*(h)>ε

Page 27: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

27

Mistake Bounds

So far: how many examples needed to learn? What about: how many mistakes before convergence? Let's consider similar setting to PAC learning:

Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging?

Page 28: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

28

Mistake Bounds: Find-S

Consider Find-S when H=conjunction of n boolean literals

Initialize h to the most specific hypothesis

l1 ∧ ¬l1 ∧ l2 ∧ ¬l2 … ∧ ¬ln ∧ ¬lnFor each positive training instance x

Remove from h any literal that is not satisfied by x

Output hypothesis h

Page 29: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

29

Mistake Bounds: Find-S

How many mistakes before converging to correct h? Find-S can never mistakenly classify a negative instance as positive.1st positive instance: n out of 2n literals are eliminated.∀subsequent positive example: ≥ 1 of the remaining n terms must be eliminated from the hypothesis.

⇒ The total number of mistakes ≤ n+1 (when (∀x) c(x) = 1).

Page 30: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

30

Mistake Bounds: Halving AlgorithmConsider Halving Algorithm:

Learn concept using VS Candidate-Eliminationalgorithm Classify new instances by majority vote of VS members

How many mistakes before converging to correct h?… in worst case:

Mistake appears when the majority of hypotheses in its current VS incorrectly classify the new example. VS will be reduced to at most half size. #mistakes ≤ log2|H|

… in best case: none

Page 31: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

31

Candidate Elimination AlgorithmG ← maximally general hypotheses in HS ← maximally specific hypotheses in HFor each training example d=<x,c(x)>If d is a positive example

Remove from G any hypothesis that is inconsistent with dFor each hypothesis s in S that is not consistent with d

remove s from S.Add to S all minimal generalizations h of s such that

h consistent with dSome member of G is more general than h

Remove from S any hypothesis that is more general than another hypothesis in S

Page 32: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

32

Candidate Elimination Algorithm

If d is a negative exampleRemove from S any hypothesis that is inconsistent with dFor each hypothesis g in G that is not consistent with d

remove g from G.Add to G all minimal specializations h of g such that

h consistent with dSome member of S is more specific than h

Remove from G any hypothesis that is less general than another hypothesis in G

Page 33: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

33

Optimal Mistake BoundsLet MA(C) be the max #mistakes made by algorithm A to

learn concepts in C. (maximum over all possible c ∈ C, and all possible training sequences)

Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of MA(C).

Page 34: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

34

Weighted Majority Algorithm

Predict by taking a weighted vote among a pool of prediction algorithms A and learns by altering the weight associated with each prediction algorithm.

Able to accommodate inconsistent training data

#mistakes of WEIGHTED-MAJORITY can be bound by #mistakes committed by the best of A

Page 35: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

35

Weighted Majority AlgorithmPooling of multiple learning algorithms

ai is the i-th prediction algorithm in A, wi its weightInitialize wi=1For each training example <x,c(x)>

Initialize q1, q0 to 0For each algorithm ai

If ai(x) = 0 then q0 = q0 + wi

If ai(x) = 1 then q1 = q1 + wi

If q1>q0 predict c(x)=1 else predict c(x)=0For each algorithm ai in A

if ai(x) ≠ c(x) then wi = wi * β

Page 36: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

36

Mistake Bound Weighted Majority Alg.

Let:D - any sequence of training examplesA - any set of n prediction algorithmsk - minimum #mistakes over D made by any algorithm in A for the training sequence D.

Then:

#mistakes over D made by the weighted-majority algorithm using β=1/2 is ≤ 2.4(k+ log2n)

Page 37: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

37

Mistake Bound Weighted MajorityAlg.

Let aj = algorithm with an optimal #mistakes k.⇒ Final weight wj = (1/2)k

Consider W = Σi wi for all n algorithms in A. W is initially n.For each mistake, W is reduced to at most ¾W, because the algorithms voting in the weighted majority must hold at least ½W, and this portion of W will be reduced by ½.Let M = total #mistakes committed by weighted-majority for training sequence D, the total final weight is ≤ n(3/4)M.The final weight wj of the optimal algorithm cannot be greater than the total weight:

(1/2)k ≤ n (3/4)M

)log(4.2

43log

log2

2

2 nknkM +≤⎟⎠⎞

⎜⎝⎛−

+≤

Page 38: Computational Learning Theory...3 Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: 1. Probability of successful learning 2. Number

38

Summary

Training errorSample errorTrue error errorD(h) Sample complexityMistake boundProbably approximately correct (PAC)Vapnik-Chervonenkis dimensionWeighted Majority Algorithm