inductive learning decision tree method (if it’s not simple, it’s not worth learning it)...

76
Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of: Jean-Claude Latombe, Stanford University; Stuart Russell, UC Berkley; Lise Getoor, University of Maryland.

Upload: alice-jones

Post on 11-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Inductive Learning Decision Tree Method

(If it’s not simple, it’s not worth learning it)

R&N: Chap. 18, Sect. 18.1–3Much of this taken from slides of: Jean-Claude Latombe,

Stanford University; Stuart Russell, UC Berkley; Lise Getoor, University of Maryland.

Page 2: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Motivation

An AI agent operating in a complex world requires an awful lot of knowledge: state representations, state axioms, constraints, action descriptions, heuristics, probabilities, ...

More and more, AI agents are designed to acquire knowledge through learning

Page 3: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

What is Learning?

Mostly generalization from experience:“Our experience of the world is specific, yet we are able to formulate general theories that account for the past and predict the future”M.R. Genesereth and N.J. Nilsson, in Logical Foundations of AI, 1987

Concepts, heuristics, policies Supervised vs. un-supervised learning

Page 4: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Contents

Introduction to inductive learning Logic-based inductive learning:

• Decision-tree induction

Page 5: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Logic-Based Inductive Learning

Background knowledge KB

Training set D (observed knowledge) that is not logically implied by KB

Inductive inference: Find h such that KB and h imply D

h = D is a trivial, but un-interesting solution (data caching)

Page 6: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Rewarded Card Example

Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded”

Background knowledge KB: ((r=1) v … v (r=10)) NUM(r)((r=J) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)

Training set D:REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J,S])

Page 7: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Rewarded Card Example

Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded”

Background knowledge KB: ((r=1) v … v (r=10)) NUM(r)((r=J) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)

Training set D:REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J,S])

Possible inductive hypothesis:h (NUM(r) BLACK(s) REWARD([r,s]))

There are several possible inductive hypotheses

Page 8: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Learning a Predicate(Concept Classifier)

Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in

E, that takes the value True or False (e.g., REWARD)

Page 9: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Learning a Predicate(Concept Classifier)

Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g.,

REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates

Page 10: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Example of Training Set

Page 11: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Example of Training Set

Note that the training set does not say whether an observable predicate is pertinent or not

Ternary attributes

Goal predicate is PLAY-TENNIS

Page 12: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Learning a Predicate(Concept Classifier)

Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates

Find a representation of CONCEPT in the form: CONCEPT(x) S(A,B, …)where S(A,B,…) is a sentence built with the observable predicates, e.g.: CONCEPT(x) A(x) (B(x) v C(x))

Page 13: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Learning an Arch Classifier

ARCH(x) HAS-PART(x,b1) HAS-PART(x,b2) HAS-PART(x,b3) IS-A(b1,BRICK) IS-A(b2,BRICK) MEET(b1,b2) (IS-A(b3,BRICK) v IS-A(b3,WEDGE)) SUPPORTED(b3,b1) SUPPORTED(b3,b2)

These aren’t: (negative examples)

These objects are arches: (positive examples)

Page 14: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Example set

An example consists of the values of CONCEPT and the observable predicates for some object x

A example is positive if CONCEPT is True, else it is negative

The set X of all examples is the example set

The training set is a subset of X

a small one!

Page 15: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Hypothesis Space

An hypothesis is any sentence of the form: CONCEPT(x) S(A,B, …)where S(A,B,…) is a sentence built using the observable predicates

The set of all hypotheses is called the hypothesis space H

An hypothesis h agrees with an example if it gives the correct value of CONCEPT

Page 16: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

+

++

+

+

+

+

++

+

+

+ -

-

-

-

-

-

-

- --

-

-

Example set X{[A, B, …, CONCEPT]}

Inductive Learning Scheme

Hypothesis space H{[CONCEPT(x) S(A,B, …)]}

Training set DInductive

hypothesis h

Page 17: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Size of Hypothesis Space

n observable predicates 2n entries in truth table defining

CONCEPT and each entry can be filled with True or False

In the absence of any restriction (bias), there are hypotheses to choose from

n = 6 2x1019 hypotheses!

22n

Page 18: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

h1 NUM(r) BLACK(s) REWARD([r,s])

h2 BLACK(s) (r=J) REWARD([r,s])

h3 ([r,s]=[4,C]) ([r,s]=[7,C]) [r,s]=[2,S]) REWARD([r,s])

h4 ([r,s]=[5,H]) ([r,s]=[J,S]) REWARD([r,s])

agree with all the examples in the training set

Multiple Inductive Hypotheses Deck of cards, with each card designated by [r,s], its

rank and suit, and some cards “rewarded” Background knowledge KB:

((r=1) v …v (r=10)) NUM(r)((r=J ) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)

Training set D:REWARD([4,C]) REWARD([7,C]) REWARD([2,S])

REWARD([5,H]) REWARD([J ,S])

Page 19: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

h1 NUM(r) BLACK(s) REWARD([r,s])

h2 BLACK(s) (r=J) REWARD([r,s])

h3 ([r,s]=[4,C]) ([r,s]=[7,C]) [r,s]=[2,S]) REWARD([r,s])

h4 ([r,s]=[5,H]) ([r,s]=[J,S]) REWARD([r,s])

agree with all the examples in the training set

Multiple Inductive Hypotheses Deck of cards, with each card designated by [r,s], its

rank and suit, and some cards “rewarded” Background knowledge KB:

((r=1) v …v (r=10)) NUM(r)((r=J ) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)

Training set D:REWARD([4,C]) REWARD([7,C]) REWARD([2,S])

REWARD([5,H]) REWARD([J ,S])

Need for a system of preferences – called a bias – to compare possible hypotheses

Page 20: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Inductive learning

• Simplest form: learn a function from examples• f is the target function

An example is a pair (x, f(x))Problem: find a hypothesis h

such that h ≈ fgiven a training set of examples

(This is a highly simplified model of real learning:– Ignores prior knowledge– Assumes examples are given)

Page 21: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Inductive learning method

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)E.g., curve fitting:

Page 22: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Inductive learning method

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)E.g., curve fitting:

Page 23: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)E.g., curve fitting:

Inductive learning method

Page 24: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)E.g., curve fitting:

Inductive learning method

Page 25: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)E.g., curve fitting:

Inductive learning method

Page 26: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)E.g., curve fitting:

Ockham’s razor: prefer the simplest hypothesis consistent with data

Inductive learning method

Page 27: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Notion of Capacity It refers to the ability of a machine to learn any

training set without error A machine with too much capacity is like a botanist

with photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything he has seen before

A machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree

Good generalization can only be achieved when the right balance is struck between the accuracy attained on the training set and the capacity of the machine

Page 28: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Keep-It-Simple (KIS) Bias

Examples• Use many fewer observable predicates than the

training set• Constrain the learnt predicate, e.g., to use only

“high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax

Motivation• If a hypothesis is too complex it is not worth

learning it (data caching does the job as well)• There are many fewer simple hypotheses than

complex ones, hence the hypothesis space is smaller

Einstein: “A theory must be as simple as possible, but not simpler than this”

Page 29: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Keep-It-Simple (KIS) Bias

Examples• Use many fewer observable predicates than the

training set• Constrain the learnt predicate, e.g., to use only

“high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax

Motivation• If a hypothesis is too complex it is not worth

learning it (data caching does the job as well)• There are many fewer simple hypotheses than

complex ones, hence the hypothesis space is smaller

If the bias allows only sentences S that areconjunctions of k << n predicates picked fromthe n observable predicates, then the size of H is O(nk)

Page 30: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Putting Things Together

Object set

Goal predicate

Observable predicates

Example set X

Training set D

Testset

Inducedhypothesis h

Learningprocedure L

Evaluationyes

no

Bias

Hypothesisspace H

Page 31: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Decision Tree Method

Page 32: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Predicate as a Decision Tree

The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED

Page 33: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Predicate as a Decision Tree

The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED• D = FUNNEL-CAP• E = BULKY

Page 34: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Training Set

Ex. # A B C D E CONCEPT

1 False False True False True False

2 False True False False False False

3 False True True True True False

4 False False True False False False

5 False False False True True False

6 True False True False False True

7 True False False True False True

8 True False True False True True

9 True True True False True True

10 True True True True True True

11 True True False False False False

12 True True False False True False

13 True False True True True True

Page 35: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

TrueTrueTrueTrueFalseTrue13

FalseTrueFalseFalseTrueTrue12

FalseFalseFalseFalseTrueTrue11

TrueTrueTrueTrueTrueTrue10

TrueTrueFalseTrueTrueTrue9

TrueTrueFalseTrueFalseTrue8

TrueFalseTrueFalseFalseTrue7

TrueFalseFalseTrueFalseTrue6

FalseTrueTrueFalseFalseFalse5

FalseFalseFalseTrueFalseFalse4

FalseTrueTrueTrueTrueFalse3

FalseFalseFalseFalseTrueFalse2

FalseTrueFalseTrueFalseFalse1

CONCEPTEDCBAEx. #

Possible Decision Tree

D

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

Page 36: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Possible Decision Tree

D

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

CONCEPT A (B v C)

Page 37: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Possible Decision Tree

D

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

CONCEPT A (B v C)

KIS bias Build smallest decision tree

Computationally intractable problem greedy algorithm

CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))

Page 38: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Picking Best Attribute

• Several different methods– Reducing classification error

• Not covered in the tex

– Using Information Gain

Page 39: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Getting Started:Top-Down Induction of Decision

Tree

Ex. # A B C D E CONCEPT

1 False False True False True False

2 False True False False False False

3 False True True True True False

4 False False True False False False

5 False False False True True False

6 True False True False False True

7 True False False True False True

8 True False True False True True

9 True True True False True True

10 True True True True True True

11 True True False False False False

12 True True False False True False

13 True False True True True True

True: 6, 7, 8, 9, 10,13False: 1, 2, 3, 4, 5, 11, 12

The distribution of training set is:

Page 40: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Getting Started: Top-Down Induction of Decision

Tree

True: 6, 7, 8, 9, 10,13False: 1, 2, 3, 4, 5, 11, 12

The distribution of training set is:

Without testing any observable predicate, wecould report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

Assuming that we will only include one observable predicate in the decision tree, which predicateshould we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)? Greedy algorithm

Page 41: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Assume It’s A

A

True:False:

6, 7, 8, 9, 10, 1311, 12 1, 2, 3, 4, 5

T F

If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise

The number of misclassified examples from the training set is 2

Page 42: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Assume It’s B

B

True:False:

9, 102, 3, 11, 12 1, 4, 5

T F

If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise

The number of misclassified examples from the training set is 5

6, 7, 8, 13

Page 43: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Assume It’s C

C

True:False:

6, 8, 9, 10, 131, 3, 4 1, 5, 11, 12

T F

If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise

The number of misclassified examples from the training set is 4

7

Page 44: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Assume It’s D

D

T F

If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise

The number of misclassified examples from the training set is 5

True:False:

7, 10, 133, 5 1, 2, 4, 11, 12

6, 8, 9

Page 45: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Assume It’s E

E

True:False:

8, 9, 10, 131, 3, 5, 12 2, 4, 11

T F

If we test only E we will report that CONCEPT is False,independent of the outcome

The number of misclassified examples from the training set is 6

6, 7

Page 46: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Assume It’s E

E

True:False:

8, 9, 10, 131, 3, 5, 12 2, 4, 11

T F

If we test only E we will report that CONCEPT is False,independent of the outcome

The number of misclassified examples from the training set is 6

6, 7

So, the best predicate to test is A

Page 47: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Choice of Second Predicate

A

T F

C

True:False:

6, 8, 9, 10, 1311, 127

T FFalse

The number of misclassified examples from the training set is 1

Page 48: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Choice of Third Predicate

C

T F

B

True:False: 11,12

7

T F

A

T F

False

True

Page 49: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Final Tree

A

CTrue

True

True BTrue

TrueFalse

False

FalseFalse

False

CONCEPT A (C v B) CONCEPT A (B v C)

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Page 50: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Top-DownInduction of a DT

DTL(, Predicates)1. If all examples in are positive then return True2. If all examples in are negative then return False3. If Predicates is empty then return failure4. A error-minimizing predicate in Predicates5. Return the tree whose:

- root is A, - left branch is DTL(+A,Predicates-A), - right branch is DTL(-A,Predicates-A)

A

C

True

True

TrueB

True

TrueFalse

False

FalseFalse

False

Subset of examples that satisfy A

Page 51: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Top-DownInduction of a DT

DTL(, Predicates)1. If all examples in are positive then return True2. If all examples in are negative then return False3. If Predicates is empty then return failure4. A error-minimizing predicate in Predicates5. Return the tree whose:

- root is A, - left branch is DTL(+A,Predicates-A), - right branch is DTL(-A,Predicates-A)

A

C

True

True

TrueB

True

TrueFalse

False

FalseFalse

False

Noise in training set!May return majority rule,

instead of failure

Page 52: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Comments

Widely used algorithm Greedy Robust to noise (incorrect examples) Not incremental

Page 53: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Using Information Theory

Rather than minimizing the probability of error, many existing learning procedures minimize the expected number of questions needed to decide if an object x satisfies CONCEPT

This minimization is based on a measure of the “quantity of information” contained in the truth value of an observable predicate

See R&N p. 659-660

Page 54: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Learning decision trees

Problem: decide whether to wait for a table at a restaurant, based on the following attributes:1. Alternate: is there an alternative restaurant nearby?2. Bar: is there a comfortable bar area to wait in?3. Fri/Sat: is today Friday or Saturday?4. Hungry: are we hungry?5. Patrons: number of people in the restaurant (None,

Some, Full)6. Price: price range ($, $$, $$$)7. Raining: is it raining outside?8. Reservation: have we made a reservation?9. Type: kind of restaurant (French, Italian, Thai, Burger)10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-

60, >60)

Page 55: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Attribute-based representations

• Examples described by attribute values (Boolean, discrete, continuous)

• E.g., situations where I will/won't wait for a table:

• Classification of examples is positive (T) or negative (F)

Page 56: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Decision tree learning

• Aim: find a small tree consistent with the training examples• Idea: (recursively) choose "most significant" attribute as root of

(sub)tree

Page 57: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Choosing an attribute

• Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"

• Patrons? is a better choice

Page 58: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Using information theory

• To implement Choose-Attribute in the DTL algorithm

• Information Content (Entropy):

I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi)

• For a training set containing p positive examples and n negative examples:

np

n

np

n

np

p

np

p

np

n

np

pI

22 loglog),(

Page 59: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Information gain

• A chosen attribute A divides the training set E into subsets E1, … , Ev according to their values for A, where A has v distinct values.

• Information Gain (IG) or reduction in entropy from the attribute test:

• Choose the attribute with the largest IG

v

i ii

i

ii

iii

np

n

np

pI

np

npAremainder

1

),()(

)(),()( Aremaindernp

n

np

pIAIG

Page 60: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Information gain

For the training set, p = n = 6, I(6/12, 6/12) = 1 bit

Consider the attributes Patrons and Type (and others too):

Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root

bits 0)]4

2,

4

2(

12

4)

4

2,

4

2(

12

4)

2

1,

2

1(

12

2)

2

1,

2

1(

12

2[1)(

bits 0541.)]6

4,

6

2(

12

6)0,1(

12

4)1,0(

12

2[1)(

IIIITypeIG

IIIPatronsIG

Page 61: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Example contd.

• Decision tree learned from the 12 examples:

• Substantially simpler than “true” tree---a more complex hypothesis isn’t justified by small amount of data

Page 62: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Evaluation methodology

• Standard methodology:– 1. Collect a large set of examples (all with correct

classifications)– 2. Randomly divide collection into two disjoint sets: training

and test– 3. Apply learning algorithm to training set giving hypothesis H– 4. Measure performance of H w.r.t. test set

• Important: keep the training and test sets disjoint!• To study the efficiency and robustness of an algorithm, repeat

steps 2-4 for different training sets and sizes of training sets• If you improve your algorithm, start again with step 1 to avoid

evolving the algorithm to work well on just this collection

Page 63: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Miscellaneous Issues

Assessing performance:• Training set and test set• Learning curve

size of training set

% c

orr

ect

on t

est

set

100

Typical learning curve

Page 64: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Miscellaneous Issues

Assessing performance:• Training set and test set• Learning curve

Overfitting Risk of using irrelevantobservable predicates togenerate a hypothesis

that agrees with all examplesin the training set

size of training set

% c

orr

ect

on t

est

set

100

Typical learning curve

Page 65: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Miscellaneous Issues

Assessing performance:• Training set and test set• Learning curve

Overfitting• Tree pruning

Terminate recursion when# errors / information gain

is small

Risk of using irrelevantobservable predicates togenerate an hypothesis

that agrees with all examplesin the training set

Page 66: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Miscellaneous Issues

Assessing performance:• Training set and test set• Learning curve

Overfitting• Tree pruning

Terminate recursion when# errors / information gain

is small

Risk of using irrelevantobservable predicates togenerate an hypothesis

that agrees with all examplesin the training set

The resulting decision tree + majority rule may not classify correctly all examples in the training set

Page 67: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Miscellaneous Issues

Assessing performance:• Training set and test set• Learning curve

Overfitting • Tree pruning

Incorrect examples Missing data Multi-valued and continuous attributes

Page 68: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Extensions of the decision tree learning algorithm

• Using gain ratios (not covered in the text)• Real-valued data• Noisy data and overfitting• Generation of rules• Setting parameters• Cross-validation for experimental validation of

performance• C4.5 is an extension of ID3 that accounts for

unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation, and so on

Page 69: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Using gain ratios• The information gain criterion favors attributes that have a

large number of values– If we have an attribute D that has a distinct value for

each record, then Info(D,T) is 0, thus Gain(D,T) is maximal

• To compensate for this Quinlan suggests using the following ratio instead of Gain:

GainRatio(D,T) = Gain(D,T) / SplitInfo(D,T)• SplitInfo(D,T) is the information due to the split of T on the

basis of value of categorical attribute D

SplitInfo(D,T) = I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|)

where {T1, T2, .. Tm} is the partition of T induced by value of D

Page 70: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Computing gain ratioFrench

Italian

Thai

Burger

Empty Some Full

Y

Y

Y

Y

Y

YN

N

N

N

N

N

•I(T) = 1

•I (Pat, T) = .47

•I (Type, T) = 1

Gain (Pat, T) =.53Gain (Type, T) = 0

SplitInfo (Pat, T) = - (1/6 log 1/6 + 1/3 log 1/3 + 1/2 log 1/2) = 1/6*2.6 + 1/3*1.6 + 1/2*1 = 1.47

SplitInfo (Type, T) = 1/6 log 1/6 + 1/6 log 1/6 + 1/3 log 1/3 + 1/3 log 1/3 = 1/6*2.6 + 1/6*2.6 + 1/3*1.6 + 1/3*1.6 = 1.93

GainRatio (Pat, T) = Gain (Pat, T) / SplitInfo(Pat, T) = .53 / 1.47 = .36

GainRatio (Type, T) = Gain (Type, T) / SplitInfo (Type, T) = 0 / 1.93 = 0

Page 71: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Real-valued data• Select a set of thresholds defining intervals

• Each interval becomes a discrete value of the attribute

• Use some simple heuristics…

– always divide into quartiles

• Use domain knowledge…

– divide age into infant (0-2), toddler (3 - 5), school-aged (5-8)

• Or treat this as another learning problem

– Try a range of ways to discretize the continuous variable and see which yield “better results” w.r.t. some metric

– E.g., try midpoint between every pair of values

Page 72: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Noisy data and overfitting (cont)

• The last problem, irrelevant attributes, can result in overfitting the training example data. – If the hypothesis space has many dimensions

because of a large number of attributes, we may find meaningless regularity in the data that is irrelevant to the true, important, distinguishing features

– Fix by pruning lower nodes in the decision tree– For example, if Gain of the best attribute at a node is

below a threshold, stop and make this node a leaf rather than generating children nodes

Page 73: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Cross-Validation to Reduce Overfitting

• Estimate how well each hypothesis will predict unseen data.

• Set aside some fraction of the known data, and use it to test the prediction performance of a hypothesis induced from the remaining data.

• K-fold cross-validation means that you run k experiments, each time setting aside a different 1/k of the data to test on, and average the results.

• Use to decide if pruning method is appropriate.• Need to test again on really unseen data.

Page 74: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Applications of Decision Tree

Medical diagnostic / Drug design Evaluation of geological systems for

assessing gas and oil basins Early detection of problems (e.g.,

jamming) during oil drilling operations

Automatic generation of rules in expert systems

Page 75: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

How well does it work?

• Many case studies have shown that decision trees are at least as accurate as human experts. – A study for diagnosing breast cancer had humans

correctly classifying the examples 65% of the time; the decision tree classified 72% correct

– British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms that replaced an earlier rule-based expert system

– Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example

Page 76: Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides of:

Summary: Decision tree learning

• Inducing decision trees is one of the most widely used learning methods in practice

• Can out-perform human experts in many problems • Strengths include

– Fast– Simple to implement– Can convert result to a set of easily interpretable rules– Empirically valid in many commercial products– Handles noisy data

• Weaknesses include:– Univariate splits/partitioning using only one attribute at a time so limits

types of possible trees– Large decision trees may be hard to understand– Requires fixed-length feature vectors – Non-incremental (i.e., batch method)