cs 380: artificial intelligence decision tree learningsanti/teaching/2013/cs380/... ·...

CS 380: ARTIFICIAL INTELLIGENCE DECISION TREE LEARNING

11/20/2013 Santiago Ontañón [email protected] https://www.cs.drexel.edu/~santi/teaching/2013/CS380/intro.html

Machine Learning Summary: • Several types of learning:

•  Learning from examples: •  Supervised Learning •  Unsupervised learning

•  Reinforcement Learning •  Learning from Demonstration (imitation) •  Etc.

•  Today: •  Learning decision trees

Inductive Learning From Examples •  f: unknown function that we want to learn

•  f: X à Y where “X” is the input space, and “Y” is the target space



For example: If we want to use machine learning to learn the evaluation function for “Othello”: -  X: space of Othello boards -  Y: real number

If we want to use machine learning to learn how to read hand-written characters: -  X: 16x16 pixel images -  Y: characters



•  Training set: •  Set of examples from which to learn: e = (x1, f(x1))




For example: If we want to use machine learning to learn the evaluation function for “Othello”: e1 = (board1,+15) e2 = (board2, -5) …




•  Learning algorithm: •  Method that given the training set, generated a hypothesis h that

fits the data •  Different learning algorithms explore different hypothesis spaces:

•  Hypothesis space: set of all possible hypotheses that can be formulated •  Learning algorithm explores this search space looking for the simplest

hypothesis that fits the data

Induction of Decision Trees • One of the earliest forms of machine learning

• Algorithm: ID3: •  Hypothesis space: decision trees •  Example representation: feature vectors •  Explores the space of decision trees, trying to find one that fits the

data

Decision Tree Example •  Target function: “is it a good day to play tennis?”

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

Training Set

Training Set

f([sunny,hot,high,weak]) = no f(sunny,hot,high,strong]) = no

etc.

Learning Decision Trees • Generating a hypothesis from examples:

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

ID3 Algorithm ID3(examples, attributes_left)

Tree = new Node() If all examples have the same target value vt

Tree.target = vt Return Tree

If attributes_left = empty list Tree.target = most common target value in examples Return Tree

Tree.attribute = A = best attribute in attributes_left For each possible value v of A

If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”

Return Tree







Return Tree

Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind]







Return Tree

Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:







Return Tree


Outlook







Return Tree


Outlook Sunny Overcast Rainy







Return Tree



Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:







Return Tree




Humidity







Return Tree




Humidity High Normal







Return Tree






Examples = [ … ] Attributes_left = [day, temperature, Wind] Tree:







Return Tree






Examples = [ … ] Attributes_left = [day, temperature, Wind] Tree:

No







Return Tree




Humidity Sunny Overcast Rainy

No







Return Tree




Humidity

No

Humidity = High

High Normal







Return Tree




Humidity

No

Humidity = High High Normal

Yes

Humidity = Normal







Return Tree



Humidity

…







Return Tree



Humidity

…

Outlook = Sunny

ID3 Output

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

ID3 Output

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

This tree can now be used to predict the target value for examples that were not in the original training set: generalization

Which Attribute is Best? •  In the original training set, we have:

•  9 examples with “YES” •  5 examples with “NO”

•  If we start with “outlook” If we start with “wind”:

9 yes 5 no

2 yes 3 no 4 yes 3 yes

2 no

9 yes 5 no

3 yes 3 no

6 yes 2 no

sunny overcast rainy strong mild

Which Attribute is Best? •  In the original training set, we have:

•  9 examples with “YES” •  5 examples with “NO”

•  If we start with “outlook” If we start with “wind”:

9 yes 5 no


2 no

9 yes 5 no

3 yes 3 no

6 yes 2 no

sunny overcast rainy strong mild

We want examples to be classified as well as possible. Ideally, all “yes”

examples in one branch, and all “no” examples in another branch.

Entropy • Given a set of symbols S, drawn from an alphabet B

• Entropy: expected number of bits needed to encode the next symbol (amount of information that knowing one more symbol provides us)

•  If S are drawn at random from B, entropy is maximal •  If S is always the same symbol from B, entropy is minimal

(we know which symbol will come next, so, no new information)

Entropy

• Example for binary variable:

Entr

opy(S

)

1.0

0.5

0.0 0.5 1.0

p+

H(X) = �X

i

p(xi)log p(Xi)

Entropy for Attribute Selection •  The entropy in a node of the tree determines how well

“grouped” are the examples in that node:

9 yes 5 no


2 no

sunny overcast rainy

Outlook

H = 0.94

H = 0.98 H = 0.00 H = 0.98

9 yes 5 no

3 yes 3 no

6 yes 2 no

strong mild

Wind

H = 0.94

H = 1.00 H = 0.81

Entropy for Attribute Selection •  Information Gain: reduction gain in Entropy due to

selecting attribute A

•  Idea: the best attribute is the one that maximizes information gain

Gain(S,A) = H(S)�X

v2values(A)

|Sv||S| H(Sv)

Information Gain

9 yes 5 no


2 no

sunny overcast rainy

Outlook

H = 0.94

H = 0.98 H = 0.00 H = 0.98

9 yes 5 no

3 yes 3 no

6 yes 2 no

strong mild

Wind

H = 0.94

H = 1.00 H = 0.81

Gain(outlook) = 0.94 – 5/14 * (0.98) – 4/14 * 0 – 4/14 * 0.98 = 0.24 Gain(Wind) = 0.94 – 6/14 * (1.00) – 8/14 * (0.81) = 0.05

ID3 Does Best-First Search

...

+ + +

A1

+ – + –

A2

A3

+

...

+ – + –

A2

A4

–

+ – + –

A2

+ – +

... ...

–

�

Search in the hypothesis space, starting from a tree with a single node, and using Information Gain as the heuristic function. We can conceive better search strategies (e.g. A*), but the computational cost might be too large (although it might learn much better!)

ID3 • Better heuristics than “Information Gain” exits:

•  E.g. Gain Ratio, GINI index, RLDM distance, etc.

• Many alternative search strategies (e.g. decision forests)

• Over-fitting: •  What happens if we have an example in the training set that is

noise? (e.g. that is mislabeled) •  ID3 will try to force it into the decision tree! •  Over-fitting strategies exist to avoid this (e.g. prevent leaves that

have only a very small number of examples)

ID3 • Converting a tree to rules:

•  Each branch of the tree is a rule •  A decision tree is just a compact way to represent a set of rules

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

Outlook = rain and Wind = strong è playtennis = no

ID3 • Converting a tree to rules:

•  Each branch of the tree is a rule •  A decision tree is just a compact way to represent a set of rules

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

Outlook = rain and Wind = strong è playtennis = no

Thus, ID3 can be used to extract knowledge (e.g. rules) from large databases of examples

Other Supervised ML Methods •  Lazy Methods

•  Instance-based Learning •  Case-Based Reasoning

• Bayesian Learning: •  Naïve Bayes •  Bayesian Networks

• Regression methods (when target function is numerical) • Neural Networks • Boosting • Bagging • Support Vector Machines •  etc.

cs 380: artificial intelligence decision tree learningsanti/teaching/2013/cs380/... ·...

Documents