cs 380: artificial intelligence decision tree learningsanti/teaching/2013/cs380/... ·...
TRANSCRIPT
CS 380: ARTIFICIAL INTELLIGENCE DECISION TREE LEARNING
11/20/2013 Santiago Ontañón [email protected] https://www.cs.drexel.edu/~santi/teaching/2013/CS380/intro.html
Machine Learning Summary: • Several types of learning:
• Learning from examples: • Supervised Learning • Unsupervised learning
• Reinforcement Learning • Learning from Demonstration (imitation) • Etc.
• Today: • Learning decision trees
Inductive Learning From Examples • f: unknown function that we want to learn
• f: X à Y where “X” is the input space, and “Y” is the target space
Inductive Learning From Examples • f: unknown function that we want to learn
• f: X à Y where “X” is the input space, and “Y” is the target space
For example: If we want to use machine learning to learn the evaluation function for “Othello”: - X: space of Othello boards - Y: real number
If we want to use machine learning to learn how to read hand-written characters: - X: 16x16 pixel images - Y: characters
Inductive Learning From Examples • f: unknown function that we want to learn
• f: X à Y where “X” is the input space, and “Y” is the target space
• Training set: • Set of examples from which to learn: e = (x1, f(x1))
Inductive Learning From Examples • f: unknown function that we want to learn
• f: X à Y where “X” is the input space, and “Y” is the target space
• Training set: • Set of examples from which to learn: e = (x1, f(x1))
For example: If we want to use machine learning to learn the evaluation function for “Othello”: e1 = (board1,+15) e2 = (board2, -5) …
Inductive Learning From Examples • f: unknown function that we want to learn
• f: X à Y where “X” is the input space, and “Y” is the target space
• Training set: • Set of examples from which to learn: e = (x1, f(x1))
• Learning algorithm: • Method that given the training set, generated a hypothesis h that
fits the data • Different learning algorithms explore different hypothesis spaces:
• Hypothesis space: set of all possible hypotheses that can be formulated • Learning algorithm explores this search space looking for the simplest
hypothesis that fits the data
Induction of Decision Trees • One of the earliest forms of machine learning
• Algorithm: ID3: • Hypothesis space: decision trees • Example representation: feature vectors • Explores the space of decision trees, trying to find one that fits the
data
Decision Tree Example • Target function: “is it a good day to play tennis?”
Outlook
Overcast
Humidity
NormalHigh
No Yes
Wind
Strong Weak
No Yes
Yes
RainSunny
Training Set
Training Set
f([sunny,hot,high,weak]) = no f(sunny,hot,high,strong]) = no
etc.
Learning Decision Trees • Generating a hypothesis from examples:
Outlook
Overcast
Humidity
NormalHigh
No Yes
Wind
Strong Weak
No Yes
Yes
RainSunny
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind]
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
Humidity
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
Humidity High Normal
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
Humidity High Normal
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
Humidity High Normal
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, Wind] Tree:
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
Humidity High Normal
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, Wind] Tree:
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
Humidity High Normal
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, Wind] Tree:
No
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
Humidity Sunny Overcast Rainy
No
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
Humidity
No
Humidity = High
High Normal
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Examples = [ … ] Attributes_left = [day, temperature, humidity, Wind] Tree:
Humidity
No
Humidity = High High Normal
Yes
Humidity = Normal
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Humidity
…
ID3 Algorithm ID3(examples, attributes_left)
Tree = new Node() If all examples have the same target value vt
Tree.target = vt Return Tree
If attributes_left = empty list Tree.target = most common target value in examples Return Tree
Tree.attribute = A = best attribute in attributes_left For each possible value v of A
If “examples where A = v” is empty SubTree = leaf node with most common value in examples Else SubTree = ID3(examples where A = v, attributes_left – A) add SubTree to Tree in a branch labeled “A = v”
Return Tree
Examples = [ … ] Attributes_left = [day, outlook, temperature, humidity, Wind] Tree:
Outlook Sunny Overcast Rainy
Humidity
…
Outlook = Sunny
ID3 Output
Outlook
Overcast
Humidity
NormalHigh
No Yes
Wind
Strong Weak
No Yes
Yes
RainSunny
ID3 Output
Outlook
Overcast
Humidity
NormalHigh
No Yes
Wind
Strong Weak
No Yes
Yes
RainSunny
This tree can now be used to predict the target value for examples that were not in the original training set: generalization
Which Attribute is Best? • In the original training set, we have:
• 9 examples with “YES” • 5 examples with “NO”
• If we start with “outlook” If we start with “wind”:
9 yes 5 no
2 yes 3 no 4 yes 3 yes
2 no
9 yes 5 no
3 yes 3 no
6 yes 2 no
sunny overcast rainy strong mild
Which Attribute is Best? • In the original training set, we have:
• 9 examples with “YES” • 5 examples with “NO”
• If we start with “outlook” If we start with “wind”:
9 yes 5 no
2 yes 3 no 4 yes 3 yes
2 no
9 yes 5 no
3 yes 3 no
6 yes 2 no
sunny overcast rainy strong mild
We want examples to be classified as well as possible. Ideally, all “yes”
examples in one branch, and all “no” examples in another branch.
Entropy • Given a set of symbols S, drawn from an alphabet B
• Entropy: expected number of bits needed to encode the next symbol (amount of information that knowing one more symbol provides us)
• If S are drawn at random from B, entropy is maximal • If S is always the same symbol from B, entropy is minimal
(we know which symbol will come next, so, no new information)
Entropy
• Example for binary variable:
Entr
opy(S
)
1.0
0.5
0.0 0.5 1.0
p+
H(X) = �X
i
p(xi)log p(Xi)
Entropy for Attribute Selection • The entropy in a node of the tree determines how well
“grouped” are the examples in that node:
9 yes 5 no
2 yes 3 no 4 yes 3 yes
2 no
sunny overcast rainy
Outlook
H = 0.94
H = 0.98 H = 0.00 H = 0.98
9 yes 5 no
3 yes 3 no
6 yes 2 no
strong mild
Wind
H = 0.94
H = 1.00 H = 0.81
Entropy for Attribute Selection • Information Gain: reduction gain in Entropy due to
selecting attribute A
• Idea: the best attribute is the one that maximizes information gain
Gain(S,A) = H(S)�X
v2values(A)
|Sv||S| H(Sv)
Information Gain
9 yes 5 no
2 yes 3 no 4 yes 3 yes
2 no
sunny overcast rainy
Outlook
H = 0.94
H = 0.98 H = 0.00 H = 0.98
9 yes 5 no
3 yes 3 no
6 yes 2 no
strong mild
Wind
H = 0.94
H = 1.00 H = 0.81
Gain(outlook) = 0.94 – 5/14 * (0.98) – 4/14 * 0 – 4/14 * 0.98 = 0.24 Gain(Wind) = 0.94 – 6/14 * (1.00) – 8/14 * (0.81) = 0.05
ID3 Does Best-First Search
...
+ + +
A1
+ – + –
A2
A3
+
...
+ – + –
A2
A4
–
+ – + –
A2
+ – +
... ...
–
�
Search in the hypothesis space, starting from a tree with a single node, and using Information Gain as the heuristic function. We can conceive better search strategies (e.g. A*), but the computational cost might be too large (although it might learn much better!)
ID3 • Better heuristics than “Information Gain” exits:
• E.g. Gain Ratio, GINI index, RLDM distance, etc.
• Many alternative search strategies (e.g. decision forests)
• Over-fitting: • What happens if we have an example in the training set that is
noise? (e.g. that is mislabeled) • ID3 will try to force it into the decision tree! • Over-fitting strategies exist to avoid this (e.g. prevent leaves that
have only a very small number of examples)
ID3 • Converting a tree to rules:
• Each branch of the tree is a rule • A decision tree is just a compact way to represent a set of rules
Outlook
Overcast
Humidity
NormalHigh
No Yes
Wind
Strong Weak
No Yes
Yes
RainSunny
Outlook = rain and Wind = strong è playtennis = no
ID3 • Converting a tree to rules:
• Each branch of the tree is a rule • A decision tree is just a compact way to represent a set of rules
Outlook
Overcast
Humidity
NormalHigh
No Yes
Wind
Strong Weak
No Yes
Yes
RainSunny
Outlook = rain and Wind = strong è playtennis = no
Thus, ID3 can be used to extract knowledge (e.g. rules) from large databases of examples
Other Supervised ML Methods • Lazy Methods
• Instance-based Learning • Case-Based Reasoning
• Bayesian Learning: • Naïve Bayes • Bayesian Networks
• Regression methods (when target function is numerical) • Neural Networks • Boosting • Bagging • Support Vector Machines • etc.