![Page 2: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/2.jpg)
Classification
Environment
Perception
Behaviour
Categorize inputs Update
belief model
Update decision making policy
Decision making
Perception
Behaviour
![Page 3: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/3.jpg)
Recognizing the type of situation you are in right now is a basic agent task:
Classification
• Robotics: misidentifying a human body with some part of a car on the assembly line would be disastrous
• Military: friend or foo?
• Electric card usage: was it a fraud or not?
![Page 4: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/4.jpg)
Last lecture: neural networks
Why more classification methods?
• Very powerful in theory
• Promising direction: deep learning
• Still difficult to fully control the technology
• In many cases: other techniques are more efficient
Occam’s razor: the simpler (the model) the better (the performance is) – go for something more complicated only if it’s really necessary
In many real-world problems, data cleaning is the most important step – after that, a simple classification method would do the job
![Page 5: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/5.jpg)
Classification
Classification Algorithm
Bottom up: inspiration from biology - e.g., neural networks
Classification Algorithm
Top down: inspiration from higher abstraction levels
![Page 6: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/6.jpg)
Prof or hobo 1?
http://individual.utoronto.ca/somody/quiz.html
![Page 7: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/7.jpg)
Prof or hobo 2?
http://individual.utoronto.ca/somody/quiz.html
![Page 8: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/8.jpg)
Prof or hobo 3?
http://individual.utoronto.ca/somody/quiz.html
![Page 9: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/9.jpg)
Prof or hobo answers
http://individual.utoronto.ca/somody/quiz.html
Hobo HoboProfessor
![Page 10: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/10.jpg)
Back to classification
Classification Algorithm
Different ways to go:
Honey? Fired? Evil plan?
![Page 11: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/11.jpg)
Back to classification
Classification Algorithm
Some classification algorithms:
Logistic regression
Support vector machines (SVMs)
Decision trees + its family
…
• Easy to understand• (Relatively) easy to implement• Vey efficient in many cases
![Page 12: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/12.jpg)
Decision making process
Did it go well?
Did it go well?
Yes
Yes
No
No
![Page 13: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/13.jpg)
What are the clues that allow you to distinguish a prof from a hobo?
• Clothes people are wearing
• Their eyes
• The beard
• …
Back to the “Prof or hobo” quiz
Main idea: checking out some properties in some order
![Page 14: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/14.jpg)
Classification with decision trees
• A decision tree takes a series of inputs defining a situation, and
outputs a binary decision/classification.
• A decision tree spells out an order for checking the properties
(attributes) of the situation until we have enough information to
decide what's going on.
• We use the observable attributes to predict the outcome (or some
important hidden or unknown quantity).
Question: what is the optimal (efficient) order of the attributes?
![Page 15: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/15.jpg)
The importance of the ordering
• Think about the “20 questions” game: inefficient questions will lead to low performance
• Think about binary search:
• Optimal: always halve the interval
• Decision trees are very simple to produce if we already know the underlying rules.
• But what we don’t have the rules, just past examples (experience)?
![Page 16: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/16.jpg)
Often we don't know in advance how to classify things, and want our agent to learn from examples.
Our objective
![Page 17: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/17.jpg)
Which attribute to start with?The order of attributes is still very important
Idea: choose the next attribute whose value can reduce the uncertainty about the outcome of the classification the most
What does it mean when we say that something reduces the uncertainty in our knowledge?
Reducing uncertainty (in knowledge) = increase (known) information
So we should choose the attribute that provides the highest information gain
![Page 18: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/18.jpg)
EntropyHow to measure information gain (and how to define it)?
Answer: borrow similar concepts from information & coding theory
Entropy (Shannon, 1948):
• A measure of the amount of disorder or uncertainty in a system. • A tidy room has low entropy: You can be reasonably certain your
keys are on the hook you made for them. • A messy room has high entropy: things are all over the place and
your keys could be absolutely anywhere.
![Page 19: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/19.jpg)
Input X Output Y
Entropy
Uncertainty about the outcome
Classification:
Entropy (Shannon, 1948):
How often Y =y Measure of information (surprise) when Y = y
(in bits)
![Page 20: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/20.jpg)
Entropy example
Good OK Terrible
Birmingham 0.33 0.33 0.33
Southampton 0.3 0.6 0.1
Glasgow 0 0 1
Weather:
![Page 21: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/21.jpg)
Entropy example
Birmingham P(x) logP(x) - P(x)logP(x)
Good 0.33 -1.58 0.53
OK 0.33 -1.58 0.53
Terrible 0.33 -1.58 0.53
Sum = 1.58 (bits)
![Page 22: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/22.jpg)
Entropy example
Southampton P(x) logP(x) - P(x)logP(x)
Good 0.3 -1.74 0.52
OK 0.6 -0.74 0.44
Terrible 0.1 -3.32 0.33
Sum = 1.29 (bits)
![Page 23: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/23.jpg)
Entropy example
Glasgow P(x) logP(x) - P(x)logP(x)
Good 0 -infinity 0
OK 0 -infinity 0
Terrible 1 0 0
Sum = 0 (bits)
When we are certain, the entropy is 0
![Page 24: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/24.jpg)
Conditional entropy
Input X Output Y
Classification:
Entropy measures the uncertainty of a given state of the system
How to measure the change?
Conditional entropy:Joint probability
Conditional probability
• How much uncertainty would remain about the outcome Y if we knew (for instance) the outcome of attribute X
![Page 25: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/25.jpg)
Information gain
Information gain:
Current level of uncertainty(entropy)
Possible new level of uncertainty
(conditional entropy)
• The difference represents how much uncertainty would decrease
![Page 26: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/26.jpg)
Building a decision tree
• Split the tree on the attribute with the highest information gain. Then repeat.
Stopping Conditions: • Don't split if all matching records have same output value (no point,
we know what happens!). • Don't split if all matching records have same attribute values (no
point, we can't distinguish them).
Recursive algorithm:
![Page 27: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/27.jpg)
Example: Predicting the importance of emails
Objective: predict whether the user will read the email
![Page 28: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/28.jpg)
18 emails: 8 read, 8 skipped
“Thread” attribute:
Reads Skips Row total
new_thread 7 (70%) 3 (30%) 10
follow_up 2 (25%) 6 (75%) 8
Example: Predicting the importance of emails
What is the information gain if we choose “Thread” ?
Calculation steps:
• Calculate H(Read)• Calculate H(Read | Thread)• Calculate G(Read, Thread) = H(Read) – H(Read | Thread)
![Page 29: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/29.jpg)
Example: Predicting the importance of emails Calculating H(Read)
• 18 emails: 8 read, 8 skipped
• P(Read = True) = P(Read = False) = 0.5
• H(Read) = -(0.5*log2(0.5) + 0.5*log2(0.5)) = 1 (bit)
![Page 30: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/30.jpg)
Example: Predicting the importance of emails Calculating H(Read | Thread)
Specific conditional entropy
Calculation steps:
• Calculate H(Read | Thread = new)• Calculate H(Read | Thread = follow_up)• Calculate H(Read | Thread) = p(new)*H(Read | Thread = new) +
+ p(follow_up)*H(Read | Thread = follow_up)
![Page 31: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/31.jpg)
Reads Skips Row total
new_thread 7 (70%) 3 (30%) 10
follow_up 2 (25%) 6 (75%) 8
Example: Predicting the importance of emails
• P(Read = True | new)= 0.7; P(Read = False | new) = 0.3
• H(Read | new) = 0.88
• P(Read = True | follow_up) = 0.25; P(Read = False | follow_up) = 0.75
• H(Read | follow_up) = 0.81
• H(Read | Thread) = 10/18 *0.88 + 8/18*0.81 = 0.85
![Page 32: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/32.jpg)
Example: Predicting the importance of emails Calculating G(Read,Thread):
• G(Read,Thread) = H(Read) – H(Read | Thread)
• G(Read,Thread) = 1– 0.85 = 0.15
![Page 33: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/33.jpg)
Example: Predicting the importance of emails
![Page 34: COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees](https://reader035.vdocument.in/reader035/viewer/2022062317/5a4d1b027f8b9ab059987497/html5/thumbnails/34.jpg)
Advantages of decision trees
• Decision trees are able to generate understandable rules (i.e., human-readable).
• Once learned, decision trees perform classification very efficiently. • Decision trees are able to handle continuous as well as categorical
variables. You choose a threshold to split the continuous variables based on information gain.