lect6.pdf
TRANSCRIPT
-
8/17/2019 lect6.pdf
1/66
Artificial Intelligence:-
1
Russell & Norvig: Sections 18.1 to 18.4
-
8/17/2019 lect6.pdf
2/66
Motivation Too many to list here!
Recommender systems eg. Amazon --> you might like this book… eg. LinkedIn --> people you might know…
Pattern Recognition
Learning to recognize postal codes Handwritten recognition --> who wrote this
cheque? …
-
8/17/2019 lect6.pdf
3/66
Why Machine Learning? To construct programs that automatically
improve with experience. Learning = crucial characteristics of an
intelligent agent
3
A learning system performs tasks and improves its performance
the more tasks it accomplishes generalizes from given experiences and is able
to make judgments on unseen cases
-
8/17/2019 lect6.pdf
4/66
Machine Learning vs Data Mining Supervised Machine Learning Try to predict the outcome of a new data from a training set of
already classified examples
I know what I am looking for, and I am using this training data tohelp me generalise
Eg: diagnostic system: which treatments are best for new diseases?
Eg: speech recognition : given the acoustic pattern, which word wasuttered
4
Unsupervised Machine Learning/ Data Mining / KnowledgeDiscovery Try to discover valuable knowledge / pattern from large databases
of grocery bills, financial transactions, medical records, etc.
I’m not sure what I am looking for, but I am sure there’s somethinginteresting
Eg: Anomaly detection: odd use of credit card?
Eg. Learning association rules: buy strawberries & whipped cream
together?
-
8/17/2019 lect6.pdf
5/66
Types of Learning In Supervised learning
We are given a training set of (X, f(X)) pairs
In Reinforcement learning We are not given the (X, f(X)) pairs
big nose big teeth big eyes no moustache f(X) = not person
small nose small teeth small eyes no moustache f(X) = person
small nose big teeth small eyes moustache f(X) = ?
small nose big teeth small eyes moustache f(X) = ?
5
ut some ow we are to w et er our earne is rig t or wrong
Goal: maximize the nb of right answers
In Unsupervised learning We are only given the Xs - not the corresponding f(X)
No teacher involved / Goal: find regularities among the Xs (clustering) Data mining
big nose big teeth big eyes no moustache not given
small nose small teeth small eyes no moustache not given
small nose big teeth small eyes moustache f(X) = ?
-
8/17/2019 lect6.pdf
6/66
Remember this slide…
6
-
8/17/2019 lect6.pdf
7/66
Inductive Learning = learning from examples
Most work in ML
Examples are given (positive and/or negative) to train asystem in a classification (or regression) task
Extrapolate from the training set to make accuratepredictions about future examples
7
an e seen as earning a unction Given a new instance X you have never seen
You must find an estimate of the function f(X) where f(X) isthe desired output
Ex:
X = features of a face (ex. small nose, big teeth, …)
f(X) = function to tell if X represents a human face or not
small nose big teeth small eyes moustache f(X) = ?
X
-
8/17/2019 lect6.pdf
8/66
Example Given 5 pairs (X,f(X)) (the training set)
Find a function that fits the training set well
So that given a new X, you can predict its f(X) valuef(X)
8
Note: choosing one function over another beyond justlooking at the training set is called inductive bias ex: prefer "smoother" functions
ex: if an attribute does not seem discriminating, drop it
source: Rusell & Norvig (2003)
X
-
8/17/2019 lect6.pdf
9/66
Inductive Learning Framework
Input data are represented by a vector of features, X
Each vector X is a list of (attribute, value) pairs. Ex: X = [nose:big, teeth:big, eyes:big, moustache:no]
The number of features is fixed (positive, finite)
9
Each attribute has a fixed, finite number of possible values Each example can be interpreted as a point in a n-dimensional
feature space where n is the number of features
if the output is discrete --> classification
if the output is continuous --> regression
-
8/17/2019 lect6.pdf
10/66
Example
10
Real ML applications typically require hundreds or thousands of examples
source: Alison Cawsey: The Essence of AI (1997).
-
8/17/2019 lect6.pdf
11/66
Techniques in ML Probabilistic Methods
ex: Naïve Bayes Classifier
Decision Trees Use only discriminating features as questions in a big if-then-else
tree
Genetic algorithms
11
Good functions evolve out of a population by combining possiblesolutions to produce offspring solutions and killing off the weakerof those solutions.
Neural networks Also called parallel distributed processing or connectionist systems
Intelligence arise from having a large number of simplecomputational units
-
8/17/2019 lect6.pdf
12/66
Today Last time: Probabilistic Methods
Decision trees
( Evaluation Unsupervised learning)
Next time: Neural Networks
12
Next time: Genetic Algorithms
-
8/17/2019 lect6.pdf
13/66
Guess Who?
13
Play online
-
8/17/2019 lect6.pdf
14/66
Decision Trees Simplest, but most successful form of learning algorithm
Very well-know algorithm is ID3 (Quinlan, 1987) and itssuccessor C4.5
Look for features that are very good indicators of the
14
,the tree
Split the examples so that those with different valuesfor the chosen feature are in a different set
Repeat the same process with another feature
-
8/17/2019 lect6.pdf
15/66
ID3 / C4.5 Algorithm Top-down construction of the decision tree
Recursive selection of the “best feature” to use at thecurrent node in the tree
Once the feature is selected for the current node,enerate children nodes, one for each ossible value of
15
the selected attribute Partition the examples using the possible values of this
attribute, and assign these subsets of the examples tothe appropriate child node
Repeat for each child node until all examplesassociated with a node are either all positive or allnegative
-
8/17/2019 lect6.pdf
16/66
ExampleInfo on last year’s students to determine if a student will get an ‘A’ this year
Features Output f(X)
Student ‘A’ last year?
Blackhair?
Workshard?
Drinks? ‘A’ this year?
16
X2: Alan Yes Yes Yes No Yes
X3: Alison No No Yes No No
X4: Jeff No Yes No Yes No
X5: Gail Yes No Yes Yes Yes
X6: Simon No Yes Yes Yes No
-
8/17/2019 lect6.pdf
17/66
Example“A” last year?
yes no
Features Outputf(X)
Student ‘A’ last
year?
Black
hair?
Works
hard?
Drinks
?
‘A’ this
year?Richard Yes Yes No Yes No
Alan Yes Yes Yes No Yes
Alison No No Yes No No
Jeff No Yes No Yes No
17
Works hard?
Output = No Gail Yes No Yes Yes Yes
Simon No Yes Yes Yes No
Output = Yes Output = No
yes no
-
8/17/2019 lect6.pdf
18/66
Example 2: The Restaurant Goal: learn whether one should wait for a table
Attributes
Alternate: another suitable restaurant nearby Bar: comfortable bar for waiting
Fri/Sat: true on Fridays and Saturdays
:
18
Patrons: how many people are present (none, some, full) Price: price range ($, $$, $$$)
Raining: raining outside
Reservation: reservation made
Type: kind of restaurant (French, Italian, Thai, Burger)
WaitEstimate: estimated wait by host (0-10 mins, 10-30, 30-60, >60)
-
8/17/2019 lect6.pdf
19/66
Example 2: The Restaurant Training data:
19source: Norvig (2003)
-
8/17/2019 lect6.pdf
20/66
A First Decision Tree
20
But is it the best decision tree we can build?
source: Norvig (2003)
-
8/17/2019 lect6.pdf
21/66
Ockham’s Razor PrincipleIt is vain to do more than can be done with less… Entities
should not be multiplied beyond necessity. [Ockham, 1324]
In other words… always favor the simplest answerthat correctly fits the training data
21
i.e. the smallest tree on average
This type of assumption is called inductive bias
inductive bias = making a choice beyond what the traininginstances contain
-
8/17/2019 lect6.pdf
22/66
Finding the Best Tree can be seen as searching the space
of all possible decision trees
Inductive bias: prefer shortertrees on average
how?
empty tree
22
searc e space o a ec s ontrees always pick the next attribute to
split the data based on its"discriminating power" (information
gain) in effect, hill-climbing search
where heuristic is information gain
source: Tom Mitchell, Machine Learning (1997)
complete tree
-
8/17/2019 lect6.pdf
23/66
Which Tree is Best?F1?
F2? F3?
F4? F5? F6? F7?
class class class class class class class class
F1?
23
F2?
F3?
class
class
class F4?
class F5?
class F6?
class F7?
class class
-
8/17/2019 lect6.pdf
24/66
Choosing the Next AttributeThe key problem is choosing which
feature to split a given set of examples
ID3 uses Maximum Information-Gain:
24
Choose the attribute that has the largestinformation gain i.e., the attribute that will result in the smallest
expected size of the subtrees rooted at itschildren
information theory
-
8/17/2019 lect6.pdf
25/66
Intuitively… Patron :
If value is Some … all outputs=Yes If value is None … all outputs=No If value is Full… we need more tests
Type : If value is French … we need more tests
Output f(X)
25
If value is Italian … we need more tests If value is Thai… we need more tests If value is Burger… we need more tests
…
So patron is more discriminating…
source: Norvig (2003)
-
8/17/2019 lect6.pdf
26/66
Next Feature…
For only data where patron = Full hungry
If value is Yes … we need more tests If value is No … all output= No
type :
26
va ue is Frenc … a output= No If value is Italian … all output= No If value is Thai… we need more tests If value is Burger… we need more tests
…
So hungry is more discriminating (only 1 new branch)…
-
8/17/2019 lect6.pdf
27/66
A Better Decision Tree
4 tests instead of 9
11 branches instead of 21
27source: Norvig (2003)
-
8/17/2019 lect6.pdf
28/66
Choosing the Next Attribute The key problem is choosing which feature to
split a given set of examples Most used strategy: information theory
28
p(x)p(x)logH(X) Xx 2∈−=
bit121log2121log21
2
1,
2
1H)p(x)logp(xtoss)coinH(fair
22
iXx
2i
i
=
+=
=−= ∑
∈
Entropy (or information content)
H(a,b) = entropy ifprobability of success = a and
probability of failure = b
-
8/17/2019 lect6.pdf
29/66
Essential Information Theory
Developed by Shannon in the 40s
Notion of entropy (information content): How informative is a piece of information?
If you already have a good idea about the answer
29
then a "hint" about the right answer is not veryinformative…
If you have no idea about the answer (ex. 50-50 split)
high entropy
then a "hint" about the right answer is veryinformative…
-
8/17/2019 lect6.pdf
30/66
Entropy
Let X be a discrete random variable (RV)
Entropy (or information content)
∑=
−=
n
1ii2i )p(x)logp(xH(X)
30
measures the amount of information in a RV average uncertainty of a RV
the average length of the message needed to transmit anoutcome xi of that variable
measured in bits
for only 2 outcomes x1 and x2, then 1 ≥ H(X) ≥ 0
-
8/17/2019 lect6.pdf
31/66
-
8/17/2019 lect6.pdf
32/66
Choosing the Best Feature (con't) The "discriminating power" of an attribute A given a data set S
Let Values(A) = the set of values that attribute A can take
Let Sv = the set of examples in the data set which have value v forattribute A (for each value v from Values(A) )
information gain (orentropy reduction)
A)|H(SH(S)A)gain(S, −=
32
( )vvalues(A)v
v SHxS
SH(S) ∑∈
−=
-
8/17/2019 lect6.pdf
33/66
Some Intuition
Size Color Shape Output
Big Red Circle +
Small Red Circle +Small Red Square -
Big Blue Circle -
33
. .
smallest information gain)
Shape and color are the most discriminating
attributes (i.e. highest information gain)
-
8/17/2019 lect6.pdf
34/66
A Small Example (1)
Color
red: 2+ 1- blue: 0+ 1-
Size Color Shape Output
Big Red Circle +
Small Red Circle +
Small Red Square -
Big Blue Circle -
121
log21
21
log21
H(S) 22 =
+−=
( )vor)values(Colv
v SHxS
SH(S)Color)gain(S, ∑
∈
−=
Values(Color) = {red,blue}
34
( )
0.31150.6885-1Color)|H(S-H(S))gain(Color
0.6885(0)4
1(0.918)4
3Color)|H(S
011
log11
1,0Hblue)Color|H(S
0.91831log
31
32log
32
31,
32Hred)Color|H(S
ora ues ooveacor
2
22
===
=+=
=
−===
=
+−=
==
-
8/17/2019 lect6.pdf
35/66
A Small Example (2)
Note: by definition,
Log 0 = -∞
Shape
circle: 2+ 1- square: 0+ 1-
Size Color Shape Output
Big Red Circle +
Small Red Circle +
Small Red Square -Big Blue Circle -
35
0.31150.918-1Shape)|H(S-H(S))gain(Shape
0.918(0)4
1(0.918)
4
3Shape)|H(S
12log22log2H(S) 22
===
=+=
= +−=
-
8/17/2019 lect6.pdf
36/66
A Small Example (3)
Size
big: 1+ 1- small: 1+ 1-
Size Color Shape Output
Big Red Circle +
Small Red Circle +
Small Red Square -Big Blue Circle -
36
01-1Size)|H(S-H(S)gain(Size)
1(1)2
1(1)
2
1Size)|H(S
121log2121log21H(S) 22
===
=+=
=
+−=
-
8/17/2019 lect6.pdf
37/66
A Small Example (4)Size Color Shape Output
Big Red Circle +
Small Red Circle +
Small Red Square -Big Blue Circle -
37
So first separate according to either color or shape(root of the tree)
0gain(Size)
0.3115)gain(Color0.3115gain S ape
=
=
=
-
8/17/2019 lect6.pdf
38/66
-
8/17/2019 lect6.pdf
39/66
The Restaurant Example
0.541bits...44log
44
40log
40x
124
22log
22
20log
20-x
1221
6
4,
6
2Hx
12
6
4
4,
4
0Hx
12
4
2
2,
2
0Hx
12
21gain(pat)
2222 ≈
+
+−+
+−=
+
+
−=
...=gain(alt) ...=gain(bar) ...=gain(fri) ...=gain(hun)
...=)gain(price ...=gain(rain) ...=gain(res)
39
Attribute pat (Patron) has the highest gain, so root of the
tree should be attribute Patrons do recursively for subtrees
bits04
2,4
2Hx12
4
4
2,4
2Hx12
4
2
1,2
1Hx12
2
2
1,2
1Hx12
21gain(type) =
+
+
+
−=
...=gain(est)
-
8/17/2019 lect6.pdf
40/66
Decision Boundaries ofDecision Trees
Feature 1
Feature 2
40
-
8/17/2019 lect6.pdf
41/66
Decision Boundaries ofDecision Trees
Feature 1Feature 2 > t1
t1 Feature 2
??
41
-
8/17/2019 lect6.pdf
42/66
Decision Boundaries ofDecision Trees
Feature 1Feature 1 > t1
t1 Feature 2
Feature 2 > t2
??
42
-
8/17/2019 lect6.pdf
43/66
Decision Boundaries ofDecision Trees
Feature 1
Feature 2 > t1
Feature 2 > t2
t1t3 Feature 2 Feature 2 > t3
43
-
8/17/2019 lect6.pdf
44/66
Applications of Decision Trees
One of the most widely used learning methods in
practice Fast, simple, and traceable
44
an ou -per orm uman exper s n many pro ems A study for diagnosing breast cancer had humans
correctly classifying the examples 65% of the time;the decision tree classified 72% correctly
Cessna designed an airplane flight controller using90,000 examples and 20 attributes per example
-
8/17/2019 lect6.pdf
45/66
Today
Last time: Probabilistic Methods
Decision trees
( Evaluation Unsupervised learning)
45
Next time: Neural Networks Next time: Genetic Algorithms
-
8/17/2019 lect6.pdf
46/66
E l h d l
-
8/17/2019 lect6.pdf
47/66
Evaluation Methodology
Standard methodology:1. Collect a large set of examples (all with correct classifications)
2. Divide collection into two disjoint sets: training set and test set3. Apply learning algorithm to training set
DO NOT LOOK AT THE TEST SET !
47
. easure per ormance w e es se
how well does the function you learn in 3. correctly classify theexamples in the test set? (ie. Compute accuracy)
Important: keep the training and test sets disjoint!
To study the robustness of the algorithm, repeat steps2-4 for different training sets and sizes of training sets
-
8/17/2019 lect6.pdf
48/66
Error analysis
correct class
(that should have
been assigned)
classes assigned by the learner
Where did the learner go wrong ?
Use a confusion matrix / contingency table
48
C1 C2 C3 C4 C5 C6 … Total
C1 99.4 .3 0 0 .3 0 100
C2 0 90.2 3.3 4.1 0 0 100
C3 0 .1 93.9 1.8 .1 1.9 100
C4 0 .5 2.2 95.5 .2 0 100
C5 0 0 .3 1.4 96.0 2.5 100C6 0 0 1.9 0 3.4 93.3 100
…
-
8/17/2019 lect6.pdf
49/66
A Learning Curve
49
Size of training set the more, the better
but after a while, not much improvement…
source: Mitchell (1997)
Some Words on Training
-
8/17/2019 lect6.pdf
50/66
Some Words on Training
In all types of learning… watch out for:
Noisy input Overfitting/underfitting the training data
50
-
8/17/2019 lect6.pdf
51/66
Overfitting / Underfitting
-
8/17/2019 lect6.pdf
52/66
Overfitting / Underfitting
In all types of learning… watch out for:
Overfitting the training data: If large number of irrelevant features are there, we may find
meaningless regularities in the data that are particular to thetraining data but irrelevant to the problem.
52
. .
complicated model with many features while the real solution ismuch simpler
Underfitting the training data: The training set is not representative enough of the real problem
i.e. You find a simple explanation with few features, while thereal solution is more complex
C lid ti
-
8/17/2019 lect6.pdf
53/66
Cross-validation another method to reduce overfitting
K-fold cross-validation run k experiments, each time you test on 1/k of the data, and train on the rest
than you average the results
ex: 10-fold cross validation1. Collect a large set of examples (all with correct classifications)
2. Divide collection into two disjoint sets: training (90%) and test (10% = 1/k)
53
. pp y earn ng a gor m o ra n ng se
4. Measure performance with the test set5. Repeat steps 2-4, with the 10 different portions
6. Average the results of the 10 experiments
exp1: train test
exp2: train test trainexp3: train test train
… …
Today
-
8/17/2019 lect6.pdf
54/66
Today
Last time: Probabilistic Methods
Decision trees
( Evaluation Unsupervised learning)
54
Next time: Neural Networks Next time: Genetic Algorithms
Unsupervised Learning
-
8/17/2019 lect6.pdf
55/66
Unsupervised Learning
Learn without labeled examples i.e. X is given, but not f(X)
Without a f(X), you can't really identify/label atest instance
small nose big teeth small eyes moustache f(X) = ?
55
But you can: Cluster/group the features of the test data into a
number of groups Discriminate between these groups without
actually labeling them
Clustering
-
8/17/2019 lect6.pdf
56/66
Clustering
Represent each instance as a vector
Each vector can be visually represented in a n dimensional
space a1 a2 a3 OutputX1 1 0 0 ?
X2 1 6 0 ?X
56
X 2
X 1 X 3
X3 8 0 1 ?
X4 6 1 0 ?
X5 1 7 1 ?
X 4
Clustering
-
8/17/2019 lect6.pdf
57/66
Clustering
Clustering algorithm
Represent test instances on a n dimensional space
Partition them into re ions of hi h densit
57
How? … many algorithms (ex. k-means) Compute the centroïd of each region as the
average of data points in the cluster
k means Clustering
-
8/17/2019 lect6.pdf
58/66
k-means Clustering
User selects how many cluster they want… (the value of k)
1. Place k points into the space (ex. at random).These points represent initial group centroïds.
58
. n .
3. When all data points have been assigned,recalculate the positions of the K centroïds as theaverage of the cluster
4. Repeat Steps 2 and 3 until none of the datainstances change group.
Euclidean Distance
-
8/17/2019 lect6.pdf
59/66
Euclidean Distance
10
6
7
8
9
To find the nearestcentroïd…
a possible metric is the
Euclidean distance distance between 2 pts
p = (p1, p2, ....,pn) = ....
59
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
where to assign a data pointx?
For all k clusters, chose theone where x has the smallestdistance
( )∑=
−=
n
1i
2ii qpd
-
8/17/2019 lect6.pdf
60/66
Example
-
8/17/2019 lect6.pdf
61/66
3
4
5
Examplepartition data points to closest centroïd
c1
61
0
1
2
0 1 2 3 4 5
c2
c3
Example
-
8/17/2019 lect6.pdf
62/66
3
4
5
Examplere-compute new centroïds
c1
62
0
1
2
0 1 2 3 4 5
c2c3
Example
-
8/17/2019 lect6.pdf
63/66
3
4
5
Examplere-assign data points to new closest centroïds
c1
63
0
1
2
0 1 2 3 4 5
c2c3
Example
-
8/17/2019 lect6.pdf
64/66
Example
c1
3
4
5
64
c2 c3
0
1
2
0 1 2 3 4 5
Notes on k-means
-
8/17/2019 lect6.pdf
65/66
Notes on k means
converges very fast!
BUT: very sensitive to initial choice of centroids
65
…
user must set initial k not easy to do…
many other clustering algorithms…
Today
-
8/17/2019 lect6.pdf
66/66
Today
Last time: Probabilistic Methods
Decision trees
( Evaluation Unsupervised learning)
66
Next time: Neural Networks Next time: Genetic Algorithms