lect6.pdf

8/17/2019 lect6.pdf

1/66

Artificial Intelligence:-

1

Russell & Norvig: Sections 18.1 to 18.4

8/17/2019 lect6.pdf

2/66

Motivation Too many to list here!

Recommender systems eg. Amazon --> you might like this book… eg. LinkedIn --> people you might know…

Pattern Recognition

Learning to recognize postal codes Handwritten recognition --> who wrote this

cheque? …

8/17/2019 lect6.pdf

3/66

Why Machine Learning? To construct programs that automatically

improve with experience. Learning = crucial characteristics of an

intelligent agent

3

A learning system performs tasks and improves its performance

the more tasks it accomplishes generalizes from given experiences and is able

to make judgments on unseen cases

8/17/2019 lect6.pdf

4/66

Machine Learning vs Data Mining Supervised Machine Learning Try to predict the outcome of a new data from a training set of

already classified examples

I know what I am looking for, and I am using this training data tohelp me generalise

Eg: diagnostic system: which treatments are best for new diseases?

Eg: speech recognition : given the acoustic pattern, which word wasuttered

4

Unsupervised Machine Learning/ Data Mining / KnowledgeDiscovery Try to discover valuable knowledge / pattern from large databases

of grocery bills, financial transactions, medical records, etc.

I’m not sure what I am looking for, but I am sure there’s somethinginteresting

Eg: Anomaly detection: odd use of credit card?

Eg. Learning association rules: buy strawberries & whipped cream

together?

8/17/2019 lect6.pdf

5/66

Types of Learning In Supervised learning

We are given a training set of (X, f(X)) pairs

In Reinforcement learning We are not given the (X, f(X)) pairs

big nose big teeth big eyes no moustache f(X) = not person

small nose small teeth small eyes no moustache f(X) = person

small nose big teeth small eyes moustache f(X) = ?


5

ut some ow we are to w et er our earne is rig t or wrong

Goal: maximize the nb of right answers

In Unsupervised learning We are only given the Xs - not the corresponding f(X)

No teacher involved / Goal: find regularities among the Xs (clustering) Data mining

big nose big teeth big eyes no moustache not given

small nose small teeth small eyes no moustache not given


8/17/2019 lect6.pdf

6/66

Remember this slide…

6

8/17/2019 lect6.pdf

7/66

Inductive Learning = learning from examples

Most work in ML

Examples are given (positive and/or negative) to train asystem in a classification (or regression) task

Extrapolate from the training set to make accuratepredictions about future examples

7

an e seen as earning a unction Given a new instance X you have never seen

You must find an estimate of the function f(X) where f(X) isthe desired output

Ex:

X = features of a face (ex. small nose, big teeth, …)

f(X) = function to tell if X represents a human face or not


X

8/17/2019 lect6.pdf

8/66

Example Given 5 pairs (X,f(X)) (the training set)

Find a function that fits the training set well

So that given a new X, you can predict its f(X) valuef(X)

8

Note: choosing one function over another beyond justlooking at the training set is called inductive bias ex: prefer "smoother" functions

ex: if an attribute does not seem discriminating, drop it

source: Rusell & Norvig (2003)

X

8/17/2019 lect6.pdf

9/66

Inductive Learning Framework

Input data are represented by a vector of features, X

Each vector X is a list of (attribute, value) pairs. Ex: X = [nose:big, teeth:big, eyes:big, moustache:no]

The number of features is fixed (positive, finite)

9

Each attribute has a fixed, finite number of possible values Each example can be interpreted as a point in a n-dimensional

feature space where n is the number of features

if the output is discrete --> classification

if the output is continuous --> regression

8/17/2019 lect6.pdf

10/66

Example

10

Real ML applications typically require hundreds or thousands of examples

source: Alison Cawsey: The Essence of AI (1997).

8/17/2019 lect6.pdf

11/66

Techniques in ML Probabilistic Methods

ex: Naïve Bayes Classifier

Decision Trees Use only discriminating features as questions in a big if-then-else

tree

Genetic algorithms

11

Good functions evolve out of a population by combining possiblesolutions to produce offspring solutions and killing off the weakerof those solutions.

Neural networks Also called parallel distributed processing or connectionist systems

Intelligence arise from having a large number of simplecomputational units

8/17/2019 lect6.pdf

12/66

Today Last time: Probabilistic Methods

Decision trees

( Evaluation Unsupervised learning)

Next time: Neural Networks

12

Next time: Genetic Algorithms

8/17/2019 lect6.pdf

13/66

Guess Who?

13

Play online

8/17/2019 lect6.pdf

14/66

Decision Trees Simplest, but most successful form of learning algorithm

Very well-know algorithm is ID3 (Quinlan, 1987) and itssuccessor C4.5

Look for features that are very good indicators of the

14

,the tree

Split the examples so that those with different valuesfor the chosen feature are in a different set

Repeat the same process with another feature

8/17/2019 lect6.pdf

15/66

ID3 / C4.5 Algorithm Top-down construction of the decision tree

Recursive selection of the “best feature” to use at thecurrent node in the tree

Once the feature is selected for the current node,enerate children nodes, one for each ossible value of

15

the selected attribute Partition the examples using the possible values of this

attribute, and assign these subsets of the examples tothe appropriate child node

Repeat for each child node until all examplesassociated with a node are either all positive or allnegative

8/17/2019 lect6.pdf

16/66

ExampleInfo on last year’s students to determine if a student will get an ‘A’ this year

Features Output f(X)

Student ‘A’ last year?

Blackhair?

Workshard?

Drinks? ‘A’ this year?

16

X2: Alan Yes Yes Yes No Yes

X3: Alison No No Yes No No

X4: Jeff No Yes No Yes No

X5: Gail Yes No Yes Yes Yes

X6: Simon No Yes Yes Yes No

8/17/2019 lect6.pdf

17/66

Example“A” last year?

yes no

Features Outputf(X)

Student ‘A’ last

year?

Black

hair?

Works

hard?

Drinks

?

‘A’ this

year?Richard Yes Yes No Yes No

Alan Yes Yes Yes No Yes

Alison No No Yes No No

Jeff No Yes No Yes No

17

Works hard?

Output = No Gail Yes No Yes Yes Yes

Simon No Yes Yes Yes No

Output = Yes Output = No

yes no

8/17/2019 lect6.pdf

18/66

Example 2: The Restaurant Goal: learn whether one should wait for a table

Attributes

Alternate: another suitable restaurant nearby Bar: comfortable bar for waiting

Fri/Sat: true on Fridays and Saturdays

:

18

Patrons: how many people are present (none, some, full) Price: price range ($, $$, $$$)

Raining: raining outside

Reservation: reservation made

Type: kind of restaurant (French, Italian, Thai, Burger)

WaitEstimate: estimated wait by host (0-10 mins, 10-30, 30-60, >60)

8/17/2019 lect6.pdf

19/66

Example 2: The Restaurant Training data:

19source: Norvig (2003)

8/17/2019 lect6.pdf

20/66

A First Decision Tree

20

But is it the best decision tree we can build?

source: Norvig (2003)

8/17/2019 lect6.pdf

21/66

Ockham’s Razor PrincipleIt is vain to do more than can be done with less… Entities

should not be multiplied beyond necessity. [Ockham, 1324]

In other words… always favor the simplest answerthat correctly fits the training data

21

i.e. the smallest tree on average

This type of assumption is called inductive bias

inductive bias = making a choice beyond what the traininginstances contain

8/17/2019 lect6.pdf

22/66

Finding the Best Tree can be seen as searching the space

of all possible decision trees

Inductive bias: prefer shortertrees on average

how?

empty tree

22

searc e space o a ec s ontrees always pick the next attribute to

split the data based on its"discriminating power" (information

gain) in effect, hill-climbing search

where heuristic is information gain

source: Tom Mitchell, Machine Learning (1997)

complete tree

8/17/2019 lect6.pdf

23/66

Which Tree is Best?F1?

F2? F3?

F4? F5? F6? F7?

class class class class class class class class

F1?

23

F2?

F3?

class

class

class F4?

class F5?

class F6?

class F7?

class class

8/17/2019 lect6.pdf

24/66

Choosing the Next AttributeThe key problem is choosing which

feature to split a given set of examples

ID3 uses Maximum Information-Gain:

24

Choose the attribute that has the largestinformation gain i.e., the attribute that will result in the smallest

expected size of the subtrees rooted at itschildren

information theory

8/17/2019 lect6.pdf

25/66

Intuitively… Patron :

If value is Some … all outputs=Yes If value is None … all outputs=No If value is Full… we need more tests

Type : If value is French … we need more tests

Output f(X)

25

If value is Italian … we need more tests If value is Thai… we need more tests If value is Burger… we need more tests

…

So patron is more discriminating…

source: Norvig (2003)

8/17/2019 lect6.pdf

26/66

Next Feature…

For only data where patron = Full hungry

If value is Yes … we need more tests If value is No … all output= No

type :

26

va ue is Frenc … a output= No If value is Italian … all output= No If value is Thai… we need more tests If value is Burger… we need more tests

…

So hungry is more discriminating (only 1 new branch)…

8/17/2019 lect6.pdf

27/66

A Better Decision Tree

4 tests instead of 9

11 branches instead of 21

27source: Norvig (2003)

8/17/2019 lect6.pdf

28/66

Choosing the Next Attribute The key problem is choosing which feature to

split a given set of examples Most used strategy: information theory

28

p(x)p(x)logH(X) Xx 2∈−=

bit121log2121log21

2

1,

2

1H)p(x)logp(xtoss)coinH(fair

22

iXx

2i

i

=

+=

=−= ∑

∈

Entropy (or information content)

H(a,b) = entropy ifprobability of success = a and

probability of failure = b

8/17/2019 lect6.pdf

29/66

Essential Information Theory

Developed by Shannon in the 40s

Notion of entropy (information content): How informative is a piece of information?

If you already have a good idea about the answer

29

then a "hint" about the right answer is not veryinformative…

If you have no idea about the answer (ex. 50-50 split)

high entropy

then a "hint" about the right answer is veryinformative…

8/17/2019 lect6.pdf

30/66

Entropy

Let X be a discrete random variable (RV)

Entropy (or information content)

∑=

−=

n

1ii2i )p(x)logp(xH(X)

30

measures the amount of information in a RV average uncertainty of a RV

the average length of the message needed to transmit anoutcome xi of that variable

measured in bits

for only 2 outcomes x1 and x2, then 1 ≥ H(X) ≥ 0

8/17/2019 lect6.pdf

31/66

8/17/2019 lect6.pdf

32/66

Choosing the Best Feature (con't) The "discriminating power" of an attribute A given a data set S

Let Values(A) = the set of values that attribute A can take

Let Sv = the set of examples in the data set which have value v forattribute A (for each value v from Values(A) )

information gain (orentropy reduction)

A)|H(SH(S)A)gain(S, −=

32

( )vvalues(A)v

v SHxS

SH(S) ∑∈

−=

8/17/2019 lect6.pdf

33/66

Some Intuition

Size Color Shape Output

Big Red Circle +

Small Red Circle +Small Red Square -

Big Blue Circle -

33

. .

smallest information gain)

Shape and color are the most discriminating

attributes (i.e. highest information gain)

8/17/2019 lect6.pdf

34/66

A Small Example (1)

Color

red: 2+ 1- blue: 0+ 1-


Big Red Circle +

Small Red Circle +

Small Red Square -

Big Blue Circle -

121

log21

21

log21

H(S) 22 =

+−=

( )vor)values(Colv

v SHxS

SH(S)Color)gain(S, ∑

∈

−=

Values(Color) = {red,blue}

34

( )

0.31150.6885-1Color)|H(S-H(S))gain(Color

0.6885(0)4

1(0.918)4

3Color)|H(S

011

log11

1,0Hblue)Color|H(S

0.91831log

31

32log

32

31,

32Hred)Color|H(S

ora ues ooveacor

2

22

===

=+=

=

−===

=

+−=

==

8/17/2019 lect6.pdf

35/66

A Small Example (2)

Note: by definition,

Log 0 = -∞

Shape

circle: 2+ 1- square: 0+ 1-


Big Red Circle +

Small Red Circle +

Small Red Square -Big Blue Circle -

35

0.31150.918-1Shape)|H(S-H(S))gain(Shape

0.918(0)4

1(0.918)

4

3Shape)|H(S

12log22log2H(S) 22

===

=+=

= +−=

8/17/2019 lect6.pdf

36/66

A Small Example (3)

Size

big: 1+ 1- small: 1+ 1-


Big Red Circle +

Small Red Circle +


36

01-1Size)|H(S-H(S)gain(Size)

1(1)2

1(1)

2

1Size)|H(S

121log2121log21H(S) 22

===

=+=

=

+−=

8/17/2019 lect6.pdf

37/66

A Small Example (4)Size Color Shape Output

Big Red Circle +

Small Red Circle +


37

So first separate according to either color or shape(root of the tree)

0gain(Size)

0.3115)gain(Color0.3115gain S ape

=

=

=

8/17/2019 lect6.pdf

38/66

8/17/2019 lect6.pdf

39/66

The Restaurant Example

0.541bits...44log

44

40log

40x

124

22log

22

20log

20-x

1221

6

4,

6

2Hx

12

6

4

4,

4

0Hx

12

4

2

2,

2

0Hx

12

21gain(pat)

2222 ≈

+

+−+

+−=

+

+

−=

...=gain(alt) ...=gain(bar) ...=gain(fri) ...=gain(hun)

...=)gain(price ...=gain(rain) ...=gain(res)

39

Attribute pat (Patron) has the highest gain, so root of the

tree should be attribute Patrons do recursively for subtrees

bits04

2,4

2Hx12

4

4

2,4

2Hx12

4

2

1,2

1Hx12

2

2

1,2

1Hx12

21gain(type) =

+

+

+

−=

...=gain(est)

8/17/2019 lect6.pdf

40/66

Decision Boundaries ofDecision Trees

Feature 1

Feature 2

40

8/17/2019 lect6.pdf

41/66


Feature 1Feature 2 > t1

t1 Feature 2

??

41

8/17/2019 lect6.pdf

42/66


Feature 1Feature 1 > t1

t1 Feature 2

Feature 2 > t2

??

42

8/17/2019 lect6.pdf

43/66


Feature 1

Feature 2 > t1

Feature 2 > t2

t1t3 Feature 2 Feature 2 > t3

43

8/17/2019 lect6.pdf

44/66

Applications of Decision Trees

One of the most widely used learning methods in

practice Fast, simple, and traceable

44

an ou -per orm uman exper s n many pro ems A study for diagnosing breast cancer had humans

correctly classifying the examples 65% of the time;the decision tree classified 72% correctly

Cessna designed an airplane flight controller using90,000 examples and 20 attributes per example

8/17/2019 lect6.pdf

45/66

Today

Last time: Probabilistic Methods

Decision trees


45

Next time: Neural Networks Next time: Genetic Algorithms

8/17/2019 lect6.pdf

46/66

E l h d l

8/17/2019 lect6.pdf

47/66

Evaluation Methodology

Standard methodology:1. Collect a large set of examples (all with correct classifications)

2. Divide collection into two disjoint sets: training set and test set3. Apply learning algorithm to training set

DO NOT LOOK AT THE TEST SET !

47

. easure per ormance w e es se

how well does the function you learn in 3. correctly classify theexamples in the test set? (ie. Compute accuracy)

Important: keep the training and test sets disjoint!

To study the robustness of the algorithm, repeat steps2-4 for different training sets and sizes of training sets

8/17/2019 lect6.pdf

48/66

Error analysis

correct class

(that should have

been assigned)

classes assigned by the learner

Where did the learner go wrong ?

Use a confusion matrix / contingency table

48

C1 C2 C3 C4 C5 C6 … Total

C1 99.4 .3 0 0 .3 0 100

C2 0 90.2 3.3 4.1 0 0 100

C3 0 .1 93.9 1.8 .1 1.9 100

C4 0 .5 2.2 95.5 .2 0 100

C5 0 0 .3 1.4 96.0 2.5 100C6 0 0 1.9 0 3.4 93.3 100

…

8/17/2019 lect6.pdf

49/66

A Learning Curve

49

Size of training set the more, the better

but after a while, not much improvement…

source: Mitchell (1997)

Some Words on Training

8/17/2019 lect6.pdf

50/66

Some Words on Training

In all types of learning… watch out for:

Noisy input Overfitting/underfitting the training data

50

8/17/2019 lect6.pdf

51/66

Overfitting / Underfitting

8/17/2019 lect6.pdf

52/66

Overfitting / Underfitting

In all types of learning… watch out for:

Overfitting the training data: If large number of irrelevant features are there, we may find

meaningless regularities in the data that are particular to thetraining data but irrelevant to the problem.

52

. .

complicated model with many features while the real solution ismuch simpler

Underfitting the training data: The training set is not representative enough of the real problem

i.e. You find a simple explanation with few features, while thereal solution is more complex

C lid ti

8/17/2019 lect6.pdf

53/66

Cross-validation another method to reduce overfitting

K-fold cross-validation run k experiments, each time you test on 1/k of the data, and train on the rest

than you average the results

ex: 10-fold cross validation1. Collect a large set of examples (all with correct classifications)

2. Divide collection into two disjoint sets: training (90%) and test (10% = 1/k)

53

. pp y earn ng a gor m o ra n ng se

4. Measure performance with the test set5. Repeat steps 2-4, with the 10 different portions

6. Average the results of the 10 experiments

exp1: train test

exp2: train test trainexp3: train test train

… …

Today

8/17/2019 lect6.pdf

54/66

Today


Decision trees


54


Unsupervised Learning

8/17/2019 lect6.pdf

55/66

Unsupervised Learning

Learn without labeled examples i.e. X is given, but not f(X)

Without a f(X), you can't really identify/label atest instance


55

But you can: Cluster/group the features of the test data into a

number of groups Discriminate between these groups without

actually labeling them

Clustering

8/17/2019 lect6.pdf

56/66

Clustering

Represent each instance as a vector

Each vector can be visually represented in a n dimensional

space a1 a2 a3 OutputX1 1 0 0 ?

X2 1 6 0 ?X

56

X 2

X 1 X 3

X3 8 0 1 ?

X4 6 1 0 ?

X5 1 7 1 ?

X 4

Clustering

8/17/2019 lect6.pdf

57/66

Clustering

Clustering algorithm

Represent test instances on a n dimensional space

Partition them into re ions of hi h densit

57

How? … many algorithms (ex. k-means) Compute the centroïd of each region as the

average of data points in the cluster

k means Clustering

8/17/2019 lect6.pdf

58/66

k-means Clustering

User selects how many cluster they want… (the value of k)

1. Place k points into the space (ex. at random).These points represent initial group centroïds.

58

. n .

3. When all data points have been assigned,recalculate the positions of the K centroïds as theaverage of the cluster

4. Repeat Steps 2 and 3 until none of the datainstances change group.

Euclidean Distance

8/17/2019 lect6.pdf

59/66

Euclidean Distance

10

6

7

8

9

To find the nearestcentroïd…

a possible metric is the

Euclidean distance distance between 2 pts

p = (p1, p2, ....,pn) = ....

59

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

where to assign a data pointx?

For all k clusters, chose theone where x has the smallestdistance

( )∑=

−=

n

1i

2ii qpd

8/17/2019 lect6.pdf

60/66

Example

8/17/2019 lect6.pdf

61/66

3

4

5

Examplepartition data points to closest centroïd

c1

61

0

1

2

0 1 2 3 4 5

c2

c3

Example

8/17/2019 lect6.pdf

62/66

3

4

5

Examplere-compute new centroïds

c1

62

0

1

2

0 1 2 3 4 5

c2c3

Example

8/17/2019 lect6.pdf

63/66

3

4

5

Examplere-assign data points to new closest centroïds

c1

63

0

1

2

0 1 2 3 4 5

c2c3

Example

8/17/2019 lect6.pdf

64/66

Example

c1

3

4

5

64

c2 c3

0

1

2

0 1 2 3 4 5

Notes on k-means

8/17/2019 lect6.pdf

65/66

Notes on k means

converges very fast!

BUT: very sensitive to initial choice of centroids

65

…

user must set initial k not easy to do…

many other clustering algorithms…

Today

8/17/2019 lect6.pdf

66/66

Today


Decision trees


66


lect6.pdf

Documents