artificial intelligence 8. supervised and unsupervised learning japan advanced institute of science...

Artificial Intelligence8. Supervised and unsupervised

learning

Japan Advanced Institute of Science and Technology (JAIST)Yoshimasa Tsuruoka

Outline

• Supervised learning• Naive Bayes classifier

• Unsupervised learning• Clustering

• Lecture slides• http://www.jaist.ac.jp/~tsuruoka/lectures/

http://www.jaist.ac.jp/~tsuruoka/lectures/

Supervised and unsupervised learning

• Supervised learning– Each instance is assigned with a label– Classification, regression– Training data need to be created manually

• Unsupervised learning– Each instance is just a vector of attribute-values– Clustering– Pattern mining

Naive Bayes classifierChapter 6.9 of Mitchell, T., Machine Learning (1997)

• Naive Bayes classifier– Output probabilities– Easy to implement– Assumes conditional independence between

features– Efficient learning and classification

• Thomas Bayes (1702 – 1761)

• The reverse conditional probability can be calculated using the original conditional probability and prior probabilities.

Bayes’ theorem

BP

ABPAPBAP

• Can we know the probability of having cancer from the result of a medical test?

Bayes’ theorem

positiveP

cancerpositivePcancerPpositivecancerP

positiveP

cancerpositivePcancerPpositivecancerP

97.0,03.0

02.0,98.0

992.0,008.0

cancernegativePcancerpositiveP

cancernegativePcancerpositiveP

cancerPcancerP

• The probability of actually having cancer is not very high.

Bayes’ theorem

298.003.0992.0

0078.098.0008.0

cancerpositivePcancerP

cancerpositivePcancerP

21.002980078.0

0078.0

positivecancerP

Naive Bayes classifier

• Assume that features are conditionally independent

ijij

v

jnjv

n

jnj

v

njv

NB

vaPvP

vaaaPvP

aaaP

vaaaPvP

aaavPv

j

j

j

j

argmax

,...,,argmax

,...,,

,...,,argmax

,...,,argmax

21

21

21

21Bayes’ theorem

The denominator isconstant.

Conditional independence

Training dataDay Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Naive Bayes classifier

• Instance <Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong>

j

j

j

jjnoyesv

ijij

noyesvNB

vstrongWindP

vhighHumidityP

vcooleTemperaturP

vsunnyOutlookPvP

vaPvPv

j

j

,

,

argmax

argmax

Class prior probability

• Maximum likelihood estimation– Just counting the number of occurrences in the

training data

36.014

5

64.014

9

noPlayTennisP

yesPlayTennisP

Conditional probabilities of features

• Maximum likelihood

60.05

3

33.09

3

noPlayTennisstrongWindP

yesPlayTennisstrongWindP

Class posterior probabilities

0053.0

yesstrongWindP

yeshighHumidityP

yescooleTemperaturP

yessunnyOutlookPyesP

0206.0

nostrongWindP

nohighHumidityP

nocooleTemperaturP

nosunnyOutlookPnoP

205.00206.00053.0

0053.0

795.00206.00053.0

0206.0

Normalize

Smoothing

• Maximum likelihood estimation

– Estimated probabilities are not reliable when nc is small

• m-estimate of probability

mn

mpnc

n

nc

m

p： prior probability

： equivalent sample size

Text classification with a Naive Bayes classifier

• Text classification– Automatic classification of news articles– Spam filtering– Sentiment analysis of product reviews– etc.

j

j

jjdislikelikev

ijij

dislikelikevNB

vagainaP

vwereaP

vthereaPvP

vaPvPv

j

j

""

...

""

""argmax

argmax

45

2

1,

45

,

There were doors all round the hall, but they were all locked; and when Alice had been all the way down one side and up the other, trying every door, she walked sadly down the middle, wondering how she was ever to get out again.

• Cannot be estimated reliably

• Ignore the position and apply m-estimate smoothing

Conditional probabilities of words

jvwereaP ""2

Vocabularyn

nvwP kjk

1

The probability of the second word of the document being the word “were”

Unsupervised learning

• No “correct” output for each instance

• Clustering– Merging “similar” instances into a group– Hierarchical clustering, k-means, etc..

• Pattern mining– Discovering frequent patterns from a large amount

of data– Association rules, graph mining, etc

Clustering

• Organize instances into groups whose members are similar in some way

Agglomerative clustering

• Define a distance between every pair of instances– E.g. cosine similarity

• Algorithm1.Start with every instance representing a singleton

cluster2.The closest two clusters are merged into a single

cluster3.Repeat this process until all clusters are merged

Agglomerative clustering

• Example Dendrogram

1

2

3

4

5

1 2 3 4 5

Defining a distance between clusters

single link complete link

group-averagecentroid

k-means algorithm

• Centroids• Minimize

• Algorithm1. Choose k centroids c1,…ck randomely

2. Assign each instance to the cluster with the closest centroid

3. Update the centroids and go back to Step 2

k

i Cxi

i

cxd1

2,

ic

artificial intelligence 8. supervised and unsupervised learning japan advanced institute of science...

Documents

classification slide

bayes theorem slide

way slide

conditional independence

training data slide

pattern mining slide

naive bayes classifier

thomas bayes