artificial intelligence 8. supervised and unsupervised learning japan advanced institute of science...
TRANSCRIPT
Artificial Intelligence8. Supervised and unsupervised
learning
Japan Advanced Institute of Science and Technology (JAIST)Yoshimasa Tsuruoka
Outline
• Supervised learning• Naive Bayes classifier
• Unsupervised learning• Clustering
• Lecture slides• http://www.jaist.ac.jp/~tsuruoka/lectures/
Supervised and unsupervised learning
• Supervised learning– Each instance is assigned with a label– Classification, regression– Training data need to be created manually
• Unsupervised learning– Each instance is just a vector of attribute-values– Clustering– Pattern mining
Naive Bayes classifierChapter 6.9 of Mitchell, T., Machine Learning (1997)
• Naive Bayes classifier– Output probabilities– Easy to implement– Assumes conditional independence between
features– Efficient learning and classification
• Thomas Bayes (1702 – 1761)
• The reverse conditional probability can be calculated using the original conditional probability and prior probabilities.
Bayes’ theorem
BP
ABPAPBAP
• Can we know the probability of having cancer from the result of a medical test?
Bayes’ theorem
positiveP
cancerpositivePcancerPpositivecancerP
positiveP
cancerpositivePcancerPpositivecancerP
97.0,03.0
02.0,98.0
992.0,008.0
cancernegativePcancerpositiveP
cancernegativePcancerpositiveP
cancerPcancerP
• The probability of actually having cancer is not very high.
Bayes’ theorem
298.003.0992.0
0078.098.0008.0
cancerpositivePcancerP
cancerpositivePcancerP
21.002980078.0
0078.0
positivecancerP
Naive Bayes classifier
• Assume that features are conditionally independent
ijij
v
jnjv
n
jnj
v
njv
NB
vaPvP
vaaaPvP
aaaP
vaaaPvP
aaavPv
j
j
j
j
argmax
,...,,argmax
,...,,
,...,,argmax
,...,,argmax
21
21
21
21Bayes’ theorem
The denominator isconstant.
Conditional independence
Training dataDay Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Naive Bayes classifier
• Instance <Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong>
j
j
j
jjnoyesv
ijij
noyesvNB
vstrongWindP
vhighHumidityP
vcooleTemperaturP
vsunnyOutlookPvP
vaPvPv
j
j
,
,
argmax
argmax
Class prior probability
• Maximum likelihood estimation– Just counting the number of occurrences in the
training data
36.014
5
64.014
9
noPlayTennisP
yesPlayTennisP
Conditional probabilities of features
• Maximum likelihood
60.05
3
33.09
3
noPlayTennisstrongWindP
yesPlayTennisstrongWindP
Class posterior probabilities
0053.0
yesstrongWindP
yeshighHumidityP
yescooleTemperaturP
yessunnyOutlookPyesP
0206.0
nostrongWindP
nohighHumidityP
nocooleTemperaturP
nosunnyOutlookPnoP
205.00206.00053.0
0053.0
795.00206.00053.0
0206.0
Normalize
Smoothing
• Maximum likelihood estimation
– Estimated probabilities are not reliable when nc is small
• m-estimate of probability
mn
mpnc
n
nc
m
p: prior probability
: equivalent sample size
Text classification with a Naive Bayes classifier
• Text classification– Automatic classification of news articles– Spam filtering– Sentiment analysis of product reviews– etc.
j
j
jjdislikelikev
ijij
dislikelikevNB
vagainaP
vwereaP
vthereaPvP
vaPvPv
j
j
""
...
""
""argmax
argmax
45
2
1,
45
,
There were doors all round the hall, but they were all locked; and when Alice had been all the way down one side and up the other, trying every door, she walked sadly down the middle, wondering how she was ever to get out again.
• Cannot be estimated reliably
• Ignore the position and apply m-estimate smoothing
Conditional probabilities of words
jvwereaP ""2
Vocabularyn
nvwP kjk
1
The probability of the second word of the document being the word “were”
Unsupervised learning
• No “correct” output for each instance
• Clustering– Merging “similar” instances into a group– Hierarchical clustering, k-means, etc..
• Pattern mining– Discovering frequent patterns from a large amount
of data– Association rules, graph mining, etc
Agglomerative clustering
• Define a distance between every pair of instances– E.g. cosine similarity
• Algorithm1.Start with every instance representing a singleton
cluster2.The closest two clusters are merged into a single
cluster3.Repeat this process until all clusters are merged