overview. apriori algorithm support is 50% (2/4) confidence is 66.67% (2/3) tx1shoes,socks,tie...
TRANSCRIPT
Apriori Algorithm
Support is 50% (2/4) Confidence is 66.67% (2/3)
TX1 Shoes,Socks,Tie
TX2 Shoes,Socks,Tie,Belt,Shirt
TX3 Shoes,Tie
TX4 Shoes,Socks,Belt
€
Socks⇒ Tie
Example Five transactions from a supermarket
(diaper=fralda)
TID List of Items
1 Beer,Diaper,Baby Powder,Bread,Umbrella
2 Diaper,Baby Powder
3 Beer,Diaper,Milk
4 Diaper,Beer,Detergent
5 Beer,Milk,Coca-Cola
Step 1
Min_sup 40% (2/5)
C1 L1Item Support
Beer "4/5"
Diaper "4/5"
Baby Powder "2/5"
Bread "1/5"
Umbrella "1/5"
Milk "2/5"
Detergent "1/5"
Coca-Cola "1/5"
Item Support
Beer "4/5"
Diaper "4/5"
Baby Powder "2/5"
Milk "2/5"
Step 2 and Step 3
C2 L2
Item Support
Beer, Diaper "3/5"
Beer, Baby Powder "1/5"
Beer, Milk "2/5"
Diaper,Baby Powder "2/5"
Diaper,Milk "1/5"
Baby Powder,Milk "0"
Item Support
Beer, Diaper "3/5"
Beer, Milk "2/5"
Diaper,Baby Powder
"2/5"
Step 4
C3 empty
• Min_sup 40% (2/5)
Item Support
Beer, Diaper,Baby Powder "1/5"
Beer, Diaper,Milk "1/5"
Beer, Milk,Baby Powder "0"
Diaper,Baby Powder,Milk "0"
Step 5
min_sup=40% min_conf=70%
Item Support(A,B) Suport A Confidence
Beer, Diaper 60% 80% 75%
Beer, Milk 40% 80% 50%
Diaper,Baby Powder 40% 80% 50%
Diaper,Beer 60% 80% 75%
Milk,Beer 40% 40% 100%
Baby Powder, Diaper 40% 40% 100%
Results
support 60%, confidence 70%
support 60%, confidence 70%
support 40%, confidence 100%
support 40%, confidence 70%
€
Beer ⇒ Diaper
€
Diaper ⇒ Beer
€
Milk ⇒ Beer
€
Baby _ Powder ⇒ Diaper
Construct FP-tree from a Transaction Database
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
Find Patterns Having p From p-conditional Database
Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
From Conditional Pattern-bases to Conditional FP-trees
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base
m-conditional pattern base:fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
All frequent patterns relate to m
m,
fm, cm, am,
fcm, fam, cam,
fcam
-> associations
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableItem frequency head f 4c 4a 3b 3m 3p 3
k-means Clustering
Cluster centers c1,c2,.,ck with clusters C1,C2,.,Ck
€
d2(r x ,
r z ) = x i − zi( )
2
i=1
d
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
1
2
k-means Example(K=2)
Pick seeds
Reassign clusters
Compute centroids
xx
Reasssign clusters
xx xx Compute centroids
Reassign clusters
Converged!
AlgorithmRandom initialization of k cluster centers
do{
-assign to each xi in the dataset the nearest cluster center (centroid) cj according to d2
-compute all new cluster centers }until ( |Enew - Eold| < or number of iterations max_iterations)
k-Means vs Mixture of Gaussians Both are iterative algorithms to assign points to clusters
K-Means: minimize
MixGaussian: maximize P(x|C=i)
Mixture of Gaussian is the more general formulation
Equivalent to k-Means when ∑i =I ,
⎥⎦
⎤⎢⎣
⎡ −−−= − )x()x(2
1exp
)2(
1)x(P 1t
2/12/dμΣμ
Σπ
€
P(C = i) =1
kC = i
0 else
⎧ ⎨ ⎪
⎩ ⎪
€
E = d2(x,c j )2
x∈C j
∑j=1
k
∑
Tree Clustering Tree clustering algorithm allow us to
reveal the internal similarities of a given pattern set
To structure these similarities hierarchically
Applied to a small set of typical patterns For n patterns these algorithm generates
a sequence of 1 to n clusters
Example
Similarity between two clusters is assessed by measuring the similarity of the furthest pair of patterns (each one from the distinct cluster)
This is the so-called complete linkage rule
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Impact of cluster distance measures
“Single-Link” (inter-cluster distance= distance between closest pair of points)
“Complete-Link” (inter-cluster distance= distance between farthest pair of points)
There are two criteria proposed for clustering evaluation and selection of an optimal clustering scheme (Berry and Linoff, 1996)
Compactness, the members of each cluster should be as close to each other as possible. A common measure of compactness is the variance, which should be minimized
Separation, the clusters themselves should be widely spaced
Dunn index
€
d(Ci,C j ) =min
r x ∈ Ci,
r y ∈ C j
d(r x ,
r y )
€
diam(Ci) =max
r x ,
r y ∈ Ci
d(r x ,
r y )
€
Dk =min
1≤ i ≤ k
min
1≤ j ≤ k
i ≠ j
d(Ci,C j )max
1≤ l ≤ kdiam(Cl ){ }
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
⎫
⎬ ⎪ ⎪
⎭ ⎪ ⎪
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
⎫
⎬ ⎪ ⎪
⎭ ⎪ ⎪
The Davies-Bouldin (DB) index (1979)
€
d(Ci,C j ) =min
r x ∈ Ci,
r y ∈ C j
d(r x ,
r y )
€
diam(Ci) =max
r x ,
r y ∈ Ci
d(r x ,
r y )
€
DBk =1
k
max
i ≠ j
diam(Ci) + diam(C j )
d(Ci,C j )
⎧ ⎨ ⎩
⎫ ⎬ ⎭i=1
k
∑
Pattern Classification (2nd ed.), Richard O. Duda,, Peter E. Hart, and David G. Stork, Wiley Interscience, 2001
Pattern Recognition: Concepts, Methods and Applications , Joaquim P. Marques de Sa, Springer-Verlag, 2001
Example Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result (+) in only 98% of the cases in which the disease is actually present, and a correct negative result (-) in only 97% of the cases in which the disease is not present
Furthermore, 0.008 of the entire population have this cancer
Normalization
The result of Bayesian inference depends strongly on the prior probabilities, which must be available in order to apply the method
€
0.0078
0.0078 + 0.0298= 0.20745
€
0.0298
0.0078 + 0.0298= 0.79255
Belief Networks
Burglary P(B)0.001
Earthquake P(E)0.002
Alarm
Burg. Earth. P(A)t t .95t f .94f t .29
f f .001
JohnCalls MaryCallsA P(J)t .90f .05
A P(M)t .7f .01
Full Joint Distribution
))(|(),...,(1
1 i
n
iin XparentsxPxxP ∏
=
=
00062.0998.0999.0001.07.09.0
)()()|()|()|(
)(
=××××=¬¬¬∧¬=
¬∧¬∧∧∧ePbPebaPamPajP
ebamjP
P(Burglary|JohnCalls=ture,MarryCalls=true)• The hidden variables of the query are Earthquake
and Alarm
• For Burglary=true in the Bayesain network
€
P(B | j,m) = αP(B, j,m) = α P(B,e,a, j,m)a
∑e
∑
€
P(b | j,m) = α P(b)P(e)P(a | b,e)P( j | a)P(m | a)a
∑e
∑
P(b) is constant and can be moved out, P(e) term can be moved outside summation a
JohnCalls=true and MarryCalls=true, the probability that the burglary has occured is aboud 28%€
P(b | j,m) = αP(b) P(e) P(a | b,e)P( j | a)P(m | a)a
∑e
∑
€
P(B, j,m) = α < 0.00059224,0.0014919 >≈< 0.284,0.716 >
Artificial Intelligence - A Modern Approach, Second Edition, S. Russel and P. Norvig, Prentice Hall, 2003
The credit history loan table has following information p(risk is high)=6/14 p(risk is moderate)=3/14 p(risk is low)=5/14
€
I(credit _ table) = −6
14log2
6
14
⎛
⎝ ⎜
⎞
⎠ ⎟−
3
14log2
3
14
⎛
⎝ ⎜
⎞
⎠ ⎟−
5
14log2
5
14
⎛
⎝ ⎜
⎞
⎠ ⎟
I(credit _ table) =1.531 bits
In the credit history loan table we make income the property tested at the root
This makes the division into• C1={1,4,7,11},C2={2,3,12,14},C3={5,6,8,9,10,13}
€
E(income) =4
14I(C1) +
4
14I(C2) +
6
14I(C3)
E(income) =4
140 +
4
141.0 +
6
140.65
E(income) = 0.564 bits
gain(income)=I(credit_table)-E(income)
gain(income)=1.531-0.564
gain(income)=0.967 bits
gain(credit history)=0.266
gain(debt)=0.581
gain(collateral)=0.756
Overfitting
Consider error of hypothesis h over Training data: errortrain(h)
Entire distribution D of data: errorD(h)
Hypothesis hH overfits training data if there is an alternative hypothesis h’H such that
errortrain(h) < errortrain(h’)
and
errorD(h) > errorD(h’)
An ID3 tree consistent with the data
Hair Color
Lotion Used
Sarah
Annie
Dana
Katie
EmilyAlexPeteJohn
Blond Red Brown
No Yes
Sunburned
Not Sunburned
Sunburned
Not Sunburned
Corresponding rules by C4.5If the person‘s hair is blonde and the person uses lotionthen nothing happens
If the person‘s hair color is blonde and the person uses no lotionthen the person turns red
If the person‘s hair color is redthen the person turns red
If the person‘s hair color is brownthen nothing happens
Default ruleIf the person uses lotionthen nothing happens
If the person‘s hair color is brownthen nothing happens
If no other rule appliesthen the person turns red
Artificial Intelligence, Partick Henry Winston, Addison-Wesley, 1992
Artificial Intelligence - Structures and Strategies for Complex Problem Solving, Second Edition, G. L. Luger and W. A. Stubblefield, Benjamin/Cummings Publishing, 1993
Machine Learning, Tom M. Mitchell, McGraw Hill, 1997
XOR problem and Perceptron
By Minsky and Papert in mid 1960
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Gradient Descent To understand, consider simpler linear
unit, where
Let's learn wi that minimize the squared error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}
• (t for target)€
o = wi ⋅ x i
i= 0
n
∑
In our example E becomes
E[w] is differentiable given f is differentiable Gradient descent can be applied
€
E[r w ] =
1
2(ti
d
i=1
2
∑d =1
m
∑ − oid )2
€
E[r w ] =
1
2(ti
d
i=1
2
∑d =1
m
∑ − f ( W ij
j
3
∑ ⋅ f ( w jk xkd
k=1
5
∑ )))2
Extension to Non-linear Decision Boundary Possible problem of the transformation
High computation burden and hard to get a good estimate SVM solves these two issues simultaneously
Kernel tricks for efficient computation Minimize ||w||2 can lead to a “good” classifier
( )
( )
( )( )( )
( )
( )( )
(.)( )
( )
( )
( )( )
( )
( )
( )( )
( )
Feature spaceInput space