similarity measure
DESCRIPTION
A similarity measure can represent the similarity between two documents, two queries, or one document and one queryTRANSCRIPT
![Page 1: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/1.jpg)
Chapter 3
Similarity Measures
Data Mining Technology
![Page 2: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/2.jpg)
Chapter 3
Similarity MeasuresWritten by Kevin E. Heinrich
Presented by Zhao Xinyou
2007.6.7
Some materials (Examples) are taken from Website.
![Page 3: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/3.jpg)
Searching Process
Input Text
Process
IndexQuery
Sorting
Show Text Result
Input Text
IndexQuery
Sorting
Show Text Result
Input Key Words
Results
Search
1. XXXXXXX2. YYYYYYY3. ZZZZZZZZ.......................
![Page 4: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/4.jpg)
Example
![Page 5: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/5.jpg)
Similarity Measures A similarity measure can represent the similar
ity between two documents, two queries, or one document and one query
It is possible to rank the retrieved documents in the order of presumed importance
A similarity measure is a function which computes the degree of similarity between a pair of text objects
There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!)
PP27-28
![Page 6: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/6.jpg)
Classic Similarity Measures
All similarity measures should map to the range [-1,1] or [0,1],
0 or -1 shows minimum similarity. (incompatible similarity)
1 shows maximum similarity. (absolute similarity)
PP28
![Page 7: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/7.jpg)
Conversion
For example 1 shows incompatible similarity, 10 shows
absolute similarity.
[1, 10]
[0, 1]
s’ = (s – 1 ) / 9
Generally, we may use:s’ = ( s – min_s ) / ( max_s – min_s )
LinearNon-linear
PP28
![Page 8: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/8.jpg)
Vector-Space Model-VSM
1960s Salton etc provided VSM, which has been successfully applied on SMART (a text searching system).
PP28-29
![Page 9: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/9.jpg)
Example D is a set, which contains m Web documents;
D={d1, d2,…di…dm} i=1,2…m
There are n words among m Web documents. di={wi1,wi2,…wij,…win} i=1,2…m , j=1,2,….n
Q= {q1,q2,…qi,….qn} i=1,2,….n
PP28-29
If similarity(q,di) > similarity(q, dj) We may get the result di is more relevant than dj
![Page 10: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/10.jpg)
Simple Measure Technology
Documents Set
PP29
Retrieved A
Relevant B
Retrieved and Relevant A∩B
Precision = Returned Relevant Documents / Total Returned Documents
Recall = Returned Relevant Documents / Total Relevant Documents
P(A,B) = |A∩B| / |A|
R(A,B) = |A∩B| / |B|
![Page 11: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/11.jpg)
Example--Simple Measure Technology
PP29
Documents Set
A,C,E,G,H, I, JRelevant
B,D,FRetrieved & Relevant
W,YRetrieved
|B| = {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10
|A| = {retrieved} = {B, D, F,W,Y} = 5
|A∩B| = {relevant} ∩ {retrieved} ={B,D,F} = 3
P = precision = 3/5 = 60%
R = recall = 3/10 = 30%
![Page 12: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/12.jpg)
Precision-Recall Graph-Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall
PP29-30
One QueryTwo Queries
Difficult to determine which of these twohypothetical results is better
![Page 13: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/13.jpg)
Similarity measures based on VSM Dice coefficient Overlap Coefficient Jaccard Cosine Asymmetric Dissimilarity Other measures
PP30
![Page 14: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/14.jpg)
Dice Coefficient-Cont’ Definition of Harmonic Mean: To X1,X2, …, Xn ,their harmonic mean E equals n divided
by(1/x1+1/x2+…+1/Xn), that is
RP
E11
2
To Harmonic Mean (E) of Precision (P) and Recall (R)
BA
BA
B
BA
A
BA
22
PP30
n
iin x
n
xxx
nE
121
1111
![Page 15: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/15.jpg)
Dice Coefficient-Cont’
Denotation of Dice Coefficient:
)1,0()1(
)1(),(),(
1
2
1
2
1
n
k kj
n
k kq
n
k kjkq
j
ww
ww
BA
BABADdqsim
PP30
EBA
BA
BA
BABAthenDif
2
)1(),(
2
1
α>0.5 : precision is more important
α<0.5 : recall is more important
Usually α=0.5
![Page 16: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/16.jpg)
Overlap CoefficientPP30-31
Documents Set
A queries B Documents
n
k
n
k kjkq
n
k kjkq
j
ww
ww
BA
BABAOdqsim
1 1
22
1
),min(
),min(),(),(
![Page 17: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/17.jpg)
Jaccard Coefficient-Cont’
Documents Set
A queries B Documents
n
k
n
k kjkq
n
k kjkq
n
k kjkq
j
wwww
ww
BA
BABAJdqsim
1 11
22
1
),(),(
PP31
![Page 18: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/18.jpg)
Example- Jaccard Coefficient
D1 = 2T1 + 3T2 + 5T3, (2,3,5)
D2 = 3T1 + 7T2 + T3 , (3,7,1)
Q = 0T1 + 0T2 + 2T3, (0,0,2)
J(D1 , Q) = 10 / (38+4-10) = 10/32 = 0.31
J(D2 , Q) = 2 / (59+4-2) = 2/61 = 0.04
PP31
n
k
n
k kjkq
n
k kjkq
n
k kjkq
j
wwww
ww
dqsim
1 11
22
1
),(
![Page 19: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/19.jpg)
Cosine Coefficient-Cont’
n
k kj
n
k kq
n
k kjkq
j
j
j
ww
ww
dq
dq
BA
BAPRBACdqsim
1
2
1
2
1
),(),(
PP31-32
(d21,d22,…d2n)
(d11,d12,…d1n)(q1,q2,…qn)
![Page 20: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/20.jpg)
Example-Cosine Coefficient Q = 0T1 + 0T2 + 2T3 D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3
C (D1 , Q) =
=10/ √ (38*4) = 0.81
C (D2 , Q) = 2 / √ (59*4) = 0.13
)200()532(
523020222222
PP31-32
(3,7,1)
(2,3,5)(0,0,2)
![Page 21: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/21.jpg)
AsymmetricPP31
n
k kq
n
k kjkqjj
w
wwdqAdqsim
1
1),min(
),(),(
dj
diWki-->wkj
![Page 22: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/22.jpg)
Euclidean distance
n
kkjkqjEj wwdqddqdis
1
2)(),(),(
PP32
![Page 23: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/23.jpg)
Manhattan block distance
n
kkjkqjMj wwdqddqdis
1
),(),(
PP32
![Page 24: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/24.jpg)
Other Measures
We may use priori/context knowledge
For example: Sim(q,dj)= [content identifier similarity]+
[objective term similarity]+
[citation similarity]
PP32
![Page 25: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/25.jpg)
ComparisonPP34
![Page 26: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/26.jpg)
Comparison
),min(
*2
2
1
2
1
BA
BAO
BA
BAC
BA
BAJ
BA
BAD
BA
Simple matching
Dice’s Coefficient
Cosine Coefficient
Overlap Coefficient
Jaccard’s Coefficient
|A|+|B|-|A∩B| ≥(|A|+|B|)/2
|A| ≥ |A∩B||B| ≥ |A∩B|
(|A|+|B|)/2 ≥√(|A|*|B|)
√(|A|*|B|)≥ min (|A|, |B|)
O≥C≥D≥J
PP34
![Page 27: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/27.jpg)
Example-Documents-Term-Query-Cont’D1:A search Engine for 3D Models
D2:Design and Implementation of a string database query language
D3:Ranking of documents by measures considering conceptual dependence between terms
D4 Exploiting hierarchical domain structure to compute similarity
D5:an approach for measuring semantic similarity between words using multiple information sources
D6:determinging semantic similarity among entity classes from different ontologies
D7:strong similarity measures for ordered sets of documents in information retrieval
T1:search(ing) T2:Engine(s) T3:ModelsT4:database T5:query T6:languageT7:documents T8:measur(es,ing) T9:conceptual T10:dependence T11: domain T12:structure T13:similarity T14:semanticT15: ontologiesT16:informationT17: retrieval
Query: Semantic similarity measures used by search engines and other
information searching mechanisms
PP33
![Page 28: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/28.jpg)
Example-Term-Document Matrix-Cont’
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 13 T14 T15 T16 T17
D1 1 1 1
D2 1 1 1
D3 1 1 1 1
D4 1 1 1
D5 1 1 1 1
D6 1 1 1
D7 1 1 1 1 1
Q 2 1 1 1 1 1
Matrix[q][A]
PP34
![Page 29: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/29.jpg)
Dice coefficient
)0*0...1*11*1()0*0...1*12*2(
)0*01*0...0*01*01*11*2(*2
n
k k
n
k kq
n
k kkq
n
k k
n
k kq
n
k kkq
ww
ww
ww
wwqDD
1
211
2
1 1
1
211
2
1 1 2)2
1(
)1(),1(
5.012
6
39
)12(*2
PP30, PP34
![Page 30: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/30.jpg)
Final ResultsPP34
O≥C≥D≥J
![Page 31: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/31.jpg)
Current Applications
Multi-Dimensional Modeling Hierarchical Clustering Bioinformatics
PP35-38
![Page 32: similarity measure](https://reader035.vdocument.in/reader035/viewer/2022062220/554f5466b4c905524c8b5092/html5/thumbnails/32.jpg)
Discussion