text similarity & clustering
DESCRIPTION
Text Similarity & Clustering. Qinpei Zhao 15.Feb.2011. Outline. String matching metrics Implementation and applications Online Resources Location-based clustering. String Matching Metrics. Exact String Matching. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/1.jpg)
TextSimilarity & Clustering
Qinpei Zhao15.Feb.2011
![Page 2: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/2.jpg)
Outline
String matching metrics Implementation and applications Online Resources Location-based clustering
![Page 3: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/3.jpg)
String Matching Metrics
![Page 4: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/4.jpg)
Exact String Matching
Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all occurrences of P in T.
Example: T=“AGCTTGA” P=“GCT” Applications:
Searching keywords in a fileSearching engines (like Google)Database searching
![Page 5: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/5.jpg)
Approximate String Matching
Determine if a text string T of length n and a pattern string P of length m “partially” matches. Consider the string “approximate”. Which of these are partial matches? aproximate approximately appropriate proximate approx approximat
apropos approxximate A partial match can be thought of as one that has k differences from the st
ring where k is some small integer (for instance 1 or 2) A difference occurs if the string1.charAt(j) != string2.charAt(j) or if strin
g1.charAt(j) does not appear in string2 (or vice versa) The former case is known as a revise difference, the latter is a delete
or insert difference. What about two characters that appear out of position? For instance,
approximate vs. apporximate?
![Page 6: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/6.jpg)
Keanu ReevesSamuel JacksonSchwarzeneggerSamuel Jackson…
Schwarrzenger
Query errors: Limited knowledge about data Typos Limited input device (cell phone) input
Data errors Typos Web data OCR
Applications Spellchecking Query relaxation …
Similarity functions: Edit distance Q-gram Cosine …
Approximate String Matching
![Page 7: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/7.jpg)
Edit distance (Levenshtein distance)
Given two strings T and P, the edit distance is the minimum number of substitutions, insertion and deletions, which will transform some characters of T into P.
Time complexity by dynamic programming: O(mn)
![Page 8: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/8.jpg)
Edit distance (1974)t m p
0 1 2 3
t 1 0 1 2
e 2 1 2 2
m 3 2 1 2
p 4 3 2 1
Dynamic programming:m[i][j] = min{m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)} d(i,j) =0 if i=j, d(i,j)=1 else
![Page 9: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/9.jpg)
Q-grams
b i n g o n 2-grams
Fixed length (q)ed(T, P) <= k, then
# of common grams >= # of T grams – k * q
![Page 10: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/10.jpg)
Q-grams
T = “bingo”, P = “going”gram1 = {#b, bi, in, ng, go, o#}gram2 = {#g, go, oi, in, ng, g#}
Unique(gram1, gram2) = {#b, bi, in, ng, go, o#, #g, oi, g#}gram1.length = (T.length + (q - 1) * 2 + 1) – qgram2.length = (P.length + (q - 1) * 2 + 1) - qL = gram1.length + gram2.lengthSimilarity = (L- |common terms difference| )/ L
![Page 11: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/11.jpg)
Cosine similarity
Two vectors A and B,θ is represented using a dot product and magnitude as
Implementation: Cosine similarity = (Common Terms) / (sqrt(Num
ber of terms in String1) + sqrt(Number of terms in String2))
![Page 12: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/12.jpg)
Cosine similarity
T = “bingo right”, P = “going right”T1 = {bingo right}, P1 = {going right}
L1 = unique(T1).length;L2 = unique(T2).length;
Unique(T1&P1) = {bingo right going}L3 = Unique(T1&P1) .length;Common terms = (L1+L2)-L3;
Similarity = common terms / (sqrt(L1)*sqrt(L2))
![Page 13: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/13.jpg)
Dice coefficient
Similar with cosine similarity Dices coefficient = (2*Common Terms) /
(Number of terms in String1 + Number of terms in String2)
![Page 14: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/14.jpg)
![Page 15: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/15.jpg)
Implementation & Applications
![Page 16: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/16.jpg)
Similarity metrics
Edit distance Q-gram Cosine distance Dice coefficient …… similarity between two strings: Demo
![Page 17: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/17.jpg)
Compared Strings Edit Distance (%)
Q_Grams Q =2 (%)
Q_Grams Q =3 (%)
Q_Grams Q =4 (%)
Cosin Distance (%)
Pizza Express Café Pizza Express
72% 78.79% 74.29% 70.27% 81.65%
Lounasravintola Pinja Ky – Ravintoloita Lounasravintola Pinja
54% 67.74% 67.19 % 65.15% 63.25%
Kioski Piirakkapaja Kioski Marttakahvio
47% 45.00% 33.33% 31.82% 50.00%
Kauppa Kulta Keidas Kauppa Kulta Nalle
68% 66.67% 63.41 % 60.47% 66.67%
Ravintola Beer Stop Pub Baari, Beer Stop R-kylä
39% 41.67% 36% 30.77% 50.00%
Ravintola Beer Stop Pub Baari, Wanha Mestari R-kylä
19% 7.69% 0% 0.00% 0.00%
Ravintola Foxie s Bar Siirry hakukenttään Baari, Foxie Karsikko
31% 25.00% 15.15% 11.76% 23.57%
Play baari Ravintola Bar Play – Ravintoloita
21% 31.11% 17.02% 8.16% 31.62%
![Page 18: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/18.jpg)
Applications in MOPSI
Duplicated records clean Spelling check
Communication & comunication
query relevance/expansion Text-level Annotation recommendation * Keyword clustering * MOPSI search engine**
![Page 19: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/19.jpg)
Annotation recommendation
500ms
![Page 20: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/20.jpg)
String clustering The similarity between every string pair is
calculated as a basis for determining the clusters Using the vector model for clustering
A similarity measure is required to calculate the similarity between two strings.
![Page 21: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/21.jpg)
String clustering (Cont.)
The final step in creating clusters is to determine when two objects (words) are in the same cluster Hierarchical agglomerative clustering (HAC) – start
with un-clustered items and perform pair-wise similarity measures to determine the clusters
Hierarchical divisive clustering – start with a cluster and breaking it down into smaller clusters
![Page 22: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/22.jpg)
Objectives of Hierarchy of Clusters
Reduce the overhead of search Perform top-down searches of the centroids of the clusters in the
hierarchy and trim those branches that are not relevant Provide for visual representation of the information
space Visual cues on the size of clusters (size of ellipse) and strengths of the
linkage between clusters (dashed line, sold line…) Expand the retrieval of relevant items
A user, once having identified an item of interest, can request to see other items in the cluster
The user can increase the specificity of items by going to children clusters or by increasing the generality of items being reviewed by going to a parent clusters
![Page 23: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/23.jpg)
Keyword clustering (semantic)
Thesaurus-based:WordNetAn advanced web-interface t
o browse the WordNet database
Thesaurus are not available for every language, e.g. Finnish.
example
![Page 24: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/24.jpg)
Resources
![Page 25: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/25.jpg)
Useful resources
Similarity metrics (http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html )
Similarity metrics (javascript) (http://cs.joensuu.fi/~zhao/Link/ )
Flamingo package (http://flamingo.ics.uci.edu/releases/4.0/ )
WordNet (http://wordnet.princeton.edu/wordnet/related-projects/ )
![Page 26: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/26.jpg)
Location-based clustering
![Page 27: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/27.jpg)
DBSCAN- density based clustering (KDD’96)
Parameters: MinPts eps
Time complexityO(logn) – getNeighbursO(nlogn) – total
AdvantagesData shape unlimitedNoise considered
![Page 28: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/28.jpg)
DBSCAN result
Joensuu: 29,76, 62.60Helsinki: 24, 60
![Page 29: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/29.jpg)
![Page 30: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/30.jpg)
Gaussian Mixture Model
Maximization likelihood estimation (Expectation Maximization algorithm)
Parameters requiredNumber of components Iteration number
Advantages:Probabilistic (fuzzy) theory
![Page 31: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/31.jpg)
GMMsJoensuu: 29,76, 62.60Helsinki: 24, 60
![Page 32: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/32.jpg)
GMMsJoensuu: 29,76, 62.60Helsinki: 24, 60
![Page 33: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/33.jpg)
![Page 35: Text Similarity & Clustering](https://reader035.vdocument.in/reader035/viewer/2022062308/56816835550346895dddee34/html5/thumbnails/35.jpg)
thanks!