similarity in wikipedia articles (edbt summer school)
TRANSCRIPT
![Page 1: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/1.jpg)
Similarity in Wikipedia Articles
Badenes, Carlos (cbadenes) Garijo, Daniel (dgarijo)
Priyatna, Freddy (fpriyatna) {*}@fi.upm.es
EDBT Summer School 2015
![Page 2: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/2.jpg)
Problem
2
Similarity between Wikipedia Articles
Wikipedia Article:
text
links
categories
![Page 3: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/3.jpg)
Hypothesis
3
Wikipedia Article:
text
links
categories
simLinks
simCtg
simTextα·∙
β·∙
ɣ·∙
+
+
simWA(R1,R2) = α·∙simTxt(R1,R2) + β·∙simLinks(R1,R2) + ɣ·∙simCtg(R1,R2)
where α+β+ɣ=1
![Page 4: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/4.jpg)
Similarity based on Text
4
…
TOPIC_1
p = [0.5, 0.3,.., 0.7]q = [0.2, 0.4,.., 0.9]Ri Rj
TOPIC_2 TOPIC_n
Latent Dirichlet Allocation
![Page 5: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/5.jpg)
Similarity based on Categories
5
Articles with multiple common categories are likely to be similar
Noise filtering is necessary (e.g., “All articles lacking in-text citations”). See https://github.com/cbadenes/siminwikart-challenge4/blob/master/category/wikipedia_bad_categories.txt
![Page 6: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/6.jpg)
Similarity based on Links
6
Sim(A,B) = links(A) ∩ links(B) / ( (links(A) U links(B) ) / 2)
2/((5+3)/2)
Articles with multiple common links are likely to be similar
![Page 7: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/7.jpg)
Proof of Concept
7
Fernando Alonso
Lionel Messi
Iker Casillas Princess Akiko
(simLinks) α = 0.2 (simCtg) β = 0.2 (simTxt) ɣ = 0.6
[1]0.062 [3]0.075
[1]0.666 [3]0.683
[1]0.058 [3]0.069
[1]0.043 [3]0.072
[1]0.019 [3]0.023
[1]0.068 [3]0.069
simTxt = 0.059 simLinks = 0.019 simCtg=[1]0.117
[3]0.181
simTxt = 0.065 simLinks = 0.0 simCtg=[1]0.095
[3]0.161
simTxt = 0.052 simLinks = 0.019 simCtg=[1]0.166
[3]0.172
simTxt = 0.980 simLinks = 0.175 simCtg=[1]0.217
[3]0.302
simTxt = 0.060 simLinks = 0.008 simCtg=[1]0.030
[3]0.172
simTxt = 0.069 simLinks = 0.004 simCtg=[1]0.080
[3]0.134
![Page 8: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/8.jpg)
Comparison
8
Lionel Messi
Princess Akiko
simTxt = 0.060 -> <common words> simLinks = 0.008 -> (England,Buenos_Aires,Chile,Madrid,Argentina) simCtg=[1]0.030 -> living_person
![Page 9: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/9.jpg)
Proposal
9
0.48
0.61
0.410.29
0.730.81
0.77
0.53
0.67
0.330.88
Graph based on Links Graph based on Similarities
![Page 10: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/10.jpg)
Problem
10
Wikipedia links reliability (missing links)
Wikipedia Article:
text
links
categories
![Page 11: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/11.jpg)
Further Refinement
11
Similarities between categories (as topics) can define relations between articles
Graph based on Links
0.48
0.61
0.410.29
0.730.81
0.77
0.53
0.67
0.330.88
Graph based on Similarities
Subgraph Pattern Matching
+
Topic Model
+
![Page 12: Similarity in Wikipedia Articles (EDBT Summer School)](https://reader031.vdocument.in/reader031/viewer/2022030223/58830c691a28ab31068b47ef/html5/thumbnails/12.jpg)
Code
12
https://github.com/cbadenes/siminwikart-challenge4