![Page 1: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/1.jpg)
Minimum Spanning TreesDisplaying Semantic Similarity
Włodzisław Duch & Paweł Matykiewicz
Department of Informatics, UMK Toruń
School of Computer Engineering, NTU Singapore
Cincinnati Children’s Hospital Research Foundation, OH, USA Google: Duch
![Page 2: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/2.jpg)
The Problem
Finding people who share some of our interests in large organizations or worldwide is difficult.
Analyzing people’s homepages and their lists of publications is a good way to find groups and individuals sharing common scientific interest.
Maps should display individuals and groups. The structure of graphical representations
depends strongly on the selection of keywords or dimensionality reduction.
![Page 3: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/3.jpg)
The Data
Reuters-215785 datasets, with 5 categories and 1 – 176 elements per category.
124 Personal Web Pages of the School of Electrical and Electronic Engineering (EEE) of the Nanyang Technological University (NTU) in Singapore, with 5 categories (control, microelectronics, information, circuit, power), and 14 – 41 documents per category.
![Page 4: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/4.jpg)
Document-word matrix
Document1: word1 word2 word3. word4 word3 word5.
Document2: word1 word3 word5. word1 word3 word6.
The matrix: documents x word frequencies
1 1 2 1 1 0
2 0 2 0 1 1
F
![Page 5: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/5.jpg)
Methods used
Inverse document frequency and term weighting.
Simple selection of relevant terms.
Latent Semantic Analysis (LSA) for dimensionality reduction.
Minimum Spanning Trees for visual representation.
TouchGraph XML visualization of MST trees.
![Page 6: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/6.jpg)
Data Preparation
Normalize columns of F dividing by highest word frequencies:
Among n documents, term j occurs dj times; inverse document frequency idfj measures uniqueness of term j:
2log / 1, 0j j jidf n d d tf x idf term weights:
ij ij jw tf idf
/ maxij ij iji
tf f f
![Page 7: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/7.jpg)
Simple selection
Simple selection: take wij weights above certain threshold, binarize and remove zero rows:
1
22
1 1
ik jkk
ij
ik jkk k
h hs
h h
Calculate similarity using cosine measure:
ij ij jh w
![Page 8: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/8.jpg)
Dimensionality reduction
Latent Semantic Analysis (LSA): use Singular Value Decomposition on weight matrix W
i j
iji j
s
W W
W W
with U = eigenvectors of WWT and V of WTW.
Remove small eigenvalues, recreate reduced W and calculate similarity:
TW UΛV
![Page 9: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/9.jpg)
Kruskal’s Algorithm and Top - Down Clusterization
![Page 10: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/10.jpg)
Modified Kruskal’s Algorithm and Bottom - Up Clusterization
![Page 11: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/11.jpg)
Reuters results
Method topics clusters accuracy
No dim red. 41 129 78.2%
LSA dim red. 0.8 (476) 41 124 76.2%
LSA dim red. 0.6 (357) 41 127 75.2%
Simple Selection 41 130 78.5%
W rank in SVD = 595
![Page 12: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/12.jpg)
Results for EEE NTU Web pages
Method topics clusters accuracy
No dim red. 10 142 84.7%
LSA dim red. 0.8 (467) 10 129 84.7%
LSA dim red. 0.6 (350) 10 137 82.8%
Simple Selection 10 145 85.5%
![Page 13: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/13.jpg)
Examples
TouchGraph LinkBrowser http://www.neuron.m4u.pl/search
![Page 14: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/14.jpg)
Results for Summary Discharges
New experiments on medical texts.
10 classes and 10 documents per class:
Plain Doc-Word matrix ≈ 23% Stop-List, TW-IDF, S.S. ≈ 64% Concept Space ≈ 64% Transformation ≈ 93%
![Page 15: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/15.jpg)
Simple Word-Doc Vector Space
![Page 16: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/16.jpg)
Meta-Map Concept Vector Space
![Page 17: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/17.jpg)
Concept Vector Space after transformation
![Page 18: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,](https://reader033.vdocument.in/reader033/viewer/2022051517/56649d555503460f94a334f1/html5/thumbnails/18.jpg)
Summary
In real application knowledge-based approach is needed to select only useful words and to parse their web pages.
Other visualization methods (like MDS) may be explored.
People have many interests and thus may belong to several topic groups.
Could be a very useful tool to create new shared interest groups in the Internet.