finding structure in texts with topological data analysistopology is the study of shapes...
TRANSCRIPT
Finding Structure in Texts with Topological DataAnalysis
Calli Clay and Ella Graham
St. Catherine University
February 1, 2020
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 1 / 17
Introduction
Recently, analyzing data has become more complex because data setsare larger in size and higher in dimension
To address this complexity, we looked at determining the shape of adata set using an approach called topological data analysis
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 2 / 17
The Shape of a Data Set
A Three Dimensional Data Set
−10 −5 0 5 10 15−4
−2
0 2
4 6
810
12
−6−4
−2 0
2 4
6
Variable OneV
aria
ble
Two
Var
iabl
e T
hree
Yet Another Three Dimensional Data Set
−6 −4 −2 0 2 4 6−15
−10
−5
0 5
10
15
−6−4
−2 0
2 4
6
Variable One
Var
iabl
e Tw
o
Var
iabl
e T
hree
Figure: Visualizing data sets (Dr. Pelatt)
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 3 / 17
Research Goals
Determine the efficiency of topological data analysis as a textanalytics tool
Analyze poetry forms including the villanelle and sestina
Analyze music genres including rock music and pop music
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 4 / 17
Background
Topology is the study of shapes
Figure: Transforming a coffee cup into a donut (Hood)
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 5 / 17
Background
Persistent homology is a common TDA method
A technique for approximating the topological features of a space indifferent dimensionsHas not been widely used for analyzing texts
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 6 / 17
Simplicial Complexes
Geometric representations of the shape of a data set
Simplices are the building blocks for simplicial complexes
Figure: Simplices (Huang)
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 7 / 17
Simplicial Complexes
We can think of point clouds as being sampled from topological space
Simplices are used to turn point clouds into simplicial complexesAccomplished with a Vietoris-Rips complex
Figure: Illustration of building a simplicial complex from a point cloud (Huang)
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 8 / 17
Persistent Homology
We use persistent homology to analyze the space that is representedby simplicial complexes
We calculate homology groups in each dimension
Dimension 0 represents componentsDimension 1 represents holes or loopsDimension 2 and higher represent voids
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 9 / 17
Barcodes
Visual representation of the persistent homology of a given text
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 10 / 17
Barcode Example with Poetry
Do not go gentle into that good night by Dylan Thomas
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 11 / 17
Bottleneck Distance
Once each text file is visually represented by a barcode, we cancompare their barcodes to find the bottleneck distance
Measures distance between the persistent homologies of two text files
W∞(X ,Y ) = infη:X→Y
supx∈X||x − η(x)||∞
Wasserstein distance is another approach
Figure: Barcode 1 in Dimension 0 Figure: Barcode 2 in Dimension 0
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 12 / 17
Process
Using the programming software RStudio, we:
Clean each text file
Represent each line of text with a word count vector
The resulting vector space forms a word count matrix
Calculate a distance matrix composed of the pairwise distancesbetween each point in the word count matrix
Use RStudio packages to calculate the persistent homology, createbarcodes, and find pairwise bottleneck distances between barcodes
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 13 / 17
Word Count Vectors with Song Lyrics
raindrops (an angel cried) by Ariana Grande
“When Raindrops fell down from the skythe day you left me, an angel cried
oh, she cried, an angel criedshe cried”
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 14 / 17
Issues and Questions
Stop Words: Do they change word count vectors significantly?
Address with standard tf-idf technique (Wagner)
Defining Distance: Euclidean or Angular?
Algorithms: SIF or SIFTS?
1 2
34
1 2
34
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 15 / 17
Results
Analyzing poetry using persistent homology is more interesting thananalyzing song lyrics
Upon further investigation, we may be able to accurately concludethat TDA is effective for the analysis of poetry
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 16 / 17
References
H. Edelsbrunner and J. Harer, Computational topology: an introduction.American Mathematical Soc., 2010.
X. Zhu, “Persistent homology: An introduction and a new textrepresentation for natural language processing,” in Twenty-ThirdInternational Joint Conference on Artificial Intelligence, 2013.
H. Wagner, P. D lotko, and M. Mrozek, “Computational topology in textmining,” in CT, pp. 68–78, Springer, 2012.
H.-L. Huang, X.-L. Wang, P. P. Rohde, Y.-H. Luo, Y.-W. Zhao, C. Liu, L. Li,N.-L. Liu, C.-Y. Lu, and J.-W. Pan, “Demonstration of topological dataanalysis on a quantum processor,” Optica, vol. 5, no. 2, pp. 193–198, 2018.
S. Gholizadeh, A. Seyeditabari, and W. Zadrozny, “Topological signature of19th century novelists: Persistent homology in text mining,” Big Data andCognitive Computing, vol. 2, no. 4, p. 33, 2018.
M. Hood, “When is a coffee mug a donut? topology explains it,” 2016.
Ripser, https://live.ripser.org/.
Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 17 / 17