finding structure in texts with topological data analysisncuwm/22ndannual/... · introduction...

17
Finding Structure in Texts with Topological Data Analysis Calli Clay and Ella Graham St. Catherine University February 1, 2020 Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 1 / 17

Upload: others

Post on 24-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Finding Structure in Texts with Topological DataAnalysis

Calli Clay and Ella Graham

St. Catherine University

February 1, 2020

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 1 / 17

Page 2: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Introduction

Recently, analyzing data has become more complex because data setsare larger in size and higher in dimension

To address this complexity, we looked at determining the shape of adata set using an approach called topological data analysis

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 2 / 17

Page 3: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

The Shape of a Data Set

A Three Dimensional Data Set

−10 −5 0 5 10 15−4

−2

0 2

4 6

810

12

−6−4

−2 0

2 4

6

Variable OneV

aria

ble

Two

Var

iabl

e T

hree

Yet Another Three Dimensional Data Set

−6 −4 −2 0 2 4 6−15

−10

−5

0 5

10

15

−6−4

−2 0

2 4

6

Variable One

Var

iabl

e Tw

o

Var

iabl

e T

hree

Figure: Visualizing data sets (Dr. Pelatt)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 3 / 17

Page 4: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Research Goals

Determine the efficiency of topological data analysis as a textanalytics tool

Analyze poetry forms including the villanelle and sestina

Analyze music genres including rock music and pop music

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 4 / 17

Page 5: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Background

Topology is the study of shapes

Figure: Transforming a coffee cup into a donut (Hood)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 5 / 17

Page 6: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Background

Persistent homology is a common TDA method

A technique for approximating the topological features of a space indifferent dimensionsHas not been widely used for analyzing texts

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 6 / 17

Page 7: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Simplicial Complexes

Geometric representations of the shape of a data set

Simplices are the building blocks for simplicial complexes

Figure: Simplices (Huang)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 7 / 17

Page 8: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Simplicial Complexes

We can think of point clouds as being sampled from topological space

Simplices are used to turn point clouds into simplicial complexesAccomplished with a Vietoris-Rips complex

Figure: Illustration of building a simplicial complex from a point cloud (Huang)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 8 / 17

Page 9: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Persistent Homology

We use persistent homology to analyze the space that is representedby simplicial complexes

We calculate homology groups in each dimension

Dimension 0 represents componentsDimension 1 represents holes or loopsDimension 2 and higher represent voids

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 9 / 17

Page 10: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Barcodes

Visual representation of the persistent homology of a given text

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 10 / 17

Page 11: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Barcode Example with Poetry

Do not go gentle into that good night by Dylan Thomas

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 11 / 17

Page 12: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Bottleneck Distance

Once each text file is visually represented by a barcode, we cancompare their barcodes to find the bottleneck distance

Measures distance between the persistent homologies of two text files

W∞(X ,Y ) = infη:X→Y

supx∈X||x − η(x)||∞

Wasserstein distance is another approach

Figure: Barcode 1 in Dimension 0 Figure: Barcode 2 in Dimension 0

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 12 / 17

Page 13: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Process

Using the programming software RStudio, we:

Clean each text file

Represent each line of text with a word count vector

The resulting vector space forms a word count matrix

Calculate a distance matrix composed of the pairwise distancesbetween each point in the word count matrix

Use RStudio packages to calculate the persistent homology, createbarcodes, and find pairwise bottleneck distances between barcodes

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 13 / 17

Page 14: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Word Count Vectors with Song Lyrics

raindrops (an angel cried) by Ariana Grande

“When Raindrops fell down from the skythe day you left me, an angel cried

oh, she cried, an angel criedshe cried”

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 14 / 17

Page 15: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Issues and Questions

Stop Words: Do they change word count vectors significantly?

Address with standard tf-idf technique (Wagner)

Defining Distance: Euclidean or Angular?

Algorithms: SIF or SIFTS?

1 2

34

1 2

34

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 15 / 17

Page 16: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

Results

Analyzing poetry using persistent homology is more interesting thananalyzing song lyrics

Upon further investigation, we may be able to accurately concludethat TDA is effective for the analysis of poetry

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 16 / 17

Page 17: Finding Structure in Texts with Topological Data Analysisncuwm/22ndAnnual/... · Introduction Recently, analyzing data has become more complex because data sets are larger in size

References

H. Edelsbrunner and J. Harer, Computational topology: an introduction.American Mathematical Soc., 2010.

X. Zhu, “Persistent homology: An introduction and a new textrepresentation for natural language processing,” in Twenty-ThirdInternational Joint Conference on Artificial Intelligence, 2013.

H. Wagner, P. D lotko, and M. Mrozek, “Computational topology in textmining,” in CT, pp. 68–78, Springer, 2012.

H.-L. Huang, X.-L. Wang, P. P. Rohde, Y.-H. Luo, Y.-W. Zhao, C. Liu, L. Li,N.-L. Liu, C.-Y. Lu, and J.-W. Pan, “Demonstration of topological dataanalysis on a quantum processor,” Optica, vol. 5, no. 2, pp. 193–198, 2018.

S. Gholizadeh, A. Seyeditabari, and W. Zadrozny, “Topological signature of19th century novelists: Persistent homology in text mining,” Big Data andCognitive Computing, vol. 2, no. 4, p. 33, 2018.

M. Hood, “When is a coffee mug a donut? topology explains it,” 2016.

Ripser, https://live.ripser.org/.

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 17 / 17