![Page 1: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/1.jpg)
Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity Nathaniel Husted [email protected]
![Page 2: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/2.jpg)
You too can be a Big Data Scientist!
![Page 3: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/3.jpg)
Terminology: Big Data Not a new concept (never believe marketers)
Moving target
Data sets large enough to cause extra considerations for processing and storage
![Page 4: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/4.jpg)
Terminology: Big Corpora
Corpus (Plural: Corpora) – A sample set of texts for natural language processing.
Big Corpus – A very large, gigabyte level, set of texts.◦ Example: Corpus of Contemporary American English
![Page 5: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/5.jpg)
Terminology: Bigrams
The Quick Brown Fox Leaves.
Also known as a Digram or n-gram for n=2.
![Page 6: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/6.jpg)
Terminology: Bigrams
The Quick Brown Fox Leaves.
Also known as a Digram or n-gram for n=2.
![Page 7: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/7.jpg)
Terminology: Bigrams
The Quick Brown Fox Leaves.
Also known as a Digram or n-gram for n=2.
![Page 8: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/8.jpg)
Terminology: Bigrams
The Quick Brown Fox Leaves.
Also known as a Digram or n-gram for n=2.
![Page 9: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/9.jpg)
Terminology: Literary Complexity
The Complexity of a Story.◦ Qualitative◦ How intertwined are the plot lines◦ How deep are the themes◦ How rich are the characters◦ How much attention it takes on the part of the read to comprehend the
whole
Examples of Complex Litearture:◦ Finnegan’s Wake by James Joyce◦ Foucault’s Pendulum by Umberto Eco
![Page 10: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/10.jpg)
Terminology: A Little Graph Theory
Directed Edge
Undirected Edge
Vertex
Loop
![Page 11: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/11.jpg)
Let’s Put Them All Together… Structural Complexity
How can we quantitatively measure the complexity of a novel?◦ Structural Complexity!◦ Biologists use structure to measure the complexity of molecules◦ System Scientists use it to measure the complexity of networks
What is Structural Complexity?◦ The amount of information contained in the relationship between elements
of a network.
![Page 12: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/12.jpg)
Metrics of Structural Complexity
Normalized Edge Complexity (NEC)◦ How many unique bigrams there were versus the theoretical maximum.
Average Edge Complexity (AEC)◦ Average number of unique bigrams per word.
Shannon Information (SI)
Vertex degree magnitude-based Information (IVD)
http://www.vcu.edu/csbc/pdfs/quantitative_measures.pdf
![Page 13: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/13.jpg)
Structural Complexity In Literature: Bigrams as Structural Cues
To use our structural complexity measures, we must “graph” our novel.
Bigrams provide a clear notion of a “graph edge”
Bigrams link work associations together
![Page 14: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/14.jpg)
Structural Complexity In Literature: Bigrams as Structural Cues The Quick Brown Fox Leaves The House.
The
Quick
Brown
Fox
Leaves
House
![Page 15: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/15.jpg)
How do we implement all these concepts? Python!
◦NetworkX◦NLTK◦XMLTree
SQLite (xargs)
![Page 16: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/16.jpg)
What is our process?1. Choose our Corpus
2. Organize our Corpus
3. Parse our Corpus
4. Analyze our Graphs
5. Process our Results
![Page 17: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/17.jpg)
Choosing our Corpus Project Gutenberg to the Rescue
◦ Tens of thousands of texts◦ Most, if not all, are in text formats (ASCII, ISO, UTF-8)◦ Convenient ISO Downloads◦ Public Domain!
Number of works: 19852
Number of authors: 7049
https://www.cs.Indiana.edu/~nhusted/project_source/pgdvd-en-corpus.tar.bz2
![Page 18: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/18.jpg)
Organizing our Corpus Project Gutenberg provides a RDF Card Catalogue of their library.
Querying a 250+ MB RDF file with RDF libraries is SLOW.
Parsing with Python’s xml.etree.cElementTree is fast!
Due to Unicode Characters, Python 3 is a must.
Storing results in SQLite give us a compact, quickly searchable, format.
![Page 19: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/19.jpg)
Parsing our Corpus in to Graphs!
Python, NetworkX, and NLTK to the rescue.
NLTK allows quick parsing of the novels.
NetworkX provides the easy to use graph library with algorithms.
![Page 20: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/20.jpg)
Analyzing Our Graphs’ Structural Complexity
IVD
AEV
![Page 21: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/21.jpg)
Storing and Analyzing the Results
Store the results in SQLite ◦ Conveniently searchable, still.◦ Conveniently readable in R.
Use R for Statistical Analysis◦ Personal Preference
![Page 22: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/22.jpg)
So what can we say about Structural Complexity?
![Page 23: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/23.jpg)
It seems to have dropped in the late 1800s
![Page 24: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/24.jpg)
Structural Complexity is Analogous to Literary Complexity
Determine authors who have literature deemed “complex”
Publisher’s Weekly Top 10 Most Difficult Books: http://www.publishersweekly.com/pw/by-topic/industry-news/tip-sheet/article/53409-the-top-10-most-difficult-books.html
![Page 25: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/25.jpg)
Structural Complexity is Analogous to Literary Complexity
http://www.publishersweekly.com/pw/by-topic/industry-news/tip-sheet/article/53409-the-top-10-most-difficult-books.html
![Page 26: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/26.jpg)
Structural Complexity is Analogous to Literary Complexity
http://www.publishersweekly.com/pw/by-topic/industry-news/tip-sheet/article/53409-the-top-10-most-difficult-books.html
![Page 27: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/27.jpg)
Structural Complexity is Analogous to Literary Complexity
http://www.publishersweekly.com/pw/by-topic/industry-news/tip-sheet/article/53409-the-top-10-most-difficult-books.html
![Page 28: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/28.jpg)
Conclusions Structural Complexity is analogous to qualitative measurements of literary complexity
Structural Complexity even allows comparison of novels to other structures such as DNA and protein-protein sequences
Results are preliminary◦ Data is not Gaussian◦ Still some catalog creation errors◦ “Big Data” is still sparse
![Page 29: Big Data, Big Corpus, and Bigrams: Calculating Literary Complexity](https://reader035.vdocument.in/reader035/viewer/2022081507/568165f8550346895dd923e5/html5/thumbnails/29.jpg)
Big Conclusion: Open Source Science!
Results are Creative Commons!
Code is GPL V3!
Dataset is public domain!
You can do your own analysis!
http://cgi.cs.indiana.edu/~nhusted/dokuwiki/doku.php?id=projects:graphalyzer
https://github.iu.edu/nhusted/GutenbergGraphalyzer
You too can be a Big Data Scientist!