digitizing serialized fiction kirk hess library research showcase november 19, 2013...
TRANSCRIPT
Digitizing Serialized FictionKirk HessLibrary Research ShowcaseNovember 19, [email protected]
Finding Serialized Fiction
“Many of the newspapers in Farm, Field and Fireside published serialized fiction written by renowned authors as well as lesser known writers and even some long-time readers. The value of this publishing model enabled literature to be disseminated to rural communities and expand the bounds of American literary culture across geographic and socioeconomic lines. “
How can we identify serialized fiction in a article-segmented newspaper archive?
Methodology• Manually extraction/indexing one title
- The Farmer’s Wife• Workflow: http://bit.ly/1aCYZSa • TEI/Scripto (OCR Correction)
• Automated techniques• Common N-Grams
• e.g. ‘Chapter (number/roman numeral)’, ‘To Be Continued’, ‘the end’, etc.
• Topic/Genre/Theme • e.g.Romance, children stories, holidays,
etc.• Named entity recognition• Predictive solutions (Bayes, Google API)
THE MYSTERIOUS MCCORKLES by F. Roney Weirhttp://uller.grainger.uiuc.edu/omeka/items/show/20
Analysis/Results• Manual Indexing Farmer’s Wife w/Omeka
Sample set completed Fall, 2012http://uller.grainger.illinois.edu/omeka/
• Topic Analysis (Latent Dirichlet Allocation) David Blei,et al. w/Mallet (http://mallet.cs.umass.edu/
Barney time water butter put milk de corn wagon chickens day weather dinner clean Mercy home lay table dry made Marigold morning make Anne bread
• Network Analysis w/GephiTopics and Documents are nodes, docs intopics are edges.
• Named Entity Recognition (NER) w/Stanford NLP Named Entity RecognizerProper names interfere with LSA, Programmatically find names
Analysis/Results (cont.)• Naïve Bayes Classifier using NLTK toolkit
• Similar to Movie Review sample using a small subset of articles, Naïve Bayes Classifier using NTLK, top 2000 words>>> classifier.show_most_informative_features(5) contains(having) = True fictio : nonfic = 1.9 : 1.0 contains(plan) = True fictio : nonfic = 1.9 : 1.0 contains(growing) = True fictio : nonfic = 1.9 : 1.0 contains(entertaining) = True fictio : nonfic = 1.9 : 1.0 contains(home) = True fictio : nonfic = 1.9 : 1.0
High accuracy (> .95) but weak ratios
Next Steps• Implement Veridian• Crowdsource OCR correction• Automated tagging of articles• Direct access to index (Solr)
• Continue NLP research using NLTK Toolkit w/ additional classifiers and NER research, full training set.
• Expand probalistic statistical methods across archive (~ 1 million pages, 5 million articles).