digitizing serialized fiction kirk hess library research showcase november 19, 2013...

Digitizing Serialized FictionKirk HessLibrary Research ShowcaseNovember 19, [email protected]

mailto:[email protected]

Finding Serialized Fiction

“Many of the newspapers in Farm, Field and Fireside published serialized fiction written by renowned authors as well as lesser known writers and even some long-time readers. The value of this publishing model enabled literature to be disseminated to rural communities and expand the bounds of American literary culture across geographic and socioeconomic lines. “

How can we identify serialized fiction in a article-segmented newspaper archive?

Methodology• Manually extraction/indexing one title

- The Farmer’s Wife• Workflow: http://bit.ly/1aCYZSa • TEI/Scripto (OCR Correction)

• Automated techniques• Common N-Grams

• e.g. ‘Chapter (number/roman numeral)’, ‘To Be Continued’, ‘the end’, etc.

• Topic/Genre/Theme • e.g.Romance, children stories, holidays,

etc.• Named entity recognition• Predictive solutions (Bayes, Google API)

THE MYSTERIOUS MCCORKLES by F. Roney Weirhttp://uller.grainger.uiuc.edu/omeka/items/show/20

http://bit.ly/1aCYZSa



http://uller.grainger.uiuc.edu/omeka/items/show/20

http://uller.grainger.uiuc.edu/omeka/items/show/20

Analysis/Results• Manual Indexing Farmer’s Wife w/Omeka

Sample set completed Fall, 2012http://uller.grainger.illinois.edu/omeka/

• Topic Analysis (Latent Dirichlet Allocation) David Blei,et al. w/Mallet (http://mallet.cs.umass.edu/

Barney time water butter put milk de corn wagon chickens day weather dinner clean Mercy home lay table dry made Marigold morning make Anne bread

• Network Analysis w/GephiTopics and Documents are nodes, docs intopics are edges.

• Named Entity Recognition (NER) w/Stanford NLP Named Entity RecognizerProper names interfere with LSA, Programmatically find names

http://uller.grainger.illinois.edu/omeka/

http://uller.grainger.illinois.edu/omeka/

http://mallet.cs.umass.edu/

Analysis/Results (cont.)• Naïve Bayes Classifier using NLTK toolkit

• Similar to Movie Review sample using a small subset of articles, Naïve Bayes Classifier using NTLK, top 2000 words>>> classifier.show_most_informative_features(5) contains(having) = True fictio : nonfic = 1.9 : 1.0 contains(plan) = True fictio : nonfic = 1.9 : 1.0 contains(growing) = True fictio : nonfic = 1.9 : 1.0 contains(entertaining) = True fictio : nonfic = 1.9 : 1.0 contains(home) = True fictio : nonfic = 1.9 : 1.0

High accuracy (> .95) but weak ratios

Next Steps• Implement Veridian• Crowdsource OCR correction• Automated tagging of articles• Direct access to index (Solr)

• Continue NLP research using NLTK Toolkit w/ additional classifiers and NER research, full training set.

• Expand probalistic statistical methods across archive (~ 1 million pages, 5 million articles).

digitizing serialized fiction kirk hess library research showcase november 19, 2013...

Documents

true fictio

eduomeka http

omekaitemsshow20 slide

wmallet http

nave bayes classifier

nlp research

ner research

roney weir http