dime-novel genre classifier: a prototype text-mining application
TRANSCRIPT
Dime-Novel Genre Classifier: A Prototype Text-Mining Application
Project Supported by the Digital Convergence Lab at NIU
MembersMarcos QuezadaInformation Systems Engineer
Fredrik Stark Doctoral Candidate in English
Mitchell ZaretskyComputer Science Major
The Project Team
ClientMatthew ShortMetadata Librarian
CoachDrew VandeCreekDirector of Digital Scholarship
Technical SupportMichael SwopeInstructional Support Analyst
● Develop a genre-classifier application to assist library cataloguers when digitizing NIU’s collections of dime novels
● Compile a list of genres and related subject terms for possible use in reclassifying online digitized collections
● Investigate text-mining tools for (1) future development of the prototype classifier app and (2) future studies of the collections
Client’s Goals
• Collection of 50,000 analog novels
• Currently available online: 1,900 digitized novels (90,000 pages)
• http://dimenovels.lib.niu.edu
Digitized Text Corpus
Solution: Create a Text-Mining Application
Image Source: Fayyad, U., Piatetsky-Shapiro , G., Smyth,P. (1996) The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, 39(11)
NIU’s Digitized Dime Novel Collection
Classifier App
Text Files of Better-
Represented Genres
Stop WordsTokenization
LemmatizationStemmingPruning
Bag of WordsVectorization
Naïve BayesGenre Top Words
Classification Checks
Adventure stories Bildungsromans Detective and mystery stories Historical fiction Love stories Sea stories Western stories
warrior broker crimin colonel sprung sailor prairi
trapper market disguis soldier lover deck gulch
ranger stock hotel sword warrior vessel calam
tribe illustr polic scout mum crew warrior
wee clerk plot lieuten alter schooner outlaw
fur sell crook warrior wee brig trapper
sprung desk avenu confeder prairi anchor rifle
savag rascal doctor sprung nun pirat gal
rifle bought stair union god ashor miner
scout share confeder cano rifle cabin scout
Top 10 of 100 Subject Words for Each Genre
• How can our proposed application help cataloguers classify the texts?
• How does computer-assisted classifying of dime novels compare to classifying done by cataloguers?
• To what extent can text-mining tools also help us answer questions about form and content of digitized dime novels?
Project Questions
of Chess.3--Boo_k of Croquet.4-—-Cricket and IFoot-BallEi-Curling and Skating.6—-Riding and. Driving.7--Ya.chting and Rowing.8-Guide to Swimming.SJ-Pedestrianism.“ 10-Books of Fun, Nos. 1, 2, 3BEADLJEPS DIME FAMILY HAND-BOOKSFor Housewives.N0. 1——Cook Book.CCG€G65G2—Recipe Book3—-I—Iousekeeper’s Guide.4-‘-Family Physician.5—Dressmaker and Milliner.H‘ For sale by. all Newsdealers and Booksellers; or sent, post-paid, to any address, on receipt of price-TEN CENTS each. BEADLE AND COMPANY, Publishers,98 William Street, New York. MAD SKIPPER;03»A CRUISE AFTER THE MAELSTROM.A TALE OF THE SVEA.BY ROGER STARBUCK,Luz-nor. or “ 601mm: HARPOON,” "on rm: nut,"“our AWAY,” mo.NEW YORK:BEADLE AND COMPANY, PUBLISHERS,11s WILLIAM swnnm‘. Entered according to Act of Congress, in the you 1&6, byv BEADLEAND COMPANY,h tho Clerk‘: O?lce of the District Court of the United sum for miSouthern District of New York.~(No- 94-) THE MAD $KIPPER.G H A P T E R I-was: nnsmvrnns.“ SPLASH! splash! splash! Here he comesagain-the rain! The calaboose more better than this place. We got no tabao’to smoke nor nothing to eat. Wish me back again in St. Mi?chael with piece of bread and one little baskeet of grapes l” The speaker, an odd-looking Portuguese dwarf with an enormous head, sat by the entrance of a small cave near the summit of a lofty hill overlooking the town and harbor of San Carlos, Chiloe island. Far beyond the town, which consists of little wooden buildings, few of them more
Paratext Contains Multiple Subject Terms
chess book croquet cricket football curling skating riding
driving yachting rowing guide swimming pedestrianism books
fun dime family handbooks housewives cook book recipe
book housekeeper guide family physician dressmaker milliner
sale newsdealers booksellers sent postpaid address receipt
price ten cents beadle company publishers street york mad
skipper cruise maelstrom tale starbuck harpoon away york
beadle company publishers entered congress beadle
company clerk district court united southern district york
mad splash splash splash comes again rain calaboose place
tabao smoke eat wish piece bread little baskeet grapes
speaker oddlooking portuguese dwarf enormous head sat
entrance small cave summit lofty hill overlooking town harbor
san chiloe island town consists little wooden buildings
Mixed Sets of Subject Terms in Filtered Text
Our Response: Remove Multi-Genre Texts
• Initial data set: 1,608 texts
• Multi-genre texts removed: 286 texts
• Revised data set: 1,322 texts
• High dimensionality• Un-structured data• “Bag of words”• Vector with one dimension for every unique term in space
• Term Frequency (TF)• Corpus representation
– Inverse Document Frequency (IDF)
Text Representation“Why, papa,” whispered the young girl, uneasily, as the boat pulled alongside the little wharf, “these men are all savages!”*
Sentence converted to a vector in the term and document space
whisper 0.059
girl 0.017
boat 0.329
littl 0.003
men 0.003
savag 0.141
* A sentence from page 20 in The Mad Skipper; or, A Cruise after the Maelstrom: A Tale of the Sea. By Roger Starbuck. New York: Beadle and Adams, 3 April 1866.
Applying these filter options improved model accuracy from 65% to 75%
• StringToWordVector filter–TF-IDF–Lowercase–Snowball Stemmer–Jockers’ Stopwords–Keep 500 Words
Preprocessing
• Naïve Bayes Multinomial Classifier
• Treats attributes as independent
• Determines most probable class label
• Ran with 10-fold cross-validation
Model Planning and Training
Our Classifier App
• Uses Weka Java API
• Builds classifier from ARFF file
• User inputs novel identifier
• App returns probability distribution
• User can choose to classify novel and save to ARFF file
Our Classifier App Improvements
Model Instance Accuracy
Initial 65%
Improved Filter 75%
Removed Multi-Genre 83%
Test Corpus from a Newly-Digitized Series
• Multiple genres in the series
• Total issues in test corpus: 214
• Classification convergence of 71% 152 matched classifications 62 mismatched classifications
Identifier Cataloger genre App genre Confidence Match
dimenovels:90660 Sea stories Sea Stories 0.9999999986 1
dimenovels:90849 Sea stories Sea Stories 1 1
dimenovels:90979 Sea stories Sea Stories 0.9999999997 1
dimenovels:94229 Western stories Western Stories 1 1
dimenovels:94943 Western stories Western Stories 0.999999997 1
Some Matched Classifications
Identifier Cataloger genre App genre Confidence Match
dimenovels:90660 Sea stories Sea Stories 0.9999999986 1
dimenovels:90849 Sea stories Sea Stories 1 1
dimenovels:90979 Sea stories Sea Stories 0.9999999997 1
dimenovels:94229 Western stories Western Stories 1 1
dimenovels:94943 Western stories Western Stories 0.999999997 1
Some Matched Classifications
Identifier Cataloger genre App genre Confidence Match
dimenovels:92635 Detective & Mystery Stories Bildungsromans 0.7394990695 0
dimenovels:94909 Detective & Mystery Stories Bildungsromans 0.6176507229 0
dimenovels:91179 Detective & Mystery Stories Sea Stories 0.4438477785 0
dimenovels:93280 Detective & Mystery Stories Sea Stories 0.704385396 0
dimenovels:95066 Detective & Mystery Stories Western Stories 0.5247524713 0
Some Mismatched Classifications
Identifier Cataloger genre App genre Confidence Match
dimenovels:92635 Detective & Mystery Stories Bildungsromans 0.7394990695 0
dimenovels:94909 Detective & Mystery Stories Bildungsromans 0.6176507229 0
dimenovels:91179 Detective & Mystery Stories Sea Stories 0.4438477785 0
dimenovels:93280 Detective & Mystery Stories Sea Stories 0.704385396 0
dimenovels:95066 Detective & Mystery Stories Western Stories 0.5247524713 0
Some Mismatched Classifications
• Our application can significantly help cataloguers narrow down a dime novel’s genre.
• Our application is an effective tool for testing cataloguers’ judgments on a text’s genre.
• Text-mining uncovers details about form and content of NIU’s digitized dime novels that invite further studies.
Answers to Our Preliminary Questions
Beyond the Classifier App
Cognitive Computing
Links documents that you provide with a pre-existing graph of concepts based on Wikipedia
Understand the content and context within text and images.
Deeper understanding of personality characteristics, needs and values.
Understand the meaning of individual sentences and documents.
Beyond the Classifier App
• Put the Classifier App into use
• Embrace the Analytics Lifecycle
• Foster understanding of computer scripting language
• Use text-mining tools to prepare texts for stylistic analysis