dime-novel genre classifier: a prototype text-mining application

42
Dime-Novel Genre Classifier: A Prototype Text-Mining Application Project Supported by the Digital Convergence Lab at NIU

Upload: marcos-quezada

Post on 28-Jan-2018

331 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Dime-Novel Genre Classifier: A Prototype Text-Mining Application

Project Supported by the Digital Convergence Lab at NIU

Acknowledgments

MembersMarcos QuezadaInformation Systems Engineer

Fredrik Stark Doctoral Candidate in English

Mitchell ZaretskyComputer Science Major

The Project Team

ClientMatthew ShortMetadata Librarian

CoachDrew VandeCreekDirector of Digital Scholarship

Technical SupportMichael SwopeInstructional Support Analyst

● Develop a genre-classifier application to assist library cataloguers when digitizing NIU’s collections of dime novels

● Compile a list of genres and related subject terms for possible use in reclassifying online digitized collections

● Investigate text-mining tools for (1) future development of the prototype classifier app and (2) future studies of the collections

Client’s Goals

Research on Dime Novels

• Collection of 50,000 analog novels

• Currently available online: 1,900 digitized novels (90,000 pages)

• http://dimenovels.lib.niu.edu

Digitized Text Corpus

Problem: Assigning Novels to Genres

Solution: Create a Text-Mining Application

Image Source: Fayyad, U., Piatetsky-Shapiro , G., Smyth,P. (1996) The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, 39(11)

NIU’s Digitized Dime Novel Collection

Classifier App

Text Files of Better-

Represented Genres

Stop WordsTokenization

LemmatizationStemmingPruning

Bag of WordsVectorization

Naïve BayesGenre Top Words

Classification Checks

Adventure stories Bildungsromans Detective and mystery stories Historical fiction Love stories Sea stories Western stories

warrior broker crimin colonel sprung sailor prairi

trapper market disguis soldier lover deck gulch

ranger stock hotel sword warrior vessel calam

tribe illustr polic scout mum crew warrior

wee clerk plot lieuten alter schooner outlaw

fur sell crook warrior wee brig trapper

sprung desk avenu confeder prairi anchor rifle

savag rascal doctor sprung nun pirat gal

rifle bought stair union god ashor miner

scout share confeder cano rifle cabin scout

Top 10 of 100 Subject Words for Each Genre

• How can our proposed application help cataloguers classify the texts?

• How does computer-assisted classifying of dime novels compare to classifying done by cataloguers?

• To what extent can text-mining tools also help us answer questions about form and content of digitized dime novels?

Project Questions

Initial Findings

Some Successful Classifications

Some Questionable Classifications

Why the Questionable Classifications?

Stories Surrounded by Paratext

Stories Surrounded by Paratext

of Chess.3--Boo_k of Croquet.4-—-Cricket and IFoot-BallEi-Curling and Skating.6—-Riding and. Driving.7--Ya.chting and Rowing.8-Guide to Swimming.SJ-Pedestrianism.“ 10-Books of Fun, Nos. 1, 2, 3BEADLJEPS DIME FAMILY HAND-BOOKSFor Housewives.N0. 1——Cook Book.CCG€G65G2—Recipe Book3—-I—Iousekeeper’s Guide.4-‘-Family Physician.5—Dressmaker and Milliner.H‘ For sale by. all Newsdealers and Booksellers; or sent, post-paid, to any address, on receipt of price-TEN CENTS each. BEADLE AND COMPANY, Publishers,98 William Street, New York. MAD SKIPPER;03»A CRUISE AFTER THE MAELSTROM.A TALE OF THE SVEA.BY ROGER STARBUCK,Luz-nor. or “ 601mm: HARPOON,” "on rm: nut,"“our AWAY,” mo.NEW YORK:BEADLE AND COMPANY, PUBLISHERS,11s WILLIAM swnnm‘. Entered according to Act of Congress, in the you 1&6, byv BEADLEAND COMPANY,h tho Clerk‘: O?lce of the District Court of the United sum for miSouthern District of New York.~(No- 94-) THE MAD $KIPPER.G H A P T E R I-was: nnsmvrnns.“ SPLASH! splash! splash! Here he comesagain-the rain! The calaboose more better than this place. We got no tabao’to smoke nor nothing to eat. Wish me back again in St. Mi?chael with piece of bread and one little baskeet of grapes l” The speaker, an odd-looking Portuguese dwarf with an enormous head, sat by the entrance of a small cave near the summit of a lofty hill overlooking the town and harbor of San Carlos, Chiloe island. Far beyond the town, which consists of little wooden buildings, few of them more

Paratext Contains Multiple Subject Terms

chess book croquet cricket football curling skating riding

driving yachting rowing guide swimming pedestrianism books

fun dime family handbooks housewives cook book recipe

book housekeeper guide family physician dressmaker milliner

sale newsdealers booksellers sent postpaid address receipt

price ten cents beadle company publishers street york mad

skipper cruise maelstrom tale starbuck harpoon away york

beadle company publishers entered congress beadle

company clerk district court united southern district york

mad splash splash splash comes again rain calaboose place

tabao smoke eat wish piece bread little baskeet grapes

speaker oddlooking portuguese dwarf enormous head sat

entrance small cave summit lofty hill overlooking town harbor

san chiloe island town consists little wooden buildings

Mixed Sets of Subject Terms in Filtered Text

Single Novels with Different Stories and Genres

Series Genres vs. Story Genres

Our Response: Remove Multi-Genre Texts

• Initial data set: 1,608 texts

• Multi-genre texts removed: 286 texts

• Revised data set: 1,322 texts

A Look under the Hood; or, How Our App Works

• High dimensionality• Un-structured data• “Bag of words”• Vector with one dimension for every unique term in space

• Term Frequency (TF)• Corpus representation

– Inverse Document Frequency (IDF)

Text Representation“Why, papa,” whispered the young girl, uneasily, as the boat pulled alongside the little wharf, “these men are all savages!”*

Sentence converted to a vector in the term and document space

whisper 0.059

girl 0.017

boat 0.329

littl 0.003

men 0.003

savag 0.141

* A sentence from page 20 in The Mad Skipper; or, A Cruise after the Maelstrom: A Tale of the Sea. By Roger Starbuck. New York: Beadle and Adams, 3 April 1866.

Applying these filter options improved model accuracy from 65% to 75%

• StringToWordVector filter–TF-IDF–Lowercase–Snowball Stemmer–Jockers’ Stopwords–Keep 500 Words

Preprocessing

• After filtering, data appears

in vector format

Transformation

• Naïve Bayes Multinomial Classifier

• Treats attributes as independent

• Determines most probable class label

• Ran with 10-fold cross-validation

Model Planning and Training

Our Classifier App

• Uses Weka Java API

• Builds classifier from ARFF file

• User inputs novel identifier

• App returns probability distribution

• User can choose to classify novel and save to ARFF file

Our Classifier App Improvements

Model Instance Accuracy

Initial 65%

Improved Filter 75%

Removed Multi-Genre 83%

Testing Our Classifier App

Test Corpus from a Newly-Digitized Series

• Multiple genres in the series

• Total issues in test corpus: 214

• Classification convergence of 71% 152 matched classifications 62 mismatched classifications

Identifier Cataloger genre App genre Confidence Match

dimenovels:90660 Sea stories Sea Stories 0.9999999986 1

dimenovels:90849 Sea stories Sea Stories 1 1

dimenovels:90979 Sea stories Sea Stories 0.9999999997 1

dimenovels:94229 Western stories Western Stories 1 1

dimenovels:94943 Western stories Western Stories 0.999999997 1

Some Matched Classifications

Identifier Cataloger genre App genre Confidence Match

dimenovels:90660 Sea stories Sea Stories 0.9999999986 1

dimenovels:90849 Sea stories Sea Stories 1 1

dimenovels:90979 Sea stories Sea Stories 0.9999999997 1

dimenovels:94229 Western stories Western Stories 1 1

dimenovels:94943 Western stories Western Stories 0.999999997 1

Some Matched Classifications

Some Matched Classifications

Identifier Cataloger genre App genre Confidence Match

dimenovels:92635 Detective & Mystery Stories Bildungsromans 0.7394990695 0

dimenovels:94909 Detective & Mystery Stories Bildungsromans 0.6176507229 0

dimenovels:91179 Detective & Mystery Stories Sea Stories 0.4438477785 0

dimenovels:93280 Detective & Mystery Stories Sea Stories 0.704385396 0

dimenovels:95066 Detective & Mystery Stories Western Stories 0.5247524713 0

Some Mismatched Classifications

Identifier Cataloger genre App genre Confidence Match

dimenovels:92635 Detective & Mystery Stories Bildungsromans 0.7394990695 0

dimenovels:94909 Detective & Mystery Stories Bildungsromans 0.6176507229 0

dimenovels:91179 Detective & Mystery Stories Sea Stories 0.4438477785 0

dimenovels:93280 Detective & Mystery Stories Sea Stories 0.704385396 0

dimenovels:95066 Detective & Mystery Stories Western Stories 0.5247524713 0

Some Mismatched Classifications

Some Mismatched Classifications

• Our application can significantly help cataloguers narrow down a dime novel’s genre.

• Our application is an effective tool for testing cataloguers’ judgments on a text’s genre.

• Text-mining uncovers details about form and content of NIU’s digitized dime novels that invite further studies.

Answers to Our Preliminary Questions

What Next?

Beyond the Classifier App

Cognitive Computing

Links documents that you provide with a pre-existing graph of concepts based on Wikipedia

Understand the content and context within text and images.

Deeper understanding of personality characteristics, needs and values.

Understand the meaning of individual sentences and documents.

Beyond the Classifier App

Sentiment Analysis

Plot diagram for The Mad Skipper (1866) →

Beyond the Classifier App

• Put the Classifier App into use

• Embrace the Analytics Lifecycle

• Foster understanding of computer scripting language

• Use text-mining tools to prepare texts for stylistic analysis

Open Discussion