web classification of digital libraries using gate machine learning

Web Classification Of Digital Libraries Using GATE Machine Learning

Stephen J. Stose

[email protected]

IST 565 – Spring 2011

2

Goals and Objectives

Learn and educate others about GATE (General Architecture for Text Engineering): www.gate.ac.uk GATE’s Machine Learning (ML) and Weka add-ons Illustrate GATE’s natural language processing (NLP) and Information Extraction (IE)

capabilities

Apply natural language processing and ML to classify all web HTML documents into two categories: Digital Libraries (DL) Non-Digital Libraries (non-DL)

Explore preliminary results: Discrimination of DL vs. non-DL is to populate an in-the-works digital library of all digital libraries: www.digitallibrarycentral.com

IST 565: Stose

http://www.gate.ac.uk/

http://www.digitallibrarycentral.com/

3

www.gate.ac.uk

Set of Java tools developed at U. of Sheffield (UK)

An integrated development environment (IDE) with a graphical user interface (GUI)

Multi-lingual support (including RTL languages)

Handles multiple text inputs (XML, TXT, Doc, PDF, Database, HTML, SGML); outputs in annotated XML

Annotation editor, including OWL and RDF metadata

At core is ANNIE (A Nearly-New Information Extraction System) Information Extraction (IE)

IST 565: Stose

http://www.gate.ac.uk/

4

General workflow1. Add HTML documents to Language Resources panel;

Combine into corpus for processing Uploading automatically annotates document structure (e.g., HTML <h1> <meta> <content=“”> annotations, then shown in right-sidebar

2. Activate IE Processing Resources (i.e., ANNIE)

3. Run Processing Resources pipeline over corpus to provide annotations to documents

4. Understand/edit annotations { Type.feature=value } syntax to create more annotations relevant to study

5. Run other Processing Resources add-ons to pipeline (ML Batch Learning, Weka)

6. Initial results: Evaluation & Summary

7. Training/Testing the classifier on more representative sample

IST 565: Stose

5









IST 565: Stose

6

Trial Corpus: DL_eval_2(sets chosen for initial testing/evaluation purposes)

Non-DL Set

12 distinct news websites selected:

• Reuters• Newsweek• National Review• LA Times• The Guardian• CS Monitor• CNN• Chicago Tribune• Boston Globe• Bloomberg• BBC• Wall Street Journal

DL Set

13 distinct digital libraries selected from:http://www.columbia.edu/cu/lweb/digital/collections/

IST 565: Stose

http://www.columbia.edu/cu/lweb/digital/collections/

7

GATE GUI

Left-Sidebar

Language Resources Load individual

documents Can combine

documents into one or more corpus

Right-Sidebar

Original markups Markups from

structured (XML, HTML) text input automatically extracted

IST 565: Stose

8









IST 565: Stose

9

IE Processing ResourcesDefault information pipeline (ANNIE): Information extraction processes run over corpus

• Tokenizes text • Unicode breakup of Token and SpaceToken annotations • based on orthography (“don’t” = “do n’t”) and kind (“word” tokens)

• Gazetteer• Looks up entities in dictionary, provided annotations such as Organization, Person,

PersonTitle, Money, Token, Date, Sentence

• Sentence Splitter• Part of Speech (POS) Tagger

• e.g., NounPhrase, VerbPhrase

• NE (named-entity) transducer• “May Jones” vs “May 2010” vs “May I leave this presentation?”• General Motors vs General Lee

• Orthomatcher (co-reference)• Mr. Johanovitz = John Johanovitz = he (same person)• The class is great. It is very fun. “The class” = “It”• CEO = Chief executive officer (same entity)

IST 565: Stose

10

GATE GUI

Left-Sidebar

ANNIE Processing Resources (PR) Tokenizer Gazetteer Sentence Splitter Part-of-Speech

(POS) Tagger Named-Entity (NE)

Transducer Orthomatcher

IST 565: Stose

11









IST 565: Stose

12

GATE GUI

Center

• PR pipeline• Add/subtract

various PRs to pipeline (order important!)

• Select corpus to apply pipeline to

Bottom

• PR parameter settings

• Run application pipeline over corpus

IST 565: Stose

13

ANNIE Information Extraction Pipeline (Adapted from gate.ac.uk tutorials)

IST 565: Stose

14









IST 565: Stose

15

GATE GUI

Right-Sidebar

Resulting annotations Annotation (color-coded)

from ANNIE IE pipeline

Bottom & Popup

Annotation Type syntax

{Type.feature = value}

• Token.kind=word• Organization.orgType=gove

rnment• Person.gender=male• Token.string=“term”

Popup editor allows for simple annotation editing

Syntax provides basis for writing bulk annotations (JAPE)

IST 565: Stose

16

Token.string=“term”

Example: Token.string=“headquarters” (an n-gram unigram)

A bag-of-words approach to text mining requires demarcated tokens, each token representing one term (t) in document (d) vector, such that the weight (w) for term t in document d is computed by:

Term frequency tf(d, t) x Inverse document frequency idf(d, t) = term specificity in document corpus

IST 565: Stose

17

Attributes Class

Attributes: Tokens & others?

token token token token token

token token token token token token

token token token token token

1. { Token.string=“term” }

2. Gazetteer• Create digital library dictionary terms• { Lookup.majorType=dlwords }

3. <Meta> (e.g., Dublin Core):• <meta content=“U. Digital Library” />• { meta.content }• Not used in current talk

Class: DL vs non-DL

Class?

• Demarcate entire text in each document

• Annotate with DL

• Annotate with non-DL

IST 565: Stose

18

Annotate entire document (Mention) Class: Either DL or non-DL

{ Mention.type=nondl }

Newsweek (non-DL)

{ Mention.type=dl }

JohnJayPapers (DL)

IST 565: Stose

19IST 565: Stose

Step 1:

Create “Key” annotation set.

Provide only TYPE document annotation (DL or nonDL), with EMPTY { } features, by highlighting and annotating over entire document

JAPE speeds up document annotationJava Annotation Patterns Engine

20IST 565: Stose

JAPE speeds up document annotationJava Annotation Patterns Engine

Step 2:

Write JAPE script

Run script with JAPE transducer over corpus pipeline.

This will transduce DL and nonDL annotation types into ONE annotation type:

{Mention.type}

Jape script also results in two features, just takes less manual annotation time:

{Mention.type=dl}

{Mention.type=nondl}

Phase:firstpassInput: DL nonDLOptions: control = brill

Rule: DL({DL}):dl--> :dl.Mention = {type="dl"}

Rule: nonDL({nonDL}):nondl--> :nondl.Mention = {type="nondl"}

21

• Uses n-gram unigram (single token) of ALL Tokens within document

• Lookup.majorType = dictionary entities within Gazetteer• Created new majorType words for dictionary lookup called: “dlwords”• Uses n-gram unigram of all Lookup.majorType=dlwords in document

2. { Lookup.majorType=dlwords } Class

Attributes Class DL or nonDL?(2 attributes, each applied separately)

1. { Token.string=term } Class

IST 565: Stose

22

Updated “dlwords” GazetteerFor the { Lookup.majorType=dlwords }

• Developed set of words most likely to occur on DL websites

• Gazetteer cap-sensitive, thus terms were written in all cap and stem variations; for instance:

• Digital Library• Digital library• digital library• Digital Libraries• Digital libraries• digital libraries

• Words/phrases selected

IST 565: Stose

Advanced SearchArchiveArchivesBrowseCatalogCollectionCollectionsDigitalDigital ArchiveDigital ArchivesDigital CollectionDigital Collections Digital ContentDigital LibraryDigital LibrariesDigitizationDigitisationImageImages

Image CollectionImage CollectionsKeywordKeywordsLibraryLibrariesManuscriptManuscriptsRepositoryRepositoriesSearchSearch TipsSpecial CollectionSpecial CollectionsUniversityUniversitiesUniversity LibraryUniversity Libraries

Any suggestions?

23









IST 565: Stose

24

GATE GUI

Center

Add to pipeline:

Batch Learning PR: GATE’s own Machine Learning (ML) PR, for:

Text Classification Chunk Recognition Relation Extraction

Requires XML configuration file: ML parameter settings

Many testing modes:• Evaluation• Training• Application

IST 565: Stose

25

GATE ML specification

ML XML Configuration Parameters

• For list of all parameter settings, see:• http://gate.ac.uk/sale/tao/splitch17.html#x22-43500017

• Engines• SVMLibSvmJava• SVMExec• Paum (Perceptron)• PaumExec• NaiveBayesWeka• KNNWeka• C4.5Weka

• Evaluation• k-fold• Holdout

• Filtering (for SVM)• Balances negative vs positive instance ratio• Can remove negative instances near hyperplane

• Many, many complex parameter settings!

Attributes:

n-gram unigram: Token.string =

<CLASS/>

Mention.type=dlMention.type=nondl

IST 565: Stose

http://gate.ac.uk/sale/tao/splitch17.html%23x22-43500017



26









IST 565: Stose

27

Evaluation { Token.string=term } ClassComparing post-evaluation annotations: Mention (expected) vs. MentionTest (observed)

Bloomberg News site mis-classification

Expected Mention.type=nondl

ObservedMention.type=dl

IST 565: Stose

28

Uses digital library terms (dlwords) added to Gazetteer

Gazetteer vocabulary = Lookup.majorType

Evaluation { Lookup.majorType=dlwords } Class

IST 565: Stose

29

Evaluation { Lookup.majorType=dlwords } Class

• Changed configuration file• Before: bag-of-words = entire document

• All Token.string (i.e., terms in document)

• Now: bag-of-words constrained by:• Lookup.majorType (vocabulary in Gazetteer)• Gazetteer now includes “dlwords”

• Results (SVM 0.66 holdout):• 100% Accuracy: Observed = Expected

•

<NGRAM> <NAME>ngram</NAME> <NUMBER>1</NUMBER> <CONSNUM>1</CONSNUM> <CONS-1> <TYPE>Lookup</TYPE> <FEATURE>majorType</FEATURE> </CONS-1> </NGRAM>

IST 565: Stose

30

Summary• Learn and illustrate NLP and ML in GATE GUI

• Trained classifier using two kinds of attributes:1. Bag-of-words on ALL words occurring in document

{ Token.string }

2. Bag-of-words on only Gazetteer words (adding “dlwords” to set) { Lookup.majorType }

• To learn/illustrate GATE ML, small sample (13 DL, 12 nonDL)

• 89% recall/precision: ALL words { Token.string=“…” }

• 100% recall/precision: Gazetteer “dlwords” { Lookup.majorType }

• How will classifier discriminate DLs from larger sample that includes ALL kinds of web documents?

IST 565: Stose

31









IST 565: Stose

32

Test Corpus: nonDLvsDL_Eval(sets annotated with DL/nonDL, pre-processed with ANNIE, and prepared for ML evaluation)

Non-DL Set (181)

• Random website generator

www.whatsmyip.org/random_websites/

• Generated 181 websites that:• Were not Digital Libraries• Were English language only• Had a minimum of text

• Still un-realistic, as DL to non-DL ratio on WWW is probably 1/1000, at least

DL Set (62)

Chose 62 Digital Libraries, mostly selected across 3 main DL university portals:

• Harvard University Digital Collections• http://digitalcollections.harvard.edu/

• Columbia University Digital Collections• http://www.columbia.edu/cu/lweb/digital/collectio

ns/

• Cornell University Libraries “Windows on the Past”

• http://cdl.library.cornell.edu/

• A few select other DLs

IST 565: Stose

http://www.whatsmyip.org/random_websites/

http://www.whatsmyip.org/random_websites/

http://digitalcollections.harvard.edu/

http://digitalcollections.harvard.edu/




http://cdl.library.cornell.edu/

http://cdl.library.cornell.edu/

33

SVMLibSvmJava Evaluation Results{ Lookup.majorType=dlwords }(with linear kernel; 0.5 uneven margins; cost = 0.7)

• SVM linear kernel does astonishingly well (using dlwords) • SVM polynomial kernel strict does less well:

• (precision, recall, F1) = (.94, 94, .94)

• Experimented with:• Cost (allows softer margins, or some misclassification to allow better generalization)• Uneven Margins (depends on positive to negative instance ratio; smaller number of positive

examples set that much less than 1)

IST 565: Stose

34

SVMLibSvmJava Evaluation Results{ Token.string }(with linear kernel; 0.5 uneven margins; cost = 0.7)

• SVM linear (using ALL tokens) does slightly better • (correct, partialCorrect, spurious, missing)

• Token.string (80, 0, 3, 3) F1 = 0.96• Lookup.majorType (79, 0, 4, 4) F1 = 0.95

• When BOTH attributes Class: • Token.string AND Lookup.majorType == Token.string (as all strings include dlwords)

IST 565: Stose

35

Other classifiers• NaiveBayesWeka & C4.5 (only Weka default options in GATE ML)

• { Token.string } and { Lookup.majorType } each separately =

(precision, recall, F1) = (.73, 73, .73) [mis-classified 22/22 of DLs!]

• Did not investigate why such terrible mis-classification; perhaps due to token non-independence (for NaiveBayes) and infinite decision trees for tokens (for C4.5).

• SVM well-known as best text classifier (Sebastini, 2002; Hotho, Nurnberger & Paass, 2005), based on Reuters 20 newsgroup collection

IST 565: Stose

36

Mis-classifications Similar with each attribute

Token.stringPrecision (1.0) Recall (0.86)

False Negatives (misclassified as nonDL)

• Digital Scriptorium• www.scriptorium.columbia.edu

• Holocaust Rescue & Relief (Andover-Harvard Theological)

• www.hds.harvard.edu/library/collections/digital/service_committee.html

• Joseph Urban Stage Design Collection • www.columbia.edu/cu/lweb/eresources/archives/rbml

/urban/

False positives (misclassified as DL)• none

Lookup.majorTypePrecision (0.95) Recall (0.86)

False Negatives (misclassified as nonDL)

• Harvard Business Education for Women (1937-1970)

• http://www.library.hbs.edu/hc/daring/intro.html#nav-intro

• Holocaust Rescue & Relief (Andover-Harvard Theological)

• www.hds.harvard.edu/library/collections/digital/service_committee.html

• Joseph Urban Stage Design Collection • www.columbia.edu/cu/lweb/eresources/archives/rbml

/urban/

False Positives (misclassified as DL)• www.spi-poker.sourceforge.net • Nothing in text is a dlword, so confusing.

IST 565: Stose

http://www.scriptorium.columbia.edu/

http://www.hds.harvard.edu/library/collections/digital/service_committee.html


http://www.columbia.edu/cu/lweb/eresources/archives/rbml/urban/


http://www.library.hbs.edu/hc/daring/intro.html%23nav-intro

http://www.library.hbs.edu/hc/daring/intro.html%23nav-intro





http://www.spi-poker.sourceforge.net/

37

Final Conclusions• Still small sample, but 96% accuracy (precision = 1.0; recall = 0.86)

• Higher precision important, as missing some DLs better than having to weed out many false positive nonDLs (for future www.digitallibrarycentral.com website)

• Algorithm unfortunately expected to generate false positives of websites about digital libraries (e.g., D-Lib Magazine is not a DL); perhaps use <meta content=“”> to discriminate?

• GATE ML “Evaluation” presented only; GATE also offers ML “Train-Application” options, but not presented here

• Next step: Train and Apply more representative ratio of needle DL in haystack of websites (e.g., Train: 100/500, Apply: 10/1000)

• If does well, next step: Set up algorithm within a web crawler

• Once DLs in hand, could then use GATE to classify DLs into taxonomy of DL types, already created (technology, photography, geography etc…)

IST 565: Stose

http://www.digitallibrarycentral.com/

web classification of digital libraries using gate machine learning

Documents