web classification of digital libraries using gate machine learning
DESCRIPTION
TRANSCRIPT
Web Classification Of Digital Libraries Using GATE Machine Learning
Stephen J. Stose
IST 565 – Spring 2011
2
Goals and Objectives
Learn and educate others about GATE (General Architecture for Text Engineering): www.gate.ac.uk GATE’s Machine Learning (ML) and Weka add-ons Illustrate GATE’s natural language processing (NLP) and Information Extraction (IE)
capabilities
Apply natural language processing and ML to classify all web HTML documents into two categories: Digital Libraries (DL) Non-Digital Libraries (non-DL)
Explore preliminary results: Discrimination of DL vs. non-DL is to populate an in-the-works digital library of all digital libraries: www.digitallibrarycentral.com
IST 565: Stose
3
www.gate.ac.uk
Set of Java tools developed at U. of Sheffield (UK)
An integrated development environment (IDE) with a graphical user interface (GUI)
Multi-lingual support (including RTL languages)
Handles multiple text inputs (XML, TXT, Doc, PDF, Database, HTML, SGML); outputs in annotated XML
Annotation editor, including OWL and RDF metadata
At core is ANNIE (A Nearly-New Information Extraction System) Information Extraction (IE)
IST 565: Stose
4
General workflow1. Add HTML documents to Language Resources panel;
Combine into corpus for processing Uploading automatically annotates document structure (e.g., HTML <h1> <meta> <content=“”> annotations, then shown in right-sidebar
2. Activate IE Processing Resources (i.e., ANNIE)
3. Run Processing Resources pipeline over corpus to provide annotations to documents
4. Understand/edit annotations { Type.feature=value } syntax to create more annotations relevant to study
5. Run other Processing Resources add-ons to pipeline (ML Batch Learning, Weka)
6. Initial results: Evaluation & Summary
7. Training/Testing the classifier on more representative sample
IST 565: Stose
5
General workflow1. Add HTML documents to Language Resources panel;
Combine into corpus for processing Uploading automatically annotates document structure (e.g., HTML <h1> <meta> <content=“”> annotations, then shown in right-sidebar
2. Activate IE Processing Resources (i.e., ANNIE)
3. Run Processing Resources pipeline over corpus to provide annotations to documents
4. Understand/edit annotations { Type.feature=value } syntax to create more annotations relevant to study
5. Run other Processing Resources add-ons to pipeline (ML Batch Learning, Weka)
6. Initial results: Evaluation & Summary
7. Training/Testing the classifier on more representative sample
IST 565: Stose
6
Trial Corpus: DL_eval_2(sets chosen for initial testing/evaluation purposes)
Non-DL Set
12 distinct news websites selected:
• Reuters• Newsweek• National Review• LA Times• The Guardian• CS Monitor• CNN• Chicago Tribune• Boston Globe• Bloomberg• BBC• Wall Street Journal
DL Set
13 distinct digital libraries selected from:http://www.columbia.edu/cu/lweb/digital/collections/
IST 565: Stose
7
GATE GUI
Left-Sidebar
Language Resources Load individual
documents Can combine
documents into one or more corpus
Right-Sidebar
Original markups Markups from
structured (XML, HTML) text input automatically extracted
IST 565: Stose
8
General workflow1. Add HTML documents to Language Resources panel;
Combine into corpus for processing Uploading automatically annotates document structure (e.g., HTML <h1> <meta> <content=“”> annotations, then shown in right-sidebar
2. Activate IE Processing Resources (i.e., ANNIE)
3. Run Processing Resources pipeline over corpus to provide annotations to documents
4. Understand/edit annotations { Type.feature=value } syntax to create more annotations relevant to study
5. Run other Processing Resources add-ons to pipeline (ML Batch Learning, Weka)
6. Initial results: Evaluation & Summary
7. Training/Testing the classifier on more representative sample
IST 565: Stose
9
IE Processing ResourcesDefault information pipeline (ANNIE): Information extraction processes run over corpus
• Tokenizes text • Unicode breakup of Token and SpaceToken annotations • based on orthography (“don’t” = “do n’t”) and kind (“word” tokens)
• Gazetteer• Looks up entities in dictionary, provided annotations such as Organization, Person,
PersonTitle, Money, Token, Date, Sentence
• Sentence Splitter• Part of Speech (POS) Tagger
• e.g., NounPhrase, VerbPhrase
• NE (named-entity) transducer• “May Jones” vs “May 2010” vs “May I leave this presentation?”• General Motors vs General Lee
• Orthomatcher (co-reference)• Mr. Johanovitz = John Johanovitz = he (same person)• The class is great. It is very fun. “The class” = “It”• CEO = Chief executive officer (same entity)
IST 565: Stose
10
GATE GUI
Left-Sidebar
ANNIE Processing Resources (PR) Tokenizer Gazetteer Sentence Splitter Part-of-Speech
(POS) Tagger Named-Entity (NE)
Transducer Orthomatcher
IST 565: Stose
11
General workflow1. Add HTML documents to Language Resources panel;
Combine into corpus for processing Uploading automatically annotates document structure (e.g., HTML <h1> <meta> <content=“”> annotations, then shown in right-sidebar
2. Activate IE Processing Resources (i.e., ANNIE)
3. Run Processing Resources pipeline over corpus to provide annotations to documents
4. Understand/edit annotations { Type.feature=value } syntax to create more annotations relevant to study
5. Run other Processing Resources add-ons to pipeline (ML Batch Learning, Weka)
6. Initial results: Evaluation & Summary
7. Training/Testing the classifier on more representative sample
IST 565: Stose
12
GATE GUI
Center
• PR pipeline• Add/subtract
various PRs to pipeline (order important!)
• Select corpus to apply pipeline to
Bottom
• PR parameter settings
• Run application pipeline over corpus
IST 565: Stose
13
ANNIE Information Extraction Pipeline (Adapted from gate.ac.uk tutorials)
IST 565: Stose
14
General workflow1. Add HTML documents to Language Resources panel;
Combine into corpus for processing Uploading automatically annotates document structure (e.g., HTML <h1> <meta> <content=“”> annotations, then shown in right-sidebar
2. Activate IE Processing Resources (i.e., ANNIE)
3. Run Processing Resources pipeline over corpus to provide annotations to documents
4. Understand/edit annotations { Type.feature=value } syntax to create more annotations relevant to study
5. Run other Processing Resources add-ons to pipeline (ML Batch Learning, Weka)
6. Initial results: Evaluation & Summary
7. Training/Testing the classifier on more representative sample
IST 565: Stose
15
GATE GUI
Right-Sidebar
Resulting annotations Annotation (color-coded)
from ANNIE IE pipeline
Bottom & Popup
Annotation Type syntax
{Type.feature = value}
• Token.kind=word• Organization.orgType=gove
rnment• Person.gender=male• Token.string=“term”
Popup editor allows for simple annotation editing
Syntax provides basis for writing bulk annotations (JAPE)
IST 565: Stose
16
Token.string=“term”
Example: Token.string=“headquarters” (an n-gram unigram)
A bag-of-words approach to text mining requires demarcated tokens, each token representing one term (t) in document (d) vector, such that the weight (w) for term t in document d is computed by:
Term frequency tf(d, t) x Inverse document frequency idf(d, t) = term specificity in document corpus
IST 565: Stose
17
Attributes Class
Attributes: Tokens & others?
token token token token token
token token token token token token
token token token token token
1. { Token.string=“term” }
2. Gazetteer• Create digital library dictionary terms• { Lookup.majorType=dlwords }
3. <Meta> (e.g., Dublin Core):• <meta content=“U. Digital Library” />• { meta.content }• Not used in current talk
Class: DL vs non-DL
Class?
• Demarcate entire text in each document
• Annotate with DL
• Annotate with non-DL
IST 565: Stose
18
Annotate entire document (Mention) Class: Either DL or non-DL
{ Mention.type=nondl }
Newsweek (non-DL)
{ Mention.type=dl }
JohnJayPapers (DL)
IST 565: Stose
19IST 565: Stose
Step 1:
Create “Key” annotation set.
Provide only TYPE document annotation (DL or nonDL), with EMPTY { } features, by highlighting and annotating over entire document
JAPE speeds up document annotationJava Annotation Patterns Engine
20IST 565: Stose
JAPE speeds up document annotationJava Annotation Patterns Engine
Step 2:
Write JAPE script
Run script with JAPE transducer over corpus pipeline.
This will transduce DL and nonDL annotation types into ONE annotation type:
{Mention.type}
Jape script also results in two features, just takes less manual annotation time:
{Mention.type=dl}
{Mention.type=nondl}
Phase:firstpassInput: DL nonDLOptions: control = brill
Rule: DL({DL}):dl--> :dl.Mention = {type="dl"}
Rule: nonDL({nonDL}):nondl--> :nondl.Mention = {type="nondl"}
21
• Uses n-gram unigram (single token) of ALL Tokens within document
• Lookup.majorType = dictionary entities within Gazetteer• Created new majorType words for dictionary lookup called: “dlwords”• Uses n-gram unigram of all Lookup.majorType=dlwords in document
2. { Lookup.majorType=dlwords } Class
Attributes Class DL or nonDL?(2 attributes, each applied separately)
1. { Token.string=term } Class
IST 565: Stose
22
Updated “dlwords” GazetteerFor the { Lookup.majorType=dlwords }
• Developed set of words most likely to occur on DL websites
• Gazetteer cap-sensitive, thus terms were written in all cap and stem variations; for instance:
• Digital Library• Digital library• digital library• Digital Libraries• Digital libraries• digital libraries
• Words/phrases selected
IST 565: Stose
Advanced SearchArchiveArchivesBrowseCatalogCollectionCollectionsDigitalDigital ArchiveDigital ArchivesDigital CollectionDigital Collections Digital ContentDigital LibraryDigital LibrariesDigitizationDigitisationImageImages
Image CollectionImage CollectionsKeywordKeywordsLibraryLibrariesManuscriptManuscriptsRepositoryRepositoriesSearchSearch TipsSpecial CollectionSpecial CollectionsUniversityUniversitiesUniversity LibraryUniversity Libraries
Any suggestions?
23
General workflow1. Add HTML documents to Language Resources panel;
Combine into corpus for processing Uploading automatically annotates document structure (e.g., HTML <h1> <meta> <content=“”> annotations, then shown in right-sidebar
2. Activate IE Processing Resources (i.e., ANNIE)
3. Run Processing Resources pipeline over corpus to provide annotations to documents
4. Understand/edit annotations { Type.feature=value } syntax to create more annotations relevant to study
5. Run other Processing Resources add-ons to pipeline (ML Batch Learning, Weka)
6. Initial results: Evaluation & Summary
7. Training/Testing the classifier on more representative sample
IST 565: Stose
24
GATE GUI
Center
Add to pipeline:
Batch Learning PR: GATE’s own Machine Learning (ML) PR, for:
Text Classification Chunk Recognition Relation Extraction
Requires XML configuration file: ML parameter settings
Many testing modes:• Evaluation• Training• Application
IST 565: Stose
25
GATE ML specification
ML XML Configuration Parameters
• For list of all parameter settings, see:• http://gate.ac.uk/sale/tao/splitch17.html#x22-43500017
• Engines• SVMLibSvmJava• SVMExec• Paum (Perceptron)• PaumExec• NaiveBayesWeka• KNNWeka• C4.5Weka
• Evaluation• k-fold• Holdout
• Filtering (for SVM)• Balances negative vs positive instance ratio• Can remove negative instances near hyperplane
• Many, many complex parameter settings!
Attributes:
n-gram unigram: Token.string =
<CLASS/>
Mention.type=dlMention.type=nondl
IST 565: Stose
26
General workflow1. Add HTML documents to Language Resources panel;
Combine into corpus for processing Uploading automatically annotates document structure (e.g., HTML <h1> <meta> <content=“”> annotations, then shown in right-sidebar
2. Activate IE Processing Resources (i.e., ANNIE)
3. Run Processing Resources pipeline over corpus to provide annotations to documents
4. Understand/edit annotations { Type.feature=value } syntax to create more annotations relevant to study
5. Run other Processing Resources add-ons to pipeline (ML Batch Learning, Weka)
6. Initial results: Evaluation & Summary
7. Training/Testing the classifier on more representative sample
IST 565: Stose
27
Evaluation { Token.string=term } ClassComparing post-evaluation annotations: Mention (expected) vs. MentionTest (observed)
Bloomberg News site mis-classification
Expected Mention.type=nondl
ObservedMention.type=dl
IST 565: Stose
28
Uses digital library terms (dlwords) added to Gazetteer
Gazetteer vocabulary = Lookup.majorType
Evaluation { Lookup.majorType=dlwords } Class
IST 565: Stose
29
Evaluation { Lookup.majorType=dlwords } Class
• Changed configuration file• Before: bag-of-words = entire document
• All Token.string (i.e., terms in document)
• Now: bag-of-words constrained by:• Lookup.majorType (vocabulary in Gazetteer)• Gazetteer now includes “dlwords”
• Results (SVM 0.66 holdout):• 100% Accuracy: Observed = Expected
•
<NGRAM> <NAME>ngram</NAME> <NUMBER>1</NUMBER> <CONSNUM>1</CONSNUM> <CONS-1> <TYPE>Lookup</TYPE> <FEATURE>majorType</FEATURE> </CONS-1> </NGRAM>
IST 565: Stose
30
Summary• Learn and illustrate NLP and ML in GATE GUI
• Trained classifier using two kinds of attributes:1. Bag-of-words on ALL words occurring in document
{ Token.string }
2. Bag-of-words on only Gazetteer words (adding “dlwords” to set) { Lookup.majorType }
• To learn/illustrate GATE ML, small sample (13 DL, 12 nonDL)
• 89% recall/precision: ALL words { Token.string=“…” }
• 100% recall/precision: Gazetteer “dlwords” { Lookup.majorType }
• How will classifier discriminate DLs from larger sample that includes ALL kinds of web documents?
IST 565: Stose
31
General workflow1. Add HTML documents to Language Resources panel;
Combine into corpus for processing Uploading automatically annotates document structure (e.g., HTML <h1> <meta> <content=“”> annotations, then shown in right-sidebar
2. Activate IE Processing Resources (i.e., ANNIE)
3. Run Processing Resources pipeline over corpus to provide annotations to documents
4. Understand/edit annotations { Type.feature=value } syntax to create more annotations relevant to study
5. Run other Processing Resources add-ons to pipeline (ML Batch Learning, Weka)
6. Initial results: Evaluation & Summary
7. Training/Testing the classifier on more representative sample
IST 565: Stose
32
Test Corpus: nonDLvsDL_Eval(sets annotated with DL/nonDL, pre-processed with ANNIE, and prepared for ML evaluation)
Non-DL Set (181)
• Random website generator
www.whatsmyip.org/random_websites/
• Generated 181 websites that:• Were not Digital Libraries• Were English language only• Had a minimum of text
• Still un-realistic, as DL to non-DL ratio on WWW is probably 1/1000, at least
DL Set (62)
Chose 62 Digital Libraries, mostly selected across 3 main DL university portals:
• Harvard University Digital Collections• http://digitalcollections.harvard.edu/
• Columbia University Digital Collections• http://www.columbia.edu/cu/lweb/digital/collectio
ns/
• Cornell University Libraries “Windows on the Past”
• http://cdl.library.cornell.edu/
• A few select other DLs
IST 565: Stose
33
SVMLibSvmJava Evaluation Results{ Lookup.majorType=dlwords }(with linear kernel; 0.5 uneven margins; cost = 0.7)
• SVM linear kernel does astonishingly well (using dlwords) • SVM polynomial kernel strict does less well:
• (precision, recall, F1) = (.94, 94, .94)
• Experimented with:• Cost (allows softer margins, or some misclassification to allow better generalization)• Uneven Margins (depends on positive to negative instance ratio; smaller number of positive
examples set that much less than 1)
IST 565: Stose
34
SVMLibSvmJava Evaluation Results{ Token.string }(with linear kernel; 0.5 uneven margins; cost = 0.7)
• SVM linear (using ALL tokens) does slightly better • (correct, partialCorrect, spurious, missing)
• Token.string (80, 0, 3, 3) F1 = 0.96• Lookup.majorType (79, 0, 4, 4) F1 = 0.95
• When BOTH attributes Class: • Token.string AND Lookup.majorType == Token.string (as all strings include dlwords)
IST 565: Stose
35
Other classifiers• NaiveBayesWeka & C4.5 (only Weka default options in GATE ML)
• { Token.string } and { Lookup.majorType } each separately =
(precision, recall, F1) = (.73, 73, .73) [mis-classified 22/22 of DLs!]
• Did not investigate why such terrible mis-classification; perhaps due to token non-independence (for NaiveBayes) and infinite decision trees for tokens (for C4.5).
• SVM well-known as best text classifier (Sebastini, 2002; Hotho, Nurnberger & Paass, 2005), based on Reuters 20 newsgroup collection
IST 565: Stose
36
Mis-classifications Similar with each attribute
Token.stringPrecision (1.0) Recall (0.86)
False Negatives (misclassified as nonDL)
• Digital Scriptorium• www.scriptorium.columbia.edu
• Holocaust Rescue & Relief (Andover-Harvard Theological)
• www.hds.harvard.edu/library/collections/digital/service_committee.html
• Joseph Urban Stage Design Collection • www.columbia.edu/cu/lweb/eresources/archives/rbml
/urban/
False positives (misclassified as DL)• none
Lookup.majorTypePrecision (0.95) Recall (0.86)
False Negatives (misclassified as nonDL)
• Harvard Business Education for Women (1937-1970)
• http://www.library.hbs.edu/hc/daring/intro.html#nav-intro
• Holocaust Rescue & Relief (Andover-Harvard Theological)
• www.hds.harvard.edu/library/collections/digital/service_committee.html
• Joseph Urban Stage Design Collection • www.columbia.edu/cu/lweb/eresources/archives/rbml
/urban/
False Positives (misclassified as DL)• www.spi-poker.sourceforge.net • Nothing in text is a dlword, so confusing.
IST 565: Stose
37
Final Conclusions• Still small sample, but 96% accuracy (precision = 1.0; recall = 0.86)
• Higher precision important, as missing some DLs better than having to weed out many false positive nonDLs (for future www.digitallibrarycentral.com website)
• Algorithm unfortunately expected to generate false positives of websites about digital libraries (e.g., D-Lib Magazine is not a DL); perhaps use <meta content=“”> to discriminate?
• GATE ML “Evaluation” presented only; GATE also offers ML “Train-Application” options, but not presented here
• Next step: Train and Apply more representative ratio of needle DL in haystack of websites (e.g., Train: 100/500, Apply: 10/1000)
• If does well, next step: Set up algorithm within a web crawler
• Once DLs in hand, could then use GATE to classify DLs into taxonomy of DL types, already created (technology, photography, geography etc…)
IST 565: Stose