chase: going digital · bridgeman digital art library bridgeman categories sample classi cation...
TRANSCRIPT
![Page 1: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/1.jpg)
IntroductionText Mining
Classification
Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data
Text mining in digital collections
CHASE: Going digital
Deirdre [email protected]
February 6, 2013
Deirdre Lungley
![Page 2: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/2.jpg)
IntroductionText Mining
Classification
Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data
Text mining in digital collections
Deirdre Lungley
![Page 3: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/3.jpg)
IntroductionText Mining
Classification
Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data
Text mining in digital collections
Bridgeman Categories
2 Oriental Miniatures 41 Mosaics
7 Maps 44 Semi-precious Stones (see also Jewellery)
9 Posters 46 Science
12 Arms, Armour & Militaria 47 Sculpture
15 Botanical 51 Sports and Leisure
18 Clocks, Watches, Barometers & Sundials 56 Trade Emblems, City Crests, Coats of Arms
20 Costume & Fashion 1126 CHOIR BOOKS
21 Enamels 5000 The Arts and Entertainment
22 Ephemera 5001 Ancient and World Cultures
24 Furniture 5002 Architecture
25 Glass 5003 Business and Industry
27 Icons 5004 Places
29 Inventions 5005 Science and Medicine
30 Jewellery (see also Semi-precious stones) 5006 History
31 Juvenilia / Children's Toys & Games 5007 Religion and Belief
33 Lighting 5010 Travel and Transport
35 Medicine 5011 Plants and Animals
38 Mythology Mythological Myth 5013 Emotions and Ideas
40 Animals
Deirdre Lungley
![Page 4: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/4.jpg)
IntroductionText Mining
Classification
Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data
Text mining in digital collections
Sample Classification Data
Query/Clicked URL Gold Standard Annotations Classifier Predictions
monster woman 5007 : Religion and Belief 5007 : Religion and Belief
Dulle Griet raiding Hell 5 : Allegory / Allegorical
38 : Mythology Mythological Myth
nuno 5007 : Religion and Belief 5007 : Religion and Belief
The Fishermen from the Polyptych of St. Vincent 42 : Personalities 5012 : Land and Sea
42 : Personalities
girl poor 5009 : People and Society 5009 : People and Society
A Peasant Girl Gathering Faggots in a Wood 5012 : Land and Sea
Deirdre Lungley
![Page 5: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/5.jpg)
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Tools of the trade
Python:
High level languageMany standard libraries, e.g., XML parser
Natural Language Toolkit (NLTK):
A platform for building Python programs to work with humanlanguage data (nltk.org)
Why?
Glue between applicationsData preparation for tools such as WekaAllows programmatic access to web services
Deirdre Lungley
![Page 6: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/6.jpg)
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Example Web Service – WikipediaMiner
Deirdre Lungley
![Page 7: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/7.jpg)
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Sample Python XML parsing – Wikify RSS title
Deirdre Lungley
![Page 8: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/8.jpg)
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Sample Python XML parsing – Wikify RSS title (Output)
Deirdre Lungley
![Page 9: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/9.jpg)
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Deirdre Lungley
![Page 10: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/10.jpg)
IntroductionText Mining
Classification
Python & NLTKWeb ServicesSample Code (1) – Wikify text
Text mining in digital collections
Deirdre Lungley
![Page 11: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/11.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings
SVMLight data format:
< target >< feature >:< value > ... < feature >:< value >
Deirdre Lungley
![Page 12: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/12.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings
SVMLight data format:
< target >< feature >:< value > ... < feature >:< value >
Deirdre Lungley
![Page 13: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/13.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings
SVMLight data format:
< target >< feature >:< value > ... < feature >:< value >
Deirdre Lungley
![Page 14: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/14.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings
SVMLight data format:
< target >< feature >:< value > ... < feature >:< value >
Deirdre Lungley
![Page 15: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/15.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Examples
Feature Extractor
Test Examples
Pos/Neglabelled feature
sets
Test feature
sets
Learning tool
Classifier model
Predictions
Deirdre Lungley
![Page 16: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/16.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Examples
Feature Extractor
Test Examples
Pos/Neglabelled feature
sets
Test feature
sets
Learning tool
Classifier model
Predictions
Project Gutenberg Catalogue BBC RSS Feed
Training Data
Test Data
SVM_Learn SVM_Classify
Deirdre Lungley
![Page 17: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/17.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Data – Project Gutenberg
Deirdre Lungley
![Page 18: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/18.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Case Study Task: Classify BBC RSS feeds
Retrieve & parse BBC RSS feed
Create Classification Features
CasefoldingTokenisationStemmingStopwords
Classify (test data → predictions)
Output to file on diskCall commandRead file
Deirdre Lungley
![Page 19: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/19.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Retrieve & parse RSS feed
Deirdre Lungley
![Page 20: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/20.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Retrieve & parse RSS feed (Output)
Deirdre Lungley
![Page 21: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/21.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Text to Features
Deirdre Lungley
![Page 22: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/22.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Text to Features (Output)
Deirdre Lungley
![Page 23: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/23.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Deirdre Lungley
![Page 24: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/24.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Deirdre Lungley
![Page 25: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/25.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Classify: Test data → predictions (Output)
Deirdre Lungley
![Page 26: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/26.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Data – Project Gutenberg
Deirdre Lungley
![Page 27: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/27.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Deirdre Lungley
![Page 28: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/28.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Create training data (Output)
Deirdre Lungley
![Page 29: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/29.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
References:
The Regex Coach
Deirdre Lungley
![Page 30: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk](https://reader034.vdocument.in/reader034/viewer/2022042222/5ec83d9b22c03d49aa5e9f76/html5/thumbnails/30.jpg)
IntroductionText Mining
Classification
Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Thank You!
Deirdre Lungley