metadata as infrastructure for information retrieval and text mining

50
March 2006 NaCTeM – Ray R. Larson Prof. Ray R. Larson University of California, Berkeley School of Information Metadata as Infrastructure for Information Retrieval and Text Mining

Upload: yitro

Post on 18-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Metadata as Infrastructure for Information Retrieval and Text Mining. Prof. Ray R. Larson University of California, Berkeley School of Information. Overview. Metadata as Infrastructure What, Where, When and Who? What are Entry Vocabulary Indexes? Notion of an EVI How are EVIs Built - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Prof. Ray R. Larson

University of California, BerkeleySchool of Information

Metadata as Infrastructure for Information Retrieval and Text

Mining

Page 2: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Overview

Metadata as Infrastructure– What, Where, When and Who?

What are Entry Vocabulary Indexes?– Notion of an EVI– How are EVIs Built

Time Period Directories– Mining Metadata for new metadata

Page 3: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Metadata as Infrastructure

The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?

Page 4: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Metadata as Infrastructure

The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who.

The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.

Page 5: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

What?

Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents.

Two kinds of mapping in every search:

• Documents are assigned to topic categories, e.g. Dewey

• Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers.

Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.

Page 6: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Texts

‘What’ searches involve mapping to controlled vocabularies

Thesaurus/Ontology

Page 7: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Start with a collection of documents.

Page 8: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Classify and index with controlled

vocabulary

Or use a pre-indexed

collection.

Index

Page 9: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Problem:Controlled

Vocabularies can be

difficult for people to

use.

“pass mtr veh spark ign eng”

Index

Use: “Economic Policy”

In Library of Congress subj

For: “Wirtschaftspolitik”

Page 10: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Solution:Entry Level Vocabulary

Indexes.Index

EVIpass mtr veh

spark ign eng”

= “Automobile”

Page 11: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

“What” and Entry Vocabulary Indexes EVIs are a means of mapping from user’s

vocabulary to the controlled vocabulary of a collection of documents…

Page 12: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Has an Entry Vocabulary

Module been built?

User selects a subject domain of

interest.

Download a set of training data.

Build associations between extracted terms & controlled

vocabularies.

Map user’s query to ranked list of

controlled vocabulary terms

Part of speech tagging

Use an existing EVI.

Extract terms (words and noun phrases) from

titles and abstracts.

User selects search terms from the ranked

list of terms returned by the EVI.

YES

Building an Entry Vocabulary Module (EVI)

Searching

For noun phrases

Internet DB indexed with a controlled

vocabulary.

Domains to select from: Engineering, Medicine, Biology, Social science, etc.

User has question but is unfamiliar with the domain

he wants to search.

NO

Building and Searching EVIs

Page 13: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Technical Details

Download a set of

training data.

Build associations between extracted terms & controlled

vocabularies.

Part of speech tagging

Extract terms (words and noun

phrases) from titles and abstracts.

Building an Entry Vocabulary Module (EVI)

For noun phrases

Internet DB indexed with a

controlled vocabulary.

Page 14: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Association Measure

C ¬Ct a b¬t c d

Where t is the occurrence of a term and C is the occurrence of a class in the training set

Page 15: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Association Measure

Maximum Likelihood ratio

W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)

and p1= p2= p=

a a+b

c c+d

a+c a+b+c+d

Vis. Dunning

Page 16: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Alternatively

Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion

Page 17: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

...),,2[logL(p t)W(c, 1 ++= baaStatistical association

Digital library resources

Page 18: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

EVI example

EVI 1

Index term:“pass mtr veh spark ign eng”User

Query “Automobile

” EVI 2Index term:“automobiles”OR

“internal combustible engines”

Page 19: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

But why stop there?

Index

EVI

Page 20: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

“Which EVI do I use?”

Index

EVI

Index

Index EVI

IndexEVI

Page 21: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

EVI to EVIs

Index

EVI

Index

Index EVI

IndexEVI

EVI2

Page 22: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

Why not treat language the same way?

Page 23: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Texts

Numericdatasets

It is also difficult to move between different media forms

Thesaurus/Ontology

EVI

Page 24: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Searching across data types

Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results

Page 25: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Texts

Numericdatasets

But texts associated with numeric data can be mapped as well…

Thesaurus/Ontology

captions

EVI

EVI

Page 26: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

EVI to Numeric Data example

EVI LCSH

marcnew query

search resultscaptions

numeric table

numeric database

online catalog

search interface 1

search interface 2

1

8 7 6

5

432

11

10 9

Page 27: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Texts

Numericdatasets

But there are also geographic dependencies…

Thesaurus/Ontology

captionsMaps/Geo Data

EVI

EVI

Page 28: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

WHERE: Place names are problematic… Variant forms: St. Petersburg, Санкт Петербург,

Saint-Pétersbourg, . . . Multiple names: Cluj, in Romania / Roumania /

Rumania, is also called Klausenburg and Kolozsvar. Names changes: Bombay Mumbai. Homographs:Vienna, VA, and Vienna, Austria;

– 50 Springfields. Anachronisms: No Germany before 1870 Vague, e.g. Midwest, Silicon Valley Unstable boundaries: 19th century Poland; Balkans;

USSR Use a gazetteer!

Page 29: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map.

Timebar

Page 30: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Zoom on map. Click on place for a list of records. Click on record to display text.

Page 31: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Catalogs and gazetteers should talk to each other!

Geographic sort / display of catalog search result.

Catalog search

Gazetteer search

Page 32: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Texts

Numericdatasets

So geographic search becomes part of the infrastructure

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Page 33: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

WHEN: Search by time is also weakly supported… Calendars are the standard for time But people use the names of events to refer to time

periods Named time periods resemble place names in being:

– Unstable: European War, Great War, First World War– Multiple: Second World War, Great Patriotic War– Ambiguous: “Civil war” in different centuries in

England, USA, Spain, etc. Places have temporal aspects & periods have

geographical aspects: When the Stone Age was, varies by region

Page 34: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Suggests a similar solution: A gazetteer-like Time Period Directory.

Gazetteer:– Place name – Type – Spatial markers (Lat & long) -- When

Time Period Directory: – Period name – Type – Time markers (Calendar) – Where

Note the symmetry in the connections between Where and When.

Similarity between place names and period names

Page 35: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Solution - Time Period Directories Initial development involved mining the

Library of Congress Subject Authority file for named time periods…

Page 36: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

LC MARC Authorities Records<USMARC><Fld001>sh 00000613 </Fld001><Fld151><a>Magdeburg

(Germany)</a><x>History</x><y>Siege, 1550-1551</y></Fld151>

<Fld550><w>g</w><a>Sieges</a><z>Germany</z></Fld550><Fld670><a>Work cat.: 45053442: Besselmeier, S. Warhafftige

history vnd beschreibung des Magdeburgischen Kriegs, 1552.</a></Fld670>

<Fld670><a>Cath. encyc.</a><b>(Magdeburg: besieged (1550-51) by the Margrave Maurice of Saxony)</b></Fld670>

<Fld670><a>Ox. encyc. reformation</a><b>(Magdeburg: ... during the 1550-1551 siege of Magdeburg ...)</b></Fld670>

</USMARC>

Page 37: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

timePeriodEntry Time Period Directory InstanceContains components described below

- periodID Unique identifier

- periodName Period name, can be repeated for alternative namesInformation about language, script, transliteration schemeSource information and notes (where was the period name mentioned)

- descriptiveNotes Description of time period

- dates Calendar and date formatBegin & end date (exact, earliest, latest, most-likely, advocated-by-source, ongoing)Notes, sources

- periodClassification Period type, e.g. Period of Conflict, Art movementCan plug in different classification schemesCan be repeated for several classifications

- location Associated places with time periodContains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinatesCan plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names)Recently added coordinates for direct use

- relatedPeriod Related time periodsperiodID of related periodsInformation about relationship type (part-of, successor etc.)Can plug in different relationship type schemes

- entryMetadata Notes about creator / creation of instanceEntry dateModification date

Page 38: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Page 39: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Time periods by named location

Page 40: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Catalog Search Result

Page 41: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Web Interface - Access by map

Page 42: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Zoomable interface gives access to geographically focused info…

Page 43: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Link initiates search of theLibrary of Congress catalogfor all records relating to thistime period.

Web Interface - Access by timeline

Page 44: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

WHEN and WHAT These named time periods are derived from Library of Congress catalog

subject headings and so can be used for catalog searching which finds books on topics important for that time period

Page 45: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Texts

Numericdatasets

Time period directories link via the place (or time)

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Time Period Directory Time lines, Chronologies

Page 46: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

WHEN, WHERE and WHO Catalog records found from a time period search commonly include

names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.

Page 47: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs,Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc.

Biographical dictionaries are heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970.

Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.

Page 48: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Texts

Numericdatasets

A new form of biographical dictionary would link to all

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Time Period Directory Time lines, Chronologies

Biographical Dictionary

Page 49: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

A Metadata Infrastructure

CATALOGS

AchivesHistorical Societies

LibrariesMuseums

Public TelevisionPublishersBooksellers

AudioImages

Numeric DataObjectsTexts

Virtual RealityWebpages

RESOURCES

INTERMEDIA INFRASTRUCTURE

Text and ImagesBiographical DictionaryWHO

TimelinesTime Period DirectoryWHEN

MapsGazetteerWHERE

Syndetic StructureThesaurusWHAT

Special Display ToolsAuthority ControlFacet

Learners

Dossiers

Page 50: Metadata as Infrastructure for Information Retrieval and Text Mining

March 2006 NaCTeM – Ray R. Larson

Acknowledgements Electronic Cultural Atlas Initiative project This work was partially supported by the Institute

of Museum and Library Services through a National Leadership Grant for Libraries, award number LG-02-04-0041-04, Oct 2004 - Sept 2006 entitled “Supporting the Learner: What, Where, When and Who” – See: http://ecai.org/imls2004

Michael Buckland, Fred Gey, Vivien Petras, Matt Meiske, Kim Carl

Contact: [email protected]