automatic cataloging & classification

Automatic cataloging & classification

Eric ChildressOCLC Research

OCLC Members CouncilResearch and New Technologies Interest Group25 October 2005

The key question

Can machines be leveraged for?– Baseline metadata

• Critical data present• Accurate tagging• Accurate values

– Ideal: Enriched metadata

The answer: – Yes…with caveats

Human Labor

MetadataIn

put

Outp

ut

Status quo

Automation approaches

Harvesting: Drawing from extant metadata in one or more sources

Extraction: Drawing from attributes of the resource and/or content in the resource

Both: Integrating both harvesting & extraction in metadata generation

Approaches (cont’)

Harvesting & extraction can be integrated with other tactics:– Point-of-transaction capture: Manual and/or

automatic capture of metadata during the lifecycle of resource and/or metadata (e.g., the source agency, date of record)

– Human review/prompting: Integrating human decision-making to address cases machines cannot handle efficiently (e.g., linking name references to correct authority file when several names are similar)

Harvesting options

New record, same database:– OCLC “derive” record technique

External metadata files: – Z39.50/Zing/MXG– OAI harvesting– Citation tools (e.g., EndNote)

Embedded metadata harvesting:– Processes structured metadata– Various tools (e.g., DC tools list)

Many harvesting tools include some extraction features (and vice-versa) – Example: InfoLibrarian appliance

http://dublincore.org/tools/




http://www.infolibcorp.com/ApplianceOverview.html

Extraction landscape

Many tools from many sources– Features vary widely– Some are narrow-band (e.g., domain-specific,

narrow scope of data work)– Standalone or highly integrated in systems

(often as part of digital access mgt. systems) Frequently-encountered features:

– Simple: document statistics, file type– Complex: (reliable) language detection,

audience level, topics, entities represented, document parts, taxonomy derivation

Extraction approaches

Information extraction: – “Automatically extract structured or semistructured

information from unstructured machine-readable documents” - Wikipedia

Natural language processing– “A range of computational techniques for analyzing and

representing naturally occurring text (free text) at one or more levels of linguistic analysis (e.g., morphological, syntactic, semantic, pragmatic) for the purpose of achieving human-like language processing for knowledge-intensive applications” - AHIMA

– Extracts both explicit & implicit meaning

http://en.wikipedia.org/wiki/Information_extraction

http://library.ahima.org/xpedio/groups/public/documents/ahima/pub_bok1_025042.html

Some work of interest

Library of Congress NSF-funded NSDL projects AMeGA iVia software RLG’s Automatic Exposure

Library of Congress

BEAT (Bibliographic Enrichment Advisory Team) activities & projects:– MARC records fromharvesting:

• E-CIP• Web access to publications in series

– Numerous enrichment activities:• TOCs: E-CIP, ONIX, dTOC project, more• Reviews: HNET, Outstanding Reference Sources,

HLAS reviews, MARS Best Free Reference Sites• Contributor biographic information, ONIX

descriptions, sample texts• Links to e-versions of various texts• Special projects for select LC collections

– Work with bibliographies & pathfinders

beat:%20http://www.loc.gov/catdir/beat/

NSDL-related projects (selected)

MetaExtract: An NLP System to Automatically Assign Metadata– CNLP (Syracuse U) & SIS (Syracuse U)– Builds on several previous projects including:

• Breaking the MetaData Generation Bottleneck [2000-2002]– CNLP (Syracuse U) & U Washington iSchool– Application of NLP to automatically generate metadata for course-

oriented materials Lenny

– Cornell NSDL group & INFOMINE– Orchestrated application of a suite of activities

• OAI harvesting with metadata augmentation using iVia• Loosely-coupled third party services to provide metadata

enhancements (correction, augmentation) to metadata destined for a central repository

• Interactions orchestrated by centralized software application

http://cnlp.syr.edu/

http://www.istweb.syr.edu/

http://www.ischool.washington.edu/

MetaExtract study findings

Auto-generated versus manually-assigned:– Comparable

• Performance in Retrieval• Quality of most elements (for Browsing)

– Better• Coverage of metadata elements

Auto-generated versus full-text:– Comparable

• Performance in Retrieval– Better

• Enables Fielded searching • Enables Browsing of results

– Provides useful structuring of data

http://www.cnlp.org/presentations/slides/JCDL_2004_Final.ppt

Other projects

AMeGA (Automatic Metadata Generation Applications Project)– UNC-CH SILS Metadata Research Center– Research initiated to fulfill LC Bibliographic Control Action Plan

4.2 (deliver specifications for tools to effect automated processing of Web-based resources)

– Final report identifies and recommends functionalities for automatic metadata generation applications

iVia software– Developed by INFOMINE & in use by NSDL, various other digital

library projects; LC looking at using iVia– Sophisticated open source harvester software that can assign

LCSH, LCC Automatic Exposure

– RLG-led initiative advocates capturing standard technical metadata about digital images automatically, as part of image creation

http://www.ils.unc.edu/

http://www.ils.unc.edu/mrc/

http://ivia.ucr.edu/

http://infomine.ucr.edu/

http://www.rlg.org/en/page.php?Page_ID=2681

OCLC activities

OCLC Research projects:– Automatic classification – FRBR-related record harvesting– SchemaTrans

OCLC production services:– OCLC Digital Archive– WorldCat link– OCLC Connexion

Automatic classification work

Scorpion– Open source software that implements a system for

automatically classifying Web-accessible text documents– Incorporated into Connexion extractor

FAST as a knowledge base for automatic classification project– Evaluated FAST as a database to support automatic

classification ePrints-UK project

– A collaboration with RDN to pilot Web services to classify records by DDC and provide authority control for personal names for RDN eprint metadata records

http://www.oclc.org/research/software/scorpion

http://www.oclc.org/research/projects/fastac

http://www.oclc.org/research/projects/fastac

http://www.oclc.org/research/projects/fast

http://www.oclc.org/research/projects/mswitch/epuk.htm

Other OCLC Research activities

FRBR-related record harvesting– Best elements of all records in workset

used to build a “work” record (Fiction Finder)

SchemaTrans project– Adopts a novel approach to translating

structured metadata between schemes– Should be friendly to modular

augumentation/correction activities

http://www.oclc.org/research/projects/mswitch/1_schematrans.htm

OCLC products

OCLC Digital Archive– Various harvesting options

• Capture of technical metadata• Start descriptive records in Connexion

WorldCat link– Scheduled ingest of metadata from OAI servers and

batch processing into WorldCat OCLC Connexion

– Extractor processes metadata from web sites• Relatively sophisticated harvesting• Processes non-canonical metadata• Slated for significant upgrade in 2006

– Rules-aided LCSH assignment while editing bibs– Automatic base authority record generation from

relevant bibliographic record (NACO)

http://www.oclc.org/research/projects/fast/default.htm

Links

Recommended reading:– Liddy, Elizabeth, “Metadata: A Promising

Solution” in EDUCAUSE Review, v. 40, n. 3 (May/June 2005)

OCLC Research links:– Automatic classification projects– SchemaTrans– ResearchWorks

automatic cataloging & classification

Documents