Taxonomies and Indexing: A Technical Strategy
Diane Vizine-GoetzOffice of Research
OCLC Online Computer Library Center,
Inc.
Context
Techniques and approaches developed by & for libraries and other institutions responsible for preserving the human record
Broad scopeLong tradition of information
organization
Why organize information?
For Search and retrieval Use Preservation & disposition
Why Organize Information by Subject?
Find information on a particular subject Only and all relevant information
precisionrecall
Find related information
How?
Subject analysis Conceptual analysis--Determining what
an information object is “about” Translate concepts into knowledge
organization (KO) scheme• e.g., Subject indexes• Thesauri• Classification scheme
Automated, Semi-automated, Human/Intellectual
Automation & Subject Analysis
Subject Analysis Conceptual Analysis Translate concept intoKO scheme
Automated- Web search engine
+KO scheme
Automatic identification ofkey concepts (names,words, and phrases)
NA
Automatic translation toKO scheme
Semi-automated- Machine aidedindexing/classification
Automatic identification ofkey concepts
Human-controlledtranslation to KO scheme
Human/Intellectual- Traditional indexing &library cataloging
Human identification ofkey concepts (topics,names, time periods,forms/genre
Human translation to KOscheme
Automated Concept Identification
Automated Indexing Ranges from simply identifying words in a
document, to Sophisticated analyses that identify key
names, words, and phrasesWordSmith Project http://orc.rsch.oclc.org:5061/
Automated Classification Automated assignment of documents to
categories or classes
Political News Concepts Extracted by WordSmith
fair housing
fair housing act
family planning
family planning programmes family planning programs
family planning services
federal government
federal government deficit federal reserve
federal reserve bank
federal reserve board
federal reserve chairman alan greenspan
federal reserve system
Advantages of automatic concept identification
InexpensiveSuitable for indexing/categorizing
large quantities of textCan identify popular and emerging
concepts and terminology
Why use knowledge organization schemes?
Knowledge organization schemes such as subject heading lists, thesauri, & classification schemes are specialized languages designed for retrieving information
Goal--to reduce ambiguities that cause precision & recall failures
Free text v.s. controlled subject retrieval language
WordSmith
family planning family planning
programmes family planning
programs family planning services
Library of Congress Subject Headings (LCSH)
Birth control clinics
UF Family planning services
Planned parenthood services
BT Clinics
19860211
MeSH Heading vs. LCSH
Family PlanningNote: Programs or services
designed to assist the family in controlling reproduction by either improving or diminishing fertility.
Entry Term Birth Control Planned Parenthood Basal Body Temperature Method Birth Limiting Births Averted Family Planning Surveys ...
Birth control (19880919)
UF Family planning Planned parenthood Population control Pregnancy--Prevention
BT Hygiene, Sexual Sexual ethics
RT Contraception Family size
NT AbortionBirth IntervalsChildlessness...
Characteristics of subject retrieval languages
Terminology is often domain specificMedicine > MeSH; Engineering > INSPEC;
Agriculture > Agrovoc
Control vocabulary (synonyms & homonyms)
Express relationships between terms
Within a domain, terms are context independent
Ei ThesaurusTM
Bank protectionUF
Coastal engineering--Bank protectionInland waterways--Bank protection
SNProtection of river banks and lake shores. For seacoasts, use SHORE PROTECTION
DT January 1993
BT Protection
RTBanks (bodies of water)Coastal engineeringEnvironmental engineeringErosionInland waterwaysRiver controlShore protectionSlope protectionSoil conservation
MC 407.2; 407.3 OC 914.1
Controlled Vocabulary
Preferred way of expressing a concept e.g., Popular vs. technical
• Heart attack vs. Myocardial infarction
Non-used vocabulary often includedSynonyms
• Current/Outdated terms > Disabled/Handicapped
Lexical variants• Phrase/Inverted forms > Bilingual education/Education,
Bilingual
Quasi-Synonyms• Synonyms/Antonyms > Literacy/Illiteracy
Relationships
Equivalence Synonymous terms
HierarchyGeneric relationship (kind)Whole-part relationshipInstance relationship (example)
Association
Subject Retrieval using a controlled vocabulary
Related Terms in LCSH
Classification / Categorization System
A systematic arrangement of knowledge into useful categories General schemes & special schemes
DDC, LCC, UDC & AGRIS, MSC
Present a generalized view of knowledge at varying levels of depth
May be enumerative or synthetic
Some Advantages of Traditional Schemes
Meaningful notationWell-developed hierarchiesWell-defined categoriesRich network of relationships
Meaningful Notation (DDC)
005.1 Programming005.1 Programmation
005.1 Программирование
005.1 Programación
DDC Notation Indicates Hierarchy
600 Technology
630 Agriculture
633 Field and plantation crops
633.1 Cereals
633.11 Wheat 633.12 Buckwheat 633.13 Oats
Well-developed Hierarchies
Hierarchies & Categories
Hierarchical from general to specificCategories have superordinate,
coordinate, subordinate relationships in hierarchy
Subcategories must be mutually exclusive
Hierarchies & Categories
Top > Recreation > Automotive > Driving > Road Rage
Social Problems > Public Safety > Traffic Hazards > Highways > Road Rage
Hierarchies, Categories, Relationships
500 Science510 Mathematics512 Algebra, number theory
512.3 Fields Class here field theory, Galois
theory Class linear algebra in 512.5;
class number theory in 512.7
Advantages of Category Schemes
Facilitate retrieval based on concepts not simply keywords
Provide context for search terms (disambiguates)
Facilitate browsing & search refinement
Advantages & Disadvantages of Formal KO Schemes
+ Bring like items together Provide context & show relationships Support browsing May accommodate multilingual usage
- Reactive to emerging topics Terminology may not match users Not practical to apply to everything
Advantages & Disadvantages of Free Text
+ Latest terminology Application not an issue
- User must to produce synonyms and
relationships Limited browsing Little multilingual support
Other Solutions
Combine approaches Map among KO schemes Map free text terms to KO schemes Produce supplemental browsable
indexes from free text
Resources
ANSI/NISO Z39.19-1993 (Revision of ANSI Z39.19-1980) Guidelines for the Construction, Format, and Management of Monolingual Thesauri <http://www.niso.org/stantech.html#z3919>
Controlled vocabularies, thesauri and classification systems available in the WWW. DC Subject <http://www.lub.lu.se/metadata/subject-help.html>
The Intellectual Foundation of Information Organizationby Elaine Svenonius. MIT Press; ISBN: 0262194333
List of Web Subject Resources <http://www.loc.gov/catdir/pcc/saco/resources.html>
The Organization of Information (Library and Information Science Text Series) by Arlene G. Taylor. Libraries Unlimited; ISBN: 1563084988
Resources for Indexers <http://www.asindexing.org/asires.shtml>