11/21/2000information organization and retrieval thesaurus design and development university of...

43
11/21/2000 Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

Post on 21-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Thesaurus Design and Development

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Page 2: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Review

• Name Authority Control

• Types of Controlled Vocabularies

• Categories and Categorization

• Interface Design

Page 3: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Types of Indexing Languages

• Uncontrolled Keyword Indexing• Indexing Languages

– Controlled, but not structured

• Thesauri– Controlled and Structured

• Classification Systems– Controlled, Structured, and Coded

• Faceted Classification Systems

Page 4: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Uses of Controlled Vocabularies• Library Subject Headings, Classification and

Authority Files.• Commercial Journal Indexing Services and

databases• Yahoo, and other Web classification schemes• Online and Manual Systems within organizations

– SunSolve– MacArthur

Page 5: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Indexing Languages

• An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents.

• An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms.

Page 6: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Classification Systems

• A classification system is an indexing language often based on a broad ordering of topical areas. Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics. Classification schemes commonly use a coded notation for representing a topic and it’s place in relation to other terms.

Page 7: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Automatic Indexing and Classification

• Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words.

• More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document.

• Automatic classification attempts to automatically group similar documents using either:– A fully automatic clustering method.

– An established classification scheme and set of documents already indexed by that scheme.

Page 8: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

ClusteringAgglomerative methods: Polythetic, Exclusive or Overlapping, Unorderedclusters are order-dependent.

DocDoc

DocDoc

DocDoc

DocDoc

1. Select initial centers (I.e. seed the space)2. Assign docs to highest matching centers and compute centroids3. Reassign all documents to centroid(s)

Rocchio’s method – (Yes the same Rocchio as Relevance Feedback)

Page 9: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Automatic Class Assignment

DocDoc

DocDoc

DocDoc

Doc

SearchEngine

1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category

Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme

Page 10: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Today

• Developing Controlled Vocabularies

• Thesaurus design

• Steps in Thesaurus development

• Indexing

Page 11: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Developing Controlled Vocabularies

• Origins and Uses of Controlled Vocabularies for Information Retrieval

• Types of Indexing Languages, Thesauri and Classification Systems

• Process of Design and Development of Thesauri

Page 12: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Origins

• Very early history of content representation– Sumerian tokens and “envelopes”– Alexandria - pinakes– Indices

Page 13: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Origins

• Biblical Indexes and Concordances– Hugo de St. Caro – 1247 A.D. : 500 Monks -- KWOC– Book indexes (Nuremburg Chronicle)

• Library Catalogs• Journal Indexes• “Information Explosion” following WWII

– Cranfield Studies of indexing languages and information retrieval

– Development of bibliographic databases • Index Medicus -- production and Medlars searching

Page 14: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Origins

• Communication theory revisited

• Problems with transmission of meaning

Noise

Source DecodingEncoding Destination

Message Message

Channel

StorageSourceDecoding

(Retrieval/Reading)Encoding

(writing/indexing)Destination

Message Message

Page 15: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Structure of an IR SystemSearchLine

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 16: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

What is a “Controlled Vocabulary”

• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

• Similarly, there are too many ways of expressing or explaining the topic of a document.

• Controlled vocabularies are sets of Rules for topic identification and indexing, and a THESAURUS, which consists of “lead-in vocabulary” and an limited and selective “Indexing Language” sometimes with special coding or structures.

Page 17: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Thesauri

• A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among Synonymous, Equivalent, Broader, Narrower and other Related Terms

Page 18: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Thesauri (cont.)

• National and International Standards for Thesauri– ANSI/NISO z39.19--1994 -- American National Standard

Guidelines for the Construction, Format and Management of Monolingual Thesauri

– ANSI/NISO Draft Standard Z39.4-199x -- American National Standard Guidelines for Indexes in Information Retrieval

– ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri

– ISO 5964-- Documentation -- Guidelines for the establishment and development of multilingual thesauri

Page 19: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Thesauri (cont.)

• Examples:– The ERIC Thesaurus of Descriptors– The Art and Architecture Thesaurus– The Medical Subject Headings (MESH) of the

National Library of Medicine

Page 20: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Why develop a thesaurus?

• To provide a conceptual structure or “space” for a body of information– To make it possible to adequately describe the

topical contents of informational objects at an appropriate level of generality or specificity

– To provide enhanced search capabilities and to improve the effectiveness of searching (I.e., to retrieve most of the relevant material without too much irrelevant material).

Page 21: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Why develop a thesaurus?

• To provide vocabulary (or terminological) control. – When there are several possible terms

designating a single concept, the thesaurus should lead the indexer or searcher to the appropriate concept, regardless of the terms they start with.

Page 22: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Preliminary considerations• What is used now?

– Continue using an existing thesaurus?– Ad hoc modification of existing thesaurus?– Develop a new well-structured thesaurus?

• What is the scope and complexity of the subject field?

• What kind of retrieval objects or data will be dealt with?

• How exhaustive and specific is the desired description of objects?

Page 23: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Preliminary Considerations

• The scope and complexity of the field will provide some indication of the scope and complexity of the thesaurus.– It is better to plan for a larger and more

comprehensive system than a smaller system that rapidly will become inadequate as the database grows.

• Development of a good thesaurus requires a major intellectual effort as well as clerical operations like data entry and production of sorted lists.

Page 24: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Development of a Thesaurus• Term Selection.• Merging and Development of Concept

Classes.• Definition of Broad Subject Fields and

Subfields.• Development of Classificatory structure• Review, Testing, Application, Revision.

Page 25: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Flow of Work in Thesaurus Construction

Select Sources

Assign codes

Select Terms

Record Selected Terms

Sort Terms

Merge identical Terms

Define Broad SubjectFields

Merge Terms in SameConcept class

Sort Terms into BroadSubject Fields

Define Subfields withinone Subject Field

Work out detailed structureof the Subject Field

Select Preferred Terms

All Subfields of BroadSubject finished?

All BroadSubjects finished?

Improve Class Structure

Yes

Yes

No

No

Print Classified Indexand review

Discuss with Experts andUsers

Select descriptors andchecklist items

Produce Full Thesaurusand Check references

Assign Notation

Review and Test

Many Modifications?

Based on Soergel, pp 327-333

Yes

No

Revise asneeded

Page 26: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

1. Term Selection• Select sources for the

collection of terms.– Prearranged Sources

– Open-ended Sources

• Assign codes to each source.

• Selection of terms– For part of pre-

arranged and for all open-ended sources

• Enter terms into database with all information.

Page 27: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

1.1 Kinds of Sources• Prearranged Sources

– Existing descriptor lists, classification schemes thesauri. This includes universal schemes like DDC or LCSH.

– Nomenclatures of single disciplines– Treatises on the terminology of a field– Encyclopedias, lexica, dictionaries and glossaries.– Tables of contents of textbooks and handbooks– Indexes of journals or abstracting journals– Indexes of other publications in the field

Page 28: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

1.1 Kinds of Sources• Open-ended sources

– Lists of search requests or interest profiles– Description of projects/activities to be served by the

information retrieval system.– Discussion with specialists in the field– Sample of documents in the field

• Ask users why and how these documents relate to the field.• Have documents indexed by experts in the field

– Lists of titles of documents in the field– Abstracts and reviews of documents– Your own knowledge

Page 29: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Selection of sources• Prearranged sources require less effort in

gathering the material, and may already indicate some relationships between terms and concepts and relationships among terms.

• Open-ended sources can reflect current terminology and may provide more complete coverage.

• Choose a set of sources that are current, as complete as possible, and considered authoratative.

Page 30: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Selection of Sources

• Each selected source is assigned an ID for tracking its use in the development of the thesaurus.– Useful when making decisions about which

terms to prefer– Useful for backtracking when questions arise

(where did this come from?)

Page 31: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Selection of Terms

• Terms can be transferred directly from prearranged sources to the recording medium (cards or database)– Have to decide which terms and references to

include, or to take the whole source

Page 32: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Selection of Terms

• In open-ended sources you read through the source and pick out terms (I.e. words and phrases) that might be useful in retrieval or as references to other terms.

• Alternatively, use keyword and phrase extraction software to create lists of terms and select from those.

• Transfer selected terms to the recording medium (cards or database).

Page 33: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

2. Merging and Development of Concept Classes

• Sort Term DB into alphabetical order.

• First Round: Merge information for Identical terms -- possibly pulling info from additional sources.

• Second Round: Merge synonyms or terms in the same concept class.

Page 34: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

3. Definition of Broad Subject Fields and Subfields

• Define Broad Subject fields and sort terms into these broad fields

• Define subfields within each broad field and sort terms into these subfields.

• Work out the detailed structure– Select Preferred Terms

– Merge information for terms in the same concept class

• Repeat these steps– for each subfield within a

broad field

– and for each broad field

– Until all terms have been consolidated and preferred terms selected

Page 35: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

4. Development of Classificatory Structure

• Produce preliminary version of classified index and update the working database.

• Improve classificatory structure

• Reality check: produce and distribute a version of the classified index. Distribute to users/experts.

Page 36: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

5. Final Stages

• Review

• Testing

• Application

• Revision

Page 37: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Review

• Discuss classified index with users/experts. – Select descriptors and checklist descriptors.

• Assign Notational Symbols

• Produce Main Thesaurus & Indexes

Page 38: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Review (cont.)

• Check cross references and insert where needed

• Produce Test Version

• Test by Indexing

• Modify as needed

• Produce Production Version.

Page 39: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Testing a Thesaurus

• Assign descriptors to a sample set of NEW documents (use enough to get an idea of any gaps in the thesaurus.

• Test retrieval using sample questions and seeing how effectively the thesaurus maps to the appropriate descriptor

Page 40: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

The Indexing Process

• Concept identification

• term selection (via thesaurus)

• term assignment

Page 41: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Application: The Indexing Process (Manual)

IsTerm

suitable

NOSelect Alternativeterm to represent

Concept

WouldConcept be

better representedby one of

these terms

Is There

Another Concept

Consider Preferred

Term

Select Preferred

Term

Establish TermDenoting Concept

Examine Documentand Identify Significant Concepts

Consider First

Concept

PreferredTerm?

StartNO

NO

NO

NO

NO

YES YES YES

YES

YESYES

DoesThesaurus

contain termfor

Concept

Consider anyassociated terms inThesaurus (NT,BT)

Admit New TermInto Thesaurus

Can Conceptbe expressed

combining terms?

Consider Each ofThese Terms

Assign Termsto

Document

Prefer Alternative

Term(s)

End

Adapted from ISO 5963, p.5

Page 42: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

Thesaurus Revision and Updates

• There will always be new concepts, products, or expressions that need to be added to the thesaurus. – Set a regular schedule of reviews and revisions.– Collect complaints, problems, etc. and fold into

revision of the thesaurus

Page 43: 11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and

11/21/2000 Information Organization and Retrieval

References• Soegel, D. Indexing Languages and Thesauri: Construction and

Maintenance. Los Angeles : Melville Publishing Co., 1974

• Foskett, A.C. The Subject Approach to Information. London: Clive Bingley, 1982.

• Standards:– ANSI/NISO z39.19--1994 -- American National Standard Guidelines for the Construction,

Format and Management of Monolingual Thesauri– ANSI/NISO Draft Standard Z39.4-199x -- American National Standard Guidelines for

Indexes in Information Retrieval– ISO 2788 -- Documentation -- Guidelines for the establishment and development of

monolingual thesauri– ISO 5964-- Documentation -- Guidelines for the establishment and development of

multilingual thesauri