module 6b: creating controlled vocabularies imt530: organization of information resources winter...

30
Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

Post on 22-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

Module 6b: Creating Controlled Vocabularies

IMT530: Organization of Information Resources

Winter 2007

Michael Crandall

Page 2: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 2

Steps in Constructing CVs

• Define your domain• Gather concepts

– From user interviews, search logs, content analysis, preexisting vocabularies

• Select your approach• Extract terminology• Control your terms• Organize your terms• Maintain, maintain, maintain

Page 3: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 3

Elements of Building CVs• Select your approach

– Pre- or post-coordinated (sixteenth century lute music or sixteenth century and lutes and music)

– Open or closed (indexers can add terms or not)– Enumeration vs. synthesis (facets)

• Extract terms– Warrant (from users or domain or both)

• Control terms– Specificity (cats or Siamese cats?)– Control of homographs (qualifications)– Term consistency and word form (plurals, etc.)– Multiword/phrase sequence and form (inverted, normal form?)– Term definitions (scope notes)– Syntax (citation order)– Semantic factoring

• Organize terms– Semantic relationships

Page 4: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

Different Approaches

Page 5: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 5

Pre- and Post-Coordination

• Pre-coordination involves creating terms that combine multiple concepts (not words) into a single term

• Post-coordination involves creating terms that contain single concepts only, not multiple ones

• Some authors refer to this as “combination”, and say “pre-combined” and “post-combined”

Page 6: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 6

Single or Multiple Concept?

• Is “information retrieval systems” a single concept or a multiple concept?

• Multiple concepts are often joined with conjunction (and, or) or preposition (in, of)

• Multiple concepts are often indicated in subdivisions, which may be indicated by a dash (--) or a comma (,)

• Bottom line is, it’s hard to tell in some cases

Page 7: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 7

Examples

Post-Coordinated Terms

Animal nutrition

Effects

Salt

Pre-Coordinated Term

Effects of salt on animal nutrition

Page 8: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 8

More Examples

• More pre-coordinate terms– France – Textile industries – Skilled

Personnel – Training (PRECIS)– Plants – Nutrition – Genetic aspects

(LCSH)

• Pre-coordinate terms often have subdivisions (the words that appear after the hyphens above)

Page 9: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 9

Advantages of Pre-Coordination

• All of the concepts that may apply to indexing a single document may appear in a single term

• Multiple concepts have the context and meaning embedded in syntactic order and constructions – they may make more sense– they are more precise– different syntax means that concepts with different

meanings can be represented using the same simple concepts, e.g.:

Art by children Music industryArt for children Music for industryArt about children Industrial uses of musicChildren and art Music about industryChildren in art

Page 10: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 10

Advantages of Pre-coordination

• More terms are available for indexers to use to express the subjects of documents

• The results of a multiple-concept search will result in a list of terms to select from (not a list of document representations with those words in them)

• Thus, a user is able to browse all the topics available to get an overview of what is available

Page 11: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 11

Sample Display

• Accidents -- 29 Related LC Subjects Accidents • Accidents Aeronautics Military United States • Accidents Aeronautics Statistics Periodicals   • Accidents Aerosols • Accidents Agricultural Laborers United States   • Accidents Agriculture   • Accidents Agriculture Abstracts   • Accidents Agriculture Bibliography • Accidents Agriculture Research United States  

Page 12: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 12

Disadvantages of Pre-Coordination

• Must create more terms, more costly to create

• Often, complex rules for combination are needed to create pre-coordinated terms. Result cost of the CV is increased, training for CV designers is longer and more difficult, and the possibility of error increased

• Makes for a long term list

Page 13: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 13

Disadvantages of Pre-Coordination

• Long strings of terms may not be interpretable by users – – Ethnic groups – young people – ethnic

identity – psychotherapy – cultural aspects

• In manual systems, access is limited to the first concept listed; only with online keyword access are the other embedded concepts accessible.

Page 14: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 14

Advantages of Post-coordination

• Vocabulary is short, because each concept is only represented once

• Rules for creation of terms are often simpler

• Simple, thus easier to construct, thus less costly

• Terms are shorter and easier to read and understand

• In a manual system, individual concepts may be more accessible

Page 15: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 15

Disadvantages of Post-coordination

• Does not allow for subtle distinctions in meaning– Art and children vs. children in art vs. art by

children– Music in industry vs. industry in music vs.

music industry• May have to assign a lot of headings for

a single document, thus relying on searching mechanisms to put them together

Page 16: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 16

Disadvantages of Post-Coordination

• A multiple-concept search frequently results in a list of document representations with those words in them; these results are not grouped according to similarity, but are often listed in a random order.

• The results list of document representations does not give the user an overview of the subject area covered by the words entered in the search

Page 17: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 17

Sample Display

RESULTS OF BOOLEAN “AND” SEARCH ON natural AND disaster

• Mapping vulnerability : disasters, development, and people / edited by Greg Bankoff, Georg Frerks, D 

• Understanding the economic and financial impacts of natural disasters / Charlotte Benson, Edward J.  

• Cultures of disaster : society and natural hazards in the Philippines / Greg Bankoff 

• Malaria control during mass population movements and natural disasters / Peter B. Bloland and Holly  

• Hurricane! : coping with disaster : progress and challenges since Galveston, 1900 / Robert Simpson,  

• The use of earth observing satellites for hazard support [electronic resource] : assessments & scena 

• The vulnerability of cities : natural disasters and social resilience / Mark Pelling 

Page 18: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 18

Open and Closed Controlled Vocabularies

• An open vocabulary is one in which an indexer may add a term at any time if they need it- pretty rare in traditional indexing (but common in folksonomies)

• A closed vocabulary is one in which an indexer may not add a term at any time. Term additions are controlled by the creators of the CV, not by indexers

Page 19: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 19

Synthesis and Enumeration

• The synthesis & enumeration attribute has to do with how a controlled vocabulary is set up to operate and with where and at what point term creation happens

• The creation of terms may be restricted to the CV designer in some cases; in other cases, indexers have some flexibility in creating new terms by using a technique called “synthesis”

Page 20: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 20

Enumeration

• An enumerated vocabulary is simply a list of terms. Indexers look at the list, select a term, and use it for indexing

• If a term is not present to index a particular document, then the indexer has to either ask the CV designer to add a term, or they are stuck

• Many enumerated vocabularies are also closed vocabularies

• Enumerative vocabularies came first in history – it probably didn’t occur to anyone that there could be any other way!

Page 21: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 21

Example of Enumeration

• Sample list of enumerated terms:– bowls– plastic bowls– wood bowls– wood chairs– steel chairs– wood bookshelves– steel bookshelves

• Note that if an indexer had a document on “steel bowls”, that term is not available. The indexer using this vocabulary would have to either assign “bowls” (not specific), or would have to ask the CV designer to add the term “steel bowls”

Page 22: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 22

Advantages of Enumeration

• Enumerated vocabularies are often easy to use because there are fewer rules for indexers (just look up your term, write it down, and move on!)

• All possible terms appear in the vocabulary, so it is easy to search and display all possible terms

Page 23: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 23

Disadvantages of Enumeration

• Some terms are not available for the indexer to use; some combinations simply are not there

• List of terms may become very long (the Library of Congress Classification, a highly enumerated classification scheme, has 46 volumes!)

• Terms may be repeated over and over• Wood bowls• Wood chairs• Wood bookshelves• Wood cabinets• Wood structures

Page 24: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 24

Synthesis

• Synthesis is a technique developed in the 20th century as a means of saving space and time in CV creation, and of extending flexibility to the indexer

• In a synthetic system, tables containing single terms are created by the CV designer and indexers follow rules to combine the terms from different tables to create a new term

• We’ll look at this in more detail in a couple weeks when we discuss faceted classification

Page 25: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 25

Synthesis and Enumeration vs. Pre- & Post-coordination

• The relationship between enumeration & synthesis and pre- & post- coordination is not one-to-one!

• Some enumerated vocabularies are pre-coordinate; others are post-coordinate

• Most synthetic vocabularies are pre-coordinate, but it is possible for a synthetic vocabulary to be post-coordinate, particularly where it is exposed to end users– Where indexers assign terms from facets, the user

has no control over coordination, but where a user can select and combine facets, it’s post-coordinate

Page 26: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 26

Synthesis and Enumeration vs. Open and Closed Vocabularies

• All synthetic systems are open to a limited extent because indexers may combine simple terms to create new longer terms - but are closed if indexers may not add new terms to tables

• Synthetic systems are completely open if an indexer is allowed to add terms to the tables, add new tables, and add new rules for term synthesis

Page 27: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 27

Term Control and Semantic Relationships

• Next week

Page 28: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 28

Questions?

• If not, take a break!!!

Page 29: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 29

Exercise 6b

• Today we are starting a three week exercise in building controlled vocabularies

• The first step is extracting concepts from a defined domain

• Form your groups and work through the two parts of exercise 6b

• Turn in your concept lists, but keep a copy to use in next week’s exercises

Page 30: Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 30

Lots to Digest

• More to come next week

• Keep in mind that this detail is what makes a vocabulary possible to maintain over time– The rules are what make or break a good

system