inls 520

29
INLS 520 – Fall 2007 Erik Mitchell INLS 520 Information Organization

Upload: acton

Post on 20-Mar-2016

46 views

Category:

Documents


0 download

DESCRIPTION

INLS 520. Information Organization. Review. Last week Types of categorization & classification structures Classification Definitions Look at Library classification systems for Dewey & Library of Congress. Today. Controlled vocabularies Types Basic concepts Related technologies - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: INLS 520

INLS 520 – Fall 2007Erik Mitchell

INLS 520

Information Organization

Page 2: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Review

• Last week– Types of categorization & classification

structures• Classification

– Definitions– Look at Library classification systems for

Dewey & Library of Congress

Page 3: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Today

• Controlled vocabularies– Types– Basic concepts

• Related technologies– Metadata standards– Example Systems

• Knowledge organization systems– Term Lists, Thesauri, Taxonomies,

Ontologies

Page 4: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Concepts & definitions• Controlled Vocabularies

– “organized lists of words and phrases, or notation systems, that are used to initially tag content, and then to find it through navigation or search.” (Warner via Leise, Fast)

– “the primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval” (ANSI Z39.19)

• Knowledge organization systems– “tools that present the organized interpretation of knowledge structures”

(Hjørland)

– “classification schemes that organize materials at a general level…, subject headings that provide more detailed access, and authority files that control variant versions of key information” (Hodge)

– “It depends on what the meaning of the words 'is' is.” (Clinton)

Page 5: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Uses of controlled vocabulary (1)

• Define scope, content, and context of information

• Navigation, breadcrumbs

• Map to user terminology

• Enhance browsing, searching

• Term consistency and relationships

Page 6: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Functions of a CV

• Removes ambiguity– Synonyms, Homonyms, polysemes,

• Defines relationships– Equivalence, hierarchical, associative (BT,

NT, RT, CR) reciprocity, • Provides context

– Category, scope, qualifiers, modifiers, scope notes

Page 7: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Types of Controlled Vocabularies

• Term Lists– Glossaries, Dictionaries, Gazetteers, Folksonomies

• Synonym rings– Z39.19 example– Oracle Text

• Taxonomies– Website navigation scheme

• Thesauri / Ontologies– Authority files, subject thesauri, topic maps

Page 8: INLS 520

INLS 520 – Fall 2007Erik Mitchell

A conceptual map

http://www.taxotips.com/

Page 9: INLS 520

INLS 520 – Fall 2007Erik Mitchell

CV Concepts• Content Analysis

– Ambiguity– Synonymy– Exhaustivity– Specificity– Co-extensivity– Aboutness– Semantic structure– Warrant (User,

Literary, Organization)

• Form Analysis– Linguistics– Grammar– Semiotics– Single / Multiple terms

• Indexing & Retrieval– Pre vs. Post Coordinate– Recall vs. Precision– Natural language

processing (NLP)

Page 10: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Content Analysis (1)• Ambiguity

– Each term should relate to a single concpet• Synonymy

– Each concept should be identified by a single entry• Specificity

– Using the most specific words or phrase expressing the subject• Exhaustivity

– The extent to which the entire document is indexed (Summarization, depth)

• Co-extensivity– “Assign as many terms as needed to bring out the main theme, and

according to guidelines sub-themes.” (p. 29, Lancaster)– “nothing more, nothing less”

• Semantic Structure– Terms can be related with equivalence, hierarchy, or associated

relationships (Use, See, NT, BT, RT)

Page 11: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Content Analysis (2)• Aboutness = Subject/topic?

– Wilson (1968)• Author intent, topicality, relationship to other resources,

textual analysis– Farithorne (1969)

• Intentional aboutness (author), extensional aboutness (document)

– Maron (1977)• objective about (document), subjective about (user), and

retrieval about (information retrieval)– Hjorland (2001)

• “Closely related to theories of meaning, interpretation, and epistemology”

Page 12: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Content Analysis (3)

• Wilson’s criteria for evaluating aboutness (1968)– Identify author’s purpose (intent)– Weigh the predominant topics, elements

(topical analysis)– Group/count a document’s use of concepts

and references (bibliometrics)– Identify essential elements (text analysis)

Page 13: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Content Analysis (4)• Literary Warrant

– “The inclusion of a vocabulary term in a controlled vocabulary based on its appearance in one or more content items. For example, a medical text may use the term “oncology.” Based on literary warrant, that term would be included in the controlled vocabulary even though the general public uses the term “cancer.” (Glosso-Thesaurus)

• User Warrant– “The inclusion of a vocabulary term in a controlled vocabulary based

on use by users. Such terms can be identified through search log analysis or free listing.” (Glosso-Thesaurus)

• Organizational Warrant– “Justification for the...selection of a preferred term due to the

characteristics and context of the organization using the resource” (ANSI Z39.19)

Page 14: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Form Analysis– Linguistics

• Synatx/Form (grammar)• Morphology (internal word structure)• Semantics (meaning)• Pragmatics, discourse analysis (word/phrase

use)– Semiotics

• study of signs/symbols – Lexical structure

• Document layout, markup, tags (think DOM)

Page 15: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Indexing & Retrieval

• Pre/Post-Coordinate• Organization prior to retrieval• Organization at the point of retrieval

• Recall / Precision• Recall: Number of retrieved relevant docs / total number

of docs in collection• Precision: number or retrieved relevant docs / all relevant

docs in collection

• Natural language processing• Uses semantics and syntax to automatically distill

‘aboutness’

Page 16: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Recall & Precision• A collection of 100

documents• Searches

– “Vocabularies”• Recall 100/100 = 1• Precision 100/100 = 1

– “Facet”• Recall 20/100= .2• Precision 20/28 = .71

– “OWL”• Recall 1/100 = .001• Precision 1/1 = 1

CV Entry # of docsControlled Vocabularies

100

Faceted analysis 20

Ontologies 5

OWL 1

RDF 3

Recall = # of docs retrieved / total # of docs in collection

Precision = # relevant of docs retrieved / total relevant # of docs in collection

Page 17: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Term List Examples

• Authority files – Maps to preferred terms– Library of Congress– Encoded Archival Context– Union List of Artist Names

• Glossaries/Dictionaries –Words & definitions, sometimes topic focused– Glosso-Thesaurus

• Folksonomies –– Contextualization, Trend discovery, Personal Information

• Synonym rings – Used for back-end equivalence in searching– Princeton Wordnet

Page 18: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Thesauri & taxonomy examples

• List of vocabularies– http://www.slais.ubc.ca/resources/

indexing/database1.htm – Taxonomy warehouse

• Two Examples– Health & Ageing Thesaurus– Thesaurus of Geographic names

Page 19: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Interoperable system example

• NCBI Entrez– 35 databases using interoperable controlled

vocabulary systems to provide rich meta-searching

• Cross-database discovery – search for “heart attack”

• Cross database linking – search for aconitase, follow the “other links” tab.

Page 20: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Vocabulary and Classification systems - exercise

• Organization structures– Term Lists /

Enumerative systems– Hierarchies– Tees– Paradigms– Facets / Associative

relationships– Folksonomies

• Break into groups, discuss & list– Goal– Structure– Issues– Benefits

• Resources– Kwasnik, Boxes &

arrows

Page 21: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Choosing a framework• Use questions

– Who is your user, what are their needs?– What systems are your users familiar with?– Will this system be internal/external?

• Content questions– How extensive, defined is the information?– Is your subject matter static or fluid?– What organizational framework best describes your content?

• System Questions– What access are you trying to provide?– What external pressures exist?– What external entities/theories will interact with this system?

Page 22: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Interoperability issues

• Similarity of subject matter in domains

• Multiple CV accepted in a domain• Specificity/granularity of content

indexing• Use of synonyms, warrant• Intended use, purpose of system

Page 23: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Creating a CV (1)• Design methods

– Re-use existing, start with content & desired use ideas

– Committee / community approach• Top-down

– Concept driven• Bottom-up

– Document driven– Empirical approach

• Deductive approach– Select terms, create relationships, perform term control

• Inductive approach– Establish CV at outset, build hierarchies on as needed

basis

Page 24: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Creating a CV (2)• Top-Down

– Identify audience– Identify all topics,

concepts, uses, and context of the domain

– Sort topics identified into an appropriate organization scheme (enumerative, hierarchical, faceted)

– Solidify structure and clean up gaps & redundancies

– Assign documents to categories, test retrieval

• Bottom-up– Identify audience– Survey documents for

topics/concepts.– Build system on the fly –

let content drive structure and limits of system

– Identify gap & redundancies in system

– Test retrieval

Page 25: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Creating a CV (3)• Think about scope, use, content, maintenance• Gather Terms

– Based on existing systems, content– Based on user needs/expectations– Investigate issues of specificity, exhaustivity, granularity

• Build hierarchies, relationships– Broader/narrower terms, Related terms, Use/Use for,

see/see also• Establish Rules• Implement• Evaluate• Maintain

http://www.boxesandarrows.com/view/creating_a_controlled_vocabulary

Page 26: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Evaluating a CV

• Goals• Determine if the CV solves retrieval needs of

user/system• Determine if CV matches user’s content

model/term expectations

• Methods• Expert evaluation of CV• User based card sorting compared to actual CV• Identification of non-included documents• Analysis of use of system - HCI

Page 27: INLS 520

INLS 520 – Fall 2007Erik Mitchell

CV Maintenance• Primary responsibility

– Editor, board, committee• New terms

– Is it really new or a different view– What is the proper form & placement

• Modified terms– Include a change log– Use a “USE” reference to point to new term

• Deleted terms– Unused / Overused terms– May want to keep for historical retrieval purposed

• Modification history– Use modification notes, date/time stamps

Page 28: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Class exercise• Protégé overview

– Orientation– Object types (Classes, Slots, Instances)– Relationships (hierarchies, associative)

• Replication of the Glosso-Thesaurus– Visit the Boxes & Arrows Glosso Thesaurus – Look at the data there and come up with a structure in Protégé that

allows replication of the thesaurus– Some issues to consider are:

• Do you want terms to be classes or instances?• What is the easiest way to show the relationships (broader term,

narrower term, etc)?• Do you need to allow multiple relationships for a given type (BT, RT,

etc)?• If you have multiple classes, at what level should you create the slots?

Page 29: INLS 520

INLS 520 – Fall 2007Erik Mitchell

Next Week

• More on Knowledge organization systems– Taxonomies, Ontologies– More work with Protégé