a theory of retrieval using structured vocabularies

62
purl.org/net/retrieval 1 A Theory of Retrieval Using Structured Vocabularies (SKOS: Preparation for Standardization) Alistair Miles CCLRC Rutherford Appleton Laboratory NKOS Workshop, September 2006, Alicante

Upload: others

Post on 23-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 1

A Theory of Retrieval Using Structured Vocabularies

(SKOS: Preparation for Standardization)

Alistair MilesCCLRC Rutherford Appleton Laboratory

NKOS Workshop, September 2006, Alicante

Page 2: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 2

What Am I Presenting?● A formal theory of retrieval using structured

vocabularies.● The main body of my masters dissertation,

which is entitled “Retrieval and the Semantic Web”.

● N.B. This presentation is intended to give an overview, for the full text go to ...

purl.org/net/retrieval

Page 3: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 3

Why?● How do you maximize the utility and

minimize the cost of vocabulary control ... ? ● Support standardization initiatives ...

– SKOS to W3C Recommendation,– BS 8723 parts 3, 4 and 5.

● Check our working assumptions!

● See also “SKOS: Requirements for Standardization” to be presented at DC 2006.

Page 4: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 4

How?● Use a formal notation (“Z”) to express

underlying ideas with mathematical precision.● Support formal specification with explanatory

prose.

● N.B. This presentation is strictly informal!

Page 5: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 5

Overview of the Theory● Foundations (Chapter 3)● Composite Queries (Chapter 4)● Limited Cost Expansion (Chapter 5)● Coordination (Chapter 6)● Translation (Chapter 7)

Page 6: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 6

General Scenario (1)

Controlled Structured Vocabulary

Collection

Index

Query

Results

evaluation

Page 7: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 7

Overview of the Theory● Foundations (Chapter 3)● Composite Queries (Chapter 4)● Limited Cost Expansion (Chapter 5)● Coordination (Chapter 6)● Translation (Chapter 7)

Page 8: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 8

General Scenario (2)

Vocabulary A

Index

Query

???

Vocabulary B

Page 9: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 9

Lightning Tour (1) – Foundations● Structured vocabulary.● Index.● Atomic query.● Direct evaluation (of atomic queries).● Naïve expansion (of an index).

Page 10: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 10

Lightning Tour (2) – Composite Queries

● Query expressions ...– “and”, “or”, “not”, “required-optional-prohibited”.

● Composition and decomposition of expressions.● Direct evaluation (composite queries).● Naïve expansion (of composite queries).● Scoring and ranking of results.

Page 11: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 11

Lightning Tour (3) – Limited Cost Expansion

● Beyond naïve expansion.● Approximating numerical “relevance cost” of

expansion.● Limited cost expansion (of an index or query).● Expansion weight and result scoring.

Page 12: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 12

Lightning Tour (4) - Coordination● Using vocabulary units in combination.● Ordered and unordered coordination.● Coordinated indexes and queries.● Naïve expansion (of a coordinated index or

query).● Limited cost expansion (of a coordinated index

or query).

Page 13: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 13

Lightning Tour (5) – Translation● Structural mapping.● Query expression mapping.● Naïve translation using a structural mapping.● Naïve translation using a query expression

mapping.● Limited cost translation using a structural

mapping.

Page 14: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 14

Caveats● Much of the prose was written in haste!● I'm no mathematician or logician!● My review of the literature is woefully

incomplete!● The chapter on RDF representations (chapter

8) is rather incomplete and at best only suggestive!

● Use cases need further development.

Page 15: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 15

A Theory of Retrieval Using Structured Vocabularies

Foundations

Page 16: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 16

Foundations – The Conceptual Basis of Controlled Vocabularies (1)● The fundamental purpose of a controlled

vocabulary is to establish a set of distinct meanings or “concepts” and to provide a means of referring unambiguously to those concepts.

Page 17: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 17

Foundations – The Conceptual Basis of Controlled Vocabularies (2)● I have modelled this means of reference as a

set of “names”, which I have called “concept names”.

● A controlled vocabulary provides a set of “concept names” which constitutes an artificial language for use in constructing an “index”. (I.e. a controlled indexing language.)

Page 18: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 18

Foundations – Structure Relations (1)

● A controlled vocabulary may provide one or more binary relations on the set of concept names, which I refer to as “structure relations”.

● The structure relations of a controlled vocabulary together constitute the “structure graph”.

Page 19: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 19

Foundations – Structure Relations (2)

● The theory considers only vocabularies that provide three structure relations, which I have called “broader”, “narrower” and “associated”.

● N.B. No attempt is made to define “broader”, “narrower” or “associated”!

● Their meaning is defined entirely in terms of operational assumptions that may be used to derive retrieval operations.

Page 20: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 20

Foundations – A Structure Graph

B

N

A

A

CNAME

Page 21: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 21

Foundations – The Structure of an Index

● An “index” consists of one or more “fields”.● A “field” is a binary relation between “document

names” and “concept names”.● (N.B. I use “document” to refer to any object we

are interested in retrieving.)● An index also provides a name for each field,

so we can target particular fields in a query.

Page 22: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 22

Foundations – A Field

DNAMECNAME

Page 23: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 23

Foundations – Types of Index● An index can have single or multiple fields.● A field can be functional or relational.

Page 24: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 24

Foundations – Atomic Queries● An “atomic query expression” comprises a

single field name and a single concept name.

Page 25: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 25

Foundations – Direct Evaluation of Atomic Queries

DNAMECNAME

QUERY

Page 26: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 26

Foundations – Naïve Assumption of Ideal Indexing● All documents indexed with a given concept

name in a given field are relevant to an atomic query for that concept name in that field.

Page 27: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 27

Foundations – Naïve Assumption of Ideal Indexing

DNAMECNAME

QUERY

All Relevant

Page 28: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 28

Foundations – Naïve Assumption of Broadening Relevance

DNAMECNAME

QUERY

All Relevant

B

All Relevant

Page 29: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 29

Foundations – Naïve Expansion of a Field

DNAMECNAME

B

Page 30: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 30

Foundations – Naïve Expansion● By including documents in a result set that are

also relevant to the query, recall is increased at no cost to precision.

Page 31: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 31

Foundations – Key Ideas● Assumptions ...

– Naïve assumption of ideal indexing.– Naïve assumption of broadening relevance.

● Operational definition for “broader/narrower”.● Naïve expansion of an index to improve recall.

● N.B. This framework probably sufficient to cover the majority of applications!

Page 32: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 32

A Theory of Retrieval Using Structured Vocabularies

Composite Queries

Page 33: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 33

Composite Queries – Query Expressions (1)

● Composite query expression – has one or more “component” (or “child”) query expressions.

● Four types of composite query expression ...– and– or– not– rop (“required-optional-prohibited”)

Page 34: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 34

Composite Queries – Composition● Child of a composite query expression can be

an atomic expression or another composite expression.

● I.e. Expressions can be arbitrarily nested.

Page 35: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 35

Composite Queries – Direct Evaluation

● Results of “and” expression ... set intersection of results of child expressions.

● Results of “or” expression ... set union of results of child expressions.

● Results of “not” expression ... set complement of results of child expression.

● Results of “rop” expression ... set intersection of results of “required” children minus set union of results of “prohibited” children ... N.B. “optional” children are truly optional.

Page 36: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 36

Composite Queries – Direct Evaluation

QUERY

and or not

Page 37: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 37

Composite Queries - Decomposition● Decompose arbitrarily nested composite query

into “positive” and “negative” atoms.

Page 38: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 38

Composite Queries – Scoring Results

● Two metrics for scoring results of composite queries ...– Unweighted scoring (number of positive atoms

matching the document).– IDF weighted scoring (take into account inverse

document frequency of concept names in the index – greater weight to more “discriminating” atoms).

● Use scores to rank results (we assume in order of greatest relevance).

Page 39: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 39

Composite Queries – Naïve Query/Index Expansion

Controlled Structured Vocabulary

Index

Results

evaluation

Query

Index

evaluation

Queryexpand

expand

Page 40: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 40

Composite Queries – Naïve Query Expansion

● Expand arbitrarily nested query expressions.● Mathematically equivalent to naïve index

expansion (but not computationally equivalent).

Page 41: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 41

A Theory of Retrieval Using Structured Vocabularies

Limited Cost Expansion

Page 42: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 42

Limited Cost Expansion – Naïve Assumptions

● Likely to break down, especially for “deep” hierarchies (does not account for specificity).

● Does not take advantage of associative links.● Expansion cannot be “tuned”, no possibility for

dynamic functionality (“all or nothing”).● Structure is not utilised for ranking of expanded

result set.

Page 43: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 43

Limited Cost Expansion – Quantitative Assumptions

? ? ? ?

P(relevance)? = N

? = A

? = B

DNAMECNAME

QUERY

Page 44: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 44

Limited Cost Expansion – Relevance Cost

● Use a numerical function to model the accumulated “relevance cost” of expansion.

● Use a “cost limit” to provide a cut-off.● Invert the minimum cost value to obtain an

“expansion weight” between 0 and 1 (high weight suggests high probability of relevance).

● Factor expansion weight into result scoring and therefore ranking.

Page 45: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 45

Limited Cost Expansion – Query/Index Expansion

● Limited cost expansion of either query or index.

Page 46: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 46

A Theory of Retrieval Using Structured Vocabularies

Coordination

Page 47: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 47

Coordination – Ordered and Unordered (1)

● Coordination is the act of combining concept names.

● Ordered – order of coordination is significant to meaning.

● Unordered – order of coordination is not significant to meaning.

Page 48: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 48

Coordination – Ordered and Unordered (2)

Page 49: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 49

Coordination – A Coordinated Field

DNAMECNAME

Page 50: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 50

Coordination – A Coordinated Query

and

Page 51: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 51

Coordination - Decomposition

{a, b, c}

{a, b} {b, c} {a, c}

a b c

Page 52: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 52

Coordination – Structure Relations{a1, b1}

{a1, b2} {a2, b1}

{a1, b3} {a2, b2} {a3, b1}

{a2, b3} {a3, b2}

{a3, b3}

a1 b1

a2 b2

a3 b3

Page 53: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 53

Coordination - Expansion● Naïve expansion of coordinated queries or

indexes.● Limited cost expansion of coordinated queries

or indexes.

Page 54: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 54

A Theory of Retrieval Using Structured Vocabularies

Translation

Page 55: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 55

General Scenario (2)

Vocabulary A

Index

Query

???

Vocabulary B

Page 56: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 56

Translation

Index

Results

evaluation

Query

Index

evaluation

Querytranslate

translate

Vocabulary AVocabulary B

Mapping

Page 57: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 57

Translation - Goals● Automated translation.● Understand consequences for precision and

recall.● Minimise loss of precision and recall.

Page 58: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 58

Translation – Mapping● Structural mapping ...

– Use “broader”, “narrower”, “associated” and “equivalent” mapping relations.

● Query expression mapping ...– Use composite query expression as the target of

the mapping.

Page 59: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 59

Translation – Methods● Naïve translation.● Limited cost translation ...

– Translation weight.

● N.B. Limited cost translation is much less demanding on the completeness of the mapping!

Page 60: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 60

A Theory of Retrieval Using Structured Vocabularies

Next Steps ...

Page 61: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 61

Adaptation and Change● Use mappings to express change in

vocabularies.● Use translations to adapt indexes and/or

queries.

● N.B. Requires vocabulary management tools that capture change information at the point of change!

Page 62: A Theory of Retrieval Using Structured Vocabularies

purl.org/net/retrieval 62

Summary● Pragmatic, operational approach to describing

the use of structured vocabularies for retrieval.● Formalise the underlying assumptions.● Support standardization, especially of

representations for index, vocabulary and mapping data.