vivien petras uc berkeley school of information

68
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Upload: emanuele-yadiel

Post on 03-Jan-2016

21 views

Category:

Documents


2 download

DESCRIPTION

Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages. Vivien Petras UC Berkeley School of Information. Overcoming the Language Problem in Search. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Vivien Petras UC Berkeley School of Information

Translating Dialects in Search:

Mapping between Specialized Languages of Discourse

and Documentary Languages

Vivien Petras

UC Berkeley School of Information

Page 2: Vivien Petras UC Berkeley School of Information

Overcoming the Language Problem in Search

How can someone searching for violins be made aware that there are also fiddles (and vice versa)?

Page 3: Vivien Petras UC Berkeley School of Information

• The Language Problem in Information Retrieval

• Dialects & Contexts

• The Search Term Recommender

• 4 Research Questions

• Exploratory Web Interface

Outline

Page 4: Vivien Petras UC Berkeley School of Information

SearcherAuthor

Concept Space

Concept Space

QuestionText

Search Statement

Match!

• Mapping between searcher and IR system

• Mapping between author and IR system

• Mapping between search statement and document

Document

IR = Language Mapping Exercise

Page 5: Vivien Petras UC Berkeley School of Information

IR = Language Mapping Exercise

Searcher

Concept Space

Question

Search Statement

Document

Match!

Information Retrieval

A search statement needs to describe the:• searcher’s question (information need) • documents that are relevant to a searcher’s question

Page 6: Vivien Petras UC Berkeley School of Information

In Semiotics:

Unlimited semiosis

In Information Science:

Inter-indexer inconsistency

The Language Problem

Page 7: Vivien Petras UC Berkeley School of Information

How to alleviate language ambiguity for search term selection?

Dialects and Contexts

Page 8: Vivien Petras UC Berkeley School of Information

How to alleviate language ambiguity for search term selection?

Wittgenstein Philosophy of language:

Language is disambiguated within contexts and specialized dialects.

Dialects and Contexts

Page 9: Vivien Petras UC Berkeley School of Information

How to alleviate language ambiguity for search term selection?

Wittgenstein Philosophy of language:

Language is disambiguated within contexts and specialized dialects.

Support search term selection:

Dialects and Contexts

Page 10: Vivien Petras UC Berkeley School of Information

How to alleviate language ambiguity for search term selection?

Wittgenstein Philosophy of language:

Language is disambiguated within contexts and specialized dialects.

Support search term selection:• Within the dialect of a specialized community

Dialects and Contexts

Page 11: Vivien Petras UC Berkeley School of Information

How to alleviate language ambiguity for search term selection?

Wittgenstein Philosophy of language:

Language is disambiguated within contexts and specialized dialects.

Support search term selection:• Within the dialect of a specialized community• In context

Dialects and Contexts

Page 12: Vivien Petras UC Berkeley School of Information

How to alleviate language ambiguity for search term selection?

Wittgenstein Philosophy of language:

Language is disambiguated within contexts and specialized dialects.

Support search term selection:• Within the dialect of a specialized community• In context• Using the language of documents (for term matching)

Dialects and Contexts

Page 13: Vivien Petras UC Berkeley School of Information

Search Term Recommender

Search Statement

SpecialtySpecialty

Specialty

Specialty

Specialty

SpecialtySpecialty

Did you mean…

Specialty Term

Specialty Term

Specialty Term

Specialty TermInformation Collection

Page 14: Vivien Petras UC Berkeley School of Information

Search Term Recommender

Page 15: Vivien Petras UC Berkeley School of Information

• Term selection support (query expansion & reformulation)

• Automatic classification

• Terminology mapping

The Search Term Recommender: Applications

Page 16: Vivien Petras UC Berkeley School of Information

1. How can specialties & specialty dialects be identified in an information collection?

The Search Term Recommender - Questions

Page 17: Vivien Petras UC Berkeley School of Information

1. How can specialties & specialty dialects be identified in an information collection?

2. Do specialty dialects really differ?

The Search Term Recommender - Questions

Page 18: Vivien Petras UC Berkeley School of Information

1. How can specialties & specialty dialects be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

The Search Term Recommender - Questions

Page 19: Vivien Petras UC Berkeley School of Information

1. How can specialties & specialty dialects be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

The Search Term Recommender - Questions

Page 20: Vivien Petras UC Berkeley School of Information

1. How can specialties & specialty dialects be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic databases:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender - Questions

Page 21: Vivien Petras UC Berkeley School of Information

• Physics, Electrical and Electronic Engineering, Computers and Control

• Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes

• Test collection:

Inspec

Number of documents 427,340

Descriptors / Document 6.99

Page 22: Vivien Petras UC Berkeley School of Information

• Biomedicine and Health

• Document: author, title, source, publication year, publication type, abstract, Mesh Headings

• Test collection:

Medline Ohsumed Collection

Number of documents 168,463

Mesh Headings / Document 3.11

Page 23: Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender System - Questions

Page 24: Vivien Petras UC Berkeley School of Information

• Domain terminology

Determine specialty documents in the collection:

Page 25: Vivien Petras UC Berkeley School of Information

• Domain terminology

• Publication source

Determine specialty documents in the collection:

Page 26: Vivien Petras UC Berkeley School of Information

• Domain terminology

• Publication source

• Bibliometric analysis

Determine specialty documents in the collection:

Page 27: Vivien Petras UC Berkeley School of Information

• Domain terminology

• Publication source

• Bibliometric analysis

• Social network analysis

Determine specialty documents in the collection:

Page 28: Vivien Petras UC Berkeley School of Information

• Domain terminology

• Publication source

• Bibliometric analysis

• Social network analysis

• Subject-specific classification

Determine specialty documents in the collection:

Page 29: Vivien Petras UC Berkeley School of Information

Inspec test collection• by top-level categories in the Inspec classification• 3 specialties: Physics, Electrical & Electronic

Engineering, Computers & Control

Identification of Specialties in an Information Collection

Page 30: Vivien Petras UC Berkeley School of Information

Inspec test collection• by top-level categories in the Inspec classification• 3 specialties: Physics, Electrical & Electronic

Engineering, Computers & Control

Ohsumed test collection• by journals grouped by subject• 33 specialties

Identification of Specialties in an Information Collection

Page 31: Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender System - Questions

Page 32: Vivien Petras UC Berkeley School of Information

Differences in specialty dialects (specialty term overlap)

Differences in Language

Page 33: Vivien Petras UC Berkeley School of Information

Differences in specialty dialects (specialty term overlap)

Differences in documentary languages (subject metadata term overlap)

Differences in Language

Page 34: Vivien Petras UC Berkeley School of Information

Differences in specialty dialects (specialty term overlap)

Differences in documentary languages (subject metadata term overlap)

Differences in search term recommender suggestions (term suggestion overlap)

Differences in Language

Page 35: Vivien Petras UC Berkeley School of Information

Inspec Dialects (specialty term overlap)

20%

7%

13%

13%

4%

33%

13%

Physics

ElectricalEngineering

Computers

terms analyzed: 60,601

Subject metadata term overlap: 87%Suggested term overlap: 30%

Page 36: Vivien Petras UC Berkeley School of Information

Ohsumed Dialects (Specialty term overlap)

terms analyzed: 11,663

Subject metadata term overlap: 32%Suggested term overlap: 30%

13%

29%

8%

19%

2%

21%

7%

CommunicableDiseases

GynecologyOrthopedics

Page 37: Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender System - Questions

Page 38: Vivien Petras UC Berkeley School of Information

• Suggest subject metadata for documents

• Comparison: specialty vs. general term suggestions

Automatic classification

Page 39: Vivien Petras UC Berkeley School of Information

Title: “A search for clusters of protostars in Orion cloud cores”

Automatic Classification

Originally assigned terms

Specialty Search Term Recommender

General Search Term Recommender

1. Infrared sources (astronomical)

2. Interstellar molecular clouds

3. Pre-main-sequence stars

4. Star associations

1. Clouds

2. Clusters of galaxies

3. Interstellar molecular clouds

4. Star clusters

5. Pre-main-sequence stars

1. Search problems

2. Clouds

3. Atomic clusters

4. Clusters of galaxies

5. Interstellar molecular clouds

Page 40: Vivien Petras UC Berkeley School of Information

Title: “A search for clusters of protostars in Orion cloud cores”

Automatic Classification

Originally assigned terms

Specialty Search Term Recommender

General Search Term Recommender

1. Infrared sources (astronomical)

2. Interstellar molecular clouds

3. Pre-main-sequence stars

4. Star associations

1. Clouds

2. Clusters of galaxies

3. Interstellar molecular clouds

4. Star clusters

5. Pre-main-sequence stars

1. Search problems

2. Clouds

3. Atomic clusters

4. Clusters of galaxies

5. Interstellar molecular clouds

Recall: Hit rate 2/4 = 0.5 1/4 = 0.25

Precision: Accuracy 2/5 = 0.4 1/5 = 0.2

Evaluation

Page 41: Vivien Petras UC Berkeley School of Information

Performance of the STR: Inspec

Inspec specialties and general STRs

0.0

0.1

0.2

0.3

0.4

0.5

0.0 0.1 0.2 0.3 0.4 0.5Recall

Pre

cisi

on

Individual Specialty STRs

General STR

Test Documents: 42,735

Specialties: 3

First 3 suggested:

Recall: 13.6%

Precision: 11.2%

Page 42: Vivien Petras UC Berkeley School of Information

Performance of the STR: Ohsumed

Ohsumed specialties and general STR

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall

Prec

isio

n

Individual Specialty STRs

General STR

First 3 suggested:

Recall: 26%

Precision: 25.6%

Test Documents: 18,733

Specialties: 33

Page 43: Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender System - Questions

Page 44: Vivien Petras UC Berkeley School of Information

• Language differences

• Representative sample of specialty language for training

Specificity of Specialties

Page 45: Vivien Petras UC Berkeley School of Information

Identifying subspecialties by classification hierarchy– e.g. Computers & Control -- Computer Hardware -- Circuits &

Devices

Specificity of Specialties - Inspec

Page 46: Vivien Petras UC Berkeley School of Information

Identifying subspecialties by classification hierarchy– e.g. Computers & Control -- Computer Hardware -- Circuits &

Devices

Specificity of Specialties - Inspec

Four levels of specificity in the Inspec collection

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.0 0.1 0.2 0.3 0.4 0.5 0.6Recall

Pre

cisi

on

Sub-sub specialty STR

Sub-specialty STR

Specialty STR

General STR

Test documents: 2425 Specialties: 3

Page 47: Vivien Petras UC Berkeley School of Information

Identifying subspecialties by journal within subject– e.g. Orthopedics -- Clinical Orthopaedics & Related Research

journal

Specificity of Specialties - Ohsumed

Page 48: Vivien Petras UC Berkeley School of Information

Identifying subspecialties by journal within subject– e.g. Orthopedics -- Clinical Orthopaedics & Related Research

journal

Specificity of Specialties - Ohsumed

Three levels of specificity in the Ohsumed Collection

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall

Pre

cisi

on

Journal STR

Specialty STR

General STR

Test documents: 745 Specialties: 3

Page 49: Vivien Petras UC Berkeley School of Information

Inspec

http://metadata.sims.berkeley.edu/str/inspec/inspec.html

Ohsumed

http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html

Exploratory Web Interfaces

Page 50: Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area

Summary

Page 51: Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area

2. Do specialty dialects really differ?– Inspec specialties: term overlap 50%, suggestions overlap 30%– Ohsumed specialties: term overlap 30%, suggestions overlap 30%

Summary

Page 52: Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area

2. Do specialty dialects really differ?– Inspec specialties: term overlap 50%, suggestions overlap 30%– Ohsumed specialties: term overlap 30%, suggestions overlap 30%

3. Is performance improved when focusing on specialty dialects?– Inspec specialties: 10% improvement over general STR– Ohsumed specialties: 25% improvement over general STR

Summary

Page 53: Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area

2. Do specialty dialects really differ?– Inspec specialties: term overlap 50%, suggestions overlap 30%– Ohsumed specialties: term overlap 30%, suggestions overlap 30%

3. Is performance improved when focusing on specialty dialects?– Inspec specialties: 10% improvement over general STR– Ohsumed specialties: 25% improvement over general STR

4. How specific should specialties be?– Depends: on language differences & collection size

Summary

Page 54: Vivien Petras UC Berkeley School of Information

Overcoming the Language Problem in Search

Search Term Recommender:

See also:

FIDDLES

50% Discount!

Thank you!

[email protected]

Page 55: Vivien Petras UC Berkeley School of Information
Page 56: Vivien Petras UC Berkeley School of Information

Explanatory Slides

Page 57: Vivien Petras UC Berkeley School of Information

“how to obtain the right information for the right user

at the right time” (Chu, 2003)

Decision Process under Uncertainty

Information Retrieval

Page 58: Vivien Petras UC Berkeley School of Information

• Searching the Needle in the Haystack

• Which Needle in which Haystack

• How to express the Needle and the Haystack

Language Problem in Information Retrieval

Decision Process under Uncertainty

Page 59: Vivien Petras UC Berkeley School of Information

Uncertainty: Searching the Needle in the Haystack

.

.

.

..

400 Documents. 20-40 Documents

Blair (1996): for a small collection (40,000 documents) only 0.25-0.5% of the documents were relevant to any given query

Page 60: Vivien Petras UC Berkeley School of Information

• a known needle in a known haystack; • a known needle in an unknown haystack; • an unknown needle in an unknown haystack; • any needle in a haystack; • the sharpest needle in a haystack; • most of the sharpest needles in a haystack; • all the needles in a haystack; • affirmation of no needles in the haystack; • thinks like needles in any haystack; • let me know whenever a new needle shows up; • where are the haystacks?; and • needles, haystacks – whatever

(Koll, 2000)

Uncertainty: Which Needle in which Haystack

Page 61: Vivien Petras UC Berkeley School of Information

How to alleviate language ambiguity?

Ludwig Wittgenstein:• Language games• Language regions

Language is disambiguated within contexts and specialized dialects.

Dialects and Contexts

Page 62: Vivien Petras UC Berkeley School of Information

• Divide information collection by specialty

• Association between – specialty terms– documentary terms (subject metadata)

• Recommend highly associated terms

The Search Term Recommender Methodology

Page 63: Vivien Petras UC Berkeley School of Information

Differences in Vocabulary – Inspec Documentary Language (Inspec Descriptor overlap)

Overlap between controlled vocabulary terms in three specialty collections in Inspec (total terms analyzed: 8,447)

62%

2%

15%

1%

2%

9%

8%

Physics

ElectricalEngineering

Computers

Page 64: Vivien Petras UC Berkeley School of Information

Differences in Vocabulary – Inspec specialty suggestions (suggested term overlap)

Variations in controlled vocabulary term suggestion (7 terms are suggested) in-between three Inspec specialty STRs (averaged over 300 queries).

7 suggested terms from each of 3 specialties

Maximum overlap: 7

No overlap: 21

Average number of unique terms suggested: 15.6

14-1763%

10-1317%

18-2020%

Page 65: Vivien Petras UC Berkeley School of Information

Differences in Vocabulary – Ohsumed Documentary Language (Mesh Heading overlap)

Overlap between controlled vocabulary terms in three specialty collections in Ohsumed (total terms analyzed: 5,376)

8%

34%

12%

12%

3%

23%

9%

CommunicableDiseases

GynecologyOrthopedics

Page 66: Vivien Petras UC Berkeley School of Information

Differences in Vocabulary – Ohsumed specialty suggestions (suggested term overlap)

Variations in controlled vocabulary term suggestion (3 terms are suggested) in-between 33 Ohsumed specialty STRs (averaged over 283 queries)

3 suggested terms from each of 33 specialties

Maximum overlap: 3

No overlap: 99

Average number of unique terms suggested: 61.9

50-6954%

30-4919%70-98

27%

Page 67: Vivien Petras UC Berkeley School of Information

Does a search term recommender utilizing specialty dialects perform better than a search term recommender based on the general language of the whole collection?

Test documents from specialty were classified with specialty search term recommender & general search term recommender

Search statement: title of test document Suggested terms: controlled vocabulary

Performance of the STR: Automatic Classification

Page 68: Vivien Petras UC Berkeley School of Information

How many of the originally assigned controlled vocabulary terms were suggested for a test document?

Recall = Originally assigned Suggested / Number of Originally assigned

Precision = Originally assigned Suggested / Number of Suggested

Example: 4 originally assigned terms

Automatic Classification: Evaluation

Cut-off level

Originally controlled vocabulary terms suggested

Recall Precision

1 1 0.25 1

2 1 0.25 0.5

3 2 0.5 0.66

4 2 0.5 0.5

5 3 0.75 0.6