translating dialects in search: mapping between specialized languages of discourse and documentary...

36
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Upload: kenneth-parks

Post on 13-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Translating Dialects in Search:

Mapping between Specialized Languages of Discourse

and Documentary Languages

Vivien Petras

UC Berkeley School of Information

Page 2: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Overcoming the Language Problem in Search

How can someone searching for violins be made aware that there are also fiddles (and vice versa)?

Page 3: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

• The Language Problem in Information Retrieval

• Dialects & Contexts

• The Search Term Recommender

• 4 Research Questions

• Exploratory Web Interface

Outline

Page 4: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

“how to obtain the right information for the right user

at the right time” (Chu, 2003)

Decision Process under Uncertainty

Information Retrieval

Page 5: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

• Searching the Needle in the Haystack

• Which Needle in which Haystack

• How to express the Needle and the Haystack

Language Problem in Information Retrieval

Decision Process under Uncertainty

Page 6: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

SearcherAuthor

Concept Space

Concept Space

QuestionText

Search Statement

Match!

• Mapping between searcher and IR system

• Mapping between author and IR system

• Mapping between search statement and document

Document

Language Mapping

Page 7: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

IR = Language Mapping Exercise

Searcher

Concept Space

Question

Search Statement

Document

Match!

Information Retrieval

A search statement needs to describe the:• searcher’s question (information need) • documents that are relevant to a searcher’s question

Page 8: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

In Linguistics:

unlimited semiosis

In Information Science:

Inter-indexer inconsistency (20-60%)

The Language Problem

Page 9: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

How to alleviate language ambiguity?

Ludwig Wittgenstein:• Language games• Language regions

Language is disambiguated within contexts and specialized dialects.

Dialects and Contexts

Page 10: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

How to alleviate language ambiguity for search term selection?

Support search term selection:• Within the dialect of a specialized community• In context• Using the language of documents (for term matching)

Dialects and Contexts

Page 11: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Search Term Recommender

Search Statement

SpecialtySpecialty

Specialty

Specialty

Specialty

SpecialtySpecialty

Did you mean…

Specialty Term

Specialty Term

Specialty Term

Specialty TermInformation Collection

Page 12: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Search Term Recommender

Page 13: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

• Divide information collection by specialty

• Association between – specialty terms– documentary terms (subject metadata)

• Recommend highly associated terms

The Search Term Recommender Methodology

Page 14: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

• Term selection support (query expansion & reformulation)

• Automatic classification

• Terminology mapping

The Search Term Recommender: Applications

Page 15: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

1. How can specialties & specialty dialects be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender - Questions

Page 16: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

• Physics, Electrical and Electronic Engineering, Computers and Control

• Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes

• Test collection:

Inspec

Number of documents 427,340

Descriptors / Document 6.99

Page 17: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

• Biomedicine and Health

• Document: author, title, source, publication year, publication type, abstract, Mesh Headings

• Test collection:

Medline Ohsumed Collection

Number of documents 168,463

Mesh Headings / Document 3.11

Page 18: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender System - Questions

Page 19: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

• Domain terminology

• Publication source

• Bibliometric analysis

• Social network analysis

• Subject-specific classification

Determine specialty documents in the collection:

Page 20: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Inspec test collection• by top-level categories in the Inspec classification• 3 specialties: Physics, Electrical & Electronic

Engineering, Computers & Control

Ohsumed test collection• by journals grouped by subject• 33 specialties

Identification of Specialties in an Information Collection

Page 21: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender System - Questions

Page 22: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Differences in specialty dialects (specialty term overlap)

Differences in documentary languages (subject metadata term overlap)

Differences in search term recommender suggestions (term suggestion overlap)

Differences in Language

Page 23: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Inspec Dialects (specialty term overlap)

20%

7%

13%

13%

4%

33%

13%

Physics

ElectricalEngineering

Computers

terms analyzed: 60,601

Subject metadata term overlap: 87%Suggested term overlap: 30%

Page 24: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Ohsumed Dialects (Specialty term overlap)

terms analyzed: 11,663

Subject metadata term overlap: 32%Suggested term overlap: 30%

13%

29%

8%

19%

2%

21%

7%

CommunicableDiseases

GynecologyOrthopedics

Page 25: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender System - Questions

Page 26: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Comparison: specialty vs. general term suggestions

Automatic classification

Page 27: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Title: “A search for clusters of protostars in Orion cloud cores”

Automatic Classification

Originally assigned terms

Specialty Search Term Recommender

General Search Term Recommender

1. Infrared sources (astronomical)

2. Interstellar molecular clouds

3. Pre-main-sequence stars

4. Star associations

1. Clouds

2. Clusters of galaxies

3. Interstellar molecular clouds

4. Star clusters

5. Pre-main-sequence stars

1. Search problems

2. Clouds

3. Atomic clusters

4. Clusters of galaxies

5. Interstellar molecular clouds

Recall: Hit rate 2/4 = 0.5 1/4 = 0.25

Precision: Accuracy 2/5 = 0.4 1/5 = 0.2

Evaluation

Page 28: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Performance of the STR: Inspec

Inspec specialties and general STRs

0.0

0.1

0.2

0.3

0.4

0.5

0.0 0.1 0.2 0.3 0.4 0.5Recall

Pre

cisi

on

Individual Specialty STRs

General STR

Test Documents: 42,735

Specialties: 3

First 3 suggested:

Recall: 13.6%

Precision: 11.2%

Page 29: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Performance of the STR: Ohsumed

Ohsumed specialties and general STR

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall

Prec

isio

n

Individual Specialty STRs

General STR

First 3 suggested:

Recall: 26%

Precision: 25.6%

Test Documents: 18,733

Specialties: 33

Page 30: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?

2. Do specialty dialects really differ?

3. Is performance improved when focusing on specialty dialects?

4. How specific should specialties be?

Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender System - Questions

Page 31: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

• Language differences

• Collection sizes for training

Specificity of Specialties

Page 32: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Identifying subspecialties by classification hierarchy– e.g. Computers & Control -- Computer Hardware -- Circuits &

Devices

Specificity of Specialties - Inspec

Four levels of specificity in the Inspec collection

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.0 0.1 0.2 0.3 0.4 0.5 0.6Recall

Pre

cisi

on

Sub-sub specialty STR

Sub-specialty STR

Specialty STR

General STR

Test documents: 2425 Specialties: 3

Page 33: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Identifying subspecialties by journal within subject– e.g. Orthopedics -- Clinical Orthopaedics & Related Research

journal

Specificity of Specialties - Ohsumed

Three levels of specificity in the Ohsumed Collection

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall

Pre

cisi

on

Journal STR

Specialty STR

General STR

Test documents: 745 Specialties: 3

Page 34: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Inspec

http://metadata.sims.berkeley.edu/str/inspec/inspec.html

Ohsumed

http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html

Exploratory Web Interfaces

Page 35: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area

2. Do specialty dialects really differ?– Inspec specialties: term overlap 50%, suggestions overlap 30%– Ohsumed specialties: term overlap 30%, suggestions overlap 30%

3. Is performance improved when focusing on specialty dialects?– Inspec specialties: 10% improvement over general STR– Ohsumed specialties: 25% improvement over general STR

4. How specific should specialties be?– Depends: on language differences & collection size

Summary

Page 36: Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Overcoming the Language Problem in Search

Search Term Recommender:

See also:

FIDDLES

50% Discount!

Thank you!

[email protected]