vivien petras uc berkeley school of information

Translating Dialects in Search:

Mapping between Specialized Languages of Discourse

and Documentary Languages

Vivien Petras

UC Berkeley School of Information

Overcoming the Language Problem in Search

How can someone searching for violins be made aware that there are also fiddles (and vice versa)?

• The Language Problem in Information Retrieval

• Dialects & Contexts

• The Search Term Recommender

• 4 Research Questions

• Exploratory Web Interface

Outline

SearcherAuthor

Concept Space

Concept Space

QuestionText

Search Statement

Match!

• Mapping between searcher and IR system

• Mapping between author and IR system

• Mapping between search statement and document

Document

IR = Language Mapping Exercise

IR = Language Mapping Exercise

Searcher

Concept Space

Question

Search Statement

Document

Match!

Information Retrieval

A search statement needs to describe the:• searcher’s question (information need) • documents that are relevant to a searcher’s question

In Semiotics:

Unlimited semiosis

In Information Science:

Inter-indexer inconsistency

The Language Problem

How to alleviate language ambiguity for search term selection?

Dialects and Contexts


Wittgenstein Philosophy of language:

Language is disambiguated within contexts and specialized dialects.





Support search term selection:





Support search term selection:• Within the dialect of a specialized community





Support search term selection:• Within the dialect of a specialized community• In context





Support search term selection:• Within the dialect of a specialized community• In context• Using the language of documents (for term matching)


Search Term Recommender

Search Statement

SpecialtySpecialty

Specialty

Specialty

Specialty

SpecialtySpecialty

Did you mean…

Specialty Term

Specialty Term

Specialty Term

Specialty TermInformation Collection

Search Term Recommender

• Term selection support (query expansion & reformulation)

• Automatic classification

• Terminology mapping

The Search Term Recommender: Applications

1. How can specialties & specialty dialects be identified in an information collection?

The Search Term Recommender - Questions


2. Do specialty dialects really differ?




3. Is performance improved when focusing on specialty dialects?





4. How specific should specialties be?






Tested on 2 bibliographic databases:• Inspec• Medline (Ohsumed collection)


• Physics, Electrical and Electronic Engineering, Computers and Control

• Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes

• Test collection:

Inspec

Number of documents 427,340

Descriptors / Document 6.99

• Biomedicine and Health

• Document: author, title, source, publication year, publication type, abstract, Mesh Headings

• Test collection:

Medline Ohsumed Collection

Number of documents 168,463

Mesh Headings / Document 3.11

1. How can specialties be identified in an information collection?




Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)

The Search Term Recommender System - Questions

• Domain terminology

Determine specialty documents in the collection:


• Publication source




• Bibliometric analysis





• Social network analysis





• Social network analysis

• Subject-specific classification


Inspec test collection• by top-level categories in the Inspec classification• 3 specialties: Physics, Electrical & Electronic

Engineering, Computers & Control

Identification of Specialties in an Information Collection

Inspec test collection• by top-level categories in the Inspec classification• 3 specialties: Physics, Electrical & Electronic

Engineering, Computers & Control

Ohsumed test collection• by journals grouped by subject• 33 specialties

Identification of Specialties in an Information Collection

Differences in specialty dialects (specialty term overlap)

Differences in Language


Differences in documentary languages (subject metadata term overlap)



Differences in documentary languages (subject metadata term overlap)

Differences in search term recommender suggestions (term suggestion overlap)


Inspec Dialects (specialty term overlap)

20%

7%

13%

13%

4%

33%

13%

Physics

ElectricalEngineering

Computers

terms analyzed: 60,601

Subject metadata term overlap: 87%Suggested term overlap: 30%

Ohsumed Dialects (Specialty term overlap)

terms analyzed: 11,663

Subject metadata term overlap: 32%Suggested term overlap: 30%

13%

29%

8%

19%

2%

21%

7%

CommunicableDiseases

GynecologyOrthopedics

• Suggest subject metadata for documents

• Comparison: specialty vs. general term suggestions

Automatic classification

Title: “A search for clusters of protostars in Orion cloud cores”

Automatic Classification

Originally assigned terms

Specialty Search Term Recommender

General Search Term Recommender

1. Infrared sources (astronomical)

2. Interstellar molecular clouds

3. Pre-main-sequence stars

4. Star associations

1. Clouds

2. Clusters of galaxies


4. Star clusters


1. Search problems

2. Clouds

3. Atomic clusters



Title: “A search for clusters of protostars in Orion cloud cores”

Automatic Classification

Originally assigned terms

Specialty Search Term Recommender

General Search Term Recommender

1. Infrared sources (astronomical)



4. Star associations

1. Clouds



4. Star clusters


1. Search problems

2. Clouds

3. Atomic clusters



Recall: Hit rate 2/4 = 0.5 1/4 = 0.25

Precision: Accuracy 2/5 = 0.4 1/5 = 0.2

Evaluation

Performance of the STR: Inspec

Inspec specialties and general STRs

0.0

0.1

0.2

0.3

0.4

0.5

0.0 0.1 0.2 0.3 0.4 0.5Recall

Pre

cisi

on

Individual Specialty STRs

General STR

Test Documents: 42,735

Specialties: 3

First 3 suggested:

Recall: 13.6%

Precision: 11.2%

Performance of the STR: Ohsumed

Ohsumed specialties and general STR

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall

Prec

isio

n

Individual Specialty STRs

General STR

First 3 suggested:

Recall: 26%

Precision: 25.6%

Test Documents: 18,733

Specialties: 33

• Language differences

• Representative sample of specialty language for training

Specificity of Specialties

Identifying subspecialties by classification hierarchy– e.g. Computers & Control -- Computer Hardware -- Circuits &

Devices

Specificity of Specialties - Inspec

Identifying subspecialties by classification hierarchy– e.g. Computers & Control -- Computer Hardware -- Circuits &

Devices

Specificity of Specialties - Inspec

Four levels of specificity in the Inspec collection

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.0 0.1 0.2 0.3 0.4 0.5 0.6Recall

Pre

cisi

on

Sub-sub specialty STR

Sub-specialty STR

Specialty STR

General STR

Test documents: 2425 Specialties: 3

Identifying subspecialties by journal within subject– e.g. Orthopedics -- Clinical Orthopaedics & Related Research

journal

Specificity of Specialties - Ohsumed

Identifying subspecialties by journal within subject– e.g. Orthopedics -- Clinical Orthopaedics & Related Research

journal

Specificity of Specialties - Ohsumed

Three levels of specificity in the Ohsumed Collection

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall

Pre

cisi

on

Journal STR

Specialty STR

General STR

Test documents: 745 Specialties: 3

Inspec

http://metadata.sims.berkeley.edu/str/inspec/inspec.html

Ohsumed

http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html

Exploratory Web Interfaces

http://metadata.sims.berkeley.edu/str/inspec/inspec.html

http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html

1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area

Summary


2. Do specialty dialects really differ?– Inspec specialties: term overlap 50%, suggestions overlap 30%– Ohsumed specialties: term overlap 30%, suggestions overlap 30%

Summary



3. Is performance improved when focusing on specialty dialects?– Inspec specialties: 10% improvement over general STR– Ohsumed specialties: 25% improvement over general STR

Summary



3. Is performance improved when focusing on specialty dialects?– Inspec specialties: 10% improvement over general STR– Ohsumed specialties: 25% improvement over general STR

4. How specific should specialties be?– Depends: on language differences & collection size

Summary

Overcoming the Language Problem in Search

Search Term Recommender:

See also:

FIDDLES

50% Discount!

Thank you!

[email protected]

Explanatory Slides

“how to obtain the right information for the right user

at the right time” (Chu, 2003)

Decision Process under Uncertainty

Information Retrieval

• Searching the Needle in the Haystack

• Which Needle in which Haystack

• How to express the Needle and the Haystack

Language Problem in Information Retrieval

Decision Process under Uncertainty

Uncertainty: Searching the Needle in the Haystack

.

.

.

..

400 Documents. 20-40 Documents

Blair (1996): for a small collection (40,000 documents) only 0.25-0.5% of the documents were relevant to any given query

• a known needle in a known haystack; • a known needle in an unknown haystack; • an unknown needle in an unknown haystack; • any needle in a haystack; • the sharpest needle in a haystack; • most of the sharpest needles in a haystack; • all the needles in a haystack; • affirmation of no needles in the haystack; • thinks like needles in any haystack; • let me know whenever a new needle shows up; • where are the haystacks?; and • needles, haystacks – whatever

(Koll, 2000)

Uncertainty: Which Needle in which Haystack

How to alleviate language ambiguity?

Ludwig Wittgenstein:• Language games• Language regions



• Divide information collection by specialty

• Association between – specialty terms– documentary terms (subject metadata)

• Recommend highly associated terms

The Search Term Recommender Methodology

Differences in Vocabulary – Inspec Documentary Language (Inspec Descriptor overlap)

Overlap between controlled vocabulary terms in three specialty collections in Inspec (total terms analyzed: 8,447)

62%

2%

15%

1%

2%

9%

8%

Physics

ElectricalEngineering

Computers

Differences in Vocabulary – Inspec specialty suggestions (suggested term overlap)

Variations in controlled vocabulary term suggestion (7 terms are suggested) in-between three Inspec specialty STRs (averaged over 300 queries).

7 suggested terms from each of 3 specialties

Maximum overlap: 7

No overlap: 21

Average number of unique terms suggested: 15.6

14-1763%

10-1317%

18-2020%

Differences in Vocabulary – Ohsumed Documentary Language (Mesh Heading overlap)

Overlap between controlled vocabulary terms in three specialty collections in Ohsumed (total terms analyzed: 5,376)

8%

34%

12%

12%

3%

23%

9%

CommunicableDiseases

GynecologyOrthopedics

Differences in Vocabulary – Ohsumed specialty suggestions (suggested term overlap)

Variations in controlled vocabulary term suggestion (3 terms are suggested) in-between 33 Ohsumed specialty STRs (averaged over 283 queries)

3 suggested terms from each of 33 specialties

Maximum overlap: 3

No overlap: 99

Average number of unique terms suggested: 61.9

50-6954%

30-4919%70-98

27%

Does a search term recommender utilizing specialty dialects perform better than a search term recommender based on the general language of the whole collection?

Test documents from specialty were classified with specialty search term recommender & general search term recommender

Search statement: title of test document Suggested terms: controlled vocabulary

Performance of the STR: Automatic Classification

How many of the originally assigned controlled vocabulary terms were suggested for a test document?

Recall = Originally assigned Suggested / Number of Originally assigned

Precision = Originally assigned Suggested / Number of Suggested

Example: 4 originally assigned terms

Automatic Classification: Evaluation

Cut-off level

Originally controlled vocabulary terms suggested

Recall Precision

1 1 0.25 1

2 1 0.25 0.5

3 2 0.5 0.66

4 2 0.5 0.5

5 3 0.75 0.6