vivien petras uc berkeley school of information
DESCRIPTION
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages. Vivien Petras UC Berkeley School of Information. Overcoming the Language Problem in Search. - PowerPoint PPT PresentationTRANSCRIPT
Translating Dialects in Search:
Mapping between Specialized Languages of Discourse
and Documentary Languages
Vivien Petras
UC Berkeley School of Information
Overcoming the Language Problem in Search
How can someone searching for violins be made aware that there are also fiddles (and vice versa)?
• The Language Problem in Information Retrieval
• Dialects & Contexts
• The Search Term Recommender
• 4 Research Questions
• Exploratory Web Interface
Outline
SearcherAuthor
Concept Space
Concept Space
QuestionText
Search Statement
Match!
• Mapping between searcher and IR system
• Mapping between author and IR system
• Mapping between search statement and document
Document
IR = Language Mapping Exercise
IR = Language Mapping Exercise
Searcher
Concept Space
Question
Search Statement
Document
Match!
Information Retrieval
A search statement needs to describe the:• searcher’s question (information need) • documents that are relevant to a searcher’s question
In Semiotics:
Unlimited semiosis
In Information Science:
Inter-indexer inconsistency
The Language Problem
How to alleviate language ambiguity for search term selection?
Dialects and Contexts
How to alleviate language ambiguity for search term selection?
Wittgenstein Philosophy of language:
Language is disambiguated within contexts and specialized dialects.
Dialects and Contexts
How to alleviate language ambiguity for search term selection?
Wittgenstein Philosophy of language:
Language is disambiguated within contexts and specialized dialects.
Support search term selection:
Dialects and Contexts
How to alleviate language ambiguity for search term selection?
Wittgenstein Philosophy of language:
Language is disambiguated within contexts and specialized dialects.
Support search term selection:• Within the dialect of a specialized community
Dialects and Contexts
How to alleviate language ambiguity for search term selection?
Wittgenstein Philosophy of language:
Language is disambiguated within contexts and specialized dialects.
Support search term selection:• Within the dialect of a specialized community• In context
Dialects and Contexts
How to alleviate language ambiguity for search term selection?
Wittgenstein Philosophy of language:
Language is disambiguated within contexts and specialized dialects.
Support search term selection:• Within the dialect of a specialized community• In context• Using the language of documents (for term matching)
Dialects and Contexts
Search Term Recommender
Search Statement
SpecialtySpecialty
Specialty
Specialty
Specialty
SpecialtySpecialty
Did you mean…
Specialty Term
Specialty Term
Specialty Term
Specialty TermInformation Collection
Search Term Recommender
• Term selection support (query expansion & reformulation)
• Automatic classification
• Terminology mapping
The Search Term Recommender: Applications
1. How can specialties & specialty dialects be identified in an information collection?
The Search Term Recommender - Questions
1. How can specialties & specialty dialects be identified in an information collection?
2. Do specialty dialects really differ?
The Search Term Recommender - Questions
1. How can specialties & specialty dialects be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
The Search Term Recommender - Questions
1. How can specialties & specialty dialects be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
The Search Term Recommender - Questions
1. How can specialties & specialty dialects be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic databases:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender - Questions
• Physics, Electrical and Electronic Engineering, Computers and Control
• Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes
• Test collection:
Inspec
Number of documents 427,340
Descriptors / Document 6.99
• Biomedicine and Health
• Document: author, title, source, publication year, publication type, abstract, Mesh Headings
• Test collection:
Medline Ohsumed Collection
Number of documents 168,463
Mesh Headings / Document 3.11
1. How can specialties be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender System - Questions
• Domain terminology
Determine specialty documents in the collection:
• Domain terminology
• Publication source
Determine specialty documents in the collection:
• Domain terminology
• Publication source
• Bibliometric analysis
Determine specialty documents in the collection:
• Domain terminology
• Publication source
• Bibliometric analysis
• Social network analysis
Determine specialty documents in the collection:
• Domain terminology
• Publication source
• Bibliometric analysis
• Social network analysis
• Subject-specific classification
Determine specialty documents in the collection:
Inspec test collection• by top-level categories in the Inspec classification• 3 specialties: Physics, Electrical & Electronic
Engineering, Computers & Control
Identification of Specialties in an Information Collection
Inspec test collection• by top-level categories in the Inspec classification• 3 specialties: Physics, Electrical & Electronic
Engineering, Computers & Control
Ohsumed test collection• by journals grouped by subject• 33 specialties
Identification of Specialties in an Information Collection
1. How can specialties be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender System - Questions
Differences in specialty dialects (specialty term overlap)
Differences in Language
Differences in specialty dialects (specialty term overlap)
Differences in documentary languages (subject metadata term overlap)
Differences in Language
Differences in specialty dialects (specialty term overlap)
Differences in documentary languages (subject metadata term overlap)
Differences in search term recommender suggestions (term suggestion overlap)
Differences in Language
Inspec Dialects (specialty term overlap)
20%
7%
13%
13%
4%
33%
13%
Physics
ElectricalEngineering
Computers
terms analyzed: 60,601
Subject metadata term overlap: 87%Suggested term overlap: 30%
Ohsumed Dialects (Specialty term overlap)
terms analyzed: 11,663
Subject metadata term overlap: 32%Suggested term overlap: 30%
13%
29%
8%
19%
2%
21%
7%
CommunicableDiseases
GynecologyOrthopedics
1. How can specialties be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender System - Questions
• Suggest subject metadata for documents
• Comparison: specialty vs. general term suggestions
Automatic classification
Title: “A search for clusters of protostars in Orion cloud cores”
Automatic Classification
Originally assigned terms
Specialty Search Term Recommender
General Search Term Recommender
1. Infrared sources (astronomical)
2. Interstellar molecular clouds
3. Pre-main-sequence stars
4. Star associations
1. Clouds
2. Clusters of galaxies
3. Interstellar molecular clouds
4. Star clusters
5. Pre-main-sequence stars
1. Search problems
2. Clouds
3. Atomic clusters
4. Clusters of galaxies
5. Interstellar molecular clouds
Title: “A search for clusters of protostars in Orion cloud cores”
Automatic Classification
Originally assigned terms
Specialty Search Term Recommender
General Search Term Recommender
1. Infrared sources (astronomical)
2. Interstellar molecular clouds
3. Pre-main-sequence stars
4. Star associations
1. Clouds
2. Clusters of galaxies
3. Interstellar molecular clouds
4. Star clusters
5. Pre-main-sequence stars
1. Search problems
2. Clouds
3. Atomic clusters
4. Clusters of galaxies
5. Interstellar molecular clouds
Recall: Hit rate 2/4 = 0.5 1/4 = 0.25
Precision: Accuracy 2/5 = 0.4 1/5 = 0.2
Evaluation
Performance of the STR: Inspec
Inspec specialties and general STRs
0.0
0.1
0.2
0.3
0.4
0.5
0.0 0.1 0.2 0.3 0.4 0.5Recall
Pre
cisi
on
Individual Specialty STRs
General STR
Test Documents: 42,735
Specialties: 3
First 3 suggested:
Recall: 13.6%
Precision: 11.2%
Performance of the STR: Ohsumed
Ohsumed specialties and general STR
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall
Prec
isio
n
Individual Specialty STRs
General STR
First 3 suggested:
Recall: 26%
Precision: 25.6%
Test Documents: 18,733
Specialties: 33
1. How can specialties be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender System - Questions
• Language differences
• Representative sample of specialty language for training
Specificity of Specialties
Identifying subspecialties by classification hierarchy– e.g. Computers & Control -- Computer Hardware -- Circuits &
Devices
Specificity of Specialties - Inspec
Identifying subspecialties by classification hierarchy– e.g. Computers & Control -- Computer Hardware -- Circuits &
Devices
Specificity of Specialties - Inspec
Four levels of specificity in the Inspec collection
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.0 0.1 0.2 0.3 0.4 0.5 0.6Recall
Pre
cisi
on
Sub-sub specialty STR
Sub-specialty STR
Specialty STR
General STR
Test documents: 2425 Specialties: 3
Identifying subspecialties by journal within subject– e.g. Orthopedics -- Clinical Orthopaedics & Related Research
journal
Specificity of Specialties - Ohsumed
Identifying subspecialties by journal within subject– e.g. Orthopedics -- Clinical Orthopaedics & Related Research
journal
Specificity of Specialties - Ohsumed
Three levels of specificity in the Ohsumed Collection
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall
Pre
cisi
on
Journal STR
Specialty STR
General STR
Test documents: 745 Specialties: 3
Inspec
http://metadata.sims.berkeley.edu/str/inspec/inspec.html
Ohsumed
http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html
Exploratory Web Interfaces
1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area
Summary
1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area
2. Do specialty dialects really differ?– Inspec specialties: term overlap 50%, suggestions overlap 30%– Ohsumed specialties: term overlap 30%, suggestions overlap 30%
Summary
1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area
2. Do specialty dialects really differ?– Inspec specialties: term overlap 50%, suggestions overlap 30%– Ohsumed specialties: term overlap 30%, suggestions overlap 30%
3. Is performance improved when focusing on specialty dialects?– Inspec specialties: 10% improvement over general STR– Ohsumed specialties: 25% improvement over general STR
Summary
1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area
2. Do specialty dialects really differ?– Inspec specialties: term overlap 50%, suggestions overlap 30%– Ohsumed specialties: term overlap 30%, suggestions overlap 30%
3. Is performance improved when focusing on specialty dialects?– Inspec specialties: 10% improvement over general STR– Ohsumed specialties: 25% improvement over general STR
4. How specific should specialties be?– Depends: on language differences & collection size
Summary
Overcoming the Language Problem in Search
Search Term Recommender:
See also:
FIDDLES
50% Discount!
Thank you!
Explanatory Slides
“how to obtain the right information for the right user
at the right time” (Chu, 2003)
Decision Process under Uncertainty
Information Retrieval
• Searching the Needle in the Haystack
• Which Needle in which Haystack
• How to express the Needle and the Haystack
Language Problem in Information Retrieval
Decision Process under Uncertainty
Uncertainty: Searching the Needle in the Haystack
.
.
.
..
400 Documents. 20-40 Documents
Blair (1996): for a small collection (40,000 documents) only 0.25-0.5% of the documents were relevant to any given query
• a known needle in a known haystack; • a known needle in an unknown haystack; • an unknown needle in an unknown haystack; • any needle in a haystack; • the sharpest needle in a haystack; • most of the sharpest needles in a haystack; • all the needles in a haystack; • affirmation of no needles in the haystack; • thinks like needles in any haystack; • let me know whenever a new needle shows up; • where are the haystacks?; and • needles, haystacks – whatever
(Koll, 2000)
Uncertainty: Which Needle in which Haystack
How to alleviate language ambiguity?
Ludwig Wittgenstein:• Language games• Language regions
Language is disambiguated within contexts and specialized dialects.
Dialects and Contexts
• Divide information collection by specialty
• Association between – specialty terms– documentary terms (subject metadata)
• Recommend highly associated terms
The Search Term Recommender Methodology
Differences in Vocabulary – Inspec Documentary Language (Inspec Descriptor overlap)
Overlap between controlled vocabulary terms in three specialty collections in Inspec (total terms analyzed: 8,447)
62%
2%
15%
1%
2%
9%
8%
Physics
ElectricalEngineering
Computers
Differences in Vocabulary – Inspec specialty suggestions (suggested term overlap)
Variations in controlled vocabulary term suggestion (7 terms are suggested) in-between three Inspec specialty STRs (averaged over 300 queries).
7 suggested terms from each of 3 specialties
Maximum overlap: 7
No overlap: 21
Average number of unique terms suggested: 15.6
14-1763%
10-1317%
18-2020%
Differences in Vocabulary – Ohsumed Documentary Language (Mesh Heading overlap)
Overlap between controlled vocabulary terms in three specialty collections in Ohsumed (total terms analyzed: 5,376)
8%
34%
12%
12%
3%
23%
9%
CommunicableDiseases
GynecologyOrthopedics
Differences in Vocabulary – Ohsumed specialty suggestions (suggested term overlap)
Variations in controlled vocabulary term suggestion (3 terms are suggested) in-between 33 Ohsumed specialty STRs (averaged over 283 queries)
3 suggested terms from each of 33 specialties
Maximum overlap: 3
No overlap: 99
Average number of unique terms suggested: 61.9
50-6954%
30-4919%70-98
27%
Does a search term recommender utilizing specialty dialects perform better than a search term recommender based on the general language of the whole collection?
Test documents from specialty were classified with specialty search term recommender & general search term recommender
Search statement: title of test document Suggested terms: controlled vocabulary
Performance of the STR: Automatic Classification
How many of the originally assigned controlled vocabulary terms were suggested for a test document?
Recall = Originally assigned Suggested / Number of Originally assigned
Precision = Originally assigned Suggested / Number of Suggested
Example: 4 originally assigned terms
Automatic Classification: Evaluation
Cut-off level
Originally controlled vocabulary terms suggested
Recall Precision
1 1 0.25 1
2 1 0.25 0.5
3 2 0.5 0.66
4 2 0.5 0.5
5 3 0.75 0.6