WP 10 Multilingual Access
Philipp Daumke, Stefan Schulz
Multilingual Access - Rationale
English as First Language
English as Second Language
No English Language Skills
English as a Foreign Language
•< 70 % of the world's scientists read in English•80 % of the world's electronically stored information is in
English•90 % English articles in Medline (2000)
Sources: The British Council, 2005Fung ICH: Open access for the non-English-speaking world: overcoming the language barrier. Emerging Themes in Epidemiology, 2008
Non-native speakers
• Broad range of command of English • Reading skills > writing skills• Reduced active vocabulary
Difficulty in formulating precise queries
English as Second Language
English as a Foreign Language
Korrelation von Hypertonie und
Läsion der Weißen Substanz…
“Correlation of high blood
pressure and lesion of the white
substance”
Cross-language document retrieval example
Korrelation von Hypertonie und
Läsion der Weißen Substanz…
“Correlation of high blood
pressure and lesion of the white
substance”
Cross-language document retrieval example
Korrelation von Hypertonie und
Läsion der Weißen Substanz…
“Correlation of high blood
pressure and lesion of the white
substance”
Cross-language document retrieval example
BootStrep WP 10 - Multilingual access
• Objectives: – To provide a multilingual search interface to the BootStrep
Biolexicon / Bioontology
– We do NOT propose to deliver a multilingual extension of the
BootStrep biolexicon
• Query Languages: French, German, English, (Italian)
• Output language: English
• Method: Subword-based semantic indexing
• Resources:
– MorphoSaurus multilingual subword lexicon & thesaurus
– MorphoSaurus Semantic Indexer
Technique: Morphosemantic Indexing
• Subword-based, multilingual semantic indexing for document retrieval
• Subwords are atomic, conceptual or linguistic units:
– Stems: stomach, gastr, diaphys– Prefixes: anti-, bi-, hyper- – Suffixes: -ary, -ion, -itis– Infixes: -o-, -s-
• Equivalence classes contain synonymous subwords and their translations:
– #derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel, … }
– #inflamm = { inflamm, -itic, -itis, -phlog, entzuend, -itis, -itisch, inflam, flog, inflam, flog, ... }
Segmentation:
Myo|kard|itis
Herz|muskel|entzünd|ung
Inflamm|ation of the heart muscle
muscle
myo
muskel
muscul
inflamm
-itis
inflam
entzünd
Eq Class
subword herzheart
card
corazon
card
INFLAMMMUSCLE
HEART
Subword Thesaurus Structure
Indexation:
#muscle #heart #inflamm
#heart #muscle #inflamm
#inflamm #heart #muscle
• Thesaurus:~21.000 equivalence classes (MIDs)
• Lexicon entries:– English: ~23.000– German: ~24.000– Portuguese: ~15.000– Spanish : ~11.000– French: ~ 8.000– Swedish: ~10.000– Italian: ~ 4.000
Indexing Pipeline
Indexing Pipeline
Indexing Pipeline
Indexing Pipeline
Subword-based document transformation
Morphosemanticindexer
Subword-Based Search
Korrelation von Hypertonie und
Läsion der Weißen Substanz…
#correl #hyper #tens #lesion #whit #matter
Subword-based query transformation
Korrelation von Hypertonie und
Läsion der Weißen Substanz…
#correl #hyper #tens #lesion #whit #matter
Adapting Morphosemantic Indexing of BootStrep
• BootStrep terminology mostly disjoint from existing clinical terminology
• Enhancement of data resources (e.g. for acronym resolution, multi-term equivalences)
• BootStrep Terms for multilingual access
– Gene Ontology , InterPro, IntAct, Gene Regulation Ontology, Species
• Medline subcorpus (about E. coli gene regulation)
Ongoing/Completed Tasks
• Manual Training of MorphoSaurus-Lexica by means of the BootStrep corpora
(en, de, fr)
• Multilingual Terminology Browser– 2268 GO terms + translations
– 6925 InterPro terms + translations
– 2082 IntAct terms + translations
– URL: http://www.medinf.uni-freiburg.de/demo/BootStrepBrowser/
• Multilingual Search Engine:– Document collection: BootStrep-Medline subset
– Languages: English, German, French
– Query modes: Author, Title, title + keywords, All
Terminology Browser
Search Results
Further Information
Navigation
Terminology Browser
Multilingual Search Engine
To do: Tools and Resources
• BootStrep-Browser– Integration of Species– Integration of the Gene Regulation Ontology
• Multilingual Search Engine– Multilingual treatment of acronyms– Inclusion of species synonym list– Dealing with mixed queries (German-English, English-French)– Integration with the fact store
• Continue lexicon population – Italian terms ?
To do: Evaluation
• Creation of a gold standard
– Typical English queries
– Find all relevant documents in the E.coli subset
• CLIR experiments
– Translate queries to French and German
– Compare mean average precision
• Reuse of already existing routines on standard benchmarks (OHSUMED, IMAGEClef)
ImageCLEFMed Benchmark
0
10
20
30
40
50
60
70
80
90
100
Percent of Baseline
EN DE PT SP FR SV AV
Language and Condition
Top 20 Average Precision
Query Translation
Morphosaurus
Morphosaurus+D
• Baseline: monolingual – Stemmed English queries– Stemmed English texts
• Query translation – Google translator– Multilingual dictionary
compiled from UMLS
• Morphosemantic Indexing – Interlingual representation of
user queries and documents
• Morphosemantic Indexing– incorporating
disambiguation module
En
glis
h
Germ
an
Port
ug
uese
Sp
an
ish
Fren
ch
Sw
ed
ish
Avera
ge
Percent ofBaseline
Top 20 Average Precision