1 the domain-specific track at clef 2008 vivien petras & stefan baerisch gesis social science...
TRANSCRIPT
1
The Domain-Specific Track at CLEF 2008
Vivien Petras & Stefan BaerischGESIS Social Science Information Centre,Bonn, Germany
Aarhus, Denmark, September 17, 2008
2
Outline
• The Domain-Specific Task
• Collections & Controlled Vocabularies
• Participants, Runs & Relevance Assessments
• Themes
• Outlook
3
The Domain-Specific Task
CLIR on structured scientific document collections:• social science domain• bibliographic metadata• controlled vocabularies for subject description
Leverage for:• search• query expansion• translation
4
The Domain-Specific Task
Tasks:• Monolingual: against German, English or Russian• Bilingual: against German, English or Russian• Multilingual: against combined collection
Topics:• 25 topics in standard TREC format (title, desc, narr):• suggestions from 28 subject specialties in the Social Sciences• translated from German English, Russian
5
Collections
German English Russian
Name GIRT-DE GIRT-EN CSA-SA ISISS
Description German social science literature & projects
GIRT-DE
translated
Sociolog. Abstracts
Inst. of Scientific Inf. for Soc. Sc. of the Ru. Acad. of Science
Coverage 1990-2000 1990-2000 1994-1996
Docs 151,319 151,319 20,000 145,802
Abstracts 96% 17% 94% 27%
6
Controlled Vocabularies
GIRT CSA-SA INION
Descriptors / doc 10 6.4 3.9
Class. codes / doc 2 1.3 n/a
5 different subject-describing terminologies:
• Thesaurus for the Social Sciences (GIRT-DE, -EN)• Thesaurus of Sociological Indexing Terms (CSA-SA)• INION Thesaurus (ISISS)
• Social Sciences Classification (GIRT-DE, -EN)• Sociological Abstracts Classification (CSA-SA)
7
Controlled Vocabularies – Mapping Tools
Translation:• GIRT German GIRT English, GIRT Russian• INION Russian INION English
Term mappings:• equivalent terms in vocabularies• GIRT German / English CSA-SA English • GIRT German INION Russian
counseling for the aged Counseling + Elderly
8
Participants
6 groups
Group Institution Country
Amsterdam
University of AmsterdamThe
Netherlands
ChemnitzMedia Informatics, Chemnitz University of Technology
Germany
CheshireSchool of Information, UC Berkeley
USA
Darmstadt Technical University Darmstadt Germany
Hug University Hospitals Geneva Switzerland
UnineComputer Science Department,University of Neuchatel
Switzerland
9
Runs
Task Runs2008
Runs2007
Runs 2006
Monolingual
- against German
10 13 13
- against English
12 15 8
- against Russian
9 11 1
Bilingual
- against German
12 14 6
- against English
9 15 3
- against Russian
8 9 3
Multilingual 9 9 2
Total 69 86 36
10
Relevance Assessments
German English Russian
Pool size 14793 14835 13930
Rel. Docs 2008 15% 14% 2%*
Rel. Docs 2007 22% 25% 10%**
Rel. Docs 2006 39% 26% n/a
* In Russian collection: 1 topic without relevant docs ** 3 topics without relevant docs
11
Relevance Assessments – Best MAP
Task Best MAP2008
Best MAP2007
Best MAP 2006
Monolingual
- against German
0.4537 0.5051 0.5454
- against English
0.3891 0.3534 0.4576
- against Russian
0.1815 0.1971 0.2542
Bilingual
- against German
0.3702 (82%)
0.4568 (90%)
0.2448 (45%)
- against English
0.3385 (87%)
0.3341 (95%)
0.3301 (72%)
- against Russian
0.0882 (49%)
0.1348 (68%)
0.1648 (62%)
Multilingual 0.2816* 0.0884 0.0753
*German topics; English = 0.2751; Russian = 0.2357
12
Themes - Retrieval models
• Lucene (Xtrieval Chemnitz, Darmstadt)
• Semantic relatedness: Wikipedia / Wiktionary
(Darmstadt)
• Language Models (Amsterdam)
• Vector space (EasyIR, Hug)
• Probabilistic – Logistic Regression (Cheshire)
• Comparison: Vector Space, LM, Probabilistic, DFR
(Unine)
• Data fusion
13
Themes – Query Expansion
• Blind Feedback (Rocchio)
• idf-window BF (infrequent terms near search term)
• Thesaurus Lookup
• Thesaurus as pivot language: double translation
• Google (text snippets)
• Wikipedia (frequent terms from top-ranked articles)
14
Themes – Translation
• Google AJAX language API
• Commercial Software (Systran, LEC)
• Bilingual thesaurus look-up
• ML retrieval thesaurus look-up
• Wikipedia (Cross-language links)
15
Summary & Outlook
• Enough interest for 2009?
• Different corpora
• Different tasks
• full topic run (125 topics)
• result: controlled vocabulary terms (not documents)
• robust task
• Full-text retrieval with open access literature
16
Domain-Specific Track:http://www.gesis.org/en/research/information_technology/clef_ds.htm
Vocabulary Mappings:http://www.gesis.org/en/research/information_technology/komohe.htm
Email:[email protected]