1 the domain-specific track at clef 2008 vivien petras & stefan baerisch gesis social science...

16
1 The Domain-Specific Track at CLEF 200 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September 17, 2008

Upload: irene-mathews

Post on 28-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

1

The Domain-Specific Track at CLEF 2008

Vivien Petras & Stefan BaerischGESIS Social Science Information Centre,Bonn, Germany

Aarhus, Denmark, September 17, 2008

2

Outline

• The Domain-Specific Task

• Collections & Controlled Vocabularies

• Participants, Runs & Relevance Assessments

• Themes

• Outlook

3

The Domain-Specific Task

CLIR on structured scientific document collections:• social science domain• bibliographic metadata• controlled vocabularies for subject description

Leverage for:• search• query expansion• translation

4

The Domain-Specific Task

Tasks:• Monolingual: against German, English or Russian• Bilingual: against German, English or Russian• Multilingual: against combined collection

Topics:• 25 topics in standard TREC format (title, desc, narr):• suggestions from 28 subject specialties in the Social Sciences• translated from German English, Russian

5

Collections

German English Russian

Name GIRT-DE GIRT-EN CSA-SA ISISS

Description German social science literature & projects

GIRT-DE

translated

Sociolog. Abstracts

Inst. of Scientific Inf. for Soc. Sc. of the Ru. Acad. of Science

Coverage 1990-2000 1990-2000 1994-1996

Docs 151,319 151,319 20,000 145,802

Abstracts 96% 17% 94% 27%

6

Controlled Vocabularies

GIRT CSA-SA INION

Descriptors / doc 10 6.4 3.9

Class. codes / doc 2 1.3 n/a

5 different subject-describing terminologies:

• Thesaurus for the Social Sciences (GIRT-DE, -EN)• Thesaurus of Sociological Indexing Terms (CSA-SA)• INION Thesaurus (ISISS)

• Social Sciences Classification (GIRT-DE, -EN)• Sociological Abstracts Classification (CSA-SA)

7

Controlled Vocabularies – Mapping Tools

Translation:• GIRT German GIRT English, GIRT Russian• INION Russian INION English

Term mappings:• equivalent terms in vocabularies• GIRT German / English CSA-SA English • GIRT German INION Russian

counseling for the aged Counseling + Elderly

8

Participants

6 groups

Group Institution Country

Amsterdam

University of AmsterdamThe

Netherlands

ChemnitzMedia Informatics, Chemnitz University of Technology

Germany

CheshireSchool of Information, UC Berkeley

USA

Darmstadt Technical University Darmstadt Germany

Hug University Hospitals Geneva Switzerland

UnineComputer Science Department,University of Neuchatel

Switzerland

9

Runs

Task Runs2008

Runs2007

Runs 2006

Monolingual

- against German

10 13 13

- against English

12 15 8

- against Russian

9 11 1

Bilingual

- against German

12 14 6

- against English

9 15 3

- against Russian

8 9 3

Multilingual 9 9 2

Total 69 86 36

10

Relevance Assessments

German English Russian

Pool size 14793 14835 13930

Rel. Docs 2008 15% 14% 2%*

Rel. Docs 2007 22% 25% 10%**

Rel. Docs 2006 39% 26% n/a

* In Russian collection: 1 topic without relevant docs ** 3 topics without relevant docs

11

Relevance Assessments – Best MAP

Task Best MAP2008

Best MAP2007

Best MAP 2006

Monolingual

- against German

0.4537 0.5051 0.5454

- against English

0.3891 0.3534 0.4576

- against Russian

0.1815 0.1971 0.2542

Bilingual

- against German

0.3702 (82%)

0.4568 (90%)

0.2448 (45%)

- against English

0.3385 (87%)

0.3341 (95%)

0.3301 (72%)

- against Russian

0.0882 (49%)

0.1348 (68%)

0.1648 (62%)

Multilingual 0.2816* 0.0884 0.0753

*German topics; English = 0.2751; Russian = 0.2357

12

Themes - Retrieval models

• Lucene (Xtrieval Chemnitz, Darmstadt)

• Semantic relatedness: Wikipedia / Wiktionary

(Darmstadt)

• Language Models (Amsterdam)

• Vector space (EasyIR, Hug)

• Probabilistic – Logistic Regression (Cheshire)

• Comparison: Vector Space, LM, Probabilistic, DFR

(Unine)

• Data fusion

13

Themes – Query Expansion

• Blind Feedback (Rocchio)

• idf-window BF (infrequent terms near search term)

• Thesaurus Lookup

• Thesaurus as pivot language: double translation

• Google (text snippets)

• Wikipedia (frequent terms from top-ranked articles)

14

Themes – Translation

• Google AJAX language API

• Commercial Software (Systran, LEC)

• Bilingual thesaurus look-up

• ML retrieval thesaurus look-up

• Wikipedia (Cross-language links)

15

Summary & Outlook

• Enough interest for 2009?

• Different corpora

• Different tasks

• full topic run (125 topics)

• result: controlled vocabulary terms (not documents)

• robust task

• Full-text retrieval with open access literature

16

Domain-Specific Track:http://www.gesis.org/en/research/information_technology/clef_ds.htm

Vocabulary Mappings:http://www.gesis.org/en/research/information_technology/komohe.htm

Email:[email protected]