support for multilingual information access

33
August 21, 2002 Szechenyi National Librar y Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies University of Maryland, College Park, MD, USA

Upload: waneta

Post on 11-Jan-2016

36 views

Category:

Documents


2 download

DESCRIPTION

Support for Multilingual Information Access. Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies University of Maryland, College Park, MD, USA. Multilingual Information Access. Help people find information that is expressed in any language. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Support for  Multilingual Information Access

August 21, 2002 Szechenyi National Library

Support for Multilingual Information Access

Douglas W. OardCollege of Information Studies and

Institute for Advanced Computer Studies

University of Maryland, College Park, MD, USA

Page 2: Support for  Multilingual Information Access

Multilingual Information Access

Help people find information that is expressed in any language

Page 3: Support for  Multilingual Information Access

Outline

• User needs

• System design

• User studies

• Next steps

Page 4: Support for  Multilingual Information Access

Global Languages

0

200

400

600

800

Spea

kers

(M

illio

ns)

Chi

nese

Eng

lish

Hin

di-U

rdu

Span

ish

Por

tugu

ese

Ben

gali

Rus

sian

Ara

bic

Japa

nese

Source: http://www.g11n.com/faq.html

Page 5: Support for  Multilingual Information Access

Source: Global Reach

English English

2000 2005

Global Internet User Population

Chinese

Page 6: Support for  Multilingual Information Access

0.1

1.0

10.0

100.0

Inte

rnet

Hos

ts (

mill

ion)

:

Eng

lish

Japa

nese

Ger

man

Fre

nch

Dut

ch

Fin

nish

Span

ish

Chi

nese

Swed

ish

Language (estimated by domain)

Global Internet Hosts

Source: Network Wizards Jan 99 Internet Domain Survey

Page 7: Support for  Multilingual Information Access

European Web Size Projection

0.1

1.0

10.0

100.0

1,000.0

10,000.0

Bil

lio

ns

of

Wo

rds

English Other European

Source: Extrapolated from Grefenstette and Nioche, RIAO 2000

Page 8: Support for  Multilingual Information Access

Global Internet Audio

source: www.real.com, Mar 2001

10621438

English

OtherLanguages

Over 2500 Internet-accessible

Radio and TelevisionStations

Page 9: Support for  Multilingual Information Access

Who needs Cross-Language Search?

• Searchers who can read several languages– Eliminate multiple queries– Query in most fluent language

• Monolingual searchers– If translations can be provided– If it suffices to know that a document exists– If text captions are used to search for images

Page 10: Support for  Multilingual Information Access

Outline

• User needs

System design

• User studies

• Next steps

Page 11: Support for  Multilingual Information Access

C ross -L an g u ag e R etrieva lIn d exin g L an g u ag esM ach in e-A ss is ted In d exin g

In fo rm ation R e trieva l

M u lt ilin g u a l M e tad a ta

D ig ita l L ib ra ries

In te rn a tion a l In fo rm ation F lowD iffu s ion o f In n ova tion

In fo rm ation U se

A u tom atic A b s trac tin g

Inform ation Science

M ach in e Tran s la tionIn fo rm ation E xtrac tionText S u m m ariza tion

N atu ra l L an g u ag e P rocess in g

M u ltilin g u a l O n to log ies

O n to log ica l E n g in eerin g

Textu a l D a ta M in in g

K n ow led g e D iscovery

M ach in e L earn in g

Artificial Intelligence

L oca liza tionIn fo rm ation V isu a liza tion

H u m an -C om p u ter In te rac tion

W eb In te rn a tion a liza tion

W orld -W id e W eb

Top ic D e tec tion an d Track in g

S p eech P rocess in g

M u ltilin g u a l O C R

D ocu m en t Im ag e U n d ers tan d in g

Other Fields

M ultilingua l In form ation Access

Page 12: Support for  Multilingual Information Access

Cross-LanguageSearch

Query

Translation

DocumentDelivery

Cross-LanguageBrowsing

Select Examine

Multilingual Information Access

Page 13: Support for  Multilingual Information Access

The Search Process

Choose Document-Language

Terms

Query-DocumentMatching

InferConcepts

Select Document-Language

Terms

Document

Author

Query

Choose Document-Language

Terms

MonolingualSearcher

Choose Query-Language

Terms

Cross-LanguageSearcher

Page 14: Support for  Multilingual Information Access

Interactive Search

Search

Translated Query

Selection

Ranked List

Examination

Document

Use

Document

QueryFormulation

QueryTranslation

Query

Query Reformulation

Page 15: Support for  Multilingual Information Access
Page 16: Support for  Multilingual Information Access

Synonym Selection

Page 17: Support for  Multilingual Information Access

KeyWord In Context (KWIC)

Page 18: Support for  Multilingual Information Access
Page 19: Support for  Multilingual Information Access

Outline

• User needs

• System design

User studies

• Next steps

Page 20: Support for  Multilingual Information Access

Cross-Language Evaluation Forum

• Annual European-language retrieval evaluation– Documents: 8 languages

• Dutch, English, Finnish, French, German, Italian, Spanish, Swedish

– Topics: 8 languages, plus Chinese and Japanese– Batch retrieval since 2000

• Interactive track (iCLEF) started in 2001– 2001 focus: document selection– 2002 focus: query formulation

Page 21: Support for  Multilingual Information Access

iCLEF 2001 Experiment Design

Participant

1

2

3

4

Task Order

Narrow:

Broad:

Topic Key

System Key

System B:

System A:

Topic11, Topic17 Topic13, Topic29

Topic11, Topic17 Topic13, Topic29

Topic17, Topic11 Topic29, Topic13

Topic17, Topic11 Topic29, Topic13

11, 13

17, 29

144 trials, in blocks of 16, at 3 sites

Page 22: Support for  Multilingual Information Access

An Experiment Session

• Task and system familiarization

• 4 searches (20 minutes each)– Read topic description– Examine document translations– Judge as many documents as possible

• Relevant, Somewhat relevant, Not relevant, Unsure, Not judged

• Instructed to seek high precision

• 8 questionnaires– Initial, each topic (4), each system (2), final

Page 23: Support for  Multilingual Information Access

Measure of Effectiveness

• Unbalanced F-Measure:– P = precision

– R = recall = 0.8

• Favors precision over recall

• This models an application in which:– Fluent translation is expensive

– Missing some relevant documents would be okay

RP

F

11

Page 24: Support for  Multilingual Information Access

French Results OverviewCLEF

AUTO

Page 25: Support for  Multilingual Information Access

English Results OverviewCLEF

AUTO

Page 26: Support for  Multilingual Information Access

Commercial vs. Gloss Translation

• Commercial Machine Translation (MT) is almost always better– Significant with one-tail t-test (p<0.05) over 16 trials

• Gloss translation usually beats random selection

0

0.2

0.4

0.6

0.8

1

1.2

umd01 umd02 umd03 umd04 umd01 umd02 umd03 umd04

Searcher

Ret

riev

al E

ffec

tiven

ess

MT

GLOSS

|-------- Broad topics ----------| |-------- Narrow topics ---------|

Page 27: Support for  Multilingual Information Access

iCLEF 2002 Experiment Design

QueryFormulation

AutomaticRetrieval

InteractiveSelection

MeanAveragePrecision

F0.8

StandardRanked List

Topic Description

Page 28: Support for  Multilingual Information Access

Maryland Experiments

• 48 trials (12 participants)– Half with automatic query translation– Half with semi-automatic query translation

• 4 subjects searched Der Spiegel and SDA– 20-60 relevant documents for 4 topics

• 8 subjects searched Der Spiegel– 8-20 relevant documents for 3 topics

• 0 relevant documents for 1 topic!

Page 29: Support for  Multilingual Information Access

Some Preliminary Results

• Average of 8 query iterations per search

• Relatively insensitive to topic– Topic 4 (Hunger Strikes): 6 iterations– Topic 2 (Treasure Hunting): 16 iterations

• Sometimes sensitive to system– Topics 1 and 2: system effect was small– Topics 3 and 4: fewer iterations with semi-automatic

• Topic 3: European Campaigns against Racism

Page 30: Support for  Multilingual Information Access

Subjective Evaluation• Semi-automatic system:

– Ability to select translations – good

• Automatic system:– Simpler / less user-involvement needed - good– Few functions / easier to learn and use – good– No control over translations - bad

• Both systems:– Highlighting keywords helps - good– Untranslated/poorly-translated words - bad– No Boolean or proximity operator – bad

Page 31: Support for  Multilingual Information Access

Outline

• User needs

• System design

• User studies

Next steps

Page 32: Support for  Multilingual Information Access

Next Steps

• Quantitative analysis from 2002 (MAP, F)– Iterative improvement of query quality

• Utility of MAP as a measure of query quality?

• Utility of semiautomatic translation

– Accuracy of relevance judgments

• Search strategies– Dependence on system– Dependence on topic– Dependence on density of relevant documents

Page 33: Support for  Multilingual Information Access

An Invitation

• Join CLEF – A first step: Hungarian topics– http://clef.iei.pi.cnr.it

• Join iCLEF– Help us focus on true user needs!– http://terral.lsi.uned.es/iCLEF