c omparable c orpus

46
Comparable Corpus Azadeh Shakery Persian- English University of Tehran Homa B. Hashemi Heshaam Faili Creating a

Upload: crevan

Post on 04-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Creating a. C omparable C orpus. P ersian- E nglish. Azadeh Shakery. Homa B. Hashemi. Heshaam Faili. University of Tehran. поиск информации. recupero dell'informazione. بازیابی اطلاعات. 信息检索. tiedonhaku. information retrieval. C ross- L anguage I nformation R etrieval - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: C omparable C orpus

Comparable Corpus

Azadeh Shakery

Persian-English

University of Tehran

Homa B. Hashemi

Heshaam Faili

Creating a

Page 2: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

2Creating a Persian-English Comparable Corpus

Page 3: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

3

Cross-Language Information Retrievalis the answer

information retrieval

اطالعات بازیابی

recupero dell'informazione

信息检索

tiedonhaku

поиск информации

Creating a Persian-English Comparable Corpus

Page 4: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

Source: Internet World Stats, http://internetworldstats.com/ 4

Needs for Persian CLIR – Some Statistics

English:Only 27.3% of total usage

Rest of Languages:17.8% of total usage

Top 10 Internet Languages-2010

Persian:52.5% of total users

Page 5: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

5Creating a Persian-English Comparable Corpus

Page 6: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

6

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Document translation

Creating a Persian-English Comparable Corpus

Page 7: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

7

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Document translation

Creating a Persian-English Comparable Corpus

Page 8: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

8

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Document translation

Creating a Persian-English Comparable Corpus

Page 9: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

9

machine translators produce the best translation

Disadvantages:

• Queries are list of keywords

• MT only return “the most likely” translation

Query Translation Using MT

Creating a Persian-English Comparable Corpus

Page 10: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

10Creating a Persian-English Comparable Corpus

Query Translation Using Dictionary

No dictionary is complete

Translation ambiguity

“Goal” & “Goal”

Page 11: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

11

Query Translation Using Parallel Corpora

ا ب پ تس ش

A B C DS T

ا ب پ تس ش

ا ب پ تس ش

A B C DS T

A B C DS T

Creating a Persian-English Comparable Corpus

Page 12: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

ا ب پ تس ش

ا ب پ تس ش

ا ب پ تس ش

12

ا ب پ تس ش

A B C DS Tا ب پ ت

س ش

ا ب پ تس ش

A B C DS T

A B C DS T

Query Translation Using Comparable Corpora

Creating a Persian-English Comparable Corpus

Page 13: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

13

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Document translation

Creating a Persian-English Comparable Corpus

Page 14: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

14

Motivation

Persian Corpora

Creating a Persian-English Comparable Corpus

Evaluate Comparable Corpora• Assessed Quality of Alignments in one month• Extracting Word Associations• CLIR with Comparable Corpora

Roadmap

Creating a Persian-English Comparable Corpus

Page 15: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

15

Persian Corpora

Monolingual• Hamshahri corpus IR• Bijankhan corpus NLP

Persian-English• Miangah parallel corpus 4,860,000 words• TEP parallel corpus 612,086 sentences• Karimi semi-parallel corpus 1100 documents

No Persian-English Comparable Corpus 15

Page 16: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

16

Motivation

Persian Corpora

Creating a Persian-English Comparable Corpus

Evaluate Comparable Corpora• Assessed Quality of Alignments in one month• Extracting Word Associations• CLIR with Comparable Corpora

Roadmap

Creating a Persian-English Comparable Corpus

Page 17: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

17

Our Comparable Corpus

Creating a Persian-English Comparable Corpus

Page 18: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

18

Source Doc

Source language

query

TargetDocs

Target language

query

Index

Matching Alignment

Creating a Persian-English Comparable Corpus

Page 19: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

TF, RATF

19

Source language

query

TargetDocs

Target language

query

Index

Matching Alignment

Survivors of Hurricane Katrina in the southern US are being taken to safety in what is being called the largest airlift in US history.

Up to 40 aircraft are operating round-the-clock to move thousands who had been stranded in New Orleans. On Saturday President Bush announced the deployment of thousands of extra troops in affected areas, amid criticism of the rescue effort. Survivors have been telling harrowing tales of violence. On Saturday more than 10,000 people were removed from flood-ravaged New Orleans.

Source Doc

Page 20: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

20

Source Doc

TargetDocs

Target language

query

Index

Matching Alignment

Source language

query

people brown

Orleans emerge

new Katrina

survivor flood

thousand relief

rescue urgency

Saturday hurricane

TF, RATF

Page 21: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

Creating a Persian-English Comparable Corpus 21

Source Doc

Source language

query

TargetDocs

Index

Matching Alignment

Target language

query

خلق قومجمعيت ملتاخيرا نوين شخص زنده

باقيمانده بازمانده روزشنبه پديدار بيرون تندباد طوفانگردباد اجتماع قهوه سرخ قهوهکاترينا سيل درياطوفان غرقسيل گرفتن طغيان راحتي اعانهامداد رفع نگراني برجستهخط فوريت ضرورت كناردريا

Page 22: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

22

Source Doc

Source language

query

Target language

query

Matching Alignment

IndexTargetDocs

Creating a Persian-English Comparable Corpus

Page 23: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

23

Source Docs

Source language queries

TargetDocs

Target language

query

Index

Alignment عمليات گسترده تخليه بازماندگان کاترينانورمن مينتا وزير حمل و نقل امريکا گفت

هواپيماها و هلي کوپترها ساعته در حال کار هستند و تا کنون بيش از هزار نفر را از

مناطقي در نيواورليان که بيشترين اسيب را ديده اند تخليه کرده اند اتوبوس ها نيز به

بيرون بردن مردم از شهر ادامه مي دهند و اولين قطار شهر را ترک کرده است مقامات

نظامي مي گويند تاکنون هزار نفر از توفان زدگان اين شهر ويران نجات يافته اند

Matching

خلق قومجمعيت ملتاخيرا نوين شخص زنده

باقيمانده بازمانده روزشنبه پديدار بيرون تندباد طوفانگردباد اجتماع قهوه سرخ قهوهکاترينا سيل درياطوفان غرقسيل گرفتن

Page 24: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

24

Source Doc

Source language

query

TargetDocs

Target language

query

Index

Matching Alignment

Two basic criteria:• Similarity score• Publication dates

Creating a Persian-English Comparable Corpus

Survivors of Hurricane Katrina in the southern US are being taken to safety in what is being called the largest airlift in US history.Up to 40 aircraft are operating round-the-clock to move thousands who had been stranded in New Orleans.

عمليات گسترده تخليه بازماندگان کاترينا نورمن مينتا وزير حمل و نقل امريکا گفت

هواپيماها و هلي کوپترها ساعته در حال کار هستند و تا کنون بيش از هزار نفر را از

مناطقي در نيواورليان که بيشترين اسيب را ديده اند تخليه کرده اند اتوبوس ها نيز به

بيرون بردن مردم از شهر ادامه مي دهند و اولين قطار شهر را ترک کرده است مقامات

نظامي مي گويند تاکنون هزار نفر از توفان زدگان اين شهر ويران نجات يافته اند

Page 25: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

25

Motivation

Persian Corpora

Creating a Persian-English Comparable Corpus

Evaluate Comparable Corpora• Assessed Quality of Alignments in one month• Extracting Word Associations• CLIR with Comparable Corpora

Roadmap

Creating a Persian-English Comparable Corpus

Page 26: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

26

Comparable Corpora Evaluation

Quality of Alignments

1. Same story2. Related story3. Shared aspect4. Common terminology5. Unrelated

Creating a Persian-English Comparable Corpus

Use “Multilingual information retrieval based on document alignment techniques” method [Braschler et.al., 1998]

Page 27: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

All dictionary

Top 3 translations

No Transliteration Transliteration

# of Aligns %of Aligns # of Aligns %of Aligns # of Aligns %of Aligns

Class 1 4 11.8 % 3 6.9 % 5 9.4 %

Class 2 4 11.8 % 17 39.5 % 24 45.3 %

Class 3 7 20.6 % 14 32.5 % 14 26.4 %

Class 4 11 32.3 % 8 18.6 % 8 15.1 %

Class 5 8 23.5 % 1 2.3 % 2 3.8 %

Total 34 100 43 100 53 100

27

CC Evaluation: Language Model

Page 28: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

28

CC Evaluation: Okapi

Top 3 translations

No Transliteration Transliteration

# of Aligns %of Aligns # of Aligns %of Aligns

Class 1 11 13.5 % 13 14.9 %

Class 2 46 56.8 % 51 58.6 %

Class 3 20 24.7 % 19 21.8 %

Class 4 4 4.9 % 4 4.6 %

Class 5 0 0 % 0 0 %

Total 81 100 87 100

Page 29: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

29

Source Docs

TargetDocs

Alignment

53697

191440

7580

Creating a Persian-English Comparable Corpus

Page 30: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

30

Motivation

Persian Corpora

Creating a Persian-English Comparable Corpus

Evaluate Comparable Corpora• Assessed Quality of Alignments in one month• Extracting Word Associations• CLIR with Comparable Corpora

Roadmap

Creating a Persian-English Comparable Corpus

Page 31: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

31

CC Evaluation: Word Associations

English Word

Persian Word

Google translation Score

Cancer

سرطان Cancer 80

بیماری Disease 52

بدن Body 51

سلول Cell 43

مبتال Suffering 41

Iraqi

عراق Iraq 39

صدام Saddam 95

عراقي Iraqi 83

بغداد Baghdad 82

حسين Hussein 75

Use “Focused web crawling in the acquisition of comparable corpora” method [Talvensaari et.al, 2008]

Page 32: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

32

Motivation

Persian Corpora

Creating a Persian-English Comparable Corpus

Evaluate Comparable Corpora• Assessed Quality of Alignments in one month• Extracting Word Associations• CLIR with Comparable Corpora

Roadmap

Creating a Persian-English Comparable Corpus

Page 33: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

Persian task of CLEF-2008:

• retrieve of Persian documents from English topics

Queries:

• 50 topic in English and their Persian translations

33

Cross-Language Information Retrieval

Creating a Persian-English Comparable Corpus

Page 34: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

34

Construct the Query Language Model

Use the top k translations of each query word English Query: Cancer Drugs

English Word Persian translations

Cancer

سرطان Cancer 0.077بیماری Disease 0.049بدن Body 0.049

سلول Cell 0.041… … …

Drugs

درمان Treatment 0.050دارو Drug 0.049

داروهای Drugs 0.042بیماری Disease 0.042… …. …

Persian Query:

سرطان

0.077

بیماری 0.049

درمان 0.050

دارو 0.049

Page 35: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

35

Cross-Language Information Retrieval

Measure Monolingual Retrieval Dictionary Comparable

Corpora

MAP 0.42153 0.153 (36.29%) 0.14 ( 33.30%)

Prec@5 0.62 0.224 (36.12%) 0.244 (39.35%)

Prec@10 0.596 0.206 (34.56%) 0.232 (38.92%)

Creating a Persian-English Comparable Corpus

Page 36: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

36

• Two independent news collections• Aligned the documents

• Topic similarities• Publication dates• Alternatives

• Different translation methods• Different retrieval models

Creating the First Big

Persian-English Comparable

Corpus

• Manually evaluate one month alignments by five-level relevance scale

• Extract word associations• Cross-Language Information Retrieval

Assess Quality of Our Corpus

Creating a Persian-English Comparable Corpus

Page 37: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

37

Future Work

Focus on CLIR task

Improving quality of extracted word associations

Other linguistic resources (dictionaries, MT, parallel corpora)

Use extracted translation knowledge to improve

quality of created corpus

Creating a Persian-English Comparable Corpus

Page 38: C omparable C orpus

Thank You

Homa B. [email protected]

Page 39: C omparable C orpus

References

Talvensaari et al. Creating and exploiting a comparable corpus in cross-language information retrieval. TOIS (2007)

Talvensaari et al. Focused web crawling in the acquisition of comparable corpora. Information Retrieval (2008)

Page 40: C omparable C orpus

Appendix: RATF formula

Page 41: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

41

weight of Source word:

weight of Target word:

CC Evaluation: Extract Word Associations

)ln()5.05.0(kk

ikik dl

NT

Maxtf

tfw

||

1 )1ln(

D

r

jrj r

wW

Creating a Persian-English Comparable Corpus

Page 42: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

42

Similarity score between a Source and Target word:

CC Evaluation: Extract Word Associations

)||||

||||)1((||||

),( ,

T

ts

Ww

tssimj

i

ADdjik

jik

“Focused web crawling in the acquisition of comparable corpora” [Talvensaari et.al, 2008]

Page 43: C omparable C orpus

Appendix: Estimate Query LM

Page 44: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

44

Aligned Documents

inL1 and L2

(1)

Extract Word Similarities

(2)Estimate Word Translation Probabilities

(3)Construct Query Language Model in L2

(4)

Use KL-Divergence to Rank Documents

Documents in

L1 and L2

Query in L1

CLIR with Comparable Corpora

Creating a Persian-English Comparable Corpus

Page 45: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

45

Step 2: Estimate Word Translation Probabilities

Use normalized raw correlation scores

Raw correlationscores

N

j j

ii

uwr

uwrwup

1),(

),()|(

Creating a Persian-English Comparable Corpus

Page 46: C omparable C orpus

CLIR

Query translation

Machinetranslation

Dictionarybased

Comparable Corpora

Parallel Corpora

Documenttranslation

Query & Documenttranslation

46

Estimate Word Translation Probabilities

Use normalized raw correlation scores

Raw correlationscores

N

j j

ii

uwr

uwrwup

1),(

),()|(

English Word Persian Word Google translation Raw Score Translation

Probability

Cancer

سرطان Cancer 80.7 0.077بیماری Disease 52 0.049بدن Body 51.2 0.049

سلول Cell 43.7 0.041مبتال Suffering 41.6 0.039

درمان Treatment 36.9 0.038تحقيقات Research 35.2 0.034