pan@fire 2013: overview of the cross-language !ndian news …fire/wn/slides/clinss.pdf ·...

23
PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track Parth Gupta 1 , Paul Clough 2 , Paolo Rosso 1 , Mark Stevenson 2 , and Rafael E. Banchs 3 1 Technical University of Valencia (UPV), Spain 2 University of Sheffield, UK 3 Institute for Infocomm Research (I 2 R), Singapore http://www.dsic.upv.es/grupos/nle/clinss.html December 4, 2013 Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 1 / 23

Upload: others

Post on 16-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

PAN@FIRE 2013: Overview of the Cross-Language!ndian News Story Search (CL!NSS) Track

Parth Gupta1, Paul Clough2, Paolo Rosso1, Mark Stevenson2, andRafael E. Banchs3

1Technical University of Valencia (UPV), Spain2University of Sheffield, UK

3Institute for Infocomm Research (I2R), Singapore

http://www.dsic.upv.es/grupos/nle/clinss.html

December 4, 2013

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 1 / 23

Page 2: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Outline

1 Motivation

2 Task Description

3 Corpus

4 Evaluation

5 Participation Overview

6 References

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 2 / 23

Page 3: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

MotivationCross-language NLP and IR heavily rely on parallel and comparabledataParallel data is precious but scarceMost of the available data is quasi-comparable - not topically alignedThe technologies to extract parallel or comparable fragments fromquasi-comparable data will be very useful in such scenarios

Current SceneAll languages don’t have parallel data - and the available data is toosmall to relyComparable corpus (Wikipedia) is not reliable in many languagesIn fact many languages do not have enough data

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 3 / 23

Page 4: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Two Questions:1 What can be considered a constant source of text across languages?2 ... that can contain parallel or comparable fragments?

AnswerWikipedia articles - often, people create pages by translating Englishpages!News stories - journalistic text re-use!

Which languages to work on?Resource Poor Languages

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 4 / 23

Page 5: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Background - Web and LanguagesLanguage Web Representationa

Rank Language Percentage1 English 54.9%2 Russian 6.1%3 German 5.3%4 Spanish 4.8%5 Chinese 4.4%6 French 4.3%7 Japanese 4.2%8 Arabic 3.0%9 Portuguese 2.3%10 Polish 1.8%

...36 Latvian 0.1%37 Estonian 0.1%

aWikipedia page: “Languages used on the Internet” bParth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 5 / 23

Page 6: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Background - Web and LanguagesLanguage Populationa

Rank Language Speakers (millions) % of world1 Mandarin 955 14.12 Spanish 407 5.853 English 359 5.524 Hindi 311 4.465 Arabic 293 4.236 Portuguese 216 3.087 Bengali 206 3.058 Russian 154 2.429 Japanese 126 1.9210 Punjabi 102 1.44

aThe estimates used for this list are those of Nationalencyclopedin and isbased on estimates published in 2010 - Wikipedia.

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 6 / 23

Page 7: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Motivation Contd..

How do such algorithms perform? [Platt et al., 2010]

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 7 / 23

Page 8: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Wikipedias and News data

Wikipedia SizeEnglish 4,392,107Spanish 1,061,460German 1,658,515

...Hindi 109,046Tamil 57,828

Year NT1 Size TOI2 Size2011 117,411 243,7732012 128,610 254,036

1Navbharat Times: Hindi Daily2Times of India: English DailyParth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 8 / 23

Page 9: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Task DescriptionObservation

News stories covering the same event published in different languages may be richsources of parallel and comparable text.

Some fragments in these stories are parallel, for example, personal quotes andtranslated versions of the same content.

Definitions [Barker and Gaizauskas, 2012]Focal Event: The main event or events which provide a focus for the news story

I e.g. Romney vs. Obama in Ohio: With superior ground operations, thepresident widens his lead

Background Event: an event that plays a supporting role in the text, providingcontext for the focal events

I e.g. Probable the last encounter between the two

News Event: a group of related events, broader than and including the focalevent, which may be reported over time in different news text installments

I e.g. Presidential election polls

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 9 / 23

Page 10: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Task Description

StatementFor each t ∈ T , find s ∈ S covering the same focal event and newsevent

SourceCollection

TargetCollection

S = L1

⋃L2

⋃ · · ·⋃Ln T = English Articles

Link each story tin T to s in Swhich share samenews event or focalevent for each L

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 10 / 23

Page 11: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Flow Diagram

Pair(A,B)

Same News Event Different NewsEvent

Same News EventSame Focal Event

Same News EventDifferent Focal Event

Year 2012/13 Task: StoryDetection

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 11 / 23

Page 12: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Article Title Relevance LevelTarget There’s lot more to talk than my 50th

Test ton: Tendulkarenglish-document-00006.txt

Source1 m�rF 50vF\ s�\c� rF k� alvA BF kIbAt�\ h{\ : t�\d� lkr

2 (same focal event)

There are many things except my 50thcentury: Tendulkarhindi-document-24799.txt

Source2 sEcn n� bnI s�\c� rF kF EP%VF 1 (same news event)Sachin makes fifty in centuryhindi-document-08018.txt

Table: Example English-Hindi text pairs describing the same news event butdifferent focal events

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 12 / 23

Page 13: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Corpus Statistics

Table: CL!NSS 2012 corpus statistics. The statistics are shown for the sourcepartition Dhi (Hindi) and a target collection Den. The column headers stand for:|D| number of documents in the corpus (partition), |Dtokens| total number oftokens, |Dvoc| total size of vocabulary (unique terms). k= thousand, M = million.

Partition |D| |Dtokens| |Dvoc|Den 25 9.3k 2.5kDhi 50691 15.6M 143k

MetadataI Title of the news storyI Date of publicationI Content of the Story

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 13 / 23

Page 14: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Evaluation Framework

RelevanceThe relevance level of the source news stories for the given test querieswill be in 2,1,0 where,

I 2 = “same news event + same focal event”I 1 = “same news event + different focal event” andI 0 = “different news event”

MeasuresNDCG@k, k = 1, 5, 10

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 14 / 23

Page 15: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Evaluation: Relevance Judgment Tool

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 15 / 23

Page 16: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Relevance Overview

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 16 / 23

Page 17: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Timeline6 May, 2013 Release of training corpus

4 Sept, 2013 Release of test corpus

27 Oct, 2013 Submission of runs

10 Nov, 2013 Release of qrels (result notification)

15 Nov, 2013 Working notes due

05 Dec, 2013 CL!NSS @ FIRE in New Delhi!

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 17 / 23

Page 18: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Participation Overview

Submission detailsTeams were asked to submit results in terms of rank-list for eachlanguage pair.Each team could submit up to 3 runs to try different approaches orconfigurations.

Participation

Teams 2012 2013Registered 10 16Participated 3 8Runs 8 23Working notes 2 6

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 18 / 23

Page 19: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Results

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 19 / 23

Page 20: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Lessons Learnt

Sometimes manually determining the focal/news events is quitedifficult.The scores achieved this year are quite high NGCD@1 0.78 vs. lastyear’s best 0.32Incorporating meta-information explicitly in similarity estimation helpsIt is also observed that carefully selecting query terms from targetdocuments help to improve the performanceAlthough, the approaches are motivated to treat the problem asranking, more sophisticated modeling of stories would certainly helpdetermining same focal events

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 20 / 23

Page 21: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

CL!NSS Programme

Time Details Speaker/s4th December12:00 Overview Talk Parth Gupta5th December15:30 Participant Talk Amogh Param

15:45 Participant Talk Piyush Arora

16:00 Participant Talk Aarti Kumar

16:15 Participant Talk Sujoy Das

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 21 / 23

Page 22: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

Thank You! ¨̂(on behalf of CL!NSS Team)

http://www.dsic.upv.es/grupos/nle/clinss.html

Supported By

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 22 / 23

Page 23: PAN@FIRE 2013: Overview of the Cross-Language !ndian News …fire/wn/slides/clinss.pdf · 2013-12-05 · Outline 1 Motivation 2 TaskDescription 3 Corpus 4 Evaluation 5 ParticipationOverview

References I

Barker, E. and Gaizauskas, R. J. (2012).Assessing the comparability of news texts.In LREC.

Platt, J. C., Toutanova, K., and tau Yih, W. (2010).Translingual document representations from discriminative projections.In EMNLP, pages 251–261.

Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 23 / 23