pan@fire 2013: overview of the cross-language !ndian news …fire/wn/slides/clinss.pdf ·...
TRANSCRIPT
PAN@FIRE 2013: Overview of the Cross-Language!ndian News Story Search (CL!NSS) Track
Parth Gupta1, Paul Clough2, Paolo Rosso1, Mark Stevenson2, andRafael E. Banchs3
1Technical University of Valencia (UPV), Spain2University of Sheffield, UK
3Institute for Infocomm Research (I2R), Singapore
http://www.dsic.upv.es/grupos/nle/clinss.html
December 4, 2013
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 1 / 23
Outline
1 Motivation
2 Task Description
3 Corpus
4 Evaluation
5 Participation Overview
6 References
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 2 / 23
MotivationCross-language NLP and IR heavily rely on parallel and comparabledataParallel data is precious but scarceMost of the available data is quasi-comparable - not topically alignedThe technologies to extract parallel or comparable fragments fromquasi-comparable data will be very useful in such scenarios
Current SceneAll languages don’t have parallel data - and the available data is toosmall to relyComparable corpus (Wikipedia) is not reliable in many languagesIn fact many languages do not have enough data
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 3 / 23
Two Questions:1 What can be considered a constant source of text across languages?2 ... that can contain parallel or comparable fragments?
AnswerWikipedia articles - often, people create pages by translating Englishpages!News stories - journalistic text re-use!
Which languages to work on?Resource Poor Languages
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 4 / 23
Background - Web and LanguagesLanguage Web Representationa
Rank Language Percentage1 English 54.9%2 Russian 6.1%3 German 5.3%4 Spanish 4.8%5 Chinese 4.4%6 French 4.3%7 Japanese 4.2%8 Arabic 3.0%9 Portuguese 2.3%10 Polish 1.8%
...36 Latvian 0.1%37 Estonian 0.1%
aWikipedia page: “Languages used on the Internet” bParth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 5 / 23
Background - Web and LanguagesLanguage Populationa
Rank Language Speakers (millions) % of world1 Mandarin 955 14.12 Spanish 407 5.853 English 359 5.524 Hindi 311 4.465 Arabic 293 4.236 Portuguese 216 3.087 Bengali 206 3.058 Russian 154 2.429 Japanese 126 1.9210 Punjabi 102 1.44
aThe estimates used for this list are those of Nationalencyclopedin and isbased on estimates published in 2010 - Wikipedia.
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 6 / 23
Motivation Contd..
How do such algorithms perform? [Platt et al., 2010]
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 7 / 23
Wikipedias and News data
Wikipedia SizeEnglish 4,392,107Spanish 1,061,460German 1,658,515
...Hindi 109,046Tamil 57,828
Year NT1 Size TOI2 Size2011 117,411 243,7732012 128,610 254,036
1Navbharat Times: Hindi Daily2Times of India: English DailyParth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 8 / 23
Task DescriptionObservation
News stories covering the same event published in different languages may be richsources of parallel and comparable text.
Some fragments in these stories are parallel, for example, personal quotes andtranslated versions of the same content.
Definitions [Barker and Gaizauskas, 2012]Focal Event: The main event or events which provide a focus for the news story
I e.g. Romney vs. Obama in Ohio: With superior ground operations, thepresident widens his lead
Background Event: an event that plays a supporting role in the text, providingcontext for the focal events
I e.g. Probable the last encounter between the two
News Event: a group of related events, broader than and including the focalevent, which may be reported over time in different news text installments
I e.g. Presidential election polls
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 9 / 23
Task Description
StatementFor each t ∈ T , find s ∈ S covering the same focal event and newsevent
SourceCollection
TargetCollection
S = L1
⋃L2
⋃ · · ·⋃Ln T = English Articles
Link each story tin T to s in Swhich share samenews event or focalevent for each L
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 10 / 23
Flow Diagram
Pair(A,B)
Same News Event Different NewsEvent
Same News EventSame Focal Event
Same News EventDifferent Focal Event
Year 2012/13 Task: StoryDetection
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 11 / 23
Article Title Relevance LevelTarget There’s lot more to talk than my 50th
Test ton: Tendulkarenglish-document-00006.txt
Source1 m�rF 50vF\ s�\c� rF k� alvA BF kIbAt�\ h{\ : t�\d� lkr
2 (same focal event)
There are many things except my 50thcentury: Tendulkarhindi-document-24799.txt
Source2 sEcn n� bnI s�\c� rF kF EP%VF 1 (same news event)Sachin makes fifty in centuryhindi-document-08018.txt
Table: Example English-Hindi text pairs describing the same news event butdifferent focal events
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 12 / 23
Corpus Statistics
Table: CL!NSS 2012 corpus statistics. The statistics are shown for the sourcepartition Dhi (Hindi) and a target collection Den. The column headers stand for:|D| number of documents in the corpus (partition), |Dtokens| total number oftokens, |Dvoc| total size of vocabulary (unique terms). k= thousand, M = million.
Partition |D| |Dtokens| |Dvoc|Den 25 9.3k 2.5kDhi 50691 15.6M 143k
MetadataI Title of the news storyI Date of publicationI Content of the Story
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 13 / 23
Evaluation Framework
RelevanceThe relevance level of the source news stories for the given test querieswill be in 2,1,0 where,
I 2 = “same news event + same focal event”I 1 = “same news event + different focal event” andI 0 = “different news event”
MeasuresNDCG@k, k = 1, 5, 10
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 14 / 23
Evaluation: Relevance Judgment Tool
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 15 / 23
Relevance Overview
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 16 / 23
Timeline6 May, 2013 Release of training corpus
4 Sept, 2013 Release of test corpus
27 Oct, 2013 Submission of runs
10 Nov, 2013 Release of qrels (result notification)
15 Nov, 2013 Working notes due
05 Dec, 2013 CL!NSS @ FIRE in New Delhi!
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 17 / 23
Participation Overview
Submission detailsTeams were asked to submit results in terms of rank-list for eachlanguage pair.Each team could submit up to 3 runs to try different approaches orconfigurations.
Participation
Teams 2012 2013Registered 10 16Participated 3 8Runs 8 23Working notes 2 6
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 18 / 23
Results
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 19 / 23
Lessons Learnt
Sometimes manually determining the focal/news events is quitedifficult.The scores achieved this year are quite high NGCD@1 0.78 vs. lastyear’s best 0.32Incorporating meta-information explicitly in similarity estimation helpsIt is also observed that carefully selecting query terms from targetdocuments help to improve the performanceAlthough, the approaches are motivated to treat the problem asranking, more sophisticated modeling of stories would certainly helpdetermining same focal events
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 20 / 23
CL!NSS Programme
Time Details Speaker/s4th December12:00 Overview Talk Parth Gupta5th December15:30 Participant Talk Amogh Param
15:45 Participant Talk Piyush Arora
16:00 Participant Talk Aarti Kumar
16:15 Participant Talk Sujoy Das
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 21 / 23
Thank You! ¨̂(on behalf of CL!NSS Team)
http://www.dsic.upv.es/grupos/nle/clinss.html
Supported By
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 22 / 23
References I
Barker, E. and Gaizauskas, R. J. (2012).Assessing the comparability of news texts.In LREC.
Platt, J. C., Toutanova, K., and tau Yih, W. (2010).Translingual document representations from discriminative projections.In EMNLP, pages 251–261.
Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 23 / 23