lrec 2008, may 26 – june 1, marrakesh 15 years of language resource creation and sharing: a...

11
LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman {ccieri,myl}@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium 3600 Market Street, Suite 810 Philadelphia PA. 19104, USA

Upload: martina-shelton

Post on 30-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities

Christopher Cieri, Mark Liberman{ccieri,myl}@ldc.upenn.edu

University of PennsylvaniaLinguistic Data Consortium

3600 Market Street, Suite 810Philadelphia PA. 19104, USA

Page 2: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

Language Resource Landscape Change in Language Resource landscape continues; some new trends emerging since

last report Continuing growth in need for language resources

number of languages, sophistication of annotation, variety of user communities Continuing advances in computing enable ever greater resource creation by individual

researchers However, demand for data centers has never been greater

as measured by: memberships, resource donations, projects Some technologies approaching human performance

quality becomes more important even at the cost of volume Understanding natural limits of human performance becomes very important

DARPA TIDES & EARS, 2004/5, groups working in MT & STT did not use all data provided DARPA GALE emphasizes source variation, richness, quality of annotation, coordination of

resource types REFLEX LCTL (Less Commonly Taught Languages), NIST LRE (Language Recognition

Evaluation) focus on diversity of languages and resource types not volume in any specific language or type

Move toward digital linguistic resources by new research communities increases resource sharing new communities need simple, adaptive access to existing data and flexible standards. communities extending data sharing require mapping among alternate representations

Growth of computing around the world Increases the diversity of languages represented on the Internet Raises the demand for technologies in these languages that in turn requires language resource

kits.

Page 3: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

LDC Linguistic Data Consortium established 1992

centralized location to distribute and archive language data normalize and manage intellectual property rights and distribution practice

Organized as a consortium, group of organizations, hosted by U. Penn. Management staff in Philadelphia: 45 FT & <= 65 PT employees Funding

DARPA seed funding covered operations + corpus creation early support from NSF, NIST required to be self-sufficient within 5 years (operation costs<= fees) annual membership fees, data licenses grant funding for specific resource creation, not maintenance

Data comes from donations, funded projects at LDC or elsewhere, community initiatives and LDC initiatives.

Expansion 1995: collection, transcription activities, 1998: annotation,1999: tools and standards, 2002:

coordinating multi-site efforts, sharing experience through publications, training LDC’s mission as currently defined is to support language-related education, research

and technology development by creating and sharing linguistic resources: data, tools and standards.

Activities resource distribution, intellectual property rights management, resource production data collection, annotation, lexicon building tool creation, infrastructure building creation of best practices, consulting and training corpus creation research, resource coordination

Page 4: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

Benefits Broad distribution of data with uniform licensing within and across research communities which

relieves funding agencies of distribution costs and provides vast amounts of data to members

Sustains stable infrastructure so that Research communities know where to find data with greatly standardized terms of use, distribution methods, Members’ access to data is ongoing Any patches are available via the same methods Tools and specifications are distributed without fee.

The cost to create any one of the corpora in the LDC catalog is at least as much as the membership fee; in many cases it is one, two or even three orders of magnitude greater.

Page 5: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

LDC data collection

news text web text: blogs, zines, newsgroups broadcast news and talk telephone conversation meetings interviews read and prompted speech printed, handwritten and hybrid documents

annotation quick and careful transcription time-alignment and segmentation at the turn, sentence and word level tagging of morphology, part-of-speech, gloss syntactic annotation semantic annotation discourse function and disfluency categorization according to topic relevance identification and classification of entities, relations, events and their co-reference summarization of various lengths from 200 words down to titles translation, multiple translation, translation quality control alignment of translated text at the document, sentence and word levels

lexicon building pronunciation, morphological, translation

Page 6: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

Publications/Membership New Membership types:

Online: access to subset of data included in LDC Online Standard: LDC Online plus may request licenses <= 16 corpora, discounted licenses of data

from previous years, discounted extra copies of licensed data Subscription: Standard Members but automatically receive 2 copies of all corpora on media

as they are released Subscription memberships, added in 2005, now account for 23% of all members. Cost increases

Due to rising costs of facilities, materials and labor costs Licensing fees increased in 2007 Membership fees increased as of January 2008 Increases modest

10% for subscription members 20% for standard members

compared to average 3% annual increase in time value of money * 15 years scaled according the demand of member type

“Frequent flyer” and early bird discounts 5% for any returning member 5% for any organization joining in first 2 months of membership year

Overall effect subscription members who maintain their membership from year and renew early to year doing so early in the year will actually see a 1% decrease is costs.

LDC currently adds 2-3 corpora to Catalog/month. Membership and licensing fees support this activity completely

LDC has distributed 53,580 copies of nearly 800 corpora and otherwise shared data with 2540 organizations in 67 countries.

Page 7: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

Publications Since last report, LDC added

68 titles to Catalog + dozens of corpora for evaluation programs A sampling of those corpora includes:

email from the Enron scandal annotated for topic Gigaword (billion word) News Text corpora in Arabic, Chinese, English, French, Spanish broadcast news in Arabic, Korean many contributions from Center for Spoken Language Understanding (CSLU)

Foreign Accented English, Apple Words and Phrases, Yes/No, Spelled and Spoken Words, Stories, Multilanguage Telephone Speech, Portland and National Cellular Telephone Speech, Names Release, Speaker Recognition, Spoltech Brazilian Portuguese and Voices

parallel text including Arabic Blogs (DARPA GALE) Hungarian-English parallel text (Varga, Németh, Halácsy, Kornai) STC-TIMIT: TIMIT data process through telephone network contributed by (Morales) Urdu speech from the Army Research Labs Speech in Korean and Spanish contributed by West Point Treebanks in Arabic, Chinese, Czech, English, Korean with translations of Arabic, Chinese Penn Discourse Treebank (Joshi, et. al.) Propbank in Korean OntoNotes Release 2.0 Conversational Telephone Speech in Levantine, Iraqi and Gulf Arabic Parallel Text in Arabic and Chinese (including 2 from ISI) Broadcast News Parallel Text (LDC, MITRE) Video key frames and transcripts created by the TRECVID program Broadband Prompted Speech in English and Turkish (Middle East Technical University) Telephone Band Speech in Russian Evaluation data from the NIST 2003 and 2004 Rich Transcription campaigns TimeBank corpus contributed by (Pustejovsky et. al.) SpatialML annotation of ACE 2005 Multilingual (Mani, Hitzeman, Richer, Harris)

Page 8: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

Sample Projects DARPA GALE (Global Autonomous Language Exploitation)

supports multilingual transcription, translation into English and distillation of text into structured information

text (news, newsgroup, blog), transcribed speech (broadcst news and conversation) translated and aligned at sentence and sub-sentence level, annotations for syntactic structure & propositional content, distillation into structured information.

English, Mandarin and Arabic MADCAT

supports systems that perform OCR (,LR) and MT of handwritten, printed and hybrid text varying scribe, text type, writing instrument, time, speed of writing, paper quality first language Arabic

Mixer Phases 1-5 support robust speaker recognition technologies multigenre: conversational telephone speech, transcript reading, face-to-face interviews multilingual: Arabic, English, Mandarin, Russian, Spanish multichannel: lavalier on the subject and interviewer, Etymotic Link-It micro-array, podium,

PZM, studio, hanging conference room, camcorder, 4 studio mics at varying distances from subject, microphone array, head mounted mic used only for brief telephone calls

LVDID (Language Variation and Dialect Identification) >100 conversations in each of a dozen linguistic varieties ongoing collection in another 20 varieties with all calls audited for sound quality and language

REFLEX-LCTL (Less Commonly Taught Languages) [Simpson, et. al.] supports multiple technologies for LCTLS especially extraction and translation monolingual & parallel news text, bilingual lexicons, encoding converters, word & sentence

segmenters, POS tagsets and taggers, morphological analyzers and tagged text, named-entity tagger and tagged text, personal name transliterator and grammatical sketch

Amazigh (Berber), Bengali, Hungarian, Pashto, Punjabi, Kurdish, Tagalog, Tamil, Thai, Tigrigna, Urdu, Uzbek and Yoruba

Page 9: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

Projects ACE - English, Chinese and Arabic corpora annotated for entities,

the relations among them and the events in which they participate and their co-reference.

HAVIC - web video collected, classified and annotated

TREC Video - broadcast video, key frames, transcripts

Mixer Greybeard - multiple telephone conversations from subjects in previous studies

ITRE - scientific text in the biomedical domain treebanked and tagged for entities

OLAC - ongoing development of the Open Language Archives Community

QLDB – methods, tools for querying complex linguistic data bases including treebanks

Page 10: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

Sample Collaborations ELRA

joint programs: NetDC collaboration: ENABLER, NEMLAR, FlareNet joint data releases subcontracts LDC->ELRA,MedLTC Arabic BN collection/transcription to

ELRA/MedLTC; ELRA considering subcontracting Spanish collection transcriptin to LDC

ANC Appen – TRANSTAC BUTE – REFLEX LCTL CASL CMU – REFLEX LCTL Elicitation Corpus DGA ELSNET IRCAM Melbourne University OLAC TalkBank

Page 11: LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman

LREC 2008, May 26 – June 1, Marrakesh

Future Plansmaintain a leadership role in language resource creation

and distributioncontinue to support distribution operations and to provide

increasing support for local initiatives via memberships and data licenses

extend outreach to new communities including commercial ventures that require specialized corpora

make better use of technologies that are based upon LDC data

generally increase activities devoted to researchsimplify production through efficiency and outsourcingexpand provision of tools, specifications and training to

members