bergen 2002/10/24-25 norwegian language bank elra/elda kc/1 khalid choukri elra/elda 55 rue...

55
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDA ELRA/ELDA KC/1 Khalid CHOUKRI ELRA/ELDA 55 Rue Brillat-Savarin, F-75013 Paris, France Tel. +33 1 43 13 33 33 -- Fax. +33 1 43 13 33 30 Email: [email protected] Web: http://www.elda.fr/ European Language Resource Association A European Infrastructure for Language Resource distribution And HL Technology evaluation

Upload: judith-poole

Post on 28-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/1

Khalid CHOUKRIELRA/ELDA

55 Rue Brillat-Savarin, F-75013 Paris, FranceTel. +33 1 43 13 33 33 -- Fax. +33 1 43 13 33 30

Email: [email protected]: http://www.elda.fr/

European Language Resource Association

A European Infrastructure for

Language Resource distribution

And

HL Technology evaluation

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/2

Rational behind ELRARational behind ELRAELRA’s Mission … Structure ….. ServicesELRA’s Mission … Structure ….. Services

Membership … Membership … Identification …Identification …DistributionDistributionLegal issues Legal issues Market forecasts – Needs - requirementsMarket forecasts – Needs - requirementsPromotion …Promotion ………..

ELRA Catalogue -- A quick overview– BLARK ….ELRA Catalogue -- A quick overview– BLARK ….Activities in Europe / European & National scenesActivities in Europe / European & National scenes & Role of ELRA & Role of ELRAThe ENABLER Initiative The ENABLER Initiative ConclusionConclusion

Outline

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/3

European Language Resource Association An Improved infrastructure for Data sharing

Centralized Not-for-profit organization for the collection, distribution, and validation of

Language Resources and tools.

Evaluation & Language Resources

Distribution Agency

Operational agency ELDA

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/4

The Association

• Membership Drive:ELRA is Open to European & Non-European InstitutionsResources are available to Members & Non-Members

Pay per Resource

Substantial discounts on LR prices (over 70%)Legal and contractual assistance with respect to LR mattersAccess to Validation and production manuals (Quality assessment)Figures and facts about the Market (results of ELRA surveys)Newsletter and other publications

• Some of the benefits of becoming a member:

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/5

European Language Resource Association An Improved infrastructure for Data sharing

A Repository Center:Technical & Logistic issuesCommercial issues (prices, fees, royalties)Legal issues (Licensing, IPR)Information Dissemination

An Association of users of Language Resources

Infrastructure for the evaluation of Human Language Technologies

providing resources, tools, methodologies, logistics,

Exit strategies / Capitalization on evaluation packages

Application to Norwegian

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/6

ELRAELRADistribution AgencyDistribution Agency

SpeechSpeech LexicaLexica CorporaCorpora TerminologyTerminology

Mono-lingual

Multi-lingual

Mono-lingual

Mono-lingual

Multi-lingual

Multi-lingual

•Speech recognition•Speech processing•Speech synthesis•Therapy and speech•.......

•Terminological extraction•NL parser assessment•Automatic text summary

•Database creation•Lexicon consolidation•Translation memory validation.....•Spell checkers

•Information retrieval•Document indexing.....

•Information retrieval•Thesaurus implementation•Generation•.......

•Translation•Generation•Dictionary consolidation.......

•Information retrieval•Document indexing•Automatic translation systems.....

ELRA Offer

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/7

Membership Drive

Colleges: Speech, Written, TerminologyMembership fees => 4 categories

Members

0

20

40

60

80

100

120

1995 1996 1997 1998 1999 2000

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/8

Legal Issues- Licensing

Provider-User Agreements

Providers

Providers

Norwegian

data centerUser

User

User

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/9

Legal Issues- Licensing

ELRAProviders

Owners

Distribution Agreement

VAR END-users

End-Users

VAR Agreement

End-User Agreement

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/10

Quick OverviewBasic Language Resources --- Spoken Written Resources

What should be available for all languages:

· Lexicons: Based on ParoleSimple(Euro) WordNetand more generally EAGLES/ISLE

Corpus ---(Country/language) National Corpus(………………….) Business/scientific Corpus(………………….) Broadcast News - Transcriptions

Multilingual/BiLingualLexicaCorpora (Comparable / Aligned / Parallel)

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/11

Quick OverviewBasic Language Resources --- Spoken Written Resources

What should be available for all languages:

· Articulatory databases (e.g. ACCOR)· Basic speech data

(some phonetic material and some phonetic sequences, by a small number of speakers, recorded in a quiet environment (EUROM 1 & BABEL)

· Pronunciation lexicon (BDLEX, PHONOLEX)· Proper names pronunciation lexicon (ONOMASTICA)· Newspaper read text (BREF, Siemens-100, Apasci)· Basic telephone speech (SPEECHDAT)· Telephone-based speaker verification. (PolyVar)· Text corpora for language models (MLCC, Le Monde …)

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/12

BLARK ..Basic LAnguage Resource KitSpeech Resources fre-fr spa-es nor-no ger-deBroadcast speech -- e--- eArticulatory database E E EMicrophone/desktop speech E E e ERead newspaper texts E ETelephone speech database E E E EMobile-radio speechPronunciation lexicon E e EOnomasticon e e e ESpeaker identification speech corpus

Text Corpora fre-fr spa-es nor-no ger-deBroadcast text corpusConversation text corpus eNewswire text corpus E E EMonolingual corpus E e e EMultilingual and parallel corpus E E e ETreebank e

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/13

Speech UK I F SF BF G EI SP Cat Bq Pt Gr ThN Dan Sw Nor FinnChSpCSp Cz Hun Est RomLatv PLSlovaArticulatory database A E E E E E E

Basic speech data A A A E E E E E E E E E S S S SPronunciation lexicon A A A A

Proper names pron. lex. E E E A*** E E E E A A A A A ANewspaper read text A A A E E A

Basic telephone speech A A A A U A U A A E A A A S U A A U U U UTeleph. speaker verif. A A

text corpora for language ModelsA A A A A A

A Available through ELRAS Available through ELRA within the next quarterE Exist/identified but not (never!) available"blank" Probably Not available / has not been identified

U Under completion/Well advanced project with distribution plans

** We exclude the lexicon that come with SpeechDat*** Available through German telecoms

Basic Speech resources -- (Europe)

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/14

Languages

Language answers % (above 20)

14% 10% 9% 6% 8%2% 4% 2% 2% 4% 2% 2% 3% 2% 2% 4% 4% 3% 2%

76%

52% 46%34%

44%

12%19%

12% 13%19%

9% 11% 16% 13% 8%19% 20% 15% 10%

0%10%20%30%40%50%60%70%80%90%

% language answers % respondents

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/15

Funding(s) of Language Resources

Public Funding•Commission of the European Union(e.g. R&D FPs) •National agencies & Authorities

Private Funding

Criteria for Language Resources funding ….. !

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/16

Brief Overview of recent activities in EuropeEuropean Union Level

Some Projects within FP5 and previous FPs …. Related to our concerns

Resources production: Speechdat Family

Specifications of new types of resources: Natural Interaction and MultiModality

within ISLE (International Standards for Language Engineering) project

Standards/metadata: Eagles and its extension …

the EU/US collaborative project ISLE,

Coming Soon :INTERA

Coordination: ENABLER, Coming soon NEMLAR

Information gathering & Dissemination : Euromap and its follow-up Hope

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/17

SpeechDat Family

SpeechDat(M) --- Fixed Telephone network -- 1K Speakers SpeechDat(M) --- Fixed Telephone network -- 1K Speakers

SpeechDat-II Fixed, Mobile, 1-5KspeakersSpeechDat-II Fixed, Mobile, 1-5Kspeakers

SpeechDat-II Speaker VerificationSpeechDat-II Speaker Verification

SpeechDat-E (CEE - SpeechDat-E (CEE - Polish Czech Slovak Russian Hungarian) Polish Czech Slovak Russian Hungarian)

SALA (Speech Across Latin America) and Now SALA-IIand Now SALA-II

SpeechDat-Car (inc. cellular)SpeechDat-Car (inc. cellular)

SpeeCon (Consumer products)SpeeCon (Consumer products)

OrientelOrientel

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/18

SpeechDat Family

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/19

SpeeCon Project

Dialectal zone Language Region Remarks Esl_ES Spanish Spain (excluding Latin America) Rus_RU 1) Russian Russia Ita_IT Italian Italy Sve_SE_FI Swedish Sweden and Finland Deu_DE_AT German Germany and Austria (excluding e.g. Belgium, Luxembourg,

Switzerland) Eng_GB English United Kingdom Dan_DK Danish Denmark Dut_BE Dutch Belgium Fra_CA French Canada Fra_FR French France (excluding e.g. Belgium, Luxembourg,

Switzerland) Fin_FI Finnish Finland Zho_CN_HK Mandarin P. R. China (incl. Hongkong) (excluding e.g. Taiwan) Dut_NL Dutch The Netherlands Jpn_JP Japanese Japan Pol_PL Polish Poland Por_PT Portuguese Portugal (excluding Brazil) Deu_CH German Switzerland Eng_US English USA (excluding e.g. Canada)

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/20

SpeechDat Family: SALA-II what you may get with PRIVATE Funding

SALA II cellular/Mobile Network (1000 speakers)

Partner Latin America US and Canada ATLAS Venezuela Loquendo Chile, US English South Lucent, Argentina Microsoft Peru US English North NSC Mexico US English Midland Philips Brazil US Spanish West Siemens Colombia US English West Telisma Costa Rica Canadian French

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/21

Brief Overview of recent activities at National level

Top-down vs Bottom-up approches

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/22

Examples of National Projects/programs

Netherlands & Belgium: Continue Now Release 5

Data Available via ELRA, Release of April2002

OVER Nine National projects, among which :

France: Action Techno-Langue

Italy : Infrastruttura nazionale per le risorse

linguistiche nel settore del trattamento automatico della lingua naturale parlata e scritta

Norway : Norwegian Language Bank

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/23

Dutch & Flemish

Release 1 (March 2000) · 62 hours speech samples orthographically transcribed (615,000 words), 90,000 words enriched with

Part-of-Speech tags; · annotation CD with first version of PRAAT (annotation tool) and first version of documentation (in

Dutch) among which relevant information on the speakers (e.g. gender, age, socio-economic class) and samples (e.g. recording conditions, the equipment) (information on the speakers in anonymous form);

Release 2 (October 2000) · over 150 hours of speech samples, orthographically transcribed (over 1,500,000 words), approximately

750,000 words enriched with Part-of-Speech tags; · annotation CD with annotation protocols and relevant information on the speakers (e.g. gender, age,

socio-economic class) and samples (e.g. recording conditions, the equipment) is available (information on the speaker in anonymous form);

Release 3 (April 2001) · more orthographically data enriched with Part-of-Speech tags; · the first broad phonetic transcriptions, word alignments, syntactic annotations, lexicon link-up will be

available; · annotation CD with documentation among which relevant information on the speakers (e.g. gender,

age, socio-economic class) and samples (e.g. recording conditions, the equipment); this release encompasses the first version of Corex, the exploitation tool.

ELRA/ELDAELRA/ELDAKC/24Bergen 2002/10/24-25 Norwegian Language Bank

2 National projects under 2 different “Programs”. 2 National projects under 2 different “Programs”. The Programs were not specific for HLT, but general:The Programs were not specific for HLT, but general:

one for industrial R&Done for industrial R&Dand the other for the South of Italy.and the other for the South of Italy.

Both projects are coordinated by A. Zampolli in Pisa.Both projects are coordinated by A. Zampolli in Pisa.

Goal: to extend core resources built in EU projects, Goal: to extend core resources built in EU projects, create new LR, the tools needed to manage the create new LR, the tools needed to manage the resources, a platform for NLP development, and resources, a platform for NLP development, and technology transfer towards SME.technology transfer towards SME.

Example of ItalyNational Projects/programsNational Projects/programs Example of ItalyExample of Italy

With Contribution from N. Calzollari and A. ZampolliWith Contribution from N. Calzollari and A. Zampolli

ELRA/ELDAELRA/ELDAKC/25Bergen 2002/10/24-25 Norwegian Language Bank

TAL - Infrastruttura nazionale per

le risorse linguistiche nel settore del trattamento automatico della

lingua naturale parlata e scritta

with 13 partner of private organisations).

Duration: 2 years, finished in 2002.

Partners:

CPR - Consorzio Pisa Ricerche; ITC - Istituto Trentino di Cultura; CSELT - Centro Studi e Laboratori Telecomunicazioni; SYNTHEMA; CVR - Consorzio Venezia Ricerche; CERTIA - Centro per la Ricerca, Sviluppo, Formazione nelle Tecnologie e Applicazioni Informatiche; QUINARY; ALCEO;COMPUTER SHARING; DELCO; GST - Gruppo Soluzioni Tecnologiche; INTERACTIVE MEDIA; NECSY - Network Control Systems

National Projects/programsNational Projects/programs Example of ItalyExample of Italy

ELRA/ELDAELRA/ELDAKC/26Bergen 2002/10/24-25 Norwegian Language Bank

•Partners:

CPR, Pisa; CIRASS, Napoli; THAMUS, Salerno; ILC-CNR, Pisa; SYNTHEMA, Pisa;

Istituto Universitario Orientale, Napoli; Dipartimento di Scienze Storiche del Mondo Antico, Università di Pisa; Sportello per la Cooperazione Scientifica e Tecnologica con i Paesi del Mediterraneo (SMED) del CNR, Napoli.

LCRMM –

Linguistica computazionale: ricerche monolingui e multilingui

(cluster "Linguistica", legge 488, with 16 partners of private and public organisations).

•Duration 3 years: will finish in 2003.

National Projects/programsNational Projects/programs Example of ItalyExample of Italy

ELRA/ELDAELRA/ELDAKC/27Bergen 2002/10/24-25 Norwegian Language Bank

National Projects/programsNational Projects/programs Example of Italy Example of Italy

Italy : Infrastruttura nazionale per le risorse

linguistiche nel settore del trattamento automatico della lingua naturale parlata e scritta

•ItalWordNet (~50.000 entries).

•Corpus di italiano parlato --- 100 Hours of speech consisting of :

a) 10h Radio-TV broadcast data (notiziari, interviste, talk show), b) 60h Map task like collection c) 5h Lab data for lexical coveraged) 10h telephone conversational speech e) 10h Domain specific (finances, touristic information etc.)

•Annotated dialogues for speech interfaces (H-H and H-M interactions)( Dialoghi annotati per applicazioni di interfacce vocali avanzate)450 dialogues annotated at all levels (morphological … Prosody…Semantics ….)

Bergen 2002/10/24-25 Norwegian Language Bank

ELRA/ELDAELRA/ELDAKC/28Bergen 2002/10/24-25 Norwegian Language Bank

National Projects/programs National Projects/programs Example of Italy Example of Italy

Bergen 2002/10/24-25 Norwegian Language Bank

to extend core resources built in EU projects, created new LR, the tools needed to manage the resources, a platform for NLP development, and technology transfer towards SME.

The total cost was about 7 million euro and funding for almost 5 million euro

The costs were equally divided between Spoken & Written areas.

In both projects the consortia agreed to distribute the LR through ELRA (with special price for Italian users).

Now, after the conference TIPI in Roma, under the sponsorship of the Ministry of Communications, the topic of HLT has been inserted in the Framework Programme for the financing of R&D in Italy.

It was also decided to constitute a Forum for HLT, of which Zampolli is president. The Forum will start working soon, also to prepare new national initiatives, to maintain LR, to write a white book on HLT in Italy, to coordinate with national activities in other EU countries, etc.

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/29

Example of FranceExample of France

NNational Projects/programsational Projects/programs

France

Technolangue Action

With Contribution from J. MarianiWith Contribution from J. Mariani

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/30

Ministère de la Culture et de la CommunicationMinistère de la Culture et de la Communication

Ministère de la Jeunesse, de l’Education Nationale et de la RechercheMinistère de la Jeunesse, de l’Education Nationale et de la Recherche

Ministère de l’Economie, des Finances et de l’IndustrieMinistère de l’Economie, des Finances et de l’Industrie

Language TechnologiesLanguage Technologies

« TechnoLangue » Action« TechnoLangue » Action

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/31

•Report to Prime Minister (November 2000)Report to Prime Minister (November 2000)•Meeting Min. Industry, Research, Culture: June 2001Meeting Min. Industry, Research, Culture: June 2001•Action : Technology survey and evaluationAction : Technology survey and evaluation•Basic Technological ResearchBasic Technological Research•Articulate with present actionsArticulate with present actions

–Research & Innovation Technological Networks–4 ICT RRIT: Telecommunications, Software, Micronanotechnologies, Audiovisual & multimedia

–Ministry of Research action on Technological Survey (VSE)

« TechnoLangue » action« TechnoLangue » action

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/32

Infrastructure program to support technological innovation, while existing R&D projects stay with RRIT & VSE (120 M€ / year)

TECHNOLANGUE

RNRT RNTL RIAM VSE

« TechnoLangue » structure« TechnoLangue » structure

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/33

BasicResearch

TechnologyDevelopment

ApplicationDevelopment

BottleneckIdentification

Research resultsin quantitative

evaluation

Technologiesnecessitated

for applications

Technologieswhich have been

validatedfor applications.Long term / high risk

Large return of investment EvolutionaryUsability

Acceptability

Meeting points with technology development

QuantitativeEvaluation

UsageEvaluation

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/34

« TechnoLangue » action« TechnoLangue » action

• OrganizationOrganization– Executive Committee (EC) chaired by C. Fluhr (CEA)– Comprising 15 members:

• 3 RRIT representatives: B. Bachimont (INA - RIAM), C. Sedogbo (Thalès - RNTL), C. Waast (IBM - RNRT)

• 3 Public research: C. Fluhr (CEA), E. Geoffrois (DGA) P. Paroubek (Limsi-CNRS)

• 5 Industrials: K. Choukri (ELDA), B. Normier (Lingway), J.-J. Rigoni (Elan Informatique ), F. Segond (Xerox) + C. Sorin (FT R&D)

• 4 Administrations: S. Chaudiron (MR), J. Mariani (MR), D. Malbert (MCC), J. Mathieu (MinEFI)

– Good balance between research & industry - written/spoken

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/35

« TechnoLangue » action« TechnoLangue » action

• Install a User CommitteeInstall a User Committee– Ministry of Foreign Affairs

• Automatic translation, multilingualism…

– Ministry of public administration• Simplification of the administrative language...

– Ministry of National Education• Training technologies, language traning...

– …

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/36

« TechnoLangue » Call« TechnoLangue » Call

• International cooperationInternational cooperation– Cooperation mechanisms within TechnoLangue

• foreign entities may participate in the projects

• financing from their own funds

– Future cooperation among similar national programs• EU Countries (Italy, Germany, Norway, Spain, Greece, The

Netherlands, Switzerland…)

• Prepare the construction of the European Research Area (ERA)

– The EC supports the coordination and generic technologies cost

– Each country supports the cost for covering its language(s): specific technology development/adaptation: (annnotated) corpus (spoken/written), lexicon (incl. pronun.), dictionaries...

• USA, Japan, South Africa…

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/37

« TechnoLangue » Call« TechnoLangue » Call

• 4 meetings of the Executive Committee4 meetings of the Executive Committee• A Call for Proposals with 4 partsA Call for Proposals with 4 parts

– Part 1: Language resources

– Part 2: Evaluation

– Part 3: Norms & standards

– Part 4: Technological survey

• Calendar:Calendar:– Launched April 15, 2002

– Deadline : May 31 / June 10 (Electronic) - June 17 (Paper)

– Results : July 19, 2002

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/38

« TechnoLangue » Call« TechnoLangue » Call– Language resources

• Spoken/written data (corpus, dictionaries, terminological data…)

• Basic Language Processing Tools (Open Source)

• Production, validation, distribution (incl. legal, economical aspects)

• For a large use by a large community (education, training…)

– Evaluation• Technology (evaluation campaign)

• Applications (evaluation toolkits)

• Methodology (metrics / protocols)

– Norms & standards• Shared effort to improve French participation

– Technological survey• In relationship with on-going actions (Euromap...)

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/39

Part 1: Language ResourcesPart 1: Language Resources

• Stimulate the production and the distribution of language Stimulate the production and the distribution of language resources for :resources for :– answering minimal needs (Basic LAnguage Resource Kit) for the

french language ;– promoting resources reusabilty ;– supporting research ;– helping industrial applications development ;– decreasing the cost of entering the sector for new comers

• Should include the French language, eventually in Should include the French language, eventually in connection with other languagesconnection with other languages

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/40

Part 1: Language ResourcesPart 1: Language Resources

• Spoken and written data :Spoken and written data :• oral corpus, pronunciation lexicons, etc.

• databases for speech synthesis ;

• monolingual and multilingual text corpus (parallel, comparable...) ;

• lexicons, terminology, grammars,...

• Lexical semantic resources : ontologies, thesauri,...

• Multimodal corpus,...etc

• Basic sofware tools :Basic sofware tools :• morphosyntactic taggers, syntactic parsers, semantic tools,

• teminology extractors,

• language identifiers,

• corpus annotations tools,

• lemmatizers,… etc.

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/41

Part 1: Language ResourcesPart 1: Language Resources

• Encourage and facilitate the use of those resourcesEncourage and facilitate the use of those resources– Putting them in new (young) user hands

– Same approach as for GUIs : “VUIs”

– Language Technology Kits with “User’s guide”• Distribution towards specialized education entities (NLP, Document

Engineering…) and more largely towards training centers (Universities, Technical Universities, Engineering schools...)

• While insuring a feedback from experience

– Open Source software economical model

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/42

Part 2: EvaluationPart 2: Evaluation

• 3 areas :3 areas :– Technology evaluation

– Application evaluation

– Evaluation methodologies

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/43

Part 2: EvaluationPart 2: Evaluation

• Technology evaluationTechnology evaluation– Organization of comparative evaluation campaigns for

technologies presently not covered by european or international programs, or with a complementary approach

– Includes the production of the data necessary for the evaluation, in a monolingual, multilingual or crosslingual context

– Scientific and industrial interest of the evaluation should appear (large enough number of participants)

– The projects must define the evaluation methodology and justify the practical organization aspects

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/44

Part 2: EvaluationPart 2: Evaluation

• Application evaluationApplication evaluation– The objective is to develop evaluation mehodologies for

industrial or pre-industrial products

– The methodologies may result in “toolboxes”, also regrouping user-oriented methodologies and protocols, or in test software packages

– The methodologies should be generic (class of applications)

– The proposals should demonstrate the project economical and industrial interest, and the modalities of the distribution of the “toolboxes”

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/45

Part 2: EvaluationPart 2: Evaluation

• Evaluation methodologiesEvaluation methodologies– Improve the present evaluation methodologies

– Identify new (quantitative and qualitative) approaches for already evaluated technologies :

• socio-technical and psycho-cognitive aspects

• cognitive modeling of evaluation

– Identify protocols for new technologies and applications• Virtual Reality, Multimodal interaction, Language on the Internet...

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/46

Part 3: StandardsPart 3: Standards

• Support the participation of French actors in Support the participation of French actors in normalization and standardization bodiesnormalization and standardization bodies– Presently weak participation of French actors in

normalization and standardization bodies

– Of strategic importance

– Variety of places where the normalization activities are taking place : official or non-official committees, forums, projects,...

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/47

Part 3: StandardsPart 3: Standards

• Actions:Actions:– Support the creation of consortia to reinforce the french

presence in various bodies (ISO, CEN, W3C,...)

– Help the share of efforts among French participants

– Identify a topic and ensure a permanent participation in all related bodies : character sets, exchange format, phonetic alphabet transcription, etc.

– Necessity of articulating the project with French bodies already implied : AFNOR, W3C French Chapter,...

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/48

Part 4: SurveyPart 4: Survey

• Part 4 - Install an information surveyPart 4 - Install an information survey– Create a portal on Language Engineering in order to give access

to :• panorama of the industrial and technological offer• state-of-the-art in science and technology• identification of language resources• identification of technological bottlenecks• a list of Call for Proposals• a presentation of the market key numbers• an information on norms and standards (with Internet links)

– Should be linked with existing sites (Euromap,...)

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/49

ResultsResults

• 52 proposals submitted52 proposals submitted– Total proposal costs : 35,9 M€

– Total requested support : 21,7 M€

– Clustering within each of the 4 topics

• 26 projects selected26 projects selected• 173 participations, 94 participants :173 participations, 94 participants :

– 33 industry

– 39 public research

– 11 other (Associations, CEA, DGA…)

– 11 foreign (Bell Labs, NII, EPFL, LATL…)

• Budget : 6,2 M€Budget : 6,2 M€

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/50

ResultsResults

• 26 selected projects:26 selected projects:– 8 on Language resources

• BLARK (Cf BNC), Fr-En, G, Sp, It, Arabic dictionaries• Specialized (aerospace, automotive…), proper names dictionaries• Aligned corpus (7 novels 19th century litterature in 4 languages)

– 6 on Tools (Open source)• Lemmatizer, Chunker, Guesser, Tagger, Parser, Speaker recogn., Topic

& NE detector, summarizer, term. extractor, Search engine...

– 3 on Standards (Spoken / Written)

– 1 on Technological survey (Portal)

– 8 on Evaluation : 7 on technology, 1 on usage evaluation

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/51

Technology EvaluationTechnology Evaluation

• Written languageWritten language– Machine translation

– Text alignment

– Syntactic parsing

– Information query

• Spoken LanguageSpoken Language– Speech transcription / indexing (incl. Named Entity)

– Speech synthesis

– Spoken dialog

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/52

French Techno-Langue ConclusionsFrench Techno-Langue Conclusions

• Launch a large national program on Language Launch a large national program on Language Technology (TechnoLangue)Technology (TechnoLangue)

• In the perspective of installing a permanent In the perspective of installing a permanent infrastructure for Language Resources, Evaluation, infrastructure for Language Resources, Evaluation, Standards and SurveyStandards and Survey

• Hope that it can participate in the construction of the Hope that it can participate in the construction of the European Research AreaEuropean Research Area

• And articulates well with international activitiesAnd articulates well with international activities

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/53

Example of NORWAY Example of NORWAY National Projects/programsNational Projects/programs

Norway : Norwegian Language Bank

language technology resources in Norway

Launch conference 24-25 October 2002 (Bergen, Norway):

The language bank will contain three types of data spoken data, text and lexical resources.

It will be organized as a foundation with state ownership,

The estimated budget is about NOK 100 million, (12 M€)

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/54

ENABLER European National Activities for Basic Language Engineering & Resources

Survey of existing national activities

Fostering common research and compatibility of LR

Suggestion for and contribution to international

cooperation

-- A new InitiativeIdentification of existing resources (Universal Catalogue)The Basics (e.g. Standards, tools, evaluation procedures, …)

Extension foreseen/ Planned

Next meeting Pisa 1st December 2002

Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/55

Information Dissemination

(Bilingual English/French; issued each quarter)

Catalogue

Web Site (Bilingual: English/French)

Web: http://www.elda.fr/

Newsletter

ELRA Conference (LREC)International Language Resources & Evaluation

Conference

(Every two years -- Next issue- LREC’2002)