bergen 2002/10/24-25 norwegian language bank elra/elda kc/1 khalid choukri elra/elda 55 rue...
TRANSCRIPT
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/1
Khalid CHOUKRIELRA/ELDA
55 Rue Brillat-Savarin, F-75013 Paris, FranceTel. +33 1 43 13 33 33 -- Fax. +33 1 43 13 33 30
Email: [email protected]: http://www.elda.fr/
European Language Resource Association
A European Infrastructure for
Language Resource distribution
And
HL Technology evaluation
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/2
Rational behind ELRARational behind ELRAELRA’s Mission … Structure ….. ServicesELRA’s Mission … Structure ….. Services
Membership … Membership … Identification …Identification …DistributionDistributionLegal issues Legal issues Market forecasts – Needs - requirementsMarket forecasts – Needs - requirementsPromotion …Promotion ………..
ELRA Catalogue -- A quick overview– BLARK ….ELRA Catalogue -- A quick overview– BLARK ….Activities in Europe / European & National scenesActivities in Europe / European & National scenes & Role of ELRA & Role of ELRAThe ENABLER Initiative The ENABLER Initiative ConclusionConclusion
Outline
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/3
European Language Resource Association An Improved infrastructure for Data sharing
Centralized Not-for-profit organization for the collection, distribution, and validation of
Language Resources and tools.
Evaluation & Language Resources
Distribution Agency
Operational agency ELDA
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/4
The Association
• Membership Drive:ELRA is Open to European & Non-European InstitutionsResources are available to Members & Non-Members
Pay per Resource
Substantial discounts on LR prices (over 70%)Legal and contractual assistance with respect to LR mattersAccess to Validation and production manuals (Quality assessment)Figures and facts about the Market (results of ELRA surveys)Newsletter and other publications
• Some of the benefits of becoming a member:
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/5
European Language Resource Association An Improved infrastructure for Data sharing
A Repository Center:Technical & Logistic issuesCommercial issues (prices, fees, royalties)Legal issues (Licensing, IPR)Information Dissemination
An Association of users of Language Resources
Infrastructure for the evaluation of Human Language Technologies
providing resources, tools, methodologies, logistics,
Exit strategies / Capitalization on evaluation packages
Application to Norwegian
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/6
ELRAELRADistribution AgencyDistribution Agency
SpeechSpeech LexicaLexica CorporaCorpora TerminologyTerminology
Mono-lingual
Multi-lingual
Mono-lingual
Mono-lingual
Multi-lingual
Multi-lingual
•Speech recognition•Speech processing•Speech synthesis•Therapy and speech•.......
•Terminological extraction•NL parser assessment•Automatic text summary
•Database creation•Lexicon consolidation•Translation memory validation.....•Spell checkers
•Information retrieval•Document indexing.....
•Information retrieval•Thesaurus implementation•Generation•.......
•Translation•Generation•Dictionary consolidation.......
•Information retrieval•Document indexing•Automatic translation systems.....
ELRA Offer
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/7
Membership Drive
Colleges: Speech, Written, TerminologyMembership fees => 4 categories
Members
0
20
40
60
80
100
120
1995 1996 1997 1998 1999 2000
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/8
Legal Issues- Licensing
Provider-User Agreements
Providers
Providers
Norwegian
data centerUser
User
User
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/9
Legal Issues- Licensing
ELRAProviders
Owners
Distribution Agreement
VAR END-users
End-Users
VAR Agreement
End-User Agreement
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/10
Quick OverviewBasic Language Resources --- Spoken Written Resources
What should be available for all languages:
· Lexicons: Based on ParoleSimple(Euro) WordNetand more generally EAGLES/ISLE
Corpus ---(Country/language) National Corpus(………………….) Business/scientific Corpus(………………….) Broadcast News - Transcriptions
Multilingual/BiLingualLexicaCorpora (Comparable / Aligned / Parallel)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/11
Quick OverviewBasic Language Resources --- Spoken Written Resources
What should be available for all languages:
· Articulatory databases (e.g. ACCOR)· Basic speech data
(some phonetic material and some phonetic sequences, by a small number of speakers, recorded in a quiet environment (EUROM 1 & BABEL)
· Pronunciation lexicon (BDLEX, PHONOLEX)· Proper names pronunciation lexicon (ONOMASTICA)· Newspaper read text (BREF, Siemens-100, Apasci)· Basic telephone speech (SPEECHDAT)· Telephone-based speaker verification. (PolyVar)· Text corpora for language models (MLCC, Le Monde …)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/12
BLARK ..Basic LAnguage Resource KitSpeech Resources fre-fr spa-es nor-no ger-deBroadcast speech -- e--- eArticulatory database E E EMicrophone/desktop speech E E e ERead newspaper texts E ETelephone speech database E E E EMobile-radio speechPronunciation lexicon E e EOnomasticon e e e ESpeaker identification speech corpus
Text Corpora fre-fr spa-es nor-no ger-deBroadcast text corpusConversation text corpus eNewswire text corpus E E EMonolingual corpus E e e EMultilingual and parallel corpus E E e ETreebank e
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/13
Speech UK I F SF BF G EI SP Cat Bq Pt Gr ThN Dan Sw Nor FinnChSpCSp Cz Hun Est RomLatv PLSlovaArticulatory database A E E E E E E
Basic speech data A A A E E E E E E E E E S S S SPronunciation lexicon A A A A
Proper names pron. lex. E E E A*** E E E E A A A A A ANewspaper read text A A A E E A
Basic telephone speech A A A A U A U A A E A A A S U A A U U U UTeleph. speaker verif. A A
text corpora for language ModelsA A A A A A
A Available through ELRAS Available through ELRA within the next quarterE Exist/identified but not (never!) available"blank" Probably Not available / has not been identified
U Under completion/Well advanced project with distribution plans
** We exclude the lexicon that come with SpeechDat*** Available through German telecoms
Basic Speech resources -- (Europe)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/14
Languages
Language answers % (above 20)
14% 10% 9% 6% 8%2% 4% 2% 2% 4% 2% 2% 3% 2% 2% 4% 4% 3% 2%
76%
52% 46%34%
44%
12%19%
12% 13%19%
9% 11% 16% 13% 8%19% 20% 15% 10%
0%10%20%30%40%50%60%70%80%90%
% language answers % respondents
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/15
Funding(s) of Language Resources
Public Funding•Commission of the European Union(e.g. R&D FPs) •National agencies & Authorities
Private Funding
Criteria for Language Resources funding ….. !
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/16
Brief Overview of recent activities in EuropeEuropean Union Level
Some Projects within FP5 and previous FPs …. Related to our concerns
Resources production: Speechdat Family
Specifications of new types of resources: Natural Interaction and MultiModality
within ISLE (International Standards for Language Engineering) project
Standards/metadata: Eagles and its extension …
the EU/US collaborative project ISLE,
Coming Soon :INTERA
Coordination: ENABLER, Coming soon NEMLAR
Information gathering & Dissemination : Euromap and its follow-up Hope
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/17
SpeechDat Family
SpeechDat(M) --- Fixed Telephone network -- 1K Speakers SpeechDat(M) --- Fixed Telephone network -- 1K Speakers
SpeechDat-II Fixed, Mobile, 1-5KspeakersSpeechDat-II Fixed, Mobile, 1-5Kspeakers
SpeechDat-II Speaker VerificationSpeechDat-II Speaker Verification
SpeechDat-E (CEE - SpeechDat-E (CEE - Polish Czech Slovak Russian Hungarian) Polish Czech Slovak Russian Hungarian)
SALA (Speech Across Latin America) and Now SALA-IIand Now SALA-II
SpeechDat-Car (inc. cellular)SpeechDat-Car (inc. cellular)
SpeeCon (Consumer products)SpeeCon (Consumer products)
OrientelOrientel
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/19
SpeeCon Project
Dialectal zone Language Region Remarks Esl_ES Spanish Spain (excluding Latin America) Rus_RU 1) Russian Russia Ita_IT Italian Italy Sve_SE_FI Swedish Sweden and Finland Deu_DE_AT German Germany and Austria (excluding e.g. Belgium, Luxembourg,
Switzerland) Eng_GB English United Kingdom Dan_DK Danish Denmark Dut_BE Dutch Belgium Fra_CA French Canada Fra_FR French France (excluding e.g. Belgium, Luxembourg,
Switzerland) Fin_FI Finnish Finland Zho_CN_HK Mandarin P. R. China (incl. Hongkong) (excluding e.g. Taiwan) Dut_NL Dutch The Netherlands Jpn_JP Japanese Japan Pol_PL Polish Poland Por_PT Portuguese Portugal (excluding Brazil) Deu_CH German Switzerland Eng_US English USA (excluding e.g. Canada)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/20
SpeechDat Family: SALA-II what you may get with PRIVATE Funding
SALA II cellular/Mobile Network (1000 speakers)
Partner Latin America US and Canada ATLAS Venezuela Loquendo Chile, US English South Lucent, Argentina Microsoft Peru US English North NSC Mexico US English Midland Philips Brazil US Spanish West Siemens Colombia US English West Telisma Costa Rica Canadian French
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/21
Brief Overview of recent activities at National level
Top-down vs Bottom-up approches
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/22
Examples of National Projects/programs
Netherlands & Belgium: Continue Now Release 5
Data Available via ELRA, Release of April2002
OVER Nine National projects, among which :
France: Action Techno-Langue
Italy : Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento automatico della lingua naturale parlata e scritta
Norway : Norwegian Language Bank
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/23
Dutch & Flemish
Release 1 (March 2000) · 62 hours speech samples orthographically transcribed (615,000 words), 90,000 words enriched with
Part-of-Speech tags; · annotation CD with first version of PRAAT (annotation tool) and first version of documentation (in
Dutch) among which relevant information on the speakers (e.g. gender, age, socio-economic class) and samples (e.g. recording conditions, the equipment) (information on the speakers in anonymous form);
Release 2 (October 2000) · over 150 hours of speech samples, orthographically transcribed (over 1,500,000 words), approximately
750,000 words enriched with Part-of-Speech tags; · annotation CD with annotation protocols and relevant information on the speakers (e.g. gender, age,
socio-economic class) and samples (e.g. recording conditions, the equipment) is available (information on the speaker in anonymous form);
Release 3 (April 2001) · more orthographically data enriched with Part-of-Speech tags; · the first broad phonetic transcriptions, word alignments, syntactic annotations, lexicon link-up will be
available; · annotation CD with documentation among which relevant information on the speakers (e.g. gender,
age, socio-economic class) and samples (e.g. recording conditions, the equipment); this release encompasses the first version of Corex, the exploitation tool.
ELRA/ELDAELRA/ELDAKC/24Bergen 2002/10/24-25 Norwegian Language Bank
2 National projects under 2 different “Programs”. 2 National projects under 2 different “Programs”. The Programs were not specific for HLT, but general:The Programs were not specific for HLT, but general:
one for industrial R&Done for industrial R&Dand the other for the South of Italy.and the other for the South of Italy.
Both projects are coordinated by A. Zampolli in Pisa.Both projects are coordinated by A. Zampolli in Pisa.
Goal: to extend core resources built in EU projects, Goal: to extend core resources built in EU projects, create new LR, the tools needed to manage the create new LR, the tools needed to manage the resources, a platform for NLP development, and resources, a platform for NLP development, and technology transfer towards SME.technology transfer towards SME.
Example of ItalyNational Projects/programsNational Projects/programs Example of ItalyExample of Italy
With Contribution from N. Calzollari and A. ZampolliWith Contribution from N. Calzollari and A. Zampolli
ELRA/ELDAELRA/ELDAKC/25Bergen 2002/10/24-25 Norwegian Language Bank
TAL - Infrastruttura nazionale per
le risorse linguistiche nel settore del trattamento automatico della
lingua naturale parlata e scritta
with 13 partner of private organisations).
Duration: 2 years, finished in 2002.
Partners:
CPR - Consorzio Pisa Ricerche; ITC - Istituto Trentino di Cultura; CSELT - Centro Studi e Laboratori Telecomunicazioni; SYNTHEMA; CVR - Consorzio Venezia Ricerche; CERTIA - Centro per la Ricerca, Sviluppo, Formazione nelle Tecnologie e Applicazioni Informatiche; QUINARY; ALCEO;COMPUTER SHARING; DELCO; GST - Gruppo Soluzioni Tecnologiche; INTERACTIVE MEDIA; NECSY - Network Control Systems
National Projects/programsNational Projects/programs Example of ItalyExample of Italy
ELRA/ELDAELRA/ELDAKC/26Bergen 2002/10/24-25 Norwegian Language Bank
•Partners:
CPR, Pisa; CIRASS, Napoli; THAMUS, Salerno; ILC-CNR, Pisa; SYNTHEMA, Pisa;
Istituto Universitario Orientale, Napoli; Dipartimento di Scienze Storiche del Mondo Antico, Università di Pisa; Sportello per la Cooperazione Scientifica e Tecnologica con i Paesi del Mediterraneo (SMED) del CNR, Napoli.
LCRMM –
Linguistica computazionale: ricerche monolingui e multilingui
(cluster "Linguistica", legge 488, with 16 partners of private and public organisations).
•Duration 3 years: will finish in 2003.
National Projects/programsNational Projects/programs Example of ItalyExample of Italy
ELRA/ELDAELRA/ELDAKC/27Bergen 2002/10/24-25 Norwegian Language Bank
National Projects/programsNational Projects/programs Example of Italy Example of Italy
Italy : Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento automatico della lingua naturale parlata e scritta
•ItalWordNet (~50.000 entries).
•Corpus di italiano parlato --- 100 Hours of speech consisting of :
a) 10h Radio-TV broadcast data (notiziari, interviste, talk show), b) 60h Map task like collection c) 5h Lab data for lexical coveraged) 10h telephone conversational speech e) 10h Domain specific (finances, touristic information etc.)
•Annotated dialogues for speech interfaces (H-H and H-M interactions)( Dialoghi annotati per applicazioni di interfacce vocali avanzate)450 dialogues annotated at all levels (morphological … Prosody…Semantics ….)
Bergen 2002/10/24-25 Norwegian Language Bank
ELRA/ELDAELRA/ELDAKC/28Bergen 2002/10/24-25 Norwegian Language Bank
National Projects/programs National Projects/programs Example of Italy Example of Italy
Bergen 2002/10/24-25 Norwegian Language Bank
to extend core resources built in EU projects, created new LR, the tools needed to manage the resources, a platform for NLP development, and technology transfer towards SME.
The total cost was about 7 million euro and funding for almost 5 million euro
The costs were equally divided between Spoken & Written areas.
In both projects the consortia agreed to distribute the LR through ELRA (with special price for Italian users).
Now, after the conference TIPI in Roma, under the sponsorship of the Ministry of Communications, the topic of HLT has been inserted in the Framework Programme for the financing of R&D in Italy.
It was also decided to constitute a Forum for HLT, of which Zampolli is president. The Forum will start working soon, also to prepare new national initiatives, to maintain LR, to write a white book on HLT in Italy, to coordinate with national activities in other EU countries, etc.
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/29
Example of FranceExample of France
NNational Projects/programsational Projects/programs
France
Technolangue Action
With Contribution from J. MarianiWith Contribution from J. Mariani
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/30
Ministère de la Culture et de la CommunicationMinistère de la Culture et de la Communication
Ministère de la Jeunesse, de l’Education Nationale et de la RechercheMinistère de la Jeunesse, de l’Education Nationale et de la Recherche
Ministère de l’Economie, des Finances et de l’IndustrieMinistère de l’Economie, des Finances et de l’Industrie
Language TechnologiesLanguage Technologies
« TechnoLangue » Action« TechnoLangue » Action
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/31
•Report to Prime Minister (November 2000)Report to Prime Minister (November 2000)•Meeting Min. Industry, Research, Culture: June 2001Meeting Min. Industry, Research, Culture: June 2001•Action : Technology survey and evaluationAction : Technology survey and evaluation•Basic Technological ResearchBasic Technological Research•Articulate with present actionsArticulate with present actions
–Research & Innovation Technological Networks–4 ICT RRIT: Telecommunications, Software, Micronanotechnologies, Audiovisual & multimedia
–Ministry of Research action on Technological Survey (VSE)
« TechnoLangue » action« TechnoLangue » action
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/32
Infrastructure program to support technological innovation, while existing R&D projects stay with RRIT & VSE (120 M€ / year)
TECHNOLANGUE
RNRT RNTL RIAM VSE
« TechnoLangue » structure« TechnoLangue » structure
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/33
BasicResearch
TechnologyDevelopment
ApplicationDevelopment
BottleneckIdentification
Research resultsin quantitative
evaluation
Technologiesnecessitated
for applications
Technologieswhich have been
validatedfor applications.Long term / high risk
Large return of investment EvolutionaryUsability
Acceptability
Meeting points with technology development
QuantitativeEvaluation
UsageEvaluation
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/34
« TechnoLangue » action« TechnoLangue » action
• OrganizationOrganization– Executive Committee (EC) chaired by C. Fluhr (CEA)– Comprising 15 members:
• 3 RRIT representatives: B. Bachimont (INA - RIAM), C. Sedogbo (Thalès - RNTL), C. Waast (IBM - RNRT)
• 3 Public research: C. Fluhr (CEA), E. Geoffrois (DGA) P. Paroubek (Limsi-CNRS)
• 5 Industrials: K. Choukri (ELDA), B. Normier (Lingway), J.-J. Rigoni (Elan Informatique ), F. Segond (Xerox) + C. Sorin (FT R&D)
• 4 Administrations: S. Chaudiron (MR), J. Mariani (MR), D. Malbert (MCC), J. Mathieu (MinEFI)
– Good balance between research & industry - written/spoken
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/35
« TechnoLangue » action« TechnoLangue » action
• Install a User CommitteeInstall a User Committee– Ministry of Foreign Affairs
• Automatic translation, multilingualism…
– Ministry of public administration• Simplification of the administrative language...
– Ministry of National Education• Training technologies, language traning...
– …
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/36
« TechnoLangue » Call« TechnoLangue » Call
• International cooperationInternational cooperation– Cooperation mechanisms within TechnoLangue
• foreign entities may participate in the projects
• financing from their own funds
– Future cooperation among similar national programs• EU Countries (Italy, Germany, Norway, Spain, Greece, The
Netherlands, Switzerland…)
• Prepare the construction of the European Research Area (ERA)
– The EC supports the coordination and generic technologies cost
– Each country supports the cost for covering its language(s): specific technology development/adaptation: (annnotated) corpus (spoken/written), lexicon (incl. pronun.), dictionaries...
• USA, Japan, South Africa…
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/37
« TechnoLangue » Call« TechnoLangue » Call
• 4 meetings of the Executive Committee4 meetings of the Executive Committee• A Call for Proposals with 4 partsA Call for Proposals with 4 parts
– Part 1: Language resources
– Part 2: Evaluation
– Part 3: Norms & standards
– Part 4: Technological survey
• Calendar:Calendar:– Launched April 15, 2002
– Deadline : May 31 / June 10 (Electronic) - June 17 (Paper)
– Results : July 19, 2002
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/38
« TechnoLangue » Call« TechnoLangue » Call– Language resources
• Spoken/written data (corpus, dictionaries, terminological data…)
• Basic Language Processing Tools (Open Source)
• Production, validation, distribution (incl. legal, economical aspects)
• For a large use by a large community (education, training…)
– Evaluation• Technology (evaluation campaign)
• Applications (evaluation toolkits)
• Methodology (metrics / protocols)
– Norms & standards• Shared effort to improve French participation
– Technological survey• In relationship with on-going actions (Euromap...)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/39
Part 1: Language ResourcesPart 1: Language Resources
• Stimulate the production and the distribution of language Stimulate the production and the distribution of language resources for :resources for :– answering minimal needs (Basic LAnguage Resource Kit) for the
french language ;– promoting resources reusabilty ;– supporting research ;– helping industrial applications development ;– decreasing the cost of entering the sector for new comers
• Should include the French language, eventually in Should include the French language, eventually in connection with other languagesconnection with other languages
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/40
Part 1: Language ResourcesPart 1: Language Resources
• Spoken and written data :Spoken and written data :• oral corpus, pronunciation lexicons, etc.
• databases for speech synthesis ;
• monolingual and multilingual text corpus (parallel, comparable...) ;
• lexicons, terminology, grammars,...
• Lexical semantic resources : ontologies, thesauri,...
• Multimodal corpus,...etc
• Basic sofware tools :Basic sofware tools :• morphosyntactic taggers, syntactic parsers, semantic tools,
• teminology extractors,
• language identifiers,
• corpus annotations tools,
• lemmatizers,… etc.
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/41
Part 1: Language ResourcesPart 1: Language Resources
• Encourage and facilitate the use of those resourcesEncourage and facilitate the use of those resources– Putting them in new (young) user hands
– Same approach as for GUIs : “VUIs”
– Language Technology Kits with “User’s guide”• Distribution towards specialized education entities (NLP, Document
Engineering…) and more largely towards training centers (Universities, Technical Universities, Engineering schools...)
• While insuring a feedback from experience
– Open Source software economical model
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/42
Part 2: EvaluationPart 2: Evaluation
• 3 areas :3 areas :– Technology evaluation
– Application evaluation
– Evaluation methodologies
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/43
Part 2: EvaluationPart 2: Evaluation
• Technology evaluationTechnology evaluation– Organization of comparative evaluation campaigns for
technologies presently not covered by european or international programs, or with a complementary approach
– Includes the production of the data necessary for the evaluation, in a monolingual, multilingual or crosslingual context
– Scientific and industrial interest of the evaluation should appear (large enough number of participants)
– The projects must define the evaluation methodology and justify the practical organization aspects
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/44
Part 2: EvaluationPart 2: Evaluation
• Application evaluationApplication evaluation– The objective is to develop evaluation mehodologies for
industrial or pre-industrial products
– The methodologies may result in “toolboxes”, also regrouping user-oriented methodologies and protocols, or in test software packages
– The methodologies should be generic (class of applications)
– The proposals should demonstrate the project economical and industrial interest, and the modalities of the distribution of the “toolboxes”
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/45
Part 2: EvaluationPart 2: Evaluation
• Evaluation methodologiesEvaluation methodologies– Improve the present evaluation methodologies
– Identify new (quantitative and qualitative) approaches for already evaluated technologies :
• socio-technical and psycho-cognitive aspects
• cognitive modeling of evaluation
– Identify protocols for new technologies and applications• Virtual Reality, Multimodal interaction, Language on the Internet...
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/46
Part 3: StandardsPart 3: Standards
• Support the participation of French actors in Support the participation of French actors in normalization and standardization bodiesnormalization and standardization bodies– Presently weak participation of French actors in
normalization and standardization bodies
– Of strategic importance
– Variety of places where the normalization activities are taking place : official or non-official committees, forums, projects,...
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/47
Part 3: StandardsPart 3: Standards
• Actions:Actions:– Support the creation of consortia to reinforce the french
presence in various bodies (ISO, CEN, W3C,...)
– Help the share of efforts among French participants
– Identify a topic and ensure a permanent participation in all related bodies : character sets, exchange format, phonetic alphabet transcription, etc.
– Necessity of articulating the project with French bodies already implied : AFNOR, W3C French Chapter,...
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/48
Part 4: SurveyPart 4: Survey
• Part 4 - Install an information surveyPart 4 - Install an information survey– Create a portal on Language Engineering in order to give access
to :• panorama of the industrial and technological offer• state-of-the-art in science and technology• identification of language resources• identification of technological bottlenecks• a list of Call for Proposals• a presentation of the market key numbers• an information on norms and standards (with Internet links)
– Should be linked with existing sites (Euromap,...)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/49
ResultsResults
• 52 proposals submitted52 proposals submitted– Total proposal costs : 35,9 M€
– Total requested support : 21,7 M€
– Clustering within each of the 4 topics
• 26 projects selected26 projects selected• 173 participations, 94 participants :173 participations, 94 participants :
– 33 industry
– 39 public research
– 11 other (Associations, CEA, DGA…)
– 11 foreign (Bell Labs, NII, EPFL, LATL…)
• Budget : 6,2 M€Budget : 6,2 M€
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/50
ResultsResults
• 26 selected projects:26 selected projects:– 8 on Language resources
• BLARK (Cf BNC), Fr-En, G, Sp, It, Arabic dictionaries• Specialized (aerospace, automotive…), proper names dictionaries• Aligned corpus (7 novels 19th century litterature in 4 languages)
– 6 on Tools (Open source)• Lemmatizer, Chunker, Guesser, Tagger, Parser, Speaker recogn., Topic
& NE detector, summarizer, term. extractor, Search engine...
– 3 on Standards (Spoken / Written)
– 1 on Technological survey (Portal)
– 8 on Evaluation : 7 on technology, 1 on usage evaluation
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/51
Technology EvaluationTechnology Evaluation
• Written languageWritten language– Machine translation
– Text alignment
– Syntactic parsing
– Information query
• Spoken LanguageSpoken Language– Speech transcription / indexing (incl. Named Entity)
– Speech synthesis
– Spoken dialog
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/52
French Techno-Langue ConclusionsFrench Techno-Langue Conclusions
• Launch a large national program on Language Launch a large national program on Language Technology (TechnoLangue)Technology (TechnoLangue)
• In the perspective of installing a permanent In the perspective of installing a permanent infrastructure for Language Resources, Evaluation, infrastructure for Language Resources, Evaluation, Standards and SurveyStandards and Survey
• Hope that it can participate in the construction of the Hope that it can participate in the construction of the European Research AreaEuropean Research Area
• And articulates well with international activitiesAnd articulates well with international activities
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/53
Example of NORWAY Example of NORWAY National Projects/programsNational Projects/programs
Norway : Norwegian Language Bank
language technology resources in Norway
Launch conference 24-25 October 2002 (Bergen, Norway):
The language bank will contain three types of data spoken data, text and lexical resources.
It will be organized as a foundation with state ownership,
The estimated budget is about NOK 100 million, (12 M€)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/54
ENABLER European National Activities for Basic Language Engineering & Resources
Survey of existing national activities
Fostering common research and compatibility of LR
Suggestion for and contribution to international
cooperation
-- A new InitiativeIdentification of existing resources (Universal Catalogue)The Basics (e.g. Standards, tools, evaluation procedures, …)
Extension foreseen/ Planned
Next meeting Pisa 1st December 2002
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/55
Information Dissemination
(Bilingual English/French; issued each quarter)
Catalogue
Web Site (Bilingual: English/French)
Web: http://www.elda.fr/
Newsletter
ELRA Conference (LREC)International Language Resources & Evaluation
Conference
(Every two years -- Next issue- LREC’2002)