introduction a field survey of dutch language resources has been carried out within the framework of...

1
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse Taalunie) with the aim of strengthening the position of Dutch in Human Language Technologies (HLT). This field survey was done in three stages. 1: BLARK The Basic Language Resource Kit (BLARK) is a wish list for an ideal HLT field. 2: Inventory of available HLT resources 3: Priority list The priority list indicates which materials need to be developed to complete the BLARK. It was drawn up by comparing the inventory with the definition of the BLARK. The method described can be adopted for languages other than Dutch. Conclusions • The current HLT infrastructure is scattered, incomplete, and not sufficiently accessible. • The available modules and applications are often poorly documented. • There is a great need for objective and methodologically sound comparisons and benchmarking of the materials. • The components that constitute the BLARK should be available at low cost or free of charge. Recommendations • Establish an HLT agency to collect, document and maintain existing parts of the BLARK. • Complete the BLARK by encouraging funding bodies to finance the development of the prioritized resources. • Make the BLARK available to academia and HLT industry under the conditions of open source development. • Develop benchmarks, test corpora, and a methodology for objective comparison, evaluation and validation of parts of A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch D. Binnenpoorte, F. de Vriend, J. Sturm, W. Daelemans, H. Strik, C. Cucchiarini Figure 1 (“+” = important, “++” = very important) Based on the full matrix a BLARK was defined. BLARK for speech technology: Modules: • Automatic speech recognition • Speech synthesis • Tools for calculating confidence measures • Tools for identification • Tools for (semi-) automatic annotation of speech corpora Data: • Speech corpora for specific applications • Multi-modal speech corpora • Multi-media speech corpora • Multi-lingual speech corpora • Benchmarks for evaluation BLARK for language technology: Modules: • Robust modular text pre-processing • Morphological analysis and morpho- syntactic disambiguation • Syntactic analysis • Semantic analysis Data: • Monolingual lexicon • Annotated corpus of text (a treebank • Benchmarks for evaluation In defining the BLARK a distinction was made between: • Applications • Modules • Data A matrix (fragment in Figure 1) was drawn up describing • which modules are required for which applications; • which data are required for which modules; • what the relative importance is of the modules and data. 1: BLARK A second matrix (fragment in figure 2) describes the availability of the components in the BLARK. Figure 2 (1 = ‘module or data set is unavailable’ to 10 = ‘module or data set is easily obtainable’). An inventory was made to establish which of the components - modules and data - that make up the BLARK are: available; i.e. can be bought or are freely obtainable e.g. by open source; (re-)usable. Inventory based on: expert knowledge; information found on the internet and in the literature; personal communication with actors in the field. Components can only be considered usable if they are of sufficient quality quality evaluation. Limited to a descriptive level: modules and data were checked against a list of evaluation criteria. 2: Inventory of available resources Speech technology: 1. Automatic speech recognition (including non-native ASR, robust ASR, adaptation, and prosody recognition) 2. Speech corpora for specific applications (e.g. directory assistance, CALL) 3. Multi-media speech corpora (speech corpora that also contain information from other media such as newspapers, WWW, etc.). 4. Tools for (semi-) automatic transcription of speech data 5. Speech synthesis (including tools for unit selection) 6. Benchmarks for evaluation Language technology: 1. Annotated corpus written Dutch 2. Syntactic analysis: robust recognition of sentence structure 3. Robust text pre-processing: tokenization and named entity recognition 4. Semantic annotations for the treebank mentioned above 5. Translation equivalents 6. Benchmarks for evaluation Requirements for prioritization: • the components should be relevant for a large number of applications; • the components should currently be either unavailable, inaccessible, or have insufficient quality; • developing the components should be feasible in the short term. 3: Priority list Comparison Feedback Feedback of the HLT field (academia and industry) was collected at a workshop with about 100 participants.

Upload: clarence-fletcher

Post on 08-Jan-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse

IntroductionA field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse Taalunie) with the aim of strengthening the position of Dutch in Human Language Technologies (HLT).

This field survey was done in three stages.

1: BLARK The Basic Language Resource Kit (BLARK) is a wish list for an ideal HLT field.

2: Inventory of available HLT resources

3: Priority listThe priority list indicates which materials need to be developed to complete the BLARK. It was drawn up by comparing the inventory with the definition of the BLARK.

The method described can be adopted for languages other than Dutch.

Conclusions• The current HLT infrastructure is

scattered, incomplete, and not sufficiently accessible.

• The available modules and applications are often poorly documented.

• There is a great need for objective and methodologically sound comparisons and benchmarking of the materials.

• The components that constitute the BLARK should be available at low cost or free of charge.

Recommendations• Establish an HLT agency to

collect, document and maintain existing parts of the BLARK.

• Complete the BLARK by encouraging funding bodies to finance the development of the prioritized resources.

• Make the BLARK available to academia and HLT industry under the conditions of open source development.

• Develop benchmarks, test corpora, and a methodology for objective comparison, evaluation and validation of parts of the BLARK.

• Promote HLT education.• Ensure that enough funding is

assigned to fundamental research.

A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch

D. Binnenpoorte, F. de Vriend, J. Sturm, W. Daelemans, H. Strik, C. Cucchiarini

Figure 1 (“+” = important, “++” = very important) Based on the full matrix a BLARK was defined.

BLARK for speech technology:

Modules: • Automatic speech recognition• Speech synthesis• Tools for calculating confidence measures • Tools for identification• Tools for (semi-) automatic annotation of speech corpora

Data: • Speech corpora for specific applications• Multi-modal speech corpora • Multi-media speech corpora • Multi-lingual speech corpora • Benchmarks for evaluation

BLARK for language technology:

Modules: • Robust modular text pre-processing• Morphological analysis and morpho-syntactic

disambiguation • Syntactic analysis • Semantic analysis

Data: • Monolingual lexicon • Annotated corpus of text (a treebank• Benchmarks for evaluation

In defining the BLARK a distinction was made between:• Applications• Modules• Data

A matrix (fragment in Figure 1) was drawn up describing • which modules are required for which applications;• which data are required for which modules;• what the relative importance is of the modules and data.

1: BLARK

A second matrix (fragment in figure 2) describes the availability of the components in the BLARK.

Figure 2 (1 = ‘module or data set is unavailable’ to 10 = ‘module or data set is easily obtainable’).

An inventory was made to establish which of the components - modules and data - that make up the BLARK are:

• available; i.e. can be bought or are freely obtainable e.g. by open source;

• (re-)usable.

Inventory based on:

• expert knowledge;

• information found on the internet and in the literature;

• personal communication with actors in the field.

Components can only be considered usable if they are of sufficient quality quality evaluation.

Limited to a descriptive level: modules and data were checked against a list of evaluation criteria.

2: Inventory of available resources

Speech technology:

1. Automatic speech recognition (including non-native ASR, robust ASR, adaptation, and prosody recognition)

2. Speech corpora for specific applications (e.g. directory assistance, CALL)

3. Multi-media speech corpora (speech corpora that also contain information from other media such as newspapers, WWW, etc.).

4. Tools for (semi-) automatic transcription of speech data 5. Speech synthesis (including tools for unit selection) 6. Benchmarks for evaluation

Language technology:

1. Annotated corpus written Dutch2. Syntactic analysis: robust recognition of sentence structure 3. Robust text pre-processing: tokenization and named entity

recognition4. Semantic annotations for the treebank mentioned above5. Translation equivalents 6. Benchmarks for evaluation

Requirements for prioritization:

• the components should be relevant for a large number of applications;

• the components should currently be either unavailable, inaccessible, or have insufficient quality;

• developing the components should be feasible in the short term.

3: Priority list

Comparison

Feedback

Feedback of the HLT field (academia and industry) was collected at a workshop with about 100 participants.