analysis and evaluation of comparable corpora for under resourced areas of machine translation

Analysis and Evaluation of Comparable Corpora for Under

Resourced Areas of Machine Translation

Inguna Skadiņa, Andrejs Vasiļjevs, Raivis Skadiņš, Robert Gaizauskas,

Dan Tufiş and Tatiana Gornostay

Challenge of Data Driven MT- Rapid development of data driven methods for MT- Automated acquisition of linguistic knowledge extracted from huge parallel corpora provide an effective solution that minimizes time- and resource-consuming manual work

- Applicability of current data-driven methods directly depends on the availability of very large quantities of parallel corpus data - Translation quality of current data-driven MT systems is very low for under-resourced languages and domains

2

3rd BUCCMalta22-05-10

Problem of availability of linguistic resources

Relevant for “smaller” or under-resourced languages

Example of Latvian: few parallel corpora of reasonable size (e.g., JRC

Acquis, EMEA) SMT trained on this corpora performs well on

domain documents, but it has unacceptable results for other domains (en-lv 43.4 BLEU in domain, 10.2 BLEU out of domain)

Solution: comparable corpora are much more widely available than parallel translation data

3


Accurat project The Accurat mission is to significantly

improve MT quality for under-resourced languages and narrow

domains by researching novel approaches how

comparable corpora can compensate for a shortage of linguistic resources

ACCURAT methods will be: Adjustable to new languages and domains Language independent where possible

2.5 year project, started on January 1, 20103rd BUCC

Malta22-05-10

4

Key objectives To create comparability metrics - to develop the

methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora

To develop, analyze and evaluate methods for automatic acquisition of comparable corpora from the Web

To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT

To measure improvements from applying acquired data against baseline results from SMT and RBMT systems

To evaluate and validate the ACCURAT project results in practical applications 5


Use Cases

Adjusting MT to narrow domainAutomotive engineering, assistive technology and data processing domains

Application for Web authoring Blog and social networking (Zemanta application)

Using SMT in software localizationIncreasing efficiency in localization, integration with CAT tools


6

Language Coverage

Focus on under-resourced languages: Latvian, Lithuanian, Estonian, Greek, Croatian, Romanian and Slovenian

Major translation directions like English-Lithuanian, English-Croatian, German Romanian

Minor translation directions like Lithuanian Romanian, Romanian-Greek and Latvian-Lithuanian

7


Work Plan WP1: To create comparability metrics – to

develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora (M3-M24)

WP2: To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT (M3-M23)

WP3: To develop, analyze and evaluate methods for automatic acquisition of a comparable corpus from the Web (M1-M22)

WP4: To measure improvements from applying acquired data against results from baseline SMT and RBMT systems (M7-M26)


8

Work Plan

WP5: To evaluate and validate the ACCURAT project results in three practical applications (M7-M30)

WP6: To disseminate project results and to transfer the project knowledge, technologies, lessons learned and best practices to interested communities and thus to ensure their worldwide impact and long-term sustainability (M1-M30)

WP7: To coordinate the project and provide administrative and financial management (M1-M30)


9

Milestones• Tools for collecting comparable

corpora from the Web (M22)• Multilingual comparable corpora

(M22)Initial comparable

corpora (M3)

• Criteria and metrics of comparability and parallelism (M24)Initial comparability

metrics (M6)

• Alignment and extraction methods for comparable corpora (M20)

Application of existing alignment

methods (M6)

• Improved MT systems (M26)• Adjusted MT systems in applications

(M30)Baseline SMT systems (M9)

10


Key Results Comparability metrics

developed and tools provided

Comparable corpora for under-resourced languages collected and tools provided

Methods and tools for multi-level alignment from comparable corpora developed

Methods for using comparable corpora in both SMT and RBMT developed

Proven application scenarios prepared

Strong increase in MT quality for under-resourced languages and narrow domains

11


Initial comparable corpora (ICC)

1 million tokens for each under-resourced language

domain corpus for en-de3rd BUCCMalta22-05-10

Domain Genre PercentInternational news

Newswires 20%

Sports Newswires 10%Admin Legal 10%Travel Advice 10%Software Wikipedia 15%Software User

manuals15%

Medicine For doctors 10%

Medicine For patients 10%

12

Recommended proportions

parallel – 10% strongly comparable (heavily edited

translations or independent, but closely related texts reporting the same event or describing the same subject) – 40%

weakly comparable (e.g.,texts within the same broader domain and genre, but varying in subdomains and specific genres, texts in the same narrow subject domain and genre, but describing different events) – 50%

length of each document should be between 500 and 3000 words

3rd BUCCMalta22-05-10 13

Initial comparable corpora: results


Domain Genre Planned Collected

International news

Newswires

20% 14,73%

Sports Newswires

10% 8,23%

Admin Legal 10% 11%

Travel Advice 10% 14,46%

Software Wikipedia 15% 5,83%

Software User manuals

15% 22,11%

Medicine For doctors

10% 12,35%

Medicine For patients

10% 11,30%

14

Initial comparable corpora: results

ET-EN LV-EN LT-EN EL-EN RO-EL HR-EN RO-EN RO-DE SL-EN

parallel 9,48 11,82 46,17 13,33 32,62 39,51 6,94 8,52 40,17strongly comparable 51,06 37,51 21,83 20,47 30,96 9,44 17,07 32,67 27,98weakly comparable 39,46 50,67 32,00 66,20 36,42 51,05 76,00 58,81 31,85


Metadata Language Domain Genre Source Number of words IPR status Comparability level

parallel and strongly comparable texts are also aligned at the document level

16


CES (Corpus Encoding Standards)


Extension to CES-CCES


CES Alignment–Extension to CCES Alignment


Criteria of Comparability and Parallelism

Lack of definite methods to determine the criteria of comparability

Some attempts to measure the degree of comparability according to distribution of topics and publication dates of documents in comparable corpora to estimate the global comparability of the corpora (Saralegi et al., 2008)

Some attempts to determine different kinds of document parallelism in comparable corpora, such as complete parallelism, noisy parallelism and complete non-parallelism

Some attempts to define criteria of parallelism of similar documents in comparable corpora, such as similar number of sentences, sharing sufficiently many links (up to 30%), and monotony of links (up to 90% of links do not cross each other) (Munteanu, 2006)


20

Criteria of Comparability and Parallelism

To investigate criteria for comparability between corpora concentrating on different sets of features: Lexical features: measuring the degree of

'lexical overlap' between frequency lists derived from corpora

Lexical sequence features: computing N-gram distances in terms of tokens

Morpho-syntactic features: computing N-gram distances in terms of Part-of-Speech codes


21

First experiment Comparability of corpora is measured in terms of

lexical features (Greek—English and German—English language pairs)

The set-up is similar to (Kilgarriff, 2001): For each corpus take the top 500 most frequent

words relative frequency is used (the absolute frequency,

or the word count, divided by the length of the corpus)

Automatically generated dictionaries by Giza++ from the parallel Europarl corpus

We compare corpora pairwise using a standard Chi-Square distance measure:

ChiSquare = ∑ {w1... w500}((FrqObserved - FrqExpected) ^ 2) / FrqObserved


22

First experiment Asymmetric method: relative frequencies in

Corpus in language A are treated as “expected” values, and those mapped from the Corpus in language B – as “observed”. Then we swap Corpora A and B and repeat the calculation. Asymmetry comes from words which are missing in one of the lists as compared to the other. Missing words have different relative frequencies that are added to the score, so distance from A to B can be different than from B to A. We use the minimum of these distances as the final score for the pair of corpora.


23

Features To extract the features which may be used to

identify the comparability between documents

Language Independent Language Dependent(requires translation)

• Document length• Date• Character overlap• Web features - URL of doc source - Common links - Links referring to each other - Image links• Other features …

• Lexical overlap• Web features - Anchor text - Image alt tag• Genre (?)• Domain (?)• Other features …

24

General Idea

parallel

weakly comparabl

e

strongly comparabl

e

ENEN

not comparabl

e

Initial Comparable Corpora

Comparability Level f1 f2 f3 … fn

parallel

strongly comparable

weakly comparable

not comparable

...

Features extraction

Classifier

ELEL

ENEN

strongly comparab

le

ENEN

ENEN

ENEN

ELEL

ELEL

ELEL

ELEL

New Documents

Predicted Comparability Level

25

Metrics of Comparability and Parallelism

Using defined criteria for parallelism, we would like to develop formal automated metrics for determining the degree of comparability

Lack of comparability metrics to evaluate corpus usability for different tasks, such as machine translation, information extraction, cross-language information retrieval

Recent studies (Kilgarriff, 2001; Rayson and Garside, 2000) have added a quantitative dimension to the issue of comparability by studying objective measures for detecting how similar (or different) two corpora are in terms of their lexical content

Further studies (Sharoff, 2007) investigated automatic ways for assessing the composition of web corpora in terms of domains and genres


26

Danielsson, Pernilla and Ridings, Daniel. Practical presentation of a vanilla aligner. Goteborgs universitet, 1997

Melamed, Dan. A Geometric Approach to Mapping Bitext Correspondence. University of Pennsylvania, 1996

Chen, Stanley F. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st annual meeting on Association for Computational Linguistics (Columbus, Ohio 1993), Association for Computational Linguistics Morristown, NJ, USA, 9-16

State of the Art: Moore, Robert C. Fast and Accurate Sentence Alignment of Bilingual Corpora. In Proceedings of the 5th Conference of the Association for Machine Translation in the Americas (Tiburon, California 2002), Springer-Verlag, Heidelberg, 135-244: provisionary alignment based on sentence lengths IBM Model 1 – estimate Translation Equivalents (TE) table generate one to one links based on sentence lengths and

TE table

Sentence Alignment on Parallel Texts


27

Reification – a link in the alignment is treated as a context independent structured object.

Using SVM (libsvm solution). Features:

translation equivalence word length correlation (Pearson) special characters occurrence similarity word frequency ranks correlation

Crossed links are allowed

Our Sentence Alignment on Parallel Texts


28

Based on previous experience, literature and current constraints (time, man-power, computational resources) we envisaged 3 possible ways of tackling with the alignment of comparable corpora in order to get useful results: QA techniques Clustering Windowing

Scenarios for Aligning Comparable Corpora


29

Accurat partners

Tilde (Coordinator) Latvia

University of Sheffield UK

University of Leeds UK

Athena Research and Innovation Center in Information Communication and Knowledge Technologies (ILSP)

Greece

University of Zagreb, Faculty of Humanities and Social Sciences

Croatia

DFKI Germany

Institute of Artificial Intelligence Romania

Linguatec Germany

Zemanta Slovenia


30

ACCURAT project has received funding from the EU 7th Framework Programme for Research and

Technological Development under Grant Agreement N° 248347

Project duration: January 2010 – June 2012

Contact information:Andrejs Vasiljevsandrejs tilde.lv

Tilde, Vienibas gatve 75a, RigaLV1004, Latvia

www.accurat-project.eu

http://www.accurat-project.eu/