analysis and evaluation of comparable corpora for under resourced areas of machine translation
DESCRIPTION
Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation. Inguna Skadiņa, Andrejs Vasiļjevs, Raivis Skadiņš, Robert Gaizauskas, Dan Tufi ş and Tatiana Gornostay. Challenge of Data Driven MT. 3rd BUCC Malta 22-05-10. - PowerPoint PPT PresentationTRANSCRIPT
Analysis and Evaluation of Comparable Corpora for Under
Resourced Areas of Machine Translation
Inguna Skadiņa, Andrejs Vasiļjevs, Raivis Skadiņš, Robert Gaizauskas,
Dan Tufiş and Tatiana Gornostay
Challenge of Data Driven MT- Rapid development of data driven methods for MT- Automated acquisition of linguistic knowledge extracted from huge parallel corpora provide an effective solution that minimizes time- and resource-consuming manual work
- Applicability of current data-driven methods directly depends on the availability of very large quantities of parallel corpus data - Translation quality of current data-driven MT systems is very low for under-resourced languages and domains
2
3rd BUCCMalta22-05-10
Problem of availability of linguistic resources
Relevant for “smaller” or under-resourced languages
Example of Latvian: few parallel corpora of reasonable size (e.g., JRC
Acquis, EMEA) SMT trained on this corpora performs well on
domain documents, but it has unacceptable results for other domains (en-lv 43.4 BLEU in domain, 10.2 BLEU out of domain)
Solution: comparable corpora are much more widely available than parallel translation data
3
3rd BUCCMalta22-05-10
Accurat project The Accurat mission is to significantly
improve MT quality for under-resourced languages and narrow
domains by researching novel approaches how
comparable corpora can compensate for a shortage of linguistic resources
ACCURAT methods will be: Adjustable to new languages and domains Language independent where possible
2.5 year project, started on January 1, 20103rd BUCC
Malta22-05-10
4
Key objectives To create comparability metrics - to develop the
methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora
To develop, analyze and evaluate methods for automatic acquisition of comparable corpora from the Web
To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT
To measure improvements from applying acquired data against baseline results from SMT and RBMT systems
To evaluate and validate the ACCURAT project results in practical applications 5
3rd BUCCMalta22-05-10
Use Cases
Adjusting MT to narrow domainAutomotive engineering, assistive technology and data processing domains
Application for Web authoring Blog and social networking (Zemanta application)
Using SMT in software localizationIncreasing efficiency in localization, integration with CAT tools
3rd BUCCMalta22-05-10
6
Language Coverage
Focus on under-resourced languages: Latvian, Lithuanian, Estonian, Greek, Croatian, Romanian and Slovenian
Major translation directions like English-Lithuanian, English-Croatian, German Romanian
Minor translation directions like Lithuanian Romanian, Romanian-Greek and Latvian-Lithuanian
7
3rd BUCCMalta22-05-10
Work Plan WP1: To create comparability metrics – to
develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora (M3-M24)
WP2: To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT (M3-M23)
WP3: To develop, analyze and evaluate methods for automatic acquisition of a comparable corpus from the Web (M1-M22)
WP4: To measure improvements from applying acquired data against results from baseline SMT and RBMT systems (M7-M26)
3rd BUCCMalta22-05-10
8
Work Plan
WP5: To evaluate and validate the ACCURAT project results in three practical applications (M7-M30)
WP6: To disseminate project results and to transfer the project knowledge, technologies, lessons learned and best practices to interested communities and thus to ensure their worldwide impact and long-term sustainability (M1-M30)
WP7: To coordinate the project and provide administrative and financial management (M1-M30)
3rd BUCCMalta22-05-10
9
Milestones• Tools for collecting comparable
corpora from the Web (M22)• Multilingual comparable corpora
(M22)Initial comparable
corpora (M3)
• Criteria and metrics of comparability and parallelism (M24)Initial comparability
metrics (M6)
• Alignment and extraction methods for comparable corpora (M20)
Application of existing alignment
methods (M6)
• Improved MT systems (M26)• Adjusted MT systems in applications
(M30)Baseline SMT systems (M9)
10
3rd BUCCMalta22-05-10
Key Results Comparability metrics
developed and tools provided
Comparable corpora for under-resourced languages collected and tools provided
Methods and tools for multi-level alignment from comparable corpora developed
Methods for using comparable corpora in both SMT and RBMT developed
Proven application scenarios prepared
Strong increase in MT quality for under-resourced languages and narrow domains
11
3rd BUCCMalta22-05-10
Initial comparable corpora (ICC)
1 million tokens for each under-resourced language
domain corpus for en-de3rd BUCCMalta22-05-10
Domain Genre PercentInternational news
Newswires 20%
Sports Newswires 10%Admin Legal 10%Travel Advice 10%Software Wikipedia 15%Software User
manuals15%
Medicine For doctors 10%
Medicine For patients 10%
12
Recommended proportions
parallel – 10% strongly comparable (heavily edited
translations or independent, but closely related texts reporting the same event or describing the same subject) – 40%
weakly comparable (e.g.,texts within the same broader domain and genre, but varying in subdomains and specific genres, texts in the same narrow subject domain and genre, but describing different events) – 50%
length of each document should be between 500 and 3000 words
3rd BUCCMalta22-05-10 13
Initial comparable corpora: results
3rd BUCCMalta22-05-10
Domain Genre Planned Collected
International news
Newswires
20% 14,73%
Sports Newswires
10% 8,23%
Admin Legal 10% 11%
Travel Advice 10% 14,46%
Software Wikipedia 15% 5,83%
Software User manuals
15% 22,11%
Medicine For doctors
10% 12,35%
Medicine For patients
10% 11,30%
14
Initial comparable corpora: results
ET-EN LV-EN LT-EN EL-EN RO-EL HR-EN RO-EN RO-DE SL-EN
parallel 9,48 11,82 46,17 13,33 32,62 39,51 6,94 8,52 40,17strongly comparable 51,06 37,51 21,83 20,47 30,96 9,44 17,07 32,67 27,98weakly comparable 39,46 50,67 32,00 66,20 36,42 51,05 76,00 58,81 31,85
3rd BUCCMalta22-05-10 15
Metadata Language Domain Genre Source Number of words IPR status Comparability level
parallel and strongly comparable texts are also aligned at the document level
16
3rd BUCCMalta22-05-10
CES (Corpus Encoding Standards)
3rd BUCCMalta22-05-10 17
Extension to CES-CCES
3rd BUCCMalta22-05-10 18
CES Alignment–Extension to CCES Alignment
3rd BUCCMalta22-05-10 19
Criteria of Comparability and Parallelism
Lack of definite methods to determine the criteria of comparability
Some attempts to measure the degree of comparability according to distribution of topics and publication dates of documents in comparable corpora to estimate the global comparability of the corpora (Saralegi et al., 2008)
Some attempts to determine different kinds of document parallelism in comparable corpora, such as complete parallelism, noisy parallelism and complete non-parallelism
Some attempts to define criteria of parallelism of similar documents in comparable corpora, such as similar number of sentences, sharing sufficiently many links (up to 30%), and monotony of links (up to 90% of links do not cross each other) (Munteanu, 2006)
3rd BUCCMalta22-05-10
20
Criteria of Comparability and Parallelism
To investigate criteria for comparability between corpora concentrating on different sets of features: Lexical features: measuring the degree of
'lexical overlap' between frequency lists derived from corpora
Lexical sequence features: computing N-gram distances in terms of tokens
Morpho-syntactic features: computing N-gram distances in terms of Part-of-Speech codes
3rd BUCCMalta22-05-10
21
First experiment Comparability of corpora is measured in terms of
lexical features (Greek—English and German—English language pairs)
The set-up is similar to (Kilgarriff, 2001): For each corpus take the top 500 most frequent
words relative frequency is used (the absolute frequency,
or the word count, divided by the length of the corpus)
Automatically generated dictionaries by Giza++ from the parallel Europarl corpus
We compare corpora pairwise using a standard Chi-Square distance measure:
ChiSquare = ∑ {w1... w500}((FrqObserved - FrqExpected) ^ 2) / FrqObserved
3rd BUCCMalta22-05-10
22
First experiment Asymmetric method: relative frequencies in
Corpus in language A are treated as “expected” values, and those mapped from the Corpus in language B – as “observed”. Then we swap Corpora A and B and repeat the calculation. Asymmetry comes from words which are missing in one of the lists as compared to the other. Missing words have different relative frequencies that are added to the score, so distance from A to B can be different than from B to A. We use the minimum of these distances as the final score for the pair of corpora.
3rd BUCCMalta22-05-10
23
Features To extract the features which may be used to
identify the comparability between documents
Language Independent Language Dependent(requires translation)
• Document length• Date• Character overlap• Web features - URL of doc source - Common links - Links referring to each other - Image links• Other features …
• Lexical overlap• Web features - Anchor text - Image alt tag• Genre (?)• Domain (?)• Other features …
24
General Idea
parallel
weakly comparabl
e
strongly comparabl
e
ENEN
not comparabl
e
Initial Comparable Corpora
Comparability Level f1 f2 f3 … fn
parallel
strongly comparable
weakly comparable
not comparable
...
Features extraction
Classifier
ELEL
ENEN
strongly comparab
le
ENEN
ENEN
ENEN
ELEL
ELEL
ELEL
ELEL
New Documents
Predicted Comparability Level
25
Metrics of Comparability and Parallelism
Using defined criteria for parallelism, we would like to develop formal automated metrics for determining the degree of comparability
Lack of comparability metrics to evaluate corpus usability for different tasks, such as machine translation, information extraction, cross-language information retrieval
Recent studies (Kilgarriff, 2001; Rayson and Garside, 2000) have added a quantitative dimension to the issue of comparability by studying objective measures for detecting how similar (or different) two corpora are in terms of their lexical content
Further studies (Sharoff, 2007) investigated automatic ways for assessing the composition of web corpora in terms of domains and genres
3rd BUCCMalta22-05-10
26
Danielsson, Pernilla and Ridings, Daniel. Practical presentation of a vanilla aligner. Goteborgs universitet, 1997
Melamed, Dan. A Geometric Approach to Mapping Bitext Correspondence. University of Pennsylvania, 1996
Chen, Stanley F. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st annual meeting on Association for Computational Linguistics (Columbus, Ohio 1993), Association for Computational Linguistics Morristown, NJ, USA, 9-16
State of the Art: Moore, Robert C. Fast and Accurate Sentence Alignment of Bilingual Corpora. In Proceedings of the 5th Conference of the Association for Machine Translation in the Americas (Tiburon, California 2002), Springer-Verlag, Heidelberg, 135-244: provisionary alignment based on sentence lengths IBM Model 1 – estimate Translation Equivalents (TE) table generate one to one links based on sentence lengths and
TE table
Sentence Alignment on Parallel Texts
3rd BUCCMalta22-05-10
27
Reification – a link in the alignment is treated as a context independent structured object.
Using SVM (libsvm solution). Features:
translation equivalence word length correlation (Pearson) special characters occurrence similarity word frequency ranks correlation
Crossed links are allowed
Our Sentence Alignment on Parallel Texts
3rd BUCCMalta22-05-10
28
Based on previous experience, literature and current constraints (time, man-power, computational resources) we envisaged 3 possible ways of tackling with the alignment of comparable corpora in order to get useful results: QA techniques Clustering Windowing
Scenarios for Aligning Comparable Corpora
3rd BUCCMalta22-05-10
29
Accurat partners
Tilde (Coordinator) Latvia
University of Sheffield UK
University of Leeds UK
Athena Research and Innovation Center in Information Communication and Knowledge Technologies (ILSP)
Greece
University of Zagreb, Faculty of Humanities and Social Sciences
Croatia
DFKI Germany
Institute of Artificial Intelligence Romania
Linguatec Germany
Zemanta Slovenia
3rd BUCCMalta22-05-10
30
ACCURAT project has received funding from the EU 7th Framework Programme for Research and
Technological Development under Grant Agreement N° 248347
Project duration: January 2010 – June 2012
Contact information:Andrejs Vasiljevsandrejs tilde.lv
Tilde, Vienibas gatve 75a, RigaLV1004, Latvia
www.accurat-project.eu