understanding machine translation and the challenge of patents

25
Dr. John Tinsley Dr. John Tinsley CEO IPTranslator CEO IPTranslator PIUG Annual Conference PIUG Annual Conference 2013 2013 Alexandria, VA. April 29 Alexandria, VA. April 29 t t Understanding Machine Translation and the Challenge of Patents

Upload: iconic-translation-machines

Post on 16-Jan-2017

106 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Understanding Machine Translation and the Challenge of Patents

Dr. John TinsleyDr. John Tinsley

CEO IPTranslatorCEO IPTranslator

PIUG Annual Conference 2013PIUG Annual Conference 2013

Alexandria, VA. April 29Alexandria, VA. April 29thth

Understanding Machine Translation and the Challenge of Patents

Page 2: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

The need for translation

Accelerating Global growth in volume of patents:

- 10.7% increase in PCT applications in 2011- China +33.4%- Japan +21%

Page 3: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Why listen to me?Machine translation is what I do!

- BSc in Computational Linguistics

- PhD in Machine Translation (DCU, CMU)

- Software Engineer for MT (CNGL)

- Founder of IPTranslator

Page 4: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Machine Translation: The BasicsMachine Translation = automatic translation

The use of computers to translate from one language into another The use of computers to automate some, or all, of the translation

processStatistical Machine Translation (SMT)

An approach to Machine Translation, where translations for an input are estimated based on previous seen translation examples and associated (inferred) probabilities.

e.g. IPTranslator, Google TranslatePrevious approaches:

Rule-based (or transfer-based): based on linguistic rules e.g. Systran; Altavista’s Babelfish

Example-based: based on translation examples and inferred linguistic patterns

SMT is now by far the predominant approach

Page 5: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Bilingual CorporaA corpus (pl. corpora) is a

collection of texts, in electronic format, in a single language document(s) book(s)

A bilingual corpus is a collection of corresponding texts, in multiple languages A document & its translation A book in multiple languages The European Parliament

proceedings• Note: source language = original language or language we’re translating

from target language = language we’re translating into

a bilingual corpus

Page 6: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Aligned Bilingual CorporaA document-aligned bilingual corpus corresponds on a document level

For translation, we required sentence-aligned bilingual corpora

The sentence on line 1 in the source language text corresponds to (i.e. is a translation of) the sentence on line 1 in the target language text etc.

Often referred to as parallel aligned corpora

Sentence aligned bilingual parallel corpora are essential for statistical machine translation

Page 7: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Learning From Previous Translations

Suppose we already know (from a sentence-aligned bilingual corpus) that:- “dog” is translated as “perro”- “I have a cat” is translated as

“Tengo un gato”

We can theoretically translate:- “I have a dog” -> “Tengo un perro”- Even though we have never seen “I

have a dog” before

Statistical machine translation induces information about unseen input, based on previously known translations

- Primarily co-occurrence statistics- Takes contextual information into account

Page 8: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Statistical Machine Translation

- Example of a small sentence aligned bilingual corpus for English-French

Page 9: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Statistical Machine Translation

- We take some new input to translate

Page 10: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Statistical Machine Translation

- We take some new input to translate

- From the corpus we can infer possible target (French) translations for various source (English) words

- We can then select the most probable translations based on simple frequencies (co-occurrence statistics)

Page 11: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Statistical Machine Translation

Given a previously unseen input sentence, and our collated statistics, we can estimate translation

Page 12: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Advanced ModellingAll modern approaches are based on building translations for complete

sentences by putting together smaller pieces of translationPrevious example is very simplistic

In reality SMT systems calculate much more complex statistical models over millions of sentence pairs for a pair of languages

Upwards of 2M sentence pairs on average for large-scale systems

Statistics calculated to represent: Word-to-word translation probabilities Phrase-to-phrase translation probabilities Word order probabilities Structural information (i.e. syntactic information) Fluency of the final output

Page 13: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Data is KeyFor SMT data is key

Information (word/phrase correspondences and associated statistics) is only based on what we have seen before in the data

Important that data used to train SMT systems is: Of sufficient size

avoid sparseness/skewed statistics Representative and relevant

contains the right type of language High-quality

absence of misspellings, incorrect alignments etc. Proofed by human translators

training data

Page 14: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Why is MT Difficult?

A word or a phrase can have more than one meaning (ambiguity – lexical or structural) E.g.: “bank”, “dive” ; “I saw the man with the

telescope”

People use language creatively New words are cropping up all the time

Linguistic differences between languages E.g. structure of Irish sentences vs. structure of English

sentences: “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry”

There can be more than one way to express the same meaning. “New York”, “The Big Apple”, “NYC”

Page 15: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Why is MT Difficult?

Israeli officials are responsible for airport security.Israel is in charge of the security at this airport.The security work for this airport is the responsibility of the Israel government.Israeli side was in charge of the security of this airport.Israel is responsible for the airport’s security.Israel is responsible for safety work at this airport.Israel presides over the security of the airport.Israel took charge of the airport security.The safety of this airport is taken charge of by Israel.This airport’s security is the responsibility of the Israeli security officials.

Page 16: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Not all languages are created equalIt’s easier to translate between some language pairs than others

A group of rival companies seek sanctions against Google

Un grupo de compañías rivales pide sanciones contra Google

We believe that the delegates will make their decision after a long debate

Wir glauben dass die Delegierten ihre Entscheidung nach einer langen Debatte treffen

Thank you very much

Go raibh míle maith agat(Lit: May you have a thousand good things)

Page 17: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

The Challenge of PatentsLong Sentences

Complex constructions

L is an organic group selected from -CH2-(OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 …

maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C.

Page 18: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

• Authoring guide for “to be translated” text

• Patents break almost all of the rules!

• “Thanks, guys(!)”

The Challenge of Patents

Very long sentences as standardGrammatically incomplete using nominal and telegraphic style (!)Passive forms are frequentFrequent use of subordinate clauses, participles, implicit constructsInconsistent and incorrect spellingHigh use of neologisms Instances of synonymy and polysemy Spurious use of punctuation

Page 19: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Evaluating Machine Translation QualityAutomatic EvaluationJudge the quality of an MT system by comparing its output against a human-produced “reference” translation-Pros: Quick, cheap, consistent-Cons: Inflexible, cannot be used on ‘new’ input

Human EvaluationAssessment of output by a bilingual evaluator -Pros: Reliable, flexible, multi-faceted (fluency, error analyses, benchmarking)-Cons: Slow, expensive, subjective

Task Based EvaluationFluency vs Adequacy

Page 20: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Evaluating Machine Translation QualityTask Based Evaluation-Standalone evaluation of MT systems is necessary to get a sense of the overall quality of a system-To determine the ultimate usability of an MT system, intrinsic task-based evaluation is required-Why? Fluency vs. Adequacy

Fluency: how fluent and grammatically correct the translation output isAdequacy: how accurately the translation conveys the meaning of the source

Output 1 The big blue house Output 2 The big house redSource La gran casa roja

Page 21: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Practical uses of Machine Translation

Understand its limitations and you’ll understand it’s capabilities!

No

•Translate a patent for filing

•Translate literature for publication

•Translate marketing materials

Yes

•Productivity tool for professional translation

•Understand foreign patents

•Localisation processes and “controlled’ content

Page 22: Understanding Machine Translation and the Challenge of Patents

Thank you!Dr. John [email protected]@IPTranslator

Page 23: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

German Verb Movement

We like that Götze scored a goal in the final.

Uns gefällt, dass Götze ein Tor im Finale geschossen hat(we like that Götze a goal in the final scored has)

Page 24: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

Sentence: 这是一篇有趣的文章Words: 这是 一篇 有趣 的 文章

(zhèshì yīpiān yǒuqù de wénzhāng) (This is an interesting article)

种水果的农民The farmer who grows fruit

[Lit: “grow fruit (particle) farmer”]

Page 25: Understanding Machine Translation and the Challenge of Patents

PIUG Annual Conference, Alexandria, April 29, 2013

English: “Software”Simplified: 软件Traditional: 軟體

English: “Network”Simplified:网络 Traditional: 網路

Я пошёл в магазин

I went to the shop

В магазин пошёл я

I went to the shop

Пошёл я в магазин

I went to the shop

(A)

(B) (C)