past, present, and future: machine translation & natural language processing for patent...

‘Past, Present, and Future’Machine Translation & Natural Language

Processing for Patent InformationDr. John Tinsley

CEO, Iconic Translation Machines Ltd.

EPOPIC. Madrid. 10th November 2016

BSc in Computational LinguisticsPhD in Machine TranslationLanguage Technology consultantFounder of Iconic Translation Machines

Why listen to me?

Machine Translation is what I do!

The world’s first and only patent specific machine translation platform

The use of computers to translate from one language into

another The use of computers to automate some, or all, of the

translation process

An approach to Machine Translation, where translations for an input are estimated based on previous seen translation examples and associated (inferred) probabilities.

e.g. IPTranslator, Google Translate

Rule-based (or transfer-based): based on linguistic rules• e.g. Systran; Altavista’s Babelfish

Example-based: based on translation examples and inferred linguistic patterns

Machine Translation: The BasicsMachine Translation = automatic translation

Statistical Machine Translation (SMT)

Other approaches

SMT is now by far the predominant approach*

A corpus (pl. corpora) is a collection of texts, in electronic format, in a single language

document(s) book(s)

Bilingual Corpora

a bilingual corpus

Note source language = original language or language we’re translating fromtarget language = language we’re translating into

A bilingual corpus is a collection of corresponding texts, in multiple languages

a document & its translation a book in multiple languages European Parliament

proceedings

Aligned Bilingual CorporaA document-aligned bilingual corpus corresponds on a document level

For translation, we required sentence-aligned bilingual corpora The sentence on line 1 in the source language text

corresponds to (i.e. is a translation of) the sentence on line 1 in the target language text etc.

Often referred to as parallel aligned corpora

Sentence aligned bilingual parallel corpora are essential for statistical machine translation

Learning from Previous TranslationsSuppose we already know (from a sentence-aligned bilingual corpus) that:

“dog” is translated as “perro” “I have a cat” is translated as

“Tengo un gato”

We can theoretically translate: “I have a dog” “Tengo un

perro” Even though we have never

seen “I have a dog” before

Statistical machine translation induces information about unseen input, based on previously known translations:

Primarily co-occurrence statistics Takes contextual information into account

Statistical Machine Translation

Example of a small sentence-aligned bilingual corpus for English-French


We take some new sentence to translate


From the corpus we can infer possible target (French) translations for various source (English) words

We can then select the most probable translations based on simple frequencies (co-occurrence statistics)


Given a previously unseen input sentence, and our collated statistics, we can estimate translation

Advanced MTAll modern approaches are based on building translations for

complete sentences by putting together smaller pieces of translation

Previous example is very simplistic In reality SMT systems calculate much more complex statistical

models over millions of sentence pairs for a pair of languages Upwards of 2M sentence pairs on average for large-scale

systems

Word-to-word translation probabilities Phrase-to-phrase translation probabilities Word order probabilities Linguistic information (are the words nouns, verbs?) Fluency of the final output

Previous example is very simplistic

Other statistics calculated include

Data is KeyFor SMT data is key

Information (word/phrase correspondences and associated statistics) is only based on what we have seen before in the data

Important that data used to train SMT systems is: Of sufficient size

avoid sparseness/skewed statistics Representative and relevant

contains the right type of language High-quality

absence of misspellings, incorrect alignments etc. Proofed by human translators

training data

Why is MT Difficult?A word or a phrase can have more than one meaning (ambiguity – lexical or structural)

e.g. “bank”, “dive”, “I saw the man with the telescope”

People use language creatively New words are cropping up all the time

Linguistic differences between languages e.g. structure of Irish sentences vs. structure of English

sentences: “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry”

There can be more than one way to express the same meaning. “New York”, “The Big Apple”, “NYC”

Why is MT Difficult?

Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel

government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport’s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport’s security is the responsibility of the Israeli security

officials.

No single solution for all languages

Number agreement: the house / the houses vs. la maison / les maisons

Gender agreement: the house / the cheese vs. la maison / le frommage

English - Spanish

English - French

No single solution for all languages

English - German

English - Chinese

种水果的农民The farmer who grows fruit

[Lit: “grow fruit (particle) farmer”]

Not all languages are created equal

French German Turkish Finnish

Spanish Chinese Korean Hungarian

Portuguese Japanese Thai Basque

The Challenge of Patents

L is an organic group selected from -CH2-(OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 …maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C.

Long Sentences

Technical constructions

Largest single document: 249,322 words

Longest Sentence: 1,417 words

The Challenge of Patents

Very long sentences as standardGrammatically incomplete using nominal and telegraphic style (!)Passive forms are frequentFrequent use of subordinate clauses, participles, implicit constructsInconsistent and incorrect spellingHigh use of neologisms Instances of synonymy and polysemy Spurious use of punctuation

Authoring guide for “to be translated” text

Patents break almost all of the rules!

Judge the quality of an MT system by comparing its output against a human-produced “reference” translation Pros: Quick, cheap, consistent Cons: Inflexible, cannot be used on ‘new’ input

Pros: Reliable, flexible, multi-faceted (fluency, error

analyses, benchmarking) Cons: Slow, expensive, subjective

Fluency vs. Adequacy

Evaluating Machine Translation Quality

Automatic Evaluation

Human Evaluation

Task-Based Evaluation

Evaluating Machine Translation QualityTask Based Evaluation Standalone evaluation of MT systems is necessary to get a sense

of the overall quality of a system To determine the ultimate usability of an MT system, intrinsic task-

based evaluation is required Why? Fluency vs. Adequacy

Fluency how fluent and grammatically correct the translation output isAdequacy how accurately the translation conveys the meaning of the source

Output 1 The big blue house Output 2 The big house redSource La gran casa roja

Task-Based Evaluation

Practical uses of Machine Translation

Understand its limitations and you’ll understand its capabilities!

No

Translate a patent for filing

Translate literature for publication

Translate marketing materials

Anything mission critical without review

Yes

Productivity tool for professional translation

Understand foreign patents

Localisation processes and “controlled’ content

High volume, e.g. eDiscovery

Use cases in practice

Product descriptions to

open new markets

MT for post-editing productivity across

industries

Developer, and user for web

content

Tens of thousands of people using

online tools daily

Neural Networks Using artificial intelligence and deep learning to develop

a completely new way of doing machine translation!

Quality Estimation Functionality through which machine translation can

“self-assess” the quality of the translations it produces.

Online Adaptive Translation Machine translations that can automatically learn and

improve based on feedback, particularly from revisions.

Use-case specific MT Just like patent MT, but for countless other areas.

Current Hot Topics

About Iconic

We are a Machine Translation and Natural Language Processing

software and services provider, delivering expert solutions with

Subject Matter Expertise

Iconic Ensemble Architecture…

…enhanced with Neural MT

Speed, Cost, and QualityWhat is the difference between machine translation vs. manual translation when translating a 10 page patent document from Chinese into English?

Machine Translation is not designed to replace professional translation but there are many cases where costly and time-consuming manual translation is simply not necessary.

- Data confidentiality

- File formats

- Potential for customisation, enhancements, and improvement for specific domains

More than just translation

DATA PROCESSING

E.G. OPTICAL CHARACTER RECOGNITION, DIGITISATION

DATABASE BUILDING

E.G. COMBINING THE ABOVE, WITH TRANSLATION, FOR EXPORT

DATA UNDERSTANDING

E.G. SUMMARISATION, CONCEPT & KEY TERM IDENTIFICATION

INFORMATION EXTRACTION

E.G. CITATION ANALYSIS, CROSS-LINGUAL SEARCH

Record Extraction

Extraction algorithms work on cleaned OCR output, using patterns, keywords, and formatting information.

Citation AnalysisAssessment of record and reference patterns Application for record extraction

Tracking variations across years

Application for bibliographic data fielding

Reference extraction + fielding

.com

Visit

and use the promo code epo2016 to get 20 free pages of translation

Thank [email protected]

@IconicTrans

mailto:[email protected]