copyright © 2017 omniscien technologies. all rights reserved.€¦ · machine translation will be...

34
Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Upload: others

Post on 15-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Page 2: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.AI, MT and Language Processing Symposium

Dion Wiggins is a highly experienced ICT industry visionary, entrepreneur, analyst andconsultant. He has an impressive knowledge in the fields of software development,architecture and management, as well as an in-depth understanding of Asian ICT markets.He is an accomplished speaker and has a high media profile for his perceptive analysis ofICT in Asia/Pacific.

Previously Dion was Vice President and Research Director for Gartner based in Hong Kong,where he was the most senior and highly-respected analyst based in all of Asia. Dion'sresearch reports on ICT in China helped change the way the world views this market.

Dion is also a well-known pioneer of the Asian Internet Industry, being the founder of oneof Asia's first ever ISPs (Asia Online in Hong Kong). In his role at Gartner and in variousother consulting positions prior to that, Dion advised literally hundreds of enterprises ontheir ICT strategy.

Dion was a founder of The ActiveX Factory, where he was recipient of the Chairman'sCommendation Award presented by Microsoft's Bill Gates for the best showcase ofsoftware developed in the Philippines. The US Government has recognized Dion as being inthe top 5% of his field worldwide and he is a former holder of a US O1 Extraordinary AbilityVisa.

Speaker Overview

Dion WigginsCTO and Co-FounderOmniscien Technologies

Page 3: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.AI, MT and Language Processing Symposium

Language is highly complex. There is a running joke in the translation industry thatmachine translation will be a solved problem in 5 years. This has been updated every 5years since the 1950s. The promise was always there, but the technology of the day simplycould not deliver. However real progress has been made.

This opening keynote presentation looks at the current state of language technology in thecontext of artificial intelligence, machine learning, machine translation and languageprocessing, cutting through the hype to look at real world applications of thesetechnologies in business today. The following 3 days are packed with industry experts, withthis presentation acting as a primer ahead of more in-depth discussions.

Dion will introduce some key concepts on a range of topics including deep neural machinetranslation and artificial intelligence to outline how AI tools can be integrated intolanguage processing workflows that break down language and language barriers. Finally,Dion will explore how language processing in both bilingual and monolingual contexts canfind new information that is hidden from the human eye but is actually right in front of usto use in everyday business.

Found in Translation - Language Meets Technology

Dion WigginsCTO and Co-FounderOmniscien Technologies

Page 4: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.2 April 2018

Found in Translation – Language Meets Technology

Dion WigginsChief Technology [email protected]

Page 5: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Machine Translation will be a Solved Problem in 5 Years

Page 6: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Machine Translation will be a Solved Problem in 5 Years

• Machine translation is improving.

• Different languages have different challenges.

• Technologies have evolved.

• The latest technology (Neural MT) has notably helped improve fluency.

• Toolkits abound.

• Many challenges still ahead.

Page 7: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Limitations of Today’s Machine Translation

• Context is a challenge.

• MT only has the context of the current sentence and does not understand the sentence before, after or other parts of the document.

• I went to the ATM.

• I swam to the river bank.

• I banked my plane into a dive.

• I banked my car into a turn.

“I went to the bank”

Word Sense Disambiguation

Page 8: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

The most ambiguous word in English is “Run”

• I ran for office

• I went for a run

• I ate bad food and got the runs

• I scored a home run

• The dry run went well

• He had a run of good luck

• The medication ran its course

• My stockings got a run

• The chicken run was big

• He was run over by a car

• …

Meanings

Page 9: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Ambiguity can be solved in part

• Words nearby

• The data that the MT engines were trained on

• This example has very focused, minimal data.

Life Sciences – ONLY 1 Million Sentences beats Google & Bing

Language Pair

MT Engine BLEU F-MeasureLevenshtein

DistanceTER

EN-JA

Omniscien Deep NMT 48.01 77 14.78 35.98

Omniscien NMT 36.65 70 19.24 48.06

Google 31.74 66 20.30 52.17

Bing 23.00 60 24.65 61.58

JA-EN

Omniscien Deep NMT 33.92 70 39.05 49.59

Omniscien NMT 28.82 67 44.77 56.41

Google 26.80 65 43.38 55.58

Bing 17.32 56 53.65 65.97

Page 10: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Patents – 12 Million Sentences

• Patents is a very complex domain.

• Google and Omniscien have about the same volume of EN-DE bilingual patent data.

• Google also has many other domains mixed in.

• Omniscien focused the engine on Patent translations ONLY

• Too much mixed domains confuses context and lowers quality

Language Pair

MT Engine BLEU F-MeasureLevenshtein

DistanceTER

EN-DEOmniscien Deep NMT 43.10 72 58.99 38.94

Google 39.52 69 64.05 42.22

DE-ENOmniscien Deep NMT 58.80 81 40.47 27.08

Google 52.10 78 51.28 32.37

Page 11: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Learning in Machine Translation

• Various Approaches• Rule-based (1970s)

• Word-based (1990s)

• Phrase-based (2000s)

• Syntax-based (2010s)

• Neural-based (2017+)

• Common Approach: • Probabilistic Estimation

(SMT uses this approach)

Today data-driven approaches dominate machine translation

Source Target

Interlingua

Semantic Transfer

Syntax Transfer

Lexical Transfer

Training Data

Page 12: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Machine Learning vs Neural Learning

• Approach• Analyze problem• Feature engineering

(coded by a programmer)

• Machine Translation Example• What features are relevant for word order?• What features are relevant for lexical

translation?

• Promise• No more feature engineering

• Neural Learning• Discovers the features automatically

• Learns how to process features

Neural LearningMachine Learning

Input Features OutputInput Features Output

Page 13: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Shallow NMT vs Deep NMT

Deep NMTShallow NMT

Layer: Input Hidden Output Input Hidden 1 Hidden 2 Hidden 3 Output

• More layers• More complex feature interactions

Page 14: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Statistical Vs Neural

Neural Machine Translation Statistical Machine Translation

Training Time

Training Data

Translation (Decoding Time)

Space on Disk Less More

Hardware GPU CPU

Mechanism Sentence by Sentence Word by Word/ Phrase by Phrase

Attentional encoder-decoder networks; optimization

Statistical Analysis / Probability

Train multiple features jointly Feature engineering required

12 Hours - 5 Days 12 Hours

20-100 million + Sentences 1-5 Million Sentences

50,000 Words Per Minute (WPM) 3,000 WPM

1GB 4-70GB

Page 15: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Statistical Vs Neural

Neural Machine Translation Statistical Machine Translation

Interpretability

Long Distance Reordering

Morphology, Syntax and Agreement Errors

Tolerance of Noisy Data

Tolerance of Out of Domain Data

Multilingual/Multi Domain Translation

Handling of Rare Words

Runtime Control

Short Phases (1-3 Words)

Page 16: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

The Latest Technology is Not Always the Best Solution

• Together SMT and Deep NMT deliver significantly higher quality translations that independently.

• Leverage the strengths and mitigates the weaknesses of both technologies via a hybrid solution.

• Language Studio integrates Deep NTM and SMT as a seamless offering.

NeuralStatistical

Hybrid

Page 17: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Hybrid NMT / SMT Translation Bridges the Gap

Process applies to both NMT and Deep NMT.

1. SMT and NMT work hand in hand.

2. SMT is known to do better on short content (1-2 words) and can be sent directly to SMT without the need for NMT.

3. NMT output translation quality is measured.

4. If quality is below a defined quality bar, then send to SMT.

5. The best scoring translation is selected or both outputs can be merged into a single translation output.

6. Optimal settings are determined via an automated process or can be hand tuned for specific cases.

SMT NMT

Select Best Score or Merge

No

No

Yes

Yes

Long orShort?

Low Quality?

Source Input

Target Output

Page 18: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

AI Beyond Machine Translation – Language Processing

• Voice Recognition

• Streams of words with no sentence boundaries or punctuation.

The temperature at the beach is hot today the sun is out people are gathering their beach gear and

heading for a day in the sand and water

The temperature at the beach is hot today. The sun is out. People are gathering their beach gear and

heading for a day in the sand and water.

Page 19: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Named Entity Recognition

• Break out information that can be processed further

• Make decisions about content

• Pre-Analyze content before translation or further processing

File Name / Path

Food

Formula / Equation

Gender

Accommodation

ID Number

IP Address

Lat/Long

Location

Metric Unit

Medication / Drug

Age / Age Group

Aggression

Brand Name

Cast Member

Color

Chemical

Credit Card

Date

Direction

Distance

E-Mail

Medical Condition

Money

Nationality

Number

Occupation

Organization

Person

Phone

Product

Product Code

Quantity

Relationship

Religion

Size / Size Range

Social media ID

Temperature

Terms

Time / Time Range

Title / Honorific

Transport / Travel

URL / Website

Weapon

Page 20: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Data Enrichment

Page 21: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Data Enrichment

head

العراق كردستان

Page 22: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Meta Data Enrichment and Classification

Page 23: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Sentiment Analysis

• Are your customers happy?

• Do users like my product?

• Did you make a cultural error in a market?

• Are your staff engaged?

• Are their problems youneed to address?

Page 24: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

High Quality Data to Learn From is King

Web Crawl

RSS Feeds

Document Align

Sentence Align

Page 25: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Sub-Segmenting Data – Domain Categorization

• The data is out there, if only it could be classified and understood.

Big Bucket of Mixed Data

Domain ID

News

Finance

Life Sciences

Named Entity Tagging

Page 26: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Track content as it is processed and apply runtime changes

TUID:7

05:etu It's gonna be epic.

09:ntt It's gonna be epic.

11:glo It's <aotran type="glo" translation="akan">gonna</aotran> be epic.

13:tok it 's <a translation="akan">gonna</a> be epic <wall/> <a translation=".">.</a>

15:etm it 's <a translation="akan">gonna</a> be epic <wall/> <a translation=".">.</a>

16:tx ia |0-1,0, 0-0 | akan |2-2,0, | menjadi |3-3,0, 0-0 | epik |4-4,0, 0-0 | . |5-5,0, |

17:txu ia |0-1,0, 0-0 | akan |2-2,0, | menjadi |3-3,0, 0-0 | epik |4-4,0, 0-0 | . |5-5,0, |

19:cap Ia akan menjadi epik .

21:rim Ia akan menjadi epik .

22:det Ia akan menjadi epik.

24:pta Ia akan menjadi epik.

Page 27: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Make Decisions about Data and Workflow

Translate

Publish60-100

Light Post Edit40-59

30-39

0-29

Feed Edits Back to Improve Engine

Store for Later Re-Translation

Discard

Re-Train Engine

Send for Re-Translation with Improved Engine

Score RangeUser Generated

Hotel Review

Pre-Process

Determine Optimal

Technology

Process & Workflow

Page 28: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Examples of NLP Tools

• Named Entity Recognition

• Name Translation

• Address Translation

• Syntax Parsing

• Part of Speech

• Capitalization

• Tokenize / Detokenize

• Sentence Segmentation

• Language ID and Encoding ID

• Domain ID / Categorization

• Spelling and Grammar Check

• Ngram Analysis

• Term Extraction and Generation

• Document Alignment

• Sentence Alignment

• Alignment Quality Analysis

• Data Mining and Manufacturing

• Word Lemmatization and Stemming

• Split Sentence Joining

• Smart Sentence Splitting

• Word Romanization and Transliteration

• Confidence Scoring

• Media Extraction (audio/images)

• Decompounding and Recompounding

• Smart Format and Data Conversions

• Sentence Boundary Detection

• Sentence Join Processing

• Multi-Source Data Synchronization

• Data Synthesis

• Data Manufacturing

Page 29: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Examples in E-Commerce

Protective Elbow Knee Pads Outdoor Sports Hunting Cycling Roller Skating Knee Pads Elbow Pads Support Adjustable Size For Scooter Skateboard Bicycle Rollerblades-Black : (Intl) –Intl

2016 Huarache Men and Women Running Shoes Breathable Sneakers Laced Couple Sport Shoes Outdoor Damping Mesh Shoes

LittleJump Organic Cotton Muslin Receiving Blanket Newborn baby swaddling blanket 47” x 47”, Girafe For Unisex – intl

6MM TPE Non-slip Yoga mats fitness Three parts environmental tasteless colchonetefitness yoga gym exercise mats (183*61*0.6 cm) - Purple

Page 30: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Putting This in Context

• Language Processing and Analysis has enabled a huge amount of new technologies.

• What use to take weeks/months now takes days.

• In the context of MT engines, training data sizes have grown from 1-5 million sentences to hundreds of millions or billions of sentences.

Page 31: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Learning More – 21 Presentations, 17 Speakers, 3 Days - MONDAY

• M2: Text Analytics: Opportunities for Financial Services Firms• Bob Hayward, Chief Customer Officer, Search365

• M3: iflix’s Localization Journey – The marriage of Human and Machine • Alphie Larrieu, Technology Manager – Localization, iflix

• M4: Practical challenges in Large Scale Patent Machine Translation• Laura Rossi, Manager Language Technology Solutions, LexisNexis Univentio,

• M5: Introduction to Language Studio• Jason Whittaker, Support Engineer, Omniscien Technologies

• M6: The Ethical Implications of Machine Translation • Renato Beninatto, Chief Executive Officer, Nimdzi

• M7: Moving Towards Augmented Translation – A Case Study • Jure Dernovsek, Solutions Engineer, memoQ

Page 32: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Learning More – 21 Presentations, 17 Speakers, 3 Days - TUESDAY

• T1: Driving Customer Engagements with Multilingual Chatbots - Moving from Customer Interactions to Customer Engagements• Lye King Tho, Watson Data & AI Leader, IBM Watson & Cloud Platform – ASEAN, IBM Watson

• T2: New Frontiers in MT and Post-Editing • Conor Bracken, CEO, Andovar

• T3: Taking a Product to China via Digital• Chris Morley, Chief Commercial Officer, Retail Global

• T4: Understanding the Benefits of Specialized Machine Translation and Language Processing Solutions• Dion Wiggins, Chief Technology Officer, Omniscien Technologies

• T5: Measuring Employee Engagement via AI and Psycholinguistic Analysis • Bruno Jakic, Co-Founder, KeenCorp

• T6: MAVERICK PRESENTATION - The Rise of the Machines • Bob Hayward + Dion Wiggins

• T7: Stop Reinventing the Wheel! The TAPICC Pre-Standardization Initiative for Translation APIs• Serge Gladkoff, Chief Executive Officer, Logrus Global

Page 33: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.

Learning More – 21 Presentations, 17 Speakers, 3 Days - WEDNESDAY

• W1: Nunc Est Tempus: Now Is the Time to Redesign Your Translation Business • Jaap van der Meer, Chief Executive Officer, TAUS

• W2: Transform Your Business with Omnichannel and Journey Analytics • Abby Monaco, Senior Product Marketing Manager at NICE Nexidia

• W3: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context• Dr. Anthony Scriffignano, Senior Vice President & Chief Data Scientist, Dun & Bradstreet• Warwick Matthews, Senior Director of Identity Data Engineering, Dun & Bradstreet

• W4: Big Data and Domain Adaptation of Machine Translation• Dion Wiggins

• W5: Research in Translation - What Is Exciting and Shows Promise Ahead? • Philipp Koehn, Chief Scientist, Omniscien Technologies• Professor of Computer Science, Johns Hopkins University

Page 34: Copyright © 2017 Omniscien Technologies. All Rights Reserved.€¦ · Machine Translation will be a Solved Problem in 5 Years •Machine translation is improving. •Different languages

Copyright © 2017 Omniscien Technologies. All Rights Reserved.