copyright © 2017 omniscien technologies. all rights reserved.€¦ · machine translation will be...
TRANSCRIPT
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Copyright © 2017 Omniscien Technologies. All Rights Reserved.AI, MT and Language Processing Symposium
Dion Wiggins is a highly experienced ICT industry visionary, entrepreneur, analyst andconsultant. He has an impressive knowledge in the fields of software development,architecture and management, as well as an in-depth understanding of Asian ICT markets.He is an accomplished speaker and has a high media profile for his perceptive analysis ofICT in Asia/Pacific.
Previously Dion was Vice President and Research Director for Gartner based in Hong Kong,where he was the most senior and highly-respected analyst based in all of Asia. Dion'sresearch reports on ICT in China helped change the way the world views this market.
Dion is also a well-known pioneer of the Asian Internet Industry, being the founder of oneof Asia's first ever ISPs (Asia Online in Hong Kong). In his role at Gartner and in variousother consulting positions prior to that, Dion advised literally hundreds of enterprises ontheir ICT strategy.
Dion was a founder of The ActiveX Factory, where he was recipient of the Chairman'sCommendation Award presented by Microsoft's Bill Gates for the best showcase ofsoftware developed in the Philippines. The US Government has recognized Dion as being inthe top 5% of his field worldwide and he is a former holder of a US O1 Extraordinary AbilityVisa.
Speaker Overview
Dion WigginsCTO and Co-FounderOmniscien Technologies
Copyright © 2017 Omniscien Technologies. All Rights Reserved.AI, MT and Language Processing Symposium
Language is highly complex. There is a running joke in the translation industry thatmachine translation will be a solved problem in 5 years. This has been updated every 5years since the 1950s. The promise was always there, but the technology of the day simplycould not deliver. However real progress has been made.
This opening keynote presentation looks at the current state of language technology in thecontext of artificial intelligence, machine learning, machine translation and languageprocessing, cutting through the hype to look at real world applications of thesetechnologies in business today. The following 3 days are packed with industry experts, withthis presentation acting as a primer ahead of more in-depth discussions.
Dion will introduce some key concepts on a range of topics including deep neural machinetranslation and artificial intelligence to outline how AI tools can be integrated intolanguage processing workflows that break down language and language barriers. Finally,Dion will explore how language processing in both bilingual and monolingual contexts canfind new information that is hidden from the human eye but is actually right in front of usto use in everyday business.
Found in Translation - Language Meets Technology
Dion WigginsCTO and Co-FounderOmniscien Technologies
Copyright © 2017 Omniscien Technologies. All Rights Reserved.2 April 2018
Found in Translation – Language Meets Technology
Dion WigginsChief Technology [email protected]
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Machine Translation will be a Solved Problem in 5 Years
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Machine Translation will be a Solved Problem in 5 Years
• Machine translation is improving.
• Different languages have different challenges.
• Technologies have evolved.
• The latest technology (Neural MT) has notably helped improve fluency.
• Toolkits abound.
• Many challenges still ahead.
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Limitations of Today’s Machine Translation
• Context is a challenge.
• MT only has the context of the current sentence and does not understand the sentence before, after or other parts of the document.
• I went to the ATM.
• I swam to the river bank.
• I banked my plane into a dive.
• I banked my car into a turn.
“I went to the bank”
Word Sense Disambiguation
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
The most ambiguous word in English is “Run”
• I ran for office
• I went for a run
• I ate bad food and got the runs
• I scored a home run
• The dry run went well
• He had a run of good luck
• The medication ran its course
• My stockings got a run
• The chicken run was big
• He was run over by a car
• …
Meanings
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Ambiguity can be solved in part
• Words nearby
• The data that the MT engines were trained on
• This example has very focused, minimal data.
Life Sciences – ONLY 1 Million Sentences beats Google & Bing
Language Pair
MT Engine BLEU F-MeasureLevenshtein
DistanceTER
EN-JA
Omniscien Deep NMT 48.01 77 14.78 35.98
Omniscien NMT 36.65 70 19.24 48.06
Google 31.74 66 20.30 52.17
Bing 23.00 60 24.65 61.58
JA-EN
Omniscien Deep NMT 33.92 70 39.05 49.59
Omniscien NMT 28.82 67 44.77 56.41
Google 26.80 65 43.38 55.58
Bing 17.32 56 53.65 65.97
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Patents – 12 Million Sentences
• Patents is a very complex domain.
• Google and Omniscien have about the same volume of EN-DE bilingual patent data.
• Google also has many other domains mixed in.
• Omniscien focused the engine on Patent translations ONLY
• Too much mixed domains confuses context and lowers quality
Language Pair
MT Engine BLEU F-MeasureLevenshtein
DistanceTER
EN-DEOmniscien Deep NMT 43.10 72 58.99 38.94
Google 39.52 69 64.05 42.22
DE-ENOmniscien Deep NMT 58.80 81 40.47 27.08
Google 52.10 78 51.28 32.37
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Learning in Machine Translation
• Various Approaches• Rule-based (1970s)
• Word-based (1990s)
• Phrase-based (2000s)
• Syntax-based (2010s)
• Neural-based (2017+)
• Common Approach: • Probabilistic Estimation
(SMT uses this approach)
Today data-driven approaches dominate machine translation
Source Target
Interlingua
Semantic Transfer
Syntax Transfer
Lexical Transfer
Training Data
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Machine Learning vs Neural Learning
• Approach• Analyze problem• Feature engineering
(coded by a programmer)
• Machine Translation Example• What features are relevant for word order?• What features are relevant for lexical
translation?
• Promise• No more feature engineering
• Neural Learning• Discovers the features automatically
• Learns how to process features
Neural LearningMachine Learning
Input Features OutputInput Features Output
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Shallow NMT vs Deep NMT
Deep NMTShallow NMT
Layer: Input Hidden Output Input Hidden 1 Hidden 2 Hidden 3 Output
• More layers• More complex feature interactions
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Statistical Vs Neural
Neural Machine Translation Statistical Machine Translation
Training Time
Training Data
Translation (Decoding Time)
Space on Disk Less More
Hardware GPU CPU
Mechanism Sentence by Sentence Word by Word/ Phrase by Phrase
Attentional encoder-decoder networks; optimization
Statistical Analysis / Probability
Train multiple features jointly Feature engineering required
12 Hours - 5 Days 12 Hours
20-100 million + Sentences 1-5 Million Sentences
50,000 Words Per Minute (WPM) 3,000 WPM
1GB 4-70GB
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Statistical Vs Neural
Neural Machine Translation Statistical Machine Translation
Interpretability
Long Distance Reordering
Morphology, Syntax and Agreement Errors
Tolerance of Noisy Data
Tolerance of Out of Domain Data
Multilingual/Multi Domain Translation
Handling of Rare Words
Runtime Control
Short Phases (1-3 Words)
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
The Latest Technology is Not Always the Best Solution
• Together SMT and Deep NMT deliver significantly higher quality translations that independently.
• Leverage the strengths and mitigates the weaknesses of both technologies via a hybrid solution.
• Language Studio integrates Deep NTM and SMT as a seamless offering.
NeuralStatistical
Hybrid
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Hybrid NMT / SMT Translation Bridges the Gap
Process applies to both NMT and Deep NMT.
1. SMT and NMT work hand in hand.
2. SMT is known to do better on short content (1-2 words) and can be sent directly to SMT without the need for NMT.
3. NMT output translation quality is measured.
4. If quality is below a defined quality bar, then send to SMT.
5. The best scoring translation is selected or both outputs can be merged into a single translation output.
6. Optimal settings are determined via an automated process or can be hand tuned for specific cases.
SMT NMT
Select Best Score or Merge
No
No
Yes
Yes
Long orShort?
Low Quality?
Source Input
Target Output
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
AI Beyond Machine Translation – Language Processing
• Voice Recognition
• Streams of words with no sentence boundaries or punctuation.
The temperature at the beach is hot today the sun is out people are gathering their beach gear and
heading for a day in the sand and water
The temperature at the beach is hot today. The sun is out. People are gathering their beach gear and
heading for a day in the sand and water.
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Named Entity Recognition
• Break out information that can be processed further
• Make decisions about content
• Pre-Analyze content before translation or further processing
File Name / Path
Food
Formula / Equation
Gender
Accommodation
ID Number
IP Address
Lat/Long
Location
Metric Unit
Medication / Drug
Age / Age Group
Aggression
Brand Name
Cast Member
Color
Chemical
Credit Card
Date
Direction
Distance
Medical Condition
Money
Nationality
Number
Occupation
Organization
Person
Phone
Product
Product Code
Quantity
Relationship
Religion
Size / Size Range
Social media ID
Temperature
Terms
Time / Time Range
Title / Honorific
Transport / Travel
URL / Website
Weapon
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Data Enrichment
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Data Enrichment
head
العراق كردستان
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Meta Data Enrichment and Classification
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Sentiment Analysis
• Are your customers happy?
• Do users like my product?
• Did you make a cultural error in a market?
• Are your staff engaged?
• Are their problems youneed to address?
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
High Quality Data to Learn From is King
Web Crawl
RSS Feeds
Document Align
Sentence Align
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Sub-Segmenting Data – Domain Categorization
• The data is out there, if only it could be classified and understood.
Big Bucket of Mixed Data
Domain ID
News
Finance
Life Sciences
Named Entity Tagging
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Track content as it is processed and apply runtime changes
TUID:7
05:etu It's gonna be epic.
09:ntt It's gonna be epic.
11:glo It's <aotran type="glo" translation="akan">gonna</aotran> be epic.
13:tok it 's <a translation="akan">gonna</a> be epic <wall/> <a translation=".">.</a>
15:etm it 's <a translation="akan">gonna</a> be epic <wall/> <a translation=".">.</a>
16:tx ia |0-1,0, 0-0 | akan |2-2,0, | menjadi |3-3,0, 0-0 | epik |4-4,0, 0-0 | . |5-5,0, |
17:txu ia |0-1,0, 0-0 | akan |2-2,0, | menjadi |3-3,0, 0-0 | epik |4-4,0, 0-0 | . |5-5,0, |
19:cap Ia akan menjadi epik .
21:rim Ia akan menjadi epik .
22:det Ia akan menjadi epik.
24:pta Ia akan menjadi epik.
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Make Decisions about Data and Workflow
Translate
Publish60-100
Light Post Edit40-59
30-39
0-29
Feed Edits Back to Improve Engine
Store for Later Re-Translation
Discard
Re-Train Engine
Send for Re-Translation with Improved Engine
Score RangeUser Generated
Hotel Review
Pre-Process
Determine Optimal
Technology
Process & Workflow
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Examples of NLP Tools
• Named Entity Recognition
• Name Translation
• Address Translation
• Syntax Parsing
• Part of Speech
• Capitalization
• Tokenize / Detokenize
• Sentence Segmentation
• Language ID and Encoding ID
• Domain ID / Categorization
• Spelling and Grammar Check
• Ngram Analysis
• Term Extraction and Generation
• Document Alignment
• Sentence Alignment
• Alignment Quality Analysis
• Data Mining and Manufacturing
• Word Lemmatization and Stemming
• Split Sentence Joining
• Smart Sentence Splitting
• Word Romanization and Transliteration
• Confidence Scoring
• Media Extraction (audio/images)
• Decompounding and Recompounding
• Smart Format and Data Conversions
• Sentence Boundary Detection
• Sentence Join Processing
• Multi-Source Data Synchronization
• Data Synthesis
• Data Manufacturing
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Examples in E-Commerce
Protective Elbow Knee Pads Outdoor Sports Hunting Cycling Roller Skating Knee Pads Elbow Pads Support Adjustable Size For Scooter Skateboard Bicycle Rollerblades-Black : (Intl) –Intl
2016 Huarache Men and Women Running Shoes Breathable Sneakers Laced Couple Sport Shoes Outdoor Damping Mesh Shoes
LittleJump Organic Cotton Muslin Receiving Blanket Newborn baby swaddling blanket 47” x 47”, Girafe For Unisex – intl
6MM TPE Non-slip Yoga mats fitness Three parts environmental tasteless colchonetefitness yoga gym exercise mats (183*61*0.6 cm) - Purple
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Putting This in Context
• Language Processing and Analysis has enabled a huge amount of new technologies.
• What use to take weeks/months now takes days.
• In the context of MT engines, training data sizes have grown from 1-5 million sentences to hundreds of millions or billions of sentences.
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Learning More – 21 Presentations, 17 Speakers, 3 Days - MONDAY
• M2: Text Analytics: Opportunities for Financial Services Firms• Bob Hayward, Chief Customer Officer, Search365
• M3: iflix’s Localization Journey – The marriage of Human and Machine • Alphie Larrieu, Technology Manager – Localization, iflix
• M4: Practical challenges in Large Scale Patent Machine Translation• Laura Rossi, Manager Language Technology Solutions, LexisNexis Univentio,
• M5: Introduction to Language Studio• Jason Whittaker, Support Engineer, Omniscien Technologies
• M6: The Ethical Implications of Machine Translation • Renato Beninatto, Chief Executive Officer, Nimdzi
• M7: Moving Towards Augmented Translation – A Case Study • Jure Dernovsek, Solutions Engineer, memoQ
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Learning More – 21 Presentations, 17 Speakers, 3 Days - TUESDAY
• T1: Driving Customer Engagements with Multilingual Chatbots - Moving from Customer Interactions to Customer Engagements• Lye King Tho, Watson Data & AI Leader, IBM Watson & Cloud Platform – ASEAN, IBM Watson
• T2: New Frontiers in MT and Post-Editing • Conor Bracken, CEO, Andovar
• T3: Taking a Product to China via Digital• Chris Morley, Chief Commercial Officer, Retail Global
• T4: Understanding the Benefits of Specialized Machine Translation and Language Processing Solutions• Dion Wiggins, Chief Technology Officer, Omniscien Technologies
• T5: Measuring Employee Engagement via AI and Psycholinguistic Analysis • Bruno Jakic, Co-Founder, KeenCorp
• T6: MAVERICK PRESENTATION - The Rise of the Machines • Bob Hayward + Dion Wiggins
• T7: Stop Reinventing the Wheel! The TAPICC Pre-Standardization Initiative for Translation APIs• Serge Gladkoff, Chief Executive Officer, Logrus Global
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
Learning More – 21 Presentations, 17 Speakers, 3 Days - WEDNESDAY
• W1: Nunc Est Tempus: Now Is the Time to Redesign Your Translation Business • Jaap van der Meer, Chief Executive Officer, TAUS
• W2: Transform Your Business with Omnichannel and Journey Analytics • Abby Monaco, Senior Product Marketing Manager at NICE Nexidia
• W3: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context• Dr. Anthony Scriffignano, Senior Vice President & Chief Data Scientist, Dun & Bradstreet• Warwick Matthews, Senior Director of Identity Data Engineering, Dun & Bradstreet
• W4: Big Data and Domain Adaptation of Machine Translation• Dion Wiggins
• W5: Research in Translation - What Is Exciting and Shows Promise Ahead? • Philipp Koehn, Chief Scientist, Omniscien Technologies• Professor of Computer Science, Johns Hopkins University
Copyright © 2017 Omniscien Technologies. All Rights Reserved.