dual learning: algorithms and applications · estimated labeling cost for one language pair 7000...

135
Dual Learning: Algorithms and Applications Tao Qin Senior Research Manager Microsoft Research Asia 11/14/2018 1 Tao Qin - ACML 2018

Upload: others

Post on 21-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Learning: Algorithms and Applications

Tao Qin

Senior Research Manager

Microsoft Research Asia

11/14/20181Tao Qin - ACML 2018

Page 2: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Outline

1. Motivation and basic concept

2. Dual learning from unlabeled data

3. Dual learning from labeled data

4. More applications

5. Summary and outlook

11/14/2018 2Tao Qin - ACML 2018

Page 3: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Deep learning is making breakthroughs

11/14/2018 3Tao Qin - ACML 2018

Page 4: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Spider

COMPUTER VISION

Page 5: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Switchboard

Switchboard cellular

Meeting speech

Microsoft, Switchboard

IBM, Switchboard

Broadcast speech

5.1%

SPEECH

Page 6: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

NATURAL LANGUAGE

Page 7: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Three Pillars of Deep Learning

• Big models: 1000+ layers, tens of billions of parameters

• Big data: web pages, search logs, social networks, and new mechanisms for data collection: conversation and crowdsourcing

• Big computing: CPU clusters, GPU clusters, FPGA farms, provided by Amazon, Azure, Ali etc.

11/14/2018 Tao Qin - ACML 2018 7

Page 8: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Deep learning is facing many challenges

11/14/2018 8Tao Qin - ACML 2018

Page 9: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Big-Data Challenge

• Today’s deep learning highly relies on huge amount of human-labeled training data

Tasks Typical training data

Image classification Millions of labeled images

Speech recognition Thousands of hours of annotated voice data

Machine translation Tens of millions of bilingual sentence pairs

Human labeling is in general very expensive, and it is hard, if not impossible, to obtain large-scale labeled data for rare domains.

11/14/2018 9Tao Qin - ACML 2018

Page 10: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Cost Estimation for Machine Translation

$0.075 × 30 × 10,000,000 = $22.5𝑀

Cost per word: $0.05-0.10

Average length of a sentence

Assume 10M sentences to translate

Estimated labeling cost for one language pair

❑ 7000 different languages that are spoken around the world❑ The 100-th largest language has over 7 million native speakers

100 × 99

2× $22.5𝑀 ≈ $113𝐵

Number of language pairs for top 100 languages

11/14/2018 10Tao Qin - ACML 2018

Page 11: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Our proposal: Dual Learning

11/14/2018 11Tao Qin - ACML 2018

Page 12: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

The Beauty of Symmetry

• Symmetry is almost everywhere in our world!

11/14/2018 12Tao Qin - ACML 2018

Page 13: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Duality in Machine Translation

Welcome to Beijing! 欢迎来到北京!

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

English->Chinese translation

Chinese->English translation

11/14/2018 13Tao Qin - ACML 2018

Page 14: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Duality in Speech Processing

Welcome to Beijing!

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Speech recognition

Speech synthesis

11/14/2018 14Tao Qin - ACML 2018

Page 15: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Duality in Image Processing

girl in pink dress is jumping in air.

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Image captioning

Image generation

11/14/2018 15Tao Qin - ACML 2018

Page 16: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Duality in Question Answering and Generation

for what purpose do organisms make peroxide and superoxide ?

Parts of the immune system of higher organisms create peroxide , superoxide , and singlet oxygen to

destroy invading microbes .

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Question answering

Question generation

11/14/2018 16Tao Qin - ACML 2018

Page 17: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Duality in Search and Advertising

AmazonShopping

Amazon.com

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Search: find webpages for a given query

Advertising: suggest keywords for a given webpage

11/14/2018 17Tao Qin - ACML 2018

Page 18: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Structural Duality in AI

• Structural duality is very common in artificial intelligence

AI Tasks X→ Y Y → XMachine translation Translation from language EN to CH Translation from language CH to EN

Speech processing Speech recognition Text to speech

Image understanding Image captioning Image generation

Conversation Question answering Question generation (e.g., Jeopardy!)

Search engine Query-document matching Query/keyword suggestion

Primal Task Dual TaskCurrently most machine learning algorithms do not exploit structure duality for training and inference.

11/14/2018 18Tao Qin - ACML 2018

Page 19: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Learning

• A new learning framework that leverages the symmetric (primal-dual) structure of AI tasks to obtain effective feedback or regularization signals to enhance the learning/inference process.

11/14/2018 19Tao Qin - ACML 2018

Page 20: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Learning from

Unlabeled Data

• Algorithms for machine translation

• Dual unsupervised learning

• Dual transfer learning

• Unsupervised machine translation

• Algorithms for image translation

• DualGAN, CycleGAN, DiscoGAN

• Conditional image translation

11/14/2018 20Tao Qin - ACML 2018

Page 21: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Why Machine Translation?

• Of great business value

https://nimdzi.com/wp-content/uploads/2018/03/2018-Nimdzi-100-First-Edition.pdf

11/14/2018 21Tao Qin - ACML 2018

Page 22: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Why Machine Translation?

• Perfectly fits into the setting of dual learning• There is no information loss in X->Y and Y->X mapping

• A very challenging AI task and a hot research direction• in NLP conferences, e.g., ACL, EMNLP, NAACL, …

• in ML conferences, e.g., NIPS, ICML, ICLR, …

• in AI conferences, e.g., IJCAI, AAAI, …

• Dedicated conferences for MT• 17th Machine Translation Summit

• 3rd Conference on Machine Translation (WMT18)

11/14/2018 22Tao Qin - ACML 2018

Page 23: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Neural Machine Translation

11/14/2018 23Tao Qin - ACML 2018

Page 24: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Encoder-Decoder for sequence generation

𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)

𝒇=( )经济, 发展, 变, 慢, 了, .近, 几年,

Encoder

Decoder

-0.2 0.9 -0.1 0.50.7 0.0 0.2

11/14/2018 24Tao Qin - ACML 2018

Page 25: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Decoder Recurrent Neural Network

Encoder Recurrent Neural Network

Encoder-Decoder for machine translation

𝒛𝒊

𝒖𝒊Wo

rdSa

mp

leR

ecu

rren

tSt

ate

𝒇=( )

𝒔𝒊Sou

rce

Stat

e

𝒘𝒊Sou

rce

Wo

rdD

ecod

erEn

cod

er

Sutskever et al., NIPS, 2014

经济, 发展, 变, 慢, 了, .近, 几年,

(1)

(2)(3)

𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)

11/14/2018 25Tao Qin - ACML 2018

Page 26: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Attention based Encoder-Decoder

𝒛𝒊

𝒖𝒊

𝒄𝒊

𝒉𝒋

Wo

rdSa

mp

leR

ecu

rren

tSt

ate

Co

nte

xtV

ecto

rSo

urc

eV

ecto

rs

Attention Weight

Enco

der

Atten

tion

𝒇=( )

Deco

der

近, 几年,

𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)

Bahdanau et al., ICLR, 2015

11/14/2018 26Tao Qin - ACML 2018

Page 27: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Attention based Encoder-Decoder

Bahdanau et al., ICLR, 2015

𝒛𝒊

𝒖𝒊

𝒄𝒊

𝒉𝒋

Wo

rdSa

mp

leR

ecu

rren

tSt

ate

Co

nte

xt

vec

tor

Sou

rce

Vec

tors

Enco

der

Atten

tion

𝒇=( )

Deco

der

发展, 变, 慢, 了, .近, 几年,

⨀ Attention Weight ⊕

经济,

𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)

11/14/2018 27Tao Qin - ACML 2018

Page 28: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

If you don’t have enough labeled data for training,

Dual Unsupervised Learning can leverage structural duality to learn from unlabeled data

NIPS 2016

11/14/2018 28Tao Qin - ACML 2018

Page 29: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Unsupervised Learning

Ch->En translation

English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)New English sentence

𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : BLEU of 𝑥′ given 𝑥 . • 𝐿(𝑦) and 𝐿(𝑥′): Language model of 𝑦 and 𝑥’

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Agent Agent

En->Ch translation

Reinforcement Learning algorithms can be used to improve both primal and dual models according to feedback signals

11/14/2018 29Tao Qin - ACML 2018

Page 30: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Learning with Policy Gradient

• Basic idea• If a large reward is observed for an action, update the policy (e.g., models 𝑓, 𝑔) towards increasing the probability of the action; otherwise, update the policy towards decreasing the probability of the action

• Algorithm• Compute the gradient Δ𝑓, Δ𝑔 of the two models 𝑓 and 𝑔.

• If the feedback is positive, e.g., 𝑠(𝑥, 𝑥′), 𝐿(𝑥′), and 𝐿(𝑦) are larger than certain thresholds, update the models as 𝑓 = 𝑓 + 𝛼Δ𝑓, 𝑔 = 𝑔 + 𝛼Δ𝑔

• If the feedback is negative, e.g., 𝑠(𝑥, 𝑥′), 𝐿(𝑥′), and 𝐿(𝑦) are smaller than certain thresholds, update the models as 𝑓 = 𝑓 − 𝛼Δ𝑓, 𝑔 = 𝑔 − 𝛼Δ𝑔

11/14/2018 30Tao Qin - ACML 2018

Page 31: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Experiment

• Machine translation as a first playground• Translation between English <–> France

• Benchmarked aligned data set

• Benchmarked monolingual data set

• Baseline algorithm : • LSTM based neural machine translation model (NMT), ICLR 2015, “Neural

Machine Translation by Jointly Learning to Align and Translate”, from Y. Bengio’s group

11/14/2018 31Tao Qin - ACML 2018

Page 32: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Experimental

• Our algorithm• Step 1: Warm start models

• 5-day trained nmt model with 10% training data

• Step 2: Self-play with monolingual data from the warm start model using reinforcement learning

• We use policy gradient algorithm in reinforcement learning, and continue training for 2 days.

11/14/2018 32Tao Qin - ACML 2018

Page 33: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Experimental Results

NMT with 10%bilingual data

Dual learning with10% bilingual data

NMT with 100%bilingual data

BLEU score: French->English

↑ 0.3 (1%)

↓ 5.0 (17%)

Starting from initial models obtained from only 10% bilingual data, dual learning can achieve similar accuracy as the NMT model learned from 100% bilingual data!

11/14/2018 33Tao Qin - ACML 2018

Page 34: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Extension to Multiple Associated Tasks

• The idea of dual learning can be extended to more than two associated tasks, as long as they can provide informative feedback signals from a closed loop.

English

Chinese

Japanese Text

Speech

Image

11/14/2018 34Tao Qin - ACML 2018

Page 35: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Comparison

Dual learning: automatically generate reinforcement feedback for unlabeled data, multiple tasks involved.

Unsupervised/semi-supervised learning:no feedback signals for unlabeled data, only one task.

Multi-task learning: multiple tasks share the same representation.

Dual learning: dual tasks don’t need to share representations, only if the loop is closed.

Transfer learning: use auxiliary tasks to boostthe target task.

Dual learning: all the tasks are mutually and simultaneously boosted.

Co-training: only one task, assuming different feature sets that provide complementary information about the instance .

Dual learning: multiple tasks involved, no assumption on feature set.

11/14/2018 35Tao Qin - ACML 2018

Page 36: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Virtual Duality: GANs

Noise 𝑥 Generated image𝑦 = 𝑓(𝑥)

Natural or generated

Discriminator 𝑔

Generator 𝑓

• The generator receives a reward signal from the discriminator letting it know whether the generated data is real or not.

• Feedback signal: 𝑅 𝑥, 𝑓, 𝑔 = 𝑔 𝑦 = 𝑔(𝑓 𝑥 )

Random vector

Generator

Sample

Real-world images

Sample

Discriminator

Fake? Real?

Loss

11/14/2018 36Tao Qin - ACML 2018

Page 37: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

If you have one well-trained model,

Dual Transfer Learningcan leverage it and unlabeled data to improve the other model

AAAI 2018

11/14/2018 37Tao Qin - ACML 2018

Page 38: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Leverage Unlabeled Data

• Starting point: 𝑃 𝑦 = 𝐸𝑥~𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓

• Standard learning: for labeled data

max𝑓

𝑥,𝑦 ∈ℒ

log 𝑃(𝑦|𝑥; 𝑓)

• New objective: for unlabeled data

min𝑓

𝑦∈𝜘

𝑃 𝑦 − 𝐸𝑥~𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓2

11/14/2018 38Tao Qin - ACML 2018

Page 39: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Tech Challenges

• How to efficiently compute 𝐸𝑥~𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 ?• Exponentially many possible 𝑥’s

• Cannot enumerate all of them

• Sampling?• Naïve sampling does not work

𝐸𝑥~𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 =

𝑥

𝑃 𝑦 𝑥; 𝑓 𝑃(𝑥)

11/14/2018 39Tao Qin - ACML 2018

Page 40: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Our Solution

𝐸𝑥~𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 =

𝑥

𝑃 𝑦 𝑥; 𝑓 𝑃(𝑥)

=

𝑥

𝑃 𝑦 𝑥; 𝑓 𝑃 𝑥

𝑃 𝑥 𝑦; 𝑔𝑃(𝑥|𝑦; 𝑔)

= 𝐸𝑥~𝑃 𝑥 𝑦; 𝑔𝑃 𝑦 𝑥; 𝑓 𝑃 𝑥

𝑃 𝑥 𝑦; 𝑔

≈1

𝐾

𝑖=1

𝐾𝑃 𝑦 𝑥𝑖; 𝑓 𝑃 𝑥𝑖𝑃 𝑥𝑖 𝑦; 𝑔

, 𝑥𝑖~𝑃(𝑥|𝑦; 𝑔)

Use the dual model 𝑔 for importance sampling11/14/2018 40Tao Qin - ACML 2018

Page 41: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Experimental Results

En->Fr De->En

↑2.93 (10%) ↑1.36 (4%)

11/14/2018 41Tao Qin - ACML 2018

Page 42: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

So, unsupervised machine translation

Many language pairs have no labeled data

There are more than 7000 languages in our planet today!

11/14/2018 42Tao Qin - ACML 2018

Page 43: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Unsupervised Machine Translation machine translation with zero labeled data

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho, ICLR 2018

Credit: Figures in this section come from the paper.

11/14/2018 43Tao Qin - ACML 2018

Page 44: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

System Architecture

• Dual structure• E.g., French<->English

• Shared encoder• Only one encoder works for both

French and English

• Fixed embeddings in the encoder

• Pre-train cross-lingual word embeddings

• Keep fixed during training

11/14/2018 44Tao Qin - ACML 2018

Page 45: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Unsupervised Training

• Autoencoder with self-reconstruction loss

• X->Z->X

• Y->Z->Y

• Dual translation with back-reconstruction loss

• X->Z->Y->Z->X

• Y->Z->X->Z->Y

11/14/2018 45Tao Qin - ACML 2018

Page 46: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Unsupervised Training

• With attention model, it is easy to obtain a trivial autoencoder

• Simple copy operation

• Denoising autoencoder• n(X)->Z->X

• n(Y)->Z->Y

• Making random swaps between contiguous words

11/14/2018 46Tao Qin - ACML 2018

Page 47: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Results

11/14/2018 47Tao Qin - ACML 2018

Page 48: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Unsupervised Neural Machine Translation Using Monolingual Corpora Only

Guillaume Lample, Alexis Conneau, Ludovic Denoyer, Marc'Aurelio Ranzato, ICLR 2018

Credit: Figures in this section come from the paper.

11/14/2018 48Tao Qin - ACML 2018

Page 49: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Key Ideas

Autoencoder

Dual translation

11/14/2018 49Tao Qin - ACML 2018

Page 50: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Unsupervised Training

• Denoising autoencoder

• Cross-domain dual translation

• Adversary training

11/14/2018 50Tao Qin - ACML 2018

Page 51: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Unsupervised Training

11/14/2018 51Tao Qin - ACML 2018

Page 52: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Final Total Objectives

11/14/2018 52Tao Qin - ACML 2018

Page 53: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Results

11/14/2018 53Tao Qin - ACML 2018

Page 54: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Learning from

Unlabeled Data

• Algorithms for machine translation

• Dual unsupervised learning

• Dual transfer learning

• Unsupervised machine translation

• Algorithms for image translation

• DualGAN/CycleGAN/DiscoGAN

• Face attribute manipulation

• Face aging

• Conditional image translation

11/14/2018 54Tao Qin - ACML 2018

Page 55: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Image-to-Image Translation

• im2im is to translate one image from source domain to target domain

Style 𝒜 to ℬ

Day to night

1. http://funny.pho.to/day-to-night-effect/2. https://turbo.deepart.io/

11/14/2018 55Tao Qin - ACML 2018

Page 56: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

DualGAN: Unsupervised Dual Learning for Image-to-Image Translation

Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, ICCV 2017

Credit: Figures in this section come from the paper.

11/14/2018 56Tao Qin - ACML 2018

Page 57: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

System Architecture

11/14/2018 57Tao Qin - ACML 2018

Page 58: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Day ➔ Night

11/14/2018 58Tao Qin - ACML 2018

Page 59: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Label ➔ Facade

11/14/2018 59Tao Qin - ACML 2018

Page 60: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Photo ➔ Sketch

11/14/2018 60Tao Qin - ACML 2018

Page 61: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Sketch ➔ Photo

11/14/2018 61Tao Qin - ACML 2018

Page 62: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Chinese paintings ➔ oil paintings

11/14/2018 62Tao Qin - ACML 2018

Page 63: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

11/14/2018 63Tao Qin - ACML 2018

Page 64: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

11/14/2018 64Tao Qin - ACML 2018

Page 65: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Papers with the Same Idea

• Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, ICCV 2017

• Learning to Discover Cross-Domain Relations with Generative Adversarial Networks, ICML 2017

11/14/2018 65Tao Qin - ACML 2018

Page 66: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

11/14/2018 66Tao Qin - ACML 2018

Page 67: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

11/14/2018 67Tao Qin - ACML 2018

Page 68: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Learning Residual Images for Face Attribute Manipulation

Wei Shen, Rujie Liu, CVPR 2017

Credit: Figures in this section come from the paper.

11/14/2018 68Tao Qin - ACML 2018

Page 69: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Face Attribute Manipulation

11/14/2018 69Tao Qin - ACML 2018

Page 70: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

System Architecture

11/14/2018 70Tao Qin - ACML 2018

Page 71: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Discriminator Loss

• Tranformation networks 𝐺0, 𝐺1• Discriminator network 𝐷

• Discriminator loss

Category label❑0: no glass❑1: with glass❑2: fake image

11/14/2018 71Tao Qin - ACML 2018

Page 72: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Original images

VAE-GAN

This work

Residual image

Glasses

11/14/2018 73Tao Qin - ACML 2018

Page 73: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Original images

VAE-GAN

This work

Residual image

Beard

11/14/2018 74Tao Qin - ACML 2018

Page 74: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Original images

VAE-GAN

This work

Residual image

Mouth

11/14/2018 75Tao Qin - ACML 2018

Page 75: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Original images

VAE-GAN

This work

Residual image

Smile

11/14/2018 76Tao Qin - ACML 2018

Page 76: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Original images

VAE-GAN

This work

Residual image

Female - Male

11/14/2018 77Tao Qin - ACML 2018

Page 77: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Original images

VAE-GAN

This work

Residual image

Young - Old

11/14/2018 78Tao Qin - ACML 2018

Page 78: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Ablation Study

Original images

Result images

Without dual learning

11/14/2018 79Tao Qin - ACML 2018

Page 79: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Conditional GANs for Face Aging and Rejuvenation

Jingkuan Song, Jingqiu Zhang, Lianli Gao, Xianglong Liu, Heng Tao Shen, IJCAI 2018

Credit: Figures in this section come from the paper.

11/14/2018 80Tao Qin - ACML 2018

Page 80: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Face Aging and Rejuvenation

11/14/2018 81Tao Qin - ACML 2018

Page 81: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

System Architecture

11/14/2018 82Tao Qin - ACML 2018

Page 82: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

11/14/2018 83Tao Qin - ACML 2018

Page 83: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Conditional Image Translation

CVPR 2018

11/14/2018 84Tao Qin - ACML 2018

Page 84: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

What exists/lacks in current im2im

• An assumption∀𝑥𝐴 ∈ 𝒟𝐴, 𝑥𝐵 ∈ 𝒟𝐵: • 𝑥𝐴 = 𝑥𝐴

𝑖 ⊕𝑥𝐴𝑠, 𝑥𝐵 = 𝑥𝐵

𝑖 ⊕𝑥𝐵𝑠

• 𝑥⋅𝑖=domain independent features; 𝑥⋅

𝑠=domain specific features

• Conventional im2im cannot specify domain specific features• Cannot control the style of generated images

• Generate with random domain specific features

• How to specify domain-specific features

11/14/2018 85Tao Qin - ACML 2018

Page 85: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

From im2im to conditional im2im

• Input: ∀𝑥𝐴 ∈ 𝒟𝐴, 𝑥𝐵 ∈ 𝒟𝐵: 𝑥𝐴 = 𝑥𝐴𝑖 ⊕𝑥𝐴

𝑠, 𝑥𝐵 = 𝑥𝐵𝑖 ⊕𝑥𝐵

𝑠

• Output: 𝑥𝐴𝐵 = 𝐺𝐴→𝐵 𝑥𝐴, 𝑥𝐵 = 𝑥𝐴𝑖 ⊕𝑥𝐵

𝑠

𝑥𝐵𝐴 = 𝐺𝐵→𝐴 𝑥𝐵 , 𝑥𝐴 = 𝑥𝐵𝑖 ⊕𝑥𝐴

𝑠

Example:

11/14/2018 86Tao Qin - ACML 2018

Page 86: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

System Architecture

11/14/2018 87Tao Qin - ACML 2018

Page 87: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Loss Function

• ℓGAN = log 𝑑𝐴 𝑥𝐴 + log 1 − 𝑑𝐴 𝑥𝐵𝐴

+ log 𝑑𝐵 𝑥𝐵 + log(1 − 𝑑𝐵(𝑥𝐴𝐵))

• ℓdualim 𝑥𝐴, 𝑥𝐵 = ‖ ‖𝑥𝐴 − ො𝑥𝐴

2 + ‖𝑥𝐵 − ො𝑥𝐵‖2

• ℓdualdi 𝑥𝐴, 𝑥𝐵 = ‖ ฮ𝑥𝐴

𝑖 − ො𝑥𝐴𝑖 2

+ ‖𝑥𝐵𝑖 − ො𝑥𝐵

𝑖 ‖2

• ℓdualds 𝑥𝐴, 𝑥𝐵 = ‖ ‖𝑥𝐴

𝑠 − ො𝑥𝐴𝑠 2 + ‖𝑥𝐵

𝑠 − ො𝑥𝐵𝑠‖2

Image level

𝑥⋅𝑖 level

𝑥⋅𝑠 level

11/14/2018 88Tao Qin - ACML 2018

Page 88: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Male ↔ Female

11/14/2018 89Tao Qin - ACML 2018

Page 89: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Bag ➔ Edge

11/14/2018 90Tao Qin - ACML 2018

Page 90: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Edge ➔ Shoe

11/14/2018 91Tao Qin - ACML 2018

Page 91: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Learning from Labeled

Data

• Dual supervised learning

• Dual inference

• Multi-agent dual learning

• Model-level dual learning

11/14/2018 92Tao Qin - ACML 2018

Page 92: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Probabilistic View of Structural Duality

• The structural duality implies strong probabilistic connections between the models of dual AI tasks.

• This can be used beyond unsupervised learning• Structural regularizer to enhance supervised learning

• Additional criterion to improve inference

𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 = 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔

Primal View Dual View

11/14/2018 93Tao Qin - ACML 2018

Page 93: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Supervised Learningcan learn from labeled data more effectively

ICML 2017

11/14/2018 94Tao Qin - ACML 2018

Page 94: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Supervised Learning

Labeled data 𝑥 label 𝑦 = 𝑓(𝑥)

𝑥 = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔 |: the gap between

the joint probability 𝑃(𝑥, 𝑦) obtained in two directions

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Agent Agent

min |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔 |

max log𝑃(𝑦|𝑥; 𝑓)

max log𝑃(𝑥|𝑦; 𝑔)

Label 𝑦

𝑃 𝑥, 𝑦= 𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓=𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔

11/14/2018 95Tao Qin - ACML 2018

Page 95: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Loss Function and Algorithm

11/14/2018 96Tao Qin - ACML 2018

Page 96: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Neural Machine Translation

English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)

𝑓: 𝑥 → 𝑦

𝑔: 𝑦 → 𝑥

Ch->En translation

En->Ch translation

English sentence 𝑥 = 𝑔(𝑦) Chinese sentence 𝑦

14

16

18

20

22

24

26

28

30

32

34

En→Fr Fr→En En→De De→En En→Zh(MT08)

Zh→En(MT08)

En→Zh(MT12)

Zh→En(MT12)

NMT 29.92 27.49 16.54 20.69 15.45 31.67 15.05 30.54

DSL 31.99 28.35 17.91 20.81 15.87 33.59 16.1 32

29.92

27.49

16.54

20.69

15.45

31.67

15.05

30.54

31.99

28.35

17.91

20.81

15.87

33.59

16.1

32

Machine Translation

11/14/2018 97Tao Qin - ACML 2018

Page 97: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Image Understanding

• Dataset: CIFAR10

• Primal model: ResNet

• Dual model: PixelCNN++

• Marginal distribution• 𝑃(𝑦) for label: uniform

distribution

• 𝑃(𝑥) for image: PixelCNN++

Image 𝑥 Label 𝑦 = 𝑓(𝑥)

𝑓: 𝑥 → 𝑦

𝑔: 𝑦 → 𝑥

Conditional image generation

Image classification

Image 𝑥 = 𝑔(𝑦) label 𝑦

11/14/2018 98Tao Qin - ACML 2018

Page 98: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Image UnderstandingImage Classification vs. Image Generation

Error rate (%)

ResNet-32(baseline) 7.51

Dual ResNet-32 6.82

ResNet-110 (baseline) 6.43

Dual ResNet-110 5.40

Image classification

Image generationBit per dimension: 2.94->2.93

https://github.com/Microsoft/DualLearning11/14/2018 99Tao Qin - ACML 2018

Page 99: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Sentiment Analysis

Sentence 𝑥 Label 𝑦 = 𝑓(𝑥)

𝑓: 𝑥 → 𝑦

𝑔: 𝑦 → 𝑥

Conditional sentence generation

Sentence classification

Sentence 𝑥 = 𝑔(𝑦) label 𝑦

Classification Error (%)

Generation Perplexity

baseline 10.10 59.19

DSL 9.20 58.78

https://github.com/Microsoft/DualLearning11/14/2018 100Tao Qin - ACML 2018

Page 100: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Theoretical Analysis

• Dual supervised learning generalizes better than standard supervised learning

The product space of the two models satisfying probabilistic duality: 𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 = 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔

11/14/2018 101Tao Qin - ACML 2018

Page 101: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Even if you cannot change/re-train models,

Dual Inferencecan help boost the inference results

IJCAI 2017

11/14/2018 102Tao Qin - ACML 2018

Page 102: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Inference

input 𝑥 Predicted output 𝑦 = 𝑓(𝑥)

Predicted output 𝑥 = 𝑔(𝑦)

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

𝑃 𝑦 𝑥; 𝑓 =𝑃 𝑥 𝑦; 𝑔 𝑃 𝑦

𝑃 𝑥

Choose the 𝑦 that can maximize 𝑃(𝑦|𝑥; 𝑓): 𝑓 𝑥 = argmax𝑦𝑃(𝑦|𝑥; 𝑓)Standard inference

𝑓𝑑𝑢𝑎𝑙 𝑥 = argmax𝑦 𝛼𝑃 𝑦 𝑥; 𝑓 + 1 − 𝛼𝑃 𝑥 𝑦; 𝑔 𝑃 𝑦

𝑃 𝑥

𝑔𝑑𝑢𝑎𝑙 𝑦 = argmax𝑦 𝛽𝑃 𝑥 𝑦; 𝑔 + 1 − 𝛽𝑃 𝑦 𝑥; 𝑓 𝑃 𝑥

𝑃 𝑦

Dual inference: leverage both primal and dual models for inference

Input 𝑦

Choose the 𝑥 that can maximize 𝑃(𝑥|𝑦; 𝑔): 𝑔 𝑦 = argmax𝑥𝑃(𝑥|𝑦; 𝑔)

11/14/2018 103Tao Qin - ACML 2018

Page 103: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Neural Machine Translation

English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)

𝑓: 𝑥 → 𝑦

𝑔: 𝑦 → 𝑥

Ch->En translation

En->Ch translation

English sentence 𝑥 = 𝑔(𝑦) Chinese sentence 𝑦

11/14/2018 104Tao Qin - ACML 2018

Page 104: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Image Understanding

• Dataset: CIFAR10

• Primal model: ResNet

• Dual model: PixelCNN++

• Marginal distribution• 𝑃(𝑦) for label: uniform

distribution

• 𝑃(𝑥) for image: PixelCNN++

Image 𝑥 Label 𝑦 = 𝑓(𝑥)

𝑓: 𝑥 → 𝑦

𝑔: 𝑦 → 𝑥

Conditional image generation

Image classification

Image 𝑥 = 𝑔(𝑦) label 𝑦

11/14/2018 105Tao Qin - ACML 2018

Page 105: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Image UnderstandingImage Classification vs. Image Generation

11/14/2018 106Tao Qin - ACML 2018

Page 106: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Sentiment Analysis

Sentence 𝑥 Label 𝑦 = 𝑓(𝑥)

𝑓: 𝑥 → 𝑦

𝑔: 𝑦 → 𝑥

Conditional sentence generation

Sentence classification

Sentence 𝑥 = 𝑔(𝑦) label 𝑦

11/14/2018 107Tao Qin - ACML 2018

Page 107: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Sentiment Analysis

Sentence 𝑥 Label 𝑦 = 𝑓(𝑥)

𝑓: 𝑥 → 𝑦

𝑔: 𝑦 → 𝑥

Conditional sentence generation

Sentence classification

Sentence 𝑥 = 𝑔(𝑦) label 𝑦

Standard

this movie is one of the funniest movies i have ever seen. the acting is great, the plot is simple. it is one of the best movies i’ve seen in a long time

Dual

i love this movie. i watched it over and over again and ihave to say that it is one of the best movies i’ve seen in a long time. the plot is simple, the acting is great. if you are looking for a good movie, go to see this movie

Standard

when i first saw this movie, i thought it was going to be funny, but it didn’t. it was so bad, i didn’t think it was going to be funny. the only thing I can say about this movie is that it is so bad that it’s not funny.

Dual

i give it 2 out of 10 because , it’s the worst movie I have ever seen . the only thing i can say about this movie is that it is so bad that it makes no sense at all . don’t waste your time .

11/14/2018 108Tao Qin - ACML 2018

Page 108: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Theoretical Analysis

• Dual inference has generalization guarantee although training and inference become a little inconsistent.

The generalization bound for dual inference is comparable to that of standard inference.

11/14/2018 109Tao Qin - ACML 2018

Page 109: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Multi-agent Dual LearningEnsemble multiple primal/dual models

Ongoing work

11/14/2018 110Tao Qin - ACML 2018

Page 110: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Refresh of Dual Learning

English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)New English sentence

𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• Δ𝑥 𝑥, 𝑥′; 𝑓, 𝑔 : similarity of 𝑥′ and 𝑥;• Δ𝑦 𝑦, 𝑦′; 𝑔, 𝑓 : similarity of 𝑦′ and 𝑦;

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Agent Agent

Zh→En translation

En→Zh translation

Policy gradient is used to improve both primal and dual models according to feedback signals

(NIPS 2016)

Training objective function:1

‖ℳ𝑥‖

𝑥∈ℳ𝑥

Δ𝑥 𝑥, 𝑔 𝑓 𝑥 +1

‖ℳ𝑦‖

𝑦∈ℳ𝑦

Δ𝑦 𝑦, 𝑓 𝑔 𝑦11/14/2018 111Tao Qin - ACML 2018

Page 111: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Motivation

English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)New English sentence

𝑥′ = 𝑔(𝑦)

In this loop, 𝑓: generator 𝑦′ = 𝑓 𝑥 ;𝑔: evaluator of 𝑓; Δ𝑥(𝑥, 𝑔(𝑦′))Evaluation quality plays a central role.

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Agent Agent

Zh→En translation

En→Zh translation

Employing multiple agents can improve evaluation qualities:Multi-Agent Dual Learning11/14/2018 112Tao Qin - ACML 2018

Page 112: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Framework

English sentence 𝑥 Chinese sentence𝑦 = 𝐹𝛼(𝑥)New English sentence

𝑥′ = 𝐺𝛽(𝑦)

Primal Task 𝐹𝛼: 𝑥 → 𝑦

Dual Task 𝐺𝛽: 𝑦 → 𝑥

Agent Agent

Zh→En translation

En→Zh translation

𝜶𝟎

𝛼1

𝛼2

𝜷𝟎𝛽1

𝛽2

𝑓𝑖 is fixed (𝑖 ≥ 1)𝐹𝛼 = σ𝑖=0

𝑁−1𝛼𝑖𝑓𝑖

𝑔𝑗 is fixed (𝑗 ≥ 1)

𝐺𝛽 = σ𝑗=0𝑁−1𝛽𝑗𝑔𝑗

Feedback signals during the loop:• Δ𝑥 𝑥, 𝑥′; 𝐹𝛼, 𝐺𝛽 : similarity of 𝑥′ and 𝑥;

• Δ𝑦 𝑦, 𝑦′; 𝐺𝛽 , 𝐹𝛼 : similarity of 𝑦′ and 𝑦;

Training objective function:1

‖ℳ𝑥‖

𝑥∈ℳ𝑥

Δ𝑥 𝑥, 𝐺𝛽 𝐹𝛼 𝑥 +1

‖ℳ𝑦‖

𝑦∈ℳ𝑦

Δ𝑦 𝑦, 𝐹𝛼 𝐺𝛽 𝑦

Train and update 𝑓0 and 𝑔0

11/14/2018 113Tao Qin - ACML 2018

Page 113: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

A Computation-Efficient Solution

• It is too cost to load 2𝑁 models into GPU memory• An off-policy way:

• Given an 𝑥, ො𝑦 ∼1

𝑁−1σ𝑖=1𝑁−1𝑓𝑖(𝑥); Given a 𝑦, ො𝑥 ∼

1

𝑁−1σ𝑗=1𝑁−1𝑔𝑗

• Calculate 𝑃𝑥→ ො𝑦 =1

𝑁−1σ𝑖=1𝑁−1𝑃(ො𝑦|𝑥; 𝑓𝑖), 𝑃𝑦→ ො𝑥 =

1

𝑁−1σ𝑗=1𝑁−1𝑃(ො𝑥|𝑦; 𝑔𝑗)

𝐴 ො𝑦→𝑥 = σ𝑗=1𝑁−1𝑃(𝑥| ො𝑦; 𝑔𝑗), 𝐴 ො𝑥→𝑦 = σ𝑖=1

𝑁−1𝑃(𝑦| ො𝑥; 𝑓𝑖)

• 𝑓0 = 𝑓0 − 𝜂∇𝑓0 ቈ

𝑁−1 𝑃𝑥→ෝ𝑦+𝑃 ො𝑦 𝑥; 𝑓0𝑁𝑃𝑥→ෝ𝑦

log𝐴ෝ𝑦→𝑥+𝑃 𝑥 ො𝑦; 𝑔0

𝑁+

𝑁−1 𝑃𝑦→ෝ𝑥+𝑃 ො𝑥 𝑦; 𝑔0𝑁𝑃𝑦→ෝ𝑥

log𝐴ෝ𝑥→𝑦+𝑃 𝑦 𝑥; 𝑓0

𝑁

• Similar for 𝑔0• The GPU only needs to load 2 models only

• If we focus on one-direction translation, only 1 model needs to be loaded

11/14/2018 114Tao Qin - ACML 2018

Page 114: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

IWSLT 2014 (<200k bilingual data)

33.61

27.95

40.537.95

18.9416.06

33.25

22.2

34.57

29.07

42.1438.09

21.14

16.31

35.42

22.75

35.44

29.52

42.4138.62

21.56

16.71

36.08

23.8

De-to-En En-to-De Es-to-En En-to-Es Ru-to-En En-to-Ru He-to-En En-to-He

Transformer Standard Dual Multi-Agent Dual

Deutsch Español русский עברית

State-of-the-art resultsDe↔En: 2 × 5 agents{Es, Ru, He}↔En: 2 × 3 agents

11/14/2018 115Tao Qin - ACML 2018

Page 115: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

WMT 2014 (4.5M bilingual data)

• On Bench-mark dataset WMT 2014,

28.4

32.15

28.4

32.1529.44

32.9929.93

34.98

30.0533.4

30.67

35.64

En-to-De (+0M monolingual) De-to-En (+0M monolingual) En-to-De (+8M monolingual) De-to-En (+8M monolingual)

Transformer Standard Dual Multi-Agent Dual

State-of-the-art resultswith WMT2014 data only

En↔De: 2 × 3 agents

11/14/2018 116Tao Qin - ACML 2018

Page 116: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

WMT 2016 Unsupervised NMT (0 bilingual data)

17.2 18.0119.3

2122.17

23.83

Facebook's Transformer Standard Dual Multi-Agent Dual

En-to-De De-to-En

En↔De: 2 × 3 agents

11/14/2018 117Tao Qin - ACML 2018

Page 117: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

WMT En->De 2016∼2018

2016 2017 2018

Facebook’s model (single) 37.04 ± 0.16 31.86 ± 0.21 44.63 ± 0.12

Facebook’s model (ensemble) 37.99 32.80 46.05

Multi-Agent Dual (Single) 40.71 ± 0.08 33.47 ± 0.15 48.97 ± 0.06

Multi-Agent Dual (Ensemble) 41.19 34.12 49.77

11/14/2018 118Tao Qin - ACML 2018

Page 118: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Model-level Dual LearningBeyond data-level dual learning

ICML 2018

11/14/2018 119Tao Qin - ACML 2018

Page 119: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Recap of Dual Learning

• Dual unsupervised learning: • 𝑥 → ො𝑦 → ො𝑥; Build feedback signal Δ(𝑥, ො𝑥); Update• Reconstruction duality

• Dual supervised Learning: • 𝑃 𝑥 𝑃 𝑦 𝑥 = 𝑃 𝑦 𝑃(𝑥|𝑦) as constraint• Joint-probability duality

• Dual transfer learning: • 𝑃 𝑦 = σ𝑥𝑃(𝑥, 𝑦)• Marginal distribution duality

• Dual inference:• argmin𝑦′∈𝒴 𝛼ℓ𝑝(𝑥, 𝑦′) + (1 − 𝛼)ℓ𝑑(𝑥, 𝑦′)• Reconstruction & Joint-probability duality

data-level duality

11/14/2018 120Tao Qin - ACML 2018

Page 120: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

A Further Step

• We find that there exists “model-level duality”

• Take English↔Chinese as an example

• Why don’t we share the modules that have similar functionality ? • 𝐸𝑒𝑛 = 𝐷𝑒𝑛; 𝐸𝑧ℎ = 𝐷𝑧ℎ

English Encoder Chinese Decoder

En→Zh Attention

Chinese Encoder English Decoder

Zh→En Attention

Same structure

Same structure

𝐸𝑒𝑛 𝐸𝑧ℎ𝐷𝑧ℎ 𝐷𝑒𝑛

11/14/2018 121Tao Qin - ACML 2018

Page 121: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Quick View of the Model

En Component Zh Component

En→Zh AttentionZh→En AttentionEn↔Zh

En Component Zh Component

En→Zh Attention

En Component Zh Component

Zh→En AttentionEn←Zh

En→Zh

11/14/2018 122Tao Qin - ACML 2018

Page 122: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Symmetric Settings

𝑥𝒞𝑋

𝑍𝑋 ℎ𝑋

𝜑𝑋𝑐

𝑋 component

𝜑𝑋𝑧

𝜑𝑋𝑐

𝜑𝑋𝑧

𝜑𝑋ℎ

𝑦𝒞𝑌

𝑍𝑌 ℎ𝑌

𝜑𝑌𝑐

𝑌 component

𝜑𝑌𝑧

𝜑𝑌𝑐

𝜑𝑌𝑧

𝜑𝑌ℎ

𝑥0

0 ℎ𝑋

𝜑𝑋𝑐

𝑋 component

𝜑𝑋𝑧

𝜑𝑋𝑐

𝜑𝑋𝑧

𝑦

𝑍𝑌 ℎ𝑌

𝜑𝑌𝑐

𝑌 component

𝜑𝑌𝑧

𝜑𝑌𝑐

𝜑𝑌𝑧

𝜑𝑌ℎ

𝑥

𝑍𝑋 ℎ𝑋

𝜑𝑋𝑐

𝑋 component

𝜑𝑋𝑐

𝜑𝑋𝑧

𝜑𝑋ℎ

𝑦0

0 ℎ𝑌

𝜑𝑌𝑐

𝑌 component

𝜑𝑌𝑧

𝜑𝑌𝑐

𝜑𝑌𝑧

𝜑𝑋𝑧

works for encoder-decoder based frameworki.e., both 𝒳 and 𝒴 are sequence collections

11/14/2018 123Tao Qin - ACML 2018

Page 123: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Asymmetric Settings

𝑥

ℎ𝑋

𝜑𝑋𝑐

𝑦𝜑𝑌

𝑥

ℎ𝑋

𝜑𝑋𝑐

𝑦𝜑𝑌

𝜑𝑋ℎ

works for encoder-classifier based frameworki.e., 𝒳 might be collections of sequences;𝒴 = {0,1,⋯ , 𝑐}

11/14/2018 124Tao Qin - ACML 2018

Page 124: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Results: Symmetric Setting

11/14/2018 125Tao Qin - ACML 2018

Page 125: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Results: Symmetric Setting

11/14/2018 126Tao Qin - ACML 2018

Page 126: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Results: Symmetric Setting

11/14/2018 127Tao Qin - ACML 2018

Page 127: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Results: Asymmetric Setting

11/14/2018 128Tao Qin - ACML 2018

Page 128: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Results: + Dual inference

• NMT• De→En: 34.71 → 35.19

• En→De: 28.64 → 28.83

• Sentiment classification• 7.41 → 6.96

11/14/2018 129Tao Qin - ACML 2018

Page 129: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

More Applications

Neural machine translation

Image understanding Sentiment analysis

Question Answering/generation

Image translation Face manipulation

11/14/2018 130Tao Qin - ACML 2018

Page 130: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Dual Question Answering/Generation

• Primal task: question answering, question ➔ answer

• Dual task: question generation, answer ➔ question

Duyu Tang, Nan Duan, Tao Qin, and Ming Zhou. "Question Answering and Question Generation as Dual Tasks." arXiv preprint arXiv:1706.02027 (2017).

Dual learning can significantly improve the accuracy of both question answering and generation

11/14/2018 131Tao Qin - ACML 2018

Page 131: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Visual Question Answering/GenerationYikang Li, Nan Duan, Bolei Zhou, Xiao Chu, Wanli Ouyang, Xiaogang Wang, Ming Zhou, “Visual Question Generation as Dual Task of Visual Question

Answering”, CVPR, 2018.

11/14/2018 132Tao Qin - ACML 2018

Page 132: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Visual Question Answering/GenerationYikang Li, Nan Duan, Bolei Zhou, Xiao Chu, Wanli Ouyang, Xiaogang Wang, Ming Zhou, “Visual Question Generation as Dual Task of Visual Question

Answering”, CVPR, 2018.

11/14/2018 133Tao Qin - ACML 2018

Page 133: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Summary

• Basic idea: leverage structure duality for machine learning

• Works for different learning settings• Unsupervised/semi-supervised learning, supervised learning, transfer

learning

• Both training and inference

• Both data level and model level

• Applied to many applications• Machine translation, question answering/generation, …

• Image classification/generation, sentiment classification/generation, …

• Image translation, face manipulation, …

11/14/2018 134Tao Qin - ACML 2018

Page 134: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

Outlook

• More algorithms, more applications

• Stability, efficiency

• Theoretical understanding• When it works/fails

• Why it works

• Open-source tools

11/14/2018 135Tao Qin - ACML 2018

Page 135: Dual Learning: Algorithms and Applications · Estimated labeling cost for one language pair 7000 different languages that are spoken around the world The 100-th largest language has

deep learning reinforcement learning

[email protected]://research.Microsoft.com/~taoqin