multilingual nlp via cross-lingual word embeddings...bilingual lexicon learning, multi-modal...

Multilingual NLP via Cross-LingualWord Embeddings

Ivan Vuli¢

LTL, University of Cambridge & PolyAI

FoTran Workshop

Helsinki; September 28, 2018

Joint Work with...

...plus eorts of many other researchers...1 / 70

A cold open...

2 / 70

Embeddings are Everywhere (Monolingual)

Intrinsic (semantic) tasks:

Semantic similarity and synonym detection, word analogiesConcept categorization

Selectional preferences, language grounding

Extrinsic (downstream) tasks

Named entity recognition, part-of-speech tagging, chunkingSemantic role labeling, dependency parsing

Sentiment analysis, discourse parsing

Beyond NLP:

Information retrieval, Web search

Question answering, dialogue systems3 / 70

Embeddings are Everywhere (Cross-lingual)

Cross-lingual and multilingual semantic similarity

Bilingual lexicon learning, multi-modal representations

Extrinsic (downstream) tasks:

Cross-lingual SRL, POS tagging, NERCross-lingual dependency parsing and sentiment analysisCross-lingual annotation and model transfer

Statistical/Neural machine translation

Beyond NLP:

(Cross-Lingual) Information retrieval

(Cross-Lingual) Question answering4 / 70

A slow intro...

5 / 70

Motivation

We want to understand and model the meaning of...

Source: dreamstime.com

...without manual/human input and without perfect MT6 / 70

Motivation

The NLP community has developed useful features for several tasks but ndingfeatures that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...)

(monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...)

(cross-lingual word embeddings → this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

7 / 70

Motivation

The NLP community has developed useful features for several tasks but ndingfeatures that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...)

(monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...)

(cross-lingual word embeddings → this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

Learn word-level features which generalise across tasks and languages

8 / 70

The World Existed B.E. (Before Embeddings)

B.E. Example 1: Cross-lingual (Brown) clusters

[Tackstrom et al., NAACL 2012, Faruqui and Dyer, ACL 2013]

9 / 70

The World Existed B.E.

B.E. Example 2a: Traditional count-based cross-lingual spaces...

[Gaussier et al., ACL 2004; Laroche and and Langlais, COLING 2010]

10 / 70

B.E. Example 2b: ...and bootstrapping from a limited bilingual signal

[Peirsman and Padó, NAACL 2010; Vuli¢ and Moens, EMNLP 2013]

11 / 70

B.E. Example 3: Cross-lingual topical spaces

[Mimno et al., EMNLP 2009; Vuli¢ et al., ACL 2011]12 / 70

Motivation Revisited

Cross-lingual word embeddings:

- simple: quick and ecient to train- state-of-the-art and omnipresent- lightweight and cheap

Cross-lingual word embedding models are to MT what(monolingual) word embedding models are to language modeling.

by Sebastian Ruder

Are they?

They support cross-lingual NLP: enabling cross-lingual modelingand cross-lingual transfer

13 / 70

Learning Word Representations

Key idea

Distributional hypothesis → words with similar meanings are likelyto appear in similar contexts

[Harris, Word 1954]

shouts:

Meaning as use!

calmly states:

You shall know a word by the company it keeps.14 / 70

Word Embeddings

Representation of each word w ∈ V :

vec(w) = [f1, f2, . . . , fdim]

Word representations in the same semantic (or embedding) space!

Image courtesy of [Gouws et al., ICML 2015]

15 / 70

Cross-Lingual Word Embeddings (CLWEs)

Representation of a word wS1 ∈ V S :

vec(wS1 ) = [f1

1 , f12 , . . . , f

Exactly the same representation for wT2 ∈ V T :

vec(wT2 ) = [f2

1 , f22 , . . . , f

Language-independent word representations in the same sharedsemantic (or embedding) space!

16 / 70

Cross-Lingual Word Embeddings

[Luong et al., NAACL 2015]17 / 70

Cross-Lingual Word Embeddings

Monolingual vs. Cross-lingual

Q1 → Algorithm Design: How to align semantic spaces in twodierent languages?

Q2 → Data Requirements: Which bilingual signals are used forthe alignment?

18 / 70

Data Requirements: Bilingual Signals

Parallel Comparable

Word Dictionaries ImagesSentence Translations CaptionsDocument - Wikipedia

Nature and alignment level of data sources required by CLWE models

Data requirements are more important for the nal CLWE modelperformance than the actual underlying architecture[Levy et al., EACL 2017]

19 / 70

Data Requirements: Bilingual Signals

[Upadhyay et al., ACL 2016]

Illustration of dierent data requirements and bilingual signals

1. Word-level: word alignments and translation dictionaries

2. Sentence-level: sentence alignments

3. Document-level: Wikipedia/news articles

4. Other: visual alignment: image captions, eye-tracking data

5. Without any bilingual signal?

20 / 70

Welcome to the CLWE Jungle I

Parallel ComparableWord Mikolov et al. (2013) Kiela et al. (2015)

Faruqui & Dyer (2014) Vuli¢ et al. (2016)Xing et al. (2015) Vuli¢ et al. (2017)Dinu et al. (2015) Zhang et al. (2017)Lazaridou et al. (2015) Hauer et al. (2017)Zhang et al. (2016)Zou et al. (2013)Ammar et al. (2016)Artexte et al. (2016)Xiao & Guo (2014)Gouws and Søgaard (2015)Duong et al. (2016)Gardner et al. (2015)Smith et al. (2017)Mrk²i¢ et al. (2017)Artetxe et al. (2017)

21 / 70

Welcome to the CLWE Jungle II

Parallel ComparableSentence alignments Guo et al. (2015) Calixto et al. (2017)

Vyas and Carpuat (2016) Gella et al. (2017)Hermann and Blunsom (2013)Lauly et al. (2013)Ko£iský et al. (2014)Chandar et al. (2014)Pham et al. (2015)Klementiev et al. (2012)Soyer et al. (2015)Luong et al. (2015)Gouws et al. (2015)Coulmance et al. (2015)Shi et al. (2015)Rajendran et al. (2016)Levy et al. (2017)

Document Hermann and Blunsom (2014) Vuli¢ & Korhonen (2016)Vuli¢ and Moens (2016)Søgaard et al. (2015)Mogadala and Rettinger (2016)

22 / 70

General (Simplied) Methodology

(Bilingual) data sources are more important than the chosen algorithm

Most algorithms are formulated as: L1(W) + L2(V) + Ω(W,V )

23 / 70

Learning from Word-Level Information

1. Mapping-based or oine approaches: train monolingual wordrepresentations independently and then learn a transformation matrixfrom one language to the other

2. Pseudo-bilingual approaches: train a monolingual WE model onautomatically constructed corpora containing words from both languages

3. Joint online approaches: take parallel text as input and minimize thesource and target language losses jointly with the regularization term

These approaches are equivalent, modulo optimization strategies

24 / 70

Learning from Word-Level Information: Example 1

The geometric relations that hold between words are similar across languages:e.g., numbers and animals in English show a similar geometric constellation astheir Spanish counterparts

[Mikolov et al., arXiv 2013]

25 / 70

Basic mapping approach

[Mikolov et al., arXiv 2013]

Learn to transform the pre-trained source language embeddings into a space

where the distance between a word and its translation pair is minimised26 / 70

Basic mapping approach

Based on given word translation pairs (xsi , xti), i = 1, . . . , N

Mean-squared error between the Euclidean distances (of the actualand transformed vector)

ΩMSE =n∑

‖Wxsi − xt

ΩMSE = ‖WXs −Xt‖2F

Standard (re)formulation:

J = L1SGNS + L2SGNS + ΩMSE27 / 70

Bootstrapping from Small Seed Dictionaries

The latest instance of the bootstrapping idea[Artetxe et al., ACL 2017]

Self-learning iterative framework: starting from cognates or numerals

28 / 70

Bootstrapping from Small Seed Dictionaries

Lexicon induction results from [Artetxe et al., ACL 2017]

Bootstrapping helps with really small seed dictionaries

A performance plateau with simple mapping approaches has been reached?29 / 70

Learning without any Bilingual Signal?

The rst instance of the fully unsupervised learning idea[Conneau et al., ICLR 2018; Artetxe et al., ACL 2018]

Adversarial training: playing the two-way game between a generator (mapping)and a discriminator

Alignment at the level of distributions

Supporting unsupervised NMT [Lample et al., ICLR 2018] and unsupervisedCLIR [Litschko et al., SIGIR 2018]

30 / 70

Learning without any Bilingual Signal?

But is it really that great?

[Søgaard Ruder, Ruder, Vuli¢; ACL 2018]

31 / 70

Mapping Approach: Bilingual to Multilingual

Fixing one space (English) and learning transformations for all monolingualword embeddings in other languages

This constructs a multilingual space from input monolingual spaces

[Smith et al., ICLR 2017]:

A unied multilingual space for 89 languages using fastText vectors and 5kGoogle Translate pairs for each language pair

32 / 70

Pseudo-bilingual training

[Gouws and Søgaard, NAACL 2015]

Barista: Bilingual Adaptive Reshuing with Individual Stochastic Alternatives

1. Build a pseudo-bilingual corpus (or corrupt the original training data) usinga set of predened equivalences (e.g., POS tags, word translation pairs)

2. Train a monolingual WE model on the corrupted/shued corpus

Original : we build the house

POS : we build la voiture / they run la house

Translations: we construire the maison / nous build la house

33 / 70

Barista: Bilingual Adaptive Reshuing with Individual Stochastic Alternatives

1 Input: Source corpus Cs, target corpus Ct, dictionary D

2 Concatenate Cs and Ct and then shue the concatenated corpus

3 Pass over the corpus, for each word w: if w′|(w,w′) ∨ (w′, v) ∈ D isnon-empty and of cardinality k, replace w with w′ with probability 1/2k;keep w otherwise

4 Train a standard WE model on the pseudo-bilingual corpus C′

34 / 70

The chosen equivalence class dictates the shape of the output space

Cross-lingual embedding space with POS equivalences35 / 70

Joint (Online) Training

Bilingual Skip-Gram (BiSkip) proposed by Luong et al., NAACL 2015

Very similar to the original mapping approach, but formulated as online

monolingual and cross-lingual training on pseudo-bilingual corpora

J = L1SGNS + L2SGNS + ΩL1,L2

ΩL1,L2 = SGNSL1→L2 + SGNSL2→L1 36 / 70

Multilingual Skip-Gram (MultiSkip) by Ammar et al., NAACL 2016

Simple extension (again) → summing up bilingual objectives for allavailable parallel corpora

J = L1SGNS + L2

SGNS + L3SGNS + ΩL1,L2

+ ΩL2,L3+ ΩL1,L3

37 / 70

Post-processing/specialisation

State-of-the-art: Paragram and Attract-Repel[Wieting et al., TACL 2015; Mrk²i¢ et al., TACL 2017]

If S is the set of synonymous word pairs, the procedure iterates overmini-batches of such constraints BS , optimising the following cost function:

S(BS) =∑

(xl,xr)∈BS

(ReLU (δsim + xltl − xlxr)

+ ReLU (δsim + xrtr − xlxr))

where δsim is the similarity margin and tl and tr are negative examples for the

given word pair (xl, xr).

38 / 70

CLWEs via Semantic Specialisation

Cross-lingual supervision: again word-level linguistic constraints(e.g., synonyms)

Constraint Relation Source

(en_sweet, it_dolce) SYN BabelNet(en_work, fr_travail) SYN BabelNet(en_school, de_schule) SYN BabelNet(fr_montagne, de_gebirge) SYN BabelNet(sh_gradona£elnik, en_mayor) SYN BabelNet(nl_vrouw, it_donna) SYN BabelNet

(en_sour, it_dolce) ANT BabelNet(en_asleep, fr_éveillé) ANT BabelNet(en_cheap, de_teuer) ANT BabelNet(de_langsam, es_rápido) ANT BabelNet(sh_obeshrabriti, en_encourage) ANT BabelNet(fr_jour, nl_nacht) ANT BabelNet

39 / 70

CLWEs via Semantic Specialisation

Negative Examples for each Synonymy Pair

For each synonymy pair (xl,xr), the negative example pair (tl, tr)is chosen from the remaining in-batch vectors so that tl is the oneclosest (cosine similarity) to xl and tr is closest to xr.

40 / 70

Learning from Aligned Sentences: Example 1

Trans-gram model: very similar to BiSkip, but it relies on full sententialcontexts

[Coulmance et al., EMNLP 2015]

J = L1SGNS + L2

SGNS + ΩL1,L2

ΩL1,L2 = SGNSL1→L2 + SGNSL2→L1

Deja vu, right?41 / 70

Learning from Aligned Sentences: Example 2

BilBOWA model: very similar to online models, but using sentence-levelcross-lingual information

[Gouws et al., ICML 2017] Deja vu, right?

J = L1SGNS + L2

SGNS + ΩBilBOWA

ΩBilBOWA = ‖ 1

m∑ws

i∈sents

xsi −

n∑wt

j∈sentt

xtj‖2

42 / 70

Learning from Aligned Documents: Example 1

Again, pseudo-bilingual training

43 / 70

Learning from Aligned Documents: Example 2

Inverted indexing

[Søgaard et al. ACL 2015]; slide courtesy of Anders Søgaard

44 / 70

Summary

Slide courtesy of Anders Søgaard45 / 70

Summary

Slide courtesy of Anders Søgaard

46 / 70

Evaluation

Quality of Embeddings

How good are they in representing word meaning?

How good are they as features in downstream

tasks?

Intrinsic Extrinsic

Evaluation

Intrinsic >> Extrinsic?

Intrinsic << Extrinsic?

Intrinsic Extrinsic?

Most Ideal but hardly true!

[Schnabel et al, EMNLP 2015; Tsvetkov et al, EMNLP 2015]

Similarity-Based vs Feature-Based?

Parallel to the division to intrinsic and extrinsic tasks, there is anotherperspective:

Similarity-based use of embeddings: applications that rely on (constrained)nearest neighbour search in the cross-lingual word embedding graph

Feature-based use of embeddings: use cross-lingual embeddings directly asfeatures: information retrieval, cross-lingual transfer?

Cross-Lingual Lexicon Inductionand Cross-Lingual Word Similarity

Semantic similarity = measuring distance in the induced space

Bilingual lexicons: (en_morning, it_mattina), (en_carpet, pt_tapete), ...

50 / 70

Cross-Lingual Lexicon Induction

Similarity-based vs. classication-based lexicon induction

Word embeddings used as features for classication

[Irvine and Callison-Burch, CompLing 2017; Heyman et al., EACL 2017]

Extrinsic EvaluationCross-lingual Document Classication

Quite popular in the CLWE literature

Cross-lingual knowledge transfer?

[Klementiev et al., COLING 2012; Heyman et al., DAMI 2015]

Cross-Lingual Document Classication

53 / 70

Extrinsic EvaluationCross-lingual Document Classication

Using word-level information composed into a document-level representation

[Hermann and Blunsom, ACL 2014]

1. Is this really an optimal way to approach this task?2. Is this really an optimal way to evaluate word embeddings?

Cross-Lingual Dependency Parsing

55 / 70

Other ApplicationsCross-lingual Task X, Cross-lingual Task Y, ...

Cross-lingual supersense tagging

Word alignment

[Levy et al., EACL 2017]

POS tagging

[Søgaard et al., ACL 2015; Zhang et al., NAACL 2016]

Entity linking: wikication

[Tsai and Roth, NAACL 2016]

Relation prediction

[Glava² and Vuli¢, NAACL 2018]

56 / 70

Other ApplicationsCross-lingual Task X, Cross-lingual Task Y, ...

Cross-lingual frame-semantic parsing and discourse parsing

[Johannsen et al., EMNLP 2015, Braud et al., EACL 2017]

Cross-lingual lexical entailment

[Vyas and Carpuat, NAACL 2016, Upadhyay et al., NAACL 2018]

Semantic role labeling

[Kozhevnikov and Titov, ACL 2013; Akbik et al., ACL 2016]

Sentiment analysis

[Zhou et al., ACL 2016; Abdalla and Hirst, arXiv 2017]

57 / 70

Extrinsic Evaluation and ApplicationCross-lingual Information Retrieval

Semantic representations for cross-lingual information IR

[Vuli¢ and Moens, SIGIR 2015; Mitra and Craswell, arXiv 2017]

Document and Query Embeddings

Composition of word embeddings:[Mitchell and Lapata, ACL 2008]

−→d = −→w1 +−→w2 + . . .+−−−→w|Nd|

The dim-dimensional document embedding in the samecross-lingual word embedding space:

−→d = [fd,1, . . . , fd,k, . . . , fd,dim]

Document and Query Embeddings

→ The same principles with queries

−→Q = −→q1 +−→q2 + . . .+−→qm

The dim-dimensional query embedding in the same bilingualword embedding space:

−→Q = [fQ,1, . . . , fQ,k, . . . , fQ,dim]

Extrinsic Evaluation and ApplicationLanguage Understanding: Multilingual Dialog State Tracking

61 / 70

The Neural Belief Tracker is a novel DST model/framework whichaims to satisfy the following design goals:

1 End-to-end learnable (no SLU modules or semantic dictionaries).

2 Generalisation to unseen slot values.

3 Capability of leveraging the semantic content of pre-trained wordvector spaces without human supervision.

[Mrk²i¢ et al., ACL 2017; Mrk²i¢ and Vuli¢, ACL 2018]

62 / 70

Representation Learning + Label Embedding +Separate Binary Decisions

To overcome data sparsity, NBT models use label embedding todecompose multi-class classication into many binary ones.

DST in Italian and German

Multilingual WOZ 2.0 (Mrk²i¢ et al., 2017); Italianand GermanThe 1,200 dialogues in WOZ 2.0 were translated by native Italian and Germanspeakers instructed to consider preceding dialogue context.

Word Vector Space EN IT DE

Best Baseline Vector Space 81.6 71.8 50.5Monolingual Distributional Vectors 77.6 71.2 46.6+ Monolingual Specialisation 80.9 72.7 52.4++ Cross-Lingual Specialisation 80.3 75.3 55.7+++ Multilingual DST Model 82.8 77.1 57.7

DST Models for Resource-Poor Languages

Ontology Grounding: Multilingual DST Models

The domain ontology (i.e. the concepts it expresses) is language agnostic,which means that `labels' persist across languages. Using training data for two(or more) languages, and cross-lingual vectors of high quality, we train therst-ever multilingual DST model.

Take-Home Messages

1. Cross-Lingual Word Embedding Models are More Similar toOld(er) NLP Ideas than It Seems

But they are simple, ecient, and state-of-the-art...

2. CLWE Models are More Similar than It SeemsWe have addressed similarities between a plethora of cross-lingual wordembedding models

3. Data is Crucial (more than the Chosen Algorithm)

Optimization and various modeling tricks matter, but the chosen multilingualsignal is the most important factor

4. CL Word Embeddings Support Cross-Lingual NLP

They are useful across a wide variety of cross-lingual NLP tasks, forcross-lingual (re-lexicalized) transfer, language understanding, IR, ...

Take-Home Messages

Embeddings are Everywhere (Cross-lingual)

Cross-lingual and multilingual semantic similarity

Bilingual lexicon learning, multi-modal representations

Extrinsic (downstream) tasks:

Cross-lingual SRL, POS tagging, NERCross-lingual dependency parsing and sentiment analysisCross-lingual annotation and model transfer

Statistical/Neural machine translation

Beyond NLP:

(Cross-Lingual) Information retrieval

(Cross-Lingual) Question answering69 / 70

Cross-Lingual Space of Thankyous

This talk was based on a recent tutorial:

http://people.ds.cam.ac.uk/iv250/tutorial/xlingrep-tutorial.pdf

Book under review: Morgan & Claypool Synthesis Lectures

with Anders Søgaard, Sebastian Ruder, Manaal Faruqui70 / 70

multilingual nlp via cross-lingual word embeddings...bilingual lexicon learning, multi-modal...

Documents

multiseg: parallel data and subword information for learning...

cross-lingual and multi-lingual word...

cross-lingual web api classification

cross-lingual cross-document coreference with entity...

intra-lingual and cross-lingual voice conversion using...

cross-lingual unsupervised sentiment classification with

cross-view embeddings for information...

pair2vec: compositional word-pair embeddings for cross

multilingual knowledge graph embeddings for cross-lingual...

active learning for cross-lingual sentiment classification

x modal x cultural x lingual global oernetwork ›...

multilingual knowledge graph embeddings for cross-lingual...

fuzzy translation of cross-lingual spelling variants

embedding projection for targeted cross-lingual sentiment

cross lingual information retrieval (clir) rong jin

cross-lingual alignment of contextual word embeddings, with...

cross-lingual mapping projection intersection.ppt

cross-lingual embeddings reveal universal and lineage

cross-lingual knowledge validation based taxonomy derivation...

cross-lingual word representations via spectral graph...