how to build a multi-lingual text classifier · how to build a multi-lingual text classifier ? july...

How to build a multi-lingual text classifier ? July 24, 2019 Axel de Romblay

© 2 0 1 9 C O N F I D E N T I A L

MotivationW H AT F O R ?

Let’s introduce some important figures, goals & KPIs

© 2 0 1 9 C O N F I D E N T I A L

A multi-lingual video catalog

Dailymotion hosts hundreds of millions videos in more than 20 languages. Our purpose is to sharethe most compelling music, entertainment, news and sports content around.

C H A P T E R

3

© 2 0 1 9 C O N F I D E N T I A L

Content categorization for a better user experienceWhy do we care at dailymotion about being able to accurately categorizecontent at scale?

• Watching interface

• Search engine

• SEO & acquisition

C H A P T E R

4

3214

© 2 0 1 9 C O N F I D E N T I A L

High precision/coveragetradeoffTag the maximum videos with a minimum error rate

Fast & up-to-date annotationGet updated topics/categories

Relevance & qualityGet relevant & meaningfultopics/categories

Multi-lingual annotationTag videos for all the languages

C H A P T E R

5

Criteria for a good classification

© 2 0 1 9 C O N F I D E N T I A L

First stepsW H AT A R E T H E R E Q U I R E M E N T S ?

Let’s introduce the settings: language detection, videoannotation using NEL & unsupervised categorization of topics.

Topic annotation for English & French videos

Reference: https://medium.com/dailymotion/topic-

annotation-automatic-algorithms-data-377079d27936

https://medium.com/dailymotion/topic-annotation-automatic-algorithms-data-377079d27936

© 2 0 1 9 C O N F I D E N T I A L

Topic annotation pipeline C H A P T E R

8

© 2 0 1 9 C O N F I D E N T I A L

• Polyglot python package (based on cld2)

• Naïve Bayesian classifier trained on millions web pages for each language.

• Optimized to run at scale.

• Supports 196 languages

C H A P T E R

9

Text extraction & language detection

How to detect the language ?

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

1 0

Topic generator : Named Entity Linking with Wikidata knowledge graph

Open source knowledge graph

Multilingual

Updated

50M interconnected entities

© 2 0 1 9 C O N F I D E N T I A L

Preprocessing

• Standard preprocessing : stop words …

• Tokenization : detection of overlapping words.

C H A P T E R

1 1

Topic generator : Named Entity Linking with Wikidata knowledge graph

Disambiguation

Given a word a, we choose the appropriate Wikidata entity pa using :

• The commonness of pa : Pr(pa|a)

• A relatedness score between a and pa :

Pruning

We keep relevant wikidata entities using :

• The link probability of a word

• The coherence of the word :

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

1 2

Topic filter : feature engineering & centrality classification problem

Training Set

Candidate topics

Features : coherence, popularity, disambiguation score, location score …

Machine Learning

Topic categorization

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

1 4

Unsupervised & uncontextual topic categorization using Wikidata

ØGather the topics: get less classes than topics

ØGet different levels of classification (hierarchy) § Set a good number of levels§ Set good splits for each level (and avoid small & big classes)

ØGet a relevant number of classes for each topic§ Have at least one class (coverage)§ Limit the number of classes for each topic

ØGet a good label for each class: match IAB taxonomy or at least get a wikidata qid

ØCoherence/consistency : similar classes must bein a same level (ex: countries, teams, ...)

Humans

Not Humans

Football Player

Model

Singer

RMCristiano Ronaldo

RapRock

Count : 20 topics

Count : 11+2+2=15 classes

What criteria for a good categorization ?

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

1 5

Preprocessing: selecting a Wikidata subgraph with product rules

Wikidata44M topics (vertices) & 4k relations

(types of edges)

Our subgraph500k topics & 10 relations

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

1 6

From a graph of connected classes to a hierarchy of classes

Compute a granularity measure G

It will perform a ranking between all the classes and will allow us to split the classes intodifferent levels depending on a measure G :

G (c) = number of leaves having a path to the class c

Compute a correlation measure C

It will detect and drop correlated classes: this willallow us to drastically reduce the number of classes per topic.

C (c1, c2) =

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

1 7

A concrete example…

© 2 0 1 9 C O N F I D E N T I A L

Multi-lingual text classifier

H O W W E D O T H I S ?

Now we can move on to NLP algorithms J

© 2 0 1 9 C O N F I D E N T I A L

What we want to do

Get a contextualcategorization of ourvideo catalog

… For all languages !

How we do this

1. Get a robust representation of our videocatalog (multi-lingual embeddings)

2. Train a predictive model on the top of it, on French & English videos only !

3. Transfer on other languages4. Evaluate performances

C H A P T E R

1 9

What we have

57% of our videocatalog (French + English) isannotated & categorized intodifferent levels.

The categorizationis uncontextual (itonly depends on the topic and not on the video)

Where do we stand

Multi-label classification

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

2 1

Get a contextual classification for French & English videos with sparse inputs

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

2 2

Get a contextual classification for French & English videos with sparse inputs

ü Fast & accurate model

§BOW trained using DataFlow

§Top1-accuracy = 0.9

ü Low memory usage : sparse model implemented using tf.keras

Pros Cons

Reference: https://medium.com/dailymotion/how-to-design-deep-learning-models-with-sparse-inputs-in-tensorflow-keras-fd5e754abec1

q Not transferable on others languages

§Bow vocabulary for French/English only

https://medium.com/dailymotion/how-to-design-deep-learning-models-with-sparse-inputs-in-tensorflow-keras-fd5e754abec1

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

2 3

Robust multi-lingual embeddings with BERT using a Tesla V100

Reference: https://github.com/google-research/bert

https://github.com/google-research/bert

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

2 6

Some qualitative results on other languages

Ø We display the most confidentpredictions on other languages (test set)

Ø To decrease the number of False Positive, we can set a threshold to get 85% precision and deduce the recall

Ø But how can we make sure that this threshold is the same on the test set ?

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

2 8

First find a quantitative metric for transfer learning.

Two biases:

• Content probably depends on the language (Korean videos tend to display more news, English videos more sports...)

• BERT is supposed to align multi-lingual embeddings.

Some ideas:

• Translate some videos into English using Google Translation API

• Apply BOW model to get a groundtruth on other languages

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

2 9

Then tune hyperparameters using state-of-the-art optimization (BOHB)

Reference:

https://www.automl.org/blog_bohb/

https://www.automl.org/blog_bohb/

© 2 0 1 9 C O N F I D E N T I A L

C H A P T E R

3 0

Push to production environment.

qCode on Github (SQL / Python / TensorFlow / DataFlow) & dump models / tables (Google Cloud).

qCheck CI passes: run unit tests, style checks, code reviews…

qBuild docker image and push to quay repository

qDeploy (Kubernetes)

qSchedule the tasks (Airflow)

qMonitor (Datadog & Tableau)

Thanks to my squad

&Thank you !

Contact: https://www.linkedin.com/in/axel-de-romblay-6444a990/

https://www.linkedin.com/in/axel-de-romblay-6444a990/

how to build a multi-lingual text classifier · how to build a multi-lingual text classifier ? july...

Documents