how to build a multi-lingual text classifier · how to build a multi-lingual text classifier ? july...
TRANSCRIPT
How to build a multi-lingual text classifier ? July 24, 2019 Axel de Romblay
© 2 0 1 9 C O N F I D E N T I A L
MotivationW H AT F O R ?
Let’s introduce some important figures, goals & KPIs
© 2 0 1 9 C O N F I D E N T I A L
A multi-lingual video catalog
Dailymotion hosts hundreds of millions videos in more than 20 languages. Our purpose is to sharethe most compelling music, entertainment, news and sports content around.
C H A P T E R
3
© 2 0 1 9 C O N F I D E N T I A L
Content categorization for a better user experienceWhy do we care at dailymotion about being able to accurately categorizecontent at scale?
• Watching interface
• Search engine
• SEO & acquisition
C H A P T E R
4
3214
© 2 0 1 9 C O N F I D E N T I A L
High precision/coveragetradeoffTag the maximum videos with a minimum error rate
Fast & up-to-date annotationGet updated topics/categories
Relevance & qualityGet relevant & meaningfultopics/categories
Multi-lingual annotationTag videos for all the languages
C H A P T E R
5
Criteria for a good classification
© 2 0 1 9 C O N F I D E N T I A L
First stepsW H AT A R E T H E R E Q U I R E M E N T S ?
Let’s introduce the settings: language detection, videoannotation using NEL & unsupervised categorization of topics.
Topic annotation for English & French videos
Reference: https://medium.com/dailymotion/topic-
annotation-automatic-algorithms-data-377079d27936
© 2 0 1 9 C O N F I D E N T I A L
Topic annotation pipeline C H A P T E R
8
© 2 0 1 9 C O N F I D E N T I A L
• Polyglot python package (based on cld2)
• Naïve Bayesian classifier trained on millions web pages for each language.
• Optimized to run at scale.
• Supports 196 languages
C H A P T E R
9
Text extraction & language detection
How to detect the language ?
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
1 0
Topic generator : Named Entity Linking with Wikidata knowledge graph
Open source knowledge graph
Multilingual
Updated
50M interconnected entities
© 2 0 1 9 C O N F I D E N T I A L
Preprocessing
• Standard preprocessing : stop words …
• Tokenization : detection of overlapping words.
C H A P T E R
1 1
Topic generator : Named Entity Linking with Wikidata knowledge graph
Disambiguation
Given a word a, we choose the appropriate Wikidata entity pa using :
• The commonness of pa : Pr(pa|a)
• A relatedness score between a and pa :
Pruning
We keep relevant wikidata entities using :
• The link probability of a word
• The coherence of the word :
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
1 2
Topic filter : feature engineering & centrality classification problem
Training Set
Candidate topics
Features : coherence, popularity, disambiguation score, location score …
Machine Learning
Topic categorization
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
1 4
Unsupervised & uncontextual topic categorization using Wikidata
ØGather the topics: get less classes than topics
ØGet different levels of classification (hierarchy) § Set a good number of levels§ Set good splits for each level (and avoid small & big classes)
ØGet a relevant number of classes for each topic§ Have at least one class (coverage)§ Limit the number of classes for each topic
ØGet a good label for each class: match IAB taxonomy or at least get a wikidata qid
ØCoherence/consistency : similar classes must bein a same level (ex: countries, teams, ...)
Humans
Not Humans
Football Player
Model
Singer
RMCristiano Ronaldo
RapRock
Count : 20 topics
Count : 11+2+2=15 classes
What criteria for a good categorization ?
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
1 5
Preprocessing: selecting a Wikidata subgraph with product rules
Wikidata44M topics (vertices) & 4k relations
(types of edges)
Our subgraph500k topics & 10 relations
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
1 6
From a graph of connected classes to a hierarchy of classes
Compute a granularity measure G
It will perform a ranking between all the classes and will allow us to split the classes intodifferent levels depending on a measure G :
G (c) = number of leaves having a path to the class c
Compute a correlation measure C
It will detect and drop correlated classes: this willallow us to drastically reduce the number of classes per topic.
C (c1, c2) =
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
1 7
A concrete example…
© 2 0 1 9 C O N F I D E N T I A L
Multi-lingual text classifier
H O W W E D O T H I S ?
Now we can move on to NLP algorithms J
© 2 0 1 9 C O N F I D E N T I A L
What we want to do
Get a contextualcategorization of ourvideo catalog
… For all languages !
How we do this
1. Get a robust representation of our videocatalog (multi-lingual embeddings)
2. Train a predictive model on the top of it, on French & English videos only !
3. Transfer on other languages4. Evaluate performances
C H A P T E R
1 9
What we have
57% of our videocatalog (French + English) isannotated & categorized intodifferent levels.
The categorizationis uncontextual (itonly depends on the topic and not on the video)
Where do we stand
Multi-label classification
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
2 1
Get a contextual classification for French & English videos with sparse inputs
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
2 2
Get a contextual classification for French & English videos with sparse inputs
ü Fast & accurate model
§BOW trained using DataFlow
§Top1-accuracy = 0.9
ü Low memory usage : sparse model implemented using tf.keras
Pros Cons
Reference: https://medium.com/dailymotion/how-to-design-deep-learning-models-with-sparse-inputs-in-tensorflow-keras-fd5e754abec1
q Not transferable on others languages
§Bow vocabulary for French/English only
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
2 3
Robust multi-lingual embeddings with BERT using a Tesla V100
Reference: https://github.com/google-research/bert
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
2 4
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
2 5
Fine-tuning on French/English videos by adding prediction layers
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
2 6
Some qualitative results on other languages
Ø We display the most confidentpredictions on other languages (test set)
Ø To decrease the number of False Positive, we can set a threshold to get 85% precision and deduce the recall
Ø But how can we make sure that this threshold is the same on the test set ?
© 2 0 1 9 C O N F I D E N T I A L
The next step(s)
C O N C L U S I O N
Let’s conclude with the next steps
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
2 8
First find a quantitative metric for transfer learning.
Two biases:
• Content probably depends on the language (Korean videos tend to display more news, English videos more sports...)
• BERT is supposed to align multi-lingual embeddings.
Some ideas:
• Translate some videos into English using Google Translation API
• Apply BOW model to get a groundtruth on other languages
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
2 9
Then tune hyperparameters using state-of-the-art optimization (BOHB)
Reference:
https://www.automl.org/blog_bohb/
© 2 0 1 9 C O N F I D E N T I A L
C H A P T E R
3 0
Push to production environment.
qCode on Github (SQL / Python / TensorFlow / DataFlow) & dump models / tables (Google Cloud).
qCheck CI passes: run unit tests, style checks, code reviews…
qBuild docker image and push to quay repository
qDeploy (Kubernetes)
qSchedule the tasks (Airflow)
qMonitor (Datadog & Tableau)
Thanks to my squad
&Thank you !
Contact: https://www.linkedin.com/in/axel-de-romblay-6444a990/