criteo tektosdata meetup

Copyright © 2015 Criteo

The Criteo Experience

Olivier Koch

Engineering Program Manager, Criteo

TektosData Meetup “Data Meets Business”

May 31, 2016


Outline

• What does Criteo do?

• Deep dive into our technical stack

• Delivery at scale

• A few lessons learned

2


Banners… what else?

3

Advertiser Publisher


Online advertising at scale

4

3B displays / day

40 PB of data

15,000 servers

worldwide


• Deep dive into Criteo


6

Bidding

• Should we bid?

•At which price?

Recommendation

•Which products shouldwe display?

Look & Feel

•Big image vs small image

•Background color, ...

Prediction

•Generic prediction engine

• Specific models trained on TBs of data


7

Bidding

• Should we bid?

•At which price?

Recommendation


Look & Feel



Prediction




As we sell performance Criteo’s and client’s interests are aligned, so the engine aims at maximizing

the value we generate to our clients

As the cost of a display is lower and independant from the bid (2nd price auction or floor), we should

always bid the maximum value that the client is willing to pay for a display

We bid the expected value of the display for the client

Value = 1€

CPM = 0,6€CPM = 0,7€

CPM = 0,75€

CPM = 1,1€CPM = 1,2€

CPM = 1,3€

This bidding strategy is optimal: we are sure to buy all profitable displays and only them


Bid = CPC pClick pSale AOV

2012 - Ensures constant

value allocation between

Criteo and its clients

2014 - COS

Optimizer

2013 - CRO :

“Conversion Rate

Optimizer”

This value depends on the predicted performance and the

client’s objective

Revenue that the display will generate for the clientMaximum share that

the client is willing to

pay


We train our prediction models on our historical displays

Historical displays

Variables

Level of engagement of the user

Quality of inventory

User fatigue

For travel: time to check-in and number

of nights

: clicked displays : converted displays (size = order value)

Our ability to predict relies

greatly on the relevance of

the variables we consider

Machine Learning

Algorithms


11

Bidding

• Should we bid?

•At which price?

Recommendation


Look & Feel



Prediction




Recommend products for a user

• What we want: reco(user) = products

• 1B users x 3B products!

• But we need to scale and keep it fresh


User X saw orange shoes

Users who saw these same shoes also saw

Most viewed product on the client’s site are

We use collaborative filtering to select candidate products

Candidate products for user X are

Historical

Similar

Best-of


Products delivering the best performance are displayed

Variables

Products seen by the user

Time since product event

Level of similarity

Product features

Historical displays

: clicked products : converted products (size = order value)

Products are selected based

on their pClick x pSale x AOV

Machine Learning

Algorithms


15

Bidding

• Should we bid?

•At which price?

Recommendation


Look & Feel



Prediction




Historical displays (color = look & feel)

We train our prediction models on our historical displays

Variables

Some of which we control:

How user interacts with banner

Organization of information

Colorset

Some of which we don’t:

Zone format

Publisher

: clicked displays : converted displays (size = order value)

Look and feel will be selected

based on its pClick x pSale x AOV

My company

BUY! BUY! BUY!

BUY!

Machine Learning

Algorithms


17

Bidding

• Should we bid?

•At which price?

Recommendation


Look & Feel



Prediction




Predict: 𝔼 𝑆𝑎𝑙𝑒𝑠𝐴𝑚𝑜𝑢𝑛𝑡 = ℙ 𝐶𝑙𝑖𝑐𝑘 ℙ 𝑆𝑎𝑙𝑒|𝐶𝑙𝑖𝑐𝑘 𝔼[𝑆𝑎𝑙𝑒𝑠𝐴𝑚𝑜𝑢𝑛𝑡|𝑆𝑎𝑙𝑒]

Each model is trained independently & refreshed as often as possible

Three sources of features: user, ad, page (mostly categorical).

Optimizing for sales amount

(logistic) (logistic) (log normal) (all regularized!)


Learn on huge volumes of data

10 000 displays



10 000 displays

leads to

50 clicks



10 000 displays

leads to

50 clicks

leads to

1 sale


We have our own large-scale distributed machine learning library on top of Hadoop used for all models.

From a ML perspective we rely, in most cases, on an L-BFGS solver initialized with SGD (see, eg, A.

Agarwal et al. A Reliable Effective Terascale Linear Learning System).

In-house Machine Learning library -- IRMA


Learning duration: trading time and volume

Longer ⇒ Volume ↑ VS Shorter ⇒ Reactivity ↑

23

100

110

120

130

140

150

160

170

180

190

200

11/01/2014 21/01/2014 31/01/2014 10/02/2014 20/02/2014

Sale

s a

mou

nt (€

)

Valentine’s day eve

Pre

cis

ion

Learning duration

12/02/2014 13/02/2014 14/02/2014 15/02/2014

16/02/2014 17/02/2014 18/02/2014 All


Each model is trained on several TB of data and contains millions of features

We learn several hundreds of models, refreshed many times per day

How about large-scale distributed machine learning?

Wait a minute: how do you handle TBs of training data?

+ =


Hadoop AllReduce

L-BFGS, being a batch algorithm, is easy to distribute (by distributing the computation of the gradient),

while it’s more difficult with SGD; we do parameter averaging for that, which needs some tweaking

(learning rate, number of epochs, …). In SGD, we use Hogwild! to multi-thread.

Zookeeper to ensure fault-tolerance.

Distribution of L-BFGS & SGD


Irma is not only about vanilla logistic regression with L2 regularization; it contains more advanced

techniques: transfer learning, factorization machines, learning to rank, …

We for example use cost-sensitive learning for bidding.

A word on advanced techniques


Two steps:

Offline testing is fast, cheap, and efficient for wide exploration

Online testing is expensive but has the ultimate word

The more data you have, the faster you can make decisions

Offline & online evaluation


28

Physical infrastructure

7 in-house data centers on 3 continents

~ 15000 servers, largest Hadoop cluster in Europe

More than 35 PB of storage Big Data

Traffic

800k HTTP requests / sec (peak activity)

29000 impressions / sec (peak activity)

<10 ms to process bidding request

<100 ms to process reco request


Academic research @ Criteo

• Our 1st public dataset is online: http://bit.ly/1vgw2XC

• New 1TB dataset released last year

• Recent publications:

Offline evaluation of response prediction in online advertising auctions, O. Chapelle, WWW’15.

Sources of variability in large-scale machine learning systems, D. Lefortier, A. Truchet, and M.de Rijke, NIPS workshop on ML systems, 2015

Cost-sensitive learning for bidding in online advertising auctions, F. Vasile and D. Lefortier,NIPS workshop on ML for e-commerce, 2015.

29


New areas of research

• Counterfactual evaluation (offline A/B tests)

• Product embeddings for recommendation

• Policy learning

30


• Delivery at scale


The early days of Criteo

32

Single C# repository

Build in 90 minutes

Weekly merges


What could go wrong?

33


34


Delivery at scale at Criteo

35

Trunk-based development (TBD)

Fast commits

Code reviews with Gerrit

The MOAB

Deploy with scp / bittorrent

Automatic metrics checks

=> 200+ happy engineers!


The Criteo MOAB

36


Delivery at scale at Criteo

37


• A few lessons learned


Start small

• If you can't build it with a few machines, it's likely you won't be able to do it with many

39

First Google computer


Start small

• Keep fancy algorithms for later

40

The Page rank algorithm


Iterate fast

• Easy access to data (20PB vs 4GB of clean, carefully selected data)

• Convenient technologies (e.g. Python & notebooks, scikit-learn)

• Make IT a non-problem

• Keep projects small (typical project size 3-9 months)

41


Iterate fast

• Easy access to data (20PB vs 4GB of clean, carefully selected data)

• Convenient technologies (e.g. Python & notebooks, scikit-learn)

• Make IT a non-problem

• Keep projects small (typical project size 3-9 months)

42

Talent magnet


Keep teams small

43

3 members

3 channels

4 members

6 channels

5 members

10 channels

10 members

45 channels

…


Build the right team

• Variety of skills

• Software/ML engineers, ops/devops

• Analysts/BI

• Product

• Designers

• Managers

44


Make the team agile

• Use a flat, distributed hierarchy model and make people sit next to each other

45

EPMENG LEAD

PM

MGR


Make the team agile

• Use the right tools

• slack

• jira

• confluence

• git

• gerrit

• OKR

46


Build the culture

• Let ideas emerge bottom-up

• Hackathons (for real)

• 10% projects

• Transparency : make info available to all

• Use mature technologies

• You will fail. That’s OK!

47


Take-aways

• Start small

• Iterate fast

• Build the team

• Make the team agile

• Build the culture

48


• Thanks! Questions?