training large-scale ad ranking models in spark

Training Large-scale Ad Ranking Models in Spark

PRESENTED BY Patrick Pletscher｜ October 19, 2015

About Us

2

Michal Aharon Oren Somekh Yaacov Fernandess Yair Koren

Amit Kagian Shahar Golan Raz Nissim Patrick Pletscher

Amir Ingber

Haifa

Collaborator

What We Do

3

Research focused on ad ranking algorithms for Yahoo Gemini Native Ads

Ad Ranking Overview

4

• Advertisers run several campaigns each with several ads • Each ad has a bid set by the advertiser; different ad price types

- pay per view

- pay per click

- various conversion price types

• Auction for each impression on a Gemini Native enabled property - auction between all eligible ads (filter by targeting/budget)

- ad with the highest expected revenue is determined

• Need to know the (personalized!) probability of a click - we mostly get money for clicks / conversions!

Ad 1 Ad 2

2$1$

5% 1%5c 2cuser

Click-Through Rate (CTR) Prediction

5

• Given a user and context, predict probability of a click for an ad. • Probably the most “profitable” machine learning problem in industry

- simple binary problem; but want probabilities, not just the label

- very skewed label distribution: clicks << skips

- tons of data (every impression generates a training example)

- limitations at serving: need to predict quickly

• Basic setting quite well-studied; scale makes it challenging - Google (McMahan et al. 2013)

- Facebook (He et al. 2014)

- Yahoo (Aharon et al. 2013)

- others (Chapelle et al. 2014)

• Some more involved research topics - Exploration/Exploitation tradeoff

- Learning from logged feedback

Overview - CTR Prediction for Gemini Native Ads

6

• Collaborative Filtering approach (Aharon et al. 2013) - Current production system

- Implemented in Hadoop MapReduce

- Used in Gemini Native ad ranking

• Large-scale Logistic Regression - A research prototype

- Implemented in Spark

- The combination of Spark & Scala allows us to iterate quickly

- Takes several concepts from the CF approach

Large-scale Logistic Regression in Spark

Apache Spark

8

• “Apache Spark is a fast and general engine for large-scale data processing” • Similar to Hadoop • Advantages over Hadoop MapReduce

- Option to cache data in memory, great for iterative computations

- A lot of syntactic sugar ‣ filter, reduceByKey, distinct, sortByKey, join

‣ in general Spark/Scala code very concise

- Spark Shell, great for interactive/ETL* workflows

- Dataframes interesting for data scientists coming from R / Python

• Includes modules for - machine learning

- streaming

- graph computations

- SQL / Dataframes

*ETL: Extract, transform, load

Spark at Yahoo

9

• Spark 1.5.1, the latest version of Spark • Runs on top of Hadoop YARN 2.6

- integrates nicely with existing Hadoop tools and infrastructureat Yahoo

- data is generally stored in HDFS

• Clusters are centrally managed • Large Hadoop deployment at Yahoo

- A few different clusters

- Each has at least a few thousand nodes

HDFS (storage)

YARN (resource management)

SparkMapReduceHive

Dataset for CTR Prediction

10

• Billions of ad impressions daily - Need for Streaming / Batched Streaming

- Each impression has a unique id

• Need click information for every impression for learning - Join impressions with a click stream every x minutes

- Need to wait for the click; introduces some delay

18:30 18:45 19:00

clicks

impressions impressions

clicks

impressions

clicks

19:15

impressions

clicks

labeled events

labeled events

in Spark: union & reduceByKey

Example - Joining Impression & Click RDDs

11

val keyAndImpressions = impressions .map(e => (e.joinKey, ("i", e))

val keyAndClicks = clicks .map(e => (e.joinKey, ("c", e)))

keyAndImpressions.union(keyAndClicks) .reduceByKey(smartCombine) .flatMap { case (k, (t, event)) => t match { case "ci" => Some(LabeledEvent(event, clicked=1)) case "i" => Some(LabeledEvent(event, clicked=0)) case "c" => None } }

def smartCombine(event1: (String, Event), event2: (String, Event)): (String, Event) = { (event1._1, event2._1) match { case ("c", "c") => event1 // de-dupe case ("i", "i") => event1 // de-dupe case ("c", "i") => ("ci", event2._2) // combine click and impression case ("i", "c") => ("ci", event1._2) // combine click and impression case ("ci", _) => event1 // de-dupe case (_, "ci") => event2 // de-dupe }}

Incremental Learning Architecture

12

learning examples

18:30 18:45 19:00

clicks

impressions impressions

clicks

impressions

clicks

19:15

impressions

clicks

labeled events

feature extraction

learning

modelmodel

Large-scale Logistic Regression

13

• Industry standard for CTR prediction (McMahan et al. 2013, He et al. 2014) • Models the probability of a click as

- feature vector ‣ high-dimensional vector but sparse (few non-zero values)

‣ model expressivity controlled by the features

‣ a lot of hand-tuning and playing around

- model parameters ‣ need to be learned

‣ generally rather non-sparse

Features for Logistic Regression

14

• Basic features - age, gender

- browser, device

• Feature crosses - E.g. age x gender x state (30 year old male from Boston)

- mostly indicator features

- Examples:

‣ gender^age m^30 ‣ gender^device m^Windows_NT ‣ gender^section m^5417810 ‣ gender^state m^2347579 ‣ age^device 30^Windows_NT

• Feature hashing to get a vector of fixed length - hash all the index tuples, e.g. (gender^age, m^30), to get a numeric index

- will introduce collisions! Choose dimensionality large enough

Parameter Estimation

15

• Basic Problem: Regularized Maximum Likelihood

- Often: L1 regularization instead of L2 ‣ promotes sparsity in the weight vector

‣ more efficient predictions in serving (also requires less memory!)

- Batch vs. streaming ‣ in our case: batched streaming, every x min perform an incremental model update

• Follow-the-regularized leader (McMahan et al. 2013) - sequential online algorithm: only use a data point once

- similar to stochastic gradient descent

- per coordinate learning rates

- encourages sparseness

- FTRL stores weight and accumulated gradient per coordinate

fit training data prevent overfitting

Basic Parallelized FTRL in Spark

16

def train(examples: RDD[LearningExample]): Unit={ val delta = examples .repartition(numWorkers) .mapPartitions(xs => updatePartition(xs, weights, counts)) .treeReduce{case(a, b) => (a._1+b._1, a._2+b._2)} weights += delta._1 / numWorkers.toDouble counts += delta._2 / numWorkers.toDouble}

def updatePartition(examples: Iterator[LearningExample], weights: DenseVector[Double], counts: DenseVector[Double]): Iterator[(DenseVector[Double], DenseVector[Double])]=

{

// standard FTRL code for examples

Iterator((deltaWeights, deltaCounts))

}

hack: actually a single result, but Spark

expects an iterator!

Summary: LR with Spark

17

• Efficient: Can learn on all the data - before: somewhat aggressive subsampling of the skips

• Possible to do feature pre-processing - in Hadoop MapReduce much harder: only one pass over data

- drop infrequent features, TF-IDF, …

• Spark-shell as a life-saver - helps to debug problems as one can inspect intermediate results at scale

- have yet to try Zeppelin notebooks

• Easy to unit test complex workflows

Spark: Lessons Learned

Upgrade!

19

• Spark has a pretty regular 3 months release schedule • Always run with the latest version

- Lots of bugs get fixed

- Difficult to keep up with new functionality (see DataFrame vs. RDD)

• Speed improvements over the past year

Configurations

20

• Our solution - config directory containing

‣ Logging: log4j.properties

‣ Spark itself: spark-defaults.conf

‣ our code: application.conf

- two versions of configs: local & cluster

- in YARN: specify them using --files argument & SPARK_CONF_DIR variable

• Use Typesafe’s config library for all application related configs - provide sensible defaults for everything

- overwrite using application.conf

• Do not hard-code any configurations in code

Accumulators

21

• Use accumulators for ensuring correctness! • Example:

- parse data, ignore event if there is a problem with the data

- use accumulator to count these failed lines

class Parser(failedLinesAccumulator: Accumulator[Int]) extends Serializable {

def parse(s: String): Option[Event] = {

try {

// parsing logic goes here

Some(...)

}

catch { case e: Exception => { failedLinesAccumulator += 1 None } }

}

}

val accumulator = Some(sc.accumulator(0, “failed lines”))

val parser = new Parser(accumulator)

val events = sc.textFile(“hdfs:///myfile”)

.flatMap(s => parser.parse(s))

RDD vs. DataFrame in Spark

22

• Initially Spark advocated Resilient Distributed Data (RDD) for data set abstraction - type-safe

- usually stores some Scala case class

- code relatively easy to understand

• Recently Spark is pushing towards using DataFrame - similar to R and Python’s Pandas data frames

- some advantages ‣ less rigid types: can append columns

‣ speed

- disadvantage: code readability suffers for non-basic types ‣ user defined types

‣ user defined functions

• Have not fully migrated to it yet

Every Day I’m Shuffling…

23

• Careful with operations which send a lot of data over the network - reduceByKey

- repartition / shuffle

• Careful with sending too much data to the driver - collect

- reduce

• found mapPartitions & treeReduce useful in some cases (see FTRL example) • play with spark configurations: frameSize, maxResultSize, timeouts…

textFile flatMap map reduceByKey

Machine Learning in Spark

24

• Relatively basic - some algorithms don’t scale so well

- not customizable enough for experts: ‣ optimizers that assume a regularizer

‣ built our own DSL for feature extraction & combination

‣ a lot of the APIs are not exposed, i.e. private to Spark

- will hopefully get there eventually

• Nice: new Transformer / Estimator / Pipeline approach - Inspired by scikit-learn, makes it easy to combine different algorithms

- Requires DataFrame

- Example (from Spark docs)val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01)

val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))

val model = pipeline.fit(training)

Thank you!

training large-scale ad ranking models in spark

Engineering