training large-scale ad ranking models in spark
TRANSCRIPT
Training Large-scale Ad Ranking Models in Spark
PRESENTED BY Patrick Pletscher| October 19, 2015
About Us
2
Michal Aharon Oren Somekh Yaacov Fernandess Yair Koren
Amit Kagian Shahar Golan Raz Nissim Patrick Pletscher
Amir Ingber
Haifa
Collaborator
What We Do
3
Research focused on ad ranking algorithms for Yahoo Gemini Native Ads
Ad Ranking Overview
4
• Advertisers run several campaigns each with several ads • Each ad has a bid set by the advertiser; different ad price types
- pay per view
- pay per click
- various conversion price types
• Auction for each impression on a Gemini Native enabled property - auction between all eligible ads (filter by targeting/budget)
- ad with the highest expected revenue is determined
• Need to know the (personalized!) probability of a click - we mostly get money for clicks / conversions!
Ad 1 Ad 2
2$1$
5% 1%5c 2cuser
Click-Through Rate (CTR) Prediction
5
• Given a user and context, predict probability of a click for an ad. • Probably the most “profitable” machine learning problem in industry
- simple binary problem; but want probabilities, not just the label
- very skewed label distribution: clicks << skips
- tons of data (every impression generates a training example)
- limitations at serving: need to predict quickly
• Basic setting quite well-studied; scale makes it challenging - Google (McMahan et al. 2013)
- Facebook (He et al. 2014)
- Yahoo (Aharon et al. 2013)
- others (Chapelle et al. 2014)
• Some more involved research topics - Exploration/Exploitation tradeoff
- Learning from logged feedback
Overview - CTR Prediction for Gemini Native Ads
6
• Collaborative Filtering approach (Aharon et al. 2013) - Current production system
- Implemented in Hadoop MapReduce
- Used in Gemini Native ad ranking
• Large-scale Logistic Regression - A research prototype
- Implemented in Spark
- The combination of Spark & Scala allows us to iterate quickly
- Takes several concepts from the CF approach
Large-scale Logistic Regression in Spark
Apache Spark
8
• “Apache Spark is a fast and general engine for large-scale data processing” • Similar to Hadoop • Advantages over Hadoop MapReduce
- Option to cache data in memory, great for iterative computations
- A lot of syntactic sugar ‣ filter, reduceByKey, distinct, sortByKey, join
‣ in general Spark/Scala code very concise
- Spark Shell, great for interactive/ETL* workflows
- Dataframes interesting for data scientists coming from R / Python
• Includes modules for - machine learning
- streaming
- graph computations
- SQL / Dataframes
*ETL: Extract, transform, load
Spark at Yahoo
9
• Spark 1.5.1, the latest version of Spark • Runs on top of Hadoop YARN 2.6
- integrates nicely with existing Hadoop tools and infrastructureat Yahoo
- data is generally stored in HDFS
• Clusters are centrally managed • Large Hadoop deployment at Yahoo
- A few different clusters
- Each has at least a few thousand nodes
HDFS (storage)
YARN (resource management)
SparkMapReduceHive
Dataset for CTR Prediction
10
• Billions of ad impressions daily - Need for Streaming / Batched Streaming
- Each impression has a unique id
• Need click information for every impression for learning - Join impressions with a click stream every x minutes
- Need to wait for the click; introduces some delay
18:30 18:45 19:00
clicks
impressions impressions
clicks
impressions
clicks
19:15
impressions
clicks
labeled events
labeled events
in Spark: union & reduceByKey
Example - Joining Impression & Click RDDs
11
val keyAndImpressions = impressions .map(e => (e.joinKey, ("i", e))
val keyAndClicks = clicks .map(e => (e.joinKey, ("c", e)))
keyAndImpressions.union(keyAndClicks) .reduceByKey(smartCombine) .flatMap { case (k, (t, event)) => t match { case "ci" => Some(LabeledEvent(event, clicked=1)) case "i" => Some(LabeledEvent(event, clicked=0)) case "c" => None } }
def smartCombine(event1: (String, Event), event2: (String, Event)): (String, Event) = { (event1._1, event2._1) match { case ("c", "c") => event1 // de-dupe case ("i", "i") => event1 // de-dupe case ("c", "i") => ("ci", event2._2) // combine click and impression case ("i", "c") => ("ci", event1._2) // combine click and impression case ("ci", _) => event1 // de-dupe case (_, "ci") => event2 // de-dupe }}
Incremental Learning Architecture
12
learning examples
18:30 18:45 19:00
clicks
impressions impressions
clicks
impressions
clicks
19:15
impressions
clicks
labeled events
feature extraction
learning
modelmodel
Large-scale Logistic Regression
13
• Industry standard for CTR prediction (McMahan et al. 2013, He et al. 2014) • Models the probability of a click as
- feature vector ‣ high-dimensional vector but sparse (few non-zero values)
‣ model expressivity controlled by the features
‣ a lot of hand-tuning and playing around
- model parameters ‣ need to be learned
‣ generally rather non-sparse
Features for Logistic Regression
14
• Basic features - age, gender
- browser, device
• Feature crosses - E.g. age x gender x state (30 year old male from Boston)
- mostly indicator features
- Examples:
‣ gender^age m^30 ‣ gender^device m^Windows_NT ‣ gender^section m^5417810 ‣ gender^state m^2347579 ‣ age^device 30^Windows_NT
• Feature hashing to get a vector of fixed length - hash all the index tuples, e.g. (gender^age, m^30), to get a numeric index
- will introduce collisions! Choose dimensionality large enough
Parameter Estimation
15
• Basic Problem: Regularized Maximum Likelihood
- Often: L1 regularization instead of L2 ‣ promotes sparsity in the weight vector
‣ more efficient predictions in serving (also requires less memory!)
- Batch vs. streaming ‣ in our case: batched streaming, every x min perform an incremental model update
• Follow-the-regularized leader (McMahan et al. 2013) - sequential online algorithm: only use a data point once
- similar to stochastic gradient descent
- per coordinate learning rates
- encourages sparseness
- FTRL stores weight and accumulated gradient per coordinate
fit training data prevent overfitting
Basic Parallelized FTRL in Spark
16
def train(examples: RDD[LearningExample]): Unit={ val delta = examples .repartition(numWorkers) .mapPartitions(xs => updatePartition(xs, weights, counts)) .treeReduce{case(a, b) => (a._1+b._1, a._2+b._2)} weights += delta._1 / numWorkers.toDouble counts += delta._2 / numWorkers.toDouble}
def updatePartition(examples: Iterator[LearningExample], weights: DenseVector[Double], counts: DenseVector[Double]): Iterator[(DenseVector[Double], DenseVector[Double])]=
{
// standard FTRL code for examples
Iterator((deltaWeights, deltaCounts))
}
hack: actually a single result, but Spark
expects an iterator!
Summary: LR with Spark
17
• Efficient: Can learn on all the data - before: somewhat aggressive subsampling of the skips
• Possible to do feature pre-processing - in Hadoop MapReduce much harder: only one pass over data
- drop infrequent features, TF-IDF, …
• Spark-shell as a life-saver - helps to debug problems as one can inspect intermediate results at scale
- have yet to try Zeppelin notebooks
• Easy to unit test complex workflows
Spark: Lessons Learned
Upgrade!
19
• Spark has a pretty regular 3 months release schedule • Always run with the latest version
- Lots of bugs get fixed
- Difficult to keep up with new functionality (see DataFrame vs. RDD)
• Speed improvements over the past year
Configurations
20
• Our solution - config directory containing
‣ Logging: log4j.properties
‣ Spark itself: spark-defaults.conf
‣ our code: application.conf
- two versions of configs: local & cluster
- in YARN: specify them using --files argument & SPARK_CONF_DIR variable
• Use Typesafe’s config library for all application related configs - provide sensible defaults for everything
- overwrite using application.conf
• Do not hard-code any configurations in code
Accumulators
21
• Use accumulators for ensuring correctness! • Example:
- parse data, ignore event if there is a problem with the data
- use accumulator to count these failed lines
class Parser(failedLinesAccumulator: Accumulator[Int]) extends Serializable {
def parse(s: String): Option[Event] = {
try {
// parsing logic goes here
Some(...)
}
catch { case e: Exception => { failedLinesAccumulator += 1 None } }
}
}
val accumulator = Some(sc.accumulator(0, “failed lines”))
val parser = new Parser(accumulator)
val events = sc.textFile(“hdfs:///myfile”)
.flatMap(s => parser.parse(s))
RDD vs. DataFrame in Spark
22
• Initially Spark advocated Resilient Distributed Data (RDD) for data set abstraction - type-safe
- usually stores some Scala case class
- code relatively easy to understand
• Recently Spark is pushing towards using DataFrame - similar to R and Python’s Pandas data frames
- some advantages ‣ less rigid types: can append columns
‣ speed
- disadvantage: code readability suffers for non-basic types ‣ user defined types
‣ user defined functions
• Have not fully migrated to it yet
Every Day I’m Shuffling…
23
• Careful with operations which send a lot of data over the network - reduceByKey
- repartition / shuffle
• Careful with sending too much data to the driver - collect
- reduce
• found mapPartitions & treeReduce useful in some cases (see FTRL example) • play with spark configurations: frameSize, maxResultSize, timeouts…
textFile flatMap map reduceByKey
Machine Learning in Spark
24
• Relatively basic - some algorithms don’t scale so well
- not customizable enough for experts: ‣ optimizers that assume a regularizer
‣ built our own DSL for feature extraction & combination
‣ a lot of the APIs are not exposed, i.e. private to Spark
- will hopefully get there eventually
• Nice: new Transformer / Estimator / Pipeline approach - Inspired by scikit-learn, makes it easy to combine different algorithms
- Requires DataFrame
- Example (from Spark docs)val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01)
val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
Thank you!