spark etl techniques - creating an optimal fantasy baseball roster

Spark ETLHow to create an optimal daily fantasy baseball roster.

Chicago Hadoop Users Group May 12, 2015 Don Drake

[email protected] @dondrake

Overview• Who am I? • ETL • Daily Fantasy Baseball • Spark • Data Flow • Extracting - web crawler • Parquet • DataFrames • Python or Scala? • Transforming - RDDs • Transforming - Moving Average

Who Am I?• Don Drake @dondrake• Currently: Principal Big Data Consultant @ Allstate

• 5 years consulting on Hadoop • Independent consultant for last 14 years • Previous clients:

• Navteq / Nokia • Sprint • Mobile Meridian - Co-Founder • MailLaunder.com - my SaaS anti-spam service • cars.com • Tribune Media Services • Family Video • Museum of Science and Industry

http://MailLaunder.com

http://cars.com

ETL• Informally: Any repeatable programmed data movement • Extraction - Get data from another source

• Oracle/PostgreSQL • CSV • Web crawler

• Transform - Normalize formatting of phone #’s, addresses • Create surrogate keys • Joining data sources • Aggregate

• Load - • Load into data warehouse • CSV • Sent to predictive model

Spark• Apache Spark is a fast and general purpose engine for large-scale data

processing. • It provides high-level API’s in Java, Scala, and Python (REPL’s for Scala

and Python) • Includes an advanced DAG execution engine that supports in-memory

computing • RDD - Resilient Distributed Dataset — Core construct of the framework • Includes a set of high-level tools including Spark SQL for SQL and

structured data processing, MLlib for machine learning, GraphX for graph processing and Spark Streaming

• Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local • Open Source, core committers from DataBricks • Latest version is 1.3.1, which includes DataFrames • Started in 2009 (AMPLab) as research project, Apache project since 2013. • LOTS of momentum.

Daily Fantasy Baseball

Spark 101- Execution• Driver - your program’s main() method

• Only 1 per application • Executors - do the distributed work

• As many as your cluster can handle • You determine ahead of time how many • You determine the amount of RAM required

Spark 101 - RDD• RDD - Resilient Distributed Dataset

• Can be created from Hadoop Input formats (text file, sequence file, parquet file, HBase, etc.) OR by transforming other RDDs.

• RDDs have actions and transformations which return pointers to new RDDs

• RDD’s can contain anything

scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

scala> textFile.count() // Number of items in this RDD res0: Long = 126

scala> textFile.first() // First item in this RDD res1: String = # Apache Spark

Spark 101 - DEMO 1 #!/bin/env python 2 3 import pyspark.SparkContext 4 5 6 sc = new SparkContext() 7 8 # create some numbers 9 nums = sc.parallelize(xrange(1000)) 10 11 nums.getNumPartitions() 12 13 nums.count()

Spark 101 - Lazy Demo 1 #!/bin/env python 2 3 import pyspark.SparkContext 4 5 sc = new SparkContext() 6 7 # create some numbers 8 nums = sc.parallelize(xrange(1000)) 9 10 check = sc.accumulator(0) 11 12 def isEven(x): 13 check.add(1) 14 return x % 2 == 0 15 16 evens = nums.filter(isEven) 17 18 evens.count() 19 20 evens.collect() 21 22 check

Spark 101 - Lazy RDDs• RDD’s are executed *only* when an action is called

upon it. • This allows multiple transformations on a RDD,

allowing Spark to compute an optimal execution plan. • Uncached RDD’s are evaluated *every* time an action

is called upon it • Cache RDD’s if you know you will iterate over an RDD

more than once. • You can cache to RAM + Disk, by default, persists to

RAM only.

Data Flow Overview

Extraction & Ingestion

Extraction• We have seen it’s easy to create a parallelized data

structure that we can execute code against • Pro-tip: If extracting from relational database, use

Sqoop + save as Parquet • We need to download a set of files for each

baseball game previously played (player statistics) • TODO List Pattern

• We know the filenames to download for each game (they are static)

• Download all files for a game in parallel

1 def getFiles(game): 2 session = requests.Session() 3 files = scrape.get_files([game], session=session) 4 count, fails = scrape.download(files, DownloadMLB.cache) 5 return (count, fails) 6 7 def summarize(a, x): 8 total = a[0] + x[0] 9 alist = a[1] 10 alist.extend(x[1]) 11 return (total, alist) 12 13 def run(self): 14 15 sc = SparkContext() 16 start_scrape = datetime.now() 17 begin, begin_parts = scrape.get_boundary(self.begin) 18 end, end_parts = scrape.get_boundary(self.end) 19 20 session = requests.Session() 21 22 print "here" 23 all_years_months_days = self.getYearsMonths(self.WEB_ROOT, session) 24 games = scrape.get_games(all_years_months_days, session=session) 25 26 gamesRDD = sc.parallelize(games) 27 print "fileRDD=", gamesRDD 28 29 gamesRDD.foreach(dump) 30 print "# parttions:", gamesRDD.getNumPartitions() 31 print "count=", gamesRDD.count() 32 res = gamesRDD.map(getFiles).reduce(summarize) 33 print "res=", res 34 35 count = res[0] 36 fails = res[1] 37 end_scrape = datetime.now() 38 self.log.info("%d files downloaded in %s", count, 39 str(end_scrape - start_scrape)) 40 if fails: 41 for url in fails: 42 self.log.error("failed to download %s", url) 43 44 sc.stop()

TODO DEMO

Butter…..Butter…… Parquet?

• Features • Interoperability - Spark, Impala, Sqoop, much more • Space efficient • Query efficient

• Schema - (field name, data type, nullable) • Columnar Format • Different Encoding (Compression) Algorithms

• Delta Encoding - diffs per row per column • Dictionary Encoding (~60k items). e.g. Dates, IP address, • Run Length Encoding - (for repetitive data)

• Example: http://blog.cloudera.com/blog/2014/05/using-impala-at-scale-at-allstate/

http://blog.cloudera.com/blog/2014/05/using-impala-at-scale-at-allstate/

http://parquet.apache.org/presentations/

http://parquet.apache.org/presentations/

Using DataFrame - SparkSQL

• Previously called SchemaRDD, now with DataFrame contain extra functionality to query/filter

• Allow you to write SQL (joins, filter, etc.) against DataFrames (or RDD’s with a little effort)

• All DataFrame’s contain a schema 1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" + 2 "batter_moving_averages.parquet") 3 batter_mov_avg.registerTempTable("bma") 4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK) 5 print "batter_mov_avg=", batter_mov_avg.take(2) 6 7 batter_games = sqlContext.sql("select * from games g, bma, 8 game_players gp \ 9 where bma.player_id = gp.player_id \ 10 and bma.game_date = gp.game_date \ 11 and gp.game_id = g.game_id \ 12 and g.game_date = gp.game_date \ 13 and g.game_date = bma.game_date")

DataFrame’s to the rescue• DataFrames offer a DSL that provides a distributed

data manipulation • In Python, you can convert a DF to Pandas data

frame and vice versa

More DataFrames• We can select columns from DataFrame • Run our own transform on it with map() • Transform function gets a Row() object, and returns one • The toDF() function will infer data types for you. • https://issues.apache.org/jira/browse/SPARK-7182 1 batting_features = batter_games.select(*unique_cols) 2 print "batting_features=", batting_features.schema 3 #print "batting_features=", batting_features.show() 4 5 def transformBatters(row_object): 6 row = row_object.asDict() 7 row = commonTransform(row) 8 9 return Row(**row) 10 11 batting_features = batting_features.map(transformBatters).toDF() 12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK) 13 14 self.rmtree(self.rddDir + "/" + "batting_features.parquet") 15 batting_features.saveAsParquetFile(self.rddDir + "/" + 16 "batting_features.parquet")

https://issues.apache.org/jira/browse/SPARK-7182

Programmatically Specifying the Schema

1 class Game(AbstractDF): 2 schema = StructType( sorted( 3 [ 4 StructField("game_id", StringType()), 5 StructField("game_date", DateType()), 6 StructField("id", IntegerType()), 7 StructField("type", StringType()), 8 StructField("local_game_time", StringType()), 9 StructField("game_pk", IntegerType()), 10 StructField("gameday_sw", StringType()), 11 StructField("game_time_et", TimestampType()), 12 StructField("home_code", StringType()), 13 StructField("home_abbrev", StringType()), 14 StructField("home_name", StringType()), 15 StructField("home_won", IntegerType()), 16 StructField("home_loss", IntegerType()), 17 StructField("home_division_id", IntegerType()), 18 StructField("home_league", StringType()), 19 StructField("away_code", StringType()), 20 StructField("away_abbrev", StringType()), 21 StructField("away_name", StringType()), 22 StructField("away_won", IntegerType()), 23 StructField("away_loss", IntegerType()), 24 StructField("away_division_id", IntegerType()), 25 StructField("away_league", StringType()), 26 StructField("stadium_id", IntegerType()), 27 StructField("stadium_name", StringType()), 28 StructField("stadium_venue_w_chan_loc", StringType()), 29 StructField("stadium_location", StringType()), 30 StructField("modified", TimestampType()), 31 ], 32 key = lambda x: x.name)) 33 skipSelectFields = ['modified']

AbstractDF• Download: https://gist.github.com/dondrake/

c7fcf42cf051492fdd91 • AbstractDF adds 3 major features to the class

• Creates a python object with attributes based on the field name defined in the schema • e.g. g = Game(); g.game_id = ‘123’

• Exposes a helper function to create a Row object containing all of the fields in schema stored with correct data types. • Needed so Scala can correctly infer data types of the

values sent. • Provides method that will return a list of columns to use in

a select statement.

https://gist.github.com/dondrake/c7fcf42cf051492fdd91

Python or Scala??

Python or Scala???• Use Python for prototyping

• Access to pandas, scikit-learn, etc. • Python is strongly typed (but not statically typed)

• Spark Python API support lags, not as popular as Scala. Bugs might not get fixed right away or at all.

• Python is slower due to Gateway required. All data must be serialized and sent through Gateway to/from Scala. (Not as bad for DataFrames)

• Use Scala for application development • Scala learning curve is steep. • Functional Programming learning curve is steep. • Scala is a statically typed language

• Java was intentionally left off • Don’t bother with Java 7 (IMO)

TransformationsBuilding a moving average

Example of a moving average

How do we compute a 5-day MA for batting average?• For an individual player, on an particular date, we need

the previous 5 days batting average. • Please note: not all players play every day.

• To build a predictive model, we need historical data, so we would need to calculate this for each day, every player, for (possibly) each metric.

• We would also want to use different Moving Average durations (e.g. a 5, 7 ,14 day moving average) for each player - day.

• Our compute space just got big and is also embarrassingly parallel

Some Stats• load table (2 full years + this year so far ~17%):

• gamePlayers = sqlContext.parquetFile(rddDir + ‘game_players.parquet').cache()

• # Hitters: • gamePlayers.filter (gamePlayers.fd_position != ‘P').count()

• 248856L (# of hitter-gamedate combinations) • # Pitchers

• gamePlayers.filter (gamePlayers.fd_position == 'P').count()

• 244926L • Season-level moving average would require about 248,000 *

(162/2) rows of data (~20 million rows)

MapReduce is dead. Long live MapReduce.

• Spark has a set of transformations that can perform on RDD’s of key / value pairs.

• Transformations • groupByKey([numTasks]) • reduceByKey(func, [numTasks]) • aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) • sortByKey([ascending], [numTasks]) • join(otherDataset, [numTasks]) • cogroup(otherDataset, [numTasks])

• Actions • reduce(func) • countByKey()

Broadcast Map-side join• “Broadcast” a dictionary of lists of game_dates,

keys in dictionary is the year the game took place. • FlatMap operation looping over broadcast

game_dates for the stats took place • We use flatMap because we emit many rows

from a single input (7-day, 14-day, etc) • Key: player_id + asOfDate + moving_length • Value: Game Stats dictionary from input

• Since this generates a lot of output, I repartition(50) the output of flatMap

Reduce• now call groupByKey()

• Creates a RDD that have a list of values with the same key. These values are the stats to calculate the average

• If performing aggregation, use reduceByKey, less shuffling involved

• Calculate the average (and stddev, etc.) • Emit new key of player_id + asOfDate • Value is the dictionary of all moving average fields • Run reduceByKey() to combine (concatenate) 7-day, 14-day

metrics to a single row • Save results as Parquet, will be joined with other features to

create a predictive model

Useful Links1. https://databricks.com/blog/2015/04/28/project-

tungsten-bringing-spark-closer-to-bare-metal.html 2. https://www.usenix.org/system/files/conference/

nsdi15/nsdi15-paper-ousterhout.pdf 3. https://zeppelin.incubator.apache.org/ 4. https://codewords.recurse.com/issues/one/an-

introduction-to-functional-programming 5. http://apache-spark-user-list.

1001560.n3.nabble.com/

https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html?utm_content=14636812&utm_medium=social&utm_source=twitter

https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-ousterhout.pdf

https://zeppelin.incubator.apache.org/

https://codewords.recurse.com/issues/one/an-introduction-to-functional-programming

http://apache-spark-user-list.1001560.n3.nabble.com/

Spark Books

spark etl techniques - creating an optimal fantasy baseball roster

Technology

spark apache spark

spark etl

spark sql

spark streaming

new rdds rdds

lazy rdds rdds

structured data processing

largescale data processing