spark: the good, the bad, and the ugly
TRANSCRIPT
Spark: The Good, the Bad, and the Ugly
Sarah GuidoData Day Texas 2016
Aboutthis talk
About me & Bitly
Spark journey to date
The future!
The good, the bad, and the ugly
About this talkThis talk is• My own experience using Spark• New toys come with caveats
This talk is not• Ground truth• An argument for or against using Spark
About me!
• Lead data scientist at Bitly (bitly.is/hiring)• NYC Python meetup organizer• Spark user• @sarah_guido
SPARK JOURNEY
The stage
“Don’t you guys just, like, shorten links?”
The stage
The stage
• Need for big data analysis tools• Iterate/prototype quickly• Overall goal: understand how people use
not only our app, but the Internet!
1 HOUR OF DECODES10 GB
1 DAY OF DECODES240 GB
1 MONTH OF DECODES~7 TB
Data
{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}
Spark: the Why
• Fast. Really fast.• SQL layer – kind of like Hive• Distributed scientific tools• Python! Sometimes.• Cutting edge technology
Spark: the What• Large-scale distributed data processing tool• SQL and streaming tools• Faster than Hadoop• Python, Scala, Java, R APIs
Spark: the How• Partitions your data to operate over in parallel• Capability to add map/reduce features• Lazy – only operates when a method is called (ex.
collect()/or writing to file)
First exploration of Spark during hack weekSPARK 1.2
DataFrames, AWS configSPARK 1.3
Decision to use Scala, official AWS supportSPARK 1.4
Prototype of trend detectionSPARK 1.5
Spark environment
• From Hadoop to AWS• Runs on AWS EMR clusters• Reads data from S3• Special config
THE GOOD
SpeedBoth development time and data extraction
Hadoop vs. Spark
Speed
Easy to prototype
Building a working Spark program is relatively simple.*
* In simple circumstances.
Integration with Jupyter Notebook
THE BAD
Updates are major and frequent
AWS Spark version lag
Documentation
• Lack of project maturity• Simple examples• Books end up being obsolete• Can’t just Stackoverflow it!
THE UGLY
DataFrames
DataFrames are appealing to data scientists familiar with Python/Pandas, but their lack of flexibility makes them difficult to use.
Python vs. Scala: the Saga
Python vs. Scala: the Saga Python (I know Python!)
Scala (Python API lacks parity with Scala)
Python (But I KNOW Python and can code faster!)
Scala (Lack of parity, weird undebuggable Python errors?!)
Python vs. Scala: the Saga
THE FUTURE
My hopes
• Python catches up – barrier to entry• Community (and therefore knowledge
base) grows• More people talk about using Spark• Best practices• Used in projects/stack at Bitly
I want to try
• Layer between NSQ and Spark – Cassandra?
• Tool like H2O• Zeppelin (again)• Spark slack channel?
Resources
• spark.apache.org - documentation• Databricks blog• Cloudera blog• Other Spark users
THANK YOU@sarah_guido