spark: the good, the bad, and the ugly

Spark: The Good, the Bad, and the Ugly

Sarah GuidoData Day Texas 2016

Aboutthis talk

About me & Bitly

Spark journey to date

The future!

The good, the bad, and the ugly

About this talkThis talk is• My own experience using Spark• New toys come with caveats

This talk is not• Ground truth• An argument for or against using Spark

About me!

• Lead data scientist at Bitly (bitly.is/hiring)• NYC Python meetup organizer• Spark user• @sarah_guido

SPARK JOURNEY

The stage

“Don’t you guys just, like, shorten links?”

The stage

The stage

• Need for big data analysis tools• Iterate/prototype quickly• Overall goal: understand how people use

not only our app, but the Internet!

1 HOUR OF DECODES10 GB

1 DAY OF DECODES240 GB

1 MONTH OF DECODES~7 TB

Data

{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}

Spark: the Why

• Fast. Really fast.• SQL layer – kind of like Hive• Distributed scientific tools• Python! Sometimes.• Cutting edge technology

Spark: the What• Large-scale distributed data processing tool• SQL and streaming tools• Faster than Hadoop• Python, Scala, Java, R APIs

Spark: the How• Partitions your data to operate over in parallel• Capability to add map/reduce features• Lazy – only operates when a method is called (ex.

collect()/or writing to file)

First exploration of Spark during hack weekSPARK 1.2

DataFrames, AWS configSPARK 1.3

Decision to use Scala, official AWS supportSPARK 1.4

Prototype of trend detectionSPARK 1.5

Spark environment

• From Hadoop to AWS• Runs on AWS EMR clusters• Reads data from S3• Special config

THE GOOD

SpeedBoth development time and data extraction

Hadoop vs. Spark

Easy to prototype

Building a working Spark program is relatively simple.*

* In simple circumstances.

Integration with Jupyter Notebook

THE BAD

Updates are major and frequent

AWS Spark version lag

Documentation

• Lack of project maturity• Simple examples• Books end up being obsolete• Can’t just Stackoverflow it!

THE UGLY

DataFrames

DataFrames are appealing to data scientists familiar with Python/Pandas, but their lack of flexibility makes them difficult to use.

Python vs. Scala: the Saga

Python vs. Scala: the Saga Python (I know Python!)

Scala (Python API lacks parity with Scala)

Python (But I KNOW Python and can code faster!)

Scala (Lack of parity, weird undebuggable Python errors?!)

Python vs. Scala: the Saga

THE FUTURE

My hopes

• Python catches up – barrier to entry• Community (and therefore knowledge

base) grows• More people talk about using Spark• Best practices• Used in projects/stack at Bitly

I want to try

• Layer between NSQ and Spark – Cassandra?

• Tool like H2O• Zeppelin (again)• Spark slack channel?

Resources

• spark.apache.org - documentation• Databricks blog• Cloudera blog• Other Spark users

THANK YOU@sarah_guido

spark: the good, the bad, and the ugly

Technology