spark: the good, the bad, and the ugly

34
Spark: The Good, the Bad, and the Ugly Sarah Guido Data Day Texas 2016

Upload: sarah-guido

Post on 21-Feb-2017

1.413 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Spark: The Good, the Bad, and the Ugly

Spark: The Good, the Bad, and the Ugly

Sarah GuidoData Day Texas 2016

Page 2: Spark: The Good, the Bad, and the Ugly

Aboutthis talk

About me & Bitly

Spark journey to date

The future!

The good, the bad, and the ugly

Page 3: Spark: The Good, the Bad, and the Ugly

About this talkThis talk is• My own experience using Spark• New toys come with caveats

This talk is not• Ground truth• An argument for or against using Spark

Page 4: Spark: The Good, the Bad, and the Ugly

About me!

• Lead data scientist at Bitly (bitly.is/hiring)• NYC Python meetup organizer• Spark user• @sarah_guido

Page 5: Spark: The Good, the Bad, and the Ugly

SPARK JOURNEY

Page 6: Spark: The Good, the Bad, and the Ugly

The stage

“Don’t you guys just, like, shorten links?”

Page 7: Spark: The Good, the Bad, and the Ugly

The stage

Page 8: Spark: The Good, the Bad, and the Ugly

The stage

• Need for big data analysis tools• Iterate/prototype quickly• Overall goal: understand how people use

not only our app, but the Internet!

Page 9: Spark: The Good, the Bad, and the Ugly

1 HOUR OF DECODES10 GB

1 DAY OF DECODES240 GB

1 MONTH OF DECODES~7 TB

Page 10: Spark: The Good, the Bad, and the Ugly

Data

{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}

Page 11: Spark: The Good, the Bad, and the Ugly

Spark: the Why

• Fast. Really fast.• SQL layer – kind of like Hive• Distributed scientific tools• Python! Sometimes.• Cutting edge technology

Page 12: Spark: The Good, the Bad, and the Ugly

Spark: the What• Large-scale distributed data processing tool• SQL and streaming tools• Faster than Hadoop• Python, Scala, Java, R APIs

Page 13: Spark: The Good, the Bad, and the Ugly

Spark: the How• Partitions your data to operate over in parallel• Capability to add map/reduce features• Lazy – only operates when a method is called (ex.

collect()/or writing to file)

Page 14: Spark: The Good, the Bad, and the Ugly

First exploration of Spark during hack weekSPARK 1.2

DataFrames, AWS configSPARK 1.3

Decision to use Scala, official AWS supportSPARK 1.4

Prototype of trend detectionSPARK 1.5

Page 15: Spark: The Good, the Bad, and the Ugly

Spark environment

• From Hadoop to AWS• Runs on AWS EMR clusters• Reads data from S3• Special config

Page 16: Spark: The Good, the Bad, and the Ugly

THE GOOD

Page 17: Spark: The Good, the Bad, and the Ugly

SpeedBoth development time and data extraction

Hadoop vs. Spark

Page 18: Spark: The Good, the Bad, and the Ugly

Speed

Page 19: Spark: The Good, the Bad, and the Ugly

Easy to prototype

Building a working Spark program is relatively simple.*

* In simple circumstances.

Page 20: Spark: The Good, the Bad, and the Ugly

Integration with Jupyter Notebook

Page 21: Spark: The Good, the Bad, and the Ugly

THE BAD

Page 22: Spark: The Good, the Bad, and the Ugly

Updates are major and frequent

Page 23: Spark: The Good, the Bad, and the Ugly

AWS Spark version lag

Page 24: Spark: The Good, the Bad, and the Ugly

Documentation

• Lack of project maturity• Simple examples• Books end up being obsolete• Can’t just Stackoverflow it!

Page 25: Spark: The Good, the Bad, and the Ugly

THE UGLY

Page 26: Spark: The Good, the Bad, and the Ugly

DataFrames

DataFrames are appealing to data scientists familiar with Python/Pandas, but their lack of flexibility makes them difficult to use.

Page 27: Spark: The Good, the Bad, and the Ugly

Python vs. Scala: the Saga

Page 28: Spark: The Good, the Bad, and the Ugly

Python vs. Scala: the Saga Python (I know Python!)

Scala (Python API lacks parity with Scala)

Python (But I KNOW Python and can code faster!)

Scala (Lack of parity, weird undebuggable Python errors?!)

Page 29: Spark: The Good, the Bad, and the Ugly

Python vs. Scala: the Saga

Page 30: Spark: The Good, the Bad, and the Ugly

THE FUTURE

Page 31: Spark: The Good, the Bad, and the Ugly

My hopes

• Python catches up – barrier to entry• Community (and therefore knowledge

base) grows• More people talk about using Spark• Best practices• Used in projects/stack at Bitly

Page 32: Spark: The Good, the Bad, and the Ugly

I want to try

• Layer between NSQ and Spark – Cassandra?

• Tool like H2O• Zeppelin (again)• Spark slack channel?

Page 33: Spark: The Good, the Bad, and the Ugly

Resources

• spark.apache.org - documentation• Databricks blog• Cloudera blog• Other Spark users

Page 34: Spark: The Good, the Bad, and the Ugly

THANK YOU@sarah_guido