performant data processing with pyspark, sparkr and dataframe api

Performant data processing with PySpark, SparkR and

DataFrame API

Ryuji Tamagawa from Osaka

Many Thanks to Holden Karau, for the discussion we had about this talk.

Agenda

Who am I ?

Spark and non-JVM languages

DataFrame APIs come to rescue

Examples

Who am I ?Software engineer working for Sky, from architecture design to troubleshooting in the field

Translator working with O’Reilly Japan

‘Learning Spark’ is the 27th book

Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’

A bed for 6 cats

Works of 2015

Available Jan, 2016 ?

Works of past

Motivation for today’s talk

I want to deal with my ‘Big’ data,

WITH PYTHON !!

Apache Spark

You may already have heard a lot

Fast, distributed data processing framework with high-level APIs

Written in Scala, run in JVM

OSHDFS

Hive e.t.c.

HBaseMapReduce

Impala e.t.c（in-

memory SQL engine）

Spark （Spark Streaming, MLlib,

GraphX, Spark SQL)

Why it’s fastDo not need to write temporary data to storage every time

Do not need to invoke JVM process every time

JVM Invocation

reduce

JVM Invocation

reduce

JVM Invocation

f1（read data to RDD）

Executor（JVM）Invocation

f4（persist to storage）

f5（does shuffle） I/O

ory (RDD

access

access I/O

access

MapReduce Spark

Apache Spark and

non-JVM languages

Spark supports non-JVM languages Shells

PySpark, for Python users

SparkR, for R users

GUI Environment : Jupiter, RStudio

You can write application code in these languages

The Web UI tells us a lot

http://<address>:4040

Performance problems with those languages

Data processing performance with those languages may be several times slower than JVM languages

The reason lies in the architecture

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

The choices you have had

Learn Scala

Write (more lines of) code in Java

Use non-JVM languages with more CPU cores to make up the performance gap

DataFrame APIs come to the rescue !

DataFrame

Tabular data with schema based on RDD

Successor of Schema RDD (Since 1.4)

Has rich set of APIs for data operation

Or, you can simply use SQL!

Do it within JVM

When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime

Obviously, the performance is almost same compared to JVM languages

Only code goes through

Executo

rDataFrame APIs compared to

RDD APIs by Examples

DataFrame, Cached

Python

lambda items: items[0] == ‘abc’

transfer

DataFrame, result

transfer

Executo

DataFrame APIs compared to RDD APIs by Examples

DataFrame, Cached

filter(df[“_1”] == “abc”)

transfer

DataFrame, result

Watch out for UDFs

You can write UDFs in Python

You can use lambdas in Python, too

Once you use them, data flows between the two worlds

slen = udf( lambda s: len(s), IntegerType())

df.select( slen(df.name)) .collect()

Make it small first, then use UDFs

Filter or sample your ‘big’ data with DataFrame APIs

Then use UDFs

SQL optimizer does not take it into account when making plans (so far)

‘BIG’ data in DataFrame

filtering with ‘native APIs’

‘Small’ data in DataFrame

whatever operation with

Make it small first, then use UDFs

Filter or sample your ‘big’ data with DataFrame APIs

Then use UDFs

SQL optimizer does not take it into account when making plans (so far)

slen = udf( lambda s: len(s), IntegerType())

sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’).collect()

processed first !

Ingesting DataIt’s slow to Deal with files like CSVs by non-JVM driver

Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first

You can process Such files directly from JVM processes (executors) even when using non-JVM languages

Executo

DataFrameDriver

Local Data

Driver Machine

HDFS (Parquet)

Driver Machine

Ingesting DataExecuto

DataFrameDriver Py4Jcode only

HDFS (Parquet)

code only

It’s slow to Deal with files like CSVs by non-JVM driver

Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first

You can process Such files directly from JVM processes (executors) even when using non-JVM languages

Appendix : Parquet

Parquet: general purpose file format for analytic workload

Columnar storage : reduces I/O significantly

High compression rate

projection pushdown

Today, workloads become CPU-intensive : very fast read, CPU-internal-aware

performant data processing with pyspark, sparkr and dataframe api

Software

sparkr - play spark using r (20160909 hadoopcon)

first impressions of sparkr: our own machine learning...

pyspark with juypter

machine learning with pyspark

pyspark best practices by juliet hougland

generalized linear models in spark mllib and sparkr

sparkr: interactive r at scale -...

bigdata - semaine 4 · bigdata - semaine 4 introduction...

intermediatedata sciencelearningplan...

pyspark best practices

predictive analytics with airflow and pyspark

pyspark in practice slides

pandas dataframe notes - dfedorov.spb.ru

pyspark of warcraft

python and bigdata - an introduction to spark (pyspark)

property series dataframe

recent development in sparkr for advanced...

sparkr: interactive data science at...

pyspark documentation

towards scalable dataframe systems · dataframe semantics....