performant data processing with pyspark, sparkr and dataframe api
Post on 16-Apr-2017
1.928 Views
Preview:
TRANSCRIPT
Performant data processing with PySpark, SparkR and
DataFrame API
Ryuji Tamagawa from Osaka
Many Thanks to Holden Karau, for the discussion we had about this talk.
Agenda
Who am I ?
Spark
Spark and non-JVM languages
DataFrame APIs come to rescue
Examples
Who am I ?Software engineer working for Sky, from architecture design to troubleshooting in the field
Translator working with O’Reilly Japan
‘Learning Spark’ is the 27th book
Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’
A bed for 6 cats
Works of 2015
Available Jan, 2016 ?
Works of past
Motivation for today’s talk
I want to deal with my ‘Big’ data,
WITH PYTHON !!
Apache Spark
Apache Spark
You may already have heard a lot
Fast, distributed data processing framework with high-level APIs
Written in Scala, run in JVM
OSHDFS
Hive e.t.c.
HBaseMapReduce
YARN
Impala e.t.c(in-
memory SQL engine)
Spark (Spark Streaming, MLlib,
GraphX, Spark SQL)
Why it’s fastDo not need to write temporary data to storage every time
Do not need to invoke JVM process every time
map
JVM Invocation
I/0
HD
FS
reduce
JVM Invocation
I/0
map
JVM Invocation
I/0
reduce
JVM Invocation
I/0
f1(read data to RDD)
Executor(JVM)Invocation
HD
FS
I/O
f2
f3
f4(persist to storage)
f5(does shuffle) I/O
f6
f7
Mem
ory (RDD
s)
access
access
access
access I/O
access
access
MapReduce Spark
Apache Spark and
non-JVM languages
Spark supports non-JVM languages Shells
PySpark, for Python users
SparkR, for R users
GUI Environment : Jupiter, RStudio
You can write application code in these languages
The Web UI tells us a lot
http://<address>:4040
Performance problems with those languages
Data processing performance with those languages may be several times slower than JVM languages
The reason lies in the architecture
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
The choices you have had
Learn Scala
Write (more lines of) code in Java
Use non-JVM languages with more CPU cores to make up the performance gap
DataFrame APIs come to the rescue !
DataFrame
Tabular data with schema based on RDD
Successor of Schema RDD (Since 1.4)
Has rich set of APIs for data operation
Or, you can simply use SQL!
Do it within JVM
When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime
Obviously, the performance is almost same compared to JVM languages
Only code goes through
Executo
rDataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame, Cached
Python
lambda items: items[0] == ‘abc’
transfer
DataFrame, result
transfer
Driv
er
Executo
r
DataFrame APIs compared to RDD APIs by Examples
JVM
DataFrame, Cached
filter(df[“_1”] == “abc”)
transfer
DataFrame, result
Driv
er
Watch out for UDFs
You can write UDFs in Python
You can use lambdas in Python, too
Once you use them, data flows between the two worlds
slen = udf( lambda s: len(s), IntegerType())
df.select( slen(df.name)) .collect()
Make it small first, then use UDFs
Filter or sample your ‘big’ data with DataFrame APIs
Then use UDFs
SQL optimizer does not take it into account when making plans (so far)
‘BIG’ data in DataFrame
filtering with ‘native APIs’
‘Small’ data in DataFrame
whatever operation with
UDFs
Make it small first, then use UDFs
Filter or sample your ‘big’ data with DataFrame APIs
Then use UDFs
SQL optimizer does not take it into account when making plans (so far)
slen = udf( lambda s: len(s), IntegerType())
sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’).collect()
processed first !
Ingesting DataIt’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when using non-JVM languages
Executo
r
JVM
DataFrameDriver
Local Data
Py4J
Driver Machine
HDFS (Parquet)
Driver Machine
Ingesting DataExecuto
r
JVM
DataFrameDriver Py4Jcode only
HDFS (Parquet)
code only
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when using non-JVM languages
Appendix : Parquet
Parquet: general purpose file format for analytic workload
Columnar storage : reduces I/O significantly
High compression rate
projection pushdown
Today, workloads become CPU-intensive : very fast read, CPU-internal-aware
top related