improving python and spark performance and ... · • apache member • apache pmcs: arrow, kudu,...

49
Improving Python and Spark Performance and Interoperability with Apache Arrow Julien Le Dem Principal Architect Dremio Li Jin Software Engineer Two Sigma Investments

Upload: others

Post on 22-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

Improving Python and Spark Performance and Interoperability with Apache Arrow

Julien Le DemPrincipal ArchitectDremio

Li JinSoftware EngineerTwo Sigma Investments

Page 2: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

About Us

• Architect at @DremioHQ• Formerly Tech Lead at Twitter on Data 

Platforms• Creator of Parquet• Apache member• Apache PMCs: Arrow, Kudu, Incubator, 

Pig, Parquet

Julien Le Dem@J_

Li Jin@icexelloss

• Software Engineer at Two Sigma Investments• Building a python­based analytics platform with PySpark• Other open source projects:

– Flint: A Time Series Library on Spark

– Cook: A Fair Share Scheduler on Mesos

Page 3: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Agenda

• Current state and limitations of PySpark UDFs

• Apache Arrow overview

• Improvements realized

• Future roadmap

Page 4: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

Current state and limitations of PySpark UDFs

Page 5: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Why do we need User Defined Functions?

• Some computation is more easily expressed with Python than Spark built­in functions.

• Examples:– weighted mean– weighted correlation – exponential moving average

Page 6: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

What is PySpark UDF

• PySpark UDF is a user defined function executed in Python runtime.

• Two types:– Row UDF: 

• lambda x: x + 1• lambda date1, date2: (date1 - date2).years

– Group UDF (subject of this presentation):• lambda values: np.mean(np.array(values))

Page 7: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Row UDF

• Operates on a row by row basis– Similar to `map` operator

• Example …df.withColumn(

‘v2’, udf(lambda x: x+1, DoubleType())(df.v1)

)

• Performance:– 60x slower than build­in functions for simple case

Page 8: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Group UDF

• UDF that operates on more than one row– Similar to `groupBy` followed by `map` operator

• Example:– Compute weighted mean by month

Page 9: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Group UDF

• Not supported out of box:– Need boiler plate code to pack/unpack multiple rows into a nested row

• Poor performance– Groups are materialized and then converted to Python data structures

Page 10: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Example: Data Normalization

(values – values.mean()) / values.std()

Page 11: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Example: Data Normalization

Page 12: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Example: Monthly Data Normalization

Useful bits

Page 13: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Example: Monthly Data Normalization

Boilerplate

Boilerplate

Page 14: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Example: Monthly Data Normalization

• Poor performance ­ 16x slower than baseline

groupBy().agg(collect_list())

Page 15: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Problems

• Packing / unpacking nested rows• Inefficient data movement (Serialization / Deserialization)• Scalar computation model: object boxing and interpreter overhead

Page 16: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

Apache Arrow

Page 17: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Arrow: An open source standard

• Common need for in memory columnar• Building on the success of Parquet.• Top­level Apache project• Standard from the start

– Developers from 13+ major open source projects involved• Benefits:

– Share the effort– Create an ecosystem

Calcite

Cassandra

Deeplearning4j

Drill

Hadoop

HBase

Ibis

Impala

Kudu

Pandas

Parquet

Phoenix

Spark

Storm

R

Page 18: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Arrow goals

• Well­documented and cross language compatible• Designed to take advantage of modern CPU• Embeddable 

­ In execution engines, storage layers, etc.• Interoperable

Page 19: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

High Performance Sharing & InterchangeBefore With Arrow

• Each system has its own internal memory format

• 70-80% CPU wasted on serialization and deserialization

• Functionality duplication and unnecessary conversions

• All systems utilize the same memory format

• No overhead for cross-system communication

• Projects can share functionality (eg: Parquet-to-Arrow reader)

Page 20: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Columnar data

persons = [{nam e: ’Joe',age: 18,phones: [

‘555-111-1111’, ‘555-222-2222’

] }, {

nam e: ’Jack',age: 37,phones: [ ‘555-333-3333’ ]

}]

Page 21: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Record Batch Construction

Schema Negotiation

Schema Negotiation

Dictionary Batch

Dictionary Batch

Record Batch

Record Batch

Record Batch

Record Batch

Record Batch

Record Batch

name (offset)name (offset)

name (data)name (data)

age (data)age (data)

phones (list offset)phones (list offset)

phones (data)phones (data)

data header (describes offsets into data)data header (describes offsets into data)

name (bitmap)name (bitmap)

age (bitmap)age (bitmap)

phones (bitmap)phones (bitmap)

phones (offset)phones (offset)

{nam e: ’Joe',age: 18,phones: [

‘555-111-1111’, ‘555-222-2222’

] }

Each box (vector) is contiguous memory The entire record batch is contiguous on wireEach box (vector) is contiguous memory The entire record batch is contiguous on wire

Page 22: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

In memory columnar format for speed

• Maximize CPU throughput­ Pipelining­ SIMD­ cache locality

• Scatter/gather I/O

Page 23: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Results­ PySpark Integration: 

53x speedup (IBM spark work on SPARK­13534)http://s.apache.org/arrowresult1

­ Streaming Arrow Performance7.75GB/s data movementhttp://s.apache.org/arrowresult2

­ Arrow Parquet C++ Integration4GB/s readshttp://s.apache.org/arrowresult3

­ Pandas Integration9.71GB/shttp://s.apache.org/arrowresult4

Page 24: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Arrow Releases

178195

311

85

237

131

76

17

Changes Days

Page 25: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

Improvements to PySpark  with Arrow

Page 26: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

How PySpark UDF works

Executor

PythonWorker

UDF: scalar -> scalar

Batched Rows

Batched Rows

Page 27: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Current Issues with UDF

• Serialize / Deserialize in Python• Scalar computation model (Python for loop)

Page 28: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Profile lambda x: x+1 Actual Runtime is 2s without profiling.8 Mb/s

91.8%

Page 29: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Vectorize Row UDF

ExecutorPythonWorker

UDF: pd.DataFrame ­> pd.DataFrame

Rows ­> RB

RB ­> Rows

Page 30: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Why pandas.DataFrame

• Fast, feature­rich, widely used by Python users• Already exists in PySpark (toPandas)• Compatible with popular Python libraries:

­ NumPy, StatsModels, SciPy, scikit­learn…• Zero copy to/from Arrow

Page 31: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Scalar vs Vectorized UDF

20x Speed Up

Actual Runtime is 2s without profiling

Page 32: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Scalar vs Vectorized UDF

OverheadRemoved

Page 33: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Scalar vs Vectorized UDF

Less System CallFaster I/O

Page 34: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Scalar vs Vectorized UDF

4.5x Speed Up

Page 35: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Support Group UDF

• Split­apply­combine:­ Break a problem into smaller pieces­ Operate on each piece independently­ Put all pieces back together

• Common pattern supported in SQL, Spark, Pandas, R … 

Page 36: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Split­Apply­Combine (Current)

• Split: groupBy, window, …• Apply: mean, stddev, collect_list, rank …• Combine: Inherently done by Spark

Page 37: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Split­Apply­Combine (with Group UDF)

• Split: groupBy, window, …• Apply: UDF• Combine: Inherently done by Spark

Page 38: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Introduce groupBy().apply()

• UDF: pd.DataFrame ­> pd.DataFrame– Treat each group as a pandas DataFrame– Apply UDF on each group– Assemble as PySpark DataFrame

Page 39: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Introduce groupBy().apply()

RowsRows

RowsRows

RowsRows

GroupsGroups

GroupsGroups

GroupsGroups

GroupsGroups

GroupsGroups

GroupsGroups

                 Each Group:pd.DataFrame ­> pd.DataFramegroupBy

Page 40: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Previous Example: Data Normalization

(values – values.mean()) / values.std()

Page 41: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Previous Example: Data Normalization

5x Speed Up

Current: Group UDF:

Page 42: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Limitations

• Requires Spark Row <­> Arrow RecordBatch conversion– Incompatible memory layout (row vs column)

• (groupBy) No local aggregation– Difficult due to how PySpark works. See 

https://issues.apache.org/jira/browse/SPARK­10915 

Page 43: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

Future Roadmap

Page 44: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

What’s Next (Arrow)

• Arrow RPC/REST• Arrow IPC• Apache {Spark, Drill, Kudu} to Arrow Integration

– Faster UDFs, Storage interfaces

Page 45: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

What’s Next (PySpark UDF)

• Continue working on SPARK­20396• Support Pandas UDF with more PySpark functions:

– groupBy().agg()– window

Page 46: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

What’s Next (PySpark UDF)

Page 47: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Get Involved

• Watch SPARK­20396• Join the Arrow community

[email protected]– Slack:

• https://apachearrowslackin.herokuapp.com/– http://arrow.apache.org– Follow @ApacheArrow

Page 48: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

© 2017 Dremio Corporation, Two Sigma Investments, LP

Thank you

• Bryan Cutler (IBM), Wes McKinney (Two Sigma Investments) for helping build this feature

• Apache Arrow community• Spark Summit organizers• Two Sigma and Dremio for supporting this work

Page 49: Improving Python and Spark Performance and ... · • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer

This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy

any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment

advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”).

Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The

document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of

such information in no way implies an endorsement of the source of such information or its validity.

The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If

so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and

comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any

association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.