pyspark best practices

34
‹#› © Cloudera, Inc. All rights reserved. Juliet Hougland Sept 2015 @j_houg PySpark Best Practices

Upload: cloudera-inc

Post on 16-Apr-2017

3.302 views

Category:

Software


2 download

TRANSCRIPT

Page 1: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Juliet Hougland Sept 2015 @j_houg

PySpark Best Practices

Page 2: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Page 3: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

• Core written, operates on the JVM • Also has Python and Java APIs

• Hadoop Friendly • Input from HDFS, HBase, Kafka • Management via YARN

• Interactive REPL • ML library == MLLib

Spark

Page 4: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Spark MLLib

• Model building and eval • Fast • Basics covered

• LR, SVM, Decision tree • PCA, SVD • K-means • ALS

• Algorithms expect RDDs of consistent types (i.e. LabeledPoints)

!

Page 5: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Thanks: Kostas Sakellis

Page 6: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

RDDs

…RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Thanks: Kostas Sakellis

sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Page 7: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

RDDs

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

Thanks: Kostas Sakellis

sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Page 8: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

RDDs

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Thanks: Kostas Sakellis

sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Page 9: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

…RDD …RDD

RDDs

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Count

Thanks: Kostas Sakellis

sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Page 10: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Spark Execution Model

Page 11: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

PySpark Execution Model

Page 12: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

PySpark Driver Program

sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Function closures need to be executed on worker nodes by a python process.

Page 13: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

How do we ship around Python functions?

sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Page 14: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Pickle!

https://flic.kr/p/c8N4sE

Page 15: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Pickle!

sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Page 16: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Best Practices for Writing PySpark

Page 17: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

REPLs and Notebookshttps://flic.kr/p/5hnPZp

Page 18: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Share your code

https://flic.kr/p/sw2cnL

Page 19: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Standard Python Projectmy_pyspark_proj/ awesome/ __init__.py bin/ docs/ setup.py tests/ awesome_tests.py __init__.py

Page 20: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

What is the shape of a PySpark job?

https://flic.kr/p/4vWP6U

Page 21: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

!• Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data

PySpark Structure?

https://flic.kr/p/ZW54

Shout out to my colleagues in the UK

Page 22: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

PySpark Structure?my_pyspark_proj/ awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv

!• Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data

Page 23: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Simple Main Method

Page 24: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

• Write a function for anything inside an transformation • Make it static

• Separate Feature generation or data standardization from your modeling

Write Testable Code

Featurize.py … !@static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double !@static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])

Page 25: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

• Functions and the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead

Write Serializable Code

https://flic.kr/p/za5cy

Page 26: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

• Provides a SparkContext configures Spark master • Quiets Py4J • https://github.com/holdenk/spark-testing-base

Testing with SparkTestingBase

Page 27: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

• Unit test as much as possible • Integration test the whole flow !• Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results

Testing Suggestions

https://flic.kr/p/tucHHL

Page 28: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Best Practices for Running PySpark

Page 29: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Writing distributed code is the easy part…

Running it is hard.

Page 30: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Get Serious About Logs

• Get the YARN app id from the WebUI or Console • yarn logs <app-id> • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces

Page 31: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Know your environment

• You may want to use python packages on your cluster • Actively manage dependencies on your cluster • Anaconda or virtualenv is good for this.

• Spark versions <1.4.0 require the same version of Python on driver and workers

Page 32: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Complex Dependencies

Page 33: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Many Python EnvironmentsPath to Python binary to use on the cluster can be set with PYSPARK_PYTHON !Can be set it in spark-env.sh

if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON=<path> fi

Page 34: PySpark Best Practices

‹#›© Cloudera, Inc. All rights reserved.

Thank YouQuestions? !@j_houg