data science toolkit 101: set up python, spark, & jupyter

12
IBM Cloud Data Services data science toolkit 101 set up Python, Spark, & Jupyter Raj Singh, PhD Developer Advocate: Geo | Open Data [email protected] http://ibm.biz/ rajrsingh twitter: @rajrsingh

Upload: raj-singh

Post on 23-Feb-2017

116 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: data science toolkit 101: set up Python, Spark, & Jupyter

IBM Cloud Data Services

data science toolkit 101set up Python, Spark, & JupyterRaj Singh, PhDDeveloper Advocate: Geo | Open [email protected]://ibm.biz/rajrsingh twitter: @rajrsingh

Page 2: data science toolkit 101: set up Python, Spark, & Jupyter

@rajrsinghIBM Cloud Data Services

Agenda

•Installation• Python• Spark• Pixiedust

•Examples

Page 3: data science toolkit 101: set up Python, Spark, & Jupyter

@rajrsinghIBM Cloud Data Services

IBM Analytics

Data Science Experience (DSX)

Page 4: data science toolkit 101: set up Python, Spark, & Jupyter

@rajrsinghIBM Cloud Data Services

What is Spark?

•In-memory Hadoop• Hadoop was massively scalable but slow• “Up to 100x faster” (10x faster if memory is exhausted)

•What is Hadoop?• HDFS: fault-tolerant storage using horizontally scalable commodity

hardware• MapReduce: programming style for distributed processing

•Presents data as an object independent of the underlying storage

Page 5: data science toolkit 101: set up Python, Spark, & Jupyter

@rajrsinghIBM Cloud Data Services

Spark abstracted storage

•Scala•PySpark = (Spark + Python)•Drivers• File storage• Cloudant• dashDB• Cassandra• …

Page 6: data science toolkit 101: set up Python, Spark, & Jupyter

@rajrsinghIBM Cloud Data Services

Python installation with miniconda

1.https://www.continuum.io/downloads (choose version 2.7)

2.Miniconda2 install into this location: /Users/<username>/miniconda2

3.bash$ conda install pandas jupyter matplotlib

4.bash$ which python /Users/<username>/miniconda2/bin/python

https://dzone.com/refcardz/apache-spark

Page 7: data science toolkit 101: set up Python, Spark, & Jupyter

@rajrsinghIBM Cloud Data Services

Spark installation

•http://spark.apache.org/downloads.html• Spark release: 1.6.2• package type: Pre-built for Hadoop 2.6

•mkdir dev

•cd dev

•tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz

•ln -s spark-1.6.2-bin-hadoop2.6 spark

•mkdir dev/notebooks

Page 8: data science toolkit 101: set up Python, Spark, & Jupyter

@rajrsinghIBM Cloud Data Services

PySpark configuration

•create directory ~/.ipython/kernels/pyspark1.6/•create file kernel.json

•cd ~/dev/spark/conf•cp spark-defaults.conf.template spark-defaults.conf•add to end of spark-defaults.conf:

spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/*

{ "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" }}

Page 9: data science toolkit 101: set up Python, Spark, & Jupyter

@rajrsinghIBM Cloud Data Services

PySpark test

•bash$ cd ~/dev

•bash$ jupyter notebook

•upper right of the Jupyter screen, click New, choose pySpark (Spark 1.6.2) Python 2 (or whatever name specified in your kernel.json file)

•in the notebook's first cell enter sc.version and click the >| button to run it (or hit CTRL + Enter).

Page 10: data science toolkit 101: set up Python, Spark, & Jupyter

@rajrsinghIBM Cloud Data Services

Pixiedust installation

•cd ~/dev

•git clone https://github.com/ibm-cds-labs/pixiedust.git

•pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust

•pip install maven-artifact

•pip install mpld3

Page 12: data science toolkit 101: set up Python, Spark, & Jupyter

IBM Cloud Data Services

Raj SinghDeveloper Advocate: Geo | Open [email protected] http://ibm.biz/rajrsingh

Twitter: @rajrsinghLinkedIn: rajrsingh

Thanks