data science toolkit 101: set up python, spark, & jupyter

IBM Cloud Data Services

data science toolkit 101set up Python, Spark, & JupyterRaj Singh, PhDDeveloper Advocate: Geo | Open [email protected]://ibm.biz/rajrsingh twitter: @rajrsingh

mailto:[email protected]

http://ibm.biz/rajrsingh


@rajrsinghIBM Cloud Data Services

Agenda

•Installation• Python• Spark• Pixiedust

•Examples


IBM Analytics

Data Science Experience (DSX)


What is Spark?

•In-memory Hadoop• Hadoop was massively scalable but slow• “Up to 100x faster” (10x faster if memory is exhausted)

•What is Hadoop?• HDFS: fault-tolerant storage using horizontally scalable commodity

hardware• MapReduce: programming style for distributed processing

•Presents data as an object independent of the underlying storage


Spark abstracted storage

•Scala•PySpark = (Spark + Python)•Drivers• File storage• Cloudant• dashDB• Cassandra• …


Python installation with miniconda

1.https://www.continuum.io/downloads (choose version 2.7)

2.Miniconda2 install into this location: /Users/<username>/miniconda2

3.bash$ conda install pandas jupyter matplotlib

4.bash$ which python /Users/<username>/miniconda2/bin/python

https://dzone.com/refcardz/apache-spark

https://www.continuum.io/downloads

https://www.continuum.io/downloads


Spark installation

•http://spark.apache.org/downloads.html• Spark release: 1.6.2• package type: Pre-built for Hadoop 2.6

•mkdir dev

•cd dev

•tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz

•ln -s spark-1.6.2-bin-hadoop2.6 spark

•mkdir dev/notebooks

http://spark.apache.org/downloads.html

http://spark.apache.org/downloads.html


PySpark configuration

•create directory ~/.ipython/kernels/pyspark1.6/•create file kernel.json

•cd ~/dev/spark/conf•cp spark-defaults.conf.template spark-defaults.conf•add to end of spark-defaults.conf:

spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/*

{ "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" }}


PySpark test

•bash$ cd ~/dev

•bash$ jupyter notebook

•upper right of the Jupyter screen, click New, choose pySpark (Spark 1.6.2) Python 2 (or whatever name specified in your kernel.json file)

•in the notebook's first cell enter sc.version and click the >| button to run it (or hit CTRL + Enter).


Pixiedust installation

•cd ~/dev

•git clone https://github.com/ibm-cds-labs/pixiedust.git

•pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust

•pip install maven-artifact

•pip install mpld3

https://github.com/ibm-cds-labs/pixiedust.git

https://github.com/ibm-cds-labs/pixiedust.git


Examples

•Pixiedust• https://github.com/ibm-cds-labs/pixiedust

•Demographic analyses• http://ibm-cds-labs.github.io/open-data/samples/• or https://github.com/ibm-cds-labs/open-data/tree/master/samples

https://github.com/ibm-cds-labs/open-data/tree/master/samples







IBM Cloud Data Services

Raj SinghDeveloper Advocate: Geo | Open [email protected] http://ibm.biz/rajrsingh

Twitter: @rajrsinghLinkedIn: rajrsingh

Thanks

mailto:[email protected]


data science toolkit 101: set up python, spark, & jupyter

Technology