data science toolkit 101: set up python, spark, & jupyter
TRANSCRIPT
IBM Cloud Data Services
data science toolkit 101set up Python, Spark, & JupyterRaj Singh, PhDDeveloper Advocate: Geo | Open [email protected]://ibm.biz/rajrsingh twitter: @rajrsingh
@rajrsinghIBM Cloud Data Services
Agenda
•Installation• Python• Spark• Pixiedust
•Examples
@rajrsinghIBM Cloud Data Services
IBM Analytics
Data Science Experience (DSX)
@rajrsinghIBM Cloud Data Services
What is Spark?
•In-memory Hadoop• Hadoop was massively scalable but slow• “Up to 100x faster” (10x faster if memory is exhausted)
•What is Hadoop?• HDFS: fault-tolerant storage using horizontally scalable commodity
hardware• MapReduce: programming style for distributed processing
•Presents data as an object independent of the underlying storage
@rajrsinghIBM Cloud Data Services
Spark abstracted storage
•Scala•PySpark = (Spark + Python)•Drivers• File storage• Cloudant• dashDB• Cassandra• …
@rajrsinghIBM Cloud Data Services
Python installation with miniconda
1.https://www.continuum.io/downloads (choose version 2.7)
2.Miniconda2 install into this location: /Users/<username>/miniconda2
3.bash$ conda install pandas jupyter matplotlib
4.bash$ which python /Users/<username>/miniconda2/bin/python
https://dzone.com/refcardz/apache-spark
@rajrsinghIBM Cloud Data Services
Spark installation
•http://spark.apache.org/downloads.html• Spark release: 1.6.2• package type: Pre-built for Hadoop 2.6
•mkdir dev
•cd dev
•tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz
•ln -s spark-1.6.2-bin-hadoop2.6 spark
•mkdir dev/notebooks
@rajrsinghIBM Cloud Data Services
PySpark configuration
•create directory ~/.ipython/kernels/pyspark1.6/•create file kernel.json
•cd ~/dev/spark/conf•cp spark-defaults.conf.template spark-defaults.conf•add to end of spark-defaults.conf:
spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/*
{ "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" }}
@rajrsinghIBM Cloud Data Services
PySpark test
•bash$ cd ~/dev
•bash$ jupyter notebook
•upper right of the Jupyter screen, click New, choose pySpark (Spark 1.6.2) Python 2 (or whatever name specified in your kernel.json file)
•in the notebook's first cell enter sc.version and click the >| button to run it (or hit CTRL + Enter).
@rajrsinghIBM Cloud Data Services
Pixiedust installation
•cd ~/dev
•git clone https://github.com/ibm-cds-labs/pixiedust.git
•pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust
•pip install maven-artifact
•pip install mpld3
@rajrsinghIBM Cloud Data Services
Examples
•Pixiedust• https://github.com/ibm-cds-labs/pixiedust
•Demographic analyses• http://ibm-cds-labs.github.io/open-data/samples/• or https://github.com/ibm-cds-labs/open-data/tree/master/samples
IBM Cloud Data Services
Raj SinghDeveloper Advocate: Geo | Open [email protected] http://ibm.biz/rajrsingh
Twitter: @rajrsinghLinkedIn: rajrsingh
Thanks