pydata2014
DESCRIPTION
Slides from PyData SV 2014TRANSCRIPT
![Page 1: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/1.jpg)
Ferry - Share & Deploy Big Data Applications with Docker
James Horey
![Page 2: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/2.jpg)
• Writing a simple application with Bokeh
• Packaging our application with Docker
• Orchestrating our application with Ferry
Technical material can be found at: https://github.com/jhorey/pydata
![Page 3: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/3.jpg)
Bokeh
![Page 4: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/4.jpg)
U.S. Census
http://api.census.gov/data/2011/acs5?get=DP03_0062E&for=county:*&in=state:06
Median income All counties California
![Page 5: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/5.jpg)
Download some data
![Page 6: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/6.jpg)
Let’s install Bokeh$ pip install bokeh >> Downloading/unpacking bokeh >> SystemError: Cannot compile 'Python.h'. Perhaps you need to install python-dev|python-devel. $ apt-get install python-dev & pip install bokeh >> "gcc: error trying to exec 'cc1plus': execvp: No such file or directory $ apt-get install g++ $ pip install bokeh
RuntimeError: bokeh sample data directory does not exist, please execute bokeh.sampledata.download()
$ python >>> import bokeh.sampledata
![Page 7: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/7.jpg)
A simple application$ python plot.py Kentucky
Louisville
![Page 8: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/8.jpg)
Let’s share
#!/bin/bash !# Make sure we have ‘pip’ installed apt-get install python-pip !# Install packages in right order apt-get —-yes install g++ python-dev pip install bokeh !# Now download the data python geography.py data/ python population economic Kentucky data/ !# Start the web server python webserver data/
• Your script didn’t work • Oh, I was supposed to run this as
sudo? • Ok, it still didn’t work • I get this funny error • Oh yeah, I’m running Redhat • Ok I’m at my desk, just use my
computer
![Page 9: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/9.jpg)
• Encapsulates applications in isolated containers • Makes it easy and safe to distribute applications • Easy to get started
![Page 10: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/10.jpg)
Our DockerfileStart from a clean Precise image
Install stuff
Add our files
Run this when starting
$ docker build -t ferry/pydata . $ docker push ferry/pydata
![Page 11: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/11.jpg)
Sharing made simple
$ docker pull ferry/pydata $ docker run -p 8000:8000 -name p1 —d ferry/pydata
p1
Kernel
Hardware
![Page 12: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/12.jpg)
Sharing made simple
$ docker pull ferry/pydata $ docker run -p 8000:8000 -name p1 —d ferry/pydata $ docker run -p 8001:8000 -name p2 —d ferry/pydata $ docker run -p 8002:8000 -name p3 —d ferry/pydata
p1 p2 p3
Kernel
Hardware
• Containers share basic kernel and H.W. capabilities
• No virtualization
• Containers are isolated • Access via port forwarding
You can run these commands now!
![Page 13: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/13.jpg)
• Highly scalable and fault-tolerant • Great for storing streaming data (sensors,
messages)
CREATE KEYSPACE census WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; !USE census; !CREATE TABLE acs_economic_data ( state_cd TEXT, state_name TEXT, county_cd TEXT, county_name TEXT, median INT, mean INT, capita INT, PRIMARY KEY(count_cd, state_cd) );
![Page 14: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/14.jpg)
Orchestration
Web DB
Web + DB
• Simple • Full control • More work for you
• Simpler Dockerfile • More extensible • How to orchestrate?
![Page 15: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/15.jpg)
• Specify the containers that constitute your application in YAML
• Support for Hadoop, Cassandra, GlusterFS, and OpenMPI
• It’s a little bit like pip for your Docker-based runtime environment
Ferry
http://ferry.opencore.io
![Page 16: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/16.jpg)
Our Application
backend: - storage: personality: "cassandra" instances: 1 connectors: - personality: "ferry/pydata-cassandra" ports: ["8000:8000"]
# The cassandra-client base comes with the various drivers # pre-installed. FROM ferry/cassandra-client NAME ferry/pydata-cassandra !# Place the start scripts in the events directories so they # are started when the connector is brought up. ADD ./scripts/startcas.sh /service/runscripts/start/ ADD ./scripts/restartcas.sh /service/runscripts/restart/ RUN chmod a+x /service/runscripts/start/startcas.sh RUN chmod a+x /service/runscripts/restart/restartcas.sh
+
![Page 17: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/17.jpg)
Easy to share (again)
$ ferry start cassandra.yml sa-df8d0aa6 $ ferry ps UUID Storage Compute Connectors Status Base Time ---- ------- ------- ---------- ------ ---- ---- sa-df8d0aa6 se-54ed4e93 se-a5350a8d running cassandra.yml
$ ferry ssh sa-df8d0aa6 root@client-se-a5350a8d:~# ps -eaf | grep python root 144 1 0 19:49 ? 00:00:00 python /home/ferry/pydata/bokeh/webserver.py /home/ferry/pydata/data
![Page 18: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/18.jpg)
What’s it doing?$ ferry start cassandra.yml
Web C* C*
root@client-se-a5350a8d:~# env | grep BACK BACKEND_STORAGE_TYPE=cassandra BACKEND_STORAGE_IP=10.1.0.12
Generate!Config
![Page 19: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/19.jpg)
What’s it doing?$ ferry start yarn
Client
Y Y
root@client-se-b597cb21:~# env | grep BACK BACKEND_STORAGE_TYPE=gluster BACKEND_STORAGE_IP=10.1.0.18 BACKEND_COMPUTE_TYPE=yarn BACKEND_COMPUTE_IP=10.1.0.15
G G
![Page 20: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/20.jpg)
What’s it doing?$ ferry stop sa-c6cbb572
Client
Y Y
G G
![Page 21: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/21.jpg)
Next steps$ ferry share sa-df8d0aa6
w c* c*
Hardware
w c* c*
Hardware
w c* c*
Hardware
![Page 22: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/22.jpg)
Next steps$ ferry deploy sa-df8d0aa6
w c* c*
Hardware
w
c* c*
Hardware
Hardware Hardware
VPC
EC2
S3
![Page 23: Pydata2014](https://reader031.vdocument.in/reader031/viewer/2022020217/54c65ebc4a795934598b4610/html5/thumbnails/23.jpg)
• Even simple applications can be complicated to install and run
• Docker helps quite a bit with this
• Ferry helps build out big data applications