hadoop on openstack - sahara @devnation 2014
DESCRIPTION
Data analysis is hard enough, don't get bogged down managing Hadoop...TRANSCRIPT
![Page 1: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/1.jpg)
Big data processing with Hadoop on OpenStack
Matthew Farrellee(@spinningmatt)
Red Hat
![Page 2: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/2.jpg)
Here for a talk about Savanna?Oops, this talk is about Sahara.
Good news is they’re the same thing.
Savanna was renamed for trademark reasons to Sahara.
You have to go to page 10 of google results to find out why:https://www.google.com/search?q=savanna+hadoop&start=90
![Page 3: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/3.jpg)
In brief - what is Hadoop
● Narrow - Apache Hadoop - a specific Apache project originally from Yahoo!, based on papers published from Google
● Broad - an ecosystem of projects, mostly Apache, that integrate in some way with Apache Hadoop
● Most common to use the broad definition
![Page 4: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/4.jpg)
Hadoop from Hortonworks (+ others)
● Multiple projects○ Workload management○ Resource management○ System management○ Data ingest & storage○ Compute frameworks○ Domain languages
● Data storage and processing focused
![Page 5: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/5.jpg)
In brief - what is OpenStack
OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
![Page 6: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/6.jpg)
An ecosystem of projects● Compute - Nova● Networking - Neutron● Object Storage - Swift● Block Storage - Cinder● Identity - Keystone● Image Service - Glance● Dashboard - Horizon● Telemetry - Ceilometer● Orchestration - Heat● Data Processing - Sahara
![Page 7: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/7.jpg)
Longer comments on big data
Choose your own adventure…
Go to the next slide and get the day over soonerSee some shoegazing followed by a rant and have the day last longer
![Page 8: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/8.jpg)
Interest (via Google Trends)
HadoopEC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack
![Page 9: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/9.jpg)
Interest (via Google Trends)
HadoopEC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack
EC2 beta Aug 25 2006 (http://aws.typepad.com/aws/2006/08/amazon_ec2_beta.html)
![Page 10: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/10.jpg)
Data analysis is hard
![Page 11: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/11.jpg)
Analysis - have a question
● Even this alone is hard to come up with● The question you answer won’t be the
question you set out to ask● You’ll have to iterate and refine
Can I predict doctor specialty from what procedures they perform?
![Page 12: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/12.jpg)
Analysis - finding the data
● Publically -○ Tons of data repositories○ No consistency, even within a specific repository
● Privately -○ Data often hidden in silos○ Even less consistency
● Avoid datasets that don’t come with a dictionary○ Data w/o a dictionary is like code w/o comments
![Page 13: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/13.jpg)
Analysis - acceptable use
● Publically -○ Data sets often have associated licenses○ Yes, even public (government) sets○ You may have to find an alternative set
● Privately -○ Often tightly controlled, considered sensitive
business data○ If you can use it, maybe only in a specific place○ Likely no alternatives
![Page 14: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/14.jpg)
● The story of Stephen Glasser and Cheryl Palma
● Two of the oldest people in the medical profession working with medicare
● Stephen Glasser graduated in 1773● Cheryl Palma graduated in 1776
Analysis - explore / clean the data
![Page 15: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/15.jpg)
Analysis - finally
● You got some answer to a question you approximately asked
● You must refine the question and process● Repeat
This is hard enough without having to manage tools and infrastructure!
![Page 16: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/16.jpg)
Sahara’s goal
Make managing Hadoop+ infrastructure and tools so simple that doing so never gets in your way
![Page 17: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/17.jpg)
Sahara is
● An OpenStack project in the Data Processing program
● Started one year ago (Summit in Portland)● Incubated in Icehouse (6 months ago)● Integrated for Juno (6 months from now)
![Page 18: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/18.jpg)
Sahara’s architecture
Data Sources
Sahara Python Client RE
ST A
PI
Cluster Configuration
Manager
Horizon
Keystone
Auth
Data Access Layer
Swift
Sahara Pages
HadoopVM
Vendors Plugins
HadoopVM
HadoopVM
HadoopVM
Resources Orchestration
Manager
Job Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Sahara Service
![Page 19: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/19.jpg)
Sahara’s plugin architecture
● This is important!● It’s where Hadoop distribution vendors
integrate their management software● It’s how users pick different software
versions● Currently: Vanilla (reference impl. w/ Apache
versions), HDP (via Ambari), IDH (via Intel Manager) and under review CDH and Spark
![Page 20: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/20.jpg)
Sahara lets you
● Create and manage clusters
● Define and run analysis jobs
● All through a programmatic interface
● Or a web console
![Page 21: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/21.jpg)
Sahara’s REST API
![Page 22: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/22.jpg)
API v1 (Cluster operations)● http://bit.ly/1hRXrVX● Plugins
○ list - comes from configuration○ get - provides capabilities of a plugin, e.g. services
● Images○ register - provide basic metadata, username - going
away w/ heat○ tag/untag - associate image w/ a plugin
![Page 23: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/23.jpg)
API v1 (Cluster operations) (cont)● http://bit.ly/1hRXrVX● Templates
○ node groups○ clusters
● Clusters○ Instances of templates
![Page 24: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/24.jpg)
API v1.1 (Elastic Data Processing)
● http://bit.ly/1kXGjGj● Data Source
○ Input and output locations (Swift/HDFS urls)● Job Binaries
○ Often JARs or scripts stored in Swift or ...● Jobs
○ Templates for a job with missing parameters● Job executions
○ Instances of templates with parameters provided
![Page 25: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/25.jpg)
API v2 (future)Consistent, stable, and clean evolution of v1 & v1.1
○ Image handling in v1 wasn’t RESTful○ Reduce use of internally stored binaries○ Jobs & job executions weren’t RESTful○ Resource naming wasn’t consistent (clusters v job-
executions & cluster-templates v jobs)○ Prune unused operations, e.g status-refresh○ Align resource lifecycle, e.g. terminate = stop&delete
vs terminate = stop
![Page 26: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/26.jpg)
Sahara’s Plugin API
![Page 27: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/27.jpg)
Sahara’s Plugin API● http://bit.ly/1h4MiAW● get_versions● get_configs(version)● get_node_processes(version)● get_required_image_tags(version)● validate(cluster)● configure_cluster(cluster)● start_cluster(cluster)● scale_cluster(cluster)● ...
![Page 28: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/28.jpg)
Roadmap
● I mentioned a couple things, but this is a community project
● The Icehouse release is tomorrow● Design summit, where developers & users &
business get together to define the roadmap, is May 13-16 in Atlanta
![Page 29: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/29.jpg)
Demo with bigpetstore
● http://jayunit100.github.io/bigpetstore/slides
● Bigpetstore (by @jayunit100)○ A full stack hadoop application○ Uses the main players in the hadoop ecosystem○ To demonstrate a single domain○ Just accepted into the Bigtop project!
![Page 30: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/30.jpg)
Demo with bigpetstore...live (cont)
We’re going to perform petstore transaction analysis -1. Generate data from a model2. Transform data for processing3. Process w/ pig or mahout, we’ll do pig4. Visualize results in web app
![Page 31: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/31.jpg)
Demo video...
https://www.youtube.com/watch?v=vmry_kXqn4c
![Page 32: Hadoop on OpenStack - Sahara @DevNation 2014](https://reader033.vdocument.in/reader033/viewer/2022052820/54c66e5a4a795944538b4613/html5/thumbnails/32.jpg)