© 2014 IBM Corporation1
Getting started with Hadoop on the Cloud
Nicolas Morales – Solutions Engineer – [email protected]@NicolasJMorales
October 11, 2014
© 2014 IBM Corporation2
Welcome
Goal: Get you started with Hadoop on the Cloud
� Hadoop
− What technical problem is it helping solve? � BIG DATA
− What is Hadoop?
− BigInsights (IBM’s Hadoop distro)
� Bluemix (IBM’s PaaS cloud solution)
− What technical problem is it helping solve?
− Analytics for Hadoop in the Cloud
� Demo & Get hands-on
− Bluemix: bluemix.net
− Hadoop Dev: ibm.biz/hadoopdev
© 2014 IBM Corporation3
It starts with a line of code.
© 2014 IBM Corporation4 4
© 2014 IBM Corporation5 5
Source:
Wikibon
2/12/2014
(Link)
© 2014 IBM Corporation6 6
�
Source: The Forrester Wave: Big Data Hadoop
Solutions, Q1 2014
2/27/2014 (Link)
© 2014 IBM Corporation7
What is Big Data?
A way to describe data problems that are unsolvable using traditional tools
More Analytics on More Data for More People
© 2014 IBM Corporation8
What Data?
Transactional & Application Data
Machine Data Social Data Enterprise Content
© 2013 IBM Corporation
More Analytics on More Data for More People
© 2014 IBM Corporation9
9
© 2014 IBM Corporation10 In 2
00
5 t
he
re w
ere
1.3
billio
n R
FID
tag
s in
cir
cu
lati
on
aro
un
d t
he
wo
rld
……
10 ……
by t
he
en
d o
f 20
11
, th
is w
as
ab
ou
t 3
0
billio
n a
nd
gro
win
g e
ven
fa
ste
r.
© 2014 IBM Corporation11
An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of
data with MACHINE SPEED characteristics…
1 BILLION lines of codeEACH engine generating 10 TB every 30 minutes!
© 2014 IBM Corporation12
Welcome to the Instrumented Interconnected World!
12
12+ TBsof tweet data
every day
25+ TBs oflog data
every day
? TBs of
data
every
day
2+
billion
people on the
Web by end
2011
30 billion
RFID tags today
(1.3B in 2005)
4.6
billion
camera phones world wide
100s of
millions
of GPS
enabled
devices sold
annually
76 million smart
meters in 2009…200M by 2014
© 2014 IBM Corporation13
6,000,000 users on Twitter
pushing out 300,000 tweets per day
500,000,000 users on Twitter
pushing out 400,000,000tweets per day
83x
1333x13
© 2014 IBM Corporation14
Volume
Variety Veracity
We’ve Moved into a New Era of Computing
Velocity
14
decision makers trust their information.
Only 1 in 3of different types of data.
100’s
of Tweets create daily.
12+terabytestrade eventsper second.
5+million
© 2014 IBM Corporation15
Imagine the Possibilities of Harnessing Your Data Resources
Retailer reduces time to run queries by 80% to
optimize inventory
Stock Exchange cuts queries from 26 hours to
2 minutes on 2 PB
Government cuts acoustic analysis from hours to
70 Milliseconds
Utility avoids power failures by analyzing
10 PB of data in minutes
Telco analyses streaming network data to reduce hardware costs by 90%
Hospital analyses streaming vitals to detect illness
24 hours earlier
Big data challenges exist in every organization today
© 2014 IBM Corporation16
Insurance
• 360˚ View of Domain or Subject
• Catastrophe Modeling
• Fraud & Abuse
• Producer Performance Analytics
• Analytics Sandbox
Banking
• Optimizing Offers and Cross-sell
• Customer Service and Call Center Efficiency
• Fraud Detection & Investigation
• Credit & Counterparty Risk
Every Industry can Leverage Big Data and Analytics
Telco
• Pro-active Call Center
• Network Analytics
• Location Based Services
Energy & Utilities
• Smart Meter Analytics
• Distribution Load Forecasting/Scheduling
• Condition Based Maintenance
• Create & Target Customer Offerings
Media & Entertainment
• Business process transformation
• Audience & Marketing Optimization
• Multi-Channel Enablement
• Digital commerce optimization
Retail
• Actionable Customer Insight
• Merchandise Optimization
• Dynamic Pricing
Travel & Transport
• Customer Analytics & Loyalty Marketing
• Predictive Maintenance Analytics
• Capacity & Pricing Optimization
Consumer Products
• Shelf Availability
• Promotional Spend Optimization
• Merchandising Compliance
• Promotion Exceptions & Alerts
Government
• Civilian Services
• Defense & Intelligence
• Tax & Treasury Services
Healthcare
• Measure & Act on Population Health Outcomes
• Engage Consumers in their Healthcare
Automotive
• Advanced Condition Monitoring
• Data Warehouse Optimization
• Actionable Customer Intelligence
Life Sciences
• Increase visibility into drug safety and effectiveness
Chemical & Petroleum
• Operational Surveillance, Analysis & Optimization
• Data Warehouse Consolidation, Integration & Augmentation
• Big Data Exploration for Interdisciplinary Collaboration
Aerospace & Defense
• Uniform Information Access Platform
• Data Warehouse Optimization
• Airliner Certification Platform
• Advanced Condition Monitoring (ACM)
Electronics
• Customer/ Channel Analytics
• Advanced Condition Monitoring
© 2013 IBM Corporation
© 2014 IBM Corporation17
Enabling everybody to leverage Big Data
GPS
External Data
Business Users...offer personalized price
promotions to different customer segments in real-time
Business Development... find and deliver new mechanisms to monetize network traffic and partner with upstream content providers
Administrators...secure, manage, and optimize data access and analysis operations
Executive Leaders...get real-time reports and analysis based on data inside as well as outside the enterprise (web, social media etc.)
Business Analysts... analyze social media buzz for the new services/offerings to gauge initial success and any course correction needed
Developers... develop new Apps and detailed algorithms in response to user and business requirements
Data Scientists... analyze subscriber usage pattern in real-time and combine that with the profile for delivering promotional or retention offers
© 2014 IBM Corporation18
Leveraging Big Data Requires Multiple Platform Capabilities
Manage & store huge volume of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all data sources
Integration, Data Quality, Security, Lifecycle Management, MDM
Understand and navigate federated big data sources
Federated Discovery and Navigation
© 2014 IBM Corporation19
What is Hadoop?
� Apache open source software framework for reliable, scalable, distributed
computing of massive amount of data
� Hides underlying system details and complexities from user
� Developed in Java
� Core sub projects:
− MapReduce
− Hadoop Distributed File System a.k.a. HDFS
� Supported by several Hadoop-related projects
� HBase
� Zookeeper
� Avro
� Flume
� etc
� Meant for heterogeneous commodity hardware
© 2014 IBM Corporation20
� New way of storing and processing the data:− Let system handle most of the issues automatically:
• Failures• Scalability• Reduce communications • Distribute data and processing power to where the data is• Make parallelism part of operating system• Relatively inexpensive hardware
� Bring processing to Data!
� Hadoop = HDFS + MapReduce infrastructure + …
� Optimized to handle− Massive amounts of data through parallelism
− A variety of data (structured, unstructured, semi-structured)
− Using inexpensive commodity hardware
� Reliability provided through replication
Design Principles of Hadoop
© 2014 IBM Corporation21
Map-Reduce →→→→ Hadoop →→→→ BigInsights
© 2014 IBM Corporation22
Hadoop Open Source Projects
� Hadoop is supplemented by an ecosystem of open source projects
© 2014 IBM Corporation23
What’s a Hadoop Distribution?
� What’s a Linux Distribution?
− Linux Kernel
− Open Source Tools around Kernel
− Installer
− Administration UI
� Open Source Distribution Formula
− Kernel
− Core Projects around Kernel
− Value Add
• Test Components
• Installer
• Administration UI
• Apps
© 2014 IBM Corporation24
� Scalable
− New nodes can be added on the fly
� Affordable
− Massively parallel computing on commodity servers
� Flexible
− Hadoop is schema-less, and can absorb any type of data
� Fault Tolerant
− Through MapReduce software framework
� Performance & reliability
− Adaptive MapReduce, Compression, Indexing, Flexible Scheduler, +++
� Enterprise Hardening of Hadoop
� Productivity Accelerators
− Web-based UI’s and tools
− End-user visualization
− Analytic Accelerators
− +++
� Enterprise Integration
− To extend & enrich your information supply chain
IBM Enriches Hadoop
24
© 2014 IBM Corporation25
IBM BigInsights – Open Source and IBM Value Adds
Real-time Analytics InfoSphere Streams
Enterprise Performance Adaptive Map Reduce & Big SQL
Storage IntegrationGPFS POSIX Distributed Filesystem
Data Governance and SecurityData Click, LDAP and Secured Cluster
SearchBigIndex and Data Explorer
Data ExplorationBigSheets “schema-on-read” tooling
MapReduceHDFS HBase Flume
Pig
Lucene
Jaql ZooKeeperOozie Hive
Sqoop
HCatalog
100% based on Apache Open Source Hadoop Components
Predictive ModelingBigR scalable data mining” on R
Text AnalyticsText processing with AQL
ANSI SQLBigSQL Optimized SQL support
Application Tooling Toolkits and accelerators
© 2014 IBM Corporation26
Manage your cluster from the integrated Web Console
� Start or stop services
� Monitor overall system health
� Inspect status of specific services
� Add / remove nodes
� Manage your Apps and workflows from the console
� Drill down into Map/Reduce, Tasks, Attempts
� Access status, logs, counters of individual flows / jobs
© 2014 IBM Corporation27
Manage your HDFS Files� Navigate the distributed file system to see what’s stored
� Create/remove/rename directories
� Modify permissions
� Upload / download files, remove/rename files, Edit files
� Execute Hadoop file system shell commands
© 2014 IBM Corporation28
Monitoring cluster, components and applications
� Cluster: system load average, CPU/Disk/Memory/Network utilization, nodes live status
� HDFS: block and file info, NameNode JVM and GC info, throughput bytes written/read
� Mapreduce: Jobs status, Mapper, Reducer, JobTracker
� HBase: region split info, #of queries/stored files/regions etc
� Hive: metadata store (call frequency and duration)
� Oozie statistics
� Zookeeper: queries, latency, watcher count, followers etc
� Flume: source and sink, #of retries and bytes written etc
EXT E N S I B L E !!
Build your own Monitoring Dashboards, with the key KPI that are of your interest!
© 2014 IBM Corporation29
Text Analytics: Getting measurable insights
� Most of the world’s data is in unstructured or semi-structured text.
� Social media is full with discussions about products and services
� Company Internal Information is locked in blobs, description fields, and sometimes even discarded
� How do you get a metrics based understanding of facts from unstructured text?
Healthcare Analytics: E-Medical records, hospital
reportsPublic Sectors Case files, police records, emergency calls…
Automotive Quality Insight: Tech notes, call logs,
online media
Insurance Fraud: Insurance claims
Social Media for Marketing: twitter, facebook, blogs,
forums
Over 80% of stored information is unstructured*
Structural analysis
Mining and visualization
© 2014 IBM Corporation30
Big R
R Clients
Data Sources
Embedded R Execution
R Packages
R Packages
1
2
1. Explore, visualize, transform, and model big data using familiar R syntax and paradigm
2. Scale out R• Partitioning of large data
(“divide”)
• Parallel cluster execution of pushed down R code (“conquer”)
• All of this from within the R environment (Jaql, Map/Reduce are hidden from you
• Almost any R package can run in this environment
“End-to-end integration of R into IBM BigInsights”
Pull data (summaries) to
R client
Or, push R functions
right on the data
© 2014 IBM Corporation31
BigSheets - Spreadsheet-style Analytic Tool
No programming knowledge needed!
How it works� Model “big data” collected from various
sources as collections� Filter and enrich content with built-in
functions� Combine data in different collections � Visualize results through spreadsheets,
charts� Export data into common formats (if
desired)
© 2014 IBM Corporation32
Overview of Application Development Lifecycle
Package and publish your application using the BigInsights Eclipse Task Launcher
How it works
� Sample your Data
� Develop your application using BigInsights tools
� Test your application
� Package and publish your application
� Deploy your application on the cluster
Task Wizards for the ease of use
to Develop Applications
Editors for: Java, Java MapReduce, Hive, Jaql, Pig, Big SQL, BigSheets Reader, BigSheets Macro, AQL module, Jaql Module, etc …
© 2014 IBM Corporation33
Running Applications in Big Data
How it works
Build in Apps make it easy to run Big Data applications & tasks:
� Import and Export Data from a Database or files
� Import and Export Web and Social Data
� Perform Tex Analytics on specified content
� Query HBase Content
� Query content stored in BigInsights using Big SQL.
� Execute Pig or JAQL applications.
E XT E N S I B L E !! Build your own applications and make them easy to execute from an appealing Application launcher
© 2014 IBM Corporation34
Big SQL
SQL-basedApplication
Big SQL Engine
Data Sources
IBM data server client
SQL MPP Run-time
CSVCSV
SeqSeq
ParquetParquet
RCRC
ORCORC
AvroAvro
CustomCustom
JSONJSON
34
� IBM’s SQL engine for Hadoop
� Comprehensive, standard SQL – SELECT: joins, unions, aggregates, subqueries . . . – GRANT/REVOKE, INSERT … INTO– PL/SQL– Stored procs, user-defined functions – IBM data server JDBC and ODBC drivers
� Optimization and performance – Java MapReduce layer replaced with high performance
IBM MPP engine (C++) – Continuous running daemons (no start up latency) – Message passing allow data to flow between nodes
without persisting intermediate results – In-memory operations with ability to spill to disk (useful
for aggregrations, sorts that exceed available RAM) – Cost-based query optimization with 140+ rewrite rules
� Various storage formats supported– Data persisted in DFS, Hive – No IBM proprietary format required
� Integration with RDBMSs via LOAD, query
federation
BigInsights
© 2014 IBM Corporation35
3
5
Big Data Accelerators Make it Easier than Ever to Build Big DataApplications
Telecommunications Event DataCDR streaming analyticsDeep Customer Event Analytics
Ships with InfoSphere
Streams
Social Data AnalyticsSentiment Analytics, Intent to purchase
Ships with InfoSphere
BigInsights & Streams
Machine Data AnalyticsOperational data including logs for operations efficiency
Ships with InfoSphere
BigInsights
© 2014 IBM Corporation36
Maybe our politicians should take a playbook out of the rivalry between duke/unc and take it to the courts http://ity.com/wfUsir
Maybe our politicians should take a playbook out of the rivalry between duke/unc and take it to the courts http://ity.com/wfUsir
I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Raleigh) w/ 2 others http://4sq.com/gbsaYR
I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Raleigh) w/ 2 others http://4sq.com/gbsaYR
@silliesylvia good!!! U shouldnt! Think about the important stuff, like ur 43rd
birthday ;) btw happy birthday Sylvia ;)
@silliesylvia good!!! U shouldnt! Think about the important stuff, like ur 43rd
birthday ;) btw happy birthday Sylvia ;)
Location
Intent to consume
@silliesylvia I <3 your leather leggings!! Its so katniss!!
@silliesylvia I <3 your leather leggings!! Its so katniss!!
Age
Personal Attributes
• Sylvia Campbell, Female, In a Relationship
• 32 years old, birthday on 7/17• Lives near Raleigh, NC• College graduate; Income of 80-120k
Buzz/Sentiment
• Retweets BF’s comments• Interest in BBC shows: Downton Abbey,
Sherlock, Fringe, (P&P?)• Sherlock Holmes, Robert Downey, Jr.• Hunger Games, Katniss/J. Lawrence
Interests/Behavior
• Watch movies, tv shows• Romance plots, “hero types”, strong
women• Uses iPad 3, Redbox, Hulu• Shopping , interest in sales/deals• Duke/ UNC basketball
@silliesylvia $10 dollars says matthew & mary get married next season :) #downtownabbey
@silliesylvia $10 dollars says matthew & mary get married next season :) #downtownabbey
Behavior
Interest
@bamagirl can’t wait to watch sherlock with you! Oh, robert downey jr, I still love you but bbc is so amazing
@bamagirl can’t wait to watch sherlock with you! Oh, robert downey jr, I still love you but bbc is so amazing
OMG OMG. just dropped my new ipad3 crappola!!!
OMG OMG. just dropped my new ipad3 crappola!!!
Interest
Consumption
Prediction
dear redbox please have kings speech for my new tvcolin firth movie marathon
dear redbox please have kings speech for my new tvcolin firth movie marathon
360 degree profile
Intent to consume
Consumption
Social Data AnalyticsUsing social media as a rich source of information
© 2014 IBM Corporation37
© 2013 IBM Corporation37
Machine Data Analysis is a Business Imperative� Cost of system down-time
− 49 percent of Fortune 500 companies experience more than 80 hours of system down time annually1
• Cost of down-time varies from $90,000/hour in the media sector to $6.48 million / hour for large online brokerages
• 80 hours * $6.48M = approx $500M per year
− System downtown costs North American businesses $26.5 billion a year in lost revenue2
� When systems go down
− Sales and other processes stop
− Work in progress may be destroyed
− Failure to meet SLA’s and contractual obligations can result in damages, fees, adverse publicity and damage to reputation
− Customers are lost to competitors, some permanently
− Productivity suffers and remediation costs additional $$$’s
© 2014 IBM Corporation38
© 2014 IBM Corporation39
Evolution of Cloud Technologies
Virtualization Dynamic Hybrid
“I want to get more out
of my existing
hardware”
“I want to strategically
use public and private
cloud together”.
Cloud Native
“I want to rapidly build new,
born on the cloud, engaging
applications in a continuous
delivery model”
Business Services (SaaS)
“I want to use an app
without having to own it”
Cloud Enabled
“I want to move my
existing middleware
workloads to the cloud”
© 2014 IBM Corporation40
Networking Networking Networking
Storage Storage Storage
Servers Servers Servers
Virtualization Virtualization Virtualization
O/S O/S O/S
Middleware Middleware Middleware
Runtime Runtime Runtime
Data Data Data
Applications Applications Applications
Infrastructureas a Service
Platformas a Service
Softwareas a Service
Vendo
r Ma
na
ge
s in
Clo
ud
Vendo
r Ma
na
ge
s in
Clo
ud
Vendo
r Ma
na
ge
s in
Clo
ud
Clie
nt
Mana
ge
s
Clie
nt M
ana
ge
s
Customization; higher costs; slower time to valueCustomization; higher costs; slower time to value
Standardization; lower costs; faster time to
value
Standardization; lower costs; faster time to
value
IT Admin
Developer Business Person
PaaS sits at the center of the cloud delivery model
© 2014 IBM Corporation41
• Move quickly, see results fast.
• Learn by tinkering and
playing.
• Needs to learn new skills
through playing and
experimenting safely.
• Needs freedom to experiment
without worrying about
pricing right away.
Developers, Developers, Developers!
© 2014 IBM Corporation4242
Bluemix is an open-standard, cloud-based platform for building, managing,
and running applications of all types (web, mobile, big data, new smart
devices, and so on).
Go Live in Seconds
The developer can choose any language runtime or bring their own. Zero to production in one command.
DevOps
Development, monitoring, deployment, and logging tools allow the developer to run the entire application.
APIs and Services
A catalog of IBM, third party, and open source API services allow the developer to stitch an application together in minutes.
On-Prem Integration
Build hybrid environments. Connect to on-premise assets plus other public and private clouds.
Flexible Pricing
Sign up in minutes. Pay as you go and subscription models offer choice and flexibility.
Layered Security
IBM secures the platform and infrastructure and provides you with the tools to secure your apps.
What is Bluemix?
© 2014 IBM Corporation43
Create apps quickly with prebuilt services
43
• Runtimes, services, and tooling up to you
Choice
Industry Leading IBM Capabilities
• Services leveraging the depth of IBM software
• Full range of capabilities
Completeness
• Open source platform and services
• Third party to enable key use cases
Security
Services
Web and
application
services
Cloud
Integration
Services
Mobile
Services
Database
services
Big Data
services
Internet
of Things
Services
Watson
Services
DevOps
Services
A full range of capabilities to suit any great idea.
© 2014 IBM Corporation44
Embracing Cloud Foundry as an Open Source PaaS
44 ©2014 IBM Corporation
Continuing our history of embracing and extending Open Source
© 2014 IBM Corporation45
Meets Developer’s
NeedsFocus on app
development, not provisioning VMs,
databases, messaging servers, etc.
Agile development model
Deploy and scale in seconds
Open Cloud PlatformThere is an increasing
appetite for cloud-based
mobile, social and analytics applications
from line-of-business executives - drives the need
for a more open cloud development platform
Compelling Community Cloud Foundry has a
compelling community and emerging ecosystem as well
as a mature set of capabilities and robustness
Cloud Foundry is more than code
© 2014 IBM Corporation46
Capabilities include Java, mobile backend development, application monitoring, as well as capabilities from ecosystem partners and open source — all through an as-a-service model in the cloud.
IBM extends CF by adding developer tools, runtimes, & services
© 2014 IBM Corporation47
Infrastructure Services
Virtual Appliance
Metadata
Application
Server
Operatingsystem
Virtual Appliance
Metadata
Application
Server
Operatingsystem
Virtual Appliance
Metadata
HTTP
Server
Operatingsystem
Defined Pattern Services
Systems of Record
Business Services
An Entire Continuum Working Together
Analytics
Composable Services
© 2014 IBM Corporation48
IBM Analytics for Hadoop Service
� Powered by
− BigInsights 3.0 & Bluemix
� Get started with Hadoop in Minutes− Tutorial: https://developer.ibm.com/hadoop/docs/tutorials/
� Dedicated Single Node Env
• BIAdmin Authority
• Access to the Web console
• Secure HTTPS channel powered by SSL certificates
• Bluemix Single Sign On (SSO)
© 2014 IBM Corporation49
Register today at bluemix.net
With on-demand services and infrastructure, developers can go from 0 to running code in a matter of minutes.
When coupled with DevOps, teams both large and small can automate the development and delivery of many applications.
By connecting securely to on-preminfrastructure, organizations can extend their existing investments.
1. Rapidly bring products and services to market at lower cost
2. Continuously deliver new functionality to their applications
3. Extend existing investments in IT infrastructure
© 2014 IBM Corporation50
Want to learn more?
� Download Quick Start Edition � Test drive the technologies
– Follow online tutorials– Enroll in online classes – Watch video demos, read articles, etc.
� Links all available from HadoopDev– https://developer.ibm.com/hadoop/
© 2014 IBM Corporation51
BigInsights Quick Start Edition
� Download: http://ibm.co/QuickStart
© 2014 IBM Corporation52
� FREE
� All types of practitioners
� All skill levels
� Hands-on Labs
� Future Meetups:
− Hadoop
− Text Analytics
− Real-time Analytics
− SQL for Hadoop
− HBase
− Social Media Analytics
− Machine Data Analytics
− Security and Privacy
Big Data Developers
http://www.meetup.com/BigDataDevelopers/
http://bigdatadevelopers.meetup.com/