hadoop analytics + enterprise class storage: one-stop solution from emc for high impact business...
DESCRIPTION
Using Greenplum HD, Isilon Scale-Out NAS and EMC services, learn how you can quickly and easily deploy a powerful, yet worry-free Hadoop-based analytics engine. If you ever desired to take the plunge with Hadoop or wanted the confidence to grow your Hadoop deployment for full-scale production, learn how EMC can provide you the tested solution to do so.TRANSCRIPT
1 © Copyright 2012 EMC Corporation. All rights reserved.
EMC Isilon Big Data Storage and Hadoop Analytics
Jemish Patel
2 © Copyright 2012 EMC Corporation. All rights reserved.
Today’s Agenda
• The Big Data Opportunity
• Big Data Analytics with Hadoop
• Technology Challenges of Hadoop
• EMC’s Hadoop Solutions for the Enterprise
• Q+A
3 © Copyright 2012 EMC Corporation. All rights reserved.
The Big Data Opportunity
4 © Copyright 2012 EMC Corporation. All rights reserved.
!!!
!!!
!!!
!!!
!!!
“Big Data Is Less About Size, And More About Freedom”
―Techcrunch
!!!
!!!
!!! “Findings: ‘Big Data’ Is More Extreme Than Volume”
― Gartner
“Big Data! It’s Real, It’s Real-time, and It’s Already Changing Your World”
―IDC “Total data: ‘bigger’ than big data”
― 451 Group
5 © Copyright 2012 EMC Corporation. All rights reserved.
!!!
!!!
!!!
!!!
!!!
“Big Data Is Less About Size, And More About Freedom”
―Techcrunch
!!!
!!!
!!! “Findings: ‘Big Data’ Is More Extreme Than Volume”
― Gartner
“Big Data! It’s Real, It’s Real-time, and It’s Already Changing Your World”
―IDC “Total data: ‘bigger’ than big data”
― 451 Group
THE ERA OF
BIG DATA IS HERE
6 © Copyright 2012 EMC Corporation. All rights reserved.
BIG DATA IS TRANSFORMING
BUSINESS
7 © Copyright 2012 EMC Corporation. All rights reserved.
Big Data in Action
• Healthcare
– Leverage historical data to discover better treatments
• Financial Services
– Data-driven banking stress tests & risk analysis
• Utilities
– Machine-learning to predict service outages & prevent energy theft
8 © Copyright 2012 EMC Corporation. All rights reserved.
Hadoop & Big Data
9 © Copyright 2012 EMC Corporation. All rights reserved.
The Promise of Big Data Analytics
Leverage data assets to identify key trends and new business opportunities
Analyze new sources of information to gain competitive advantages
Take an agile approach to analytics that can adapt at the speed of business
Scale your storage and analysis platform to handle Big Data’s volume, velocity and variety
10 © Copyright 2012 EMC Corporation. All rights reserved.
• Created 5-6 years ago by former Yahoo! Engineer, Doug Cutting
• Software platform designed to analyze massive amounts of unstructured data
• Two core components:
– Hadoop Distributed File System (HDFS) (storage)
– MapReduce (compute)
• Now a top-level Apache project backed by large, open source development community
The Emergence of Hadoop
11 © Copyright 2012 EMC Corporation. All rights reserved.
•"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
•"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
MapReduce
12 © Copyright 2012 EMC Corporation. All rights reserved.
MapReduce
13 © Copyright 2012 EMC Corporation. All rights reserved.
•JobTracker – A master node that manages job submissions, scheduling and reprocessing in case of job failures. Jobs consist of a mapper, a reducer and a list of inputs.
•TaskTracker- Each slave node in the cluster runs a TaskTracker process. The JobTracker instructs the TaskTrackers to run and monitor a task. A task consists of a map or a reduce over a piece of data.
Services for MapReduce
14 © Copyright 2012 EMC Corporation. All rights reserved.
• HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
•HDFS has a permissions model for files and directories that is much like POSIX.
HDFS – Hadoop Distributed Filesystem
15 © Copyright 2012 EMC Corporation. All rights reserved.
•Namenode - manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log.
•Datanode- Workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
•Secondary Namenode - Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine
Services for HDFS
16 © Copyright 2012 EMC Corporation. All rights reserved.
Pig - A high-level data-flow language and execution framework for parallel computation
Mahout - A Scalable machine learning and data mining library
Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying (SQL)
Hbase - A scalable, distributed database that supports structured data storage for large tables
R(RHIPE) – Combines Hadoop + R analytics language
HBase Hive Mahout
Pig
MapReduce – Compute Layer (Job Scheduling / Execution)
HDFS – Storage Layer (Hadoop Distributed Filesystem)
R (RHIPE)
Ecosystem
C
o
r
e
Hadoop Eco-System Components
17 © Copyright 2012 EMC Corporation. All rights reserved.
Why Hadoop is Important
Pragmatic approach to analytics on a very large scale
– Opens up new ways of gaining insights and identifying opportunities for businesses
Designed to address the rise of unstructured data
– Enterprise data to grow by 650% over next 5 years
– More than 80% of this growth will be unstructured data
18 © Copyright 2012 EMC Corporation. All rights reserved.
Hadoop Early Majority
Hadoop Early Adopters
Evolution of the Hadoop Market
Innovators/ Early Adopters
Early Majority Late Majority Laggards
19 © Copyright 2012 EMC Corporation. All rights reserved.
Hadoop Early Majority
Hadoop Early Adopters
Evolution of the Hadoop Market
HADOOP PROFILE (TO DATE)
Pioneers and academics
Application Architect
Visionary
Open source / community driven Build-your-own server, application & storage infrastructure Commodity components
Web 2.0 Universities Life Sciences
20 © Copyright 2012 EMC Corporation. All rights reserved.
Hadoop Early Majority
Hadoop Early Adopters
Evolution of the Hadoop Market
HADOOP PROFILE (TO DATE)
Pioneers and academics
Application Architect
Visionary
Open source / community driven Build-your-own server, application & storage infrastructure Commodity components
HADOOP PROFILE (EMERGING)
IT Manager & CIO Data Scientist Line-of-business
Commercial distribution Turnkey solution End-to-End Data protection
Fortune 1000 Financial Services Retail
Web 2.0 Universities Life Sciences
21 © Copyright 2012 EMC Corporation. All rights reserved.
Technology Challenges of Hadoop
22 © Copyright 2012 EMC Corporation. All rights reserved.
Hadoop Architecture
Ethernet
Hadoop Data Node
1. Data is ingested into the Hadoop File System (HDFS) 2. Computation occurs inside Hadoop (MapReduce) 3. Results are exported from HDFS for use
Hadoop Data Node
Hadoop Data Node
Hadoop Data Node
Hadoop Data Node
Hadoop Data Node
Hadoop Name Node
23 © Copyright 2012 EMC Corporation. All rights reserved.
Writing Data into Hadoop
24 © Copyright 2012 EMC Corporation. All rights reserved.
Reading Data from HDFS
25 © Copyright 2012 EMC Corporation. All rights reserved.
Technology Challenges of Hadoop
Hadoop DAS Environment 1 Dedicated Storage Infrastructure
– One-off for Hadoop only
2 Single Point of Failure – Namenode
3 Lacking Enterprise Data Protection – No Snapshots, replication, backup
4 Poor Storage Efficiency – 3X mirroring
5 Fixed Scalability – Rigid compute to storage ratio
6 Manual Import/Export – No protocol support
Name node
26 © Copyright 2012 EMC Corporation. All rights reserved.
Technology Challenges of Hadoop
Hadoop DAS Environment 1 Dedicated Storage Infrastructure
– One-off for Hadoop only
2 Single Point of Failure – Namenode
3 Lacking Enterprise Data Protection – No Snapshots, replication, backup
4 Poor Storage Efficiency – 3X mirroring
5 Fixed Scalability – Rigid compute to storage ratio
6 Manual Import/Export – No protocol support
1x
1x
2x
2x
3x
2x
3x
3x
1x
Namenode
27 © Copyright 2012 EMC Corporation. All rights reserved.
EMC Addresses the Hadoop Challenge
1 Dedicated Storage Infrastructure – One-off for Hadoop only
2 Single Point of Failure – Namenode
3 Lacking Enterprise Data Protection – No Snapshots, replication, backup
4 Poor Storage Efficiency – 3X mirroring
5 Fixed Scalability – Rigid compute to storage ratio
6 Manual Import/Export – No protocol support
1 Scale-Out Storage Platform – Multiple applications & workflows
2 No Single Point of Failure – Distributed Namenode
3 End-to-End Data Protection – SnapshotIQ, SyncIQ, NDMP Backup
4 Industry-Leading Storage Efficiency – >80% Storage Utilization
5 Independent Scalability – Add compute & storage separately
6 Multi-Protocol
– Industry standard protocols – NFS, CIFS, FTP, HTTP, HDFS
28 © Copyright 2012 EMC Corporation. All rights reserved.
1 Scale-Out Storage Platform – Multiple applications & workflows
2 No Single Point of Failure – Distributed Namenode
3 End-to-End Data Protection – SnapshotIQ, SyncIQ, NDMP Backup
4 Industry-Leading Storage Efficiency – >80% Storage Utilization
5 Independent Scalability – Add compute & storage separately
6 Multi-Protocol
– Industry standard protocols – NFS, CIFS, FTP, HTTP, HDFS
The EMC Isilon Advantage for Hadoop
29 © Copyright 2012 EMC Corporation. All rights reserved.
•Isilon becomes the namenode as well as the data node
•Provides scalability and protection of the data.
•Hadoop cluster no longer has a single point of failure and no longer writes multiple 64MB-128MB chunks of data to datanodes
Writing into Hadoop with Isilon
30 © Copyright 2012 EMC Corporation. All rights reserved.
Reading Hadoop Data with Isilon
Data is read off the cluster back to the compute nodes.
The datanodes are now just compute nodes and are independent of the data in the Hadoop cluster.
–Benefits are that the Hadoop hardware can be upgraded without the need for migration of the data
31 © Copyright 2012 EMC Corporation. All rights reserved.
Accelerating the Benefits of Hadoop for the Enterprise
Reducing Risk
End-to-End Data Protection
Organizational Knowledge/Experience
Industry’s First and Only Scale-Out Storage Solution with Native Hadoop Integration
32 © Copyright 2012 EMC Corporation. All rights reserved.
Sto
rage
Co
mpu
te
EMC’s Enterprise Hadoop Solution
Apache Hadoop certified by Greenplum Simple platform management and
control Parallel analytics access with
Greenplum Database
EMC Greenplum HD and EMC Isilon Scale-Out Storage
33 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum: Not Just About Technology
• Data Science teams will become the driving force for success with big data analytics
• Greenplum is committed to the future of data science
– University data science program collaboration with Stanford and UC Berkeley
– Community investment including the Greenplum Analytic Workbench, Community edition software, and Data Science Summits
• Greenplum built its own Data Science practice
– Leading PhDs with analytic tools expertise
34 © Copyright 2012 EMC Corporation. All rights reserved.
Hadoop in Action
35 © Copyright 2012 EMC Corporation. All rights reserved.
Leading Big Ten university renowned worldwide for its research and academic excellence.
Background
Challenge
Solution
Customer Case Study
Purdue University
36 © Copyright 2012 EMC Corporation. All rights reserved.
Background
Challenge
Solution
Customer Case Study
Purdue University
• Large Hadoop environment for researchers in Statistics Department
• No central storage infrastructure, leading to many different, disparate islands of data without consistent protection or performance
• Small IT staff managing large amounts of data and hundreds of data-intensive users
37 © Copyright 2012 EMC Corporation. All rights reserved.
Background
Challenge
Solution
Customer Case Study
Purdue University
• Deployed Isilon with HDFS, which plugged seamlessly into their Hadoop environment
• Created a single, shared storage resource for data computing and analytics
• Delivered a highly reliable and flexible storage infrastructure that protected data from loss or corruption
• Eliminated need to migrate data between storage silos, delivering immediate accessibility and significantly higher performance
38 © Copyright 2012 EMC Corporation. All rights reserved.
Background
Challenge
Solution
Customer Case Study
Purdue University
“We tested EMC Isilon with Hadoop in our statistics department, which must often analyze huge data sets. EMC Isilon's multi-protocol capabilities provided fast and reliable delivery of data to our statisticians, demonstrating the potential to increase the time spent on actually doing the science, while reducing management costs.” Alex Younts, Purdue University
39 © Copyright 2012 EMC Corporation. All rights reserved.
Leading Global Shipping and Transportation company.
Background
Challenge
Solution
Customer Case Study
Global Shipping & Transportation Co.
40 © Copyright 2012 EMC Corporation. All rights reserved.
Background
Challenge
Solution
Customer Case Study
Global Shipping & Transportation Co.
• Large amounts of data in different formats from various business units. Focused on E-commerce self service site with semi-structured (XML) and unstructured log data
• Looking to optimize their current ways of analyzing this data regardless of format.
• They wanted to understand what devices were accessing their self-service site in order to measure usage patterns to enhance user experience on their E-commerce site
41 © Copyright 2012 EMC Corporation. All rights reserved.
Background
Challenge
Solution
Customer Case Study
Global Shipping & Transportation Co.
• Using Isilon with HDFS as the central storage for their Hadoop environment, they eliminated any ETL steps as data could simply be copied over standard protocols
• Created a single, shared storage resource for data analytics regardless of structured, semi-structured or unstructured data queries across their entire data set.
• Delivered a highly reliable and flexible storage infrastructure that enabled mechanisms such as backup and archive to be part of their analytics workflow
42 © Copyright 2012 EMC Corporation. All rights reserved.
Questions?
43 © Copyright 2012 EMC Corporation. All rights reserved.
Thank You!
44 © Copyright 2012 EMC Corporation. All rights reserved.
Provide Feedback & Win!
125 attendees will receive $100 iTunes gift cards. To enter the raffle, simply complete:
– 5 sessions surveys – The conference survey
Download the EMC World Conference App to learn more: emcworld.com/app
45 © Copyright 2012 EMC Corporation. All rights reserved.