hadoop analytics + enterprise class storage: one-stop solution from emc for high impact business...

46
1 © Copyright 2012 EMC Corporation. All rights reserved. EMC Isilon Big Data Storage and Hadoop Analytics Jemish Patel

Upload: emc-academic-alliance

Post on 18-Nov-2014

1.425 views

Category:

Technology


0 download

DESCRIPTION

Using Greenplum HD, Isilon Scale-Out NAS and EMC services, learn how you can quickly and easily deploy a powerful, yet worry-free Hadoop-based analytics engine. If you ever desired to take the plunge with Hadoop or wanted the confidence to grow your Hadoop deployment for full-scale production, learn how EMC can provide you the tested solution to do so.

TRANSCRIPT

Page 1: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

1 © Copyright 2012 EMC Corporation. All rights reserved.

EMC Isilon Big Data Storage and Hadoop Analytics

Jemish Patel

Page 2: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

2 © Copyright 2012 EMC Corporation. All rights reserved.

Today’s Agenda

• The Big Data Opportunity

• Big Data Analytics with Hadoop

• Technology Challenges of Hadoop

• EMC’s Hadoop Solutions for the Enterprise

• Q+A

Page 3: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

3 © Copyright 2012 EMC Corporation. All rights reserved.

The Big Data Opportunity

Page 4: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

4 © Copyright 2012 EMC Corporation. All rights reserved.

!!!

!!!

!!!

!!!

!!!

“Big Data Is Less About Size, And More About Freedom”

―Techcrunch

!!!

!!!

!!! “Findings: ‘Big Data’ Is More Extreme Than Volume”

― Gartner

“Big Data! It’s Real, It’s Real-time, and It’s Already Changing Your World”

―IDC “Total data: ‘bigger’ than big data”

― 451 Group

Page 5: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

5 © Copyright 2012 EMC Corporation. All rights reserved.

!!!

!!!

!!!

!!!

!!!

“Big Data Is Less About Size, And More About Freedom”

―Techcrunch

!!!

!!!

!!! “Findings: ‘Big Data’ Is More Extreme Than Volume”

― Gartner

“Big Data! It’s Real, It’s Real-time, and It’s Already Changing Your World”

―IDC “Total data: ‘bigger’ than big data”

― 451 Group

THE ERA OF

BIG DATA IS HERE

Page 6: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

6 © Copyright 2012 EMC Corporation. All rights reserved.

BIG DATA IS TRANSFORMING

BUSINESS

Page 7: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

7 © Copyright 2012 EMC Corporation. All rights reserved.

Big Data in Action

• Healthcare

– Leverage historical data to discover better treatments

• Financial Services

– Data-driven banking stress tests & risk analysis

• Utilities

– Machine-learning to predict service outages & prevent energy theft

Page 8: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

8 © Copyright 2012 EMC Corporation. All rights reserved.

Hadoop & Big Data

Page 9: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

9 © Copyright 2012 EMC Corporation. All rights reserved.

The Promise of Big Data Analytics

Leverage data assets to identify key trends and new business opportunities

Analyze new sources of information to gain competitive advantages

Take an agile approach to analytics that can adapt at the speed of business

Scale your storage and analysis platform to handle Big Data’s volume, velocity and variety

Page 10: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

10 © Copyright 2012 EMC Corporation. All rights reserved.

• Created 5-6 years ago by former Yahoo! Engineer, Doug Cutting

• Software platform designed to analyze massive amounts of unstructured data

• Two core components:

– Hadoop Distributed File System (HDFS) (storage)

– MapReduce (compute)

• Now a top-level Apache project backed by large, open source development community

The Emergence of Hadoop

Page 11: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

11 © Copyright 2012 EMC Corporation. All rights reserved.

•"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

•"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

MapReduce

Page 12: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

12 © Copyright 2012 EMC Corporation. All rights reserved.

MapReduce

Page 13: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

13 © Copyright 2012 EMC Corporation. All rights reserved.

•JobTracker – A master node that manages job submissions, scheduling and reprocessing in case of job failures. Jobs consist of a mapper, a reducer and a list of inputs.

•TaskTracker- Each slave node in the cluster runs a TaskTracker process. The JobTracker instructs the TaskTrackers to run and monitor a task. A task consists of a map or a reduce over a piece of data.

Services for MapReduce

Page 14: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

14 © Copyright 2012 EMC Corporation. All rights reserved.

• HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

•HDFS has a permissions model for files and directories that is much like POSIX.

HDFS – Hadoop Distributed Filesystem

Page 15: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

15 © Copyright 2012 EMC Corporation. All rights reserved.

•Namenode - manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log.

•Datanode- Workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

•Secondary Namenode - Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine

Services for HDFS

Page 16: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

16 © Copyright 2012 EMC Corporation. All rights reserved.

Pig - A high-level data-flow language and execution framework for parallel computation

Mahout - A Scalable machine learning and data mining library

Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying (SQL)

Hbase - A scalable, distributed database that supports structured data storage for large tables

R(RHIPE) – Combines Hadoop + R analytics language

HBase Hive Mahout

Pig

MapReduce – Compute Layer (Job Scheduling / Execution)

HDFS – Storage Layer (Hadoop Distributed Filesystem)

R (RHIPE)

Ecosystem

C

o

r

e

Hadoop Eco-System Components

Page 17: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

17 © Copyright 2012 EMC Corporation. All rights reserved.

Why Hadoop is Important

Pragmatic approach to analytics on a very large scale

– Opens up new ways of gaining insights and identifying opportunities for businesses

Designed to address the rise of unstructured data

– Enterprise data to grow by 650% over next 5 years

– More than 80% of this growth will be unstructured data

Page 18: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

18 © Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Early Majority

Hadoop Early Adopters

Evolution of the Hadoop Market

Innovators/ Early Adopters

Early Majority Late Majority Laggards

Page 19: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

19 © Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Early Majority

Hadoop Early Adopters

Evolution of the Hadoop Market

HADOOP PROFILE (TO DATE)

Pioneers and academics

Application Architect

Visionary

Open source / community driven Build-your-own server, application & storage infrastructure Commodity components

Web 2.0 Universities Life Sciences

Page 20: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

20 © Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Early Majority

Hadoop Early Adopters

Evolution of the Hadoop Market

HADOOP PROFILE (TO DATE)

Pioneers and academics

Application Architect

Visionary

Open source / community driven Build-your-own server, application & storage infrastructure Commodity components

HADOOP PROFILE (EMERGING)

IT Manager & CIO Data Scientist Line-of-business

Commercial distribution Turnkey solution End-to-End Data protection

Fortune 1000 Financial Services Retail

Web 2.0 Universities Life Sciences

Page 21: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

21 © Copyright 2012 EMC Corporation. All rights reserved.

Technology Challenges of Hadoop

Page 22: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

22 © Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Architecture

Ethernet

Hadoop Data Node

1. Data is ingested into the Hadoop File System (HDFS) 2. Computation occurs inside Hadoop (MapReduce) 3. Results are exported from HDFS for use

Hadoop Data Node

Hadoop Data Node

Hadoop Data Node

Hadoop Data Node

Hadoop Data Node

Hadoop Name Node

Page 23: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

23 © Copyright 2012 EMC Corporation. All rights reserved.

Writing Data into Hadoop

Page 24: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

24 © Copyright 2012 EMC Corporation. All rights reserved.

Reading Data from HDFS

Page 25: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

25 © Copyright 2012 EMC Corporation. All rights reserved.

Technology Challenges of Hadoop

Hadoop DAS Environment 1 Dedicated Storage Infrastructure

– One-off for Hadoop only

2 Single Point of Failure – Namenode

3 Lacking Enterprise Data Protection – No Snapshots, replication, backup

4 Poor Storage Efficiency – 3X mirroring

5 Fixed Scalability – Rigid compute to storage ratio

6 Manual Import/Export – No protocol support

Name node

Page 26: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

26 © Copyright 2012 EMC Corporation. All rights reserved.

Technology Challenges of Hadoop

Hadoop DAS Environment 1 Dedicated Storage Infrastructure

– One-off for Hadoop only

2 Single Point of Failure – Namenode

3 Lacking Enterprise Data Protection – No Snapshots, replication, backup

4 Poor Storage Efficiency – 3X mirroring

5 Fixed Scalability – Rigid compute to storage ratio

6 Manual Import/Export – No protocol support

1x

1x

2x

2x

3x

2x

3x

3x

1x

Namenode

Page 27: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

27 © Copyright 2012 EMC Corporation. All rights reserved.

EMC Addresses the Hadoop Challenge

1 Dedicated Storage Infrastructure – One-off for Hadoop only

2 Single Point of Failure – Namenode

3 Lacking Enterprise Data Protection – No Snapshots, replication, backup

4 Poor Storage Efficiency – 3X mirroring

5 Fixed Scalability – Rigid compute to storage ratio

6 Manual Import/Export – No protocol support

1 Scale-Out Storage Platform – Multiple applications & workflows

2 No Single Point of Failure – Distributed Namenode

3 End-to-End Data Protection – SnapshotIQ, SyncIQ, NDMP Backup

4 Industry-Leading Storage Efficiency – >80% Storage Utilization

5 Independent Scalability – Add compute & storage separately

6 Multi-Protocol

– Industry standard protocols – NFS, CIFS, FTP, HTTP, HDFS

Page 28: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

28 © Copyright 2012 EMC Corporation. All rights reserved.

1 Scale-Out Storage Platform – Multiple applications & workflows

2 No Single Point of Failure – Distributed Namenode

3 End-to-End Data Protection – SnapshotIQ, SyncIQ, NDMP Backup

4 Industry-Leading Storage Efficiency – >80% Storage Utilization

5 Independent Scalability – Add compute & storage separately

6 Multi-Protocol

– Industry standard protocols – NFS, CIFS, FTP, HTTP, HDFS

The EMC Isilon Advantage for Hadoop

Page 29: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

29 © Copyright 2012 EMC Corporation. All rights reserved.

•Isilon becomes the namenode as well as the data node

•Provides scalability and protection of the data.

•Hadoop cluster no longer has a single point of failure and no longer writes multiple 64MB-128MB chunks of data to datanodes

Writing into Hadoop with Isilon

Page 30: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

30 © Copyright 2012 EMC Corporation. All rights reserved.

Reading Hadoop Data with Isilon

Data is read off the cluster back to the compute nodes.

The datanodes are now just compute nodes and are independent of the data in the Hadoop cluster.

–Benefits are that the Hadoop hardware can be upgraded without the need for migration of the data

Page 31: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

31 © Copyright 2012 EMC Corporation. All rights reserved.

Accelerating the Benefits of Hadoop for the Enterprise

Reducing Risk

End-to-End Data Protection

Organizational Knowledge/Experience

Industry’s First and Only Scale-Out Storage Solution with Native Hadoop Integration

Page 32: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

32 © Copyright 2012 EMC Corporation. All rights reserved.

Sto

rage

Co

mpu

te

EMC’s Enterprise Hadoop Solution

Apache Hadoop certified by Greenplum Simple platform management and

control Parallel analytics access with

Greenplum Database

EMC Greenplum HD and EMC Isilon Scale-Out Storage

Page 33: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

33 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum: Not Just About Technology

• Data Science teams will become the driving force for success with big data analytics

• Greenplum is committed to the future of data science

– University data science program collaboration with Stanford and UC Berkeley

– Community investment including the Greenplum Analytic Workbench, Community edition software, and Data Science Summits

• Greenplum built its own Data Science practice

– Leading PhDs with analytic tools expertise

Page 34: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

34 © Copyright 2012 EMC Corporation. All rights reserved.

Hadoop in Action

Page 35: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

35 © Copyright 2012 EMC Corporation. All rights reserved.

Leading Big Ten university renowned worldwide for its research and academic excellence.

Background

Challenge

Solution

Customer Case Study

Purdue University

Page 36: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

36 © Copyright 2012 EMC Corporation. All rights reserved.

Background

Challenge

Solution

Customer Case Study

Purdue University

• Large Hadoop environment for researchers in Statistics Department

• No central storage infrastructure, leading to many different, disparate islands of data without consistent protection or performance

• Small IT staff managing large amounts of data and hundreds of data-intensive users

Page 37: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

37 © Copyright 2012 EMC Corporation. All rights reserved.

Background

Challenge

Solution

Customer Case Study

Purdue University

• Deployed Isilon with HDFS, which plugged seamlessly into their Hadoop environment

• Created a single, shared storage resource for data computing and analytics

• Delivered a highly reliable and flexible storage infrastructure that protected data from loss or corruption

• Eliminated need to migrate data between storage silos, delivering immediate accessibility and significantly higher performance

Page 38: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

38 © Copyright 2012 EMC Corporation. All rights reserved.

Background

Challenge

Solution

Customer Case Study

Purdue University

“We tested EMC Isilon with Hadoop in our statistics department, which must often analyze huge data sets. EMC Isilon's multi-protocol capabilities provided fast and reliable delivery of data to our statisticians, demonstrating the potential to increase the time spent on actually doing the science, while reducing management costs.” Alex Younts, Purdue University

Page 39: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

39 © Copyright 2012 EMC Corporation. All rights reserved.

Leading Global Shipping and Transportation company.

Background

Challenge

Solution

Customer Case Study

Global Shipping & Transportation Co.

Page 40: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

40 © Copyright 2012 EMC Corporation. All rights reserved.

Background

Challenge

Solution

Customer Case Study

Global Shipping & Transportation Co.

• Large amounts of data in different formats from various business units. Focused on E-commerce self service site with semi-structured (XML) and unstructured log data

• Looking to optimize their current ways of analyzing this data regardless of format.

• They wanted to understand what devices were accessing their self-service site in order to measure usage patterns to enhance user experience on their E-commerce site

Page 41: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

41 © Copyright 2012 EMC Corporation. All rights reserved.

Background

Challenge

Solution

Customer Case Study

Global Shipping & Transportation Co.

• Using Isilon with HDFS as the central storage for their Hadoop environment, they eliminated any ETL steps as data could simply be copied over standard protocols

• Created a single, shared storage resource for data analytics regardless of structured, semi-structured or unstructured data queries across their entire data set.

• Delivered a highly reliable and flexible storage infrastructure that enabled mechanisms such as backup and archive to be part of their analytics workflow

Page 42: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

42 © Copyright 2012 EMC Corporation. All rights reserved.

Questions?

Page 43: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

43 © Copyright 2012 EMC Corporation. All rights reserved.

Thank You!

Page 44: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

44 © Copyright 2012 EMC Corporation. All rights reserved.

Provide Feedback & Win!

125 attendees will receive $100 iTunes gift cards. To enter the raffle, simply complete:

– 5 sessions surveys – The conference survey

Download the EMC World Conference App to learn more: emcworld.com/app

Page 45: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

45 © Copyright 2012 EMC Corporation. All rights reserved.

Page 46: Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight