the practice of big data - the hadoop ecosystem explained with usage scenarios

55
Big Data in Practice: A Pragmatic approach to Adoption and Value creation Raj Nair Data Practitioner and Consultant

Upload: kcmallu

Post on 26-Jan-2015

106 views

Category:

Data & Analytics


3 download

DESCRIPTION

What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?

TRANSCRIPT

Page 1: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Big Data in Practice: A Pragmatic approach to Adoption and

Value creation

Raj Nair Data Practitioner and Consultant

Page 2: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Application Services

• Enterprise Resource Planning (ERP)

• eCommerce / eBusiness

• Enterprise App Dev and ECM

• Legacy Support, Systems Integration and Conversion

Info Management

• Business Intelligence and Analytics

• Dashboards, Scorecards, Reporting

• MDM & Data Modeling

• Data Marts, ODS, ETL, Data Mining

IT Infrastructure

• IT Professional Services

• Network Administration & Support

• dB Admin & Maintenance

• Hosting and Application Support

Process & Governance

• SDLC – Agile, TDD, TFD Iterative

• Requirements Analysis, PMP, Change Management and Automated QA

• Training & Knowledge Transition and Technical Documentation

2

Page 3: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Content NOT FOR DISTRIBUTION: Property of Raj Nair

3

© Copyright @ 2011 Object Technology Solutions, Inc. (OTSI)

Object Technology Solutions Inc. (OTSI) is a leading Information Technology (IT) Services and Solutions company founded in 1999.

Clientele of Fortune 500 companies providing IT Solutions in the areas

of SDLC, Information Management, Business Intelligence, ERP,

eCommerce (B2B, B2C), Mobile, Enterprise Solutions, Middleware and

Infrastructure.

Technology Expertise and Experience

SAP - Business Objects, ERP, Microsoft - SharePoint, .Net, SQL Server,

Project Server, IBM - WebSphere, Cognos, Rational Suite, HP - Testing

tools, PPM

Data - Oracle, DB2, SQLServer, Teradata, OS – Windows, Unix (AIX, Linux,

HP-UX) etc., Open Source, Java

Certified Diversity Supplier in KS, MO and IL

Page 4: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

1 Big Data – The Original Use Case

2 Mainstream Big Data

3 Real World Use Cases and Applications

4 Practical Adoption : Opportunity Identification

5 Big Data 2.0 – What’s on the Horizon ?

6 Conclusion

Page 5: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

An Open Source Engine

The Year was 2002 ….

Doug Cutting Mike Caferella

Page 6: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Already Somebody’s Biz Problem

• Problem of Capacity & Scale

http://

Page 7: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

The Perfect Storm

MapReduce Google File System

BigTable

Page 8: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

MapReduce

Google File System

+

=

Page 9: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

1 Big Data – The Original Use Case

2 Mainstream Big Data

3 Real World Use Cases and Applications

4 Practical Adoption : Opportunity Identification

5 Big Data 2.0 – What’s on the Horizon ?

6 Conclusion

Page 10: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Yes, But… We are not Google

Sears: Dynamic Pricing

AT&T, quantifying customer impact from

failed cell towers

Nokia: Holistic view of how users interact with apps

across the world

Zions Bancorp: Analyze 130 data sources for fraud Cerner:

Detecting Health Risks

Page 11: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Every Day Big Data

Reaching scale-up limits on your server

Represents tools, technologies, frameworks for storage and processing at scale

Represents Opportunity

Page 12: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Every Day Big Data

Reaching scale-up limits on your server

Represents tools, technologies, frameworks for storage and processing at scale

Represents Opportunity

Page 13: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Every Day Big Data

Reaching scale-up limits on your server

Represents tools, technologies, frameworks for storage and processing at scale

Represents Opportunity

Page 14: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Big Data 1.0 – The Hadoop Ecosystem

Software library

Framework for large scale distributed processing

Ability to scale to thousands of computers

Page 15: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Design Principles

- Large Data Sets

Classic Hadoop MapReduce – Batch Processing

- Moving computation is cheaper than moving data

- Hardware Failure, redundancy

Page 16: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

This not “That”

Is Is Not

A Software Framework (Storage/Compute)

A Database Management System An appliance

Batch Processing For real-time or interaction

Write Once, Read Many Delete and Update or “ACID”

Unassuming of data formats Imposing any schemas

Open Source Lock In

Made for commodity servers with local disks

Meant to be run in virtualized environments

Page 17: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

What is this you call data?

Unlearn current notion of “Data”

Native Data Source

Page 18: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

HDFS Storage and Archival

MapReduce Programming Library

Crunch Data Pipeline

processing HBase Real time access

(low latency)

Pig M/R Abstraction

Hive Data Warehouse

Sqoop Data Transfer

Flume Data Streaming

(High Latency)

Data Processing Workload Management

Data Movement

Page 19: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Purpose Use it for

HDFS Distributed Storage Raw data storage and archival

Flume Data Movement Continuous Streaming into HDFS

Sqoop Data Movement Data transfer from RDBMS to HDFS/HBase

HBase Workload Mgmt Near real-time read/write access to large data sets

Hive Workload Mgmt Analytical queries; data warehouse

Map Reduce

Data Processing Low level custom code for data processing

Crunch Data Processing (Java) Coding M/R pipelines, aggregations

Pig Data Processing Scripting language; similar to Crunch

Page 20: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

A Powerful Paradigm

Storage Layer

Query Engine

Processing Engine

Metadata

Hadoop – Separate Layers

Multiple Query Engines

Data in Native format

Oracle SQL Server

Storage

Query

Storage

Query

Storage

Query

DB2

Tightly integrated Proprietary Stacks, cannot free your data

Page 21: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

1 Big Data – The Original Use Case

2 Mainstream Big Data

3 Real World Use Cases and Applications

4 Practical Adoption : Opportunity Identification

5 Big Data 2.0 – What’s on the Horizon ?

6 Conclusion

Page 22: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Opportunity…

Transform Data Processing

Exploration

Information Enrichment

Data Archival

Page 23: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Data Processing Pipeline

Several sources

Varying Frequencies

Varying Formats

Quality check

Validations, Scrubbing

Transformations/Rules

Prune app data sources

Discard/Archive

Page 24: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Data Processing Engine

Data Warehouse

Data Storage

Page 25: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

ETL Engine

Data Warehouse

Data Storage

Page 26: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

ELT

Data Warehouse

Data Storage

Page 27: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

From Source to Business Value

Shoe-horning

Relational fit Loading Archiving / Purging

Biz Rules

Validations Scrubbing Mapping Transforms

Staging Distribution

Prep Tuning

Data stores

Minutes/Hours Subset of Data

Hours Reliability

Sourcing

Missed SLAs = Biz Frustration

Page 28: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

From Source to Business Value

Significantly more data sources

Highly scalable, significantly performant data processing

New business value, Faster time to value

Page 29: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Data Exploration

Large reservoir of data

Descriptive Statistics

Central Tendencies

Dispersion

Visualization

Surprise Me!

Page 31: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Information Enrichment

Page 32: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Information Enrichment

Page 33: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Data Archival

Recycle Policy

Page 34: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Data Archival

Storage in Native Format

Redundancy , Replication

Easily accessible, inexpensive

Page 35: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

1 Big Data – The Original Use Case

2 Mainstream Big Data

3 Real World Use Cases and Applications

4 Practical Adoption : Opportunity Identification

5 Big Data 2.0 – What’s on the Horizon ?

6 Conclusion

Page 36: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Practical Adoption

Big Data Technologies don’t solve all problems

Leveraging existing investments

Complexities of existing systems

Page 37: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Proof of Concept

Use your own data – realistic results

Focus on very specific pain points

Know what you are going to measure

Page 38: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Opportunity Identification

Shoe-horning

Relational fit Loading Archiving / Purging

Biz Rules

Validations Scrubbing Mapping

Staging Distribution

Prep Tuning

Data stores

Minutes/Hours Subset of Data

Hours Reliability

Sourcing

Page 39: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Data Processing Engine

Data Warehouse

Data Storage

Page 40: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Data Processing Engine

Data Warehouse

Data Storage

Keep all your raw data Cheaper Hardware Low cost per byte $$ High value per byte

Offload from RDBMS Improve scale, performance Leverage existing tools

Page 41: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Hardware on a budget Master:

- 12 cores

- 32 GB RAM

- 2 TB SATA Drives, 7.2K RPM

Workers:

- 4 Nodes

- 12 cores

- 16 GB RAM

- 4 TB SATA Drives each, 7.2 PRM

$5000

$5000 each

4-Port 10 Gig Switch - $1500 Grand Total < $30,000

Software costs ? - 0

Page 42: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

NoSQL

Data Processing Engine

Data Warehouse

Data Storage

Keep all your raw data Cheaper Hardware

NoSQL

Low cost per byte $$ High value per byte

Page 43: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Exploratory BI / Analysis Data

Storage

Makes Data exploration practically cheaper and faster Use existing visualization tools (Tableau or other) Check for integration with R

Page 44: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Data Architecture

• Single Important factor

• Don’t miss technology trends

But ….

It’s more about the battle plan

Page 45: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

1 Big Data – The Road to Now

2 Mainstream Big Data

3 Real World Use Cases and Applications

4 Practical Adoption : Opportunity Identification

5 Big Data 2.0 – What’s on the Horizon ?

6 Conclusion

Page 46: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

What about that RDBMS?

Too many new data types

Extreme demands for loading & query access

Dynamic / just in time schemas

SQL is great, but why limit to relational?

Still great for transactional workloads

Page 47: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

What’s Next?

Multi-tenant Hadoop

SQL on Hadoop

Security In-memory Real Time

Page 48: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

HDFS 2 Storage and Archival

MapReduce (BATCH)

HBase (online)

Hive (interactive)

YARN Yet Another Resource Manager

In-memory Search

Application Container - scale resource management Map Reduce becomes “one type of application workload”

Multi-tenant Hadoop

Page 49: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

SQL on Hadoop

Impala

Tez

Phoenix

• Cloudera

• MPP Engine

• HortonWorks

• SQL on Hive

• Apache

• SQL on HBase

Page 50: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

In memory and Real Time

Spark

Storm

Apache Drill

• 100x faster than M/R

• Event processing

• Low latency ad hoc queries

• Interactive queries at scale

Page 51: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Honorable (Proprietary) mentions

RDBMS on Hadoop

Complete Package

MPP, SMP, DataFlow

HortonWorks underneath

Manage, Analyze machine generated data

Page 52: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

1 Big Data – The Road to Now

2 Mainstream Big Data

3 Real World Use Cases and Applications

4 Practical Adoption : Opportunity Identification

5 Big Data 2.0 – What’s on the Horizon ?

6 Conclusion

Page 53: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Where can I get Hadoop?

Distributors

Open Source Apache Project

And these guys…

Cloud

Page 54: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Conclusion

The Power & Paradigm of Distributed Computing

“Nativity” of Data – Unlearn old notions

Identify, understand your data processing pipeline

POC with a measurable, specific use case

Data Architecture – key to sustainable scalability

Stay informed

Page 55: The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios