the practice of big data - the hadoop ecosystem explained with usage scenarios

Big Data in Practice: A Pragmatic approach to Adoption and

Value creation

Raj Nair Data Practitioner and Consultant

Application Services

• Enterprise Resource Planning (ERP)

• eCommerce / eBusiness

• Enterprise App Dev and ECM

• Legacy Support, Systems Integration and Conversion

Info Management

• Business Intelligence and Analytics

• Dashboards, Scorecards, Reporting

• MDM & Data Modeling

• Data Marts, ODS, ETL, Data Mining

IT Infrastructure

• IT Professional Services

• Network Administration & Support

• dB Admin & Maintenance

• Hosting and Application Support

Process & Governance

• SDLC – Agile, TDD, TFD Iterative

• Requirements Analysis, PMP, Change Management and Automated QA

• Training & Knowledge Transition and Technical Documentation

2

Content NOT FOR DISTRIBUTION: Property of Raj Nair

3

© Copyright @ 2011 Object Technology Solutions, Inc. (OTSI)

Object Technology Solutions Inc. (OTSI) is a leading Information Technology (IT) Services and Solutions company founded in 1999.

Clientele of Fortune 500 companies providing IT Solutions in the areas

of SDLC, Information Management, Business Intelligence, ERP,

eCommerce (B2B, B2C), Mobile, Enterprise Solutions, Middleware and

Infrastructure.

Technology Expertise and Experience

SAP - Business Objects, ERP, Microsoft - SharePoint, .Net, SQL Server,

Project Server, IBM - WebSphere, Cognos, Rational Suite, HP - Testing

tools, PPM

Data - Oracle, DB2, SQLServer, Teradata, OS – Windows, Unix (AIX, Linux,

HP-UX) etc., Open Source, Java

Certified Diversity Supplier in KS, MO and IL

1 Big Data – The Original Use Case

2 Mainstream Big Data

3 Real World Use Cases and Applications

4 Practical Adoption : Opportunity Identification

5 Big Data 2.0 – What’s on the Horizon ?

6 Conclusion

An Open Source Engine

The Year was 2002 ….

Doug Cutting Mike Caferella

Already Somebody’s Biz Problem

• Problem of Capacity & Scale

http://

The Perfect Storm

MapReduce Google File System

BigTable

MapReduce

Google File System

+

=






6 Conclusion

Yes, But… We are not Google

Sears: Dynamic Pricing

AT&T, quantifying customer impact from

failed cell towers

Nokia: Holistic view of how users interact with apps

across the world

Zions Bancorp: Analyze 130 data sources for fraud Cerner:

Detecting Health Risks

Every Day Big Data

Reaching scale-up limits on your server

Represents tools, technologies, frameworks for storage and processing at scale

Represents Opportunity

Big Data 1.0 – The Hadoop Ecosystem

Software library

Framework for large scale distributed processing

Ability to scale to thousands of computers

Design Principles

- Large Data Sets

Classic Hadoop MapReduce – Batch Processing

- Moving computation is cheaper than moving data

- Hardware Failure, redundancy

This not “That”

Is Is Not

A Software Framework (Storage/Compute)

A Database Management System An appliance

Batch Processing For real-time or interaction

Write Once, Read Many Delete and Update or “ACID”

Unassuming of data formats Imposing any schemas

Open Source Lock In

Made for commodity servers with local disks

Meant to be run in virtualized environments

What is this you call data?

Unlearn current notion of “Data”

Native Data Source

HDFS Storage and Archival

MapReduce Programming Library

Crunch Data Pipeline

processing HBase Real time access

(low latency)

Pig M/R Abstraction

Hive Data Warehouse

Sqoop Data Transfer

Flume Data Streaming

(High Latency)

Data Processing Workload Management

Data Movement

Purpose Use it for

HDFS Distributed Storage Raw data storage and archival

Flume Data Movement Continuous Streaming into HDFS

Sqoop Data Movement Data transfer from RDBMS to HDFS/HBase

HBase Workload Mgmt Near real-time read/write access to large data sets

Hive Workload Mgmt Analytical queries; data warehouse

Map Reduce

Data Processing Low level custom code for data processing

Crunch Data Processing (Java) Coding M/R pipelines, aggregations

Pig Data Processing Scripting language; similar to Crunch

A Powerful Paradigm

Storage Layer

Query Engine

Processing Engine

Metadata

Hadoop – Separate Layers

Multiple Query Engines

Data in Native format

Oracle SQL Server

Storage

Query

Storage

Query

Storage

Query

DB2

Tightly integrated Proprietary Stacks, cannot free your data






6 Conclusion

Opportunity…

Transform Data Processing

Exploration

Information Enrichment

Data Archival

Data Processing Pipeline

Several sources

Varying Frequencies

Varying Formats

Quality check

Validations, Scrubbing

Transformations/Rules

Prune app data sources

Discard/Archive

Data Processing Engine

Data Warehouse

Data Storage

ETL Engine

Data Warehouse

Data Storage

ELT

Data Warehouse

Data Storage

From Source to Business Value

Shoe-horning

Relational fit Loading Archiving / Purging

Biz Rules

Validations Scrubbing Mapping Transforms

Staging Distribution

Prep Tuning

Data stores

Minutes/Hours Subset of Data

Hours Reliability

Sourcing

Missed SLAs = Biz Frustration

From Source to Business Value

Significantly more data sources

Highly scalable, significantly performant data processing

New business value, Faster time to value

Data Exploration

Large reservoir of data

Descriptive Statistics

Central Tendencies

Dispersion

Visualization

Surprise Me!

Data Exploration

Courtesy: Data Science Central http://www.datasciencecentral.com/profiles/blogs/r-hadoop-data-analytics-heaven

http://www.datasciencecentral.com/profiles/blogs/r-hadoop-data-analytics-heaven










Information Enrichment

Data Archival

Recycle Policy

Data Archival

Storage in Native Format

Redundancy , Replication

Easily accessible, inexpensive






6 Conclusion

Practical Adoption

Big Data Technologies don’t solve all problems

Leveraging existing investments

Complexities of existing systems

Proof of Concept

Use your own data – realistic results

Focus on very specific pain points

Know what you are going to measure

Opportunity Identification

Shoe-horning

Relational fit Loading Archiving / Purging

Biz Rules

Validations Scrubbing Mapping

Staging Distribution

Prep Tuning

Data stores

Minutes/Hours Subset of Data

Hours Reliability

Sourcing


Data Warehouse

Data Storage


Data Warehouse

Data Storage

Keep all your raw data Cheaper Hardware Low cost per byte $$ High value per byte

Offload from RDBMS Improve scale, performance Leverage existing tools

Hardware on a budget Master:

- 12 cores

- 32 GB RAM

- 2 TB SATA Drives, 7.2K RPM

Workers:

- 4 Nodes

- 12 cores

- 16 GB RAM

- 4 TB SATA Drives each, 7.2 PRM

$5000

$5000 each

4-Port 10 Gig Switch - $1500 Grand Total < $30,000

Software costs ? - 0

NoSQL


Data Warehouse

Data Storage

Keep all your raw data Cheaper Hardware

NoSQL

Low cost per byte $$ High value per byte

Exploratory BI / Analysis Data

Storage

Makes Data exploration practically cheaper and faster Use existing visualization tools (Tableau or other) Check for integration with R

Data Architecture

• Single Important factor

• Don’t miss technology trends

But ….

It’s more about the battle plan

1 Big Data – The Road to Now





6 Conclusion

What about that RDBMS?

Too many new data types

Extreme demands for loading & query access

Dynamic / just in time schemas

SQL is great, but why limit to relational?

Still great for transactional workloads

What’s Next?

Multi-tenant Hadoop

SQL on Hadoop

Security In-memory Real Time

HDFS 2 Storage and Archival

MapReduce (BATCH)

HBase (online)

Hive (interactive)

YARN Yet Another Resource Manager

In-memory Search

Application Container - scale resource management Map Reduce becomes “one type of application workload”

Multi-tenant Hadoop

SQL on Hadoop

Impala

Tez

Phoenix

• Cloudera

• MPP Engine

• HortonWorks

• SQL on Hive

• Apache

• SQL on HBase

In memory and Real Time

Spark

Storm

Apache Drill

• 100x faster than M/R

• Event processing

• Low latency ad hoc queries

• Interactive queries at scale

Honorable (Proprietary) mentions

RDBMS on Hadoop

Complete Package

MPP, SMP, DataFlow

HortonWorks underneath

Manage, Analyze machine generated data

1 Big Data – The Road to Now





6 Conclusion

Where can I get Hadoop?

Distributors

Open Source Apache Project

And these guys…

Cloud

Conclusion

The Power & Paradigm of Distributed Computing

“Nativity” of Data – Unlearn old notions

Identify, understand your data processing pipeline

POC with a measurable, specific use case

Data Architecture – key to sustainable scalability

Stay informed

the practice of big data - the hadoop ecosystem explained with usage scenarios

Data & Analytics