how apache hadoop is revolutionizing business intelligence ...assets.en.oreilly.com/1/event/63/how...

28
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics Strata Conference, Sept 22 nd 2011, New York, NY Dr. Amr Awadallah, Founder, CTO, VP of Engineering [email protected], twitter: @awadallah

Upload: others

Post on 22-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

How Apache Hadoop is RevolutionizingBusiness Intelligence and Data Analytics

Strata Conference, Sept 22nd 2011, New York, NY

Dr. Amr Awadallah, Founder, CTO, VP of [email protected], twitter: @awadallah

Page 2: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Business Intelligence Before Adopting Apache Hadoop

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 2

Storage Only Grid (original raw data)

Instrumentation

Collection

RDBMS (processed data)

BI Reports + Interactive Apps

Mostly Append

ETL Compute Grid

Moving Data ToCompute Doesn’t Scale

Can’t Explore OriginalHigh Fidelity Raw Data

Archiving =PrematureData Death

Page 3: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Business Intelligence After Adopting Apache Hadoop

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3

Hadoop: Storage + Compute Grid

Instrumentation

Collection

RDBMS

BI Reports + Interactive Apps

Complex Data Processing

Mostly Append

Data Exploration &Advanced Analytics

ETL and Aggregations

Keep Data Alive For Ever

Page 4: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

So What is Apache Hadoop?

• A scalable fault-­‐tolerant distributed system for data storage andprocessing (open source under the Apache license)

• Core Hadoop has two main components:• Hadoop Distributed File System: self-­‐healing high-­‐bandwidth clustered storage• MapReduce: fault-­‐tolerant distributed processing

• Key business values:• Flexible – Store any data, Run any analysis (Mine First, Govern Later)• Scalable – Start at 1TB/3-­‐nodes then grow to petabytes/thousands of nodes• Affordable – Cost per TB at a fraction of traditional options• Open Source – No Lock-­‐In, Rich Ecosystem, Large developer community• Broadly adopted – A large and active ecosystem, Proven to run at scale

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4

Page 5: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

The Main Benefit: Agility/Flexibility

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 5

Schema-­‐on-­‐Read (Hadoop):Schema-­‐on-­‐Write (RDBMS):• Schema must be created beforedata is loaded

• Explicit load operation has totake place which transforms datato database internal structure

• New columns must be addedexplicitly before data for suchcolumns can be loaded into thedatabase

• Read is Fast

• Standards/Governance

• Data is simply copied to the filestore, no special transformation isneeded

• A SerDe (Serializer/Deserlizer) isapplied during read time to extractthe required columns

• New data can start flowinganytime and will appearretroactively once the SerDe isupdated to parse them

• Load is Fast

• Flexibility/AgilityBenefitsBenefits

Page 6: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

What is Complex Data Processing?

1. Java MapReduce: Gives the most flexibility and performance,but potentially long development cycle (the “assemblylanguage” of Hadoop).

2. Streaming MapReduce (also Pipes): Allows you to develop inany programming language of your choice, but slightly lowerperformance and less flexibility.

3. Pig: A high-­‐level language out of Yahoo, suitable for batch dataflow workloads.

4. Hive: A SQL interpreter out of Facebook, also includes a meta-­‐store mapping files to their schemas and associated SerDe.

5. Oozie: A PDL XML workflow server engine that enables creatinga workflow of jobs composed of any of the above.

6Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 7: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

What This Means For You: Agility

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7

Up Front Design Just in Time

Page 8: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

What This Means For You: Innovation

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 8

Data Committee Data Scientist

Page 9: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

What This Means For You: Consolidation

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9

Silos Sharing

Page 10: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

What This Means For You: Extract Value from Latent Data

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10

Archive to Tape Keep Data Alive

Page 11: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Benefit #2: Scalability

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 11

What This Means For You: Ability to Grow Fluidly

Page 12: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

What This Means For You: Data Beats Algorithm

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 12

Smarter Algos More Data

Page 13: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Where Does Hadoop Fit in the Enterprise Data Stack?

13Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Logs Files Web Data

EnterpriseData

Warehouse

WebApplication

EnterpriseReporting

BI, Analytics

Analysts Business Users

Customers

IDEs

Data Scientists

RelationalDatabases

Low-­‐LatencyServingSystems

ClouderaMgmt Suite

SystemOperators

DataArchitects

Development Tools

ETLTools

Business Intelligence Tools

Page 14: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Use The Right Tool For The Right Job

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 14

Relational Databases: Hadoop:

Use when:

• Structured or Not (Agility)

• Scalability of Storage/Compute

• Complex Data Processing

Use when:

• Interactive OLAP Analytics (<1sec)

• Multistep ACID Transactions

• 100% SQL Compliance

Page 15: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Two Core Use Cases Common Across Many Industries

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 15

ADVA

NCE

DAN

ALYTICS

DATA

PROCESSING

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Clickstream Sessionization

Mediation

Data Factory

Trade Reconciliation

SIGINT

Application ApplicationIndustry

Web

Media

Telco

Retail

Financial

Federal

Bioinformatics Genome MappingSequencing Analysis

Use CaseUse Case

ManufacturingProduct Quality Mfg Process Tracking

Page 16: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

CDH: Cloudera’s Distribution Including Apache Hadoop

16Copyright © 2011, Cloudera, Inc. All Rights Reserved.

• Open Source – 100% Apache licensed, 100% Open Source, 100% Free.• Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA• Integrated – All required component versions & dependencies are managed for you• Industry Standard – Existing RDBMS, ETL and BI systems work best with it• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32/64bit, etc

Coordination

Data Integration Fast Read/WriteAccess

Languages / Compilers

Workflow Scheduling Metadata

UI Framework SDK

ZOOKEEPER

FLUME, SQOOP, ODBC HBASE

PIG, HIVE

OOZIE OOZIE HIVE

HUE SDKHUE

Page 17: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

SCM Express: Simplifies Installation and Configuration

©2011 Cloudera, Inc. All Rights Reserved. 17

Service & Configuration Manager(SCM) Express takes the complexity out ofdeploying and configuring CDH.

Provision a complete Hadoop stack in minutes

Centrally manage system services through a user-­‐friendly interface

Manages services for up to 50 nodes

FREE to download

KEY FEATURESAutomated,wizard-­‐based

installation of thecomplete Hadoop stack

Central, real-­‐timedashboard forconfigurationmanagement

Ability to configure thecluster while it’s running

Incorporatescomprehensive validation

and error checking

Automates the expansionof services to new nodeswhen they come online

1 2 3 4 5

Page 18: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

What is Cloudera Enterprise?

©2011 Cloudera, Inc. All Rights Reserved. 18

Simplify and Accelerate Hadoop Deployment

Reduce Adoption Costs and Risks

Lower the Cost of Administration

Increase the Transparency & Control of Hadoop

Leverage the Experience of Our Experts

Cloudera Enterprise makes open sourceApache Hadoop enterprise-­‐easy

EFFECTIVENESSEnsuring Repeatable Value fromApache Hadoop Deployments

EFFICIENCYEnabling Apache Hadoop to beAffordably Run in Production

ClouderaManagement Suite

ComprehensiveToolset for HadoopAdministration

Production-­‐LevelSupport

Our Team of ExpertsOn-­‐Call to Help YouMeet Your SLAs

CLOUDERA ENTERPRISE COMPONENTS

3 of the top 5 telecommunications, mobile services, defense & intelligence,banking, media and retail organizations depend on Cloudera Enterprise

Page 19: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Hadoop World 2011

The largest gathering of Hadoop practitioners, developers,business executives, industry luminaries and innovativecompanies in the Hadoop ecosystem.

©2011 Cloudera, Inc. All Rights Reserved.

• 1400 attendees, 25+ sponsors

• 60 sessions across 5 tracks for:

– Business Decision Makers

– Enterprise Architects

– IT Operators

– Data Scientists

– Developers

• Cloudera Training and Certification(November 7, 10, 11)

November 8-9

Sheraton New York Hotel& Towers, NYC

Learn more and register atwww.hadoopworld.com

$50 discount for

Strata attendees

19

Page 20: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

What I Would Like You To Remember:

• The Key Benefits of the Apache Hadoop Data Platform:• Agility/Flexibility (Enables Innovation/Exploration).• Complex Data Processing (Any Language, Any Problem).• Scalability of Storage/Compute (Freedom to Grow).• Economical Active Archive (Keep All Your Data Alive).

• Cloudera Enterprise enables:• Lower the Cost of Management and Administration.• Simplify and Accelerate Hadoop Deployment.• Increase the Transparency & Control of Hadoop.• Firm SLAs on Issue Resolution.

20Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 21: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Contact Information:

21Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Amr [email protected]

650-­‐644-­‐3921http://twitter.com/awadallah

Page 22: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 22

Page 23: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Appendix

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23

Page 24: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Hadoop Timeline

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24

2002 2003 2004 2005 2006 2007 2008 2009

Doug Cutting & Mike Cafarellastarted working on Nutch

Google publishes GFS &MapReduce papers

Doug Cutting adds DFS &MapReduce support to Nutch

Yahoo! hires Cutting,Hadoop spins out of Nutch

Facebooks launches Hive:SQL Support for Hadoop

Fastest sort of a TB, 3.5minsover 910 nodes

NY Times converts 4TB ofimage archives over 100 EC2s

• Fastest sort of a TB, 62secsover 1,460 nodes• Sorted a PB in 16.25hoursover 3,658 nodes

Hadoop Summit 2009,750 attendees

ClouderaFounded

Doug Cuttingjoins Cloudera

Page 25: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Cloudera’s Track Record

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25

• Customers: Multiple customers with >1,000 Hadoop nodes under management• Supporting dozens of diverse production use cases including ones that are revenue critical

with tight SLA’s

• Community: years of demonstrated leadership in the Apache Hadoop ecosystem.Cloudera employees are:• The largest contributor to the Hadoop ecosystem in patches• Founders of 70% of the projects in the Apache Hadoop ecosystem including Apache

Hadoop itself• The first to build & integrate what is now the reference Hadoop stack

• Industry: Multiple years of experience providing Hadoop solutions across industries:• 2 of the top 5 payments companies run Cloudera• 3 of the top 5 commerical banks run Cloudera• 2 of the top 4 online travel companies run Cloudera

Page 26: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Cloudera Enterprise Management Suite

©2011 Cloudera, Inc. All Rights Reserved. 26

Utility It Helps You… So You Can… It’s Like…

Activity Monitor • Consolidate all user activitiesinto a real-­‐time view

• Diagnose user performance• Track activity metrics

• Improve performance• Improve conformance toSLAs

• Improve QOS

• MySQL Enterprise Monitor• Quest Foglight for Oracle /SQL Server

Service &ConfigurationManager

• Manage system services• Automate changes• Validate settings• 1-­‐click security

• Lower cost of administration• Improve uptime

• Red Hat Satellite Server• Microsoft System Center• Oracle Enterprise Manager

ResourceManager

• Report on the usage ofscarce resources

• Plan for capacity expansion

• Improve quality of service• Extend the life of the cluster

• VMware vCenter

AuthorizationManager

• Centralize management of allusers, groups and privileges

• Manage permissions viadelegated administration

• Lower the costs ofadministration

• Improve compliance

• Teradata securityadministration

Page 27: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

CDH Integrates with Existing IT Infrastructure

27

Databases Cloud/OS HardwareBI/Analytics

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

ETL

Page 28: How Apache Hadoop is Revolutionizing Business Intelligence ...assets.en.oreilly.com/1/event/63/How Hadoop is... · How Apache Hadoop is Revolutionizing Business Intelligence and Data

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 28