hadoop in the enterprise - dr. amr awadallah @ microstrategy world 2011

Apache Hadoop in the Enterprise

Cloudera, Inc.

Amr Awadallah, Founder, CTO, VP of Engineering.

aaa@cloudera.com, twitter: @awadallah

Microstrategy World – January 2011 – Las Vegas

Source: IDC White Paper - sponsored by EMC.

As the Economy Contracts, the Digital Universe Expands. May 2009.

Unstructured Data Explosion

• 2,500 exabytes of new information in 2012 with Internet as primary driver

• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

Relational

Complex, Unstructured

Dramatic Changes in Enterprise Data Needs

Data Explosion

• Any Type of Data

• From Many Sources

• Instrument Everything

Hard Problems

• Complex Analysis

• At Lowest Granularity

• Data Beats Algorithm

What is Hadoop?

• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)

• Core Hadoop has two main components • Hadoop Distributed File System (HDFS): self-healing high-bandwidth

clustered storage

• MapReduce: fault-tolerant distributed processing

• Key business values • Flexible -> Store any data, run any analysis (Mine First, Govern Later)

• Affordable -> Cost per TB at a fraction of traditional options

• Broadly adopted -> A large and active ecosystem

• Proven at scale -> Several petabyte deployments in production today

• Open Source -> No Lock-In, low cost, large developer community.

Cloudera’s Data Operating System (CDH)

• Open Source – 100% Apache licensed

• Simplified – Component versions & dependencies managed for you

• Reliable – Predictable release schedules, Patched with fixes to improve stability

• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32 or 64bit, etc.

• Integrated – All components & functions interoperate through standard API’s

• Supported – Founders, committers, contributors across all projects

Hue Hue SDK

Oozie Oozie

HBase Avro, Flume, Sqoop

Zookeeper

Avro, Hive

Pig. Hive

Benefit #1: Agility

Schema-on-Read (Hadoop):

Schema-on-Write (RDBMS):

• Schema must be created before data is loaded

• Explicit load operation has to take place which transforms data to database internal structure

• New columns must be added explicitly before data for such columns can be loaded into the database

• Read is Fast

• Standards/Governance

• Data is simply copied to the file store, no special transformation is needed

• A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns

• New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them

• Load is Fast

• Evolving Schemas/Agility

Benefits

Benefit #2: Data Consolidation

A single data system to enable processing across the universe of data types.

Complex Data

Documents Web feeds System logs Online forums

Structured Data (“relational”)

CRM Financials Logistics Data Marts

SharePoint Sensor data EMB archives Images/Video

Inventory Sales records HR records Web Profiles

Benefit #3: Any Programing Language (Not Only SQL)

1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop).

2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility.

3. Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes.

4. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads.

5. Hive: A SQL interpreter out of Facebook, also includes a meta-store mapping files to their schemas and associated SerDe.

6. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above.

Benefit #4: Balancing Return on Investment (or Byte!)

Low ROB

• Return on Byte = value to be extracted from that byte divided by the cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.

High ROB

Use The Right Tool For The Right Job

Relational Databases:

Hadoop:

Use when:

• Structured or Not (Agility)

• Scalable Storage/Compute

• Complex Data Processing

Use when:

• Interactive OLAP Analytics (<1sec)

• Multistep ACID Transactions

• SQL Compliance

Where Does Hadoop Fit in the Enterprise Data Stack?

Logs Files Web Data

Enterprise Data

Warehouse

Web Application

Enterprise Reporting

BI, Analytics

Analysts Business Users

Data Scientists

Relational Databases

Low-Latency Serving Systems

Cloudera

Mgmt Apps

System Administrators

Data Architects

Apache Hive Features

• A subset of SQL covering the most common statements

• JDBC/ODBC support

• Agile data types: Array, Map, Struct, and JSON objects

• Pluggable SerDe system to work on unstructured files directly.

• User Defined Functions and Aggregates

• Regular Expression support

• MapReduce support

• Partitions and Buckets (for performance optimization)

• In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect

• More details: http://wiki.apache.org/hadoop/Hive

Broad Adoption in Key Verticals

Stakeholders

Risk Analysts Intelligence

Risk management:

“Examine purchase behavior across debit and credit properties to better identify high-risk customers.”

Example Applications

Financial Services Telecom Retail Government

Research Insight Team

IT: Operations

IT: Data Engineering

“Analyze calling patterns among users and current capacity to forecast traffic growth and locate new towers.”

Brand Equity:

“Monitor customer and product data recorded across internal & external sources to trend brand valuation.”

Traffic Analysis:

“Use multimedia data from various sources to build an actionable graph of relationships among targets.”

Customers

How are Customers Using Cloudera?

Analyze search terms and subsequent user purchase decisions to tune search results, increase conversion rates

Digest long-term historical trade data to identify fraudulent activity and build real-time fraud prevention

Model site visitor behavior with analytics that deliver better recommendations for new purchases

Continually refine predictive models for advertising response rates to deliver more precisely targeted advertisements

Replace expensive legacy ETL system with more flexible, cheaper infrastructure that is 20 times faster

Correlate educational outcomes with programs and student histories to improve results

Examine customer behavior to improve loan risk scoring

Answering Questions that Were Impossible to Ask Before

Big Bank

More: http://www.cloudera.com/company/press-center/hadoop-world-nyc/

Cloudera Offerings

Software Services Training

Facilitating enterprise adoption of Hadoop

• Improves conformance to important IT SLAs, policies and procedures

• Lowers the cost of management and administration

• Increases reliability and consistency of the platform

• Certified integration with RDBMS, ETL, BI, Server, and Cloud Systems

Cloudera Enterprise Enterprise Support and Management Applications

Integrating with Existing IT Infrastructure

RDBMS Cloud/OS Hardware BI/Analytics

MicroStrategy (for interactive Dashboards)

Informatica (for Extract-Transform-Load, aka ETL)

Summary

• Cloudera’s Data OS (CDH) enables:

• Data Agility (Evolving Schemas)

• Consolidation (Structured or Not)

• Complex Data Processing (Any Language)

• Economical Storage (Enable Return-on-Byte > 1)

• Cloudera Enterprise enables:

• Conformance to important IT SLAs, policies and procedures

• Lower cost of management and administration

• Increased reliability and consistency

• Certified integration with existing IT infrastructure

Contact Information and Free Hadoop Book

Amr Awadallah

CTO, Cloudera, Inc.

aaa@cloudera.com

650-644-3921

twitter.com/awadallah

twitter.com/cloudera

Appendix

Cloudera Overview

Hadoop…

Jeff Hammerbacher, Chief Scientist

Amr Awadallah, CTO, VP Engineering

Doug Cutting, Chief Architect

… meets enterprise

Mike Olson - CEO

Omer Trajman – VP, Customer Solutions

John Kreisa –VP, Marketing

Charles Zedlewski – VP, Product Management

Ed Albanese – Head of Business Development

Investors Accel Partners, Greylock Partners, Meritech Capital Partners

Product category Data Management

Business model Cloudera offers Software, Support, Training, and Professional Services

Employees 70+

Customers 75+

Headquarters Palo Alto, California

Elevator pitch The leading provider of Apache Hadoop-based software and services for the enterprise

Vision We enable organizations to profit from all of their data

Why CDH (Cloudera Distribution for Hadoop)?

Features Benefits

It’s packaged Much easier for users to install CDH than any other form of Hadoop.

It’s patched This makes CDH more stable and secure than just downloading an Apache branch

It’s proven Thousands of organizations already use CDH today so risk is lower

It’s highly functional CDH will cover more use cases and users will be more productive than if they were just using core Hadoop.

It’s integrated Save time (of piecing a system together yourself) and lower risk (of choosing the wrong combination of versions or patches)

It’s the accepted standard More of your preexisting investments in RDBMS, ETL and BI work best with CDH

It’s supported CDH is one of only two distributions that has a commercial entity standing behind it

It’s 100% Apache licensed Investment in this technology is insured.

Hadoop Timeline

2002 2003 2004 2005 2006 2007 2008 2009

Doug Cutting & Mike Cafarella started working on Nutch

Google publishes GFS & MapReduce papers

Cutting adds DFS & MapReduce support to Nutch

Yahoo! hires Cutting, Hadoop spins out of Nutch

Web-scale deployments at Y!, Facebook, Last.fm

Fastest sort of a TB, 3.5mins over 910 nodes

NY Times converts 4TB of image archives over 100 EC2s

• Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes

Hadoop Summit 2009, 750 attendees

Cloudera Founded

Cloudera hires Cutting

10 Common Hadoop-able Problems

1. Modeling true risk

2. Customer churn analysis

3. Recommendation engine

4. Ad targeting

5. PoS transaction analysis

6. Analyzing network data to predict failure

7. Threat analysis

8. Trade surveillance

9. Search quality

10. Data “sandbox”

Case Studies: Hadoop World 2009

•VISA: Large Scale Transaction Analysis •JP Morgan Chase: Data Processing for Financial Services •China Mobile: Data Mining Platform for Telecom Industry •Rackspace: Cross Data Center Log Processing •Booz Allen Hamilton: Protein Alignment using Hadoop •eHarmony: Matchmaking in the Hadoop Cloud •General Sentiment: Understanding Natural Language •Yahoo!: Social Graph Analysis •Visible Technologies: Real-Time Business Intelligence

Slides and Videos: http://www.cloudera.com/hadoop-world-nyc

Case Studies: Hadoop World 2010

•eBay: Hadoop at eBay •Twitter: The Hadoop Ecosystem at Twitter •Yale University: MapReduce and Parallel Database Systems •General Electric: Sentiment Analysis powered by Hadoop •Facebook: HBase in Production •AOL: AOL’s Data Layer •Raytheon: SHARD: Storing and Querying Large-Scale Data •StumbleUpon: Mixing Real-Time and Batch Processing

More Info: http://www.cloudera.com/company/press-center/hadoop-world-nyc/

Hadoop Design Axioms

1. System Shall Manage and Heal Itself

2. Performance Shall Scale Linearly

3. Compute Should Move to Data

4. Simple Core, Modular and Extensible

Block Size = 64MB

Replication Factor = 3

HDFS: Hadoop Distributed File System

Cost/GB is a few

¢/month vs $/month

MapReduce: Distributed Processing

MapReduce Example for Word Count

Split 1

Split i

Split N

Map 1 (docid, text)

(docid, text) Map i

(docid, text) Map M

Reduce 1

Output

File 1 (sorted words,

sum of counts)

Reduce i

Output

File i (sorted words,

sum of counts)

Reduce R

Output

File R (sorted words,

sum of counts)

(words, counts) (sorted words, counts)

Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)

Shuffle

(words, counts) (sorted words, counts)

“To Be

Or Not

To Be?”

Be, 12

Be, 30

cat *.txt | mapper.pl | sort | reducer.pl > out.txt

Hadoop High-Level Architecture

Name Node Maintains mapping of file blocks

to data node slaves

Job Tracker Schedules jobs across

task tracker slaves

Data Node Stores and serves

blocks of data

Hadoop Client Contacts Name Node for data

or Job Tracker to submit jobs

Task Tracker Runs tasks (work units)

within a job

Share Physical Node

Hive vs Pig Example (count distinct values > 0)

• Hive syntax:

SELECT COUNT(DISTINCT col1)

FROM mytable

WHERE col1 > 0;

• Pig syntax:

mytable = LOAD ‘myfile’ AS (col1, col2, col3);

mytable = FOREACH mytable GENERATE col1;

mytable = FILTER mytable BY col1 > 0;

mytable = DISTINCT col1;

mytable = GROUP mytable BY col1;

mytable = FOREACH mytable GENERATE COUNT(mytable);

DUMP mytable;

Hive Agile Data Types

• STRUCTS: • SELECT mytable.mycolumn.myfield FROM …

• MAPS (Hashes): • SELECT mytable.mycolumn[mykey+ FROM …

• ARRAYS: • SELECT mytable.mycolumn*5+ FROM …

• JSON: • SELECT get_json_object(mycolumn, objpath

hadoop in the enterprise - dr. amr awadallah @ microstrategy world 2011

Technology

microstrategy enterprise cloud: security framework › iec...

dr. amr awadallah, cto/founder @awadallah ... - big data...

microstrategy for reconciliation...use microstrategy report...

dr. amr awadallah, cto/founder @awadallah, aaa@cloudera...

microstrategy awards

microstrategy training | microstrategy online training |...

1 tcp-bfa: buffer fill avoidance {aaa,crai}@stanford.edu...

microstrategy 9.2

microstrategy mobile for iphone and ipad -...

microstrategy basic reporting

microstrategy web:

microstrategy & google

microstrategy 9 brochure

what is the microstrategy tutorial? -...

hadoop: an industry perspective amr awadallah founder/cto,...

microstrategy 931

human abilities presented by mahmoud awadallah 1

xbr fundamental training guide - oracle€¦ ·...

in re microstrategy

technologyforecast - pwc · 34 hadoop’s foray into the...