apache hadoop today & tomorrow - snia€¦ · apache hadoop projects . programming languages ....

© Hortonworks, Inc. All Rights Reserved.

Apache Hadoop Today & Tomorrow

Eric Baldeschwieler, CEO Hortonworks, Inc.

twitter: @jeric14 (@hortonworks) www.hortonworks.com

http://www.hortonworks.com


Agenda

Brief Overview of Apache Hadoop

Where Apache Hadoop is Used

Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce

Where Apache Hadoop Is Going

Q&A

2


3

A set of open source projects owned by the Apache Foundation that transforms commodity computers and network into a distributed service •HDFS – Stores petabytes of data reliably •Map-Reduce – Allows huge distributed computations

Key Attributes •Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails •Simple and Flexible APIs – Our rocket scientists use it directly! •Very powerful – Harnesses huge clusters, supports best of breed analytics •Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases

What is Apache Hadoop?


What is it used for?

Internet scale data Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook

Cutting edge analytics Machine learning, data mining…

Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining

And lots more!

4


5

Apache Hadoop Projects

Programming Languages

Computation

Object Storage

Zoo

keep

er

(Coo

rdin

atio

n)

Core Apache Hadoop Related Apache Projects

HDFS (Hadoop Distributed File System)

MapReduce (Distributed Programing Framework)

Hive (SQL)

Pig (Data Flow)

HBase (Columnar Storage)

HCatalog (Meta Data)

HM

S (M

anag

emen

t)

Table Storage


6

Where Hadoop is Used


2006

2008

2009

2010

The Datagraph Blog

Everywhere!

2007

7

http://aws.amazon.com/

http://www.rapleaf.com/

http://visibletechnologies.com/

http://www.contextweb.com/

http://www.rackspace.com/index.php

http://www.apollogrp.edu/Default.aspx

http://multivu.prnewswire.com/mnr/therubiconproject/32383/images/32383-hi-rubicon_project_logo.jpg

http://www.moldex3d.com/jla/en/images/stories/Success/Samsung/samsung-logo.jpg

http://www.accelacommunications.com/

http://www.forward3d.co.uk/

http://infochimps.org/

http://www.linkedin.com/home?trk=hb_logo

http://www.pronux.ch/

http://twitter.com/

http://en.wikipedia.org/wiki/File:Microsoft_wordmark.svg

http://www.netflix.com/

http://www.google.com/imgres?imgurl=http://www.crunchbase.com/assets/images/resized/0002/2996/22996v1-max-250x250.png&imgrefurl=http://www.crunchbase.com/company/admeld&usg=__bWbC4xpT5XlnSLtmgJ9OIfkTVmE=&h=56&w=250&sz=6&hl=en&start=6&um=1&itbs=1&tbnid=ZDshM3S4I1oT6M:&tbnh=25&tbnw=111&prev=/images?q=admeld&um=1&hl=en&sa=N&tbs=isch:1


8

HADOOP @ YAHOO!

40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users


twice the engagement

CASE STUDY YAHOO! HOMEPAGE

9

Personalized for each visitor Result: twice the engagement

+160% c licks vs . one s ize fits a ll

+79% c licks vs . randomly s e lec ted

+43% c licks vs . ed itor s e lec ted

Recommended links News Interests Top Searches


CASE STUDY YAHOO! HOMEPAGE

10

• Serving Maps • Users - Interests

• Five Minute

Production • Weekly

Categorization models

SCIENCE HADOOP CLUSTER

SERVING SYSTEMS

PRODUCTION HADOOP CLUSTER

USER BEHAVIOR

ENGAGED USERS

CATEGORIZATION MODELS (weekly)

SERVING MAPS

(every 5 minutes) USER

BEHAVIOR

» Identify user interests using Categorization models

» Machine learning to build ever better categorization models

Build customized home pages with latest data (thousands / second)


CASE STUDY YAHOO! MAIL Enabling quick response in the spam arms race

• 450M mail boxes • 5B+ deliveries/day • Antispam models retrained every few hours on Hadoop

40% less spam than Hotmail and 55% less spam than Gmail

“ “

SCIENCE

PRODUCTION

11


12

, early adopters Scale and productize Hadoop

Apache Hadoop

A Brief History

2006 – present

Wide Enterprise Adoption Funds further development, enhancements

Nascent / 2011

Other Internet Companies Add tools / frameworks, enhance Hadoop

2008 – present

…

Service Providers Provide training, support, hosting

2010 – present

…

http://www.linkedin.com/home?trk=hb_logo

http://aws.amazon.com/


13

Traditional Enterprise Architecture Data Silos + ETL

EDW Data Marts

BI / Analytics

Traditional Data Warehouses, BI & Analytics Serving Applications

Web Serving

NoSQLRDMS

…

Unstructured Systems

Serving Logs

Social Media

Sensor Data

Text Systems

…

Traditional ETL & Message buses


14

EDW Data Marts

BI / Analytics


Web Serving

NoSQLRDMS

…


Apache Hadoop EsTsL (s = Store) Custom Analytics

Serving Logs

Social Media

Sensor Data

Text Systems

…


Hadoop Enterprise Architecture Connecting All of Your Big Data


15

EDW Data Marts

BI / Analytics


Web Serving

NoSQLRDMS

…


Serving Logs

Social Media

Sensor Data

Text Systems

…

80-90% of data produced today is unstructured

Gartner predicts 800% data growth over next 5 years


Apache Hadoop EsTsL (s = Store) Custom Analytics

Hadoop Enterprise Architecture Connecting All of Your Big Data


What is Driving Adoption?

Business drivers Identified high value projects that require use of more data Belief that there is great ROI in mastering big data

Financial drivers Growing cost of data systems as proportion of IT spend Cost advantage of commodity hardware + open source

Enables departmental-level big data strategies

Technical drivers Existing solutions failing under growing requirements

3Vs - Volume, velocity, variety

Proliferation of unstructured data

16


17

Big Data Platforms Cost per TB, Adoption

Source:

Size of bubble = cost effectiveness of solution


18

Apache Hadoop Core


19

Overview

Frameworks share commodity hardware Storage - HDFS Processing - MapReduce

2 * 10GigE 2 * 10GigE 2 * 10GigE

2 * 10GigE

• 20-40 nodes / rack • 16 Cores • 48G RAM • 6-12 * 2TB disk • 1-2 GigE to node

Network Core

Rack Switch

1-2U server

…

Rack Switch

1-2U server

…

Rack Switch

1-2U server

…

Rack Switch

1-2U server

…

…


Map/Reduce

Map/Reduce is a distributed computing programming model It works like a Unix pipeline:

cat input | grep | sort | uniq -c > output Input | Map | Shuffle & Sort | Reduce | Output

Strengths: Easy to use! Developer just writes a couple of functions Moves compute to data

Schedules work on HDFS node with data if possible Scans through data, reducing seeks Automatic reliability and re-execution on failure

20


HDFS: Scalable, Reliable, Managable

21

Scale IO, Storage, CPU • Add commodity servers & JBODs • 4K nodes in cluster, 80

Fault Tolerant & Easy management Built in redundancy Tolerate disk and node failures Automatically manage

addition/removal of nodes One operator per 8K nodes!!

Storage server used for computation

Move computation to data

Not a SAN But high-bandwidth network

access to data via Ethernet

Immutable file system Read, Write, sync/flush

No random writes

`

Switch

…

Switch

…

Switch

…

Core Switch

Core Switch

…


HDFS Use Cases

Petabytes of unstructured data for parallel, distributed analytics processing using commodity hardware

Solve problems that cannot be solved using traditional systems at a cheaper price Large storage capacity ( >100PB raw) Large IO/Computation bandwidth (>4K servers)

> 4 Terabit bandwidth to disk! (conservatively) Scale by adding commodity hardware Cost per GB ~= $1.5, includes MapReduce cluster

22


HDFS Architecture

Namenode

Persistent Namespace Metadata & Journal

Namespace State

Block Map

Heartbeats & Block Reports

Block ID Block Locations

Datanodes

Block ID Data

NFS

Hierarchal Namespace File Name BlockIDs

Horizontally Scale IO and Storage

b1

b5

b3

JBOD Blo

ck S

tora

ge

Nam

espa

ce

b2

b3

b1

JBOD

b3

b5

b2

JBOD

b1

b5

b2

JBOD


Client Read & Write Directly from Closest Server

Namenode

Namespace State

Block Map

Horizontally Scale IO and Storage

24

b1

b5

b3

JBOD

b2

b3

b1

JBOD

b3

b5

b2

JBOD

b1 b5

b2

JBOD

Client

1 open

2 read

Client

2 write

1 create

End-to-end checksum


Actively maintain data reliability

Namenode

Namespace State

Block Map

25

b1

b5

b3

JBOD

b2

b3

b4

JBOD

b3

b5

b2

JBOD

b1 b5

b2

JBOD

2. copy

3. blockReceived

1. replicate

Bad/lost block replica

Periodically check block checksums


HBase

Hadoop ecosystem Database, based on Google BigTable

Goal: Hosting of very large tables (billions of rows X millions of columns) on commodity hardware. Multidimensional sorted Map

Table => Row => Column => Version => Value

Distributed column-oriented store Scale – Sharding etc. done automatically

No SQL, CRUD etc.

26


27

What’s Next


About Hortonworks – Basics

Founded – July 1st, 2011 22 Architects & committers from Yahoo!

Mission – Architect the future of Big Data Revolutionize and commoditize the storage and

processing of Big Data via open source

Vision – Half of the worlds data will be stored in Hadoop within five years


Game Plan

Support the growth of a huge Apache Hadoop ecosystem Invest in ease of use, management, and other

enterprise features Define APIs for ISVs, OEMs and others to integrate

with Apache Hadoop Continue to invest in advancing the Hadoop core,

remain the experts Contribute all of our work to Apache

Profit by providing training & support to the Hadoop

community


30

Lines of Code Contributed to Apache Hadoop


Apache Hadoop Roadmap

Phase 1 – Making Apache Hadoop Accessible • Release the most stable version of Hadoop ever (Hadoop 0.20.205)

• Frequent sustaining releases • Release directly usable code via Apache (RPMs, .debs…) • Improve project integration (HBase support)

2011

Phase 2 – Next-Generation Apache Hadoop • Address key product gaps (HA, Management…) • Enable partner innovation via open APIs • Enable community innovation via modular architecture

2012 (Alphas in Q4 2011)

31


Next-Generation Hadoop

Core HDFS Federation – Scale out and innovation via new APIs

Will run on 6000 node clusters with 24TB disk / node = 144PB in next release

Next Gen MapReduce – Support for MPI and many other programing models HA (no SPOF) and Wire compatibility

Data - HCatalog 0.3 Pig, Hive, MapReduce and Streaming as clients HDFS and HBase as storage systems Performance and storage improvements

Management & Ease of use Ambari – A Apache Hadoop Management & Monitoring System Stack installation and centralized config management REST and GUI for user & administrator tasks

32


Thank You!

33

Questions

Twitter: @jeric14 (@hortonworks) www.hortonworks.com

http://www.hortonworks.com

apache hadoop today & tomorrow - snia€¦ · apache hadoop projects . programming languages ....

Documents