apache hadoop today & tomorrow - snia€¦ · apache hadoop projects . programming languages ....

of 33/33
© Hortonworks, Inc. All Rights Reserved. Apache Hadoop Today & Tomorrow Eric Baldeschwieler, CEO Hortonworks, Inc. twitter: @jeric14 (@hortonworks) www.hortonworks.com

Post on 20-May-2020

3 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • © Hortonworks, Inc. All Rights Reserved.

    Apache Hadoop Today & Tomorrow

    Eric Baldeschwieler, CEO Hortonworks, Inc.

    twitter: @jeric14 (@hortonworks) www.hortonworks.com

    http://www.hortonworks.com�

  • © Hortonworks, Inc. All Rights Reserved.

    Agenda

    Brief Overview of Apache Hadoop

    Where Apache Hadoop is Used

    Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce

    Where Apache Hadoop Is Going

    Q&A

    2

  • © Hortonworks, Inc. All Rights Reserved.

    3

    A set of open source projects owned by the Apache Foundation that transforms commodity computers and network into a distributed service •HDFS – Stores petabytes of data reliably •Map-Reduce – Allows huge distributed computations

    Key Attributes •Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails •Simple and Flexible APIs – Our rocket scientists use it directly! •Very powerful – Harnesses huge clusters, supports best of breed analytics •Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases

    What is Apache Hadoop?

  • © Hortonworks, Inc. All Rights Reserved.

    What is it used for?

    Internet scale data Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook

    Cutting edge analytics Machine learning, data mining…

    Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining

    And lots more!

    4

  • © Hortonworks, Inc. All Rights Reserved.

    5

    Apache Hadoop Projects

    Programming Languages

    Computation

    Object Storage

    Zoo

    keep

    er

    (Coo

    rdin

    atio

    n)

    Core Apache Hadoop Related Apache Projects

    HDFS (Hadoop Distributed File System)

    MapReduce (Distributed Programing Framework)

    Hive (SQL)

    Pig (Data Flow)

    HBase (Columnar Storage)

    HCatalog (Meta Data)

    HM

    S (M

    anag

    emen

    t)

    Table Storage

  • © Hortonworks, Inc. All Rights Reserved.

    6

    Where Hadoop is Used

  • © Hortonworks, Inc. All Rights Reserved.

    2006

    2008

    2009

    2010

    The Datagraph Blog

    Everywhere!

    2007

    7

    http://aws.amazon.com/�http://www.rapleaf.com/�http://visibletechnologies.com/�http://www.contextweb.com/�http://www.rackspace.com/index.php�http://www.apollogrp.edu/Default.aspx�http://multivu.prnewswire.com/mnr/therubiconproject/32383/images/32383-hi-rubicon_project_logo.jpg�http://www.moldex3d.com/jla/en/images/stories/Success/Samsung/samsung-logo.jpg�http://www.accelacommunications.com/�http://www.forward3d.co.uk/�http://infochimps.org/�http://www.linkedin.com/home?trk=hb_logo�http://www.pronux.ch/�http://twitter.com/�http://en.wikipedia.org/wiki/File:Microsoft_wordmark.svg�http://www.netflix.com/�http://www.google.com/imgres?imgurl=http://www.crunchbase.com/assets/images/resized/0002/2996/22996v1-max-250x250.png&imgrefurl=http://www.crunchbase.com/company/admeld&usg=__bWbC4xpT5XlnSLtmgJ9OIfkTVmE=&h=56&w=250&sz=6&hl=en&start=6&um=1&itbs=1&tbnid=ZDshM3S4I1oT6M:&tbnh=25&tbnw=111&prev=/images?q=admeld&um=1&hl=en&sa=N&tbs=isch:1�

  • © Hortonworks, Inc. All Rights Reserved.

    8

    HADOOP @ YAHOO!

    40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users

  • © Hortonworks, Inc. All Rights Reserved.

    twice the engagement

    CASE STUDY YAHOO! HOMEPAGE

    9

    Personalized for each visitor Result: twice the engagement

    +160% c licks vs . one s ize fits a ll

    +79% c licks vs . randomly s e lec ted

    +43% c licks vs . ed itor s e lec ted

    Recommended links News Interests Top Searches

  • © Hortonworks, Inc. All Rights Reserved.

    CASE STUDY YAHOO! HOMEPAGE

    10

    • Serving Maps • Users - Interests

    • Five Minute

    Production • Weekly

    Categorization models

    SCIENCE HADOOP CLUSTER

    SERVING SYSTEMS

    PRODUCTION HADOOP CLUSTER

    USER BEHAVIOR

    ENGAGED USERS

    CATEGORIZATION MODELS (weekly)

    SERVING MAPS

    (every 5 minutes) USER

    BEHAVIOR

    » Identify user interests using Categorization models

    » Machine learning to build ever better categorization models

    Build customized home pages with latest data (thousands / second)

  • © Hortonworks, Inc. All Rights Reserved.

    CASE STUDY YAHOO! MAIL Enabling quick response in the spam arms race

    • 450M mail boxes • 5B+ deliveries/day • Antispam models retrained every few hours on Hadoop

    40% less spam than Hotmail and 55% less spam than Gmail

    “ “

    SCIENCE

    PRODUCTION

    11

  • © Hortonworks, Inc. All Rights Reserved.

    12

    , early adopters Scale and productize Hadoop

    Apache Hadoop

    A Brief History

    2006 – present

    Wide Enterprise Adoption Funds further development, enhancements

    Nascent / 2011

    Other Internet Companies Add tools / frameworks, enhance Hadoop

    2008 – present

    Service Providers Provide training, support, hosting

    2010 – present

    http://www.linkedin.com/home?trk=hb_logo�http://aws.amazon.com/�

  • © Hortonworks, Inc. All Rights Reserved.

    13

    Traditional Enterprise Architecture Data Silos + ETL

    EDW Data Marts

    BI / Analytics

    Traditional Data Warehouses, BI & Analytics Serving Applications

    Web Serving

    NoSQLRDMS

    Unstructured Systems

    Serving Logs

    Social Media

    Sensor Data

    Text Systems

    Traditional ETL & Message buses

  • © Hortonworks, Inc. All Rights Reserved.

    14

    EDW Data Marts

    BI / Analytics

    Traditional Data Warehouses, BI & Analytics Serving Applications

    Web Serving

    NoSQLRDMS

    Unstructured Systems

    Apache Hadoop EsTsL (s = Store) Custom Analytics

    Serving Logs

    Social Media

    Sensor Data

    Text Systems

    Traditional ETL & Message buses

    Hadoop Enterprise Architecture Connecting All of Your Big Data

  • © Hortonworks, Inc. All Rights Reserved.

    15

    EDW Data Marts

    BI / Analytics

    Traditional Data Warehouses, BI & Analytics Serving Applications

    Web Serving

    NoSQLRDMS

    Unstructured Systems

    Serving Logs

    Social Media

    Sensor Data

    Text Systems

    80-90% of data produced today is unstructured

    Gartner predicts 800% data growth over next 5 years

    Traditional ETL & Message buses

    Apache Hadoop EsTsL (s = Store) Custom Analytics

    Hadoop Enterprise Architecture Connecting All of Your Big Data

  • © Hortonworks, Inc. All Rights Reserved.

    What is Driving Adoption?

    Business drivers Identified high value projects that require use of more data Belief that there is great ROI in mastering big data

    Financial drivers Growing cost of data systems as proportion of IT spend Cost advantage of commodity hardware + open source

    Enables departmental-level big data strategies

    Technical drivers Existing solutions failing under growing requirements

    3Vs - Volume, velocity, variety

    Proliferation of unstructured data

    16

  • © Hortonworks, Inc. All Rights Reserved.

    17

    Big Data Platforms Cost per TB, Adoption

    Source:

    Size of bubble = cost effectiveness of solution

  • © Hortonworks, Inc. All Rights Reserved.

    18

    Apache Hadoop Core

  • © Hortonworks, Inc. All Rights Reserved.

    19

    Overview

    Frameworks share commodity hardware Storage - HDFS Processing - MapReduce

    2 * 10GigE 2 * 10GigE 2 * 10GigE

    2 * 10GigE

    • 20-40 nodes / rack • 16 Cores • 48G RAM • 6-12 * 2TB disk • 1-2 GigE to node

    Network Core

    Rack Switch

    1-2U server

    Rack Switch

    1-2U server

    Rack Switch

    1-2U server

    Rack Switch

    1-2U server

  • © Hortonworks, Inc. All Rights Reserved.

    Map/Reduce

    Map/Reduce is a distributed computing programming model It works like a Unix pipeline:

    cat input | grep | sort | uniq -c > output Input | Map | Shuffle & Sort | Reduce | Output

    Strengths: Easy to use! Developer just writes a couple of functions Moves compute to data

    Schedules work on HDFS node with data if possible Scans through data, reducing seeks Automatic reliability and re-execution on failure

    20

  • © Hortonworks, Inc. All Rights Reserved.

    HDFS: Scalable, Reliable, Managable

    21

    Scale IO, Storage, CPU • Add commodity servers & JBODs • 4K nodes in cluster, 80

    Fault Tolerant & Easy management Built in redundancy Tolerate disk and node failures Automatically manage

    addition/removal of nodes One operator per 8K nodes!!

    Storage server used for computation

    Move computation to data

    Not a SAN But high-bandwidth network

    access to data via Ethernet

    Immutable file system Read, Write, sync/flush

    No random writes

    `

    Switch

    Switch

    Switch

    Core Switch

    Core Switch

  • © Hortonworks, Inc. All Rights Reserved.

    HDFS Use Cases

    Petabytes of unstructured data for parallel, distributed analytics processing using commodity hardware

    Solve problems that cannot be solved using traditional systems at a cheaper price Large storage capacity ( >100PB raw) Large IO/Computation bandwidth (>4K servers)

    > 4 Terabit bandwidth to disk! (conservatively) Scale by adding commodity hardware Cost per GB ~= $1.5, includes MapReduce cluster

    22

  • © Hortonworks, Inc. All Rights Reserved.

    HDFS Architecture

    Namenode

    Persistent Namespace Metadata & Journal

    Namespace State

    Block Map

    Heartbeats & Block Reports

    Block ID Block Locations

    Datanodes

    Block ID Data

    NFS

    Hierarchal Namespace File Name BlockIDs

    Horizontally Scale IO and Storage

    b1

    b5

    b3

    JBOD Blo

    ck S

    tora

    ge

    Nam

    espa

    ce

    b2

    b3

    b1

    JBOD

    b3

    b5

    b2

    JBOD

    b1

    b5

    b2

    JBOD

  • © Hortonworks, Inc. All Rights Reserved.

    Client Read & Write Directly from Closest Server

    Namenode

    Namespace State

    Block Map

    Horizontally Scale IO and Storage

    24

    b1

    b5

    b3

    JBOD

    b2

    b3

    b1

    JBOD

    b3

    b5

    b2

    JBOD

    b1 b5

    b2

    JBOD

    Client

    1 open

    2 read

    Client

    2 write

    1 create

    End-to-end checksum

  • © Hortonworks, Inc. All Rights Reserved.

    Actively maintain data reliability

    Namenode

    Namespace State

    Block Map

    25

    b1

    b5

    b3

    JBOD

    b2

    b3

    b4

    JBOD

    b3

    b5

    b2

    JBOD

    b1 b5

    b2

    JBOD

    2. copy

    3. blockReceived

    1. replicate

    Bad/lost block replica

    Periodically check block checksums

  • © Hortonworks, Inc. All Rights Reserved.

    HBase

    Hadoop ecosystem Database, based on Google BigTable

    Goal: Hosting of very large tables (billions of rows X millions of columns) on commodity hardware. Multidimensional sorted Map

    Table => Row => Column => Version => Value

    Distributed column-oriented store Scale – Sharding etc. done automatically

    No SQL, CRUD etc.

    26

  • © Hortonworks, Inc. All Rights Reserved.

    27

    What’s Next

  • © Hortonworks, Inc. All Rights Reserved.

    About Hortonworks – Basics

    Founded – July 1st, 2011 22 Architects & committers from Yahoo!

    Mission – Architect the future of Big Data Revolutionize and commoditize the storage and

    processing of Big Data via open source

    Vision – Half of the worlds data will be stored in Hadoop within five years

  • © Hortonworks, Inc. All Rights Reserved.

    Game Plan

    Support the growth of a huge Apache Hadoop ecosystem Invest in ease of use, management, and other

    enterprise features Define APIs for ISVs, OEMs and others to integrate

    with Apache Hadoop Continue to invest in advancing the Hadoop core,

    remain the experts Contribute all of our work to Apache

    Profit by providing training & support to the Hadoop

    community

  • © Hortonworks, Inc. All Rights Reserved.

    30

    Lines of Code Contributed to Apache Hadoop

  • © Hortonworks, Inc. All Rights Reserved.

    Apache Hadoop Roadmap

    Phase 1 – Making Apache Hadoop Accessible • Release the most stable version of Hadoop ever (Hadoop 0.20.205)

    • Frequent sustaining releases • Release directly usable code via Apache (RPMs, .debs…) • Improve project integration (HBase support)

    2011

    Phase 2 – Next-Generation Apache Hadoop • Address key product gaps (HA, Management…) • Enable partner innovation via open APIs • Enable community innovation via modular architecture

    2012 (Alphas in Q4 2011)

    31

  • © Hortonworks, Inc. All Rights Reserved.

    Next-Generation Hadoop

    Core HDFS Federation – Scale out and innovation via new APIs

    Will run on 6000 node clusters with 24TB disk / node = 144PB in next release

    Next Gen MapReduce – Support for MPI and many other programing models HA (no SPOF) and Wire compatibility

    Data - HCatalog 0.3 Pig, Hive, MapReduce and Streaming as clients HDFS and HBase as storage systems Performance and storage improvements

    Management & Ease of use Ambari – A Apache Hadoop Management & Monitoring System Stack installation and centralized config management REST and GUI for user & administrator tasks

    32

  • © Hortonworks, Inc. All Rights Reserved.

    Thank You!

    33

    Questions

    Twitter: @jeric14 (@hortonworks) www.hortonworks.com

    http://www.hortonworks.com�

    Apache Hadoop Today & TomorrowAgendaWhat is Apache Hadoop?What is it used for?Apache Hadoop ProjectsSlide Number 6Everywhere!Slide Number 8Slide Number 9Slide Number 10Slide Number 11A Brief HistoryTraditional Enterprise Architecture�Data Silos + ETLSlide Number 14Slide Number 15What is Driving Adoption?Big Data Platforms�Cost per TB, AdoptionSlide Number 18Slide Number 19Map/ReduceHDFS: Scalable, Reliable, ManagableHDFS Use CasesHDFS ArchitectureClient Read & Write Directly �from Closest ServerActively maintain data reliabilityHBaseSlide Number 27About Hortonworks – BasicsGame PlanLines of Code Contributed to �Apache Hadoop�Apache Hadoop RoadmapNext-Generation HadoopThank You!