hadoopintro

Upload: dhanraj-kannan

Post on 03-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 hadoopintro

    1/31

    An introduction to

  • 8/12/2019 hadoopintro

    2/31

    Hello

    Processing against a 156 nodecluster

    Certified Hadoop Developer

    Certified Hadoop SystemAdministrator

  • 8/12/2019 hadoopintro

    3/31

    Goals

    Why should you care?

    What is it? How does it work?

  • 8/12/2019 hadoopintro

    4/31

    Data Everywhere

    Every two days now we create as

    much information as we did from the

    dawn of civilization up until 2003

    -Eric Schmidt

    then CEO of GoogleAug 4, 2010

  • 8/12/2019 hadoopintro

    5/31

    Data Everywhere

  • 8/12/2019 hadoopintro

    6/31

    Data Everywhere

  • 8/12/2019 hadoopintro

    7/31

    Data Everywhere

  • 8/12/2019 hadoopintro

    8/31

    The Hadoop Project

    Originally based on papers published byGoogle in 2003 and 2004

    Hadoop started in 2006 at Yahoo! Top level Apache Foundation project

    Large, active user base, user groups Very active development, strongdevelopment team

  • 8/12/2019 hadoopintro

    9/31

    Who Uses Hadoop?

  • 8/12/2019 hadoopintro

    10/31

    Hadoop Components

    HDFS

    Storage

    Self-healing

    high-bandwidthclustered storage

    MapReduce

    Processing

    Fault-tolerant

    distributedprocessing

  • 8/12/2019 hadoopintro

    11/31

    Typical Cluster

    3-4000 commodity servers Each server 2x quad-core 16-24 GB ram

    4-12 TB disk space 20-30 servers per rack

  • 8/12/2019 hadoopintro

    12/31

    2 Kinds of Nodes

    Master Nodes Slave Nodes

  • 8/12/2019 hadoopintro

    13/31

    Master Nodes NameNode only 1 per cluster

    metadata server and database

    SecondaryNameNode helps withsome housekeeping

    JobTracker only 1 per cluster

    job scheduler

  • 8/12/2019 hadoopintro

    14/31

    Slave Nodes DataNodes

    1-4000 per cluster

    block data storage

    TaskTrackers 1-4000 per cluster

    task execution

  • 8/12/2019 hadoopintro

    15/31

    HDFS Basics

    HDFS is a filesystem written in Java Sits on top of a native filesystem

    Provides redundant storage for massiveamounts of data

    Use cheap(ish), unreliable computers

  • 8/12/2019 hadoopintro

    16/31

    HDFS Data Data is split into blocks and stored on

    multiple nodes in the cluster

    Each block is usually 64 MB or 128 MB(conf)

    Each block is replicated multiple times(conf)

    Replicas stored on different data nodes Large files, 100 MB+

  • 8/12/2019 hadoopintro

    17/31

    NameNode

    A single NameNode stores all metadata Filenames, locations on DataNodes of

    each block, owner, group, etc.

    All information maintained in RAM forfast lookup

    Filesystem metadata size is limited tothe amount of available RAM on the

    NameNode

  • 8/12/2019 hadoopintro

    18/31

    SecondaryNameNode

    The Secondary NameNode is not afailover NameNode

    Does memory-intensive administrativefunctions for the NameNode

    Should run on a separate machine

  • 8/12/2019 hadoopintro

    19/31

    Data Node

    DataNodes store file contents

    Stored as opaque blocks on theunderlying filesystem Different blocks of the same file will be

    stored on different DataNodes Same block is stored on three (or more)

    DataNodes for redundancy

  • 8/12/2019 hadoopintro

    20/31

    Self-healing DataNodes send heartbeats to the

    NameNode

    After a period without any heartbeats,a DataNode is assumed to be lost

    NameNode determines which blockswere on the lost node

    NameNode finds other DataNodeswith copies of these blocks These DataNodes are instructed to

    copy the blocks to other nodes

    Replication is actively maintained

  • 8/12/2019 hadoopintro

    21/31

    HDFS Data Storage NameNode holds

    file metadata

    DataNodes hold theactual data Block size is 64

    MB, 128 MB, etc Each block

    replicated three

    times

    NameNodefoo.txt: blk_1, blk_2, blk_3

    bar.txt: blk_4, blk_5

    DataNodesblk_1 blk_2

    blk_3 blk_5

    blk_1 blk_3

    blk_4

    blk_1 blk_4

    blk_5

    blk_2 blk_4

    blk_2 blk_3

    blk_5

  • 8/12/2019 hadoopintro

    22/31

    What is MapReduce?

    MapReduce is a method for distributinga task across multiple nodes

    Automatic parallelization anddistribution

    Each node processes data stored onthat node (processing goes to the data)

  • 8/12/2019 hadoopintro

    23/31

    Features of MapReduce

    Fault-tolerance

    Status and monitoring tools A clean abstraction for programmers

  • 8/12/2019 hadoopintro

    24/31

    JobTracker MapReduce jobs are controlled by a

    software daemon known as the

    JobTracker

    The JobTracker resides on a masternode

    Assigns Map and Reduce tasks to other nodes on the cluster

    These nodes each run a software daemon known as theTaskTracker

    The TaskTracker is responsible for actually instantiating the Map orReduce task, and reporting progress back to the JobTracker

  • 8/12/2019 hadoopintro

    25/31

    Two Parts

    Developer specifies two functions: map() reduce()

    The framework does the rest

  • 8/12/2019 hadoopintro

    26/31

    map()

    The Mapper reads data in the form ofkey/value pairs

    It outputs zero or more key/value pairs

    map(key_in, value_in) ->

    (key_out, value_out)

  • 8/12/2019 hadoopintro

    27/31

    reduce()

    After the Map phase all the intermediatevalues for a given intermediate key are

    combined together into a list

    This list is given to one or moreReducers

    The Reducer outputs zero or more finalkey/value pairs These are written to HDFS

  • 8/12/2019 hadoopintro

    28/31

    map() Word Countmap(String input_key, String input_value)

    foreach word w in input_value

    emit(w, 1)

    (1234, to be or not to be)

    (5678, to see or not to see)

    (to,1),(be,1),(or,1),(not,1),(to,1),(be,1), (to,1),(see,1),

    (or,1),(not,1),(to,1),(see,1)

  • 8/12/2019 hadoopintro

    29/31

    reduce() Word Countreduce(String output_key, List middle_vals)

    set count = 0

    foreach v in intermediate_vals:

    count += vemit(output_key, count)

    (to, [1,1,1,1])

    (be,[1,1])

    (or,[1,1])

    (not,[1,1])

    (see,[1,1])

    (to, 4)

    (be,2)

    (or,2)

    (not,2)

    (see,2)

  • 8/12/2019 hadoopintro

    30/31

    Resources

    http://hadoop.apache.org/

    http://developer.yahoo.com/hadoop/

    http://www.cloudera.com/resources/?media=Video

  • 8/12/2019 hadoopintro

    31/31

    Questions?