forrester hadoop infrastructure architecture

Upload: mark-lobo

Post on 07-Aug-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    1/16

    Forrester Research, Inc., 60 Acorn Park Drive, Cambridge, MA 02140 USA Tel: +1 617.613.6000 |  Fax: +1 617.613.5000 |  www.forrester.com

    Building The Foundation For CustomerInsight: Hadoop Infrastructure Architecture

    by Richard Fichera, April 9, 2014

    For: Infrastructure

    & Operations

    Professionals

    KEY TAKEAWAYS

    Hadoop Provides A Foundational Technology Upon Which To BuildCustomer Engagement

    Big data can correlate data and events rom multiple sources. Historically, the tools and

    inrastructure to do this have been prohibitively expensive. With the advent o Hadoop

    and its ecosystem o tools, firms looking or an incremental advantage have been able to

    turn this data into actionable insights in ways that were unimaginable a decade ago.

    Hadoop Can Become The Hub Of An Enterprise’s Big Data Strategy 

    Because Hadoop is an inherently extensible open source system built on an extremely

    powerul abstraction layer or managing large collections o both structured and

    unstructured data, it is increasingly becoming an enterprise hub or all big data, and an

    active community o new and legacy independent sofware vendors is building upon it.

    Hadoop Infrastructure Is Different — But I&O Professionals Need Few

    New Skills To Deal With It

    While the effective use o Hadoop entails complex and, or most organizations, new

    sofware skills, the inrastructure or Hadoop can be designed and managed by I&O

    pros afer learning some basic configuration rules and management practices. Generally,

    no significant new I&O skills are needed to set up and manage a Hadoop environment.

    Hadoop Will Drive Organizations Toward DevOps

    Te Hadoop lie cycle is dynamic, with high-velocity change during development,

    potential movement between cloud prototyping and in-house production, and rapid

    incremental change o production environments as workloads are added and tuned.

    Tis operational profile deeply avors and motivates a strong DevOps process in

    enterprises adopting Hadoop.

    http://www.forrester.com/

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    2/16

    © 2014, Forrester Research, Inc. All rights reserved. Unauthorized reproduction is strictly prohibited. Information is based on best available

    resources. Opinions reflect judgment at the time and are subject to change. Forrester ®, Technographics®, Forrester Wave, RoleView, TechRadar,

    and Total Economic Impact are trademarks of Forrester Research, Inc. All other trademarks are the property of their respective companies. To

    purchase reprints of this document, please email [email protected]. For additional information, go to www.forrester.com.

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    WHY READ THIS REPORT

    Te prolieration o customer-acing data-intensive systems in almost every modern enterprise has

    catalyzed the rapid deployment o big data environments, commonly with Hadoop as the underlying

    processing environment. Unortunately, inrastructure and operations (I&O) pros have had little guidance

    in understanding how to configure and manage the underlying inrastructure to support Hadoop and its

    ecosystem o tools and applications. Tis report helps I&O proessionals understand the basics o Hadoop

    inrastructure and includes guidelines or system configuration, rough data sizing, and suggestions on how

    to plan or the inevitable growth o the Hadoop big data environment.

    Table Of Contents

    The Business Problem — Deriving Time-

    Sensitive Results From Big Data

    Hadoop Basics For I&O Pros — Parallelism,

    Replication, And Scalability 

    How To Architect The Right Infrastructure

    For Hadoop

    It’s Alive — And Growing — Staffing AndOperations For Hadoop

    WHAT IT MEANS

    Hadoop Will Become A Critical Part Of Core

    Enterprise Business

    Notes & Resources

    This report is based on ongoing research

    into the evolution of Hadoop and big data

    infrastructure architecture — specifically how

    to help businesses and IT build platforms

    that will support scalable solutions for

    customer engagement and analytics. Specific

    inputs to this report came from interviews

    with Hadoop and other big data solution

    suppliers, system vendors, Hadoop architects,

    and users of Hadoop, along with collaboration

    with other Forrester analysts, clients, and

    discussions with the members of the Forrester

    Leadership Boards.

    Related Research Documents

    The Forrester Wave™: Big Data Hadoop

    Solutions, Q1 2014

    February 27, 2014

    The Forrester Wave™: Enterprise DataWarehouse, Q4 2013

    December 9, 2013

    Predictions 2014: All Things Data

    February 7, 2014

    Building The Foundation For Customer Insight:Hadoop Infrastructure Architectureby Richard Fichera

    with Laura Koetzle, Brian Hopkins, and Katherine Williamson

    2

    4

    11

    8

     APRIL 9, 2014

    http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES114021http://www.forrester.com/go?objectid=BIO2625http://www.forrester.com/go?objectid=BIO607http://www.forrester.com/go?objectid=BIO2705http://www.forrester.com/go?objectid=BIO2705http://www.forrester.com/go?objectid=BIO607http://www.forrester.com/go?objectid=BIO2625http://www.forrester.com/go?objectid=RES114021http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    3/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 2

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    THE BUSINESS PROBLEM — DERIVING TIME-SENSITIVE RESULTS FROM BIG DATA 

    Te value o big data — some o which exists in the orm o “digital breadcrumbs” that customers

    leave behind as they navigate the Web, some as explicit actions on their part, such as tweets and

    Facebook entries, and some as structured output rom various applications and systems — lies in

    the ability to correlate data and events rom multiple sources. Historically, the ability to rapidly

    collate these disparate events and chunks o data has been almost nonexistent. With the advent o

    Hadoop and its ecosystem o tools, companies looking or an incremental advantage have been able

    to turn this sea o data into actionable insights in ways that were unimaginable a decade ago. For

    example, Hadoop allows you to:

    ■ Conduct customer sentiment analysis from witter, Facebook, and other sources.  Forexample, film studios want to maximize their revenues, and rapid adjustments to promotional

    programs based on customer reactions can have a major impact. You can use Hadoop to mine

    customer sentiment rom social sources like witter and Facebook, blogs, product reviews, andpress articles. In the case o a newly released movie, or example, you can use Hadoop to examine

    massive numbers o text items, analyze their content, and aggregate the results into a composite

    metric. You can run the solution against real-time data streams, which allows you to see results in

    a time rame within which you can make decisions about online marketing programs.1

    ■ Understand your customers’ life-cycle progress with web clickstream data. You can useHadoop to analyze the massive data streams generated by active websites to better understand

    user data, such as how users navigate and how long they look, and the patterns that distinguish

    an early-stage buyer rom a mere window-shopper. Clickstream analysis by the large web

    companies such as Google and Amazon was one o the earliest commercial uses o Hadoop. Last

    year, a major shipping and logistics company used Hadoop to analyze weblogs to detect mobiledevices so it could more finely tailor online services.2

    ■ Build a more flexible enterprise data hub. Many firms ound that Hadoop’s inherentlyflexible and scalable architecture coupled with its open source origins made it an attractive

    enterprise data hub or perorming extract, transorm, and load (EL) unctions or other

    existing enterprise systems. By substituting Hadoop or increasingly expensive proprietary

    EL solutions such as those rom Inormatica, Oracle, and SAP, enterprises gain a flexible and

    extensible utility to connect both current and uture systems and applications.3 One major

    financial services company uncovered massive raud with a new Hadoop project — and also

    saved $30 million by substituting Hadoop or conventional EL and data warehousing tools.

    FedEx, using existing data rom other production systems, used Hadoop to identiy high-

    revenue source and destination ZIP codes and to identiy patterns that led to shipment delays.

    ■ Increase system reliability with sensor and log data analysis. You can use Hadoop to analyzethe data generated by sensors and the log data rom almost any conceivable equipment to look

    or patterns and correlations. For example, a major supplier o smart grid metering analyzes

    electric meter results at the rate o over 1 million meters per second using a combination o

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    4/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 3

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    Hadoop and other technologies to improve power pricing and load management. General

    Electric uses Hadoop or analysis o real-time data rom jet engines, wind turbines, locomotives,

    and other devices to schedule preventative maintenance service beore those critical systems ail.

    Shanghai elecom processes video data rom thousands o monitoring points using Hadoop asa data storage and processing hub — and perormance is five times aster, which has improved

    Shanghai elecom’s ability to rapidly respond to emergency situations. Other examples include

    HVAC optimization or office buildings and data centers to save money and reduce energy

    consumption, and urban traffic flow monitoring and control to reduce congestion. Te list is

    endless — almost every activity we engage in generates sensor and log data that is amenable to

    analysis, and Hadoop has emerged as the platorm o choice.

    We Already Have Tools; Why Have Developers Flocked To Hadoop?

    Most o these business problems existed beore the phrase “big data” entered common parlance, and

    we’ve deployed generations o specialized solutions, such as relational databases, business intelligence

    (BI) tools, and specialized statistical analysis applications, in attempts to solve them. What drives

    developers and marketers to Hadoop? Te migration to Hadoop is primarily driven by three actors:

    ■ Data type and source generality. Part o Hadoop’s appeal is that it is not specifically optimizedor any specific solution or data type but rather a general ramework or parallel processing, so

    your developers and data scientists can add any relevant data, whatever its ormat or source.4 

    Other tools, both open source and independent sofware vendor (ISV) solutions, can be

    layered on top o this ramework, but the basic Hadoop tooling is flexible enough to deal with

    both structured and unstructured data, batch and streaming data, and can be programmed in

    almost all standard languages. In addition, Hadoop supports standard connectors such as opendatabase connectivity (ODBC) to enterprise staples like SQL, SAP, and Excel.

    ■ Strong ecosystem and community. Hadoop reaps the benefit o an active community o opensource developers, consultants, and an ever-increasing library o ISV solutions such as Vertica

    or real-time columnar analytics and MarkLogic or flexible NoSQL queries and transactional

    capabilities, as well as open source offerings such as HBase and MongoDB that either layer on

    top o Hadoop or eed/extract data rom it. Even the largest o the proprietary ISV solution

    communities cannot match the sum total o this activity or its rapid growth trajectory. You will

    never be a technology orphan with Hadoop.

    ■ Lower cost. Even including the cost o specialized staff and the increasing use o value-addedHadoop distributions and services like Cloudera and Hortonworks, Hadoop is cheaper to

    get started with than the ISV solutions o previous generations. Additionally, Hadoop was

    architected to run on lower-cost server and storage inrastructure, which also removes the

    “hidden” cost o high-end inrastructure.

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    5/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 4

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    Why Do I&O Pros Need To Get Involved With Hadoop Now?

    Developers and data geeks have been using Hadoop or a decade in ever-increasing numbers, with

    an inflection point triggered by Cloudera’s 2008 launch. Because previous uses o Hadoop were

    all post-processing, your firm’s Hadoop pioneers could start out with any old cheap scale-out-

    type servers you had lying around; with cloud capacity, they didn’t need any help setting up that

    sort o vanilla inrastructure, so they didn’t involve the I&O team. But that’s changed — Hadoop

    inrastructure architecture has burst onto the to-do list o I&O pros. Tere are two reasons or this:

    1. It’s gotten too big to stay in the skunkworks shadows.  Your Hadoop cluster has now grown to

    the point where it’s chewing up a lot o resources, and your developers don’t want to support the

    inrastructure by themselves.

    2. Your firm can win, serve, or retain more customers with higher performance analytics. Your

    customer analytics and business insights leaders need higher-perormance solutions to insertthe perect advertisement or present the right custom offer to your customers. And that means

    that developers need help rom I&O pros to design the right inrastructure to run on.

    HADOOP BASICS FOR I&O PROS — PARALLELISM, REPLICATION, AND SCALABILITY 

    Hadoop is an open source implementation o MapReduce, one o Google’s oundational

    technologies. Hadoop has emerged as a new way to process and integrate a variety o customer-

    related data, including clickstream, geographic data, and text, and turn this data into actionable

    insights.5 It can work in a batch or real-time environment, and it is capable o digesting any data

    type, both structured and unstructured. Many o these same capabilities are available rom legacy

    analytics and data warehouse systems, but Hadoop can routinely deliver results with superiorperormance at anywhere rom one-fifh to one-tenth the cost.6 And in this case, “cheaper and aster”

    really does mean better, because the lower cost and greater speed allow you to solve problems that

    were previously uneconomical to attack.7

    Te tradeoff or Hadoop’s dramatic improvement in the underlying economics o data analysis is

    that Hadoop is very different rom existing enterprise database processing, and Hadoop requires

    entirely new skill sets and has even catalyzed the creation o a new specialty, the “data scientist,”

    who specializes in architecting the Hadoop data environment and its connections to the rest o the

    enterprise. Fortunately, as Hadoop has developed, so has its supporting inrastructure, and today’s

    tools include enterprise staples such as SQL ront ends or Hadoop, which makes it much more

    accessible or traditional programmers and database administrators.

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    6/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 5

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    What’s Under The Hood: Hadoop’s Primary Components

    Hadoop is built on a distributed architecture in which each processing node (server) has its own

    storage and processing capacity and data is moved between nodes as needed but never processed

    remotely. Fundamentally, the MapReduce technology involves splitting processing into multiple

    parallel tasks, perorming operations on the data (the Map part) and then sorting and aggregating the

    data (the Reduce unction).8 Hadoop is built to run on an Ethernet-connected cluster o basic servers

    with direct-attached storage, without any hardware redundancy or even RAID protection. It does

    this by borrowing the replication scheme rom Google’s massive file system in which each block o

    data is copied to three separate locations to protect against system ailure. 9 A Hadoop environment is

    composed o a number o sofware components, each with its own inrastructure requirements (see

    Figure 1). Hadoop differs rom most environments that I&O pros are amiliar with because Hadoop:

    ■ Uses an architecture that assumes (and tolerates) machine failures. Hadoop was designed

    with an understanding that with scale, hardware ailures are inevitable. Te odds o any givendisk ailing are low, and given expensive and redundant hardware, core enterprise systems

    can be protected with a high degree o confidence. But with a Hadoop cluster with 1,000 disks

    (probably in the upper quartile o Hadoop clusters or size, but certainly nowhere near the

    largest), ailures will be common. Hadoop tolerates multiple disk ailures graceully and allows

    both incremental replacement and more choice in the economics o disk drive selection. As

    a result, most Hadoop installations use low-cost SAS disks as opposed to high-end small

    computer system interace (SCSI) disks, and they dispense with RAID entirely.

    ■ Hides the housekeeping. Unlike environments where I&O pros expect to have detailed insightinto the perormance and usage characteristics o the components and storage, Hadoop allows

    a Hadoop cluster to be managed as a relatively opaque black box, whose contents are o interestto the Hadoop specialists. I&O pros need only deal with the requirements or storage expansion

    and any required network changes within the Hadoop cluster.

    ■ Is relentlessly scalable. Hadoop environments grow. Period. Tey do not shrink; they are notofen “cleaned up”; and, because Hadoop is a universal operating environment or big data, they

    are “data magnets” once an organization begins to understand the potentials o Hadoop. Tey

    tend to be populated with data rom multiple sources, ofen in advance o a clear need, and once

    a given project or analysis experiment is done, the data inevitably stays in the Hadoop cluster,

    either because the experiment has turned into a production job or so that the data scientist can

    use it in some undefined uture experiment. For the I&O practitioner, this monotonic trend

    in capacity means that you need a well-articulated process or capacity expansion that allows

    regular addition o capacity in the orm o servers with attached storage.

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    7/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 6

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    Figure 1 Primary Hadoop Components

    Source: Forrester Research, Inc.113803

    Component

    Core components —

    NameNode, MapReduce,

    OpenJDK, and YARN

    Keeps track of the data

    across the cluster; manages

    location, replication, and

    availability; and runs the basic

    MapReduce logic

    1 master active at any time Should be configured

    as a redundant pair

    JobTracker Keeps track of Hadoop jobs

    across the cluster

    1 master active at any time Can run on the same

    server as the

    NameNode and other

    core components

    HDFS and TaskTracker Uses the local OS and file

    system of each node to

    perform processing

    Multiple nodes contain

    the actual data

    This is the component

    that implements the

    MapReduce functionsand scales as data

    and processing

    volumes grow.

    What It does How many  

    Special

    considerations

    HOW TO ARCHITECT THE RIGHT INFRASTRUCTURE FOR HADOOP

    I Hadoop has ended up on I&O’s plate simply because the cluster has grown beyond skunkworks

    size (meaning you don’t have any high-perormance requirements and your customer analytics

    leaders can’t oresee having any), your job is simple. All you need to do is add compute/storagecapacity regularly and occasionally add more network bandwidth i things get slow. I you do have

    high-perormance requirements, your inrastructure choices will mean the difference between

    success and ailure. Here’s what you need to know:

    ■ Te NameNode and Jobracker are critical. Te rather oddly named NameNode server is theserver that keeps track o the Hadoop data 64 MB or 128 MB data segments and has long been

    a source o concern or Hadoop architects as Hadoop has moved rom an experimental utility

    to production status. Te recent 2.0 release o Hadoop has added the ability to easily configure

    redundant NameNodes as a standard eature, easing this concern, and Forrester recommends

    that any production Hadoop environment be configured this way, since the NameNode servers

    are typically small-to-medium two-socket servers with only a ew disks — a small insurance

    premium to pay to protect against a major disruption.10 Forrester also recommends running the

    NameNode and Jobracker on a single node with an identical ailover node.11

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    8/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 7

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    ■ Hadoop’s network requirements are not complex . . . Hadoop is designed to run over a standardEthernet network, and Hadoop clusters use only very basic network unctions, so you only need

    basic network switches. Production Hadoop clusters have three networks — the data cluster

    network, an administrative network, and a systems management network (the latter two can be

    collapsed into a single network to keep the servers to a simple dual network interace controller

    [NIC] configuration or these unctions). Te data cluster network, over which all the data into

    and out o the compute nodes will pass, is the critical network resource in a Hadoop cluster.

    ■ . . . but you’ll need beefy network links to support high-performance customer analytics. Forrester recommends that the Hadoop network connecting the nodes within a rack be 10 Gb

    by deault and that all o the data node servers get redundant 10 Gb links. Te connections

    between racks should be at least 10 Gb, and based on our interviews with Hadoop experts,

    Forrester recommends 40 Gb interconnects between racks.12 Core enterprise networks are

    always configured with dual paths, with each server connected to a different logical hal othe network, so that processing can continue in the event o a network switch ailure. Because

    Hadoop is likely to become critical or delivering customized services to customers, Forrester

    recommends that Hadoop be configured with dual network connections.13 Regardless o the

    choice o dual- or single-path network, the overall topology must be designed so that the

    network can accommodate additional lea-nodes as the cluster scales.

    ■ You should spend time configuring the data nodes. Hadoop departs rom standard enterpriseapplication practice by ederating all storage attached to the processing nodes into the global

    HDFS instead o using centralized network-attached storage (NAS) or storage area network

    (SAN) or pooled storage.14 Tis architecture, coupled with the inherent redundancy o the

    Hadoop environment, allows processing and storage capacity to scale incrementally in lock-stepand reduces cost. However, this means that it alls to the inrastructure architect to select the

    correct ratio o processing to storage. ypically, Hadoop nodes have large disk configurations

    in relation to the CPU and memory, but Forrester believes that this balance varies widely with

    the potential workloads. While the actual data capacity per core will vary, the most common

    practice in configuring Hadoop processing nodes is to allocate one disk per core.

    ■ I&O pros can buy Hadoop-specific configurations today. In response to the variationin workloads, all o the tier one and most o the tier two system vendors offer multiple

    configurations targeted at Hadoop clusters (see Figure 2). Tese generally do not include

    elaborate redundant power supplies, advanced onboard management capabilities, and any other

    legacy enterprise reliability artiacts like extra ans or sensors. I&O pros can also source these

    configurations rom Hadoop tier two hardware vendors and can get consulting assistance and

    ongoing support rom value-added distribution providers such as Hortonworks and Cloudera,

    or rom regional/local consultancies that ocus on Hadoop.15 Some o these vendors offer useul

    ree online sizing and configuration tools.16

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    9/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 8

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    Figure 2 Sample Hadoop Compute And Data Node Configurations

    IT’S ALIVE — AND GROWING — STAFFING AND OPERATIONS FOR HADOOP

    We provide the answers to inrastructure and operations proessionals’ our most important high-

    perormance Hadoop inrastructure questions.

    1. What Skills And Staff Do I&O Pros Need To Run A Hadoop Environment?

    Once the Hadoop cluster is up and running, I&O pros have to keep it going. Fortunately, staffing or

    Hadoop operations involves only basic Linux, storage, and networking skills. An operations group

    amiliar with the installation and operation o standard servers can master the additional Hadoop-specific skills required. I&O pros must manage Hadoop runtime environments with specialized

    Hadoop management tools like Apache Ambari or cluster deployment and management or Apache

    Serengeti or managing Hadoop in virtualized environments, plus standard systems management

    tools like Nagios, iDRAC, Director, or OneView or the basics o server operation. Te open source

    Hadoop distribution includes management tools that allow I&O to look at the cluster operations,

    workloads, and network activity. Additionally, the value-added suppliers such as Cloudera,

    Hortonworks, and Intel all supply enhanced management capabilities to enable deployment, updates,

    backup, and operational monitoring o the Hadoop cluster.

    2. How Big Will My Hadoop Cluster Be?Te size o the Hadoop cluster will obviously vary with the amount o data to be processed, but

    a basic Hadoop sizing rule o thumb is 4 x D, where D is the initial size o the data.17 Space or

    additional tools on top o the basic Hadoop operating environment will vary, but in general, most

    tools to access Hadoop using SQL-like queries and other techniques tend to build relatively compact

    sets o indices on top o Hadoop, and the additional overhead will likely be a single-digit percentage

    on top o the basic storage sizing. Another interesting metric is cluster size. One major system

    Source: Forrester Research, Inc.113803

    Mainstream — counting, correlating, sorting,

    aggregating tasks like log analysis, basic

    event correlation, website traffic analysis

    1 or 2 socket x

    8–10 core x86, low

    CPU bin

    64–128 GB 8–12/8–36

    Computationally intensive MapReduce jobs

    such as optimization calculations, image

    analysis, time-series and streaming data,

    financial analysis, ETL

    2 x 10 socket x86,

    high-performance

    bin

    128–256 GB 12–24/12–72

    Standard file server (for comparison

    purposes)

    2 x 10 socket low

    bin x86

    64–128 GB 24–45/48–135

    CPU Memory Disks/TB

    0.6–2.25

    0.6–3.6

    2.4–6.8

    TB/core

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    10/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 9

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

     vendor’s Hadoop practice noted that the average size o a starter Hadoop cluster was three or our

    data nodes plus either a single or dual set o servers to run the NameNode and other management

    components (Forrester recommends the dual configuration).

    3. Can I Run Hadoop On A VM Cluster?

    One o the emerging rontiers in Hadoop environments is the use o virtual machines (VMs) as

    the Hadoop cluster processing nodes.18 I the Hadoop cluster nodes are utilized 20% o the time or

    less, it’s easible to run multiple VMs on each node and get increased throughput rom the existing

    servers. Tere is considerable activity in the Hadoop community, much o it sponsored by VMware,

    Intel, and some o the system vendors, to add explicit extensions to Hadoop to make it more

    convenient to deploy and operate in a virtual environment. Tere are two basic architectures or

    running on Hadoop VM clusters:

    ■ Local storage on each node. In this model, each compute node hosts multiple VMs and hasits own physical storage. Te VMs appear to the Hadoop environment exactly as i they were

    each a standalone server, and Hadoop distributed file systems (HDFS’s) handle the allocation

    and movement o data between the storage visible to each VM as i they were separate physical

    servers. Tis architecture does not perturb the basic Hadoop operating or management model,

    and the Hadoop cluster just looks like it has more servers with less storage per server.

    ■ Shared storage. Tis model is less common. Te inhibiting actor in a shared storageenvironment is the complexity o managing the network storage rather than potential

    perormance problems; with current generation storage arrays and 10G Ethernet and

    FibreChannel, the actual transer o data is no longer an issue.

    Te provisioning, presentation, and management o network-attached storage or VM clusters in

    general and Hadoop in particular is changing at a rapid rate, and Forrester believes that within

    the next 12 to 18 months, we will have multiple mature options to easily provision and manage

    shared storage Hadoop clusters. Storage vendors such as EMC with its ViPR product and VMware

    with its new virtual SAN (VSAN) offering will streamline shared storage Hadoop environments.19 

    Forrester strongly recommends that Hadoop architects consider shared storage environments,

    particularly solutions such as VSAN that ederate the existing storage on individual server nodes

    as opposed to requiring purchase o external network-attached storage arrays.

    Another alternative or you to Hadoop on VM is packaged CI solutions including the hypervisor,

    storage, and the requisite management tools, especially storage, i you wish to investigate a VM-

    based Hadoop installation.20

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    11/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 10

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    4. Why Not Do Hadoop In The Cloud?

    While there are cloud-based Hadoop offerings rom multiple vendors, such as Amazon, Google, and

    Rackspace, the majority o production Hadoop installations are and will remain on-premises or

    several reasons:

    ■ Heavy and increasing workloads favor on-premises Hadoop. Hadoop’s inrastructuredesign is comparably cost-effective to those used by the cloud providers. Hadoop clusters tend

    to be heavily utilized, with capacity being added as resources get scarce, rather than being

    massively overprovisioned. Tese characteristics make the argument or cloud’s cost advantage

    less compelling, since cloud usually compares best against lightly loaded in-house resources.

    Additionally, as Hadoop becomes a production resource, Hadoop cluster workloads and

    storage requirements tend to increase without the dramatic peaks and valleys that might make

    a cloud deployment attractive or its ability to scale down as well as up.21 Te best use case or

    Hadoop in the cloud is or development, which almost always requires constant changes to theenvironment and has the kind o highly variable workload profile that avors cloud. Hadoop in

    the cloud also reduces the major DevOps (development + operations) overhead associated with

    physical installations.

    ■ Cloud storage is both slower and more expensive for data sets that just keep growing. Mostcost comparisons show that low-cost enterprise storage is still cheaper than cloud or long-term

    data storage, particularly since low-cost cloud storage may have unacceptably long access times.

    Also, Hadoop tends to collect 10 times or more data than legacy transactional environments do,

    plus data scientists and their customer-ocused business stakeholders will almost never want to

    discard Hadoop data, and the access requirements are unpredictable — all o which avors on-

    premises storage.

    ■ Data sources and locality make a big difference for performance. In cases where the data isentirely cloud-generated (such as analysis o witter, blog posts, and other social media data),

    running Hadoop clusters in the cloud might make sense. But as Hadoop is used increasingly or

    real-time customer-acing systems with data coming rom multiple venues, I&O pros will likely

    need to build it out in a physical acility with the right (deterministic bandwidth and latency)

    network interconnects to minimize the end-to-end latency o the application. Tus, the optimal

    acility or your Hadoop cluster is likely either your enterprise data center or a colocation or

    hosting acility with the right peer interconnects — not a cloud environment with unknown and

    probably longer latency network connections.

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    12/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 11

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    W H A T I T M E A N S

    HADOOP WILL BECOME A CRITICAL PART OF CORE ENTERPRISE BUSINESS

    Te capabilities o Hadoop 2.0 will accelerate the use o Hadoop as a real-time platorm and asa platorm or other analytics sofware. As the number o applications that require real-time

    perormance grows, these requirements will bias uture Hadoop inrastructure in predicable directions:

    ■ Hadoop will include flash memory in the processing nodes. As Hadoop becomes theoundation o real-time processes and applications, the requirements or the processing

    nodes begin to escalate. While the amount o data that can be handled in a given amount o

    time scales very well with the number o additional nodes, i the requirements dictate aster

    response or a given amount o data, the only reasonable solution is to use aster nodes with

    flash memory. All commercial Hadoop distributions can take advantage o flash memory on

    the individual compute nodes.

    ■ You will need to install application-specific Hadoop clusters. Because Hadoop encompassesa wide range o processing, as applications scale and as new applications requiring unusual

    processing come on line, it may be necessary to begin to install new Hadoop clusters with

    application-specific processing nodes or financial risk calculations or time-series analysis.  

    ■ Hadoop will become the big data hub and integration point for enterprise systems. As Hadoop becomes more o a closely coupled adjunct to customer-acing systems like

    eCommerce platorms and mainstream enterprise systems such as enterprise resource

    planning (ERP) and classic data warehouse applications, the quality o the integration

    with the enterprise systems becomes critical. In these cases, I&O pros may need to hostthe Hadoop cluster in the same inrastructure as the enterprise apps to achieve tighter

    integration with the core applications.

    ENDNOTES

    1  “Real time” is a slippery term, but the original definition rom control systems theory is still valid — real-

    time processing allows decisions to be made within the cycle time o the process in question. In other words,

    something that happens quickly enough to matter to a person or process waiting or the result. Tus, a

    signal processing system might define real time as being ractions o a microsecond, while an advertising

    insertion system designed to offer up an ad to a customer viewing a company website might be hundreds o

    milliseconds, and or a marketing campaign, real time might be hours.

    2  Tis pilot illustrates both the potential o Hadoop and the size o some o the big data repositories. Te pilot

    project involved processing 570 billion weblog records in 9 minutes on what was described as a “small cluster.”

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    13/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 12

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    3  Tese EL vendors are all rapidly adapting their products to work with Hadoop, mostly in the orm o

    using HDFS as an underlying data store. But once users get a taste o the potential economics o using an

    open source stack or what was previously a purely proprietary stack, it becomes difficult or legacy vendors

    to support historical margins.

    4  Tis assertion leads instantly to the challenge that Hadoop may in act be very flexible but not in act very

    efficient at any particular task. Tere is some truth to that assertion — an optimized columnar database like

    Vertica or Netezza may be righteningly efficient at certain kinds o queries but will simply not be able to do

    others involving different data types. Te beauty o Hadoop’s architecture is that in exchange or some lack o

    optimization, it allows an almost infinite plasticity in terms o problems and data types. In addition, the lack

    o optimization is somewhat offset by the act that it can be composed rom commodity components, so the

    Hadoop solution vendor cannot extract much o a premium or integrating the hardware platorm, unlike

    many other commercial solutions. However, much o Hadoop’s attraction also lies in the act that MapReduce

    is a separate unction rom the Hadoop file system, and many specialized applications, Vertica among them,

    are integrating with HDFS to take advantage o its reliable and highly scalable storage architecture.5  We might as well deal with the inevitable right off the bat — how the heck did it get named Hadoop?

    Hadoop was initially created by Doug Cutting and Mike Caarella, and Hadoop was the name o Cutting’s

    young son’s avorite stuffed elephant.

    6  Hadoop can also be used internally or operational improvement. It is a powerul platorm or analysis o

    log files rom servers, network equipment, and any o the myriad devices that spit out data as they operate.

    Using Hadoop can simpliy and accelerate turning these inchoate streams o data into real understanding

    about efficiencies, actual costs, and compliance.

    7  As more people can use a previously unavailable technology and apply it to problems that were previously

    uneconomical to attack, more benefits accrue to more players in the economy. Tis will ollow the samepath as the evolution o supercomputing and advanced design automation, which were originally used

    mostly or aerospace and deense products — today, many consumer goods are designed with the next

    generation o those same tools. In the case o Hadoop, or example, the ability to apply this technology to a

    single movie release where the goal is to raise the box-office take by a ew million dollars in a ew markets

    would have been completely uneconomical five years ago.

    8  Te details are out o scope or this report, but I&O pros can get a oundation in MapReduce undamentals

    rom the Hadoop website (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html).

    9  For the technically inclined, the replication actor is configurable, but most installations seem to be sticking

    with the deault 3x. Te blocks in the HDFS are huge, either 64 MB or 128 MB, so the amount o metadata

    that the NameNode server must keep track o is manageable, with only eight or 16 times three blocks perGB (8,000 or 16,000 per B). At PB scales, the HDFS, probably using the 128 MB block size, will have in the

    tens o millions o block entries, small enough to keep large portions o the metadata tables in memory or

    efficient operation.

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    14/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 13

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    10 Te ailure o a nonredundant NameNode is not a total disaster, because the NameNode writes a log file that

    can be used to reconstruct the data. But it can be a long process — think days, not minutes or a ew hours —

    or even modest 10 B to 100 B file systems.

    11 Hadoop 2.0 has also added a number o major enhancements, particularly the ability or workloads other

    than MapReduce to run on top o HDFS, enabling a wide range o third-party tools and other open source

    projects to take advantage o Hadoop’s robust ederated storage and processing architecture. Many o

    these tools were already available, but the incorporation o the support or a general-purpose extension to

    MapReduce makes Hadoop 2.0 a much more suitable general-purpose big data and analytics platorm.

    12 Given the continued cost decline in network switches, Forrester recommends that I&O groups contemplating

    implementation o high-perormance Hadoop clusters evaluate 40 GB inter-rack links. For environments

    where the MapReduce jobs are primarily aggregation, enumeration, and sorting the traditional Hadoop

    workloads, a simple 1 GB NIC per server may be sufficient. However, as Hadoop workloads grow to

    incorporate real-time (what Hadoop practitioners ofen reer to as continuous processing) as well as batch

    data rom other enterprise and external data sources, the additional jobs and the constant EL processing

    can add significant network traffic, and the traditional 1 GB connection may become a bottleneck.

    13 A dual network will entail twice the cost or network equipment and a slight additional cost or dual NIC

    configurations on each server. Forrester cannot make a blanket recommendation on this aspect o the

    inrastructure but cautions that reconfiguring rom a single- to dual-path network is very disruptive and

    time-consuming — assume that it will take a day per rack, assuming that you configured the racks correctly

    in the first place to have the power and space to accommodate the additional switches.

    14 Hadoop was not the first scalable sofware environment to use this technique, drawn as it was rom Google’s

    proprietary MapReduce concept, and similar concepts can be ound underlying earlier experiments and

    products such as the AFS (Andrew File System), DCE (distributed computing environment), and IBM’sGPFS (General Purpose File System), but it is arguably the most successul and rapidly growing example.

    15  Intel has an alliance with Hortonworks through which it has produced a version o Hadoop optimized or

    x86 execution, as well as actively supporting an alternative Apache Hadoop distribution.

    16 Hortonworks’ sizing tool is an excellent example. Source: Hortonworks (http://hortonworks.com/resources/

    cluster-sizing-guide/?utm_source=google&utm_medium=ppc&utm_campaign=Sitelinks).

    17 Here’s the derivation or 4 x D: You start by assuming that the data will require (R x D) + S, where R is the

    replication actor, which most installations leave at the deault actor o 3, D is the initial size o the data, and

    S is the amount o space allocated or the “shuffle” phase o the MapReduce process, where intermediate

    blocks o results are moved rom node to node and aggregated. One senior Hadoop consultant noted that the

    usual rule o thumb or shuffle space is to allow approximately the same amount o space as the original data

    beore replication. Using this ormula gives us a basic Hadoop sizing rule o thumb o 4 x D.

    18 Currently there is only a limited sample o production environments, but the number is growing or

    the same reasons that other applications initially moved to VMs rom physical inrastructure — capital

    resource efficiency and management cost.

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    15/16

    FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

    Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 14

    © 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

    19 VSAN in its initial version(s) will almost certainly be optimized or the random mix o read/write

    characteristic o a general-purpose enterprise VM cluster as opposed to the read-dominated and large-

    block transer patterns o a Hadoop cluster, but it is hard to imagine that VMware will not pursue Hadoop

    optimization in the near-term uture.

    20 While a wide range o offerings rom vendors such as HP, IBM, Cisco, and Dell offer bundling o storage,

    network, and compute nodes as integrated VM clusters, Forrester recommends also evaluating newer

    options rom Nutanix and SimpliVity or virtualized Hadoop because o their deeply integrated ederated

    storage architecture, which may prove to ameliorate many o the complexities o managing the Hadoop

    storage resources. Also worthy o additional scrutiny are advanced storage-centric products like Maxta,

    intri, and Atlantis Computing, which dramatically simpliy the task o deploying and managing VMware

    storage. Emerging tools such as EMC ViPR, which can overlay a HDFS definition on top o an existing SAN

    offer additional potential or simpliying storage management o Hadoop on a VM cluster. o the extent that

    VMware makes its VSAN product relevant to Hadoop, all other solutions may lose a great deal o relevance.

    21 One example that we were able to find was a 200 VM instance running on Amazon’s Elastic MapReduce

    service that generated bills in excess o $40,000 per month beore the user brought the workload in-house

    on an eight-node x86 Hadoop cluster.

  • 8/20/2019 Forrester Hadoop Infrastructure Architecture

    16/16

    Forrester Research (Nasdaq: FORR) is a global research and advisory firm serving professionals in 13 key roles across three distinct client

    segments. Our clients face progressively complex business and technology decisions every day. To help them understand, strategize, and ac

    upon opportunities brought by change, Forrester provides proprietary research, consumer and business data, custom consulting, events and

    «

    Forrester Focuses OnInfrastructure & Operations Professionals

    You are responsible for identifying — and justifying — which technologies

    and process changes will help you transform and industrialize your

    company’s infrastructure and create a more productive, resilient, and

    effective IT organization. Forrester’s subject-matter expertise and

    deep understanding of your role will help you create forward-thinking

    strategies; weigh opportunity against risk; justify decisions; and optimize

    your individual, team, and corporate performance.

    IAN OLIVER, client persona representing Infrastructure & Operations Professionals

     About Forrester

     A global research and advisory firm, Forrester inspires leaders,informs better decisions, and helps the world’s top companies turn

    the complexity of change into business advantage. Our research-

    based insight and objective advice enable IT professionals to

    lead more successfully within IT and extend their impact beyond

    the traditional IT organization. Tailored to your individual role, our

    resources allow you to focus on important business issues —

    margin, speed, growth — first, technology second.

    FOR MORE INFORMATION

    o find out how Forrester Research can help you be successul every day, please

    contact the office nearest you, or visit us at www.orrester.com. For a complete list

    o worldwide locations, visit www.orrester.com/about.

    CLIENT SUPPORT

    For inormation on hard-copy or electronic reprints, please contact Client Support

    at +1 866.367.7378, +1 617.613.5730, or [email protected] . We offer

    quantity discounts and special pricing or academic and nonprofit institutions.

    mailto:[email protected]://www.forrester.com/mailto:[email protected]