forrester hadoop infrastructure architecture

8/20/2019 Forrester Hadoop Infrastructure Architecture

1/16

Forrester Research, Inc., 60 Acorn Park Drive, Cambridge, MA 02140 USA Tel: +1 617.613.6000 | Fax: +1 617.613.5000 | www.forrester.com

Building The Foundation For CustomerInsight: Hadoop Infrastructure Architecture

by Richard Fichera, April 9, 2014

For: Infrastructure

& Operations

Professionals

KEY TAKEAWAYS

Hadoop Provides A Foundational Technology Upon Which To BuildCustomer Engagement

Big data can correlate data and events rom multiple sources. Historically, the tools and

inrastructure to do this have been prohibitively expensive. With the advent o Hadoop

and its ecosystem o tools, firms looking or an incremental advantage have been able to

turn this data into actionable insights in ways that were unimaginable a decade ago.

Hadoop Can Become The Hub Of An Enterprise’s Big Data Strategy

Because Hadoop is an inherently extensible open source system built on an extremely

powerul abstraction layer or managing large collections o both structured and

unstructured data, it is increasingly becoming an enterprise hub or all big data, and an

active community o new and legacy independent sofware vendors is building upon it.

Hadoop Infrastructure Is Different — But I&O Professionals Need Few

New Skills To Deal With It

While the effective use o Hadoop entails complex and, or most organizations, new

sofware skills, the inrastructure or Hadoop can be designed and managed by I&O

pros afer learning some basic configuration rules and management practices. Generally,

no significant new I&O skills are needed to set up and manage a Hadoop environment.

Hadoop Will Drive Organizations Toward DevOps

Te Hadoop lie cycle is dynamic, with high-velocity change during development,

potential movement between cloud prototyping and in-house production, and rapid

incremental change o production environments as workloads are added and tuned.

Tis operational profile deeply avors and motivates a strong DevOps process in

enterprises adopting Hadoop.

http://www.forrester.com/


2/16

© 2014, Forrester Research, Inc. All rights reserved. Unauthorized reproduction is strictly prohibited. Information is based on best available

resources. Opinions reflect judgment at the time and are subject to change. Forrester ®, Technographics®, Forrester Wave, RoleView, TechRadar,

and Total Economic Impact are trademarks of Forrester Research, Inc. All other trademarks are the property of their respective companies. To

purchase reprints of this document, please email [email protected]. For additional information, go to www.forrester.com.

FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS

WHY READ THIS REPORT

Te prolieration o customer-acing data-intensive systems in almost every modern enterprise has

catalyzed the rapid deployment o big data environments, commonly with Hadoop as the underlying

processing environment. Unortunately, inrastructure and operations (I&O) pros have had little guidance

in understanding how to configure and manage the underlying inrastructure to support Hadoop and its

ecosystem o tools and applications. Tis report helps I&O proessionals understand the basics o Hadoop

inrastructure and includes guidelines or system configuration, rough data sizing, and suggestions on how

to plan or the inevitable growth o the Hadoop big data environment.

Table Of Contents

The Business Problem — Deriving Time-

Sensitive Results From Big Data

Hadoop Basics For I&O Pros — Parallelism,

Replication, And Scalability

How To Architect The Right Infrastructure

For Hadoop

It’s Alive — And Growing — Staffing AndOperations For Hadoop

WHAT IT MEANS

Hadoop Will Become A Critical Part Of Core

Enterprise Business

Notes & Resources

This report is based on ongoing research

into the evolution of Hadoop and big data

infrastructure architecture — specifically how

to help businesses and IT build platforms

that will support scalable solutions for

customer engagement and analytics. Specific

inputs to this report came from interviews

with Hadoop and other big data solution

suppliers, system vendors, Hadoop architects,

and users of Hadoop, along with collaboration

with other Forrester analysts, clients, and

discussions with the members of the Forrester

Leadership Boards.

Related Research Documents

The Forrester Wave™: Big Data Hadoop

Solutions, Q1 2014

February 27, 2014

The Forrester Wave™: Enterprise DataWarehouse, Q4 2013

December 9, 2013

Predictions 2014: All Things Data

February 7, 2014

Building The Foundation For Customer Insight:Hadoop Infrastructure Architectureby Richard Fichera

with Laura Koetzle, Brian Hopkins, and Katherine Williamson

2

4

6

11

8

APRIL 9, 2014

http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES114021http://www.forrester.com/go?objectid=BIO2625http://www.forrester.com/go?objectid=BIO607http://www.forrester.com/go?objectid=BIO2705http://www.forrester.com/go?objectid=BIO2705http://www.forrester.com/go?objectid=BIO607http://www.forrester.com/go?objectid=BIO2625http://www.forrester.com/go?objectid=RES114021http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/


3/16


Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 2

© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014

THE BUSINESS PROBLEM — DERIVING TIME-SENSITIVE RESULTS FROM BIG DATA

Te value o big data — some o which exists in the orm o “digital breadcrumbs” that customers

leave behind as they navigate the Web, some as explicit actions on their part, such as tweets and

Facebook entries, and some as structured output rom various applications and systems — lies in

the ability to correlate data and events rom multiple sources. Historically, the ability to rapidly

collate these disparate events and chunks o data has been almost nonexistent. With the advent o

Hadoop and its ecosystem o tools, companies looking or an incremental advantage have been able

to turn this sea o data into actionable insights in ways that were unimaginable a decade ago. For

example, Hadoop allows you to:

■ Conduct customer sentiment analysis from witter, Facebook, and other sources. Forexample, film studios want to maximize their revenues, and rapid adjustments to promotional

programs based on customer reactions can have a major impact. You can use Hadoop to mine

customer sentiment rom social sources like witter and Facebook, blogs, product reviews, andpress articles. In the case o a newly released movie, or example, you can use Hadoop to examine

massive numbers o text items, analyze their content, and aggregate the results into a composite

metric. You can run the solution against real-time data streams, which allows you to see results in

a time rame within which you can make decisions about online marketing programs.1

■ Understand your customers’ life-cycle progress with web clickstream data. You can useHadoop to analyze the massive data streams generated by active websites to better understand

user data, such as how users navigate and how long they look, and the patterns that distinguish

an early-stage buyer rom a mere window-shopper. Clickstream analysis by the large web

companies such as Google and Amazon was one o the earliest commercial uses o Hadoop. Last

year, a major shipping and logistics company used Hadoop to analyze weblogs to detect mobiledevices so it could more finely tailor online services.2

■ Build a more flexible enterprise data hub. Many firms ound that Hadoop’s inherentlyflexible and scalable architecture coupled with its open source origins made it an attractive

enterprise data hub or perorming extract, transorm, and load (EL) unctions or other

existing enterprise systems. By substituting Hadoop or increasingly expensive proprietary

EL solutions such as those rom Inormatica, Oracle, and SAP, enterprises gain a flexible and

extensible utility to connect both current and uture systems and applications.3 One major

financial services company uncovered massive raud with a new Hadoop project — and also

saved $30 million by substituting Hadoop or conventional EL and data warehousing tools.

FedEx, using existing data rom other production systems, used Hadoop to identiy high-

revenue source and destination ZIP codes and to identiy patterns that led to shipment delays.

■ Increase system reliability with sensor and log data analysis. You can use Hadoop to analyzethe data generated by sensors and the log data rom almost any conceivable equipment to look

or patterns and correlations. For example, a major supplier o smart grid metering analyzes

electric meter results at the rate o over 1 million meters per second using a combination o


4/16




Hadoop and other technologies to improve power pricing and load management. General

Electric uses Hadoop or analysis o real-time data rom jet engines, wind turbines, locomotives,

and other devices to schedule preventative maintenance service beore those critical systems ail.

Shanghai elecom processes video data rom thousands o monitoring points using Hadoop asa data storage and processing hub — and perormance is five times aster, which has improved

Shanghai elecom’s ability to rapidly respond to emergency situations. Other examples include

HVAC optimization or office buildings and data centers to save money and reduce energy

consumption, and urban traffic flow monitoring and control to reduce congestion. Te list is

endless — almost every activity we engage in generates sensor and log data that is amenable to

analysis, and Hadoop has emerged as the platorm o choice.

We Already Have Tools; Why Have Developers Flocked To Hadoop?

Most o these business problems existed beore the phrase “big data” entered common parlance, and

we’ve deployed generations o specialized solutions, such as relational databases, business intelligence

(BI) tools, and specialized statistical analysis applications, in attempts to solve them. What drives

developers and marketers to Hadoop? Te migration to Hadoop is primarily driven by three actors:

■ Data type and source generality. Part o Hadoop’s appeal is that it is not specifically optimizedor any specific solution or data type but rather a general ramework or parallel processing, so

your developers and data scientists can add any relevant data, whatever its ormat or source.4

Other tools, both open source and independent sofware vendor (ISV) solutions, can be

layered on top o this ramework, but the basic Hadoop tooling is flexible enough to deal with

both structured and unstructured data, batch and streaming data, and can be programmed in

almost all standard languages. In addition, Hadoop supports standard connectors such as opendatabase connectivity (ODBC) to enterprise staples like SQL, SAP, and Excel.

■ Strong ecosystem and community. Hadoop reaps the benefit o an active community o opensource developers, consultants, and an ever-increasing library o ISV solutions such as Vertica

or real-time columnar analytics and MarkLogic or flexible NoSQL queries and transactional

capabilities, as well as open source offerings such as HBase and MongoDB that either layer on

top o Hadoop or eed/extract data rom it. Even the largest o the proprietary ISV solution

communities cannot match the sum total o this activity or its rapid growth trajectory. You will

never be a technology orphan with Hadoop.

■ Lower cost. Even including the cost o specialized staff and the increasing use o value-addedHadoop distributions and services like Cloudera and Hortonworks, Hadoop is cheaper to

get started with than the ISV solutions o previous generations. Additionally, Hadoop was

architected to run on lower-cost server and storage inrastructure, which also removes the

“hidden” cost o high-end inrastructure.


5/16




Why Do I&O Pros Need To Get Involved With Hadoop Now?

Developers and data geeks have been using Hadoop or a decade in ever-increasing numbers, with

an inflection point triggered by Cloudera’s 2008 launch. Because previous uses o Hadoop were

all post-processing, your firm’s Hadoop pioneers could start out with any old cheap scale-out-

type servers you had lying around; with cloud capacity, they didn’t need any help setting up that

sort o vanilla inrastructure, so they didn’t involve the I&O team. But that’s changed — Hadoop

inrastructure architecture has burst onto the to-do list o I&O pros. Tere are two reasons or this:

1. It’s gotten too big to stay in the skunkworks shadows. Your Hadoop cluster has now grown to

the point where it’s chewing up a lot o resources, and your developers don’t want to support the

inrastructure by themselves.

2. Your firm can win, serve, or retain more customers with higher performance analytics. Your

customer analytics and business insights leaders need higher-perormance solutions to insertthe perect advertisement or present the right custom offer to your customers. And that means

that developers need help rom I&O pros to design the right inrastructure to run on.

HADOOP BASICS FOR I&O PROS — PARALLELISM, REPLICATION, AND SCALABILITY

Hadoop is an open source implementation o MapReduce, one o Google’s oundational

technologies. Hadoop has emerged as a new way to process and integrate a variety o customer-

related data, including clickstream, geographic data, and text, and turn this data into actionable

insights.5 It can work in a batch or real-time environment, and it is capable o digesting any data

type, both structured and unstructured. Many o these same capabilities are available rom legacy

analytics and data warehouse systems, but Hadoop can routinely deliver results with superiorperormance at anywhere rom one-fifh to one-tenth the cost.6 And in this case, “cheaper and aster”

really does mean better, because the lower cost and greater speed allow you to solve problems that

were previously uneconomical to attack.7

Te tradeoff or Hadoop’s dramatic improvement in the underlying economics o data analysis is

that Hadoop is very different rom existing enterprise database processing, and Hadoop requires

entirely new skill sets and has even catalyzed the creation o a new specialty, the “data scientist,”

who specializes in architecting the Hadoop data environment and its connections to the rest o the

enterprise. Fortunately, as Hadoop has developed, so has its supporting inrastructure, and today’s

tools include enterprise staples such as SQL ront ends or Hadoop, which makes it much more

accessible or traditional programmers and database administrators.


6/16




What’s Under The Hood: Hadoop’s Primary Components

Hadoop is built on a distributed architecture in which each processing node (server) has its own

storage and processing capacity and data is moved between nodes as needed but never processed

remotely. Fundamentally, the MapReduce technology involves splitting processing into multiple

parallel tasks, perorming operations on the data (the Map part) and then sorting and aggregating the

data (the Reduce unction).8 Hadoop is built to run on an Ethernet-connected cluster o basic servers

with direct-attached storage, without any hardware redundancy or even RAID protection. It does

this by borrowing the replication scheme rom Google’s massive file system in which each block o

data is copied to three separate locations to protect against system ailure. 9 A Hadoop environment is

composed o a number o sofware components, each with its own inrastructure requirements (see

Figure 1). Hadoop differs rom most environments that I&O pros are amiliar with because Hadoop:

■ Uses an architecture that assumes (and tolerates) machine failures. Hadoop was designed

with an understanding that with scale, hardware ailures are inevitable. Te odds o any givendisk ailing are low, and given expensive and redundant hardware, core enterprise systems

can be protected with a high degree o confidence. But with a Hadoop cluster with 1,000 disks

(probably in the upper quartile o Hadoop clusters or size, but certainly nowhere near the

largest), ailures will be common. Hadoop tolerates multiple disk ailures graceully and allows

both incremental replacement and more choice in the economics o disk drive selection. As

a result, most Hadoop installations use low-cost SAS disks as opposed to high-end small

computer system interace (SCSI) disks, and they dispense with RAID entirely.

■ Hides the housekeeping. Unlike environments where I&O pros expect to have detailed insightinto the perormance and usage characteristics o the components and storage, Hadoop allows

a Hadoop cluster to be managed as a relatively opaque black box, whose contents are o interestto the Hadoop specialists. I&O pros need only deal with the requirements or storage expansion

and any required network changes within the Hadoop cluster.

■ Is relentlessly scalable. Hadoop environments grow. Period. Tey do not shrink; they are notofen “cleaned up”; and, because Hadoop is a universal operating environment or big data, they

are “data magnets” once an organization begins to understand the potentials o Hadoop. Tey

tend to be populated with data rom multiple sources, ofen in advance o a clear need, and once

a given project or analysis experiment is done, the data inevitably stays in the Hadoop cluster,

either because the experiment has turned into a production job or so that the data scientist can

use it in some undefined uture experiment. For the I&O practitioner, this monotonic trend

in capacity means that you need a well-articulated process or capacity expansion that allows

regular addition o capacity in the orm o servers with attached storage.


7/16




Figure 1 Primary Hadoop Components

Source: Forrester Research, Inc.113803

Component

Core components —

NameNode, MapReduce,

OpenJDK, and YARN

Keeps track of the data

across the cluster; manages

location, replication, and

availability; and runs the basic

MapReduce logic

1 master active at any time Should be configured

as a redundant pair

JobTracker Keeps track of Hadoop jobs

across the cluster

1 master active at any time Can run on the same

server as the

NameNode and other

core components

HDFS and TaskTracker Uses the local OS and file

system of each node to

perform processing

Multiple nodes contain

the actual data

This is the component

that implements the

MapReduce functionsand scales as data

and processing

volumes grow.

What It does How many

Special

considerations

HOW TO ARCHITECT THE RIGHT INFRASTRUCTURE FOR HADOOP

I Hadoop has ended up on I&O’s plate simply because the cluster has grown beyond skunkworks

size (meaning you don’t have any high-perormance requirements and your customer analytics

leaders can’t oresee having any), your job is simple. All you need to do is add compute/storagecapacity regularly and occasionally add more network bandwidth i things get slow. I you do have

high-perormance requirements, your inrastructure choices will mean the difference between

success and ailure. Here’s what you need to know:

■ Te NameNode and Jobracker are critical. Te rather oddly named NameNode server is theserver that keeps track o the Hadoop data 64 MB or 128 MB data segments and has long been

a source o concern or Hadoop architects as Hadoop has moved rom an experimental utility

to production status. Te recent 2.0 release o Hadoop has added the ability to easily configure

redundant NameNodes as a standard eature, easing this concern, and Forrester recommends

that any production Hadoop environment be configured this way, since the NameNode servers

are typically small-to-medium two-socket servers with only a ew disks — a small insurance

premium to pay to protect against a major disruption.10 Forrester also recommends running the

NameNode and Jobracker on a single node with an identical ailover node.11


8/16




■ Hadoop’s network requirements are not complex . . . Hadoop is designed to run over a standardEthernet network, and Hadoop clusters use only very basic network unctions, so you only need

basic network switches. Production Hadoop clusters have three networks — the data cluster

network, an administrative network, and a systems management network (the latter two can be

collapsed into a single network to keep the servers to a simple dual network interace controller

[NIC] configuration or these unctions). Te data cluster network, over which all the data into

and out o the compute nodes will pass, is the critical network resource in a Hadoop cluster.

■ . . . but you’ll need beefy network links to support high-performance customer analytics. Forrester recommends that the Hadoop network connecting the nodes within a rack be 10 Gb

by deault and that all o the data node servers get redundant 10 Gb links. Te connections

between racks should be at least 10 Gb, and based on our interviews with Hadoop experts,

Forrester recommends 40 Gb interconnects between racks.12 Core enterprise networks are

always configured with dual paths, with each server connected to a different logical hal othe network, so that processing can continue in the event o a network switch ailure. Because

Hadoop is likely to become critical or delivering customized services to customers, Forrester

recommends that Hadoop be configured with dual network connections.13 Regardless o the

choice o dual- or single-path network, the overall topology must be designed so that the

network can accommodate additional lea-nodes as the cluster scales.

■ You should spend time configuring the data nodes. Hadoop departs rom standard enterpriseapplication practice by ederating all storage attached to the processing nodes into the global

HDFS instead o using centralized network-attached storage (NAS) or storage area network

(SAN) or pooled storage.14 Tis architecture, coupled with the inherent redundancy o the

Hadoop environment, allows processing and storage capacity to scale incrementally in lock-stepand reduces cost. However, this means that it alls to the inrastructure architect to select the

correct ratio o processing to storage. ypically, Hadoop nodes have large disk configurations

in relation to the CPU and memory, but Forrester believes that this balance varies widely with

the potential workloads. While the actual data capacity per core will vary, the most common

practice in configuring Hadoop processing nodes is to allocate one disk per core.

■ I&O pros can buy Hadoop-specific configurations today. In response to the variationin workloads, all o the tier one and most o the tier two system vendors offer multiple

configurations targeted at Hadoop clusters (see Figure 2). Tese generally do not include

elaborate redundant power supplies, advanced onboard management capabilities, and any other

legacy enterprise reliability artiacts like extra ans or sensors. I&O pros can also source these

configurations rom Hadoop tier two hardware vendors and can get consulting assistance and

ongoing support rom value-added distribution providers such as Hortonworks and Cloudera,

or rom regional/local consultancies that ocus on Hadoop.15 Some o these vendors offer useul

ree online sizing and configuration tools.16


9/16




Figure 2 Sample Hadoop Compute And Data Node Configurations

IT’S ALIVE — AND GROWING — STAFFING AND OPERATIONS FOR HADOOP

We provide the answers to inrastructure and operations proessionals’ our most important high-

perormance Hadoop inrastructure questions.

1. What Skills And Staff Do I&O Pros Need To Run A Hadoop Environment?

Once the Hadoop cluster is up and running, I&O pros have to keep it going. Fortunately, staffing or

Hadoop operations involves only basic Linux, storage, and networking skills. An operations group

amiliar with the installation and operation o standard servers can master the additional Hadoop-specific skills required. I&O pros must manage Hadoop runtime environments with specialized

Hadoop management tools like Apache Ambari or cluster deployment and management or Apache

Serengeti or managing Hadoop in virtualized environments, plus standard systems management

tools like Nagios, iDRAC, Director, or OneView or the basics o server operation. Te open source

Hadoop distribution includes management tools that allow I&O to look at the cluster operations,

workloads, and network activity. Additionally, the value-added suppliers such as Cloudera,

Hortonworks, and Intel all supply enhanced management capabilities to enable deployment, updates,

backup, and operational monitoring o the Hadoop cluster.

2. How Big Will My Hadoop Cluster Be?Te size o the Hadoop cluster will obviously vary with the amount o data to be processed, but

a basic Hadoop sizing rule o thumb is 4 x D, where D is the initial size o the data.17 Space or

additional tools on top o the basic Hadoop operating environment will vary, but in general, most

tools to access Hadoop using SQL-like queries and other techniques tend to build relatively compact

sets o indices on top o Hadoop, and the additional overhead will likely be a single-digit percentage

on top o the basic storage sizing. Another interesting metric is cluster size. One major system

Source: Forrester Research, Inc.113803

Mainstream — counting, correlating, sorting,

aggregating tasks like log analysis, basic

event correlation, website traffic analysis

1 or 2 socket x

8–10 core x86, low

CPU bin

64–128 GB 8–12/8–36

Computationally intensive MapReduce jobs

such as optimization calculations, image

analysis, time-series and streaming data,

financial analysis, ETL

2 x 10 socket x86,

high-performance

bin

128–256 GB 12–24/12–72

Standard file server (for comparison

purposes)

2 x 10 socket low

bin x86

64–128 GB 24–45/48–135

CPU Memory Disks/TB

0.6–2.25

0.6–3.6

2.4–6.8

TB/core


10/16




vendor’s Hadoop practice noted that the average size o a starter Hadoop cluster was three or our

data nodes plus either a single or dual set o servers to run the NameNode and other management

components (Forrester recommends the dual configuration).

3. Can I Run Hadoop On A VM Cluster?

One o the emerging rontiers in Hadoop environments is the use o virtual machines (VMs) as

the Hadoop cluster processing nodes.18 I the Hadoop cluster nodes are utilized 20% o the time or

less, it’s easible to run multiple VMs on each node and get increased throughput rom the existing

servers. Tere is considerable activity in the Hadoop community, much o it sponsored by VMware,

Intel, and some o the system vendors, to add explicit extensions to Hadoop to make it more

convenient to deploy and operate in a virtual environment. Tere are two basic architectures or

running on Hadoop VM clusters:

■ Local storage on each node. In this model, each compute node hosts multiple VMs and hasits own physical storage. Te VMs appear to the Hadoop environment exactly as i they were

each a standalone server, and Hadoop distributed file systems (HDFS’s) handle the allocation

and movement o data between the storage visible to each VM as i they were separate physical

servers. Tis architecture does not perturb the basic Hadoop operating or management model,

and the Hadoop cluster just looks like it has more servers with less storage per server.

■ Shared storage. Tis model is less common. Te inhibiting actor in a shared storageenvironment is the complexity o managing the network storage rather than potential

perormance problems; with current generation storage arrays and 10G Ethernet and

FibreChannel, the actual transer o data is no longer an issue.

Te provisioning, presentation, and management o network-attached storage or VM clusters in

general and Hadoop in particular is changing at a rapid rate, and Forrester believes that within

the next 12 to 18 months, we will have multiple mature options to easily provision and manage

shared storage Hadoop clusters. Storage vendors such as EMC with its ViPR product and VMware

with its new virtual SAN (VSAN) offering will streamline shared storage Hadoop environments.19

Forrester strongly recommends that Hadoop architects consider shared storage environments,

particularly solutions such as VSAN that ederate the existing storage on individual server nodes

as opposed to requiring purchase o external network-attached storage arrays.

Another alternative or you to Hadoop on VM is packaged CI solutions including the hypervisor,

storage, and the requisite management tools, especially storage, i you wish to investigate a VM-

based Hadoop installation.20


11/16




4. Why Not Do Hadoop In The Cloud?

While there are cloud-based Hadoop offerings rom multiple vendors, such as Amazon, Google, and

Rackspace, the majority o production Hadoop installations are and will remain on-premises or

several reasons:

■ Heavy and increasing workloads favor on-premises Hadoop. Hadoop’s inrastructuredesign is comparably cost-effective to those used by the cloud providers. Hadoop clusters tend

to be heavily utilized, with capacity being added as resources get scarce, rather than being

massively overprovisioned. Tese characteristics make the argument or cloud’s cost advantage

less compelling, since cloud usually compares best against lightly loaded in-house resources.

Additionally, as Hadoop becomes a production resource, Hadoop cluster workloads and

storage requirements tend to increase without the dramatic peaks and valleys that might make

a cloud deployment attractive or its ability to scale down as well as up.21 Te best use case or

Hadoop in the cloud is or development, which almost always requires constant changes to theenvironment and has the kind o highly variable workload profile that avors cloud. Hadoop in

the cloud also reduces the major DevOps (development + operations) overhead associated with

physical installations.

■ Cloud storage is both slower and more expensive for data sets that just keep growing. Mostcost comparisons show that low-cost enterprise storage is still cheaper than cloud or long-term

data storage, particularly since low-cost cloud storage may have unacceptably long access times.

Also, Hadoop tends to collect 10 times or more data than legacy transactional environments do,

plus data scientists and their customer-ocused business stakeholders will almost never want to

discard Hadoop data, and the access requirements are unpredictable — all o which avors on-

premises storage.

■ Data sources and locality make a big difference for performance. In cases where the data isentirely cloud-generated (such as analysis o witter, blog posts, and other social media data),

running Hadoop clusters in the cloud might make sense. But as Hadoop is used increasingly or

real-time customer-acing systems with data coming rom multiple venues, I&O pros will likely

need to build it out in a physical acility with the right (deterministic bandwidth and latency)

network interconnects to minimize the end-to-end latency o the application. Tus, the optimal

acility or your Hadoop cluster is likely either your enterprise data center or a colocation or

hosting acility with the right peer interconnects — not a cloud environment with unknown and

probably longer latency network connections.


12/16




W H A T I T M E A N S

HADOOP WILL BECOME A CRITICAL PART OF CORE ENTERPRISE BUSINESS

Te capabilities o Hadoop 2.0 will accelerate the use o Hadoop as a real-time platorm and asa platorm or other analytics sofware. As the number o applications that require real-time

perormance grows, these requirements will bias uture Hadoop inrastructure in predicable directions:

■ Hadoop will include flash memory in the processing nodes. As Hadoop becomes theoundation o real-time processes and applications, the requirements or the processing

nodes begin to escalate. While the amount o data that can be handled in a given amount o

time scales very well with the number o additional nodes, i the requirements dictate aster

response or a given amount o data, the only reasonable solution is to use aster nodes with

flash memory. All commercial Hadoop distributions can take advantage o flash memory on

the individual compute nodes.

■ You will need to install application-specific Hadoop clusters. Because Hadoop encompassesa wide range o processing, as applications scale and as new applications requiring unusual

processing come on line, it may be necessary to begin to install new Hadoop clusters with

application-specific processing nodes or financial risk calculations or time-series analysis.

■ Hadoop will become the big data hub and integration point for enterprise systems. As Hadoop becomes more o a closely coupled adjunct to customer-acing systems like

eCommerce platorms and mainstream enterprise systems such as enterprise resource

planning (ERP) and classic data warehouse applications, the quality o the integration

with the enterprise systems becomes critical. In these cases, I&O pros may need to hostthe Hadoop cluster in the same inrastructure as the enterprise apps to achieve tighter

integration with the core applications.

ENDNOTES

1 “Real time” is a slippery term, but the original definition rom control systems theory is still valid — real-

time processing allows decisions to be made within the cycle time o the process in question. In other words,

something that happens quickly enough to matter to a person or process waiting or the result. Tus, a

signal processing system might define real time as being ractions o a microsecond, while an advertising

insertion system designed to offer up an ad to a customer viewing a company website might be hundreds o

milliseconds, and or a marketing campaign, real time might be hours.

2 Tis pilot illustrates both the potential o Hadoop and the size o some o the big data repositories. Te pilot

project involved processing 570 billion weblog records in 9 minutes on what was described as a “small cluster.”


13/16




3 Tese EL vendors are all rapidly adapting their products to work with Hadoop, mostly in the orm o

using HDFS as an underlying data store. But once users get a taste o the potential economics o using an

open source stack or what was previously a purely proprietary stack, it becomes difficult or legacy vendors

to support historical margins.

4 Tis assertion leads instantly to the challenge that Hadoop may in act be very flexible but not in act very

efficient at any particular task. Tere is some truth to that assertion — an optimized columnar database like

Vertica or Netezza may be righteningly efficient at certain kinds o queries but will simply not be able to do

others involving different data types. Te beauty o Hadoop’s architecture is that in exchange or some lack o

optimization, it allows an almost infinite plasticity in terms o problems and data types. In addition, the lack

o optimization is somewhat offset by the act that it can be composed rom commodity components, so the

Hadoop solution vendor cannot extract much o a premium or integrating the hardware platorm, unlike

many other commercial solutions. However, much o Hadoop’s attraction also lies in the act that MapReduce

is a separate unction rom the Hadoop file system, and many specialized applications, Vertica among them,

are integrating with HDFS to take advantage o its reliable and highly scalable storage architecture.5 We might as well deal with the inevitable right off the bat — how the heck did it get named Hadoop?

Hadoop was initially created by Doug Cutting and Mike Caarella, and Hadoop was the name o Cutting’s

young son’s avorite stuffed elephant.

6 Hadoop can also be used internally or operational improvement. It is a powerul platorm or analysis o

log files rom servers, network equipment, and any o the myriad devices that spit out data as they operate.

Using Hadoop can simpliy and accelerate turning these inchoate streams o data into real understanding

about efficiencies, actual costs, and compliance.

7 As more people can use a previously unavailable technology and apply it to problems that were previously

uneconomical to attack, more benefits accrue to more players in the economy. Tis will ollow the samepath as the evolution o supercomputing and advanced design automation, which were originally used

mostly or aerospace and deense products — today, many consumer goods are designed with the next

generation o those same tools. In the case o Hadoop, or example, the ability to apply this technology to a

single movie release where the goal is to raise the box-office take by a ew million dollars in a ew markets

would have been completely uneconomical five years ago.

8 Te details are out o scope or this report, but I&O pros can get a oundation in MapReduce undamentals

rom the Hadoop website (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html).

9 For the technically inclined, the replication actor is configurable, but most installations seem to be sticking

with the deault 3x. Te blocks in the HDFS are huge, either 64 MB or 128 MB, so the amount o metadata

that the NameNode server must keep track o is manageable, with only eight or 16 times three blocks perGB (8,000 or 16,000 per B). At PB scales, the HDFS, probably using the 128 MB block size, will have in the

tens o millions o block entries, small enough to keep large portions o the metadata tables in memory or

efficient operation.


14/16




10 Te ailure o a nonredundant NameNode is not a total disaster, because the NameNode writes a log file that

can be used to reconstruct the data. But it can be a long process — think days, not minutes or a ew hours —

or even modest 10 B to 100 B file systems.

11 Hadoop 2.0 has also added a number o major enhancements, particularly the ability or workloads other

than MapReduce to run on top o HDFS, enabling a wide range o third-party tools and other open source

projects to take advantage o Hadoop’s robust ederated storage and processing architecture. Many o

these tools were already available, but the incorporation o the support or a general-purpose extension to

MapReduce makes Hadoop 2.0 a much more suitable general-purpose big data and analytics platorm.

12 Given the continued cost decline in network switches, Forrester recommends that I&O groups contemplating

implementation o high-perormance Hadoop clusters evaluate 40 GB inter-rack links. For environments

where the MapReduce jobs are primarily aggregation, enumeration, and sorting the traditional Hadoop

workloads, a simple 1 GB NIC per server may be sufficient. However, as Hadoop workloads grow to

incorporate real-time (what Hadoop practitioners ofen reer to as continuous processing) as well as batch

data rom other enterprise and external data sources, the additional jobs and the constant EL processing

can add significant network traffic, and the traditional 1 GB connection may become a bottleneck.

13 A dual network will entail twice the cost or network equipment and a slight additional cost or dual NIC

configurations on each server. Forrester cannot make a blanket recommendation on this aspect o the

inrastructure but cautions that reconfiguring rom a single- to dual-path network is very disruptive and

time-consuming — assume that it will take a day per rack, assuming that you configured the racks correctly

in the first place to have the power and space to accommodate the additional switches.

14 Hadoop was not the first scalable sofware environment to use this technique, drawn as it was rom Google’s

proprietary MapReduce concept, and similar concepts can be ound underlying earlier experiments and

products such as the AFS (Andrew File System), DCE (distributed computing environment), and IBM’sGPFS (General Purpose File System), but it is arguably the most successul and rapidly growing example.

15 Intel has an alliance with Hortonworks through which it has produced a version o Hadoop optimized or

x86 execution, as well as actively supporting an alternative Apache Hadoop distribution.

16 Hortonworks’ sizing tool is an excellent example. Source: Hortonworks (http://hortonworks.com/resources/

cluster-sizing-guide/?utm_source=google&utm_medium=ppc&utm_campaign=Sitelinks).

17 Here’s the derivation or 4 x D: You start by assuming that the data will require (R x D) + S, where R is the

replication actor, which most installations leave at the deault actor o 3, D is the initial size o the data, and

S is the amount o space allocated or the “shuffle” phase o the MapReduce process, where intermediate

blocks o results are moved rom node to node and aggregated. One senior Hadoop consultant noted that the

usual rule o thumb or shuffle space is to allow approximately the same amount o space as the original data

beore replication. Using this ormula gives us a basic Hadoop sizing rule o thumb o 4 x D.

18 Currently there is only a limited sample o production environments, but the number is growing or

the same reasons that other applications initially moved to VMs rom physical inrastructure — capital

resource efficiency and management cost.


15/16




19 VSAN in its initial version(s) will almost certainly be optimized or the random mix o read/write

characteristic o a general-purpose enterprise VM cluster as opposed to the read-dominated and large-

block transer patterns o a Hadoop cluster, but it is hard to imagine that VMware will not pursue Hadoop

optimization in the near-term uture.

20 While a wide range o offerings rom vendors such as HP, IBM, Cisco, and Dell offer bundling o storage,

network, and compute nodes as integrated VM clusters, Forrester recommends also evaluating newer

options rom Nutanix and SimpliVity or virtualized Hadoop because o their deeply integrated ederated

storage architecture, which may prove to ameliorate many o the complexities o managing the Hadoop

storage resources. Also worthy o additional scrutiny are advanced storage-centric products like Maxta,

intri, and Atlantis Computing, which dramatically simpliy the task o deploying and managing VMware

storage. Emerging tools such as EMC ViPR, which can overlay a HDFS definition on top o an existing SAN

offer additional potential or simpliying storage management o Hadoop on a VM cluster. o the extent that

VMware makes its VSAN product relevant to Hadoop, all other solutions may lose a great deal o relevance.

21 One example that we were able to find was a 200 VM instance running on Amazon’s Elastic MapReduce

service that generated bills in excess o $40,000 per month beore the user brought the workload in-house

on an eight-node x86 Hadoop cluster.


16/16

Forrester Research (Nasdaq: FORR) is a global research and advisory firm serving professionals in 13 key roles across three distinct client

segments. Our clients face progressively complex business and technology decisions every day. To help them understand, strategize, and ac

upon opportunities brought by change, Forrester provides proprietary research, consumer and business data, custom consulting, events and

«

Forrester Focuses OnInfrastructure & Operations Professionals

You are responsible for identifying — and justifying — which technologies

and process changes will help you transform and industrialize your

company’s infrastructure and create a more productive, resilient, and

effective IT organization. Forrester’s subject-matter expertise and

deep understanding of your role will help you create forward-thinking

strategies; weigh opportunity against risk; justify decisions; and optimize

your individual, team, and corporate performance.

IAN OLIVER, client persona representing Infrastructure & Operations Professionals

About Forrester

A global research and advisory firm, Forrester inspires leaders,informs better decisions, and helps the world’s top companies turn

the complexity of change into business advantage. Our research-

based insight and objective advice enable IT professionals to

lead more successfully within IT and extend their impact beyond

the traditional IT organization. Tailored to your individual role, our

resources allow you to focus on important business issues —

margin, speed, growth — first, technology second.

FOR MORE INFORMATION

o find out how Forrester Research can help you be successul every day, please

contact the office nearest you, or visit us at www.orrester.com. For a complete list

o worldwide locations, visit www.orrester.com/about.

CLIENT SUPPORT

For inormation on hard-copy or electronic reprints, please contact Client Support

at +1 866.367.7378, +1 617.613.5730, or [email protected] . We offer

quantity discounts and special pricing or academic and nonprofit institutions.
mailto:[email protected]://www.forrester.com/mailto:[email protected]