forrester hadoop infrastructure architecture
TRANSCRIPT
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
1/16
Forrester Research, Inc., 60 Acorn Park Drive, Cambridge, MA 02140 USA Tel: +1 617.613.6000 | Fax: +1 617.613.5000 | www.forrester.com
Building The Foundation For CustomerInsight: Hadoop Infrastructure Architecture
by Richard Fichera, April 9, 2014
For: Infrastructure
& Operations
Professionals
KEY TAKEAWAYS
Hadoop Provides A Foundational Technology Upon Which To BuildCustomer Engagement
Big data can correlate data and events rom multiple sources. Historically, the tools and
inrastructure to do this have been prohibitively expensive. With the advent o Hadoop
and its ecosystem o tools, firms looking or an incremental advantage have been able to
turn this data into actionable insights in ways that were unimaginable a decade ago.
Hadoop Can Become The Hub Of An Enterprise’s Big Data Strategy
Because Hadoop is an inherently extensible open source system built on an extremely
powerul abstraction layer or managing large collections o both structured and
unstructured data, it is increasingly becoming an enterprise hub or all big data, and an
active community o new and legacy independent sofware vendors is building upon it.
Hadoop Infrastructure Is Different — But I&O Professionals Need Few
New Skills To Deal With It
While the effective use o Hadoop entails complex and, or most organizations, new
sofware skills, the inrastructure or Hadoop can be designed and managed by I&O
pros afer learning some basic configuration rules and management practices. Generally,
no significant new I&O skills are needed to set up and manage a Hadoop environment.
Hadoop Will Drive Organizations Toward DevOps
Te Hadoop lie cycle is dynamic, with high-velocity change during development,
potential movement between cloud prototyping and in-house production, and rapid
incremental change o production environments as workloads are added and tuned.
Tis operational profile deeply avors and motivates a strong DevOps process in
enterprises adopting Hadoop.
http://www.forrester.com/
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
2/16
© 2014, Forrester Research, Inc. All rights reserved. Unauthorized reproduction is strictly prohibited. Information is based on best available
resources. Opinions reflect judgment at the time and are subject to change. Forrester ®, Technographics®, Forrester Wave, RoleView, TechRadar,
and Total Economic Impact are trademarks of Forrester Research, Inc. All other trademarks are the property of their respective companies. To
purchase reprints of this document, please email [email protected]. For additional information, go to www.forrester.com.
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
WHY READ THIS REPORT
Te prolieration o customer-acing data-intensive systems in almost every modern enterprise has
catalyzed the rapid deployment o big data environments, commonly with Hadoop as the underlying
processing environment. Unortunately, inrastructure and operations (I&O) pros have had little guidance
in understanding how to configure and manage the underlying inrastructure to support Hadoop and its
ecosystem o tools and applications. Tis report helps I&O proessionals understand the basics o Hadoop
inrastructure and includes guidelines or system configuration, rough data sizing, and suggestions on how
to plan or the inevitable growth o the Hadoop big data environment.
Table Of Contents
The Business Problem — Deriving Time-
Sensitive Results From Big Data
Hadoop Basics For I&O Pros — Parallelism,
Replication, And Scalability
How To Architect The Right Infrastructure
For Hadoop
It’s Alive — And Growing — Staffing AndOperations For Hadoop
WHAT IT MEANS
Hadoop Will Become A Critical Part Of Core
Enterprise Business
Notes & Resources
This report is based on ongoing research
into the evolution of Hadoop and big data
infrastructure architecture — specifically how
to help businesses and IT build platforms
that will support scalable solutions for
customer engagement and analytics. Specific
inputs to this report came from interviews
with Hadoop and other big data solution
suppliers, system vendors, Hadoop architects,
and users of Hadoop, along with collaboration
with other Forrester analysts, clients, and
discussions with the members of the Forrester
Leadership Boards.
Related Research Documents
The Forrester Wave™: Big Data Hadoop
Solutions, Q1 2014
February 27, 2014
The Forrester Wave™: Enterprise DataWarehouse, Q4 2013
December 9, 2013
Predictions 2014: All Things Data
February 7, 2014
Building The Foundation For Customer Insight:Hadoop Infrastructure Architectureby Richard Fichera
with Laura Koetzle, Brian Hopkins, and Katherine Williamson
2
4
6
11
8
APRIL 9, 2014
http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES114021http://www.forrester.com/go?objectid=BIO2625http://www.forrester.com/go?objectid=BIO607http://www.forrester.com/go?objectid=BIO2705http://www.forrester.com/go?objectid=BIO2705http://www.forrester.com/go?objectid=BIO607http://www.forrester.com/go?objectid=BIO2625http://www.forrester.com/go?objectid=RES114021http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES86621http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/go?objectid=RES112461http://www.forrester.com/
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
3/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 2
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
THE BUSINESS PROBLEM — DERIVING TIME-SENSITIVE RESULTS FROM BIG DATA
Te value o big data — some o which exists in the orm o “digital breadcrumbs” that customers
leave behind as they navigate the Web, some as explicit actions on their part, such as tweets and
Facebook entries, and some as structured output rom various applications and systems — lies in
the ability to correlate data and events rom multiple sources. Historically, the ability to rapidly
collate these disparate events and chunks o data has been almost nonexistent. With the advent o
Hadoop and its ecosystem o tools, companies looking or an incremental advantage have been able
to turn this sea o data into actionable insights in ways that were unimaginable a decade ago. For
example, Hadoop allows you to:
■ Conduct customer sentiment analysis from witter, Facebook, and other sources. Forexample, film studios want to maximize their revenues, and rapid adjustments to promotional
programs based on customer reactions can have a major impact. You can use Hadoop to mine
customer sentiment rom social sources like witter and Facebook, blogs, product reviews, andpress articles. In the case o a newly released movie, or example, you can use Hadoop to examine
massive numbers o text items, analyze their content, and aggregate the results into a composite
metric. You can run the solution against real-time data streams, which allows you to see results in
a time rame within which you can make decisions about online marketing programs.1
■ Understand your customers’ life-cycle progress with web clickstream data. You can useHadoop to analyze the massive data streams generated by active websites to better understand
user data, such as how users navigate and how long they look, and the patterns that distinguish
an early-stage buyer rom a mere window-shopper. Clickstream analysis by the large web
companies such as Google and Amazon was one o the earliest commercial uses o Hadoop. Last
year, a major shipping and logistics company used Hadoop to analyze weblogs to detect mobiledevices so it could more finely tailor online services.2
■ Build a more flexible enterprise data hub. Many firms ound that Hadoop’s inherentlyflexible and scalable architecture coupled with its open source origins made it an attractive
enterprise data hub or perorming extract, transorm, and load (EL) unctions or other
existing enterprise systems. By substituting Hadoop or increasingly expensive proprietary
EL solutions such as those rom Inormatica, Oracle, and SAP, enterprises gain a flexible and
extensible utility to connect both current and uture systems and applications.3 One major
financial services company uncovered massive raud with a new Hadoop project — and also
saved $30 million by substituting Hadoop or conventional EL and data warehousing tools.
FedEx, using existing data rom other production systems, used Hadoop to identiy high-
revenue source and destination ZIP codes and to identiy patterns that led to shipment delays.
■ Increase system reliability with sensor and log data analysis. You can use Hadoop to analyzethe data generated by sensors and the log data rom almost any conceivable equipment to look
or patterns and correlations. For example, a major supplier o smart grid metering analyzes
electric meter results at the rate o over 1 million meters per second using a combination o
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
4/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 3
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
Hadoop and other technologies to improve power pricing and load management. General
Electric uses Hadoop or analysis o real-time data rom jet engines, wind turbines, locomotives,
and other devices to schedule preventative maintenance service beore those critical systems ail.
Shanghai elecom processes video data rom thousands o monitoring points using Hadoop asa data storage and processing hub — and perormance is five times aster, which has improved
Shanghai elecom’s ability to rapidly respond to emergency situations. Other examples include
HVAC optimization or office buildings and data centers to save money and reduce energy
consumption, and urban traffic flow monitoring and control to reduce congestion. Te list is
endless — almost every activity we engage in generates sensor and log data that is amenable to
analysis, and Hadoop has emerged as the platorm o choice.
We Already Have Tools; Why Have Developers Flocked To Hadoop?
Most o these business problems existed beore the phrase “big data” entered common parlance, and
we’ve deployed generations o specialized solutions, such as relational databases, business intelligence
(BI) tools, and specialized statistical analysis applications, in attempts to solve them. What drives
developers and marketers to Hadoop? Te migration to Hadoop is primarily driven by three actors:
■ Data type and source generality. Part o Hadoop’s appeal is that it is not specifically optimizedor any specific solution or data type but rather a general ramework or parallel processing, so
your developers and data scientists can add any relevant data, whatever its ormat or source.4
Other tools, both open source and independent sofware vendor (ISV) solutions, can be
layered on top o this ramework, but the basic Hadoop tooling is flexible enough to deal with
both structured and unstructured data, batch and streaming data, and can be programmed in
almost all standard languages. In addition, Hadoop supports standard connectors such as opendatabase connectivity (ODBC) to enterprise staples like SQL, SAP, and Excel.
■ Strong ecosystem and community. Hadoop reaps the benefit o an active community o opensource developers, consultants, and an ever-increasing library o ISV solutions such as Vertica
or real-time columnar analytics and MarkLogic or flexible NoSQL queries and transactional
capabilities, as well as open source offerings such as HBase and MongoDB that either layer on
top o Hadoop or eed/extract data rom it. Even the largest o the proprietary ISV solution
communities cannot match the sum total o this activity or its rapid growth trajectory. You will
never be a technology orphan with Hadoop.
■ Lower cost. Even including the cost o specialized staff and the increasing use o value-addedHadoop distributions and services like Cloudera and Hortonworks, Hadoop is cheaper to
get started with than the ISV solutions o previous generations. Additionally, Hadoop was
architected to run on lower-cost server and storage inrastructure, which also removes the
“hidden” cost o high-end inrastructure.
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
5/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 4
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
Why Do I&O Pros Need To Get Involved With Hadoop Now?
Developers and data geeks have been using Hadoop or a decade in ever-increasing numbers, with
an inflection point triggered by Cloudera’s 2008 launch. Because previous uses o Hadoop were
all post-processing, your firm’s Hadoop pioneers could start out with any old cheap scale-out-
type servers you had lying around; with cloud capacity, they didn’t need any help setting up that
sort o vanilla inrastructure, so they didn’t involve the I&O team. But that’s changed — Hadoop
inrastructure architecture has burst onto the to-do list o I&O pros. Tere are two reasons or this:
1. It’s gotten too big to stay in the skunkworks shadows. Your Hadoop cluster has now grown to
the point where it’s chewing up a lot o resources, and your developers don’t want to support the
inrastructure by themselves.
2. Your firm can win, serve, or retain more customers with higher performance analytics. Your
customer analytics and business insights leaders need higher-perormance solutions to insertthe perect advertisement or present the right custom offer to your customers. And that means
that developers need help rom I&O pros to design the right inrastructure to run on.
HADOOP BASICS FOR I&O PROS — PARALLELISM, REPLICATION, AND SCALABILITY
Hadoop is an open source implementation o MapReduce, one o Google’s oundational
technologies. Hadoop has emerged as a new way to process and integrate a variety o customer-
related data, including clickstream, geographic data, and text, and turn this data into actionable
insights.5 It can work in a batch or real-time environment, and it is capable o digesting any data
type, both structured and unstructured. Many o these same capabilities are available rom legacy
analytics and data warehouse systems, but Hadoop can routinely deliver results with superiorperormance at anywhere rom one-fifh to one-tenth the cost.6 And in this case, “cheaper and aster”
really does mean better, because the lower cost and greater speed allow you to solve problems that
were previously uneconomical to attack.7
Te tradeoff or Hadoop’s dramatic improvement in the underlying economics o data analysis is
that Hadoop is very different rom existing enterprise database processing, and Hadoop requires
entirely new skill sets and has even catalyzed the creation o a new specialty, the “data scientist,”
who specializes in architecting the Hadoop data environment and its connections to the rest o the
enterprise. Fortunately, as Hadoop has developed, so has its supporting inrastructure, and today’s
tools include enterprise staples such as SQL ront ends or Hadoop, which makes it much more
accessible or traditional programmers and database administrators.
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
6/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 5
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
What’s Under The Hood: Hadoop’s Primary Components
Hadoop is built on a distributed architecture in which each processing node (server) has its own
storage and processing capacity and data is moved between nodes as needed but never processed
remotely. Fundamentally, the MapReduce technology involves splitting processing into multiple
parallel tasks, perorming operations on the data (the Map part) and then sorting and aggregating the
data (the Reduce unction).8 Hadoop is built to run on an Ethernet-connected cluster o basic servers
with direct-attached storage, without any hardware redundancy or even RAID protection. It does
this by borrowing the replication scheme rom Google’s massive file system in which each block o
data is copied to three separate locations to protect against system ailure. 9 A Hadoop environment is
composed o a number o sofware components, each with its own inrastructure requirements (see
Figure 1). Hadoop differs rom most environments that I&O pros are amiliar with because Hadoop:
■ Uses an architecture that assumes (and tolerates) machine failures. Hadoop was designed
with an understanding that with scale, hardware ailures are inevitable. Te odds o any givendisk ailing are low, and given expensive and redundant hardware, core enterprise systems
can be protected with a high degree o confidence. But with a Hadoop cluster with 1,000 disks
(probably in the upper quartile o Hadoop clusters or size, but certainly nowhere near the
largest), ailures will be common. Hadoop tolerates multiple disk ailures graceully and allows
both incremental replacement and more choice in the economics o disk drive selection. As
a result, most Hadoop installations use low-cost SAS disks as opposed to high-end small
computer system interace (SCSI) disks, and they dispense with RAID entirely.
■ Hides the housekeeping. Unlike environments where I&O pros expect to have detailed insightinto the perormance and usage characteristics o the components and storage, Hadoop allows
a Hadoop cluster to be managed as a relatively opaque black box, whose contents are o interestto the Hadoop specialists. I&O pros need only deal with the requirements or storage expansion
and any required network changes within the Hadoop cluster.
■ Is relentlessly scalable. Hadoop environments grow. Period. Tey do not shrink; they are notofen “cleaned up”; and, because Hadoop is a universal operating environment or big data, they
are “data magnets” once an organization begins to understand the potentials o Hadoop. Tey
tend to be populated with data rom multiple sources, ofen in advance o a clear need, and once
a given project or analysis experiment is done, the data inevitably stays in the Hadoop cluster,
either because the experiment has turned into a production job or so that the data scientist can
use it in some undefined uture experiment. For the I&O practitioner, this monotonic trend
in capacity means that you need a well-articulated process or capacity expansion that allows
regular addition o capacity in the orm o servers with attached storage.
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
7/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 6
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
Figure 1 Primary Hadoop Components
Source: Forrester Research, Inc.113803
Component
Core components —
NameNode, MapReduce,
OpenJDK, and YARN
Keeps track of the data
across the cluster; manages
location, replication, and
availability; and runs the basic
MapReduce logic
1 master active at any time Should be configured
as a redundant pair
JobTracker Keeps track of Hadoop jobs
across the cluster
1 master active at any time Can run on the same
server as the
NameNode and other
core components
HDFS and TaskTracker Uses the local OS and file
system of each node to
perform processing
Multiple nodes contain
the actual data
This is the component
that implements the
MapReduce functionsand scales as data
and processing
volumes grow.
What It does How many
Special
considerations
HOW TO ARCHITECT THE RIGHT INFRASTRUCTURE FOR HADOOP
I Hadoop has ended up on I&O’s plate simply because the cluster has grown beyond skunkworks
size (meaning you don’t have any high-perormance requirements and your customer analytics
leaders can’t oresee having any), your job is simple. All you need to do is add compute/storagecapacity regularly and occasionally add more network bandwidth i things get slow. I you do have
high-perormance requirements, your inrastructure choices will mean the difference between
success and ailure. Here’s what you need to know:
■ Te NameNode and Jobracker are critical. Te rather oddly named NameNode server is theserver that keeps track o the Hadoop data 64 MB or 128 MB data segments and has long been
a source o concern or Hadoop architects as Hadoop has moved rom an experimental utility
to production status. Te recent 2.0 release o Hadoop has added the ability to easily configure
redundant NameNodes as a standard eature, easing this concern, and Forrester recommends
that any production Hadoop environment be configured this way, since the NameNode servers
are typically small-to-medium two-socket servers with only a ew disks — a small insurance
premium to pay to protect against a major disruption.10 Forrester also recommends running the
NameNode and Jobracker on a single node with an identical ailover node.11
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
8/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 7
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
■ Hadoop’s network requirements are not complex . . . Hadoop is designed to run over a standardEthernet network, and Hadoop clusters use only very basic network unctions, so you only need
basic network switches. Production Hadoop clusters have three networks — the data cluster
network, an administrative network, and a systems management network (the latter two can be
collapsed into a single network to keep the servers to a simple dual network interace controller
[NIC] configuration or these unctions). Te data cluster network, over which all the data into
and out o the compute nodes will pass, is the critical network resource in a Hadoop cluster.
■ . . . but you’ll need beefy network links to support high-performance customer analytics. Forrester recommends that the Hadoop network connecting the nodes within a rack be 10 Gb
by deault and that all o the data node servers get redundant 10 Gb links. Te connections
between racks should be at least 10 Gb, and based on our interviews with Hadoop experts,
Forrester recommends 40 Gb interconnects between racks.12 Core enterprise networks are
always configured with dual paths, with each server connected to a different logical hal othe network, so that processing can continue in the event o a network switch ailure. Because
Hadoop is likely to become critical or delivering customized services to customers, Forrester
recommends that Hadoop be configured with dual network connections.13 Regardless o the
choice o dual- or single-path network, the overall topology must be designed so that the
network can accommodate additional lea-nodes as the cluster scales.
■ You should spend time configuring the data nodes. Hadoop departs rom standard enterpriseapplication practice by ederating all storage attached to the processing nodes into the global
HDFS instead o using centralized network-attached storage (NAS) or storage area network
(SAN) or pooled storage.14 Tis architecture, coupled with the inherent redundancy o the
Hadoop environment, allows processing and storage capacity to scale incrementally in lock-stepand reduces cost. However, this means that it alls to the inrastructure architect to select the
correct ratio o processing to storage. ypically, Hadoop nodes have large disk configurations
in relation to the CPU and memory, but Forrester believes that this balance varies widely with
the potential workloads. While the actual data capacity per core will vary, the most common
practice in configuring Hadoop processing nodes is to allocate one disk per core.
■ I&O pros can buy Hadoop-specific configurations today. In response to the variationin workloads, all o the tier one and most o the tier two system vendors offer multiple
configurations targeted at Hadoop clusters (see Figure 2). Tese generally do not include
elaborate redundant power supplies, advanced onboard management capabilities, and any other
legacy enterprise reliability artiacts like extra ans or sensors. I&O pros can also source these
configurations rom Hadoop tier two hardware vendors and can get consulting assistance and
ongoing support rom value-added distribution providers such as Hortonworks and Cloudera,
or rom regional/local consultancies that ocus on Hadoop.15 Some o these vendors offer useul
ree online sizing and configuration tools.16
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
9/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 8
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
Figure 2 Sample Hadoop Compute And Data Node Configurations
IT’S ALIVE — AND GROWING — STAFFING AND OPERATIONS FOR HADOOP
We provide the answers to inrastructure and operations proessionals’ our most important high-
perormance Hadoop inrastructure questions.
1. What Skills And Staff Do I&O Pros Need To Run A Hadoop Environment?
Once the Hadoop cluster is up and running, I&O pros have to keep it going. Fortunately, staffing or
Hadoop operations involves only basic Linux, storage, and networking skills. An operations group
amiliar with the installation and operation o standard servers can master the additional Hadoop-specific skills required. I&O pros must manage Hadoop runtime environments with specialized
Hadoop management tools like Apache Ambari or cluster deployment and management or Apache
Serengeti or managing Hadoop in virtualized environments, plus standard systems management
tools like Nagios, iDRAC, Director, or OneView or the basics o server operation. Te open source
Hadoop distribution includes management tools that allow I&O to look at the cluster operations,
workloads, and network activity. Additionally, the value-added suppliers such as Cloudera,
Hortonworks, and Intel all supply enhanced management capabilities to enable deployment, updates,
backup, and operational monitoring o the Hadoop cluster.
2. How Big Will My Hadoop Cluster Be?Te size o the Hadoop cluster will obviously vary with the amount o data to be processed, but
a basic Hadoop sizing rule o thumb is 4 x D, where D is the initial size o the data.17 Space or
additional tools on top o the basic Hadoop operating environment will vary, but in general, most
tools to access Hadoop using SQL-like queries and other techniques tend to build relatively compact
sets o indices on top o Hadoop, and the additional overhead will likely be a single-digit percentage
on top o the basic storage sizing. Another interesting metric is cluster size. One major system
Source: Forrester Research, Inc.113803
Mainstream — counting, correlating, sorting,
aggregating tasks like log analysis, basic
event correlation, website traffic analysis
1 or 2 socket x
8–10 core x86, low
CPU bin
64–128 GB 8–12/8–36
Computationally intensive MapReduce jobs
such as optimization calculations, image
analysis, time-series and streaming data,
financial analysis, ETL
2 x 10 socket x86,
high-performance
bin
128–256 GB 12–24/12–72
Standard file server (for comparison
purposes)
2 x 10 socket low
bin x86
64–128 GB 24–45/48–135
CPU Memory Disks/TB
0.6–2.25
0.6–3.6
2.4–6.8
TB/core
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
10/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 9
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
vendor’s Hadoop practice noted that the average size o a starter Hadoop cluster was three or our
data nodes plus either a single or dual set o servers to run the NameNode and other management
components (Forrester recommends the dual configuration).
3. Can I Run Hadoop On A VM Cluster?
One o the emerging rontiers in Hadoop environments is the use o virtual machines (VMs) as
the Hadoop cluster processing nodes.18 I the Hadoop cluster nodes are utilized 20% o the time or
less, it’s easible to run multiple VMs on each node and get increased throughput rom the existing
servers. Tere is considerable activity in the Hadoop community, much o it sponsored by VMware,
Intel, and some o the system vendors, to add explicit extensions to Hadoop to make it more
convenient to deploy and operate in a virtual environment. Tere are two basic architectures or
running on Hadoop VM clusters:
■ Local storage on each node. In this model, each compute node hosts multiple VMs and hasits own physical storage. Te VMs appear to the Hadoop environment exactly as i they were
each a standalone server, and Hadoop distributed file systems (HDFS’s) handle the allocation
and movement o data between the storage visible to each VM as i they were separate physical
servers. Tis architecture does not perturb the basic Hadoop operating or management model,
and the Hadoop cluster just looks like it has more servers with less storage per server.
■ Shared storage. Tis model is less common. Te inhibiting actor in a shared storageenvironment is the complexity o managing the network storage rather than potential
perormance problems; with current generation storage arrays and 10G Ethernet and
FibreChannel, the actual transer o data is no longer an issue.
Te provisioning, presentation, and management o network-attached storage or VM clusters in
general and Hadoop in particular is changing at a rapid rate, and Forrester believes that within
the next 12 to 18 months, we will have multiple mature options to easily provision and manage
shared storage Hadoop clusters. Storage vendors such as EMC with its ViPR product and VMware
with its new virtual SAN (VSAN) offering will streamline shared storage Hadoop environments.19
Forrester strongly recommends that Hadoop architects consider shared storage environments,
particularly solutions such as VSAN that ederate the existing storage on individual server nodes
as opposed to requiring purchase o external network-attached storage arrays.
Another alternative or you to Hadoop on VM is packaged CI solutions including the hypervisor,
storage, and the requisite management tools, especially storage, i you wish to investigate a VM-
based Hadoop installation.20
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
11/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 10
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
4. Why Not Do Hadoop In The Cloud?
While there are cloud-based Hadoop offerings rom multiple vendors, such as Amazon, Google, and
Rackspace, the majority o production Hadoop installations are and will remain on-premises or
several reasons:
■ Heavy and increasing workloads favor on-premises Hadoop. Hadoop’s inrastructuredesign is comparably cost-effective to those used by the cloud providers. Hadoop clusters tend
to be heavily utilized, with capacity being added as resources get scarce, rather than being
massively overprovisioned. Tese characteristics make the argument or cloud’s cost advantage
less compelling, since cloud usually compares best against lightly loaded in-house resources.
Additionally, as Hadoop becomes a production resource, Hadoop cluster workloads and
storage requirements tend to increase without the dramatic peaks and valleys that might make
a cloud deployment attractive or its ability to scale down as well as up.21 Te best use case or
Hadoop in the cloud is or development, which almost always requires constant changes to theenvironment and has the kind o highly variable workload profile that avors cloud. Hadoop in
the cloud also reduces the major DevOps (development + operations) overhead associated with
physical installations.
■ Cloud storage is both slower and more expensive for data sets that just keep growing. Mostcost comparisons show that low-cost enterprise storage is still cheaper than cloud or long-term
data storage, particularly since low-cost cloud storage may have unacceptably long access times.
Also, Hadoop tends to collect 10 times or more data than legacy transactional environments do,
plus data scientists and their customer-ocused business stakeholders will almost never want to
discard Hadoop data, and the access requirements are unpredictable — all o which avors on-
premises storage.
■ Data sources and locality make a big difference for performance. In cases where the data isentirely cloud-generated (such as analysis o witter, blog posts, and other social media data),
running Hadoop clusters in the cloud might make sense. But as Hadoop is used increasingly or
real-time customer-acing systems with data coming rom multiple venues, I&O pros will likely
need to build it out in a physical acility with the right (deterministic bandwidth and latency)
network interconnects to minimize the end-to-end latency o the application. Tus, the optimal
acility or your Hadoop cluster is likely either your enterprise data center or a colocation or
hosting acility with the right peer interconnects — not a cloud environment with unknown and
probably longer latency network connections.
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
12/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 11
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
W H A T I T M E A N S
HADOOP WILL BECOME A CRITICAL PART OF CORE ENTERPRISE BUSINESS
Te capabilities o Hadoop 2.0 will accelerate the use o Hadoop as a real-time platorm and asa platorm or other analytics sofware. As the number o applications that require real-time
perormance grows, these requirements will bias uture Hadoop inrastructure in predicable directions:
■ Hadoop will include flash memory in the processing nodes. As Hadoop becomes theoundation o real-time processes and applications, the requirements or the processing
nodes begin to escalate. While the amount o data that can be handled in a given amount o
time scales very well with the number o additional nodes, i the requirements dictate aster
response or a given amount o data, the only reasonable solution is to use aster nodes with
flash memory. All commercial Hadoop distributions can take advantage o flash memory on
the individual compute nodes.
■ You will need to install application-specific Hadoop clusters. Because Hadoop encompassesa wide range o processing, as applications scale and as new applications requiring unusual
processing come on line, it may be necessary to begin to install new Hadoop clusters with
application-specific processing nodes or financial risk calculations or time-series analysis.
■ Hadoop will become the big data hub and integration point for enterprise systems. As Hadoop becomes more o a closely coupled adjunct to customer-acing systems like
eCommerce platorms and mainstream enterprise systems such as enterprise resource
planning (ERP) and classic data warehouse applications, the quality o the integration
with the enterprise systems becomes critical. In these cases, I&O pros may need to hostthe Hadoop cluster in the same inrastructure as the enterprise apps to achieve tighter
integration with the core applications.
ENDNOTES
1 “Real time” is a slippery term, but the original definition rom control systems theory is still valid — real-
time processing allows decisions to be made within the cycle time o the process in question. In other words,
something that happens quickly enough to matter to a person or process waiting or the result. Tus, a
signal processing system might define real time as being ractions o a microsecond, while an advertising
insertion system designed to offer up an ad to a customer viewing a company website might be hundreds o
milliseconds, and or a marketing campaign, real time might be hours.
2 Tis pilot illustrates both the potential o Hadoop and the size o some o the big data repositories. Te pilot
project involved processing 570 billion weblog records in 9 minutes on what was described as a “small cluster.”
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
13/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 12
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
3 Tese EL vendors are all rapidly adapting their products to work with Hadoop, mostly in the orm o
using HDFS as an underlying data store. But once users get a taste o the potential economics o using an
open source stack or what was previously a purely proprietary stack, it becomes difficult or legacy vendors
to support historical margins.
4 Tis assertion leads instantly to the challenge that Hadoop may in act be very flexible but not in act very
efficient at any particular task. Tere is some truth to that assertion — an optimized columnar database like
Vertica or Netezza may be righteningly efficient at certain kinds o queries but will simply not be able to do
others involving different data types. Te beauty o Hadoop’s architecture is that in exchange or some lack o
optimization, it allows an almost infinite plasticity in terms o problems and data types. In addition, the lack
o optimization is somewhat offset by the act that it can be composed rom commodity components, so the
Hadoop solution vendor cannot extract much o a premium or integrating the hardware platorm, unlike
many other commercial solutions. However, much o Hadoop’s attraction also lies in the act that MapReduce
is a separate unction rom the Hadoop file system, and many specialized applications, Vertica among them,
are integrating with HDFS to take advantage o its reliable and highly scalable storage architecture.5 We might as well deal with the inevitable right off the bat — how the heck did it get named Hadoop?
Hadoop was initially created by Doug Cutting and Mike Caarella, and Hadoop was the name o Cutting’s
young son’s avorite stuffed elephant.
6 Hadoop can also be used internally or operational improvement. It is a powerul platorm or analysis o
log files rom servers, network equipment, and any o the myriad devices that spit out data as they operate.
Using Hadoop can simpliy and accelerate turning these inchoate streams o data into real understanding
about efficiencies, actual costs, and compliance.
7 As more people can use a previously unavailable technology and apply it to problems that were previously
uneconomical to attack, more benefits accrue to more players in the economy. Tis will ollow the samepath as the evolution o supercomputing and advanced design automation, which were originally used
mostly or aerospace and deense products — today, many consumer goods are designed with the next
generation o those same tools. In the case o Hadoop, or example, the ability to apply this technology to a
single movie release where the goal is to raise the box-office take by a ew million dollars in a ew markets
would have been completely uneconomical five years ago.
8 Te details are out o scope or this report, but I&O pros can get a oundation in MapReduce undamentals
rom the Hadoop website (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html).
9 For the technically inclined, the replication actor is configurable, but most installations seem to be sticking
with the deault 3x. Te blocks in the HDFS are huge, either 64 MB or 128 MB, so the amount o metadata
that the NameNode server must keep track o is manageable, with only eight or 16 times three blocks perGB (8,000 or 16,000 per B). At PB scales, the HDFS, probably using the 128 MB block size, will have in the
tens o millions o block entries, small enough to keep large portions o the metadata tables in memory or
efficient operation.
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
14/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 13
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
10 Te ailure o a nonredundant NameNode is not a total disaster, because the NameNode writes a log file that
can be used to reconstruct the data. But it can be a long process — think days, not minutes or a ew hours —
or even modest 10 B to 100 B file systems.
11 Hadoop 2.0 has also added a number o major enhancements, particularly the ability or workloads other
than MapReduce to run on top o HDFS, enabling a wide range o third-party tools and other open source
projects to take advantage o Hadoop’s robust ederated storage and processing architecture. Many o
these tools were already available, but the incorporation o the support or a general-purpose extension to
MapReduce makes Hadoop 2.0 a much more suitable general-purpose big data and analytics platorm.
12 Given the continued cost decline in network switches, Forrester recommends that I&O groups contemplating
implementation o high-perormance Hadoop clusters evaluate 40 GB inter-rack links. For environments
where the MapReduce jobs are primarily aggregation, enumeration, and sorting the traditional Hadoop
workloads, a simple 1 GB NIC per server may be sufficient. However, as Hadoop workloads grow to
incorporate real-time (what Hadoop practitioners ofen reer to as continuous processing) as well as batch
data rom other enterprise and external data sources, the additional jobs and the constant EL processing
can add significant network traffic, and the traditional 1 GB connection may become a bottleneck.
13 A dual network will entail twice the cost or network equipment and a slight additional cost or dual NIC
configurations on each server. Forrester cannot make a blanket recommendation on this aspect o the
inrastructure but cautions that reconfiguring rom a single- to dual-path network is very disruptive and
time-consuming — assume that it will take a day per rack, assuming that you configured the racks correctly
in the first place to have the power and space to accommodate the additional switches.
14 Hadoop was not the first scalable sofware environment to use this technique, drawn as it was rom Google’s
proprietary MapReduce concept, and similar concepts can be ound underlying earlier experiments and
products such as the AFS (Andrew File System), DCE (distributed computing environment), and IBM’sGPFS (General Purpose File System), but it is arguably the most successul and rapidly growing example.
15 Intel has an alliance with Hortonworks through which it has produced a version o Hadoop optimized or
x86 execution, as well as actively supporting an alternative Apache Hadoop distribution.
16 Hortonworks’ sizing tool is an excellent example. Source: Hortonworks (http://hortonworks.com/resources/
cluster-sizing-guide/?utm_source=google&utm_medium=ppc&utm_campaign=Sitelinks).
17 Here’s the derivation or 4 x D: You start by assuming that the data will require (R x D) + S, where R is the
replication actor, which most installations leave at the deault actor o 3, D is the initial size o the data, and
S is the amount o space allocated or the “shuffle” phase o the MapReduce process, where intermediate
blocks o results are moved rom node to node and aggregated. One senior Hadoop consultant noted that the
usual rule o thumb or shuffle space is to allow approximately the same amount o space as the original data
beore replication. Using this ormula gives us a basic Hadoop sizing rule o thumb o 4 x D.
18 Currently there is only a limited sample o production environments, but the number is growing or
the same reasons that other applications initially moved to VMs rom physical inrastructure — capital
resource efficiency and management cost.
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
15/16
FOR INFRASTRUCTURE & OPERATIONS PROFESSIONALS
Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture 14
© 2014, Forrester Research, Inc. Reproduction Prohibited April 9, 2014
19 VSAN in its initial version(s) will almost certainly be optimized or the random mix o read/write
characteristic o a general-purpose enterprise VM cluster as opposed to the read-dominated and large-
block transer patterns o a Hadoop cluster, but it is hard to imagine that VMware will not pursue Hadoop
optimization in the near-term uture.
20 While a wide range o offerings rom vendors such as HP, IBM, Cisco, and Dell offer bundling o storage,
network, and compute nodes as integrated VM clusters, Forrester recommends also evaluating newer
options rom Nutanix and SimpliVity or virtualized Hadoop because o their deeply integrated ederated
storage architecture, which may prove to ameliorate many o the complexities o managing the Hadoop
storage resources. Also worthy o additional scrutiny are advanced storage-centric products like Maxta,
intri, and Atlantis Computing, which dramatically simpliy the task o deploying and managing VMware
storage. Emerging tools such as EMC ViPR, which can overlay a HDFS definition on top o an existing SAN
offer additional potential or simpliying storage management o Hadoop on a VM cluster. o the extent that
VMware makes its VSAN product relevant to Hadoop, all other solutions may lose a great deal o relevance.
21 One example that we were able to find was a 200 VM instance running on Amazon’s Elastic MapReduce
service that generated bills in excess o $40,000 per month beore the user brought the workload in-house
on an eight-node x86 Hadoop cluster.
-
8/20/2019 Forrester Hadoop Infrastructure Architecture
16/16
Forrester Research (Nasdaq: FORR) is a global research and advisory firm serving professionals in 13 key roles across three distinct client
segments. Our clients face progressively complex business and technology decisions every day. To help them understand, strategize, and ac
upon opportunities brought by change, Forrester provides proprietary research, consumer and business data, custom consulting, events and
«
Forrester Focuses OnInfrastructure & Operations Professionals
You are responsible for identifying — and justifying — which technologies
and process changes will help you transform and industrialize your
company’s infrastructure and create a more productive, resilient, and
effective IT organization. Forrester’s subject-matter expertise and
deep understanding of your role will help you create forward-thinking
strategies; weigh opportunity against risk; justify decisions; and optimize
your individual, team, and corporate performance.
IAN OLIVER, client persona representing Infrastructure & Operations Professionals
About Forrester
A global research and advisory firm, Forrester inspires leaders,informs better decisions, and helps the world’s top companies turn
the complexity of change into business advantage. Our research-
based insight and objective advice enable IT professionals to
lead more successfully within IT and extend their impact beyond
the traditional IT organization. Tailored to your individual role, our
resources allow you to focus on important business issues —
margin, speed, growth — first, technology second.
FOR MORE INFORMATION
o find out how Forrester Research can help you be successul every day, please
contact the office nearest you, or visit us at www.orrester.com. For a complete list
o worldwide locations, visit www.orrester.com/about.
CLIENT SUPPORT
For inormation on hard-copy or electronic reprints, please contact Client Support
at +1 866.367.7378, +1 617.613.5730, or [email protected] . We offer
quantity discounts and special pricing or academic and nonprofit institutions.
mailto:[email protected]://www.forrester.com/mailto:[email protected]