lsi corporation contributed articles written & placed by ...€¦ · because virtualization...
Post on 10-Jul-2020
3 Views
Preview:
TRANSCRIPT
1
Table of Contents Title Date Publication Page
#
Busting Through the Biggest Bottleneck in Virtualized Servers
7/31/13 Data & Storage Management Report
3
Accelerating Big Data Analytics with Flash Caching
7/23/13 Silicon Angle 7
Reality Check: The role of ‘smart silicon’ in mobile networks
7/9/13 RCR Wireless 13
The Revival of Direct Attached Storage for Oracle Databases
7/9/13 Database Trends & Applications 15
Achieving Low Latency: The Fall and Rise of Storage Caching
7/1/13 Datanami 18
Addressing the data deluge challenge in mobile networks with intelligent content caching
6/7/13 Electronic Component News 24
PCIe flash: It solves lots of problems, but also makes a bunch - so what's its future?
5/14/13 TechSpot 28
Mega Datacenters: Pioneering the Future of IT Infrastructure
4/25/13 DatacenterPOST 33
The Evolution Of Solid-State Storage In Enterprise Servers
4/23/13 Electronic Design 37
Networks to Get Smarter and Faster in 2013 and Beyond
3/13/13 Converge! Network Digest 46
Maximizing Solid State Storage Capacity in Small Form Factors (Single-chip "DRAM-less")
3/4/13 Electronic Component News 50
Bridging the Data Deluge Gap--The Role of Smart Silicon in Networks
2/28/13 EE Times 53
Accelerating SAN Storage with Server Flash Caching
1/31/13 Computer Technology Review
55
LSI Corporation
Contributed articles written & placed by
Gallagher PR
2
Understanding SSD Over-provisioning
1/8/13 Electronic Design News 58
Next-generation Multicore SoC Architectures for Tomorrow’s Communications Networks
12/11/12 Embedded Computing Design 64
The Inside View: Why Network Infrastructures Need Smarter Silicon
11/10/12 Network World 69
Avoiding “Whack-A-Mole” in the Datacenter
9/10/12 Data Center Journal 72
Virtualization of Data Centers: New Options in the Control and Data Planes (Part III)
8/30/12 Data Center Knowledge 79
Bridging the Data Deluge Gap 8/23/12 Forbes 81
Virtualization of Data Centers: New Options in the Control & Data Planes (Part II)
8/20/12 Data Center Knowledge 84
Virtualization of Data Centers: New Options in the Control and Data Planes
8/2/12 Data Center Knowledge 86
3
Busting Through the Biggest Bottleneck in
Virtualized Servers By: Tony Afshary
The data deluge has brought renewed focus on an old problem: the enormous performance gap that exists
in input/output (I/O) between a server’s memory and its storage. I/O typically takes a mere 100
nanoseconds for information stored in the server’s memory, while I/O to a hard disk drive (HDD) takes
about 10 milliseconds — a difference of five orders of magnitude that is having a profound adverse impact
on application performance.
This bottleneck exists for both dedicated and virtualized servers, but can be far worse with the latter
because virtualization creates the potential for much greater resource contention. Virtualization affords
numerous benefits by dramatically improving server utilization (from around 10 percent in dedicated
servers to 50 percent or more), but the increased per-server application load inevitably exacerbates the
I/O bottleneck. Multiple applications, all competing for the same finite I/O, have the ability to turn what
might have been orderly, sequential access for each into completely random read/writes for the server,
creating a worst case scenario for HDD performance.
In a virtualized server, the primary symptom of contention is when any virtual machine (VM) must wait for
CPU cycles, and/or for I/O to memory or disk. Fortunately, such contention can be minimized by judicious
balancing of the total workload among all virtual servers, and by optimizing the allocation of each server’s
physical resources. Taking these steps can enable a VM to perform as well as a dedicated server.
Unfortunately, however, server virtualization is normally accompanied by storage virtualization, which
virtually assures an adverse impact on application performance. Compared to direct attached storage
(DAS), a storage area network (SAN) or network-attached storage (NAS) has a higher I/O latency,
combined with a lower bandwidth or throughput that also decreases I/O Operations per second (IOPs).
Frequent congestion on the intervening Fibre Channel (FC), FC over Ethernet, iSCSI or Ethernet network
further degrades performance.
The extent of the I/O bottleneck issue became apparent in a recent LSI survey of 412 European
datacenter managers. The results revealed that while 93 percent acknowledge the critical importance of
optimizing application performance, a full 75 percent do not feel they are achieving the desired results.
Not surprisingly, 70 percent of the survey respondents cited storage I/O as the single biggest bottleneck
in the datacenter today.
Cache in Flash
Caching data to memory in a server, or in a SAN controller or cache appliance, is a proven technique for
reducing I/O latency and, thereby, improving application-level performance. But because the size of the
cache that is economically feasible with random access memory (measured in gigabytes) is only a small
fraction of the capacity of even a single disk drive (measured in terabytes), traditional RAM-based caching
is increasingly unable to deliver the performance gains required in today’s virtualized datacenters.
Consider what happens in a typical virtualized server. Each VM is allocated some amount of RAM, and
together these allocations usually exceed the total amount of physical memory available. This can result in
the VMs competing for memory, and as they do, it is necessary for the hypervisor to swap pages out and
in, to and from (very slow) disk storage, further exacerbating the I/O bottleneck.
Flash memory technology helps break through the cache size limitation imposed by RAM to again make
caching an effective and cost-effective means for accelerating application performance. As shown in the
4
diagram, flash memory, with an I/O latency of less than 50 microseconds, fills the significant performance
gap between main memory and Tier 1 storage.
Flash memory fills the void in both latency and capacity that exists between main memory and fast-
spinning hard disk drives
The closer the data is to the processor, the better the performance. This is why applications requiring high
performance normally use DAS, and it is also why flash cache provides its biggest benefit when placed
directly in the virtualized server on the PCI Express (PCIe) bus. Intelligent caching software is then used
to automatically and transparently place “hot data” (the most frequently accessed data) in the low-latency
flash memory, where it is accessed up to 200 times faster than when on a Tier 1 HDD. The flash cache can
also be configured to become the “swap cache” for main memory, thus helping to mitigate performance
problems being caused by memory contention.
5
The intelligent caching software detects hot data by constantly monitoring the physical server’s I/O
activity to find the specific ranges of logical block addresses that are experiencing the most reads and/or
writes, and continuously moving these into the cache. With this approach, the flash cache is able to
support all of the VMs running in any server.
The intelligent caching algorithms normally give the highest priority to highly random, small block-
oriented applications, such as those for databases and on-line transaction processing, because these stand
to benefit the most. By contrast, applications with sequential read and/or write operations benefit very
little from caching (except when multiple such applications are configured to run on the same server!), so
these are given the lowest priority.
How can flash memory, with a latency of up to 100 times higher than RAM, outperform traditional caching
systems? The reason is the sheer capacity possible with flash memory, which dramatically increases the
“hit rate” of the cache. Indeed, with some flash cache cards now supporting multiple terabytes of high-
performance solid state storage, there is often sufficient capacity to store rather large datasets for all of a
server’s VMs as hot data.
Exhaustive internal LSI testing has shown that the application-level performance gains afforded by flash
cache acceleration in both dedicated and virtualized servers are considerable. For servers with DAS, which
already enjoy the “proximity performance advantage” over SAN/NAS environments, typical improvements
can be in the range of 5 to10 times. In environments with a SAN or NAS, which experience additional
latency caused by the network, server-side flash caching can improve performance even more — by up to
30 times in some cases.
Flash Forward to the Future
Flash memory has a very promising future. Flash is already the preferred storage medium in tablets and
ultrabooks, and increasingly in laptop computers. Solid state drives (SSDs) are replacing or supplementing
HDDs in desktop PCs and servers with DAS, while the fastest SSD storage tiers are growing larger in SAN
and NAS configurations.
Solid state storage is also non-volatile, so unlike caching with RAM, which is read-only and subject to data
loss during a power failure, a flash cache can support both reads and writes, and some solutions now offer
RAID-like data protection, making the cache the equivalent of a fast storage tier. During internal LSI
testing, the addition of acceleration for writes to flash cache (which are then persisted to primary storage)
can improve application-level performance even more than the 10 or 30 times noted above in write-
intensive applications.
The key to making continued improvements in flash price/performance, similar to what has been the case
for CPUs with Moore’s Law, is the flash storage processors (FSPs) that facilitate shrinking flash memory
geometries and/or higher cell densities. To accommodate these advances, future generations of FSPs will
need to offer ever more sophisticated error correction (to improve reliability) and wear-leveling (to
improve endurance).
Flash memory enjoys some other advantages that are beneficial in virtualized datacenters, as well,
including a combination of higher density and lower power consumption compared to HDDs, which enables
more storage in a smaller space that also requires less cooling. SSDs are also typically far more reliable
than HDDs, and should one ever fail, RAID data protection is restored much faster.
As the price/performance of flash memory continues to improve, the rapid adoption of solid state storage
will likely continue in the datacenter. But don’t expect SSDs to completely replace HDDs any time soon.
HDDs have tremendous advantages in storage capacity and the per-gigabyte cost of that capacity. And
because the vast majority of data in most organizations is accessed only occasionally, the higher I/O
6
latency of HDDs is normally of little consequence—particularly because this “dusty data” can quickly
become hot data in a flash (pun intended!) on those infrequent occasions when needed.
Flash cache has now become part of the virtualization paradigm based on its ability to maximize the
benefits. Servers are virtualized to get more work from each one, resulting in considerable savings in
capital and operational expenditures, as well as in precious space and power. Storage is virtualized to
achieve similar savings through greater efficiencies and economies of scale. Flash cache helps provide a
more cost-effective way to get even more work from virtualized servers and faster work from virtualized
storage.
ABOUT Tony Afshary
Tony Afshary is the business line director for Nytro Solutions Products within the Accelerated Solutions
Division of LSI Corporation. In this role, he is responsible for product management and product marketing
of the LSI Nytro product family of enterprise flash-based storage, including PCIe based flash, utilizing
seamless and intelligent placement of data to accelerate datacenter applications.
Previously, Afshary was responsible for marketing and planning of LSI’s data protection and management
and storage virtualization products. Prior to that, he was the director of Customer/Application Engineering
for LSI’s server/storage products. He has been in the storage industry for over 13 years. Before joining
LSI, Afshary worked at Intel for 11 years, managing marketing and development activities for storage and
communication processors. Afshary received a bachelor’s degree in Electrical and Computer Engineering
and an MBA from Arizona State University.
7
Accelerating Big Data Analytics with Flash Caching By Kimberly Leyenaar
The global volume, velocity and variety of data are all increasing, and these three dimensions of the data
deluge—the massive growth of digital information—are what make Hadoop software ideal for big data
analytics. Hadoop is purpose-built for analyzing a variety of structured and non-structured data, but its
biggest advantage is its ability to cost-effectively analyze an unprecedented volume of data on clusters of
commodity servers.
While Hadoop is built around the ability to linearly scale and distribute MapReduce jobs across a cluster,
there is now a more cost-effective option for scaling performance in Hadoop clusters: high-performance
read/write PCIe flash cache acceleration cards.
Scaling Hadoop Performance: A Historical Perspective
The closer the data to the processor, the less the latency and the better the performance. This
fundamental principle of data proximity is what has guided the Hadoop architecture, and is the main
reason for Hadoop’s success as a high-performance big data analytics solution.
8
To keep the data close to the processor, Hadoop uses servers with direct-attached storage (DAS). And to
get the data even closer to the processor, the servers are usually equipped with significant amounts of
random access memory (RAM).
Small portions of a MapReduce job are distributed across multiple nodes in a cluster for processing in
parallel, giving Hadoop its linear scalability. Depending on the nature of the MapReduce jobs, bottlenecks
can form either in the network or in the individual server nodes. These bottlenecks can often be eliminated
by adding more servers, more processor cores, or more RAM.
With MapReduce jobs, a server’s maximum performance is usually determined by its maximum RAM
capacity. This is particularly true during the Reduce phase, when intermediate data shuffles, sorts and
merges exceed the server RAM size, forcing the processing to be performed with input/output (I/O) to
hard disk drives (HDDs).
As the need for I/O to disk increases, performance degrades considerably. Slow storage I/O is rooted in
the mechanics of traditional HDDs and this increased latency of I/O to disk imposes a severe performance
penalty.
One cost-effective ways to break through the disk-to-I/O bottleneck and further scale the performance of
the Hadoop cluster is to use solid state flash memory for caching.
Scaling Hadoop Performance with Flash Caching
Data has been cached from slower to faster media since the advent of the mainframe computer, and it
remains an essential function in every computer today. The enduring and widespread use of caching
demonstrates its enduring ability to deliver substantial and cost-effective performance improvements.
9
When a server is equipped with its full complement of RAM and that
memory is fully utilized by applications, the only way to increase caching capacity is to add a different
type of memory. One option is NAND flash memory, which is up to 200 times faster than a high-
performance HDD.
A new class of server-side PCIe flash solution uniquely integrates onboard flash memory with Serial-
Attached SCSI (SAS) interfaces to create high-performance DAS configurations consisting of a mix of solid
state and hard disk drive storage, coupling the performance benefits of flash with the capacity and cost
advantages of HDDs.
Testing Cluster Performance With and Without Flash Caching
To compare cluster performance with and without flash caching, LSI used the widely accepted TeraSort
benchmark. TeraSort tests performance in applications that sort large numbers of 100-byte records, which
requires a considerable amount of computation, networking and storage I/O—all characteristics of real-
world Hadoop workloads.
LSI used an eight-node cluster for its 100-gigabyte (GB) TeraSort test. Each server was equipped with 12
CPU cores, 64 GB of RAM and eight 1-terabyte HDD as well as an LSI® Nytro MegaRAID 8100-4i
acceleration card combining 100GB of onboard flash memory with intelligent caching software and LSI
dual-core RAID-on-Chip (ROC) technology. The acceleration card’s onboard flash memory was deactivated
for the test without caching.
No software change was required because the flash caching is transparent to the server applications,
operating system, file subsystem and device drivers. Notably, RAID (Redundant Arrays of Independent
Disks) storage is not normally used in Hadoop clusters because of the way the Hadoop Distributed File
10
System replicates data among nodes. So while the RAID capability of the Nytro MegaRAID acceleration
card would not be used in all Hadoop clusters, this feature adds little to the overall cost of the card.
LSI internal testing with flash caching activated found that the TeraSort test consistently completed
approximately 33 percent faster. This performance improvement from caching scales in proportion to the
size of the cluster needed to complete a specific MapReduce or other job within a required run time.
LSI Nytro MegaRAID card using the TeraSort benchmark completed Hadoop jobs 33 percent
faster (LSI internal test; individual results may vary).
Saving Cash with Cache
Based on results from the internal LSI TeraSort benchmark performance test, the table below compares
the estimated total cost of ownership (TCO) of two cluster configurations—one with and one without flash
caching—that are both capable of completing the same job in the same amount of time.
11
Without
Caching
With
Caching
Number of Servers 1000 750
Servers (MSRP of $6,280) $6,280,000 $4,710,000
Nytro MegaRAID Cards (MSRP of
$1799) $0 $1,349,250
Total Hardware Costs $6,280,000 $6,059,250
Costs for Rack Space, Power,
Cooling and Administration Over
3 Years * $19,610,000 $14,707,500
3-Year Total Cost of Ownership $25,890,000 $20,766,750
* Cost computed using data from the Uptime Institute, an independent division of The 451
Group
The tests showed that in certain circumstances, using fewer servers to accommodate the same processing
time requirement can reduce TCO by up to 20 percent, or $5.1 million, over three years.
Conclusion
Organizations using big data analytics now have another option for scaling performance: PCIe flash cache
acceleration cards. While these tests centered on Hadoop clusters, LSI’s extensive internal testing with
various databases and other popular applications consistently demonstrates performance improvement
gains ranging from a factor of three (for DAS configurations) to a factor of 30 (for SAN and NAS
configurations).
12
Big data is only as useful as the analytics that organizations use to unlock its full value, making Hadoop a
powerful tool for analyzing data to gain deeper insights in science, research, government and business.
Servers need to be smarter and more efficient and flash caching helps enable fewer servers (with fewer
software licenses) to perform more work, more cost-effectively for data sets large and small—a great
option for IT managers working to do more with less under the growing pressure of the data deluge.
About the Author
Kimberly Leyenaar is a Principal Big Data Engineer and Solution Technologist for LSI’s Accelerated Storage
Division. An Electrical Engineering graduate from the University of Central Florida, she has been a storage
performance engineer and architect for over 14 years. At LSI, she now focuses on discovering innovative
ways to solve the challenges surrounding Big Data applications and architectures.
13
Reality Check: The role of ‘smart silicon’ in mobile
networks
By Greg Huff, SVP and CTO, LSI
Editor’s Note: Welcome to our weekly Reality Check column where we let C-level executives and
advisory firms from across the mobile industry to provide their unique insights into the marketplace.
What does “smart silicon” (specialized integrated circuits with both general-purpose and function-specific
processors) have to do with next-generation mobile services? Plenty, especially as the number of
bandwidth-hungry devices and applications continues to grow unabated. To accommodate the
accompanying data deluge, base station throughput will need to increase by more than an order of
magnitude from 300 megabits per second in 3G networks to 5 gigabits per second in LTE networks. LTE-
Advanced technology will require base station throughput to double again to 10 Gbps.
Several related changes are also having an impact on base stations. Next-generation access networks are
using more and smaller cells to deliver the higher data rates reliably. Multiple radios are being employed
in cloud-like distributed antenna systems. Network topologies are flattening. Content is being cached at
the edge to conserve backhaul bandwidth. Operators are offering advanced quality of service and location-
based services, and are moving to application-aware billing.
These changes are motivating mobile network operators to seek more intelligent and more cost-effective
ways to keep pace with the data deluge, and this is where smart silicon can help. General-purpose
processors are simply too slow for base station functions that must operate deep inside every packet in
real-time, such as packet classification, digital signal processing, transcoding, encryption/decryption and
traffic management.
For this reason, packet-level processing functions are increasingly being performed in hardware to
improve performance, and these hardware accelerators are now being integrated with multicore
processors in specialized system-on-chip communications processors. The number of function-specific
acceleration engines available also continues to grow, and more engines (along with more processor
cores) can now be placed on a single SoC. With current technology, it is even possible to integrate an
equipment vendor’s unique intellectual property into a custom SoC for use in a proprietary system. In
many cases, these advances now make it possible to replace multiple SoCs with a single SoC in base
stations.
14
In addition to delivering higher throughput, SoCs reduce the total cost of the system, resulting in a
significant improvement in its price/performance, while the inclusion of multiple acceleration engines
makes it easier to satisfy end-to-end QoS and service-level agreement requirements. An equally important
consideration in mobile network infrastructures is power consumption, and here too the SoC has a distinct
advantage with its ability to replace multiple discrete components with a single, energy-efficient integrated
circuit.
Another challenge involves the way hardware acceleration is implemented in some SoCs. The problem is
caused by the workflow within the SoC when packets must pass through several hardware acceleration
engines, as is the case for many services and applications. If traffic flows must be handled by a general-
purpose processor core whenever traversing a different acceleration engine, undesirable latency and jitter
(variability in latency) will both increase, potentially significantly.
Some next-generation SoCs address this issue by supporting configurable pipelines capable of processing
packets deterministically. Each separate service-oriented pipeline creates a message-passing control path
that enables system designers to specify different packet-processing flows that utilize different
combinations of acceleration engines. Such granular traffic management enables any service to process
any traffic flow directly through any engines required and in any sequence desired without intervention
from a general-purpose processor, thereby minimizing latency and assuring that even the strictest QoS
and SLA guarantees can be met.
Without these advances in integrated circuits, it would be virtually impossible for mobile operators to keep
pace with the data deluge. So what does “smart silicon” have to do with next-generation mobile services,
especially when it comes to reducing cost while improving overall system performance? Everything.
Greg Huff is SVP and CTO for LSI. In this capacity, he is responsible for shaping the future growth strategy
of LSI products within the storage and networking markets. Huff joined the company in May 2011 from
Hewlett-Packard, where he was VP and CTO of the company’s Industry Standard Server business. In that
position, he was responsible for the technical strategy of HP’s ProLiant servers, BladeSystem family
products and its infrastructure software business. Prior to that, he served as research and development
director for the HP Superdome product family. Huff earned a bachelor’s degree in Electrical Engineering
from Texas A&M University and an MBA from the Cox School of Business at Southern Methodist University.
15
The Revival of Direct Attached Storage for
Oracle Databases By Tony Afshary
Storage area networks (SANs) and network-attached storage (NAS) owe their popularity to some
compelling advantages in scalability, utilization and data management. But achieving high performance
for some applications with a SAN or NAS can come at a premium price. In those database applications
where performance is critical, direct-attached storage (DAS) offers a cost-effective high-performance
solution. This is true for both dedicated and virtualized servers, and derives from the way high-speed
flash memory storage options can be integrated seamlessly into a DAS configuration.
Revival of DAS in the IT Infrastructure
Storage subsystems and their capacities have changed significantly since the turn of the millennium,
and these advancements have caused a revival of DAS in both small and medium businesses and large
enterprises. To support this trend, vendors have added support for DAS to their existing product lines
and introduced new DAS-based solutions. Some of these new solutions combine DAS with solid state
storage, RAID data protection and intelligent caching technology that continuously places “hot” data in
the onboard flash cache to accelerate performance.
Why the renewed interest DAS now after so many organizations have implemented SAN and/or
NAS? There are three reasons. The primary reason is performance: DAS is able to outperform all forms
of networked storage owing to its substantially lower latency. The second is cost savings that result
from minimizing the need to purchase and administer SAN or NAS storage systems and the host bus
adapters (HBAs) required to access these systems. Third is ease of use. Implementing and managing
DAS are utterly simple compared to the other storage architectures. This is particularly true for Oracle
database applications.
The Evolution of DAS
DAS technology has evolved considerably over the years. For example, Serial-Attached SCSI (SAS)
expanders and switches enable database administrators (DBAs) to create very large DAS configurations
capable of containing hundreds of drives, while support for both SAS and SATA enables DBAs to deploy
those drives in tiers. And new management tools, including both graphical user and command line
interfaces, have dramatically simplified DAS administration.
While networked storage continues to have an advantage in resource utilization compared to DAS, the
cost of unused spindles today is easily offset by the substantial performance gains DAS delivers for
applications running software with costly per-server licenses. In fact, having some unused spindles on a
database server offers the ability to “tune” the storage system as needed.
A DBA could, for example, use the spare spindles to either isolate certain database objects for better
performance, or allocate them to an existing RAID LUN. When using only HDDs for a database that
requires high throughput in I/O operations per second (IOPS), allocating database objects over more
spindles increases database performance. Allocating more spindles for performance rather than for
capacity is referred to as “short stroking.” With a smaller number of tracks containing data,
repositioning of the drive’s head is minimized, thereby reducing latency and increasing IOPS.
As is often the case in data centers, the ongoing operational expenditures, especially for management,
often eclipse the capital expenditure involved. Such is the case for SAN or NAS, which require a storage
administrator. No such ongoing OpEx is incurred with DAS, especially when using Oracle’s Automatic
Storage Management (ASM) system. And with the need for costly HBAs, switches and other
16
infrastructure in SAN/NAS environments, DAS often affords a lower CapEx today, as well, particularly in
database applications.
Today’s Oracle DBA
Being an Oracle DBA today is quite different compared to even just a few years ago. As organizations
strive to do more with less, Oracle has been teaming with partners to provide the tools and functionality
DBAs need to be more productive while enhancing performance. Consider just one example of how
much a DBA’s responsibilities have changed: improving performance by minimizing I/O waits, or the
percentage of time processors are waiting on disk I/O.
To increase storage performance by minimizing I/O waits in a typical database using exclusively HDDs,
a DBA might need to take one or more the following actions:
Isolate “hot” datafiles to cold disks, or if the storage device is highly utilized, moving datafiles to
other spindles to even out the disk load.
Rebuild the storage to a different RAID configuration, such as from RAID 5 to RAID 10 to increase
performance.
Add more “short stroked” disks to the array to get more IOPS.
Increase the buffer space in the System Global Area and/or make use of the different caches inside
the SGA to fine-tune how data is accessed.
Move “hot” data to a higher performance storage tier, such as HDDs with faster spindles or solid
state drives (SSDs).
Minimize or eliminate fragmentation in tables and index tablespaces.
Note that many of these actions require the DBA to be evaluating the database continuously to
determine what currently constitutes “hot” data, and constantly making adjustments to optimize
performance. Some also require scheduling downtime to implement and test the changes being made.
An alternative to such constant and labor-intensive fine-tuning is the use of server-side flash storage
solutions that plug directly into the PCIe bus and integrate intelligent caching with support for DAS in
RAID configurations. Intelligent caching software automatically and transparently moves “hot” data—
that which is being accessed the most frequently—from the attached DAS HDDs to fast, on-board NAND
flash memory, thereby significantly decreasing the latency of future reads.
Testing of Flash Cache for DAS
Extensive evaluation of server-side flash-based application acceleration solutions under different
scenarios to assess improvements in IOPS, transactions per second, application response times and
other performance metrics, reveals that for I/O-intensive database applications, moving data closer to
the CPU delivers improvements in performance ranging from a factor of three to an astonishing factor of
100 in some cases.
In all test scenarios, the use of server-side flash caching consistently delivered superior
performance. The reduction in application response time ranged from 5X to 10X with no fine-tuning of
the configuration. When the database was tuned for both the HDD-only and flash cache configurations,
response times were reduced by nearly 30X from 710 milliseconds (HDD-only) to 25 milliseconds with
the use of cache.
These results demonstrate that while tuning efforts are effective, they are substantially more effective
with the use of flash cache. And even without tuning, flash cache is able to reduce response times by
up to an order of magnitude.
Superior Performance with DAS
17
The use of direct-attached storage has once again become the preferred option for Oracle databases for
a variety of reasons. Not only does DAS deliver superior performance in database servers to get the
most from costly software licenses, it is also easier to administer, especially when using Oracle’s
Automatic Storage Management system. Some solutions also now enable DAS to be shared by multiple
servers.
Even better performance and cost efficiency can be achieved by complementing DAS with intelligent
server-side flash cache acceleration cards that minimize I/O latency and maximize IOPS. In addition, by
allowing infrequently accessed data to remain on HDD storage, organizations can deploy an economical
mix of high-performance flash and high-capacity hard-disk storage to optimize both the cost per IOPS
and the cost per gigabyte of storage.
Server-side flash caching solutions can also be used in SAN environments to improve
performance. Such tests have revealed both significant reductions in response times and dramatic
increases in transaction throughput. So whether using DAS or SAN, the combination of server-side flash
and intelligent caching has proven to be a cost-effective way to maximize performance and efficiency
from the storage subsystem.
About the author:
Tony Afshary is director of marketing, Accelerated Solutions Division, LSI, which designs semiconductors
and software that accelerate storage and networking in datacenters, mobile networks and client
computing.
18
Achieving Low Latency: The Fall and Rise of
Storage Caching
Tony Afshary
The caching of content from disk storage to high-speed memory is a proven technology for reducing read
latency and improving application-level performance. The problem with traditional caching, however, is
one of scale: Random access memory, typically used for caches, is limited to Gigabytes, while hard disk
drive-based storage exists on the order of Terabytes. The three orders of magnitude difference in scale
puts a practical limit on the potential performance gains. Flash memory has now made caching beneficial
again owing to its combination of low latency (on a par with memory) and high capacity (on a par with
hard disk drives).
A Brief History of Cache
The caching of data from slower media to faster ones has existed since the days of
mainframe computing, and quickly made its debut on PCs shortly after they entered
the market. Caching also exists at multiple levels and in different locations—from the
L1 and L2 cache built into processors to the dynamic RAM (DRAM) caching in the
controllers used with storage area networks (SANs) and network-attached storage
(NAS).
The long, widespread use of caching is a testament to its benefit: dramatically improving performance in a
transparent and cost-effective manner. For example, PCs constantly cache data from the hard disk drive
(HDD) to main memory to improve input/output (I/O) throughput. I/O to main memory takes about 100
nanoseconds, while I/O a fast-spinning HDD takes around 10 milliseconds—a difference of five orders of
magnitude.
In this example, the cache works by moving the data and/or software currently being accessed (the so-
called “hot data”) from the HDD to main memory. The operating system’s file subsystem makes these
movements constantly and automatically using algorithms that detect hot data to improve the “hit rate” of
the cache. With such behind-the-scenes transparency, the only thing a user should ever notice is an
improvement in performance after adding more DRAM.
The data deluge impacting today’s datacenters, however, is causing traditional DRAM-based caching to
become less effective. The reason is that the amount of memory possible in a server or a caching
appliance is only a small fraction of the capacity of even a single disk drive. Because datacenters now
store multiple Terabytes or even Petabytes of data, and I/O rates are increasing with more applications
19
being run on virtualized servers, the performance gains from traditional forms of caching are becoming
increasingly insufficient.
Fortunately, there is now also a solution to overcoming the limitation being imposed by traditional DRAM-
based caching: flash memory.
Figure 1: Flash memory fills the void in both latency
and capacity between main memory and fast-spinning
hard disk drives.
Cache in a Flash
As shown in Figure 1, flash memory breaks through DRAM’s cache size limitation barrier to again make
caching a highly effective and cost-effective means for accelerating application-level performance. Another
important advantage over DRAM is that flash memory is non-volatile, enabling it to retain stored
information even when not powered.
NAND flash memory-based storage solutions typically deliver the highest performance gains when the
flash cache is placed directly in the server on the high-performance Peripheral Component Interconnect
Express® (PCIe) bus. Even though flash memory has a higher latency than DRAM, PCIe-based flash cache
adapters deliver superior performance for two reasons. The first is the significantly higher capacity of flash
20
memory, which substantially increases the hit rate of the cache. Indeed, with some flash adapters now
supporting multiple Terabytes of solid state storage, there is often sufficient capacity to store entire
databases or other datasets as hot data.
The second reason involves the location of the flash cache: directly in the server on the PCIe bus. With no
external connections and no intervening network to a SAN or NAS (that is also subject to frequent
congestion), the hot data is accessible in a flash (pun intended).
Intelligent caching software running on the host server detects hot data blocks and caches these to the
flash cache. As shown in Figure 2, the caching software is located between the file system and the storage
device drivers. Direct-attached storage (DAS) and SAN use existing drivers; the flash cache card has a
Memory Pipeline Technology (MPT) driver. As hot data “cools” the caching software automatically replaces
it with hotter data.
Figure 2: The intelligent caching software operates
between the server’s file system and the device
drivers to provide transparency to the
applications.
21
The intelligent caching software normally gives the highest priority to highly random, small I/O block-
oriented applications, such as those for databases and on on-line transaction processing (OLTP), as these
stand to benefit the most. The software detects hot data by monitoring I/O activity to find the specific
ranges of logical block addresses (LBAs) that are experiencing the most reads and/or writes, and moves
these into the cache.
By contrast, because applications with sequential read and/or write operations benefit very little from
caching, these are given a low priority. The reason is that 6 Gigabit/second (Gb/s) Serial-Attached SCSI
(SAS) and Serial ATA (SATA) HDDs can achieve a satisfactory throughput of up to 3000 Megabytes/second
(MB/s), and roughly double that with 12 Gb/s SAS.
Most PCIe flash adapters contain at least two SSD modules to support RAID (Random Array of
Independent Disks) configurations. In the unprotected RAID 0 mode, data is striped across both SSD
modules, creating a larger cache. In the protected RAID 1 mode, data is mirrored across the SSD modules
so that in the event one fails, the other has a complete copy.
Any data written to the flash cache must also be written to primary DAS or SAN storage, and there are
two ways this can occur. In Write Through mode, any data written to flash is simultaneously written to
primary storage. Because most applications will wait for confirmation that a write has been completed
before proceeding, this increases I/O latency. In Write Back mode data is written only to an SSD, or when
using mirroring, both SSDs, allowing write operations to be completed substantially faster. All writes are
then persisted to primary storage when the data cools and is replaced in the cache. Write Through mode
can safely use a RAID 0 configuration of the flash cache; Write Back mode should employ a RAID 1
configuration for adequate data protection.
Benchmark Test Results
LSI® has conducted extensive testing of application acceleration solutions under different scenarios to
assess improvements in I/O operations per second (IOPs), transactions per second, user response times
and other performance metrics. For I/O-intensive applications, these tests reveal improvements in
performance ranging from a factor of 3x to an astonishing factor of 100x. Reported here are the results of
one such test.
This particular test evaluates both the response times and transactional throughput of a MySQL OLTP
application using the SysBench system performance benchmark. The basic configuration is a dedicated
server with DAS consisting entirely of HDDs. The flash cache is a 100 Gigabyte Nytro™ MegaRAID® 8100-
4i PCIe adapter with the Nytro XD intelligent caching software running in the host. Four different flash
cache configurations are used based on a combination of write modes (Write Through or Write Back) and
RAID levels (0 or 1).
22
Figure 3: Response times (in milliseconds) were
reduced by 65 percent using the flash cache in
Write Back mode with RAID 1 protection.
The “No SSD” results shown in Figures 3 and 4 are for the baseline configuration using HDDs with no flash
cache. In Write Through (WT) mode, all write operations are made directly to the HDDs, which limits the
performance gains to only about 20 percent. In Write Back (WB) mode, writes are made to the flash
cache, resulting in a response time improvement of up to 80 percent, as shown in Figure 3. But because
data protection is prudent with WB mode (as no protection would require using transaction logs to recover
from an SSD failure), a more realistic improvement would be 65 percent for the flash cache configured
with RAID 1 protection.
Figure 4: Transactions per second increased by a
factor of 3 using the flash cache in Write Back
mode with RAID 1 protection.
23
As with response times, transactions per second (TPS) throughput rates improve dramatically when the
flash cache is used for both reads and writes. And for some applications, the benefit of the 5-times
improvement in TPS shown in Figure 4 might outweigh the exposure from a lack of data protection,
particularly given the high reliability of flash memory. But even with RAID 1 protection, TPS throughput
increases by a factor of 3 over the “No SSD” configuration.
These tests show that even a relatively modest amount of flash cache (100 Gigabytes) can deliver
meaningful performance gains. Tests with 800 Gigabytes of flash reveal an improvement of up to 30 times
in SAN environments for some applications.
Conclusion
The size of a cache relative to the size of the data store is a key determining factor in its ability to improve
performance. This is the reason DRAM-based caches, limited to Gigabytes of capacity, have become less
effective under the growing data deluge. With SSDs and PCIe flash adapters now supporting Terabytes of
capacity, the size of the cache becomes considerably greater relative to the data store, which makes
caching proportionally more effective.
Another determining factor is the nature of the target application. I/O-intensive applications that involve
random read/write access stand to benefit substantially, while those accessing data sequentially,
especially in large blocks, stand to benefit little, if at all.
The final determining factor is the caching software’s ability to maximize the hit rate by accurately
identifying the hot spots in the data, as these are constantly changing for applications with random I/O
operations. Most do a fairly effective job, and the larger flash cache capacity now makes this a less critical
factor.
Although a flash cache inevitably offers at least some improvement in performance, the extent of the gain
might not be cost-justifiable. Fortunately there are free tools available that can predict the performance
gains possible on a per-application basis. These tools employ intelligent caching algorithms, similar to
what is actually used in the cache, to evaluate access patterns and provide an estimate of the likely
improvement in performance.
The opportunity to achieve substantial gains, combined with the ability to quantify the potential benefit in
advance of making any investment, make flash caching solutions an option worthy of serious
consideration in virtually any datacenter today.
About the Author
Tony Afshary is the Business Line Director for Nytro Solutions Products at LSI’s
Accelerated Solutions Division. In this role, he is responsible for Product
Management & Product Marketing for LSI's Nytro Family of enterprise flash based
storage, including PCIe based Flash, utilizing seamless and intelligent placement of
data to accelerate data-center applications.
24
Addressing the data deluge challenge in mobile
networks with intelligent content caching Seong Hwan Kim, Ph.D., Technical Marketing Manager, LSI
The most recent IDC Predictions 2013: Competing on the 3RD Platform report forecasts the biggest driver
of IT growth to once again be mobile devices (smartphones, tablets, e-readers, etc.), generating around
20 percent of all IT purchases and accounting for more than 50 percent of all IT market growth. Mobile
devices continue to provide the ubiquitous and constant Internet access that is creating massive amounts
of multimedia traffic, with video remaining the dominant component in this data deluge.
Mobile networks are struggling to satiate the seemingly unquenchable thirst from more users for faster
access to more and more digital content. This dynamic is creating a “data deluge gap”—a disparity
between network capacity and growing demand. Competitive pressures prevent mobile operators from
being able to make the capital investment required to close this widening gap with brute force bandwidth,
making it necessary explore new ways of providing services more intelligently and cost-effectively.
This article explores one such technique: intelligent content caching to improve overall throughput by
minimizing traffic flows end-to-end in mobile networks.
Meeting user expectations
Before exploring content caching, it is instructive to understand the user expectations driving the data
deluge gap. A recent study (reported in an Open Networking Summit presentation titled OpenRadio:
Software Defined Wireless Infrastructure) found that it takes around 7-20 seconds to load a full Web page
over mobile networks. On a corporate LAN or home broadband network, Web pages typically take 6
seconds or less to load. This diverging user experience adds to the perception that mobile networks are
too slow.
Meeting user expectations will become even more challenging as the amount of video and multimedia
traffic increases. Cisco’s Visual Networking Index forecasts video will constitute more than 70 percent of
all network traffic in the near future. Accommodating this explosive growth, particularly during periods of
peak activity, will require both more bandwidth and more intelligent use of that bandwidth from the access
to the core in mobile networks.
New mobile network management solutions will need to go beyond Quality of Service (QoS) and other
traditional traffic management provisions, however. The reason is: while QoS can prioritize traffic flows, it
can do nothing to minimize them. So as mobile networks become increasingly like content delivery
networks, it will be necessary to operate them as such. And one proven technique for minimizing the
amount of traffic end-to-end in content delivery networks is caching.
Intelligent content caching in mobile networks
Intelligent content caching is a cost-effective way to improve the Quality of Experience (QoE) for mobile
users. The fundamental idea of intelligent caching is to store popular content as close as possible to the
users, thereby making it more ready available while simultaneously minimizing backhaul traffic.
Content caching employs a geographically-distributed and layered architecture as shown in the Figure 1.
25
There are two layers of caching established by location: one is at the edge or access portion of the
network; the other is more centralized toward the core of the network. Such a model is defined as
hierarchical caching.
Figure 1. Hierarchical caching architecture
While financially justified based on the cost savings, caching in Layer 1 at the edge of the network, such
as with the eNodeB or Radio Network Controller platforms, requires a higher initial investment owing to
the high number of access nodes involved. With far fewer nodes in the core, such as a gateway node
and/or a central datacenter, caching at Layer 2 requires a relatively low initial investment.
In a hierarchical caching architecture, content is cached concurrently in both layers to compound the
bandwidth savings. Numerous industry studies have shown that caching at Layer 2 can reduce traffic from
the mobile network core to the Internet by more than 30 percent. Caching at Layer 1 can reduce backhaul
traffic from the radio area network (RAN) to the core by 30 percent or more depending on the cache hit
rate, as recently reported in a Light Reading Webinar titled Extensibility: The Key to Maximizing Caching
Investments.
How intelligent content caching works
The bandwidth-reducing benefit of caching increases as the “hit rate” increases, which it inevitably does
26
with popular content, such as video going viral or a breaking news story. Figure 2 shows two different
data paths: a “cold” path for the first time content is accessed by any user; and a “hot” path for
subsequent access from cache by other users. This particular configuration employs an intelligent
communications processor to offload the CPU for better performance, and a “flash cache” card with solid
state memory. Not shown is the coordination of cached content between Layers 1 and 2.
Figure 2. Hot and cold data paths
In the cold data path from a user’s perspective, if the deep packet inspection (DPI) engine finds the
content is not already cached by matching the request to an entry in the cache content table, the
processor’s classification engine passes the request to the uplink Ethernet connection to be fetched from
an upstream source, either the Layer 2 cache or the target site on the Internet. If the content is coming
from the Internet and each cache has available capacity, the content will be placed in both Layer 1 and
Layer 2 cache while it is being delivered to the user. Intelligent algorithms are used to continuously
determine which content should be cached based on a combination of recency, popularity and other
factors.
Again from a user’s perspective, but this time from a different user, the DPI engine checks to see if the
content requested has been cached locally. If it is found in the cache content table, the processor’s
classification engine sends the request to the local, Layer 1 cache. All subsequent requests from this
particular user for this particular content are recognized directly by the classification engine and do not,
therefore, require any further involvement from the DPI engine.
While many of the content caching solutions available today utilize x86 or other general-purpose CPUs to
perform traffic inspection, this approach is not well suited for a Layer 1 cache where there are
requirements for low power consumption and low cost. Offloading the CPU with an intelligent
communications processor equipped with purpose-built acceleration engines, as depicted in Figure 2, can
yield up to a 5-times improvement in performance.
The problem with using general-purpose CPUs for packet-level processing is that critical, real-time tasks
like traffic inspection are often performed only at the port level. Because many applications use HTTP as a
transport layer, the lack of deep understanding of the specific applications in the network traffic flows
hinders efficient content management. So while a general-purpose CPU programming model makes
software development easier, it can result in CPU resources being overwhelmed and poor
performance/watt/cost.
27
By contrast, the hardware acceleration engines in purpose-built System on Chip (SoC) communications
processors provide much deeper application-level awareness in real-time, which is critical in broadband 3G
and 4G mobile networks. The SoC design also provides superior throughput performance while consuming
less power.
The use of solid state storage in purpose-built, small form factor flash cache acceleration cards similarly
maximizes performance with minimal power consumption compared to caching in memory or to hard disk
drives. A Vodafone “Typical Data Usage” chart shows that a 4-minute YouTube video is about 11 MB of
content, for example, while the video streaming of a 30-minute TV episode represents about 90MB of
data. A flash cache acceleration care with 512 GB of capacity would, therefore, be capable of storing about
50,000 of these video clips or about 6,000 of the half-hour streaming videos.
Conclusion
Intelligent content caching affords three major benefits that together help close the data deluge gap. First,
by reducing latency, user QoE is improved dramatically, even under heavy loads, resulting in more
satisfied users. Second, by distributing the total load more evenly from the edge to the core, overall
network throughput can be optimized. Third and perhaps most importantly, profitability is increased
through a combination of more revenue from satisfied users and better utilization of available backhaul
bandwidth.
These benefits can all be maximized by using solutions purpose-built for the special needs of mobile
networks. The use of specialized mobile communications processors that combine multiple CPU cores with
multiple hardware acceleration engines—all on a single integrated circuit—results in maximum
performance with minimal power consumption. Dedicated and standards-based flash cache acceleration
cards provide both the performance and versatility needed to optimize the configuration of a hierarchical
caching architecture.
It bears repeating: As mobile networks become increasingly like content delivery networks, it will be
necessary to operate them as such. And intelligent content caching is a proven technique for delivering
content more quickly and cost-effectively.
About the author
Seong Hwan Kim is a Technical Marketing Manager for the Networking Solutions Group at LSI Corporation.
He has close to 20 years of experience in Computer Networks and Digital Communications. His expertise is
in Enterprise network, network and server virtualization, SDN/OpenFlow, cloud acceleration, wireless
communications and QoS/QoE management. As a noted industry expert, he has several patents in
networking. His work has been published in numerous publications including IEEE communication and
Elsevier magazines, and has presented at several industry conferences.
Seong Hwan Kim has a Ph.D. degree in Electrical Engineering from State University of New York at Stony
Brook, and received his MBA degree from Lehigh University.
28
PCIe flash: It solves lots of problems, but also
makes a bunch - so what's its future? By Rob Ober
Editor’s Note:
This is a guest post by Rob Ober, corporate strategist at LSI. Prior to joining LSI, Rob was a fellow in the
Office of the CTO at AMD. He was also a founding board member of OLPC ($100 laptop.org) and
OpenSPARC.
I want to warn you, there is some thick background information here first. But don’t worry. I’ll get to the
meat of the topic and that’s this: Ultimately, I think that PCIe cards will evolve to more external, rack-
level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but
other leaders in flash are going down this path too...
I’ve been working on enterprise flash storage since 2007 – mulling over how to make it work. Endurance,
capacity, cost, performance have all been concerns that have been grappled with. Of course the flash is
changing too as the nodes change: 60nm, 50nm, 35nm, 24nm, 20nm… and single level cell (SLC) to multi
29
level cell (MLC) to triple level cell (TLC) and all the variants of these “trimmed” for specific use cases. The
spec “endurance” has gone from 1 million program/erase cycles (PE) to 3,000, and in some cases 500.
It’s worth pointing out that almost all the “magic” that has been developed around flash was already
scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity
increased, meaning fewer die are needed for a solution – and that means less parallel bandwidth for data
transfer… And the “requirement” for state-of-the-art single operation write latency has fallen well below
the write latency of the flash itself. (What the ?? Yea – talk about that later in some other blog. But flash
is ~1500uS write latency, where state of the art flash cards are ~50uS.) When I describe the state of
technology it sounds pretty pessimistic. I’m not. We’ve overcome a lot.
We built our first PCIe card solution at LSI in 2009. It wasn’t perfect, but it was better than anything else
out there in many ways. We’ve learned a lot in the years since – both from making them, and from
dealing with customer and users – both of our own solutions and our competitors. We’re lucky to be an
important player in storage, so in general the big OEMs, large enterprises and the mega datacenters all
want to talk with us – not just about what we have or can sell, but what we could have and what
we could do. They’re generous enough to share what works and what doesn’t. What the values of
solutions are and what the pitfalls are too. Honestly? It’s the mega datacenters in the lead both practically
and in vision.
If you haven’t nodded off to sleep yet, that’s a long-winded way of saying – things have changed fast,
and, boy, we’ve learned a lot in just a few years.
Most important thing we’ve learned…
Most importantly, we’ve learned it’s latency that matters. No one is pushing the IOPs limits of flash, and
no one is pushing the bandwidth limits of flash. But they sure are pushing the latency limits.
PCIe cards are great, but…
We’ve gotten lots of feedback, and one of the biggest things we’ve learned is – PCIe flash cards are
awesome. They radically change performance profiles of most applications, especially databases allowing
servers to run efficiently and actual work done by that server to multiply 4x to 10x (and in a few extreme
cases 100x). So the feedback we get from large users is “PCIe cards are fantastic. We’re so thankful they
came along. But…” There’s always a “but,” right??
It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using
them. We’re not the only ones hearing it. To be clear, none of these are stopping people from deploying
PCIe flash… the attraction is just too compelling. But the problems are real, and they have real
implications, and the market is asking for real solutions.
Stranded capacity & IOPs
30
o Some “leftover” space is always needed in a PCIe card. Databases don’t do well when they
run out of storage! But you still pay for that unused capacity.
o All the IOPs and bandwidth are rarely used – sure latency is met, but there is capability left
on the table.
o Not enough capacity on a card – It’s hard to figure out how much flash a server/application
will need. But there is no flexibility. If my working set goes one byte over the card capacity,
well, that’s a problem.
Stranded data on server fail
o If a server fails – all that valuable hot data is unavailable. Worse – it all needs to be re-
constructed when the server does come online because it will be stale. It takes quite a while
to rebuild 2TBytes of interesting data. Hours to days.
PCIe flash storage is a separate storage domain vs. disks and boot.
o Have to explicitly manage LUNs, move data to make it hot.
o Often have to manage via different API’s and management portals.
o Applications may even have to be re-written to use different APIs, depending on the vendor.
Depending on the vendor, performance doesn’t scale.
o One card gives awesome performance improvement. Two cards don’t give quite the same
improvement.
o Three or four cards don’t give any improvement at all. Performance maxed out somewhere
below 2 cards. It turns out drivers and server onloaded code create resource bottlenecks,
but this is more a competitor’s problem than ours.
Depending on the vendor, performance sags over time.
o More and more computation (latency) is needed in the server as flash wears and needs
more error correction.
o This is more a competitor’s problem than ours.
It’s hard to get cards in servers.
o A PCIe card is a card – right? Not really. Getting a high capacity card in a half height, half
length PCIe form factor is tough, but doable. However, running that card has problems.
o It may need more than 25W of power to run at full performance – the slot may or may not
provide it. Flash burns power proportionately to activity, and writes/erases are especially
intense on power. It’s really hard to remove more than 25W air cooling in a slot.
o The air is preheated, or the slot doesn’t get good airflow. It ends up being a server by
server/slot by slot qualification process. (yes, slot by slot…) As trivial as this sounds, it’s
actually one of the biggest problems
31
Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc.
That’s what we’re here for though – right? Solve the impossible?
A quick summary is in order. It’s not looking good. For a given solution, flash is getting less reliable, there
is less bandwidth available at capacity because there are fewer die, we’re driving latency way below the
actual write latency of flash, and we’re not satisfied with the best solutions we have for all the reasons
above.
The implications
If you think these through enough, you start to consider one basic path. It also turns out we’re not the
only ones realizing this. Where will PCIe flash solutions evolve over the next 2, 3, 4 years? The basic goals
are:
Unified storage infrastructure for boot, flash, and HDDs
Pooling of storage so that resources can be allocated/shared
Low latency, high performance as if those resources were DAS attached, or PCIe card flash
Bonus points for file store with a global name space
One easy answer would be – that’s a flash SAN or NAS. But that’s not the answer. Not many customers
want a flash SAN or NAS – not for their new infrastructure, but more importantly, all the data is at the
wrong end of the straw. The poor server is left sucking hard. Remember – this is flash, and people use
flash for latency. Today these SAN type of flash devices have 4x-10x worse latency than PCIe cards. Ouch.
You have to suck the data through a relatively low bandwidth interconnect, after passing through both the
storage and network stacks. And there is interaction between the I/O threads of various servers and
applications – you have to wait in line for that resource. It’s true there is a lot of startup energy in this
space. It seems to make sense if you’re a startup, because SAN/NAS is what people use today, and
there’s lots of money spent in that market today. However, it’s not what the market is asking for.
Another easy answer is NVMe SSDs. Right? Everyone wants them – right? Well, OEMs at least. Front bay
PCIe SSDs (HDD form factor or NVMe – lots of names) that crowd out your disk drive bays. But they don’t
fix the problems. The extra mechanicals and form factor are more expensive, and just make replacing the
cards every 5 years a few minutes faster. Wow. With NVME SSDs, you can fit fewer HDDs – not good.
They also provide uniformly bad cooling, and hard limit power to 9W or 25W per device. But to protect the
storage in these devices, you need to have enough of them that you can RAID or otherwise protect. Once
you have enough of those for protection, they give you awesome capacity, IOPs and bandwidth, too much
in fact, but that’s not what applications need – they need low latency for the working set of data.
What do I think the PCIe replacement solutions in the near future will look like? You need to pool the flash
across servers (to optimize bandwidth and resource usage, and allocate appropriate capacity). You need
to protect against failures/errors and limit the span of failure, commit writes at very low latency (lower
32
than native flash) and maintain low latency, bottleneck-free physical links to each server… To me that
implies:
Small enclosure per rack handling ~32 or more servers
Enclosure manages temperature and cooling optimally for performance/endurance
Remote configuration/management of the resources allocated to each server
Ability to re-assign resources from one server to another in the event of server/VM blue-screen
Low-latency/high-bandwidth physical cable or backplane from each server to the enclosure
Replaceable inexpensive flash modules in case of failure
Protection across all modules (erasure coding) to allow continuous operation at very high
bandwidth
NV memory to commit writes with extremely low latency
Ultimately – integrated with the whole storage architecture at the rack, the same APIs, drivers, etc.
That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and
bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards
will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes
today. This is just my opinion, but as I say – other leaders in flash are going down this path too…
What’s your opinion?
Rob Ober drives LSI into new technologies, businesses and products as an LSI fellow in Corporate
Strategy. Prior to joining LSI, he was a fellow in the Office of the CTO at AMD, responsible for mobile
platforms, embedded platforms and wireless strategy. He was a founding board member of OLPC ($100
laptop.org) and OpenSPARC.
33
MEGA DATACENTERS: PIONEERING THE
FUTURE OF IT INFRASTRUCTURE
Rob Ober, LSI Fellow, Processor and System Architect, LSI Corporate Strategy Office , says:
The unrelenting growth in the volume and velocity of data worldwide is spurring innovation in datacenter
infrastructures, and mega datacenters (MDCs) are on the leading edge of these advances. Although MDCs
are relatively new, their exponential growth – driven by this data deluge – has thrust them into rarefied
regions of the global server market: they now account for about 25 percent of servers shipped.
Rapid innovation is the watchword at MDCs. It is imperative to their core business and, on a much larger
scale, forcing a rethinking of IT infrastructures of all sizes. The pioneering efforts of MDCs in private
clouds, compute clusters, data analytics and other IT applications now provide valuable insights into the
future of IT. Any organization stands to benefit by emulating MDC techniques to improve scalability,
reliability, efficiency and manageability and reduce the cost of work done as they confront changing
business dynamics and rising financial pressures.
The Effects of Scale at MDCs
MDCs and traditional datacenters are miles apart in scale, though the architects at each face many of the
same challenges. Most notably, both are trying to do more with less by implementing increasingly
sophisticated applications and optimizing the investments needed to confront the data deluge. The sheer
scale of MDCs, however, magnifies even the smallest inefficiency or problem. Economics force MDCs to
view the entire datacenter as a resource pool to be optimized as it delivers more services and supports
more users.
MDCs like those at Facebook, Amazon, Google and China’s Tencent use a small set of distinct platforms,
each optimized for a specific task, such as storage, database, analytics, search or web services. The scale
of these MDCs is staggering: Each typically houses 200,000 to 1,000,000 servers, and from 1.5 million to
10 million disk drives. Storage is their largest cost. The world’s largest MDCs deploy LSI flash cards, flash
cache acceleration, host bus adapters, serial-attached SCSI (SAS) infrastructure and RAID storage
solutions, giving LSI unique insight into challenges these organizations are facing, and how they are
pioneering various architectural solutions to common problems.
MDCs prefer open source software for operating systems and other infrastructure, and the applications are
usually self-built. Most MDC improvements have been given back to the open source community. In many
MDCs, even the hardware infrastructure might be self-built or, at a minimum, self-specified for optimal
configurations – options that might not be available to smaller organizations.
Server virtualization is only rarely used in MDCs. Instead of using virtual machines to run multiple
applications on a single server, MDCs prefer to run applications across clusters consisting of hundreds to
thousands of server nodes dedicated to a specific task. For example, the server cluster may contain only
boot storage, RAID-protected storage for database or transactional data, or unprotected direct-map drives
with data replication across facilities depending on the task or application it is performing. MDC
virtualization applications are all open source. They are used for containerization to simplify the
34
deployment and replication of images. Because re-imaging or updating virtualization applications occurs
frequently, boot image management is another challenge.
The large clusters at MDCs make the latency of inter-node communications critical to application
performance, so MDCs make extensive use of 10Gbit Ethernet in servers today and, in some cases, they
even deploy 40Gbit infrastructure as needed. MDCs also optimize performance by deploying networks with
static configurations that minimize transactional latency. And MDC architects are now deploying at least
some software defined network (SDN) infrastructure to optimize performance, simplify management at
scale and reduce costs.
To some, MDCs are seen as cheap, refusing to pay for any value-added functionality from vendors. But
that’s a subtle misunderstanding of their motivations. With as many as 1 million servers, MDCs require a
lights-out infrastructure maintained primarily by automated scripts and only a few technicians assigned
simple maintenance tasks. MDCs also maintain a ruthless focus on minimizing any unnecessary spending,
using the savings to grow and optimize work performed per dollar spent.
MDCs are very careful to eliminate features not central to their core applications, even if provided for free,
since they increase operating expenditures. Chips, switches and buttons, lights, cables, screws and
latches, software layers and anything else that does nothing to improve performance only adds to power
and cooling demands and service overhead. The addition of one unnecessary LED in 200,000 servers, for
example, is considered an excess that consumes 26,000 watts of power and can increase operating costs
by $10,000 per year.
Even minor problems can become major issues at scale. One of the biggest operational challenges for
MDCs is HDD failure rates. Despite the low price of hard disk drives (HDDs), failures can cause costly
disruptions in large clusters, where these breakdowns are routine. Another challenge is managing rarely-
used archival data that may exceed petabytes and is now approaching exabytes of online storage,
consuming more space and power while delivering diminishing value. Every organization faces similar
challenges, albeit on a smaller scale.
Lessons Learned from Mega Datacenters
Changing business dynamics and financial pressures are forcing all organizations to rethink the types of IT
infrastructure and software applications they deploy. The low cost of MDC cloud services is motivating
CFOs to demand more capabilities at lower costs from their CIOs, who in turn are turning to MDCs to find
inspiration and ways to address these challenges.
The first lesson any organization can learn from MDCs is to simplify maintenance and management by
deploying a more homogenous infrastructure. Minimizing infrastructure spending where it matters little
and focusing it where it matters most frees capital to be invested in architectural enhancements that
maximize work-per-dollar. Investing in optimization and efficiency helps reduce infrastructure and
associated management costs, including those for maintenance, power and cooling. Incorporating more
lights-out self-management also pays off, supporting more capabilities with existing staff.
The second lesson is that maintaining five-nines (99.999%) reliability drives up costs and becomes
increasingly difficult architecturally as the infrastructure scales. A far more cost-effective architecture is
one that allows subsytems to fail, letting the rest of the system operate unimpeded and the overall system
self-heal. Because all applications are clustered, a single misbehaving node can degrade the performance
35
of the entire cluster. MDCs take the offending server off line, enabling all others to operate at peak
performance. The hardware and software needed for such an architecture are readily available today,
enabling any organization to emulate this approach. And though the expertise needed to effectively deploy
a cluster is still rare, new orchestration layers are emerging to automate cluster management.
Storage, one of the most critical infrastructure subsystems, directly impacts application performance and
server utilization. MDCs are leaders in optimizing datacenter storage efficiency, providing high-availability
operation to satisfy requirements for data retention and disaster recovery. All MDCs rely exclusively on
direct-attached storage (DAS), which carries a much lower purchase cost, is simpler to maintain and
delivers higher performance than a storage area network (SAN) or network-attached storage (NAS).
Although many MDCs minimize costs by using consumer-grade Serial ATA (SATA) HDDs and solid state
drives (SSDs), they almost always deploy these drives on a SAS infrastructure to maximize performance
and simplify management. More MDCs are now migrating to large-capacity, enterprise-grade SAS drives
for higher reliability and performance, especially as SAS migrates from 6Gbit/s to 12Gbit/s bandwidth.
When evaluating storage performance, most organizations focus on I/O operations per second (IOPs) and
MBytes/s throughput metrics. MDCs have discovered that applications driving IOPs to SDDs quickly reach
other limits though, often peaking well below 200,000 IOPs, and that MBytes/s performance has only a
modest impact on work done. A more meaningful metric is I/O latency because it correlates more directly
with application performance and server utilization – the very reason MDCs are deploying more SSDs or
solid state caching (or both) to minimize I/O latency and increase work-per-dollar.
Typical HDD read/write latency is on the order of 10 milliseconds. By contrast, typical SSD read and write
latencies are around 200 microseconds and 100 microseconds, respectively – about five orders of
magnitude lower. Specialized PCIe® flash cache acceleration cards can reduce latency another order of
magnitude to tens of microseconds. Using solid state storage to supplement or replace HDDs enables
servers and applications to do four to10 times more work. Server-based flash caching provides even
greater gains in SAN and NAS environments – up to 30 times.
Flash cache acceleration cards deliver the lowest latency when plugged directly into a server’s PCIe bus.
Intelligent caching software continuously and transparently places hot data (the most frequently accessed
or temporally important) in low-latency flash storage to improve performance. Some flash cache
acceleration cards support multiple terabytes of solid state storage, holding entire databases or working
datasets as hot data. And because there is no intervening network and no risk of associated congestion,
the cached data is accessible quickly and deterministically under any workload.
Deploying an all-solid-state Tier 0 for some applications is also now feasible, and at least one MDC uses
SSDs exclusively. In the enterprise, decisions about using SSDs usually focus on the storage layer, and
cost per GByte or IOPs, pitting HDDs against SSDs with an emphasis on capital expenditure. MDCs have
discovered that SSDs deliver better price/performance than HDDs by maximizing work-per-dollar
investments in other infrastructure (especially servers and software licenses), and by reducing overall
maintenance costs. Solid state storage is also more reliable, easier to manage, faster to replicate and
rebuild, and more energy-efficient than HDDs – all advantages to any datacenter.
Pioneering the Datacenter of the Future
MDCs have been driving open source solutions with proven performance, reliability and scalability. In
some cases, these pioneering efforts have enabled applications to scale far beyond any commercial
36
product. Examples include Hadoop® for analytics and derivative applications, and clustered query and
databases applications like Cassandra™ and Google’s Dremel. The state-of-the-art for these and other
applications is evolving quickly, literally month-by-month. These open source solutions are seeing
increasing adoption and inspiring new commercial solutions.
Two other, relatively new initiatives are expected to bring MDC advances to the enterprise market, just as
Linux® software did. One is OpenCompute, which offers a minimalist, cost-effective, easy-to-scale
hardware infrastructure for compute clusters. Open Compute could also foster its own innovation,
including an open hardware support services business model similar to the one now used for open source
software. The second initiative is OpenStack® software, which promises a higher level of automation for
managing pools of compute, storage and networking resources, ultimately leading to the ability to operate
a software defined datacenter.
A related MDC initiative involves disaggregating servers at the rack level. Disaggregation separates the
processor from memory, storage, networking and power, and pools these resources at the rack level,
enabling the lifecycle of each resource to be managed on its own optimal schedule to help minimize costs
while increasing work-per-dollar. Some architects believe that these initiatives could reduce total cost of
ownership by a staggering 70 percent.
Maximizing work-per-dollar at the rack and datacenter levels is one of the best ways today for IT
architects in any organization to do more with less. MDCs are masters at this type of high efficiency as
they continue to redefine how datacenters will scale to meet the formidable challenges of the data deluge.
About the Author
Robert Ober is an LSI Fellow in Corporate Strategy, driving LSI into new technologies, businesses and
products. He has 30 years of experience in processor and system architecture. Prior to joining LSI, Rob
was a Fellow in the Office of the CTO at AMD, with responsibility for mobile platforms, embedded
platforms and wireless strategy. He was one of the founding Board members of OLPC ($100 laptop.org)
and was influential in its technical evolution, and was also a Board Member of OpenSPARC.
Previously Rob was Chief Architect at Infineon Technologies, responsible for the TriCore family of
processors used in automotive, communication and security products. In addition, he drove improvements
in semiconductor methodology, libraries, process and the mobile phone platforms. Rob was manager of
Newton Technologies at Apple Computer and was involved in the creation of the PowerPC Macintosh
computers, PowerPC, StrongARM and ARC processors. He also has experience in development of CDC,
CRAY and SPARC supercomputers, mainframes and high-speed networks, and he has dozens of patents in
mobility, computing and processor architecture. Rob has an honors Bachelor of Applied Science (BASc.) in
Systems Design Engineering from the University of Waterloo in Ontario, Canada.
37
The Evolution Of Solid-State Storage In
Enterprise Servers By Tom Heil
Solid-state drives (SSDs) and PCI Express (PCIe) flash memory adapters are growing in popularity in
enterprise, service provider, and cloud datacenters due to their ability to cost-effectively improve
application-level performance. A PCIe flash adapter is a solid-state storage device that plugs directly into a
PCIe slot of an individual server, placing fast, persistent storage near server processors to accelerate
application-level performance.
By placing storage closer to the server’s CPU, PCIe flash adapters dramatically reduce latency in storage
transactions compared to traditional hard-disk drive (HDD) storage. However, the configuration lacks
standardization and critical storage device attributes like external serviceability with hot-pluggability.
To overcome these limitations, various organizations are developing PCIe storage standards that extend
PCIe onto the server storage mid-plane to provide external serviceability. These new PCIe storage
standards take full advantage of flash memory’s low latency and provide an evolutionary path for its use
in enterprise servers.
The Need For Speed
Many applications benefit considerably from the use of solid-state storage owing to the enormous latency
gap that exists between the server’s main memory and its direct-attached HDDs. Flash storage enables
database applications, for example, to experience improvements of four to 10 times because access to
main memory takes about 100 ns while input/output (I/O) to traditional rotating storage is on the order of
10 ms or more(Fig. 1).
38
1. NAND flash memory fills the gap in latency between a server’s main memory and fast-
spinning hard-disk drives.
This access latency difference, approximately five orders of magnitude, has a profound adverse impact on
application-level performance and response times. Latency to external storage area networks (SANs) and
network-attached storage (NAS) is even higher owing to the intervening network infrastructure (e.g.,
Fibre Channel or Ethernet).
Flash memory provides a new high-performance storage tier that fills the gap between a server’s dynamic
random access memory (DRAM) and Tier 1 storage consisting of the fastest-spinning HDDs. This new “Tier
0” of solid-state storage, with latencies from 50 µs to several hundred microseconds, delivers dramatic
gains in application-level performance while continuing to leverage rotating media’s cost-per-gigabyte
advantage in all lower tiers.
Because the need for speed is so pressing in many of today’s applications, IT managers could not wait for
new flash-optimized storage standards to be finalized and become commercially available. That’s why
SSDs supporting the existing SAS and SATA standards as well as proprietary PCIe-based flash adapters
are already being deployed in datacenters. However, these existing solid-state storage solutions utilize
very different configurations.
39
SAS And SATA SSDs
The norm today for direct-attached storage (DAS) is a rack-mount server with an externally accessible
chassis having multiple 9-W storage bays capable of accepting a mix of SAS and SATA drives operating at
up to 6 Gbits/s. The storage mid-plane typically interfaces with the server motherboard via a PCIe-based
host redundant array of independent disks (RAID) adapter that has an embedded RAID-on-chip (ROC)
controller(Fig. 2).
2. SAS and SATA SSDs are supported today in standard storage bays with a RAID-on-chip
(ROC) controller on the server’s PCIe bus.
While originally designed for HDDs, this configuration is ideal for SSDs that utilize 2.5-in. and 3.5-in. HDD
disk form factors. Support for SAS and SATA HDDs and SSDs in various RAID configurations provides a
number of benefits in DAS configurations, such as the ability to mix high-performance SAS drives with
low-cost SATA drives in tiers of storage directly on the server. The fastest Tier 0 can utilize SAS SSDs,
while the slowest tier utilizes SATA HDDs (or external SAN or NAS). In some configurations, firmware on
the RAID adapter can transparently cache application data onto SSDs.
Being externally accessible and hot-pluggable, the configuration of disks can be changed as needed to
improve performance by adding more SSDs, or to expand capacity in any tier, as well as to replace
defective drives to restore full RAID-level data protection. Because the arrangement is fully standardized,
any bay can support any SAS or SATA drive. Device connectivity is easily scaled via an in-server SAS
expander or via SAS connections to external drive enclosures, commonly called JBODs for “just a bunch of
disks.”
40
The main advantage of deploying flash in HDD form factors using established SAS and SATA protocols is
that it significantly accelerates application performance while leveraging mature standards and the
existing infrastructure (both hardware and software). So, this configuration will remain popular well into
the future in all but the most demanding latency-sensitive applications. Enhancements also continue to be
made, including RAID adapters getting faster with PCIe version 3.0, and 12-Gbit/s SAS SSDs that are
poised for broad deployment beginning in 2013.
Even with continual advances and enhancements, though, SAS and SATA cannot capitalize fully on flash
memory’s performance potential. The most obvious constraints are the limited power (9 W) and channel
width (one or two lanes) available in a storage bay that was initially designed to accommodate rotating
magnetic media, not flash. These constraints limit the performance possible with the amount of flash that
can be deployed in a typical HDD form factor, and they are the driving force behind the emergence of PCIe
flash adapters.
PCIe Flash Adapters
Instead of plugging into a storage bay, a flash adapter plugs directly into a PCIe bus slot on the server’s
motherboard, giving it direct access to the CPU and main memory (Fig. 3). The result is a latency as low
as 50 µs for (buffered) I/O operations to solid-state storage. Because there are no standards yet for PCIe
storage devices, flash adapter vendors must supply a device driver to interface with the host’s file system.
In some cases, vendor-specific drivers are bundled with popular server operating systems.
3. PCIe flash adapters overcome the limitations imposed by legacy storage protocols, but they
must be plugged directly into the server’s PCIe bus.
41
Unlike storage bays that provide one or two lanes, server PCIe slots are typically four or eight lanes wide.
An eight-lane (x8) PCIe (version 3.0) slot, for example, can provide a throughput of 8 Gbytes/s (eight
lanes at 1 Gbyte/s each). By contrast, a SAS storage bay can scale to 3 Gbytes/s (two lanes at 12 Gbits/s
or 1.5 Gbytes/s each). The higher bandwidth increases I/O operations per second (IOPs), which reduces
the transaction latency experienced by some applications.
Another significant advantage of a PCIe slot is the higher power available, which enables larger flash
arrays, as well as more parallel read/write operations to the array(s). The PCIe bus supports up to 25 W
per slot, and if even more is needed, a separate connection can be made to the server’s power supply,
similar to the way high-end PCIe graphics cards are configured in workstations. For half-height, half-
length (HHHL) cards today, 25 W is normally sufficient. Ultra-high-capacity full-height cards often require
additional power.
A PCIe flash adapter can be utilized either as flash cache or as a primary storage solid-state drive. The
more common configuration today is flash cache to accelerate I/O to DAS, SAN, or NAS rotating media.
Adapters used as an SSD are often available with advanced capabilities, such as host-based RAID for data
protection. But the PCIe bus isn’t an ideal platform for primary storage due to its lack of external
serviceability and hot-pluggability.
Flash Cache Acceleration Cards
Caching content to memory in a server is a proven technique for reducing latency and, thereby, improving
application-level performance. But because the amount of memory possible in a server (measured in
gigabytes) is only a small fraction of the capacity of even a single disk drive (measured in terabytes),
achieving performance gains from this traditional form of caching is becoming difficult.
Flash memory breaks through the cache size limitation imposed by DRAM to again make caching a highly
effective and cost-effective means for accelerating application-level performance. Flash memory is also
non-volatile, giving it another important advantage over DRAM caches. As a result, PCIe-based flash cache
adapters such as the LSI Nytro XD solution have already become popular for enhancing performance.
Solid-state memory typically delivers the highest performance gains when the flash cache is placed
directly in the server on the PCIe bus. Embedded or host-based intelligent caching software is used to
place “hot data” (the most frequently accessed data) in the low-latency, high-performance flash storage.
Even though flash memory has a higher latency than DRAM, PCIe flash cache cards deliver superior
performance for two reasons.
The first is the significantly higher capacity of flash memory, which dramatically increases the “hit rate” of
the cache. Indeed, with some flash cards now supporting multiple terabytes of solid-state storage, there is
often sufficient capacity to store entire databases or other datasets as “hot data.” The second reason
42
involves the location of the flash cache: directly in the server on the PCIe bus. With no external
connections and no intervening network to a SAN or NAS (that is also subject to frequent congestion and
deep queues), the “hot data” is accessible in a flash (pun intended) in a deterministic manner under all
circumstances.
Although the use of PCIe flash adapters can dramatically improve application performance, PCIe was not
designed to accommodate storage devices directly. PCIe adapters are not externally serviceable, are not
hot-pluggable, and are difficult to manage as part of an enterprise storage infrastructure. The proprietary
nature of PCIe flash adapters also is an impediment to a robust, interoperable multi-party device
ecosystem. Overcoming these limitations requires a new industry-standard PCIe storage solution.
Express Bay
Support for the PCIe interface on an externally accessible storage mid-plane is emerging based on the
Express Bay standard with the SFF-8639 connector. Express Bay provides four dedicated PCIe lanes and
up to 25 W to accommodate ultra-high-performance, high-capacity Enterprise PCIe SSDs (eSSD) in a 2.5-
in. or 3.5-in. disk drive form factor.
As a superset of today’s standard disk drive bay, Express Bay maintains backward compatibility with
existing SAS and SATA devices. The SSD Form Factor Working Groupis creating the Express Bay
standard, Enterprise SSD Form Factor 1.0 Specification, in cooperation with the SFF Committee, the SCSI
Trade Association, the PCI Special Interest Group, and the Serial ATA International Organization.
Enterprise SSDs for Express Bay will initially use vendor-specific protocols enabled by vendor-supplied
host drivers. Enterprise SSDs compliant with the new NVM Express (NVMe) flash-optimized host interface
protocol will emerge in 2013. The NVMe Work Group (www.nvmexpress.org) is defining NVMe for use in
PCIe devices targeting both clients (PCs, ultrabooks, etc.) and servers. By 2014, standard NVMe host
drivers should be available in all major operating systems, eliminating the need for vendor-specific drivers
(except when a vendor supplies a driver to enable unique capabilities).
Also in 2014, Enterprise PCIe SSDs compliant with the new SCSI Express (SCSIe) host interface protocol
are expected to make their debut. SCSIe SSDs will be optimized for enterprise applications and should fit
seamlessly under existing enterprise storage applications based on the SCSI architecture and command
set. SCSIe is being defined by the SCSI Trade Association and the InterNational Committee for
Information Technology Standards (INCITS) Technical Committee T10 for SCSI Storage Interfaces.
Most mid-planes supporting the Express Bays will interface with the server via two separate PCIe-based
cards: a PCIe switch to support high-performance Enterprise PCIe SSDs and a RAID adapter to support
legacy SAS and SATA devices (Fig. 4). Direct support for PCIe (through the PCIe switch) makes it
possible to put flash cache acceleration solutions in the Express Bay.
43
4. Express Bay fully supports the low latency of flash memory with the high performance of
PCIe, while maintaining backwards compatibility with existing SAS and SATA HDDs and SSDs.
This configuration is expected to become preferable over the flash adapters now being plugged directly
into the server’s PCIe bus. Nevertheless, PCIe flash adapters may continue to be used in ultra-high-
performance or ultra-high-capacity applications that justify utilizing the wider x8 PCIe bus slots and/or
additional power available only within the server.
Because it is more expensive to provision an Express Bay than a standard drive bay, server vendors are
likely to limit deployment of Express Bays until market demand for Enterprise PCIe SSDs increases. Early
server configurations may support perhaps two or four Express Bays, with the remainder being standard
bays. Server vendors may also offer some models with a high number of (or nothing but) Express Bays to
target ultra-high-performance and ultra-high-capacity applications, especially those that require little or
no rotating media storage.
SATA Express
PCIe flash storage also is expected to become common in client devices beginning in 2013 with the advent
of the new SATA Express (SATAe) standard. Like SATA before them, SATAe devices are expected be
adopted in the enterprise due to the low cost that inevitably results from the economics of high-volume
client-focused technologies.
The SATAe series of standards includes a flash-only M.2 form factor (previously called the next-generation
form factor or NGFF) for ultrabooks and netbooks and a 2.5-in. disk drive compatible form factor for
laptop and desktop PCs. SATAe standards are being developed by the Serial ATA International
44
Organization (www.sata-io.org). Initial SATAe devices will use the current AHCI protocol to leverage
industry-standard SATA host drivers, but will quickly move to NVMe once standard NVMe drivers become
incorporated into major operating systems.
The SATAe 2.5-in. form factor is expected to play a significant role in enterprise storage. It is designed to
plug into either an Express Bay or a standard drive bay. In both cases, the PCIe signals are multiplexed
atop the existing SAS/SATA lanes. Either bay then can accommodate a SATAe SSD or a SAS or SATA
drive (Fig. 5). Of course, the Express Bay can additionally accommodate x4 Enterprise PCIe SSDs as
previously discussed.
5. Although designed for client PCs, new SATA Express drives will be supported in a standard
bay by multiplexing the PCIe protocols atop existing SAS/SATA lanes.
The configuration implies future RAID controller support for SATAe drives to supplement existing support
for SAS and SATA drives. Note that although SATAe SSDs will outperform SATA SSDs, they will lag 12-
Gbit/s SAS SSD performance (two lanes of 12 Gbits/s are faster than two lanes of 8-Gbit/s PCIe 3.0). The
SATAe M.2 form factor will also be adopted in the enterprise in situations where a client-class PCIe SSD is
warranted, but the flexibility and/or external serviceability of a storage form factor is not required.
Summary
With its ability to bridge the large gap in I/O latency between main memory and hard-disk drives, flash
memory has exposed some limitations in existing storage standards. These standards have served the
industry well, and SAS and SATA HDDs and SDDs will continue to be deployed in enterprise and cloud
applications well into the foreseeable future. Indeed, the new standards being developed all accommodate
45
today’s existing and proven standards, making the integration of solid-state storage seamless and
evolutionary, not disruptive or revolutionary.
To take full advantage of flash memory’s ultra-low latency, proprietary solutions that leverage the high
performance of the PCIe bus have emerged in advance of the new storage standards. But while PCIe
delivers the performance needed, it was never intended to be a storage architecture. In effect, the new
storage standards extend the PCIe bus onto the server’s externally accessible mid-plane, which was
designed as a storage architecture.
Yogi Berra famously observed, “It’s tough to make predictions, especially about the future.” But because
the new standards all preserve backwards compatibility, there is no need to predict a “winner” among
them. In fact, all are likely to coexist, perhaps in perpetuity, because each is focused on specific and
different needs in client and server storage. Fortunately, Express Bay supports both new and legacy
standards, as well as proprietary solutions, all concurrently. This freedom of choice down to the level of an
individual bay eliminates the need for the industry to choose only one as “the” standard.
Tom Heil is a senior systems architect and Distinguished Engineer in LSI’s Storage Products
Division, where he is responsible for technology strategy, product line definition, and business
planning. He is a 25-year veteran of the computer and storage industry and holds 17 patents in
computer and I/O architecture. He can be reached at tom.heil@lsi.com.
46
Networks to Get Smarter and Faster in 2013
and Beyond
By Greg Huff, Chief Technology Officer at LSI
Architects and managers of networks of all types – enterprise, storage and mobile – are struggling under
the formidable pressure of massive data growth. To accelerate performance amid this data deluge, they
have two options: the traditional brute force approach of deploying systems beefed up with more general-
purpose processors, or turning to systems with intelligent silicon powered by purpose-built hardware
accelerators integrated with multi-core processors.
Adding more and faster general-purpose processors to routers, switches and other networking equipment
can improve performance but adds to system costs and power demands while doing little to address
latency, a major cause of performance problems in networks. By contrast, smart silicon minimizes or
eliminates performance choke points by reducing latency for specific processing tasks. In 2013 and
beyond, design engineers will increasingly deploy smart silicon to achieve the benefits of its order of
magnitude higher performance and greater efficiencies in cost and power.
Enterprise Networks
In the past, Moore’s Law was sufficient to keep pace with increasing computing and networking workloads.
Hardware and software largely advanced in lockstep: as processor performance increased, more
sophisticated features could be added in software. These parallel improvements made it possible to create
more abstracted software, enabling much higher functionality to be built more quickly and with less
programming effort. Today, however, these layers of abstraction are making it difficult to perform more
complex tasks with adequate performance.
General-purpose processors, regardless of their core count and clock rate, are too slow for functions such
as classification, cryptographic security and traffic management that must operate deep inside each and
every packet. What’s more, these specialized functions must often be performed sequentially, restricting
the opportunity to process them in parallel in multiple cores. By contrast, these and other specialized
types of processing are ideal applications for smart silicon, and it is increasingly common to have multiple
intelligent acceleration engines integrated with multiple cores in specialized System-on-Chip (SoC)
communications processors.
The number of function-specific acceleration engines available continues to grow, and shrinking
geometries now make it possible to integrate more engines onto a single SoC. It is even possible to
integrate a system vendor’s unique intellectual property as a custom acceleration engine within an SoC.
Taken together, these advances make it possible to replace multiple SoCs with a single SoC to enable
faster, smaller, more power-efficient networking architectures.
Storage Networks
47
The biggest bottleneck in data centers today is caused by the five orders of magnitude difference in I/O
latency between main memory in servers (100 nanoseconds) and traditional hard disk drives (10
milliseconds). Latency to external storage area networks (SANs) and network-attached storage (NAS) is
even higher because of the intervening network and performance restrictions resulting when a single
resource services multiple, simultaneous requests sequentially in deep queues.
Caching content to memory in a server or in a SAN on a Dynamic RAM (DRAM) cache appliance is a
proven technique for reducing latency and thereby improving application-level performance. But today,
because the amount of memory possible in a server or cache appliance (measured in gigabytes) is only a
small fraction of the capacity of even a single disk drive (measured in terabytes), the performance gains
achievable from traditional caching are insufficient to deal with the data deluge.
Advances in NAND flash memory and flash storage processors, combined with more intelligent caching
algorithms, break through the traditional caching scalability barrier to make caching an effective, powerful
and cost-efficient way to accelerate application performance going forward. Solid state storage is ideal for
caching as it offers far lower latency than hard disk drives with comparable capacity. Besides delivering
higher application performance, caching enables virtualized servers to perform more work, cost-
effectively, with the same number of software licenses.
Solid state storage typically produces the highest performance gains when the flash cache is placed
directly in the server on the PCIe® bus. Intelligent caching software is used to place hot, or most
frequently accessed, data in low-latency flash storage. The hot data is accessible quickly and
deterministically under any workload since there is no external connection, no intervening network to a
SAN or NAS and no possibility of associated traffic congestion and delay. Exciting to those charged with
managing or analyzing massive data inflows, some flash cache acceleration cards now support multiple
terabytes of solid state storage, enabling the storage of entire databases or other datasets as hot data.
Mobile Networks
Traffic volume in mobile networks is doubling every year, driven mostly by the explosion of video
applications. Per-user access bandwidth is also increasing rapidly as we move from 3G to LTE and LTE-
Advanced. This will in turn lead to the advent of even more graphics-intensive, bandwidth-hungry
applications.
48
Base stations must rapidly evolve to manage rising network loads. In the infrastructure multiple radios are
now being used in cloud-like distributed antenna systems and network topologies are flattening. Operators
are planning to deliver advanced quality of service with location-based services and application-aware
billing. As in the enterprise, increasingly handling these complex, real-time tasks is only feasible by adding
acceleration engines built into smart silicon.
To deliver higher 4G data speeds reliably to a growing number of mobile devices, access networks need
more, and smaller, cells and this drives the need for the deployment of SoCs in base stations. Reducing
component count with SoCs has another important advantage: lower power consumption. From the edge
to the core, power consumption is now a critical factor in all network infrastructures.
The use System-on-Chip ICs with multiple cores and multiple acceleration engines will be essential in 3G
and 4G mobile networks.
Enterprise networks, datacenter storage architectures and mobile network infrastructures are in the midst
of rapid, complex change. The best and possibly only way to efficiently and cost-effectively address these
changes and harness the opportunities of the data deluge is by adopting smart silicon solutions that are
emerging in many forms to meet the challenges of next-generation networks.
49
About the Author
Greg Huff is Chief Technology Officer at LSI. In this capacity, he is responsible for
shaping the future growth strategy of LSI products within the storage and
networking markets. Huff joined the company in May 2011 from HP, where he
was vice president and chief technology officer of the company’s Industry
Standard Server business. In that position, he was responsible for the technical
strategy of HP’s ProLiant servers, BladeSystem family products and its
infrastructure software business. Prior to that, he served as research and
development director for the HP Superdome product family. Huff earned a
bachelor's degree in Electrical Engineering from Texas A&M University and an MBA
from the Cox School of Business at Southern Methodist University.
50
Maximizing solid-state storage capacity in
small form factors
Kent Smith, Senior Director of Marketing, Flash Components Division, LSI
Users want ever-smaller and lighter devices but also demand ever-increasing storage capacity to keep
more apps and data loaded on their mobile computing platforms. To accommodate these two competing
objectives, solid-state storage form factors will need to get smaller, while NAND flash memory geometries
will be shrinking and storing more bits per cell. The combination is having an impact on the way flash
memory is being designed into ultrabooks, netbooks and other mobile computing devices.
The first consideration in designing for maximum capacity is the form factor of the printed circuit board
(PCB) for the storage components. The latest storage form factors being standardized are known as M.2
(previously called the next generation form factor or NGFF). As shown in Figure 1, the most popular M.2
form factor among system manufacturers is 40 percent smaller than the mSATA card. In addition to being
more compact, the M.2 specification has been optimized for solid state storage and includes connector
keys for SATA, 2x or 4x PCI Express.
51
Figure 1. This popular version of the new M.2 form factor (on the right) offers 40 percent less area than the existing mSATA
form factor.
For applications where additional capacity is required (and space is available), the M.2 specification
supports other card dimensions, including some with lengths up to 110 mm, providing nearly 60 percent
more area than mSATA. There are also other custom and proprietary designs that include stacking
multiple flash memory packages or using multiple PCBs that are much taller in the z-height dimension of
the base PCB, but reduce the overall footprint by decreasing the aggregate cubic volume.
The smaller area available on the M.2 card is driving the need for using smaller flash memory geometries
and/or more bits per cell. As shown in Figure 2, the combination has dramatically increased the density of
storage possible. For example, in the same footprint, 50 nm flash using single-level cells (SLC) can store
only 2 Gigabytes (GB), while 19 nm flash using multi-level cells (MLC) can store 32 GB—16 times the
density for approximately the same cost. With triple-level cells (TLC), also at 19 nm, the same footprint
could have a capacity as high as 48 GB.
Figure 2. Smaller flash memory geometries and more bits per cell combine to increase the capacity in Gigabytes per square
millimeter possible in small form factors.
Next-generation flash storage processors
52
Taking full advantage of shrinking geometries and higher bit densities of NAND flash memory requires
some changes to flash storage processors (FSP). The FSP is responsible for managing the pages and
blocks of flash memory, and also provides the input/output (I/O) interface with the system. Two of the
biggest challenges for FSPs today involve error correction and endurance.
As flash memory geometries shrink, cells become smaller and, therefore, hold less of a charge for the one,
two or three bits they store. For illustrative purposes imagine a 50 nm cell storing a single bit, which
might hold about 1000 electrons, and a 20 nm cell storing two bits, which might hold only 100 electrons—
an order of magnitude fewer. While the number of electrons cited here does not reflect actual
measurements, the comparison does demonstrate that the lower charge available with fewer electrons
increases the potential for read errors from the flash, which must be corrected by the FSP.
Traditional approaches to error correction, such as Reed-Solomon (RS) or BCH (also named for its co-
inventors Bose, Ray-Chaudhuri and Hocquenghem), are giving way to the Low-Density Parity Check
(LDPC) in next-generation FSPs. LDPC can provide error correction performance close to the theoretical
limits of any technique. Adding sophisticated digital signal processing enables detection and correction of
even more errors. The few errors that cannot be corrected could then be handled by an integral data
protection technology, much like the RAID (redundant array of independent disks) technology used in
direct-attached storage and storage area network controllers.
Higher density flash cells with higher error rates wear out sooner. For this reason, the garbage collection
and wear-leveling capabilities of the FSP have become increasingly important. The need for garbage
collection and wear-leveling in NAND flash causes the amount of data being physically written to flash
memory to be a multiple of the logical data intended to be written. This phenomenon is expressed as a
simple ratio called “write amplification,” which ideally would approach 1.0. Because these “unnecessary”
writes wear out cells prematurely, next-generation FSPs will benefit greatly from some type of data
reduction technology to minimize write amplification and, thereby, maximize the flash memory’s useful
life.
Another technique for increasing capacity is to eliminate the need for a separate DRAM buffer, which is
required in solid state storage solutions to maintain the “map” consisting of a combination of the flash
memory file index and logical block addresses (LBAs). But the DRAM chip consumes precious space and
power that could (and should) be used for more flash memory. DRAM-less chip designs, such as the LSI
SandForce FSP, are also key to enabling SSD manufacturers to develop higher capacity drives for today’s
growing class of thin-and-light ultrabook platforms. By creating designs that do not require an external
DRAM buffer, these next-generation single-chip FSPs are what will make it possible to maximize solid
state storage capacity in small form factors.
About the author
Kent Smith is senior director of Marketing for the Flash Components Division of LSI Corporation, where he
is responsible for all outbound marketing and performance analysis. Prior to LSI, Smith was the senior
director of Corporate Marketing at SandForce, which was acquired by LSI in 2012, his second company to
be sold to LSI. He has over 25 years of marketing and management experience in the storage and high-
tech industry, holding senior management positions at companies including SiliconStor, Polycom, Adaptec,
Acer and Quantum. Smith holds an MBA from the University of Phoenix.
53
Bridging the Data Deluge Gap—The Role of
Smart Silicon in Networks Michael Merluzzi, LSI Corporation
The proliferation of smart mobile devices, video, user-generated content and social networking, and the
rising adoption of cloud services for both enterprise and consumer services are all driving explosive growth
of wireless networking infrastructure. Globally, mobile data traffic is expected to grow 18-fold between
2011 and 2016, reaching 10.8 exabytes per month by 2016. Today, video traffic alone accounts for 40
percent of the wireless network load. The number of mobile devices connected to wireless networks will
reach 25 billion, averaging 3.5 devices for every person on the planet, by 2015. That number is expected
to double, to 50 billion, by 2020.This growth in storage capacity and network traffic is far outstripping the
infrastructure build-out required to support it, a phenomenon known as the data deluge gap.
To bridge this gap, the industry needs to leverage smarter silicon technology to scale datacenter
infrastructures more cost effectively. Besides helping close the data deluge gap, smarter data processing
offers potential dramatic improvements in application performance. A recent survey of 412 European
datacenter managers conducted by LSI revealed that while 93 percent acknowledged the critical
importance of improving application performance, a full 75% do not feel that they are achieving the
desired results. This indicates that there is rising pressure on datacenter managers to find smarter ways
to push systems to do much more work within the same power and cost profiles.
Accelerating Networks
Smart software running on general-purpose processors, increasingly with multiple cores, is pervasive in
the datacenter. Processors have long inhabited switches and routers, firewalls and load-balancers, WAN
accelerators and VPN gateways. None of these systems are fast enough, however, to keep pace with the
data deluge on its own, for a basic reason: general-purpose processors must treat every byte equally.
While such equality is perfectly acceptable for system-level versatility, it is inadequate for low-level, high-
volume packet processing.
This reality is driving the need for more intelligence in silicon that is purpose-built for specific networking
applications to provide the right balance of performance, power consumption and programmability.
Today’s smart silicon has reached a level of price/performance that makes it more cost-effective than
adding general-purpose processors.
The latest generation of smart silicon typically features multiple cores of general-purpose processors and
multiple acceleration engines for common networking functions, such as packet classification with deep
packet inspection, security processing, especially for encryption and decryption, and traffic management.
Some of these acceleration engines are so powerful they can completely offload specialized network
processing from general-purpose processors, making it easier to perform switching, routing and other
networking functions entirely in smart line cards installed in servers and networking appliances to further
accelerate overall network performance.
In many organizations today, microseconds matter, driving strong demand for faster response times. For
trading firms, latency can be measured in millions of dollars per millisecond. For others, such as online
retailers, every millisecond of delay can mean lost sales and fading customer loyalty. Tomorrow’s
datacenter networks will need to be both faster and flatter, and therefore, smarter than ever. To eliminate
the data deluge gap and maximize performance, systems need to be smarter, and those smarts will
increasingly need to take the form of purpose-built silicon.
About the Author
54
Michael Merluzzi is product marketing manager in the Networking Solutions Group of LSI Corporation.
Focusing on mobile backhaul applications, Merluzzi is responsible for marketing of integrated platform
solutions and application-enabling software for the LSI Axxia family of multicore communication
processors. Previously, he held a variety of roles in technical marketing, applications engineering and
software development. Merluzzi holds a bachelor's degree in Electrical Engineering from The Pennsylvania
State University and master's degrees in Business Administration and Computer Engineering from Lehigh
University.
55
Accelerating SAN Storage with Server Flash Caching By Tony Afshary
The data deluge, with its relentless increase in the volume and velocity of data, has brought renewed
focus on an old problem: the enormous performance gap that exists in input and output (I/O) operations
between a server’s memory and disk storage. I/O takes a mere 100 nanoseconds for information stored in
a server’s memory, whereas I/O to a hard disk drive (HDD) takes about 10 milliseconds — a difference of
five orders of magnitude that is having a profound adverse impact on application performance and
response times.
The lower bandwidth and higher latency in a storage area network (SAN) or network-attached storage
(NAS) combine to exacerbate the performance problem, which gets even worse with the frequent traffic
congestion on the intervening Fibre Channel (FC), FC over Ethernet, iSCSI or Ethernet network. This
storage bottleneck has grown over the years as the increase in drive capacities has outstripped the
decrease in latency of faster-spinning drives. As a result, the performance limitations of most applications
have become tied to latency more than bandwidth or I/Os per second (IOps), and this trend is expected
accelerate as the amount of data being created continues to grow between 30 and 50 percent per year.
It is instructive to look at the situation from another perspective. The past three decades have witnessed a
3000 times increase in network bandwidth, while network latency has been reduced by only about 30
times. During the same period, the gains in processor performance, disk capacity and memory capacity
have also been similarly eclipsed by the relatively modest reduction in latency.
The extent of the problem became apparent in a recent survey conducted by LSI of 412 European
datacenter managers. The results revealed that while 93 percent acknowledge the critical importance of
optimizing application performance, a full 75 percent do not feel they are achieving the desired results.
Not surprisingly, 70 percent of the survey respondents cited storage I/O as the single biggest bottleneck
in the datacenter today.
The challenge will only get greater, caused by what LSI calls the data deluge gap — the disparity between
the 30 to 50 percent annual growth in storage capacity requirements and the 5 to 7 percent annual
increase in IT budgets. The net effect is that data is growing faster than the IT infrastructure investment
required to store, transmit, analyze and manage it. The result is that IT departments and datacenter
managers are under increasing pressure to find smarter ways to bridge the data deluge gap and improve
performance.
Cache in a Flash
Caching content to memory in a server or in a SAN on a Dynamic RAM (DRAM) cache appliance is a
proven technique for improving storage performance by reducing latency, and thereby improving
application-level performance. But because the amount of memory possible in a server or cache appliance
(measured in gigabytes) is only a small fraction of the capacity of even a single hard disk drive (measured
in terabytes) performance gains from this traditional form of caching are becoming increasingly insufficient
to overcome the challenges of the data deluge gap.
NAND flash memory technology breaks through the cache size limitation imposed by traditional memory
to again make caching the most effective and cost-effective means for accelerating application
performance. As shown in the diagram, NAND flash memory fills the significant void between main
memory and Tier 1 storage in both capacity and latency.
56
Flash memory fills the void in both latency and capacity between main memory and fast-
spinning hard disk drives.
Solid state memory typically delivers the highest performance gains when the flash cache acceleration
card is placed directly in the server on the PCI Express (PCIe) bus. Embedded or host-based intelligent
caching software is used to place “hot data” (the most frequently accessed data) in the low-latency flash
storage, where data is accessed up to 200 times faster than with a Tier 1 HDD, where less frequently
accessed data is stored.
Astute readers may be questioning how flash cache, with a latency 100 times higher than DRAM, can
outperform traditional caching systems. There are two reasons for this. The first is the significantly higher
capacity of flash memory, which dramatically increases the “hit rate” of the cache. Indeed, with some of
these flash cache cards now supporting multiple terabytes of solid state storage, there is often sufficient
capacity to store entire databases or other datasets as “hot data.”
The second reason involves the location of the flash cache: directly in the server on the high-speed PCIe
bus. With no internal or external connections, or no intervening network that is also subject to frequent
congestion, the “hot data” is accessible in a flash (pun intended) and in a deterministic manner under all
circumstances.
Tests show that the performance gains of server-side flash-based caching are both consistent and
significant under real-world conditions. Tests performed by LSI using Quest Benchmark Factory software
and audited by the Transaction Performance Council, clearly demonstrate how a PCIe-based flash
acceleration card can improve database application-level performance by a conservative 5 to10 times
compared to either direct-attached storage (DAS) or a SAN.
More and Better Flash
As the pricing of flash memory continues to drop and its performance continues to improve, flash memory
will become more prevalent throughout the datacenter. Will flash-based solid state drives (SSDs) ever
57
replace hard disk drives? No, at least in the foreseeable future. HDDs have enormous advantages in
storage capacity and in the cost of that capacity on a per-gigabyte basis. And because the vast majority of
data in most organizations is only rarely accessed, the higher latency of HDDs is normally of little
consequence — especially if this “dusty data” can become “hot data” in a PCIe flash cache accelerator on
those rare occasions when it is needed.
The key to making continued improvements in flash price/performance — comparable to that of
processors according to Moore’s Law — is advancements in the flash controllers that facilitate ever-
shrinking NAND memory geometries, already under 20 nanometers. The latest generation of flash
controllers offers sophisticated wear-leveling to improve flash memory endurance, and enhanced error
correction algorithms to improve reliability with RAID-like data protection.
These advances are making it possible for PCIe-based flash caching solutions to provide advanced
capabilities beyond those available with traditional caching. For example, caching has historically been a
read-only technology, but RAID-like data protection for writes to flash memory has the effect of making
the cache the equivalent of a fast storage tier. The addition of acceleration for writes to flash cache (which
are then persisted to RAID-based DAS or SAN) can improve application-level performance by up to 30
times compared to HDD-only storage systems.
The Future of Flash
Flash memory has already become the primary storage in tablets and ultrabooks, and a growing number
of laptop computers. Solid state drives are replacing or supplementing hard disk drives in desktop
computers and the direct-attached storage in servers, while SSD storage tiers are growing larger in SAN
and NAS configurations. And the use of PCIe-based acceleration adapters is growing rapidly owing to their
ability to bridge the data deluge gap better than any other alternative.
Some of the other advantages of flash (not discussed here) are giving these trends additional momentum.
Flash has a higher density than hard disk drives, enabling more storage in a smaller space. Flash also
consumes less power, and therefore, requires less cooling. These advantages are equally beneficial at both
a small scale in a tablet and a large scale in a datacenter.
Even as flash memory becomes more pervasive throughout datacenters, there will continue to be a need
for PCIe flash acceleration cards in servers for quite some time. Indeed, the flash cache is expected to
remain the most effective and cost-effective way to accelerate application performance for the foreseeable
future.
Tony Afshary is the director of marketing for the Accelerated Solutions Division of LSI Corporation.
58
Understanding SSD over-provisioning
Kent Smith, LSI Corporation
The over-provisioning of NAND flash memory in solid state drives (SSDs) and flash memory-based
accelerator cards (cache) is a required practice in the storage industry owing to the need for a controller
to manage the NAND flash memory. This is true for all segments of the computer industry—from
ultrabooks and tablets to enterprise and cloud servers.
Essentially, over-provisioning allocates a portion of the total flash memory available to the flash storage
processor, which it needs to perform various memory management functions. This leaves less usable
capacity, of course, but results in superior performance and endurance. More sophisticated applications
require more over-provisioning, but the benefits inevitably outweigh the reduction in usable capacity.
The Need for Over-provisioning NAND Flash Memory
NAND flash memory is unlike both random access memory and magnetic media, including hard disk
drives, in one fundamental way: there is no ability to overwrite existing content. Instead, entire blocks of
flash memory must first be erased before any new pages can be written.
With a hard disk drive (HDD), for example, that act of “deleting” files affects only the metadata in the
directory. No data is actually deleted on the drive; the sectors used previously are merely made available
as “free space” for storing new data. This is the reason “deleted” files can be recovered (or “undeleted”)
from HDDs, and why it is necessary to actually erase sensitive data to fully secure a drive.
With NAND flash memory, by contrast, free space can only be created by actually deleting or erasing the
data that previously occupied any block of memory. The process of reclaiming blocks of flash memory that
no longer contains valid data is called “garbage collection.” Only when the blocks, and the pages they
contain, have been cleared in this fashion are they then able to store new data during a write operation.
The flash storage processor (FSP) is responsible for managing the pages and blocks of memory, and also
provides the interface with the operating system’s file subsystem. This need to manage individual cells,
pages and blocks of flash memory requires some overhead, and that in turn, means that the full amount
of memory is not available to the user. To provide a specified amount of user capacity it is therefore
necessary to over-provision the amount of flash memory, and as will be shown later, the more over-
provisioning the better.
The portion of total NAND flash memory capacity held in reserve (unavailable to the user) for use by the
FSP is used for garbage collection (the major use); FSP firmware (a small percentage); spare blocks
(another small percentage); and optionally, enhanced data protection beyond the basic error correction
(space requirement varies).
59
Even though there is a loss in user capacity with over-provisioning, the user does receive two important
benefits: better performance and greater endurance. The former is one of the reasons for using flash
memory, including in solid state drives (SSDs), while the latter addresses an inherent limitation in flash
memory.
Percentage Over-provisioning
The equation for calculating the percentage of over-provisioning is rather straightforward:
For example, in a configuration consisting of 128 Gigabytes (GB) of flash memory total, 120 GB of which
is available to the user, the system is over-provisioned by 6.7 percent, which is typically rounded up to 7
percent:
It is also important to note another factor that often causes confusion: a binary Gibibyte is not the same
as a decimal Gigabyte. As shown in Figure 1, a binary GB is 7.37 percent larger than a decimal GB.
Because most operating systems display the binary representation for both memory and storage, this
causes over-provisioning to appear smaller because the actual number of bytes is 7.37 percent higher
than the number of bytes displayed. This is why an SSD listed as providing 128 GB of user space can still
function with 128 GB of physical memory. Using the calculation above, the over-provisioning amount
would appear to be zero percent, which is impossible for NAND flash. In reality it is really over-provisioned
closer to 0 + 7.37 percent.
Figure 1. The difference between a binary Gigabyte and a decimal Gigabyte
Test Environment
To isolate the over-provisioning variable, the tests were conducted on a single SSD with Toshiba MLC
(multi-level cell) 24nm NAND flash memory controlled by an LSI SF-2281 flash storage processor. It is
important to note that the FSP used employs the LSI DuraWrite™ technology that optimizes writes to flash
memory, and utilizes intelligent block management and wear-leveling to improve reliability and
60
endurance. These capabilities combine to afford over five years of useful life for MLC-based flash memory
with typical use cases.
Previous testing performed by LSI revealed that entropy has an effect on performance only for SSDs
without data reduction technology. For this reason, the red lines in the graphs showing the results for
100% entropy are labeled “Typical SSDs.” This series of tests, which used SSDs equipped with LSI
DuraWrite data reduction technology, were designed to evaluate performance at different levels of both
over-provisioning and entropy, and to specifically test the hypothesis that data reduction could improve
performance at lower levels of entropy.
Test result data points are based on post-garbage collection, steady state operation. All preconditioning
used the same transfer size and type as the test result (e.g. random 4KB results are preconditioned with
random 4KB transfers until reaching steady state operation).
VDBench V5.02 was used as the main test software with IOMeter V1.1.0 providing cross-check
verification. The test PC was configured with an Intel Core i5-2500K 3.30 GHz processor, the Intel H67
Express chipset, Intel Rapid Storage Technology 10.1.0.1008 (with AHCI Enabled); 4 GB of 1333 MHz
RAM; and Windows 7 Professional (32-bit).
Performance Test Results
Sequential writes were uniform across all tested over-provisioning ranging from zero to 75 percent. This
flat performance derives from the nature of sequential writes to flash. As data is written to flash memory,
it completely fills all of the pages in a block. When the drive becomes filled, blocks of data that are no
longer valid need to be erased first via the garbage collection process, which it does by simply erasing
entire blocks without needing to move (read then write) any individual pages that might otherwise still be
valid. Because there are no incremental writes during garbage collection during this operation, there is no
benefit from additional free space. With SSDs that use a data reduction technology like DuraWrite from
LSI, the level of flat performance will increase as a function of the entropy (data randomness); the lower
the entropy the higher the performance. In this situation, however, the increase in performance is due to
the reduced writes being completed sooner and not from the additional free space.
Throughput performance for sustained 4KB random writes improved as the amount of over-provisioning
increased. Additionally, for SSDs with DuraWrite data reduction technology, the throughput improvement
also increased at all levels of entropy.
Figure 2 shows the results of this test. The reason why the increased over-provisioning improves
performance for random writes is due to how garbage collection operates. As data is written randomly, the
logical block addresses (LBAs) being updated are distributed across all the blocks of the flash. This causes
a number of small “holes” of invalid data pages among valid data pages. During garbage collection those
blocks with invalid data pages require the valid data to be read and moved to new empty blocks. This
background read and write operation requires time to execute and prevents the SSD from responding to
read and write requests from the host, giving the perception of slower overall performance. When the
over-provisioning is a higher percentage of the total flash memory, the time required for garbage
collection is reduced, enabling the SSD to operate faster.
61
Figure 2. The effect of over-provisioning on write performance throughput
The need for garbage collection and wear-leveling with NAND flash memory causes the amount of data
being physically written to be a multiple of the logical data intended to be written. This phenomenon is
expressed as a simple ratio called “write amplification,” which ideally would approach 1.0 for standard
SSDs with sequential writes, but typically is much higher due to the addition of random writes in most
environments. With SSDs that have DuraWrite technology, the typical user experiences a much lower
write amplification that is often on average only 0.5. Getting write amplification low is important to
extending the flash memory’s useful life.
Random write operations have the greatest impact on write amplification, so to best view the effect of
over-provisioning on write amplification, tests were conducted under those conditions. As shown in Figure
3, write amplification for sustained 4KB random writes benefited significantly from a higher percentage of
over-provisioning for SSDs that do not include DuraWrite technology. For SSDs that do include DuraWrite
or a similar data reduction technology, the throughput improvement increased at a higher rate at higher
levels of entropy.
Note also how the use of a data reduction technology like DuraWrite minimizes the benefits of over-
provisioning for lower levels of entropy. When the entropy of the user data is low, DuraWrite is able to
reduce the amount of space consumed in the flash memory. Because the operating system is unaware of
this reduction, the extra space is automatically used by the flash storage processor as additional over-
provisioning space. As the entropy of the data increases, the additional free space decreases. At 100
percent entropy the additional over-provisioning is zero, which is the same result as a “Typical SSD” (red
line) that does not employ a data reduction technology. Referring again to Figure 3, a standard SSD with
62
28 percent over-provisioning would have the same write amplification as an SSD with DuraWrite
technology at zero percent over-provisioning for data with an entropy as high as 75 percent.
Figure 3. The effect of over-provisioning on write amplification
With the advent of SSDs, and the need to manage them differently from traditional HDDs, a TRIM
command was added to storage protocols to enable operating systems to designate blocks of data that are
no longer valid. Until the SSD is informed the data is invalid with a write to a currently occupied LBA, it
will continue to save that data during the garbage collection process, resulting in less free space and
higher write amplification. TRIM enables the SSD to perform its garbage collection and free up the storage
space occupied by invalid data in advance of future write operations.
Figure 4 shows the effect of the TRIM command on over-provisioning. For a “marketed” percentage of
over-provisioning (28 percent in this example), the amount effectively increases after performing a TRIM
operation. Note how the capacity originally designated as Free Space remains consumed as Presumed
Valid Data by the SSD after being deleted by the operating system or the user until a TRIM command is
received. In effect, the TRIM operation provides dynamic over-provisioning because it increases the
resulting over-provisioning after completion.
63
Figure 4. The effect of the TRIM command on over-provisioning percentage
Conclusion
The over-provisioned capacity of NAND flash memory creates the space the flash storage processor needs
to manage the flash memory more intelligently and effectively. As shown by these test results, higher
percentages of over-provisioning improve both write performance and write amplification. Higher
percentages of over-provisioning can also improve the endurance of flash memory and enable more robust
forms of data protection beyond basic error correction.
Only SSDs that utilize a data reduction technology, such as DuraWrite in the LSI SandForce flash storage
processors, can take advantage of lower levels of entropy to improve performance based on the increase
in “dynamic” over-provisioning.
Owing to the many benefits of over-provisioning, a growing number of SSDs now enable users to control
the percentage of over-provisioning by allocating a smaller portion of the total available flash memory to
user capacity during formatting. With increased capacities based on the ever-shrinking geometries of
NAND flash memory technology, combined with steady advances in flash storage processors, it is
reasonable to expect that over-provisioning will become less of an issue with users over time.
About the author
Kent Smith is senior director of Marketing for the Flash Components Division of LSI Corporation, where he
is responsible for all outbound marketing and performance analysis. Prior to LSI, Smith was the senior
director of Corporate Marketing at SandForce, which was acquired by LSI in 2012, his second company to
be sold to LSI. He has over 25 years of marketing and management experience in the storage and high-
tech industry, holding senior management positions at companies including SiliconStor, Polycom, Adaptec,
Acer and Quantum. Smith holds an MBA from the University of Phoenix.
64
Next-generation multicore SoC architectures for
tomorrow's communications networks
David Sonnier, LSI Corporation
IT managers are under increasing pressure to boost network capacity and performance to cope
with the data deluge. Networking systems are under a similar form of stress with their
performance degrading as new capabilities are added in software. The solution to both needs is
next-generation System-on-Chip (SoC) communications processors that combine multiple cores
with multiple hardware acceleration engines.
The data deluge, with its massive growth in both mobile and enterprise network traffic, is driving
substantial changes in the architectures of base stations, routers, gateways, and other networking
systems. To maintain high performance as traffic volume and velocity continue to grow, next-generation
communications processors combine multicore processors with specialized hardware acceleration engines
in SoC ICs.
The following discussion examines the role of the SoC in today’s network infrastructures, as well as how
the SoC will evolve in coming years. Before doing so, it is instructive to consider some of the trends
driving this need.
Networks under increasing stress
In mobile networks, per-user access bandwidth is increasing by more than an order of magnitude from
200-300 Mbps in 3G networks to 3-5 Gbps in 4G Long-Term Evolution (LTE) networks. Advanced LTE
technology will double bandwidth again to 5-10 Gbps. Higher-speed access networks will need more and
smaller cells to deliver these data rates reliably to a growing number of mobile devices.
In response to these and other trends, mobile base station features are changing significantly.
Multiple radios are being used in cloud-like distributed antenna systems. Network topologies are
flattening. Operators are offering advanced Quality of Service (QoS) and location-based services and
moving to application-aware billing. The increased volume of traffic will begin to place considerable stress
on both the access and backhaul portions of the network.
Traffic is similarly exploding within data center networks. Organizations are pursuing limitless-scale
computing workloads on virtual machines, which is breaking many of the traditional networking protocols
and procedures. The network itself is also becoming virtual and shifting to a Network-as-a-Service (NaaS)
65
paradigm, which is driving organizations to a more flexible Software-Defined Networking (SDN)
architecture.
These trends will transform the data center into a private cloud with a service-oriented network. This
private cloud will need to interact more seamlessly and securely with public cloud offerings in hybrid
arrangements. The result will be the need for greater intelligence, scalability, and flexibility throughout
the network.
Moore’s Law not keeping pace
Once upon a time, Moore’s Law – the doubling of processor performance every 18 months or so – was
sufficient to keep pace with computing and networking requirements. Hardware and software advanced in
lockstep in both computers and networking equipment. As software added more features with greater
sophistication, advances in processors maintained satisfactory levels of performance. But then along came
the data deluge.
In mobile networks, for example, traffic volume is growing by some 78 percent per year, owing mostly to
the increase in video traffic. This is already causing considerable congestion, and the problem will only get
worse when an estimated 50 billion mobile devices are in use by 2016 and the total volume of traffic
grows by a factor of 50 in the coming decade.
In data centers, data volume and velocity are also growing exponentially. According to IDC, digital data
creation is rising 60 percent per year. The research firm’s Digital Universe Study predicts that annual data
creation will grow 44-fold between 2009 and 2020 to 35 zettabytes (35 trillion gigabytes). All of this data
must be moved, stored, and analyzed, making Big Data a big problem for most organizations today.
With the data deluge demanding more from network infrastructures, vendors have applied a Band-Aid to
the problem by adding new software-based features and functions in networking equipment. Software has
now grown so complex that hardware has fallen behind. One way for hardware to catch up is to use
processors with multiple cores. If one general-purpose processor is not enough, try two, four, 16, or
more.
Another way to improve hardware performance is to combine something new – multiple cores – with
something old – Reduced Instruction Set Computing (RISC) technology. With RISC, less is more based on
the uniform register file load/store architecture and simple addressing modes. ARM, for example, has
made some enhancements to the basic RISC architecture to achieve a better balance of high performance,
small code size, low power consumption, and small silicon area, with the last two factors being important
to increasing the core count.
66
Hardware acceleration necessary, but …
General-purpose processors, regardless of the number of cores, are simply too slow for functions that
must operate deep inside every packet, such as packet classification, cryptographic security, and
traffic management, which is needed for intelligent QoS. Because these functions must often be performed
in serial fashion, there is limited opportunity to process them simultaneously in multiple cores. For these
reasons, such functions have long been performed in hardware, and it is increasingly common to have
these hardware accelerators integrated with multicore processors in specialized SoC communications
processors.
The number of function-specific acceleration engines available also continues to grow, and more engines
(along with more cores) can now be placed on a single SoC. Examples of acceleration engines include
packet classification, deep packet inspection, encryption/decryption, digital signal processing, transcoding,
and traffic management. It is even possible now to integrate a system vendor’s unique intellectual
property into a custom acceleration engine within an SoC. Taken together, these advances make it
possible to replace multiple SoCs with a single SoC in many networking systems (see Figure 1).
Figure 1: SoC communications processors combine multiple
general-purpose processor cores with multiple task-specific
acceleration engines to deliver higher performance with a
lower component count and lower power consumption.
(Click graphic to zoom by 1.9x)
In addition to delivering higher throughput, SoCs reduce the cost of equipment, resulting in a significant
price/performance improvement. Furthermore, the ability to tightly couple multiple acceleration engines
makes it easier to satisfy end-to-end QoS and service-level agreement requirements. The SoC also offers
67
a distinct advantage when it comes to power consumption, which is an increasingly important
consideration in network infrastructures, by providing the ability to replace multiple
discrete components in a single energy-efficient IC.
The powerful capabilities of today’s SoCs make it possible to offload packet processing entirely to system
line cards such as a router or switch. In distributed architectures like the IP Multimedia System and SDN,
the offload can similarly be distributed among multiple systems, including servers.
Although hardware acceleration is necessary, the way it is implemented in some SoCs today may no
longer be sufficient in applications requiring deterministic performance. The problem is caused by the
workflow within the SoC itself when packets must pass through several hardware accelerators, which is
increasingly the case for systems tasked with inspecting, transforming, securing, and otherwise
manipulating traffic.
If traffic must be handled by a general-purpose processor each time it passes through a different
acceleration engine, latency can increase dramatically, and deterministic performance cannot be
guaranteed under all circumstances. This problem will get worse as data rates increase in Ethernet
networks from 1 Gbps to 10 Gbps, and in mobile networks from 300 Mbps in 3G networks to 5 Gbps in 4G
networks.
Next-generation multicore SoCs
LSI addresses the data path problem in its Axxia SoCs with Virtual Pipeline technology. The Virtual
Pipeline creates a message-passing control path that enables system designers to dynamically specify
different packet-processing flows that require different combinations of multiple acceleration engines. Each
traffic flow is then processed directly through any engine in any desired sequence without intervention
from a general-purpose processor (see Figure 2). This design natively supports connecting different
heterogeneous cores together, enabling more flexibility and better power optimization.
68
Figure 2: To maximize performance, next-generation SoC
communications processors process packets directly and
sequentially in multiple acceleration engines without intermediate
intervention from the CPU cores.
(Click graphic to zoom)
In addition to faster, more efficient packet processing, next-generation SoCs also include more general-
purpose processor cores (to 32, 64, and beyond), highly scalable and lower-latency interconnects,
nonblocking switching, and a wider choice of standard interfaces (Serial RapidIO, PCI Express, USB, I2C,
and SATA) and higher-speed Ethernet interfaces (1G, 2.5G, 10G, and 40G+). To easily integrate these
increasingly sophisticated capabilities into a system’s design, software development kits are enhanced
with tools that simplify development, testing, debugging, and optimization tasks.
Next-generation SoC ICs accelerate time to market for new products while lowering both manufacturing
costs and power consumption. With deterministic performance for data rates in excess of 40 Gbps,
embedded hardware is once again poised to accommodate any additional capabilities required by the data
deluge for another three to four years.
69
Why next-generation infrastructures need
smarter silicon By Jim Anderson
Given the explosive growth in data traffic, Moore's Law is not enough to keep pace with
demand for higher network speeds. A smarter silicon and software approach is needed.
This vendor-written tech primer has been edited by Network World to eliminate product promotion, but
readers should note it will likely favor the submitter's approach.
Given the explosive growth in data traffic, Moore's Law is not enough to keep pace with demand for higher
network speeds. A smarter silicon and software approach is needed.
Among the best ways to accelerate the performance of mobile and data center networks is to combine
general-purpose processors with smart silicon accelerator engines that significantly streamline the way
bits are prioritized and moved to optimize network performance and cloud-based services.
THE NEXT BIG THINGS: What are grand technology and scientific challenges for the 21st
century?
One of the fundamental challenges facing the industry is the data deluge gap -- the disparity between the
30% to 50% annual growth in network and storage capacity requirements and the 5% to 7% annual
increase in IT budgets. The growing adoption of cloud-based services and soaring generation and
consumption of data storage are driving exponential growth in the volume of data crossing the network to
and from the cloud. With the growth in data traffic far outstripping the infrastructure build-out required to
support it, network managers are under pressure to find smarter ways to improve performance.
Cloud data center networks were built with existing technologies and have thus far succeeded in
improving performance through brute force -- adding more hardware such as servers, switches, processor
cores and memory. This approach, however, is costly and unsustainable, increasing hardware costs along
with floor space, cooling and power requirements, and falls well short of solving the problem of network
latency.
Adding intelligence in the form of smarter silicon streamlines processing of data packets traversing mobile
and data center networks. In particular, smart silicon enables next-generation networks to understand the
criticality of data, then manipulate, prioritize and route it in ways that reduce overall traffic and
accelerates the delivery of important digital information, such as real-time data for voice and video, on
time.
Smarter networks
General-purpose processors, which increasingly feature multiple cores, pervade network infrastructures.
These processors drive switches and routers, firewalls and load-balancers, WAN accelerators and VPN
gateways. None of these systems is fast enough, however, to keep pace with the data deluge on its own,
and for a basic reason: general-purpose processors are designed purely for compute-centric, server-class
workloads and are not optimized for handling the unique network-centric workloads in current and next-
generation infrastructures.
Smart silicon, however, can accelerate throughput for real-time workloads, such as high-performance
packet processing, while ensuring deterministic performance over changing traffic demands.
70
Smart silicon typically features multiple cores of general-purpose processors complemented by multiple
acceleration engines for common networking functions, such as packet classification with deep packet
inspection, security processing and traffic management. Some of these acceleration engines are powerful
enough to completely offload specialized packet processing tasks from general-purpose processors,
making it possible to perform switching, routing and other networking functions entirely in fast path
accelerators to vastly improve overall network performance. Offloading compute-intensive workloads to
acceleration engines that are optimized for a particular workload can also deliver a significant
performance-per-watt advantage over purely general-purpose processors.
Customized smart silicon can be a great option for a network equipment vendor wanting to carve out a
unique competitive advantage by integrating its own optimizations. For example, a vendor's proprietary,
differentiating intellectual property can be integrated into silicon to provide advantages over general-
purpose processors, including for optimized baseband processing, deep packet inspection and traffic
management. This level of integration requires close collaboration between network equipment and
semiconductor vendors.
Tomorrow's data center network will need to be both faster and flatter, and therefore, smarter than ever.
One of the key challenges to overcome in virtualized mega data centers is control plane scalability. To
enable cloud-scale data centers, the control plane needs to scale either up or out. In the traditional scale-
up approach, additional or more powerful compute engines, acceleration engines or both are deployed to
help scale up networking control plane performance.
In emerging scale-out architectures like software-defined networking (SDN), the control plane is
separated from the data plane, and then typically executed on standard servers. In both scale-up and
scale-out architectures, intelligent multicore communications processors that combine general-purpose
processors with specialized hardware acceleration engines can dramatically improve control plane
performance. Some functions, such as packet processing and traffic management, often can be offloaded
to line cards equipped with these purpose-built communications processors.
While the efficacy of distributing the control and data planes remains an open question, it's clear that SDN
will need smart silicon to deliver on its promise of scalable performance.
Smarter storage
Smarter silicon in storage can also help close the data deluge gap. The storage I/O choke point is rooted
in the mechanics of traditional hard disk drive (HDD) platters and actuator arms and their speed limits in
transferring data from the disk media, as evidenced in the difference of five orders of magnitude in I/O
latency between memory (at 100 nanoseconds) and Tier 1 HDDs (at 10 milliseconds).
Another limitation is the amount of memory that can be supported in traditional caching systems
(measured in gigabytes), which is a small fraction of the capacity of a single disk drive (measured in
terabytes). Both offer little room for performance improvements beyond increasing the gigabytes
of Dynamic RAM (DRAM) in caching appliances or adding more of today's fast-spinning HDDs.
Solid state storage in the form of NAND flash memory, on the other hand, is particularly effective in
bridging this significant bottleneck, delivering high-speed I/O similar to memory at capacities on a par
with HDDs. For its part, smart silicon delivers sophisticated wear-leveling, garbage collection and unique
data reduction techniques to improve flash memory endurance and enhanced error correction algorithms
for RAID-like data protection. Flash memory helps bridge both the capacity and latency gap between
DRAM caching and HDDs.
71
Solid state memory typically delivers the highest performance gains when the flash cache acceleration
card is placed directly in the server on the PCI Express (PCIe) bus. Embedded or host-based intelligent
caching software is used to place "hot data" in the flash memory, where data can be accessed in 20
microseconds -- 140 times faster than with a Tier 1 HDD, at 2,800 microseconds. Some of these cards
support multiple terabytes of solid state storage, and a new class of solution now also offers both internal
flash and Serial-Attached SCSI (SAS) interfaces to combine high-performance solid state and RAID HDD
storage. A PCIe-based flash acceleration card can improve database application-level performance by five
to 10 times in DAS and SAN environments.
Smart silicon is at the heart of all of these solutions. So without the deep inside view of the semiconductor
vendors, the system vendors would have no hope of ever closing the data deluge gap.
72
Avoiding “Whack-A-Mole” in the Data Center
By Jeff Richardson
It’s a curse in any network infrastructure, especially in the data center: clear one performance bottleneck,
and another drag on data or application speed surfaces elsewhere in a never-ending game of “Whack-A-
Mole.” In today’s data centers, the “Whack-A-Mole” mallet is swinging like never before as these
bottlenecks pop up with increasing frequency in the face of the data deluge—the exponential growth of
digital information worldwide.
Some of these choke points are familiar, such as the timeworn input/output (I/O) path between servers
and disk storage, whether directly attached or in a storage-area network, as microprocessor capability and
speed has outpaced storage. Other, newer bottlenecks are cropping up with the growing consolidation and
virtualization of servers and storage in data center clouds as more organizations deploy cloud
architectures to pool storage, processing and networking in order to increase computing resource
efficiency and utilization, improve resiliency and scalability, and reduce costs.
Improving data center efficiency has always come down to balancing and optimizing these resources, but
this calibration is being radically disturbed today by major transitions in the network, such as the growth
of Gigabit Ethernet to 10 Gigabit and soon to 40 Gigabit, the emergence of multicore and other ever-faster
processors, and the rising deployments of solid-state storage. As virtualization increases server utilization,
and therefore efficiency, it also exacerbates interactive resource conflicts in memory and I/O. And even
more resource conflicts are bound to emerge as big-data applications evolve to run over ever-growing
clusters of tens of thousands of computers that process, manage and store petabytes of data.
With these dynamic changes to the data center, maintaining acceptable levels of performance is becoming
a greater challenge. But there are proven ways to address the most common bottlenecks today—ways
that will give IT managers a stronger hand in the high-stakes bottleneck reduction contest.
73
Bridging the I/O Gap Between Memory and Hard-Disk Drives
Hard-disk drive (HDD) I/O is a major bottleneck in direct-attached storage (DAS) servers, storage-area
networks (SANs) and network-attached storage (NAS) arrays. Specifically, I/O to memory in a server
takes about 100 nanoseconds, whereas I/O to a Tier One HDD takes about 10 milliseconds—a difference
of 100,000 times that chokes application performance. Latency in a SAN or NAS often is even higher
because of data-traffic congestion on the intervening Fibre Channel (FC), FC over Ethernet or iSCSI
network.
These bottlenecks have grown over the years as increases in drive capacity have outstripped decreases in
latency of faster-spinning drives, and in confronting the data deluge, IT managers have needed to add
more hard disks and deeper queues just to keep pace. As a result, the performance limitations of most
applications have become tied to latency instead of bandwidth or I/Os per second (IOPS), and this
problem threatens to worsen as the need for storage capacity continues to grow by 50–100 percent per
year. Keep in mind that the last three decades have seen only a 30x reduction in latency, while network
bandwidth has improved 3,000x over the same period. Processor throughput, disk capacity and memory
capacity have also seen large gains.
Caching content to memory in a server or in the SAN on a dynamic RAM (DRAM) cache appliance can help
reduce latency, and therefore improve application-level performance. But because the amount of memory
possible in a server or cache appliance, measured in gigabytes, is only a small fraction of the capacity of
even a single hard-disk drive, measured in terabytes, performance gains from caching are often
inadequate.
Solid-state storage in the form of NAND flash memory is particularly effective in bridging the significant
latency gap between memory and HDDs. In both capacity and latency, flash memory bridges the gap
between DRAM caching and HDDs, as the chart below shows. Traditionally, flash has been very expensive
to deploy and difficult to integrate into existing storage architectures. Today, decreases in the cost of flash
coupled with hardware and software innovations that ease deployment have made the ROI for flash-based
storage more compelling.
74
Flash memory fills the void in both latency and capacity between dynamic RAM in a cache appliance and
fast-spinning hard-disk drives.
Solid-state memory typically delivers the highest performance gains when the flash acceleration card is
placed directly in the server on the PCI Express (PCIe) bus. Embedded or host-based intelligent caching
software is used to place “hot data” in the flash memory, where data is accessed in about 20
microseconds—140 times faster than with a Tier One HDD, at 2,800 microseconds—giving users data they
care about far faster. Some of these cards support multiple terabytes of solid-state storage, and a new
class of solution now also offers both internal flash and Serial Attached SCSI (SAS) interfaces to create a
combination high-performance solid-state and RAID HDD storage solution. A PCIe-based flash acceleration
card can improve database application-level performance by 5 to10 times in either a DAS or SAN
environment.
Scaling the Virtualized Data Center Network
One common bottleneck in virtualized data centers today is the switching control plane—a potential choke
point that can limit network performance as the number of virtual machines grows. Control-plane
workloads increase in four sometimes related ways:
Server virtualization adds considerable control overhead, especially when moving virtual machines (VMs)
75
More and larger server clusters, such as for analyzing big data, substantially increase the traffic flow for
inter-node communications
The explosion in CPU cores—driven by the need to avert bottlenecks in server processing power—
increases both the number of VMs per server and the size of server clusters
Data center networks flatten as they grow to help accommodate these changes, and they maintain latency
and throughput performance in the face of relentless growth
These changes are severely stressing the control plane. During a VM migration, for example, rapid
changes in connections, address resolution protocol (ARP) messages and routing tables can overwhelm
existing control-plane solutions, especially in large-scale virtualized environments. As a result, large-scale
VM data migration is often impractical because of the overhead involved.
To enable large-scale VM migration, the control plane needs to scale either up or out. In the traditional
scale-up approach, the existing control-plane solutions within networking platforms are supplemented by
additional or more-powerful compute engines, acceleration engines or both to help scale control-plane
performance. These supplemental resources free up CPU cycles for other tasks, improving overall network
performance.
In the scale-up architecture, existing network platforms are supplemented by additional and/or more-
powerful compute engines to help execute the network control stack.
In emerging scale-out architectures, the control plane is separated from the data plane, and then typically
executed on standard servers. In some cases, control-plane tasks are divided into sub-tasks, such as
76
discovery, dissemination and recovery, which are then distributed across these servers. Emerging
architectures such as SDN (software-defined networking) employ scale-out approaches for greater control-
plane scalability. These architectures also enable IT managers to virtualize the network substrate and to
better manage and secure data center traffic.
In the scale-out architecture, the separation and distribution of the control and data planes lends itself
well to software-defined networking, such as with OpenFlow.
In both scale-up and scale-out architectures, intelligent multicore communications processors, which
combine general-purpose processors with specialized hardware acceleration engines for specific functions,
can produce dramatic improvements in control-plane performance. Some functions, such as packet
processing and traffic management, often can be offloaded entirely to line cards equipped with such
purpose-built communications processors.
Near-term Advances That Promise to Improve Both Server I/O and Network Performance
In many organizations today, milliseconds matter, driving strong demand for shorter response times. For
some, like trading firms, latency can be measured in millions of dollars per millisecond. For others, such as
online retailers, every millisecond of delay caused by latency can compromise competitiveness and
customer satisfaction, and ultimately directly affect revenue.
As more digital information is driven throughout the data center, fast solid-state storage will be
increasingly deployed for storage server caching, and for solid-state drives (SSDs) in tiered DAS and SAN
configurations. The growth of SSD capacity and shipment volumes continues, reducing the cost per
gigabyte through economies of scale, while smart flash storage processors with sophisticated garbage
collection, wear-leveling and enhanced error-correction algorithms continue to improve SSD endurance.
77
Increasing use of 10 Gigabit and 40 Gigabit Ethernet, and broad deployment of 12Gbps SAS technology,
will also contribute to higher data rates. Besides doubling the throughput of existing 6Gbps SAS
technology, 12Gbps SAS will use performance improvements in PCIe 3.0 to achieve more than one million
IOPS.
As data center networks continue to flatten, new forms of acceleration and programmability in both the
control and data planes will be needed. Greater use of hardware acceleration for both packet processing
and traffic management will deliver deterministic performance under varying traffic loads in these flat,
scaled-up or scaled-out networks.
More Bottlenecks to Come
As servers move to 10 Gigabit Ethernet, the rack will become its own bottleneck. To help clear this
bottleneck, solid-state storage will shuttle data among servers at high speed, purpose-built PCIe cards will
enable fast inter-server communications, and all components within a rack will likely be restructured to
optimize performance and cost. As data centers begin to resemble private clouds and increasingly employ
public cloud services in a multi-tenant, hybrid arrangement, the switching services plane will need to more
intelligently classify and manage traffic to improve application-level performance and enhance security.
With the increasing use of encrypted and tunneled traffic, these and other CPU-intensive packet
processing tasks will need to be offloaded to function-specific acceleration engines to enable a fully
distributed intelligent fabric.
High-speed communications processors, acceleration engines, solid-state storage and other technologies
that increase performance and reduce latency in data center networks will take on increasing importance
as networks and data centers continue to struggle with massive data growth, and as IT managers race to
increase data speed within their architectures just to keep up with relentless demand for faster access to
digital information.
About the Author
Jeff Richardson is executive vice president and chief operating officer for LSI. In
this capacity, he oversees all marketing, engineering and manufacturing of the
company’s product operations. Previously, Richardson was executive vice
president and general manager of the LSI Semiconductor Solutions Group, where
he was responsible for LSI’s silicon solutions across all segments of data
networking/communications, server, hard disk drive, enterprise tape and storage
systems markets.
Richardson joined the company in June 2005 from Intel Corporation, where he
served as vice president of the Digital Enterprise Group and general manager of
the Server Platform Group. Before that, Richardson was vice president and
general manager of the Intel Enterprise Solutions and Services Division. Before
joining Intel in 1992, he held engineering positions at Altera Corporation, Chips and Technologies (the first
fabless semiconductor company), and Amdahl Corporation. Richardson earned a bachelor’s degree in
78
electrical engineering from the University of Colorado in 1987. He is a member of the board of directors of
Volterra Semiconductor Corporation.
Leading article photo courtesy of Mike Towber
79
Virtualization of Data Centers: New Options in
the Control and Data Planes (Part III)
Raghu Kondapalli is director of technology focused on Strategic Planning and Solution
Architecture for the Networking Components Division of LSI Corporation. He brings a rich
experience and deep knowledge of the cloud-based, service provider and enterprise
networking business, specifically in packet processing, switching and SoC architectures.
This Industry Perspectives article is the third and final in a series of three that analyzes the
network-related issues being caused by the Data Deluge in virtualized data centers, and how these are
having an effect on both cloud service providers and the enterprise. The focus of the first article was on
the overall effect server virtualization is having on storage virtualization and traffic flows in the data
center network, while the second article dove a bit deeper into the network management complexities and
control plane requirements needed to address those challenges. This article examines two ways of scaling
the control plane to accommodate these additional requirements in virtualized data centers.
The control plane can scale in two directions: out or up. In the scale-out approach, the control plane
functions are separated and distributed across physical or virtual servers. In the scale-up approach, the
server’s processing power is augmented by adding extra compute resources, such as x86 processors. In
both the scale-out and scale-up architectures, performance can be further enhanced by providing
function-specific hardware acceleration.
Control Plane Scale-out Architecture
In the scale-out architecture, the basic platform is implemented with generic processors augmented by
separate communications processors with specialized hardware accelerators that can offload control plane
functions. The control plane tasks are divided into sub-tasks, such as discovery, dissemination, and
recovery, and are then distributed across the data center. Because the various tasks can execute on any
server in the network or in the cloud, the scale-out architecture lends itself well to Software Defined
Networking (SDN). Owing to its distributed arrangement, the architecture requires robust communications
between the control plane and the data planes using APIs for the network protocol, such as OpenFlow.
80
Depending on the network size and configuration, hardware acceleration of these networking functions
may be necessary to achieve satisfactory performance. Protocol-aware communications processors are
designed to handle specific control plane tasks and/or network management functions, including packet
analysis and routing, security, ARP offload, OAM offload, IGMP messages, networking statistics,
application-aware firewalling, QoS, etc.
Control Plane Scale-up Architecture
In the scale-up architecture, the existing network control platforms are supplemented by additional and/or
more powerful compute engines to help execute the network control stack. These supplemental resources
free up server CPU cycles for other tasks, and result in an overall improvement in the network
performance. Because general-purpose processors are not optimized for packet processing functions,
however, they are not an ideal solution for the scale-up architecture. As with the scale-out architecture,
performance can be improved dramatically using function-specific, protocol-aware communications
processors.
81
Bridging The Data Deluge Gap
Guest post written by Abhi Talwalker
Abhi Talwalkar is CEO of LSI Corp.
In the first 60 seconds of reading this article, 1 billion gigabytes of information will flow
across mobile networks around the world. That’s the equivalent of a tenth of all the information contained
in the Library of Congress crisscrossing the Internet in a minute. This massive flow of information,
happening every minute of every day, will grow ten-fold over the next several years, according to IDC.
The amount of static data – information stored on drives or servers – also is expected to expand at an
incredibly rapid rate. As individuals and businesses, we are all dealing with the impact of this data deluge.
At the same time, IT budgets are growing only 5%-7% per year. The net effect is that information is
growing faster than the investment required to store, transmit, analyze and manage it. Herein lays the
real challenge: data is growing faster than the IT infrastructure investments required to support it, leaving
a widening “data deluge gap.” And unless new forms of intelligence, including those powered by smart
silicon, are integrated into datacenters and networks to clear bottlenecks and bridge the gap between
traffic growth and IT investments, the world’s information society could face significant economic and
technical roadblocks.
As with many things in technology, this gap represents enormous challenges, but also offers huge
opportunities.
One outcome of unrelenting data growth in datacenters and mobile networks has been the accelerated
adoption of cloud computing. The “cloud” solves many technical challenges and helps deliver services
more efficiently by leveraging spending on existing infrastructure. But it is fraught with its own challenges,
especially for architects of datacenters and mobile networks wrestling with how to address daunting
scalability, flexibility and capacity requirements in order to unlock the greatest value from the information
created in the data deluge.
In today’s data-driven world, information has enormous value. Make no mistake: The “digital divide” is
very real, as those with slow or limited access to data get out-traded on Wall Street, out-marketed on the
Internet and risk falling behind in education, business and medicine. Data is most valuable when it is
used, shared, analyzed and made available to connected devices and people. But the determination of
what constitutes valuable data must often be made in nanoseconds.
82
Together these challenges mean that the industry must bridge the gap to get the maximum return on
information from the highest value data. Of course, this is much easier said than done. To eliminate traffic
bottlenecks in storage systems and enterprise and mobile networks smart silicon must be integrated
within strategic areas of IT infrastructures. Ironically, as one kind of chip enables the creation of huge
volumes of data, other smart chips are needed to help increase the speed of the system and direct the
flow of that data.
So where are these bottlenecks?
Today, mobile networks suffer the most acute impact as data traffic growth, driven by huge adoption of
smartphones, tablets and other client devices, is forecast to grow at a 78% compounded annual growth
rate from 2011 to 2016. In data centers, gains in storage performance have fallen well short of increases
in processor speed, which continues to double every few years in keeping with Moore’s Law. These
storage and networking choke points are expected to tighten as the number of connected and mobile
devices rises from about 8 billion today to 50 billion by 2020, and as the volume of data continues to grow
by 30% to 50% a year.
In mobile networks, the dramatic rise in video is driving explosive data growth. What’s more, end users
want faster access to higher quality content, including bandwidth-hungry high-definition video and other
rich media.
But video poses real challenges, as it consumes considerably more bandwidth than both voice and data,
and video quality degrades substantially, often unacceptably, when network congestion interrupts or
delays individual packets in traffic streams. In other words: not all packets are created equal, which
means as video traffic grows, mobile networks are going to need to get smarter about how they manage
the packets traversing their infrastructures. The devil is in the details, which is why smart silicon is
required to address this challenge, performing tasks like packet inspection, or looking into those packets
as they move throughout the network and making decisions about what to do with them and which ones
to prioritize.
When it comes to bottlenecks in datacenters, the Data Deluge Gap affects everyone, from the largest
service provider and enterprise to small and medium businesses, and the billions of end users consuming
data-intensive services. Here the biggest bottleneck is between a server’s central processing unit and its
storage, whether directly attached or in a storage area network. Retrieving and storing data from a hard-
disk drive takes one million times longer than accessing it from server memory, a difference that can
severely degrade application performance.
For transaction-oriented businesses such as online retailers, these drags on performance can mean the
difference between profitability and losses. For retail, healthcare and pharmaceutical companies that now
83
rely on critical findings of Big Data analytics, performance slowdowns can compromise key aspects of
competitiveness such as how quickly and where a product is brought to market. For service organizations,
a delay of a few seconds can mean the difference between deepening customer loyalty and
abandonment. Think about waiting on a website for your shopping cart to load and credit card to clear;
the longer you wait, the more likely you are to choose another site. And there are thousands of these in
an hour on some sites. Or look at the world of high speed trading where millions of dollars balance on
milliseconds of timing. The stakes couldn’t be higher.
The opportunity here is to leverage technologies known to have acceleration benefits as fast as 300x the
existing technologies, like Flash memory, which up until recently has been too expensive. This requires
smart silicon, but also rethinking architectures and storage.
The biggest trends in IT today — big data, cloud computing, social media and growing connected devices
and the “Internet of things” – all mean one thing: a relentless and massive flow of data that needs to be
shared and stored. With a creaking infrastructure, the system at times risks paralysis or overflow and
we’ve seen localized instances that show us the chaos that could occur. Many challenges remain.
More exciting is the opportunity. Data has enormous value and potential to improve our society. To
liberate information from the performance constraints of today’s storage and networking infrastructures,
we need to focus our brightest minds on solutions like bringing smart silicon to strategic points in
networks or datacenters, or on hardware or software that helps route and prioritize the most important
data for the fastest access. That is how we can tackle the data deluge and create the best user experience
for all.
84
Virtualization of Data Centers: New Options in
the Control & Data Planes (Part II) Raghu Kondapalli is director of technology focused on Strategic Planning and Solution Architecture for
the Networking Components Division of LSI Corporation. He brings a rich experience and deep
knowledge of the cloud-based, service provider and enterprise networking business, specifically in
packet processing, switching and SoC architectures.
This Industry Perspectives article is the second in a series of three that analyzes the
network-related issues being caused by the Data Deluge in virtualized data centers, and how these are
having an effect on both cloud service providers and the enterprise. The focus of the first article was on
the overall effect server virtualization is having on storage virtualization and traffic flows in the datacenter
network. This article dives a bit deeper into the network challenges in virtualized data centers as well as
the network management complexities and control plane requirements needed to address those
challenges.
Server Virtualization Overhead
Server virtualization has enabled tens to hundreds of VMs per server in data centers using multi-core CPU
technology. As a result, packet processing functions, such as packet classification, routing decisions,
encryption/decryption, etc., have increased exponentially. Because discrete networking systems may not
scale cost-effectively to meet these increased processing demands, some changes are also needed in the
network.
Networking functions that are implemented in software in network hypervisors are not very efficient,
because x86 servers are not optimized for packet processing. The control plane, therefore, needs to be
scaled somehow by adding communications processors capable of offloading network control tasks, and
both the control and data planes stand to benefit substantially from hardware assistance provided by such
function-specific acceleration.
The table below shows the effect on packet processing overhead of virtualizing 1,000 servers. As shown,
by mapping each CPU core to four virtual machines (VMs), and assuming 1 percent traffic management
overhead with a 25 percent east-west traffic flow, the network management overhead increases by a
factor of 32 times in this example of a virtualized data center.
85
This table shows the effect on network management overhead of virtualizing 1,000 servers.
Virtual Machine Migration
Support for VM migration among servers, either within one server cluster or across multiple clusters,
creates additional management complexity and packet processing overhead. IT administrators may decide
to move a VM from one server to another for a variety of reasons, including resource availability, quality-
of-experience, maintenance, and hardware/software or network failures. The hypervisor handles these VM
migration scenarios by first reserving a VM on the destination server, then moving the VM to its new
destination, and finally tearing down the original VM.
Hypervisors are not capable of the timely generation of address resolution protocol (ARP) broadcasts to
notify of the VM moves, especially in large-scale virtualized environments. The network can even become
so congested from the control overhead occurring during a VM migration that the ARP messages fail to get
through in a timely manner. With such a significant impact on network behavior being caused by rapid
changes in connections, ARP messages and routing tables, existing control plane solutions need an
upgrade to more scalable architectures.
Multi-tenancy and Security
Owing to the high costs associated with building and operating a data center, many IT organizations are
moving to a multi-tenant model where different departments or even different companies (in the cloud)
share a common infrastructure of virtualized resources. Data protection and security are critical needs in
multi-tenant environments, which require logical isolation of resources without dedicating physical
resources to any customer.
The control plane must, therefore, provide secure access to data center resources and be able to change
the security posture dynamically during VM migrations. The control plane may also need to implement
customer-specific policies and Quality of Service (QoS) levels.
Service Level Agreements and Resource Metering
The network-as-a-service paradigm requires active resource metering to ensure SLAs are
maintained. Resource metering through the collection of network statistics is useful for calculating return
on investment, and evaluating infrastructure expansion and upgrades, as well as for monitoring SLAs.
The network monitoring tasks are currently spread across the hypervisor, legacy management tools, and
some newer infrastructure monitoring tools. Collecting and consolidating this management information
adds further complexity to the control plane for both the data center operator and multi-tenant
enterprises.
The next article in the series will examine two ways of scaling the control plane to accommodate these additional
packet processing requirements in virtualized data centers.
86
Virtualization of Data Centers: New Options in
the Control and Data Planes
Raghu Kondapalli is director of technology focused on Strategic Planning and Solution Architecture for
the Networking Components Division of LSI Corporation. He brings a rich experience and deep
knowledge of the cloud-based, service provider and enterprise networking business, specifically in
packet processing, switching and SoC architectures.
The Data Deluge occurring in today’s content-rich Internet, cloud and enterprise
applications is growing the volume, velocity and variety of information data centers must now process. In
response, organizations have begun virtualizing their data centers to become more cost-effective, power-
efficient, scalable and agile.
The migration began with server virtualization using technologies like multi-core CPUs and multi-thread
operating systems. Next was the virtualization of storage area networks (SANs) and network attached
storage (NAS) to cope with the Data Deluge more efficiently and cost-effectively. The final target for
virtualization is the data center network itself, which will necessitate changes in the both the control and
data planes to manage traffic flows more intelligently and improve overall performance.
This Industry Perspectives article is the first in a series of three that analyzes the network-related
challenges in virtualized data centers, and how these are having an effect on network infrastructures—
from the SAN to the core. The focus here is on the effect server virtualization is having on storage
virtualization and traffic flows in the data center network.
Server Virtualization’s Effect on Storage and the Network
The need for instantaneous and reliable access to data across all segments of today’s connected world is
pushing the boundaries of data center virtualization. Cloud computing, with its superior scalability and
lower total-cost-of-ownership (TCO), is at the leading edge of this trend by requiring virtualization of the
entire datacenter in a multi-tenancy environment.
Servers were initially virtualized by implementing virtual machines (VMs) in software with the hypervisor
creating a layer of abstraction between physical and virtual machines, thereby absorbing many of the
connectivity, manageability and scalability issues. Software-based hypervisors, however, are unable to
keep pace with the increased performance demands of the Data Deluge. Processor extensions to support
x86 virtualization made their debut in the mid-2000’s, providing the hardware acceleration needed to
improve performance.
Storage
Virtualization of storage is typically done in a SAN, which houses both the VM images and some or all of
the data needed by the applications. VM support requires extra storage in the SAN to backup and replicate
the images dynamically, and during the initial phase of storage virtualization, storage hypervisors helped
administrators perform these tasks more easily by disguising the actual complexity of the SAN. These
techniques by themselves, however, proved insufficient for the relentless growth in storage demands. And
once again, advances in hardware, particularly the use of flash memory in solid-state drives (SSDs),
87
became critical to boosting SAN performance. Such tiered and/or application-aware storage solutions
deliver hardware acceleration to both the SAN and directly attached storage (DAS), providing both
improved I/O throughput and real-time analytics.
Until recently, most of the efforts in data center virtualization addressed the server and storage segments.
Network virtualization has been ad hoc, at best, normally implemented as an add-on module to traditional
compute-centric hypervisors. Network-specific extensions to hypervisors handle basic connectivity and
fault management, and are able to meet the performance needs for small data centers. The current
generation of large-scale server farms, however, must have thousands of servers with potentially dozens
of VMs per server. The application workloads, which are generally distributed across several VMs, increase
VM-to-VM communications (east-west traffic), while other factors, such as VM migration and storage
applications like data replication, have also increased east-west traffic flows. And these changes are
occurring as client-to-server communications (north-south traffic) also continues to grow exponentially.
Reaping Benefits of Virtualization
Currently, IT departments are exploring new options for data center networks to better reap the benefits
of virtualization. At present, several solutions have been proposed to improve data center network
utilization and performance. At the network architectural level, isolating the control plane functions from
the data plane, and virtualizing both, is a growing trend that involves improving the efficiency of the
existing network infrastructures with simple upgrades. Scale-out and scale-up are two such techniques
that are now being used, and these will be covered in more detail in the third article in this series.
A related trend involves Software-Defined Networking (SDN), which is another abstraction where network
application stacks are presented with a virtual view of the network that shields its physical topology. SDN
also enables control plane tasks to be virtualized and distributed across the network. OpenFlow is one
example of an SDN that proposes to separate control plane functions, such as routing, from data plane
functions, like forwarding, enabling them to execute independently on different devices—potentially from
different vendors.
But before exploring these proposed network virtualization options, it is useful to dive a bit deeper into the
networking issues in a virtualized datacenter, and this is the subject of the second article in this series.
top related