lsi corporation contributed articles written & placed by ...€¦ · because virtualization...

Table of Contents Title Date Publication Page

Busting Through the Biggest Bottleneck in Virtualized Servers

7/31/13 Data & Storage Management Report

Accelerating Big Data Analytics with Flash Caching

7/23/13 Silicon Angle 7

Reality Check: The role of ‘smart silicon’ in mobile networks

7/9/13 RCR Wireless 13

The Revival of Direct Attached Storage for Oracle Databases

7/9/13 Database Trends & Applications 15

Achieving Low Latency: The Fall and Rise of Storage Caching

7/1/13 Datanami 18

Addressing the data deluge challenge in mobile networks with intelligent content caching

6/7/13 Electronic Component News 24

PCIe flash: It solves lots of problems, but also makes a bunch - so what's its future?

5/14/13 TechSpot 28

Mega Datacenters: Pioneering the Future of IT Infrastructure

4/25/13 DatacenterPOST 33

The Evolution Of Solid-State Storage In Enterprise Servers

4/23/13 Electronic Design 37

Networks to Get Smarter and Faster in 2013 and Beyond

3/13/13 Converge! Network Digest 46

Maximizing Solid State Storage Capacity in Small Form Factors (Single-chip "DRAM-less")

3/4/13 Electronic Component News 50

Bridging the Data Deluge Gap--The Role of Smart Silicon in Networks

2/28/13 EE Times 53

Accelerating SAN Storage with Server Flash Caching

1/31/13 Computer Technology Review

LSI Corporation

Contributed articles written & placed by

Gallagher PR

Understanding SSD Over-provisioning

1/8/13 Electronic Design News 58

Next-generation Multicore SoC Architectures for Tomorrow’s Communications Networks

12/11/12 Embedded Computing Design 64

The Inside View: Why Network Infrastructures Need Smarter Silicon

11/10/12 Network World 69

Avoiding “Whack-A-Mole” in the Datacenter

9/10/12 Data Center Journal 72

Virtualization of Data Centers: New Options in the Control and Data Planes (Part III)

8/30/12 Data Center Knowledge 79

Bridging the Data Deluge Gap 8/23/12 Forbes 81

Virtualization of Data Centers: New Options in the Control & Data Planes (Part II)

Virtualization of Data Centers: New Options in the Control and Data Planes

Busting Through the Biggest Bottleneck in

Virtualized Servers By: Tony Afshary

The data deluge has brought renewed focus on an old problem: the enormous performance gap that exists

in input/output (I/O) between a server’s memory and its storage. I/O typically takes a mere 100

nanoseconds for information stored in the server’s memory, while I/O to a hard disk drive (HDD) takes

about 10 milliseconds — a difference of five orders of magnitude that is having a profound adverse impact

on application performance.

This bottleneck exists for both dedicated and virtualized servers, but can be far worse with the latter

because virtualization creates the potential for much greater resource contention. Virtualization affords

numerous benefits by dramatically improving server utilization (from around 10 percent in dedicated

servers to 50 percent or more), but the increased per-server application load inevitably exacerbates the

I/O bottleneck. Multiple applications, all competing for the same finite I/O, have the ability to turn what

might have been orderly, sequential access for each into completely random read/writes for the server,

creating a worst case scenario for HDD performance.

In a virtualized server, the primary symptom of contention is when any virtual machine (VM) must wait for

CPU cycles, and/or for I/O to memory or disk. Fortunately, such contention can be minimized by judicious

balancing of the total workload among all virtual servers, and by optimizing the allocation of each server’s

physical resources. Taking these steps can enable a VM to perform as well as a dedicated server.

Unfortunately, however, server virtualization is normally accompanied by storage virtualization, which

virtually assures an adverse impact on application performance. Compared to direct attached storage

(DAS), a storage area network (SAN) or network-attached storage (NAS) has a higher I/O latency,

combined with a lower bandwidth or throughput that also decreases I/O Operations per second (IOPs).

Frequent congestion on the intervening Fibre Channel (FC), FC over Ethernet, iSCSI or Ethernet network

further degrades performance.

The extent of the I/O bottleneck issue became apparent in a recent LSI survey of 412 European

datacenter managers. The results revealed that while 93 percent acknowledge the critical importance of

optimizing application performance, a full 75 percent do not feel they are achieving the desired results.

Not surprisingly, 70 percent of the survey respondents cited storage I/O as the single biggest bottleneck

in the datacenter today.

Cache in Flash

Caching data to memory in a server, or in a SAN controller or cache appliance, is a proven technique for

reducing I/O latency and, thereby, improving application-level performance. But because the size of the

cache that is economically feasible with random access memory (measured in gigabytes) is only a small

fraction of the capacity of even a single disk drive (measured in terabytes), traditional RAM-based caching

is increasingly unable to deliver the performance gains required in today’s virtualized datacenters.

Consider what happens in a typical virtualized server. Each VM is allocated some amount of RAM, and

together these allocations usually exceed the total amount of physical memory available. This can result in

the VMs competing for memory, and as they do, it is necessary for the hypervisor to swap pages out and

in, to and from (very slow) disk storage, further exacerbating the I/O bottleneck.

Flash memory technology helps break through the cache size limitation imposed by RAM to again make

caching an effective and cost-effective means for accelerating application performance. As shown in the

diagram, flash memory, with an I/O latency of less than 50 microseconds, fills the significant performance

gap between main memory and Tier 1 storage.

Flash memory fills the void in both latency and capacity that exists between main memory and fast-

spinning hard disk drives

The closer the data is to the processor, the better the performance. This is why applications requiring high

performance normally use DAS, and it is also why flash cache provides its biggest benefit when placed

directly in the virtualized server on the PCI Express (PCIe) bus. Intelligent caching software is then used

to automatically and transparently place “hot data” (the most frequently accessed data) in the low-latency

flash memory, where it is accessed up to 200 times faster than when on a Tier 1 HDD. The flash cache can

also be configured to become the “swap cache” for main memory, thus helping to mitigate performance

problems being caused by memory contention.

The intelligent caching software detects hot data by constantly monitoring the physical server’s I/O

activity to find the specific ranges of logical block addresses that are experiencing the most reads and/or

writes, and continuously moving these into the cache. With this approach, the flash cache is able to

support all of the VMs running in any server.

The intelligent caching algorithms normally give the highest priority to highly random, small block-

oriented applications, such as those for databases and on-line transaction processing, because these stand

to benefit the most. By contrast, applications with sequential read and/or write operations benefit very

little from caching (except when multiple such applications are configured to run on the same server!), so

these are given the lowest priority.

How can flash memory, with a latency of up to 100 times higher than RAM, outperform traditional caching

systems? The reason is the sheer capacity possible with flash memory, which dramatically increases the

“hit rate” of the cache. Indeed, with some flash cache cards now supporting multiple terabytes of high-

performance solid state storage, there is often sufficient capacity to store rather large datasets for all of a

server’s VMs as hot data.

Exhaustive internal LSI testing has shown that the application-level performance gains afforded by flash

cache acceleration in both dedicated and virtualized servers are considerable. For servers with DAS, which

already enjoy the “proximity performance advantage” over SAN/NAS environments, typical improvements

can be in the range of 5 to10 times. In environments with a SAN or NAS, which experience additional

latency caused by the network, server-side flash caching can improve performance even more — by up to

30 times in some cases.

Flash Forward to the Future

Flash memory has a very promising future. Flash is already the preferred storage medium in tablets and

ultrabooks, and increasingly in laptop computers. Solid state drives (SSDs) are replacing or supplementing

HDDs in desktop PCs and servers with DAS, while the fastest SSD storage tiers are growing larger in SAN

and NAS configurations.

Solid state storage is also non-volatile, so unlike caching with RAM, which is read-only and subject to data

loss during a power failure, a flash cache can support both reads and writes, and some solutions now offer

RAID-like data protection, making the cache the equivalent of a fast storage tier. During internal LSI

testing, the addition of acceleration for writes to flash cache (which are then persisted to primary storage)

can improve application-level performance even more than the 10 or 30 times noted above in write-

intensive applications.

The key to making continued improvements in flash price/performance, similar to what has been the case

for CPUs with Moore’s Law, is the flash storage processors (FSPs) that facilitate shrinking flash memory

geometries and/or higher cell densities. To accommodate these advances, future generations of FSPs will

need to offer ever more sophisticated error correction (to improve reliability) and wear-leveling (to

improve endurance).

Flash memory enjoys some other advantages that are beneficial in virtualized datacenters, as well,

including a combination of higher density and lower power consumption compared to HDDs, which enables

more storage in a smaller space that also requires less cooling. SSDs are also typically far more reliable

than HDDs, and should one ever fail, RAID data protection is restored much faster.

As the price/performance of flash memory continues to improve, the rapid adoption of solid state storage

will likely continue in the datacenter. But don’t expect SSDs to completely replace HDDs any time soon.

HDDs have tremendous advantages in storage capacity and the per-gigabyte cost of that capacity. And

because the vast majority of data in most organizations is accessed only occasionally, the higher I/O

latency of HDDs is normally of little consequence—particularly because this “dusty data” can quickly

become hot data in a flash (pun intended!) on those infrequent occasions when needed.

Flash cache has now become part of the virtualization paradigm based on its ability to maximize the

benefits. Servers are virtualized to get more work from each one, resulting in considerable savings in

capital and operational expenditures, as well as in precious space and power. Storage is virtualized to

achieve similar savings through greater efficiencies and economies of scale. Flash cache helps provide a

more cost-effective way to get even more work from virtualized servers and faster work from virtualized

storage.

ABOUT Tony Afshary

Tony Afshary is the business line director for Nytro Solutions Products within the Accelerated Solutions

Division of LSI Corporation. In this role, he is responsible for product management and product marketing

of the LSI Nytro product family of enterprise flash-based storage, including PCIe based flash, utilizing

seamless and intelligent placement of data to accelerate datacenter applications.

Previously, Afshary was responsible for marketing and planning of LSI’s data protection and management

and storage virtualization products. Prior to that, he was the director of Customer/Application Engineering

for LSI’s server/storage products. He has been in the storage industry for over 13 years. Before joining

LSI, Afshary worked at Intel for 11 years, managing marketing and development activities for storage and

communication processors. Afshary received a bachelor’s degree in Electrical and Computer Engineering

and an MBA from Arizona State University.

Accelerating Big Data Analytics with Flash Caching By Kimberly Leyenaar

The global volume, velocity and variety of data are all increasing, and these three dimensions of the data

deluge—the massive growth of digital information—are what make Hadoop software ideal for big data

analytics. Hadoop is purpose-built for analyzing a variety of structured and non-structured data, but its

biggest advantage is its ability to cost-effectively analyze an unprecedented volume of data on clusters of

commodity servers.

While Hadoop is built around the ability to linearly scale and distribute MapReduce jobs across a cluster,

there is now a more cost-effective option for scaling performance in Hadoop clusters: high-performance

read/write PCIe flash cache acceleration cards.

Scaling Hadoop Performance: A Historical Perspective

The closer the data to the processor, the less the latency and the better the performance. This

fundamental principle of data proximity is what has guided the Hadoop architecture, and is the main

reason for Hadoop’s success as a high-performance big data analytics solution.

To keep the data close to the processor, Hadoop uses servers with direct-attached storage (DAS). And to

get the data even closer to the processor, the servers are usually equipped with significant amounts of

random access memory (RAM).

Small portions of a MapReduce job are distributed across multiple nodes in a cluster for processing in

parallel, giving Hadoop its linear scalability. Depending on the nature of the MapReduce jobs, bottlenecks

can form either in the network or in the individual server nodes. These bottlenecks can often be eliminated

by adding more servers, more processor cores, or more RAM.

With MapReduce jobs, a server’s maximum performance is usually determined by its maximum RAM

capacity. This is particularly true during the Reduce phase, when intermediate data shuffles, sorts and

merges exceed the server RAM size, forcing the processing to be performed with input/output (I/O) to

hard disk drives (HDDs).

As the need for I/O to disk increases, performance degrades considerably. Slow storage I/O is rooted in

the mechanics of traditional HDDs and this increased latency of I/O to disk imposes a severe performance

penalty.

One cost-effective ways to break through the disk-to-I/O bottleneck and further scale the performance of

the Hadoop cluster is to use solid state flash memory for caching.

Scaling Hadoop Performance with Flash Caching

Data has been cached from slower to faster media since the advent of the mainframe computer, and it

remains an essential function in every computer today. The enduring and widespread use of caching

demonstrates its enduring ability to deliver substantial and cost-effective performance improvements.

When a server is equipped with its full complement of RAM and that

memory is fully utilized by applications, the only way to increase caching capacity is to add a different

type of memory. One option is NAND flash memory, which is up to 200 times faster than a high-

performance HDD.

A new class of server-side PCIe flash solution uniquely integrates onboard flash memory with Serial-

Attached SCSI (SAS) interfaces to create high-performance DAS configurations consisting of a mix of solid

state and hard disk drive storage, coupling the performance benefits of flash with the capacity and cost

advantages of HDDs.

Testing Cluster Performance With and Without Flash Caching

To compare cluster performance with and without flash caching, LSI used the widely accepted TeraSort

benchmark. TeraSort tests performance in applications that sort large numbers of 100-byte records, which

requires a considerable amount of computation, networking and storage I/O—all characteristics of real-

world Hadoop workloads.

LSI used an eight-node cluster for its 100-gigabyte (GB) TeraSort test. Each server was equipped with 12

CPU cores, 64 GB of RAM and eight 1-terabyte HDD as well as an LSI® Nytro MegaRAID 8100-4i

acceleration card combining 100GB of onboard flash memory with intelligent caching software and LSI

dual-core RAID-on-Chip (ROC) technology. The acceleration card’s onboard flash memory was deactivated

for the test without caching.

No software change was required because the flash caching is transparent to the server applications,

operating system, file subsystem and device drivers. Notably, RAID (Redundant Arrays of Independent

Disks) storage is not normally used in Hadoop clusters because of the way the Hadoop Distributed File

System replicates data among nodes. So while the RAID capability of the Nytro MegaRAID acceleration

card would not be used in all Hadoop clusters, this feature adds little to the overall cost of the card.

LSI internal testing with flash caching activated found that the TeraSort test consistently completed

approximately 33 percent faster. This performance improvement from caching scales in proportion to the

size of the cluster needed to complete a specific MapReduce or other job within a required run time.

LSI Nytro MegaRAID card using the TeraSort benchmark completed Hadoop jobs 33 percent

faster (LSI internal test; individual results may vary).

Saving Cash with Cache

Based on results from the internal LSI TeraSort benchmark performance test, the table below compares

the estimated total cost of ownership (TCO) of two cluster configurations—one with and one without flash

caching—that are both capable of completing the same job in the same amount of time.

Without

Caching

Number of Servers 1000 750

Servers (MSRP of $6,280) $6,280,000 $4,710,000

Nytro MegaRAID Cards (MSRP of

$1799) $0 $1,349,250

Total Hardware Costs $6,280,000 $6,059,250

Costs for Rack Space, Power,

Cooling and Administration Over

3 Years * $19,610,000 $14,707,500

3-Year Total Cost of Ownership $25,890,000 $20,766,750

* Cost computed using data from the Uptime Institute, an independent division of The 451

The tests showed that in certain circumstances, using fewer servers to accommodate the same processing

time requirement can reduce TCO by up to 20 percent, or $5.1 million, over three years.

Conclusion

Organizations using big data analytics now have another option for scaling performance: PCIe flash cache

acceleration cards. While these tests centered on Hadoop clusters, LSI’s extensive internal testing with

various databases and other popular applications consistently demonstrates performance improvement

gains ranging from a factor of three (for DAS configurations) to a factor of 30 (for SAN and NAS

configurations).

Big data is only as useful as the analytics that organizations use to unlock its full value, making Hadoop a

powerful tool for analyzing data to gain deeper insights in science, research, government and business.

Servers need to be smarter and more efficient and flash caching helps enable fewer servers (with fewer

software licenses) to perform more work, more cost-effectively for data sets large and small—a great

option for IT managers working to do more with less under the growing pressure of the data deluge.

About the Author

Kimberly Leyenaar is a Principal Big Data Engineer and Solution Technologist for LSI’s Accelerated Storage

Division. An Electrical Engineering graduate from the University of Central Florida, she has been a storage

performance engineer and architect for over 14 years. At LSI, she now focuses on discovering innovative

ways to solve the challenges surrounding Big Data applications and architectures.

Reality Check: The role of ‘smart silicon’ in mobile

networks

By Greg Huff, SVP and CTO, LSI

Editor’s Note: Welcome to our weekly Reality Check column where we let C-level executives and

advisory firms from across the mobile industry to provide their unique insights into the marketplace.

What does “smart silicon” (specialized integrated circuits with both general-purpose and function-specific

processors) have to do with next-generation mobile services? Plenty, especially as the number of

bandwidth-hungry devices and applications continues to grow unabated. To accommodate the

accompanying data deluge, base station throughput will need to increase by more than an order of

magnitude from 300 megabits per second in 3G networks to 5 gigabits per second in LTE networks. LTE-

Advanced technology will require base station throughput to double again to 10 Gbps.

Several related changes are also having an impact on base stations. Next-generation access networks are

using more and smaller cells to deliver the higher data rates reliably. Multiple radios are being employed

in cloud-like distributed antenna systems. Network topologies are flattening. Content is being cached at

the edge to conserve backhaul bandwidth. Operators are offering advanced quality of service and location-

based services, and are moving to application-aware billing.

These changes are motivating mobile network operators to seek more intelligent and more cost-effective

ways to keep pace with the data deluge, and this is where smart silicon can help. General-purpose

processors are simply too slow for base station functions that must operate deep inside every packet in

real-time, such as packet classification, digital signal processing, transcoding, encryption/decryption and

traffic management.

For this reason, packet-level processing functions are increasingly being performed in hardware to

improve performance, and these hardware accelerators are now being integrated with multicore

processors in specialized system-on-chip communications processors. The number of function-specific

acceleration engines available also continues to grow, and more engines (along with more processor

cores) can now be placed on a single SoC. With current technology, it is even possible to integrate an

equipment vendor’s unique intellectual property into a custom SoC for use in a proprietary system. In

many cases, these advances now make it possible to replace multiple SoCs with a single SoC in base

stations.

In addition to delivering higher throughput, SoCs reduce the total cost of the system, resulting in a

significant improvement in its price/performance, while the inclusion of multiple acceleration engines

makes it easier to satisfy end-to-end QoS and service-level agreement requirements. An equally important

consideration in mobile network infrastructures is power consumption, and here too the SoC has a distinct

advantage with its ability to replace multiple discrete components with a single, energy-efficient integrated

circuit.

Another challenge involves the way hardware acceleration is implemented in some SoCs. The problem is

caused by the workflow within the SoC when packets must pass through several hardware acceleration

engines, as is the case for many services and applications. If traffic flows must be handled by a general-

purpose processor core whenever traversing a different acceleration engine, undesirable latency and jitter

(variability in latency) will both increase, potentially significantly.

Some next-generation SoCs address this issue by supporting configurable pipelines capable of processing

packets deterministically. Each separate service-oriented pipeline creates a message-passing control path

that enables system designers to specify different packet-processing flows that utilize different

combinations of acceleration engines. Such granular traffic management enables any service to process

any traffic flow directly through any engines required and in any sequence desired without intervention

from a general-purpose processor, thereby minimizing latency and assuring that even the strictest QoS

and SLA guarantees can be met.

Without these advances in integrated circuits, it would be virtually impossible for mobile operators to keep

pace with the data deluge. So what does “smart silicon” have to do with next-generation mobile services,

especially when it comes to reducing cost while improving overall system performance? Everything.

Greg Huff is SVP and CTO for LSI. In this capacity, he is responsible for shaping the future growth strategy

of LSI products within the storage and networking markets. Huff joined the company in May 2011 from

Hewlett-Packard, where he was VP and CTO of the company’s Industry Standard Server business. In that

position, he was responsible for the technical strategy of HP’s ProLiant servers, BladeSystem family

products and its infrastructure software business. Prior to that, he served as research and development

director for the HP Superdome product family. Huff earned a bachelor’s degree in Electrical Engineering

from Texas A&M University and an MBA from the Cox School of Business at Southern Methodist University.

The Revival of Direct Attached Storage for

Oracle Databases By Tony Afshary

Storage area networks (SANs) and network-attached storage (NAS) owe their popularity to some

compelling advantages in scalability, utilization and data management. But achieving high performance

for some applications with a SAN or NAS can come at a premium price. In those database applications

where performance is critical, direct-attached storage (DAS) offers a cost-effective high-performance

solution. This is true for both dedicated and virtualized servers, and derives from the way high-speed

flash memory storage options can be integrated seamlessly into a DAS configuration.

Revival of DAS in the IT Infrastructure

Storage subsystems and their capacities have changed significantly since the turn of the millennium,

and these advancements have caused a revival of DAS in both small and medium businesses and large

enterprises. To support this trend, vendors have added support for DAS to their existing product lines

and introduced new DAS-based solutions. Some of these new solutions combine DAS with solid state

storage, RAID data protection and intelligent caching technology that continuously places “hot” data in

the onboard flash cache to accelerate performance.

Why the renewed interest DAS now after so many organizations have implemented SAN and/or

NAS? There are three reasons. The primary reason is performance: DAS is able to outperform all forms

of networked storage owing to its substantially lower latency. The second is cost savings that result

from minimizing the need to purchase and administer SAN or NAS storage systems and the host bus

adapters (HBAs) required to access these systems. Third is ease of use. Implementing and managing

DAS are utterly simple compared to the other storage architectures. This is particularly true for Oracle

database applications.

The Evolution of DAS

DAS technology has evolved considerably over the years. For example, Serial-Attached SCSI (SAS)

expanders and switches enable database administrators (DBAs) to create very large DAS configurations

capable of containing hundreds of drives, while support for both SAS and SATA enables DBAs to deploy

those drives in tiers. And new management tools, including both graphical user and command line

interfaces, have dramatically simplified DAS administration.

While networked storage continues to have an advantage in resource utilization compared to DAS, the

cost of unused spindles today is easily offset by the substantial performance gains DAS delivers for

applications running software with costly per-server licenses. In fact, having some unused spindles on a

database server offers the ability to “tune” the storage system as needed.

A DBA could, for example, use the spare spindles to either isolate certain database objects for better

performance, or allocate them to an existing RAID LUN. When using only HDDs for a database that

requires high throughput in I/O operations per second (IOPS), allocating database objects over more

spindles increases database performance. Allocating more spindles for performance rather than for

capacity is referred to as “short stroking.” With a smaller number of tracks containing data,

repositioning of the drive’s head is minimized, thereby reducing latency and increasing IOPS.

As is often the case in data centers, the ongoing operational expenditures, especially for management,

often eclipse the capital expenditure involved. Such is the case for SAN or NAS, which require a storage

administrator. No such ongoing OpEx is incurred with DAS, especially when using Oracle’s Automatic

Storage Management (ASM) system. And with the need for costly HBAs, switches and other

infrastructure in SAN/NAS environments, DAS often affords a lower CapEx today, as well, particularly in

database applications.

Today’s Oracle DBA

Being an Oracle DBA today is quite different compared to even just a few years ago. As organizations

strive to do more with less, Oracle has been teaming with partners to provide the tools and functionality

DBAs need to be more productive while enhancing performance. Consider just one example of how

much a DBA’s responsibilities have changed: improving performance by minimizing I/O waits, or the

percentage of time processors are waiting on disk I/O.

To increase storage performance by minimizing I/O waits in a typical database using exclusively HDDs,

a DBA might need to take one or more the following actions:

Isolate “hot” datafiles to cold disks, or if the storage device is highly utilized, moving datafiles to

other spindles to even out the disk load.

Rebuild the storage to a different RAID configuration, such as from RAID 5 to RAID 10 to increase

performance.

Add more “short stroked” disks to the array to get more IOPS.

Increase the buffer space in the System Global Area and/or make use of the different caches inside

the SGA to fine-tune how data is accessed.

Move “hot” data to a higher performance storage tier, such as HDDs with faster spindles or solid

state drives (SSDs).

Minimize or eliminate fragmentation in tables and index tablespaces.

Note that many of these actions require the DBA to be evaluating the database continuously to

determine what currently constitutes “hot” data, and constantly making adjustments to optimize

performance. Some also require scheduling downtime to implement and test the changes being made.

An alternative to such constant and labor-intensive fine-tuning is the use of server-side flash storage

solutions that plug directly into the PCIe bus and integrate intelligent caching with support for DAS in

RAID configurations. Intelligent caching software automatically and transparently moves “hot” data—

that which is being accessed the most frequently—from the attached DAS HDDs to fast, on-board NAND

flash memory, thereby significantly decreasing the latency of future reads.

Testing of Flash Cache for DAS

Extensive evaluation of server-side flash-based application acceleration solutions under different

scenarios to assess improvements in IOPS, transactions per second, application response times and

other performance metrics, reveals that for I/O-intensive database applications, moving data closer to

the CPU delivers improvements in performance ranging from a factor of three to an astonishing factor of

100 in some cases.

In all test scenarios, the use of server-side flash caching consistently delivered superior

performance. The reduction in application response time ranged from 5X to 10X with no fine-tuning of

the configuration. When the database was tuned for both the HDD-only and flash cache configurations,

response times were reduced by nearly 30X from 710 milliseconds (HDD-only) to 25 milliseconds with

the use of cache.

These results demonstrate that while tuning efforts are effective, they are substantially more effective

with the use of flash cache. And even without tuning, flash cache is able to reduce response times by

up to an order of magnitude.

Superior Performance with DAS

The use of direct-attached storage has once again become the preferred option for Oracle databases for

a variety of reasons. Not only does DAS deliver superior performance in database servers to get the

most from costly software licenses, it is also easier to administer, especially when using Oracle’s

Automatic Storage Management system. Some solutions also now enable DAS to be shared by multiple

servers.

Even better performance and cost efficiency can be achieved by complementing DAS with intelligent

server-side flash cache acceleration cards that minimize I/O latency and maximize IOPS. In addition, by

allowing infrequently accessed data to remain on HDD storage, organizations can deploy an economical

mix of high-performance flash and high-capacity hard-disk storage to optimize both the cost per IOPS

and the cost per gigabyte of storage.

Server-side flash caching solutions can also be used in SAN environments to improve

performance. Such tests have revealed both significant reductions in response times and dramatic

increases in transaction throughput. So whether using DAS or SAN, the combination of server-side flash

and intelligent caching has proven to be a cost-effective way to maximize performance and efficiency

from the storage subsystem.

About the author:

Tony Afshary is director of marketing, Accelerated Solutions Division, LSI, which designs semiconductors

and software that accelerate storage and networking in datacenters, mobile networks and client

computing.

Achieving Low Latency: The Fall and Rise of

Storage Caching

Tony Afshary

The caching of content from disk storage to high-speed memory is a proven technology for reducing read

latency and improving application-level performance. The problem with traditional caching, however, is

one of scale: Random access memory, typically used for caches, is limited to Gigabytes, while hard disk

drive-based storage exists on the order of Terabytes. The three orders of magnitude difference in scale

puts a practical limit on the potential performance gains. Flash memory has now made caching beneficial

again owing to its combination of low latency (on a par with memory) and high capacity (on a par with

hard disk drives).

A Brief History of Cache

The caching of data from slower media to faster ones has existed since the days of

mainframe computing, and quickly made its debut on PCs shortly after they entered

the market. Caching also exists at multiple levels and in different locations—from the

L1 and L2 cache built into processors to the dynamic RAM (DRAM) caching in the

controllers used with storage area networks (SANs) and network-attached storage

(NAS).

The long, widespread use of caching is a testament to its benefit: dramatically improving performance in a

transparent and cost-effective manner. For example, PCs constantly cache data from the hard disk drive

(HDD) to main memory to improve input/output (I/O) throughput. I/O to main memory takes about 100

nanoseconds, while I/O a fast-spinning HDD takes around 10 milliseconds—a difference of five orders of

magnitude.

In this example, the cache works by moving the data and/or software currently being accessed (the so-

called “hot data”) from the HDD to main memory. The operating system’s file subsystem makes these

movements constantly and automatically using algorithms that detect hot data to improve the “hit rate” of

the cache. With such behind-the-scenes transparency, the only thing a user should ever notice is an

improvement in performance after adding more DRAM.

The data deluge impacting today’s datacenters, however, is causing traditional DRAM-based caching to

become less effective. The reason is that the amount of memory possible in a server or a caching

appliance is only a small fraction of the capacity of even a single disk drive. Because datacenters now

store multiple Terabytes or even Petabytes of data, and I/O rates are increasing with more applications

being run on virtualized servers, the performance gains from traditional forms of caching are becoming

increasingly insufficient.

Fortunately, there is now also a solution to overcoming the limitation being imposed by traditional DRAM-

based caching: flash memory.

Figure 1: Flash memory fills the void in both latency

and capacity between main memory and fast-spinning

hard disk drives.

Cache in a Flash

As shown in Figure 1, flash memory breaks through DRAM’s cache size limitation barrier to again make

caching a highly effective and cost-effective means for accelerating application-level performance. Another

important advantage over DRAM is that flash memory is non-volatile, enabling it to retain stored

information even when not powered.

NAND flash memory-based storage solutions typically deliver the highest performance gains when the

flash cache is placed directly in the server on the high-performance Peripheral Component Interconnect

Express® (PCIe) bus. Even though flash memory has a higher latency than DRAM, PCIe-based flash cache

adapters deliver superior performance for two reasons. The first is the significantly higher capacity of flash

memory, which substantially increases the hit rate of the cache. Indeed, with some flash adapters now

supporting multiple Terabytes of solid state storage, there is often sufficient capacity to store entire

databases or other datasets as hot data.

The second reason involves the location of the flash cache: directly in the server on the PCIe bus. With no

external connections and no intervening network to a SAN or NAS (that is also subject to frequent

congestion), the hot data is accessible in a flash (pun intended).

Intelligent caching software running on the host server detects hot data blocks and caches these to the

flash cache. As shown in Figure 2, the caching software is located between the file system and the storage

device drivers. Direct-attached storage (DAS) and SAN use existing drivers; the flash cache card has a

Memory Pipeline Technology (MPT) driver. As hot data “cools” the caching software automatically replaces

it with hotter data.

Figure 2: The intelligent caching software operates

between the server’s file system and the device

drivers to provide transparency to the

applications.

The intelligent caching software normally gives the highest priority to highly random, small I/O block-

oriented applications, such as those for databases and on on-line transaction processing (OLTP), as these

stand to benefit the most. The software detects hot data by monitoring I/O activity to find the specific

ranges of logical block addresses (LBAs) that are experiencing the most reads and/or writes, and moves

these into the cache.

By contrast, because applications with sequential read and/or write operations benefit very little from

caching, these are given a low priority. The reason is that 6 Gigabit/second (Gb/s) Serial-Attached SCSI

(SAS) and Serial ATA (SATA) HDDs can achieve a satisfactory throughput of up to 3000 Megabytes/second

(MB/s), and roughly double that with 12 Gb/s SAS.

Most PCIe flash adapters contain at least two SSD modules to support RAID (Random Array of

Independent Disks) configurations. In the unprotected RAID 0 mode, data is striped across both SSD

modules, creating a larger cache. In the protected RAID 1 mode, data is mirrored across the SSD modules

so that in the event one fails, the other has a complete copy.

Any data written to the flash cache must also be written to primary DAS or SAN storage, and there are

two ways this can occur. In Write Through mode, any data written to flash is simultaneously written to

primary storage. Because most applications will wait for confirmation that a write has been completed

before proceeding, this increases I/O latency. In Write Back mode data is written only to an SSD, or when

using mirroring, both SSDs, allowing write operations to be completed substantially faster. All writes are

then persisted to primary storage when the data cools and is replaced in the cache. Write Through mode

can safely use a RAID 0 configuration of the flash cache; Write Back mode should employ a RAID 1

configuration for adequate data protection.

Benchmark Test Results

LSI® has conducted extensive testing of application acceleration solutions under different scenarios to

assess improvements in I/O operations per second (IOPs), transactions per second, user response times

and other performance metrics. For I/O-intensive applications, these tests reveal improvements in

performance ranging from a factor of 3x to an astonishing factor of 100x. Reported here are the results of

one such test.

This particular test evaluates both the response times and transactional throughput of a MySQL OLTP

application using the SysBench system performance benchmark. The basic configuration is a dedicated

server with DAS consisting entirely of HDDs. The flash cache is a 100 Gigabyte Nytro™ MegaRAID® 8100-

4i PCIe adapter with the Nytro XD intelligent caching software running in the host. Four different flash

cache configurations are used based on a combination of write modes (Write Through or Write Back) and

RAID levels (0 or 1).

Figure 3: Response times (in milliseconds) were

reduced by 65 percent using the flash cache in

Write Back mode with RAID 1 protection.

The “No SSD” results shown in Figures 3 and 4 are for the baseline configuration using HDDs with no flash

cache. In Write Through (WT) mode, all write operations are made directly to the HDDs, which limits the

performance gains to only about 20 percent. In Write Back (WB) mode, writes are made to the flash

cache, resulting in a response time improvement of up to 80 percent, as shown in Figure 3. But because

data protection is prudent with WB mode (as no protection would require using transaction logs to recover

from an SSD failure), a more realistic improvement would be 65 percent for the flash cache configured

with RAID 1 protection.

Figure 4: Transactions per second increased by a

factor of 3 using the flash cache in Write Back

mode with RAID 1 protection.

As with response times, transactions per second (TPS) throughput rates improve dramatically when the

flash cache is used for both reads and writes. And for some applications, the benefit of the 5-times

improvement in TPS shown in Figure 4 might outweigh the exposure from a lack of data protection,

particularly given the high reliability of flash memory. But even with RAID 1 protection, TPS throughput

increases by a factor of 3 over the “No SSD” configuration.

These tests show that even a relatively modest amount of flash cache (100 Gigabytes) can deliver

meaningful performance gains. Tests with 800 Gigabytes of flash reveal an improvement of up to 30 times

in SAN environments for some applications.

Conclusion

The size of a cache relative to the size of the data store is a key determining factor in its ability to improve

performance. This is the reason DRAM-based caches, limited to Gigabytes of capacity, have become less

effective under the growing data deluge. With SSDs and PCIe flash adapters now supporting Terabytes of

capacity, the size of the cache becomes considerably greater relative to the data store, which makes

caching proportionally more effective.

Another determining factor is the nature of the target application. I/O-intensive applications that involve

random read/write access stand to benefit substantially, while those accessing data sequentially,

especially in large blocks, stand to benefit little, if at all.

The final determining factor is the caching software’s ability to maximize the hit rate by accurately

identifying the hot spots in the data, as these are constantly changing for applications with random I/O

operations. Most do a fairly effective job, and the larger flash cache capacity now makes this a less critical

factor.

Although a flash cache inevitably offers at least some improvement in performance, the extent of the gain

might not be cost-justifiable. Fortunately there are free tools available that can predict the performance

gains possible on a per-application basis. These tools employ intelligent caching algorithms, similar to

what is actually used in the cache, to evaluate access patterns and provide an estimate of the likely

improvement in performance.

The opportunity to achieve substantial gains, combined with the ability to quantify the potential benefit in

advance of making any investment, make flash caching solutions an option worthy of serious

consideration in virtually any datacenter today.

About the Author

Tony Afshary is the Business Line Director for Nytro Solutions Products at LSI’s

Accelerated Solutions Division. In this role, he is responsible for Product

Management & Product Marketing for LSI's Nytro Family of enterprise flash based

storage, including PCIe based Flash, utilizing seamless and intelligent placement of

data to accelerate data-center applications.

Addressing the data deluge challenge in mobile

networks with intelligent content caching Seong Hwan Kim, Ph.D., Technical Marketing Manager, LSI

The most recent IDC Predictions 2013: Competing on the 3RD Platform report forecasts the biggest driver

of IT growth to once again be mobile devices (smartphones, tablets, e-readers, etc.), generating around

20 percent of all IT purchases and accounting for more than 50 percent of all IT market growth. Mobile

devices continue to provide the ubiquitous and constant Internet access that is creating massive amounts

of multimedia traffic, with video remaining the dominant component in this data deluge.

Mobile networks are struggling to satiate the seemingly unquenchable thirst from more users for faster

access to more and more digital content. This dynamic is creating a “data deluge gap”—a disparity

between network capacity and growing demand. Competitive pressures prevent mobile operators from

being able to make the capital investment required to close this widening gap with brute force bandwidth,

making it necessary explore new ways of providing services more intelligently and cost-effectively.

This article explores one such technique: intelligent content caching to improve overall throughput by

minimizing traffic flows end-to-end in mobile networks.

Meeting user expectations

Before exploring content caching, it is instructive to understand the user expectations driving the data

deluge gap. A recent study (reported in an Open Networking Summit presentation titled OpenRadio:

Software Defined Wireless Infrastructure) found that it takes around 7-20 seconds to load a full Web page

over mobile networks. On a corporate LAN or home broadband network, Web pages typically take 6

seconds or less to load. This diverging user experience adds to the perception that mobile networks are

too slow.

Meeting user expectations will become even more challenging as the amount of video and multimedia

traffic increases. Cisco’s Visual Networking Index forecasts video will constitute more than 70 percent of

all network traffic in the near future. Accommodating this explosive growth, particularly during periods of

peak activity, will require both more bandwidth and more intelligent use of that bandwidth from the access

to the core in mobile networks.

New mobile network management solutions will need to go beyond Quality of Service (QoS) and other

traditional traffic management provisions, however. The reason is: while QoS can prioritize traffic flows, it

can do nothing to minimize them. So as mobile networks become increasingly like content delivery

networks, it will be necessary to operate them as such. And one proven technique for minimizing the

amount of traffic end-to-end in content delivery networks is caching.

Intelligent content caching in mobile networks

Intelligent content caching is a cost-effective way to improve the Quality of Experience (QoE) for mobile

users. The fundamental idea of intelligent caching is to store popular content as close as possible to the

users, thereby making it more ready available while simultaneously minimizing backhaul traffic.

Content caching employs a geographically-distributed and layered architecture as shown in the Figure 1.

There are two layers of caching established by location: one is at the edge or access portion of the

network; the other is more centralized toward the core of the network. Such a model is defined as

hierarchical caching.

Figure 1. Hierarchical caching architecture

While financially justified based on the cost savings, caching in Layer 1 at the edge of the network, such

as with the eNodeB or Radio Network Controller platforms, requires a higher initial investment owing to

the high number of access nodes involved. With far fewer nodes in the core, such as a gateway node

and/or a central datacenter, caching at Layer 2 requires a relatively low initial investment.

In a hierarchical caching architecture, content is cached concurrently in both layers to compound the

bandwidth savings. Numerous industry studies have shown that caching at Layer 2 can reduce traffic from

the mobile network core to the Internet by more than 30 percent. Caching at Layer 1 can reduce backhaul

traffic from the radio area network (RAN) to the core by 30 percent or more depending on the cache hit

rate, as recently reported in a Light Reading Webinar titled Extensibility: The Key to Maximizing Caching

Investments.

How intelligent content caching works

The bandwidth-reducing benefit of caching increases as the “hit rate” increases, which it inevitably does

with popular content, such as video going viral or a breaking news story. Figure 2 shows two different

data paths: a “cold” path for the first time content is accessed by any user; and a “hot” path for

subsequent access from cache by other users. This particular configuration employs an intelligent

communications processor to offload the CPU for better performance, and a “flash cache” card with solid

state memory. Not shown is the coordination of cached content between Layers 1 and 2.

Figure 2. Hot and cold data paths

In the cold data path from a user’s perspective, if the deep packet inspection (DPI) engine finds the

content is not already cached by matching the request to an entry in the cache content table, the

processor’s classification engine passes the request to the uplink Ethernet connection to be fetched from

an upstream source, either the Layer 2 cache or the target site on the Internet. If the content is coming

from the Internet and each cache has available capacity, the content will be placed in both Layer 1 and

Layer 2 cache while it is being delivered to the user. Intelligent algorithms are used to continuously

determine which content should be cached based on a combination of recency, popularity and other

factors.

Again from a user’s perspective, but this time from a different user, the DPI engine checks to see if the

content requested has been cached locally. If it is found in the cache content table, the processor’s

classification engine sends the request to the local, Layer 1 cache. All subsequent requests from this

particular user for this particular content are recognized directly by the classification engine and do not,

therefore, require any further involvement from the DPI engine.

While many of the content caching solutions available today utilize x86 or other general-purpose CPUs to

perform traffic inspection, this approach is not well suited for a Layer 1 cache where there are

requirements for low power consumption and low cost. Offloading the CPU with an intelligent

communications processor equipped with purpose-built acceleration engines, as depicted in Figure 2, can

yield up to a 5-times improvement in performance.

The problem with using general-purpose CPUs for packet-level processing is that critical, real-time tasks

like traffic inspection are often performed only at the port level. Because many applications use HTTP as a

transport layer, the lack of deep understanding of the specific applications in the network traffic flows

hinders efficient content management. So while a general-purpose CPU programming model makes

software development easier, it can result in CPU resources being overwhelmed and poor

performance/watt/cost.

By contrast, the hardware acceleration engines in purpose-built System on Chip (SoC) communications

processors provide much deeper application-level awareness in real-time, which is critical in broadband 3G

and 4G mobile networks. The SoC design also provides superior throughput performance while consuming

less power.

The use of solid state storage in purpose-built, small form factor flash cache acceleration cards similarly

maximizes performance with minimal power consumption compared to caching in memory or to hard disk

drives. A Vodafone “Typical Data Usage” chart shows that a 4-minute YouTube video is about 11 MB of

content, for example, while the video streaming of a 30-minute TV episode represents about 90MB of

data. A flash cache acceleration care with 512 GB of capacity would, therefore, be capable of storing about

50,000 of these video clips or about 6,000 of the half-hour streaming videos.

Conclusion

Intelligent content caching affords three major benefits that together help close the data deluge gap. First,

by reducing latency, user QoE is improved dramatically, even under heavy loads, resulting in more

satisfied users. Second, by distributing the total load more evenly from the edge to the core, overall

network throughput can be optimized. Third and perhaps most importantly, profitability is increased

through a combination of more revenue from satisfied users and better utilization of available backhaul

bandwidth.

These benefits can all be maximized by using solutions purpose-built for the special needs of mobile

networks. The use of specialized mobile communications processors that combine multiple CPU cores with

multiple hardware acceleration engines—all on a single integrated circuit—results in maximum

performance with minimal power consumption. Dedicated and standards-based flash cache acceleration

cards provide both the performance and versatility needed to optimize the configuration of a hierarchical

caching architecture.

It bears repeating: As mobile networks become increasingly like content delivery networks, it will be

necessary to operate them as such. And intelligent content caching is a proven technique for delivering

content more quickly and cost-effectively.

About the author

Seong Hwan Kim is a Technical Marketing Manager for the Networking Solutions Group at LSI Corporation.

He has close to 20 years of experience in Computer Networks and Digital Communications. His expertise is

in Enterprise network, network and server virtualization, SDN/OpenFlow, cloud acceleration, wireless

communications and QoS/QoE management. As a noted industry expert, he has several patents in

networking. His work has been published in numerous publications including IEEE communication and

Elsevier magazines, and has presented at several industry conferences.

Seong Hwan Kim has a Ph.D. degree in Electrical Engineering from State University of New York at Stony

Brook, and received his MBA degree from Lehigh University.

PCIe flash: It solves lots of problems, but also

makes a bunch - so what's its future? By Rob Ober

Editor’s Note:

This is a guest post by Rob Ober, corporate strategist at LSI. Prior to joining LSI, Rob was a fellow in the

Office of the CTO at AMD. He was also a founding board member of OLPC ($100 laptop.org) and

OpenSPARC.

I want to warn you, there is some thick background information here first. But don’t worry. I’ll get to the

meat of the topic and that’s this: Ultimately, I think that PCIe cards will evolve to more external, rack-

level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but

other leaders in flash are going down this path too...

I’ve been working on enterprise flash storage since 2007 – mulling over how to make it work. Endurance,

capacity, cost, performance have all been concerns that have been grappled with. Of course the flash is

changing too as the nodes change: 60nm, 50nm, 35nm, 24nm, 20nm… and single level cell (SLC) to multi

level cell (MLC) to triple level cell (TLC) and all the variants of these “trimmed” for specific use cases. The

spec “endurance” has gone from 1 million program/erase cycles (PE) to 3,000, and in some cases 500.

It’s worth pointing out that almost all the “magic” that has been developed around flash was already

scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity

increased, meaning fewer die are needed for a solution – and that means less parallel bandwidth for data

transfer… And the “requirement” for state-of-the-art single operation write latency has fallen well below

the write latency of the flash itself. (What the ?? Yea – talk about that later in some other blog. But flash

is ~1500uS write latency, where state of the art flash cards are ~50uS.) When I describe the state of

technology it sounds pretty pessimistic. I’m not. We’ve overcome a lot.

We built our first PCIe card solution at LSI in 2009. It wasn’t perfect, but it was better than anything else

out there in many ways. We’ve learned a lot in the years since – both from making them, and from

dealing with customer and users – both of our own solutions and our competitors. We’re lucky to be an

important player in storage, so in general the big OEMs, large enterprises and the mega datacenters all

want to talk with us – not just about what we have or can sell, but what we could have and what

we could do. They’re generous enough to share what works and what doesn’t. What the values of

solutions are and what the pitfalls are too. Honestly? It’s the mega datacenters in the lead both practically

and in vision.

If you haven’t nodded off to sleep yet, that’s a long-winded way of saying – things have changed fast,

and, boy, we’ve learned a lot in just a few years.

Most important thing we’ve learned…

Most importantly, we’ve learned it’s latency that matters. No one is pushing the IOPs limits of flash, and

no one is pushing the bandwidth limits of flash. But they sure are pushing the latency limits.

PCIe cards are great, but…

We’ve gotten lots of feedback, and one of the biggest things we’ve learned is – PCIe flash cards are

awesome. They radically change performance profiles of most applications, especially databases allowing

servers to run efficiently and actual work done by that server to multiply 4x to 10x (and in a few extreme

cases 100x). So the feedback we get from large users is “PCIe cards are fantastic. We’re so thankful they

came along. But…” There’s always a “but,” right??

It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using

them. We’re not the only ones hearing it. To be clear, none of these are stopping people from deploying

PCIe flash… the attraction is just too compelling. But the problems are real, and they have real

implications, and the market is asking for real solutions.

Stranded capacity & IOPs

o Some “leftover” space is always needed in a PCIe card. Databases don’t do well when they

run out of storage! But you still pay for that unused capacity.

o All the IOPs and bandwidth are rarely used – sure latency is met, but there is capability left

on the table.

o Not enough capacity on a card – It’s hard to figure out how much flash a server/application

will need. But there is no flexibility. If my working set goes one byte over the card capacity,

well, that’s a problem.

Stranded data on server fail

o If a server fails – all that valuable hot data is unavailable. Worse – it all needs to be re-

constructed when the server does come online because it will be stale. It takes quite a while

to rebuild 2TBytes of interesting data. Hours to days.

PCIe flash storage is a separate storage domain vs. disks and boot.

o Have to explicitly manage LUNs, move data to make it hot.

o Often have to manage via different API’s and management portals.

o Applications may even have to be re-written to use different APIs, depending on the vendor.

Depending on the vendor, performance doesn’t scale.

o One card gives awesome performance improvement. Two cards don’t give quite the same

improvement.

o Three or four cards don’t give any improvement at all. Performance maxed out somewhere

below 2 cards. It turns out drivers and server onloaded code create resource bottlenecks,

but this is more a competitor’s problem than ours.

Depending on the vendor, performance sags over time.

o More and more computation (latency) is needed in the server as flash wears and needs

more error correction.

o This is more a competitor’s problem than ours.

It’s hard to get cards in servers.

o A PCIe card is a card – right? Not really. Getting a high capacity card in a half height, half

length PCIe form factor is tough, but doable. However, running that card has problems.

o It may need more than 25W of power to run at full performance – the slot may or may not

provide it. Flash burns power proportionately to activity, and writes/erases are especially

intense on power. It’s really hard to remove more than 25W air cooling in a slot.

o The air is preheated, or the slot doesn’t get good airflow. It ends up being a server by

server/slot by slot qualification process. (yes, slot by slot…) As trivial as this sounds, it’s

actually one of the biggest problems

Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc.

That’s what we’re here for though – right? Solve the impossible?

A quick summary is in order. It’s not looking good. For a given solution, flash is getting less reliable, there

is less bandwidth available at capacity because there are fewer die, we’re driving latency way below the

actual write latency of flash, and we’re not satisfied with the best solutions we have for all the reasons

above.

The implications

If you think these through enough, you start to consider one basic path. It also turns out we’re not the

only ones realizing this. Where will PCIe flash solutions evolve over the next 2, 3, 4 years? The basic goals

Unified storage infrastructure for boot, flash, and HDDs

Pooling of storage so that resources can be allocated/shared

Low latency, high performance as if those resources were DAS attached, or PCIe card flash

Bonus points for file store with a global name space

One easy answer would be – that’s a flash SAN or NAS. But that’s not the answer. Not many customers

want a flash SAN or NAS – not for their new infrastructure, but more importantly, all the data is at the

wrong end of the straw. The poor server is left sucking hard. Remember – this is flash, and people use

flash for latency. Today these SAN type of flash devices have 4x-10x worse latency than PCIe cards. Ouch.

You have to suck the data through a relatively low bandwidth interconnect, after passing through both the

storage and network stacks. And there is interaction between the I/O threads of various servers and

applications – you have to wait in line for that resource. It’s true there is a lot of startup energy in this

space. It seems to make sense if you’re a startup, because SAN/NAS is what people use today, and

there’s lots of money spent in that market today. However, it’s not what the market is asking for.

Another easy answer is NVMe SSDs. Right? Everyone wants them – right? Well, OEMs at least. Front bay

PCIe SSDs (HDD form factor or NVMe – lots of names) that crowd out your disk drive bays. But they don’t

fix the problems. The extra mechanicals and form factor are more expensive, and just make replacing the

cards every 5 years a few minutes faster. Wow. With NVME SSDs, you can fit fewer HDDs – not good.

They also provide uniformly bad cooling, and hard limit power to 9W or 25W per device. But to protect the

storage in these devices, you need to have enough of them that you can RAID or otherwise protect. Once

you have enough of those for protection, they give you awesome capacity, IOPs and bandwidth, too much

in fact, but that’s not what applications need – they need low latency for the working set of data.

What do I think the PCIe replacement solutions in the near future will look like? You need to pool the flash

across servers (to optimize bandwidth and resource usage, and allocate appropriate capacity). You need

to protect against failures/errors and limit the span of failure, commit writes at very low latency (lower

than native flash) and maintain low latency, bottleneck-free physical links to each server… To me that

implies:

Small enclosure per rack handling ~32 or more servers

Enclosure manages temperature and cooling optimally for performance/endurance

Remote configuration/management of the resources allocated to each server

Ability to re-assign resources from one server to another in the event of server/VM blue-screen

Low-latency/high-bandwidth physical cable or backplane from each server to the enclosure

Replaceable inexpensive flash modules in case of failure

Protection across all modules (erasure coding) to allow continuous operation at very high

bandwidth

NV memory to commit writes with extremely low latency

Ultimately – integrated with the whole storage architecture at the rack, the same APIs, drivers, etc.

That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and

bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards

will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes

today. This is just my opinion, but as I say – other leaders in flash are going down this path too…

What’s your opinion?

Rob Ober drives LSI into new technologies, businesses and products as an LSI fellow in Corporate

Strategy. Prior to joining LSI, he was a fellow in the Office of the CTO at AMD, responsible for mobile

platforms, embedded platforms and wireless strategy. He was a founding board member of OLPC ($100

laptop.org) and OpenSPARC.

MEGA DATACENTERS: PIONEERING THE

FUTURE OF IT INFRASTRUCTURE

Rob Ober, LSI Fellow, Processor and System Architect, LSI Corporate Strategy Office , says:

The unrelenting growth in the volume and velocity of data worldwide is spurring innovation in datacenter

infrastructures, and mega datacenters (MDCs) are on the leading edge of these advances. Although MDCs

are relatively new, their exponential growth – driven by this data deluge – has thrust them into rarefied

regions of the global server market: they now account for about 25 percent of servers shipped.

Rapid innovation is the watchword at MDCs. It is imperative to their core business and, on a much larger

scale, forcing a rethinking of IT infrastructures of all sizes. The pioneering efforts of MDCs in private

clouds, compute clusters, data analytics and other IT applications now provide valuable insights into the

future of IT. Any organization stands to benefit by emulating MDC techniques to improve scalability,

reliability, efficiency and manageability and reduce the cost of work done as they confront changing

business dynamics and rising financial pressures.

The Effects of Scale at MDCs

MDCs and traditional datacenters are miles apart in scale, though the architects at each face many of the

same challenges. Most notably, both are trying to do more with less by implementing increasingly

sophisticated applications and optimizing the investments needed to confront the data deluge. The sheer

scale of MDCs, however, magnifies even the smallest inefficiency or problem. Economics force MDCs to

view the entire datacenter as a resource pool to be optimized as it delivers more services and supports

more users.

MDCs like those at Facebook, Amazon, Google and China’s Tencent use a small set of distinct platforms,

each optimized for a specific task, such as storage, database, analytics, search or web services. The scale

of these MDCs is staggering: Each typically houses 200,000 to 1,000,000 servers, and from 1.5 million to

10 million disk drives. Storage is their largest cost. The world’s largest MDCs deploy LSI flash cards, flash

cache acceleration, host bus adapters, serial-attached SCSI (SAS) infrastructure and RAID storage

solutions, giving LSI unique insight into challenges these organizations are facing, and how they are

pioneering various architectural solutions to common problems.

MDCs prefer open source software for operating systems and other infrastructure, and the applications are

usually self-built. Most MDC improvements have been given back to the open source community. In many

MDCs, even the hardware infrastructure might be self-built or, at a minimum, self-specified for optimal

configurations – options that might not be available to smaller organizations.

Server virtualization is only rarely used in MDCs. Instead of using virtual machines to run multiple

applications on a single server, MDCs prefer to run applications across clusters consisting of hundreds to

thousands of server nodes dedicated to a specific task. For example, the server cluster may contain only

boot storage, RAID-protected storage for database or transactional data, or unprotected direct-map drives

with data replication across facilities depending on the task or application it is performing. MDC

virtualization applications are all open source. They are used for containerization to simplify the

deployment and replication of images. Because re-imaging or updating virtualization applications occurs

frequently, boot image management is another challenge.

The large clusters at MDCs make the latency of inter-node communications critical to application

performance, so MDCs make extensive use of 10Gbit Ethernet in servers today and, in some cases, they

even deploy 40Gbit infrastructure as needed. MDCs also optimize performance by deploying networks with

static configurations that minimize transactional latency. And MDC architects are now deploying at least

some software defined network (SDN) infrastructure to optimize performance, simplify management at

scale and reduce costs.

To some, MDCs are seen as cheap, refusing to pay for any value-added functionality from vendors. But

that’s a subtle misunderstanding of their motivations. With as many as 1 million servers, MDCs require a

lights-out infrastructure maintained primarily by automated scripts and only a few technicians assigned

simple maintenance tasks. MDCs also maintain a ruthless focus on minimizing any unnecessary spending,

using the savings to grow and optimize work performed per dollar spent.

MDCs are very careful to eliminate features not central to their core applications, even if provided for free,

since they increase operating expenditures. Chips, switches and buttons, lights, cables, screws and

latches, software layers and anything else that does nothing to improve performance only adds to power

and cooling demands and service overhead. The addition of one unnecessary LED in 200,000 servers, for

example, is considered an excess that consumes 26,000 watts of power and can increase operating costs

by $10,000 per year.

Even minor problems can become major issues at scale. One of the biggest operational challenges for

MDCs is HDD failure rates. Despite the low price of hard disk drives (HDDs), failures can cause costly

disruptions in large clusters, where these breakdowns are routine. Another challenge is managing rarely-

used archival data that may exceed petabytes and is now approaching exabytes of online storage,

consuming more space and power while delivering diminishing value. Every organization faces similar

challenges, albeit on a smaller scale.

Lessons Learned from Mega Datacenters

Changing business dynamics and financial pressures are forcing all organizations to rethink the types of IT

infrastructure and software applications they deploy. The low cost of MDC cloud services is motivating

CFOs to demand more capabilities at lower costs from their CIOs, who in turn are turning to MDCs to find

inspiration and ways to address these challenges.

The first lesson any organization can learn from MDCs is to simplify maintenance and management by

deploying a more homogenous infrastructure. Minimizing infrastructure spending where it matters little

and focusing it where it matters most frees capital to be invested in architectural enhancements that

maximize work-per-dollar. Investing in optimization and efficiency helps reduce infrastructure and

associated management costs, including those for maintenance, power and cooling. Incorporating more

lights-out self-management also pays off, supporting more capabilities with existing staff.

The second lesson is that maintaining five-nines (99.999%) reliability drives up costs and becomes

increasingly difficult architecturally as the infrastructure scales. A far more cost-effective architecture is

one that allows subsytems to fail, letting the rest of the system operate unimpeded and the overall system

self-heal. Because all applications are clustered, a single misbehaving node can degrade the performance

of the entire cluster. MDCs take the offending server off line, enabling all others to operate at peak

performance. The hardware and software needed for such an architecture are readily available today,

enabling any organization to emulate this approach. And though the expertise needed to effectively deploy

a cluster is still rare, new orchestration layers are emerging to automate cluster management.

Storage, one of the most critical infrastructure subsystems, directly impacts application performance and

server utilization. MDCs are leaders in optimizing datacenter storage efficiency, providing high-availability

operation to satisfy requirements for data retention and disaster recovery. All MDCs rely exclusively on

direct-attached storage (DAS), which carries a much lower purchase cost, is simpler to maintain and

delivers higher performance than a storage area network (SAN) or network-attached storage (NAS).

Although many MDCs minimize costs by using consumer-grade Serial ATA (SATA) HDDs and solid state

drives (SSDs), they almost always deploy these drives on a SAS infrastructure to maximize performance

and simplify management. More MDCs are now migrating to large-capacity, enterprise-grade SAS drives

for higher reliability and performance, especially as SAS migrates from 6Gbit/s to 12Gbit/s bandwidth.

When evaluating storage performance, most organizations focus on I/O operations per second (IOPs) and

MBytes/s throughput metrics. MDCs have discovered that applications driving IOPs to SDDs quickly reach

other limits though, often peaking well below 200,000 IOPs, and that MBytes/s performance has only a

modest impact on work done. A more meaningful metric is I/O latency because it correlates more directly

with application performance and server utilization – the very reason MDCs are deploying more SSDs or

solid state caching (or both) to minimize I/O latency and increase work-per-dollar.

Typical HDD read/write latency is on the order of 10 milliseconds. By contrast, typical SSD read and write

latencies are around 200 microseconds and 100 microseconds, respectively – about five orders of

magnitude lower. Specialized PCIe® flash cache acceleration cards can reduce latency another order of

magnitude to tens of microseconds. Using solid state storage to supplement or replace HDDs enables

servers and applications to do four to10 times more work. Server-based flash caching provides even

greater gains in SAN and NAS environments – up to 30 times.

Flash cache acceleration cards deliver the lowest latency when plugged directly into a server’s PCIe bus.

Intelligent caching software continuously and transparently places hot data (the most frequently accessed

or temporally important) in low-latency flash storage to improve performance. Some flash cache

acceleration cards support multiple terabytes of solid state storage, holding entire databases or working

datasets as hot data. And because there is no intervening network and no risk of associated congestion,

the cached data is accessible quickly and deterministically under any workload.

Deploying an all-solid-state Tier 0 for some applications is also now feasible, and at least one MDC uses

SSDs exclusively. In the enterprise, decisions about using SSDs usually focus on the storage layer, and

cost per GByte or IOPs, pitting HDDs against SSDs with an emphasis on capital expenditure. MDCs have

discovered that SSDs deliver better price/performance than HDDs by maximizing work-per-dollar

investments in other infrastructure (especially servers and software licenses), and by reducing overall

maintenance costs. Solid state storage is also more reliable, easier to manage, faster to replicate and

rebuild, and more energy-efficient than HDDs – all advantages to any datacenter.

Pioneering the Datacenter of the Future

MDCs have been driving open source solutions with proven performance, reliability and scalability. In

some cases, these pioneering efforts have enabled applications to scale far beyond any commercial

product. Examples include Hadoop® for analytics and derivative applications, and clustered query and

databases applications like Cassandra™ and Google’s Dremel. The state-of-the-art for these and other

applications is evolving quickly, literally month-by-month. These open source solutions are seeing

increasing adoption and inspiring new commercial solutions.

Two other, relatively new initiatives are expected to bring MDC advances to the enterprise market, just as

Linux® software did. One is OpenCompute, which offers a minimalist, cost-effective, easy-to-scale

hardware infrastructure for compute clusters. Open Compute could also foster its own innovation,

including an open hardware support services business model similar to the one now used for open source

software. The second initiative is OpenStack® software, which promises a higher level of automation for

managing pools of compute, storage and networking resources, ultimately leading to the ability to operate

a software defined datacenter.

A related MDC initiative involves disaggregating servers at the rack level. Disaggregation separates the

processor from memory, storage, networking and power, and pools these resources at the rack level,

enabling the lifecycle of each resource to be managed on its own optimal schedule to help minimize costs

while increasing work-per-dollar. Some architects believe that these initiatives could reduce total cost of

ownership by a staggering 70 percent.

Maximizing work-per-dollar at the rack and datacenter levels is one of the best ways today for IT

architects in any organization to do more with less. MDCs are masters at this type of high efficiency as

they continue to redefine how datacenters will scale to meet the formidable challenges of the data deluge.

About the Author

Robert Ober is an LSI Fellow in Corporate Strategy, driving LSI into new technologies, businesses and

products. He has 30 years of experience in processor and system architecture. Prior to joining LSI, Rob

was a Fellow in the Office of the CTO at AMD, with responsibility for mobile platforms, embedded

platforms and wireless strategy. He was one of the founding Board members of OLPC ($100 laptop.org)

and was influential in its technical evolution, and was also a Board Member of OpenSPARC.

Previously Rob was Chief Architect at Infineon Technologies, responsible for the TriCore family of

processors used in automotive, communication and security products. In addition, he drove improvements

in semiconductor methodology, libraries, process and the mobile phone platforms. Rob was manager of

Newton Technologies at Apple Computer and was involved in the creation of the PowerPC Macintosh

computers, PowerPC, StrongARM and ARC processors. He also has experience in development of CDC,

CRAY and SPARC supercomputers, mainframes and high-speed networks, and he has dozens of patents in

mobility, computing and processor architecture. Rob has an honors Bachelor of Applied Science (BASc.) in

Systems Design Engineering from the University of Waterloo in Ontario, Canada.

The Evolution Of Solid-State Storage In

Enterprise Servers By Tom Heil

Solid-state drives (SSDs) and PCI Express (PCIe) flash memory adapters are growing in popularity in

enterprise, service provider, and cloud datacenters due to their ability to cost-effectively improve

application-level performance. A PCIe flash adapter is a solid-state storage device that plugs directly into a

PCIe slot of an individual server, placing fast, persistent storage near server processors to accelerate

application-level performance.

By placing storage closer to the server’s CPU, PCIe flash adapters dramatically reduce latency in storage

transactions compared to traditional hard-disk drive (HDD) storage. However, the configuration lacks

standardization and critical storage device attributes like external serviceability with hot-pluggability.

To overcome these limitations, various organizations are developing PCIe storage standards that extend

PCIe onto the server storage mid-plane to provide external serviceability. These new PCIe storage

standards take full advantage of flash memory’s low latency and provide an evolutionary path for its use

in enterprise servers.

The Need For Speed

Many applications benefit considerably from the use of solid-state storage owing to the enormous latency

gap that exists between the server’s main memory and its direct-attached HDDs. Flash storage enables

database applications, for example, to experience improvements of four to 10 times because access to

main memory takes about 100 ns while input/output (I/O) to traditional rotating storage is on the order of

10 ms or more(Fig. 1).

1. NAND flash memory fills the gap in latency between a server’s main memory and fast-

spinning hard-disk drives.

This access latency difference, approximately five orders of magnitude, has a profound adverse impact on

application-level performance and response times. Latency to external storage area networks (SANs) and

network-attached storage (NAS) is even higher owing to the intervening network infrastructure (e.g.,

Fibre Channel or Ethernet).

Flash memory provides a new high-performance storage tier that fills the gap between a server’s dynamic

random access memory (DRAM) and Tier 1 storage consisting of the fastest-spinning HDDs. This new “Tier

0” of solid-state storage, with latencies from 50 µs to several hundred microseconds, delivers dramatic

gains in application-level performance while continuing to leverage rotating media’s cost-per-gigabyte

advantage in all lower tiers.

Because the need for speed is so pressing in many of today’s applications, IT managers could not wait for

new flash-optimized storage standards to be finalized and become commercially available. That’s why

SSDs supporting the existing SAS and SATA standards as well as proprietary PCIe-based flash adapters

are already being deployed in datacenters. However, these existing solid-state storage solutions utilize

very different configurations.

SAS And SATA SSDs

The norm today for direct-attached storage (DAS) is a rack-mount server with an externally accessible

chassis having multiple 9-W storage bays capable of accepting a mix of SAS and SATA drives operating at

up to 6 Gbits/s. The storage mid-plane typically interfaces with the server motherboard via a PCIe-based

host redundant array of independent disks (RAID) adapter that has an embedded RAID-on-chip (ROC)

controller(Fig. 2).

2. SAS and SATA SSDs are supported today in standard storage bays with a RAID-on-chip

(ROC) controller on the server’s PCIe bus.

While originally designed for HDDs, this configuration is ideal for SSDs that utilize 2.5-in. and 3.5-in. HDD

disk form factors. Support for SAS and SATA HDDs and SSDs in various RAID configurations provides a

number of benefits in DAS configurations, such as the ability to mix high-performance SAS drives with

low-cost SATA drives in tiers of storage directly on the server. The fastest Tier 0 can utilize SAS SSDs,

while the slowest tier utilizes SATA HDDs (or external SAN or NAS). In some configurations, firmware on

the RAID adapter can transparently cache application data onto SSDs.

Being externally accessible and hot-pluggable, the configuration of disks can be changed as needed to

improve performance by adding more SSDs, or to expand capacity in any tier, as well as to replace

defective drives to restore full RAID-level data protection. Because the arrangement is fully standardized,

any bay can support any SAS or SATA drive. Device connectivity is easily scaled via an in-server SAS

expander or via SAS connections to external drive enclosures, commonly called JBODs for “just a bunch of

disks.”

The main advantage of deploying flash in HDD form factors using established SAS and SATA protocols is

that it significantly accelerates application performance while leveraging mature standards and the

existing infrastructure (both hardware and software). So, this configuration will remain popular well into

the future in all but the most demanding latency-sensitive applications. Enhancements also continue to be

made, including RAID adapters getting faster with PCIe version 3.0, and 12-Gbit/s SAS SSDs that are

poised for broad deployment beginning in 2013.

Even with continual advances and enhancements, though, SAS and SATA cannot capitalize fully on flash

memory’s performance potential. The most obvious constraints are the limited power (9 W) and channel

width (one or two lanes) available in a storage bay that was initially designed to accommodate rotating

magnetic media, not flash. These constraints limit the performance possible with the amount of flash that

can be deployed in a typical HDD form factor, and they are the driving force behind the emergence of PCIe

flash adapters.

PCIe Flash Adapters

Instead of plugging into a storage bay, a flash adapter plugs directly into a PCIe bus slot on the server’s

motherboard, giving it direct access to the CPU and main memory (Fig. 3). The result is a latency as low

as 50 µs for (buffered) I/O operations to solid-state storage. Because there are no standards yet for PCIe

storage devices, flash adapter vendors must supply a device driver to interface with the host’s file system.

In some cases, vendor-specific drivers are bundled with popular server operating systems.

3. PCIe flash adapters overcome the limitations imposed by legacy storage protocols, but they

must be plugged directly into the server’s PCIe bus.

Unlike storage bays that provide one or two lanes, server PCIe slots are typically four or eight lanes wide.

An eight-lane (x8) PCIe (version 3.0) slot, for example, can provide a throughput of 8 Gbytes/s (eight

lanes at 1 Gbyte/s each). By contrast, a SAS storage bay can scale to 3 Gbytes/s (two lanes at 12 Gbits/s

or 1.5 Gbytes/s each). The higher bandwidth increases I/O operations per second (IOPs), which reduces

the transaction latency experienced by some applications.

Another significant advantage of a PCIe slot is the higher power available, which enables larger flash

arrays, as well as more parallel read/write operations to the array(s). The PCIe bus supports up to 25 W

per slot, and if even more is needed, a separate connection can be made to the server’s power supply,

similar to the way high-end PCIe graphics cards are configured in workstations. For half-height, half-

length (HHHL) cards today, 25 W is normally sufficient. Ultra-high-capacity full-height cards often require

additional power.

A PCIe flash adapter can be utilized either as flash cache or as a primary storage solid-state drive. The

more common configuration today is flash cache to accelerate I/O to DAS, SAN, or NAS rotating media.

Adapters used as an SSD are often available with advanced capabilities, such as host-based RAID for data

protection. But the PCIe bus isn’t an ideal platform for primary storage due to its lack of external

serviceability and hot-pluggability.

Flash Cache Acceleration Cards

Caching content to memory in a server is a proven technique for reducing latency and, thereby, improving

application-level performance. But because the amount of memory possible in a server (measured in

gigabytes) is only a small fraction of the capacity of even a single disk drive (measured in terabytes),

achieving performance gains from this traditional form of caching is becoming difficult.

Flash memory breaks through the cache size limitation imposed by DRAM to again make caching a highly

effective and cost-effective means for accelerating application-level performance. Flash memory is also

non-volatile, giving it another important advantage over DRAM caches. As a result, PCIe-based flash cache

adapters such as the LSI Nytro XD solution have already become popular for enhancing performance.

Solid-state memory typically delivers the highest performance gains when the flash cache is placed

directly in the server on the PCIe bus. Embedded or host-based intelligent caching software is used to

place “hot data” (the most frequently accessed data) in the low-latency, high-performance flash storage.

Even though flash memory has a higher latency than DRAM, PCIe flash cache cards deliver superior

performance for two reasons.

The first is the significantly higher capacity of flash memory, which dramatically increases the “hit rate” of

the cache. Indeed, with some flash cards now supporting multiple terabytes of solid-state storage, there is

often sufficient capacity to store entire databases or other datasets as “hot data.” The second reason

involves the location of the flash cache: directly in the server on the PCIe bus. With no external

connections and no intervening network to a SAN or NAS (that is also subject to frequent congestion and

deep queues), the “hot data” is accessible in a flash (pun intended) in a deterministic manner under all

circumstances.

Although the use of PCIe flash adapters can dramatically improve application performance, PCIe was not

designed to accommodate storage devices directly. PCIe adapters are not externally serviceable, are not

hot-pluggable, and are difficult to manage as part of an enterprise storage infrastructure. The proprietary

nature of PCIe flash adapters also is an impediment to a robust, interoperable multi-party device

ecosystem. Overcoming these limitations requires a new industry-standard PCIe storage solution.

Express Bay

Support for the PCIe interface on an externally accessible storage mid-plane is emerging based on the

Express Bay standard with the SFF-8639 connector. Express Bay provides four dedicated PCIe lanes and

up to 25 W to accommodate ultra-high-performance, high-capacity Enterprise PCIe SSDs (eSSD) in a 2.5-

in. or 3.5-in. disk drive form factor.

As a superset of today’s standard disk drive bay, Express Bay maintains backward compatibility with

existing SAS and SATA devices. The SSD Form Factor Working Groupis creating the Express Bay

standard, Enterprise SSD Form Factor 1.0 Specification, in cooperation with the SFF Committee, the SCSI

Trade Association, the PCI Special Interest Group, and the Serial ATA International Organization.

Enterprise SSDs for Express Bay will initially use vendor-specific protocols enabled by vendor-supplied

host drivers. Enterprise SSDs compliant with the new NVM Express (NVMe) flash-optimized host interface

protocol will emerge in 2013. The NVMe Work Group (www.nvmexpress.org) is defining NVMe for use in

PCIe devices targeting both clients (PCs, ultrabooks, etc.) and servers. By 2014, standard NVMe host

drivers should be available in all major operating systems, eliminating the need for vendor-specific drivers

(except when a vendor supplies a driver to enable unique capabilities).

Also in 2014, Enterprise PCIe SSDs compliant with the new SCSI Express (SCSIe) host interface protocol

are expected to make their debut. SCSIe SSDs will be optimized for enterprise applications and should fit

seamlessly under existing enterprise storage applications based on the SCSI architecture and command

set. SCSIe is being defined by the SCSI Trade Association and the InterNational Committee for

Information Technology Standards (INCITS) Technical Committee T10 for SCSI Storage Interfaces.

Most mid-planes supporting the Express Bays will interface with the server via two separate PCIe-based

cards: a PCIe switch to support high-performance Enterprise PCIe SSDs and a RAID adapter to support

legacy SAS and SATA devices (Fig. 4). Direct support for PCIe (through the PCIe switch) makes it

possible to put flash cache acceleration solutions in the Express Bay.

4. Express Bay fully supports the low latency of flash memory with the high performance of

PCIe, while maintaining backwards compatibility with existing SAS and SATA HDDs and SSDs.

This configuration is expected to become preferable over the flash adapters now being plugged directly

into the server’s PCIe bus. Nevertheless, PCIe flash adapters may continue to be used in ultra-high-

performance or ultra-high-capacity applications that justify utilizing the wider x8 PCIe bus slots and/or

additional power available only within the server.

Because it is more expensive to provision an Express Bay than a standard drive bay, server vendors are

likely to limit deployment of Express Bays until market demand for Enterprise PCIe SSDs increases. Early

server configurations may support perhaps two or four Express Bays, with the remainder being standard

bays. Server vendors may also offer some models with a high number of (or nothing but) Express Bays to

target ultra-high-performance and ultra-high-capacity applications, especially those that require little or

no rotating media storage.

SATA Express

PCIe flash storage also is expected to become common in client devices beginning in 2013 with the advent

of the new SATA Express (SATAe) standard. Like SATA before them, SATAe devices are expected be

adopted in the enterprise due to the low cost that inevitably results from the economics of high-volume

client-focused technologies.

The SATAe series of standards includes a flash-only M.2 form factor (previously called the next-generation

form factor or NGFF) for ultrabooks and netbooks and a 2.5-in. disk drive compatible form factor for

laptop and desktop PCs. SATAe standards are being developed by the Serial ATA International

Organization (www.sata-io.org). Initial SATAe devices will use the current AHCI protocol to leverage

industry-standard SATA host drivers, but will quickly move to NVMe once standard NVMe drivers become

incorporated into major operating systems.

The SATAe 2.5-in. form factor is expected to play a significant role in enterprise storage. It is designed to

plug into either an Express Bay or a standard drive bay. In both cases, the PCIe signals are multiplexed

atop the existing SAS/SATA lanes. Either bay then can accommodate a SATAe SSD or a SAS or SATA

drive (Fig. 5). Of course, the Express Bay can additionally accommodate x4 Enterprise PCIe SSDs as

previously discussed.

5. Although designed for client PCs, new SATA Express drives will be supported in a standard

bay by multiplexing the PCIe protocols atop existing SAS/SATA lanes.

The configuration implies future RAID controller support for SATAe drives to supplement existing support

for SAS and SATA drives. Note that although SATAe SSDs will outperform SATA SSDs, they will lag 12-

Gbit/s SAS SSD performance (two lanes of 12 Gbits/s are faster than two lanes of 8-Gbit/s PCIe 3.0). The

SATAe M.2 form factor will also be adopted in the enterprise in situations where a client-class PCIe SSD is

warranted, but the flexibility and/or external serviceability of a storage form factor is not required.

Summary

With its ability to bridge the large gap in I/O latency between main memory and hard-disk drives, flash

memory has exposed some limitations in existing storage standards. These standards have served the

industry well, and SAS and SATA HDDs and SDDs will continue to be deployed in enterprise and cloud

applications well into the foreseeable future. Indeed, the new standards being developed all accommodate

today’s existing and proven standards, making the integration of solid-state storage seamless and

evolutionary, not disruptive or revolutionary.

To take full advantage of flash memory’s ultra-low latency, proprietary solutions that leverage the high

performance of the PCIe bus have emerged in advance of the new storage standards. But while PCIe

delivers the performance needed, it was never intended to be a storage architecture. In effect, the new

storage standards extend the PCIe bus onto the server’s externally accessible mid-plane, which was

designed as a storage architecture.

Yogi Berra famously observed, “It’s tough to make predictions, especially about the future.” But because

the new standards all preserve backwards compatibility, there is no need to predict a “winner” among

them. In fact, all are likely to coexist, perhaps in perpetuity, because each is focused on specific and

different needs in client and server storage. Fortunately, Express Bay supports both new and legacy

standards, as well as proprietary solutions, all concurrently. This freedom of choice down to the level of an

individual bay eliminates the need for the industry to choose only one as “the” standard.

Tom Heil is a senior systems architect and Distinguished Engineer in LSI’s Storage Products

Division, where he is responsible for technology strategy, product line definition, and business

planning. He is a 25-year veteran of the computer and storage industry and holds 17 patents in

computer and I/O architecture. He can be reached at tom.heil@lsi.com.

Networks to Get Smarter and Faster in 2013

and Beyond

By Greg Huff, Chief Technology Officer at LSI

Architects and managers of networks of all types – enterprise, storage and mobile – are struggling under

the formidable pressure of massive data growth. To accelerate performance amid this data deluge, they

have two options: the traditional brute force approach of deploying systems beefed up with more general-

purpose processors, or turning to systems with intelligent silicon powered by purpose-built hardware

accelerators integrated with multi-core processors.

Adding more and faster general-purpose processors to routers, switches and other networking equipment

can improve performance but adds to system costs and power demands while doing little to address

latency, a major cause of performance problems in networks. By contrast, smart silicon minimizes or

eliminates performance choke points by reducing latency for specific processing tasks. In 2013 and

beyond, design engineers will increasingly deploy smart silicon to achieve the benefits of its order of

magnitude higher performance and greater efficiencies in cost and power.

Enterprise Networks

In the past, Moore’s Law was sufficient to keep pace with increasing computing and networking workloads.

Hardware and software largely advanced in lockstep: as processor performance increased, more

sophisticated features could be added in software. These parallel improvements made it possible to create

more abstracted software, enabling much higher functionality to be built more quickly and with less

programming effort. Today, however, these layers of abstraction are making it difficult to perform more

complex tasks with adequate performance.

General-purpose processors, regardless of their core count and clock rate, are too slow for functions such

as classification, cryptographic security and traffic management that must operate deep inside each and

every packet. What’s more, these specialized functions must often be performed sequentially, restricting

the opportunity to process them in parallel in multiple cores. By contrast, these and other specialized

types of processing are ideal applications for smart silicon, and it is increasingly common to have multiple

intelligent acceleration engines integrated with multiple cores in specialized System-on-Chip (SoC)

communications processors.

The number of function-specific acceleration engines available continues to grow, and shrinking

geometries now make it possible to integrate more engines onto a single SoC. It is even possible to

integrate a system vendor’s unique intellectual property as a custom acceleration engine within an SoC.

Taken together, these advances make it possible to replace multiple SoCs with a single SoC to enable

faster, smaller, more power-efficient networking architectures.

Storage Networks

The biggest bottleneck in data centers today is caused by the five orders of magnitude difference in I/O

latency between main memory in servers (100 nanoseconds) and traditional hard disk drives (10

milliseconds). Latency to external storage area networks (SANs) and network-attached storage (NAS) is

even higher because of the intervening network and performance restrictions resulting when a single

resource services multiple, simultaneous requests sequentially in deep queues.

Caching content to memory in a server or in a SAN on a Dynamic RAM (DRAM) cache appliance is a

proven technique for reducing latency and thereby improving application-level performance. But today,

because the amount of memory possible in a server or cache appliance (measured in gigabytes) is only a

small fraction of the capacity of even a single disk drive (measured in terabytes), the performance gains

achievable from traditional caching are insufficient to deal with the data deluge.

Advances in NAND flash memory and flash storage processors, combined with more intelligent caching

algorithms, break through the traditional caching scalability barrier to make caching an effective, powerful

and cost-efficient way to accelerate application performance going forward. Solid state storage is ideal for

caching as it offers far lower latency than hard disk drives with comparable capacity. Besides delivering

higher application performance, caching enables virtualized servers to perform more work, cost-

effectively, with the same number of software licenses.

Solid state storage typically produces the highest performance gains when the flash cache is placed

directly in the server on the PCIe® bus. Intelligent caching software is used to place hot, or most

frequently accessed, data in low-latency flash storage. The hot data is accessible quickly and

deterministically under any workload since there is no external connection, no intervening network to a

SAN or NAS and no possibility of associated traffic congestion and delay. Exciting to those charged with

managing or analyzing massive data inflows, some flash cache acceleration cards now support multiple

terabytes of solid state storage, enabling the storage of entire databases or other datasets as hot data.

Mobile Networks

Traffic volume in mobile networks is doubling every year, driven mostly by the explosion of video

applications. Per-user access bandwidth is also increasing rapidly as we move from 3G to LTE and LTE-

Advanced. This will in turn lead to the advent of even more graphics-intensive, bandwidth-hungry

applications.

Base stations must rapidly evolve to manage rising network loads. In the infrastructure multiple radios are

now being used in cloud-like distributed antenna systems and network topologies are flattening. Operators

are planning to deliver advanced quality of service with location-based services and application-aware

billing. As in the enterprise, increasingly handling these complex, real-time tasks is only feasible by adding

acceleration engines built into smart silicon.

To deliver higher 4G data speeds reliably to a growing number of mobile devices, access networks need

more, and smaller, cells and this drives the need for the deployment of SoCs in base stations. Reducing

component count with SoCs has another important advantage: lower power consumption. From the edge

to the core, power consumption is now a critical factor in all network infrastructures.

The use System-on-Chip ICs with multiple cores and multiple acceleration engines will be essential in 3G

and 4G mobile networks.

Enterprise networks, datacenter storage architectures and mobile network infrastructures are in the midst

of rapid, complex change. The best and possibly only way to efficiently and cost-effectively address these

changes and harness the opportunities of the data deluge is by adopting smart silicon solutions that are

emerging in many forms to meet the challenges of next-generation networks.

About the Author

Greg Huff is Chief Technology Officer at LSI. In this capacity, he is responsible for

shaping the future growth strategy of LSI products within the storage and

networking markets. Huff joined the company in May 2011 from HP, where he

was vice president and chief technology officer of the company’s Industry

Standard Server business. In that position, he was responsible for the technical

strategy of HP’s ProLiant servers, BladeSystem family products and its

infrastructure software business. Prior to that, he served as research and

development director for the HP Superdome product family. Huff earned a

bachelor's degree in Electrical Engineering from Texas A&M University and an MBA

from the Cox School of Business at Southern Methodist University.

Maximizing solid-state storage capacity in

small form factors

Kent Smith, Senior Director of Marketing, Flash Components Division, LSI

Users want ever-smaller and lighter devices but also demand ever-increasing storage capacity to keep

more apps and data loaded on their mobile computing platforms. To accommodate these two competing

objectives, solid-state storage form factors will need to get smaller, while NAND flash memory geometries

will be shrinking and storing more bits per cell. The combination is having an impact on the way flash

memory is being designed into ultrabooks, netbooks and other mobile computing devices.

The first consideration in designing for maximum capacity is the form factor of the printed circuit board

(PCB) for the storage components. The latest storage form factors being standardized are known as M.2

(previously called the next generation form factor or NGFF). As shown in Figure 1, the most popular M.2

form factor among system manufacturers is 40 percent smaller than the mSATA card. In addition to being

more compact, the M.2 specification has been optimized for solid state storage and includes connector

keys for SATA, 2x or 4x PCI Express.

Figure 1. This popular version of the new M.2 form factor (on the right) offers 40 percent less area than the existing mSATA

form factor.

For applications where additional capacity is required (and space is available), the M.2 specification

supports other card dimensions, including some with lengths up to 110 mm, providing nearly 60 percent

more area than mSATA. There are also other custom and proprietary designs that include stacking

multiple flash memory packages or using multiple PCBs that are much taller in the z-height dimension of

the base PCB, but reduce the overall footprint by decreasing the aggregate cubic volume.

The smaller area available on the M.2 card is driving the need for using smaller flash memory geometries

and/or more bits per cell. As shown in Figure 2, the combination has dramatically increased the density of

storage possible. For example, in the same footprint, 50 nm flash using single-level cells (SLC) can store

only 2 Gigabytes (GB), while 19 nm flash using multi-level cells (MLC) can store 32 GB—16 times the

density for approximately the same cost. With triple-level cells (TLC), also at 19 nm, the same footprint

could have a capacity as high as 48 GB.

Figure 2. Smaller flash memory geometries and more bits per cell combine to increase the capacity in Gigabytes per square

millimeter possible in small form factors.

Next-generation flash storage processors

Taking full advantage of shrinking geometries and higher bit densities of NAND flash memory requires

some changes to flash storage processors (FSP). The FSP is responsible for managing the pages and

blocks of flash memory, and also provides the input/output (I/O) interface with the system. Two of the

biggest challenges for FSPs today involve error correction and endurance.

As flash memory geometries shrink, cells become smaller and, therefore, hold less of a charge for the one,

two or three bits they store. For illustrative purposes imagine a 50 nm cell storing a single bit, which

might hold about 1000 electrons, and a 20 nm cell storing two bits, which might hold only 100 electrons—

an order of magnitude fewer. While the number of electrons cited here does not reflect actual

measurements, the comparison does demonstrate that the lower charge available with fewer electrons

increases the potential for read errors from the flash, which must be corrected by the FSP.

Traditional approaches to error correction, such as Reed-Solomon (RS) or BCH (also named for its co-

inventors Bose, Ray-Chaudhuri and Hocquenghem), are giving way to the Low-Density Parity Check

(LDPC) in next-generation FSPs. LDPC can provide error correction performance close to the theoretical

limits of any technique. Adding sophisticated digital signal processing enables detection and correction of

even more errors. The few errors that cannot be corrected could then be handled by an integral data

protection technology, much like the RAID (redundant array of independent disks) technology used in

direct-attached storage and storage area network controllers.

Higher density flash cells with higher error rates wear out sooner. For this reason, the garbage collection

and wear-leveling capabilities of the FSP have become increasingly important. The need for garbage

collection and wear-leveling in NAND flash causes the amount of data being physically written to flash

memory to be a multiple of the logical data intended to be written. This phenomenon is expressed as a

simple ratio called “write amplification,” which ideally would approach 1.0. Because these “unnecessary”

writes wear out cells prematurely, next-generation FSPs will benefit greatly from some type of data

reduction technology to minimize write amplification and, thereby, maximize the flash memory’s useful

Another technique for increasing capacity is to eliminate the need for a separate DRAM buffer, which is

required in solid state storage solutions to maintain the “map” consisting of a combination of the flash

memory file index and logical block addresses (LBAs). But the DRAM chip consumes precious space and

power that could (and should) be used for more flash memory. DRAM-less chip designs, such as the LSI

SandForce FSP, are also key to enabling SSD manufacturers to develop higher capacity drives for today’s

growing class of thin-and-light ultrabook platforms. By creating designs that do not require an external

DRAM buffer, these next-generation single-chip FSPs are what will make it possible to maximize solid

state storage capacity in small form factors.

About the author

Kent Smith is senior director of Marketing for the Flash Components Division of LSI Corporation, where he

is responsible for all outbound marketing and performance analysis. Prior to LSI, Smith was the senior

director of Corporate Marketing at SandForce, which was acquired by LSI in 2012, his second company to

be sold to LSI. He has over 25 years of marketing and management experience in the storage and high-

tech industry, holding senior management positions at companies including SiliconStor, Polycom, Adaptec,

Acer and Quantum. Smith holds an MBA from the University of Phoenix.

Bridging the Data Deluge Gap—The Role of

Smart Silicon in Networks Michael Merluzzi, LSI Corporation

The proliferation of smart mobile devices, video, user-generated content and social networking, and the

rising adoption of cloud services for both enterprise and consumer services are all driving explosive growth

of wireless networking infrastructure. Globally, mobile data traffic is expected to grow 18-fold between

2011 and 2016, reaching 10.8 exabytes per month by 2016. Today, video traffic alone accounts for 40

percent of the wireless network load. The number of mobile devices connected to wireless networks will

reach 25 billion, averaging 3.5 devices for every person on the planet, by 2015. That number is expected

to double, to 50 billion, by 2020.This growth in storage capacity and network traffic is far outstripping the

infrastructure build-out required to support it, a phenomenon known as the data deluge gap.

To bridge this gap, the industry needs to leverage smarter silicon technology to scale datacenter

infrastructures more cost effectively. Besides helping close the data deluge gap, smarter data processing

offers potential dramatic improvements in application performance. A recent survey of 412 European

datacenter managers conducted by LSI revealed that while 93 percent acknowledged the critical

importance of improving application performance, a full 75% do not feel that they are achieving the

desired results. This indicates that there is rising pressure on datacenter managers to find smarter ways

to push systems to do much more work within the same power and cost profiles.

Accelerating Networks

Smart software running on general-purpose processors, increasingly with multiple cores, is pervasive in

the datacenter. Processors have long inhabited switches and routers, firewalls and load-balancers, WAN

accelerators and VPN gateways. None of these systems are fast enough, however, to keep pace with the

data deluge on its own, for a basic reason: general-purpose processors must treat every byte equally.

While such equality is perfectly acceptable for system-level versatility, it is inadequate for low-level, high-

volume packet processing.

This reality is driving the need for more intelligence in silicon that is purpose-built for specific networking

applications to provide the right balance of performance, power consumption and programmability.

Today’s smart silicon has reached a level of price/performance that makes it more cost-effective than

adding general-purpose processors.

The latest generation of smart silicon typically features multiple cores of general-purpose processors and

multiple acceleration engines for common networking functions, such as packet classification with deep

packet inspection, security processing, especially for encryption and decryption, and traffic management.

Some of these acceleration engines are so powerful they can completely offload specialized network

processing from general-purpose processors, making it easier to perform switching, routing and other

networking functions entirely in smart line cards installed in servers and networking appliances to further

accelerate overall network performance.

In many organizations today, microseconds matter, driving strong demand for faster response times. For

trading firms, latency can be measured in millions of dollars per millisecond. For others, such as online

retailers, every millisecond of delay can mean lost sales and fading customer loyalty. Tomorrow’s

datacenter networks will need to be both faster and flatter, and therefore, smarter than ever. To eliminate

the data deluge gap and maximize performance, systems need to be smarter, and those smarts will

increasingly need to take the form of purpose-built silicon.

About the Author

Michael Merluzzi is product marketing manager in the Networking Solutions Group of LSI Corporation.

Focusing on mobile backhaul applications, Merluzzi is responsible for marketing of integrated platform

solutions and application-enabling software for the LSI Axxia family of multicore communication

processors. Previously, he held a variety of roles in technical marketing, applications engineering and

software development. Merluzzi holds a bachelor's degree in Electrical Engineering from The Pennsylvania

State University and master's degrees in Business Administration and Computer Engineering from Lehigh

University.

Accelerating SAN Storage with Server Flash Caching By Tony Afshary

The data deluge, with its relentless increase in the volume and velocity of data, has brought renewed

focus on an old problem: the enormous performance gap that exists in input and output (I/O) operations

between a server’s memory and disk storage. I/O takes a mere 100 nanoseconds for information stored in

a server’s memory, whereas I/O to a hard disk drive (HDD) takes about 10 milliseconds — a difference of

five orders of magnitude that is having a profound adverse impact on application performance and

response times.

The lower bandwidth and higher latency in a storage area network (SAN) or network-attached storage

(NAS) combine to exacerbate the performance problem, which gets even worse with the frequent traffic

congestion on the intervening Fibre Channel (FC), FC over Ethernet, iSCSI or Ethernet network. This

storage bottleneck has grown over the years as the increase in drive capacities has outstripped the

decrease in latency of faster-spinning drives. As a result, the performance limitations of most applications

have become tied to latency more than bandwidth or I/Os per second (IOps), and this trend is expected

accelerate as the amount of data being created continues to grow between 30 and 50 percent per year.

It is instructive to look at the situation from another perspective. The past three decades have witnessed a

3000 times increase in network bandwidth, while network latency has been reduced by only about 30

times. During the same period, the gains in processor performance, disk capacity and memory capacity

have also been similarly eclipsed by the relatively modest reduction in latency.

The extent of the problem became apparent in a recent survey conducted by LSI of 412 European

datacenter managers. The results revealed that while 93 percent acknowledge the critical importance of

optimizing application performance, a full 75 percent do not feel they are achieving the desired results.

Not surprisingly, 70 percent of the survey respondents cited storage I/O as the single biggest bottleneck

in the datacenter today.

The challenge will only get greater, caused by what LSI calls the data deluge gap — the disparity between

the 30 to 50 percent annual growth in storage capacity requirements and the 5 to 7 percent annual

increase in IT budgets. The net effect is that data is growing faster than the IT infrastructure investment

required to store, transmit, analyze and manage it. The result is that IT departments and datacenter

managers are under increasing pressure to find smarter ways to bridge the data deluge gap and improve

performance.

Cache in a Flash

Caching content to memory in a server or in a SAN on a Dynamic RAM (DRAM) cache appliance is a

proven technique for improving storage performance by reducing latency, and thereby improving

application-level performance. But because the amount of memory possible in a server or cache appliance

(measured in gigabytes) is only a small fraction of the capacity of even a single hard disk drive (measured

in terabytes) performance gains from this traditional form of caching are becoming increasingly insufficient

to overcome the challenges of the data deluge gap.

NAND flash memory technology breaks through the cache size limitation imposed by traditional memory

to again make caching the most effective and cost-effective means for accelerating application

performance. As shown in the diagram, NAND flash memory fills the significant void between main

memory and Tier 1 storage in both capacity and latency.

Flash memory fills the void in both latency and capacity between main memory and fast-

spinning hard disk drives.

Solid state memory typically delivers the highest performance gains when the flash cache acceleration

card is placed directly in the server on the PCI Express (PCIe) bus. Embedded or host-based intelligent

caching software is used to place “hot data” (the most frequently accessed data) in the low-latency flash

storage, where data is accessed up to 200 times faster than with a Tier 1 HDD, where less frequently

accessed data is stored.

Astute readers may be questioning how flash cache, with a latency 100 times higher than DRAM, can

outperform traditional caching systems. There are two reasons for this. The first is the significantly higher

capacity of flash memory, which dramatically increases the “hit rate” of the cache. Indeed, with some of

these flash cache cards now supporting multiple terabytes of solid state storage, there is often sufficient

capacity to store entire databases or other datasets as “hot data.”

The second reason involves the location of the flash cache: directly in the server on the high-speed PCIe

bus. With no internal or external connections, or no intervening network that is also subject to frequent

congestion, the “hot data” is accessible in a flash (pun intended) and in a deterministic manner under all

circumstances.

Tests show that the performance gains of server-side flash-based caching are both consistent and

significant under real-world conditions. Tests performed by LSI using Quest Benchmark Factory software

and audited by the Transaction Performance Council, clearly demonstrate how a PCIe-based flash

acceleration card can improve database application-level performance by a conservative 5 to10 times

compared to either direct-attached storage (DAS) or a SAN.

More and Better Flash

As the pricing of flash memory continues to drop and its performance continues to improve, flash memory

will become more prevalent throughout the datacenter. Will flash-based solid state drives (SSDs) ever

replace hard disk drives? No, at least in the foreseeable future. HDDs have enormous advantages in

storage capacity and in the cost of that capacity on a per-gigabyte basis. And because the vast majority of

data in most organizations is only rarely accessed, the higher latency of HDDs is normally of little

consequence — especially if this “dusty data” can become “hot data” in a PCIe flash cache accelerator on

those rare occasions when it is needed.

The key to making continued improvements in flash price/performance — comparable to that of

processors according to Moore’s Law — is advancements in the flash controllers that facilitate ever-

shrinking NAND memory geometries, already under 20 nanometers. The latest generation of flash

controllers offers sophisticated wear-leveling to improve flash memory endurance, and enhanced error

correction algorithms to improve reliability with RAID-like data protection.

These advances are making it possible for PCIe-based flash caching solutions to provide advanced

capabilities beyond those available with traditional caching. For example, caching has historically been a

read-only technology, but RAID-like data protection for writes to flash memory has the effect of making

the cache the equivalent of a fast storage tier. The addition of acceleration for writes to flash cache (which

are then persisted to RAID-based DAS or SAN) can improve application-level performance by up to 30

times compared to HDD-only storage systems.

The Future of Flash

Flash memory has already become the primary storage in tablets and ultrabooks, and a growing number

of laptop computers. Solid state drives are replacing or supplementing hard disk drives in desktop

computers and the direct-attached storage in servers, while SSD storage tiers are growing larger in SAN

and NAS configurations. And the use of PCIe-based acceleration adapters is growing rapidly owing to their

ability to bridge the data deluge gap better than any other alternative.

Some of the other advantages of flash (not discussed here) are giving these trends additional momentum.

Flash has a higher density than hard disk drives, enabling more storage in a smaller space. Flash also

consumes less power, and therefore, requires less cooling. These advantages are equally beneficial at both

a small scale in a tablet and a large scale in a datacenter.

Even as flash memory becomes more pervasive throughout datacenters, there will continue to be a need

for PCIe flash acceleration cards in servers for quite some time. Indeed, the flash cache is expected to

remain the most effective and cost-effective way to accelerate application performance for the foreseeable

future.

Tony Afshary is the director of marketing for the Accelerated Solutions Division of LSI Corporation.

Understanding SSD over-provisioning

Kent Smith, LSI Corporation

The over-provisioning of NAND flash memory in solid state drives (SSDs) and flash memory-based

accelerator cards (cache) is a required practice in the storage industry owing to the need for a controller

to manage the NAND flash memory. This is true for all segments of the computer industry—from

ultrabooks and tablets to enterprise and cloud servers.

Essentially, over-provisioning allocates a portion of the total flash memory available to the flash storage

processor, which it needs to perform various memory management functions. This leaves less usable

capacity, of course, but results in superior performance and endurance. More sophisticated applications

require more over-provisioning, but the benefits inevitably outweigh the reduction in usable capacity.

The Need for Over-provisioning NAND Flash Memory

NAND flash memory is unlike both random access memory and magnetic media, including hard disk

drives, in one fundamental way: there is no ability to overwrite existing content. Instead, entire blocks of

flash memory must first be erased before any new pages can be written.

With a hard disk drive (HDD), for example, that act of “deleting” files affects only the metadata in the

directory. No data is actually deleted on the drive; the sectors used previously are merely made available

as “free space” for storing new data. This is the reason “deleted” files can be recovered (or “undeleted”)

from HDDs, and why it is necessary to actually erase sensitive data to fully secure a drive.

With NAND flash memory, by contrast, free space can only be created by actually deleting or erasing the

data that previously occupied any block of memory. The process of reclaiming blocks of flash memory that

no longer contains valid data is called “garbage collection.” Only when the blocks, and the pages they

contain, have been cleared in this fashion are they then able to store new data during a write operation.

The flash storage processor (FSP) is responsible for managing the pages and blocks of memory, and also

provides the interface with the operating system’s file subsystem. This need to manage individual cells,

pages and blocks of flash memory requires some overhead, and that in turn, means that the full amount

of memory is not available to the user. To provide a specified amount of user capacity it is therefore

necessary to over-provision the amount of flash memory, and as will be shown later, the more over-

provisioning the better.

The portion of total NAND flash memory capacity held in reserve (unavailable to the user) for use by the

FSP is used for garbage collection (the major use); FSP firmware (a small percentage); spare blocks

(another small percentage); and optionally, enhanced data protection beyond the basic error correction

(space requirement varies).

Even though there is a loss in user capacity with over-provisioning, the user does receive two important

benefits: better performance and greater endurance. The former is one of the reasons for using flash

memory, including in solid state drives (SSDs), while the latter addresses an inherent limitation in flash

memory.

Percentage Over-provisioning

The equation for calculating the percentage of over-provisioning is rather straightforward:

For example, in a configuration consisting of 128 Gigabytes (GB) of flash memory total, 120 GB of which

is available to the user, the system is over-provisioned by 6.7 percent, which is typically rounded up to 7

percent:

It is also important to note another factor that often causes confusion: a binary Gibibyte is not the same

as a decimal Gigabyte. As shown in Figure 1, a binary GB is 7.37 percent larger than a decimal GB.

Because most operating systems display the binary representation for both memory and storage, this

causes over-provisioning to appear smaller because the actual number of bytes is 7.37 percent higher

than the number of bytes displayed. This is why an SSD listed as providing 128 GB of user space can still

function with 128 GB of physical memory. Using the calculation above, the over-provisioning amount

would appear to be zero percent, which is impossible for NAND flash. In reality it is really over-provisioned

closer to 0 + 7.37 percent.

Figure 1. The difference between a binary Gigabyte and a decimal Gigabyte

Test Environment

To isolate the over-provisioning variable, the tests were conducted on a single SSD with Toshiba MLC

(multi-level cell) 24nm NAND flash memory controlled by an LSI SF-2281 flash storage processor. It is

important to note that the FSP used employs the LSI DuraWrite™ technology that optimizes writes to flash

memory, and utilizes intelligent block management and wear-leveling to improve reliability and

endurance. These capabilities combine to afford over five years of useful life for MLC-based flash memory

with typical use cases.

Previous testing performed by LSI revealed that entropy has an effect on performance only for SSDs

without data reduction technology. For this reason, the red lines in the graphs showing the results for

100% entropy are labeled “Typical SSDs.” This series of tests, which used SSDs equipped with LSI

DuraWrite data reduction technology, were designed to evaluate performance at different levels of both

over-provisioning and entropy, and to specifically test the hypothesis that data reduction could improve

performance at lower levels of entropy.

Test result data points are based on post-garbage collection, steady state operation. All preconditioning

used the same transfer size and type as the test result (e.g. random 4KB results are preconditioned with

random 4KB transfers until reaching steady state operation).

VDBench V5.02 was used as the main test software with IOMeter V1.1.0 providing cross-check

verification. The test PC was configured with an Intel Core i5-2500K 3.30 GHz processor, the Intel H67

Express chipset, Intel Rapid Storage Technology 10.1.0.1008 (with AHCI Enabled); 4 GB of 1333 MHz

RAM; and Windows 7 Professional (32-bit).

Performance Test Results

Sequential writes were uniform across all tested over-provisioning ranging from zero to 75 percent. This

flat performance derives from the nature of sequential writes to flash. As data is written to flash memory,

it completely fills all of the pages in a block. When the drive becomes filled, blocks of data that are no

longer valid need to be erased first via the garbage collection process, which it does by simply erasing

entire blocks without needing to move (read then write) any individual pages that might otherwise still be

valid. Because there are no incremental writes during garbage collection during this operation, there is no

benefit from additional free space. With SSDs that use a data reduction technology like DuraWrite from

LSI, the level of flat performance will increase as a function of the entropy (data randomness); the lower

the entropy the higher the performance. In this situation, however, the increase in performance is due to

the reduced writes being completed sooner and not from the additional free space.

Throughput performance for sustained 4KB random writes improved as the amount of over-provisioning

increased. Additionally, for SSDs with DuraWrite data reduction technology, the throughput improvement

also increased at all levels of entropy.

Figure 2 shows the results of this test. The reason why the increased over-provisioning improves

performance for random writes is due to how garbage collection operates. As data is written randomly, the

logical block addresses (LBAs) being updated are distributed across all the blocks of the flash. This causes

a number of small “holes” of invalid data pages among valid data pages. During garbage collection those

blocks with invalid data pages require the valid data to be read and moved to new empty blocks. This

background read and write operation requires time to execute and prevents the SSD from responding to

read and write requests from the host, giving the perception of slower overall performance. When the

over-provisioning is a higher percentage of the total flash memory, the time required for garbage

collection is reduced, enabling the SSD to operate faster.

Figure 2. The effect of over-provisioning on write performance throughput

The need for garbage collection and wear-leveling with NAND flash memory causes the amount of data

being physically written to be a multiple of the logical data intended to be written. This phenomenon is

expressed as a simple ratio called “write amplification,” which ideally would approach 1.0 for standard

SSDs with sequential writes, but typically is much higher due to the addition of random writes in most

environments. With SSDs that have DuraWrite technology, the typical user experiences a much lower

write amplification that is often on average only 0.5. Getting write amplification low is important to

extending the flash memory’s useful life.

Random write operations have the greatest impact on write amplification, so to best view the effect of

over-provisioning on write amplification, tests were conducted under those conditions. As shown in Figure

3, write amplification for sustained 4KB random writes benefited significantly from a higher percentage of

over-provisioning for SSDs that do not include DuraWrite technology. For SSDs that do include DuraWrite

or a similar data reduction technology, the throughput improvement increased at a higher rate at higher

levels of entropy.

Note also how the use of a data reduction technology like DuraWrite minimizes the benefits of over-

provisioning for lower levels of entropy. When the entropy of the user data is low, DuraWrite is able to

reduce the amount of space consumed in the flash memory. Because the operating system is unaware of

this reduction, the extra space is automatically used by the flash storage processor as additional over-

provisioning space. As the entropy of the data increases, the additional free space decreases. At 100

percent entropy the additional over-provisioning is zero, which is the same result as a “Typical SSD” (red

line) that does not employ a data reduction technology. Referring again to Figure 3, a standard SSD with

28 percent over-provisioning would have the same write amplification as an SSD with DuraWrite

technology at zero percent over-provisioning for data with an entropy as high as 75 percent.

Figure 3. The effect of over-provisioning on write amplification

With the advent of SSDs, and the need to manage them differently from traditional HDDs, a TRIM

command was added to storage protocols to enable operating systems to designate blocks of data that are

no longer valid. Until the SSD is informed the data is invalid with a write to a currently occupied LBA, it

will continue to save that data during the garbage collection process, resulting in less free space and

higher write amplification. TRIM enables the SSD to perform its garbage collection and free up the storage

space occupied by invalid data in advance of future write operations.

Figure 4 shows the effect of the TRIM command on over-provisioning. For a “marketed” percentage of

over-provisioning (28 percent in this example), the amount effectively increases after performing a TRIM

operation. Note how the capacity originally designated as Free Space remains consumed as Presumed

Valid Data by the SSD after being deleted by the operating system or the user until a TRIM command is

received. In effect, the TRIM operation provides dynamic over-provisioning because it increases the

resulting over-provisioning after completion.

Figure 4. The effect of the TRIM command on over-provisioning percentage

Conclusion

The over-provisioned capacity of NAND flash memory creates the space the flash storage processor needs

to manage the flash memory more intelligently and effectively. As shown by these test results, higher

percentages of over-provisioning improve both write performance and write amplification. Higher

percentages of over-provisioning can also improve the endurance of flash memory and enable more robust

forms of data protection beyond basic error correction.

Only SSDs that utilize a data reduction technology, such as DuraWrite in the LSI SandForce flash storage

processors, can take advantage of lower levels of entropy to improve performance based on the increase

in “dynamic” over-provisioning.

Owing to the many benefits of over-provisioning, a growing number of SSDs now enable users to control

the percentage of over-provisioning by allocating a smaller portion of the total available flash memory to

user capacity during formatting. With increased capacities based on the ever-shrinking geometries of

NAND flash memory technology, combined with steady advances in flash storage processors, it is

reasonable to expect that over-provisioning will become less of an issue with users over time.

About the author

Kent Smith is senior director of Marketing for the Flash Components Division of LSI Corporation, where he

is responsible for all outbound marketing and performance analysis. Prior to LSI, Smith was the senior

director of Corporate Marketing at SandForce, which was acquired by LSI in 2012, his second company to

be sold to LSI. He has over 25 years of marketing and management experience in the storage and high-

tech industry, holding senior management positions at companies including SiliconStor, Polycom, Adaptec,

Acer and Quantum. Smith holds an MBA from the University of Phoenix.

Next-generation multicore SoC architectures for

tomorrow's communications networks

David Sonnier, LSI Corporation

IT managers are under increasing pressure to boost network capacity and performance to cope

with the data deluge. Networking systems are under a similar form of stress with their

performance degrading as new capabilities are added in software. The solution to both needs is

next-generation System-on-Chip (SoC) communications processors that combine multiple cores

with multiple hardware acceleration engines.

The data deluge, with its massive growth in both mobile and enterprise network traffic, is driving

substantial changes in the architectures of base stations, routers, gateways, and other networking

systems. To maintain high performance as traffic volume and velocity continue to grow, next-generation

communications processors combine multicore processors with specialized hardware acceleration engines

in SoC ICs.

The following discussion examines the role of the SoC in today’s network infrastructures, as well as how

the SoC will evolve in coming years. Before doing so, it is instructive to consider some of the trends

driving this need.

Networks under increasing stress

In mobile networks, per-user access bandwidth is increasing by more than an order of magnitude from

200-300 Mbps in 3G networks to 3-5 Gbps in 4G Long-Term Evolution (LTE) networks. Advanced LTE

technology will double bandwidth again to 5-10 Gbps. Higher-speed access networks will need more and

smaller cells to deliver these data rates reliably to a growing number of mobile devices.

In response to these and other trends, mobile base station features are changing significantly.

Multiple radios are being used in cloud-like distributed antenna systems. Network topologies are

flattening. Operators are offering advanced Quality of Service (QoS) and location-based services and

moving to application-aware billing. The increased volume of traffic will begin to place considerable stress

on both the access and backhaul portions of the network.

Traffic is similarly exploding within data center networks. Organizations are pursuing limitless-scale

computing workloads on virtual machines, which is breaking many of the traditional networking protocols

and procedures. The network itself is also becoming virtual and shifting to a Network-as-a-Service (NaaS)

paradigm, which is driving organizations to a more flexible Software-Defined Networking (SDN)

architecture.

These trends will transform the data center into a private cloud with a service-oriented network. This

private cloud will need to interact more seamlessly and securely with public cloud offerings in hybrid

arrangements. The result will be the need for greater intelligence, scalability, and flexibility throughout

the network.

Moore’s Law not keeping pace

Once upon a time, Moore’s Law – the doubling of processor performance every 18 months or so – was

sufficient to keep pace with computing and networking requirements. Hardware and software advanced in

lockstep in both computers and networking equipment. As software added more features with greater

sophistication, advances in processors maintained satisfactory levels of performance. But then along came

the data deluge.

In mobile networks, for example, traffic volume is growing by some 78 percent per year, owing mostly to

the increase in video traffic. This is already causing considerable congestion, and the problem will only get

worse when an estimated 50 billion mobile devices are in use by 2016 and the total volume of traffic

grows by a factor of 50 in the coming decade.

In data centers, data volume and velocity are also growing exponentially. According to IDC, digital data

creation is rising 60 percent per year. The research firm’s Digital Universe Study predicts that annual data

creation will grow 44-fold between 2009 and 2020 to 35 zettabytes (35 trillion gigabytes). All of this data

must be moved, stored, and analyzed, making Big Data a big problem for most organizations today.

With the data deluge demanding more from network infrastructures, vendors have applied a Band-Aid to

the problem by adding new software-based features and functions in networking equipment. Software has

now grown so complex that hardware has fallen behind. One way for hardware to catch up is to use

processors with multiple cores. If one general-purpose processor is not enough, try two, four, 16, or

Another way to improve hardware performance is to combine something new – multiple cores – with

something old – Reduced Instruction Set Computing (RISC) technology. With RISC, less is more based on

the uniform register file load/store architecture and simple addressing modes. ARM, for example, has

made some enhancements to the basic RISC architecture to achieve a better balance of high performance,

small code size, low power consumption, and small silicon area, with the last two factors being important

to increasing the core count.

Hardware acceleration necessary, but …

General-purpose processors, regardless of the number of cores, are simply too slow for functions that

must operate deep inside every packet, such as packet classification, cryptographic security, and

traffic management, which is needed for intelligent QoS. Because these functions must often be performed

in serial fashion, there is limited opportunity to process them simultaneously in multiple cores. For these

reasons, such functions have long been performed in hardware, and it is increasingly common to have

these hardware accelerators integrated with multicore processors in specialized SoC communications

processors.

The number of function-specific acceleration engines available also continues to grow, and more engines

(along with more cores) can now be placed on a single SoC. Examples of acceleration engines include

packet classification, deep packet inspection, encryption/decryption, digital signal processing, transcoding,

and traffic management. It is even possible now to integrate a system vendor’s unique intellectual

property into a custom acceleration engine within an SoC. Taken together, these advances make it

possible to replace multiple SoCs with a single SoC in many networking systems (see Figure 1).

Figure 1: SoC communications processors combine multiple

general-purpose processor cores with multiple task-specific

acceleration engines to deliver higher performance with a

lower component count and lower power consumption.

(Click graphic to zoom by 1.9x)

In addition to delivering higher throughput, SoCs reduce the cost of equipment, resulting in a significant

price/performance improvement. Furthermore, the ability to tightly couple multiple acceleration engines

makes it easier to satisfy end-to-end QoS and service-level agreement requirements. The SoC also offers

a distinct advantage when it comes to power consumption, which is an increasingly important

consideration in network infrastructures, by providing the ability to replace multiple

discrete components in a single energy-efficient IC.

The powerful capabilities of today’s SoCs make it possible to offload packet processing entirely to system

line cards such as a router or switch. In distributed architectures like the IP Multimedia System and SDN,

the offload can similarly be distributed among multiple systems, including servers.

Although hardware acceleration is necessary, the way it is implemented in some SoCs today may no

longer be sufficient in applications requiring deterministic performance. The problem is caused by the

workflow within the SoC itself when packets must pass through several hardware accelerators, which is

increasingly the case for systems tasked with inspecting, transforming, securing, and otherwise

manipulating traffic.

If traffic must be handled by a general-purpose processor each time it passes through a different

acceleration engine, latency can increase dramatically, and deterministic performance cannot be

guaranteed under all circumstances. This problem will get worse as data rates increase in Ethernet

networks from 1 Gbps to 10 Gbps, and in mobile networks from 300 Mbps in 3G networks to 5 Gbps in 4G

networks.

Next-generation multicore SoCs

LSI addresses the data path problem in its Axxia SoCs with Virtual Pipeline technology. The Virtual

Pipeline creates a message-passing control path that enables system designers to dynamically specify

different packet-processing flows that require different combinations of multiple acceleration engines. Each

traffic flow is then processed directly through any engine in any desired sequence without intervention

from a general-purpose processor (see Figure 2). This design natively supports connecting different

heterogeneous cores together, enabling more flexibility and better power optimization.

Figure 2: To maximize performance, next-generation SoC

communications processors process packets directly and

sequentially in multiple acceleration engines without intermediate

intervention from the CPU cores.

(Click graphic to zoom)

In addition to faster, more efficient packet processing, next-generation SoCs also include more general-

purpose processor cores (to 32, 64, and beyond), highly scalable and lower-latency interconnects,

nonblocking switching, and a wider choice of standard interfaces (Serial RapidIO, PCI Express, USB, I2C,

and SATA) and higher-speed Ethernet interfaces (1G, 2.5G, 10G, and 40G+). To easily integrate these

increasingly sophisticated capabilities into a system’s design, software development kits are enhanced

with tools that simplify development, testing, debugging, and optimization tasks.

Next-generation SoC ICs accelerate time to market for new products while lowering both manufacturing

costs and power consumption. With deterministic performance for data rates in excess of 40 Gbps,

embedded hardware is once again poised to accommodate any additional capabilities required by the data

deluge for another three to four years.

Why next-generation infrastructures need

smarter silicon By Jim Anderson

Given the explosive growth in data traffic, Moore's Law is not enough to keep pace with

demand for higher network speeds. A smarter silicon and software approach is needed.

This vendor-written tech primer has been edited by Network World to eliminate product promotion, but

readers should note it will likely favor the submitter's approach.

Given the explosive growth in data traffic, Moore's Law is not enough to keep pace with demand for higher

network speeds. A smarter silicon and software approach is needed.

Among the best ways to accelerate the performance of mobile and data center networks is to combine

general-purpose processors with smart silicon accelerator engines that significantly streamline the way

bits are prioritized and moved to optimize network performance and cloud-based services.

THE NEXT BIG THINGS: What are grand technology and scientific challenges for the 21st

century?

One of the fundamental challenges facing the industry is the data deluge gap -- the disparity between the

30% to 50% annual growth in network and storage capacity requirements and the 5% to 7% annual

increase in IT budgets. The growing adoption of cloud-based services and soaring generation and

consumption of data storage are driving exponential growth in the volume of data crossing the network to

and from the cloud. With the growth in data traffic far outstripping the infrastructure build-out required to

support it, network managers are under pressure to find smarter ways to improve performance.

Cloud data center networks were built with existing technologies and have thus far succeeded in

improving performance through brute force -- adding more hardware such as servers, switches, processor

cores and memory. This approach, however, is costly and unsustainable, increasing hardware costs along

with floor space, cooling and power requirements, and falls well short of solving the problem of network

latency.

Adding intelligence in the form of smarter silicon streamlines processing of data packets traversing mobile

and data center networks. In particular, smart silicon enables next-generation networks to understand the

criticality of data, then manipulate, prioritize and route it in ways that reduce overall traffic and

accelerates the delivery of important digital information, such as real-time data for voice and video, on

Smarter networks

General-purpose processors, which increasingly feature multiple cores, pervade network infrastructures.

These processors drive switches and routers, firewalls and load-balancers, WAN accelerators and VPN

gateways. None of these systems is fast enough, however, to keep pace with the data deluge on its own,

and for a basic reason: general-purpose processors are designed purely for compute-centric, server-class

workloads and are not optimized for handling the unique network-centric workloads in current and next-

generation infrastructures.

Smart silicon, however, can accelerate throughput for real-time workloads, such as high-performance

packet processing, while ensuring deterministic performance over changing traffic demands.

Smart silicon typically features multiple cores of general-purpose processors complemented by multiple

acceleration engines for common networking functions, such as packet classification with deep packet

inspection, security processing and traffic management. Some of these acceleration engines are powerful

enough to completely offload specialized packet processing tasks from general-purpose processors,

making it possible to perform switching, routing and other networking functions entirely in fast path

accelerators to vastly improve overall network performance. Offloading compute-intensive workloads to

acceleration engines that are optimized for a particular workload can also deliver a significant

performance-per-watt advantage over purely general-purpose processors.

Customized smart silicon can be a great option for a network equipment vendor wanting to carve out a

unique competitive advantage by integrating its own optimizations. For example, a vendor's proprietary,

differentiating intellectual property can be integrated into silicon to provide advantages over general-

purpose processors, including for optimized baseband processing, deep packet inspection and traffic

management. This level of integration requires close collaboration between network equipment and

semiconductor vendors.

Tomorrow's data center network will need to be both faster and flatter, and therefore, smarter than ever.

One of the key challenges to overcome in virtualized mega data centers is control plane scalability. To

enable cloud-scale data centers, the control plane needs to scale either up or out. In the traditional scale-

up approach, additional or more powerful compute engines, acceleration engines or both are deployed to

help scale up networking control plane performance.

In emerging scale-out architectures like software-defined networking (SDN), the control plane is

separated from the data plane, and then typically executed on standard servers. In both scale-up and

scale-out architectures, intelligent multicore communications processors that combine general-purpose

processors with specialized hardware acceleration engines can dramatically improve control plane

performance. Some functions, such as packet processing and traffic management, often can be offloaded

to line cards equipped with these purpose-built communications processors.

While the efficacy of distributing the control and data planes remains an open question, it's clear that SDN

will need smart silicon to deliver on its promise of scalable performance.

Smarter storage

Smarter silicon in storage can also help close the data deluge gap. The storage I/O choke point is rooted

in the mechanics of traditional hard disk drive (HDD) platters and actuator arms and their speed limits in

transferring data from the disk media, as evidenced in the difference of five orders of magnitude in I/O

latency between memory (at 100 nanoseconds) and Tier 1 HDDs (at 10 milliseconds).

Another limitation is the amount of memory that can be supported in traditional caching systems

(measured in gigabytes), which is a small fraction of the capacity of a single disk drive (measured in

terabytes). Both offer little room for performance improvements beyond increasing the gigabytes

of Dynamic RAM (DRAM) in caching appliances or adding more of today's fast-spinning HDDs.

Solid state storage in the form of NAND flash memory, on the other hand, is particularly effective in

bridging this significant bottleneck, delivering high-speed I/O similar to memory at capacities on a par

with HDDs. For its part, smart silicon delivers sophisticated wear-leveling, garbage collection and unique

data reduction techniques to improve flash memory endurance and enhanced error correction algorithms

for RAID-like data protection. Flash memory helps bridge both the capacity and latency gap between

DRAM caching and HDDs.

Solid state memory typically delivers the highest performance gains when the flash cache acceleration

card is placed directly in the server on the PCI Express (PCIe) bus. Embedded or host-based intelligent

caching software is used to place "hot data" in the flash memory, where data can be accessed in 20

microseconds -- 140 times faster than with a Tier 1 HDD, at 2,800 microseconds. Some of these cards

support multiple terabytes of solid state storage, and a new class of solution now also offers both internal

flash and Serial-Attached SCSI (SAS) interfaces to combine high-performance solid state and RAID HDD

storage. A PCIe-based flash acceleration card can improve database application-level performance by five

to 10 times in DAS and SAN environments.

Smart silicon is at the heart of all of these solutions. So without the deep inside view of the semiconductor

vendors, the system vendors would have no hope of ever closing the data deluge gap.

Avoiding “Whack-A-Mole” in the Data Center

By Jeff Richardson

It’s a curse in any network infrastructure, especially in the data center: clear one performance bottleneck,

and another drag on data or application speed surfaces elsewhere in a never-ending game of “Whack-A-

Mole.” In today’s data centers, the “Whack-A-Mole” mallet is swinging like never before as these

bottlenecks pop up with increasing frequency in the face of the data deluge—the exponential growth of

digital information worldwide.

Some of these choke points are familiar, such as the timeworn input/output (I/O) path between servers

and disk storage, whether directly attached or in a storage-area network, as microprocessor capability and

speed has outpaced storage. Other, newer bottlenecks are cropping up with the growing consolidation and

virtualization of servers and storage in data center clouds as more organizations deploy cloud

architectures to pool storage, processing and networking in order to increase computing resource

efficiency and utilization, improve resiliency and scalability, and reduce costs.

Improving data center efficiency has always come down to balancing and optimizing these resources, but

this calibration is being radically disturbed today by major transitions in the network, such as the growth

of Gigabit Ethernet to 10 Gigabit and soon to 40 Gigabit, the emergence of multicore and other ever-faster

processors, and the rising deployments of solid-state storage. As virtualization increases server utilization,

and therefore efficiency, it also exacerbates interactive resource conflicts in memory and I/O. And even

more resource conflicts are bound to emerge as big-data applications evolve to run over ever-growing

clusters of tens of thousands of computers that process, manage and store petabytes of data.

With these dynamic changes to the data center, maintaining acceptable levels of performance is becoming

a greater challenge. But there are proven ways to address the most common bottlenecks today—ways

that will give IT managers a stronger hand in the high-stakes bottleneck reduction contest.

Bridging the I/O Gap Between Memory and Hard-Disk Drives

Hard-disk drive (HDD) I/O is a major bottleneck in direct-attached storage (DAS) servers, storage-area

networks (SANs) and network-attached storage (NAS) arrays. Specifically, I/O to memory in a server

takes about 100 nanoseconds, whereas I/O to a Tier One HDD takes about 10 milliseconds—a difference

of 100,000 times that chokes application performance. Latency in a SAN or NAS often is even higher

because of data-traffic congestion on the intervening Fibre Channel (FC), FC over Ethernet or iSCSI

network.

These bottlenecks have grown over the years as increases in drive capacity have outstripped decreases in

latency of faster-spinning drives, and in confronting the data deluge, IT managers have needed to add

more hard disks and deeper queues just to keep pace. As a result, the performance limitations of most

applications have become tied to latency instead of bandwidth or I/Os per second (IOPS), and this

problem threatens to worsen as the need for storage capacity continues to grow by 50–100 percent per

year. Keep in mind that the last three decades have seen only a 30x reduction in latency, while network

bandwidth has improved 3,000x over the same period. Processor throughput, disk capacity and memory

capacity have also seen large gains.

Caching content to memory in a server or in the SAN on a dynamic RAM (DRAM) cache appliance can help

reduce latency, and therefore improve application-level performance. But because the amount of memory

possible in a server or cache appliance, measured in gigabytes, is only a small fraction of the capacity of

even a single hard-disk drive, measured in terabytes, performance gains from caching are often

inadequate.

Solid-state storage in the form of NAND flash memory is particularly effective in bridging the significant

latency gap between memory and HDDs. In both capacity and latency, flash memory bridges the gap

between DRAM caching and HDDs, as the chart below shows. Traditionally, flash has been very expensive

to deploy and difficult to integrate into existing storage architectures. Today, decreases in the cost of flash

coupled with hardware and software innovations that ease deployment have made the ROI for flash-based

storage more compelling.

Flash memory fills the void in both latency and capacity between dynamic RAM in a cache appliance and

fast-spinning hard-disk drives.

Solid-state memory typically delivers the highest performance gains when the flash acceleration card is

placed directly in the server on the PCI Express (PCIe) bus. Embedded or host-based intelligent caching

software is used to place “hot data” in the flash memory, where data is accessed in about 20

microseconds—140 times faster than with a Tier One HDD, at 2,800 microseconds—giving users data they

care about far faster. Some of these cards support multiple terabytes of solid-state storage, and a new

class of solution now also offers both internal flash and Serial Attached SCSI (SAS) interfaces to create a

combination high-performance solid-state and RAID HDD storage solution. A PCIe-based flash acceleration

card can improve database application-level performance by 5 to10 times in either a DAS or SAN

environment.

Scaling the Virtualized Data Center Network

One common bottleneck in virtualized data centers today is the switching control plane—a potential choke

point that can limit network performance as the number of virtual machines grows. Control-plane

workloads increase in four sometimes related ways:

Server virtualization adds considerable control overhead, especially when moving virtual machines (VMs)

More and larger server clusters, such as for analyzing big data, substantially increase the traffic flow for

inter-node communications

The explosion in CPU cores—driven by the need to avert bottlenecks in server processing power—

increases both the number of VMs per server and the size of server clusters

Data center networks flatten as they grow to help accommodate these changes, and they maintain latency

and throughput performance in the face of relentless growth

These changes are severely stressing the control plane. During a VM migration, for example, rapid

changes in connections, address resolution protocol (ARP) messages and routing tables can overwhelm

existing control-plane solutions, especially in large-scale virtualized environments. As a result, large-scale

VM data migration is often impractical because of the overhead involved.

To enable large-scale VM migration, the control plane needs to scale either up or out. In the traditional

scale-up approach, the existing control-plane solutions within networking platforms are supplemented by

additional or more-powerful compute engines, acceleration engines or both to help scale control-plane

performance. These supplemental resources free up CPU cycles for other tasks, improving overall network

performance.

In the scale-up architecture, existing network platforms are supplemented by additional and/or more-

powerful compute engines to help execute the network control stack.

In emerging scale-out architectures, the control plane is separated from the data plane, and then typically

executed on standard servers. In some cases, control-plane tasks are divided into sub-tasks, such as

discovery, dissemination and recovery, which are then distributed across these servers. Emerging

architectures such as SDN (software-defined networking) employ scale-out approaches for greater control-

plane scalability. These architectures also enable IT managers to virtualize the network substrate and to

better manage and secure data center traffic.

In the scale-out architecture, the separation and distribution of the control and data planes lends itself

well to software-defined networking, such as with OpenFlow.

In both scale-up and scale-out architectures, intelligent multicore communications processors, which

combine general-purpose processors with specialized hardware acceleration engines for specific functions,

can produce dramatic improvements in control-plane performance. Some functions, such as packet

processing and traffic management, often can be offloaded entirely to line cards equipped with such

purpose-built communications processors.

Near-term Advances That Promise to Improve Both Server I/O and Network Performance

In many organizations today, milliseconds matter, driving strong demand for shorter response times. For

some, like trading firms, latency can be measured in millions of dollars per millisecond. For others, such as

online retailers, every millisecond of delay caused by latency can compromise competitiveness and

customer satisfaction, and ultimately directly affect revenue.

As more digital information is driven throughout the data center, fast solid-state storage will be

increasingly deployed for storage server caching, and for solid-state drives (SSDs) in tiered DAS and SAN

configurations. The growth of SSD capacity and shipment volumes continues, reducing the cost per

gigabyte through economies of scale, while smart flash storage processors with sophisticated garbage

collection, wear-leveling and enhanced error-correction algorithms continue to improve SSD endurance.

Increasing use of 10 Gigabit and 40 Gigabit Ethernet, and broad deployment of 12Gbps SAS technology,

will also contribute to higher data rates. Besides doubling the throughput of existing 6Gbps SAS

technology, 12Gbps SAS will use performance improvements in PCIe 3.0 to achieve more than one million

As data center networks continue to flatten, new forms of acceleration and programmability in both the

control and data planes will be needed. Greater use of hardware acceleration for both packet processing

and traffic management will deliver deterministic performance under varying traffic loads in these flat,

scaled-up or scaled-out networks.

More Bottlenecks to Come

As servers move to 10 Gigabit Ethernet, the rack will become its own bottleneck. To help clear this

bottleneck, solid-state storage will shuttle data among servers at high speed, purpose-built PCIe cards will

enable fast inter-server communications, and all components within a rack will likely be restructured to

optimize performance and cost. As data centers begin to resemble private clouds and increasingly employ

public cloud services in a multi-tenant, hybrid arrangement, the switching services plane will need to more

intelligently classify and manage traffic to improve application-level performance and enhance security.

With the increasing use of encrypted and tunneled traffic, these and other CPU-intensive packet

processing tasks will need to be offloaded to function-specific acceleration engines to enable a fully

distributed intelligent fabric.

High-speed communications processors, acceleration engines, solid-state storage and other technologies

that increase performance and reduce latency in data center networks will take on increasing importance

as networks and data centers continue to struggle with massive data growth, and as IT managers race to

increase data speed within their architectures just to keep up with relentless demand for faster access to

digital information.

About the Author

Jeff Richardson is executive vice president and chief operating officer for LSI. In

this capacity, he oversees all marketing, engineering and manufacturing of the

company’s product operations. Previously, Richardson was executive vice

president and general manager of the LSI Semiconductor Solutions Group, where

he was responsible for LSI’s silicon solutions across all segments of data

networking/communications, server, hard disk drive, enterprise tape and storage

systems markets.

Richardson joined the company in June 2005 from Intel Corporation, where he

served as vice president of the Digital Enterprise Group and general manager of

the Server Platform Group. Before that, Richardson was vice president and

general manager of the Intel Enterprise Solutions and Services Division. Before

joining Intel in 1992, he held engineering positions at Altera Corporation, Chips and Technologies (the first

fabless semiconductor company), and Amdahl Corporation. Richardson earned a bachelor’s degree in

electrical engineering from the University of Colorado in 1987. He is a member of the board of directors of

Volterra Semiconductor Corporation.

Leading article photo courtesy of Mike Towber

Virtualization of Data Centers: New Options in

the Control and Data Planes (Part III)

Raghu Kondapalli is director of technology focused on Strategic Planning and Solution

Architecture for the Networking Components Division of LSI Corporation. He brings a rich

experience and deep knowledge of the cloud-based, service provider and enterprise

networking business, specifically in packet processing, switching and SoC architectures.

This Industry Perspectives article is the third and final in a series of three that analyzes the

network-related issues being caused by the Data Deluge in virtualized data centers, and how these are

having an effect on both cloud service providers and the enterprise. The focus of the first article was on

the overall effect server virtualization is having on storage virtualization and traffic flows in the data

center network, while the second article dove a bit deeper into the network management complexities and

control plane requirements needed to address those challenges. This article examines two ways of scaling

the control plane to accommodate these additional requirements in virtualized data centers.

The control plane can scale in two directions: out or up. In the scale-out approach, the control plane

functions are separated and distributed across physical or virtual servers. In the scale-up approach, the

server’s processing power is augmented by adding extra compute resources, such as x86 processors. In

both the scale-out and scale-up architectures, performance can be further enhanced by providing

function-specific hardware acceleration.

Control Plane Scale-out Architecture

In the scale-out architecture, the basic platform is implemented with generic processors augmented by

separate communications processors with specialized hardware accelerators that can offload control plane

functions. The control plane tasks are divided into sub-tasks, such as discovery, dissemination, and

recovery, and are then distributed across the data center. Because the various tasks can execute on any

server in the network or in the cloud, the scale-out architecture lends itself well to Software Defined

Networking (SDN). Owing to its distributed arrangement, the architecture requires robust communications

between the control plane and the data planes using APIs for the network protocol, such as OpenFlow.

Depending on the network size and configuration, hardware acceleration of these networking functions

may be necessary to achieve satisfactory performance. Protocol-aware communications processors are

designed to handle specific control plane tasks and/or network management functions, including packet

analysis and routing, security, ARP offload, OAM offload, IGMP messages, networking statistics,

application-aware firewalling, QoS, etc.

Control Plane Scale-up Architecture

In the scale-up architecture, the existing network control platforms are supplemented by additional and/or

more powerful compute engines to help execute the network control stack. These supplemental resources

free up server CPU cycles for other tasks, and result in an overall improvement in the network

performance. Because general-purpose processors are not optimized for packet processing functions,

however, they are not an ideal solution for the scale-up architecture. As with the scale-out architecture,

performance can be improved dramatically using function-specific, protocol-aware communications

processors.

Bridging The Data Deluge Gap

Guest post written by Abhi Talwalker

Abhi Talwalkar is CEO of LSI Corp.

In the first 60 seconds of reading this article, 1 billion gigabytes of information will flow

across mobile networks around the world. That’s the equivalent of a tenth of all the information contained

in the Library of Congress crisscrossing the Internet in a minute. This massive flow of information,

happening every minute of every day, will grow ten-fold over the next several years, according to IDC.

The amount of static data – information stored on drives or servers – also is expected to expand at an

incredibly rapid rate. As individuals and businesses, we are all dealing with the impact of this data deluge.

At the same time, IT budgets are growing only 5%-7% per year. The net effect is that information is

growing faster than the investment required to store, transmit, analyze and manage it. Herein lays the

real challenge: data is growing faster than the IT infrastructure investments required to support it, leaving

a widening “data deluge gap.” And unless new forms of intelligence, including those powered by smart

silicon, are integrated into datacenters and networks to clear bottlenecks and bridge the gap between

traffic growth and IT investments, the world’s information society could face significant economic and

technical roadblocks.

As with many things in technology, this gap represents enormous challenges, but also offers huge

opportunities.

One outcome of unrelenting data growth in datacenters and mobile networks has been the accelerated

adoption of cloud computing. The “cloud” solves many technical challenges and helps deliver services

more efficiently by leveraging spending on existing infrastructure. But it is fraught with its own challenges,

especially for architects of datacenters and mobile networks wrestling with how to address daunting

scalability, flexibility and capacity requirements in order to unlock the greatest value from the information

created in the data deluge.

In today’s data-driven world, information has enormous value. Make no mistake: The “digital divide” is

very real, as those with slow or limited access to data get out-traded on Wall Street, out-marketed on the

Internet and risk falling behind in education, business and medicine. Data is most valuable when it is

used, shared, analyzed and made available to connected devices and people. But the determination of

what constitutes valuable data must often be made in nanoseconds.

Together these challenges mean that the industry must bridge the gap to get the maximum return on

information from the highest value data. Of course, this is much easier said than done. To eliminate traffic

bottlenecks in storage systems and enterprise and mobile networks smart silicon must be integrated

within strategic areas of IT infrastructures. Ironically, as one kind of chip enables the creation of huge

volumes of data, other smart chips are needed to help increase the speed of the system and direct the

flow of that data.

So where are these bottlenecks?

Today, mobile networks suffer the most acute impact as data traffic growth, driven by huge adoption of

smartphones, tablets and other client devices, is forecast to grow at a 78% compounded annual growth

rate from 2011 to 2016. In data centers, gains in storage performance have fallen well short of increases

in processor speed, which continues to double every few years in keeping with Moore’s Law. These

storage and networking choke points are expected to tighten as the number of connected and mobile

devices rises from about 8 billion today to 50 billion by 2020, and as the volume of data continues to grow

by 30% to 50% a year.

In mobile networks, the dramatic rise in video is driving explosive data growth. What’s more, end users

want faster access to higher quality content, including bandwidth-hungry high-definition video and other

rich media.

But video poses real challenges, as it consumes considerably more bandwidth than both voice and data,

and video quality degrades substantially, often unacceptably, when network congestion interrupts or

delays individual packets in traffic streams. In other words: not all packets are created equal, which

means as video traffic grows, mobile networks are going to need to get smarter about how they manage

the packets traversing their infrastructures. The devil is in the details, which is why smart silicon is

required to address this challenge, performing tasks like packet inspection, or looking into those packets

as they move throughout the network and making decisions about what to do with them and which ones

to prioritize.

When it comes to bottlenecks in datacenters, the Data Deluge Gap affects everyone, from the largest

service provider and enterprise to small and medium businesses, and the billions of end users consuming

data-intensive services. Here the biggest bottleneck is between a server’s central processing unit and its

storage, whether directly attached or in a storage area network. Retrieving and storing data from a hard-

disk drive takes one million times longer than accessing it from server memory, a difference that can

severely degrade application performance.

For transaction-oriented businesses such as online retailers, these drags on performance can mean the

difference between profitability and losses. For retail, healthcare and pharmaceutical companies that now

rely on critical findings of Big Data analytics, performance slowdowns can compromise key aspects of

competitiveness such as how quickly and where a product is brought to market. For service organizations,

a delay of a few seconds can mean the difference between deepening customer loyalty and

abandonment. Think about waiting on a website for your shopping cart to load and credit card to clear;

the longer you wait, the more likely you are to choose another site. And there are thousands of these in

an hour on some sites. Or look at the world of high speed trading where millions of dollars balance on

milliseconds of timing. The stakes couldn’t be higher.

The opportunity here is to leverage technologies known to have acceleration benefits as fast as 300x the

existing technologies, like Flash memory, which up until recently has been too expensive. This requires

smart silicon, but also rethinking architectures and storage.

The biggest trends in IT today — big data, cloud computing, social media and growing connected devices

and the “Internet of things” – all mean one thing: a relentless and massive flow of data that needs to be

shared and stored. With a creaking infrastructure, the system at times risks paralysis or overflow and

we’ve seen localized instances that show us the chaos that could occur. Many challenges remain.

More exciting is the opportunity. Data has enormous value and potential to improve our society. To

liberate information from the performance constraints of today’s storage and networking infrastructures,

we need to focus our brightest minds on solutions like bringing smart silicon to strategic points in

networks or datacenters, or on hardware or software that helps route and prioritize the most important

data for the fastest access. That is how we can tackle the data deluge and create the best user experience

for all.

the Control & Data Planes (Part II) Raghu Kondapalli is director of technology focused on Strategic Planning and Solution Architecture for

the Networking Components Division of LSI Corporation. He brings a rich experience and deep

knowledge of the cloud-based, service provider and enterprise networking business, specifically in

packet processing, switching and SoC architectures.

This Industry Perspectives article is the second in a series of three that analyzes the

network-related issues being caused by the Data Deluge in virtualized data centers, and how these are

having an effect on both cloud service providers and the enterprise. The focus of the first article was on

the overall effect server virtualization is having on storage virtualization and traffic flows in the datacenter

network. This article dives a bit deeper into the network challenges in virtualized data centers as well as

the network management complexities and control plane requirements needed to address those

challenges.

Server Virtualization Overhead

Server virtualization has enabled tens to hundreds of VMs per server in data centers using multi-core CPU

technology. As a result, packet processing functions, such as packet classification, routing decisions,

encryption/decryption, etc., have increased exponentially. Because discrete networking systems may not

scale cost-effectively to meet these increased processing demands, some changes are also needed in the

network.

Networking functions that are implemented in software in network hypervisors are not very efficient,

because x86 servers are not optimized for packet processing. The control plane, therefore, needs to be

scaled somehow by adding communications processors capable of offloading network control tasks, and

both the control and data planes stand to benefit substantially from hardware assistance provided by such

function-specific acceleration.

The table below shows the effect on packet processing overhead of virtualizing 1,000 servers. As shown,

by mapping each CPU core to four virtual machines (VMs), and assuming 1 percent traffic management

overhead with a 25 percent east-west traffic flow, the network management overhead increases by a

factor of 32 times in this example of a virtualized data center.

This table shows the effect on network management overhead of virtualizing 1,000 servers.

Virtual Machine Migration

Support for VM migration among servers, either within one server cluster or across multiple clusters,

creates additional management complexity and packet processing overhead. IT administrators may decide

to move a VM from one server to another for a variety of reasons, including resource availability, quality-

of-experience, maintenance, and hardware/software or network failures. The hypervisor handles these VM

migration scenarios by first reserving a VM on the destination server, then moving the VM to its new

destination, and finally tearing down the original VM.

Hypervisors are not capable of the timely generation of address resolution protocol (ARP) broadcasts to

notify of the VM moves, especially in large-scale virtualized environments. The network can even become

so congested from the control overhead occurring during a VM migration that the ARP messages fail to get

through in a timely manner. With such a significant impact on network behavior being caused by rapid

changes in connections, ARP messages and routing tables, existing control plane solutions need an

upgrade to more scalable architectures.

Multi-tenancy and Security

Owing to the high costs associated with building and operating a data center, many IT organizations are

moving to a multi-tenant model where different departments or even different companies (in the cloud)

share a common infrastructure of virtualized resources. Data protection and security are critical needs in

multi-tenant environments, which require logical isolation of resources without dedicating physical

resources to any customer.

The control plane must, therefore, provide secure access to data center resources and be able to change

the security posture dynamically during VM migrations. The control plane may also need to implement

customer-specific policies and Quality of Service (QoS) levels.

Service Level Agreements and Resource Metering

The network-as-a-service paradigm requires active resource metering to ensure SLAs are

maintained. Resource metering through the collection of network statistics is useful for calculating return

on investment, and evaluating infrastructure expansion and upgrades, as well as for monitoring SLAs.

The network monitoring tasks are currently spread across the hypervisor, legacy management tools, and

some newer infrastructure monitoring tools. Collecting and consolidating this management information

adds further complexity to the control plane for both the data center operator and multi-tenant

enterprises.

The next article in the series will examine two ways of scaling the control plane to accommodate these additional

packet processing requirements in virtualized data centers.

the Control and Data Planes

Raghu Kondapalli is director of technology focused on Strategic Planning and Solution Architecture for

the Networking Components Division of LSI Corporation. He brings a rich experience and deep

knowledge of the cloud-based, service provider and enterprise networking business, specifically in

packet processing, switching and SoC architectures.

The Data Deluge occurring in today’s content-rich Internet, cloud and enterprise

applications is growing the volume, velocity and variety of information data centers must now process. In

response, organizations have begun virtualizing their data centers to become more cost-effective, power-

efficient, scalable and agile.

The migration began with server virtualization using technologies like multi-core CPUs and multi-thread

operating systems. Next was the virtualization of storage area networks (SANs) and network attached

storage (NAS) to cope with the Data Deluge more efficiently and cost-effectively. The final target for

virtualization is the data center network itself, which will necessitate changes in the both the control and

data planes to manage traffic flows more intelligently and improve overall performance.

This Industry Perspectives article is the first in a series of three that analyzes the network-related

challenges in virtualized data centers, and how these are having an effect on network infrastructures—

from the SAN to the core. The focus here is on the effect server virtualization is having on storage

virtualization and traffic flows in the data center network.

Server Virtualization’s Effect on Storage and the Network

The need for instantaneous and reliable access to data across all segments of today’s connected world is

pushing the boundaries of data center virtualization. Cloud computing, with its superior scalability and

lower total-cost-of-ownership (TCO), is at the leading edge of this trend by requiring virtualization of the

entire datacenter in a multi-tenancy environment.

Servers were initially virtualized by implementing virtual machines (VMs) in software with the hypervisor

creating a layer of abstraction between physical and virtual machines, thereby absorbing many of the

connectivity, manageability and scalability issues. Software-based hypervisors, however, are unable to

keep pace with the increased performance demands of the Data Deluge. Processor extensions to support

x86 virtualization made their debut in the mid-2000’s, providing the hardware acceleration needed to

improve performance.

Storage

Virtualization of storage is typically done in a SAN, which houses both the VM images and some or all of

the data needed by the applications. VM support requires extra storage in the SAN to backup and replicate

the images dynamically, and during the initial phase of storage virtualization, storage hypervisors helped

administrators perform these tasks more easily by disguising the actual complexity of the SAN. These

techniques by themselves, however, proved insufficient for the relentless growth in storage demands. And

once again, advances in hardware, particularly the use of flash memory in solid-state drives (SSDs),

became critical to boosting SAN performance. Such tiered and/or application-aware storage solutions

deliver hardware acceleration to both the SAN and directly attached storage (DAS), providing both

improved I/O throughput and real-time analytics.

Until recently, most of the efforts in data center virtualization addressed the server and storage segments.

Network virtualization has been ad hoc, at best, normally implemented as an add-on module to traditional

compute-centric hypervisors. Network-specific extensions to hypervisors handle basic connectivity and

fault management, and are able to meet the performance needs for small data centers. The current

generation of large-scale server farms, however, must have thousands of servers with potentially dozens

of VMs per server. The application workloads, which are generally distributed across several VMs, increase

VM-to-VM communications (east-west traffic), while other factors, such as VM migration and storage

applications like data replication, have also increased east-west traffic flows. And these changes are

occurring as client-to-server communications (north-south traffic) also continues to grow exponentially.

Reaping Benefits of Virtualization

Currently, IT departments are exploring new options for data center networks to better reap the benefits

of virtualization. At present, several solutions have been proposed to improve data center network

utilization and performance. At the network architectural level, isolating the control plane functions from

the data plane, and virtualizing both, is a growing trend that involves improving the efficiency of the

existing network infrastructures with simple upgrades. Scale-out and scale-up are two such techniques

that are now being used, and these will be covered in more detail in the third article in this series.

A related trend involves Software-Defined Networking (SDN), which is another abstraction where network

application stacks are presented with a virtual view of the network that shields its physical topology. SDN

also enables control plane tasks to be virtualized and distributed across the network. OpenFlow is one

example of an SDN that proposes to separate control plane functions, such as routing, from data plane

functions, like forwarding, enabling them to execute independently on different devices—potentially from

different vendors.

But before exploring these proposed network virtualization options, it is useful to dive a bit deeper into the

networking issues in a virtualized datacenter, and this is the subject of the second article in this series.

lsi corporation contributed articles written & placed by ...€¦ · because virtualization...

Documents

virtualization 101. what is virtualization? desktop...

tilly - spaces of contention

a framework for joint wireless network virtualization and...

contention and democracy in europe,...

virtualization getting started guide - virtualization...

mda contention

how social media affords populist politics: remarks …

contention and democracy in europe, 1650–2000...contention...

contention et isolement

sq l server latch contention

virtualization technique virtualization technique system...

j o'neill contention iib

medi contention

scheduling: contention, fairness and throughput

diagnosing and resolving spinlock contention on sql...

bones of contention - pbs

introduction to virtualization - pku to...

virtualization technique virtualization technique system...

4 user state virtualization application virtualization os...

12695_maps limited contention protocols