maximizing file transfer performance using 10gb ethernet and virtualization

Maximizing File Transfer Performance Using 10Gb Ethernet and Virtualization A FedEx Case Study

THE CHALLENGE

Spiraling data center network complexity, cable maintenance and troubleshooting costs, and increasing bandwidth requirements led FedEx Corporation to investigate techniques forsimplifyingthenetworkinfrastructureandboostingletransferthroughput.

In response to this challenge, FedEx—in collaboration with Intel—conducted a case study to determine the most effective approach to achieving near-native 10-gigabit letransferratesinavirtualizedenvironmentbasedonVMwarevSphere*ESX*4runningonserverspoweredbytheIntel®Xeon®processor5500series.Theserverswere equipped with Intel® 10 Gigabit AF DA Dual Port Server Adapters supporting direct-attachcoppertwinaxialcableconnections.Forthisimplementation,Intel®VirtualMachineDeviceQueues(VMDq)featurewasenabledinVMwareNetQueue*.

File transfer applications are widely used in production environments, including replicateddatasets,databases,backups,andsimilaroperations.Aspartofthiscasestudy,severaloftheseapplicationsareusedinthetestsequences.Thiscasestudy:

•Investigatestheplatformhardwareandsoftwarelimitsoffiletransferperformance

•Identifiesthebottlenecksthatrestricttransferrates

•Evaluatestrade-offsforeachoftheproposedsolutions

•Makesrecommendationsforincreasingfiletransferperformancein10GigabitEthernet(10G)nativeLinux*anda10GVMwarevirtualizedenvironment

The latest 10G solutions let users cost-effectively consolidate the many Ethernet and FibreChanneladaptersdeployedinatypicalVMwareESXimplementation.VMwareESX,runningonIntelXeonprocessor5500series-basedservers,providesareliable,high-performancesolutionforhandlingthisworkload.

THE PROCESS

In the process of building a new data center, FedEx Corporation, the largest express shippingcompanyintheworld,evaluatedthepotentialbenetsof10G,consideringthesekeyquestions:

•Can10Gmakethenetworklesscomplexandstreamlineinfrastructuredeployment?

•Can10Ghelpsolvecablemanagementissues?

•Can10Gmeetourincreasingbandwidthrequirementsaswetargethighervirtualmachine(VM)consolidationratios?

CASE STUDYIntel® Xeon® processor 5500 seriesVirtualization

FedEx Corporation is a

worldwide information and

business solution company, with a

superior portfolio of transportation

and delivery services.

2

Cost FactorsHowdoes10Gaffectcosts?Both10GBASE-TandDirectAttachTwinaxcabling (sometimes referred to as 10GSFP+Cu or SFP+ Direct Attach cost lessthanUSD400peradapterport.Incomparison,a10GShortReach(10GBASE-SR)berconnectioncostsapproximatelyUSD700peradapterport.1

In existing VMware production environments, FedEx had used eight 1GbE connections implemented with two quad-port cards (plus one or two 100/1000 ports for management) in addition to two 4GbFibreChannellinks.Basedonmarketpricing, moving the eight 1GbE connections onthequad-portcardstotwo10GBASE-Tor 10G Twinax connections is cost effective

today.Usingtwo10Gconnectionscanactually consume less power than the eight 1GbE connections they replace—providing additionalsavingsoverthelongterm.Having less cabling typically reduces instances of wiring errors and lowers maintenancecosts,aswell.FedExusedIEEE802.1qtrunkingtoseparatetrafcowsonthe10Glinks.

Engineering Trade-offsEngineering trade-offs are also a consideration.

Direct Attach/Twinax 10G cabling has a maximum reach of seven meters for passive cables, which affects the physical wiringschemetobeimplemented.Theservers have to be located within a seven-meter radius of the 10G switches totakeadvantageofthisnewtechnology.However, active cables are available that canextendthisrangeifnecessary.Akeyfeature of the Twinax cable technology is that it uses exactly the same form-factor connectors that the industry-standard SFP+opticalmodulesuse.Usingthistechnology allows you to select the most cost-effectiveandpower-efcientpassiveTwinax for short reaches and then move uptoactiveTwinax,SRber,orevenLongReach(LR)berforlongerruns.Twinaxadapters consume approximately 3W per portwhenusingpassivecabling.

10GBASE-T’smaximumreachof100metersmakesitaexible,datacenter-widedeploymentoptionfor10GbE.10GBASE-Tis also backwards-compatible with today’swidelydeployedGigabitEthernetinfrastructures.Thisfeaturemakesitan excellent technology for migrating

Figure 1.Congurationoftheserverwiringwhenusingeight1GbEconnections.

Figure 2. Congurationoftheserverwiringwhen two 10G connections replace eight 1GbE

connections.

from GbE to 10GbE, as IT can use existing infrastructuresanddeploymentknowledge.Current-generation10GBASE-Tadaptersuse approximately 10W per port, and upcoming products will consume even less, making them suitable for integration onto motherboardsinfutureservergenerations.

Framing the Challenge File transfer applications are widely used in production environments to move data between systems for various purposes and to replicate data sets across servers and applications that share these data sets.FedEx,forexample,usesftp,scp, and rsync for data replication and distributionintheirproductionnetworks.

In addition to the considerations associated with cable consolidation, cost factors, and the power advantages of using10G,anotherkeyquestionremained:Cantoday’sserverseffectivelytakeadvantageof10Gpipes?Usingftp,FedExwas able to drive 320 Mbps over a 1G connection.Initial10Gtesting,however,indicatedthattheycouldonlyachieve560Mbps, despite the potential capabilities of 10xfasterpipes.

Plugging a 10G NIC into a server does not automatically deliver 10 Gbps of applicationlevelthroughput.Anobviousquestionarises:Whatcanbedonetomaximizeletransferperformanceonmodernserversusing10G?

2

3

HardwareIntel® Xeon® processor X5560 series @ 2.8 GHz (8 cores, 16 threads); SMT, NUMA, VT-x, VT-d, EIST, Turbo Enabled (default in BIOS); 24 GB Memory; Intel 10GbE CX4 Server Adapter with VMDq

Test Methodology RAM disk used, not disk drives. We are focused on network I/O, not disk I/O

What is being transferred?

Directorystructure,partofLinuxrepository:~8Gtotal,~5000les,variablelesize,averagelesize~1.6MB

Data Collection Tools Used

Linux * utility “sar”: Capture receive throughput and CPU utilization

Application Tools used

Netperf (common network micro-benchmark); OpenSSH, OpenSSL (standard Linux layers); HPN-SSH (optimized version of OpenSSH); scp, rsync (standard Linuxletransferutilities);bbcp(“BitTorrent-like”letransferutility)

SOURCE SERVER

NETPERFBBCP SCP RSYNC

SSH HPNSSH

RHEL 5.3 64-bit

Intel®Xeon®ProcessorX5500Series

DESTINATION SERVER

NETPERFBBCP SCP RSYNC

SSH HPNSSH

RHEL 5.3 64-bit

Intel®Xeon®ProcessorX5500Series

File TransferDirection

IntelOplin10GbECX4Directly connected

back-to-back

Figure 3.Nativetestcongurationdetails.

FedEx and Intel delved deeper into the investigation,usingseveralcommonletransfertoolsonbothnativeLinux*andVMwareLinuxVMs.Thetestenvironmentfeatured Red Hat Enterprise Linux (RHEL) 5.3andVMwarevSphereESX4runningonIntelXeonprocessor5500series-basedservers.Theseserverswereequipped with Intel 10 Gigabit AF DA Dual Port Server Adapters supporting direct attachcoppertwinaxial(10GBASE-CX1)cable connections and the VMDq feature supportedbyVMwareNetQueue.

Native Test ConfigurationFigure 3 details the components of thenativetestconguration.Thetestsystems were connected back-to-back over 10G to eliminate variables that could be caused by the network switches themselves.Thisshouldbeconsideredabest-case scenario because any switches addsomeniteamountoflatencyinreal-world scenarios, possibly degrading performanceforsomeworkloads.

A RAM disk, rather than physical disk drives, was used in all testing to focus on the network I/O performance rather than beinglimitedbydiskI/Operformance.

The default bulk encryption used in OpenSSH and HPN-SSH is Advanced EncryptionStandard(AES)128-bit.Thiswasnotchangedduringtesting.

The application test tools included the following:

•netperf: This commonly used network-oriented, low-level synthetic, micro-benchmark does very little processingbeyondforwardingthepackets.It is effective for evaluating the capabilities ofthenetworkinterfaceitself.

•OpenSSH, OpenSSL: These standard Linux layers perform encryption for remoteaccess,filetransfers,andsoon.

•HPN-SSH: This optimized version of OpenSSH was developed by the PittsburghSupercomputerCenter(PSC).For more details, visit www.psc.edu/networking/projects/hpn-ssh/

•scp: The standard Linux secure copy utility

•rsync: The standard Linux directory synchronization utility

•bbcp: A peer-to-peer file copy utility (similartoBitTorrent)developedbythe Stanford Linear Accelerator Center (SLAC).Formoredetails,goto www.slac.stanford.edu/~abh/bbcp/

3

4

Synthetic Benchmarks versus Real File Transfer Workloads for Native Linux*The following two sections examine the differences between synthetic benchmarking and benchmarks generated during actual workloads while running nativeLinux.

Synthetic Benchmarks

Figure4showstheresultsoftwotestcasesusingnetperf.Intherstcase,the10GcardwaspluggedintoaPCIe*Gen1x4slot, which limited the throughput to about 6GbpsbecauseofthePCIebusbottleneck.In the second case, the card was plugged into a PCIe Gen1 x8 slot, which allowed full throughputatnearlinerate.

PCIe slots can present a problem if the physical slot size does not match the actualconnectiontothechipset.Todetermine which slots are capable of full PCIe width and performance, check with thesystemvendor.Theproperconnectionwidthcanalsobeveriedusingsystemtoolsandlogles.PCIeGen1x8(orPCIeGen2x4,ifsupported)isnecessarytoachieve 10 Gbps throughput for one 10G port.Adual-port10GcardrequirestwicethePCIebusbandwidth.

As demonstrated, achieving 10 Gbps transfer rates is quite easy using a syntheticbenchmark.Thenextsectionlooks at a case where actual workloads areinvolved.

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

PCIe* Gen1 x4 PCIe* Gen1 x8

Rece

ive

Thro

ughp

ut (M

bps)

NETPERF

Native Linux*: Synthetic Benchmark

Figure 4.SyntheticbenchmarkfornativeLinux*.

Benchmarks Based on Actual Workloads

Real applications present their own unique challenges.Figure5showstheearliernetperf results as a reference bar on the leftandseventestcasestotheright.Tool choice obviously matters, but the standard tools are not very well threaded, sotheydon’ttakefulladvantageoftheeightcores,16threads,andNICqueues

(morethan16)availableinthisparticularhardwareplatform.Thescptoolrunningoverstandardssh,orscp(ssh)inFigure6,and the rsync(ssh) case both achieve only 400–550Mbps,aboutthesameasFedEx’sinitialdisappointingresultswithftp.Multi-threadedletransfertoolsofferapotentialperformance boost, and two promising candidatesemergedduringaWebsearch:HPN-SSHfromPSCandbbcpfromSLAC.

Figure 5.Comparisonofvariouslecopytools(onestream).

0

10

20

30

40

50

60

70

80

90

100

NETPERF SCP(SSH)

RSYNC(SSH)

SCP(HPN-SSH)

RSYNC(HPN-SSH)

SCP(HPN-SSH +No Crypto)

RSYNC(HPN-SSH +No Crypto)

BBCP

Avg.

CPU

(%U

til)

Rece

ive

Thro

ughp

ut (M

bps)

Native Linux*: Various File Copy Tools (1 stream)

Receive Throughput– 1 Stream Avg. CPU (%Util) – 1 Stream

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

4

5

Using the HPN-SSH layer to replace the OpenSSH layer drives the throughput up to about1.4Gbps.HPN-SSHusesuptofourthreads for the cryptographic processing and more than doubles application layer throughput.Clearly,performanceismovingintherightdirection.

If improving the bulk cryptographic processing helps that much, what could theteamachievebydisablingitentirely?HPN-SSHoffersthatoptionaswell.Without bulk cryptography, scp achieves 2Gbpsandrsyncachieves2.3Gbps.The performance gains in these cases, however, rely on bypassing encryption forletransfers.HPN-SSHprovidessignicantperformanceadvantagesover the default OpenSSH layer that is provided in most Linux distributions; this approachwarrantsfurtherstudy.

According to a presentation created by SLAC titled “P2P Data Copy Program bbcp” (http://www.slac.stanford.edu/grp/scs/paper/chep01-7-018.pdf), bbcp encrypts sensitive passwords and control information but does not encrypt the bulkdatatransfer.Thisdesigntrade-offsacricesprivacyforspeed.Evenwithout encrypting the bulk data, this can still be an effective trade-off for many environments where data privacy is not acriticalconcern.Withbbcp,usingthedefaultfourthreads,theletransferratesreached3.6Gbps.Thisrepresentsthe best results so far, surpassing even the HPN-SSH cases with no cryptographic processing,andit’smorethansixtimesbetter than the initial test results with scp(ssh).Thebbcpapproachisveryefcientandbearsfurtherconsideration.

Basedontheseresults,FedExisactivelyevaluating the use of HPN-SSH and bbcp intheirproductionenvironments.Noneof the techniques tried so far, however, has even come close to achieving 10 Gbps of throughput—not even reaching

0

10

20

30

40

50

60

70

80

90

100

NETPERF SCP(SSH)

RSYNC(SSH)

SCP(HPN-SSH)

RSYNC(HPN-SSH)



BBCP

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Avg.

CPU

(%U

til)

Rece

ive

Thro

ughp

ut (M

bps)

Native Linux*: Various File Copy Tools (8 streams vs. 1 stream)

Receive Throughput – 1 Stream Receive Throughput – 8 Streams Avg. CPU (%Util) – 1 Stream Avg. CPU (%Util) – 8 Streams

halfofthattarget.Atthispoint,theInteland FedEx engineering team focused on identifying any bottlenecks preventing the10Gbpsletransfertargetfrombeingreached.Otherthanrewritingallof the tools to enhance the performance through multi-threading, the team also wanted to know if any other available techniquescouldboostletransferrates?

To gain a better understanding of the performance issues, the engineering teamraneightletransfersinparallelstreams to attempt to drive up aggregate letransferthroughputperformanceandobtain better utilization of the platform hardwareresources.Figure6indicatestheresults.

InFigure6,therstfourredbarsshow that using eight parallel streams overcomes the threading limits of these tools and drives aggregate bulk encrypted throughputmuchhigher:

•2.7Gbpswithscp(ssh)

•3.3Gbpswithrsync(ssh)

•4.4Gbpswithscp(HPN-SSH)

•4.2Gbpswithrsync(HPN-SSH)

These results demonstrate that using more parallelism dramatically scales up

Figure 6.Comparisonofvariouslecopytools(eightstreams).

performancebymorethanvetimes,but the testing did not demonstrate eight times the throughput when using eight threadsinparallel.Theresolutiontotheproblem does not lie in simply using the brute-force approach and running more streamsinparallel.

These results also show that bulk encryption is expensive, in terms of boththroughputandCPUutilization.HPN-SSH, with its multi-threaded cryptography, still provides a significant benefit, but not as dramatic a benefit as inthesingle-streamcase.

The results associated with the remaining threeredbarsofFigure6areinstructive.ThersttwocasesuseHPN-SSHwithnobulk cryptography, and the third case is the eight-thread bbcp result in which bulk datatransfersarenotencrypted.Theseresults demonstrate that it is possible to achieve nearly the same 10 Gbps line rate throughput number as the netperf micro-benchmarkresultwhenrunningrealletransferapplications.

As this testing indicates, using multiple parallel streams and disabling bulk cryptographic processing is effective forobtainingnear10-Gbpslineratele

5

6

SOURCE SERVER

transferthroughputinLinux.Thesetrade-offs may or may not be applicable for a given network environment, but they do indicate an effective technique for gaining better performance with fewer trade-offsinthefuture.Inteliscontinuingto work with the Linux community to ndsolutionstoincreaseletransferperformance.

Thenext-generationIntelXeon5600processorfamily,code-namedWestmere,andotherfutureIntelXeonprocessors will have a new instruction set enhancement to improve performance for Advanced Encryption Standard (AES) bulk encryptioncalledAES-NI.Thisadvancepromisestodeliverfasterletransferrates when bulk encryption is turned on withAES-NI-enabledplatforms.

Best Practices for Native LinuxFollow these practices to achieve the best performanceresultswhenperformingletransferoperationsundernativeLinux:

•Configuration: PCIe Gen1 x8 minimum for 1x 10G port; PCIe Gen2 x8 minimum for 2x 10G ports on one card

•BIOS settings: Turn ON Energy Efficient mode (Turbo, EIST), SMT, NUMA

•TurnONReceive(Rx)Multi-Queue(MQ)support (enabled by default in RHEL); Transmit (Tx) is currently limited to one queue in RHEL, SLES 11RC supports MQ Tx

•FactorintheselimitationsofLinuxfiletransfertoolsandSSH/SSLlayers:

–scpandssh:singlethreaded

–rsync:dualthreaded

–HPN-SSH:fourcryptographythreadsand single-threaded MAC layer

–bbcp:encryptedsetuphandshake,butnotbulk transfer; defaults to four threads

•Usemultipleparallelstreamstoovercome tool thread limits and maximizethroughput.

•Bulkcryptographyoperationslimitperformance.Disablecryptographyforthoseenvironmentswhereitisacceptable.TheIntelXeon5600processorfamily(code-named Westmere) and future IntelXeonprocessorswillimprovebulkcryptographicperformanceusingAES-NI.

Achieving Native Performance in a Virtualized EnvironmentEarlier sections illustrated effective techniquesformaximizingletransferperformance using various tools in the Linux native environment and described someofthefactorsthatlimitletransferperformance.

The following sections examine performance in a virtualized environment to determine the level of network throughput that can be achieved for both syntheticbenchmarksandforvariousletransfertools.

In this virtualized environment, the test systems are provisioned with VMware vSphereESX4runningonIntelXeonprocessor5500series-basedservers.The test systems are connected back-to-back over 10G with VMDq enabled in VMwareNetQueue.TheVMsontheseserverswereprovisionedwithRHEL5.3(64bit)and,asinnativecases,thetestteam used the same application tools and testmethodologyexceptinoneinstance:The team used the esxtop utility for measuringtheservers’throughputandCPUutilization.

Within the virtualized environment, these testscenarioswereused:

•OnevirtualmachinewitheightvCPU and12GBRAM

•Onevirtualmachine(eightvCPUand 12GBRAM)withVMDirectPathI/O

•Eightvirtualmachines,witheachvirtualmachinehavingonevCPUand2GBRAM

•Eightvirtualmachines,bothwithandwithout the VMDq feature

CASE 1: One Virtual Machine with Eight vCPUs and 12 GB RAMEachserverhadoneVMconguredwitheightvCPUsand12GBofRAM,usingRHEL5.3(64bit)astheguestoperatingsystem(OS).Thetestteamrannetperfasamicro-benchmarkandalsovariousletransfer tools for transferring a directory structureofapproximately8GBfromtheVMontherstservertotheVMonthesecondserver.Figure7showsthetestconguration.

Similar to the testing done in the native Linux case, the test team compared the datafromnetperf,onestreamofletransfer,andeightparallelletransfers.Figure 8 shows the receive network throughput and total average CPU utilization for the micro-benchmark, such asnetperf,andvariousletransfertoolswhenrunningasinglestreamofcopy.Thegurecomparestheresultsforthenative data with the data results in the virtualizedenvironment.

Figure 7.TestcongurationforCase1.

VM1 (8 vCPU) VM1 (8 vCPU)File TransferDirection


back-to-back

File TransferApplications


RHEL5.3 RHEL5.3

vSw

itch

vSw

itch

VMware® ESX® 4.0 VMware® ESX® 4.0

Intel®Xeon®ProcessorX5500Series Intel®Xeon®ProcessorX5500Series

DESTINATION SERVER

6

7

Figure 8 indicates that the throughput in a VM is lower across all cases when comparedwiththenativecase.Inthevirtualized case, the throughput for the netperftestresultsis5.8Gbpscomparedto9.3Gbpsinthenativecase.Eveninthecaseofletransfertools,suchasscp and rsync running over standard ssh, the throughput ranges from 300 Mbps to 500Mbps,whichisslightlylowerwhencomparedwiththenativecase.

Using the HPN-SSH layer to replace the OpenSSHlayerincreasesthethroughput.Also, disabling cryptography increases the throughput, but not at as high a level asthelinerate.Theshapeofthecurve,however,issimilar.

The same limitations that occurred in the native case (such as standard tools not being well threaded and cryptography adding to the overhead) also apply in this case.Becauseofthis,theletransfertools cannot take full advantage of multiplecoresandtheNICqueues.

Most of the tools and utilities—including ssh and scp—are single threaded; the rsyncutilityisdualthreaded.UsingtheHPN-SSH layer to replace the OpenSSH layerhelpsincreasethethroughput.InHPN-SSH, the cryptography operations are multi-threaded (four threads), which boosts performancesignicantly.Thesingle-threaded MAC layer, however, still creates abottleneck.WhenHPN-SSHisrunwithcryptography disabled, the performance increases,butthebenetsofencrypteddatatransferarelost.Thisissimilartothecase with bbcp, which is multi-threaded (using four threads by default), but the bulk transferisnotencrypted.

The next test uses eight parallel streams, attempting to work around the threading limitationsofvarioustools.Figure10shows the receive network throughput andCPUutilizationforvariousletransfer tools when running eight parallel streamsofcopies.Thischartalsoincludescomparisonswiththenativedataresults.

Figure 8. Throughputcomparisonofvariouslecopytools(onestream).

0

10

20

30

40

50

60

70

80

90

100

NETPERF SCP(SSH)

RSYNC(SSH)

SCP(HPN-SSH)

RSYNC(HPN-SSH)



BBCP

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Avg.

CPU

(%U

til)

Rece

ive

Thro

ughp

ut (M

bps)

Receive Throughput – Native Receive Throughput – VirtualizedAvg. CPU (%Util) – Native Avg. CPU (%Util) – Virtualized

ESX* 4.0 GA (1 VM with 8 vCPU): Various File Copy Tools (1 stream)

Figure 9. Throughputcomparisonofvariouslecopytools(eightstreams)

0

10

20

30

40

50

60

70

80

90

100

NETPERF SCP(SSH)

RSYNC(SSH)

SCP(HPN-SSH)

RSYNC(HPN-SSH)



BBCP

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Avg.

CPU

(%U

til)

Rece

ive

Thro

ughp

ut (M

bps)


ESX* 4.0 GA (1VM with 8 vCPU): Various File Copy Tools (8 streams)

FromFigure9it’sclearthatusingeightparallelstreamsovercomesthetools’threadinglimitations.Figure9alsoshowsthat cryptography operations are a limiter intherstfourcasesoftheletransfertools.Betterthroughputisindicatedinthe last three cases, in which the copies weremadewithoutusingcryptography.

Even though these results are better compared to relying on one-stream data, theystilldon’tcomeclosetothelinerate (approximately 10 Gbps) achieved in thenativecase.Sincethetestingusedone VM with eight vCPUs, the test team determined that this might be a good case for using direct assignment of the 10G NIC tothevirtualmachine.

7

8

CASE 2: One Virtual Machine with Eight vCPUs and VMDirectPathThetestbedsetupandcongurationin this case was similar to that of the previous case except that the 10G was directassignedtotheVM.Figure10showsthetestcongurationforthiscase.

The test team started with a synthetic benchmark netperf and then ran eight parallelstreamsofcopiesforvariousletransfertools.TheresultsshowninFigure12 compare the performance numbers from the VM with DirectPath I/O to the performance numbers of the VM with no DirectPathI/Oandnative.

As Figure 11 illustrates, VMDirectPath (VT-d direct assignment) of the 10G NIC to the VM increases performance to a level that is close to the native performanceresults.Nonetheless,the trade-offs associated with using VMDirectPatharesubstantial.

A number of features are not available forVMconguredwithVMDirectPath,includingVMotion*,suspend/resume,record/replay, fault tolerance, high availability,DRS,andsoon.Becauseofthese limitations, the use of VMDirectPath will continue to be a niche solution awaiting future developments that minimizethem.VMDirectPathmaybe practical to use today for virtual security-appliance VMs since these VMs typicallydonotmigratefromhosttohost.VMDirectPath may also be useful for other applications that have their own clustering technologyanddon’trelyontheVMwareplatformfeaturesforfaulttolerance.

Figure 10. TestcongurationforCase2.

Figure 11. Throughputcomparisonofvariouslecopytools(eightstreams)usingVMDirectPath.

0

10

20

30

40

50

60

70

80

90

100

NETPERF SCP(SSH)

RSYNC(SSH)

SCP(HPN-SSH)

RSYNC(HPN-SSH)



BBCP

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Avg.

CPU

(%U

til)

Rece

ive

Thro

ughp

ut (M

bps)

ESX 4.0* GA (1VM with 8 vCPU & VMDirectPath I/O: Various FileVarious File Copy Tools (8 streams)

Receive Throughput – Native Receive Throughput – VM w/o VMDirectPath I/OReceive Throughput – VM w/VMDirectPath I/O Avg. CPU (%Util) – NativeAvg. CPU (%Util) – VM w/o VMDirectPath I/O Avg. CPU (%Util) – VM w/ VMDirectPath I/O

SOURCE SERVER

VM1 (8 vCPU) VM1 (8 vCPU)File TransferDirection


back-to-back



RHEL5.3 RHEL5.3

VMware® ESX® 4.0 VMware ESX 4.0

Intel®Xeon®ProcessorX5500Series IntelXeonProcessorX5500Series

DESTINATION SERVER

8

9

CASE 3: Eight Virtual Machines, Each with One vCPU and 2 GB RAM The two previous cases incorporated a single large VM with eight vCPUs, which issomethingsimilartoanativeserver.In Case 3, the scenario includes eight virtualmachineswitheachVMconguredwithonevCPUand2GBRAM.TheguestOSisstillRHEL5.3(64bit).Bothoftheservers include eight virtual machines and the tests run one stream of copy per VM (so in effect eight parallel streams of copiesarerunning).Figure12showsthecongurationforthiscase.

Figure 13 indicates the performance results when running a synthetic micro-benchmarkandrealletransferapplicationsoneightVMsinparallel.Comparisons with the results for the nativeserverareshowninblue.

As shown in Figure 13, the aggregate throughput with netperf across eight VMs reaches the same level of throughput asachievedinthenativecase.Withtheletransfertools,cryptographystillimposes performance limitations, but the multi-threaded cryptography in HPN-SSH improves performance when compared tothestandardutilities,suchasssh.Largerbenetsaregainedwhenbulkcryptography is disabled, as indicated by the results shown by the last three redbars.Severaloftheseresultsshowthat the virtualized case can equal the performanceofthenativecase.

Figure 12.TestcongurationforCase3.

Figure 13. Throughputcomparisonofvariouslecopytoolswith8VMsand1streamperVM.

0

10

20

30

40

50

60

70

80

90

100

NETPERF SCP(SSH)

RSYNC(SSH)

SCP(HPN-SSH)

RSYNC(HPN-SSH)



BBCP

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Avg.

CPU

(%U

til)

Rece

ive

Thro

ughp

ut (M

bps)


ESX* 4.0 GA (8 VMs with 1 vCPU each): Various File Copy Tools

SOURCE SERVER

VM1 (1 vCPU) VM1 (1 vCPU)

VM8 (1 vCPU) VM8 (1 vCPU)

File TransferDirection


back-to-back

vSw

itch

vSw

itch

VMware* ESX* 4.0 VMware® ESX® 4.0

Intel®Xeon®ProcessorX5500Series Intel®Xeon®ProcessorX5500Series

DESTINATION SERVER

9

10

Sub Case: Eight Virtual Machines, with and without VMDq Becausethetestcongurationincludesmultiple VMs, the VMDq feature offers someadvantages.ThisfeaturehelpsofoadnetworkI/Odataprocessingfrom the virtual machine monitor (VMM) softwaretonetworksilicon.Thetestteamran a sub case, comparing the receive throughputfromvariousletransfertoolsby enabling and disabling the VMDq and associatedNetQueuefeature.

TheresultsoftherstfourcopyoperationsinFigure14showthatthere is no difference in throughput, regardless of whether VMDq is enabled or disabled.Thisisprimarilybecauseofthelimitations imposed by bulk cryptography operations.Whenrunningtheletransferapplications tools by disabling the cryptography—as in last three cases—the advantages of the VMDq feature becomeclear.Theresultsofthelastthree operations show the advantage of VMDq with the cryptography bottleneck removed.ThesedataresultsindicatethatVMDq improves performance with multiple VMs if there are no system bottlenecks in place(suchascryptographyorslotwidth).

Basedonthefullrangeoftestresults,the test team developed a set of best practices to improve performance in virtualized environments, as detailed in thenextsection.

Figure 14. Throughputcomparisonofvariouslecopytools(eightstreams)withandwithoutVirtualMachineDeviceQueues(VMDq).

0

10

20

30

40

50

60

70

80

90

100

SCP(SSH)

RSYNC(SSH)

SCP(HPN-SSH)

RSYNC(HPN-SSH)



BBCP0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Avg.

CPU

(%U

til)

Rece

ive

Thro

ughp

ut (M

bps)

Receive Throughput – VMDq Receive Throughput – No VMDqAvg. CPU (%Util) – VMDq Avg. CPU (%Util) – No VMDq

ESX* 4.0 GA - 8 VMs: Various File Copy Tools (VMDq vs. No VMDq)

Type of File copy

Best Practices for Virtualized Environments (ESX* 4.0)Consider the following guidelines for achieving the best data transfer performance in a virtualized environment runningVMwareVSphereESX4.

•ConfirmtheactualwidthofPCIeslots.

–Physicalslotsizedoesn’talwaysmatchthe actual connection to the chipset; check with your server vendor to deter-minewhichonesarereallyfullwidth.

–Verifytheproperconnectionusingsys-temtools.

–PCIeGen1x8(orPCIeGen2x4,ifsup-ported) is required to achieve 10 Gbps throughputforone10Gport.PCIeGen2x8isnecessaryfor2x10Gports.

–ThetransferratelimitationwithPCIeGen1x4isapproximately6Gbps.

•Bydefault,NetQueue/VMDqisenabledforbothRxandTxqueues.

–Onequeueisallocatedpercore/threaduptothehardwarelimits.

–PerformanceisimprovedwithmultipleVMs, if there are no system bottlenecks, suchasslotwidthorcryptography.

•Usevmxnet3(newinESX4.0)orvmxnet2(ESX3.5)insteadofthedefaulte1000driver.

•UsemultipleVMstoimprovethroughputratherthanonelargeVM.

–TwovCPUVMsshowbetterthroughputthanonevCPUVMincertaincases.

•Toollimitations,slotlimitations,andcryptography limitations still apply, as in thenativecase.

10

11

The ResultsThis case study and the results obtained through the collaborative efforts of FedEx and Intel engineers suggest a number of waysinwhichletransferperformancein the data center can be maximized, depending on the tools and virtualization techniques used, as well as the hardware conguration.

The most important performance-related considerations, based on the observations and data results obtained during testing, areasfollows:

•BeawareofpotentialPCIeslothardwareissues.EnsurethatthePCIebuswidth and speed provide sufficient I/O bandwidthforfullperformance.

•ConsiderTwinaxcablingtogreatlyreducethecostof10GEthernet.TheSFP+ form factor provides a single physical interface standard that can cover the application range—from the lowest cost, lowest power, and shortest reach networking available to the longest reach situations that may be encountered.Choosingthecabletypeon a port-by-port basis can provide a cost-effective and flexible approach forenvironmentalneeds.Whenusingthis new cabling approach, 10G can be cost effective today, as compared to commonly deployed usage models with six,eight,ormore1Gportsperserver.

•Settheappropriatesystemconfigurationparameters.Thetestteamevaluatedeach of the recommended settings detailed in the body of this case study and determined that these settings either improved performance or were neutralforthetestworkloads.MakesuretheBIOSissetcorrectlyandthatthe OS, VMM, and applications take full advantageofmodernplatformfeatures.

•Syntheticbenchmarksarenotthesameasrealworkloads.Althoughsyntheticbenchmarks can be useful tools to evaluate subsystem performance, they can be misleading for estimating applicationlevelthroughput.

•Choosetherighttoolsandusagemodels.Identifytool-threadinglimits,andusemoreparallelismwhenpossible.Cryptography can be a bottleneck, so disableitifitisnotrequired.Ifitisrequired, choose an implementation that hasoptimalperformance.

•Setyourexpectationscorrectly.Configuration using multiple VMs will oftenoutperformasingleVM.Canyourapplicationscaleinthisfashion?In some cases multi-vCPU VM is better iftheapplicationsarewellthreaded.Moving directly from your physical server configurations and tunings to the nearest equivalent VM is an effective starting point, but ultimately may not bethebesttrade-off.Virtualizationfeatures, such as VMDq and NetQueue, will perform better when multiple VMs are being used and there are no system bottlenecks, such as PCIe bandwidth limitationsorcryptographyprocessing.

This case study demonstrates that it is possible to achieve close to 10G line ratethroughputfromtoday’sserverspoweredbyIntelXeonprocessorsrunningRHEL5.3andVMwareESX4.0.Usethetesting methods described to validate the 10G network you are building and to helpidentifyandresolveproblems.Theresults detailed in this case study make it clear that tool choices, usage models, and congurationsettingsaremoreimportantthan whether the application is running in avirtualizedenvironment.

11

For More InformationFor answers to questions about server performance, data center architecture, or application optimization, check out theresourcesonTheServerRoomsite:http://communities.intel.com/community/openportit/server.Intelexpertsareavailable to help you improve performance within your network infrastructure and achieveyourdatacentergoals.

AUTHORS

Chris Greer, chief engineer, FedEx Services

ChrisGreerisachiefengineeroftechnicalarchitectureatFedExServices.Hehasworkedfor12yearsatFedExasanetworkandsystemsengineer.Currently,heworkstoresearchanddenenetworkandserverstandardswithanemphasisonservervirtualization.

Bob Albers, I/O usage architect, Intel Corporation

BobAlbersisanI/OusagearchitectinIntel’sDigitalEnterpriseGroup’sEnd-UserPlatformIntegrationteam.Hehasover32yearsofexperienceincomputerandnetworkdesignandarchitecture,including25yearsatIntel.ForthelastseveralyearsBobhasbeen focused on understanding the usage models and improving the performance of networkandsecurityworkloadsrunningonnativeandvirtualizedIntelservers.

Sreeram Sammeta, senior systems engineer, Intel Corporation

SreeramSammetaisaseniorsystemsengineerinIntel’sDigitalEnterpriseGroup’sEnd-userPlatformIntegrationgroup.HepreviouslyhasworkedinvariousengineeringpositionsinInformationTechnologyduringthelast7years.HeholdsanM.SinelectricalengineeringfromSIU.Hisprimaryfocusareaisbuildingend-to-endproof-of-concepts;exploring,deploying,andvalidatingemergingend-userusagemodels.Hisrecentinterests include evaluating next generation data center virtualization technologies focusingonperformanceofvirtualizednetworkandsecurityinfrastructureappliances.

SOLUTIONPROVIDEDBY:

1 Based on list prices for current Intel® Ethernet products. This document and the information given are for the convenience of Intel’s customer base and are provided “AS IS” WITH NO WARRANTIES WHATSOEVER, EXPRESS OR IMPLIED,INCLUDING ANY IMPLIED WARRANTY OF

MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS. Receipt or possession of this document does not grant any license to any of the intellectual property described, displayed, or contained herein. Intel products are not intended for use in medical, life-saving, life-sustaining, critical control, or safety systems, or in nuclear facility applications.

� �3HUIRUPDQFH�WHVWV�DQG�UDWLQJV�DUH�PHDVXUHG�XVLQJ�VSHFL¿F�FRPSXWHU�V\VWHPV�DQG�RU�FRPSRQHQWV�DQG�UHÀHFW�WKH�DSSUR[LPDWH�SHUIRUPDQFH�RI�,QWHO�SURGXFWV�DV�PHDVXUHG�E\�WKRVH�WHVWV��$Q\�GLIIHUHQFH�LQ�V\VWHP�KDUGZDUH�RU�VRIWZDUH�GHVLJQ�RU�FRQ¿JXUDWLRQ�PD\�DIIHFW�DFWXDO�SHUIRUPDQFH��,QWHO�PD\�PDNH�FKDQJHV�WR�VSHFL¿FDWLRQV��SURGXFW�GHVFULSWLRQV�DQG�SODQV�DW�DQ\�WLPH��ZLWKRXW�QRWLFH�

� �&RS\ULJKW��,QWHO�&RUSRUDWLRQ��$OO�ULJKWV�UHVHUYHG��,QWHO��WKH�,QWHO�ORJR��DQG�;HRQ�DUH�WUDGHPDUNV�RI�,QWHO�&RUSRUDWLRQ�LQ�WKH�8�6��DQG�RWKHU�FRXQWULHV�� 2WKHU�QDPHV�DQG�EUDQGV�PD\�EH�FODLPHG�DV�WKH�SURSHUW\�RI�RWKHUV�� 3ULQWHG�LQ�86$� ��%<�0(6+�3')�� Please Recycle 323329-002US

maximizing file transfer performance using 10gb ethernet and virtualization

Documents