on the delivered performance of the sun crypto accelerator

18
1 Abstract Today, computer and network security has received enormous amount of atten- tion. In particular, security protocols and cryptographic operations are constantly employed to ensure authentic and pri- vate communications in the business world. However, the high cost associ- ated with cryptographic operations has been a critical factor that has slowed down many transactions and has pre- vented security protocols from being widely enforced. For example, Web browsing using a secure protocol often results in slow responses as the Web servers frequently reach their capacities. The Sun Crypto Accelerator 1000 (Sun CA1000) is a hardware-based, high-per- formance cryptographic security solu- tion that enables a Sun server to handle large client loads under security mea- sures deployed using techniques such as the Secure Sockets Layer (SSL) protocol, which is commonly used in e-commerce environment. Many critical crypto- graphic functions employed by SSL can be off-loaded from a Web server to the Sun CA1000 board(s) and performed in high speed. The power of the Sun CA1000 solution takes the performance of a Sun-based secure Web server solution to the next level. Our evaluation shows that it deliv- ers remarkable improvement on the per- formance of the Sun Fire TM 6800 server. Each Sun CA1000 board can efficiently process 4400 RSA operations per second and 495 Mbps of Triple-DES data encryption, a performance that is 20 times and 8 times, respectively, faster than the performance of the host proces- sor. With seamless integration, such optimization accelerates the Web server by more than 270% in our Web server benchmarks and allows the host proces- sors to delegate themselves on other aspects of Web applications. 1. Introduction Computer and network security is increasingly important to individuals and corporations in the highly-comput- erized society today. Many security mea- sures rely on cryptographic functions to ensure the confidentiality, authentica- On the Delivered Performance of the Sun Crypto Accelerator 1000 Shih-Hao Hung and Pallab Bhattacharya Performance and Availability Engineering Sun Microsystems Inc. 901 San Antonio Road Palo Alto, CA 94303 {hungsh, pallab}@eng.sun.com

Upload: others

Post on 28-Feb-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Abstract

Today, computer and network securityhas received enormous amount of atten-tion. In particular, security protocols andcryptographic operations are constantlyemployed to ensure authentic and pri-vate communications in the businessworld. However, the high cost associ-ated with cryptographic operations hasbeen a critical factor that has sloweddown many transactions and has pre-vented security protocols from beingwidely enforced. For example, Webbrowsing using a secure protocol oftenresults in slow responses as the Webservers frequently reach their capacities.The Sun Crypto Accelerator 1000 (SunCA1000) is a hardware-based, high-per-formance cryptographic security solu-tion that enables a Sun server to handlelarge client loads under security mea-sures deployed using techniques such asthe Secure Sockets Layer (SSL) protocol,which is commonly used in e-commerceenvironment. Many critical crypto-graphic functions employed by SSL canbe off-loaded from a Web server to the

Sun CA1000 board(s) and performed inhigh speed.The power of the Sun CA1000 solutiontakes the performance of a Sun-basedsecure Web server solution to the nextlevel. Our evaluation shows that it deliv-ers remarkable improvement on the per-formance of the Sun FireTM 6800 server.Each Sun CA1000 board can efficientlyprocess 4400 RSA operations per secondand 495 Mbps of Triple-DES dataencryption, a performance that is 20times and 8 times, respectively, fasterthan the performance of the host proces-sor. With seamless integration, suchoptimization accelerates the Web serverby more than 270% in our Web serverbenchmarks and allows the host proces-sors to delegate themselves on otheraspects of Web applications.

1. Introduction

Computer and network security isincreasingly important to individualsand corporations in the highly-comput-erized society today. Many security mea-sures rely on cryptographic functions toensure the confidentiality, authentica-

On the Delivered Performance of the Sun Crypto Accelerator 1000

Shih-Hao Hung and Pallab Bhattacharya

Performance and Availability Engineering

Sun Microsystems Inc.

901 San Antonio Road

Palo Alto, CA 94303

{hungsh, pallab}@eng.sun.com

1

hungsh
The paper was originally presented in the Sun User Performance Group Conference, Honolulu, April 2002.

tion, and integrity of the data stored incomputers and the messages transmittedover the network. There are, however,cost and overhead associated with thesecure measures that have been signifi-cant concerns for the users.The predominant security protocol forelectronic commerce, on which we focusin this paper, is Secure Sockets Layer (SSL)[1][2]. SSL encrypts client-server com-munications and provides functions forauthentication and message integrity.SSL was originally developed byNetscape. As it gains in popularity,open-source libraries such as NetworkSecurity Services (NSS) [3] and OpenSSL[4] were provided to the public and usedto implement all sorts of security fea-tures for network applications. SSL issupported by the leading web servers,e.g. iPlanetTM Web Server (iWS) [5] andApache [6] as well as Web browsers.Support for SSL demands considerableperformance from a Web server systemdue to the extra computations and mes-sages involved in the protocol process-ing. The most computation-intensivecryptographic operations occur duringSSL's session creation, where a crypto-graphic session is established between aclient and a server. Additionally, SSLincurs computational load for encrypt-ing data exchanged between the clientand the server, which is also known asbulk data encryption.The Sun Crypto Accelerator 1000 solu-tion accelerates both session creationand bulk encryption computations forSSL that enhances the performance ofSSL on Sun Servers. The Sun CA1000 is asolution that is designed to off-loadcryptographic computations from thehost processors. Many critical crypto-graphic functions, such as RSA [7] andTriple-DES (3DES) [8], can be off-loaded

from a Web server to the Sun CA1000and performed in parallel. Such an opti-mization significantly improves thethroughput of a secure Web server andsave the host processors for other work.This paper discusses our observations onthe delivered performance of the SunCA1000. In Section 2, we overview thecapability of the Sun CA1000 and evalu-ate its performance using in-housemicrobenchmark programs. We revealthe potential of the Sun CA1000 in termsof the acceleration of cryptographicfunctions (RSA and 3DES) and the sav-ings of host processor time. In Section 3,as a case study, we examine the effec-tiveness of the Sun CA1000 in accelerat-ing iPlanet Web server 6.0 (iWS6). Wemeasure the performance of a Sun Fire6800 server running iWS6, using a modi-fied SPECweb99_SSL benchmark pro-gram that drives the server with secureWeb requests (HTTPS) generated fromremote clients. Section 4 discusses theperformance considerations and tuningoptions that users should pay attentionin running high-performance secureWeb servers. Section 5 summarizes ourfindings and concludes the paper.

2. The Sun Crypto Accelerator1000

The Sun Crypto Accelerator 1000 solu-tion1 consists of a single-chip crypto-graphic accelerator in a PCI package andsoftware packages with SSL support.The solution is compatible with mostPCI-based Sun servers running theSolarisTM 8 Operating Environment.Currently, SSL support is available foriPlanet 4.x/6.x Web Servers andOpenSSL/Apache 1.3.12.

1. In this paper, we use the term “solution” to emphasize theintegration of hardware and software.

2

Figure 1 illustrates how Web serversexercise the components of the SunCA1000 software and hardware. Multi-ple servers can utilize the Sun CA1000boards simultaneously via the iWSadapter or the Apache module that Sunprovided. The Sun CA1000 Crypto-graphic library (libcryptography, in short)is responsible for scheduling crypto-graphic jobs on the Sun CA1000 boards.Also, libcryptography contains softwarealgorithms that will be used to performcryptographic functions in case all theSun CA1000 boards in the system fail.The following two of the most widelyused, yet high-cost cryptographic opera-tions can be accelerated1:• RSA: 4400 private key (CRT) operations/

sec.• 3DES-CBC: 495 Mbps (16K packet size).

The Sun CA1000 also supports othercryptographic functions, such as DSAand SHA1. In this paper, we chose tofocus on the performance of RSA and3DES as case studies for public-key andsymmetric-key cryptography, respec-tively.In this section, we briefly describe theSun Fire 6800 system that we used toevaluate the Sun CA1000. Then wedescribe RSA and 3DES functions andvalidate the performance of the SunCA1000 through a series of microbench-mark programs that run on a hostmachine. We measure and compare thethroughput as well as the host processorutilization on a Sun Fire 6800 server withand without the Sun CA1000.

2.1 System under Test

Throughout this paper we evaluate theSun CA1000 solution using the Sun Fire6800 midframe server. The system thatwe use contains up to 24 UltraSPARC®

III processors running at 900Mhz. Thesystem can be dynamically configured tobring components on-line or take themoff-line without disrupting system oper-ation or requiring a system reboot.The Sun Fire 6800 server can have up tofour I/O assemblies, and each I/O assem-bly has eight 64-bit PCI slots. All PCIslots support 33Mhz bus speed. Two PCIslots additionally support 66Mhz opera-tions. The I/O subsystem is designed tobe scalable and upgradable so the sys-tem can handle I/O-intensive applica-tions such as Web servers and databaseapplications. Sufficient amount of memory and net-work interfaces are installed so that theperformance should not be impacted byvirtual memory swapping and networkcongestion.

1. These are the preliminary specifications for the pre-pro-duction Sun CA1000 boards that we received for testing.The final specifications have not been decided yet as ofthe submission date of this paper. The production releaseis expected to meet or exceed the specifications quoted inthis paper. More cryptographic functions may be sup-ported through software update.

Sun CA1000 Board

Sun CA1000 Driver

Sun CA1000 Cryptographic Library

Sun CA1000 Sun CA1000 iWS Adapter Apache Module

iWS ApacheiWSiWS ApacheApache

Sun CA1000 BoardSun CA1000 Board

Applications

Sun CA1000 Solution

Software

Hardware

Figure 1: Web servers and the components of the Sun CA1000 solution.

3

The processors used for the test are cur-rently Sun’s top-of-the-line processors.Other experiments that we carried outindicate that slower systems benefitmore from the use of the Sun CA1000because software implementations runslower on these machines.

2.2 RSA Performance

RSA is a public key algorithm inventedby Ron Rivest, Adi Shamir, and LenAdelman in 1977 [7]. It is the most popu-lar public key algorithm used in estab-lishing SSL sessions [1]. RSA operationsare computationally expensive, and it istime consuming for a general-purposeprocessor to perform RSA operations. Inthis subsection, we examine the potentialof the Sun CA1000, in terms of its capac-ity and efficiency in RSA processing.

2.2.1 RSAperf

We used a microbenchmark programdeveloped in Sun, called RSAperf(rsaperf), to measure the performanceof a system in processing RSA opera-tions. As illustrated in Figure 2, multiple

instances of RSAperf call the crypto-graphic library provided by the SunCA1000 package to carry out RSA opera-tions. The cryptographic library takesadvantage of the Sun CA1000 automati-cally if the board is present, or performsthe cryptographic functions in softwareusing the host processor(s) if no SunCA1000 board is on-line1. The softwareimplementation is highly optimized forUltraSPARC® processors.We executed RSAperf on our Sun Fire6800 server. The results are listed inTable 1. We measure the throughput of1024-bit RSA operations, the most com-monly used RSA algorithms with 1, 2, 4,and 8 on-line host processors. Duringthe run, utilization of the host processorswas monitored using the Solaris mpstatutility.

2.2.2 Software Performance

The software columns in Table 1 showthe performance of the system when theoperations were done completely by thehost processor(s). The host processorswere fully utilized to execute the RSAoperations, as indicated by the 100%CPU utilization. A single processor sys-tem is capable of 245 RSA operations persecond. The throughput scales almostlinearly as the number of processorsincreases, as RSAperf can take advan-tage of multiple processors.

2.2.3 Accelerated Performance

With one Sun CA1000 board, the single-processor system is capable of 4287 RSAoperations per second. For that perfor-mance, only 20% of host processor time

Sun CA1000 Board

Sun CA1000 Driver

Sun CA1000 Cryptographic Library

RSAperf

Sun CA1000 BoardSun CA1000 Board

Figure 2: RSAperf, Sun CA1000 software and hardware.

RSAperfRSAperf

SoftwareAlgorithms

Host

1. Note that this is a high-availability feature to ensure thecryptographic functions are done correctly even when noSun CA1000 board is installed in the system.

4

is needed for running the benchmarkprogram and driving the Sun CA1000board to its full potential.Comparing the performance of the sin-gle-processor systems, as illustrated byFigure 3), the system with one SunCA1000 board generates 4287/245=17.5times as much throughput as the systemwithout Sun CA1000 boards. Thus, theSun CA1000 board would save 17 hostprocessors for the user in delivering~4000 RSA ops/s. Furthermore, consider-ing that one Sun CA1000 board needsonly 20% host processor power in realiz-ing the performance, the Sun CA1000-

enabled system is 87.5 times more effi-cient in running RSAperf.

2.2.4 Scalability of Accelerated Performance

The design of the Sun CA1000 empha-sizes performance scalability. The SunCA1000 board handles cryptographicoperations concurrently in a pipelinedfashion, and its architecture takes advan-tage of the abundant parallelism existingin the tasks that Web servers typicallyhandle.For optimizing throughput, one SunCA1000 board can accept up to 24 con-current RSA operations simultaneouslywithout the host processor(s) waiting forprior operations to finish. In addition,multiple Sun CA1000 boards can beinstalled and utilized on one hostmachine to work for the same or differ-ent applications. The Sun CA1000 cryp-tographic library automaticallyschedules the cryptographic tasks to allthe Sun CA1000 boards available in thesystem.Figure 4 shows the results on the two-processor system, varying the number ofprocesses that RSAperf forks to submitRSA jobs. With one RSAperf process, theSun CA1000 board performs one RSAoperation at a time, the system is capableof 539 ops/sec. From this test, we con-clude that the latency for completing one

Number of Host Processors

Software Sun CA1000RSA ops/sec CPU Utilization RSA ops/sec CPU Utilization

1 245 100% 4287 20%2 484 100% 4337 18%4 931 100% 4397 7%8 1908 100% 4381 3%

Table 1: RSA performance on Sun Fire 6800 Server.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 2 4 6 8 10No. of Processors

SCA1000 Software

Figure 3: RSA throughput (ops/sec) on Sun Fire 6800 Server, varying the number of

on-line host processors.

5

RSA operation is on average 1.85 milli-seconds. In comparison, the host proces-sor needs approximately 4 millisecondsto perform the same operation in soft-ware. To fully exploit the potential of one SunCA1000 board, RSAperf needs to submitat least 24 RSA jobs in parallel. Furtheradding processes does not increase theperformance as single Sun CA1000board cannot accept more than 24 con-current RSA operations.By adding the second Sun CA1000 boardto the system, the peak performance canbe doubled to 8795 RSA ops/sec. Ourexperiments show that it takes approxi-mately 64 processes to keep two SunCA1000 boards busy.We also configured RSAperf to run inthe thread mode, which uses threads togenerate workload instead of processes.The two sets of results are very close,which means that applications can take

advantage of the Sun CA1000 either withprocesses or threads.

2.3 Triple-DES Performance

Triple-DES (3DES) [8] is based on theData Encryption Standard (DES), one ofthe most widely used symmetric cipher/bulk-data encryption algorithm. 3DES iscomputationally expensive as 3DESoperation essentially requires three DESoperations. The high cost associatedwith 3DES makes it one of the slowersymmetric ciphers when implementedin software.

2.3.1 3DESperf

We used a microbenchmark program,called 3DESperf, to evaluate the perfor-mance of 3DES executed in software andhardware. Similar to RSAperf (seeFigure 2), 3DESperf calls the SunCA1000 cryptographic library to carryout 3DES encryption of given messagesas fast as the host machine can. Thecryptographic library uses a Sun CA1000board automatically when the board ispresent or uses host processors to per-form the jobs when no Sun CA1000board is on-line.

2.3.2 Size of Messages

The size of the encrypted messages is animportant factor for the accelerated3DES performance. For a Sun CA1000board to encrypt a message using 3DES,the message has to be transferredbetween the host and the Sun CA1000board via the PCI bus. The cost associ-ated with job scheduling and data trans-fer discourages small-size payloads fromutilizing hardware acceleration. There isa fixed software cost for setting up datastructures, entering the kernel, andscheduling operations on the hardware,

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 25 50 75 100 125 150

No. of RSAperf Processes

RSA

Ops

/sec

2x SCA1000 1x SCA1000

Figure 4: Sun CA1000 RSA throughput on Sun Fire 6800 Server with two on-line

processors, varying the number of RSAperf processes used to generate the

workload.

6

which limits the effectiveness of thehardware for short messages.For this reason, the Sun CA1000 crypto-graphic library handles messagessmaller than 1K bytes in software ratherthan directing them to hardware. Ourmeasurements, to be described in thenext subsection, validate the efficacy ofthis policy.

2.3.3 Accelerated Performance

Table 2. shows the benchmark resultsmeasured on a system with dual 900MhzUltraSPARC® III processors and one SunCA1000 board. As the size of messagesincreases, the throughput from softwareincreases because less function calls areinvolved to perform the encryption.The overhead of function calls and jobscheduling impacts the performance ofthe Sun CA1000 board significantly forshort messages. Still, the data presentedin Table 2 shows that the performance isbetter with the Sun CA1000 board formessages longer than 1KB.For messages larger than 8KB, the SunCA1000 board outperforms the softwaremechanism by 3.4 times, utilizing only54% of the host processors. In terms ofhost processor utilization per byteencrypted, the system with one SunCA1000 board is 6.4 time more efficient

than the system without it, for encrypt-ing 8KB messages.Figure 5 shows that the performance ofsoftware 3DES scales well when morehost processors are on-line. More 3DESthroughput can be generated by addinghost processors. It requires, however,eight host processors to achieve the samelevel of performance that one SunCA1000 board provides. Notice that only20% of processor utilization is needed toperform 500Mbps 3DES with one SunCA1000 board. Essentially, the Sun

Message SizeSoftware Sun CA1000

Throughput (Mbps) CPU Utilization Throughput (Mbps) CPU Utilization16K 142 100% 501 20% 8K 140 100% 482 54% 4K 133 100% 419 93% 2K 128 100% 240 100% 1K 114 100% 136 100%

Table 2: 3DES performance on Sun Fire 6800 Server with 2 processors, varying message size.

Figure 5: 3DES throughput on Sun Fire 6800 Server, varying the number of on-line

host processors (Message Size=16KB).

0

100

200

300

400

500

600

0 2 4 6 8 10

No. of Processors

Mbp

s

Software SCA1000

7

CA1000 board saves almost eight proces-sors in this case.

2.4 Performance with Various PCI Bus Speeds

The Sun CA1000 board is compatiblewith both 32-bit and 64-bit PCI bus run-ning at either 33Mhz or 66Mhz. The datacollected on the Sun Fire 6800 server arebased on a 64-bit PCI bus. While manySun platforms are equipped with 64-bitPCI slot(s), we evaluated the perfor-mance of Sun CA1000 boards on a smallserver with a 32-bit PCI slot, and theresults showed that the performance of aSun CA1000 board could be sub-optimalwhen connected to a 32-bit PCI slot.RSA operations require very low bustraffic besides job scheduling, thus thebus speed does not pose a limitation. For3DES, a 64-bit 33Mhz PCI bus still pro-vides sufficient bandwidth for one SunCA1000 board to encrypt at 495 Mbps,but a 32-bit 33Mhz PCI bus limits the3DES throughput to approximately295Mbps for this UltraSPARC II-basedserver.Thus, the users should be aware that theperformance of a Sun CA1000 board canbe affected by the PCI bus, especiallywhen multiple high-speed I/O devices,such as Gigabit network cards, disk con-trollers, or another Sun CA1000 board,contend for the same PCI bus.

3. Accelerating Secure Web Server Performance

The World Wide Web is used by millionsof people everyday. Numerous applica-tions and services are provided to endusers through Web servers and Hyper-Text Transfer Protocol (HTTP). SSL is aprotocol that provides a secure channelbetween two machines by protecting

data in transit and authenticating themachines that are exchanging informa-tion. Web traffic can be protected by run-ning HTTP over SSL (HTTPS), which issupported by nearly all modern com-mercial Web servers.HTTPS puts a much higher load on theserver machine than does pure HTTP,which is why HTTPS is typically usedonly when the situation demands asecure channel, such as when sendingcredit card information. Even so, WorldWide Web users should not be surprisedwhen they receive slow responses fromthe use of HTTPS. We focus on the performance of secureWeb servers in this section, as an exam-ple of how a high-performance cryptoaccelerator can improve the performanceof an important application. BesidesWeb servers, one should notice thatthere are a variety of network applica-tions that also use SSL to facilitate securecommunications. Popular network pro-tocols, such as NNTP, SMTP, POP,IMAP, can take advantage of SSL tosecure news and mail access on theInternet [1]. Like Web servers, the high-cost of SSL has prevented many Internetprotocols from using SSL on a regularbasis. With the advent of SSL accelera-tors like the Sun CA1000, we envisionthat Internet will become a more secureplace in the near future.

3.1 SSL Handshakes: Session Creation and Resumption

Here we briefly introduce the notions ofSSL session. Details can be found in [1].An SSL connection represents one specificcommunication channel. It can bedivided into two phases, the handshakeand data transfer phases. The handshakephase authenticates the server and estab-

8

lishes the private cryptographic keys tobe used to encrypt the data to be trans-mitted during the data transfer phase.When an SSL connection is createdbetween the client and server for the firsttime, a full handshake is needed for theclient and server to negotiate algorithmsand the master secret that are to be usedto authenticate and encrypt bulk data inthe current and perhaps future SSL con-nections. An SSL session is a virtual con-struct representing the negotiatedalgorithms and the master secret, as theresult of the first SSL connection. A fullSSL handshake, also referred to as an SSLsession creation, requires a computation-ally expensive public key cryptographicoperation using algorithms such as RSAto establish the master secret.An opportunity to reduce the overheadof session creation is to use the sessionresumption mechanism provided by SSL,which is also known as session reuse orsession caching. It is possible for an SSLconnection to resume to a previous ses-sion by reusing the same master secretestablished in a previous SSL connec-tion. This avoids the computationallyexpensive public key cryptographicoperation required for creating a newSSL session.

3.2 Benchmarking Methodology

Popular benchmark programs, for exam-ple, HTTP_LOAD [10], WebStone [11],and SPECweb [12], exist for measuringthe performance of non-secure Webservers. One can also find secure ver-sions of these benchmark programs,which may or may not be officially sup-ported, to experiment with secure Webservers. In our earlier experiments, weused a secure version of HTTP_LOAD,WebStone, and SPECweb96, and they all

generates similar results against smallWeb servers.A secure version of SPECweb99, calledSPECweb99_SSL, was in its final devel-opment stage when we experimentedwith the Sun CA1000. By April 2002, theofficial version should be available to thepublic. SPECweb99_SSL is being pro-moted by major computer manufactur-ers as the standard benchmark programfor performance testing of secure Webservers. Sun has been actively involvedin the development of SPECweb99_SSL,and we adapted a pre-release version ofSPECweb99_SSL. For testing the SunCA1000, we found the benchmark pro-gram to be very usable and stable.As illustrated in Figure 6, we useSPECweb99_SSL to generate the work-load from a set of client machines toemulate real-world scenarios wherethousands or more clients could besimultaneously accessing to a secureWeb server via HTTPS. We customizedthe SPECweb99_SSL workload genera-tion program to provide the tests weneeded to evaluate the impact that theSun CA1000 brings to the performanceof Web servers. These customized testswere designed to measure the perfor-mance of session creation and sessionresumption on the server1.In our session creation test, the clientprogram generates HTTPS requests tofetch small (102 bytes) files from theserver, one file per SSL session, with nosession resumption allowed. Small filesare intentionally chosen to isolate thefactors of bulk data encryption and file

1. We cannot disclose our SPECweb99_SSL results as thebenchmark was not finalized when we performed the test.Our customized tests are highly deviated from the stan-dard SPECweb99_SSL, and the results presented in thispaper should not be used to compare to any standardSPECweb99_SSL results.

9

transfer (disk I/O) in the test, so the focuscan be on the session creation/SSL hand-shake phase. SSL sessions are estab-lished using 1024-bit RSA.Our session resumption test generatessimilar workloads, except that therequests generated by the client programallow the Web server to speedup the SSLhandshake phase with session resump-tion. In our experiments, Web serverstook full advantage of session resump-tion after the first HTTPS operationsfrom the clients, thus the ratio of sessionresumption1 was nearly 100%.Figure 6 shows two basic server configu-rations: (a) software-only and (b) acceler-

ated. On the software-only server, theNSS cryptographic modules handle thecryptographic operations for the Webserver using its own (internal) imple-mentation. On the accelerated server,most cryptographic functions are han-dled by the NSS internal modules,except the RSA operations are acceler-ated using the Sun CA1000 board(s).Our session creation test was designedto measure how well the Sun CA1000solution can accelerate session creationswith its high-performance RSA func-tional capability. As opposed to the ses-sion creation test, the session resumptiontest does not involve heavy crypto-graphic operations. Hence the results ofthe session resumption test should notbe affected by the presence of the Sun1. Session resumption ratio is defined as (number of session

resumption operations)/(number of HTTPS operations).

Sun CA1000 Board

Sun CA1000 Driver

Sun CA1000

Sun CA1000

iWSiWSSecure

Sun CA1000 BoardSun CA1000 Board

Figure 6: A functional diagram for the Web server benchmark environments.

iWS6 Server

SPECweb99_SSLClient

SPECweb99_SSLClient

SPECweb99_SSLClient

TCP/IP Network

HTTPS

NSS Cryptographic

iWSiWSSecure

iWS6 Server

SPECweb99_SSLClient

SPECweb99_SSLClient

SPECweb99_SSLClient

TCP/IP Network

HTTPS

(a) Software-only Server (b) Accelerated Server

Module iWS Adapter

Cryptographic Library

NSS Cryptographic Module

NSS Cryptographic Module

NSS Cryptographic Module

NSS Cryptographic Module

NSS Cryptographic Module

10

CA1000 software/hardware. Because ses-sion creation basically includes the oper-ations involved in session resumption,session resumption should be faster thansession creation, and we consider ses-sion resumption performance an upperbound that limits how well the same sys-tem can do in the session creation test.

3.3 iPlanet Web Server Performance

We have used the Sun CA1000 solutionto accelerate iPlanet Web Server 6.0 andApache 1.3.12. In our experience, bothWeb servers, running on small systemswith no more than 8 processors, gavecomparable performance. On large sys-tems, iWS6 scaled well beyond 8 proces-sors, while the performance of Apache1.3.12 suffered from its architecture andits lack of performance tuning options1.The abundant performance tuningoptions provided by iWS6, on the otherhand, allowed us to properly set up theWeb server to give good performance foreach machine configuration. In thispaper, we choose to study the perfor-mance of iWS6 only.

3.3.1 Session Creation Performance

Table 3 and Figure 7 show the SSL ses-sion creation performance of iWS6 run-ning on a Sun Fire 6800 server with avariety of configurations2, measuredfrom the modified SPECweb99_SSLbenchmark program mentioned in theprevious subsection. The Sun Fire 6800server can have up to 24 UltraSPARC®

III processors on-line, and we selectivelydisabled some of the processors to mea-sure the scalability of the Web server.As the number of processors increased,the performance of the Web serverincreased, but not linearly, because ofincreased contention in systemresources. Overall, the “software curve”in Figure 7 shows the Sun Fire 6800server scaled well, as a 16-processor sys-tem produced approximately 13 timesthroughput of a one-processor system,an efficiency3 of 81%, without the SunCA1000.

1. Apache version 2 claims to perform better with its newthread model and tuning options. The Sun CA1000 sup-port for Apache 2 is planned.

2. Note that two Sun CA1000 boards were used for the 24-processor configuration because the capacity of one SunCA1000 is reached in the 16-processor case.

3. Efficiency is defined as (normalized performance)/(no. ofprocessors), which characterizes how effective the proces-sors are utilized to enhance application performance in amultiprocessor system.

No. of Host Processors

NSS Module Sun CA1000 Acceleration(Sun CA1000/Software)Throughput

HTTPS/secNormalizedPerformance

ThroughputHTTPS/sec

NormalizedPerformance

1 124.7 1 337 2.70 2.72 236.3 1.89 668 5.36 2.834 462.4 3.71 1288 10.33 2.798 845.5 6.78 2457 19.70 2.91

12 1239 9.93 3636 29.16 2.9316 1611 12.92 4385 35.16 2.72

Table 3: iWS6 SSL session creation performance on Sun Fire 6800 Server.

11

The Web server also scaled well when itis accelerated by the Sun CA1000, untilthe capacity of the Sun CA1000 board isreached for processing the RSA opera-tions. With one Sun CA1000 board, theperformance of the Web server scaled to4385 HTTPS/sec (session creations persecond). With two Sun CA1000 boards,5777 HTTPS/sec was produced on the24-processor system.

3.3.2 Session Creation Acceleration

Table 3 also lists the acceleration ratiosthat one Sun CA1000 board provided toeach test configuration. The accelerationratios are between 2.7 and 2.93 across thetable.In a preliminary study, we found that,with one processor, the NSS internalcryptographic module is capable of pro-cessing 208 RSA ops/sec, compared tothe 245 RSA ops/sec that we measuredwith the Sun CA1000 library when thehardware is disabled (Section 2.2).Figure 8 compares the SSL session cre-ation performance with different imple-mentations of the 1024-bit RSAalgorithms.

The one-processor system served at anaverage rate of 124.7 HTTPS/sec usingthe NSS crypto module. Thus, the sys-tem spent approximately 124.7/208=60%of the host processing power in the RSAcomputations and the remaining 40%processing power in other parts ofHTTPS.If the 60% of software RSA processingtime is completely off-loaded from thehost processor to the Sun CA1000 board,we should see that the Sun CA1000 solu-tion to accelerate SSL session creation bya factor of 2.5, which is very close to the2.7 acceleration ratio that we observed1.

3.3.3 Peak Performance of Single Sun CA1000 Systems

The potential of a single Sun CA1000board was fully exploited by the 16-pro-cessor system, giving a performance of4385 HTTPS/sec. Recall that each HTTPSrequest from the client invoked a newSSL session that required one 1024-bitRSA operation from the server. Thus the

Figure 7: iWS6 SSL Session Creation Performance on Sun Fire 6800 Server.

01000200030004000500060007000

0 4 8 12 16 20 24 28

No. of Host Processors

HTT

PS/s

ecNSS Module 2x SCA-1000

1. The discrepancy could be due to measurement errors orfactors which are not obvious to us at this moment.

Figure 8: Comparison of iWS6 SSL Session Creation Performance on one-processor

Sun Fire 6800 Server.

124.7 136

337

0

50

100

150

200

250

300

350

400

NSSModule

SCA-1000Software

SCA-1000Hardware

HTT

PS/s

ec

12

performance is bound by the SunCA1000 board’s capacity in handlingRSA operations. This is confirmed as13% of CPU idle time was observed inthe 16-processor case. When the RSAbandwidth of the Sun CA1000 boardwas exhausted, some host processorshad to wait to submit RSA jobs to SunCA1000.The 4385 HTTPS/sec performance wasreported by the benchmark program inthe end of a five-minute run to reflect theaverage throughput during the runtime.We validated the test by monitoring theSun CA1000’s performance counters,which showed the same average comple-tion rate of RSA jobs.

3.3.4 Scalability - Multiple Sun CA1000 Boards

While a single Sun CA1000 board caneffectively accelerate the Web server toan impressive 4385 HTTPS/sec perfor-mance, the scalability of the Sun CA1000solution reached the next level as anaverage rate of 5778 HTTPS/sec wasenabled by two Sun CA1000 boards inthe case of 24 processors. Note that 15%idle time observed on the host proces-sors was limited by workload generationcapacity of the clients.This experiment has proven that Webserver performance can scale with multi-ple Sun CA1000 boards. The perfor-mance scalability ensures that SunCA1000 will be an attractive, competi-tive, upgradable solution for the years tocome.

3.3.5 Session Resumption Performance

Under the session resumption test, weallow the client and server to reuse anSSL session for unlimited times once thesession is created. Since SSL session

resumption does not require expensivepublic key cryptographic operations, wedid not find Sun CA1000 to provide anyacceleration for this test, nor did we findthat Sun CA1000 impaired the sessionresumption performance.The performance gap between sessionresumption and session creation withthe Sun CA1000 solution is relativelysmall, as illustrated by Figure 9. For the12-processor system, session resumptionperformance is approximately 10% bet-ter than session creation performancewith the Sun CA1000 solution. In fact,this summarizes the role of the SunCA1000 solution in the SSL handshakephase: the Sun CA1000 effectivelyreduces the cost of SSL session creationto nearly the same as the cost of sessionresumption for our test configurations.

4. Tuning for Secure iWS

Before the advent of products like theSun Crypto Accelerator 1000, it was dif-ficult for users or even developers toenvision a high-performance secure Webserver that can carry out 4400 HTTPS/secand 495 Mbps of 3DES encryption simul-taneously. Running at this level of per-

Figure 9: Comparison of iWS6 performance on 12-processor Sun Fire 6800 Server.

39643636

0500

10001500200025003000350040004500

SessionCreation

SessionResumption

HTT

P/se

c

13

formance, a Web server has to beefficient in handling not only the SSLprotocol, but the HTTP, TCP/IP, anddisk I/O operations as well. Performancetuning is a common practice for a high-performance Web server, as Web serversand operating systems often need to beoptimized to handle high Web traffic[13]. This section shares our tuning expe-rience.

4.1 Networking

Today, 100 Mbps networks are widelydeployed, and gigabit networks arebeginning to gain momentums. Whilenetwork bandwidth traditionally has notbeen a very critical issue for secure Webservers, to take full advantage of thehardware encryption capability of theSun Crypto Accelerator 1000, e.g.495Mbps 3DES, the host machine mustbe equipped with proper network inter-face(s) to support the network band-width.HTTPS incurs extra handshake messagesfor establishing each SSL connection.During the handshake phase, a full SSLhandshake would require the server toreceive 4 messages from the client andreply with 5 messages. While thesehandshake messages are relatively short,they put pressure on the network inter-face and the operating system. For exam-ple, to serve 4,000 HTTPS in one second,the server has to handle 16,000 incominghandshake messages and deliver 20,000outgoing messages to the clients in onesecond. In addition, these messagesincur TCP ACK packets. Even with high-speed network interfaces such as SunGigabitEthernet/P 2.0 adapter, networkinterface can be the point of contentionsimply serving for SSL handshakes.

In our session creation tests, we foundthat a 100 Mbps network interface (SunHME) can support up to 2900 HTTPS/secof short file requests with no tuningapplied. At the same time, the networkinterface generates 13,000 interrupts tothe processor every second. Switching toa Gigabit network interface may notsolve the problem unless the networkinterface is much more efficient inpacket processing.The short SSL handshake messages canbe handled more efficiently throughadjusting operating system (TCP/IPstack) tuning parameters that controlNagle algorithm and deferred ack [1][14].However, optimizing for short mes-sages can hurt the performance of bulkdata transfer if it is not done properly.The users may need to profile the net-work traffic on their Web server sincedifferent Web applications may have dif-ferent behaviors.

4.2 iWS6 Tuning Options

Performance tuning is a common prac-tice for a high-performance Web server.Because the secure version of iPlanetWeb Server is based on the non-secureversion, most tuning parameters thataffect HTTP protocol processing wouldalso affect the performance of a secureWeb server. The readers are referred to[13] for tuning options for non-secureiWS. In this subsection, we provide somegeneral guidelines for tuning a secureWeb server running iWS6. Notice thatmany tuning options require intimateknowledge of the server platform andthe Web applications that run on theserver.

14

4.2.1 Listen Sockets and Virtual Servers

By default, an iWS6 server instance lis-tens on port 443 to handle requests fromall IP addresses. Performance can beaffected if contention occurs on the listensocket. iWS6 supports the notion of virtual serv-ers that share the same configurationinformation, except that each single vir-tual server can be associated with onelisten socket. We found that virtual serv-ers can be employed (via the Webserver’s server.xml file) to create multi-ple listen sockets for a single Web serverinstance to reduce socket contention.

4.2.2 Keep-Alive/Persistent Connection Requests

The use of persistent requests, alsoknown as keep-alive requests, generallyreduces the overhead of creating TCPconnections for multiple HTTPSrequests. Unlike regular requests, theresponse for a keep-alive request doesnot cause a close of the TCP connection.Keep-alive may improve the perfor-mance of the Web server because reusingTCP connections reduces the overheadof establishing TCP connections.

4.2.3 RqThrottle

By default, iWS6 runs with multipleworker threads to process HTTP/HTTPSrequests. The number of worker threadscan be adjusted by setting a parametercalled RqThrottle in the Web server’smagnus.conf file. The more workerthreads the Web server employs, themore concurrent requests the Web servercan handle, but the number of workerthreads needs to match the processingpower of the system and the applicationto achieve optimal performance.

4.2.4 MaxProcs

By default, iWS6 runs with one process,within which multiple worker threadsare used to process requests. This threadmodel is favored because threads typi-cally have less overhead in contextswitching and migration than processes.Under certain workload, running iWSwith multiple processes can benefit per-formance as processes offer these advan-tages over threads: (1) processes couldbe bound to processors using the Solarispbind facility to reduce migration, and(2) objects are often replicated in eachprocess, reducing contentions for thelocks that protect them.A parameter in the Web server’s mag-nus.conf file, called MaxProcs, can be setto specify the number of processes thatthe Web server runs with. Notice thatRqThrottle adjusts the number of workerthreads per process. Thus MaxProcs andRqThrottle jointly adjusts the total num-ber of threads employed by the Webserver.

4.2.5 Multiple Server Instances

Multiple independent Web serverinstances can simultaneously takeadvantage of the Sun CA1000 solution.Occasionally, as a way to minimize inter-ferences between different Web applica-tions and to allow performance to scaleon a large system, it is better to partitionthe system into several processor groupsand run separate Web server instances indifferent processor groups.

4.2.6 Thread Libraries

Lately, Solaris 8 Operating Environment(Solaris OE) offers two different flavorsof thread libraries for serving multi-threaded applications and handling

15

multithreading activities. Since iWS6 is ahighly-parallel multithreaded applica-tion, its performance can depend signifi-cantly on the thread library. In ourexperience, we found that a secure iWS6runs better with the alternate threadlibrary (also known as lwp) in Solaris 8OE. The alternate thread library will bethe default thread library in the forth-coming Solaris 9 OE.

4.2.7 Memory Allocation

The performance of Memory AllocationSubsystem (MAS) in the iWS6 can affectthe performance of a secure Web server.iWS6 comes with support for defaultMAS provided by Solaris OE or can betuned to take advantage of other MAS,such as SmartHeapTM that is bundledwith some versions of iWS6. In order forthe web server uses SmartHeap, the Webserver start script should be modified.A few lines in the start scripts shouldbe un-commented to enable the use ofSmartHeap.Solaris 8 OE has a MAS available via adynamically loadable library /usr/lib/libmtmalloc.so that has proven to bethe fastest MAS available on Solaris OE.If SmartHeap is not enabled, then settingthe environment variable LD_PRELOAD to/usr/lib/libmtmalloc.so immedi-ately before the Web server is started,would allow the Web server to use thisMAS.

4.2.8 SSL3SessionTimeout

This tunable in iPlanet Web Server'smagnus.conf, is represented in seconds.The default value for this parameter is 24hours. It controls the maximum time thatan SSL version 3 (SSL3) session will becached by the Web server. Since SSL Ses-sion caching can potentially consume

memory (see the next tunable), thisvalue should be tuned down for a sitethat experiences heavy new SSL sessioncreation traffic. Tuning this value toolow can potentially increase the load onthe SSL session creation subsystem inthe Web server.

4.2.9 SSLCacheEntries

This is a tunable in iPlanet Web Server'smagnus.conf. It sets the number of SSLsessions that can be cached. The Webserver may decide to remove a sessionentry from its session cache when thesession cache is full, even if theSSL3SessionTimeout has not expired.If a Web site does not want to supportresumable SSL sessions, then thisparameter should be tuned to its mini-mum, i.e. 1. Note that it will use thedefault cache size of 10000 if it is set to 0.

5. Summary

In this paper, we presented variousaspects of performance accelerationdelivered by Sun Crypto Accelerator1000. We evaluated the potential of theSun CA1000 solution by measuring itsperformance with low-level RSA and3DES benchmark programs. We havealso shown that, the Sun CA1000 solu-tion successfully accelerated SSL sessioncreation for secure Web servers in ourHTTPS benchmarks.Performance tuning is often necessaryfor a Web server to achieve high perfor-mance. We have shared some tuningguidelines in this paper. We work withengineers of the Sun CA1000 Team,iPlanet and NSS to deliver the best per-formance of the integrated Sun/SunCA1000/iWS solution. Based on ourexperience, we believe that the SunCA1000 solution will be appreciated by

16

users as a cost-effective solution for run-ning their applications securely withoutsacrificing performance.

6. Acknowledgement

This paper is a result of a project startedjointly by the Cryptography group andthe Performance and Availability Engi-neering (PAE) in Sun. Experts of iWSand NSS from iPlanet later joined andcontributed to this project, as the co-opti-mization of the Web server and SunCA1000 became a key focus. In PAE, wewould like to thank the Sun CA1000Engineering Team for providing soft-ware and hardware support to this per-formance study. We also would like toacknowledge iPlanet and AOL/NSS forinvestigating with us on issues related toiWS and SSL.

References

[1] E. Rescorla. SSL and TLS: Design and Building Secure Systems. Addison Wesley, 2001.

[2] W. Stallings. Cryptography and Net-work Security - Principles and Prac-tice, 2nd Ed. Prentice Hall, 1998.

[3] The Network Security Services (NSS) Project Homepage. Overview of NSS Open Source Crypto Libraries. http://www.mozilla.org/projects/security/pki/nss/overview.html.

[4] The OpenSSL Project Homepage. http://www.openssl.org/.

[5] iPlanet Web Server Homepage. http://www.iplanet.com/products/iplanet_web_enterprise/home_web_server.html.

[6] The Apache Software Foundation. http://www.apache.org/.

[7] R. L. Rivest, A. Shamir, and L. M.Adelman. On Digital Signature andPublic Key Cryptosystems, TechnicalReport, MIT/LCS/TR-212, MIT Lab-oratory for Computer Science, Jan-uary 1979.

[8] ANSI. American National Standard for FInancial Institution Key Manage-ment. ANSI X9.17, 1985

[9] RSA Laboratories. PKCS #11: Cryp-tographic Token Interface Standard. Rev. 1. November 2001.

[10] ACME Labs. http_load - multipro-cessing http test client. http://www.acme.com/software/http_load/.

[11] Mindcraft. WebStone - The Bench-mark for Web Servers. http://www.mindcraft.com/webstone/.

[12] Standard Performance Evaluation Corporation. SPECweb99 Design Document. http://www.spec.org/osg/web99/docs/whitepaper.html, July 2000.

[13] N. Sun, A. Guzovskiy, P. Bhatta-charya, and K.-T. Ko. Web Server Performance under SPECweb99 Work-load. Sun Users Performance Group Conference (SUPerG), Oct. 2001, Amsterdam, Netherlands.

[14] J. Huang, S.-H Hung, G.-P. Musu-meci, M. S. Klivansky, and K.-T. Ko. Sun Fire Gigabit Performance Charac-terization. Sun Users Performance Group Conference, Oct. 2001, Amsterdam, Netherlands.

17

Legal Notice

©2002 Sun Microsystems, Inc. All rightsreserved. Sun, Sun Microsystems, theSun Logo, Sun Fire, iPlanet, and Solarisare trademarks or registered trademarksof Sun Microsystems, Inc. in the UnitedStates and other countries. All SPARCtrademarks are used under license andare trademarks or registered trademarksof SPARC International, Inc. in theUnited States and other countries. Prod-ucts bearing SPARC trademarks arebased upon an architecture developedby Sun Microsystems, Inc. Sun Microsys-tems, Inc. has intellectual property rightsrelating to technology described in thisdocument. In particular, and withoutlimitation, these intellectual propertyrights may include one or more patentsor pending patent applications in theU.S. or other countries.

18