network server performance and scalability

Network ServerPerformance and Scalability

June 22, 2005

Scott RixnerRice Computer Architecture Group

http://www.cs.rice.edu/CS/Architecture/

© Scott Rixner, 2005 Network Server Performance and Scalability 2

Rice Computer Architecture Group

Rice Computer Architecture

Faculty– Scott Rixner

Students– Mike Calhoun– Hyong-youb Kim– Jeff Shafer– Paul Willmann

Research Focus– System architecture– Embedded systems







Network Servers Today

Content types– Mostly text, small images– Low quality video (300-500 Kbps)

Internet

1 Gbps

NetworkServer

Clients3 Mbps



Network Servers in the Future

Content types– Diverse multimedia content– DVD quality video (10 Mbps)

Internet

100 Gbps

Clients100 Mbps

NetworkServer



TCP Performance Issues

Network Interfaces– Limited flexibility – Serialized access

Computation– Only about 3000 instructions per packet– However, very low IPC, parallelization difficulties

Memory– Large connection data structures (about 1KB each)– Low locality, high DRAM latency



Selected Research

Network Interfaces– Programmable NIC design– Firmware parallelization– Network interface data caching

Operating Systems– Connection handoff to the network interface – Parallizing network stack processing

System Architecture– Memory controller design



Designing a 10 Gigabit NIC

Programmability for performance– Computation offloading improves performance

NICs have power, area concerns– Architecture solutions should be efficient

Above all, must support 10 Gbps links– What are the computation and memory requirements?– What architecture efficiently meets them?– What firmware organization should be used?



Aggregate Requirements10 Gbps – Maximum-sized Frames

Instruction Throughput

Control Data Bandwidth

Frame Data Bandwidth

TX Frame 229 MIPS 2.6 Gbps 19.75 Gbps

RX Frame 206 MIPS 2.2 Gbps 19.75 Gbps

Total 435 MIPS 4.8 Gbps 39.5 Gbps

1514-byte Frames at 10 Gbps 812,744 Frames/s



Meeting 10 Gbps Requirements

Processor Architecture– At least 435 MIPS within embedded device– Limited instruction-level parallelism– Abundant task-level parallelism

Memory Architecture– Control data needs low latency, small capacity– Frame data needs high bandwidth, large capacity– Must partition storage



Processor Architecture

Perfect 1BP No BPIn-order 1 0.87 0.87

Out-order 2 1.74 1.21 2x performance costly

– Branch prediction, reorder buffer, renaming logic, wakeup logic– Overheads translate to greater than 2x core power, area costs– Great for a GP processor; not for an embedded device

Are there other opportunities for parallelism?– Many steps to process a frame – run them simultaneously– Many frames need processing – process simultaneously

Solution: use parallel single-issue cores



0

10

20

30

40

50

60

16B

32B

64B

128B

256B

512B 1KB

2KB

4KB

8KB

16KB

32KB

Cache Size (Bytes)

Hit

Rat

io (P

erce

nt)

6 ProcessorHit Ratio

Control Data Caching

SMPCache trace analysis of a 6-processor NIC architecture



A Programmable10Gbps NIC

Instruction Memory

I-Cache 0

CPU 0

(P+4)x(S) Crossbar (32-bit)

PCIInterface

EthernetInterfacePCI

Bus DRAM

Ext. Mem. Interface(Off-Chip)

Scratchpad 0 Scratchpad 1 S-pad S-1

CPU P-1

I-Cache 1 I-Cache P-1

CPU 1



Network Interface Firmware

NIC processing steps are well defined Must provide high latency tolerance

– DMA to host– Transfer to/from network

Event mechanism is the obvious choice– How do you process and distribute events?



Task Assignment with an Event Register

PCI Read Bit SW Event Bit … Other Bits

PCI Interface Finishes Work

Processor(s) inspect

transactions

0 0 011

Processor(s) need to enqueue

TX Data

Processor(s) pass data to

Ethernet Interface



Task-level Parallel Firmware

TransferDMAs 0-4 0 Idle Idle

PCI Read Bit

PCI Read HW Status

Proc 0 Proc 1

1TransferDMAs 5-9

1

0

TimeProcessDMAs

0-4Idle

ProcessDMAs

5-91 Idle



Frame-level Parallel Firmware

TransferDMAs 0-4 Idle

PCI RD HW Status Proc 0 Proc 1

TransferDMAs 5-9

TimeProcessDMAs

0-4

Build Event

Idle

ProcessDMAs

5-9

Build Event

Idle



Scaling in Two Dimensions

0

2

4

6

8

10

12

14

16

18

20

100 150 200 250 300Core Frequency (MHz)

Thro

ughp

ut (G

b/s)

Ethernet Limit8 Processors6 Processors4 Processors2 Processors1 Processor

Gbp

s



A Programmable 10 Gbps NIC

This NIC architecture relies on:– Data Memory System – Partitioned organization, not

coherent caches– Processor Architecture – Parallel scalar processors– Firmware – Frame-level parallel organization – RMW Instructions – reduce ordering overheads

A programmable NIC: A substrate for offload services



NIC Offload Services

Network Interface Data Caching Connection Handoff Virtual Network Interfaces …



Network Interface Data Caching

Cache data in network interface Reduces interconnect traffic Software-controlled cache Minimal changes to the operating system

Prototype web server– Up to 57% reduction in PCI traffic– Up to 31% increase in server performance– Peak 1571 Mbps of content throughput

• Breaks PCI bottleneck



Results: PCI Traffic

~1260 Mb/s is limit!

~60 % Content trafficPCI saturated60 % utilization1198 Mb/s of HTTP content

30 % Overhead



Content Locality Block cache with 4KB block size

8-16MB caches capture locality



Results: PCI Traffic Reduction

Low temporal reuseLow PCI utilization

Good temporal reuseCPU bottleneck

36-57 % reductionwith four tracesUp to 31%performanceimprovement



Connection Handoff to the NIC

No magic processor on NIC– OS must control work

between itself and NIC Move established connections

between OS and NIC– Connection: unit of control– OS decides when and what

Benefits– Sockets are intact – no need to

change applications– Zero-copy– No port allocation or routing

on NIC– Can adapt to route changes

TCPIP

Handoff

TCPIP

EthernetHandoff

Sockets

Ethernet / Lookup

Driver

NIC

OS

Handoff interface:1. Handoff2. Send3. Receive4. Ack5. …



Connection Handoff

Traditional offload– NIC replicates entire

network stack– NIC can limit connections

due to resource limitations Connection handoff

– OS decides which subset of connections NIC should handle

– NIC resource limitations limit amount of offload, not number of connections

OS

NIC



Establishment and Handoff

OS establishes connections

OS decides whether or not to handoff each connection

OS

Connection

NIC

Connection

1. Establish a connection

2. Handoff



Data Transfer

Offloaded connections require minimal support from OS for data transfers– Socket layer for interface to

applications– Driver layer for interrupts,

buffer management

OS

Connection

NIC

Connection

3. Send, Receive, Ack, …

Data

Data



Connection Teardown

Teardown requires both NIC and OS to deallocate connection data structures

OS

Connection

NIC

Connection

4. De-alloc

5. De-alloc



Connection Handoff Status

Working prototype built on FreeBSD Initial results for web workloads

– Reductions in cycles and cache misses on host– Transparently handle multiple NICs– Fewer messages on PCI

• 1.4 per packet to 0.6 per packet• Socket-level instead of packet-level communication

– ~17% throughput increase (simulations) To do

– Framework for offload policies– Test zero-copy, more workloads– Port to Linux



Virtual Network Interfaces

Traditionally used for user-level network access– Each process has its own “virtual NIC”– Provide protection among processes

Can we use this concept to improve network stack performance within the OS?– Possibly, but we need to understand the behavior of the

OS on networking workloads first



Networking Workloads

Performance is influenced by– The operating system’s network stack– The increasing number of connections– Microprocessor architecture trends



Networking Performance

Bound by TCP/IP processing 2.4GHz Intel Xeon: 2.5 Gbps for one nttcp stream

0%10%20%30%40%50%60%70%80%90%

100%

SPECWEBRice IBM

NASA

World Cup

OtherUserSystem CallTCPIPEthernetDriver

- Hurwitz and Feng, IEEE Micro 2004



Throughput vs. Connections

Faster links more connections More connections worse performance

4 8 16 32 64 128 256 512 102420480

200

400

600

800

1000

1200

ConnectionsHTT

P C

onte

nt T

hrou

ghpu

t (M

b/s)

CSIBMNASAWC



The End of the Uniprocessor?

Uniprocessors have become too complicated– Clock speed increases have slowed down– Increasingly complicated architectures for performance

Multi-core processors are becoming the norm– IBM Power 4 – 2 cores (2001)– Intel Pentium 4 – 2 hyperthreads (2002)– Sun UltraSPARC IV – 2 cores (2004) – AMD Opteron – 2 cores (2005)

Sun Niagra – 8 cores, 4 threads each (est. 2006) How do we use these cores for networking?



Parallelism with Data-Synchronized Stacks

Linux 2.4.20+, FreeBSD 5+



DragonflyBSD, Solaris 10

Parallelism with Control-Synchronized Stacks



Parallelization Challenges

Data-Synchronous– Lots of thread parallelism– Significant locking overheads

Control-Synchronous– Reduces locking– Load balancing issues

Which approach is better?– Throughput? Scalability?– We’re optimizing both schemes in FreeBSD 5 to find out

Network Interface– Serialization point– Can virtualization help?



Memory Controller Architecture

Improve DRAM efficiency– Memory access scheduling– Virtual Channels

Improve copy performance– 45-61% of kernel execution time can be copies– Best copy algorithm dependent on copy size, cache

residency, cache state– Probe copy– Hardware copy acceleration

Improve I/O performance…



Summary

Our focus is on system-level architectures for networking Network interfaces must evolve

– No longer just a PCI-to-Ethernet bridge– Need to provide capabilities to help the operating system

Operating systems must evolve– Future systems will have 10s to 100s of processors– Networking must be parallelized – many bottlenecks remain

Synergy between the NIC and OS cannot be ignored Memory performance is also increasingly a critical factor

network server performance and scalability

Documents