network server performance and scalability
DESCRIPTION
Network Server Performance and Scalability. Scott Rixner Rice Computer Architecture Group http://www.cs.rice.edu/CS/Architecture/. June 22, 2005. Rice Computer Architecture. Faculty Scott Rixner Students Mike Calhoun Hyong-youb Kim Jeff Shafer Paul Willmann Research Focus - PowerPoint PPT PresentationTRANSCRIPT
Network ServerPerformance and Scalability
June 22, 2005
Scott RixnerRice Computer Architecture Group
http://www.cs.rice.edu/CS/Architecture/
© Scott Rixner, 2005 Network Server Performance and Scalability 2
Rice Computer Architecture Group
Rice Computer Architecture
Faculty– Scott Rixner
Students– Mike Calhoun– Hyong-youb Kim– Jeff Shafer– Paul Willmann
Research Focus– System architecture– Embedded systems
http://www.cs.rice.edu/CS/Architecture/
© Scott Rixner, 2005 Network Server Performance and Scalability 3
Rice Computer Architecture Group
Network Servers Today
Content types– Mostly text, small images– Low quality video (300-500 Kbps)
Internet
1 Gbps
NetworkServer
Clients3 Mbps
© Scott Rixner, 2005 Network Server Performance and Scalability 4
Rice Computer Architecture Group
Network Servers in the Future
Content types– Diverse multimedia content– DVD quality video (10 Mbps)
Internet
100 Gbps
Clients100 Mbps
NetworkServer
© Scott Rixner, 2005 Network Server Performance and Scalability 5
Rice Computer Architecture Group
TCP Performance Issues
Network Interfaces– Limited flexibility – Serialized access
Computation– Only about 3000 instructions per packet– However, very low IPC, parallelization difficulties
Memory– Large connection data structures (about 1KB each)– Low locality, high DRAM latency
© Scott Rixner, 2005 Network Server Performance and Scalability 6
Rice Computer Architecture Group
Selected Research
Network Interfaces– Programmable NIC design– Firmware parallelization– Network interface data caching
Operating Systems– Connection handoff to the network interface – Parallizing network stack processing
System Architecture– Memory controller design
© Scott Rixner, 2005 Network Server Performance and Scalability 7
Rice Computer Architecture Group
Designing a 10 Gigabit NIC
Programmability for performance– Computation offloading improves performance
NICs have power, area concerns– Architecture solutions should be efficient
Above all, must support 10 Gbps links– What are the computation and memory requirements?– What architecture efficiently meets them?– What firmware organization should be used?
© Scott Rixner, 2005 Network Server Performance and Scalability 8
Rice Computer Architecture Group
Aggregate Requirements10 Gbps – Maximum-sized Frames
Instruction Throughput
Control Data Bandwidth
Frame Data Bandwidth
TX Frame 229 MIPS 2.6 Gbps 19.75 Gbps
RX Frame 206 MIPS 2.2 Gbps 19.75 Gbps
Total 435 MIPS 4.8 Gbps 39.5 Gbps
1514-byte Frames at 10 Gbps 812,744 Frames/s
© Scott Rixner, 2005 Network Server Performance and Scalability 9
Rice Computer Architecture Group
Meeting 10 Gbps Requirements
Processor Architecture– At least 435 MIPS within embedded device– Limited instruction-level parallelism– Abundant task-level parallelism
Memory Architecture– Control data needs low latency, small capacity– Frame data needs high bandwidth, large capacity– Must partition storage
© Scott Rixner, 2005 Network Server Performance and Scalability 10
Rice Computer Architecture Group
Processor Architecture
Perfect 1BP No BPIn-order 1 0.87 0.87
Out-order 2 1.74 1.21 2x performance costly
– Branch prediction, reorder buffer, renaming logic, wakeup logic– Overheads translate to greater than 2x core power, area costs– Great for a GP processor; not for an embedded device
Are there other opportunities for parallelism?– Many steps to process a frame – run them simultaneously– Many frames need processing – process simultaneously
Solution: use parallel single-issue cores
© Scott Rixner, 2005 Network Server Performance and Scalability 11
Rice Computer Architecture Group
0
10
20
30
40
50
60
16B
32B
64B
128B
256B
512B 1KB
2KB
4KB
8KB
16KB
32KB
Cache Size (Bytes)
Hit
Rat
io (P
erce
nt)
6 ProcessorHit Ratio
Control Data Caching
SMPCache trace analysis of a 6-processor NIC architecture
© Scott Rixner, 2005 Network Server Performance and Scalability 12
Rice Computer Architecture Group
A Programmable10Gbps NIC
Instruction Memory
I-Cache 0
CPU 0
(P+4)x(S) Crossbar (32-bit)
PCIInterface
EthernetInterfacePCI
Bus DRAM
Ext. Mem. Interface(Off-Chip)
Scratchpad 0 Scratchpad 1 S-pad S-1
CPU P-1
I-Cache 1 I-Cache P-1
CPU 1
© Scott Rixner, 2005 Network Server Performance and Scalability 13
Rice Computer Architecture Group
Network Interface Firmware
NIC processing steps are well defined Must provide high latency tolerance
– DMA to host– Transfer to/from network
Event mechanism is the obvious choice– How do you process and distribute events?
© Scott Rixner, 2005 Network Server Performance and Scalability 14
Rice Computer Architecture Group
Task Assignment with an Event Register
PCI Read Bit SW Event Bit … Other Bits
PCI Interface Finishes Work
Processor(s) inspect
transactions
0 0 011
Processor(s) need to enqueue
TX Data
Processor(s) pass data to
Ethernet Interface
© Scott Rixner, 2005 Network Server Performance and Scalability 15
Rice Computer Architecture Group
Task-level Parallel Firmware
TransferDMAs 0-4 0 Idle Idle
PCI Read Bit
PCI Read HW Status
Proc 0 Proc 1
1TransferDMAs 5-9
1
0
TimeProcessDMAs
0-4Idle
ProcessDMAs
5-91 Idle
© Scott Rixner, 2005 Network Server Performance and Scalability 16
Rice Computer Architecture Group
Frame-level Parallel Firmware
TransferDMAs 0-4 Idle
PCI RD HW Status Proc 0 Proc 1
TransferDMAs 5-9
TimeProcessDMAs
0-4
Build Event
Idle
ProcessDMAs
5-9
Build Event
Idle
© Scott Rixner, 2005 Network Server Performance and Scalability 17
Rice Computer Architecture Group
Scaling in Two Dimensions
0
2
4
6
8
10
12
14
16
18
20
100 150 200 250 300Core Frequency (MHz)
Thro
ughp
ut (G
b/s)
Ethernet Limit8 Processors6 Processors4 Processors2 Processors1 Processor
Gbp
s
© Scott Rixner, 2005 Network Server Performance and Scalability 18
Rice Computer Architecture Group
A Programmable 10 Gbps NIC
This NIC architecture relies on:– Data Memory System – Partitioned organization, not
coherent caches– Processor Architecture – Parallel scalar processors– Firmware – Frame-level parallel organization – RMW Instructions – reduce ordering overheads
A programmable NIC: A substrate for offload services
© Scott Rixner, 2005 Network Server Performance and Scalability 19
Rice Computer Architecture Group
NIC Offload Services
Network Interface Data Caching Connection Handoff Virtual Network Interfaces …
© Scott Rixner, 2005 Network Server Performance and Scalability 20
Rice Computer Architecture Group
Network Interface Data Caching
Cache data in network interface Reduces interconnect traffic Software-controlled cache Minimal changes to the operating system
Prototype web server– Up to 57% reduction in PCI traffic– Up to 31% increase in server performance– Peak 1571 Mbps of content throughput
• Breaks PCI bottleneck
© Scott Rixner, 2005 Network Server Performance and Scalability 21
Rice Computer Architecture Group
Results: PCI Traffic
~1260 Mb/s is limit!
~60 % Content trafficPCI saturated60 % utilization1198 Mb/s of HTTP content
30 % Overhead
© Scott Rixner, 2005 Network Server Performance and Scalability 22
Rice Computer Architecture Group
Content Locality Block cache with 4KB block size
8-16MB caches capture locality
© Scott Rixner, 2005 Network Server Performance and Scalability 23
Rice Computer Architecture Group
Results: PCI Traffic Reduction
Low temporal reuseLow PCI utilization
Good temporal reuseCPU bottleneck
36-57 % reductionwith four tracesUp to 31%performanceimprovement
© Scott Rixner, 2005 Network Server Performance and Scalability 24
Rice Computer Architecture Group
Connection Handoff to the NIC
No magic processor on NIC– OS must control work
between itself and NIC Move established connections
between OS and NIC– Connection: unit of control– OS decides when and what
Benefits– Sockets are intact – no need to
change applications– Zero-copy– No port allocation or routing
on NIC– Can adapt to route changes
TCPIP
Handoff
TCPIP
EthernetHandoff
Sockets
Ethernet / Lookup
Driver
NIC
OS
Handoff interface:1. Handoff2. Send3. Receive4. Ack5. …
© Scott Rixner, 2005 Network Server Performance and Scalability 25
Rice Computer Architecture Group
Connection Handoff
Traditional offload– NIC replicates entire
network stack– NIC can limit connections
due to resource limitations Connection handoff
– OS decides which subset of connections NIC should handle
– NIC resource limitations limit amount of offload, not number of connections
OS
NIC
© Scott Rixner, 2005 Network Server Performance and Scalability 26
Rice Computer Architecture Group
Establishment and Handoff
OS establishes connections
OS decides whether or not to handoff each connection
OS
Connection
NIC
Connection
1. Establish a connection
2. Handoff
© Scott Rixner, 2005 Network Server Performance and Scalability 27
Rice Computer Architecture Group
Data Transfer
Offloaded connections require minimal support from OS for data transfers– Socket layer for interface to
applications– Driver layer for interrupts,
buffer management
OS
Connection
NIC
Connection
3. Send, Receive, Ack, …
Data
Data
© Scott Rixner, 2005 Network Server Performance and Scalability 28
Rice Computer Architecture Group
Connection Teardown
Teardown requires both NIC and OS to deallocate connection data structures
OS
Connection
NIC
Connection
4. De-alloc
5. De-alloc
© Scott Rixner, 2005 Network Server Performance and Scalability 29
Rice Computer Architecture Group
Connection Handoff Status
Working prototype built on FreeBSD Initial results for web workloads
– Reductions in cycles and cache misses on host– Transparently handle multiple NICs– Fewer messages on PCI
• 1.4 per packet to 0.6 per packet• Socket-level instead of packet-level communication
– ~17% throughput increase (simulations) To do
– Framework for offload policies– Test zero-copy, more workloads– Port to Linux
© Scott Rixner, 2005 Network Server Performance and Scalability 30
Rice Computer Architecture Group
Virtual Network Interfaces
Traditionally used for user-level network access– Each process has its own “virtual NIC”– Provide protection among processes
Can we use this concept to improve network stack performance within the OS?– Possibly, but we need to understand the behavior of the
OS on networking workloads first
© Scott Rixner, 2005 Network Server Performance and Scalability 31
Rice Computer Architecture Group
Networking Workloads
Performance is influenced by– The operating system’s network stack– The increasing number of connections– Microprocessor architecture trends
© Scott Rixner, 2005 Network Server Performance and Scalability 32
Rice Computer Architecture Group
Networking Performance
Bound by TCP/IP processing 2.4GHz Intel Xeon: 2.5 Gbps for one nttcp stream
0%10%20%30%40%50%60%70%80%90%
100%
SPECWEBRice IBM
NASA
World Cup
OtherUserSystem CallTCPIPEthernetDriver
- Hurwitz and Feng, IEEE Micro 2004
© Scott Rixner, 2005 Network Server Performance and Scalability 33
Rice Computer Architecture Group
Throughput vs. Connections
Faster links more connections More connections worse performance
4 8 16 32 64 128 256 512 102420480
200
400
600
800
1000
1200
ConnectionsHTT
P C
onte
nt T
hrou
ghpu
t (M
b/s)
CSIBMNASAWC
© Scott Rixner, 2005 Network Server Performance and Scalability 34
Rice Computer Architecture Group
The End of the Uniprocessor?
Uniprocessors have become too complicated– Clock speed increases have slowed down– Increasingly complicated architectures for performance
Multi-core processors are becoming the norm– IBM Power 4 – 2 cores (2001)– Intel Pentium 4 – 2 hyperthreads (2002)– Sun UltraSPARC IV – 2 cores (2004) – AMD Opteron – 2 cores (2005)
Sun Niagra – 8 cores, 4 threads each (est. 2006) How do we use these cores for networking?
© Scott Rixner, 2005 Network Server Performance and Scalability 35
Rice Computer Architecture Group
Parallelism with Data-Synchronized Stacks
Linux 2.4.20+, FreeBSD 5+
© Scott Rixner, 2005 Network Server Performance and Scalability 36
Rice Computer Architecture Group
DragonflyBSD, Solaris 10
Parallelism with Control-Synchronized Stacks
© Scott Rixner, 2005 Network Server Performance and Scalability 37
Rice Computer Architecture Group
Parallelization Challenges
Data-Synchronous– Lots of thread parallelism– Significant locking overheads
Control-Synchronous– Reduces locking– Load balancing issues
Which approach is better?– Throughput? Scalability?– We’re optimizing both schemes in FreeBSD 5 to find out
Network Interface– Serialization point– Can virtualization help?
© Scott Rixner, 2005 Network Server Performance and Scalability 38
Rice Computer Architecture Group
Memory Controller Architecture
Improve DRAM efficiency– Memory access scheduling– Virtual Channels
Improve copy performance– 45-61% of kernel execution time can be copies– Best copy algorithm dependent on copy size, cache
residency, cache state– Probe copy– Hardware copy acceleration
Improve I/O performance…
© Scott Rixner, 2005 Network Server Performance and Scalability 39
Rice Computer Architecture Group
Summary
Our focus is on system-level architectures for networking Network interfaces must evolve
– No longer just a PCI-to-Ethernet bridge– Need to provide capabilities to help the operating system
Operating systems must evolve– Future systems will have 10s to 100s of processors– Networking must be parallelized – many bottlenecks remain
Synergy between the NIC and OS cannot be ignored Memory performance is also increasingly a critical factor