fast-path data movement for ip transports - duke university

25
Slide 1 Fast Fast - - Path Data Movement Path Data Movement for IP Transports for IP Transports A Tour of the Alternatives A Tour of the Alternatives Jeff Chase Jeff Chase Duke University Duke University

Upload: others

Post on 16-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Slide 1

FastFast--Path Data Movement Path Data Movement for IP Transportsfor IP Transports

A Tour of the AlternativesA Tour of the Alternatives

Jeff ChaseJeff Chase

Duke UniversityDuke University

Slide 2

The Big QuestionsThe Big Questions

ll Can/should IP be competitive in “high performance” Can/should IP be competitive in “high performance” domains (e.g., “SAN” markets)?domains (e.g., “SAN” markets)?

–– iSCSI iSCSI vs. vs. FibreChannelFibreChannel

–– IP over 10+GE vs. IP over 10+GE vs. InfinibandInfiniband

–– IP offers many advantagesIP offers many advantages

ll What will it take to get us there?What will it take to get us there?

ll What does this have to do with IETF?What does this have to do with IETF?

Slide 3

Focusing on the IssueFocusing on the Issue

ll The key issue IS NOT:The key issue IS NOT:–– The pipes: The pipes: Ethernet has come a long way since 1981.Ethernet has come a long way since 1981.

•• Add another zero every three years?Add another zero every three years?–– Transport architecture: Transport architecture: generality of IP is worth the cost.generality of IP is worth the cost.

–– Protocol overheadProtocol overhead: run better code on a faster CPU.: run better code on a faster CPU.

–– Interrupts, checksums, etcInterrupts, checksums, etc: the NIC vendors can : the NIC vendors can innovate here without us.innovate here without us.

All of these are part of the bigger picture, but we don’t All of these are part of the bigger picture, but we don’t need an IETF working group to “fix” them.need an IETF working group to “fix” them.

Slide 4

The Copy ProblemThe Copy Problem

ll The key issue IS The key issue IS data movement within the host.data movement within the host.

–– Combined with other overheads, copying sucks up Combined with other overheads, copying sucks up resources needed for application processing.resources needed for application processing.

ll The problem won’t go away with better technology.The problem won’t go away with better technology.

–– Faster CPUs don’t help: it’s the memory.Faster CPUs don’t help: it’s the memory.

ll General solutions are elusive…General solutions are elusive…on the receive sideon the receive side..

ll The problem exposes basic structural issues:The problem exposes basic structural issues:

–– interactions among NIC, OS, APIs, protocols.interactions among NIC, OS, APIs, protocols.

Slide 5

What is What is IETF’sIETF’s Role?Role?

ll Can we fix it locally within the end systems?Can we fix it locally within the end systems?

ll To what degree does it involve the protocols?To what degree does it involve the protocols?

ll Haven’t we been doing zeroHaven’t we been doing zero--copy TCP etc. for copy TCP etc. for decades?decades?

Slide 6

““ZeroZero--Copy” AlternativesCopy” Alternatives

ll Option 1Option 1: page flipping: page flipping•• NIC places payloads in aligned memory; OS uses NIC places payloads in aligned memory; OS uses

virtual memory to map it where the app wants it.virtual memory to map it where the app wants it.

ll Option 2Option 2: scatter/gather API: scatter/gather API•• NIC puts the data wherever it want; app accepts the NIC puts the data wherever it want; app accepts the

data wherever it lands.data wherever it lands.

ll Option 3Option 3: direct data placement: direct data placement•• NIC puts data where the headers say it should go.NIC puts data where the headers say it should go.

Each solution involves the OS, application, and NIC Each solution involves the OS, application, and NIC to some degree.to some degree.

Slide 7

Page Flipping: the BasicsPage Flipping: the Basics

K U

NIC

Header splitting VM remaps pages

at socket layerAligned

payload buffers

Receiving app specifies buffers (per RFC 793 copy

semantics).

Goal: deposit payloads in aligned buffer blocks

suitable for the OS VM and I/O system.

Slide 8

Page Flipping with Small Page Flipping with Small MTUsMTUs

NIC

Host

K U

Split transport headers, sequence and coalesce

payloads for each connection/stream/flow.

Give up on Jumbo Frames.

Slide 9

Page Flipping with a ULPPage Flipping with a ULP

NIC

Host

K U

Split transport and ULP headers, coalesce

payloads for each stream (or ULP PDU).

ULP PDUs encapsulated in stream transport (TCP, SCTP)

Example: an NFS client reading a file

Slide 10

Page Flipping: Pros and ConsPage Flipping: Pros and Cons

ll ProPro: sometimes works.: sometimes works.

–– Application buffers must match transport alignment.Application buffers must match transport alignment.

ll NIC must split headers and coalesce payloads to fill NIC must split headers and coalesce payloads to fill aligned buffer pages.aligned buffer pages.

ll NIC must recognize and separate ULP headers as NIC must recognize and separate ULP headers as well as transport headers.well as transport headers.

ll Page Page remapremap requires TLB requires TLB shootdown shootdown for for SMPsSMPs..

–– Cost/overhead scales with number of processors.Cost/overhead scales with number of processors.

Slide 11

Option 2: Scatter/GatherOption 2: Scatter/Gather

NIC

Host

K U

NIC demultiplexespackets by ID of

receiving process.

Deposit data anywhere in buffer pool for recipient.

System and apps see data as arbitrary scatter/gather buffer chains (readonly).

Fbufs and IO-Lite [Rice]

Slide 12

Scatter/Gather: Pros and ConsScatter/Gather: Pros and Cons

ll ProPro: just might work.: just might work.

ll New APIsNew APIs

ll New applicationsNew applications

ll New New NICsNICs

ll New OSNew OS

ll May not meet app alignment constraints.May not meet app alignment constraints.

Slide 13

Option 3: Direct Data PlacementOption 3: Direct Data Placement

NIC

NIC “steers” payloads directly to app buffers, as

directed by transport and/or ULP headers.

Slide 14

DDP: Pros and ConsDDP: Pros and Cons

ll EffectiveEffective: deposits payloads directly in designated : deposits payloads directly in designated receive buffers, without copying or flipping.receive buffers, without copying or flipping.

ll GeneralGeneral: works independent of MTU, page size, : works independent of MTU, page size, buffer alignment, presence of ULP headers, etc.buffer alignment, presence of ULP headers, etc.

ll LowLow--impactimpact: if the NIC is “magic”, DDP is : if the NIC is “magic”, DDP is compatible with existing apps, APIs, compatible with existing apps, APIs, ULPsULPs, and OS., and OS.

ll Of course, there are no magic Of course, there are no magic NICsNICs……

Slide 15

DDP: ExamplesDDP: Examples

ll TCP Offload Engines (TOE) can steer payloads directly to TCP Offload Engines (TOE) can steer payloads directly to preposted preposted buffers.buffers.

–– Similar to page flipping (“pack” each flow into buffers)Similar to page flipping (“pack” each flow into buffers)

–– Relies on Relies on prepostingpreposting, doesn’t work for , doesn’t work for ULPsULPs

ll ULPULP--specificspecific NICsNICs (e.g.,(e.g., iSCSIiSCSI))

–– Proliferation of specialProliferation of special--purpose purpose NICsNICs

–– Expensive for future Expensive for future ULPsULPs

ll RDMARDMA on nonon non--IP networksIP networks

–– VIA, VIA, InfinibandInfiniband, , ServerNetServerNet, etc., etc.

Slide 16

Remote Direct Memory AccessRemote Direct Memory Access

NIC

RDMA-like transport shim

carries directives and steering tags

in data stream.

Remote Peer

Register buffer steering tags with NIC, pass them to remote peer.

Directives and steering tags guide NIC data

placement.

Slide 17

The Case for RDMA over IPThe Case for RDMA over IPll RDMARDMA--like functions offer a general solution for fastlike functions offer a general solution for fast--path path

data movement.data movement.

–– New transport functions to enhance performance.New transport functions to enhance performance.

ll Lots of experience with RDMA on nonLots of experience with RDMA on non--IP interconnects.IP interconnects.

–– RDMA is an accepted directRDMA is an accepted direct--access model.access model.

–– Protocols and APIs already exist to use RDMA.Protocols and APIs already exist to use RDMA.

ll Leverages IP infrastructure, generality, cost.Leverages IP infrastructure, generality, cost.

–– IP everywhereIP everywhere

ll RDMARDMA NICsNICs are general across a range of apps andare general across a range of apps and ULPsULPs..

Slide 18

The Case AgainstThe Case Against

ll Requires a standard protocol for direct data Requires a standard protocol for direct data placement over IP transports.placement over IP transports.

–– InteroperabilityInteroperability

–– Must leverage and conform to existing/future framework Must leverage and conform to existing/future framework for security and management.for security and management.

ll Requires some extension of Requires some extension of ULPsULPs or apps to use it.or apps to use it.•• Low to moderateLow to moderate

ll “No RDMA protocol exists that can solve X.”“No RDMA protocol exists that can solve X.”

Slide 19

Summary/ConclusionSummary/Conclusion

ll Avoiding receiver copies is a complex problem.Avoiding receiver copies is a complex problem.–– No easy answers, but partial solutions abound.No easy answers, but partial solutions abound.

ll General fastGeneral fast--path data movement can benefit any path data movement can benefit any protocol/application that moves a lot of data.protocol/application that moves a lot of data.–– Position IP as competitive for highPosition IP as competitive for high--performance networking;performance networking;

–– currently effective only for NICcurrently effective only for NIC--based based ULPsULPs..

ll RDMA standards will promote a market for RDMA standards will promote a market for interoperableinteroperablegeneralgeneral--purpose direct data placement purpose direct data placement NICsNICs..–– Build on growing market for IP Build on growing market for IP NICsNICs (TOE, (TOE, iSCSIiSCSI, VI), VI)

–– Generalize to a wide range of Generalize to a wide range of ULPsULPs and future and future ULPsULPs..

Slide 20

RDMA IssuesRDMA Issues

–– Negotiating RDMA functionality at connection setupNegotiating RDMA functionality at connection setup

–– Buffer registration and protectionBuffer registration and protection•• Memory semanticsMemory semantics vsvs. steering semantics. steering semantics•• Persistence of buffer tagsPersistence of buffer tags•• Access control granularityAccess control granularity

–– Read Read vsvs. write. write

–– Round trip costs and RDMA buffer “window” on Round trip costs and RDMA buffer “window” on LFNsLFNs

–– Standardize NIC interface to host, or value diversity?Standardize NIC interface to host, or value diversity?

–– Interaction of protocols and NIC implementationInteraction of protocols and NIC implementation•• Efficient implementation is paramount (cost and performance).Efficient implementation is paramount (cost and performance).•• Requires support for outRequires support for out--ofof--order data placement?order data placement?

–– Related messaging semanticsRelated messaging semantics

Slide 21

SCTP?SCTP?

ll IP transports should be considered on an equal basis with IP transports should be considered on an equal basis with respect to fastrespect to fast--path data movement.path data movement.

ll Everything I have said applies equally to SCTP.Everything I have said applies equally to SCTP.–– Same problem, same solutions.Same problem, same solutions.

ll SCTP stream IDs could be used for SCTP stream IDs could be used for demuxdemux……–– …just like early …just like early demuxdemux for TCP connections.for TCP connections.

–– Separate the ULP headers on a different stream? Separate the ULP headers on a different stream?

–– Requires close scrutiny + still have to flip or DDP.Requires close scrutiny + still have to flip or DDP.

ll SCTP could be an excellent basis for RDMA.SCTP could be an excellent basis for RDMA.–– Could enable outCould enable out--ofof--order placement given a chunk type extension order placement given a chunk type extension

for RDMA.for RDMA.

Slide 22

Duke ExperimentsDuke Experimentsll Extend a publicExtend a public--domain Unix kernel with highdomain Unix kernel with high--speed speed

networking optimizations known in the literature.networking optimizations known in the literature.•• FreeBSDFreeBSD 4.0...4.0...•• ...extended for zero...extended for zero--copy sockets and checksum offloading.copy sockets and checksum offloading.•• ZeroZero--copy TCP and NFS/UDPcopy TCP and NFS/UDP•• Tigon Tigon GE and Trapeze/GE and Trapeze/MyrinetMyrinet

ll Use Use netperfnetperf to benchmark TCP/IP on different to benchmark TCP/IP on different configurations, using configurations, using iprobeiprobe to profile host CPU activity.to profile host CPU activity.

•• 2 types of PCs and 2 types of Alphas2 types of PCs and 2 types of Alphas•• 3232-- and 64and 64--bit PCIbit PCI•• vary MTUs from 1500 bytes up to 32KBvary MTUs from 1500 bytes up to 32KB

ll Experiment with various combined optimizations, and Experiment with various combined optimizations, and study the numbers.study the numbers.

Slide 23

CPU Overhead for TCP/IPCPU Overhead for TCP/IP

0%

20%

40%

60%

80%

100%

Ethernet MTU 8KB MTU 32KB MTU 32KB MTU, ChecksumOffloading

32KB MTU, Zero-Copy

32KB MTU, ChecksumOffloading, Zero-

Copy

CP

U U

tiliz

atio

n

Idle

Copy & Checksum

Interrupt

VM

TPZ Driver

Buffering

TCP/IP

500 MHz Monet@370Mb/s

Slide 24

Trapeze/L7 TCP/IP ResultsTrapeze/L7 TCP/IP Results

41% CPU 17% CPU

92% CPU

zero-copy sockets for FreeBSDTCP checksum offloading

large MTU

Compaq “Monet”Alpha 21264 @ 500 MHz

64-bit PCI @ 33 MHz 1.28 Gb/s Myrinet (L7)

Slide 25

NetperfNetperf/TCP at 2/TCP at 2 GbGb/s/s

Dell PowerEdge 4400, 64/66 PCI, 733 MHz Intel P-IIITrapeze/Myrinet-2000 adapters (LANai-9 chipset)