extoll lustre network driver overview and preliminary results
TRANSCRIPT
Extoll Lustre Network Driver Overview and Preliminary Results
Sarah M. Neuwirth University of Heidelberg, Germany
LAD’18, Paris, 2018
Extoll Interconnect Architectural Idea
• Lean network interface − Low latency message exchange − High hardware message rate − Optimized memory footprint for scalability
• Switchless design − 3D Torus direct network − Reliable network − High Scalability
• Efficient, pipelined hardware architecture
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth 2
Extoll Interconnect Hardware Architecture – Overview
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth 3
Host InterfaceNetwork InterfaceNetwork
NetworkSwitch
LP
LP
LP
(7th) LP
LP
LP
LP
NP
NP
NP
NP
Register File
ATU
SMFU
Egress Ingress
RMA
Req CmpResp
VELO
Req Cmp
LP = Link Port NP = Network Port Req = Requester Resp = Responder Cmp = Completer
On-Chip Network
PCIe x16 Gen3
EXTOLL NIC
Interconnection Network
Extoll Interconnect Functional Units
• VELO – Virtualized Engine for Low Overhead − Low latency two-sided messaging for small payloads
• RMA – Remote Memory Access − Supports put and get operations to transfers large
amounts of data with one- and zero-copy methods − Local and remote completion notifications
• ATU – Address Translation Unit − Mapping of memory regions and page lists
• SMFU – Shared Memory Functional Unit − Distributed shared memory − Remote load/store operations
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth
Network Interface
Register File
ATU
SMFU
RMA
VELO
On-Chip Network
4
Extoll Basedriver
ATUDriver
User Application
libVELO
Middleware & Library (i.e., MPI, GASPI)
Extoll Hardware
VELO RMA RegisterFileATU
User Space
NIC
Kernel Space
Application Management
libRMA
RFAccess
RMADriver
VELODriver
EMP
PCIeConfig-space
libSMFU
SMFUDriver
SMFU
KernelAPI
Extoll Interconnect Software Environment
• Device driver and modules − Provide operating system bypass − Manage resources
• Kernel API provides interface for… − Network interface (EXN) − Direct sockets (AF_EXTL) − Network-attached accelerators (VPCI) − Lustre Network Driver (EXLND)
• Low-level user libraries − OpenMPI MTL for Extoll
• PGAS support (e.g., GASPI, GASNet) • EMP management software ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth 5
Extoll Tourmalet Throughput Performance – RDMA Capability
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth
• Two DL380 HP Servers each equipped with − EXTOLL Tourmalet Card − ConnectIB FDR Card − Intel(R) Xeon(R) CPU
E5-2620 v3 @ 2.40GHz − 32 GB RAM
• Software: − CentOS 7.2 − Mellanox OFED 3.0-1.01 − EXTOLL SW Stack 1.3.1 − OSU MPI Benchmark
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M
Band
wid
th (M
B/s)
Message size (Bytes)
EXTOLL
MVAPICH
OpenMPI BTL
OpenMPI MXM
6
LNET Lustre Network Communication Protocol
• LNET − Defines the communication infrastructure
between Lustre clients and servers − Abstracts network details from Lustre
• Supports most network technologies − TCP/IP networks, Cray Gemini, Infiniband − RDMA for bulk data transfers − Provides routing between LNET networks
• Lustre Network Driver (LND) − Implements LNET-to-LND API − Provides support for a particular network type
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth
TCP/IP
socklnd
LNET Code Module
Network RPC API
Lustre Request Processing
Application I/O
gnilnd o2iblnd ...
Gemini Infiniband ...
7
TCP/IP
socklnd
LNET Code Module
Network RPC API
Lustre Request Processing
Application I/O
gnilnd o2iblnd exlnd
Gemini Infiniband Extoll
EXLND Internals EXLND – Extoll Lustre Network Driver
• Built upon Extoll Kernel API • Data transfer protocols − Bulk transfers: • Rendezvous protocol: RMA puts/gets • Memory mapped via ATU
→ Network Logical Address (NLA) − Requests and immediate transfers: • Eager protocol: VELO messages
• Kernel module: kexlnd.ko • Network name: ex • Network adapter: ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth
networks=ex(ex0) 8
EXLND Internals Resource Management – Memory Descriptors
• Memory descriptors (MD) are − Typically either struct kvec or page array − Mapped via scatter/gather lists for RDMA (bulk data transfers)
• Idea: map scatter/gather lists via ATU into Extoll address space ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth
Portali ME ME ME
Portals Table
...
MemoryRegion
MD MD MD
Portal Index
MDBULK message
Memory Descriptor
Extoll Address Space
Physical Address Space ATU
NetworkLogical
Address
Lustre Portals Mechanism Memory Region Mapping
9
EXLND Internals Resource Management – Completions
• EXLND completions are… − Basically TX and RX descriptors − Distinguished by RMA sub-units
• Three different queues: − Responder queue handles LNetPut TX path − Completer queue handles both LNetPut RX
and LNetGet TX − Requester queue handles LNetGet RX path
• Completions are matched with incoming RMA notifications
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth 10
Network Interface
Register File
ATU
SMFU
Egress Ingress
RMA
Req CmpResp
VELO
Req Cmp
On-Chip Network
lnd_send()
Setup GET source for RDMAMap memory via ATU -> NLA
Build VELO messageLNET header & NLA
Parse LNET header
lnd_recv()
Setup GET sink -> NLA
VELO sendRDMA prep message
RMA GETBulk data transfer
GET Completion Notification (TX)
GET Completion Notification (RX)
Node A Node B
EXLND Internals Bulk Data Transfer – LNetPut()
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth
LNetPut(): • Sends data asynchronously •Mostly issued by clients to
request data from a server or to request a bulk data transmission to a server •Used by servers to send a
reply back to the client
11
EXLND Internals Bulk Data Transfer – LNetGet()
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth
LNetGet(): • Serves as a remote read
operation and is used by servers to retrieve data in bulk read transmissions from clients
12
lnd_send()
Setup PUT sink for RDMAMap memory via ATU -> NLA
Build VELO messageLNET header & NLA
Parse LNET header
lnd_recv()
Setup PUT source -> NLA
VELO sendRDMA prep message
RMA PUTBulk data transfer
PUT Completion Notification (TX origin)
PUT Completion Notification (TX)
Node A Node B
EXLND Performance Evaluation Test System and Lustre Configuration
• 5 Supermicro Barebone SuperServer SYS-1019GP-TT each equipped with: − Intel Xeon Silver 4110 2,1 GHz − 32 GB RAM − EXTOLL Tourmalet Card − OSS/MDS: Intel SSD 545s 128 GB
• CentOS 7.3 − Lustre Patch 2.10.1 − EXTOLL Software Stack v1.4.0 − EXLND v0.2 (experimental)
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth 13
EXLND Performance Evaluation LNET Selftest – I/O Size Scaling
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth 14
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
8K 16K 32K 64K 128K 192K 256K 384K 512K 768K 1M
Thro
ughp
ut (M
B/s)
I/O Size (in Bytes)
LNET Selftest BRW - I/O Size Scaling
EXLND WriteEXLND ReadEXN(TCP) WriteEXN(TCP) Read
EXLND Performance Evaluation LNET Selftest – Thread Scaling
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth 15
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 2 4 8 16 32 64 128 256
Thro
ughp
ut (M
B/s)
Number of Threads
LNET Selftest BRW - Thread Scaling
EXLND WriteEXLND ReadEXN(TCP) WriteEXN(TCP) Read
EXLND Performance Evaluation Other Benchmarks / Applications
• Application performance: IOR Benchmark − Sequential read/write, file-per-process − Throughput limited by write/read speed of SSDs: ~350 MB/s per SSD − Need larger system (more OSSs/OSTs) for evaluation
• Metadata traffic performance: mdtest − Measure values like file creations/seconds or stat operations/seconds
ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth
#Tasks 1 2 4 8 16 32
File creation Ops/s 1782 3134 10161 17857 19925 25766
File stat Ops/s 6141 10046 18957 27391 38746 49356
16
EXLND Development Roadmap
• Finalize LND code − Perform code optimizations including • Improved scatter/gather I/O • Multiple scheduler threads • Default credit configuration and tunables
− Handle remaining corner cases − Large scale stability tests
• Ensure compatibility with recent LNET changes − Currently supports latest Lustre 2.8 and 2.10 releases
• Push code to the Lustre community ZITI • Extoll LND: Overview and Preliminary Results • Sarah M. Neuwirth 17