Madeleine
Olivier Aumage
Runtime ProjectINRIA – LaBRI
Bordeaux, France
Objective
Rational task assignment in high-performance communication stacks
Programmingenvironment
Middlelevel
interface
Lowlevel
interface
Network
Application
Software stack
Model
Abstraction
Hardware control
Madeleine
A communication support for clusters and multi-clusters
Features
Abstract interface
Programmation by contract Specification of constraints Freedom for optimization
Active software support Dynamic optimization Adaptivity Transparency
Interface
Definitions
Connection Uni-directional point-to-point link FIFO ordering
Channel Graph of connections Multiplexing unit Network virtualization
Connection
Process
Channel
Communication model
Characteristics
Model Message passing Incremental message builing
Expressiveness Control of data blocs by flags Contract between the programmer and the interface
Express
Primitives
Main commands
Send mad_begin_packing mad_pack … mad_pack mad_end_packing
Receive mad_begin_unpacking mad_unpack … mad_unpack mad_end_unpacking
Message building
Commands Mad_pack(cnx, buffer, len, pack_mode, unpack_mode) Mad_unpack(cnx, buffer, len, pack_mode, unpack_mode)
Send contract options (send modes) Send_CHEAPER Send_SAFER Send_LATER
Receive contract options (receive modes) Receive_CHEAPER Receive_EXPRESS
Constraints Strictly symmetrical pack/unpack sequences
Triplets (len, pack_mode, unpack_mode) identical for send and for receive Data consistency
Send
Pack
Modification
End_packing
Send_SAFER Send_LATER Send_CHEAPER
Contract between the programmer and the interface
Send_SAFER / Send_LATER / Send_CHEAPER
Control of data transfer Optimization amount
Promises of programmer Data consistency
Special services Delayed send Buffer reuse
Specification at semantical level Independency: request / implementation
Receive
Unpack
After Unpack
End_unpacking
Receive_EXPRESS Receive_CHEAPER
Data available Availability?
Data available
Message structuring
Receive_CHEAPER / Receive_EXPRESS
Receive_EXPRESS Mandatory immediate receive Interpretation/extraction of message
Receive_CHEAPER Free reception of block Message contents
Express
Organization
Two-layered model Buffer management
Data processing code reuse Hardware abstraction
Modular approach Buffer management modules Drivers Transmission modules
Interface
Buffermanagement
Networkmanagement
BMM BMM
TM TM TM
Network
Driver Driver
Drivers
Network management layer
Data transfers Send, receive Group transfers
Transfer method selection Choice function
Transmission modules
Depends on the network
One module per transfer method Pilote GM: 2 TM Pilote BIP: 2 TM Pilote SCI: 3 TM Pilote VIA: 3 TM
Associated to a buffer management module
Transmission modules
Thread
Network
Pack
Madeleine
Interface BMM
BMM
TM
TM
Process
Buffers
Generic management layer
Virtual buffers Static Dynamic
Groups Aggregations Splitting
Buffer management modules
Buffer type Static/dynamic
Aggregation mode Without Sequential aggregation Half-sequential aggregation
Aggregation shape Symmetrical/non-symmetrical
Implementation
Status
Network drivers Quadrics, MX, GM, SISCI,
MPI, TCP, VRP
VIA, UDP, SBP, BIP
Distribution Licence GPL
Availability
Linux IA32, IA64, x86-64,
Alpha, Sparc, PowerPC
MacOS/X G4
Solaris IA32, Sparc
Aix PowerPC
Windows NT IA32
Tests – current plaform
Test environment
Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB Giga-Ethernet SISCI/SCI MX & GM /Myrinet Quadrics Elan4
Testing procedure
Test: 1000 x (send + receive) Result: ½ x average of 5 tests
Latency
1
10
100
4 8 16 32 64 128 256 512 1024 2048 4096 8192
Mad/ SISCI
Mad/ GM
Mad/ MX
Mad/ Quadrics
Packet size (bytes)
Late
ncy
(µs)
Bandwidth
0,1
1
10
100
1000
Mad/ SISCI
Mad/ GM
Mad/ MX
Mad/ Quadrics
Transfer time (bytes)
Ban
dw
idt
h(M
B/s
)
Tests – older platform
Testing environments
Cluster of PC bi-Pentium II 450 MHz, 128 MB Fast-Ethernet SISCI/SCI BIP/Myrinet
Testing procedure
Test: 1000 x (send + receive) Result: ½ x average of 5 tests
SISCI/SCI – latency
1
10
100
1000
10000
100000
Mad/ SISCI
SISCI
Packet size (bytes)
Late
ncy
(µs)
SISCI/SCI – bandwidth
0,1
1
10
100
Mad/ SISCI
SISCI
Packet size (bytes)
Ban
dw
idt
h(M
B/s
)
SISCI/SCI – latencyPacks/messages
1
10
100
1000
10000
100000
Mad/ SISCI
2 msgs
2 packs
4 msgs
4 packs
8 msgs
8 packs
16 msgs
16 packs
32 msgs
32 packs
64 msgs
64 packs
128 msgs
128 packs
256 msgs
256 packs
Packet size (bytes)
Late
ncy
(µs)
SISCI/SCI – bandwidthPacks/messages
Packet size (bytes)
Ban
dw
idt
h(M
B/s
)
0,1
1
10
100
Mad/ SISCI
2 msgs
2 packs
4 msgs
4 packs
8 msgs
8 packs
16 msgs
16 packs
32 msgs
32 packs
64 msgs
64 packs
128 msgs
128 packs
256 msgs
256 packs
API MPIGeneric interface: point-to-point communication, collective communication, groups building
Abstract Device Interface (ADI)Generic interface: data type management, request queues management
SMP_PLUG
Local communication
CH_SELF
Local loops
Madeleine
CH_MADCommunicationPolling loopsInternal MPICH protocols
CommunicationMulti-protocol support
QSNETTCP UDP BIP MXGMSISCI
Users – MPICH/Madeleine
MPICH/Mad/SCI – Latency
1
10
100
1000
10000
100000
Mad
MPICH/ Mad
SCI-MPICH
SCA MPI
Packet size (bytes)
Late
ncy
(µs)
MPICH/Mad/SCI – bandwidth
0,1
1
10
100
Mad
MPICH/ Mad
SCI-MPICH
SCA MPI
Packet size (bytes)
Ban
dw
idt
h(M
B/s
)
Application
MPI JVMORB
MarcelMadeleineCommunicationMulti-protocol support
Circuit VSock
Padico Core
Padico Task Manager
Thread Padicomicro-kernelmanager
Net Access
QSNETTCP UDP BIP MXGMSISCI
Padico
Users – Padico
Padico – latency
1
10
100
1000
10000
Madeleine
Vsock
MPI
CORBA
Packet size (bytes)
Late
ncy
(µs)
Padico – bandwidth
0,01
0,1
1
10
100
1000
Madeleine
Vsock
MPI
Corba
Java
Packet size (bytes)
Ban
dw
idt
h(M
B/s
)
Conclusion
Unified communication support
Abstract interface Contract-based programming Modular/adaptive architecture Dynamic optimization Transparent multi-cluster support
On-going/future work
Programming interface Message structuration Near-future information exploitation Pathological cases reduction Fault tolerance
Communication sequences processing Code specialization, compilation
Session management Deployment Dynamicity Fault-tolerance Scaling
?
Madeleine I Madeleine II Madeleine III Madeleine IV
Some limitations of Madeleine (version III)
Objectives for a new Madeleine
Some optimizations are out of reach for Madeleine The optimization range is to narrow
Need information about what is coming in the near future Need to be more liberal in allowing permutations in the packet flow
Optimizations strategies involve too much work from the driver programmer Need to share more of strategic code Need to easily evaluate and even mix various strategies
Optimization sequences are synchronous with the application program Need to synchronize optimization sequences with the NIC
Proposal: Madeleine IV
Optimizer thread
Sender thread
Driver
Network
Hardware-specificparameters
Tracks
Tactics
Strategies
Constraints
Optimizer thread
Concepts
Definitions
Tracks Hardware multiplexing units mapping (tags) Main track
Control packets, small packets, … Optional auxiliary tracks
Other traffics (large messages, …) Tactics
Basic optimization operations Permutation, aggregation, piggybacking, association, splitting, track change
Strategies Set of tactics towards one optimization goal
Constraints Tactics compatibility Send/receive modes
Proposal: Madeleine IV
Optimizer thread
Sender thread
Driver
Network
Hardware-specificparameters
Tracks
Tactics
Strategies
Constraints
Optimizer thread
Packet headers
Giving up a little bit of raw efficiency to get much more flexibility
Opportunist packet aggregation/permutation Inside a single packet flow Across multiple packet flows
Side effects Control packets
Rendez-vous ACKs
Piggybacking Multiplexing
Concurrent communication progression
Communication scheduling
The NIC is responsible for requesting work
Packets are built when the NIC is ready
The optimizer gets more time to gather up-to-date optimization clues
Tests
Test environment
Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB MX / Myrinet
Testing procedure
Test: 1000 x (send + receive) Result: ½ x average of 5 tests
Test – Latency
Packet size (bytes)
Late
ncy
(µs)
1
10
100
4 8 16 32 64 128 256 512 1024 2048
MX
Mad3
Mad4
Test – Bandwidth
Packet size (bytes)
Ban
dw
idt
h(M
B/s
)
0,1
1
10
100
1000
4 16 64 256 1024 4096 16384 65536 262144 1048576
MX
Mad3
Mad4
Test – Latency when aggregating short packets
Packet size (bytes)
Late
ncy
(µs)
1
10
100
1000
4 8 16 32 64 128 256
Mad3
Mad4
Opportunist aggregation on RDV
Aggregating a short packet with a RDV request for a long packet
No gain with MX/Myrinet
Madeleine III Latency: 310 µs Bandwidth: 201 MB/s
Madeleine IV Latency: 314 µs Bandwidth: 200 MB/s
MX flow control gets in the way
Conclusion
A new architecture for optimizing communication Wider optimization spectrum Better interactions between software and harware
A platform for experimenting optimizations Optimization tactics
A prototype implemented on top of MX/Myrinet Proof of concept
On-going and future work
Optimization Tactic combinations Automatic strategy selection External strategies (plug-ins)
Interface expressiveness Extended packs One-sided communication
Load-balancing, multi-rail Benefit from all available links
Proposal: Madeleine IV
Optimizer thread
Sender thread
Driver
Network
Hardware-specificparameters
Tracks
Tactics
Strategies
Constraints
Optimizer thread
Cluster architectures
Characteristics
A set of computers Regular of-the-shelf PC
A « classical » network Slow Administration Service
A fast network Low latency High bandwidth Applications
Fast network
Slow network
ClusterCluster
Three programming models
Programming environments Message passing
PVM, MPI Service invocation
RPC SUN, OVM, PM2, etc.
RSR Nexus
JAVA RMI CORBA
Distributed-shared memory TreadMarks, DSMThreads, DSM-PM2
Each model has its use
?
!
Research theme
Interfacing programming environments with networking technologies
NetworkNetwork
ProgrammingenvironmentsProgrammingenvironments
CommunicationsupportCommunicationsupport
Messagepassing
Serviceinvocation(RPC, RMI)
?
ApplicationprocessesApplicationprocesses
Ethernet Myrinet SCI Quadrics Infiniband
Distributedshared memory
Features needed
A generic communication interface
Neutrality Independence with respect to the target programming model
Message passing Service invocation: RPC, RMI Distributed Shared Memory
Portability Independence with respect to hardware
Computing hardware Networking hardware
Efficiency Raw performance
Latency, bandwidth, reactivity Application performance
Available solutions
High level network interfaces?
Example MPI
Advantages Portability, normalization Rich features Efficiency
The interface is not adapted to complex communication schemes Relations between pieces of data in a communication message? Lack of expressiveness
Problem example
Remote service invocation Request
Header: service descriptor Body: service arguments
First option – two messages
MPIconnection
MPImessage
Header
Header
MPImessage
Body
Request
Body
Header
Header
Body
Client Server
Body
Problem example
Remote service invocation Second option – one copy
In both cases, MPI is not expressive enough
ServerClient
Header
Body
Requête
Corps
En-tête
ConnexionMPI
MessageMPI
Corps
En-tête
Copy
Body
Header
Body
Available solutions (cont’d)
Low level interfaces?
Examples BIP, GAMMA, GM, SISCI, VIA
Advantages Efficiency Exploitation of hardware potential
Hardware dependency Limited abstraction level
Difficult development Limited potential for code reuse
Short-lived development?
Available solutions (fin)
Middle-level communication interface?
Examples Nexus, Active Messages, Fast Messages
Advantages Abstraction Efficiency Relative portability
Neutrality? Expressiveness? Active message (or similar) programming model Unnecessary additional processing Problem of appraoch
Objective
Proposal for a generic middle-level communication interface
Independency with respect to programming environnementsNeutral programming model
Independency with respect to networking technologyPerformance portability
Env 1 Env 2 Env 3 Env n
Net 1 Net 2 Net 3 Net m
Env 1 Env 2 Env 3 Env n
Net 1 Net 2 Net 3 Net m
?
Objectifs
Améliorer la portée des optimisations
Permettre d’implanter et d’évaluer facilement différentes stratégies, de manière portable
Optimiser l’activité des cartes réseaux Transferts dirigés par la carte Équilibrage des transferts entre plusieurs cartes
Exemple 1 Deux paquets consécutifs dont le mode de réception est express
Nouvelles tactiques d’optimisation
Exemple 2 Un paquet nécessitant l’envoi d’un rendez-vous a pour mode de réception
express et est suivi d’un paquet n’en ayant pas besoin
Avec Madeleine 3
Tactique agrégationde messages courts
Avec Madeleine 3
Tactique agrégation de rendez-vous
Exemple
Send
begin_send(dest)
pack(data, long, r_express) pack(index1, court,
r_express) pack(index2, court,
r_express)
end_send()
Receive
begin_recv()
unpack(data , long , r_express)
unpack(index1, court, r_express) unpack(index2, court, r_express)
total = data[index1] + data[index2]
end_recv()
Paquets à acquitter
Optimiseur
Applications
Réseau
Emission Réception
Paquets inattendus
AcqAcq
Static buffers
Buffer managers Filling
Drivers Allocation/free
Buffer managers Allocation/free Copy (when necessary) Aggregation by affinity
Dynamic buffers
Aggregation
Sequential
Flush Flush
TM1 TM1TM2
Aggregation
Half-sequential
Flush Flush
TM 1
TM 2
Main
Aggregation shape
Symmetrical
Non-symmetrical Flush FlushFlush
Flush FlushFlushFlush
Send Receive
Special cases
Send_LATER / Receive_CHEAPER Automatic half-sequential aggregation
TM 1
TM 2
Main
End_packing
Special cases
Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call
Send
Pack Unpack
Receive
Special cases
Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call
Pack Unpack
Send Receive
Special cases
Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call
Pack Unpack
Expected data
Delayed send
Send Receive
Special cases
Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call
Pack Unpack
Expected data
Delayed send
Send Receive
Special cases
Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call
End_packing Unpack
Expected data
Delayed send
Send Receive
Special cases
Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call
Fill Unpack
Delayed send
Send ReceiveExpected data
Special cases
Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call
Send Receive
Transfer
Tests – first part
Testing environments
Cluster of PC bi-Pentium II 450 MHz, 128 MB Fast-Ethernet SISCI/SCI BIP/Myrinet
Testing procedure
Test: 1000 x (send + receive) Result: ½ x average of 5 tests
Grids?
Heterogeneity
Grids
Idea
A grid
A computer A interconnected set of grids
Multi-cluster support
Cluster of cluster exploitation
Fast cluster networks Fast inter-clusters networks Network level heterogeneity
High performance
networkHigh
performance network
High performance
network
Idea
Physical channels Related to a physical network Not-necessarily cover each node of the session
Virtual channels Cover each node the session Contains one or more physical channels
MyrinetSCI
Virtuel
Integration
Generic transmission module Limited stack traversal on forwarding nodes
Interface
Buffermanagement
Drivers
BMM BMM
TM TM TM
Network
Generic TM
Forwarding module
Thread
Network 2
Madeleine
BMM TM
TM
Process
TMInterface
Threads
Network 1
Bandwidth preservation
Pipeline Concurrent receive et re-send using two buffers
One copy Same buffer for receive and re-send
Buffer 1
Buffer 2
Receive
Re-send
LANai
Deployment
Session spawning –Léonie
Sessions Flexibility
Multi-cluster Unified launch
Grouped spawns Extensibility
Support for optimized distributed process launchers
Network Information table generation
Processes directory Routing tables for virtual channels
Ordering NIC initializations, channel opening
Madeleine
Léonie
Virtual connections – latencySISCI+BIP
10
100
1000
10000
100000
BIP+SISCI
SISCI+BIP
Packet size (bytes)
Late
ncy
(µs)
Myrinet
SCI
Virtual connections – bandwidthSISCI+BIP
Packet size (bytes)
Ban
dw
idt
h(M
B/s
)
0,1
1
10
100
4 16 64256
10244096
16384
65536
262144
1048576
BIP+SISCI
SISCI+BIP