1 opportunities and challenges of modern communication architectures: case study with qsnet cac...
DESCRIPTION
3 Processor Virtualization Basic idea of processor virtualization User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI User View System implementationTRANSCRIPT
![Page 1: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/1.jpg)
1
Opportunities and Challenges of Modern Communication
Architectures: Case Study with QsNet
CAC Workshop
Santa Fe, NM, 2004
Sameer Kumar* and Laxmikant V. KaleParallel Programming Laboratory
University of Illinois at Urbana Champaign
![Page 2: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/2.jpg)
2
Outline Processor virtualization QsNet
Opportunities Performance Evaluation of QsNet Challenges of QsNet Summary
![Page 3: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/3.jpg)
3
Processor Virtualization Basic idea of processor virtualization
User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI
User View
System implementation
![Page 4: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/4.jpg)
4
QsNet Popular interconnect from
Quadrics Several parallel systems in top500
use QsNet Pittsburgh’s Lemieux (6TF) ASCI-Q (20TF) Elite network Elan adaptor
![Page 5: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/5.jpg)
5
Elite Network 320 MB/s each way after protocol Reliable fat-tree network
Multiple routes provides fault tolerance
Adaptive worm hole routing 35 ns per hop
![Page 6: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/6.jpg)
6
Elan Network Adaptor Features
Low latency (4.5 μs for MPI) High bandwidth (320MB/s/node)
Components Sparc processor DMA Engine 64 MB RAM On chip cache
![Page 7: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/7.jpg)
7
Low CPU Overhead
05
101520253035
Lat
ency
(us)
16 64 256 1024 4096Message Size (Bytes)
Latency CPU Overhead
CPU Overhead is small and does not change much with the message size
![Page 8: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/8.jpg)
8
Traditional Message Passing
Time
P0
P1
Send Overhead Receive Overhead
Idle Time Traditional Message Passing does not utilize
low CPU overhead of Elan
![Page 9: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/9.jpg)
9
Adaptive Overlap
VP0 VP1 VP0 VP1
Time
P0
P1
Send Overhead Receive Overhead
Processor Virtualization takes full advantage of the low CPU overhead of Elan
![Page 10: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/10.jpg)
10
Benefit of Adaptive Overlap
Problem setup: 3D stencil calculation of size 2403 run on Lemieux.
Shows AMPI with virtualization ratio of 1 and 8.
0.001
0.01
0.1
1
1 10 100 1000
Procs
Exe
c Ti
me
[sec
]AMPI(1)
AMPI(8)
![Page 11: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/11.jpg)
11
Charm++ Message Driven Execution
Handler
Scheduler
Pump Garbage CollectionSend
Tport Send Post Receives
Receive Message
![Page 12: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/12.jpg)
12
NAMD: A Production MD System
•Written in Charm++•Fully featured program•NIH-funded development•Distributed free of charge (5000+ downloads so far)•Binaries and source code•Installed at NSF centers•Large published simulations (e.g., aquaporin simulation featured in keynote)
![Page 13: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/13.jpg)
13
Scaling NAMD Several QsNet challenges had to
be overcome to scale NAMD
![Page 14: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/14.jpg)
14
QsNet Challange: Latency
02468
101214161820
1 5 9 17 33Number of Receives Posted
Shor
t Mes
sage
Lat
ency
(us)
MPI ConverseApplications need to post receives
for messages of different sizes
![Page 15: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/15.jpg)
15
Latency Bottlenecks Latency
Slow NIC processor with a 100Mhz clock
Cache size only 8KB Traversing a large
loop flushes it
1 860175 924759 10303713 17406017 100800
3Cache Misses vs Number
of Receives Posted
![Page 16: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/16.jpg)
16
Managing Latency: Message Combining
Organize processors in a 2D (virtual) Mesh
Phase 1: Processors send messages to row neighbors1 P
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
Phase 1: Processors send messages to column neighbors1 P
2* messages instead of P-1 1P
![Page 17: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/17.jpg)
17
NAMD PME Performance
0
20
40
60
80
100
120
140St
ep T
ime
256 512 1024
Processors
MeshDirectNative MPI
Performance of Namd with the Atpase molecule.PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages
![Page 18: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/18.jpg)
18
QsNet Challenge: Bandwidth
MB/s
One Way 290Two Way 128
PCI/DMA Contention restricts bandwidth on Alpha servers
QsNetNetwork Bandwidth
320 MB/s
![Page 19: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/19.jpg)
19
Improving Bandwidth
Main-Main Elan-Main Elan-ElanOne Way 290 305 319Two Way 128 305 319
Sending messages from Elan memory is
faster
Node bandwidth (MB/s) for different placements of source and destination
![Page 20: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/20.jpg)
20
QsNet Challenge: Stretched Handlers
Stretched Sends
Green superscripts
Similar stretches observed in the middle of entry methods
NAMD Timeline
Time
Proc
esso
rs
Force computeIntegrate
![Page 21: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/21.jpg)
21
Stretching Solution Stretched Sends
Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged
Solved the problem by closely working with Quadrics and obtaining a patch
Isend only blocks on the rendezvous of the previous message to the same destination
![Page 22: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/22.jpg)
22
Stretching Solution Contd. Stretches in the middle of entry
methods Caused by OS daemons Using blocking receives minimized
these stretches Daemons can be scheduled when
processor is idle
![Page 23: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/23.jpg)
23
NAMD With Blocking ReceivesPr
oces
sors
Time
Blocking Receives
![Page 24: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/24.jpg)
24
NAMD Performance on Lemieux
0
5
10
15
20
25
30
1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
Processors
Step
Tim
e (m
s)
0
200
400
600
800
1000
1200
Perf
orm
ance
GFL
OPS
Namd Step Time (ms) Performance (GF)
![Page 25: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/25.jpg)
25
Summary QsNet and excellent network NIC co-processor ideal for message
driven execution Programming guidelines
Send messages from Elan memory Post limited number of receives and
before the sends Blocking receives to avoid stretching
![Page 26: 1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant](https://reader034.vdocument.in/reader034/viewer/2022042707/5a4d1b4d7f8b9ab0599a63da/html5/thumbnails/26.jpg)
26
Future Work One sided communication
Barrier? Persistent one sided
communication Reserve buffers on destination