mrnet: from scalable performance to scalable reliability
DESCRIPTION
MRNet: From Scalable Performance to Scalable Reliability. Dorian C. Arnold University of Wisconsin-Madison Paradyn/Condor Week April 14-16, 2004 Madison, WI. More HPC Facts. Statistics from Top500 List: 24%: number of processors ≥ 512 10%: number of processors ≥ 1024 - PowerPoint PPT PresentationTRANSCRIPT
© 2004 Dorian C. Arnold April 14, 2004
MRNet:From Scalable Performance
to Scalable Reliability
Dorian C. ArnoldUniversity of Wisconsin-Madison
Paradyn/Condor WeekApril 14-16, 2004
Madison, WI
– 2 – Scalability and Reliability© 2004 Dorian C. Arnold
More HPC Facts Statistics from Top500 List:
• 24%: number of processors ≥ 512• 10%: number of processors ≥ 1024• 9 systems: number of processors ≥ 4096• Largest system has 8192 processors• By 2009, 500th entry faster than today’s #1
Bottom Line: HPC systems with many thousands of nodes will soon be the standard.
– 3 – Scalability and Reliability© 2004 Dorian C. Arnold
Applications Must Address Scalability!
Challenge 1: Scalable Performance
• Provide distributed tools with a mechanism Provide distributed tools with a mechanism for scalable, efficient group for scalable, efficient group communicationscommunicationsand data analyses.and data analyses.– Scalable MulticastScalable Multicast– Scalable ReductionsScalable Reductions– In-network data aggregationsIn-network data aggregations
– 4 – Scalability and Reliability© 2004 Dorian C. Arnold
Applications Must Address Scalability!
Scalability necessitates reliability!
Challenge 2: Scalable Reliability• Provide mechanisms for reliability in our Provide mechanisms for reliability in our
large-scale environment that do not large-scale environment that do not degrade scalability.degrade scalability.– Scalable multicast– Scalable reductions– In-network data aggregations
– 5 – Scalability and Reliability© 2004 Dorian C. Arnold
Target Applications Distributed tools and debuggersDistributed tools and debuggers
• Paradyn, Tau, PAPI’s perfometer, …Paradyn, Tau, PAPI’s perfometer, … Grid and Distributed Middleware
• Condor, GlobusCondor, Globus Cluster and system monitoring applicationsCluster and system monitoring applications Distributed shell for command-line toolsDistributed shell for command-line tools
Goal: Provide a generic scaling mechanismfor monitoring, control, troubleshooting and general middleware components for Grid infrastructures.
– 6 – Scalability and Reliability© 2004 Dorian C. Arnold
Problem: Centralization leads to poor scalability• Communication overhead
does not scale.• Data Analyses restricted to
front-end.
Challenge 1: Scalable Performance
Tool Front End
BE0 BE1 BE2 BE3 BEn-4 BEn-3 BEn-2 BEn-1
a0 a1 a2 a3 an-4 an-3 an-2 an-1
– 7 – Scalability and Reliability© 2004 Dorian C. Arnold
Multicast/Reduction Network• Scalable data multicast and
reduction operations.• In-network data
aggregations.
MRNet: Solution to Scalable Tool Performance
Tool Front End
BE0 BE1 BE2 BE3 BEn-4 BEn-3 BEn-2 BEn-1
a0 a1 a2 a3 an-4 an-3 an-2 an-1…………
……
……
– 10 – Scalability and Reliability© 2004 Dorian C. Arnold
Paradyn/MRNet Integration Scalable start-up
• Broadcast metric data to daemons• Gather daemon data at front-end• Front-end/daemon clock skew detection
Performance data aggregation• Time-based synchronization
– 11 – Scalability and Reliability© 2004 Dorian C. Arnold
Paradyn Data Aggregation (32 metrics)
00.10.20.30.40.50.60.70.80.9
1
0 100 200 300 400 500 600
Number of Back-Ends
Ser
vice
Rat
e/A
rriv
al R
ate
32 metrics, flat tree 16 metrics, flat tree
8 metrics, flat tree 1 metric, flat tree
32 metrics, 8-way fanout
– 12 – Scalability and Reliability© 2004 Dorian C. Arnold
MRNet References Technical papers:
• Roth, Arnold, and Miller, “MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools”, in SC2003 (Phoenix, AZ, November 2003).
• Roth, Arnold and Miller, “Benchmarking the MRNet Distributed Tool Infrastructure: Lessons Learned”, in 2004 High-Performance Grid Computing Workshop held in conjunction with IPDPS 2004 (Santa Fe, New Mexico, April 2004).
– 13 – Scalability and Reliability© 2004 Dorian C. Arnold
Scalable Performance Achieved:What Next?
More and increasingly complex components in large scale systems.
component
componentsystem
MTTRNN
MTTFMTTF
)1(
)( 2
A system with 10,000 nodes is 104 timesmore likely to fail than one with 100 nodes.
– 14 – Scalability and Reliability© 2004 Dorian C. Arnold
Challenge 2: Scalable Reliability Goals:
• Design scalable reliability mechanisms for communication infrastructures with reduction operations and in-network data aggregations.
• Quantitative understanding of scalability trade-off between different levels of resiliency and reliability.
– 15 – Scalability and Reliability© 2004 Dorian C. Arnold
Challenge 2: Scalable Reliability Reliability vs. Resiliency:
• A reliable system executes correctly in the presence of (tolerated) failures.
• A resilient system recovers to a mode in which it can once again execute correctly.– During a failure, errors are visible at the system
interface level.
– 16 – Scalability and Reliability© 2004 Dorian C. Arnold
Challenge 2: Scalable Reliability Problem:
• Scalability → decentralization, low-overhead– Scalability wants simple systems.
• Reliability → consensus, convergence, high-overhead– Reliability wants complex systems.
How can we leverage our tree-based topology to achieve scalable reliability?
– 17 – Scalability and Reliability© 2004 Dorian C. Arnold
Recovery Models and Semantics Fault model: crash-stop failures TCP-like reliability for tree-based multicast
and reduction operations
System should tolerate any and all internal node failures• System slowly degrades to flat topology
Models based on operational complexity• E.g. Are in-network filters stateful?
– 18 – Scalability and Reliability© 2004 Dorian C. Arnold
Recovery Models and Semantics: Challenges
Detecting loss , duplication and ordering
Quick recovery from message loss
Correct recovery from failure
Recovery of state information from aggregation operations
Simultaneous failures
Validation of our scalability methodology
– 19 – Scalability and Reliability© 2004 Dorian C. Arnold
Challenge 2: Scalable Reliability
Hypothesis: Aggregating control messagesHypothesis: Aggregating control messagescan effectively achievecan effectively achieve
scalable, reliable systems.scalable, reliable systems.
– 20 – Scalability and Reliability© 2004 Dorian C. Arnold
Example: Scalable Failure Detection
Goal: A scalable failure-detection service with high rates of convergence.
Previous work:• non-scalable overhead• poor convergence properties• non-deterministic guarantees• costly assumptions
– E.g. fully-connected meshes
– 21 – Scalability and Reliability© 2004 Dorian C. Arnold
Failure Detection Approaches•Gossip-style failure detection and propagation
•Gupta et al, van Renesse et al
– 22 – Scalability and Reliability© 2004 Dorian C. Arnold
Failure Detection Approaches•Hierarchical heartbeat detection and propagation
• Felber et al, Overcast, Grid monitoring
– 23 – Scalability and Reliability© 2004 Dorian C. Arnold
Scalable Failure Detection Tracking senders in aggregated message:
• Naïve approaches:– Append 32/64-bit source ID for each source
• Pathological case: many senders
– Bit-array where bits represent potential sources• Pathological case: many potential sources, few actual
senders
• Our Approach:– Variable size bit-array:
• Number of bits vary according to descendants beneath the intermediate node (i.e. depth in topology)
– 24 – Scalability and Reliability© 2004 Dorian C. Arnold
Scalable Failure DetectionHierarchical heartbeats/propagation (with message aggregation):
0 0 11 1 0 1
1 0 11
0 0 11 0 11
– 25 – Scalability and Reliability© 2004 Dorian C. Arnold
Scalable Failure Detection Study scalability and convergence
implications of our scalable failure detection protocol.• In theory:
– Pure Hierarchical• msgs = nh x h
– Hierarchical w/aggregation• msgs = ( (nh+1 – 1)/(n – 1) ) – 1
• Example n=8, h=4 (4096 leaves):– Pure hierarchical: 16,384 msgs- With aggregation: 4,680 msgs
– 26 – Scalability and Reliability© 2004 Dorian C. Arnold
Scalable Event Propagation Implement generic event propagation
service• Encode events into 1-byte codes
• Combine with aggregation protocol for low-overhead control messages
• Piggyback control messages with data messages
– 27 – Scalability and Reliability© 2004 Dorian C. Arnold
Summary MRNet provides tools and grid services
with scalable communications and data analyses.
We are studying techniques to provide high degrees of reliability at large scales.
MRNet website:• http://www.paradyn.org/mrnet