scalable fault-tolerant interconnection networks for large ... · scalable fault-tolerant...

Barcelona Supercomputing Center, Spain

Workshop on Design for Reliability

Scalable Fault-tolerant Interconnection Networks for Large-scale Computing

Systems

Jose Carlos Sancho

Ana Jokanovic and Jesus Labarta

Outline

HPC network interconnects trends

Future network failure expectations

Handling errors in today’s HPC networks

Proposed scalable fault-tolerant approaches

Permanent or hardware failures

Transient failures or bit errors

Conclusions

HPC interconnect technologies

Optical interconnects are preferred over copper

• Size: Lower cable size & weight, Smaller connector size

• Attenuation: Longer distance, more power-efficient transmission at higher BW*distance

• Isolation: Lower EMI, less crosstalk, more reliable

source node

Switch

destination node

Switch

Active Optical Cable - Opto/Electronic conversions (CMOS integration) - Bandwidth 40 Gbps (4 x 10 Gbps) - < 10-15 Bit Error Rate (temporal or soft failures) - 5 x 106 hours Mean time between permanent failures

HPC networks size

Top10 historical data Source: Peter M. Kogge and Timothy J. Dysart, “Using the TOP500 to Trace and Project Technology and Architecture Trends”, SC11

Exascale will need more sockets

HPC network bandwidth

HPC network bandwidth is increasing

Source: InfiniBand Roadmap, available from http://www.infinibandta.org

#Cores growing faster than BW!!

Power wall

The Energy issue is well understood Systems are constrained by

power, exascale < 20MW

Data movement is the key to save energy Up to 200 x more energy needed to transport a bit from a nearest

neighbor chip than to operate on it:

• Energy needed for a floating point operation (0.1-0.05 pJ/bit)

• Energy needed for (electronic) data-transport on card (2-10 pJ/bit)

Sources: DARPA/IPTO study, by Peter Kogge, et. al. available on http://www.nd.edu/~kogge/reports.html

Coal energy plant

Optical transceiver power

Power scales with data rate

Source: Samuel Palermo, “Energy Efficient CMOS Optical I/O”, CMOS Emerging Technologies Conference

BER < 10-10

Optical BER

Low power increases BER

Power (in dBm) = 10 log10

Power

1mW

Source: Yimin Kang, “Monolithic germanium/silicon avalanche photodiodes with 340 GHz gain–bandwidth product”, Nature Photonics 3, 59 - 63 (2009)

1 mW = 0 dBm 1 µW = -30 dBm

Power wall implications

Trading off power with BER @ high data rates

Low failures Low power

Handling errors in InfiniBand (1)

Hardware failures

S0 S1 S2 S3 S4 S5 S6 S7

S8 S9 S10 S11

S12 S13

A0

S14 S15

A1

Fault link: - Discard packet - Inform SMA

Packets

Subnet Manager (SMA) - Re-program routing tables at S8 - Inform A0

Handling errors in InfiniBand (2)

Bit errors

Switch receiving

packet

VC

RC

| C

RC

| d

ata

pay

load

| h

ead

er

Bit errors

NACK

EBPD | header

Checking CRC Discard faulty packet

Discard faulty packet

Compute source node

Switch receiving

packet

Compute destination

node

Pack

et r

etra

nsm

issi

on

Too much overhead - Notification - Re-transmission

Scalable fault-tolerant approaches

Longer distances between end nodes prevent to handle errors efficiently

Handle it locally !

Local hardware failure recovery

S0 S1 S2 S3 S4 S5 S6 S7

S8 S9 S10 S11

S12 S13

A0

S14 S15

A1

Fault link: -Re-route packet -Inform SMA

Local failure recovery

Subnet Manager (SMA) - Re-program routing tables at S8

Open issues - Deadlock? - Who to re-route packet?

Local bit error recovery

Switch receiving

packet

… |

dat

a |

ECC

| d

ata

| EC

C |

hea

der

Bit errors

NACK

- Use Error Correcting Codes (ECC) - Populate packets with ECCs

Compute source node

Switch receiving

packet

Compute destination

node

Packet free of errors

Open issues - Which ECC? - Communication protocol?

Local failure recovery

Conclusions

Bit errors and hardware failures may happen more often in future computing systems

The number of network components is growing

Power wall increases BER

Current networks are still relaying on end-node recovery which adds too much overhead

Locally handling errors is promising but still requires efficient techniques to be explored which may be different depending of the type of failure

Acknowledgements

Spanish Ministry of Science and Innovation

scalable fault-tolerant interconnection networks for large ... · scalable fault-tolerant...

Documents