the race towards€¦ · 3 the race towards exascale 2000 2005 2010 2020 “roadrunner” 1st 2015...

22
The Race Towards Co-Design Architecture For Next Generation HPC Systems Gilad Shainer HPC Advisory Council Spain, September 2015

Upload: others

Post on 16-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

The Race Towards Co-Design Architecture For Next Generation HPC Systems

Gilad Shainer

HPC Advisory Council

Spain, September 2015

Page 2: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

2

Global Perspective

Page 3: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

3

The Race Towards Exascale

2000 202020102005

“Roadrunner”

1st

2015

Terascale Petascale Exascale

Single-Core to Multi-CoreSMP to Clusters

Performance

Technology Development

Page 4: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

4

The Road to Exascale Computing

Multi-Core Co-DesignCluster

Page 5: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

5

Co-Design Architecture – From Discrete to System Focused

Discrete Level System Level

Page 6: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

6

Exascale will be Enabled via Co-Design Architecture

Software – Hardware

Hardware – Hardware (e.g. GPU-Direct)

Software – Software (e.g. OpenUCX)

Industry – Users – Academia

Standard, Open Source, Eco-System

Programmable, Configurable, Innovative

Page 7: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

7

Software-Hardware Co-Design? Example: Breaking the Latency Wall

• Today: Network devices are in 100ns latency today

• Challenge: How to enable the next order of magnitude performance improvement?

• Solution: Co-Design - mapping the communication frameworks on all active devices

• Result: reduce HPC communication frameworks latency by an order of magnitude

Co-Design Architecture Paves the Road to Exascale Performance

10 years ago

~10

microsecond

~100s

microsecond

NetworkCommunication

Framework

Today

~10s

microsecond

Communication

Framework

~0.1

microsecond

Network

~1s

microsecond

Communication

Framework

Future

~0.1

microsecond

Co-Design

Network

Page 8: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

8

The Future of Latency – Application Level

• Migrating complete operations from the software / CPU to the Network

• Hardware – Software co-design

Page 9: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

9

• The road to Exascale requires order of magnitude performance improvements

• Co-Design architecture enables all active devices to become co-processors

The Road to Exascale – Co-Design System Architecture

Mapping Communication Frameworks on All Active Devices

Page 10: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

10

Offloading

The Elements of the Co-Design Architecture

Flexibility

RDMA GPUDirect Programmability

VirtualizationHeterogeneous

System

Backward and future

Compatibility

Direct Communications

Communication Frameworks (MPI, SHMEM/PGAS)

Applications (Innovations, Scalability, Performance)

Software-Defined X

Offloading Technologies: In-Network Computing

Co-Design Implementation Via Offloading Technologies

Page 11: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

11

Exascale Co-Design Collaboration

Collaborative Effort

Industry, National Laboratories and Academia

The Next Generation

HPC Software Framework

Page 12: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

12

What We Don’t Want to Achieve

Page 13: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

13

Different Model - Co-Design Effort

• Co-design effort between national laboratories, academia, and industry

Applications: LAMMPS, NWCHEM, etc.

Programming models: MPI, PGAS/Gasnet, etc.

Middleware: UCX

Driver and Hardware

Co

-Desig

n

Page 14: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

14

UCX Framework Mission

• Collaboration between industry, laboratories, and academia

• Create open-source production grade communication framework for HPC applications

• To enable the highest performance through co-design of software-hardware interfaces

• To unify industry - national laboratories - academia efforts

Performance oriented

Optimization for low-software overheads in

communication path allows near native-level

performance

Community driven

Collaboration between industry, laboratories,

and academia

Production quality

Developed, maintained, tested, and used by

industry and researcher community

API

Exposes broad semantics that target data centric and HPC programming models and

applications

Research

The framework concepts and ideas are

driven by research in academia, laboratories,

and industry

Cross platform

Support for Infiniband, Cray, various shared

memory (x86-64 and Power), GPUs

Co-design of Exascale Network APIs

Page 15: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

15

The UCX Framework

UC-S for Services

This framework provides basic

infrastructure for component

based programming, memory

management, and useful

system utilities

Functionality:

Platform abstractions and data

structures

UC-T for Transport

Low-level API that expose basic

network operations supported by

underlying hardware

Functionality:

work request setup and

instantiation of operations

UC-P for Protocols

High-level API uses UCT

framework to construct protocols

commonly found in applications

Functionality:

Multi-rail, device selection,

pending queue, rendezvous, tag-

matching, software-atomics, etc.

Page 16: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

16

UCX High-level Overview

Page 17: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

17

Collaboration

• Mellanox co-designs network interface and contributes MXM technology

– Infrastructure, transport, shared memory, protocols, integration with OpenMPI/SHMEM,

MPICH

• ORNL co-designs network interface and contributes UCCS project

– InfiniBand optimizations, Cray devices, shared memory

• NVIDIA co-designs high-quality support for GPU devices

– GPU-Direct, GDR copy, etc.

• IBM co-designs network interface and contributes ideas and concepts from PAMI

• UH/UTK focus on integration with their research platforms

Page 18: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

18

UCX Information

Page 20: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

20

GPUDirect RDMA and Sync (3.0 and 4.0)

• Hardware – Hardware co-design

Page 21: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

21

HPC|Music Project

HPC music is an advanced research project about High Performance Computing and Music

Production dedicated to enable HPC in music creation. Its goal is to develop HPC cluster

and cloud solutions that further enable the future of music production and reproduction.

Page 22: The Race Towards€¦ · 3 The Race Towards Exascale 2000 2005 2010 2020 “Roadrunner” 1st 2015 Terascale Petascale Exascale SMP to Clusters Single-Core to Multi-Core Performance

Thank You!

Web: www.hpcadvisorycouncil.com

Email: [email protected]

Facebook: http://www.facebook.com/HPCAdvisoryCouncil

Twitter: www.twitter.com/hpccouncil

YouTube: www.youtube.com/user/hpcadvisorycouncil