accelerated technical-and high performance computing · accelerated technical-and high performance...
TRANSCRIPT
AcceleratedTechnical- andHighPerformanceComputing
Klaus Gottschalk - [email protected] Architect IBM Cognitive Systems
IBM Cognitive Systems
High Level Strategy Drivers and Directions
• Data Volumes are Exploding – Especially Unstructured Data• Data Needs to be Collected, Managed, and ‘Digested’
• Deriving Insight and Information from the Data requires:• Data access and availability• A variety of processing steps in a ‘Workflow’, and processing optimizations
• Need for Compute continues to Grow• Per IDC, technical computing growth @ 11.9% (vs. 4.9%) in 2015 , supporting
both High Performance Computing and High Performance Data Analytics • Moores Law and Frequency stabilization require more threads, cores, &
nodes• Accelerated Computing emerges: GPUs, FPGAs, CAPI attached Flash & I/O
• Energy Efficiency continues to rise in value and requires:• Processing Elements that are Optimized to the task• Energy and Data aware Workflow Management
• The OpenPOWER Foundation provides innovation opportunities to a variety of Partners
Pric
e/Pe
rform
ance
§Full system stack innovation required
§Technology and §Processors
2000 2020
§Firmware / OSAcceleratorsSoftwareStorageNetwork
§Workflow§Dependency Graph
IBM Cognitive Systems
OpenPOWER, a catalyst for Open Innovation
The OpenPOWER Foundation creates an open ecosystem, using the POWER Architecture to share expertise, investment, and
server-class intellectual property to serve the evolving needs of customers.
Performance of leading POWER architecture Broadens the capability and performance of the POWER platform
Open DevelopmentOpenPOWER enables greater innovation through both open software and open hardware
Collaboration across multiple thought leadersCollaborative development model drives collective thought leadership, simultaneously across multiple disciplines
System / Integration
I/O / Storage / Acceleration
Boards / Systems
Chip / SOC
This is What A Revolution Looks Like © 2017 OpenPOWER Foundation
Software
Implementation / HCP / Research
System / Integration
I/O / Storage / Acceleration
Boards / Systems
Chip / SOC
This is What A Revolution Looks Like © 2017 OpenPOWER Foundation
Software
Implementation / HCP / Research
300+ Members
31Countries
40+ISVs
IBM Cognitive Systems
Portfolio of HPC Solutions
• Deploymenttools,integratedmanagement• Compilers:gcc,IBMXLC,LLVMOpenMP4,PGIFortran/C/C++,Java,OpenACC,OpenMP
• Debuggers,Profilers,Mathlibraries,MPI &HPC apps
Processors & Systems
HPCSoftware
High PerformanceFile System &
Storage
• HighPerformanceProcessors&Systems• Accelerator,networking,storageintegrationviaNVLink &CAPI• Highestmemorythroughput
• HighestPerformanceHPC Storage:ElasticStorageServer• HighPerformanceSpectrumScale(GPFS)ParallelFileSystem• Datacentricdesign
High Speed Interconnect
• Highspeedinterconnect/networkfabricfromMellanoxTechnologies
• MPIaccelerationintheIBfabric,reducingCPUoverhead• SupportforGPUDirect,NVMe overfabric
IBM Cognitive Systems
OpenPOWER: Open Architecture for HPC & Analytics
ProcessorIPLicensing
Open
Interfaces
Systems
&Software
§LicensingprocessorcoretoenablesemiconductorpartnerslikeSuzhouPowercore tobuildPOWERchips
§TightintegrationusingCAPI &NVLink withAccelerators(NVIDIA,Xilinx),Networking(Mellanox),Storage(CAPIFlash)
§EnablingSystemPartnerstobuildPOWER-basedserversandOpenSourcingSoftwareincludingFirmware&Hypervisor
IBM Cognitive Systems
Collaborative Innovation between IBM and NVIDIA: POWER8 with NVLink
Built for Developer Goals• Think less about architecture in code• Break apart my problem less• Spend less time optimizing• Write simpler code
Casting NVLink into Silicon• IBM: transistors and I/O to NVLink on CPU• NVIDIA: deep interface into GPU (NVLink)• 2+ years in the making• 2.5X the bandwidth from CPU:GPU,
built into the chip
with NVLinkTM
Don’t overthink your hardwareDon’t waste time writing for data movementEasily unleash the parallelism of your GPU
Embedded NVLinkTM
IBM Cognitive Systems
NVLink And Unified Memory/Page Migration Engine
Tesla P100 architecture simplifies programming, sharing memory between CPU & GPU• Unified memory: allows programs to access full memory addresses of all CPU and
GPUs• Page Migration Engine: GPU memory faults seamlessly migrate to CPU memory
NVLink• POWER8 with NVLink ensures fast data access for pages and data movement• Fat and Flat: Memory migration on POWER systems moves at the same bandwidth
CPU-to-GPU or GPU-to- GPU
Programming consequences• Far simpler programming and memory model
§Eliminates the programming details of allocating and copying device memory• Larger data sizes permissible
§Applications can now use data sets that are larger than the memory size of the GPU
IBM Cognitive Systems
Why it Matters: Use Cases where NVLink will have the most Impact
Mask Bus Transfers from Host-Device
Constant Data Transfers between adjacent GPUs
Burst Data at Startup and Teardown
.
Stream Data at Same Rate as Computation
Genomics, Cryptography, Video Processing, etc.
CFD/CAE, Machine Learning, Deep Learning, etc.
Molecular Dynamics, Amber, etc.
Accelerated Databases, Analytics, etc.
IBM Cognitive Systems
HPC Pre-Sales Centers and Technical Support• PADC centers with IBM, NVIDIA and Mellanox focused on accelerated applications and technical
collaborations • IBM Systems Client Centers
§ HPC Briefings§ HPC Workshops§ HPC Benchmarks
UK Science and Technology Facilities Council (STFC) PADC
§IBM PADC Montpellier joint center with NVIDIA and Mellanox
§IBM PADC Boeblingen joint center with NVIDIA
IBM Poughkeepsie POWER HPC
Benchmark Center
For latest HPC information refer to the
IBM Systems Client Centers HPC page
IBM Austin POWER HPC Executive Briefing Center
NEW! NVIDIA/IBM Acceleration Lab
IBM Cognitive Systems
POWER9 Chip
New Core Microarchitecture§ Stronger thread performance§ Efficient agile pipeline§ POWER ISA v3.0
Enhanced Cache Hierarchy§ 120MB NUCA L3 architecture§ 12 x 20-way associative regions§ Advanced replacement policies§ Fed by 7 TB/s on-chip bandwidth
Cloud + Virtualization Innovation§ Quality of service assists§ New interrupt architecture§ Workload optimized frequency§ Hardware enforced trusted execution
14nm finFET§ Improved device performance and
reduced energy§ 17 layer metal stack and eDRAM§ 8.0 billion transistors
Leadership Acceleration Platform§ Enhanced on-chip acceleration§ Nvidia NVLink 2.0: High
bandwidth, advanced features§ CAPI 2.0: Coherent accelerator
and storage attach (PCIe G4)§ OpenCAPI: Improved latency and
bandwidth, open interface
State of the Art I/O Subsystem§ PCIe Gen4 – 48 lanes
High Bandwidth Signaling § 16 Gb/s interface: Local SMP§ 25 Gb/s interface: 25G Link for
Accelerator and remote SMP
IBM Cognitive Systems
Witherspoon (4-6 GPU) Server
Anticipated 10X performance improvement over 2015 solution
• combined GPU and CPU advancesWitherspoon is the platform that will deliver the commitments made in the CORAL contract
• 2 POWER9, 4 GPU for LLNL, water cooled• 2 POWER9, 6 GPU for ORNL, water cooled
4 GPU
IBM Cognitive Systems
CORAL
§2 versions of code for each application:• Baseline: (lab codes) minimal changes + offloading directives (e.g. OpenACC)• Optimized: can create codes from scratch, using any language we choose
§Tools to implement/modify the lab codes:• Languages: MPI, OpenMP, OpenACC, CUDA, Fortran, etc.• Architectures: Power Processors, GPUs, Infiniband, NVLink, etc.
§OpenACC directives to off-load work to the GPU
IBM Cognitive Systems
CORAL SYSTEM ARCHITECTURE
Compute Rack: 18 Servers/rack779 TFlop/rack10.8 TB/rack
55 kWatts max
System:200 Pflops compute + 5 PB Active Flash+120 PB Disk
Scalable Active Network:Mellanox IB4X EDR Switch
Converged 2U server drawer for HPC and Cloud
ESS Rack:
- Scalable system software and data architecture
- LLVM Open Source compiler- Water cooling- Integrated Local Active
Storage
256 Compute Racks
40 Disk Racks
16 Optional Flash Racks
TMS drawers orFlash cards.CAPI attached.Globally accessible with local processing
POWER9:22 Cores4 Threads/core0.65 DP TF/s3.7 GHz
SXM2
Volta:7.0DPTF/[email protected]/s
POWER9 2 Socket Server2 P9 + 4/6 Volta GPU (@7 TF/s)
512 GiB SMP Memory (32 GB DDR4 RDIMMs)64/96GiB GPU Memory (HBM stacks)
IBM Cognitive Systems
DOE Project CORAL Status
Oak Ridge (IBM) on-time*
Livermore (IBM) on-time*
Argonne (Intel/Cray) delayed+
*) https://www.hpcwire.com/2017/10/03/olcfs-200-petaflops-summit-machine-still-slated-2018-start/
+) https://www.nextplatform.com/2017/05/23/surprises-2018-doe-budget-supercomputing/
IBM Cognitive Systems
OpenCAPI Consortion Founded
SILICON VALLEY, CA - 14 Oct 2016: OpenCAP Consortium formed by AMD, Dell EMC, Google, HP, IBM, Mellanox, Micron, NVIDIA and XilinxServers and related products based on the new standard are expected in the second half of 2017
IBM Cognitive Systems
OpenCAPI Approach • What is OpenCAPI?
• OpenCAPI is an Open Interface Architecture that allows any microprocessor to attach to• Coherent user-level accelerators and I/O devices• Advanced memories accessible via read/write or user-level DMA semantics• Agnostic to processor architecture
• Key Attributes of OpenCAPI
• High-bandwidth, low latency interface optimized to enable streamlined implementation of attached devices
• Attached devices operate natively within an application’s user space and coherently with processors
• Supports a wide range of use cases and access semantics
• 100% Open Consortium• All company participants welcome • All ISA participants welcome
IBM Cognitive Systems
openCAPI
OpenCAPI
Looking Ahead: POWER9 Accelerator Interfaces
Extreme Accelerator Bandwidth and Reduced Latency• PCIe Gen 4 x 48 lanes –
192 GB/s peak bandwidth (duplex)• IBM BlueLink 25Gb/s x 48 lanes –
300 GB/s peak bandwidth (duplex)Coherent Memory and Virtual Addressing Capability for all Accelerators
• CAPI 2.0 - 4x bandwidth of POWER8 using PCIe Gen 4• NVLink 2.0 – Next generation of GPU/CPU bandwidth
and integration using BlueLink• OpenCAPI – High bandwidth, low latency and open
interface using BlueLink
IBM Cognitive Systems
CAPI Accelerator Cards
Nallatech teamexplainingCAPIFlashcard:https://www.youtube.com/watch?v=1n_ceKkCRuk
Thank you!
IBM Cognitive Systems
ibm.com/systems/hpc