supercomputing at 1/10 the costcomp.chem.nottingham.ac.uk/enca/sc_tenth_cost.pdf · 4 t20 gpus 4...

SUPERCOMPUTING AT 1/10TH

THE COST

Timothy Lanfear, NVIDIA

TH

© NVIDIA Corporation 2010

WHY GPU COMPUTING?


Science is Desperate for Throughput

1982 1997 2003

1,000,000,000

1,000,000

1,000

1

Gigaflops

Estrogen ReceptorEstrogen Receptor36K atoms36K atoms

F1F1--ATPaseATPase327K atoms327K atoms

BPTIBPTI3K atoms3K atoms

1 Exaflop

1 Petaflop

Science is Desperate for Throughput

2006 2010 2012

RibosomeRibosome2.7M2.7M atomsatoms

ChromatophoreChromatophore50M atoms50M atoms

BacteriaBacteria100s of100s of

ChromatophoresChromatophores

Ran for 8 months tosimulate 2 nanoseconds


Power Crisis in Supercomputing

1982 1996

Exaflop

Petaflop

Teraflop

Gigaflop

7,000,000 Watts7,000,000 Watts

850,000 Watts850,000 Watts

60,000 Watts60,000 Watts

Power Crisis in Supercomputing

2008 2020

Household PowerEquivalent

City

Town

Neighborhood

Block

7,000,000 Watts7,000,000 Watts

25,000,000 Watts25,000,000 Watts

JaguarJaguarLosLos AlamosAlamos


Top 5 Machines: Performance and Power

Tianhe-1A Jaguar Nebulae

0

500

1000

1500

2000

2500

3000

Gig

afl

op

s

Top 5 Machines: Performance and Power

0

1

2

3

4

5

6

7

8

Tsubame 2.0 Hopper II

Meg

aw

att

s

Gigaflops

Megawatts


NVIDIA AND GPU COMPUTINGOMPUTING


GeForce

Tegra

Quadro

Tesla


Mainstream Applications Going ParallelCUDA Accelerates Adobe Mercury Playback Engine

Amazingly fluid,real-time video editing

Quick preview of real timeedits and effects

Realistic preview of finalcontent

Faster encoding

Mainstream Applications Going ParallelCUDA Accelerates Adobe Mercury Playback Engine


Dawning Nebulae

Second Fastest Supercomputer in the World

1.27 Petaflop

4640 Tesla GPUs

Second Fastest Supercomputer in the World

1.27 Petaflop

4640 Tesla GPUs


The World’s Fastest Supercomputer

Tianhe-1A

2.507 Petaflop

7168 Tesla M2050 GPUs

National Supercomputing Center inTianjin

The World’s Fastest Supercomputer


NVIDIA® TESLA™ GPUS FORFOR HPC


Data Center

NVIDIA Tesla 20-Series Products

Workstation

NVIDIA Tesla 20-Series Products


1U SystemsServer Module

Tesla M2070 /Tesla M2050

Tesla M1060 Tesla S2050

GPUs 1 T20 GPU 1 T10 GPU 4 T20 GPUs

SinglePrecision

1030 GFlops 933 GFlops 4120 GFlops

DoublePrecision

515 Gflops 78 GFlops 2060 GFlops

Memory 6 GB / 3 GB 4 GB 12 GB (S2050)

Mem BW 148.4 GB/s 102 GB/s 148.4 GB/s

NVIDIA Tesla GPU Computing Products

1U Systems Workstation Boards

Tesla S2050 Tesla S1070Tesla C2070 /Tesla C2050

Tesla C1060

4 T20 GPUs 4 T10 GPUs 1 T20 GPU 1 T10 GPU

4120 GFlops 4140 GFlops 1030 Gflops 933 GFlops

2060 GFlops 346 GFlops 515 Gflops 78 GFlops

12 GB (S2050)16 GB

4 GB / GPU6 GB / 3 GB 4 GB

148.4 GB/s 102 GB/s 144 GB/s 102 GB/s

NVIDIA Tesla GPU Computing Products


Tesla: Built for Professional ComputingFeature

Features

4x Higher double precision (on 20-series)

ECC only on Tesla & Quadro (on 20-series)

Bi-directional PCI-E communication (Tesla has Dual DMAEngines, GeForce has only 1 DMA Engine)

Larger memory for larger data sets – 3GB and 6GB Products

Cluster management software tools available on Tesla only

TCC (Tesla Compute Cluster) driver supported for Windows OSonly on Tesla.

Integrated OEM workstations and servers

Professional ISVs will certify CUDA applications only on Tesla

Quality &Warranty

2 to 4 day Stress testing & memory burn-in for reliability. Addedmargin in memory and core clocks for added reliability.

Manufactured & guaranteed by NVIDIA

3-year warranty from NVIDIA

Support &Lifecycle

Enterprise support, higher priority for CUDA bugs and requests

18-24 months availability + 6-month EOL notice

Tesla: Built for Professional ComputingBenefits

Higher Performance for scientific CUDA applications

Data reliability inside the GPU and on DRAM memories

Bi-directional PCI-E communication (Tesla has Dual DMAEngines, GeForce has only 1 DMA Engine)

Higher Performance for CUDA applications (by overlappingcommunication & computation)

Larger memory for larger data sets – 3GB and 6GB ProductsHigher performance on wide range of applications (medical, oil & gas,

manufacturing, FEA, CAE)

Cluster management software tools available on Tesla onlyNeeded for GPU monitoring and job scheduling in data center

deployments

TCC (Tesla Compute Cluster) driver supported for Windows OSonly on Tesla.

Higher performance for CUDA applications due to lower kernel launchoverhead. TCC adds support for RDP and Services

Trusted, reliable systems built for Tesla products.

Professional ISVs will certify CUDA applications only on Tesla Bug reproduction, support, feature requests for Tesla only.

2 to 4 day Stress testing & memory burn-in for reliability. Addedmargin in memory and core clocks for added reliability.

Built for 24/7 computing in data center and workstation environments.

No changes in key components like GPU and memory without notice.Always the same clocks for known, reliable performance.

Reliable, long life products

Enterprise support, higher priority for CUDA bugs and requestsAbility to influence CUDA and GPU roadmap. Get early access to

features requests.

Reliable product supply


GPU COMPUTING ECOSYSTEMCOSYSTEM


NVIDIA Developer Eco-SystemDebuggers& Profilers

cuda-gdbNV Visual Profiler

Parallel NsightVisual Studio

AllineaTotalView

MATLABMathematicaNI LabView

pyCUDA

NumericalPackages

CC++

FortranOpenCL

DirectComputeJava

Python

GPU Compilers

GPGPU Consultants & Training

ANEO GPU Tech

NVIDIA Developer Eco-System

CC++

FortranOpenCL

DirectComputeJava

Python

GPU Compilers

PGI AcceleratorCAPS HMPP

mCUDAOpenMP

ParallelizingCompilers

BLASFFT

LAPACKNPP

VideoImagingGPULib

Libraries

OEM Solution Providers


Doing GP-GPU Right: Hardware + Software

CC OpenCLOpenCLtmtmC++C++

Libraries and MiddlewareLibraries and Middleware

cuFFTcuFFT cuBLAScuBLASCULACULA

LAPACKLAPACKNPP &NPP &cuDPPcuDPP

NVIDIA GPUNVIDIA GPUCUDA Parallel Computing ArchitectureCUDA Parallel Computing Architecture

GPU Computing ApplicationsGPU Computing Applications

Doing GP-GPU Right: Hardware + Software

DirectDirectComputeCompute

FortranFortranJava andJava andPythonPython

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

Libraries and MiddlewareLibraries and Middleware

NPP &NPP &cuDPPcuDPP

VideoVideoPhysXPhysXPhysicsPhysics

OptiXOptiXRayRay

TracingTracing

mentalmental rayrayirayiray

RenderingRendering

RealityRealityServerServer3D Web3D WebServicesServices

NVIDIA GPUNVIDIA GPUCUDA Parallel Computing ArchitectureCUDA Parallel Computing Architecture

GPU Computing ApplicationsGPU Computing Applications


CUDA C/C++ Continuous Innovation

2007 2008

July 07 Nov 07 April 08 Aug 08

CUDA Toolkit 1.1CUDA Toolkit 1.1

•• Win XP 64Win XP 64

•• Atomics supportAtomics support

•• MultiMulti--GPUGPUsupportsupport


•• Double PrecisionDouble Precision

•• CompilerCompilerOptimizationsOptimizations

•• Vista 32/64Vista 32/64

•• Mac OSXMac OSX

•• 3D Textures3D Textures

•• HW InterpolationHW Interpolation


•• C CompilerC Compiler•• C ExtensionsC Extensions

•• Single PrecisionSingle Precision•• BLASBLAS•• FFTFFT•• SDKSDK

40 examples40 examples

CUDACUDAVisual Profiler 2.2Visual Profiler 2.2

cudacuda--gdbgdbHW DebuggerHW Debugger

CUDA C/C++ Continuous Innovation

2009 2010

Aug 08 July 09 Nov 09 Mar 10


•• Double PrecisionDouble Precision

•• CompilerCompilerOptimizationsOptimizations

•• Vista 32/64Vista 32/64

•• Mac OSXMac OSX

•• 3D Textures3D Textures

•• HW InterpolationHW Interpolation


•• DP FFTDP FFT

•• 1616--32 Conversion32 Conversionintrinsicsintrinsics

•• PerformancePerformanceenhancementsenhancements

ParallelParallel NsightNsightBetaBeta


•• C++ inheritanceC++ inheritance

•• Fermi arch supportFermi arch support

•• Tools updatesTools updates

•• Driver / RT interopDriver / RT interop


Parallel NsightVisual Studio

Visual ProfilerFor Linux

Visual ProfilerFor Linux

cuda-gdbFor Linux


Commercial Debuggers for GPUs

GPU Debugging

Making it easy

Allinea DDT — CUDA Enabled

Commercial Debuggers for GPUs

TotalView for CUDA


GPU Programming BooksGPU Programming Books


FermiLab

NCSA384 GPUs

Daresbury LabPNNL

256GPUs

Max PlanckInstitute

ArgonneLab

Harvard

Oxford

Jefferson LabsGeorgia TechTACC

Delaware

OSC Maryland

Johns Hopkins

WestGrid

StanfordUNC

NERSC CEA

Aarhus

NorwegianUniv of S & T

Braunschweig

Copenhagen

Oak Ridge

WisconsinVaTech

Cambridge

Groningen

Utah

Berkeley

CSIRO256 GPUs

Chinese Academyof Sciences

2000+ GPUs

NationalTaiwan Univ

Max PlanckInstitute

Prospective Deployment

Existing Deployment

IIT DelhiTokyo Tech680 GPUs

PekingUniversity

Univ ofScience & Tech

TsinghuaUniversity

NCHC

CurtinUniversity

SwinburneUniversity

Riken220 GPUs

Osaka

Nagasaki

KISTI

AnnaUniv

IITMadras

Indian Instituteof Science

NIT Calicut

Dept ofSpace

LRDE

Indian Instof Tropical

Meteorology

NizhegorodskyUniversity

Kazan Univ

St. PetersburgUniversity

Institute ofPhysicsBraunschweig

Copenhagen

SNU

Yonsei


In testing our key applications, the Tesla GPUs delivered speedhad never seen before, sometimes even orders of magnitude.“

Satoshi MatsuokaProfessorTokyo Institute of Technology

Future computing architectures will be hybrid systems with parallelGPUs working in tandem with multi“

Jack DongarraProfessor, University of TennesseeAuthor of Linpack

I believe history will record Fermi as a significant milestone.“Director Parallel Computing Research Laboratory, U.C. Berkeley

Co-Author of Computer Architecture: A Quantitative Approach

In testing our key applications, the Tesla GPUs delivered speed-ups that wehad never seen before, sometimes even orders of magnitude. ”

Future computing architectures will be hybrid systems with parallel-coreGPUs working in tandem with multi-core CPUs. ”

I believe history will record Fermi as a significant milestone.”Dave Patterson

Director Parallel Computing Research Laboratory, U.C. BerkeleyAuthor of Computer Architecture: A Quantitative Approach


GPU Technology Conference

The most important eventthe GPU

Monday, September 20 –San Jose Convention Center, San Jose, California

GPU Technology Conference 2010

The most important event inGPU ecosystem

Thursday, September 23, 2010San Jose Convention Center, San Jose, California


Supercomputing 2010 Conference

Keynote — Bill Dally

Booth talks — Takayuki Aoki,

Tamrat Belayneh, Jack Dongarra, RobertFarber, Wu Feng, Wei Ge, Mark Govett,Satoshi Matsuoka, Patrick McCormick,Paul Navratil, Thomas Schulthess, JohnStone, Jeffrey Vetter, Michael Wolf

Four HPCwire awards

Gordon Bell prize

Best student paper

NVIDIA GPUs all over theexhibition floor

Supercomputing 2010 Conference


T8

128 core

T10240 core

A 2015 GPU *~20× the performance of today’s GPU

~5,000 cores at ~3 GHz (50 mW each)

~20 TFLOPS

~1.2 TB/s of memory bandwidth

* This is a sketch of a what a GPU in 2015 might look like, it does not reflect any actual product plans

GPU Revolutionizing Computing

GFlops

Fermi512 core

GPU

A 2015 GPU *~20× the performance of today’s GPU

~5,000 cores at ~3 GHz (50 mW each)

~20 TFLOPS

~1.2 TB/s of memory bandwidth

* This is a sketch of a what a GPU in 2015 might look like, it does not reflect any actual product plans

GPU Revolutionizing Computing

Fermi512 core

Kepler

Maxwell

supercomputing at 1/10 the costcomp.chem.nottingham.ac.uk/enca/sc_tenth_cost.pdf · 4 t20 gpus 4...

Documents