supercomputing at 1/10 the costcomp.chem.nottingham.ac.uk/enca/sc_tenth_cost.pdf · 4 t20 gpus 4...
TRANSCRIPT
SUPERCOMPUTING AT 1/10TH
THE COST
Timothy Lanfear, NVIDIA
TH
© NVIDIA Corporation 2010
WHY GPU COMPUTING?
© NVIDIA Corporation 2010
Science is Desperate for Throughput
1982 1997 2003
1,000,000,000
1,000,000
1,000
1
Gigaflops
Estrogen ReceptorEstrogen Receptor36K atoms36K atoms
F1F1--ATPaseATPase327K atoms327K atoms
BPTIBPTI3K atoms3K atoms
1 Exaflop
1 Petaflop
Science is Desperate for Throughput
2006 2010 2012
RibosomeRibosome2.7M2.7M atomsatoms
ChromatophoreChromatophore50M atoms50M atoms
BacteriaBacteria100s of100s of
ChromatophoresChromatophores
Ran for 8 months tosimulate 2 nanoseconds
© NVIDIA Corporation 2010
Power Crisis in Supercomputing
1982 1996
Exaflop
Petaflop
Teraflop
Gigaflop
7,000,000 Watts7,000,000 Watts
850,000 Watts850,000 Watts
60,000 Watts60,000 Watts
Power Crisis in Supercomputing
2008 2020
Household PowerEquivalent
City
Town
Neighborhood
Block
7,000,000 Watts7,000,000 Watts
25,000,000 Watts25,000,000 Watts
JaguarJaguarLosLos AlamosAlamos
© NVIDIA Corporation 2010
Top 5 Machines: Performance and Power
Tianhe-1A Jaguar Nebulae
0
500
1000
1500
2000
2500
3000
Gig
afl
op
s
Top 5 Machines: Performance and Power
0
1
2
3
4
5
6
7
8
Tsubame 2.0 Hopper II
Meg
aw
att
s
Gigaflops
Megawatts
© NVIDIA Corporation 2010
Top 5 Machines: Performance and Power
Tianhe-1A Jaguar Nebulae
0
500
1000
1500
2000
2500
3000
Gig
afl
op
s
Top 5 Machines: Performance and Power
0
1
2
3
4
5
6
7
8
Tsubame 2.0 Hopper II
Meg
aw
att
s
Gigaflops
Megawatts
© NVIDIA Corporation 2010
NVIDIA AND GPU COMPUTINGOMPUTING
© NVIDIA Corporation 2010
GeForce
Tegra
Quadro
Tesla
© NVIDIA Corporation 2010
© NVIDIA Corporation 2010
Mainstream Applications Going ParallelCUDA Accelerates Adobe Mercury Playback Engine
Amazingly fluid,real-time video editing
Quick preview of real timeedits and effects
Realistic preview of finalcontent
Faster encoding
Mainstream Applications Going ParallelCUDA Accelerates Adobe Mercury Playback Engine
© NVIDIA Corporation 2010
© NVIDIA Corporation 2010
Dawning Nebulae
Second Fastest Supercomputer in the World
1.27 Petaflop
4640 Tesla GPUs
Second Fastest Supercomputer in the World
1.27 Petaflop
4640 Tesla GPUs
© NVIDIA Corporation 2010
The World’s Fastest Supercomputer
Tianhe-1A
2.507 Petaflop
7168 Tesla M2050 GPUs
National Supercomputing Center inTianjin
The World’s Fastest Supercomputer
© NVIDIA Corporation 2010
NVIDIA® TESLA™ GPUS FORFOR HPC
© NVIDIA Corporation 2010
Data Center
NVIDIA Tesla 20-Series Products
Workstation
NVIDIA Tesla 20-Series Products
© NVIDIA Corporation 2010
1U SystemsServer Module
Tesla M2070 /Tesla M2050
Tesla M1060 Tesla S2050
GPUs 1 T20 GPU 1 T10 GPU 4 T20 GPUs
SinglePrecision
1030 GFlops 933 GFlops 4120 GFlops
DoublePrecision
515 Gflops 78 GFlops 2060 GFlops
Memory 6 GB / 3 GB 4 GB 12 GB (S2050)
Mem BW 148.4 GB/s 102 GB/s 148.4 GB/s
NVIDIA Tesla GPU Computing Products
1U Systems Workstation Boards
Tesla S2050 Tesla S1070Tesla C2070 /Tesla C2050
Tesla C1060
4 T20 GPUs 4 T10 GPUs 1 T20 GPU 1 T10 GPU
4120 GFlops 4140 GFlops 1030 Gflops 933 GFlops
2060 GFlops 346 GFlops 515 Gflops 78 GFlops
12 GB (S2050)16 GB
4 GB / GPU6 GB / 3 GB 4 GB
148.4 GB/s 102 GB/s 144 GB/s 102 GB/s
NVIDIA Tesla GPU Computing Products
© NVIDIA Corporation 2010
Tesla: Built for Professional ComputingFeature
Features
4x Higher double precision (on 20-series)
ECC only on Tesla & Quadro (on 20-series)
Bi-directional PCI-E communication (Tesla has Dual DMAEngines, GeForce has only 1 DMA Engine)
Larger memory for larger data sets – 3GB and 6GB Products
Cluster management software tools available on Tesla only
TCC (Tesla Compute Cluster) driver supported for Windows OSonly on Tesla.
Integrated OEM workstations and servers
Professional ISVs will certify CUDA applications only on Tesla
Quality &Warranty
2 to 4 day Stress testing & memory burn-in for reliability. Addedmargin in memory and core clocks for added reliability.
Manufactured & guaranteed by NVIDIA
3-year warranty from NVIDIA
Support &Lifecycle
Enterprise support, higher priority for CUDA bugs and requests
18-24 months availability + 6-month EOL notice
Tesla: Built for Professional ComputingBenefits
Higher Performance for scientific CUDA applications
Data reliability inside the GPU and on DRAM memories
Bi-directional PCI-E communication (Tesla has Dual DMAEngines, GeForce has only 1 DMA Engine)
Higher Performance for CUDA applications (by overlappingcommunication & computation)
Larger memory for larger data sets – 3GB and 6GB ProductsHigher performance on wide range of applications (medical, oil & gas,
manufacturing, FEA, CAE)
Cluster management software tools available on Tesla onlyNeeded for GPU monitoring and job scheduling in data center
deployments
TCC (Tesla Compute Cluster) driver supported for Windows OSonly on Tesla.
Higher performance for CUDA applications due to lower kernel launchoverhead. TCC adds support for RDP and Services
Trusted, reliable systems built for Tesla products.
Professional ISVs will certify CUDA applications only on Tesla Bug reproduction, support, feature requests for Tesla only.
2 to 4 day Stress testing & memory burn-in for reliability. Addedmargin in memory and core clocks for added reliability.
Built for 24/7 computing in data center and workstation environments.
No changes in key components like GPU and memory without notice.Always the same clocks for known, reliable performance.
Reliable, long life products
Enterprise support, higher priority for CUDA bugs and requestsAbility to influence CUDA and GPU roadmap. Get early access to
features requests.
Reliable product supply
© NVIDIA Corporation 2010
GPU COMPUTING ECOSYSTEMCOSYSTEM
© NVIDIA Corporation 2010
NVIDIA Developer Eco-SystemDebuggers& Profilers
cuda-gdbNV Visual Profiler
Parallel NsightVisual Studio
AllineaTotalView
MATLABMathematicaNI LabView
pyCUDA
NumericalPackages
CC++
FortranOpenCL
DirectComputeJava
Python
GPU Compilers
GPGPU Consultants & Training
ANEO GPU Tech
NVIDIA Developer Eco-System
CC++
FortranOpenCL
DirectComputeJava
Python
GPU Compilers
PGI AcceleratorCAPS HMPP
mCUDAOpenMP
ParallelizingCompilers
BLASFFT
LAPACKNPP
VideoImagingGPULib
Libraries
OEM Solution Providers
© NVIDIA Corporation 2010
Doing GP-GPU Right: Hardware + Software
CC OpenCLOpenCLtmtmC++C++
Libraries and MiddlewareLibraries and Middleware
cuFFTcuFFT cuBLAScuBLASCULACULA
LAPACKLAPACKNPP &NPP &cuDPPcuDPP
NVIDIA GPUNVIDIA GPUCUDA Parallel Computing ArchitectureCUDA Parallel Computing Architecture
GPU Computing ApplicationsGPU Computing Applications
Doing GP-GPU Right: Hardware + Software
DirectDirectComputeCompute
FortranFortranJava andJava andPythonPython
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
Libraries and MiddlewareLibraries and Middleware
NPP &NPP &cuDPPcuDPP
VideoVideoPhysXPhysXPhysicsPhysics
OptiXOptiXRayRay
TracingTracing
mentalmental rayrayirayiray
RenderingRendering
RealityRealityServerServer3D Web3D WebServicesServices
NVIDIA GPUNVIDIA GPUCUDA Parallel Computing ArchitectureCUDA Parallel Computing Architecture
GPU Computing ApplicationsGPU Computing Applications
© NVIDIA Corporation 2010
CUDA C/C++ Continuous Innovation
2007 2008
July 07 Nov 07 April 08 Aug 08
CUDA Toolkit 1.1CUDA Toolkit 1.1
•• Win XP 64Win XP 64
•• Atomics supportAtomics support
•• MultiMulti--GPUGPUsupportsupport
CUDA Toolkit 2.0CUDA Toolkit 2.0
•• Double PrecisionDouble Precision
•• CompilerCompilerOptimizationsOptimizations
•• Vista 32/64Vista 32/64
•• Mac OSXMac OSX
•• 3D Textures3D Textures
•• HW InterpolationHW Interpolation
CUDA Toolkit 1.0CUDA Toolkit 1.0
•• C CompilerC Compiler•• C ExtensionsC Extensions
•• Single PrecisionSingle Precision•• BLASBLAS•• FFTFFT•• SDKSDK
40 examples40 examples
CUDACUDAVisual Profiler 2.2Visual Profiler 2.2
cudacuda--gdbgdbHW DebuggerHW Debugger
CUDA C/C++ Continuous Innovation
2009 2010
Aug 08 July 09 Nov 09 Mar 10
CUDA Toolkit 2.0CUDA Toolkit 2.0
•• Double PrecisionDouble Precision
•• CompilerCompilerOptimizationsOptimizations
•• Vista 32/64Vista 32/64
•• Mac OSXMac OSX
•• 3D Textures3D Textures
•• HW InterpolationHW Interpolation
CUDA Toolkit 2.3CUDA Toolkit 2.3
•• DP FFTDP FFT
•• 1616--32 Conversion32 Conversionintrinsicsintrinsics
•• PerformancePerformanceenhancementsenhancements
ParallelParallel NsightNsightBetaBeta
CUDA Toolkit 3.0CUDA Toolkit 3.0
•• C++ inheritanceC++ inheritance
•• Fermi arch supportFermi arch support
•• Tools updatesTools updates
•• Driver / RT interopDriver / RT interop
© NVIDIA Corporation 2010
Parallel NsightVisual Studio
Visual ProfilerFor Linux
Visual ProfilerFor Linux
cuda-gdbFor Linux
© NVIDIA Corporation 2010
Commercial Debuggers for GPUs
GPU Debugging
Making it easy
Allinea DDT — CUDA Enabled
Commercial Debuggers for GPUs
TotalView for CUDA
© NVIDIA Corporation 2010
GPU Programming BooksGPU Programming Books
© NVIDIA Corporation 2010
FermiLab
NCSA384 GPUs
Daresbury LabPNNL
256GPUs
Max PlanckInstitute
ArgonneLab
Harvard
Oxford
Jefferson LabsGeorgia TechTACC
Delaware
OSC Maryland
Johns Hopkins
WestGrid
StanfordUNC
NERSC CEA
Aarhus
NorwegianUniv of S & T
Braunschweig
Copenhagen
Oak Ridge
WisconsinVaTech
Cambridge
Groningen
Utah
Berkeley
CSIRO256 GPUs
Chinese Academyof Sciences
2000+ GPUs
NationalTaiwan Univ
Max PlanckInstitute
Prospective Deployment
Existing Deployment
IIT DelhiTokyo Tech680 GPUs
PekingUniversity
Univ ofScience & Tech
TsinghuaUniversity
NCHC
CurtinUniversity
SwinburneUniversity
Riken220 GPUs
Osaka
Nagasaki
KISTI
AnnaUniv
IITMadras
Indian Instituteof Science
NIT Calicut
Dept ofSpace
LRDE
Indian Instof Tropical
Meteorology
NizhegorodskyUniversity
Kazan Univ
St. PetersburgUniversity
Institute ofPhysicsBraunschweig
Copenhagen
SNU
Yonsei
© NVIDIA Corporation 2010
In testing our key applications, the Tesla GPUs delivered speedhad never seen before, sometimes even orders of magnitude.“
Satoshi MatsuokaProfessorTokyo Institute of Technology
Future computing architectures will be hybrid systems with parallelGPUs working in tandem with multi“
Jack DongarraProfessor, University of TennesseeAuthor of Linpack
I believe history will record Fermi as a significant milestone.“Director Parallel Computing Research Laboratory, U.C. Berkeley
Co-Author of Computer Architecture: A Quantitative Approach
In testing our key applications, the Tesla GPUs delivered speed-ups that wehad never seen before, sometimes even orders of magnitude. ”
Future computing architectures will be hybrid systems with parallel-coreGPUs working in tandem with multi-core CPUs. ”
I believe history will record Fermi as a significant milestone.”Dave Patterson
Director Parallel Computing Research Laboratory, U.C. BerkeleyAuthor of Computer Architecture: A Quantitative Approach
© NVIDIA Corporation 2010
GPU Technology Conference
The most important eventthe GPU
Monday, September 20 –San Jose Convention Center, San Jose, California
GPU Technology Conference 2010
The most important event inGPU ecosystem
Thursday, September 23, 2010San Jose Convention Center, San Jose, California
© NVIDIA Corporation 2010
Supercomputing 2010 Conference
Keynote — Bill Dally
Booth talks — Takayuki Aoki,
Tamrat Belayneh, Jack Dongarra, RobertFarber, Wu Feng, Wei Ge, Mark Govett,Satoshi Matsuoka, Patrick McCormick,Paul Navratil, Thomas Schulthess, JohnStone, Jeffrey Vetter, Michael Wolf
Four HPCwire awards
Gordon Bell prize
Best student paper
NVIDIA GPUs all over theexhibition floor
Supercomputing 2010 Conference
© NVIDIA Corporation 2010
T8
128 core
T10240 core
A 2015 GPU *~20× the performance of today’s GPU
~5,000 cores at ~3 GHz (50 mW each)
~20 TFLOPS
~1.2 TB/s of memory bandwidth
* This is a sketch of a what a GPU in 2015 might look like, it does not reflect any actual product plans
GPU Revolutionizing Computing
GFlops
Fermi512 core
GPU
A 2015 GPU *~20× the performance of today’s GPU
~5,000 cores at ~3 GHz (50 mW each)
~20 TFLOPS
~1.2 TB/s of memory bandwidth
* This is a sketch of a what a GPU in 2015 might look like, it does not reflect any actual product plans
GPU Revolutionizing Computing
Fermi512 core
Kepler
Maxwell