palestra - usp
Post on 14-Apr-2018
214 views
Embed Size (px)
TRANSCRIPT
7/30/2019 Palestra - Usp
1/44
Product Availability Update
Product InventoryLeadtime
for big ordersNo
C1060 200 units 8 weeks Build M1060 500 units 8 weeks Build
S1070-400 50 units 10 weeks Build
S1070-500 25 units+ 75 being built 10 weeks Build
M2050 Shipping now
Building 20K for Q2
8 weeks Sold out thr
S2050 Shipping nowBuilding 200 for Q2
8 weeks Sold out thr
C2050 2000 units 8 weeks Will mainta
M2070 Sept 2010 - Get PO in no
C2070 Sept-Oct 2010 - Get PO in no
M2070-Q Oct 2010 -
Processamento ParaGPUs na ArquiteturArnaldo TavaresTesla Sales Manager for Latin America
7/30/2019 Palestra - Usp
2/44
Quadro or Tesla?
Computer Aided Design e.g. CATIA, SolidWorks, Siemens NX
3D Modeling / Animation e.g. 3ds, Maya, Softimage
Video Editing / FX e.g. Adobe CS5, Avid
Numerical Analytics e.g. MATLAB, Mathematica
Computational Biology e.g. AMBER, NAMD, VMD
Computer Aided Enginee e.g. ANSYS, SIMULIA/ABAQUS
7/30/2019 Palestra - Usp
3/44
GPU Computing
CPU + GPU Co-Processing
4 cores
CPU48 GigaFlops (DP)
GPU515 GigaFlops (DP)
(Average efficiency in Linpack: 50%)
7/30/2019 Palestra - Usp
4/44
146X
Medical Imaging
U of Utah
36X
Molecular Dynamics
U of Illinois, Urbana
18X
Video Transcoding
Elemental Tech
50X
Matlab Computing
AccelerEyes
149X
Financial simulation
Oxford
47X
Linear Algebra
Universidad Jaime
20X
3D Ultrasound
Techniscan
130X
Quantum Chemistry
U of Illinois, UrbanaG
50x 150x
7/30/2019 Palestra - Usp
5/44
Tools
Oil & Gas
Bio-Chemistry
Bio-Informatics
NVIDIAVideo Libraries
AccelerEyes
Jacket MATLAB
EMPhotonicsCULAPACK
Bright Cluster
ManagerCAPS HMPP
MATLAB
Thrust C++Template Lib
CUDA C/C++
PGI CUDA
Fortran
Parallel NsightVis Studio IDE
Allinea DDT
Debugger
OpenEye ROCS
Available Announced
TauCUDAPerf Tools
NVIDIA NPPPerfPrimitives
ParaTools
VampirTrace
VSGOpen Inventor
StoneRidgeRTM
HeadwaveSuite
AccelewareRTM Solver
GeoStarSeismic Suite
ffA SVI Pro
OpenGeoSolutions OpenSEIS
ParadigmRTM
Seismic CityRTM
TsunamiRTM
CAE ACUSIMAcuSolve 1.8
AutodeskMoldflow
PrometchParticleworks
RemcomXFdtd 7.0
MM
PGIAccelerators
Platform LSFCluster Mgr
MAGMA(LAPACK)
O
MetacompCFD++
Available Now
Libraries
Wolfram
Mathematica
CUDA FFTCUDA BLAS
TeraChemBigDFT
ABINT
VMD
Acellera
ACEMD
AMBER DL-POLY
GROMACS
HOOMD
LAMMPS
NAMD
GAMESS CP2K
CUDA-BLASTP
CUDA-EC
CUDA-MEME
CUDA SW++SmithWaterm GPU-HMMR
HEX ProteinDocking
MUMmerGPUPIPER
Docking
LSTCLS-DYNA 971
RNG & SPARSECUDA Libraries
ParadigmSKUA
PGI CUDAx86
Increasing Number of Professional CUDA Ap
ANSYSMechanical
7/30/2019 Palestra - Usp
6/44
Increasing Number of Professional CUDA
Siemens 4DUltrasound
Rendering
Finance
EDA
DigisensMedical
SchrodingerCore Hopping
MotionDSPIkena Video
ManifoldGIS
Dalsa MachineVision
SynopsysTCAD
SPEAGSEMCAD X
AgilentEMPro 2010
CSTMicrowave
Agilent ADSSPICE
AccelewareFDTD Solver
AccelewareEM Solution
AquiminAlphaVision
Other
NAGRNG
SciCompSciFinance
HanweckOptions Analy
Available Now
Gauda OPC
UsefulProgress Med
LightworksArtisan
Autodesk3ds Max
NVIDIAOptiX (SDK)
mental imagesiray (OEM)
BunkspeedShot (iray)
Refractive SWOctane
C
fRandom
Control Arion
CausticGraphics
Weta DigitalPantaRay
ILMPlume
Available Announced
DigitalAnarchy Photo
Video
Elemental
Video
Fraunhofer
JPEG2000
Cinnafilm
Pixel Strings
Assimilate
SCRATCH
The FoundryKronos
TDVisionTDVCodec
ARRIVarious Apps
Black MagicDa Vinci
MainConcept
CUDA Encoder
GenArtsSapphire
Adobe PremierPro CS5
MurexMACS
Numerix RiskRMS Risk
Mgt Solutions
RocketickVeritlog Sim
MVTecMachine Vis
7/30/2019 Palestra - Usp
7/44
3 of Top5 Supercomputers
0
500
1000
1500
2000
2500
3000
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te
Gigaflops
7/30/2019 Palestra - Usp
8/44
3 of Top5 Supercomputers
0
500
1000
1500
2000
2500
3000
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te
Gigaflops
7/30/2019 Palestra - Usp
9/44
What if Every Supercomputer Had Ferm
0
200
400
600
800
1000
Linpack
Teraflops
Top 500 Supercomputers (Nov 2009)
150 GPUs
37 TeraFlops
$740K
Top 150
225 GPUs
55 TeraFlops
$1.1 M
Top 100
450 GPUs
110 TeraFlops
$2.2 M
Top 50
7/30/2019 Palestra - Usp
10/44
Hybrid ExaScale Trajectory
2008
1 TFLOP
7.5 KWatts
2010
1.27 PFLOPS
2.55 MWatts
2017 *
2 EFLOPS
10 MWatts
* This is a projection based on Moores law and does not represent a committed roadmap
7/30/2019 Palestra - Usp
11/44
Tesla Roadmap
7/30/2019 Palestra - Usp
12/44
The March of the GPUs
0
50
100
150
200
250
2007 2008 2009 201
Peak Memory BGBytes/s
T10
Nehalem
3 GHz
Westm
3 GH
T20
0
200
400
600
800
1000
1200
2007 2008 2009 2010 2011 2012
Peak Double Precision FPGFlops/sec
Nehalem
3 GHz
Westmere
3 GHz
T20
T20A
T10
8-core
Sandy Bridge
3 GHz
NVIDIA GPU (ECC off) Double Precision: NVIDIA GPU Double Precision: x86 CPU
7/30/2019 Palestra - Usp
13/44
Project Denver
7/30/2019 Palestra - Usp
14/44
Expected Tesla Roadmap with Project Den
7/30/2019 Palestra - Usp
15/44
Workstations
Up to 4x
Tesla C2050/70 GPUs
Integra
CPU-GPU
2x Tesla M205
in 1
OEM CPU Server +
Tesla S2050/70
4 Tesla GPUs in 2U
Workstation / Data Center Solutions
M2
7/30/2019 Palestra - Usp
16/44
Tesla C2050 Tesla C2070
Processors Tesla 20-series GPU
Number of Cores 448
Caches64 KB L1 cache + Shared Memory / 32 cores
768 KB L2 cache
Floating Point Peak
Performance
1030 Gigaflops (single)
515 Gigaflops (double)
GPU Memory3 GB
2.625 GB with ECC on
6 GB
5.25 GB with ECC on
Memory Bandwith 144 GB/s (GDDR5)
System I/O PCIe x16 Gen2
Power 238 W (max) 238 W (max)
Available Shipping Now Shipping Now
Tesla C-Series Workstation GPUs
7/30/2019 Palestra - Usp
17/44
How is the GPU Used?
Basic Component: Stream Multiprocessor (SM)
SIMD: Single InstructionMultiple Data
Same Instruction for all cores, but can operate over different data
SIMD at SM, MIMD at GPU chip
Source: Presentation from Felipe A. Cruz, Nagasaki University
7/30/2019 Palestra - Usp
18/44
The Use of GPUs and Bottleneck Analysis
Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology
7/30/2019 Palestra - Usp
19/44
The Fermi Architecture
3 billion transistors
16 x Streaming Multiprocessors(SMs)
6 x 64-bit Memory Partitions =
384-bit Memory Interface
Host Interface: connects the GPU
to the CPU via PCI-Express
GigaThread global scheduler:
distribute thread blocks to SM
thread schedulers
7/30/2019 Palestra - Usp
20/44
SM Architecture
32 CUDA cores per SM (512 total)
16 x Load/Store Units = source and destin. addresscalculated for 16 threads per clock
4 x Special Function Units (sin, cosine, sq. root, etc.)
64 KB of RAM for shared memory and L1 cache(configurable)
Dual Warp Scheduler
7/30/2019 Palestra - Usp
21/44
Dual Warp Scheduler
1 Warp = 32 parallel threads
2 Warps issued and executed concurrently
Each Warp goes to 16 CUDA Cores
Most instructions can be dual issued
(exception: Double Precision instructions)
Dual-Issue Model allows near peak hardware
performance
7/30/2019 Palestra - Usp
22/44
CUDA Core Architecture
Re
Schedu
Dispatc
Load/S
Special
Interco
64K C
Cache
Unif
Core
Core
Core
Core
C
C
C
C
Core
Core
Core
Core
C
C
C
C
Instr
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
New IEEE 754-2008 floating-point standard,
surpassing even the most advanced CPUs
Newly designed integer ALU
optimized for 64-bit and extended
precision operations
Fused multiply-add (FMA) instructionfor both 32-bit single and 64-bit
double