Download - Palestra - Usp
-
7/30/2019 Palestra - Usp
1/44
Product Availability Update
Product InventoryLeadtime
for big ordersNo
C1060 200 units 8 weeks Build M1060 500 units 8 weeks Build
S1070-400 50 units 10 weeks Build
S1070-500 25 units+ 75 being built 10 weeks Build
M2050 Shipping now
Building 20K for Q2
8 weeks Sold out thr
S2050 Shipping nowBuilding 200 for Q2
8 weeks Sold out thr
C2050 2000 units 8 weeks Will mainta
M2070 Sept 2010 - Get PO in no
C2070 Sept-Oct 2010 - Get PO in no
M2070-Q Oct 2010 -
Processamento ParaGPUs na ArquiteturArnaldo TavaresTesla Sales Manager for Latin America
-
7/30/2019 Palestra - Usp
2/44
Quadro or Tesla?
Computer Aided Design e.g. CATIA, SolidWorks, Siemens NX
3D Modeling / Animation e.g. 3ds, Maya, Softimage
Video Editing / FX e.g. Adobe CS5, Avid
Numerical Analytics e.g. MATLAB, Mathematica
Computational Biology e.g. AMBER, NAMD, VMD
Computer Aided Enginee e.g. ANSYS, SIMULIA/ABAQUS
-
7/30/2019 Palestra - Usp
3/44
GPU Computing
CPU + GPU Co-Processing
4 cores
CPU48 GigaFlops (DP)
GPU515 GigaFlops (DP)
(Average efficiency in Linpack: 50%)
-
7/30/2019 Palestra - Usp
4/44
146X
Medical Imaging
U of Utah
36X
Molecular Dynamics
U of Illinois, Urbana
18X
Video Transcoding
Elemental Tech
50X
Matlab Computing
AccelerEyes
149X
Financial simulation
Oxford
47X
Linear Algebra
Universidad Jaime
20X
3D Ultrasound
Techniscan
130X
Quantum Chemistry
U of Illinois, UrbanaG
50x 150x
-
7/30/2019 Palestra - Usp
5/44
Tools
Oil & Gas
Bio-Chemistry
Bio-Informatics
NVIDIAVideo Libraries
AccelerEyes
Jacket MATLAB
EMPhotonicsCULAPACK
Bright Cluster
ManagerCAPS HMPP
MATLAB
Thrust C++Template Lib
CUDA C/C++
PGI CUDA
Fortran
Parallel NsightVis Studio IDE
Allinea DDT
Debugger
OpenEye ROCS
Available Announced
TauCUDAPerf Tools
NVIDIA NPPPerfPrimitives
ParaTools
VampirTrace
VSGOpen Inventor
StoneRidgeRTM
HeadwaveSuite
AccelewareRTM Solver
GeoStarSeismic Suite
ffA SVI Pro
OpenGeoSolutions OpenSEIS
ParadigmRTM
Seismic CityRTM
TsunamiRTM
CAE ACUSIMAcuSolve 1.8
AutodeskMoldflow
PrometchParticleworks
RemcomXFdtd 7.0
MM
PGIAccelerators
Platform LSFCluster Mgr
MAGMA(LAPACK)
O
MetacompCFD++
Available Now
Libraries
Wolfram
Mathematica
CUDA FFTCUDA BLAS
TeraChemBigDFT
ABINT
VMD
Acellera
ACEMD
AMBER DL-POLY
GROMACS
HOOMD
LAMMPS
NAMD
GAMESS CP2K
CUDA-BLASTP
CUDA-EC
CUDA-MEME
CUDA SW++SmithWaterm GPU-HMMR
HEX ProteinDocking
MUMmerGPUPIPER
Docking
LSTCLS-DYNA 971
RNG & SPARSECUDA Libraries
ParadigmSKUA
PGI CUDAx86
Increasing Number of Professional CUDA Ap
ANSYSMechanical
-
7/30/2019 Palestra - Usp
6/44
Increasing Number of Professional CUDA
Siemens 4DUltrasound
Rendering
Finance
EDA
DigisensMedical
SchrodingerCore Hopping
MotionDSPIkena Video
ManifoldGIS
Dalsa MachineVision
SynopsysTCAD
SPEAGSEMCAD X
AgilentEMPro 2010
CSTMicrowave
Agilent ADSSPICE
AccelewareFDTD Solver
AccelewareEM Solution
AquiminAlphaVision
Other
NAGRNG
SciCompSciFinance
HanweckOptions Analy
Available Now
Gauda OPC
UsefulProgress Med
LightworksArtisan
Autodesk3ds Max
NVIDIAOptiX (SDK)
mental imagesiray (OEM)
BunkspeedShot (iray)
Refractive SWOctane
C
fRandom
Control Arion
CausticGraphics
Weta DigitalPantaRay
ILMPlume
Available Announced
DigitalAnarchy Photo
Video
Elemental
Video
Fraunhofer
JPEG2000
Cinnafilm
Pixel Strings
Assimilate
SCRATCH
The FoundryKronos
TDVisionTDVCodec
ARRIVarious Apps
Black MagicDa Vinci
MainConcept
CUDA Encoder
GenArtsSapphire
Adobe PremierPro CS5
MurexMACS
Numerix RiskRMS Risk
Mgt Solutions
RocketickVeritlog Sim
MVTecMachine Vis
-
7/30/2019 Palestra - Usp
7/44
3 of Top5 Supercomputers
0
500
1000
1500
2000
2500
3000
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te
Gigaflops
-
7/30/2019 Palestra - Usp
8/44
3 of Top5 Supercomputers
0
500
1000
1500
2000
2500
3000
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te
Gigaflops
-
7/30/2019 Palestra - Usp
9/44
What if Every Supercomputer Had Ferm
0
200
400
600
800
1000
Linpack
Teraflops
Top 500 Supercomputers (Nov 2009)
150 GPUs
37 TeraFlops
$740K
Top 150
225 GPUs
55 TeraFlops
$1.1 M
Top 100
450 GPUs
110 TeraFlops
$2.2 M
Top 50
-
7/30/2019 Palestra - Usp
10/44
Hybrid ExaScale Trajectory
2008
1 TFLOP
7.5 KWatts
2010
1.27 PFLOPS
2.55 MWatts
2017 *
2 EFLOPS
10 MWatts
* This is a projection based on Moores law and does not represent a committed roadmap
-
7/30/2019 Palestra - Usp
11/44
Tesla Roadmap
-
7/30/2019 Palestra - Usp
12/44
The March of the GPUs
0
50
100
150
200
250
2007 2008 2009 201
Peak Memory BGBytes/s
T10
Nehalem
3 GHz
Westm
3 GH
T20
0
200
400
600
800
1000
1200
2007 2008 2009 2010 2011 2012
Peak Double Precision FPGFlops/sec
Nehalem
3 GHz
Westmere
3 GHz
T20
T20A
T10
8-core
Sandy Bridge
3 GHz
NVIDIA GPU (ECC off) Double Precision: NVIDIA GPU Double Precision: x86 CPU
-
7/30/2019 Palestra - Usp
13/44
Project Denver
-
7/30/2019 Palestra - Usp
14/44
Expected Tesla Roadmap with Project Den
-
7/30/2019 Palestra - Usp
15/44
Workstations
Up to 4x
Tesla C2050/70 GPUs
Integra
CPU-GPU
2x Tesla M205
in 1
OEM CPU Server +
Tesla S2050/70
4 Tesla GPUs in 2U
Workstation / Data Center Solutions
M2
-
7/30/2019 Palestra - Usp
16/44
Tesla C2050 Tesla C2070
Processors Tesla 20-series GPU
Number of Cores 448
Caches64 KB L1 cache + Shared Memory / 32 cores
768 KB L2 cache
Floating Point Peak
Performance
1030 Gigaflops (single)
515 Gigaflops (double)
GPU Memory3 GB
2.625 GB with ECC on
6 GB
5.25 GB with ECC on
Memory Bandwith 144 GB/s (GDDR5)
System I/O PCIe x16 Gen2
Power 238 W (max) 238 W (max)
Available Shipping Now Shipping Now
Tesla C-Series Workstation GPUs
-
7/30/2019 Palestra - Usp
17/44
How is the GPU Used?
Basic Component: Stream Multiprocessor (SM)
SIMD: Single InstructionMultiple Data
Same Instruction for all cores, but can operate over different data
SIMD at SM, MIMD at GPU chip
Source: Presentation from Felipe A. Cruz, Nagasaki University
-
7/30/2019 Palestra - Usp
18/44
The Use of GPUs and Bottleneck Analysis
Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology
-
7/30/2019 Palestra - Usp
19/44
The Fermi Architecture
3 billion transistors
16 x Streaming Multiprocessors(SMs)
6 x 64-bit Memory Partitions =
384-bit Memory Interface
Host Interface: connects the GPU
to the CPU via PCI-Express
GigaThread global scheduler:
distribute thread blocks to SM
thread schedulers
-
7/30/2019 Palestra - Usp
20/44
SM Architecture
32 CUDA cores per SM (512 total)
16 x Load/Store Units = source and destin. addresscalculated for 16 threads per clock
4 x Special Function Units (sin, cosine, sq. root, etc.)
64 KB of RAM for shared memory and L1 cache(configurable)
Dual Warp Scheduler
-
7/30/2019 Palestra - Usp
21/44
Dual Warp Scheduler
1 Warp = 32 parallel threads
2 Warps issued and executed concurrently
Each Warp goes to 16 CUDA Cores
Most instructions can be dual issued
(exception: Double Precision instructions)
Dual-Issue Model allows near peak hardware
performance
-
7/30/2019 Palestra - Usp
22/44
CUDA Core Architecture
Re
Schedu
Dispatc
Load/S
Special
Interco
64K C
Cache
Unif
Core
Core
Core
Core
C
C
C
C
Core
Core
Core
Core
C
C
C
C
Instr
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
New IEEE 754-2008 floating-point standard,
surpassing even the most advanced CPUs
Newly designed integer ALU
optimized for 64-bit and extended
precision operations
Fused multiply-add (FMA) instructionfor both 32-bit single and 64-bit
double precision
-
7/30/2019 Palestra - Usp
23/44
Fused Multiply-Add Instruction (FMA)
TM
-
7/30/2019 Palestra - Usp
24/44
GigaThreadTM Hardware Thread Scheduler (
Hierarchically manages thousands
of simultaneously active threads
10x faster application context
switching (each program receives a
time slice of processing resources)
Concurrent kernel execution
HTS
-
7/30/2019 Palestra - Usp
25/44
GigaThread Hardware Thread Scheduler
Concurrent Kernel Execution + Faster Context Switch
Serial Kernel Execution Parallel Kernel Executio
Time
Kernel 1 Kernel 1 Kernel 2
Kernel 2 Kernel 3
Kernel 3
nelKernel 5
Kernel 5
Kernel 4
Kernel 2
Kernel 2
-
7/30/2019 Palestra - Usp
26/44
GigaThread Streaming Data Transfer Engi
Dual DMA engines
Simultaneous CPUGPU and GPUCPU
data transferFully overlapped with CPU and GPU
processing time
Activity Snapshot:
SDT
Kernel 0
Kernel 1
Kernel 2
Kernel 3
CPU
CPU
CPU
CPU
SDT0
SDT0
SDT0
SDT0
GPU
GPU
GPU
G
SDT1
SDT1
SD
C h d M Hi h
-
7/30/2019 Palestra - Usp
27/44
Cached Memory Hierarchy
First GPU architecture to support a true cache
hierarchy in combination with on-chip shared memory
Shared/L1 Cache per SM (64KB)
Improves bandwidth and reduces latency
Unified L2 Cache (768 KB)
Fast, coherent data sharing across all cores in the GPU
Global Memory (up to 6GB)
CUDA C t U ifi d D i A hit t
-
7/30/2019 Palestra - Usp
28/44
CUDA: Compute Unified Device Architectu
NVIDIAs Parallel Computing Architecture
Software Development Platform aimed to the GPU Architecture
CUDA Driver
CUDA Parallel Compute Engines inside GPU
CUDA Support in Kernel Level Driver
OpenCL
Driver
Applications
Using OpenCL
OpenCL C
Applications
Using the
CUDA Driver API
C for CUDA
C Runtime
for CUDA
Applications
Using C, C++, Fortran,
Java, Python, ...
C for CUDA
PTX (ISA)
DirectX 11
Compute
Applications
Using DirectX
HLSL
Device-level APIs Language Integration
1
2
34
5
Th d Hi h
-
7/30/2019 Palestra - Usp
29/44
Thread Hierarchy
Kernels (simple C program) are executed by thread
Threads are grouped into Blocks
Threads in a Block can synchronize execution
Blocks are grouped in a Grid
Blocks are independent (must be able to be executed
at any order
Source: Presentation from Felipe A. Cruz, Nagasaki University
Memory and Hardware Hierarchy
-
7/30/2019 Palestra - Usp
30/44
Memory and Hardware Hierarchy
Threads access Registers
CUDA Cores execute Threads
Threads within a Block can share data/results
via Shared Memory
Streaming Multiprocessors (SMs) execute
Blocks
Grids use Global Memory for result sharing
(after kernel-wide global synchronization)
GPU executes Grids
Source: Presentation from Felipe A. Cruz, Nagasaki University
Full View of the Hierarchy Model
-
7/30/2019 Palestra - Usp
31/44
Full View of the Hierarchy Model
CUDA Hardware Level Memory AcceThread CUDA Core Registers
Block SM Shared Memo
Grid GPU Global Memor
Device Node Host Memory
IDs and Dimensions
-
7/30/2019 Palestra - Usp
32/44
IDs and Dimensions
Device
Grid 1
Block
(0, 0)
Block
(0, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Threads
3D IDs, unique within a block
Blocks
2D IDs, unique within a grid
Dimensions set at launch time
Can be unique for each grid
Built-in variables
threadIdx, blockIdx
blockDim, gridDim
Compiling C for CUDA Applications
-
7/30/2019 Palestra - Usp
33/44
Compiling C for CUDA Applications
void serial_function( ) {...}void other_function(int ... ) {
...}void saxpy_serial(float ... ) {for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];}void main( ) {float x;saxpy_serial(..);
...}
NVCC
(Open64)CPU
C CUDA
Key Kernels
CUDA object
files
Re
App
CP
Linker
CP
Ex
Modify into
Parallel
CUDA code
C for CUDA : C with a few keywords
-
7/30/2019 Palestra - Usp
34/44
C for CUDA : C with a few keywords
void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}
// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float { int i = blockIdx.x*blockDim.x + threadIdx.x;if(i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y);
Standa
Para
Software Programming
-
7/30/2019 Palestra - Usp
35/44
Software Programming
Source: Presentation from Andreas Klckner, NYU
Software Programming
-
7/30/2019 Palestra - Usp
36/44
Software Programming
Source: Presentation from Andreas Klckner, NYU
Software Programming
-
7/30/2019 Palestra - Usp
37/44
Software Programming
Source: Presentation from Andreas Klckner, NYU
Software Programming
-
7/30/2019 Palestra - Usp
38/44
Software Programming
Source: Presentation from Andreas Klckner, NYU
Software Programming
-
7/30/2019 Palestra - Usp
39/44
Software Programming
Source: Presentation from Andreas Klckner, NYU
Software Programming
-
7/30/2019 Palestra - Usp
40/44
Software Programming
Source: Presentation from Andreas Klckner, NYU
Software Programming
-
7/30/2019 Palestra - Usp
41/44
Software Programming
Source: Presentation from Andreas Klckner, NYU
Software Programming
-
7/30/2019 Palestra - Usp
42/44
Software Programming
Source: Presentation from Andreas Klckner, NYU
CUDA C/C++ Leadership
-
7/30/2019 Palestra - Usp
43/44
CUDA C/C Leadership
2007 2008 2009
July 07 Nov 07 April 08 Aug 08 July 09 Nov 0CUDA Toolkit 1.1
Win XP 64
Atomics support
Multi-GPU
support
CUDA Toolkit 2.0
Double Precision
Compiler
Optimizations
Vista 32/64
Mac OSX
3D Textures
HW Interpolation
CUDA Toolkit 2.3
DP FFT
16-32 Conversion
intrinsics
Performanceenhancements
CUDA Toolkit 1.0
C Compiler
C Extensions
Single Precision
BLAS
FFTSDK
40 examples
CUDA
Visual Profiler 2.2
cuda-gdb
HW Debugger
Parallel N
Beta
Why should I choose Tesla over consumer c
-
7/30/2019 Palestra - Usp
44/44
y
Feature Benefits
Features
4x Higher double precision (on 20-series) Higher Performance for scientific
ECC only on Tesla & Quadro (on 20-series) Data reliability inside the GPU and
Bi-directional PCI-E communication (Tesla has Dual DMA
Engines, GeForce has only 1 DMA Engine)
Higher Performance for CUDA appl
communication & com
Larger memory for larger data sets 3GB and 6GB ProductsHigher performance on wide range of app
manufacturing, FEA
Cluster management software tools available on Tesla onlyNeeded for GPU monitoring and job s
deployments
TCC (Tesla Compute Cluster) driver supported for Windows OS
only on Tesla.
Higher performance for CUDA application
overhead. TCC adds support for
Integrated OEM workstations and servers Trusted, reliable systems built f
Professional ISVs will certify CUDA applications only on Tesla Bug reproduction, support, feature r
Quality &
Warranty
2 to 4 day Stress testing & memory burn-in for reliability. Added
margin in memory and core clocks for added reliability.Built for 24/7 computing in data center an
Manufactured & guaranteed by NVIDIANo changes in key components like GPU
Always the same clocks for known,
3-year warranty from HP Reliable, long life pr
Support &
Lifecycle
Enterprise support, higher priority for CUDA bugs and requestsAbility to influence CUDA and GPU road
features reques
18-24 months availability + 6-month EOL notice Reliable product s