palestra - usp

7/30/2019 Palestra - Usp

1/44

Product Availability Update

Product InventoryLeadtime

for big ordersNo

C1060 200 units 8 weeks Build M1060 500 units 8 weeks Build

S1070-400 50 units 10 weeks Build

S1070-500 25 units+ 75 being built 10 weeks Build

M2050 Shipping now

Building 20K for Q2

8 weeks Sold out thr

S2050 Shipping nowBuilding 200 for Q2

8 weeks Sold out thr

C2050 2000 units 8 weeks Will mainta

M2070 Sept 2010 - Get PO in no

C2070 Sept-Oct 2010 - Get PO in no

M2070-Q Oct 2010 -

Processamento ParaGPUs na ArquiteturArnaldo TavaresTesla Sales Manager for Latin America


2/44

Quadro or Tesla?

Computer Aided Design e.g. CATIA, SolidWorks, Siemens NX

3D Modeling / Animation e.g. 3ds, Maya, Softimage

Video Editing / FX e.g. Adobe CS5, Avid

Numerical Analytics e.g. MATLAB, Mathematica

Computational Biology e.g. AMBER, NAMD, VMD

Computer Aided Enginee e.g. ANSYS, SIMULIA/ABAQUS


3/44

GPU Computing

CPU + GPU Co-Processing

4 cores

CPU48 GigaFlops (DP)

GPU515 GigaFlops (DP)

(Average efficiency in Linpack: 50%)


4/44

146X

Medical Imaging

U of Utah

36X

Molecular Dynamics

U of Illinois, Urbana

18X

Video Transcoding

Elemental Tech

50X

Matlab Computing

AccelerEyes

149X

Financial simulation

Oxford

47X

Linear Algebra

Universidad Jaime

20X

3D Ultrasound

Techniscan

130X

Quantum Chemistry

U of Illinois, UrbanaG

50x 150x


5/44

Tools

Oil & Gas

Bio-Chemistry

Bio-Informatics

NVIDIAVideo Libraries

AccelerEyes

Jacket MATLAB

EMPhotonicsCULAPACK

Bright Cluster

ManagerCAPS HMPP

MATLAB

Thrust C++Template Lib

CUDA C/C++

PGI CUDA

Fortran

Parallel NsightVis Studio IDE

Allinea DDT

Debugger

OpenEye ROCS

Available Announced

TauCUDAPerf Tools

NVIDIA NPPPerfPrimitives

ParaTools

VampirTrace

VSGOpen Inventor

StoneRidgeRTM

HeadwaveSuite

AccelewareRTM Solver

GeoStarSeismic Suite

ffA SVI Pro

OpenGeoSolutions OpenSEIS

ParadigmRTM

Seismic CityRTM

TsunamiRTM

CAE ACUSIMAcuSolve 1.8

AutodeskMoldflow

PrometchParticleworks

RemcomXFdtd 7.0

MM

PGIAccelerators

Platform LSFCluster Mgr

MAGMA(LAPACK)

O

MetacompCFD++

Available Now

Libraries

Wolfram

Mathematica

CUDA FFTCUDA BLAS

TeraChemBigDFT

ABINT

VMD

Acellera

ACEMD

AMBER DL-POLY

GROMACS

HOOMD

LAMMPS

NAMD

GAMESS CP2K

CUDA-BLASTP

CUDA-EC

CUDA-MEME

CUDA SW++SmithWaterm GPU-HMMR

HEX ProteinDocking

MUMmerGPUPIPER

Docking

LSTCLS-DYNA 971

RNG & SPARSECUDA Libraries

ParadigmSKUA

PGI CUDAx86

Increasing Number of Professional CUDA Ap

ANSYSMechanical


6/44

Increasing Number of Professional CUDA

Siemens 4DUltrasound

Rendering

Finance

EDA

DigisensMedical

SchrodingerCore Hopping

MotionDSPIkena Video

ManifoldGIS

Dalsa MachineVision

SynopsysTCAD

SPEAGSEMCAD X

AgilentEMPro 2010

CSTMicrowave

Agilent ADSSPICE

AccelewareFDTD Solver

AccelewareEM Solution

AquiminAlphaVision

Other

NAGRNG

SciCompSciFinance

HanweckOptions Analy

Available Now

Gauda OPC

UsefulProgress Med

LightworksArtisan

Autodesk3ds Max

NVIDIAOptiX (SDK)

mental imagesiray (OEM)

BunkspeedShot (iray)

Refractive SWOctane

C

fRandom

Control Arion

CausticGraphics

Weta DigitalPantaRay

ILMPlume

Available Announced

DigitalAnarchy Photo

Video

Elemental

Video

Fraunhofer

JPEG2000

Cinnafilm

Pixel Strings

Assimilate

SCRATCH

The FoundryKronos

TDVisionTDVCodec

ARRIVarious Apps

Black MagicDa Vinci

MainConcept

CUDA Encoder

GenArtsSapphire

Adobe PremierPro CS5

MurexMACS

Numerix RiskRMS Risk

Mgt Solutions

RocketickVeritlog Sim

MVTecMachine Vis


7/44

3 of Top5 Supercomputers

0

500

1000

1500

2000

2500

3000

Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te

Gigaflops


8/44

3 of Top5 Supercomputers

0

500

1000

1500

2000

2500

3000

Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te

Gigaflops


9/44

What if Every Supercomputer Had Ferm

0

200

400

600

800

1000

Linpack

Teraflops

Top 500 Supercomputers (Nov 2009)

150 GPUs

37 TeraFlops

$740K

Top 150

225 GPUs

55 TeraFlops

$1.1 M

Top 100

450 GPUs

110 TeraFlops

$2.2 M

Top 50


10/44

Hybrid ExaScale Trajectory

2008

1 TFLOP

7.5 KWatts

2010

1.27 PFLOPS

2.55 MWatts

2017 *

2 EFLOPS

10 MWatts

* This is a projection based on Moores law and does not represent a committed roadmap


11/44

Tesla Roadmap


12/44

The March of the GPUs

0

50

100

150

200

250

2007 2008 2009 201

Peak Memory BGBytes/s

T10

Nehalem

3 GHz

Westm

3 GH

T20

0

200

400

600

800

1000

1200

2007 2008 2009 2010 2011 2012

Peak Double Precision FPGFlops/sec

Nehalem

3 GHz

Westmere

3 GHz

T20

T20A

T10

8-core

Sandy Bridge

3 GHz

NVIDIA GPU (ECC off) Double Precision: NVIDIA GPU Double Precision: x86 CPU


13/44

Project Denver


14/44

Expected Tesla Roadmap with Project Den


15/44

Workstations

Up to 4x

Tesla C2050/70 GPUs

Integra

CPU-GPU

2x Tesla M205

in 1

OEM CPU Server +

Tesla S2050/70

4 Tesla GPUs in 2U

Workstation / Data Center Solutions

M2


16/44

Tesla C2050 Tesla C2070

Processors Tesla 20-series GPU

Number of Cores 448

Caches64 KB L1 cache + Shared Memory / 32 cores

768 KB L2 cache

Floating Point Peak

Performance

1030 Gigaflops (single)

515 Gigaflops (double)

GPU Memory3 GB

2.625 GB with ECC on

6 GB

5.25 GB with ECC on

Memory Bandwith 144 GB/s (GDDR5)

System I/O PCIe x16 Gen2

Power 238 W (max) 238 W (max)

Available Shipping Now Shipping Now

Tesla C-Series Workstation GPUs


17/44

How is the GPU Used?

Basic Component: Stream Multiprocessor (SM)

SIMD: Single InstructionMultiple Data

Same Instruction for all cores, but can operate over different data

SIMD at SM, MIMD at GPU chip

Source: Presentation from Felipe A. Cruz, Nagasaki University


18/44

The Use of GPUs and Bottleneck Analysis

Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology


19/44

The Fermi Architecture

3 billion transistors

16 x Streaming Multiprocessors(SMs)

6 x 64-bit Memory Partitions =

384-bit Memory Interface

Host Interface: connects the GPU

to the CPU via PCI-Express

GigaThread global scheduler:

distribute thread blocks to SM

thread schedulers


20/44

SM Architecture

32 CUDA cores per SM (512 total)

16 x Load/Store Units = source and destin. addresscalculated for 16 threads per clock

4 x Special Function Units (sin, cosine, sq. root, etc.)

64 KB of RAM for shared memory and L1 cache(configurable)

Dual Warp Scheduler


21/44

Dual Warp Scheduler

1 Warp = 32 parallel threads

2 Warps issued and executed concurrently

Each Warp goes to 16 CUDA Cores

Most instructions can be dual issued

(exception: Double Precision instructions)

Dual-Issue Model allows near peak hardware

performance


22/44

CUDA Core Architecture

Re

Schedu

Dispatc

Load/S

Special

Interco

64K C

Cache

Unif

Core

Core

Core

Core

C

C

C

C

Core

Core

Core

Core

C

C

C

C

Instr

CUDA CoreDispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

New IEEE 754-2008 floating-point standard,

surpassing even the most advanced CPUs

Newly designed integer ALU

optimized for 64-bit and extended

precision operations

Fused multiply-add (FMA) instructionfor both 32-bit single and 64-bit

double precision


23/44

Fused Multiply-Add Instruction (FMA)

TM


24/44

GigaThreadTM Hardware Thread Scheduler (

Hierarchically manages thousands

of simultaneously active threads

10x faster application context

switching (each program receives a

time slice of processing resources)

Concurrent kernel execution

HTS


25/44

GigaThread Hardware Thread Scheduler

Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution Parallel Kernel Executio

Time

Kernel 1 Kernel 1 Kernel 2

Kernel 2 Kernel 3

Kernel 3

nelKernel 5

Kernel 5

Kernel 4

Kernel 2

Kernel 2


26/44

GigaThread Streaming Data Transfer Engi

Dual DMA engines

Simultaneous CPUGPU and GPUCPU

data transferFully overlapped with CPU and GPU

processing time

Activity Snapshot:

SDT

Kernel 0

Kernel 1

Kernel 2

Kernel 3

CPU

CPU

CPU

CPU

SDT0

SDT0

SDT0

SDT0

GPU

GPU

GPU

G

SDT1

SDT1

SD

C h d M Hi h


27/44

Cached Memory Hierarchy

First GPU architecture to support a true cache

hierarchy in combination with on-chip shared memory

Shared/L1 Cache per SM (64KB)

Improves bandwidth and reduces latency

Unified L2 Cache (768 KB)

Fast, coherent data sharing across all cores in the GPU

Global Memory (up to 6GB)

CUDA C t U ifi d D i A hit t


28/44

CUDA: Compute Unified Device Architectu

NVIDIAs Parallel Computing Architecture

Software Development Platform aimed to the GPU Architecture

CUDA Driver

CUDA Parallel Compute Engines inside GPU

CUDA Support in Kernel Level Driver

OpenCL

Driver

Applications

Using OpenCL

OpenCL C

Applications

Using the

CUDA Driver API

C for CUDA

C Runtime

for CUDA

Applications

Using C, C++, Fortran,

Java, Python, ...

C for CUDA

PTX (ISA)

DirectX 11

Compute

Applications

Using DirectX

HLSL

Device-level APIs Language Integration

1

2

34

5

Th d Hi h


29/44

Thread Hierarchy

Kernels (simple C program) are executed by thread

Threads are grouped into Blocks

Threads in a Block can synchronize execution

Blocks are grouped in a Grid

Blocks are independent (must be able to be executed

at any order


Memory and Hardware Hierarchy


30/44

Memory and Hardware Hierarchy

Threads access Registers

CUDA Cores execute Threads

Threads within a Block can share data/results

via Shared Memory

Streaming Multiprocessors (SMs) execute

Blocks

Grids use Global Memory for result sharing

(after kernel-wide global synchronization)

GPU executes Grids


Full View of the Hierarchy Model


31/44

Full View of the Hierarchy Model

CUDA Hardware Level Memory AcceThread CUDA Core Registers

Block SM Shared Memo

Grid GPU Global Memor

Device Node Host Memory

IDs and Dimensions


32/44

IDs and Dimensions

Device

Grid 1

Block

(0, 0)

Block

(0, 1)

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Threads

3D IDs, unique within a block

Blocks

2D IDs, unique within a grid

Dimensions set at launch time

Can be unique for each grid

Built-in variables

threadIdx, blockIdx

blockDim, gridDim

Compiling C for CUDA Applications


33/44

Compiling C for CUDA Applications

void serial_function( ) {...}void other_function(int ... ) {

...}void saxpy_serial(float ... ) {for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];}void main( ) {float x;saxpy_serial(..);

...}

NVCC

(Open64)CPU

C CUDA

Key Kernels

CUDA object

files

Re

App

CP

Linker

CP

Ex

Modify into

Parallel

CUDA code

C for CUDA : C with a few keywords


34/44

C for CUDA : C with a few keywords

void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}

// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float { int i = blockIdx.x*blockDim.x + threadIdx.x;if(i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y);

Standa

Para

Software Programming


35/44


Source: Presentation from Andreas Klckner, NYU



36/44





37/44





38/44





39/44





40/44





41/44





42/44



CUDA C/C++ Leadership


43/44

CUDA C/C Leadership

2007 2008 2009

July 07 Nov 07 April 08 Aug 08 July 09 Nov 0CUDA Toolkit 1.1

Win XP 64

Atomics support

Multi-GPU

support

CUDA Toolkit 2.0

Double Precision

Compiler

Optimizations

Vista 32/64

Mac OSX

3D Textures

HW Interpolation

CUDA Toolkit 2.3

DP FFT

16-32 Conversion

intrinsics

Performanceenhancements

CUDA Toolkit 1.0

C Compiler

C Extensions

Single Precision

BLAS

FFTSDK

40 examples

CUDA

Visual Profiler 2.2

cuda-gdb

HW Debugger

Parallel N

Beta

Why should I choose Tesla over consumer c


44/44

y

Feature Benefits

Features

4x Higher double precision (on 20-series) Higher Performance for scientific

ECC only on Tesla & Quadro (on 20-series) Data reliability inside the GPU and

Bi-directional PCI-E communication (Tesla has Dual DMA

Engines, GeForce has only 1 DMA Engine)

Higher Performance for CUDA appl

communication & com

Larger memory for larger data sets 3GB and 6GB ProductsHigher performance on wide range of app

manufacturing, FEA

Cluster management software tools available on Tesla onlyNeeded for GPU monitoring and job s

deployments

TCC (Tesla Compute Cluster) driver supported for Windows OS

only on Tesla.

Higher performance for CUDA application

overhead. TCC adds support for

Integrated OEM workstations and servers Trusted, reliable systems built f

Professional ISVs will certify CUDA applications only on Tesla Bug reproduction, support, feature r

Quality &

Warranty

2 to 4 day Stress testing & memory burn-in for reliability. Added

margin in memory and core clocks for added reliability.Built for 24/7 computing in data center an

Manufactured & guaranteed by NVIDIANo changes in key components like GPU

Always the same clocks for known,

3-year warranty from HP Reliable, long life pr

Support &

Lifecycle

Enterprise support, higher priority for CUDA bugs and requestsAbility to influence CUDA and GPU road

features reques

18-24 months availability + 6-month EOL notice Reliable product s

palestra - usp

Documents