Download - Palestra - Usp

Transcript
  • 7/30/2019 Palestra - Usp

    1/44

    Product Availability Update

    Product InventoryLeadtime

    for big ordersNo

    C1060 200 units 8 weeks Build M1060 500 units 8 weeks Build

    S1070-400 50 units 10 weeks Build

    S1070-500 25 units+ 75 being built 10 weeks Build

    M2050 Shipping now

    Building 20K for Q2

    8 weeks Sold out thr

    S2050 Shipping nowBuilding 200 for Q2

    8 weeks Sold out thr

    C2050 2000 units 8 weeks Will mainta

    M2070 Sept 2010 - Get PO in no

    C2070 Sept-Oct 2010 - Get PO in no

    M2070-Q Oct 2010 -

    Processamento ParaGPUs na ArquiteturArnaldo TavaresTesla Sales Manager for Latin America

  • 7/30/2019 Palestra - Usp

    2/44

    Quadro or Tesla?

    Computer Aided Design e.g. CATIA, SolidWorks, Siemens NX

    3D Modeling / Animation e.g. 3ds, Maya, Softimage

    Video Editing / FX e.g. Adobe CS5, Avid

    Numerical Analytics e.g. MATLAB, Mathematica

    Computational Biology e.g. AMBER, NAMD, VMD

    Computer Aided Enginee e.g. ANSYS, SIMULIA/ABAQUS

  • 7/30/2019 Palestra - Usp

    3/44

    GPU Computing

    CPU + GPU Co-Processing

    4 cores

    CPU48 GigaFlops (DP)

    GPU515 GigaFlops (DP)

    (Average efficiency in Linpack: 50%)

  • 7/30/2019 Palestra - Usp

    4/44

    146X

    Medical Imaging

    U of Utah

    36X

    Molecular Dynamics

    U of Illinois, Urbana

    18X

    Video Transcoding

    Elemental Tech

    50X

    Matlab Computing

    AccelerEyes

    149X

    Financial simulation

    Oxford

    47X

    Linear Algebra

    Universidad Jaime

    20X

    3D Ultrasound

    Techniscan

    130X

    Quantum Chemistry

    U of Illinois, UrbanaG

    50x 150x

  • 7/30/2019 Palestra - Usp

    5/44

    Tools

    Oil & Gas

    Bio-Chemistry

    Bio-Informatics

    NVIDIAVideo Libraries

    AccelerEyes

    Jacket MATLAB

    EMPhotonicsCULAPACK

    Bright Cluster

    ManagerCAPS HMPP

    MATLAB

    Thrust C++Template Lib

    CUDA C/C++

    PGI CUDA

    Fortran

    Parallel NsightVis Studio IDE

    Allinea DDT

    Debugger

    OpenEye ROCS

    Available Announced

    TauCUDAPerf Tools

    NVIDIA NPPPerfPrimitives

    ParaTools

    VampirTrace

    VSGOpen Inventor

    StoneRidgeRTM

    HeadwaveSuite

    AccelewareRTM Solver

    GeoStarSeismic Suite

    ffA SVI Pro

    OpenGeoSolutions OpenSEIS

    ParadigmRTM

    Seismic CityRTM

    TsunamiRTM

    CAE ACUSIMAcuSolve 1.8

    AutodeskMoldflow

    PrometchParticleworks

    RemcomXFdtd 7.0

    MM

    PGIAccelerators

    Platform LSFCluster Mgr

    MAGMA(LAPACK)

    O

    MetacompCFD++

    Available Now

    Libraries

    Wolfram

    Mathematica

    CUDA FFTCUDA BLAS

    TeraChemBigDFT

    ABINT

    VMD

    Acellera

    ACEMD

    AMBER DL-POLY

    GROMACS

    HOOMD

    LAMMPS

    NAMD

    GAMESS CP2K

    CUDA-BLASTP

    CUDA-EC

    CUDA-MEME

    CUDA SW++SmithWaterm GPU-HMMR

    HEX ProteinDocking

    MUMmerGPUPIPER

    Docking

    LSTCLS-DYNA 971

    RNG & SPARSECUDA Libraries

    ParadigmSKUA

    PGI CUDAx86

    Increasing Number of Professional CUDA Ap

    ANSYSMechanical

  • 7/30/2019 Palestra - Usp

    6/44

    Increasing Number of Professional CUDA

    Siemens 4DUltrasound

    Rendering

    Finance

    EDA

    DigisensMedical

    SchrodingerCore Hopping

    MotionDSPIkena Video

    ManifoldGIS

    Dalsa MachineVision

    SynopsysTCAD

    SPEAGSEMCAD X

    AgilentEMPro 2010

    CSTMicrowave

    Agilent ADSSPICE

    AccelewareFDTD Solver

    AccelewareEM Solution

    AquiminAlphaVision

    Other

    NAGRNG

    SciCompSciFinance

    HanweckOptions Analy

    Available Now

    Gauda OPC

    UsefulProgress Med

    LightworksArtisan

    Autodesk3ds Max

    NVIDIAOptiX (SDK)

    mental imagesiray (OEM)

    BunkspeedShot (iray)

    Refractive SWOctane

    C

    fRandom

    Control Arion

    CausticGraphics

    Weta DigitalPantaRay

    ILMPlume

    Available Announced

    DigitalAnarchy Photo

    Video

    Elemental

    Video

    Fraunhofer

    JPEG2000

    Cinnafilm

    Pixel Strings

    Assimilate

    SCRATCH

    The FoundryKronos

    TDVisionTDVCodec

    ARRIVarious Apps

    Black MagicDa Vinci

    MainConcept

    CUDA Encoder

    GenArtsSapphire

    Adobe PremierPro CS5

    MurexMACS

    Numerix RiskRMS Risk

    Mgt Solutions

    RocketickVeritlog Sim

    MVTecMachine Vis

  • 7/30/2019 Palestra - Usp

    7/44

    3 of Top5 Supercomputers

    0

    500

    1000

    1500

    2000

    2500

    3000

    Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te

    Gigaflops

  • 7/30/2019 Palestra - Usp

    8/44

    3 of Top5 Supercomputers

    0

    500

    1000

    1500

    2000

    2500

    3000

    Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te

    Gigaflops

  • 7/30/2019 Palestra - Usp

    9/44

    What if Every Supercomputer Had Ferm

    0

    200

    400

    600

    800

    1000

    Linpack

    Teraflops

    Top 500 Supercomputers (Nov 2009)

    150 GPUs

    37 TeraFlops

    $740K

    Top 150

    225 GPUs

    55 TeraFlops

    $1.1 M

    Top 100

    450 GPUs

    110 TeraFlops

    $2.2 M

    Top 50

  • 7/30/2019 Palestra - Usp

    10/44

    Hybrid ExaScale Trajectory

    2008

    1 TFLOP

    7.5 KWatts

    2010

    1.27 PFLOPS

    2.55 MWatts

    2017 *

    2 EFLOPS

    10 MWatts

    * This is a projection based on Moores law and does not represent a committed roadmap

  • 7/30/2019 Palestra - Usp

    11/44

    Tesla Roadmap

  • 7/30/2019 Palestra - Usp

    12/44

    The March of the GPUs

    0

    50

    100

    150

    200

    250

    2007 2008 2009 201

    Peak Memory BGBytes/s

    T10

    Nehalem

    3 GHz

    Westm

    3 GH

    T20

    0

    200

    400

    600

    800

    1000

    1200

    2007 2008 2009 2010 2011 2012

    Peak Double Precision FPGFlops/sec

    Nehalem

    3 GHz

    Westmere

    3 GHz

    T20

    T20A

    T10

    8-core

    Sandy Bridge

    3 GHz

    NVIDIA GPU (ECC off) Double Precision: NVIDIA GPU Double Precision: x86 CPU

  • 7/30/2019 Palestra - Usp

    13/44

    Project Denver

  • 7/30/2019 Palestra - Usp

    14/44

    Expected Tesla Roadmap with Project Den

  • 7/30/2019 Palestra - Usp

    15/44

    Workstations

    Up to 4x

    Tesla C2050/70 GPUs

    Integra

    CPU-GPU

    2x Tesla M205

    in 1

    OEM CPU Server +

    Tesla S2050/70

    4 Tesla GPUs in 2U

    Workstation / Data Center Solutions

    M2

  • 7/30/2019 Palestra - Usp

    16/44

    Tesla C2050 Tesla C2070

    Processors Tesla 20-series GPU

    Number of Cores 448

    Caches64 KB L1 cache + Shared Memory / 32 cores

    768 KB L2 cache

    Floating Point Peak

    Performance

    1030 Gigaflops (single)

    515 Gigaflops (double)

    GPU Memory3 GB

    2.625 GB with ECC on

    6 GB

    5.25 GB with ECC on

    Memory Bandwith 144 GB/s (GDDR5)

    System I/O PCIe x16 Gen2

    Power 238 W (max) 238 W (max)

    Available Shipping Now Shipping Now

    Tesla C-Series Workstation GPUs

  • 7/30/2019 Palestra - Usp

    17/44

    How is the GPU Used?

    Basic Component: Stream Multiprocessor (SM)

    SIMD: Single InstructionMultiple Data

    Same Instruction for all cores, but can operate over different data

    SIMD at SM, MIMD at GPU chip

    Source: Presentation from Felipe A. Cruz, Nagasaki University

  • 7/30/2019 Palestra - Usp

    18/44

    The Use of GPUs and Bottleneck Analysis

    Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology

  • 7/30/2019 Palestra - Usp

    19/44

    The Fermi Architecture

    3 billion transistors

    16 x Streaming Multiprocessors(SMs)

    6 x 64-bit Memory Partitions =

    384-bit Memory Interface

    Host Interface: connects the GPU

    to the CPU via PCI-Express

    GigaThread global scheduler:

    distribute thread blocks to SM

    thread schedulers

  • 7/30/2019 Palestra - Usp

    20/44

    SM Architecture

    32 CUDA cores per SM (512 total)

    16 x Load/Store Units = source and destin. addresscalculated for 16 threads per clock

    4 x Special Function Units (sin, cosine, sq. root, etc.)

    64 KB of RAM for shared memory and L1 cache(configurable)

    Dual Warp Scheduler

  • 7/30/2019 Palestra - Usp

    21/44

    Dual Warp Scheduler

    1 Warp = 32 parallel threads

    2 Warps issued and executed concurrently

    Each Warp goes to 16 CUDA Cores

    Most instructions can be dual issued

    (exception: Double Precision instructions)

    Dual-Issue Model allows near peak hardware

    performance

  • 7/30/2019 Palestra - Usp

    22/44

    CUDA Core Architecture

    Re

    Schedu

    Dispatc

    Load/S

    Special

    Interco

    64K C

    Cache

    Unif

    Core

    Core

    Core

    Core

    C

    C

    C

    C

    Core

    Core

    Core

    Core

    C

    C

    C

    C

    Instr

    CUDA CoreDispatch Port

    Operand Collector

    Result Queue

    FP Unit INT Unit

    New IEEE 754-2008 floating-point standard,

    surpassing even the most advanced CPUs

    Newly designed integer ALU

    optimized for 64-bit and extended

    precision operations

    Fused multiply-add (FMA) instructionfor both 32-bit single and 64-bit

    double precision

  • 7/30/2019 Palestra - Usp

    23/44

    Fused Multiply-Add Instruction (FMA)

    TM

  • 7/30/2019 Palestra - Usp

    24/44

    GigaThreadTM Hardware Thread Scheduler (

    Hierarchically manages thousands

    of simultaneously active threads

    10x faster application context

    switching (each program receives a

    time slice of processing resources)

    Concurrent kernel execution

    HTS

  • 7/30/2019 Palestra - Usp

    25/44

    GigaThread Hardware Thread Scheduler

    Concurrent Kernel Execution + Faster Context Switch

    Serial Kernel Execution Parallel Kernel Executio

    Time

    Kernel 1 Kernel 1 Kernel 2

    Kernel 2 Kernel 3

    Kernel 3

    nelKernel 5

    Kernel 5

    Kernel 4

    Kernel 2

    Kernel 2

  • 7/30/2019 Palestra - Usp

    26/44

    GigaThread Streaming Data Transfer Engi

    Dual DMA engines

    Simultaneous CPUGPU and GPUCPU

    data transferFully overlapped with CPU and GPU

    processing time

    Activity Snapshot:

    SDT

    Kernel 0

    Kernel 1

    Kernel 2

    Kernel 3

    CPU

    CPU

    CPU

    CPU

    SDT0

    SDT0

    SDT0

    SDT0

    GPU

    GPU

    GPU

    G

    SDT1

    SDT1

    SD

    C h d M Hi h

  • 7/30/2019 Palestra - Usp

    27/44

    Cached Memory Hierarchy

    First GPU architecture to support a true cache

    hierarchy in combination with on-chip shared memory

    Shared/L1 Cache per SM (64KB)

    Improves bandwidth and reduces latency

    Unified L2 Cache (768 KB)

    Fast, coherent data sharing across all cores in the GPU

    Global Memory (up to 6GB)

    CUDA C t U ifi d D i A hit t

  • 7/30/2019 Palestra - Usp

    28/44

    CUDA: Compute Unified Device Architectu

    NVIDIAs Parallel Computing Architecture

    Software Development Platform aimed to the GPU Architecture

    CUDA Driver

    CUDA Parallel Compute Engines inside GPU

    CUDA Support in Kernel Level Driver

    OpenCL

    Driver

    Applications

    Using OpenCL

    OpenCL C

    Applications

    Using the

    CUDA Driver API

    C for CUDA

    C Runtime

    for CUDA

    Applications

    Using C, C++, Fortran,

    Java, Python, ...

    C for CUDA

    PTX (ISA)

    DirectX 11

    Compute

    Applications

    Using DirectX

    HLSL

    Device-level APIs Language Integration

    1

    2

    34

    5

    Th d Hi h

  • 7/30/2019 Palestra - Usp

    29/44

    Thread Hierarchy

    Kernels (simple C program) are executed by thread

    Threads are grouped into Blocks

    Threads in a Block can synchronize execution

    Blocks are grouped in a Grid

    Blocks are independent (must be able to be executed

    at any order

    Source: Presentation from Felipe A. Cruz, Nagasaki University

    Memory and Hardware Hierarchy

  • 7/30/2019 Palestra - Usp

    30/44

    Memory and Hardware Hierarchy

    Threads access Registers

    CUDA Cores execute Threads

    Threads within a Block can share data/results

    via Shared Memory

    Streaming Multiprocessors (SMs) execute

    Blocks

    Grids use Global Memory for result sharing

    (after kernel-wide global synchronization)

    GPU executes Grids

    Source: Presentation from Felipe A. Cruz, Nagasaki University

    Full View of the Hierarchy Model

  • 7/30/2019 Palestra - Usp

    31/44

    Full View of the Hierarchy Model

    CUDA Hardware Level Memory AcceThread CUDA Core Registers

    Block SM Shared Memo

    Grid GPU Global Memor

    Device Node Host Memory

    IDs and Dimensions

  • 7/30/2019 Palestra - Usp

    32/44

    IDs and Dimensions

    Device

    Grid 1

    Block

    (0, 0)

    Block

    (0, 1)

    Block (1, 1)

    Thread

    (0, 1)

    Thread

    (1, 1)

    Thread

    (2, 1)

    Thread

    (0, 2)

    Thread

    (1, 2)

    Thread

    (2, 2)

    Thread

    (0, 0)

    Thread

    (1, 0)

    Thread

    (2, 0)

    Threads

    3D IDs, unique within a block

    Blocks

    2D IDs, unique within a grid

    Dimensions set at launch time

    Can be unique for each grid

    Built-in variables

    threadIdx, blockIdx

    blockDim, gridDim

    Compiling C for CUDA Applications

  • 7/30/2019 Palestra - Usp

    33/44

    Compiling C for CUDA Applications

    void serial_function( ) {...}void other_function(int ... ) {

    ...}void saxpy_serial(float ... ) {for (int i = 0; i < n; ++i)

    y[i] = a*x[i] + y[i];}void main( ) {float x;saxpy_serial(..);

    ...}

    NVCC

    (Open64)CPU

    C CUDA

    Key Kernels

    CUDA object

    files

    Re

    App

    CP

    Linker

    CP

    Ex

    Modify into

    Parallel

    CUDA code

    C for CUDA : C with a few keywords

  • 7/30/2019 Palestra - Usp

    34/44

    C for CUDA : C with a few keywords

    void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}

    // Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);

    __global__ void saxpy_parallel(int n, float a, float *x, float { int i = blockIdx.x*blockDim.x + threadIdx.x;if(i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y);

    Standa

    Para

    Software Programming

  • 7/30/2019 Palestra - Usp

    35/44

    Software Programming

    Source: Presentation from Andreas Klckner, NYU

    Software Programming

  • 7/30/2019 Palestra - Usp

    36/44

    Software Programming

    Source: Presentation from Andreas Klckner, NYU

    Software Programming

  • 7/30/2019 Palestra - Usp

    37/44

    Software Programming

    Source: Presentation from Andreas Klckner, NYU

    Software Programming

  • 7/30/2019 Palestra - Usp

    38/44

    Software Programming

    Source: Presentation from Andreas Klckner, NYU

    Software Programming

  • 7/30/2019 Palestra - Usp

    39/44

    Software Programming

    Source: Presentation from Andreas Klckner, NYU

    Software Programming

  • 7/30/2019 Palestra - Usp

    40/44

    Software Programming

    Source: Presentation from Andreas Klckner, NYU

    Software Programming

  • 7/30/2019 Palestra - Usp

    41/44

    Software Programming

    Source: Presentation from Andreas Klckner, NYU

    Software Programming

  • 7/30/2019 Palestra - Usp

    42/44

    Software Programming

    Source: Presentation from Andreas Klckner, NYU

    CUDA C/C++ Leadership

  • 7/30/2019 Palestra - Usp

    43/44

    CUDA C/C Leadership

    2007 2008 2009

    July 07 Nov 07 April 08 Aug 08 July 09 Nov 0CUDA Toolkit 1.1

    Win XP 64

    Atomics support

    Multi-GPU

    support

    CUDA Toolkit 2.0

    Double Precision

    Compiler

    Optimizations

    Vista 32/64

    Mac OSX

    3D Textures

    HW Interpolation

    CUDA Toolkit 2.3

    DP FFT

    16-32 Conversion

    intrinsics

    Performanceenhancements

    CUDA Toolkit 1.0

    C Compiler

    C Extensions

    Single Precision

    BLAS

    FFTSDK

    40 examples

    CUDA

    Visual Profiler 2.2

    cuda-gdb

    HW Debugger

    Parallel N

    Beta

    Why should I choose Tesla over consumer c

  • 7/30/2019 Palestra - Usp

    44/44

    y

    Feature Benefits

    Features

    4x Higher double precision (on 20-series) Higher Performance for scientific

    ECC only on Tesla & Quadro (on 20-series) Data reliability inside the GPU and

    Bi-directional PCI-E communication (Tesla has Dual DMA

    Engines, GeForce has only 1 DMA Engine)

    Higher Performance for CUDA appl

    communication & com

    Larger memory for larger data sets 3GB and 6GB ProductsHigher performance on wide range of app

    manufacturing, FEA

    Cluster management software tools available on Tesla onlyNeeded for GPU monitoring and job s

    deployments

    TCC (Tesla Compute Cluster) driver supported for Windows OS

    only on Tesla.

    Higher performance for CUDA application

    overhead. TCC adds support for

    Integrated OEM workstations and servers Trusted, reliable systems built f

    Professional ISVs will certify CUDA applications only on Tesla Bug reproduction, support, feature r

    Quality &

    Warranty

    2 to 4 day Stress testing & memory burn-in for reliability. Added

    margin in memory and core clocks for added reliability.Built for 24/7 computing in data center an

    Manufactured & guaranteed by NVIDIANo changes in key components like GPU

    Always the same clocks for known,

    3-year warranty from HP Reliable, long life pr

    Support &

    Lifecycle

    Enterprise support, higher priority for CUDA bugs and requestsAbility to influence CUDA and GPU road

    features reques

    18-24 months availability + 6-month EOL notice Reliable product s


Top Related