palestra - usp

Download Palestra - Usp

Post on 14-Apr-2018

214 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • 7/30/2019 Palestra - Usp

    1/44

    Product Availability Update

    Product InventoryLeadtime

    for big ordersNo

    C1060 200 units 8 weeks Build M1060 500 units 8 weeks Build

    S1070-400 50 units 10 weeks Build

    S1070-500 25 units+ 75 being built 10 weeks Build

    M2050 Shipping now

    Building 20K for Q2

    8 weeks Sold out thr

    S2050 Shipping nowBuilding 200 for Q2

    8 weeks Sold out thr

    C2050 2000 units 8 weeks Will mainta

    M2070 Sept 2010 - Get PO in no

    C2070 Sept-Oct 2010 - Get PO in no

    M2070-Q Oct 2010 -

    Processamento ParaGPUs na ArquiteturArnaldo TavaresTesla Sales Manager for Latin America

  • 7/30/2019 Palestra - Usp

    2/44

    Quadro or Tesla?

    Computer Aided Design e.g. CATIA, SolidWorks, Siemens NX

    3D Modeling / Animation e.g. 3ds, Maya, Softimage

    Video Editing / FX e.g. Adobe CS5, Avid

    Numerical Analytics e.g. MATLAB, Mathematica

    Computational Biology e.g. AMBER, NAMD, VMD

    Computer Aided Enginee e.g. ANSYS, SIMULIA/ABAQUS

  • 7/30/2019 Palestra - Usp

    3/44

    GPU Computing

    CPU + GPU Co-Processing

    4 cores

    CPU48 GigaFlops (DP)

    GPU515 GigaFlops (DP)

    (Average efficiency in Linpack: 50%)

  • 7/30/2019 Palestra - Usp

    4/44

    146X

    Medical Imaging

    U of Utah

    36X

    Molecular Dynamics

    U of Illinois, Urbana

    18X

    Video Transcoding

    Elemental Tech

    50X

    Matlab Computing

    AccelerEyes

    149X

    Financial simulation

    Oxford

    47X

    Linear Algebra

    Universidad Jaime

    20X

    3D Ultrasound

    Techniscan

    130X

    Quantum Chemistry

    U of Illinois, UrbanaG

    50x 150x

  • 7/30/2019 Palestra - Usp

    5/44

    Tools

    Oil & Gas

    Bio-Chemistry

    Bio-Informatics

    NVIDIAVideo Libraries

    AccelerEyes

    Jacket MATLAB

    EMPhotonicsCULAPACK

    Bright Cluster

    ManagerCAPS HMPP

    MATLAB

    Thrust C++Template Lib

    CUDA C/C++

    PGI CUDA

    Fortran

    Parallel NsightVis Studio IDE

    Allinea DDT

    Debugger

    OpenEye ROCS

    Available Announced

    TauCUDAPerf Tools

    NVIDIA NPPPerfPrimitives

    ParaTools

    VampirTrace

    VSGOpen Inventor

    StoneRidgeRTM

    HeadwaveSuite

    AccelewareRTM Solver

    GeoStarSeismic Suite

    ffA SVI Pro

    OpenGeoSolutions OpenSEIS

    ParadigmRTM

    Seismic CityRTM

    TsunamiRTM

    CAE ACUSIMAcuSolve 1.8

    AutodeskMoldflow

    PrometchParticleworks

    RemcomXFdtd 7.0

    MM

    PGIAccelerators

    Platform LSFCluster Mgr

    MAGMA(LAPACK)

    O

    MetacompCFD++

    Available Now

    Libraries

    Wolfram

    Mathematica

    CUDA FFTCUDA BLAS

    TeraChemBigDFT

    ABINT

    VMD

    Acellera

    ACEMD

    AMBER DL-POLY

    GROMACS

    HOOMD

    LAMMPS

    NAMD

    GAMESS CP2K

    CUDA-BLASTP

    CUDA-EC

    CUDA-MEME

    CUDA SW++SmithWaterm GPU-HMMR

    HEX ProteinDocking

    MUMmerGPUPIPER

    Docking

    LSTCLS-DYNA 971

    RNG & SPARSECUDA Libraries

    ParadigmSKUA

    PGI CUDAx86

    Increasing Number of Professional CUDA Ap

    ANSYSMechanical

  • 7/30/2019 Palestra - Usp

    6/44

    Increasing Number of Professional CUDA

    Siemens 4DUltrasound

    Rendering

    Finance

    EDA

    DigisensMedical

    SchrodingerCore Hopping

    MotionDSPIkena Video

    ManifoldGIS

    Dalsa MachineVision

    SynopsysTCAD

    SPEAGSEMCAD X

    AgilentEMPro 2010

    CSTMicrowave

    Agilent ADSSPICE

    AccelewareFDTD Solver

    AccelewareEM Solution

    AquiminAlphaVision

    Other

    NAGRNG

    SciCompSciFinance

    HanweckOptions Analy

    Available Now

    Gauda OPC

    UsefulProgress Med

    LightworksArtisan

    Autodesk3ds Max

    NVIDIAOptiX (SDK)

    mental imagesiray (OEM)

    BunkspeedShot (iray)

    Refractive SWOctane

    C

    fRandom

    Control Arion

    CausticGraphics

    Weta DigitalPantaRay

    ILMPlume

    Available Announced

    DigitalAnarchy Photo

    Video

    Elemental

    Video

    Fraunhofer

    JPEG2000

    Cinnafilm

    Pixel Strings

    Assimilate

    SCRATCH

    The FoundryKronos

    TDVisionTDVCodec

    ARRIVarious Apps

    Black MagicDa Vinci

    MainConcept

    CUDA Encoder

    GenArtsSapphire

    Adobe PremierPro CS5

    MurexMACS

    Numerix RiskRMS Risk

    Mgt Solutions

    RocketickVeritlog Sim

    MVTecMachine Vis

  • 7/30/2019 Palestra - Usp

    7/44

    3 of Top5 Supercomputers

    0

    500

    1000

    1500

    2000

    2500

    3000

    Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te

    Gigaflops

  • 7/30/2019 Palestra - Usp

    8/44

    3 of Top5 Supercomputers

    0

    500

    1000

    1500

    2000

    2500

    3000

    Tianhe-1A Jaguar Nebulae Tsubame Hopper II Te

    Gigaflops

  • 7/30/2019 Palestra - Usp

    9/44

    What if Every Supercomputer Had Ferm

    0

    200

    400

    600

    800

    1000

    Linpack

    Teraflops

    Top 500 Supercomputers (Nov 2009)

    150 GPUs

    37 TeraFlops

    $740K

    Top 150

    225 GPUs

    55 TeraFlops

    $1.1 M

    Top 100

    450 GPUs

    110 TeraFlops

    $2.2 M

    Top 50

  • 7/30/2019 Palestra - Usp

    10/44

    Hybrid ExaScale Trajectory

    2008

    1 TFLOP

    7.5 KWatts

    2010

    1.27 PFLOPS

    2.55 MWatts

    2017 *

    2 EFLOPS

    10 MWatts

    * This is a projection based on Moores law and does not represent a committed roadmap

  • 7/30/2019 Palestra - Usp

    11/44

    Tesla Roadmap

  • 7/30/2019 Palestra - Usp

    12/44

    The March of the GPUs

    0

    50

    100

    150

    200

    250

    2007 2008 2009 201

    Peak Memory BGBytes/s

    T10

    Nehalem

    3 GHz

    Westm

    3 GH

    T20

    0

    200

    400

    600

    800

    1000

    1200

    2007 2008 2009 2010 2011 2012

    Peak Double Precision FPGFlops/sec

    Nehalem

    3 GHz

    Westmere

    3 GHz

    T20

    T20A

    T10

    8-core

    Sandy Bridge

    3 GHz

    NVIDIA GPU (ECC off) Double Precision: NVIDIA GPU Double Precision: x86 CPU

  • 7/30/2019 Palestra - Usp

    13/44

    Project Denver

  • 7/30/2019 Palestra - Usp

    14/44

    Expected Tesla Roadmap with Project Den

  • 7/30/2019 Palestra - Usp

    15/44

    Workstations

    Up to 4x

    Tesla C2050/70 GPUs

    Integra

    CPU-GPU

    2x Tesla M205

    in 1

    OEM CPU Server +

    Tesla S2050/70

    4 Tesla GPUs in 2U

    Workstation / Data Center Solutions

    M2

  • 7/30/2019 Palestra - Usp

    16/44

    Tesla C2050 Tesla C2070

    Processors Tesla 20-series GPU

    Number of Cores 448

    Caches64 KB L1 cache + Shared Memory / 32 cores

    768 KB L2 cache

    Floating Point Peak

    Performance

    1030 Gigaflops (single)

    515 Gigaflops (double)

    GPU Memory3 GB

    2.625 GB with ECC on

    6 GB

    5.25 GB with ECC on

    Memory Bandwith 144 GB/s (GDDR5)

    System I/O PCIe x16 Gen2

    Power 238 W (max) 238 W (max)

    Available Shipping Now Shipping Now

    Tesla C-Series Workstation GPUs

  • 7/30/2019 Palestra - Usp

    17/44

    How is the GPU Used?

    Basic Component: Stream Multiprocessor (SM)

    SIMD: Single InstructionMultiple Data

    Same Instruction for all cores, but can operate over different data

    SIMD at SM, MIMD at GPU chip

    Source: Presentation from Felipe A. Cruz, Nagasaki University

  • 7/30/2019 Palestra - Usp

    18/44

    The Use of GPUs and Bottleneck Analysis

    Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology

  • 7/30/2019 Palestra - Usp

    19/44

    The Fermi Architecture

    3 billion transistors

    16 x Streaming Multiprocessors(SMs)

    6 x 64-bit Memory Partitions =

    384-bit Memory Interface

    Host Interface: connects the GPU

    to the CPU via PCI-Express

    GigaThread global scheduler:

    distribute thread blocks to SM

    thread schedulers

  • 7/30/2019 Palestra - Usp

    20/44

    SM Architecture

    32 CUDA cores per SM (512 total)

    16 x Load/Store Units = source and destin. addresscalculated for 16 threads per clock

    4 x Special Function Units (sin, cosine, sq. root, etc.)

    64 KB of RAM for shared memory and L1 cache(configurable)

    Dual Warp Scheduler

  • 7/30/2019 Palestra - Usp

    21/44

    Dual Warp Scheduler

    1 Warp = 32 parallel threads

    2 Warps issued and executed concurrently

    Each Warp goes to 16 CUDA Cores

    Most instructions can be dual issued

    (exception: Double Precision instructions)

    Dual-Issue Model allows near peak hardware

    performance

  • 7/30/2019 Palestra - Usp

    22/44

    CUDA Core Architecture

    Re

    Schedu

    Dispatc

    Load/S

    Special

    Interco

    64K C

    Cache

    Unif

    Core

    Core

    Core

    Core

    C

    C

    C

    C

    Core

    Core

    Core

    Core

    C

    C

    C

    C

    Instr

    CUDA CoreDispatch Port

    Operand Collector

    Result Queue

    FP Unit INT Unit

    New IEEE 754-2008 floating-point standard,

    surpassing even the most advanced CPUs

    Newly designed integer ALU

    optimized for 64-bit and extended

    precision operations

    Fused multiply-add (FMA) instructionfor both 32-bit single and 64-bit

    double