what’s new in cuda 8 -...

30
Siddharth Sharma, Oct 2016 WHAT’S NEW IN CUDA 8

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

Siddharth Sharma, Oct 2016

WHAT’S NEW IN CUDA 8

Page 2: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

2

WHAT’S NEW IN CUDA 8Why Should You Care

>2X

Run Computations Faster* Solve Larger Problems** Critical Path Analysis

* HOOMD Blue v1.3.3 Lennard-Jones liquid benchmark

•K80 and P100 (PCIe); Base clocks; 4 GPUs per PCIe root complex

• 2x K80 indicates 2-GPU configuration (or 1x K80 board)

• CUDA 8 GA with r361.79 (K80) and r361.93.02 (P100)

• Host System: Intel Xeon Broadwell dual socket 22-core E5-2699

[email protected] 3.6GHz Turbo with CentOS 7.2 x86-64 and 256GB memory

** HPGMG-FV benchmark

• K80 and P100 (PCIe); Base clocks; 4 GPUs per PCIe root complex

• 2x K80 indicates 2-GPU configuration (or 1x K80 board)

• CUDA 8 GA with r361.79 (K80) and r361.93.02 (P100)

• Host System: Intel Xeon Broadwell dual socket 22-core E5-2699 [email protected] 3.6GHz

Turbo with CentOS 7.2 x86-64 and 256GB memory

Page 3: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

3

WHAT’S NEW IN CUDA 8

LIBRARIES

• New nvGRAPH library

• Support for FP16, INT8

UNIFIED MEMORY

• Demand Paging

• New Tuning APIs

• Data Coherence & Atomics

PASCAL ARCHITECTURE

• NVLINK

• HBM2 Stacked Memory

• Page Migration Engine

DEVELOPER TOOLS

• Critical Path Analysis

• NVCC Compile Time

• OpenACC Profiling

Page 4: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

4

PASCAL ARCHITECTURE

Page 5: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

5

PASCAL ARCHITECTURE

Webinar: “Inside Pascal”

Mark Harris (NVIDIA), Lars Nyland (NVIDIA)

GPU Technical Conference 2016 - ID S6176

Page 6: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

6

CUDA 8 ON P100: >3X FASTER THAN CPUsApplicati

on T

hro

ughput

Speedup V

sCPU

Performance may vary based on OS and software

versions, and motherboard configuration

• K80 and P100 (PCIe); Base clocks; 4 GPUs per PCIe root complex

• CUDA 8 GA with r361.79 (K80) and r361.93.02 (P100)

• Host System: Intel Xeon Broadwell dual socket 22-core E5-2699 [email protected] 3.6GHz Turbo

with CentOS 7.2 x86-64 and 256GB memory

0x

5x

10x

15x

20x

25x

30x

35x

LAMMPS NAMD HPCG (128) HPCG (256) HOOMD Blue QUDA (Average) MiniFe

E5-2699v4 2x K80 1x P100 2x P100 4x P100 8x P100

Page 7: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

7

UNIFIED MEMORY

Page 8: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

8

UNIFIED MEMORY

Past Developer View

Implicit Memory Management

Starting with Kepler and CUDA 6

System Memory GPU Memory Unified Memory

Pascal GPUCPU Pascal GPUCPU

Page 9: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

9

APPLICATIONS: LARGE VARIATIONS IN DATASET SIZES

Ray-tracingLarger scenes to render

Graph AnalysisLarger datasets

CombustionMore species & improved accuracy

Quantum ChemistryLarger systems

Page 10: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

10

CUDA 8: PASCAL UNIFIED MEMORYEasier Memory Management, APIs for High Performance

ENABLE LARGE

DATA MODELS

Oversubscribe GPU memory

Allocate up to system memory size

TUNE

UNIFIED MEMORY

PERFORMANCE

APIs for Pre-fetching & Read duplication

Usage hints via cudaMemAdvise API

SIMPLER

DATA ACCESS

CPU/GPU Data coherence

Unified memory atomic operations

Allocate Beyond GPU Memory Size

Unified Memory

Pascal GPUCPU

CUDA 8

Unified Memory

Page 11: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

11

CUDA 8 UNIFIED MEMORY — EXAMPLE

64 GB unified memory allocation on P100 with 16 GB physical memory

Transparent – No API changes

Works on Pascal & future architectures

Allocating 4x more than P100 physical memory

void foo() {

// Allocate 64 GBchar *data;size_t size = 64*1024*1024*1024;cudaMallocManaged(&data, size);

}

Page 12: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

12

CUDA 8 UNIFIED MEMORY — EXAMPLE

Both CPU code and CUDA kernel accessing ‘data’ simultaneously

Possible with CUDA 8 unified memory on Pascal

Accessing data simultaneously by CPU and GPU codes

__global__ void mykernel(char *data) {data[1] = ‘g’;

}

void foo() {char *data;cudaMallocManaged(&data, 2);

mykernel<<<...>>>(data);// no synchronize heredata[0] = ‘c’;

cudaFree(data);}

Webinar: “CUDA 8 Unified Memory”

Page 13: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

13

>3X SPEEDUP WITH UNIFIED MEMORY

0

20

40

60

80

100

120

140

160

180

1.4 4.7 8.6 28.9 58.6

CPU only K40 P100 (Unified Memory) P100 (Unified Memory with hints)

HPG

MG

AM

R A

pplicati

on T

hro

ughput

(Million D

OF/s)

Working Set (GB)

All 5 levels fit in GPU memory

Only 2 levels fit

Only 1 level fits

P100 GPU Memory Size (16GB)

Performance may vary based on OS and software

versions, and motherboard configuration

• HPGMG AMR on 1xK40, 1xP100 (PCIe) with CUDA 8 (r361)

• CPU measurements with Intel Xeon Haswell dual socket 10-core E5-2650 [email protected] GHz 3.0 GHz Turbo, HT on

• Host System: Intel Xeon Haswell dual socket 16-cores E5-2630 [email protected] 3.2GHz Turbo

Page 14: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

14

LIBRARIES

Page 15: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

15

GRAPH ANALYTICS

… and much more: Parallel Computing, Recommender Systems, Fraud Detection, Voice Recognition, Text Understanding, Search

Insight from connections in big data

GENOMICSCYBER SECURITY /

NETWORK ANALYTICSSOCIAL NETWORK

ANALYSIS

Wikimedia Commons Circos.ca

Page 16: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

16https://developer.nvidia.com/nvGRAPH

nvGRAPH

Parallel Library for Interactive and High Throughput Graph Analytics

Solve graphs with up to 2.5 Billion edges on a single GPU (Tesla M40)

Includes — PageRank, Single Source Shortest Path and Single Source Widest Path algorithms

Semi-ring SPMV operations provides building blocks for graph traversal algorithms

GPU Accelerated Graph Analytics

PageRank Single Source

Shortest Path

Single Source

Widest Path

Search Robotic Path

Planning

IP Routing

Recommendation

Engines

Power Network

Planning

Chip Design / EDA

Social Ad

Placement

Logistics & Supply

Chain Planning

Traffic sensitive

routing

Page 17: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

17

> 200X SPEEDUP ON PAGERANK VS GALOIS

Performance may vary based on OS and software

versions, and motherboard configuration

• nvGRAPH on M40 (ECC ON, r352), P100 (r361), Base clocks, input and output data on device

• GraphMat, Galois (v2.3) on Intel Xeon Broadwell dual-socket 22-core/socket E5-2699 v4 @ 2.22GHz, 3.6GHz Turbo

• Comparing Average Time per Iteration (ms) for PageRank

• Host System: Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo

• CentOS 7.2 x86-64 with 128GB System Memory

Speedup v

s. G

alo

is

1x 1x0x

50x

100x

150x

200x

250x

PageRank PageRank

soc-LiveJournal Twitter

Galois GraphMat M40 P100

Page 18: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

18

HIGHER THROUGHPUT THROUGH LOWER PRECISION COMPUTATION

Deep LearningcuBLAS: FP16 and INT8 GEMMS

Radio AstronomycuFFT: native FP16 operations

Fluid DynamicscuSPARSE: FP16 CSRMV

Page 19: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

19

DEVELOPER TOOLS

Page 20: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

20

DEPENDENCY ANALYSIS

The longest running kernel is not always the most critical optimization target

Easily find the critical kernel to optimize

A wait

B wait

Kernel X Kernel Y

5% 40%

TimelineOptimize Here

CPU

GPU

Page 21: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

21

IDENTIFY BOTTLENECKS ON CRITICAL PATHVisual Profiler and NVPROF

Unguided Analysis

Dependency AnalysisFunctions on critical path

Page 22: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

22

IDENTIFY BOTTLENECKS ON CRITICAL PATHVisual profiler and NVPROF

Inbound dependencies

Launch copy_kernel MemCpy HtoD [sync]

Outbound dependencies

MemCpy DtoH [sync]

Page 23: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

23

OpenACC PROFILINGOpenACC->Driver API->Compute

correlation

OpenACC->Source Code correlation

OpenACCtimeline

OpenACCProperties

Page 24: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

24

PROFILE CPU CODE + GPU CODE IN VISUAL PROFILER

Profile execution times on host function calls

View CPU code function hierarchy

Page 25: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

25

PROFILE UNIFIED MEMORY

Webinar: “CUDA 8 Tools Webinar”

MONITOR NVLINK BANDWIDTH

Page 26: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

26

COMPILE NVCC 2X FASTERImproved Developer Productivity

Performance may vary based on OS and software

versions, and motherboard configuration

• Average total compile times (per translation unit)

• Host system: Intel Core i7-3930K 6-cores @ 3.2GHz

• CentOS x86_64 Linux release 7.1.1503 (Core) with GCC 4.8.3 20140911

• GPU target architecture sm_52

Webinar: “CUDA 8 Performance Report”

Page 27: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

27

NEW PLATFORMS SUPPORTED

Platform Operating Systems Compilers

Windows Windows Server 2012 R2 Microsoft Visual Studio 2015 Update 3

and VS Community 2015

Linux Fedora 23, Ubuntu 16.04, SLES 1 PGI C++ 16.1/16.4, Clang 3.7, ICC 16.0

MAC OS X 10.12 GCC 5.x

Page 28: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

28

WHAT’S NEW IN CUDA 8

LIBRARIES

• New nvGRAPH library

• Support for FP16, INT8

UNIFIED MEMORY

• Demand Paging

• New Tuning APIs

• Data Coherence & Atomics

PASCAL ARCHITECTURE

• NVLINK

• HBM2 Stacked Memory

• Page Migration Engine

DEVELOPER TOOLS

• Critical Path Analysis

• NVCC Compile Time

• OpenACC Profiling

Page 29: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

29

CUDA 8 – DOWNLOAD TODAY!

• CUDA Applications in your Industry: www.nvidia.com/object/gpu-applications-domain.htm

• Additional Webinars:

• Inside PASCAL

• CUDA 8 Performance Report

• CUDA 8 Tools

• CUDA 8 Unified Memory

• CUDA 8 Release Notes: www.docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#abstract

Everything You Need to Accelerate Applications

developer.nvidia.com/cuda-toolkit

Page 30: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve

THANK YOU