portable parallelism in diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfdec 02,...

36
Portable parallelism in Diderot John Reppy University of Chicago December 2, 2011

Upload: others

Post on 06-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Portable parallelism in Diderot

John Reppy

University of Chicago

December 2, 2011

Page 2: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Portable parallelism

The challenge of portable parallelism

Wide range of parallel hardware with varying memory architectures:I CMP (multicore): shared cache, uniform shared memory.I SMP: separate caches, non-uniform shared memory (NUMA).I GPUs: wide-vector instructions, explicit memory hierarchy, distributed

memory.I Cluster: separate caches and distributed memory.I Supercomputer: specialized interconnects, heterogeneous architectures,

etc.

And we have a wide range of parallel applications.

December 2, 2011 Portable parallelism in Diderot 2

Page 3: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Portable parallelism

Portable parallel programming

Image analysis

Video encoding

Ray tracer

Portfolio valuation

Computer Game

Supercomputers

Clusters

SMP/CMP

GPU

Applications Platforms

Program

Program

The Ideal: write once, run everywhere, for any application

December 2, 2011 Portable parallelism in Diderot 3

Page 4: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Portable parallelism

Portable parallel programming

Image analysis

Video encoding

Ray tracer

Portfolio valuation

Computer Game

Supercomputers

Clusters

SMP/CMP

GPU

Applications Platforms

Program

Program

Program

Program

The Reality: write once per platform per application

December 2, 2011 Portable parallelism in Diderot 3

Page 5: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Portable parallelism

Portable parallel programming

Image analysis

Video encoding

Ray tracer

Portfolio valuation

Computer Game

Supercomputers

Clusters

SMP/CMP

GPU

Applications Platforms

Program

Program

Manticore: restrict platforms

December 2, 2011 Portable parallelism in Diderot 3

Page 6: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Portable parallelism

Portable parallel programming

Image analysis

Video encoding

Ray tracer

Portfolio valuation

Computer Game

Supercomputers

Clusters

SMP/CMP

GPU

Applications Platforms

Program

Diderot: restrict applications (also Spiral and Delite)

December 2, 2011 Portable parallelism in Diderot 3

Page 7: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Diderot

Diderot is a cross-discipline project involvingI Biomedical image analysis and visualizationI Programming language design and implementation

Plan: use ideas from programming languages to improve the state of the art inimage-analysis and visualization.

Joint work with Gordon Kindlmann and students Charisee Chiw, LamontSamuels, and Nick Seltzer.

December 2, 2011 Portable parallelism in Diderot 4

Page 8: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Why image analysis is important

Physical object Image data Computationalrepresentation

Imaging Visualization

Analysis

Scientists need tools to extract structure from many kinds of image data.

December 2, 2011 Portable parallelism in Diderot 5

Page 9: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Image analysis and visualization

I We are interested in a class of algorithms that compute geometricproperties of objects from imaging data.

I These algorithms compute over a continuous tensor field that isreconstructed from discrete data using a separable convolution kernel.

Continuous fieldDiscrete image data

December 2, 2011 Portable parallelism in Diderot 6

Page 10: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Image analysis and visualization (continued ...)

Examples includeI Direct volume rendering (requires

reconstruction, derivatives)I Fiber tractography (requires tensor

fields)I Particle systems (requires dynamic

numbers of computational elements)

December 2, 2011 Portable parallelism in Diderot 7

Page 11: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Image analysis and visualization (continued ...)

Examples includeI Direct volume rendering (requires

reconstruction, derivatives)I Fiber tractography (requires tensor

fields)I Particle systems (requires dynamic

numbers of computational elements)

December 2, 2011 Portable parallelism in Diderot 7

Page 12: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Image analysis and visualization (continued ...)

Examples includeI Direct volume rendering (requires

reconstruction, derivatives)I Fiber tractography (requires tensor

fields)I Particle systems (requires dynamic

numbers of computational elements)

December 2, 2011 Portable parallelism in Diderot 7

Page 13: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

DSL for image analysis

Diderot is a parallel DSL for image analysis and visualization algorithms.

We have two main design goals for Diderot:I Provide a high-level mathematical programming model that abstracts

away from discrete image data and the target architecture.I Use domain knowledge to get good performance on a range of parallel

platforms, without requiring an understanding of parallel programming.

Note: Diderot is not an embedded DSL.

December 2, 2011 Portable parallelism in Diderot 8

Page 14: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Diderot programming model

I The Diderot programming model is based on a collection of mostlyautonomous strands that are embedded in a continuous tensor field.

I Each strand has a state and an update method, which encapsulates thecomputational kernel of the algorithm.

I Diderot abstracts away from details such as the discrete image-data, therepresentation of reals (float vs double), and the target machine (e.g.,CPU vs GPU).

I The computational kernel of a Diderot program is expressed using theconcepts and direct-style notation of tensor calculus. These includetensor operations (•, ⇥) and higher-order field operations (r), etc.

I No shared mutable state.

December 2, 2011 Portable parallelism in Diderot 9

Page 15: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Diderot parallelism modelBulk-synchronous parallel with “deterministic” semantics.

executionstep

strands

update

idle

read

spawn

global computation

global computation

strand state

new

die

Note: the current language implements a subset of this model.December 2, 2011 Portable parallelism in Diderot 10

Page 16: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Example — Curvaturefield#2(3)[] F = bspln3 ~ load("quad-patches.nrrd");field#0(2)[3] RGB = tent ~ load("2d-bow.nrrd");· · ·strand RayCast (int ui, int vi) {

· · ·update {

· · ·vec3 grad = -rF(pos);vec3 norm = normalize(grad);tensor[3,3] H = r⌦rF(pos);tensor[3,3] P = identity[3] - norm⌦norm;tensor[3,3] G = -(P•H•P)/|grad|;real disc = sqrt(2.0*|G|ˆ2 - trace(G)ˆ2);real k1 = (trace(G) + disc)/2.0;real k2 = (trace(G) - disc)/2.0;vec3 matRGB = // material RGBA

RGB([max(-1.0, min(1.0, 6.0*k1)),max(-1.0, min(1.0, 6.0*k2))]);

· · ·}

· · ·}

k2

k1

(1,1)

(-1,-1)

December 2, 2011 Portable parallelism in Diderot 11

Page 17: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Diderot

Example — 2D Isosurface

· · ·strand sample (int ui, int vi) {output vec2 pos = · · ·;

// set isovalue to closest of 50, 30, or 10

real isoval = 50.0 if F(pos) >= 40.0else 30.0 if F(pos) >= 20.0else 10.0;

int steps = 0;update {

if (!inside(pos, F) || steps > stepsMax)die;

vec2 grad = rF(pos);// delta = Newton-Raphson step

vec2 delta = normalize(grad) * (F(pos) - isoval)/|grad|;if (|delta| < epsilon)

stabilize;pos = pos - delta;steps = steps + 1;

}}

December 2, 2011 Portable parallelism in Diderot 12

Page 18: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Implementation issues

Diderot compiler and runtime

I Compiler is 21,000 lines of SML (2,500 in front-end).I Multiple backends: vectorized C and OpenCL (CUDA under

construction).I Multiple runtimes: Sequential C, Parallel C, OpenCL.I Designed to generate libraries, but also supports standalone executables.

December 2, 2011 Portable parallelism in Diderot 13

Page 19: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Implementation issues

Probing tensor fieldsA probe gets compiled down into code that maps the world-space coordinatesto image space and then convolves the image values in the neighborhood ofthe position.

⊛h

Continuous fieldDiscrete image data

FV

In 2D, the reconstruction is (recall that h is separable)

F(x) =sX

i=1�s

sX

j=1�s

V[n + hi, ji]h(fx

� i)h(fy

� j)

where s is the support of h, n = bM

�1xc and f = M

�1x � n.

December 2, 2011 Portable parallelism in Diderot 14

Page 20: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Implementation issues

Probing tensor fields (continued ...)

In general, generating the probe operations is more challenging.The first step is to normalize field expressions. For example,

r(s⇤(V ~ h)) ) s⇤(r(V ~ h))

) s⇤(V ~ (rh))

In the implementation, we view r as a “tensor” of partial-derivative operators

r =

"@@x

@@y

#r⌦r =

"@2

@x

2@2

@xy

@2

@xy

@2

@y

2

#

December 2, 2011 Portable parallelism in Diderot 15

Page 21: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Implementation issues

Probing tensor fields (continued ...)

Each component in the partial-derivative tensor corresponds to a componentin the result of the probe.

V ~ (rh) = V ~"

@@x

h

@@y

h

#

=

" Ps

i=1�s

Ps

j=1�s

V[n + hi, ji]h0(fx

� i)h(fy

� j)P

s

i=1�s

Ps

j=1�s

V[n + hi, ji]h(fx

� i)h0(fy

� j)

#

A later stage of the compiler expands out the evaluations of h and h

0.

Probing code has high arithmetic intensity and is a good candidate forvectorization and GPUs.

December 2, 2011 Portable parallelism in Diderot 16

Page 22: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Targeting GPUs

Targeting GPUs

I Standard GPGPU programming models (CUDA and OpenCL) arelow-level and expose hardware details.

I Diderot frees the programmer from those issues, but the compiler andruntime must still handle them.

I We need to be smart about memory access and divergence.

December 2, 2011 Portable parallelism in Diderot 17

Page 23: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Targeting GPUs

Nividia’s Fermi architecture

Shared L2 Cache (768 Kb)SharedGlobal

Memory

I Multi-processor compute units share L2 cache and global memory.I

Single-Instruction, Multiple-Thread execution model.I Each warp (32 threads) executes the same instruction.I Predication used to handle divergent control flow.I Each compute unit runs its own warps.

December 2, 2011 Portable parallelism in Diderot 18

Page 24: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Targeting GPUs

Fermi Compute Unit (CUDA 2.0)Dispatch two half-warps per clock.

16-cores 16-cores 16 load/store units

32K by 32-bit register file(holds thread state)

64Kb L1 cache/local memory

warp scheduler/dispatch unit warp scheduler/dispatch unit

Instruction cache

December 2, 2011 Portable parallelism in Diderot 19

Page 25: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Targeting GPUs

OpenCL Parallelism ModelGrid of work items (threads) organized into work groups.

Work group

Work item

Standard approach: map data to the grid and run data-parallel computation.This approach does not work well for irregular workloads.December 2, 2011 Portable parallelism in Diderot 20

Page 26: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Targeting GPUs

Persistent threads[Hoberock et al. 2009; Parker 2010; Wald 2011]Instead of using the GPU scheduler, each workgroup runs a 32-wide parallelstrand scheduler (64-wide on AMD hardware).

32×1 work items = 1 warp

strand block strand block

Work queue

Each scheduler runs strand update methods until there are no more blocks.December 2, 2011 Portable parallelism in Diderot 21

Page 27: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Targeting GPUs

Avoiding divergence

Each execution step is divided into two phases: update and compaction.

Update kernel

Compact kernel

executionstep

When occupancy gets too low, we compact across blocks.

December 2, 2011 Portable parallelism in Diderot 22

Page 28: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Targeting GPUs

Latency hidingTo hide memory latency, we run multiple workgroups per GPU compute unit.

Number of workers per CU1 2 3 4 5 6 7 8 9 10

Spee

du

p

0

1

2

3

4

5

vr!lite

illust!vr

lic2d

Runtime system could adjust the number of cores dynamically.December 2, 2011 Portable parallelism in Diderot 23

Page 29: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Performance

Experimental framework

I Compare four versions of benchmarks: Teem/C, Sequential Diderot,Parallel Diderot, GPU Diderot.

I SMP machine: 8-core MacPro with 2.93 GHz Xeon X5570 processors(SSE-4)

I GPU machine: Linux box with NVIDIA Tesla C2070 (14⇥32 cores).I Four typical benchmark programs

Ivr-lite — simple volume-renderer with Phong shading running on CTscan of hand

Iillust-vr — fancy volume-renderer with cartoon shading running on CTscan of hand

Ilic2d — line integral convolution in 2D running on synthetic data

Iridge3d — particle-based ridge detection running on lung data

December 2, 2011 Portable parallelism in Diderot 24

Page 30: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Performance

SMP scalingParallel performance scaling with respect to sequential Diderot.

Number of threads1 2 3 4 5 6 7 8

Sp

eed

up

0

1

2

3

4

5

6

7

8perfect

vr!lite

illust!vr

lic2d

ridge3d

December 2, 2011 Portable parallelism in Diderot 25

Page 31: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Performance

Performance comparison

vr!lite illust!vr lic2d ridge3d

Sp

eed

up

vs.

Tee

m/C

0

5

10

15

20

25

30

Teem/C Sequential SMP!8 GPU!8

Note that ridge3d triggers a bug in NVIDIA’s OpenCL compiler.December 2, 2011 Portable parallelism in Diderot 26

Page 32: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Future research

Language evolution

I Dynamic strand creation.I Strand-strand interactions.I Global computation mechanisms.I Type inference and dimension polymorphism.

December 2, 2011 Portable parallelism in Diderot 27

Page 33: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Future research

Long-term goals

In the future, we would like to generalize this work in two directions:I Extend Diderot to other classes of algorithms (e.g., object recognition).I Generalize approach to other domains.

December 2, 2011 Portable parallelism in Diderot 28

Page 34: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Future research

Long-term goals

In the future, we would like to generalize this work in two directions:I Extend Diderot to other classes of algorithms (e.g., object recognition).I Generalize approach to other domains.

December 2, 2011 Portable parallelism in Diderot 28

Page 35: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Conclusion

Conclusion

Domain-specific languages can provide both high-level notation and portableparallel performance.

December 2, 2011 Portable parallelism in Diderot 29

Page 36: Portable parallelism in Diderotdiderot-language.cs.uchicago.edu/papers/hiperfit-20111202.pdfDec 02, 2011  · Diderot Diderot Diderot is a cross-discipline project involving I Biomedical

Conclusion

Questions?

http://diderot-language.cs.uchicago.edu

December 2, 2011 Portable parallelism in Diderot 30