a parallel streaming framework for multi-processing-unit ... · references "parallel data˜ow...

1
References "Parallel Dataflow Scheme for Streaming (Un)Structured Data" by H. Vo, D. Osmari, B. Summa, J. Comba, V. Pascucci and C. Silva, SCI Institute, University of Utah, Technical Report #UUSCI-2009-004, 2009. "Multi-Threaded Streaming Pipeline For VTK" by H. Vo and C. Silva, SCI Institute, University of Utah, Technical Report #UUSCI-2009-005, 2009. www.sci.utah.edu A Parallel Streaming Framework for Multi-Processing-Unit Systems Huy Vo, Brian Summa, Valerio Pascucci, Claudio Silva SCI Institute, University of Utah, USA Presenter: Huy Vo Daniel Osmari, Joao Comba UFRGS, Brazil E New Event E Data Request E New Event E E E E E E E E E Data Request POLICY SCOPE DISTRIBUTED CENTRALIZED PUSH PULL SCHEDULER RESOURCES Trigger, Execute & Feedback Distribute RESOURCES INPUT PERFORM COMPUTATION OUTPUT Demand-Driven Requests Pulling Data Event-Driven Requests Pushing Data Module Message Data Dependency Executive Our Unified Model E New Event E E E E E E Data Request A B C D Piece #2 Piece #1 A B Merge A B A B Streamable/Separable Data Structure Concurrency Execution Task Pipeline Data Motivation Pull(), Push() and ReleaseInputs() Current dataflow system problems (with VTK, AVS, Data Explorer,...): • API “polluted” over the years (many designs in 80s-90s) • Do not map well to modern highly-parallel architecture: Multi- Core, GPUs, ... • Large datasets not well supported Need a unified dataflow architecture supporting: • Streaming • Heterogeneous architecture • Efficiency • Concurrency A B Piece #1 Piece #2 Piece #1 A B A B Request Piece #1 Piece #1 A B Request Piece #2 Piece #2 Streaming Push Streaming Pull • Use a priority queue with topological order of the modules as priority ranks • Include streaming piece numbers to priority ranks • Resources distributing strategy is based on execution time • Maximize concurrency but still taking data dependency into account • Pipelining GPU branch with separate CUDA devices • Load-balancing between CPU/GPU is converged over streaming time • Achieved a throughput of 250M pairs/s on (2) Xeon w5580 and a Quadro FX5800 gives • Achieved a throughput over 350M pairs/s on (2) Xeon e5550 and (2) Tesla C1060s Task Scheduler & Resource Distribution System Architecture Execution Parallelism Classification of Dataflow Streaming Model Preliminary Results Ongoing Progress Complete Implementation for CPU Thread Resources Basic Concrete Implementation of CUDA execution for GPU Resources • Connections can direct data transfer to appropriate cudaMemcpy() calls • A custom scheduling strategy to minimize host-device data transfer • Dynamic Load-Balancing for GPU kernels • Concrete implementation for MPI to run on cluster • Incorporate TBB/OpenMP for Task Scheduling • A generalized streaming data structure ContourFilter DataSetMapper Actor ContourFilter DataSetMapper Actor DataSetReader RenderWindow Renderer 1 2 3 4 5 6 7 9 8 10 Data Block and Priority Rank 11 12 13 14 15 16 viewportChangedEvent() while (LOD>0) // Set LOD of DataAccesses ... Push(<list of DataAccess>) LOD = LOD - 1 ProgressiveRenderer Average ... DataAccess DataAccess DataAccess viewportChangedEvent() while (LOD>0) // Propagate LOD to DataAccesses ... Pull(Average Module) LOD = LOD - 1 Push Pull Better Pipeline Parallelism Simpler Interface Compute() { // input dependent computation .. ReleaseInputs() // input independent computation .. } vtkImageBlend vtkImageDataStreamMerger vtkImageDataStreamGenerator vtkImageGaussianSmooth vtkImageLuminance vtkImageThreshhold vtkImageDataStreamGenerator vtkImageGaussianSmooth CPU/GPU Sort Streaming Simplification of Tetrahedral Meshes Multi-View Visualization ~3x speed up on 4-core machine ProgressiveRenderer GrayScale GrayScale GrayScale Average Standard Deviation Difference L2 Norm Edge Detection Downsample Downsample Downsample ... DataAccess DataAccess DataAccess ... 1 image 12 images Streaming Simplication of Tetrahedral Meshes Models (Tets) Original Streaming Quadric Pipeline Inst. Inst. Torso 1.0M 8.5s 8.4s 7.9s 3.0s Fighter 1.4M 10.1s 9.5s 8.4s 3.6s Rbl 3.8M 37.4s 35.6s 31.5s 8.9s Mito 5.5M 52.0s 53.1s 44.9s 10.2s UpdateQuadrics UpdateQuadrics UpdateQuadrics TetStreamMerger TetStreamingMesh SimplificationBuf Decimate UpdateQuadrics UpdateQuadrics UpdateQuadrics SimplificationBuf Decimate SimplificationBuf Decimate SimplificationBuf Decimate TetStreamMerger TetStreamingMesh CPU Threaded Radix Sort GPU Presort GPU Scatter Streaming Unsorted Data CPU Threaded Merge CPU Threaded Accumulate Sorted Output GPU Scan Blocks GPU Sum Blocks

Upload: dangnhi

Post on 22-Mar-2019

221 views

Category:

Documents


0 download

TRANSCRIPT

References"Parallel Data�ow Scheme for Streaming (Un)Structured Data" by H. Vo, D. Osmari, B. Summa, J. Comba, V. Pascucci and C. Silva, SCI Institute, University of Utah, Technical Report #UUSCI-2009-004, 2009.

"Multi-Threaded Streaming Pipeline For VTK" by H. Vo and C. Silva, SCI Institute, University of Utah, Technical Report #UUSCI-2009-005, 2009.

www.sci.utah.edu

A Parallel Streaming Framework for Multi-Processing-Unit SystemsHuy Vo, Brian Summa, Valerio Pascucci, Claudio Silva

SCI Institute, University of Utah, USAPresenter: Huy Vo Daniel Osmari, Joao Comba

UFRGS, Brazil

ENew Event E

Data Request

ENew Event

E

E E

E

E

E

E E

E

Data Request

POLICY

SCOPE

DISTR

IBUTE

DCE

NTR

ALIZE

D

PUSH PULL

SCHEDULER RESOURCES

Trigger,Execute &Feedback

Distribute

RESO

URC

ES

INPUT

PERFORMCOMPUTATION

OUTPUT

Demand-Driven RequestsPulling Data

Event-Driven RequestsPushing Data

Module

Message

Data Dependency

Executive

Our Uni�ed Model

ENew Event

E

E

E

E E

E

Data Request

A

B

C

D

Piece #2

Piece #1

A

B

Merge

A

B

A

B

Streamable/Separable Data Structure

Concurrency Execution

Task Pipeline Data

Motivation Pull(), Push() and ReleaseInputs()Current data�ow system problems(with VTK, AVS, Data Explorer,...):• API “polluted” over the years (many designs in 80s-90s)• Do not map well to modern highly-parallel architecture: Multi- Core, GPUs, ...• Large datasets not well supported

Need a uni�ed data�ow architecture supporting:• Streaming• Heterogeneous architecture• Efficiency• Concurrency

A

B

Piece #1 Piece #2

Piece #1

A

B

A

B

RequestPiece #1

Piece #1

A

B

RequestPiece #2

Piece #2

Streaming Push Streaming Pull

• Use a priority queue with topological order of the modules as priority ranks• Include streaming piece numbers to priority ranks• Resources distributing strategy is based on execution time• Maximize concurrency but still taking data dependency into account

• Pipelining GPU branch with separate CUDA devices• Load-balancing between CPU/GPU is converged overstreaming time• Achieved a throughput of 250M pairs/s on (2) Xeon w5580 and a Quadro FX5800 gives• Achieved a throughput over 350M pairs/s on (2) Xeon e5550 and (2) Tesla C1060s

Task Scheduler & Resource Distribution

System Architecture

Execution Parallelism

Classi�cation of Data�ow Streaming Model Preliminary Results

Ongoing Progress

Complete Implementation for CPU Thread Resources

Basic Concrete Implementation of CUDA execution for GPU Resources • Connections can direct data transfer to appropriate cudaMemcpy() calls • A custom scheduling strategy to minimize host-device data transfer

• Dynamic Load-Balancing for GPU kernels • Concrete implementation for MPI to run on cluster • Incorporate TBB/OpenMP for Task Scheduling • A generalized streaming data structure

vtkContourFilter

vtkDataSetMapper

vtkActor

vtkDataSetReader

vtkContourFilter

vtkDataSetMapper

vtkActor

vtkDataSetReader

vtkRenderWindow

vtkRenderer

ContourFilter

DataSetMapper

Actor

ContourFilter

DataSetMapper

Actor

DataSetReader

RenderWindow

Renderer

11

2 2

3 3

44

5

5

6

6

7

7

9

9 8

8

10

(a) (b)

10

Data Block and Priority Rank

11

12

13

14

15

16

viewportChangedEvent()

while (LOD>0)

// Set LOD of DataAccesses

...

Push(<list of DataAccess>)

LOD = LOD - 1

ProgressiveRenderer

...DataAccess DataAccess DataAccess

Average1 image

12 imagesProgressiveRenderer

...DataAccess DataAccess DataAccess

Average1 image

12 images

viewportChangedEvent()

while (LOD>0)

// Propagate LOD to DataAccesses

...

Pull(Average Module)

LOD = LOD - 1

Push Pull

Better Pipeline ParallelismSimpler Interface

Compute() {

// input dependent computation

..

ReleaseInputs()

// input independent computation

..

}

vtkImageDataStreamer5

vtkImageGaussianSmooth

vtkImageLuminance

vtkImageThreshhold

vtkPNGReader

vtkDataSetMapper

vtkActor

1

2vtkImageDataStreamGenerator

2

3

4

6

7

vtkRenderer

9

8

vtkImageDataStreamMerger6

vtkImageGaussianSmooth

vtkImageLuminance

vtkImageThreshhold

vtkPNGReader

vtkDataSetMapper

vtkActor

1

3

4

5

vtkImageBlend

7

8

9

vtkImageDataStreamMerger

vtkDataSetMapper

vtkActor

vtkImageDataStreamGenerator

vtkImageGaussianSmooth

vtkImageLuminance

vtkImageThreshhold

vtkPNGReader

vtkImageDataStreamGenerator

vtkImageGaussianSmooth

vtkPNGReader

CPU/GPU Sort

Streaming Simpli�cation of Tetrahedral Meshes

Multi-View Visualization ~3x speed up on 4-core machine

ProgressiveRenderer

GrayScale GrayScale GrayScale

Average

StandardDeviation

Di�erenceL2 Norm

EdgeDetection

Downsample Downsample Downsample

...DataAccess DataAccess DataAccess

...

1 image 12 images

Streaming Simplication of Tetrahedral MeshesModels (Tets) Original Streaming Quadric Pipeline

Inst. Inst.Torso 1.0M 8.5s 8.4s 7.9s 3.0sFighter 1.4M 10.1s 9.5s 8.4s 3.6sRbl 3.8M 37.4s 35.6s 31.5s 8.9sMito 5.5M 52.0s 53.1s 44.9s 10.2s

UpdateQuadrics UpdateQuadrics UpdateQuadrics

TetStreamMerger

TetStreamingMesh

Simpli�cationBuf

Decimate

UpdateQuadrics UpdateQuadrics UpdateQuadrics

Simpli�cationBuf

Decimate

Simpli�cationBuf

Decimate

Simpli�cationBuf

Decimate

TetStreamMerger

TetStreamingMesh

CPU Threaded Radix Sort

GPU Presort

GPU Scatter

Streaming Unsorted Data

CPU Threaded Merge

CPU Threaded Accumulate

Sorted Output

GPU Scan Blocks

GPU Sum Blocks