parallel processing … · o(log n) o(n) o(n log n) ... cheap (simple) 8 ... • choose splitters...

42
PARALLEL PROCESSING UNIT 4 Dr. Ahmed Sallam www.sallam.cf > (sallamah.weebly.com) 1

Upload: others

Post on 13-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

PARALLEL PROCESSING

UNIT 4Dr. Ahmed Sallam

www.sallam.cf > (sallamah.weebly.com)

1

Page 2: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

QUIZ

On scan of N elements the amount of work is: O(log n) O(n) O(n log n) O(n2)

On scan of N elements the number of steps is: O(log n) O(n) O(n log n) O(n2) 2

Page 3: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

OUTLINES

Compact

Compact-like

Segment scan

Sorting

Odd Even Sort

Merge Sort

Radix Sort

Quick Sort 3

Page 4: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

COMPACT (FILTER)4

Page 5: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

COMPACT

Filter red color objects only This help us to keep objects

we care about and ignore other objects.

This saves space andprocessing time.

5

Page 6: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

COMPACT MODEL

6

a b c d e …Input:

…T F T F TPredicate:e.g. “Is my index even?”

a - c - e …Output:

a c …

Sparse

Densee

Page 7: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

WHY WE PREFER DENSE COMPACT? Suppose we want to apply

toGray() on every red object.

7

//Sparseif( isRed(object)){

toGray(object)}

//Densecc=compact(objects, isRed())

map(cc, toGray())

15 threads 15 threads

4 threads

Page 8: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

QUIZ

When to use compact?

When the number of elements ….. Small Large

When the operation on this elements are ….. Expensive (complex) Cheap (Simple)

8

Page 9: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

COMPACT PARALLELIZATION

How to compute the scatter address/index in parallel?

9

a b c d e …Input:

…T F T F TPredicate:e.g. “Is my index even?”

Sparse

Dense

a - c - e …Output:

a c …e

Page 10: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

COMPACT PARALLELIZATION

We can paraphrase the problem as following:

And for computer we paraphrase it again as following:

10

F F T T FInput: T

0 - 1 2 -Output: -

0 0 1 1 0Input: 1

0 - 1 2 -Output: -

???

1 1 3

Exclusive Scan

Page 11: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

COMPACT PARALLEL ALGORITHM

1. Generate Predicate array: predicate_array = predicate_function(input_array)

2. Generate Scan-in array (1s and 0s)scan-in_array = convert(predicate_array)

3. Generate Scatter-addresses array : addresses_array = exclusive_sum_scan(scan-in_array)

4. Scatter input elements to addresses output = scatter(addresses_array, input_array)

11

map

scan

scatter

Page 12: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

QUIZ

Suppose we need to compact 1M element with the following predicate functions: A: isDivisiableBy_17( ) B: isNotDivisiableBy_34( )

12

A Faster A = B B FasterMapScanScatter

Many elementsFew elements

Page 13: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

OUTLINES

Compact

Compact-like

Segment scan

Sorting

Odd Even Sort

Merge Sort

Radix Sort

Quick Sort 13

Page 14: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

COMPACT-LIKE14

Page 15: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

RECAP COMPACT ALLOCATION

Compact allocate 1 output for 1 (true) element and 0 output for 1 (false).

Can we generalize?? (not only 1s) The number of output can be computed dynamically

for each input items.

15

Page 16: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

CLIPPING

Suppose a set of triangles are sent as input to a computer graphics pipeline

16

dc

Page 17: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

CLIPPING PROBLEM

How to clip triangles at the boundries?

17

dc

Page 18: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

BAD SOLUTION

Allocate maximum possible space in intermediate array 5 for each triangle in our case .

Apply compact

Disadvantages: Wasteful in space Scanning large intermediate array

18

c e d f b gInput: a

? ? ? ? b ?Intermediate: a ---

Page 19: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

GOOD SOLUTION (GENERAL COMPACT)

Allocation requests per input element

Apply scan

Allocate output array based with respect to max scan #, and then apply scatter

19

c e d f b gInput: a

1 0 1 2 1 1request: 0

0 1 1 2 4 5Addresses: 0

0 1 2output: 3 4 5

Page 20: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

OTHER APPLICATION OF SCAN

Data Compression Collision Detection

20

Page 21: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

OUTLINES

Compact

Compact-like

Segment scan

Sorting

Odd Even Sort

Merge Sort

Radix Sort

Quick Sort 21

Page 22: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

SEGMENTED SCAN22

Page 23: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

SEGMENTED SCAN

Many small Scans Lunch each independently Combine as segments

Remember we back all segments in one big array to be processed by one kernel, instead of running a separate kernel over each segment to gain max benefits from GPU power.

We use a separate array to indicate segments’ heads

23

2 3 4 5 6Input: 1 7 8

0 1 0 0 1Heads: 1 0 0

2 3 4 5 6Input: 1 7 8

1 3 6 10 150 21 28

1 0 3 7 00 6 13

Scan

Segmented Scan

Page 24: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

SEGMENTED SCAN APPLICATION

“Sparse matrix *Dense vector” multiplication (SpMv) Sparse matrix: contains all elements includes a lot of zeros Dense matrix: doesn’t contain zeros

Sparse matrix multiplication comes with a lot of unnecessary multiplication

E.g. Google PageRank a is a non-zero value indicate there is a link between webpages

indicated by column index and webpage indicated by row index.

All web pages

All

web

pag

es

Page 25: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

REVIEW MATRIX MULTIPLICATION

1 0 32 4 10 1 5

x 012

=

1 ∗ 0 0 ∗ 1 3 ∗ 22 ∗ 0 4 ∗ 1 1 ∗ 20 ∗ 0 1 ∗ 1 5 ∗ 2

=

6211

Page 26: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

COMPRESSED SPARSE ROW

WecanrepresentsparsematrixinCPRformatasfollowing:

0

0 0

Value [a b c d e f]Column [0 2 0 1 2 2]RowPtr [0 2 5]

26

H H H

Page 27: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

CPR MULTIPLICATION

WecanrepresentsparsematrixinCPRformatasfollowing: Value: [a b c d e f] Column: [0 2 0 1 2 2] RowPtr: [0 2 5]

1. Create Segments with values and RowPtr

2. Gather vector values using column

3. Pairwise multiply 1 . 2

4. Apply exclusive backward sum scan (at the head) 27

RowPtr:[0 2 5] [a b c d e f][a b c d e f]

x 012

column:[0 2 0 1 2 2] [x z x y z z]

[a*x b*z c*x d*y e*z f*z]

out(0) out(1) out(2)Can we apply reduce instead??

Page 28: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

OUTLINES

Compact

Compact-like

Segment scan

Sorting

Odd Even Sort

Merge Sort

Radix Sort

Quick Sort 28

Page 29: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

SORTING29

Page 30: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

ODD-EVEN (BRICK) SORT

It’s the parallel version of bubble sort

30

5 1 4 2 3

1 5 2 4 3

1 2 5 3 4

1 2 3 5 4

1 2 3 4 5

Step? Work?

O(1)

O(n)

O(log n)

O(n log n)

O(n2)

Page 31: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

MERGE SORT

31

512512 512512

⋮⋮

11 11 ⋯1 1

1M

512K 512K

⋮⋮20482048

10241024 10241024

Stage 1:• Tons of small merge tasks• 1 merge to 1 thread• We can use shared memory• Also we can use serial algorithm

to sort small block of elements

Stage 2:• Bunch of medium merge tasks• 1 merge to 1 block

Stage 3:• We have only 1merge task• 2 huge sorted lists• Choose splitters (256th element)

in each list. and sort them• Use merge task in stage

2 to merge elements between splitters

Page 32: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

MERGE TASK

32

Serial AlgorithmCompare heads 0 1 3 4 5 12 15 34 59 102

Parallel Algorithmfor-each (element I ∈ A )

1. Find position of I in A with thread ID2. Find position of I in B with binary search (log n )3. Sum results of 1, 2 to get the position in the output

0 1 3 4 5 12 15 34 59 102

0 1 5 12 34 A 3 4 15 59 102B

0 1 5 12 34

3 4 15 59 102Sorted

Page 33: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

MERGE TASK (CONT.) How to merge two huge lists in parallel?

No one can do all of this work

33

List1 ⋯

List 2 ⋯

Find nth (e.g. 256th ) splitter elementsA B C D

E F G H

Merge splitters A FE B C G D H

Merge elements between every two consecutive splitterse.g. F and C:

1. Find F position in list 12. Find C position in list 23. Merge elements between these two positions in every list.

Page 34: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

MERGE ANALYSIS

34

512512 512512

⋮⋮

11 11 ⋯1 1

1M

512K 512K

⋮⋮20482048

10241024 10241024

n

log n

Step? Work?

O(1)

O(n)

O(log n)

O(n log n)

O(n2)

Page 35: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

RADIX SORT

1. Start with LSB2. Split input into 2 sets based on the current bit3. Move to next MSB, and repeat 1

35

0

5

4

7

1

6

0

1

1

1

0

1

0

0

0

1

0

1

0

1

0

1

1

0

0

1

1

1

1

0

0

0

1

0

1

0

0

0

0

1

1

1

0

1

1

0

1

1

0

0

0

0

1

1

0

0

1

1

0

1

0

0

1

1

1

1

0

0

0

0

1

1

0

1

0

1

0

1

0

1

4

5

6

7

Page 36: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

RADIX SORT PENALIZATION

0

5

2

7

1

6

0

1

1

1

0

1

0

0

0

1

0

1

0

1

0

1

1

0

0

1

1

1

1

0

0

0

1

0

1

0

0

0

0

1

1

1

CompactWhat is this Algorithm?

What is The Predicate?

(i&1)==0

Page 37: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

RADIX SORT ANALYSIS

37

Work: O(kn)linear

0

5

2

7

1

6

0

1

1

1

0

1

0

0

0

1

0

1

0

1

0

1

1

0

0

1

1

1

1

0

0

0

1

0

1

0

0

0

0

1

1

1

k: #of bits n: #of elements

Steps: O(k)

Page 38: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

QUICK SORT

Choose Pivot element Compare all elements with Pivot Split into 3 Arrays <p, =p, >p Recursion on each array

38

3 5 2 4 1 P=3

=3 >3<3

2 1 3 5 4 P=2

1 2

=2 >2<2

Page 39: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

QUICK SORT PENALIZATION

Old GPUs doesn’t support recursion

Current GPUs support recursion39

compact =3 compact >3compact: <3

3 5 2 4 1 P=3

2 1 3 5 4

compact: <3

Page 40: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

NOTE

All sort algorithms that we have studied are key value sorts where we usually depend on an integer key to sort.

However if you have items with different data item (e.g. structure with many value). Use a key or a pointer to this value to apply sorting

40

Page 41: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

OUTLINES

Compact

Compact-like

Segment scan

Sorting

Odd Even Sort

Merge Sort

Radix Sort

Quick Sort 41

Page 42: PARALLEL PROCESSING … · O(log n) O(n) O(n log n) ... Cheap (Simple) 8 ... • Choose splitters (256th element) in each list. and sort them • Use merge task in stage 2 to merge

RED EYE REMOVAL

Stencil Sort Map

42