parallel&scan&people.cs.pitt.edu/~bmills/docs/teaching/cs1645/lecture_scan.pdfcompact •...
TRANSCRIPT
Parallel Scan
Bryan Mills, PhD
Spring 2017
Scan
• Prefix Sum Example
1 2 3 4 5 6 7 8
1 3 6 10 15 21 28 36
INPUT
OUTPUT
Generalized Scan
• f(x,y) = operaJon – Can be any binary associaJve operaJon
• sum, mulJply, logical and/or, max, min
• 0 = IdenJty element – zero for sum, one for mulJplicaJon, true for logical, false for logical or, 0 for unsigned max
Generalized Scan Type acc = identity;for (i = 0; I < elems.length(); i++) {
acc = f(acc, elems[i]);out[i] = acc;
}
• How many steps?
• How much work?
n
n
Why?
INPUT
OUTPUT
• Many problems look like this – Output depends on one new input and previous output – Balancing checkbook, regressions, even sorJng can.
Types of Scan
• Exclusive – Output all elements excluding the current
• Inclusive – Output all elements including the current
1 2 3 4 5 6 7 8
0 1 3 6 10 15 21 28
INPUT
OUTPUT
1 3 6 10 15 21 28 36 OUTPUT
Looks serial INPUT
OUTPUT
• It is non-‐obvious how to convert to parallel however we will look at two algorithms – Hillis/Steele – Blelloch
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
3
Input
Step 1
Add direct neighbors n = 1
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
3 5 7 9 11 13 15
Input
Step 1
Add direct neighbors n = 1
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
1 3 5 7 9 11 13 15
Input
Step 1
If no neighbors then copy previous step
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
1 3 5 7 9 11 13 15
Input
Step 1
6 Step 2
Add neighbors two away n = 2
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
1 3 5 7 9 11 13 15
Input
Step 1
6 10 14 18 22 26 Step 2
Add neighbors two away n = 2
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
1 3 5 7 9 11 13 15
Input
Step 1
1 3 6 10 14 18 22 26 Step 2
If no neighbors copy down
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
1 3 5 7 9 11 13 15
Input
Step 0
1 3 6 10 14 18 22 26 Step 1
Add neighbors four away n = 4 Generally n = 2step
15 Step 2
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
1 3 5 7 9 11 13 15
Input
Step 0
1 3 6 10 14 18 22 26 Step 1
Add neighbors four away n = 4 Generally n = 2step
15 21 28 36 Step 2
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
1 3 5 7 9 11 13 15
Input
Step 0
1 3 6 10 14 18 22 26 Step 1
Copy down others
1 3 6 10 15 21 28 36 Step 2
Hillis/Steele Inclusive Scan 1 2 3 4 5 6 7 8
1 3 5 7 9 11 13 15
Input
Step 0
1 3 6 10 14 18 22 26 Step 1
You now have the inclusive scan.
1 3 6 10 15 21 28 36 Step 2
Steps = O(log n) Work = O(n log n) <-‐ dimensions of rectangle above
Blelloch Exclusive Scan
• Happens in two passes: – Reduce
• Like previous reduce steps but keep around intermediate results
– Down sweep • New operaJon
Blelloch Exclusive Scan 1 2 3 4 5 6 7 8 Input
Blelloch Exclusive Scan 1 2 3 4 5 6 7 8 Input
3 7 11 15 Step 0 n = 2^0 = 1
Start a simple reduce
Blelloch Exclusive Scan 1 2 3 4 5 6 7 8 Input
3 7 11 15 Step 0 n = 2^0 = 1
Similar to other reduces, however keep around the intermediate results: 3, 7, 11, 15, 10, 26
10 26 Step 1 n = 2^1 = 2
36 Step 2 n = 2^2 = 4
Blelloch Down Sweep OperaJon
• Reverse reduce step – Same inputs (leb and right) – But two outputs also leb and right
• Add L+R and put on right • Copy down R and put it on leb
L R
L+R R
Blelloch Down Sweep 1 2 3 4 5 6 7 8 Input
3 7 11 15 Step 0 n = 2^0 = 1
10 26 Step 1 n = 2^1 = 2
36 Step 2 n = 2^2 = 4
0
First replace last column with idenJty element, for sum this is zero (0)
Replace with IdenJty elem
Blelloch Down Sweep 1 2 3 4 5 6 7 8 Input
3 7 11 15 Step 0 n = 2^0 = 1
10 26 Step 1 n = 2^1 = 2
36 Step 2 n = 2^2 = 4
10 0 Replace with IdenJty elem
Down sweep OperaJon
Copy element down from previous reduce, this is why you need to keep around the intermediate reduce results.
Blelloch Down Sweep 1 2 3 4 5 6 7 8 Input
3 7 11 15 Step 0 n = 2^0 = 1
10 26 Step 1 n = 2^1 = 2
36 Step 2 n = 2^2 = 4
10 0 Replace with IdenJty elem
0 10 Down sweep OperaJon
Down sweep operaJon Copy right to leb (0) Put L+R to the right (10 + 0 = 10)
1 2 3 4 5 6 7 8 Input
3 7 11 15 Reduce Step 0
10 26 Reduce Step 1
36 Reduce Step 2
10 0 IdenJty Elem
3 0 11 10 Down Sweep 0
0 3 10 21 Down Sweep 1
1 2 3 4 5 6 7 8 Input
3 7 11 15 Reduce Step 0
10 26 Reduce Step 1
36 Reduce Step 2
10 0 IdenJty Elem
3 0 11 10 Down Sweep 0
1 0 3 3 5 10 7 21 Down Sweep 1
0 1 3 6 10 15 21 28 Down Sweep 2
Blelloch Exclusive Scan
• # Steps? – 2 log n = O(log n)
• Work? – O(n)
Compact
• Many Jmes you lots of data and you only want to perform some computaJon on a subset of that data. – Logs analysis: only look at logs containing a certain search term or type of search term
– Graphics: only perform ray tracing on elements in the viewport
– Big Data: Calculate histogram of incomes for everyone with a dog
What is compact
• Given some predicate funcJon remove those elements which return false and “squeeze” the data into the required space.
1 2 3 4 5 6 7 8
T F T F T F T F
Input
Predicate Is odd?
1 3 5 7 Output
Parallel Compact
• Just have each thread evaluate predicate and copy only on true.
1 2 3 4 5 6 7 8
T F T F T F T F
Input
Predicate Is odd?
1 3 5 7 Dense Output
1 -‐ 3 -‐ 5 -‐ 7 -‐ Sparse Output
Sparse is easy in parallel, how do we get dense output in parallel?
Parallel Compact
• Just have each thread evaluate predicate and copy only on true.
1 2 3 4 5 6 7 8
T F T F T F T F
Input
Predicate Is odd?
1 3 5 7 Dense Output
1 -‐ 3 -‐ 5 -‐ 7 -‐ Sparse Output
Sparse is easy in parallel, how do we get dense output in parallel?
Parallel Compact 1 2 3 4 5 6 7 8
T F T F T F T F
Input
Predicate Is odd?
1 3 5 7 Dense Output
1 -‐ 3 -‐ 5 -‐ 7 -‐ Sparse Output
Look at the desired address locaJons in the dense output for each for each element in the sparse output
Address in dense output 0 -‐ 1 -‐ 2 -‐ 3 -‐
Parallel Compact 1 2 3 4 5 6 7 8
T F T F T F T F
Input
Predicate Is odd?
1 3 5 7 Dense Output
True/False = 1/0
Address in dense output 0 -‐ 1 -‐ 2 -‐ 3 -‐
1 0 1 0 1 0 1 0
0 1 1 2 2 3 3 4 Exclusive Scan on 1/0
To get addresses for dense output run a SCAN ! !
Parallel Compact Steps
1. Run Predicate 2. Create a scan-‐in array – True = 1 – False = 0
3. Run exclusive scan over scan-‐in array – Output is the scaker addresses for input
4. Scaker the input into output addresses