ii extreme models study the two extremes of parallel computation models: abstract sm (pram); ignores...

32
II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates hardware details Everything else falls between these two extremes Topics in This Part Chapter 5 PRAM and Basic Algorithms Chapter 6 More Shared-Memory Algorithms Chapter 7 Sorting and Selection Networks Chapter 8 Other Circuit-Level Examples

Upload: zaria-reaver

Post on 29-Mar-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

II Extreme Models Study the two extremes of parallel computation models:

• Abstract SM (PRAM); ignores implementation issues• Concrete circuit model; incorporates hardware details• Everything else falls between these two extremes

Topics in This Part

Chapter 5 PRAM and Basic Algorithms

Chapter 6 More Shared-Memory Algorithms

Chapter 7 Sorting and Selection Networks

Chapter 8 Other Circuit-Level Examples

Page 2: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

7 Sorting and Selection Networks Become familiar with the circuit model of parallel processing:

• Go from algorithm to architecture, not vice versa• Use a familiar problem to study various trade-offs

Topics in This Chapter

7.1 What is a Sorting Network?

7.2 Figures of Merit for Sorting Networks

7.3 Design of Sorting Networks

7.4 Batcher Sorting Networks

7.5 Other Classes of Sorting Networks

7.6 Selection Networks

Page 3: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

7.1 What is a Sorting Network?

Fig. 7.1 An n-input sorting network or an n-sorter.

xxx

x

.

.

.

.

.

.

n-sorter

0

1

2

n–1

yyy

y

0

1

2

n–1

The outputs are a permutation of the inputs satisfying y Š y Š ... Š y (non-descending)

0 1 n–1

Fig. 7.2 Block diagram and four different schematic representations for a 2-sorter.

2-sorter

input min0

input1 max

in out

in out

Block Diagram Alternate Representations

in out

in out

Page 4: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

2-sorter

input min0

input1 max

in out

in out

Block Diagram Alternate Representations

in out

in out

2-sorter

Building Blocks for Sorting Networks

Fig. 7.3 Parallel and bit-serial hardware realizations of a 2-sorter.

Q R

S

Com-pare

1

0

1

0

k

k

k

k

min(a, b )

max(a, b )

b<a?

a

b

Q R

S

1

0

1

0

min(a, b )

max(a, b )

b<a?

a

b M

SB

-firs

t ser

ial i

nput

s

a<b?

Reset

Implementation with bit-parallel inputs

Implementation with bit-serial inputs

Page 5: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Proving a Sorting Network Correctx0x1

x

x3

2

y0

y1

y

y3

2

2

3

1

5

3

2

5

1

1

3

2

5

1

2

3

5

Fig. 7.4 Block diagram and schematic representation of a 4-sorter.

Method 1: Exhaustive test – Try all n! possible input orders

Method 2: Ad hoc proof – for the example above, note that y0 is smallest, y3 is largest, and the last comparator sorts the other two outputs

Method 3: Use the zero-one principle – A comparison-based sorting algorithm is correct iff it correctly sorts all 0-1 sequences (2n tests)

Page 6: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Validating a Sorting Network

Stick diagram schematic

b

a

d

c

Track 1

Track 2

Track 3

Track 4

min(a, b)

max(a, b)

min(c, d)

max(c, d)

min(a, b, c, d)

max(a, b, c, d)

????

????

In the example above, it was fairly easy to show the validity of the sorting network. Generally, it is much more difficult

How would one establish the validity of this 16-input sorting network?

More importantly, how does one come up with this design in the first place?

Page 7: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

The Zero-One Principle

0000

0000

0001

0001

0010

0001

0011

0011

0100

0001

0101

0011

0110

0011

0111

0111

1000

0001

1001

0011

1010

0011

1011

0111

1100

0011

1101

0111

1110

0111

1111

1111

A sorting net built of comparators is valid if it correctly sorts all 0-1 sequences

So, we can validate a sorting network using 2n rather than n! input patterns

n = 12: 2n = 4096, n! = 479,001,600 (thousands vs. half a billion)

Page 8: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Elaboration on the Zero-One Principle

Deriving a 0-1 sequence that is not correctly sorted, given an arbitrary sequence that is not correctly sorted.

Let outputs yi and yi+1 be out of order, that is yi > yi+1

Replace inputs that are strictly less than yi with 0s and all others with 1s

The resulting 0-1 sequence will not be correctly sorted either

6-sorter

1 3 6* 5* 8 9

3 6 9 1 8 5

Invalid

0 1 1 0 1 0

0 0 1 0 1 1

Page 9: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

7.2 Figures of Merit for Sorting Networks

Delay: Number of levels

Cost: Number of comparators

Cost Delay

x0x1

x

x3

2

y0

y1

y

y3

2

2

3

1

5

3

2

5

1

1

3

2

5

1

2

3

5

In the following example, we have 5 comparators

The following 4-sorter has 3 comparator levels on its critical path

The cost-delay product for this example is 15

Fig. 7.4 Block diagram and schematic representation of a 4-sorter.

Page 10: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Cost as a Figure of Merit

n = 9, 25 modules, 9 levelsn = 10, 29 modules, 9 levels

n = 12, 39 modules, 9 levels

n = 16, 60 modules, 10 levels

Fig. 7.5 Some low-cost sorting networks.

Page 11: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Delay as a Figure of Merit

Fig. 7.6 Some fast sorting networks.

n = 6, 12 modules, 5 levels

n = 9, 25 modules, 8 levelsn = 10, 31 modules, 7 levels

n = 12, 40 modules, 8 levels

n = 16, 61 modules, 9 levels

These 3 comparators constitute one level

Page 12: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Cost-Delay Product as a Figure of Merit

n = 6, 12 modules, 5 levels

n = 9, 25 modules, 8 levelsn = 10, 31 modules, 7 levels

n = 12, 40 modules, 8 levels

n = 16, 61 modules, 9 levels

Fast 10-sorter from Fig. 7.6

n = 9, 25 modules, 9 levelsn = 10, 29 modules, 9 levels

n = 12, 39 modules, 9 levels

n = 16, 60 modules, 10 levels

Low-cost 10-sorter from Fig. 7.5

Cost Delay = 29 9 = 261 Cost Delay = 31 7 = 217

The most cost-effective n-sorter may be neither the fastest design, nor the lowest-cost design

Applications of sorting networks:Directing information packets to their destinations in a network routerConnecting n processors to n memory modules in a parallel computer

Page 13: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

7.3 Design of Sorting Networks

Fig. 7.7 Brick-wall 6-sorter based on odd–even transposition.

Rotate by 90 degrees

Rotate by 90 degrees to see the odd-even exchange patterns

C(n) = n(n – 1)/2D(n ) = n

Cost Delay = n2(n – 1)/2 = Q(n3)

Page 14: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Insertion Sort and Selection Sort

Fig. 7.8 Sorting network based on insertion sort or selection sort.

xxx

x

.

.

.

(n–1)-sorter

0

1

2

n–2

yyy

y

0

1

2

n–2

x n–1

.

.

.

y n–1

xxx

x

.

.

.

(n–1)-sorter

0

1

2

n–2

yyy

y

0

1

2

n–2

x n–1

.

.

.

y n–1

.

.

.

Insert ion sort Selection sort

Parallel insertion sort = Parallel selection sort = Parallel bubble sort!

C(n) = n(n – 1)/2D(n ) = 2n – 3 Cost Delay = Q(n3)

Page 15: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Theoretically Optimal Sorting Networks

AKS sorting network(Ajtai, Komlos, Szemeredi: 1983)

xxx

x

.

.

.

.

.

.

n-sorter

0

1

2

n–1

yyy

y

0

1

2

n–1

The outputs are a permutation of the inputs satisfying y Š y Š ... Š y (non-descending)

0 1 n–1

O(log n) depth

O(n log n)size

Unfortunately, AKS networks are not practical owing to large (4-digit) constant factors involved; improvements since 1983 not enough

Note that even for these optimal networks, delay-cost product is suboptimal; but this is the best we can do

Existing sorting networks have O(log2 n) latency and O(n log2 n) cost

Given that log2 n is only 20 for n = 1 000 000, the latter are more practical

Page 16: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

7.4 Batcher Sorting NetworksOne type of sorting network proposed by Batcher is based on the idea of an (m ,m' )-merger and uses a technique known as even–odd merge or odd-even merge. An (m ,m' )-merger is a circuit that merges two sorted sequences of lengths m and m' into a singlesorted sequence of length m + m'.

Let the two sorted sequences be:

The odd–even merge is done by merging the even- and odd-indexed elements of the two lists separately:

If we now compare–exchange the pairs of elements the resulting sequence will be completely sorted. Note that v0 , which is known to be the smallest element overall, is excluded from the final compare–exchange operations

Page 17: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Batcher Sorting Networks

Page 18: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Batcher Sorting Networks

Page 19: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Batcher Sorting Networks

Fig. 7.9 Batcher’s even–odd merging network for 4 + 7 inputs.

x

x

x

x

y

y

y

y

y

y

y v

v

v

v

v

v

0

1

2

3

0

1

2

3

4

5

6

0

1

2

3

4

5

w

w

w

w

w

0

1

2

3

4

(2, 4)-merger (2, 3)-merger

First sorted sequ- ence x

Second sorted sequ- ence y

(2, 3)-mergera0a1

b0b1b2

(1, 2)-mergerc0

d0d1

(1, 1) (1, 2)

(1, 1)

Page 20: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Batcher’s Even-Odd Merge Sorting

Batcher’s (m, m) even-odd merger, for m a power of 2:

C(m) = 2C(m/2) + m – 1 = (m – 1) + 2(m/2 – 1) + 4(m/4 – 1) + . . . = m log2m + 1

D(m) = D(m/2) + 1 = log2 m + 1

Cost Delay = Q(m log2 m)

Page 21: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Batcher’s Even-Odd Merge Sorting

Batcher sorting networks based on the even-odd merge technique:

C(n) = 2C(n/2) + (n/2)(log2(n/2)) + 1 n(log2n)2/ 2

D(n) = D(n/2) + log2(n/2) + 1 = D(n/2) + log2n = log2n (log2n + 1)/2

Cost Delay = Q(n log4n)

n/2-sorter

n/2-sorter

(n/2, n/2)- merger

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Fig. 7.10 The recursive structure of Batcher’s even–odd merge sorting network.

Page 22: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Example Batcher’s Even-Odd 8-Sorter

n/2-sorter

n/2-sorter

(n/2, n/2)- merger

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4-sorters Even (2,2)-merger

Odd (2,2)-merger

Fig. 7.11 Batcher’s even-odd merge sorting network for eight inputs .

Page 23: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Bitonic-Sequence Sorter

Fig. 14.2 Sorting a bitonic sequence on a linear array.

Shift right half of data to left half (superimpose the two halves)

In each position, keep the smaller value of each pair and ship the larger value to the right

Each half is a bitonic sequence that can be sorted independently

0 1 2 n–1

0 1 2 n–1

. . .

. . .

Bitonic sequence

Shifted right half

n/2

n/2

. . .

. . .

Bitonic sequence:

1 3 3 4 6 6 6 2 2 1 0 0 Rises, then falls

8 7 7 6 6 6 5 4 6 8 8 9 Falls, then rises

8 9 8 7 7 6 6 6 5 4 6 8 The previous sequence, right-rotated by 2

Page 24: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Batcher’s Bitonic Sorting Networks

Fig. 7.12 The recursive structure of Batcher’s bitonic sorting network.

n/2-sorter

n/2-sorter

n-input bitonic- sequence sorter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bitonic sequence

. . .

. . .

Page 25: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Batcher’s Bitonic Sorting Networks

Batcher’s bitonic sorting network for 16 inputs.

Bitonic Sequence of inputs

Page 26: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Batcher’s Bitonic Sorting Networks

Batcher’s bitonic sorting network for eight inputs.

Page 27: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

7.5 Other Classes of Sorting Networks

Fig. 7.14 Periodic balanced sorting network for eight inputs.

Desirable properties:a. Regular / modular (easier VLSI layout).b. Simpler circuits via reusing the blocksc. With an extra block tolerates some faults (missed exchanges)d. With 2 extra blocks provides tolerance to any single faults (a missed or incorrect exchange)e. Multiple passes through faulty network (graceful degradation)f. Single-block design becomes fault-tolerant by using an extra stage

Page 28: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Shearsort-Based Sorting Networks (1)

Fig. 7.15 Design of an 8-sorter based on shearsort on 24 mesh.

0 1 2 3

4567

Snake-like row sorts

Column sorts

0 1 2 3 4 5 6 7

Snake-like row sorts

Corresponding 2-D mesh

Page 29: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Shearsort-Based Sorting Networks (2)

Fig. 7.16 Design of an 8-sorter based on shearsort on 24 mesh.

0 1 2 3 4 5 6 7

0 1

3 2

54

7 6

Corresponding 2-D mesh

Left column sort

Right column sort

Snake-like row sort

Left column sort

Right column sort

Snake-like row sort

Some of the sameadvantages asperiodic balancedsorting networks

Page 30: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

7.6 Selection NetworksDirect design may yield simpler/faster selection networks

4-sorters Even (2,2)-merger

Odd (2,2)-merger

3rd smallest element

Can remove this block if smallest three inputs needed

Can remove these four comparators

Deriving an (8, 3)-selector from Batcher’s even-odd merge 8-sorter.

Page 31: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Categories of Selection Networks

Unfortunately we know even less about selection networks than we do about sorting networks.

One can define three selection problems [Knut81]:

I. Select the k smallest values; present in sorted order II. Select kth smallest value III. Select the k smallest values; present in any order

Circuit and time complexity: (I) hardest, (III) easiest

Classifiers:Selectors that separate the smaller half of values from the larger half

Smaller4 values

Larger4 values

8 inputs 8-Classifier

Page 32: II Extreme Models Study the two extremes of parallel computation models: Abstract SM (PRAM); ignores implementation issues Concrete circuit model; incorporates

Type-III Selection Networks

[0,7]

[0,7]

[0,7]

[0,7]

[0,7]

[0,7]

[0,7]

[0,7]

[0,6]

[1,7]

[0,6]

[0,6]

[0,6]

[1,7]

[1,7]

[1,7]

[1,6]

[1,6]

[1,6]

[1,6]

[0,3]

[0,4]

[0,4]

[0,4]

[0,4] [0,3]

[3,7]

[4,7][3,7][3,7]

[3,7]

[4,7]

[1,3]

[1,5]

[1,5] [1,3]

[4,6][2,6]

[2,6]

[4,6]

Figure 7.17 A type III (8, 4)-selector. 8-Classifier