communication lower bound for the fast fourier transform michael anderson communication-avoiding...

38
Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Upload: madeleine-savery

Post on 31-Mar-2015

243 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Communication Lower Bound for the Fast Fourier

Transform

Michael Anderson

Communication-Avoiding Algorithms (CS294)

Fall 2011

Page 2: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Sources

• J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC '81: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, New York, NY, USA, 1981. ACM.

• J. E. Savage. Extending the Hong-Kung model to memory hierarchies. In COCOON, pages 270--281, 1995.

• CS256 Applied Theory of Computation Brown University. Lecture 18 (http://www.cs.brown.edu/courses/csci2560/lectures/lect.18.MemoryHierarchyIII.pdf)

• John E. Savage Models of Computation Exploring the Power of Computing

• A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116--1127, 1988.

Page 3: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Outline

1. Fast Fourier Transform2. Lower bound

1. Two-level pebble game 2. S-span

3. Upper bound4. Multilevel pebble game5. Open Problems

Page 4: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Outline

1. Fast Fourier Transform2. Lower bound

1. Two-level pebble game 2. S-span

3. Upper bound4. Multilevel pebble game5. Open Problems

Page 5: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Xk = xne−2πjk

n

N

n=0

N −1

Output Vector Input Vector

Discrete Fourier Transform

Page 6: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

X0 = xnω0*n

n=0

N −1

X1 = xnω1*n

n=0

N −1

X2 = xnω2*n

n=0

N −1

X3 = xnω3*n

n=0

N −1

XN −2 = xnω(N −2)*n

n=0

N −1

XN −1 = xnω(N −1)*n

n=0

N −1

. . .

Unroll Output Vector

ω =e−2πj

N

Page 7: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

X0 = x0ω0*0 + x1ω

0*1 + x3ω0*2 + ... + xN −2ω

0*(N −2) + xN −1ω0*(N −1)

X1 = x0ω1*0 + x1ω

1*1 + x3ω1*2 + ... + xN −2ω

1*(N −2) + xN −1ω1*(N −1)

X2 = x0ω2*0 + x1ω

2*1 + x3ω2*2 + ... + xN −2ω

2*(N −2) + xN −1ω2*(N −1)

X3 = x0ω3*0 + x1ω

3*1 + x3ω3*2 + ... + xN −2ω

3*(N −2) + xN −1ω3*(N −1)

X4 = x0ω4*0 + x1ω

4*1 + x3ω4*2 + ... + xN −2ω

4*(N −2) + xN −1ω4*(N −1)

... ... ... ... ... ... ...

XN −2 = x0ω(N −2)*0 + x1ω

(N −2)*1 + x3ω(N −2)*2 + ... + xN −2ω

(N −2)*(N −2) + xN −1ω(N −2)*(N −1)

XN −1 = x0ω(N −1)*0 + x1ω

(N −1)*1 + x3ω(N −1)*2 + ... + xN −2ω

(N −1)*(N −2) + xN −1ω(N −1)*(N −1)

Unroll Input Vector

ω =e−2πj

N

Page 8: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

ω 0*0 ω 0*1 ω 0*2 ... ω 0*(N −2) ω 0*(N −1)

ω1*0 ω1*1 ω1*2 ... ω1*(N −2) ω1*(N −1)

ω 2*0 ω 2*1 ω 2*2 ... ω 2*(N −2) ω 2*(N −1)

ω 3*0 ω 3*1 ω 3*2 ... ω 3*(N −2) ω 3*(N −1)

ω 4*0 ω 4*1 ω 4*2 ... ω 4*(N −2) ω 4*(N −1)

... ... ... ... ... ...

ω (N −2)*0 ω (N −2)*1 ω (N −2)*2 ... ω (N −2)*(N −2) ω (N −2)*(N −1)

ω (N −1)*0 ω (N −1)*1 ω (N −1)*2 ... ω (N −1)*(N −2) ω (N −1)*(N −1)

Phrase as Matrix-Vector Multiply

x0

x1

x2

...

xN −2

xN −1

X0

X1

X3

...

XN −2

XN −1

X5

X4

I N P U T V E C T O R

OU

TP

UT

V

EC

TO

R

Page 9: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

DFT

Factorization

I N P U T V E C T O R

OU

TP

UT

V

EC

TO

R

Page 10: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Factorization

I N P U T V E C T O R

OU

TP

UT

V

EC

TO

R

DFT

DFT

+*

+*

+*

+*

+*

+*

+*

+*

x0 = x0 + x1ωk

x1 = x1 − x0ωk

x0 x0

x1 x1

Page 11: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Factorization

I N P U T V E C T O R

OU

TP

UT

V

EC

TO

R

DFT

+*

+*

+*

+*

+*

+*

+*

+*

DFT

DFT

DFT

+*

+*

+*

+*

+*

+*

+*

+*

Page 12: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

FFT

OU

TP

UT

V

EC

TO

R+*

+*

+*

+*

+*

+*

+*

+*IN

PU

T

VE

CT

OR +*

+*

+*

+*

+*

+*

+*

+*

+*

+*

+*

+*

+*

+*

+*

+*

Shuffle Compute

Page 13: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Outline

1. Fast Fourier Transform2. Lower bound

1. Two-level pebble game 2. S-span

3. Upper bound4. Multilevel pebble game5. Open Problems

Page 14: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

• Used to analyze communication in straight-line programs (e.g. Matrix multiply, FFT, matrix transpose)

• Played on a DAG. Vertices represent inputs, intermediate data, and operations. Edges represent data dependencies

• Pebbles represent cache locations. Pebble color represents a distinct level of the cache hierarchy. Placing a pebble on a specific vertex means storing that data element in cache.

Red Blue (2-level) Pebble Game

Page 15: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Red Blue (2-level) Pebble Game• Used to analyze communication in straight-line programs (e.g. Matrix multiply, FFT, matrix transpose)

• Played on a DAG. Vertices represent inputs, intermediate data, and operations. Edges represent data dependencies

• Pebbles represent cache locations. Pebble color represents a distinct level of the cache hierarchy. Placing a pebble on a specific vertex means storing that data element in cache.

Red Pebble (Fast Memory)

Blue Pebble (Slow Memory)

Page 16: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Rules of the Red Blue Pebble Game

• (Initialization) A blue pebble can be placed on any input vertex at any time

• (Input) A red pebble may be placed on any vertex that contains a blue pebble

• (Output) A blue pebble may be placed on any vertex that contains a red pebble

• (Computation) A red pebble can be placed on any vertex if all of its immediate predecessors have red pebbles

• (Deletion) A pebble can be removed at any time

• (Goal) All output vertices contain blue pebbles

Page 17: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Playing the Game

• A pebbling strategy is a sequence of steps in which the rules on the previous slide are used to move pebbles

• The number of red pebbles (size of fast memory) is limited to S (assume infinite blue pebbles).

• A communication lower bound (or Minimum I/O Time) is determined by proving the minimum number of (Input) and (Output) rules invoked over all possible pebbling strategies.

• The total number of computation steps should also be minimized

Page 18: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

S-span

• The S-span of DAG G, ρ(S,G), is the maximum number of vertices of G that can be pebbled with S red pebbles in red pebble game maximized over all initial placements of S red pebbles.

• Red pebble game is like the red blue game but blue pebbles cannot be stored on intermediate vertices.

Red Pebble

Initial red pebble(S=6)

Page 19: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Using S-span for Lower BoundsDivide the computation into h sub-pebblings (C1, C2...Ch) that each communicate no more than S words between level 1 and 2.

Each sub-pebbling has 2S words available (S words initially in the cache plus S inputs). Therefore, each sub-pebbling can perform no more than ρ(2S,G) operations.

C1 C2 C3 C4

C5

C6 . . .

InputLevel-1

opsOutput

Ch

Page 20: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

• Theorem For every pebbling P of G = (V,E) in the red-blue pebble game with S red pebbles, the I/O time used, T2(S,G,P) satisfies:

T2(S,G,P) /S⎡ ⎤ρ(2S,G) ≥ V − In(G)

Number of words moved (In batches of S words)

Upper bound on arithmetic intensity(number of operations per 2S words)

Total number of operations

Using S-span for Lower Bounds

T2(S,G,P) /S⎡ ⎤≥V − In(G)

ρ (2S,G)

Page 21: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

What is the S-span of the FFT DAG?

Lemma 1: The S-span of the FFT DAG on n inputs is no greater than 2 S log(S) when S < n.

Proof: Let num(p) denote the number of moves currently allocated to pebble p. Both p1 and p2 are moved to the upper level nodes v1, and v2. (Illegal, but an upper bound) If num(p1) = num(p2) then increment both. Otherwise increment the smaller.

The total number of red pebbling moves is therefore bounded by:

2 num(p)p∈pebbles

v1 v2

u2u1 p1 p2

Page 22: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

What is the S-span of the FFT DAG?

Lemma 2: For each pebble p on node n in the FFT DAG, the number of nodes, N(p), that contained a red pebble in the initial configuration and that are connected by a directed path to n is at least 2num(p)

Page 23: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

What is the S-span of the FFT DAG?

Lemma 2: For each pebble p on node n in the FFT DAG, the number of nodes, N(p), that contained a red pebble in the initial configuration and that are connected by a directed path to n is at least 2num(p)

Proof (Induction): Base case: num(p) = 1. In this case, the node n needed 2 inputs.

Inductive step: Assume that N(p) is at least 2e-1 for some value of num(p) < e-1. Show that N(p) becomes at least 2e when num(p) is incremented to e during a butterfly operation.

Case 1: Pebbles p1 and p2 enter a butterfly operation with num(p1)=num(p2)=e-1. Since u1 and u2 are roots of disjoint trees with at least 2e-1 initial pebbles, the total number of initial pebbles is now 2(2e-1) = 2e pebbles.

Case 2: num(p) < num(partner) in the butterfly. num(partner) > e therefore the partner must have been connected to at least 2e

initial pebbles.

Page 24: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

What is the S-span of the FFT DAG?

There are S pebbles and each pebble can only cover one initial placement. Therefore num(p) < log(S), because there must be at least 2num(p) initial pebbles. (Lemma 2)

According to Lemma 1, the total number of pebbling moves is bounded by:

2 num(p)p∈pebbles

∑ ≤ 2 log(S)p∈pebbles

So the S-span is 2 S log(S). QED

Page 25: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

FFT Two-level Hierarchy Lower Bound

T2(S,G,P) /S⎡ ⎤ρ(2S,G) ≥ V − In(G)

T2(S,G,P) /S⎡ ⎤ ≥N log(N) −N

4S log(2S)

T2(S,G,P) = ΩN logN

logS

⎝ ⎜

⎠ ⎟

Number of words moved

Page 26: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Outline

1. Fast Fourier Transform2. Lower bound

1. Two-level pebble game 2. S-span

3. Upper bound4. Multilevel pebble game5. Open Problems

Page 27: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Transpose FFT

Page 28: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Transpose FFT (Upper Bound)

Suppose the FFT size is a power of 2. (N = 2d) There are log(N) levels in the FFT DAG.

Divide the large FFT into many FFTs of size S, where S is the size of fast memory. There are log(N)/log(S) stages of independent size-S FFTs. After each stage, store the outputs in slow memory for a total of N log(N)/log(S) words moved between fast and slow memory, which achieves the lower bound.

Page 29: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Outline

1. Fast Fourier Transform2. Lower bound

1. Two-level pebble game 2. S-span

3. Upper bound4. Multilevel pebble game5. Open Problems

Page 30: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Multilevel Pebble Game• Red/blue pebble game was for 2 levels (fast and slow)

• For multilevel game, data begins and ends in the highest level memory (the Lth) and can be transferred between consecutive levels (l-1 to l or vice versa)

Level-1(Registers)

Level-L(Main Memory)

Level-2(On-chip cache)

. . .

Page 31: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Rules of the Multilevel Pebble Game

• (Initialization) A level-L pebble can be placed on any input vertex at any time

• (Computation) A first-level pebble can be placed on any vertex if all of its immediate predecessors have first-level pebbles

• (Deletion) Except for level-L pebbles on output vertices, a pebble at any level can be removed at any time

• (Input from level-l) For 2 < l < L-1, a level-(l-1) pebble can be placed on any vertex carrying a level-l pebble

• (Output to level-l) For 2 < l < L-1, a level-(l) pebble can be placed on any vertex carrying a level-(l-1) pebble

• (Goal) All output vertices contain level-L pebbles

Page 32: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Terminology

• Resource Vector p = (p1, p2, p3, ... pL-1) where pl is the number of pebbles at level l. (Highest level is assumed infinite)

• sl = sum of all available pebbles below level-l

• Minimal Pebbling assumes that the number of highest level I/O operations is minimized, the number of I/O operations is minimized at successively lower levels and number of computation steps is minimized.

• Tl = Number of I/O operations at level l

Page 33: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Multilevel S-Span

Theorem:

Consider a minimal pebbling of the DAG G = (V,E) in the standard memory hierarchy game with resource vector p using sl pebbles at level l or less. The following lower bound must be satisfied:

Tl(L )(ρ,G) /sl−1⎡ ⎤ρ (2sl−1,G) ≥ V

C1 C2 C3 C4

C5

C6 . . .

InputLevel l-1 ops Output

Level l sub-pebblings

Ch

Page 34: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Relating Multilevel to 2-level

Theorem:

The following inequality holds for 2 < l < L-1 when the graph G is pebbled in the L-level game with resource vector p.

Tl(L )(p,G) ≥T2

(2)(sl−1,G)€

Page 35: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Review

• The minimum I/O time for the FFT in the 2-level case is N log N / log S

• This was determined by finding the S-span of the FFT graph using it to bound the number of words transferred between memory levels

• The standard FFT algorithm achieves this lower bound (so the lower bound is tight)

• Two-level lower bounds can be generalized to multi-level memory hierarchies

Page 36: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Outline

1. Fast Fourier Transform2. Lower bound

1. Two-level pebble game 2. S-span

3. Upper bound4. Multilevel pebble game5. Open Problems

Page 37: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Open Problems• Communication lower bounds for 2-D and 3-D FFTs

• I suspect that S-span argument also holds for 2-D case

• What if S is larger than one row?

• Determining the FFT lower bound for the parallel model described in this class• Lower bounds for a “parallel hierarchal memory model” using randomized sorting algorithms for communication can be found here: J. S. Vitter and E. A. M. Shriver. “Algorithms for Parallel Memory II: Hierarchical Multilevel Memories”

• Using the pebble game (S-span) method to analyze new algorithms• Matrix Multiply and sorting and several other examples can be found in the references listed earlier

Page 38: Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Questions?