autotuning (1/2) - vuduc

53
Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1

Upload: others

Post on 05-May-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Autotuning (1/2) - Vuduc

Autotuning (1/2):Cache-oblivious algorithms

Prof. Richard Vuduc

Georgia Institute of Technology

CSE/CS 8803 PNA: Parallel Numerical Algorithms

[L.17] Tuesday, March 4, 2008

1

Page 2: Autotuning (1/2) - Vuduc

Today’s sources

CS 267 (Demmel & Yelick @ UCB; Spring 2007)

“An experimental comparison of cache-oblivious and cache-conscious programs?” by Yotov, et al. (SPAA 2007)

“The memory behavior of cache oblivious stencil computations,” by Frigo & Strumpen (2007)

Talks by Matteo Frigo and Kaushik Datta at CScADS Autotuning Workshop (2007)

Demaine’s @ MIT: http://courses.csail.mit.edu/6.897/spring03/scribe_notes

2

Page 3: Autotuning (1/2) - Vuduc

Review:Tuning matrix multiply

3

Page 4: Autotuning (1/2) - Vuduc

Tiled MM on AMD Opteron 2.2 GHz (4.4 Gflop/s peak), 1 MB L2 cache

< 25% peak! We evidently still have a lot of work to do...

4

Page 5: Autotuning (1/2) - Vuduc

Fast

Slow

Registers

L1

TLB

L2

Main

5

Page 6: Autotuning (1/2) - Vuduc

C

B

A

6

Page 7: Autotuning (1/2) - Vuduc

Software pipelining: Interleave iterations to delay dependent instructions

i-4i-3

i

i+1

Source: Clint Whaley’s code optimization course (UTSA Spring 2007)

m3;

7

Page 8: Autotuning (1/2) - Vuduc

0

0.0625

0.125

0.1875

0.25

0.3125

0.375

0.4375

0.5

0.5625

0.625

0.6875

0.75

0.8125

0.875

Fraction of Peak

0 100 200 300 400 500 600 700 800 900 1000 1100 12000

50

100

150

200

250

300

350

400

450

500

550

600

650

700

matrix dimension (n)

Per

form

ance

(Mflo

p/s)

Dense Matrix Multiply Performance (Square n×n Operands) [800 MHz Intel Pentium III−mobile]

VendorGoto−BLASReg/insn−level + cache tiling + copyCache tiling + copy opt.Reference

Source: Vuduc, Demmel, Bilmes (IJHPCA 2004)

8

Page 9: Autotuning (1/2) - Vuduc

Cache-oblivious matrix multiply[Yotov, Roeder, Pingali, Gunnels, Gustavson (SPAA 2007)][Talk by M. Frigo at CScADS Autotuning Workshop 2007]

9

Page 10: Autotuning (1/2) - Vuduc

Fast

Slow

Two-level memory hierarchy

M = capacity of cache (“fast”)

L = cache line size

Fully associative

Optimal replacement

Evicts most distant use

Sleator & Tarjan (CACM 1985): LRU, FIFO w/in constant of optimal w/ cache larger by constant factor

“Tall-cache:” M ≥ O(L2)

Limits: See Brodal & Fagerberg (STOC 2003)

When might this not hold?

Memory model for analyzing cache-oblivious algorithms

10

Page 11: Autotuning (1/2) - Vuduc

A recursive algorithm for matrix-multiply

A11 A12

A21 A22

C11 C12

C21 C22

B11 B12

B21 B22Divide all dimensions in half

Bilardi, et al.: Use Gray code ordering

Cost (flops) = T (n) ={

8 · T (n2 ) n > 1

O(1) n = 1

= O(n3)

11

Page 12: Autotuning (1/2) - Vuduc

A recursive algorithm for matrix-multiply

A11 A12

A21 A22

C11 C12

C21 C22

B11 B12

B21 B22Divide all dimensions in half

Bilardi, et al.: Use grey-code ordering

I/O Complexity?

12

Page 13: Autotuning (1/2) - Vuduc

A recursive algorithm for matrix-multiply

A11 A12

A21 A22

C11 C12

C21 C22

B11 B12

B21 B22Divide all dimensions in half

Bilardi, et al.: Use grey-code ordering

No. of misses, with tall-cache assumption:

Q(n) =

{8 · Q(n

2 ) if n >√

M3

3n2

L otherwise

}≤ Θ

(n3

L√

M

)

13

Page 14: Autotuning (1/2) - Vuduc

Alternative: Divide longest dimension (Frigo, et al.)

C1

C2

A1

A2

m

k

Bk

n

CA1 A2

k

B2

B1k

n

B2B1k

n

k

A C1 C2

Cache misses Q(m, k, n) ≤

Θ(

mk+kn+mnL

)if mk + kn + mn ≤ αM

2Q(

m2 , k, n

)if m ≥ k, n

2Q(m, k2 , n) if k > m, k ≥ n

2Q(m, k, n2 ) otherwise

= Θ(

mkn

L√

M

)

14

Page 15: Autotuning (1/2) - Vuduc

Relax tall-cache assumption using suitable layout

Source: Yotov, et al. (SPAA 2007) and Frigo, et al. (FOCS ’99)

Row-major Row-block-row Morton Z

No assumptionM ≥ Ω(L)Need tall cache

15

Page 16: Autotuning (1/2) - Vuduc

I

K

K

J

Latency-centric vs. bandwidth-centric views of blocking

Time per flop ≈ 1 +α

τ· 1b

α

τ· 1κ≤ b ≤

√M

3

⇐ Assume can perfectly overlap

computation & communication

Peak flop/cy ≡ φ

Bandwidth, word/cy ≡ β

2n3

b· 1β

! 2n3 · 1φ

=⇒ φ

β≤ b

16

Page 17: Autotuning (1/2) - Vuduc

FPU Registers L2 L3 MemoryL1

4*

≥2

2*

4

4

≥6≈0.5

1 ≤ bR ≤ 6

1.33 ≤ β(R,L2) ≤ 4

1 ≤ bL2 ≤ 6

1.33 ≤ β(L2,L3) ≤ 4

8 ≤ bL3 ≤ 418

0.02 ≤ β(L3,Memory) ≤ 0.52 FMAs/cycle

Latency-centric vs. bandwidth-centric views of blocking

Example platform: Itanium 2

Consider L3 ←→ memory bandwidth

Φ = 4 flops / cycle; β = 0.5 words / cycle

L3 capacity = 4 MB (512 kwords)

Need 8 ≤ bL3 ≤ 418

Implications: Approximate cache-oblivious blocking works

Wide range of block sizes should be OK

If upper bound > 2*lower, divide-and-conquer generates block size in range

Source: Yotov, et al. (SPAA 2007)

17

Page 18: Autotuning (1/2) - Vuduc

Cache-oblivious vs. cache-aware

Does cache-oblivious perform as well as cache-aware?

If not, what can be done?

Next: Summary of Yotov, et al., study (SPAA 2007)

Stole slides liberally

18

Page 19: Autotuning (1/2) - Vuduc

All- vs. largest-dimension

Similar; assume “all-dim”

19

Page 20: Autotuning (1/2) - Vuduc

Data structures

Morton-Z complicated and yields same or worse performance, so assume row-block-row

20

Page 21: Autotuning (1/2) - Vuduc

Example 1: Ultra IIIi

1 GHz ⇒ 2 Gflop/s peak

Memory hierarchy

32 registers

L1 = 64 KB, 4-way

L2 = 1 MB, 4-way

Sun compiler

21

Page 22: Autotuning (1/2) - Vuduc

• Iterative: triply nested loop• Recursive: down to 1 x 1 x 1

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement

22

Page 23: Autotuning (1/2) - Vuduc

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement Recursive

Micro-Kernel

None /Compiler

• Recursion down to NB• Unfold completely below

NB to get a basic block• Micro-Kernel:

• The basic block compiled with native compiler

• Best performance for NB =12

• Compiler unable to use registers

• Unfolding reduces control overhead

• limited by I-cache

23

Page 24: Autotuning (1/2) - Vuduc

• Recursion down to NB• Unfold completely

below NB to get a basic block

• Micro-Kernel• Scalarize all array

references in the basic block

• Compile with native compiler

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement Recursive

Micro-Kernel

None /Compil

er

24

Page 25: Autotuning (1/2) - Vuduc

• Recursion down to NB• Unfold completely below NB to get a

basic block• Micro-Kernel

• Perform Belady’s register allocation on the basic block

• Schedule using BRILA compiler

Belady /BRILA

Scalarized /Compiler

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement Recursive

Micro-Kernel

None /Compiler

25

Page 26: Autotuning (1/2) - Vuduc

• Recursion down to NB• Unfold completely below NB to get a

basic block• Micro-Kernel

• Construct a preliminary schedule• Perform Graph Coloring register

allocation• Schedule using BRILA compiler

Belady /BRILA

Scalarized /Compiler

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement Recursive

Micro-Kernel

None /Compiler

Coloring /BRILA

26

Page 27: Autotuning (1/2) - Vuduc

• Recursion down to MU x NU x KU

• Micro-Kernel• Completely unroll MU x NU

x KU triply nested loop• Construct a preliminary

schedule• Perform Graph Coloring

register allocation• Schedule using BRILA

compiler

Belady /BRILA

Scalarized /Compiler

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement Recursive

Micro-Kernel

None /Compiler

Coloring /BRILA

Iterative

27

Page 28: Autotuning (1/2) - Vuduc

Mini-Kernel

• Recursion down to NB• Mini-Kernel

• NB x NB x NB triply nested loop

• Tiling for L1 cache• Body is Micro-Kernel

Belady /BRILA

Scalarized /Compiler

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement Recursive

Micro-Kernel

None /Compiler

Coloring /BRILA

Iterative

28

Page 29: Autotuning (1/2) - Vuduc

Mini-Kernel

Belady /BRILA

Scalarized /Compiler

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement Recursive

Micro-Kernel

None /Compiler

Coloring /BRILA

Iterative

ATLAS CGw/SATLAS Unleashed

Specialized code generator with search

29

Page 30: Autotuning (1/2) - Vuduc

Mini-Kernel

Belady /BRILA

Scalarized /Compiler

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement Recursive

Micro-Kernel

None /Compiler

Coloring /BRILA

Iterative

ATLAS CGw/SATLAS Unleashed

30

Page 31: Autotuning (1/2) - Vuduc

Summary:Engineering considerations

Need to cut-off recursion

Careful scheduling/tuning required at “leaves”

Yotov, et al., report that full-recursion + tuned micro-kernel ≤ 2/3 best

Open issues

Recursively-scheduled kernels worse than iteratively-schedule kernels — why?

Prefetching needed, but how best to apply in recursive case?

31

Page 32: Autotuning (1/2) - Vuduc

Administrivia

32

Page 33: Autotuning (1/2) - Vuduc

Upcoming schedule changes

Some adjustment of topics (TBD)

Tu 3/11 — Project proposals due

Th 3/13 — SIAM Parallel Processing (attendance encouraged)

Tu 4/1 — No class

Th 4/3 — Attend talk by Doug Post from DoD HPC Modernization Program

33

Page 34: Autotuning (1/2) - Vuduc

Homework 1:Parallel conjugate gradients

Put name on write-up!

Grading: 100 pts max

Correct implementation — 50 pts

Evaluation — 30 pts

Tested on two samples matrices — 5

Implemented and tested on stencil — 10

“Explained” performance (e.g., per proc, load balance, comp. vs. comm) — 15

Performance model — 15 pts

Write-up “quality” — 5 pts

34

Page 35: Autotuning (1/2) - Vuduc

Projects

Proposals due Tu 3/11

Your goal should be to do something useful, interesting, and/or publishable!

Something you’re already working on, suitably adapted for this course

Faculty-sponsored/mentored

Collaborations encouraged

35

Page 36: Autotuning (1/2) - Vuduc

My criteria for “approving” your project

“Relevant to this course:” Many themes, so think (and “do”) broadly

Parallelism and architectures

Numerical algorithms

Programming models

Performance modeling/analysis

36

Page 37: Autotuning (1/2) - Vuduc

General styles of projects

Theoretical: Prove something hard (high risk)

Experimental:

Parallelize something

Take existing parallel program, and improve it using models & experiments

Evaluate algorithm, architecture, or programming model

37

Page 38: Autotuning (1/2) - Vuduc

Anything of interest to a faculty member/project outside CoC

Parallel sparse triple product (R*A*RT, used in multigrid)

Future FFT

Out-of-core or I/O-intensive data analysis and algorithms

Block iterative solvers (convergence & performance trade-offs)

Sparse LU

Data structures and algorithms (trees, graphs)

Look at mixed-precision

Discrete-event approaches to continuous systems simulation

Automated performance analysis and modeling, tuning

“Unconventional,” but related

Distributed deadlock detection for MPI

UPC language extensions (dynamic block sizes)

Exact linear algebra

Examples

38

Page 39: Autotuning (1/2) - Vuduc

Switch: M. Frigo’s talk slidesfrom CScADS 2007 autotuning workshophttp://cscads.rice.edu/workshops/july2007/autotune-workshop-07

39

Page 40: Autotuning (1/2) - Vuduc

Cache-oblivious stencil computations[Frigo and Strumpen (ICS 2005)][Datta, et al. (2007)]

40

Page 41: Autotuning (1/2) - Vuduc

t=0x=0 16

5

8

Cache-oblivious stencil computation

10

41

Page 42: Autotuning (1/2) - Vuduc

t=0x=0 16

5

8

Cache-oblivious stencil computation

10

w < 2×h:

42

Page 43: Autotuning (1/2) - Vuduc

t=0x=0 16

5

8

Cache-oblivious stencil computation

w < 2×h ⇒ “Time-cut”:

10

43

Page 44: Autotuning (1/2) - Vuduc

t=0x=0 16

5

8

Cache-oblivious stencil computation

10

w ≥ 2×h:

44

Page 45: Autotuning (1/2) - Vuduc

t=0x=0 16

5

8

Cache-oblivious stencil computation

10

w ≥ 2×h ⇒ “Space-cut”:

45

Page 46: Autotuning (1/2) - Vuduc

t=0x=0 16

5

8

Cache-oblivious stencil computation

10

w ≥ 2×h ⇒ “Space-cut”:

46

Page 47: Autotuning (1/2) - Vuduc

t=0x=0 16

5

8

Cache-oblivious stencil computation

10

w < 2×h ⇒ “Time-cut”:

47

Page 48: Autotuning (1/2) - Vuduc

t=0x=0 16

5

8

Cache-oblivious stencil computation

10

Theorem [Frigo & Strumpen (ICS 2005)]:d = dimension ⇒

Q(n, t; d) = O

(nd · t

M1d

)

48

Page 49: Autotuning (1/2) - Vuduc

Source: Datta, et al. (2007)

Cache-oblivious stencil computation:Fewer misses but more time

49

Page 50: Autotuning (1/2) - Vuduc

t=0x=0 16

5

8

Cache-conscious algorithm

10

b

50

Page 51: Autotuning (1/2) - Vuduc

Cache-conscious algorithm

Source: Datta, et al. (2007)

51

Page 52: Autotuning (1/2) - Vuduc

“In conclusion…”

52

Page 53: Autotuning (1/2) - Vuduc

Backup slides

53