dynamic analysis in the reduceron · dynamic vs. static analysis “create all closures as...

59
Dynamic analysis in the Reduceron Matthew Naylor and Colin Runciman University of York

Upload: others

Post on 19-Nov-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dynamic analysis in

the Reduceron

Matthew Naylor and Colin Runciman

University of York

Page 2: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

“I wonder how popular Haskell needs to become for

Intel to optimize their processors for my runtime,

rather than the other way around.”

Simon Marlow, 2009

A question

Page 3: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Underlying problem?

Constraints imposed by conventional processors

make efficient implementations of functional

languages sophisticated and limited.

Page 4: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

“Current RISC technology will probably have

increased in speed enough by the time a graph-

reduction chip could be designed and fabricated to

make the exercise pointless.”

Koopman, 1990

Build a machine especially to run functional programs.

Possible solution

But now the situation is different…

Page 5: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Block RAMs

FPGAs (Reconfigurable H/W)

Contain a large set of components that can be

connected together in any desired way.

Widely available, quick to program, autonomous.

Logic Cells (Gates & Flip-flops)

Page 6: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

not restricted by conventional architectural constraints,

A computer designed to run lazy functional programs,

implemented on an FPGA, using a functional language.

The Reduceron

(Origin: Chris Kania)

Page 7: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Graph reduction

Step by step

Page 8: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Graph reduction, step by step

f ys x xs = g x (h xs ys)

Suppose that function f is defined by

where g and h are functions and the following

machine-state arises during reduction.

f

a

b

c g @2

Stack Templates Heap

f: h @3 @1

Page 9: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Operation: f <- Stack[0]

g <- Code[f]

g -> Heap

Steps: 3

f

a

b

c

Stack Heap

g g @2 f: h @3 @1

Templates

Graph reduction, step by step

Page 10: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Operation: arg <- Code[f+1]

b <- Stack[arg]

b -> Heap

Steps: 6

f

a

b

c

Stack Heap

g b g @2 f: h @3 @1

Templates

Graph reduction, step by step

Page 11: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Operation: ptr <- Code[f+2]

ptr’ -> Heap

Steps: 8

f

a

b

c

Stack Heap

g b g @2 f: h @3 @1

Templates

Graph reduction, step by step

Page 12: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Operation: h <- Code[f+3]

h -> Heap

Steps: 10

f

a

b

c

Stack Heap

g b g @2 f: h @3 @1 h

Templates

Graph reduction, step by step

Page 13: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Operation: arg <- Code[f+4]

c <- Stack[arg]

c -> Heap

Steps: 13

f

a

b

c

Stack Heap

g b g @2 f: h @3 @1 h c

Templates

Graph reduction, step by step

Page 14: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Operation: arg <- Code[f+5]

a <- Stack[arg]

a -> Heap

Steps: 16

f

a

b

c

Stack Heap

g b g @2 f: h @3 @1 h c a

Templates

Graph reduction, step by step

Page 15: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

The von Neumann Bottleneck

One word at a time.

Each of the 16 memory

transactions is done

sequentially.

Page 16: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Widening the bottleneck

Page 17: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Widening the bottleneck, again

Many words at a time.

Page 18: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Applying a function “in one go”

f

a

b

c

Stack Heap

g b g @2 f: h @3 @1 h c a

Templates

The function f is applied in a single clock cycle.

Page 19: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Reduceron “instruction set”

Operation Clock cycles

Apply n/2

Unwind 1

Update 1

Primitive Apply 1

Where n = number of applications in function body.

Page 20: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Template instantiation?

Yes, the Reduceron actually does

template instantiation!

But what about the G-machine and the ABC machine?

Page 21: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Template instantiation

“This chapter introduces the simplest possible

implementation of a functional language: a

graph-reducer based on template instantiation”

Simon Peyton Jones, 1992

Page 22: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dynamic analysis (1)

Update avoidance

Page 23: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Shared pointers

unshared applications, and

possibly-shared applications.

Distinguish between:

Idea: When an unshared application is

reduced to normal form, no update is needed.

Page 24: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dynamic vs. static analysis

“Create all closures as [unshared], and

dynamically change their tag to [possibly-

shared] if they become shared. We call

this operation dashing.”

“In general we strongly suspect that the

cost of dashing greatly outweighs the

advantages of precision when compared

to the [static analysis] method.”

Peyton Jones, 1988

Page 25: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dashing when applying

s

f

g

x @1 @3

Stack Templates

s: @2 @3

Heap

When the function s, containing two references

to its 3rd argument, is applied,

Page 26: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dashing when applying

s

f

g

x @1 @3

Stack

s: @2 @3

When the function s, containing two references

to its 3rd argument, is applied,

the 3rd argument is dashed.

Heap

f x g x

Templates

Page 27: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dashing when applying

inst :: Reduceron -> Atom -> Atom

inst r a =

pick

[ a!isARG --> dash (a!isArgShared) x

, a!isAP -->

makeAP (a!isShared)

(r!heap!size + a!pointer)

, otherwise --> a

]

where

otherwise = inv (a!isARG <|> a!isAP)

x = pick (zip (a!argIndex!velems)

(r!vstack!tops!velems))

Page 28: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dashing when unwinding

x

c

Stack Heap

f a x: b

When a pointer x to a shared application appears

on top of the stack,

Page 29: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dashing when unwinding

c

Stack Heap

f a x: b

When a pointer x to a shared application appears

on top of the stack,

f

a

b

the unwound application is dashed.

Page 30: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dashing when unwinding unwind :: Reduceron -> Recipe

unwind r =

Seq [

r!newTop <== vhead as

, r!vstack!update n (unwindMask n)

(as!vtail!velems)

, app!hasAlts |> r!astack!push (app!alts)

, upd |> r!ustack!push

(makeUpdate (r!vstack!size)

(r!top!pointer))

]

where

app = r!heap!outputB

n = app!appArity

as = vmap (dash sh) (app!atoms)

sh = r!top!isShared

upd = sh <&> inv (app!isNF)

Page 31: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dashing when updating

Stack Heap

f a x:

C

a

as b

When a normal-form is reached,

Page 32: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dashing when updating

Stack Heap

C a x:

C

a

as as

When a normal-form is reached,

it is copied onto the heap, overwriting the original

application, and its arguments are dashed.

Page 33: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dashing in the Reduceron

In each of the three cases, dashing

is just bit-flipping under some

simple-to-compute conditions.

Page 34: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Normal-form updates avoided

0

10

20

30

40

50

60

70

80

90

100

Mean = 56.2%

Page 35: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Total updates avoided

0

10

20

30

40

50

60

70

80

90

100

Mean = 56.2%

Mean = 87.2%

Page 36: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

% runtime saving

0

5

10

15

20

25

30

35

40

45

50

Mean = 21.5%

Page 37: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Dynamic analysis (2)

Speculative evaluation of primitive redexes (PRS)

Page 38: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Primitive redexes

f x xs a = g (x+a) xs

Suppose that function f is defined by

where g is a function, and + is primitive addition.

f

5

xs

2 g @2

Stack Templates Heap

f:

@1 + @3

Page 39: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Primitive redexes

Application of f results in the primitive redex

5+2 being instantiated on the heap.

g @2

Stack Heap

f:

@1 + @3

g xs

5 + 2

f

5

xs

2

Templates

Page 40: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Speculative evaluation

Idea: At instantiation time, look at the arguments

in a primitive application. If they are already

evaluated, apply the primitive speculatively.

g @2

Stack Heap

f:

@1 + @3

f

5

xs

2 g 7 xs

Templates

Page 41: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

% runtime saving, due to PRS

-10

0

10

20

30

40

50

Mean = 12%

amount of primitive reduction

Page 42: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Example of PRS success

enum n m = if n <= m

then Cons n (enum (n+1) m)

else Nil

enum 0 10

=

Function to enumerate a range of integers:

We wish to evaluate: enum 0 10

Page 43: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Example of PRS success

enum n m = if n <= m

then Cons n (enum (n+1) m)

else Nil

enum 0 10

= if 0 <= 1

then Cons 0 (enum (0+1) 10)

else Nil

Function to enumerate a range of integers:

We wish to evaluate: 0 10 enum

Page 44: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Example of PRS success

enum n m = if n <= m

then Cons n (enum (n+1) m)

else Nil

enum 0 10

= if True

then Cons 0 (enum 1 10)

else Nil

Function to enumerate a range of integers:

We wish to evaluate: 0 10 enum

Page 45: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Example of PRS success

enum n m = if n <= m

then Cons n (enum (n+1) m)

else Nil

enum 0 10

= if True

then Cons 0 (enum 1 10)

else Nil

= Cons 0 (enum 1 10)

Function to enumerate a range of integers:

We wish to evaluate: 0 10 enum

Page 46: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Example of PRS failure

enum n m = if n <= m

then Cons n (enum (n+1) m)

else Nil

enum x 10

=

Function to enumerate a range of integers:

We wish to evaluate: y enum 10 x: f

Page 47: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Example of PRS failure

enum n m = if n <= m

then Cons n (enum (n+1) m)

else Nil

enum x 10

= if x <= 10

then Cons x (enum (x+1) 10)

else Nil

Function to enumerate a range of integers:

We wish to evaluate: y enum 10 x: f

Page 48: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Example of PRS failure

enum n m = if n <= m

then Cons n (enum (n+1) m)

else Nil

enum x 10

= if x <= 10

then Cons x (enum (x+1) 10)

else Nil

= Cons x (enum (x+1) 10)

Function to enumerate a range of integers:

We wish to evaluate: enum 10 x: 3

Page 49: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Possible solutions

Option 1: Worker/wrapper transformation.

Since enum is strict, introduce enumWrap

which forces n and m before passing them to

enum.

Option 2: Update the stack. Leave evaluated

arguments of conditional primitives on the

stack, to be observed by case alternatives.

Page 50: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Conclusions

The Reduceron enjoys many freedoms.

Dynamic analysis can be very attractive.

There is a lot of low-level parallelism to

exploit even in “sequential” graph reduction.

Still to come: Parallel reduction, PRS is just

the beginning, and compiler optimisation.

Page 51: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Quick progress report

Page 52: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Speed-up factor, since IFL’07

1

3

5

7

9

11

13

15

Page 53: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

My desk

Intel Core 2 Duo E8400

* 6M Cache

* 3.00 GHz

* 1333 MHz FSB

* Released in 2008

Xilinx Virtex 5 FPGA

* LX110T (mid-range)

* Speed-grade 1 (lowest)

* Released in 2006

Page 54: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Reduceron synthesis results

Fmax 110Mhz

Slice usage 2284 (13%)

LUTs 5865 (8%)

Flip-flops 1018 (1%)

BRAM usage 4.7 megabits (89%)

NOTE: memory capacity is very small – could be

enhanced by moving the heap into off-chip memory.

Page 55: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Slow-down factor, against PC

1

2

3

4

5

6

7

8

9 GHC -O2

Clean

Page 56: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Acknowledgements

This research is supported by EPSRC grant

EP/G011052/1.

Thanks to the Xilinx University Program for

donating the FPGA and development board.

Thanks to Satnam Singh for his advice and

contacts which enabled us to obtain this

board. The money saved has been used to

extend my contract by three months!

Page 57: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

Megabits of block RAM, over time

0

5

10

15

20

25

30

35

40

45

Virtex II (2000) Virtex 4 (2004) Virtex 5 (2006) Virtex 6 (2009)

Page 58: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

An off-chip heap

• Heap could be moved into off-chip RAM.

– To cater for memory-hungry programs.

• Should be possible without modifying

existing design significantly.

– Modern memory chips clock much higher

than 110Mhz.

– Example: reduced-latency RAM (4-cycle

access) clocking at over 400Mhz.

Page 59: Dynamic analysis in the Reduceron · Dynamic vs. static analysis “Create all closures as [unshared], and dynamically change their tag to [possibly-shared] if they become shared

% runtime spent on arithmetic

0

5

10

15

20

25

30

35 Mean = 12%