statically bounding memory usage for score process networks eylon caspi ee290n 5/15/02 university of...

40
Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley I A I B O A O B

Post on 20-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

Statically Bounding Memory Usagefor SCORE Process Networks

Eylon Caspi

EE290N

5/15/02

University of California, Berkeley

IA IB

OA OB

Page 2: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 2

Overview

Motivation SCORE, “Page Packing”

Bounding Memory using Automata Composition

Preliminary Results

Open Questions / Future Work

Page 3: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 3

Reconfigurable Computing

Programmable logic +Programmable interconnect (e.g. FPGA)

Fills the gap 10x-100x better than microprocessor 10x-100x worse than ASIC In performance (MOPS),

In density (MOPS/mm2, MOPS/mW)

Hardware scales by tiling / duplicating High parallelism; spatial data paths

But no abstraction for software survival No binary compatibility No performance scaling

Designer targets a specific device, specific resource constraints

Graphics copyright bytheir respective company

Page 4: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

4

Virtual Hardware

Compute model has unbounded resources Programmer no longer targets particular device size

Paging “Compute pages” swapped in/out (like VM) Page context = thread (FSM to access streams, block)

Efficient virtualization Amortize reconfiguration cost over an entire input buffer

buffers

Transform Quantize RLE Encode

computepages

Page 5: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 5

SCORE Model

Program = data-flow graph of stream-connected SFSMs Kahn process network

blocking read, non-blocking write (almost)

Compute: SFSM (Streaming Finite State Machine) Concretely: page + FSM to implement token-flow semantics Abstractly: task with local control

Communication: Stream FIFO channel, unbounded buffering

Storage: Memory Segment Memory block with streaming interface

Dynamics: Dynamic local behavior in SFSM Unbounded resource usage: stream buffer expansion Dynamic graph allocation in STM (Streaming Turing Machine)

Model admits parallelism at multiple levels: ILP, pipeline, data

Stream ComputationsOrganized forReconfigurable Execution

Page 6: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

6

SCORE Programming Model: TDF

TDF = intermediate, behavioral language for: EFSM Operators • Static operator graphs

State machine for: Firing signatures • Control flow (branching)

Firing semantics: When in state X, wait for X’s inputs, then fire (consume, act)

select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ){ state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S;}

s t f

o

select

Page 7: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 7

The Compilation Problem

Programming Model Execution Model• Communicating EFSM operators • Communicating page configs

- unrestricted size, # IOs, timing - fixed size, # IOs, timing

• Paged virtual hardware

Compile

memorysegment

TDFoperator

stream

memorysegment

compute page

streamCompilation is a resource-binding xform on state machines +data-paths

Page 8: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 8

A Problem with Page Packing

Intent: Pack multiple, small operators onto 1 pageto reduce fragmentation

Problem: Streams within a page are registers;must guarantee bounded stream depth

Theorem: Memory bound is undecidable(for a Turing complete process network model)

Page 1

Page 2

Page 9: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 9

Possible Solutions

Handle unbounded streams External buffering Registers (guess at depth)

+ external buffer as fall-back not practical for many small ops

Guarantee depth bound for some cases

Only one FSM per page + I/O pipelines

Identify compatible FSMs, balanced schedules

Page 10: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

10

Bounded Buffer Example: Single Stream

x= =x

x= =x

x=

=x

=xx=

=x

x= =x

Static Rate Dynamic Rate

A pair compositionwith a single stream needs no buffer

Page 11: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

11

Unbounded Buffer Example: Multi Stream

Bounded buffer Unbounded buffer

Ad-hoc analysis gets complicated quickly

What about >2 SFSMs?

x=y=

=x=y

x=y= =x =y

x=y=

=x

=x

=y

x=y= =x =y

x=y=

=x

=y

Page 12: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

12

Interface Automata

A finite state machine that transitions on I/O actions Not input-enabled (not every I/O on every cycle)

G = (V, E, Ai, Ao, Ah, Vstart)

Ai = input actions x? (in CSP notation)

Ao = output actions y! ”

Ah = internal actions z; ”

E V x (Ai Ao Ah) x V (transition on action)

Execution trace = (v, a, v, a, …) (non-deterministic branching)

Ss?

S’

T

F

T’

F’

st;

sf;

t?

f?

o!

o!

s

t

f

o select

o

s t f

de Alfaro + Henzinger,Symp. Found. SW Eng.(FSE) 2001

Page 13: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 13

AB A’B

AB’ A’B’

AutomataComposition

Automata Composition

Composition ~ product FSM with synchronization

(rendezvous) on common actions

A

x?

A’

y!

B

z!

B’

y?

A Byx z

x?

y;z!

x?

z!

AB A’B

AB’ A’B’

x?

y!

x?

y!

z! y? z! y?

Direct Product

Composition edges:

(I) step A on unshared action

(ii) step B on unshared action

(iii) step both on shared action

CompatibleComposition

BoundedMemory

Page 14: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 14

Compatibility

Illegal (P,Q) = { reachable product states (p,q) VPVQ

s.t. p produces a shared action thatq does not accept, or vice versa }

Interface automata P, Q are compatible if: CSP: Always I/O Autaomata: Illegal (P,Q) = Interface Automata: Illegal (P,Q,Env) =

Least restrictive environment Env accepts all outputs, provides no inputs Compatible states are those that never reach illegal states

via internal, output transitions

This is overly restrictive for SCORE Can enter an illegal state, stall the illegal producer,

and step the consumer on a different action! IllegalSCORE(P,Q) = { reachable product states (p,q) VPVQ

s.t. (p produces a shared action that qdoes not accept, and q has no alternativenon-shared actions), or vice versa }

= reachable deadlock(VPVQ) \ deadlock(VP) deadlock(VQ)

Ax!

Bx?

Ax!

Bt;

Page 15: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 15

Alternate Composition Semantics

Automata Composition (P Q) (the rest of this talk) Compatibility = no reachable deadlock Pessimistic; correct in any environment Any state can stall state explosion

Interface Composition (P II Q) Compatibility = no reachable illegal states for given env. Optimistic; correct in environment that provides no inputs Outputs cannot stall smaller composition

SCORE Composition? (P Q) How to get smaller compositions, correct in any environment? Strategic use of output stall

Compatibility by construction? (disallow transitions to illegal paths) Minimal stutter (stall) transitions?

Page 16: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 16

An Incompatible SCORE Composition

A A’ A’’i? x!

y!

B

B’

B’’

y?

x?

o!

A Bxi o

y

AB A’B A’’B

AB’ A’B’ A’’B’

AB’’ A’B’’ A’’B’’

y;

i?

x;

i?

i?

o! o! o!

Page 17: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 17

Adding a Buffer

A Bxi o

y

Qx

A A’ A’’i? x!

y!

Q

Q’

x? x!

AQ A’Qi?

A’’Q

x;

AQ’ A’Q’ A’’Q’

x! x!

y!

y!

x!

i?

Page 18: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 18

Buffered Composition

AQ Bxi o

y

AQ A’Qi? A’’Q

x;

AQ’ A’Q’ A’’Q’

x! x!

y!

y!

x!i?

AQB’ A’QB’i?

A’’QB’

x;

AQ’ B’ A’Q’ B’ A’’Q’ B’i?

AQB’’ A’QB’’i?

A’’QB’’

x;

AQ’ B’’ A’Q’ B’’ A’’Q’ B’’i?

AQB A’QBi?

A’’QB

x;

AQ’B A’Q’B A’’Q’Bi?

o!

o!

o!

o!

o!

o!

y;

y;

x; x; x;

B

B’

B’’

y?

x?

o!

Page 19: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 19

Adding a Buffer, Alternate Order

A Bxi o

y

Qx

Q Q’x?

x!

B

B’

B’’

y?

o!

x?

QB Q’Bx?

QB’ Q’B’x?

QB’’ Q’B’’x?

y? y?

x;

o! o!

Page 20: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 20

Buffered Composition, Alternate Order

A QBxi o

y

QB Q’Bx?

QB’ Q’B’x?

QB’’ Q’B’’x?

y? y?

x;

o! o!

AQB AQ’B

AQB’ AQ’B’

AQB’’ AQ’B’’

x;

o!

o!

A’QB A’Q’B

A’QB’ A’Q’B’

A’QB’’ A’Q’B’’

x;o!

A’’QB A’’Q’B

A’’QB’ A’’Q’B’

A’’QB’’ A’’Q’B’’

x;

i?

i?

i?o! o! o!

x;

x;

x;

y; y;

i?

i?

i?

A A’ A’’i? x!

y!

Page 21: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

Composition is Associative*

A QBxi o

y

AQB AQ’B

AQB’ AQ’B’

AQB’’ AQ’B’’

x;

o!

o!

A’QB A’Q’B

A’QB’ A’Q’B’

A’QB’’ A’Q’B’’

x;o!

A’’QB A’’Q’B

A’’QB’ A’’Q’B’

A’’QB’’ A’’Q’B’’

x;

i?

i?

i?o! o! o!

x;

x;

x;

y; y;

i?

i?

i?

AQB’ A’QB’i?

A’’QB’

x;

AQ’ B’ A’Q’ B’ A’’Q’ B’i?

AQB’’ A’QB’’i?

A’’QB’’

x;

AQ’ B’’ A’Q’ B’’ A’’Q’ B’’i?

AQB A’QBi?

A’’QB

x;

AQ’B A’Q’B A’’Q’Bi?

o!

o!

o!

o!

o!

o!

y;

y;

x; x; x;

A (Q B)

(A Q) BAQ Bxi o

y

Page 22: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

22

Static Stream Depth Bound Analysis

Basic idea: try to compose A, B with increasingly large queues

Given: Graph of TDF operators Output: Stream depth bound (N {∞}) for each stream

Initialize: depth[ei]0 for all streams (edges) ei

For each pair (A,B) of connected operators Let {ei} be set of streams connecting A, B

Construct interface automata for A, B each ei induces actions: shared action aei if depth[ei]<∞

non-shared actions aiei, ao

ei if depth[ei]=∞

While not Done Construct composition: C A B { Q(depth[ei]) ei s.t. depth[ei]<∞}

Compute illegal states: IllegalSCORE(C)

If IllegalSCORE(C) = – Done with pair A, B

Else– For each shared action aei that is output but not input in some illegal state s IllegalSCORE(C)

» depth[ei]++

» If (depth[ei] ≥ depth_threshhold) then depth[ei] ∞

– If depth[ei] = ∞ ei then Done with pair A, B

Return depth[]

A Be1i o

e2

A Bi o

Page 23: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 23

Results – Pair Composition*

App #streams Trivially Non-Triv. Un- #SFSMs #SFSM Trivially Non-Triv.Not

Bounded Bounded bounded pairs Compos ComposCompos

IIR 9 9 0 0 8 7 7 0 0

Wavelet Encode 58 35 0 23 30 24 15 0 9

Wavelet Decode 57 34 12 11 27 31 26 0 5

JPEG Encode 62 25 13 24 13 11 6 1 4

JPEG Decode 61 - - - 12 - - - -

MPEG Encode IP 421 351 43 70 92 154 144 5 5

MPEG Encode IPB 488 402 58 22 114 211 192 8 5

* Max stream depth = 2

(with streams to mem) (without streams to mem)

Page 24: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 24

Maximum Depth Parameter

App #streams Non-Trivially Bounded#SFSM Non-Trivially Composable

for given max stream depth pairs for given max stream depth

0 1 2 3 4 0 1 2 3 4

IIR 9 0 0 0 0 0 7 0 0 0 0 0

Wavelet Encode 58 0 0 0 0 0 24 0 0 0 0 0

Wavelet Decode 57 12 12 12 12 12 31 0 0 0 0 0

JPEG Encode 62 10 12 13 15 16 11 0 1 1 1 1

JPEG Decode 61 17 18 - - - 9 2 2 - - -

MPEG Encode IP 421 39 42 43 - - 154 4 5 5 - -

MPEG Encode IPB 488 53 57 58 - - 211 6 8 8 - -

Page 25: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 25

Composite Automaton Size

App # Nodes in Largest Composition Run time

for given max stream depth(seconds)

0 1 2 3 4 2

IIR 12 12 12 12 12 0.2

Wavelet Encode 1587 1587 3094 11,417 30,630 7.5

Wavelet Decode 961 1922 11,798 37,712 84,784 7.2

JPEG Encode 3785 6175 8086 16,244 27,138 9.5

JPEG Decode 3785 196,576 196,576* 196,576* 196,576* 245*

MPEG Encode IP 7887 89,541 334,757 334,757* 334,757* 125

MPEG Encode IPB 8478 100,909 385,334 385,334* 385,334* 150* Crashed; Partial Results

Page 26: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 26

Composing More than 2 SFSMs

Page Packing by incrementally growing a cluster? Larger composition should improve stream depth bound

Restricts environment around a pair of SFSMs Fewer transitions fewer reachable deadlocks

BUT larger composition can expose deadlocked feedback loop

Page 1

Page 2

1

24

Compose 2 SFSMs

Page 1

1

2 ?4

∞ ?

Compose 3 SFSMs

Page 27: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 27

Synthesizing a Composite SFSM?

How to turn a composite automaton into TDF or page logic? TDF does not support all non-deterministic branches

Multiple inputs: ok (state with multiple signatures / cases) Multiple outputs: must sequentialize (how?) Input + output: ???

Input before output — may cause deadlock if output feeds back to input Output before input — may stall composite on output back-pressure

Conjecture: It is always safe to sequentialize

outputs before inputs Heavier-weight automata can check

input availability / output spacebefore blocking on I/O “System-Level Types for Component-Based

Design,” Lee + Xiong, EMSOFT 2001(used in Ptolemy)

AB A’B

AB’ A’B’

IA CompositionA || B

x?

y;z!

x?

z!

Page 28: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 28

Summary

SCORE Process network model to support software longevity,

scalability on massively parallel HW

Automata composition with finite queues Compatibility Bounded memory

Initial results: pair composition

Future work: Faster run time (semantics for smaller composite size) Compose more than 2 SFSMs Page Packing

Page 29: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 29

Supplemental

Page 30: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 30

TDF Dataflow Process Network

Dataflow Process Network[Parks+Lee, IEEE May ‘95] Process enabled by set of firing rules: R = {R1, R2, …, RN}

Firing rule = set of patterns: Ri = {Ri,1, Ri,2 , …, Ri,p}

DF process for a TDF operator: Feedback arc for state One firing rule per state

Patterns match state value + presence of desired inputs E.g. for state i: Ri = {Ri,1, Ri,2 , …, [i]}

Patterns: Ri,j = [*] if input j is in state i’s input signatureRi,j = if input j is not in state i’s input signatureRi,p = [i] for final input, representing state arc

These are sequential firing rules Partitioned SFSM adds “wait” state

process sta

te

Page 31: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 31

SFSM Partitioning Transform

Only 1 partition active at a time Transform to activate via streams

New state in each partition: “wait” Used when not active Waits for activation

from other partition(s) Has one input signature

(firing rule) per activator

Firing rules are not sequential,but determinism guaranteed Only 1 possible activator

Activation streams fromgiven source to given dest.partitions can be merged +binary-encoded

A

B

C

D

A

B

WaitAB

C

D

WaitCD

{A,B}

{A,B}

{C,D}

{C,D}

Page 32: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 32

SCORE Hardware Model

Paged FPGA Compute Page (CP)

Fixed-size slice of RC hardware Fixed number of I/O ports

Distributed, on-chip memory Configurable Memory Block (CMB) Stream access

High-level interconnect

Microprocessor Run-time support + user code

Page 33: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 33

Functional Simulation

FPGA based on HSRA [Berkeley, FPGA ’99] CP: 512 4-LUTs CMB: 2Mbit DRAM Area for CP-CMB pair:

Page reconfiguration: 5000 cycles (from CMB) Synchronous operation (same clock speed as processor)

x86 microprocessor Page Scheduler task

Swap on timer interrupt (every 250,000 cycles) Fully dynamic scheduling

.25: 12.9mm2 (1/9 of PII-450)

.18: 6.7mm2 (1/16 of PIII-600)

Page 34: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 34

Application: JPEG Encode

Page 35: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 35

Execution Results

Hardware Size (CP-CMB Pairs)

Page 36: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 36

Execution Results

Hardware Size (CP-CMB Pairs)

Page 37: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 37

Execution Results

Hardware Size (CP-CMB Pairs)

Page 38: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 38

Execution Results

Hardware Size (CP-CMB Pairs)

Page 39: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 39

Page Hardware Model

Page = fixed-size slice of rsrcs + stream interface

FSM for: Firing • Output emission • Data-path control •

Branching

FSM

Reconfigurable

Fixed logic

Page 40: Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

5/15/02 Eylon Caspi – EE290N 40

Page Firing Logic

Sample firing logic 3 inputs (A,B,C) 3 outputs (X,Y,Z) Single signature