statically bounding memory usage for score process networks eylon caspi ee290n 5/15/02 university of...
Post on 20-Dec-2015
215 views
TRANSCRIPT
Statically Bounding Memory Usagefor SCORE Process Networks
Eylon Caspi
EE290N
5/15/02
University of California, Berkeley
IA IB
OA OB
5/15/02 Eylon Caspi – EE290N 2
Overview
Motivation SCORE, “Page Packing”
Bounding Memory using Automata Composition
Preliminary Results
Open Questions / Future Work
5/15/02 Eylon Caspi – EE290N 3
Reconfigurable Computing
Programmable logic +Programmable interconnect (e.g. FPGA)
Fills the gap 10x-100x better than microprocessor 10x-100x worse than ASIC In performance (MOPS),
In density (MOPS/mm2, MOPS/mW)
Hardware scales by tiling / duplicating High parallelism; spatial data paths
But no abstraction for software survival No binary compatibility No performance scaling
Designer targets a specific device, specific resource constraints
Graphics copyright bytheir respective company
4
Virtual Hardware
Compute model has unbounded resources Programmer no longer targets particular device size
Paging “Compute pages” swapped in/out (like VM) Page context = thread (FSM to access streams, block)
Efficient virtualization Amortize reconfiguration cost over an entire input buffer
buffers
Transform Quantize RLE Encode
computepages
5/15/02 Eylon Caspi – EE290N 5
SCORE Model
Program = data-flow graph of stream-connected SFSMs Kahn process network
blocking read, non-blocking write (almost)
Compute: SFSM (Streaming Finite State Machine) Concretely: page + FSM to implement token-flow semantics Abstractly: task with local control
Communication: Stream FIFO channel, unbounded buffering
Storage: Memory Segment Memory block with streaming interface
Dynamics: Dynamic local behavior in SFSM Unbounded resource usage: stream buffer expansion Dynamic graph allocation in STM (Streaming Turing Machine)
Model admits parallelism at multiple levels: ILP, pipeline, data
Stream ComputationsOrganized forReconfigurable Execution
6
SCORE Programming Model: TDF
TDF = intermediate, behavioral language for: EFSM Operators • Static operator graphs
State machine for: Firing signatures • Control flow (branching)
Firing semantics: When in state X, wait for X’s inputs, then fire (consume, act)
select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ){ state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S;}
s t f
o
select
5/15/02 7
The Compilation Problem
Programming Model Execution Model• Communicating EFSM operators • Communicating page configs
- unrestricted size, # IOs, timing - fixed size, # IOs, timing
• Paged virtual hardware
Compile
memorysegment
TDFoperator
stream
memorysegment
compute page
streamCompilation is a resource-binding xform on state machines +data-paths
5/15/02 8
A Problem with Page Packing
Intent: Pack multiple, small operators onto 1 pageto reduce fragmentation
Problem: Streams within a page are registers;must guarantee bounded stream depth
Theorem: Memory bound is undecidable(for a Turing complete process network model)
Page 1
Page 2
5/15/02 Eylon Caspi – EE290N 9
Possible Solutions
Handle unbounded streams External buffering Registers (guess at depth)
+ external buffer as fall-back not practical for many small ops
Guarantee depth bound for some cases
Only one FSM per page + I/O pipelines
Identify compatible FSMs, balanced schedules
10
Bounded Buffer Example: Single Stream
x= =x
x= =x
x=
=x
=xx=
=x
x= =x
Static Rate Dynamic Rate
A pair compositionwith a single stream needs no buffer
11
Unbounded Buffer Example: Multi Stream
Bounded buffer Unbounded buffer
Ad-hoc analysis gets complicated quickly
What about >2 SFSMs?
x=y=
=x=y
x=y= =x =y
x=y=
=x
=x
=y
x=y= =x =y
x=y=
=x
=y
12
Interface Automata
A finite state machine that transitions on I/O actions Not input-enabled (not every I/O on every cycle)
G = (V, E, Ai, Ao, Ah, Vstart)
Ai = input actions x? (in CSP notation)
Ao = output actions y! ”
Ah = internal actions z; ”
E V x (Ai Ao Ah) x V (transition on action)
Execution trace = (v, a, v, a, …) (non-deterministic branching)
Ss?
S’
T
F
T’
F’
st;
sf;
t?
f?
o!
o!
s
t
f
o select
o
s t f
de Alfaro + Henzinger,Symp. Found. SW Eng.(FSE) 2001
5/15/02 13
AB A’B
AB’ A’B’
AutomataComposition
Automata Composition
Composition ~ product FSM with synchronization
(rendezvous) on common actions
A
x?
A’
y!
B
z!
B’
y?
A Byx z
x?
y;z!
x?
z!
AB A’B
AB’ A’B’
x?
y!
x?
y!
z! y? z! y?
Direct Product
Composition edges:
(I) step A on unshared action
(ii) step B on unshared action
(iii) step both on shared action
CompatibleComposition
BoundedMemory
5/15/02 Eylon Caspi – EE290N 14
Compatibility
Illegal (P,Q) = { reachable product states (p,q) VPVQ
s.t. p produces a shared action thatq does not accept, or vice versa }
Interface automata P, Q are compatible if: CSP: Always I/O Autaomata: Illegal (P,Q) = Interface Automata: Illegal (P,Q,Env) =
Least restrictive environment Env accepts all outputs, provides no inputs Compatible states are those that never reach illegal states
via internal, output transitions
This is overly restrictive for SCORE Can enter an illegal state, stall the illegal producer,
and step the consumer on a different action! IllegalSCORE(P,Q) = { reachable product states (p,q) VPVQ
s.t. (p produces a shared action that qdoes not accept, and q has no alternativenon-shared actions), or vice versa }
= reachable deadlock(VPVQ) \ deadlock(VP) deadlock(VQ)
Ax!
Bx?
Ax!
Bt;
5/15/02 Eylon Caspi – EE290N 15
Alternate Composition Semantics
Automata Composition (P Q) (the rest of this talk) Compatibility = no reachable deadlock Pessimistic; correct in any environment Any state can stall state explosion
Interface Composition (P II Q) Compatibility = no reachable illegal states for given env. Optimistic; correct in environment that provides no inputs Outputs cannot stall smaller composition
SCORE Composition? (P Q) How to get smaller compositions, correct in any environment? Strategic use of output stall
Compatibility by construction? (disallow transitions to illegal paths) Minimal stutter (stall) transitions?
5/15/02 Eylon Caspi – EE290N 16
An Incompatible SCORE Composition
A A’ A’’i? x!
y!
B
B’
B’’
y?
x?
o!
A Bxi o
y
AB A’B A’’B
AB’ A’B’ A’’B’
AB’’ A’B’’ A’’B’’
y;
i?
x;
i?
i?
o! o! o!
5/15/02 Eylon Caspi – EE290N 17
Adding a Buffer
A Bxi o
y
Qx
A A’ A’’i? x!
y!
Q
Q’
x? x!
AQ A’Qi?
A’’Q
x;
AQ’ A’Q’ A’’Q’
x! x!
y!
y!
x!
i?
5/15/02 18
Buffered Composition
AQ Bxi o
y
AQ A’Qi? A’’Q
x;
AQ’ A’Q’ A’’Q’
x! x!
y!
y!
x!i?
AQB’ A’QB’i?
A’’QB’
x;
AQ’ B’ A’Q’ B’ A’’Q’ B’i?
AQB’’ A’QB’’i?
A’’QB’’
x;
AQ’ B’’ A’Q’ B’’ A’’Q’ B’’i?
AQB A’QBi?
A’’QB
x;
AQ’B A’Q’B A’’Q’Bi?
o!
o!
o!
o!
o!
o!
y;
y;
x; x; x;
B
B’
B’’
y?
x?
o!
5/15/02 Eylon Caspi – EE290N 19
Adding a Buffer, Alternate Order
A Bxi o
y
Qx
Q Q’x?
x!
B
B’
B’’
y?
o!
x?
QB Q’Bx?
QB’ Q’B’x?
QB’’ Q’B’’x?
y? y?
x;
o! o!
5/15/02 Eylon Caspi – EE290N 20
Buffered Composition, Alternate Order
A QBxi o
y
QB Q’Bx?
QB’ Q’B’x?
QB’’ Q’B’’x?
y? y?
x;
o! o!
AQB AQ’B
AQB’ AQ’B’
AQB’’ AQ’B’’
x;
o!
o!
A’QB A’Q’B
A’QB’ A’Q’B’
A’QB’’ A’Q’B’’
x;o!
A’’QB A’’Q’B
A’’QB’ A’’Q’B’
A’’QB’’ A’’Q’B’’
x;
i?
i?
i?o! o! o!
x;
x;
x;
y; y;
i?
i?
i?
A A’ A’’i? x!
y!
Composition is Associative*
A QBxi o
y
AQB AQ’B
AQB’ AQ’B’
AQB’’ AQ’B’’
x;
o!
o!
A’QB A’Q’B
A’QB’ A’Q’B’
A’QB’’ A’Q’B’’
x;o!
A’’QB A’’Q’B
A’’QB’ A’’Q’B’
A’’QB’’ A’’Q’B’’
x;
i?
i?
i?o! o! o!
x;
x;
x;
y; y;
i?
i?
i?
AQB’ A’QB’i?
A’’QB’
x;
AQ’ B’ A’Q’ B’ A’’Q’ B’i?
AQB’’ A’QB’’i?
A’’QB’’
x;
AQ’ B’’ A’Q’ B’’ A’’Q’ B’’i?
AQB A’QBi?
A’’QB
x;
AQ’B A’Q’B A’’Q’Bi?
o!
o!
o!
o!
o!
o!
y;
y;
x; x; x;
A (Q B)
(A Q) BAQ Bxi o
y
22
Static Stream Depth Bound Analysis
Basic idea: try to compose A, B with increasingly large queues
Given: Graph of TDF operators Output: Stream depth bound (N {∞}) for each stream
Initialize: depth[ei]0 for all streams (edges) ei
For each pair (A,B) of connected operators Let {ei} be set of streams connecting A, B
Construct interface automata for A, B each ei induces actions: shared action aei if depth[ei]<∞
non-shared actions aiei, ao
ei if depth[ei]=∞
While not Done Construct composition: C A B { Q(depth[ei]) ei s.t. depth[ei]<∞}
Compute illegal states: IllegalSCORE(C)
If IllegalSCORE(C) = – Done with pair A, B
Else– For each shared action aei that is output but not input in some illegal state s IllegalSCORE(C)
» depth[ei]++
» If (depth[ei] ≥ depth_threshhold) then depth[ei] ∞
– If depth[ei] = ∞ ei then Done with pair A, B
Return depth[]
A Be1i o
e2
A Bi o
5/15/02 23
Results – Pair Composition*
App #streams Trivially Non-Triv. Un- #SFSMs #SFSM Trivially Non-Triv.Not
Bounded Bounded bounded pairs Compos ComposCompos
IIR 9 9 0 0 8 7 7 0 0
Wavelet Encode 58 35 0 23 30 24 15 0 9
Wavelet Decode 57 34 12 11 27 31 26 0 5
JPEG Encode 62 25 13 24 13 11 6 1 4
JPEG Decode 61 - - - 12 - - - -
MPEG Encode IP 421 351 43 70 92 154 144 5 5
MPEG Encode IPB 488 402 58 22 114 211 192 8 5
* Max stream depth = 2
(with streams to mem) (without streams to mem)
5/15/02 Eylon Caspi – EE290N 24
Maximum Depth Parameter
App #streams Non-Trivially Bounded#SFSM Non-Trivially Composable
for given max stream depth pairs for given max stream depth
0 1 2 3 4 0 1 2 3 4
IIR 9 0 0 0 0 0 7 0 0 0 0 0
Wavelet Encode 58 0 0 0 0 0 24 0 0 0 0 0
Wavelet Decode 57 12 12 12 12 12 31 0 0 0 0 0
JPEG Encode 62 10 12 13 15 16 11 0 1 1 1 1
JPEG Decode 61 17 18 - - - 9 2 2 - - -
MPEG Encode IP 421 39 42 43 - - 154 4 5 5 - -
MPEG Encode IPB 488 53 57 58 - - 211 6 8 8 - -
5/15/02 25
Composite Automaton Size
App # Nodes in Largest Composition Run time
for given max stream depth(seconds)
0 1 2 3 4 2
IIR 12 12 12 12 12 0.2
Wavelet Encode 1587 1587 3094 11,417 30,630 7.5
Wavelet Decode 961 1922 11,798 37,712 84,784 7.2
JPEG Encode 3785 6175 8086 16,244 27,138 9.5
JPEG Decode 3785 196,576 196,576* 196,576* 196,576* 245*
MPEG Encode IP 7887 89,541 334,757 334,757* 334,757* 125
MPEG Encode IPB 8478 100,909 385,334 385,334* 385,334* 150* Crashed; Partial Results
5/15/02 Eylon Caspi – EE290N 26
Composing More than 2 SFSMs
Page Packing by incrementally growing a cluster? Larger composition should improve stream depth bound
Restricts environment around a pair of SFSMs Fewer transitions fewer reachable deadlocks
BUT larger composition can expose deadlocked feedback loop
Page 1
Page 2
1
24
∞
Compose 2 SFSMs
Page 1
1
2 ?4
∞ ?
Compose 3 SFSMs
5/15/02 Eylon Caspi – EE290N 27
Synthesizing a Composite SFSM?
How to turn a composite automaton into TDF or page logic? TDF does not support all non-deterministic branches
Multiple inputs: ok (state with multiple signatures / cases) Multiple outputs: must sequentialize (how?) Input + output: ???
Input before output — may cause deadlock if output feeds back to input Output before input — may stall composite on output back-pressure
Conjecture: It is always safe to sequentialize
outputs before inputs Heavier-weight automata can check
input availability / output spacebefore blocking on I/O “System-Level Types for Component-Based
Design,” Lee + Xiong, EMSOFT 2001(used in Ptolemy)
AB A’B
AB’ A’B’
IA CompositionA || B
x?
y;z!
x?
z!
5/15/02 Eylon Caspi – EE290N 28
Summary
SCORE Process network model to support software longevity,
scalability on massively parallel HW
Automata composition with finite queues Compatibility Bounded memory
Initial results: pair composition
Future work: Faster run time (semantics for smaller composite size) Compose more than 2 SFSMs Page Packing
5/15/02 Eylon Caspi – EE290N 29
Supplemental
5/15/02 Eylon Caspi – EE290N 30
TDF Dataflow Process Network
Dataflow Process Network[Parks+Lee, IEEE May ‘95] Process enabled by set of firing rules: R = {R1, R2, …, RN}
Firing rule = set of patterns: Ri = {Ri,1, Ri,2 , …, Ri,p}
DF process for a TDF operator: Feedback arc for state One firing rule per state
Patterns match state value + presence of desired inputs E.g. for state i: Ri = {Ri,1, Ri,2 , …, [i]}
Patterns: Ri,j = [*] if input j is in state i’s input signatureRi,j = if input j is not in state i’s input signatureRi,p = [i] for final input, representing state arc
These are sequential firing rules Partitioned SFSM adds “wait” state
process sta
te
5/15/02 Eylon Caspi – EE290N 31
SFSM Partitioning Transform
Only 1 partition active at a time Transform to activate via streams
New state in each partition: “wait” Used when not active Waits for activation
from other partition(s) Has one input signature
(firing rule) per activator
Firing rules are not sequential,but determinism guaranteed Only 1 possible activator
Activation streams fromgiven source to given dest.partitions can be merged +binary-encoded
A
B
C
D
A
B
WaitAB
C
D
WaitCD
{A,B}
{A,B}
{C,D}
{C,D}
5/15/02 Eylon Caspi – EE290N 32
SCORE Hardware Model
Paged FPGA Compute Page (CP)
Fixed-size slice of RC hardware Fixed number of I/O ports
Distributed, on-chip memory Configurable Memory Block (CMB) Stream access
High-level interconnect
Microprocessor Run-time support + user code
5/15/02 Eylon Caspi – EE290N 33
Functional Simulation
FPGA based on HSRA [Berkeley, FPGA ’99] CP: 512 4-LUTs CMB: 2Mbit DRAM Area for CP-CMB pair:
Page reconfiguration: 5000 cycles (from CMB) Synchronous operation (same clock speed as processor)
x86 microprocessor Page Scheduler task
Swap on timer interrupt (every 250,000 cycles) Fully dynamic scheduling
.25: 12.9mm2 (1/9 of PII-450)
.18: 6.7mm2 (1/16 of PIII-600)
5/15/02 Eylon Caspi – EE290N 34
Application: JPEG Encode
5/15/02 Eylon Caspi – EE290N 35
Execution Results
Hardware Size (CP-CMB Pairs)
5/15/02 Eylon Caspi – EE290N 36
Execution Results
Hardware Size (CP-CMB Pairs)
5/15/02 Eylon Caspi – EE290N 37
Execution Results
Hardware Size (CP-CMB Pairs)
5/15/02 Eylon Caspi – EE290N 38
Execution Results
Hardware Size (CP-CMB Pairs)
5/15/02 Eylon Caspi – EE290N 39
Page Hardware Model
Page = fixed-size slice of rsrcs + stream interface
FSM for: Firing • Output emission • Data-path control •
Branching
FSM
Reconfigurable
Fixed logic
5/15/02 Eylon Caspi – EE290N 40
Page Firing Logic
Sample firing logic 3 inputs (A,B,C) 3 outputs (X,Y,Z) Single signature