compilation for scalable, paged virtual hardware eylon caspi qualifying exam 3/6/01 university of...

50
Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley I A I B O A O B

Upload: chad-carpenter

Post on 31-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

Compilation for Scalable,Paged Virtual Hardware

Eylon Caspi

Qualifying Exam

3/6/01

University of California, Berkeley

IA IB

OA OB

Page 2: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 2

The Compilation Problem

Programming Model Execution Model• Communicating EFSM operators • Communicating page configs

- unrestricted size, # IOs, timing - fixed size, # IOs, timing

• Paged virtual hardware

Compile

memorysegment

TDFoperator

stream

memorysegment

compute page

streamCompilation is a resource-binding xform on state machines +data-paths

Page 3: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 3

Overview

Motivation Paged virtual hardware –  software survival + scalability SCORE programming model

Compilation methodology New page partitioning techniques Automatic synthesis & partitioning of communicating FSMs

Evaluation + Architectural Studies Timeline

Page 4: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 4

Reconfigurable Computing

Programmable logic +Programmable interconnect (e.g. FPGA)

10x-100x gain vs. microprocessors in: Performance Functional density (work per area-time)

Spatial Computing Parallelism; custom data paths

Programmability Custom execution sequence; specialization

BUT current models expose resource constraints to the programmer Programmer has to target a specific device Limits software longevity

Graphics copyright bytheir respective company

Page 5: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 5

Solution: Virtual Hardware

Compute model with unbounded resources Programmer no longer targets a specific device

Enables software longevity, scalability

Requires efficient hardware virtualization Large device concurrent spatial execution Small device time multiplexing Paging model

Page 6: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 6

Previous Approaches to Paging

WASMII: Register IO [Ling+Amano, FCCM ‘93] Page IO via registers Evaluate each page for a cycle, then reconfigure Reconfiguration time dominates execution

DPGA: Configuration Cache [DeHon, FPGA ‘94] , TM-FPGA [Xilinx, FCCM ‘97] Fast reconfiguration area, power Reconfiguration power dominates execution

PipeRench: Stripes [CMU, FPGA ‘98] Pipelined reconfiguration Feed-forward computation only

time

Page 7: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 7

Paging + Streaming

Streaming allows efficient, useful virtualization Amortizes reconfiguration cost over a larger epoch Exploits program structure Less restrictive communication topology

Compiler and scheduler’s joint responsibility

buffers Swap Swap Swap

Swap

Page 8: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 8

SCORE Compute Model

Program = DFG of compute nodes Kahn process network

blocking read, non-blocking write

Compute: SFSM (Streaming Finite State Machine) Concretely: page + FSM to implement token-flow semantics Abstractly: task with local control

Communication: Stream Abstraction of wire, with buffering

Storage: Memory Segment Dynamics:

Dynamic local behavior in SFSM Unbounded resource usage: stream buffer expansion Dynamic graph allocation in STM (Streaming Turing Machine)

Page 9: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

9

SCORE Programming Model: TDF

TDF = intermediate, behavioral language for: EFSM Operators • Static operator graphs

State machine for: Firing signatures • Control flow (branching)

Firing semantics: When in state X, wait for X’s inputs, then fire (consume, act)

select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ){ state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S;}

s t f

o

select

Page 10: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 10

SCORE Hardware Model

Paged FPGA Compute Page (CP)

Fixed-size slice of RC hardware Fixed number of I/O ports

Distributed, on-chip memory Configurable Memory Block (CMB) Stream access

High-level interconnect

Microprocessor Run-time support + user code

Page 11: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 11

SCORE Software Infrastructure

Device Simulator Cycle-accurate behavioral simulation Parameterized (e.g. #pages) Interact with concurrent user processes (STMs) via stream API

Page Scheduler Version 1: dynamic, list-based scheduling (by input availability) Version 2: static, precedence-based

TDF Compiler Compiles to working C++ simulation code No partitioning (page = 1 TDF operator)

Applications Wavelet, JPEG, MPEG, IIR Device size

Runtime

Page 12: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 12

Communication is King

With virtualization,Inter-page delay is unknown, sensitive to: Placement Interconnect implementation Page schedule Technology – wire delay is growing

Inter-page feedback is SLOW Partition to contain FB loops in page Schedule to contain FB loops on device

Page 13: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 13

Structural Partitioning is Not Enough

Structural partitioning does not address feedback loops Wire min-cut

FM, flow-based

Minimum wire length Spectral

Delay-optimal DAG mapping DAGON, FlowMap, Wong

Structural partitioning does not address communication rates, dynamics All loops are NOT created equal

Page 14: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 14

FSM Decomposition is not enough

Ashar+Devadas+Newton (ICCAD ‘89) Minimize logic

Kuo+Liu+Cheng (ISCAS ‘95) Minimize wires

Benini+DeMicheli+Vermeulen (ISCAS ‘98) Minimize power

None consider inter-page delay None consider cutting / scheduling data-path separately

from FSM

Ma

Mb

Ma

Mb

Ma

Mb

Fa

Fb

Page 15: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 15

Outline

Motivation Compilation Methodology Evaluation + Architectural Studies Time Line

Page 16: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 16

Compilation – Scope

Synthesis + Partitioningof SFSMs TDF Pages Resource binding

Target Parameterized hardware model / simulation

Constrained optimization problem Constraints

page area, IO, timing

Optimality Criteria Primary: Communication delay Secondary: Communication bandwidth, Area

Compile

memorysegment

TDFoperator

stream

memorysegment

compute page

stream

Page 17: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 17

Compilation Flow Overview

(1) Optimizations(2) Data path timing + scheduling(3) Partitioning

Ignore: Place / route / retime in page

Known solutions in the community

Page scheduling Responsibility of separate scheduler

Page 18: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 18

Synthesis + Partitioning Flow

Pipeline Extraction

Data Path Mapping

Partition Large States

Schedule DF into States

Cluster States

Page Packing

Synthesize Page FSMs

Compiler Optimizations

Optimization

PreliminaryCode

Data-path

Partitioning

p

p

p

p

p

Page 19: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 19

How Big is an Operator?

• Wavelet Decode

• Wavelet Encode• JPEG Encode• MPEG Encode

Area for 47 Operators(Before Pipeline Extraction)

0

500

1000

1500

2000

2500

3000

3500

Operator (sorted by area)

Area (4-LUTs)

FSM Area

DF Area

• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR

Page 20: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 20

Partitioning Tasks

(1)Decompose/shrink SFSMs

(2)Pack SFSMsonto page

Pipeline Extraction

Data Path Mapping

Partition Large States

Schedule DF into States

Cluster States

Page Packing

Synthesize Page FSMs

Compiler Optimizations

p

p

p

p

Page 21: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

21

Pipeline Extraction

Hoist uncontrolled FF data-flow out of FSMD Benefits:

Shrink FSM cyclic core Extracted pipeline has more freedom for scheduling and

partitioning

Extract

state foo(x): if (x==0)...

state foo(xz): if (xz) ...

x

stat

e

DFCF

x==0xz

x

pipeline pipeline

Page 22: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 22

Pipeline Extraction – Extractable Area

Extractable Data-Path Areafor 47 Operators

0

500

1000

1500

2000

2500

3000

3500

Operator (sorted by data-path area)

Area (4-LUTs)

Extracted DF Area

Residual DF Area

• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR

Page 23: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 23

Pipeline Extraction – Residual SFSM

Area for 47 Operators(After Pipeline Extraction)

0

500

1000

1500

2000

2500

3000

Operator (sorted by area)

Area (4-LUTs)

FSM Area

Residual DF Area

• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR

Page 24: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 24

Data-path Mapping / Scheduling

Task: Bind technology-specific area/time to data-path primitives Schedule data-path primitives in state machine

Fixed-frequency target Decompose primitives into multi-cycle operations Data-path module library / tree matching

Pipeline linearized sequences / loops DAG mapping state logic is insufficient

Compiler technology Code motion Software pipelining

Page 25: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 25

Delay-Oriented State Clustering

Indivisible unit: state (CF+DF) Spatial locality in state logic

Cluster states into page-size sub-machines Inter-page communication for

data flow, state flow

Sequential delay is in inter-page state transfer Cluster to maintain local control Cluster to contain state loops

Similar to: VLIW trace scheduling [Fisher ‘81]

FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98]

VM/cache code placement GarpCC HW/SW partitioning [Callahan ‘00]

Page 26: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 26

State Clustering Formulation

Min-cut transition probabilities in state flow graph Probabilities from profiling

Area-constrained Balanced min-cut partitioning

[Yang+Wong, ACM ‘94] Iterate to desired partition area

(1-)A ≤ a(X) ≤ (1+)A IO-constrained

Add wire edges

Mix edge weights: (c)wwire + (1-c)wSF

Use smallest IO-feasible c

Requires all states to be smaller than page

p1 p2

p3 p4

p5

w1 w2

w4

w5 w6

w8

w9

w3

w7

a2

a1

a3

a4

Page 27: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 27

Page Packing

Cluster SFSMs + pipelines Avoid page fragmentation

Min-cut streams of top-level DFG Allow cutting pipelines, not SFSMs Area and IO constrained (Wong balanced min-cut partition) Disallow certain topologies

No dynamic-rate streams in page

Data-flow feedback?

Page 28: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 28

Outline

Motivation Compilation Methodology Evaluation + Architectural Studies Time Line

Page 29: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 29

Evaluating Paging Overhead

Applications Must be rewritten in TDF Existing: • Wavelet, • JPEG, • MPEG, • IIR To do: • ADPCM, • BABAR particle detector

Metrics Circuit area (#pages x page-size) Page delay (LUT depth per firing) Performance (total run-time, “makespan”)

Baseline comparison “Unpartitioned”: page = 1 TDF operator

Ideal virtualization with zero partitioning cost – cannot do better

Page 30: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 30

Page Size Studies

Paging overhead varies with: Application • Page size, IO • Match thereof

Is paging overhead robust to a mismatch? Vary page parameters, measure:

(1) Pure area overhead (2) Pure performance overhead

Execute spatially in expanded hardware

(3) Virtualized performance overhead Execute in fixed device size

(1) (2) (3)

Page 31: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 31

Outline

Motivation Compilation Methodology Evaluation + Architectural Studies Time Line

Page 32: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 32

Status

SCORE compiler / simulator / scheduler Compile+execute unpartitioned (page = 1 TDF op)

Preliminary synthesis + partitioning work Pipeline extraction FSM synthesis to SIS Area-constrained state clustering

To do Complete initial implementation Evaluate Improve – secondary implementation

Page 33: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 33

To Complete Initial Implementation

IO-constrained state clustering Decompose large states Page packing Data-path scheduling in states Synthesize partitioned SFSMs

Page 34: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 34

Secondary Implementation – Possibilities

Optimizations SW pipelining Use SUIF

State clustering with replication Unified state clustering + page packing

Cluster states of all operators simultaneously

Finer-grained clustering Recast as BDF, min-cut stream rates

Page 35: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 35

Time Line

3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8

Impl. 1

Eval

Impl. 2

EvalThesiswriting

Month:

Year: 2001 2002

Page 36: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 36

Summary

Partitioning and paging enables Software survival / scaling Efficient use of small HW for dynamic apps

My Contributions Methodology for page synthesis + partitioning

Necessary for efficient virtualization

Evaluation framework Verify that paging can be efficient

Architectural studies

Page 37: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 37

Supplemental Material

SFSMs + transforms SCORE simulation + scaling results Page hardware model Synthesis observations Architectural studies

Page 38: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 38

TDF Dataflow Process Network

Dataflow Process Network[Parks+Lee, IEEE May ‘95] Process enabled by set of firing rules: R = {R1, R2, …, RN}

Firing rule = set of patterns: Ri = {Ri,1, Ri,2 , …, Ri,p}

DF process for a TDF operator: Feedback arc for state One firing rule per state

Patterns match state value + presence of desired inputs E.g. for state i: Ri = {Ri,1, Ri,2 , …, [i]}

Patterns: Ri,j = [*] if input j is in state i’s input signatureRi,j = if input j is not in state i’s input signatureRi,p = [i] for final input, representing state arc

These are sequential firing rules Partitioned SFSM adds “wait” state

process sta

te

Page 39: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 39

SFSM Partitioning Transform

Only 1 partition active at a time Transform to activate via streams

New state in each partition: “wait” Used when not active Waits for activation

from other partition(s) Has one input signature

(firing rule) per activator

Firing rules are not sequential,but determinism guaranteed Only 1 possible activator

Activation streams fromgiven source to given dest.partitions can be merged +binary-encoded

A

B

C

D

A

B

WaitAB

C

D

WaitCD

{A,B}

{A,B}

{C,D}

{C,D}

Page 40: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 40

Distributing/Collecting Shared Streams

Requires inter-page synchronization for ordering

Two schemes for input distribution (1) send token to all pages

– Inactive pages must discard tokens,must know how many to discard

(2) send token only to active page– Distributor must know state– (a) present state requests token OR– (b) previous state pre-fetches token

One scheme for output collection– Collector must know state

How to cluster distributors / collectors? Distributor scheme (1) and collector incur no sequential delay (wire min-cut ok) Distributor scheme (2)(a) can be cast into delay-optimal state clustering:

– Decompose reading states into sequences of single-read states– Pre-cluster states that read same stream – this forms distributors– Sequential delay of read request is now modeled as state transfer to distributor

A

B

C

D

i

o

Page 41: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 41

Decomposing Large States

A state may be larger than a page

Decomposing into a sequence of page-size states leads to excessive inter-page transfer

Better: delay-optimal DAG-mapping into parallel pages

Page 42: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 42

SFSM Optimizations

Many traditional compiler optimization techniques apply to TDF State flow ~ basic block flow Different cost model

“Unlimited” registers and functional units

E.g. work-reducing optimizations Constant folding / propagation Common subexpression elimintation Hoist loop invariants Strength reduction

Page 43: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 43

SCORE Functional Simulation

FPGA based on HSRA [Berkeley, FPGA ’99] CP: 512 4-LUTs CMB: 2Mbit DRAM Area for CP-CMB pair:

Page reconfiguration: 5000 cycles (from CMB) Synchronous operation (same clock speed as processor)

x86 microprocessor

Page Scheduler task Swap on timer interrupt (every 250,000 cycles) Fully dynamic scheduling

.25: 12.9mm2 (1/9 of PII-450)

.18: 6.7mm2 (1/16 of PIII-600)

Page 44: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 44

Application: JPEG Encode

Page 45: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 45

Scaling Results: JPEG Encode

Physical Compute Pages

Tot

al T

ime

(Mak

esp

an in

mill

ion

s o

f cy

cles

)

Page 46: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 46

Page Hardware Model

Page = fixed-size slice of rsrcs + stream interface

FSM for: Firing • Output emission • Data-path control •

Branching

FSM

Reconfigurable

Fixed logic

Page 47: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 47

Page Firing Logic

Sample firing logic 3 inputs (A,B,C) 3 outputs (X,Y,Z) Single signature

Page 48: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 48

How Large is a State?

Histogram of Data-Path Area Per State(1404 States from 5 Applications)

162

31

68

4

35

3 1 3 1 2 1 18

3

317

764

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80100 120 140 160 180 200 220 240 260 280 300

Data-Path Area (4-LUTs)

Count

• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• IIR

Page 49: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

49

SFSM Firing Delay

Complex SFSM may require ≥1 cycle just for control Evaluate firing rule, generate control signals, compute next state

Should we partition SFSM to minimize FSM logic? No – incurring inter-page communication latency is worse!

Histogram of FSM Delayfor 47 Operators

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

Delay (4-LUTs)

Count

• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR

Histogram of FSM Delayfor 47 Operators(unpartitioned)

4-LUT Depth

Histogram of FSM Inputsfor 47 Operators

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

Number of Inputs

Count

• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR

Histogram of FSM Inputsfor 47 Operators(unpartitioned)

Page 50: Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/01 Eylon Caspi – Qualifying Exam 50

Scaling the Hardware Resources

A simplified scaling model for architectural studies

Scaling page size (LUTs) induces scaling of other resources, e.g.: Scaling memory

Constant CP-to-CMB ratio

Scaling page IO Rent’s Rule: IO = CAp, (0 ≤ p ≤ 1)