lecture 10b: implementing dsp functionality: alternatives

79
1 Kurt Keutzer Lecture 10b: Implementing DSP Functionality: Alternatives Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted, Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang

Upload: raheem

Post on 11-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Lecture 10b: Implementing DSP Functionality: Alternatives. Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 10b:   Implementing DSP Functionality: Alternatives

1Kurt Keutzer

Lecture 10b: Implementing DSP Functionality:

Alternatives

Prepared by: Professor Kurt Keutzer

Computer Science 252, Spring 2000

With contributions from:

Prof. Heinrich Meyr, University of Aachen

Philip Chong, David Chinnery, Rhett Davis, Paul Husted,

Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang

Page 2: Lecture 10b:   Implementing DSP Functionality: Alternatives

2Kurt Keutzer

System Implementation Choices

DSP Core

ProgramROM

CoefficientROM

Control

EMBEDDEDCORE µP/DSP

OFF-THESHELF µP/

DSP

DSP

APPLICATIONSPECIFIC µP (ASIP)

ASIC

System Functionality

ASIP Core

ProgramROM

CoefficientROM

Control

Page 3: Lecture 10b:   Implementing DSP Functionality: Alternatives

3Kurt Keutzer

Making a Successful Comparison - 1

Find an interesting application kernel viterbi decoding for speech processing (not a full modem!)

Find realistic constraints native to the application n=2, K=7, QPSK, 100KBS, BER= 10^-4

Find architectures/implementations that are promising for the application TI TMS320C54, Tensilica Xtensa What are the relevant features of this architecture that support this

application?

Fix application constraints across all implementations (above)

Fix key parameters for implementation comparison performance (constraint) area power

Page 4: Lecture 10b:   Implementing DSP Functionality: Alternatives

4Kurt Keutzer

Making a Successful Comparison - 2

Identify how key parameters will be measured performance - instruction set simulator, eval board area - data sheets, gate estimates power - eval board, TI application note

Implement your application kernel Examine different algorithms Start with code downloaded from the web - multimedia

benchmarks etc. Build your software development/evaluation environment:

http://www.ti.com/sc/docs/tools/dsp/6ccsfreetool.htm

Page 5: Lecture 10b:   Implementing DSP Functionality: Alternatives

5Kurt Keutzer

Making a Successful Comparison - 3

Implement your application kernel (cont) Phase 0: Research

Find application notes, research reports for your own or comparable architectures

Phase 1: Estimation Develop a quick estimate based on initial code Integrate research findings Do a quick back-of-envelope reality check

Phase 2: Real implementation/Tuning Tailor algorithm, implementation to architecture Do your very best! Have a contest with your partner

Phase 3: Evaluation Apply evaluation tools to key parameters Evaluate and compare results - return to 2

If your life depended on choosing the right part - what would you do?

Page 6: Lecture 10b:   Implementing DSP Functionality: Alternatives

6Kurt Keutzer

Making a Successful Comparison - 4

Final evaluation and comparison - compare all implementations

To evaluate for a product - everything is fair game

To evaluate principally the architectures - need to consider: Fab differences - TSMC vs. IBM (10-20% faster) process differences - .35 micron vs. .25 (50% faster) power supply differences 3.0V vs. 1.5V asic vs. custom implementations - (2x faster)

Now evaluate - if I was the architect of this processor/implementor of this system on a chip, what would I do differently?

cache sizes register availability additional instructions on chip memory

Page 7: Lecture 10b:   Implementing DSP Functionality: Alternatives

7Kurt Keutzer

Making a Successful Comparison - 5

Just for fun …

In addition to primary constraints (speed, cost, power)

final real world considerations business relationships (joint partnership with Lucent) Time-to-market issues

time to configure? software development environment library/application software support application engineering support

Page 8: Lecture 10b:   Implementing DSP Functionality: Alternatives

8Kurt Keutzer

Viterbi Algorithm

Prof. Heinrich Meyr

University of Aachen

Page 9: Lecture 10b:   Implementing DSP Functionality: Alternatives

9Kurt Keutzer

Viterbi Decoders in digital communication systems

Signal Source Source CoderConvolutional orTrellis Coder &Mapper

Modulator

Channel

Viterbi DecoderSource Decoder DemodulatorSignal Sink

information bits channel symbols ck

received symbols yk

decoded bits

Page 10: Lecture 10b:   Implementing DSP Functionality: Alternatives

10Kurt Keutzer

Convolutional Coder and Trellis diagram

0 k k+1 T

x

0

1

2

3

ss0,k 0,k+1

s s3,k 3,k+1

z -1 z -1

+

+

uk

codesymbols

Mapper

channelsymbols

modulo 2addition

xx1,k 0,k

kyknown startstate X =00 T

additivewhitenoise n

CONVOLUTIONAL CODER

VITERBI DECODER

CHANNEL

kinformationbits

uk-1

uk-2

T-1

BPSK

kc

kb

kb = 1

ik

b = 0i

Survivor Memory

known endstate X =0

decoded bits

decisions

Page 11: Lecture 10b:   Implementing DSP Functionality: Alternatives

11Kurt Keutzer

ACS recursion for M = 2

Max { , }(1,i)k

survivor pathcompeting path

(1,i)k

Z(0,i),k-1

Z(1,i),k-1 (1,i)k

i,k

d = 1i,k (1,i)k

(0,i)k

Z(0,i),k-1 (0,i)k

(0,i)k

Z(1,i),k-1

Page 12: Lecture 10b:   Implementing DSP Functionality: Alternatives

12Kurt Keutzer

Viterbi Decoder block diagram

TMU ACSU SMU

Latch

channelsymbols y

k

branchmetrics

statemetrics

k

decisionbits

decodedbits u

Page 13: Lecture 10b:   Implementing DSP Functionality: Alternatives

13Kurt Keutzer

Characteristic of a 2-bit step-at-zero quantizer

Q=-2

Q=-1

Q=0

Q=1 saturation

saturation-2

-1

1

2

normalizedinputlevel

Interpretation

1 2-1-2

Page 14: Lecture 10b:   Implementing DSP Functionality: Alternatives

14Kurt Keutzer

Architecture

Page 15: Lecture 10b:   Implementing DSP Functionality: Alternatives

15Kurt Keutzer

Node parallel ACS architecture

(0,i)k

Shuffle-ExchangeNetwork

0,k

1,k

N-1,k

(1,i)k

ACS

ACS

ACS

0

1

N-1

TMU

Register

SMU

decisionsdec(i,k)

Page 16: Lecture 10b:   Implementing DSP Functionality: Alternatives

16Kurt Keutzer

ACS

ACS

ACS

ACS

M

M

M

M

butterfly butterflysharedACS

sharedACS

Alternative Implementations

Page 17: Lecture 10b:   Implementing DSP Functionality: Alternatives

17Kurt Keutzer

Butterfly trellis structure and resource sharing for the K = 3, rate 1/2 code

ACS

ACS

ACS

Path metricmemory

ACS

0,k

1,k

3,k

2,k

ACS

ACS

0,k+1

2,k+1

3,k+1

1,k+1

0,k

1,k

3,k

2,k

MUX

MUX

MUX

MUX

oldstatemetrics

newstatemetrics

Page 18: Lecture 10b:   Implementing DSP Functionality: Alternatives

18Kurt Keutzer

Survivor Memory Unit

Page 19: Lecture 10b:   Implementing DSP Functionality: Alternatives

19Kurt Keutzer

REA hardware architecture

d

3

0

1

2

d

d

d

0=

0

00

11

11

0 1 D

1

1=

1

1

0

0

0=

0=

PE

3,k

0,k

1,k

2,k

s

s

s

s

u

[1]

k-D

u

[2]

k-D

u

[3]

k-D

u

[0]

k-D

k-1

k-1

k-1

k-1

^

^

^

^

u

[1]u

[2]u

[3]u

[0]^

^

^

^

u(0,0)

u(0,0)

u(1,0)

u(1,3)

k

k

k

k

u

[1]u

[2]u

[3]u

[0]^

^

^

^

u

[1]

k-D+1

u

[2]

k-D+1

u

[3]

k-D+1

u

[0]

k-D+1

^

^

^

^

Page 20: Lecture 10b:   Implementing DSP Functionality: Alternatives

20Kurt Keutzer

Decoded Sequence: 0 0 ... 0 1 0

Acquisition of final survivorDecoding

10

0

Decoded Sequence : 0 0 ... 0 1 0

00

ku[0]^

k-Du[0]^u[0]^

k-(D+ M-1)

Page 21: Lecture 10b:   Implementing DSP Functionality: Alternatives

21Kurt Keutzer

Viterbi Project Constraints

•uncoded word length = 1

•coded word length (n) = 2 this means that it is rate 1/2

•constraint length (K aka. L) = 7 this means that the number

of states in trellis is 2^(K-1) or 64 states

•branch metric calculation is QPSK

• soft decision wordlength (q) = 6

•chain-backing depth (D) = 96

•generator polynomials: p0 = 171, p1= 133 (octal) this means that p0=1111001,

p1=1011011

• data rate 100 kbs

• goal: bit error rate (BER) = 10^-4

• signal to noise ratio (SNR)

• degradation 0.05dB

Page 22: Lecture 10b:   Implementing DSP Functionality: Alternatives

22Kurt Keutzer

Viterbi Decoder Implementation on an ARM

EE 290S Final Project

May 4, 1999

Phillip Chong

Page 23: Lecture 10b:   Implementing DSP Functionality: Alternatives

23Kurt Keutzer

ARM Overview

32-bit RISC microprocessor

Five stage pipeline

Features fast ALU operations (barrel shifter)

Scalar integer unit, no FPU

Page 24: Lecture 10b:   Implementing DSP Functionality: Alternatives

24Kurt Keutzer

Algorithm Tweaking

Performing the metric computation through table lookup (load = 1 delay slot) is faster than using ALU (multiplication = up to 3 delay slots)

Parity computation (Viterbi code) can also be done through table lookup

Page 25: Lecture 10b:   Implementing DSP Functionality: Alternatives

25Kurt Keutzer

Reducing Memory Footprint

Cache misses can be very costly due to pipeline stalls

We are willing to give up some algorithmic efficiency to eliminate cache misses

To minimize the memory footprint, we pack 32 bits of traceback into single word; we can easily unpack this data due to the barrel shifter (1 cycle operation)

For 128 level traceback, memory requirements are 512 bytes (metrics table) + 1024 bytes (traceback) + 768 bytes (parity lookup tables) = 2304 bytes

Page 26: Lecture 10b:   Implementing DSP Functionality: Alternatives

26Kurt Keutzer

Simulation Results

Simulated decoding of 4096 bits on a 125 MHz 3.3V model

Execution requires 11.72M ARM instruction cycles, giving 44 kb/s data rate

Power consumption was estimated at 52.47 mW

Scaling simulation results up to 275 MHz 2.0V ARM (fastest commercially available) gives 96 kb/s at 42.40 mW

Page 27: Lecture 10b:   Implementing DSP Functionality: Alternatives

27Kurt Keutzer

Summary

Clock speed: 275 MHz

Execution Performance: 96kb/s

Power Dissipation: 42.40 mW (5.68 mW/mm2)

Area: 7.47mm2 in 0.25 m

Design Effort: 4 days

Portability very high: code is ANSI C; architecture-dependent tweaks may need reworking

Page 28: Lecture 10b:   Implementing DSP Functionality: Alternatives

28Kurt Keutzer

Conclusion/Thanks

One-bit quantization gives opportunities for performance improvements, at a huge cost in QOR

Viterbi algorithm would benefit greatly from having hardware parallelism (vector ops) available

Many thanks to Marlene Wan for providing power estimation

Page 29: Lecture 10b:   Implementing DSP Functionality: Alternatives

29Kurt Keutzer

Viterbi Decoder Implementation on a TI C54x

EE 290S Final Project

May 4, 1999

Paul Husted

Page 30: Lecture 10b:   Implementing DSP Functionality: Alternatives

30Kurt Keutzer

Introduction

Implemented Viterbi Decoder on a TI TMS320VC5402 DSP

Examine: Performance (bits/sec) Power (mW/bit) Cost ($/unit,area) Design effort (engineer-months)

Page 31: Lecture 10b:   Implementing DSP Functionality: Alternatives

31Kurt Keutzer

Viterbi Decoder Specifications

Implementation Specifications: Constraint Length (K aka. L) = 7 Branch Metric Calculation is QPSK Soft Decision Wordlength (q) = 6 Chain-backing Depth (D) = 96 Gen. Polynomials: p0 = 171, p1= 133 (octal) Data Rate 100 kbs Goal: Bit Error Rate (BER) = 10^-4

Page 32: Lecture 10b:   Implementing DSP Functionality: Alternatives

32Kurt Keutzer

C54x Capabilities

Capabilities of all C54x DSP Cores: Three 16-bit Data, One 16-bit program bus 40 bit ACC with 40 bit barrel shifter Two independent accumulators A single cycle non-pipelined MAC Single-instruction repeat and block-repeat Six channel DMA controller Arithmetic instructions with parallel store and parallel

load

Page 33: Lecture 10b:   Implementing DSP Functionality: Alternatives

33Kurt Keutzer

Helpful Instructions for the Viterbi Decoder

The C54x Has Specialized Instruction Set Dual Add/Subtract in 1 Cycle Compare, Select, and Store Unit (CSSU)

Compare Branch Metrics Store Larger Value, Store Decision Bit Increment Address Registers in Circular Buffer 1 Cycle

Allows Butterfly (2 States) in 5 cycles

Page 34: Lecture 10b:   Implementing DSP Functionality: Alternatives

34Kurt Keutzer

Butterfly Implementation

DADSTCMPS

DSADTCMPS

Old(2*j)

Old(2*j+1)

New(j)

New(j+2(K-2))

T Register = Local Distance

Page 35: Lecture 10b:   Implementing DSP Functionality: Alternatives

35Kurt Keutzer

TI TMS320VC5402 DSP

Specific Chip Characteristics: Operates at 100 MIPS

Core Voltage of 1.8V I/O Pins Operate at 3.3V

16K Word x 16 Bits of Dual-Access RAM 4K Word x 16 Bits of ROM Internal DMA Created in 0.18 Micron Technology

Page 36: Lecture 10b:   Implementing DSP Functionality: Alternatives

36Kurt Keutzer

Dataflow

Data I/O Input Values Assumed to be Placed at Specified

Memory Location by Internal DMA Output Values Assumed to be removed from another

Memory Location by Internal DMA Alternatively, Data Could be Placed in this Memory

Location After Other On-Chip Receiver Processing

Page 37: Lecture 10b:   Implementing DSP Functionality: Alternatives

37Kurt Keutzer

Implementation Analysis

Viterbi Decoder Code Created in Assembly

Linked to Processor Specific Memory Map

Simulated on Cycle-Accurate Simulator Used Correct Memory Model for VC5402

Page 38: Lecture 10b:   Implementing DSP Functionality: Alternatives

38Kurt Keutzer

Implementation Results

Estimated ActualCode Size 500

Instructions1032 (16 bit)Words

Data Size 1280 (16 bit)Words

1280 (16 Bit)Words

MIPS(100 Kbps)

18.425 21.53125

Max. Speed(100 MIPS)

582 Kbps 464.7 Kbps

Page 39: Lecture 10b:   Implementing DSP Functionality: Alternatives

39Kurt Keutzer

Power Calculation

Compared with TI Figures: TI uses 1/2 MACs, 1/2 NOPs For Power Figure .25 Micron Estimate is .45 mA/MIPS

Fully Static Design can be Clocked at Any Rate Viterbi Code Uses 1.08 Times More Current than TI

Estimate

At 22 MIPS, 19.25 mW are Consumed in the Core

Page 40: Lecture 10b:   Implementing DSP Functionality: Alternatives

40Kurt Keutzer

Area Estimate

TI Will Not Release Die Sizes .25 Micron Chips Fit Inside 3.2 mm x 3.2 mm Area on

a 144 pin BGA Maximum Die Size is thus 10.24 mm2

Page 41: Lecture 10b:   Implementing DSP Functionality: Alternatives

41Kurt Keutzer

Development Cost

Engineering Time Estimate - 3 days

Assumes Engineer Has Experience with Assembly Language and TI Tools

Tool Cost - $13262.45 Includes Emulator, Simulator, Compiler, Assembler,

Linker, Debugger

Cost of Chip - $8.52

Page 42: Lecture 10b:   Implementing DSP Functionality: Alternatives

42Kurt Keutzer

Conclusion

Optimized Instructions Make Algorithm Efficient

Static Design Allows Clock Rate to be Set As Needed to Reduce Power

Flexibility Exists to Perform Other Processing of Data

Very Little Development Time/Cost

Page 43: Lecture 10b:   Implementing DSP Functionality: Alternatives

43Kurt Keutzer

ACS TIE Extension with State (ACS)

bm331 24:2316:15 8:7 0

bm2 bm1 bm0

+

+

17pm- pm-

1127

-=1?

31Rs

msbmsb

+

+

17pm-pm-

11 27

- =1?

31Rt

msbmsb

11pm

310:1decision bitdecision bit

Rrpm

16:17

0:11:0

27

decision bitdecision bit

Control

instruction

Page 44: Lecture 10b:   Implementing DSP Functionality: Alternatives

44Kurt Keutzer

Tensilica Viterbi Implementation

Niraj Shah

Scott Weber

290A Final Presentation

Page 45: Lecture 10b:   Implementing DSP Functionality: Alternatives

45Kurt Keutzer

Tensilica Flow

.c

.o xt-run

.c.c

gen uArch Designer

gen

xt-gcc

TIE

TensilicaProcessorGenerator

Page 46: Lecture 10b:   Implementing DSP Functionality: Alternatives

46Kurt Keutzer

Xtensa Architecture

XtensaCore

Rs Rt RrI

TIE

TIE Extensions: single cycle state free no new exceptions no stalls typeless data

Rs, Rt, Rr are 32 bit regs

I is the instruction controlling the TIE unit

Xtensa Core is a 32 bit configurable RISC processor

Page 47: Lecture 10b:   Implementing DSP Functionality: Alternatives

47Kurt Keutzer

Viterbi Architecture

ACS

TraceBackRAMInit

ADC I/0Device

MeasuredMeasuredPerformancePerformance

HereHere

Page 48: Lecture 10b:   Implementing DSP Functionality: Alternatives

48Kurt Keutzer

TIE SetupBMreg (ACS)

-++

31 8:7 0I

Rs Rt

Rr

31 8:7 0Q

bm33123:2415:167:80

bm2bm1bm0

-0x7F0x7F

-

Controlinstruction

Page 49: Lecture 10b:   Implementing DSP Functionality: Alternatives

49Kurt Keutzer

ACS TIE Extension (ACS)

+

+

bm331 24:23 16:15 8:7 0

bm2 bm1 bm017

pm- pm-11 1:027

-=1?

11:12pm

310:10’sdecision bitdecision bit

ACS03 ||ACS12 ||ACS30 ||ACS21

31

instruction

RtRs

Rr

msbmsb

Page 50: Lecture 10b:   Implementing DSP Functionality: Alternatives

50Kurt Keutzer

ACS TIE Extension with State (ACS)

bm331 24:2316:15 8:7 0

bm2 bm1 bm0

+

+

17pm- pm-

1127

-=1?

31Rs

msbmsb

+

+

17pm-pm-

11 27

- =1?

31Rt

msbmsb

11pm

310:1decision bitdecision bit

Rrpm

16:17

0:11:0

27

decision bitdecision bit

Control

instruction

Page 51: Lecture 10b:   Implementing DSP Functionality: Alternatives

51Kurt Keutzer

TIE Zmask (TraceBack)

&

31 1:0Rs Rt

Rr

31 6:5 0

6:70

|

0x7F0x7F

<<1<<1

&0x3F0x3F

31

Controlinstruction

Page 52: Lecture 10b:   Implementing DSP Functionality: Alternatives

52Kurt Keutzer

Designs

All designs had a BER of 0.000095 after 10 million iterations

Design 1 100 MHz, 48 mW, 1K DCache, 1K ICache, TIE

Design 1+ 222 MHz, 144 mW, 1K DCache, 1K ICache, TIE

Design 2- 100 MHz, 69 mW, 16K DCache, 16K ICache, TIE

Design 2 222 MHz, 191 mW, 16K DCache, 16K ICache, TIE

Design 3 222 MHz, 191 mW, 16K DCAche, 16K ICache, TIE with state

Page 53: Lecture 10b:   Implementing DSP Functionality: Alternatives

53Kurt Keutzer

Performance

118

409

263

909

357409

793

909966

1142

0

200

400

600

800

1000

1200

Design1

Design1+

Design2-

Design2

Design3

CachePerfect Cache

Kb/sKb/s

Page 54: Lecture 10b:   Implementing DSP Functionality: Alternatives

54Kurt Keutzer

Energy Dissipation

uJ/bituJ/bit

0.4

0.12

0.54

0.160.19

0.17

0.240.21 0.2

0.17

0

0.1

0.2

0.3

0.4

0.5

0.6

Design1

Design1+

Design2-

Design2

Design3

CachePerfect Cache

Page 55: Lecture 10b:   Implementing DSP Functionality: Alternatives

55Kurt Keutzer

n(s*J)/Bit

n(s*J)/n(s*J)/BitBit

3.39

0.293

2.05

0.176

0.5320.416 0.3150.231 0.2070.148

00.5

11.5

22.5

33.5

Design1

Design1+

Design2-

Design2

Design3

CachePerfect Cache

Page 56: Lecture 10b:   Implementing DSP Functionality: Alternatives

56Kurt Keutzer

Die Area

2.1 2.12.372.37

6.146.146.7 6.7 6.7 6.7

01234567

Design1

Design1+

Design2-

Design2

Design3

CachePerfect Cache

mmmm22

Page 57: Lecture 10b:   Implementing DSP Functionality: Alternatives

57Kurt Keutzer

Conclusions

TIE extensions, cache configuration, and improved code efficiency resulted in an order of magnitude improvement from our original

For power and performance, the effect of cache size is greater than the effect of a higher clock frequency

Use voltage scaling to reduce the power

If streaming data, then scale frequency

Adding state will result in the ability to increase performance

Having the ability to remove core instructions will decrease decode complexity and should lower power and area

Page 58: Lecture 10b:   Implementing DSP Functionality: Alternatives

58Kurt Keutzer

Soft Core Viterbi Decoder

EECS 290A Project

Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang

Page 59: Lecture 10b:   Implementing DSP Functionality: Alternatives

59Kurt Keutzer

High Level Architecture

23%36%30%

0%48%15%

38%8%22%

18%4%16%

9%2%8%

4%1%5%

2%1%4%

% Gates% Area% Power

Page 60: Lecture 10b:   Implementing DSP Functionality: Alternatives

60Kurt Keutzer

Branch & Path Metric Generation

UL

UL

UL

UL

UL

UL

UL

UL

Branch Metrics Computation apparently implemented with a CORDIC block (contains 840 MUX’s, 58 adders & flip-flops, 32 15-bit busses)

Branch Metrics Hard-wired to each ACS unit

Path Metrics Stored in ACS units

Each ACS unit handles 16 states

Hard-wired Path Metric Interconnect

Page 61: Lecture 10b:   Implementing DSP Functionality: Alternatives

61Kurt Keutzer

ACS Architecture

Each ACS unit stores 32 path metrics

Only two SRAM’s are active at a time

Across all four ACS units, each path metric is stored twice

SRAM accounts for 88% of the area and 27% of the power for each ACS unit

8x9 SRAM

PMU

PML

PMU

BMU

PML

BML

Add CompareSelect

Pipeline Register

MUX

Page 62: Lecture 10b:   Implementing DSP Functionality: Alternatives

62Kurt Keutzer

Traceback Architecture

State-Machine blocks are just large sum-of products combinational networks(351 gates each)

Each memory unit contains a 16x64 SRAM and logic(192 MUX’s, 128 flip-flops)

DecisionBits Traceback

Next_ramin

PipelineRegister

MUXSRAM

Traceback Memory Unit

192

OutDecisionBits

TracebackMemory Unit22% Area20% Power

Finite StateMachine11% Area13% Power

Traceback Unit

Page 63: Lecture 10b:   Implementing DSP Functionality: Alternatives

63Kurt Keutzer

Design Flow

Design Compiler Synthesis script (from Mentor/Inventra)

SRAM Generator (from Norman Walker)

VHDL gate-level sims (timing verification, switching activity annotation)

PowerMill Simulations (SRAM, core)

Design Compiler, Power Compiler (Static timing, power analysis)

Floor Planning (Preview)

Place & Route (Silicon Ensemble)

Interconnect Parasitic Extraction (“report simcap”

PowerMill simulations, PathMill static analysis

Design Compiler, Power Compiler (Static timing, power analysis with back-annotated interconnect parasitics)

Synthesis & Module Generation

Pre-Layout Verification & Analysis

Post-Layout Verification & Analysis

Floor Planning Place & Route

Page 64: Lecture 10b:   Implementing DSP Functionality: Alternatives

64Kurt Keutzer

Synthesis and SRAM Generation

Synthesis with Synopsys Design Compiler Constraint: 66 kHz clock (effectively infinite) Bottom-up synthesis of 62 VHDL entities

Low-Power SRAM generator (from Pleiades) Very large sense-amps, control logic Optimized for power, speed at low supply-voltages Word-length limited to a power of 2

Page 65: Lecture 10b:   Implementing DSP Functionality: Alternatives

65Kurt Keutzer

Simulation Models

Behavioral C

Behavioral VHDL

RTL VHDL

• Parameterized, bit-true, and fast

• Used for system level design and BER simulations

• Synthesizable, crafted for specific parameters and implementation structure• Used for synthesis quality

• Parameterized, bit-true, and cycle-true• Used for structural simulations and test bench reference

Page 66: Lecture 10b:   Implementing DSP Functionality: Alternatives

66Kurt Keutzer

BER Simulation Results

Page 67: Lecture 10b:   Implementing DSP Functionality: Alternatives

67Kurt Keutzer

SRAM

Simulation Tools: TimeMill & PowerMill

Parameters 66 MHz clock Voltage 2.5V Random Generated Test Vectors

Results Power Analysis Timing Analysis

Page 68: Lecture 10b:   Implementing DSP Functionality: Alternatives

68Kurt Keutzer

SRAM: Power Numbers

SRAM used for ACS Unit 8 words by 9 data bits

Operations Avg.(µA) Avg.(mW) Avg.(pJ)

Read Activity 663.73 1.659 24.885

Write Activity 563.21 1.408 21.120

Read/Write 612.29 1.530 22.950

Parasitic ExtractionOperations Avg.(µA) Avg.(mW) Avg.(pJ)

Read Activity 949.89 2.3747 35.6205

Write Activity 772.830 1.9320 28.980

Read/Write 851.42 2.1285 31.9275

Page 69: Lecture 10b:   Implementing DSP Functionality: Alternatives

69Kurt Keutzer

SRAM: Power Numbers

SRAM used for Traceback Unit 16 words by 64 data bits

Operations Avg.(µA) Avg.(mW) Avg.(pJ)

Read Activity 2170.7 5.4267 81.4005

Write Activity 1893.4 4.7335 71.0025

Read/Write 2086.9 5.2172 78.2580

Parasitic Extraction?

Page 70: Lecture 10b:   Implementing DSP Functionality: Alternatives

70Kurt Keutzer

SRAM: Timing Numbers

Delays Delays

Setup Time; Hold Time time needed for data address to become stable

Setup(ns) Hold(ns) Data Resolution(ns)

ACS SRAM ~1 ~2 ~1.8

Traceback SRAM ~1 ~2 ~5

Page 71: Lecture 10b:   Implementing DSP Functionality: Alternatives

71Kurt Keutzer

Place and Route

Floor planning of the Viterbi SRAM macro cells and standard cells was done in Preview, and Silicon Ensemble was used for routing.

Total SRAM macro cell area was 1.58 mm2 (1.08 mm2 with 9x8 SRAMs) Area of the 16 9x8 bit SRAM macro cells: 0.052 mm2 each, 62% larger than

required, as 16x8 bit SRAMs were used (SRAM generator output had been verified for powers of 2)

Area of the 3 16x64 bit SRAM macro cells: 0.25 mm2 each

Area of the standard cells 1.02 mm2 (0.35 mm2 from DEF file)

Final chip area was 4.0 mm2 (original estimate 2.5 mm2)

Parasitics for timing simulation were extracted from the final routed nets in Silicon Ensemble.

Page 72: Lecture 10b:   Implementing DSP Functionality: Alternatives

72Kurt Keutzer

Wiring Statistics

Six metal layers, layers 5 and 6 used for power and ground respectively

Ground and power spaced alternately 100 um apart horizontally and vertically.

There were about 6200 nets and 46,114 vias.

Total wire lengths:

metal layer 1: 3,293 um

metal layer 2: 458,440 um

metal layer 3: 510,517 um

metal layer 4: 218,023 um

metal layer 5: 96,882 um signal, and 38,400 um power

metal layer 6: 8,660 um signal, and 37,500 um ground

wire length: 685 mm horizontal, 611 mm vertical, total 1296 mm

Page 73: Lecture 10b:   Implementing DSP Functionality: Alternatives

73Kurt Keutzer

Final Placement and Routing

Significant routing congestion at 16 by 64 bit SRAM outputs, due to Silicon Ensemble grid size of 1 um (observe white and light blue wires).

Minimum of 6 unroutable nets observed, even at 12 mm2 chip area.

Final size was 1.25 mm x 3.2 mm, 4 mm2, with 9 unroutable nets.

Violation reports in Silicon Ensemble did not identify which nets were unroutable, other than problems with ground and power connections.

Page 74: Lecture 10b:   Implementing DSP Functionality: Alternatives

74Kurt Keutzer

Static Timing Checks

Delay BeforeAnnotation (ns)

Delay AfterAnnotation (ns)

Max ClockFrequency (MHz)

Max SymbolRate (Msps)

Critical Path 8.7 17 60 3.8Longest

SRAM Path8.5 14 - -

All timing checks performed with Design Compiler’s report_timing command

Parasitic capacitances back-annotated with the set_load command

No RC parasitics annotated

No SRAM model was used for timing checks

Critical Path was from ACS control logic, through a PM ouput MUX select signal (in an ACS unit), through the following ACS unit.

Checks performed at 2.5V

Page 75: Lecture 10b:   Implementing DSP Functionality: Alternatives

75Kurt Keutzer

Static Power Checks

Power Before Annotation After SAIFAnnotation

After ParasiticAnnotation

Cell Internal (mW): 28 20 20Net Switching (mW): 15 6.3 8.7Total Dynamic (mW): 43 26 29Cell Leakage (nW): 750 810 810

All timing checks performed with Design Compiler’s report_power command

Switching activity was measured for every output port (transition counts over 16,000-cycle simulation)

Back-annotation performed with SAIF files

No SRAM model was used for power checks (added in manually)

Checks performed at 2.5V w/ 60 MHz clock

Page 76: Lecture 10b:   Implementing DSP Functionality: Alternatives

76Kurt Keutzer

Delay and Energy Scaling

Page 77: Lecture 10b:   Implementing DSP Functionality: Alternatives

77Kurt Keutzer

Performance Results

For fixed throughput requirement 100ksps:

SupplyVoltage (V)

Clock Rate(MHz)

Symbol Rate(Msps)

Power(mW)

Optimized forPerformance

2.5 1.6 0.1 1.59

Optimized forEnergy

0.8 1.6 0.1 0.16

Optimized forEDP

1.25 1.6 0.1 0.40

SupplyVoltage (V)

Clock Rate(MHz)

Symbol Rate(Msps)

Energy DelayProduct (fJs)

Power(mW)

Optimized forPerformance

2.5 60 3.75 4.24 59.6

Optimized forEnergy

0.8 7.46 0.47 3.49 0.76

Optimized forEDP

1.25 25.12 1.57 2.53 6.24

Page 78: Lecture 10b:   Implementing DSP Functionality: Alternatives

78Kurt Keutzer

Summary NORMALIZED (100kbs)

Effort

(days)

Power (uW)/

Gate

Gates/

Area

Area

(mm^2)GatesNorm

Power

(mW)Performanc

e (kbs)Implementation

60.81423809.522.1050000294.440.68100.00CP 1

40.7376695.687.47500000266.836.86100.00ARM

60.0527040.066.694709817.92.47100.00CP 2

60.0763958.156.692648014.72.02100.00CP 3

30.0424599.4110.244709814.31.97100.00DSP

300.0048775.004.00351001.00.14100.00ASIC

Page 79: Lecture 10b:   Implementing DSP Functionality: Alternatives

79Kurt Keutzer

Summary MAX PERFORMANCE

Effort

(days)Power (uW)/

Gate

Gates/

Area

Area

(mm^2)GatesNorm

Power

(mW)Performance

(kbs)Implementatio

n

N/AN/AN/AN/AN/AN/A100.00 N/AReference

40.866695.687.47500000.842.94116.48ARM

60.9623809.522.10500000.948.00118.00CP 1

31.904599.4110.24470981.889.46464.70DSP

64.067040.066.69470983.8191.00793.00CP 2

67.213958.156.69264803.8191.00966.00CP 3

301.448775.004.00351001.050.603750.00ASIC