lec26-wrapup - university of california,...

56
EE141 EECS151/251A Spring 2019 Digital Design and Integrated Circuits Instructor: John Wawrzynek Lecture 26: Wrap-up

Upload: others

Post on 01-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

EECS151/251ASpring2019 DigitalDesignandIntegratedCircuitsInstructor:JohnWawrzynek

Lecture 26:Wrap-up

Page 2: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Outline❑ Important takeaways ❑ Exam Topics

2

Page 3: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 3

Why Study and Learn Digital Design?❑ We expect that many of our graduates will eventually be

employed as designers. ▪ Digital design is not a spectator sport. The only way to learn

it, and to appreciate the issues, is to do it. ▪ To a large extent, it comes with practice/experience (this course is

just the beginning). ▪ Another way to get better is to study other designs. Not time to

do much of this during the semester, but a good practice for later. ❑ However, a significant percentage of our graduates will not

be digital designers. What’s in it for them? ▪ Better manager of designers, marketers, field engineers, etc. ▪ Better researcher/scientist/designer in related areas

– Software engineers, fabrication process development, etc. ▪ To become a better user of electronic systems.

Page 4: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 4

In What Context Will You be Designing?

❑ Electronic design is a critical tool for most areas of pure science: ▪ Astrophysics – special electronics used for processing radio antenna

signals. ▪ Genomics – special processing architectures for DNA string matching. ▪ In general - sensor processing, control, and number crunching. ▪ Machine Learning now relies heavily on special hardware. ▪ In some fields, computation has replaced experimentation – particle

physics, world weather prediction (fluid dynamics). ❑ In computer engineering, prototypes often designed, implemented, and

studied to “prove out” an idea. Common within Universities and industrial research labs. Lessons learned and proven ideas often transferred to industry through licensing, technical communications, or startup companies. ▪ RISC processors were first proved out at Berkeley and IBM Research

Engineers learn so that they can build.

Scientists build so that they can learn.

Page 5: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 5

Designs in Industry❑ Of course, companies are the primary employer of

designers. Provide some useful products to society or government and make a profit for the shareholders.

❑ Interesting recent shift ▪ All software giants now

have hardware design teams (embedded and chips)

▪ Google, Amazon, Facebook, Microsoft, …

Page 6: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 6

Ten Big Ideas from EECS1511. Modularity and Hierarchy is an

important way to describe and think about digital systems.

2. Parallelism is a key property of hardware systems and distinguishes them from serial software execution.

3. Clocking and the use of state elements (latches, flip-flops, and memories) control the flow of data.

4. Cost/Performance/Power tradeoffs are possible at all levels of the system design.

5. Boolean Algebra and other logic representations.

6. Hardware Description Languages (HDLs) and Logic Synthesis are a central tool for digital design.

7. Datapath + Controller is a effective design pattern.

8. Finite State Machines abstraction gives us a way to model any digital system – used for designing controllers.

9. Arithmetic circuits are often based on “long-hand” arithmetic techniques.

10.FPGAs + ASICs give us a convenient and flexible implementation technology.

Page 7: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 7

What We Didn’t Cover❑ Design Verification and Testing

▪ Industrial designers spend more than half their time testing and verifying correctness of their designs.

– Some of this covered in the lab and a bit in lecture. Didn’t cover rigorous testing procedures.

▪ Most industrial products are designed from the start for testability. Important for design verification and later for manufacturing test.

▪ Related: Fault modeling and fault tolerant design.

❑ Other High-level Optimization Techniques ▪ High-level Synthesis - now starting to catch on

❑ Other High-level Architectures: GPUs, video processing, network routers, …

❑ Asynchronous Design

Page 8: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Most Closely Related Courses❑ CS152 Computer Architecture and Engineering

▪ Design and Analysis of Microprocessors ▪ Applies basic design concepts from EECS151

❑ EE241B Digital Integrated Circuits ▪ Transistor-level design of ICs ▪ More on Advanced ASIC Tool use

❑ CS250 VLSI Systems Design ▪ Advanced-undergrad/grad course ▪ Design tradeoffs at the chip design level

8

Page 9: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 9

Future Design Issues❑ Automatic High-level synthesis (HLS) and optimization (with

micro-architecture synthesis) and hardware/software co-design.

❑ Current trend is towards “system on a chip” (SOC) design methodology: ▪ Pre-designed subsystems (processor cores, bus controllers,

memory systems, network interfaces, etc. ) connected with standard on-chip interconnect or bus.

▪ Strong emphasis on “accelerators”. ❑ A number of alternatives to silicon VLSI have been

proposed, including techniques based on: ▪ Carbon nanotubes*, molecular electronics, quantum mechanics,

and biological processes. ▪ How will these change the way we design systems?

*In 2012, IBM produced a sub-10 nm carbon nanotube transistor that outperformed silicon on speed and power.[15] "The superior low-voltage performance of the sub-10 nm CNT transistor proves the viability of nanotubes for consideration in future aggressively scaled transistor technologies", according to the abstract of the paper in Nano Letters.[16]

Page 10: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Final Exam and Project Info❑ Exam held in scheduled final exam slot:

Friday May 17, 7-10PM, Hearst Gym 251 ❑ “Comprehensive” Final Exam ❑ Emphasis on second half

❑ Project interviews ❑ Project final reports

10

Page 11: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 11

The exam will take place Friday May 17, 7–10PM in Hearst Gym 251. The exam comprises a set of questions with 1 point per expected minute of completion with a total of approximately 90 points. 251A students will be asked to complete extra questions. All students are allowed one 2-sided 8.5 × 11 inch sheet of notes. No calculators, phones, or other electronic devices will be allowed. Slide-rules will be permitted.

Topics: The final exam will be comprehensive and test all topics covered this semester. However, emphasis will be placed on topics covered after the midterm exam—those listed below.

1. Sources of Power and Energy consumption in Digital ICs 2. Principles Behind Six Low-power Design Techniques 3. How to Improve Energy Efficiency through Parallelism and Pipelining

4. How to Design a RISC-V Single-Cycle Processor from the ISA 5. Processor Pipelining Hazards and Mechanisms 6. Principle behind and motivations for hardware acceleration 7. Line Drawing Accelerator Design Details 8. Memory Block Internal Architecture 9. SRAM Cell and Read/Write Operation

10. Memory Block Periphery Circuits 11. Memory Decoder Design 12. DRAM Cell and Read/Write Operation 13. Dual-port Memory Architecture 14. Effect of Clock Uncertainties on Maximum Clock Frequency 15. Source of Clock Uncertainties 16. Principle of Good Clock Distribution 17. IR and dI/dt effects in Power distribution

Page 12: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 12

18. Cascading Memory blocks for More Width, Depth, and Ports

19. FIFO Implementation 20. Memory Block Specification in Verilog 21. Serialization versus Parallelization in Iterative Computations

22. Principles of Pipelining and Restrictions of Loops

23. C-Slow Technique for Pipelining Loops 24. Carry Select Adder Design 25. Carry Lookahead and Parallel Prefix Adders

26. Bit-Serial Addition 27. Array Multiplier Design 28. Carry Save Addition 29. Signed Multiplication 30. Booth Encoding 31. Bit-Serial Multiplication 32. CSD Multiplier Design 33. Log and Barrel Shifters Design and Analysis

34. Use of Counters in Controller Design 35. Binary Counter Design and Optimization

36. Ring Counter Design 37. LFSR Implementation 38. List Processor Design and Optimizations 39. Modulo Scheduling 40. Types and Sources of Faults in ICs 41. Hamming Codes

Page 13: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Ring Counters❑ “one-hot” counters 0001, 0010, 0100, 1000, 0001, …

“Self-starting” version:

Page 14: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Building an LFSR from a Primitive Polynomial❑ For k-bit LFSR number the flip-flops with FF1 on the right. ❑ Find the primitive polynomial of the form xk + … + 1. ❑ The feedback path comes from the Q output of the leftmost FF, corresponding

to the xk term. ❑ The x0 = 1 term corresponds to connecting the feedback directly to the D input

of FF 1. ❑ Each term of the form xn corresponds to connecting an xor between FF n and

n+1. ❑ 4-bit example, uses x4 + x + 1

▪ x4 ⇔ FF4’s Q output ▪ x ⇔ xor between FF1 and FF2 ▪ 1 ⇔ FF1’s D input

❑ To build an 8-bit LFSR, use the primitive polynomial x8 + x4 + x3 + x2 + 1 and connect xors between FF2 and FF3, FF3 and FF4, and FF4 and FF5.

Page 15: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,
Page 16: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Six low-power design techniques

Power-down idle transistors

Parallelism and pipelining

Slow down non-critical paths

Clock gating

Thermal management

Data-dependent processing

Page 17: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Gate delay roughly linear

with Vdd

This magic trick brought to you by Cory Hall ...

3636

Active Power ReductionActive Power Reduction

Slow Fast Slow

Lo

w S

up

ply

Vo

lta

ge

Hig

h S

up

ply

Vo

lta

ge

Multiple Supply

Voltages

Logic BlockFreq = 1

Vdd = 1

Throughput = 1

Power = 1

Area = 1

Pwr Den = 1

Vdd

Logic Block

Freq = 0.5

Vdd = 0.5

Throughput = 1

Power = 0.25

Area = 2

Pwr Den = 0.125

Vdd/2

Logic Block

Replicated DesignsAnd so, we can transform this:

Block processes stereo audio. 1/2 of clocks for “left”, 1/2 for “right”.

P ~ F ⨯ Vdd2

P ~ 1 ⨯ 1 2

Into this: Top block processes “left”, bottom “right”.

3636

Active Power ReductionActive Power Reduction

Slow Fast Slow

Lo

w S

up

ply

Vo

lta

ge

Hig

h S

up

ply

Vo

lta

ge

Multiple Supply

Voltages

Logic BlockFreq = 1

Vdd = 1

Throughput = 1

Power = 1

Area = 1

Pwr Den = 1

Vdd

Logic Block

Freq = 0.5

Vdd = 0.5

Throughput = 1

Power = 0.25

Area = 2

Pwr Den = 0.125

Vdd/2

Logic Block

Replicated Designs

CV2 power only

P ~ #blks ⨯ F ⨯ Vdd 2

P ~ 2 ⨯ 1/2 ⨯ 1/4 = 1/4

Page 18: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Single-CycleRISC-VRV32IDatapath

�18

IMEM

ALU

Imm.Gen

+4DMEM

BranchComp.

Reg[]

AddrAAddrB

DataAAddrD

DataB

DataD

AddrDataW

DataR

10

0121

0pc 0

1

inst[11:7]inst[19:15]inst[24:20]

inst[31:7]

pc+4alu

mem

wbalupc+4

pc

imm[31:0]

Reg[rs2]

inst[31:0] ImmSelRegWEnBrUnBrEqBrLT ASelBSel ALUSel MemRW WBSelPCSel

wbReg[rs1]

Page 19: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Pipelined Processor

19

Page 20: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EnergyEfficiencyofCPUversusASICversusFPGA

!20

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. SIGARCH Comput. Archit. News, 38:37–47, June 2010.

CPUASIC

500x

∴ FPGA : CPU = 70x

Ian Kuon and Jonathan Rose. Measuring the gap between fpgas and asics. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA ’06, pages 21–30, New York, NY, USA, 2006. ACM FPGA ASIC

7x

Similar story for performance efficiency

Page 21: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Spring 2019 EECS151 Page

Line Drawing Algorithm

"21

This version assumes: x0 < x1, y0 < y1, slope =< 45 degreesfunction line(x0, x1, y0, y1) int deltax := x1 - x0 int deltay := y1 - y0 int error := deltax / 2 int y := y0 for x from x0 to x1 plot(x,y) error := error - deltay if error < 0 then y := y + 1 error := error + deltax

Note: error starts at deltax/2 and gets decremented by deltay for each x. y gets incremented when error goes negative, therefore y gets incremented at a rate proportional to deltax/deltay.

0 1 2 3 4 5 6 7 8 9 10 11 121234567

Page 22: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Memory Architecture Overview

22

❑ Word lines used to select a row for reading or writing

❑ Bit lines carry data to/from periphery

❑ Core aspect ratio keep close to 1 to help balance delay on word line versus bit line

❑ Address bits are divided between the two decoders

❑ Row decoder used to select word line

❑ Column decoder used to select one or more columns for input/output of data

Page 23: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

SRAM read/write operations

23

Page 24: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Periphery

❑ Decoders ❑ Sense Amplifiers ❑ Input/Output Buffers ❑ Control / Timing Circuitry

24

Page 25: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Row Decoder • Expands L-K address lines into 2L-K word lines

25

❑ Example: decoder for 8Kx8 memory block ❑ core arranged as

256x256 cells ❑ Need 256 AND gates,

each driving one word line

Page 26: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Write: C S is charged or discharged by asserting WL and BL.Read: Charge redistribution takes places between bit line and storage capacitance

Voltage swing is small; typically around 250 mV.

1-Transistor DRAM Cell

26

VBL

CS << CBL

VBIT= 0 or (VDD – VT)

❑ To get sufficient Cs, special IC process is used ❑ Cell reading is destructive, therefore read operation always is followed by a

write-back ❑ Cell looses charge (leaks away in ms - highly temperature dependent),

therefore cells occasionally need to be “refreshed” - read/write cycle

Page 27: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 27

Dual-ported Memory Internals❑ Add decoder, another set of

read/write logic, bits lines, word lines:

deca decbcell

array

r/w logic

r/w logic

data portsaddress

ports

• Example cell: SRAM

• Repeat everything but cross-coupled inverters.

• This scheme extends up to a couple more ports, then need to add additional transistors.

b2 b2b1 b1

WL2

WL1

Page 28: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Clock Constraints in Edge-Triggered Systems

28

If launching edge is late and receiving edge is early, the data will not be too late if:

Minimum cycle time is determined by the maximum delays through the logic

tclk-q,max + tlogic,max + tsetup < TCLK – tJS,1 – tJS,2 + δ

tclk-q,max + tlogic,max + tsetup - δ + 2tJS < TCLK

Skew can be either positive or negative Jitter tJS usually expressed as peak-to-peak or n x RMS value

Page 29: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Clock UncertaintiesSources of clock uncertainty

29

Page 30: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

H-Tree

30

Equal wire length/number of buffers to get to every location

Page 31: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Power Supply Impedance❑ No voltage source is ideal - ||Z|| > 0 ❑ Two principal elements increase Z: ▪ Resistance of supply lines (IR drop) ▪ Inductance of supply lines (L⋅di/dt drop)

31

Page 32: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Cascading Memory-BlocksHow to make larger memory blocks out of smaller ones.

Increasing the depth. Example: given 1Kx8, want 2Kx8

32

Page 33: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

FIFO Implementation Details• Assume, dual-port memory with asynchronous read,

synchronous write. • Binary counter for each of read and write address.

CEs (count enable) controlled by WE and RE. • Equal comparator to see when pointers match. • State elements for FULL and EMPTY flags:

• Control logic (FSM) with truth-table (draft) shown to left.

WE RE equal* EMPTYi FULLi

0 0 0 0 0 0 0 1 EMPTYi-1 FULLi-1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 1 1 EMPTYi-1 FULLi-1

* Actually need 2 signals: “will be equal after read” and “will be equal after write”

Page 34: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 34

Verilog RAM Specification// // Single-Port RAM with Asynchronous Read // module ramBlock (clk, we, a, di, do); input clk; input we; // write enable input [19:0] a; // address input [7:0] di; // data in output [7:0] do; // data out reg [7:0] ram [1048575:0]; // 8x1Meg always @(posedge clk) begin // Synch write if (we) ram[a] <= di; assign do = ram[a]; // Asynch readendmodule

What do the synthesis tools do with this?

Page 35: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Fall 2019 EECS151/251A Page

Time-Multiplexing• Time multiplex single ALU for

all adds and multiplies: • Attempts to minimize cost at

the expense of time. – Need to add extra register,

muxes, control.

• If we adopt above approach, we can then consider the combinational hardware circuit diagram as an abstract computation-graph.

• This time-multiplexing “covers” the computation graph by performing the action of each node one at a time. (Sort of emulates it.)

Using other primitives, other coverings are possible.

!35

Page 36: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Fall 2019 EECS151/251A Page

Limits on Pipelining• Without FF overhead, throughput improvement α # of stages. • After many stages are added FF overhead begins to dominate:

• Other limiters to effective pipelining: – clock skew contributes to clock overhead – unequal stages – FFs dominate cost – clock distribution power consumption – feedback (dependencies between loop iterations)

FF “overhead” is the setup and clk to Q times.

!36

Page 37: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Fall 2019 EECS151/251A Page

Pipelining Loops with Feedback

However, we can overlap the “non-feedback” part of the iterations:

Add is associative and communitive. Therefore we can reorder the computation to shorten the delay of the feedback path:

yi = (yi-1 + xi) + a = (a + xi) + yi-1

add1 xi+a xi+1+a xi+2+a

add2 yi yi+1 yi+2

• Pipelining is limited to 2 stages.

“Loop carry dependency”

“Shorten” the feedback path.

!37

Page 38: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Fall 2019 EECS151/251A Page

“C-slow” Technique• Essentially this means we go

ahead and cut feedback path:

• This makes operations in adjacent pipeline stages independent and allows full cycle for each:

• C computations (in this case C=2) can use the pipeline simultaneously.

• Must be independent. • Input MUX interleaves input

streams. • Each stream runs at half the

pipeline frequency. • Pipeline achieves full

throughput.

add1 x+b x+b x+b x+b x+b x+b mult ay ay ay ay ay ay add2 y y y y y y

!38

Multithreaded Processors use this.

Page 39: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Carry Select Adder❑ Extending Carry-select to multiple blocks

❑ What is the optimal # of blocks and # of bits/block? ▪ If blocks too small delay dominated by total mux delay ▪ If blocks too large delay dominated by adder ripple delay

T α sqrt(N), Cost ≈2*ripple + muxes 39

Page 40: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Parallel Prefix Adder Example

G = g1 + g0 p1

P = p1p0

g1 p1g2 p2g3 p3

G = g2 + g1 p2

P = p2p1

G = g3 + g2 p3

P = p3p2

g0 p0

G = g2 + g1 p2 + g0p2p1

= c3G = g3 + g2 p3 +(g1 + g0p1)p3p2

= g3 + g2p3 + g1p3p2 + g0p3p2p1

= c4

c2

c1

si = ai ⊕ bi ⊕ ci = pi ⊕ ci 40

Page 41: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Bit-serial Adder

❑ Addition of 2 n-bit numbers: ▪ takes n clock cycles, ▪ uses 1 FF, 1 FA cell, plus registers ▪ the bit streams may come from or go to other circuits, therefore the

registers might not be needed.

• A, B, and R held in shift-registers. Shift right once per clock cycle.

• Reset is asserted by controller.

41

Page 42: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Combinational Multiplier (unsigned) X3 X2 X1 X0 * Y3 Y2 Y1 Y0 -------------------- X3Y0 X2Y0 X1Y0 X0Y0 + X3Y1 X2Y1 X1Y1 X0Y1 + X3Y2 X2Y2 X1Y2 X0Y2 + X3Y3 X2Y3 X1Y3 X0Y3 ----------------------------------------- Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0

HA

x3

FA

x2

FA

x1

FA

x2

FA

x1

HA

x0

FA

x1

HA

x0

HA

x0

FA

x3

FA

x2

FA

x3

x3 x2 x1 x0

z0

z1

z2

z3z4z5z6z7

y3

y2

y1

y0

Propagation delay ~2N

multiplicandmultiplier

Partial products, one for each bit in multiplier (each bit needs just one AND gate)

42

Page 43: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

2’s Complement Multiplication

FA

x3

FA

x2

FA

x1

FA

x2

FA

x1

HA

x0

FA

x1

HA

x0

HA

x0

FA

x3

FA

x2

FA

x3

HA

1

1

x3 x2 x1 x0

z0

z1

z2

z3z4z5z6z7

y3

y2

y1

y0

43

Page 44: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Page

Carry-Save Addition• Speeding up multiplication is a

matter of speeding up the summing of the partial products.

• “Carry-save” addition can help. • Carry-save addition passes

(saves) the carries to the output, rather than propagating them.

• Example: sum three numbers, 310 = 0011, 210 = 0010, 310 = 0011

310 0011 + 210 0010 c 0100 = 410 s 0001 = 110

310 0011 c 0010 = 210

s 0110 = 610

1000 = 810

carry-save add

carry-save add

carry-propagate add

• In general, carry-save addition takes in 3 numbers and produces 2. • Sometimes called a “3:2 compressor”: 3 input signals into 2 in a potentially lossy

operation • Whereas, carry-propagate takes 2 and produces 1. • With this technique, we can avoid carry propagation until final addition

!44

Page 45: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Page

Bit-serial Multiplier• Bit-serial multiplier (n2 cycles, one bit of result per n cycles):

• Control Algorithm:

repeat n cycles { // outer (i) loop repeat n cycles{ // inner (j) loop shiftA, selectSum, shiftHI } shiftB, shiftHI, shiftLOW, reset }

Note: The occurrence of a control signal x means x=1. The absence of x means x=0.

!45

Page 46: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Booth recoding

BK+1 0 0 0 0 1 1 1 1

BK 0 0 1 1 0 0 1 1

BK-1 0 1 0 1 0 1 0 1

action

add 0add A add A

add 2*A sub 2*A sub A sub A add 0

A “1” in this bit means the previous stage needed to add 4*A. Since this stage is shifted by 2 bits with respect to the previous stage, adding 4*A in the previous stage is like adding A in this stage!

-2*A+A

-A+A

from previous bit paircurrent bit pair(On-the-fly canonical signed digit encoding!)

BK+1,K*A = 0*A → 0 = 1*A → A = 2*A → 4A – 2A = 3*A → 4A – A

46

Page 47: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Canonic Signed Digit Representation❑ CSD represents numbers using 1, 1, & 0 with the least

possible number of non-zero digits. ▪ Strings of 2 or more non-zero digits are replaced. ▪ Leads to a unique representation.

❑ To form CSD representation might take 2 passes: ▪ First pass: replace all occurrences of 2 or more 1’s: 01..10 by 10..10 ▪ Second pass: same as above, plus replace 0110 by 0010

and 0110 by 0010 ❑ Examples:

❑ Can we further simplify the multiplier circuits?

0010111 = 23 0011001 0101001 = 32 - 8 - 1011101 = 29

100101 = 32 - 4 + 1

0110110 = 54 1011010 1001010 = 64 - 8 - 2

47

Page 48: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Log Shifter / Rotator❑ Log(N) stages, each shifts (or not) by a power of 2 places, S=[s2;s1;s0]:

Shift by N/2

Shift by 2

Shift by 1

Page 49: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Barrel Shifter❑ Cost/delay?

▪ (don’t forget the decoder)

49

Page 50: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Controller using Counters• State Transition Diagram:

▪ Assume presence of two binary counters. An “i” counter for the outer loop and “j” counter for inner loop.

TC is asserted when the counter reaches it maximum count value. CE is “count enable”. The counter increments its value on the rising edge of the clock if CE is asserted.

Page 51: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

5. Optimization, Architecture #4

❑ Datapath:

❑ Incremental cost: – Addition of another register & mux, adder mux, and control.

❑ Performance: find max time of the four actions 1. XßMemory[NUMA], 0.5+1+10+1+0.5 = 13ns NUMAßNEXT+1; same for all ⇒ T>13ns, F<77MHz 2. NEXTßMemory[NEXT], SUMßSUM+X;

LD_NUMA

51

Page 52: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Spring 2019 EECS151 - Lec24 Page

nexti

Modulo Scheduling List Processor

• Finished schedule for 4 iterations:

nexti NEXTßMemory[NEXT]

numai NUMAßNEXT+1

xi XßMemory[NUMA]

sumi SUMßSUM+X

• Assuming a single adder and a single ported memory. Minimal schedule section length = 2. Because both memory and adder are used for 2

cycles during one iteration.

Memory next1 next2 x1 next3 x2 next4 x3 adder numa1 numa2 sum1 numa3 sum2 numa4 sum3

numai

memory

adder

nextinumai

memory

adderXi-1

nextinumai

memory

adderXi-1

sumi-2

wrap-around, decrease subscript

wrap-around, decrease subscript

!52

Page 53: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141

Synchronous Counters❑ How do we extend to n-bits? ❑ Extrapolate c+: d+ = d ⊕ abc, e+ = e ⊕ abcd

❑ Has difficulty scaling (AND gate inputs grow with n)

❑ CE is “count enable”, allows external control of counting, ❑ TC is “terminal count”, is asserted on highest value, allows

cascading, external sensing of occurrence of max value.

TC

Page 54: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Spring 2019 EECS150 - Lec25-faults Page

Types of Faults in Digital Designs

• Design Bugs (function, timing, power draw) – detected and corrected at design time through testing and

verification (simulation, static checks) • Manufacturing Defects (violation of design rules,

impurities in processing, statistical variations) – post production testing for sorting – spare on-chip resources for repair

• Runtime Failures (physical effects and environmental conditions) – assuming design is correct and no manufacturing defects

"54

Page 55: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

Spring 2019 EECS150 – Lec25-fault Page

Hamming Error Correcting Code• Use more parity bits to pinpoint bit(s)

in error, so they can be corrected. • Example: Single error correction

(SEC) on 4-bit data – use 3 parity bits, with 4-data bits

results in 7-bit code word – 3 parity bits sufficient to identify any

one of 7 code word bits – overlap the assignment of parity bits

so that a single error in the 7-bit word can be corrected

• Procedure: group parity bits so they correspond to subsets of the 7 bits: – p1 protects bits 1,3,5,7

– p2 protects bits 2,3,6,7

– p3 protects bits 4,5,6,7

1 2 3 4 5 6 7 p1 p2 d1 p3 d2 d3 d4

Bit position number 001 = 110

011 = 310

101 = 510

111 = 710

010 = 210

011 = 310

110 = 610

111 = 710

100 = 410

101 = 510

110 = 610

111 = 710

p1

p2

p3

Note: number bits from left to right.

!55

Page 56: lec26-wrapup - University of California, Berkeleyinst.eecs.berkeley.edu/~eecs151/sp19/files/lec26-wrapup.pdfRehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,

EE141 !56

The End.

❑ Special thanks to our GSIs: Chris and Arya.

❑ Good luck on the final.

❑ Thanks for a great semester!