summary of course projects

S E T I AWA N S O E KA M T O P U T RA

SUMMARY OF COURSE PROJECTS

M A S T E R O F E L E C T R I C A L A N D C O M P U T E R E N G I N E E R I N G

I L L I N O I S I N S T I T U T E O F T E C H N O L O G YD E C E M B E R 2 0 1 0 G R A D U AT E

2

CONTENTS

• 32-bit Pipelined CPU• MC68K-Based Monitor Program• Pipelined MIPS Processor with hazard handler and dat

a forwarding• Simple Mesh-Like and Ring-Like Network

on Chip Design• Small office network design• 4-bit 10t adder circuit with dual-vt logic design• Single-ended 6T vs. standard 6T SRAM bitcell design• QR Matrix Factorization• Electro Active Polymer Energy Harvesting Design• Advanced Encryption Standard Hardware Design

3

SPRING 2009

• Introduction to VLSI Design• 32-bit Pipelined CPU• Multiplier with accumulator and pipeline optimization

• Microcomputer• MC68K-Based Monitor Program

• Advanced Computer Architecture• Pipelined MIPS Processor with hazard handler and data

forwarding

Return

4

32-BIT PIPELINED CPU

• Hardware Description Language• Verilog

• Tools• Compiler: Cadence Verilog XL• Logic Synthesis: Synopsys Design Compiler• Simulation tool: Cadence’s SimVision, Mentor Graphics

Modelsim • Place and Route: Cadence SOC Encounter• Mentor Graphic’s Modelsim

• Objectives• Execute ASIC Flow in this implementation using verilog• RTL, post-synthesis, and post-PR simulation for verification

• Determine maximum frequency, area, delay, and power

Return

5


• 32-bit Memory File• Eight ALU functions: multiplication, add,

subtraction, OR, AND, XOR, XNOR• M:multiplicand, N: multiplier• Multiplier:• Radix 2r produce N/r partial products• Radix-4 booth-encoded Multiplier Reduces number of

partial products (N/2 vs. N)• Wallace Tree Reduces number of logic levels required to

perform summation

Return

6


Return

7


Return

8


Return

9


• Results• Maximum frequency: 40 < f < 41

MHz

Return

10


• Case studies:• Case 1: Modify ALU multiplier to multiplier with

accumulator (MAC) (useful for implementing DSP)• Case 2: Pipeline optimization

• MAC benefit: reduces #instruction sets to compute the final result of sum of product functions.• Pipeline optimization is applied by inserting

registers at the critical path (in this case MAC unit)

Return

11

Case I32-BIT PIPELINED CPU

Return

12

• Case 1 results

• Case 2 results


Return

13

• Case 2 Decision to put registers


Return

14

• Provided:• Multiplier accumulator block diagram• Simple CPU design written in verilog• All required tools

• Implementation• Construct fore-mentioned unit in verilog and modify the

design to fit new unit• Apply numbers of registers for pipelining

• Design functionality Test• Verify in sumulation that function F= (-10)* 5 + (-60)*2 +

(-60)*8 outputs the correct result


Return

15

• Results


Return

16

• Additional Analysis Result• Finding the maximum frequency• Expected maximum frequency of the design: 58 MHz• Frequency vs. area vs. power consumption


Return

17

MC68K-BASED MONITOR PROGRAM

• instructor: Dr. Jafar Saniie• Requirements/Specifications• Construct a simple monitor program for MC68000

processor that allows user to execute common memory and register accesses, basic exception handlers.

• Language• 68000 assembly language

• Tools• Easy68k Editor/Assembler/Simulator

Return

18

• Monitor program flowchart


Return

19


• Monitor program system diagram

Return

20

• Includes command interpreter that check and validate user inputs.

• Monitor debugger commands:• MEMD Memory display• MEMS Memory Set• SORT Memory Sort• FILL Memory Fill• MOVE Memory move• MEMM Memory Modify• FIND Block Memory Search• REGM Register Modify• REGD Register Display• RUNS Execute program at specified location


Return

21

• Monitor debugger Exception handling commands:• TBUS Bus Error Exception• TADD Address Error• TILL Illegal Exception• TPRI Privilege Violation• TDIV Division by Zero


Return

22

• Results (partial of 17 commands made)Register display

Memory display

Command interpreter


Return

23

HIGH-PERFORMANCE PIPELINED MIPS PROCESSOR

• MIPS (Microprocessor without Interlocked Pipeline Stages) is a reduced instruction set computer (RISC) instruction set architecture (ISA)

• instructor: Prof. Jia Wang• Requirements/Specifications• Design a MIPS processor with pipeline, data forwarding, and hazard handling

capabilities.• Run RTL Simulation to verify the functionalities

• Language• VHDL

• Tools• Modelsim PE 6.5• MARS 3.6 MIPS Simulator

• Provided:• Data memory unit design• Testbench code

Return

24

• Data width: 32-bit• 5-stage pipeline

• Instruction Fetch• Instruction Decode• Execute• Memory Access• Write-Back

• Main Modules• Program counter (PC)• Control Unit• ALU Control Unit• Register File• ALU• Instruction Memory• Data Memory• Hazard Detection Unit• Forwarding Unit

• Branch Hazard• Branch calculation occurred in

Instruction Decode Stage• Branch miss only costs one

cycle of stall.

• Data Hazard• Stall if data being written is

going to be used at the next instruction

• Data Forwarding• Result data is used immediately

rather than written back to register file first.


Return

25


• MIPS Architecture

Return

26

• Test program (Running on MARS 3.6)


Return

27

• Result


Return

28

FALL 2009

• Hardware/Software Co-Design• Simple Mesh-Like Network on Chip Design• Simple Ring-Like Network on Chip Design

• Introduction to Computer Network• Design of 2-story small office computer network

Return

29

HARDWARE/SOFTWARE CO-DESIGN

• Projects:• Network on chip prototype design with three

nodes• Simple Mesh-Like Network on Chip Design

Return

30

NETWORK ON CHIP PROTOTYPE DESIGN WITH THREE NODES

• Instructor: Prof. Jia Wang• Specifications• Three-node in partially connected mesh topology NoC

architecture• Three processing elements and three routers.• Queue system: FIFO

• Language• SystemC running on Visual C++

• Tools• Microsoft Visual C++

Return

31

• Three-node NoC System Diagram

• Third node function (called PE_dumpbox)• It receives all packets that cannot be processed by the

destination processing unit due to overloading in the network


Return

32

• Results• Overload in Router 1

network buffer at cycle 3

• 3rd processing unit PE_dumpbox receives packet


Return

33

• Specifications• a simple mesh-like NoC architecture.• One router has one processing unit (PE).• Queue system: FIFO• 4 by 4 matrix-like size

• Language• SystemC

• Tools• Microsoft Visual C++

MESH-LIKE NETWORK ON CHIP PROTOTYPE DESIGN

Return

34


• Simple NoC Architecture

Return

35

• Results• Generated packets

• Result shows packets are delivered


Return

36

• Results• Delays due to the fact

that only one packet is delivered to processing element PE at a time


Return

37

• Benefit and drawback:• Packet arrives in the destination address with fewer

hops reducing contention and increasing average bit rate.

• Increases the complexity of the design and more wires are needed.


Return

38

INTRODUCTION TO COMPUTER NETWORK

• Project: • Design a prototype of 2-story small office computer

network capable of serving 20 users with three department LANs, four servers and wireless Internet

• Language• N/A

• Tools• Microsoft Visio

Return

39

SMALL OFFICE NETWORK DESIGN

• Proposed configurations• IP address allocation

Return

40

• Proposed configurations• Design Topology


Return

41

• Office Layout

2nd floor

1st floor

Colored arrows show how cables are managed


Return

42

SPRING 2010

• Advanced VLSI• 4-bit 10t adder circuit with dual-vt logic design

• High Performance VLSI IC System• Single-ended 6T vs. standard 6T SRAM bitcell design

comparison

• QR Factorization• Implementing QR factorization algorithm in C

Return

43

4-BIT 10T ADDER CIRCUIT WITH DUAL-VT LOGIC DESIGN

• Project:• 4-bit 10t adder circuit with dual-vt logic design

• Specifications• Adder circuit is based on:

J. Lin, M. Sheu, and C.Ho. A Novel High-Speed and Energy Efficient 10-Transistor Full Adder Design. IEEE Trans. on Circuits and Systems, May 2007.

• Adder: cascaded Carry ripple Adders• Technology node: 45nm (FreePDK)• Voltage: 1.1V @ 25 MHz• Performance measurements (delay and power consumption) for 10T

Adder Circuit using high-threshold (Vt), low-Vt, and dual-Vt transistors

• Tools• Cadence Virtuoso Schematic Design• Synopsys HSPICE Simulator• Nanosim Simulator

Return

44


• High Vt vs. low Vt

• Full Adder Design (1-bit)• Complementary and level restoring carry logic (CLRCL)

Return

45


• Full Adder Design (1-bit) Critical Path• Dual-VT: Low-VT apply on transistors which are in critical path

for speed and High-VT for others for low leakage• NMOS at multiplexer and PMOS in inverter are low-VT

transistors

Return

46


• Logic EquationSum = (A XNOR B).Cin + (A XOR B). Cin_bar

Cout= (A XOR B) .Cin + (A XNOR B).A

• Design Components• Inverter (left) and multiplexer (right)

Return

47


• 1-bit Full Adder (consisting of multiplexers and inversters) and its symbol

• 4-bit Full Adder

Return

48


• Methodology• Using combination of input vector to measure delay and

power consumptions• Delay : Switching delay between least significant bit (bit 0) and

most significant bit (bit 3)• Power : Average and maximum power during simulation

• Results• Delay (in seconds)

High-to-Low Low-to-High0.00E+00

5.00E-11

1.00E-10

1.50E-10

2.00E-10

2.50E-10

3.00E-10

3.50E-10

4.00E-10

High-VTLow-VTDual-VT

Return

49


• Results• Power consumption (in Watt)

Average Power

(avgpwr)0.00E+00

1.00E-05

2.00E-05

3.00E-05

4.00E-05

5.00E-05

6.00E-05


0.00E+001.00E-042.00E-043.00E-044.00E-045.00E-04


Return

50

• Results


Return

51

• Issue• Voltage degradation specifically for high-vt or high

frequency (> 125 MHz) due to pass transistors behavior to deliver weak-1 (NMOS) or weak-0 (PMOS).


Return

52

SINGLE-ENDED 6T VS. STANDARD 6T SRAM BITCELL DESIGN

• Specifications• Design from:J. Singh, et al. Single Ended 6T SRAM with Isolated Read-Port for Low-Power Embedded Systems. IEEE. 2009

• Technology node: 45nm• Use: high VT MOSFET

• Tools• Cadence Virtuoso Schematic Design• Synopsys HSPICE Simulator

Return

53


• Background• SRAM consumes majority of die area• Dynamic power via reads and writes activities• Static power : retaining its logic value

• Benefits/Drawbacks of Single-Ended SRAM• Faster reading logic ‘1’• One bit line (no complementary bit bar line) wire

reduction• More delay in Writing ‘1’ due to weak-1 behavior of pass

transistor NMOS (but around 85% of writes are zero writes)• Role of Isolated Read Port: Prevents bitcell content to be

exposed during READs• Considerable lower power dissipation, better read SNM

Return

54


Return

55


• Standard 6T SRAM• Read: precharge BL

and BL* WordLine=1

• Write: assert new value to BL and BL* WordLine=1

• Transistor sizing:• Access transistor:

medium• Pullup TR: weak• Pulldown TR:

Strong

Return

56


Return

57


Return

58


Return

59


Return

60


• Standard SRAM Design (using Cadence Virtuoso)

Return

61


• Single-Ended SRAM Design

Return

62


• Comparison Results• Write Delay (0 to 0.5Vdd or 1 to 0.5Vdd)

[3] Y. Chang, F. Lai, C. Yang. Zero-Aware Asymmetric SRAM Cell for Reducing Cache Power in Writing Zero. IEEE Trans. On VLSI Systems, Vol.12, No.8, August 2004.

“…around 85% of the instruction write bits are “0,” and over 90% of the data write bits are “0.”.. “ (quoted from [3])

Return

63


• Comparison Results• Power Consumption Comparison

Return

64


• Noise Margin

Return

65

QR MATRIX FACTORIZATION

• Purposes:• Implementing QR factorization algorithm in C

• Specifications• Written in C under RedHat OS

• QR Factorization• Decomposition method of a matrix to solve linear problems or

equations without inverting one of the left-hand side matrix.• Applicable to: m-by-n matrix A• Decomposition: A = QR where Q is an orthogonal matrix of size m-

by-m, and R is an upper triangular• The QR decomposition provides an alternative way of solving the

system of equations Ax = b without inverting the matrix A. The fact that Q is orthogonal means that QTQ = I, so that Ax = b is

• equivalent to Rx = QTb, which is easier to solve since R is triangular.

Return

66


• Algorithm

Return

67


• Result

Return

68

FALL 2010

• Electro Active Polymer Energy Harvesting• Advanced Encryption Standard

Return

69

ELECTRO ACTIVE POLYMER ENERGY HARVESTING DESIGN

• EAP Circuitry provides mechanical to electrical energy conversion when it is stretched, given bias voltage.• EAP material VHB 4905 tape and carbon grease

Return

70

• Previous prototype:• Charge management IC:

TI’s bq2000• Li-ion battery 3V,

45mAh• Application: TI’s eZ430-

F2013• Boost Converter to

supply biasing voltage (5 V 1.5KV): • EMCO Q15N-5

• Drawbacks• High energy consumption• EAP output power is too small

to even turn on battery charging circuit (which needs 20.6 mA)

• Solutions• EAP material efficiency• Higher capacitance

• Battery and circuit that can store small energy without requiring much energy to operate

• Apply low biasing voltage eliminate use of boost converter


Return

71

• Simulation model using Simulink• Circuit model parameters:• EAP Model parameters, input voltage (battery), and output

capacitor Co


Return

72

• Simulation model using Simulink• EAP Model Parameters:• Cidle, Cforced, force frequency f(how often the EAP is stretched)

• Absolute function to create always-positive sine waveform from original sine wave


Return

73

• Simulation result:

Return


74

• Prototype:• Battery charging : Cymbet CBC5300• Battery : 2xCBC050 (3x50uAh) at 3.5V

output• Capability to harvest 1.05V • PCB Layout Tool : Altium Designer• Application: MSP430-F2274 with CC2500 2.4GHz RF

Transceiver


Return

75

• Input power• Tested using voltage generator at

1.042 V• Current drawn was 529 µA

• Output power• power for RF and MSP430• Power for additional load (tested by

using 330KΩ resistor)


Return

76

• Power for RF and MSP430• Depends on how often the device transmits data• Set to 5 seconds

• Based on SLAA378C documentation from TI, for 5 second period between transmission, average current consumption (expected) is 8.4 µA.

• Voltage is approx. 3.2V

• Power for the load


Return

77

• Efficiency • Pstore is power stored in the

battery.

• Ƞ= • Note that is roughly averaged

from the battery charging profile .• Also note that during experiment,

the battery still have some charge.

Battery Charging profile for CBC050

Return


78

ADVANCED ENCRYPTION STANDARD HARDWARE DESIGN

• Variant AES with 512-bit and 1024-bit key• Area and power consumption comparison with 128-

bit and 256-bit AES keys• CMOS technology : 45nm• Operating Voltage : 1.1 V @ 100 MHz• Verilog language• Tools:• Synthesis : Synopsys DC Compiler• Simulation : Modelsim

• Find the relationship between key size and implemented hardware area and power consumption.

Return

79

• Longer key size:• More secure• More iteration rounds• (1)

• More power and area increase

• Rijndael Algorithm

Initial Round

Normal Round

Final Round

Plaintext

AddRoundKey

SubBytes

ShiftRows

AddRoundKey

Cipher Key

Key Expansion RoundKey[0]

RoundKey[i]

MixColumns

i < Number of rounds?

i=i+1

yes

SubBytes

ShiftRows

AddRoundKey

No

Ciphered Text


Return

80

• Block View of AES Operation

plaintext (in bytes)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

XORFirst roundkey (in bytes)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

State Block0 4 8 121 5 9 132 6 10 143 7 11 15

State BlockS0 S4 S8 S12S1 S5 S9 S13S2 S6 S10 S14S3 S7 S11 S15

SubBytes(Replaces each byte with S-box

value)

State Block(after ShiftRows)S0 S4 S8 S12

S5 S9 S13 S1 XORS10 S14 S2 S6 Per ColumnS15 S3 S7 S11

MixColumns a(x)

State Block(after MixColums) Next roundkeyM0 M4 M8 M12 K0 K4 K8 K12M5 M9 M13 M1 k1 K5 K9 K13

M10 M14 M2 M6 K2 K6 K10 K14m15 M3 M7 M11 K3 K7 K11 K15

XOR

Ready for

next round

Key Expansion ModuleCipher_key

Plain_text Mux

AddRoundKey

SubBytes and

ShiftRowsMixColumns

Mux

AddRoundKey

Mux

Initial value (zero)

Ciphered_text


Return

81

• Block Diagram

Key Expansion ModuleCipher_key

Plain_text Mux

AddRoundKey

SubBytes and

ShiftRowsMixColumns

Mux

AddRoundKey

Mux

Initial value (zero)

Ciphered_text


Return

82

Results

AES128 AES256 AES512 AES10240

1

2

3

4

5

6

7

f(x) = 0.85245812 x + 2.73899385R² = 0.985616267025268

power (dynamic) in mWpower (static) in mWTotal Power in mWLinear (Total Power in mW)

power (dynamic) in mW power (static) in mW Total Power in mWAES128 3.3574 0.2971603 3.6545603AES256 3.9442 0.3341722 4.2783722AES512 5.0289 0.409219 5.438119AES1024 5.6042 0.5053051 6.1095051

52500575006250067500725007750082500875009250097500


Return

83

Results: Area

52500

57500

62500

67500

72500

77500

82500

87500

92500

97500


Return

summary of course projects

Documents

pipelined cpu return

pipelined cpu multiplier

pipelined cpu case studies

simple monitor program

pipelined cpureturn

memory file

power return

common memory