2. background and related workshodhganga.inflibnet.ac.in/bitstream/10603/27657/12/7...halt i/o...

35

2. BACKGROUND AND RELATED WORK

This chapter provides the necessary background information that is

useful to understand the main contributions of the thesis. The following

section presents a brief discussion on techniques for designing for low

power consumption. Section 2.2 describes the various attributes of

Instruction Set Architecture (ISA). Section 2.3 discusses processor

performance and high performance architectural features. In section 2.4, an

overview of different types of embedded processors is presented.

Section 2.5 deals with the architectural aspects of the embedded systems.

Section 2.6 presents an overview of emergence of different RISC

processors including MIPS. Section 2.7 explains how MIPS32 instructions

waste bits. In section 2.8, various techniques, followed for embedded code

size reduction, are reviewed. Finally, the need for a new and dedicated ISA

for Embedded SoCs is elaborated in section 2.9.

2.1 DESIGN FOR LOW POWER CONSUMPTION

Power dissipation and energy efficiency are primary design

constraints for both simple and complex processors. As a result of the

growing market for battery-powered portable embedded systems, the drive

for minimum power consumption has become equally important as the

drive for increased performance. Power consumption in processors

consists of a static component, called leakage power, and a dynamic

component, called switching power. The total power consumption of CMOS

circuit comprises three components [21]:

1. Switching power: This is the power dissipated by charging and

discharging the gate output capacitance, CL, and represents

the useful work performed by the gate. The energy per output

36

transition is given by the following equation where Vdd is power

supply voltage:

2

t L dd

1E = .C . V = 1picojoule

2 (2.1)

2. Short-circuit power: When the gate inputs are at an intermediate

level, both the p- and n-type networks can conduct. This results

in a transitory conducting path from Vdd to Vss. In a careful design

that avoids slow signal transitions, the short-circuit power is

usually a small fraction of the switching power.

3. Leakage current: The transistor networks do conduct a very

small current when they are in their 'off' state. Though it is

generally negligible in an active circuit, it can drain a supply

battery over a long period of time.

In a well designed active circuit, the switching power dominates, with

the short-circuit power forming 10% to 20% of the total power, and the

leakage current being significant only when the circuit is inactive.

Therefore, the total power dissipation, Pc, of a CMOS circuit, neglecting the

short-circuit and leakage components, is given by summing the dissipation

of every gate g in the circuit C:

2 g

C dd g L

g C

1P = .f. V . A . C

2 (2.2)

where f is the clock frequency, Ag is the gate active factor (reflecting the fact

that not all gates switch every cycle) and g

LC is the gate load capacitance.

The typical gate load capacitance is a function of the process

technology and therefore not under the control of the designer. The

37

remaining parameters in the equation suggest following approaches to low-

power design:

1. Minimize the power supply voltage, Vdd.

2. Minimize the circuit activity, A. Techniques such as clock gating

fall under this heading.

3. Minimize the number of gates. Simpler circuits use less power

than complex ones, all other things being equal.

4. Minimize the clock frequency, f. Although a lower clock rate

reduces the power consumption, it also reduces performance

having a neutral effect on power-efficiency. If, however, a

reduced clock frequency allows operation at a reduced Vdd, this

will be highly beneficial to the power-efficiency.

5. Exploit parallelism. Duplicating a circuit allows the two circuits

to sustain the same performance at half the clock frequency of

the original circuit, which allows the required performance to be

delivered with a lower power supply voltage.

Although static leakage power has historically been small compared

to dynamic switching power, the situation is changing as the feature sizes

decrease. The smallest chip size of a chip process technology refers to the

smallest size of transistors, wires, or gaps between them that can be

created onto the chip die with that process technology. As these sizes

decrease, the capacitance of the system of transistors,

g

LC , is lowered. This

reduced capacitance decreases the switching time of these transistors (or

gate delay), resulting in faster logic performance accommodating faster

processor clock frequencies. The gate activity factor approximates the

average switching activity of the circuit for each clock edge. The supply

voltage, Vdd, is lowered to reduce interference with the ever-closer

neighbouring components and to meet thermal requirements. Lowering Vdd

greatly reduces dynamic power consumption since the dynamic power is

38

proportional to the square of this supply voltage. However, lowering the

supply voltage in turn often requires a lowering of the threshold voltage, the

voltage level at which transistors switch, to maintain fast clock rates.

Lowering the threshold voltage and moving the threshold closer to ground

causes a disproportionate increase in the static leakage current and thus

an increase in static power consumption [22].

For a fixed task, decreasing the clock rate reduces the power, but

not the energy. The energy to execute a workload is equal to the average

power multiplied by the execution time for the workload. For BOPES

devices, battery life is more important than actual power consumption.

Hence energy is the proper metric.

2.2 INSTRUCTION SET ARCHITECTURE (ISA)

The features that are built into architecture’s instruction set are

commonly referred to as the Instruction Set Architecture or ISA. The ISA

defines such features as the operations that can be used by the

programmers to create programs under that architecture, the operands

(data) that can be accepted and processed by architecture, the storage, the

addressing modes used to gain access to and process operands, and

handling of interrupts. These features are important because an ISA

implementation is a determining factor in defining important characteristics

of an embedded design, such as performance, design time, available

functionality, and cost. In the embedded domain, it used to be true that

minimizing gates was the most important consideration of an ISA design

[7]. This is what led to many of the idiosyncrasies of early DSP designs.

Advances in VLSI technologies have changed this, and most of the

embedded world can now afford enough complexity to allow much more

regular and orthogonal instruction sets.

39

2.2.1 Instruction Types and Operations

The following information is provided either directly or indirectly by

an instruction [9]:

1. Operation code (opcode): Nature of operation done by the

instruction

2. Data: Type of data - binary, decimal, character etc.

3. Operand location: Memory, register etc.

4. Operand addressing: Method of specifying the operand location

(address)

5. Instruction length: Size - one byte, two bytes etc.

6. Number of address fields: zero address, single address, two

address etc.

Two computers of different architectures do not have the same

instruction set. Almost every architecture provides certain unique

instructions that ease the burden of compiler/programmer or the hardware

design. Based on the operations performed by the instructions, it is

common to classify the instructions into following types:

1. Data transfer instructions: These move data from one

register/memory location to another.

2. Arithmetic instructions: These perform arithmetical operations.

3. Logical instructions: These perform Boolean logical operations.

4. Control transfer instructions: These modify program execution

sequence.

5. Input/output (I/O) instructions: These transfer information

between external peripherals and system nucleus

(CPU/memory)

6. String manipulation instructions: These manipulate strings of

byte, word, double word etc.

40

7. Translate instructions: These convert the data from one

format to another.

8. Processor control instructions: These control the processor

operation.

Table 2.1 lists sample instructions for each of the above eight types

and corresponding actions done by the processor for these instructions.

Table 2.1: Sample Instructions and processor actions

Instruction

Type Specific Instruction examples and processor actions

Data transfer Instruction Action by processor

MOVE Transfer data from source location to

destination location

LOAD Transfer data from a memory location to a

CPU register

STORE Transfer data from a CPU register to a

memory location

PUSH Transfer data from the source to stack (top)

POP Transfer data from stack (top) to the

destination

XCHG Exchange; swap the contents of the source

and destination

CLEAR Reset the destination with all 0's

SET Set the destination with all 1's

41

Table 2.1 (Continued)

Instruction


Arithmetic Instruction Action by processor

ADD Add; calculate sum of two operands

ADC Add with carry; calculate the sum of

operands and the 'carry' bit

SUB Subtract; calculate the difference of two

numbers

SUBB Subtract with borrow; calculate the

difference with 'borrow'

MUL Multiply; calculate the product of two

operands

DIV Divide; calculate the quotient and

remainder of two numbers

NEG Negate; change sign of operand

INC Increment; add 1 to operand

DEC Decrement; subtract 1 from operand

SFIFTA Shift arithmetic; shift the operand

(left or right) with sign extension

Logical Instruction Action by processor

NOT Complement the operand

OR Perform bit-wise logical OR of operands

AND Perform bit-wise logical AND of operands

XOR Perform bit-wise 'exclusive OR' of operands

SHIFT Shift the operand (left or right) filling the

empty bit positions as 0's

ROT Rotate; shift the operand (left or right) with

wrap-around

TEST Test for specified condition and set or reset

relevant flags

42


Instruction


Control

transfer

Instruction Action by processor

JUMP Branch; enter the specified address into

Program Counter (PC)

JUMPIF Branch on condition; enter the specified

address into PC only if the specified

condition is satisfied; conditional transfer

JUMPSUB CALL; save current 'program control

status' (into stack) and then enter the

specified address into PC

RET RETRURN; unsave (restore) 'program

control status' (from stack) into PC and other

relevant registers and flags

INT Interrupt; create a software interrupt; save

'program control status' (into stack) and

enter the address corresponding to the

specified vector code into PC

IRET Interrupt return; restore (unsave) 'program

control status' (from stack) into PC and other

relevant registers and flags

LOOP Iteration; decrement the implied register by 1

and test for non-zero; if satisfied, enter the

specified address into PC

43


Instruction


Input-output Instruction Action by processor

IN Input; read data from the specified input port /

device into specified or implied register

OUT Output; write data from specified or implied

register into an output port/device

TEST I/O Read the status from I/O subsystem and set

condition flags (codes)

START

I/O

Inform the I/O processor (or the data channel)

to start the I/O program consisting of

commands for the I/O operations

HALT I/O Inform the I/O processor (or the data

channel) to abort the I/O program

consisting of commands for the I/O

operations under progress

String

manipulation


MOVS Move byte or word of string

LODS Load byte or word of string

CMPS Compare byte or word of strings

STOS Store byte or word of string

SCAS Scan byte or word of string

Translate Instruction Action by processor

XLAT Translate; convert the given code into

another by table lookup

PACK Convert the unpacked decimal number into

packed decimal

UNPACK Convert the packed decimal number into

unpacked decimal

44


Instruction


Processor

control


HLT Halt; stop instruction cycle (processing)

STI (EI) Set/enable interrupt; sets interrupt enable

flag to '1', so as to allow maskable interrupts

CLI (DI) Clear/disable interrupt; resets interrupt

enable flag to '0' so as to ignore maskable

interrupts

WAIT Freeze instruction cycle till a specified

condition, such as an input signal becoming

active, is satisfied

NOOP No operation; no action

ESC Escape; the next instruction after the ESC is

to be skipped since it is meant for the

coprocessor

LOCK Reserve the bus, and hence the memory,

till the next instruction, following the LOCK

instruction, is executed/completed

CMC Complement 'carry' flag

CLC Clear 'carry' flag

STC Set 'carry' flag

2.2.2 Operation codes

There are a number of ways to allocate opcodes to an instruction

[11]. The design issue is to reduce the number of bits in the instruction

(small bit budget) while providing a large number of opcodes for a rich

instruction set. Following three design techniques have been used to meet

these requirements:

45

1. A fixed-length opcode allocated to variable length instructions as in

IBM S370 (Figure. 2.1)

2. A variable-length opcode provided by opcode expansion, allocated

in a variable-length instructions as in Intel x86 (Figure. 2.2)

3. A variable-length opcode provided by opcode expansion, allocated

in a fixed-length instruction as in MIPS32 (Figure. 2.3).

2.2.3 Addressing modes

Addressing mode is the method by which the location of an

instruction is specified within an instruction. Table 2.2 defines popular

addressing modes. A given ISA may not support all the addressing modes.

Table 2.2: Addressing modes and mechanisms

Addressing

mode Mechanism Remarks/examples

Implied

addressing

Operand address is not specified

explicitly

RET and IRET

Immediate

addressing

Operand is given in the

instruction

Fast operand fetch

but operand size is

limited as it increases

instruction length

Direct

addressing

(Absolute

addressing)

Operand is in a memory location;

its address is given in the

instruction

One memory access

required to get the

operand

Indirect

addressing

Operand is in a memory location;

its address is also in memory;

address of the location

containing the operand address

is given in the instruction

Two memory

accesses are

required to get the

operand

46


Addressing

mode Mechanism Remarks/examples

Register

direct

addressing

Operand is in a register; the

register address/number is given

in the instruction

Faster operand fetch

compared to direct

addressing

Register

indirect

addressing

Operand is in memory; its

address in a register;

address/number of the register is

given in the instruction

Faster operand fetch

than indirect

addressing

Base

register

addressing

Operand is in memory; its

address is specified in two parts;

the instruction gives an offset

number and also specifies the

base register; the offset (integer

number) has to be added to the

base register contents

Useful in relocation

of programs

PC-relative

addressing

Similar to base register

addressing, but the register

always being the PC

Mostly used by

branch instructions

Index

addressing

The operand is in memory; the

instruction gives an address, and

the index register contains an

offset number; the address and

the offset number are added to

get the operand address

Convenient for

indexing arrays

47

Figure. 2.1: IBM S370 Instruction Formats

48

Figure. 2.2: INTEL Pentium Pro Instruction Formats

Figure. 2.3: MIPS32 Instruction Formats

49

2.2.4 Data types

Application programs may use various types of data depending on

the problem. A machine language program can operate either on numeric

data or on non-numeric data. The numeric data can be either binary or

decimal number. The non-numeric data can be any of the following types:

characters, addresses, and logical data. All non-binary data is represented

inside a computer in the binary coded form. The binary data can be

represented either as a fixed-point or a floating-point number. In fixed-point

number representation, the position of a binary number is rigidly fixed in

one place. In floating-point number representation, the binary point's

position can be anywhere. The fixed-point numbers are known as integers

whereas the floating-point numbers are known as real numbers. Arithmetic

operations on fixed-point numbers are simple and they require minimum

hardware circuits. The floating-point arithmetic is complex and requires

extensive hardware circuits. Compared to fixed-point numbers, the floating-

point numbers have two advantages:

1. The maximum or minimum value that can be represented in

floating-point number representation is higher. Hence it is

useful in dealing with very small or very large numbers.

2. The floating-point number representation leads to better

accuracy in arithmetic operations.

2.2.5 ISA Models

There are several different ISA models that architectures are based

upon, each with its own specifications for the various features. The most

commonly implemented ISA models are application-specific, general

purpose and instruction level parallel. Application-Specific ISA Models

define processors that are intended for specific embedded applications,

such as processors made only for TVs. General-purpose ISA models are

50

typically implemented in processors targeted to be used in a wide variety of

systems, rather than only in specific types of embedded systems. CISC

model and RISC model are the common types of general-purpose ISA

architectures implemented in embedded processors. Many current

processor designs fall under the CISC or RISC category primarily because

of their heritage. RISC processors have become more complex, while CISC

processors have become more efficient to compete with their RISC

counterparts, thus blurring the line between the definition of a RISC versus

a CISC architecture. Technically, these processors have both RISC and

CISC attributes, regardless of their definitions. Instruction-level parallelism

ISA architectures are similar to general-purpose ISAs, except that they

execute multiple instructions in parallel, as the name implies. Examples of

instruction-level parallelism ISAs [9] include SIMD model, Superscalar

model, and VLIW model.

2.3 PROCESSOR PERFORMANCE AND ADVANCED ARCHITECTURES

The performance of a processor is measured by the amount of time

taken by the processor to execute a program. The processor performs an

instruction cycle for each instruction. Table 2.3 illustrates the actions taken

at various steps of the instruction cycle for ADD instruction. Elementary

operations performed by the processor during instruction cycle execution

are known as micro-operations. A given micro-operation takes place when

the corresponding control signal is issued by the processor. Table 2.4

illustrates some sample micro-operations performed by the processor. The

time taken for executing different instructions is not the same. Hence the

type of instructions executed in a program and the number of instructions

executed by the processor, while running the program, decides the time

taken by the processor to execute a program.

51

Table 2.3: Instruction cycle steps and actions for ADD instruction

Sl.

No. Step

Action

responsibility Remarks

Parameter

affecting

performance

1 Instruction

fetch

Control unit;

external action

Fetches next

instruction from

main memory

memory

access time

2 Instruction

decode

Control unit;

internal action

Analyses opcode

pattern in the

instruction and

identifies the exact

operation specified

decode time

3 Operand

fetch

Control unit:

external

(memory) or

internal action

depending on

the location of

operands

Determines the

operand addresses

and then fetches

the operands, one

by one, from main

memory or CPU

registers and

supply them to ALU

(1) operand

address

calculation

time

(2) Register/

memory

access time

4 Execute

(ADD)

ALU; internal

action

Specified arithmetic

operation is done

Addition time

5 Result

store

Control unit;

external or

internal action

Stores the result in

memory or

registers

Register/

memory

access time

52

Table 2.4: Sample micro-operations

Sl.

no.

Control

signal

Micro-operation Remarks

1 MAR← PC Contents of PC are copied

(transferred) to Memory Address

Register (MAR)

The first micro-

operation in

instruction fetch

2 PC← PC + 4 Contents of PC are incremented

by 4

The PC always

points to next

instruction

address

3 IR ←MBR Contents of Memory Buffer

Register (MBR) are copied to

Instruction Register (IR)

The last micro-

operation in

instruction fetch

4 MBR ←R2 Contents of R2 register are

copied to MBR

The first micro-

operation in result

store

The following equation is commonly used for expressing a

computer's performance ability:

time time cycles instructions

program cycle instruction program

In other words, the execution time is given by the following equation:

Tp = Nie X CPI/F (2.4)

where Nie is the number of instructions executed (and not the number of

instructions present in the program), CPI is the average number of clock

cycles needed for an instruction, and F is the clock frequency. The CISC

approach attempts to minimize the number of instructions per program,

sacrificing the number of cycles per instruction. RISC does the opposite,

reducing the cycles per instruction at the cost of the number of instructions

per program.

(2.3)

53

For any specific computer, there are two simple measurements that

give us an idea about its performance:

1. Response time or execution time: This is the time taken by the

computer to execute a given program – from the start to the

end of completion of the program. The response time for a

program is different for different computers.

2. Throughput: This is the work done (total number of programs

executed) by the computer during a given period of time.

2.3.1 Instruction Pipelining

In a simple processor (scalar, non-pipelined), the steps of an

instruction cycle are sequentially performed one after the other and

execution of successive instructions are also done sequentially, one after

the other. Instruction pipelining (Figure. 2.4) is a technique in which

execution of successive instructions are overlapped. The goal is to

increase the total number of instructions executed in a given period of time.

In a pipelined processor, different sections of the processor perform

different steps of the instruction cycle for different instructions at a given

time. Each step is called a pipe stage. All the pipe stages together form a

pipe.

Figure. 2.4: A six stage instruction pipeline

54

In a six stage instruction pipeline, six instructions can be active

simultaneously. If it is assumed that all instructions are independent of

other instructions, then for each clock cycle, one instruction can be

completed due to overlap of instruction cycles of consecutive instructions.

In practice, three types of hazards - data, structural, and control - reduce

the pipeline efficiency [9].

Dependencies between instructions are a property of programs. If

two instructions are dependent, they should not be executed

simultaneously. They may be partially overlapped. Two instructions may be

either directly data dependent or indirectly data dependent through another

instruction due to chain of dependencies. In case of dependence, there are

two possible solutions:

1. Preserving the dependence but preventing a hazard

2. Removing the dependence by transforming the object code.

Techniques used for detecting and preventing hazards should

preserve program order so that the overall behaviour and results of the

program are not affected.

2.3.2 RISC Instructions and Pipelining

Though pipelining can be implemented in both CISC and RISC types

of processors to enhance performance, it is simpler to design a pipelined

RISC processor. The following properties of RISC architecture help in

simplifying the pipeline design:

1. All instructions are of equal size, say 4 bytes.

2. Instruction formats are not many; just 1 to 3.

3. Arithmetic and other operations on data always have operands

(data) in registers (not in memory).

4. Only load and store instructions can access memory.

55

Generally RISC processors have three types of instructions: ALU

instructions, Load and store instructions and Branch and Jump type

instructions. In ALU Instructions, the operands are available in registers.

On completion, the results should be stored in registers. In load and store

instructions, one operand is in register and the other operand is in memory.

The address of the memory operand is generally specified as the sum of

two parts: the base register contents and the offset indicated by the

immediate field in the instruction. In branches and jumps, the branch

conditions are usually specified in one of the two ways:

1. Comparison of two items in registers

2. Condition bits or condition codes

Unconditional jumps are present in almost all RISC processors.

Traditional RISC pipeline has five stages as shown in Figure. 2.5 (a).

Figure. 2.5 (b) shows timing diagram while executing 6 instructions over 10

clock cycles. Figure. 2.5 (c) shows the RISC pipeline as a series of data

paths shifted in time.

Figure. 2.5 (a): Five stage pipeline

56

Figure. 2.5 (b): Timing Diagram

CC- Code Cache (Instruction memory); R-Registers; ALU-Arithmetic Logic

Unit; DC-Data Cache (data memory)

CC R ALU DC R

CC R ALU DC R

CC R ALU DC R

CC R ALU DC R

CC R ALU DC R

CC R ALU DC R

1 2 3 4 5 6 7 8 9 10

Time in Clock cycles

Pro

gra

m e

xec

uti

on s

equen

ce

Figure. 2.5 (c): RISC Pipeline as a series of datapaths

57

Tradeoffs in micro architecture have changed somewhat since the

RISC five-stage pipeline [7]. In the early RISC days, transistor count

limitations convinced the designers to reuse the ALU for address

computations. Today, transistors are almost free of cost but wires are

expensive. Each additional pipeline stage has a marginal benefit in terms of

spreading out the work in smaller steps that may allow a lower cycle time,

and a marginal cost in terms of added design complexity and global

overheads. Table 2.5 defines the clock cycles, respective stages of

instruction cycle and micro operations. Actual number of clock cycles

required for different instructions are as follows:

Unconditional branch instruction: 2 (cycles 1 and 2)

Store instruction: 4 (cycles 1 to 4)

Any other instruction: 5 (cycles 1 to 5)

There are many alternate design options offering varying

performance levels. The designer chooses the best option taking into

account the hardware cost and required performance level.

There are two major problems in a practical pipeline:

1. Resource Conflict: Two different operations at two

sections/stages may need the same hardware resource in the

same clock cycle, due to overlapping of instructions. To resolve

this, multiple resources of the same type can be provided in the

hardware. This will increase the cost and hence should be

done judiciously.

2. Interference between adjacent stages: Two instructions in

different stages of the pipeline should not interfere with each

other. To resolve this, pipeline registers are used between

successive stages of the pipeline. The pipeline registers are

named indicating the stages linked by them such as IF/ID,

58

ID/EX, EX/MEM and MEM/WB. The result of any specific stage

is stored in the pipeline register at the end of a clock cycle.

During the next clock cycle, the contents of the pipeline register

serve as input to the next stage. In some cases, the result

generated by one stage may not be used as input to the next

stage. It may propagate through more than one stage. For

example, for a STORE instruction, the result is produced in the

ID stage but it is stored in memory only in the MEM stage.

Table 2.5: Typical instruction cycle phases in RISC processors

Sl.

no.

Clock

cycle

Instruction

cycle phase

Major micro

operations

Hardware

sections involved

1 1 Instruction

Fetch (IF)

a. Send PC contents

to memory

b. Fetch the current

instruction from

memory

c. Increment PC by 4

to indicate the next

instruction address

a. Cache memory

2 2 Instruction

Decode (ID);

plus Register

Read cycle

a. Decode the

instruction

b. Read the contents

of source registers

c. Compare the

contents of registers

(as preparation for

certain instructions

such as compare)

a. Instruction

decoder

b. Registers

c. Adder /

comparator

59


Sl.

no.

Clock

cycle

Instruction

cycle phase

Major micro

operations

Hardware

sections involved

3 3 Execution

(EX); plus

Effective

address cycle

a. For ALU instruction,

the specified

operation is done by

the ALU

b. For memory

reference instruction

(Load/store), the

effective address is

calculated by ALU by

adding the base

register contents and

the offset.

c. For branch

instruction, testing of

branch condition is

done.

a. ALU

b. ALU

c. ALU

4 4 Memory

Access

(MEM); plus

branch

completion

a. For load instruction,

memory read

operation from the

effective address is

done.

b. For store

instruction, memory

write operation at the

effective address,

storing the contents of

source register

c. For branch

instruction, the branch

address is entered in

PC if branch occurs.

a. Cache memory

b. Cache memory

5 5 Write – back

(WB)

a. The result is stored

in the destination

register for load

instruction and ALU

instruction.

a. Registers

60

2.3.3 Superscalar processor

In a scalar pipelined processor, though there are multiple

instructions simultaneously active in the pipeline, there is only one

execution unit/functional unit. Hence at a given time, only one instruction

can be in the execution unit. In a superscalar architecture, there are

multiple pipelines in the processor and hence two or more instructions can

be executed simultaneously. In other words, in a superscalar processor,

same type of operation (add, shift etc.) can be executed simultaneously in

single clock cycle on multiple pipelines for different instructions. Figure. 2.6

shows the organization of a superscalar processor with two pipelines [9]. In

some superscalar processors, instruction sequencing is static (at

compilation time) but in majority of superscalar processors, it is dynamic (at

run time). The control unit in a dynamic superscalar processor is a complex

one whereas in a static superscalar processor, the compiler is a complex

one.

2.3.4 Very Long Instruction Word (VLIW) Processor

The VLIW architecture exploits Instruction Level Parallelism (ILP)

with close cooperation between the compiler and the processor. The

processor has multiple functional units similar to a dynamic superscalar

processor but scheduling is done by the compiler that groups several

independent operations into a very long instruction word. Each VLIW has

multiple fields/slots with each slot containing one RISC like operation. Each

operation corresponds to a functional unit. During the execution of a VLIW,

the processor performs all the operations in parallel in different functional

units. Figure. 2.7 illustrates the principle of a VLIW processor [9].

61

OF-Operand Fetch IF- Instruction Fetch EX-Execute SR-Store Results

2 instructions

Instruction queue

EU-1

Odd instruction

EU-2

EU-Execute unit

Write buffers

Cache

Memory

MAIN MEMORY

System Bus

Unified cache

2 instructions

Even instruction

OF

EX

SR SR

EX

OF

RE

GIS

TE

RS

I F Unit

Decode

and

dispatch

Result

Figure. 2.6: Superscalar Processor Organisation

62

Instruction Cache Memory

add mul load store cmp branch mulfl addfl

INT INT MAU 1 MAU 2 INT Branch FLOAT FLOAT

ALU MUL/DIV ALU unit MUL/DIV ADDER

AAADDER

Integer RF Floating

Point RF

Bus Interface Data Cache

IR

FUs

MAU

System Bus

IR-Instruction Register FU-Functional Unit

RF-Register File

INT-Integer

MAU-Memory Addressing Unit

(a) Inside VLIW Processor

add mul load store cmp branch mulfl addfl

add R1 R2

256 bits

32 bits

(b) VLIW and one operation

Figure. 2.7: VLIW Processor Organisation

63

2.3.5 Cache Memory

The cache memory is a small and fast intermediate buffer between

the processor and the main memory with the objective of reducing the

processor's waiting time during main memory access. The presence of

cache memory is not known to application programs. Figure. 2.8 illustrates

the use of cache memory.

Figure. 2.8: Use of Cache memory

The main memory is conceptually divided into many blocks, each

containing a fixed number of consecutive locations. The cache memory is

organized as number of lines and the size of each line is same as the

capacity of main memory block. The cache operation is based on locality of

reference [23], a property inherent in programs. Most of the times,

processing requirement is such that instructions or data needed are

available in those main memory locations which are physically close to the

current main memory location being accessed. There are two kinds of

behaviour pattern:

1. Temporal locality: A recently accessed memory location is

likely to be accessed again.

64

2. Spatial locality: The neighbouring location to the recently

accessed memory location is likely to be accessed.

In view of these two properties, while reading a location from main

memory, the content of entire block is transferred and stored in cache

memory. There are more blocks in main memory than the number of lines

in cache memory. Hence a mapping function is followed by the cache

controller to systematically map any main memory block to one of the

cache lines. When the processor needs a memory operand, the cache

controller checks the cache memory to find out if the current main memory

address is already mapped onto cache. If it is mapped, it means the

required item is available in cache memory and this condition is called

'cache hit'. Then the required information is read from cache memory.

On the other hand, if the current main memory address is not

mapped in cache memory, the required information is not available in

cache memory and this situation is known as 'cache miss'. In this case, the

entire block containing the main memory address is brought into the cache

memory. The time taken to bring the required item from the main memory

and supply it to the processor is known as 'miss penalty'. The hit rate (also

known as hit ratio) provides the fraction of the number of accesses which

faced 'cache hit' to the total number of accesses.

The cache memory is of two types: Unified cache or common cache,

and Split cache. The unified cache stores both instructions and data. In

split cache, there is a separate instruction cache (also known as code

cache) and data cache. Some computers use a two level or three level

cache memory system. The cache immediately next to the processor is

known as level 1 cache or primary cache. The next level cache is called a

level 2 cache or secondary cache. Most microprocessors are incorporating

multi-level caches on-chip.

65

2.3.6 Virtual Memory

Virtual memory concept facilitates the execution of large programs in

systems with smaller physical memory. Virtual memory is desirable in the

following two cases:

1. The logical memory space of the processor is small

2. The physical main memory space has to be kept small to

reduce the cost though the processor has large logical memory

space.

Figure. 2.9 illustrates the concept of virtual memory. In virtual

memory system, the OS automatically manages the long programs by

storing the entire program on a large hard disk. At a given time, only some

portions of the program are stored in main memory. During the execution of

the program, different portions of the program are swapped between the

main memory and hard disk on need basis. The program does not address

the physical memory directly.

CM - Cache memory; optional unit.

Figure. 2.9: Virtual memory concept

66

While referring to an instruction or operand, it provides the logical

address, and the virtual memory hardware (also known as memory

management unit or MMU) in the processor translates it into the equivalent

physical memory address [9]. There are two popular methods in virtual

memory implementation: paging and segmentation. In paging, the system

software divides the program into pages of equal sizes. In segmentation,

the machine language programmer organizes the program into different

segments which need not be of same size. Figure. 2.10 illustrates the

mechanism of virtual memory.

Figure. 2.10: Virtual memory mechanism

2.3.7 Multicore CPU

Building a high performance computer system by linking together

several low performing computers is a standard technique of achieving

parallelism. This idea is the basis for development of multiprocessor

systems. Designing a microcomputer using multiple single-chip

microprocessors has been a cost-effective strategy for several years in the

past. The latest trend is the design of multicore microprocessors resulting

in quantum change in the way multiprocessor systems are developed and

67

used for various applications [10]. Figure. 2.11 illustrates the concept of

muticore with four cores in a single die. Figure. 2.12 illustrates the

organization of SPARC 64 VII, a popular quad core CPU.

Figure. 2.11: A Quad-core CPU

Figure. 2.12: SPARC64 VII Processor

Chip Multiprocessing technology is an architecture in which multiple

physical cores are integrated on a single processor module. Each physical

core runs a single execution thread of a multithreaded application

independently from other cores at any given time. With this technology,

multi-core processors offer several times the performance of single-core

68

modules. The ability to process multiple instructions at each clock cycle

provides the performance advantage, but improvements also result from

the short distances and fast bus speeds between chips as compared to

traditional CPU to CPU communication in a multiprocessor system.

2.4 EMBEDDED PROCESSORS

Processors are the main functional units of an embedded system,

and are primarily responsible for processing instructions and data. An

embedded system contains at least one master processor, acting as the

central controlling device, and can have additional slave processors that

work with and are controlled by the master processor. These slave

processors may either extend the instruction set of the master processor or

act to manage buses and input/output (I/O) devices. The complexity of the

master processor usually determines whether it is classified as a

microprocessor or a microcontroller. Traditionally, microprocessors contain

a minimal set of integrated memory and I/O components, whereas the

microcontrollers have most of the system memory and I/O components

integrated on the chip. However, these traditional definitions are becoming

somewhat inaccurate in view of convergence taking place in recent

processor designs. There are literally hundreds of embedded processors

available and these can be grouped into various architectures [6]. What

differentiates one processor group's architecture from another is the set of

machine code instructions that the processors within the architecture group

can execute. Processors are considered to be of the same architecture

when they can execute the same set of machine code instructions. Table

2.6 lists some examples of real-world processors and the architecture

families they fall under. Table 2.7 lists the merits and demerits of different

types of processors that can embed in a complex embedded system [8].

69

Table 2.6: Typical Embedded Architectures and Processors

Architecture Processor Manufacturer

AMD Au1xx Advanced Micro Devices

ARM ARM7, ARM9, ... ARM, ....

ColdFire 5282, 5272, 5307, 5407, ... Motorola/Freescale, ...

M32/R 32170, 32180, 32182,

32192, ...

Renesas/Mitsubishi, ...

MIPS32 R3K, R4K, 5K, 16, ... MT14kx, IDT, MIPS

Technologies, ...

NEC Vr55xx, Vr54xx, Vr41xx NEC Corporation, ...

PowerPC 82xx, 74xx, 8xx, 7xx, 6xx,

5xx, 4xx

IBM, Motorola/Freescale, ...

SuperH (SH) SH3, SH4 Hitachi, ...

SHARC SHARC Analog Devices, Transtech

DSP, Radstone, ...

strongARM strongARM Intel, ...

SPARC UltraSPARC II Sun Microsystems, ...

TMS320C6xxx TMS320C6xxx Texas Instruments

x86 X86 [386, 486, Pentium(II,

III, IV)...]

Intel, Transmeta, National

Semiconductor, Atlas, ...

Tricore Tricore1, Tricore2, ... Infineon, ...

70

Table 2.7: Processor types in Complex Embedded Systems

Processor type Application Advantage Disadvantage

General purpose

microprocessor

When intensive

computations are

required and large

embedded software

is located in the

external memory

cores or chips

No engineering

cost in

designing the

processor

Additional redundant

execution units that

are not needed in the

given system design

Microcontroller Used with internal

memory, devices

and peripherals and

when embedded

software is located

in the internal ROM

or flash memory

No engineering

cost in

designing the

processor

Additional

manufacturing costs

and redundant

application units

which are not

needed in the given

system design

DSP Used with signal

processing-related

instructions for

filters, image, audio,

and video and

CODEC applications

No engineering

cost involved in

designing the

signal

processor

Manufacturing cost

may be high

Single purpose

processors and

application

specific system

processor

Control I/O and bus

operations and

peripherals and

devices

They support

other

processing

units in the

system and

execute

specific

hardware

processes fast

In-house engineering

cost of development,

royalty payments for

an IP core of

processor and time-

to-market cost

Multicore

processor

To significantly

enhance the

performance of the

system

Reduced

engineering

cost

Increased

manufacturing cost

Accelerator To accelerate the

execution of codes.

A floating point

coprocessor

accelerates

mathematical

operations and Java

accelerator

accelerates Java

code execution.

Increases

performance by

co-processing

with the main

processor

Increased

engineering cost of

development or

royalty payments for

the IP core of

processor

71

2.5 EMBEDDED SYSTEM ARCHITECTURES

Embedded computer systems range from everyday machines - most

of the microwaves and washing machines, printers, network switches, and

automobiles - to handheld digital devices (such as PDAs, cell phones, and

music players) to videogame consoles and digital set-top boxes. Except in

some applications such as PDAs, in many embedded applications, the only

programming occurs at developer's site in connection with the initial loading

of the application code or a later software upgrade of that application. Thus,

the application is carefully tuned for the processor and system [3].

Embedded systems often process information in different ways from

general-purpose processors. Typically these applications include deadline-

driven constraints—so-called real-time constraints. In these applications, a

particular computation must be completed by a certain time limit failing

which the system will malfunction. A real-time performance requirement is

one where a segment of the application has an absolute maximum

execution time that is allowed. For example, in a digital set-top box the time

to process each video frame is limited, since the processor must accept

and process the frame before the next frame arrives (typically called hard

real-time systems). In some applications, a more liberal requirement exists:

the average time for a particular task is constrained as well as is the

number of instances when some maximum time is exceeded. Such

approaches (typically called soft real-time) arise when it is possible to

occasionally miss the time constraint on an event, as long as not too many

are missed. Real-time performance tends to be highly application

dependent.

Embedded system applications typically involve processing

information as signals that may be an image, a motion picture composed of

a series of images, a control sensor measurement, and so on. Signal

72

processing requires specific computation that many embedded processors

are optimized for.

Two other key characteristics exist in many embedded applications:

the need to minimize memory and the need to minimize power. The

importance of memory size translates to an emphasis on code size, since

data size is dictated by the application. Some architecture has special

instruction set capabilities to reduce code size. Larger memories also mean

more power, and optimizing power is often critical in embedded

applications. Although the emphasis on low power is frequently driven by

the use of batteries, the need to use less expensive packaging (plastic

versus ceramic) and the absence of a fan for cooling also demand reduced

power consumption.

Often an application’s functional and performance requirements are

met by combining a custom hardware solution together with software

running on a standardized embedded processor core, which is designed to

interface to such special-purpose hardware. In practice, embedded

problems are usually solved by one of three approaches:

1. The designer uses a combined hardware/software solution that

includes some custom hardware and an embedded processor

core that is integrated with the custom hardware, often on the

same chip.

2. The designer uses custom software running on an off-the-shelf

embedded processor.

3. The designer uses a digital signal processor and custom

software for the processor.

Embedded systems are a very broad category of computing devices.

For example, the TI 320C55 DSP is a relatively “RISC-like” processor

designed for embedded applications, with very fine-tuned capabilities. On

73

the other end of the spectrum, the TI 320C64x is a very high-performance,

eight-issue VLIW processor for very demanding tasks. Media extensions

attempt to merge DSPs with some more general-purpose processing

abilities to make these processors usable for signal processing

applications. Hennessy and Patterson have examined [3] several case

studies, including the Sony PlayStation 2, digital cameras, and cell phones.

The PlayStation2 performs detailed three-dimensional graphics, whereas a

cell phone encodes and decodes signals according to elaborate

communication standards. But both have system architectures that are very

different from general-purpose desktop or server platforms. In general,

architectural decisions that seem practical for general-purpose applications,

such as multiple levels of caching or out-of-order superscalar execution,

are much less desirable in embedded applications. This is due to chip area,

cost, power, and real-time constraints. The programming model that these

systems present places more demands on both the programmer and the

compiler for extracting parallelism.

2.5.1 Digital Signal Processor

A digital signal processor (DSP) is a special-purpose processor

optimized for executing digital signal processing algorithms [5]. Most of

these algorithms, from time-domain filtering (e.g., infinite impulse response

and finite impulse response filtering), to convolution, to transforms (e.g.,

fast Fourier transform, discrete cosine transform), to even forward error

correction (FEC) encodings, all have as their kernel the same operation: a

multiply-accumulate operation. Either transform has as its core the sum of

a product. To accelerate this, DSPs typically feature special-purpose

hardware to perform multiply-accumulate (MAC). A MAC instruction of

“MAC A, B, C” has the semantics of “A = A + B * C.” In some situations, the

performance of this operation is so critical that a DSP is selected for an

application based solely upon its MAC operation throughput. DSPs often

employ fixed-point arithmetic. In addition to MAC operations, DSPs often

74

also have operations to accelerate portions of communications algorithms.

An important class of these algorithms revolve around encoding and

decoding forward error correction codes—codes in which extra information

is added to the digital bit stream to guard against errors in transmission. At

one end of the DSP spectrum is the TI 320C55 architecture optimized for

low-power, embedded applications with a seven-staged pipelined CPU.

The source of input data to DSP is some form of digitized signal, like

a photo image captured by a digital camera, a voice packet going through a

network router, or an audio clip played by a digital keyboard. As with

microcontrollers, DSPs also tend to incorporate many peripherals that are

useful in signal processing on a single IC. For example, a DSP device may

contain a number of analog-to-digital and digital-to-analog converters,

pulse-width modulators, direct memory access controllers, timers, and

counters.

2.5.2 Media Extensions

Media Extensions is a middle ground between DSPs and

microcontrollers. These extensions add DSP-like capabilities to

microcontroller architectures at relatively low cost. Because media

processing is judged by human perception, the data for multimedia

operations are often much narrower than the 64-bit data word of modern

desktop and server processors. For example, floating-point operations for

graphics are normally in single precision, not double precision, and often at

a precision less than specified by IEEE 754. Rather than waste the 64-bit

arithmetic-logical units (ALUs) when operating on 32-bit, 16-bit, or even8-

bit integers, multimedia instructions can operate on several narrower data

items at the same time. Thus, a partitioned add operation on 16-bit data

with a64-bit ALU would perform four 16-bit adds in a single clock cycle. The

extra hardware required is only to prevent carries between the four 16-bit

partitions of the ALU. For example, such instructions might be used for

75

graphical operations on pixels [10]. These operations are commonly called

single-instruction multiple data (SIMD) or vector instructions. Most graphics

multimedia applications use 32-bit floating-point operations.

2.5.3 Embedded Multiprocessors

In the embedded space, a number of special-purpose designs have

used customized multiprocessors; including the Sony PlayStation

2[7].Many special-purpose embedded designs consist of a general-purpose

programmable processor or DSP with special-purpose, finite-state

machines that are used for stream-oriented I/O. In applications ranging

from computer graphics and media processing to telecommunications, this

style of special-purpose multiprocessor is becoming common. Although the

inter-processor interactions in such designs are highly regimented and

relatively simple—consisting primarily of a simple communication

channel—because much of the design is committed to silicon, ensuring that

the communication protocols among the input/output processors and the

general-purpose processor are correct is a major challenge in such

designs. As a recent trend, embedded multiprocessors are built from

several general-purpose processors. These multiprocessors have been

focused primarily on the high-end telecommunications and networking

market, where scalability is critical. An example of such a design is the

MXP processor designed by empowerTel Networks for use in voiceover-IP

systems. The MXP processor consists of four main components:

1. An interface to serial voice streams, including support for

handling jitter

2. Support for fast packet routing and channel lookup

3. A complete Ethernet interface, including the MAC layer

4. Four MIPS32 R4000-class processors, each with its own cache

(a total of 48 KB or 12 KB per processor)

76

The MIPS processors are used to run the code responsible for

maintaining the voice-over-IP channels, including the assurance of quality

of service, echo cancellation, simple compression, and packet encoding.

Since the goal is to run as many independent voice streams as possible, a

multiprocessor is an ideal solution. Because of the small size of the MIPS

cores, the entire chip takes only 13.5Mtransistors. Future generations of

the chip are expected to handle more voice channels, as well as do more

sophisticated echo cancellation, voice activity detection, and more

sophisticated compression.

Multiprocessing is becoming widespread in the embedded

computing arena for two primary reasons. First, the issues of binary

software compatibility, which plague desktop and server systems, are less

relevant in the embedded space. Often software in an embedded

application is written from scratch for an application or significantly

modified. Second, the applications often have natural parallelism,

especially at the high end of the embedded space. Examples of this natural

parallelism abound in applications such as a settop box, a network switch,

a cell phone or a game system. The lower barriers to use of thread-level

parallelism together with the greater sensitivity to die cost (and hence

efficient use of silicon) are leading to widespread adoption of

multiprocessing in the embedded space, as the application needs grow to

demand more performance.

Desktop computers and servers rely on the memory hierarchy to

reduce average access time to relatively static data, but there are

embedded applications where data are often a continuous stream. In such

applications there is still spatial locality, but temporal locality is much more

limited. The steady stream of graphics and audio demanded by electronic

games lead to a different approach to memory design. The style is high

bandwidth via many dedicated independent memories.

77

2.6 MIPS32 Vs OTHER RISC PROCESSORS

Although the modern version of the RISC design dates to the 1980s,

a number of systems of the 1970s have been credited as the first RISC

architecture, partly based on their use of load/store approach. For

example, the CDC 6600 designed by Seymour Cray in 1964 used a

load/store architecture with only two addressing modes (register+register,

and register+immediate constant) and 74 opcodes, with the basic clock

cycle/instruction issue rate being 10 times faster than the memory access

time [24,25].

The modern RISC revolution started with the projects at Stanford

University and University of California, Berkeley and IBM. Stanford's design

led to the successful MIPS architecture, while Berkeley's RISC project has

been commercialized as the SPARC. Another success from this era was

IBM's 801 that eventually led to the Power Architecture. As these projects

matured, a wide variety of similar designs flourished in the late 1980s and

early 1990s, representing a major force in the Unix workstation market as

well as embedded processors in laser printers, routers and similar

products. The Berkeley RISC project delivered the RISC-I processor in

1982. Compared with averages of about 100,000 in newer CISC designs of

the era, the RISC-I, consisting of only 44,420 transistors, had only 32

instructions with three addressing modes, and yet completely outperformed

any other single-chip design. They followed this up with the 40,760

transistor, 39 instruction RISC-II in 1983, which ran over three times as fast

as RISC-I. In 1986, Hewlett Packard started using an early implementation

of their PA-RISC in some of their computers. In the meantime, the Berkeley

RISC effort had become so well known that it eventually became the name

for the entire concept and in 1987 Sun Microsystems began shipping

systems with the SPARC processor, directly based on the Berkeley RISC-II

system.

http://en.wikipedia.org/wiki/Load/store_architecture

http://en.wikipedia.org/wiki/CDC_6600

http://en.wikipedia.org/wiki/Seymour_Cray

http://en.wikipedia.org/wiki/Load/store_architecture

http://en.wikipedia.org/wiki/Addressing_mode

http://en.wikipedia.org/wiki/Stanford_University


http://en.wikipedia.org/wiki/University_of_California,_Berkeley

http://en.wikipedia.org/wiki/MIPS_architecture

http://en.wikipedia.org/wiki/Berkeley_RISC

http://en.wikipedia.org/wiki/SPARC

http://en.wikipedia.org/wiki/IBM

http://en.wikipedia.org/wiki/Power_Architecture

http://en.wikipedia.org/wiki/Unix_workstation

http://en.wikipedia.org/wiki/Embedded_processor

http://en.wikipedia.org/wiki/Laser_printer

http://en.wikipedia.org/wiki/Router_(computing)

http://en.wikipedia.org/wiki/Complex_instruction_set_computing

http://en.wikipedia.org/wiki/Hewlett_Packard

http://en.wikipedia.org/wiki/PA-RISC



http://en.wikipedia.org/wiki/Sun_Microsystems


78

Well-known RISC families include DEC Alpha, AMD 29k, ARC,

ARM, Atmel AVR, Blackfin, Intel i860 and i960, MIPS, Motorola 88000, PA-

RISC, Power (including PowerPC), SuperH, and SPARC. In the 21st

century, the use of ARM architecture processors in smart phones and

tablet computers such as the iPad, Android, and Windows RT tablets

provided a wide user base for RISC-based systems. RISC processors are

also used in supercomputers such as the K computer, the fastest on the

TOP500 list in 2011, and Sequoia, the fastest in 2012 list.

Over the years, RISC instruction sets have grown in size, and today

many of them have a larger set of instructions than many CISC CPUs.

Some RISC processors such as the PowerPC have instruction sets as

large as the CISC IBM System/370, for example; conversely, the DEC

PDP-8—clearly a CISC CPU because many of its instructions involve

multiple memory accesses—has only 8 basic instructions and a few

extended instructions. RISC architectures are now used across a wide

range of platforms, from cellular telephones and tablet computers to some

of the world's fastest supercomputers such as the K computer, the fastest

on the TOP500 list in 2011. As of 2014, a new research ISA, RISC-V, has

been under development at University of California, Berkeley, emphasizing

features such as many core, heterogeneous multiprocessing,

virtualisability, and dense instruction encoding.

2.6.1 CISC and RISC Convergence

State of the art processor technology has changed significantly since

RISC chips were first introduced in the early '80s. Because a number of

advancements are used by both RISC and CISC processors, the lines

between the two architectures have begun to blur. In fact, the two

architectures almost seem to have adopted the strategies of the other.

Since the processor speeds have increased, CISC chips are now able to

execute more than one instruction within a single clock. This also allows

http://en.wikipedia.org/wiki/DEC_Alpha

http://en.wikipedia.org/wiki/AMD_29k

http://en.wikipedia.org/wiki/ARC_International

http://en.wikipedia.org/wiki/ARM_architecture

http://en.wikipedia.org/wiki/Atmel_AVR

http://en.wikipedia.org/wiki/Blackfin

http://en.wikipedia.org/wiki/Intel_i860

http://en.wikipedia.org/wiki/Intel_i960


http://en.wikipedia.org/wiki/Motorola_88000



http://en.wikipedia.org/wiki/Power_Architecture

http://en.wikipedia.org/wiki/PowerPC

http://en.wikipedia.org/wiki/SuperH


http://en.wikipedia.org/wiki/ARM_architecture

http://en.wikipedia.org/wiki/Smart_phone

http://en.wikipedia.org/wiki/Tablet_computer

http://en.wikipedia.org/wiki/IPad

http://en.wikipedia.org/wiki/Android_(operating_system)

http://en.wikipedia.org/wiki/Supercomputer

http://en.wikipedia.org/wiki/K_computer

http://en.wikipedia.org/wiki/TOP500

http://en.wikipedia.org/wiki/IBM_Sequoia

http://en.wikipedia.org/wiki/PowerPC

http://en.wikipedia.org/wiki/IBM

http://en.wikipedia.org/wiki/System/370

http://en.wikipedia.org/wiki/PDP-8

http://en.wikipedia.org/wiki/Tablet_computer

http://en.wikipedia.org/wiki/Supercomputer

http://en.wikipedia.org/wiki/K_computer

http://en.wikipedia.org/wiki/TOP500

79

CISC chips to make use of pipelining. With other technological

improvements, it is now possible to fit many more transistors on a single

chip. This gives RISC processors enough space to incorporate more

complicated, CISC-like commands. RISC chips also make use of more

complicated hardware, making use of extra function units for superscalar

execution. All of these factors have led some groups to conclude that now

in the present "post-RISC" era, the two architectures have become so

similar that distinguishing between them is no longer relevant. However, it

should be noted that RISC chips still retain some important traits. RISC

chips strictly utilize uniform, single-cycle instructions. They also retain the

register-to-register, load/store architecture. And despite their extended

instruction sets, RISC chips still have a large number of general purpose

registers.

The question of whether ISA plays an intrinsic role in performance or

energy efficiency is becoming important [26]. The traditionally low power

ARM ISA (a RISC) is entering the high performance server market, with the

traditionally high-performance x86 ISA (a CISC) is entering the mobile low-

power device market.

The MIPS architecture that grew out of a graduate course by John L.

Hennessy at Stanford University in 1981, resulted in a functioning system

in 1983, and could run simple programs by 1984. The MIPS approach

emphasized an aggressive clock cycle and the use of the pipeline, making

sure it could be run as "full" as possible. The MIPS system was followed by

the MIPS-X and in 1984 Hennessy and his colleagues formed MIPS

Computer Systems. The commercial venture resulted in the R2000

microprocessor in 1985, and was followed by the R3000 in 1988. The

company was purchased by Silicon Graphics, Inc. in 1992, and was spun

off as MIPS Technologies, Inc. in 1998. Subsequently Imagination

Technologies has bought the company.


http://en.wikipedia.org/wiki/John_L._Hennessy

http://en.wikipedia.org/wiki/John_L._Hennessy


http://en.wikipedia.org/wiki/MIPS_Computer_Systems

http://en.wikipedia.org/wiki/MIPS_Computer_Systems

http://en.wikipedia.org/wiki/R2000_(microprocessor)

http://en.wikipedia.org/wiki/R2000_(microprocessor)

http://en.wikipedia.org/wiki/R3000

80

2.7 MIPS32 INSTRUCTIONS AND CODE WASTAGE

RISC processors generally have three types of instructions: ALU,

Load or store, and Branch and Jump. Though RISC processors have

limited number of addressing modes, there are variations among the

processors. MIPS processor has only two addressing modes: immediate

and displacement, both with 16-bit fields [3].

Figure. 2.3 seen earlier in section 2.2.2 summarises the basic

formats of MIPS32 integer instructions [27] with examples. The length of

the fields in bits is indicated inside brackets. All the instructions are 32-bits

and the most significant six bits contain the opcode. In the I-type and J-type

instructions, the opcode itself indicates the exact operation. In the R-type

instructions, the op field identifies the instruction type and the fn field (least

significant bits 0-5) indicates the exact operation. For example, the six-bit

pattern 000000 in op identifies all R-type instructions and the fn pattern

indicates the exact function i.e., the instruction is add, and, sub, mul, div,

shift etc. For the and instruction, the op is 0x24 whereas for the or

instruction, the op is 0x25. The R-type is for register-to-register operations.

The I-type is for data transfers, branches, and immediate operations. In

load/store type instructions, the offset field is added to the contents of the

rs register, usually an address, to form the effective address for one of the

operands, either the source or destination.

The branch instructions use a signed 16-bit offset field enabling

jump by 215-1 instructions forward or 215 instructions backward. In I-type

arithmetic instructions, the immediate field is sign-extended to 32-bits to

form one of the operands, and the other operand is available in the rs

register. In I-type logical instructions, the immediate field is zero-extended

to form the second operand and the rs register has the first operand. The

J-type is for jumps and the instruction address is identified by the 26-bit

target field. The actual instruction address is a 30-bit address formed by

81

shifting left the target field contents by four bits. There are two more jump

instructions, jr and jalr, which follow different formats and they contain the

instruction address in the rs register and they have no target field.

The drawbacks of RISC instruction formats due to fixed instruction

size feature are as follows:

1. Several bits are unused in many instructions. Table 2.8 lists the

extent of unused bits in six integer instructions of MIPS32 ISA

since all instructions have to be 32 bits.

2. The R-type instructions use totally 12 bits to specify the

operation though there are only maximum of 64 different R-type

operations in MIPS32 ISA.

Table 2.8: Typical Wastage of Bits in MIPS32 Instructions

Instruction Action

No. of

unused

bits

Instruction Action

No. of

unused

bits

Rfe Return

from

exception

19 addu Addition 5

Syscall System call 20 mult Multiply 10

Nop No

operation

20 lui Load upper

immediate

5

3. In immediate type instructions such as addi, 16 bits are used

for specifying the immediate operand. In most cases, 8 bits are

sufficient for the immediate operand and the remaining 8 bits

become redundant. In branch instructions such as beq, the

82

offset field is underutilized in those cases where the offset

required can be specified with 8 bits.

The impact of these drawbacks on the code size has been quantified

in chapter 3 by analysing typical embedded object codes with the help of a

custom built tool. The outcome of this analysis has formed the basis for the

architectural modifications proposed in chapter 4 and chapter 5.

2.8 CODE SIZE REDUCTION IN EMBEDDED SYSTEMS

In embedded applications, every bit of code counts since it directly

affects both the program memory size, and the amount of bit traffic

between the program memory and the processor. Static code size is

directly proportional to cost in terms of program ROM size in embedded

systems. Dynamic code size has repercussions on instruction cache

effectiveness and hence on performance. Depending on the complexity of

the system, the code memory takes beyond 50% of the embedded product.

The instruction fetches take 5 to 15% of the execution time for a typical 32-

bit embedded RISC processor [7]. Since embedded systems are not user

programmable, several techniques are available to the developers, both at

compiler level and hardware level for compressing the original code

generated by the compiler. However, most solutions reduce performance.

Although the goal of this thesis is in favour of redesigning existing RISC

processors, review of philosophy behind these code compression

techniques and the extent of code compression achieved is provided to

help appreciate the benefits of the architectural solution proposed by us.

Several techniques to reduce code size have been implemented

[28]. These are classified into three types [2]: Code compression, Compiler

techniques and Ad hoc ISA modification. The first two techniques retain the

original ISA whereas the third technique involves supporting a new

83

instruction set that is a subset of the original ISA. An overview of these

three techniques is given below.

2.8.1 Code Compression

Code compression, initially applied to single issue processors such

as CISC and RISC, is now used in VLIW processors also. The

compression methods [28] are based on traditional data compression

techniques including entropy encoding, such as Huffman encoding [29] and

arithmetic coding [30,31,32], dictionary-based compression [33], operand

factorization [34], and re-encoding the original RISC instructions, to name a

few. Code compression involves compressing the executable RISC object

code in offline, and storing the compressed code in code memory. The

decompression is done on-the-fly, for each instruction, during program

execution. The decompression unit is placed between the processor core

and memory either as post-cache (between the cache and the processor),

or as pre-cache (between the code memory and the cache) [35]. In the pre-

cache architecture, the code memory contains compressed code but the

instruction cache memory contains uncompressed code. Decompression

occurs whenever there is a cache miss and hence it is not time critical. In

the post-cache architecture, both code memory and instruction cache

contain compressed code. Decompression occurs during every instruction

fetch and hence it is in the critical path of the instruction pipeline.

The criterion to measure the efficiency of a code compression

scheme is compression ratio, which is defined as the ratio of the size of the

compressed program over the size of the original program. A large body of

knowledge is available on lossless compression [36] and hardware for low

power and high performance compression and decompression has been

proposed [37]. However, there are some distinctive requirements [38]. First,

it must be possible to decompress a program during execution, ensuring

random access, starting from several points inside the program, since

84

branch, jump, and call instructions can alter the program execution.

Second, compression and decompression algorithms can be highly

asymmetric because compression can be performed once for all (offline)

when the executable is generated, while decompression is performed

during program execution; thus it should be fast and power efficient

because its hardware cost must be fully amortized by the corresponding

savings in memory size and power, without compromising performance.

The compression methods [28] result in either variable or fixed-width

instructions. Decompression is more complex with variable-width

instruction as the width of the instruction is not known before the

decompression. Normally, the code compression strategy does not require

any modification to the processor architecture. The instruction fetch unit

generates the next instruction address which will be normally the sum of

previous instruction address and the size of the previous instruction. On

encountering a branch, jump, or call instruction, the target address will be

calculated and the target instruction will be fetched from the memory or

cache. If the program memory contains the compressed code, a mapping

between the original address space and the compressed address space is

necessary. Alternate approach [33] requires a two phase action in offline

after compilation. First, compress the whole program, then, patch branch

offsets during a second phase, to point to a compressed code. In this

approach, the processor needs to be modified to handle unaligned

(compressed) branch targets.

Wolfe and Chanin [30, 39] were the first to apply code compression

to embedded systems. Their scheme known as Compressed Code RISC

Processor (CCRP) uses Huffman coding to compress MIPS object codes,

and a Line Access Table (LAT) to map original program block addresses

and compressed code block addresses. The LAT is stored in program

memory. The code memory has compressed code and the code cache

holds the uncompressed code. Compression is done through a software

tool after linking, and the compressed program is placed into a special

85

memory area, identified by the linker as a compressed text segment that

also has a special section for decompression tables. A byte-based Huffman

coding algorithm was used with a cache line as the basic block to be

compressed. A TLB like buffer called Cache line address Lookaside Buffer

(CLB) is introduced to minimise LAT accesses and save time.

Decompression is slower since Huffman codes are of variable length

codes.

The CCRP method established the foundation for the IBM Codepack

compression technology for the PowerPC 400 series [40]. Compressed

code is stored in the external memory and CodePack is placed between

the memory and the cache as illustrated in Figure. 2.13. Decompression is

triggered by an instruction cache miss. The translation between the

compressed and uncompressed lines is held in the LAT. The 32-bit

PowerPC Instructions are divided into two 16-bit parts and two Huffman

tables are used for each piece. The Huffman-like codewords are assigned

on a frequency distribution basis. Words are grouped in sets and words

belonging to the same set have been assigned codewords of the same

length. For each cache miss, Codepack fetches and decompresses two

cache blocks instead of the only one requested. This approach does not

involve compiler modification or processor design change. The original

work of Wolfe and Chanin achieves 30 to 50% compression ratio whereas

IBM CodePack technique gives compression ratio between 36% and 47%.

2.8.2 Dictionary-based Compression

Dictionary- based compression is another compression method

[38,28,41]. It is based on the property that the same instructions with the

same operands reappear in the embedded object code repeatedly. The

compression algorithm creates a dictionary of distinct instructions, and

replaces each instruction in the original program with the corresponding

86

index to the dictionary as illustrated in Figure. 2.14. Thus, the instructions

are substituted by 'codewords'.

Figure. 2.13: IBM Codepack Code Compression for Power PC

As the codeword is smaller than the original instruction, the size of

the code is reduced. During program execution, the codeword (dictionary

index), fetched from the program memory, is used to fetch the original

uncompressed instructions in the dictionary. Figure. 2.15 illustrates the

decompression operation of the dictionary method of compression. Given a

program with N unique instructions, the length of the codeword is [log N]

bits.

87

Figure. 2.14: Dictionary based compression

Figure. 2.15: Decompression procedure for the dictionary based

compression

The dictionary is usually implemented in ROM in the control path of

the processor. Dictionary-based compression is a simple scheme offering

fast decompression. The decompressor is actually a simple table; it can be

integrated with the instruction decoder into a single pipeline stage. Though

this scheme is a straightforward one, offering inexpensive address

88

translation and sizable reduction of memory fetch bandwidth (i.e., number

of bits transferred from code memory to execute a program), [7] argues that

'this approach is the least appealing for an embedded system'. On the

other hand, [39] establishes that the dictionary-based compression is

competitive with CodePack for static footprint compression, and achieves

superior results for bus traffic and energy reduction.

In expression-tree-based algorithms [42] for code compression

proposed by Guido et. al, the encoded symbols are extracted from program

expression trees and dictionary-based decompression engines are

implemented.

2.8.3 Compiler Techniques

Modern embedded compilers are often more complex than general

purpose compilers. A traditional compiler mainly aims to optimize a one-

dimensional cost function represented by the number of cycles needed to

execute a program. On the other hand, for an embedded compiler, code

size and energy are equally important as the speed of execution. Certain

scalar optimizations by traditional compiler are relevant in embedded

systems also. For example, transformations such as dead code elimination,

common sub expression elimination, strength reduction, copy propagation,

and constant folding reduce code, and power consumption apart from

improving speed. However, certain ILP-oriented optimizations such as loop

unrolling, tail duplication, procedure inclining and cloning, speculation, and

global code motion offer better speed but may hurt code size and power

consumption [7]. Research on code compression has been very active in

the compiler community [11, 43] with the goal of finding compact program

representations. Pure software techniques [39] by compiler to reduce

program size and decompress instructions during execution have been

popular among embedded community. Compiler techniques for code

compression for RISC architectures, by Cooper and McIntosh [44] map

89

isomorphic instruction sequences into abstract routine calls or cross-

jumping. A profile-guided code compression to apply Huffman coding to

infrequently executed functions has been suggested by Debray and Evans

[45], [46]. A control flow graph centric software approach to reduce memory

space consumption has been proposed by Ozturk et al [47]. Their approach

involves on-the-fly compression/decompression of object codes of

embedded applications. A flexible decompressor approach, applicable to

multiple platforms, was proposed by Shogan and Chiders [48] with their

implementation of IBM's CodePack algorithms within the fetch step of

Software Dynamic Translator (SDT) in pure software infrastructure. Thus

compiler techniques for code compression involve register renaming, inter

procedural optimization, and procedural abstraction of repeated code

fragments. The procedure abstraction is a program optimization technique

that replaces repeated sequences of common code with calls to a single

procedure. The above compiler techniques are attractive since they have

no runtime decompression overheads, do not require any hardware change

and the code generated can be directly executed by the processor.

However, there is a need to modify the software tools such as compilers

and linkers.

2.8.4 Ad hoc ISA Modification

This approach customizes the existing RISC instruction set

architecture with narrow instructions supporting fewer operations, smaller

operand fields, and fewer registers. For example, the Thumb [49]

instruction set is a modification of the original ARM instruction set (32-bit

instructions). It has 36 different 16-bit instructions which form a subset of

ARM instructions. Similarly in MIPS16, a subset of 32-bit MIPS instructions

are mapped to 16-bit MIPS instructions which can be translated in real-time

into 32-bit MIPS instructions. This approach involves a considerable effort

to design the new instruction set and requires a new instruction decoder, a

new set of software development tools, such as compiler, assembler, and

90

linker. A code saving of up to 40% has been reported. However, the dense

instruction sets often cause performance penalties [39] due to lack of

instructions. Also, the processor hardware needs additional logic for

decoder/decompression to support both ISAs. Both ARM and MIPS have

responded to the first criticism by introducing Thumb2 and microMIPS

processors. The ISAs of these processors support two instruction sizes:

16-bit and 32-bit. Although the performance degradation has been taken

care to certain extent, the processors still have additional

decoder/converter logic to detect the 16-bit instructions and convert them

into 32-bit instructions.

There have been attempts to develop tiny RISC processors [50].

The DMN-6 has 16 registers of 8-bits, executes just 12 instructions and has

no cache memory. Known as Minimal RISC processor, it is meant

exclusively for use in toys.

2.9 ISA LEVEL CODE SIZE REDUCTION

Instructions set architects have broadly used two techniques to

reduce the relative energy cost of instruction stream delivery. One

approach is to increase the amount of work performed by a single

instruction. Vector machines, for example, reduce instruction bandwidth

demands by expressing a large amount of SIMD parallelism in a single

instruction [9]. CISC machines do so by combining multiple simple

operations into a single instruction and providing more addressing modes.

An alternate approach is to reduce the size of the instructions. CISC

instruction sets generally have been composed of variable-length

instructions: the simple and more common ones are usually encoded in

fewer bits than those that require more operands or occur less frequently.

RISC ISAs initially sacrificed the code density advantages of variable-

length instruction encodings in favour of simple, fixed length 32-bit

encodings. Subsequently, RISC instruction set extensions have provided

91

fixed-length 16-bit encodings (as in ARM Thumb and MIPS16), although

often at the expense of performance and limited access to some hardware

features. Next generation RISC ISAs (as in ARM Thumb2, micro MIPS and

RISC-V) partly resolve these drawbacks by encoding the most common

instructions densely, while maintaining most or all of the functionality of the

32-bit ISA. However, these ISAs have not fully resolved the issue of code

density since these ISAs continue giving importance to pipeline design

complexity. Hence they have only two different instruction sizes: two bytes

and four bytes. Still, these are called as variable instruction length ISAs

which is a misnomer and the term hybrid instruction length is the proper

term. On the other hand, hybrid length encoding proposed in this thesis

recommends a new ISA with four different sizes that reduces the average

length of instructions with the goal of minimizing code memory size. It also

improves energy per operation by reducing instruction fetch traffic.

Depending on the memory word size, with a stream of hybrid instruction

length instructions, some instructions will reside in more than one memory

word and will require more than one memory access to fetch the

instruction. Figure. 2.16 illustrates a memory map of a sequence of x86

instructions [11]. The digits indicate the instruction number in the stream.

The eight instructions in the stream require seven memory cycles, giving

0.875 memory cycles per instruction. For this example, the average

number of bytes per instruction is 3.375. Published statistics on the IBM

S360 show that this CISC architecture has approximately four bytes per

instruction [11].

2.10 CONCLUSIONS

This chapter provides an overview of various attributes of ISA and

different types of embedded processors. The cause for the increased code

size of embedded processors is illustrated with the example of MIPS32

ISA. Different techniques for code size reduction in embedded systems

have been briefly seen in this chapter.

92

Figure. 2.16: Memory map of variable instruction stream

The next chapter analyses the behaviour of embedded object codes

of MIPS32 and the Chapter 4 discusses two different techniques of hybrid

instruction encoding for MIPS32 processor to minimise the code size.

2. background and related workshodhganga.inflibnet.ac.in/bitstream/10603/27657/12/7...halt i/o...

Documents