2. background and related workshodhganga.inflibnet.ac.in/bitstream/10603/27657/12/7...halt i/o...
TRANSCRIPT
35
2. BACKGROUND AND RELATED WORK
This chapter provides the necessary background information that is
useful to understand the main contributions of the thesis. The following
section presents a brief discussion on techniques for designing for low
power consumption. Section 2.2 describes the various attributes of
Instruction Set Architecture (ISA). Section 2.3 discusses processor
performance and high performance architectural features. In section 2.4, an
overview of different types of embedded processors is presented.
Section 2.5 deals with the architectural aspects of the embedded systems.
Section 2.6 presents an overview of emergence of different RISC
processors including MIPS. Section 2.7 explains how MIPS32 instructions
waste bits. In section 2.8, various techniques, followed for embedded code
size reduction, are reviewed. Finally, the need for a new and dedicated ISA
for Embedded SoCs is elaborated in section 2.9.
2.1 DESIGN FOR LOW POWER CONSUMPTION
Power dissipation and energy efficiency are primary design
constraints for both simple and complex processors. As a result of the
growing market for battery-powered portable embedded systems, the drive
for minimum power consumption has become equally important as the
drive for increased performance. Power consumption in processors
consists of a static component, called leakage power, and a dynamic
component, called switching power. The total power consumption of CMOS
circuit comprises three components [21]:
1. Switching power: This is the power dissipated by charging and
discharging the gate output capacitance, CL, and represents
the useful work performed by the gate. The energy per output
36
transition is given by the following equation where Vdd is power
supply voltage:
2
t L dd
1E = .C . V = 1picojoule
2 (2.1)
2. Short-circuit power: When the gate inputs are at an intermediate
level, both the p- and n-type networks can conduct. This results
in a transitory conducting path from Vdd to Vss. In a careful design
that avoids slow signal transitions, the short-circuit power is
usually a small fraction of the switching power.
3. Leakage current: The transistor networks do conduct a very
small current when they are in their 'off' state. Though it is
generally negligible in an active circuit, it can drain a supply
battery over a long period of time.
In a well designed active circuit, the switching power dominates, with
the short-circuit power forming 10% to 20% of the total power, and the
leakage current being significant only when the circuit is inactive.
Therefore, the total power dissipation, Pc, of a CMOS circuit, neglecting the
short-circuit and leakage components, is given by summing the dissipation
of every gate g in the circuit C:
2 g
C dd g L
g C
1P = .f. V . A . C
2 (2.2)
where f is the clock frequency, Ag is the gate active factor (reflecting the fact
that not all gates switch every cycle) and g
LC is the gate load capacitance.
The typical gate load capacitance is a function of the process
technology and therefore not under the control of the designer. The
37
remaining parameters in the equation suggest following approaches to low-
power design:
1. Minimize the power supply voltage, Vdd.
2. Minimize the circuit activity, A. Techniques such as clock gating
fall under this heading.
3. Minimize the number of gates. Simpler circuits use less power
than complex ones, all other things being equal.
4. Minimize the clock frequency, f. Although a lower clock rate
reduces the power consumption, it also reduces performance
having a neutral effect on power-efficiency. If, however, a
reduced clock frequency allows operation at a reduced Vdd, this
will be highly beneficial to the power-efficiency.
5. Exploit parallelism. Duplicating a circuit allows the two circuits
to sustain the same performance at half the clock frequency of
the original circuit, which allows the required performance to be
delivered with a lower power supply voltage.
Although static leakage power has historically been small compared
to dynamic switching power, the situation is changing as the feature sizes
decrease. The smallest chip size of a chip process technology refers to the
smallest size of transistors, wires, or gaps between them that can be
created onto the chip die with that process technology. As these sizes
decrease, the capacitance of the system of transistors,
g
LC , is lowered. This
reduced capacitance decreases the switching time of these transistors (or
gate delay), resulting in faster logic performance accommodating faster
processor clock frequencies. The gate activity factor approximates the
average switching activity of the circuit for each clock edge. The supply
voltage, Vdd, is lowered to reduce interference with the ever-closer
neighbouring components and to meet thermal requirements. Lowering Vdd
greatly reduces dynamic power consumption since the dynamic power is
38
proportional to the square of this supply voltage. However, lowering the
supply voltage in turn often requires a lowering of the threshold voltage, the
voltage level at which transistors switch, to maintain fast clock rates.
Lowering the threshold voltage and moving the threshold closer to ground
causes a disproportionate increase in the static leakage current and thus
an increase in static power consumption [22].
For a fixed task, decreasing the clock rate reduces the power, but
not the energy. The energy to execute a workload is equal to the average
power multiplied by the execution time for the workload. For BOPES
devices, battery life is more important than actual power consumption.
Hence energy is the proper metric.
2.2 INSTRUCTION SET ARCHITECTURE (ISA)
The features that are built into architecture’s instruction set are
commonly referred to as the Instruction Set Architecture or ISA. The ISA
defines such features as the operations that can be used by the
programmers to create programs under that architecture, the operands
(data) that can be accepted and processed by architecture, the storage, the
addressing modes used to gain access to and process operands, and
handling of interrupts. These features are important because an ISA
implementation is a determining factor in defining important characteristics
of an embedded design, such as performance, design time, available
functionality, and cost. In the embedded domain, it used to be true that
minimizing gates was the most important consideration of an ISA design
[7]. This is what led to many of the idiosyncrasies of early DSP designs.
Advances in VLSI technologies have changed this, and most of the
embedded world can now afford enough complexity to allow much more
regular and orthogonal instruction sets.
39
2.2.1 Instruction Types and Operations
The following information is provided either directly or indirectly by
an instruction [9]:
1. Operation code (opcode): Nature of operation done by the
instruction
2. Data: Type of data - binary, decimal, character etc.
3. Operand location: Memory, register etc.
4. Operand addressing: Method of specifying the operand location
(address)
5. Instruction length: Size - one byte, two bytes etc.
6. Number of address fields: zero address, single address, two
address etc.
Two computers of different architectures do not have the same
instruction set. Almost every architecture provides certain unique
instructions that ease the burden of compiler/programmer or the hardware
design. Based on the operations performed by the instructions, it is
common to classify the instructions into following types:
1. Data transfer instructions: These move data from one
register/memory location to another.
2. Arithmetic instructions: These perform arithmetical operations.
3. Logical instructions: These perform Boolean logical operations.
4. Control transfer instructions: These modify program execution
sequence.
5. Input/output (I/O) instructions: These transfer information
between external peripherals and system nucleus
(CPU/memory)
6. String manipulation instructions: These manipulate strings of
byte, word, double word etc.
40
7. Translate instructions: These convert the data from one
format to another.
8. Processor control instructions: These control the processor
operation.
Table 2.1 lists sample instructions for each of the above eight types
and corresponding actions done by the processor for these instructions.
Table 2.1: Sample Instructions and processor actions
Instruction
Type Specific Instruction examples and processor actions
Data transfer Instruction Action by processor
MOVE Transfer data from source location to
destination location
LOAD Transfer data from a memory location to a
CPU register
STORE Transfer data from a CPU register to a
memory location
PUSH Transfer data from the source to stack (top)
POP Transfer data from stack (top) to the
destination
XCHG Exchange; swap the contents of the source
and destination
CLEAR Reset the destination with all 0's
SET Set the destination with all 1's
41
Table 2.1 (Continued)
Instruction
Type Specific Instruction examples and processor actions
Arithmetic Instruction Action by processor
ADD Add; calculate sum of two operands
ADC Add with carry; calculate the sum of
operands and the 'carry' bit
SUB Subtract; calculate the difference of two
numbers
SUBB Subtract with borrow; calculate the
difference with 'borrow'
MUL Multiply; calculate the product of two
operands
DIV Divide; calculate the quotient and
remainder of two numbers
NEG Negate; change sign of operand
INC Increment; add 1 to operand
DEC Decrement; subtract 1 from operand
SFIFTA Shift arithmetic; shift the operand
(left or right) with sign extension
Logical Instruction Action by processor
NOT Complement the operand
OR Perform bit-wise logical OR of operands
AND Perform bit-wise logical AND of operands
XOR Perform bit-wise 'exclusive OR' of operands
SHIFT Shift the operand (left or right) filling the
empty bit positions as 0's
ROT Rotate; shift the operand (left or right) with
wrap-around
TEST Test for specified condition and set or reset
relevant flags
42
Table 2.1 (Continued)
Instruction
Type Specific Instruction examples and processor actions
Control
transfer
Instruction Action by processor
JUMP Branch; enter the specified address into
Program Counter (PC)
JUMPIF Branch on condition; enter the specified
address into PC only if the specified
condition is satisfied; conditional transfer
JUMPSUB CALL; save current 'program control
status' (into stack) and then enter the
specified address into PC
RET RETRURN; unsave (restore) 'program
control status' (from stack) into PC and other
relevant registers and flags
INT Interrupt; create a software interrupt; save
'program control status' (into stack) and
enter the address corresponding to the
specified vector code into PC
IRET Interrupt return; restore (unsave) 'program
control status' (from stack) into PC and other
relevant registers and flags
LOOP Iteration; decrement the implied register by 1
and test for non-zero; if satisfied, enter the
specified address into PC
43
Table 2.1 (Continued)
Instruction
Type Specific Instruction examples and processor actions
Input-output Instruction Action by processor
IN Input; read data from the specified input port /
device into specified or implied register
OUT Output; write data from specified or implied
register into an output port/device
TEST I/O Read the status from I/O subsystem and set
condition flags (codes)
START
I/O
Inform the I/O processor (or the data channel)
to start the I/O program consisting of
commands for the I/O operations
HALT I/O Inform the I/O processor (or the data
channel) to abort the I/O program
consisting of commands for the I/O
operations under progress
String
manipulation
Instruction Action by processor
MOVS Move byte or word of string
LODS Load byte or word of string
CMPS Compare byte or word of strings
STOS Store byte or word of string
SCAS Scan byte or word of string
Translate Instruction Action by processor
XLAT Translate; convert the given code into
another by table lookup
PACK Convert the unpacked decimal number into
packed decimal
UNPACK Convert the packed decimal number into
unpacked decimal
44
Table 2.1 (Continued)
Instruction
Type Specific Instruction examples and processor actions
Processor
control
Instruction Action by processor
HLT Halt; stop instruction cycle (processing)
STI (EI) Set/enable interrupt; sets interrupt enable
flag to '1', so as to allow maskable interrupts
CLI (DI) Clear/disable interrupt; resets interrupt
enable flag to '0' so as to ignore maskable
interrupts
WAIT Freeze instruction cycle till a specified
condition, such as an input signal becoming
active, is satisfied
NOOP No operation; no action
ESC Escape; the next instruction after the ESC is
to be skipped since it is meant for the
coprocessor
LOCK Reserve the bus, and hence the memory,
till the next instruction, following the LOCK
instruction, is executed/completed
CMC Complement 'carry' flag
CLC Clear 'carry' flag
STC Set 'carry' flag
2.2.2 Operation codes
There are a number of ways to allocate opcodes to an instruction
[11]. The design issue is to reduce the number of bits in the instruction
(small bit budget) while providing a large number of opcodes for a rich
instruction set. Following three design techniques have been used to meet
these requirements:
45
1. A fixed-length opcode allocated to variable length instructions as in
IBM S370 (Figure. 2.1)
2. A variable-length opcode provided by opcode expansion, allocated
in a variable-length instructions as in Intel x86 (Figure. 2.2)
3. A variable-length opcode provided by opcode expansion, allocated
in a fixed-length instruction as in MIPS32 (Figure. 2.3).
2.2.3 Addressing modes
Addressing mode is the method by which the location of an
instruction is specified within an instruction. Table 2.2 defines popular
addressing modes. A given ISA may not support all the addressing modes.
Table 2.2: Addressing modes and mechanisms
Addressing
mode Mechanism Remarks/examples
Implied
addressing
Operand address is not specified
explicitly
RET and IRET
Immediate
addressing
Operand is given in the
instruction
Fast operand fetch
but operand size is
limited as it increases
instruction length
Direct
addressing
(Absolute
addressing)
Operand is in a memory location;
its address is given in the
instruction
One memory access
required to get the
operand
Indirect
addressing
Operand is in a memory location;
its address is also in memory;
address of the location
containing the operand address
is given in the instruction
Two memory
accesses are
required to get the
operand
46
Table 2.2 (Continued)
Addressing
mode Mechanism Remarks/examples
Register
direct
addressing
Operand is in a register; the
register address/number is given
in the instruction
Faster operand fetch
compared to direct
addressing
Register
indirect
addressing
Operand is in memory; its
address in a register;
address/number of the register is
given in the instruction
Faster operand fetch
than indirect
addressing
Base
register
addressing
Operand is in memory; its
address is specified in two parts;
the instruction gives an offset
number and also specifies the
base register; the offset (integer
number) has to be added to the
base register contents
Useful in relocation
of programs
PC-relative
addressing
Similar to base register
addressing, but the register
always being the PC
Mostly used by
branch instructions
Index
addressing
The operand is in memory; the
instruction gives an address, and
the index register contains an
offset number; the address and
the offset number are added to
get the operand address
Convenient for
indexing arrays
47
Figure. 2.1: IBM S370 Instruction Formats
48
Figure. 2.2: INTEL Pentium Pro Instruction Formats
Figure. 2.3: MIPS32 Instruction Formats
49
2.2.4 Data types
Application programs may use various types of data depending on
the problem. A machine language program can operate either on numeric
data or on non-numeric data. The numeric data can be either binary or
decimal number. The non-numeric data can be any of the following types:
characters, addresses, and logical data. All non-binary data is represented
inside a computer in the binary coded form. The binary data can be
represented either as a fixed-point or a floating-point number. In fixed-point
number representation, the position of a binary number is rigidly fixed in
one place. In floating-point number representation, the binary point's
position can be anywhere. The fixed-point numbers are known as integers
whereas the floating-point numbers are known as real numbers. Arithmetic
operations on fixed-point numbers are simple and they require minimum
hardware circuits. The floating-point arithmetic is complex and requires
extensive hardware circuits. Compared to fixed-point numbers, the floating-
point numbers have two advantages:
1. The maximum or minimum value that can be represented in
floating-point number representation is higher. Hence it is
useful in dealing with very small or very large numbers.
2. The floating-point number representation leads to better
accuracy in arithmetic operations.
2.2.5 ISA Models
There are several different ISA models that architectures are based
upon, each with its own specifications for the various features. The most
commonly implemented ISA models are application-specific, general
purpose and instruction level parallel. Application-Specific ISA Models
define processors that are intended for specific embedded applications,
such as processors made only for TVs. General-purpose ISA models are
50
typically implemented in processors targeted to be used in a wide variety of
systems, rather than only in specific types of embedded systems. CISC
model and RISC model are the common types of general-purpose ISA
architectures implemented in embedded processors. Many current
processor designs fall under the CISC or RISC category primarily because
of their heritage. RISC processors have become more complex, while CISC
processors have become more efficient to compete with their RISC
counterparts, thus blurring the line between the definition of a RISC versus
a CISC architecture. Technically, these processors have both RISC and
CISC attributes, regardless of their definitions. Instruction-level parallelism
ISA architectures are similar to general-purpose ISAs, except that they
execute multiple instructions in parallel, as the name implies. Examples of
instruction-level parallelism ISAs [9] include SIMD model, Superscalar
model, and VLIW model.
2.3 PROCESSOR PERFORMANCE AND ADVANCED ARCHITECTURES
The performance of a processor is measured by the amount of time
taken by the processor to execute a program. The processor performs an
instruction cycle for each instruction. Table 2.3 illustrates the actions taken
at various steps of the instruction cycle for ADD instruction. Elementary
operations performed by the processor during instruction cycle execution
are known as micro-operations. A given micro-operation takes place when
the corresponding control signal is issued by the processor. Table 2.4
illustrates some sample micro-operations performed by the processor. The
time taken for executing different instructions is not the same. Hence the
type of instructions executed in a program and the number of instructions
executed by the processor, while running the program, decides the time
taken by the processor to execute a program.
51
Table 2.3: Instruction cycle steps and actions for ADD instruction
Sl.
No. Step
Action
responsibility Remarks
Parameter
affecting
performance
1 Instruction
fetch
Control unit;
external action
Fetches next
instruction from
main memory
memory
access time
2 Instruction
decode
Control unit;
internal action
Analyses opcode
pattern in the
instruction and
identifies the exact
operation specified
decode time
3 Operand
fetch
Control unit:
external
(memory) or
internal action
depending on
the location of
operands
Determines the
operand addresses
and then fetches
the operands, one
by one, from main
memory or CPU
registers and
supply them to ALU
(1) operand
address
calculation
time
(2) Register/
memory
access time
4 Execute
(ADD)
ALU; internal
action
Specified arithmetic
operation is done
Addition time
5 Result
store
Control unit;
external or
internal action
Stores the result in
memory or
registers
Register/
memory
access time
52
Table 2.4: Sample micro-operations
Sl.
no.
Control
signal
Micro-operation Remarks
1 MAR← PC Contents of PC are copied
(transferred) to Memory Address
Register (MAR)
The first micro-
operation in
instruction fetch
2 PC← PC + 4 Contents of PC are incremented
by 4
The PC always
points to next
instruction
address
3 IR ←MBR Contents of Memory Buffer
Register (MBR) are copied to
Instruction Register (IR)
The last micro-
operation in
instruction fetch
4 MBR ←R2 Contents of R2 register are
copied to MBR
The first micro-
operation in result
store
The following equation is commonly used for expressing a
computer's performance ability:
time time cycles instructions
program cycle instruction program
In other words, the execution time is given by the following equation:
Tp = Nie X CPI/F (2.4)
where Nie is the number of instructions executed (and not the number of
instructions present in the program), CPI is the average number of clock
cycles needed for an instruction, and F is the clock frequency. The CISC
approach attempts to minimize the number of instructions per program,
sacrificing the number of cycles per instruction. RISC does the opposite,
reducing the cycles per instruction at the cost of the number of instructions
per program.
(2.3)
53
For any specific computer, there are two simple measurements that
give us an idea about its performance:
1. Response time or execution time: This is the time taken by the
computer to execute a given program – from the start to the
end of completion of the program. The response time for a
program is different for different computers.
2. Throughput: This is the work done (total number of programs
executed) by the computer during a given period of time.
2.3.1 Instruction Pipelining
In a simple processor (scalar, non-pipelined), the steps of an
instruction cycle are sequentially performed one after the other and
execution of successive instructions are also done sequentially, one after
the other. Instruction pipelining (Figure. 2.4) is a technique in which
execution of successive instructions are overlapped. The goal is to
increase the total number of instructions executed in a given period of time.
In a pipelined processor, different sections of the processor perform
different steps of the instruction cycle for different instructions at a given
time. Each step is called a pipe stage. All the pipe stages together form a
pipe.
Figure. 2.4: A six stage instruction pipeline
54
In a six stage instruction pipeline, six instructions can be active
simultaneously. If it is assumed that all instructions are independent of
other instructions, then for each clock cycle, one instruction can be
completed due to overlap of instruction cycles of consecutive instructions.
In practice, three types of hazards - data, structural, and control - reduce
the pipeline efficiency [9].
Dependencies between instructions are a property of programs. If
two instructions are dependent, they should not be executed
simultaneously. They may be partially overlapped. Two instructions may be
either directly data dependent or indirectly data dependent through another
instruction due to chain of dependencies. In case of dependence, there are
two possible solutions:
1. Preserving the dependence but preventing a hazard
2. Removing the dependence by transforming the object code.
Techniques used for detecting and preventing hazards should
preserve program order so that the overall behaviour and results of the
program are not affected.
2.3.2 RISC Instructions and Pipelining
Though pipelining can be implemented in both CISC and RISC types
of processors to enhance performance, it is simpler to design a pipelined
RISC processor. The following properties of RISC architecture help in
simplifying the pipeline design:
1. All instructions are of equal size, say 4 bytes.
2. Instruction formats are not many; just 1 to 3.
3. Arithmetic and other operations on data always have operands
(data) in registers (not in memory).
4. Only load and store instructions can access memory.
55
Generally RISC processors have three types of instructions: ALU
instructions, Load and store instructions and Branch and Jump type
instructions. In ALU Instructions, the operands are available in registers.
On completion, the results should be stored in registers. In load and store
instructions, one operand is in register and the other operand is in memory.
The address of the memory operand is generally specified as the sum of
two parts: the base register contents and the offset indicated by the
immediate field in the instruction. In branches and jumps, the branch
conditions are usually specified in one of the two ways:
1. Comparison of two items in registers
2. Condition bits or condition codes
Unconditional jumps are present in almost all RISC processors.
Traditional RISC pipeline has five stages as shown in Figure. 2.5 (a).
Figure. 2.5 (b) shows timing diagram while executing 6 instructions over 10
clock cycles. Figure. 2.5 (c) shows the RISC pipeline as a series of data
paths shifted in time.
Figure. 2.5 (a): Five stage pipeline
56
Figure. 2.5 (b): Timing Diagram
CC- Code Cache (Instruction memory); R-Registers; ALU-Arithmetic Logic
Unit; DC-Data Cache (data memory)
CC R ALU DC R
CC R ALU DC R
CC R ALU DC R
CC R ALU DC R
CC R ALU DC R
CC R ALU DC R
1 2 3 4 5 6 7 8 9 10
Time in Clock cycles
Pro
gra
m e
xec
uti
on s
equen
ce
Figure. 2.5 (c): RISC Pipeline as a series of datapaths
57
Tradeoffs in micro architecture have changed somewhat since the
RISC five-stage pipeline [7]. In the early RISC days, transistor count
limitations convinced the designers to reuse the ALU for address
computations. Today, transistors are almost free of cost but wires are
expensive. Each additional pipeline stage has a marginal benefit in terms of
spreading out the work in smaller steps that may allow a lower cycle time,
and a marginal cost in terms of added design complexity and global
overheads. Table 2.5 defines the clock cycles, respective stages of
instruction cycle and micro operations. Actual number of clock cycles
required for different instructions are as follows:
Unconditional branch instruction: 2 (cycles 1 and 2)
Store instruction: 4 (cycles 1 to 4)
Any other instruction: 5 (cycles 1 to 5)
There are many alternate design options offering varying
performance levels. The designer chooses the best option taking into
account the hardware cost and required performance level.
There are two major problems in a practical pipeline:
1. Resource Conflict: Two different operations at two
sections/stages may need the same hardware resource in the
same clock cycle, due to overlapping of instructions. To resolve
this, multiple resources of the same type can be provided in the
hardware. This will increase the cost and hence should be
done judiciously.
2. Interference between adjacent stages: Two instructions in
different stages of the pipeline should not interfere with each
other. To resolve this, pipeline registers are used between
successive stages of the pipeline. The pipeline registers are
named indicating the stages linked by them such as IF/ID,
58
ID/EX, EX/MEM and MEM/WB. The result of any specific stage
is stored in the pipeline register at the end of a clock cycle.
During the next clock cycle, the contents of the pipeline register
serve as input to the next stage. In some cases, the result
generated by one stage may not be used as input to the next
stage. It may propagate through more than one stage. For
example, for a STORE instruction, the result is produced in the
ID stage but it is stored in memory only in the MEM stage.
Table 2.5: Typical instruction cycle phases in RISC processors
Sl.
no.
Clock
cycle
Instruction
cycle phase
Major micro
operations
Hardware
sections involved
1 1 Instruction
Fetch (IF)
a. Send PC contents
to memory
b. Fetch the current
instruction from
memory
c. Increment PC by 4
to indicate the next
instruction address
a. Cache memory
2 2 Instruction
Decode (ID);
plus Register
Read cycle
a. Decode the
instruction
b. Read the contents
of source registers
c. Compare the
contents of registers
(as preparation for
certain instructions
such as compare)
a. Instruction
decoder
b. Registers
c. Adder /
comparator
59
Table 2.5 (Continued)
Sl.
no.
Clock
cycle
Instruction
cycle phase
Major micro
operations
Hardware
sections involved
3 3 Execution
(EX); plus
Effective
address cycle
a. For ALU instruction,
the specified
operation is done by
the ALU
b. For memory
reference instruction
(Load/store), the
effective address is
calculated by ALU by
adding the base
register contents and
the offset.
c. For branch
instruction, testing of
branch condition is
done.
a. ALU
b. ALU
c. ALU
4 4 Memory
Access
(MEM); plus
branch
completion
a. For load instruction,
memory read
operation from the
effective address is
done.
b. For store
instruction, memory
write operation at the
effective address,
storing the contents of
source register
c. For branch
instruction, the branch
address is entered in
PC if branch occurs.
a. Cache memory
b. Cache memory
5 5 Write – back
(WB)
a. The result is stored
in the destination
register for load
instruction and ALU
instruction.
a. Registers
60
2.3.3 Superscalar processor
In a scalar pipelined processor, though there are multiple
instructions simultaneously active in the pipeline, there is only one
execution unit/functional unit. Hence at a given time, only one instruction
can be in the execution unit. In a superscalar architecture, there are
multiple pipelines in the processor and hence two or more instructions can
be executed simultaneously. In other words, in a superscalar processor,
same type of operation (add, shift etc.) can be executed simultaneously in
single clock cycle on multiple pipelines for different instructions. Figure. 2.6
shows the organization of a superscalar processor with two pipelines [9]. In
some superscalar processors, instruction sequencing is static (at
compilation time) but in majority of superscalar processors, it is dynamic (at
run time). The control unit in a dynamic superscalar processor is a complex
one whereas in a static superscalar processor, the compiler is a complex
one.
2.3.4 Very Long Instruction Word (VLIW) Processor
The VLIW architecture exploits Instruction Level Parallelism (ILP)
with close cooperation between the compiler and the processor. The
processor has multiple functional units similar to a dynamic superscalar
processor but scheduling is done by the compiler that groups several
independent operations into a very long instruction word. Each VLIW has
multiple fields/slots with each slot containing one RISC like operation. Each
operation corresponds to a functional unit. During the execution of a VLIW,
the processor performs all the operations in parallel in different functional
units. Figure. 2.7 illustrates the principle of a VLIW processor [9].
61
OF-Operand Fetch IF- Instruction Fetch EX-Execute SR-Store Results
2 instructions
Instruction queue
EU-1
Odd instruction
EU-2
EU-Execute unit
Write buffers
Cache
Memory
MAIN MEMORY
System Bus
Unified cache
2 instructions
Even instruction
OF
EX
SR SR
EX
OF
RE
GIS
TE
RS
I F Unit
Decode
and
dispatch
Result
Figure. 2.6: Superscalar Processor Organisation
62
Instruction Cache Memory
add mul load store cmp branch mulfl addfl
INT INT MAU 1 MAU 2 INT Branch FLOAT FLOAT
ALU MUL/DIV ALU unit MUL/DIV ADDER
AAADDER
Integer RF Floating
Point RF
Bus Interface Data Cache
IR
FUs
MAU
System Bus
IR-Instruction Register FU-Functional Unit
RF-Register File
INT-Integer
MAU-Memory Addressing Unit
(a) Inside VLIW Processor
add mul load store cmp branch mulfl addfl
add R1 R2
256 bits
32 bits
(b) VLIW and one operation
Figure. 2.7: VLIW Processor Organisation
63
2.3.5 Cache Memory
The cache memory is a small and fast intermediate buffer between
the processor and the main memory with the objective of reducing the
processor's waiting time during main memory access. The presence of
cache memory is not known to application programs. Figure. 2.8 illustrates
the use of cache memory.
Figure. 2.8: Use of Cache memory
The main memory is conceptually divided into many blocks, each
containing a fixed number of consecutive locations. The cache memory is
organized as number of lines and the size of each line is same as the
capacity of main memory block. The cache operation is based on locality of
reference [23], a property inherent in programs. Most of the times,
processing requirement is such that instructions or data needed are
available in those main memory locations which are physically close to the
current main memory location being accessed. There are two kinds of
behaviour pattern:
1. Temporal locality: A recently accessed memory location is
likely to be accessed again.
64
2. Spatial locality: The neighbouring location to the recently
accessed memory location is likely to be accessed.
In view of these two properties, while reading a location from main
memory, the content of entire block is transferred and stored in cache
memory. There are more blocks in main memory than the number of lines
in cache memory. Hence a mapping function is followed by the cache
controller to systematically map any main memory block to one of the
cache lines. When the processor needs a memory operand, the cache
controller checks the cache memory to find out if the current main memory
address is already mapped onto cache. If it is mapped, it means the
required item is available in cache memory and this condition is called
'cache hit'. Then the required information is read from cache memory.
On the other hand, if the current main memory address is not
mapped in cache memory, the required information is not available in
cache memory and this situation is known as 'cache miss'. In this case, the
entire block containing the main memory address is brought into the cache
memory. The time taken to bring the required item from the main memory
and supply it to the processor is known as 'miss penalty'. The hit rate (also
known as hit ratio) provides the fraction of the number of accesses which
faced 'cache hit' to the total number of accesses.
The cache memory is of two types: Unified cache or common cache,
and Split cache. The unified cache stores both instructions and data. In
split cache, there is a separate instruction cache (also known as code
cache) and data cache. Some computers use a two level or three level
cache memory system. The cache immediately next to the processor is
known as level 1 cache or primary cache. The next level cache is called a
level 2 cache or secondary cache. Most microprocessors are incorporating
multi-level caches on-chip.
65
2.3.6 Virtual Memory
Virtual memory concept facilitates the execution of large programs in
systems with smaller physical memory. Virtual memory is desirable in the
following two cases:
1. The logical memory space of the processor is small
2. The physical main memory space has to be kept small to
reduce the cost though the processor has large logical memory
space.
Figure. 2.9 illustrates the concept of virtual memory. In virtual
memory system, the OS automatically manages the long programs by
storing the entire program on a large hard disk. At a given time, only some
portions of the program are stored in main memory. During the execution of
the program, different portions of the program are swapped between the
main memory and hard disk on need basis. The program does not address
the physical memory directly.
CM - Cache memory; optional unit.
Figure. 2.9: Virtual memory concept
66
While referring to an instruction or operand, it provides the logical
address, and the virtual memory hardware (also known as memory
management unit or MMU) in the processor translates it into the equivalent
physical memory address [9]. There are two popular methods in virtual
memory implementation: paging and segmentation. In paging, the system
software divides the program into pages of equal sizes. In segmentation,
the machine language programmer organizes the program into different
segments which need not be of same size. Figure. 2.10 illustrates the
mechanism of virtual memory.
Figure. 2.10: Virtual memory mechanism
2.3.7 Multicore CPU
Building a high performance computer system by linking together
several low performing computers is a standard technique of achieving
parallelism. This idea is the basis for development of multiprocessor
systems. Designing a microcomputer using multiple single-chip
microprocessors has been a cost-effective strategy for several years in the
past. The latest trend is the design of multicore microprocessors resulting
in quantum change in the way multiprocessor systems are developed and
67
used for various applications [10]. Figure. 2.11 illustrates the concept of
muticore with four cores in a single die. Figure. 2.12 illustrates the
organization of SPARC 64 VII, a popular quad core CPU.
Figure. 2.11: A Quad-core CPU
Figure. 2.12: SPARC64 VII Processor
Chip Multiprocessing technology is an architecture in which multiple
physical cores are integrated on a single processor module. Each physical
core runs a single execution thread of a multithreaded application
independently from other cores at any given time. With this technology,
multi-core processors offer several times the performance of single-core
68
modules. The ability to process multiple instructions at each clock cycle
provides the performance advantage, but improvements also result from
the short distances and fast bus speeds between chips as compared to
traditional CPU to CPU communication in a multiprocessor system.
2.4 EMBEDDED PROCESSORS
Processors are the main functional units of an embedded system,
and are primarily responsible for processing instructions and data. An
embedded system contains at least one master processor, acting as the
central controlling device, and can have additional slave processors that
work with and are controlled by the master processor. These slave
processors may either extend the instruction set of the master processor or
act to manage buses and input/output (I/O) devices. The complexity of the
master processor usually determines whether it is classified as a
microprocessor or a microcontroller. Traditionally, microprocessors contain
a minimal set of integrated memory and I/O components, whereas the
microcontrollers have most of the system memory and I/O components
integrated on the chip. However, these traditional definitions are becoming
somewhat inaccurate in view of convergence taking place in recent
processor designs. There are literally hundreds of embedded processors
available and these can be grouped into various architectures [6]. What
differentiates one processor group's architecture from another is the set of
machine code instructions that the processors within the architecture group
can execute. Processors are considered to be of the same architecture
when they can execute the same set of machine code instructions. Table
2.6 lists some examples of real-world processors and the architecture
families they fall under. Table 2.7 lists the merits and demerits of different
types of processors that can embed in a complex embedded system [8].
69
Table 2.6: Typical Embedded Architectures and Processors
Architecture Processor Manufacturer
AMD Au1xx Advanced Micro Devices
ARM ARM7, ARM9, ... ARM, ....
ColdFire 5282, 5272, 5307, 5407, ... Motorola/Freescale, ...
M32/R 32170, 32180, 32182,
32192, ...
Renesas/Mitsubishi, ...
MIPS32 R3K, R4K, 5K, 16, ... MT14kx, IDT, MIPS
Technologies, ...
NEC Vr55xx, Vr54xx, Vr41xx NEC Corporation, ...
PowerPC 82xx, 74xx, 8xx, 7xx, 6xx,
5xx, 4xx
IBM, Motorola/Freescale, ...
SuperH (SH) SH3, SH4 Hitachi, ...
SHARC SHARC Analog Devices, Transtech
DSP, Radstone, ...
strongARM strongARM Intel, ...
SPARC UltraSPARC II Sun Microsystems, ...
TMS320C6xxx TMS320C6xxx Texas Instruments
x86 X86 [386, 486, Pentium(II,
III, IV)...]
Intel, Transmeta, National
Semiconductor, Atlas, ...
Tricore Tricore1, Tricore2, ... Infineon, ...
70
Table 2.7: Processor types in Complex Embedded Systems
Processor type Application Advantage Disadvantage
General purpose
microprocessor
When intensive
computations are
required and large
embedded software
is located in the
external memory
cores or chips
No engineering
cost in
designing the
processor
Additional redundant
execution units that
are not needed in the
given system design
Microcontroller Used with internal
memory, devices
and peripherals and
when embedded
software is located
in the internal ROM
or flash memory
No engineering
cost in
designing the
processor
Additional
manufacturing costs
and redundant
application units
which are not
needed in the given
system design
DSP Used with signal
processing-related
instructions for
filters, image, audio,
and video and
CODEC applications
No engineering
cost involved in
designing the
signal
processor
Manufacturing cost
may be high
Single purpose
processors and
application
specific system
processor
Control I/O and bus
operations and
peripherals and
devices
They support
other
processing
units in the
system and
execute
specific
hardware
processes fast
In-house engineering
cost of development,
royalty payments for
an IP core of
processor and time-
to-market cost
Multicore
processor
To significantly
enhance the
performance of the
system
Reduced
engineering
cost
Increased
manufacturing cost
Accelerator To accelerate the
execution of codes.
A floating point
coprocessor
accelerates
mathematical
operations and Java
accelerator
accelerates Java
code execution.
Increases
performance by
co-processing
with the main
processor
Increased
engineering cost of
development or
royalty payments for
the IP core of
processor
71
2.5 EMBEDDED SYSTEM ARCHITECTURES
Embedded computer systems range from everyday machines - most
of the microwaves and washing machines, printers, network switches, and
automobiles - to handheld digital devices (such as PDAs, cell phones, and
music players) to videogame consoles and digital set-top boxes. Except in
some applications such as PDAs, in many embedded applications, the only
programming occurs at developer's site in connection with the initial loading
of the application code or a later software upgrade of that application. Thus,
the application is carefully tuned for the processor and system [3].
Embedded systems often process information in different ways from
general-purpose processors. Typically these applications include deadline-
driven constraints—so-called real-time constraints. In these applications, a
particular computation must be completed by a certain time limit failing
which the system will malfunction. A real-time performance requirement is
one where a segment of the application has an absolute maximum
execution time that is allowed. For example, in a digital set-top box the time
to process each video frame is limited, since the processor must accept
and process the frame before the next frame arrives (typically called hard
real-time systems). In some applications, a more liberal requirement exists:
the average time for a particular task is constrained as well as is the
number of instances when some maximum time is exceeded. Such
approaches (typically called soft real-time) arise when it is possible to
occasionally miss the time constraint on an event, as long as not too many
are missed. Real-time performance tends to be highly application
dependent.
Embedded system applications typically involve processing
information as signals that may be an image, a motion picture composed of
a series of images, a control sensor measurement, and so on. Signal
72
processing requires specific computation that many embedded processors
are optimized for.
Two other key characteristics exist in many embedded applications:
the need to minimize memory and the need to minimize power. The
importance of memory size translates to an emphasis on code size, since
data size is dictated by the application. Some architecture has special
instruction set capabilities to reduce code size. Larger memories also mean
more power, and optimizing power is often critical in embedded
applications. Although the emphasis on low power is frequently driven by
the use of batteries, the need to use less expensive packaging (plastic
versus ceramic) and the absence of a fan for cooling also demand reduced
power consumption.
Often an application’s functional and performance requirements are
met by combining a custom hardware solution together with software
running on a standardized embedded processor core, which is designed to
interface to such special-purpose hardware. In practice, embedded
problems are usually solved by one of three approaches:
1. The designer uses a combined hardware/software solution that
includes some custom hardware and an embedded processor
core that is integrated with the custom hardware, often on the
same chip.
2. The designer uses custom software running on an off-the-shelf
embedded processor.
3. The designer uses a digital signal processor and custom
software for the processor.
Embedded systems are a very broad category of computing devices.
For example, the TI 320C55 DSP is a relatively “RISC-like” processor
designed for embedded applications, with very fine-tuned capabilities. On
73
the other end of the spectrum, the TI 320C64x is a very high-performance,
eight-issue VLIW processor for very demanding tasks. Media extensions
attempt to merge DSPs with some more general-purpose processing
abilities to make these processors usable for signal processing
applications. Hennessy and Patterson have examined [3] several case
studies, including the Sony PlayStation 2, digital cameras, and cell phones.
The PlayStation2 performs detailed three-dimensional graphics, whereas a
cell phone encodes and decodes signals according to elaborate
communication standards. But both have system architectures that are very
different from general-purpose desktop or server platforms. In general,
architectural decisions that seem practical for general-purpose applications,
such as multiple levels of caching or out-of-order superscalar execution,
are much less desirable in embedded applications. This is due to chip area,
cost, power, and real-time constraints. The programming model that these
systems present places more demands on both the programmer and the
compiler for extracting parallelism.
2.5.1 Digital Signal Processor
A digital signal processor (DSP) is a special-purpose processor
optimized for executing digital signal processing algorithms [5]. Most of
these algorithms, from time-domain filtering (e.g., infinite impulse response
and finite impulse response filtering), to convolution, to transforms (e.g.,
fast Fourier transform, discrete cosine transform), to even forward error
correction (FEC) encodings, all have as their kernel the same operation: a
multiply-accumulate operation. Either transform has as its core the sum of
a product. To accelerate this, DSPs typically feature special-purpose
hardware to perform multiply-accumulate (MAC). A MAC instruction of
“MAC A, B, C” has the semantics of “A = A + B * C.” In some situations, the
performance of this operation is so critical that a DSP is selected for an
application based solely upon its MAC operation throughput. DSPs often
employ fixed-point arithmetic. In addition to MAC operations, DSPs often
74
also have operations to accelerate portions of communications algorithms.
An important class of these algorithms revolve around encoding and
decoding forward error correction codes—codes in which extra information
is added to the digital bit stream to guard against errors in transmission. At
one end of the DSP spectrum is the TI 320C55 architecture optimized for
low-power, embedded applications with a seven-staged pipelined CPU.
The source of input data to DSP is some form of digitized signal, like
a photo image captured by a digital camera, a voice packet going through a
network router, or an audio clip played by a digital keyboard. As with
microcontrollers, DSPs also tend to incorporate many peripherals that are
useful in signal processing on a single IC. For example, a DSP device may
contain a number of analog-to-digital and digital-to-analog converters,
pulse-width modulators, direct memory access controllers, timers, and
counters.
2.5.2 Media Extensions
Media Extensions is a middle ground between DSPs and
microcontrollers. These extensions add DSP-like capabilities to
microcontroller architectures at relatively low cost. Because media
processing is judged by human perception, the data for multimedia
operations are often much narrower than the 64-bit data word of modern
desktop and server processors. For example, floating-point operations for
graphics are normally in single precision, not double precision, and often at
a precision less than specified by IEEE 754. Rather than waste the 64-bit
arithmetic-logical units (ALUs) when operating on 32-bit, 16-bit, or even8-
bit integers, multimedia instructions can operate on several narrower data
items at the same time. Thus, a partitioned add operation on 16-bit data
with a64-bit ALU would perform four 16-bit adds in a single clock cycle. The
extra hardware required is only to prevent carries between the four 16-bit
partitions of the ALU. For example, such instructions might be used for
75
graphical operations on pixels [10]. These operations are commonly called
single-instruction multiple data (SIMD) or vector instructions. Most graphics
multimedia applications use 32-bit floating-point operations.
2.5.3 Embedded Multiprocessors
In the embedded space, a number of special-purpose designs have
used customized multiprocessors; including the Sony PlayStation
2[7].Many special-purpose embedded designs consist of a general-purpose
programmable processor or DSP with special-purpose, finite-state
machines that are used for stream-oriented I/O. In applications ranging
from computer graphics and media processing to telecommunications, this
style of special-purpose multiprocessor is becoming common. Although the
inter-processor interactions in such designs are highly regimented and
relatively simple—consisting primarily of a simple communication
channel—because much of the design is committed to silicon, ensuring that
the communication protocols among the input/output processors and the
general-purpose processor are correct is a major challenge in such
designs. As a recent trend, embedded multiprocessors are built from
several general-purpose processors. These multiprocessors have been
focused primarily on the high-end telecommunications and networking
market, where scalability is critical. An example of such a design is the
MXP processor designed by empowerTel Networks for use in voiceover-IP
systems. The MXP processor consists of four main components:
1. An interface to serial voice streams, including support for
handling jitter
2. Support for fast packet routing and channel lookup
3. A complete Ethernet interface, including the MAC layer
4. Four MIPS32 R4000-class processors, each with its own cache
(a total of 48 KB or 12 KB per processor)
76
The MIPS processors are used to run the code responsible for
maintaining the voice-over-IP channels, including the assurance of quality
of service, echo cancellation, simple compression, and packet encoding.
Since the goal is to run as many independent voice streams as possible, a
multiprocessor is an ideal solution. Because of the small size of the MIPS
cores, the entire chip takes only 13.5Mtransistors. Future generations of
the chip are expected to handle more voice channels, as well as do more
sophisticated echo cancellation, voice activity detection, and more
sophisticated compression.
Multiprocessing is becoming widespread in the embedded
computing arena for two primary reasons. First, the issues of binary
software compatibility, which plague desktop and server systems, are less
relevant in the embedded space. Often software in an embedded
application is written from scratch for an application or significantly
modified. Second, the applications often have natural parallelism,
especially at the high end of the embedded space. Examples of this natural
parallelism abound in applications such as a settop box, a network switch,
a cell phone or a game system. The lower barriers to use of thread-level
parallelism together with the greater sensitivity to die cost (and hence
efficient use of silicon) are leading to widespread adoption of
multiprocessing in the embedded space, as the application needs grow to
demand more performance.
Desktop computers and servers rely on the memory hierarchy to
reduce average access time to relatively static data, but there are
embedded applications where data are often a continuous stream. In such
applications there is still spatial locality, but temporal locality is much more
limited. The steady stream of graphics and audio demanded by electronic
games lead to a different approach to memory design. The style is high
bandwidth via many dedicated independent memories.
77
2.6 MIPS32 Vs OTHER RISC PROCESSORS
Although the modern version of the RISC design dates to the 1980s,
a number of systems of the 1970s have been credited as the first RISC
architecture, partly based on their use of load/store approach. For
example, the CDC 6600 designed by Seymour Cray in 1964 used a
load/store architecture with only two addressing modes (register+register,
and register+immediate constant) and 74 opcodes, with the basic clock
cycle/instruction issue rate being 10 times faster than the memory access
time [24,25].
The modern RISC revolution started with the projects at Stanford
University and University of California, Berkeley and IBM. Stanford's design
led to the successful MIPS architecture, while Berkeley's RISC project has
been commercialized as the SPARC. Another success from this era was
IBM's 801 that eventually led to the Power Architecture. As these projects
matured, a wide variety of similar designs flourished in the late 1980s and
early 1990s, representing a major force in the Unix workstation market as
well as embedded processors in laser printers, routers and similar
products. The Berkeley RISC project delivered the RISC-I processor in
1982. Compared with averages of about 100,000 in newer CISC designs of
the era, the RISC-I, consisting of only 44,420 transistors, had only 32
instructions with three addressing modes, and yet completely outperformed
any other single-chip design. They followed this up with the 40,760
transistor, 39 instruction RISC-II in 1983, which ran over three times as fast
as RISC-I. In 1986, Hewlett Packard started using an early implementation
of their PA-RISC in some of their computers. In the meantime, the Berkeley
RISC effort had become so well known that it eventually became the name
for the entire concept and in 1987 Sun Microsystems began shipping
systems with the SPARC processor, directly based on the Berkeley RISC-II
system.
78
Well-known RISC families include DEC Alpha, AMD 29k, ARC,
ARM, Atmel AVR, Blackfin, Intel i860 and i960, MIPS, Motorola 88000, PA-
RISC, Power (including PowerPC), SuperH, and SPARC. In the 21st
century, the use of ARM architecture processors in smart phones and
tablet computers such as the iPad, Android, and Windows RT tablets
provided a wide user base for RISC-based systems. RISC processors are
also used in supercomputers such as the K computer, the fastest on the
TOP500 list in 2011, and Sequoia, the fastest in 2012 list.
Over the years, RISC instruction sets have grown in size, and today
many of them have a larger set of instructions than many CISC CPUs.
Some RISC processors such as the PowerPC have instruction sets as
large as the CISC IBM System/370, for example; conversely, the DEC
PDP-8—clearly a CISC CPU because many of its instructions involve
multiple memory accesses—has only 8 basic instructions and a few
extended instructions. RISC architectures are now used across a wide
range of platforms, from cellular telephones and tablet computers to some
of the world's fastest supercomputers such as the K computer, the fastest
on the TOP500 list in 2011. As of 2014, a new research ISA, RISC-V, has
been under development at University of California, Berkeley, emphasizing
features such as many core, heterogeneous multiprocessing,
virtualisability, and dense instruction encoding.
2.6.1 CISC and RISC Convergence
State of the art processor technology has changed significantly since
RISC chips were first introduced in the early '80s. Because a number of
advancements are used by both RISC and CISC processors, the lines
between the two architectures have begun to blur. In fact, the two
architectures almost seem to have adopted the strategies of the other.
Since the processor speeds have increased, CISC chips are now able to
execute more than one instruction within a single clock. This also allows
79
CISC chips to make use of pipelining. With other technological
improvements, it is now possible to fit many more transistors on a single
chip. This gives RISC processors enough space to incorporate more
complicated, CISC-like commands. RISC chips also make use of more
complicated hardware, making use of extra function units for superscalar
execution. All of these factors have led some groups to conclude that now
in the present "post-RISC" era, the two architectures have become so
similar that distinguishing between them is no longer relevant. However, it
should be noted that RISC chips still retain some important traits. RISC
chips strictly utilize uniform, single-cycle instructions. They also retain the
register-to-register, load/store architecture. And despite their extended
instruction sets, RISC chips still have a large number of general purpose
registers.
The question of whether ISA plays an intrinsic role in performance or
energy efficiency is becoming important [26]. The traditionally low power
ARM ISA (a RISC) is entering the high performance server market, with the
traditionally high-performance x86 ISA (a CISC) is entering the mobile low-
power device market.
The MIPS architecture that grew out of a graduate course by John L.
Hennessy at Stanford University in 1981, resulted in a functioning system
in 1983, and could run simple programs by 1984. The MIPS approach
emphasized an aggressive clock cycle and the use of the pipeline, making
sure it could be run as "full" as possible. The MIPS system was followed by
the MIPS-X and in 1984 Hennessy and his colleagues formed MIPS
Computer Systems. The commercial venture resulted in the R2000
microprocessor in 1985, and was followed by the R3000 in 1988. The
company was purchased by Silicon Graphics, Inc. in 1992, and was spun
off as MIPS Technologies, Inc. in 1998. Subsequently Imagination
Technologies has bought the company.
80
2.7 MIPS32 INSTRUCTIONS AND CODE WASTAGE
RISC processors generally have three types of instructions: ALU,
Load or store, and Branch and Jump. Though RISC processors have
limited number of addressing modes, there are variations among the
processors. MIPS processor has only two addressing modes: immediate
and displacement, both with 16-bit fields [3].
Figure. 2.3 seen earlier in section 2.2.2 summarises the basic
formats of MIPS32 integer instructions [27] with examples. The length of
the fields in bits is indicated inside brackets. All the instructions are 32-bits
and the most significant six bits contain the opcode. In the I-type and J-type
instructions, the opcode itself indicates the exact operation. In the R-type
instructions, the op field identifies the instruction type and the fn field (least
significant bits 0-5) indicates the exact operation. For example, the six-bit
pattern 000000 in op identifies all R-type instructions and the fn pattern
indicates the exact function i.e., the instruction is add, and, sub, mul, div,
shift etc. For the and instruction, the op is 0x24 whereas for the or
instruction, the op is 0x25. The R-type is for register-to-register operations.
The I-type is for data transfers, branches, and immediate operations. In
load/store type instructions, the offset field is added to the contents of the
rs register, usually an address, to form the effective address for one of the
operands, either the source or destination.
The branch instructions use a signed 16-bit offset field enabling
jump by 215-1 instructions forward or 215 instructions backward. In I-type
arithmetic instructions, the immediate field is sign-extended to 32-bits to
form one of the operands, and the other operand is available in the rs
register. In I-type logical instructions, the immediate field is zero-extended
to form the second operand and the rs register has the first operand. The
J-type is for jumps and the instruction address is identified by the 26-bit
target field. The actual instruction address is a 30-bit address formed by
81
shifting left the target field contents by four bits. There are two more jump
instructions, jr and jalr, which follow different formats and they contain the
instruction address in the rs register and they have no target field.
The drawbacks of RISC instruction formats due to fixed instruction
size feature are as follows:
1. Several bits are unused in many instructions. Table 2.8 lists the
extent of unused bits in six integer instructions of MIPS32 ISA
since all instructions have to be 32 bits.
2. The R-type instructions use totally 12 bits to specify the
operation though there are only maximum of 64 different R-type
operations in MIPS32 ISA.
Table 2.8: Typical Wastage of Bits in MIPS32 Instructions
Instruction Action
No. of
unused
bits
Instruction Action
No. of
unused
bits
Rfe Return
from
exception
19 addu Addition 5
Syscall System call 20 mult Multiply 10
Nop No
operation
20 lui Load upper
immediate
5
3. In immediate type instructions such as addi, 16 bits are used
for specifying the immediate operand. In most cases, 8 bits are
sufficient for the immediate operand and the remaining 8 bits
become redundant. In branch instructions such as beq, the
82
offset field is underutilized in those cases where the offset
required can be specified with 8 bits.
The impact of these drawbacks on the code size has been quantified
in chapter 3 by analysing typical embedded object codes with the help of a
custom built tool. The outcome of this analysis has formed the basis for the
architectural modifications proposed in chapter 4 and chapter 5.
2.8 CODE SIZE REDUCTION IN EMBEDDED SYSTEMS
In embedded applications, every bit of code counts since it directly
affects both the program memory size, and the amount of bit traffic
between the program memory and the processor. Static code size is
directly proportional to cost in terms of program ROM size in embedded
systems. Dynamic code size has repercussions on instruction cache
effectiveness and hence on performance. Depending on the complexity of
the system, the code memory takes beyond 50% of the embedded product.
The instruction fetches take 5 to 15% of the execution time for a typical 32-
bit embedded RISC processor [7]. Since embedded systems are not user
programmable, several techniques are available to the developers, both at
compiler level and hardware level for compressing the original code
generated by the compiler. However, most solutions reduce performance.
Although the goal of this thesis is in favour of redesigning existing RISC
processors, review of philosophy behind these code compression
techniques and the extent of code compression achieved is provided to
help appreciate the benefits of the architectural solution proposed by us.
Several techniques to reduce code size have been implemented
[28]. These are classified into three types [2]: Code compression, Compiler
techniques and Ad hoc ISA modification. The first two techniques retain the
original ISA whereas the third technique involves supporting a new
83
instruction set that is a subset of the original ISA. An overview of these
three techniques is given below.
2.8.1 Code Compression
Code compression, initially applied to single issue processors such
as CISC and RISC, is now used in VLIW processors also. The
compression methods [28] are based on traditional data compression
techniques including entropy encoding, such as Huffman encoding [29] and
arithmetic coding [30,31,32], dictionary-based compression [33], operand
factorization [34], and re-encoding the original RISC instructions, to name a
few. Code compression involves compressing the executable RISC object
code in offline, and storing the compressed code in code memory. The
decompression is done on-the-fly, for each instruction, during program
execution. The decompression unit is placed between the processor core
and memory either as post-cache (between the cache and the processor),
or as pre-cache (between the code memory and the cache) [35]. In the pre-
cache architecture, the code memory contains compressed code but the
instruction cache memory contains uncompressed code. Decompression
occurs whenever there is a cache miss and hence it is not time critical. In
the post-cache architecture, both code memory and instruction cache
contain compressed code. Decompression occurs during every instruction
fetch and hence it is in the critical path of the instruction pipeline.
The criterion to measure the efficiency of a code compression
scheme is compression ratio, which is defined as the ratio of the size of the
compressed program over the size of the original program. A large body of
knowledge is available on lossless compression [36] and hardware for low
power and high performance compression and decompression has been
proposed [37]. However, there are some distinctive requirements [38]. First,
it must be possible to decompress a program during execution, ensuring
random access, starting from several points inside the program, since
84
branch, jump, and call instructions can alter the program execution.
Second, compression and decompression algorithms can be highly
asymmetric because compression can be performed once for all (offline)
when the executable is generated, while decompression is performed
during program execution; thus it should be fast and power efficient
because its hardware cost must be fully amortized by the corresponding
savings in memory size and power, without compromising performance.
The compression methods [28] result in either variable or fixed-width
instructions. Decompression is more complex with variable-width
instruction as the width of the instruction is not known before the
decompression. Normally, the code compression strategy does not require
any modification to the processor architecture. The instruction fetch unit
generates the next instruction address which will be normally the sum of
previous instruction address and the size of the previous instruction. On
encountering a branch, jump, or call instruction, the target address will be
calculated and the target instruction will be fetched from the memory or
cache. If the program memory contains the compressed code, a mapping
between the original address space and the compressed address space is
necessary. Alternate approach [33] requires a two phase action in offline
after compilation. First, compress the whole program, then, patch branch
offsets during a second phase, to point to a compressed code. In this
approach, the processor needs to be modified to handle unaligned
(compressed) branch targets.
Wolfe and Chanin [30, 39] were the first to apply code compression
to embedded systems. Their scheme known as Compressed Code RISC
Processor (CCRP) uses Huffman coding to compress MIPS object codes,
and a Line Access Table (LAT) to map original program block addresses
and compressed code block addresses. The LAT is stored in program
memory. The code memory has compressed code and the code cache
holds the uncompressed code. Compression is done through a software
tool after linking, and the compressed program is placed into a special
85
memory area, identified by the linker as a compressed text segment that
also has a special section for decompression tables. A byte-based Huffman
coding algorithm was used with a cache line as the basic block to be
compressed. A TLB like buffer called Cache line address Lookaside Buffer
(CLB) is introduced to minimise LAT accesses and save time.
Decompression is slower since Huffman codes are of variable length
codes.
The CCRP method established the foundation for the IBM Codepack
compression technology for the PowerPC 400 series [40]. Compressed
code is stored in the external memory and CodePack is placed between
the memory and the cache as illustrated in Figure. 2.13. Decompression is
triggered by an instruction cache miss. The translation between the
compressed and uncompressed lines is held in the LAT. The 32-bit
PowerPC Instructions are divided into two 16-bit parts and two Huffman
tables are used for each piece. The Huffman-like codewords are assigned
on a frequency distribution basis. Words are grouped in sets and words
belonging to the same set have been assigned codewords of the same
length. For each cache miss, Codepack fetches and decompresses two
cache blocks instead of the only one requested. This approach does not
involve compiler modification or processor design change. The original
work of Wolfe and Chanin achieves 30 to 50% compression ratio whereas
IBM CodePack technique gives compression ratio between 36% and 47%.
2.8.2 Dictionary-based Compression
Dictionary- based compression is another compression method
[38,28,41]. It is based on the property that the same instructions with the
same operands reappear in the embedded object code repeatedly. The
compression algorithm creates a dictionary of distinct instructions, and
replaces each instruction in the original program with the corresponding
86
index to the dictionary as illustrated in Figure. 2.14. Thus, the instructions
are substituted by 'codewords'.
Figure. 2.13: IBM Codepack Code Compression for Power PC
As the codeword is smaller than the original instruction, the size of
the code is reduced. During program execution, the codeword (dictionary
index), fetched from the program memory, is used to fetch the original
uncompressed instructions in the dictionary. Figure. 2.15 illustrates the
decompression operation of the dictionary method of compression. Given a
program with N unique instructions, the length of the codeword is [log N]
bits.
87
Figure. 2.14: Dictionary based compression
Figure. 2.15: Decompression procedure for the dictionary based
compression
The dictionary is usually implemented in ROM in the control path of
the processor. Dictionary-based compression is a simple scheme offering
fast decompression. The decompressor is actually a simple table; it can be
integrated with the instruction decoder into a single pipeline stage. Though
this scheme is a straightforward one, offering inexpensive address
88
translation and sizable reduction of memory fetch bandwidth (i.e., number
of bits transferred from code memory to execute a program), [7] argues that
'this approach is the least appealing for an embedded system'. On the
other hand, [39] establishes that the dictionary-based compression is
competitive with CodePack for static footprint compression, and achieves
superior results for bus traffic and energy reduction.
In expression-tree-based algorithms [42] for code compression
proposed by Guido et. al, the encoded symbols are extracted from program
expression trees and dictionary-based decompression engines are
implemented.
2.8.3 Compiler Techniques
Modern embedded compilers are often more complex than general
purpose compilers. A traditional compiler mainly aims to optimize a one-
dimensional cost function represented by the number of cycles needed to
execute a program. On the other hand, for an embedded compiler, code
size and energy are equally important as the speed of execution. Certain
scalar optimizations by traditional compiler are relevant in embedded
systems also. For example, transformations such as dead code elimination,
common sub expression elimination, strength reduction, copy propagation,
and constant folding reduce code, and power consumption apart from
improving speed. However, certain ILP-oriented optimizations such as loop
unrolling, tail duplication, procedure inclining and cloning, speculation, and
global code motion offer better speed but may hurt code size and power
consumption [7]. Research on code compression has been very active in
the compiler community [11, 43] with the goal of finding compact program
representations. Pure software techniques [39] by compiler to reduce
program size and decompress instructions during execution have been
popular among embedded community. Compiler techniques for code
compression for RISC architectures, by Cooper and McIntosh [44] map
89
isomorphic instruction sequences into abstract routine calls or cross-
jumping. A profile-guided code compression to apply Huffman coding to
infrequently executed functions has been suggested by Debray and Evans
[45], [46]. A control flow graph centric software approach to reduce memory
space consumption has been proposed by Ozturk et al [47]. Their approach
involves on-the-fly compression/decompression of object codes of
embedded applications. A flexible decompressor approach, applicable to
multiple platforms, was proposed by Shogan and Chiders [48] with their
implementation of IBM's CodePack algorithms within the fetch step of
Software Dynamic Translator (SDT) in pure software infrastructure. Thus
compiler techniques for code compression involve register renaming, inter
procedural optimization, and procedural abstraction of repeated code
fragments. The procedure abstraction is a program optimization technique
that replaces repeated sequences of common code with calls to a single
procedure. The above compiler techniques are attractive since they have
no runtime decompression overheads, do not require any hardware change
and the code generated can be directly executed by the processor.
However, there is a need to modify the software tools such as compilers
and linkers.
2.8.4 Ad hoc ISA Modification
This approach customizes the existing RISC instruction set
architecture with narrow instructions supporting fewer operations, smaller
operand fields, and fewer registers. For example, the Thumb [49]
instruction set is a modification of the original ARM instruction set (32-bit
instructions). It has 36 different 16-bit instructions which form a subset of
ARM instructions. Similarly in MIPS16, a subset of 32-bit MIPS instructions
are mapped to 16-bit MIPS instructions which can be translated in real-time
into 32-bit MIPS instructions. This approach involves a considerable effort
to design the new instruction set and requires a new instruction decoder, a
new set of software development tools, such as compiler, assembler, and
90
linker. A code saving of up to 40% has been reported. However, the dense
instruction sets often cause performance penalties [39] due to lack of
instructions. Also, the processor hardware needs additional logic for
decoder/decompression to support both ISAs. Both ARM and MIPS have
responded to the first criticism by introducing Thumb2 and microMIPS
processors. The ISAs of these processors support two instruction sizes:
16-bit and 32-bit. Although the performance degradation has been taken
care to certain extent, the processors still have additional
decoder/converter logic to detect the 16-bit instructions and convert them
into 32-bit instructions.
There have been attempts to develop tiny RISC processors [50].
The DMN-6 has 16 registers of 8-bits, executes just 12 instructions and has
no cache memory. Known as Minimal RISC processor, it is meant
exclusively for use in toys.
2.9 ISA LEVEL CODE SIZE REDUCTION
Instructions set architects have broadly used two techniques to
reduce the relative energy cost of instruction stream delivery. One
approach is to increase the amount of work performed by a single
instruction. Vector machines, for example, reduce instruction bandwidth
demands by expressing a large amount of SIMD parallelism in a single
instruction [9]. CISC machines do so by combining multiple simple
operations into a single instruction and providing more addressing modes.
An alternate approach is to reduce the size of the instructions. CISC
instruction sets generally have been composed of variable-length
instructions: the simple and more common ones are usually encoded in
fewer bits than those that require more operands or occur less frequently.
RISC ISAs initially sacrificed the code density advantages of variable-
length instruction encodings in favour of simple, fixed length 32-bit
encodings. Subsequently, RISC instruction set extensions have provided
91
fixed-length 16-bit encodings (as in ARM Thumb and MIPS16), although
often at the expense of performance and limited access to some hardware
features. Next generation RISC ISAs (as in ARM Thumb2, micro MIPS and
RISC-V) partly resolve these drawbacks by encoding the most common
instructions densely, while maintaining most or all of the functionality of the
32-bit ISA. However, these ISAs have not fully resolved the issue of code
density since these ISAs continue giving importance to pipeline design
complexity. Hence they have only two different instruction sizes: two bytes
and four bytes. Still, these are called as variable instruction length ISAs
which is a misnomer and the term hybrid instruction length is the proper
term. On the other hand, hybrid length encoding proposed in this thesis
recommends a new ISA with four different sizes that reduces the average
length of instructions with the goal of minimizing code memory size. It also
improves energy per operation by reducing instruction fetch traffic.
Depending on the memory word size, with a stream of hybrid instruction
length instructions, some instructions will reside in more than one memory
word and will require more than one memory access to fetch the
instruction. Figure. 2.16 illustrates a memory map of a sequence of x86
instructions [11]. The digits indicate the instruction number in the stream.
The eight instructions in the stream require seven memory cycles, giving
0.875 memory cycles per instruction. For this example, the average
number of bytes per instruction is 3.375. Published statistics on the IBM
S360 show that this CISC architecture has approximately four bytes per
instruction [11].
2.10 CONCLUSIONS
This chapter provides an overview of various attributes of ISA and
different types of embedded processors. The cause for the increased code
size of embedded processors is illustrated with the example of MIPS32
ISA. Different techniques for code size reduction in embedded systems
have been briefly seen in this chapter.
92
Figure. 2.16: Memory map of variable instruction stream
The next chapter analyses the behaviour of embedded object codes
of MIPS32 and the Chapter 4 discusses two different techniques of hybrid
instruction encoding for MIPS32 processor to minimise the code size.