ecs 7th sem-cse-unit-2

Embedded Computing Systems

Unit – II

Text Books:

1. Wayne Wolf: Computers as Components, Principles of Embedded Computing

Systems Design, 2nd Edition, Elsevier, 2008.

2. Shibu K V: Introduction to Embedded Systems, Tata McGraw Hill, 2009

(Chapters 10, 13)

Reference Books:

1. James K. Peckol: Embedded Systems, A contemporary Design Tool, Wiley

India, 2008

2. Tammy Neorgaard: Embedded Systems Architecture, Elsevier, 2005

By

Dr. K. Satyanarayan Reddy

CiTECH, B’lore-36.

UNIT-II : Instruction Sets, CPUs7 hours

Preliminaries, ARM Processor,Programming Input and Output,Supervisor mode, Exceptions, Traps,Coprocessors, Memory SystemsMechanisms, CPU Performance, CPUPower Consumption.

Design Example: Data Compressor.

8 September 2014 ECS-VII Sem-CSE-VTU: By Dr. K. Satyanarayan Reddy 2

PRELIMINARIESNote: two CPUs are used as examples.

The ARM processor [Fur96, Jag95] is widely used in cellphones and many other systems.

Instruction Sets – These are the programmer’s interface to thehardware.


Computer Architecture TaxonomyA block diagram for one type of computer is shown in Figure below.

The computing system consists of a Central Processing Unit (CPU) anda Memory.


Fig 1. A von Neumann Architecture Computer

Fig 2. A Harvard Architecture

The memory holds both DATA and INSTRUCTIONS, and can be read orwritten when given an address.

A computer whose memory holds both DATA and INSTRUCTIONS isknown as a VON NEUMANN MACHINE.

The CPU has several internal REGISTERS that store values used internally.

One of those registers is the Program Counter (PC), which holds theaddress of an instruction in memory.

The CPU fetches the instruction from memory, decodes the instruction,and executes it.

The Program Counter does not directly determine what the machinedoes next, but only points to an instruction indirectly in memory.

The action of CPU can be changed by changing only the instructions .

It is this separation of the instruction memory from the CPUdistinguishes a stored-program computer from a general finite-statemachine.


Computer Architecture Taxonomy cont’d…

An alternative to the von Neumann style of organizing computers isthe Harvard Architecture, which is nearly as old as the vonNeumann Architecture.

As shown in Figure below, a Harvard machine has separate memoriesfor data and program.

The Program Counter points to program memory, not data memory.

As a result, it is harder to write self-modifying programs (programsthat write data values, then use those values as instructions) onHarvard machines.


Fig 2. A Harvard Architecture


Harvard Architectures are widely used, as the separation of program and datamemories provides higher performance for Digital Signal Processing.

Processing signals in real-time places great strains on the Data Access System intwo ways:

I. Large amounts of data flow through the CPU; and

II. that data must be processed at precise intervals, not just when

the CPU gets around to it.

Data sets that arrive continuously and periodically are called Streaming Data.

Advantage: Having two memories with separate ports provides higher memorybandwidth; not making data and memory compete for the same port alsomakes it easier to move the data at the proper times.

DSPs constitute a large fraction of all microprocessors sold today, and most ofthem are based on Harvard architectures.

e.g.: Most of the telephone calls in the world go through at least two DSPs, oneat each end of the phone call.



Also Computer Architectures can be organized based on theirINSTRUCTIONS and how they are executed.

Many early computer architectures were what is known today asComplex Instruction Set Computers (CISC).

These machines provided a variety of instructions that may performvery complex tasks like string searching; having a number ofdifferent instruction formats of varying lengths.

One of the advances in the development of high-performancemicroprocessors was the concept of Reduced Instruction SetComputers (RISC) having fewer and simpler instructions.

The instructions were also chosen so that they could be efficientlyexecuted in Pipelined Processors.

Early RISC designs substantially outperformed CISC designs of theperiod.



Computers can also be classified by several characteristics of theirinstruction sets.

The instruction set of the computer defines the interface betweensoftware modules and the underlying hardware.

Instructions can have a variety of characteristics, including:

■ Fixed vs. Variable length.

■ Addressing Modes.

■ Numbers of Operands.

■ Types of Operations supported.The set of registers available for use by programs is called the

Programming Model, also known as the Programmer Model.

Note: The CPU has many other registers that are used for internaloperations and are unavailable to programmers.



Assembly LanguageFigure below shows a fragment of ARM assembly code having basic features:

■ One instruction appears per line.

■ Labels, which give names to memory locations, start in thefirst column.

■ Instructions must start in the second column or after todistinguish them from labels.

■ Comments run from some designated comment character (;in the case of ARM) to the end of the line.

Assembly language follows this relatively structured form to make it easy for theassembler to parse the program and to consider most aspects of the programline by line.


An example of ARM Assembly Language

Column 1 Column 2

Figure below shows the format of an ARM data processing instruction such as an ADD.

For the instruction ADDGT r0,r3,#5

the cond field would be set according to the GT condition (1100), the opcode field wouldbe set to the binary code for the ADD instruction (0100), the first operand register Rnwould be set to 3 to represent r3, the destination register Rd would be set to 0 for r0, andthe operand 2 field would be set to the immediate value of 5.

Assemblers must also provide some pseudo-ops to help programmers create completeassembly language programs. An example of a pseudo-op is one that allows data valuesto be loaded into memory locations. These allow constants e.g. to be set into memory.


Assembly Language cont’d….

r0 = 5 r3

Format of ARM data processing instructions

An example of a memory allocation pseudo-op for ARM isshown in code

The ARM % pseudo-op allocates a block of memory of thesize specified by the operand and initializes thoselocations to zero.


Assembly Language cont’d….

BIGBLOCK% 10

Pseudo-ops for Allocating Memory

ARM PROCESSORARM is actually a family of RISC architectures that have been

developed over many years.

The textual description of instructions, is called an assemblylanguage.

ARM instructions are written one per line, starting after the firstcolumn.

Comments begin with a semicolon and continue to the end of theline.

A label, giving name to a memory location, comes at the beginning ofthe line, starting in the first column.

Here is an example:

LDR r0,[r8] ;a comment

label ADD r4,r0,r1


Processor and Memory OrganizationARM7 is a von Neumann Architecture machine, while ARM9 uses a

Harvard Architecture.

The possible performance differences may exist for both theProcessors.

The ARM architecture supports two basic types of data:

■ The standard ARM word is 32 bits long.

■ The word may be divided into four 8-bit bytes.

An Address refers to a byte, not a word.

Therefore, the word 0 in the ARM address space is at location 0, theword 1 is at 4, the word 2 is at 8,and so on.


The ARM processor can be configured at power-up to address thebytes in a word in either little-endian mode (with the lowest-orderbyte residing in the low-order bits of the word) or big-endianmode (the lowest-order byte stored in the highest bits of theword), as shown in Figure below:


Processor and Memory Organization cont’d….

Data OperationsArithmetic and logical operations in C are performed in variables.

Variables are implemented as memory locations.

Therefore, to be able to write instructions to perform C expressionsand assignments, both arithmetic and logical instructions must beconsidered as well as instructions for reading and writing memory.

Figure below shows a sample fragment of C code with datadeclarations and several assignment statements.

The variables a, b, c, x, y, and z all become data locations in memory.


A C fragment with Data Operations

In the ARM processor, arithmetic and logical operations cannot be performed directly onmemory locations.

ARM is a Load-Store Architecture — the data operands must first be loaded into the CPU andthen stored back to main memory to save the results.

Figure below shows the registers in the basic ARM programming model.

ARM has 16 general-purpose registers, r0 through r15.

Except for r15, they are identical; any operation that can be done on one of them can bedone on the other one also.


Data Operations cont’d….

The r15 register has the same capabilities as theother registers, but it is also used as the ProgramCounter.The Program Counter (PC) should NOT BEOVERWRITTEN for use in data operations.However, giving the PC the properties of aGeneral Purpose Register allows the programcounter value to be used as an operand incomputations, which can make certainprogramming tasks easier.

The other important basic register in the programming model is theCurrent Program Status Register (CPSR).

This register is set automatically during every arithmetic, logical, orshifting operation.

The top four bits of the CPSR hold the following useful informationabout the results of that arithmetic/logical operation:

■ The negative (N) bit is set when the result is negative in two’s-complement arithmetic.

■ The zero (Z) bit is set when every bit of the result is zero.

■ The carry (C) bit is set when there is a carry out of theoperation.

■ The overflow(V) bit is set when an arithmetic operation resultsin an overflow.



ExampleStatus bit computation in the ARM:

An ARM word is 32 bits. In C notation, a hexadecimal number startswith 0x, such as 0xffffffff, which is a two’s-complementrepresentation of -1 in a 32-bit word.

Here are some sample calculations:

■ - 1 + 1 = 0: Written in 32-bit format, this becomes

0xffffffff + 0x1 = 0x0, giving the CPSR value of NZCV = 1001.

■ 0 – 1 = -1: 0x0 - 0x1 = 0xffffffff, with NZCV = 1000.

■ 231 – 1 + 1 = -231: 0x7fffffff + 0x1 = 0x80000000,

with NZCV = 1001.


The basic form of a datainstruction is simple:

ADD r0,r1,r2This instruction sets register r0 to

the sum of the values stored inr1 and r2 (i.e. r0 = r1 + r2).

In addition to specifying registersas sources for operands,instructions may also provideIMMEDIATE operands, whichencode a constant valuedirectly in the instruction.

For example,

ADD r0,r1,#2sets r0 to “r1 + 2”.The major data operations are

summarized in adjacent Figure.



ARM Data Instructions

RSB performs a subtraction with the order of the two operandsreversed, so that

RSB r0, r1,r2 sets r0 r2 - r1.

The bit-wise logical operations perform logical AND, OR, and XORoperations (the exclusive or is called EOR).

The BIC instruction stands for bit clear:

BIC r0, r1, r2 sets r0 to r1 and not r2 (r0 r1).This instruction uses the second source operand as a mask:

Where a bit in the mask is 1, the corresponding bit in the firstsource operand is cleared.



The MUL instruction multiplies two values, but with somerestrictions:

No operand may be an IMMEDIATE, and the two source operandsmust be different registers.

The MLA instruction performs a multiply-accumulate operation,particularly useful in matrix operations and signal processing.

The instruction MLA r0,r1,r2,r3

sets r0 r1 x r2 + r3.



The SHIFT operations can be applied to ARITHMETIC and LOGICALinstructions.

The shift modifier is always applied to the second source operand.

A left shift moves bits up toward the most-significant bits (msb), whilea right shift moves bits down to the least-significant(lsb) bit in theword.

The LSL and LSR modifiers perform left and right logical shifts, fillingthe least-significant bits of the operand with zeroes.

The arithmetic shift left is equivalent to an LSL, but the ASR copies thesign bit—if the sign is 0, a 0 is copied, while if the sign is 1, a 1 iscopied.

The rotate modifiers always rotate right, moving the bits that fall offthe least-significant bit up to the most-significant bit in the word.

The RRX modifier performs a 33-bit rotate, with the CPSR’s C bit beinginserted above the sign bit of the word; this allows the carry bit tobe included in the rotation.



The instructions in Figure below are comparison operations;they do not modify general-purpose registers but only setthe values of the NZCV bits of the CPSR register.

The compare instruction CMP r0, r1 computes r0 – r1, setsthe status bits, and throws away the result of thesubtraction.

CMN uses an addition to set the status bits.

TST performs a bit-wise AND on the operands, while TEQperforms an exclusive-or.



ARM Comparison Instructions

Figure below summarizes the ARM move instructions. The instructionMOV r0, r1

sets the value of r0 to the current value of r1 (i.e. r0r1).

The MVN instruction complements the operand bits (one’scomplement) during the move.


ARM MOVE instructions


Values are transferred between registers and memory using the load-storeinstructions summarized in Figure below.

LDRB and STRB load and store bytes rather than whole words, while LDRH andSDRH operate on half-words and LDRSH extends the sign bit on loading.

An ARM address may be 32 bits long. The ARM load and store

instructions do not directly refer to main memory addresses, since a 32-bitaddress would not fit into an instruction that included an opcode andoperands. Instead, the ARM uses register-indirect addressing.



In register-indirect addressing, the value stored in the register is usedas the address to be fetched from memory; the result of that fetchis the desired operand value.

Thus, as illustrated in Figure below, if we set r1 = 0 x 100, theinstruction LDR r0,[r1]

sets r0 to the value of memory location 0x100.

Similarly, STR r0,[r1] would store the contents of r0 in the memorylocation whose address is given in r1.



There are several possible variations:

LDR r0,[r1, – r2] loads r0 from the address given by r1 -r2, while LDR r0,[r1, #4] loads r0 from the address r1 + 4.

how to get an address into a register: to be able to set aregister to an arbitrary 32-bit value.

In the ARM, the standard way to set a register to an addressis by PERFORMING ARITHMETIC on the PROGRAMCOUNTER, which is stored in r15.

By adding or subtracting to the PC a constant equal to thedistance between the current instruction and the desiredlocation, the desired address can be generated withoutperforming a load.



The ARM programming system provides an ADRpseudo-operation to simplify this step. Thus, asshown in Figure below, if location 0x100 be given thename FOO, then the pseudo-operation ADR r1,FOOcan be used to perform the same function of loadingr1 with the address 0x100.




Data Operations cont’d…. An Example:C assignments in ARM instructions x = (a + b) - c;The semicolon (;) begins a comment after an instruction, which continues to the end of that line. The statement x = (a + b) - c; can be implemented by using r0 for a, r1 for b, r2 for c, and r3 for x . Also registers are needed for indirect addressing.In this case, the same indirect addressing register, r4 will be reused, for each variable load. The code must load the values of a, b, and c into these registers before performing the arithmetic, and it must store the value of x back to memory when it is done.This code performs the following necessary steps:

ADR r4,a ; get address for aLDR r0,[r4] ; get value of aADR r4,b ; get address for b, reusing r4LDR r1,[r4] ; load value of bADD r3,r0,r1 ; set intermediate result for x to a + bADR r4,c ; get address for cLDR r2,[r4] ; get value of cSUB r3,r3,r2 ; complete computation of xADR r4,x ; get address for xSTR r3,[r4] ; store x at proper location

The operation y a ∗ (b + c); can be coded similarly, but in this case moreregisters be will reuse by using r0 for both a and b, r1 for c, and r2 for y .Once again, we will use r4 to store addresses for indirect addressing.

The resulting code is

ADR r4,b ; get address for b

LDR r0,[r4] ; get value of b

ADR r4,c ; get address for c

LDR r1,[r4] ; get value of c

ADD r2,r0,r1 ; compute partial result of y

ADR r4,a ; get address for a

LDR r0,[r4] ; get value of a

MUL r2,r2,r0 ; compute final value of y

ADR r4,y ; get address for y

STR r2,[r4] ; store value of y at proper location


Data Operations cont’d…. An Example:

The C statement z = (a << 2) | (b & 15); can be coded using r0 for aand z, r1 for b, and r4 for addresses as follows:



MOV r0,r0,LSL 2 ; perform shift



AND r1,r1,#15 ; perform logical AND

ORR r1,r0,r1 ; compute final value of z

ADR r4,z ; get address for z

STR r1,[r4] ; store value of z


Data Operations cont’d…. An Example:

There are three addressing modes: Register, Immediate, andIndirect.

The ARM also supports several forms of base-plus-offsetaddressing, which is related to indirect addressing.

But rather than using a register value directly as an address,the register value is added to another value to form theaddress.

For instance, LDR r0,[r1,#16] loads r0 with the value stored atlocation r1 + 16.

Here, r1 is referred to as the Base and the immediate value’16’ the Offset.

When the offset is an immediate, it may have any value up to4,096; another register may also be used as the offset.



This addressing mode has two other variations: Auto-indexing andPost-indexing.

Auto-indexing updates the base register, such that LDR r0,[r1,#16]!first adds 16 to the value of r1, and then uses that new value as theaddress.

The ! operator causes the Base Register to be Updated with thecomputed address so that it can be used again later.

Auto-indexing instructions will fetch from the same memory location,but auto-indexing modifies the value of the base register r1.

Post-indexing does not perform the offset calculation until after thefetch has been performed.

Consequently, LDR r0,[r1],#16

will load r0 with the value stored at the memory location whoseaddress is given by r1, and then add 16 to r1 and set r1 to the newvalue. ( r0 [r1] and then r1 [r1] + 16 ).



Data Operations cont’d…. Flow of ControlThe B (branch) instruction is the basic mechanism in ARM for

changing the flow of control.

The address that is the destination of the branch is often called thebranch target.

Branches are PC-relative, the branch specifies the offset from thecurrent PC value to the branch target.

The offset is in words, but because the ARM is byte addressable, theoffset is multiplied by four (shifted left two bits, actually) to form abyte address.

Thus, the instruction

B #100

will add 400 to the current PC value (word size 4 x 100).


The ARM allows any instruction, including branches, to be executedconditionally.

This allows branches to be conditional, as well as data operations.

Figure below summarizes the condition codes.


Data Operations cont’d…. Flow of Control

Condition codes in ARM

Flow of Control cont’d…. ExampleImplementing an if statement in

ARM

The following if statement is used asan example:

if (a < b) {

x = 5;

y = c + d;

}

else x = c – d;

The implementation uses two blocksof code, one for the true case andanother for the false case.

A branch may either fall through tothe true case or branch to thefalse case:

; compute and test the condition





CMP r0, r1 ; compare a < b

BGE fblock ; if a >= b, take branch

; the true block follows

MOV r0,#5 ; generate value for x

ADR r4,x ; get address for x

STR r0,[r4] ; store value of x

ADR r4,c ; get address for c


ADR r4,d ; get address for d

LDR r1,[r4] ; get value of d

ADD r0,r0,r1 ; compute c + d

ADR r4,y ; get address for y

STR r0,[r4] ; store value of y

B after ; branch around the false block

; the false block follows

fblock ADR r4,c ; get address for c


ADR r4,d ; get address for d

LDR r1,[r4] ; get value of d

SUB r0,r0,r1 ; compute c – d

ADR r4,x ; get address for x

STR r0,[r4] ; store value of x

after ... ; code after the if statement


Implementing the C switch statement in ARMThe switch statement in C takes the following form:

switch (test)

{

case 0: ... break;

case 1: ... break;

...

}

The above statement could be coded like an if statement by first testing testA, then testB, and so forth.However, it can be more efficiently implemented by using base-plus-offset addressing and buildingwhat is known as a branch table:

ADR r2,test ; get address for test

LDR r0,[r2] ; load value for test

ADR r1,switchtab ; load address for switch table

LDR r15,[r1,r0,LSL #2]

switchtab DCD case0

DCD case1

...

case0 ... ; code for case 0

...

case1 ... ; code for case 1

...


This implementation of switch case uses the value of test as an offset into atable, where the table holds the addresses for the blocks of code thatimplement the various cases.

The heart of this code is the LDR instruction, which packs a lot offunctionality into a single instruction:

■ It shifts the value of r0 left two bits to turn the offsetinto a word address.

■ It uses base-plus-offset addressing to add the left-shifted value of test (held in r0) to the address of thebase of the table held in r1.

■ It sets the PC (r15) to the new address computed by theinstruction.

Each case is implemented by a block of code that is located elsewhere inmemory.

The branch table begins at the location named switchtab.The DCD statement is a way of loading a 32-bit address into memory at that

point, so the branch table holds the addresses of the starting points ofthe blocks that correspond to the cases.


Implementing the C switch statement in ARM cont’d….

Loops: ExampleThe loop is a very common C statement, particularly in signal processing code.

Loops can be naturally implemented using conditional branches. Because loops oftenoperate on values stored in arrays, loops are also a good illustration of another use of thebase-plus-offset addressing mode.

A simple but common use of a loop is in the FIR filter, which is explained in ApplicationExample as given below: FIR filters: A Finite Impulse Response (FIR) filter is a commonlyused method for processing signals. The FIR filter is a simple sum of products:

In use as a filter, the xi s are assumed to be samples of data taken periodically, while the ci s

are coefficients. This computation is usually drawn like this:


This representation assumes that thesamples are coming in periodically andthat the FIR filter output is computedonce every time a new sample comes in.The boxes represent delay elements thatstore the recent samples to provide the xi

s. The delayed samples are individuallymultiplied by the ci s and then summedto provide the filter output.

Example : An FIR filter for the ARMThe C code for the FIR filter of Application Example in slide 40 is as follows:

for (i = 0, f = 0; i < N; i++)

f = f + c[i] * x[i];

the arrays c and x can be addressed using base-plus-offset addressing.

By loading one register with the address of the zeroth element of each arrayand use the register holding i as the offset.

The C language defines a for loop as equivalent to a while loop with properinitialization and termination.

Using that rule, the for loop can be rewritten as:

i = 0;

f = 0;

while (i < N)

{

f = f + c[i]*x[i];

i++;

}


Here is the code for the loop:

; loop initiation code

MOV r0,#0 ; use r0 for i, set to 0

MOV r8,#0 ; use a separate index for arrays

ADR r2,N ; get address for N

LDR r1,[r2] ; get value of N for loop termination test

MOV r2,#0 ; use r2 for f, set to 0

ADR r3,c ; load r3 with address of base of c array

ADR r5,x ; load r5 with address of base of x array

; loop body

loop LDR r4,[r3,r8] ; get value of c[i]

LDR r6,[r5,r8] ; get value of x[i]

MUL r4,r4,r6 ; compute c[i]*x[i]

ADD r2,r2,r4 ; add into running sum f

; update loop counter and array index

ADD r8,r8,#4 ; add one word offset to array index

ADD r0,r0,#1 ; add 1 to i ; test for exit

CMP r0,r1

BLT loop ; if i < N, continue loop

loopend...

The result of a 32-bit X 32-bitmultiplication is a 64-bitresult.

The ARM MUL instructionleaves the lower 32 bits ofthe result in the destinationregister.

So long as the result fits within32 bits, this is the desiredaction.

If the input values are such thatvalues can sometimesexceed 32 bits, then thecode must be redesigned tocompute higher-resolutionvalues.


Example : An FIR filter for the ARM cont’d….

FunctionsThe other important class of C statement to consider is the function. A

‘C’ function returns a value (unless its return type is void); subroutineor procedure are the common names for such a construct when itdoes not return a value. Consider this simple use of a function in C:

x = a + b;

foo(x);

y = c - d;

A function returns to the code immediately after the function call, in thiscase the assignment to y.

A simple branch is insufficient because program control would not knowwhere to return.

To return properly, the PC value must be saved when the procedure/

function is called and, when the procedure is finished, set the PC tothe address of the instruction just after the call to the procedure.


The branch-and-link instruction is used in theARM

for procedure calls.

For instance, BL foo will perform a branch and link to the codestarting at location foo (using PC-relative addressing).

The branch and link is much like a branch, except that beforebranching it stores the current PC value in r14.

Thus, to return from a procedure, the value of r14 will simply bemoved to r15:

MOV r15,r14

Note: The PC value stored in r14, should not be overwritten duringthe procedure.


Functions cont’d….

e.g.: A ‘C’ function is called within another C function, the second functioncall will overwrite r14, destroying the return address for the first functioncall.

The standard procedure for allowing nested procedure calls (includingrecursive procedure calls) is to build a stack, as illustrated in Figurebelow. The C code shows a series of functions that call other functions:f1( ) calls f2( ), which in turn calls f3( ).


Functions cont’d….

The right side of the figure shows thestate of the procedure call stackduring the execution of f3( ).The stack contains one activationrecord for each active procedure.When f3( ) finishes, it can pop the topof the stack to get its return address,leaving the return address for f2( )waiting at the top of the stack for itsreturn.

Procedure calls in ARMExample:

void f1(int a){f2(a);}

The ARM C compiler’s convention is to use register r13 to point to the top of the stack.Assuming that the argument a has been passed into f1() on the stack and that the

argument for f2 (which happens to be the same value) must be pushed onto thestack before calling f2().

Here is some handwritten code for f1(), which includes a call to f2():f1 LDR r0,[r13] ; load value of a argument into r0 from stack

; call f2()STR r14,[r13]! ; store f1's return address on the stackSTR r0,[r13] ! ; store argument to f2 onto stackBL f2 ; branch and link to f2

; return from f1()SUB r13,#4 ; pop f2's argument off the stackLDR r13!,r15 ; restore registers and return


PROGRAMMING INPUT AND OUTPUTInput and Output Devices: Input and output devices usually have some analog or non electronic

component.

e.g.: A disk drive has a rotating disk and analog read/write electronics.

Figure below shows the structure of a typical I/O device and its relationship to the CPU.

The interface between the CPU and the device’s internals (e.g.,the rotating disk and read/writeelectronics in a disk drive) is a set of registers.

The CPU talks to the device by reading and writing the registers.

Devices typically have several registers:

■ Data registers hold values that are treated as data by the device, such as the data read or writtenby a disk.

■ Status registers provide information about

the device’s operation, such as whether the

current transaction has completed.


Structure of a typical I/O Device

Note: Some of the registers may be read-only, such asa status register that indicates when the device isdone, while others may be readable or writable

Application Example : The 8251 UART

The 8251 UART (Universal Asynchronous Receiver/Transmitter) is theoriginal device used for serial communications, such as the serialport connections on PCs.

The 8251 was introduced as a stand-alone integrated circuit for earlymicroprocessors.

Its functions nowadays are typically subsumed by a larger chip, butthese more advanced devices still use the basic programminginterface defined by the 8251.


PROGRAMMING INPUT AND OUTPUT cont’d….

The UART is programmable for a variety of transmission andreception parameters.

However, the basic format of transmission is simple.

Data are transmitted as streams of characters, each of which has thefollowing form:

Every character starts with a start bit (a 0) and a stop bit (a 1).

The start bit allows the receiver to recognize the start of a newcharacter; the stop bit ensures that there will be a transition at thestart of the stop bit.

The data bits are sent as high and low voltages at a uniform rate.

That rate is known as the Baud Rate; the period of one bit is theinverse of the baud rate.



Before transmitting or receiving data, the CPU must set the UART’s moderegisters to correspond to the data line’s characteristics.

The parameters for the serial port are familiar from the parameters for aserial communications program (such as Kermit):

■ the baud rate;

■ the number of bits per character (5 through 8);

■ whether parity is to be included and whether it is even or odd;and

■ the length of a stop bit (1, 1.5, or 2 bits).

The UART includes one 8-bit register that buffers characters between theUART and the CPU bus.

The Transmitter Ready output indicates that the transmitter is ready toaccept a data character; the Transmitter Empty signal goes high when theUART has no characters to send.

On the receiver side, the Receiver Ready pin goes high when the UART has acharacter ready to be read by the CPU.



Input and Output PrimitivesMicroprocessors can provide programming support for input and

output in two ways:

• I/O instructions and

• memory-mapped I/O.

These instructions provide a separate address space for I/O devices.

The most common way to implement I/O is by memory mapping—even CPUs that provide I/O instructions can also implementmemory-mapped I/O.

As the name implies, memory-mapped I/O provides addresses for theregisters in each I/O device.

Programs use the CPU’s normal read and write instructions tocommunicate with the devices.


Following Example illustrates memory-mapped I/O on the ARM.

Memory-mapped I/O on ARM

the EQU pseudo-op can be used to define a symbolic name for thememory location of our I/O device:

DEV1 EQU 0x1000

Given that name, we can use the following standard code to read andwrite the device register:

LDR r1,#DEV1 ; set up device address

LDR r0,[r1] ; read DEV1

LDR r0,#8 ; set up value to write

STR r0,[r1] ; write 8 to device


Input and Output Primitives cont’d….

How can I/O devices be written directly in a high-level language likeC?

When a variable in C is defined and used, the compiler hides thevariable’s address from user.

But pointers can be used to manipulate addresses of I/O devices.

The traditional names for functions that read and write arbitrarymemory locations are peek and poke.

The peek function can be written in C as:

int peek(char *location) {

return *location; /* de-reference location pointer */

}



The argument to peek is a pointer that is de-referenced by the C * operatorto read the location.

Thus, following code be can written to read a device register :

#define DEV1 0x1000...dev_status = peek(DEV1); /* read device register */

The poke function can be implemented as:

void poke(char *location, char newval) {(*location) = newval; /* write to location */}

To write to the status register, the following code can be used :

poke(DEV1,8); /* write 8 to device register */

These functions can, of course, be used to read and write arbitrary memorylocations, not just devices.



Busy-Wait I/OThe most basic way to use devices in a program is Busy-wait I/O.

Devices are typically slower than the CPU and may require manycycles to complete an operation.

If the CPU is performing multiple operations on a single device, suchas writing several characters to an output device, then it must waitfor one operation to complete before starting the next one.

(e.g.: If the second character is about to start for being written beforethe device has finished with the first one, the device will probablynever print the first character.)

Asking an I/O device whether it is finished by reading its statusregister is often called POLLING.


Busy-wait I/O Programminge.g.: A sequence of characters are to be written to an output device.The device has two registers: one for the character to be written and a status register.The status register’s value is 1 when the device is busy writing and 0 when the write transaction has completed.The peek and poke functions will be used to write the busy-wait routine in C.First, symbolic names are defined for the register addresses:

#define OUT_CHAR 0x1000 /* output device character register */#define OUT_STATUS 0x1001 /* output device status register */

The sequence of characters is stored in a standard C string, which is terminated by a null (0) character.Peek and Poke can be used to send the characters and wait for each transaction to complete:

char *mystring = "Hello, world." /* string to write */char *current_char; /* pointer to current position instring */current_char = mystring; /* point to head of string */while (*current_char != `\ 0') { /* until null character */poke(OUT_CHAR,*current_char); /* send character todevice */while (peek(OUT_STATUS) != 0); /* keep checkingstatus */current_char++; /* update character pointer */}


Busy-Wait I/O cont’d….

Note: The outer while loop sendsthe characters one at a time.The inner while loop checks thedevice status—it implements thebusy-wait function by repeatedlychecking the device status untilthe status changes to 0.

Copying characters from input to output using busy-wait I/O:

a character needs to be read repeatedly from the input device and written to the output device.

First, the addresses for the device registers need to be defined :

#define IN_DATA 0x1000

#define IN_STATUS 0x1001

#define OUT_DATA 0x1100

#define OUT_STATUS 0x1101

The input device sets its status register to 1 when a new character has been read; we must set the statusregister back to 0 after the character has been read so that the device is ready to read another character.

When writing, the output must set the status register to 1 to start writing and wait for it to return to 0.

Peek and Poke can be used repeatedly to perform the read/write operation:

while (TRUE)

{ /* perform operation forever */

/* read a character into achar */

while (peek(IN_STATUS) == 0); /* wait until ready */

achar = (char)peek(IN_DATA); /* read the character */

/* write achar */

poke(OUT_DATA,achar);

poke(OUT_STATUS,1); /* turn on device */

while (peek(OUT_STATUS) != 0); /* wait until done */

}


Busy-Wait I/O cont’d….

InterruptsBusy-wait I/O is extremely inefficient—the CPU does nothing but test

the device status while the I/O transaction is in progress.

In many cases, the CPU could do useful work in parallel with the I/Otransaction, such as:

■ Computation, as in determining the next output to send tothe device or processing the last input received, and

■ Control of other I/O devices.

To allow parallelism, new mechanisms need to be introduced into theCPU.

The Interrupt Mechanism allows devices to signal the CPU and toforce execution of a particular piece of code.

When an interrupt occurs, the program counter’s value is changed topoint to an Interrupt Handler Routine (also known as a DeviceDriver) that takes care of the device : writing the next data, readingdata that have just become ready, and so on.


The interrupt mechanism saves the value of the PC at the interruption so that the CPUcan return to the program that was interrupted.

Interrupts therefore allow the flow of control in the CPU to change easily betweendifferent contexts, such as a foreground computation and multiple I/O devices.

As shown in Figure below, the interface between the CPU and I/O device includes thefollowing signals for interrupting:

■ The I/O device asserts the interrupt request signal when it wants servicefrom the CPU; and

■ The CPU asserts the interrupt acknowledge signal when it is ready tohandle the I/O device’s request.


Interrupts cont’d….

The I/O device’s logic decides when to interrupt; e.g., it may generatean interrupt when its Status Register goes into the Ready State.

The CPU may not be able to service an Interrupt Request immediatelybecause it may be doing something else that must be finished first.

e.g.: A program that talks to both a high-speed disk drive and a low-speed keyboard should be designed to finish a disk transactionbefore handling a keyboard interrupt.

Only when the CPU decides to acknowledge the interrupt; the CPUchanges the Program Counter to point to the device’s handler.

The Interrupt Handler operates like a subroutine, except that it is notcalled by the executing program.

The program that runs when no interrupt is being handled is oftencalled the Foreground Program; when the interrupt handlerfinishes, it returns to the foreground program, whereverprocessing was interrupted.



Consider the interrupt style of processing and compare it to busy-waitI/O: Example.

Assuming that C functions can be written which act as InterruptHandlers.

Those handlers will work with the devices in much the same way as inbusy-wait I/O by reading and writing status and data registers.

The main difference is in handling the output—the interrupt signalsthat the character is done, so the handler does not have to doanything.

A global variable will be used as a char for the input handler to passthe character to the foreground program.

Because the foreground program doesn’t know when an interruptoccurs, also a global Boolean variable will be used viz. gotchar, tosignal when a new character has been received.



The code for the input and output handlers follows:

void input_handler()

{ /* get a character and put in global */

achar = peek(IN_DATA); /* get character */

gotchar = TRUE; /* signal to main program */

poke(IN_STATUS,0); /* reset status to initiate next transfer */

}

void output_handler()

{ /* react to character being sent */

/* don't have to do anything */

}

The main program is like the busy-wait program. It looks at gotchar to check when a new character has been read and then

immediately sends it out to the output device.

main()

{

while (TRUE)

{ /* read then write forever */

if (gotchar)

{ /* write a character */

poke(OUT_DATA,achar); /* put character in device */

poke(OUT_STATUS,1); /* set status to initiate write */

gotchar = FALSE; /* reset flag */

}

}

}



Copying characters from input to output with interrupts and buffers:the program performs reads and writes independently, rather thanreading a single character and then writing it.

The read and write routines communicate through the followingglobal variables:

■ A character string io_buf will hold a queue of characters thathave been read but not yet written.

■ A pair of integers buf_head (start) and buf_tail (end) willpoint to the first and last characters read.

■ An integer error will be set to 0 whenever io_buf overflows.

The global variables allow the input and output devices to run atdifferent rates.

The queue io_buf acts as a wraparound buffer; characters are addedto the tail when an input is received and take characters out fromthe head (in book tail is given) when output is needed.



When the first character is read, the tail is incremented after the character is addedto the queue, leaving the buffer and pointers looking like the following:

When the buffer is full, one character is left in the buffer unused (Fig. on leftbelow).

As the next figure shows, if another character is added and updated the tail buffer(wrapping it around to the head of the buffer), then one would be unable todistinguish a full buffer from an empty one.



The following code provides the declarations for the above global variables and some service routines for adding and removing

characters from the queue. Because interrupt handlers are regular code, we can use subroutines to structure code just as with any

program.

#define BUF_SIZE 8

char io_buf[BUF_SIZE]; /* character buffer */

int buf_head = 0, buf_tail = 0; /* current position in buffer */

int error = 0; /* set to 1 if buffer ever overflows */

void empty_buffer() { /* returns TRUE if buffer is empty */

buf_head == buf_tail;

}

void full_buffer() { /* returns TRUE if buffer is full */

(buf_tail+1) % BUF_SIZE == buf_head ;

}

int nchars() { /* returns the number of characters in the buffer */

if (buf_head >= buf_tail) return buf_tail – buf_head;

else return BUF_SIZE + buf_tail – buf_head;

}

void add_char(char achar)

{ /* add a character to the buffer head */

io_buf[buf_tail++] = achar;

/* check pointer */

if (buf_tail == BUF_SIZE) buf_tail = 0;

}

char remove_char()

{ /* take a character from the buffer head */

char achar;

achar = io_buf[buf_head++]; /* check pointer */

if (buf_head == BUF_SIZE) buf_head = 0;

}8 September 2014

ECS-VII Sem-CSE-VTU: By Dr. K. Satyanarayan Reddy 65

I/P DR

O/P DR

I/P SR

O/P SR

Input Device

Output Device

0x1000

0x1001

0x1100

0x1101

IN_DATA

OUT_DATA

IN_STATUS

OUT_STATUS

MEMORY

I/P: Input D: Data O/P: Output R:Register S: Status

Memory Mapping of Devices

Assume that there are two interrupt handling routines defined in C, input_handler for the input device and output_handler for the output device.

These routines work with the device in much the same way as did the busy-wait routines.

Here is the code for the input handler:

#define IN_DATA 0x1000

#define IN_STATUS 0x1001

void input_handler()

{

char achar;

if (full_buffer()) /* error */

error = 1;

else { /* read the character and update pointer */

achar = peek(IN_DATA); /* read character */

add_char(achar); /* add to queue */

}

poke(IN_STATUS,0); /* set status register back to 0 */

/* if buffer was empty, start a new output transaction */

if (nchars() == 1)

{ /* buffer had been empty until this interrupt */

poke(OUT_DATA,remove_char()); /* send character */

poke(OUT_STATUS,1); /* turn device on */

}

}

#define OUT_DATA 0x1100

#define OUT_STATUS 0x1101

void output_handler()

{

if (!empty_buffer())

{ /* start a new character */

poke(OUT_DATA,remove_char()); /* send character */

poke(OUT_STATUS,1); /* turn device on */

}

}



Debugging Interrupt CodeAssume that the foreground code is performing a matrix multiplication

operation y = Ax + b:

for (i = 0; i < M; i++)

{

y[i] = b[i];

for (j = 0; j < N; j++)

y[i] = y[i] + A[i, j]*x[j];

}

The interrupt handlers are used to perform I/O while the matrix computation isperformed, but with one small change: read_handler has a bug that causes itto change the value of j.

Any CPU register that is written by the interrupt handler must be saved beforeit is modified and restored before the handler exits.

Any type of bug; such as forgetting to save the register or to properly restore it,can cause that register to mysteriously change value in the foregroundprogram.


The CPU implements interrupts by checking the Interrupt Request Line atthe beginning of execution of every instruction.

If an interrupt request has been asserted, the CPU does not fetch theinstruction pointed to by the PC.

Instead the CPU sets the PC to a predefined location, which is the beginningof the Interrupt Handling Routine.

The starting address of the Interrupt Handler is usually given as a pointer,rather than defining a fixed location for the handler, the CPU defines alocation in memory that holds the address of the handler, which can thenreside anywhere in memory.

Because the CPU checks for interrupts at every instruction, it can respondquickly to service requests from devices.

However, the interrupt handler must return to the foreground programwithout disturbing the foreground program’s operation.

Since subroutines perform a similar function, it is natural to build the CPU’sinterrupt mechanism to resemble its subroutine function.



Interrupts cont’d….Priorities and VectorsThere are two ways in which interrupts can be generalized to handle

multiple devices and to provide more flexible definitions for theassociated hardware and software:

■ Interrupt priorities allow the CPU to recognize someinterrupts as more important than others, and

■ Interrupt vectors allow the interrupting device to specify itshandler.

Prioritized interrupts not only allow multiple devices to be connectedto the interrupt line but also allow the CPU to ignore lessimportant interrupt requests while it handles more importantrequests.


As shown in Figure below, the CPU provides several different interrupt request signals,shown here as L1, L2, up to Ln.

The lower-numbered interrupt lines are given higher priority, so in this case, if devices 1,2,and n all requested interrupts simultaneously, 1’s request would be acknowledgedbecause it is connected to the highest-priority interrupt line.

Most CPUs use a set of signals that provide the priority number of the winning Interruptin binary form (so that interrupt level 7 requires 3 bits rather than 7), rather thanproviding a separate Interrupt Acknowledge Line for each device,.

A device knows that its interrupt request was accepted by seeing its own prioritynumber on the interrupt acknowledge lines.


Interrupts cont’d….Priorities and Vectors

How to change the Priorities of a device: Simply by connecting it to adifferent interrupt request line.

This requires hardware modification, so if priorities need to bechangeable, removable cards, programmable switches, or someother mechanism should be provided to make the change easy.

Masking: The priority mechanism must ensure that a lower-priorityinterrupt does not occur when a higher-priority interrupt is beinghandled.

This decision process is known as Masking



When an interrupt is acknowledged, the CPU stores in an internalregister the priority level of that interrupt.

When a subsequent interrupt is received, its priority is checkedagainst the priority register; the new request is acknowledged onlyif it has higher priority than the currently pending interrupt.

When the interrupt handler exits, the priority register must be reset.

Non Maskable Interrupt (NMI): The highest-priority interrupt isnormally called the Non Maskable Interrupt (NMI).

The NMI cannot be turned off and is usually reserved for interruptscaused by power failures; a simple circuit can be used to detect adangerously low power supply, and the NMI interrupt handler canbe used to save critical state in nonvolatile memory, turn off I/Odevices to eliminate spurious device operation during power downetc.



As shown in Figure below, a small amount of LOGIC external to the CPU can be usedto generate an interrupt whenever any of the devices that need to be togrouped together for requesting service.

The CPU will call the Interrupt Handler associated with this priority; that handlerdoes not know which of the devices actually requested the interrupt.

The handler uses software polling to check the status of each device: In thisexample, it would read the status registers of 1, 2, and 3 to see which of them isready and requesting service.



Using polling to share an Interrupt over several devices

Assume that there are 3 devices A, B, and C.A has priority 1 (highest priority), Bpriority 2, and C priority 3.

The adjacent UML sequence diagram showswhich interrupt handler is executing as afunction of time for a sequence ofinterrupt requests.

In each case, an interrupt handler keepsrunning until either it is finished or ahigher priority interrupt arrives.

The C interrupt, although it arrives early,does not finish for a long time becauseinterrupts from both A and B intervene;

The System Design must take into accountthe worst-case combinations of interruptsthat can occur to ensure that no devicegoes without service for too long.

When both A and B interrupt simultaneously,A’s interrupt gets priority; when A’shandler is finished, the prioritymechanism automatically answers B’spending interrupt.


I/O with Prioritized Interrupts: Example

Vectors provide flexibility in a different dimension,namely, the ability to define the interrupthandler that should service a request from adevice.

Adjacent Figure shows the Hardware Structurerequired to support interrupt vectors.

In addition to the interrupt request andacknowledge lines, additional Interrupt VectorLines run from the devices to the CPU.

After a device’s request is acknowledged, it sends itsInterrupt Vector over those lines to the CPU.

The CPU then uses the vector number as an index ina table stored in memory as shown in Figure.

The location referenced in the interrupt vector tableby the vector number gives the address of thehandler.

There are two important things to notice about theinterrupt vector mechanism.

First, the device, not the CPU, stores its vectornumber.

In this way, a device can be given a new handlersimply by changing the vector number it sends,without modifying the system software.


Interrupt Vectors

I/O with Prioritized Interrupts: Example

Interrupts OverheadsConsider the complete interrupt handling process.

Once a device requests an interrupt, some stepsare performed by the CPU, some by the device,and others by software. the major steps in theprocess are given here :

1. CPU: The CPU checks for pending interrupts at thebeginning of an instruction.

It answers the highest-priority interrupt, whichhas a higher priority than that given in theinterrupt priority register.

2. Device: The device receives the acknowledgmentand sends the CPU its interrupt vector.

3. CPU: The CPU looks up the device handler addressin the interrupt vector table using the vector asan index.

A subroutine-like mechanism is used to save thecurrent value of the PC and possibly otherinternal CPU state, such as general-purposeregisters.

4. Software: The device driver may save additionalCPU state.

It then performs the required operations on thedevice. It then restores any saved state andexecutes the interrupt return instruction.

5. CPU: The interrupt return instruction restores the PCand other automatically saved states to returnexecution to the code that was interrupted.

Interrupts do not come without a performancepenalty.

In addition to the execution time required for the codethat talks directly to the devices, there is executiontime overhead associated with the interruptmechanisms.

■ The interrupt itself has overhead similar to asubroutine call.

Because an interrupt causes a change in theprogram counter, it incurs a branch penalty.

In addition, if the interrupt automatically storesCPU registers, that action requires extra cycles,even if the state is not modified by the interrupthandler.

■ In addition to the branch delay penalty, theinterrupt requires extra cycles to acknowledgethe interrupt and obtain the vector from thedevice.

■ The interrupt handler will, in general, save andrestore CPU registers that were notautomatically saved by the interrupt.

■ The interrupt return instruction incurs a branchpenalty as well as the time required to restorethe automatically saved state.

8 September 2014


ARM7 supports two types of interrupts: Fast Interrupt Requests (FIQs) and Interrupt Requests (IRQs).

An FIQ takes priority over an IRQ.

The Interrupt Table is always kept in the Bottom Memory Addresses, starting at location 0.

The entries in the table typically contain subroutine calls to the appropriate handler.

The ARM7 performs the following steps when responding to an interrupt [ARM99B]:

■ saves the appropriate value of the PC to be used for returning to foreground Program,

■ copies the CPSR into a Saved Program Status Register (SPSR),

■ forces bits in the CPSR to note the interrupt, and

■ forces the PC to the appropriate Interrupt Vector.

When leaving the interrupt handler, the handler should:

■ restore the proper PC value,

■ restore the CPSR from the SPSR, and

■ clear interrupt disable flags.

The worst-case latency to respond to an interrupt includes the following components:

■ two cycles to synchronize the external request,

■ up to 20 cycles to complete the current instruction,

■ three cycles for data abort, and

■ two cycles to enter the interrupt handling state.

This adds up to 27 clock cycles. The best-case latency is four clock cycles.

8 September 2014


Interrupts in ARM

SUPERVISOR MODE, EXCEPTIONS, AND TRAPSSupervisor Mode: Complex systems are often implemented as several

programs that communicate with each other.

These programs may run under the command of an OperatingSystem.

It may be desirable to provide hardware checks to ensure that theprograms do not interfere with each other say by erroneouslywriting into a segment of memory used by another program.

Software debugging is important but can leave some problems in arunning system; hardware checks ensure an additional level ofsafety.

In such cases it is often useful to have a Supervisor Mode provided bythe CPU. Normal programs run in User Mode.

The Supervisor Mode has privileges than that of User Modes.


Memory Management System which allows the addresses of memory locations to bechanged dynamically.

Control of the Memory Management Unit (MMU) is typically reserved for Supervisor Modeto avoid the obvious problems that could occur when program bugs cause inadvertentchanges in the Memory Management Registers.

The ARM instruction that puts the CPU in supervisor mode is called SWI:

SWI CODE_1

It can be executed conditionally, as with any ARM instruction.

SWI causes the CPU to go into supervisor mode and sets the PC to 0x08.

The argument to SWI is a 24-bit immediate value that is passed on to the supervisor modecode; it allows the program to request various services from the supervisor mode.

In supervisor mode, the bottom 5 bits of the CPSR are all set to 1 to indicate that the CPU isin supervisor mode.

The old value of the CPSR just before the SWI is stored in a register called the Saved ProgramStatus Register (SPSR).

There are in fact several SPSRs for different modes; the supervisor mode SPSR is referred toas SPSR_svc.

To return from supervisor mode, the supervisor restores the PC from register r14 andrestores the CPSR from the SPSR_svc.

8 September 2014


SUPERVISOR MODE cont’d….

ExceptionsAn exception is an Internally Detected Error e.g. Division by zero.

How to handle: To check every divisor before division to be sure it isnot zero, but this would both substantially increase the size ofnumerical programs and cost a great deal of CPU time evaluatingthe divisor’s value.

The CPU can more efficiently check the divisor’s value duringexecution.

Since the time at which a zero divisor will be found is not known inadvance, this event is similar to an interrupt except that it isgenerated inside the CPU.

The exception mechanism provides a way for the program to react tosuch unexpected events.

Just as interrupts can be seen as an extension of the subroutinemechanism, exceptions are generally implemented as a variationof an interrupt.


Exceptions in general require both prioritization and vectoring.

Exceptions must be prioritized because a single operation maygenerate more than one exception.

e.g. An illegal operand and an illegal memory access.

The priority of exceptions is usually fixed by the CPU architecture.

Vectoring provides a way for the user to specify the handler for theexception condition.

The vector number for an exception is usually predefined by thearchitecture; it is used to index into a table of exception handlers.


Exceptions cont’d….

TrapsA TRAP, also known as a Software Interrupt, is an instruction

that explicitly generates an exception condition.

The most common use of a Trap is to enter Supervisor Mode.

The entry into supervisor mode must be controlled tomaintain security, if the interface between user andsupervisor mode is improperly designed, a user programmay be able to sneak code into the supervisor mode thatcould be executed to perform harmful operations.

The ARM provides the SWI interrupt for software interrupts.

This instruction causes the CPU to enter supervisor mode.

An opcode is embedded in the instruction that can be readby the handler.


CO-PROCESSORSTo provide flexibility at the instruction set level, for various features that are implemented

in the CPU, the CPU architects attached Co-processors to the CPU through which someof the instructions are implemented.

e.g.: floating-point arithmetic was introduced into the Intel architecture (Intel 8087 co-processor) by providing separate chips that implemented the floating-pointinstructions.

To support co-processors, certain opcodes must be reserved in the instruction set for co-processor operations.

Because it executes instructions, a co-processor must be tightly coupled to the CPU.

When the CPU receives a co-processor instruction, the CPU must activate the co-processorand pass it the relevant instruction.

Co-processor instructions can load and store co-processor registers or can performinternal operations.

The CPU can suspend execution to wait for the co-processor instruction to finish; it canalso take a more superscalar approach and continue executing instructions whilewaiting for the co-processor to finish.

The ARM Architecture provides support for up to 16 co-processors.

Co-processors are able to perform load and store operations on their own registers.

They can also move data between the co-processor registers and main ARM registers.


MEMORY SYSTEM MECHANISMSCaches: Caches are used to speed up memory system performance.

Many microprocessor architectures include cache which speeds upaverage memory access time when properly used.

It increases the variability of memory access times, accesses in thecache will be fast, while access to locations not cached will beslow.

A cache is a small, fast memory that holds copies of some of thecontents of main memory.

As Cache is fast, it provides higher-speed access for the CPU; butsince it is small, not all requests can be satisfied by the cache,forcing the system to wait for the slower main memory.

Caching makes sense when the CPU is using only a relatively smallset of memory locations at any one time; the set of activelocations is often called the Working Set.


Figure below shows how the cache support reads in the memory system.

A Cache Controller mediates between the CPU and the memory system comprisingof the main memory.

The cache controller sends a memory request to the cache and main memory.

If the requested location is in the cache, the cache controller forwards the location’scontents to the CPU and aborts the main memory request; this condition isknown as a Cache Hit.

If the location is not in the cache, the controller waits for the value from mainmemory and forwards it to the CPU; this situation is known as a Cache Miss.


MEMORY SYSTEM MECHANISMS: Caches

The Cache in the Memory System

Cache Misses can be classified into several types depending on the situation thatgenerated them:■ a compulsory miss (also known as a cold miss) occurs the first time a location is

used,■ a capacity miss is caused by a too-large working set, and■ a conflict miss happens when two locations map to the same location in the

cache.Some basic formulas for memory system performance:Let h be the hit rate, the probability that a given memory location is in the cache.It follows that (1 – h) is the miss rate, or the probability that the location is not in the

cache.The average memory access time can be computed as

tav = htcache + (1 - h). tmain …………………………………... (1)where tcache is the access time of the cache and tmain is the main memory access time.The hit rate depends on the program being executed and the cache organization.The best-case memory access time (ignoring cache controller overhead) is tcache, while

the worst-case access time is tmain.

Given that tmain is typically 50–60ns for DRAM, while tcache is at most a few nanoseconds,the spread between worst-case and best-case memory delays is substantial.



Modern CPUs may use multiple levels of cache as shown in Figure below.

The first-level cache (commonly known as L1 cache) is closest to theCPU, the second-level cache (L2 cache) feeds the first-level cache, andso on.

The second-level cache is much larger but is also slower.

If h1 is the first-level hit rate and h2 is the rate at which access hit thesecond-level cache but not the first-level cache, then the averageaccess time for a two-level cache system is given by:

tav = h1.tL1 + h2.tL2 + (1 - h1 - h2).tmain .........................( 2)



A two-level Cache System

The simplest way to implement a cache is a DIRECT-MAPPED cache, as shown in Figure below.

The cache consists of cache blocks, each of which includes a tag to show which memory location isrepresented by this block, a data field holding the contents of that memory, and a valid tag to showwhether the contents of this cache block are valid. An address is divided into three sections.

The index is used to select which cache block to check. The tag is compared against the tag value in theblock selected by the index. If the address tag matches the tag value in the block, that block includesthe desired memory location.

If the length of the data field is longer than the minimum addressable unit, then the lowest bits of theaddress are used as an offset to select the required value from the data field. Given the structure ofthe cache, there is only one block that must be checked to see whether a location is in the cache, theindex uniquely determines that block. If the access is a hit, the data value is read from the cache.



A Direct-mapped cache

Writes are slightly more complicated than reads because mainmemory as well as the cache have to be updated.

Write Through: every write changes both the cache and thecorresponding main memory location (usually through a writebuffer).

This scheme ensures that the cache and main memory areconsistent, but may generate some additional main memorytraffic.

Write Back Policy: The number of times writing to main memory canbe reduced by using a Write back Policy.

If we write only when we remove a (Block or) location from thecache, we eliminate the writes when a location is written severaltimes before it is removed from the cache.



The limitations of the direct-mapped cache can be reduced by going to the SET-ASSOCIATIVE CACHEstructure shown in Figure below.

A Set-associative cache is characterized by the number of banks or ways it uses, giving an n-way set-associative cache.

A set is formed by all the blocks (one for each bank) that share the same index. Each set is implementedwith a direct-mapped cache.

Cache Hit: A cache request is broadcast to all banks simultaneously. If any of the sets has the location,the cache reports a HIT.

Although memory locations map onto blocks using the same function, there are n separate blocks foreach set of locations.

Therefore, several locations can simultaneously be cached which happen to map onto the same cacheblock.

The Set Associative Cache Structure incurs a little extra overhead and is slightly slower than a direct-mapped cache, but the higher hit rates provided

by it often compensates.



Bank 0 Bank 1 Bank 2 Bank 3

B0 B1 B2 B3 B0 B1 B2 B3

With B0: Block 0 ; B1: Block1… etc.

An Example of 4-Way Set Associative Cache Map

Consider a very simple caching scheme. Using 2 bits of the address asthe tag.

Comparing a DIRECT-MAPPED cache with four blocks and a two-wayset-associative cache with four sets, and using Least Recently Used(LRU) replacement to make it easy to compare the two caches.

A 3-bit address is used for simplicity. The contents of the memoryfollow:

Address Data Address Data

000 0101 100 1000

001 1111 101 0001

010 0000 110 1010

011 0110 111 0100

Giving each cache the same pattern of addresses (in binary to simplifypicking out the index): 001, 010, 011, 100, 101, and 111.


Example: Direct-mapped vs. Set-Associative Caches

To understand how the Direct-mapped cache works, andsee how its state changes.



Using a similar procedure to determine what ends up in the two-way set-associativecache.

The only difference is that there is some freedom when a block has to be replacedwith new data.

To make the results easy to understand, using a LRU replacement policy.

To start with making each way the size of the original direct-mapped cache.

The final state of the two-way set-associative cache follows:

A cache that holds both instructions and data is called a Unified Cache.



100101

Memory Management Units(MMU) and Address Translation

An MMU translates addresses between the CPU and Physical Memory.

This translation process is often known as Memory Mapping since addresses aremapped from a Logical Space into a Physical Space.

Primarily MMU’s appear in Embedded Systems in the Host Processor.

Early computers used MMU’s to compensate for limited address space in theirinstruction sets.

As modern CPUs typically do not have this limitation, MMUs are used to provide VirtualAddressing.

As shown in Figure below, the MMU accepts Logical Addresses from the CPU.

Logical addresses refer to the program’s abstract address space but do not correspondto actual RAM locations.

The MMU translates them from tables to physical addresses that do correspond to RAM.

By changing the MMU’s Tables, the physical location at which the program resides canbe changed without modifying the program’s code or data.


A Virtually Addressed Memory System

if a secondary storage unit such as flash or a disk is added, parts of theprogram can be eliminated from main memory.

In a Virtual Memory System, the MMU keeps track of which logicaladdresses are actually resident in main memory; those that do not residein main memory are kept on the secondary storage device.

When the CPU requests an address that is not in main memory, the MMUgenerates an exception called a Page Fault.

The handler for this exception executes code that reads the requestedlocation from the Secondary Storage Device into Main Memory.

The program that generated the page fault is restarted by the handler onlyafter the

■ required memory has been read back into mainmemory, and

■ MMU’s tables have been updated to reflect thechanges.


Memory Management Units(MMU) and Address Translation cont’d….

There are two styles of address translation: Segmented and Paged.

Each has advantages and the two can be combined to form a segmented, paged addressingscheme.

As shown in Figure below,

Segmenting is designed to support a large, Pages describe small, equally sized regions.

arbitrarily sized region of Memory. Pages are of uniform size, which

A Segment is usually described by its start simplifies the hardware required for

address and size, allowing different segments to address translation.

be of different sizes.



A segmented, paged scheme is created by dividingeach segment into pages and using two steps foraddress translation.

Paging introduces the possibility of fragmentation asprogram pages are scattered around physicalmemory.

Physical Memory

A simple segmenting scheme is shown in Figure below, the MMU would maintain asegment register that describes the currently active segment.

This register would point to the base of the current segment.

The address extracted from an instruction would be used as the offset for theaddress.

The physical address is formed by adding the segment base to the offset.

Most segmentation schemes also check the physical address against the upper limitof the segment by extending the segment register to include the segment sizeand comparing the offset to the allowed size.



Address Translation for a Segment

The translation of paged addresses requires more MMU state but asimpler calculation.

As shown in Figure below, the Logical Address is divided into twosections, including a Page Number and an Offset.

The page number is used as an index into a page table, which storesthe physical address for the start of each page.



The page table isnormally kept in mainmemory, which meansthat an AddressTranslation requiresmemory access.

The Page Table may be organized in several ways, as shown in Figurebelow.

The simplest scheme is a Flat Table. The table is indexed by the pagenumber and each entry holds the page descriptor.

A more sophisticated method is a TREE. The root entry of the tree holdspointers to Pointer Tables at the next level of the tree; each pointertable is indexed by a part of the page number.



Alternative schemes for organizing page tables

Eventually (after three levels, in this case) the descriptor table is arrived at,which includes the desired page descriptor.

The efficiency of paged address translation may be increased by caching pagetranslation information.

A cache for address translation is known as a Translation Lookaside Buffer (TLB).

The MMU reads the TLB to check whether a page number is currently in the TLBcache? if so, then it uses that value rather than reading from memory.

Some extensions to both segmenting and paging are useful for virtual memory:

■ At minimum, a present bit is necessary to show whether the logicalsegment or page is currently in physical memory.

■ A dirty bit shows whether the page/segment has been written to.This bit is maintained by the MMU, since it knows about every writeperformed by the CPU.

■ Permission bits are often used. Some pages/segments may bereadable but not writable. If the CPU supports modes,pages/segments may be accessible by the supervisor but not in usermode.



The ARM MMU supports the following typesof memory regions for addresstranslation:

■ a section is a 1-MB block of Memory,

■ a large Page is 64 KB, and

■ a small Page is 4 KB.

An address is marked as section mapped orpage mapped.

A two-level scheme is used to translateaddresses.

The first-level table, which is pointed to by theTranslation Table Base register, holdsdescriptors for section translation andpointers to the second-level tables. Thesecond-level tables describe thetranslation of both large and small pages.

The basic two-level process for a large orsmall page is illustrated in adjacentFigure.

The details differ between large and smallpages, such as the size of the second-level table index.

The first- and second-level pages also containaccess control bits for virtual memoryand protection.


ARM two-stage address translation


CPU PERFORMANCEPipelining: Modern CPUs are designed as pipelined machines in which

several instructions are executed in parallel.

Pipelining greatly increases the efficiency of the CPU.

The ARM7 has a three-stage pipeline:

■Fetch: the instruction is fetched from memory.

■Decode: the instruction’s opcode and operands are decoded todetermine what operation is to be performed.

■ Execute: the decoded instruction is executed.

Each of these operations requires one clock cycle for typical instructions.

Latency of Instruction Execution: A normal instruction requires three clockcycles to completely execute, this is known as the latency of instructionexecution.

Since the pipeline has three stages, an instruction is completed in everyclock cycle.

In other words, the pipeline has a throughput of one instruction per cycle.


Figure below illustrates the position of instructions in the pipeline during execution.

A vertical slice through the timeline shows all instructions in the pipeline at that time.

By following an instruction horizontally, the progress of instruction execution can be seen.

The C55x includes a seven-stage pipeline:

1. Fetch.

2. Decode.

3. Address computes data and branch addresses.

4. Access 1 reads data.

5. Access 2 finishes data read.

6. Read stage puts operands onto internal busses.

7. Execute performs operations.

RISC machines are designed to keep the pipeline busy. CISC machines may display a widevariation in instruction timing.


CPU PERFORMANCE cont’d….

Pipelined execution of ARM instructions

Figure below illustrates a data stall in the execution of a sequence ofinstructions starting with a Load Multiple (LDMIA) Instruction.

Since there are two registers to load, the instruction must stay in theexecution phase for two cycles.

In a multiphase execution, the decode stage is also occupied, since itmust continue to remember the decoded instruction.

As a result, the SUB instruction is fetched at the normal time but notdecoded until the LDMIA is finishing. This delays the fetching of thethird instruction, the CMP.



Pipelined execution of multicycle ARM instruction

Branches also introduce Control Stall delays into the pipeline, commonly referred to as thebranch penalty, as shown in Figure below.

The decision whether to take the conditional branch BNE is not made until the third clockcycle of that instruction’s execution, which computes the branch target address.

If the branch is taken, the succeeding instruction at PC+4 has been fetched and started to bedecoded.

When the branch is taken, the branch target address is used to fetch the branch targetinstruction.

Since there is need to wait for the execution cycle to complete before knowing the target,two cycles of work on instructions must thrown be away in the path not taken.



Pipelined execution of a branch in ARM

One solution to this problem is to introduce the Delayed Branch.

In this case of branch instruction, some number of instructions arealways executed, directly after the branch, whether or not thebranch is taken.

This allows the CPU to keep the pipeline full during execution of thebranch.

However, some of those instructions after the delayed branch may beno-ops.

Any instruction in the delayed branch window must be valid for bothexecution paths, whether or not the branch is taken.

If there are not enough instructions to fill the delayed branch window,it must be filled with no-ops.



This knowledge of instruction execution time is used to evaluate the execution time of someC code, as shown in Example below:

Execution time of a for loop on the ARMUsing the C code for the FIR filter of Application Example 2.1:for (i = 0, f = 0; i < N; i++)

f = f + c[i] * x[i];Repeating the ARM code for this loop:

; loop initiation code

MOV r0,#0 ; use r0 for i, set to 0

MOV r8,#0 ; use a separate index for arrays

ADR r2,N ; get address for N

LDR r1,[r2] ; get value of N for loop termination test

MOV r2,#0 ; use r2 for f, set to 0

ADR r3,c ; load r3 with address of base of c array

ADR r5,x ; load r5 with address of base of x array

; loop body

loop LDR r4,[r3,r8] ; get value of c[i]

LDR r6,[r5,r8] ; get value of x[i]

MUL r4,r4,r6 ; compute c[i]*x[i]

ADD r2,r2,r4 ; add into running sum f

; update loop counter and array index

ADD r8,r8,#4 ; add one word offset to array index

ADD r0,r0,#1 ; add 1 to i

; test for exit

CMP r0,r1

BLT loop ; if i < N, continue loop

loopend...



Inspection of the code shows that the only instruction that may take more than one cycle isthe conditional branch in the loop test.

The number of instructions can be counted and associated number of clock cycles in eachblock as follows:

The unconditional branch at the end of the update block always incurs a branch penalty oftwo cycles.

The BLT instruction in the test block incurs a pipeline delay of two cycles when the branch istaken.

That happens for all but the last iteration, when the instruction has an execution time of ttest,

worst; the last iteration executes in time ttest,best.

A formula for the total execution time of the loop in cycles can be written as

t loop = tinit + N(tbody + tupdate) + (N - 1)ttest,worst + ttest,best …………………………….. (3.3)

Although caches are invisible in the programming model, they have a profound effect onCPU’s performance. A Cache Miss is often several clock cycles slower than a Cache Hit.



CPU POWER CONSUMPTIONTo study the characteristics of CPUs that influence power

consumption and mechanisms provided by CPUs to control howmuch power they consume.

To distinguish between Energy and Power.

Power is defined as Energy Consumption per unit time.

Heat generation depends on power consumption.

Generally, the term Power will be used as synonym for energy andpower consumption (distinguishing between them only whennecessary).

The high-level Power Consumption Characteristics of CPUs and othersystem components are derived from the circuits used to buildthose components.

Virtually all digital systems are built with Complementary Metal OxideSemiconductor (CMOS) circuitry.


The basic sources of CMOS power consumption are easily identified and briefly described below.

■ Voltage Drops: The Dynamic Power Consumption of a CMOS circuit is proportional to the squareof the Power Supply Voltage (V2).

Therefore, by reducing the power supply voltage to the lowest level that provides the requiredperformance and the power consumption can be reduced significantly.

Parallel Hardware can also be added, for further reduction in the power supply voltage whilemaintaining required performance.

■ Toggling: A CMOS circuit uses most of its power when it is changing its Output Value.

Thus providing 2 ways to reduce power consumption.

1. By reducing the speed at which the circuit operates, its power consumption can be

reduced.

2. By eliminating unnecessary changes to the inputs of a CMOS circuit, also eliminating

unnecessary glitches at the circuit outputs unnecessary power consumption is

eliminated.

■ Leakage: Even when a CMOS circuit is not active, some Charge Leaks Out of the circuit’s nodesthrough the Substrate.

The only way to eliminate leakage current is to remove the power supply.

Completely disconnecting the power supply eliminates power consumption, but it usuallytakes a significant amount of time to reconnect the system to the power supply and reinitializeits internal state so that it once again performs properly.


CPU POWER CONSUMPTION cont’d….

In CMOS CPUs, the following power-saving strategies are used:

■ CPUs can be used at reduced voltage levels. e.g.: reducing the power supplyfrom 1 to 0.9 V causes the power consumption to drop by a factor of 1.2.

■ The CPU can be operated at a Lower Clock Frequency to reduce powerconsumption.

■ The CPU may internally disable certain function units that are not required forthe currently executing function. This reduces energy consumption.

■ Some CPUs allow parts of the CPU to be totally disconnected from the powersupply to eliminate leakage currents.

There are 2 types of Power Management features provided by CPUs.

A Static Power Management Mechanism is invoked by the user but does not otherwisedepend on CPU activities.

e.g.: A power down mode intended to save energy. This mode provides a high-level wayto reduce unnecessary power consumption. Power-down modes typically end uponreceipt of an interrupt or other event.

A Dynamic Power Management Mechanism takes actions to control power based upon thedynamic activity in the CPU.

e.g. The CPU may turn off certain sections of the CPU when the instructions beingexecuted do not need them.


CPU POWER CONSUMPTION cont’d….

Example: Energy efficiency features in the PowerPC 603

The PowerPC 603 was designed specifically for low-power operation whileretaining high performance.

It typically dissipates 2.2 W running at 80 MHz.

The architecture provides 3 low-power modes: Doze, Nap, and Sleep; thatprovide static power management capabilities for use by the programs andoperating system.

The 603 also uses a variety of dynamic power management techniques forpower minimization that are performed automatically, without programintervention.

To reduce power consumption, the CPU uses the dynamic techniquesmentioned below.

■ An Execution Unit that is not being used can be Shut Down.

■ The cache, an 8-KB, two-way set-associative cache, was organized intosubarrays so that at most two out of eight subarrays will be accessed onany given clock cycle.

Not all units in the CPU are active all the time; idling them when they are notbeing used can save power.


The table below shows the percentage of time various units in the 603 were idle for the SPEC integer andfloating-point benchmarks.

Idle units are turned off automatically by switching off their clocks.

Various stages of the pipeline are turned on and off, depending on which stages are necessary at thecurrent time.

Measurements comparing the chip’s power consumption with and without dynamic power managementshow that dynamic techniques provide significant power savings.


Example: Energy efficiency features in the PowerPC 603

CPU as a Power State Machine The modes of a CPU can be modeled by a Power State Machine.

An example is shown in Figure below.

Each state in the machine represents a different mode of themachine, and every state is labeled with its average powerconsumption.

The example machine has two states: run mode with powerconsumption Prun and sleep mode with power consumption Psleep.

Transitions show how the machine can go from state to state; eachtransition is labeled with the time required to go from the sourceto the destination state.


A Power State Machine for a Processor

Example: Power-saving modes of the StrongARM SA-1100The StrongARM SA-1100 is designed to provide sophisticated power management

capabilities that are controlled by the on-chip power manager.

The processor takes two power supplies, as shown in the following figure:

VDD is the main power supply for the core CPU and is nominally 3.3 V.

The VDDX supply is used for the pins and other logic such as the Power Manager; it isnormally at 1.5V (The two supplies share a common ground).

The system can supply two inputs about the status of the power supply.

VDD_FAULT tells the CPU that the main power supply is not being properly regulated, whileBATT_FAULT indicates that the battery has been removed or is low.

Either of these events can cause the CPU to go into a low-power mode. In low-poweroperation, the VDD supply can be turned off (the VDDX supply always remains on). Whenresuming operation, the PWR_EN signal is used by the CPU to tell the external powersupply to ramp up the VDD power supply.


The SA-1100 provides the three power modes described below.

■ Run Mode is normal operation and has the highest powerconsumption.

■ Idle mode saves power by stopping the CPU clock. The systemunit modules—realtime clock, operating system timer, interruptcontrol, general-purpose I/O, and power manager, all remainoperational.

Idle mode is entered by executing a three-instruction sequence.

The CPU returns to run mode upon receiving an interrupt fromone of the internal system units or from a peripheral or byresetting the CPU. This causes the machine to restart the CPUclock and to resume execution where it left off.

■ Sleep Mode shuts off most of the chip’s activity. Entering sleepmode causes the system to shut down on-chip activity, reset theCPU, and negate the PWR_EN pin to tell the external electronicsthat the chip’s power supply should be driven to 0 V.


Example: Power-saving modes of the StrongARM SA-1100

A separate I/O power supply remains on and supplies power to the power manager so that theCPU can be awakened from sleep mode; the low-speed clock keeps the power manager running atlow speeds sufficient to manage sleep mode.

The CPU software should set several registers to prepare for sleep mode.

Sleep mode is entered by forcing the sleep bit in the power manager control register; it can also beentered by a power supply fault.

The sleep shutdown sequence happens in three steps, each of which requires about 30 s.

The machine wakes up from sleep state on a preprogrammed wake-up event.

The wake-up sequence has three steps: the PWR_EN pin is asserted to turn on the external powersupply and waits for about 10 ms; the 3.686-MHz oscillator is ramped up to speed; and theinternal reset is negated and the CPU boot sequence begins. The sleep mode saves over threeorders of magnitude of power consumption.


Example: Power-saving modes of the StrongARM SA-1100

Design Example: DATA COMPRESSORA Data Compressor that takes in data with a constant number of bits per data element and

puts out a compressed data stream in which the data is encoded in variable lengthsymbols.

Requirements and Algorithm: The Huffman coding technique has been used.

The Data Compressor takes in a sequence of input symbols and then produces a stream ofoutput symbols.

Assume for simplicity that the input symbols are one byte in length.

The output symbols are variable length. So to deliver the output data, a format has to bechosen.

Delivering each coded symbol separately is tedious, since the length of each symbol wouldhave to be supplied and external code has to be used to pack them into words.

On the other hand, bit-by-bit delivery is almost certainly too slow hence the data compressorneeds to be relied on to pack the coded symbols into an array.

There is NO one-to-one relationship between the input and output symbols, and one has towait for several input symbols before a packed output word comes out.


Huffman Coding for Text CompressionText compression algorithms aim at statistical reductions in the

volume of data.

The Huffman coding, makes use of information on the frequency ofcharacters to assign variable-length codes to characters.

If shorter bit sequences are used to identify more frequentcharacters, then the length of the total sequence will be reduced.

In order to be able to decode the incoming bit string, the codecharacters must have unique prefixes:

No code may be a prefix of a longer code for another character.

As a simple example of Huffman coding, assume that these charactershave the following probabilities P of appearance in a message:


The code is build from the bottom up. After sorting the characters byprobability, a new symbol is created by adding a bit.

Then computing the joint probability of finding either one of thosecharacters and re-sort the table.

The result is a tree that can be read top down to find the charactercodes.

The coding tree for our example appears below.


Huffman Coding for Text Compression cont’d….

Reading the codes off the treefrom the root to the leaves, thefollowing coding of thecharacters is obtained :

Specification: Let’s refine the description of Figure below to come up with a morecomplete specification for the data compression module.

That collaboration diagram concentrates on the steady-state behavior of thesystem. For a fully functional system, the following additional behavior has to beprovided.

■ To provide the compressor with a new symbol table.

■ To flush the symbol buffer to cause the system to release all pendingsymbols that have been partially packed.

This may be done when the symbol table is changed or in the middle ofan encoding session, keep a transmitter busy.



A class description for this refined understanding of the requirements on the module isshown in Figure below.

The class’s buffer and current-bit behaviors keep track of the state of the encoding, andthe table attribute provides the current symbol table.

The class has 3 methods as follows:

■ Encode() performs the basic encoding function. It takes in a 1-byte input symbol andreturns two values: a boolean showing whether it is returning a full buffer and, if theboolean is true, the full buffer itself.

■ New-symbol-table() installs a new symbol table into the object and throws away thecurrent contents of the internal buffer.

■ Flush() returns the current state of the buffer, including the number of valid bits in thebuffer.



Definition of the Data-Compressor Class

Also, classes for the data buffer and the symbol table needs to be defined.

These classes are shown in Figure below.

The data-buffer will be used to hold both packed symbols and unpacked.

It defines the buffer itself and the length of the buffer.

Also a data type has to be defined because the longest encoded symbol islonger than an input symbol.

The longest Huffman code for an eight-bit input symbol is 256 bits.

The insert function packs a new symbol into the upper bits of the buffer; italso puts the remaining bits in a new buffer if the current bufferoverflows.


Additional Class Definitions for the Data Compressor


The Symbol-table class indexes the encoded version of each symbol.

The class defines an access behavior for the table; it also defines aload behavior to create a new symbol table.

The relationships between these classes are shown in Figure 1, a datacompressor object includes one BUFFER and one SYMBOL TABLE.

Figure 2 shows a state diagram for the encode behavior.

It shows that most of the effort goes into filling the buffers withvariable-length symbols.


Relationships between classesin the data compressor

Figure 1 Figure 2

State diagram for encode behavior


Figure below, shows a state diagram for insert.It shows that two cases must be considered,the new symbol does not fill the currentbuffer or it does.

(Note: Program Design is being left for SelfStudy)



State diagram for Insert Behavior

Testing: One way to test the code is to run it and look at the output without consideringhow the code is written.

In this case, a symbol table can be loaded up, run some symbols through it, and seewhether the correct result is obtained.

The symbol table can be obtained from outside sources or a small program can writtento generate it. Several different symbol tables should be tested. Also enough symbolsfor each symbol table have to be tested.

One way to help automate testing is to write a Huffman decoder.

As illustrated in Figure below, a set of symbols can be run through the encoder, and thenthrough the decoder, and simply make sure that the input and output are the same.

If they are not, then both the encoder and decoder have to be checked to locate theproblem.



A Test of the Encoder

Another way to test the code is to examine the code itself and try to identifypotential problem areas. When the code is read, look for places wheredata operations take place to see that they are performed properly.

Also take a look at the conditionals to identify different cases that need tobe exercised.

Some ideas of things to look out for are listed below:

■ Is it possible to run past the end of the symbol table?

■ What happens when the next symbol does not fill up the buffer?

■ What happens when the next symbol exactly fills up the buffer?

■ What happens when the next symbol overflows the buffer?

■ Do very long encoded symbols work properly? How about veryshort ones?

■ Does flush( ) work properly?

Testing the internals of code often requires building scaffolding code.

e.g.: To test the insert method separately, which requires building a programthat calls the method with the proper values.



THANK YOU


ecs 7th sem-cse-unit-2

Devices & Hardware