new unit iv

7/28/2019 new Unit IV

1/34

The ARM PROGRAMMER'S MODEL

A processor will typically have many invisible registers involved in executing ainstruction, the values of these registers before and after the instruction is executed are nsignificant (Not important), only the values in the visible registers have any significanc(Important). The visible registers in an ARM processor are shown in Figure 1.

When writing user-level programs, only the 15 general-purpose 32-bit registers (r0 tr14), the program counter (r15) and the current program status register (CPSR) need tuse.

Figure 1: ARM's visible registers.

a) Registers: Totally 37 registers are available

R0 to R15 are directly accessible

15 general registers (R0 to R14)

R13: Stack point (SP) (in common)Individual stack for each processor mode

R14: Linked register (LR) R15: The Program Counter (PC) CPSR - Current Program Status Register contains condition code flags and the curren

mode bits 5 SPSRs (Saved Program Status Registers) which are loaded with CPSR when a

exceptions occurs The visible registers depend on the processor mode


2/34

The other registers (the banked registers) are switched in to support IRQ, FIQSupervisor, Abort and Undefined mode processing

b) Basic Operating Modes of the ARM (has Seven Modes):User: Unprivileged mode under which most tasks runFIQ: Entered when a high priority (fast) interrupt is raisedIRQ: Entered when a low priority (normal) interrupt is raisedSupervisor: Entered on reset and when a Software Interrupt instruction is executedAbort: Used to handle memory access violationsUndefined : Used to handle undefined instructionsSystem: Privileged mode using the same registers as user mode

c) Exceptions:

Exception Mode Priority Vector Address

Reset supervisor 1 0x00000000

Undefined instruction undefined 6 0x00000004

Software interrupt supervisor 6 0x00000008

Prefetch Abort Abort 5 0x0000000C

Data Abort Abort 2 0x00000010

Interrupt IRQ 4 0x00000018

Fast interrupt FIQ 3 0x0000001C

Table 1 - Exception types, sorted by Interrupt Vector addresses

On Exception (When arm moves from user mode to other modes):

1. (PC + 4) " LR2. CPSR "SPSR_mod (Any Mode)3. PC " IV address4. R13, R14 replaced by R13_mod, R14_mod5. In case of FIQ mode R7 R12 also replaced

d) Program Status Registers CPSR (Current PSR):The ARM Contains a Current Program Status Register


3/34

CPSR, [plus five Saved Program Status Registers (SPSRs) for exception handling]These registers functions are:

Hold information about the most recently performed ALU operation. Control the enabling and disabling of interrupts. Set the processor operating mode

Condition Code Flags:N = Negative result from ALU flagZ = Zero result from ALU flagC = ALU operation Carried outV = ALU operation overflowed

Interrupt Disable Bits:I = 1, disables the IRQF = 1, disables the FIQ

T Bit:T = 0, Processor in ARM state

T = 1, Processor in Thumb stateMode Bits:M[4:0] define the processor mode.

N: Negative; the last ALU operation which changed the flags produced a negativeresult (the top bit of the 32-bit result was a one).

Z: Zero; the last ALU operation which changed the flags produced a zero result (everbit of the 32-bit result was zero).

C: Carry; the last ALU operation which changed the flags generated a carry-out,either as a result of an arithmetic operation in the ALU or from the shifter.

V: oVerflow; the last arithmetic ALU operation which changed the flags generated an

overflow into the sign bit.

Processor Modes:

Modes RepresentationCPSR

M[4:0]Use

User usr 10000 Normal program execution mode

FIQ fiq 10001 Processing fast interrupts

IRQ irq 10010 Processing general-purpose interrupts

Supervisor svc 10011 Processing software interrupts

Abort abt 10111 Processing memory faults

Undefined und 11011 Handling undefined instruction traps

System Sys 11111Running privileged OS tasks(ARM architecture v4 and above)


4/34

c) The Memory System:

Figure 2: ARM memory organization.

Byte addressed memory Maximum 4GB of memory Data types

{byte, half-word, word} {signed, unsigned} Byte: 8 bits Half-word: 16 bits

Must be aligned to even boundary Word: 32 bits

Must be aligned to 4 byte boundary

Notes: Internally, all operations are on word operands Byte and half-word are only supported by data transfer instruction ARM instructions are all words Thumb instructions are all half-words

This shows a small area of memory where each byte location has a unique number. A byte may occupy any of these locations, and a few examples are shown in th

figure 2. A word-sized data item must occupy a group of four byte locations starting at a byt

address which is a multiple of four, and again the figure contains a couple examples.

Half-words occupy two byte locations starting at an even byte address.


5/34

d) Load-Store Architecture:1. ARM employs load-store architecture.2. This means that the instruction set will only process (add, subtract, and so on) values

which are in registers (or specified directly within the instruction itself), and will alwaysplace the results of such processing into a register.

3. The only operations which apply to memory state are ones which copy memory valuesinto registers (load instructions) or copy register values into memory (sto

instructions).ARM Instructions:1. Data processing instructions: These use and change only register values.

(For example, an instruction can add two registers and place the result in a register)2. Data transfer instructions: These copy memory values into registers (loainstructions)

or copy register values into memory (store instructions). An additional form, useful onlyin systems code, exchanges a memory value with a register value.

3. Control flow instructions: Normal instruction execution uses instructions stored atconsecutive memory addresses. Control flow instructions cause execution to switch to adifferent address, either permanently (branch instructions) or saving a return address toresume the original sequence (branch and link instructions) or trapping into systemcode (supervisor calls).

e) Supervisor Mode:The ARM processor supports a protected supervisor mode. Entered to supervisor on

reset and when a Software Interrupt instruction is executed.

The ARM Instruction Set:

All ARM instructions are 32 bits wide and are aligned on 4-byte boundaries memory.

The most notable features of the ARM instruction set are:

The load-store architecture

3-address data processing instructions (that is, the two source operandregisters and the result register are all independently specified)

Conditional execution of every instruction The inclusion of very powerful load and store multiple register instructions The ability to perform a general shift operation and a general ALU operation in a single instruction that executes in a single clock cycle Open instruction set extension through the coprocessor instruction set,

including adding new registers and data types to the programmer's model A very dense 16-bit compressed representation of the instruction set in the

Thumb architecture

f) The I/O System: The ARM handles I/O (input/output) peripherals (such as disk controllers, netwo

interfaces, and so on) as memory-mapped devices with interrupt support


6/34

The internal registers in these devices appear as addressable locations, those registmay be read and written using the load-store instructions,

Peripherals may attract the processor's attention by making an interrupt requeusing either the normal interrupt (IRQ) or the fast interrupt (FIQ) input. Both interrupinputs are level-sensitive and maskable

Some systems may include direct memory access (DMA) hardware interfaced externto the processor to handle high-bandwidth I/O traffic (Large size of data transfebetween I/O device and memory)

g) ARM Exceptions: Exception are generated by internal and external source to cause the processor t

handle an event, such as an externally generated interrupt or an attempt to executan undefined instruction.

Before handling the exception the processor state ((PC + 4) " LR,

CPSR " SPSR_mod (Any Mode) and PC " Vector address) must be preserved so that thoriginal program can be resumed when the exception routine has completed

The general procedure to handle exception are

1. The current state is saved by copying the PC into rl4_exc and the CPSR into SPSR_ex(where exc stands for the exception type).

2. The processor operating mode is changed to the appropriate exception mode.3. The PC is forced to a value between 0016 and 1C16, the particular value depending on

the type of exception.

Now PC will point the starting address of exception (Vector address) then processohandles exceptionException Return

Any modified user registers must be restored Restore CPSR

Resume PC in the correct instruction stream(The return to the user program is achieved by restoring the user registers and the

using an instruction to restore the PC and the CPSR atomically)

3-stage pipeline ARM organization

The organization of an ARM with a 3-stage pipeline is illustrated in Figure 1.a) The principal components are:

The register bank, which stores the processor state (Results and Status of Result).has two read ports and one write port which can each be used to access any registeplus an additional read port and an additional write port that give special access tr15, the program counter.

The barrel shifter, which can shift or rotate one operand by any number of bits.

The ALU, which performs the arithmetic and logic functions required by theinstruction set

The address register and incrementer, which select and hold all memory addresseand generate sequential addresses when required.

The data registers, which hold data passing to and from memory. The instruction decoder and associated control logic.


7/34

In a single-cycle data processing instruction, two register operands are accessed, thvalue on the B bus is shifted and combined with the value on the A bus in the ALU, then thresult is written back into the register bank.

The program counter value is fed into the incrementer, then the incremented value copied back into rl5(PC) in the register bank and also into the address register to be used athe address for the next instruction fetch.b) The 3-Stage Pipeline:

ARM processors employ a simple 3-stage pipeline with the following pipeline stages:

1. Fetch:The instruction is fetched from memory and placed in the instruction pipeline.

2. Decode:The instruction is decoded and the datapath control signals are prepared for the ne

cycle. In this stage the instruction 'owns' the decode logic but not the datapath.3. Execute:

(i) The instruction 'owns' the datapath,(ii) The register bank is read (Data Read),(iii) An operand (Data) is shifted,(iv) The ALU result generated and written back into a destination register.

At any one time, three different instructions may occupy each of these stages (FetecDecode and Execution), so the hardware in each stage has to be capable of independenoperation.


8/34

Figure 1: 3-stage pipeline ARM organization.

When the processor is executing simple data processing instructions the pipeline

enables one instruction to be completed every clock cycle. An individual instruction takes three clock cycles to complete, so it has three-cycle

latency but the throughput (Output Efficiency) is one instruction per cycle.

The 3-stage pipeline operation for single-cycle instructions is shown in Figure 2.

D [31:0]


9/34

Figure 2: ARM Single-Cycle Instruction 3-Stage Pipeline Operation

When a multi-cycle instruction is executed the pipeline operation flow is less regular,as illustrated in Figure 3.

Figure 3: ARM MULTI-CYCLE Instruction 3-Stage Pipeline Operation.

The figure 3 shows a sequence of single-cycle ADD instructions with a data storeinstruction, STR, occurring after the first ADD

The cycles that access main memory is shown with light shading so it can be seenthat memory is used in every cycle.

The datapath is used in every cycle, being involved (Used) in all the execute cycles,

the address calculation and the data transfer The decode logic is always generating the control signals for the datapath to use in

the next cycle so in addition to the decode cycles it is also generating the control forthe data transfer during the address calculation cycle of the STR.

The simplest way to view breaks in the ARM pipeline is to observe the following factors1. All instructions occupy the datapath for one or more adjacent cycles.2. For each cycle that an instruction occupies the datapath, it occupies the decode logic in

the immediately preceding cycle.3. During the first datapath cycle each instruction issues a fetch for the next instruction4. Branch instructions flush and refill the instruction pipeline


10/34

c) PC Behavior:1. Program counter is visible to the user as r15(Register15) , must need to point out the

next instruction location.2. During first cycle the PC must point eight bytes (two instructions) ahead of the current

instruction ( PC must run ahead of the current instruction: 8 bytes ahead).

5-stage pipeline ARM organizationThe 3-stage pipeline used in the ARM cores is very cost-effective, but the performanc

is less when compare to 5 stage pipelining.The time T , required to execute a given program is given by:

Where,Ninstis the number of ARM instructions executedCPI is the average number of clock cycles per instructionfclk is the processor's clock frequency

Since Ninst is constant for a given program (Not possible to change)So there are only two ways to increase performance:

1. Increase the clock rate, fclk.This requires the simpler logic in each pipeline stage, therefore, the number of

pipeline stages to be increased.2. Reduce the average number of clock cycles per instruction (CPI).

Instructions should occupy fewer pipeline slots or that pipe-line stalls caused bdependencies between instructions are reduced.

a) Memory Bottleneck:The 3 Stage ARM core (ARM7TDMI) is a Von Nuumann architecture. A 3-stage ARM

core accesses memory on (almost) every clock cycle either to fetch an instruction or totransfer data. The von Neumann bottleneckcaused by the mismatch in speed betweenthe CPU and memory.

To get a better CPI the memory system must deliver more than one value in eachclock cycle (delivering more than 32 bits per cycle) from a single memory or by havingseparate memories for instruction and data accesses (This architecture is calledHarvard architecture).

Any computer with a single memory for both instruction and data will limits theperformance of computer because of less memory band width availability.Solution:

1. Faster Memory: More than one value in each clock cycle

2. Separate memories for instruction and accessesb) High Performance ARM: The ARM processors which use a 5-stage pipeline

1. Reduce maximum work in each stage2. Higher clock frequency

Separate instruction and data memories will Reduce the CPIThe ARM processors which use a 5-stage pipeline have the following pipelinestages:1.Fetch

The instruction is fetched from memory and placed in the instruction pipeline.2.Decode


11/34

The instruction is decoded and register operands read from the register file.There are three operand read ports in the register file, so most ARM instructioncan get all their operands in one cycle.

3. Execute

An operand is shifted and the ALU result generated. If the instruction is a load or store the memory address is computed in the ALU

4. Buffer/data

Data memory is accessed if required. Otherwise the ALU result is simply buffered for one clock cycle to give the sampipeline flow for all instructions.

5. Write-back

The results generated by the instruction are written back to the register file,including any data loaded from memory.

Figure 1: ARM Single-Cycle Instruction 5-Stage Pipeline Operation


12/34

Figure 1: ARM9TDMI 5-Stage Pipeline Organizationc) Data forwarding:

When an instruction needs to use the result of previous instruction before that resulthas returned to the register leads to data dependencies.

The only way to resolve data dependencies without stalling the pipeline is tointroduce forwarding paths.

Forwarding path allow result to be passed between stages as soon as theyavailable (At execution stage its self).

The 5-stage ARM pipeline requires each of the three source operands to be forwardefrom any of three intermediate result registers as shown in Figure 1

Exception :There is one case where, even with data forwarding it is not possible to avoid a

pipeline Stall (Data dependency). Consider the following code sequence as example:LDR rN, [ . . ] ; load rN from somewhereADD r2, r1, rN ; and use it immediately


13/34

For the above case one cycle stall will appear in pipeline operation due to datadependency because value loaded into rN only enters the processor at the end of thebuffer/data stage but it is needed by the following instruction at the start of theexecute stage.

The only way to avoid this stall is to encourage the compiler (or assembly languageprogrammer) not to put a dependent instruction (like ADD r2, r1, rN) immediatelyafter a load instruction (like LDR rN, [ . . ]).

The problem with data hazards, introduced by this sequence of instructions can besolved with a simple hardware technique called forwarding.

d) PC (Program Counter) Generation:

The 5-stage pipeline reads the instruction operands one stage earlier in the pipelineas PC+4 rather than PC+8. (PC+8 is in 3 stage pipeline organization)

The incremented PC value from the fetch stage is fed directly to the register file inthe decode stage, bypassing the pipeline register between the two stages. PC+4 forthe next instruction is equal to PC+8 for the current instruction, so the correct r15value is obtained without additional hardware.

ARM Instruction Execution

The execution of an ARM instruction can be understood by datapath organization aspresented in Figure 1.a) Data Processing Instructions:

A data processing instruction requires two operands, one of which is always a registeand the other is either a second register or an immediate value.

The second operand is passed through the barrel shifter where it is subject to ageneral shift operation, and then it is combined with the first operand in the ALUusing a general ALU operation.

Finally, the result from the ALU is written back into the destination register All these operations take place in a single clock cycle as shown in Figure 4 Note also how the PC value in the address register is incremented and copied back

into both the address register and r15 in the register bank The next instruction is loaded into the bottom of the instruction pipeline (i. pipe).

When immediate value is required it is extracted from the current instruction at thetop of the instruction pipeline.

For data processing instructions only the bottom eight bits (bits [7:0]) of theinstruction are used in the immediate value.


14/34

b) Data Transfer Instructions:

A data transfer (load or store) instruction computes a memory address same as datprocessing instruction computes its result.

A register is used as the base address to which offset value, for this 12-bit immediavalue is used without a shift operation, finally the calculated address is sent to thaddress register and in a second cycle the data transfer takes place.

The datapath operation for data store instruction (SIR) with an immediate offset ashown in Figure 2

This instruction need two machine cycles During the first cycle the incremented PC value is stored in the register

bank, so that the address register is free to accept the data transfer addressfor the second cycle

The end of the second cycle the PC is fed back to the address register toallow instruction prefetching to continue.


15/34

Figure 3: The First Two (of three) Cycles of a Branch Instruction.

In the store instructions the 'data out' block extracts the bottom byte from theregister and replicates it four times across the 32-bit data bus. Then the externalmemory control logic can use the bottom two bits of the address bus to activate theappropriate byte within the memory system

Load instruction gat data from memory through data in register on the 2nd cycle anda third cycle is needed to transfer the data from there to the destination register.

Branch Instructions:Branch instructions compute the target address in the first cycle as shown in Figure 3. A 24-bit immediate field is extracted from the instruction and then shifted left two bi

positions to give a word-aligned offset which is added to the PC then added value isissued as an instruction fetch address

While moving from one address location to other address location sometimes thereturn address is copied into the link register (r14) (then this is called branch withlink instruction).

The during 1st cycle branch address is computed While 2nd cycle return address is saved in r14 register (Link register)


16/34

The 3rd cycle is required to complete the pipeline refilling (Return back to mainprogram), it is also used to make a small correction to the value stored in the linkregister to points out the instruction (in main program) which follows the branch (Lasinstruction in the Sub program)

Figure 3: The First Two (of three) Cycles of a Branch Instruction.

ARM ImplementationThe ARM design is divided into two sections

1. Register transfer level (RTL): Describe Datapath section2. Finite state machine (FSM)): Describe Control section

CLOCKING SCHEME:ARM design has 2-phase non-overlapping clocks, as shown in Figure 1 (most ARMs d

not operate with edge-sensitive registers)

Figure 1: 2-Phase Non-Overlapping Clock Scheme.


17/34

Which are generated internally from a single input clock signal. Data movement is controlled by passing data alternatively through latches which are

open during phase 1 and then during phase 2.

The non-overlapping property of the phase 1 and phase 2 clocks ensures that thereare no race conditions in the circuit.

DATAPATH TIMING:The normal timing of the datapath components in a 3-stage pipeline is illustrated in

Figure 2 When phase 1 goes high, the data will be read from selected registers, the read

buses which become valid early during phase 1

At the time one operand (Data read form register) is passed through the barrel shiftefor shift operation at last shifter output becomes valid a little later in phase 1

The ALU has input latches which are open during phase 1, allowing the operands tobegin combining (operate) in the ALU as soon as they are valid, but they close atthe end of phase 1

But the ALU continues to process the operands through phase 2 The ALU will store the result in destination register at the end of phase 2

Figure 2: ARM Datapath Timing (3-Stage Pipeline).

The minimum datapath cycle time is the sum of; The register read time; The shifter delay; The ALU delay;

The register write set-up time;

The phase 2 to phase 1 non-overlap time.The ALU delay is highly variable, depending on the operation it is performing.

1. Logical operations are relatively fast, since they involve no carry propagation.2. Arithmetic operations (addition, subtraction and comparisons) are relatively slow,

since they involve carry propagation. (Involves Longer Path)


18/34

ADDER DESIGN:The Original ARM1 Ripple-Carry Adder Circuit:

The first ARM processor prototype used a simple ripple-carry adder as shown in Figu3 designed using a CMOS AND-OR-INVERT gate.

The worst-case carry path is 32 gates long.

Figure 3: The Original ARM1 Ripple-Carry Adder Circuit.The ARM2 4-bit carry look-ahead scheme:

In order to allow a higher clock rate, ARM2 used a 4-bit carry look-ahead schemeto reduce the worst-case carry path length. The circuit is shown in Figure 4.

The logic circuit produces carry generate (G) and propagate (P) signals whichcontrol the 4-bit carry look-ahead circuit output.

Finally the carry propagate path length is reduced to eight gate delays

Figure 4: The ARM2 4-bit carry look-ahead scheme.ALU FUNCTIONS :

The ALU does not only add its two inputs. It must perform the data operations (Add,sub inc, dcr ect) defined by the instruction set, including address computations for memorytransfers, branch calculations, bit-wise logical functions, and so on.The full ARM2 ALU logic is illustrated in Figure 5.


19/34

Figure 5: The ARM2 ALU logic for one result bit

The set of functions generated by this ALU and the associated values of the ALUfunction selects are listed in Table 1.

Table 1: ARM2 ALU function codes

THE ARM6 CARRY-SELECT ADDER:

ARM6 by using a carry-select adder to improvement the worst-case add time (fastaddition operation).

This form of adder computes the sums of various fields of the word (32-bit data) wheit receives a carry-in (c) of both zero and one, and then the final result is selected byusing the correct carry-in (c) value to control a multiplexer as shown in the figure 6.


20/34

Figure 6: The ARM6 Carry-Celect Adder Scheme

The critical path is now O(log2[word width]) gates long, though direct comparisonwith previous schemes is difficult since the fan-out on some of these gates is high.However, the worst-case addition time is significantly faster than the 4-bit carry lookahead adder at the cost of significantly increased silicon area.

ARM6 ALU STRUCTURE:

The ARM6 carry-select adder does not allow merging the arithmetic and logicfunctions into a single structure as was used on ARM2.

A separate logic unit runs in parallel with the adder and a multiplexer is used toselects the output from the adder or from the logic unit as required.

The overall ALU structure is shown in Figure 7.

The input operands are each selectively inverted, then added and combined in thelogic unit, and finally the required result is selected and issued on the ALU result bus

The C and V flags are generated in the adder (they have no meaning for logicaloperations), the N flag is copied from bit 31 of the result and the Z flag is evaluatedfrom the whole result bus.

Figure 7: The ARM6 ALU Organization.


21/34

CARRY ARBITRATION ADDER:The adder logic was further improved on the ARM9TDMI by implementing carry

Arbitration Adder. This adder computes all intermediate carry values using a 'parallel-prefix' tree, which is a very fast parallel logic structure.

The carry arbitration scheme recedes the conventional information in terms of twonew variables, u and v.

Table 2: ARM9 carry arbitration encoding

Consider the computation of the carry out, C, from a particular bit position in theadder with inputs A and B. Before the carry in is known, the information available is asshown in Table 2.The information is encoded by u and v.This information can be combined with that from aneighboring bit position using the formula:

THE BARREL SHIFTER:The ARM architecture supports instructions which perform a shift operation in

series with an ALU operation, the barrel shifter will produce some delay.In order to minimize the delay through the shifter, a cross-bar switch matrix is used t

guide (Push) each input to the appropriate output.The principle of the cross-bar switch is illustrated in Figure 8 shows only a 4 x 4matrix cross bar switch.

Figure 8: The cross-bar switch barrel shifter principle.


22/34

(The ARM processors use a 32 x 32 matrix) Each input is connected to each outputthrough a switch.The shifting functions are implemented by wiring switches along diagonals to a commoncontrol input:

1. For a left or right shift function, one diagonal is turned on. This connects all the inputbits to their respective outputs where they are used. In the ARM the barrel shifteroperates in negative (opposite) logic where a ' 1' is represented as a potential

near ground and a '0' by a potential near the supply.2. For a rotate right function, the right shift diagonal is enabled together with the

complementary left shift diagonal.(For example, on the 4-bit matrix rotate right one bit is implemented using the'right 1' and the 'left 3' (3 = 4 - 1) diagonals) 3.Arithmetic shift right uses sign-extension rather than zero-fill for the unconnected

output bits.MULTIPLIER DESIGN:

All ARM processors except first prototype have integer multiplication.Two styles of multiplier have been used:

Older ARM cores include low-cost multiplication hardware that supports only the32-bit result multiply and multiply accumulate instructions.

Recent ARM cores have high-performance multiplication hardware and supportthe 64-bit result multiply and multiply accumulate instructions.

Low Cost Implementation:The multiplier employs a modified Booth's algorithm to produce the 2-bit product, Th

allows all four values of the 2-bit multiplier to be implemented by a simple shift and addor subtract, possibly carrying the x 4 over to the next cycle.The control settings for the Nth cycle of the multiplication are shown in Table 3.

Table 3: The 2-bit Multiplication Algorithm, Nth cycle.


23/34

Since this multiplication uses the existing Barrel shifter and ALU of the ARM processor.HIGH SPEED MULTIPLIER:

Where multiplication performance is very important, more hardware resource must bdedicated (Separated) to it.

In some embedded systems the ARM core is used to perform real-time digitalsignal processing (DSP) in addition to general control functions.

Use intermediate results which include partial sums and partial carries

Carry-save adders are used for this These two binary results are added together at the end of multiplication The main ALU is used for this

Carry propagate adder takes two conventional (irredundant) binary numbers as inpuand produces a binary sum

Carry save adder takes one binary and one redundant (partial sum and partial carry)input and produces a sum in redundant binary representation (sum and carry)

Figure 9: Carry-propagate

(a) Carry Save(b) Adder Structures

ARM HIGH-SPEED MULTIPLIER ORGANIZATION:

CSA has 4 layers of adders each handling 2 multiplier bits => multiply 8-bits per cloccycle

Partial sum and carry are cleared at the beginning or initialized to accumulate a valu Multiplier is shifted right 8-bits per cycle in the Rs register Carry sum and carry are rotated right 8 bits per cycle Performance: up to 4 clock cycles (early termination is possible)


24/34

Complexity: 160 bits in shift registers, 128 bits of carry-save adder logic (up to 10% simpler cores)

Figure 10: ARM high speed multiplier organizationTHE REGISTER BANK:

The last major block on the ARM datapath is the register bank. All the user visible state (Data) is stored in 31 general purpose 32-bit registers, The

transistor circuit of the register cell used in ARM cores up to the ARM6 is shown inFigure 11.

Figure 11: ARM6 Register Cell Circuit

The storage cell is an asymmetric cross-coupled pair of CMOS inverters TheA and B read buses are precharged to Vdd during phase 2 of the clock cycle


25/34

This register cell design works well with a 5 volt supply Since a low supply voltage gives good power-efficiency, These register cells are arranged in columns to form a 32-bit register, and the

columns are packed together to form the complete register bank. The decoders for the read and write enable lines are then packed above the columns

as shown in Figure 12.

Figure 12: ARM Registers Bank Floorplan.DATAPATH LAYOUT:

1. The ARM datapath is laid out to a constant pitch (Area) per bit.2. It is a good idea to produce a floor-plan for the datapath noting the 'passenger' busesthrough each block, as illustrated in Figure 13.

3. The order of the function blocks is chosen to minimize the number of additional busespassing over the more complex functions.


26/34

Figure 13: ARM Core Datapath Buses.

CONTROL STRUCTURES:

The control logic on the ARM has three structural components

1. An instruction decoder PLA (programmable logic array), uses some of the instruction bits

and an internal cycle counter (Count number) to define the operation to be in the next

cycle.2. Distributed secondary control logic uses the class information from the main decoder PL

to

select other instruction bits and/or processor state information to control the datapath.

3. Decentralized control units for specific instructions that take a number of cycles to

complete (load and store multiple, multiply and coprocessor operations).

The main decoder PLA has around 14 inputs, 40 product terms and 40 outputs, the

precise number varying slightly between the different cores. On recent ARM cores it is

implemented as two PLAs: a small, fast PLA which generates the time critical outputs

and a larger, slower PLA which generates all the other outputs. Functionally, however, it ca

be viewed as a single PLA unit.

Figure 14: ARM control logic structure.


27/34

ARCHITECTURAL SUPPORT FOR SYSTEM DEVELOPMENT

I-THE ADVANCED MICROCONTROLLER BUS ARCHITECTURE

(AMBA):ARM processor cores have bus interfaces for high-speed cache interfacing; some

interfacing is required to allow the ARM to communicate with other on-chip macrocells.ARM Limited specified the Advanced Microcontroller Bus Architecture, AMBA, to

standardize the on-chip connection of different macrocells.AMBA buses:

Three buses are denned within the AMBA specification:1. The Advanced High-performance Bus (AHB) is used to connect high-performancesystem modules. It supports burst mode data transfers and split transactions, and all timinis reference to a single clock edge.2. The Advanced System Bus (ASB) is used to connect high-performance system

modules. It supports burst mode data transfers.3. The Advanced Peripheral Bus (APB) offers a simpler interface for low-performanceperipherals.A typical AMBA-based microcontroller will incorporate either an AHB or an ASB togetherwith an APB as illustrated in Figure 1. The ASB is the older form of system bus, with AHBbeing introduced later to improve support for higher performance, synthesis and timingverification.

The APB is generally used as a local secondary bus which appears as a single slavemodule on the AHB or ASB.

Figure 1: A typical AMBA-based system.Arbitration:

A bus transaction is initiated by a bus master The arbiter decides priorities The master, x, issues a request (AREQx) to the central arbiter. When the bus is available, the arbiter issues a grant (AGNTx) to the master. The arbitration must take account of the bus lock signal (BLOK)


28/34

Bus transfers:When a master has been granted access to the bus, it issues address and control

information to indicate the type of the transfer, for this slave device should response.The following signal is used to define the transaction timing:

The bus clock, BCLK. (This will usually be the same as mclk, the ARM processorclock).

Bus transaction, BTRAN[1:0], indicates whether the next bus cycle will be address-only, sequential or non-sequential.

The address bus, BA[31:0] Bus transfer direction, BWRITE Bus protection signals, BPROT[1:0], which indicate instruction or data fetches and

supervisor or user access. The transfer size, BSIZE[1:0], specifies a byte, half-word or word transfer. Bus lock, BLOK, allows a master to retain the bus to complete an atomic read-modify

write transaction.

The data bus, BD[31:0], used to transmit write data and to receive read data.

A slave unit may process the requested transaction immediately, accepting write dat

or issuing read data on ED[31:0], or signal one of the following responses: Bus wait, BWAIT, allows a slave module to insert wait states when it cannot complet

the transaction in the current cycle. Bus last, BLAST, allows a slave to terminate a sequential burst to force the bus

master to issue a new bus transaction request to continue.

Bus error, BERROR, indicates a transaction that cannot be completed. If the master isa processor it should abort the transfer.

Bus Reset:Correct ASB power-up is ensured by imposing an asynchronous reset mode that

forces all drivers off the bus independently of the clock.Test Interface:

A possible use of the AMBA is to provide support for a modular testing methodologythrough the Test Interface Controller. This approach allows each module on the AMBA to betested independently by allowing an external tester to appear as a busmaster on the ASB.

Two test request inputs (TREQA and TREQB)Advanced Peripheral Bus:

The Advanced Peripheral Bus is a simple, static bus which operates as a stub on anASB to offer a minimalist interface to very simple peripheral macrocells.

The bus includes address (PADDR[n:0]\ the full 32 bits are not usuallyrequired)

Read and write data (PRDATA[m:0] and PWDATA[m:0], where m is 7, 15 or31). A read/write direction indicator (PWRITE) Individual peripheral select strobes (PSELx) Peripheral timing strobe (PENABLE) APB transfers are timed to PCLK, and all APB devices are reset with PRESETn.

Advanced High performance Bus:


29/34

The AHB is intended to replace the ASB in very high performance systems forexample those based on the ARM1020E

It supports split transactions,

It uses a single clock edge to control all of its operations

It uses a centrally multiplexed bus scheme rather than a bidirectional bus

II-THE ARMulator:

The ARMulator is part of the cross-development toolkit. It is a software emulator of the ARM processor which supports the debugging and

evaluation of ARM code without requiring an ARM processor chip.The ARMulator has a important role in embedded system design. It supports the

high-level prototyping of various parts of the system to support the development ofsoftware and the evaluation of architectural alternatives (Help to develop newarchitecture).It is made up of four components:

The processor core model A memory interface A coprocessor interface An operating system interface

System Modeling:Using the ARMulator it is possible to build a complete, clock-cycle accurate modelling

software model of a system including a cache, MMU, physical memory, peripheral devices,operating system and software.

Once the design is reasonably stable (or Correct), hardware development willprobably move into a CAD (Computer Aided Design) environment, but softwaredevelopment can continue using the ARMulator-based model.

III-THE JTAG BOUNDARY SCAN TEST ARCHITECTURE:

Two difficult areas in the development of a product are1. The production testing of the VLSI component2. The production testing of the assembled printed circuit board

The production testing of the assembled printed circuit board is addressed (Achievedby the IEEE standard number 1149, This standard was developed by theJoint Test ActionGroup (hence JTAG), and the architecture described bythe standard is known either as'JTAG boundary scan' or as 'IEEE 1149'.

The general structure of the JTAG boundary scan test interface is shown in the figure1.


30/34

Figure 1. JTAG Boundary Scan Organization.

Test signals:The interface works with five dedicated signals which must be provided on each

chip that supports the test standard:

TRSTis a test reset input which initializes the test interface.

TCKis the test clock which controls the timing of the test interface independently

from any system clocks. TMS is the test mode select which controls the operation of the test interface statemachine.

TDI is the test data input line which supplies the data to the boundary scan orinstruction registers.

TDO is the test data output line which carries the sampled values from the boundaryscan chain and propagates data to the next chip in the serial test circuit.

TAP controller:The operation of the test interface is controlled by the Test Access Port (TAP)

controller.Data registers:

The data registers are1. The device ID register2. The bypass register3. The boundary scan register

The behavior of a particular chip is determined by the contents of the test instructioregister, which can select between various different data registers:

1. The device ID register reads out an identification number which is hard-wired into thechip ( or Which chip is currently in checking).

2. The bypass register connects TDI to TDO with a 1-clock delay to give the tester toanother device in the test loop on the same board.


31/34

3. The boundary scan register intercepts (captures) all the signals between the corelogicand the pins and comprises the individual register bits which are shown as thesquares connected to the core logic in Figure 2.

4. Other registers may be employed on the chip to test other functions as required.

Figure 2: A Possible JTAG Extension for Macrocell Testing.Instructions:Instructions inside the instruction register may be public or private.

Public instructions are declared and available for general test use

Private instructions are for specialized on-chip test purposes

Public instructions:

The minimum set of public instructions that all compliant devices must support is:1. BYPASS: here the device connects TDIto TOO though a single clock delay.2. EXTEST: here the boundary scan register is connected between TDI and TOO and the

pin states are captured and controlled by the register.3. IDCODE: here the ID register is connected between TDI and TDO.4. INTEST: here the boundary scan register is connected between TDI and TDO andthe

core logic input and output states are captured and controlled by the register.

PCB testing:The principal goal of the JTAG test circuit is to enable the continuity of tracks and the

connectivity of solder connections on printed circuit boards (PCBs) to be tested.Embedded-ICE:

The EmbeddedlCE module introduces breakpoint and watchpoint registers which areaccessed as additional data registers using special JTAG instructions, and a trace bufferMacrocell testing:

A complex system chips contains number of complex predesigned macrocellstogether with some Application-Specific Custom Logic.


32/34

The ARM processor core is itself one such macrocell, reaming macrocell may comefrom ARM Limited (a semiconductor partner) or from some third party supplier.

In such cases the designer of the system chip will have limited knowledge of themacrocells and will depend on the macrocell suppliers for the production testpatterns for each macrocell.

Each macrocell may have a boundary scan path through which the test patterns maybe applied using JTAG architecture as given the figure 2.

IV-THE ARM DEBUG ARCHITECTURE:Debugging any computer system can be a complex task. There are two basic approaches tdebugging,

1. Watching a system from the outside using test equipment such as a logic analyzer(Simple Method)

2. Viewing a system from the inside with tools that support single stepping, the settingof breakpoints, and so on (More powerful method)

Desktop Debugging:To debug a system the piece of software should run in the desktop machine then all

the user interface components are readily available and the debugger may itself a another

piece of software running on the same machine (Desktop).Break-points are set by replacing an instruction in the object program with acall to the debugger instruction.Embedded Debugging:

Debugging becomes more difficult when the target system is embedded.Now there is no user interface in the system, so the debugger must run on a remote

host through some sort of communication link to the target embedded system.If the code is in ROM, instructions cannot simply be replaced by calls instruction to th

debugger because locations are not writeable (not Changeable).The standard solution for checking the embedded system is the In-Circuit

Emulator (ICE).

The processor in the target system is removed and replaced by an emulatorThe emulator may be (based around) the same processor chip or a different with morpins, this will also incorporate buffers to copy the bus activity and various hardwareresources which can watch for particular events, such as execution passing through abreakpoint. The trace buffer and hardware resources are managed by software running on host desktop system.Debugging processor cores:

In ICE approach an identifiable (Checkable) processor chip can be removed andreplaced by the ICE, normally it is not possible for a complex system.Although simulation using software models such as the ARMulator should remove many ofthe design bugs before the physical system is built, it is often impossible to run the full

software system under emulation, and it can be difficult to model all the real-timeconstraints accurately.Therefore it is likely that some debugging of the complete hardware and software

system will be necessary. How can this be achieved? This is still an area of active researchto identify the best overall strategy.ARM debug Hardware:

To provide debug facilities offered by a typical ICE, the user must be able to setbreakpoints and watchpoints to inspect and modify the processor and system state, to seea trace of processor activity.

In addition to the breakpoint and watchpoint events, it may also be desirable to haltthe processor when a system-level event occurs. The debug architecture includes external


33/34

inputs for this purpose. The on-chip cell containing these facilities is called theEmbeddedICE module.

Embedded-ICE:The EmbeddedICE module consists

1. Two watchpoint registers2. Control register and3. Status registers.

The watchpoint registers can halt the ARM core when the address, data and controlsignals match the value programmed into the watchpoint register. Since the comparison isperformed under a mask.

Watchpoint register can be configured to operate as a breakpoint register capable ofhalting the processor when an instruction in either ROM or RAM is executed. Thecomparison and mask logic is illustrated in Figure 3.

Figure 1: EmbeddedlCE Signal Comparision Logic.

Each watchpoint can look for a particular combination of values on the ARM address

bus, data bus, trans, ope, mo?[l:0] and r/w control signals, and if either combination is

matched the processor is stopped. EmbeddedICE registers are programmed via the JTAG

test port.

Accessing state:

The EmbeddedICE module allows a program to be halted at specific points, but it doe

not directly allow the processor or system state to be inspected or modified. This is

achieved via further scan paths which are also accessed through the JTAG port.


34/34

Figure 2: EmbeddedICE Register Read and Write Structure

new unit iv

Documents