arm organization and implementation aleksandar milenkovic e-mail:...

ARMOrganization and Implementation

Aleksandar MilenkovicE-mail: [email protected]

Web: http://www.ece.uah.edu/~milenka

mailto:[email protected]







http://www.ece.uah.edu/~milenka








2

Outline

ARM Architecture ARM Organization and Implementation ARM Instruction Set Architectural Support for High-level Languages Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications

3

ARM organization

Register file – 2 read ports, 1 write port +

1 read, 1 write port reserved for r15 (pc)

Barrel shifter – shift or rotate one operand for any number of bits

ALU – performs the arithmetic and logic functions required

Memory address register + incrementer

Memory data registers Instruction decoder and

associated control logic

multiply

data out register

instruction

decode

&

control

incrementer

registerbank

address register

barrelshifter

A[31:0]

D[31:0]

data in register

ALU

control

PC

PC

ALU bus

A bus

B bus

register

4

Three-stage pipeline

Fetch the instruction is fetched from memory and placed in

the instruction pipeline Decode

the instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath

Execute the instruction owns the datapath; the register bank

is read, an operand shifted, the ALU register generated and written back into a destination register

5

ARM single-cycle instruction pipeline

fetch decode execute

time

1



2

3instruction

6

ARM single-cycle instruction pipeline

add r0,r1,#5

sub r2,r3,r6

cmp r2,#3

fetch

time

decode

fetch

execute add

decode

fetch

execute sub

decode execute cmp

1 2 3

7

ARM multi-cycle instruction pipeline

fetch ADD decode execute

time

1

fetch STR decode calc. addr.

fetch ADD decode execute

2

3

data xfer

fetch ADD decode execute4

5 fetch ADD decode execute

instruction

Decode logic is always generating the control signals for the datapath to use in the next cycle

8

ARM multi-cycle LDMIA (load multiple) instruction

fetch decodeex ld r2ldmia r0,{r2,r3}

sub r2,r3,r6

cmp r2,#3

ex ld r3

fetch

time

decode ex sub

fetch decodeex cmp

Decode stage occupied since ldmia must continue toremember decoded instruction

sub fetched at normal time butnot decoded until LDMIA is finishing

Instruction delayed

9

Control stalls: due to branches

Branches often introduce stalls (branch penalty) Stall time may depend on whether branch is taken

May have to squash instructions that already started executing

Don’t know what to fetch until condition is evaluated

10

ARM pipelined branch

time

fetch decode ex bnebne foo

sub r2,r3,r6

fetch decode

foo add r0,r1,r2

ex bne

fetch decode ex add

ex bne

Decision not made until the third clock cycle

Two cycles of work thrown away if bne takes place

11

Pipeline: how it works

All instructions occupy the datapath for one or more adjacent cycles

For each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycle

During the fist datapath cycle each instruction issues a fetch for the next instruction but one

Branch instruction flush and refill the instruction pipeline

12

ARM9TDMI 5-stage pipeline

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-index

pre-index

LDM/STM

register write

r15

pc + 8

pc + 4

+4

mux

shift

mul

B, BLMOV pc

Fetch Decode

instruction is decoded register operands read

(3 read ports) Execute

an operand is shifted and the ALU result generated, or

address is computed Buffer/data

data memory is accessed (load, store)

Write-back write to register file

13

ARM9TDMI Data Forwarding

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-index

pre-index

LDM/STM

register write

r15

pc + 8

pc + 4

+4

mux

shift

mul

B, BLMOV pc

ADD r3, r2, r1, LSL #3ADD r5, r5, r3, LSL r2

r3 := r2 + 8 x r1r5 := r5 + 2r2 x r3

ADD r3, r2, r1, LSL #3ADD r8, r9, r10ADD r5, r5, r3, LSL r2

r3 := r2 + 8 x r1r8 := r9 + r10r5 := r5 + 2r2 x r3

LD r3, [r2] ADD r1, r2, r3

r3 := mem[r2]r1 := r2 + r3

Data Forwarding

Stall?

14

ARM9TDMI PC generation

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-index

pre-index

LDM/STM

register write

r15

pc + 8

pc + 4

+4

mux

shift

mul

B, BLMOV pc

3-stage pipeline PC behavior:

operands are read in execution stage r15 = PC + 8

5-stage pipeline operands are read in decode

stage and r15 = PC + 4? incompatibilities between 3-

stage and 5-stage implementations => unacceptable

to avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs

15

Data processing instruction datapath activity

address register

increment

registersRd

Rn

PC

Rm

as ins.

as instruction

mult

data out data in i. pipe

(a) register – register operations

address register

increment

registersRd

Rn

PC

as ins.

as instruction

mult


[7:0]

(b) register – immediate operations

Reg-RegRd = Rn op

Rmr15 = AR + 4

AR = AR + 4Reg-Imm

Rd = Rn op Imm

r15 = AR + 4AR = AR + 4

16

STR (store register) datapath activity

address register

increment

registersRn

PC

lsl #0

= A / A + B / A - B

mult


[11:0]

address register

increment

registersRn

Rd

shifter

= A + B / A - B

mult

PC

byte? data in i. pipe

(a) 1st cycle – compute address (b) 2nd cycle – store data & auto-index

Compute addressAR = Rn op

Dispr15 = AR + 4

Store dataAR = PCmem[AR] =

Rd<x:y>If

autoindexing=>Rn = Rn +/- 4

17

The first two (of three) cycles of a branch instruction

address register

increment

registersPC

lsl #2

= A + B

mult


[23:0]

address register

increment

registersR14

PC

shifter

= A

mult


(a) 1st cycle – compute branch target(b) 2nd cycle – save return address

Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch?

Compute target addressAR = PC + Disp,lsl

#2Save return address

(if required)r14 = PCAR = AR + 4

18

ARM Implementation

Datapath Control unit (FSM)

19

2-phase non-overlapping clock scheme

Most ARMs do not operate on edge-sensitive registers

Instead the design is based around 2-phase non-overlapping clocks which are generated internally from a single clock signal

Data movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2

1 clock cycle

phase 1

phase 2

20

ARM datapath timing Register read

Register read buses – dynamic, precharged during phase 2 During phase 1 selected registers discharge the read buses

which become valid early in phase 1 Shift operation

second operand passes through barrel shifter ALU operation

ALU has input latches which are open in phase 1,allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1

so that the phase 2 precharge does not get through to the ALU

ALU processes the operands during the phase 2, producing the valid output towards the end of the phase

the result is latched in the destination register at the end of phase 2

21

ARM datapath timing (cont’d)

read bus valid

shift out valid

ALU out

shift time

ALU time

registerwrite time

registerreadtime

ALU operandslatched

phase 1

phase 2

prechargeinvalidatesbuses

Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time

22

The original ARM1 ripple-carry adder

Carry logic: use CMOS AOI (And-Or-Invert) gate Even bits use circuit show below Odd bits use the dual circuit with inverted inputs

and outputs and AND and OR gates swapped around

Worst case path:32 gates long

AB

Cin

sum

Cout

23

ARM2 4-bit carry look-ahead scheme

Carry Generate (G)Carry Propagate (P)

Cout[3] =Cin[0].P + G Use AOI and

alternate AND/OR gates Worst case:

8 gates longA[3:0]

B[3:0]

Cin[0]

sum[3:0]

Cout[3]

4-bitadderlogic

P

G

24

The ARM2 ALU logic for one result bit

ALU functions data operations (add, sub, ...) address computations for memory accesses branch target computations bit-wise logical

operations ...

ALUbus

432105

NBbus

NAbus

carrylogic

fs:

G

P

25

ARM2 ALU function codes

26

The ARM6 carry-select adder scheme

Compute sums of various fields of the wordfor carry-in of zero and carry-in of one

Final result is selected by using the correct carry-in value to control a multiplexor

sum[31:16]sum[15:8]sum[7:4]sum[3:0]

s s+1

a,b[31:28]a,b[3:0]

+ +, +1

c

+, +1

mux

mux

mux

Worst case: O(log2[word width]) gates long

Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.

27

The ARM6 ALU organization

Not easy to merge the arithmetic and logic functions =>a separate logic unit runs in parallel with the adder,

and multiplexor selects the output

Z

N

VC

logic/arithmetic

C infunction

invert A invert B

result

result mux

logic functions

A operand latch B operand latch

XOR gates XOR gates

adder

zero detect

28

ARM9 carry arbitration encoding

Carry arbitration adder

A B C u v

0 0 0 0 0

0 1 unknown 1 0

1 0 unknown 1 0

1 1 1 1 1

29

The cross-bar switch barrel shifter

Shifter delay is critical since it contributes directly to the datapath cycle time

Cross-bar switch matrix (32 x 32) Principle for 4x4 matrix

in[0]

in[1]

in[2]

in[3]

out[0] out[1] out[2] out[3]

no shiftright 1right 2right 3

left 1

left 2

left 3

30

The cross-bar switch barrel shifter (cont’d)

Precharged logic is used => each switch is a single NMOS transistor

Precharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics

For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)

Arithmetic shift right:use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately

31

The 2-bit multiplication algorithm, Nth cycle

Carry - i n Mul t i p l i er Shi f t ALU Carry -o ut0 x 0 LSL #2N A + 0 0

x 1 LSL #2N A + B 0x 2 LSL #(2N + 1) A – B 1x 3 LSL #2N A – B 1

1 x 0 LSL #2N A + B 0x 1 LSL #(2N + 1) A + B 0x 2 LSL #2N A – B 1x 3 LSL #2N A + 0 1

32

Carry-propagate (a) and carry-save (b) adder structures

+A B Cin

Cout S(a) +

A B Cin

Cout S+

A B Cin

Cout S+

A B Cin

Cout S

+A B Cin

Cout S(b) +

A B Cin

Cout S+

A B Cin

Cout S+

A B Cin

Cout S

33

ARM high-speed multiplier organization

Rs >> 8 bits/cycle

carry-save adders

partial sum

partial carry

initialization for MLAregisters

Rm

ALU (add partials)

rotate sum andcarry 8 bits/cycle

34

ARM2 register cell circuit

A busB bus

ALU bus

writeread

Bread

A

35

ARM register bank floorplan

A bus read decoders

B bus read decoders

write decoders

register cellsPC

Vdd

Vss

ALUbus

PCbus

INCbus

ALUbus

A bus

B bus

36

ARM core datapath buses

address register

incrementer

register bank

multiplier

ALU

shifter

data in

instruction pipe

data out

A B

W

instruction

Din

shift out

PC

Ad

inc

37

ARM control logic structure

decodePLA

cyclecount

multiplycontrol

load/storemultiple

addresscontrol

registercontrol

ALUcontrol

shiftercontrol

instruction

coprocessor

arm organization and implementation aleksandar milenkovic e-mail:...

Documents

branch instruction

implementation arm instruction

instruction delayed

cycle slide

multiple instruction

decoded instruction

alu register

r2 ex bne