arm organization and implementation aleksandar milenkovic e-mail:...
TRANSCRIPT
ARMOrganization and Implementation
Aleksandar MilenkovicE-mail: [email protected]
Web: http://www.ece.uah.edu/~milenka
2
Outline
ARM Architecture ARM Organization and Implementation ARM Instruction Set Architectural Support for High-level Languages Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications
3
ARM organization
Register file – 2 read ports, 1 write port +
1 read, 1 write port reserved for r15 (pc)
Barrel shifter – shift or rotate one operand for any number of bits
ALU – performs the arithmetic and logic functions required
Memory address register + incrementer
Memory data registers Instruction decoder and
associated control logic
multiply
data out register
instruction
decode
&
control
incrementer
registerbank
address register
barrelshifter
A[31:0]
D[31:0]
data in register
ALU
control
PC
PC
ALU bus
A bus
B bus
register
4
Three-stage pipeline
Fetch the instruction is fetched from memory and placed in
the instruction pipeline Decode
the instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath
Execute the instruction owns the datapath; the register bank
is read, an operand shifted, the ALU register generated and written back into a destination register
5
ARM single-cycle instruction pipeline
fetch decode execute
time
1
fetch decode execute
fetch decode execute
2
3instruction
6
ARM single-cycle instruction pipeline
add r0,r1,#5
sub r2,r3,r6
cmp r2,#3
fetch
time
decode
fetch
execute add
decode
fetch
execute sub
decode execute cmp
1 2 3
7
ARM multi-cycle instruction pipeline
fetch ADD decode execute
time
1
fetch STR decode calc. addr.
fetch ADD decode execute
2
3
data xfer
fetch ADD decode execute4
5 fetch ADD decode execute
instruction
Decode logic is always generating the control signals for the datapath to use in the next cycle
8
ARM multi-cycle LDMIA (load multiple) instruction
fetch decodeex ld r2ldmia r0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
ex ld r3
fetch
time
decode ex sub
fetch decodeex cmp
Decode stage occupied since ldmia must continue toremember decoded instruction
sub fetched at normal time butnot decoded until LDMIA is finishing
Instruction delayed
9
Control stalls: due to branches
Branches often introduce stalls (branch penalty) Stall time may depend on whether branch is taken
May have to squash instructions that already started executing
Don’t know what to fetch until condition is evaluated
10
ARM pipelined branch
time
fetch decode ex bnebne foo
sub r2,r3,r6
fetch decode
foo add r0,r1,r2
ex bne
fetch decode ex add
ex bne
Decision not made until the third clock cycle
Two cycles of work thrown away if bne takes place
11
Pipeline: how it works
All instructions occupy the datapath for one or more adjacent cycles
For each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycle
During the fist datapath cycle each instruction issues a fetch for the next instruction but one
Branch instruction flush and refill the instruction pipeline
12
ARM9TDMI 5-stage pipeline
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-index
pre-index
LDM/STM
register write
r15
pc + 8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
Fetch Decode
instruction is decoded register operands read
(3 read ports) Execute
an operand is shifted and the ALU result generated, or
address is computed Buffer/data
data memory is accessed (load, store)
Write-back write to register file
13
ARM9TDMI Data Forwarding
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-index
pre-index
LDM/STM
register write
r15
pc + 8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
ADD r3, r2, r1, LSL #3ADD r5, r5, r3, LSL r2
r3 := r2 + 8 x r1r5 := r5 + 2r2 x r3
ADD r3, r2, r1, LSL #3ADD r8, r9, r10ADD r5, r5, r3, LSL r2
r3 := r2 + 8 x r1r8 := r9 + r10r5 := r5 + 2r2 x r3
LD r3, [r2] ADD r1, r2, r3
r3 := mem[r2]r1 := r2 + r3
Data Forwarding
Stall?
14
ARM9TDMI PC generation
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-index
pre-index
LDM/STM
register write
r15
pc + 8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
3-stage pipeline PC behavior:
operands are read in execution stage r15 = PC + 8
5-stage pipeline operands are read in decode
stage and r15 = PC + 4? incompatibilities between 3-
stage and 5-stage implementations => unacceptable
to avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs
15
Data processing instruction datapath activity
address register
increment
registersRd
Rn
PC
Rm
as ins.
as instruction
mult
data out data in i. pipe
(a) register – register operations
address register
increment
registersRd
Rn
PC
as ins.
as instruction
mult
data out data in i. pipe
[7:0]
(b) register – immediate operations
Reg-RegRd = Rn op
Rmr15 = AR + 4
AR = AR + 4Reg-Imm
Rd = Rn op Imm
r15 = AR + 4AR = AR + 4
16
STR (store register) datapath activity
address register
increment
registersRn
PC
lsl #0
= A / A + B / A - B
mult
data out data in i. pipe
[11:0]
address register
increment
registersRn
Rd
shifter
= A + B / A - B
mult
PC
byte? data in i. pipe
(a) 1st cycle – compute address (b) 2nd cycle – store data & auto-index
Compute addressAR = Rn op
Dispr15 = AR + 4
Store dataAR = PCmem[AR] =
Rd<x:y>If
autoindexing=>Rn = Rn +/- 4
17
The first two (of three) cycles of a branch instruction
address register
increment
registersPC
lsl #2
= A + B
mult
data out data in i. pipe
[23:0]
address register
increment
registersR14
PC
shifter
= A
mult
data out data in i. pipe
(a) 1st cycle – compute branch target(b) 2nd cycle – save return address
Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch?
Compute target addressAR = PC + Disp,lsl
#2Save return address
(if required)r14 = PCAR = AR + 4
18
ARM Implementation
Datapath Control unit (FSM)
19
2-phase non-overlapping clock scheme
Most ARMs do not operate on edge-sensitive registers
Instead the design is based around 2-phase non-overlapping clocks which are generated internally from a single clock signal
Data movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2
1 clock cycle
phase 1
phase 2
20
ARM datapath timing Register read
Register read buses – dynamic, precharged during phase 2 During phase 1 selected registers discharge the read buses
which become valid early in phase 1 Shift operation
second operand passes through barrel shifter ALU operation
ALU has input latches which are open in phase 1,allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1
so that the phase 2 precharge does not get through to the ALU
ALU processes the operands during the phase 2, producing the valid output towards the end of the phase
the result is latched in the destination register at the end of phase 2
21
ARM datapath timing (cont’d)
read bus valid
shift out valid
ALU out
shift time
ALU time
registerwrite time
registerreadtime
ALU operandslatched
phase 1
phase 2
prechargeinvalidatesbuses
Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time
22
The original ARM1 ripple-carry adder
Carry logic: use CMOS AOI (And-Or-Invert) gate Even bits use circuit show below Odd bits use the dual circuit with inverted inputs
and outputs and AND and OR gates swapped around
Worst case path:32 gates long
AB
Cin
sum
Cout
23
ARM2 4-bit carry look-ahead scheme
Carry Generate (G)Carry Propagate (P)
Cout[3] =Cin[0].P + G Use AOI and
alternate AND/OR gates Worst case:
8 gates longA[3:0]
B[3:0]
Cin[0]
sum[3:0]
Cout[3]
4-bitadderlogic
P
G
24
The ARM2 ALU logic for one result bit
ALU functions data operations (add, sub, ...) address computations for memory accesses branch target computations bit-wise logical
operations ...
ALUbus
432105
NBbus
NAbus
carrylogic
fs:
G
P
25
ARM2 ALU function codes
26
The ARM6 carry-select adder scheme
Compute sums of various fields of the wordfor carry-in of zero and carry-in of one
Final result is selected by using the correct carry-in value to control a multiplexor
sum[31:16]sum[15:8]sum[7:4]sum[3:0]
s s+1
a,b[31:28]a,b[3:0]
+ +, +1
c
+, +1
mux
mux
mux
Worst case: O(log2[word width]) gates long
Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.
27
The ARM6 ALU organization
Not easy to merge the arithmetic and logic functions =>a separate logic unit runs in parallel with the adder,
and multiplexor selects the output
Z
N
VC
logic/arithmetic
C infunction
invert A invert B
result
result mux
logic functions
A operand latch B operand latch
XOR gates XOR gates
adder
zero detect
28
ARM9 carry arbitration encoding
Carry arbitration adder
A B C u v
0 0 0 0 0
0 1 unknown 1 0
1 0 unknown 1 0
1 1 1 1 1
29
The cross-bar switch barrel shifter
Shifter delay is critical since it contributes directly to the datapath cycle time
Cross-bar switch matrix (32 x 32) Principle for 4x4 matrix
in[0]
in[1]
in[2]
in[3]
out[0] out[1] out[2] out[3]
no shiftright 1right 2right 3
left 1
left 2
left 3
30
The cross-bar switch barrel shifter (cont’d)
Precharged logic is used => each switch is a single NMOS transistor
Precharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics
For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)
Arithmetic shift right:use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately
31
The 2-bit multiplication algorithm, Nth cycle
Carry - i n Mul t i p l i er Shi f t ALU Carry -o ut0 x 0 LSL #2N A + 0 0
x 1 LSL #2N A + B 0x 2 LSL #(2N + 1) A – B 1x 3 LSL #2N A – B 1
1 x 0 LSL #2N A + B 0x 1 LSL #(2N + 1) A + B 0x 2 LSL #2N A – B 1x 3 LSL #2N A + 0 1
32
Carry-propagate (a) and carry-save (b) adder structures
+A B Cin
Cout S(a) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
+A B Cin
Cout S(b) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
33
ARM high-speed multiplier organization
Rs >> 8 bits/cycle
carry-save adders
partial sum
partial carry
initialization for MLAregisters
Rm
ALU (add partials)
rotate sum andcarry 8 bits/cycle
34
ARM2 register cell circuit
A busB bus
ALU bus
writeread
Bread
A
35
ARM register bank floorplan
A bus read decoders
B bus read decoders
write decoders
register cellsPC
Vdd
Vss
ALUbus
PCbus
INCbus
ALUbus
A bus
B bus
36
ARM core datapath buses
address register
incrementer
register bank
multiplier
ALU
shifter
data in
instruction pipe
data out
A B
W
instruction
Din
shift out
PC
Ad
inc
37
ARM control logic structure
decodePLA
cyclecount
multiplycontrol
load/storemultiple
addresscontrol
registercontrol
ALUcontrol
shiftercontrol
instruction
coprocessor