cs 152 computer architecture and engineering lecture 27 mid-term...
TRANSCRIPT
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-5-3John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 27 – Mid-Term II Review
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
CS 152: What’s left ...
Today: Mid-term Review, HKN ...
Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda.
Tuesday 5/10: Final presentations.
This time, more of an overview style ...
No class on Thursday.
Deadline to bring up grading issues:Tues 5/10@ 5PM. Contact John at lazzaro@eecs
Peer Review: For final project.Please send by Friday at 5 PM.
No electronic devices, no notes ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-3-31John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 19 – Error Correcting Codes
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand how Hamming Codes workCosmic ray hit D1. But how do we know that?
D₃D₂D₁P₂D₀P₁P₀
On readout we compute:P₀ xor D₃ xor D₁ xor D₀ = 1 xor 0 xor 0 xor 0 = 1
P₁ xor D₃ xor D₂ xor D₀ = 1 xor 0 xor 1 xor 0 = 0P₂ xor D₃ xor D₂ xor D₁ = 0 xor 0 xor 1 xor 0 = 1
0 11 0 0 1 1We write:
D₃D₂D₁P₂D₀P₁P₀0 01 0 0 1 1Later, we read:
P₂P₁P₀ = b101 = 5
What does “5” mean?
0 01 0 0 1 1The position of the flipped bit!To repair, just flip it back ...
D₃D₂D₁P₂D₀P₁P₀14 36 57 2
Note: we number the least significant bit with 1, not 0! 0 is reserved for “no errors”.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand Parity Code math ...Simple case: Two 1KB blocks of data (A and B)
Create a third block, C:
C = A xor B (do xor on each bit of block)
Read all three blocks. If A or B is not available but C is, regenerate A or B:
A = C xor B B = C xor A
The math is easy: the trick is system design! Examples: RAID, voice-over-IP parity FEC.
“Parity codes”
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand Parity Code system designThe disk will tell you “this block does not exist” or “the disk is dead”, by returningan error code when you do a read.
Often, applications number packets as they send them, by adding a “sequence number” to packet header. Receivers detect a “break” in the number sequence ...
If we know this will happen in advance, what can we do, at the OS or application level?
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand checksum “big picture” ...Can checksums detect every possible error?
Answer: No -- for a 16-bit checksum, there are many possible packets that have the same checksum. If you are unlucky enough to have your transmission errors convert a block into another block with the same checksum value, you will not detect the error!
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-5John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 20 – Advanced Processors I
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand superpipeline performance Seconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Q. Could adding pipeline stages reduce CPI for an application?
ARM XScale8 stages
CPI Problem Possible Solution
Taken branches cause longer
stallsBranch prediction,
loop unrolling
Cache misses take more
clock cycles
Larger caches, add prefetch
opcodes to ISA
A. Yes, due to these problems:
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
I1:I2:I3:I4:I5:
t1 t2 t3 t4 t5 t6 t7 t8Time:Inst
I6:
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
IF IDIF
EXIDIF
MEM WBEX stage computes if branch is taken
If we predicted incorrectly, these instructions MUST
NOT complete!
We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full!Dynamic Predictors: a cache of branch history
I-Cache
Understand branch prediction in-depth
A control instr?
Taken or Not Taken?
The PC a branch
“targets”
Branch Predictor
Predictions
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
“In-Depth” means down to this level ...
D Q D Q
Prediction for next branch (1 = take, 0 = not take)
We do not change the prediction the first time it is incorrect. Why?
Was last prediction correct? (1 = yes, 0 = no)
BNE R4,R0,loopSUBI R4,R4,-1loop:
This branch taken 10 times, then not taken once (end of loop). The next time we enter the loop, we would like to predict “take” the first time through.
ADDI R4,R0,11
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
!"#$%&
!"#$%
&'"()*+,-*.,,/
012.3-*4++556
789($:;9*<9:$*=)'"'($%":#$:(#
>8#?
>8#?
.*(?( .*(?(
+(?( +(?( +(?(
!"##$
%&%'#&(')
%*+,&*##$
%&%'#&(')
789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#
!"" ;B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*
%9$%"#*'*F89($:;9*89:$*
!"" :9B8$#*$;*'*F89($:;9*89:$*G%1C1-*"%C:#$%"*F:A%H
('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9
'((%B$
'((%B$
!"#$%
&'"()*+,-*.,,/
012.3-*4++550
&8A$:BA%*789($:;9*<9:$#
I7 IJ KL
M4< &%N
7'DD
7N8A
7D:@
I##8%
OPQR#
7PQR#
Example: Superscalar MIPS. Fetches 2instructions at a time. If first integer and
second floating point, issue in same cycle
Understand lockstep superscalar concept
Integer instruction FP instruction
LD F0,0(R1)LD F6,-8(R1)LD F10,-16(R1) ADDD F4,F0,F2LD F14,-24(R1) ADDD F8,F6,F2LD F18,-32(R1) ADDD F12,F10,F2SD 0(R1),F4 ADDD F16,F14,F2SD -8(R1),F8 ADDD F20,F18,F2SD -16(R1),F12SD -24(R1),F16
Two issuesper cycle
One issueper cycle
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-7John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 21 – Advanced Processors II
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
Out of order CPU design will NOT appear on exam: HW was sufficient.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand Precise Interrupts ...
!"#$%&
!"#$%
&'"()*+,-*.//0
123.4-*5+.66+,
788%($9:%;%##<
=%;'>9;?*';@*AB$6C86C"@%"*%D%(B$9C;*E'#*89"#$
9>FG%>%;$%@*9;*+H1H*9;*IJ&*41/KH+*LB$*@9@*;C$*#)CE*
BF*9;*$)%*#BL#%MB%;$*>C@%G#*B;$9G*>9@6N9;%$9%#2
!"#$%
&'()*+)
+2*7D(%F$9C;#*;C$*F"%(9#%O
.2*788%($9:%*C;*'*:%"P*#>'GG*(G'##*C8*F"C?"'>#
A;%*>C"%*F"CLG%>*;%%@%@*$C*L%*#CG:%@
!"#$%"&'$%(#)*+%)
!"#$%
&'"()*+,-*.//0
123.4-*5+.66+1
Q"%(9#%*I;$%""BF$#
,$'-.)$'(//+(%'()'0*'(#'0#$+%%./$'0)'$(1+#'2+$3++#'$3"'0#)$%.4$0"#) !"#$%&' #()%&'*+,
- ./0%01102.%31%#44%'(".562.'3("%67%.3%#()%'(246)'(8%&' '".3.#44$%239740.0
- (3%01102.%31%#($%'(".562.'3(%#1.05%&' /#"%.#:0(%74#20
;/0%'(.05567.%/#()405%0'./05%#<35."%./0%75385#9%35%50".#5."%'.%#.%&'*+%=
Definition:
Follows from the “contract” between the architect and the programmer ...
(or exception)
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
... in the context of Static Pipelines
!"#$%&
!"#$%
&'"()*+,-*.//0
123.4-*5+.66+7899%($*:;*<;$%""=>$#!"#$%&$%'()'*+%,-.)#/%0
!" !"#! $%&' $%& $(!# )! $*& (+,-./!$ 01)2! $3& $*& $(!% !"#! $4& $%& $*!& 516! $73& $3& $%!' 8!!! $%& $4& $*
()*+(,+(-./-01(23 7'''*'''* .'''7 ('''. +'''+ ( %'''%
-/4*(-/0,#0 -/4*(-/0,"5
9:;<=>?-'=;@?--AB@<
6-/174/078*/--)3*409-/0.7,,71):*0*(0723:/2/8*09*0;7<;043//.+ =98*0*(04*9-*0/>/1)*7(80(,0:9*/-0784*-)1*7(840?/,(-//>1/3*7(801;/1@40,7874;/.0(80/9-:7/-0784*-)1*7(84
!"#$%
&'"()*+,-*.//0
123.4-*5+.66+38?(%>$@:;*A';BC@;D120$!'()'*3/4)$5#67)*8/-)./0)9
C D:E>'?FG?B@=:;'$EHI<'=;'B=B?E=;?'A;@=E'G:JJ=@'B:=;@',0'<@HI?/C KFG?B@=:;<'=;'?H-E=?-'B=B?'<@HI?<':L?--=>?'EH@?-'?FG?B@=:;<C ";M?G@'?F@?-;HE'=;@?--AB@<'H@'G:JJ=@'B:=;@',:L?--=>?':@N?-</C "$'?FG?B@=:;'H@'G:JJ=@O'AB>H@?'9HA<?'H;>'KP9'-?I=<@?-<&'Q=EEHEE'<@HI?<&'=;M?G@'NH;>E?-'P9'=;@:'$?@GN'<@HI?
A4B81;-(8()40!8*/--)3*4
KFG!
P9!
P9";<@R'0?J ! !?G:>? K 0
!H@H'0?J ST
KFGK
P9K
KFG0
P90
C9)4/
D6C
E7::0F0G*9</
E7::0H0G*9</
E7::0D0G*9</
!::/<9:0I31(./
IJ/-,:(=F9*90A..-0D>1/3*
6C0A..-/440D>1/3*7(84
E7::0K-7*/?91@
G/:/1*0L98.:/-06C
!"##$%&
'"$(%
Key observation: architected state only change in memory and register write stages.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-12John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 22 – Advanced Processors III
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Krste
November 10, 2004
6.823, L18--3
Multithreading
How can we guarantee no dependencies between instructions in a pipeline?
-- One way is to interleave execution of instructions from different program threads on same pipeline
F D X M W
t0 t1 t2 t3 t4 t5 t6 t7 t8
T1: LW r1, 0(r2)
T2: ADD r7, r1, r4
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
T1: LW r5, 12(r1)
t9
F D X M W
F D X M W
F D X M W
F D X M W
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Last instruction
in a thread
always completes
writeback before
next instruction
in same thread
reads regfile
KrsteNovember 10, 2004
6.823, L18--5
Simple Multithreaded Pipeline
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
+1
2 Thread
select
PC1
PC1
PC1
PC1
I$ IRGPR1GPR1GPR1GPR1
X
Y
2
D$
Understand static pipeline multithreading4 CPUs,each run at 1/4 clock
Many variants ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand pros/cons of shared L2 ...
supp
orts
a 1
.875
-Mby
te o
n-ch
ip L
2 ca
che.
Pow
er4
and
Pow
er4+
sys
tem
s bo
th h
ave
32-
Mby
te L
3 ca
ches
, whe
reas
Pow
er5
syst
ems
have
a 3
6-M
byte
L3
cach
e.T
he L
3 ca
che
oper
ates
as a
bac
kdoo
r with
sepa
rate
bus
es fo
r rea
ds a
nd w
rites
that
ope
r-at
e at
hal
f pr
oces
sor
spee
d. I
n Po
wer
4 an
dPo
wer
4+ sy
stem
s, th
e L3
was
an
inlin
e ca
che
for
data
ret
riev
ed fr
om m
emor
y. B
ecau
se o
fth
e hi
gher
tran
sisto
r de
nsity
of t
he P
ower
5’s
130-
nm te
chno
logy
, we c
ould
mov
e the
mem
-or
y co
ntro
ller
on c
hip
and
elim
inat
e a
chip
prev
ious
ly n
eede
d fo
r the
mem
ory
cont
rolle
rfu
nctio
n. T
hese
two
chan
ges
in th
e Po
wer
5al
so h
ave t
he si
gnifi
cant
side
ben
efits
of r
educ
-in
g la
tenc
y to
the
L3 c
ache
and
mai
n m
emo-
ry, a
s w
ell a
s re
duci
ng t
he n
umbe
r of
chi
psne
cess
ary
to b
uild
a sy
stem
.
Chip
overv
iewFi
gure
2 s
how
s th
e Po
wer
5 ch
ip,
whi
chIB
M f
abri
cate
s us
ing
silic
on-o
n-in
sula
tor
(SO
I) d
evic
es a
nd c
oppe
r int
erco
nnec
t. SO
Ite
chno
logy
red
uces
dev
ice
capa
cita
nce
toin
crea
se t
rans
isto
r pe
rfor
man
ce.5
Cop
per
inte
rcon
nect
dec
reas
es w
ire
resi
stan
ce a
ndre
duce
s de
lays
in w
ire-d
omin
ated
chi
p-tim
-
ing
path
s. I
n 13
0 nm
lith
ogra
phy,
the
chi
pus
es ei
ght m
etal
leve
ls an
d m
easu
res 3
89 m
m2 .
The
Pow
er5
proc
esso
r su
ppor
ts th
e 64
-bit
Pow
erPC
arc
hite
ctur
e. A
sin
gle
die
cont
ains
two
iden
tical
pro
cess
or co
res,
each
supp
ortin
gtw
o lo
gica
l thr
eads
. Thi
s ar
chite
ctur
e m
akes
the c
hip
appe
ar as
a fo
ur-w
ay sy
mm
etric
mul
-tip
roce
ssor
to th
e op
erat
ing
syst
em. T
he tw
oco
res s
hare
a 1
.875
-Mby
te (1
,920
-Kby
te) L
2ca
che.
We i
mpl
emen
ted
the L
2 ca
che a
s thr
eeid
entic
al s
lices
with
sep
arat
e co
ntro
llers
for
each
. The
L2
slice
s are
10-
way
set-
asso
ciat
ive
with
512
cong
ruen
ce cl
asse
s of 1
28-b
yte l
ines
.T
he d
ata’s
rea
l add
ress
det
erm
ines
whi
ch L
2sli
ce th
e dat
a is c
ache
d in
. Eith
er p
roce
ssor
core
can
inde
pend
ently
acc
ess e
ach
L2 c
ontr
olle
r.W
e al
so in
tegr
ated
the
dire
ctor
y fo
r an
off-
chip
36-
Mby
te L
3 ca
che o
n th
e Pow
er5
chip
.H
avin
g th
e L3
cach
e dire
ctor
y on
chip
allo
ws
the
proc
esso
r to
che
ck th
e di
rect
ory
afte
r an
L2 m
iss w
ithou
t exp
erie
ncin
g of
f-ch
ip d
elay
s.To
red
uce
mem
ory
late
ncie
s, w
e in
tegr
ated
the m
emor
y co
ntro
ller o
n th
e chi
p. T
his e
lim-
inat
es d
rive
r an
d re
ceiv
er d
elay
s to
an
exte
r-na
l con
trol
ler.
Proce
ssor c
oreW
e de
signe
d th
e Po
wer
5 pr
oces
sor c
ore
tosu
ppor
t bo
th e
nhan
ced
SMT
and
sin
gle-
thre
aded
(ST
) op
erat
ion
mod
es.
Figu
re 3
show
s th
e Po
wer
5’s
inst
ruct
ion
pipe
line,
whi
ch is
iden
tical
to th
e Pow
er4’
s. A
ll pi
pelin
ela
tenc
ies i
n th
e Pow
er5,
incl
udin
g th
e bra
nch
misp
redi
ctio
n pe
nalty
and
load
-to-
use
late
n-cy
with
an
L1 d
ata
cach
e hi
t, ar
e th
e sa
me
asin
the
Pow
er4.
The
iden
tical
pip
elin
e st
ruc-
ture
lets
opt
imiz
atio
ns d
esig
ned
for
Pow
er4-
base
d sy
stem
s pe
rfor
m
equa
lly
wel
l on
Pow
er5-
base
d sy
stem
s. F
igur
e 4
show
s th
ePo
wer
5’s i
nstr
uctio
n flo
w d
iagr
am.
In S
MT
mod
e, th
e Po
wer
5 us
es tw
o se
pa-
rate
inst
ruct
ion
fetc
h ad
dres
s reg
ister
s to
stor
eth
e pr
ogra
m c
ount
ers
for
the
two
thre
ads.
Inst
ruct
ion
fetc
hes
(IF
stag
e)
alte
rnat
ebe
twee
n th
e tw
o th
read
s. I
n ST
mod
e, t
hePo
wer
5 us
es o
nly
one
prog
ram
cou
nter
and
can
fetc
h in
stru
ctio
ns fo
r th
at t
hrea
d ev
ery
cycl
e. I
t ca
n fe
tch
up t
o ei
ght
inst
ruct
ions
from
the
inst
ruct
ion
cach
e (I
C s
tage
) ev
ery
cycl
e. T
he tw
o th
read
s sh
are
the
inst
ruct
ion
cach
e an
d th
e in
stru
ctio
n tr
ansla
tion
faci
lity.
In a
give
n cy
cle,
all f
etch
ed in
stru
ctio
ns co
me
from
the
sam
e th
read
.
42
HOT
CHIP
S15
IEEE M
ICRO
Figu
re 2
. Pow
er5
chip
(FXU
= fi
xed-
poin
t exe
cutio
n un
it, IS
U=
inst
ruct
ion
sequ
enci
ng u
nit,
IDU
= in
stru
ctio
n de
code
uni
t,LS
U =
load
/sto
re u
nit,
IFU
= in
stru
ctio
n fe
tch
unit,
FPU
=flo
atin
g-po
int u
nit,
and
MC
= m
emor
y co
ntro
ller).
(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.
(1) Threads on two cores that use shared libraries conserve L2 memory.
Also see Lecture 27 slides on this related topics!
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand Niagara design choices8 cores:
Single-issue6-stage pipeline4-way multi-threadedFast crypto support
Shared resources:3MB on-chip cache4 DDR2 interfaces32G DRAM, 20 Gb/s1 shared FP unitGB Ethernet ports
Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)
Die size: 340 mm² in 90 nm.Power: 50-60 W
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
256 KB Local Store -- 128 128-bit RegistersSPU issues 2 inst/cycle (in order) to 7 execution unitsSPU fills Local Store using DMA to DRAM and network
Programmers manage caching explicitly
Understand Cell design choices ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-14John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 23 – Buses, Disks, and RAID
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand “bus master” concept ...Figure 2-1 Simplified block diagram
Serial port
1.5 GbpsSerial ATA bus
10/100/1000 Ethernet port
10/100/1000 Ethernet port
400 MHzECC DDR
memory bus
Processor interfacebus running at half theprocessor speed
FireWire 400 port (front)
System activity lights
16-bit 4.8 GBpsHyper Transport
8-bit 1.6 GBpsHyper Transport
33 MHzPCI bus
PCI-X slots
PMU99power
controller
BootROM
USB 2.0 port 480 Mbps
USB 2.0 port 480 Mbps
DIMM slots
Internal hard driveconnectors
FireWire 800 port (rear)
FireWire 800 port (rear)
Main logic board
1.5 GbpsSerial ATA bus
ATA/100 bus
Internal opticaldrive connector
U3Hmemorycontrollerand PCI
bus bridge
K2I/O deviceand diskcontroller
FireWirePHY
PCI USBcontroller
Ethernetcontroller
Processor moduleProcessor module
64-bit PowerPC G5microprocessor
PCI, PCI-X, orgraphics support
Bus B
Bus A PCI-Xbridge
1.5 GbpsSerial ATA bus
64-bit PowerPC G5microprocessor
Xserve G5 has the following separate buses.
! Processor bus: running at half the speed of the processor, 64-bit data throughput per processorconnecting the processor module to the U3H IC
! Dual processor systems have two independent, 64-bit processor buses, each running at half thespeed of the processors
! Memory bus: 400 MHz, 128-bit bus connecting the main ECC DDR SDRAM memory to the U3HIC
! PCI-X bridge bus: supports two 64-bit PCI-X slots
22 Block Diagram and Buses2005-01-04 | © 2002, 2005 Apple Computer, Inc. All Rights Reserved.
C H A P T E R 2
Architecture
Apple Xserve G5 - has 8 DIMM slots, to support 8GB.
Memory controller is the only “bus master” - it can start transactions on the bus, but the DIMMs cannot.
DIMMs respond to transaction requests. Since memory controller is only bus master, and there are a small number of DIMM slots, bus sharing is easy: use dedicated wires to each slot.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand “bus vs switch” issues ...Figure 2-1 Simplified block diagram
Serial port
1.5 GbpsSerial ATA bus
10/100/1000 Ethernet port
10/100/1000 Ethernet port
400 MHzECC DDR
memory bus
Processor interfacebus running at half theprocessor speed
FireWire 400 port (front)
System activity lights
16-bit 4.8 GBpsHyper Transport
8-bit 1.6 GBpsHyper Transport
33 MHzPCI bus
PCI-X slots
PMU99power
controller
BootROM
USB 2.0 port 480 Mbps
USB 2.0 port 480 Mbps
DIMM slots
Internal hard driveconnectors
FireWire 800 port (rear)
FireWire 800 port (rear)
Main logic board
1.5 GbpsSerial ATA bus
ATA/100 bus
Internal opticaldrive connector
U3Hmemorycontrollerand PCI
bus bridge
K2I/O deviceand diskcontroller
FireWirePHY
PCI USBcontroller
Ethernetcontroller
Processor moduleProcessor module
64-bit PowerPC G5microprocessor
PCI, PCI-X, orgraphics support
Bus B
Bus A PCI-Xbridge
1.5 GbpsSerial ATA bus
64-bit PowerPC G5microprocessor
Xserve G5 has the following separate buses.
! Processor bus: running at half the speed of the processor, 64-bit data throughput per processorconnecting the processor module to the U3H IC
! Dual processor systems have two independent, 64-bit processor buses, each running at half thespeed of the processors
! Memory bus: 400 MHz, 128-bit bus connecting the main ECC DDR SDRAM memory to the U3HIC
! PCI-X bridge bus: supports two 64-bit PCI-X slots
22 Block Diagram and Buses2005-01-04 | © 2002, 2005 Apple Computer, Inc. All Rights Reserved.
C H A P T E R 2
Architecture
+++ Low cost. One set of wires from memory controller can support up to 8 DIMMs.
--- Latency of bus increases with length of wires (needed to reach all 8 DIMM sockets), and the loading of 8 DIMMs. Must design for worst-case (8 DIMMs), even if only 1 DIMM is present.
--- Shared wires limit maximum bandwidth from memory. If memory controller had 8 sets of dedicated wires, one per DIMM, memory bandwidth would be much better (but more expensive).
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand serial bus pros and consSerial: Data is sent “bit by bit” over one logical wire.
+++ Low cost: a small number of wires cost less. Also, cheap wires and connectors can be used, since skew is less a problem.
--- When only using one wire, there is a bandwidth limit. Thus, DIMMs uses many wires(a ”parallel” bus, not “serial”).
USB, FireWire Ethernet.
+++ Sending data over many wires introduces “skew” - signals travel on each wire at a slightly different speed. Skew limits speed and length of a bus. Serial buses have fewer skew issues, because they only use one logical wire.
Serial pros and cons:
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand disk block organization ...
Outer tracks hold more sectors.
2005 desktop rotation speed:7200 RPM
Each ring isa “track”.
A track is dividedinto “sectors”.
A sector codes a fixed # of bytes (ex: 4K blocks).
Many more tracks and
sectors than shown!
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand the Disk Latency EquationLatency of a disk block read =
Queueing Time +Zero if no other accesses pending.
Controller Time + Usually short.
Seek Time + 2005: about 8 ms
Rotation Time + 4.2 ms @ 7200 RPM1/2 full rotation time
Transfer Time 1 ms @ 7200 RPM
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand how to reason about RAID5-disk
Recovery Group
D0
D1
D2
D3
Parity
RAID 5: Interleaved Parity Disks
Logical Blocks Bn on Array
B0
B1
B2
B3
P0
B4
B5
B6
B7
P1
B8
B9
B10
P2 . . .
. . .
. . .
. . .
. . .B11
+++ Writes of parity blocks distributed across 5 disks
COD/e3Page 574-580for RAID details.
Will be responsible for level of detail in book for Mid-term II.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-19John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 24 – Networks
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
Know this material well ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand bottom-up networking ...
IP Packet
801.11b WiFi packet
For this “hop”, IP packet sent “inside” of a wireless 801.11b packet.
IP Packet
Cable modem packet
For this “hop”, IP packet sent “inside” of a cable modem DOCSIS packet.
ISO Layer Names:IP packet: “Layer 3”WiFi and Cable Modem packets: “Layer 2”Radio/cable waveforms: “Layer 1”
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
email WWW phone...
SMTP HTTP RTP...
TCP UDP…
IP
Ethernet Wi-Fi…
CSMA async sonet...
copper fiber radio...
Diagram Credit: Steve Deering
Protocol Complexity
Understand the IP abstraction ...
Internet Protocol (IP):An abstraction for applications to target, and for link networks to support.Very simple, very successful.
Link layer is not expected to be perfect.
IP presentslink network errors/losses in an abstract way (not a link specific way).
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand how IP numbers work
198.211.61.22 ??? A user-friendly form of the 32-bit unsigned value 3335732502, which is:198*2^24 + 211*2^16 + 61*2^8 + 22
IP4 number for this computer: 198.211.61.22Every directly connected host has a unique IP number.
Upper limit of 2^32 IP4 numbers (some are reserved for other purposes).
Next-generation IP (IP6) limit: 2^128.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand the IP header fields ...
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|Version| IHL |Type of Service| Total Length |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Identification |Flags| Fragment Offset |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Time to Live | Protocol | Header Checksum |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Source Address |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Destination Address |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| |+ +| Payload data (size implied by Total Length header field) |+ +| |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
To: IP number
From: IP number Note: Could be a lie ...
IHL field: # of words in header. The typical header (IHL = 5 words) is shown. Longer headers code add extra fields after the destination address.
Header
Data
Bitfield numbers
IP4, IP6, etc ... How the destination should interpret the payload data.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-21John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 25 – Routers
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand the MGR Router ...
IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 6, NO. 3, JUNE 1998 237
A 50-Gb/s IP RouterCraig Partridge, Senior Member, IEEE, Philip P. Carvey, Member, IEEE, Ed Burgess, Isidro Castineyra, Tom Clarke,
Lise Graham, Michael Hathaway, Phil Herman, Allen King, Steve Kohalmi, Tracy Ma, John Mcallen,Trevor Mendez, Walter C. Milliken, Member, IEEE, Ronald Pettyjohn, Member, IEEE,
John Rokosz, Member, IEEE, Joshua Seeger, Michael Sollins, Steve Storch,Benjamin Tober, Gregory D. Troxel, David Waitzman, and Scott Winterble
Abstract—Aggressive research on gigabit-per-second networkshas led to dramatic improvements in network transmissionspeeds. One result of these improvements has been to putpressure on router technology to keep pace. This paper describesa router, nearly completed, which is more than fast enough tokeep up with the latest transmission technologies. The routerhas a backplane speed of 50 Gb/s and can forward tens ofmillions of packets per second.
Index Terms—Data communications, internetworking, packetswitching, routing.
I. INTRODUCTION
TRANSMISSION link bandwidths keep improving, at
a seemingly inexorable rate, as the result of research
in transmission technology [26]. Simultaneously, expanding
network usage is creating an ever-increasing demand that can
only be served by these higher bandwidth links. (In 1996
and 1997, Internet service providers generally reported that
the number of customers was at least doubling annually and
that per-customer bandwidth usage was also growing, in some
cases by 15% per month.)
Unfortunately, transmission links alone do not make a
network. To achieve an overall improvement in networking
performance, other components such as host adapters, operat-
ing systems, switches, multiplexors, and routers also need to
get faster. Routers have often been seen as one of the lagging
technologies. The goal of the work described here is to show
that routers can keep pace with the other technologies and are
Manuscript received February 20, 1997; revised July 22, 1997; approvedby IEEE/ACM TRANSACTIONS ON NETWORKING Editor G. Parulkar. This workwas supported by the Defense Advanced Research Projects Agency (DARPA).C. Partridge is with BBN Technologies, Cambridge, MA 02138 USA, and
with Stanford University, Stanford, CA 94305 USA (e-mail: [email protected]).P. P. Carvey, T. Clarke, and A. King were with BBN Technologies,
Cambridge, MA 02138 USA. They are now with Avici Systems, Inc.,Chelmsford, MA 01824 USA (e-mail: [email protected]; [email protected];[email protected]).E. Burgess, I. Castineyra, L. Graham, M. Hathaway, P. Herman, S.
Kohalmi, T. Ma, J. Mcallen, W. C. Milliken, J. Rokosz, J. Seeger, M.Sollins, S. Storch, B. Tober, G. D. Troxel, and S. Winterble are with BBNTechnologies, Cambridge, MA 02138 USA (e-mail: [email protected];[email protected]; [email protected]; [email protected]; [email protected]).T. Mendez was with BBN Technologies, Cambridge, MA 02138 USA. He
is now with Cisco Systems, Cambridge, MA 02138 USA.R. Pettyjohn was with BBN Technologies, Cambridge, MA 02138 USA.
He is now with Argon Networks, Littleton, MA 01460 USA (e-mail:[email protected]).D. Waitzman was with BBN Technologies, Cambridge, MA 02138 USA.
He is now with D. E. Shaw and Company, L.P., Cambridge, MA 02139 USA.Publisher Item Identifier S 1063-6692(98)04174-0.
fully capable of driving the new generation of links (OC-48c
at 2.4 Gb/s).
A multigigabit router (a router capable of moving data
at several gigabits per second or faster) needs to achieve
three goals. First, it needs to have enough internal bandwidth
to move packets between its interfaces at multigigabit rates.
Second, it needs enough packet processing power to forward
several million packets per second (MPPS). A good rule
of thumb, based on the Internet’s average packet size of
approximately 1000 b, is that for every gigabit per second
of bandwidth, a router needs 1 MPPS of forwarding power.1
Third, the router needs to conform to a set of protocol
standards. For Internet protocol version 4 (IPv4), this set of
standards is summarized in the Internet router requirements
[3]. Our router achieves all three goals (but for one minor
variance from the IPv4 router requirements, discussed below).
This paper presents our multigigabit router, called the MGR,
which is nearly completed. This router achieves up to 32
MPPS forwarding rates with 50 Gb/s of full-duplex backplane
capacity.2 About a quarter of the backplane capacity is lost
to overhead traffic, so the packet rate and effective bandwidth
are balanced. Both rate and bandwidth are roughly two to ten
times faster than the high-performance routers available today.
II. OVERVIEW OF THE ROUTER ARCHITECTURE
A router is a deceptively simple piece of equipment. At
minimum, it is a collection of network interfaces, some sort of
bus or connection fabric connecting those interfaces, and some
software or logic that determines how to route packets among
those interfaces. Within that simple description, however, lies a
number of complexities. (As an illustration of the complexities,
consider the fact that the Internet Engineering Task Force’s
Requirements for IP Version 4 Routers [3] is 175 pages long
and cites over 100 related references and standards.) In this
section we present an overview of the MGR design and point
out its major and minor innovations. After this section, the rest
of the paper discusses the details of each module.
1See [25]. Some experts argue for more or less packet processing power.Those arguing for more power note that a TCP/IP datagram containing anACK but no data is 320 b long. Link-layer headers typically increase thisto approximately 400 b. So if a router were to handle only minimum-sizedpackets, a gigabit would represent 2.5 million packets. On the other side,network operators have noted a recent shift in the average packet size tonearly 2000 b. If this change is not a fluke, then a gigabit would representonly 0.5 million packets.2Recently some companies have taken to summing switch bandwidth in
and out of the switch; in that case this router is a 100-Gb/s router.
1063–6692/98$10.00 ! 1998 IEEE
The “MGR” Router was a research project in late 1990’s. Kept up with “line rate” of the fastest links of its day (OC-48c, 2.4 Gb/s optical).
Architectural approach is still valid today ...
At the level we presented it in class. However, it will be much easier to understand it at that level if you read the paper (on website).
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Know the life of a packet in a router ...
238 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 6, NO. 3, JUNE 1998
Fig. 1. MGR outline.
A. Design Summary
A simplified outline of the MGR design is shown in Fig. 1,
which illustrates the data processing path for a stream of
packets entering from the line card on the left and exiting
from the line card on the right.
The MGR consists of multiple line cards (each supporting
one or more network interfaces) and forwarding engine cards,
all plugged into a high-speed switch. When a packet arrives
at a line card, its header is removed and passed through the
switch to a forwarding engine. (The remainder of the packet
remains on the inbound line card). The forwarding engine
reads the header to determine how to forward the packet and
then updates the header and sends the updated header and
its forwarding instructions back to the inbound line card. The
inbound line card integrates the new header with the rest of
the packet and sends the entire packet to the outbound line
card for transmission.
Not shown in Fig. 1 but an important piece of the MGR
is a control processor, called the network processor, that
provides basic management functions such as link up/down
management and generation of forwarding engine routing
tables for the router.
B. Major Innovations
There are five novel elements of this design. This section
briefly presents the innovations. More detailed discussions,
when needed, can be found in the sections following.
First, each forwarding engine has a complete set of the
routing tables. Historically, routers have kept a central master
routing table and the satellite processors each keep only a
modest cache of recently used routes. If a route was not in a
satellite processor’s cache, it would request the relevant route
from the central table. At high speeds, the central table can
easily become a bottleneck because the cost of retrieving a
route from the central table is many times (as much as 1000
times) more expensive than actually processing the packet
header. So the solution is to push the routing tables down
into each forwarding engine. Since the forwarding engines
only require a summary of the data in the route (in particular,
next hop information), their copies of the routing table, called
forwarding tables, can be very small (as little as 100 kB for
about 50k routes [6]).
Second, the design uses a switched backplane. Until very
recently, the standard router used a shared bus rather than
a switched backplane. However, to go fast, one really needs
the parallelism of a switch. Our particular switch was custom
designed to meet the needs of an Internet protocol (IP) router.
Third, the design places forwarding engines on boards
distinct from line cards. Historically, forwarding processors
have been placed on the line cards. We chose to separate them
for several reasons. One reason was expediency; we were not
sure if we had enough board real estate to fit both forwarding
engine functionality and line card functions on the target
card size. Another set of reasons involves flexibility. There
are well-known industry cases of router designers crippling
their routers by putting too weak a processor on the line
card, and effectively throttling the line card’s interfaces to
the processor’s speed. Rather than risk this mistake, we built
the fastest forwarding engine we could and allowed as many
(or few) interfaces as is appropriate to share the use of the
forwarding engine. This decision had the additional benefit of
making support for virtual private networks very simple—we
can dedicate a forwarding engine to each virtual network and
ensure that packets never cross (and risk confusion) in the
forwarding path.
Placing forwarding engines on separate cards led to a fourth
innovation. Because the forwarding engines are separate from
the line cards, they may receive packets from line cards that
1. Packet arrives in line card. Line card sends the packet header to a forward engine for processing.
1.
1.
Note: We can balance the number of line cards and forwarding engines for efficiency: this is how packet routing parallelizes.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand the forwarding problem ... 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|Version| IHL |Type of Service| Total Length |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Identification |Flags| Fragment Offset |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Time to Live | Protocol | Header Checksum |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Source Address |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Destination Address |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| |+ +| Payload data (size implied by Total Length header field) |+ +| |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Header
Data
Bitfield numbers
To: IP number
Forwarding engine looks at the destination address, and decides which outbound line card will get the packet closest to its destination. How?
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
And how network structure affects itRouters route to a “network”, not a “host”. /xx means the top xx bits of the 32-bit address identify a single network.
Thus, all of UCB only needs 6 routing table entries.Today, Internet routing table has about 100,000 entries.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand how switches work ...
Switch
Line
Line
Engine
Line
Engine
EngineLine
Line
A pipelined arbitration system decides how to connect up the switch. The connections for the transfer at epoch N are computer in epochs N-3, N-2 and N-1, using dedicated switch allocation wires.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-26John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 26 – Synchronization
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand Sequential ConsistencySequential Consistency: As if each thread takes turns executing, and instructions in each thread execute in program order.
Sequential Consistent architectures get the right answer, but give up many optimizations.
LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr
T2 code(consumer)
T1 code(producer)
ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr
1
2
3
4
Legal orders: 1, 2, 3, 4 or 1, 3, 2, 4 or 3, 4, 1 2 ... but not 2, 3, 1, 4!
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand memory fences
Higher Addresses
LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait MEMBAR ; LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr
T2 code(consumer)
y x
Tail Head
y
Tail Head
After:Before:Higher Addresses
T1 code(producer)
ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueMEMBAR ;ADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr
1
2
3
4
Ensures 1 happens before 2, and 3 happens before 4.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr
Critical section
Assuming sequential consistency: 3 MEMBARs not shown ...
Understand Test and Set ...
Test&Set(m, R)R = M[m];if (R == 0) then M[m]=1;
An example atomic read-modify-write ISA instruction:
What if the OS swaps a process out while in the critical section? “High-latency locks”, a source of Linux audio problems (and others)
P: Test&Set R6, mutex(R0); Mutex check BNE R6, R0, P ; If not 0, spin
V: SW R0 mutex(R0) ; Give up mutex
Note: With Test&Set(), the M[m]=1 state corresponds to last slide’s s=0 state!
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
and non-blocking synchronization ...
Compare&Swap(Rt,Rs, m)if (Rt == M[m])then M[m] = Rs; Rs = Rt; status = success;else status = fail;
Another atomic read-modify-write instruction:
If thread swaps out before Compare&Swap, no latency problem;this code only “holds” the lock for one instruction!
try: LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R6, R3, 4 ; Shift head by one word
Compare&Swap R3, R6, head(R0); Try to update head BNE R3, R6, try ; If not success, try again
If R3 != R6, another thread got here first, so we must try again.
Assuming sequential consistency: MEMBARs not shown ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-28John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 27 – Multiprocessors
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand the coherency problem ...
CPU0
Cache
Addr Value
CPU1
Shared Main MemoryAddr Value16
Cache
Addr Value
5
CPU0:LW R2, 16(R0)
516
CPU1:LW R2, 16(R0)
16 5
CPU1:SW R0,16(R0)
0
0Write-through caches
View of memory no longer “coherent”.
Loads of location 16 from CPU0 and CPU1 see different values!
Today: What to do ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
and cache placement impact ...
CPU0 CPU1
Shared Main Memory
For modern clock rates,access to shared cache through switch takes 10+ cycles.
Shared Multi-Bank Cache
Memory Switch Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good.
This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Using different architectures ...
CPU0 CPU1
Shared Main Memory
Thus, we need to solve the cache coherency problem for L1 cache.
Shared Multi-Bank L2 Cache
Memory Switch or Bus
Advantages of shared L2 over private L2s:
Processors communicate at cache speed, not DRAM speed.
L1 Caches L1 Caches
Constructive interference, if both CPUs need same data/instr.
Disadvantage: CPUs share BW to L2 cache ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand the write-thru solution ...
CPU0
Cache Snooper
CPU1
Shared Main Memory Hierarchy
Cache SnooperMemory bus
1. Writing CPU takes control of bus.
2. Address to be written is invalidated in all other caches.
3. Write is sent to main memory.
Reads will no longer hit in cache and get stale data.
Reads will cache miss, retrieve new value from main memory
For write-thru caches ...
To a first-order, reads will “just work” if write-thru caches implement this policy.
A “two-state” protocol (cache lines are “valid” or “invalid”).
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Understand the NUMA concept ...
CPU 0
Cache
CPU 1023
Interconnection Network
Each CPU has part of main memory attached to it.
Cache
DRAM DRAM
...
To access other parts of main memory, use the interconnection network.
For best results, applications take the non-uniform memory latency into account.
Good for applications that match the machine model ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
And the cluster concept ...In some applications, each machine can handle a net query by itself.
Example: serving static web pages. Each machine has a copy of the website.
but I intentionally ignore them here because theyare well studied elsewhere and because the issuesin this article are largely orthogonal to the use ofdatabases.
AdvantagesThe basic model that giant-scale services followprovides some fundamental advantages:
! Access anywhere, anytime. A ubiquitous infra-structure facilitates access from home, work,airport, and so on.
! Availability via multiple devices. Because theinfrastructure handles most of the processing,users can access services with devices such asset-top boxes, network computers, and smartphones, which can offer far more functionali-ty for a given cost and battery life.
! Groupware support. Centralizing data frommany users allows service providers to offergroup-based applications such as calendars, tele-conferencing systems, and group-managementsystems such as Evite (http://www.evite.com/).
! Lower overall cost. Although hard to measure,infrastructure services have a fundamental costadvantage over designs based on stand-alonedevices. Infrastructure resources can be multi-plexed across active users, whereas end-userdevices serve at most one user (active or not).Moreover, end-user devices have very low uti-lization (less than 4 percent), while infrastruc-ture resources often reach 80 percent utiliza-tion. Thus, moving anything from the deviceto the infrastructure effectively improves effi-ciency by a factor of 20. Centralizing theadministrative burden and simplifying enddevices also reduce overall cost, but are harderto quantify.
! Simplified service updates. Perhaps the mostpowerful long-term advantage is the ability toupgrade existing services or offer new serviceswithout the physical distribution required bytraditional applications and devices. Devicessuch as Web TVs last longer and gain useful-ness over time as they benefit automaticallyfrom every new Web-based service.
ComponentsFigure 1 shows the basic model for giant-scalesites. The model is based on several assumptions.First, I assume the service provider has limitedcontrol over the clients and the IP network.Greater control might be possible in some cases,however, such as with intranets. The model also
assumes that queries drive the service. This is truefor most common protocols including HTTP, FTP,and variations of RPC. For example, HTTP’s basicprimitive, the “get” command, is by definition aquery. My third assumption is that read-onlyqueries greatly outnumber updates (queries thataffect the persistent data store). Even sites that wetend to think of as highly transactional, such as e-commerce or financial sites, actually have thistype of “read-mostly” traffic1: Product evaluations(reads) greatly outnumber purchases (updates), forexample, and stock quotes (reads) greatly out-number stock trades (updates). Finally, as the side-bar, “Clusters in Giant-Scale Services” (next page)explains, all giant-scale sites use clusters.
The basic model includes six components:
! Clients, such as Web browsers, standalone e-mail readers, or even programs that use XMLand SOAP initiate the queries to the services.
! The best-effort IP network, whether the publicInternet or a private network such as anintranet, provides access to the service.
! The load manager provides a level of indirectionbetween the service’s external name and theservers’ physical names (IP addresses) to preservethe external name’s availability in the presenceof server faults. The load manager balances loadamong active servers. Traffic might flow throughproxies or firewalls before the load manager.
! Servers are the system’s workers, combiningCPU, memory, and disks into an easy-to-repli-cate unit.
IEEE INTERNET COMPUTING http://computer.org/internet/ JULY • AUGUST 2001 47
Giant-Scale Services
Client
Client
Client
Loadmanager
Persistent data store
Client
IP network
Single-site server
Optionalbackplane
Figure 1.The basic model for giant-scale services. Clients connect viathe Internet and then go through a load manager that hides downnodes and balances traffic.Load manager is a special-purpose computer that assigns
incoming HTTP connections to a particular machine.Image from Eric Brewer’s IEEE Internet Computing article.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Good luck on the mid-term!
Today: Mid-term Review, HKN ...
Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda.
Tuesday 5/10: Final presentations.
This time, more of an overview style ...
No class on Thursday.
Deadline to bring up grading issues:Tues 5/10@ 5PM. Contact John at lazzaro@eecs
Peer Review: For final project.Please send by Friday at 5 PM.
No electronic devices, no notes ...