cs 152 computer architecture and engineering lecture 26 ...cs152/sp05/lecnotes/lec15-1.pdfmb yte l3...
TRANSCRIPT
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-26John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 26 – Synchronization
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Last Time: How Routers Work
238 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 6, NO. 3, JUNE 1998
Fig. 1. MGR outline.
A. Design Summary
A simplified outline of the MGR design is shown in Fig. 1,
which illustrates the data processing path for a stream of
packets entering from the line card on the left and exiting
from the line card on the right.
The MGR consists of multiple line cards (each supporting
one or more network interfaces) and forwarding engine cards,
all plugged into a high-speed switch. When a packet arrives
at a line card, its header is removed and passed through the
switch to a forwarding engine. (The remainder of the packet
remains on the inbound line card). The forwarding engine
reads the header to determine how to forward the packet and
then updates the header and sends the updated header and
its forwarding instructions back to the inbound line card. The
inbound line card integrates the new header with the rest of
the packet and sends the entire packet to the outbound line
card for transmission.
Not shown in Fig. 1 but an important piece of the MGR
is a control processor, called the network processor, that
provides basic management functions such as link up/down
management and generation of forwarding engine routing
tables for the router.
B. Major Innovations
There are five novel elements of this design. This section
briefly presents the innovations. More detailed discussions,
when needed, can be found in the sections following.
First, each forwarding engine has a complete set of the
routing tables. Historically, routers have kept a central master
routing table and the satellite processors each keep only a
modest cache of recently used routes. If a route was not in a
satellite processor’s cache, it would request the relevant route
from the central table. At high speeds, the central table can
easily become a bottleneck because the cost of retrieving a
route from the central table is many times (as much as 1000
times) more expensive than actually processing the packet
header. So the solution is to push the routing tables down
into each forwarding engine. Since the forwarding engines
only require a summary of the data in the route (in particular,
next hop information), their copies of the routing table, called
forwarding tables, can be very small (as little as 100 kB for
about 50k routes [6]).
Second, the design uses a switched backplane. Until very
recently, the standard router used a shared bus rather than
a switched backplane. However, to go fast, one really needs
the parallelism of a switch. Our particular switch was custom
designed to meet the needs of an Internet protocol (IP) router.
Third, the design places forwarding engines on boards
distinct from line cards. Historically, forwarding processors
have been placed on the line cards. We chose to separate them
for several reasons. One reason was expediency; we were not
sure if we had enough board real estate to fit both forwarding
engine functionality and line card functions on the target
card size. Another set of reasons involves flexibility. There
are well-known industry cases of router designers crippling
their routers by putting too weak a processor on the line
card, and effectively throttling the line card’s interfaces to
the processor’s speed. Rather than risk this mistake, we built
the fastest forwarding engine we could and allowed as many
(or few) interfaces as is appropriate to share the use of the
forwarding engine. This decision had the additional benefit of
making support for virtual private networks very simple—we
can dedicate a forwarding engine to each virtual network and
ensure that packets never cross (and risk confusion) in the
forwarding path.
Placing forwarding engines on separate cards led to a fourth
innovation. Because the forwarding engines are separate from
the line cards, they may receive packets from line cards that
2. Forwarding engine determines the next hop for the packet, and returns next-hop data to the line card, together with an updated header.
2.
2.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Recall: Two CPUs sharing memory
supp
orts
a 1
.875
-Mby
te o
n-ch
ip L
2 ca
che.
Pow
er4
and
Pow
er4+
sys
tem
s bo
th h
ave
32-
Mby
te L
3 ca
ches
, whe
reas
Pow
er5
syst
ems
have
a 3
6-M
byte
L3
cach
e.T
he L
3 ca
che
oper
ates
as a
bac
kdoo
r with
sepa
rate
bus
es fo
r rea
ds a
nd w
rites
that
ope
r-at
e at
hal
f pr
oces
sor
spee
d. I
n Po
wer
4 an
dPo
wer
4+ sy
stem
s, th
e L3
was
an
inlin
e ca
che
for
data
ret
riev
ed fr
om m
emor
y. B
ecau
se o
fth
e hi
gher
tran
sisto
r de
nsity
of t
he P
ower
5’s
130-
nm te
chno
logy
, we c
ould
mov
e the
mem
-or
y co
ntro
ller
on c
hip
and
elim
inat
e a
chip
prev
ious
ly n
eede
d fo
r the
mem
ory
cont
rolle
rfu
nctio
n. T
hese
two
chan
ges
in th
e Po
wer
5al
so h
ave t
he si
gnifi
cant
side
ben
efits
of r
educ
-in
g la
tenc
y to
the
L3 c
ache
and
mai
n m
emo-
ry, a
s w
ell a
s re
duci
ng t
he n
umbe
r of
chi
psne
cess
ary
to b
uild
a sy
stem
.
Chip
overv
iewFi
gure
2 s
how
s th
e Po
wer
5 ch
ip,
whi
chIB
M f
abri
cate
s us
ing
silic
on-o
n-in
sula
tor
(SO
I) d
evic
es a
nd c
oppe
r int
erco
nnec
t. SO
Ite
chno
logy
red
uces
dev
ice
capa
cita
nce
toin
crea
se t
rans
isto
r pe
rfor
man
ce.5
Cop
per
inte
rcon
nect
dec
reas
es w
ire
resi
stan
ce a
ndre
duce
s de
lays
in w
ire-d
omin
ated
chi
p-tim
-
ing
path
s. I
n 13
0 nm
lith
ogra
phy,
the
chi
pus
es ei
ght m
etal
leve
ls an
d m
easu
res 3
89 m
m2 .
The
Pow
er5
proc
esso
r su
ppor
ts th
e 64
-bit
Pow
erPC
arc
hite
ctur
e. A
sin
gle
die
cont
ains
two
iden
tical
pro
cess
or co
res,
each
supp
ortin
gtw
o lo
gica
l thr
eads
. Thi
s ar
chite
ctur
e m
akes
the c
hip
appe
ar as
a fo
ur-w
ay sy
mm
etric
mul
-tip
roce
ssor
to th
e op
erat
ing
syst
em. T
he tw
oco
res s
hare
a 1
.875
-Mby
te (1
,920
-Kby
te) L
2ca
che.
We i
mpl
emen
ted
the L
2 ca
che a
s thr
eeid
entic
al s
lices
with
sep
arat
e co
ntro
llers
for
each
. The
L2
slice
s are
10-
way
set-
asso
ciat
ive
with
512
cong
ruen
ce cl
asse
s of 1
28-b
yte l
ines
.T
he d
ata’s
rea
l add
ress
det
erm
ines
whi
ch L
2sli
ce th
e dat
a is c
ache
d in
. Eith
er p
roce
ssor
core
can
inde
pend
ently
acc
ess e
ach
L2 c
ontr
olle
r.W
e al
so in
tegr
ated
the
dire
ctor
y fo
r an
off-
chip
36-
Mby
te L
3 ca
che o
n th
e Pow
er5
chip
.H
avin
g th
e L3
cach
e dire
ctor
y on
chip
allo
ws
the
proc
esso
r to
che
ck th
e di
rect
ory
afte
r an
L2 m
iss w
ithou
t exp
erie
ncin
g of
f-ch
ip d
elay
s.To
red
uce
mem
ory
late
ncie
s, w
e in
tegr
ated
the m
emor
y co
ntro
ller o
n th
e chi
p. T
his e
lim-
inat
es d
rive
r an
d re
ceiv
er d
elay
s to
an
exte
r-na
l con
trol
ler.
Proce
ssor c
oreW
e de
signe
d th
e Po
wer
5 pr
oces
sor c
ore
tosu
ppor
t bo
th e
nhan
ced
SMT
and
sin
gle-
thre
aded
(ST
) op
erat
ion
mod
es.
Figu
re 3
show
s th
e Po
wer
5’s
inst
ruct
ion
pipe
line,
whi
ch is
iden
tical
to th
e Pow
er4’
s. A
ll pi
pelin
ela
tenc
ies i
n th
e Pow
er5,
incl
udin
g th
e bra
nch
misp
redi
ctio
n pe
nalty
and
load
-to-
use
late
n-cy
with
an
L1 d
ata
cach
e hi
t, ar
e th
e sa
me
asin
the
Pow
er4.
The
iden
tical
pip
elin
e st
ruc-
ture
lets
opt
imiz
atio
ns d
esig
ned
for
Pow
er4-
base
d sy
stem
s pe
rfor
m
equa
lly
wel
l on
Pow
er5-
base
d sy
stem
s. F
igur
e 4
show
s th
ePo
wer
5’s i
nstr
uctio
n flo
w d
iagr
am.
In S
MT
mod
e, th
e Po
wer
5 us
es tw
o se
pa-
rate
inst
ruct
ion
fetc
h ad
dres
s reg
ister
s to
stor
eth
e pr
ogra
m c
ount
ers
for
the
two
thre
ads.
Inst
ruct
ion
fetc
hes
(IF
stag
e)
alte
rnat
ebe
twee
n th
e tw
o th
read
s. I
n ST
mod
e, t
hePo
wer
5 us
es o
nly
one
prog
ram
cou
nter
and
can
fetc
h in
stru
ctio
ns fo
r th
at t
hrea
d ev
ery
cycl
e. I
t ca
n fe
tch
up t
o ei
ght
inst
ruct
ions
from
the
inst
ruct
ion
cach
e (I
C s
tage
) ev
ery
cycl
e. T
he tw
o th
read
s sh
are
the
inst
ruct
ion
cach
e an
d th
e in
stru
ctio
n tr
ansla
tion
faci
lity.
In a
give
n cy
cle,
all f
etch
ed in
stru
ctio
ns co
me
from
the
sam
e th
read
.
42
HOT
CHIP
S15
IEEE M
ICRO
Figu
re 2
. Pow
er5
chip
(FXU
= fi
xed-
poin
t exe
cutio
n un
it, IS
U=
inst
ruct
ion
sequ
enci
ng u
nit,
IDU
= in
stru
ctio
n de
code
uni
t,LS
U =
load
/sto
re u
nit,
IFU
= in
stru
ctio
n fe
tch
unit,
FPU
=flo
atin
g-po
int u
nit,
and
MC
= m
emor
y co
ntro
ller).
In fact, it is an architectural challenge. Even letting several threads on one machine share memory is tricky.
In earlier lectures, we pretended it was easy to let several CPUs share a memory system.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Today: Hardware Thread Support
Producer/Consumer: One thread writes A, one thread reads A.
Locks: Two threads share write access to A.
On Thursday: Multiprocessor memory system design and synchronization issues.
Thursday is a simplified overview -- graduate-level architecture courses spend weeks on this topic ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
How 2 threads share a queue ...
Words in
Memory
Higher Address Numbers
Tail Head
We begin with an empty queue ...
Thread 1 (T1) adds data to the tail of the queue.“Producer” thread
Thread 2 (T2) takes data from the head of the queue.“Consumer” thread
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Producer adding x to the queue ...
xWords
in Memory
Higher Address Numbers
Tail Head
Words in
MemoryHigher Address Numbers
Tail Head
T1 code(producer)
Before:
After:
ORi R1, R0, xval ; Load x value into R1LW R2, tail(R0) ; Load tail pointer into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Producer adding y to the queue ...
y xWords
in Memory
Higher Address Numbers
Tail Head
ORi R1, R0, yval ; Load y value into R1LW R2, tail(R0) ; Load tail pointer into R2 SW R1, 0(R2) ; Store y into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr
xWords
in Memory
Higher Address Numbers
Tail Head
T1 code(producer)
Before:
After:
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Consumer reading the queue ...
yWords
in Memory
Higher Address Numbers
Tail Head
LW R3, head(R0) ; Load head pointer into R3spin: LW R4, tail(R0) ; Load tail pointer into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head pointer
T2 code(consumer)
Before:
After:
y xWords
in Memory
Tail Head
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
What can go wrong?
Higher Addresses
LW R3, head(R0) ; Load head pointer into R3spin: LW R4, tail(R0) ; Load tail pointer into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head pointer
T2 code(consumer)
y x
Tail Head
y
Tail Head
After:Before:Higher Addresses
T1 code(producer)
ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load tail pointer into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail pointer
1
2
3
4
What if order is 2, 3, 4, 1? Then, x is read before it is written!The CPU running T1 has no way to know its bad to delay 1 !
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Leslie Lamport: Sequential ConsistencySequential Consistency: As if each thread takes turns executing, and instructions in each thread execute in program order.
Sequential Consistent architectures get the right answer, but give up many optimizations.
LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr
T2 code(consumer)
T1 code(producer)
ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr
1
2
3
4
Legal orders: 1, 2, 3, 4 or 1, 3, 2, 4 or 3, 4, 1 2 ... but not 2, 3, 1, 4!
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Efficient alternative: Memory barriersIn the general case, machine is not
sequentially consistent.When needed, a memory barrier may be added to the program (a fence).
All memory operations before fence complete, then memory operations after the fence begin.
ORi R1, R0, x ;LW R2, tail(R0) ;SW R1, 0(R2) ;MEMBARADDi R2, R2, 4 ;SW R2 0(tail) ;
1
2
Ensures 1 completes before 2 takes effect.
MEMBAR is expensive, but you only pay for it when you use it.
Many MEMBAR variations for efficiency (versions that only effect loads or stores, certain memory regions, etc).
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Producer/consumer memory fences
Higher Addresses
LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait MEMBAR ; LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr
T2 code(consumer)
y x
Tail Head
y
Tail Head
After:Before:Higher Addresses
T1 code(producer)
ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueMEMBAR ;ADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr
1
2
3
4
Ensures 1 happens before 2, and 3 happens before 4.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Reminder: Final Project Checkoff
UC Regents Spring 2005 © UCBCS 152 L8: Pipelining I
Instruction Cache
Data Cache
DRAM
D
R
A
M
C
o
n
t
r
o
l
l
e
r
P
i
p
e
l
i
n
e
d
C
P
U
IC Bus IM Bus
DC Bus DM Bus
TAs will provide “secret” MIPS machine code tests.
Bonus points ifthese tests run by2 PM. If not, TAs give you test code to use over weekend
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
CS 152: What’s left ...
Monday 5/2: Final report due, 11:59 PM
Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda.
Tuesday 5/10: Final presentations.
Watch email for final project peer review request.
No class on Thursday. Review session in Tuesday 5/2, + HKN (???).
Deadline to bring up grading issues:Tues 5/10@ 5PM. Contact John at lazzaro@eecs
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Sharing Write Access
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
One producer, two consumers ...
Higher Addresses
LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr
T2 & T3 (2 copes
of consumer thread)
y x
Tail Head
y
Tail Head
After:Before:Higher Addresses
T1 code(producer)
ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr
Critical section: T2 and T3 must take turns running red code.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Abstraction: Semaphores (Dijkstra, 1965)Semaphore: unsigned int s s is initialized to the number of threads permitted in the critical section at once (in our example, 1).
P(s): If s > 0, s-- and return. Otherwise, sleep. When
woken do s-- and return. V(s): Do s++, awaken one
sleeping process,return.
P(s);
V(s);critical section (s=0)
Example use (initial s = 1):
When awake, V(s) and P(s) are atomic: no interruptions, with exclusive access to s.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr
Critical section
Assuming sequential consistency: 3 MEMBARs not shown ...
Spin-Lock Semaphores: Test and Set
Test&Set(m, R)R = M[m];if (R == 0) then M[m]=1;
An example atomic read-modify-write ISA instruction:
What if the OS swaps a process out while in the critical section? “High-latency locks”, a source of Linux audio problems (and others)
P: Test&Set R6, mutex(R0); Mutex check BNE R6, R0, P ; If not 0, spin
V: SW R0 mutex(R0) ; Give up mutex
Note: With Test&Set(), the M[m]=1 state corresponds to last slide’s s=0 state!
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Non-blocking synchronization ...
Compare&Swap(Rt,Rs, m)if (Rt == M[m])then M[m] = Rs; Rs = Rt; status = success;else status = fail;
Another atomic read-modify-write instruction:
If thread swaps out before Compare&Swap, no latency problem;this code only “holds” the lock for one instruction!
try: LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R6, R3, 4 ; Shift head by one word
Compare&Swap R3, R6, head(R0); Try to update head BNE R3, R6, try ; If not success, try again
If R3 != R6, another thread got here first, so we must try again.
Assuming sequential consistency: MEMBARs not shown ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Semaphores with just LW & SW?Can we implement semaphores with just normal load and stores? Yes! Assuming sequential consistency ...
In practice, we create sequential consistency by using memory fenceinstructions ... so, not really “normal”.
Since load and store semaphore algorithms are quite tricky to get right, it is more convenient to use a Test&set or Compare&swap instead.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Conclusions: Synchronization
Memset: Memory fences, in lieu of full sequential consistency.
Test&Set: A spin-lock instruction for sharing write access.
Compare&Swap: A non-blocking alternative to share write access.