project2_talk.ppt
TRANSCRIPT
-
8/9/2019 project2_talk.ppt
1/16
Checkpoint BasedRecovery from Power
FailuresChristopher Sutardja
Emil Stefanov
-
8/9/2019 project2_talk.ppt
2/16
Goals
• Consistent checkpoint – A consistent snapshot of memory for a specic
time in the past!
• Safe even under power failure – "he checkpoint is never #in transition$
• Small stora%e overhead – &ot much more than dou'le the memory!
• (ow performance overhead – Should not stall the processor for too lon%!
• Scala'le – Scales well in lar%e core networks such as
meshes!
-
8/9/2019 project2_talk.ppt
3/16
Related )ork
• *n the feasi'ility of incremental checkpointin%for scientic computin% 'y +! Sancho et al – Speculates a'out the future role of checkpointin%
in parallel machines!
– As the num'er of processin% nodes %rowse,ponentially- failure of any one node 'ecomesmuch more likely!
– Error correction codes and other redundancieswould introduce too much overhead when usedalone!
– As a result- researchin% Checkpoint recovery is%rowin% in importance!
-
8/9/2019 project2_talk.ppt
4/16
Related )ork
• .odular Checkpointin% for Atomicity'y (! /iarek et al!
– 0ntroduces an a'straction calledsta'ili1ers to make checkpointin% easier!
– "ar%ets messa%e2passin% machines
• .akes consistent checkpointin% more
challen%in%!
-
8/9/2019 project2_talk.ppt
5/16
Related )ork
• Safety&et3 improvin% the availa'ility ofshared memory multiprocessors with%lo'al checkpoint4recovery 'y 5! Sorin et
al! – E,plores the concept of checkpointin% in
lo%ical time!
– .ultiple checkpoints!
–
Each dirty cache line has a ta% indicatin%when it was modied relative to a checkpoint!
– (ow e,ecution overhead!
– &ot safe from power failures!
-
8/9/2019 project2_talk.ppt
6/16
Related )ork
• Re6ive3 cost2e7ective architectural supportfor roll'ack recovery in shared2memorymultiprocessors 'y .! Prvulovic et al! – E,plores di7erent ways of roll'ack recovery in
shared2memory multiprocessor systems!Considers3• the scope of the checkpoint• memory• checkpointin% mechanism!
– Achieves a'out 89 checkpointin% overhead! – &ot safe from power failures! – &ot %eared towards non2volatile memory3
re:uires fast writes!
-
8/9/2019 project2_talk.ppt
7/16
Related )ork
• E;cient 0nitiali1ation and CrashRecovery for (o%2'ased File Systemsover Flash .emory 'y Chin )u et al! –
As Flash .emory 'ecomes cheaper anddenser- the uses for Flash increase!
– et another use of =ash for recovery!
–
-
8/9/2019 project2_talk.ppt
8/16
5RA.
5RA.
5RA.
5RA.
Core
(?
(@
-
8/9/2019 project2_talk.ppt
9/16
5RA.
5RA.
5RA.
5RA.
CheckpointA
CacheCheckpoi
ntControlle
r
CheckpointB
CheckpointA
CacheCheckpoi
nt
Controller
CheckpointB
(?
(@ Chec
kpoint
Bu7er
(o%
Chec
kpoint
Bu7er
(o%
Chec
kpoint
Bu7er
(o%
Chec
kpoint
Bu7er
(o%
5RA.Checkpoint
er
5RA.Checkpoint
er
Address 5ecoder
5RA.Checkpoint
er
5RA.Checkpoint
er
Checkpoint
Coordinator
CheckpointA
CheckpointB
Core
-
8/9/2019 project2_talk.ppt
10/16
Checkpointin% "echni:ues
• For Caches and Cores3 – Each cache4core has two =ash stora%es adjacent
to it!• *ne is for the previous checkpoint•
*ne for the current checkpoint! – 5urin% a checkpoint- the cache4core internal
state is copied to =ash stora%e!
• For DRAM3 – "he checkpointin% system snoops on 5RA.! – 5RA. chan%es are continuously lo%%ed to =ash
memory! – A chain of parallel 'u7ers ensues that 5RA.
checkpointin% almost never causes a stall!
-
8/9/2019 project2_talk.ppt
11/16
Responsi'ilities of the .ainComponents
• Checkpoint Coordinator – &oties the nodes and 5RA. checkpointers
that a checkpoint is 'e%innin%!
• 5RA. Checkpointer – Continuously lo%s 5RA. chan%es!
– Checkpoints when instructed 'y thecoordinator!
• Cache Checkpoint Controller – Checkpoints the adjacent cache when
instructed 'y the coordinator!
-
8/9/2019 project2_talk.ppt
12/16
Steps for Checkpointin% ?of @
?! "he coordinator sets the checkpointsi%nal to ?!
@! 0n parallel eacha! Core3
i! Pauses processin% instructions!ii! Copies internal state to =ash memory!
'! Cache Checkpoint Controller3i! Copies cache internal state to =ash memory data
is copied one line at a time!
c! 5RA. Checkpointer3i! Flushes 'u7er to =ash lo%!ii! &oties checkpoint coordinator that the 'u7er has
'een =ushed!
-
8/9/2019 project2_talk.ppt
13/16
Steps for Checkpointin% @of @
! "he coordinator sets the checkpoint si%nalto D!
! 0n parallel each
a! Core3i! Flips =ash memory 'it to indicate the new
checkpoint 'u7er!
'! Cache Checkpoint Controller3
i! Flips =ash memory 'it to indicate the newcheckpoint 'u7er!
c! 5RA. Checkpointer3
i! .arks checkpoint 'oundary in =ash lo%!
-
8/9/2019 project2_talk.ppt
14/16
CheckpointACacheCheckpoint
Controller
CheckpointB
CheckpointA
Cache
CheckpointController
CheckpointB
(?
(@
CheckpointA
CheckpointB
Core
F F F F F F F F
-
8/9/2019 project2_talk.ppt
15/16
Chec
kpoint
Bu7er
(o%
Chec
kpoint
Bu7er
(o%
Chec
kpoint
Bu7er
(o%
Chec
kpoint
Bu7er
(o%
Address 5ecoder
PreviousCheckpoint
Changes
NextCheckpoint Changes
endstart
Bufered
Changes
PreviousCheckpoint
(randomaccess)
-
8/9/2019 project2_talk.ppt
16/16
Recoverin%
?! 5eterminin% which Checkpoint to usea! System checks which Checkpoint is the most recent'! 0f the most recent checkpoint was in pro%ress durin%
crash- the older checkpoint is used!
@! Restorin% Previous Statea! Each architectural re%ister is rewritten!'! Each cache is written to 'y its adjacent F(AS 'u7er
one cache line at a timec! .ain .emory is recoveredd! "ake advanta%e of pipelined write if availa'le!
! Resume E,ecutiona! Resume pro%ram counter
'! &otify that CP