project2_talk.ppt

Upload: bmbm-farid

Post on 01-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 project2_talk.ppt

    1/16

    Checkpoint BasedRecovery from Power

    FailuresChristopher Sutardja

    Emil Stefanov

  • 8/9/2019 project2_talk.ppt

    2/16

    Goals

    • Consistent checkpoint – A consistent snapshot of memory for a specic

    time in the past!

    • Safe even under power failure – "he checkpoint is never #in transition$

    • Small stora%e overhead – &ot much more than dou'le the memory!

    • (ow performance overhead – Should not stall the processor for too lon%!

    • Scala'le – Scales well in lar%e core networks such as

    meshes!

  • 8/9/2019 project2_talk.ppt

    3/16

    Related )ork

    • *n the feasi'ility of incremental checkpointin%for scientic computin% 'y +! Sancho et al – Speculates a'out the future role of checkpointin%

    in parallel machines!

     – As the num'er of processin% nodes %rowse,ponentially- failure of any one node 'ecomesmuch more likely!

     – Error correction codes and other redundancieswould introduce too much overhead when usedalone!

     –  As a result- researchin% Checkpoint recovery is%rowin% in importance!

  • 8/9/2019 project2_talk.ppt

    4/16

    Related )ork

    • .odular Checkpointin% for Atomicity'y (! /iarek et al!

     – 0ntroduces an a'straction calledsta'ili1ers to make checkpointin% easier!

     – "ar%ets messa%e2passin% machines

    • .akes consistent checkpointin% more

    challen%in%!

  • 8/9/2019 project2_talk.ppt

    5/16

    Related )ork

    • Safety&et3 improvin% the availa'ility ofshared memory multiprocessors with%lo'al checkpoint4recovery 'y 5! Sorin et

    al! – E,plores the concept of checkpointin% in

    lo%ical time!

     – .ultiple checkpoints!

     –

    Each dirty cache line has a ta% indicatin%when it was modied relative to a checkpoint!

     – (ow e,ecution overhead!

     – &ot safe from power failures!

  • 8/9/2019 project2_talk.ppt

    6/16

    Related )ork

    • Re6ive3 cost2e7ective architectural supportfor roll'ack recovery in shared2memorymultiprocessors 'y .! Prvulovic et al! – E,plores di7erent ways of roll'ack recovery in

    shared2memory multiprocessor systems!Considers3• the scope of the checkpoint• memory• checkpointin% mechanism!

     – Achieves a'out 89 checkpointin% overhead! – &ot safe from power failures! – &ot %eared towards non2volatile memory3

    re:uires fast writes!

  • 8/9/2019 project2_talk.ppt

    7/16

    Related )ork

    • E;cient 0nitiali1ation and CrashRecovery for (o%2'ased File Systemsover Flash .emory 'y Chin )u et al! –

    As Flash .emory 'ecomes cheaper anddenser- the uses for Flash increase!

     – et another use of =ash for recovery!

     –

  • 8/9/2019 project2_talk.ppt

    8/16

    5RA.

    5RA.

    5RA.

    5RA.

    Core

    (?

    (@

  • 8/9/2019 project2_talk.ppt

    9/16

    5RA.

    5RA.

    5RA.

    5RA.

    CheckpointA

    CacheCheckpoi

    ntControlle

    r

    CheckpointB

    CheckpointA

    CacheCheckpoi

    nt

    Controller

    CheckpointB

    (?

    (@ Chec

    kpoint

    Bu7er

    (o%

    Chec

    kpoint

    Bu7er

    (o%

    Chec

    kpoint

    Bu7er

    (o%

    Chec

    kpoint

    Bu7er

    (o%

    5RA.Checkpoint

    er

    5RA.Checkpoint

    er

    Address 5ecoder

    5RA.Checkpoint

    er

    5RA.Checkpoint

    er

    Checkpoint

    Coordinator

    CheckpointA

    CheckpointB

    Core

  • 8/9/2019 project2_talk.ppt

    10/16

    Checkpointin% "echni:ues

    • For Caches and Cores3 – Each cache4core has two =ash stora%es adjacent

    to it!• *ne is for the previous checkpoint•

    *ne for the current checkpoint! – 5urin% a checkpoint- the cache4core internal

    state is copied to =ash stora%e!

    • For DRAM3 – "he checkpointin% system snoops on 5RA.! – 5RA. chan%es are continuously lo%%ed to =ash

    memory! – A chain of parallel 'u7ers ensues that 5RA.

    checkpointin% almost never causes a stall!

  • 8/9/2019 project2_talk.ppt

    11/16

    Responsi'ilities of the .ainComponents

    • Checkpoint Coordinator – &oties the nodes and 5RA. checkpointers

    that a checkpoint is 'e%innin%!

    • 5RA. Checkpointer – Continuously lo%s 5RA. chan%es!

     – Checkpoints when instructed 'y thecoordinator!

    • Cache Checkpoint Controller – Checkpoints the adjacent cache when

    instructed 'y the coordinator!

  • 8/9/2019 project2_talk.ppt

    12/16

    Steps for Checkpointin% ?of @

    ?! "he coordinator sets the checkpointsi%nal to ?!

    @! 0n parallel eacha! Core3

    i! Pauses processin% instructions!ii! Copies internal state to =ash memory!

    '! Cache Checkpoint Controller3i! Copies cache internal state to =ash memory data

    is copied one line at a time!

    c! 5RA. Checkpointer3i! Flushes 'u7er to =ash lo%!ii! &oties checkpoint coordinator that the 'u7er has

    'een =ushed!

  • 8/9/2019 project2_talk.ppt

    13/16

    Steps for Checkpointin% @of @

    ! "he coordinator sets the checkpoint si%nalto D!

    ! 0n parallel each

    a! Core3i! Flips =ash memory 'it to indicate the new

    checkpoint 'u7er!

    '! Cache Checkpoint Controller3

    i! Flips =ash memory 'it to indicate the newcheckpoint 'u7er!

    c! 5RA. Checkpointer3

    i! .arks checkpoint 'oundary in =ash lo%!

  • 8/9/2019 project2_talk.ppt

    14/16

    CheckpointACacheCheckpoint

    Controller

    CheckpointB

    CheckpointA

    Cache

    CheckpointController

    CheckpointB

    (?

    (@

    CheckpointA

    CheckpointB

    Core

    F F F F F F F F

  • 8/9/2019 project2_talk.ppt

    15/16

    Chec

    kpoint

    Bu7er

    (o%

    Chec

    kpoint

    Bu7er

    (o%

    Chec

    kpoint

    Bu7er

    (o%

    Chec

    kpoint

    Bu7er

    (o%

    Address 5ecoder

    PreviousCheckpoint

    Changes

    NextCheckpoint Changes

    endstart

    Bufered

    Changes

    PreviousCheckpoint

    (randomaccess)

  • 8/9/2019 project2_talk.ppt

    16/16

    Recoverin%

    ?! 5eterminin% which Checkpoint to usea! System checks which Checkpoint is the most recent'! 0f the most recent checkpoint was in pro%ress durin%

    crash- the older checkpoint is used!

    @! Restorin% Previous Statea! Each architectural re%ister is rewritten!'! Each cache is written to 'y its adjacent F(AS 'u7er

    one cache line at a timec! .ain .emory is recoveredd! "ake advanta%e of pipelined write if availa'le!

    ! Resume E,ecutiona! Resume pro%ram counter

    '! &otify that CP