multiprocessors - inf.ed.ac.uk · multiprocessors ! why multiprocessors?! – power wall! – ilp...

19
Inf3 Computer Architecture - 2012-2013 1 Multiprocessors Why multiprocessors? Power Wall ILP Wall Limitation of ILP in programs Complexity of superscalar design Cost-effectiveness: easier to connect several ready processors than designing a new, more powerful, processor Better integration and performance than networked individual computers

Upload: others

Post on 19-Oct-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

  • Inf3 Computer Architecture - 2012-2013 1

    Multiprocessors !

      Why multiprocessors?!–  Power Wall!

    –  ILP Wall!  Limitation of ILP in programs!  Complexity of superscalar design!

    –  Cost-effectiveness: easier to connect several ready processors than designing a new, more powerful, processor!

    –  Better integration and performance than networked individual computers!

  • But…!

      Software should expose parallelism!–  Programmers should write parallel programs!–  Legacy code should be parallelized!

    Inf3 Computer Architecture - 2012-2013 2

    …as hard as any (problem) that computer science has faced.

  • Inf3 Computer Architecture - 2012-2013 3

    Amdahl’s Law and Efficiency!

      Let: F → fraction of problem that can be parallelized! Spar → speedup obtained on parallelized fraction! P → number of processors!

      e.g.: 16 processors (Spar = 16), F = 0.9 (90%), !

    Soverall = 1

    (1 – F) + F

    Spar

    Soverall = 1

    (1 – 0.9) + 0.9

    16

    = 6.4

    E = Soverall

    P

    E = 6.4

    16 = 0.4 (40%)

  • Inf3 Computer Architecture - 2012-2013 4

    Inter-processor Communication Models!

      Shared memory!

      Message passing!

    flag = 0; … a = 10; flag = 1;

    flag = 0; … while (!flag) {} x = a * y;

    Producer (p1) Consumer (p2)

    … a = 10; send(p2, a, label);

    … receive(p1, b, label); x = b * y;

    Producer (p1) Consumer (p2)

  • Shared Memory vs Message Passing!

      Shared memory pros!–  Easier to program: correctness first, performance later!–  For OS only (relatively) minor extensions required!

      Shared memory cons!–  Synchronization complex!–  Communication implicit harder to optimize!–  Coherence must be ensured. !

    Inf3 Computer Architecture - 2012-2013 5

  • HW Support for Shared Memory!

      Cache Coherence!–  Caches + multiprocessers stale values!–  System must behave correctly in the presence of

    caches!  Write propagation!  Write serialization!

      Memory Consistency!–  When should writes propagate?!–  How are memory operations ordered?!–  What value should a read return?!

      Primitive synchronization!–  Memory fences: memory ordering on demand!–  Read-Modify-writes: support for locks (critical sections) !

    Inf3 Computer Architecture - 2012-2013!

  • Cache Coherence!

    Inf3 Computer Architecture - 2012-2013!

    flag = 0; … data = 10;

    flag = 1;

    flag = 0; …

    while (!flag) {}

    x = data * y;

    Producer (p1) Consumer (p2)

    The update to flag (and data) should be (eventually) visible to p2

    7

  • Memory Consistency!

    Inf3 Computer Architecture - 2012-2013!

    flag = 0; … data = 10;

    flag = 1;

    flag = 0; …

    while (!flag) {}

    x = data * y;

    Producer (p1) Consumer (p2)

    If p2 sees the update to flag, will p2 see the update to data?

    8

  • Primitive Synchronization !

    Inf3 Computer Architecture - 2012-2013!

    flag = 0; … data = 10;

    flag = 1;

    flag = 0; …

    while (!flag) {}

    x = data * y;

    Producer (p1) Consumer (p2)

    If p2 sees the update to flag, will it see the update to data?

    fence

    fence

    9

  • Inf3 Computer Architecture - 2012-2013 10

    The Cache Coherence Problem!

    CPU

    Main memory

    CPU CPU

    Cache Cache Cache

    T0: A=1

    T0: A not cached T0: A not cached T0: A not cached T1: load A (A=1)

    T1: A=1

    T1: A not cached T1: A not cached T2: load A (A=1) T2: A not cached T2: A=1

    T2: A=1

    T3: store A (A=2) T3: A not cached T3: A=1

    T3: A=1

    stale

    stale

    T4: load A (A=1) T4: A=1 T4: A=2

    T4: A=1

    use old value

    T5: load A (A=1)

    use stale value!

  • Inf3 Computer Architecture - 2012-2013 11

    Maintaining Cache Coherence!

      Option 1: do not allow shared data to be cached if it can be modified (e.g., Cray T3D and T3E)!–  Difficult to program!–  Poor performance!

      Option 2: user program makes sure that data is always coherent!–  Very difficult to program!–  Poor performance!

      Option 3: hardware enforces cache coherence!

  • Inf3 Computer Architecture - 2012-2013!

    Cache Coherence Protocols!  Idea:!

    –  Keep track of what processors have copies of what data!

    –  Enforce that at any given time a single value of every data exists:!  By getting rid of copies of the data with old values → invalidate protocols!  By updating everyone’s copy of the data → update protocols!

      In practice:!–  Guarantee that old values are eventually invalidated/updated (write

    propagation)! (recall that without synchronization there is no guarantee that a load

    will return the new value anyway)!

    –  Guarantee that only a single processor is allowed to modify a certain datum at any given time (write serialization)!

    –  Must appear as if no caches were present!

    12

  • Inf3 Computer Architecture - 2012-2013 13

    Write-invalidate Example!

    CPU

    Main memory

    CPU CPU

    Cache Cache Cache

    T1: load A (A=1)

    T1: A=1

    T1: A not cached T1: A not cached T2: load A (A=1) T2: A not cached T2: A=1

    T2: A=1

    T3: store A (A=2) T3: A not cached T3: A not cached

    T3: A=1

    invalidate

    stale

    T4: load A (A=2) T4: A not cached T4: A=2

    T4: A=1

    new value T5: load A (A=2)

    new value

  • Inf3 Computer Architecture - 2012-2013!

    Write-update Example!

    CPU

    Main memory

    CPU CPU

    Cache Cache Cache

    T1: load A (A=1)

    T1: A=1

    T1: A not cached T1: A not cached T2: load A (A=1) T2: A not cached T2: A=1

    T2: A=1

    T3: store A (A=2) T3: A not cached T3: A = 2

    T3: A=2

    update

    update

    T4: load A (A=2) T4: A = 2 T4: A=2

    T4: A=2

    new value

    T5: load A (A=2)

    14

  • Inf3 Computer Architecture - 2012-2013!

    Cache Coherence Protocols!

      Add state bits to cache lines to track state of the line!–  Most common: Modified, Owned, Exclusive, Shared, Invalid !–  Protocols usually named after the states supported!

      Cache lines transition between states upon load/store operations from the local processor and by remote processors!

      These state transitions must guarantee the invariant: no two cache copies can be simultaneously modified !–  SWMR: Single writer multiple readers!

    15

  • Inf3 Computer Architecture - 2012-2013!

    Example: MSI Protocol!

      States:!–  Modified (M): block is cached only in this cache and has been

    modified!–  Shared (S): block is cached in this cache and possibly in other

    caches (no cache can modify the block)!–  Invalid (I): block is not cached!

    16

  • Inf3 Computer Architecture - 2012-2013!

    Example: MSI Protocol!

      Transactions originated at this CPU:!

    Invalid Shared

    Modified

    CPU read miss

    CPU read hit

    CPU write miss

    CPU write

    CPU write hit CPU read hit

    17

  • Inf3 Computer Architecture - 2012-2013!

    Example: MSI Protocol!

      Transactions originated at other CPU:!

    Invalid Shared

    Modified

    CPU read miss

    CPU read hit

    CPU write miss

    CPU write

    CPU write hit CPU read hit

    Remote write miss

    Remote write miss Remote read miss

    Remote read miss

    18

  • Inf3 Computer Architecture - 2012-2013!

    Possible Implementations!

      Two ways of implementing coherence protocols in hardware!

    –  Snooping: all cache controllers monitor all other caches’ activities and maintain the state of their lines!  Commonly used with buses and in many CMP’s today!

    –  Directory: a central control device directly handles all cache activities and tells the caches what transitions to make!  Can be of two types: centralized and distributed!  Commonly used with scalable interconnects and in many CMP’s

    today!

    19