optimizing for mp:s - uppsala university · lbm lq ++ + + + 0 0,2 0,4 0,6 0,8 1 1,2 bzip2...

39
Optimizing for MP:s Erik Hagersten Uppsala University, Sweden [email protected]

Upload: others

Post on 08-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Optimizing for MP:s

    Erik HagerstenUppsala University, Sweden

    [email protected]

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 2

    AVDARK2012

    Cache Waste/* Unoptimized */

    for (s = 0; s < ITERATIONS; s++){

    for (j = 0; j < HUGE; j++)

    x[j] = x[j+1]; /* will hog the cache but not benefit*/

    for (i = 0; i < SMALLER_THAN_CACHE; i++)

    y[i] = y[i+1]; /* will be evicted between usages /*

    }

    /* Optimized */

    for (s = 0; s < ITERATIONS; s++){

    for (j = 0; j < HUGE; j++) {

    PREFETCH_NTA x[j+1] /* will be installed in L1, but not L3 (AMD) *

    x[j] = x[j+1];

    for (i = 0; I < SMALLER_THAN_CACHE; i++)

    y[i] = y[i+1]; /* will always hit in the cache*/

    }

    Also important for single-threaded applications if they

    are co-scheduled and share cache with other applications.

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 3

    AVDARK2012

    cache size

    cache misses

    actualactual/4

    The larger cache, the better

    UART Research: Hints to avoid cache pollution(non-temporal prefetches)

    Hint:Don’t

    allocate!missrate2x missrate

    0

    1

    2

    3

    Original Lim=1.7MB

    One Instance Four Instances

    40% faster

    Hint: lim= actual/4

    Orig

    Thro

    ughp

    ut

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 4

    AVDARK2012

    Categorize and avoiding cache wasteMissrate

    $-size

    ∆ benefit

    Cachehogging

    L2L1CPU

    L1CPU

    L1

    L2

    Mem

    No point in caching! per-instruction cache

    avoidence (prefetch.nta)

    Hogging

    ∆ benefit

    Don’t care

    Slowsothers

    Slowedby others

    Slows &slowed

    Hogging

    ∆ benefit

    +

    ++

    +

    bzip LBM

    LQ

    + ++

    +

    +

    0

    0,2

    0,4

    0,6

    0,8

    1

    1,2

    bzip2 Libquantum LBM Geom mean

    Individually In mix In mix, patched

    25%

    AMD Opteron

    Perf

    orm

    ance Andreas Sandberg, David Eklov and Erik

    Hagersten. Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses, In Proceedings of Supercomputing (SC), New Orleans, LA, USA, November 2010.

    Automatic ”taming” of the hoggersApplication classification

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 5

    AVDARK2012

    Coherence traffic

    Thread 0:int a, total;

    spawn_child()

    for (int i; i< HUGE; i++) {

    /* do some work */

    a++;

    }

    join()

    total = a;

    Child:

    for (int i; i< HUGE; i++) {

    /* do some work*/

    a++;

    }

    Thread 0:int a, total;

    spawn_child()

    for (int i; i< HUGE; i++) {

    /* do some work */

    a++;

    }

    join()

    total += a;

    Child:int b;

    for (int i; i< HUGE; i++) {

    /* do some work */

    b++;

    }

    total += b;

    OPT:

    ORIG:

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 6

    AVDARK2012

    False sharing

    Thread 0:int a, b;

    spawn_child()

    for (int i; i< HUGE; i++) {

    ...

    a++;

    }

    join()

    total = a + b;

    Child:

    for (int i; i< HUGE; i++) {

    ...

    b++;

    }

    Thread 0:int a;

    spawn_child()

    for (int i; i< HUGE; i++) {

    ...

    a++;

    }

    join()

    total += a;

    Child:int b;

    for (int i; i< HUGE; i++) {

    ...

    total += b;

    }

    OPT:

    ORIG:

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 7

    AVDARK2012

    Coherence Utilization

    Thread 0:vec_type x[HUGE];

    for (int i; i< HUGE; i++) {

    ...

    x[i].a++;

    }

    spawn_child()

    ...

    join()

    Child (Thread 1)

    for (int i; i< HUGE; i++) {

    y[i] = x[i].a;

    }

    ORIG:

    x[0]abcde f

    x[12abcde f

    x[ab

    struct vec_type{

    int a;int b;int c;int d;int e;int f;

    };

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 8

    AVDARK2012

    A Bad Example: ”POUNDING”

    proc lock(lock_variable) {while (TAS[lock_variable]==1) {} /* bang on the lock until free */

    }

    proc unlock(lock_variable) {lock_variable := 0

    }

    Assume: The function TAS (test and set) -- returns the current memory value and atomicallywrites the busy pattern “1” to the memory

    Generates too much traffic!!-- spinning threads produce traffic!

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 9

    AVDARK2012

    Optimistic Test&Set Lock ”spinlock”

    proc lock(lock_variable) {while true {

    if (TAS[lock_variable] ==0) break; /* bang on the lock once, done if TAS==0 */while(lock_variable != 0) {} /* spin locally in your cache until ”0” observed*/

    } }

    proc unlock(lock_variable) {lock_variable := 0

    }

    Much less coherence traffic!!-- still lots of traffic at lock handover!

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 10

    AVDARK2012

    Uppsala Programming for Multicore Architecture Center

    62 MSEK grant / 10 years [$9M/10y]+ related additional grants at UU = 130MSEK

    Research areas: Performance modeling New parallel algorithms Scheduling of threads and resources Testing & verification Language technology MC in wireless and sensors

    Erik:

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations11

    AVDARK2012

    StatCache: Insight and EfficiencySlowdown 10% (for long-running applications)

    mem

    Probabilistic Cache Model

    Address Stream1:read A2:read B3:read C4:write C5:read B6:read D7:read A8:read E9:read B

    Host Computer Target Architecture

    ArchitecturalParameters

    Online Sampling Offline “Insight Technology”

    core

    core

    ... mem

    L1

    L1

    L2

    core

    ...

    core

    Modeled behavior

    ApplicationFingerprint

    5, 3,…ReuseDistance=5

    ReuseDistance=3

    SparseSampler ThreadSpotter

    Advice

    Randomly select accessesto monitor

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations12

    AVDARK2012

    UART: Efficient sparse sampling

    A B D B E B A F D B . .1 4 5 6 7 8 9 10 11 12 … N. .32

    .…

    i=0

    1.Use HW counter overflow to randomly select accesses to sample (e.g. ~on avergage every 1.000.000th access)

    2. Set a watchpoint for the data cacheline they touch

    3. Use HW counters to count #memory accesses until watchpoint trap

    Sampling Overhead ~17% (10% at Acumem for long-running apps)

    (Modeling with math < 100ms)

    trap trap

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations13

    AVDARK2012

    Fingerprint ≈ Sparse reuse distance histogram

    Reuse distance

    h(d)

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations14

    AVDARK2012

    Miss?pmiss=m(#repl)

    Modeling random caches with math(Assumtion: ”Constant” MissRatio)

    A B D B E B A F D B . .1 4 5 6 7 8 9 10 11 12 … N. .32

    # repl ≈ 5 * MissRatio

    .…

    #repl

    pmissMiss Equation m

    rdi=5

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations15

    AVDARK2012

    The cacheline Ais in a cache with

    L cachelines

    After 1Replacement

    (1 – 1/L) chancethat A survives

    (1 – 1/L)R chancethat A survives

    A A A

    After RReplacements

    Assuming a fully associative cache

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations16

    AVDARK2012

    Miss?

    16

    pmiss=m(5 * MissRatio)

    Modeling random caches with math(Assumtion: ”Constant” MissRatio)

    A B D B E B A F D B . .1 4 5 6 7 8 9 10 11 12 … N. .32

    # repl ≈ 5 * MissRatio pmiss=m(3 * MissRatio)

    .…

    # repl ≈ 3 * MissRate

    n samples: MissRatio * n = Σm(rd(i) * MissRatio)i=0

    n

    m(repl)=1 – (1 – 1/L)repl

    #repl

    pmiss

    Can be solved in a ”fraction of a second” for different L:s

    Miss Equation m

    rdi=5

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations17

    AVDARK2012

    17

    Accuracy: Simulation vs. ”math” (Random replacement)

    Mis

    s ra

    tio (

    %)

    Cache size (bytes)

    vpr

    gzip

    ammp

    Comparing simulation (w/ slowdown 100x) and math (”fractions of a second”)

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations18

    AVDARK2012

    A B A B

    . 2 3 4 5 6 7 8 91.

    B C C D E C C…

    Sampled Reuse Pair A-A

    Stack Distance: How many unique data objects? Answer: 3

    12 ... N

    Modeling LRU Caches: Stack distance...

    If we know all reuses: How many of the reuses 2-6 go beyond End? Answer: 3

    Stack_distance = Σ [d(i) > (End – k + 2)]k=Start

    End

    Start=2 End=6

    rdi=5

    Foreach sample: if (Stack_distance > L ) miss++ else hit++

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations19

    AVDARK2012

    A B A B

    . 2 3 4 5 6 7 8 91.

    B C C D E C C…

    d(1)

    12 ... N

    But we only know a few reuse distances...

    Estimate: How many of the reuses 2-6 go beyond End? Answer: Est_SD

    Est_SD = Σ p[d(i) > (End - k)]k=Start

    End

    Assume that the distribution (aka histogram) of sampled reuses is representative for all accesses in that ”time window”

    d(2) d(3)

    d

    h(d)

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations20

    AVDARK2012

    All SPEC 2006

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations21

    AVDARK2012

    Modeling coherence

    rA B rD B E B rA F rD B . .1 4 5 6 7 8 9 10 11 12 … N. .32

    .…

    i=0

    Record coherence-related interaction at runtime (Arch. Independen)Model coherence effects off-lineCan model different topologies and thread bindings off-line

    trap trap

    B E B wA F rD B . . .…trap trap

    Thread A

    Thread B

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations22

    AVDARK2012

    (Need to be Efficient)3: Our Approach

    Machine-independent

    runtimeinformation

    Efficientmodeling

    Draw conclusions,build tools

    == ∑ (1 − (1 − ) ( ))1. Capture data locality information

    Find ”best”:• Core type• Cache size• Thread scheduling• Frequency• Code optimizations…

    Predict (for many options)• Cache statistics• Bandwidth requirement• Performance• Power consumption• Phase behavior ...

    2. Measure impact of resource allocations

    Solve equationsGather runtime info Add heuristics

    3. Capture code usage information?

    Clustering, K-means...4. Capture power properties?

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations23

    AVDARK2012

    The World’s ”best”: 1. Cache locality samplers & cache ”simulator” (OH ~20%)

    Cache hitrate model for data and instructions (~10ms) Multi-threading model [a.k.a. Coherence model] (~10ms) Cache sharing model (~10ms)

    2. Cache/BW quantitative measurements (OH ~5%) Cache sharing model (~10ms) Performance prediction & BW requirement (~10ms) Cache sharing model (CPI & BW) (~10ms)

    3. DVFS models & run-time (power) management4. On-line phase detection tool (OH ~2%)

    Phase-guided sampling Phase-guided power management

    5. Simplest coherence protocol VIPS Two states, self-invalidation, no directory

    simulated [MB]

    mod

    elle

    d [M

    B]

    Cache allocationon multicore

    $ size [MB]

    mis

    ses

    Achievements

    $ size

    On real HWPerformance

    Bandwidth

    time

    phases

    time

    DVFS:Performance: 98%Energy: 50%

    CPI

    BW

    misses

  • Multi-threaded Case Study:Gauss-Seidel on Multicores

    From Wallin et al, ICS 2006

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 25

    AVDARK2012

    Criteria for HPC Algorithms

    Past: Minimize communication Maximize scalability (1000s of CPUs)

    Optimize for Multicore chip: On-chip communication is “for free” Scalability is limited to ~10 threads The caches are tiny Memory bandwidth is the bottleneck

    Data locality is key!

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 26

    AVDARK2012

    Selected HPC Wire Articles

    More Than 16 Cores May Well Be Pointless Sandia Labs, Dec 07 2008

    Up Against the Memory Wall”Never mind the cores. Just hand over the cache”

    Michael Feltman, Dec 11 2008

    HPC@Intel: When to Say No to ParallelismSanjiv Shah, Intel. January 14 2009

    Finding a Door in the Memory WallErik Hagersten, Acumem. Feb-April 2009

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 27

    AVDARK2012

    Example: Gauss Seidel

    1

    1

    2

    2

    1 1 1 1 1

    1 1 1 1 1 1

    1 1 1 1 1 1

    2 2 1

    2 2 2 2 2

    2 2 2 2 2 2

    LOOP:UPDATE ALL POINTS IF (convergence_test)

    (Longer explanation: Finding a Door in the Memory Wall @ HPCWire)

    Mission: “Maximize the parallelism and minimize the inter-thread communication”

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 28

    AVDARK2012

    State-of-the-art:Removing Dependence: Red/Black

    1

    1

    1

    1

    1 1 1 1 1

    1 1 1 1 1 1

    1 1 1 1 1 1

    1 2 1

    2 1 2 2 1

    1 2 1 2 1 2

    LOOP: UPDATE ALL RED POINTSUPDATE ALL BLACK POINTSIF (convergence_test)

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 29

    AVDARK2012

    State-of-the-art:Red/Black, Parallelism = N2/2

    Core 0

    Core 1

    1

    1

    1

    1

    1 1 1 1 1

    1 2 1 2 1 2

    2 1 1 1 1

    1 2 1

    2 1 2 2 1

    1 1 1 1 1 1

    LOOP:IN PARALLEL: UPDATE ALL RED POINTS

    IN PARALELL: UPDATE ALL BLACK POINTS

    IF (convergence_test)

    Limited communication N2/2 parallelism Done!Only one problem…

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 30

    AVDARK2012

    Only One Problem: Performance

    0

    1

    2

    0 1 2 3 4 5 6 7 8

    # Cores

    Spee

    dup

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 31

    AVDARK2012

    Back to the drawing board: Temporal blocking for seq. code

    22

    44

    1

    3

    = active region

    34 = current

    = sweep path

    = data dependence

    1,2,3,4 = iteration number

    = cacheline layout

    LOOP:LOOP:

    UPDATE ALL POINTS IN ACTIVE REGIONSLIDE DOWN THE REGION

    IF (convergence_test)

    Communication is “for free” and moderate parallelism is OKPriority 1: limit bandwidth needs!

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 32

    AVDARK2012

    Back to the drawing board: Temporal blocking for seq. code

    32

    44

    1

    4

    12

    = active region

    = current

    = sweep path

    = data dependence

    1,2,3,4 = iteration number

    = cacheline layout

    LOOP:LOOP:

    UPDATE ALL POINTS IN ACTIVE REGIONSLIDE DOWN THE REGION

    IF (convergence_test)

    4 iterations inone sweep!

    Communication is “for free” and moderate parallelism is OKPriority 1: limit bandwidth need!

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 33

    AVDARK2012

    0%

    1%

    2%

    3%

    256k 512k 1M 2M 4M 8M 16M 32M 64M 128M 256M 512MCache size

    Red/BlackBlock=1Block=2Block=4Block=8Block=16

    DRAM_traffic(cache_size)

    Fetch Rate, i.e, fraction of mem_ops generating DRAM traffic

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 34

    AVDARK2012

    G-S, temp block Parallelism = N

    32

    44

    1

    4

    Core 0 Core 1 Core 2 Core 3

    0123

    1 1

    Synchronization flags

    Wait until ”lefty” is done:Lots of communication

    • Producer/Consumer Flag• Sharing of data values

    Only N-fold parallelism

    2

    = active region

    = current

    = sweep path

    = data dependence

    1,2,3,4 = iteration number

    = cacheline layout

    1 = sync flag iteration no

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 35

    AVDARK2012

    Problems we ran into 1 (2)

    32

    44

    1

    4

    Core 0 Core 1 Core 2 Core 3

    0123

    1 12

    512 elements = 64 cache lines

    512elem.

    Core 0indexinginto L2 $

    Core 1indexinginto L2 $

    Core 2indexinginto L2 $

    Core 3indexininto L2

    16 cache lines

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 36

    AVDARK2012

    Problems we ran into 2 (2) We had a loop nesting problem that the

    compiler optimized away

    ... sometimes

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 37

    AVDARK2012

    Running on a Multisocket

    I/F

    I/F

    100

    DRAM

    DRAM

    Coherence = Non-Uniform Coherence

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 38

    AVDARK2012

    Example: G-S, temp blocking

    32

    44

    1

    4

    Core 0 Core 1 Core 2 Core 3

    0123

    1 12

    = active region

    = current

    = sweep path

    = data dependence

    1,2,3,4 = iteration number

    = cacheline layout

    1 = sync flag iteration no

    PADDING

  • Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 39

    AVDARK2012

    0

    1

    2

    3

    4

    5

    6

    7

    0 1 2 3 4 5 6 7 8

    Perfo

    rman

    ce

    # Cores

    Red/Black

    Block=8

    Lessons Learned: Optimize cacheusage BEFORE parallelizing

    3x

    [Wallin, Löf, Holmgren, Hagersten @ ICS 2006]

    Demo Time!

    G-S:DanW:s codeOptimized