optimizing for mp:s - uppsala university · lbm lq ++ + + + 0 0,2 0,4 0,6 0,8 1 1,2 bzip2...

Optimizing for MP:s

Erik HagerstenUppsala University, Sweden

[email protected]

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 2

AVDARK2012

Cache Waste/* Unoptimized */

for (s = 0; s < ITERATIONS; s++){

for (j = 0; j < HUGE; j++)

x[j] = x[j+1]; /* will hog the cache but not benefit*/

for (i = 0; i < SMALLER_THAN_CACHE; i++)

y[i] = y[i+1]; /* will be evicted between usages /*

}

/* Optimized */

for (s = 0; s < ITERATIONS; s++){

for (j = 0; j < HUGE; j++) {

PREFETCH_NTA x[j+1] /* will be installed in L1, but not L3 (AMD) *

x[j] = x[j+1];

for (i = 0; I < SMALLER_THAN_CACHE; i++)

y[i] = y[i+1]; /* will always hit in the cache*/

}

Also important for single-threaded applications if they

are co-scheduled and share cache with other applications.


AVDARK2012

cache size

cache misses

actualactual/4

The larger cache, the better

UART Research: Hints to avoid cache pollution(non-temporal prefetches)

Hint:Don’t

allocate!missrate2x missrate

0

1

2

3

Original Lim=1.7MB

One Instance Four Instances

40% faster

Hint: lim= actual/4

Orig

Thro

ughp

ut


AVDARK2012

Categorize and avoiding cache wasteMissrate

$-size

∆ benefit

Cachehogging

L2L1CPU

L1CPU

L1

L2

Mem

No point in caching! per-instruction cache

avoidence (prefetch.nta)

Hogging

∆ benefit

Don’t care

Slowsothers

Slowedby others

Slows &slowed

Hogging

∆ benefit

+

++

+

bzip LBM

LQ

+ ++

+

+

0

0,2

0,4

0,6

0,8

1

1,2

bzip2 Libquantum LBM Geom mean

Individually In mix In mix, patched

25%

AMD Opteron

Perf

orm

ance Andreas Sandberg, David Eklov and Erik

Hagersten. Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses, In Proceedings of Supercomputing (SC), New Orleans, LA, USA, November 2010.

Automatic ”taming” of the hoggersApplication classification


AVDARK2012

Coherence traffic

Thread 0:int a, total;

spawn_child()

for (int i; i< HUGE; i++) {

/* do some work */

a++;

}

join()

total = a;

Child:


/* do some work*/

a++;

}

Thread 0:int a, total;

spawn_child()


/* do some work */

a++;

}

join()

total += a;

Child:int b;


/* do some work */

b++;

}

total += b;

OPT:

ORIG:


AVDARK2012

False sharing

Thread 0:int a, b;

spawn_child()


...

a++;

}

join()

total = a + b;

Child:


...

b++;

}

Thread 0:int a;

spawn_child()


...

a++;

}

join()

total += a;

Child:int b;


...

total += b;

}

OPT:

ORIG:


AVDARK2012

Coherence Utilization

Thread 0:vec_type x[HUGE];


...

x[i].a++;

}

spawn_child()

...

join()

Child (Thread 1)


y[i] = x[i].a;

}

ORIG:

x[0]abcde f

x[12abcde f

x[ab

struct vec_type{

int a;int b;int c;int d;int e;int f;

};


AVDARK2012

A Bad Example: ”POUNDING”

proc lock(lock_variable) {while (TAS[lock_variable]==1) {} /* bang on the lock until free */

}

proc unlock(lock_variable) {lock_variable := 0

}

Assume: The function TAS (test and set) -- returns the current memory value and atomicallywrites the busy pattern “1” to the memory

Generates too much traffic!!-- spinning threads produce traffic!


AVDARK2012

Optimistic Test&Set Lock ”spinlock”

proc lock(lock_variable) {while true {

if (TAS[lock_variable] ==0) break; /* bang on the lock once, done if TAS==0 */while(lock_variable != 0) {} /* spin locally in your cache until ”0” observed*/

} }

proc unlock(lock_variable) {lock_variable := 0

}

Much less coherence traffic!!-- still lots of traffic at lock handover!


AVDARK2012

Uppsala Programming for Multicore Architecture Center

62 MSEK grant / 10 years [$9M/10y]+ related additional grants at UU = 130MSEK

Research areas: Performance modeling New parallel algorithms Scheduling of threads and resources Testing & verification Language technology MC in wireless and sensors

Erik:

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations11

AVDARK2012

StatCache: Insight and EfficiencySlowdown 10% (for long-running applications)

mem

Probabilistic Cache Model

Address Stream1:read A2:read B3:read C4:write C5:read B6:read D7:read A8:read E9:read B

Host Computer Target Architecture

ArchitecturalParameters

Online Sampling Offline “Insight Technology”

core

core

... mem

L1

L1

L2

core

...

core

Modeled behavior

ApplicationFingerprint

5, 3,…ReuseDistance=5

ReuseDistance=3

SparseSampler ThreadSpotter

Advice

Randomly select accessesto monitor


AVDARK2012

UART: Efficient sparse sampling

A B D B E B A F D B . .1 4 5 6 7 8 9 10 11 12 … N. .32

.…

i=0

1.Use HW counter overflow to randomly select accesses to sample (e.g. ~on avergage every 1.000.000th access)

2. Set a watchpoint for the data cacheline they touch

3. Use HW counters to count #memory accesses until watchpoint trap

Sampling Overhead ~17% (10% at Acumem for long-running apps)

(Modeling with math < 100ms)

trap trap


AVDARK2012

Fingerprint ≈ Sparse reuse distance histogram

Reuse distance

h(d)


AVDARK2012

Miss?pmiss=m(#repl)

Modeling random caches with math(Assumtion: ”Constant” MissRatio)

A B D B E B A F D B . .1 4 5 6 7 8 9 10 11 12 … N. .32

# repl ≈ 5 * MissRatio

.…

#repl

pmissMiss Equation m

rdi=5


AVDARK2012

The cacheline Ais in a cache with

L cachelines

After 1Replacement

(1 – 1/L) chancethat A survives

(1 – 1/L)R chancethat A survives

A A A

After RReplacements

Assuming a fully associative cache


AVDARK2012

Miss?

16

pmiss=m(5 * MissRatio)

Modeling random caches with math(Assumtion: ”Constant” MissRatio)

A B D B E B A F D B . .1 4 5 6 7 8 9 10 11 12 … N. .32

# repl ≈ 5 * MissRatio pmiss=m(3 * MissRatio)

.…

# repl ≈ 3 * MissRate

n samples: MissRatio * n = Σm(rd(i) * MissRatio)i=0

n

m(repl)=1 – (1 – 1/L)repl

#repl

pmiss

Can be solved in a ”fraction of a second” for different L:s

Miss Equation m

rdi=5


AVDARK2012

17

Accuracy: Simulation vs. ”math” (Random replacement)

Mis

s ra

tio (

%)

Cache size (bytes)

vpr

gzip

ammp

Comparing simulation (w/ slowdown 100x) and math (”fractions of a second”)


AVDARK2012

A B A B

. 2 3 4 5 6 7 8 91.

B C C D E C C…

Sampled Reuse Pair A-A

Stack Distance: How many unique data objects? Answer: 3

12 ... N

Modeling LRU Caches: Stack distance...

If we know all reuses: How many of the reuses 2-6 go beyond End? Answer: 3

Stack_distance = Σ [d(i) > (End – k + 2)]k=Start

End

Start=2 End=6

rdi=5

Foreach sample: if (Stack_distance > L ) miss++ else hit++


AVDARK2012

A B A B

. 2 3 4 5 6 7 8 91.

B C C D E C C…

d(1)

12 ... N

But we only know a few reuse distances...

Estimate: How many of the reuses 2-6 go beyond End? Answer: Est_SD

Est_SD = Σ p[d(i) > (End - k)]k=Start

End

Assume that the distribution (aka histogram) of sampled reuses is representative for all accesses in that ”time window”

d(2) d(3)

d

h(d)


AVDARK2012

All SPEC 2006


AVDARK2012

Modeling coherence

rA B rD B E B rA F rD B . .1 4 5 6 7 8 9 10 11 12 … N. .32

.…

i=0

Record coherence-related interaction at runtime (Arch. Independen)Model coherence effects off-lineCan model different topologies and thread bindings off-line

trap trap

B E B wA F rD B . . .…trap trap

Thread A

Thread B


AVDARK2012

(Need to be Efficient)3: Our Approach

Machine-independent

runtimeinformation

Efficientmodeling

Draw conclusions,build tools

== ∑ (1 − (1 − ) ( ))1. Capture data locality information

Find ”best”:• Core type• Cache size• Thread scheduling• Frequency• Code optimizations…

Predict (for many options)• Cache statistics• Bandwidth requirement• Performance• Power consumption• Phase behavior ...

2. Measure impact of resource allocations

Solve equationsGather runtime info Add heuristics

3. Capture code usage information?

Clustering, K-means...4. Capture power properties?


AVDARK2012

The World’s ”best”: 1. Cache locality samplers & cache ”simulator” (OH ~20%)

Cache hitrate model for data and instructions (~10ms) Multi-threading model [a.k.a. Coherence model] (~10ms) Cache sharing model (~10ms)

2. Cache/BW quantitative measurements (OH ~5%) Cache sharing model (~10ms) Performance prediction & BW requirement (~10ms) Cache sharing model (CPI & BW) (~10ms)

3. DVFS models & run-time (power) management4. On-line phase detection tool (OH ~2%)

Phase-guided sampling Phase-guided power management

5. Simplest coherence protocol VIPS Two states, self-invalidation, no directory

simulated [MB]

mod

elle

d [M

B]

Cache allocationon multicore

$ size [MB]

mis

ses

Achievements

$ size

On real HWPerformance

Bandwidth

time

phases

time

DVFS:Performance: 98%Energy: 50%

CPI

BW

misses

Multi-threaded Case Study:Gauss-Seidel on Multicores

From Wallin et al, ICS 2006


AVDARK2012

Criteria for HPC Algorithms

Past: Minimize communication Maximize scalability (1000s of CPUs)

Optimize for Multicore chip: On-chip communication is “for free” Scalability is limited to ~10 threads The caches are tiny Memory bandwidth is the bottleneck

Data locality is key!


AVDARK2012

Selected HPC Wire Articles

More Than 16 Cores May Well Be Pointless Sandia Labs, Dec 07 2008

Up Against the Memory Wall”Never mind the cores. Just hand over the cache”

Michael Feltman, Dec 11 2008

HPC@Intel: When to Say No to ParallelismSanjiv Shah, Intel. January 14 2009

Finding a Door in the Memory WallErik Hagersten, Acumem. Feb-April 2009


AVDARK2012

Example: Gauss Seidel

1

1

2

2

1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 1

2 2 1

2 2 2 2 2

2 2 2 2 2 2

LOOP:UPDATE ALL POINTS IF (convergence_test)

(Longer explanation: Finding a Door in the Memory Wall @ HPCWire)

Mission: “Maximize the parallelism and minimize the inter-thread communication”


AVDARK2012

State-of-the-art:Removing Dependence: Red/Black

1

1

1

1

1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 1

1 2 1

2 1 2 2 1

1 2 1 2 1 2

LOOP: UPDATE ALL RED POINTSUPDATE ALL BLACK POINTSIF (convergence_test)


AVDARK2012

State-of-the-art:Red/Black, Parallelism = N2/2

Core 0

Core 1

1

1

1

1

1 1 1 1 1

1 2 1 2 1 2

2 1 1 1 1

1 2 1

2 1 2 2 1

1 1 1 1 1 1

LOOP:IN PARALLEL: UPDATE ALL RED POINTS

IN PARALELL: UPDATE ALL BLACK POINTS

IF (convergence_test)

Limited communication N2/2 parallelism Done!Only one problem…


AVDARK2012

Only One Problem: Performance

0

1

2

0 1 2 3 4 5 6 7 8

# Cores

Spee

dup


AVDARK2012

Back to the drawing board: Temporal blocking for seq. code

22

44

1

3

= active region

34 = current

= sweep path

= data dependence

1,2,3,4 = iteration number

= cacheline layout

LOOP:LOOP:

UPDATE ALL POINTS IN ACTIVE REGIONSLIDE DOWN THE REGION


Communication is “for free” and moderate parallelism is OKPriority 1: limit bandwidth needs!


AVDARK2012

Back to the drawing board: Temporal blocking for seq. code

32

44

1

4

12

= active region

= current

= sweep path

= data dependence


= cacheline layout

LOOP:LOOP:

UPDATE ALL POINTS IN ACTIVE REGIONSLIDE DOWN THE REGION


4 iterations inone sweep!

Communication is “for free” and moderate parallelism is OKPriority 1: limit bandwidth need!


AVDARK2012

0%

1%

2%

3%

256k 512k 1M 2M 4M 8M 16M 32M 64M 128M 256M 512MCache size

Red/BlackBlock=1Block=2Block=4Block=8Block=16

DRAM_traffic(cache_size)

Fetch Rate, i.e, fraction of mem_ops generating DRAM traffic


AVDARK2012

G-S, temp block Parallelism = N

32

44

1

4

Core 0 Core 1 Core 2 Core 3

0123

1 1

Synchronization flags

Wait until ”lefty” is done:Lots of communication

• Producer/Consumer Flag• Sharing of data values

Only N-fold parallelism

2

= active region

= current

= sweep path

= data dependence


= cacheline layout

1 = sync flag iteration no


AVDARK2012

Problems we ran into 1 (2)

32

44

1

4


0123

1 12

512 elements = 64 cache lines

512elem.

Core 0indexinginto L2 $



Core 3indexininto L2

16 cache lines


AVDARK2012

Problems we ran into 2 (2) We had a loop nesting problem that the

compiler optimized away

... sometimes


AVDARK2012

Running on a Multisocket

I/F

I/F

100

DRAM

DRAM

Coherence = Non-Uniform Coherence


AVDARK2012

Example: G-S, temp blocking

32

44

1

4


0123

1 12

= active region

= current

= sweep path

= data dependence


= cacheline layout

1 = sync flag iteration no

PADDING


AVDARK2012

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8

Perfo

rman

ce

# Cores

Red/Black

Block=8

Lessons Learned: Optimize cacheusage BEFORE parallelizing

3x

[Wallin, Löf, Holmgren, Hagersten @ ICS 2006]

Demo Time!

G-S:DanW:s codeOptimized

optimizing for mp:s - uppsala university · lbm lq ++ + + + 0 0,2 0,4 0,6 0,8 1 1,2 bzip2...

Documents