reducing oltp instruction misses with thread migration

Reducing OLTP Instruction Misses with Thread Migration

Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos

University of TorontoÉcole Polytechnique Fédérale de Lausanne

2

OLTP on a Intel Xeon5660Shore-MTHyper-threading disabled

IPC < 1 on a 4-issue machineTPC-C TPC-E

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Inst

ructi

ons

per C

ycle

TPC-C TPC-E0%

10%20%30%40%50%60%70%80%90%

100%

Resource (includes data)Instructions

Brea

kdow

n of

Cor

e St

alls

bette

r

70-80% of stalls are instruction stalls

3

16 32 64 128 256 512 10240

10

20

30

40

50

60

TPC-CTPC-E

Cache Size (KB)

Mis

ses p

er k

-Inst

ructi

onOLTP L1 Instruction Cache Misses

Trace Simulation4-way L1-I Cache

Shore-MT

bette

r

~512KB is enough for OLTP instruction footprint

Most common today!

4

• Larger L1-I cache size Higher access latency

• Different replacement policies Does not really affect OLTP workloads

• Advanced prefetching Has too much space overhead (40KB per core)

• Simultaneous multi-threading Increases IPC per hardware context Cache polluting

Reducing Instruction Stallsat the hardware level

5

• Enables usage of aggregate L1-I capacity– Large cache size without increased latency

• Can exploit instruction commonality– Localizes common transaction instructions

• Dynamic hardware solution– More general purpose

Alternative: Thread Migration

6

Transactions Running Parallel

T1 T2 T3

Instruction parts that can fit into L1-I

Threads

TransactionT1T2T3

Common instructions among concurrent threads

7

Scheduling Threads

0 1 2 3T1

T2 T1

T3 T2 T1

T1 T3 T2

CORES

1

T3

0 1 2 3T1

T1 T2

T1 T2 T3

T1 T2 T3

CORES

T3

Traditional TMi

L1I

3

6

9

10

1

2

3

4

4

T1

T2

T3

Threadstim

e

TotalMisses

TotalMisses

8

TMi

0 1

T1

CORES • Group threads• Wait till L1-I is almost full

– Count misses– Record last N misses– Misses > threshold => Migrate

L1I

T2T1Transaction A

T4T3Transaction B tim

e

9

TMi

0 1

T1

T2 T1

T1 T2

T1 T2

CORES Where to migrate?• Check the last N misses recorded

in other caches1) No matching cache => Move to an idle core if exists2) Matching cache => Move to that core3) None of above => Do not move

L1I

T2T1Transaction A

time

10

• Trace Simulation– PIN to extract instructions & data accesses per transaction– 16 core system– 32KB 8-way set-associative L1 caches– Miss-threshold is 256– Last 6 misses are kept

• Shore-MT as the storage manager– Workloads: TPC-C, TPC-E

Experimental Setup

11

Impact on L1-I Misses

Instruction misses reduced by half

bette

r

No M

igra

tion

TMi

TMi B

lind

No M

igra

tion

TMi

TMi B

lind

TPC-C TPC-E

05

1015202530354045

Instruction

Mis

ses p

er k

-Inst

ructi

on

12

Impact on L1-D Misses

Cannot ignore increased data misses

No M

igra

tion

TMi

TMi B

lind

No M

igra

tion

TMi

TMi B

lind

TPC-C TPC-E

05

1015202530354045 Write Data

Read DataInstruction

Mis

ses p

er k

-Inst

ructi

on

bette

r

13

• Dealing with the data left behind– Prefetching

• Depends on thread identification– Software assisted– Hardware detection

• OS support needed– Disabling OS control over thread scheduling

TMi’s Challenges

14

• ~50% of the time OLTP stalls on instructions• Spread computation through thread migration• TMi

– Halves L1-I misses– Time-wise ~30% expected improvement– Data misses should be handled

Conclusion

Thank you!

15

BACKUP

16

L1-I Misses per K-Instruction16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M 16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M 16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M 16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M

2-way 4-way 8-way FA 2-way 4-way 8-way FATPC-C TPC-E

0

10

20

30

40

50 Capacity Conflict Compulsory

Inst

ructi

ons M

PKI

17

L1-D Misses per K-Instruction16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M 16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M

8-way FA 8-way FATPC-C TPC-E

0

2

4

6

8

10 CapacityConflictCompulsory

Data

MPK

I

18

Replacement Policies

I-MPKI D-MPKI I-MPKI D-MPKITPC-C TPC-E

0

5

10

15

20

25

30 LRU LIP BIP DIP

MPK

I

Experimental Setup

Intel Xeon X5660 Server

#Sockets 2

#Cores in a Socket 6 (OoO)

#HW Contexts 24

Clock Speed 2.80GHz

Memory 48GB

LLC (L3) 12 MB

L2 (per core) 256KB

L1 (per core) 32KB (both I and D)

Hyper-Threading Enabled

OS Ubuntu 10.04 with Linux kernel 2.6.32

• Intel VTune 2011– Interface for hardware

counters• Working set fits in RAM• Log flushed to RAM• Each run:

– Starts with initial database

– Each worker executes 1000 xcts before Vtune starts collecting numbers for 60 secs

20

Formulas• IPC = INST_RETIRED.ANY_P /

CPU_CLK_UNHALTED.THREAD

• Data Stalls = RESOURCE_STALLS.ANY

• Instruction Stalls = UOPS_ISSUED.CORE_STALL_CYCLES - RESOURCE_STALLS.ANY

21

16K

32K

64K

128K

256K

512K 1M 16

K

32K

64K

128K

256K

512K 1M

TPC-C TPC-E

05

101520253035404550

Cache Size

Capa

city

Mis

ses p

er k

-Inst

ructi

onOLTP L1 Instruction Cache Misses

Trace Simulation4-way L1-I Cache

Shore-MT

Most common today!

bette

r

~512KB is enough for OLTP instruction footprint

reducing oltp instruction misses with thread migration

Documents