reducing oltp instruction misses with thread migration

Reducing OLTP Instruction Misses with Thread Migration

Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos

University of TorontoÉcole Polytechnique Fédérale de Lausanne

OLTP on a Intel Xeon5660Shore-MTHyper-threading disabled

IPC < 1 on a 4-issue machineTPC-C TPC-E

TPC-C TPC-E0%

10%20%30%40%50%60%70%80%90%

Resource (includes data)Instructions

70-80% of stalls are instruction stalls

16 32 64 128 256 512 10240

TPC-CTPC-E

Cache Size (KB)

onOLTP L1 Instruction Cache Misses

Trace Simulation4-way L1-I Cache

Shore-MT

~512KB is enough for OLTP instruction footprint

Most common today!

• Larger L1-I cache size Higher access latency

• Different replacement policies Does not really affect OLTP workloads

• Advanced prefetching Has too much space overhead (40KB per core)

• Simultaneous multi-threading Increases IPC per hardware context Cache polluting

Reducing Instruction Stallsat the hardware level

• Enables usage of aggregate L1-I capacity– Large cache size without increased latency

• Can exploit instruction commonality– Localizes common transaction instructions

• Dynamic hardware solution– More general purpose

Alternative: Thread Migration

Transactions Running Parallel

T1 T2 T3

Instruction parts that can fit into L1-I

Threads

TransactionT1T2T3

Common instructions among concurrent threads

Scheduling Threads

0 1 2 3T1

T3 T2 T1

T1 T3 T2

0 1 2 3T1

T1 T2 T3

Traditional TMi

Threadstim

TotalMisses

CORES • Group threads• Wait till L1-I is almost full

– Count misses– Record last N misses– Misses > threshold => Migrate

T2T1Transaction A

T4T3Transaction B tim

CORES Where to migrate?• Check the last N misses recorded

in other caches1) No matching cache => Move to an idle core if exists2) Matching cache => Move to that core3) None of above => Do not move

T2T1Transaction A

• Trace Simulation– PIN to extract instructions & data accesses per transaction– 16 core system– 32KB 8-way set-associative L1 caches– Miss-threshold is 256– Last 6 misses are kept

• Shore-MT as the storage manager– Workloads: TPC-C, TPC-E

Experimental Setup

Impact on L1-I Misses

Instruction misses reduced by half

TPC-C TPC-E

1015202530354045

Instruction

Impact on L1-D Misses

Cannot ignore increased data misses

TPC-C TPC-E

1015202530354045 Write Data

Read DataInstruction

• Dealing with the data left behind– Prefetching

• Depends on thread identification– Software assisted– Hardware detection

• OS support needed– Disabling OS control over thread scheduling

TMi’s Challenges

• ~50% of the time OLTP stalls on instructions• Spread computation through thread migration• TMi

– Halves L1-I misses– Time-wise ~30% expected improvement– Data misses should be handled

Conclusion

Thank you!

BACKUP

L1-I Misses per K-Instruction16

2K 1M 16K

512K 1M 16

2K 1M 16K

512K 1M 16

2K 1M 16K

512K 1M 16

2K 1M 16K

512K 1M

2-way 4-way 8-way FA 2-way 4-way 8-way FATPC-C TPC-E

50 Capacity Conflict Compulsory

L1-D Misses per K-Instruction16

2K 1M 16K

512K 1M 16

2K 1M 16K

512K 1M

8-way FA 8-way FATPC-C TPC-E

10 CapacityConflictCompulsory

Replacement Policies

I-MPKI D-MPKI I-MPKI D-MPKITPC-C TPC-E

30 LRU LIP BIP DIP

Experimental Setup

Intel Xeon X5660 Server

#Sockets 2

#Cores in a Socket 6 (OoO)

#HW Contexts 24

Clock Speed 2.80GHz

Memory 48GB

LLC (L3) 12 MB

L2 (per core) 256KB

L1 (per core) 32KB (both I and D)

Hyper-Threading Enabled

OS Ubuntu 10.04 with Linux kernel 2.6.32

• Intel VTune 2011– Interface for hardware

counters• Working set fits in RAM• Log flushed to RAM• Each run:

– Starts with initial database

– Each worker executes 1000 xcts before Vtune starts collecting numbers for 60 secs

Formulas• IPC = INST_RETIRED.ANY_P /

CPU_CLK_UNHALTED.THREAD

• Data Stalls = RESOURCE_STALLS.ANY

• Instruction Stalls = UOPS_ISSUED.CORE_STALL_CYCLES - RESOURCE_STALLS.ANY

512K 1M 16

512K 1M

TPC-C TPC-E

101520253035404550

Cache Size

onOLTP L1 Instruction Cache Misses

Trace Simulation4-way L1-I Cache

Shore-MT

Most common today!

~512KB is enough for OLTP instruction footprint

reducing oltp instruction misses with thread migration

Documents

in-memory oltp simulator - sqlartbits.com · applies to:...

oltp on hardware islands - danica...

sql server in-memory oltp and columnstore feature …...oltp...

thread-level parallelism — simultaneous multithreading ......

comparing oracle oltp performance on red hat … oracle oltp...

tuning informix for oltp workloads - oninit informix for...

oltp vs olap

hbase+phoenix for oltp

from oltp to olap -...

+ nosql w2013 csci 2141. + oltp vs. olap we can divide it...

chap3 oltp olap olam

db2bp physical design oltp 0412

vu39 - wordpress.comscanning the entire table none of these...

ios7: hits & misses

oltp to olap conversion

seas06 - sql server 2005 oltp best practices 1 seas 2006 sql...

unit-2 olap and oltp

designing highly scalable oltp systems

accelerating mysql with jit compilers...10 oltp workloads...

oltp chronicle week 6