reducing oltp instruction misses with thread migration
DESCRIPTION
Reducing OLTP Instruction Misses with Thread Migration. Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos. University of Toronto École Polytechnique Fédérale de Lausanne. OLTP on a Intel Xeon5660. Shore-MT Hyper-threading disabled . better. - PowerPoint PPT PresentationTRANSCRIPT
Reducing OLTP Instruction Misses with Thread Migration
Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos
University of TorontoÉcole Polytechnique Fédérale de Lausanne
2
OLTP on a Intel Xeon5660Shore-MTHyper-threading disabled
IPC < 1 on a 4-issue machineTPC-C TPC-E
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Inst
ructi
ons
per C
ycle
TPC-C TPC-E0%
10%20%30%40%50%60%70%80%90%
100%
Resource (includes data)Instructions
Brea
kdow
n of
Cor
e St
alls
bette
r
70-80% of stalls are instruction stalls
3
16 32 64 128 256 512 10240
10
20
30
40
50
60
TPC-CTPC-E
Cache Size (KB)
Mis
ses p
er k
-Inst
ructi
onOLTP L1 Instruction Cache Misses
Trace Simulation4-way L1-I Cache
Shore-MT
bette
r
~512KB is enough for OLTP instruction footprint
Most common today!
4
• Larger L1-I cache size Higher access latency
• Different replacement policies Does not really affect OLTP workloads
• Advanced prefetching Has too much space overhead (40KB per core)
• Simultaneous multi-threading Increases IPC per hardware context Cache polluting
Reducing Instruction Stallsat the hardware level
5
• Enables usage of aggregate L1-I capacity– Large cache size without increased latency
• Can exploit instruction commonality– Localizes common transaction instructions
• Dynamic hardware solution– More general purpose
Alternative: Thread Migration
6
Transactions Running Parallel
T1 T2 T3
Instruction parts that can fit into L1-I
Threads
TransactionT1T2T3
Common instructions among concurrent threads
7
Scheduling Threads
0 1 2 3T1
T2 T1
T3 T2 T1
T1 T3 T2
CORES
1
T3
0 1 2 3T1
T1 T2
T1 T2 T3
T1 T2 T3
CORES
T3
Traditional TMi
L1I
3
6
9
10
1
2
3
4
4
T1
T2
T3
Threadstim
e
TotalMisses
TotalMisses
8
TMi
0 1
T1
CORES • Group threads• Wait till L1-I is almost full
– Count misses– Record last N misses– Misses > threshold => Migrate
L1I
T2T1Transaction A
T4T3Transaction B tim
e
9
TMi
0 1
T1
T2 T1
T1 T2
T1 T2
CORES Where to migrate?• Check the last N misses recorded
in other caches1) No matching cache => Move to an idle core if exists2) Matching cache => Move to that core3) None of above => Do not move
L1I
T2T1Transaction A
time
10
• Trace Simulation– PIN to extract instructions & data accesses per transaction– 16 core system– 32KB 8-way set-associative L1 caches– Miss-threshold is 256– Last 6 misses are kept
• Shore-MT as the storage manager– Workloads: TPC-C, TPC-E
Experimental Setup
11
Impact on L1-I Misses
Instruction misses reduced by half
bette
r
No M
igra
tion
TMi
TMi B
lind
No M
igra
tion
TMi
TMi B
lind
TPC-C TPC-E
05
1015202530354045
Instruction
Mis
ses p
er k
-Inst
ructi
on
12
Impact on L1-D Misses
Cannot ignore increased data misses
No M
igra
tion
TMi
TMi B
lind
No M
igra
tion
TMi
TMi B
lind
TPC-C TPC-E
05
1015202530354045 Write Data
Read DataInstruction
Mis
ses p
er k
-Inst
ructi
on
bette
r
13
• Dealing with the data left behind– Prefetching
• Depends on thread identification– Software assisted– Hardware detection
• OS support needed– Disabling OS control over thread scheduling
TMi’s Challenges
14
• ~50% of the time OLTP stalls on instructions• Spread computation through thread migration• TMi
– Halves L1-I misses– Time-wise ~30% expected improvement– Data misses should be handled
Conclusion
Thank you!
15
BACKUP
16
L1-I Misses per K-Instruction16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M 16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M 16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M 16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M
2-way 4-way 8-way FA 2-way 4-way 8-way FATPC-C TPC-E
0
10
20
30
40
50 Capacity Conflict Compulsory
Inst
ructi
ons M
PKI
17
L1-D Misses per K-Instruction16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M 16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M
8-way FA 8-way FATPC-C TPC-E
0
2
4
6
8
10 CapacityConflictCompulsory
Data
MPK
I
18
Replacement Policies
I-MPKI D-MPKI I-MPKI D-MPKITPC-C TPC-E
0
5
10
15
20
25
30 LRU LIP BIP DIP
MPK
I
Experimental Setup
Intel Xeon X5660 Server
#Sockets 2
#Cores in a Socket 6 (OoO)
#HW Contexts 24
Clock Speed 2.80GHz
Memory 48GB
LLC (L3) 12 MB
L2 (per core) 256KB
L1 (per core) 32KB (both I and D)
Hyper-Threading Enabled
OS Ubuntu 10.04 with Linux kernel 2.6.32
• Intel VTune 2011– Interface for hardware
counters• Working set fits in RAM• Log flushed to RAM• Each run:
– Starts with initial database
– Each worker executes 1000 xcts before Vtune starts collecting numbers for 60 secs
20
Formulas• IPC = INST_RETIRED.ANY_P /
CPU_CLK_UNHALTED.THREAD
• Data Stalls = RESOURCE_STALLS.ANY
• Instruction Stalls = UOPS_ISSUED.CORE_STALL_CYCLES - RESOURCE_STALLS.ANY
21
16K
32K
64K
128K
256K
512K 1M 16
K
32K
64K
128K
256K
512K 1M
TPC-C TPC-E
05
101520253035404550
Cache Size
Capa
city
Mis
ses p
er k
-Inst
ructi
onOLTP L1 Instruction Cache Misses
Trace Simulation4-way L1-I Cache
Shore-MT
Most common today!
bette
r
~512KB is enough for OLTP instruction footprint