eliminating giant vm lock in ruby through hardware transactional memory (rubykaigi 2014)
DESCRIPTION
This deck is about eliminating Ruby's giant VM lock (GVL) through hardware transactional memory (HTM) of the Intel Haswell processor.TRANSCRIPT
![Page 1: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/1.jpg)
© 2014 IBM Corporation
Eliminating Giant VM Lock in RubythroughHardware Transactional Memory
Rei Odaira (Twitter: @ReiOdaira)IBM Research – Tokyo
Jose G. CastanosIBM Research – Watson Research Center
Hisanobu TomariUniversity of Tokyo
![Page 2: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/2.jpg)
© 2014 IBM Corporation2
IBM Research – Tokyo
Giant VM Locks in Scripting Languages
Scripting languages (Ruby, Python, etc.) everywhere. Increasing demands for speed.
Multi-thread speed restricted by Giant VM locks.
– Only one thread can execute interpreter at one time.
– Not required by language specification.
☺Simplified implementation in interpreter and libraries.
☹ No scalability on multi cores.
– CRuby, CPython, and PyPy
![Page 3: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/3.jpg)
© 2014 IBM Corporation3
IBM Research – Tokyo
Hardware Transactional Memory (HTM) Coming into the Market
Improve performance by simply replacing locks with TM?
Lower overhead than software TM via hardware.
Blue Gene/Q2012
zEC122012
4th Generation Core Processor
2013
Intel
POWER82014
![Page 4: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/4.jpg)
© 2014 IBM Corporation4
IBM Research – Tokyo
Our Goal
What will realistic applications perform if we replace GVL with HTM?
– Giant VM Lock (GVL)
What modifications are needed in the interpreter?
![Page 5: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/5.jpg)
© 2014 IBM Corporation5
IBM Research – Tokyo
Our Goal
What will realistic applications perform if we replace GVL with HTM?
– Giant VM Lock (GVL)
What modifications are needed in the interpreter?
atomic { }
Eliminate GVL in CRuby through HTM on Xeon E3-1275 v3 (aka Haswell) Evaluate WEBrick HTTP server and Ruby on Rails
XeonE3-1275 v3
Intel
![Page 6: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/6.jpg)
© 2014 IBM Corporation6
IBM Research – Tokyo
Outline
Motivation
Transactional Memory
GVL in Ruby
Eliminating GVL through HTM
Experimental Results
Conclusion
![Page 7: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/7.jpg)
© 2014 IBM Corporation7
IBM Research – Tokyo
Transactional Memory
At programming time
– Enclose critical sections with transaction begin/end operations.
At execution time
– Memory operations within a transaction observed as one step by other threads.
– Multiple transactions executed in parallel as long as their memory operations do not conflict. Higher parallelism than locks.
xbegin();a->count++;xend();
xbegin();b->count++;xend();
xbegin();a->count++;xend(); xbegin();
a->count++;xend();
Thread X
Thread Y
lock();a->count++;unlock();
xbegin();a->count++;xend();
![Page 8: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/8.jpg)
© 2014 IBM Corporation8
IBM Research – Tokyo
HTM Instruction set (Intel TSX)
– XBEGIN: Begin a transaction
– XEND: End a transaction
– XABORT, etc.
Micro-architecture
– Read and write sets held in CPU caches
– Conflict detection using CPU cache coherence protocol
Abort in four cases:
– Read set and write set conflict
– Read set and write set overflow
– Restricted instructions (e.g. system calls)
– External interruptions, etc.
XBEGIN abort_handler......XEND
abort_handler:...
![Page 9: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/9.jpg)
© 2014 IBM Corporation9
IBM Research – Tokyo
Ruby Multi-thread Programming and GVL based on 1.9.3
Ruby language
– Program with Thread, Mutex, and ConditionVariable classes
Ruby virtual machine
– Ruby program compiled into bytecode sequences.
– Ruby threads assigned to native threads (Pthread)
– Only one thread can execute the interpreter at any give time due to the GVL.
GVL acquired/released when a thread starts/finishes.
GVL yielded during a blocking operation, such as I/O.
GVL yielded also at pre-defined yield-point bytecode ops.
– Conditional/unconditional jumps, method/block exits, etc.
![Page 10: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/10.jpg)
© 2014 IBM Corporation10
IBM Research – Tokyo
How GVL is Yielded in Ruby It is too costly to yield GVL at every yield point.
Yield GVL every 250 msec using a timer thread.
Timer thread Application threads
Wa
ke u
p e
very
25
0 m
sec
ReleaseGVL
Set flag
Set flag
AcquireGVL
if (flag is true){ gvl_release(); sched_yield(); gvl_acquire();}
Check the flag at every yield point.
Actual implementation is more complex to ensure fairness.
![Page 11: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/11.jpg)
© 2014 IBM Corporation11
IBM Research – Tokyo
Outline
Motivation
Transactional Memory
GVL in Ruby
Eliminating GVL through HTM
– Our Algorithm
– Conflict Removal
Experimental Results
Conclusion
![Page 12: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/12.jpg)
© 2014 IBM Corporation12
IBM Research – Tokyo
Eliminating GVL through HTM Execute as a transaction first.
– Begins/ends at the same points as GVL’s yield points.
Acquire GVL after consecutive aborts in a transaction.
Begin transaction
End transaction Conflict Retry if abort
Acquire GVL in case of consecutive aborts
Release GVLWait for GVL release
No timer thread needed.
![Page 13: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/13.jpg)
© 2014 IBM Corporation13
IBM Research – Tokyo
Where to Begin and End Transactions? Should be the same as GVL’s acquisition/release/yield points.
– Guaranteed as critical section boundaries.
☹ However, the original yield points are too coarse-grained.
– Cause many transaction overflows.
Bytecode boundaries are supposed to be safe critical section boundaries.
– Bytecode can be generated in arbitrary orders.
– Therefore, an interpreter is not supposed to have a critical section that straddles a bytecode boundary.
☹ Ruby programs that are not correctly synchronized can change behavior.
Added the following bytecode instructions as transaction yield points.– getinstancevariable, getclassvariable, getlocal, send, opt_plus, opt_minus, opt_mult, opt_aref
– Criteria: they appear frequently or are complex.
![Page 14: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/14.jpg)
© 2014 IBM Corporation14
IBM Research – Tokyo
5 Sources of Conflicts and How to Remove Them (1/2)
Global variables pointing to the current running thread
– Cause conflicts because they are written every time transactions yield
Moved to Pthread’s thread-local storage
Updates to inline caches at the time of misses
– Cache the result of hash table access for method invocation and instance-field access
– Cause conflicts because caches are shared among threads.
Implement miss reduction techniques.
Source code needs to be changed for higher performance, but each change is limited to only a few dozen lines of Ruby interpreter’s source code.
![Page 15: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/15.jpg)
© 2014 IBM Corporation15
IBM Research – Tokyo
5 Sources of Conflicts and How to Remove Them (2/2) Manipulation of global free list when allocating objects
Introduce thread-local free lists
• When a thread-local free list becomes empty, take 256 objects in bulk from the global free list.
Garbage collection
– Cause conflicts when multiple threads try to do GC.
– Cause overflows even when a single thread performs GC.
Reduce GC frequency by increasing the Ruby heap.
False sharing in Ruby’s thread structures (rb_thread_t)
– Add many fields to store thread-local information. Conflict with other structures on the same cache line.
Assign each thread structure to a dedicated cache line.
![Page 16: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/16.jpg)
© 2014 IBM Corporation16
IBM Research – Tokyo
Outline
Motivation
Transactional Memory
GVL in Ruby
Eliminating GVL through HTM
Experimental Results
Conclusion
![Page 17: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/17.jpg)
© 2014 IBM Corporation17
IBM Research – Tokyo
Experimental Environment
Implemented in Ruby 1.9.3-p547.
~400 MB Ruby heap
Xeon E3-1275 v3 (aka Haswell)
– 4-core 2-SMT 3.5-GHz Xeon E3-1275 v3
– 6-MB read set and 19-KB write set
– Linux 3.10.5
![Page 18: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/18.jpg)
© 2014 IBM Corporation18
IBM Research – Tokyo
Benchmark Programs
WEBrick: HTTP server included in Ruby distribution
– Spawn 1 thread to handle 1 request.
Ruby on Rails running on WEBrick, Thin, and Puma.
– Measured a Web application to fetch a book list from SQLite3.
– Thin executed in the multi-threaded mode.
HTTP client and server running on the same machine.
– To measure maximum speed-ups possible.
![Page 19: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/19.jpg)
© 2014 IBM Corporation19
IBM Research – Tokyo
Throughput of WEBrick
GET a ~50-byte HTML file.
Normalized to 1-thread GVL
HTM was faster than GVL by 57%.WEBrick
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7Number of clients
Thr
ough
put
(1 =
1-t
hrea
d G
VL)
GVLHTM
![Page 20: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/20.jpg)
© 2014 IBM Corporation20
IBM Research – Tokyo
Throughput of Ruby on Rails / WEBrick HTM improved the throughput by 24% over GVL.
Many aborts due to transaction overflows.
– Regular expression libraries and method invocations.
Need to split into multiple transactions?
Ruby on Rails/WEBrick
0
0.5
1
1.5
0 1 2 3 4 5 6 7Number of clients
Th
rou
gh
pu
t(1
= 1
-th
rea
d G
VL
)
GVL
HTM
![Page 21: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/21.jpg)
© 2014 IBM Corporation21
IBM Research – Tokyo
Throughput of Ruby on Rails / Thin and Puma
HTM improved the throughput by 43% with Thin and 58% with Puma over GVL.
Many aborts due to transaction overflows.
Ruby on Rails/Thin
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7Number of clients
Th
rou
gh
pu
t(1
= 1
-th
rea
d G
VL
)
GVL
HTM
Ruby on Rails/Puma
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7Number of clients
Th
rou
gh
pu
t(1
= 1
-th
rea
d G
VL
)
GVL
HTM
![Page 22: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/22.jpg)
© 2014 IBM Corporation22
IBM Research – Tokyo
Conclusion
What was the performance of realistic applications when GVL was replaced with HTM?
– 57% speed-up in WEBrick and 58% in Ruby on Rails.
What was required for good scalability?
– Removed 5 sources of conflicts.
Code patch available!
– http://researcher.watson.ibm.com/researcher/ view_person_subpage.php?id=5206
Using HTM is an effective way to outperform the GVL without fine-grain locking.
![Page 23: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/23.jpg)
© 2014 IBM Corporation
Backup
![Page 24: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/24.jpg)
© 2014 IBM Corporation24
IBM Research – Tokyo
Begin
Tradeoff in Transaction Lengths No need to end and begin transactions at every yield point.
= Variable transaction lengths at the granularity of yield points.
Longer transaction = Smaller relative overhead to begin/end transaction.
Shorter transaction = Smaller abort overhead
– Smaller amount of wasteful work rolled-back at aborts
– Smaller probability of transaction overflows
Shorter transactionBegin BeginEnd
Wasteful work
BeginEnd End
Longer transactionBegin BeginEnd Begin
![Page 25: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/25.jpg)
© 2014 IBM Corporation25
IBM Research – Tokyo
Dynamic Adjustment of Transaction Lengths
Transaction length = How many yield points to skip
Adjust transaction lengths on a per-yield-point basis.
1. Initialize with a long length (255).
2. Calculate abort ratio at each yield point Number of aborts / Number of transactions
3. If the abort ratio exceeds a threshold (6% for Xeon), shorten the transaction length (x 0.75) and return to Step 2.
4. If a pre-defined number (300) of transactions started before the abort ratio exceeds the threshold, stop adjusting that yield point.
![Page 26: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/26.jpg)
© 2014 IBM Corporation26
IBM Research – Tokyo
Single-thread Overhead
Single-thread speed is important too.
18-35% single-thread overhead
– Optimization: with 1 thread, use the GVL instead of HTM. 5-14% overhead in micro-benchmarks even with this optimization.
Sources of the overhead:
– Checks at the yield points Adaptive insertion of yield points
– Disabled method-invocation inline caches Concurrent inline caches
![Page 27: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/27.jpg)
© 2014 IBM Corporation27
IBM Research – Tokyo
Beginning a Transaction
Persistent aborts
– Overflow
– Restricted instructions
Otherwise, transient aborts
– Read-set and write-set conflicts, etc.
Abort reason reported by CPU
– Using a specified memory area
if (TBEGIN()) { /* Transaction */ if (GVL.acquired) TABORT();} else { /* Abort */ if (GVL.acquired) { if (Retried 16 times) Acquire GVL; else { Retry after GVL release; } else if (Persistent abort) { Acquire GVL; } else { /* Transient abort */ if (Retried 3 times) Acquire GVL ; else Retry;}}Execute Ruby code;
![Page 28: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/28.jpg)
© 2014 IBM Corporation28
IBM Research – Tokyo
Ending and Yielding a Transaction
if (GVL.acquired) Release GVL;else TEND();
Ending a transaction
if (--yield_counter == 0) { End a transaction; Begin a transaction;}
Yielding a transaction(transaction boundary)
Dynamic adjustment of a transaction length
![Page 29: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/29.jpg)
© 2014 IBM Corporation29
IBM Research – Tokyo
Platform-Specific Optimizations
Spin-wait for a while at GVL contention.
– Original GVL is acquired and released every ~250 msec.
– GVL is acquired and released far more frequently when used as a fallback path for HTM. Parallelism severely damaged if a thread sleeps in OS every time GVL is contended.
– Some environments already include this optimization in Pthread implementation.
![Page 30: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/30.jpg)
© 2014 IBM Corporation30
IBM Research – Tokyo
HTM in Intel Xeon E3-1275 v3
XBEGIN, XEND, and XABORT
– Offset to abort handler specified by XBEGIN.
– EAX register tells abort reason.
Flat nesting
Implementation details not disclosed yet.
Conflict detected at the granularity of 64-byte cache line
Read and write sets tracked by CPU caches
– 32-KB L1D, 256-KB L2, and 8-MB shared L3.
Read- and write-set sizes depend on the models, according to our experiments.
– Xeon E3-1275v3: ~6MB read and ~19KB write sets
– Core i7 4770S: ~4MB read and ~22KB write sets
![Page 31: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/31.jpg)
© 2014 IBM Corporation31
IBM Research – Tokyo
FT
00.5
11.5
22.5
33.5
44.5
5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
GIL
HTM-1
HTM-16
HTM-256
HTM-dynamic
FT
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
GIL
HTM-1
HTM-16
HTM-256
HTM-dynamic
Throughput of FT in Ruby NPB
HTM-dynamic was the best.
HTM-1 suffered high overhead.
HTM-256 incurred high abort ratios.
zEC12 and Xeon showed similar speed-ups up to 4 threads.
SMT did not help.
zEC12
Xeon E3-1275 v3
SMT
![Page 32: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/32.jpg)
© 2014 IBM Corporation32
IBM Research – Tokyo
Throughput of Ruby NPB on Xeon E3-1275 v3 (1/2)BT
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
CG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
FT
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
GIL
HTM-1
HTM-16
HTM-256
HTM-dynamic
Showed the same scalability as zEC12.
SMT did not help.
HTM-dynamic was not the best.
– Due to learning in HTM?
![Page 33: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)](https://reader033.vdocument.in/reader033/viewer/2022060110/556157aed8b42a780d8b5497/html5/thumbnails/33.jpg)
© 2014 IBM Corporation33
IBM Research – Tokyo
Throughput of Ruby NPB on Xeon E3-1275 v3 (2/2)IS
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
LU
00.20.40.60.8
11.21.41.61.8
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
MG
00.20.40.60.8
11.21.41.61.8
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
SP
00.20.40.60.8
11.21.41.61.8
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)