eliminating global interpreter locks in ruby through ... · eliminate ruby and python’s gilswith...
TRANSCRIPT
© 2014 IBM Corporation
Eliminating Global Interpreter Locksin Ruby through Hardware Transactional Memory
Rei OdairaIBM Research – Tokyo
Jose G. CastanosIBM Research – Watson Research Center
Hisanobu TomariUniversity of Tokyo
© 2014 IBM Corporation2
IBM Research – Tokyo
Global Interpreter Locks in Scripting Languages
� Scripting languages (Ruby, Python, etc.)
everywhere. � Increasing demands for speed.
� Multi-thread speed restricted by global interpreter locks.
– Only one thread can execute interpreter at one time.
– Not required by language specification.
☺Simplified implementation in interpreter and libraries.
� No scalability on multi cores.
– CRuby, CPython, and PyPy
© 2014 IBM Corporation3
IBM Research – Tokyo
Hardware Transactional Memory (HTM) Coming into the Market
� Improve performance by simply replacing locks with TM?
� Lower overhead than software TM via hardware.
Blue Gene/Q
2012
zEC12
2012
4th Generation
Core Processor
2013
Intel
POWER8
© 2014 IBM Corporation4
IBM Research – Tokyo
Our Goal
� What will realistic applications perform if we replace GIL
with HTM?
– Global Interpreter Lock (GIL)
� What modifications and new techniques are needed?
© 2014 IBM Corporation5
IBM Research – Tokyo
Our Goal
� What will realistic applications perform if we replace GIL
with HTM?
– Global Interpreter Lock (GIL)
� What modifications and new techniques are needed?
zEC12
atomic { }
� Eliminate GIL in Ruby through HTM
on zEC12 and Xeon E3-1275 v3 (aka Haswell)
� Evaluate Ruby NAS Parallel Benchmarks,
WEBrick HTTP server, and Ruby on Rails
Xeon
E3-1275 v3
Intel
© 2014 IBM Corporation6
IBM Research – Tokyo
Related Work� Eliminate Python’s GIL with HTM
� Micro-benchmarks on non-cycle-accurate simulator[Riley et al. ,2006]
� Micro-benchmarks on cycle-accurate simulator[Blundell et al., 2010]
� Micro-benchmarks on Rock’s restricted HTM [Tabba, 2010]
� Eliminate Ruby and Python’s GILs with fine-grain locks
– JRuby, IronRuby, Jython, IronPython, etc.
� Huge implementation effort
� Remaining locks and incompatibility in class libraries
� Eliminate the GIL in Ruby 2
� through less restricted HTM on real machines 2
� and measure non-micro-benchmarks.
�Concurrency and no incompatibility in the class libraries
© 2014 IBM Corporation7
IBM Research – Tokyo
Outline
� Motivation
� Transactional Memory
� GIL in Ruby
� Eliminating GIL through HTM
� Experimental Results
� Conclusion
© 2014 IBM Corporation8
IBM Research – Tokyo
Transactional Memory
� At programming time
– Enclose critical sections with
transaction begin/end operations.
� At execution time
– Memory operations within a
transaction observed as one step
by other threads.
– Multiple transactions executed in
parallel as long as their memory
operations do not conflict.
� Higher parallelism than locks.
tbegin();a->count++;tend();
tbegin();b->count++;tend();
tbegin();a->count++;tend();
tbegin();a->count++;tend();
Thread X
Thread Y
lock();a->count++;unlock();
tbegin();a->count++;tend();
© 2014 IBM Corporation9
IBM Research – Tokyo
HTM
� Instruction set (zEC12)
– TBEGIN: Begin a transaction
– TEND: End a transaction
– TABORT, etc.
� Micro-architecture
– Read and write sets held in caches
– Conflict detection using cache coherence protocol
� Abort in four cases:
– Read set and write set conflict
– Read set and write set overflow
– Restricted instructions (e.g. system calls)
– External interruptions, etc.
TBEGINif (cc!=0)goto abort handler
...
...TEND
© 2014 IBM Corporation10
IBM Research – Tokyo
Ruby Multi-thread Programming and GIL based on 1.9.3-p194
� Ruby language
– Program with Thread, Mutex, and ConditionVariable classes
� Ruby virtual machine
– Ruby threads assigned to native threads (Pthread)
– Only one thread can execute the interpreter at any give time due to the GIL.
� GIL acquired/released when a thread starts/finishes.
� GIL yielded during a blocking operation, such as I/O.
� GIL yielded also at pre-defined yield-point bytecode ops.
– Conditional/unconditional jumps, method/block exits, etc.
© 2014 IBM Corporation11
IBM Research – Tokyo
How GIL is Yielded in Ruby
� It is too costly to yield GIL at every yield point.
� Yield GIL every 250 msec using a timer thread.
Timer thread Application threads
Wake
up
every
250
msec
Release
GIL
Set flag
Set flag
Acquire
GIL
if (flag is true){gil_release();sched_yield();gil_acquire();
}
Check the flag at
every yield point.
Actual implementation is more
complex to ensure fairness.
© 2014 IBM Corporation12
IBM Research – Tokyo
Outline
� Motivation
� Transactional Memory
� GIL in Ruby
� Eliminating GIL through HTM
– Basic Algorithm
– Dynamic Adjustment of Transaction Lengths
– Conflict Removal
� Experimental Results
� Conclusion
© 2014 IBM Corporation13
IBM Research – Tokyo
Eliminating GIL through HTM
� Execute as a transaction first.
– Begins/ends at the same points as GIL’s yield points.
� Acquire GIL after consecutive aborts in a transaction.
Begin transaction
End transaction Conflict Retry if abort
Acquire GIL in case of
consecutive aborts
Release GILWait for GIL release
No timer thread needed.
© 2014 IBM Corporation14
IBM Research – Tokyo
Begin
Tradeoff in Transaction Lengths
� No need to end and begin transactions at every yield point.
= Variable transaction lengths at the granularity of yield points.
� Longer transaction = Smaller relative overhead to begin/end transaction.
� Shorter transaction = Smaller abort overhead
– Smaller amount of wasteful work rolled-back at aborts
– Smaller probability of transaction overflows
Shorter transactionBegin BeginEnd
Wasteful work
BeginEnd End
Longer transactionBegin BeginEnd Begin
© 2014 IBM Corporation15
IBM Research – Tokyo
Dynamic Adjustment of Transaction Lengths
Transaction length = How many yield points to skip
Adjust transaction lengths on a per-yield-point basis.
1. Initialize with a long length (255).
2. Calculate abort ratio at each yield point
� Number of aborts / Number of transactions
3. If the abort ratio exceeds a threshold (1% for zEC12 and 6% for Xeon), shorten the transaction length (x 0.75) and return to Step 2.
4. If a pre-defined number (300) of transactions started before the abort ratio exceeds the threshold, stop adjusting that yield point.
© 2014 IBM Corporation16
IBM Research – Tokyo
5 Sources of Conflicts and How to Remove Them
� Refer to our paper for the details.
– Global variables pointing to the current running thread
– Manipulation of global free list when allocating objects
– Conflicts and overflows in garbage collection
– Updates to inline caches at the time of misses
– False sharing in thread structures
�Source code needs to be changed for higher performance,
but each change is limited to only a few dozen lines
of the interpreter’s source code.
© 2014 IBM Corporation17
IBM Research – Tokyo
5 Sources of Conflicts and How to Remove Them
� Refer to our paper for the details.
– Global variables pointing to the current running thread
– Manipulation of global free list when allocating objects
– Conflicts and overflows in garbage collection
– Updates to inline caches at the time of misses
– False sharing in thread structures
�Source code needs to be changed for higher performance,
but each change is limited to only a few dozen lines
of the interpreter’s source code.
© 2014 IBM Corporation18
IBM Research – Tokyo
Thread-Local Free Lists to Avoid Conflicts
Free object Free object Free object
Single global free list
Object allocation
Thread 1
Thread 2
Free object Free object
Thread 1
Thread 2
Free object Free object
Free object Free object Free object
Thread-local
free lists
Bulk
movement
© 2014 IBM Corporation19
IBM Research – Tokyo
Outline
� Motivation
� Transactional Memory
� GIL in Ruby
� Eliminating GIL through HTM
� Experimental Results
� Conclusion
© 2014 IBM Corporation20
IBM Research – Tokyo
Experimental Environment
� Implemented in Ruby 1.9.3-p194.
� ~400 MB Ruby heap
� zEC12
– 12-core 5.5-GHz zEC12 (1 hardware thread on 1 core)
– 1-MB read set and 8-KB write set
– Ported to z/OS 1.13 UNIX System Services.
� Xeon E3-1275 v3 (aka Haswell)
– 4-core 2-SMT 3.5-GHz Xeon E3-1275 v3
– 6-MB read set and 19-KB write set
– Linux 3.10.5
© 2014 IBM Corporation21
IBM Research – Tokyo
Benchmark Programs
� Ruby NAS Parallel Benchmarks (NPB) [Nose et al., 2012]
– Class S for BT, CG, FT, LU, SP, and Class W for IS and MG
� WEBrick: HTTP server included in Ruby distribution
– Spawn 1 thread to handle 1 request.
� Ruby on Rails running on WEBrick
– Executed a Web application to fetch a book list from SQLite3.
– Measured only on Xeon E3-1275 v3.
© 2014 IBM Corporation22
IBM Research – Tokyo
GIL and HTM Configurations
� GIL: Original Ruby
� HTM-n (n = 1, 16, 256)�
Fixed transaction length (Skip n-1 yield points)
� HTM-dynamic:
Dynamic adaptive transaction length
© 2014 IBM Corporation23
IBM Research – Tokyo
FT
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of threads
Thro
ughput
(1 =
1 t
hre
ad G
IL)
GIL
HTM-1
HTM-16
HTM-256
HTM-dynamic
FT
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7 8 9
Number of threads
Thro
ughput
(1 =
1 t
hre
ad G
IL)
GIL
HTM-1
HTM-16
HTM-256
HTM-dynamic
Throughput of FT in Ruby NPB
� HTM-dynamic was the best.
� HTM-1 suffered high overhead.
� HTM-256 incurred high abort ratios.
� zEC12 and Xeon showed similar speed-ups up to 4 threads.
� SMT did not help.
zEC12
Xeon E3-1275 v3
SMT
© 2014 IBM Corporation24
IBM Research – Tokyo
Throughput of WEBRick
� HTM was faster than GIL by 57% on Xeon E3-1275 v3.
� Scalability on zEC12 was limited by conflicts at malloc().
� HTM-dynamic was among the best.
� HTM-1 was better than HTM-16 and -256.
– Frequent complex operations in WEBrick such as method invocations
WEBrick
0
0.5
1
1.5
0 1 2 3 4 5 6 7
Number of clients
Thro
ughput
(1 =
1 thre
ad G
IL)
GIL HTM-1HTM-16 HTM-256HTM-dynamic
zEC12
WEBrick
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7
Number of clients
Thro
ughput
(1 =
1 t
hre
ad G
IL)
Xeon E3-1275 v3
© 2014 IBM Corporation25
IBM Research – Tokyo
Ruby on Rails on Xeon E3-1275 v3 and Abort Ratios
� HTM improved the throughput by 24% over GIL.
� High abort ratios due to transaction overflows.
– Regular expression libraries and method invocations.
� Need to split into multiple transactions?
Abort ratios of HTM-dynamic
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7
Number of clients
Ab
ort
ra
tio
(%
)
WEBrick / zEC12
WEBrick / Xeon
Rails / Xeon
Ruby on Rails
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 1 2 3 4 5 6 7
Number of clients
Thro
ughput
(1 =
1 t
hre
ad G
IL)
GIL HTM-1
HTM-16 HTM-256
HTM-dynamic
© 2014 IBM Corporation26
IBM Research – Tokyo
Single-thread Overhead
� Single-thread speed is important too.
� 18-35% single-thread overhead
– Optimization: with 1 thread, use the GIL instead of HTM.
� 5-14% overhead in micro-benchmarks even with this
optimization.
� Sources of the overhead:
– Checks at the yield points
� Adaptive insertion of yield points
– Disabled method-invocation inline caches
� Concurrent inline caches
© 2014 IBM Corporation27
IBM Research – Tokyo
Conclusion
� What was the performance of realistic applications
when GIL was replaced with HTM?
– Up to 4.4-fold speed-up with 12 threads in Ruby NPB.
– 57% speed-up in WEBrick and 24% in Ruby on Rails.
� What was required for good scalability?
– Proposed dynamic transaction-length adjustment.
– Removed 5 sources of conflicts.
� Using HTM is an effective way to outperform the GIL
without fine-grain locking.
© 2014 IBM Corporation
Backups
© 2014 IBM Corporation29
IBM Research – Tokyo
Beginning a Transaction
� Persistent aborts
– Overflow
– Restricted
instructions
� Otherwise, transient
aborts
– Read-set and write-
set conflicts, etc.
� Abort reason reported by
CPU
– Using a specified
memory area
if (TBEGIN()) {/* Transaction */if (GIL.acquired)TABORT();
} else {/* Abort */if (GIL.acquired) {if (Retried 16 times)Acquire GIL;
else {Retry after GIL release;
} else if (Persistent abort) {Acquire GIL;
} else { /* Transient abort */if (Retried 3 times)Acquire GIL�
elseRetry;
}}Execute Ruby code;
© 2014 IBM Corporation30
IBM Research – Tokyo
Where to Begin and End Transactions?
� Should be the same as GIL’s acquisition/release/yield points.
– Guaranteed as critical section boundaries.
� However, the original yield points are too coarse-grained.
– Cause many transaction overflows.
� Bytecode boundaries are supposed to be safe critical section boundaries.
– Bytecode can be generated in arbitrary orders.
– Therefore, an interpreter is not supposed to have a critical section that
straddles a bytecode boundary.
� Ruby programs that are not correctly synchronized can change behavior.
� Added the following bytecode instructions as transaction yield points.
– getinstancevariable, getclassvariable, getlocal, send, opt_plus, opt_minus, opt_mult, opt_aref
– Criteria: they appear frequently or are complex.
© 2014 IBM Corporation31
IBM Research – Tokyo
Ending and Yielding a Transaction
if (GIL.acquired)Release GIL;
elseTEND();
Ending a transaction
if (--yield_counter == 0) {End a transaction;Begin a transaction;
}
Yielding a transaction
(transaction boundary)
Dynamic adjustment of a transaction length
© 2014 IBM Corporation32
IBM Research – Tokyo
Abort Ratios on zEC12
� Transaction lengths well adjusted by HTM-dynamic with 1% as a
target abort ratio.
� No correlation to the scalabilities.
Abort ratios of HTM-dynamic
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of threads
Ab
ort
ra
tio
(%
)
BT
CG
FT
IS
LU
MG
SP
© 2014 IBM Corporation33
IBM Research – Tokyo
Cycle breakdowns of Ruby NPB
0%
20%
40%
60%
80%
100%
BT CG FT IS LU MG SP
Transaction begin/end Successful transactions
GIL acquired Aborted transactionsWaiting for GIL release
Cycle Breakdowns on zEC12
� No correlation to the scalabilities.
– Result of IS is not reliable because 79% of its execution was
spent in initialization, outside of the measurement period.
© 2014 IBM Corporation34
IBM Research – Tokyo
Categorization by Functions Aborted by Fetch Conflicts
� Half of the aborts occurred in manipulating the global free list(rb_newobj) and lazy sweep in GC (gc_lazy_sweep).
– A lot of Float objects allocated.
� To be fixed in Ruby 2.0 with Flonum?
Abort categorization by functions
HTM-dynamic / 12 threads / Cache fetch-related
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
BT CG FT IS LU MG SP
Others
vm_getivar
vm_exec_core
tend
rb_newobj
rb_mutex_sleep...
rb_mutex_lock
libc
gc_lazy_sweep
© 2014 IBM Corporation35
IBM Research – Tokyo
Comparing Scalabilities of CRuby, JRuby, and Java
� CRuby (HTM) was similar to Java (almost no VM-internal bottleneck).
� Scalability saturation in CRuby (HTM) is inherent in the applications.
� JRuby (fine-grain locking) behaved differently from CRuby (HTM).
� On average, HTM achieved the same scalability as fine-grain locking.
Scalability of HTM-dynamic / CRuby
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of threads
Thro
ughput
(1 =
1 t
hre
ad)
Scalability of JRuby (12x Intel Xeon)
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of threads
Thro
ughput
(1 =
1 t
hre
ad)
Scalability of Java NPB (12x Xeon)
0
2
4
6
8
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of threads
Th
rou
gh
pu
t (1
= 1
th
rea
d)
BT
CG
FT
IS
LU
MG
SP