eliminating giant vm lock in ruby through hardware transactional memory (rubykaigi 2014)

33
© 2014 IBM Corporation Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory Rei Odaira (Twitter: @ReiOdaira) IBM Research – Tokyo Jose G. Castanos IBM Research – Watson Research Center Hisanobu Tomari University of Tokyo

Upload: rei-odaira

Post on 24-May-2015

942 views

Category:

Software


2 download

DESCRIPTION

This deck is about eliminating Ruby's giant VM lock (GVL) through hardware transactional memory (HTM) of the Intel Haswell processor.

TRANSCRIPT

Page 1: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation

Eliminating Giant VM Lock in RubythroughHardware Transactional Memory

Rei Odaira (Twitter: @ReiOdaira)IBM Research – Tokyo

Jose G. CastanosIBM Research – Watson Research Center

Hisanobu TomariUniversity of Tokyo

Page 2: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation2

IBM Research – Tokyo

Giant VM Locks in Scripting Languages

Scripting languages (Ruby, Python, etc.) everywhere. Increasing demands for speed.

Multi-thread speed restricted by Giant VM locks.

– Only one thread can execute interpreter at one time.

– Not required by language specification.

☺Simplified implementation in interpreter and libraries.

☹ No scalability on multi cores.

– CRuby, CPython, and PyPy

Page 3: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation3

IBM Research – Tokyo

Hardware Transactional Memory (HTM) Coming into the Market

Improve performance by simply replacing locks with TM?

Lower overhead than software TM via hardware.

Blue Gene/Q2012

zEC122012

4th Generation Core Processor

2013

Intel

POWER82014

Page 4: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation4

IBM Research – Tokyo

Our Goal

What will realistic applications perform if we replace GVL with HTM?

– Giant VM Lock (GVL)

What modifications are needed in the interpreter?

Page 5: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation5

IBM Research – Tokyo

Our Goal

What will realistic applications perform if we replace GVL with HTM?

– Giant VM Lock (GVL)

What modifications are needed in the interpreter?

atomic { }

Eliminate GVL in CRuby through HTM on Xeon E3-1275 v3 (aka Haswell) Evaluate WEBrick HTTP server and Ruby on Rails

XeonE3-1275 v3

Intel

Page 6: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation6

IBM Research – Tokyo

Outline

Motivation

Transactional Memory

GVL in Ruby

Eliminating GVL through HTM

Experimental Results

Conclusion

Page 7: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation7

IBM Research – Tokyo

Transactional Memory

At programming time

– Enclose critical sections with transaction begin/end operations.

At execution time

– Memory operations within a transaction observed as one step by other threads.

– Multiple transactions executed in parallel as long as their memory operations do not conflict. Higher parallelism than locks.

xbegin();a->count++;xend();

xbegin();b->count++;xend();

xbegin();a->count++;xend(); xbegin();

a->count++;xend();

Thread X

Thread Y

lock();a->count++;unlock();

xbegin();a->count++;xend();

Page 8: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation8

IBM Research – Tokyo

HTM Instruction set (Intel TSX)

– XBEGIN: Begin a transaction

– XEND: End a transaction

– XABORT, etc.

Micro-architecture

– Read and write sets held in CPU caches

– Conflict detection using CPU cache coherence protocol

Abort in four cases:

– Read set and write set conflict

– Read set and write set overflow

– Restricted instructions (e.g. system calls)

– External interruptions, etc.

XBEGIN abort_handler......XEND

abort_handler:...

Page 9: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation9

IBM Research – Tokyo

Ruby Multi-thread Programming and GVL based on 1.9.3

Ruby language

– Program with Thread, Mutex, and ConditionVariable classes

Ruby virtual machine

– Ruby program compiled into bytecode sequences.

– Ruby threads assigned to native threads (Pthread)

– Only one thread can execute the interpreter at any give time due to the GVL.

GVL acquired/released when a thread starts/finishes.

GVL yielded during a blocking operation, such as I/O.

GVL yielded also at pre-defined yield-point bytecode ops.

– Conditional/unconditional jumps, method/block exits, etc.

Page 10: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation10

IBM Research – Tokyo

How GVL is Yielded in Ruby It is too costly to yield GVL at every yield point.

Yield GVL every 250 msec using a timer thread.

Timer thread Application threads

Wa

ke u

p e

very

25

0 m

sec

ReleaseGVL

Set flag

Set flag

AcquireGVL

if (flag is true){ gvl_release(); sched_yield(); gvl_acquire();}

Check the flag at every yield point.

Actual implementation is more complex to ensure fairness.

Page 11: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation11

IBM Research – Tokyo

Outline

Motivation

Transactional Memory

GVL in Ruby

Eliminating GVL through HTM

– Our Algorithm

– Conflict Removal

Experimental Results

Conclusion

Page 12: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation12

IBM Research – Tokyo

Eliminating GVL through HTM Execute as a transaction first.

– Begins/ends at the same points as GVL’s yield points.

Acquire GVL after consecutive aborts in a transaction.

Begin transaction

End transaction Conflict Retry if abort

Acquire GVL in case of consecutive aborts

Release GVLWait for GVL release

No timer thread needed.

Page 13: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation13

IBM Research – Tokyo

Where to Begin and End Transactions? Should be the same as GVL’s acquisition/release/yield points.

– Guaranteed as critical section boundaries.

☹ However, the original yield points are too coarse-grained.

– Cause many transaction overflows.

Bytecode boundaries are supposed to be safe critical section boundaries.

– Bytecode can be generated in arbitrary orders.

– Therefore, an interpreter is not supposed to have a critical section that straddles a bytecode boundary.

☹ Ruby programs that are not correctly synchronized can change behavior.

Added the following bytecode instructions as transaction yield points.– getinstancevariable, getclassvariable, getlocal, send, opt_plus, opt_minus, opt_mult, opt_aref

– Criteria: they appear frequently or are complex.

Page 14: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation14

IBM Research – Tokyo

5 Sources of Conflicts and How to Remove Them (1/2)

Global variables pointing to the current running thread

– Cause conflicts because they are written every time transactions yield

Moved to Pthread’s thread-local storage

Updates to inline caches at the time of misses

– Cache the result of hash table access for method invocation and instance-field access

– Cause conflicts because caches are shared among threads.

Implement miss reduction techniques.

Source code needs to be changed for higher performance, but each change is limited to only a few dozen lines of Ruby interpreter’s source code.

Page 15: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation15

IBM Research – Tokyo

5 Sources of Conflicts and How to Remove Them (2/2) Manipulation of global free list when allocating objects

Introduce thread-local free lists

• When a thread-local free list becomes empty, take 256 objects in bulk from the global free list.

Garbage collection

– Cause conflicts when multiple threads try to do GC.

– Cause overflows even when a single thread performs GC.

Reduce GC frequency by increasing the Ruby heap.

False sharing in Ruby’s thread structures (rb_thread_t)

– Add many fields to store thread-local information. Conflict with other structures on the same cache line.

Assign each thread structure to a dedicated cache line.

Page 16: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation16

IBM Research – Tokyo

Outline

Motivation

Transactional Memory

GVL in Ruby

Eliminating GVL through HTM

Experimental Results

Conclusion

Page 17: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation17

IBM Research – Tokyo

Experimental Environment

Implemented in Ruby 1.9.3-p547.

~400 MB Ruby heap

Xeon E3-1275 v3 (aka Haswell)

– 4-core 2-SMT 3.5-GHz Xeon E3-1275 v3

– 6-MB read set and 19-KB write set

– Linux 3.10.5

Page 18: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation18

IBM Research – Tokyo

Benchmark Programs

WEBrick: HTTP server included in Ruby distribution

– Spawn 1 thread to handle 1 request.

Ruby on Rails running on WEBrick, Thin, and Puma.

– Measured a Web application to fetch a book list from SQLite3.

– Thin executed in the multi-threaded mode.

HTTP client and server running on the same machine.

– To measure maximum speed-ups possible.

Page 19: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation19

IBM Research – Tokyo

Throughput of WEBrick

GET a ~50-byte HTML file.

Normalized to 1-thread GVL

HTM was faster than GVL by 57%.WEBrick

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7Number of clients

Thr

ough

put

(1 =

1-t

hrea

d G

VL)

GVLHTM

Page 20: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation20

IBM Research – Tokyo

Throughput of Ruby on Rails / WEBrick HTM improved the throughput by 24% over GVL.

Many aborts due to transaction overflows.

– Regular expression libraries and method invocations.

Need to split into multiple transactions?

Ruby on Rails/WEBrick

0

0.5

1

1.5

0 1 2 3 4 5 6 7Number of clients

Th

rou

gh

pu

t(1

= 1

-th

rea

d G

VL

)

GVL

HTM

Page 21: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation21

IBM Research – Tokyo

Throughput of Ruby on Rails / Thin and Puma

HTM improved the throughput by 43% with Thin and 58% with Puma over GVL.

Many aborts due to transaction overflows.

Ruby on Rails/Thin

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7Number of clients

Th

rou

gh

pu

t(1

= 1

-th

rea

d G

VL

)

GVL

HTM

Ruby on Rails/Puma

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7Number of clients

Th

rou

gh

pu

t(1

= 1

-th

rea

d G

VL

)

GVL

HTM

Page 22: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation22

IBM Research – Tokyo

Conclusion

What was the performance of realistic applications when GVL was replaced with HTM?

– 57% speed-up in WEBrick and 58% in Ruby on Rails.

What was required for good scalability?

– Removed 5 sources of conflicts.

Code patch available!

– http://researcher.watson.ibm.com/researcher/ view_person_subpage.php?id=5206

Using HTM is an effective way to outperform the GVL without fine-grain locking.

Page 23: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation

Backup

Page 24: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation24

IBM Research – Tokyo

Begin

Tradeoff in Transaction Lengths No need to end and begin transactions at every yield point.

= Variable transaction lengths at the granularity of yield points.

Longer transaction = Smaller relative overhead to begin/end transaction.

Shorter transaction = Smaller abort overhead

– Smaller amount of wasteful work rolled-back at aborts

– Smaller probability of transaction overflows

Shorter transactionBegin BeginEnd

Wasteful work

BeginEnd End

Longer transactionBegin BeginEnd Begin

Page 25: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation25

IBM Research – Tokyo

Dynamic Adjustment of Transaction Lengths

Transaction length = How many yield points to skip

Adjust transaction lengths on a per-yield-point basis.

1. Initialize with a long length (255).

2. Calculate abort ratio at each yield point Number of aborts / Number of transactions

3. If the abort ratio exceeds a threshold (6% for Xeon), shorten the transaction length (x 0.75) and return to Step 2.

4. If a pre-defined number (300) of transactions started before the abort ratio exceeds the threshold, stop adjusting that yield point.

Page 26: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation26

IBM Research – Tokyo

Single-thread Overhead

Single-thread speed is important too.

18-35% single-thread overhead

– Optimization: with 1 thread, use the GVL instead of HTM. 5-14% overhead in micro-benchmarks even with this optimization.

Sources of the overhead:

– Checks at the yield points Adaptive insertion of yield points

– Disabled method-invocation inline caches Concurrent inline caches

Page 27: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation27

IBM Research – Tokyo

Beginning a Transaction

Persistent aborts

– Overflow

– Restricted instructions

Otherwise, transient aborts

– Read-set and write-set conflicts, etc.

Abort reason reported by CPU

– Using a specified memory area

if (TBEGIN()) { /* Transaction */ if (GVL.acquired) TABORT();} else { /* Abort */ if (GVL.acquired) { if (Retried 16 times) Acquire GVL; else { Retry after GVL release; } else if (Persistent abort) { Acquire GVL; } else { /* Transient abort */ if (Retried 3 times) Acquire GVL ; else Retry;}}Execute Ruby code;

Page 28: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation28

IBM Research – Tokyo

Ending and Yielding a Transaction

if (GVL.acquired) Release GVL;else TEND();

Ending a transaction

if (--yield_counter == 0) { End a transaction; Begin a transaction;}

Yielding a transaction(transaction boundary)

Dynamic adjustment of a transaction length

Page 29: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation29

IBM Research – Tokyo

Platform-Specific Optimizations

Spin-wait for a while at GVL contention.

– Original GVL is acquired and released every ~250 msec.

– GVL is acquired and released far more frequently when used as a fallback path for HTM. Parallelism severely damaged if a thread sleeps in OS every time GVL is contended.

– Some environments already include this optimization in Pthread implementation.

Page 30: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation30

IBM Research – Tokyo

HTM in Intel Xeon E3-1275 v3

XBEGIN, XEND, and XABORT

– Offset to abort handler specified by XBEGIN.

– EAX register tells abort reason.

Flat nesting

Implementation details not disclosed yet.

Conflict detected at the granularity of 64-byte cache line

Read and write sets tracked by CPU caches

– 32-KB L1D, 256-KB L2, and 8-MB shared L3.

Read- and write-set sizes depend on the models, according to our experiments.

– Xeon E3-1275v3: ~6MB read and ~19KB write sets

– Core i7 4770S: ~4MB read and ~22KB write sets

Page 31: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation31

IBM Research – Tokyo

FT

00.5

11.5

22.5

33.5

44.5

5

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

GIL

HTM-1

HTM-16

HTM-256

HTM-dynamic

FT

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

GIL

HTM-1

HTM-16

HTM-256

HTM-dynamic

Throughput of FT in Ruby NPB

HTM-dynamic was the best.

HTM-1 suffered high overhead.

HTM-256 incurred high abort ratios.

zEC12 and Xeon showed similar speed-ups up to 4 threads.

SMT did not help.

zEC12

Xeon E3-1275 v3

SMT

Page 32: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation32

IBM Research – Tokyo

Throughput of Ruby NPB on Xeon E3-1275 v3 (1/2)BT

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

CG

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

FT

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

GIL

HTM-1

HTM-16

HTM-256

HTM-dynamic

Showed the same scalability as zEC12.

SMT did not help.

HTM-dynamic was not the best.

– Due to learning in HTM?

Page 33: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation33

IBM Research – Tokyo

Throughput of Ruby NPB on Xeon E3-1275 v3 (2/2)IS

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

LU

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

MG

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

SP

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)