optimizing native code for erlang · scheduler collapse • with riak we've seen problems in...
TRANSCRIPT
![Page 1: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/1.jpg)
Optimizing Native Code for ErlangSteve VinoskiBasho [email protected]@stevevinoski
1Monday, September 22, 14
![Page 2: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/2.jpg)
INTEGRATION, ERLANG STYLE
• External: OS processes separate from the Erlang VM
• Ports
• C Nodes
• Jinterface
• TCP/UDP/SCTP networking
2Monday, September 22, 14
![Page 3: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/3.jpg)
INTEGRATION, ERLANG STYLE
• Internal: statically or dynamically linked into the Erlang VM
• Erlang Built-in Functions (BIFs)
• Port Drivers
• Native Implemented Functions (NIFs)
3Monday, September 22, 14
![Page 4: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/4.jpg)
INTEGRATION EXAMPLES
• rebar uses ports for external commands like git, grep, rsync
• Erlang's inet_drv port driver
• written in C
• supports TCP, UDP, SCTP for Erlang applications
• Riak's eleveldb persistence backend is a C++ NIF
4Monday, September 22, 14
![Page 5: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/5.jpg)
NIF DETAILS
5Monday, September 22, 14
![Page 6: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/6.jpg)
NIF DETAILS
• Start with a regular Erlang module
5Monday, September 22, 14
![Page 7: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/7.jpg)
NIF DETAILS
• Start with a regular Erlang module
• Functions can either be stubbed out to raise errors, or have default implementations
5Monday, September 22, 14
![Page 8: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/8.jpg)
NIF DETAILS
• Start with a regular Erlang module
• Functions can either be stubbed out to raise errors, or have default implementations
• Corresponding NIFs live in a dynamically loaded library
5Monday, September 22, 14
![Page 9: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/9.jpg)
NIF DETAILS
• Start with a regular Erlang module
• Functions can either be stubbed out to raise errors, or have default implementations
• Corresponding NIFs live in a dynamically loaded library
• Module typically specifies a NIF loading function via -on_load
5Monday, September 22, 14
![Page 10: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/10.jpg)
NIF DETAILS
• Start with a regular Erlang module
• Functions can either be stubbed out to raise errors, or have default implementations
• Corresponding NIFs live in a dynamically loaded library
• Module typically specifies a NIF loading function via -on_load
• NIFs replace Erlang functions of the same name/arity at module load time
5Monday, September 22, 14
![Page 11: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/11.jpg)
NIF EXAMPLE
• Example module: bitwise
• Provides a function exor/2 that takes a binary and a value
• exor/2 computes an exclusive or of each byte of the binary with the argument value
• Find the code here: https://github.com/vinoski/bitwise.git
6Monday, September 22, 14
![Page 12: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/12.jpg)
NIF EXAMPLE
7Monday, September 22, 14
![Page 13: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/13.jpg)
NIF EXAMPLE
8Monday, September 22, 14
![Page 14: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/14.jpg)
NIF EXAMPLE
8Monday, September 22, 14
![Page 15: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/15.jpg)
NIF EXAMPLE
9Monday, September 22, 14
![Page 16: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/16.jpg)
NIF EXAMPLE
10Monday, September 22, 14
![Page 17: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/17.jpg)
EXOR/2 NIF
11Monday, September 22, 14
![Page 18: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/18.jpg)
12Monday, September 22, 14
![Page 19: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/19.jpg)
13Monday, September 22, 14
![Page 20: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/20.jpg)
13Monday, September 22, 14
![Page 21: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/21.jpg)
14Monday, September 22, 14
![Page 22: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/22.jpg)
NOW FOR SOME BIG DATA
• 2 billion bytes
15Monday, September 22, 14
![Page 23: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/23.jpg)
LET'S TIME OUR NIF
16Monday, September 22, 14
![Page 24: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/24.jpg)
LET'S TIME OUR NIF
• Nearly 6 seconds!
• This is bad.
16Monday, September 22, 14
![Page 25: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/25.jpg)
ERLANG PROCESS ARCHITECTURE
17Monday, September 22, 14
![Page 26: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/26.jpg)
CPUCore 1
. . . . . . CPUCore N
ERLANG PROCESS ARCHITECTURE
17Monday, September 22, 14
![Page 27: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/27.jpg)
OS + kernel threadsCPU
Core 1. . . . . . CPU
Core N
ERLANG PROCESS ARCHITECTURE
17Monday, September 22, 14
![Page 28: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/28.jpg)
Erlang VM
OS + kernel threadsCPU
Core 1. . . . . . CPU
Core N
ERLANG PROCESS ARCHITECTURE
17Monday, September 22, 14
![Page 29: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/29.jpg)
Erlang VM
OS + kernel threadsCPU
Core 1. . . . . . CPU
Core N
N1
SMPScheduler Threads
(one per core)
ERLANG PROCESS ARCHITECTURE
17Monday, September 22, 14
![Page 30: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/30.jpg)
Erlang VM
Run Queues
OS + kernel threadsCPU
Core 1. . . . . . CPU
Core N
N1
SMPScheduler Threads
(one per core)
ERLANG PROCESS ARCHITECTURE
17Monday, September 22, 14
![Page 31: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/31.jpg)
Erlang VM
Run QueuesProcess
Process
Process
Process
Process
Process
OS + kernel threadsCPU
Core 1. . . . . . CPU
Core N
N1
SMPScheduler Threads
(one per core)
ERLANG PROCESS ARCHITECTURE
17Monday, September 22, 14
![Page 32: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/32.jpg)
SCHEDULING A PROCESS
18Monday, September 22, 14
![Page 33: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/33.jpg)
SCHEDULING A PROCESS
• A scheduler takes a process from its run queue
18Monday, September 22, 14
![Page 34: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/34.jpg)
SCHEDULING A PROCESS
• A scheduler takes a process from its run queue
• It executes it until it hits 2000 reductions (function calls) or until it waits for a message, or if it hits an emulator trap
18Monday, September 22, 14
![Page 35: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/35.jpg)
SCHEDULING A PROCESS
• A scheduler takes a process from its run queue
• It executes it until it hits 2000 reductions (function calls) or until it waits for a message, or if it hits an emulator trap
• The process then gets scheduled out and another one chosen
18Monday, September 22, 14
![Page 36: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/36.jpg)
SCHEDULING A PROCESS
• A scheduler takes a process from its run queue
• It executes it until it hits 2000 reductions (function calls) or until it waits for a message, or if it hits an emulator trap
• The process then gets scheduled out and another one chosen
• See Jesper Louis Andersen's scheduling description:http://jlouisramblings.blogspot.com/2013/01/how-erlang-does-scheduling.html
18Monday, September 22, 14
![Page 37: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/37.jpg)
THREAD PROGRESS
• Scheduler threads share some data structures
• But using traditional locks or ref counts to protect them scales poorly
• Instead, schedulers report their progress frequently to other schedulers
• Schedulers use their knowledge of other schedulers' progress to know when certain operations are safe
• For more details see https://github.com/erlang/otp/blob/master/erts/emulator/internal_doc/ThreadProgress.md
19Monday, September 22, 14
![Page 38: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/38.jpg)
BLOCKED SCHEDULERS
• Blocking a scheduler prevents thread progress, making other schedulers wait
• Blocking a scheduler also makes it unavailable to run other processes
• A NIF shouldn't occupy a scheduler for more than 1-2 ms
• NIF reductions should also be counted properly
20Monday, September 22, 14
![Page 39: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/39.jpg)
SCHEDULER COLLAPSE
• With Riak we've seen problems in production where schedulers go to sleep and stop executing processes
• Caused by misbehaving NIFs in Riak's storage backends interfering with normal scheduler operations
• Can also be caused by misbehaving standard Erlang functions
• See Scott Fritchie's nifwait repository, md5 branch:https://github.com/slfritchie/nifwait.git
21Monday, September 22, 14
![Page 40: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/40.jpg)
LET'S COUNT REDUCTIONS
22Monday, September 22, 14
![Page 41: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/41.jpg)
LET'S COUNT REDUCTIONS
22Monday, September 22, 14
![Page 42: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/42.jpg)
A MISBEHAVING NIF
23Monday, September 22, 14
![Page 43: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/43.jpg)
A MISBEHAVING NIF
• Blocked a scheduler thread for 5.86 seconds
• And only 4 reductions
23Monday, September 22, 14
![Page 44: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/44.jpg)
WORKAROUNDS
• Break the data into chunks
• Call exor_bad/2 repeatedly, once for each chunk
• Combine the resulting chunks into a final result
24Monday, September 22, 14
![Page 45: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/45.jpg)
CHUNKING
25Monday, September 22, 14
![Page 46: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/46.jpg)
CHUNKING
26Monday, September 22, 14
![Page 47: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/47.jpg)
CHUNKING
27Monday, September 22, 14
![Page 48: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/48.jpg)
CHUNKING
• Problem: how to determine optimal chunk size?
• Here, we arbitrarily chose 4MB chunks
28Monday, September 22, 14
![Page 49: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/49.jpg)
CHUNKING
• Problem: how to determine optimal chunk size?
• Here, we arbitrarily chose 4MB chunks
28Monday, September 22, 14
![Page 50: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/50.jpg)
CHUNKING RESULTS
29Monday, September 22, 14
![Page 51: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/51.jpg)
CHUNKING RESULTS
• 476 chunks processed
• Much better reduction count of 1445
• Scheduler was never blocked (probably anyway)
• But a longer execution time of 7.87 seconds
29Monday, September 22, 14
![Page 52: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/52.jpg)
A BETTER APPROACH
• For Erlang/OTP 17.3 (released 17 Sep 2014) I added a new NIF API function: enif_schedule_nif
• Takes a name and function pointer for a NIF, and an array of arguments to pass to it
• Schedules the argument NIF for future invocation with the specified arguments
• Allows the calling NIF to yield the scheduler
30Monday, September 22, 14
![Page 53: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/53.jpg)
31Monday, September 22, 14
![Page 54: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/54.jpg)
32Monday, September 22, 14
![Page 55: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/55.jpg)
32Monday, September 22, 14
![Page 56: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/56.jpg)
32Monday, September 22, 14
![Page 57: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/57.jpg)
33Monday, September 22, 14
![Page 58: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/58.jpg)
EXOR2/6
• exor2/6 is an "internal NIF" not visible to Erlang
• Works through as much of the binary as it can before its timeslice runs out
• Reports reductions using enif_consume_timeslice
• When its timeslice is up, reschedules itself via enif_schedule_nif
• Adjusts chunksize for the next iteration based on progress in each iteration
34Monday, September 22, 14
![Page 59: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/59.jpg)
...snip...
35Monday, September 22, 14
![Page 60: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/60.jpg)
...snip...
35Monday, September 22, 14
![Page 61: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/61.jpg)
...snip...
35Monday, September 22, 14
![Page 62: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/62.jpg)
...snip...
35Monday, September 22, 14
![Page 63: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/63.jpg)
...snip...
35Monday, September 22, 14
![Page 64: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/64.jpg)
36Monday, September 22, 14
![Page 65: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/65.jpg)
36Monday, September 22, 14
![Page 66: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/66.jpg)
36Monday, September 22, 14
![Page 67: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/67.jpg)
36Monday, September 22, 14
![Page 68: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/68.jpg)
36Monday, September 22, 14
![Page 69: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/69.jpg)
A YIELDING NIF
37Monday, September 22, 14
![Page 70: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/70.jpg)
A YIELDING NIF
• 5.36 seconds, fastest so far
• At over 10000 reductions, much more accurate accounting
• We yielded the scheduler 5 times
37Monday, September 22, 14
![Page 71: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/71.jpg)
ANOTHER APPROACH:DIRTY SCHEDULERS
38Monday, September 22, 14
![Page 72: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/72.jpg)
Erlang VM
Run Queues
OS + kernel threadsCPU
Core 1. . . . . . CPU
Core N
N1
SMPScheduler Threads
(one per core)
DIRTY SCHEDULERS
39Monday, September 22, 14
![Page 73: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/73.jpg)
OS + kernel threadsCPU
Core 1. . . . . . . . . . . . . CPU
Core N
N1
DIRTY SCHEDULERS
40Monday, September 22, 14
![Page 74: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/74.jpg)
OS + kernel threadsCPU
Core 1. . . . . . . . . . . . . CPU
Core N
N1
DIRTY SCHEDULERS
. . . . . . . . . . . . .DC1 DCN
DC: Dirty CPU Scheduler41Monday, September 22, 14
![Page 75: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/75.jpg)
OS + kernel threadsCPU
Core 1. . . . . . . . . . . . . CPU
Core N
N1
DIRTY SCHEDULERS
. . . . . . . . . . . . .DC1 DCN
DC: Dirty CPU Scheduler
Shared DC Run Queue
41Monday, September 22, 14
![Page 76: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/76.jpg)
DIRTY SCHEDULERS
N
CPUCore 1
. . . . . . . . . . . . . CPUCore N
1 . . . . . . . . . . . . .DC1 DCN
Shared DC Run Queue
42Monday, September 22, 14
![Page 77: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/77.jpg)
DIRTY SCHEDULERS
N
CPUCore 1
. . . . . . . . . . . . . CPUCore N
1 . . . . . . . . . . . . .DC1 DCN
Shared DC Run Queue
Shared DI/ORun Queue
DI/O NDI/O 1
DI/O: Dirty I/O Scheduler
OS + kernel threads
42Monday, September 22, 14
![Page 78: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/78.jpg)
DIRTY SCHEDULERSShared DI/ORun Queue
DI/O NDI/O 1
DI/O: Dirty I/O Scheduler
OS + kernel threads
43Monday, September 22, 14
![Page 79: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/79.jpg)
ENABLING DIRTY SCHEDULERS
• configure --enable-dirty-schedulers
• Your Erlang shell will print something like the following system version line:
Erlang/OTP 17 [erts-6.2] [source] [64-bit] [smp:8:8] \ [ds:8:8:10] [async-threads:10] [kernel-poll:false]
44Monday, September 22, 14
![Page 80: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/80.jpg)
USING DIRTY SCHEDULERS
• Either schedule a dirty NIF via enif_schedule_nif
• Pass a flag to indicate dirty CPU or dirty I/O scheduling
• Or specify a NIF as dirty in your ErlNifFuncs array
• Both of these are new with Erlang 17.3, replacing old experimental dirty NIF API
45Monday, September 22, 14
![Page 81: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/81.jpg)
USING DIRTY SCHEDULERS
46Monday, September 22, 14
![Page 82: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/82.jpg)
USING DIRTY SCHEDULERS
46Monday, September 22, 14
![Page 83: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/83.jpg)
USING DIRTY SCHEDULERS
46Monday, September 22, 14
![Page 84: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/84.jpg)
A DIRTY EXOR/2
47Monday, September 22, 14
![Page 85: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/85.jpg)
A DIRTY EXOR/2
• 5.95 seconds on a dirty scheduler thread
• 8 reductions and 0 yields
• But was (almost) never on a regular scheduler
• Regular schedulers were running other jobs normally
47Monday, September 22, 14
![Page 86: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/86.jpg)
SCHEDULE IT DIRTY
48Monday, September 22, 14
![Page 87: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/87.jpg)
SCHEDULE IT DIRTY
• No chunking or yielding needed for dirty exor/2
48Monday, September 22, 14
![Page 88: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/88.jpg)
SCHEDULE IT DIRTY
• No chunking or yielding needed for dirty exor/2
• But dirty schedulers are finite resources
48Monday, September 22, 14
![Page 89: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/89.jpg)
SCHEDULE IT DIRTY
• No chunking or yielding needed for dirty exor/2
• But dirty schedulers are finite resources
• Evil dirty NIFs can completely occupy all dirty schedulers and prevent other dirty jobs from running
48Monday, September 22, 14
![Page 90: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/90.jpg)
SCHEDULE IT DIRTY
• No chunking or yielding needed for dirty exor/2
• But dirty schedulers are finite resources
• Evil dirty NIFs can completely occupy all dirty schedulers and prevent other dirty jobs from running
• A dirty NIF can use enif_schedule_nif to reschedule, yielding to allow other dirty jobs to execute
48Monday, September 22, 14
![Page 91: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/91.jpg)
SCHEDULE IT DIRTY
• No chunking or yielding needed for dirty exor/2
• But dirty schedulers are finite resources
• Evil dirty NIFs can completely occupy all dirty schedulers and prevent other dirty jobs from running
• A dirty NIF can use enif_schedule_nif to reschedule, yielding to allow other dirty jobs to execute
• A NIF can use enif_schedule_nif to flip itself between regular mode and dirty mode
48Monday, September 22, 14
![Page 92: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/92.jpg)
NEXT STEPS
• Dirty drivers already in progress
• Native processes?
• see Rickard Green's original 2011 presentation on these topics: http://www.erlang-factory.com/upload/presentations/377/RickardGreen-NativeInterface.pdf
49Monday, September 22, 14
![Page 93: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/93.jpg)
ACKNOWLEDGEMENTS
• A huge thanks to Rickard Green of the Ericsson OTP team, who has patiently guided me in this work
• Also thanks to Sverker Eriksson of the OTP team
• And thanks to Anthony Ramine for mentioning "NIF traps" one day in the #erlang IRC channel, where I got the idea for enif_schedule_nif
50Monday, September 22, 14
![Page 94: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving](https://reader036.vdocument.in/reader036/viewer/2022062415/5fd937dd01e16018ef5b736b/html5/thumbnails/94.jpg)
THANKS
http://shop.oreilly.com/product/0636920024149.do#
51Monday, September 22, 14