the os scheduler: a performance-critical component in
TRANSCRIPT
![Page 1: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/1.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS 1
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT
IN LINUX CLUSTER ENVIRONMENTS
By Jean-Pierre Lozi
Oracle Labs
KEYNOTE FOR BPOE-9 @ASPLOS2018THE NINTH WORKSHOP ON BIG DATA BENCHMARKS,
PERFORMANCE, OPTIMIZATION AND EMERGING HARDWARE
![Page 2: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/2.jpg)
CLUSTER COMPUTING
Multicore servers with dozens of cores
Common for e.g., a hadoop cluster, a distributed graph analytics engine, multiple apps...
High cost of infrastructure, high energy consumption
2THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 3: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/3.jpg)
CLUSTER COMPUTING
Multicore servers with dozens of cores
Common for e.g., a hadoop cluster, a distributed graph analytics engine, multiple apps...
High cost of infrastructure, high energy consumption
Linux-based software stack
Low (license) cost, yet high reliability
2THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 4: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/4.jpg)
CLUSTER COMPUTING
Multicore servers with dozens of cores
Common for e.g., a hadoop cluster, a distributed graph analytics engine, multiple apps...
High cost of infrastructure, high energy consumption
Linux-based software stack
Low (license) cost, yet high reliability
Challenge: don’t waste cycles!
Reduces infrastructure and energy costs
Improves bandwidth and latency
2THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 5: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/5.jpg)
WHERE TO HUNT FOR CYCLES?
3
![Page 6: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/6.jpg)
WHERE TO HUNT FOR CYCLES?
3
Applications, libraries:
often main focus
![Page 7: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/7.jpg)
WHERE TO HUNT FOR CYCLES?
3
Applications, libraries:
often main focus
Storage: optimized since
decades! E.g., many filesystems,
RDBMSes bypassing the OS
![Page 8: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/8.jpg)
WHERE TO HUNT FOR CYCLES?
3
Applications, libraries:
often main focus
Storage: optimized since
decades! E.g., many filesystems,
RDBMSes bypassing the OS Network stack, NICs,
reducing network usage (e.g. HDFS): common optimizations
![Page 9: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/9.jpg)
WHERE TO HUNT FOR CYCLES?
3
Applications, libraries:
often main focus
Storage: optimized since
decades! E.g., many filesystems,
RDBMSes bypassing the OS Network stack, NICs,
reducing network usage (e.g. HDFS): common optimizations
NUMA, bus
usage:
Placement,
replication,
interleaving,
many recent
papers
![Page 10: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/10.jpg)
WHERE TO HUNT FOR CYCLES?
3
Applications, libraries:
often main focus
Storage: optimized since
decades! E.g., many filesystems,
RDBMSes bypassing the OS Network stack, NICs,
reducing network usage (e.g. HDFS): common optimizations
NUMA, bus
usage:
Placement,
replication,
interleaving,
many recent
papers
![Page 11: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/11.jpg)
IS THE SCHEDULER WORKING IN YOUR CLUSTER?
4THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 12: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/12.jpg)
IS THE SCHEDULER WORKING IN YOUR CLUSTER?
It must be! 15 years ago, Linus Torvalds was already saying:
4
“And you have to realize that there are not very many things
that have aged as well as the scheduler. Which is just another
proof that scheduling is easy.”
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 13: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/13.jpg)
IS THE SCHEDULER WORKING IN YOUR CLUSTER?
It must be! 15 years ago, Linus Torvalds was already saying:
Since then, people have been running applications on their multicore machines all the time, and they run, CPU usage is high, everything seems fine.
4
“And you have to realize that there are not very many things
that have aged as well as the scheduler. Which is just another
proof that scheduling is easy.”
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 14: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/14.jpg)
IS THE SCHEDULER WORKING IN YOUR CLUSTER?
It must be! 15 years ago, Linus Torvalds was already saying:
Since then, people have been running applications on their multicore machines all the time, and they run, CPU usage is high, everything seems fine.
But would you notice if some cores remained idle intermittently, when they shouldn’t?
4
“And you have to realize that there are not very many things
that have aged as well as the scheduler. Which is just another
proof that scheduling is easy.”
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 15: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/15.jpg)
IS THE SCHEDULER WORKING IN YOUR CLUSTER?
It must be! 15 years ago, Linus Torvalds was already saying:
Since then, people have been running applications on their multicore machines all the time, and they run, CPU usage is high, everything seems fine.
But would you notice if some cores remained idle intermittently, when they shouldn’t?
Do you keep monitoring tools (htop) running all the time?
4
“And you have to realize that there are not very many things
that have aged as well as the scheduler. Which is just another
proof that scheduling is easy.”
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 16: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/16.jpg)
IS THE SCHEDULER WORKING IN YOUR CLUSTER?
It must be! 15 years ago, Linus Torvalds was already saying:
Since then, people have been running applications on their multicore machines all the time, and they run, CPU usage is high, everything seems fine.
But would you notice if some cores remained idle intermittently, when they shouldn’t?
Do you keep monitoring tools (htop) running all the time?
Even if you do, would you be able to identify faulty behavior from normal noise?
4
“And you have to realize that there are not very many things
that have aged as well as the scheduler. Which is just another
proof that scheduling is easy.”
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 17: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/17.jpg)
IS THE SCHEDULER WORKING IN YOUR CLUSTER?
It must be! 15 years ago, Linus Torvalds was already saying:
Since then, people have been running applications on their multicore machines all the time, and they run, CPU usage is high, everything seems fine.
But would you notice if some cores remained idle intermittently, when they shouldn’t?
Do you keep monitoring tools (htop) running all the time?
Even if you do, would you be able to identify faulty behavior from normal noise?
Would you ever suspect the scheduler?
4
“And you have to realize that there are not very many things
that have aged as well as the scheduler. Which is just another
proof that scheduling is easy.”
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 18: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/18.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
THIS TALK
Over the past few years of working on various projects, we sometimes saw strange, hard to explain performance results.
5
![Page 19: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/19.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
THIS TALK
Over the past few years of working on various projects, we sometimes saw strange, hard to explain performance results.
An example: running a TPC-H benchmark on a 64-core machine, our runs much faster when pinning threads to cores than when we let the Linux scheduler do its job.
5
![Page 20: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/20.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
THIS TALK
Over the past few years of working on various projects, we sometimes saw strange, hard to explain performance results.
An example: running a TPC-H benchmark on a 64-core machine, our runs much faster when pinning threads to cores than when we let the Linux scheduler do its job.
Memory locality issue? Impossible, hardware counters showed no difference in the % of remote memory accesses, in cache misses, etc.
5
![Page 21: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/21.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
THIS TALK
Over the past few years of working on various projects, we sometimes saw strange, hard to explain performance results.
An example: running a TPC-H benchmark on a 64-core machine, our runs much faster when pinning threads to cores than when we let the Linux scheduler do its job.
Memory locality issue? Impossible, hardware counters showed no difference in the % of remote memory accesses, in cache misses, etc.
Contention over some resource (spinlock, etc.)? We investigated this for a long time, but couldn’t find anything that looked off.
5
![Page 22: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/22.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
THIS TALK
Over the past few years of working on various projects, we sometimes saw strange, hard to explain performance results.
An example: running a TPC-H benchmark on a 64-core machine, our runs much faster when pinning threads to cores than when we let the Linux scheduler do its job.
Memory locality issue? Impossible, hardware counters showed no difference in the % of remote memory accesses, in cache misses, etc.
Contention over some resource (spinlock, etc.)? We investigated this for a long time, but couldn’t find anything that looked off.
Overhead of context switches? Threads moved a lot but we proved that the overhead was negligible.
5
![Page 23: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/23.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
THIS TALK
Over the past few years of working on various projects, we sometimes saw strange, hard to explain performance results.
An example: running a TPC-H benchmark on a 64-core machine, our runs much faster when pinning threads to cores than when we let the Linux scheduler do its job.
Memory locality issue? Impossible, hardware counters showed no difference in the % of remote memory accesses, in cache misses, etc.
Contention over some resource (spinlock, etc.)? We investigated this for a long time, but couldn’t find anything that looked off.
Overhead of context switches? Threads moved a lot but we proved that the overhead was negligible.
We ended up suspecting the core behavior of the scheduler.
5
![Page 24: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/24.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
THIS TALK
Over the past few years of working on various projects, we sometimes saw strange, hard to explain performance results.
An example: running a TPC-H benchmark on a 64-core machine, our runs much faster when pinning threads to cores than when we let the Linux scheduler do its job.
Memory locality issue? Impossible, hardware counters showed no difference in the % of remote memory accesses, in cache misses, etc.
Contention over some resource (spinlock, etc.)? We investigated this for a long time, but couldn’t find anything that looked off.
Overhead of context switches? Threads moved a lot but we proved that the overhead was negligible.
We ended up suspecting the core behavior of the scheduler.
We implemented high-resolution tracing tools and saw that some cores were idle while others overloaded...
5
![Page 25: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/25.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
THIS TALK
Over the past few years of working on various projects, we sometimes saw strange, hard to explain performance results.
An example: running a TPC-H benchmark on a 64-core machine, our runs much faster when pinning threads to cores than when we let the Linux scheduler do its job.
Memory locality issue? Impossible, hardware counters showed no difference in the % of remote memory accesses, in cache misses, etc.
Contention over some resource (spinlock, etc.)? We investigated this for a long time, but couldn’t find anything that looked off.
Overhead of context switches? Threads moved a lot but we proved that the overhead was negligible.
We ended up suspecting the core behavior of the scheduler.
We implemented high-resolution tracing tools and saw that some cores were idle while others overloaded...
5
![Page 26: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/26.jpg)
THIS TALK
This is how we found our first performance bug. Which made us investigate more...
6THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 27: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/27.jpg)
THIS TALK
This is how we found our first performance bug. Which made us investigate more...
In the end: four Linux scheduler performance bugs that we found and analyzed
6THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 28: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/28.jpg)
THIS TALK
This is how we found our first performance bug. Which made us investigate more...
In the end: four Linux scheduler performance bugs that we found and analyzed
Always the same symptom: idle cores while others are overloaded
The bug-hunting was tough, and led us to develop our own tools
6THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 29: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/29.jpg)
THIS TALK
This is how we found our first performance bug. Which made us investigate more...
In the end: four Linux scheduler performance bugs that we found and analyzed
Always the same symptom: idle cores while others are overloaded
The bug-hunting was tough, and led us to develop our own tools
Performance overhead of some of the bugs :
12-23% performance improvement on a popular database with TPC-H
137× performance improvement on HPC workloads
6THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 30: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/30.jpg)
THIS TALK
This is how we found our first performance bug. Which made us investigate more...
In the end: four Linux scheduler performance bugs that we found and analyzed
Always the same symptom: idle cores while others are overloaded
The bug-hunting was tough, and led us to develop our own tools
Performance overhead of some of the bugs :
12-23% performance improvement on a popular database with TPC-H
137× performance improvement on HPC workloads
Not always possible to provide a simple, working fix...
Intrisic problems with the design of the scheduler?
6THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 31: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/31.jpg)
THIS TALK
Main takeaway of our analysis: more research must be directedtowards implementing an efficient scheduler for multicore architectures,because contrary to what a lot of us think, this is *not* a solved problem!
7THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 32: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/32.jpg)
THIS TALK
Main takeaway of our analysis: more research must be directedtowards implementing an efficient scheduler for multicore architectures,because contrary to what a lot of us think, this is *not* a solved problem!
Need convincing? Let’s go through it together...
7THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 33: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/33.jpg)
THIS TALK
Main takeaway of our analysis: more research must be directedtowards implementing an efficient scheduler for multicore architectures,because contrary to what a lot of us think, this is *not* a solved problem!
Need convincing? Let’s go through it together...
...starting with a bit of background...
7THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 34: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/34.jpg)
THIS TALK
Main takeaway of our analysis: more research must be directedtowards implementing an efficient scheduler for multicore architectures,because contrary to what a lot of us think, this is *not* a solved problem!
Need convincing? Let’s go through it together...
...starting with a bit of background...
7THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 35: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/35.jpg)
THE COMPLETELY FAIR SCHEDULER (CFS): CONCEPT
8
Core 0 Core 1 Core 2 Core 3
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 36: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/36.jpg)
THE COMPLETELY FAIR SCHEDULER (CFS): CONCEPT
8
Core 0 Core 1 Core 2 Core 3
R = 103
R = 82
R = 24
R = 18
R = 12
One runqueue where threads
are globally sorted by runtime
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 37: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/37.jpg)
THE COMPLETELY FAIR SCHEDULER (CFS): CONCEPT
8
Core 0 Core 1 Core 2 Core 3
R = 103
R = 82
R = 24
R = 18
R = 12
One runqueue where threads
are globally sorted by runtime
When a thread is done running
for its timeslice : enqueued againR = 112
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 38: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/38.jpg)
THE COMPLETELY FAIR SCHEDULER (CFS): CONCEPT
8
Core 0 Core 1 Core 2 Core 3
R = 103
R = 82
R = 24
R = 18
R = 12
One runqueue where threads
are globally sorted by runtime
When a thread is done running
for its timeslice : enqueued againR = 112
Some tasks have a lower niceness
and thus have a longer timeslice
(allowed to run longer)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 39: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/39.jpg)
THE COMPLETELY FAIR SCHEDULER (CFS): CONCEPT
8
Core 0 Core 1 Core 2 Core 3
R = 103
R = 82
R = 24
R = 18
R = 12
One runqueue where threads
are globally sorted by runtime
When a thread is done running
for its timeslice : enqueued againR = 112
Some tasks have a lower niceness
and thus have a longer timeslice
(allowed to run longer)
Cores get their next task
from the global runqueue
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 40: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/40.jpg)
THE COMPLETELY FAIR SCHEDULER (CFS): CONCEPT
8
Core 0 Core 1 Core 2 Core 3
R = 103
R = 82
R = 24
R = 18
R = 12
One runqueue where threads
are globally sorted by runtime
When a thread is done running
for its timeslice : enqueued againR = 112
Some tasks have a lower niceness
and thus have a longer timeslice
(allowed to run longer)
Cores get their next task
from the global runqueue
Of course, cannot work with a single
runqueue because of contention
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 41: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/41.jpg)
CFS: IN PRACTICE
One runqueue per core to avoid contention
9
W=6
Core 0 Core 1
W=1
W=1
W=1
W=1
W=1
W=1
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 42: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/42.jpg)
CFS: IN PRACTICE
One runqueue per core to avoid contention
CFS periodically balances “loads”:
load(task) = weight1 x % cpu use2
1The lower the niceness, the higher the weight
9
W=6
Core 0 Core 1
W=1
W=1
W=1
W=1
W=1
W=1
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 43: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/43.jpg)
CFS: IN PRACTICE
One runqueue per core to avoid contention
CFS periodically balances “loads”:
load(task) = weight1 x % cpu use2
1The lower the niceness, the higher the weight
2We don’t want a high-priority thread that sleeps a lot to take a whole CPU for itself and then mostly sleep!
9
W=6
Core 0 Core 1
W=1
W=1
W=1
W=1
W=1
W=1
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 44: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/44.jpg)
CFS: IN PRACTICE
One runqueue per core to avoid contention
CFS periodically balances “loads”:
load(task) = weight1 x % cpu use2
1The lower the niceness, the higher the weight
2We don’t want a high-priority thread that sleeps a lot to take a whole CPU for itself and then mostly sleep!
Since there can be many cores: hierarchical approach!
9
W=6
Core 0 Core 1
W=1
W=1
W=1
W=1
W=1
W=1
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 45: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/45.jpg)
L=2000 L=6000 L=1000
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
10
L=1000
L=1000
L=3000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=3000
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 46: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/46.jpg)
L=2000 L=6000 L=1000
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
10
L=1000
L=1000
L=3000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=3000
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 47: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/47.jpg)
L=2000 L=6000 L=1000
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
10
L=1000
L=1000
L=3000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=3000
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 48: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/48.jpg)
L=2000 L=6000 L=1000
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
10
L=1000
L=1000
L=3000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=3000Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 49: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/49.jpg)
L=2000 L=6000 L=1000
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
10
L=1000
L=1000
L=3000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=3000Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 50: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/50.jpg)
L=2000 L=4000 L=3000
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
10
L=1000
L=1000
L=3000
L=1000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=3000
L=1000
L=1000
Balanced! Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 51: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/51.jpg)
AVG(L)=3500L=2000
AVG(L)=2500L=4000 L=3000
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
10
L=1000
L=1000
L=3000
L=1000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=3000
L=1000
L=1000
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 52: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/52.jpg)
AVG(L)=3000L=3000 L=3000L=3000
AVG(L)=3000CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
10
L=1000
L=1000
L=3000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=3000
L=1000
L=1000L=1000
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 53: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/53.jpg)
AVG(L)=3000L=3000 L=3000L=3000
AVG(L)=3000CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
10
L=1000
L=1000
L=3000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=3000
L=1000
L=1000L=1000
Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 54: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/54.jpg)
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
Note that only the average load of groups is considered
11THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 55: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/55.jpg)
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
Note that only the average load of groups is considered
If for some reason the lower-level load-balancing fails, nothing happens at a higher level:
11THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 56: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/56.jpg)
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
Note that only the average load of groups is considered
If for some reason the lower-level load-balancing fails, nothing happens at a higher level:
11
L=3000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=0 L=6000 L=3000 L=3000
L=1000
L=1000
AVG(L)=3000 AVG(L)=3000
L=1000
L=1000
L=100
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 57: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/57.jpg)
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
Note that only the average load of groups is considered
If for some reason the lower-level load-balancing fails, nothing happens at a higher level:
11
L=3000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=0 L=6000 L=3000 L=3000
L=1000
L=1000
AVG(L)=3000 AVG(L)=3000
L=1000
L=1000
L=100
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 58: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/58.jpg)
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
Note that only the average load of groups is considered
If for some reason the lower-level load-balancing fails, nothing happens at a higher level:
11
L=3000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=0 L=6000 L=3000 L=3000
L=1000
L=1000
AVG(L)=3000 AVG(L)=3000
L=1000
L=1000
L=100
Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 59: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/59.jpg)
CFS IN PRACTICE : HIERARCHICAL LOAD BALANCING
Note that only the average load of groups is considered
If for some reason the lower-level load-balancing fails, nothing happens at a higher level:
11
L=3000
L=1000
L=1000
L=1000
L=1000
Core 0 Core 1 Core 2 Core 3
L=0 L=6000 L=3000 L=3000
L=1000
L=1000
AVG(L)=3000 AVG(L)=3000
L=1000
L=1000
L=100
Balanced!
!!!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 60: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/60.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
12THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 61: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/61.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
One of them aims to increase fairness between “sessions”.
12THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 62: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/62.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
One of them aims to increase fairness between “sessions”.
Objective: making sure that launching lots of threads from one terminal doesn’t prevent other processes on the machine (potentially from other users) from running.
12THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 63: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/63.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
One of them aims to increase fairness between “sessions”.
Objective: making sure that launching lots of threads from one terminal doesn’t prevent other processes on the machine (potentially from other users) from running.
Otherwise, easy to use more resources than other users by spawning many threads...
12THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 64: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/64.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
One of them aims to increase fairness between “sessions”.
13
L=1000
L=1000
L=1000
L=1000
L=1000
Session (tty) 2
Session (tty) 1
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 65: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/65.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
One of them aims to increase fairness between “sessions”.
13
L=1000
L=1000
L=1000
L=1000
L=1000
Session (tty) 2
Session (tty) 1
L=1000L=1000
L=1000 L=1000
L=1000
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 66: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/66.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
One of them aims to increase fairness between “sessions”.
13
L=1000
L=1000
L=1000
L=1000
L=1000
Session (tty) 2
Session (tty) 1
L=1000L=1000
L=1000 L=1000
L=1000
50% of a
CPU
150%
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 67: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/67.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
One of them aims to increase fairness between “sessions”.
13
L=1000
L=1000
L=1000
L=1000
L=1000
Session (tty) 2
Session (tty) 1
L=1000L=1000
L=1000 L=1000
L=1000
50% of a
CPU
150%
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 68: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/68.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
Solution: divide the load of a task by the number of threads in its tty...
14THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 69: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/69.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
Solution: divide the load of a task by the number of threads in its tty...
14
L=1000
L=250L=250
Session (tty) 2
Session (tty) 1
L=250 L=250
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 70: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/70.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
Solution: divide the load of a task by the number of threads in its tty...
14
L=1000
L=250L=250
Session (tty) 2
Session (tty) 1
L=1000
L=250
L=250
L=250 L=250
L=250
L=250
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 71: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/71.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
Solution: divide the load of a task by the number of threads in its tty...
14
L=1000
L=250L=250
Session (tty) 2
Session (tty) 1
L=1000
L=250
L=250
100% of a
CPU
100% of a
CPU
L=250 L=250
L=250
L=250
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 72: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/72.jpg)
CFS IN PRACTICE: MORE HEURISTICS
Load calculations are actually more complicated, use more heuristics.
Solution: divide the load of a task by the number of threads in its tty...
14
L=1000
L=250L=250
Session (tty) 2
Session (tty) 1
L=1000
L=250
L=250
100% of a
CPU
100% of a
CPU
L=250 L=250
L=250
L=250
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 73: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/73.jpg)
BUG 1/4: GROUP IMBALANCE
15
Session (tty) 2
Session (tty) 1
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 74: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/74.jpg)
BUG 1/4: GROUP IMBALANCE
15
Session (tty) 2
Session (tty) 1
Load(thread) = %cpu × weight / #threads
= 100 × 10 / 1
= 1000
Load(thread) = %cpu × weight / #threads
= 100 × 10 / 8
= 125
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 75: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/75.jpg)
BUG 1/4: GROUP IMBALANCE
15
Session (tty) 2
Session (tty) 1
Load(thread) = %cpu × weight / #threads
= 100 × 10 / 1
= 1000
Load(thread) = %cpu × weight / #threads
= 100 × 10 / 8
= 125
L=1000
L=125
L=125
L=125
L=125
L=125
L=125
L=125
L=125
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 76: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/76.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 77: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/77.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 78: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/78.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 79: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/79.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 80: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/80.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 81: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/81.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced! Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 82: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/82.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
AVG(L)=500 AVG(L)=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced! Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 83: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/83.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
AVG(L)=500 AVG(L)=500Balanced!
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced! Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 84: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/84.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
AVG(L)=500 AVG(L)=500Balanced!
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced! Balanced!
!!!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 85: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/85.jpg)
BUG 1/4: GROUP IMBALANCE
16
L=1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
AVG(L)=500 AVG(L)=500Balanced!
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced! Balanced!
!!!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 86: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/86.jpg)
BUG 1/4: GROUP IMBALANCE
Another example, on a 64-core machine, with load balancing:
First between pairs of cores (Bulldozer architecture, a bit like hyperthreading)
Then between NUMA nodes
17THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 87: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/87.jpg)
BUG 1/4: GROUP IMBALANCE
Another example, on a 64-core machine, with load balancing:
First between pairs of cores (Bulldozer architecture, a bit like hyperthreading)
Then between NUMA nodes
User 1 launches :ssh <machine> R & ssh <machine> R &
17THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 88: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/88.jpg)
BUG 1/4: GROUP IMBALANCE
Another example, on a 64-core machine, with load balancing:
First between pairs of cores (Bulldozer architecture, a bit like hyperthreading)
Then between NUMA nodes
User 1 launches :ssh <machine> R & ssh <machine> R &
User 2 launches :ssh <machine> make –j 64 kernel
17THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 89: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/89.jpg)
BUG 1/4: GROUP IMBALANCE
Another example, on a 64-core machine, with load balancing:
First between pairs of cores (Bulldozer architecture, a bit like hyperthreading)
Then between NUMA nodes
User 1 launches :ssh <machine> R & ssh <machine> R &
User 2 launches :ssh <machine> make –j 64 kernel
The bug happens at two levels :
Other core on pair of core idle
Other cores on NUMA node less busy...
17THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 90: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/90.jpg)
BUG 1/4: GROUP IMBALANCE
Another example, on a 64-core machine, with load balancing:
First between pairs of cores (Bulldozer architecture, a bit like hyperthreading)
Then between NUMA nodes
User 1 launches :ssh <machine> R & ssh <machine> R &
User 2 launches :ssh <machine> make –j 64 kernel
The bug happens at two levels :
Other core on pair of core idle
Other cores on NUMA node less busy...
17THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 91: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/91.jpg)
BUG 1/4: GROUP IMBALANCE
Another example, on a 64-core machine, with load balancing:
First between pairs of cores (Bulldozer architecture, a bit like hyperthreading)
Then between NUMA nodes
User 1 launches :ssh <machine> R & ssh <machine> R &
User 2 launches :ssh <machine> make –j 64 kernel
The bug happens at two levels :
Other core on pair of core idle
Other cores on NUMA node less busy...
17THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 92: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/92.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 93: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/93.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 94: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/94.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 95: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/95.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 96: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/96.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 97: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/97.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
Balanced! Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 98: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/98.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=0 L=1000 L=500 L=500
MIN(L)=0 MIN(L)=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 99: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/99.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
MIN(L)=250 MIN(L)=250L=250 L=250
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 100: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/100.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
MIN(L)=250 MIN(L)=250L=250 L=250Balanced!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 101: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/101.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
MIN(L)=250 MIN(L)=250L=250 L=250
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 102: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/102.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000 L=500
L=125
L=125
L=125
L=125
L=125
L=125
L=125
MIN(L)=250 MIN(L)=250L=250 L=250
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 103: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/103.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000 L=500
L=125
L=125
L=125
L=125
L=125
Balanced!
L=125
L=125
MIN(L)=250 MIN(L)=250L=250 L=250
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 104: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/104.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000 L=500
L=125
L=125
L=125
L=125
L=125
Balanced!
L=125
L=125
MIN(L)=250 MIN(L)=250L=250 L=250
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 105: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/105.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000
L=125
L=125
L=125
L=125
L=125
Balanced!
L=125
L=125
MIN(L)=250L=250 L=325 L=325
MIN(L)=325
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 106: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/106.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000
L=125
L=125
L=125
L=125
L=125
Balanced! Balanced!
L=125
L=125
MIN(L)=250L=250 L=325 L=325
MIN(L)=325
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 107: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/107.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000
L=125
L=125
L=125
L=125
L=125
Balanced! Balanced!
L=125
L=125
MIN(L)=250L=250 L=325 L=325
MIN(L)=325
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 108: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/108.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
18
L =
1000
L=125
Core 0 Core 1 Core 2 Core 3
L=1000
L=125
L=125
L=125
L=125
L=125
Balanced! Balanced!
L=125
L=125
MIN(L)=250L=250 Balanced! L=325 L=325
MIN(L)=325
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 109: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/109.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
19THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 110: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/110.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
After the fix, make runs 13% faster, and R is not impacted
19THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 111: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/111.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
After the fix, make runs 13% faster, and R is not impacted
A simple solution, but is it ideal? Minimum load more volatile than average...
19THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 112: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/112.jpg)
BUG 1/4: GROUP IMBALANCE
A simple solution: balance the minimum load of groups instead of the average
After the fix, make runs 13% faster, and R is not impacted
A simple solution, but is it ideal? Minimum load more volatile than average...
May cause lots of unnecessary rebalancing. Revamping load calculations needed?
19THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 113: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/113.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Hierarchical load balancing is based on groups of cores named scheduling domains
20THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 114: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/114.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Hierarchical load balancing is based on groups of cores named scheduling domains
Based on affinity, i.e., pairs of cores, dies, CPUs, NUMA nodes, etc.
20THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 115: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/115.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Hierarchical load balancing is based on groups of cores named scheduling domains
Based on affinity, i.e., pairs of cores, dies, CPUs, NUMA nodes, etc.
Each scheduling domain contains groups that are the lower-level scheduling domains
20THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 116: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/116.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Hierarchical load balancing is based on groups of cores named scheduling domains
Based on affinity, i.e., pairs of cores, dies, CPUs, NUMA nodes, etc.
Each scheduling domain contains groups that are the lower-level scheduling domains
For instance, on our 64-core AMD Bulldozer machine:
20THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 117: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/117.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Hierarchical load balancing is based on groups of cores named scheduling domains
Based on affinity, i.e., pairs of cores, dies, CPUs, NUMA nodes, etc.
Each scheduling domain contains groups that are the lower-level scheduling domains
For instance, on our 64-core AMD Bulldozer machine:
At level 1, each pair of core (scheduling domains) contain cores (scheduling groups)
20THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 118: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/118.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Hierarchical load balancing is based on groups of cores named scheduling domains
Based on affinity, i.e., pairs of cores, dies, CPUs, NUMA nodes, etc.
Each scheduling domain contains groups that are the lower-level scheduling domains
For instance, on our 64-core AMD Bulldozer machine:
At level 1, each pair of core (scheduling domains) contain cores (scheduling groups)
At level 2, each CPU (s.d.) contain pairs of cores (s.g.)
20THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 119: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/119.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Hierarchical load balancing is based on groups of cores named scheduling domains
Based on affinity, i.e., pairs of cores, dies, CPUs, NUMA nodes, etc.
Each scheduling domain contains groups that are the lower-level scheduling domains
For instance, on our 64-core AMD Bulldozer machine:
At level 1, each pair of core (scheduling domains) contain cores (scheduling groups)
At level 2, each CPU (s.d.) contain pairs of cores (s.g.)
At level 3, each group of directly connected CPUs (s.d.) contain CPUs (s.g.)
20THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 120: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/120.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Hierarchical load balancing is based on groups of cores named scheduling domains
Based on affinity, i.e., pairs of cores, dies, CPUs, NUMA nodes, etc.
Each scheduling domain contains groups that are the lower-level scheduling domains
For instance, on our 64-core AMD Bulldozer machine:
At level 1, each pair of core (scheduling domains) contain cores (scheduling groups)
At level 2, each CPU (s.d.) contain pairs of cores (s.g.)
At level 3, each group of directly connected CPUs (s.d.) contain CPUs (s.g.)
At level 4, the whole machine (s.d.) contains group of directly connected CPUs (s.g.)
20THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 121: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/121.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
21
Bulldozer 64-core:
Eight CPUs, with
8 cores each,
non-complete
interconnect graph!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 122: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/122.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
22
At the first level,
the first core
balances load
with the other core
on the same pair
(because they
share resources,
high affinity)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 123: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/123.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
22
At the 2nd level,
the first pair
balances load
with other pairs
on the same CPU
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 124: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/124.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
22
At the 3rd level,
the first CPU
balances load
with directly
connected CPUS
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 125: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/125.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
22
At the 4th level,
the first group of
directly
connected CPUs
balances load
with the other
groups of directly
connected CPUs
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 126: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/126.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
22
Groups of CPUs
built by:
(1) picking first
CPU and looking
for all directly
connected CPUs
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 127: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/127.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
22
Groups of CPUs
built by:
(2) picking first
CPU not in a
group and
looking for all
directly
connected CPUs
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 128: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/128.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
22
And then stop,
because all CPUs
are in a group
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 129: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/129.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
22
And then stop,
because all CPUs
are in a group
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 130: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/130.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
Suppose we
taskset an
application on
these two CPUs,
two hops apart
(16 threads)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 131: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/131.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
And threads
are created
on this core
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 132: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/132.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
Load gets
correctly balanced
on the pair of
cores
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 133: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/133.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
Load gets
correctly balanced
on the CPU
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 134: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/134.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
No stealing
at level 3,
because nodes
not directly
connected (1 hop
apart)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 135: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/135.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
At level 4,
stealing between
the red and green
groups...
Overloaded node
in both groups!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 136: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/136.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
load(red) =
16 * load(thread)
load(green) =
16 * load(thread)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 137: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/137.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
load(red) =
16 * load(thread)
load(green) =
16 * load(thread)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 138: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/138.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
load(red) =
16 * load(thread)
load(green) =
16 * load(thread)
!!!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 139: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/139.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
23
load(red) =
16 * load(thread)
load(green) =
16 * load(thread)
!!!
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 140: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/140.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Fix: build the domains by creating one “directly connected” group for every CPU
Instead of the first CPU and the first one not “covered” by a group
24THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 141: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/141.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Fix: build the domains by creating one “directly connected” group for every CPU
Instead of the first CPU and the first one not “covered” by a group
Performance improvement of NAS applications on two nodes :
24
Application With bug After fix Improvement
BT 99 56 1.75x
CG 42 15 2.73x
EP 73 36 2x
LU 1040 38 27x
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 142: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/142.jpg)
BUG 2/4: SCHEDULING GROUP CONSTRUCTION
Fix: build the domains by creating one “directly connected” group for every CPU
Instead of the first CPU and the first one not “covered” by a group
Performance improvement of NAS applications on two nodes :
Very good improvement for LU because more threads than cores if can’t use 16 cores
Solves spinlock issues (incl. potential convoys)
24
Application With bug After fix Improvement
BT 99 56 1.75x
CG 42 15 2.73x
EP 73 36 2x
LU 1040 38 27x
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 143: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/143.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 3/4: MISSING SCHEDULING DOMAINS
In addition to this, when domains re-built, levels 3 and 4 not re-built...
25
![Page 144: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/144.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 3/4: MISSING SCHEDULING DOMAINS
In addition to this, when domains re-built, levels 3 and 4 not re-built...
I.e., no balancing between directly connected or 1-hop CPUs (i.e. any CPU)
25
![Page 145: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/145.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 3/4: MISSING SCHEDULING DOMAINS
In addition to this, when domains re-built, levels 3 and 4 not re-built...
I.e., no balancing between directly connected or 1-hop CPUs (i.e. any CPU)
Happens for instance when disabling and re-enabling a core
25
![Page 146: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/146.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 3/4: MISSING SCHEDULING DOMAINS
In addition to this, when domains re-built, levels 3 and 4 not re-built...
I.e., no balancing between directly connected or 1-hop CPUs (i.e. any CPU)
Happens for instance when disabling and re-enabling a core
Launch an application, first thread created on CPU 1
25
![Page 147: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/147.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 3/4: MISSING SCHEDULING DOMAINS
In addition to this, when domains re-built, levels 3 and 4 not re-built...
I.e., no balancing between directly connected or 1-hop CPUs (i.e. any CPU)
Happens for instance when disabling and re-enabling a core
Launch an application, first thread created on CPU 1
First thread will stay on CPU 1, next threads will be created on CPU 1 (default Linux)
25
![Page 148: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/148.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 3/4: MISSING SCHEDULING DOMAINS
In addition to this, when domains re-built, levels 3 and 4 not re-built...
I.e., no balancing between directly connected or 1-hop CPUs (i.e. any CPU)
Happens for instance when disabling and re-enabling a core
Launch an application, first thread created on CPU 1
First thread will stay on CPU 1, next threads will be created on CPU 1 (default Linux)
All the threads will be on CPU 1 forever!
25
![Page 149: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/149.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 3/4: MISSING SCHEDULING DOMAINS
In addition to this, when domains re-built, levels 3 and 4 not re-built...
I.e., no balancing between directly connected or 1-hop CPUs (i.e. any CPU)
Happens for instance when disabling and re-enabling a core
Launch an application, first thread created on CPU 1
First thread will stay on CPU 1, next threads will be created on CPU 1 (default Linux)
All the threads will be on CPU 1 forever!
25
![Page 150: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/150.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 3/4: MISSING SCHEDULING DOMAINS
In addition to this, when domains re-built, levels 3 and 4 not re-built...
I.e., no balancing between directly connected or 1-hop CPUs (i.e. any CPU)
Happens for instance when disabling and re-enabling a core
Launch an application, first thread created on CPU 1
First thread will stay on CPU 1, next threads will be created on CPU 1 (default Linux)
All the threads will be on CPU 1 forever!
25
Application With bug After fix Improvement
BT 122 23 5.2x
CG 134 5.4 25x
EP 72 18 4x
LU 2196 16 137x
![Page 151: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/151.jpg)
BUG 4/4: OVERLOAD-ON-WAKEUP
Until now, we analyzed the behavior of the the periodic, (buggy) hierarchical load balancing that uses (buggy) scheduling domains
26THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 152: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/152.jpg)
BUG 4/4: OVERLOAD-ON-WAKEUP
Until now, we analyzed the behavior of the the periodic, (buggy) hierarchical load balancing that uses (buggy) scheduling domains
But there is another way load is balanced: threads get to pick on which core they get woken up when they are done blocking (after a lock acquisition, an I/O)...
26THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 153: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/153.jpg)
BUG 4/4: OVERLOAD-ON-WAKEUP
Until now, we analyzed the behavior of the the periodic, (buggy) hierarchical load balancing that uses (buggy) scheduling domains
But there is another way load is balanced: threads get to pick on which core they get woken up when they are done blocking (after a lock acquisition, an I/O)...
Here is how it works: when a thread wakes up, it looks for non-busy cores on the same CPU in order to decide on which core it should wake up.
26THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 154: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/154.jpg)
BUG 4/4: OVERLOAD-ON-WAKEUP
Until now, we analyzed the behavior of the the periodic, (buggy) hierarchical load balancing that uses (buggy) scheduling domains
But there is another way load is balanced: threads get to pick on which core they get woken up when they are done blocking (after a lock acquisition, an I/O)...
Here is how it works: when a thread wakes up, it looks for non-busy cores on the same CPU in order to decide on which core it should wake up.
Only cores that are on the same CPU, in order to improve data locality...
26THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 155: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/155.jpg)
BUG 4/4: OVERLOAD-ON-WAKEUP
Until now, we analyzed the behavior of the the periodic, (buggy) hierarchical load balancing that uses (buggy) scheduling domains
But there is another way load is balanced: threads get to pick on which core they get woken up when they are done blocking (after a lock acquisition, an I/O)...
Here is how it works: when a thread wakes up, it looks for non-busy cores on the same CPU in order to decide on which core it should wake up.
Only cores that are on the same CPU, in order to improve data locality...
26THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 156: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/156.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 4/4: OVERLOAD-ON-WAKEUP
Commercial DB with TPC-H, 64 threads on 64 cores, nothing else on the machine.
27
![Page 157: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/157.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 4/4: OVERLOAD-ON-WAKEUP
Commercial DB with TPC-H, 64 threads on 64 cores, nothing else on the machine.
With threads pinned to cores, works fine. With Linux scheduling, execution much slower, phases with overloaded cores while there are long-term idle cores!
27
![Page 158: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/158.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 4/4: OVERLOAD-ON-WAKEUP
Commercial DB with TPC-H, 64 threads on 64 cores, nothing else on the machine.
With threads pinned to cores, works fine. With Linux scheduling, execution much slower, phases with overloaded cores while there are long-term idle cores!
27
![Page 159: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/159.jpg)
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
BUG 4/4: OVERLOAD-ON-WAKEUP
Commercial DB with TPC-H, 64 threads on 64 cores, nothing else on the machine.
With threads pinned to cores, works fine. With Linux scheduling, execution much slower, phases with overloaded cores while there are long-term idle cores!
27
![Page 160: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/160.jpg)
BUG 4/4
Beginning: 8 threads / CPU, cores busy
28THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 161: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/161.jpg)
BUG 4/4
Beginning: 8 threads / CPU, cores busy
Occasionally, 1 DB thread migrated to other CPU because transient thread appeared during rebalancing which looked like imbalance (only instant loads considered)
28THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 162: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/162.jpg)
BUG 4/4
Beginning: 8 threads / CPU, cores busy
Occasionally, 1 DB thread migrated to other CPU because transient thread appeared during rebalancing which looked like imbalance (only instant loads considered)
Now, 9 threads on one CPU, and 7 on another one. CPU with 9 threads slow, slows down all execution because all threads wait for each other (barriers), i.e. idle cores everywhere...
28
9 threads
7 threads Idle (long)
Slowed down execution
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 163: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/163.jpg)
BUG 4/4
Beginning: 8 threads / CPU, cores busy
Occasionally, 1 DB thread migrated to other CPU because transient thread appeared during rebalancing which looked like imbalance (only instant loads considered)
Now, 9 threads on one CPU, and 7 on another one. CPU with 9 threads slow, slows down all execution because all threads wait for each other (barriers), i.e. idle cores everywhere...
Barriers: threads keep sleeping and waking up, but extra thread never wakes up on idle core, because waking up algorithm only considers local CPU!
28
9 threads
7 threads Idle (long)
Slowed down execution
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 164: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/164.jpg)
BUG 4/4
Beginning: 8 threads / CPU, cores busy
Occasionally, 1 DB thread migrated to other CPU because transient thread appeared during rebalancing which looked like imbalance (only instant loads considered)
Now, 9 threads on one CPU, and 7 on another one. CPU with 9 threads slow, slows down all execution because all threads wait for each other (barriers), i.e. idle cores everywhere...
Barriers: threads keep sleeping and waking up, but extra thread never wakes up on idle core, because waking up algorithm only considers local CPU!
Periodic rebalancing can’t rebalance load most of the time because many idle cores ⇒ Hard to see an imbalance between 9-thread and 7-thread CPU...
28
9 threads
7 threads Idle (long)
Slowed down execution
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 165: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/165.jpg)
BUG 4/4
Beginning: 8 threads / CPU, cores busy
Occasionally, 1 DB thread migrated to other CPU because transient thread appeared during rebalancing which looked like imbalance (only instant loads considered)
Now, 9 threads on one CPU, and 7 on another one. CPU with 9 threads slow, slows down all execution because all threads wait for each other (barriers), i.e. idle cores everywhere...
Barriers: threads keep sleeping and waking up, but extra thread never wakes up on idle core, because waking up algorithm only considers local CPU!
Periodic rebalancing can’t rebalance load most of the time because many idle cores ⇒ Hard to see an imbalance between 9-thread and 7-thread CPU...
“Solution”: wake up on core idle for the longest time (not great for energy)
28
9 threads
7 threads Idle (long)
Slowed down execution
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 166: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/166.jpg)
BUG 4/4
Beginning: 8 threads / CPU, cores busy
Occasionally, 1 DB thread migrated to other CPU because transient thread appeared during rebalancing which looked like imbalance (only instant loads considered)
Now, 9 threads on one CPU, and 7 on another one. CPU with 9 threads slow, slows down all execution because all threads wait for each other (barriers), i.e. idle cores everywhere...
Barriers: threads keep sleeping and waking up, but extra thread never wakes up on idle core, because waking up algorithm only considers local CPU!
Periodic rebalancing can’t rebalance load most of the time because many idle cores ⇒ Hard to see an imbalance between 9-thread and 7-thread CPU...
“Solution”: wake up on core idle for the longest time (not great for energy)
28
9 threads
7 threads Idle (long)
Slowed down execution
THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 167: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/167.jpg)
WHERE DO WE GO FROM HERE?
Load balancing on a multicore machine usually considered a solved problem
29THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 168: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/168.jpg)
WHERE DO WE GO FROM HERE?
Load balancing on a multicore machine usually considered a solved problem
To recap, on Linux, load balancing works that way:
Hierarchical rebalancing uses a metric named load,
29THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 169: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/169.jpg)
WHERE DO WE GO FROM HERE?
Load balancing on a multicore machine usually considered a solved problem
To recap, on Linux, load balancing works that way:
Hierarchical rebalancing uses a metric named load,
to periodically balance threads between scheduling domains.
29THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 170: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/170.jpg)
WHERE DO WE GO FROM HERE?
Load balancing on a multicore machine usually considered a solved problem
To recap, on Linux, load balancing works that way:
Hierarchical rebalancing uses a metric named load,
to periodically balance threads between scheduling domains.
In addition to this, threads balance load by selecting core where to wake up.
29THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 171: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/171.jpg)
WHERE DO WE GO FROM HERE?
Load balancing on a multicore machine usually considered a solved problem
To recap, on Linux, load balancing works that way:
Hierarchical rebalancing uses a metric named load,
↑ Found fundamental issue here
to periodically balance threads between scheduling domains.
In addition to this, threads balance load by selecting core where to wake up.
29THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 172: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/172.jpg)
WHERE DO WE GO FROM HERE?
Load balancing on a multicore machine usually considered a solved problem
To recap, on Linux, load balancing works that way:
Hierarchical rebalancing uses a metric named load,
↑ Found fundamental issue here
to periodically balance threads between scheduling domains.
↑ Found fundamental issue here
In addition to this, threads balance load by selecting core where to wake up.
29THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 173: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/173.jpg)
WHERE DO WE GO FROM HERE?
Load balancing on a multicore machine usually considered a solved problem
To recap, on Linux, load balancing works that way:
Hierarchical rebalancing uses a metric named load,
↑ Found fundamental issue here
to periodically balance threads between scheduling domains.
↑ Found fundamental issue here
In addition to this, threads balance load by selecting core where to wake up.
↑ Found fundamental issue here
29THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 174: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/174.jpg)
WHERE DO WE GO FROM HERE?
Load balancing on a multicore machine usually considered a solved problem
To recap, on Linux, load balancing works that way:
Hierarchical rebalancing uses a metric named load,
↑ Found fundamental issue here
to periodically balance threads between scheduling domains.
↑ Found fundamental issue here
In addition to this, threads balance load by selecting core where to wake up.
↑ Found fundamental issue here
Wait, was anything working at all?
29THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 175: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/175.jpg)
WHERE DO WE GO FROM HERE?
Many major issues went unnoticed for years in the scheduler...How can we prevent this from happening again?
30THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 176: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/176.jpg)
WHERE DO WE GO FROM HERE?
Many major issues went unnoticed for years in the scheduler...How can we prevent this from happening again?
Code testing
No clear fault (no crash, no deadlock, etc.)
Existing tools don’t target these bugs
30THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 177: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/177.jpg)
WHERE DO WE GO FROM HERE?
Many major issues went unnoticed for years in the scheduler...How can we prevent this from happening again?
Code testing
No clear fault (no crash, no deadlock, etc.)
Existing tools don’t target these bugs
Performance regression
Usually done with 1 app on a machine to avoid interactions
Insufficient coverage
30THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 178: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/178.jpg)
WHERE DO WE GO FROM HERE?
Many major issues went unnoticed for years in the scheduler...How can we prevent this from happening again?
Code testing
No clear fault (no crash, no deadlock, etc.)
Existing tools don’t target these bugs
Performance regression
Usually done with 1 app on a machine to avoid interactions
Insufficient coverage
Model checking, formal proofs
Complex, parallel code: so far, nobody knows how to do it...
30THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 179: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/179.jpg)
WHERE DO WE GO FROM HERE? Idea 1: short-term hack — implemented a sanity checker
31THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 180: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/180.jpg)
WHERE DO WE GO FROM HERE? Idea 1: short-term hack — implemented a sanity checker
31THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idle core while a core is
overloaded?
Monitor thread migrations,
creations, destructions
Yes
Every
second100ms
Report a bug
Imbalance not fixed
![Page 181: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/181.jpg)
WHERE DO WE GO FROM HERE? Idea 1: short-term hack — implemented a sanity checker
31THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idle core while a core is
overloaded?
Monitor thread migrations,
creations, destructions
Yes
Every
second100ms
Report a bug
Imbalance not fixed
Not an assertion/watchdog :
might not be a bug
![Page 182: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/182.jpg)
WHERE DO WE GO FROM HERE? Idea 1: short-term hack — implemented a sanity checker
31THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idle core while a core is
overloaded?
Monitor thread migrations,
creations, destructions
Yes
Every
second100ms
Report a bug
Imbalance not fixed
Not an assertion/watchdog :
might not be a bug
situation has to last
for a long time
![Page 183: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/183.jpg)
WHERE DO WE GO FROM HERE? Idea 2: fine-grained tracers!
32THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 184: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/184.jpg)
WHERE DO WE GO FROM HERE? Idea 2: fine-grained tracers!
Built a simple one, turned out to be the only way to really understand what happens
Aggregate metrics (CPI, cache misses, etc.) not precise enough
32THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 185: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/185.jpg)
WHERE DO WE GO FROM HERE? Idea 2: fine-grained tracers!
Built a simple one, turned out to be the only way to really understand what happens
Aggregate metrics (CPI, cache misses, etc.) not precise enough
Could really be improved!
32THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 186: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/186.jpg)
WHERE DO WE GO FROM HERE? Idea 3: produce a dedicated profiler!
33THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 187: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/187.jpg)
WHERE DO WE GO FROM HERE? Idea 3: produce a dedicated profiler!
Lack of tools!
33THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 188: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/188.jpg)
WHERE DO WE GO FROM HERE? Idea 3: produce a dedicated profiler!
Lack of tools!
Possible to detect if slowdown comes from scheduler or application?
Would avoid a lot of wasted time!
33THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 189: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/189.jpg)
WHERE DO WE GO FROM HERE? Idea 3: produce a dedicated profiler!
Lack of tools!
Possible to detect if slowdown comes from scheduler or application?
Would avoid a lot of wasted time!
Follow threads, and see if often on overloaded cores when shouldn’t have?
Detect if threads unnecessarily moved to core/node that leads to many cache misses?
33THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 190: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/190.jpg)
WHERE DO WE GO FROM HERE? Idea 4: produce good scheduler benchmarks!
34THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 191: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/191.jpg)
WHERE DO WE GO FROM HERE? Idea 4: produce good scheduler benchmarks!
Really needed, and virtually inexistent!
34THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 192: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/192.jpg)
WHERE DO WE GO FROM HERE? Idea 4: produce good scheduler benchmarks!
Really needed, and virtually inexistent!
Not an easy problem: insane coverage needed!
Using combination of many real applications: configuration nightmare!
34THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 193: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/193.jpg)
WHERE DO WE GO FROM HERE? Idea 4: produce good scheduler benchmarks!
Really needed, and virtually inexistent!
Not an easy problem: insane coverage needed!
Using combination of many real applications: configuration nightmare!
Simulated workloads?
Have to do elaborate work: spinning and sleeping not efficient
Have to be representative of reality, have to cover corner cases
Use machine learning? Genetic algorithms?
34THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 194: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/194.jpg)
WHERE DO WE GO FROM HERE?
35THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Let’s take a step back: *why* did we end up in this situation?
![Page 195: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/195.jpg)
WHERE DO WE GO FROM HERE?
35THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Let’s take a step back: *why* did we end up in this situation?
Linux used for many classes of applications (big data, n-tier, cloud, interactive, DB, HPC...)
![Page 196: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/196.jpg)
WHERE DO WE GO FROM HERE?
35THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Let’s take a step back: *why* did we end up in this situation?
Linux used for many classes of applications (big data, n-tier, cloud, interactive, DB, HPC...)
Multicore architectures increasingly diverse and complex!
![Page 197: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/197.jpg)
WHERE DO WE GO FROM HERE?
35THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Let’s take a step back: *why* did we end up in this situation?
Linux used for many classes of applications (big data, n-tier, cloud, interactive, DB, HPC...)
Multicore architectures increasingly diverse and complex!
Result: very complex monolithic scheduler supposed to work in all situations!
Many heuristics interact in complex, unpredictable ways
Some features greatly complexify, e.g., load balancing (tasksets, cgroups/autogroups...)
![Page 198: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/198.jpg)
WHERE DO WE GO FROM HERE?
35THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Let’s take a step back: *why* did we end up in this situation?
Linux used for many classes of applications (big data, n-tier, cloud, interactive, DB, HPC...)
Multicore architectures increasingly diverse and complex!
Result: very complex monolithic scheduler supposed to work in all situations!
Many heuristics interact in complex, unpredictable ways
Some features greatly complexify, e.g., load balancing (tasksets, cgroups/autogroups...)
Keeps getting worse!
E.g., task_struct: 163 fields in Linux 3.0 (07/2011), 215 fields in 4.6 (05/2016)
20,000 lines of C!
![Page 199: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/199.jpg)
WHERE DO WE GO FROM HERE?
36THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
# lines of code # functions # variables
1K
2K
3K
4K
5K
6K
7K
8K
0
100
200
300
0
10
20
100
02009 2011 2013 2015 2017 2009 2011 2013 2015 2017 2009 2011 2013 2015 2017
![Page 200: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/200.jpg)
WHERE DO WE GO FROM HERE?
37THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Proving the scheduler implementation correct: not doable!
Way too much code for current technology
We’d need to detect high-level abstractions from low-level C: a challenge!
Even if we managed that, how do we keep up with updates?
Code keeps evolving with new architectures and application needs...
![Page 201: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/201.jpg)
WHERE DO WE GO FROM HERE?
37THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Proving the scheduler implementation correct: not doable!
Way too much code for current technology
We’d need to detect high-level abstractions from low-level C: a challenge!
Even if we managed that, how do we keep up with updates?
Code keeps evolving with new architectures and application needs...
We need another approach...
![Page 202: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/202.jpg)
WHERE DO WE GO FROM HERE?
38THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Write simple, schedulers with proven properties !
A scheduler is tailored to a (class of) parallel application(s)
Specific thread election criterion, load balancing criterion, state machine with events...
Machine partitioned into sets of cores that run ≠ schedulers
Scheduler deployed together with (an) application(s) on a partition
![Page 203: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/203.jpg)
WHERE DO WE GO FROM HERE?
38THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Write simple, schedulers with proven properties !
A scheduler is tailored to a (class of) parallel application(s)
Specific thread election criterion, load balancing criterion, state machine with events...
Machine partitioned into sets of cores that run ≠ schedulers
Scheduler deployed together with (an) application(s) on a partition
Through a DSL, for two reasons:
Much easier, safer and less bug-prone than writing low-level C kernel code !
Clear abstractions, possible to reason about them and prove properties
Work conservation, load balancing live and in finite # or rounds, valid hierarchy...
![Page 204: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/204.jpg)
WHERE DO WE GO FROM HERE?
38THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 5: switch to simpler schedulers, easier to reason about!
Write simple, schedulers with proven properties !
A scheduler is tailored to a (class of) parallel application(s)
Specific thread election criterion, load balancing criterion, state machine with events...
Machine partitioned into sets of cores that run ≠ schedulers
Scheduler deployed together with (an) application(s) on a partition
Through a DSL, for two reasons:
Much easier, safer and less bug-prone than writing low-level C kernel code !
Clear abstractions, possible to reason about them and prove properties
Work conservation, load balancing live and in finite # or rounds, valid hierarchy...
![Page 205: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/205.jpg)
WHERE DO WE GO FROM HERE?
39THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
Idea 6: ???
Any other ideas?
![Page 206: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/206.jpg)
CONCLUSION
Scheduling (as in dividing CPU cycles among theads) was thought to be a solved problem.
40THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 207: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/207.jpg)
CONCLUSION
Scheduling (as in dividing CPU cycles among theads) was thought to be a solved problem.
Analysis: fundamental issues in the load metric, scheduling domains, scheduling choices...
40THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 208: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/208.jpg)
CONCLUSION
Scheduling (as in dividing CPU cycles among theads) was thought to be a solved problem.
Analysis: fundamental issues in the load metric, scheduling domains, scheduling choices...
Very bug-prone implementation following years of adapting to hardware
40THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 209: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/209.jpg)
CONCLUSION
Scheduling (as in dividing CPU cycles among theads) was thought to be a solved problem.
Analysis: fundamental issues in the load metric, scheduling domains, scheduling choices...
Very bug-prone implementation following years of adapting to hardware
Can’t ensure simple “invariant”: no idle cores while overloaded cores
40THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 210: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/210.jpg)
CONCLUSION
Scheduling (as in dividing CPU cycles among theads) was thought to be a solved problem.
Analysis: fundamental issues in the load metric, scheduling domains, scheduling choices...
Very bug-prone implementation following years of adapting to hardware
Can’t ensure simple “invariant”: no idle cores while overloaded cores
Proposed fixes: not always satisfactory
40THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 211: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/211.jpg)
CONCLUSION
Scheduling (as in dividing CPU cycles among theads) was thought to be a solved problem.
Analysis: fundamental issues in the load metric, scheduling domains, scheduling choices...
Very bug-prone implementation following years of adapting to hardware
Can’t ensure simple “invariant”: no idle cores while overloaded cores
Proposed fixes: not always satisfactory
What can we do? Many things to explore!
40THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 212: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/212.jpg)
CONCLUSION
Scheduling (as in dividing CPU cycles among theads) was thought to be a solved problem.
Analysis: fundamental issues in the load metric, scheduling domains, scheduling choices...
Very bug-prone implementation following years of adapting to hardware
Can’t ensure simple “invariant”: no idle cores while overloaded cores
Proposed fixes: not always satisfactory
What can we do? Many things to explore!
Our takeaway: more research must be directed towards implementing efficient andreliable schedulers for multicore architectures!
40THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS
![Page 213: The OS Scheduler: a Performance-Critical Component in](https://reader031.vdocument.in/reader031/viewer/2022020916/61b32804ac1a5d77ad26e876/html5/thumbnails/213.jpg)
CONCLUSION
Scheduling (as in dividing CPU cycles among theads) was thought to be a solved problem.
Analysis: fundamental issues in the load metric, scheduling domains, scheduling choices...
Very bug-prone implementation following years of adapting to hardware
Can’t ensure simple “invariant”: no idle cores while overloaded cores
Proposed fixes: not always satisfactory
What can we do? Many things to explore!
Our takeaway: more research must be directed towards implementing efficient andreliable schedulers for multicore architectures!
40THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS