user-level process towards exascale systems akio shimada [1], atsushi hori [1], yutaka ishikawa [1],...
TRANSCRIPT
User-Level Process towards Exascale Systems
Akio Shimada[1], Atsushi Hori[1], Yutaka Ishikawa[1],Pavan Balaji[2]
[1]RIKEN AICS, [2]Argonne National Laboratory
Background
• MPI processes running on a HPC cluster communicate with each other to exchange the data for parallel computation– An MPI process must wait for a completion of a
communication• Latency hiding can be considered as an
important issue towards Exascale systems– Network system of a HPC cluster will be larger
Methods for Latency Hiding
• Non-blocking communication– Overlapping communication and computation
• Oversubscription– Binding multiple processes to one CPU core– Switching process when a process is blocked to
wait for a completion of a communication
Problem• Process context switch is slow– The overhead of process context spoils the benefit
of the process oversubscription in some cases [ Lancu et al. IPDPS 2010 ]• The overhead of jumping into the kernel context• The overhead of the address space switching
Conventional Approach
• The oversubscription using user-level thread (e.g. FG-MPI)– Invoking multiple user-level threads within a process– Assigning a role of an MPI process to a user-level thread
• Pros and cons– Pros
• Fast context switch– The context switch between user-level threads can be conducted in the user-
space– The context switch between user-level threads does not require address
space switching
– Cons• Modification to the application is required
– Program code (text) and data (data, bss and heap) are shared among user-level threads playing a role of an MPI process
Our Solution• User-level process (ULP)– ULP is a “process”, which can be schedules in the user-space
• The ULP has the beneficial features of the user-level thread• The ULP has its own program code and data. (Therefore, we
equate the ULP with “process”.)
– Capability of ULP• The ULP enables the low-overhead process oversubscription• Modification to the application is not required
Kernel-level Process User-level Thread User-level Process
Context switch Slow Fast Fast
Modification to the application
Not required Required Not required
Overview of User-level Process
Task Scheduler (Kernel-space)
databss
text
data
heap
databss
text
heap
databss
text
heap
Task Scheduler (User-space)
databss
text
heap
databss
text
heap
databss
text
heapKernel-level
Process
User-levelProcess
User-levelProcess
User-levelProcess
Kernel-levelThread
Kernel-levelThread
Kernel-levelThread
User-levelThread
User-levelThread
User-levelThread
Execution Context C CPU Core
(a) Kernel-level Process
Kernel-level Process
Kernel-level Process
Kernel-level Process
(b) User-level Process (c) Kernel-level Thread (d) User-level Thread
Kernel-levelProcess
Kernel-levelProcess
stack
stackstackstack
stackstackstackstackstack
bssheap
text
databss
heap
text
stackstackstack
Address Space Boundary
Task Scheduler (User-space)
C C C C
Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space)
• The ULP can be scheduled in the user-space– The low-overhead oversubscription can be achieved by avoiding the
overhead of the process context switch• The ULP has its own program code and data
– Modification to the application is not required
Address Space Design
TEXT
DATA&BSS
HEAP
STACK
KERNEL
ULP 0
Addr
ess
low
high
TEXT
DATA&BSS
HEAP
STACK
・・・
KERNEL
ULP 1
ULP 2
TEXT
DATA&BSS
HEAP
KERNEL
STACK 1STACK 0
STACK N-1 ULP N-1
STACK 2・・・
Process User-level Thread User-level Process
Context Switch
textdata & bss
heapstack
Partition forULP 0
Partition forULP 1
registers
textdata & bss
heapstack
registers
CPU core
① save context of user-level process 0
② load context of user-level process 1
・・・
Low
High
Addr
ess
Context switch from ULP 0 to ULP 1
• Segment registers must be considered on x86_64 architectures– Segment registers are not accessible from user-space– The fs register is used for implementing Thread Local Storage (TLS)– Thread safe functions must be build without using TLS
ULP API• int pvas_ulp_create(int *pvd)
– pvas_ulp_create creates address space for ULPs• int pvas_ulp_destroy(int pvd)
– pvas_ulp_destroy destroys a created address space• int pvas_ulp_spawn(int pvd, int pvid, char *filename, char **argv, char **environ)– pvas_ulp_spawn spawns kernel-level process with a ULP
• int pvas_ulp_exec(int pvid, char *filename, char **argv, char **environ)– pvas_ulp_exec creats and executes a new ULP
• int pvas_ulp_switch(int pvid)– pvas_ulp_switch conducts context from the current ULP to the
indicated ULP
Preliminary Evaluation (context switch performance)
• Benchmark– Invoking multiple parallel processes on a single CPU core– A parallel process may be a kernel-level process or a kernel-level thread or a user-
level thread or a user-level process– Measuring a time elapsed until all parallel process performs context switch 1000
times• The performance of the ULP is competitive with that of the user-level thread
EnvironmentCPU: Intel Xeon X5670 2.93 GHzOS : Linux 2.6.32-el6 for x86_64
Low
er is
bett
er
Summary and Future Work• Summary– The ULP enables the low-overhead
oversubscription by avoiding the overhead of the process context switch
– The oversubscription using ULP does not require any modification to the application
• Future work– Future work is to embed the capability of the ULP
in the MPI runtimes and evaluate it