user-level process towards exascale systems akio shimada [1], atsushi hori [1], yutaka ishikawa [1],...

User-Level Process towards Exascale Systems

Akio Shimada[1], Atsushi Hori[1], Yutaka Ishikawa[1],Pavan Balaji[2]

[1]RIKEN AICS, [2]Argonne National Laboratory

Background

• MPI processes running on a HPC cluster communicate with each other to exchange the data for parallel computation– An MPI process must wait for a completion of a

communication• Latency hiding can be considered as an

important issue towards Exascale systems– Network system of a HPC cluster will be larger

Methods for Latency Hiding

• Non-blocking communication– Overlapping communication and computation

• Oversubscription– Binding multiple processes to one CPU core– Switching process when a process is blocked to

wait for a completion of a communication

Problem• Process context switch is slow– The overhead of process context spoils the benefit

of the process oversubscription in some cases [ Lancu et al. IPDPS 2010 ]• The overhead of jumping into the kernel context• The overhead of the address space switching

Conventional Approach

• The oversubscription using user-level thread (e.g. FG-MPI)– Invoking multiple user-level threads within a process– Assigning a role of an MPI process to a user-level thread

• Pros and cons– Pros

• Fast context switch– The context switch between user-level threads can be conducted in the user-

space– The context switch between user-level threads does not require address

space switching

– Cons• Modification to the application is required

– Program code (text) and data (data, bss and heap) are shared among user-level threads playing a role of an MPI process

Our Solution• User-level process (ULP)– ULP is a “process”, which can be schedules in the user-space

• The ULP has the beneficial features of the user-level thread• The ULP has its own program code and data. (Therefore, we

equate the ULP with “process”.)

– Capability of ULP• The ULP enables the low-overhead process oversubscription• Modification to the application is not required

Kernel-level Process User-level Thread User-level Process

Context switch Slow Fast Fast

Modification to the application

Not required Required Not required

Overview of User-level Process

Task Scheduler (Kernel-space)

databss

text

data

heap

databss

text

heap

databss

text

heap

Task Scheduler (User-space)

databss

text

heap

databss

text

heap

databss

text

heapKernel-level

Process

User-levelProcess

User-levelProcess

User-levelProcess

Kernel-levelThread

Kernel-levelThread

Kernel-levelThread

User-levelThread

User-levelThread

User-levelThread

Execution Context C CPU Core

(a) Kernel-level Process

Kernel-level Process



(b) User-level Process (c) Kernel-level Thread (d) User-level Thread

Kernel-levelProcess

Kernel-levelProcess

stack

stackstackstack

stackstackstackstackstack

bssheap

text

databss

heap

text

stackstackstack

Address Space Boundary

Task Scheduler (User-space)

C C C C

Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space)

• The ULP can be scheduled in the user-space– The low-overhead oversubscription can be achieved by avoiding the

overhead of the process context switch• The ULP has its own program code and data

– Modification to the application is not required

Address Space Design

TEXT

DATA&BSS

HEAP

STACK

KERNEL

ULP 0

Addr

ess

low

high

TEXT

DATA&BSS

HEAP

STACK

・・・

KERNEL

ULP 1

ULP 2

TEXT

DATA&BSS

HEAP

KERNEL

STACK 1STACK 0

STACK N-1 ULP N-1

STACK 2・・・

Process User-level Thread User-level Process

Context Switch

textdata & bss

heapstack

Partition forULP 0

Partition forULP 1

registers

textdata & bss

heapstack

registers

CPU core

① save context of user-level process 0

② load context of user-level process 1

・・・

Low

High

Addr

ess

Context switch from ULP 0 to ULP 1

• Segment registers must be considered on x86_64 architectures– Segment registers are not accessible from user-space– The fs register is used for implementing Thread Local Storage (TLS)– Thread safe functions must be build without using TLS

ULP API• int pvas_ulp_create(int *pvd)

– pvas_ulp_create creates address space for ULPs• int pvas_ulp_destroy(int pvd)

– pvas_ulp_destroy destroys a created address space• int pvas_ulp_spawn(int pvd, int pvid, char *filename, char **argv, char **environ)– pvas_ulp_spawn spawns kernel-level process with a ULP

• int pvas_ulp_exec(int pvid, char *filename, char **argv, char **environ)– pvas_ulp_exec creats and executes a new ULP

• int pvas_ulp_switch(int pvid)– pvas_ulp_switch conducts context from the current ULP to the

indicated ULP

Preliminary Evaluation (context switch performance)

• Benchmark– Invoking multiple parallel processes on a single CPU core– A parallel process may be a kernel-level process or a kernel-level thread or a user-

level thread or a user-level process– Measuring a time elapsed until all parallel process performs context switch 1000

times• The performance of the ULP is competitive with that of the user-level thread

EnvironmentCPU: Intel Xeon X5670 2.93 GHzOS : Linux 2.6.32-el6 for x86_64

Low

er is

bett

er

Summary and Future Work• Summary– The ULP enables the low-overhead

oversubscription by avoiding the overhead of the process context switch

– The oversubscription using ULP does not require any modification to the application

• Future work– Future work is to embed the capability of the ULP

in the MPI runtimes and evaluate it

user-level process towards exascale systems akio shimada [1], atsushi hori [1], yutaka ishikawa [1],...

Documents