echofs: enabling transparent access to node-local nvm ... · –efficient & decentralized...

22
Dagstuhl, May 2017 echofs: Enabling Transparent Access to Node-local NVM Burst Buffers for Legacy Applications […with the scheduler’s collaboration] Alberto Miranda, PhD Researcher on HPC I/O [email protected] www.bsc.es The NEXTGenIO project has received funding from the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671951

Upload: others

Post on 19-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

Dagstuhl, May 2017

echofs: Enabling Transparent Access to Node-local

NVM Burst Buffers for Legacy Applications

[…with the scheduler’s collaboration]

Alberto Miranda, PhD

Researcher on HPC I/O

[email protected]

www.bsc.es

The NEXTGenIO project has received funding from the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671951

Page 2: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Petascale already struggles with I/O…

– Extreme parallelism w/ millions of threads

– Job’s input data read from external PFS

– Checkpoints: periodic writes to external PFS

– Job’s output data write to external PFS

• HPC & Data Intensive systems merging

– Modelling, simulation and analytic workloads increasing…

I/O -> a fundamental challenge;

2

external filesystem

high performance network

compute nodes

Page 3: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Petascale already struggles with I/O…

– Extreme parallelism w/ millions of threads

– Job’s input data read from external PFS

– Checkpoints: periodic writes to external PFS

– Job’s output data write to external PFS

• HPC & Data Intensive systems merging

– Modelling, simulation and analytic workloads increasing…

• And it will only get worse at Exascale…

I/O -> a fundamental challenge;

3

external filesystem

high performance network

compute nodes

Page 4: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Fast storage devices that temporarily store application data before sending it to PFS

– Goal: Absorb peak I/O to avoid overtaxing PFS

– Cray Datawarp, DDN IME

• Growing interest to add them into next-gen HPC architectures

– NERSC’s Cori, LLNL’s Sierra, ANL’s Aurora, …

– Typically a separate resource to PFS

– Usage/allocation/data movements/etc become user responsibility

Burst Buffers -> remote;

4

external filesystem

high performance network

compute nodes

burst filesystem

Page 5: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Non-volatile coming to the node

– Argonne’s Theta has 128GB SSDs in each compute node

Burst Buffers -> on-node;

5

external filesystem

high performance network

compute nodes

Page 6: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

I/O Stack

• Node-local, high-density NVRAM becomes a fundamental component

– Intel’s 3DXPointTM

– Capacity much larger than DRAM

– Slightly slower than DRAM but significantly faster than SSDs

– DIMM form factor standard memory controller

– No refresh no/low energy leakage

NEXTGenIO EU Project [http://www.nextgenio.eu]

6

cache

memory

storage

Page 7: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Node-local, high-density NVRAM becomes a fundamental component

– Intel’s 3DXPointTM

– Capacity much larger than DRAM

– Slightly slower than DRAM but significantly faster than SSDs

– DIMM form factor standard memory controller

– No refresh no/low energy leakage

NEXTGenIO EU Project [http://www.nextgenio.eu]

7

I/O Stack

cache

memory

nvram

fast storage

slow storage

Page 8: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Node-local, high-density NVRAM becomes a fundamental component

– Intel’s 3DXPointTM

– Capacity much larger than DRAM

– Slightly slower than DRAM but significantly faster than SSDs

– DIMM form factor standard memory controller

– No refresh no/low energy leakage

I/O Stack

NEXTGenIO EU Project [http://www.nextgenio.eu]

8

cache

memory

nvram

fast storage

slow storage

1. How do we manage access to these layers?

2. How can we bring the benefits from these layers to legacy code?

Page 9: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

OUR SOLUTION: MANAGING ACCESS THROUGH A USER-LEVEL FILESYSTEM

Page 10: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

I/O Stack

• First goal: Allow legacy applications to transparently benefit from new storage layers

– Accessible storage layers under unique mount point

– Make new layers readily available to applications

– I/O stack complexity hidden from applications

– Allows for automatic management of data location

– POSIX interface [sorry]

echofs -> objectives;

10

cache

memory

nvram

SSD

Lustre

/mnt/echofs/PFS namespace is “echoed”

/mnt/PFS/User/App /mnt/ECHOFS/User/App

Page 11: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Second goal: construct a collaborative burst buffer by joining NVM regions assigned to a batch job by scheduler [SLURM]

– Filesystem’s lifetime linked to batch job’s lifetime

– Input files staged into NVM before job starts

– Allow HPC jobs to perform collaborative NVM I/O

– Output files staged out to PFS when job ends

echofs -> objectives;

11

NVM

collaborative burst buffer

compute nodes

NVM NVM NVM

parallel processes

external filesystem

echofs

POSIX read/writes

stage-in/stage-out

Page 12: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• User provides job I/O requirements through SLURM

– Nodes required, files accessed, type of access [in|out|inout], expected lifetime [temporary|persistent], expected “survivability”, required POSIX semantics [?], …

• SLURM allocates nodes and mounts echofs across them

– Also forwards I/O requirements through API

• echofs builds the CBB and fills it with input files

– When finished, SLURM starts the batch job

echofs -> intended workflow;

12

Page 13: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• User provides job I/O requirements through SLURM

– Nodes required, files accessed, type of access [in|out|inout], expected lifetime [temporary|persistent], expected “survivability”, required POSIX semantics [?], …

• SLURM allocates nodes and mounts echofs across them

– Also forwards I/O requirements through API

• echofs builds the CBB and fills it with input files

– When finished, SLURM starts the batch job

echofs -> intended workflow;

13

We can’t expect optimization details

from users, but maybe for them to offer us

enough hints…

Page 14: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Job I/O absorbed by collaborative burst buffer

– Non-CBB open()s forwarded to PFS (throttled to limit PFS congestion)

– Temporary files do not need to make it to PFS (e.g. checkpoints)

– Metadata attributes for temporary files cached

Distributed key-value store

• When job completes, future of files managed by echofs

– Persistent files eventually sync'd to PFS

– Decision orchestrated by SLURM & DataScheduler component depending on requirements of upcoming jobs

echofs -> intended workflow;

14

If some other job reuses these files, we can leave them “as is”

Page 15: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Distributed data servers

– Job’s data space partitioned across compute nodes

– Each node acts as data server for its partition

– Each node acts as data client for other partitions

• Pseudo-random file segment distribution

– No replication ⇒ avoid coherence mechanisms

– Resiliency through erasure codes (eventually)

– Each node acts as lock manager for its partition

echofs -> data distribution;

15

NVM NVM NVM NVM

[0-8MB) [8-16MB) [16-32MB)

hash

compute nodes

data space

shared file

Page 16: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Why pseudo-random?

– Efficient & decentralized segment lookup[no metadata request to lookup segment]

– Balance workload w.r.t. partition size

– Allows for collaborative I/O

echofs -> data distribution;

16

NVM NVM NVM NVM

[0-8MB) [8-16MB) [16-32MB)

hash

compute nodes

data space

shared file

Page 17: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Why pseudo-random?

– Efficient & decentralized segment lookup[no metadata request to lookup segment]

– Balance workload w.r.t. partition size

– Allows for collaborative I/O

– Guarantees minimal movements ofdata if node allocation changes[future research on elasticity]

echofs -> data distribution;

17

NVM

allocated nodes

NVM

NVM

allocated nodes

NVM

NVM

NVM

allocated nodes

NVM

NVM

NVM

NVM

allocated nodes

NVM

job phases over time

job scheduler

datatransfer

datatransfer

datatransfer

+1 node +1 node -2 nodes

Page 18: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Why pseudo-random?

– Efficient & decentralized segment lookup[no metadata request to lookup segment]

– Balance workload w.r.t. partition size

– Allows for collaborative I/O

– Guarantees minimal movements ofdata if node allocation changes[future research on elasticity]

echofs -> data distribution;

18

NVM

allocated nodes

NVM

NVM

allocated nodes

NVM

NVM

NVM

allocated nodes

NVM

NVM

NVM

NVM

allocated nodes

NVM

job phases over time

job scheduler

datatransfer

datatransfer

datatransfer

+1 node +1 node -2 nodes

Other strategies would be possible depending

on job semantics

Page 19: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Data Scheduler daemon external to echofs

– Interfaces SLURM & echofs

Allows SLURM to send requests to echofs

Allows echofs to ACK these requests

– Offers an API to [non-legacy] applications willing to send I/O hints to echofs

– In the future will coordinate w/ SLURM to decide when different echofs instances should access PFS[data-aware job scheduling]

echofs -> integration with batch scheduler;

19

echofs

data scheduler

SLURM

applications

stage-in/stage-outasynchronous requests

dynamicI/O requirements

staticI/O requirements

Page 20: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• Main features:

– Ephemeral filesystem linked to job lifetime

– Allows legacy applications to benefit from newer storage technologies

– Provides aggregate I/O for applications

• Research goals:

– Improve coordination w/ job scheduler and other HPC management infrastructure

– Investigate ad-hoc data distributions tailored for each job I/O

– Scheduler-triggered specific optimizations for jobs/files

Summary;

20

Page 21: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

• POSIX compliance is hard…

– But maybe we don’t need FULL COMPLIANCE for ALL jobs…

• Adding I/O-awareness to the scheduler is important…

– Allows wasting I/O work already done…

– … but requires user/developer collaboration (tricky…)

• User-level filesystems/libraries solve very specific I/O problems…

– Can we reuse/integrate these efforts? Can we learn what works for a specific application, characterize it & automatically run similar ones in a “best fit” FS?

Food for thought;

21

Page 22: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition

Thank you!

For further information please contact

[email protected], [email protected],

[email protected]

www.bsc.es