system software for big data and post petascale computing · system software for big data and post...

System Software for Big Data and Post Petascale Computing

Osamu Tatebe University of Tsukuba

The Japanese Extreme Big Data Workshop February 26, 2014

I/O performance requirement for exascale applications

• Computational Science (Climate, CFD, …) – Read initial data (100TB~PB) – Write snapshot data (100TB~PB)

periodically

• Data Intensive Science (Particle Physics, Astrophysics, Life Science, …) – Data analysis of 10PB~EB experiment

Scalable performance requirement for Parallel File System

Year FLOPS #cores IO BW IOPS Systems 2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2016 100P 10M 10TB/s O(100K) 2020 1E 100M 100TB/s O(1M)

IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes

Performance target

Technology trend

• HDD performance not increase so much – 300 MB/s, 5 W in 2020 – 100 TB/s means O(2M)W

• Flash, storage class memory – 1 GB/s, 0.1 W in 2020 – Cost, limited number of updates

• Interconnects – 62 GB/s (Infiniband 4xHDR)

☹ ☺

Current parallel file system

• Central storage array • Separate installation of compute nodes and storage • Network BW between compute nodes and storage

needs to be scaled-up to scale out the I/O performance

Compute nodes (clients) Storage

NW BW limitation

Remember memory architecture

Shared memory Distributed memory

Scaled-out parallel file system

• Distributed storage in compute nodes • I/O performance would be scaled out by accessing near

storage unless metadata performance is bottleneck – Access to near storage mitigates network BW requirement – The performance may be non uniform

MDS cluster

Compute nodes (clients)

Storage

Example of Scale-out Storage Architecture

• 3 years later snapshot • Non-uniform but

scale-out storage • R&D of system

software stacks is required to achieve maximum I/O performance for data-intensive science

CPU （2 sockets x2.0GHzx16

coresx32FPU）

memory chipset

1TB local

storage

12 Gbps SAS x 16

19.2 GB/s, 16 TB

Metadata server

9.6 TB/s, 8 PB 96 TB/s, 80 PB

Infiniband HDR 62 GB/s

• 5,000 IO nodes • 10 MDSs

Challenge • File system (Object store)

– Central storage cluster to distributed storage cluster – Scaled out parallel file system up to O(1M) clients

• Scaled out MDS performance • Compute node OS

– Reduction of OS noises – Cooperative cache

• Runtime system – Optimization for non uniform storage access “NUSA”

• Global storage for data sharing of exabyte-scale data among machines

Scaled out parallel file system

• Federate local storage in compute nodes – Special purpose

• Google file system [SOSP’03] • Hadoop file system (HDFS)

– POSIX(-like) • Gfarm file system [CCGrid’02, NGC’10]

Scaled-out MDS • GIGA+ [Swapnil Patil et al.

FAST’11] – Incremental directory partitioning – Independent locking in each

partition • skyFS [Jing Xing et al. SC’09]

– Performance improvement during directory partitioning in GIGA+

• Lustre – MT scalability in 2.X – Proposed clustered MDS

• PPMDS [Our JST CREST R&D] – Shared-nothing KV stores – Nonblocking software transactional

memory (No lock)

IOPS (file creates per sec) #MDS (#core) GIGA+ 98K 32 (256) skyFS 100K 32 (512) Lustre 2.4 80K 1 (16) PPMDS 270K 15 (240)

Development of Pwrake: Data-intensive Workflow System

IO-aware Task Scheduling: • Locality-aware scheduling

– Selection of Compute Nodes by Input files

• Buffer Cache-aware scheduling – Modified LIFO to ease Trailing Task Problem

0 2000 4000

Naïve

Locality aware

Locality awareand Cache aware

Locality-aware

42% speedup Cache-aware

23% speedup

Workflow elapsed time (sec) with I/O file size 900 GB (10 nodes)

Process Process Process

Pwrake

file3 file1 file2

Gfarm file system

Pwrake = Workflow System based on Rake (Ruby make)

Maximize Locality using Multi-Constraint Graph Partitioning

[Tanaka, CCGrid 2012] • Task scheduling based on MCGP can minimize data movement

• Applied to Pwrake workflow system and evaluated on Montage workflow

Data movement reduced by 86% Execution time improved by 31%

Simple Graph Partitioning Multi-Constraint Graph Partitioning

Parallel tasks are unbalanced among nodes.

HPCI Shared Storage • HPCI – High Performance Computing Infrastructure

– “K”, Hokkaido, Tohoku, Tsukuba, Tokyo, Titech, Nagoya, Kyoto, Osaka, Kyushu, RIKEN, JAMSTEC, AIST

• A 20PB Gfarm distributed file system consisting East and West sites

• Grid Security Infrastructure (GSI) for user ID • Parallel file replication among sites • Parallel file staging to/from each center

11.5 PB (60 servers)

10 PB (40 servers)

10 (~40) Gbps MDS MDS

West site (AICS) East site (U Tokyo)

MDS MDS

Picture courtesy by Hiroshi Harada (U Tokyo)

Storage structure of HPCI Shared Storage

Temporal space • I/O performance •No backup

Data sharing •Capacity and reliability •Secured communication •Fair share and easy to use •No backup but file can be replicated

Lustre, Pansas, GPFS, …

Gfarm file system

Persistent storage •Capacity and reliability •Back up copy will be in Tape or disk

Local file

system

Global file system

Wide-area distributed file

system

Objective File system How to use

•mv/cp •File staging

HPCI Shared Storage

•Web I/F Remote clients

Initial Performance Result

898 847

1,107 1,073

Hokkaido Kyoto Tokyo AICS

I/O Bandwidth [MB/sec]

File copy performance of 300 1GB files to HPCI Shared Storage

Related System

• XSEDE-Wide File System (GPFS) – Planned, but not in operation yet

• DEISA Global File System – Multicluster GPFS

• RZG, LRZ, BSC, JSC, EPSS, HLRS, … • Site name included in the path name – no location

transparency – files cannot be replicated across sites

– PRACE does not provide global file system • Limitation of operation systems that can mount • PRACE does not assume to use multiple sites

Summary • App IO requirement

– Computational Science • Scaled-out IO performance up to O(1M) nodes (100TB to 1PB per

hour) – Data Intensive Science

• Data processing for 10PB to 1EB data (>100TB/sec) • File system, Object store, OS and runtime R&D for scale out

storage architecture – Central storage cluster to distributed storage cluster

• Network wide RAID • Scaled out MDS

– Runtime system for non uniform storage access “NUSA” • Locality aware process scheduling

• Global file system

system software for big data and post petascale computing · system software for big data and post...

Documents

storage for petascale computing

masahiro nakamura, osamu tatebe university of tsukuba...

aces iii and sial: technologies for petascale computing in...

the visual computing center ensures that · the visual...

a nano-electronics simulator for petascale computing: from

fujitsu hpc roadmap beyond petascale computing ·...

petascale data intensive computing alex szalay the johns...

petascale computing and similarity scaling in...

grid datafarm architecture for petascale data intensive...

optical data storage tatebe

petascale parallel computing and beyond general...

9thworkshop of the inria-illinois joint laboratory for...

numerical libraries for petascale computing

exascale computing - hlrs.de · exascale computing 4...

computational chemistry and the march to petascale computing

workshop on software development tools for petascale...

fpga-based petascale computing (pdf) · fpga based...

adaptive optimization for petascale heterogeneous cpu/gpu...

gfarm fs tatebe tip2004

salishan conference on high-speed computing computational...