system software for big data and post petascale computing · system software for big data and post...

System Software for Big Data and Post Petascale Computing

Osamu Tatebe University of Tsukuba

The Japanese Extreme Big Data Workshop February 26, 2014

I/O performance requirement for exascale applications

• Computational Science (Climate, CFD, …) – Read initial data (100TB~PB) – Write snapshot data (100TB~PB)

periodically

• Data Intensive Science (Particle Physics, Astrophysics, Life Science, …) – Data analysis of 10PB~EB experiment

data

Scalable performance requirement for Parallel File System

Year FLOPS #cores IO BW IOPS Systems 2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2016 100P 10M 10TB/s O(100K) 2020 1E 100M 100TB/s O(1M)

IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes

Performance target

Technology trend

• HDD performance not increase so much – 300 MB/s, 5 W in 2020 – 100 TB/s means O(2M)W

• Flash, storage class memory – 1 GB/s, 0.1 W in 2020 – Cost, limited number of updates

• Interconnects – 62 GB/s (Infiniband 4xHDR)

☹ ☺

☹

Current parallel file system

• Central storage array • Separate installation of compute nodes and storage • Network BW between compute nodes and storage

needs to be scaled-up to scale out the I/O performance

MDS

Compute nodes (clients) Storage

NW BW limitation

Remember memory architecture

CPU

Mem

CPU

Mem

Shared memory Distributed memory

Scaled-out parallel file system

• Distributed storage in compute nodes • I/O performance would be scaled out by accessing near

storage unless metadata performance is bottleneck – Access to near storage mitigates network BW requirement – The performance may be non uniform

MDS cluster

Compute nodes (clients)

Storage

Example of Scale-out Storage Architecture

• 3 years later snapshot • Non-uniform but

scale-out storage • R&D of system

software stacks is required to achieve maximum I/O performance for data-intensive science

CPU （2 sockets x2.0GHzx16

coresx32FPU）

memory chipset

1TB local

storage

12 Gbps SAS x 16

x 16

19.2 GB/s, 16 TB

x 500

Metadata server

x 10

9.6 TB/s, 8 PB 96 TB/s, 80 PB

Infiniband HDR 62 GB/s

• 5,000 IO nodes • 10 MDSs

Challenge • File system (Object store)

– Central storage cluster to distributed storage cluster – Scaled out parallel file system up to O(1M) clients

• Scaled out MDS performance • Compute node OS

– Reduction of OS noises – Cooperative cache

• Runtime system – Optimization for non uniform storage access “NUSA”

• Global storage for data sharing of exabyte-scale data among machines

Scaled out parallel file system

• Federate local storage in compute nodes – Special purpose

• Google file system [SOSP’03] • Hadoop file system (HDFS)

– POSIX(-like) • Gfarm file system [CCGrid’02, NGC’10]

Scaled-out MDS • GIGA+ [Swapnil Patil et al.

FAST’11] – Incremental directory partitioning – Independent locking in each

partition • skyFS [Jing Xing et al. SC’09]

– Performance improvement during directory partitioning in GIGA+

• Lustre – MT scalability in 2.X – Proposed clustered MDS

• PPMDS [Our JST CREST R&D] – Shared-nothing KV stores – Nonblocking software transactional

memory (No lock)

IOPS (file creates per sec) #MDS (#core) GIGA+ 98K 32 (256) skyFS 100K 32 (512) Lustre 2.4 80K 1 (16) PPMDS 270K 15 (240)

Development of Pwrake: Data-intensive Workflow System

IO-aware Task Scheduling: • Locality-aware scheduling

– Selection of Compute Nodes by Input files

• Buffer Cache-aware scheduling – Modified LIFO to ease Trailing Task Problem

0 2000 4000

Naïve

Locality aware

Locality awareand Cache aware

Locality-aware

42% speedup Cache-aware

23% speedup

Workflow elapsed time (sec) with I/O file size 900 GB (10 nodes)

Process Process Process

SSH

Pwrake

file3 file1 file2

Gfarm file system

Pwrake = Workflow System based on Rake (Ruby make)

Maximize Locality using Multi-Constraint Graph Partitioning

[Tanaka, CCGrid 2012] • Task scheduling based on MCGP can minimize data movement

• Applied to Pwrake workflow system and evaluated on Montage workflow

Data movement reduced by 86% Execution time improved by 31%

Simple Graph Partitioning Multi-Constraint Graph Partitioning

Parallel tasks are unbalanced among nodes.

HPCI Shared Storage • HPCI – High Performance Computing Infrastructure

– “K”, Hokkaido, Tohoku, Tsukuba, Tokyo, Titech, Nagoya, Kyoto, Osaka, Kyushu, RIKEN, JAMSTEC, AIST

• A 20PB Gfarm distributed file system consisting East and West sites

• Grid Security Infrastructure (GSI) for user ID • Parallel file replication among sites • Parallel file staging to/from each center

11.5 PB (60 servers)

10 PB (40 servers)

10 (~40) Gbps MDS MDS

West site (AICS) East site (U Tokyo)

MDS MDS

Picture courtesy by Hiroshi Harada (U Tokyo)

Storage structure of HPCI Shared Storage

Temporal space • I/O performance •No backup

Data sharing •Capacity and reliability •Secured communication •Fair share and easy to use •No backup but file can be replicated

Lustre, Pansas, GPFS, …

Gfarm file system

Persistent storage •Capacity and reliability •Back up copy will be in Tape or disk

Local file

system

Global file system

Wide-area distributed file

system

Objective File system How to use

•mv/cp •File staging

•mv/cp •File staging

HPCI Shared Storage

•Web I/F Remote clients

Initial Performance Result

898 847

1,107 1,073

0

200

400

600

800

1,000

1,200

Hokkaido Kyoto Tokyo AICS

I/O Bandwidth [MB/sec]

File copy performance of 300 1GB files to HPCI Shared Storage

Related System

• XSEDE-Wide File System (GPFS) – Planned, but not in operation yet

• DEISA Global File System – Multicluster GPFS

• RZG, LRZ, BSC, JSC, EPSS, HLRS, … • Site name included in the path name – no location

transparency – files cannot be replicated across sites

– PRACE does not provide global file system • Limitation of operation systems that can mount • PRACE does not assume to use multiple sites

Summary • App IO requirement

– Computational Science • Scaled-out IO performance up to O(1M) nodes (100TB to 1PB per

hour) – Data Intensive Science

• Data processing for 10PB to 1EB data (>100TB/sec) • File system, Object store, OS and runtime R&D for scale out

storage architecture – Central storage cluster to distributed storage cluster

• Network wide RAID • Scaled out MDS

– Runtime system for non uniform storage access “NUSA” • Locality aware process scheduling

• Global file system

system software for big data and post petascale computing · system software for big data and post...

Documents