system software for big data and post petascale computing · system software for big data and post...
TRANSCRIPT
System Software for Big Data and Post Petascale Computing
Osamu Tatebe University of Tsukuba
The Japanese Extreme Big Data Workshop February 26, 2014
I/O performance requirement for exascale applications
• Computational Science (Climate, CFD, …) – Read initial data (100TB~PB) – Write snapshot data (100TB~PB)
periodically
• Data Intensive Science (Particle Physics, Astrophysics, Life Science, …) – Data analysis of 10PB~EB experiment
data
Scalable performance requirement for Parallel File System
Year FLOPS #cores IO BW IOPS Systems 2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2016 100P 10M 10TB/s O(100K) 2020 1E 100M 100TB/s O(1M)
IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes
Performance target
Technology trend
• HDD performance not increase so much – 300 MB/s, 5 W in 2020 – 100 TB/s means O(2M)W
• Flash, storage class memory – 1 GB/s, 0.1 W in 2020 – Cost, limited number of updates
• Interconnects – 62 GB/s (Infiniband 4xHDR)
☹ ☺
☹
Current parallel file system
• Central storage array • Separate installation of compute nodes and storage • Network BW between compute nodes and storage
needs to be scaled-up to scale out the I/O performance
MDS
Compute nodes (clients) Storage
NW BW limitation
Remember memory architecture
CPU
Mem
CPU
Mem
Shared memory Distributed memory
Scaled-out parallel file system
• Distributed storage in compute nodes • I/O performance would be scaled out by accessing near
storage unless metadata performance is bottleneck – Access to near storage mitigates network BW requirement – The performance may be non uniform
MDS cluster
Compute nodes (clients)
Storage
Example of Scale-out Storage Architecture
• 3 years later snapshot • Non-uniform but
scale-out storage • R&D of system
software stacks is required to achieve maximum I/O performance for data-intensive science
CPU (2 sockets x2.0GHzx16
coresx32FPU)
memory chipset
1TB local
storage
12 Gbps SAS x 16
x 16
19.2 GB/s, 16 TB
x 500
Metadata server
x 10
9.6 TB/s, 8 PB 96 TB/s, 80 PB
Infiniband HDR 62 GB/s
• 5,000 IO nodes • 10 MDSs
Challenge • File system (Object store)
– Central storage cluster to distributed storage cluster – Scaled out parallel file system up to O(1M) clients
• Scaled out MDS performance • Compute node OS
– Reduction of OS noises – Cooperative cache
• Runtime system – Optimization for non uniform storage access “NUSA”
• Global storage for data sharing of exabyte-scale data among machines
Scaled out parallel file system
• Federate local storage in compute nodes – Special purpose
• Google file system [SOSP’03] • Hadoop file system (HDFS)
– POSIX(-like) • Gfarm file system [CCGrid’02, NGC’10]
Scaled-out MDS • GIGA+ [Swapnil Patil et al.
FAST’11] – Incremental directory partitioning – Independent locking in each
partition • skyFS [Jing Xing et al. SC’09]
– Performance improvement during directory partitioning in GIGA+
• Lustre – MT scalability in 2.X – Proposed clustered MDS
• PPMDS [Our JST CREST R&D] – Shared-nothing KV stores – Nonblocking software transactional
memory (No lock)
IOPS (file creates per sec) #MDS (#core) GIGA+ 98K 32 (256) skyFS 100K 32 (512) Lustre 2.4 80K 1 (16) PPMDS 270K 15 (240)
Development of Pwrake: Data-intensive Workflow System
IO-aware Task Scheduling: • Locality-aware scheduling
– Selection of Compute Nodes by Input files
• Buffer Cache-aware scheduling – Modified LIFO to ease Trailing Task Problem
0 2000 4000
Naïve
Locality aware
Locality awareand Cache aware
Locality-aware
42% speedup Cache-aware
23% speedup
Workflow elapsed time (sec) with I/O file size 900 GB (10 nodes)
Process Process Process
SSH
Pwrake
file3 file1 file2
Gfarm file system
Pwrake = Workflow System based on Rake (Ruby make)
Maximize Locality using Multi-Constraint Graph Partitioning
[Tanaka, CCGrid 2012] • Task scheduling based on MCGP can minimize data movement
• Applied to Pwrake workflow system and evaluated on Montage workflow
Data movement reduced by 86% Execution time improved by 31%
Simple Graph Partitioning Multi-Constraint Graph Partitioning
Parallel tasks are unbalanced among nodes.
HPCI Shared Storage • HPCI – High Performance Computing Infrastructure
– “K”, Hokkaido, Tohoku, Tsukuba, Tokyo, Titech, Nagoya, Kyoto, Osaka, Kyushu, RIKEN, JAMSTEC, AIST
• A 20PB Gfarm distributed file system consisting East and West sites
• Grid Security Infrastructure (GSI) for user ID • Parallel file replication among sites • Parallel file staging to/from each center
11.5 PB (60 servers)
10 PB (40 servers)
10 (~40) Gbps MDS MDS
West site (AICS) East site (U Tokyo)
MDS MDS
Picture courtesy by Hiroshi Harada (U Tokyo)
Storage structure of HPCI Shared Storage
Temporal space • I/O performance •No backup
Data sharing •Capacity and reliability •Secured communication •Fair share and easy to use •No backup but file can be replicated
Lustre, Pansas, GPFS, …
Gfarm file system
Persistent storage •Capacity and reliability •Back up copy will be in Tape or disk
Local file
system
Global file system
Wide-area distributed file
system
Objective File system How to use
•mv/cp •File staging
•mv/cp •File staging
HPCI Shared Storage
•Web I/F Remote clients
Initial Performance Result
898 847
1,107 1,073
0
200
400
600
800
1,000
1,200
Hokkaido Kyoto Tokyo AICS
I/O Bandwidth [MB/sec]
File copy performance of 300 1GB files to HPCI Shared Storage
Related System
• XSEDE-Wide File System (GPFS) – Planned, but not in operation yet
• DEISA Global File System – Multicluster GPFS
• RZG, LRZ, BSC, JSC, EPSS, HLRS, … • Site name included in the path name – no location
transparency – files cannot be replicated across sites
– PRACE does not provide global file system • Limitation of operation systems that can mount • PRACE does not assume to use multiple sites
Summary • App IO requirement
– Computational Science • Scaled-out IO performance up to O(1M) nodes (100TB to 1PB per
hour) – Data Intensive Science
• Data processing for 10PB to 1EB data (>100TB/sec) • File system, Object store, OS and runtime R&D for scale out
storage architecture – Central storage cluster to distributed storage cluster
• Network wide RAID • Scaled out MDS
– Runtime system for non uniform storage access “NUSA” • Locality aware process scheduling
• Global file system