system software for big data and post petascale computing · system software for big data and post...
Post on 28-Aug-2018
216 Views
Preview:
TRANSCRIPT
System Software for Big Data and Post Petascale Computing
Osamu Tatebe University of Tsukuba
The Japanese Extreme Big Data Workshop February 26, 2014
I/O performance requirement for exascale applications
• Computational Science (Climate, CFD, …) – Read initial data (100TB~PB) – Write snapshot data (100TB~PB)
periodically
• Data Intensive Science (Particle Physics, Astrophysics, Life Science, …) – Data analysis of 10PB~EB experiment
data
Scalable performance requirement for Parallel File System
Year FLOPS #cores IO BW IOPS Systems 2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2016 100P 10M 10TB/s O(100K) 2020 1E 100M 100TB/s O(1M)
IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes
Performance target
Technology trend
• HDD performance not increase so much – 300 MB/s, 5 W in 2020 – 100 TB/s means O(2M)W
• Flash, storage class memory – 1 GB/s, 0.1 W in 2020 – Cost, limited number of updates
• Interconnects – 62 GB/s (Infiniband 4xHDR)
☹ ☺
☹
Current parallel file system
• Central storage array • Separate installation of compute nodes and storage • Network BW between compute nodes and storage
needs to be scaled-up to scale out the I/O performance
MDS
Compute nodes (clients) Storage
NW BW limitation
Remember memory architecture
CPU
Mem
CPU
Mem
Shared memory Distributed memory
Scaled-out parallel file system
• Distributed storage in compute nodes • I/O performance would be scaled out by accessing near
storage unless metadata performance is bottleneck – Access to near storage mitigates network BW requirement – The performance may be non uniform
MDS cluster
Compute nodes (clients)
Storage
Example of Scale-out Storage Architecture
• 3 years later snapshot • Non-uniform but
scale-out storage • R&D of system
software stacks is required to achieve maximum I/O performance for data-intensive science
CPU (2 sockets x2.0GHzx16
coresx32FPU)
memory chipset
1TB local
storage
12 Gbps SAS x 16
x 16
19.2 GB/s, 16 TB
x 500
Metadata server
x 10
9.6 TB/s, 8 PB 96 TB/s, 80 PB
Infiniband HDR 62 GB/s
• 5,000 IO nodes • 10 MDSs
Challenge • File system (Object store)
– Central storage cluster to distributed storage cluster – Scaled out parallel file system up to O(1M) clients
• Scaled out MDS performance • Compute node OS
– Reduction of OS noises – Cooperative cache
• Runtime system – Optimization for non uniform storage access “NUSA”
• Global storage for data sharing of exabyte-scale data among machines
Scaled out parallel file system
• Federate local storage in compute nodes – Special purpose
• Google file system [SOSP’03] • Hadoop file system (HDFS)
– POSIX(-like) • Gfarm file system [CCGrid’02, NGC’10]
Scaled-out MDS • GIGA+ [Swapnil Patil et al.
FAST’11] – Incremental directory partitioning – Independent locking in each
partition • skyFS [Jing Xing et al. SC’09]
– Performance improvement during directory partitioning in GIGA+
• Lustre – MT scalability in 2.X – Proposed clustered MDS
• PPMDS [Our JST CREST R&D] – Shared-nothing KV stores – Nonblocking software transactional
memory (No lock)
IOPS (file creates per sec) #MDS (#core) GIGA+ 98K 32 (256) skyFS 100K 32 (512) Lustre 2.4 80K 1 (16) PPMDS 270K 15 (240)
Development of Pwrake: Data-intensive Workflow System
IO-aware Task Scheduling: • Locality-aware scheduling
– Selection of Compute Nodes by Input files
• Buffer Cache-aware scheduling – Modified LIFO to ease Trailing Task Problem
0 2000 4000
Naïve
Locality aware
Locality awareand Cache aware
Locality-aware
42% speedup Cache-aware
23% speedup
Workflow elapsed time (sec) with I/O file size 900 GB (10 nodes)
Process Process Process
SSH
Pwrake
file3 file1 file2
Gfarm file system
Pwrake = Workflow System based on Rake (Ruby make)
Maximize Locality using Multi-Constraint Graph Partitioning
[Tanaka, CCGrid 2012] • Task scheduling based on MCGP can minimize data movement
• Applied to Pwrake workflow system and evaluated on Montage workflow
Data movement reduced by 86% Execution time improved by 31%
Simple Graph Partitioning Multi-Constraint Graph Partitioning
Parallel tasks are unbalanced among nodes.
HPCI Shared Storage • HPCI – High Performance Computing Infrastructure
– “K”, Hokkaido, Tohoku, Tsukuba, Tokyo, Titech, Nagoya, Kyoto, Osaka, Kyushu, RIKEN, JAMSTEC, AIST
• A 20PB Gfarm distributed file system consisting East and West sites
• Grid Security Infrastructure (GSI) for user ID • Parallel file replication among sites • Parallel file staging to/from each center
11.5 PB (60 servers)
10 PB (40 servers)
10 (~40) Gbps MDS MDS
West site (AICS) East site (U Tokyo)
MDS MDS
Picture courtesy by Hiroshi Harada (U Tokyo)
Storage structure of HPCI Shared Storage
Temporal space • I/O performance •No backup
Data sharing •Capacity and reliability •Secured communication •Fair share and easy to use •No backup but file can be replicated
Lustre, Pansas, GPFS, …
Gfarm file system
Persistent storage •Capacity and reliability •Back up copy will be in Tape or disk
Local file
system
Global file system
Wide-area distributed file
system
Objective File system How to use
•mv/cp •File staging
•mv/cp •File staging
HPCI Shared Storage
•Web I/F Remote clients
Initial Performance Result
898 847
1,107 1,073
0
200
400
600
800
1,000
1,200
Hokkaido Kyoto Tokyo AICS
I/O Bandwidth [MB/sec]
File copy performance of 300 1GB files to HPCI Shared Storage
Related System
• XSEDE-Wide File System (GPFS) – Planned, but not in operation yet
• DEISA Global File System – Multicluster GPFS
• RZG, LRZ, BSC, JSC, EPSS, HLRS, … • Site name included in the path name – no location
transparency – files cannot be replicated across sites
– PRACE does not provide global file system • Limitation of operation systems that can mount • PRACE does not assume to use multiple sites
Summary • App IO requirement
– Computational Science • Scaled-out IO performance up to O(1M) nodes (100TB to 1PB per
hour) – Data Intensive Science
• Data processing for 10PB to 1EB data (>100TB/sec) • File system, Object store, OS and runtime R&D for scale out
storage architecture – Central storage cluster to distributed storage cluster
• Network wide RAID • Scaled out MDS
– Runtime system for non uniform storage access “NUSA” • Locality aware process scheduling
• Global file system
top related