segregated storage and compute · 2012. 11. 7. · • segregated storage and compute – nfs,...
TRANSCRIPT
-
• Segregated storage and compute– NFS, GPFS, PVFS, Lustre– Batch-scheduled systems: Clusters, Grids, and
Supercomputers– Programming paradigm: HPC, MTC, and HTC
• Co-located storage and compute– HDFS, GFS– Data centers at Google, Yahoo, and others– Programming paradigm: MapReduce– Others from academia: Sector, MosaStore, Chirp
2
-
• Segregated storage and compute– NFS, GPFS, PVFS, Lustre– Batch-scheduled systems: Clusters, Grids, and
Supercomputers– Programming paradigm: HPC, MTC, and HTC
• Co-located storage and compute– HDFS, GFS– Data centers at Google, Yahoo, and others– Programming paradigm: MapReduce– Others from academia: Sector, MosaStore, Chirp
3
-
• Segregated storage and compute– NFS, GPFS, PVFS, Lustre– Batch-scheduled systems: Clusters, Grids, and
Supercomputers– Programming paradigm: HPC, MTC, and HTC
• Co-located storage and compute– HDFS, GFS– Data centers at Google, Yahoo, and others– Programming paradigm: MapReduce– Others from academia: Sector, MosaStore, Chirp
4
-
• Segregated storage and compute– NFS, GPFS, PVFS, Lustre– Batch-scheduled systems: Clusters, Grids, and
Supercomputers– Programming paradigm: HPC, MTC, and HTC
• Co-located storage and compute– HDFS, GFS– Data centers at Google, Yahoo, and others– Programming paradigm: MapReduce– Others from academia: Sector, MosaStore, Chirp
5
-
0.1
1
10
100
1000
2002-2004 Today
MB
/s p
er P
roce
ssor
Cor
e
Local DiskClusterSupercomputer
6
--2.2X2.2X--99X99X --15X15X
--438X438X
• Local Disk:– 2002-2004: ANL/UC TG Site
(70GB SCSI)– Today: PADS (RAID-0, 6
drives 750GB SATA)• Cluster:
– 2002-2004: ANL/UC TG Site (GPFS, 8 servers, 1Gb/s each)
– Today: PADS (GPFS, SAN)
• Supercomputer:– 2002-2004: IBM Blue Gene/L
(GPFS)– Today: IBM Blue Gene/P (GPFS)
-
What if we could combine the scientific community’s existing
programming paradigms, but yet still exploit the data locality that
naturally occurs in scientific workloads?
7
-
8
-
9
Number of Tasks
Input Data Size
Hi
Med
Low1 1K 1M
HPC(Heroic
MPI Tasks)
HTC/MTC(Many Loosely Coupled Tasks)
MapReduce/MTC(Data Analysis,
Mining)
MTC(Big Data and Many Tasks)
Number of Tasks
Input Data Size
Hi
Med
Low1 1K 1M
HPC(Heroic
MPI Tasks)
HTC/MTC(Many Loosely Coupled Tasks)
MapReduce/MTC(Data Analysis,
Mining)
MTC(Big Data and Many Tasks)
[MTAGS08] “Many-Task Computing for Grids and Supercomputers”
-
10
• Important concepts related to the hypothesis– Workload: a complex query (or set of queries) decomposable into
simpler tasks to answer broader analysis questions – Data locality is crucial to the efficient use of large scale distributed
systems for scientific and data-intensive applications– Allocate computational and caching storage resources, co-scheduled to
optimize workload performance
“Significant performance improvements can be obtained in the analysis of large dataset by leveraging information
about data analysis workloads rather than individual data analysis tasks.”
-
11
text
Task DispatcherData-Aware Scheduler Persistent Storage
Shared File System
Idle Resources
Provisioned Resources
text
Task DispatcherData-Aware Scheduler Persistent Storage
Shared File System
Idle Resources
Provisioned Resources
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
• Resource acquired in response to demand
• Data diffuse from archival storage to newly acquired transient resources
• Resource “caching” allows faster responses to subsequent requests
• Resources are released when demand drops
• Optimizes performance by co-scheduling data and computations
• Decrease dependency of a shared/parallel file systems
• Critical to support data intensive MTC
-
12[SC07] “Falkon: a Fast and Light-weight tasK executiON framework”
• What would data diffusion look like in practice?• Extend the Falkon framework
-
13
• FA: first-available– simple load balancing
• MCH: max-cache-hit– maximize cache hits
• MCU: max-compute-util– maximize processor utilization
• GCC: good-cache-compute– maximize both cache hit and processor utilization at
the same time
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
-
14
0
1
2
3
4
5
first-available
without I/O
first-availablewith I/O
max-compute-util
max-cache-hit
good-cache-
compute
CPU
Tim
e pe
r Tas
k (m
s)
0
1000
2000
3000
4000
5000
Thro
ughp
ut (t
asks
/sec
)
Task SubmitNotification for Task AvailabilityTask Dispatch (data-aware scheduler)Task Results (data-aware scheduler)Notification for Task ResultsWS CommunicationThroughput (tasks/sec)
0
1
2
3
4
5
first-available
without I/O
first-availablewith I/O
max-compute-util
max-cache-hit
good-cache-
compute
CPU
Tim
e pe
r Tas
k (m
s)
0
1000
2000
3000
4000
5000
Thro
ughp
ut (t
asks
/sec
)
Task SubmitNotification for Task AvailabilityTask Dispatch (data-aware scheduler)Task Results (data-aware scheduler)Notification for Task ResultsWS CommunicationThroughput (tasks/sec)
[DIDC09] “Towards Data Intensive Many-Task Computing”, under review
• 3GHz dual CPUs• ANL/UC TG with
128 processors• Scheduling window
2500 tasks• Dataset
• 100K files• 1 byte each
• Tasks• Read 1 file• Write 1 file
-
15
• Monotonically Increasing Workload– Emphasizes increasing loads
• Sine-Wave Workload– Emphasizes varying loads
• All-Pairs Workload– Compare to best case model of active storage
• Image Stacking Workload (Astronomy)– Evaluate data diffusion on a real large-scale data-
intensive application from astronomy domain
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
-
16
• 250K tasks – 10MB reads– 10ms compute
• Vary arrival rate:– Min: 1 task/sec– Increment function:
CEILING(*1.3)– Max: 1000 tasks/sec
• 128 processors• Ideal case:
– 1415 sec– 80Gb/s peak
throughput
0
50000
100000
150000
200000
250000
0
100
200
300
400
500
600
700
800
900
1000
Task
s C
ompl
eted
Arr
ival
Rat
e (p
er s
econ
d)
Time (sec)
Arrival RateTasks completed
-
17
• GPFS vs. ideal: 5011 sec vs. 1415 sec
0102030405060708090
100
Nod
es A
lloca
ted
Thro
ughp
ut (G
b/s)
Que
ue L
engt
h (x
1K)
Time (sec)Throughput (Gb/s) Demand (Gb/s)Wait Queue Length Number of Nodes
-
18
Max-compute-util Max-cache-hit
00.10.20.30.40.50.60.70.80.91
0102030405060708090
100
Cac
he H
it/M
iss
%C
PU
Util
izat
ion
%
Nod
es A
lloca
ted
Thro
ughp
ut (G
b/s)
Que
ue L
engt
h (x
1K)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes CPU Utilization
00.10.20.30.40.50.60.70.80.91
0102030405060708090
100
Cac
he H
it/M
iss
%
Nod
es A
lloca
ted
Thro
ughp
ut (G
b/s)
Que
ue L
engt
h (x
1K)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes
-
19
1GB1.5GB
2GB4GB
0%10%20%30%40%50%60%70%80%90%100%
0102030405060708090
100
Cach
e Hit/
Mis
s %
Node
s Allo
cate
dTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0%10%20%30%40%50%60%70%80%90%100%
0102030405060708090
100
Cac
he H
it/M
iss
%
Nod
es A
lloca
ted
Thro
ughp
ut (G
b/s)
Que
ue L
engt
h (x
1K)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes
0%10%20%30%40%50%60%70%80%90%100%
0102030405060708090
100
Cach
e Hit/
Mis
s %
Node
s Allo
cate
dTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes
00.10.20.30.40.50.60.70.80.91
0102030405060708090
100
Cach
e Hit/
Mis
s %
Node
s Allo
cate
dTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes
-
20
• Data Diffusion vs. ideal: 1436 sec vs 1415 sec
00.10.20.30.40.50.60.70.80.91
0102030405060708090
100
Cac
he H
it/M
iss
%
Nod
es A
lloca
ted
Thro
ughp
ut (G
b/s)
Que
ue L
engt
h (x
1K)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes
-
21
Throughput:– Average: 14Gb/s vs 4Gb/s– Peak: 81Gb/s vs. 6Gb/s
Response Time – 3 sec vs 1569 sec 506X
80
6
12
73 8181
2146
02468
101214161820
Ideal FA GCC 1GB
GCC 1.5GB
GCC 2GB
GCC 4GB
MCH 4GB
MCU 4GB
Thro
ughp
ut (G
b/s)
Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)
1569
1084
1143.4 3.1
230 287
0
200
400
600
800
1000
1200
1400
1600
1800
FA GCC 1GB
GCC 1.5GB
GCC 2GB
GCC 4GB
MCH 4GB
MCU 4GB
Aver
age
Res
pons
e Ti
me
(sec
)
-
22
• Performance Index:– 34X higher
• Speedup– 3.5X faster
than GPFS
1
1.5
2
2.5
3
3.5
00.10.20.30.40.50.60.70.80.9
1
FA GCC 1GB
GCC 1.5GB
GCC 2GB
GCC 4GB
GCC 4GB SRP
MCH 4GB
MCU 4GB
Spee
dup
(com
p. to
LA
N G
PFS)
Perf
orm
ance
Inde
x
Performance Index Speedup (compared to first-available)
-
23
0
100
200
300
400
500
600
700
800
900
1000
060
012
0018
0024
0030
0036
0042
0048
0054
0060
0066
00
Time (sec)
Arr
ival
Rat
e (p
er s
ec)
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
Num
ber o
f Tas
ks C
ompl
eted
Arrival RateNumber of Tasks
0
100
200
300
400
500
600
700
800
900
1000
060
012
0018
0024
0030
0036
0042
0048
0054
0060
0066
00
Time (sec)
Arr
ival
Rat
e (p
er s
ec)
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
Num
ber o
f Tas
ks C
ompl
eted
Arrival RateNumber of Tasks
• 2M tasks – 10MB reads– 10ms compute
• Vary arrival rate:– Min: 1 task/sec– Arrival rate function:– Max: 1000 tasks/sec
• 200 processors• Ideal case:
– 6505 sec– 80Gb/s peak
throughput
705.5*)11.0(*)1)859678.2*)11.0((sin( timetimesqrtA
-
• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs
24
0102030405060708090
100
Nod
es A
lloca
ted
Thro
ughp
ut (G
b/s)
Que
ue L
engt
h (x
1K)
Time (sec)Throughput (Gb/s) Demand (Gb/s)Wait Queue Length Number of Nodes
-
25
• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• GCC+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs
0%10%20%30%40%50%60%70%80%90%100%
0102030405060708090
100
Cac
he H
it/M
iss
Nod
es A
lloca
ted
Thro
ughp
ut (G
b/s)
Que
ue L
engt
h (x
1K)
Time (sec)Cache Hit Local % Cache Hit Global % Cache Miss %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
j
-
26
• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• GCC+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs• GCC+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs
0%10%20%30%40%50%60%70%80%90%100%
0102030405060708090
100
Cac
he H
it/M
iss
%
Nod
es A
lloca
ted
Thro
ughp
ut (G
b/s)
Que
ue L
engt
h (x
1K)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes
-
• All-Pairs( set A, set B, function F ) returns matrix M:
• Compare all elements of set A to all elements of set B via function F, yielding matrix M, such that M[i,j] = F(A[i],B[j])
27
1 foreach $i in A2 foreach $j in B3 submit_job F $i $j4 end5 end
• 500x500 – 250K tasks– 24MB reads– 100ms compute– 200 CPUs
• 1000x1000 • 1M tasks• 24MB reads• 4sec compute• 4096 CPUs
• Ideal case:– 6505 sec– 80Gb/s peak
throughput
[DIDC09] “Towards Data Intensive Many-Task Computing”, under review
-
0%10%20%30%40%50%60%70%80%90%100%
0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00
100.00
Cac
he H
it/M
iss
Thro
ughp
ut (G
b/s)
Time (sec)Cache Hit Local % Cache Hit Global %Cache Miss % Max Throughput (GPFS)Throughput (Data Diffusion) Max Throughput (Local Disk)
28
Efficiency: 75%
-
0%10%20%30%40%50%60%70%80%90%100%
0.0020.0040.0060.0080.00
100.00120.00140.00160.00180.00200.00
Cac
he H
it/M
iss
Thro
ughp
ut (G
b/s)
Time (sec)Cache Hit Local % Cache Hit Global %Cache Miss % Max Throughput (GPFS)Throughput (Data Diffusion) Max Throughput (Local Memory)
29
Efficiency: 86%
[DIDC09] “Towards Data Intensive Many-Task Computing”, under review
-
• Pull vs. Push– Data Diffusion
• Pulls task working set• Incremental spanning
forest– Active Storage:
• Pushes workload working set to all nodes
• Static spanning tree
30
0%10%20%30%40%50%60%70%80%90%
100%
500x500200 CPUs
1 sec
500x500200 CPUs
0.1 sec
1000x10004096 CPUs
4 sec
1000x10005832 CPUs
4 sec
Effic
ienc
y
Experiment
Best Case (active storage)Falkon (data diffusion)Best Case (parallel file system)
Experiment ApproachLocal
Disk/Memory (GB)
Network (node-to-node)
(GB)
Shared File
System (GB)
Best Case (active storage) 6000 1536 12
Falkon(data diffusion) 6000 1698 34
Best Case (active storage) 6000 1536 12
Falkon(data diffusion) 6000 1528 62
Best Case (active storage) 24000 12288 24
Falkon(data diffusion) 24000 4676 384
Best Case (active storage) 24000 12288 24
Falkon(data diffusion) 24000 3867 906
500x500200 CPUs
1 sec
500x500200 CPUs
0.1 sec
1000x10004096 CPUs
4 sec
1000x10005832 CPUs
4 sec
Christopher Moretti, Douglas Thain, University of Notre Dame
[DIDC09] “Towards Data Intensive Many-Task Computing”, under review
-
• Best to use active storage if– Slow data source– Workload working set fits on local node storage
• Best to use data diffusion if– Medium to fast data source– Task working set
-
32
• Purpose– On-demand “stacks” of
random locations within ~10TB dataset
• Challenge– Processing Costs:
• O(100ms) per object
– Data Intensive: • 40MB:1sec
– Rapid access to 10-10K “random” files
– Time-varying load
AP SloanData
+
+++
+
+
=
+
Locality Number of Objects Number of Files1 111700 111700
1.38 154345 1116992 97999 490003 88857 296204 76575 191455 60590 12120
10 46480 465020 40460 202530 23695 790
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”[TG06] “AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis”
-
33
0
50
100
150
200
250
300
350
400
450
GPFS GZ LOCAL GZ GPFS FIT LOCAL FITFilesystem and Image Format
Tim
e (m
s)
openradec2xyreadHDU+getTile+curl+convertArraycalibration+interpolation+doStackingwriteStacking
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
-
34
Low data locality – Similar (but better)
performance to GPFS
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 4 8 16 32 64 128
Number of CPUsTi
me
(ms)
per
sta
ck p
er C
PU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 4 8 16 32 64 128
Number of CPUsTi
me
(ms)
per
sta
ck p
er C
PU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
High data locality– Near perfect scalability0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 4 8 16 32 64 128Number of CPUs
Tim
e (m
s) p
er s
tack
per
CPU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 4 8 16 32 64 128Number of CPUs
Tim
e (m
s) p
er s
tack
per
CPU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
-
35
• Aggregate throughput:– 39Gb/s– 10X higher than GPFS
• Reduced load on GPFS– 0.49Gb/s– 1/10 of the original load
0
5
10
15
20
25
30
35
40
45
50
1 1.38 2 3 4 5 10 20 30Locality
Agg
rega
te T
hrou
ghpu
t (G
b/s)
Data Diffusion Throughput LocalData Diffusion Throughput Cache-to-CacheData Diffusion Throughput GPFSGPFS Throughput (FIT)GPFS Throughput (GZ)
0
5
10
15
20
25
30
35
40
45
50
1 1.38 2 3 4 5 10 20 30Locality
Agg
rega
te T
hrou
ghpu
t (G
b/s)
Data Diffusion Throughput LocalData Diffusion Throughput Cache-to-CacheData Diffusion Throughput GPFSGPFS Throughput (FIT)GPFS Throughput (GZ)
• Big performance gains as locality increases
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 1.38 2 3 4 5 10 20 30 IdealLocality
Tim
e (m
s) p
er s
tack
per
CPU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 1.38 2 3 4 5 10 20 30 IdealLocality
Tim
e (m
s) p
er s
tack
per
CPU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
-
• Data access patterns: write once, read many• Task definition must include input/output files
metadata• Per task working set must fit in local storage• Needs IP connectivity between hosts• Needs local storage (disk, memory, etc)• Needs Java 1.4+
36
-
• [Ghemawat03,Dean04]: MapReduce+GFS• [Bialecki05]: Hadoop+HDFS • [Gu06]: Sphere+Sector• [Tatebe04]: Gfarm• [Chervenak04]: RLS, DRS• [Kosar06]: Stork
• Conclusions– None focused on the co-location of storage and generic
black box computations with data-aware scheduling while operating in a dynamic elastic environment
– Swift + Falkon + Data Diffusion is arguably a more generic and powerful solution than MapReduce
37
-
38
• Identified that data locality is crucial to the efficient use of large scale distributed systems for data-intensive applications Data Diffusion– Integrated streamlined task dispatching with data
aware scheduling policies– Heuristics to maximize real world performance– Suitable for varying, data-intensive workloads– Proof of O(NM) Competitive Caching
-
39
• Falkon is a real system– Late 2005: Initial prototype, AstroPortal– January 2007: Falkon v0– November 2007: Globus incubator project v0.1
• http://dev.globus.org/wiki/Incubator/Falkon
– February 2009: Globus incubator project v0.9• Implemented in Java (~20K lines of code) and C
(~1K lines of code)– Open source: svn co https://svn.globus.org/repos/falkon
• Source code contributors (beside myself)– Yong Zhao, Zhao Zhang, Ben Clifford, Mihael Hategan
[Globus07] “Falkon: A Proposal for Project Globus Incubation”
-
40
• Workload• 160K CPUs• 1M tasks• 60 sec per task
• 2 CPU years in 453 sec• Throughput: 2312 tasks/sec• 85% efficiency
[TPDS09] “Middleware Support for Many-Task Computing”, under preparation
-
41[TPDS09] “Middleware Support for Many-Task Computing”, under preparation
-
42
ACM MTAGS09 Workshop@ SC09
Due Date: August 1st, 2009
-
43
IEEE TPDS JournalSpecial Issue on MTC
Due Date: December 1st, 2009
-
44
• More information:– Other publications: http://people.cs.uchicago.edu/~iraicu/– Falkon: http://dev.globus.org/wiki/Incubator/Falkon– Swift: http://www.ci.uchicago.edu/swift/index.php
• Funding:– NASA: Ames Research Center, GSRP– DOE: Office of Advanced Scientific Computing Research,
Office of Science, U.S. Dept. of Energy– NSF: TeraGrid
• Relevant activities:– ACM MTAGS09 Workshop at Supercomputing 2009
• http://dsl.cs.uchicago.edu/MTAGS09/– Special Issue on MTC in IEEE TPDS Journal
• http://dsl.cs.uchicago.edu/TPDS_MTC/