big data analytics on object stores mapreduce: swift vs....
TRANSCRIPT
Big Data Analytics on Object StoresA Performance Study
Lukas Rupprecht*, Rui Zhang, Dean Hildebrand*Imperial College London, IBM Research - Almaden
,1
1 Work done during internship at IBM Research - Almaden
How Does The Swift Overhead Scale?
Scaling Datasize Mapper overhead stays con- stant Reducer overhead increases linearly Larger output files cause lon- ger copies during rename
Scaling Mappers Mapper overhead increases linearly Reducer overhead stays constant More mappers cause more authentication requests
-5
0
5
10
15
20
0 2 4 6 8 10 12
time(s)
input data (GB)
Job time differenceMap time difference
-10
0
10
20
30
40
50
60
70
4 8 16 32 64 128 256tim
e(s)
number of mappers
Job time differenceMap time difference
How To Run Them Together Efficiently?What are common problems when running analytics directly on ObjectStores and how can they be overcome?
To answer these questions we compare the performance of running Apa-che Hadoop MapReduce on top of OpenStack Swift (Object Store) andApache Hadoop HDFS using a recently developed connector [1].
Can we leverage certain capabilities of Object Stores? Fast file lookups Fast metadata access Rich metadata
File Access Times: Swift vs. HDFS
Heavy outliers for SwiftInitial call requiresauthenticationOverhead of ~1s (as noauthentication for HDFS)
Without authenticationSwift is 2-9x slowerReason: HTTP overhead(FS talks to HTTP server,request processing, ...)No benefit of constantlookups
Authentication and additional HTTP communication in Swift causescalls to be slower.
0100200300400500600700800900
create glob ls-dir ls open
time(m
s)
HDFS Completion TimeSwift Completion Time
0
10
20
30
40
50
create glob ls-dir ls open
time(m
s)
HDFS Completion TimeSwift Completion Time
MapReduce: Swift vs. HDFSHow much time do mappers and reducers spend in storage system calls?Setup: Small (3 nodes) private cluster running MapReduce, HDFS, and Swift colocated Workload: Batch-oriented wordcount job on 12GB dataset (HiBench generated)
Mappers Swift overhead of ~2s, i.e. ~10x slower Overhead due to HTTP + authentication Stable behavior
Reducers Swift overhead of up to 33s, i.e. ~150x slower Overhead due to rename Rename causes (remote) data copy
00.51
1.52
2.53
3.5
Swift HDFS
time(s)
map 1map 2map 3
map 4map 5map 6
map 7map 8
05
101520253035
Swift HDFS
time(s)
red 1red 2red 3
red 4red 5red 6
red 7red 8
Why Analytics on Object Stores?Analytics engines are typically running on top of distributed file systemssuch as HDFS. If Object Stores are used for archiving, running analyticson archived data (active archive) requires an extra load/store step andan additional analytics cluster. By directly running on the Object Store,the extra step is avoided.
Why Object Stores?
Geo-distributed Storage System
Region Zone
GET
POST
GET
Object Stores (e.g. OpenStack Swift) are highly scalable storage systemsto cheaply store billions of objects.They provide a RESTful API and fine-grained security via role-basedauthentication and HTTPSCheap and convenient global sharing/distribution of scientific data.Scalable long-term archive store
OpenStack Swift + Hadoop MapReduce
M
M M
M R R
R R
Swift FS Connector
Swift Proxy
read o
HTTP GET o
hash(o)
ConsistentHashing Ring
write o
HTTP POST o
1
34
5
2
0
Connector is Hadoop FileSystem implementationFile System calls are translated to HTTP requests to the Swift ProxySwift Proxy stores/retrieves objects via name-based consistent hashing
Conclusion and Future WorkWe presented a first performance study of MapReduce running on top ofSwift and identified problems for such a setup. A major problem is that ob-jects cannot be renamed without a data copy. This can have significant im-pact on job completion time for large input data and latency sensitive jobs.
In the future, we plan to increase the scale of our setup and look at differentworkloads, replication, and the implications of object size to data locality.
[1] http://docs.openstack.org/developer/sahara/userdoc/hadoop-swift.html