big data analytics on object stores mapreduce: swift vs....

1
Big Data Analytics on Object Stores A Performance Study Lukas Rupprecht*, Rui Zhang, Dean Hildebrand *Imperial College London, IBM Research - Almaden ,1 1 Work done during internship at IBM Research - Almaden How Does The Swift Overhead Scale? Scaling Datasize Mapper overhead stays con- stant Reducer overhead increases linearly Larger output files cause lon- ger copies during rename Scaling Mappers Mapper overhead increases linearly Reducer overhead stays constant More mappers cause more authentication requests -5 0 5 10 15 20 0 2 4 6 8 10 12 time (s) input data (GB) Job time difference Map time difference -10 0 10 20 30 40 50 60 70 4 8 16 32 64 128 256 time (s) number of mappers Job time difference Map time difference How To Run Them Together Efficiently? What are common problems when running analytics directly on Object Stores and how can they be overcome? To answer these questions we compare the performance of running Apa- che Hadoop MapReduce on top of OpenStack Swift (Object Store) and Apache Hadoop HDFS using a recently developed connector [1]. Can we leverage certain capabilities of Object Stores? Fast file lookups Fast metadata access Rich metadata File Access Times: Swift vs. HDFS Heavy outliers for Swift Initial call requires authentication Overhead of ~1s (as no authentication for HDFS) Without authentication Swift is 2-9x slower Reason: HTTP overhead (FS talks to HTTP server, request processing, ...) No benefit of constant lookups Authentication and additional HTTP communication in Swift causes calls to be slower. 0 100 200 300 400 500 600 700 800 900 create glob ls-dir ls open time (ms) HDFS Completion Time Swift Completion Time 0 10 20 30 40 50 create glob ls-dir ls open time (ms) HDFS Completion Time Swift Completion Time MapReduce: Swift vs. HDFS How much time do mappers and reducers spend in storage system calls? Setup: Small (3 nodes) private cluster running MapReduce, HDFS, and Swift colocated Workload: Batch-oriented wordcount job on 12GB dataset (HiBench generated) Mappers Swift overhead of ~2s, i.e. ~10x slower Overhead due to HTTP + authentication Stable behavior Reducers Swift overhead of up to 33s, i.e. ~150x slower Overhead due to rename Rename causes (remote) data copy 0 0.5 1 1.5 2 2.5 3 3.5 Swift HDFS time (s) map 1 map 2 map 3 map 4 map 5 map 6 map 7 map 8 0 5 10 15 20 25 30 35 Swift HDFS time (s) red 1 red 2 red 3 red 4 red 5 red 6 red 7 red 8 Why Analytics on Object Stores? Analytics engines are typically running on top of distributed file systems such as HDFS. If Object Stores are used for archiving, running analytics on archived data (active archive) requires an extra load/store step and an additional analytics cluster. By directly running on the Object Store, the extra step is avoided. Why Object Stores?

Upload: others

Post on 14-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Big Data Analytics on Object StoresA Performance Study

Lukas Rupprecht*, Rui Zhang, Dean Hildebrand*Imperial College London, IBM Research - Almaden

,1

1 Work done during internship at IBM Research - Almaden

How Does The Swift Overhead Scale?

Scaling Datasize Mapper overhead stays con- stant Reducer overhead increases linearly Larger output files cause lon- ger copies during rename

Scaling Mappers Mapper overhead increases linearly Reducer overhead stays constant More mappers cause more authentication requests

-5

0

5

10

15

20

0 2 4 6 8 10 12

time(s)

input data (GB)

Job time differenceMap time difference

-10

0

10

20

30

40

50

60

70

4 8 16 32 64 128 256tim

e(s)

number of mappers

Job time differenceMap time difference

How To Run Them Together Efficiently?What are common problems when running analytics directly on ObjectStores and how can they be overcome?

To answer these questions we compare the performance of running Apa-che Hadoop MapReduce on top of OpenStack Swift (Object Store) andApache Hadoop HDFS using a recently developed connector [1].

Can we leverage certain capabilities of Object Stores? Fast file lookups Fast metadata access Rich metadata

File Access Times: Swift vs. HDFS

Heavy outliers for SwiftInitial call requiresauthenticationOverhead of ~1s (as noauthentication for HDFS)

Without authenticationSwift is 2-9x slowerReason: HTTP overhead(FS talks to HTTP server,request processing, ...)No benefit of constantlookups

Authentication and additional HTTP communication in Swift causescalls to be slower.

0100200300400500600700800900

create glob ls-dir ls open

time(m

s)

HDFS Completion TimeSwift Completion Time

0

10

20

30

40

50

create glob ls-dir ls open

time(m

s)

HDFS Completion TimeSwift Completion Time

MapReduce: Swift vs. HDFSHow much time do mappers and reducers spend in storage system calls?Setup: Small (3 nodes) private cluster running MapReduce, HDFS, and Swift colocated Workload: Batch-oriented wordcount job on 12GB dataset (HiBench generated)

Mappers Swift overhead of ~2s, i.e. ~10x slower Overhead due to HTTP + authentication Stable behavior

Reducers Swift overhead of up to 33s, i.e. ~150x slower Overhead due to rename Rename causes (remote) data copy

00.51

1.52

2.53

3.5

Swift HDFS

time(s)

map 1map 2map 3

map 4map 5map 6

map 7map 8

05

101520253035

Swift HDFS

time(s)

red 1red 2red 3

red 4red 5red 6

red 7red 8

Why Analytics on Object Stores?Analytics engines are typically running on top of distributed file systemssuch as HDFS. If Object Stores are used for archiving, running analyticson archived data (active archive) requires an extra load/store step andan additional analytics cluster. By directly running on the Object Store,the extra step is avoided.

Why Object Stores?

Geo-distributed Storage System

Region Zone

GET

POST

GET

Object Stores (e.g. OpenStack Swift) are highly scalable storage systemsto cheaply store billions of objects.They provide a RESTful API and fine-grained security via role-basedauthentication and HTTPSCheap and convenient global sharing/distribution of scientific data.Scalable long-term archive store

OpenStack Swift + Hadoop MapReduce

M

M M

M R R

R R

Swift FS Connector

Swift Proxy

read o

HTTP GET o

hash(o)

ConsistentHashing Ring

write o

HTTP POST o

1

34

5

2

0

Connector is Hadoop FileSystem implementationFile System calls are translated to HTTP requests to the Swift ProxySwift Proxy stores/retrieves objects via name-based consistent hashing

Conclusion and Future WorkWe presented a first performance study of MapReduce running on top ofSwift and identified problems for such a setup. A major problem is that ob-jects cannot be renamed without a data copy. This can have significant im-pact on job completion time for large input data and latency sensitive jobs.

In the future, we plan to increase the scale of our setup and look at differentworkloads, replication, and the implications of object size to data locality.

[1] http://docs.openstack.org/developer/sahara/userdoc/hadoop-swift.html