impalatogo design explained

13
ImpalaToGo - design explained David Gruzman

Upload: david-groozman

Post on 27-Jul-2015

122 views

Category:

Software


0 download

TRANSCRIPT

Page 1: ImpalaToGo design explained

ImpalaToGo - design explainedDavid Gruzman

Page 2: ImpalaToGo design explained

What is ImpalaToGoImpalaToGo is a fork of Cloudera Impala with a proxy layer between a generic fs/dfs, not strictly HDFS. This proxy acts as a cache layer.

Page 3: ImpalaToGo design explained

Why caching layer?We want ImpalaToGo to work efficiently with remote storages, especially cloud object storage, like S3.Why not another storage, such as GlusterFS? We believe that a caching layer is inherently simpler and more elastic than storage

Page 4: ImpalaToGo design explained

Storage volume vs speed

We believe that it is optimal to:- store big data volumes on slow & cheap

HDD, which are part of object storage.- store the hot data set on local SSD drives.Impala needs about 100 mb/sec bandwidth per CPU. SSD are ideal for this purpose.

Page 5: ImpalaToGo design explained

Optimize local drive spaceOn AWS, cloud ephemeral disk space is a scarce resource.Cache can store all data without replication, thus maximizing its usage.Storage would require redundancy, thus wasting space.

Page 6: ImpalaToGo design explained

Cache layer designDuring the design of a distributed cache system we have to answer the following design questions:● How are files distributed among the nodes?● How are files are stored on individual drives?● How to ensure data locality?

Page 7: ImpalaToGo design explained

File distributionWe use a consistent hash over full file names to map files to cluster nodes. Benefits: ● No metadata storage required● Efficient resize

Page 8: ImpalaToGo design explained

File storage on nodesWe store files in a single directory, under the same structure as in DFS.For instance, a file in S3://someBucket/SomeDir/SomeFilewill be stored in /var/cache/impalaToGo/someBucket/SomeDir/SomeFileSince files are distributed by consistent hash, each file will be stored on exactly one node.

Page 9: ImpalaToGo design explained

File storage - assessmentEasy to find files in the local path with relative ease, since we can predict the cache structure.DevOps is left to choose how multiple drives are organized together.Storage of files, not blocks - a single huge file can not be processed by several nodes.

Page 10: ImpalaToGo design explained

Cache evictionCurrently, we have implemented a simple LRU algorithm. If nothing can be evicted - cache is bypassed

Page 11: ImpalaToGo design explained

Roadmap : Pre-fetch Configurable pre-fetch capability. We want user to have the ability to specify rules on which data should be pre-fetched into the cache when written to the remote storage.

Page 12: ImpalaToGo design explained

Roadmap - TachyonWe are working on Tachyon integration as one of the possible caching layers for ImpalaToGo

Page 13: ImpalaToGo design explained

ImplementationEach node contains our C++ module which is capable of concurrent downloading and serving of multiple files.