impalatogo design explained

ImpalaToGo - design explainedDavid Gruzman

What is ImpalaToGoImpalaToGo is a fork of Cloudera Impala with a proxy layer between a generic fs/dfs, not strictly HDFS. This proxy acts as a cache layer.

Why caching layer?We want ImpalaToGo to work efficiently with remote storages, especially cloud object storage, like S3.Why not another storage, such as GlusterFS? We believe that a caching layer is inherently simpler and more elastic than storage

Storage volume vs speed

We believe that it is optimal to:- store big data volumes on slow & cheap

HDD, which are part of object storage.- store the hot data set on local SSD drives.Impala needs about 100 mb/sec bandwidth per CPU. SSD are ideal for this purpose.

Optimize local drive spaceOn AWS, cloud ephemeral disk space is a scarce resource.Cache can store all data without replication, thus maximizing its usage.Storage would require redundancy, thus wasting space.

Cache layer designDuring the design of a distributed cache system we have to answer the following design questions:● How are files distributed among the nodes?● How are files are stored on individual drives?● How to ensure data locality?

File distributionWe use a consistent hash over full file names to map files to cluster nodes. Benefits: ● No metadata storage required● Efficient resize

File storage on nodesWe store files in a single directory, under the same structure as in DFS.For instance, a file in S3://someBucket/SomeDir/SomeFilewill be stored in /var/cache/impalaToGo/someBucket/SomeDir/SomeFileSince files are distributed by consistent hash, each file will be stored on exactly one node.

File storage - assessmentEasy to find files in the local path with relative ease, since we can predict the cache structure.DevOps is left to choose how multiple drives are organized together.Storage of files, not blocks - a single huge file can not be processed by several nodes.

Cache evictionCurrently, we have implemented a simple LRU algorithm. If nothing can be evicted - cache is bypassed

Roadmap : Pre-fetch Configurable pre-fetch capability. We want user to have the ability to specify rules on which data should be pre-fetched into the cache when written to the remote storage.

Roadmap - TachyonWe are working on Tachyon integration as one of the possible caching layers for ImpalaToGo

ImplementationEach node contains our C++ module which is capable of concurrent downloading and serving of multiple files.

impalatogo design explained

Software