impalatogo and tachyon integration

ImpalaToGo and

TachyonIntegration and performance

By ImpalaToGo team

http://impala2go.info

http://impala2go.info/


What is ImpalaToGo

ImpalaToGo is a fork of Cloudera Impala, separated from

Hadoop.

It is optimized to work with S3 storage by caching data

locally.


What is Tachyon

Tachyon is a memory-centric distributed storage system

http://tachyon-project.org/

In our case Tachyon’s capability to access and

cache data from the underlying file system is

important.

http://tachyon-project.org/

Tachyon HDFS similarity

Tachyon’s architecture resembles HDFS - it

has workers on each node and centrally

manages metadata (by master).

Tachyon HDFS differences

● Tachyon has hierarchy of storages, which includes

RAM. It enables efficient utilization of RAM, SSDs and

HDDs to store data.

● Tachyon has a notion of underlying DFS where data

can be persisted, or checkpointed.

● Tachyon has a notion of lineage. It enables async

saving of data without risk of data loss. Data owned by

non-accessible worker is recoverable.

Why together

Fot efficient work with S3 or other remote

storage ImpalaToGo requires caching layer.

Tachyon is a full-fledged caching layer.

Storage hierarchy, as well as integration with

other Hadoop family products makes it a good

choice.

ImpalaToGo native caching

ImpalaToGo has its own caching layer.

The main difference with Tachyon - it has only

single storage tier - local disk.

It is written in C++ and optimized for concurrent

access and caching the large number of files

simultaneously (requested by parallel clients).

Integration

ImpalaToGo is capable of working with any DFS in

Hadoop’s sense.

ImpalaToGo can be deployed on the same host node as

Tachyon and configured to work together.

Tachyon, in its turn, should be configured to access and

cache data from S3, or other remote DFS

ImpalaToGo over Tachyon configuration how-to:

https://github.com/ImpalaToGo/ImpalaToGo/wiki/To-run-

ImpalaToGo-over-Tachyon

Deployment

S3 or other remote DFS

Cluster Node: Cluster Node:

Cluster Node: Cluster Node.

Tachyon

Master Tachyon

worker

ImpalaToGo

worker

Performance - setup

We have been measured access to web_sales table, from

TPC-DS data, stored in the S3 in CSV format.

We selected the simplest “select count(*) from web_sales”

query intentionally to measure mostly storage access but

not processing / parsing capabilities.

We set up two m3.large servers and generated data of

scale factor 10.

Tachyon configuration

We have configured Tachyon to have enough

RAM to cache all the data.

In our case it means that we have configured

1GB per server, for the RAM storage tier.

ResultsWe compared first and second run. First run is actually reads from S3, while

second run reads data cached by the Tachyon.Query: select count(*) from web_sales

+----------+

| count(*) |

+----------+

| 7197566 |

+----------+

Fetched 1 row(s) in 26.26s

Query: select count(*) from web_sales

+----------+

| count(*) |

+----------+

| 7197566 |

+----------+

Fetched 1 row(s) in 3.28s

Results - analysis

Data size is about 1.5 GB.

Processing from S3 was at speed: 28 MB/sec per server. It

is usual speed for S3 reading. We can conclude that

system is I/O bound.

Processing of Tachyon was at speed about 230 MB/sec, or

about 115 Mb/sec per core. It is more or less maximum

speed for ImpalaToGo. We can conclude that system is

CPU bound

Conclusion

Tachyon gives ImpalaToGo users more

flexibility to use memory and / or different

types of local drives to cache remote data.

Tachyon enables ImpalaToGo to get it’s

maximum speed when data is cached.

Tachyon enables ImpalaToGo to share data

with other Hadoop-family products.

impalatogo and tachyon integration

Software