impalatogo and tachyon integration
TRANSCRIPT
ImpalaToGo and
TachyonIntegration and performance
By ImpalaToGo team
http://impala2go.info
What is ImpalaToGo
ImpalaToGo is a fork of Cloudera Impala, separated from
Hadoop.
It is optimized to work with S3 storage by caching data
locally.
http://impala2go.info/
What is Tachyon
Tachyon is a memory-centric distributed storage system
http://tachyon-project.org/
In our case Tachyon’s capability to access and
cache data from the underlying file system is
important.
Tachyon HDFS similarity
Tachyon’s architecture resembles HDFS - it
has workers on each node and centrally
manages metadata (by master).
Tachyon HDFS differences
● Tachyon has hierarchy of storages, which includes
RAM. It enables efficient utilization of RAM, SSDs and
HDDs to store data.
● Tachyon has a notion of underlying DFS where data
can be persisted, or checkpointed.
● Tachyon has a notion of lineage. It enables async
saving of data without risk of data loss. Data owned by
non-accessible worker is recoverable.
Why together
Fot efficient work with S3 or other remote
storage ImpalaToGo requires caching layer.
Tachyon is a full-fledged caching layer.
Storage hierarchy, as well as integration with
other Hadoop family products makes it a good
choice.
ImpalaToGo native caching
ImpalaToGo has its own caching layer.
The main difference with Tachyon - it has only
single storage tier - local disk.
It is written in C++ and optimized for concurrent
access and caching the large number of files
simultaneously (requested by parallel clients).
Integration
ImpalaToGo is capable of working with any DFS in
Hadoop’s sense.
ImpalaToGo can be deployed on the same host node as
Tachyon and configured to work together.
Tachyon, in its turn, should be configured to access and
cache data from S3, or other remote DFS
ImpalaToGo over Tachyon configuration how-to:
https://github.com/ImpalaToGo/ImpalaToGo/wiki/To-run-
ImpalaToGo-over-Tachyon
Deployment
S3 or other remote DFS
Cluster Node: Cluster Node:
Cluster Node: Cluster Node.
Tachyon
Master Tachyon
worker
ImpalaToGo
worker
Performance - setup
We have been measured access to web_sales table, from
TPC-DS data, stored in the S3 in CSV format.
We selected the simplest “select count(*) from web_sales”
query intentionally to measure mostly storage access but
not processing / parsing capabilities.
We set up two m3.large servers and generated data of
scale factor 10.
Tachyon configuration
We have configured Tachyon to have enough
RAM to cache all the data.
In our case it means that we have configured
1GB per server, for the RAM storage tier.
ResultsWe compared first and second run. First run is actually reads from S3, while
second run reads data cached by the Tachyon.Query: select count(*) from web_sales
+----------+
| count(*) |
+----------+
| 7197566 |
+----------+
Fetched 1 row(s) in 26.26s
Query: select count(*) from web_sales
+----------+
| count(*) |
+----------+
| 7197566 |
+----------+
Fetched 1 row(s) in 3.28s
Results - analysis
Data size is about 1.5 GB.
Processing from S3 was at speed: 28 MB/sec per server. It
is usual speed for S3 reading. We can conclude that
system is I/O bound.
Processing of Tachyon was at speed about 230 MB/sec, or
about 115 Mb/sec per core. It is more or less maximum
speed for ImpalaToGo. We can conclude that system is
CPU bound
Conclusion
Tachyon gives ImpalaToGo users more
flexibility to use memory and / or different
types of local drives to cache remote data.
Tachyon enables ImpalaToGo to get it’s
maximum speed when data is cached.
Tachyon enables ImpalaToGo to share data
with other Hadoop-family products.