impalatogo introduction
TRANSCRIPT
ImpalaToGo architecture
By David Gruzman BigDataCraft.com
About Cloudera Impala
Cloudera impala is high performance MPP
engine.
It is built using C++, LLVM in critical
performance parts.
It is open source.
“Cloudera Impala” is trademark of Сloudera.
A words of thanks ...
We want to tell a lot of thanks to Cloudera who
developed this beautiful codebase.
It is one of the best solutions to analyze data
inside Hadoop cluster.
Why ImpalaToGoThe main incentive is - we want to free Impala
from the storage layer.
- Hadoop hardware usually has a lot of HDD
drives, some CPU and RAM.
- Impala likes machines with high IO, a lot of
RAM and a lot of CPU.
- We want to enable running Impala on best
possible hardware
Some hardware examples
Typical hadoop server : 32-48GB RAM, 4 HDD
drives, dual six cpu.
Machine optimal for Impala : 128-256 GB Ram,
4 SSD or 12 HDD, dual 12 core CPU.
Architecture - diagram
Hadoop cluster or S3
Caching layer on local SSD drives
ImpalaToGo cluster
Architecture - in words
Current Cloudera Impala is working with HDFS
only.
ImpalaToGo has own caching layer on local
drives and works with any DFS, like S3.
Caching algorithm - LRU
We write to local drives as long as there is a
space.
When space is about to finish - we delete files
we didn’t use for longest time.
When it is applicable
Assuming you have data on hardware, which is not ready
for Impala because:
- It is not Yours (s3)
- There is not enough RAM (old hadoop machines).
- RAM is occupied (Map Reduce already use it).
- You hadoop version does not support Impala
- You do not want to risk running anything alongside your
critical processes.
ImpalaToGo solution
You get hold of bunch of good machines and
run ImpalaToGo on them.
ImpalaToGo will access data from s3 or from
remote HDFS.
In the same time - it keeps hot data on local
drives us improving performance and reducing
load on storage cluster as much as possible.
What hardware to use?
We suggest to use the same hardware as
recommended for Cloudera Impala with one
change.
Instead of using a lot of HDD drives to get both
space and bandwidth - you can put a few SSD.
You will get even better bandwidth and
ImpalaToGo will keep hot data there.
How much it differs from Cloudera
Impala
We didn’t changed too much.
We replaced HDFS access with our caching
layer.
We replaced data locality which was received
from NameNode with consistent hashing.
So 99% of code is original Cloudera Impala
code.