impalatogo introduction

ImpalaToGo architecture

By David Gruzman BigDataCraft.com

About Cloudera Impala

Cloudera impala is high performance MPP

engine.

It is built using C++, LLVM in critical

performance parts.

It is open source.

“Cloudera Impala” is trademark of Сloudera.

A words of thanks ...

We want to tell a lot of thanks to Cloudera who

developed this beautiful codebase.

It is one of the best solutions to analyze data

inside Hadoop cluster.

Why ImpalaToGoThe main incentive is - we want to free Impala

from the storage layer.

- Hadoop hardware usually has a lot of HDD

drives, some CPU and RAM.

- Impala likes machines with high IO, a lot of

RAM and a lot of CPU.

- We want to enable running Impala on best

possible hardware

Some hardware examples

Typical hadoop server : 32-48GB RAM, 4 HDD

drives, dual six cpu.

Machine optimal for Impala : 128-256 GB Ram,

4 SSD or 12 HDD, dual 12 core CPU.

Architecture - diagram

Hadoop cluster or S3

Caching layer on local SSD drives

ImpalaToGo cluster

Architecture - in words

Current Cloudera Impala is working with HDFS

only.

ImpalaToGo has own caching layer on local

drives and works with any DFS, like S3.

Caching algorithm - LRU

We write to local drives as long as there is a

space.

When space is about to finish - we delete files

we didn’t use for longest time.

When it is applicable

Assuming you have data on hardware, which is not ready

for Impala because:

- It is not Yours (s3)

- There is not enough RAM (old hadoop machines).

- RAM is occupied (Map Reduce already use it).

- You hadoop version does not support Impala

- You do not want to risk running anything alongside your

critical processes.

ImpalaToGo solution

You get hold of bunch of good machines and

run ImpalaToGo on them.

ImpalaToGo will access data from s3 or from

remote HDFS.

In the same time - it keeps hot data on local

drives us improving performance and reducing

load on storage cluster as much as possible.

What hardware to use?

We suggest to use the same hardware as

recommended for Cloudera Impala with one

change.

Instead of using a lot of HDD drives to get both

space and bandwidth - you can put a few SSD.

You will get even better bandwidth and

ImpalaToGo will keep hot data there.

How much it differs from Cloudera

Impala

We didn’t changed too much.

We replaced HDFS access with our caching

layer.

We replaced data locality which was received

from NameNode with consistent hashing.

So 99% of code is original Cloudera Impala

code.

Any questions?

Write to me : [email protected]

Thank you for Your attention.

David Gruzman

mailto:[email protected]

impalatogo introduction

Software