impalatogo introduction

13
ImpalaToGo architecture By David Gruzman BigDataCraft.com

Upload: david-groozman

Post on 15-Jul-2015

388 views

Category:

Software


0 download

TRANSCRIPT

Page 1: ImpalaToGo introduction

ImpalaToGo architecture

By David Gruzman BigDataCraft.com

Page 2: ImpalaToGo introduction

About Cloudera Impala

Cloudera impala is high performance MPP

engine.

It is built using C++, LLVM in critical

performance parts.

It is open source.

“Cloudera Impala” is trademark of Сloudera.

Page 3: ImpalaToGo introduction

A words of thanks ...

We want to tell a lot of thanks to Cloudera who

developed this beautiful codebase.

It is one of the best solutions to analyze data

inside Hadoop cluster.

Page 4: ImpalaToGo introduction

Why ImpalaToGoThe main incentive is - we want to free Impala

from the storage layer.

- Hadoop hardware usually has a lot of HDD

drives, some CPU and RAM.

- Impala likes machines with high IO, a lot of

RAM and a lot of CPU.

- We want to enable running Impala on best

possible hardware

Page 5: ImpalaToGo introduction

Some hardware examples

Typical hadoop server : 32-48GB RAM, 4 HDD

drives, dual six cpu.

Machine optimal for Impala : 128-256 GB Ram,

4 SSD or 12 HDD, dual 12 core CPU.

Page 6: ImpalaToGo introduction

Architecture - diagram

Hadoop cluster or S3

Caching layer on local SSD drives

ImpalaToGo cluster

Page 7: ImpalaToGo introduction

Architecture - in words

Current Cloudera Impala is working with HDFS

only.

ImpalaToGo has own caching layer on local

drives and works with any DFS, like S3.

Page 8: ImpalaToGo introduction

Caching algorithm - LRU

We write to local drives as long as there is a

space.

When space is about to finish - we delete files

we didn’t use for longest time.

Page 9: ImpalaToGo introduction

When it is applicable

Assuming you have data on hardware, which is not ready

for Impala because:

- It is not Yours (s3)

- There is not enough RAM (old hadoop machines).

- RAM is occupied (Map Reduce already use it).

- You hadoop version does not support Impala

- You do not want to risk running anything alongside your

critical processes.

Page 10: ImpalaToGo introduction

ImpalaToGo solution

You get hold of bunch of good machines and

run ImpalaToGo on them.

ImpalaToGo will access data from s3 or from

remote HDFS.

In the same time - it keeps hot data on local

drives us improving performance and reducing

load on storage cluster as much as possible.

Page 11: ImpalaToGo introduction

What hardware to use?

We suggest to use the same hardware as

recommended for Cloudera Impala with one

change.

Instead of using a lot of HDD drives to get both

space and bandwidth - you can put a few SSD.

You will get even better bandwidth and

ImpalaToGo will keep hot data there.

Page 12: ImpalaToGo introduction

How much it differs from Cloudera

Impala

We didn’t changed too much.

We replaced HDFS access with our caching

layer.

We replaced data locality which was received

from NameNode with consistent hashing.

So 99% of code is original Cloudera Impala

code.

Page 13: ImpalaToGo introduction

Any questions?

Write to me : [email protected]

Thank you for Your attention.

David Gruzman