gregorry letribot - druid at criteo - nosql matters 2015

37
Druid Elixir For analytics Grégory Letribot @ar_rabbit

Upload: nosqlmatters

Post on 15-Jul-2015

730 views

Category:

Software


1 download

TRANSCRIPT

Druid

Elixir For

analyticsGrégory Letribot @ar_rabbit

The right

toolfor

the JOB

Performan

ce

Advertisi

unique internet users per month

Ads displayed per day

10

000

1

300

Servers

Hadoop

nodes

Data

Analytics the old way

CPOP

Loadi

ng…

Once upon a time, in an SQL Galaxy..

« Guys, database whatever_DB contains 3B rows

Disks are full, it needs a purge

Server will be reinstalled and will host only the last 30 days »

« Well, talk with product »

SQL limit is

reachedProduct working, but infrastructure falling apart

We want more !More dimensions

User centric

Interactive

Realtime

NOSQL ?Precomputing gives fast queries

Not flexible !

Scales exponentially on dimensions

Suming

up Food for BI

Interactive, sub-second insights

Arbitrarily drill into data

Scalability, availability…

metamar

kets

Built for analytics

Scalable and Available

Real-time ingestion and Queries

Read-oriented

store Column oriented

In Memory

Fast Filtering

Segment based

Distribu

ted

Real

time

REST

api

Cold

stor

age

Broke

rnodes

queri

es

Data feed

Histo

ricalnodes

realt

imenodes Hand off

Histo

ricalnodes

realt

imenodes Hand offcoordin

atornodes

zookeep

er

metadat

a

Back to real life

Data workflow

Real-time or Batch ingestion

Lambda

architecture !

Drill, baby drill

!

Columnar

StoreOnly reads relevant data

Iphone Google Computer 0.1€08:12:37

Android Yahoo Cloth 0.2€08:12:38

Select sum(cost) whereDevice = Iphone

High

compressi

onDictionary encoding & LZF

Wacken

Hellfest

Fall of Summer

Wacken

Hellfest

1

2

3

1

2

Metadata:

Wacken =>

1

Hellfest=>

2

Fall of

summer=> 3

Inverted

indexWacken 1,0,0,0,1

Hellfest 0,1,0,1,0

Fall of Summer 0,0,1,0,0

Wacken

Hellfest

Fall of Summer

Wacken

Hellfest

Fast binary operations

Don’t !« … WHERE value >= 10 »

cardinality 1500

1490 binary OR…

Userflow

Unique user count ain’t onlyan interview exercise !

Sketching

algorithmsHyperLogLog Approximate unique count

Extreme storage reduction

Constant time computation

Live Demo

aggregatedrows

months of data

rows

nodes

40

0B

7

2B

12

Performances No downtime in 6 months Aggregate displays, clicks, sales & revenue generated for our biggest advertiser

grouped by device

over 7 months = 197 ms

Performances

According to

metamarkets 33M rows per second per core

Scaled up to 26B rows per second

10k event per second ingestion per node

What’s wrong ? Be carefull with your data model

Immutable is.. Immutable !

No joins, no full sql capabilities

A couple of bugs.. But very active and

friendly team !

Happy !

So

far

http://www.criteo.com/careers/

Looking for talents