splout sql - web latency sql views for hadoop

35
Splout SQL When Big Data Output is also Big Data Iván de Prado Alonso CEO of Datasalt www.datasalt.es @ivanprado @datasalt

Upload: datasalt

Post on 01-Jul-2015

15.571 views

Category:

Technology


1 download

DESCRIPTION

There are many Big Data problems whose output is also Big Data. In this presentation we will show Splout SQL, which allows serving an arbitrarily big dataset by partitioning it. Splout serves partitioned SQL views which are generated and indexed by Hadoop. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Hadoop is nowadays the de-facto open-source solution for Big Data batch-processing. When the output of a Hadoop process is big, there isn`t a satisfying solution for serving it. Think of pre-computed recommendations, for example, where the whole dataset may vary from one day to another. Splout decouples database creation from database serving and makes it efficient and safe to deploy Hadoop-generated datasets. There are many databases that allow serving Big Data such as NoSQL solutions, but they don`t have a rich query language like SQL. You generally can`t aggregate data in real-time like you would do with a GROUP BY clause. Because you can`t precompute everything, SQL is a very convenient feature to have in a Big Data serving solution. Splout is not a “fast analytics” engine. Splout is made for demanding web or mobile applications where query performance is critical. Arbitrary real-time aggregations should be done in less than 200 milliseconds under high traffic load. On top of that, Splout is scalable, flexible, RESTful & open-source.

TRANSCRIPT

Page 1: Splout SQL - Web latency SQL views for Hadoop

Splout SQLWhen Big Data Output is also Big

Data

Iván de Prado Alonso – CEO of Datasaltwww.datasalt.es@ivanprado@datasalt

Page 2: Splout SQL - Web latency SQL views for Hadoop
Page 3: Splout SQL - Web latency SQL views for Hadoop

Full SQL*

For Big Data

Web latency & throughput

Unlike NoSQL

Unlike RDBMS

Unlike Impala, Apache Drill, etc.

* Within each partition

Page 4: Splout SQL - Web latency SQL views for Hadoop

How does it work?

Isolation between generation and serving

Page 5: Splout SQL - Web latency SQL views for Hadoop

Table CLIENTS

CID Name

U20 Doug

U21 Ted

Table SALES

SID CID Amount

S100 U20 102

S101 U20 60

Tablespace CLIENTS_INFOTable CLIENTS

CID Name

U20 Doug

U21 Ted

U40 John

Table SALES

SID CID Amount

S100 U20 102

S101 U20 60

S223 U40 99

table CLIENTS

table SALES

Generate tablespace CLIENTS_INFO with partitioned by CID

partitioned by CID

Partition U10 – U35

Partition U36 – U60

Table CLIENTS

CID Name

U40 John

Table SALES

SID CID Amount

S223 U40 99

Generation

Page 6: Splout SQL - Web latency SQL views for Hadoop

Table CLIENTS

CID Name

U20 Doug

U21 Ted

Table SALES

SID CID Amount

S100 U20 102

S101 U20 60

SELECT Name, sum(Amount) FROM CLIENTS c, SALES s WHERE c.CID = s.CID AND CID = ‘U20’;

For key = ‘U20’, tablespace=‘CLIENTS_INFO’

Partition U10 – U35

Serving

Partition U36 – U60

Table CLIENTS

CID Name

U40 John

Table SALES

SID CID Amount

S223 U40 99

Page 7: Splout SQL - Web latency SQL views for Hadoop

SELECT Name, sum(Amount) FROM CLIENTS c, SALES s WHERE c.CID = s.CID AND CID = ‘U40’;

For key = ‘U40’, tablespace=‘CLIENTS_INFO’

Serving

Partition U36 – U60

Table CLIENTS

CID Name

U40 John

Table SALES

SID CID Amount

S223 U40 99

Table CLIENTS

CID Name

U20 Doug

U21 Ted

Table SALES

SID CID Amount

S100 U20 102

S101 U20 60

Partition U10 – U35

Page 8: Splout SQL - Web latency SQL views for Hadoop

Why does it scale?

Data is partitioned

Partitions are distributed across nodes

Adding more nodes increases capacity

Queries restricted to a single partition

Generation does not impact serving

Page 9: Splout SQL - Web latency SQL views for Hadoop

Ok, so what is Splout SQLuseful for?

Page 10: Splout SQL - Web latency SQL views for Hadoop

Big Data Analytics

Manageable output

Page 11: Splout SQL - Web latency SQL views for Hadoop

Big Data Analytics

Sometimes Big Data output is also Big Data

Page 12: Splout SQL - Web latency SQL views for Hadoop

Splout SQL allowsto serve

Big Data results

Page 13: Splout SQL - Web latency SQL views for Hadoop

Let’s see an example …

Page 14: Splout SQL - Web latency SQL views for Hadoop

Building a Google Analytics

Imagine that one crazy day you decide to build some kind of Google Analytics…

Millions of domains

Zillions of events

Individual panel per domain

Page 15: Splout SQL - Web latency SQL views for Hadoop

Time-based charts (day/hour aggregations)

Flexible dimension breakdown

Per page, per browserPer country, per language …

Requirements

Page 16: Splout SQL - Web latency SQL views for Hadoop

With Splout SQL

Page 17: Splout SQL - Web latency SQL views for Hadoop

Splout SQL provides SQL consolidated views for Hadoop

data

Page 18: Splout SQL - Web latency SQL views for Hadoop

Let’s see more details about

Splout SQL

Page 19: Splout SQL - Web latency SQL views for Hadoop

Splout SQL Architecture

Page 20: Splout SQL - Web latency SQL views for Hadoop

Each partition is …

Backed by SQLite

Generated on Hadoop

Distributed on Splout SQL cluster

Including any indexes needed

With replication for failover

Data can be sorted before insertion to minimize disk seeks at query time

Pre-sampling for balancing partition size

Page 21: Splout SQL - Web latency SQL views for Hadoop

Atomicity

A tablespace is a set of tables that share the same partitioning schema

Tablespaces are versioned

Several tablespaces can be deployedat once

Only one version served at a time

All-or-nothing semantics (atomicity)

Rollback support

Page 22: Splout SQL - Web latency SQL views for Hadoop

Characteristics

Ensured ms latencies

Controlled by the developer selecting the proper:

Even when queries hit disk

- Cluster topology- Partitioning- Indexes- Data collocation (insertion order)

Page 23: Splout SQL - Web latency SQL views for Hadoop

Characteristics (II)

100% SQL

But restricted to a single partition

Real-time aggregations

Joins

ScalabilityIn data capacity

In performance

Page 24: Splout SQL - Web latency SQL views for Hadoop

Characteristics (III)

Atomicity

New data replaces old data all at once

High availability

Through the use of replication

Open Source

Page 25: Splout SQL - Web latency SQL views for Hadoop

Characteristics (IV)

Easy to manage

Read only

Data is updated in batches

Changing the size of the cluster can be done without any downtime

Updates come from new tablespacedeployments

Page 26: Splout SQL - Web latency SQL views for Hadoop

Characteristics (V)

Native connectorsHive

Pig

Cascading

Page 27: Splout SQL - Web latency SQL views for Hadoop

API - Generation

Command line

Loading CSV files

Java API

Connectors

$ hadoop jar splout-*-hadoop.jar generate …

Page 28: Splout SQL - Web latency SQL views for Hadoop

API - Service

Rest API

JSON response

Page 29: Splout SQL - Web latency SQL views for Hadoop

API - Console

Page 30: Splout SQL - Web latency SQL views for Hadoop

Benchmark

350 GB Wikipedia logs

2-machines cluster900 queries/second, 80 ms/query, 80 threads

Aggregation queries impacting 15 rows in average

Page 31: Splout SQL - Web latency SQL views for Hadoop

Benchmark (II)

4-machines cluster3150 queries/second, 40 ms/query, 160 threads

More info:http://sploutsql.com/performance.html

Page 32: Splout SQL - Web latency SQL views for Hadoop

Web latency

SQL

Consolidated Views

For Hadoop

“A good candidate for the serving layer of a lambda architecture”

Page 33: Splout SQL - Web latency SQL views for Hadoop

www.SploutCloud.com - Splout SQL as a service

Page 34: Splout SQL - Web latency SQL views for Hadoop

Future work

Growing the community

Do you want to collaborate?

Automatic rebalancing on failover

Almost done

Some read/write capabilities

Enabling Splout SQL to become the speed layer on lambda architectures

Page 35: Splout SQL - Web latency SQL views for Hadoop

Iván de Prado Alonso – CEO of Datasaltwww.datasalt.es@ivanprado@datasalt

Questions?