norikra: stream processing with sql

61
Norikra: Stream Processing With SQL 2014/09/13 HadoopCon 2014 Taiwan Satoshi Tagomori (@tagomoris)

Upload: satoshi-tagomori

Post on 24-Jan-2015

2.322 views

Category:

Technology


6 download

DESCRIPTION

HadoopCon 2014 Taiwan Tech Talk * Stream processing overview * Using SQL as DSL for stream processing * Details of Norikra * Norikra queries * Use cases

TRANSCRIPT

Page 1: Norikra: Stream Processing with SQL

Norikra:Stream Processing With SQL

2014/09/13HadoopCon 2014 Taiwan

Satoshi Tagomori (@tagomoris)

Page 2: Norikra: Stream Processing with SQL

Satoshi Tagomori (@tagomoris)LINE Corporation

Analytics Platform Team

Page 3: Norikra: Stream Processing with SQL

THE ONE THINGWHAT YOU MUST LEAN TODAY IS

Page 4: Norikra: Stream Processing with SQL

Norikra

Page 5: Norikra: Stream Processing with SQL

NorikraIS NOT

Norika

Page 6: Norikra: Stream Processing with SQL

Topics

Basics of stream processing

Stream processing with SQL

Norikra overview

Norikra queries

Use cases in production

Page 7: Norikra: Stream Processing with SQL

Stream Processing

Less latency

Less computing power

No query schedule management

Page 8: Norikra: Stream Processing with SQL

Data Flow And Latencydata windowquery execution

Batch Stream

incrementalquery execution

Page 9: Norikra: Stream Processing with SQL

Query For Stored Datav1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

table

At first, all dataMUST be stored.

Page 10: Norikra: Stream Processing with SQL

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table

WHERE v3=’x’ GROUP BY v1,v2

table

Query For Stored Data

Page 11: Norikra: Stream Processing with SQL

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table

WHERE v3=’x’ GROUP BY v1,v2

table

SELECT v4,COUNT(*)FROM table

WHERE v1 AND v2 GROUP BY v4

Query For Stored Data

Page 12: Norikra: Stream Processing with SQL

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table

WHERE v3=’x’ GROUP BY v1,v2

table

SELECT v4,COUNT(*)FROM table

WHERE v1 AND v2 GROUP BY v4

“All data” means“data that will not be used”.

Query For Stored Data

Page 13: Norikra: Stream Processing with SQL

Query For Stream Data

v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table.win:xxx

WHERE v3=’x’ GROUP BY v1,v2

stream

SELECT v4,COUNT(*)FROM table.win:xxx

WHERE v1 AND v2 GROUP BY v4

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

Page 14: Norikra: Stream Processing with SQL

v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table.win:xxx

WHERE v3=’x’ GROUP BY v1,v2

stream

SELECT v4,COUNT(*)FROM table.win:xxx

WHERE v1 AND v2 GROUP BY v4

v1,v2,v3

v1,v2,v4v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

Query For Stream Data

Page 15: Norikra: Stream Processing with SQL

v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table.win:xxx

WHERE v3=’x’ GROUP BY v1,v2

stream

SELECT v4,COUNT(*)FROM table.win:xxx

WHERE v1 AND v2 GROUP BY v4

v1,v2,v3

v1,v2,v4v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

Query For Stream Data

Page 16: Norikra: Stream Processing with SQL

v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table.win:xxx

WHERE v3=’x’ GROUP BY v1,v2

stream

SELECT v4,COUNT(*)FROM table.win:xxx

WHERE v1 AND v2 GROUP BY v4

v1,v2,v3

v1,v2,v4

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

All data will be discarded right after insertion.

(Bye-bye storage system maintenance!)

Query For Stream Data

Page 17: Norikra: Stream Processing with SQL

Incremental Calculation

v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table.win:xxx

WHERE v3=’x’ GROUP BY v1,v2

streamv1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

internal data (memory)

v1 v2 COUNT

TRUE TRUE 0

TRUE FALSE 1

FALSE TRUE 33

FALSE FALSE 2

Page 18: Norikra: Stream Processing with SQL

v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table.win:xxx

WHERE v3=’x’ GROUP BY v1,v2

streamv1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

internal data (memory)

v1 v2 COUNT

TRUE TRUE 1

TRUE FALSE 1

FALSE TRUE 33

FALSE FALSE 2

Incremental Calculation

Page 19: Norikra: Stream Processing with SQL

v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table.win:xxx

WHERE v3=’x’ GROUP BY v1,v2

stream

v1,v2,v3,v4,v5,v6

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

internal data (memory)

v1 v2 COUNT

TRUE TRUE 1

TRUE FALSE 1

FALSE TRUE 34

FALSE FALSE 2

Incremental Calculation

Page 20: Norikra: Stream Processing with SQL

SELECT v1,v2,COUNT(*)FROM table.win:xxx

WHERE v3=’x’ GROUP BY v1,v2

stream

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

internal data (memory)

v1 v2 COUNT

TRUE TRUE 1

TRUE FALSE 2

FALSE TRUE 37

FALSE FALSE 3memory can store

internal data

Incremental Calculation

Page 21: Norikra: Stream Processing with SQL

Data WindowTarget time (or size) range of queries

Batch

FROM-TO: WHERE dt >= ‘2014-09-13 13:30:00‘

AND dt < ‘2014-09-13 14:20:00’

Stream

“Calculate this query every 50 minutes”

Extended SQL required SELECT v1,v2,COUNT(*)FROM table.win:xxx

WHERE v3=’x’ GROUP BY v1,v2

Page 22: Norikra: Stream Processing with SQL

Stream Processing With SQL

Esper: Java library to process streamneeds to be implemented in Java daemon code

With schema for data/queryOSS under GPLv2

http://esper.codehaus.org/

Page 23: Norikra: Stream Processing with SQL

Esper EPL

SELECT height, weightFROM tblWHERE age > 30

Select values of height and weightfor all events with age larger than 30

Page 24: Norikra: Stream Processing with SQL

SELECT height, COUNT(*) AS cFROM tblWHERE age > 30GROUP BY height

Esper EPLCount records group by height value

for events with age larger than 30

This query doesn’tever produce results

Page 25: Norikra: Stream Processing with SQL

SELECT height, COUNT(*) AS cFROM tbl.win:time_batch(1 hour)WHERE age > 30GROUP BY height

Esper EPLCount records group by height value

for events with age larger than 30per every 1 hour

Page 26: Norikra: Stream Processing with SQL

With/without SchemaSchema-full data:

strict schema: predefined fields w/ types (or reject)

schema on read: try to read known fields (or ignore)

Schema-less data:

Any field (or ignore), any type (implicit/explicit conversion)

fit for services under development:

All internet services including us!

Page 27: Norikra: Stream Processing with SQL

Stream Processing & SchemaQueries first, data second

for all stream processingQueries automatically know what fields to query

schema-less (mixed)data stream

fields subset

for query A

fields subsetfor query B

query A

query Bevents from

billing service

events fromAPI endpoint

events of service XTO BE

Page 28: Norikra: Stream Processing with SQL
Page 29: Norikra: Stream Processing with SQL

break.

Page 30: Norikra: Stream Processing with SQL
Page 31: Norikra: Stream Processing with SQL

Norikra:Schema-less Stream Processing with SQL

Server software, runs on JVM

Open source software (GPLv2)

http://norikra.github.io/

https://github.com/norikra/norikra

Page 32: Norikra: Stream Processing with SQL

Norikra:Schema-less event stream:

Add/Remove data fields whenever you wantSQL:

No more restarts to add/remove queriesw/ JOINs, w/ SubQueriesw/ UDF (in Java/Ruby from rubygem)

Truly Complex events:Nested Hash/Array, accessible directly from SQL

HTTP RPC w/ JSON or MessagePack (fluentd plugin available!)

Page 33: Norikra: Stream Processing with SQL

How To Setup Norikra:Install JRuby

download jruby.tar.gz, extract it and export $PATHuse rbenv

rbenv install jruby-1.7.xx

rbenv shell jruby-..

Install Norikragem install norikra

Execute Norikra servernorikra start

Page 34: Norikra: Stream Processing with SQL

Norikra Interface:Command line: norikra-client

norikra-client target open ...

norikra-client query add ...

tail -f ... | norikra-client event send ...

WebUI

show status

show/add/remove queries

HTTP API

JSON, MessagePack

Page 35: Norikra: Stream Processing with SQL

Norikra Queries: (1)

SELECT name, ageFROM events

target

Page 36: Norikra: Stream Processing with SQL

Norikra Queries: (1)

SELECT name, ageFROM events

{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}

{“name”:”tagomoris”,”age”:34}

Page 37: Norikra: Stream Processing with SQL

Norikra Queries: (1)

SELECT name, ageFROM events

nothing

{“name”:”tagomoris”, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}

without “age”

Page 38: Norikra: Stream Processing with SQL

Norikra Queries: (2)

SELECT name, ageFROM events

WHERE current=”Taipei”

{“name”:”tagomoris”,”age”:34}

{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}

Page 39: Norikra: Stream Processing with SQL

Norikra Queries: (2)

SELECT name, ageFROM events

WHERE current=”Taipei”

nothing

{“name”:”hadoop”, “age”:99, “address”:”Somewhere”, “corp”:”ASF”, “current”:”Elsewhere”}

Page 40: Norikra: Stream Processing with SQL

Norikra Queries: (3)

SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY age

Page 41: Norikra: Stream Processing with SQL

Norikra Queries: (3)

SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY age

{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...

every 5 mins

{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}

Page 42: Norikra: Stream Processing with SQL

Norikra Queries: (4)

SELECT age, COUNT(*) as cntFROM

events.win:time_batch(5 mins)GROUP BY age

{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...

SELECT max(age) as maxFROM

events.win:time_batch(5 mins)

{“max”:51}

{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}

every 5 mins

Page 43: Norikra: Stream Processing with SQL

Norikra Queries: (5)

SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY age

{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...]}

Page 44: Norikra: Stream Processing with SQL

Norikra Queries: (5)

SELECT user.age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY user.age

{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...]}

Page 45: Norikra: Stream Processing with SQL

Norikra Queries: (5)

SELECT user.age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

WHERE current=”Taipei”AND attend.$0 AND attend.$1

GROUP BY user.age

{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...]}

Page 46: Norikra: Stream Processing with SQL

break.next: use cases

Page 47: Norikra: Stream Processing with SQL

Use case 1:External API call reports for partners (LINE)

External API call for LINE Business Connect

LINE backend sends requests to partner’s API endpoint using users’ messages

http://developers.linecorp.com/blog/?p=3386

Page 48: Norikra: Stream Processing with SQL

Use case 1:External API call reports for partners (LINE)

API error response summaries

http://developers.linecorp.com/blog/?p=3386

Page 49: Norikra: Stream Processing with SQL

Use case 1:External API call reports for partners (LINE)

channelgateway

partner’sserver

logs

queryresults

MySQL Mail

SELECT    channelId  AS  channel_id,    reason,    detail,    count(*)  AS  error_count,    min(timestamp)  AS  first_timestamp,    max(timestamp)  AS  last_timestampFROM    api_error_log.win:time_batch(60  sec)GROUP  BY  channelId,reason,detailHAVING  count(*)  >  0

http://developers.linecorp.com/blog/?p=3386

Page 50: Norikra: Stream Processing with SQL

Use case 2:Prompt reports for Ad service console

Prompt reports with Norikra + Fixed reports with Hive

appserverapp

serverappserver

appserverapp

serverappserver

Fluentd

HDFS

consoleservice

fetch query results(frequently)

execute hive query(daily)

impressionlogs

Page 51: Norikra: Stream Processing with SQL

SELECT    yyyymmdd,  hh,  campaign_id,  region,  lang,    COUNT(*)  AS  click,    COUNT(DISTINCT  member_id)  AS  uuFROM  (    SELECT  yyyymmdd,  hh,        get_json_object(log,  '$.campaign.id')  AS  campaign_id,        get_json_object(log,  '$.member.region')  AS  region,        get_json_object(log,  '$.member.lang')  AS  lang,        get_json_object(log,  '$.member.id')  AS  member_id    FROM  applog    WHERE  service='myservice'        AND  yyyymmdd='20140913'        AND  get_json_object(log,  '$.type')='click')  xGROUP  BY  yyyymmdd,  hh,  campaign_id,  region,  lang

Hive query for fixed reports

Use case 2:Prompt reports for Ad service console

Page 52: Norikra: Stream Processing with SQL

SELECT    campaign.id  AS  campaign_id,    member.region  AS  region,    member.lang  AS  lang,    COUNT(*)  AS  click,    COUNT(DISTINCT  member.id)  AS  uuFROM  myservice.win:time_batch(1  hours)WHERE  type="click"GROUP  BY  campaign.id,  member.region,  member.lang

Norikra query for prompt reports

Use case 2:Prompt reports for Ad service console

Page 53: Norikra: Stream Processing with SQL

Use case 3:Realtime access dashboard on Google Platform

Access log visualizationCount using Norikra (2-step), Store on Google BigQueryDashboard on Google Spreadsheet + Apps Script

https://www.youtube.com/watch?v=EZkw5TDcCGwhttp://qiita.com/kazunori279/items/6329df57635799405547

Page 54: Norikra: Stream Processing with SQL

Use case 3:Realtime access dashboard on Google Platform

https://www.youtube.com/watch?v=EZkw5TDcCGwhttp://qiita.com/kazunori279/items/6329df57635799405547

Server

Fluentd

ngnix

access log

access logsto BigQuery

norikra query resultsto aggregate nodenorikra query

to aggregate locally

Page 55: Norikra: Stream Processing with SQL

Use case 3:Realtime access dashboard on Google Platform

https://www.youtube.com/watch?v=EZkw5TDcCGwhttp://qiita.com/kazunori279/items/6329df57635799405547

Fluentd

ngnix

70 servers, 120,000 requests/sec (or more!)

ngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnix

GoogleBigQuery

GoogleSpreadsheet+ Apps script

...

counts per host

logs to store

total count

Page 56: Norikra: Stream Processing with SQL

More queries, more simplicityand less latency.

Thanks!

photo: by my co-workers

Page 58: Norikra: Stream Processing with SQL

HA? Distributed?

NO!

I have some idea, but I have no time to implement it

There are no needs for HA/Distributed processing

Page 59: Norikra: Stream Processing with SQL

Data flow & API?

Use Fluentd!

Page 60: Norikra: Stream Processing with SQL

Scalability?

10,000 - 100,000 events/sec

on 2CPU 8Core server

Page 61: Norikra: Stream Processing with SQL

Storm or Norikra?

Simple and fixed workload for huge traffic

Use Storm!

Complex and fragile workload for non-huge traffic

Use Norikra!