norikra: stream processing with sql

Norikra:Stream Processing With SQL

2014/09/13HadoopCon 2014 Taiwan

Satoshi Tagomori (@tagomoris)

Satoshi Tagomori (@tagomoris)LINE Corporation

Analytics Platform Team

THE ONE THINGWHAT YOU MUST LEAN TODAY IS

Norikra

NorikraIS NOT

Norika

Topics

Basics of stream processing

Stream processing with SQL

Norikra overview

Norikra queries

Use cases in production

Stream Processing

Less latency

Less computing power

No query schedule management

Data Flow And Latencydata windowquery execution

Batch Stream

incrementalquery execution

Query For Stored Datav1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

table

At first, all dataMUST be stored.

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table

WHERE v3=’x’ GROUP BY v1,v2

table

Query For Stored Data




table

SELECT v4,COUNT(*)FROM table

WHERE v1 AND v2 GROUP BY v4





table

SELECT v4,COUNT(*)FROM table


“All data” means“data that will not be used”.


Query For Stream Data

v1,v2,v3,v4,v5,v6

SELECT v1,v2,COUNT(*)FROM table.win:xxx


stream

SELECT v4,COUNT(*)FROM table.win:xxx


v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

v1,v2,v3,v4,v5,v6



stream



v1,v2,v3

v1,v2,v4v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6


v1,v2,v3,v4,v5,v6



stream



v1,v2,v3

v1,v2,v4v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6


v1,v2,v3,v4,v5,v6



stream



v1,v2,v3

v1,v2,v4

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6

All data will be discarded right after insertion.

(Bye-bye storage system maintenance!)


Incremental Calculation

v1,v2,v3,v4,v5,v6



streamv1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6


internal data (memory)

v1 v2 COUNT

TRUE TRUE 0

TRUE FALSE 1

FALSE TRUE 33

FALSE FALSE 2

v1,v2,v3,v4,v5,v6



streamv1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6



v1 v2 COUNT

TRUE TRUE 1

TRUE FALSE 1

FALSE TRUE 33

FALSE FALSE 2


v1,v2,v3,v4,v5,v6



stream

v1,v2,v3,v4,v5,v6



v1 v2 COUNT

TRUE TRUE 1

TRUE FALSE 1

FALSE TRUE 34

FALSE FALSE 2




stream

v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6


v1 v2 COUNT

TRUE TRUE 1

TRUE FALSE 2

FALSE TRUE 37

FALSE FALSE 3memory can store

internal data


Data WindowTarget time (or size) range of queries

Batch

FROM-TO: WHERE dt >= ‘2014-09-13 13:30:00‘

AND dt < ‘2014-09-13 14:20:00’

Stream

“Calculate this query every 50 minutes”

Extended SQL required SELECT v1,v2,COUNT(*)FROM table.win:xxx


Stream Processing With SQL

Esper: Java library to process streamneeds to be implemented in Java daemon code

With schema for data/queryOSS under GPLv2

http://esper.codehaus.org/



Esper EPL

SELECT height, weightFROM tblWHERE age > 30

Select values of height and weightfor all events with age larger than 30

SELECT height, COUNT(*) AS cFROM tblWHERE age > 30GROUP BY height

Esper EPLCount records group by height value

for events with age larger than 30

This query doesn’tever produce results

SELECT height, COUNT(*) AS cFROM tbl.win:time_batch(1 hour)WHERE age > 30GROUP BY height

Esper EPLCount records group by height value

for events with age larger than 30per every 1 hour

With/without SchemaSchema-full data:

strict schema: predefined fields w/ types (or reject)

schema on read: try to read known fields (or ignore)

Schema-less data:

Any field (or ignore), any type (implicit/explicit conversion)

fit for services under development:

All internet services including us!

Stream Processing & SchemaQueries first, data second

for all stream processingQueries automatically know what fields to query

schema-less (mixed)data stream

fields subset

for query A

fields subsetfor query B

query A

query Bevents from

billing service

events fromAPI endpoint

events of service XTO BE

break.

Norikra:Schema-less Stream Processing with SQL

Server software, runs on JVM

Open source software (GPLv2)

http://norikra.github.io/

https://github.com/norikra/norikra





Norikra:Schema-less event stream:

Add/Remove data fields whenever you wantSQL:

No more restarts to add/remove queriesw/ JOINs, w/ SubQueriesw/ UDF (in Java/Ruby from rubygem)

Truly Complex events:Nested Hash/Array, accessible directly from SQL

HTTP RPC w/ JSON or MessagePack (fluentd plugin available!)

How To Setup Norikra:Install JRuby

download jruby.tar.gz, extract it and export $PATHuse rbenv

rbenv install jruby-1.7.xx

rbenv shell jruby-..

Install Norikragem install norikra

Execute Norikra servernorikra start

Norikra Interface:Command line: norikra-client

norikra-client target open ...

norikra-client query add ...

tail -f ... | norikra-client event send ...

WebUI

show status

show/add/remove queries

HTTP API

JSON, MessagePack

Norikra Queries: (1)

SELECT name, ageFROM events

target



{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}

{“name”:”tagomoris”,”age”:34}



nothing

{“name”:”tagomoris”, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}

without “age”



WHERE current=”Taipei”

{“name”:”tagomoris”,”age”:34}




WHERE current=”Taipei”

nothing

{“name”:”hadoop”, “age”:99, “address”:”Somewhere”, “corp”:”ASF”, “current”:”Elsewhere”}


SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY age



GROUP BY age

{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...

every 5 mins



SELECT age, COUNT(*) as cntFROM

events.win:time_batch(5 mins)GROUP BY age

{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...

SELECT max(age) as maxFROM

events.win:time_batch(5 mins)

{“max”:51}


every 5 mins



GROUP BY age

{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...]}


SELECT user.age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY user.age



SELECT user.age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

WHERE current=”Taipei”AND attend.$0 AND attend.$1

GROUP BY user.age


break.next: use cases

Use case 1:External API call reports for partners (LINE)

External API call for LINE Business Connect

LINE backend sends requests to partner’s API endpoint using users’ messages

http://developers.linecorp.com/blog/?p=3386




API error response summaries





channelgateway

partner’sserver

logs

queryresults

MySQL Mail

SELECT channelId AS channel_id, reason, detail, count(*) AS error_count, min(timestamp) AS first_timestamp, max(timestamp) AS last_timestampFROM api_error_log.win:time_batch(60 sec)GROUP BY channelId,reason,detailHAVING count(*) > 0




Use case 2:Prompt reports for Ad service console

Prompt reports with Norikra + Fixed reports with Hive

appserverapp

serverappserver

appserverapp

serverappserver

Fluentd

HDFS

consoleservice

fetch query results(frequently)

execute hive query(daily)

impressionlogs

SELECT yyyymmdd, hh, campaign_id, region, lang, COUNT(*) AS click, COUNT(DISTINCT member_id) AS uuFROM ( SELECT yyyymmdd, hh, get_json_object(log, '$.campaign.id') AS campaign_id, get_json_object(log, '$.member.region') AS region, get_json_object(log, '$.member.lang') AS lang, get_json_object(log, '$.member.id') AS member_id FROM applog WHERE service='myservice' AND yyyymmdd='20140913' AND get_json_object(log, '$.type')='click') xGROUP BY yyyymmdd, hh, campaign_id, region, lang

Hive query for fixed reports


SELECT campaign.id AS campaign_id, member.region AS region, member.lang AS lang, COUNT(*) AS click, COUNT(DISTINCT member.id) AS uuFROM myservice.win:time_batch(1 hours)WHERE type="click"GROUP BY campaign.id, member.region, member.lang

Norikra query for prompt reports


Use case 3:Realtime access dashboard on Google Platform

Access log visualizationCount using Norikra (2-step), Store on Google BigQueryDashboard on Google Spreadsheet + Apps Script

https://www.youtube.com/watch?v=EZkw5TDcCGwhttp://qiita.com/kazunori279/items/6329df57635799405547

https://www.youtube.com/watch?v=EZkw5TDcCGw


http://qiita.com/kazunori279/items/6329df57635799405547




Server

Fluentd

ngnix

access log

access logsto BigQuery

norikra query resultsto aggregate nodenorikra query

to aggregate locally







Fluentd

ngnix

70 servers, 120,000 requests/sec (or more!)

ngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnix

GoogleBigQuery

GoogleSpreadsheet+ Apps script

...

counts per host

logs to store

total count





More queries, more simplicityand less latency.

Thanks!

photo: by my co-workers

See also:http://norikra.github.io/

“Stream processing and Norikra”http://www.slideshare.net/tagomoris/stream-processing-and-norikra

“Batch processing and Stream processing by SQL”http://www.slideshare.net/tagomoris/hcj2014-sql

“Log analysis systems and its designs in LINE Corp 2014 Early”http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-corp-2014-early

“Norikra in Action”http://www.slideshare.net/tagomoris/norikra-in-action-ver-2014-spring



http://www.slideshare.net/tagomoris/stream-processing-and-norikra

http://www.slideshare.net/tagomoris/stream-processing-and-norikra

http://www.slideshare.net/tagomoris/hcj2014-sql

http://www.slideshare.net/tagomoris/hcj2014-sql

http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-corp-2014-early




http://www.slideshare.net/tagomoris/norikra-in-action-ver-2014-spring

http://www.slideshare.net/tagomoris/norikra-in-action-ver-2014-spring

HA? Distributed?

NO!

I have some idea, but I have no time to implement it

There are no needs for HA/Distributed processing

Data flow & API?

Use Fluentd!

Scalability?

10,000 - 100,000 events/sec

on 2CPU 8Core server

Storm or Norikra?

Simple and fixed workload for huge traffic

Use Storm!

Complex and fragile workload for non-huge traffic

Use Norikra!

norikra: stream processing with sql

Technology