norikra: stream processing with sql
DESCRIPTION
HadoopCon 2014 Taiwan Tech Talk * Stream processing overview * Using SQL as DSL for stream processing * Details of Norikra * Norikra queries * Use casesTRANSCRIPT
Norikra:Stream Processing With SQL
2014/09/13HadoopCon 2014 Taiwan
Satoshi Tagomori (@tagomoris)
Satoshi Tagomori (@tagomoris)LINE Corporation
Analytics Platform Team
THE ONE THINGWHAT YOU MUST LEAN TODAY IS
Norikra
NorikraIS NOT
Norika
Topics
Basics of stream processing
Stream processing with SQL
Norikra overview
Norikra queries
Use cases in production
Stream Processing
Less latency
Less computing power
No query schedule management
Data Flow And Latencydata windowquery execution
Batch Stream
incrementalquery execution
Query For Stored Datav1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
table
At first, all dataMUST be stored.
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table
WHERE v3=’x’ GROUP BY v1,v2
table
Query For Stored Data
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table
WHERE v3=’x’ GROUP BY v1,v2
table
SELECT v4,COUNT(*)FROM table
WHERE v1 AND v2 GROUP BY v4
Query For Stored Data
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table
WHERE v3=’x’ GROUP BY v1,v2
table
SELECT v4,COUNT(*)FROM table
WHERE v1 AND v2 GROUP BY v4
“All data” means“data that will not be used”.
Query For Stored Data
Query For Stream Data
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
SELECT v4,COUNT(*)FROM table.win:xxx
WHERE v1 AND v2 GROUP BY v4
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
SELECT v4,COUNT(*)FROM table.win:xxx
WHERE v1 AND v2 GROUP BY v4
v1,v2,v3
v1,v2,v4v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
Query For Stream Data
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
SELECT v4,COUNT(*)FROM table.win:xxx
WHERE v1 AND v2 GROUP BY v4
v1,v2,v3
v1,v2,v4v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
Query For Stream Data
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
SELECT v4,COUNT(*)FROM table.win:xxx
WHERE v1 AND v2 GROUP BY v4
v1,v2,v3
v1,v2,v4
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
All data will be discarded right after insertion.
(Bye-bye storage system maintenance!)
Query For Stream Data
Incremental Calculation
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
streamv1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
internal data (memory)
v1 v2 COUNT
TRUE TRUE 0
TRUE FALSE 1
FALSE TRUE 33
FALSE FALSE 2
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
streamv1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
internal data (memory)
v1 v2 COUNT
TRUE TRUE 1
TRUE FALSE 1
FALSE TRUE 33
FALSE FALSE 2
Incremental Calculation
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
internal data (memory)
v1 v2 COUNT
TRUE TRUE 1
TRUE FALSE 1
FALSE TRUE 34
FALSE FALSE 2
Incremental Calculation
SELECT v1,v2,COUNT(*)FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6v1,v2,v3,v4,v5,v6
internal data (memory)
v1 v2 COUNT
TRUE TRUE 1
TRUE FALSE 2
FALSE TRUE 37
FALSE FALSE 3memory can store
internal data
Incremental Calculation
Data WindowTarget time (or size) range of queries
Batch
FROM-TO: WHERE dt >= ‘2014-09-13 13:30:00‘
AND dt < ‘2014-09-13 14:20:00’
Stream
“Calculate this query every 50 minutes”
Extended SQL required SELECT v1,v2,COUNT(*)FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
Stream Processing With SQL
Esper: Java library to process streamneeds to be implemented in Java daemon code
With schema for data/queryOSS under GPLv2
http://esper.codehaus.org/
Esper EPL
SELECT height, weightFROM tblWHERE age > 30
Select values of height and weightfor all events with age larger than 30
SELECT height, COUNT(*) AS cFROM tblWHERE age > 30GROUP BY height
Esper EPLCount records group by height value
for events with age larger than 30
This query doesn’tever produce results
SELECT height, COUNT(*) AS cFROM tbl.win:time_batch(1 hour)WHERE age > 30GROUP BY height
Esper EPLCount records group by height value
for events with age larger than 30per every 1 hour
With/without SchemaSchema-full data:
strict schema: predefined fields w/ types (or reject)
schema on read: try to read known fields (or ignore)
Schema-less data:
Any field (or ignore), any type (implicit/explicit conversion)
fit for services under development:
All internet services including us!
Stream Processing & SchemaQueries first, data second
for all stream processingQueries automatically know what fields to query
schema-less (mixed)data stream
fields subset
for query A
fields subsetfor query B
query A
query Bevents from
billing service
events fromAPI endpoint
events of service XTO BE
break.
Norikra:Schema-less Stream Processing with SQL
Server software, runs on JVM
Open source software (GPLv2)
http://norikra.github.io/
https://github.com/norikra/norikra
Norikra:Schema-less event stream:
Add/Remove data fields whenever you wantSQL:
No more restarts to add/remove queriesw/ JOINs, w/ SubQueriesw/ UDF (in Java/Ruby from rubygem)
Truly Complex events:Nested Hash/Array, accessible directly from SQL
HTTP RPC w/ JSON or MessagePack (fluentd plugin available!)
How To Setup Norikra:Install JRuby
download jruby.tar.gz, extract it and export $PATHuse rbenv
rbenv install jruby-1.7.xx
rbenv shell jruby-..
Install Norikragem install norikra
Execute Norikra servernorikra start
Norikra Interface:Command line: norikra-client
norikra-client target open ...
norikra-client query add ...
tail -f ... | norikra-client event send ...
WebUI
show status
show/add/remove queries
HTTP API
JSON, MessagePack
Norikra Queries: (1)
SELECT name, ageFROM events
target
Norikra Queries: (1)
SELECT name, ageFROM events
{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}
{“name”:”tagomoris”,”age”:34}
Norikra Queries: (1)
SELECT name, ageFROM events
nothing
{“name”:”tagomoris”, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}
without “age”
Norikra Queries: (2)
SELECT name, ageFROM events
WHERE current=”Taipei”
{“name”:”tagomoris”,”age”:34}
{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}
Norikra Queries: (2)
SELECT name, ageFROM events
WHERE current=”Taipei”
nothing
{“name”:”hadoop”, “age”:99, “address”:”Somewhere”, “corp”:”ASF”, “current”:”Elsewhere”}
Norikra Queries: (3)
SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)
GROUP BY age
Norikra Queries: (3)
SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)
GROUP BY age
{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...
every 5 mins
{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}
Norikra Queries: (4)
SELECT age, COUNT(*) as cntFROM
events.win:time_batch(5 mins)GROUP BY age
{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...
SELECT max(age) as maxFROM
events.win:time_batch(5 mins)
{“max”:51}
{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”}
every 5 mins
Norikra Queries: (5)
SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)
GROUP BY age
{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...]}
Norikra Queries: (5)
SELECT user.age, COUNT(*) as cntFROM events.win:time_batch(5 mins)
GROUP BY user.age
{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...]}
Norikra Queries: (5)
SELECT user.age, COUNT(*) as cntFROM events.win:time_batch(5 mins)
WHERE current=”Taipei”AND attend.$0 AND attend.$1
GROUP BY user.age
{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...]}
break.next: use cases
Use case 1:External API call reports for partners (LINE)
External API call for LINE Business Connect
LINE backend sends requests to partner’s API endpoint using users’ messages
http://developers.linecorp.com/blog/?p=3386
Use case 1:External API call reports for partners (LINE)
API error response summaries
http://developers.linecorp.com/blog/?p=3386
Use case 1:External API call reports for partners (LINE)
channelgateway
partner’sserver
logs
queryresults
MySQL Mail
SELECT channelId AS channel_id, reason, detail, count(*) AS error_count, min(timestamp) AS first_timestamp, max(timestamp) AS last_timestampFROM api_error_log.win:time_batch(60 sec)GROUP BY channelId,reason,detailHAVING count(*) > 0
http://developers.linecorp.com/blog/?p=3386
Use case 2:Prompt reports for Ad service console
Prompt reports with Norikra + Fixed reports with Hive
appserverapp
serverappserver
appserverapp
serverappserver
Fluentd
HDFS
consoleservice
fetch query results(frequently)
execute hive query(daily)
impressionlogs
SELECT yyyymmdd, hh, campaign_id, region, lang, COUNT(*) AS click, COUNT(DISTINCT member_id) AS uuFROM ( SELECT yyyymmdd, hh, get_json_object(log, '$.campaign.id') AS campaign_id, get_json_object(log, '$.member.region') AS region, get_json_object(log, '$.member.lang') AS lang, get_json_object(log, '$.member.id') AS member_id FROM applog WHERE service='myservice' AND yyyymmdd='20140913' AND get_json_object(log, '$.type')='click') xGROUP BY yyyymmdd, hh, campaign_id, region, lang
Hive query for fixed reports
Use case 2:Prompt reports for Ad service console
SELECT campaign.id AS campaign_id, member.region AS region, member.lang AS lang, COUNT(*) AS click, COUNT(DISTINCT member.id) AS uuFROM myservice.win:time_batch(1 hours)WHERE type="click"GROUP BY campaign.id, member.region, member.lang
Norikra query for prompt reports
Use case 2:Prompt reports for Ad service console
Use case 3:Realtime access dashboard on Google Platform
Access log visualizationCount using Norikra (2-step), Store on Google BigQueryDashboard on Google Spreadsheet + Apps Script
https://www.youtube.com/watch?v=EZkw5TDcCGwhttp://qiita.com/kazunori279/items/6329df57635799405547
Use case 3:Realtime access dashboard on Google Platform
https://www.youtube.com/watch?v=EZkw5TDcCGwhttp://qiita.com/kazunori279/items/6329df57635799405547
Server
Fluentd
ngnix
access log
access logsto BigQuery
norikra query resultsto aggregate nodenorikra query
to aggregate locally
Use case 3:Realtime access dashboard on Google Platform
https://www.youtube.com/watch?v=EZkw5TDcCGwhttp://qiita.com/kazunori279/items/6329df57635799405547
Fluentd
ngnix
70 servers, 120,000 requests/sec (or more!)
ngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnixngnix
GoogleBigQuery
GoogleSpreadsheet+ Apps script
...
counts per host
logs to store
total count
More queries, more simplicityand less latency.
Thanks!
photo: by my co-workers
See also:http://norikra.github.io/
“Stream processing and Norikra”http://www.slideshare.net/tagomoris/stream-processing-and-norikra
“Batch processing and Stream processing by SQL”http://www.slideshare.net/tagomoris/hcj2014-sql
“Log analysis systems and its designs in LINE Corp 2014 Early”http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-corp-2014-early
“Norikra in Action”http://www.slideshare.net/tagomoris/norikra-in-action-ver-2014-spring
HA? Distributed?
NO!
I have some idea, but I have no time to implement it
There are no needs for HA/Distributed processing
Data flow & API?
Use Fluentd!
Scalability?
10,000 - 100,000 events/sec
on 2CPU 8Core server
Storm or Norikra?
Simple and fixed workload for huge traffic
Use Storm!
Complex and fragile workload for non-huge traffic
Use Norikra!