building sexy real-time analytics systems - erlang factory nyc / toronto 2013
DESCRIPTION
In the world of Real-time bidding (RTB), it is crucial to get performance metrics as soon as possible. This is why AdGear build their own real-time analytics system. In this talk, Louis-Philippe will share with you what he has learnt building this system and he will introduce Swirl, AdGear's lightweight distributed stream processor. He will also give some clues on how to build a subset of SQL to power your distributed jobs. Talk objectives: - Introduce Swirl, a lightweight distributed stream processor - Implement a subset of SQL (lexer + parser + boolean logic) - Demo real-time graphing web interface powered by Swirl, Cowboy, Bullet and D3.jsTRANSCRIPT
![Page 1: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/1.jpg)
Building “sexy” real-time analytics systems
![Page 2: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/2.jpg)
AdGear is full-stack ad platform for publishers and advertisers, with advanced analytics, attribution measurement, ad serving, and real-time bidding technology.
![Page 3: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/3.jpg)
Real-time bidding (RTB)
![Page 4: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/4.jpg)
• help clients to make informed decisions
• should I increase the bid price?
• should I bid on exchange X?
• inventory control (brand safety)
• debugging (bots detection, creatives audits)
Real-time reporting... why?
![Page 5: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/5.jpg)
“Sexy” real-time analytics systems
![Page 6: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/6.jpg)
“Sexy”?
• elegant backend
• beautiful user interface
![Page 7: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/7.jpg)
• ssh
• node.js
• socket.io
Architecture #1 (3 years ago)
![Page 8: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/8.jpg)
Problems
• no SMP support
• each process needs to be monitored
• requires load-balancing (nginx)
• duplicated state (per process)
• duplicated work (de-serialization)
• bad error handling (event loop explodes)
• callbacks...
![Page 9: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/9.jpg)
* promise construct
![Page 10: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/10.jpg)
• ssh_channel *
• gproc (pub sub)
• ETS counters
• bullet (cowboy)
* https://gist.github.com/lpgauth/6529807
Architecture #2 (1.5 years ago)
![Page 11: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/11.jpg)
1. receive buffered events, split and de-serialize
2. each event is sent to a collector process (3) using gproc (pubsub) for filtering
3. collector (gen_server) aggregates message using ETS counters and flush every second
4. bullet handler serializes the aggregates (tab2list to json)
Architecture #2
![Page 12: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/12.jpg)
Problems
• ssh_channel process and collector process are bottlenecks
• number of messages increases with the number of clients
• requires lots of bandwidth for large streams
• limited filtering (match specs)
![Page 13: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/13.jpg)
Improvements... (6 months ago)
• optimize collector’s msg loop (gen_server to proc_lib)
• use ssh compression
• added support for openssh zlib compression *
• R16B02
* https://github.com/lpgauth/otp/tree/openssh_zlib
![Page 14: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/14.jpg)
This worked for a while...
![Page 15: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/15.jpg)
“Hey man, it would be very cool if you could show in real-time the number of bid requests per domain for
Friday’s demo... Can you do it?” - boss
![Page 16: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/16.jpg)
Sure.
![Page 17: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/17.jpg)
What did I just agree too...
• I only have 3 days to build this...
• bid requests stream is too large to aggregate in a central location (1+ Gbit/s - 80K+/s)
![Page 18: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/18.jpg)
Strategy for demo
1. move aggregation upstream
2. use ETS match select to find table ids (filtering)
3. increment counters in process (no message!)
4. periodically flush aggregates via message to collector node
5. collector node increments local counters and periodically flush aggregates to bullet handler
![Page 19: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/19.jpg)
Success!
![Page 20: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/20.jpg)
Introducing swirl! “lightweight distributed stream processor”
![Page 21: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/21.jpg)
Swirl components
• “dynamic” streams (swirl_stream)
• simple behavior that implements a map-reduce like interface (swirl_flow)
• powerful filtering language (swirl_ql)
• process registry (swirl_tracker)
![Page 22: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/22.jpg)
Streams
![Page 23: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/23.jpg)
Flows
* application:start(swirl).
![Page 24: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/24.jpg)
swirl_flow behavior
![Page 25: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/25.jpg)
Mapper Node1. process “emits” event
2. lookup in ETS if there’s a flow that matches the stream name and filter
3. if there’s a match, call flow_mod:map/4
4. if map returns counters, increment in ETS
5. swirl_mapper periodically flush aggregates to reducer node
![Page 26: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/26.jpg)
Reducer Node
1. swirl_tracker receives mapper aggregates and forwards it to reducer
2. reducer increments counters in ets
3. reducer flushes counters to flow_mod:reduce/4
![Page 27: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/27.jpg)
Swirl-ql
• sql where clause like syntax
• supported operators:
• AND / OR
• <, <=, =, >, <>
• IN (x, y) / NOT IN (x, y, z)
• IS NULL / IS NOT NULL (undefined)
* https://github.com/lpgauth/swirl-ql
![Page 28: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/28.jpg)
Swirl-ql
• examples:
• “event IN (‘impression’, ‘click’)”!
• “buyer_id IS NOT NULL AND buyer_id <> 3”!
• “event = ‘impressions’ AND (buyer_id IN (3, 5) OR buyer_id IS NULL)
![Page 29: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/29.jpg)
Swirl-ql
• leex / yecc for parsing (use lex / yacc doc)
• pattern match ftw!
• use hipe (~200% speed gain in micro benchmarks)
• 0.286 vs 0.097 microseconds *
• experimenting with dynamic compilation
* http://theory.stanford.edu/~sergei/papers/sigmod10-index.pdf
![Page 30: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/30.jpg)
Swirl limitations
• best-effort (hard problem!)
• netsplits
• crash
• in-memory only
![Page 31: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/31.jpg)
Todo
• node discovery
• code distribution
• resource limitation
• better documentation!
![Page 32: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/32.jpg)
• swirl
• bullet (cowboy)
Architecture #3 (now!)
![Page 34: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013](https://reader033.vdocument.in/reader033/viewer/2022052600/557fb817d8b42a40118b48fe/html5/thumbnails/34.jpg)
pssst: we’re hiring!
Thank You!
twitter: lpgauth github: lpgauth