drinking from the firehose - real-time metrics

Drinking from the Firehose Real-Time Metrics

Samantha Quiñones

@ieatkillerbees http://samanthaquinones.com

http://samanthaquinones.com

“How would you let editors test how well

different headlines perform for the same

piece of content?”

Measuring User Behavior

• Application path

• Use patterns

• Mouse & attention tracking

Multivariate Testing

• Sort all users in to groups

• 1 control group receives unaltered content

• 1 or more groups receive altered content

• Measure behavioral statistics (CTR, abandon rate, time on page, scroll depth) for each group

State Monitoring

• Debugging

• Load Monitoring

And then…?

• Augmented intelligence for content creators

• Quality prediction

What if content could change itself based on the weather?

Managing Big Data

How big is big?

1,300,000,000,000events per

DAY

~40 datapointsper

EVENT

~15,000 eventsper

SECOND

~600,000 datapoints

Containing

25 megabytes / second

At a rate up to

CollectorCollectorCollectorCollector CollectorCollectorCollectorCollector CollectorCollectorCollectorCollector

Rabbit MQ Farm

Rabbit MQ Farm

Hadoop

Hadoop

• Framework for distributed storage and processing of data

• Designed to make managing very large datasets simple with…

• Well-documented, open-source, common libraries

• Optimizing for commodity hardware

Hadoop Distributed File System

• Modeled after Google File System

• Stores logical files across multiple systems

• Rack-aware

• No read-write concurrency

MapReduce

• Framework for massively parallel data processing tasks

Map<?php $document = "I'm a little teapot short and stout here is my handle here is my spout";

/** * Outputs: [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0] */ function map($target_word, $document) { return array_map( function ($word) use ($target_word) { if ($word === $target_word) { return 1; } return 0; }, preg_split('/\s+/', $document) ); } echo json_encode(map("is", $document)) . PHP_EOL;

Reduce<?php $data = [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0];

/** * Outputs: 2 */ function reduce($data) { return array_reduce( $data, function ($count, $value) { return $count + $value; } ); } echo reduce($data) . PHP_EOL;

Hadoop Limitations

• Hadoop jobs are batched and take significant time to run

• Data may not be available for 1+ hours after collection

“How would you let editors test how well

different headlines perform for the same

piece of content?”

Consider Shelf-life

• Most articles are relevant for < 24 hours

• Interest peaks < 3 hours

Real-Time Pipelines

CollectorCollectorCollectorCollector



Rabbit MQ Farm

CollectorCollectorCollectorStreamer



Version 1 (PoC)

CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver CollectorCollectorCollectorStatsD Cluster

ElasticSearch

this.visit = function(record) { if (record.userAgent) { var parser = new UAParser(); parser.setUA(record.userAgent); var user_agent = parser.getResult(); return { user_agent: user_agent } } return {}; };

Findings

• Max throughput per collector: 300 events/second

• ~70 receivers needed for prod

• StatsD key format creates data redundancy and reduced data richness

Version 1 (PoC)

CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver CollectorCollectorCollectorStatsD Cluster

ElasticSearch

Transits & Terminals

• Transits - Short-term, in-memory, volatile storage for data with a life-span up to a few seconds

• Terminals - Destinations for data that either store, abandon, or transmit

An efficient real-time data pathway consists

of a network of transits and terminals, where

no single node acts as both a transit and a

terminal at the same time.

StatsD

• Acts as a transit, taking data and passing it along…

• BUT

• Acts as a terminal, aggregating keys in memory and becoming a transit after a time or buffer threshold.

Version 2

CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver ElasticSearchRabbitMQ

RabbitMQ

• Lightweight message broker

• Allows complex message routing without application-level logic

• Can buffer 90-120 seconds of traffic

Version 2

• Eliminated eventing and improved performance

• Replaced StatsD with RabbitMQ

• Data records are kept together

• No longer works with Kibana (sadface)

while (buffer.length > 0) { var char = buffer.shift(); if ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); continue; } outbuf.push(char); }

var i = 0; var tBuf = buffer.slice(); while (i < buffer.length) { var char = tBuf[i++]; if ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); } outbuf.push(char); }

Findings



• Micro-optimized code became increasingly brittle and hard to maintain as custom logic was needed for every edge case

Version 2

CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver ElasticSearchRabbitMQ

Need to Get Serious

• Very high throughput

• Multi-threaded worker pool with large memory buffers

• Static & dynamic optimization

• Efficient memory management for extremely volatile in-memory data

• Eliminate any processing overhead. Receiver must be a Transit

And also…

• Not GoLang (because no one on the team is familiar with it)

• Not Rust (because no one on the team wants to be familiar with it)

• Not C (because C)

mfw java :(

Why Java?

• Solid static & dynamic analysis and optimizations in the S2BC & JIT compilers

• Clients for the stuff I needed to talk to

• Well-supported within AOL & within my team

Version 3

CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver

ElasticSearch

RabbitMQ

CollectorCollectorCollectorProcessor/ Router

public class StreamReader { private static final Logger logger = Logger.getLogger(StreamReader.class.getName()); private StreamerQueue queue = new StreamerQueue(); private StreamProcessor processor; private List<StreamReader.BeaconWorkerThread> workerThreads = new ArrayList(); private RtStreamerClient client;

public StreamReader(String streamerURI, AmqpClient amqpClient, String appID, String tpcFltrs, String rfFltrs, String bt) { ArrayList queueList = new ArrayList(); this.processor = new StreamProcessor(amqpClient); byte numThreads = 8;

for(int i = 0; i < numThreads; ++i) { StreamReader.BeaconWorkerThread worker = new StreamReader.BeaconWorkerThread(); this.workerThreads.add(worker); worker.start(); }

queueList.add(this.queue); this.client = new RtStreamerClient(streamerURI, appID, tpcFltrs, rfFltrs, bt, queueList); } }

public class StreamProcessor { private static final Logger logger = Logger.getLogger(StreamProcessor.class.getName()); private AmqpClient amqpClient;

public StreamProcessor(AmqpClient amqpClient) { this.amqpClient = amqpClient; }

public void send(String data) throws Exception { this.amqpClient.send(data.getBytes()); logger.debug("Sent event " + data + " to AMQP"); } }

Que

ue

Que

ue

Que

ue

Que

ue

Que

ue

Que

ue

Que

ue

Que

ue

Que

ue

Que

ue

Que

ue

Network Input

Network Output

Linked List Queues

Findings



Why ElasticSearch

• Open-source Lucene search engine

• Highly-distributed storage engine

• Clusters nicely

• Built-in aggregations like whoa

Aggregations

• Geographic Boxing & Radius Grouping

• Time-Series

• Histograms

• Min/Max/Avg Statistical Evaluation

• MapReduce (coming soon!)

• How many users viewed my post on an android tablet in portrait mode within 10 miles of Denton, TX?

• What is the average time from start of page-load to first click for readers on linux desktops between 3am and 5am?

• Given two sets of link texts, which has the higher CTR for a randomized sample of readers on tablet devices?

Browser to Browser in < 5 seconds

But wait…

Is that “real-time”?

Real-Time for Real

• Live analysis of data as it is collected

• Active visualization of very short-term trends in data

Potential Problems

• Small sample sizes for new datasets / small analysis windows

• Data volumes too high for end-user comprehension

• Data volumes too high for end-user hardware/network connections

Version 4

CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver

ElasticSearch

RabbitMQ

CollectorCollectorCollectorProcessor/ Router

Websocket Server

D3JS

• Open-source data visualization library written in JavaScript

By the way…

xn = x + (r * COS(2π * n / v)) yn = y + (r * COS(2π * n / v))

where n = ordinal of vertex and where v = number of vertices and

x,y = center of the polygon

var views = 0; var socket = io(); socket.on('pageview', function(point) { views++; });

function tick() { data.push(views); views = 0;

path .attr("d", line) .attr("transform", null) .transition() .duration(500) .ease("linear") .attr("transform", "translate(" + x(0) + ",0)") .each("end", tick);

data.shift(); }

tick();

Pageview Heartbeat

Real-Time Profiling

Receiver Layer

Receiver Buffer/Transit

Processing & Routing Layer

Processing & Routing Transit

Storage Engine End-User Consumable Queues

Layers are • Geographically decoupled • Capable of independent scaling • Fully encapsulated with no cross-

layer dependencies

Interfaces

Input Stream (Java)

Routing (node.js)

Filtering (node.js)

Aggregation (PHP)

Visualization (D3JS)

MV Testing (PHP)

Languages & Tools

RabbitMQ

Hadoop

Elastic Search

PHP

JS (node)

JS (D3)

Java

MySQL

Where are we Now?

• It took 6 months to build a rock-solid data pipeline

• Entry points from:

• User data collectors

• Application code

That was the easy part.

What’s next?

• Live debugging & runtime profiling

• Embeddable visualizations

• On-demand stream filters

• Predictive performance analysis

• Real-time sentiment analysis

@ieatkillerbees http://samanthaquinones.com

https://joind.in/13742

http://samanthaquinones.com

https://joind.in/13742

drinking from the firehose - real-time metrics

Technology