divolte collector overview

53
Go DataDriven PROUDLY PART OF THE XEBIA GROUP @asnare / @fzk / @godatadriven [email protected] Divolte Collector Andrew Snare / Friso van Vollenhoven Because life’s too short for log file parsing

Upload: godatadriven

Post on 16-Apr-2017

531 views

Category:

Business


2 download

TRANSCRIPT

Page 1: Divolte collector overview

GoDataDrivenPROUDLY PART OF THE XEBIA GROUP

@asnare / @fzk / @[email protected]

Divolte Collector

Andrew Snare / Friso van Vollenhoven

Because life’s too short for log file parsing

Page 2: Divolte collector overview

99% of all data in Hadoop156.68.7.63 - - [28/Jul/1995:11:53:28 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669 137.244.160.140 - - [28/Jul/1995:11:53:29 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0 163.205.160.5 - - [28/Jul/1995:11:53:31 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 4324 163.205.160.5 - - [28/Jul/1995:11:53:40 -0400] "GET /shuttle/countdown/count70.gif HTTP/1.0" 200 46573 140.229.50.189 - - [28/Jul/1995:11:53:54 -0400] "GET /shuttle/missions/sts-67/images/images.html HTTP/1.0" 200 4464 163.206.89.4 - - [28/Jul/1995:11:54:02 -0400] "GET /shuttle/technology/sts-newsref/sts-mps.html HTTP/1.0" 200 215409 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891 131.110.53.48 - - [28/Jul/1995:11:54:07 -0400] "GET /shuttle/technology/sts-newsref/stsref-toc.html HTTP/1.0" 200 84905 163.205.160.5 - - [28/Jul/1995:11:54:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 130.160.196.81 - - [28/Jul/1995:11:54:15 -0400] "GET /shuttle/resources/orbiters/challenger.html HTTP/1.0" 200 8089 131.110.53.48 - - [28/Jul/1995:11:54:16 -0400] "GET /images/shuttle-patch-small.gif HTTP/1.0" 200 4179 137.244.160.140 - - [28/Jul/1995:11:54:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10136 131.110.53.48 - - [28/Jul/1995:11:54:18 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 131.110.53.48 - - [28/Jul/1995:11:54:19 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713 130.160.196.81 - - [28/Jul/1995:11:54:19 -0400] "GET /shuttle/resources/orbiters/challenger-logo.gif HTTP/1.0" 200 4179 163.205.160.5 - - [28/Jul/1995:11:54:25 -0400] "GET /shuttle/missions/sts-70/images/images.html HTTP/1.0" 200 8657 130.181.4.158 - - [28/Jul/1995:11:54:26 -0400] "GET /history/rocket-history.txt HTTP/1.0" 200 26990 137.244.160.140 - - [28/Jul/1995:11:54:30 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 304 0 137.244.160.140 - - [28/Jul/1995:11:54:31 -0400] "GET /images/launch-logo.gif HTTP/1.0" 304 0 137.244.160.140 - - [28/Jul/1995:11:54:38 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 304 0 168.178.17.149 - - [28/Jul/1995:11:54:48 -0400] "GET /shuttle/missions/sts-65/mission-sts-65.html HTTP/1.0" 200 131165 140.229.50.189 - - [28/Jul/1995:11:54:53 -0400] "GET /shuttle/missions/sts-67/images/KSC-95EC-0390.jpg HTTP/1.0" 200 128881 131.110.53.48 - - [28/Jul/1995:11:54:58 -0400] "GET /shuttle/missions/missions.html HTTP/1.0" 200 8677 131.110.53.48 - - [28/Jul/1995:11:55:02 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853 131.110.53.48 - - [28/Jul/1995:11:55:05 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 128.159.111.141 - - [28/Jul/1995:11:55:09 -0400] "GET /procurement/procurement.html HTTP/1.0" 200 3499 128.159.111.141 - - [28/Jul/1995:11:55:10 -0400] "GET /images/op-logo-small.gif HTTP/1.0" 200 14915 128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 192.213.154.220 - - [28/Jul/1995:11:55:15 -0400] "GET /shuttle/countdown/tour.html HTTP/1.0" 200 4347 192.213.154.220 - - [28/Jul/1995:11:55:15 -0400] "GET /images/KSC-94EC-412-small.gif HTTP/1.0" 200 20484

Page 3: Divolte collector overview

GoDataDriven

How do we use our data?

• Ad hoc

• Batch

• Streaming

Page 4: Divolte collector overview

USER

HTTP request:/org/apache/hadoop/io/IOUtils.html

log transportservice

log event:2012-07-01T06:00:02.500Z /org/apache/hadoop/io/IOUtils.html

transport logs tocompute cluster

off line analytics /model training

batch updatemodel state

serve model result(e.g. recommendations) streaming log

processingstreaming updatemodel state

Typical web optimization architecture

Page 5: Divolte collector overview

GoDataDriven

Parse HTTP server logs

access.log

Page 6: Divolte collector overview

GoDataDriven

How did it get there?

Option 1: parse HTTP server logs

• Ship log files on a schedule

• Parse using MapReduce jobs

• Batch analytics jobs feed online systems

Page 7: Divolte collector overview

GoDataDriven

HTTP server log parsing

• Inherently batch oriented

• Schema-less (URL format is the schema)

• Initial job to parse logs into structured format

• Usually multiple versions of parsers required

• Requires sessionizing

• Logs usually have more than you ask for (bots, image requests, spiders, health check, etc.)

Page 8: Divolte collector overview

GoDataDriven

Stream HTTP server logs

access.logMessage Queue or Event Transport

(Kafka, Flume, etc.) EVENTS

tail -F

EVENTS

OTHER CONSUMERS

Page 9: Divolte collector overview

GoDataDriven

How did it get there?

Option 2: stream HTTP server logs

• tail -F logfiles

• Use a queue for transport (e.g. Flume or Kafka)

• Parse logs on the fly

• Or write semi-schema’d logs, like JSON

• Parse again for batch work load

Page 10: Divolte collector overview

GoDataDriven

Stream HTTP server logs

• Allows for near real-time event handling when consuming from queues

• Sessionizing? Duplicates? Bots?

• Still requires parser logic

• No schema

Page 11: Divolte collector overview

GoDataDriven

Tagging

index.html script.

js

web server

access.log

tracking server

Message Queue or Event Transport(Kafka, Flume, etc.) EVENTS

OTHER CONSUMERS

web page traffic

tracking traffic(asynchronous)

structured events

structured events

Page 12: Divolte collector overview

GoDataDriven

How did it get there?

Option 3: tagging

• Instrument pages with special ‘tag’, i.e. special JavaScript or image just for logging the request

• Create special endpoint that handles the tag request in a structured way

• Tag endpoint handles logging the events

Page 13: Divolte collector overview

GoDataDriven

Tagging

• Not a new idea (Google Analytics, Omniture, etc.)

• Less garbage traffic, because a browser is required to evaluate the tag

• Event logging is asynchronous

• Easier to do inflight processing (apply a schema, add enrichments, etc.)

• Allows for custom events (other than page view)

Page 14: Divolte collector overview

GoDataDriven

Also…

• Manage session through cookies on the client side

• Incoming data is already sessionized

• Extract additional information from clients

• Screen resolution

• Viewport size

• Timezone

Page 15: Divolte collector overview

GoDataDriven

Looks familiar?

<script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

ga('create', 'UA-40578233-2', 'godatadriven.com'); ga('send', 'pageview');

</script>

Page 16: Divolte collector overview

GoDataDriven

Divolte Collector

Click stream data collection for Hadoop and Kafka.

Page 17: Divolte collector overview

GoDataDriven

Divolte Collector

index.html script.

js

web server

access.log

tracking server

Message Queue or Event Transport(Kafka, Flume, etc.) EVENTS

OTHER CONSUMERS

web page traffic

tracking traffic(asynchronous)

structured events

structured events

Page 18: Divolte collector overview

GoDataDriven

Divolte Collector : Vision

• Focus purely on collection

• Processing is a separate concern

• Minimal on the fly enrichment

• The Hadoop tools ecosystem evolves too fast to compete (SQL solutions, streaming, machine learning, etc.)

• Just provide data

• Data source for custom data science solutions

• Not a web analytics solution per se; descriptive web analytics is a side effect

• Use cases will vary, try not too many assumptions about users’ needs

Page 19: Divolte collector overview

GoDataDriven

Divolte Collector : Vision

• Solve the web specific tricky parts

• ID generation on client side (JavaScript)

• In-stream duplicate detection

• Schema!

• Data will be written in a schema-evolution-friendly open format (Apache Avro)

• No arbitrary (JSON) objects

Page 20: Divolte collector overview

GoDataDriven

Javascript based tag<body><!-- Your page content here.-->

<!-- Include Divolte Collector just before the closing body tag--><script src="//example.com/divolte.js" defer async></script></body>

Page 21: Divolte collector overview

GoDataDriven

Effectively stateless

Page 22: Divolte collector overview

GoDataDriven

Data with a schema in Avro

{ "namespace": "com.example.record", "type": "record", "name": "MyEventRecord", "fields": [ { "name": "location", "type": "string" }, { "name": "pageType", "type": "string" }, { "name": "timestamp", "type": "long" } ]}

Page 23: Divolte collector overview

GoDataDriven

Map incoming data onto Avro records

mapping { map clientTimestamp() onto 'timestamp' map location() onto 'location'

def u = parse location() to uri section { when u.path().equalTo('/checkout') apply { map 'checkout' onto 'pageType' exit() } map 'normal' onto 'pageType' }}

Page 24: Divolte collector overview

GoDataDriven

User agent parsing

map userAgent().family() onto 'browserName'map userAgent().osFamily() onto 'operatingSystemName'map userAgent().osVersion() onto 'operatingSystemVersion'

// Etc... More fields available

Page 25: Divolte collector overview

GoDataDriven

IP to geolocation lookup

Page 26: Divolte collector overview

GoDataDriven

Useful performanceRequests per second: 14010.80 [#/sec] (mean) Time per request: 0.571 [ms] (mean) Time per request: 0.071 [ms] (mean, across all concurrent requests) Transfer rate: 4516.55 [Kbytes/sec] received

Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 1 Processing: 0 0 0.2 0 3 Waiting: 0 0 0.2 0 3 Total: 0 1 0.2 1 3

Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 1 100% 3 (longest request)

Page 27: Divolte collector overview

GoDataDriven

Custom events

divolte.signal('addToBasket', { productId: 309125, count: 1})

In the page (Javascript)

map eventParameter('productId') onto 'basketProductId'map eventParameter('count') onto 'basketNumProducts'

In the mapping (Groovy)

Page 28: Divolte collector overview

GoDataDriven

Avro data, use any tool

Page 29: Divolte collector overview

GoDataDriven

Divolte Collector

• http://divolte.io

• Apache License, Version 2.0

Page 30: Divolte collector overview

Examples

Page 31: Divolte collector overview

GoDataDriven

Ad hoc

Page 32: Divolte collector overview

GoDataDriven

Batch

Page 33: Divolte collector overview

GoDataDriven

Online

Page 34: Divolte collector overview

GoDataDriven

Example

Page 35: Divolte collector overview

GoDataDriven

Example

Page 36: Divolte collector overview

GoDataDriven

Approach

1. Pick n images randomly

2. Optimise displayed image using bandit optimisation

3. After X iterations:

• Pick n / 2 new images randomly

• Select n / 2 images from existing set using learned distribution

• Construct new set of images using half of existing set and newly selected random images

4. Goto 2

Page 37: Divolte collector overview

GoDataDriven

Bayesian Bandits

• For each image, keep track of:

• Number of impressions

• Number of clicks

• When serving an image:

• Draw a random number from a Beta distribution with parameters alpha = # of clicks, beta = # of impressions, for each image

• Show image where sample value is largest

Page 38: Divolte collector overview

GoDataDriven

Bayesian Bandits

• https://en.wikipedia.org/wiki/Multi-armed_bandit

• http://tdunning.blogspot.nl/2012/02/bayesian-bandits.html

• https://www.chrisstucchio.com/blog/2013/bayesian_bandit.html

Page 39: Divolte collector overview

GoDataDriven

Prototype UI

class HomepageHandler(ShopHandler): @coroutine def get(self): # Hard-coded ID for a pretty flower. # Later this ID will be decided by the bandit optmization. winner = '15442023790'

# Grab the item details from our catalog service. top_item = yield self._get_json('catalog/item/%s' % winner)

# Render the homepage self.render( 'index.html', top_item=top_item)

Page 40: Divolte collector overview

GoDataDriven

Prototype UI

<div class="col-md-6"> <h4>Top pick:</h4> <p> <!-- Link to the product page with a source identifier for tracking --> <a href="/product/{{ top_item['id'] }}/#/?source=top_pick"> <img class="img-responsive img-rounded" src="{{ top_item['variants']['Medium']['img_source'] }}"> <!-- Signal that we served an impression of this image --> <script>divolte.signal('impression', { source: 'top_pick', productId: '{{ top_item['id'] }}'})</script> </a> </p> <p> Photo by {{ top_item['owner']['real_name'] or top_item['owner']['user_name']}} </p></div>

Page 41: Divolte collector overview

GoDataDriven

Data collection in Divolte Collector

{ "name": "source", "type": ["null", "string"], "default": null}

def locationUri = parse location() to uriwhen eventType().equalTo('pageView') apply { def fragmentUri = parse locationUri.rawFragment() to uri map fragmentUri.query().value('source') onto 'source'}

when eventType().equalTo('impression') apply { map eventParameters().value('productId') onto 'productId' map eventParameters().value('source') onto 'source' }

Page 42: Divolte collector overview

GoDataDriven

Keep counts in Redis

{ 'c|14502147379': '2', 'c|15106342717': '2', 'c|15624953471': '1', 'c|9609633287': '1', 'i|14502147379': '2', 'i|15106342717': '3', 'i|15624953471': '2', 'i|9609633287': '3'}

Page 43: Divolte collector overview

GoDataDriven

Consuming Kafka in Python

def start_consumer(args): # Load the Avro schema used for serialization. schema = avro.schema.Parse(open(args.schema).read())

# Create a Kafka consumer and Avro reader. Note that # it is trivially possible to create a multi process # consumer. consumer = KafkaConsumer(args.topic, client_id=args.client, group_id=args.group, metadata_broker_list=args.brokers) reader = avro.io.DatumReader(schema)

# Consume messages. for message in consumer: handle_event(message, reader)

Page 44: Divolte collector overview

GoDataDriven

Consuming Kafka in Pythondef handle_event(message, reader): # Decode Avro bytes into a Python dictionary. message_bytes = io.BytesIO(message.value) decoder = avro.io.BinaryDecoder(message_bytes) event = reader.read(decoder)

# Event logic. if 'top_pick' == event['source'] and 'pageView' == event['eventType']: # Register a click. redis_client.hincrby( ITEM_HASH_KEY, CLICK_KEY_PREFIX + ascii_bytes(event['productId']), 1) elif 'top_pick' == event['source'] and 'impression' == event['eventType']: # Register an impression and increment experiment count. p = redis_client.pipeline() p.incr(EXPERIMENT_COUNT_KEY) p.hincrby( ITEM_HASH_KEY, IMPRESSION_KEY_PREFIX + ascii_bytes(event['productId']), 1) experiment_count, ingnored = p.execute()

if experiment_count == REFRESH_INTERVAL: refresh_items()

Page 45: Divolte collector overview

GoDataDriven

def refresh_items(): # Fetch current model state. We convert everything to str. current_item_dict = redis_client.hgetall(ITEM_HASH_KEY) current_items = numpy.unique([k[2:] for k in current_item_dict.keys()])

# Fetch random items from ElasticSearch. Note we fetch more than we need, # but we filter out items already present in the current set and truncate # the list to the desired size afterwards. random_items = [ ascii_bytes(item) for item in random_item_set(NUM_ITEMS + NUM_ITEMS - len(current_items) // 2) if not item in current_items][:NUM_ITEMS - len(current_items) // 2]

# Draw random samples. samples = [ numpy.random.beta( int(current_item_dict[CLICK_KEY_PREFIX + item]), int(current_item_dict[IMPRESSION_KEY_PREFIX + item])) for item in current_items]

# Select top half by sample values. current_items is conveniently # a Numpy array here. survivors = current_items[numpy.argsort(samples)[len(current_items) // 2:]]

# New item set is survivors plus the random ones. new_items = numpy.concatenate([survivors, random_items])

# Update model state to reflect new item set. This operation is atomic # in Redis. p = redis_client.pipeline(transaction=True) p.set(EXPERIMENT_COUNT_KEY, 1) p.delete(ITEM_HASH_KEY) for item in new_items: p.hincrby(ITEM_HASH_KEY, CLICK_KEY_PREFIX + item, 1) p.hincrby(ITEM_HASH_KEY, IMPRESSION_KEY_PREFIX + item, 1) p.execute()

Page 46: Divolte collector overview

GoDataDriven

Serving a recommendationclass BanditHandler(web.RequestHandler): redis_client = None

def initialize(self, redis_client): self.redis_client = redis_client

@gen.coroutine def get(self): # Fetch model state. item_dict = yield gen.Task(self.redis_client.hgetall, ITEM_HASH_KEY) items = numpy.unique([k[2:] for k in item_dict.keys()])

# Draw random samples. samples = [ numpy.random.beta( int(item_dict[CLICK_KEY_PREFIX + item]), int(item_dict[IMPRESSION_KEY_PREFIX + item])) for item in items]

# Select item with largest sample value. winner = items[numpy.argmax(samples)]

self.write(winner)

Page 47: Divolte collector overview

GoDataDriven

Integrate

class HomepageHandler(ShopHandler): @coroutine def get(self): http = AsyncHTTPClient() request = HTTPRequest(url='http://localhost:8989/item', method='GET') response = yield http.fetch(request) winner = json_decode(response.body) top_item = yield self._get_json('catalog/item/%s' % winner)

self.render( 'index.html', top_item=top_item)

Page 48: Divolte collector overview

Roadmap

Page 49: Divolte collector overview

GoDataDriven

Server side - short term

• Allow multiple sources / sink channels

• With different input → schema mappings

• Server side events

• Support for server side event logging (JSON endpoint)

• Enabler for mobile SDKs

• Trivial to add pixel based end-point (server managed cookies)

Page 50: Divolte collector overview

GoDataDriven

Client side

• Specific browser related bug fixes (IE9)

• Allow for setting session scoped parameters

• JavaScript Data Layer

Page 51: Divolte collector overview

GoDataDriven

Collector next steps

• Integrate with Planout (https://facebook.github.io/planout/)

• Allow definition of online experiments in one place

• All event logging automatically includes random parameters generated for experiment selection

• Single solution for data collection for online experimentation / optimization

Page 52: Divolte collector overview

GoDataDriven

References

• http://blog.godatadriven.com/rapid-prototyping-online-machine-learning-divolte-collector.html

• http://divolte.io

• https://github.com/divolte/divolte-collector

• https://github.com/divolte/divolte-examples

Page 53: Divolte collector overview

GoDataDriven

We’re hiring / Questions? / Thank you!

@asnare / @fzk / @[email protected]

Andrew Snare / Friso van Vollenhoven