mark harwood - building entity centric indexes - nosql matters dublin 2015

31
Entity-Centric Indexing Mark Harwood @elasticmark 4/6/2015

Upload: nosqlmatters

Post on 05-Aug-2015

198 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Entity-Centric Indexing

Mark Harwood @elasticmark 4/6/2015

www.elastic.co2

(or “when aggregations don’t cut it”)

Entity-centric indexes

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

3

A typical “event-centric” deployment

Time-based event indexesEvent stream

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

4

Problem: some aggregations are expensive

We need to join all event-level data together at query-time.

?Using web server log data, answer the question:

"how long on average do customers spend on my site?"

!

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

5

How to cripple elasticsearch with a bucket explosion:

1. Ask a question about values that needs to be derived from multiple

documents (e.g. deriving a web session’s duration)

2. Make the joining key a high cardinality field e.g. something like “IP address”

3. Extra points if you use no routing of your documents so that related content is spray-gunned across multiple shards

www.elastic.co6

A “pay-as-you-go” model to the costs of fusing data

Solution

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

7

Solution: an “entity-centric” model

Usual stream of eventsTime-based event indexes

Entity-based summary indexes

Periodic extracts sorted by entity ID and time

www.elastic.co8

• WebSessions • "how long on average do my customers spend on my site?” • “which users behave like bots?” • “what is the most common exit page?”

• Bank Accounts • "Does this new payment match the typical spending behaviour of bank account X?”

Entity-centric queries

www.elastic.co9

• Buyers • "What do the users who bought product X also buy?” • “Which buyers behave like ‘shills’ and who are they promoting?”

• Cars • “Which cars drove long distances after failing a road worthiness test?”

Entity-centric queries

www.elastic.co10

Web log analytics

Use case

www.elastic.co11

• Analyses website traffic for retailers and manufacturers in the automotive industry

• Summarising many behaviours over time e.g. • unique numbers of visitors per month • engagement: average session durations

• Faced scaling issues producing some results from raw events

Use case: GFORCES

www.elastic.co12

• Data store contains 150m events generated by 26m user sessions • Event-centric aggregations were taking ~25 seconds • Equivalent entity-centric aggregations take <50ms• Simplified queries for common entry pages, common exit pages etc

Results of moving to entity-centric indexing

www.elastic.co13

Amazon marketplace reviews - building profiles for reviewers

Worked example

Play  along!  Code  +  data  here:  bit.ly/entcent

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

14

An “entity-centric” model

AmazonReviews (an event-centric index)

reviews.csv loadEvents.sh

Review event fields• rating • seller • reviewer • date

AmazonReviewers (an entity-centric index)

buildEntities.sh

• Drops and creates reviewers index. • Uses Python client to query and scroll list of

reviews sorted by reviewerId and time • Python pushes _update requests to ~400k

“Reviewer” documents each containing bundles of their recent reviews using bulk indexing API

• Shard-side Groovy script collapses the multiple reviews into a single reviewer JSON document summarising behaviour

Reviewer entity fields• positivity • num sellers reviewed • last 50 reviews • profile (“newbie”, “fanboy” etc)

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

15

Anatomy of an entity indexing groovy script

Initialize  if  new  document

Loop  to  consolidate  latest  events

Re-­‐run  risk  profile  logic  

Load  stored  state

Store  the  script  in  ES_HOME/config/scripts/foo.groovy

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

16

Insight: which sellers have a lot of fanboys?

Seller  #187  has  more  than  his  fair  share  of  “fanboy”  reviewers  …

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

17

Drilling down into seller #187’s fanboys

Suspiciously  synchronised  behaviour

www.elastic.co18

UK 2013 car road worthiness tests

Worked example

www.elastic.co19

• In the UK all vehicles must pass an annual roadworthiness test, called an MOT (named after the Ministry of Transport)

• It is illegal to drive a car that has failed an MOT (unless driving home from a test or to a repair centre)

• Taxis and other forms of public transport have to be tested more frequently - every 6 months.

• All data is freely available from data.gov.uk but with anonymised vehicle ID and inexact test locations.

Example background

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

20

Example background

MOTs

mots.csv loadMOTs.sh

Cars

buildEntities.sh

• Drops and creates mots index.

• Uses Python client to bulk load all 37m road worthiness test results for 2013 (data source http://data.gov.uk/

• Drops and creates cars index. • Registers CarProfileUpdater.groovy as a

stored script • Uses Python client to query and scroll list of

mot test results sorted by vehicle ID and time

• Python pushes _update requests to ~27m “Car” documents each containing bundles of related MOT test results using bulk indexing API

• Shard-side Groovy script collapses the multiple tests into a single summary JSON document for a car, deriving summaries eg

MOT event fields• result (pass/fail) • vehicle ID • Make + model +

age • mileage • test date • test location

Car entity fields• Make + model + age • last test result, date, location • miles driven while failed • days between fail and fix • complete test history • suspected bad mileometer

readings

www.elastic.co21

Car attributes derived from 3 test result documents

Data fusion logic

1

2

3

Test  date

Mile-­‐o-­‐m

eter  re

ading

daysForFix

badReading?

milesDrivenAfterFailure

mile-o-meterRewind

www.elastic.co24

A user-centric index as a recommendation engine

Recycling user behaviours

www.elastic.co25

• A public dataset* of 10m movie ratings made by 71k users • One elasticsearch document per user with a list of their

movie ratings

Movielens data

Example background

*  http://files.grouplens.org/datasets/movielens/ml-­‐10m-­‐README.html

www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited

26

“Uncommonly common”user behaviours

www.elastic.co27

Conclusions

www.elastic.co28

• Efficient and simple queries • Advanced analytics/insights • Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups) • Can reuse existing elasticsearch APIs or build entity documents using external

technologies

Entity centric indexing: Advantages

www.elastic.co29

• Avoid “fat entities” • Use forgetful collections: Priority queues, circular buffers, HyperLogLog

• Avoid pointless updates • Use ctx.op=“none” to avoid writes of insignificant changes

• Consider options for reducing event volumes: • Use of aggregations in gathering events • Reduce related events in event-gathering script that issues updates

• Parallelise the pull of event information

Entity centric indexing: tips

www.elastic.co30

• Incremental entity updates can be achieved by querying all events since the

timestamp of the last run • Data integrity - implement policies for:

• handling any failures in performing entity updates • retiring old entities (use of TTL?)

Entity centric indexing

www.elastic.co31

@elasticmark

Questions?