designing the right schema to power heap (pgconf silicon valley 2016)

74
Designing The Right Schema To Power Heap Dan Robinson CTO, Heap

Upload: dan-robinson

Post on 16-Apr-2017

202 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Designing The Right Schema To Power Heap

Dan Robinson CTO, Heap

Page 2: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

• Joined as Heap's first hire in July, 2013

• Previously a backend engineer at Palantir

• Stanford '11 in Math and CS

whoami

Page 3: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Overview

• What is Heap?

• Why is what we're building such a difficult data problem?

• Four different ways we've tried to solve it.

Page 4: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 5: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 6: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

bookHotelButton.addEventListener("click", function() { Analytics.track('Booked Hotel');});

Page 7: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 8: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

listingDetailPage.addEventListener("load", function() { Analytics.track('Viewed A Listing');});

...

if (signInAttempt.isSuccessful) { Analytics.track('Signed In');}

...

submitCreditCardButton.addEventListener("click", function() { Analytics.track('Entered Credit Card');}

Page 9: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 10: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Analytics is fundamentally iterative.

Page 11: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Capture everything that happens.

Analyze the data retroactively.

Page 12: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 13: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 14: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 15: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 16: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Challenges1. Capturing 10x to 100x as much data.

Will never care about 95% of it.

Page 17: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 18: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 19: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 20: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 21: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 22: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Challenges1. Capturing 10x to 100x as much data.

Will never care about 95% of it.

2. Funnels, retention, behavioral cohorts, grouping, filtering... can't pre-aggregate.

Page 23: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Challenges1. Capturing 10x to 100x as much data.

Will never care about 95% of it.

2. Funnels, retention, behavioral cohorts, grouping, filtering... can't pre-aggregate.

3. Within a few minutes of real-time.

Page 24: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 25: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Data is mostly write-once, never update.

2. Queries map nicely to relational model.

3. Events have a natural ordering (time) which is mostly monotonic.

4. Analyses are always in terms of defined events.

Possibly Useful Observations

Page 26: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 27: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Attempt #1: Vanilla Boyce-Codd

Page 28: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE user ( customer_id BIGINT, user_id BIGINT, properties JSONB NOT NULL);

} PRIMARY KEY

Page 29: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE session ( customer_id BIGINT, user_id BIGINT, session_id BIGINT, time BIGINT NOT NULL, properties JSONB NOT NULL);

} PRIMARY KEY{FOREIGN KEY(user)

Page 30: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE pageview ( customer_id BIGINT, user_id BIGINT, session_id BIGINT, pageview_id BIGINT, time BIGINT NOT NULL, properties JSONB NOT NULL);

}PRIMARY KEY{FOREIGN KEY(session)

Page 31: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE event ( customer_id BIGINT, user_id BIGINT, session_id BIGINT, pageview_id BIGINT, event_id BIGINT, time BIGINT NOT NULL, properties JSONB NOT NULL);

}PRIMARY KEY{FOREIGN KEY(pageview)

Page 32: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 33: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 34: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Simple, easy to understand.

2. Can express basically all analysis in plain old SQL. Plays nicely with ORMs. Just works.

3. Not much surface area for data inconsistencies.

Pros Of Schema #1

Page 35: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Simple, easy to understand.

2. Can express basically all analysis in plain old SQL. Plays nicely with ORMs. Just works.

3. Not much surface area for data inconsistencies.

You should basically always start here.

Pros Of Schema #1

Page 36: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Pro: got us to launch!

Con: too many joins, even for simple analyses. Queries too slow for large customers.

Page 37: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Data is mostly write-once, never update.

2. Queries map nicely to relational model.

3. Events have a natural ordering (time) which is mostly monotonic.

4. Analyses are always in terms of defined events.

5. Aggregations partition cleanly at the user level.

Possibly Useful Observations

Page 38: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Attempt #2: Denormalize Everything Onto The User

Page 39: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE user_events ( customer_id BIGINT, user_id BIGINT, time_first_seen BIGINT NOT NULL, properties JSONB NOT NULL, events JSONB[] NOT NULL);

} PRIMARY KEY

Page 40: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

funnel_events(events JSONB[], pattern_array TEXT[]) RETURNS int[]-- Returns an array with 1s corresponding to steps completed-- in the funnel, 0s in the other positions

count_events(events JSONB[], pattern TEXT) RETURNS int-- Returns the number of elements in `events` that-- match `pattern`.

Page 41: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

SELECT funnel_events( ARRAY[ '{"foo": "bar", "baz": 10}', -- first event '{"foo": "abc", "baz": 30}', -- second event '{"foo": "dog", "city": "san francisco"}' -- third event ], ARRAY[ '"foo"=>"abc"', -- matches second event '"city"=>like "%ancisco"' -- matches third event ]);

Page 42: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

SELECT funnel_events( ARRAY[ '{"foo": "bar", "baz": 10}', -- first event '{"foo": "abc", "baz": 30}', -- second event '{"foo": "dog", "city": "san francisco"}' -- third event ], ARRAY[ '"foo"=>"abc"', -- matches second event '"city"=>like "%ancisco"' -- matches third event ]);

--------> emits {1, 1}

Page 43: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

SELECT funnel_events( ARRAY[ '{"foo": "bar", "baz": 10}', -- first event '{"foo": "abc", "baz": 30}', -- second event '{"foo": "dog", "city": "san francisco"}' -- third event ], ARRAY[ '"san"=>like "%ancisco"' -- matches third event '"foo"=>"abc"', -- nothing to match after third event ]);

--------> emits {1, 0}

Page 44: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

SELECT sum( funnel_events( events, ARRAY['"type"=>"pageview","path"=>"/signup.html"', '"type"=>"submit","hierarchy"=>like "%@form;#signup;%"'] ))FROM user_eventsWHERE customer_id = 12345

--------> emits something like {110, 20}

Page 45: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. No joins, just aggregations.

2. Can run pretty sophisticated analysis via extensions like funnel_events.

3. Easy to distribute.

4. Event arrays are TOASTed, which saves lots of disk space and I/O.

Pros Of Schema #2

Page 46: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Can't index for defined events, or even event fields.

2. Can't index for event times in any meaningful sense.

3. Arrays keep growing and growing...

Limitations Of Schema #2

Page 47: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE user_events ( customer_id BIGINT, user_id BIGINT, properties JSONB NOT NULL, time_first_seen BIGINT NOT NULL, time_last_seen BIGINT NOT NULL, events JSONB[] NOT NULL, events_last_week JSONB[] NOT NULL);

} PRIMARY KEY

Page 48: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

SELECT sum( funnel_events( events_last_week, ARRAY['"type"=>"pageview","path"=>"/signup.html"', '"type"=>"submit","hierarchy"=>like "%@form;#signup;%"'] ))FROM user_eventsWHERE customer_id = 12345 AND time_first_seen < query_timerange_end AND time_last_seen > query_timerange_begin

Page 49: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Can't index for defined events, or even event fields.

2. Can't index for event times in any meaningful sense.

3. Arrays keep growing and growing...

4. Write path is very painful.

Limitations Of Schema #2

Page 50: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Page 51: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Adding one event to a user requires rewriting the whole user. (Cost over time is quadratic in size of user!)

2. Schema bloats like crazy, requires maxing out autovacuum.

3. Simple maintenance is expensive.

Write Path Of Schema #2

Page 52: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

About 500 GB of bloat!

VACUUM FULL Friday night

Page 53: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Data is mostly write-once, never update.

2. Queries map nicely to relational model.

3. Events have a natural ordering (time) which is mostly monotonic.

4. Analyses are always in terms of defined events which are very sparse.

5. Aggregations partition cleanly at the user level.

Possibly Useful Observations

Page 54: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Attempt #3: Denormalized Events, Split Out From Users

Page 55: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE user ( customer_id BIGINT, user_id BIGINT, properties JSONB NOT NULL);

} PRIMARY KEY

Page 56: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE event ( customer_id BIGINT, user_id BIGINT, event_id BIGINT, time BIGINT, data JSONB NOT NULL);

}PRIMARY KEY{FOREIGN KEY(user)

Page 57: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE INDEX confirmed_checkout_idx ON event (time) WHERE (data ->> 'path') = '/checkout' AND (data ->> 'action') = 'click' AND (data ->> 'css_hierarchy') LIKE '%div.checkout_modal%a.btn' AND (data ->> 'target_text') = 'Confirm Order'

Page 58: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE INDEX confirmed_checkout_idx ON event (time) WHERE (data ->> 'path') = '/checkout' AND (data ->> 'action') = 'click' AND (data ->> 'css_hierarchy') LIKE '%div.checkout_modal%a.btn' AND (data ->> 'target_text') = 'Confirm Order'

...

SELECT COUNT(*) AS value, date_trunc('month', to_timestamp(time / 1000) AT TIME ZONE 'UTC') AS time_bucketFROM eventWHERE customer_id = 135 AND time BETWEEN 1424437200000 AND 1429531200000 AND (data ->> 'path') = '/checkout' AND (data ->> 'action') = 'click' AND (data ->> 'css_hierarchy') LIKE '%div.checkout_modal%a.btn' AND (data ->> 'target_text') = 'Confirm Order'GROUP BY time_bucket

Page 59: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Partial Index Strategy• Structure the event table such that every event

definition is a row-level predicate on it.

• Under the hood, Heap maintains one partial index for each of those predicates.

• The variety of events that Heap captures is massive, so any individual event definition is very selective.

• Fits perfectly into our "retroactive" analytics framework.

Page 60: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

General Read-Path Strategy• All analyses shard cleanly by (customer_id, user_id),

and every query is built from a sparse set of events.

• Simple meta-formula for most analysis queries:

1. Build up an array of relevant events for each user

2. Pass the array to a custom UDF

3. Join arbitrarily for more filtering, grouping, etc

Page 61: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Excellent read performance, with a few caveats.

2. Flexible event-level indexing and query tuning makes it easier to make new analyses fast.

3. Much, much less write-time I/O cost.

4. PostgreSQL manages a lot of complexity for us.

Pros Of Schema #3

Page 62: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Expensive to maintain all those indexes!

2. Lack of meaningful statistics for the query planner.

3. Bigger disk footprint by ~2.5x.

4. Some of the assumptions are a bit restrictive / don't degrade gracefully.

Limitations Of Schema #3

Page 63: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Data is mostly write-once, never update.

2. Queries map nicely to relational model.

3. Events have a natural ordering (time) which is mostly monotonic.

4. Analyses are always in terms of defined events which are very sparse and predictable to a degree.

5. Aggregations partition cleanly at the user level.

Possibly Useful Observations

Page 64: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Attempt #4: Denormalized Events, Common Fields Extracted

Page 65: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE event ( customer_id BIGINT, user_id BIGINT, event_id BIGINT, time BIGINT, data JSONB NOT NULL);

}PRIMARY KEY{FOREIGN KEY(user)

Page 66: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE event ( customer_id BIGINT, user_id BIGINT, event_id BIGINT, time BIGINT, type TEXT, hierarchy TEXT, target_text TEXT, ... data JSONB NOT NULL);

}PRIMARY KEY{FOREIGN KEY(user)

Page 67: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Dataset is ~30% smaller on disk.

2. Query planner has much more information to work with, can use it in more ambitious ways.

Pros Of Schema #4

Page 68: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE event ( customer_id BIGINT, user_id BIGINT, event_id BIGINT, time BIGINT, type TEXT, -- btree hierarchy TEXT, -- gin target_text TEXT, ... -- more btrees in here data JSONB NOT NULL);

}PRIMARY KEY{FOREIGN KEY(user)

Can now combine indexes on these!

{

Page 69: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Dataset is ~30% smaller on disk.

2. Query planner has much more information to work with, can use it in more ambitious ways.

3. Can get rid of ~60% of partial indexes and replace them with small set of simpler indexes.

Pros Of Schema #4

Page 70: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

1. Costs ~50% less CPU on write.

2. Costs ~50% more I/O on write.

3. Eliminates of a lot of edge cases, degrades more gracefully.

Tradeoffs From Mixed Indexing Strategy

Page 71: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

CREATE TABLE user ( customer_id BIGINT, user_id BIGINT, properties JSONB NOT NULL, identity TEXT);

} PRIMARY KEY

How do you represent user moves?

Page 72: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Future Work

• Partitioning the events table, many options here.

• Supporting a much more heterogeneous dataset.

• New analysis paradigms.

• Many, many others. (Did I mention we're hiring?)

Page 73: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

PostgreSQL Wishlist• Ability to move table data with indexes.

• Partial indexes and composite types have lots of gotchas if you want index-only scans.

• Better ability to keep the visibility map up to date, without constant VACUUMing.

• Distributed systems features.

Page 74: Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)

Questions? Or, ask me on twitter: @danlovesproofs