caching is your friend - foss4g na 2015 is... · caching is your friend ... django leaflet celery....
TRANSCRIPT
Caching is Your FriendCreating Tons of Maps on the Fly and Auto-
Updating Some
Outline
● Empower Engine Overview● Tons of Maps, Tons of Tiles● Caching 1: Map Cache Tables● Caching 2: Daily Updating Data● Store stuff well: PostGIS tips & tricks
Julie Goldberg● 15+ years as a software engineer● Democratic Politics & Campaign Software since 2003-2004 (Howard
Dean’s presidential campaign)
Noah Glusenkamp● Democratic Politics since Obama in Iowa in 2007● Made many one-off maps for many campaigns he worked on● UI and product design focused
Who We Are
Campaigns Organize Geographically
Organizers each have their own turf
Local variation matters.
Marriage Example
Scope = Washington State Scope = Seattle
Overall Problem● Any Scope● Any Data Layer● Any Granularity All Within the Request/Response Cycle
DEMO….
Our Stack● Postgres/PostGIS● Tilestache ● Mapnik ● S3● Django● Leaflet● Celery
Maps Empower & Motivate● Maps tell a story. Spreadsheets don’t.● Maps motivate organizers, volunteers and even
candidates.● Organizers don’t know what a shapefile is, and they
shouldn’t need to.
User TestimonialWith Empower Engine I can visually show vols where the densest part of my turf is and provide them with reasoning why I want them to go canvass in those areas. Even if means that they have to drive 20 mins or so. Also vols get so stoked when I provide them with inside SCIENCE on how we are going to win this election. - Kathleen Austad, Field Organizer 2014
Empower Engine: What it Does● Very easy to create or edit choropleth maps.● Campaign manager can make a set of important
“distribution maps” and distribute them to each organizer at their turf.
● Rescales/reclassifies each map for each user to their turf.
● Some data layers are static.● Most interesting ones update daily (doors in last week,
universe density, early votes).
Outline
● Empower Engine Overview● Tons of Maps, Tons of Tiles● Caching 1: Map Cache Tables● Caching 2: Daily Updating Data● Store stuff well: PostGIS tips & tricks
Dynamic Maps & Tiles in Tilestache:The Problem
● Tilestache normally has a config file with one configuration per map.
● Every time someone changes a map or makes a new one, we can’t change the config file.
Tilestache Out of the Box● Tilestache expects you to pre-generate your tiles or
generate them on demand using Mapnik and/or a custom provider.
● You can cache tiles in many ways, including S3. ● Image tiles and UTFGrid tiles both.
Custom Tilestache Wrapperhttps://gist.github.com/JulieGoldberg/6926274● Specify parameter names on start-up. ● Passes parameters to each tile request. ● MapId tells our custom provider (based on Mapnik)
what data layers, colors, etc. to use.● Our custom cache inserts /MapId/#/Version/# into
our S3 path. Editing a map invalidates tiles.
S3 Path and Tiles
/map_id/5086/version/0/district_value/11/329/714.png
/map_id/5086/version/1/district_value/11/329/714.png
Outline
● Empower Engine Overview● Tons of Maps, Tons of Tiles● Caching 1: Map Cache Tables● Caching 2: Daily Updating Data● Store stuff well: Miscelaneous PostGIS tips
& tricks
Why Cache?● Space is cheap.● Geometric queries are slow.● Indexed lookups are fast. No JOINs/WHERE
clauses are even faster. ● We want to make tiles on the fly.● We want to generate new maps in the web
request/response cycle.● We want to update lots of data every day as
quickly as possible.
Size of our tables (just for WA)
Districts: 6,226,475Hexagon, Precinct, County, Census Block, etc.
District Attribute Values: 27,246,617# Marriage Votes in Precinct 12345, # Doors Attempted yesterday in Hexagon 54, etc.
Only cache tables easily be recreated....
Making Tiles Efficiently: Cache Table
● Mapnik does geospatial queries to find the relevant districts and associate the data.
● Both our district_shapes table and our values table are huge.
● The smaller the set of data, the better it performs.● Don’t JOIN any tables inside Mapnik. ● Make a cache table per map with all data values and
shapes.
Sample Map Cache Table DefinitionCREATE TABLE map_caches.map_2270 (
district_id integer,
attribute_15230 double precision,
attribute_15231 double precision,
attribute_15232 double precision,
attribute_15233 double precision,
geometry geometry(MultiPolygon,3857),
short_name character varying(12),
long_name character varying(100),
sq_miles double precision,
colorized_attribute_id integer, -- important for plurality maps
colorized_value double precision,
author_provided_id character varying(60))
Make Cache Tables Efficiently● Don’t do any geometric queries to determine what data
will be on a map.● Pre-calculate all possible granularities with all possible
scopes.○ Medium Hexagons in Puyallup? Here’s the list.○ 2012 Precincts in Washington’s 3rd CD? Here you
go.● When asked to pull any data layer in any scope at any
granularity, we join on indexed integer IDs.
Pre-Computing Takes Time● It takes hours or sometimes days to pre-compute
granularity-scope intersections when adding districts.● Not in the web request/response cycle.● We have scripts that compute them and populate the
cache when we add districts.● We’ve optimized this some, but it’s a one-time cost
whenever we add districts.
Outline
● Empower Engine Overview● Tons of Maps, Tons of Tiles● Caching 1: Map Cache Tables● Caching 2: Daily Updating Data● Store stuff well: PostGIS tips & tricks
What sort of data do we have?Universe = People Campaign Currently Wants to Contact ● Universe Density
● Universe Penetration ○ What % of your universe in
each area has the campaign attempted or contacted?
○ In specified day range or since specified date.
● Early voters in universe● All Door/Phone
Attempts/Contacts in specified day range
Refreshing the Data Daily● Download daily from the DNC’s data warehouse.● Voters in campaign’s universes are likely to be voters
contacted. The same voters may be in multiple universes.
● Download the voter id and geocode for all the voters we care about.
● Download all the data we want to aggregate per voter.
Aggregate All Points to PolygonsAvoid Overplotting
Typical GOTV Day in Washington State:3,894,333 Voters 692,744 Contacts or Attempts6,305,129 People in All Universes1,404,251 Penetration Into Universes
Daily Person-District Lookups● Need a person->district lookup for every district group
(small hexagons, current precincts, etc.).● Recreate these cache tables every day.● All daily updated per-voter data can be joined to this to
lookup and grouped on district.
Person-District Mappings Change (but not much each day)
● Jennifer Smith (person id 444) moves in with Amanda Jones (person id 987) at geocode (47.6717833,-122.3814195) ○ Medium Hexagon: 423, district 12345.○ 2014 Precincts: SEA 36-1324, district 98765
● If Amanda lived there already, we may know the districts.
● If they rented a new house and the prior resident was a non-citizen, we should district the new geocode once.
● They’re unlikely to move again this year.
People Share Geocodes● Couples, families, apartment buildings● Lookup table of PersonID, Latitude, Longitude,
DistrictGroupID -> DistrictIDWhat large hexagon is Jennifer in, now that she lives at (47.6717833,-122.3814195)?
● If it’s not there, look for the geocode without the PersonID before trying to compute it.
Districting New Geocodes● Pull DISTINCT list of geocodes to district.
○ Speeds things up by factor of 2 or 3 (only geocode Amanda & Jane’s new house once).
● Store a cache table of shapes for each district group.● No JOINs or WHERE clauses improves performance.
Keep Newly Districted Geocodes● Add new person - geocode - district group -> district
rows to our lookup for each person at the new geocode.● Keep the new lookup rows● We’ll probably be making maps about the same voters
tomorrow.
Updating Data Layers● Make our daily person-district cache tables from our
person-geocode-districtGroup -> district lookup. ● For each data set, join to person-district lookup, GROUP
BY district ID.● We pre-compute and store square miles of all districts,
so # people per square mile is as easy as # people.
Updating Maps● Store last viewed on map.● Store last updated on data layers.● If the data is newer than the map:
○ Recreate the cache table.○ Recompute bins .○ Increment the version.
● New version invalidates the tile cache.
Reminder: Overall Problem● Any Scope● Any Data Layer● Any Granularity All Within the Request/Response Cycle
Any Scope
City of Puyallup
Legislative District 45
Any Data Layer
% Early Voted
# People Contacted at their Door
Any Granularity
2014 Primary Precincts
Medium Hexagon
Outline
● Empower Engine Overview● Tons of Maps, Tons of Tiles● Caching 1: Map Cache Tables● Caching 2: Daily Updating Data● Store stuff well: PostGIS tips & tricks
Shapes Table
SELECT *
FROM districts
LIMIT 1
Order of magnitude faster if there are no geometries in the table.
● Postgres knows the size of integer, varchar(12), etc.
● Postgres doesn’t know how big a geometry is.
● Geometry columns hold pointers to shapes.
● Create separate district_shapes table.
● Store geometries in shapes table.
Indexes are Your Friend● Indexes always speed up searches, but some are better
than others.○ Integer best○ Float second choice○ Geometry worst
● Use person-latitude-longitude index before latitude-longitude index
● Store latitude and longitude separately, because Postgres can index floats better than geometries.
Searching Districts
● Auto-complete on name for selecting boundary scopes.
● Millions of granularity-only districts○ precincts, hexagons, census blocks
● Thousands of boundary scopes ○ CDs, cities, states
● Washington State:○ 768 boundary scopes○ 6 million granularities
ScopeableDistricts Materialized View
● Make a materialized view for districts that can be boundary scopes.
● View is refreshed whenever we add such districts (on back end).
● Could add a text index on the view if we need.
Questions?Contact: [email protected]