mongodb san francisco 2013:geo searches for healthcare pricing data presented by robert stewart,...
DESCRIPTION
This talk covers the MongoDB deployment architecture used at Castlight Health to support very low latency spatial searches against our database of hundreds of millions of healthcare prices. The Geo haystack index in MongoDB and SSDs turned out to be the perfect solution for our problem. A strategy of replica set flipping also enables Castlight to swap in very large changes to the pricing data with no impact to the running application.TRANSCRIPT
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Geo Searches for Health Care Pricing Data
Robert Stewart
Senior Architect, Castlight Health
@wombatnation
1
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Castlight Health
The Business and Technical Problems
Initial Solution
MongoDB, Geo Haystack Index and SSDs
Replica Set Flipping
2
3
Hosted web and mobile applications providing unbiased information on health care cost and quality
Customers are employers and health plans
Founded in 2008, raised $181 million in VC funding
#1 on Wall Street Journal’s list of “Top 50 Venture-Backed Companies” for 2011
Hiring!
Castlight Health
4
Home Page
5
Search Results
6
Business Problem
Support searches for
Prices for a procedure performed by any in-network provider in a geographical area
Prices for all procedures performed by a single provider
Sub-second response, even if returning data on thousands of prices
7
Need a very fast geo index
Rate count doubled in last 3 months to 600 million
Major rate updates monthly
Difficult to index data to ensure sequential reads
Sometimes lots of random reads
Technical Problems
8
Pricing Retrieval Architecture
9
Initial Solution
Store pricing data in MySQL
When Pricing Service starts, create two in-memory indexes and cache most of the rates
55 GB JVM Heap with lots of GC tuning
20-minute service startup time to build indexes
3 hours for background caching of most rates
Trouble Brewing: Total rates growing quickly Rolling restart becoming unacceptably slow If rates not in Java or MySQL cache, retrieval was very slow
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Enter the Mongo
10
11
Geo Indexes
Tried standard geo 2D indexes in MongoDB
Too slow for my use case
Geo Haystack index
Conceptually similar
From docs.mongodb.org “A haystack index is a special index that is optimized to return
results over small areas. Haystack indexes improve performance on queries that use flat geometry.”
12
Mercator Projection with 10 degree grid
13
Geo Haystack
We chose degrees long-lat for x-y coordinate system
25 miles is our default search radius Roughly 0.5 degrees in middle of the US
db.priceables_1.ensureIndex(
{ loc: "geoHaystack", pm: 1 },
{ bucketSize: 0.5 })
db.runCommand(
{ geoSearch: "priceables_1",
near: [-122.4, 37.79],
maxDistance: 0.5,
search: { pm: 6757 },
limit: 50000 })
maxDistance calculated using great circle algorithm
14
Geo Haystack Pros
Very fast when retrieving many documents in a relatively small search radius
Great when you also need to apply a secondary filter Compound 2dsphere index in Mongo 2.4 has even better support
15
Geo Haystack Cons
Supports only one extra filter in index SERVER-2979
A bug if unindexed query on only the second part of the key SERVER-8645
> db.priceables_1.find({pm: 6757})
error: { "$err" : "assertion src/mongo/db/geo/haystack.cpp:178" }
Second part of index can’t have an array value
Location part of key can’t be null
16
SSDs
For uncached data on HDD, Geo Haystack was twice as fast as custom Java geo index and MySQL
Still close to 1 minute for big queries with full data set
Death by random read
Tested with a $200 Samsung SSD Typical query dropped to 20 millis Big query only about 150 millis
17
Random 4k block reads, 5 GB file, 16 threads
Mongoperf on SSDs
Env SSD Read Ops/s Read MB/s
Prod Samsung 200GB SLC 74k 288
QA VM Samsung 200GB SLC 30k 117
Dev Samsung 830 256GB SATA MLC 47k 183
Env SSD Write Ops/s Write MB/s
Prod Samsung 200GB SLC 1074 289
QA VM Samsung 200GB SLC 405 196
Dev Samsung 830 256GB SATA MLC 438 210
Sequential write of the 5 GB file
18
Requirements Major price updates monthly Minor updates more frequently
Huge bulk loads with no impact on active replica set
I/O bound, not CPU bound
Low Impact Pricing Updates
19
Two replica sets
Lowered cost with two SSDs on each pricing server
scp compressed files from QA to passive replica set Protip: to compress and uncompress
tar cvf - pricing | pigz > ~/pricing.tgz
pigz -dc pricing.tgz | tar xvf -
Page in index and data db.runCommand({ touch: "priceables_1", index: true, data: true })
Pricing Service operation to atomically flip
Replica Set Flipping Solution
20
Replica Set Architecture
Physical Servers
ReplicaSets
prodpricing1
prodpricing2
Server pricing1
mongod 28001primary
mongod 28002secondary
Server pricing2
mongod 28001secondary
mongod 28002primary
Server db1
mongod 28001arbiter
Server db2
mongod 28002arbiter
21
Obviously, increased cost, but only for SSDs
Recently added caching of remote pricing lookups TTL collections
Cache is lost during a flip
But, usually flip late at night
Cache eviction time is only a few hours
Replica Set Flipping Drawbacks
22
Geo search speed with cold cache acceptable
Geo search speed with warm cache awesome
Pricing Service startup down to a few seconds
No production impact for major rate updates
Lowered risk for minor rate updates
Overall Results
23
Summary
Geo Haystack Index great for … Retrieving lots of documents in a constrained search area Geo searches with a secondary filter
SSDs great for … Random reads Reducing need for lots of complex indexes
Replica set flipping great for … Instant swap of large amounts of data Primarily, if not solely, read only Trading cost for operational flexibility
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Q & A
24