mongodb san francisco 2013:geo searches for healthcare pricing data presented by robert stewart,...

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Geo Searches for Health Care Pricing Data

Robert Stewart

Senior Architect, Castlight Health

[email protected]

@wombatnation

1

mailto:[email protected]


Castlight Health

The Business and Technical Problems

Initial Solution

MongoDB, Geo Haystack Index and SSDs

Replica Set Flipping

2

3

Hosted web and mobile applications providing unbiased information on health care cost and quality

Customers are employers and health plans

Founded in 2008, raised $181 million in VC funding

#1 on Wall Street Journal’s list of “Top 50 Venture-Backed Companies” for 2011

Hiring!

Castlight Health

4

Home Page

5

Search Results

6

Business Problem

Support searches for

Prices for a procedure performed by any in-network provider in a geographical area

Prices for all procedures performed by a single provider

Sub-second response, even if returning data on thousands of prices

7

Need a very fast geo index

Rate count doubled in last 3 months to 600 million

Major rate updates monthly

Difficult to index data to ensure sequential reads

Sometimes lots of random reads

Technical Problems

8

Pricing Retrieval Architecture

9

Initial Solution

Store pricing data in MySQL

When Pricing Service starts, create two in-memory indexes and cache most of the rates

55 GB JVM Heap with lots of GC tuning

20-minute service startup time to build indexes

3 hours for background caching of most rates

Trouble Brewing: Total rates growing quickly Rolling restart becoming unacceptably slow If rates not in Java or MySQL cache, retrieval was very slow


Enter the Mongo

10

11

Geo Indexes

Tried standard geo 2D indexes in MongoDB

Too slow for my use case

Geo Haystack index

Conceptually similar

From docs.mongodb.org “A haystack index is a special index that is optimized to return

results over small areas. Haystack indexes improve performance on queries that use flat geometry.”

12

Mercator Projection with 10 degree grid

13

Geo Haystack

We chose degrees long-lat for x-y coordinate system

25 miles is our default search radius Roughly 0.5 degrees in middle of the US

db.priceables_1.ensureIndex(

{ loc: "geoHaystack", pm: 1 },

{ bucketSize: 0.5 })

db.runCommand(

{ geoSearch: "priceables_1",

near: [-122.4, 37.79],

maxDistance: 0.5,

search: { pm: 6757 },

limit: 50000 })

maxDistance calculated using great circle algorithm

14

Geo Haystack Pros

Very fast when retrieving many documents in a relatively small search radius

Great when you also need to apply a secondary filter Compound 2dsphere index in Mongo 2.4 has even better support

15

Geo Haystack Cons

Supports only one extra filter in index SERVER-2979

A bug if unindexed query on only the second part of the key SERVER-8645

> db.priceables_1.find({pm: 6757})

error: { "$err" : "assertion src/mongo/db/geo/haystack.cpp:178" }

Second part of index can’t have an array value

Location part of key can’t be null

16

SSDs

For uncached data on HDD, Geo Haystack was twice as fast as custom Java geo index and MySQL

Still close to 1 minute for big queries with full data set

Death by random read

Tested with a $200 Samsung SSD Typical query dropped to 20 millis Big query only about 150 millis

17

Random 4k block reads, 5 GB file, 16 threads

Mongoperf on SSDs

Env SSD Read Ops/s Read MB/s

Prod Samsung 200GB SLC 74k 288

QA VM Samsung 200GB SLC 30k 117

Dev Samsung 830 256GB SATA MLC 47k 183

Env SSD Write Ops/s Write MB/s

Prod Samsung 200GB SLC 1074 289

QA VM Samsung 200GB SLC 405 196

Dev Samsung 830 256GB SATA MLC 438 210

Sequential write of the 5 GB file

18

Requirements Major price updates monthly Minor updates more frequently

Huge bulk loads with no impact on active replica set

I/O bound, not CPU bound

Low Impact Pricing Updates

19

Two replica sets

Lowered cost with two SSDs on each pricing server

scp compressed files from QA to passive replica set Protip: to compress and uncompress

tar cvf - pricing | pigz > ~/pricing.tgz

pigz -dc pricing.tgz | tar xvf -

Page in index and data db.runCommand({ touch: "priceables_1", index: true, data: true })

Pricing Service operation to atomically flip

Replica Set Flipping Solution

20

Replica Set Architecture

Physical Servers

ReplicaSets

prodpricing1

prodpricing2

Server pricing1

mongod 28001primary

mongod 28002secondary

Server pricing2

mongod 28001secondary

mongod 28002primary

Server db1

mongod 28001arbiter

Server db2

mongod 28002arbiter

21

Obviously, increased cost, but only for SSDs

Recently added caching of remote pricing lookups TTL collections

Cache is lost during a flip

But, usually flip late at night

Cache eviction time is only a few hours

Replica Set Flipping Drawbacks

22

Geo search speed with cold cache acceptable

Geo search speed with warm cache awesome

Pricing Service startup down to a few seconds

No production impact for major rate updates

Lowered risk for minor rate updates

Overall Results

23

Summary

Geo Haystack Index great for … Retrieving lots of documents in a constrained search area Geo searches with a secondary filter

SSDs great for … Random reads Reducing need for lots of complex indexes

Replica set flipping great for … Instant swap of large amounts of data Primarily, if not solely, read only Trading cost for operational flexibility


Q & A

24

mongodb san francisco 2013:geo searches for healthcare pricing data presented by robert stewart,...

Technology

summary23 geo haystack

index server

geo haystack cons15

geo haystack pros14

use case geo haystack

standard geo

geo indexes11

geo haystack13