mongodb, our swiss army knife database

MongoDB at fotopediaTimeline storage

Our Swiss Army Knife Database

MongoDB at fotopedia

• Context

• Wikipedia data storage

• Metacache

Fotopedia

• Fotonauts, an American/French company

• Photo — Encyclopedia

• Heavily interconnected system : flickr, facebook, wikipedia, picassa, twitter…

• MongoDB in production since last october

• main store lives in MySQL… for now

First contact

• Wikipedia imported data

Wikipedia queries

• wikilinks from one article

• links to one article

• geo coordinates

• redirect

• why not use wikipedia API ?

Download ~ 5.7GB gzipXML

GeoRedirectBacklinkRelated

~12GB tabular data

Problem

Load ~12GB into a K/V store

CouchDB 0.9 attempt

• CouchDB had no dedicated import tool

• need to go through HTTP / Rest API

“DATA LOADING”

LOADING!

(obviously hijacked from xkcd.com)

Problem, rephrased

Load ~12GB into any K/V store

in hours, not days

Hadoop HBase ?

• as we were already using Hadoop Map/Reduce for preparation

• bulk load was just emerging at that time, requiring to code against HBase private APIs, generate the data in an ad-hoc binary format, ...

photo by neural.it on Flickr

Problem, rerephrasedLoad ~12GB into any K/V store

in hours, not days

without wasting a week on development

and another week on setup

and several months on tuning

please ?

MongoDB attempt• Transforming the tabular data into a JSON

form : about half an hour or code, 45 minutes of hadoop parallel processing

• setup mongo server : 15 minutes

• mongoimport : 3 minutes to start it, 90 minutes to run

• plug RoR app on mongo : minutes

• prototype was done in a day

Download ~ 5.7GB gzip

GeoRedirectBacklinkRelated

~12GB, 12M docs

Batch Synchronous

Ruby on Rails

Hot swap ?

• Indexing was locking everything.

• Just run two instances of MongoDB.

• One instance is servicing the web app

• One instance is asleep or loading data

• One third instance knows the status of the two instances.

We loved:

• JSON import format

• efficiency of mongoimport

• simple and flexible installation

• just one cumbersome dependency

• easy to start (we use runit)

• easy to have several instances on one box

Second contact

• itʼs just all about graphes, anyway.

• wikilinks

• people following people

• related community albums

• and soon, interlanguage links

all about graphes...

• ... and itʼs also all about cache.

• The application needs to “feel” faster, letʼs cache more.

• The application needs to “feel” right, so letʼs cache less.

• or — big sigh — invalidate.

Page fragment caching

RoR application

Varnish HTTP cache

Nginx SSIphoto by Mykl Roventine on Flickr

photo by Aires Dos Santos

photo by Leslie Chatfield on Flickr

There are only two hard thingsin Computer Science:cache invalidation and naming things.

Phil Karlton

Haiku ?

Naming things

• REST have been a strong design principle in fotopedia since the early days, and the efforts are paying.

/en/2nd_arrondissement_of_Paris

/en/Paris/fragment/left_col

/en/Paris/fragment/related

/users/john/fragment/contrib

Invalidating

• Rest allows us to invalidate by URL prefix.

• When the Paris album changes, we have to invalidate /en/Paris.*

Varnish invalidation

• Varnish built-in regexp based invalidation is not designed for intensive, fine grained invalidation.

• We need to invalidate URLs individually.

/en/Paris.*

/en/Paris

/en/Paris/photos.json?skip=0&number=20

Metacache workflow

RoR application

Varnish HTTP cache

Nginx SSI

metacache feeder

varnish log

invalidation worker

/en/Paris/en/Paris/fragment/left_col/en/Paris/photos.json?skip=0&number=20/en/Paris/photos.json?skip=13&number=27

/en/Paris.*

• This time we are actually using MongoDB as a BTree. Impressive.

• The metacache has been running fine for several months, and we want to go further.

Invalidate less

• We need to be more specific as to what we invalidate.

• Today, if somebody votes on a photo in the Paris album, we invalidate all /en/Paris prefix, and most of it is unchanged.

• We will move towards a more clever metacache.

Metacache reloaded• Pub/Sub metacache

• Have the backend send a specific header to be caught by the metacache-feeder, conaining “subscribe” message.

• This header will be a JSON document, to be pushed to the metacache.

• The purge commands will be mongo search queries.

{url:/en/Paris, observe:[summary,links]}

{url:/en/Paris/fragment/left_col, observe: [cover]}

{url:/en/Paris/photos.json?skip=0&number=20, observe:[photos]}

/en/Paris

{url:/en/Paris, observe:[summary,links]}

{url:/en/Paris/fragment/left_col, observe: [cover]}

{url:/en/Paris/photos.json?skip=0&number=20, observe:[photos]}

when somebody votes{ url:/en/Paris.*, observe:photos }

when the summary changes{ url:/en/Paris.*, observe:summary }

when the a new link is created{ url:/en/Paris.*, observe:links }

Other uses cases

• Timeline activities storage: just one more BTree usage.

• Moderation workflow data: tiny dataset, but more complex queries, map/reduce.

• Suspended experimentation around log collection and analysis

Current situation

• Mysql: main data store

• CouchDB: old timelines (+ chef)

• MongoDB: metacache, wikipedia, moderation, new timelines

• Redis: raw data cache for counters, recent activity (+ resque)

What about the main store ?

• albums are good fit for documents

• votes and score may be more tricky

• recent introduction of resque

In short

• Simple, fast.

• Hackable: in a language most can read.

• Clear roadmap.

• Very helpful and efficient team.

• Designed with application developer needs in mind.

mongodb, our swiss army knife database

url prex

json form

json document

gb tabular data

cache invalidation

col enparisphotos

enparis prex

data loading loading

Technology

swiss army knife of freight audit

opus, the swiss army knife of audio codecs

process builder: your salesforce swiss army knife

ctsos - the swiss army knife of education

swiss knife 4/2014 (nov)

dojo - javascript's swiss army knife (7/15/2009)

mcn 2012 swiss army knife approach

drupal views: the swiss army knife of modules

swiss army knife for automation testing

swiss army knife of content marketing

elastic the swiss army knife for redbooth

wlst: weblogic's swiss army knife

swiss knife 3/2015

swiss knife 2/2016

lawo v_pro8 presentation - swiss army knife

swiss knife 2/2014 (may)

the web analytics swiss army knife

analytics swiss army knife

dsp 'swiss army knife

emacs - professionals swiss army knife