flowdock's full-text search with mongodb

47
Full-text search with MongoDB Otto Hilska, @mutru @flowdock 1 Thursday, July 7, 2011

Upload: flowdock

Post on 01-Nov-2014

3.869 views

Category:

Technology


3 download

DESCRIPTION

Otto Hilska's presentation about Flowdock's full-text search with MongoDB. San Francisco MongoDB meetup in June 2011.

TRANSCRIPT

Page 1: Flowdock's full-text search with MongoDB

Full-text search with MongoDBOtto Hilska, @mutru

@flowdock

1Thursday, July 7, 2011

Page 2: Flowdock's full-text search with MongoDB

2Thursday, July 7, 2011

APIdock.com is one of the services we’ve created for the Ruby community: a social documentation site.

Page 3: Flowdock's full-text search with MongoDB

3Thursday, July 7, 2011

- We did some “research” about real-time web back in 2008.- At the same time, did software consulting for large companies.- Flowdock is a product spinoff from our consulting company. It’s Google Wave done right, with focus on technical teams.

Page 4: Flowdock's full-text search with MongoDB

4Thursday, July 7, 2011

Flowdock combines a group chat (on the right) to a shared team inbox (on the left).

Our promise: Teams stay up-to-date, react in seconds instead of hours, and never forget anything.

Page 5: Flowdock's full-text search with MongoDB

5Thursday, July 7, 2011

Flowdock gets messages from various external sources (like JIRA, Twitter, Github, Pivotal Tracker, emails, RSS feeds) and from the Flowdock users themselves.

Page 6: Flowdock's full-text search with MongoDB

6Thursday, July 7, 2011

All of the highlighted areas are objects in the “messages” collection. MongoDB’s document model is perfect for our use case, where various data formats (tweets, emails, ...) are stored inside the same collection.

Page 7: Flowdock's full-text search with MongoDB

6Thursday, July 7, 2011

All of the highlighted areas are objects in the “messages” collection. MongoDB’s document model is perfect for our use case, where various data formats (tweets, emails, ...) are stored inside the same collection.

Page 8: Flowdock's full-text search with MongoDB

6Thursday, July 7, 2011

All of the highlighted areas are objects in the “messages” collection. MongoDB’s document model is perfect for our use case, where various data formats (tweets, emails, ...) are stored inside the same collection.

Page 9: Flowdock's full-text search with MongoDB

6Thursday, July 7, 2011

All of the highlighted areas are objects in the “messages” collection. MongoDB’s document model is perfect for our use case, where various data formats (tweets, emails, ...) are stored inside the same collection.

Page 10: Flowdock's full-text search with MongoDB

7Thursday, July 7, 2011

This is how a typical message looks like.

Page 11: Flowdock's full-text search with MongoDB

{   "_id":ObjectId("4de92cd0097580e29ca5b6c2"),   "id":NumberLong(45967),   "app":"chat",   "flow":"demo:demoflow",   "event":"comment",   "sent":NumberLong("1307126992832"),   "attachments":[

   ],   "_keywords":[      "good",      "point", ...   ],   "uuid":"hC4-09hFcULvCyiU",   "user":"1",   "content":{      "text":"Good point, I'll mark it as deprecated.",      "title":"Updated  JIRA integration API"   },   "tags":[      "influx:45958"   ]}

7Thursday, July 7, 2011

This is how a typical message looks like.

Page 12: Flowdock's full-text search with MongoDB

jQuery (+UI)Comet impl.MVC impl.

Browser

Rails appWebsiteAdminPaymentsAccount mgmt

Scala backendMessagesWho’s onlineAPIRSS feedsSMTP serverTwitter feed

PostgreSQL MongoDB

8Thursday, July 7, 2011

An overview of the Flowdock architecture: most of the code is JavaScript and runs inside the browser.

The Scala (+Akka) backend does all the heavy lifting (mostly related to messages and online presence), and the Ruby on Rails application handles all the easy stuff (public website, account management, administration, payments etc).

We used PostgreSQL in the beginning, and migrated messages to MongoDB. Otherwise there is no particular reason why we couldn’t use MongoDB for everything.

Page 13: Flowdock's full-text search with MongoDB

9Thursday, July 7, 2011

One of the key features in Flowdock is tagging. For example, if I’m doing some changes to our production environment, I mention it in the chat and tag it as #production. Production deployments are automatically tagged with the same tag, so I can easily get a log of everything that’s happened.

It’s very easy to implement with MongoDB, since we just index the “tags” array and search using it.

Page 14: Flowdock's full-text search with MongoDB

db.messages.ensureIndex({flow: 1, tags: 1, id: -1});

9Thursday, July 7, 2011

One of the key features in Flowdock is tagging. For example, if I’m doing some changes to our production environment, I mention it in the chat and tag it as #production. Production deployments are automatically tagged with the same tag, so I can easily get a log of everything that’s happened.

It’s very easy to implement with MongoDB, since we just index the “tags” array and search using it.

Page 15: Flowdock's full-text search with MongoDB

db.messages.find({flow: 123,tags: {$all: [“production”]})

.sort({id: -1});

db.messages.ensureIndex({flow: 1, tags: 1, id: -1});

9Thursday, July 7, 2011

One of the key features in Flowdock is tagging. For example, if I’m doing some changes to our production environment, I mention it in the chat and tag it as #production. Production deployments are automatically tagged with the same tag, so I can easily get a log of everything that’s happened.

It’s very easy to implement with MongoDB, since we just index the “tags” array and search using it.

Page 16: Flowdock's full-text search with MongoDB

https://jira.mongodb.org/browse/SERVER-380

10Thursday, July 7, 2011

There’s a JIRA ticket about full-text search for MongoDB.Users have built lots of their own implementations, but the discussion continues.

Page 17: Flowdock's full-text search with MongoDB

Library support• Stemming

• Ranked probabilistic search

• Synonyms

• Spelling corrections

• Boolean, phrase, word proximity queries

11Thursday, July 7, 2011

These are some of the features you might see in an advanced full-text search implementation. There are libraries to do some parts of this (like libraries specific to stemming), and more advanced search libraries like Lucene and Xapian.

Lucene is a Java library (also ported to C++ etc.), and Xapian is a C++ library.

Many of these are hackable with MongoDB by expanding the query.

Page 18: Flowdock's full-text search with MongoDB

Standalone server

Lucene based

Rich document support

Result highlighting

Distributed

Standalone server

Lucene queries

REST/JSON API

Real-time indexing

Distributed

Standalone server

MySQL integration

Real-time indexing

Distributed searching

12Thursday, July 7, 2011

You can use the libraries directly, but they don’t do anything to guarantee replication & scaling.

Standalone implementations usually handle clustering, query processing and some more advanced features.

Page 19: Flowdock's full-text search with MongoDB

Things to consider

• Data access patterns

• Technology stack

• Data duplication

• Use cases: need to search Word documents? Need to support boolean queries? ...

13Thursday, July 7, 2011

When choosing your solution, you’ll want to keep it simple, consider how write-heavy your app is, what special features do you need, can you afford to store the data 3 times in a MongoDB replica set + 2 times in a search server etc.

Page 20: Flowdock's full-text search with MongoDB

Real-time searchPerformance

14Thursday, July 7, 2011

There are tons of use cases where search doesn’t need to be real-time. It’s a requirement that will heavily impact your application.

Page 21: Flowdock's full-text search with MongoDB

KISS

15Thursday, July 7, 2011

As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need to measure what customers want.Many of the features are possible to achieve with MongoDB.Facebook messages search also searches exact word matches (=it sucks), and people don’t complain.

So we built a minimal implementation with MongoDB. No stemming or anything, just a keyword search, but it needs to be real-time.

Page 22: Flowdock's full-text search with MongoDB

KISSEven Facebook does.

15Thursday, July 7, 2011

As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need to measure what customers want.Many of the features are possible to achieve with MongoDB.Facebook messages search also searches exact word matches (=it sucks), and people don’t complain.

So we built a minimal implementation with MongoDB. No stemming or anything, just a keyword search, but it needs to be real-time.

Page 23: Flowdock's full-text search with MongoDB

“Good point. I’ll mark it as deprecated.”

_keywords: [“good”, “point”, “mark”, “deprecated”]

16Thursday, July 7, 2011

You need client-side code for this transformation.What’s possible: stemming, search by beginning of the wordWhat’s not possible: intelligent ranking on the DB side (which is ok for us, since we want to sort results by time anyway)

Page 24: Flowdock's full-text search with MongoDB

db.messages.ensureIndex({flow: 1,_keywords: 1,id: -1});

17Thursday, July 7, 2011

Simply build the _keywords index the same way we already had our tags indexed.

Page 25: Flowdock's full-text search with MongoDB

db.messages.find({flow: 123,_keywords: {$all: [“hello”, “world”]}

}).sort({id: -1});

18Thursday, July 7, 2011

Search is also trivial to implement. As said, our users want the messages in chronological order, which makes this a lot easier.

Page 26: Flowdock's full-text search with MongoDB

That’s it! Let’s take it to production.

19Thursday, July 7, 2011

A minimal search implementation is the easy part. We faced quite a few operational issues when deploying it to production.

Page 27: Flowdock's full-text search with MongoDB

Index size:

2500 MB per 1M messages

20Thursday, July 7, 2011

As it turns out, the _keywords index is pretty big.

Page 28: Flowdock's full-text search with MongoDB

0

5.00

10.00

15.00

20.00

Messages Index: Keywords Index: Tags Index: Others

10M messages: Size in gigabytes

21Thursday, July 7, 2011

Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reduce the index size.Has implications for example to insert/update performance.

Page 29: Flowdock's full-text search with MongoDB

0

5.00

10.00

15.00

20.00

Messages Index: Keywords Index: Tags Index: Others

10M messages: Size in gigabytes

21Thursday, July 7, 2011

Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reduce the index size.Has implications for example to insert/update performance.

Page 30: Flowdock's full-text search with MongoDB

Option #1:Just generate _keywords and build

the index in background.

22Thursday, July 7, 2011

The naive solution: try to do it with no downtime. Didn’t work, site slowed down too much.

Page 31: Flowdock's full-text search with MongoDB

Option #2:Try to do it during a 6 hour

service break.

23Thursday, July 7, 2011

It worked much faster when our users weren’t constantly accessing the data. But 6 hours during a weekend wasn’t enough, and we had to cancel the migration.

Page 32: Flowdock's full-text search with MongoDB

Option #3:Delete _keywords, build the index

and re-generate keywords in the background.

24Thursday, July 7, 2011

Generating an index is much faster when there is no data to index. Building the index was fine, but generating keywords was very slow and took the site down.

Page 33: Flowdock's full-text search with MongoDB

Option #4:As previously, but add sleep(5).

25Thursday, July 7, 2011

You know you’re a great programmer when you’re adding sleep()s to your production code.

Page 34: Flowdock's full-text search with MongoDB

Option #5:As previously, but add Write Concerns.

26Thursday, July 7, 2011

Let the queries block, so that if MongoDB slows down, the migration script doesn’t flood the server.

Yeah, it would’ve taken a month, or it would’ve slowed down the service.

Page 35: Flowdock's full-text search with MongoDB

Option #6:Shard.

27Thursday, July 7, 2011

Would have been a solution, but we didn’t want to host all that data in-memory, since it’s not accessed that often.

Page 36: Flowdock's full-text search with MongoDB

Option #7:SSD!

28Thursday, July 7, 2011

We had the possibility to try it on a SSD disk.

This is not a viable alternative to AWS users, but AWS users could temporarily shard their data to a large number of high-memory instances.

Page 37: Flowdock's full-text search with MongoDB

Option #7:SSD!

28Thursday, July 7, 2011

We had the possibility to try it on a SSD disk.

This is not a viable alternative to AWS users, but AWS users could temporarily shard their data to a large number of high-memory instances.

Page 38: Flowdock's full-text search with MongoDB

Option #7:SSD!

28Thursday, July 7, 2011

We had the possibility to try it on a SSD disk.

This is not a viable alternative to AWS users, but AWS users could temporarily shard their data to a large number of high-memory instances.

Page 39: Flowdock's full-text search with MongoDB

29Thursday, July 7, 2011

My reactions to using SSD. Decided to benchmark it.

Page 40: Flowdock's full-text search with MongoDB

Messages

10M messagesin 100 “flows”,

100k each

Total size 19.67 GB

Indices

_id: 1flow: 1, app: 1, id: -1

flow: 1, event: 1, id: -1flow: 1, id: -1

flow: 1, tags: 1, id: -1flow: 1, _keywords: 1, id: -1

Total size 22.03 GB

30Thursday, July 7, 2011

This is the starting point for my next benchmark. Wanted to test it with a real-size database, instead of starting from scratch.

Page 41: Flowdock's full-text search with MongoDB

0

75.00

150.00

225.00

300.00

SSD SATA

mongorestore time in minutes

31Thursday, July 7, 2011

First used mongorestore to populate the test database.133 vs. 235 minutes, and index generation is mostly CPU-bound.Doesn’t really benefit from the faster seek times.

Page 42: Flowdock's full-text search with MongoDB

Insert performance test

A total of 100 workspacesAnd 3 workers each accessing 30 workspacesPerforming 1000 inserts to each

= 90 000 inserts, as quickly as possible

32Thursday, July 7, 2011

Page 43: Flowdock's full-text search with MongoDB

0

50.00

100.00

150.00

200.00

SSD SATA

insert benchmark: time in minutes

33Thursday, July 7, 2011

4.25 vs 155. That’s 4 minutes vs. 2.5 hours.

Page 44: Flowdock's full-text search with MongoDB

9.67 inserts/sec

352.94 inserts/sec

vs.

34Thursday, July 7, 2011

The same numbers as inserts/sec.

Page 45: Flowdock's full-text search with MongoDB

36x35Thursday, July 7, 2011

36x performance improvement with SSD. So we ended up using it in production.

Page 46: Flowdock's full-text search with MongoDB

36Thursday, July 7, 2011

Works well, searches from all kinds of content (here Git commit messages and deployment emails), queries typically take only tens of milliseconds max.

Page 47: Flowdock's full-text search with MongoDB

Questions / Comments?

@flowdock / [email protected]

37Thursday, July 7, 2011

This was a very specific full-text search implementation. The fact that we didn’t need to rank search results made it trivial.

I’m happy to discuss other use cases. Please share your thoughts and experiences.