wanderu – lessons from building a travel site with neo4j - eddy wong @ graphconnect ny 2013
DESCRIPTION
Wanderu is a consumer-focused search engine for buses and trains. Eddy will recount the architectural, modeling and other technical “lessons learned” and “lessons unlearned” in implementing our geospatial and search features using Neo4j in the context of a NoSQL polyglot solution.TRANSCRIPT
Wanderu: Lessons Learned
Lessons Learned and Unlearned from Building a Travel Site with Graphs and Neo4j
Eddy WongCTO, Wanderu.com
@eddywongch
About Wanderu.comSearch Engine for (Intercity) Buses and Trains
Demo
From pt A to pt B
Nomenclature: Stations, Trips
A: NYCB: DC
Philly
BOLT, $13, 11/07/2013
MEG, $9, 11/07/2013 MEG, $4, 11/07/2013
A Shortest Path Problem as a function of depart, arrive, price, duration, date times
Lessons
LearnedUnLearned
Idea
•Architectural•Modeling•Geo
Our Story
• 2 yr startup, Tech started about 1+ yr ago
• Beta in Mar 2013, Launch in Aug 2013
• Knew nothing about Neo4j when we started (Jun 2012)
• Did not like the relational model: wanted schema-less and no self-joins
• Wanted a graph model
Workflow
Store
Scraping JSON
Bus Websites Non-uniform Data
Uniform Data
Server
Architectural Lessons
Art: MC Escher
Our Situation
• Data is written only in one direction
• Users search for paths, then segments
• Searches are done by date
• Needed online capability
• Trip info (price/avail) could change on some
Solution
Scraping JSON
Bus Websites Non-uniform Data
Uniform Data
MongoDBNeo4jMongoConn
Nodes & Edges
Replica Mechanism
MongoConnector
• MongoDB Lab project, open source, unsupported
• Uses Replica Mechanism: Oplog
• Eventually Consistent (not real time)
• Written in Python
• Main methods: Upserts and Deletes, passes doc
• Implement DocMgr->Neo4jDocMgr->py2neo
• We can add new properties easily on the fly
Polyglot Arch
Scraping JSON
Bus Websites Non-uniform Data
MongoDB
Neo4j
MongoConnNodes & Edges
Replica Mechanism
REST Server
BOS, NYCBOS, PHLNYC, DC
NYC, PHL
Modeling Lessons
Art: MC Escher
Our Story
• We tried to “dump” all data into Neo4j
• Edges had dates -> too many Edges -> “Super Node Problem”
• Query perf was terrible (1+ mins) and worse as # edges increased
• Tried Gremlin -> No improvements
• Needed range queries on Edges
“Dehydate”
• Don’t store everything in the Neo4j, only metadata
• Use Neo4j as a “connection index”
• Don’t store entities in Nodes, only keys
• Don’t store heavy properties in Edges
Neo4j Model
source: Wes Freeman, Tobias Lindaaker
Our Solution
• Serve paths from Neo4j
• Segments from MongoDB (with date constraints)
• Back to “Joins”
• “Join” across Neo4j + MongoDB:
1 != 525d9031e6c9236072114387
Joins across DBs
MongoDB: Stations Neo4j: Nodes
BOS BOS
NYC NYC
DC DC
... ...
MongoDB: Trips Neo4j: Edges
BOS-NYC BOS-NYC
BOS-DC BOS-DC
NYC-DC NYC-DC
... ...
• Forget seq id generated by dbs
• Use a human-created “UUID” string for id
• Convert pair into id: depart-arrive
• For example: BOS-NYC
Geo Lessons
Art: MC Escher
Hybrid Solution
• Google Autocomplete
• Google Maps
• MongoDB station geo lookup
Lessons of Lessons
• Really understand the Neo4j Runtime Model
• Pick universal human generated ids
• Join across dbs better than RDBMS: 10s paths x 100s segments vs. 500k x 500k
• Glad to have picked Neo4j: doing content gen and more geo features now
Useful Links
• Neo4j Internals
slideshare.net/thobe/an-overview-of-neo4j-internals
• Aseem’s Lessons Learned with Neo4j
http://aseemk.com/talks/neo4j-lessons-learned#/14
• Wes Freeman, Neo4j Internals
http://wes.skeweredrook.com/graphdb-meetup-may-2013.pdf
• MongoConnector
blog.mongodb.org/post/29127828146/introducing-mongo-connector