no more sql

1. No More SQL A chronicle of moving a data repository from a traditional relational database to MongoDBGlenn Street Database Architect, Copyright Clearance Center

2. Who am I? Database Architect at Copyright Clearance CenterOracle Certified ProfessionalMany years of database development and administrationLearning to embrace polyglot persistenceBeen working with MongoDB since version 1.6 3. What is Copyright Clearance Center? "Copyright Clearance Center (CCC), the rights licensing expert, is a global rights broker for the worlds most sought-after books, journals, blogs, movies and more. Founded in 1978 as a not-for-profit organization, CCC provides smart solutions that simplify the access and licensing of content. These solutions let businesses and academic institutions quickly get permission to share copyright-protected materials, while compensating publishers and creators for the use of their works." www.copyright.com 4. What I want to talk about today Not application design, but data management issues Our experience in moving from "legacy" relational data way of doing things These experiences come from one large project 5. What do I mean by data management? Topics like naming conventions, data element definitionsData modelingData integrationTalking to legacy (relational) databasesArchive, purge, retention, backups 6. Where we started 200+ tables in a relational databaseCore set of tables fewer, but many supporting tables2.5 TB total (including TEMP space, etc.)Many PL/SQL packages and proceduresSolr for search 7. Today We use MongoDB in several products The one I'll talk about today is our largest MongoDB database (> 2 TB) Live in production end of September 8. What options did we have in the past for horizontal scaling? At the database layer, fewClustering ($$)So, we emphasized scaling at the application tierWe wanted to be able to scale out the database tier in a low-cost way 9. What kind of data? "Work" data, primarily books, articles, journalsAssociated metadata Publisher, author, etc. 10. Application characteristics Most queries are reads via Solr index Database access is needed for additional metadata not stored in SolrCustom matching algorithms for data loadsDatabase updates are done in-bulk (loading)Loads of data come from third-party providersOn top of this we've built many reports, canned and ad-hoc 11. Here's what the core data model looked like: highly normalized 12. Where we are today 12 MongoDB shards x 200 GB (2.4 TB) MongoDB database Replica sets, including hidden members for backup (more about that later)GridFS for data to be loadedMMS for monitoringJEE application (no stored procedure code)Solr for search 13. What motivated us? Downtime every time we made even the simplest database schema update The data model was not appropriate for our use case Bulk loading (very poor performance)Read-mostly (few updates)We want to be able to see most of a "work's" metadata at onceThis lead to many joins, given our normalized data model 14. More motivators Every data loader required custom coding The business users wanted more control over adding data to the data model on-the-fly (e.g., a new data provider with added metadata) This would be nearly impossible using a relational database MongoDB's flexible schema model is perfect for this use! 15. What were our constraints? Originally, we wanted to revamp the nature of how we represent a work Our idea was to construct a work made up of varying data sources, a canonical work But, as so often happens, time the avenger was not on our side 16. We needed to reverse-engineer functionality This meant we needed to translate the relational structures We probably didn't take full advantage of a documentoriented database The entire team was more familiar with the relational model Lesson: Help your entire team get into the polyglot persistence mindset 17. We came up with a single JSON document We weighed the usual issues: Embedding vs. linkingSeveral books touch on this topic, as does the MongoDB manual One excellent one: MongoDB Applied Design Patterns by Rick Copeland, O'Reilly Media. 18. We favored embedding "Child" tables became "child" documents This seemed the most natural translation of relational to documentBut, this led to larger documentsLesson: We could have used linking more 19. Example: one-to-one relationship 20. In MongoDB work... "publicationCountry" : { "country_code" : "CHE", "country_description" : "Switzerland" } 21. Example: one-to-many relationship 22. In MongoDB An array of work contributors "work_contributor" : [ { "contributorName" : "Ballauri, Jorgji S.", "contributorRoleDescr" : "Author", }, { "contributorName" : "Maxwell, William", "contributorRoleDescr" : "Editor", },... ] 23. When embedding... Consider the resulting size of your documents Embedding is akin to denormalization in the relational world Denormalization is not always the answer (even for RDBMS)! 24. Data migration from our relational database Wrote a custom series of ETL processes Combined Talend Data Integration and custom-built code Also leveraged our new loader program 25. But...we still had to talk to a relational database The legacy relational database became a reporting and batchprocess database (at least for now) Data from our new MongoDB system of record needed to be synced with the relational database Wrote a custom process to transform the JSON structure back to relational tablesLesson: Consider relational constraints when syncing from MongoDB to a relational database We had to account for some discrepancies in field lengths (MongoDB is more flexible) 26. More Lessons Learned Document size is key! The data management practices you're used to from the relational world must be adapted; example: key namesIn the relational world, we favor longer namesWe found that large key names were causing us pain We're not the first: see On shortened field names in MongoDB blog postBut, this goes against good relational database naming practices (e.g., longer column names are self-documenting) 27. More Lessons Learned Our way of using Spring Data introduced it's own problems scaffoldingNesting of keys for flexibility was painful Example: workItemValues.work_createdUser.rawValue 28. Backups at this scale are challenging! Mongodump and mongoexport were too slow for our needsDecided on hidden replica set members on AWSUsing filesystem snapshots for backupsLooking into MMS Backup service 29. Another Lesson: Non/SemiTechnical Users For example, business analysts, product ownersMany know and like SQLMany don't understand a document-oriented databaseEngineering spent a lot of time and effort in raising the comfort level This was not universally successfulAn interesting project, SQL4NoSQL 30. How to communicate structure? 31. Communicating Structure Mind map was helpful initiallyDifficult to maintain 32. JSON Schema {"$schema": "http://json-schema.org/draft-03/schema", "title": Phase I Schema", "description": "Describes the structure of the MongoDB database for Phase I", "type":"object", "id": "http://jsonschema.net", "required":false, "properties":{ "_id": { "type":"string", "required":false }, ... 33. JSON Schema for communicating structure I created a JSON schema representation of the work document JSON Schema JSON Schema.netWas used by QA and other teams for supporting tools JSON Schema also useful, but also cumbersome to maintain 34. Next Steps/Challenges Investigating on-disk (file system) compression Very promising so farCan we be more "document-oriented"? Remove vestiges of relational data modelsImplement an archiving and purging strategyInvestigating MMS Backup 35. Vote for these JIRA Items! Option to store data compressedBulk insert is slow in sharded environmentTokenize the field namesIncrease max document size to at least 64mbCollection level locking 36. Thanks! Twitter: @GlennRStreetBlog: http://glennstreet.net/LinkedIn: http://www.linkedin.com/in/glennrstreet/

no more sql

Technology