introduction to nosql at silicon valley code camp

75
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. An Introduction to NoSQL Jim Driscoll, MarkLogic

Upload: marklogic

Post on 02-Dec-2014

701 views

Category:

Technology


0 download

DESCRIPTION

This presentation was given at Silicon Valley Code Camp in October 2014 by Jim Driscoll, Sr. Sales Engineer at MarkLogic. Jim has been programming in Java since 1996, and has worked on such projects as the first version of servlets, and the first release of J2EE. About this presentation: Before they were called cars, they were "horseless carriages". That's what NoSQL is: something better than the past but still defined in terms of the past. For decades we had one way of storing data, the relational model. Now databases come with new data models, new index schemes, and new scaling architectures. They're also much more cloud-focused than previous systems. It's an exciting time again to be a database administrator, or a programmer, and this presentation gets you oriented. The presentation covers how we got here, where we're at, and where we're going. The discussion includes data models (key-value, document, graph), licensing (open source, commercial), and transaction models (ACID vs BASE), including examples of working code.

TRANSCRIPT

  • 1. COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. An Introduction to NoSQL Jim Driscoll, MarkLogic
  • 2. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 2 ALL RIGHTS RESERVED. Agenda History of NoSQL NoSQL Terminology Taxonomy
  • 3. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 3 ALL RIGHTS RESERVED. HISTORY OF NOSQL
  • 4. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 4 ALL RIGHTS RESERVED. A Short History of Data Application Specific Databases Size is paramount Relational Databases Size matters but break from the application silo and provide data integrity NoSQL Databases Agility Scalability Speed
  • 5. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 5 ALL RIGHTS RESERVED. RELATIONAL DOESNT MEAN WHAT YOU THINK IT DOES
  • 6. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 6 ALL RIGHTS RESERVED. What's wrong with Relational? Nothing, it's perfect for square data ...where you know the relationships in advance ...where the schema doesn't change often ...where all the data can fit on one machine ...where a separate disk seek for every join isn't an issue
  • 7. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 7 ALL RIGHTS RESERVED. SEEK AND YOU WILL FIND IN ABOUT 10MS
  • 8. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 8 ALL RIGHTS RESERVED. The Rise of NoSQL 1998 - Carlo Strozzi coins term 2001 - MarkLogic founded 2003 - LiveJournal's Memcached 2003 - Google FileSystem paper 2004 - Google MapReduce paper 2006 - Google BigTable paper 2006 - Hadoop released to Apache 2007 - Amazon Dynamo paper 2009 - Eric Evans popularizes term 2010 - MongoDB (first production release)
  • 9. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 9 ALL RIGHTS RESERVED. MEMCACHED
  • 10. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 10 ALL RIGHTS RESERVED. Memcached Developed at LiveJournal as a frontend cache for websites First released in 2003 Keep disk access at a minimum, pool memory on many machines So useful it found wide popularity, still under active development
  • 11. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 11 ALL RIGHTS RESERVED. Memcached High Performance, Distributed Memory Object Caching System Distributed runs across many computers Memory runs without touching disk Object cache designed to hold small lumps of data High performance because it never touches disk, and the objects are small, its optimized for speed
  • 12. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 12 ALL RIGHTS RESERVED. Memcached Client server system Servers are unaware of each other Clients determine server to use via hashing Servers keep content as an LRU cache So all data transitory
  • 13. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 13 ALL RIGHTS RESERVED. SHARDING
  • 14. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 14 ALL RIGHTS RESERVED. Sharding to Scale Out
  • 15. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 15 ALL RIGHTS RESERVED. BIGTABLE
  • 16. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 16 ALL RIGHTS RESERVED. Bigtable Created by Google in 2004 to store massive amounts of data Made public in famous 2006 paper Used throughout Google GMail, Google Maps, YouTube, Web Indexing, etc Reportedly over 100 internal projects Never shipped externally as a product but available for public use as part the AppEngine hosting API
  • 17. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 17 ALL RIGHTS RESERVED. Bigtable Rows are composed of columns, which in turn belong to column families Column families are essentially typing, validation and expiration info Every cell is versioned via timestamp System is robust and crash resistant Can survive the crash of any machine, including the master Scale out architecture
  • 18. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 18 ALL RIGHTS RESERVED. MAP / REDUCE
  • 19. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 19 ALL RIGHTS RESERVED. Map / Reduce Massively Distributed Processes Map - sort, filter, transform data Reduce - summarize data
  • 20. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 21 ALL RIGHTS RESERVED. HADOOP
  • 21. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 22 ALL RIGHTS RESERVED. Hadoop First envisioned as Nutch at the Internet Archive in 2002 There were 100s of millions of webpages to index Early versions heavily influenced by Google File System, Map Reduce papers Goal: Perform work on large datasets using commodity machines Development moved to Yahoo in 2006 Open Sourced to Apache, as Hadoop A File system (HDFS) A Task Runner (MapReduce) A Task Manager (YARN) Note: Not a database
  • 22. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 23 ALL RIGHTS RESERVED. Hadoop Really good at Batch Processing on incredibly large data sets Not so good parts Latency Updates Usability
  • 23. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 24 ALL RIGHTS RESERVED. DYNAMO
  • 24. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 25 ALL RIGHTS RESERVED. Amazon Dynamo Created to power Amazons Web store Writing with low latency more important than consistency Techniques first made public in 2007 paper Never externally shipped but huge influence on market Used for a variety of critical portions of Amazons site Shopping cart User Session Succeeded by DynamoDB Similar name, but whole new architecture (with better consistency)
  • 25. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 26 ALL RIGHTS RESERVED. Amazon Dynamo Distributed Key Value store "always writable low latency reads and writes, at the expense of consistency asynchronous replication on put() operations mean that get() may return a stale value updates during a network partition can result in conflicts and the application must handle them
  • 26. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 27 ALL RIGHTS RESERVED. TERMINOLOGY
  • 27. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 28 ALL RIGHTS RESERVED. What is(n't) NoSQL No SQL Schema-less Open Source BASE (Eventually Consistent)
  • 28. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 29 ALL RIGHTS RESERVED. ACID Atomicity Everything either succeeds or fails Consistency Nothing is saved unless it passes consistency rules Isolation No two processes can interfere with each other Durability Once saved, data can not be lost due to system failure
  • 29. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 30 ALL RIGHTS RESERVED. BASE Basically Available Soft state Eventually consistent
  • 30. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 34 ALL RIGHTS RESERVED. What happens without consistency? Absolute fastest performance at lowest hardware cost Highest global data availability at lowest hardware cost Working with one document or row at a time Writing advanced code to create your own consistency model Eventually consistent data Some inconsistent data that cant be reconciled Some missing data that cant be recovered Some inconsistent query results
  • 31. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 35 ALL RIGHTS RESERVED. What is NoSQL? Database Non-relational Schema on read Scale out architecture Cluster friendly / Cloud Ready
  • 32. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 36 ALL RIGHTS RESERVED. SCALE UP VS SCALE OUT
  • 33. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 37 ALL RIGHTS RESERVED. CAP Theory Consistent Always get correct answers, when you get answers Available Always get answers Partition Tolerant System can be divided in two Pick Two
  • 34. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 38 ALL RIGHTS RESERVED. CA Database
  • 35. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 39 ALL RIGHTS RESERVED. CP Database
  • 36. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 40 ALL RIGHTS RESERVED. AP Database
  • 37. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 41 ALL RIGHTS RESERVED. SOFTWARE TAXONOMY
  • 38. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 42 ALL RIGHTS RESERVED. Taxonomy Key Value stores Document Databases Column Databases Graph Databases
  • 39. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 43 ALL RIGHTS RESERVED. Key Value Taxonomy KV Cache KV Store Eventually Consistent Store Ordered Store Data Structure Server
  • 40. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 44 ALL RIGHTS RESERVED. KEY VALUE STORES
  • 41. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 45 ALL RIGHTS RESERVED. MemcacheDB Very early KV implementation (2008) KV Store based on Memcached source, with BerkleyDB persistent store Speaks the memcached protocol Development stopped (2009), but still quite popular For when you like Memcached, but want persistance
  • 42. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 46 ALL RIGHTS RESERVED. REDIS
  • 43. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 47 ALL RIGHTS RESERVED. Redis First released in 2009 Sponsored by VMWare, then Pivotal Name means Remote Dictionary Server Fully in memory key value store Whole db must reside in memory of one machine Limits scalability, at the benefit of performance Often used as a front end cache for other NoSQL databases
  • 44. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 48 ALL RIGHTS RESERVED. Redis Not just strings as values: Lists of strings Sets of strings (collections of non-repeating unsorted elements) Sorted sets of strings (collections of non-repeating elements ordered by a floating-point number called score) Hashes where keys and values are strings
  • 45. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 49 ALL RIGHTS RESERVED. Redis Master / slave replication - slave may be master to another slave allowing tree replication also publish/subscribe API slaves may be updated separately from master, allows inconsistencies (!) Persistent store Append only journal Flushed every 2 seconds by default
  • 46. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 50 ALL RIGHTS RESERVED. RIAK
  • 47. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 51 ALL RIGHTS RESERVED. Riak Basho was developing Sales Force automation software, came up with Riak, decided to make that their business instead First release 2011 Key Value store with CAP tunability, predictable latency but even Strong Consistency mode is not ACID Write conflicts handled at read time Handles JSON natively (via Search) Provides Map/Reduce engine in JavaScript Pluggable backend data store, can be an in memory implementation Search via Solr (though not realtime) Rental cloud service (Riak CS)
  • 48. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 52 ALL RIGHTS RESERVED. DOCUMENT STORES
  • 49. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 53 ALL RIGHTS RESERVED. Document vs Key Value Stores Extension of Key Value - the value is a document but also Structurally aware Indexed searches Document formats CouchDB JSON MongoDB BSON MarkLogic - JSON, XML
  • 50. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 54 ALL RIGHTS RESERVED. MARKLOGIC
  • 51. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 55 ALL RIGHTS RESERVED. MarkLogic The Only Enterprise NoSQL Database Founded in 2001 by search engine experts Multi-model database with built-in search and application services Flexible data model XML, JSON, RDF, Geospatial, text and binaries Scalable and elastic Clusters on commodity hardware at massive scale Trusted for mission-critical apps ACID, HA/DR, Security
  • 52. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 56 ALL RIGHTS RESERVED. MarkLogic Features Built-in Search Scalability and Elasticity ACID Transactions Government-grade Security HA/DR Cloud Deployment Hadoop-ready
  • 53. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 57 ALL RIGHTS RESERVED. MarkLogic Architecture Universal Index Index words, elements, the relationships of words and elements Many indexes (automatically) used at once In memory range indexes work like column indexes A native triple store that supports SPARQL Search on ranges, free text, field values, more Shared-nothing architecture, automatic partitioning and balancing Can also use a range index for tiered partitioning MarkLogic Connector for Hadoop
  • 54. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 58 ALL RIGHTS RESERVED. XML vs JSON XML is more expressive, granular Namespaces Intermixed elements and text Standard query languages (XPath, XQuery) JSON is more compact, simpler (for better and worse) Programmatically accessible via standard JavaScript libraries
  • 55. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 59 ALL RIGHTS RESERVED. MONGODB
  • 56. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 60 ALL RIGHTS RESERVED. MongoDB Development began in 2007 by 10gen Name from humongous Originally wanted to create a Google App Engine system 1.4 considered first production ready release, 2010 Horizontally scaling
  • 57. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 61 ALL RIGHTS RESERVED. MongoDB Stores data in proprietary format BSON, similar to JSON with more data types Search on field, on range, or on regex Single index per query (secondary index optional) Replication of databases as master/slave, with (tunable) eventual consistency Sharding handled via a shard key, splitting by range Be sure the key is evenly distributed Client APIs in many (40+!) languages
  • 58. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 62 ALL RIGHTS RESERVED. WIDE COLUMN STORES
  • 59. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 63 ALL RIGHTS RESERVED. Column Stores Descended from Big Table approach Excellent for sparse data Column families need to be specified up front But still stored sparsely No way to list all the columns in the database Append only Updates via timestamp Deletes via tombstone marker
  • 60. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 64 ALL RIGHTS RESERVED. CASSANDRA
  • 61. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 65 ALL RIGHTS RESERVED. Cassandra Developed at Facebook, 2008, donated to Apache Descended from Bigtable and Dynamo One of the primary Dynamo developers helped create Cassandra Focused on maximum throughput Write lots of data, fast But at the expense of consistency (tunable) Used by Twitter, Reddit, Netflix but not Facebook
  • 62. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 66 ALL RIGHTS RESERVED. Cassandra Share nothing architecture Partitioned via hash (multiple strategies) Be careful choosing your Row Key! Async masterless replication CAP Tunable from "writes never fail" to "wait until persisted on all slaves Query with range queries, column family, CQL Hadoop support (replaces HDFS)
  • 63. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 67 ALL RIGHTS RESERVED. GRAPH DATABASES
  • 64. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 68 ALL RIGHTS RESERVED. Nodes and Vertices
  • 65. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 69 ALL RIGHTS RESERVED. NEO4J
  • 66. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 70 ALL RIGHTS RESERVED. Neo4J Released in 2010 Written in Java, APIs are Java centric Most popular Graph Database Powers the recommendation engines of Glassdoor, Walmart
  • 67. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 71 ALL RIGHTS RESERVED. Neo4J Whole graph in memory scales to millions of relationships But does persist to disk Transactional Replicated for performance and robustness, master/slave Proprietary Graph query language (Cypher) Enterprise version adds clustering, sharding
  • 68. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 72 ALL RIGHTS RESERVED. SEMANTIC WEB
  • 69. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 73 ALL RIGHTS RESERVED. Semantics: A New Way to Organize Data Data is stored in Triples, expressed as: Subject : Predicate : Object John Smith : livesIn : London London : isIn : England Query with SPARQL, gives us simple lookup .. and more! Find people who live in (a place that's in) England "John Smith" "England" livesIn "London" isIn livesIn
  • 70. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 74 ALL RIGHTS RESERVED. Context from the World at Large Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Linked Open Data Facts that are freely available In a form thats easily consumed DBpedia (wikipedia as structured information) Einstein was born in Germany Irelands currency is the Euro GeoNames Doha is the capital of Qatar Doha has these lat/long coordinates
  • 71. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 75 ALL RIGHTS RESERVED. IN CONCLUSION
  • 72. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 76 ALL RIGHTS RESERVED. The Future Feature Convergence Growth of Public / Private Cloud use New uses, increased adoption
  • 73. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 77 ALL RIGHTS RESERVED. Don't Design Your System Like It's 1979
  • 74. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 78 ALL RIGHTS RESERVED. Further http://www.nosqlfordummies.com
  • 75. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 79 ALL RIGHTS RESERVED. ANY QUESTIONS?