Download - NoSQL overview implementation free
![Page 1: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/1.jpg)
NoSQL overviewImplementation free
Benoit Perroud21. January 2011, 01. March 2011
![Page 2: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/2.jpg)
2
Disclaimer
Any views or opinions presented in this presentation are solely those of the author and do not necessarily represent
those of Verisign.
![Page 3: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/3.jpg)
3
Outline
• Introduction
• Scalability and Availability• Difficulties to scale• CAP Theorem• NoSQL Goals
• NoSQL Taxonomy
• Concepts and Patterns
• Existing implementations• NoSQL in the Real World• Conclusion
![Page 4: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/4.jpg)
4
Introduction
![Page 5: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/5.jpg)
5
NoSQL Term
• Mandatory name disambiguation :
NoSQL stands for Not Only SQL.
• The term NoSQL is more or less attributed to Eric Evans, a Rackspace employee, who used it in early 2009 when Johan Oskarsson, a Last.fm employee, wanted to organize an event to discuss open-source distributed databases.
![Page 6: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/6.jpg)
6
Wikipedia Definition
• [Wikipedia] NoSQL is a term used to designate database management systems that differ from classic relational database management systems (RDBMS) in some way. These data stores may not require fixed table schemas, usually avoid join operations, do not attempt to provide ACID (atomicity, consistency, isolation, durability) properties and typically scale horizontally.
![Page 7: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/7.jpg)
7
Tag Cloud
![Page 8: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/8.jpg)
8
Why NoSQL ?
• Is there really a problem with SQL and RDBMS ?
No there isn't !SQL is powerful, ACID (atomicity, consistency, isolation, durability) properties are well-established, developers and DBAs have dominated it.
• So why the hell need we NoSQL ?• What was the motivations of Google and Amazon to
invest huge amount in research around NoSQL ? • Why “social sites” are so hard to scale ?
![Page 9: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/9.jpg)
9
Scalability and Availability
![Page 10: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/10.jpg)
10
Scalability definition
• [Wikipedia] Scalability is a desirable property of a system, a network, or a process, which indicates its ability to either handle growing amounts of work in a graceful manner or to be readily enlarged.
In summary : handle load and peaks.
• Scalability in two dimensions :• Scale up → scale vertically (increase RAM in an existing node)• Scale out → scale horizontally (add a node to the cluster)
![Page 11: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/11.jpg)
11
Availability definition
• [Wikipedia] Availability refers to the ability of the users to access and use the system. If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.
In summary : minimize downtime.
Availability % Downtime per year Downtime per month Downtime per week90% ("one nine") 36.5 days 72 hours 16.8 hours95% 18.25 days 36 hours 8.4 hours99% ("two nines") 3.65 days 7.20 hours 1.68 hours99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
![Page 12: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/12.jpg)
12
Difficulties to Scale
![Page 13: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/13.jpg)
13
RDBMS Scalability
• RDBMS are hard to scale. Hard should be understood as costly
RDBMS licenses, hardware, DBAs' and operational costs grow non linearly with the load.
RDBMS have either :• single point of failure (SpoF)
→ the master DB
• or replication latency → distributed transactions : two-phase commit, paxos algorithm.
![Page 14: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/14.jpg)
14
Hardware Scalability
• Commodity hardware and appliances are I/O bound• But network throughput is cheaper than hard disk throughput.
• Hard disks is (was?) the main bottleneck in today's computing.• Random access have a high latency (disk seek)• Throughput have increased, but not proportionally with the
storage
→ Distributing data across a network of small computers (and applying the data locality concept) scale better (cheaper) than a huge appliance.
![Page 15: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/15.jpg)
15
Pioneers in Scaling The Big
• Google : huge amount of data (hundreds of petabytes, do not fit on a single appliance)• need to be partitioned MTBF is proportional to number of machines.
→ BigTable.
• Amazon : high availability (99.999%)• need to have redundancy and write scalability
→ Dynamo
• Facebook, Twitter, "social sites”• no cluster of data to be partitioned,
• long tail (old data still time to time accessed),
• lots of users connected in the same time,
• user specific content (predictions hard to achieve, few static pages)
![Page 16: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/16.jpg)
16
CAP Theorem
![Page 17: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/17.jpg)
17
Remind of CAP theorem
• Consistency : all nodes see the same data at the same time
• Availability : node failures do not prevent survivors from continuing to operate
• Partition Tolerance : the system continues to operate despite arbitrary message loss
According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three.
![Page 18: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/18.jpg)
18
NoSQL trade-offs
• NoSQL datastores have typically done other trade-offs than RDBMS to the CAP theorem• Most of them gave up the “C” of the theorem, giving up the ACID
properties in the same way.
• NoSQL datastores also have a more simpler data access pattern• value = get(key)
• put(key, value)
• remove(key)
![Page 19: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/19.jpg)
19
NoSQL Goals
![Page 20: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/20.jpg)
20
NoSQL promises
• NoSQL finality is to achieve (horizontal) scalability and high availability.
• Business goal : Keep cost growing proportionally with the load (tight provisioning).
• Operational goal : • Scale the system by simply adding node (or removing).• The system runs on commodity hardware.
![Page 21: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/21.jpg)
21
NoSQL Taxonomy
![Page 22: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/22.jpg)
22
Taxonomy
NoSQL most common types are :
• Document store• store document which structure can be explored.
• Key/value store• simple hash map data access pattern.
• Column oriented store• Something between simple key/value and complex document store
• Graph database• store node and edges, walk-through data access
• Object database
![Page 23: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/23.jpg)
23
Classification
Picture from http://blog.nahurst.com/visual-guide-to-nosql-systems
![Page 24: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/24.jpg)
24
Concepts and Patterns
![Page 25: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/25.jpg)
25
Implementation side : Consistent Hashing
• [Wikipedia] Consistent hashing is a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots.
Picture from http://www.lexemetech.com/2007/11/consistent-hashing.html
![Page 26: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/26.jpg)
26
Implementation side : Bloom Filter
• [Wikipedia] Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
Picture from http://en.wikipedia.org/wiki/File:Bloom_filter.svg
![Page 27: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/27.jpg)
27
Implementation side : Quorum
• [Wikipedia] A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system. A quorum-based technique is implemented to enforce consistent operation in a distributed system.
• Quorum : R + W > NN : number of replica, R : number of node read, W : number of node written.
• R = 1, W = N
• R = N, W = 1
• R = N/2, W = N/2 (+1 if N is even)
![Page 28: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/28.jpg)
28
Implementation side : Vector Clocks
• [Wikipedia] Vector Clocks is an algorithm for generating a partial ordering of events in a distributed system and detecting causality violations.
Picture from http://en.wikipedia.org/wiki/File:Vector_Clock.svg
![Page 29: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/29.jpg)
29
Implementation side : other Common Concepts
• Replication• Multi-master (Gossip, agents, P2P)
• Master-slave (+failover)
• Merkle Tree• The main use of hash trees is to make sure that data blocks received from
other peers in a peer-to-peer network are received undamaged and unaltered
• Multiversion concurrency control• MVCC is a concurrency control method commonly used by database
management systems to provide concurrent access to the database and in programming languages to implement transactional memory.
• SEDA• Staged Event-Driven Architecture
• LMT (Log Merge Tree)• Efficient replacement of B-Tree that require less disk seeks.
![Page 30: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/30.jpg)
30
User side : Map Reduce
• [Wikipedia] MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes). It is inspired by the map and reduce functions commonly used in functional programming.
Picture from http://code.google.com/edu/parallel/mapreduce-tutorial.html
![Page 31: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/31.jpg)
31
User side : Inverted Indexes
• [Wikipedia] An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database (data denormalization).
Picture from http://developer.apple.com/library/mac/
![Page 32: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/32.jpg)
32
User side : other Patterns
• Idempotent updates• Repeating the update twice do not lead to inconsistent data.
• Offline (asynchronous) processing• Push on change (vs. pull on demand), batches
• SOA and services isolation• Service failure resilience. Product is a mash-up of back-end
services.
• Schemaless (ColumnFamily-based data model)• Schema update is no more required
• Data locality• Processing is sent to data instead of data sent do workers.
![Page 33: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/33.jpg)
33
Existing Implementations
![Page 34: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/34.jpg)
34
Non exhaustive list of Existing Implementations
• Google BigTable, Megastore
• Amazon Dynamo,
• Apache Hbase,
• Apache Cassandra,
• Riak,
• Project Voldemort,
• Flockdb,
• Neo4j,
• Hypertable,
• Jackrabbit,
• Solr, ElsaticSearch, ...
• CouchDB,
• MongoDB,
• Tokyo Cabinet,
• SimpleDB,
• Redis,
• Memcached,
• Infinispan,
• Scalaris,
• Terrastore,
• Mysql HandlerSocket,
• Almost a new project every week ...
![Page 35: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/35.jpg)
35
NoSQL in the Real World
![Page 36: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/36.jpg)
36
Real World Example
• Facebook new messaging system• Based on Hbase (consistency model simplier than Cassandra)
• Amazon shopping cart• Based on Dynamo, never delete a row. Only add delta (+1, +2,
-1) in the cart.
• Twitter• Use Cassandra for real time analytics
• Google Megastore• ACID within partitions, lower consistency across partitions• Synchronous replication with Paxos algorithm
![Page 37: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/37.jpg)
37
Real World Example (2)
• Digg, Last.fm, NYT, Guardian, Yahoo, Flickr, NetFlix, Adobe, Mozilla, Github, Linkedin, StumbleUpon, ...
• MapReduce usage is increasing daily• Google and Amazon have sat their domination in being the first
to masterize big data.
• Can your business jump into the NoSQL wave ?
![Page 38: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/38.jpg)
38
Conclusion
![Page 39: NoSQL overview implementation free](https://reader034.vdocument.in/reader034/viewer/2022051611/54b763194a7959f71f8b470a/html5/thumbnails/39.jpg)
39
Conclusion
• NoSQL is not a general purpose datastore !• Eventually consistent model can be tricky, and can reserve
nasty surprises if not used carefully.• MapReduce search job have high latency
• SQL and NoSQL are complementary• Knowing both allow to put the right technology to the right place.
• NoSQL has challenged RDBMS supremacy• New ideas for RDBMS are emerging• See HandlerSocket, a mysql NoSQL plugin.