nosql dbms: concepts, techniques and patterns...

noSQL DBMS: Concepts, Techniques and Patterns

Iztok Savnik, FAMNIT, 2017

Contents

•Requirements•Consistency model•Partitioning•Storage Layout•Query Models

Requirements

•Requirements of the new web applications•ACID vs. BASE•CAP theorem

ACID vs. BASE

•ACID Atomicity Consistency Isolation Durability

• •Characteristics

Strong consistency Focus on “commit” Nested transactions Availability? Conservative (pessimistic) Difficult evolution

•BASE Basically Available Soft-state Eventual consistency

• •Characteristics

Weak consistency Availability first Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution

CAP theorem

•“Towards Robust Distributed Systems” at ACM’s PODC 2000, Eric Brewer

•CAP theorem Consistency

System is in a consistent state after the execution of an operation After an update of some writer all readers see his updates in some shared data source

Availability A system is designed and implemented in a way that allows it to continue operation if e.g.

nodes in a cluster crash or some hardware or software parts are down due to upgrades Partition tolerance

The ability of the system to continue operation in the presence of network partitions

•Theorem: You can have at most two of these properties for any shared-data system

•Widely adopted today by large web companies

CAP theorem

Forfeit Partitions

•Examples Single - site databases Cluster databases LDAP xFS file system

•Traits 2 - phase commit cache validation protocols

Forfeit Availability

•Examples Distributed databases Distributed locking Majority protocols

•Traits Pessimistic locking Make minority partitions unavailable

Forfeit Consistency

•Examples Coda Web cachinge DNS

•Traits expirations/leases conflict resolution optimistic

Consistency Models

•A consistency model determines rules for visibility and apparent order of updates.

System is in a consistent state after the execution of an operation

•Example: Row X is replicated on nodes M and N Client A writes row X to node N Some period of time t elapses. Client B reads row X from node M Does client B see the write from client A?

•Consistency is a continuum with tradeoffs•Consistency models

Timestamps Optimistic Locking Vector Clocks Multiversion Storage

Strict Consistency

•All read operations must return the data from the latest completed write operation, regardless of which replica the operations went to

• Implies either: All operations for a given row go to the same node (replication for

availability) or nodes employ some kind of distributed transaction protocol (eg 2 Phase

Commit or Paxos)

•CAP Theorem: Strict Consistency can’t be achieved at the same time as availability and

partition-tolerance.

Eventual Consistency

•As t → ∞, readers will see writes.• In a steady state, the system is guaranteed to eventually return

the last written value•For example: DNS, or MySQL Slave•Replication (log shipping)•Special cases of eventual consistency:

Read-your-own-writes consistency (“sent mail” box) Causal consistency (if you write Y after reading X, anyone who reads Y sees

X) gmail has RYOW but not causal!

Timestamps and Vector Clocks

•Determining a history of a row•Eventual consistency relies on deciding what value a row will

eventually converge to• In the case of two writers writing at “the same” time, this is

difficult•Timestamps are one solution, but rely on synchronized clocks and

don’t capture causality•Vector clocks are an alternative method of capturing order in a

distributed system

Vector Clocks

•A vector clock is a tuple {t1, t

2, ..., t

n} of clock values from each

node

•v1 < v

2 if:

For all i, v1i ≤ v

2i

For at least one i, v1i < v

2i

•v1 < v

2 implies global time ordering of events

•When data is written from node i, it sets ti to its clock value.

•This allows eventual consistency to resolve consistency between writes on multiple replicas.

Optimistic Concurency Control

Locking is conservative approach that avoids conflicts Additional code for locking Checking deadlock Lock saturation for all objects that are frequently used

If conflicts are rare we can get concurrent transactions by not locking Instead we check the conflicts before commiting

transaction Optimistic control is faster than locking if there are

not a lot of conflicts (worse when there are many conflicts)

Kung-Robinson model

Transaction has 3 phasis: READ: Transaction reads from the databse but makes

changes to a local copy. VALIDATE: validate changes. WRITE: make local changes public.

ROOT

old

new

changedobjects

Validation Check the conditions that are sufficient Every transaction gets numerical timestamp Transaction IDs are assingned at the end of READ phase, just before validation ReadSet(Ti): Set of objects read by transaction Ti WriteSet(Ti): Set of objects changed by Ti

Optimistic Locking

•Unique counter or clock value is saved for each piece of data. Client on update provides the counter/clock-value of the revision it likes to

update

•Optimistic locking scheme needs a total order of version numbers To allow causality reasoning on versions e. g. which revision is considered the most recent by a node on which an

update was issued a lot of history has to be saved

•Downside of this procedure Project Voldemort development team

… does not work well in a distributed and dynamic scenario where servers show up and go away often and without prior notice

… such a total order easily gets disrupted in a distributed and dynamic setting of nodes, the Project Voldemort team argues

Optimistic Locking

•Couchbase optimistic locking We optimistically speculate that there will be no conflict Before commit we check if there are conflicts

Commit is issued if there are no conflicts Handling one read/write operation is easier than the complex transaction Example:

Let’s assume we are building an online wikipedia – like application using Couchbase Server: users can update an article and add newer articles. Let’s assume Alice is using this application to edit an article on ‘bicycles’ to correct some information. Alice opens up the article and makes those changes but before she hits save, she gets distracted and walks away from her desk. In the meantime, let’s assume Joe notices the same error in the bicycle article and wants to correct the mistake.

If optimistic locking is used in the application, Joe can edit the article and save his changes. When Alice returns and wants to save her changes, either Alice or the application will want to handle the latest updates before allowing Alice’s action to change the document.

Optimistic locking takes the “optimistic” view that data conflicts due to concurrent edits occur rarely, so it’s more important to allow concurrent edits.

Optimistic Locking

•Couchbase pessimistic locking Application simply locks the objects and unlocks it after commit Example:

Now let’s assume that your business process requires exclusive access to one or more documents or a graph of documents.

Referring to our previous example, when Alice is editing the document she does not want any other user to edit the same document. If Joe tries to open the page, he will have to wait until Alice has released the lock.

With pessimistic locking, the application will need to explicitly get a lock on the document to guarantee exclusive user access. When the user is done accessing the document, the locks can be removed either manually or using a timeout.

Multiversion Concurrency Control

•MVCC is a concurrency control method commonly used by database management systems to provide concurrent access to the database

• Isolation is the property that provides guarantees in the concurrent accesses to data

● Isolation is implemented by means of a concurrency control protocol

•MVCC aims at solving the problem by keeping multiples copies of each data item

● On update MVCC creates a newer version of the data item● Isolation level commonly implemented with MVCC is snapshot isolation

● All reads made in a transaction will see a consistent snapshot of the database ● It reads the last committed values that existed at the time it started

● How to remove versions that become obsolete?● PostgreSQL adopts this approach with its VACUUM proces

Multiversion Concurrency Control

•Each table cell has a timestamp•Timestamps don’t necessarily need to correspond to real life•Multiple versions can exist concurrently for a given row•Reads may return “most recent”, “most recent before T”, etc.

(free snapshots)•System may provide optimistic concurrency control with compare-

and-swap on timestamps● SQL Anywhere, InterBase, Firebird, Oracle, PostgreSQL, MongoDB and

Microsoft SQL Server (2005 and later)● Not the same as serializability

Vector Clock - Details

•A vector clock is defined as a tuple V[0], V[1], ..., V[n] of clock values from each node .

• In a distributed scenario node i maintains such a tuple of clock values, which represent the state of itself and the other (replica) nodes’ state as it is aware about at a given time

● Vi[0] for the clock value of the first node,

● Vi[1] for the clock value of the second node, . . .

● Vi[i] for itself, . . .

● Vi[n] for the clock value of the last node.

•Clock values may be real timestamps derived from a node’s local clock, version/revision numbers or some other ordinal values.

Vector Clocks•Vector clocks are updated in a way defined by the following rules

If an internal operation happens at node i, this node will increment its clock Vi[i].

This means that internal updates are seen immediately by the executing node. If node i sends a message to node k, it first advances its own clock value V

i[i] and

attaches the vector clock Vi to the message to node k. Thereby, he tells the

receiving node about his internal state and his view of the other nodes at the time the message is sent.

If node i receives a message from node j, it first advances its vector clock Vi[i] and

then merges its own vector clock with the vector clock V message attached to the message from node j so that: V

i = max(V

i , V

message)

To compare two vector clocks Vi and Vj in order to derive a partial ordering, the

following rule is applied: V

i > V

j , if ∀k V

i[k] >= V

j[k] and ∃l V

i[l] > V

j[l]

If neither Vi > V

j nor V

i < V

j applies, a conflict caused by concurrent updates has

occurred and needs to be resolved by e.g. a client application.

Vector Clocks

Vector Clocks

•Casual reasoning between updates•Clients participate in the vector clock scenario

They keep a vector clock of the last replica node they have talked to and use this vector clock depending on the client consistency model that is required

For monotonic read consistency a client attaches this last vector clock it received to requests and the contacted replica node makes sure that the vector clock of its response is greater than the vector clock the client submitted

Clients can be sure to see only newer versions of some piece of data

•Advantages of vector clocks are No dependence on synchronized clocks No total ordering of revision numbers required for casual reasoning No need to store and maintain multiple revisions of a piece of data on all

nodes

Gossip

•Gossip is one method to propagate a view of cluster status.•Every t seconds, on each node:

The node selects some other node to chat with. The node reconciles its view of the cluster with its gossip buddy. Each node maintains a “timestamp” for itself and for the most recent

information it has from every other node.

• Information about cluster state spreads in O(log n) rounds (eventual consistency)

•Scalable and no SPOF, but state is only eventually consistent

Gossip

4 rounds of the protocol

Propagate State via Gossip

•Gossip protocol operates in either a state or an operation transfer model to handle read and update operations as well as replication among database nodes

•State Transfer Model Data or deltas of data are exchanged between clients and servers as well as

among servers Database server nodes maintain vector clocks for their data and also state

version trees for conflicting versions versions where the corresponding vector clocks cannot be brought into a V

A < V

B or V

A >

VB relation

Clients maintain vector clocks for pieces of data they have already requested or updated Vector clocks are exchanged and processed in the following manner

Query Processing When a client queries for data it sends its vector clock of the requested data along with the request Server node responds with part of his state tree for the piece of data that precedes the vector clock attached

to the client request and the server’s vector clock Client has to resolve potential version conflicts

Update Processing Internode Gossiping

Propagate State via Gossip

•Operation Transfer Model Operations applicable to locally maintained data are communicated among

nodes in the operation transfer mode lesser bandwidth is consumed to interchange operations in contrast to actual data or

deltas of data Importance to apply the operations in correct order on each node

a replica node first has to determine a casual relationship between operations (by vector clock comparison) before applying them to its data

In this model a replica node maintains the following vector clocks V

state : The vector clock corresponding to the last updated state of its data.

Vi : A vector clock for itself where—compared to V

state —merges with received vector

clocks may already have happened V

j : A vector clock received by the last gossip message of replica node j (for each of those

replica nodes). Exchange and processing vector clocks in

read-, update- and internode-messages Complex protocols for handling queue of (deferred) operations

correct (casual) order reasoned by the vector clocks attached to these operations

Partitioning

•Assuming that data in large scale systems exceeds the capacity of a single machine and should also be replicated to ensure reliability and allow scaling measures such as load-balancing,

•Ways of partitioning the data of such a system have to be thought about

Depending on the size of the system and the other factors like dynamism

(e. g. how often and dynamically storage nodes may join and leave)

•There are different approaches to this issue: Memory Caches Clustering Master(s)/slaves model Sharding Consistent Hashing

Memory Caches

•Like memcached (memcached.org/) partitioned—though transient—in-memory databases

replicate most frequently requested parts of a database to main memory rapidly deliver data to clients and disburden database servers significantly

consists of an array of processes with an assigned amount of memory launched on several machines in a network made known to an application via configuration.

•memcached protocol implementation in 3 different programming languages client applications provided a simple key-/value-store API 3 It stores objects placed under a key into the cache by hashing that key

against the configured memcached-instances. If a memcached process does not respond, most API-implementations

ignore the non-answering node and use the responding nodes instead implicit rehashing of cache-objects as part of them gets hashed to a different memcached-server after a cache-miss;

Memory Caches

formerly non-answering node joins the memcached server array again keys for part of the data get hashed to it again after a cache miss and objects now dedicated to that node will implicitly leave the memory of some other

memcached server they have been hashed to while the node was down memcached applies a LRU 4 strategy for cache cleaning and allows it to specify timeouts for cache-objects

•Logic Half in Client, Half in Server Clients understand how to choose which server to read or write to for an

item, what to do when it cannot contact a server. The servers understand how store and fetch items. They also manage when

to evict or reuse memory.

•O(1) All commands are implemented to be as fast and lock-friendly as possible.

This gives allows near-deterministic query speeds for all use cases. Queries on slow machines should run in well under 1ms. High end servers

can serve millions of keys per second in throughput.

•Other implementations of memory caches are available e. g. for application servers like JBoss

Database clusters

•High Availability all components of a system from end-to-end provide an uninterrupted

service High availability can be achieved through redundancy or very fast

recovery.

•Clustering of database servers is another approach to partition data

strives for transparency towards clients who should not notice talking to a cluster of database servers instead of a single server

Scale the persistence layer of a system to a certain degree many criticize that clustering features have only been added on top of

DBMSs that were not originally designed for distribution

•MySQL cluster Shared-nothing clustering and auto-sharding Designed to provide high availability and high throughput with low latency,

while allowing for near linear scalability

Database clusters

•Clustering uses multiple servers to work with a shared databasesystem

Each server within a cluster is called a node. If the primary node fails, then the virtual database service willbecome

active on one of the secondary nodes Low disk and network latency essential

•Some clustering systems operate at the Operating System level not the database level e.g. MS-SQL

•MySQL InnoDB InnoDB cluster is a collection of products that work together

A group of MySQL servers can be configured in a cluster using MySQL Shell The cluster has a single master (the primary) = read-write master Multiple secondary servers are replicas of the master A minimum of three servers are required to create a high availability cluster A client application is connected to the primary via MySQL Router

Database clusters

•MySQL InnoDB InnoDB is a general-purpose storage engine that balances high reliability

and high performance. Key advantages of InnoDB include:

Its DML operations follow the ACID model, with transactions featuring commit, rollback, and crash-recovery capabilities to protect user data.

Row-level locking and Oracle-style consistent reads increase multi-user concurrency and performance.

Data is arranged on disk so as to optimize queries based on primary keys. To maintain data integrity, InnoDB supports FOREIGN KEY constraints

Master(s)/Slaves models

•Each data partition has a single master and multiple slaves Write-operations for all or parts of the data are routed to master(s) Number of replica-servers satisfying read-requests (slaves)

•Read/write protocol If the master replicates to its clients asynchronously there are no write lags If the master crashes before completing replication to at least one client the

write-operation is lost If the master replicates writes synchronously to at least one slave Read requests can go to any replicas if the client can tolerate some degree

of data staleness "mastership" happens at the virtual node level

There is NO single particular physical node that plays the role as the master Therefore, the write workload is also distributed across different physical node

When a physical node crashes The masters of certain partitions will be lost The most updated slave will be nominated to become the new master


•Master Slave model works very well in general when the application has a high read/write ratio

It also works very well when the update happens evenly in the key range So it is the predominant model of data replication

•2 ways how the master propagate updates to the slave State transfer and Operation transfer The state transfer model is more robust against message lost Deltas are used many times (hash trees of the objects)

•Multi-master model allows updates to happen at any replica One master / slaves model unable to spread the workload evenly Problems to retain consistency


•Consistency of Master(s) models Strict consistency can be achieved by 2PC

bring all N replicas to the same state at every update “prepare" phase where the coordinator ask every replica to confirm whether each of them

is ready to perform the update After gathering all N replicas responses positively → "commit" phase Ask every replicas to commit If any one of the replica crashes, the update will be unsuccessful

Quorum based 2PC the coordinator only need to update W<N replicas Write to all the N replicas but only wait for positive acknowledgment for any W On read we need to access R replicas (W+R>N) and return the latest (timestamp)

• If the client can tolerate a more relax consistency model We don't need to use the 2PC commit or quorum based protocol

Sharding

•Partition the data in such a way that data typically requested and updated together resides on the same node

There are different interpretations of the sharding Horizontal partitioning of the data store (not single table!)

Each shard may have the same schemata Usually some subset of attributes are used for the definition of partitioning

These attributes form the shard key (sometimes referred to as the partition key)

•Sharding physically organizes the data Load and storage volume is evenly distributed among the servers Data shards may also be replicated for reasons of reliability and load-

balancing It may be either allowed to write to a dedicated replica only or to all replicas

maintaining a partition of the data.

Sharding

•When an application stores and retrieves data Sharding logic directs the application to the appropriate shard This sharding logic can be implemented as

part of the data access code in the application, or it could be implemented by the data storage system

•Benefits of sharding You can scale the system out by adding further shards running on additional

storage nodes. A system can use off-the-shelf hardware rather than specialized and

expensive computers for each storage node. You can reduce contention and improve performance by balancing the

workload across shards. In the cloud, shards can be located physically close to the users that'll

access the data.

•Sharding strategies (MS-SQL) Three strategies are commonly used when selecting the shard key and

deciding how to distribute data across shards

Sharding

•Lookup strategy The mapping between

a virtual shard and the physical partitions

the sharding logic implements a map that routes a requestfor data to the shard that contains that data using the shard key

Sharding

•Range strategy This strategy groups

related items together in the same shard, and orders them by shard key—

the shard keys are sequential

Data is usually held in row key order in the shardItems that are subject to range queries and need to be grouped together

can use a shard key that has the same value forthe partition key but a unique value for the row key.

Sharding

•Hash strategy The sharding logic computes

the shard to store an item in based on a hash of one or more attributes of the data

The purpose of this strategy is to reduce the chance of hotspots

Achieves a balance between the size of each shard and the average load that each shard will encounter

Sharding

•Advantages and considerations Lookup.

This offers more control over the way that shards are configured and used Using virtual shards reduces the impact when rebalancing data

New physical partitions can be added to even out the workload. The mapping between a virtual shard and the physical partitions can be modified without affecting application code

Range. This is easy to implement and works well with range queries

They can often fetch multiple data items from a single shard in a single operation. This strategy offers easier data management

For example, users in the same region are in the same shard This strategy doesn't provide optimal balancing between shards.

Rebalancing shards is difficult and might not resolve the problem of uneven load

Hash. This strategy offers a better chance of more even data and load distribution. Request routing can be accomplished directly by using the hash function.

There's no need to maintain a map. Rebalancing shards is difficult.

Consistent Hashing

•The topic is “modern” It is motivated by issues in present-day systems

Web caching, Peer-to-peer systems, Database partitioning it’s general and flexible enough to be useful for other problems

•Web caching was original motivation for consistent hashing, 1997 Store requested pages in the local cache

Next time there is no need to retrieve the page the from remote source Obvious benefits: less Internet traffic, faster access The amount of the data used to store pages cached for 100 machines can

be substantial Where to store them? One machine? Cluster?

•Use hash function to map URLs to caches First intuition of the computer scientist? Nice properties of hash functions

Behaves like a random function, spreading data out evenly and without noticeable correlation across the possible buckets

Designing good hash functions is not easy

Consistent Hashing

Say there are n caches, named {0, 1, 2, ... , n−1}. We store the Web page with URL x at the cache server h(x) mod n

h(x) way bigger than n Solution works well if you do not change n

All elements would have to be re-hashed if n changes n changes if a cache is added or a cache fails

•Consistent hashing What we want?

hash table-type functionality and changing n does not affect hashing and almost all objects stay assigned to the same cache when n changes

Leading idea: hash server names and URLs with the same hash function Which objects are assigned to which caches

Consistent Hashing

Inserting a new object x to the database Given an object x that hashes to the bucket h(x) Imagine we have already hashed all the cache server names we scan buckets to the right of h(x) until we find a bucket h(s) to which the name of some cache s hashes.

Expected load on each of the n cache servers is exactly 1/n fraction of the objects Suppose we add a new cache server s — which objects have to move?

Only the objects stored at s. Adding the nth cache causes only a 1/n fraction of the objects to relocate

This approach to consistent hashing can also be visualized on a circle

Consistent Hashing

•Standard hash table operations Lookup and Insert? Efficiently implementation of the rightward/clockwise scan for the cache

server s that minimizes h(s) subject to h(s) ≥ h(x) We want a data structure for storing the cache names, with the

corresponding hash values as keys, that supports a fast Successor operation Which data structure? Given a key we need the next key and its value.

Not hash table, heap? How about binary search tree? Complexity is O(log n)

•Reducing the variance Expected load of each cache server is a 1/n fraction of the objects The realized load of each cache will vary An easy way to decrease this variance is to make k “virtual copies” of each

cache s hashing with k different hash functions to get h1(s), ... , hk(s) Each server now has k copies—all of them are labeled s Objects are assigned as before — from h(x), we scan clockwise until we encounter one of

the hash values of some cache s Choosing k ≈ log 2 n is large enough to obtain reasonably balanced loads

Consistent Hashing

Virtual copies are also useful for dealing with heterogeneous caches that have different capacities. Use the number of virtual copies of a cache server proportional to the server’s capacity;

Literature

•Christof Strauch, NoSQL Databases. •www.christof-strauch.de/nosqldbs.pdf•Ho, Ricky: NOSQL Patterns. November 2009. •horicky.blogspot.com/2009/11/nosql-patterns.html•Lipcon, Todd: Design Patterns for Distributed Non-Relational

Databases, Presentation, 2009. •static.last.fm/johan/nosql-20090611/intro_nosql.pdf•Karger, David, Et.Al., Consistent Hashing and Random Trees:

Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web, ACM symposium on Theory of computing, 1997

•Microsoft Azure, Sharding pattern, Jun 2017.•docs.microsoft.com/en-us/azure/architecture/patterns/sharding•Tim Roughgarden, Gregory Valiant, CS168: The Modern

Algorithmic Toolbox, Lecture #1: Introduction and Consistent Hashing, Stanford 2017.

Literature

•Cauchbase, Optimistic or pessimistic locking – Which one should you pick?. https://blog.couchbase.com/optimistic-or-pessimistic-locking-which-one-should-you-pick/

nosql dbms: concepts, techniques and patterns...

Documents