NoSQL Databases
Not Only SQLMore than one way to store data
• Not using the relational model
• Performs well on clusters
• No fixed schema
• Generally, open source
• Specialized for websites and big data
Why NoSQL?• Developers frustrated impedance mismatch
between relational model and application model
• Most projects use Entity-Relationship Mapping tools with a steep learning curve
• Relational databases do not run well on clusters. It would be impossible to run large scale websites (Twitter, Facebook) using a RDBMS
Aggregate Data Models• Modern applications need data grouped into
units for ACID purposes
• Aggregate data is easy to manage over a cluster
• Inter-aggregate relationships must be handled via map-reduce computations
• Also offer pre-computed materialized views
Data Distribution
• Sharding
segements the data by primary key into shards, each stored on a different server
• Two models used for distributing data across a cluster
Data Distribution
• Replication
copies all the data to multiple servers. Two forms:
• Master-slave
• Peer-to-peer
• Two models used for distributing data across a cluster
CAP Theorem
Choose any two:
• Consistency
• Availability
• Partition Toleration
C A
P
CAP Theorem
Consistency
All database clients will read the same value for the same query, even given concurrent updates
C A
P
CAP Theorem
Availability
All database clients will always be able to read and write data
C A
P
CAP TheoremPartition Tolerance
The database can be split into multiple servers and will continue to function if then network connection is interrupted
C A
P
CAP is the Key• SQL databases have a fixed capability (the
standard) and you can’t choose one feature over another
• NoSQL databases allows choice of best features for a particular system
• NoSQL requires knowledge to make a good choice
3 Faces of CAPCA (Consistency and Availability)
• Uses 2-phase commit
• Blocks system, so only works well in a single data center
3 Faces of CAPCP (Consistency and Partition Tolerance)
• Uses shards to scale
• Data is consistent, but still run the risk of some data becoming unavailable if a node fails
3 Faces of CAPAP (Availability and Partition Tolerance)
• System always available
• May return inaccurate data
• Similar to DNS
NoSQL Database Types• Key-Value
• Document
• Column-family
• Graph
Key-Value Databases• Simplest API: get, put, delete
• Data is just a blob
• Might not be persistent
• Examples: riak, memcached, redis
Document Databases• Similar to key-value except the value is in a
known format and is examinable
• Data is structured (most likely JSON, BSON, or XML)
• Might not be persistent
• Examples: CouchDB, MongoDB
Column Family Databases• Store rows that have many columns associated
with a row key
• Column families are groups of related data that is often accessed together (a customer’s profile would be one family, their orders another)
• Examples: cassandra, HBase
Graph Databases• Stores entities and the relationships between
them
• Entities have properties
• Edges (relations) can have properties and have directional significance
• Allows easy traversal of relationships
• Examples: Neo4J, Infinite Graph
Why Choose NoSQL?• To improve programmer productivity
• To improve data access performance by handling larger data sets, reducing latency, and/or improving throughput
Selecting a NoSQL Database• Key-value is used for session information,
preferences, profiles, and shopping carts. Normally data without relationships.
• Document are for content mangement, real-time analytics, and ecommerce. Avoid when aggregate data is often needed.
• Column family databases are useful for write-heavy operations like logging.
Selecting a NoSQL Database• Graph databases are optimal for social
networks, spatial data, and recommendation engines