extra nosql intro
DESCRIPTION
Extra Nosql IntroTRANSCRIPT
-
Introduction to NoSQL
University o TorontoComputer Science Department
Presenter: Suprio Ray
-
2How will this class improve your CV
-
NoSQL
the whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for.
Eric Evans
What does it mean? No SQL (Eric Evans)
Not Only SQL (Emil Elfrem)
New SQL?
Src: Mark Madsen
-
Overview
Why NoSQL
What is NoSQL
NoSQL categories
-
Motives behind NoSQL
Big data, different application domains
Scalability and performance
Graceful failure recovery
Data format, manageability
-
Motivation: Big data; one size does not fit all
OLTP Amazon : 42 TB Typical OLTP databases: less than a TB
Data Warehouse Yahoo : 2 PB Ebay: 1.4 PB
Search engines (text) Google : 850 TB Youtube: 76 PB of video data/year
Scientific US Department of Energy (NERSC): 3.5 PB
New application domains Stream processing Social media
-
Motivation: need for scale and performance
Scaling up
Issues with scaling up when the dataset is just too big
RDBMS were not designed to be distributed
Cost effective strategy: scaling out or horizontal scaling
Some applications need very few database features; But need high scalability when traffic spike happens
SQL may be too heavy-weight
Does not need fancy indexing.
Just fast lookup by primary key
-
IT World Prediction
-
Super Bowl traffic spike
1,800%Traffic Spike
Stable Performance
Commercial Airs
-
Motivation: graceful failure recovery
Dependence on Web services
We are addicted to Googling, Gmail, Google Map, Youtube, Facebook, Twitter, Blackberry
Graceful failure recovery Need to continue to provide service
Cost of downtime
-
The Cost of downtime
Facebook was down for ~3 hours in Sep, 2010 $1 million in lost ad revenues
Rackspace was down due to power failure in Jun, 2009 was forced to pay ~$3.5 million in service credits to customers
Paypal was down due to network hardware failure in Aug, 2009 $7.2 million in lost transactions in 4.5 hours
Google outages Search, Gmail, YouTube, Google News down for 14% of users in
May, 2009 Google App Engine applications were down in May, 2010
RIM had two Blackberry service outages in a week in Dec, 2009 The second one lasted more than 8 hours. Cost?
-
Motivation: need for flexible schema
Relational databases define the schema at design time
Rigid, no way to change dynamically
Need a DBA
Stop the world to make any change
Many applications dont have any fixed schema Log processing
Stream processing
Graph processing
Data model should not restrict data access
-
Motivations summary: avoid RDBMS/SQL limitations
Harder to scale. Expensive
Joins across multiple nodes? Hard
How does RDMS handle data growth? Hard
Rigid schema design. Not manageable
Need for a DBA. Expensive
-
Overview
Why NoSQL
What is NoSQL
NoSQL categories
-
NoSQL Definition
From www.nosql-database.org:
Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge data amount, and more.
-
NoSQL Distinguishing Characteristics
Can handle large data volumes Googles big data
Scalable replication and distribution Potentially thousands of machines, distributed around the world
Queries need to return answers quickly
Schema-less
ACID transaction properties are not needed BASE
CAP Theorem
-
Recap: RDBMS/SQL Characteristics
Data stored in tables
Relationships represented by data row
Data Manipulation Language (DML)
Data Definition Language (DDL)
Transactions (ACID properties)
-
Recap: Data Definition Language (DDL)
Schema defined at the start Create Table (Column1 Datatype1, Column2 Datatype 2, )
Constraints to define and enforce relationships Primary Key Foreign Key
Triggers to respond to Insert, Update , & Delete
Stored Modules
Alter
Drop
-
Recap: Data Manipulation Language (DML)
Data manipulated with Select, Insert, Update, & Delete statements
Select T1.Column1, T2.Column2 From Table1, Table2 Where T1.Column1 = T2.Column1
Data Aggregation
Compound statements
Functions and Procedures
-
Recap: Transactions ACID Properties
Atomic All of the work in a transaction completes (commit) or none of it completes
Consistent A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints
Isolated The results of any changes made during a transaction are not visible until the transaction has committed
Durable The results of a committed transaction survive failures
-
OLTP Through the Looking Glass, and What We Found ThereSIGMOD 08, pp. 981-992, 2008.
31%
31%
26%
12%
Buffer Pool
Locking
Recovery
Real Work
The cost of locking
-
CAP Theorem
CAP Theorem:
satisfying all three at the
same time is impossible
A P
To scale out, you have to partition
Many nodes; each node containsreplicas of partitions of data
Consistency
all replicas contain the same version of data
Availability
system remains operational on failing nodes
Partition tolarence
multiple entry points
system remains operational on system split
C
-
ACID vs. BASE
Pritchett, D.: BASE: An Acid Alternative (queue.acm.org/detail.cfm?id=1394128)
Relational
Atomicity Consistency Isolation Durability
NoSQL
BasicallyAvailable (CP)
Soft-state Eventually consistent (AP)
-
BASE Transactions
Acronym contrived to be the opposite of ACID Basically Available, Soft state, Eventually Consistent
Characteristics Availability first Best effort Weak consistency stale data OK Approximate answers OK Simpler and faster
-
NoSQL advantages
Cheap, easy to implement (open source)
Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned Down nodes easily replaced
No single point of failure
Can scale up and down
Doesn't require a schema
-
What am I giving up?
Joins
ACID transactions
SQL, as a sometimes frustrating but still powerful query language
Easy integration with other applications that support SQL
-
Overview
Why NoSQL
What is NoSQL
NoSQL categories
-
NoSQL categories
-
Complexity
-
NoSQL categories
-
Overview
Why NoSQL
What is NoSQL
NoSQL categories
Key-value store
-
Very simple interface
Data model: (key, value) pairs
Operations: Put(key,value)
value = Get(key)
Implementation: efficiency, scalability, fault-tolerance
Records distributed to nodes based on key
Replication
Examples
Redis, Memcached, Riak
Key-Value store
-
Redis
History Started in early 2009 - Salvatore Sanfilippo, an Italian developer
He was working on a real-time web analytics solution and
found that MySQL could not provide necessary performance
Distributed data structure server
Simple API
Automatic data partitioning across multiple nodes (in-progress)
-
Distributed data structure
Distributed hash table (DHT)
Decentralized hash lookup service
(key, value) pairs are stored in DHT and any participating node can retrieve the value given a key
-
Logical data model
Key
Printable ASCII
Value
Primitives Strings
Containers (of strings) Hashes
Lists
Sets
Sorted Sets
-
Logical data model
Key
Printable ASCII
Value
Primitives Strings
Containers (of strings) Hashes
Lists
Sets
Sorted Sets
-
Logical data model
Key
Printable ASCII
Value
Primitives Strings
Containers (of strings) Hashes
Lists
Sets
Sorted Sets
-
Logical data model
Key
Printable ASCII
Value
Primitives Strings
Containers (of strings) Hashes
Lists
Sets
Sorted Sets
-
Logical data model
Key
Printable ASCII
Value
Primitives Strings
Containers (of strings) Hashes
Lists
Sets
Sorted Sets
-
API: primitive
SET foo bar
GET foo=> bar
API: listLPUSH mylist a // now mylist holds 'aLPUSH mylist b // now mylist holds 'b','a'LPUSH mylist c // now mylist holds 'c','b','a
LRANGE mylist 0 1 => c,b
Redis-cli
-
API: hash
HMSET myuser name Salvatore surname Filippo country ItalyHGET myuser surname
=> Filippo
API: set
SADD myset a SADD myset bSADD myset fooSADD myset bar SMEMBERS myset=> bar,a,foo,b
Redis-cli
-
Overview
Why NoSQL
What is NoSQL
NoSQL categories
Key-value store
Column store
-
Column (family) store
Not to be confused with the relational-db version of this
Sybase-IQ etc
Multi-dimensional map
Not all entries are relevant each time
Column families
Examples
Cassandra
Hbase
Amazon SimpleDB
-
Cassandra
History Initially developed at Facebook to for their Inbox Search feature
Released as an open source project in July 2008
Decentralized; no single point of failure
Incremental scalability
Uses consistent hashing
Tunable consistency
-
Consistent hashing
Cassandras partitioning scheme is based on consistent hashing
Basic hash function
-
Cassandras partitioning scheme is based on consistent hashing
Basic hash function
Inconsistent hashing
Consistent hashing
-
Cassandras partitioning scheme is based on consistent hashing
Basic hash function
Consistent hashing
Only a small number
of keys are remapped
Consistent hashing
-
Key space partitioning
Based on consistent hashing
Keys hashed to a point on a fixed circular ring
Nodes are positioned at their hash values on the circle
A key is hashed to find its location
A key is stored in the following
N (clockwise successor) nodes
-
Key space partitioning
Consistent hashing
If a node goes down,
it is stored in the next node
-
Cassandra and Consistency
Cassandra has programmable read/writable consistency
One: Return from the first node that responds
Quorum: Query from all nodes and respond with the one that has latest timestamp once a majority of nodes responded
All: Query from all nodes and respond with the one that has latest timestamp once all nodes responded.
-
Relational model
Schema: tabular, fixed
-
Column store model
Schema: flexible, dynamic
-
Keyspace
Close to relational database
But, does not stipulate any concrete structure
Basic attributes
Replication factor
Replica placement strategy
Column families
-
Column family
Container for a collection of rows
Think of them as a map of a map
SortedMap
-
Column
Smallest increment of data
A name, value and timestamp
Timestamps used to determine the most recent update to a column
Columns can be indexed
-
Super column
Adds another level of nesting to the regular columns
Comprised of a (super) column name and an ordered map of sub-columns
-
Column family vs table summary
Columns are not strictly defined in column family
Any column can added to a row any time
A column family can hold columns or super columns
Column family has a comparator attribute that indicates how columns will be sorted in query results
-
Comparator types
Several built-in types
AsciiType
UTF8Typetext
IntegerType
LongType
UUIDDateType
BooleanType
FloatType
DoubleType
-
Cassandra-cli
Creating a keyspace
CREATE KEYSPACE demo with placement_strategy = 'SimpleStrategy' and strategy_options = {replication_factor:1};
use demo;
Creating a column family
create column family users with comparator = 'UTF8Type'; assume Users keys as utf8;
update column family users with column_metadata = [ {column_name: first, validation_class: UTF8Type},{column_name: last, validation_class: UTF8Type}, {column_name: age, validation_class: UTF8Type
} ];
-
Cassandra-cli
Inserting recordSET users['bob']['first']='Robert';
SET users['bob']['last']='Jones';
SET users['bob']['age']='35';
SET users['Lin']['first']='Linda';
SET users['Lin']['last']='Smith';
SET users['Lin']['age']='32';
SET users['Jane']['first']='Jane';
SET users['Jane']['last']='Smith';
SET users['Jane']['age']='26';
-
Cassandra-cli
Read record by row key
GET users['bob'];
=> (name=age, value=35, timestamp=1416010677679000)
=> (name=first, value=Robert, timestamp=1416010669480000)
=> (name=last, value=Jones, timestamp=1416010676760000)
Read record by column keyGET users where last='Smith';
=> No indexed columns present in index clause with operator EQ
-
Cassandra-cli
Create index on the column UPDATE COLUMN FAMILY users WITH comparator = UTF8Type AND
column_metadata = [{column_name: last, validation_class: UTF8Type, index_type: KEYS}];
Read record by column key GET users where last='Smith'; RowKey: Lin
=> (name=age, value=3332, timestamp=1416010957625000)=> (name=first, value=4c696e6461, timestamp=1416010957620000) => (name=last, value=Smith, timestamp=1416010957623000)
RowKey: Jane=> (name=age, value=3236, timestamp=1416010965199000) => (name=first, value=4a616e65, timestamp=1416010963840000) => (name=last, value=Smith, timestamp=1416010963843000)
-
Some statistics
Facebook Search
MySQL > 50 GB Data
Writes Average : ~300 ms
Reads Average : ~350 ms
Rewritten with Cassandra > 50 GB Data
Writes Average : 0.12 ms
Reads Average : 15 ms
-
Overview
Why NoSQL
What is NoSQL
NoSQL categories
Key-value store
Column store
Document store
-
Document store
Key-document store
the document can be seen as a value so you can consider this is a super-set of key-value
Big difference with key-value store
that in document stores one can query also on the document, i.e. the document portion is structured (not just a blob of data)
Examples
MongoDB
CouchDB
-
MongoDB
A document-oriented database
documents encapsulate and encode data
Uses BSON/JSON format
Schema-less
No more configuring database columns with types
No transactions
No joins
-
MongoDB basics
A MongoDB instance may have zero or more databases
A database may have zero or more collections Can be thought of as the relation (table) in DBMS, but withmany differences
A collection may have zero or more documents Docs in the same collection dont even need to have the same fields Docs are the records in RDBMS Docs can embed other documents Documents are addressed in the database via a unique key
A document may have one or more fields
MongoDB Indexes is much like their RDBMS counterparts
-
MongoDB vs RDBMS
RDBMS MongoDB
Database Database
Table, View Collection
Row Document (JSON, BSON)
Column Field
-
RDBMS MongoDB
Database Database
Table, View Collection
Row Document (JSON, BSON)
Column Field
MongoDB vs RDBMS
{"_id" : ObjectId("5114e0bd42"),"first" : "John","last" : "Doe","age" : 39,
"interests" : ["Mountain Biking ]
}
-
Collection example
{"_id" : ObjectId("5114e0bd42"),"first" : "John","last" : "Doe","age" : 39,
"interests" : ["Mountain Biking ]
},{
"_id" : ObjectId(4a14e0f361"),"first" : Caroline","last" : Smith","age" : 32,
"interests" : ["Reading",Yoga]
}
Obligatory, andautomaticallygenerated byMongoDB
-
Overview
Why NoSQL
What is NoSQL
NoSQL categories
Key-value store
Column store
Document store
Graph store
-
Graph store
Based on Graph Theory
Scale vertically
You can use graph algorithms easily
Example, Neo4j
-
Relational vs. Graph: data model
Finding friends
-
Relational vs. Graph: data model
Finding friends
Bobs friends
SELECT p1.PersonFROM Person p1
JOIN PersonFriendON PersonFriend.FriendID = p1.ID
JOIN Person p2ON PersonFriend.PersonID = p2.ID
WHERE p2.Person = 'Bob'
-
Relational vs. Graph: data model
Finding friends
Bobs friends-of-friends
SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIENDFROM PersonFriend pf1
JOIN Person p1ON pf1.PersonID = p1.ID
JOIN PersonFriend pf2ON pf2.PersonID = pf1.FriendID
JOIN Person p2ON pf2.FriendID = p2.ID
WHERE p1.Person = Bob' AND pf2.FriendID p1.ID
-
Relational vs. Graph: data model
Finding friends
Bobs friends-of-friends-of-....
SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIENDFROM PersonFriend pf1
JOIN Person p1ON pf1.PersonID = p1.ID
JOIN PersonFriend pf2ON pf2.PersonID = pf1.FriendID
JOIN Person p2ON pf2.FriendID = p2.ID
WHERE p1.Person = Bob' AND pf2.FriendID p1.ID
Join complexity increases with each additional depth
-
Relational model and connected data
Relational model deals with connected data by means of join
Join tables add complexity; they mix business data with foreign key metadata
Foreign key constraints add additional development and maintenance overhead just to make the database work
Things get more complex and more expensive the deeper we go into the network
-
Enter, property graph model...
Node
contain properties
Relationship
connect nodes
a start node and an end node
always has a direction
a label
Properties
keys are strings and the values are arbitrary data types
-
Property graph model
name: Alice
age: 32
name: Bob
Age: 35
name: James
age: 27
-
Finding relations is easy!
-
Advantages of property graph model
Flexibility Allow us to add new nodes and new relationships without
compromising the existing network or migrating data
Original data and its intent remain intact
Expressive power We can see who LOVES whom (and whether that love is requited!)
We can see whos MARRIED_TO someone else
We can see who is a COLLEAGUE_OF of whom and
who is BOSS_OF them all
Performance
-
Relational vs. Graph: performance
Finding friends-of-friends in a social network
Maximum depth 5
1 million people, each with approximately 50 friends
-
Cypher: graph query language of NEO4J
Declarative graph pattern matching language
SQL for graphs
Tabular results
Cypher is evolving steadily
Syntax changes between releases
Supports queries
Including aggregation, ordering and limits
Mutating operations in product roadmap
-
(a) --> (b)
Two nodes, one relationship
a b
-
Two nodes, one relationship
START a=node(*)
MATCH (a)-->(b)
RETURN a, b;
a b
-
ba
b
a
b
a
START a=node(*)
MATCH (a)-->(b)
RETURN a, b;
Pattern matching
-
Two nodes, one relationship
START a=node(*)
MATCH (a)-[r:ACTED_IN]->(m)
RETURN a.name, r.roles, m.title;
a m
ACTED IN
-
Paths
(a)-->(b)-->(c)
a b c
-
bc
a
b
c
a
b
a
Pattern matching
- START a=node(*)MATCH (a)-[:ACTED_IN]->(m)
-
Constraints on properties
START tom=node:node_auto_index(name="Tom Hanks")
MATCH (tom)-[:ACTED_IN]->(movie)
WHERE movie.released < 1992
RETURN DISTINCT movie.title;
(Movies in which Tom Hanks acted, that were released before 1980)
-
Variable length paths
(a)-[*1..3]->(b)
a b
a b
a b
-
Friends-of-Friends
START keanu=node:node_auto_index(name="Keanu Reeves")
MATCH (keanu)-[:KNOWS*2]->(fof)
RETURN DISTINCT fof.name;
-
NoSQL databases reject:
Overhead of ACID transactions
Complexity of SQL
Burden of up-front schema design
Programmer responsible for
Determining the consistency level
Navigating access path
NoSQL summary
-
Should I be using NoSQL Databases?
NoSQL Data storage systems makes sense for applications that need to deal with very large semi-structured data
Log Analysis
Social Networking Feeds
Most of us work on organizational databases, which are not that large and have low update/query rates
regular relational databases are the correct solution for such applications
-
References
I. Robinson, J. Webber, E. Eifrem. Graph Databases. OReilly, 2013
Neo4J intro tutorial.
NoSQL. Dr. Kristie Hawkey. Dalhousie University NoSQL. Perry Hoekstra. Perficient, Inc. NoSQL. Akmal Chaudhri Massively Parallel Cloud Data Storage Systems. S. Sudarshan. IIT Bombay NoSQL Theory, Implementations, an introduction. Firat Atagun http://www.datastax.com/docs/1.0/ddl/column_family http://redis.io/topics/twitter-clone
REDIS. REmote DIctionary Server. Chris Keith and James Tavares
Advanced Topics in Database Management. Stan Zdonik. Brown University
An introduction to MongoDB. Rcz Gbor
MongoDB. Mohamed Zahran. NYU
Handling an 1,800 Percent Traffic Spike During Super Bowl XLVI. Jim Houska and Jim Houska
-
Thanks
-
CRUD
Create db.collection.insert( ) db.collection.save( ) db.collection.update( , , { upsert: true } )
Read db.collection.find( , ) db.collection.findOne( , )
Update db.collection.update( , , )
Delete db.collection.remove( , )
-
mongo>
Actors database Insert records
db.actors.insert({ first: 'matthew', last: 'setter', dob: '21/04/1978', gender: 'm', hair_colour: 'brown', occupation: 'developer', nationality: 'australian' });
db.actors.insert({ first: 'james', last: 'caan', dob: '26/03/1940', gender: 'm', hair_colour: 'brown', occupation: 'actor', nationality: 'american' }); . . . . .
-
mongo>
Actors database Query: show all actors> db.actors.find()
Query: show all actors that are female
> db.actors.find({gender: 'f'});{ "_id" : ObjectId("546e5363440266a4f135a37a"), "first" : "jamie lee", "last" : "curtis", "dob" :
"22/11/1958", "gender" : "f", "hair_colour" : "brown", "occupation" : "actor", "nationality" : "american" }
{ "_id" : ObjectId("546e5363440266a4f135a37c"), "first" : "judi", "last" : "dench", "dob" : "09/12/1934", "gender" : "f", "hair_colour" : "white", "occupation" : "actress", "nationality" : "english" }
Query: show all male actors who are English
> db.actors.find({gender: 'm', $or: [{nationality: 'english'}]});{ "_id" : ObjectId("546e5363440266a4f135a37b"), "first" : "michael", "last" : "caine", "dob" :
"14/03/1933", "gender" : "m", "hair_colour" : "brown", "occupation" : "actor", "nationality" : "english" }
-
mongo>
Actors database
Update: update the record for James Caan that his hair is grey
> db.actors.update({first: 'james', last: 'caan'}, {$set: {hair_colour: grey'}});
> db.actors.find({first: 'james', last: 'caan'});{ "_id" : ObjectId("546e5363440266a4f135a377"), "first" : "james", "last" : "caan", "dob" :
"26/03/1940", "gender" : "m", "hair_colour" : "grey", "occupation" : "actor", "nationality" : "american" }
Delete
> db.actors.remove({first: 'james', last: 'caan'});
-
Tech Trend: Connectedness
Info
rmat
ion
co
nn
ecti
vity
Text Documents
Hypertext
Feeds
Blogs
Wikis
UGC
Tagging
RDFa
Social networks
-
Consistent Hashing
Partition using consistent hashing
Keys hash to a point on a fixed circular space
Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots
Nodes take positions on the circle.
A, B, and D exists.
B responsible for AB range.
D responsible for BD range.
A responsible for DA range.
C joins.
B, D split ranges.
C gets BC from D.
A
H
D
B
M
V
S
R
C