extra nosql intro

Introduction to NoSQL

University o TorontoComputer Science Department

Presenter: Suprio Ray

2How will this class improve your CV

NoSQL

the whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for.

Eric Evans

What does it mean? No SQL (Eric Evans)

Not Only SQL (Emil Elfrem)

New SQL?

Src: Mark Madsen

Overview

Why NoSQL

What is NoSQL

NoSQL categories

Motives behind NoSQL

Big data, different application domains

Scalability and performance

Graceful failure recovery

Data format, manageability

Motivation: Big data; one size does not fit all

OLTP Amazon : 42 TB Typical OLTP databases: less than a TB

Data Warehouse Yahoo : 2 PB Ebay: 1.4 PB

Search engines (text) Google : 850 TB Youtube: 76 PB of video data/year

Scientific US Department of Energy (NERSC): 3.5 PB

New application domains Stream processing Social media

Motivation: need for scale and performance

Scaling up

Issues with scaling up when the dataset is just too big

RDBMS were not designed to be distributed

Cost effective strategy: scaling out or horizontal scaling

Some applications need very few database features; But need high scalability when traffic spike happens

SQL may be too heavy-weight

Does not need fancy indexing.

Just fast lookup by primary key

IT World Prediction

Super Bowl traffic spike

1,800%Traffic Spike

Stable Performance

Commercial Airs

Motivation: graceful failure recovery

Dependence on Web services

We are addicted to Googling, Gmail, Google Map, Youtube, Facebook, Twitter, Blackberry

Graceful failure recovery Need to continue to provide service

Cost of downtime

The Cost of downtime

Facebook was down for ~3 hours in Sep, 2010 $1 million in lost ad revenues

Rackspace was down due to power failure in Jun, 2009 was forced to pay ~$3.5 million in service credits to customers

Paypal was down due to network hardware failure in Aug, 2009 $7.2 million in lost transactions in 4.5 hours

Google outages Search, Gmail, YouTube, Google News down for 14% of users in

May, 2009 Google App Engine applications were down in May, 2010

RIM had two Blackberry service outages in a week in Dec, 2009 The second one lasted more than 8 hours. Cost?

Motivation: need for flexible schema

Relational databases define the schema at design time

Rigid, no way to change dynamically

Need a DBA

Stop the world to make any change

Many applications dont have any fixed schema Log processing

Stream processing

Graph processing

Data model should not restrict data access

Motivations summary: avoid RDBMS/SQL limitations

Harder to scale. Expensive

Joins across multiple nodes? Hard

How does RDMS handle data growth? Hard

Rigid schema design. Not manageable

Need for a DBA. Expensive

Overview

Why NoSQL

What is NoSQL

NoSQL categories

NoSQL Definition

From www.nosql-database.org:

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge data amount, and more.

NoSQL Distinguishing Characteristics

Can handle large data volumes Googles big data

Scalable replication and distribution Potentially thousands of machines, distributed around the world

Queries need to return answers quickly

Schema-less

ACID transaction properties are not needed BASE

CAP Theorem

Recap: RDBMS/SQL Characteristics

Data stored in tables

Relationships represented by data row

Data Manipulation Language (DML)

Data Definition Language (DDL)

Transactions (ACID properties)

Recap: Data Definition Language (DDL)

Schema defined at the start Create Table (Column1 Datatype1, Column2 Datatype 2, )

Constraints to define and enforce relationships Primary Key Foreign Key

Triggers to respond to Insert, Update , & Delete

Stored Modules

Alter

Drop

Recap: Data Manipulation Language (DML)

Data manipulated with Select, Insert, Update, & Delete statements

Select T1.Column1, T2.Column2 From Table1, Table2 Where T1.Column1 = T2.Column1

Data Aggregation

Compound statements

Functions and Procedures

Recap: Transactions ACID Properties

Atomic All of the work in a transaction completes (commit) or none of it completes

Consistent A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints

Isolated The results of any changes made during a transaction are not visible until the transaction has committed

Durable The results of a committed transaction survive failures

OLTP Through the Looking Glass, and What We Found ThereSIGMOD 08, pp. 981-992, 2008.

31%

31%

26%

12%

Buffer Pool

Locking

Recovery

Real Work

The cost of locking

CAP Theorem

CAP Theorem:

satisfying all three at the

same time is impossible

A P

To scale out, you have to partition

Many nodes; each node containsreplicas of partitions of data

Consistency

all replicas contain the same version of data

Availability

system remains operational on failing nodes

Partition tolarence

multiple entry points

system remains operational on system split

C

ACID vs. BASE

Pritchett, D.: BASE: An Acid Alternative (queue.acm.org/detail.cfm?id=1394128)

Relational

Atomicity Consistency Isolation Durability

NoSQL

BasicallyAvailable (CP)

Soft-state Eventually consistent (AP)

BASE Transactions

Acronym contrived to be the opposite of ACID Basically Available, Soft state, Eventually Consistent

Characteristics Availability first Best effort Weak consistency stale data OK Approximate answers OK Simpler and faster

NoSQL advantages

Cheap, easy to implement (open source)

Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned Down nodes easily replaced

No single point of failure

Can scale up and down

Doesn't require a schema

What am I giving up?

Joins

ACID transactions

SQL, as a sometimes frustrating but still powerful query language

Easy integration with other applications that support SQL

Overview

Why NoSQL

What is NoSQL

NoSQL categories

NoSQL categories

Complexity

NoSQL categories

Overview

Why NoSQL

What is NoSQL

NoSQL categories

Key-value store

Very simple interface

Data model: (key, value) pairs

Operations: Put(key,value)

value = Get(key)

Implementation: efficiency, scalability, fault-tolerance

Records distributed to nodes based on key

Replication

Examples

Redis, Memcached, Riak

Key-Value store

Redis

History Started in early 2009 - Salvatore Sanfilippo, an Italian developer

He was working on a real-time web analytics solution and

found that MySQL could not provide necessary performance

Distributed data structure server

Simple API

Automatic data partitioning across multiple nodes (in-progress)

Distributed data structure

Distributed hash table (DHT)

Decentralized hash lookup service

(key, value) pairs are stored in DHT and any participating node can retrieve the value given a key

Logical data model

Key

Printable ASCII

Value

Primitives Strings

Containers (of strings) Hashes

Lists

Sets

Sorted Sets

API: primitive

SET foo bar

GET foo=> bar

API: listLPUSH mylist a // now mylist holds 'aLPUSH mylist b // now mylist holds 'b','a'LPUSH mylist c // now mylist holds 'c','b','a

LRANGE mylist 0 1 => c,b

Redis-cli

API: hash

HMSET myuser name Salvatore surname Filippo country ItalyHGET myuser surname

=> Filippo

API: set

SADD myset a SADD myset bSADD myset fooSADD myset bar SMEMBERS myset=> bar,a,foo,b

Redis-cli

Overview

Why NoSQL

What is NoSQL

NoSQL categories

Key-value store

Column store

Column (family) store

Not to be confused with the relational-db version of this

Sybase-IQ etc

Multi-dimensional map

Not all entries are relevant each time

Column families

Examples

Cassandra

Hbase

Amazon SimpleDB

Cassandra

History Initially developed at Facebook to for their Inbox Search feature

Released as an open source project in July 2008

Decentralized; no single point of failure

Incremental scalability

Uses consistent hashing

Tunable consistency

Consistent hashing

Cassandras partitioning scheme is based on consistent hashing

Basic hash function


Basic hash function

Inconsistent hashing

Consistent hashing


Basic hash function

Consistent hashing

Only a small number

of keys are remapped

Consistent hashing

Key space partitioning

Based on consistent hashing

Keys hashed to a point on a fixed circular ring

Nodes are positioned at their hash values on the circle

A key is hashed to find its location

A key is stored in the following

N (clockwise successor) nodes

Key space partitioning

Consistent hashing

If a node goes down,

it is stored in the next node

Cassandra and Consistency

Cassandra has programmable read/writable consistency

One: Return from the first node that responds

Quorum: Query from all nodes and respond with the one that has latest timestamp once a majority of nodes responded

All: Query from all nodes and respond with the one that has latest timestamp once all nodes responded.

Relational model

Schema: tabular, fixed

Column store model

Schema: flexible, dynamic

Keyspace

Close to relational database

But, does not stipulate any concrete structure

Basic attributes

Replication factor

Replica placement strategy

Column families

Column family

Container for a collection of rows

Think of them as a map of a map

SortedMap

Column

Smallest increment of data

A name, value and timestamp

Timestamps used to determine the most recent update to a column

Columns can be indexed

Super column

Adds another level of nesting to the regular columns

Comprised of a (super) column name and an ordered map of sub-columns

Column family vs table summary

Columns are not strictly defined in column family

Any column can added to a row any time

A column family can hold columns or super columns

Column family has a comparator attribute that indicates how columns will be sorted in query results

Comparator types

Several built-in types

AsciiType

UTF8Typetext

IntegerType

LongType

UUIDDateType

BooleanType

FloatType

DoubleType

Cassandra-cli

Creating a keyspace

CREATE KEYSPACE demo with placement_strategy = 'SimpleStrategy' and strategy_options = {replication_factor:1};

use demo;

Creating a column family

create column family users with comparator = 'UTF8Type'; assume Users keys as utf8;

update column family users with column_metadata = [ {column_name: first, validation_class: UTF8Type},{column_name: last, validation_class: UTF8Type}, {column_name: age, validation_class: UTF8Type

} ];

Cassandra-cli

Inserting recordSET users['bob']['first']='Robert';

SET users['bob']['last']='Jones';

SET users['bob']['age']='35';

SET users['Lin']['first']='Linda';

SET users['Lin']['last']='Smith';

SET users['Lin']['age']='32';

SET users['Jane']['first']='Jane';

SET users['Jane']['last']='Smith';

SET users['Jane']['age']='26';

Cassandra-cli

Read record by row key

GET users['bob'];

=> (name=age, value=35, timestamp=1416010677679000)

=> (name=first, value=Robert, timestamp=1416010669480000)

=> (name=last, value=Jones, timestamp=1416010676760000)

Read record by column keyGET users where last='Smith';

=> No indexed columns present in index clause with operator EQ

Cassandra-cli

Create index on the column UPDATE COLUMN FAMILY users WITH comparator = UTF8Type AND

column_metadata = [{column_name: last, validation_class: UTF8Type, index_type: KEYS}];

Read record by column key GET users where last='Smith'; RowKey: Lin

=> (name=age, value=3332, timestamp=1416010957625000)=> (name=first, value=4c696e6461, timestamp=1416010957620000) => (name=last, value=Smith, timestamp=1416010957623000)

RowKey: Jane=> (name=age, value=3236, timestamp=1416010965199000) => (name=first, value=4a616e65, timestamp=1416010963840000) => (name=last, value=Smith, timestamp=1416010963843000)

Some statistics

Facebook Search

MySQL > 50 GB Data

Writes Average : ~300 ms

Reads Average : ~350 ms

Rewritten with Cassandra > 50 GB Data

Writes Average : 0.12 ms

Reads Average : 15 ms

Overview

Why NoSQL

What is NoSQL

NoSQL categories

Key-value store

Column store

Document store

Document store

Key-document store

the document can be seen as a value so you can consider this is a super-set of key-value

Big difference with key-value store

that in document stores one can query also on the document, i.e. the document portion is structured (not just a blob of data)

Examples

MongoDB

CouchDB

MongoDB

A document-oriented database

documents encapsulate and encode data

Uses BSON/JSON format

Schema-less

No more configuring database columns with types

No transactions

No joins

MongoDB basics

A MongoDB instance may have zero or more databases

A database may have zero or more collections Can be thought of as the relation (table) in DBMS, but withmany differences

A collection may have zero or more documents Docs in the same collection dont even need to have the same fields Docs are the records in RDBMS Docs can embed other documents Documents are addressed in the database via a unique key

A document may have one or more fields

MongoDB Indexes is much like their RDBMS counterparts

MongoDB vs RDBMS

RDBMS MongoDB

Database Database

Table, View Collection

Row Document (JSON, BSON)

Column Field

RDBMS MongoDB

Database Database

Table, View Collection

Row Document (JSON, BSON)

Column Field

MongoDB vs RDBMS

{"_id" : ObjectId("5114e0bd42"),"first" : "John","last" : "Doe","age" : 39,

"interests" : ["Mountain Biking ]

}

Collection example

{"_id" : ObjectId("5114e0bd42"),"first" : "John","last" : "Doe","age" : 39,

"interests" : ["Mountain Biking ]

},{

"_id" : ObjectId(4a14e0f361"),"first" : Caroline","last" : Smith","age" : 32,

"interests" : ["Reading",Yoga]

}

Obligatory, andautomaticallygenerated byMongoDB

Overview

Why NoSQL

What is NoSQL

NoSQL categories

Key-value store

Column store

Document store

Graph store

Graph store

Based on Graph Theory

Scale vertically

You can use graph algorithms easily

Example, Neo4j

Relational vs. Graph: data model

Finding friends


Finding friends

Bobs friends

SELECT p1.PersonFROM Person p1

JOIN PersonFriendON PersonFriend.FriendID = p1.ID

JOIN Person p2ON PersonFriend.PersonID = p2.ID

WHERE p2.Person = 'Bob'


Finding friends

Bobs friends-of-friends

SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIENDFROM PersonFriend pf1

JOIN Person p1ON pf1.PersonID = p1.ID

JOIN PersonFriend pf2ON pf2.PersonID = pf1.FriendID

JOIN Person p2ON pf2.FriendID = p2.ID

WHERE p1.Person = Bob' AND pf2.FriendID p1.ID


Finding friends

Bobs friends-of-friends-of-....

SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIENDFROM PersonFriend pf1

JOIN Person p1ON pf1.PersonID = p1.ID

JOIN PersonFriend pf2ON pf2.PersonID = pf1.FriendID

JOIN Person p2ON pf2.FriendID = p2.ID

WHERE p1.Person = Bob' AND pf2.FriendID p1.ID

Join complexity increases with each additional depth

Relational model and connected data

Relational model deals with connected data by means of join

Join tables add complexity; they mix business data with foreign key metadata

Foreign key constraints add additional development and maintenance overhead just to make the database work

Things get more complex and more expensive the deeper we go into the network

Enter, property graph model...

Node

contain properties

Relationship

connect nodes

a start node and an end node

always has a direction

a label

Properties

keys are strings and the values are arbitrary data types

Property graph model

name: Alice

age: 32

name: Bob

Age: 35

name: James

age: 27

Finding relations is easy!

Advantages of property graph model

Flexibility Allow us to add new nodes and new relationships without

compromising the existing network or migrating data

Original data and its intent remain intact

Expressive power We can see who LOVES whom (and whether that love is requited!)

We can see whos MARRIED_TO someone else

We can see who is a COLLEAGUE_OF of whom and

who is BOSS_OF them all

Performance

Relational vs. Graph: performance

Finding friends-of-friends in a social network

Maximum depth 5

1 million people, each with approximately 50 friends

Cypher: graph query language of NEO4J

Declarative graph pattern matching language

SQL for graphs

Tabular results

Cypher is evolving steadily

Syntax changes between releases

Supports queries

Including aggregation, ordering and limits

Mutating operations in product roadmap

(a) --> (b)

Two nodes, one relationship

a b


START a=node(*)

MATCH (a)-->(b)

RETURN a, b;

a b

ba

b

a

b

a

START a=node(*)

MATCH (a)-->(b)

RETURN a, b;

Pattern matching


START a=node(*)

MATCH (a)-[r:ACTED_IN]->(m)

RETURN a.name, r.roles, m.title;

a m

ACTED IN

Paths

(a)-->(b)-->(c)

a b c

bc

a

b

c

a

b

a

Pattern matching

START a=node(*)MATCH (a)-[:ACTED_IN]->(m)

Constraints on properties

START tom=node:node_auto_index(name="Tom Hanks")

MATCH (tom)-[:ACTED_IN]->(movie)

WHERE movie.released < 1992

RETURN DISTINCT movie.title;

(Movies in which Tom Hanks acted, that were released before 1980)

Variable length paths

(a)-[*1..3]->(b)

a b

a b

a b

Friends-of-Friends

START keanu=node:node_auto_index(name="Keanu Reeves")

MATCH (keanu)-[:KNOWS*2]->(fof)

RETURN DISTINCT fof.name;

NoSQL databases reject:

Overhead of ACID transactions

Complexity of SQL

Burden of up-front schema design

Programmer responsible for

Determining the consistency level

Navigating access path

NoSQL summary

Should I be using NoSQL Databases?

NoSQL Data storage systems makes sense for applications that need to deal with very large semi-structured data

Log Analysis

Social Networking Feeds

Most of us work on organizational databases, which are not that large and have low update/query rates

regular relational databases are the correct solution for such applications

References

I. Robinson, J. Webber, E. Eifrem. Graph Databases. OReilly, 2013

Neo4J intro tutorial.

NoSQL. Dr. Kristie Hawkey. Dalhousie University NoSQL. Perry Hoekstra. Perficient, Inc. NoSQL. Akmal Chaudhri Massively Parallel Cloud Data Storage Systems. S. Sudarshan. IIT Bombay NoSQL Theory, Implementations, an introduction. Firat Atagun http://www.datastax.com/docs/1.0/ddl/column_family http://redis.io/topics/twitter-clone

REDIS. REmote DIctionary Server. Chris Keith and James Tavares

Advanced Topics in Database Management. Stan Zdonik. Brown University

An introduction to MongoDB. Rcz Gbor

MongoDB. Mohamed Zahran. NYU

Handling an 1,800 Percent Traffic Spike During Super Bowl XLVI. Jim Houska and Jim Houska

Thanks

CRUD

Create db.collection.insert( ) db.collection.save( ) db.collection.update( , , { upsert: true } )

Read db.collection.find( , ) db.collection.findOne( , )

Update db.collection.update( , , )

Delete db.collection.remove( , )

mongo>

Actors database Insert records

db.actors.insert({ first: 'matthew', last: 'setter', dob: '21/04/1978', gender: 'm', hair_colour: 'brown', occupation: 'developer', nationality: 'australian' });

db.actors.insert({ first: 'james', last: 'caan', dob: '26/03/1940', gender: 'm', hair_colour: 'brown', occupation: 'actor', nationality: 'american' }); . . . . .

mongo>

Actors database Query: show all actors> db.actors.find()

Query: show all actors that are female

> db.actors.find({gender: 'f'});{ "_id" : ObjectId("546e5363440266a4f135a37a"), "first" : "jamie lee", "last" : "curtis", "dob" :

"22/11/1958", "gender" : "f", "hair_colour" : "brown", "occupation" : "actor", "nationality" : "american" }

{ "_id" : ObjectId("546e5363440266a4f135a37c"), "first" : "judi", "last" : "dench", "dob" : "09/12/1934", "gender" : "f", "hair_colour" : "white", "occupation" : "actress", "nationality" : "english" }

Query: show all male actors who are English

> db.actors.find({gender: 'm', $or: [{nationality: 'english'}]});{ "_id" : ObjectId("546e5363440266a4f135a37b"), "first" : "michael", "last" : "caine", "dob" :

"14/03/1933", "gender" : "m", "hair_colour" : "brown", "occupation" : "actor", "nationality" : "english" }

mongo>

Actors database

Update: update the record for James Caan that his hair is grey

> db.actors.update({first: 'james', last: 'caan'}, {$set: {hair_colour: grey'}});

> db.actors.find({first: 'james', last: 'caan'});{ "_id" : ObjectId("546e5363440266a4f135a377"), "first" : "james", "last" : "caan", "dob" :

"26/03/1940", "gender" : "m", "hair_colour" : "grey", "occupation" : "actor", "nationality" : "american" }

Delete

> db.actors.remove({first: 'james', last: 'caan'});

Tech Trend: Connectedness

Info

rmat

ion

co

nn

ecti

vity

Text Documents

Hypertext

Feeds

Blogs

Wikis

UGC

Tagging

RDFa

Social networks

Consistent Hashing

Partition using consistent hashing

Keys hash to a point on a fixed circular space

Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots

Nodes take positions on the circle.

A, B, and D exists.

B responsible for AB range.

D responsible for BD range.

A responsible for DA range.

C joins.

B, D split ranges.

C gets BC from D.

A

H

D

B

M

V

S

R

C

extra nosql intro

Documents

nosql big data

nosql nosql categoriesmotives

manageable need

huge data

data accessmotivations

nosql distinguishing

tb data warehouse yahoo

tb youtube