extra nosql intro

103
Introduction to NoSQL University o Toronto Computer Science Department Presenter: Suprio Ray

Upload: yen-nhi-tran

Post on 28-Sep-2015

14 views

Category:

Documents


2 download

DESCRIPTION

Extra Nosql Intro

TRANSCRIPT

  • Introduction to NoSQL

    University o TorontoComputer Science Department

    Presenter: Suprio Ray

  • 2How will this class improve your CV

  • NoSQL

    the whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for.

    Eric Evans

    What does it mean? No SQL (Eric Evans)

    Not Only SQL (Emil Elfrem)

    New SQL?

    Src: Mark Madsen

  • Overview

    Why NoSQL

    What is NoSQL

    NoSQL categories

  • Motives behind NoSQL

    Big data, different application domains

    Scalability and performance

    Graceful failure recovery

    Data format, manageability

  • Motivation: Big data; one size does not fit all

    OLTP Amazon : 42 TB Typical OLTP databases: less than a TB

    Data Warehouse Yahoo : 2 PB Ebay: 1.4 PB

    Search engines (text) Google : 850 TB Youtube: 76 PB of video data/year

    Scientific US Department of Energy (NERSC): 3.5 PB

    New application domains Stream processing Social media

  • Motivation: need for scale and performance

    Scaling up

    Issues with scaling up when the dataset is just too big

    RDBMS were not designed to be distributed

    Cost effective strategy: scaling out or horizontal scaling

    Some applications need very few database features; But need high scalability when traffic spike happens

    SQL may be too heavy-weight

    Does not need fancy indexing.

    Just fast lookup by primary key

  • IT World Prediction

  • Super Bowl traffic spike

    1,800%Traffic Spike

    Stable Performance

    Commercial Airs

  • Motivation: graceful failure recovery

    Dependence on Web services

    We are addicted to Googling, Gmail, Google Map, Youtube, Facebook, Twitter, Blackberry

    Graceful failure recovery Need to continue to provide service

    Cost of downtime

  • The Cost of downtime

    Facebook was down for ~3 hours in Sep, 2010 $1 million in lost ad revenues

    Rackspace was down due to power failure in Jun, 2009 was forced to pay ~$3.5 million in service credits to customers

    Paypal was down due to network hardware failure in Aug, 2009 $7.2 million in lost transactions in 4.5 hours

    Google outages Search, Gmail, YouTube, Google News down for 14% of users in

    May, 2009 Google App Engine applications were down in May, 2010

    RIM had two Blackberry service outages in a week in Dec, 2009 The second one lasted more than 8 hours. Cost?

  • Motivation: need for flexible schema

    Relational databases define the schema at design time

    Rigid, no way to change dynamically

    Need a DBA

    Stop the world to make any change

    Many applications dont have any fixed schema Log processing

    Stream processing

    Graph processing

    Data model should not restrict data access

  • Motivations summary: avoid RDBMS/SQL limitations

    Harder to scale. Expensive

    Joins across multiple nodes? Hard

    How does RDMS handle data growth? Hard

    Rigid schema design. Not manageable

    Need for a DBA. Expensive

  • Overview

    Why NoSQL

    What is NoSQL

    NoSQL categories

  • NoSQL Definition

    From www.nosql-database.org:

    Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge data amount, and more.

  • NoSQL Distinguishing Characteristics

    Can handle large data volumes Googles big data

    Scalable replication and distribution Potentially thousands of machines, distributed around the world

    Queries need to return answers quickly

    Schema-less

    ACID transaction properties are not needed BASE

    CAP Theorem

  • Recap: RDBMS/SQL Characteristics

    Data stored in tables

    Relationships represented by data row

    Data Manipulation Language (DML)

    Data Definition Language (DDL)

    Transactions (ACID properties)

  • Recap: Data Definition Language (DDL)

    Schema defined at the start Create Table (Column1 Datatype1, Column2 Datatype 2, )

    Constraints to define and enforce relationships Primary Key Foreign Key

    Triggers to respond to Insert, Update , & Delete

    Stored Modules

    Alter

    Drop

  • Recap: Data Manipulation Language (DML)

    Data manipulated with Select, Insert, Update, & Delete statements

    Select T1.Column1, T2.Column2 From Table1, Table2 Where T1.Column1 = T2.Column1

    Data Aggregation

    Compound statements

    Functions and Procedures

  • Recap: Transactions ACID Properties

    Atomic All of the work in a transaction completes (commit) or none of it completes

    Consistent A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints

    Isolated The results of any changes made during a transaction are not visible until the transaction has committed

    Durable The results of a committed transaction survive failures

  • OLTP Through the Looking Glass, and What We Found ThereSIGMOD 08, pp. 981-992, 2008.

    31%

    31%

    26%

    12%

    Buffer Pool

    Locking

    Recovery

    Real Work

    The cost of locking

  • CAP Theorem

    CAP Theorem:

    satisfying all three at the

    same time is impossible

    A P

    To scale out, you have to partition

    Many nodes; each node containsreplicas of partitions of data

    Consistency

    all replicas contain the same version of data

    Availability

    system remains operational on failing nodes

    Partition tolarence

    multiple entry points

    system remains operational on system split

    C

  • ACID vs. BASE

    Pritchett, D.: BASE: An Acid Alternative (queue.acm.org/detail.cfm?id=1394128)

    Relational

    Atomicity Consistency Isolation Durability

    NoSQL

    BasicallyAvailable (CP)

    Soft-state Eventually consistent (AP)

  • BASE Transactions

    Acronym contrived to be the opposite of ACID Basically Available, Soft state, Eventually Consistent

    Characteristics Availability first Best effort Weak consistency stale data OK Approximate answers OK Simpler and faster

  • NoSQL advantages

    Cheap, easy to implement (open source)

    Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned Down nodes easily replaced

    No single point of failure

    Can scale up and down

    Doesn't require a schema

  • What am I giving up?

    Joins

    ACID transactions

    SQL, as a sometimes frustrating but still powerful query language

    Easy integration with other applications that support SQL

  • Overview

    Why NoSQL

    What is NoSQL

    NoSQL categories

  • NoSQL categories

  • Complexity

  • NoSQL categories

  • Overview

    Why NoSQL

    What is NoSQL

    NoSQL categories

    Key-value store

  • Very simple interface

    Data model: (key, value) pairs

    Operations: Put(key,value)

    value = Get(key)

    Implementation: efficiency, scalability, fault-tolerance

    Records distributed to nodes based on key

    Replication

    Examples

    Redis, Memcached, Riak

    Key-Value store

  • Redis

    History Started in early 2009 - Salvatore Sanfilippo, an Italian developer

    He was working on a real-time web analytics solution and

    found that MySQL could not provide necessary performance

    Distributed data structure server

    Simple API

    Automatic data partitioning across multiple nodes (in-progress)

  • Distributed data structure

    Distributed hash table (DHT)

    Decentralized hash lookup service

    (key, value) pairs are stored in DHT and any participating node can retrieve the value given a key

  • Logical data model

    Key

    Printable ASCII

    Value

    Primitives Strings

    Containers (of strings) Hashes

    Lists

    Sets

    Sorted Sets

  • Logical data model

    Key

    Printable ASCII

    Value

    Primitives Strings

    Containers (of strings) Hashes

    Lists

    Sets

    Sorted Sets

  • Logical data model

    Key

    Printable ASCII

    Value

    Primitives Strings

    Containers (of strings) Hashes

    Lists

    Sets

    Sorted Sets

  • Logical data model

    Key

    Printable ASCII

    Value

    Primitives Strings

    Containers (of strings) Hashes

    Lists

    Sets

    Sorted Sets

  • Logical data model

    Key

    Printable ASCII

    Value

    Primitives Strings

    Containers (of strings) Hashes

    Lists

    Sets

    Sorted Sets

  • API: primitive

    SET foo bar

    GET foo=> bar

    API: listLPUSH mylist a // now mylist holds 'aLPUSH mylist b // now mylist holds 'b','a'LPUSH mylist c // now mylist holds 'c','b','a

    LRANGE mylist 0 1 => c,b

    Redis-cli

  • API: hash

    HMSET myuser name Salvatore surname Filippo country ItalyHGET myuser surname

    => Filippo

    API: set

    SADD myset a SADD myset bSADD myset fooSADD myset bar SMEMBERS myset=> bar,a,foo,b

    Redis-cli

  • Overview

    Why NoSQL

    What is NoSQL

    NoSQL categories

    Key-value store

    Column store

  • Column (family) store

    Not to be confused with the relational-db version of this

    Sybase-IQ etc

    Multi-dimensional map

    Not all entries are relevant each time

    Column families

    Examples

    Cassandra

    Hbase

    Amazon SimpleDB

  • Cassandra

    History Initially developed at Facebook to for their Inbox Search feature

    Released as an open source project in July 2008

    Decentralized; no single point of failure

    Incremental scalability

    Uses consistent hashing

    Tunable consistency

  • Consistent hashing

    Cassandras partitioning scheme is based on consistent hashing

    Basic hash function

  • Cassandras partitioning scheme is based on consistent hashing

    Basic hash function

    Inconsistent hashing

    Consistent hashing

  • Cassandras partitioning scheme is based on consistent hashing

    Basic hash function

    Consistent hashing

    Only a small number

    of keys are remapped

    Consistent hashing

  • Key space partitioning

    Based on consistent hashing

    Keys hashed to a point on a fixed circular ring

    Nodes are positioned at their hash values on the circle

    A key is hashed to find its location

    A key is stored in the following

    N (clockwise successor) nodes

  • Key space partitioning

    Consistent hashing

    If a node goes down,

    it is stored in the next node

  • Cassandra and Consistency

    Cassandra has programmable read/writable consistency

    One: Return from the first node that responds

    Quorum: Query from all nodes and respond with the one that has latest timestamp once a majority of nodes responded

    All: Query from all nodes and respond with the one that has latest timestamp once all nodes responded.

  • Relational model

    Schema: tabular, fixed

  • Column store model

    Schema: flexible, dynamic

  • Keyspace

    Close to relational database

    But, does not stipulate any concrete structure

    Basic attributes

    Replication factor

    Replica placement strategy

    Column families

  • Column family

    Container for a collection of rows

    Think of them as a map of a map

    SortedMap

  • Column

    Smallest increment of data

    A name, value and timestamp

    Timestamps used to determine the most recent update to a column

    Columns can be indexed

  • Super column

    Adds another level of nesting to the regular columns

    Comprised of a (super) column name and an ordered map of sub-columns

  • Column family vs table summary

    Columns are not strictly defined in column family

    Any column can added to a row any time

    A column family can hold columns or super columns

    Column family has a comparator attribute that indicates how columns will be sorted in query results

  • Comparator types

    Several built-in types

    AsciiType

    UTF8Typetext

    IntegerType

    LongType

    UUIDDateType

    BooleanType

    FloatType

    DoubleType

  • Cassandra-cli

    Creating a keyspace

    CREATE KEYSPACE demo with placement_strategy = 'SimpleStrategy' and strategy_options = {replication_factor:1};

    use demo;

    Creating a column family

    create column family users with comparator = 'UTF8Type'; assume Users keys as utf8;

    update column family users with column_metadata = [ {column_name: first, validation_class: UTF8Type},{column_name: last, validation_class: UTF8Type}, {column_name: age, validation_class: UTF8Type

    } ];

  • Cassandra-cli

    Inserting recordSET users['bob']['first']='Robert';

    SET users['bob']['last']='Jones';

    SET users['bob']['age']='35';

    SET users['Lin']['first']='Linda';

    SET users['Lin']['last']='Smith';

    SET users['Lin']['age']='32';

    SET users['Jane']['first']='Jane';

    SET users['Jane']['last']='Smith';

    SET users['Jane']['age']='26';

  • Cassandra-cli

    Read record by row key

    GET users['bob'];

    => (name=age, value=35, timestamp=1416010677679000)

    => (name=first, value=Robert, timestamp=1416010669480000)

    => (name=last, value=Jones, timestamp=1416010676760000)

    Read record by column keyGET users where last='Smith';

    => No indexed columns present in index clause with operator EQ

  • Cassandra-cli

    Create index on the column UPDATE COLUMN FAMILY users WITH comparator = UTF8Type AND

    column_metadata = [{column_name: last, validation_class: UTF8Type, index_type: KEYS}];

    Read record by column key GET users where last='Smith'; RowKey: Lin

    => (name=age, value=3332, timestamp=1416010957625000)=> (name=first, value=4c696e6461, timestamp=1416010957620000) => (name=last, value=Smith, timestamp=1416010957623000)

    RowKey: Jane=> (name=age, value=3236, timestamp=1416010965199000) => (name=first, value=4a616e65, timestamp=1416010963840000) => (name=last, value=Smith, timestamp=1416010963843000)

  • Some statistics

    Facebook Search

    MySQL > 50 GB Data

    Writes Average : ~300 ms

    Reads Average : ~350 ms

    Rewritten with Cassandra > 50 GB Data

    Writes Average : 0.12 ms

    Reads Average : 15 ms

  • Overview

    Why NoSQL

    What is NoSQL

    NoSQL categories

    Key-value store

    Column store

    Document store

  • Document store

    Key-document store

    the document can be seen as a value so you can consider this is a super-set of key-value

    Big difference with key-value store

    that in document stores one can query also on the document, i.e. the document portion is structured (not just a blob of data)

    Examples

    MongoDB

    CouchDB

  • MongoDB

    A document-oriented database

    documents encapsulate and encode data

    Uses BSON/JSON format

    Schema-less

    No more configuring database columns with types

    No transactions

    No joins

  • MongoDB basics

    A MongoDB instance may have zero or more databases

    A database may have zero or more collections Can be thought of as the relation (table) in DBMS, but withmany differences

    A collection may have zero or more documents Docs in the same collection dont even need to have the same fields Docs are the records in RDBMS Docs can embed other documents Documents are addressed in the database via a unique key

    A document may have one or more fields

    MongoDB Indexes is much like their RDBMS counterparts

  • MongoDB vs RDBMS

    RDBMS MongoDB

    Database Database

    Table, View Collection

    Row Document (JSON, BSON)

    Column Field

  • RDBMS MongoDB

    Database Database

    Table, View Collection

    Row Document (JSON, BSON)

    Column Field

    MongoDB vs RDBMS

    {"_id" : ObjectId("5114e0bd42"),"first" : "John","last" : "Doe","age" : 39,

    "interests" : ["Mountain Biking ]

    }

  • Collection example

    {"_id" : ObjectId("5114e0bd42"),"first" : "John","last" : "Doe","age" : 39,

    "interests" : ["Mountain Biking ]

    },{

    "_id" : ObjectId(4a14e0f361"),"first" : Caroline","last" : Smith","age" : 32,

    "interests" : ["Reading",Yoga]

    }

    Obligatory, andautomaticallygenerated byMongoDB

  • Overview

    Why NoSQL

    What is NoSQL

    NoSQL categories

    Key-value store

    Column store

    Document store

    Graph store

  • Graph store

    Based on Graph Theory

    Scale vertically

    You can use graph algorithms easily

    Example, Neo4j

  • Relational vs. Graph: data model

    Finding friends

  • Relational vs. Graph: data model

    Finding friends

    Bobs friends

    SELECT p1.PersonFROM Person p1

    JOIN PersonFriendON PersonFriend.FriendID = p1.ID

    JOIN Person p2ON PersonFriend.PersonID = p2.ID

    WHERE p2.Person = 'Bob'

  • Relational vs. Graph: data model

    Finding friends

    Bobs friends-of-friends

    SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIENDFROM PersonFriend pf1

    JOIN Person p1ON pf1.PersonID = p1.ID

    JOIN PersonFriend pf2ON pf2.PersonID = pf1.FriendID

    JOIN Person p2ON pf2.FriendID = p2.ID

    WHERE p1.Person = Bob' AND pf2.FriendID p1.ID

  • Relational vs. Graph: data model

    Finding friends

    Bobs friends-of-friends-of-....

    SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIENDFROM PersonFriend pf1

    JOIN Person p1ON pf1.PersonID = p1.ID

    JOIN PersonFriend pf2ON pf2.PersonID = pf1.FriendID

    JOIN Person p2ON pf2.FriendID = p2.ID

    WHERE p1.Person = Bob' AND pf2.FriendID p1.ID

    Join complexity increases with each additional depth

  • Relational model and connected data

    Relational model deals with connected data by means of join

    Join tables add complexity; they mix business data with foreign key metadata

    Foreign key constraints add additional development and maintenance overhead just to make the database work

    Things get more complex and more expensive the deeper we go into the network

  • Enter, property graph model...

    Node

    contain properties

    Relationship

    connect nodes

    a start node and an end node

    always has a direction

    a label

    Properties

    keys are strings and the values are arbitrary data types

  • Property graph model

    name: Alice

    age: 32

    name: Bob

    Age: 35

    name: James

    age: 27

  • Finding relations is easy!

  • Advantages of property graph model

    Flexibility Allow us to add new nodes and new relationships without

    compromising the existing network or migrating data

    Original data and its intent remain intact

    Expressive power We can see who LOVES whom (and whether that love is requited!)

    We can see whos MARRIED_TO someone else

    We can see who is a COLLEAGUE_OF of whom and

    who is BOSS_OF them all

    Performance

  • Relational vs. Graph: performance

    Finding friends-of-friends in a social network

    Maximum depth 5

    1 million people, each with approximately 50 friends

  • Cypher: graph query language of NEO4J

    Declarative graph pattern matching language

    SQL for graphs

    Tabular results

    Cypher is evolving steadily

    Syntax changes between releases

    Supports queries

    Including aggregation, ordering and limits

    Mutating operations in product roadmap

  • (a) --> (b)

    Two nodes, one relationship

    a b

  • Two nodes, one relationship

    START a=node(*)

    MATCH (a)-->(b)

    RETURN a, b;

    a b

  • ba

    b

    a

    b

    a

    START a=node(*)

    MATCH (a)-->(b)

    RETURN a, b;

    Pattern matching

  • Two nodes, one relationship

    START a=node(*)

    MATCH (a)-[r:ACTED_IN]->(m)

    RETURN a.name, r.roles, m.title;

    a m

    ACTED IN

  • Paths

    (a)-->(b)-->(c)

    a b c

  • bc

    a

    b

    c

    a

    b

    a

    Pattern matching

  • START a=node(*)MATCH (a)-[:ACTED_IN]->(m)
  • Constraints on properties

    START tom=node:node_auto_index(name="Tom Hanks")

    MATCH (tom)-[:ACTED_IN]->(movie)

    WHERE movie.released < 1992

    RETURN DISTINCT movie.title;

    (Movies in which Tom Hanks acted, that were released before 1980)

  • Variable length paths

    (a)-[*1..3]->(b)

    a b

    a b

    a b

  • Friends-of-Friends

    START keanu=node:node_auto_index(name="Keanu Reeves")

    MATCH (keanu)-[:KNOWS*2]->(fof)

    RETURN DISTINCT fof.name;

  • NoSQL databases reject:

    Overhead of ACID transactions

    Complexity of SQL

    Burden of up-front schema design

    Programmer responsible for

    Determining the consistency level

    Navigating access path

    NoSQL summary

  • Should I be using NoSQL Databases?

    NoSQL Data storage systems makes sense for applications that need to deal with very large semi-structured data

    Log Analysis

    Social Networking Feeds

    Most of us work on organizational databases, which are not that large and have low update/query rates

    regular relational databases are the correct solution for such applications

  • References

    I. Robinson, J. Webber, E. Eifrem. Graph Databases. OReilly, 2013

    Neo4J intro tutorial.

    NoSQL. Dr. Kristie Hawkey. Dalhousie University NoSQL. Perry Hoekstra. Perficient, Inc. NoSQL. Akmal Chaudhri Massively Parallel Cloud Data Storage Systems. S. Sudarshan. IIT Bombay NoSQL Theory, Implementations, an introduction. Firat Atagun http://www.datastax.com/docs/1.0/ddl/column_family http://redis.io/topics/twitter-clone

    REDIS. REmote DIctionary Server. Chris Keith and James Tavares

    Advanced Topics in Database Management. Stan Zdonik. Brown University

    An introduction to MongoDB. Rcz Gbor

    MongoDB. Mohamed Zahran. NYU

    Handling an 1,800 Percent Traffic Spike During Super Bowl XLVI. Jim Houska and Jim Houska

  • Thanks

  • CRUD

    Create db.collection.insert( ) db.collection.save( ) db.collection.update( , , { upsert: true } )

    Read db.collection.find( , ) db.collection.findOne( , )

    Update db.collection.update( , , )

    Delete db.collection.remove( , )

  • mongo>

    Actors database Insert records

    db.actors.insert({ first: 'matthew', last: 'setter', dob: '21/04/1978', gender: 'm', hair_colour: 'brown', occupation: 'developer', nationality: 'australian' });

    db.actors.insert({ first: 'james', last: 'caan', dob: '26/03/1940', gender: 'm', hair_colour: 'brown', occupation: 'actor', nationality: 'american' }); . . . . .

  • mongo>

    Actors database Query: show all actors> db.actors.find()

    Query: show all actors that are female

    > db.actors.find({gender: 'f'});{ "_id" : ObjectId("546e5363440266a4f135a37a"), "first" : "jamie lee", "last" : "curtis", "dob" :

    "22/11/1958", "gender" : "f", "hair_colour" : "brown", "occupation" : "actor", "nationality" : "american" }

    { "_id" : ObjectId("546e5363440266a4f135a37c"), "first" : "judi", "last" : "dench", "dob" : "09/12/1934", "gender" : "f", "hair_colour" : "white", "occupation" : "actress", "nationality" : "english" }

    Query: show all male actors who are English

    > db.actors.find({gender: 'm', $or: [{nationality: 'english'}]});{ "_id" : ObjectId("546e5363440266a4f135a37b"), "first" : "michael", "last" : "caine", "dob" :

    "14/03/1933", "gender" : "m", "hair_colour" : "brown", "occupation" : "actor", "nationality" : "english" }

  • mongo>

    Actors database

    Update: update the record for James Caan that his hair is grey

    > db.actors.update({first: 'james', last: 'caan'}, {$set: {hair_colour: grey'}});

    > db.actors.find({first: 'james', last: 'caan'});{ "_id" : ObjectId("546e5363440266a4f135a377"), "first" : "james", "last" : "caan", "dob" :

    "26/03/1940", "gender" : "m", "hair_colour" : "grey", "occupation" : "actor", "nationality" : "american" }

    Delete

    > db.actors.remove({first: 'james', last: 'caan'});

  • Tech Trend: Connectedness

    Info

    rmat

    ion

    co

    nn

    ecti

    vity

    Text Documents

    Hypertext

    Feeds

    Blogs

    Wikis

    UGC

    Tagging

    RDFa

    Social networks

  • Consistent Hashing

    Partition using consistent hashing

    Keys hash to a point on a fixed circular space

    Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots

    Nodes take positions on the circle.

    A, B, and D exists.

    B responsible for AB range.

    D responsible for BD range.

    A responsible for DA range.

    C joins.

    B, D split ranges.

    C gets BC from D.

    A

    H

    D

    B

    M

    V

    S

    R

    C