big data nosql 1017

Copyright © 2011 LOGTEL

NoSQL (big data)

Samuel [email protected]


Agenda

Big Data / No SQL – technology aspect How big is BIG ?

The motivation behind NoSQL

CAP theorem

Partitions / fragmentation

The different NoSQL models

Key Value

Column-Based

Document store

Big Table

Graph

The NoSQL way of thinking (using graphs)

Big Data - Applicative (what can we do with it)

2


It’s a hype (!)

3


Big Data Definition

No single standard definition…

“Big Data” is data whose scale, diversity, and

complexity require new architecture,

techniques, algorithms, and analytics to

manage it and extract value and hidden

knowledge from it…

4


NoSQL humor

5

http://geekandpoke.typepad.com/

http://geekandpoke.typepad.com/


10GB ? 10TB ? 10 PB ?

How big is BIG ?

6


The 4 V’s

7

Characteristics of Big Data:

1-Scale (Volume)

• Data Volume

• 44x increase from 2009 2020

• From 0.8 zettabytes to 35zb

• Data volume is increasing exponentially

8

Exponential increase in

collected/generated data


The 4 V’s

9


2-Complexity (Varity)

• Various formats, types, and structures

• Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…

• Static data vs. streaming data

• A single application can be generating/collecting many types of data

10


The 4 V’s

11


3-Speed (Velocity)

• Data is begin generated fast and need to be processed

fast

• Online Data Analytics

• Late decisions missing opportunities

• Examples

• E-Promotions: Based on your current location, your purchase

history, what you like send promotions right now for store next to

you

• Healthcare monitoring: sensors monitoring your activities and body

any abnormal measurements require immediate reaction12


The 4 V’s

13

Who’s Generating Big Data

Social media and networks

(all of us are generating data)Scientific instruments

(collecting all sorts of data)

Mobile devices

(tracking all objects all the time)

Sensor technology and

networks

(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover

knowledge from the collected data in a timely manner and in a scalable

fashion

14

The Model Has

Changed…• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming

data

15


NoSQL motivation

16


Why Now?

Explosion of social media sites (Facebook, Twitter) with large data needs

Explosion of storage needs in large web sites such as Google, Yahoo

Much of the data is not files

Rise of cloud-based solutions such as Amazon S3 (simple storage solution)

Shift to dynamically-typed data with frequent schema changes

Open-source community


Parallel Databases and Data Stores

Relational Databases – mainstay of business

Web-based applications caused spikes

Especially true for public-facing e-Commerce sites

Many application servers, one database

Easy to parallelize application servers to 1000s of servers, harder to parallelize databases to same scale

First solution: memcache (in-memory) or other caching mechanisms to reduce database access


Scaling Up

What if the dataset is huge, and very high number of transactions per second

Use multiple servers to host database

‘scaling out’ or ‘horizontal scaling’

Parallel databases have been around for a while

But expensive, and designed for decision support

not OLTP (Online Transaction Processing)


Scaling RDBMS – Master/Slave

Master-Slave

All writes are written to the master. All reads performed against the replicated slave databases

Good for mostly read, very few update applications

Critical reads may be incorrect as writes may not have been propagated down

Large data sets can pose problems as master needs to duplicate data to slaves


Scaling RDBMS - Partitioning

Partitioning

Divide the database across many machines

E.g. hash or range partitioning

Handled transparently by parallel databases but they are expensive

“Sharding” Divide data amongst many cheap databases

(MySQL/PostgreSQL)

Manage parallel access in the application

Scales well for both reads and writes

Not transparent, application needs to be partition-aware


What is NoSQL?

Stands for Not Only SQL

Class of non-relational data storage systems

E.g. BigTable, Dynamo, PNUTS/Sherpa, ..

Usually do not require a fixed table schema nor do

they use the concept of joins

All NoSQL offerings relax one or more of the ACID

properties (will talk about the CAP theorem)

Not a backlash/rebellion against RDBMS

SQL is a rich query language that cannot be rivaled

by the current list of NoSQL offerings


NoSQL Data Storage: Classification

NoSQL solutions fall into 4 major areas:

Uninterpreted key/value or ‘the big hash table’. Amazon S3 (Dynamo)

Voldemort

Scalaris

Column-based, with interpreted keys Cassandra, BigTable, HBase, Sherpa/PNuts

Others CouchDB (document-based)

Neo4J (graph-based)


NoSQL ecosystem

24


25


Big Data Landscape 2015

26


Architecture

28


MapReduce

29


ACID

Atomic: Either the whole process of a transaction is done or none is.

Consistency: Database constraints (application-specific) are preserved.

Isolation: It appears to the user as if only one process executes at a time. (Two concurrent transactions will not see on another’s transaction while “in flight”.)

Durability: The updates made to the database in a committed transaction will be visible to future transactions. (Effects of a process do not get lost if the system crashes.)


CAP Theorem

Three properties of a system

Consistency (all copies have same value)

Availability (system can run even if parts have failed)

Partitions (network can break into two or more parts, each

with active systems that can’t talk to other parts)

Brewer’s CAP “Theorem”: You can have at most two

of these three properties for any system

Very large systems will partition at some point

Choose one of consistency or availability

Traditional database choose consistency

Most Web applications choose availability

Except for specific parts such as order processing


The reminder

Dial 1-800-remind ......

Available , Consist – not portioned

Not available ...

Available , Partitioned – not Consistent

Consistent, Partitioned – not Available

32


The proof…

33


CAP Theorem

Three properties of a system

Consistency (all copies have same value)

Availability (system can run even if parts have failed)

Partitions (network can break into two or more parts, each

with active systems that can’t talk to other parts)

Brewer’s CAP “Theorem”: You can have at most two

of these three properties for any system

Very large systems will partition at some point

Choose one of consistency or availability

Traditional database choose consistency

Most Web applications choose availability

Except for specific parts such as order processing


Availability

Traditionally, thought of as the server/process available five 9’s (99.999 %).

However, for large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.

Want a system that is resilient in the face of network disruption


Eventual Consistency

When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent

For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service

Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID

Soft state: copies of a data item may be inconsistent

Eventually Consistent – copies becomes consistent at some later time if there are no more updates to that data item

BASE in Cassandra

Query

Closest replica

Cassandra Cluster

Replica A

Result

Replica B Replica C

Digest QueryDigest Response Digest Response

Result

Client

Read repair if digests differ


Common Advantages

Cheap, easy to implement (open source)

Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned

When data is written, the latest version is on at least one node and then replicated to other nodes

Down nodes easily replaced

No single point of failure

Easy to distribute

Don't require a schema


What am I giving up?

joins

group by

order by

ACID transactions

SQL as a sometimes frustrating but still powerful query language

easy integration with other applications that support SQL


Distributed Key-Value Data Stores

Distributed key-value data storage systems allow key-value pairs to be stored (and retrieved on key) in a massively parallel system E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon

Dynamo, ..

Partitioning, high availability etc. completely transparent to application

Sharding systems and key-value stores don’t support many relational features

No join operations (except within partition)

No referential integrity constraints across partitions

etc.


Flexible Data Model

Rockets

Key Value

1

2

3

Name Value

tooninventoryQtybrakes

Rocket-Powered Roller SkatesReady, Set, Zoom5false

name

Name Value

tooninventoryQtybrakes

Little Giant Do-It-Yourself Rocket-Sled KitBeep Prepared4false

Name Value

tooninventoryQtywheels

Acme Jet Propelled UnicycleHot Rod and Reel11

name

name


HBase

46


Google

Tables are sorted by Row

Table schema only define its column families . Each family consists of any number of columns

Each column consists of any number of versions

Columns only exist when inserted, NULLs are free.

Columns within a family are sorted and stored together

Everything except table names are byte[]

(Row, Family: Column, Timestamp) Value

Row key

Column Family

valueTimeStamp


Splunk – Document base

48


Splunk – log analysis

49


PNUTS Data Storage Architecture


01

1/2

F

E

D

C

B

A N=3

h(key2)

h(key1)

52

Partitioning And Replication


Should I be using NoSQL Databases?

For almost all of us, regular relational databases are THE correct solution

NoSQL Data storage systems makes sense for applications that need to deal with very large semi-structured data

Log Analysis

Social Networking Feeds


Graph in practice(thanks to Luca Garulli)

54


...how to think «graphically» with

one of the most common domains

in the enterprise world:

The old-classic CRM* domain

* today in 99% of the cases a RDBMS is used

Lets take a real example - CRM


Every developer knows

the Relational Model(?),

but who knows the

Graph one?


Back to school:

Graph Theory crash course


SamNoSQL

lectureLikes

Basic Graph


Sam

name: Samuel

surname: Dratwa

company: SADOT

NoSQL

Lecture

editions: [Comverse, Tel-Aviv]

Likes

since: 2012

Vertices and Edges can have properties



Vertices are directed

* https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model

Property Graph Model*


SamNoSQL

lecture

An Edge connects 2 vertices: use multiple

vertices to represents 1-N and N-M relationships

Edges - Arcs


Likes

Avital

Sam

FriendOf

NoSQL

lecture

Doron

Joins


Compliments,

this is your diploma in

«Graph Theory»


Customer Address

Order Stock

Registry system

Order system

Domain: minimal CRM


Stock

Registry system

Order

Order system

Customer Address

How doesRelational DBMS

manage relationships?


JOIN Customer.Address -> Address.Id

Customer

Id Name Address

10 Samuel 34

11 Katja 44

34 Sylvia 54

56 Mark 66

88 Steve 68

Address

Id Location

34 Rome, London

44 Cologne

54 Rome

66 New Mexico

68 Palo Alto

Relational World: 1-1 Relationships


Inverse JOIN Address.Customer -> Customer.Id

Customer

Id Name

10 Samuel

11 Katja

34 Sylvia

56 Mark

88 Steve

Address

Id Customer Location

24 10 Rome

33 10 London

44 34 Rome

66 11 Cologne

68 88 Palo Alto

Relational World: 1-N Relationships


Additional table with 2 JOINs

(1) CustomerAddress.Id -> Customer.Id and

(2) CustomerAddress.Address -> Address.Id

Customer

Id Name

10 Samuel

11 Katja

34 Sylvia

56 Mark

88 Steve

Address

Id Location

24 Rome

33 London

44 Rome

66 Cologne

68 Palo Alto

CustomerAddress

Id Address

10 24

10 33

34 24

Relational World: N-M Relationships


What’s wrong with the

Relational Model?


These are all JOINs executedeverytime you traverse a

relationship

Customer

Id Name

10 Samuel

11 Katja

34 Sylvia

56 Mark

88 Steve

Address

Id Location

24 Rome

33 London

44 Rome

66 Cologne

68 Palo Alto


relationship


relationship


relationship!

CustomerAddress

Id Address

10 24

10 33

34 24

The JOIN is the evil!


Why not JOIN

• A JOIN means searching for a key in another table

• The first rule to improve performance is

indexing all the keys

• Index speeds up searches but slows down insert, updates and deletes

• So in the best case a JOIN is a lookup into in an

index

• This is done per single join!

• If you traverse hundreds of relationships

you’re executing hundreds of JOINs


Index Lookup

it is really that fast?


A-Z

A-L M-Z

A-L

A-D E-L

M-Z

M-R S-Z

A-D

A-B C-D

E-L

E-G H-L

E-G

E-F G

H-L

H-J K-L

Index algorithms are all similar and based on

balanced trees

Index Lookup: how does it works?

Think to an Address Book

where we have to find Samuel’s phone number


A-Z

A-L M-Z

A-L

A-D E-L

M-Z

M-R S-Z

A-D

A-B C-D

E-L

E-G H-L

E-G

E-F G

H-L

H-J K-L

Found! Each lookup takes X steps, where Xgrows with the

index size!


An index lookup is executed

for each JOIN

Querying more tables can easily

produce millions of JOINs/Lookups!

Here the rule: more entries

= more lookup steps = slower JOIN


Is there a better way to

manage relationships?


How does GraphDB manage

index-free relationships?


an Open Source (Apache 2)

document-graph NoSQL dbms

supports: transactions, extended-SQL,Multi-Master replication, etc


SamLives

out : [#14:54]

label : ‘Customer’

name : ‘Sam’

out: [#13:35]

in: [#13:100]

Label : ‘Lives’

RID =

#13:35

RID =

#14:54

RID =

#13:100

in: [#14:54]

label = ‘Address’

name = ‘Rome’

The Record ID (RID)

is a Physical position

Rome

OrientDB: traverse a relationship


GraphDB handles relationships as a

physical LINK to the record

assigned when the edge is created

on the other side

RDBMS computes the

relationship every time you query a database

Is not that crazy?!


This means jumping from a

O(log N) algorithm to a near O(1)

traversing cost is not more affected

by database size!

This is huge in the BigData age


$luca> cd bin

$luca> ./console.sh

OrientDB console v.1.2.0-SNAPSHOT (www.orientdb.org)

Type 'help' to display all the commands supported.

orientdb> create vertex V set name = ‘Sam’, label = ‘Customer’

Created vertex #13:35 in 0.03 secs

orientdb> create vertex V set name = ‘Rome’, label = ‘Address’

Created vertex #13:100 in 0.02 secs

orientdb> create edge E from #13:35 to #13:100 set label = ‘Lives’

Created edge #14:54 in 0.02 secs

Create the graph in SQL


OGraphDatabase graph = new OGraphDatabase("local:/tmp/db/graph”);

ODocument sam= graph.createVertex();

sam.field(“name", “Sam");

sam.field(“label", “Customer");

ODocument rome = graph.createVertex();

rome.field(“name", “Rome”);

rome.field(“label", “Address”);

ODocument edge = graph.createEdge(sam, rome).field(“label”, “Lives”);

edge.save();

graph.close();

Create the graph in Java


orientdb> select in[label=‘Lives’].out from V where

label = ‘Address’ and name = ‘Rome’

---+--------+--------------------+--------------------+--------------------+

#| REC ID |label |out |in |

---+--------+--------------------+--------------------+--------------------+

0| 13:35|Sam |[#14:54] | |

---+--------+--------------------+--------------------+--------------------+

1 item(s) found. Query executed in 0.007 sec(s).

orientdb> select * from V where label = ‘Address’ AND

in[label=‘Lives’].size() > 0

---+--------+--------------------+--------------------+--------------------+

#| REC ID |label |out |in |

---+--------+--------------------+--------------------+--------------------+

0| 13:100| Rome | |[#14:54] |

---+--------+--------------------+--------------------+--------------------+

1 item(s) found. Query executed in 0.007 sec(s).

Query the graph in SQL


OGraphDatabase graph = new

OGraphDatabase("local:/tmp/db/graph”);

// GET ALL THE THE CUSTOMER FROM ROME, ITALY

List<ODocument> result = graph.command( new OCommandSQL (

“select in[label=‘Lives’].out from V where label = ‘Address’

and name = ?”)

).execute( “Rome”);

for( ODocument v : result ) {

System.out.println(“Result: “ + v.field(“label”) );

}

-----------------------------------------------------------------------------------

----Result: Sam

Query the graph in Java


Query vs. traversal

Once you’ve a well connected database in the form of a Super Graph you can cross records instead of query them!

All you need is some root vertices where to start to traverse


Customers

Sam John Sylvia

Order

2332

Order

8834

White

Soap

StocksSpecial

Customers

This is a

root

vertex

Query vs. traversal


Supposing that the root node #30:0 links all the

Customer vertices

Get all the customers:

orientdb> select out.in from #30:0

Get all the customers who bought at least one ‘White Soap’

product:

orientdb> select * from ( select out.in from #30:0) where

out.in.out[label=‘Bought’].in.name = ‘White Soap’

Customers

#30:0

Query the graph in SQL


Demo time!

http://studio.nuvolabase.com/db/free/demo/GratefulDeadConcerts/studio/?user=reader&database=/db/free/demo/GratefulDeadConcerts&password=reader

http://studio.nuvolabase.com/db/free/demo/GratefulDeadConcerts/studio/?user=reader&database=/db/free/demo/GratefulDeadConcerts&password=reader


Should I be using NoSQL Databases?

For almost all of us, regular relational databases are THE correct solution

NoSQL Data storage systems makes sense for applications that need to deal with very verylarge semi-structured data

Log Analysis

Social Networking Feeds


WHAT CAN WE DO WITH BIG DATA ?

90

What’s driving Big Data

- Ad-hoc querying and reporting

- Data mining techniques

- Structured data, typical sources

- Small to mid-size datasets

- Optimizations and predictive analytics

- Complex statistical analysis

- All types of data, and many sources

- Very large datasets

- More of a real-time

91

Value of Big Data Analytics

• Big data is more real-time in

nature than traditional DW

applications

• Traditional DW architectures (e.g.

Exadata, Teradata) are not well-

suited for big data apps

• Shared nothing, massively parallel

processing, scale out

architectures are well-suited for

big data apps

92


USN with related technical areas

94

What is collecting all this data?Web Browsers Search Engines

Microsoft’s

Internet Explorer

Mozilla’s FireFox

Google’s Chrome

Apple’s Safari

Google’s

Microsoft’s

Yahoo’s

IAC Search’s

Time-Warner’s AOL

Explorer

(Non-profit foundation,

used to be Netscape)

What is collecting all this data?

Smartphones & Apps

Apple’s iPhone

(Apple O/S)

Samsung, HTC.

Nokia, Motorola

(Android O/S)

RIM Corp’s Blackberry

(BlackBerry O/S)

Tablet Computers & Apps

Apple’s iPad

Samsung’s Galaxy

Amazon’s Kindle Fire


Hospitals & Other Medical Systems Banking & Phone Systems

Can you hear me now?

(Heh heh heh!)

Pharmacies

Laboratories

Imaging Centers

Emergency Medical Services (EMS)

Hospital Information Systems

Doc-in-a-Box

Electronic Medical Records

Blood Banks

Birth & Death Records


A real pain in the apps! What are they collecting?• Restaurant reservations

(Open Table)

• Weather in L.A. in 3 days (Weather+)

• Side effects of medications (MedWatcher)

• 3-star hotels in New Orleans (Priceline)

• Which PC should I buy and where (PriceCheck)

Big Brother Needs Big DataIn March 2012, the Obama Administration announced the Big Data Research and Development Initiative, $200 million in new R&D investments, which will explore how Big Data could be used to address important problems facing the government. The initiative was composed of 84 different Big Data programs spread across six departments.

http://tinyurl.com/85oytkj

The U.S. Federal Government owns six of the ten most powerful supercomputers in the world.

http://tinyurl.com/85oytkj

How Companies Like Use BigData To Make You Love Them

Last month, I talked to Amazon customer service about my malfunctioning Kindle, and it was great. Thirty seconds after putting in a service request on Amazon’s website, my phone rang, and the woman on the other end--let’s call her Barbara--greeted me by name and said, "I understand that you have a problem with your Kindle." We resolved my problem in under two minutes, we got to skip the part where I carefully spell out my last name and address, and she didn’t try to upsell me on anything. After nearly a decade of ordering stuff from Amazon, I never loved the company as much as I did at that moment.

The fact is, Amazon has been collecting my information for years--not just addresses and payment information but the identity of everything I’ve ever bought or even looked at. And while dozens of other companies do that, too, Amazon’s doing something remarkable with theirs. They’re using that data to build our relationship.

Article by Sean Madden, May 2012, an expert in service design and innovation strategy.

How Can You Avoid Big Data?

• Pay cash for everything!

• Never go online!

• Don’t use a telephone!

• Don’t use Kroger or Harris Teeter cards!

• Don’t fill any prescriptions!

• Never leave your house!

Key concept of Big Data

• Store everything

• Don’t delete anything

• Schema is a bottleneck

• Think always on parallel

• Remember the CAP theorem

Thank You!!!

…and please fill the evaluation form

103

big data nosql 1017

Data & Analytics