big data nosql 1017
TRANSCRIPT
Copyright © 2011 LOGTEL
Agenda
Big Data / No SQL – technology aspect How big is BIG ?
The motivation behind NoSQL
CAP theorem
Partitions / fragmentation
The different NoSQL models
Key Value
Column-Based
Document store
Big Table
Graph
The NoSQL way of thinking (using graphs)
Big Data - Applicative (what can we do with it)
2
Copyright © 2011 LOGTEL
It’s a hype (!)
3
Copyright © 2011 LOGTEL
Big Data Definition
No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
4
Copyright © 2011 LOGTEL
NoSQL humor
5
http://geekandpoke.typepad.com/
Copyright © 2011 LOGTEL
10GB ? 10TB ? 10 PB ?
How big is BIG ?
6
Copyright © 2011 LOGTEL
The 4 V’s
7
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
8
Exponential increase in
collected/generated data
Copyright © 2011 LOGTEL
The 4 V’s
9
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and structures
• Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be generating/collecting many types of data
10
Copyright © 2011 LOGTEL
The 4 V’s
11
Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be processed
fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store next to
you
• Healthcare monitoring: sensors monitoring your activities and body
any abnormal measurements require immediate reaction12
Copyright © 2011 LOGTEL
The 4 V’s
13
Who’s Generating Big Data
Social media and networks
(all of us are generating data)Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable
fashion
14
The Model Has
Changed…• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming
data
15
Copyright © 2011 LOGTEL
NoSQL motivation
16
Copyright © 2011 LOGTEL
Why Now?
Explosion of social media sites (Facebook, Twitter) with large data needs
Explosion of storage needs in large web sites such as Google, Yahoo
Much of the data is not files
Rise of cloud-based solutions such as Amazon S3 (simple storage solution)
Shift to dynamically-typed data with frequent schema changes
Open-source community
Copyright © 2011 LOGTEL
Parallel Databases and Data Stores
Relational Databases – mainstay of business
Web-based applications caused spikes
Especially true for public-facing e-Commerce sites
Many application servers, one database
Easy to parallelize application servers to 1000s of servers, harder to parallelize databases to same scale
First solution: memcache (in-memory) or other caching mechanisms to reduce database access
Copyright © 2011 LOGTEL
Scaling Up
What if the dataset is huge, and very high number of transactions per second
Use multiple servers to host database
‘scaling out’ or ‘horizontal scaling’
Parallel databases have been around for a while
But expensive, and designed for decision support
not OLTP (Online Transaction Processing)
Copyright © 2011 LOGTEL
Scaling RDBMS – Master/Slave
Master-Slave
All writes are written to the master. All reads performed against the replicated slave databases
Good for mostly read, very few update applications
Critical reads may be incorrect as writes may not have been propagated down
Large data sets can pose problems as master needs to duplicate data to slaves
Copyright © 2011 LOGTEL
Scaling RDBMS - Partitioning
Partitioning
Divide the database across many machines
E.g. hash or range partitioning
Handled transparently by parallel databases but they are expensive
“Sharding” Divide data amongst many cheap databases
(MySQL/PostgreSQL)
Manage parallel access in the application
Scales well for both reads and writes
Not transparent, application needs to be partition-aware
Copyright © 2011 LOGTEL
What is NoSQL?
Stands for Not Only SQL
Class of non-relational data storage systems
E.g. BigTable, Dynamo, PNUTS/Sherpa, ..
Usually do not require a fixed table schema nor do
they use the concept of joins
All NoSQL offerings relax one or more of the ACID
properties (will talk about the CAP theorem)
Not a backlash/rebellion against RDBMS
SQL is a rich query language that cannot be rivaled
by the current list of NoSQL offerings
Copyright © 2011 LOGTEL
NoSQL Data Storage: Classification
NoSQL solutions fall into 4 major areas:
Uninterpreted key/value or ‘the big hash table’. Amazon S3 (Dynamo)
Voldemort
Scalaris
Column-based, with interpreted keys Cassandra, BigTable, HBase, Sherpa/PNuts
Others CouchDB (document-based)
Neo4J (graph-based)
Copyright © 2011 LOGTEL
NoSQL ecosystem
24
Copyright © 2011 LOGTEL
25
Copyright © 2011 LOGTEL
Big Data Landscape 2015
26
Copyright © 2011 LOGTEL 27
Copyright © 2011 LOGTEL
Architecture
28
Copyright © 2011 LOGTEL
MapReduce
29
Copyright © 2011 LOGTEL
ACID
Atomic: Either the whole process of a transaction is done or none is.
Consistency: Database constraints (application-specific) are preserved.
Isolation: It appears to the user as if only one process executes at a time. (Two concurrent transactions will not see on another’s transaction while “in flight”.)
Durability: The updates made to the database in a committed transaction will be visible to future transactions. (Effects of a process do not get lost if the system crashes.)
Copyright © 2011 LOGTEL
CAP Theorem
Three properties of a system
Consistency (all copies have same value)
Availability (system can run even if parts have failed)
Partitions (network can break into two or more parts, each
with active systems that can’t talk to other parts)
Brewer’s CAP “Theorem”: You can have at most two
of these three properties for any system
Very large systems will partition at some point
Choose one of consistency or availability
Traditional database choose consistency
Most Web applications choose availability
Except for specific parts such as order processing
Copyright © 2011 LOGTEL
The reminder
Dial 1-800-remind ......
Available , Consist – not portioned
Not available ...
Available , Partitioned – not Consistent
Consistent, Partitioned – not Available
32
Copyright © 2011 LOGTEL
The proof…
33
Copyright © 2011 LOGTEL
CAP Theorem
Three properties of a system
Consistency (all copies have same value)
Availability (system can run even if parts have failed)
Partitions (network can break into two or more parts, each
with active systems that can’t talk to other parts)
Brewer’s CAP “Theorem”: You can have at most two
of these three properties for any system
Very large systems will partition at some point
Choose one of consistency or availability
Traditional database choose consistency
Most Web applications choose availability
Except for specific parts such as order processing
Copyright © 2011 LOGTEL
Availability
Traditionally, thought of as the server/process available five 9’s (99.999 %).
However, for large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.
Want a system that is resilient in the face of network disruption
Copyright © 2011 LOGTEL
Eventual Consistency
When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent
For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service
Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID
Soft state: copies of a data item may be inconsistent
Eventually Consistent – copies becomes consistent at some later time if there are no more updates to that data item
BASE in Cassandra
Query
Closest replica
Cassandra Cluster
Replica A
Result
Replica B Replica C
Digest QueryDigest Response Digest Response
Result
Client
Read repair if digests differ
Copyright © 2011 LOGTEL
Common Advantages
Cheap, easy to implement (open source)
Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned
When data is written, the latest version is on at least one node and then replicated to other nodes
Down nodes easily replaced
No single point of failure
Easy to distribute
Don't require a schema
Copyright © 2011 LOGTEL
What am I giving up?
joins
group by
order by
ACID transactions
SQL as a sometimes frustrating but still powerful query language
easy integration with other applications that support SQL
Copyright © 2011 LOGTEL
Distributed Key-Value Data Stores
Distributed key-value data storage systems allow key-value pairs to be stored (and retrieved on key) in a massively parallel system E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon
Dynamo, ..
Partitioning, high availability etc. completely transparent to application
Sharding systems and key-value stores don’t support many relational features
No join operations (except within partition)
No referential integrity constraints across partitions
etc.
Copyright © 2011 LOGTEL
Flexible Data Model
Rockets
Key Value
1
2
3
Name Value
tooninventoryQtybrakes
Rocket-Powered Roller SkatesReady, Set, Zoom5false
name
Name Value
tooninventoryQtybrakes
Little Giant Do-It-Yourself Rocket-Sled KitBeep Prepared4false
Name Value
tooninventoryQtywheels
Acme Jet Propelled UnicycleHot Rod and Reel11
name
name
Copyright © 2011 LOGTEL
HBase
46
Copyright © 2011 LOGTEL
Tables are sorted by Row
Table schema only define its column families . Each family consists of any number of columns
Each column consists of any number of versions
Columns only exist when inserted, NULLs are free.
Columns within a family are sorted and stored together
Everything except table names are byte[]
(Row, Family: Column, Timestamp) Value
Row key
Column Family
valueTimeStamp
Copyright © 2011 LOGTEL
Splunk – Document base
48
Copyright © 2011 LOGTEL
Splunk – log analysis
49
Copyright © 2011 LOGTEL
PNUTS Data Storage Architecture
Copyright © 2011 LOGTEL
01
1/2
F
E
D
C
B
A N=3
h(key2)
h(key1)
52
Partitioning And Replication
Copyright © 2011 LOGTEL
Should I be using NoSQL Databases?
For almost all of us, regular relational databases are THE correct solution
NoSQL Data storage systems makes sense for applications that need to deal with very large semi-structured data
Log Analysis
Social Networking Feeds
Copyright © 2011 LOGTEL
Graph in practice(thanks to Luca Garulli)
54
Copyright © 2011 LOGTEL
...how to think «graphically» with
one of the most common domains
in the enterprise world:
The old-classic CRM* domain
* today in 99% of the cases a RDBMS is used
Lets take a real example - CRM
Copyright © 2011 LOGTEL
Every developer knows
the Relational Model(?),
but who knows the
Graph one?
Copyright © 2011 LOGTEL
Back to school:
Graph Theory crash course
Copyright © 2011 LOGTEL
SamNoSQL
lectureLikes
Basic Graph
Copyright © 2011 LOGTEL
Sam
name: Samuel
surname: Dratwa
company: SADOT
NoSQL
Lecture
editions: [Comverse, Tel-Aviv]
Likes
since: 2012
Vertices and Edges can have properties
Vertices and Edges can have properties
Vertices and Edges can have properties
Vertices are directed
* https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
Property Graph Model*
Copyright © 2011 LOGTEL
SamNoSQL
lecture
An Edge connects 2 vertices: use multiple
vertices to represents 1-N and N-M relationships
Edges - Arcs
Copyright © 2011 LOGTEL
Likes
Avital
Sam
FriendOf
NoSQL
lecture
Doron
Joins
Copyright © 2011 LOGTEL
Compliments,
this is your diploma in
«Graph Theory»
Copyright © 2011 LOGTEL
Customer Address
Order Stock
Registry system
Order system
Domain: minimal CRM
Copyright © 2011 LOGTEL
Stock
Registry system
Order
Order system
Customer Address
How doesRelational DBMS
manage relationships?
Copyright © 2011 LOGTEL
JOIN Customer.Address -> Address.Id
Customer
Id Name Address
10 Samuel 34
11 Katja 44
34 Sylvia 54
56 Mark 66
88 Steve 68
Address
Id Location
34 Rome, London
44 Cologne
54 Rome
66 New Mexico
68 Palo Alto
Relational World: 1-1 Relationships
Copyright © 2011 LOGTEL
Inverse JOIN Address.Customer -> Customer.Id
Customer
Id Name
10 Samuel
11 Katja
34 Sylvia
56 Mark
88 Steve
Address
Id Customer Location
24 10 Rome
33 10 London
44 34 Rome
66 11 Cologne
68 88 Palo Alto
Relational World: 1-N Relationships
Copyright © 2011 LOGTEL
Additional table with 2 JOINs
(1) CustomerAddress.Id -> Customer.Id and
(2) CustomerAddress.Address -> Address.Id
Customer
Id Name
10 Samuel
11 Katja
34 Sylvia
56 Mark
88 Steve
Address
Id Location
24 Rome
33 London
44 Rome
66 Cologne
68 Palo Alto
CustomerAddress
Id Address
10 24
10 33
34 24
Relational World: N-M Relationships
Copyright © 2011 LOGTEL
What’s wrong with the
Relational Model?
Copyright © 2011 LOGTEL
These are all JOINs executedeverytime you traverse a
relationship
Customer
Id Name
10 Samuel
11 Katja
34 Sylvia
56 Mark
88 Steve
Address
Id Location
24 Rome
33 London
44 Rome
66 Cologne
68 Palo Alto
These are all JOINs executedeverytime you traverse a
relationship
These are all JOINs executedeverytime you traverse a
relationship
These are all JOINs executedeverytime you traverse a
relationship!
CustomerAddress
Id Address
10 24
10 33
34 24
The JOIN is the evil!
Copyright © 2011 LOGTEL
Why not JOIN
• A JOIN means searching for a key in another table
• The first rule to improve performance is
indexing all the keys
• Index speeds up searches but slows down insert, updates and deletes
• So in the best case a JOIN is a lookup into in an
index
• This is done per single join!
• If you traverse hundreds of relationships
you’re executing hundreds of JOINs
Copyright © 2011 LOGTEL
Index Lookup
it is really that fast?
Copyright © 2011 LOGTEL
A-Z
A-L M-Z
A-L
A-D E-L
M-Z
M-R S-Z
A-D
A-B C-D
E-L
E-G H-L
E-G
E-F G
H-L
H-J K-L
Index algorithms are all similar and based on
balanced trees
Index Lookup: how does it works?
Think to an Address Book
where we have to find Samuel’s phone number
Copyright © 2011 LOGTEL
A-Z
A-L M-Z
A-L
A-D E-L
M-Z
M-R S-Z
A-D
A-B C-D
E-L
E-G H-L
E-G
E-F G
H-L
H-J K-L
Found! Each lookup takes X steps, where Xgrows with the
index size!
Copyright © 2011 LOGTEL
An index lookup is executed
for each JOIN
Querying more tables can easily
produce millions of JOINs/Lookups!
Here the rule: more entries
= more lookup steps = slower JOIN
Copyright © 2011 LOGTEL
Is there a better way to
manage relationships?
Copyright © 2011 LOGTEL
How does GraphDB manage
index-free relationships?
Copyright © 2011 LOGTEL
an Open Source (Apache 2)
document-graph NoSQL dbms
supports: transactions, extended-SQL,Multi-Master replication, etc
Copyright © 2011 LOGTEL
SamLives
out : [#14:54]
label : ‘Customer’
name : ‘Sam’
out: [#13:35]
in: [#13:100]
Label : ‘Lives’
RID =
#13:35
RID =
#14:54
RID =
#13:100
in: [#14:54]
label = ‘Address’
name = ‘Rome’
The Record ID (RID)
is a Physical position
Rome
OrientDB: traverse a relationship
Copyright © 2011 LOGTEL
GraphDB handles relationships as a
physical LINK to the record
assigned when the edge is created
on the other side
RDBMS computes the
relationship every time you query a database
Is not that crazy?!
Copyright © 2011 LOGTEL
This means jumping from a
O(log N) algorithm to a near O(1)
traversing cost is not more affected
by database size!
This is huge in the BigData age
Copyright © 2011 LOGTEL
$luca> cd bin
$luca> ./console.sh
OrientDB console v.1.2.0-SNAPSHOT (www.orientdb.org)
Type 'help' to display all the commands supported.
orientdb> create vertex V set name = ‘Sam’, label = ‘Customer’
Created vertex #13:35 in 0.03 secs
orientdb> create vertex V set name = ‘Rome’, label = ‘Address’
Created vertex #13:100 in 0.02 secs
orientdb> create edge E from #13:35 to #13:100 set label = ‘Lives’
Created edge #14:54 in 0.02 secs
Create the graph in SQL
Copyright © 2011 LOGTEL
OGraphDatabase graph = new OGraphDatabase("local:/tmp/db/graph”);
ODocument sam= graph.createVertex();
sam.field(“name", “Sam");
sam.field(“label", “Customer");
ODocument rome = graph.createVertex();
rome.field(“name", “Rome”);
rome.field(“label", “Address”);
ODocument edge = graph.createEdge(sam, rome).field(“label”, “Lives”);
edge.save();
graph.close();
Create the graph in Java
Copyright © 2011 LOGTEL
orientdb> select in[label=‘Lives’].out from V where
label = ‘Address’ and name = ‘Rome’
---+--------+--------------------+--------------------+--------------------+
#| REC ID |label |out |in |
---+--------+--------------------+--------------------+--------------------+
0| 13:35|Sam |[#14:54] | |
---+--------+--------------------+--------------------+--------------------+
1 item(s) found. Query executed in 0.007 sec(s).
orientdb> select * from V where label = ‘Address’ AND
in[label=‘Lives’].size() > 0
---+--------+--------------------+--------------------+--------------------+
#| REC ID |label |out |in |
---+--------+--------------------+--------------------+--------------------+
0| 13:100| Rome | |[#14:54] |
---+--------+--------------------+--------------------+--------------------+
1 item(s) found. Query executed in 0.007 sec(s).
Query the graph in SQL
Copyright © 2011 LOGTEL
OGraphDatabase graph = new
OGraphDatabase("local:/tmp/db/graph”);
// GET ALL THE THE CUSTOMER FROM ROME, ITALY
List<ODocument> result = graph.command( new OCommandSQL (
“select in[label=‘Lives’].out from V where label = ‘Address’
and name = ?”)
).execute( “Rome”);
for( ODocument v : result ) {
System.out.println(“Result: “ + v.field(“label”) );
}
-----------------------------------------------------------------------------------
----Result: Sam
Query the graph in Java
Copyright © 2011 LOGTEL
Query vs. traversal
Once you’ve a well connected database in the form of a Super Graph you can cross records instead of query them!
All you need is some root vertices where to start to traverse
Copyright © 2011 LOGTEL
Customers
Sam John Sylvia
Order
2332
Order
8834
White
Soap
StocksSpecial
Customers
This is a
root
vertex
Query vs. traversal
Copyright © 2011 LOGTEL
Supposing that the root node #30:0 links all the
Customer vertices
Get all the customers:
orientdb> select out.in from #30:0
Get all the customers who bought at least one ‘White Soap’
product:
orientdb> select * from ( select out.in from #30:0) where
out.in.out[label=‘Bought’].in.name = ‘White Soap’
Customers
#30:0
Query the graph in SQL
Copyright © 2011 LOGTEL
Demo time!
Copyright © 2011 LOGTEL
Should I be using NoSQL Databases?
For almost all of us, regular relational databases are THE correct solution
NoSQL Data storage systems makes sense for applications that need to deal with very verylarge semi-structured data
Log Analysis
Social Networking Feeds
Copyright © 2011 LOGTEL
WHAT CAN WE DO WITH BIG DATA ?
90
What’s driving Big Data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
91
Value of Big Data Analytics
• Big data is more real-time in
nature than traditional DW
applications
• Traditional DW architectures (e.g.
Exadata, Teradata) are not well-
suited for big data apps
• Shared nothing, massively parallel
processing, scale out
architectures are well-suited for
big data apps
92
Copyright © 2011 LOGTEL 93
Copyright © 2011 LOGTEL
USN with related technical areas
94
What is collecting all this data?Web Browsers Search Engines
Microsoft’s
Internet Explorer
Mozilla’s FireFox
Google’s Chrome
Apple’s Safari
Google’s
Microsoft’s
Yahoo’s
IAC Search’s
Time-Warner’s AOL
Explorer
(Non-profit foundation,
used to be Netscape)
What is collecting all this data?
Smartphones & Apps
Apple’s iPhone
(Apple O/S)
Samsung, HTC.
Nokia, Motorola
(Android O/S)
RIM Corp’s Blackberry
(BlackBerry O/S)
Tablet Computers & Apps
Apple’s iPad
Samsung’s Galaxy
Amazon’s Kindle Fire
What is collecting all this data?
Hospitals & Other Medical Systems Banking & Phone Systems
Can you hear me now?
(Heh heh heh!)
Pharmacies
Laboratories
Imaging Centers
Emergency Medical Services (EMS)
Hospital Information Systems
Doc-in-a-Box
Electronic Medical Records
Blood Banks
Birth & Death Records
What is collecting all this data?
A real pain in the apps! What are they collecting?• Restaurant reservations
(Open Table)
• Weather in L.A. in 3 days (Weather+)
• Side effects of medications (MedWatcher)
• 3-star hotels in New Orleans (Priceline)
• Which PC should I buy and where (PriceCheck)
Big Brother Needs Big DataIn March 2012, the Obama Administration announced the Big Data Research and Development Initiative, $200 million in new R&D investments, which will explore how Big Data could be used to address important problems facing the government. The initiative was composed of 84 different Big Data programs spread across six departments.
http://tinyurl.com/85oytkj
The U.S. Federal Government owns six of the ten most powerful supercomputers in the world.
How Companies Like Use BigData To Make You Love Them
Last month, I talked to Amazon customer service about my malfunctioning Kindle, and it was great. Thirty seconds after putting in a service request on Amazon’s website, my phone rang, and the woman on the other end--let’s call her Barbara--greeted me by name and said, "I understand that you have a problem with your Kindle." We resolved my problem in under two minutes, we got to skip the part where I carefully spell out my last name and address, and she didn’t try to upsell me on anything. After nearly a decade of ordering stuff from Amazon, I never loved the company as much as I did at that moment.
The fact is, Amazon has been collecting my information for years--not just addresses and payment information but the identity of everything I’ve ever bought or even looked at. And while dozens of other companies do that, too, Amazon’s doing something remarkable with theirs. They’re using that data to build our relationship.
Article by Sean Madden, May 2012, an expert in service design and innovation strategy.
How Can You Avoid Big Data?
• Pay cash for everything!
• Never go online!
• Don’t use a telephone!
• Don’t use Kroger or Harris Teeter cards!
• Don’t fill any prescriptions!
• Never leave your house!
Key concept of Big Data
• Store everything
• Don’t delete anything
• Schema is a bottleneck
• Think always on parallel
• Remember the CAP theorem
Thank You!!!
…and please fill the evaluation form
103