an overview of data management paradigms: relational, document, and graph

37
An Overview of Data Management Paradigms: Relational, Document, and Graph Marko A. Rodriguez T-5, Center for Nonlinear Studies Los Alamos National Laboratory http://markorodriguez.com February 15, 2010

Upload: marko-rodriguez

Post on 08-May-2015

7.960 views

Category:

Technology


1 download

DESCRIPTION

A review of relational, document, and graph databases.

TRANSCRIPT

Page 1: An Overview of Data Management Paradigms: Relational, Document, and Graph

An Overview of Data Management Paradigms:

Relational, Document, and Graph

Marko A. RodriguezT-5, Center for Nonlinear StudiesLos Alamos National Laboratory

http://markorodriguez.com

February 15, 2010

Page 2: An Overview of Data Management Paradigms: Relational, Document, and Graph

Relational, Document, and Graph Database Data Models

Graph DatabaseRelational Database

a

a

b

c

d

{ data }

{ data }

{ data }

Document Database

MySQLPostgreSQL

Oracle

MongoDBCouchDB

Neo4jAllegroGraph

HyperGraphDB

Database models are optimized for solving particular types of problems. This is why different database

models exist — there are many types of problems in the world.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 3: An Overview of Data Management Paradigms: Relational, Document, and Graph

Finding the Right Solution to your Problem1. Come to terms with your problem.

• “I have metadata for a massive number of objects and I don’t knowhow to get at my data.”

2. Identify the solution to your problem.

• “I need to be able to find objects based on their metadata.”

3. Identify the type of database that is optimized for that type of solution.

• “A document database scales, stores metadata, and can be queried.”

4. Identify the database of that type that best meets your particular needs.

• “CouchDB has a REST web interface and all my developers aregood with REST.”

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 4: An Overview of Data Management Paradigms: Relational, Document, and Graph

Relational Databases

• Relational databases have been the de facto data management solutionfor many years.

MySQL is available at http://www.mysql.com

PostgreSQL is available at http://www.postgresql.org

Oracle is available at http://www.oracle.com

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 5: An Overview of Data Management Paradigms: Relational, Document, and Graph

Relational Databases: The Relational Structure

• Relational databases require a schema before data can be inserted.

• Relational databases organizes data according to relations — or tables.

columns (attributes/properties)

rows

(tup

les/

obje

cts)

j

i x

Object i has the value x for property j.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 6: An Overview of Data Management Paradigms: Relational, Document, and Graph

Relational Databases: Creating a Table

• Relational databases organizes data according to relations — or tables.

• Relational databases require a schema before data can be inserted.

• Lets create a table for Grateful Dead songs.

mysql> CREATE TABLE songs (name VARCHAR(255) PRIMARY KEY,performances INT,song_type VARCHAR(20));

Query OK, 0 rows affected (0.40 sec)

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 7: An Overview of Data Management Paradigms: Relational, Document, and Graph

Relational Databases: Viewing the Table Schema

• Its possible to look at the defined structure (schema) of your newlycreated table.

mysql> DESCRIBE songs;+--------------+--------------+------+-----+---------+-------+| Field | Type | Null | Key | Default | Extra |+--------------+--------------+------+-----+---------+-------+| name | varchar(255) | NO | PRI | NULL | || performances | int(11) | YES | | NULL | || song_type | varchar(20) | YES | | NULL | |+--------------+--------------+------+-----+---------+-------+3 rows in set (0.00 sec)

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 8: An Overview of Data Management Paradigms: Relational, Document, and Graph

Relational Databases: Inserting Rows into a Table

• Lets insert song names, the number of times they were played inconcert, and whether they were and original or cover.

mysql> INSERT INTO songs VALUES ("DARK STAR", 219, "original");Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO songs VALUES ("FRIEND OF THE DEVIL", 304, "original");

Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO songs VALUES ("MONKEY AND THE ENGINEER", 32, "cover");

Query OK, 1 row affected (0.00 sec)

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 9: An Overview of Data Management Paradigms: Relational, Document, and Graph

Relational Databases: Searching a Table

• Lets look at the entire songs table.

mysql> SELECT * FROM songs;+-------------------------+--------------+-----------+| name | performances | song_type |+-------------------------+--------------+-----------+| DARK STAR | 219 | original || FRIEND OF THE DEVIL | 304 | original || MONKEY AND THE ENGINEER | 32 | cover |+-------------------------+--------------+-----------+3 rows in set (0.00 sec)

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 10: An Overview of Data Management Paradigms: Relational, Document, and Graph

Relational Databases: Searching a Table

• Lets look at all original songs.

mysql> SELECT * FROM songs WHERE song_type="original";+---------------------+--------------+-----------+| name | performances | song_type |+---------------------+--------------+-----------+| DARK STAR | 219 | original || FRIEND OF THE DEVIL | 304 | original |+---------------------+--------------+-----------+2 rows in set (0.00 sec)

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 11: An Overview of Data Management Paradigms: Relational, Document, and Graph

Relational Databases: Searching a Table

• Lets look at only the names of the original songs.

mysql> SELECT name FROM songs WHERE song_type="original";+---------------------+| name |+---------------------+| DARK STAR || FRIEND OF THE DEVIL |+---------------------+2 rows in set (0.00 sec)

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 12: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases

• Document databases store structured documents. Usually thesedocuments are organized according a standard (e.g. JavaScript ObjectNotation—JSON, XML, etc.)

• Document databases tend to be schema-less. That is, they do notrequire the database engineer to apriori specify the structure of the datato be held in the database.

MongoDB is available at http://mongodb.org and CouchDB is available at http://couchdb.org

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 13: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: JavaScript Object Notation

• A JSON document is a collection of key/value pairs, where a value canbe yet another collection of key/value pairs.

? string: a string value (e.g. “marko”, “rodriguez”).? number: a numeric value (e.g. 1234, 67.012).? boolean: a true/false value (e.g. true, false)? null: a non-existant value.? array: an array of values (e.g. [1,“marko”,true])? object: a key/value map (e.g. { “key” : 123 })

The JSON specification is very simple and can be found at http://www.json.org/.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 14: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: JavaScript Object Notation

{_id : "D0DC29E9-51AE-4A8C-8769-541501246737",name : "Marko A. Rodriguez",homepage : "http://markorodriguez.com",age : 30,location : {

country : "United States",state : "New Mexico",city : "Santa Fe",zipcode : 87501

},interests : ["graphs", "hockey", "motorcycles"]

}

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 15: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: Handling JSON Documents

• Use object-oriented “dot notation” to access components.

> marko = eval({_id : "D0DC29E9...", name : "Marko...})> marko._idD0DC29E9-51AE-4A8C-8769-541501246737> marko.location.citySanta Fe> marko.interests[0]graphs

All document database examples presented are using MongoDB [http://mongodb.org].

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 16: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: Inserting JSON Documents

• Lets insert a Grateful Dead document into the database.

> db.songs.insert({_id : "91",properties : {name : "TERRAPIN STATION",song_type : "original",performances : 302

}})

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 17: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: Finding JSON Documents

• Searching is based on created a “subset” document and pattern matchingit in the database.

• Find all songs where properties.name equals TERRAPIN STATION.

> db.songs.find({"properties.name" : "TERRAPIN STATION"}){ "_id" : "91", "properties" :{ "name" : "TERRAPIN STATION", "song_type" : "original","performances" : 302 }}

>

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 18: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: Finding JSON Documents

• You can also do comparison-type operations.

• Find all songs where properties.performances is greater than 200.

> db.songs.find({"properties.performances" : { $gt : 200 }}){ "_id" : "104", "properties" :{ "name" : "FRIEND OF THE DEVIL", "song_type" : "original","performances" : 304}}

{ "_id" : "122", "properties" :{ "name" : "CASEY JONES", "song_type" :"original", "performances" : 312}}

has more>

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 19: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: Processing JSON Documents

• Sharding is the process of distributing a database’s data across multiplemachines. Each partition of the data is known as a shard.

• Document databases shard easily because there are no explicit referencesbetween documents.

{ _id : }{ _id : }{ _id : }

{ _id : }{ _id : }{ _id : }

{ _id : }{ _id : }{ _id : }

{ _id : }{ _id : }{ _id : }

communication service

client appliation

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 20: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: Processing JSON Documents

• Most document databases come with a Map/Reduce feature to allow forthe parallel processing of all documents in the database.

? Map function: apply a function to every document in the database.? Reduce function: apply a function to the grouped results of the map.

M : D → (K, V ),

where D is the space of documents, K is the space of keys, and V is thespace of values.

R : (K, V n)→ (K, V ),

where V n is the space of all possible combination of values.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 21: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: Processing JSON Documents

• Create a distribution of the Grateful Dead original song performances.

> map = function(){if(this.properties.song_type == "original")emit(this.properties.performances, 1);

};

> reduce = function(key, values) {var sum = 0;for(var i in values) {sum = sum + values[i];

}return sum;

};

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 22: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: Processing JSON Documents

> results = db.songs.mapReduce(map, reduce){"result" : "tmp.mr.mapreduce_1266016122_8","timeMillis" : 72,"counts" : {"input" : 809,"emit" : 184,"output" : 119

},"ok" : 1,

}

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 23: An Overview of Data Management Paradigms: Relational, Document, and Graph

{ _id : 91, properties : { name : "TERRAP..." performances : 302 }}

{ _id : 100, properties : { name : "PLAYIN..." performances : 312 }}

{ _id : 122, properties : { name : "CASEY ..." performances : 312 }}

map = function(){ if(this.properties.song_type == "original") emit(this.properties.performances, 1);};

312 : 1 312 : 1 302 : 1

...

reduce = function(key, values) { var sum = 0; for(var i in values) { sum = sum + values[i]; } return sum;};

{ 312 : 2 302 : 1 ...}

312 : [1,1]302 : [1]

...

key values

key value

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 24: An Overview of Data Management Paradigms: Relational, Document, and Graph

Document Databases: Processing JSON Documents

> db[results.result].find(){ "_id" : 0, "value" : 11 }{ "_id" : 1, "value" : 14 }{ "_id" : 2, "value" : 5 }{ "_id" : 3, "value" : 8 }{ "_id" : 4, "value" : 3 }{ "_id" : 5, "value" : 4 }...{ "_id" : 554, "value" : 1 }{ "_id" : 582, "value" : 1 }{ "_id" : 583, "value" : 1 }{ "_id" : 594, "value" : 1 }{ "_id" : 1386, "value" : 1 }

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 25: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases

• Graph databases store objects (vertices) and their relationships to oneanother (edges). Usually these relationships are typed/labeled anddirected.

• Graph databases tend to be optimized for graph-based traversalalgorithms.

Neo4j is available at http://neo4j.org

AllegroGraph is available at http://www.franz.com/agraph/allegrograph

HyperGraphDB is available at http://www.kobrix.com/hgdb.jsp

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 26: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases: Property Graph Model

name = "marko"age = 29

1

4

knows

weight = 1.0

name = "josh"age = 32

name = "vadas"age = 27

2

knows

weight = 0.5

created

weight = 0.4

name = "lop"lang = "java"

3

created

weight = 0.4

name = "ripple"lang = "java"

5

created

weight = 1.0

name = "peter"age = 35

6

created

weight = 0.2

78

9

11

10

12

Graph data models vary. This section will use the data model popularized by Neo4j.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 27: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases: Handling Property Graphs

• Gremlin is a graph-based programming language that can be used tointeract with graph databases.

• However, graph databases also come with their own APIs.

GremlinG = (V,E)

Gremlin is available at http://gremlin.tinkerpop.com.

All the examples in this section are using Gremlin and Neo4j.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 28: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases: Moving Around a Graph in Gremlin

gremlin> $_ := g:key(‘name’,‘marko’)==>v[1]gremlin> ./outE==>e[7][1-knows->2]==>e[9][1-created->3]==>e[8][1-knows->4]gremlin> ./outE/inV==>v[2]==>v[3]==>v[4]gremlin> ./outE/inV/@name==>vadas==>lop==>josh

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 29: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases: Inserting Vertices and Edges

• Lets create a Grateful Dead graph.

gremlin> $_g := neo4j:open(‘/tmp/grateful-dead’)==>neo4jgraph[/tmp/grateful-dead]gremlin> $v := g:add-v(g:map(‘name’,‘TERRAPIN STATION’))==>v[0]gremlin> $u := g:add-v(g:map(‘name’,‘TRUCKIN’))==>v[1]gremlin> $e := g:add-e(g:map(‘weight’,1),$v,‘followed_by’,$u)==>e[2][0-followed_by->1]

You can batch load graph data as well: g:load(‘data/grateful-dead.xml’) using the GraphML

specification [http://graphml.graphdrawing.org/]

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 30: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases: Inserting Vertices and Edges

• When all the data is in, you have a directed, weighted graph of theconcert behavior of the Grateful Dead. A song is followed by anothersong if the second song was played next in concert. The weight of theedge denotes the number of times this happened in concert over the 30years that the Grateful Dead performed.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 31: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases: Finding Vertices

• Find the vertex with the name TERRAPIN STATION.

• Find the name of all the songs that followed TERRAPIN STATION inconcert more than 3 times.

gremlin> $_ := g:key(‘name’,‘TERRAPIN STATION’)==>v[0]gremlin> ./outE[@weight > 3]/inV/@name==>DRUMS==>MORNING DEW==>DONT NEED LOVE==>ESTIMATED PROPHET==>PLAYING IN THE BAND

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 32: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases: Processing Graphs• Most graph algorithms are aimed at traversing a graph in some manner.

• The traverser makes use of vertex and edge properties in order toguide its walk through the graph.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 33: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases: Processing Graphs• Find all songs related to TERRAPIN STATION according to concert

behavior.

$e := 1.0$scores := g:map()repeat 75$_ := (./outE[@label=‘followed_by’]/inV)[g:rand-nat()]if $_ != null()g:op-value(‘+’,$scores,$_/@name,$e)$e := $e * 0.85

else$_ := g:key(‘name, ‘TERRAPIN STATION)$e := 1.0

endend

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 34: An Overview of Data Management Paradigms: Relational, Document, and Graph

Graph Databases: Processing Graphs

gremlin> g:sort($scores,‘value’,true())==>PLAYING IN THE BAND=1.9949905250390623==>THE MUSIC NEVER STOPPED=0.85==>MEXICALI BLUES=0.5220420095726453==>DARK STAR=0.3645706137191774==>SAINT OF CIRCUMSTANCE=0.20585176856988666==>ALTHEA=0.16745479118927242==>ITS ALL OVER NOW=0.14224175713617204==>ESTIMATED PROPHET=0.12657286655816163...

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 35: An Overview of Data Management Paradigms: Relational, Document, and Graph

Conclusions• Relational Databases

? Stable, solid technology that has been used in production for decades.

? Good for storing inter-linked tables of data and querying within and across tables.

? They do not scale horizontally due to the interconnectivity of table keys and the

cost of joins.

• Document Databases

? For JSON documents, there exists a one-to-one mapping from document-to-

programming object.

? They scale horizontally and allow for parallel processing due to forced sharding at

document.

? Performing complicated queries requires relatively sophisticated programming skills.

• Graph Databases

? Optimized for graph traversal algorithms and local neighborhood searches.

? Low impedance mismatch between a graph in a database and a graph of objects in

object-oriented programming.

? They do not scale well horizontally due to interconnectivity of vertices.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 36: An Overview of Data Management Paradigms: Relational, Document, and Graph

A Collection of References

• http://www.wakandasoftware.com/blog/nosql-but-so-much-more/

• http://horicky.blogspot.com/2009/07/choosing-between-sql-and-non-sql.html

• http://ai.mee.nu/seeking a database that doesnt suck

• http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html

• http://horicky.blogspot.com/2009/11/nosql-patterns.html

• http://horicky.blogspot.com/2010/02/nosql-graphdb.html

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Page 37: An Overview of Data Management Paradigms: Relational, Document, and Graph

Fin.Thank your for your time...

• My homepage: http://markorodriguez.com

• TinkerPop: http://tinkerpop.com

Acknowledgements: Peter Neubauer (Neo Technology) for comments and review.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010