nosql and nosql beginning with o-r databases. first, crud… global notions of managing persistent...

NoSQL and NOSQLBeginning with o-r databases

First, CRUD…

• Global notions of managing persistent data, regardless of the model or system

• Create, Read, Update, Delete

• But there is also DDL

• And there are implementation issues, like sorts and indices

3

Why not standard tables?

• Extreme data structuring conflict between host language and database language: • Impedance mismatch• Atomic values are the only common data type

• To retrieve all of an object requires lots of joins

• Difficult to look for objects that are similar but have some different attributes

• Difficult to retrieve an attribute that is a collection

• You have to program with two programming languages

• We have value-based semantics – difficult to know if two people have the same mother, or just a mother with the same name/ID and this causes us to make inference

What is o-o?

• Relation is a set of tuples• Objects are arranged in sets of objects

• In a relation, a tuple’s components are primitive (int, string)• The components of an object can be complex

types (sets, tuples, other objects)

• SQL: programs are global• Object: programs are local

5

Key concept: Object Id’s

• Every object has a unique Id: different objects have different Ids

• Immutable: does not change as the object changes

• Different from primary key!• Like a key, identifies an object uniquely• But key values can change – oids cannot• And there are inferences based on values

6

Objects and Values

• An object is a pair: (oid, value)

• Example: A Joe Public’s object(#32, [ SSN: 111-22-3333,

Name: “Joe Public”,

PhoneN: {“516-123-4567”, “516-345-6789”},

Child: {#445, #73} ] )

7

Classes

• Class: set of semantically similar objects (eg, people, students, cars, motorcycles)

• A class has:• Type: describes common structure of all objects in

the class (semantically similar objects are also structurally similar)

• Method signatures: declarations of the operations that can be applied to all objects in the class.

• Extent: the set of all objects in the class

• Classes are organized in a class hierarchy• The extent of a class contains the extent of any of its

subclasses

8

The ODMG Standard

• ODMG 3.0 was released in 2000

• Includes the data model (more or less)

• ODL: The object definition language

• OQL: The object query language

• A transaction specification mechanism

• Language bindings: How to access an ODMG database from C++, Smalltalk, and Java (expect C# to be added to the mix)

9

Main Idea: Host Language = Data Language

• Objects in the host language are mapped directly to database objects

• Some objects in the host program are persistent.

10

The Structure of an ODMG Application

11

Objects in SQL

• Object-relational extension of SQL-92

• Includes the legacy relational model

• SQLdatabase = a finite set of relations

• relation = a set of tuples (extends legacy relations) OR

a set of objects (completely new)

• object = (oid, tuple-value)

• tuple = tuple-value

• tuple-value = [Attr1: v1, …, Attrn: vn]

• multiset-value = {v1, …, vn }

12

Path expressions

SELECT T.Student.Name, T.Grade

FROM TRANSCRIPT T

WHERE T.Student.Address.Street = ‘Main St.’

PostgreSQL vs. MySQL• PostgreSQL is a generation newer

• It has nice UDT capabilities• There are libraries of UDTs that can be imported and used

• Both PostgreSQL and MySQL• Full text search• XML data types• To some degree free

• MySQL• Never underestimate the value of a heavily understood piece of

software• Lots of stacks and development environments come configured to

work with it (but to a lesser extent, this is true of PostgreSQL, too).• It is a “core” SQL database, in that we can move pretty much to

any other server-based DBMS is we start with MySQL

Triggers in PostgreSQL

• Triggers automatically fire stored procedures when some event happens, like an insert or update. They allow the database to enforce some required behavior in response to changing data.

• PL/pgSQL – Procedural Language of PostgreSQL

Example

CREATE TABLE logs (event_id integer, old_title varchar(255),old_starts timestamp,old_ends timestamp,logged_at timestamp DEFAULT current_timestamp);

A logs table

Continued

CREATE OR REPLACE FUNCTION log_event() RETURNS trigger AS $$

DECLARE

BEGIN

INSERT INTO logs (event_id, old_title, old_starts, old_ends)

VALUES (OLD.event_id, OLD.title, OLD.starts, OLD.ends);

RAISE NOTICE 'Someone just changed event #%', OLD.event_id;

RETURN NEW;

END;

A function to insert old data in to the log

Continued… a trigger

CREATE TRIGGER log_eventsAFTER UPDATE ON eventsFOR EACH ROW EXECUTE PROCEDURE log_event();

Logs changes after any row is updated

Rules

• A RULE is a description of how to alter the parsed query tree.

• Every time Postgres runs an SQL statement, it parses the statement into a query tree (generally called an abstract syntax tree).

Back to PostgreSQL

Page 32 of Seven Databases, onward

Fuzzy searches, full text searching

Postgres and spatial data

• For manipulating 2D/3D spatial data• Points, lines, and polygons formed from points and lines• Can perform union, intersection, operations• Can project shapes into 2D areas• Has a 3D geometry type (relatively new)• Can calculate accurate distances in meters• Works with an open source server that allows folks to share

geospatial data

• Command line interface• Also supports some forms of raster data• Provides spatial indices• Has a notion of a geometric column

Queries

SELECT superhero.name FROM city, superhero WHERE ST_Contains(city.geom, superhero.geom) and city.name = 'Gotham';

SELECT AsBinary(the_geom) as wkb_geometry FROM river AS r, state AS s WHERE intersects(r.the_geom, s.the_geom)

Mapnik

• Used for OSM (open street map) data and uses postgis• Mapnik is an open source system for rendering

maps• Used to design maps• Written in C++• It renders maps from postgis databases

Next: full text and approximate text search

• But first, not to be confused with the Like operator• Used % as the wild card

• Or with regular expressions for character string comparison

Full text search

• First, you index the words in a document and create an array of lexemes

• Second, specify a boolean phrase using and, or, not, and parens

• We typically don’t index “stop” words like and, or, the, etc.

• Dictionaries are used to find roots of related words, like dead and dying

• Thesauruses dictionaries are used to for recognition of domain-specific and similar words

Documents

• A document is a text attribute in a row of a table

• Often we use part of a document or concatenate various parts of documents

Details: dictionaries

• Define stop words that should not be indexed

• Map synonyms to a single word.

• Map phrases to a single word using a thesaurus.

• Map different variations of a word to a canonical form

Searching

• Uses a match operator - @@

• Basic search consists of asking about the relationship to a vector of words to a given document, which is also a vector• The vector can have and, or, etc. in it• tsvector – document – normalized lexemes• tsquery – query

Examples

SELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;

select the ten most recent documents that contain create and table in the title or body

Results can be ranked

Recent addition: fuzziness

• soundex(text) returns text • Converts a string to its Soundex code• Based on pronunciation

• difference(text, text) returns int• converts two strings to their Soundex codes and then

reports the number of matching code positions• 0 is a no match• 4 is a full match

• Def: A phonetic coding system intended to suppress spelling variation and determining the relationship between two (similar) words

Levenshtein

• Levenshtein distance is a metric for evaluating the difference between two sequences, in particular, words

• E.g.: test=# SELECT levenshtein('GUMBO', 'GAMBOL');

• E.g.: SELECT * FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10

• Used in particular, to detect nicknames

Metaphone

• E.g., metaphone(text source, int max_output_length) returns text

• Similar to soundex

• Used to classify words according to their english pronunciation

• Apparently better for non-english languages, compared to soundex

• E.g.: SELECT * FROM users WHERE METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2) should detect similarity to word William

Class project, beginning

• Build an application on top of • PostgreSQL or MySQL• And one of the other NoSQL databases in the 3 books• But pick from only key/value, key/document, or column-

based databases

• The application is written in a language of your choice

• Each of the databases must be used to manage the kind of data it is intended for• Traditional relational table data• And nontraditional data

Final grades

• Choice 1: each of exam 1, exam 2, project are 1/3 of your final grade

• Choice 2: if you build a web app for your project, I will use the best two of three grades

NoSQL DBs

• Why?

Relational DBs• SQL - Fixed schema, row oriented & optimized

• SQL - Rigid 2Phase transactions, with locking bottleneck

• SQL - Set theoretic

• SQL - Centralized distribution

• SQL - Computational, not navigational/inter-connected & set-oriented

• Sql - Poor support for heterogeneity & compression

No SQL - no or not only

• Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions)

• Key/value pairs - Google Dynamo

• Graph like - Neo4J

• Document based - MongoDB (cluster based for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)

But remember -

• Categories not distinct - take each one for what it is

• Heterogeneous structure & polyglot language environment is common

• NoSQL DBs tend to be unsupported with funky GUIs - but there are very active volunteer user bases maintaining and evolving them

• NoSQL DBs also tend to use programming languages for queries

When do you want non-2P transactions and no SQL?

• Interactive, zillion user apps where user fixes errors via some form of compensation

• Minimal interconnectedness

• Individual data values are not mission-critical

• Read-heavy environments

• Cloud -based environments

• Queries are not set-oriented & are computational and imperative, and perhaps long

• Real time apps

SQL is here to stay...

• Formal & unambiguous semantics

• Declarative language with clean separation of application and queries

• Consistent

• Flexible

• Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc.

• Great GUIs that are very rich and debugged

And importantly...

• Lots of apps need clean, well understood stacks, not speed or the cloud

• In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates

• Relational DBs scale reasonably well, too, at least in non-cloud environments

Again…

• The classification of the various nosql databases is imprecise, semi-controversial, and we have to be careful about reading too much into it.

• Rather than focusing on categorizing dbs, we should be concerned with what they do, how they relate to each other with respect to functionality, and how they compare to sql databases.

Key-value and key-document DBs

• Databases that access aggregate data• Key-value dbs know nothing about the structure of the

aggregate• Key-document databases do know, but the interpretation of

these aggregates happens outside the db• Keep in mind that these two categories of databases overlap in

practice

• Importantly, both of these two database systems categories focus on storing and retrieving individual aggregates, and not on interrelating (horizontally) multiple aggregates

• There is something similar to this in SQL DBs – and that is highly un-normalized tables

Important notions…

• It can be a difficult problem to represent some domains as key-value or key-document databases, as the boundaries of aggregates might not be easy to determine.

• This basic data modeling issue has a lot of influence on the sort of database you should use.

• Relational databases don’t manipulate aggregates, but they are aggregate neutral for the most part, leaving the construction of aggregates to run time … but we might have hidden, un-normalized tables that make some commonly used aggregates much faster to materialize

Key-value vs. key-document

• In key-value databases, we can only retrieve data via a key

• In key-document databases, we may be able to ask questions about the content of documents – but again, we are not cross-associating them

• Mongo is perhaps the most talked about key-document system, and so we will start there

Installing Mongo

• Mongo • http://docs.mongodb.org/manual/installation

• A GUI• http://www.mongodb.org/display/DOCS/

Admin+UIs

GUIs for Mongo

• There are a few GUIs that seem pretty good• Mongo-vision: http://code.google.com/p/mongo-vision/

(web page)• Needs Prudence as a web server

• MongoVue: http://mongovue.com, but Windows only• RockMongo (web based): http://rockmongo.com/ (web

page)• Needs an apache web server

• Very easy to install, just download • http://docs.mongodb.org/manual/installation

http://code.google.com/p/mongo-vision/

http://code.google.com/p/mongo-vision/

http://mongovue.com/

http://mongovue.com/

http://rockmongo.com/

http://rockmongo.com/

Getting an Apache web server

• XAMPP for windows (mac version is way out of date)

• MAMP for Macs (on the app store)

• WAMP for windows (bitnami.org)

• All of these give you PHP and MySQL as well. If we have time, we will look at MySQL full text search.

• You might want to install PostgreSQL, too. There is a bitnami stack. If there is time, we will look at PostgreSQL UDTs and full text search.

Mongo overview

• Document based

• Focuses on clusters for extremely large scaling

• Supports nested documents

• Uses JavaScript for queries

• No schema

Terminology

• A database consists of collections

• Collections are made up of documents

• A document is made up of fields

• There are also indices

• There are also cursors

When to use Mongo

• Medical records and other large document systems

• Read heavy environments like analytics and mining

• Partnered with relational databases• Relational for live data• Mongo for huge largely read only archives

• Online applications

• Massively wide e-commerce

Mongo documents and queries

• Documents• Self-defining, with hierarchical structure• like XML• Or JSON, which uses javascript to define docs in a human-

readable form

• Documents can vary in structure, even in the same collection

• You can add attributes to new documents in a collection without having the change the existing ones in the collection

• Queries: db.order.find({“customerId”:”99”})

Consistency and transactions

• There is a tailor-able consistency command that can be used the level you want for updating replicas of documents

• No multi-document atomic transactions are supported

• CAP theorem, which basically says there is a tradeoff between availability and consistency• Consistent, available, partition tolerant• The C and A are the big ones • Eventual enforcement of consistency is the key

• You can embed references to other documents in a document, but this tends to create a “join effect”• DBRef is the command

Selectors

• Used for finding, counting, updating, and removing docs from collections

• {} is the null search and matches all documents

• We could run: {gender:’f’}

• {field1: value1, field2: value2} creates an ‘and’ operation

• Also, less than, greater than, etc. (e.g., $gt)

• $exists, $or

Another document DB: CouchDB

• Major focus: surviving network problems

• Engineered for web use

• No ad hoc querying, searching is via map reduce-based indices

• We will get back to CouchDB

Map Reduce

• Focus is on performing data operations on parallel hardware

• This is a paradigm, not a specific programmatic technique

• Each map reduce process has two phases• Convert a list into a desired sort of list with the map operator• Convert the new list into a small number of atomic values via

a reduce operator

• This allows us to spread an process across a wide array of servers, with each server performing an independent map reduce process

56

Map-Reduceexamples

57

So, what is it?

• A two phase process geared toward optimizing broad, widely distributed parallel computing platforms

• Apache Hadoop is a MapReduce file system.

• MapReduce is Googles version (and it is proprietary).

• Phases• 1. Take a series of keys and transform them into a

different series of values, generally, ones that have some semantic context

• 2. Perform a second pass where the new series of values are compressed into far fewer values

58

In its strictest sense…

• Map-reduce is a two phase operation• First, convert a list of data into a list of a

different kind of data• Second, turn the second list into a single or a

list of scalar values, often the cardinality of the items created in the first step

59

Relevant computing/data application types

• For aggregate database processing, and not so much for set-oriented, and certainly not for object-based querying

• Fits well with cluster-based environments, where there are lots of opportunities for parallel processing

• Fits query patterns that calculate the cardinality of sets and the removal of duplicates

60

Strategy for M-R

• We try to do the computing on the machines where the data sits

• So we try to engineer the storage of data so that it accommodates the chaining of M-R operations

61

The key bottom line concept

• In a relational database, we try to minimize the I/O costs of moving large volumes of data from the server to the client, so that it can then be scanned and aggregated

• In a database that supports MP, we trying to screen (and sometimes aggregate) data on the server where it sits

• We also use parallel processing within cluster servers to minimize the cost of doing that aggregation if it cannot all be done on a single server housing the original data.

62

Another way of looking at this…

• We have seen the tradeoff between moving data and moving processing logic in the context of distributed, homogenous distributed data

• Often, in distributed databases, it is far cheaper to ship processing logic instead of data, even if it causes extra processing to have to happen

• This is another context in which we often choose to send processing code to a server in order to minimize the movement of large volumes of data

63

Example

• 1. We start with a set of person keys and map each of these to the names of the people.• Key 1 -> Harry• Key 2 -> Harry• Key 3 -> Tommy

• 2. We aggregate the list of people names by counting how many unique names are in the list.• Harry, Harry, Tommy -> 2

64

What actually happens?

• Informally: • Each key leads to a name field.• Then, the names are isolated.• Then, each is passed to a “mapper”, which

returns the name, along with a 1. • Then, a “reducer” takes each name and makes

a list of 1’s. The reducer adds up the 1’s for each name and returns a list of (name, count) pairs.

65

From NoSQL Distilled:1. Creating a list with a map

66

2. Aggregating with a reduce

67

3. Partitioning the output of mappers: parallelism & adding a phase that merges the results of the reducers

68

4. Introducing a combiner operation to minimize the movement of redundant data – the output format must be the same as the

input format

69

5. A combiner that removes duplicate product-customer

pairs

70

6. Concatenating a combining and a reduce (counting) operation

71

7. Maintaining the 1’s counts in the mapping phase

72

8. Adding temporal information to the map/reduce process

73

9. Using a reduce operator to create product per month totals

74

10. A second mapper that creates base year by year comparisons

75

11. A reduce operation combines records for a given year

76

Complaints

• M-R is low level.

• It is rigid.

• It exists to optimize the distributed cluster model – only.

• It demands that an application fit perfectly into the paradigm.

• It takes careful planning and knowledge of exactly how the data will be used to structure the database to optimally serve a series of map/reduce operations

• It thus does not accommodate on-the-fly browsing

nosql and nosql beginning with o-r databases. first, crud… global notions of managing persistent...

Documents

set of objects

data language objects

objects sql

similar objects

different objects

sets of objects

set of tuples objects

object query language