nosql and nosql beginning with o-r databases. first, crud… global notions of managing persistent...
TRANSCRIPT
NoSQL and NOSQLBeginning with o-r databases
First, CRUD…
• Global notions of managing persistent data, regardless of the model or system
• Create, Read, Update, Delete
• But there is also DDL
• And there are implementation issues, like sorts and indices
3
Why not standard tables?
• Extreme data structuring conflict between host language and database language: • Impedance mismatch• Atomic values are the only common data type
• To retrieve all of an object requires lots of joins
• Difficult to look for objects that are similar but have some different attributes
• Difficult to retrieve an attribute that is a collection
• You have to program with two programming languages
• We have value-based semantics – difficult to know if two people have the same mother, or just a mother with the same name/ID and this causes us to make inference
What is o-o?
• Relation is a set of tuples• Objects are arranged in sets of objects
• In a relation, a tuple’s components are primitive (int, string)• The components of an object can be complex
types (sets, tuples, other objects)
• SQL: programs are global• Object: programs are local
5
Key concept: Object Id’s
• Every object has a unique Id: different objects have different Ids
• Immutable: does not change as the object changes
• Different from primary key!• Like a key, identifies an object uniquely• But key values can change – oids cannot• And there are inferences based on values
6
Objects and Values
• An object is a pair: (oid, value)
• Example: A Joe Public’s object(#32, [ SSN: 111-22-3333,
Name: “Joe Public”,
PhoneN: {“516-123-4567”, “516-345-6789”},
Child: {#445, #73} ] )
7
Classes
• Class: set of semantically similar objects (eg, people, students, cars, motorcycles)
• A class has:• Type: describes common structure of all objects in
the class (semantically similar objects are also structurally similar)
• Method signatures: declarations of the operations that can be applied to all objects in the class.
• Extent: the set of all objects in the class
• Classes are organized in a class hierarchy• The extent of a class contains the extent of any of its
subclasses
8
The ODMG Standard
• ODMG 3.0 was released in 2000
• Includes the data model (more or less)
• ODL: The object definition language
• OQL: The object query language
• A transaction specification mechanism
• Language bindings: How to access an ODMG database from C++, Smalltalk, and Java (expect C# to be added to the mix)
9
Main Idea: Host Language = Data Language
• Objects in the host language are mapped directly to database objects
• Some objects in the host program are persistent.
10
The Structure of an ODMG Application
11
Objects in SQL
• Object-relational extension of SQL-92
• Includes the legacy relational model
• SQLdatabase = a finite set of relations
• relation = a set of tuples (extends legacy relations) OR
a set of objects (completely new)
• object = (oid, tuple-value)
• tuple = tuple-value
• tuple-value = [Attr1: v1, …, Attrn: vn]
• multiset-value = {v1, …, vn }
12
Path expressions
SELECT T.Student.Name, T.Grade
FROM TRANSCRIPT T
WHERE T.Student.Address.Street = ‘Main St.’
PostgreSQL vs. MySQL• PostgreSQL is a generation newer
• It has nice UDT capabilities• There are libraries of UDTs that can be imported and used
• Both PostgreSQL and MySQL• Full text search• XML data types• To some degree free
• MySQL• Never underestimate the value of a heavily understood piece of
software• Lots of stacks and development environments come configured to
work with it (but to a lesser extent, this is true of PostgreSQL, too).• It is a “core” SQL database, in that we can move pretty much to
any other server-based DBMS is we start with MySQL
Triggers in PostgreSQL
• Triggers automatically fire stored procedures when some event happens, like an insert or update. They allow the database to enforce some required behavior in response to changing data.
• PL/pgSQL – Procedural Language of PostgreSQL
Example
CREATE TABLE logs (event_id integer, old_title varchar(255),old_starts timestamp,old_ends timestamp,logged_at timestamp DEFAULT current_timestamp);
A logs table
Continued
CREATE OR REPLACE FUNCTION log_event() RETURNS trigger AS $$
DECLARE
BEGIN
INSERT INTO logs (event_id, old_title, old_starts, old_ends)
VALUES (OLD.event_id, OLD.title, OLD.starts, OLD.ends);
RAISE NOTICE 'Someone just changed event #%', OLD.event_id;
RETURN NEW;
END;
A function to insert old data in to the log
Continued… a trigger
CREATE TRIGGER log_eventsAFTER UPDATE ON eventsFOR EACH ROW EXECUTE PROCEDURE log_event();
Logs changes after any row is updated
Rules
• A RULE is a description of how to alter the parsed query tree.
• Every time Postgres runs an SQL statement, it parses the statement into a query tree (generally called an abstract syntax tree).
Back to PostgreSQL
Page 32 of Seven Databases, onward
Fuzzy searches, full text searching
Postgres and spatial data
• For manipulating 2D/3D spatial data• Points, lines, and polygons formed from points and lines• Can perform union, intersection, operations• Can project shapes into 2D areas• Has a 3D geometry type (relatively new)• Can calculate accurate distances in meters• Works with an open source server that allows folks to share
geospatial data
• Command line interface• Also supports some forms of raster data• Provides spatial indices• Has a notion of a geometric column
Queries
SELECT superhero.name FROM city, superhero WHERE ST_Contains(city.geom, superhero.geom) and city.name = 'Gotham';
SELECT AsBinary(the_geom) as wkb_geometry FROM river AS r, state AS s WHERE intersects(r.the_geom, s.the_geom)
Mapnik
• Used for OSM (open street map) data and uses postgis• Mapnik is an open source system for rendering
maps• Used to design maps• Written in C++• It renders maps from postgis databases
Next: full text and approximate text search
• But first, not to be confused with the Like operator• Used % as the wild card
• Or with regular expressions for character string comparison
Full text search
• First, you index the words in a document and create an array of lexemes
• Second, specify a boolean phrase using and, or, not, and parens
• We typically don’t index “stop” words like and, or, the, etc.
• Dictionaries are used to find roots of related words, like dead and dying
• Thesauruses dictionaries are used to for recognition of domain-specific and similar words
Documents
• A document is a text attribute in a row of a table
• Often we use part of a document or concatenate various parts of documents
Details: dictionaries
• Define stop words that should not be indexed
• Map synonyms to a single word.
• Map phrases to a single word using a thesaurus.
• Map different variations of a word to a canonical form
Searching
• Uses a match operator - @@
• Basic search consists of asking about the relationship to a vector of words to a given document, which is also a vector• The vector can have and, or, etc. in it• tsvector – document – normalized lexemes• tsquery – query
Examples
SELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;
select the ten most recent documents that contain create and table in the title or body
Results can be ranked
Recent addition: fuzziness
• soundex(text) returns text • Converts a string to its Soundex code• Based on pronunciation
• difference(text, text) returns int• converts two strings to their Soundex codes and then
reports the number of matching code positions• 0 is a no match• 4 is a full match
• Def: A phonetic coding system intended to suppress spelling variation and determining the relationship between two (similar) words
Levenshtein
• Levenshtein distance is a metric for evaluating the difference between two sequences, in particular, words
• E.g.: test=# SELECT levenshtein('GUMBO', 'GAMBOL');
• E.g.: SELECT * FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10
• Used in particular, to detect nicknames
Metaphone
• E.g., metaphone(text source, int max_output_length) returns text
• Similar to soundex
• Used to classify words according to their english pronunciation
• Apparently better for non-english languages, compared to soundex
• E.g.: SELECT * FROM users WHERE METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2) should detect similarity to word William
Class project, beginning
• Build an application on top of • PostgreSQL or MySQL• And one of the other NoSQL databases in the 3 books• But pick from only key/value, key/document, or column-
based databases
• The application is written in a language of your choice
• Each of the databases must be used to manage the kind of data it is intended for• Traditional relational table data• And nontraditional data
Final grades
• Choice 1: each of exam 1, exam 2, project are 1/3 of your final grade
• Choice 2: if you build a web app for your project, I will use the best two of three grades
NoSQL DBs
• Why?
Relational DBs• SQL - Fixed schema, row oriented & optimized
• SQL - Rigid 2Phase transactions, with locking bottleneck
• SQL - Set theoretic
• SQL - Centralized distribution
• SQL - Computational, not navigational/inter-connected & set-oriented
• Sql - Poor support for heterogeneity & compression
No SQL - no or not only
• Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions)
• Key/value pairs - Google Dynamo
• Graph like - Neo4J
• Document based - MongoDB (cluster based for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)
But remember -
• Categories not distinct - take each one for what it is
• Heterogeneous structure & polyglot language environment is common
• NoSQL DBs tend to be unsupported with funky GUIs - but there are very active volunteer user bases maintaining and evolving them
• NoSQL DBs also tend to use programming languages for queries
When do you want non-2P transactions and no SQL?
• Interactive, zillion user apps where user fixes errors via some form of compensation
• Minimal interconnectedness
• Individual data values are not mission-critical
• Read-heavy environments
• Cloud -based environments
• Queries are not set-oriented & are computational and imperative, and perhaps long
• Real time apps
SQL is here to stay...
• Formal & unambiguous semantics
• Declarative language with clean separation of application and queries
• Consistent
• Flexible
• Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc.
• Great GUIs that are very rich and debugged
And importantly...
• Lots of apps need clean, well understood stacks, not speed or the cloud
• In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates
• Relational DBs scale reasonably well, too, at least in non-cloud environments
Again…
• The classification of the various nosql databases is imprecise, semi-controversial, and we have to be careful about reading too much into it.
• Rather than focusing on categorizing dbs, we should be concerned with what they do, how they relate to each other with respect to functionality, and how they compare to sql databases.
Key-value and key-document DBs
• Databases that access aggregate data• Key-value dbs know nothing about the structure of the
aggregate• Key-document databases do know, but the interpretation of
these aggregates happens outside the db• Keep in mind that these two categories of databases overlap in
practice
• Importantly, both of these two database systems categories focus on storing and retrieving individual aggregates, and not on interrelating (horizontally) multiple aggregates
• There is something similar to this in SQL DBs – and that is highly un-normalized tables
Important notions…
• It can be a difficult problem to represent some domains as key-value or key-document databases, as the boundaries of aggregates might not be easy to determine.
• This basic data modeling issue has a lot of influence on the sort of database you should use.
• Relational databases don’t manipulate aggregates, but they are aggregate neutral for the most part, leaving the construction of aggregates to run time … but we might have hidden, un-normalized tables that make some commonly used aggregates much faster to materialize
Key-value vs. key-document
• In key-value databases, we can only retrieve data via a key
• In key-document databases, we may be able to ask questions about the content of documents – but again, we are not cross-associating them
• Mongo is perhaps the most talked about key-document system, and so we will start there
Installing Mongo
• Mongo • http://docs.mongodb.org/manual/installation
• A GUI• http://www.mongodb.org/display/DOCS/
Admin+UIs
GUIs for Mongo
• There are a few GUIs that seem pretty good• Mongo-vision: http://code.google.com/p/mongo-vision/
(web page)• Needs Prudence as a web server
• MongoVue: http://mongovue.com, but Windows only• RockMongo (web based): http://rockmongo.com/ (web
page)• Needs an apache web server
• Very easy to install, just download • http://docs.mongodb.org/manual/installation
Getting an Apache web server
• XAMPP for windows (mac version is way out of date)
• MAMP for Macs (on the app store)
• WAMP for windows (bitnami.org)
• All of these give you PHP and MySQL as well. If we have time, we will look at MySQL full text search.
• You might want to install PostgreSQL, too. There is a bitnami stack. If there is time, we will look at PostgreSQL UDTs and full text search.
Mongo overview
• Document based
• Focuses on clusters for extremely large scaling
• Supports nested documents
• Uses JavaScript for queries
• No schema
Terminology
• A database consists of collections
• Collections are made up of documents
• A document is made up of fields
• There are also indices
• There are also cursors
When to use Mongo
• Medical records and other large document systems
• Read heavy environments like analytics and mining
• Partnered with relational databases• Relational for live data• Mongo for huge largely read only archives
• Online applications
• Massively wide e-commerce
Mongo documents and queries
• Documents• Self-defining, with hierarchical structure• like XML• Or JSON, which uses javascript to define docs in a human-
readable form
• Documents can vary in structure, even in the same collection
• You can add attributes to new documents in a collection without having the change the existing ones in the collection
• Queries: db.order.find({“customerId”:”99”})
Consistency and transactions
• There is a tailor-able consistency command that can be used the level you want for updating replicas of documents
• No multi-document atomic transactions are supported
• CAP theorem, which basically says there is a tradeoff between availability and consistency• Consistent, available, partition tolerant• The C and A are the big ones • Eventual enforcement of consistency is the key
• You can embed references to other documents in a document, but this tends to create a “join effect”• DBRef is the command
Selectors
• Used for finding, counting, updating, and removing docs from collections
• {} is the null search and matches all documents
• We could run: {gender:’f’}
• {field1: value1, field2: value2} creates an ‘and’ operation
• Also, less than, greater than, etc. (e.g., $gt)
• $exists, $or
Another document DB: CouchDB
• Major focus: surviving network problems
• Engineered for web use
• No ad hoc querying, searching is via map reduce-based indices
• We will get back to CouchDB
Map Reduce
• Focus is on performing data operations on parallel hardware
• This is a paradigm, not a specific programmatic technique
• Each map reduce process has two phases• Convert a list into a desired sort of list with the map operator• Convert the new list into a small number of atomic values via
a reduce operator
• This allows us to spread an process across a wide array of servers, with each server performing an independent map reduce process
56
Map-Reduceexamples
57
So, what is it?
• A two phase process geared toward optimizing broad, widely distributed parallel computing platforms
• Apache Hadoop is a MapReduce file system.
• MapReduce is Googles version (and it is proprietary).
• Phases• 1. Take a series of keys and transform them into a
different series of values, generally, ones that have some semantic context
• 2. Perform a second pass where the new series of values are compressed into far fewer values
58
In its strictest sense…
• Map-reduce is a two phase operation• First, convert a list of data into a list of a
different kind of data• Second, turn the second list into a single or a
list of scalar values, often the cardinality of the items created in the first step
59
Relevant computing/data application types
• For aggregate database processing, and not so much for set-oriented, and certainly not for object-based querying
• Fits well with cluster-based environments, where there are lots of opportunities for parallel processing
• Fits query patterns that calculate the cardinality of sets and the removal of duplicates
60
Strategy for M-R
• We try to do the computing on the machines where the data sits
• So we try to engineer the storage of data so that it accommodates the chaining of M-R operations
61
The key bottom line concept
• In a relational database, we try to minimize the I/O costs of moving large volumes of data from the server to the client, so that it can then be scanned and aggregated
• In a database that supports MP, we trying to screen (and sometimes aggregate) data on the server where it sits
• We also use parallel processing within cluster servers to minimize the cost of doing that aggregation if it cannot all be done on a single server housing the original data.
62
Another way of looking at this…
• We have seen the tradeoff between moving data and moving processing logic in the context of distributed, homogenous distributed data
• Often, in distributed databases, it is far cheaper to ship processing logic instead of data, even if it causes extra processing to have to happen
• This is another context in which we often choose to send processing code to a server in order to minimize the movement of large volumes of data
63
Example
• 1. We start with a set of person keys and map each of these to the names of the people.• Key 1 -> Harry• Key 2 -> Harry• Key 3 -> Tommy
• 2. We aggregate the list of people names by counting how many unique names are in the list.• Harry, Harry, Tommy -> 2
64
What actually happens?
• Informally: • Each key leads to a name field.• Then, the names are isolated.• Then, each is passed to a “mapper”, which
returns the name, along with a 1. • Then, a “reducer” takes each name and makes
a list of 1’s. The reducer adds up the 1’s for each name and returns a list of (name, count) pairs.
65
From NoSQL Distilled:1. Creating a list with a map
66
2. Aggregating with a reduce
67
3. Partitioning the output of mappers: parallelism & adding a phase that merges the results of the reducers
68
4. Introducing a combiner operation to minimize the movement of redundant data – the output format must be the same as the
input format
69
5. A combiner that removes duplicate product-customer
pairs
70
6. Concatenating a combining and a reduce (counting) operation
71
7. Maintaining the 1’s counts in the mapping phase
72
8. Adding temporal information to the map/reduce process
73
9. Using a reduce operator to create product per month totals
74
10. A second mapper that creates base year by year comparisons
75
11. A reduce operation combines records for a given year
76
Complaints
• M-R is low level.
• It is rigid.
• It exists to optimize the distributed cluster model – only.
• It demands that an application fit perfectly into the paradigm.
• It takes careful planning and knowledge of exactly how the data will be used to structure the database to optimally serve a series of map/reduce operations
• It thus does not accommodate on-the-fly browsing