nosql and nosql beginning with o-r databases. first, crud… global notions of managing persistent...

76
NoSQL and NOSQL Beginning with o-r databases

Upload: walter-flynn

Post on 27-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

NoSQL and NOSQLBeginning with o-r databases

Page 2: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

First, CRUD…

• Global notions of managing persistent data, regardless of the model or system

• Create, Read, Update, Delete

• But there is also DDL

• And there are implementation issues, like sorts and indices

Page 3: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

3

Why not standard tables?

• Extreme data structuring conflict between host language and database language: • Impedance mismatch• Atomic values are the only common data type

• To retrieve all of an object requires lots of joins

• Difficult to look for objects that are similar but have some different attributes

• Difficult to retrieve an attribute that is a collection

• You have to program with two programming languages

• We have value-based semantics – difficult to know if two people have the same mother, or just a mother with the same name/ID and this causes us to make inference

Page 4: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

What is o-o?

• Relation is a set of tuples• Objects are arranged in sets of objects

• In a relation, a tuple’s components are primitive (int, string)• The components of an object can be complex

types (sets, tuples, other objects)

• SQL: programs are global• Object: programs are local

Page 5: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

5

Key concept: Object Id’s

• Every object has a unique Id: different objects have different Ids

• Immutable: does not change as the object changes

• Different from primary key!• Like a key, identifies an object uniquely• But key values can change – oids cannot• And there are inferences based on values

Page 6: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

6

Objects and Values

• An object is a pair: (oid, value)

• Example: A Joe Public’s object(#32, [ SSN: 111-22-3333,

Name: “Joe Public”,

PhoneN: {“516-123-4567”, “516-345-6789”},

Child: {#445, #73} ] )

Page 7: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

7

Classes

• Class: set of semantically similar objects (eg, people, students, cars, motorcycles)

• A class has:• Type: describes common structure of all objects in

the class (semantically similar objects are also structurally similar)

• Method signatures: declarations of the operations that can be applied to all objects in the class.

• Extent: the set of all objects in the class

• Classes are organized in a class hierarchy• The extent of a class contains the extent of any of its

subclasses

Page 8: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

8

The ODMG Standard

• ODMG 3.0 was released in 2000

• Includes the data model (more or less)

• ODL: The object definition language

• OQL: The object query language

• A transaction specification mechanism

• Language bindings: How to access an ODMG database from C++, Smalltalk, and Java (expect C# to be added to the mix)

Page 9: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

9

Main Idea: Host Language = Data Language

• Objects in the host language are mapped directly to database objects

• Some objects in the host program are persistent.

Page 10: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

10

The Structure of an ODMG Application

Page 11: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

11

Objects in SQL

• Object-relational extension of SQL-92

• Includes the legacy relational model

• SQLdatabase = a finite set of relations

• relation = a set of tuples (extends legacy relations) OR

a set of objects (completely new)

• object = (oid, tuple-value)

• tuple = tuple-value

• tuple-value = [Attr1: v1, …, Attrn: vn]

• multiset-value = {v1, …, vn }

Page 12: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

12

Path expressions

SELECT T.Student.Name, T.Grade

FROM TRANSCRIPT T

WHERE T.Student.Address.Street = ‘Main St.’

Page 13: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

PostgreSQL vs. MySQL• PostgreSQL is a generation newer

• It has nice UDT capabilities• There are libraries of UDTs that can be imported and used

• Both PostgreSQL and MySQL• Full text search• XML data types• To some degree free

• MySQL• Never underestimate the value of a heavily understood piece of

software• Lots of stacks and development environments come configured to

work with it (but to a lesser extent, this is true of PostgreSQL, too).• It is a “core” SQL database, in that we can move pretty much to

any other server-based DBMS is we start with MySQL

Page 14: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Triggers in PostgreSQL

• Triggers automatically fire stored procedures when some event happens, like an insert or update. They allow the database to enforce some required behavior in response to changing data.

• PL/pgSQL – Procedural Language of PostgreSQL

Page 15: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Example

CREATE TABLE logs (event_id integer, old_title varchar(255),old_starts timestamp,old_ends timestamp,logged_at timestamp DEFAULT current_timestamp);

A logs table

Page 16: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Continued

CREATE OR REPLACE FUNCTION log_event() RETURNS trigger AS $$

DECLARE

BEGIN

INSERT INTO logs (event_id, old_title, old_starts, old_ends)

VALUES (OLD.event_id, OLD.title, OLD.starts, OLD.ends);

RAISE NOTICE 'Someone just changed event #%', OLD.event_id;

RETURN NEW;

END;

A function to insert old data in to the log

Page 17: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Continued… a trigger

CREATE TRIGGER log_eventsAFTER UPDATE ON eventsFOR EACH ROW EXECUTE PROCEDURE log_event();

Logs changes after any row is updated

Page 18: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Rules

• A RULE is a description of how to alter the parsed query tree.

• Every time Postgres runs an SQL statement, it parses the statement into a query tree (generally called an abstract syntax tree).

Page 19: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Back to PostgreSQL

Page 32 of Seven Databases, onward

Fuzzy searches, full text searching

Page 20: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Postgres and spatial data

• For manipulating 2D/3D spatial data• Points, lines, and polygons formed from points and lines• Can perform union, intersection, operations• Can project shapes into 2D areas• Has a 3D geometry type (relatively new)• Can calculate accurate distances in meters• Works with an open source server that allows folks to share

geospatial data

• Command line interface• Also supports some forms of raster data• Provides spatial indices• Has a notion of a geometric column

Page 21: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Queries

SELECT superhero.name FROM city, superhero WHERE ST_Contains(city.geom, superhero.geom) and city.name = 'Gotham';

SELECT AsBinary(the_geom) as wkb_geometry FROM river AS r, state AS s WHERE intersects(r.the_geom, s.the_geom)

Page 22: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Mapnik

• Used for OSM (open street map) data and uses postgis• Mapnik is an open source system for rendering

maps• Used to design maps• Written in C++• It renders maps from postgis databases

Page 23: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Next: full text and approximate text search

• But first, not to be confused with the Like operator• Used % as the wild card

• Or with regular expressions for character string comparison

Page 24: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Full text search

• First, you index the words in a document and create an array of lexemes

• Second, specify a boolean phrase using and, or, not, and parens

• We typically don’t index “stop” words like and, or, the, etc.

• Dictionaries are used to find roots of related words, like dead and dying

• Thesauruses dictionaries are used to for recognition of domain-specific and similar words

Page 25: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Documents

• A document is a text attribute in a row of a table

• Often we use part of a document or concatenate various parts of documents

Page 26: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Details: dictionaries

• Define stop words that should not be indexed

• Map synonyms to a single word.

• Map phrases to a single word using a thesaurus.

• Map different variations of a word to a canonical form

Page 27: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Searching

• Uses a match operator - @@

• Basic search consists of asking about the relationship to a vector of words to a given document, which is also a vector• The vector can have and, or, etc. in it• tsvector – document – normalized lexemes• tsquery – query

Page 28: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Examples

SELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;

select the ten most recent documents that contain create and table in the title or body

Results can be ranked

Page 29: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Recent addition: fuzziness

• soundex(text) returns text • Converts a string to its Soundex code• Based on pronunciation

• difference(text, text) returns int• converts two strings to their Soundex codes and then

reports the number of matching code positions• 0 is a no match• 4 is a full match

• Def: A phonetic coding system intended to suppress spelling variation and determining the relationship between two (similar) words

Page 30: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Levenshtein

• Levenshtein distance is a metric for evaluating the difference between two sequences, in particular, words

• E.g.: test=# SELECT levenshtein('GUMBO', 'GAMBOL');

• E.g.: SELECT * FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10

• Used in particular, to detect nicknames

Page 31: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Metaphone

• E.g., metaphone(text source, int max_output_length) returns text

• Similar to soundex

• Used to classify words according to their english pronunciation

• Apparently better for non-english languages, compared to soundex

• E.g.: SELECT * FROM users WHERE METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2) should detect similarity to word William

Page 32: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Class project, beginning

• Build an application on top of • PostgreSQL or MySQL• And one of the other NoSQL databases in the 3 books• But pick from only key/value, key/document, or column-

based databases

• The application is written in a language of your choice

• Each of the databases must be used to manage the kind of data it is intended for• Traditional relational table data• And nontraditional data

Page 33: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Final grades

• Choice 1: each of exam 1, exam 2, project are 1/3 of your final grade

• Choice 2: if you build a web app for your project, I will use the best two of three grades

Page 34: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

NoSQL DBs

• Why?

Page 35: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Relational DBs• SQL - Fixed schema, row oriented & optimized

• SQL - Rigid 2Phase transactions, with locking bottleneck

• SQL - Set theoretic

• SQL - Centralized distribution

• SQL - Computational, not navigational/inter-connected & set-oriented

• Sql - Poor support for heterogeneity & compression

Page 36: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

No SQL - no or not only

• Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions)

• Key/value pairs - Google Dynamo

• Graph like - Neo4J

• Document based - MongoDB (cluster based for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)

Page 37: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

But remember -

• Categories not distinct - take each one for what it is

• Heterogeneous structure & polyglot language environment is common

• NoSQL DBs tend to be unsupported with funky GUIs - but there are very active volunteer user bases maintaining and evolving them

• NoSQL DBs also tend to use programming languages for queries

Page 38: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

When do you want non-2P transactions and no SQL?

• Interactive, zillion user apps where user fixes errors via some form of compensation

• Minimal interconnectedness

• Individual data values are not mission-critical

• Read-heavy environments

• Cloud -based environments

• Queries are not set-oriented & are computational and imperative, and perhaps long

• Real time apps

Page 39: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

SQL is here to stay...

• Formal & unambiguous semantics

• Declarative language with clean separation of application and queries

• Consistent

• Flexible

• Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc.

• Great GUIs that are very rich and debugged

Page 40: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

And importantly...

• Lots of apps need clean, well understood stacks, not speed or the cloud

• In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates

• Relational DBs scale reasonably well, too, at least in non-cloud environments

Page 41: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Again…

• The classification of the various nosql databases is imprecise, semi-controversial, and we have to be careful about reading too much into it.

• Rather than focusing on categorizing dbs, we should be concerned with what they do, how they relate to each other with respect to functionality, and how they compare to sql databases.

Page 42: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Key-value and key-document DBs

• Databases that access aggregate data• Key-value dbs know nothing about the structure of the

aggregate• Key-document databases do know, but the interpretation of

these aggregates happens outside the db• Keep in mind that these two categories of databases overlap in

practice

• Importantly, both of these two database systems categories focus on storing and retrieving individual aggregates, and not on interrelating (horizontally) multiple aggregates

• There is something similar to this in SQL DBs – and that is highly un-normalized tables

Page 43: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Important notions…

• It can be a difficult problem to represent some domains as key-value or key-document databases, as the boundaries of aggregates might not be easy to determine.

• This basic data modeling issue has a lot of influence on the sort of database you should use.

• Relational databases don’t manipulate aggregates, but they are aggregate neutral for the most part, leaving the construction of aggregates to run time … but we might have hidden, un-normalized tables that make some commonly used aggregates much faster to materialize

Page 44: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Key-value vs. key-document

• In key-value databases, we can only retrieve data via a key

• In key-document databases, we may be able to ask questions about the content of documents – but again, we are not cross-associating them

• Mongo is perhaps the most talked about key-document system, and so we will start there

Page 45: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Installing Mongo

• Mongo • http://docs.mongodb.org/manual/installation

• A GUI• http://www.mongodb.org/display/DOCS/

Admin+UIs

Page 46: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

GUIs for Mongo

• There are a few GUIs that seem pretty good• Mongo-vision: http://code.google.com/p/mongo-vision/

(web page)• Needs Prudence as a web server

• MongoVue: http://mongovue.com, but Windows only• RockMongo (web based): http://rockmongo.com/ (web

page)• Needs an apache web server

• Very easy to install, just download • http://docs.mongodb.org/manual/installation

Page 47: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Getting an Apache web server

• XAMPP for windows (mac version is way out of date)

• MAMP for Macs (on the app store)

• WAMP for windows (bitnami.org)

• All of these give you PHP and MySQL as well. If we have time, we will look at MySQL full text search.

• You might want to install PostgreSQL, too. There is a bitnami stack. If there is time, we will look at PostgreSQL UDTs and full text search.

Page 48: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Mongo overview

• Document based

• Focuses on clusters for extremely large scaling

• Supports nested documents

• Uses JavaScript for queries

• No schema

Page 49: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Terminology

• A database consists of collections

• Collections are made up of documents

• A document is made up of fields

• There are also indices

• There are also cursors

Page 50: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

When to use Mongo

• Medical records and other large document systems

• Read heavy environments like analytics and mining

• Partnered with relational databases• Relational for live data• Mongo for huge largely read only archives

• Online applications

• Massively wide e-commerce

Page 51: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Mongo documents and queries

• Documents• Self-defining, with hierarchical structure• like XML• Or JSON, which uses javascript to define docs in a human-

readable form

• Documents can vary in structure, even in the same collection

• You can add attributes to new documents in a collection without having the change the existing ones in the collection

• Queries: db.order.find({“customerId”:”99”})

Page 52: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Consistency and transactions

• There is a tailor-able consistency command that can be used the level you want for updating replicas of documents

• No multi-document atomic transactions are supported

• CAP theorem, which basically says there is a tradeoff between availability and consistency• Consistent, available, partition tolerant• The C and A are the big ones • Eventual enforcement of consistency is the key

• You can embed references to other documents in a document, but this tends to create a “join effect”• DBRef is the command

Page 53: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Selectors

• Used for finding, counting, updating, and removing docs from collections

• {} is the null search and matches all documents

• We could run: {gender:’f’}

• {field1: value1, field2: value2} creates an ‘and’ operation

• Also, less than, greater than, etc. (e.g., $gt)

• $exists, $or

Page 54: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Another document DB: CouchDB

• Major focus: surviving network problems

• Engineered for web use

• No ad hoc querying, searching is via map reduce-based indices

• We will get back to CouchDB

Page 55: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

Map Reduce

• Focus is on performing data operations on parallel hardware

• This is a paradigm, not a specific programmatic technique

• Each map reduce process has two phases• Convert a list into a desired sort of list with the map operator• Convert the new list into a small number of atomic values via

a reduce operator

• This allows us to spread an process across a wide array of servers, with each server performing an independent map reduce process

Page 56: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

56

Map-Reduceexamples

Page 57: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

57

So, what is it?

• A two phase process geared toward optimizing broad, widely distributed parallel computing platforms

• Apache Hadoop is a MapReduce file system.

• MapReduce is Googles version (and it is proprietary).

• Phases• 1. Take a series of keys and transform them into a

different series of values, generally, ones that have some semantic context

• 2. Perform a second pass where the new series of values are compressed into far fewer values

Page 58: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

58

In its strictest sense…

• Map-reduce is a two phase operation• First, convert a list of data into a list of a

different kind of data• Second, turn the second list into a single or a

list of scalar values, often the cardinality of the items created in the first step

Page 59: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

59

Relevant computing/data application types

• For aggregate database processing, and not so much for set-oriented, and certainly not for object-based querying

• Fits well with cluster-based environments, where there are lots of opportunities for parallel processing

• Fits query patterns that calculate the cardinality of sets and the removal of duplicates

Page 60: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

60

Strategy for M-R

• We try to do the computing on the machines where the data sits

• So we try to engineer the storage of data so that it accommodates the chaining of M-R operations

Page 61: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

61

The key bottom line concept

• In a relational database, we try to minimize the I/O costs of moving large volumes of data from the server to the client, so that it can then be scanned and aggregated

• In a database that supports MP, we trying to screen (and sometimes aggregate) data on the server where it sits

• We also use parallel processing within cluster servers to minimize the cost of doing that aggregation if it cannot all be done on a single server housing the original data.

Page 62: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

62

Another way of looking at this…

• We have seen the tradeoff between moving data and moving processing logic in the context of distributed, homogenous distributed data

• Often, in distributed databases, it is far cheaper to ship processing logic instead of data, even if it causes extra processing to have to happen

• This is another context in which we often choose to send processing code to a server in order to minimize the movement of large volumes of data

Page 63: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

63

Example

• 1. We start with a set of person keys and map each of these to the names of the people.• Key 1 -> Harry• Key 2 -> Harry• Key 3 -> Tommy

• 2. We aggregate the list of people names by counting how many unique names are in the list.• Harry, Harry, Tommy -> 2

Page 64: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

64

What actually happens?

• Informally: • Each key leads to a name field.• Then, the names are isolated.• Then, each is passed to a “mapper”, which

returns the name, along with a 1. • Then, a “reducer” takes each name and makes

a list of 1’s. The reducer adds up the 1’s for each name and returns a list of (name, count) pairs.

Page 65: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

65

From NoSQL Distilled:1. Creating a list with a map

Page 66: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

66

2. Aggregating with a reduce

Page 67: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

67

3. Partitioning the output of mappers: parallelism & adding a phase that merges the results of the reducers

Page 68: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

68

4. Introducing a combiner operation to minimize the movement of redundant data – the output format must be the same as the

input format

Page 69: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

69

5. A combiner that removes duplicate product-customer

pairs

Page 70: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

70

6. Concatenating a combining and a reduce (counting) operation

Page 71: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

71

7. Maintaining the 1’s counts in the mapping phase

Page 72: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

72

8. Adding temporal information to the map/reduce process

Page 73: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

73

9. Using a reduce operator to create product per month totals

Page 74: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

74

10. A second mapper that creates base year by year comparisons

Page 75: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

75

11. A reduce operation combines records for a given year

Page 76: NoSQL and NOSQL Beginning with o-r databases. First, CRUD… Global notions of managing persistent data, regardless of the model or system Create, Read,

76

Complaints

• M-R is low level.

• It is rigid.

• It exists to optimize the distributed cluster model – only.

• It demands that an application fit perfectly into the paradigm.

• It takes careful planning and knowledge of exactly how the data will be used to structure the database to optimally serve a series of map/reduce operations

• It thus does not accommodate on-the-fly browsing