big data management and nosql databases data...what is big data? ibm: depending on the industry and...

67
Big Data Management and NoSQL Databases Lecture 12 PD Dr. Andreas Behrend [email protected] Acknowledgements I am indebted to Prof. Dr.-Ing. Sebastian Michel, Prof. Johan Gamper, and Dr. Holubova for providing me slides.

Upload: others

Post on 13-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Big Data Management and NoSQL Databases Lecture 12 PD Dr. Andreas Behrend [email protected]

Acknowledgements

I am indebted to Prof. Dr.-Ing. Sebastian Michel,

Prof. Johan Gamper, and Dr. Holubova for providing me slides.

Page 2: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

What is Big Data?

buzzword? bubble? gold rush? revolution?

“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it,

everyone thinks everyone else is doing it, so everyone claims they are doing it.”

Dan Ariely

Page 3: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

What is Big Data?

No standard definition First occurrence of the term: High

Performance Computing (HPC)

Gartner: “Big Data” is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

3 (4, 5) Vs

Volume

Variety Velocity

Big Data

Page 4: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

What is Big Data?

IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources such as transactions, social media, enterprise content, sensors, and mobile devices. Companies can leverage data to adapt their products and services to better meet customer needs, optimize operations and infrastructure, and find new sources of revenue.

http://www.ibmbigdatahub.com/

Social media and networks (all of us are generating data)

Scientific instruments (collecting all sorts of data)

Mobile devices (tracking all objects all the time)

Sensor technology and networks (measuring all kinds of data)

Page 5: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Big Data Characteristics: Volume (Scale)

http://www.ibmbigdatahub.com/

Data volume is increasing

exponentially, not linearly

1021

109

1018

1012

Page 6: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Big Data Characteristics: Variety (Complexity)

http://www.ibmbigdatahub.com/

Various formats, types, and

structures (from semi-structured

XML to unstructured multimedia)

Static data vs. streaming data

1018

109

Page 7: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Big Data Characteristics: Velocity (Speed)

http://www.ibmbigdatahub.com/

Data is being generated fast and

need to be processed fast

Online Data

Analytics

Page 8: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Big Data Characteristics: Veracity (Uncertainty)

http://www.ibmbigdatahub.com/

Uncertainty due to inconsistency, incompleteness,

latency, ambiguities, or approximations.

1012

Page 9: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Some Numbers as of 2015 Estimated Size of Data

• Google: 15 000 PB (=15 Exabytes)

• Facebook: 300 PB • Ebay: 90 PB • Spotify: 10 PB

Data Processed per Day • Google: 100 PB • Ebay: 100 PB • NSA: 29 PB • Facebook: 600 TB • Twitter: 100 TB • Spotify: 2,2 TB

MB = 106 Bytes GB = 109 Bytes TB (Terabyte) = 1012 Bytes PB (Petabyte) = 1015 Bytes EB (Exabyte) = 1018 Bytes

Page 10: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

How does Data Look Like? • Not necessarily like you got used to in database

lectures: usually not nicely structured (BCNF or 3NF) relations with known schema information.

• But: – Twitter Tweets – Server Access Logs – Web Pages – Web Graph – Huge CSV files in general (e.g., holding a “relation”)

Page 11: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

{"created_at":"Wed Jan 21 15:21:04 +0000 2015","id":557920823764586496,"id_str":"557920823764586496","text":"#T ulsaAirport #Oklahoma Jan 21 08:53 Temperature 37\u00b0F clouds Wind NW 7 km\/h Humidity 85% .. http:\/\/t.co\ /SnC8ST3gQC","source":"\u003ca href=\"http:\/\/www.woweather.com\/USA\/TulsaIAP.htm\" rel=\"nofollow\"\u003eupd ate weather tulsa\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":nu ll,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":255167 921,"id_str":"255167921","name":"Weather Tulsa","screen_name":"wo_tulsa","location":"Tulsa","url":"http:\/\/itu nes.apple.com\/app\/weatheronline\/id299504833?mt=8","description":"Weather Tulsa\n\nhttp:\/\/www.woweather.com \/USA\/Tulsa.htm","protected":false,"verified":false,"followers_count":111,"friends_count":60,"listed_count":5, "favourites_count":0,"statuses_count":33805,"created_at":"Sun Feb 20 20:31:42 +0000 2011","utc_offset":7200,"ti me_zone":"Athens","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_b ackground_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1 \/bg.pn g","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_ back ground_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_ color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/ \ /pbs.twimg.com\/profile_images\/1249942071\/WO-20px- linien_normal.png","profile_image_url_https":"https:\/\/pbs .twimg.com\/profile_images\/1249942071\/WO- 20px-linien_normal.png","default_profile":true,"default_profile_imag e":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place ":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"TulsaAirport", "indices":[0,13]},{"text":"Oklahoma","indices":[14,23]}],"trends":[],"urls":[{"url":"http:\/\/t.co\/SnC8ST3gQC","expa nded_url":"http:\/\/bit.ly\/188eNcw","display_url":"bit.ly\/188eNcw","indices":[93,115]}],"user_mentions":[],"sym bols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"e n","timestamp_ms":"1421853664710"} {"created_at":"Wed Jan 21 15:21:04 +0000 2015","id":557920823877464064,"id_str":"557920823877464064","text":"An ime episode updated: Kyoukai no

How to store or analyse such Data?

Page 12: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Processing Big Data OLTP: Online Transaction Processing (DBMSs)

Database applications Storing, querying, multiuser access

OLAP: Online Analytical Processing (Data Warehousing) Answer multi-dimensional analytical queries Financial/marketing reporting, budgeting, forecasting, …

RTAP: Real-Time Analytic Processing (Big Data Architecture & Technology) Data gathered & processed in a real-time

Streaming fashion Real-time data queried and presented in an online fashion Real-time and history data combined and mined interactively

Page 13: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Key Big Data-Related Technologies Distributed file

systems NoSQL databases Grid computing,

cloud computing MapReduce and

other new paradigms

Large scale machine learning

http://e-theses.imtlucca.it/34/

Page 14: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Relational Database Management Systems (RDMBSs) Predominant technology for storing structured

data Established query languages, e.g. SQL, RA Often thought of as the only alternative for data

storage Persistence, concurrency control, consistency

control, … Alternatives: Object databases or XML stores Never gained the same adoption and market

shareT

Page 15: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Why Distributed File Systems?

• Assume you got 10 TB data on disk • Now, do some analysis of it

• With a 100MB/s disk, reading alone takes

– 100000 seconds – 1666 minutes – 27 hours

Page 16: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Need to do something about it

http://flickr.com/photos/jurvetson/157722937/

http://www.google.com/about/datacenter

Page 17: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Scale-up vs Scale-out Scale-Up (vertical scaling):

More RAM

More CPU

More HDD

Scale-Out (horizontal scaling):

Same Hardware

Connected by network

Page 18: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Data Centers

source: http://www.google.com/about/datacenters/inside/index.html

Page 19: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Hardware Failures • Lots of machines (commodity hardware)

failure is not an exception but very common • P[machine fails today] = 1/365 • n machines: P[failure of at least 1 machine] =

1-(1-P[machine fails today])^n

– for n=1: 0.0027 – for n=10: 0.02706 – for n=100: 0.239 – for n=1000: 0.9356 – for n=10 000: ~ 1.0

source: google.com

Page 20: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Fallacies of Distributed Computing 1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn't change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous source: Peter Deutsch

and others at Sun

Page 21: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Failure Handling & Recovery

• Hardware failures happen virtually at any time

• Algorithms/Infrastructures have to compensate

• Issues in distributed computing:

• Replication of data • Logging of state • Redundancy in task execution

Page 22: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

„NoSQL“ 1998 first used for a relational database that

omitted the use of SQL Carlo Strozzi

2009 used for conferences of advocates of non- relational databases Eric Evans

Blogger, developer at Rackspace

NoSQL movement = “the whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for”

Page 23: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

„NoSQL“

Not „no to SQL“ Another option, not the only one

Not „not only SQL“ Oracle DB or PostgreSQL would fit the definition

„Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable. The original intention has been modern web-scale databases. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent (BASE, not ACID), a huge data amount, and more“

http://nosql-database.org/

Page 24: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

The End of Relational Databases?

Relational databases will not disappear Compelling arguments for most projects Familiarity, stability, feature set, and available support

We should see relational databases as one option for data storage Polyglot persistence – using different data stores in

different circumstances Search for optimal storage for a particular application

Page 25: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Motivation for NoSQL Databases

Huge amounts of data are now handled in real- time

Both data and use cases are getting more and more dynamic

Social networks (relying on graph data) have gained impressive momentum Special type of NoSQL databases: graph databases

Full-texts have always been treated shabbily by RDBMS

Page 26: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Example: FaceBook http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/

Statistics from 2010

500 million users 570 billion page views per month 3 billion photos uploaded per month 1.2 million photos saved per second 25 billion pieces of content (updates, comments) shared every

month 50 million server-side operations per second

2008: 10,000 servers 2009: 30,000 servers …

→ One RDBMS may not be enough to keep this going on!

And even newer numbers: https://research.facebook.com/blog/facebook-s-top-open-data-problems/

Page 27: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Example: FaceBook Architecture from 2010

Cassandra NoSQL distributed storage system with

no single point of failure For searching your inbox messages

Hadoop/Hive An open source MapReduce

implementation Enables to perform calculations on

massive amounts of data Hive enables to use SQL queries

against Hadoop

Page 28: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Example: FaceBook Architecture from 2010 and later

Memcached Distributed memory caching system Caching layer between the web servers

and MySQL servers Since database access is relatively slow

HBase Hadoop database, used for e-mails,

instant messaging and SMS Has recently replaced MySQL,

Cassandra and few others Built on Google’s BigTable model

Page 29: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Databases Five Advantages

1. Elastic scaling “Classical” database administrators scale up – buy

bigger servers as database load increases Scaling out – distributing the database across multiple

hosts as load increases 2. Big Data Volumes of data that are being stored have increased

massively Opens new dimensions that cannot be handled with

RDBMS

http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772

Page 30: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Databases Five Advantages

3. Goodbye DBAs (see you later?) Automatic repair, distribution, tuning, … vs. expensive,

highly trained DBAs of RDBMS 4. Economics Based on cheap commodity servers → less costs per

transaction/second 5. Flexible Data Models Non-existing/relaxed data schema → structural changes

cause no overhead

Page 31: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Databases Five Challenges

1. Maturity Still in pre-production phase Key features yet to be implemented 2. Support Mostly open source, result from start-ups

Enables fast development Limited resources or credibility 3. Administration Require lot of skills to install and effort to maintain

Page 32: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Databases Five Challenges

4. Analytics and Business Intelligence Focused on web apps scenarios

Modern Web 2.0 applications Insert-read-update-delete

Limited ad-hoc querying Even a simple query requires significant programming expertise

5. Expertise Few number of NoSQL experts available in the market

Page 33: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Data Assumptions

RDBMS NoSQL

integrity is mission-critical OK as long as most data is correct

data format consistent, well-defined data format unknown or inconsistent

data is of long-term value data is expected to be replaced

data updates are frequent write-once, read multiple (no updates, or at least not often)

predictable, linear growth unpredictable growth (exponential)

non-programmers writing queries only programmers writing queries

regular backup replication

access through master server sharding across multiple nodes

Page 34: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Data Model Aggregates Data model = the model by which the database

organizes data Each NoSQL solution has a different model Key-value, document, column-family, graph First three orient on aggregates

Aggregate A data unit with a complex structure

Not just a set of tuples like in RDBMS Domain-Driven Design: “an aggregate is a collection

of related objects that we wish to treat as a unit” A unit for data manipulation and management of consistency

Page 35: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources
Page 36: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources
Page 37: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources
Page 38: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Data Model Aggregates – aggregate-ignorant

There is no universal strategy how to draw aggregate boundaries Depends on how we manipulate the data

RDBMS and graph databases are aggregate- ignorant It is not a bad thing, it is a feature Allows to easily look at the data in different ways Better choice when we do not have a primary

structure for manipulating data

NoSQL

Page 39: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Data Model Aggregates – aggregate-oriented

Aggregate orientation Aggregates give the database information about

which bits of data will be manipulated together Which should live on the same node

Helps greatly with running on a cluster We need to minimize the number of nodes we need to query

when we are gathering data

Consequence for transactions NoSQL databases support atomic manipulation of a

single aggregate at a time

Page 40: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Databases Materialized Views Disadvantage: the aggregated structure is given, other

types of aggregations cannot be done easily RDBMSs lack of aggregate structure → support for accessing

data in different ways (using views) Solution: materialized views

Pre-computed and cached queries Strategies:

Update materialized view when we update the base data For more frequent reads of the view than writes

Run batch jobs to update the materialized views at regular intervals

Page 41: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Databases Schemalessness When we want to store data in a RDBMS, we need to

define a schema Advocates of schemalessness rejoice in freedom and

flexibility Allows to easily change your data storage as we learn more

about the project Easier to deal with non-uniform data

Fact: there is usually an implicit schema present The program working with the data must know its structure

Page 42: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Types of NoSQL Databases Core: Key-value stores (databases) Document databases Column-family (column-oriented/columnar) stores

(in constrast to relational columnar DBs like Monet DB) Graph databases

Non-core: Object databases XML databases …

http://nosql-database.org/

Page 43: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Key-value store Basic characteristics

The simplest NoSQL data stores A simple hash table (map), primarily used when all

access to the database is via primary key A table in RDBMS with two columns, such as ID and

NAME ID column being the key NAME column storing the value

A BLOB that the data store just stores Basic operations:

Get the value for the key Put a value for a key Delete a key from the data store

Simple → great performance, easily scaled Simple → not for complex queries, aggregation needs

Page 44: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

• Data model: (key) -> value • Interface: CRUD (Create, Read, Update, Delete)

Key-Value Stores

users:2:friends {23, 76, 233, 11} users:2:inbox [234, 3466, 86,55]

Theme → "dark", cookies → "false" users:2:settings

Value: An opaque blob

Key

Page 45: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Key-value store Representatives MemcachedDB

not open-source

Project Voldemort

open-source version

Page 46: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Key-value store Suitable Use Cases

Storing Session Information Every web session is assigned a unique session_id value Everything about the session can be stored by a single PUT request

or retrieved using a single GET Fast, everything is stored in a single object User Profiles, Preferences Every user has a unique user_id, user_name + preferences such as

language, colour, time zone, which products the user has access to, …

As in the previous case: Fast, single object, single GET/PUT

Shopping Cart Data Similar to the previous cases

Page 47: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Key-value store When Not to Use

Relationships among Data Relationships between different sets of data Some key-value stores provide link-walking features

Not usual Multioperation Transactions Saving multiple keys

Failure to save any one of them → revert or roll back the rest of the operations

Query by Data Search the keys based on something found in the value part Operations by Sets Operations are limited to one key at a time No way to operate upon multiple keys at the same time

Page 48: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Column-Family Stores Basic Characteristics

Also “columnar” or “column-oriented” Column families = rows that have many columns

associated with a row key Column families are groups of related data that is often

accessed together e.g., for a customer we access all profile information at the same

time, but not orders

Page 49: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Examples: Cassandra (AP), Google BigTable (CP),

Wide-Column Stores

com.cnn.www crawled: … content : "<html>…" content : "<html>…" content : "<html>…" title : "CNN"

Row Key Column

Data model: (rowkey, column, timestamp) -> value Interface: CRUD, Scan

Versions (timestamped)

HBase (CP)

Page 50: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Column-Family Stores Representatives

Google’s BigTable

Page 51: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Column-Family Stores Suitable Use Cases

Event Logging Ability to store any data structures → good choice to store event information Content Management Systems, Blogging Platforms We can store blog entries with tags, categories, links, and trackbacks in

different columns Comments can be either stored in the same row or moved to a different

keyspace Blog users and the actual blogs can be put into different column families

Page 52: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Column-Family Stores When Not to Use

Systems that Require ACID Transactions Column-family stores are not just a special kind of RDBMSs with

variable set of columns! Aggregation of the Data Using Queries (such as SUM or AVG) Have to be done on the client side For Early Prototypes We are not sure how the query patterns may change As the query patterns change, we have to change the column family

design

Page 53: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Document Databases Basic Characteristics

Documents are the main concept Stored and retrieved XML, JSON, …

Documents are Self-describing Hierarchical tree data structures Can consist of maps, collections (lists, sets, …), scalar values,

nested documents, … Documents in a collection are expected to be similar

Their schema can differ Document databases store documents in the value part

of the key-value store Key-value stores where the value is examinable

Page 54: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Data model: (collection, key) -> document Interface: CRUD, Querys, Map-Reduce

Examples: CouchDB (AP), Amazon SimpleDB (AP),

Document Stores

order-12338 { order-id: 23, customer: { name : "Felix Gessert", age : 25 } line-items : [ {product-name : "x", …} , …]

}

ID/Key JSON Document

MongoDB (CP)

Page 55: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Document Databases Representatives

Lotus Notes Storage Facility

Page 56: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Document Databases Suitable Use Cases Event Logging Many different applications want to log events

Type of data being captured keeps changing Events can be sharded (i.e. divided) by the name of the application or type

of event Content Management Systems, Blogging Platforms Managing user comments, user registrations, profiles, web-facing

documents, … Web Analytics or Real-Time Analytics Parts of the document can be updated New metrics can be easily added without schema changes

E.g. adding a member of a list, set,… E-Commerce Applications Flexible schema for products and orders Evolving data models without expensive data migration

Page 57: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Document Databases When Not to Use

Complex Transactions Spanning Different Operations Atomic cross-document operations

Some document databases do support (e.g., RavenDB) Queries against Varying Aggregate Structure Design of aggregate is constantly changing → we need

to save the aggregates at the lowest level of granularity i.e. to normalize the data

Page 58: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Graph Databases Basic Characteristics

To store entities and relationships between these entities Node is an instance of an object Nodes have properties

e.g., name Edges have directional significance Edges have types

e.g., likes, friend, …

Nodes are organized by relationships Allow to find interesting patterns e.g., “Get all nodes employed by Big Co that like NoSQL

Distilled”

Page 59: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Example:

Page 60: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Graph Databases RDBMS vs. Graph Databases When we store a graph-like structure in RDBMS, it is for

a single type of relationship “Who is my manager”

Adding another relationship usually means a lot of schema changes

In RDBMS we model the graph beforehand based on the traversal we want If the traversal changes, the data will have to change In graph databases the relationship is not calculated at query

time but persisted

Page 61: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Graph Databases Representatives

FlockDB

Page 62: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Graph Databases Suitable Use Cases

Connected Data Social networks Any link-rich domain is well suited for graph databases Routing, Dispatch, and Location-Based Services Node = location or address that has a delivery Graph = nodes where a delivery has to be made Relationships = distance Recommendation Engines “your friends also bought this product” “when invoicing this item, these other items are usually invoiced”

Page 63: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

Graph Databases When Not to Use

When we want to update all or a subset of entities Changing a property on all the nodes is not a straightforward

operation e.g., analytics solution where all entities may need to be updated

with a changed property Some graph databases may be unable to handle lots of

data Distribution of a graph is difficult or impossible

Page 64: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Data Model Aggregates and NoSQL databases

Key-value database Aggregate = some big blob of mostly meaningless bits

But we can store anything We can only access an aggregate by lookup based on

its key Document database Enables to see a structure in the aggregate

But we are limited by the structure when storing (similarity) We can submit queries to the database based on the

fields in the aggregate

Page 65: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

NoSQL Data Model Aggregates and NoSQL databases

Column-family stores A two-level aggregate structure

The first key is a row identifier, picking up the aggregate of interest

The second-level values are referred to as columns Ways to think about how the data is structured:

Row-oriented: each row is an aggregate with column families representing useful chunks of data (profile, order history)

Column-oriented: each column family defines a record type (e.g., customer profiles) with rows for each of the records; a row is the join of records in all column families

Page 66: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

History Google File System

MapReduce CouchDB

MongoDB Dynamo

Cassandra Riak

MegaStore

F1

Redis

HyperDeX Spanner

CouchBase

Dremel

Hadoop &HDFS HBase

BigTable

Page 67: Big Data Management and NoSQL Databases Data...What is Big Data? IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources

References http://nosql-database.org/ Pramod J. Sadalage – Martin Fowler: NoSQL Distilled:

A Brief Guide to the Emerging World of Polyglot Persistence

Eric Redmond – Jim R. Wilson: Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement

Sherif Sakr – Eric Pardede: Graph Data Management: Techniques and Applications

Shashank Tiwari: Professional NoSQL