big data managementbig data managementement motivation t a manag “without data you are just...

35
ement ta Manag Big Dat Big Data Management Big Data Management December 2015 Alberto Abelló & Oscar Romero 1

Upload: others

Post on 08-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

emen

tta

Man

agBig

Dat

Big Data ManagementBig Data Management

December 2015 Alberto Abelló & Oscar Romero 1

emen

tta

Man

agBig

Dat

MOTIVATION

December 2015 Alberto Abelló & Oscar Romero 2

emen

tMotivation

ta M

anag

“Without data you are just another person with an opinion ”

Big

Dat an opinion.”

William Edwards Deming

“It is a capital mistake to theorize before one has data ”has data.

Sherlock Holmes (A Study in Scarlett)

Prescriptive

Descriptive

Predictive

December 2015 Alberto Abelló & Oscar Romero 3

emen

tta

Man

agBig

Dat

December 2015 4

emen

tta

Man

agBig

Dat

December 2015 5

Bain & Company

emen

tta

Man

agBig

Dat

DEFINITION

December 2015 Alberto Abelló & Oscar Romero 6

emen

tBig Data facets

ta M

anag

The Original

Big

Dat

as Technology as Data Distinctions as Data Distinctions as Signals

O t it as Opportunity as Metaphor as New Term for Old Stuff

December 2015 Alberto Abelló & Oscar Romero 7

Timo Elliott

emen

tBig Data definition

ta M

anag

Velocity

Big

Dat

Volume Variety Variety …

V i bilit Variability Validity/Veracity Value

From IBM “Understanding Big Data”

erab

ytes

Mill

ions

of

T

December 2015 Alberto Abelló & Oscar Romero 8

year

emen

tBigbench

ta M

anag

Big

Dat

December 2015 Alberto Abelló & Oscar Romero 9

emen

tTypes of Big Data Analyzed in Industry

ta M

anag

Big

Dat

December 2015 Alberto Abelló & Oscar Romero 10

emen

tBig Data sources

ta M

anag

Structured

Big

Dat

Created (i.e., business data) Provoked (e.g., customer feedback) Transacted Compiled (e.g., demographics) Experimental (e.g., sampling customers)

Unstructured Unstructured Captured (e.g., search words) User-generated (e g social networks) User generated (e.g., social networks)

December 2015 Alberto Abelló & Oscar Romero 11

emen

tBig Data related areas

ta M

anag

Volume and Velocity Declarative querying Query optimization

Big

Dat Query optimization

Variety and Variability Data quality Data integration Web and text mining Web and text mining Information retrieval

Validity/Veracity Data consistency UncertaintyUncertainty Statistical reasoning Data linkage (provenance)

Value Analyticsy

Data mining Algorithmics Automatic learning Simulation Privacy

Biologists Linguistics Chemists Sociologists

Engineers Engineers

December 2015 Alberto Abelló & Oscar Romero 12

emen

tta

Man

agBig

Dat

VOLUME AND VELOCITY MANAGEMENT

December 2015 Alberto Abelló & Oscar Romero 13

emen

tKey features of cloud databases

ta M

anag

a) Quick/Cheap set upf

Big

Dat b) Simple call level interface or protocol

No declarative query languageW k d l th ACIDc) Weaker concurrency model than ACID

d) Flexible schemaAbilit t d i ll dd tt ib t Ability to dynamically add new attributes

e) Multi-tenancyf) Ability to horizontally scale f) Ability to horizontally scale g) Ability to replicate & distribute (fragmentation)h) Efficient use of distributed indexes and RAMh) Efficient use of distributed indexes and RAM

December 2015 Alberto Abelló & Oscar Romero 14

emen

tDistributed Database

ta M

anag

A distributed database (DDB) is a database where d t t i di t ib t d l

Big

Dat data management is distributed over several

nodes in a network. Each node is a database itself Each node is a database itself

Potential heterogeneity Nodes communicate through the network

December 2015 Alberto Abelló & Oscar Romero 15

emen

tParallel database architectures

ta M

anag

Big

Dat

D. DeWitt & J. Gray, “Parallel Database Systems: h f f h f b ” 992

December 2015 Alberto Abelló & Oscar Romero 16

The future of High Performance Database Processing”, 1992Figure from D. Abady

emen

tPerformance comparison

ta M

anag

Big

Dat

December 2015 Alberto Abelló & Oscar Romero 17

emen

tta

Man

agBig

Dat

VARIETY MANAGEMENT

December 2015 Alberto Abelló & Oscar Romero 18

emen

tSchemaless Databases (I)

ta M

anag

( )CREATE TABLE Student ( id int,

name varchar2(50), h 2(50)

Big

Dat surname varchar2(50),

enrolment date);Insert into Student (1, ‘Oscar’, ‘Romero’, ‘01/01/2012’, true, ‘Lleida’); Insert into Student (1, ‘Oscar’, ‘Romero’, NULL); I t i t St d t (1 ‘O ’ ‘R ’ ‘01/01/2012’)

WRONGOK

OKInsert into Student (1, ‘Oscar’, ‘Romero’, ‘01/01/2012’);

Consequences

OK

Consequences Gain flexibility May reduce the impedance mismatch

Coupled with HLLs (e g Java) Coupled with HLLs (e.g., Java) Lose semantics (also consistency)Insert into Student (1, {‘Oscar’, ‘Romero’, ‘01/01/2012’});

Th d t i d d i i l i l t (!)Th d t i d d i i l i l t (!) The data independence principle is lost (!)The data independence principle is lost (!) The ANSI / SPARC architecture is not followed Applications can access and manipulate the database

internal structuresinternal structuresDecember 2015 Alberto Abelló & Oscar Romero 19

emen

tSchemaless Databases (II)

ta M

anag

NOSQL solution for the impedance

Big

Dat mismatch

Several new data models were introduced Graph data model Document-oriented databases Key-value (~ hash tables) Streams (~ vectors and matrixes)Streams ( vectors and matrixes)

These new models lack of an explicit schema (defined by the user)schema (defined by the user) However, an implicit schema remains

December 2015 Alberto Abelló & Oscar Romero 20

emen

tImpedance Mismatch

ta M

anag

Big

Dat

December 2015 Alberto Abelló & Oscar Romero 21

Petra Selmer, Advances in Data Management 2012

emen

tImpedance Mismatch

ta M

anag

Big

Dat

December 2015 Alberto Abelló & Oscar Romero 22

Petra Selmer, Advances in Data Management 2012

emen

tta

Man

agBig

Dat

POLYGLOT SYSTEMS

December 2015 Alberto Abelló & Oscar Romero 23

emen

tDifferent applications

ta M

anag

Not Only SQL (different problems entail different solutions)

Big

Dat OLTP

Object-Relational Distributed databases Parallel databases Parallel databases

Scientific databases and other Big Data repositories Key-value stores

Data Warehousing & OLAP MOLAP Column stores Multidimensional features

Text / documents Text / documents Document databases

XML/JSON databases Stream processing

St Stream processor Semantic Web and Open Data

Graph databases

December 2015 Alberto Abelló & Oscar Romero 24

emen

tDatabases landscape

ta M

anag

Big

Dat

25December 2015 Alberto Abelló & Oscar Romero

emen

tSpecific platforms

ta M

anag

Google BigTable Published in 2006

Big

Dat Published in 2006

Implemented by Hbase Also Dynamo and Cassandra

Google MapReduce Published in 2004 Implemented by Hadoop Implemented by Hadoop

MongoDB Published in 2007

Neo4J/Sparksee Published in 2010/2008SAP HANA SAP HANA Published in 2011 Prototyped in SanssouciDB Prototyped in SanssouciDB

December 2015 Alberto Abelló & Oscar Romero 26

emen

tPolyglot Systems

ta M

anag

Federate different kinds of storage systems

Big

Dat

Martin Fowlerhttp://martinfowler.com/bliki/PolyglotPersistence.html

27December 2015 Alberto Abelló & Oscar Romero

emen

tInternal Structures

ta M

anag

Big

Dat

Ben Stopford

December 2015 Alberto Abelló & Oscar Romero 28

pProgscon & JAX Finance 2015

emen

tThe Problem is Not SQL

ta M

anag

Q Relational systems are too generic…

OLTP: stored procedures and simple queries

Big

Dat OLTP: stored procedures and simple queries

OLAP: ad-hoc complex queries Documents: large objects Streams: time windows with volatile data Scientific: uncertainty and heterogeneity

But the overhead of RDBMS has nothing to … But the overhead of RDBMS has nothing to do with SQL Low-level, record-at-a-time interface is not the

lsolution No standard No ACID

Michael StonebrakerSQL Databases vS. NoSQL DatabasesCommunications of the ACM, 53(4), 2010

December 2015 Alberto Abelló & Oscar Romero 29

emen

tta

Man

agBig

Dat

APACHE TOOLSET

December 2015 Alberto Abelló & Oscar Romero 30

emen

tHadoop ecosystem

ta M

anag

Big

Dat

Small Analytics Big Analytics

Warehousing

Data Lake

By Víctor Herrero.

December 2015 Alberto Abelló & Oscar Romero 31

By Víctor Herrero.Big Data Management & Analytics (UPC School)

emen

tBrewery or bottled beer?

ta M

anag

D It Y lf

Big

Dat Do It Yourself

• Expensive• Ad hoc development

Off the Shelf• Economies of scale• Concrete functionalities

December 2015 Alberto Abelló & Oscar Romero 32

Florian Waas analogy

emen

tConclusions

ta M

anag

Big Data is not only volume

Big

Dat

Cloud databases are not simply databases in the cloud

Distributed Database principles Variety matters Variety matters

December 2015 Alberto Abelló & Oscar Romero 33

emen

tBibliography

ta M

anag

M. T. Özsu and P. Valduriez. Principles of

Big

Dat Distributed Database Systems, 3rd Ed.

Springer, 2011 A. Ghazal et al. BigBench: towards an

industry standard benchmark for big data y ganalytics. SIGMOD Conference, 2013

R. Cattell. Scalable SQL and NoSQL Data R. Cattell. Scalable SQL and NoSQL Data Stores. SIGMOD Record 39(4), 2010

L Liu M T Özsu (Eds ) Encyclopedia of L. Liu, M.T. Özsu (Eds.). Encyclopedia of Database Systems. Springer, 2009

December 2015 Alberto Abelló & Oscar Romero 34

emen

tta

Man

agBig

Dat

December 2015 Alberto Abelló & Oscar Romero 35