1. introduction to the course "designing data bases with advanced data models (nosql &...

Course Introduc-on Designing Data Bases with Advanced Data Models

Dr. Fabio Fumarola

Enterprise compu-ng evolu-on •  We’ve spent several years in the world of enterprise compu-ng

•  We’ve seen many things change in: –  Languages, Architectures, –  PlaAorms, and Processes.

•  But in one thing we stayed constant – “rela-onal database stored the data”

•  The data storage ques-on for architects was: “which rela-onal database to use”

1

The stability of the reign •  Why? •  An organiza-on’s data lasts much longer than its programs (COBOL?)

•  It’s valuable to have stable data storage which is accessible from many applica-ons

•  In the last decades RDBMS have been successful in solving problems related to storing, serving and processing data.

2

FROM BUSINESS TO DECISION SUPPORT

Before going deep on the course arguments let’s understand the evolu-on of decision support systems.

3

From business to decision support: ‘60

•  Star-ng from ’60 data were stored using magne-c disks.

•  Supported analysis where sta-c, only aggregated and pre[y limited

•  For instance, it was possible to extract the total amount of last month sales

4


•  With rela-onal databases and SQL, data analysis start to be somehow dynamics.

•  SQL allows us to extract data at detailed and aggregated level

•  Transac-onal ac-vi-es are stored in Online Transac-on Process databases

•  OLTP are used in several applica-ons such as orders, salary, invoices…

5


•  The best hypothesis was that the described modules are included into Enterprise Resource Planning (ERP) sobware

•  Examples of such vendors are SAP, Microsob, HP and Oracle.

•  Normally, what happen is that each module is implemented as an ad-‐hoc sobware with is own database.

•  Cons: Data representa-on and integra-on. 6

OLTP design

•  Such kind of databases are designed to be: –  Strongly normalized, –  Fast in inser-ng data.

•  However, data normaliza-on: –  Do not foster the read of huge quan-ty of data, –  Increase the number of tables used to store records.

7

Foster decision support •  In order to support “Decisions” we need to extract de-‐normalized data via several JOINS.

•  Moreover, opera-onal databases offer a no/limited visibility on historical data.

Considera-ons: •  These factors make hard data analysis made on OLTP databases…

8


•  Thus in ’90 bore databases designed to support analysis from OLTP databases.

•  This is the arise of Data Warehouses •  DWs are central repositories of integrated data from one or more disparate sources.

•  They store current and historical data and are used for crea-ng trending reports for management repor-ng such as annual and quarterly comparisons.

9

Types of DWs systems •  Data Mart

–  is a simple form of a data warehouse that is focused on a single subject (or func-onal area), such as sales, finance or marke-ng.

•  Online analy-cal processing (OLAP): –  is characterized by a rela-vely low volume of transac-ons. –  Queries are oben very complex and involve aggrega-ons. –  OLAP databases store aggregated, historical data in mul--‐dimensional schemas.

10

Types of DWs systems •  Predic-ve analysis

–  Predic-ve analysis is about finding and quan-fying hidden pa[erns in the data using complex mathema-cal models that can be used to predict future outcomes.

–  Predic-ve analysis is different from OLAP in that OLAP focuses on historical data analysis and is reac-ve in nature, while predic-ve analysis focuses on the future.

11

Data Warehouse •  Data stored in DWs are the star-ng point for the Business Intelligence (BI).

•  Def. “Business intelligence (BI) is the set of techniques and tools for the transforma7on of raw data into meaningful and useful informa7on for business analysis purposes”.

•  With the evolu-on of BI systems we moved from SQL based analysis to Visual Instruments.

12

Example DW Cube

13

OLAP features •  OLAP systems support data explora-on through drill-‐down, drill-‐up, slicing e dicing opera-ons.

•  However, we s-ll have an historical point of view of what happened in the business but now on what is happening.

•  We cannot make predic-on on the future

14


•  Star-ng from 2000 it arises the necessity to do predic-ve analysis.

•  The techniques in this scenario are in the field of Data Mining

•  Data Mining is the computa-onal process of discovering pa[erns in large dataset involving methods at the intersec-on of ar-ficial intelligence, machine learning, sta-s-cs, and database systems.

15


•  The overall goal of the data mining process is to extract novel informa-on from a data set and transform it into understandable knowledge.

•  Aside from the raw analysis step, it involves database and data management aspects, data pre-‐processing, model and inference considera-ons, interes-ngness metrics, complexity considera-ons, post-‐processing of discovered structures, visualiza-on, and online upda-ng.

16

From business to decision support: Recap

•  In the last decades RDBMS have been successful in solving problems related to storing, serving and processing data.

17

From business to decision support: Recap

•  Vendors such as Oracle, Ver-ca, Teradata, Microsob and IBM proposed their solu-on based on Rela-onal Algebra and SQL.

•  With Data Mining we are able to extract knowledge from data which can be used to support predic-ve analysis (Data Mining is not only predic-on!!!).

18

Challenges of Scale Differ

THERE IS SOMETHING THAT DOES NOT WORK!

But

20

1. Scaling Up Databases A ques-on I’m oben asked about Heroku is: “How do you scale the SQL database?” There’s a lot of things I can say about using caching, sharding, and other techniques to take load off the database. But the actual answer is: we don’t. SQL databases are fundamentally non-‐scalable, and there is no magical pixie dust that we, or anyone, can sprinkle on them to suddenly make them scale. Adam Wiggins Heroku Adam Wiggins, Heroku Pa[erson, David; Fox, Armando (2012-‐07-‐11). Engineering Long-‐LasGng SoHware: An Agile Approach Using SaaS and Cloud CompuGng, Alpha Edi-on (Kindle Loca-ons 1285-‐1288). Strawberry Canyon LLC. Kindle Edi-on.

21

2. Data Variety •  RDBMs have problems with Unstructured and Semi-‐Structured Data (varied data)

22

3. Connec-vity

23

4. P2P Knowledge

24

5. Concurrency

25

6. Concurrency

26

6. Diversity

27

7. Cloud

28

What is the problem with RDBMs Caching Master/Slave Master/Master Cluster Table Par--oning Federated Tables Sharding Distributed DBs

29

http://codefutures.com/database-sharding/

What is the problem with RDBMs •  RDBMS can somehow deal with this aspects, but they have issues related to: –  expensive licensing, –  requiring complex applica-on logic, –  Dealing with evolving data models

•  There were a need for systems that could: –  work with different kind of data format, –  Do not require strict schema, –  and are easily scalable.

30

NOSQL: THE NEW CHALLENGER! Help!!!

31

NoSQL •  It is born out of a need to handle large data volumes •  It forces a fundamental shib to building large hardware plaAorms through clusters of commodity servers.

•  This need raises from the difficul-es of making applica-on code play well with rela-onal databases

32

NoSQL •  The term “NoSQL” is very ill-‐defined. •  It’s generally applied to a number of recent non rela-onal databases: Cassandra, Mongo, Neo4j, Hbase and Redis,…

•  They embrace –  schemaless data, –  run on a cluster, –  and have the ability to trade off tradi-onal consistency for other useful proper-es

33

Why are NoSQL Databases Interes-ng 1. Applica-on development produc-vity:

–  A lot of applica-on development is spent on mapping data between in memory data structures and rela-onal databases

–  A NoSQL database may provide a data model that can simplify that interac-on resul-ng in less code to write, debug, and evolve.

34

Why are NoSQL Databases Interes-ng 2.  Large-‐scale data:

–  Organiza-ons are finding it valuable to capture mode data and process it more quickly.

–  They are finding it expensive to do so with rela-onal databases.

–  NoSQL database are more economic if ran on large cluster of many smaller an cheaper machines.

–  Many NoSQL database are designed to run on clusters, so they be[er fit on Big Data scenarios.

35

WHY NOSQL

Internet Hypertext, RSS, Wikis, blogs, wikis, tagging, user generated content, RDF, ontologies

36

Conn

ectedn

ess

The Value of Rela-onal Databases

•  Rela-onal databases have become such an embedded part of our compu-ng culture.

•  What are the benefits they provide?

38

Getng at Persistent Data •  The most obvious value of a database is keeping large amount of persistent data

•  Two areas of memory: – Main memory: fast, vola-le, limited in space and lose data when it loses the power

–  Backing store: larger but slower, commonly seen as a disk

39

Getng at Persistent Data •  The most obvious value of a database is keeping large amount of persistent data

•  Two areas of memory: – Main memory: fast, vola-le, limited in space and lose data when it loses the power

–  Backing store: larger but slower, commonly seen as a disk

40

Getng at Persistent Data •  The backing store can be organized in all sort of ways.

•  For many produc-vity applica-ons (such as word processors) it is a file in the file system.

•  For most enterprise applica-ons, however, the backing store is a database.

•  A database allows more flexibility than a file system.

41

Concurrency •  Concurrency is notoriously difficult to get right. •  Object oriented is not the right programming model to deal with concurrency.

•  Since enterprise applica-ons can have a lot of concurrent users, there is a lot of rooms for bad things to happen.

•  Rela-on databases have transac-ons that help mi-ga-ng this problem, but….

42

Concurrency •  You s-ll have to deal with transac-onal error when you try to book a room that is just gone.

•  The reality is that the transac-onal mechanism has worked well to contain the complexity of concurrency.

•  Transac-on with rollback allows as to deal with errors.

43

Integra-on •  Enterprise applica-ons live in a rich ecosystem •  mul-ple applica-on wri[en by different teams need to

collaborate in order to get things done •  This collabora-on is done via data sharing. •  A common way to do this is shared database integraGon

[Hohpe and Woolf] where mul-ple applica-ons store their data into a single database

•  Using a single database, allows all the applica-on to share data easily, while the database concurrency control applica-ons such as users.

44

A (Mostly) Standard Model •  Rela-onal database have succeeded because they have a standard model

•  As a result, developers and database professionals can apply the same knowledge in several projects.

•  Although there are differences between different RDBMs, the core mechanism remain the same.

45

Impedance Mismatch •  It is the difference between the rela-onal model and the in-‐memory data structures.

•  Rela-onal data organizes data into table and rows (rela-on and tuples). –  A tuple is a set of name-‐value pairs –  A rela-on is a set of tuples.

•  All the SQL opera-ons consume and return rela-ons.

46

Impedance Mismatch: Example

47

Impedance Mismatch •  Tuples and rela-on provides elegance and simplicity, but it also introduces limita-ons.

•  In par-cular, the values in a rela-onal tuple have to be simple.

•  They cannot contain any structure, such as nested record or a list.

•  This limita-on is not true for in memory data-‐structures.

48

Impedance Mismatch •  As a result, if we want to use richer in-‐memory data structure, we have to translate it to a rela-onal representa-on to store in on disk.

•  While object-‐oriented language succeeded, object-‐oriented databases faded into obscurity.

•  Impedance mismatch has been made much easier to deal with Object-‐Rela-onal Mapping (ORM) frameworks such as Eclipse-‐Link, Hibernate and other JPA implementa-ons.

49

Impedance Mismatch

50

•  ORMs remove a lot of work, but can become a problem when people try to ignore: –  the database, and –  query performance suffer

•  This is where NoSQL database works greatly, why?

Applica-on and Integra-on DBs •  This is a event that happen several -mes in SW projects.

•  In this scenario, the database acts as an integra-on database.

•  The downsides to share database are.

51

Applica-on and Integra-on DBs •  The downsides to share database are:

–  Its structure tend to be more complex than any single applica-on needs,

–  If an applica-on want to make changes to its data storage, it needs to coordinate with all the other applica-ons,

–  Performance degrada-on due to huge number of access –  Errors in database usage since it is accessed by applica-on wri[en by different teams.

•  This is different from single applica-on databases

52

Applica-on and Integra-on DBs •  Interoperability concerns can now shit to interfaces of the applica-on allowing interac-on over HTTTP.

•  This is what happen with micro-‐services (h[p://www.-kalk.com/java/micro-‐services/)

•  Micro-‐services and Web services in general enable a form of communica-ons based on data.

•  Data is represented as documents using before XML and now JSON format.

53

Applica-on and Integra-on DBs •  If you are going to use service integra-on using text over HTTP is the way to go.

•  However, if we are dealing with performance there are binary protocols.

54

A[ack of the Clusters •  In 2000s several large web proper-es drama-cally increase in scale: – Websites started tracking ac-vity and structure in detail (analy-cs)

–  Large sets of data appeared: links, social networks, ac-vity in logs, mapping data.

– With this growth in data came a growth in users

•  Coping with this increase in data and traffic required more compu-ng resources.

55

A[ack of the Clusters •  There were to choices:

–  Scaling up –  Scaling out

•  Scaling up implies bigger machines, more processors, disk storage and memory ($$$).

•  Scaling out was the alterna-ve: –  Use a lot of small machine in a cluster. –  It is cheap and also more resilient (individual failures)

56

A[ack of the Clusters •  This revealed a new problem, rela-onal databases are note designed to run on clusters.

•  Clustered rela-onal databases (e.g. Oracle RAC or Microsob SQL Server) work on the concept of shared disk subsystem.

•  RDBMs can also run on separate server with different sets of data (sharding)

57

A[ack of the Clusters •  However, it needs an applica-on to control the sharded-‐database.

•  Also we lose querying, referen-al integrity, transac-ons or consistency control cross shard.

•  These technical issues are exacerbated by licensing cots.

•  This mismatch between DBs and clusters led some organiza-ons to consider different solu-ons

58

The emergence of NoSQL •  Two companies in par-cular – Google and Amazon – have been very influen-al.

•  They were capturing a large amount of data and their business is on data management

•  Both companies produces influen-al papers: –  BigTable from Google –  Dynamo DB from Amazon

59

The emergence of NoSQL •  As part of innova-on in data management system, several

new technologies where built: –  2003 -‐ Google File System, –  2004 -‐ MapReduce, –  2006 -‐ BigTable, –  2007 -‐ Amazon DynamoDB –  2012 Google Cloud Engine

•  Each solved different use cases and had a different set of assump-ons.

•  All these mark the beginning of a different way of thinking about data management.

60

The emergence of NoSQL •  It is irony that the term “NoSQL” appeared in late 90s from a rela-onal database made by Carlo Strozzi.

•  The name comes from the fact that it does not used SQL as query language.

•  However, the usage of “NoSQL” as we consider today come from a meetup on 2009 in San Francisco.

•  They want a term that can be used as Twi[er hashtag. #NoSQL

61

NoSQL Characteris-cs 1.  They don’t use SQL. (HBase, Cassandra, Redis…) 2.  They are generally open-‐source projects. 3.  Most of them are designed to run on clusters. 4.  RDBMs used ACID transac-ons to handle

consistency across the whole database. NoSQL resort to other op-ons (CAP theorem).

5.  No all are cluster oriented (Graph DBs) 6.  NoSQL operate without a schema. (schema free)

62

The emergence of NoSQL •  NoSQL does not stands for Not-‐Only SQL. •  It is be[er to NoSQL as a movement rather than a techonology.

•  RDBMs are not going away. •  The change is that rela-onal databases are an op-on •  This point of view is oben referred to as polyglot persistence

63

The emergence of NoSQL •  Instead of just picking a rela-onal database, we need to understand: 1.  The nature of the data we are storing, and 2.  How we want to manipulate it.

•  In order to deal with this change most organiza-ons need to shib from integra-on database to applica-on database.

•  In this course we concentrate on Big Data running on clusters.

64

The emergence of NoSQL •  The Big Data concerns have created an opportunity for people to think freshly about their storage needs.

•  NoSQL help developer produc-vity by simplifying their database access even if they have no need to scale beyond single machine.

65

1. introduction to the course "designing data bases with advanced data models (nosql &...

Data & Analytics

historical data

processing data

normalized data

data representa

data normaliza

inserng data

hard data analysis

organizaons data