1. introduction to the course "designing data bases with advanced data models (nosql &...
TRANSCRIPT
Enterprise compu-ng evolu-on • We’ve spent several years in the world of enterprise compu-ng
• We’ve seen many things change in: – Languages, Architectures, – PlaAorms, and Processes.
• But in one thing we stayed constant – “rela-onal database stored the data”
• The data storage ques-on for architects was: “which rela-onal database to use”
1
The stability of the reign • Why? • An organiza-on’s data lasts much longer than its programs (COBOL?)
• It’s valuable to have stable data storage which is accessible from many applica-ons
• In the last decades RDBMS have been successful in solving problems related to storing, serving and processing data.
2
FROM BUSINESS TO DECISION SUPPORT
Before going deep on the course arguments let’s understand the evolu-on of decision support systems.
3
From business to decision support: ‘60
• Star-ng from ’60 data were stored using magne-c disks.
• Supported analysis where sta-c, only aggregated and pre[y limited
• For instance, it was possible to extract the total amount of last month sales
4
From business to decision support: ‘80
• With rela-onal databases and SQL, data analysis start to be somehow dynamics.
• SQL allows us to extract data at detailed and aggregated level
• Transac-onal ac-vi-es are stored in Online Transac-on Process databases
• OLTP are used in several applica-ons such as orders, salary, invoices…
5
From business to decision support: ‘80
• The best hypothesis was that the described modules are included into Enterprise Resource Planning (ERP) sobware
• Examples of such vendors are SAP, Microsob, HP and Oracle.
• Normally, what happen is that each module is implemented as an ad-‐hoc sobware with is own database.
• Cons: Data representa-on and integra-on. 6
OLTP design
• Such kind of databases are designed to be: – Strongly normalized, – Fast in inser-ng data.
• However, data normaliza-on: – Do not foster the read of huge quan-ty of data, – Increase the number of tables used to store records.
7
Foster decision support • In order to support “Decisions” we need to extract de-‐normalized data via several JOINS.
• Moreover, opera-onal databases offer a no/limited visibility on historical data.
Considera-ons: • These factors make hard data analysis made on OLTP databases…
8
From business to decision support: ‘90
• Thus in ’90 bore databases designed to support analysis from OLTP databases.
• This is the arise of Data Warehouses • DWs are central repositories of integrated data from one or more disparate sources.
• They store current and historical data and are used for crea-ng trending reports for management repor-ng such as annual and quarterly comparisons.
9
Types of DWs systems • Data Mart
– is a simple form of a data warehouse that is focused on a single subject (or func-onal area), such as sales, finance or marke-ng.
• Online analy-cal processing (OLAP): – is characterized by a rela-vely low volume of transac-ons. – Queries are oben very complex and involve aggrega-ons. – OLAP databases store aggregated, historical data in mul--‐dimensional schemas.
10
Types of DWs systems • Predic-ve analysis
– Predic-ve analysis is about finding and quan-fying hidden pa[erns in the data using complex mathema-cal models that can be used to predict future outcomes.
– Predic-ve analysis is different from OLAP in that OLAP focuses on historical data analysis and is reac-ve in nature, while predic-ve analysis focuses on the future.
11
Data Warehouse • Data stored in DWs are the star-ng point for the Business Intelligence (BI).
• Def. “Business intelligence (BI) is the set of techniques and tools for the transforma7on of raw data into meaningful and useful informa7on for business analysis purposes”.
• With the evolu-on of BI systems we moved from SQL based analysis to Visual Instruments.
12
OLAP features • OLAP systems support data explora-on through drill-‐down, drill-‐up, slicing e dicing opera-ons.
• However, we s-ll have an historical point of view of what happened in the business but now on what is happening.
• We cannot make predic-on on the future
14
From business to decision support: ‘00
• Star-ng from 2000 it arises the necessity to do predic-ve analysis.
• The techniques in this scenario are in the field of Data Mining
• Data Mining is the computa-onal process of discovering pa[erns in large dataset involving methods at the intersec-on of ar-ficial intelligence, machine learning, sta-s-cs, and database systems.
15
From business to decision support: ‘00
• The overall goal of the data mining process is to extract novel informa-on from a data set and transform it into understandable knowledge.
• Aside from the raw analysis step, it involves database and data management aspects, data pre-‐processing, model and inference considera-ons, interes-ngness metrics, complexity considera-ons, post-‐processing of discovered structures, visualiza-on, and online upda-ng.
16
From business to decision support: Recap
• In the last decades RDBMS have been successful in solving problems related to storing, serving and processing data.
17
From business to decision support: Recap
• Vendors such as Oracle, Ver-ca, Teradata, Microsob and IBM proposed their solu-on based on Rela-onal Algebra and SQL.
• With Data Mining we are able to extract knowledge from data which can be used to support predic-ve analysis (Data Mining is not only predic-on!!!).
18
1. Scaling Up Databases A ques-on I’m oben asked about Heroku is: “How do you scale the SQL database?” There’s a lot of things I can say about using caching, sharding, and other techniques to take load off the database. But the actual answer is: we don’t. SQL databases are fundamentally non-‐scalable, and there is no magical pixie dust that we, or anyone, can sprinkle on them to suddenly make them scale. Adam Wiggins Heroku Adam Wiggins, Heroku Pa[erson, David; Fox, Armando (2012-‐07-‐11). Engineering Long-‐LasGng SoHware: An Agile Approach Using SaaS and Cloud CompuGng, Alpha Edi-on (Kindle Loca-ons 1285-‐1288). Strawberry Canyon LLC. Kindle Edi-on.
21
What is the problem with RDBMs Caching Master/Slave Master/Master Cluster Table Par--oning Federated Tables Sharding Distributed DBs
29
http://codefutures.com/database-sharding/
What is the problem with RDBMs • RDBMS can somehow deal with this aspects, but they have issues related to: – expensive licensing, – requiring complex applica-on logic, – Dealing with evolving data models
• There were a need for systems that could: – work with different kind of data format, – Do not require strict schema, – and are easily scalable.
30
NoSQL • It is born out of a need to handle large data volumes • It forces a fundamental shib to building large hardware plaAorms through clusters of commodity servers.
• This need raises from the difficul-es of making applica-on code play well with rela-onal databases
32
NoSQL • The term “NoSQL” is very ill-‐defined. • It’s generally applied to a number of recent non rela-onal databases: Cassandra, Mongo, Neo4j, Hbase and Redis,…
• They embrace – schemaless data, – run on a cluster, – and have the ability to trade off tradi-onal consistency for other useful proper-es
33
Why are NoSQL Databases Interes-ng 1. Applica-on development produc-vity:
– A lot of applica-on development is spent on mapping data between in memory data structures and rela-onal databases
– A NoSQL database may provide a data model that can simplify that interac-on resul-ng in less code to write, debug, and evolve.
34
Why are NoSQL Databases Interes-ng 2. Large-‐scale data:
– Organiza-ons are finding it valuable to capture mode data and process it more quickly.
– They are finding it expensive to do so with rela-onal databases.
– NoSQL database are more economic if ran on large cluster of many smaller an cheaper machines.
– Many NoSQL database are designed to run on clusters, so they be[er fit on Big Data scenarios.
35
WHY NOSQL
Internet Hypertext, RSS, Wikis, blogs, wikis, tagging, user generated content, RDF, ontologies
36
Conn
ectedn
ess
The Value of Rela-onal Databases
• Rela-onal databases have become such an embedded part of our compu-ng culture.
• What are the benefits they provide?
38
Getng at Persistent Data • The most obvious value of a database is keeping large amount of persistent data
• Two areas of memory: – Main memory: fast, vola-le, limited in space and lose data when it loses the power
– Backing store: larger but slower, commonly seen as a disk
39
Getng at Persistent Data • The most obvious value of a database is keeping large amount of persistent data
• Two areas of memory: – Main memory: fast, vola-le, limited in space and lose data when it loses the power
– Backing store: larger but slower, commonly seen as a disk
40
Getng at Persistent Data • The backing store can be organized in all sort of ways.
• For many produc-vity applica-ons (such as word processors) it is a file in the file system.
• For most enterprise applica-ons, however, the backing store is a database.
• A database allows more flexibility than a file system.
41
Concurrency • Concurrency is notoriously difficult to get right. • Object oriented is not the right programming model to deal with concurrency.
• Since enterprise applica-ons can have a lot of concurrent users, there is a lot of rooms for bad things to happen.
• Rela-on databases have transac-ons that help mi-ga-ng this problem, but….
42
Concurrency • You s-ll have to deal with transac-onal error when you try to book a room that is just gone.
• The reality is that the transac-onal mechanism has worked well to contain the complexity of concurrency.
• Transac-on with rollback allows as to deal with errors.
43
Integra-on • Enterprise applica-ons live in a rich ecosystem • mul-ple applica-on wri[en by different teams need to
collaborate in order to get things done • This collabora-on is done via data sharing. • A common way to do this is shared database integraGon
[Hohpe and Woolf] where mul-ple applica-ons store their data into a single database
• Using a single database, allows all the applica-on to share data easily, while the database concurrency control applica-ons such as users.
44
A (Mostly) Standard Model • Rela-onal database have succeeded because they have a standard model
• As a result, developers and database professionals can apply the same knowledge in several projects.
• Although there are differences between different RDBMs, the core mechanism remain the same.
45
Impedance Mismatch • It is the difference between the rela-onal model and the in-‐memory data structures.
• Rela-onal data organizes data into table and rows (rela-on and tuples). – A tuple is a set of name-‐value pairs – A rela-on is a set of tuples.
• All the SQL opera-ons consume and return rela-ons.
46
Impedance Mismatch • Tuples and rela-on provides elegance and simplicity, but it also introduces limita-ons.
• In par-cular, the values in a rela-onal tuple have to be simple.
• They cannot contain any structure, such as nested record or a list.
• This limita-on is not true for in memory data-‐structures.
48
Impedance Mismatch • As a result, if we want to use richer in-‐memory data structure, we have to translate it to a rela-onal representa-on to store in on disk.
• While object-‐oriented language succeeded, object-‐oriented databases faded into obscurity.
• Impedance mismatch has been made much easier to deal with Object-‐Rela-onal Mapping (ORM) frameworks such as Eclipse-‐Link, Hibernate and other JPA implementa-ons.
49
Impedance Mismatch
50
• ORMs remove a lot of work, but can become a problem when people try to ignore: – the database, and – query performance suffer
• This is where NoSQL database works greatly, why?
Applica-on and Integra-on DBs • This is a event that happen several -mes in SW projects.
• In this scenario, the database acts as an integra-on database.
• The downsides to share database are.
51
Applica-on and Integra-on DBs • The downsides to share database are:
– Its structure tend to be more complex than any single applica-on needs,
– If an applica-on want to make changes to its data storage, it needs to coordinate with all the other applica-ons,
– Performance degrada-on due to huge number of access – Errors in database usage since it is accessed by applica-on wri[en by different teams.
• This is different from single applica-on databases
52
Applica-on and Integra-on DBs • Interoperability concerns can now shit to interfaces of the applica-on allowing interac-on over HTTTP.
• This is what happen with micro-‐services (h[p://www.-kalk.com/java/micro-‐services/)
• Micro-‐services and Web services in general enable a form of communica-ons based on data.
• Data is represented as documents using before XML and now JSON format.
53
Applica-on and Integra-on DBs • If you are going to use service integra-on using text over HTTP is the way to go.
• However, if we are dealing with performance there are binary protocols.
54
A[ack of the Clusters • In 2000s several large web proper-es drama-cally increase in scale: – Websites started tracking ac-vity and structure in detail (analy-cs)
– Large sets of data appeared: links, social networks, ac-vity in logs, mapping data.
– With this growth in data came a growth in users
• Coping with this increase in data and traffic required more compu-ng resources.
55
A[ack of the Clusters • There were to choices:
– Scaling up – Scaling out
• Scaling up implies bigger machines, more processors, disk storage and memory ($$$).
• Scaling out was the alterna-ve: – Use a lot of small machine in a cluster. – It is cheap and also more resilient (individual failures)
56
A[ack of the Clusters • This revealed a new problem, rela-onal databases are note designed to run on clusters.
• Clustered rela-onal databases (e.g. Oracle RAC or Microsob SQL Server) work on the concept of shared disk subsystem.
• RDBMs can also run on separate server with different sets of data (sharding)
57
A[ack of the Clusters • However, it needs an applica-on to control the sharded-‐database.
• Also we lose querying, referen-al integrity, transac-ons or consistency control cross shard.
• These technical issues are exacerbated by licensing cots.
• This mismatch between DBs and clusters led some organiza-ons to consider different solu-ons
58
The emergence of NoSQL • Two companies in par-cular – Google and Amazon – have been very influen-al.
• They were capturing a large amount of data and their business is on data management
• Both companies produces influen-al papers: – BigTable from Google – Dynamo DB from Amazon
59
The emergence of NoSQL • As part of innova-on in data management system, several
new technologies where built: – 2003 -‐ Google File System, – 2004 -‐ MapReduce, – 2006 -‐ BigTable, – 2007 -‐ Amazon DynamoDB – 2012 Google Cloud Engine
• Each solved different use cases and had a different set of assump-ons.
• All these mark the beginning of a different way of thinking about data management.
60
The emergence of NoSQL • It is irony that the term “NoSQL” appeared in late 90s from a rela-onal database made by Carlo Strozzi.
• The name comes from the fact that it does not used SQL as query language.
• However, the usage of “NoSQL” as we consider today come from a meetup on 2009 in San Francisco.
• They want a term that can be used as Twi[er hashtag. #NoSQL
61
NoSQL Characteris-cs 1. They don’t use SQL. (HBase, Cassandra, Redis…) 2. They are generally open-‐source projects. 3. Most of them are designed to run on clusters. 4. RDBMs used ACID transac-ons to handle
consistency across the whole database. NoSQL resort to other op-ons (CAP theorem).
5. No all are cluster oriented (Graph DBs) 6. NoSQL operate without a schema. (schema free)
62
The emergence of NoSQL • NoSQL does not stands for Not-‐Only SQL. • It is be[er to NoSQL as a movement rather than a techonology.
• RDBMs are not going away. • The change is that rela-onal databases are an op-on • This point of view is oben referred to as polyglot persistence
63
The emergence of NoSQL • Instead of just picking a rela-onal database, we need to understand: 1. The nature of the data we are storing, and 2. How we want to manipulate it.
• In order to deal with this change most organiza-ons need to shib from integra-on database to applica-on database.
• In this course we concentrate on Big Data running on clusters.
64
The emergence of NoSQL • The Big Data concerns have created an opportunity for people to think freshly about their storage needs.
• NoSQL help developer produc-vity by simplifying their database access even if they have no need to scale beyond single machine.
65