cmsc424: database design

62
CMSC424: Database Design Instructor: Amol Deshpande [email protected]

Upload: aerona

Post on 24-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

CMSC424: Database Design. Instructor: Amol Deshpande [email protected]. Today. Motivation: Why study databases ? Syllabus Administrivia Workload etc Data management challenges in a very simple application - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CMSC424: Database Design

CMSC424: Database Design

Instructor: Amol Deshpande [email protected]

Page 2: CMSC424: Database Design

Motivation: Why study databases ? Syllabus Administrivia

◦ Workload etc Data management challenges in a very simple

application

We will also discuss some interesting open problems/research directions

No laptop use allowed in the class !!

Today

Page 3: CMSC424: Database Design

There is a *HUGE* amount of data in this world Everywhere you see… Personal

◦ Emails, data on your computer Enterprise

◦ Banks, supermarkets, universities, airlines etc etc Scientific

◦ Biological, astronomical World wide web

◦ Social networks etc…

Motivation: Data Overload

Page 4: CMSC424: Database Design

Much more is produced every day

IBM: “… in 2005, the amount of data will grow from 3.2 million exabytes to 43 million exabytes”

Wal-mart: 583 terabytes of sales and inventory dataAdds a billion rows every day“we know how many 2.4 ounces of tubes of toothpastes sold yesterday and what was sold with them”

Neilsen Media Research: 20 GB a day; total 80-100 TB 12000 households or personal meters

Extending to iPods and TiVos in recent years

Motivation: Data Overload

Page 5: CMSC424: Database Design

Motivation: Data Overload Scientific data is literally astronomical on scale“Wellcome Trust Sanger Institute's World Trace Archive

database of DNA sequences hit one billion entries..”

Stores all sequence data produced and published by the world scientific community

22 Tbytes and doubling every 10 months

"Scanning the whole dataset for a single genetic sequence… like searching for a single sentence in the contents of the British Library”

Page 6: CMSC424: Database Design

Automatically generated data through instrumentation

“Britain to log vehicle movements through cameras. 35 million reads per day.”

Wireless sensor networks are becoming ubiquitous.

RFID: Possible to track every single piece of product throughout its life (“Gillette boycott”)

Motivation: Data Overload

Page 7: CMSC424: Database Design

Motivation: Data Overload

WIRED MAGAZINE: ISSUE 16.07

Page 8: CMSC424: Database Design

Motivation: Data Overload

WIRED MAGAZINE: ISSUE 16.07

Page 9: CMSC424: Database Design

How do we do anything with this data?

Where and how do we store it ?◦ Disks are doubling every 18 months or so -- not

enough◦ In many cases, the data is not actually recorded as

it is; summarized first

What if the disks crash ?◦ Very common, especially with 1000’s of disks for

each application

Motivation: Data Overload

Page 10: CMSC424: Database Design

What to do with the data ?◦ text search ?

◦ “find the stores with the maximum increase in sales in last month”

◦ “how much time from here to Pittsburgh if I start at 2pm ?” Data is there; more will be soon (GPS, live traffic data) Requires predictive capabilities

◦ “how many DVDs should we order?” (Netflix)

◦ …

Motivation: Data Overload

Page 11: CMSC424: Database Design

What to do with the data ?◦ Find videos with this incident

Not even clear “how” to do this

◦ Mine the “blogs” to detect “buzz”

◦ More and more need to convert “information” to “knowledge” “Data mining”

◦ Most of these are open problems; we won’t discuss them much

Motivation: Data Overload

Page 12: CMSC424: Database Design

Speed !! ◦ With TB’s of data, just finding something (even if you know

what), is not easy Reading a file with TB of data can take hours

◦ Imagine a bank and millions of ATMs How much time does it take you to do a withdrawal ? The data is not local

How do we ensure “correctness” ?◦ Can’t have money disappearing◦ Harder than you might think

Motivation: Data Overload

Page 13: CMSC424: Database Design

How do we guarantee the data will be there 10 years from now ?

Privacy and security !!!◦ Every other day we see some database leaked on the web

Data integration (e.g. Web) New kinds of data

◦ Scientific/biological, Image, Audio/Video, Sensor data etc Out of scope

◦ Interesting research challenges !

More questions…

Page 14: CMSC424: Database Design

Provide a systematic way to answer many of these questions…

Aim is to allow easy management of data◦ Store it ◦ Update it◦ Query it

Massively successful for structured data◦ Structured ?

DBMS to the Rescue

Page 15: CMSC424: Database Design

A lot of the data we encounter is structured◦ Some have very simple structures ◦ E.g. Data that can be represented in tabular forms◦ Significantly easier to deal with◦ We will focus on such data for much of the class

Structured vs Unstructured

Accountbname acct_no balanceDowntown

MianusPerryR.H

A-101A-215A-102A-305

500700400350

Customercname cstreet ccity

JonesSmithHayesCurry

Lindsay

MainNorthMainNorthPark

HarrisonRye

HarrisonRye

Pittsfield

Page 16: CMSC424: Database Design

Some data has a little more complicated structure◦ E.g graph structures

Map data, social networks data, the web link structure etc◦ In many cases, can convert to tabular forms (for storing)◦ Slightly harder to deal with

Queries require dealing with the graph structure

Structured vs Unstructured

Page 17: CMSC424: Database Design

Collaborations Graph

Query: Find my Erdos Number.

Page 18: CMSC424: Database Design

Increasing amount of data in a semi-structured format◦ XML – Self-describing tags (HTML ?)◦ Complicates a lot of things◦ We will discuss this toward the end

Structured vs Unstructured

Page 19: CMSC424: Database Design

A huge amount of data is unfortunately unstructured◦ Books, WWW ◦ Amenable to pretty much only text search… so far

Information Retreival deals with this topic

◦ What about Google ? Google is mainly successful because it uses the structure

Video ? Music ?

Structured vs Unstructured

Page 20: CMSC424: Database Design

Provide a systematic way to answer most of these questions…◦ … for structured data◦ … increasing for semi-structured data

XML database systems have been coming up

Solving the same problems for truly unstructured data remains an open problem◦ Much research in Information Retrieval community

DBMS to the Rescue

Page 21: CMSC424: Database Design

They are everywhere !! Enterprises

◦ Banks, airlines, universities Internet

◦ Searchsystems.net lists 35568 37220 46838 public records DBs◦ Amazon, Ebay, IMDB

Blogs, social networks… Your computer (emails especially) …

DBMS to the Rescue

Page 22: CMSC424: Database Design

Provide a systematic way to answer most of these questions…◦ … for structured data◦ … increasing for semi-structured data

XML database systems have been coming up

Solving the same problems for truly unstructured data remains an open problem◦ Much research in Information Retrieval community

What about cloud computing ? MapReduce ?◦ More generally: “web scale data management”

What about large-scale data analytics ? Statistics ?

DBMS to the Rescue

Page 23: CMSC424: Database Design

Ongoing debate/issue ◦ Cloud computing seems to eschew DBMSs in favor of homegrown solutions◦ E.g. Google, Facebook, Amazon etc…

MapReduce: A paradigm for large-scale data analysis◦ Hadoop: An open source implementation

Why ? ◦ DBMSs can’t scale to the needs, not fault-tolerant enough

These apps don’t need things like transactions, that complicate DBMSs (???)◦ Mapreduce favors Unix-style programming, doesn’t require SQL

Try writing SVMs or decision trees in SQL◦ Cost

Companies like Teradata may charge $100,000 per TB of data managed Related talk on Thursday, Sept 3, 4pm, AVW 3258

Web Scale Data Management, Analysis

Page 24: CMSC424: Database Design

representing information◦ data modeling

languages and systems for querying data◦ complex queries & query semantics◦ over massive data sets

concurrency control for data manipulation◦ controlling concurrent access ◦ ensuring transactional semantics

reliable data storage◦ maintain data semantics even if you pull the plug

What we will cover…

Page 25: CMSC424: Database Design

We will see…◦Algorithms and cost analyses◦System architecture and implementation◦Resource management and scheduling◦Computer language design, semantics and

optimization◦Applications of AI topics including logic and planning◦Statistical modeling of data

What we will cover…

Page 26: CMSC424: Database Design

We will mainly discuss structured data◦ That can be represented in tabular forms (called Relational data)◦ We will spend some time on XML◦ We will also spend some time on Mapreduce-like stuff

Still the biggest and most important business (?)◦ Well defined problem with really good solutions that work

Contrast XQuery for XML vs SQL for relational ◦ Solid technological foundations

Many of the basic techniques however are directly applicable◦ E.g. reliable data storage etc.◦ Cf. Many recent attempts to add SQL-like capabilities, transactions to

Mapreduce and related technologies

What we will cover…

Page 27: CMSC424: Database Design

Instructor: Amol Deshpande◦ 3221 AV Williams Bldg◦ [email protected]◦ Class Webpage:

Off of http://www.cs.umd.edu/~amol, Or http://www.cs.umd.edu/class

Email to me: write CMSC424 in the title.

TA: Walaa Eldin Moustafa

Administrivia Break

Page 28: CMSC424: Database Design

Textbook:◦ Database System Concepts

Fifth Edition Abraham Silberschatz, Henry F. Korth, S. Sudarshan

Lecture notes will be posted on the webpage◦ Videos also once I figure out where to host them

http://forum.cs.umd.edu◦ We will use this in place of a newsgroup◦ First resort for any questions◦ General announcements will be posted there◦ Register today !

Administrivia Break

Page 29: CMSC424: Database Design

Workload:◦ 4 homeworks (10%)◦ 2 Mid-terms, Final (50%)◦ An SQL assignment (10%)◦ A programming assignment (10%)

Likely based on BerkeleyDB

◦ An application development project (20%) Front-loaded

Schedule on the webpage First assignment out next week, due a week later

Administrivia Break

Page 30: CMSC424: Database Design

Grading◦ Fixed◦ 80+: A◦ 70+: B◦ 60+: C◦ 60-: D/F

Most had 40+ on non-exams last two times (out of 50)◦ Exams are usually somewhat harder

Administrivia Break

Page 31: CMSC424: Database Design

Relational DBMSs (also known as “they are still around?”)◦ Oracle, IBM DB2, Microsoft SQL Server, Sybase

Open source alternatives◦ MySQL, PostgreSQL, Apache Derby, BerkeleyDB (mainly a storage engine –

no SQL), neo4j (graph data) …

Data Warehousing Solutions◦ Geared towards very large volumes of data and on analyzing them◦ Long list: Teradata, Oracle Exadata, Netezza (based on FPGAs), Aster Data

(founded 2005), Vertica (column-based), Kickfire, Xtremedata (released 2009), Sybase IQ, Greenplum (eBay, Fox Networks use them)

◦ Usually sell package/services and charge per TB of managed data◦ Many (especially recent ones) start with MySQL or PostgreSQL and make

them parallel/faster etc..◦ Many have recently added MapReduce-like framework on top

Current Industry Outlook

Page 32: CMSC424: Database Design

Bigtable-like◦ Called “key-value stores”◦ Think highly distributed hash tables◦ Allow some transactional capabilities – still evolving area◦ PNUTS (Yahoo), Cassandra (Facebook), Dynamo (Amazon)

Current Industry Outlook

Page 33: CMSC424: Database Design

Mapreduce-like◦ Hadoop (open source), Pig (@Yahoo), Dryad (@Microsoft)◦ Amazon EC2 Framework◦ Not really a database – but increasing declarative SQL-like capabilities

are being added (e.g. HIVE at Facebook)

Current Industry Outlook

Page 34: CMSC424: Database Design

Mapreduce-like◦ Hadoop (open source), Pig (@Yahoo), Dryad (@Microsoft)◦ Amazon EC2 Framework◦ Not really a database – but increasing declarative SQL-like capabilities

are being added (e.g. HIVE at Facebook)

Much ongoing research in industry and academia◦ University of Maryland a major player◦ UMD Cloud Computing Center

Exciting times ahead !!

Current Industry Outlook

Page 35: CMSC424: Database Design

Data management challenges in a very simple application◦ Why we can’t use a file system to do database management

Data Modelling◦ Going from conceptual requirements of a application to a

concrete data model

Next..

Page 36: CMSC424: Database Design

Simple Banking Application◦ Need to store information about:

Accounts Customers

◦ Need to support: ATM transactions Queries about the data

Instructive to see how a naïve solution will work

Example

Page 37: CMSC424: Database Design

Data stored in files in ASCII format ◦ #-seperated files in /usr/db directory◦ /usr/db/accounts

Account Number # Balance 101 # 900 102 # 700 …

◦ /usr/db/customers Customer Name # Customer Address # Account Number Johnson # 101 University Blvd # 101 Smith # 1300 K St # 102 Johnson # 101 University Blvd # 103 …

A file-system based solution

Page 38: CMSC424: Database Design

Write application programs to support the operations◦ In your favorite programming language

◦ Withdrawals by a customer for amount $X from account Y Scan /usr/db/accounts, and look for Y in the 1st field Subtract $X from the 2nd field, and rewrite the file

◦ To support finding names of all customers on street Z Scan /usr/db/customers, and look for (partial) matches for Z

in the addess field

◦ …

A file-system based solution

Page 39: CMSC424: Database Design

1. Data redundancy and inconsistency◦ No control of redundancy

Customer Name # Customer Address # Account Number Johnson # 101 University Blvd # 101 Smith # 1300 K St # 102 Johnson # 101 University Blvd # 103 …Especially true as programs/data organization evolve over time

◦ Inconsistencies Data in different files may not agree Very critical issue

What’s wrong with this solution ?

Page 40: CMSC424: Database Design

2. Evolution of the database is hard◦ Delete an account

Will have to rewrite the entire file◦ Add a new field to the accounts file, or split the customers file in two parts:

Rewriting the entire file least of the worries Will probably have to rewrite all the application programs

What’s wrong with this solution ?

Page 41: CMSC424: Database Design

3. Difficulties in Data Retrieval◦ No sophisticated tools for selective data access

Access only the data for customer X Inefficient to scan the entire file

◦ Limited reuse Find customers who live in area code 301 Unfortunately, no application program already written Write a new program every time ?

What’s wrong with this solution ?

Page 42: CMSC424: Database Design

4. Semantic constraints◦ Semantic integrity constraints become part of program code

Balance should not fall below 0 Every program that modifies the balance will have to

enforce this constraint◦ Hard to add new constraints or change existing ones

Balance should not fall below 0 unless overdraft-protection enabled

Now what? Rewrite every program that modifies the balance ?

What’s wrong with this solution ?

Page 43: CMSC424: Database Design

5. Atomicity problems because of failures

What’s wrong with this solution ?

Jim transfers $100 from Acct #55 to Acct #376

1. Get balance for acct #55

2. If balance55 > $100 then

a. balance55 := balance55 - 100

b. update balance55 on disk

c. get balance from database for acct #376

d. balance376 := balance376 + 100

e. update balance376 on disk

CRASH

Must be atomic

Do all the operations or none of the operations

Page 44: CMSC424: Database Design

6. Durability problems because of failures

What’s wrong with this solution ?

Jim transfers $100 from Acct #55 to Acct #376

1. Get balance for acct #55

2. If balance55 > $100 then

a. balance55 := balance55 - 100

b. update balance55 on disk

c. get balance from database for acct #376

d. balance376 := balance376 + 100

e. update balance376 on disk

f. print receipt

CRASHAfter reporting success to the user, the changes

better be there when he checks tomorrow

Page 45: CMSC424: Database Design

7. Concurrent access anomalies

What’s wrong with this solution ?

Joe@ATM1: Withdraws $100 from Acct #55

1. Get balance for acct #55

2. If balance55 > $100 then

a. balance55 := balance55 – 100

b. dispense cash

c. update balance55 Jane@ATM2: Withdraws $50 from Acct #55

1. Get balance for acct #55

2. If balance55 > $50 then

a. balance55 := balance55 – 50

b. dispense cash

c. update balance55

Page 46: CMSC424: Database Design

What’s wrong with this solution ?

Joe@ATM1: Withdraws $100 from Acct #55

1. Get balance for acct #55

2. If balance55 > $100 then

a. balance55 := balance55 – 100

b. dispense cash

c. update balance55

Jane@ATM2: Withdraws $50 from Acct #55

1. Get balance for acct #55

2. If balance55 > $50 then

a. balance55 := balance55 – 50

b. dispense cash

c. update balance55

Balance would only reflect one of the two operations

Bank loses money

7. Concurrent access anomalies

Page 47: CMSC424: Database Design

8. Security Issues◦Need fine grained control on who sees what Only the manager should have access to

accounts with balance more than $100,000 How do you enforce that if there is only one

accounts file ?

What’s wrong with this solution ?

Page 48: CMSC424: Database Design

Summary: What’s wrong with this solution ?

◦Hard to control redundancy◦Hard to evolve the structure◦Data retrieval requires writing application programs◦Semantic constraints all over the place◦Not fast enough !◦Data consistency issues Disk crashes etc

◦Security

Database management provide an end-to-end

solution to all of these problems

Page 49: CMSC424: Database Design

The key insight is whats called data abstraction

How ?

Page 50: CMSC424: Database Design

Probably the most important purpose of a DBMS Goal: Hiding low-level details from the users of the

system

Through use of logical abstractions

Data Abstraction

Page 51: CMSC424: Database Design

Data Abstraction

Logical

Level

Physical

Level

View Level

View 1 View 2 View n…

How data is actually stored ?

e.g. are we using disks ? Which

file system ?

What data is stored ?

describe data properties such as

data semantics, data relationships

What data users and application

programs see ?

Page 52: CMSC424: Database Design

Logical level:◦ Provide an abstraction of tables◦ Two tables can be accessed:

accounts Columns: account number, balance

customers Columns: name, address, account number

View level:◦ A teller (non-manager) can only see a part of the accounts table

Not containing high balance accounts

Data Abstraction: Banking Example

Page 53: CMSC424: Database Design

Physical Level:◦ Each table is stored in a separate ASCII file◦ # separated fields

Identical to what we had before ?◦ BUT the users are not aware of this

They only see the tables The application programs are written over the tables abstraction

◦ Can change the physical level without affecting users

◦ In fact, can even change the logical level without affecting the teller

Data Abstraction: Banking Example

Page 54: CMSC424: Database Design

1. Data Modeling

2. Data Retrieval

3. Data Storage

4. Data Integrity

DBMS at a Glance

Page 55: CMSC424: Database Design

A data model is a collection of concepts for describing data properties and domain knowledge:◦ Data relationships◦ Data semantics◦ Data constraints

We will discuss two models extensively in this class◦ Entity-relationship Model◦ Relational Model

Probably discuss XML as well

Data Modeling

Page 56: CMSC424: Database Design

Query = Declarative data retrieval program◦ describes what data to acquire, not how to acquire it◦ Non-declarative:

scan the accounts file look for number 55 in the 2nd field subtract $50 from the 3rd field

◦ Declarative (posed against the tables abstraction): Subtract $50 from the column named balance for the row

corresponding to account number 55 in the accounts table How to do it is not specified.

Why ?◦ Easier to write◦ More efficient to execute (why ?)

Data Retrieval

Page 57: CMSC424: Database Design

Where and how to store data ?◦ Main memory ?

What if the database larger than memory size ?◦ Disks ?

How to move data between memory and disk ?◦ Etc etc…

Data Storage

Page 58: CMSC424: Database Design

Manage concurrency and crashes◦ Transaction: A sequence of database actions enclosed within

special tags◦ Properties:

Atomicity: Entire transaction or nothing Consistency: Transaction, executed completely, take database from one

consistent state to another Isolation: Concurrent transactions appear to run in isolation Durability: Effects of committed transactions are not lost

◦ Consistency: Transaction programmer needs to guarantee that DBMS can do a few things, e.g., enforce constraints on the data

◦ Rest: DBMS guarantees

Data Integrity

Page 59: CMSC424: Database Design

Semantic constraints◦ Typically specified at the logical level◦ E.g. balance > 0

Data Integrity

Page 60: CMSC424: Database Design

Data Models◦ Conceptual representation of the data

Data Retrieval◦ How to ask questions of the database◦ How to answer those questions

Data Storage◦ How/where to store data, how to access it

Data Integrity◦ Manage crashes, concurrency◦ Manage semantic inconsistencies

Not fully disjoint categorization !!

DBMS at a glance

Page 61: CMSC424: Database Design

Why study databases ?◦ Shift from computation to information

Always true in corporate domains Increasing true for personal and scientific domains

◦ Need has exploded in recent years Data is growing at a very fast rate

◦ Solving the data management problems is going to be a key

Summary

Page 62: CMSC424: Database Design

Database Management Systems provide ◦ Data abstraction

Key in evolving systems◦ Guarantees about data integrity

In presence of concurrent access, failures…◦ Speed !!

Summary