cs186 class wrap-up

31
CS186 Class Wrap- Up R&G Chapters 1-28 Lecture 28

Upload: phuong

Post on 08-Jan-2016

38 views

Category:

Documents


3 download

DESCRIPTION

CS186 Class Wrap-Up. R&G Chapters 1-28 Lecture 28. Administrivia. Final Exam Friday 12/12, 5pm – 8pm, Room 4 LeConte You may have 2 pages of notes, both sides The exam is cumulative Final Exam Review Tuesday 12/9, 1pm-3pm, 306 Soda Hall Homework 5 Due Monday, 12/8. News. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS186 Class Wrap-Up

CS186 Class Wrap-Up

R&G Chapters 1-28Lecture 28

Page 2: CS186 Class Wrap-Up

Administrivia

• Final Exam – Friday 12/12, 5pm – 8pm, Room 4 LeConte– You may have 2 pages of notes, both sides– The exam is cumulative

• Final Exam Review – Tuesday 12/9, 1pm-3pm, 306 Soda Hall

• Homework 5– Due Monday, 12/8

Page 3: CS186 Class Wrap-Up

News

• Winter Consulting’s 2003 survey of Largest DBs– http://mxtest.wintercorp.com/vldb/2003_TopTen_Survey/TopTenWinners.

asp

– The largest single database is 29,232 GB!– That’s a single database at France Telecom– Many companies have TBs of data, but usually

spread out among multiple databases, file systems, etc.

• In 2001, largest DB was ~10TB

Page 4: CS186 Class Wrap-Up

News (cont.) – Top Transaction Processing DBs

1. Land Registry, 18.3 terabytes 2. BT plc, 11.7 terabytes 3. United Parcel Service, 9.0 terabytes 4. Caica Econômica Federal, 6.9 terabytes 5. US Patent and Trademark Office, 5.4 terabytes 6. Verizon Communications, 5.3 terabytes 7. Bureau of Customs and Border Protection, 4.1

TB 8. Hewlett Packard, 3.2 terabytes 9. Boeing, 3.1 terabytes 10.CheckFree Corp, 2.9 terabytes

Page 5: CS186 Class Wrap-Up

News (cont) – Top Decision Support DBs

1. France Telecom, 29.2 terabytes 2. AT&T, 26.3 terabytes 3. SBC, 24.8 terabytes 4. Anonymous, 16.2 terabytes 5. Amazon.com, 13.0 terabytes 6. Kmart, 12.6 terabytes 7. Claria Corp., 12.1 terabytes 8. HIRA, 11.9 terabytes 9. FedEx Services, 10.0 terabytes 10.Vodafone, 9.1 terabytes

Page 6: CS186 Class Wrap-Up

Lessons? (from the survey and this course)

• DBs are a huge part of business today

• Companies have *lots* of data– (imagine tuning UPSs database with 41 billion rows!)

• DBs are based on theory of data modelling, with lots of practical data management on top– nice mix of theoretical and practical

• In most jobs, useful to understand how DBs work

Page 7: CS186 Class Wrap-Up

Today

• What topics did we cover?

• What topics did we *not* cover?

Page 8: CS186 Class Wrap-Up

First, what topics did we not cover?• In the book:

– Chapter 21 – Security and Authorization– Chapter 22 – Parallel and Distributed DBs– Chapter 23 – Object-Database Systems– Chapter 24 – Deductive Databases– Chapter 25 – Data Warehousing and

Decision Support– Chapter 27 – XML Data– Chapter 28 – Spatial Data Management

• Not in the book– Federated Databases...

Page 9: CS186 Class Wrap-Up

And what topics did we cover?• Chapters 1-20, and 26

Database and Data Model basics (1-3) 4 16%

Query Languages (4-5) 4 16%

Integrating DBs with other systems (6-7)

2 8%

Storing data in memory and disk (8-9) 2 8%

Tree and Hash Indexes (10-11) 2 8%

Join/Sort cost, Query Optimization (12-15)

3 12%

Concurrency Control & Recovery (16-18)

5 20%

Normal Forms, Database Design (19) 2 8%

Database Tuning (20) 1 4%

Data Mining (26) 1 4%

Page 10: CS186 Class Wrap-Up

1. Overview of Database Systems

• What is a Database? A Database System?

• What are the useful characteristics of DBs?

• When should you use a database? When is the file system better?

Page 11: CS186 Class Wrap-Up

2. Database Design/ER Models

• Databases support many levels of abstraction– possible to design at abstract level in one

form, store data in very different form• The E-R Model

– Useful for design, easier for human to understand

– Specify entities, attributes, relationships– Possible to convert ER schemas to

Relational Schemas

Page 12: CS186 Class Wrap-Up

3. The Relational Model

• Most common data model for databases• Based on tables: rows and columns• Tables connected using key/foreign keys• Integrity Constraints

– Domain constraints for field values– Referential integrity for keys/foreign keys– Other constaints specified by real world

• e.g. 0.0 <= gpa <= 4.0

Page 13: CS186 Class Wrap-Up

4. Relational Algebra and Calculus• Relational algebra

– Operators that act on sets of tuples– σ, Π, , –, etc.– “procedural”

• Relational Calculus– Uses first-order logic to describe query result– does not describe how to get result, i.e. declaritive– studied Tuple Relational Calculus, variables are

tuples{S |S Sailors S.rating > 7}

))2(8(, Sratingratingsname

Page 14: CS186 Class Wrap-Up

5. SQL: Queries, Constraints, Triggers

• Data Definition Language (DDL)– Create Table– Constraints & Triggers

• Data Manipulation Language (DML)SELECT [DISTINCT] target-listFROM relation-listWHERE qualificationGROUP BY grouping-listHAVING group-qualification

• Set Operations, subqueries, etc.

Page 15: CS186 Class Wrap-Up

6. Database Applications

• How to access DBs from programs– embedded SQL, SQLJ– Dynamic APIs: ODBC, JDBG– Cursors: a way to iterate over relations– Stored procedures in database language

• Accessing other programs from databases– Extending postgres with C code

Page 16: CS186 Class Wrap-Up

7. Internet Applications

• Internet basics: URIs, HTTP stateless protocol

• Web data formats: XML, HTML, DTD• Different architectures

– Single-tier– Client-server (thick or thin client)– Three-tier architecture

• Web browser/thin client • App server running business logic• Database maintaining data

Page 17: CS186 Class Wrap-Up

8. Storage and Indexing

• Different file organizations– Heap Files (unordered)– Sorted Files– Clustered Files– Unclustered Tree– Unclustered Hash

• Tradeoffs in I/O costs for various operations

Page 18: CS186 Class Wrap-Up

9. Storing Data: Disks and Files

• Hierarchy of storage• Keeping data in files on disk

– How to arrange fields into records– How to arrange records into pages– How to arrange pages into files

• Managing disk and memory– Buffer management– LRU, MRU, Clock, etc.

Page 19: CS186 Class Wrap-Up

10. Tree-Structured Indexes

• Trees best for range queries, o.k. for equality• ISAM

– less common, usually best for data that doesn’t change

– index doesn’t adjust, instead uses overflow pages if leaves fill

• B-Trees– present in virtually all databases– tree adjusts index to stay balanced– you should understand these pretty well after

Hw4

Page 20: CS186 Class Wrap-Up

11. Hash-Based Indexes

• Hash indexes best for equality, useless for range queries

• Static hashing– only good when data doesn’t change– uses overflow buckets

• Extendible hashing– uses directory of buckets, when overflow, double

directory size– never needs overflow buckets

• Linear hashing– no directory, just a number indicating which buckets

have split– may need overflow buckets, but doesn’t need directory

Page 21: CS186 Class Wrap-Up

12. Overview of Query Evaluation

• System catalogs – info about all tables– includes statistics about field values

• Access paths – how to get at tuples– file scan, indexes

• Query plan – tree of relational operators

Page 22: CS186 Class Wrap-Up

13. External Sorting

• Database can sort any amount of info, even if it doesn’t fit in memory

• Sort runs that fit in memory, then merge sorted runs together

• Used in Hw5

Page 23: CS186 Class Wrap-Up

14. Evaluating Relational Operators

• How to implement:– Selection– Projection– Join Algorithms:

• Nested Loops• Indexed Nested Loops• Sort-Merge Join• Hash-Join

Page 24: CS186 Class Wrap-Up

15. A Typical Relational Query Optimizer

• Break query into query blocks• Enumerate possible query plans• Evaluate cost for each, choose cheapest

Page 25: CS186 Class Wrap-Up

16. Overview of Transactions

• Transactions, unit of atomicity• ACID properties• anomolies with concurrent execution• Introduction to logging

Page 26: CS186 Class Wrap-Up

17. Concurrency Control

• Anomalies• Precedences Graphs• Schedule Charateristics

– Seriazable, View Serializable, Conflict Serializable, Recoverable, Avoids Cascading Abort, Strict

• Locking approaches: 2PL, strict 2PL– dealing with deadlock– Hierarchical locking– Locking in B-Trees

• Non-locking approaches – Optimistic CC– Timestamp CC– Multiversion CC

Page 27: CS186 Class Wrap-Up

18. Crash Recovery

• Effects of Buffer Management on recovery

• Write-ahead log• Transaction abort• Checkpointing• Aries algorithm

– Analysis phase– Redo phase– Undo phase

Page 28: CS186 Class Wrap-Up

19. Schema Refinement & Normal Forms

• Functional dependencies– A B, whenever A is the same, B must be same

• FDs allow us to determine candidate keys, normal forms, qualities of decomposition

• Tradeoffs between data replication, dependency preservatn

• Always must have lossless join decompositions• BCNF has little replication, may need to join to

check FDs• 3NF may have replication, but can preserve FDs

Page 29: CS186 Class Wrap-Up

20. Physical Database Design and Tuning

• Once a DB is running, many changes may improve performance

• First need to understand workload– What are typical queries? Which queries are

most important?• Indexes – what will improve queries• Schema Changes

– denormalize to reduce joins– supernormalize to reduce table size, contention

• Rewriting Queries– avoid queries that the optimizer will do poorly

on

Page 30: CS186 Class Wrap-Up

24. Data Mining

• What is Data Mining?

• Process of Data Mining

• Different classes of DM Algorithms– Supervised– Unsupervised

Page 31: CS186 Class Wrap-Up

Summary

• Databases are highly important today

• DB Design based on theoretical foundation

• Numerous practical/implementation issues addressed to make them run efficiently

• This course covered enough practical and theoretical so you can use and understand DBs