data warehouse notes

15
Data Warehouse Notes

Upload: torsten-markku

Post on 30-Dec-2015

31 views

Category:

Documents


0 download

DESCRIPTION

Data Warehouse Notes. OLTP vs OLAP. OLTP: Online Transaction Processing Most common type of database for data input ER design approach Database design looks like “real world” Throughput and response time are big concerns. Exercise: Examples of transactional databases. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Warehouse Notes

Data Warehouse Notes

Page 2: Data Warehouse Notes

OLTP vs OLAP

OLTP: Online Transaction Processing

Most common type of database for data input ER design approach Database design looks like “real world” Throughput and response time are big concerns

Exercise: Examples of transactional databases

Page 3: Data Warehouse Notes

Physical Data Storage and Usage in OLTP System:

Data stored on disk in data pages (4k, 8k, 16k, 32k in size)

Data IO is also in data pages

Memory image of database table is also data pages

Need one row of one table, read an entire data page.

Rows per Page ~ Page Size/Bytes per Row

. . . . . . . . . . . . . . . . . .

Physical storage of rows of a single table

AB

C D

Exercise: Cost of Retrieval (A,B) vs (C,D)

Page 4: Data Warehouse Notes

Data Input Strategies

All transactions get added to the end of the last data page in the chain of data pages that make up this table.

single point of contention slows everyone down most reporting is looking at older data on different data

pages so not blocked by data input on newly created data pages

. . . . . . . . . . . . . . . . . .

new rows added here

report accesses these pages

Page 5: Data Warehouse Notes

Data Input Strategies

Transactions with similar key values are placed on the same data pages (eg, orders from the same state)

multiple insertion points so less chance of insertions running into each other

reporting now runs into problems reporting from the same pages where data is being entered

. . . . . . . . . . . . . . . . . .

. . .

MA

RI

CTdata insertion and reportinguse the same pages

insertion

Page 6: Data Warehouse Notes

Solution 1

Have OLTP and OLAP run on different copies of the same database – live and day-old.

Each night copy all new transactions to day-old database Or mirror data insertion in a duplicate database

Problems:

Insertion and Reporting are “orthogonal” so the structure that makes one task easy makes the other difficult.

Insertion typically restricted to one external view at a time Reporting is often across many external views Insertion easy to design and designed only once Reports hard to design and new reports all the time

Page 7: Data Warehouse Notes

Solution 2:

OLAP database has a different structure than the OLTP database.

Most common structure is called the Star Schema.

FactTable

DimensionTables

Address = {address, country, state, region}

Page 8: Data Warehouse Notes

Alternative to Star Schema:

OLAP database has a different structure than the OLTP database.

Another common structure is called the Snowflake Schema.

FactTable

DimensionTables

Address = {address, county_id}Country = {id, state_id}State = {id, county_id}County = {id, ...}

Page 9: Data Warehouse Notes

Fact Table:

IDD1_IDD2_IDD3_ID...m1m2m3...

D1ID...

D2ID...

D3ID...

Fact table schemaconsists of numeric factscalled measures categorized by dimensions

Dimensions are qualitativeproperties of the dataand measures arequantitative properties.

Page 10: Data Warehouse Notes

Example Fact Table

claim_typepatient_state_codeclaim_user_idpatient_claim_idprmy_procedure_idprmy_diagnosis_idclaim_from_date..............................claim_pymt_amtddctbl_amtcoinsurance_amt

HMO_CLAIM_HISTORY

state

claim_typepro

c_id

(cp_amt, dd_amt,coin_amt)

data cube

Page 11: Data Warehouse Notes

Improvement?

All transactions fit into a single table

Every report is a slight modification of every other report

select new dimensions select new measures

Problems:

Fact table can get very large (terabytes) New business decisions lead to new dimensions so table

has to be restructured Smaller cubes are better for reporting since more efficient.

Page 12: Data Warehouse Notes

Fact Table and daily, materialized report tables:

IDD1_IDD2_IDD3_ID...m1m2m3...

D1ID...

D2ID...

D3ID...

Add additional summarytables that are updated as the main fact table is updated

D1.NameD2.Name

...AggD3ImpactOn_m1... select d1.Name,d2.Name, sum(m1) as AggD3ImpactOn_m1

from FactTable f, D1 d1, D2 d2where f.D1_ID = d1.ID and f.D2_ID = d2.IDgroup by D1_ID,D2_ID

Report #1

Page 13: Data Warehouse Notes

Exercise:

Suppose that the Library database is used in several branch libraries of a large system.

Suppose that when a book is returned, a permanent record of the loan is put into a central fact table in a central data warehouse.

What kind of data might be put in the new fact table? What are the dimensions? Any measures?

Page 14: Data Warehouse Notes

Library OLAP Schema:

Library_IDBorrower_IDAccession_Noisbnloan_datereturn_datereturn_codedays_on_loan

LibraryBranchCardholder

Book Copy

Page 15: Data Warehouse Notes

ETL:

Extraction-Transformation-Load

Process by which data is copied from OLTP source databases to a single OLAP report database/data warehouse