chapter 8 newer database topics based on g. post, dbms: designing & building business...

20
Chapter 8 Newer Database Topics Based on G. Post, DBMS: Designing & Building Business Applications University of Manitoba Asper School of Business 3500 DBMS Bob Travica Updated 2010

Upload: fay-snow

Post on 29-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Chapter 8

Newer Database Topics

Based on G. Post, DBMS: Designing & Building Business Applications

University of ManitobaAsper School of Business

3500 DBMSBob Travica

Updated 2010

DDBB

SSYYSSTTEEMMSS

2 of 20

OLAP & Data Warehouse

Online TransactionProcessing (OLTP):Querying Databaseswith 3NF tables

Operations’data

Predefinedreports

Online Analytical Processing (OLAP);Data warehousing;Data Mining.Usually denormalized data.

Periodicaltransfers

Interactivedata analysis

Flat files

MIS 3500

DDBB

SSYYSSTTEEMMSS

3 of 20

OLTP vs. OLAP

DDBB

SSYYSSTTEEMMSS

4 of 20

Warehousing Goals

Integrate data from different sources to get a larger picture of

business Data aggregations (summaries on different dimensions) Ad hoc queries (support non-routine decision making) Statistical analysis (test hypotheses on relationships between

pieces of data) Discover new relationships (data mining)

DDBB

SSYYSSTTEEMMSS

5 of 20

Extraction, Transformation, and Transportation

Data warehouse:All data must be consistent.

Customers

Convert “Client” to “Customer”

Apply standard product numbers

Convert currencies

Fix region codes

Transaction data from diverse systems.

• Preparations performed on data

Extract

Transform Transport

DDBB

SSYYSSTTEEMMSS

6 of 20

Three-Dimensional View of Data: Cube

Sale Date

CustomerLocation

Categ

ory

Similar ideas usedin crosstab query andpivot table.

DDBB

SSYYSSTTEEMMSS

7 of 20

Data Hierarchy

Year

Quarter

Month

Week

Day

Levels Roll-upTo get higher-level totals

Drill-downTo get lower-level details

DDBB

SSYYSSTTEEMMSS

8 of 20

Star Design

Amount=SalePrice*Quantity

Fact Table

SaleSaleDateSalePriceQuantity

Dimension Table

Measures

Amounts broken down by product category, period, and customer location.

ProductCategory

CustomerLocation

Dimension Table

Dimension Table

Hierarchical: Dimension tables can link only via fact table.

DDBB

SSYYSSTTEEMMSS

9 of 20

Snowflake Design

SaleIDItemIDQuantitySalePriceAmount

OLAPItems

ItemIDDescriptionQuantityOnHandListPriceCategory

Merchandise

SaleIDSaleDateEmployeeIDCustomerIDSalesTax

Sale

CustomerIDPhoneFirstNameLastNameAddressZipCodeCityID

Customer

CityIDZipCodeCityState

City

Network-like design: Dimension tables can link

directly.

DDBB

SSYYSSTTEEMMSS

10 of 20

Excel Pivot Table Reports

Can place data in rows or columns.By grouping months, can instantly get quarterly or monthly totals.

Quarter MonthQuarter 1 Quarter 2 Quarter 3 Quarter 4 Grand Total

LastName EmployeeIDDataCarpenter 8 Sum of Animal 1,668.91 606.97 426.39 7.20 2,709.47

Sum of Merchandise 324.90 78.30 99.00 128.70 630.90Eaton 6 Sum of Animal 522.37 341.85 562.50 1,426.72

Sum of Merchandise 30.60 54.90 107.10 192.60Farris 7 Sum of Animal 5,043.36 1,059.70 796.47 6,899.53

Sum of Merchandise 826.92 188.10 306.00 1,321.02Gibson 2 Sum of Animal 4,983.51 1,549.83 2,556.10 9,089.44

Sum of Merchandise 668.25 238.50 450.90 1,357.65Hopkins 4 Sum of Animal 3,747.96 1,194.88 372.65 128.41 5,443.90

Sum of Merchandise 476.91 252.90 121.50 7.20 858.51James 5 Sum of Animal 3,282.77 2,373.08 437.88 150.11 6,243.84

Sum of Merchandise 505.89 693.45 99.00 99.00 1,397.34O'Connor 9 Sum of Animal 2,643.69 180.91 510.12 3,334.72

Sum of Merchandise 263.70 83.70 55.80 403.20Reasoner 3 Sum of Animal 4,577.43 625.74 589.68 2,500.24 8,293.09

Sum of Merchandise 762.30 89.10 116.80 396.90 1,365.10Reeves 1 Sum of Animal 1,120.93 1,120.93

Sum of Merchandise 263.88 263.88Shields 10 Sum of Animal 1,008.76 162.15 1,170.91

Sum of Merchandise 62.10 22.50 84.60Total Sum of Animal 28,599.69 7,591.11 2,840.72 6,701.03 45,732.55Total Sum of Merchandise 4,185.45 1,624.05 569.50 1,495.80 7,874.80

DDBB

SSYYSSTTEEMMSS

11 of 20

CUBE Option (SQL 99)

Bird 1 135.00 0 0Bird 2 45.00 0 0…Bird (null) 32.00 0 0Bird (null) 607.50 1 0Cat 1 396.00 0 0Cat 2 113.85 0 0…Cat (null) 1293.30 1 0(null) 1 1358.8 0 1(null) 2 1508.94 0 1(null) 3 2362.68 0 1…(null) (null) 8451.79 1 1

Category Month Amount Gc Gm

SELECT Category, Month, Sum, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm

FROM …GROUP BY CUBE (Category, Month...)

DDBB

SSYYSSTTEEMMSS

12 of 20

GROUPING SETS: Hiding Details

Bird (null) 607.50Cat (null) 1293.30…(null) 1 1358.8(null) 2 1508.94(null) 3 2362.68…(null) (null) 8451.79

Category Month Amount

SELECT Category, Month, SumFROM …GROUP BY GROUPING SETS ( ROLLUP (Category),

ROLLUP (Month),( ) )

DDBB

SSYYSSTTEEMMSS

13 of 20

SQL RANK FunctionsSELECT Employee, SalesValue RANK() OVER (ORDER BY SalesValue DESC) AS rankDENSE_RANK() OVER (ORDER BY SalesValue DESC) AS denseFROM SalesORDER BY SalesValue DESC, Employee;

Employee SalesValue rank dense

Jones 18,000 1 1

Smith 16,000 2 2

Black 16,000 2 2

White 14,000 4 3

DENSE_RANK does not skip numbers

• Therefore, advances in SQL motivate DBMS vendors to support OLAP and data warehousing.

DDBB

SSYYSSTTEEMMSS

14 of 20

Data Mining

Goal: To discover unknown relationships in the data that can be used to make better decisions. Exploratory analysis. A bottom-up approach that scans the data to find relationships Some statistical routines, but they are not sufficient

Statistics relies on averages

Sometimes the important data lies in more detailed pairs

Supervised by developer vs. unsupervised (self-organizing artificial neural networks)

DDBB

SSYYSSTTEEMMSS

15 of 20

Common Techniques

1. Classification/Prediction

2. Association Rules/Market Basket Analysis

3. Clustering

DDBB

SSYYSSTTEEMMSS

16 of 20

1. Classification(Prediction)

Purpose: “Classify” things that are causes and those that are effects.

Examples Which borrowers/loans are most likely to be successful?

Which customers are most likely to want a new item?

Which companies are likely to file bankruptcy?

Which workers are likely to quit in the next six months?

Which startup companies are likely to succeed?

Which tax returns are fraudulent?

DDBB

SSYYSSTTEEMMSS

17 of 20

Classification Process

Clearly identify the outcome/dependent variable. Identify potential variables that might affect the outcome. Use sample data to test and validate the model. Regression/correlation analysis, decision tables and trees,

etc.

Income Credit History Job Stability Credit Success

50000 Good Good Yes

75000 Mixed Bad No

DDBB

SSYYSSTTEEMMSS

18 of 20

2. Association/Market Basket

Purpose: Determine what events or items go together/co-occur.

Examples: What items are customers likely to buy together?

(Business use: Consider putting the two together to

increase cross-selling.)

DDBB

SSYYSSTTEEMMSS

19 of 20

Association Challenges

If an item is rarely purchased, any other item bought with it seems important. So combine items into categories.

Some relationships are obvious. Burger and fries.

Some relationships are puzzling/meaningless. Hardware store found that toilet rings sell well only when a new

store first opened. But what does it mean?

DDBB

SSYYSSTTEEMMSS

20 of 20

3. Cluster Analysis Purpose: Determine groups of people or some entities. Examples

Are there groups of customers? (If so, we could target them; market segmentation)

Do the locations for our stores have elements in common? (If so, we can search for similar clusters for new locations.)

Do employees have common characteristics? (If so, we can hire similar, or dissimilar, people.)

Small intracluster distance

Large intercluster distance