data warehousing logical design

23
Data warehousing logical design Mirjana Mazuran [email protected] November 27, 2008 1/23

Upload: nguyen-quang-hien

Post on 20-Apr-2015

185 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: Data Warehousing Logical Design

Data warehousing logical design

Mirjana [email protected]

November 27, 2008

1/23

Page 2: Data Warehousing Logical Design

Outline

I Data Warehouse logical designI ROLAP model

I star schemaI snowflake schema

I Exercise 1: wine companyI Exercise 2: real estate agencyI Exam 9/1/07: travel agency

2/23

Page 3: Data Warehousing Logical Design

IntroductionLogical design

I starting fron the conceptual design it is necessary to determinthe logical schema of data

I we use ROLAP (Relational On-Line Analytical Processing)model to represent multidimensional data

I ROLAP uses the relational data model, which means thatdata is stored in relations

I given the DFM representation of multidimensional data, twoschemas are used:

I star schemaI snowflake schema

3/23

Page 4: Data Warehousing Logical Design

ROLAPStar schema

I Each dimension is represented by a relation such that:I the primary key of the relation is the primary key of the

dimensionI the attributes of the relation describe all aggregation levels of

the dimensionI A fact is represented by a relation such that:

I the primary key of the relation is the set of primary keysimported from all the dimension tables

I the attributes of the relation are the measures of the fact

Pros and ConsI few joins are needed during query executionI dimension tables are denormalizedI denormalization introduces redundancy

4/23

Page 5: Data Warehousing Logical Design

ROLAPSnowflake schema

I Each (primary) dimension is represented by a relation:I the primary key of the relation is the primary key of the

dimensionI the attributes of the relation directly depend by the primary keyI a set of foreign keys is used to access information at different

levels of aggregation. Such information is part of thesecondary dimensions and is stored in dedicated relations

I A fact is represented by a relation such that:I the primary key of the relation is the set of primary keys

imported from all and only the primary dimension tablesI the attributes of the relation are the measures of the fact

Pros and ConsI denormalization is reducedI less memory space is requiredI a lot of joins can be required if they involve attributes in

secondary dimension tables5/23

Page 6: Data Warehousing Logical Design

Exercise 1Wine company

An online order wine company requires the designing of asnowflake schema to record the quantity and sales of its wines toits customers. Part of the original database is composed by thefollowing tables:CUSTOMER (Code, Name, Address, Phone)

WINE (Code, Name, Vintage, BottlePrice, CasePrice)CLASS (Code, Name, Region)AREA (Code, Description)TIME (TimeStamp, Date, Year)

Note that the tables represent the main entities of the ER schema,thus it is necessary to derive the significant relashionships amongthem in order to correctly design the data warehouse.

6/23

Page 7: Data Warehousing Logical Design

Exercise 1: A possible solutionSnowflake schema

FACT SalesMEASURES Quantity, CostDIMENSIONS Customer, Area, Time, Wine → Class

WineCodeCustomerCodeOrderTimeCodeAreaCodeQuantityCost

SalesTable

CustomerCodeNameAddressPhone

Customer

TimeCodeDateYear

Time

AreaCodeDescription

Area

ClassCodeNameRegion

Class

ClassCodeNameVintageBottlePriceCasePrice

WineWineCode

7/23

Page 8: Data Warehousing Logical Design

Exercise 2Real estate agency

Let us consider the case of a real estate agency whose database iscomposed by the following tables:

OWNER (IDOwner, Name, Surname, Address, City, Phone)ESTATE (IDEstate, IDOwner, Category, Area, City, Province,

Rooms, Bedrooms, Garage, Meters)CUSTOMER (IDCust, Name, Surname, Budget, Address, City,

Phone)AGENT (IDAgent, Name, Surname, Office, Address, City,

Phone)AGENDA (IDAgent, Data, Hour, IDEstate, ClientName)

VISIT (IDEstate,IDAgent, IDCust, Date, Duration)SALE (IDEstate,IDAgent, IDCust, Date, AgreedPrice,

Status)RENT (IDEstate,IDAgent, IDCust, Date, Price, Status,

Time)8/23

Page 9: Data Warehousing Logical Design

Exercise 2Real estate agency

I Goal:I Provide a supervisor with an overview of the situation. The

supervisor must have a global view of the business, in terms ofthe estates the agency deals with and of the agents’ work.

I Questions:1. Design a conceptual schema for the DW.2. What facts and dimensions do you consider?3. Design a Star Schema or Snowflake Schema for the DW.

I Write the following SQL queries:I How many customers have visited properties of at least 3

different categories?I What is the average duration of visits per property category?I Who has paid the highest price among the customers that

have viewed properties of at least 3 different categories?I Who has bought a flat for the highest price w.r.t. each month?I What kind of property sold for the highest price w.r.t each city

and month?9/23

Page 10: Data Warehousing Logical Design

Exercise 2: A possible solutionFacts and dimensions

Points 1 and 2 are left as homework. In particular it is required todiscuss the facts of interest (with respect to your point of view)and then define, for each fact, the attribute tree, dimensions andmeasures with the corresponding glossary.The following ideas will be used during the solution of the exercise:

I supervisors should be able to control the sales of the agencyFACT Sales

MEASURES OfferPrice, AgreedPrice, StatusDIMENSIONS EstateID, OwnerID, CustomerID, AgentID,

TimeIDI supervisors should be able to control the work of the agents by

analyzing the visits to the estates, which the agents are incharge of

FACT ViewingMEASURES DurationDIMENSIONS EstateID, CustomerID, AgentID, TimeID

10/23

Page 11: Data Warehousing Logical Design

Exercise 2: A possible solutionStar schema

TimeID

EstateIDOwnerIDCustomerIDAgentID

OfferPriceAgreedPriceStatus

SalesTable

TimeID

EstateIDCustomerIDAgentID

Duration

ViewingTableCustomerIDNameSurnameBudget

Customer

AddressCityPhone

AgentIDNameSurnameOffice

Agent

AddressCityPhone

OwnerIDNameSurnameAddressCityPhone

Owner

TimeIDDayMonthYear

Time

EstateIDCategoryAreaCityProvinceRooms

Estate

BedroomsGarageMetersSheetMap

11/23

Page 12: Data Warehousing Logical Design

Exercise 2: A possible solutionSnowflake schema

AgentID

EstateID

TimeID

CustomerID

OwnerID

OfferPrice

AgreedPrice

Status

SalesTable

AgentID

TimeID

EstateID

CustomerID

Duration

ViewingTable

CustomerID

Name

Surname

Budget

Customer

Address

PlaceID

Phone

EstateID

PlaceID

Category

Rooms

Estate

Bedrooms

Garage

Meters

Sheet

Map

PlaceID

City

Province

Area

Place

OwnerID

Name

Surname

Address

PlaceID

Phone

Owner

TimeID

Day

Month

Year

Time

AgentID

Name

Surname

Office

Agent

Address

PlaceID

Phone

12/23

Page 13: Data Warehousing Logical Design

Exercise 2: A possible solutionSQL queries wrt the star schema

I How many customers have visited properties of at least 3different categories?

SELECT COUNT(*)FROM ViewingTable V, Estate EWHERE V.EstateID = E.EstateIDGROUP BY V.CustomerIDHAVING COUNT(DISTINCT E.Category) >= 3

I Average duration of visits per property category

SELECT E.Category, AVG(V.Duration)FROM ViewingTable V, Estate EWHERE V.EstateID = E.EstateIDGROUP BY E.Category

13/23

Page 14: Data Warehousing Logical Design

Exercise 2: A possible solutionSQL queries wrt the star schema

I Who has paid the highest price among the customers thathave viewed properties of at least 3 different categories?

CREATE VIEW Cust3Cat ASSELECT V.CustomerIDFROM ViewingTable V, Estate EWHERE V.EstateID = E.EstateIDGROUP BY V.CustomerIDHAVING COUNT(DISTINCT E.Category) >= 3

SELECT C.CustomerIDFROM Cust3Cat C, SalesTable SWHERE C.CustID = S.CustID AND S.AgreedPrice IN(SELECT MAX(S.AgreedPrice)

FROM Cust3Cat C1, SalesTable S1WHERE C1.CustomerID = S1.CustomerID)

14/23

Page 15: Data Warehousing Logical Design

Exercise 2: A possible solutionSQL queries wrt the star schema

I Who has bought a flat for the highest price w.r.t. each month?

SELECT S.CustomerID, T.Month, T.Year, S.AgreedPriceFROM SalesTable S, Estate E, Time TWHERE S.EstateID = E.EstateID AND S.TimeID =T.TimeID AND E.Category = “flat” AND (T.Month,T.Year, S.AgreedPrice) IN (

SELECT T1.Month, T1.Year,MAX(S1.AgreedPrice)FROM SalesTable S1, Estate E1, Time T1WHERE S1.EstateID = E1.EstateID ANDS1.TimeID = T1.TimeID AND E1.Category =“flat”GROUP BY T1.Month, T1.Year

15/23

Page 16: Data Warehousing Logical Design

Exercise 2: A possible solutionSQL queries wrt the star schema

I What kind of property sold for the highest price w.r.t each cityand month?

SELECT E.Category, E.City, T.Moth, T.Year,E.AgreedPrice

FROM SalesTable S, Time T, Estate EWHERE S.TimeID = T.TimeID AND E.EstateID =S.EstateID AND (P.AgreedPrice, P.City, T.month,T.year) IN (

SELECT MAX(E1.AgreedPrice), E1.City,T1.Month, T1.Year)FROM SalesTable S1, Time T1, Estate E1WHERE S1.TimeID = T1.TimeID ANDE1.EstateID = S1.EstateIDGROUP BY T.Month, T.Year, E.City)

16/23

Page 17: Data Warehousing Logical Design

Exam 9/1/07Travel agency

A travel agency organizes guided trips for tourists of differentnationalities. The agency wants to know the main trends abouttrips both with respect to the visited places and type of partecipant.The following relational schema contains the initial database:

TRIP (Code, Destination, Category, Guide, Duration, Date)DESTINATIONS (Code, Name, Description, Type, Nation)

GROUP (Trip, Participant, Price)PARTICIPANT (Code, Name, Surname, Address, Birthday, Nation)

NATION (Code, Name, Continent)GUIDE (Code, Name, Surname, Address, Sex, Birthday,

Nation)TYPE (Code, Description)

CATEGORY (Code, Description)

17/23

Page 18: Data Warehousing Logical Design

Exam 9/1/07: A possible solutionReverse engineering

CATEGORY

TRIP

GUIDE

DESTINATION

of

with

to

(1,1)

(0,N)

(0,1)

(0,N)

(1,1) (0,N)

code description

name

address sex

surname

code

code

description

code

duration

date

name

birthday

TYPE

code description

of (0,N)(0,1)

PARTICIPANT

NATIONcode

name

continent

to

of

to

of

price

namecode

birthday

surname

address

(1,N)

(0,N)

(0,N)(0,N)(0,N)

(1,1)

(0,1)

(0,1)

18/23

Page 19: Data Warehousing Logical Design

Exam 9/1/07: A possible solutionFacts, measures, dimensions, attribute tree

FACT TripMEASURES PartecipantNr, Duration, IncomeDIMENSIONS Partecipant, Place, Guide, Time, Category

Day

Month

YearBirthday

Address

Surname

Name

IDPartecipant

Duration

IdCategory

IDTrip

Description

Sex

SurnameName

IDGuide Description

IdDestin

IdType

Description

Address

Name

continent

name

IDNation

continentname

IDNation

continentname

Bithday IDNation

grafting

pruning19/23

Page 20: Data Warehousing Logical Design

Exam 9/1/07: A possible solutionAttribute tree, fact schema

Attribute tree

DayMonth

Year

BirthdayIDPartecip

Duration

IdCategory

IDTrip

Description

Sex

IDGuide

IdDestin

IdType

Name

Continent

Nation

Nationality

Nationality

Bithday

TripDurationIncomePartecipNr

Fact schema

Sex

IDGuide

Nationality

Bithday

IdDestin

IdType

Name

Continent

Nation

IdCategory

Description

Day

Month

Year

Birthday

IDPartecip

Nationality

20/23

Page 21: Data Warehousing Logical Design

Exam 9/1/07: A possible solutionSnowflake schema

IdCategoryIdTimeIdDestin

Duration

Trip

IdPartecipantIdGuide

DescriptionIdCategoryCategory

DayIdTimeTime

MonthYear

NameIdNation

Destination

Type

IdDestin

NameIdNationNation

Continent

BirthdayIdNation

Guide

Sex

IdGuide

BirthdayIdNation

PartecipantIdPartecipant

IncomePartecipNr

21/23

Page 22: Data Warehousing Logical Design

Exam 9/1/07: A possible solutionSQL queries

I Average trip duration for a given place

SELECT AVG(T.Duration)FROM Trip T, Destination DWHERE T.IdDestin = D.IdDstin AND D.Name = “Place”

I Average trip duration for a given trip category and month

SELECT AVG(T.Duration)FROM Trip T, Time TiWHERE T.IdTime = Ti.IdTime AND T.IdCategory =“Category” AND Ti.Month = “Month”

I Average trip price w.r.t. duration, type of place and year

SELECT T.Duration, D.Type, Ti.Year, AVG(T.Income)FROM Trip T, Destination D, Time TiWHERE T.IdTim = Ti.IdTim AND T.IdDesti = D.IdDestiGROUP BY T.Duration, D.Type, Ti.Year

22/23

Page 23: Data Warehousing Logical Design

Exam 9/1/07: A possible solutionSQL queries

I Number of trips w.r.t. type of place, month and guide’snationality

SELECT D.Type, Ti.Month, G.IdNation, COUNT(*)FROM Trip T, Destination D, Time Ti, Guide GWHERE T.IdDestin = D.IdDestin AND T.IdTime =Ti.IdTime AND T.IdGuide = G.IdGuideGROUP BY D.Type, Ti.Month, G.IdNation

I Average number of partecipants w.r.t. trip category andcontinent

SELECT T.IdCategory, N.Continent, AVG(T.PartecipNr)FROM Trip T, Destination D, Nation NWHERE T.IdDestin = D.IdDestin AND D.IdNation =N.IdNationGROUP BY T.IdCategory, N.Continent

23/23