design of dw - ulisboa · design of dw remember item, city, year, and sales_in_euro (city) (item)...

27
1 Design of DW Remember item, city, year, and sales_in_Euro (item) (city) () (year) (city, item) (city, year) (item, year) (city, item, year)

Upload: lamkiet

Post on 23-Nov-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Design of DW

Rememberitem, city, year, and sales_in_Euro

(item)(city)

()

(year)

(city, item) (city, year) (item, year)

(city, item, year)

2

2 leveles of hierarchies for each dimension Item(part,color) i1,i2 City(downtown,suburb) c1,c2 Year(good_year,bad_year) y1,y2

For a 3-dimensional data cube, where Li is thenumber of all levels (L1,2,3=2), the total numberof cuboids that can be generated is

!

(2 +1) = 33

= 27i=1

3

"

{(), (c1),(c2),(i1),(i2),(y1),(y2),

(c1,i1),(c1,i2),(c2,i1),(c2,i2), (c1,y1),(c1,y2),(c2,y1),(c2,y2), (i1,y1),(i1,y2),(i2,y1),(i2,y2),

(c1,i1,y1),(c1,i1,y2),(c1,i2,y1),(c1,i2,y2), (c2,i1,y1),(c2,i1,y2),(c2,i2,y1),(c2,i2,y2)}

3

DMQL Data Mining Query Language Relational database schema Translation into SQL query

Example, star schema, and relational data base MDX Multifeature cubes Design of a Data Warehouse Lifecycle models Data Warehouse models

DMQL DMQL: A Data Mining Query Language

for Relational Databases (Han et al,Simon Fraser University)

Data warehouses and data marts can bedefined by cube definition anddimension definition

4

DMQL Create and manipulate data mining models

through a SQL-based interface (“Command-driven” data mining)

Abstract away the data mining particulars Data mining should be performed on data in

the database (should not need to export to aspecial-purpose environment)

Approaches differ on what kinds of modelsshould be created, and what operations weshould be able to perform

Cube Definition Syntax in DMQL

Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:

<measure_list>

Dimension Definition (Dimension Table)define dimension <dimension_name> as

(<attribute_or_subdimension_list>)

5

Example of Star Schematime_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystate_or_provincecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Defining Star Schema in DMQL

define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week,month, quarter, year)

define dimension item as (item_key, item_name, brand,type, supplier_type)

define dimension branch as (branch_key, branch_name,branch_type)

define dimension location as (location_key, street, city,province_or_state, country)

6

The star schema contains two measures dollars_sold and units_sold

How are the DMQL commandsinterpreted to generate a specified datacube?

Relational database schematime(time_key,day_of_week,month,quater,year)

item(item_key,item_name,brand,type,supplier_type)

branch(branch_key,branch_name,branch_type)

location(location_key,street,city,province_or_state,country)

sales(time_key,item_key,branch_key,location_key,number_of_units_sold,price)

7

The DMQL specification is translated into the followingSQL query which generates the base cuboid

SELECT s.time_key,s.item_key,s.branch_key,s.location_key, SUM(s.number_of_units_sold*s.price), SUM(s.number_of_units_sold)FROM time t, item i, branch b, location l, sales s,WHERE s.time_key=t.time_key AND s.item_key=i.item_key AND s.branch_key=b.branch_key AND s.location_key=l.location_keyGROUP BY (s.time_key,s.item_key,s.branch_key,s.location_key);

The granularity (resolution) of eachdimension is at the join key level

A join key is the key that links a fact tableand the dimension table

The fact table associated with a base cuboidis sometimes referred as base fact table

8

By changing GROUP BY we may generateother cuboids

The apex cuboid representing the total sum ofdollars_sold and total count of units_sold isgenerated by GROUP BY ();

Other cuboids may be generated by applyingselection and projection operations on the basecuboid

To generate a data cube we may as well use GROUP BY CUBE

(s.time_key,s.item_key,s.branch_key,s.location_key);

Defining Snowflake Schemain DMQL

define cube sales_snowflake [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month,

quarter, year)define dimension item as (item_key, item_name, brand, type,

supplier(supplier_key, supplier_type))define dimension branch as (branch_key, branch_name,

branch_type)define dimension location as (location_key, street, city(city_key,

province_or_state, country))

9

Defining Fact Constellationin DMQL

define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold

= count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location in

cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales

Example (Exercício 7) Suponha um datawarehouse que contém as seguintes

quatro dimensões: Data, Espectador, Localizaçaõ eJogo e o facto “preço” que consiste no valor, em euros

Valor pago por um espectador quando assiste a umdeterminado jogo numa data

Espectadores podem ser estudantes, adultos ouséniores, e que cada uma destas categorias tem o seupreço de bilhete

Dimensão data contenha o dia, mês e ano; que adimensão Localização contenha o nome do estádio, eque a dimensão Jogo contenha o nome das duasequipas defrontadas

10

Diagrama em estrelapara este DW

jogo(jogoId, equipa1, equipa2) data (dataId, dia, mes, ano) espectador (espId, nome, categoria) localizacao(localId, estadio) factos(jogoId, dataId, espId, localId, preco)

Escreva em SQL a interrogaçã o quedevolve o preço total pago porespectadores estudantes para assistirao jogo que se realizou no Estádio daLuz no dia 1 de Março de 2005

11

The corresponding SQL querrySELECT SUM(preco)FROM Factos F, Data D, Localizacao L, Espectador EWHERE F.dataId = D.dataIdAND F.espId = E.espIdAND F.localId = L.localIdAND D.dia = 1AND D.mes = 3AND D.ano = 2005AND L.estadio = ‘Estadio da Luz’AND E.categoria= ‘Estudante’;

Esquema relacional que modele a mesma informaçãoe uma interrogaçã o SQL que devolva a mesmainformação:

jogo(jogoId, equipa1, equipa2, localId, data)localizacao(localId, estadio)espectador(espId, nome, categoriaId)categoria (categoriaId, nomeC, preco)jogoEspectador(jogoId, espId)

12

The corresponding SQL querrySELECT SUM (C.preco)FROM Categoria C, Espectador E, JogoEspectador JE, Jogo, J,

Localizacao LWHERE C.nomeC = ‘Estudante’AND J.data = 1/3/2005AND L.estadio = ‘Estadio da Luz’AND C.categoriaId = E.categoriaIdAND E.espId = JE.espIdAND JE.jogoId = J.jogoIdAND J.localId = L.localId;

Difference between bothapproaches

Com o modelo em estrela, existe três joins, databela Factos com cada uma das 3 dimensõesrelevantes

Com o esquema relacional, existem 4 joins

In the star schema has less joins, correspondingto the relevant dimensions

In multidimensional model the base cuboid isalready precomputed

13

MDX Multidimensional Expressions (MDX) as a

Language

MDX emerged circa 1998, when it first beganto appear in commercial applications. MDXwas created to query OLAP databases, andhas become widely adopted within the realm ofanalytical applications

Provide the total sales and total cost amounts for the years 1997and 1998 individually for all USA-based stores (including allproducts)

We are asked, moreover, to provide the information in a two-dimensional grid, with the sales and cost amounts (calledmeasures in our data warehouse) in the rows and the years (1997and 1998) in the columns

SELECT{[Time].[1997],[Time].[1998]} ON COLUMNS,{[Measures].[Warehouse Sales],[Measures].[Warehouse

Cost]} ON ROWSFROM WarehouseWHERE ([Store].[All Stores].[USA])

14

The cube that is targeted by the query (the query scope) appearsin the FROM clause of the query

The FROM clause in MDX works much as it does in SQL, where itstipulates the tables used as sources for the query

The query syntax also uses other keywords that are common inSQL, such as SELECT and WHERE.

Important difference is that the output of an MDX query, whichuses a cuboid as a data source, is another cuboid, whereas theoutput of an SQL query (which uses a columnar table as a source)is typically columnar

A query has one or more dimensions. The query abovehas two. (The first three dimensions (=axes) that arefound in MDX queries are known as rows, columnsand pages)

SELECT{[Time].[1997].[Q1],[Time].[1997].[Q2]}ON COLUMNS,{[Warehouse].[All Warehouses].[USA]} ON

ROWSFROM WarehouseWHERE ([Measures].[WarehouseSales])

Curled brackets "{}" are used in MDX to represent a setof members of a dimension or group of dimensions

15

Complex Aggregation atMultiple Granularities Multifeature cubes compute complex queries

involving multiple aggregates at multiplegranularities (resolution)

Example item is purchased in a sales region on a business

day (year,month,day) The shelf life in months of a given item is stored in

shelf The item price and sales is stored in price and sales

Find the total sales in 2000, broken down byitem, region, and month with subtotal for eachdimension

A data cube is constructed {(item,region,month),(item,region),(item,month),

(month,region),(item),(month),(region,()}

Simple data cube, since it does not involve anydependent aggregates

What are dependent aggregates?

16

Example

Grouping by all subsets (cuboids){item,region,month} (=data cube)

Find maximum price for each group(cuboid) in 2000

Among the maximum price tuples findthe minimum and maximum shelf lives

Multifeature cube graph for theexample query

R0 cube

R1 cube {=MAX(price}

R2 cube {=MIN(R1.shelf)} R3 cube {=MAX(R1.shelf)}

17

The multifeature graph illustrates theaggregate dependencies

R0,R1,R2,R3 are the grouping variables The grouping variables R2,R3 are

dependent on R1 In extended SQL

R2 in R1 R3 in R1

Query in extended SQL

SELECTitem,region,month,MAX(price),MIN(R1.shelf),MAX(R1.shelf)

FROM PurchasesWHERE year=2000CUBE BY item,region,month:R1,R2,R3SUCH THAT R1.price=MAX(price) AND

R2 IN R1 and R2.shelf=MIN(R1.shelf) AND R3 IN R1 and R3.shelf=MAX(R1.shelf);

18

Design of Data Warehouse:A Business Analysis Framework

Four views regarding the design of a data warehouse Top-down view

• allows selection of the relevant information necessary for the datawarehouse

Data source view• exposes the information being captured, stored, and managed by

operational systems Data warehouse view

• consists of fact tables and dimension tables Business query view

• sees the perspectives of data in the warehouse from the view of end-user

Data WarehouseDesign Process

Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view Waterfall: structured and systematic analysis at each step before

proceeding to the next Spiral: rapid generation of increasingly functional systems, short

turn around time, quick turn around

19

Lifecycle planning Translation from user requirements into

software requirements Transformation of the software requirements

into software design Implementation of the design into programming

code The sequence of this steps is defined by the

lifecyle model

A software lifecycle model must bedefined for every project!

The lifecycle model you choose has asmuch influence over your project’ssuccess as any other planning decisionyou make!

20

Pure Waterfall modelSoftware Concept

RequirementsAnalysis

ArchitecuralDesign

DetailedDesign

Coding andDebugging

System Testing

Pure Waterfal model Document driven model which means that the main

work products that are carried from phase to phase aredocuments

The disadvantage of the pure waterfall model arisefrom the difficulty of fully specifying requirements at thebeginning of the project, before any design works hasbeen done and before any code has been written

21

Salmon modelSoftware Concept

RequirementsAnalysis

ArchitecuralDesign

DetailedDesign

Coding andDebugging

System Testing

Code-and-Fix I

Code-And-FixSystem Specification

(maybe)Release

22

Code-and-Fix IIAdvantages

No overhead, you don’t spend any time on planning,documentation, quality assurance, enforcement, orother activities than pure coding

Since you jump right into coding, you can show signs ofprogress immediately

It requires little expertise

Code-and-Fix III Maintainability and reliability decrease

with the complexity and the time For any kind of project other than a tiny

project, this model is dangerous. It mighthave no overhead, but it also provides nomeans of assessing progress, you justcode until you’re done

23

Spiral I

Riskanalysis

Prototypes

Simulation models

Start

Determine objectives,Alternatives, andconstraints

Review

Evaluate

DevelopmentPlan

Reqirements

codetestRealse

I II III

IV

Spiral II The basic idea behind the diagram is that

you start on a small scale in the middle ofthe spine, explore the risks, make a planto handle the risks, and then commit toan approach of the next iteration. Eachiteration moves your project to a largerscale

24

Spiral III The spiral model is a risk-oriented

lifecycle model that breaks a softwareproject up into mini projects. Each miniproject addresses one or more majorrisks until all the major risks have beenaddressed

Spiral IV Determine objectives, and constraints Identify and resolve risks Evaluate alternatives Develop the deliverables for that iteration, and verify

that they are correct Plan next iteration Commit to an approach for the next iteration

One of the most important advantages of the spiral model is that ascosts increase, risk decrease. The more time and money you

spend, the less risk your’re taking

25

Data Warehouse: A Multi-Tiered ArchitectureData Warehouse: A Multi-Tiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

Othersources

Data Storage

OLAP Server

Three Data Warehouse Models

Enterprise warehouse collects all of the information about subjects spanning the entire

organization Data Mart

a subset of corporate-wide data that is of value to a specific groupsof users. Its scope is confined to specific, selected groups, such asmarketing data mart

• Independent vs. dependent (directly from warehouse) data mart Virtual warehouse

A set of views over operational databases Only some of the possible summary views may be materialized

26

Data Warehouse Usage Three kinds of data warehouse applications

Information processing• supports querying, basic statistical analysis, and reporting using

crosstabs, tables, charts and graphs Analytical processing

• multidimensional analysis of data warehouse data• supports basic OLAP operations, slice-dice, drilling, pivoting

Data mining• knowledge discovery from hidden patterns• supports associations, constructing analytical models, performing

classification and prediction, and presenting the mining results usingvisualization tools

Data Warehouse Back-End Toolsand Utilities

Data extraction get data from multiple, heterogeneous, and external sources

Data cleaning detect errors in the data and rectify them when possible

Data transformation convert data from legacy or host format to warehouse format

Load sort, summarize, consolidate, compute views, check integrity, and build

indicies and partitions Refresh

propagate the updates from the data sources to the warehouse

27

DMQL Data Mining Query Language Relational database schema Translation into SQL query

Example, star schema, and relational data base MDX Multifeature cubes Design of a Data Warehouse Lifecycle models Data Warehouse models

Data Cleaning

(De)normalization(?) Missing Values ...