design of dw - ulisboa · design of dw remember item, city, year, and sales_in_euro (city) (item)...
TRANSCRIPT
1
Design of DW
Rememberitem, city, year, and sales_in_Euro
(item)(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
2
2 leveles of hierarchies for each dimension Item(part,color) i1,i2 City(downtown,suburb) c1,c2 Year(good_year,bad_year) y1,y2
For a 3-dimensional data cube, where Li is thenumber of all levels (L1,2,3=2), the total numberof cuboids that can be generated is
!
(2 +1) = 33
= 27i=1
3
"
{(), (c1),(c2),(i1),(i2),(y1),(y2),
(c1,i1),(c1,i2),(c2,i1),(c2,i2), (c1,y1),(c1,y2),(c2,y1),(c2,y2), (i1,y1),(i1,y2),(i2,y1),(i2,y2),
(c1,i1,y1),(c1,i1,y2),(c1,i2,y1),(c1,i2,y2), (c2,i1,y1),(c2,i1,y2),(c2,i2,y1),(c2,i2,y2)}
3
DMQL Data Mining Query Language Relational database schema Translation into SQL query
Example, star schema, and relational data base MDX Multifeature cubes Design of a Data Warehouse Lifecycle models Data Warehouse models
DMQL DMQL: A Data Mining Query Language
for Relational Databases (Han et al,Simon Fraser University)
Data warehouses and data marts can bedefined by cube definition anddimension definition
4
DMQL Create and manipulate data mining models
through a SQL-based interface (“Command-driven” data mining)
Abstract away the data mining particulars Data mining should be performed on data in
the database (should not need to export to aspecial-purpose environment)
Approaches differ on what kinds of modelsshould be created, and what operations weshould be able to perform
Cube Definition Syntax in DMQL
Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition (Dimension Table)define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
5
Example of Star Schematime_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcitystate_or_provincecountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesMeasures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Defining Star Schema in DMQL
define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,month, quarter, year)
define dimension item as (item_key, item_name, brand,type, supplier_type)
define dimension branch as (branch_key, branch_name,branch_type)
define dimension location as (location_key, street, city,province_or_state, country)
6
The star schema contains two measures dollars_sold and units_sold
How are the DMQL commandsinterpreted to generate a specified datacube?
Relational database schematime(time_key,day_of_week,month,quater,year)
item(item_key,item_name,brand,type,supplier_type)
branch(branch_key,branch_name,branch_type)
location(location_key,street,city,province_or_state,country)
sales(time_key,item_key,branch_key,location_key,number_of_units_sold,price)
7
The DMQL specification is translated into the followingSQL query which generates the base cuboid
SELECT s.time_key,s.item_key,s.branch_key,s.location_key, SUM(s.number_of_units_sold*s.price), SUM(s.number_of_units_sold)FROM time t, item i, branch b, location l, sales s,WHERE s.time_key=t.time_key AND s.item_key=i.item_key AND s.branch_key=b.branch_key AND s.location_key=l.location_keyGROUP BY (s.time_key,s.item_key,s.branch_key,s.location_key);
The granularity (resolution) of eachdimension is at the join key level
A join key is the key that links a fact tableand the dimension table
The fact table associated with a base cuboidis sometimes referred as base fact table
8
By changing GROUP BY we may generateother cuboids
The apex cuboid representing the total sum ofdollars_sold and total count of units_sold isgenerated by GROUP BY ();
Other cuboids may be generated by applyingselection and projection operations on the basecuboid
To generate a data cube we may as well use GROUP BY CUBE
(s.time_key,s.item_key,s.branch_key,s.location_key);
Defining Snowflake Schemain DMQL
define cube sales_snowflake [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month,
quarter, year)define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))define dimension branch as (branch_key, branch_name,
branch_type)define dimension location as (location_key, street, city(city_key,
province_or_state, country))
9
Defining Fact Constellationin DMQL
define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold
= count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales
Example (Exercício 7) Suponha um datawarehouse que contém as seguintes
quatro dimensões: Data, Espectador, Localizaçaõ eJogo e o facto “preço” que consiste no valor, em euros
Valor pago por um espectador quando assiste a umdeterminado jogo numa data
Espectadores podem ser estudantes, adultos ouséniores, e que cada uma destas categorias tem o seupreço de bilhete
Dimensão data contenha o dia, mês e ano; que adimensão Localização contenha o nome do estádio, eque a dimensão Jogo contenha o nome das duasequipas defrontadas
10
Diagrama em estrelapara este DW
jogo(jogoId, equipa1, equipa2) data (dataId, dia, mes, ano) espectador (espId, nome, categoria) localizacao(localId, estadio) factos(jogoId, dataId, espId, localId, preco)
Escreva em SQL a interrogaçã o quedevolve o preço total pago porespectadores estudantes para assistirao jogo que se realizou no Estádio daLuz no dia 1 de Março de 2005
11
The corresponding SQL querrySELECT SUM(preco)FROM Factos F, Data D, Localizacao L, Espectador EWHERE F.dataId = D.dataIdAND F.espId = E.espIdAND F.localId = L.localIdAND D.dia = 1AND D.mes = 3AND D.ano = 2005AND L.estadio = ‘Estadio da Luz’AND E.categoria= ‘Estudante’;
Esquema relacional que modele a mesma informaçãoe uma interrogaçã o SQL que devolva a mesmainformação:
jogo(jogoId, equipa1, equipa2, localId, data)localizacao(localId, estadio)espectador(espId, nome, categoriaId)categoria (categoriaId, nomeC, preco)jogoEspectador(jogoId, espId)
12
The corresponding SQL querrySELECT SUM (C.preco)FROM Categoria C, Espectador E, JogoEspectador JE, Jogo, J,
Localizacao LWHERE C.nomeC = ‘Estudante’AND J.data = 1/3/2005AND L.estadio = ‘Estadio da Luz’AND C.categoriaId = E.categoriaIdAND E.espId = JE.espIdAND JE.jogoId = J.jogoIdAND J.localId = L.localId;
Difference between bothapproaches
Com o modelo em estrela, existe três joins, databela Factos com cada uma das 3 dimensõesrelevantes
Com o esquema relacional, existem 4 joins
In the star schema has less joins, correspondingto the relevant dimensions
In multidimensional model the base cuboid isalready precomputed
13
MDX Multidimensional Expressions (MDX) as a
Language
MDX emerged circa 1998, when it first beganto appear in commercial applications. MDXwas created to query OLAP databases, andhas become widely adopted within the realm ofanalytical applications
Provide the total sales and total cost amounts for the years 1997and 1998 individually for all USA-based stores (including allproducts)
We are asked, moreover, to provide the information in a two-dimensional grid, with the sales and cost amounts (calledmeasures in our data warehouse) in the rows and the years (1997and 1998) in the columns
SELECT{[Time].[1997],[Time].[1998]} ON COLUMNS,{[Measures].[Warehouse Sales],[Measures].[Warehouse
Cost]} ON ROWSFROM WarehouseWHERE ([Store].[All Stores].[USA])
14
The cube that is targeted by the query (the query scope) appearsin the FROM clause of the query
The FROM clause in MDX works much as it does in SQL, where itstipulates the tables used as sources for the query
The query syntax also uses other keywords that are common inSQL, such as SELECT and WHERE.
Important difference is that the output of an MDX query, whichuses a cuboid as a data source, is another cuboid, whereas theoutput of an SQL query (which uses a columnar table as a source)is typically columnar
A query has one or more dimensions. The query abovehas two. (The first three dimensions (=axes) that arefound in MDX queries are known as rows, columnsand pages)
SELECT{[Time].[1997].[Q1],[Time].[1997].[Q2]}ON COLUMNS,{[Warehouse].[All Warehouses].[USA]} ON
ROWSFROM WarehouseWHERE ([Measures].[WarehouseSales])
Curled brackets "{}" are used in MDX to represent a setof members of a dimension or group of dimensions
15
Complex Aggregation atMultiple Granularities Multifeature cubes compute complex queries
involving multiple aggregates at multiplegranularities (resolution)
Example item is purchased in a sales region on a business
day (year,month,day) The shelf life in months of a given item is stored in
shelf The item price and sales is stored in price and sales
Find the total sales in 2000, broken down byitem, region, and month with subtotal for eachdimension
A data cube is constructed {(item,region,month),(item,region),(item,month),
(month,region),(item),(month),(region,()}
Simple data cube, since it does not involve anydependent aggregates
What are dependent aggregates?
16
Example
Grouping by all subsets (cuboids){item,region,month} (=data cube)
Find maximum price for each group(cuboid) in 2000
Among the maximum price tuples findthe minimum and maximum shelf lives
Multifeature cube graph for theexample query
R0 cube
R1 cube {=MAX(price}
R2 cube {=MIN(R1.shelf)} R3 cube {=MAX(R1.shelf)}
17
The multifeature graph illustrates theaggregate dependencies
R0,R1,R2,R3 are the grouping variables The grouping variables R2,R3 are
dependent on R1 In extended SQL
R2 in R1 R3 in R1
Query in extended SQL
SELECTitem,region,month,MAX(price),MIN(R1.shelf),MAX(R1.shelf)
FROM PurchasesWHERE year=2000CUBE BY item,region,month:R1,R2,R3SUCH THAT R1.price=MAX(price) AND
R2 IN R1 and R2.shelf=MIN(R1.shelf) AND R3 IN R1 and R3.shelf=MAX(R1.shelf);
18
Design of Data Warehouse:A Business Analysis Framework
Four views regarding the design of a data warehouse Top-down view
• allows selection of the relevant information necessary for the datawarehouse
Data source view• exposes the information being captured, stored, and managed by
operational systems Data warehouse view
• consists of fact tables and dimension tables Business query view
• sees the perspectives of data in the warehouse from the view of end-user
Data WarehouseDesign Process
Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view Waterfall: structured and systematic analysis at each step before
proceeding to the next Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
19
Lifecycle planning Translation from user requirements into
software requirements Transformation of the software requirements
into software design Implementation of the design into programming
code The sequence of this steps is defined by the
lifecyle model
A software lifecycle model must bedefined for every project!
The lifecycle model you choose has asmuch influence over your project’ssuccess as any other planning decisionyou make!
20
Pure Waterfall modelSoftware Concept
RequirementsAnalysis
ArchitecuralDesign
DetailedDesign
Coding andDebugging
System Testing
Pure Waterfal model Document driven model which means that the main
work products that are carried from phase to phase aredocuments
The disadvantage of the pure waterfall model arisefrom the difficulty of fully specifying requirements at thebeginning of the project, before any design works hasbeen done and before any code has been written
21
Salmon modelSoftware Concept
RequirementsAnalysis
ArchitecuralDesign
DetailedDesign
Coding andDebugging
System Testing
Code-and-Fix I
Code-And-FixSystem Specification
(maybe)Release
22
Code-and-Fix IIAdvantages
No overhead, you don’t spend any time on planning,documentation, quality assurance, enforcement, orother activities than pure coding
Since you jump right into coding, you can show signs ofprogress immediately
It requires little expertise
Code-and-Fix III Maintainability and reliability decrease
with the complexity and the time For any kind of project other than a tiny
project, this model is dangerous. It mighthave no overhead, but it also provides nomeans of assessing progress, you justcode until you’re done
23
Spiral I
Riskanalysis
Prototypes
Simulation models
Start
Determine objectives,Alternatives, andconstraints
Review
Evaluate
DevelopmentPlan
Reqirements
codetestRealse
I II III
IV
Spiral II The basic idea behind the diagram is that
you start on a small scale in the middle ofthe spine, explore the risks, make a planto handle the risks, and then commit toan approach of the next iteration. Eachiteration moves your project to a largerscale
24
Spiral III The spiral model is a risk-oriented
lifecycle model that breaks a softwareproject up into mini projects. Each miniproject addresses one or more majorrisks until all the major risks have beenaddressed
Spiral IV Determine objectives, and constraints Identify and resolve risks Evaluate alternatives Develop the deliverables for that iteration, and verify
that they are correct Plan next iteration Commit to an approach for the next iteration
One of the most important advantages of the spiral model is that ascosts increase, risk decrease. The more time and money you
spend, the less risk your’re taking
25
Data Warehouse: A Multi-Tiered ArchitectureData Warehouse: A Multi-Tiered Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
Othersources
Data Storage
OLAP Server
Three Data Warehouse Models
Enterprise warehouse collects all of the information about subjects spanning the entire
organization Data Mart
a subset of corporate-wide data that is of value to a specific groupsof users. Its scope is confined to specific, selected groups, such asmarketing data mart
• Independent vs. dependent (directly from warehouse) data mart Virtual warehouse
A set of views over operational databases Only some of the possible summary views may be materialized
26
Data Warehouse Usage Three kinds of data warehouse applications
Information processing• supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs Analytical processing
• multidimensional analysis of data warehouse data• supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining• knowledge discovery from hidden patterns• supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results usingvisualization tools
Data Warehouse Back-End Toolsand Utilities
Data extraction get data from multiple, heterogeneous, and external sources
Data cleaning detect errors in the data and rectify them when possible
Data transformation convert data from legacy or host format to warehouse format
Load sort, summarize, consolidate, compute views, check integrity, and build
indicies and partitions Refresh
propagate the updates from the data sources to the warehouse