8/20/2015 1 data warehousing and olap. 2 data warehousing & olap defined in many different ways,...

30
03/25/22 1 Data Warehousing and OLAP Data Warehousing and OLAP

Upload: milton-joseph

Post on 24-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

04/19/23 1

Data Warehousing and Data Warehousing and OLAPOLAP

2

Data Warehousing & OLAPData Warehousing & OLAP

Defined in many different ways, but not rigorously.Defined in many different ways, but not rigorously. A decision support database that is maintained A decision support database that is maintained separatelyseparately

from the organization’s operational databasefrom the organization’s operational database Support Support information processinginformation processing by providing a solid by providing a solid

platform of consolidated, historical data for analysis.platform of consolidated, historical data for analysis. ““A data warehouse is aA data warehouse is a subject-orientedsubject-oriented,, integrated integrated,,

time-varianttime-variant, , and and nonvolatilenonvolatile collection of data in collection of data in support of management’s decision-making support of management’s decision-making process.”—W. H. Inmonprocess.”—W. H. Inmon

Data warehousing:Data warehousing: The process of constructing and using data warehousesThe process of constructing and using data warehouses

What is a Data Warehouse?

3

Data Warehouse—Subject-Data Warehouse—Subject-OrientedOriented

Organized around major Organized around major subjectssubjects, such as , such as

customer, product, sales.customer, product, sales.

Focusing on the modeling and analysis of data for Focusing on the modeling and analysis of data for

decisiondecision makersmakers, not on daily operations or , not on daily operations or

transaction processing.transaction processing.

Provide Provide a simple and concisea simple and concise view around view around

particular subject issues by particular subject issues by excluding data that excluding data that

are not useful in the decision support processare not useful in the decision support process..

4

Data Warehouse—Data Warehouse—IntegratedIntegrated

Constructed by integrating Constructed by integrating multiple, multiple, heterogeneousheterogeneous data sources data sources relational databases, flat files, on-line transaction relational databases, flat files, on-line transaction

recordsrecords Data cleaning and data integration techniques Data cleaning and data integration techniques

are applied.are applied. Ensure consistency in naming conventions, encoding Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different structures, attribute measures, etc. among different data sourcesdata sources

E.g., Hotel price: currency, tax, breakfast covered, etc.E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. When data is moved to the warehouse, it is converted.

5

Data Warehouse—Time Data Warehouse—Time VariantVariant

The time horizon for the data warehouse is The time horizon for the data warehouse is

significantly significantly longerlonger than that of operational than that of operational

systems.systems. Operational database: current value data.Operational database: current value data. Data warehouse data: provide information from a Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouseEvery key structure in the data warehouse Contains an element of time, explicitly or implicitlyContains an element of time, explicitly or implicitly But the key of operational data may or may not But the key of operational data may or may not

contain “time element”.contain “time element”.

6

Data Warehouse—Non-VolatileData Warehouse—Non-Volatile

A A physically separate storephysically separate store of data transformed of data transformed

from the operational environment.from the operational environment.

Operational Operational update of data does not occurupdate of data does not occur in the in the

data warehouse environment.data warehouse environment. Does not require transaction processing, recovery, and Does not require transaction processing, recovery, and

concurrency control mechanismsconcurrency control mechanisms

Requires only two operations in data accessing: Requires only two operations in data accessing:

initial loading of datainitial loading of data and and access of dataaccess of data..

7

Three Data Warehouse Three Data Warehouse ModelsModels

Enterprise warehouseEnterprise warehouse collects all of the information about subjects collects all of the information about subjects

spanning the entire organizationspanning the entire organization Data MartData Mart

a subset of corporate-wide data that is of value to a a subset of corporate-wide data that is of value to a specific groupsspecific groups of users. Its scope is confined to of users. Its scope is confined to specific, selected groups, such as marketing data specific, selected groups, such as marketing data martmart

Virtual warehouseVirtual warehouse A set of views over operational databasesA set of views over operational databases Only some of the possible summary views may be Only some of the possible summary views may be

materializedmaterialized

8

Data Warehouse vs. Data Warehouse vs. Operational DBMSOperational DBMS

OLTP (on-line transaction processing)OLTP (on-line transaction processing) Major task of traditional relational DBMSMajor task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)OLAP (on-line analytical processing) Major task of data warehouse systemMajor task of data warehouse system Data analysis and decision makingData analysis and decision making

Distinct features (OLTP vs. OLAP):Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. marketUser and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidatedData contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subjectDatabase design: ER + application vs. star + subject View: current, local vs. evolutionary, integratedView: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queriesAccess patterns: update vs. read-only but complex queries

9

OLTP vs. OLAPOLTP vs. OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

10

Why Separate Data Why Separate Data Warehouse?Warehouse?

High performance for both systemsHigh performance for both systems DBMS— tuned for OLTP: access methods, indexing, DBMS— tuned for OLTP: access methods, indexing,

concurrency control, recoveryconcurrency control, recovery Warehouse—tuned for OLAP: complex OLAP queries, Warehouse—tuned for OLAP: complex OLAP queries,

multidimensional view, consolidation.multidimensional view, consolidation.

Different functions and different data:Different functions and different data: missing datamissing data: Decision support requires historical data : Decision support requires historical data

which operational DBs do not typically maintainwhich operational DBs do not typically maintain data consolidationdata consolidation: DS requires consolidation : DS requires consolidation

(aggregation, summarization) of data from (aggregation, summarization) of data from heterogeneous sourcesheterogeneous sources

data qualitydata quality: different sources typically use inconsistent : different sources typically use inconsistent data representations, codes and formats which have to data representations, codes and formats which have to be reconciledbe reconciled

11

Conceptual Modeling of Conceptual Modeling of Data WarehousesData Warehouses

Modeling data warehouses: dimensions & measuresModeling data warehouses: dimensions & measures Star schemaStar schema: : A fact table in the middle connected to a A fact table in the middle connected to a

set of dimension tables set of dimension tables Snowflake schemaSnowflake schema: : A refinement of star schema where A refinement of star schema where

some dimensional hierarchy is some dimensional hierarchy is normalizednormalized into a set of into a set of

smaller dimension tablessmaller dimension tables, forming a shape similar to , forming a shape similar to

snowflakesnowflake

Fact constellationsFact constellations: : Multiple Multiple fact tablesfact tables share dimension share dimension

tablestables, viewed as a collection of stars, therefore called , viewed as a collection of stars, therefore called

galaxy schemagalaxy schema or fact constellation or fact constellation

12

Example of Star SchemaExample of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Dimension Tables

13

From Tables and From Tables and Spreadsheets to Data Spreadsheets to Data

CubesCubes A data warehouse is based on a A data warehouse is based on a multidimensional multidimensional

datadata modelmodel which views data in the form of a which views data in the form of a

data cubedata cube

A data cube, such as A data cube, such as salessales, allows data to be , allows data to be

modeled and viewed in multiple dimensionsmodeled and viewed in multiple dimensions Dimension tables, such as Dimension tables, such as item (item_name, brand, type), item (item_name, brand, type),

oror time(day, week, month, quarter, year) time(day, week, month, quarter, year)

Fact table contains measures (such as Fact table contains measures (such as dollars_solddollars_sold) and ) and

keys to each of the related dimension tableskeys to each of the related dimension tables

14

Data CubesData Cubes

In data warehousing literature, an n-D In data warehousing literature, an n-D

base cube is called a base cube is called a base cuboidbase cuboid. The top . The top

most 0-D cuboid, which holds the highest-most 0-D cuboid, which holds the highest-

level of summarization, is called the level of summarization, is called the apex apex

cuboidcuboid. The lattice of cuboids forms a . The lattice of cuboids forms a

data cubedata cube..

15

Typical OLAP OperationsTypical OLAP Operations Slice and dice: Slice and dice:

project and selectproject and select Pivot (rotate): Pivot (rotate):

reorient the cube, visualization, 3D to series of reorient the cube, visualization, 3D to series of 2D planes.2D planes.

Roll up (drill-up):Roll up (drill-up): summarize data summarize data

by climbing up hierarchy or by dimension by climbing up hierarchy or by dimension reductionreduction

Drill down (roll down):Drill down (roll down): reverse of roll-up reverse of roll-up

from higher level summary to lower level from higher level summary to lower level summary or detailed data, or introducing new summary or detailed data, or introducing new dimensionsdimensions

16

PC

A Sample Data CubeA Sample Data Cube

Date

Produ

ct

Cou

ntr

y

TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

17

PC

Slice and DiceSlice and Dice

Date

Produ

ct

Cou

ntr

y

TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

Sales in USA

Sales of PC

Sales of 3Qtr

Sales of 3Qtr in USA Sales of PC of 3Qtr in USA

18

PC

Pivot (rotate):Pivot (rotate):

Date

Produ

ct

Cou

ntr

y

TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

19

Roll up (drill-up) Roll up (drill-up) ((dimension reduction dimension reduction

example)example)Date

Produ

ct

Cou

ntr

ysum

TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

Group by date, country

- reduced product dimension

Group by country

- reduced date dimension

Group by nothing

- reduced all dimensions

20

Roll up (drill-up)Roll up (drill-up)

Group by product, countryDate

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Gro

up b

y da

te, c

ount

ry

Group by product

Group by dateALL

Group by country

21

Drill down (roll-down)Drill down (roll-down)

Date

Produ

ct

Cou

ntr

y

TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

Gro

up b

y da

te, c

ount

ry

ALL

Group by country

22

Cuboids Corresponding to Cuboids Corresponding to the Cubethe Cube

all

product date country

product,date product,country date, country

product, date, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid

23

A Concept Hierarchy: Spatial A Concept Hierarchy: Spatial DimensionDimension

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity

24

Example of Star SchemaExample of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Dimension Tables

25

Multidimensional DataMultidimensional Data

Sales volume as a function of product, Sales volume as a function of product, month, and regionmonth, and region

Prod

uct

Regio

n

MonthDimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

26

Measures: Three Measures: Three CategoriesCategories

distributivedistributive: if the result derived by applying the function : if the result derived by applying the function to to n n aggregate values is the same as that derived by aggregate values is the same as that derived by applying the function on all the data without partitioning.applying the function on all the data without partitioning.

E.g., count(), sum(), min(), max().E.g., count(), sum(), min(), max().

algebraicalgebraic:: if it can be computed by an algebraic function if it can be computed by an algebraic function with with MM arguments (where arguments (where M M is a bounded integer), each of is a bounded integer), each of which is obtained by applying a distributive aggregate which is obtained by applying a distributive aggregate function.function.

E.g.,E.g., avg(), min_N(), standard_deviation().avg(), min_N(), standard_deviation().

holisticholistic:: if there is no constant bound on the storage size if there is no constant bound on the storage size needed to describe a subaggregate.needed to describe a subaggregate.

E.g., median(), mode(), rank().E.g., median(), mode(), rank().

27

DistributiveDistributive Sum of sales group by product, region, Sum of sales group by product, region,

monthmonth Sum of sales group by category, region, Sum of sales group by category, region,

monthmonth Others: count, min, maxOthers: count, min, max

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

28

AlgebraicAlgebraic

Average sales group by product, region, monthAverage sales group by product, region, month Average sales group by category, region, monthAverage sales group by category, region, month

We will keep We will keep the sum of sales group by product, region, monththe sum of sales group by product, region, month the count group by product, region, monththe count group by product, region, month

Then calculate the averageThen calculate the average Others: Others:

min_N(), standard_deviation().min_N(), standard_deviation().

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

29

HolisticHolistic

The rank of sales group by product, region, The rank of sales group by product, region, monthmonth

The rank of sales group by category, The rank of sales group by category, region, monthregion, month

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

30

SummarySummary Data warehouse Data warehouse

A A subject-orientedsubject-oriented,, integrated integrated, , time-varianttime-variant, and , and nonvolatilenonvolatile collection of data in support of management’s decision-making collection of data in support of management’s decision-making processprocess

A A multi-dimensional modelmulti-dimensional model of a data of a data warehousewarehouse Star schema, snowflake schema, fact constellationsStar schema, snowflake schema, fact constellations A data cube consists of dimensions & measuresA data cube consists of dimensions & measures

OLAPOLAP operations: drilling, rolling, slicing, dicing operations: drilling, rolling, slicing, dicing and pivotingand pivoting

Efficient computation of data cubesEfficient computation of data cubes Partial vs. full vs. no materializationPartial vs. full vs. no materialization Bitmap indexBitmap index