data warehousing and olap

58
Data Warehousing Data Warehousing and and OLAP OLAP

Upload: brigid

Post on 13-Jan-2016

51 views

Category:

Documents


5 download

DESCRIPTION

Data Warehousing and OLAP. Warehousing. Growing industry: $8 billion in 1998 Range from desktop to huge: Walmart: 900-CPU, 2,700 disk, 23TB Teradata system Lots of buzzwords, hype slice & dice, rollup, MOLAP, pivot,. Outline. What is a data warehouse? Why a warehouse? Models & operations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Warehousing and OLAP

Data Warehousing Data Warehousing andandOLAPOLAP

Page 2: Data Warehousing and OLAP

WarehousingWarehousing►Growing industry: $8 billion in 1998Growing industry: $8 billion in 1998►Range from desktop to huge:Range from desktop to huge:

Walmart: 900-CPU, 2,700 disk, 23TBWalmart: 900-CPU, 2,700 disk, 23TBTeradata systemTeradata system

►Lots of buzzwords, hypeLots of buzzwords, hype slice & dice, rollup, MOLAP, pivot, ...slice & dice, rollup, MOLAP, pivot, ...

Page 3: Data Warehousing and OLAP

OutlineOutline►What is a data warehouse?What is a data warehouse?►Why a warehouse?Why a warehouse?►Models & operationsModels & operations► Implementing a warehouseImplementing a warehouse►Future directionsFuture directions

Page 4: Data Warehousing and OLAP

What is a Warehouse?What is a Warehouse?►Collection of diverse dataCollection of diverse data

subject orientedsubject oriented aimed at executive, decision makeraimed at executive, decision maker often a copy of operational dataoften a copy of operational data with value-added data with value-added data (e.g., summaries, (e.g., summaries,

history)history)

integratedintegrated time-varyingtime-varying non-volatilenon-volatile more

Page 5: Data Warehousing and OLAP

What is a Warehouse?What is a Warehouse?►Collection of toolsCollection of tools

gathering datagathering data cleansing, integrating, ...cleansing, integrating, ... querying, reporting, analysisquerying, reporting, analysis data miningdata mining monitoring, administering warehousemonitoring, administering warehouse

Page 6: Data Warehousing and OLAP

Warehouse ArchitectureWarehouse Architecture

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 7: Data Warehousing and OLAP

Why a Warehouse?Why a Warehouse?►Two Approaches:Two Approaches:

Query-Driven (Lazy)Query-Driven (Lazy) Warehouse (Eager)Warehouse (Eager)

Source Source

?

Page 8: Data Warehousing and OLAP

Query-Driven ApproachQuery-Driven Approach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Page 9: Data Warehousing and OLAP

Advantages of WarehousingAdvantages of Warehousing►High query performanceHigh query performance►Queries not visible outside warehouseQueries not visible outside warehouse►Local processing at sources unaffectedLocal processing at sources unaffected►Can operate when sources unavailableCan operate when sources unavailable►Can query data not stored in a DBMSCan query data not stored in a DBMS►Extra information at warehouseExtra information at warehouse

Modify, summarize (store aggregates)Modify, summarize (store aggregates) Add historical informationAdd historical information

Page 10: Data Warehousing and OLAP

Advantages of Query-DrivenAdvantages of Query-Driven►No need to copy dataNo need to copy data

less storageless storage no need to purchase datano need to purchase data

►More up-to-date dataMore up-to-date data►Query needs can be unknownQuery needs can be unknown►Only query interface needed at Only query interface needed at

sourcessources►May be less draining on sourcesMay be less draining on sources

Page 11: Data Warehousing and OLAP

OLTP vs. OLAPOLTP vs. OLAP► OLTP: On Line Transaction ProcessingOLTP: On Line Transaction Processing

Describes processing at operational sitesDescribes processing at operational sites

► OLAP: On Line Analytical ProcessingOLAP: On Line Analytical Processing Describes processing at warehouseDescribes processing at warehouse

Page 12: Data Warehousing and OLAP

OLTP vs. OLAPOLTP vs. OLAP

► Mostly updatesMostly updates► Many small Many small

transactionstransactions► Mb-Tb of dataMb-Tb of data► Raw dataRaw data► Clerical usersClerical users► Up-to-date dataUp-to-date data► Consistency, Consistency,

recoverability criticalrecoverability critical

► Mostly readsMostly reads► Queries long, complexQueries long, complex► Gb-Tb of dataGb-Tb of data► Summarized, Summarized,

consolidated dataconsolidated data► Decision-makers, Decision-makers,

analysts as usersanalysts as users

OLTP OLAP

Page 13: Data Warehousing and OLAP

Data MartsData Marts►Smaller warehousesSmaller warehouses►Spans part of organizationSpans part of organization

e.g., marketing (customers, products, e.g., marketing (customers, products, sales)sales)

►Do not require enterprise-wide Do not require enterprise-wide consensusconsensus but long term integration problems?but long term integration problems?

Page 14: Data Warehousing and OLAP

Warehouse Models & Warehouse Models & OperatorsOperators

►Data ModelsData Models relationsrelations stars & snowflakesstars & snowflakes cubescubes

►OperatorsOperators slice & diceslice & dice roll-up, drill downroll-up, drill down pivotingpivoting otherother

Page 15: Data Warehousing and OLAP

StarStar

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

product prodId name pricep1 bolt 10p2 nut 5

store storeId cityc1 nycc2 sfoc3 la

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50

Page 16: Data Warehousing and OLAP

Star SchemaStar Schema

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Page 17: Data Warehousing and OLAP

TermsTerms►Fact tableFact table►Dimension tablesDimension tables►MeasuresMeasures

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Page 18: Data Warehousing and OLAP

Dimension HierarchiesDimension Hierarchies

store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy

city cityId pop regIdsfo 1M northla 5M south

region regId namenorth cold regionsouth warm region

sType tId size locationt1 small downtownt2 large suburbs

storesType

city region

snowflake schema constellations

Page 19: Data Warehousing and OLAP

CubeCube

sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8

c1 c2 c3p1 12 50p2 11 8

Fact table view: Multi-dimensional cube:

dimensions = 2

Page 20: Data Warehousing and OLAP

3-D Cube3-D Cube

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

dimensions = 3

Multi-dimensional cube:Fact table view:

Page 21: Data Warehousing and OLAP

ROLAP vs. MOLAPROLAP vs. MOLAP►ROLAP:ROLAP:

Relational On-Line Analytical Relational On-Line Analytical ProcessingProcessing

►MOLAP:MOLAP:Multi-Dimensional On-Line Analytical Multi-Dimensional On-Line Analytical ProcessingProcessing

Page 22: Data Warehousing and OLAP

AggregatesAggregates

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

81

Page 23: Data Warehousing and OLAP

AggregatesAggregates

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

ans date sum1 812 48

Page 24: Data Warehousing and OLAP

Another ExampleAnother Example

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

sale prodId date amtp1 1 62p2 1 19p1 2 48

drill-down

rollup

Page 25: Data Warehousing and OLAP

AggregatesAggregates►Operators: sum, count, max, min, Operators: sum, count, max, min,

median, avemedian, ave►““Having” clauseHaving” clause►Using dimension hierarchyUsing dimension hierarchy

average by region (within store)average by region (within store) maximum by month (within date)maximum by month (within date)

Page 26: Data Warehousing and OLAP

Cube AggregationCube Aggregation

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3sum 67 12 50

sump1 110p2 19

129

. . .

drill-down

rollup

Example: computing sums

Page 27: Data Warehousing and OLAP

Cube OperatorsCube Operators

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3sum 67 12 50

sump1 110p2 19

129

. . .

sale(c1,*,*)

sale(*,*,*)sale(c2,p2,*)

Page 28: Data Warehousing and OLAP

c1 c2 c3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129

Extended CubeExtended Cube

day 2 c1 c2 c3 *p1 44 4 48p2* 44 4 48

c1 c2 c3 *p1 12 50 62p2 11 8 19* 23 8 50 81

day 1

*

sale(*,p2,*)

Page 29: Data Warehousing and OLAP

Aggregation Using Aggregation Using HierarchiesHierarchies

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

region A region Bp1 56 54p2 11 8

customer

region

country

(customer c1 in Region A;customers c2, c3 in Region B)

Page 30: Data Warehousing and OLAP

PivotingPivoting

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

Multi-dimensional cube:Fact table view:

c1 c2 c3p1 56 4 50p2 11 8

Page 31: Data Warehousing and OLAP

Implementing a WarehouseImplementing a Warehouse►MonitoringMonitoring: Sending data from : Sending data from

sourcessources► IntegratingIntegrating: Loading, cleansing,...: Loading, cleansing,...►ProcessingProcessing: Query processing, : Query processing,

indexing, ...indexing, ...►ManagingManaging: Metadata, Design, ...: Metadata, Design, ...

Page 32: Data Warehousing and OLAP

MonitoringMonitoring►Source Types: relational, flat file, IMS, Source Types: relational, flat file, IMS,

VSAM, IDMS, WWW, news-wire, …VSAM, IDMS, WWW, news-wire, …► Incremental vs. RefreshIncremental vs. Refresh

customer id name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la new

Page 33: Data Warehousing and OLAP

Monitoring TechniquesMonitoring Techniques►Periodic snapshotsPeriodic snapshots►Database triggersDatabase triggers►Log shippingLog shipping►Data shipping (replication service)Data shipping (replication service)►Transaction shippingTransaction shipping►Polling (queries to source)Polling (queries to source)►Screen scrapingScreen scraping►Application level monitoringApplication level monitoring

A

dvan

tage

s &

Dis

adva

ntag

es!!

Page 34: Data Warehousing and OLAP

Monitoring IssuesMonitoring Issues►FrequencyFrequency

periodic: daily, weekly, …periodic: daily, weekly, … triggered: on “big” change, lots of triggered: on “big” change, lots of

changes, ...changes, ...

►Data transformationData transformation convert data to uniform formatconvert data to uniform format remove & add fields remove & add fields (e.g., add date to get (e.g., add date to get

history)history)

►Standards Standards (e.g., ODBC)(e.g., ODBC)

►GatewaysGateways

Page 35: Data Warehousing and OLAP

IntegrationIntegration►Data CleaningData Cleaning►Data LoadingData Loading►Derived DataDerived Data

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 36: Data Warehousing and OLAP

Data CleaningData Cleaning► Migration (e.g., yen Migration (e.g., yen dollars) dollars)► Scrubbing: use domain-specific knowledge Scrubbing: use domain-specific knowledge

(e.g., social security numbers)(e.g., social security numbers)► Fusion (e.g., mail list, customer merging)Fusion (e.g., mail list, customer merging)

► Auditing: discover rules & relationshipsAuditing: discover rules & relationships(like data mining)(like data mining)

billing DB

service DB

customer1(Joe)

customer2(Joe)

merged_customer(Joe)

Page 37: Data Warehousing and OLAP

Loading DataLoading Data► Incremental vs. refreshIncremental vs. refresh►Off-line vs. on-line Off-line vs. on-line ►Frequency of loadingFrequency of loading

At night, 1x a week/month, continuouslyAt night, 1x a week/month, continuously

►Parallel/Partitioned loadParallel/Partitioned load

Page 38: Data Warehousing and OLAP

Derived DataDerived Data►Derived Warehouse DataDerived Warehouse Data

indexesindexes aggregatesaggregates materialized views (next slide)materialized views (next slide)

►When to update derived data?When to update derived data?► Incremental vs. refreshIncremental vs. refresh

Page 39: Data Warehousing and OLAP

Materialized ViewsMaterialized Views►Define new warehouse relations Define new warehouse relations

using SQL expressionsusing SQL expressionssale prodId storeId date amt

p1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

does not existat any source

Page 40: Data Warehousing and OLAP

ProcessingProcessing►ROLAP servers vs. MOLAP serversROLAP servers vs. MOLAP servers► Index StructuresIndex Structures►What to Materialize?What to Materialize?►AlgorithmsAlgorithms Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 41: Data Warehousing and OLAP

ROLAP ServerROLAP Server►Relational OLAP ServerRelational OLAP Server

relationalDBMS

ROLAPserver

tools

utilities

sale prodId date sump1 1 62p2 1 19p1 2 48

Special indices, tuning;

Schema is “denormalized”

Page 42: Data Warehousing and OLAP

MOLAP ServerMOLAP Server►Multi-Dimensional OLAP ServerMulti-Dimensional OLAP Server

multi-dimensional

server

M.D. tools

utilitiescould also

sit onrelational

DBMS

Pro

du

ctCity

Date1 2 3 4

milk

soda

eggs

soap

AB

Sales

Page 43: Data Warehousing and OLAP

Index StructuresIndex Structures►Traditional Access MethodsTraditional Access Methods

B-trees, hash tables, R-trees, grids, …B-trees, hash tables, R-trees, grids, …

►Popular in WarehousesPopular in Warehouses inverted listsinverted lists bit map indexesbit map indexes join indexesjoin indexes text indexestext indexes

Page 44: Data Warehousing and OLAP

Inverted ListsInverted Lists

2023

1819

202122

232526

r4r18r34r35

r5r19r37r40

rId name ager4 joe 20

r18 fred 20r19 sally 21r34 nancy 20r35 tom 20r36 pat 25r5 dave 21

r41 jeff 26

. .

.

ageindex

invertedlists

datarecords

Page 45: Data Warehousing and OLAP

Using Inverted ListsUsing Inverted Lists►Query: Query:

Get people with age = 20 and name = Get people with age = 20 and name = “fred”“fred”

►List for age = 20: List for age = 20: r4, r18, r34, r35r4, r18, r34, r35

►List for name = “fred”: List for name = “fred”: r18, r52r18, r52

►Answer is intersection: Answer is intersection: r18r18

Page 46: Data Warehousing and OLAP

Bit MapsBit Maps

2023

1819

202122

232526

id name age1 joe 202 fred 203 sally 214 nancy 205 tom 206 pat 257 dave 218 jeff 26

. .

.

ageindex

bitmaps

datarecords

110110000

0010001011

Page 47: Data Warehousing and OLAP

Using Bit MapsUsing Bit Maps►Query: Query:

Get people with age = 20 and name = “fred”Get people with age = 20 and name = “fred”►List for age = 20: List for age = 20: 11011000001101100000

►List for name = “fred”: List for name = “fred”: 01000000010100000001

►Answer is intersection: Answer is intersection: 010000000000010000000000

Good if domain cardinality small Bit vectors can be compressed

Page 48: Data Warehousing and OLAP

JoinJoin

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• “Combine” SALE, PRODUCT relations• In SQL: SELECT * FROM SALE, PRODUCT

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

Page 49: Data Warehousing and OLAP

Join IndexesJoin Indexes

product id name price jIndexp1 bolt 10 r1,r3,r5,r6p2 nut 5 r2,r4

sale rId prodId storeId date amtr1 p1 c1 1 12r2 p2 c1 1 11r3 p1 c3 1 50r4 p2 c2 1 8r5 p1 c1 2 44r6 p1 c2 2 4

join index

Page 50: Data Warehousing and OLAP

What to Materialize?What to Materialize?►Store in warehouse results useful for Store in warehouse results useful for

common queriescommon queries►Example:Example:

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

c1p1 110p2 19

129

. . .

total sales

materialize

Page 51: Data Warehousing and OLAP

Materialization FactorsMaterialization Factors►Type/frequency of queriesType/frequency of queries►Query response timeQuery response time►Storage costStorage cost►Update costUpdate cost

Page 52: Data Warehousing and OLAP

Cube Aggregates LatticeCube Aggregates Lattice

city, product, date

city, product city, date product, date

city product date

all

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

129

use greedyalgorithm todecide whatto materialize

Page 53: Data Warehousing and OLAP

Dimension HierarchiesDimension Hierarchies

all

state

city

cities city statec1 CAc2 NY

Page 54: Data Warehousing and OLAP

Dimension HierarchiesDimension Hierarchies

city, product

city, product, date

city, date product, date

city product date

all

state, product, date

state, date

state, product

state

not all arcs shown...

Page 55: Data Warehousing and OLAP

Interesting HierarchyInteresting Hierarchy

all

years

quarters

months

days

weeks

time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000

conceptualdimension table

Page 56: Data Warehousing and OLAP

DesignDesign►What data is needed?What data is needed?►Where does it come from?Where does it come from?►How to clean data?How to clean data?►How to represent in warehouse How to represent in warehouse

(schema)?(schema)?►What to summarize?What to summarize?►What to materialize?What to materialize?►What to index?What to index?

Page 57: Data Warehousing and OLAP

ToolsTools►DevelopmentDevelopment

design & edit: schemas, views, scripts, rules, queries, reportsdesign & edit: schemas, views, scripts, rules, queries, reports

►Planning & AnalysisPlanning & Analysis what-if scenarios what-if scenarios (schema changes, refresh rates)(schema changes, refresh rates), capacity planning, capacity planning

►Warehouse ManagementWarehouse Management performance monitoring, usage patterns, exception reportingperformance monitoring, usage patterns, exception reporting

►System & Network ManagementSystem & Network Management measure traffic (sources, warehouse, clients)measure traffic (sources, warehouse, clients)

►Workflow ManagementWorkflow Management ““reliable scripts” for cleaning & analyzing datareliable scripts” for cleaning & analyzing data

Page 58: Data Warehousing and OLAP

Current State of IndustryCurrent State of Industry►Extraction and integration done off-lineExtraction and integration done off-line

Usually in large, time-consuming, batchesUsually in large, time-consuming, batches

►Everything copied at warehouseEverything copied at warehouse Not selective about what is storedNot selective about what is stored Query benefit vs storage & update costQuery benefit vs storage & update cost

►Query optimization aimed at OLTPQuery optimization aimed at OLTP High throughput instead of fast responseHigh throughput instead of fast response Process whole query before displaying Process whole query before displaying

anythinganything