cs 345: topics in data warehousing thursday, november 4, 2004

CS 345:Topics in Data Warehousing

Thursday, November 4, 2004

Review of Tuesday’s Class

• Pre-computed aggregates– Materialized views– Aggregate navigation– Dimension and fact aggregates

• Selection of aggregates– Manual selection– Greedy algorithm– Limitations of greedy approach

Outline of Today’s Class

• Index Selection

• Selecting Views and Indexes Together

• Storage Systems– Mirroring, Striping, and Parity– RAID Levels

Index Selection Problem

• Similar problem to selecting aggregate tables– Select column sets to include / exclude

• Additional degrees of freedom– What type of index (B-tree, hash, bitmap, join index)– Ordering of columns in index search key– Clustered vs. non-clustered

• Additional restrictions– Columns chosen from a single table

• Except for special case of join index

• Interaction between indexes can be important– Less of an issue with aggregate tables– Examples:

• index intersection• index-based merge join without sorting

Heuristics for Manual Selection• Always include single-column indexes on:

– dimension primary keys– fact foreign keys

• Mixture of wide and thin indexes– Build multi-column indexes on fact & dimension tables

• Covering indexes allow index-only plans• Coverage vs. speed-up trade-off

– More columns → useful for a greater variety of queries– Fewer columns → smaller index → greater speed-up

– Build single-column indexes on important dimension columns• Particularly on attributes with high filtering power

– Product Name, Brand, etc.• Bitmap indexes for low- and medium-cardinality columns• B-tree indexes for high-cardinality columns

• Fact tables often clustered on Date– Most queries reference Date dimension– Little or no reorganization necessary as data appended

Automatic Index Selection

• AutoAdmin project– Research project at Microsoft– Developed tools for index & materialized view

selection– Similar tools now available from all major vendors

• Papers we’ll cover– “An Efficient Cost-Driven Index Selection Tool for

Microsoft SQL Server” • by Chaudhuri and Narasayya, 1997

– “Automated Selection of Materialized Views and Indexes for SQL Databases”

• by Agrawal, Chaudhuri, and Narasayya, 2000

Guiding Principles

• Workload-driven approach– Which indexes are good depends on which queries are asked

• Incorporate the query optimizer– Indexes are only useful if the optimizer chooses to use them– Optimizer’s cost estimation model is well-developed, accurate

• Limit search space heuristically– Indexes that are good in combination are also good by

themselves– Leading term of good multi-column index is a good single-

column index– Indexes that are good for an entire workload are the best

possible choice for some query in the workload– Heuristics speed up the selection process considerably, at the

cost of missing some good index combinations

Index Selection ArchitectureWorkload

Final Indexes

Identify CandidateIndexes

EnumerateConfigurations

Generate Multi-Column

Indexes

SimulatedIndex

Creation

CostEstimation

DatabaseManagement

System(Query

Optimizer)

Index Recommender

“What-If” Index Analysis• Query optimizer estimates cost of query plan based on

statistics– Sizes of relations and indexes– Number of distinct values– Frequency of occurrence of each value

• Generating statistics for an index is cheaper than actually building the index– Statistics can be estimated from a sample of the data

• Simulated / “what-if” index analysis– Ask the optimizer to optimize a query– Record cost estimate for best query plan– Update statistics to trick optimizer into thinking that an extra

index exists– Ask the optimizer to optimize the query again– Record new cost estimate for best query plan– Compare before/after estimates to quantify impact of index

Estimating Workload Cost• Configuration = set of indexes• Atomic configuration = set of indexes that are all used together to

answer some query• Many possible configurations, fewer atomic configurations

– Most query plans use only a small number of indexes– Example:

• 50 possible indexes, choose best 10• No query uses more than 3 indexes• # of configurations = 10 billion• # of atomic configurations = 20876

• Only need to consider atomic configuration when estimating costs– Cost(Q,I) = cost of query Q with index set I– Let A I be an atomic configuration contained in I– Cost(Q,I) = min[ Cost(A,I) ]– Mininum taken over all atomic configurations contained in I

Identifying Atomic Configurations

• Query syntax can be used– Leading term of index = column mentioned in WHERE, GROUP

BY, or ORDER BY clause– Trailing term of index = column mentioned anywhere in query

• Heuristics for reducing number of atomic configurations– Number of atomic configs. can be large for complex queries– Too many atomic configurations → index selection is very slow– Trade off index selection time vs. quality of recommendations– Single-join heuristic: only consider atomic configurations which

involve ≤ 2 tables and ≤ 2 indexes per table– Adaptively identify index interactions

• Compare “cost of query Q with indexes I” vs. “cost of query Q with best subset of I”

• If the two costs are equal or close, then I is not an atomic configuration

Identify Candidate Indexes

• For each query in the workload, determine the best atomic configuration– Enumerate relevant atomic configurations for each query based

on query syntax– Simulate each configuration by modifying statistics– Calculate estimated execution cost using query optimizer

• Candidate index set = union of best atomic configuration for each query in workload

• Some indexes from optimal index set may be omitted– Suppose index I is second-best index for 10 queries but best for

no query– Index I is likely to be part of the optimal configuration– However, index I will not be in the candidate set– This choice of candidate set is a time-saving heuristic– Considering all reasonable indexes would be too expensive

Enumerate Configurations• Among all candidate indexes, which k indexes should we build?• One approach: Greedy algorithm

– Similar to the one discussed last class– Add indexes one at a time– Always choose the index that will decrease workload cost by the

greatest amount• Greedy approach fails to capture index interactions

– An index may be useless by itself but useful in conjunction with a second index

– Such combinations will be missed by greedy selection• Greedy(m,k) algorithm

– Exhaustively consider all configurations of ≤ m indexes– Select the best such configuration– Greedily add (k-m) additional indexes

• Choice of m trades off search time vs. result quality– Greedy(0,k) = pure greedy approach (fast)– Greedy(k,k) = exhaustive search (accurate)– Other values of m are in between [m=2 seems good in practice]

Generate Multi-Column Indexes• Another heuristic to reduce optimization time• Initially consider only narrow indexes, and iteratively widen them• First iteration:

– When building atomic configurations, consider only single-column indexes

• Second iteration:– Include the best indexes chosen in Iteration 1– Also consider two-column “expansions” of the single-column indexes

chosen in Iteration 1• Third iteration:

– Include the best indexes from iteration 2– Also consider three-column “expansions” of the two-column indexes

chosen in Iteration 2• Generalizes to as many iterations as desired

– Cache results of optimizer evaluations– Only cost for new atomic configs. need be computed in each iteration

• Experimental results indicate that little loss in quality occurs– As compared to the non-iterative solution

Selecting Indexes and Views

• Indexes and aggregate tables each serve to speed up queries

• There are interactions between them– Indexes can be built on aggregate tables– Constructing an aggregate table can decrease the

usefulness of a related index (or vice versa)

• Selecting them together can deliver better results than selecting them independently

• How to combine the two?

Candidate Identification for Views

• Materialized views considered by AutoAdmin– Join of several tables– With or without aggregation– Optionally including filters– (More general than the aggregate tables we’ve discussed)

• Restricting the space of views considered– First identify “interesting table-subsets”– Idea: Materialized views over large tables are most useful– A table-subset is a set of tables– Table-subset that are referenced in < C% of queries (weighted

by cost) are not interesting.– TS-Cost(T) = Sum [Cost(Q) * (size of tables in T) / (size of tables

in Q)]• Sum over all queries Q that reference every table in table-subset T

– Table-subsets with TS-Cost < C% of total cost are not interesting

Candidate Identification

• For each query in the workload, determine best atomic configuration

• Atomic configuration made up of:– Indexes– Materialized views over interesting table-subsets– Indexes on materialized views over interesting table-

subsets

• Candidate set = union of best atomic configurations across all queries

View Merging• View merging is like multi-column index generation• Combine two views to create a more generic view

– Move up the data cube lattice• Merge(V1,V2)

– Group by union of V1, V2 grouping columns– Filter by intersection of V1, V2 filters– Filters that are in one of V1,V2 but not the other become grouping columns

• Example:– SELECT Income, SUM(Quantity)

FROM Sales, CustomerWHERE Sales.Customer_key = Customer.Customer_keyAND Customer.State = 'CA'

– SELECT Age, SUM(Quantity)FROM Sales, CustomerWHERE Sales.Customer_key = Customer.Customer_key

– Merged view:SELECT Income, Age, State SUM(Quantity)FROM Sales, CustomerWHERE Sales.Customer_key = Customer.Customer_key

Storage

• Data analysis queries touch lots of data

• Data warehouses are often very large

• Reading the data from disk is usually the bottleneck

• What can be done to improve performance?

• Add more disks and benefit from parallelism

RAID

• Redundant Arrays of Inexpensive Disks– Using lots of disk drives improves:

• Performance• Reliability

– Alternative: Specialized, high-performance hardware– RAID delivers better price/performance than high-end disks

• Performance– Read data from n disks at once → reads are n times faster

• Reliability– Store multiple copies of data– If one disk fails, no data is lost and the system continues to run

• Three main concepts– Mirroring– Striping– Parity

Mirroring

• Use two disks that are identical copies of each other– Primary goal: fault-tolerance

• If one disk fails, use the other one– Writes must be done to both disks at once– Improved random read performance

• Can do two random reads at one time– Sequential read performance mostly unaffected

This Is WhatMirroring

Looks Like

This Is WhatMirroring

Looks Like

Striping• Spread data across n disks• First disk gets blocks 1, n+1, 2n+1, etc.• Second disk gets blocks 2, n+2, 2n+2, etc.• Improved random read performance

– Can do as many as n reads at the same time– But each read must go to a specific disk– Thus multiple reads can conflict if unlucky

• Sequential reads are very fast– Especially for long reads (many blocks from each disk)– Read in parallel from all disks

• Each write goes to a single disk

ThisYou

AThree

IsWould

SentenceDisk

HowStripeAcrossDrives!

Parity

• Mirroring delivers fault-tolerance through redundancy• Storage utilization is rather poor

– Only 50% of disk capacity is useful– The other 50% is overhead for fault tolerance

• Parity checks deliver fault-tolerance with less redundancy– Use n+1 disks– Store data on n of the disks– Last disk contains parity data

• XOR of other n disks• Compare ith bit on each disk• Even number of 1s → ith parity bit is 0• Odd number of 1s → ith parity bit is 1

– Any one disk fails → no data is lost

Parity Example

• Three servers + 1 parity server– Server 1 stores “110011”– Server 2 stores “011011”– Server 3 stores “110101”– Server P stores “011101”

• Number of 1s = 2,3,1,1,2,3• Even, Odd, Odd, Odd, Even, Odd

• Suppose Server 2 fails– “110011”, “??????”, “110101”, “011101”– Take XOR of remaining servers to reconstruct

• Number of 1s = 2,3,1,2,1,3• Even,odd,odd,even,odd,odd• 011011

RAID Levels

• RAID 0– Striping (without parity)– Pros:

• Good performance• No redundancy (no wasted capacity)

– Cons: • Poor fault-tolerance (worse than no RAID!)

• RAID 1– Mirroring– Pros:

• Good fault-tolerance• Very fast recovery

– Cons:• Wastes storage capacity• Performance not as good as other RAID levels

RAID Levels• RAID 2: Not used.• RAID 3 and 4:

– Striping with dedicated parity disk– Stripe size = byte for RAID 3, block for RAID 4– Pros:

• Good performance• Good fault-tolerance with little redundancy• Reasonably fast recovery

– Cons:• Parity disk is a bottleneck for writes

ThisYouDataLevel

IsWouldUsingFour.

HowStoreRAID

(Parity 1)(Parity 2)(Parity 3)(Parity 4)

RAID Levels

• RAID 5– Striping with distributed parity– Servers “take turns” being the parity server– Pros and Cons similar to RAID 3 and 4

• Avoids write bottleneck associated with RAID 3 and 4

• Performance degrades following disk failure

ThisYouData

(Parity 4)

IsWould

(Parity 3)Level

How(Parity 2)

UsingFive.

(Parity 1)StoreRAID

Multi-Level RAID

• The RAID ideas can be hierarchically combined• Most common combination are:

– RAID 1+0 – stripes of mirrors– RAID 0+1 – mirror of stripes

ThisRAID

Is1+0

HowWorks

ThisRAID

Is1+0

HowWorks

ThisRAID

Is0+1

HowWorks

ThisRAID

Is0+1

HowWorks

RAID 1+0 RAID 0+1

RAID 1+0 vs. RAID 0+1• Difference is what happens when a disk fails

– RAID 1+0• One stripe becomes unmirrored• Failure of the other disk in that stripe leads to data loss

– RAID 0+1• One mirror becomes invalid• Failure of any disk in the other stripe leads to data loss

ThisRAID

Is1+0

HowWorks

ThisRAID

Is1+0

HowWorks

ThisRAID

Is0+1

HowWorks

ThisRAID

Is0+1

HowWorks

RAID 1+0 RAID 0+1

cs 345: topics in data warehousing thursday, november 4, 2004

Documents

singlecolumn indexes

candidate indexes

index ordering of columns

good index combinations

multicolumn indexes

query optimizer indexes

index search key clustered

type of index btree