olap and data warehousing slides courtesy of: julia stoyanovitch columbia university surajit...
Post on 21-Dec-2015
216 Views
Preview:
TRANSCRIPT
OLAP and Data WarehousingSlides courtesy of:Julia StoyanovitchColumbia University
Surajit ChaudhuriMicrosoft Research, Redmond, WA, USA
surajitc@microsoft.com
Umeshwar DayalHewlett-Packard Labs., Palo Alto, CA, USA
dayal@hpl.hp.com
23© Surajit Chaudhuri, Umeshwar Dayal
What is OLAP? On-Line Analytical Processing Information technology to help the knowledge
worker (executive, manager, analyst) make faster and better decisions.
OLAP is an element of decision support systems (DSS).
3
Running Example: Car Sales Cars: carId, make, model, color
Dealers: dealerId, city, state
Time of Sale: tid, year, month, day
Sales: carId, dealerId, tid, price
4
OLTP Queries: Examples
create a new sales record that indicates that a red VW Golf was sold in Boston, MA
see how many black and silver VW Passats were sold at dealership #123 on April 11th 2005
5
OLAP Queries: Examples Analyze comparative sales of the different colors
of VW Golf by state
See which months are particularly favorable to the sale of different VW models and colors
Rank VW dealerships by revenue, displaying a ranked list of dealerships and % differences in sales between each dealership and the one ranked 1 place higher
6
OLAP vs. OLTP OLAP Knowledge worker Decision support Subject-oriented (Star, snowflake) Historical, Consolidated Summarized,
Multidimensional Ad hoc Complex query Read mostly Lots of scans Millions Hundreds 100 GB - TB Query throughput, response
OLTP Clerk, IT professional Day to day operations Application-oriented (E-R based) Current, Isolated Detailed, Flat relational Structured, Repetitive Short, simple transaction
Read/write Index/hash on prim. key Tens Thousands 100 MB - GB Trans. throughput
User Function DB design
Data View Usage Unit of work Access Operations # Records accessed # Users Db size Metric
© Surajit Chaudhuri, Umeshwar Dayal
7
OLAP Queries: Challenges Many AND, OR in the WHERE clause Self-join, nested sub-queries
» Last year’s sales vs this year’s sales for each product» Show reps for whom every sale has been more than $15000
Extensive use of aggregation, often on related datasets Aggregation over time periods Ranking Use of statistical functions Very large datasets Expectation of an interactive response time
8© Surajit Chaudhuri, Umeshwar Dayal
OLAP Query Tools Goal of OLAP is to support ad-hoc querying for
the business analyst (Power user) Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with
warehouse data» Large data set» Semantically enriched to understand business terms
(e.g., time, geography)» Combined with reporting features
Multidimensional view of data is the foundation of OLAP.
9© Surajit Chaudhuri, Umeshwar Dayal
Multidimensional Data Model Database is a set of facts (points) in a
multidimensional space A fact has a measure dimension
» quantity that is analyzed, e.g., sale amount, budget A set of dimensions with respect to which data is
analyzed» e.g., store, product, date associated with a sale amount
Dimensions form a sparsely populated coordinate system
Each dimension has a set of attributes» e.g., owner, city and county of store
10
Attribute Hierarchies Attributes of a dimension may be related An m:1 dependency is most common Dependency graph may be:
» Hierarchy: e.g., city -> state -> country
» Lattice: date -> month -> yeardate -> week -> year
Hierarchies are most common Dependencies influence choice of operations and
data representation
© Surajit Chaudhuri, Umeshwar Dayal
11© Surajit Chaudhuri, Umeshwar Dayal
State
State
Col
orC
olor
DateDate
Fact data: Sales volume in $100Sales volume in $100
RedGreenBlue
WhiteSilver Black
1 2 3 4 76 5
WI CA
15
10
12
20
50
10
DimensionsColor, State, Date
Attributesdate (year, month, day)
Attribute Hierarchies and Lattice Industry Country Year
Category State Quarter
Product City Month Week
NY
Date
Multidimensional Data Sales volume as a function of product, time, geography
12© Surajit Chaudhuri, Umeshwar Dayal
ROLAP and MOLAP Relational OLAP (ROLAP)
» Relational and Specialized Relational DBMS to store and manage warehouse data
» OLAP middleware to support missing pieces– Optimize for each DBMS backend – Aggregation Navigation Logic– Additional tools and services
Multidimensional OLAP (MOLAP) » Array-based storage structures» Direct access to array data structures
13
Multiple Aggregations Create a 2-dimensional spreadsheet that shows
sum of sales by year as well as by model of car Each subtotal requires a separate aggregate query
YEAR
STATE
Sum By State
Sum by
Year
© Surajit Chaudhuri, Umeshwar Dayal
14
Example: Multiple Aggregations
WI CA Total
2003 63 81 144
2004 38 107 145
2005 75 35 110
Total 176 223 399
15
Generalization: The Data Cube Base tuples Aggregate tuples:
» one aggregation for each subset of dimensions (powerset)
» exponential number of subsets, but can optimize the computation
Example» N = 3 dimensions
– model = {Golf, Jetta}– color = {red, black, white}– state = {NY, CA, WI}
» How many aggregate tuples in the data cube?– face – 1D agg; edge – 2D agg; corner – 3D agg
16
Operations on Multidimensional Data Model
Aggregation (roll-up) of detailed data to create summary data
Navigation to detailed data (drill-down) from summary Selection (slice) defines a subcube
– Project the cube on fewer dimensions by specifying coordinates of remaining dimensions
– e.g., sales where state = NY and month = Jan
Calculation– Within a dimension, e.g., (sales - expense) by state– Across dimensions
Ranking – top 3% of states by average sales
Window Queries
© Surajit Chaudhuri, Umeshwar Dayal
17
Roll-up and Drill-Down Roll-Up: Use of aggregation
» dimension reduction: – e.g., total sales by state by color– e.g., total sales by state
» navigating attribute hierarchy: – e.g., sales by city -> total sales by state -> total sales by
country– e.g., total sales by city and year -> total sales by state and year
-> total sales by country
Drill-Down: Inverse operation of roll-up» Provides the data set that was aggregated
– e.g., show “base” data for total sales figure for CA state
© Surajit Chaudhuri, Umeshwar Dayal
18
Slice and Dice What colors of Golf are not doing so well?
Select color, sum(price) From SALES Where model = ‘Golf’ slicing Group By color dicing
Keep slicing if results are uniform
19
More ExamplesQ: Given a query, which values from the CUBE do
we need to retrieve?
A: To answer a query Q use tuples T s.t.» If Q groups by A, T must have a non-* value in its
component for A» If Q slices by A = b, T must have the value b (not * or
any other value) in its component for A» If Q neither groups nor slices by A, then T has to have
* in its component for A
20© Surajit Chaudhuri, Umeshwar Dayal
Cit
yC
ity
ProductProductM
onth
Mon
th
Pro
duct
Pro
duct
MonthMonthFact data: Sales volume in $100Sales volume in $100
JuiceColaMilk
CreamToothpaste
Soap
1 2 3 4 76 5
LASF
NY
15
10
12
20
50
10
CitCit
yyPivot (Rotate)
Result: cross tabulation
21© Surajit Chaudhuri, Umeshwar Dayal
Warehouse Database Schema Entity-Relationship design techniques not
appropriate Design should reflect multidimensional view Typical schemas:
» Star Schema» Snowflake Schema» Fact Constellation Schema
22© Surajit Chaudhuri, Umeshwar Dayal
Example of a Star Schema
Fact table
Order OrderNoOrderDate
Customer
CustomerNoCustomerNameCustomerAddressCity
SalespersonSalespersonIDSalespersonNameCityQuota
ProdNoProdNameProdDescrCategoryCategoryDescrUnitPriceQOH
City
CityNameStateCountry
Date
DateKeyDateMonthYear
OrderNoSalespersonIDCustomerNoProdNoDateKeyCityNameQuantityTotalPrice
Product
23© Surajit Chaudhuri, Umeshwar Dayal
Star Schema and Variants A single fact table and a single table for each
dimension Generated keys are used for performance and
maintenance reasons Fact constellation: Multiple Fact tables that share
common dimension tables» Example: ProjectedExpense and ActualExpense may
share dimensional tables Snowflake Schema: Represents dimensional
hierarchy by normalization
24© Surajit Chaudhuri, Umeshwar Dayal
Example of a Snowflake Schema
Fact table
OrderNoSalespersonIDCustomerNoDateKeyCityNameProdNoQuantityTotalPrice
Order OrderNoOrderDate
Customer
CustomerNoCustomerNameCustomerAddressCity
SalespersonSalespersonIDSalespesonNameCityQuota
ProductProdNoProdNameProdDescrCategoryUnitPriceQOH
CategoryCategoryNameCategoryDescr
City
CityNameState
State
StateNameCountry
Date
DateKeyDateMonth
Month
MonthYear
Year
Year
25
Performance Considerations Normalization for dimension tables
» Read-only data, so no update anomalies» Fewer joins – better performance
Pre-computation of summary tables» Re-use can speed up performance» How can we use pre-computed results effectively?
Data is very large, dimension data often sparse» Crucial to use indexes effectively » Need for new indexing techniques: bitmap indexes, join
indexes
26© Surajit Chaudhuri, Umeshwar Dayal
Bit Map Index An alternative representation of RID-list Comparison, join and aggregation operations are
reduced to bit arithmetic Specially advantageous for low-cardinality
domains» Significant reduction in space and I/O (30:1)» Adapted for higher cardinality domains » Compression (e.g., run-length encoding) exploited » Upper Bound of 2R words for any bitmap over R rows
[Hasan & Sinha, 1997]
27
Bitmap Index Example
custid name gender rating
112 Joe M 3
115 Ram M 5
119 Sue F 5
116 Woo M 4
M F
1 0
1 0
0 1
1 0
1 2 3 4 5
0 0 1 0 0
0 0 0 0 1
0 0 0 0 1
0 0 0 1 0
28© Surajit Chaudhuri, Umeshwar Dayal
Join Index Traditional index maps the value in a column to a
list of rows with that value Join index maintain relationships between
attribute value of a dimension and the matching rows in the fact table
Join index may span multiple dimensions (composite join index)» Use join index to identify regions of cartesian product
that are of interest» Few people in Southern California may buy umbrellas
29© Surajit Chaudhuri, Umeshwar Dayal
Algorithm Using Bitmapped Join Indexes
[O’Neil&Graefe95] Maintain bit mapped join indexes between each
dimension table and the fact table To answer a query over multiple dimensions
» Take intersection of join indexes until the set of candidate fact tuples is small
» Do foreign key joins with rest of the dimension tables» Look up the fact table
30© Surajit Chaudhuri, Umeshwar Dayal
Join Index over Star Schema
Fact table
Order OrderNoOrderDate
Customer
CustomerNoCustomerNameCustomerAddressCity
SalespersonSalespersonIDSalespesonNameCityQuota
ProdNoProdNameProdDescrCategoryCategoryDescrUnitPriceQOH
City
CityNameStateCountry
DateDateKeyDateMonthYear
OrderNoSalespersonIDCustomerNoProdNoDateKeyCityNameQuantityTotalPrice
Product
31© Surajit Chaudhuri, Umeshwar Dayal
ROLAP:Handling of Aggregate Views
Important component for ROLAP Servers Choice of aggregate views to materialize Physical representation of Materialized Views in
the star schema Logic for Aggregation Navigation
» make optimum use of materialized aggregates to answer a query
32© Surajit Chaudhuri, Umeshwar Dayal
ROLAP: Choice of Aggregate Views to Materialize
Storage can increase dramatically if precomputed views are not chosen properly
Must take into account queries in the workload, their frequencies and their costs
The decision must be taken in the broader context of physical database design» e.g., should take into account the choice of indexes
Heuristic approaches adopted in products
33© Surajit Chaudhuri, Umeshwar Dayal
ROLAP: Using Materialized Views Through Selection
A query can use a view through a selection if » Each selection condition C on each dimension d
in the query is » Logically implies a condition C’ on dimension
d in the view Example: A view has sum(sales) by product and
by year for products introduced after 1991» OK to use for sum(sales) by product for
products introduced after 1992 » CANNOT use for sum(sales) for products
introduced after 1989
34© Surajit Chaudhuri, Umeshwar Dayal
Using Materialized Views through Group By (Roll Up)
The view V may be applicable via roll-up if for every grouping attribute g of the query Q:» Q has Group By a1,..,g, an» V has Group By a1,..,h, an» Attribute g is higher than h in the attribute
hierarchy » Aggregation functions are distributive
Example: Compute “sum(sales) by category” from the view “sum(sales) by product”
35© Surajit Chaudhuri, Umeshwar Dayal
-- W.H. Inmon, Building the Data Warehouse, 1992.
Data Warehouse A decision support database that is maintained
separately from the organization’s operational databases.
A data warehouse is a – subject-oriented, – integrated,– time-varying, – non-volatile
collection of data that is used primarily in organizational decision making.
36© Surajit Chaudhuri, Umeshwar Dayal
Why Separate Data Warehouse Performance
» Op dbs designed & tuned for known trans. workloads.» Complex OLAP queries would degrade performance
for operational transactions. » Special data organization, access & implementation
methods needed for multidimensional views & queries. Function
» Missing data: Decision support requires historical data, which op dbs do not typically maintain.
» Data consolidation: Decision support requires data consolidation (aggregation, summarization) from many heterogeneous sources: op dbs, external sources.
» Data quality: Different sources typically use inconsistent data representations, codes, and formats, which have to be reconciled.
37© Surajit Chaudhuri, Umeshwar Dayal
Data Warehousing Architecture
Data sources
External sources Extract
TransformTransport
Data Warehouse
Data Marts
OLAPOLAP Servers
Data Mining
Query/Reporting
Metadata Repository
Monitoring & Administration
Front-End Tools
ServeOperational dbs
38© Surajit Chaudhuri, Umeshwar Dayal
Data Warehouse vs. Data Marts Enterprise data warehouse: collects all information
about subjects (customers, products, sales, assets, personnel) that span the entire organization. » Requires extensive business modeling. » May take years to design and build.
Data Marts: Departmental subsets that focus on selected subjects. » Marketing data mart: customer, products, sales. » Faster roll out, but complex integration in the long run.
Virtual warehouse: views over operational dbs» materialize some summary views for efficient query
processing » easier to build» requisite excess capacity on operational db servers.
39© Surajit Chaudhuri, Umeshwar Dayal
Three-Tier Architecture Warehouse database server
» almost always a relational DBMS; rarely flat files. OLAP servers
» Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operations.
» Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.
Clients» Query and reporting tools.» Analysis tools.» Data mining tools.
40© Surajit Chaudhuri, Umeshwar Dayal
Populating & Refreshing the Warehouse
Data extraction Data cleaning Data transformation
» Convert from legacy/host format to warehouse format Load
» Sort, summarize, consolidate, compute views, check integrity, build indexes, partition
Refresh» Propagate updates from sources to the warehouse.
41© Surajit Chaudhuri, Umeshwar Dayal
Data Cleaning Why?
» Data warehouse contains data that is analyzed for business decisions
» More data and mulitple sources could mean more errors » Results in incorrect analysis
Detecting data anomalies and rectifying them early has huge payoffs
Important to identify tools that work together well Long Term Solution
» Change business practices and data entry tools» Repository for metadata
42© Surajit Chaudhuri, Umeshwar Dayal
Load Issues:
» huge volumes of data to be loaded» small time window (usually at night) when the
warehouse can be taken off-line» when to build indexes and summary tables» allow system administrator to monitor status, cancel
suspend, resume load, or change load rate» restart after failure with no loss of data integrity.
Techniques:» batch load utility: sort input records on clustering key
and use sequential I/O; build indexes and derived tables» sequential loads still too long (~100 days for TB)» use parallelism and incremental techniques.
43© Surajit Chaudhuri, Umeshwar Dayal
Parallel Load
Pipelined and partitioned parallelism
[Barclay, Barnes, Gray, Sundaresan: Loading Databases Using Dataflow Parallelism]
Source tables Sort runs Merge runs Table insert Target tables
Build index record
Scan
Sort runs Merge runs Index insert Target index
44© Surajit Chaudhuri, Umeshwar Dayal
Incremental Load Full load may still take too long.
» entire load is a (long) batch transaction» replace old table with new after transaction commits» use periodic checkpoints; after failure, restart from last
checkpoint. Use incremental loads during refresh to reduce data
volume » insert only updated tuples» now, incremental load conflicts with queries» break into sequence of shorter transactions (every
~1000 records, every few seconds)» coordinate this sequence of transactions: must ensure
consistency between base tables and derived tables & indices.
45© Surajit Chaudhuri, Umeshwar Dayal
Refresh Issues:
» when to refresh– on every update: too expensive, only necessary if
OLAP queries need current data (e.g., up-to-the-minute stock quotes)
– periodically (e.g., every 24 hours, every week) or after “significant” events
– refresh policy set by administrator based on user needs and traffic
– possibly different policies for different sources.» how to refresh.
46© Surajit Chaudhuri, Umeshwar Dayal
Refresh Techniques Full extract from base tables
» read entire source table or database: expensive» may be the only choice for legacy databases or files.
Incremental techniques (related to work on active dbs)» detect & propagate changes on base tables: replication
servers – snapshots & triggers (Oracle)– transaction shipping (Sybase)
» logical correctness– computing changes to star tables – computing changes to derived and summary tables– optimization: only significant changes
» transactional correctness: incremental load.
top related