cs346: advanced databases graham cormode [email protected] data warehousing and olap

20
CS346: Advanced Databases Graham Cormode [email protected]. uk Data Warehousing and OLAP

Upload: wilfred-mckinney

Post on 22-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

CS346: Advanced DatabasesGraham Cormode [email protected]

Data Warehousing and OLAP

Outline

Chapter: “Overview of Data Warehousing and OLAP” in Elmasri and Navathe

¨ What is a data warehouse and what is it for?¨ The multidimensional data model and common schema designs¨ Special indexes: bitmap and join indexes

Why?¨ Another model of data to study and contrast with RDBMS¨ A different perspective on using data for insight¨ A relatively recent development (1990s): still developing

CS346 Advanced Databases2

Data Warehouses

¨ Data Warehouses were introduced to handle large data stores– Typically, historical data for business analytic purposes– Separate from the organization’s “live” operational database

¨ “A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data” W. Inmon, “father of data warehouse”– Subject-oriented: focused on one topic (e.g. all sales records)– Integrated: data brought together from many sources, cleaned– Time-variant: covers a long history of data (e.g. last decade)– Nonvolatile: only periodically updated, not “live” data

¨ Data warehouse products from Oracle, IBM, Microsoft, Teradata

CS910 Foundations of Data Analytics3

OLAP, OLTP, DSS

¨ OLAP: Online Analytical Processing– Analysis of complex data stored in a data warehouse– Often using distributed storage and processing (Hive…)

¨ In contrast to Online Transaction Processing (OLTP)– Insertions, updates, deletions and queries

¨ Decision Support Systems or Enterprise Information Systems– Allow organization’s leader to make complex strategic decisions– Support data mining / machine learning for knowledge discovery– “Business Intelligence”

CS346 Advanced Databases4

Data Warehouse Characteristics

¨ Data Warehouses adopt a different data model to RDBMS– Typically a multidimensional data model

¨ Warehouses often store integrated data from many sources– Contrast to DBMS which encourages multiple disjoint DBs

¨ Warehouses typically support time-series, trend analysis– Need more historical data, not just the current values

¨ Warehouses typically nonvolatile– Data is added to only periodically. No need for transactions!

¨ Warehouses typically handle very large amounts of data– Often two orders of magnitude (100x) larger than “live” databases– May be terabytes-petabytes in size

CS346 Advanced Databases5

ETL: Extract, Transform, Load

¨ Putting data into a data warehouse is a complex process– Denoted ETL: Extract, Transform, Load

¨ Extract: pull data out of whatever system it is stored in– Via appropriate interchange format: XML, flat files, etc.

¨ Transform: put data into a usable format– Pick which attributes, harmonize formats, sort and join as needed– Format for consistency: names of entities should agree– Clean the data: identify errors and fill in missing values

Return cleaned data to update original source: backflushing– Fit the data to the model of the warehouse: ensure it fits schema

CS346 Advanced Databases6

ETL: Extract, Transform, Load

¨ Load: store in an appropriate format– Many warehouses use simple structures, e.g. sorted flat files– Refresh policy: How up to date is the data? Can it be offline?– How long does it take to load into the warehouse?

¨ Store metadata on the data as well: metadata repository– Technical metadata: how data was processed, stored, updated– Business metadata: relevant business rules and organization details

CS346 Advanced Databases7

Characteristics of Data Warehouses

¨ A few key properties of data warehouses (DW): – Multidimensional: allow many levels of aggregation– Support multiple users via client-server architecture– Should be intuitive and responsive to use

¨ Many variations of the central concept:– Enterprise-wide DW: corral everything about an organization– Virtual DWs: provide a materialized view of an operational DB– Data marts: DWs restricted to a subset of an organization

¨ Two common architectures for warehouses:– Distributed: must handle replication, partitioning, consistency– Federated: collection of autonomous warehouses (data marts)

CS346 Advanced Databases8

OLAP and Data Cubes

¨ Warehouses often support Online Analytical Processing (OLAP)– A multidimensional view of data– Represents data as a data cube– Explored by aggregating or refining dimensions in the data

CS910 Foundations of Data Analytics9

10

Aggregating Multidimensional Data

¨ E.g. Sales volume as a function of product, month, and region

Prod

uct

Region

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

11

A Sample Data Cube

Total annual salesof TVs in U.S.A.

Date

Product

Coun

trysum

sum TV

DVDPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

* (all)

OLAP Operations

¨ Roll up (drill-up): summarize data– by climbing up hierarchy or by dimension reduction

¨ Drill down (roll down): inverse of roll-up– from higher level summary to lower level summary or detailed

data, or introducing new dimensions¨ Slice and dice: project and select

– Zoom in on particular value, or drop some attributes¨ Apply aggregation: on a given dimension

– Count, Sum, Min, Max, Average, Variance, Median, Mode

CS910 Foundations of Data Analytics12

Multidimensional Storage Model

¨ The DW multidimensional storage model has two table types:– Dimension tables and fact tables

¨ Fact table: many tuples, 1 per stored fact, pointing to dimensions– E.g. sale of an item: which product, which store, which customer

¨ Dimension table: tuples of attributes of the dimension– E.g. details of the product, of the store, of the customer

CS346 Advanced Databases13

Data Warehouse Schemas

¨ Star schema: fact table with a single table for each dimension¨ Snowflake schema: variation of a star schema

– Fact tables are arranged hierarchically after normalization

CS346 Advanced Databases14

Fact constellations

¨ Fact constellation: a set of fact tables that share some dimension tables

CS346 Advanced Databases15

Bitmap Indexes

¨ Bitmap indexes used to support high-performance access– One of various techniques used in the database

¨ Takes the form of a bit vector for each value in a table– Set to 1 if a particular value occurs, 0 if it does not

¨ Can be quite compact if the domain size is small – E.g. 1M rows and domain size of 4: bitmap index size 0.5MB– Efficient to check conjunctive conditions: intersect (AND) bitmaps

CS346 Advanced Databases16

Join indexing

¨ A join index connections dimension data to tuples in a fact table– Assuming a star schema

¨ A join index is a traditional index linking primary and foreign keys– Lists all the keys that meet the (equi)join condition

¨ e.g. consider a sales fact table that has city as one dimension– Join index on city: list of sales tuple ids for each different city

¨ Can make a join index as a bitmap index

CS346 Advanced Databases17

Data Warehouse versus Views

¨ Recall views: result of a (stored) query on a database– Could achieve warehouse functionality via (materialized) views

¨ Data warehouses are more than just views:– Warehouses are stored, not materialized on demand– Different data model: multidimensional, not relational– Data warehouses can be indexed (views cannot)– Warehouses support various analysis tasks (mining, time series)– Warehouses typically contain more (historic) data than one DB

CS346 Advanced Databases18

Data Warehouses: Pros and Cons

¨ Data warehouses have many strengths for data analysis:– Support fast exploration and aggregation of data– Designed to handle very large data sets (TBs / billions of records)– Software supports analytics (data mining/machine learning) on top

Clustering, Regression, Classification, Rule mining¨ However, they have their limitations:

– A big undertaking: bringing together all an organization’s data– Need a thorough understanding of the organizational structure– Can be costly to maintain (time-consuming to clean and load data)– As underlying data organization changes, so must the warehouse

CS910 Foundations of Data Analytics19

Summary

CS346 Advanced Databases20

¨ What is a data warehouse and what is it for?¨ Storing and querying all the data of a large organization

¨ The multidimensional data model and common schema designs¨ Roll up, drill down, slice & dice; star and snowflake schemas

¨ Special indexes: bitmap and join indexes

¨ Chapter: “Overview of Data Warehousing and OLAP” in Elmasri and Navathe