Download - Slides

Transcript
Page 1: Slides

Data Warehousing

Dr. Anol BhattacherjeeUniversity of South Florida

E-mail: [email protected]: http://coba.usf.edu/ABhatt/

Telephone: (813) 974-6760

Copyright 2006 Anol Bhattacherjee

Page 2: Slides

Agenda Data warehouse:

Production databases versus data warehouse. Characteristics. Structure: Dimensions, hierarchies. Design: Star schema. Building: ETL methodology.

Online analytical processing (OLAP): Examples. Techniques.

Data mining: Techniques.

Data warehouse administrator.

Page 3: Slides

Two Categories of Business Systems Transaction processing systems (TPS):

Application systems used by company employees for everyday operational tasks, such as sales, manufacturing, and customer support.

Employ production databases. Decision support systems (DSS):

Systems specifically designed to aid managers in decision-making tasks, such as budgeting, forecasting, and planning.

Employ data warehouses and/or data marts. Require analytical capabilities, such as data mining (OLAP) tools. Also called business intelligence (BI) systems.

Page 4: Slides

Data Warehouse What it is:

A subject-oriented, integrated, time-variant enterprise-wide repository of historical data designed to support executive decision-making.

Data is aggregated by business dimensions (e.g., region, year, product line), and can be analyzed along these dimensions.

Allows trend analysis, planning, etc. without complex SQL queries. Example: Wal-Mart’s RetailLink system:

Gives suppliers full access to WM’s sales and inventory data in real-time for collaborative planning, forecasting, and replenishment (CPFR).

Powered by NCR’s Teradata servers: Runs 30+ business applications. Supports 18,000+ users (WM managers). Handles 120,000 queries/week. Receives 8.4 million updates/minute (transactions) at peak-time.

Page 5: Slides

Data Warehouse versus Databases

Production Databases Data WarehouseSupports transaction processing systems used in everyday business operations

Supports decision support systems used for managerial decision making

Data stored in relational format Data stored in multidimensional format

MB/GB in size Terabytes in size

No specific support for time-series Supports time-series/periodicity

Good for data input/output, but poor for computation (e.g., aggregate)

Poor for data input/output, but uses vector arithmetic for fast computation

Exists independently Aggregated from production databases

No special analytical operations supported

Supports special analytical operations such as drill-down and slice and dice

Page 6: Slides

Characteristics of DW Data Subject-oriented:

Data is organized around subjects or business dimensions, such as sales, customers, orders, claims, accounts, employees, etc.

Integrated: Data is collected from several transactional databases, and integrated

in a way to provide a unified picture of each subject over time. Data from different databases is transformed into a common schema,

measurement, code, data type. Aggregated:

Data stored is not transaction-level, but aggregated by products, regions, months/years, or some other business dimension.

Page 7: Slides

Characteristics of DW Data Historical:

Data updated at some time interval: weekly, monthly, etc. Data stored by weeks, months, etc. for historical comparison and trend

analysis. Time variant:

Data always includes a timestamp (e.g., sales by weeks, months, quarters, or years).

Non-volatile: Data is historical, and does not change with time.

Denormalized: Denormalized data is used to improve query performance, though it

also increases update time and introduces data integrity problems. Works because historic data in the data warehouse is rarely updated.

Page 8: Slides

Data Warehouse versus Data Marts Enterprise data warehouse (EDW):

Large-scale data repository that incorporates aggregated historical data for an entire company, division, or business unit.

Built around many subjects, can support a wide range of decision tasks. Data marts:

Small-scale data repository serving the needs of one department. Based on a limited number of subjects (sometimes one). Constructed from few transactional databases or a subset of EDW data. Provides a buffer between managers and EDW: managers work with DM

data, so that even if the DM data is corrupted, EDW data is unchanged. Which is done first:

Top-down development: EDW is created first, from which data is extracted to create one or more DMs.

Bottom-up approach: Build independent DMs as needed, overall EDW built later from existing DMs.

Page 9: Slides

Dimensions of a Data Warehouse

Two-dimensional data warehouse Three-dimensional data warehouse

Data warehouses can have four or more dimensions

Page 10: Slides

Dimensions and Hierarchies Two key characteristics of a DW:

Subject orientation of data. Temporal nature of data (time dimension).

Multidimensional databases: Each DW subject reflects a separate business dimension (e.g.,

product line, sales area, year, etc.), hence DW are often called multidimensional databases.

Multidimensional databases are not yet mature or sophisticated; hence multidimensional data is often stored in relational databases.

Hierarchies: Business dimensions can be organized into hierarchies:

Sales area by city, county state region, etc. Time grouped by day, month, quarter, year, etc.

Drill-down analysis: Extracting data from higher to lower hierarchy. Slice and dice: Extracting data from two hierarchies.

Page 11: Slides

Hierarchies

Products (hierarchy):• By product lines• By responsibility centers• By work centers

Sales

Sales area (hierarchy):• Region: Northeast

• State: NY• Area: NYC• Area: Albany• Area: Buffalo• Area: Long Island

• State: NJ• State: PA

• Region: Midwest• Region: West

Time (hierarchy):• Year: 1995

• Quarter: Q1• Month: January

• Day: 01• Day: 02

• Month: February• Quarter: Q2

• Year: 1996• Year: 1997

Drill-down: Overall sales figures for NY vs. sales figures for NYC, Albany, Buffalo, etc.Slice and dice: Sales of individual product lines in NYC vs. Albany, vs. Buffalo, etc.

Page 12: Slides

Designing a Data Warehouse Star schema:

Design technique use to create multidimensional tables using a relational database.

Two components: Fact table: Sales. Dimension tables: Time

Period, Salesperson, Products. Snowflake design:

One dimension table (Car) leads to another dimension table (Manufacturer).

Lucky Rent-A-Car Data Warehouse Design

Page 13: Slides

Fact table provides salesstatistics broken down by product, period and store

dimensions

Star Schema Example

Dimension tables contain descriptions about subjects of the business

1:N relationship between fact and dimension tables

Page 14: Slides

Star Schema With Sample Data

Page 15: Slides

Design Considerations in Star Schema Fact table:

Should contain quantitative time-period data. Granularity: what level of detail should you store in fact table? Transactional grain (finest level) versus aggregated grain (summarized). Finer grain provides better analysis capability, but require more rows in

dimension and fact tables and hence, slower performance. Dimension table:

Keys must be time-invariant (i.e., non-business dependent). Should be denormalized to maximize performance.

Relationship: 1:N relationship between fact and dimension tables

Page 16: Slides

Building a Data Warehouse

API

Flat files

Oracle3rd party

feeds

VSAM

LoadTransformExtract

Temporarydata hub

Datawarehouse

Data marts

ETL Methodology: Extract data Transform data Load data

Page 17: Slides

ETL Methodology Data extraction:

Process of copying relevant data from a variety of transactional databases for inclusion in a DW.

May occur at regular intervals (e.g., weekly, monthly) to add new data. Data from incompatible databases, flat files, text documents, etc. must

be filtered through appropriate API (application programming interfaces) as needed.

Data transformation: Next slide.

Data loading: Extracted, cleaned, and transformed data is loaded into DW at a

predetermined data refresh frequency.

Page 18: Slides

Building a Data Warehouse Data transformation/cleaning:

Data extracted from transactional databases must be cleaned (“scrubbed”) and transformed before loading into a DW.

Format differences across different tables/databases must be reconciled. Missing or misspelled data values must be resolved. Erroneous data are identified using application programs, and scrutinized/

corrected by DW analysts using system-generated exception reports. Transaction-level data is aggregated by business dimensions. Key step in DW construction since DW is very sensitive to data errors.

PK: SS# (123-45-6789) Name (Robert G. Smith)

Life Insurance Database

PK: DL# (FL-B12345678) Name (Bob Smith)

Auto Insurance Database

PK: Acc# (12345678905) Name (R. G. Smith)

Home Insurance Database

Challenges of Data Reconciliation

Page 19: Slides

Data Cleaning Example

Good Reading BookstoresQuestionable data: Is book quantity correct?

Out-of-range data: A single book can’t cost $3,200.99

Referential integrity problem: Customer#12738 does not exist in Customers table

Possible misspelling: Do rows 3 & 8 refer to the same person?

Missing data: City is blank.

Questionable data: State for rows 2 & 6 could be the same

Page 20: Slides

Using a Data Warehouse: OLAP Online analytic processing (OLAP):

A decision support approach based on viewing data by dimensions. Well suited for multidimensional data hierarchies in a DW.

OLAP techniques: Drill-down: Retrieving finer levels of data detail. Slice: Data subset based on a single value of one dimension. Pivot or Rotation: Interchanging data dimensions in a slice.

Slice operation

Page 21: Slides

Drill-down by Package Size Drill-down by Package Size and Color

Drill Down

Page 22: Slides

OLAP Reports: Yahoo StoresYahoo Page Stats: Last 365 Days [vitanet]50/16315 entries shown. [See All] [See More] [See Fewer]Sort: [By Hits] [By Count of Items Sold] [By Count of Orders] [By Revenue] Download: [Spreadsheet]

Hits Items Sold Revenue Page 329,215 Vitanet Front Page79,640 9,373 211567.34 Xenadrine RFA-1, 120 capsules, 24,790 Ind23,147 Rate Us16,626 Shop by 100s of Vitamin/Mineral12,645 856 19776.90 Ripped Fuel 200 capsules Twinlab11,885 TwinLab Products7,471 11 172.45 Free Samples Drawing Win 6000 grams7,446 Vita-net Nutritional Products7,231 Creatine Monohydrate6,917 Aphrodisiacs6,896 Androstenedione6,162 On Sale Items6,149 55 1207.25 Natural Sex Woman5,859 Growth Hormone5,843 1,070 37115.60 Hydroxycut 240 Capsules MuscleTech5,756 247 9345.35 Creatine Monohydrate 2000 grams

Page 23: Slides

OLAP Examples Sears’ Strategic Performance Reporting System (SPRS):

Goal: Daily tuning of buying, merchandising, and marketing strategies. Tracks real-time sales; inventory in stores, transit, and distribution

centers; promotion outcomes by item, location, promotion, etc. Analytics: Price-reduction modeling to “move” products; inventory

analysis; customer profitability analysis; store reconfiguration. 1.7 TB data warehouse, replacing 18 prior databases.

British Telecom’s Interactive and Reporting Information System (IRIS): Goal: Tracking 10,000+ ongoing service projects using key performance

indicators such as project cost, status, etc. Permits precanned reports, modeling, forecasting, analytics, etc. Implemented using SAP’s Strategic Enterprise Management suite using

real-time data feed from SAP’s Business Warehouse application.

Page 24: Slides

Using a Data Warehouse: Data Mining Data mining:

Searching for hidden patterns or knowledge in a company’s data using a blend of statistical, AI, and computer graphics techniques.

Goal is to discover new knowledge or explain observed events. Applications:

Identify patterns of credit card fraud. Identify patterns of consumer purchases.

Data mining techniques: Decision trees. Case-based reasoning. Neural networks. Genetic algorithms.

Page 25: Slides

Data Warehouse Administrator Specialized personnel in charge of the DW maintenance/upgrade. Needs three kinds of expertise:

Business expertise: Company’s business processes and transactional data/databases. Company’s business goals to know what data should be stored in DW.

Data expertise: Transactional data/databases for selection and integration into DW. Designing and overseeing data cleaning/ transformation efforts.

Technical expertise: Principles of data warehouse design. Knowledge of OLAP and data mining techniques. Experience with handling very large databases with unique needs for

security, backup and recovery, data distribution, etc.

Page 26: Slides

Challenges in Data Warehousing Data cleaning and finding more “dirty” data than expected. Coordinating the regular appending of new data from transactional

databases to the data warehouse. Managing very large databases. Building and maintaining the data dictionary.


Top Related