slides

of 26 /26
Data Warehousing Dr. Anol Bhattacherjee University of South Florida E-mail: [email protected] Web: http://coba.usf.edu/ABhatt/ Telephone: (813) 974-6760 Copyright 2006 Anol Bhattacherjee

Author: tommy96

Post on 13-Nov-2014

458 views

Category:

Documents


0 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

  • 1. Data Warehousing Dr. Anol Bhattacherjee University of South Florida E-mail: [email protected] Web: http://coba.usf.edu/ABhatt/ Telephone: (813) 974-6760 Copyright2006 Anol Bhattacherjee

2. Agenda

  • Data warehouse:
    • Production databases versus data warehouse.
    • Characteristics.
    • Structure: Dimensions, hierarchies.
    • Design: Star schema.
    • Building: ETL methodology.
  • Online analytical processing (OLAP):
    • Examples.
    • Techniques.
  • Data mining:
    • Techniques.
  • Data warehouse administrator.

3. Two Categories of Business Systems

  • Transaction processing systems (TPS):
    • Application systems used by companyemployeesfor everydayoperationaltasks, such as sales, manufacturing, and customer support.
    • Employ production databases.
  • Decision support systems (DSS):
    • Systems specifically designed to aidmanagersindecision-makingtasks, such as budgeting, forecasting, and planning.
    • Employ data warehouses and/or data marts.
    • Require analytical capabilities, such as data mining (OLAP) tools.
    • Also called business intelligence (BI) systems.

4. Data Warehouse

  • What it is:
    • A subject-oriented, integrated, time-variant enterprise-wide repository of historical data designed to support executive decision-making.
    • Data is aggregated by business dimensions (e.g., region, year, product line), and can be analyzed along these dimensions.
    • Allows trend analysis, planning, etc. without complex SQL queries.
  • Example: Wal-Marts RetailLink system:
    • Gives suppliers full access to WMs sales and inventory data in real-time for collaborative planning, forecasting, and replenishment (CPFR).
    • Powered by NCRs Teradata servers:
      • Runs 30+ business applications.
      • Supports 18,000+ users (WM managers).
      • Handles 120,000 queries/week.
      • Receives 8.4 million updates/minute (transactions) at peak-time.

5. Data Warehouse versus Databases Supports decision support systems used for managerial decision making Supports transaction processing systems used in everyday business operations Terabytes in size MB/GB in size Supports special analytical operations such as drill-down and slice and dice No special analytical operationssupported Aggregated from production databases Exists independently Poor for data input/output, but uses vector arithmetic for fast computation Good for data input/output, but poor for computation (e.g., aggregate) Supports time-series/periodicity No specific support for time-series Data stored in multidimensional format Data stored in relational format Data Warehouse Production Databases 6. Characteristics of DW Data

  • Subject-oriented:
    • Data is organized around subjects or business dimensions, such as sales, customers, orders, claims, accounts, employees, etc.
  • Integrated:
    • Data is collected from several transactional databases, and integrated in a way to provide a unified picture of each subject over time.
    • Data from different databases is transformed into a common schema,measurement, code, data type.
  • Aggregated:
    • Data stored is not transaction-level, but aggregated by products, regions, months/years, or some other business dimension.

7. Characteristics of DW Data

  • Historical:
    • Data updated at some time interval: weekly, monthly, etc.
    • Data stored by weeks, months, etc. for historical comparison and trend analysis.
  • Time variant:
    • Data always includes a timestamp (e.g., sales by weeks, months, quarters, or years).
  • Non-volatile:
    • Data is historical, and does not change with time.
  • Denormalized:
    • Denormalized data is used to improve query performance, though it also increases update time and introduces data integrity problems.
    • Works because historic data in the data warehouse is rarely updated.

8. Data Warehouse versus Data Marts

  • Enterprise data warehouse (EDW):
    • Large-scale data repository that incorporates aggregated historical data for an entire company, division, or business unit.
    • Built around many subjects, can support a wide range of decision tasks.
  • Data marts:
    • Small-scale data repository serving the needs of one department.
    • Based on a limited number of subjects (sometimes one).
    • Constructed from few transactional databases or a subset of EDW data.
    • Provides a buffer between managers and EDW: managers work with DM data, so that even if the DM data is corrupted, EDW data is unchanged.
  • Which is done first:
    • Top-down development: EDW is created first, from which data is extracted to create one or more DMs.
    • Bottom-up approach: Build independent DMs as needed, overall EDW built later from existing DMs.

9. Dimensions of a Data Warehouse Two-dimensional data warehouse Three-dimensional data warehouse Data warehouses can have four or more dimensions 10. Dimensions and Hierarchies

  • Two key characteristics of a DW:
    • Subject orientation of data.
    • Temporal nature of data (time dimension).
  • Multidimensional databases:
    • Each DW subject reflects a separate business dimension (e.g., product line, sales area, year, etc.), hence DW are often called multidimensional databases.
    • Multidimensional databases are not yet mature or sophisticated; hence multidimensional data is often stored in relational databases.
  • Hierarchies:
    • Business dimensions can be organized into hierarchies:
      • Sales area by city, county state region, etc.
      • Time grouped by day, month, quarter, year, etc.
    • Drill-down analysis: Extracting data from higher to lower hierarchy.
    • Slice and dice: Extracting data from two hierarchies.

11. Hierarchies

  • Products (hierarchy):
  • By product lines
  • By responsibility centers
  • By work centers

Sales

  • Sales area (hierarchy):
  • Region: Northeast
    • State: NY
      • Area: NYC
      • Area: Albany
      • Area: Buffalo
      • Area: Long Island
    • State: NJ
    • State: PA
  • Region: Midwest
  • Region: West
  • Time (hierarchy):
  • Year: 1995
    • Quarter: Q1
      • Month: January
        • Day: 01
        • Day: 02
      • Month: February
    • Quarter: Q2
  • Year: 1996
  • Year: 1997
    • Drill-down:Overall sales figures for NY vs. sales figures for NYC, Albany, Buffalo, etc.
    • Slice and dice:Sales of individual product lines in NYC vs. Albany, vs. Buffalo, etc.

12. Designing a Data Warehouse

  • Star schema:
    • Design technique use to create multidimensional tables using a relational database.
  • Two components:
    • Fact table: Sales.
    • Dimension tables: Time Period, Salesperson, Products.
  • Snowflake design:
    • One dimension table (Car) leads to another dimension table (Manufacturer).

Lucky Rent-A-Car Data Warehouse Design 13. Star Schema Example Fact tableprovides sales statistics broken down byproduct, period and store dimensions Dimension tables contain descriptions aboutsubjects of the business1:N relationship between fact and dimension tables 14. Star Schema With Sample Data 15. Design Considerations in Star Schema

  • Fact table:
    • Should contain quantitative time-period data.
    • Granularity: what level of detail should you store in fact table?
    • Transactional grain (finest level) versus aggregated grain (summarized).
    • Finer grain provides better analysis capability, but require more rows in dimension and fact tables and hence, slower performance.
  • Dimension table:
    • Keys must be time-invariant (i.e., non-business dependent).
    • Should be denormalized to maximize performance.
  • Relationship:
    • 1:N relationship between fact and dimension tables

16. Building a Data Warehouse

  • ETL Methodology :
  • Extract data
  • Transform data
  • Load data

API Flat files Oracle 3 rdparty feeds VSAM Load Transform Extract Temporary data hub Data warehouse Data marts 17. ETL Methodology

  • Data extraction:
    • Process of copying relevant data from a variety of transactional databases for inclusion in a DW.
    • May occur at regular intervals (e.g., weekly, monthly) to add new data.
    • Data from incompatible databases, flat files, text documents, etc. must be filtered through appropriate API (application programming interfaces) as needed.
  • Data transformation:
    • Next slide.
  • Data loading:
    • Extracted, cleaned, and transformed data is loaded into DW at a predetermined data refresh frequency.

18. Building a Data Warehouse

  • Data transformation/cleaning:
    • Data extracted from transactional databases must be cleaned (scrubbed) and transformed before loading into a DW.
    • Format differences across different tables/databases must be reconciled.
    • Missing or misspelled data values must be resolved.
    • Erroneous data are identified using application programs, and scrutinized/ corrected by DW analysts using system-generated exception reports.
    • Transaction-level data is aggregated by business dimensions.
    • Key step in DW construction since DW is very sensitive to data errors.

PK: SS# (123-45-6789) Name (Robert G. Smith) Life Insurance Database PK: DL# (FL-B12345678) Name (Bob Smith) Auto Insurance Database PK: Acc# (12345678905) Name (R. G. Smith) Home Insurance Database Challenges of Data Reconciliation 19. Data Cleaning Example Good Reading Bookstores Questionable data: Is book quantity correct? Out-of-range data: A single bookcant cost $3,200.99 Referential integrity problem: Customer# 12738 does not exist in Customers table Possible misspelling: Do rows 3 & 8refer to the same person? Missing data: City is blank. Questionable data: State for rows 2 & 6could be the same 20. Using a Data Warehouse: OLAP

  • Online analytic processing (OLAP):
    • A decision support approach based on viewing data by dimensions.
    • Well suited for multidimensional data hierarchies in a DW.
  • OLAP techniques:
    • Drill-down: Retrieving finer levels of data detail.
    • Slice: Data subset based on a single value of one dimension.
    • Pivot or Rotation: Interchanging data dimensions in a slice.

Slice operation 21. Drill Down Drill-down by Package Size Drill-down by Package Size and Color 22. OLAP Reports: Yahoo Stores Yahoo Page Stats: Last 365 Days [ vitanet ] 50/16315 entries shown. [ See All ] [ See More ] [ See Fewer ] Sort:[By Hits][ By Count of Items Sold ] [ By Count of Orders ] [ By Revenue ]Download: [ Spreadsheet ] HitsItems SoldRevenuePage329,215VitanetFront Page 79,6409,373211567.34XenadrineRFA-1, 120 capsules,24,790Ind 23,147Rate Us 16,626Shop by 100s of Vitamin/Mineral 12,64585619776.90Ripped Fuel 200 capsulesTwinlab 11,885TwinLabProducts 7,47111 172.45Free Samples Drawing Win 6000 grams 7,446Vita-net Nutritional Products 7,231CreatineMonohydrate 6,917Aphrodisiacs 6,896Androstenedione 6,162On Sale Items 6,14955 1207.25Natural Sex Woman 5,859Growth Hormone 5,8431,07037115.60Hydroxycut240 CapsulesMuscleTech 5,7562479345.35CreatineMonohydrate2000 grams 23. OLAP Examples

  • Sears Strategic Performance Reporting System (SPRS):
    • Goal: Daily tuning of buying, merchandising, and marketing strategies.
    • Tracks real-time sales; inventory in stores, transit, and distribution centers; promotion outcomes by item, location, promotion, etc.
    • Analytics: Price-reduction modeling to move products; inventory analysis; customer profitability analysis; store reconfiguration.
    • 1.7 TB data warehouse, replacing 18 prior databases.
  • British Telecoms Interactive and Reporting Information System (IRIS):
    • Goal: Tracking 10,000+ ongoing service projects using key performance indicators such as project cost, status, etc.
    • Permits precanned reports, modeling, forecasting, analytics, etc.
    • Implemented using SAPs Strategic Enterprise Management suite using real-time data feed from SAPs Business Warehouse application.

24. Using a Data Warehouse: Data Mining

  • Data mining:
    • Searching for hidden patterns or knowledge in a companys data using a blend of statistical, AI, and computer graphics techniques.
    • Goal is to discover new knowledge or explain observed events.
  • Applications:
    • Identify patterns of credit card fraud.
    • Identify patterns of consumer purchases.
  • Data mining techniques:
    • Decision trees.
    • Case-based reasoning.
    • Neural networks.
    • Genetic algorithms.

25. Data Warehouse Administrator

  • Specialized personnel in charge of the DW maintenance/upgrade.
  • Needs three kinds of expertise:
    • Business expertise:
      • Companys business processes and transactional data/databases.
      • Companys business goals to know what data should be stored in DW.
    • Data expertise:
      • Transactional data/databases for selection and integration into DW.
      • Designing and overseeing data cleaning/ transformation efforts.
    • Technical expertise:
      • Principles of data warehouse design.
      • Knowledge of OLAP and data mining techniques.
      • Experience with handling very large databases with unique needs for security, backup and recovery, data distribution, etc.

26. Challenges in Data Warehousing

  • Data cleaning and finding more dirty data than expected.
  • Coordinating the regular appending of new data from transactional databases to the data warehouse.
  • Managing very large databases.
  • Building and maintaining the data dictionary.