slides
Embed Size (px)
DESCRIPTION
TRANSCRIPT
- 1. Data Warehousing Dr. Anol Bhattacherjee University of South Florida E-mail: [email protected] Web: http://coba.usf.edu/ABhatt/ Telephone: (813) 974-6760 Copyright2006 Anol Bhattacherjee
2. Agenda
- Data warehouse:
-
- Production databases versus data warehouse.
-
- Characteristics.
-
- Structure: Dimensions, hierarchies.
-
- Design: Star schema.
-
- Building: ETL methodology.
- Online analytical processing (OLAP):
-
- Examples.
-
- Techniques.
- Data mining:
-
- Techniques.
- Data warehouse administrator.
3. Two Categories of Business Systems
- Transaction processing systems (TPS):
-
- Application systems used by companyemployeesfor everydayoperationaltasks, such as sales, manufacturing, and customer support.
-
- Employ production databases.
- Decision support systems (DSS):
-
- Systems specifically designed to aidmanagersindecision-makingtasks, such as budgeting, forecasting, and planning.
-
- Employ data warehouses and/or data marts.
-
- Require analytical capabilities, such as data mining (OLAP) tools.
-
- Also called business intelligence (BI) systems.
4. Data Warehouse
- What it is:
-
- A subject-oriented, integrated, time-variant enterprise-wide repository of historical data designed to support executive decision-making.
-
- Data is aggregated by business dimensions (e.g., region, year, product line), and can be analyzed along these dimensions.
-
- Allows trend analysis, planning, etc. without complex SQL queries.
- Example: Wal-Marts RetailLink system:
-
- Gives suppliers full access to WMs sales and inventory data in real-time for collaborative planning, forecasting, and replenishment (CPFR).
-
- Powered by NCRs Teradata servers:
-
-
- Runs 30+ business applications.
-
-
-
- Supports 18,000+ users (WM managers).
-
-
-
- Handles 120,000 queries/week.
-
-
-
- Receives 8.4 million updates/minute (transactions) at peak-time.
-
5. Data Warehouse versus Databases Supports decision support systems used for managerial decision making Supports transaction processing systems used in everyday business operations Terabytes in size MB/GB in size Supports special analytical operations such as drill-down and slice and dice No special analytical operationssupported Aggregated from production databases Exists independently Poor for data input/output, but uses vector arithmetic for fast computation Good for data input/output, but poor for computation (e.g., aggregate) Supports time-series/periodicity No specific support for time-series Data stored in multidimensional format Data stored in relational format Data Warehouse Production Databases 6. Characteristics of DW Data
- Subject-oriented:
-
- Data is organized around subjects or business dimensions, such as sales, customers, orders, claims, accounts, employees, etc.
- Integrated:
-
- Data is collected from several transactional databases, and integrated in a way to provide a unified picture of each subject over time.
-
- Data from different databases is transformed into a common schema,measurement, code, data type.
- Aggregated:
-
- Data stored is not transaction-level, but aggregated by products, regions, months/years, or some other business dimension.
7. Characteristics of DW Data
- Historical:
-
- Data updated at some time interval: weekly, monthly, etc.
-
- Data stored by weeks, months, etc. for historical comparison and trend analysis.
- Time variant:
-
- Data always includes a timestamp (e.g., sales by weeks, months, quarters, or years).
- Non-volatile:
-
- Data is historical, and does not change with time.
- Denormalized:
-
- Denormalized data is used to improve query performance, though it also increases update time and introduces data integrity problems.
-
- Works because historic data in the data warehouse is rarely updated.
8. Data Warehouse versus Data Marts
- Enterprise data warehouse (EDW):
-
- Large-scale data repository that incorporates aggregated historical data for an entire company, division, or business unit.
-
- Built around many subjects, can support a wide range of decision tasks.
- Data marts:
-
- Small-scale data repository serving the needs of one department.
-
- Based on a limited number of subjects (sometimes one).
-
- Constructed from few transactional databases or a subset of EDW data.
-
- Provides a buffer between managers and EDW: managers work with DM data, so that even if the DM data is corrupted, EDW data is unchanged.
- Which is done first:
-
- Top-down development: EDW is created first, from which data is extracted to create one or more DMs.
-
- Bottom-up approach: Build independent DMs as needed, overall EDW built later from existing DMs.
9. Dimensions of a Data Warehouse Two-dimensional data warehouse Three-dimensional data warehouse Data warehouses can have four or more dimensions 10. Dimensions and Hierarchies
- Two key characteristics of a DW:
-
- Subject orientation of data.
-
- Temporal nature of data (time dimension).
- Multidimensional databases:
-
- Each DW subject reflects a separate business dimension (e.g., product line, sales area, year, etc.), hence DW are often called multidimensional databases.
-
- Multidimensional databases are not yet mature or sophisticated; hence multidimensional data is often stored in relational databases.
- Hierarchies:
-
- Business dimensions can be organized into hierarchies:
-
-
- Sales area by city, county state region, etc.
-
-
-
- Time grouped by day, month, quarter, year, etc.
-
-
- Drill-down analysis: Extracting data from higher to lower hierarchy.
-
- Slice and dice: Extracting data from two hierarchies.
11. Hierarchies
- Products (hierarchy):
- By product lines
- By responsibility centers
- By work centers
Sales
- Sales area (hierarchy):
- Region: Northeast
-
- State: NY
-
-
- Area: NYC
-
-
-
- Area: Albany
-
-
-
- Area: Buffalo
-
-
-
- Area: Long Island
-
-
- State: NJ
-
- State: PA
- Region: Midwest
- Region: West
- Time (hierarchy):
- Year: 1995
-
- Quarter: Q1
-
-
- Month: January
-
-
-
-
- Day: 01
-
-
-
-
-
- Day: 02
-
-
-
-
- Month: February
-
-
- Quarter: Q2
- Year: 1996
- Year: 1997
-
- Drill-down:Overall sales figures for NY vs. sales figures for NYC, Albany, Buffalo, etc.
-
- Slice and dice:Sales of individual product lines in NYC vs. Albany, vs. Buffalo, etc.
12. Designing a Data Warehouse
- Star schema:
-
- Design technique use to create multidimensional tables using a relational database.
- Two components:
-
- Fact table: Sales.
-
- Dimension tables: Time Period, Salesperson, Products.
- Snowflake design:
-
- One dimension table (Car) leads to another dimension table (Manufacturer).
Lucky Rent-A-Car Data Warehouse Design 13. Star Schema Example Fact tableprovides sales statistics broken down byproduct, period and store dimensions Dimension tables contain descriptions aboutsubjects of the business1:N relationship between fact and dimension tables 14. Star Schema With Sample Data 15. Design Considerations in Star Schema
- Fact table:
-
- Should contain quantitative time-period data.
-
- Granularity: what level of detail should you store in fact table?
-
- Transactional grain (finest level) versus aggregated grain (summarized).
-
- Finer grain provides better analysis capability, but require more rows in dimension and fact tables and hence, slower performance.
- Dimension table:
-
- Keys must be time-invariant (i.e., non-business dependent).
-
- Should be denormalized to maximize performance.
- Relationship:
-
- 1:N relationship between fact and dimension tables
16. Building a Data Warehouse
- ETL Methodology :
- Extract data
- Transform data
- Load data
API Flat files Oracle 3 rdparty feeds VSAM Load Transform Extract Temporary data hub Data warehouse Data marts 17. ETL Methodology
- Data extraction:
-
- Process of copying relevant data from a variety of transactional databases for inclusion in a DW.
-
- May occur at regular intervals (e.g., weekly, monthly) to add new data.
-
- Data from incompatible databases, flat files, text documents, etc. must be filtered through appropriate API (application programming interfaces) as needed.
- Data transformation:
-
- Next slide.
- Data loading:
-
- Extracted, cleaned, and transformed data is loaded into DW at a predetermined data refresh frequency.
18. Building a Data Warehouse
- Data transformation/cleaning:
-
- Data extracted from transactional databases must be cleaned (scrubbed) and transformed before loading into a DW.
-
- Format differences across different tables/databases must be reconciled.
-
- Missing or misspelled data values must be resolved.
-
- Erroneous data are identified using application programs, and scrutinized/ corrected by DW analysts using system-generated exception reports.
-
- Transaction-level data is aggregated by business dimensions.
-
- Key step in DW construction since DW is very sensitive to data errors.
PK: SS# (123-45-6789) Name (Robert G. Smith) Life Insurance Database PK: DL# (FL-B12345678) Name (Bob Smith) Auto Insurance Database PK: Acc# (12345678905) Name (R. G. Smith) Home Insurance Database Challenges of Data Reconciliation 19. Data Cleaning Example Good Reading Bookstores Questionable data: Is book quantity correct? Out-of-range data: A single bookcant cost $3,200.99 Referential integrity problem: Customer# 12738 does not exist in Customers table Possible misspelling: Do rows 3 & 8refer to the same person? Missing data: City is blank. Questionable data: State for rows 2 & 6could be the same 20. Using a Data Warehouse: OLAP
- Online analytic processing (OLAP):
-
- A decision support approach based on viewing data by dimensions.
-
- Well suited for multidimensional data hierarchies in a DW.
- OLAP techniques:
-
- Drill-down: Retrieving finer levels of data detail.
-
- Slice: Data subset based on a single value of one dimension.
-
- Pivot or Rotation: Interchanging data dimensions in a slice.
Slice operation 21. Drill Down Drill-down by Package Size Drill-down by Package Size and Color 22. OLAP Reports: Yahoo Stores Yahoo Page Stats: Last 365 Days [ vitanet ] 50/16315 entries shown. [ See All ] [ See More ] [ See Fewer ] Sort:[By Hits][ By Count of Items Sold ] [ By Count of Orders ] [ By Revenue ]Download: [ Spreadsheet ] HitsItems SoldRevenuePage329,215VitanetFront Page 79,6409,373211567.34XenadrineRFA-1, 120 capsules,24,790Ind 23,147Rate Us 16,626Shop by 100s of Vitamin/Mineral 12,64585619776.90Ripped Fuel 200 capsulesTwinlab 11,885TwinLabProducts 7,47111 172.45Free Samples Drawing Win 6000 grams 7,446Vita-net Nutritional Products 7,231CreatineMonohydrate 6,917Aphrodisiacs 6,896Androstenedione 6,162On Sale Items 6,14955 1207.25Natural Sex Woman 5,859Growth Hormone 5,8431,07037115.60Hydroxycut240 CapsulesMuscleTech 5,7562479345.35CreatineMonohydrate2000 grams 23. OLAP Examples
- Sears Strategic Performance Reporting System (SPRS):
-
- Goal: Daily tuning of buying, merchandising, and marketing strategies.
-
- Tracks real-time sales; inventory in stores, transit, and distribution centers; promotion outcomes by item, location, promotion, etc.
-
- Analytics: Price-reduction modeling to move products; inventory analysis; customer profitability analysis; store reconfiguration.
-
- 1.7 TB data warehouse, replacing 18 prior databases.
- British Telecoms Interactive and Reporting Information System (IRIS):
-
- Goal: Tracking 10,000+ ongoing service projects using key performance indicators such as project cost, status, etc.
-
- Permits precanned reports, modeling, forecasting, analytics, etc.
-
- Implemented using SAPs Strategic Enterprise Management suite using real-time data feed from SAPs Business Warehouse application.
24. Using a Data Warehouse: Data Mining
- Data mining:
-
- Searching for hidden patterns or knowledge in a companys data using a blend of statistical, AI, and computer graphics techniques.
-
- Goal is to discover new knowledge or explain observed events.
- Applications:
-
- Identify patterns of credit card fraud.
-
- Identify patterns of consumer purchases.
- Data mining techniques:
-
- Decision trees.
-
- Case-based reasoning.
-
- Neural networks.
-
- Genetic algorithms.
25. Data Warehouse Administrator
- Specialized personnel in charge of the DW maintenance/upgrade.
- Needs three kinds of expertise:
-
- Business expertise:
-
-
- Companys business processes and transactional data/databases.
-
-
-
- Companys business goals to know what data should be stored in DW.
-
-
- Data expertise:
-
-
- Transactional data/databases for selection and integration into DW.
-
-
-
- Designing and overseeing data cleaning/ transformation efforts.
-
-
- Technical expertise:
-
-
- Principles of data warehouse design.
-
-
-
- Knowledge of OLAP and data mining techniques.
-
-
-
- Experience with handling very large databases with unique needs for security, backup and recovery, data distribution, etc.
-
26. Challenges in Data Warehousing
- Data cleaning and finding more dirty data than expected.
- Coordinating the regular appending of new data from transactional databases to the data warehouse.
- Managing very large databases.
- Building and maintaining the data dictionary.