1 acctg 6910 building enterprise & business intelligence systems (e.bis) data staging olivia r....
Post on 15-Jan-2016
217 views
TRANSCRIPT
1
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
Data Staging
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
2
TechnicalArchitecture
Design
TechnicalArchitecture
Design
ProductSelection &Installation
ProductSelection &Installation
End-UserApplication
Specification
End-UserApplication
Specification
End-UserApplication
Development
End-UserApplication
Development
The Business Dimensional Lifecycle
ProjectPlanningProject
Planning
Business
Requirement
Definition
Business
Requirement
Definition
DeploymentDeploymentMaintenance
andGrowth
Maintenanceand
Growth
Project ManagementProject Management
DimensionalModeling
DimensionalModeling
PhysicalDesign
PhysicalDesign
Data StagingDesign &
Development
Data StagingDesign &
Development
3
Data Staging
Data Warehouse (Oracle)
DB2
Access
Excel
Legacy System
Data Staging
4
Data Staging
• Extraction• Data Cleansing• Data Integration• Transformation• Transportation (Loading)• Maintenance
5
Extraction
• Extract source data from legacy systems and place it in a staging area.
• To reduce the impact on the performance of legacy systems, source data is extracted without any cleansing, integration and transformation operations.
6
Extraction
• A variety of file formats exist in legacy systems– Relational database: DB2, Oracle, SQL
Server, Informix, Access …– Flat file: Excel file, text file
• Commercial data extraction tools are very helpful in data extraction.– Ex: Oracle Data Mart Builder
7
Data Preparation (Cleansing)
It’s all about data quality!!!
8
Outline
• Measures for Data Quality • Causes for data errors• Common types of data errors• Common error checks• Correcting missing values• Timing for error checks and
corrections• Steps of data preparation
9
Measures for Data Quality• Correctness/Accuracy - w.r.t. the real data• Consistency/Uniqueness – data values,
references, measures and interpretations• Completeness - scope of data & values• Relevancy – w.r.t. the requirements• Current data – relevant to the requirements
10
Causes for Data Errors
• Data entry errors• Correct data not available at the time of
data entries• By different users same time or same users
overtime – Inconsistent or incorrect use of “codes”– Inconsistent or incorrect interpretation of “fields”
• Transaction processing errors• System and recovery errors• Data extract/transformation errors
11
Common Data Errors
• Missing (null) values• Incorrect use of default values (e.g., zero)• Data domain integrity violation (e.g., 0/1)• Data value (dependency) integrity violation
(e.g., if mm=02 then DD<30)• Data referential integrity violation
(e.g., a customer’s order record cannot exist unless the customer record already exists)
12
Common Data Errors, Cont’d
• Data retention integrity violation (e.g., old inventory snapshots should not be stored)
• Data Derivation/Transformation/Aggregation Integrity Violation (e.g., profit not = sales – costs)
• Inconsistent data values of the same data (M versus m for male)
• Inconsistent use of the same data value (DM for Data Mining and Data Marts)
13
Error Checks
• Domain value validation • Value dependency validation• Referential integrity validation• Identify missing-value or default-value records• Identify outliers• Cross-footing -Check aggregates and
derivations across different levels and against common sense
• Eyeballs!• Process validation
14
Data Cleaning: Missing Values
1. Exclude the record2. Exclude the attribute/field3. Replaced by a global constant 4. Replaced by the attribute mean5. Replaced by the most probable value6. Apply 4 – 6 by class/segments of records7. Manual correction8. Application specific algorithm1-6 are less practical for OLAP bound data
15
Timing for Error Checking
• During Data Staging• During Data Loading• Others
– Before data extraction (data entries, transaction processing, recovery, audits, etc.)
– After data loading
16
Steps of Data Preparation• Identify data sources• Extract and analyze source data• Standardize data• Correct and complete data• Match and consolidate data• Analyze data defect types• Transform and enhance data into target• Calculate derivations and summary data• Audit and control data extract,
transformation and loading
17
Data Integration
• Data from different data sources with different formats need to be integrated into one data warehouse– Ex: 3 customer table in sales
department, marketing department and an acquired company
Customer (cid, cname, city …)Customer (customerid, customername,city…)Customer (custid, custname, cname,…)
18
Data Integration
• Same attribute with different name: cid, customerid, custid
• Different attribute with same name: – cname -> customer name– cname -> city name
• Same attribute with different formats
19
Data Integration
• How to integrate
– Get the schemas of all data sources
– Get the schema of the data warehouse
– Integrate source schemas with the help from commercial tools and domain experts
20
Transformation
• Prepare data for loading into the data warehouse– Change the data format– Create derived attributes and tables– Aggregate– Create warehouse keys
21
Transportation
• Using bulk load tools, such as Oracle SQL Loader, instead of SQL command
• Create indexes
22
Maintenance
• Maintenance frequency: daily, weekly, monthly
• Identify change records and new records in legacy systems– Create timestamps for changes and new
records in legacy systems– Compare data between legacy systems and DW
• Load changes and new records into DW