agenda 02/21/2013 discuss exercise answer questions in task #1 put up your sample databases for...
TRANSCRIPT
Agenda 02/21/2013
Discuss exerciseAnswer questions in task #1
Put up your sample databases for tasks #2 and #3
Define ETL in more depth by the activities performed.
Discuss the “controversy” in ETL activities
Discussed in prior classes...
Lots of data.Traditional transaction processing systems
Non-traditional transaction processingCall center; Click-stream; Loyalty card; Warranty cards/product registration information
External data from government and commercial entities.
Lots of poor quality data for lots of reasons that can be traced back to lots of people.
Populating the data warehouse
ExtractTake data from source systems. May require middleware to gather all necessary data.
TransformationPut data into consistent format and content.Validate data – check for accuracy, consistency using pre-defined and agreed-upon business rules.Convert data as necessary.
LoadUse a batch (bulk) update operation that keeps track of what is loaded, where, when and how. Keep a detailed load log to audit updates to the data warehouse.
Data Cleansing
Source systems contain “dirty data” that must be cleansed
ETL software contains rudimentary to very sophisticated data cleansing capabilities
Industry-specific data cleansing software is often used. Important for performing name and address correction
Leading data cleansing vendors include general hardware/software vendors such as IBM, Oracle, SAP, Microsoft and specialty vendors Information Builders (DataMigrator), Harte-Hanks (Trillium), CloverETL, Talend, and BusinessObjects (Centric)
Parsing
Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.
Examples include parsing the first, middle, and last name; street number and street name; and city and state.
Input Data from Source FileBeth Christine Parker, SLS MGRRegional Port AuthorityFederal Building12800 Lake CalumetHedgewisch, IL
Parsed Data in Target FileFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL
Parsing
Correcting
Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.
Correcting
Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398
Parsed DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL
Standardizing
Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
Standardizing
Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398
Corrected DataPre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398
Matching
Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
Matching
Corrected Data (Data Source #1)Pre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398
Corrected Data (Data Source #2)Pre-name: Ms.First Name: Elizabeth1st Name Match Standards: Beth, Bethany, BethelMiddle Name: ChristineLast Name: Parker-LewisTitle: Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr., Suite 2City: ChicagoState: ILZip: 60633Zip+Four: 2398Phone: 708-555-1234Fax: 708-555-5678
Consolidating
Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
Consolidating
Corrected Data (Data Source #1)
Corrected Data (Data Source #2)
Consolidated DataName: Ms. Beth (Elizabeth) Christine Parker-LewisTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingAddress: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398Phone: 708-555-1234Fax: 708-555-5678
ETL Products
SQL Server 2012 Integration Services from Microsoft
Power Mart/Power Center from Informatica
Warehouse Builder from Oracle
Teradata Warehouse Builder from Teradata
DataMigrator from Information Builders
SAS System from SAS Institute
Connectivity Solutions from OpenText
Ab Initio
What about unstructured data?
What is unstructured data?
What percentage of data in organizations is considered to be “unstructured”?
Examples
Why store it in a data warehouse?
Does it do any good in large text fields?
Special ETL for unstructured data
Unstructured Data Example
Notes about post-service of a product:The hub bent when the bicycle hit a large pothole.
The plane takes off sluggishly during high-altitude departures.
The product won’t allow entry of a 1098-T when the person is declared as a dependent.
“Text analytics” are used to transform the data.
Text analytics
Parses text and extracts facts (complaints, problems, issues) about key entities (customers, products, locations).
Uses natural language processes (NLP).NLP converts human language into more formal representations that are easier for a computer program to manipulate.
Combination of computational linguistics and artificial intelligence.
Goal of ETL
Structured and unstructured data stored in a relational database.
Data is complete, accurate, consistent, and in conformance with the business rules of the organization.