agenda 03/27/2014 review first test. discuss internal data project. review characteristics of data...

25
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance. Define ETL activities. Discuss database analyst/programmer responsibilities for data evaluation.

Upload: marylou-simpson

Post on 30-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Agenda 03/27/2014

Review first test.

Discuss internal data project.

Review characteristics of data quality.Types of data.

Data quality.

Data governance.

Define ETL activities.

Discuss database analyst/programmer responsibilities for data evaluation.

Page 2: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Question Answer Question Answer Question Answer Question Answer

1.  B 8.  D 15.  A 22.  B

2.  A 9.  C 16.  D 23.  D

3.  C 10.  D 17.  D 24.  D

4.  B 11.  B 18.  A 25.  A

5.  A 12.  C 19.  C    

6.  C 13.  B 20.  A    

7.  E 14.  A 21.  C    

Answers to Multiple Choice Questions

Page 3: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Discussed in prior classes...

Lots of data.Traditional transaction processing systems

Non-traditional dataCall center; Click-stream; Loyalty card; Warranty cards/product registration information, email, twitter, Facebook

External data from government and commercial entities

General classification of dataTransaction data

Referential data/master data

Metadata

Page 4: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Data quality

What is good quality data?Correct

Accurate

Consistent

Complete

Available

Accessible

Timely

Relevant

Page 5: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

How does data “go bad”?

Does all “bad” data have to be fixed?

Page 6: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Data governance

Policies, processes and procedures aimed at managing the data in an organization.

Usually high-level cross-department committees that oversee data management across the organization.

Responsible for defining what data is necessary to gather.

Responsible for defining the source and store of data.

Responsible for security policies, processes, procedures.

Responsible for creating the policies, processes and procedures.Responsible for assigning blame.

Responsible for enforcing policies.

Page 7: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Data quality in data warehouses

Is it more important than data quality in source transaction and reference data?

How is better quality data achieved?Automated ETL processes to populate the data warehouse

Spot checking programmatically

Page 8: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Populating the data warehouse

ExtractTake data from source systems. May require middleware to gather all necessary data.

TransformationPut data into consistent format and content.Validate data – check for accuracy, consistency using pre-defined and agreed-upon business rules.Convert data as necessary.

LoadUse a batch (bulk) update operation that keeps track of what is loaded, where, when and how. Keep a detailed load log to audit updates to the data warehouse.

Page 9: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Data Cleansing

Source systems contain “dirty data” that must be cleansed

ETL software contains rudimentary to very sophisticated data cleansing capabilities

Industry-specific data cleansing software is often used. Important for performing name and address correction

Leading data cleansing vendors include general hardware/software vendors such as IBM, Oracle, SAP, Microsoft and specialty vendors Informatica, Information Builders (DataMigrator), Harte-Hanks (Trillium), CloverETL, Talend, and BusinessObjects (SAP-AG)

Page 10: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Steps in data cleansing

· Parsing

· Correcting

· Standardizing

· Matching

· Consolidating

Page 11: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Parsing

Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.

Examples include parsing the first, middle, and last name; street number and street name; and city and state.

Page 12: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Input Data from Source FileBeth Christine Parker, SLS MGRRegional Port AuthorityFederal Building12800 Lake CalumetHedgewisch, IL

Parsed Data in Target FileFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL

Parsing

Page 13: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Correcting

Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.

Page 14: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Correcting

Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398

Parsed DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL

Page 15: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Standardizing

Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.

Page 16: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Standardizing

Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398

Corrected DataPre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398

Page 17: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Matching

Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.

Page 18: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Matching

Corrected Data (Data Source #1)Pre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398

Corrected Data (Data Source #2)Pre-name: Ms.First Name: Elizabeth1st Name Match Standards: Beth, Bethany, BethelMiddle Name: ChristineLast Name: Parker-LewisTitle: Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr., Suite 2City: ChicagoState: ILZip: 60633Zip+Four: 2398Phone: 708-555-1234Fax: 708-555-5678

Page 19: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Consolidating

Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

Page 20: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Consolidating

Corrected Data (Data Source #1)

Corrected Data (Data Source #2)

Consolidated DataName: Ms. Beth (Elizabeth) Christine Parker-LewisTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingAddress: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398Phone: 708-555-1234Fax: 708-555-5678

Page 21: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Source system view – 3 clients

Policy No.ME309451-2

Account#1238891

TransactionB498/97

Page 22: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

The reality – ONE client

Account#1238891

Policy No.ME309451-2

TransactionB498/97

Page 23: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

Consolidating whole groups

William Lewis

Beth Parker

Karen Parker-Lewis

William Parker-Lewis, Jr.

Page 24: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

ETL Products

SQL Server 2012 Integration Services from Microsoft

Power Mart/Power Center/Power Exchange from Informatica

Warehouse Builder from Oracle

Teradata Warehouse Builder from Teradata

DataMigrator from Information Builders

SAS System from SAS Institute

Connectivity Solutions from OpenText

Ab Initio

Page 25: Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance

ETL Goal: Data is complete, accurate, consistent, and in conformance with the business rules of the organization.

Questions:

• Is ETL really necessary?

• Has the advent of big data changed our need for ETL?

• ETL vs. ELT

• Does the use of Hadoop eliminate the need for ETL software???

• Does it matter if the data is stored in the “cloud”?