etl the process of updating the data warehouse.. recent developments in data warehousing: a tutorial...
TRANSCRIPT
![Page 1: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/1.jpg)
ETL
The process of updating the data warehouse.
![Page 2: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/2.jpg)
Recent Developments in Data Warehousing: A Tutorial
Hugh J. WatsonTerry College of BusinessUniversity of [email protected]://www.terry.uga.edu/~hwatson/dw_tutorial.ppt
![Page 3: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/3.jpg)
Two Data Warehousing Strategies
Enterprise-wide warehouse, top down, the Inmon methodologyData mart, bottom up, the Kimball methodologyWhen properly executed, both result in an enterprise-wide data warehouse
![Page 4: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/4.jpg)
The Data Mart Strategy
The most common approachBegins with a single mart and architected marts are added over time for more subject areas Relatively inexpensive and easy to implementCan be used as a proof of concept for data warehousingCan perpetuate the “silos of information” problemCan postpone difficult decisions and activitiesRequires an overall integration plan
![Page 5: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/5.jpg)
The Enterprise-wide StrategyA comprehensive warehouse is built initiallyAn initial dependent data mart is built using a subset of the data in the warehouseAdditional data marts are built using subsets of the data in the warehouseLike all complex projects, it is expensive, time consuming, and prone to failureWhen successful, it results in an integrated, scalable warehouse
![Page 6: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/6.jpg)
Data Sources and Types
Primarily from legacy, operational systemsAlmost exclusively numerical data at the present timeExternal data may be included, often purchased from third-party sourcesTechnology exists for storing unstructured data and expect this to become more important over time
![Page 7: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/7.jpg)
Extraction, Transformation, and Loading (ETL) Processes
The “plumbing” work of data warehousingData are moved from source to target data basesA very costly, time consuming part of data warehousing
![Page 8: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/8.jpg)
Recent Development:More Frequent Updates
Updates can be done in bulk and trickle modesBusiness requirements, such as trading partner access to a Web site, requires current dataFor international firms, there is no good time to load the warehouse
![Page 9: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/9.jpg)
Recent Development: Clickstream Data
Results from clicks at web sitesA dialog manager handles user interactions. An ODS (operational data store in the data staging area) helps to custom tailor the dialogThe clickstream data is filtered and parsed and sent to a data warehouse where it is analyzedSoftware is available to analyze the clickstream data
![Page 10: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/10.jpg)
Data Extraction
Often performed by COBOL routines (not recommended because of high program maintenance and no automatically generated meta data)Sometimes source data is copied to the target database using the replication capabilities of standard RDMS (not recommended because of “dirty data” in the source systems) Increasing performed by specialized ETL software
![Page 11: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/11.jpg)
Sample ETL Tools
Teradata Warehouse Builder from TeradataDataStage from Ascential SoftwareSAS System from SAS InstitutePower Mart/Power Center from InformaticaSagent Solution from Sagent SoftwareHummingbird Genio Suite from Hummingbird Communications
![Page 12: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/12.jpg)
Reasons for “Dirty” Data Dummy Values Absence of Data Multipurpose Fields Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules Reused Primary Keys, Non-Unique Identifiers Data Integration Problems
![Page 13: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/13.jpg)
Data Cleansing
Source systems contain “dirty data” that must be cleansedETL software contains rudimentary data cleansing capabilitiesSpecialized data cleansing software is often used. Important for performing name and address correction and householding functionsLeading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)
![Page 14: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/14.jpg)
Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
![Page 15: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/15.jpg)
Parsing
Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.Examples include parsing the first, middle, and last name; street number and street name; and city and state.
![Page 16: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/16.jpg)
Correcting
Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.Example include replacing a vanity address and adding a zip code.
![Page 17: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/17.jpg)
Standardizing
Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.Examples include adding a pre name, replacing a nickname, and using a preferred street name.
![Page 18: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/18.jpg)
Matching
Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.Examples include identifying similar names and addresses.
![Page 19: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/19.jpg)
Consolidating
Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
![Page 20: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/20.jpg)
Data Staging
Often used as an interim step between data extraction and later stepsAccumulates data from asynchronous sources using native interfaces, flat files, FTP sessions, or other processesAt a predefined cutoff time, data in the staging file is transformed and loaded to the warehouseThere is usually no end user access to the staging fileAn operational data store may be used for data staging
![Page 21: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/21.jpg)
Data Transformation
Transforms the data in accordance with the business rules and standards that have been establishedExample include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates
![Page 22: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/22.jpg)
Data Loading
Data are physically moved to the data warehouseThe loading takes place within a “load window” The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications
![Page 23: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/23.jpg)
Meta Data
Data about dataNeeded by both information technology personnel and usersIT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.
![Page 24: ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University](https://reader036.vdocument.in/reader036/viewer/2022062421/56649dc45503460f94ab7148/html5/thumbnails/24.jpg)
Recent Development:Meta Data Integration
A growing realization that meta data is critical to data warehousing success Progress is being made on getting vendors to agree on standards and to incorporate the sharing of meta data among their toolsVendors like Microsoft, Computer Associates, and Oracle have entered the meta data marketplace with significant product offerings