data transformation overview

Upload: sambit-das

Post on 13-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Data Transformation Overview

    1/7

    1

    Data Transformation Overview

    Contents

    1.Data Transformationa. Introductionb.Input filesc. Source Systemsd.Mapping filee.Output filef. Output file structure

    2.Process Stepsa. Initial data assessmentb.Pre-transformationc. Transformationd.Post-transformatione.Quality checksf. Common issues and challenges

    3.Cartographera. Introductionb.Current statusc. Requested features

    1.Data Transformation

  • 7/27/2019 Data Transformation Overview

    2/7

    2

    a) Introduction

    The process of converting the data from one format to another, which includes cleansing,

    reformatting, standardization, joining data from multiple Master files and applying business

    rules if any. The data is transformed from the client specified format to the format which can run on

    Spend visibility platform.

    PurposeThe purpose of Data Transformation is to convert the client data to the standard format so that

    the data is ready for the data enrichment

    End UserThe end user of the transformed data is Data Enrichment team.

    b) Input files

    The input files are customer extracted raw (flat) files from their Operational data sources (OLTP

    database)

    Types of Input file formats: .xls, .xlsx, .csv, .txt, .dat formats. Denormalized data:

    a) Has multiple source systems in single raw fileb) Facts and Dimensions are created from the single raw file

    Normalized data:Facts and dimension tables are created from the different raw files

    c) Source Systems

  • 7/27/2019 Data Transformation Overview

    3/7

    3

    Source systems are the individual systems in which the transactional data is stored. We

    transform the data based on source systems with a separate mapping file.

    d) Mapping FileData Mapping maps the data from the raw file to the Standard Schema tables. This is a

    document provided by the customer to perform transformation based on applying business

    rules

    e) Output FilesOutput should be the transformed facts and Dimension files as per the standard Data

    Acquisition Schema

    Data Acquisition Schema Guide_10s2.xls

    Data_Acquisition_Schema_Guide_10s3.xls

    f) Output file structureThe Transformed files contains normalized Fact and Dimension Tables with UTF-8 as encoding

    Fact Tables: Fact table is the master table that contains all primary Key data or Main data

    Fact Tables

    Invoice2/ Invoice3

    PO2/ PO3

    Dimension Tables: Dimension tables contain attributes that describes fact records in the fact

    table

    Dimension Tables

    Account

    Companysite

    ContractCostCenter

    CostCenterMgmt

    ERPCommodity

    FlexDimension1

    FlexDimension2

    FlexDimension3

    FlexDimension4

    FlexDimension5

    http://localhost/var/www/apps/conversion/tmp/scratch_5/Data%20Acquisition%20Schema%20Guide_10s2.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_5/Data%20Acquisition%20Schema%20Guide_10s2.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_5/Data_Acquisition_Schema_Guide_10s3.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_5/Data_Acquisition_Schema_Guide_10s3.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_5/Data_Acquisition_Schema_Guide_10s3.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_5/Data%20Acquisition%20Schema%20Guide_10s2.xls
  • 7/27/2019 Data Transformation Overview

    4/7

    4

    FlexDimension6

    FlexDimension7

    FlexDimension8

    FlexDimension9

    FlexDimension10

    FlexDimension11

    FlexDimension12

    FlexDimension13FlexDimension14

    Part

    User

    Supplier

    2.Process StepsPM will send mail communication to DT Team about the availability of raw files. Raw files are

    generally made available in *** Analysis. He also sends a data worksheet with raw file names,

    Spend Amount and record count in each file.

    a) Initial data assessment

    In this phase we assess the quality of raw data provided, identify what all tables need to be

    created during transformation, time estimation for transformation and also reporting the issues

    found in the data assessment to the customer at the initial stage.

    We also check for:-

    Discrepancy in Number of raw files, Record Count & Amount Refer correct version of Mapping file Suitable encoding for special characters in data files Mappings are in line with Raw files Sufficient data in lookup tables Raw data maximum length vs SV Schema field limit Format of Amount field Data availability in mandatory fields Duplicate records Invalid Date data Non-numeric data in Numeric fields Accuracy of Currency rates (only if the file is available)b) Pre-Transformation

  • 7/27/2019 Data Transformation Overview

    5/7

    5

    In this phase, we need to update the raw files and make it ready for the transformation, ie. Format

    the dates, pulling data from different lookup/ Master table; update the fields as per the customer

    rules, etc.

    c) TransformationIn this phase, we proceed with transformation for source systems those data is clean and

    eligible for transformation, i.e. those do not have data issues. We will create the necessary

    Facts and Dimension tables according to the mapping file.

    We will also apply the following rules on Fact and Dimension files:

    Refer previous cycle data issues file for recurrence(only for recurring cycles) Amount field format Negative spend representation Field Id & Field Name gaps filling Check duplicate records in both Facts and Dimensions Check broken links between Facts and Dimensions Check null records for mandatory fields Date format as per Analysis standards Default values Apply all special instructionsd) Post Transformation Fix the duplicates in both Facts and Dimensions if any Fix the referential integrity (data integrity) between Facts and Dimensions if any Fix the Hierarchy issues if any

    e) Quality Checks Duplicate checks to make sure that we do not have any data duplication in Facts and

    dimensions

    Violation of referential integrity check is to ensure data integrity between facts anddimensions

    Additional Checks:o Check if leading zeroes are intacto Make sure that all the records in table have data for columns Accounting Date in

    Invoice table and Ordered Date in PO.

    o Check for incompatible data in date and amount columnso Check for null AmountCurrencyo Check if the final and initial stats matches

    f) Common Issues and challenges

  • 7/27/2019 Data Transformation Overview

    6/7

    6

    Some of the common issues and challenges faced during the entire process

    Issues during loading of data

    We face issues like

    Escape Character for double Quotes

    Truncation errors

    Header in more than one row

    Length of a field more than 255Number data type in first few records and later alpha numeric

    Duplicate Field Names

    One record spanned to more than one row

    Problem with special characters and encoding

    Data getting split into more than one field

    We need to fix these issues manually and import the file without any error

    Issues during Transformation

    Some of the issues we face during transformation are

    Empty PK records

    Amount in exponential

    Empty records in mandatory fields

    Amount in parenthesis

    Cartographer

    a) Introduction

  • 7/27/2019 Data Transformation Overview

    7/7

    7

    Cartographer is a data mapping tool developed by Tim Pittman used to map the customer

    fields to the schema tables. Cartographer is connected to the MS SQL Server 2008 to perform

    the transformation. Since the current tool MS Access used for transformation has limitation on

    holding the data upto 2GB. So we came up with the idea of having a transformation tool which

    can support huge data sets.

    Version: 1.0.6.0