dwh basics

21
ETL TESTING This guide provides the following sections:- 1. Data warehouse concepts 2. Etl development life cycle 3. Etl test plan 4. Etl testing life cycle (or) Etl test process 5. Types of etl testing 6. Types of etl bugs 7. Bug reporting 8. Testing templates(test case, bug reporting & etc..) 9. Etl performance testing 10. Etl interview questions 11. Project with example 12. SQL 13. Unix 1. Data warehouse concepts Data ware house is relational database which is subject oriented, integrated, time-variant and non volatile collection of data used to support strategic decision making process Data warehouse Architecture: 2. Etl development life cycle

Upload: 1raju1234

Post on 20-Oct-2015

70 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DWH Basics

ETL TESTING

 This guide provides the following sections:-

1.   Data warehouse concepts        2.   Etl development life cycle3.   Etl test plan 4.   Etl testing life cycle (or) Etl test process5.   Types of etl testing 6.   Types of etl bugs7.   Bug reporting                8.   Testing templates(test case, bug reporting & etc..)9.   Etl performance testing               10. Etl interview questions11. Project with example                12. SQL             13. Unix                                                                                                                                                                       

1.   Data warehouse concepts                                                       Data ware house is relational database which is subject oriented, integrated, time-variant and non volatile collection of data used to support strategic decision making process

Data warehouse  Architecture: 

2.   Etl development life cycle

To learn etl testing, sql is mandatory and should have knowledge in unix. Any way I will guide you in last section.

ETL Testing:- ETL testing is similar to manual testing which we have to do manually with human interaction.

Page 2: DWH Basics

Once after inserting or updating the data into datamart by etl developer then we will test that datamart before loading into the centralized dataware house. This test is called ETL Testing.

 Etl development life cycle:

                                                         REQUIREMENT ANLAYSIS                                      HIGH LEVEL DESIGN     

                    LOW LEVEL DESIGN

                    DEVELOPMENT                           SIT(system integration testing)

                    REVIEW

                   TESTING--------->>Etl Testing life Cycle      

                    UAT(user acceptance testing)

                   PRODUCTION

 3.   Etl test plan

Test Plan for banking project

Page 3: DWH Basics

Etl testing life cycle (or) Etl test process

ETL TESTING LIFE CYCLE:- 

Introduction Banking

Back Ground Informatica, oracle 10g

Test Items Fixed Deposit, Withdrawls.

Features to be tested Like password.non secure field to tested

Approach Types of Etl testing

Testing levels Sanity, smoke

Features Pass & Fail criteria

How many tc pass, tc fail

Suspension  criteria Company will make some rules

Test Environment Staging server, client server(Alpha), production server(Beta), live server

Test deliverables Test cases, bug logging, test procedure

Scheduled tasks It’s a time table of the project or module.

Staff & training Required persons

Risk and mitigation General Holidays, seek leaves

Sign off Higher authority

Features not to be tested

Secure fiels,tables

Page 4: DWH Basics

5.   Types of etl testing

1)       Constraint Testing:

In the phase of constraint testing, the test engineers identifies whether the data is mapped from source to target or not.

The Test Engineer follows the below scenarios in ETL Testing process.

a)      NOT NULL

b)      UNIQUE

c)       Primary Key

d)      Foreign key

e)      Check

f)       Default

g)       NULL

   2)      Source to Target Count Testing:

In the Source to Target data is matched or not. A Tester can check in this view whether it is ascending order or descending  order it doesn’t matter .Only count is required for Tester.

Due to lack of time a tester can follow this type of Testing.

Page 5: DWH Basics

  

3)      Source to Target Data Validation Testing:

In this Testing, a tester can validate the each and every point of the source to target data.

Most of the financial projects, a tester can identify the decimal factors.

4)      Threshold/Data Integrated Testing:

In this Testing, the Ranges of the data, A test Engineer can usually identifies the population calculation and share marketing and business finance analysis (quarterly, halferly, Yearly)

MIN       MAX     RANGE

4              10           6

                                            

5)      Field to Field Testing:

In the field to field testing, a test engineer can identify that how much space is occupied in the database. The data is integrated in the table cum datatypes.

NOTE: To check the order of the columns and source column to target column.

   6)      Duplicate Check Testing:In this phase of ETL Testing, a Tester can face duplicate value very frequently so, at that time the tester follows database queries why because huge amount of data is present in source and Target tables.

Select ENO, ENAME, SAL, COUNT (*) FROM EMP GROUP BY ENO, ENAME, SAL HAVING COUNT (*) >1;

Note: 

1)      There are no mistakes in Primary Key or no Primary Key is allotted then the duplicates may arise.

Page 6: DWH Basics

2)      Sometimes, a developer can do mistakes while transferring the data from source to target at that time duplicates may arise.

3)      Due to Environment Mistakes also duplicates arise (Due to improper plugins in the tool).

7)      Error/Exception Logical Testing:

1)      Delimiter is available in Valid Tables

2)      Delimiter is not available in invalid tables(Exception Tables)

8)      Incremental and Historical Process Testing:

In the Incremental data, the historical data is not corrupted. When the historical data is corrupted then this is the condition where bugs raise.

9)      Control Columns and Defect Values Testing:

This is introduced by IBM

10)   Navigation Testing:

Navigation Testing is the End user point of view testing. An end user cannot follow the friendly of the application that navigation is called as bad or poor Navigation.

                At the time of Testing, A tester can identify this type of navigation scenarios to avoid unnecessary navigation.

11)   Initialization testing:

A combination of hardware and software installed in platform is called the Initialization Testing

12)    Transformation Testing:

At the time of mapping from source table to target table, Transformation is not in mapping condition, then the Test Engineer raises bugs.

Page 7: DWH Basics

13)   Regression Testing:

Code modification to fix a bug or to implement a new functionality which makes us to to find errors.

 These introduced errors are called regression . Identifying for regression effect is called regression testing.

14)   Retesting:

Re executing the failed test cases after fixing the bug.

15)     System Integration Testing:

Integration testing: After the completion of programming process . Developer can integrate the modules there are 3 models

a)      Top Down

b)      Bottom Up

c)       Hybrid

6.   Types of etl bugs

1. User interface bugs/cosmetic bugs:-     Related to GUI of application

     Navigation, spelling mistakes, font style, font size, colors, alignment. 

                                                             

2. BVA Related bug:-

     Minimum and maximum values

3. ECP Related bug:-

     Valid and invalid type

4. Input/output bugs:-

     Valid values not accepted

     Invalid values accepted

Page 8: DWH Basics

5. Calculation bugs:-

     Mathematical errors

     Final output is wrong 

 6. Load condition bugs:-

     Does not allows multiple users

     Does not allows customer expected load

 7. Race condition bugs:-

     System crash & hang

     System cannot run client plat forms

 8. Version control bugs:-

     No logo matching

     No version information available

     This occurs usually in regression testing 

 9. H/W bugs:-

     Device is not responding to the application 

10. Source bugs:-

     Mistakes in help documents

7)Bug reporting                                                                                                                                 

  Bug Life Cycle (or) Defect Tracking Process

                      DETECT DEFECT

                      REPRODUCED DEFECT

                      REPORT DEFECT

                      BUG FIXING                                   BUG RESOLVING

                      BUG CLOSING

Page 9: DWH Basics

Testing templates

      1.       Issue log/Clarification template

      2.       Test case template

      3.       Bug reporting template

      4.       Metrics template

Issue log/Clarification template:-

Reference (Doc Name)

Issue Description

Clarification provider

status Raised date Clarified date

Clarified by Remarks

Test case template:-

S.NO TC_ID Description Expected Result

status Query comment

Bug reporting template:-

Page 10: DWH Basics

Defect_ID Description Build_ID Version_ID Severity Priority Status Assigned to

Detected By

Metrics template:-

DateNo. of test cases designed

No. of test cases executed

No. of test cases failed

No. of test cases hold

No. of defects logged

Comments

Etl performance testing

ETL Performance Tuning:

In the Phase of ETL Performing testing , A tester can involve in database Level or Core Database Level. As well as database tester and the same time ETL tester can involve in Performance tuning also. Performance tuning means server side based work.

What is a Performance Testing :

To test the Server response with different user loads. The Purpose of performance testing is to find bottle neck in the application.

Page 11: DWH Basics

What is a Bottle Neck ?

Bottle Neck is a break point where the server will be in peak (or) the bottle neck is a pin point (or) break point when the server responds where the server will be busy with the user request.

ETL Performing Life cycle :

•       Work flow requirements

•       Performing Objective

•       Performing testing

•       Performing Measurements

•       Performance Tuning

•   ETL Workflow requirements:-

In the Phase of work flow req, ETL Tester can identify the performing scenarios how to connect the database to server which environment supports the performance testing and to check the front end and back end environment and batch jobs, data merging, file system components finally reporting events.

•   Performing Objective :

The performing objective is to start end to end performance testing most

Of the time performing objective will be decided by the client.

•   Performing Testing :

To calculate the speed of the project , ETL Tester can test the DataBase Level . The data base is loading the target properly or not. When ETL Developer doesn’t loads the data in proper conditions then some damage is caused in the performance of the system.

•   Performing Measurements :

At the time of Performing execution, we need to measure the below metrics.

    Client side metrics 2.hits/sec 3. Through put 4. Memory allocation,5. Process resources 6. Database statistics database user conditions.

•   Performance Tuning :

Page 12: DWH Basics

It is a mechanism to get a fixed performance related issues as a Performance tester , we are going to give some suggest recommendations to tuning department.

Code Level ---------- Developer

Data Base Level-----------DBA

Network Level------------Administrator

System Level-------------S/A

Server Level------------Server side People.

ETL interview questions:

1. What is the difference between OLAP and OLTP?

2. Tell me about your ETL workflow process?

3. What is the difference between Operational Database and Warehouse?

4. What type of approach you follow in your project?

5. What is the difference between Data Mart and data ware house?

6. In your project you are using which type of data base and how much space ?

7. Explain the test case template?

8. What is the difference between Severity and Priority?

9. What is the difference between SDLC and STLC?

10. What is the difference between Issue Log and Clarification Log?

11. What type of bugs you have faced in your project?

12. What is Banking?

13. Explain what are the types of Banking?

14. What is the difference between Dimension table and Fact table?

15. Explain SCD’s and their types? how it will be used?

16. Explain Bug reporting?

17. Are you using any models in SDLC?

18. Which process used in ETL Testing?

19. What is unit testing? who will do this?

20. Whats the difference between Incremental Load and Initial Load?

21. Through which document you have done your project?

22. Are you using Requirement tab in QC?

Project

Here I am taking emp table as example. For this I will write test scenarios and test cases, that means we are testing emp table.   

Page 13: DWH Basics

 Check List or Test Scenarios:-

    1. To  validate  the  data in table (emp)    2. To validate the table structure.    3. To validate the null values  of the table.    4. To validate the null values of very attribute.    5. To check the duplicate values of  the table.    6. To check the duplicate values of each attribute of the table    7. To check the field value or space (length of the field size)      8. To check the constraints (foreign ,primary key)    9. To check the name of the employer who has not earned any commission    10. To check the all employers who are work in dept no (Account

dept,sales dept)    11. To check the row count of each attribute.    12. To check the row count of the table.    13. To check the max salary from emp table.    14. To check the min salary from emp table.

Introduction to database:-

Data: The properties of anything is called data

Ex:- Meaningful facts, text, graphics,  images, sound, video segments

Information: Data processed to be useful in decision making

Ex: - student got 1st rank.

Database: To store the information

Earlier days to store information we are using flat file systems like:

1. Spread sheets

2. Folders

3. Ledgers

4. List

The above mentioned storage methods are called as Flat file systems.

Disadvantages:-

Data Redundancy

Page 14: DWH Basics

Limited data sharing

Excessive program maintenence

File System Approach Access:

For each program we have to maintain separate file  

To avoid this drawbacks "RDBMS" came to picture

RDBMS:

It is an advanced version of DBMS with relationshipsIt is also used to store and manage data with efficient way than DBMS

RDBMS Approach

Page 15: DWH Basics

You can't connect directly to the database it won't allow. So, we used RDBMS.

SQL

Structured query language and purpose is in order to store (or) manage the information with relational database

Sql is a set of standards maintain by the ANSCII group

Installation Procedures for Oracle 10g,11g:  

Installation of Oracle 10g in windows xp:- Click here 

Installation of Oracle 11g in windows 7:- Click here

Once after installing the sql prepare the below content and practice it simultaneously

DATAWAREHOUSE-BASICS

What is a Data warehouse? Why we need Data warehouse?According to, Ralph Kimball: A data warehouse is a relational database that is designed for querying and analyzing the business but not for transaction processing.It usually contains historical data derived from transactional data (different sourcesystems).

According to ,W.H.Inmon:

A Data warehouse is a Subject oriented, integrated, timevariant and non-volatile collection of Data used to support strategic decision Makingprocess. Characteristic features of a Data warehouse:1.Subject Oriented2.Integrated

Page 16: DWH Basics

3.Nonvolatile4.Time Variant

Note: The first data warehousing system is implemented in 1987 by W.H.Inmon

Subject Oriented   :  The data warehouses are designed as a Subject-oriented that a reused to analyze the business by top level management, or middle level management, or for a individual department in an   enterprise.   

       Process Oriented  Subject Oriented

Transactional Storage Data WarehouseStorageFor example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a datawarehouse by subject matter, sales in this case makes the data warehouse subject oriented. Integrated: A data warehouse is an integrated database which contains the business information collected from various operational data sources.

 

 

12

12

Integration of Data Data Warehouse StorageTransactional Storage

A p p l . A - M , F A p p l . B - 1 , 0 A p p l . C - X , Y Appl. A -pipeline cm.A p p l . B - p i p e l i n e i n c h e s Appl. C -pipeline mcfAppl. A -balance dec(13,2)Appl. B -balance PIC 9(9)V99Appl. C -balance floatA p p l . A - b a l - o n - h a n d Appl. B -current_balanceAppl. C -balanceAppl. A -

Page 17: DWH Basics

date (Julian)Appl. B -date (yymmdd)Appl. C -date (absolute)M, Fpipeline cmbalance dec(13, 2)balancedate (Julian)

            I         n           t         e         g         r         a           t            i         o         n

EncodingUnit of AttributesPhysicalAttributesNamingConventionsDataConsistency

 Time Variant

:A

Data warehouse is a time variant database which allows you to analyze and compare the business with respect to various time periods (Year, Quarter, Month, Week, Day) because which maintains historical data. Current Data Historical Data

Transactional Storage Data Warehouse Storage

Non-volatile

:AData warehouse is a non-volatile database. That means once the data

entered into data warehouse cannot change. It doesn’t reflect to the changes taken

place in operational database. Hence the data is static Volatile Non- Volatile

According to, Babcock -

Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support.

Why we need Data warehouse?

 1.To Store Large Volumes of Historical Detail Data from Mission Critical Applications

2.Better business intelligence for end-users3. Data Security - To prevent unauthorized access to sensitive data4. Replacement of older, less-responsive decision support systems5. Reduction in time to locate, access, and analyze informationEvaluation:1.60’s: Batch reports 1. hard to find and analyze information2. inflexible and expensive, reprogram every new request3. 70’s: Terminal-based DSS and EIS (executive information systems)1. 

Page 18: DWH Basics

still inflexible, not integrated with desktop tools4. 80’s: Desktop data access and analysis tools 1. query tools, spreadsheets, GUIs2. easier to use, but only access operational databases5. 90’s: Data warehousing with integrated OLAP engines and toolsWhat is an Operational System? OR What is OLTP?1. Operational systems are the systems that help us run the day-to-day enterprise operations.

2. On Line Transactional Processing systems not built to hold history data.3. The data in these systems are having current data only.4. The data in these systems are maintained in 3 NF. The data is used for runningthe business that doesn’t used for analyzing the business. 5. The examples are online reservations, credit-card authorizations, and AT withdrawals etc.,Difference between OLTP and Data warehouse (OLAP)In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it. Operational System (OLTP) Data warehouse (OLAP)It is designed to support business transactional processing. It is designed to support decision-making process.Application oriented data Subject oriented dataCurrent data Historical dataDetailed data Summary dataVolatile data Non-volatile dataLess history (3-6 months) More history (5-10 years)Normalization data De-normalization dataDesigned for running the business Designed for analyzing the businessSupports E-R modeling Supports Dimensional modelingClerical users can access this data Knowledge users can access this

 dataDB Size–100MB-GB DB Size–100GB-TBFew Indexes Many IndexesMany Joins Some JoinsAdvantages of Data Warehousing:1. High query performance2. 

Page 19: DWH Basics

Queries not visible outside warehouse3. Can operate when sources unavailable4. Can query data not stored in a DBMS5. Extra information at warehouse1. Modify, summarize (store aggregates)2. Add historical information6. Improves the quality and accessibility of data.7. Reduce the requirements of users to access operational data.8. Allows new reports and studies to be introduced without disrupting operationalsystems.9. Increases the amount of information available to users

Types of Data warehouse: There are three types of data warehousesCentralized data warehouse:A centralised DW is one in which data is stored in asingle, large primary database. This database can be queried directly or used to feeddata marts.1.  Functional data warehouse: A functional DW is dedicated to a subset of the business, such as a marketing or finance business function.1.Separate DWs for different business capabilities2. Easier to build initially