delivering business intelligence performance - diva …659231/fulltext01.pdf · delivering business...

Master's thesis Two years

International Master's Degree in Computer Engineering Delivering Business Intelligence Performance by Data Warehouse and ETL Tuning AV level, 30hp

Ghazal Tashakor

Mid Sweden University The Department of Information Technology and Media (ITM) Author: Ghazal Tashakor Address: [email protected] Study programme: M.Sc. in Engineering – Computer Engineering, 120hp Examiner: Tingting Zhang, [email protected] Tutors: Fredrik Håkansson, Katarina Lundqvist Scope: 9071 words Date: 2013-10-21

M.Sc. Thesis within

Computer Engineering AV

30 ECTS

Delivering Business Intelligence Performance

by Data Warehouse and ETL Tuning

Ghazal Tashakor

Error! Reference source not

found.

Ghazal Tashakor

Table of Contents

2013-10-21

Abstract The aim of this thesis is to show how numerous organizations such as

CGI Consultant attempt to introduce BI-Solutions through IT and other

operational methods in order to deal with large companies, which want

to make their competitive market position stronger. This aim is achieved

by Gap Analyzing in the BI roadmap and available Data Warehouses

based on one of the company projects which were handed over to CGI

from Lithuania.

The fundamentals in achieving the BI-Solutions through IT, which has

built the thesis methodology by research are, data warehousing, content

analytics and performance management, data movement (Extract,

Transform and Load) and CGI BI methodology, business process man-

agement, TeliaSonera Maintenance Management Model (TSM3) and AM

model of CGI in the high level.

The part of the thesis basically requires some research and practical

work on Informatica PowerCenter, Microsoft SQL Server Management

Studio and low level details such as database tuning, DBMS tuning

implementation and ETL workflows optimization.

Keywords: BI, ETL, DW, DBMS, TSM3, AM, Gap Analysing


found.

Ghazal Tashakor

Table of Contents

2013-10-21

Acknowledgements

I would like to thank my parents whose support gave me the opportu-

nity to continue to my academic pursuit and specially to complete this

thesis, and Mehdi Bonvari who has always helped me and believed that

I could do it. I want to express my gratitude to my teacher Professor

Tingting Zhang at Mid Sweden University whose guidance helped me

in writing of this thesis.


found.

Ghazal Tashakor

Table of Contents

2013-10-21

Table of Contents Abstract ............................................................................................................ iv

Acknowledgements ......................................................................................... v

Terminology / Notation .................................................................................. 1

1 Introduction ............................................................................................ 2

1.1 Data warehouse/ Business Intelligence (DWBI) Background .............. 2

1.2 Application performance management Background ............................ 3

1.2.2 CGI Outsource Projects Maintenance Process ..................................... 3

1.3 Problem Motivation .................................................................................... 4

1.4 Overall aim and verifiable goals ............................................................... 4

1.5 Restrictions ................................................................................................... 5

1.6 Scope............. ................................................................................................ 5

1.7 Outline.......... ................................................................................................ 6

2 Theory / Related work .......................................................................... 7

2.1 Business Intelligence Analytical Frameworks and Method ................. 7

2.2 Business and IT solution Management .................................................... 9

2.2.1 TeliaSonera Application Maintenance Management .......................... 9

2.2.2 TSM3..... ..................................................................................................... 9

2.2.3 CGI Application Management Model ................................................ 11

2.2.4 CGI Pyramid of BI solutions................................................................. 12

2.3 Business Intelligence Development Studio-SQL Server ...................... 13

2.3.1 Monitoring and Management with SQL Server Management Studio

(SSMS) ......................................................................................... 13

2.3.2 Data management Query Processing .................................................. 13

2.3.3 Microsoft BIDW Integration Scenarios ............................................... 14

2.3.4 Troubleshooting DW Performance problem in SQL Server 2008 ... 14

2.4 Database Tuning ....................................................................................... 16

2.5 ETL.......... .................................................................................................... 17

2.5.1 Informatica PowerCenter ...................................................................... 18

2.5.2 ETL Performance measures and Tuning ............................................ 19

2.6 Lithuania BIDW overview ....................................................................... 19

2.7 Lithuania ETL overview .......................................................................... 21

3 Methodology / Model ......................................................................... 22

3.1 Experimental Methodology ..................................................................... 22

3.2 Implementation Scenarios ....................................................................... 23

3.2.1 BIDW Performance Logical Method and Tools ................................. 24


found.

Ghazal Tashakor

Table of Contents

2013-10-21

3.3 Theoretical Data ........................................................................................ 25

3.4 Practical Task ............................................................................................. 26

4 Implementation ................................................................................... 27

5 Results ................................................................................................... 33

5.1 DMV results before analyzing query performance.............................. 33

5.2 Analyzing Query Performance results for Where, Join and union ... 35

5.2.1. Where Queries ....................................................................................... 35

5.3 Query Optimization outputs of Lithuania ETL Data Flows ............... 45

5.4 Disk Usage result based on new Indexes type ..................................... 47

6 Conclusions .......................................................................................... 49

References ........................................................................................................ 50

Appendix A: Shredding an XML document via T-SQL for inserting its

data into a SQL Server table .............................................................. 53

Appendix B: User manual BIDW Operational Guidelines Data Mart

and ETL for Lithuania ........................................................................ 54

Figure B.1: Database structure of Omnital ................................................ 54

Figure B.2: Lithaunia ETL Informatica DB Schedule .............................. 55

Figure B.3: Lithaunia ETL rough overview ............................................... 55

Delivering Business Intelligence

Performance

Ghazal Tashakor

‎1 Introduction

2013-10-21

Terminology / Notation Acronyms / Abbreviations

AM Application Management Model

APM Application Performance Management

BI Business Intelligence

BICC Business Intelligence Competency Center

CRT Customer Recommendation Tool

DDL Data Definition Language

DML Data Manipulation Language

DMV Dynamic Management Views and Functions

DTA Database Engine Tuning Advisor

DW Data Warehouse

ETL Extract, Transform, Load

JOQR Join Ordering and Query Rewrite

OLAP Online Analytical Processing

PDW Parallel Data Warehouse

SSAS SQL Server Analysis Services

SSMS SQL Server Management Studio

TSM3 TeliaSonera Maintenance Management Model

WMS Warehouse Management System


Performance

Ghazal Tashakor

‎1 Introduction

2013-10-21

1 Introduction Business Intelligence (BI) application analysis and performance

management are being used within the IT departments of companies. It

consists of a set of management and analytic processes which are

supported by technology, and which enables business to define strategic

goals and to measure performance against those goals. [10]

DWBI (Data Warehousing & Business Intelligence) is a complex process

with huge benefits to the overall cost and action time in relation to the

business process. It was developed to bring some balance to the constant

battle between performance and productivity by aggregating various

enterprise data to a warehouse, where business queries run in order to

retrieve intelligent information. [10]

Business intelligence (BI) roadmap shows that Data Warehouse and

Business Intelligence (DWBI) project teams firstly attempt to develop

applications based on a BI methodology after which the project will be

handed over to the maintenance and performance management

consultant team with regards to evaluation and ongoing support. [1][10]

1.1 Data warehouse/ Business Intelligence (DWBI) Background Data warehouse/business intelligence (DWBI) application services

divided to data warehouse/business intelligence (DWBI) development

services and data warehouse /business intelligence (DWBI) application

management support. [6][1]

DWBI development services contain data integration, analysis and

reporting, which are important parts of the BI implementation life-cycle.

Business intelligence implementation has sometimes been delayed

because projects are lacking in BI tools and technologies within data

warehouses (DW), extract, transform and load (ETL) or other BI core

components.

A gap analysis is required in order to bridge these BI implementation

gaps and consequently the elimination of data flow process bottlenecks.

Gathering and understanding current and future business objects based

on an application management model, checking a number of key

services such as BI architecture design, data modelling, extract,

transform and load (ETL) analytical reporting and using existing

resources and solutions to fill these gaps and configure this system

properly for BI are the known ways when utilizing gap analysis.


Performance

Ghazal Tashakor

‎1 Introduction

2013-10-21

Deep expertise of all following services and technologies is required.

Business Intelligence &

Information Delivery

Technologies

Relational Database

Management Systems

Extract, Transform & Load

(ETL) Technologies

MicroStrategy

Cognos

Business Objects

Oracle

SQL Server

mySql

MS Access

Pentaho

Informatica

Oracle Data

Integrator

SAP

Table 1 - BI/DW Technologies Table

Since the business intelligence is not a short-term progress, DWBI

consultants usually offer an ongoing IT support plan (IT solutions) for

an entire application maintenance organization to the customers. This

support plan is called business solutions and in this manner BI

application management support creates a balance to control the costs.

Consulting services and technologies are providing business intelligence

solutions to the customers based on a delivery model for application

management.

1.2 Application performance management Background

The monitoring and management of performance and software applica-

tion performance problems in information technology and system

management fields are called application performance management

(APM). APM moves the IT metrics to business metrics.

1.2.2 CGI Outsource Projects Maintenance Process

Transition is a process of application maintenance to execute the

delivery on the both business and technical levels. The process is

normally carried out in the form of a project and the delivery time is

clearly defined.

It includes moving an existing service from project to CRT maintenance.


Performance

Ghazal Tashakor

‎1 Introduction

2013-10-21

A part of CRT duties in CGI is in determining solutions for small

transition projects which TeliaSonera has handed over, and includes the

Lithuania project, which served as the practical grounds for this thesis.

1.3 Problem Motivation

For all cases, CGI has a one set of transitions checklists and approved

processes, which are not the same as the TeliaSonera Maintenance

Model (TSM3) checklists. The first problem concerns these differences

during the transition/handover phase, which produces the requirement

for business solutions.

The second problem, underlying the thesis, postulates an IT solution,

presents itself as some missing links in the data warehouse and ETL

process of the above mentioned handover project from Lithuania to

CGI. Some of problems that have occurred within interacting systems

during the observation phase are:

Latency or non-deliveries during the process of loading in

the file structure data mart (S1).

Long loading times and instability in the loading-locks

during the process of transforming information; for exam-

ple some workflows would become deadlock while data

was loading to targets or database errors occurred while

writing to the database.

Locked log files, locked event files.

Duplicate records and indexes during the process of

transforming data in the final data mart (S3).

1.4 Overall aim and verifiable goals

The overall aim is to produce an improvement in the transition process.

To achieve this overall aim, gap analyzing and comparing the desired

performance with the actual performance of the CGI transition proc-

esses, which are built in SQL Server 2008 data marts, with ETL written

in Informatica PowerCenter based on daily source system flat files, is

the most important part. The purpose of gap analyzing is to use self-


Performance

Ghazal Tashakor

‎1 Introduction

2013-10-21

study and a practice part to identify the most obvious gaps in the IT-

system of CRT.

The verifiable goals are:

Gap analysis in the TeliaSonera business intelligence

roadmap and the available data warehouse, based on one

of the company projects, which was handed over to CGI

from Lithuania, to bridge and provide IT solutions for CGI

business intelligence implementation checklists gaps.

Developing new analysis such as a new indexing strategy

and new ETL execution workflows.

1.5 Restrictions

The thesis plan focuses on TeliaSonera global organizations for Lithunia

which are separate to that of Sweden, Norway, and Denmark.

Although there are other important tools which are used in TeliaSonera

global organizations such as Portrait Dialogue and HP BTO, these IT

solutions are applied solely in SSMS and CRT PowerCenter.

It should be noted that administrative access to the SQL-data mart was

also limited, which proved problematic in the practical phase.

In the physical design process, it is usual to use OLAP but, here in this

methodology, this is not of concern, which of course doesn’t negate its

importance, only that it falls outside the scope of work of this project.

1.6 Scope

This study takes the problem of data warehouse performance tech-

niques and focuses on improving the overall delivering of the data to

information process. TeliaSonera SQL-data marts were surveyed for

design errors and tests were performed to determine the efficiency of

design and ETL process applied in day-to-day functions of the system.

The thesis is distinguished by the evaluation of design efficiency and

ETL implementation, and the practical design improvements which

have resulted from the survey and gap analysis. The survey’s conclusion

would be generally valid for every large database; in particular for data


Performance

Ghazal Tashakor

‎1 Introduction

2013-10-21

warehouses in similar companies in telecom industry where database

performance is tied to queries service performance and ETL process.

1.7 Outline

The survey begins with a concise background on Data-

Warehouses/Business Intelligence (DWBI) and other necessary signifi-

cant concepts required in relation to the problem, including TeliaSonera

application maintenance manager (AM) and CGI outsource project

maintenance process. Chapter 2 contains a description of the system and

business processes and activities, which are the actual subject of gap

analysis. The theoretical background and problem survey, plus tests and

examinations performed and experiments made are other subjects that

are covered in this chapter.

The practical and theoretical work, which is represented by the body of

work in chapter 2, directly results in the development of a model in

order to design the methodology, which is described in chapter 3 of this

document. The implementation scenario and a description regarding the

performed ploys complete the discussion in chapter 3.

A full report of the methodology implementation is explained and

illustrated in chapter 4, and the results produced, including query

analysis results and query optimization reports are documented in

chapter 5. Subsequently, chapter 6 contains the conclusions which have

resulted from the observation of the problem, tests and implementation

results.


Performance

Ghazal Tashakor

‎2 Theory / Related work

2013-10-21

2 Theory / Related work The theory is based on the business intelligence analytical frameworks

and method, also business and the IT solution management. In this part

Business Intelligence Development Studio for an SQL Server will be

used as the tool for both monitoring and management, plus query

processing. Database tuning and ETL would also form parts of the

theoretical work, which would lead to an overview of Lithuanian BIDW

and ETL processes.

2.1 Business Intelligence Analytical Frameworks and Method

Delivering and managing information for monitoring and management

of an organisation and its business processes and activities with set of

products and technologies is called business intelligence. [10]

CGI has one of the strongest capabilities in the IT industry for delivering

BI projects.

Their solutions cover audit, design, build and operation. Customers can

outsource their entire BI function to one of CGI three offshore compe-

tence centers, or their BI systems can be run locally as a managed ser-

vice. [1]

They do not only focus on the business and IT aspects of the solution,

but also on people issues that frequently cause BI projects to fail. Their

services extend to the establishment of BI Competence Centres (BICCs)

in the customer’s organization, based on the proven BICC model which

CGI itself uses. [1]

Through the management consultancy, they bring vertical expertise to

the customer in areas such as finance, public sector, telecoms and manu-

facturing. They ensure that the BI project generates the right information

at the right time and place, to the right people, so that customers can

improve their business decision making. [1]

CGI BI framework is a proven methodology which enables the system to

deliver a complete end-to-end solution with a consistent, repeatable and

global approach. [1]


Performance

Ghazal Tashakor


2013-10-21

The framework, as illustrated in the figure diagram, covers project,

process, and implementation and people issues. At the right side of the

diagram are project driven programmers, and to the left are day-today

processes. [1]

CGI has 2,000 BI consultants and the specialist skills listed in Table 2.

Country Specialisation

France SAP Business Objects, IBM Cogons, Oracle

Germany SAP Business Objects, IBM Cognos, Infor PM 10

Netherlands Oracle, Informatica, SAS

Nordics Microsoft

United Kingdom Predictive Analytics

Table 2 - BI specialization [1]

CGI has extensive experience working across the complete spectrum of

third party vendor products which assist them in adopting a business

perspective and use this implementation knowledge and expertise to

design the best solution. See Figure 1. [1]

Figure 1 - CGI’s BI Management Framework [1]


Performance

Ghazal Tashakor


2013-10-21

2.2 Business and IT solution Management

This section firstly describes the solution management services used in

CGI and TeliaSonera after which the methods and processes used in the

theoretical part of the project will be described. Following this, there is a

brief description of Business Intelligence Development Studio-SQL

Server and its usage.

2.2.1 TeliaSonera Application Maintenance Management

Business intelligence solutions in CGI manage the maintenance roles

and processes, which are described in maintenance objects based on

TeliaSonera maintenance management model (TSM3). The TSM3 deliv-

ered in December 2005. TSM3 components are illustrated in Figure 2.

Figure 2 - TSM3 Components Relations

2.2.2 TSM3

TeliaSonera Model (TSM3) is a model for organizing maintenance in

order to conduct a system in a more business manner. The model is

based on developing maintenance objects where IT systems are in-

cluded. [1]


Performance

Ghazal Tashakor


2013-10-21

The TSM3 model is based on research and practical experience and it

has three principles to create a maintenance assignment.

1. Identifying the maintenance object

2. Specify the maintenance assignment

3. Clarify the roles of responsibility

Maintenance object supports the customer recommendation solutions in

both IT and business activities. IT activities are divided to three main

areas, namely, development, maintenance and operations. The IT envi-

ronment is hosted on TeliaSonera global organizations servers located in

Sweden and other servers such as Norway, Denmark, Finland and

Lithuania. See Figure 3. [1]

Figure 3 - Maintenance object [1]

Customer recommendations solution consists of Customer Recommen-

dation Tool (CRT) which gathers all the required customer information

in CRT data mart for recommendations, campaign presentation and

reports. ETL is used for populating this CRT data mart. The data envi-

ronment is located and maintained in Sweden. [1]

The task of the CRT maintenance is the daily monitoring and checking

so that all the scheduled tasks run daily, weekly and monthly plus

checking log files and making reports. Error handling is conducted

when a job has not run as scheduled and the solving of related problems

is another maintenance activity. [1]


Performance

Ghazal Tashakor


2013-10-21

The major part is identifying and solving common problems that are

occurring in the managed BI system of CGI by finding some IT solu-

tions. [1]

CGI manages many different types of systems for its customers. In some

cases, CGI utilizes their transition processes and in other cases uses the

customer. [1]

2.2.3 CGI Application Management Model

Within the commitments which includes outsourcing, this process

includes the transformation from the present situation to CGI's delivery

format for management and/or operation, either for a completely new

client or as a completion for an existing client. This typically includes

moving an existing service from the client to CGI, adapting collabora-

tion forms and streamlining administration methods. [1]

The process is normally carried out in project format, where the re-

quirements regarding results and delivery time are clearly defined. In

many cases, the result of the process is taken over by a day-to-day

delivery in the form of operation and management. In outsourcing

projects, which often include a takeover of several services/functions

from a client or from the client's supplier, many different groupings

become involved in the transition process. [1]

The Lithuania project which is this thesis was one of those hand-over

projects based on the following application management model in

Figure 4.

Figure 4 - CGI Application Management Model [1]


Performance

Ghazal Tashakor


2013-10-21

2.2.4 CGI Pyramid of BI solutions

CGI designs and delivers a pyramid of BI solutions that show the strate-

gic and operational Business Intelligence requirements of an organiza-

tion by using the CGI BI framework. Each layer in the framework is an

abstracted view that represents a grouping of tools, technologies and

processes that cohesively provide a subset of functionality. See Figure 5.

[11]

Figure 5 - BI Solutions Pyramid [11]

The pyramid shows the transformation of the data whereby it becomes

actionable information. [11]

The following descriptions are the chosen data flows steps from the

bottom of the framework upwards and which are related to the thesis.

[11]

Data sources: Structured data sources, data warehouses/data

marts, third-party data feeds, flat files and others. [11]

Integration: Following data level and message-oriented integra-

tion to facilitate virtualized data access. In computing, extract,

transform and load (ETL) refers to this process. [11]

Analytics: Formulaic processing of data, including data mining.

Workflow: Coordination of multiple triggered processes. [11]


Performance

Ghazal Tashakor


2013-10-21

2.3 Business Intelligence Development Studio-SQL Server

The foundation of every business intelligence pyramid of solution is a

strong data storage and aggregation strategy, which implements ad-

vanced data warehousing and data mining techniques.

Data warehousing is one of the most important stages of BI develop-

ment.

Parallel Data Warehousing integrates with the SQL Server business

Intelligence (BI) components which are: [9]

1. Integration Services (SSIS)

2. Reporting Services (SSRS)

3. SQL Server Analysis Services (SSAS)

2.3.1 Monitoring and Management with SQL Server Management Studio (SSMS)

To manage database objects, it is necessary to query the tables and view

the objects. [9]

Parallel data warehouse concludes the admin Console, which is a web-

based application with possibilities for monitoring the query execution

status and other useful information for tuning user queries, indexes and

the database relational system. [9]

The SQL Server Management Studio (SSMS) in the SQL Server 2008 R2

is used for managing, object exploring and run queries interactively. [9]

2.3.2 Data management Query Processing

Query processing must manage high availability, parallelization, and

data movement between nodes. The parallel data warehouse makes a

copy of the tables on each compute node and the control node follows

these steps to process a query as shown in Figure 6: [9]

1. Parse the SQL statement.


Performance

Ghazal Tashakor


2013-10-21

2. Validate and authorize the objects.

3. Build a distributed execution plan.

4. Run the execution plan.

5. Aggregate query results.

6. Send results to the client application.

Figure 6 - Query Processing steps [9]

2.3.3 Microsoft BIDW Integration Scenarios

The integration services are the ETL component of SQL Server. This is

used to extract and merge data from several sources, then filter and

clean the data before loading it into the data warehouse. [9]

SQL Server 2008 R2 parallel data warehouse (PDW) concerns the query-

ing and analyzing of the stored data. SSIS packages load data into PDW.

[9]

2.3.4 Troubleshooting DW Performance problem in SQL Server 2008

There are many reasons for slowdown in the SQL Server. Some keys to

identify the problems are: [8]


Performance

Ghazal Tashakor


2013-10-21

Resource bottlenecks: This consists of CPU, memory and I/O bot-

tlenecks. For example, memory bottlenecks can lead to slather

paging.

The tempdb bottlenecks: Because there is only one tempdb for

each SQL Server, if an application overloaded tempdb through

DDL or DML operations, a bottleneck will occur.

A slow running user query: The technical concern of this thesis is,

in general, in relation to slow running queries which could occur

because of the index strategy of the database.

Parallel query optimization has the goal of finding a parallel plan that

delivers the query result in minimal time. [8]

Most of the solutions are related to:

Underlying execution mechanisms

Machine architecture

Variations in the cost metric being minimized

Query language expressive power

To schedule and understand the optimization problem, it must be

broken into two phases whose view is based on parallel query optimiza-

tion analysis. [8]

The first phase is JOQR (for join ordering and query rewrite) about a

query tree that produces join ordering and computing each join to

transform it into a query so as to minimize the total cost. [7]

The second phase is parallelization, which passes the query tree to a

query optimizer, which will break down the query for analysis and

generate an execution plan which will minimize the response time. [1]

Query Analysis; analysing each table to identify all search arguments

(WHERE clauses), OR clauses, UNION and JOIN clauses which are used

in the index selection. [7]

Execution plan argument is always about choosing between relational

table join techniques, such as nested loops join or hash join and process-


Performance

Ghazal Tashakor


2013-10-21

ing techniques options, which are divided into stream aggregate and

hash aggregate operators. Scheduling machine resources is another part

of the second phase. [7]

SQL Server 2008 introduced new features and tools that are useful for

the monitoring and troubleshooting of the performance problems. One

or more of the following tools can be used: [12] [7]

Performance Monitor: This thesis uses a third party tool (Power-

Center) and ETL for performance monitoring. [7] [12]

DBCC commands: In this thesis, there was no access to these ad-

ministrative commands because of restrictions. [7] [12]

DMVs: Dynamic Management views and functions (transact-

SQL) is one of the useful tools in this thesis to find out about in-

dex strategy such as missing indexes and database Index frag-

mentation level. In this thesis, DMV results are used for index op-

timization and they are useful to identify locking and blocking is-

sues on the database instances. [7] [12]

Extended Events: Tracing events including the deadlock graph

and show planned XML events which have been used in this the-

sis. [7] [12]

Data collector and the management data warehouse (MDW): Col-

lecting metrics on the servers over time for performance trouble-

shooting. [7] [12]

2.4 Database Tuning

Database tuning is the activity to obtain the higher throughput

and lower response time so that the database application runs

more quickly. Database tuning is mostly based on index tuning

and tuning relational system of given database. [6]

Index tuning such as search keys, data structures performance

(B+ Tree), index clustering, indexes maintained as updates inser-

tions, deletes and covering indexes are performed as the most

important part of index implementation for the SQL Server. [12]


Performance

Ghazal Tashakor


2013-10-21

Figure 7 - Database Tuner Model [6]

For the tuning relational system of given database, schema tuning and

query tuning are required. Denormalization and vertical partitioning are

suggested in schema tuning and query rewriting tunes a relational

system in order for it to run faster. [6]

2.5 ETL

The majority of the data management and business intelligence project

expenditure results from the adoption of incorrect decisions in relation

to employing architectures that may not be suitable for the design and

development of data warehouses. [5]

Extracting as a reading process from a source database, transforming as

a converting data form process and loading as a writing process into a

target database are three database functions which are implemented and

customized in the tool named ETL, which takes its name from the acro-

nym of Extracting-Transforming-Loading. [5]

The ETL standard is a traditional approach to data warehouse develop-

ment, which has been improved over time by the development and use

of effective toolsets. The process of re-engineering IT solutions with ETL

routines decreases the cost of executing, process time consumption and

risks. [5]


Performance

Ghazal Tashakor


2013-10-21

The final goal for ETL could be epitomized as “to retrieve data from one

database and place it in to another”. Since source and destination data-

base formats and sometimes engines are not always the same, convert-

ing and combining the data, a.k.a. transforming the data is required

during this transaction. [5]

2.5.1 Informatica PowerCenter

Informatica PowerCenter processing engine has following services: [3]

Integration Service which implements the ETL logic.

Repository Service which manages metadata repositories, map-ping and workflow definitions.

Repository Service Process which is a multi-threaded process that retrieves, inserts and updates repository metadata.

Repository database which contains the ETL metadata.

All clients access the repository database tables through the Repository

Service. [3]

Figure 8 - Informatica PowerCenter Architecture [3]

Client applications are used to create transformations, manage metadata, ex-

ecute ETL processes and monitor them. [3]


Performance

Ghazal Tashakor


2013-10-21

There are four applications available as desktop tools:

1. Designer for developing ETL mappings from database tables.

2. Workflow Manager to create and start workflows (A workflow is a set

of instruction which is divided into tasks and an integration service uses

them to extract, transform and load data.).

3. Workflow Monitor is used for Monitoring running workflows

4. Repository Manager is a tool for managing source/targets connections,

objects and users.

2.5.2 ETL Performance measures and Tuning

The aim with regards to performance tuning is optimizing the session

performance by eliminating performance bottlenecks. [4]

Tuning performance of a session starts by identifying a performance

bottleneck and eliminating it and then identifying the next performance

bottleneck. This will continue until a satisfactory result is achieved in

relation to the session performance. [4]

ETL Performance also improves by creating multiple pipeline partitions

for parallel data processing in a single session with the PowerCenter

partitioning tool. Increasing the number of partitions increases the

number of threads so the Integration Service can create multiple connec-

tions to sources by the increased number of transformation threads. [4]

To improve the ETL, adding a hash auto-keys partition to the

Sorter/Aggregator transformation and group rows of data among parti-

tions based on data warehouse parallel query execution plan is sug-

gested to offer a better performance. [4]

Identifying the performance bottlenecks in the target database and, in

addition, mapping, which is the other performance tuning option,

which can be implemented. [4]

2.6 Lithuania BIDW overview

Figure 9 provides an explanation as to how the Omnitel data mart is

built in the SQL Server 2008 with ETL written in the Informatica Pow-

erCenter and it also shows the process flows of the daily source system

files to the final data mart. [1]


Performance

Ghazal Tashakor


2013-10-21

Omnitel firstly transfers all source flat files to Clearinghouse, which then

sends them to the file share. Clearinghouse sends a “kvitto” file for each

source flat file as soon as the files are received in the file share, and these

“kvitto” files will be the trigger for the start of ETL. [1]

ETL-Validate guides source files to the Source Files folder and reads the

source file and validates the date in the header row so it is an expected

date and count all data rows and compares to the value in the footer

row.

ETL-Capture truncates the S1 tables and reads the source files and adds

a unique Delivery_ID and timestamp for each source and inserts all

rows into the S1 tables. [1]

ETL-Transform truncates the S2 tables and reads the S1 tables and

fetches keys from S3 (data mart) tables and calculates an MD5 key

before inserting rows into the S2 tables. Some ETL-Transformations

directly insert in the S3 (data mart), this applies to COUNT and MONE-

TARY files because there are new values at every delivery time so they

do not require a True Delta.

ETL-True Delta reads the S2 tables and compares data against the S3

(data mart) by unique keys and MD5 key and performs an insert for

new data or an update if the data has changed. [1]

Figure 9 - Lithuania BIDW overview [1]

ETL-Tech starts a stored procedure which creates a snapshot for PB to

use. [1]


Performance

Ghazal Tashakor


2013-10-21

ETL-Apply reads table PARTY in the S3 (data mart) for all inactivated

parties at the present time and by conducting a stored procedure per-

forms an update in one PB table.

2.7 Lithuania ETL overview

Omnitel DW delivers sequential source files to the final data mart by

way of the Clearinghouse. Pitney Bowes campaign and CRT marketing

tool use the data mart. Files must reach between 10.00-12.00am daily,

weekly and monthly to CRT PowerCenter. [1]

If files do not arrive, CRT maintenance has an agreement with Omnitel

for handling the situation. [1]For all mentioned business and IT solu-

tions, the following local IT systems and a technical platform is required

in Lithuania.

Figure 10 - Lithuania ETL overview [1]


Performance

Ghazal Tashakor

‎3 Methodology / Model

2013-10-21

3 Methodology / Model Identifying and evaluating a BIDW solution and develop it as an ongo-

ing supportive plan means that the physical and ETL design from each

project map must be realized.

All stages are required from the project preparation up to the hand-over

to maintenance after the final preparation for evaluation, which are

available in manual documents and checklists used for finding an im-

plementing methodology to minimize the time from the occurrence of a

business event.

3.1 Experimental Methodology

Project management comparison is the first step of this methodology.

Comparing the traditional TeliaSonera model with the Agile methodol-

ogy, which is an alternative to traditional project management, is the

first step.

To solve the problems, an investigation of TeliaSonera and CGI project

management activities, roles and processes and comparing them will

assist in discovering the useless and redundant parts.

The second step is realization. There are two different and related

angles associated with the realization which this methodology must

survey. One is database physical design and the other is ETL design.

The physical design process consists of database indexes, partition and

an aggregation plan. The ETL modelling process consists of developing

a strategy for extracting and updating the database and producing a

high level event driven architectures (ETL Workflows).

Technical methods of realization consist of:

• Monitoring the event driven architectures (ETL Workflows) by

increasing the number of refresh cycles of an existing data ware-

house and updating the data more frequently.

• DMV results and query analytical processing.


Performance

Ghazal Tashakor


2013-10-21

• Database tuning (New performance optimization strategy-

Indexing)

• ETL tuning. (New performance optimization strategy).

After the realization and the studying of the available blueprints, the

implementation of the method is the final step to make the changes and

obtain the new performance results and compare these to past perform-

ance results.

3.2 Implementation Scenarios

The implementation scenario is divided to two parts. The first part

involves upgrading the ETL architecture from batch load (data source

daily load) to real time processes which can support on demand services

or web services. The concept of real time is known as real time business

intelligence which proposes that information should be delivered as

soon as it is required but not necessarily in real time. This part required

a data latency analysis in traditional business intelligence with a third

party tool. Upgrading the data integration architecture method involves

ETL performance tuning by identifying the performance bottlenecks

which can occur in target, source, mapping and sessions.

The second part of scenario is shown as a flowchart here, and it uses a

set of queries resolved by the system for workload to select candidate

indexes from those queries then performs a reconstruction of the candi-

date indexes for a final configuration and surveying the indexing strate-

gy in data marts.


Performance

Ghazal Tashakor


2013-10-21

Figure 11 – Database Scenario Flowchart

Logical methods and tools would be described and the impact on theory

and practical work would be elaborated.

3.2.1 BIDW Performance Logical Method and Tools

Optimizing and database engine tuning to achieve a better performance

from the database and make the ETL process faster is the key success of

the parallel database system, which combines data management and

parallel processing techniques together.

The parallel query optimization goal is in finding a low cost parallel

execution plan to deliver the result of queries in the minimal time.

To perform ETL operations, selecting a suitable third party tool which

can be coordinated with a specific relational database management

system for monitoring and maintenance is necessary, because SQL

optimization and troubleshooting requires auxiliary representation such

as workflow diagrams.

This thesis methodology for optimizing and database engine tuning

begins with analyzing the query performance and rewriting them in

order to obtain a query execution plan optimization with data access

operators (Scan and Seek).


Performance

Ghazal Tashakor


2013-10-21

Analyzing the query performance for the Lithuania project began with

monitoring the daily data integration workflows with the Informatica

PowerCenter Workflow monitor to capture the source qualifier override

queries and the slow- running queries for tuning. After testing all these

queries in the old database, the three most common (Where, Join, Un-

ion) have been selected for comparing the results after executing them in

both old and new designed databases.

Achieving the performance results by investigating database partition-

ing techniques and surveying an indexing strategy in the physical

database and comparing them with the new results after the implemen-

tation scenario is the BIDW performance method scenario.

The best practice for analyzing the database performance involves index

maintenance to update the requirements for indexes even after optimi-

zation.

Using the Informatica PowerCenter workflow manager for collecting

performance data and checking the filter transformation are the tech-

niques for this method.

Identifying the common problems and errors which occur in ETL proc-

ess and these IT solutions in the form of code optimization (T-SQL) and

preventing deadlocks from occurring by the optimization of the work-

flows mapping (PowerCenter) are also to be considered as parts of this

methodology.

3.3 Theoretical Data

Each company has its own intelligence, so it is necessary to know CGI

Business analytics method and the BI framework based on manual

internals. Gap analysis and process on data warehouse and ETL are both

necessary for the implementation the thesis. The theoretical part re-

quires both a study and research of the BI Tools and the gap analysis

process, BI methodology, business objectives matrix and solution land-

scapes, ETL and DW and application performance management guide.

The result will be based on following research questions to improve the

data warehouse and ETL functionality for existing system:


Performance

Ghazal Tashakor


2013-10-21

• How to minimize the time spent from the occurrence of the busi-

ness event with regards to the collecting and storing of the raw

data where it can be accessed and analyzed, i.e. Data Latency.

• How to maintain information consistency.

So to achieve the following goals:

• Minimize latency and eliminate non-deliveries.

• Stabilizing the loading process and transforming.

• Improve lock and unlock processes.

3.4 Practical Task

The practical tasks required working with Microsoft SQL Server 2008

Management Studio, Microsoft SQL Server Analysis Services (SSAS),

SQL Server Integration Services (SSIS) and Informatica PowerCenter

tool with the related database on Oracle. Scripting with XML and T-SQL

languages are used in this project for creating and shredding XML

documents from old tables and inserting all data into the new SQL

Server tables in the CGI test environment by means of a bulk loading

technique. Interviews with members of the maintenance team in CGI

and also in TeliaSonera side proved to be of assistance.

Three weeks practical work was involved in learning how the Power-

Center integration service moves data daily from the Lithuania source

flat files to the target TeliaSonera data marts based on workflows and

mapping metadata stored in PowerCenter repository and, an additional

one week was involved in learning to learn how the errors should be

handle where sessions fail.


Performance

Ghazal Tashakor

‎4 Implementation

2013-10-21

4 Implementation Old index analysis performance data is required before new indexing

performance optimization. The following subjects were investigated in

relation to the analysis old index performance:

• Indexes combination is right or not.

• How the indexes are being used.

• Missing indexes

• The ways that keep the indexes healthy and the most op-

timal.

This indexing strategy enabled the ridding of unused indexes and

verifying the health of the existing indexes. Adding additional indexes

was involved in order to improve the poor performance of the queries.

These will be a long term pragmatic approach, which leads the old

design to the newly evaluated one by means of analysis.

There are 3 major combined techniques which have used to implement

the new indexing for optimizing the performance of the Omnital rela-

tional data warehouse and in reducing the query processing cost.

The following index partitioning method offers a better performance

based on query analysing and the experimental results.

The best way to define a partition key for large tables is by breaking

apart (splitting) the primary key (non-clustered) and defining a separate

unique clustered index based on the primary key and a natural key. This

kind of covering indexes will be much faster than index seek plus key

lookup or even a clustered index seek.

In addition, the tuning of the indexes parameters and join indexes

optimized the total cost and was the reason for changing the indexes

properties from integer to uniqueidentifier to reduce the number of

bytes in the index key size and to maintain their width of as being as

narrow as possible.


Performance

Ghazal Tashakor

‎4 Implementation

2013-10-21

Figure 12 - new index partitioning for 7 large tables which were being locked most

of the times.


Performance

Ghazal Tashakor

‎4 Implementation

2013-10-21

Figure 13 - Old indexing report


Performance

Ghazal Tashakor

‎4 Implementation

2013-10-21

Figure 13 - Old indexing report

The suggested uniqueidentifier keys have an additional non-leaf level in

B-Tree algorithm.

Boosting performance as reducing the number of I/Os the disk performs

is the other positive reason.

All other index improvements based on compressing tables and parti-

tions is used in this implementation.

The report in Figure 12 shows the new index partitioning for 7 large

tables which were being locked for the majority of the time.


Performance

Ghazal Tashakor

‎4 Implementation

2013-10-21

The report in Figure 13 shows the old indexing in comparison to this

new design.


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

5 Results The analysis performed in the process of preparation of the thesis inter-

views and practical tasks shows that Omnitel parallel data warehouse is

established as a MS SQL 2008 Server instance, for which its server clus-

ters consists of 3 nodes with multiple MS SQL Server instances installed.

This led the system to be considered and categorized as designed in the

hybrid architecture. It contains a shared-nothing system in which each

node becomes a shared-memory multiprocessor.

Omnitel is a dimensional model data warehouse in the form of a third

normal (3NF) schema, making it especially suitable for the execution of

long-running queries; however, the ETL product outcome in PowerCen-

ter data flows and SQL Server DMV (Dynamic Management Views)

results from the master database showing some deadlocks in the shared

workspace. In other words the master database deadlock was traversed

and transformed into flaws and results in erroneous results in Power-

Center and SQL Server DMV.

5.1 DMV results before analyzing query performance

SQL Server DMV (dynamic management views and functions with

Transact-SQL) server state information results were monitored and

gathered. Thereupon using these results it becomes possible to identify

and to remove unused indexes, also to realize the list of tables which are

subject to lock and last but not least locating and listing the most expen-

sive queries and also missing indexes.

The result from index usage stats (SYS.dm_db_index_usage_stats) for

verifying index usage is particularly used in this thesis to track Seeks,

Scans, Lookups, updates and final run time of these operations.

The following table shows this result from S3 (final data mart) which is

CR_CRTDATAMARTSTAGING_LT_OMITEL.


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Figure 14- Old indexing report

Figure 15 - Sys.dm_db_index_physical_stats report

The operations that directly access the database include Scan (reads an

entire structure) and Seeks (does not scan entire structure), and a table


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Scan even on filtered indexes could be slow and would require extra

I/O.

To determine the fragmentation level of a given database for rebuilding

indexes, (Sys.dm_db_index_physical_stats) have been used. The results

of this operation are shown in Figure 15.

Where index fragmentation was greater than 40% in percentage terms,

indexes where rebuilt; and indexes have recognized when the index

fragmentation factor varied from 10 to 40 percents.

On each occasion of a query performance analysis operation, the above

mentioned actions have been used to verify the health status of the

existing indexes.

5.2 Analyzing Query Performance results for Where, Join and union

Using SARGABLE operators, checking the execution plan (XML show

plan and graphical display) of selected queries and attempting to reduce

the two most important estimated I/O and CPU costs, and increasing the

estimated operator cost, prepared the output to be ready for query

optimization and the realization of this thesis implementation. The

query optimization method in this thesis is based on Where, Join and

Union selected queries.

5.2.1. Where Queries

Following the graphical execution plan in Figure 16 shows a 2% better

performance (actual execution plans) in clustered index Scan/Seek cost,

while the operator used the Where clause of queries on tables. After

implementation, the new index strategy assists with regards to the

estimated operator cost improvement, although the estimated I/O cost is

increased a little more. The image in Figure 16 shows the old index

strategy which leads query to a 97% clustered index scan cost and the

second image in Figure 17 shows a 99% clustered index scan cost after

index optimization.


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Figure 16 – Old Index Strategy- Cost: 97%


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Figure 17– New index strategy- Cost: 99%

5.2.2. Join

Joins phrases in query increase the cost of the query and cause lowers

performances. Hence, the optimal selected manner attempted to mini-

mize the cost of optimized query performance is using Clustered Index

Scan/Seek more often and removing bookmarks lookups as much as

possible. Although it will require extra I/O operations and the estimated

I/O cost may increase in some queries, since there will be no need to

scan each and every row of table, the query eventually will be executed

more rapidly.

Query optimizer reasons about choosing a merge join or a hash join to

read and compare are related to the indexing optimization strategy. By

using the new indexing, the attempt is to lead an execution plan to


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

replace the merge join instead of hash join because practice shows that

hash uses more resources and it will return too much data. Merge join

sorts inputs one row at a time and it is much more efficient perform-

ance-wise. The following graphical design of the execution plan in

Figures 18 and 19 on clustered indexes and Figures 20 and 21 on non-

clustered indexes for the same join query show the results and the

differences between their performances.

Figure 18 shows the query execution plan in old tables with a 1% clus-

tered index scan cost and figure 19 shows the 10% clustered index scan

cost improvement and the reduction in the estimated CPU cost from

with new index strategy design. Additionally, the merge join will sort

data before join them together.

Figure 18– Old index strategy-Cluster Index Scan


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Figure 19– New index strategy- Cluster Index Scan


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Figure 20- Old index strategy- Non-Cluster Index Scan


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Figure 21- New index strategy- Non-Cluster Index Scan

The second aim was to reduce the estimated operating costs. This im-

provement is also achieved, although it will use extra I/O operations. A

comparison between figures 20 and 21 shows that using merge joins

increases the estimated operator costs from 30% to 53% in a non-

clustered index scan.

Although the key lookup may specify an additional pre-fetch argument

while instructing the execution engine, with a join or a lookup it may be

working better so for slow join queries or non-covering indexes, seek

plus a lookup works better than merely the seek operation. Figure 22

shows the result after using a FORCESEEK query based on the new

index strategy.


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Figure 22- New index strategy- FORCESEEK


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

5.2.3. Union

HASH join is irrelevant for use on UNION clause in queries. These

kinds of operations are only used to find duplicates and an attempt has

been made to reduce those costs in Union queries and make the re-

sources free.

Figure 23- Old index strategy-Hash join


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Figure 24- New index strategy-Merge join

Comparing the two Figures (23 and 24) shows that merge join is chosen

by execution plan automatically after implementing new indexing

strategy and will reduce the costs to 16% while the cost was 64% by

selecting Hash join.

The estimated operator cost on new covering indexes increases from

36% to 81% on PARTY_COUNT_BALANCE_REL table and a 2% im-

provement on the major parent PARTY table. It appears that the esti-

mated I/O and CPU costs have fallen sharply in the new design which is

very encouraging.

Performance

Ghazal Tashakor

‎5 Results

2013-10-21

5.3 Query Optimization outputs of Lithuania ETL Data Flows

Operating the PowerCenter client in CGI, using the Workflow Monitor

to monitor scheduled and running workflows every day during extract-

ing, transforming and loading in the S1, S2, S3 data marts for TeliaSon-

era, PowerCenter server gave away some transformation errors which

led the Results to make some changes about Indexing and partitioning

in the S3 (Final Data mart) which is called CR_CRTDATAMART_LT_

OMNITEL. The following error samples showed some problem during

transforming and updating data in the data warehouse.

Error Sample1:

Severity Timestamp Node

ThreadMessage Code

Message

ERROR 2012-12-18 15:05:01 node_sehan5763astcr4

TRANSF_1_1_1 TT_11019

There is an error in the port [RETURN_VALUE]:

The default value for the port is set to: ERROR(<<Expression Error>>

[ERROR]: transformation error

... nl:ERROR(u:'transformation error')).

Loading error results shows the lack of index strategy in the S1 (File

Structure) which is called

CR_CRTDATAMARTSTAGING_LT_OMNITEL.

After checking sessions and mappings in the ETL execution workflows

in the capture and transform part, the following error samples showed

some problems during data loading in the data warehouse.

Error Sample 2:


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Severity Timestamp Node

ThreadMessage Code

Message ERROR2012-12-19 13:04:43

node_sehan5763astcr4 WRITER_1_*_1

WRT_8164Error loading into target

[PARTY_COUNT_BALANCE_REL] : Bad rows exceeded Session

Threshold [1]

Figure 25- ETL Transform Source qualifier Query

Running the ETL source qualifier override queries in the data ware-

house shows non filtering indexes which leads execution plan to choose

a table scan. Those heap tables were the reason for the slow performance

during loading data in the S1. Figure 25 shows the result on one of the

ETL transform source qualifier queries. The estimated operator cost is

13% which is not satisfactory.


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

5.4 Disk Usage result based on new Indexes type

Disk usage results are produced after changing the indexes type from

integer to GUID (uniqueidentifier) in some tables.

Although Microsoft SQL Server GUID (uniqueidentifier) as a primary

key is the heaviest data type to use, sometimes it could be the most

capable. It could be compared so it will be very useful during the com-

parison of data between the pre-data mart and final data mart for load-

ing part of the ETL. It is possible to have sequential uniqueidentifier by

assigning new sequential ID() so it may assist in tuning the ETL

workflow mappings in other data marts which are not final data marts.

GUID could be generated outside the database and be unique without

blocking. There is no need to worry about duplicate keys during merg-

ing data while ETL plan is updating data in final data mart.

Also it could be synchronized without overlap between different data

stores. GUID usually performs not so well-as the filtered primary key

and so the best way to use it is to keep the primary keys non-clustered.

The image in Figure 26 shows the result report of disk usage by the new

database top tables which are called

CR_CRTDATAMART_LT_OMNITELGH. The reserved space has

obviously grown in comparison to the original database

(CR_CRTDATAMART_LT_ OMNITEL) top tables.

Figure 26 – Disk usage results for GUID

# Records Reserved (KB) Used (KB)

76 40 40

76 40 40

44 16 16

44 16 16

88 40 40

88 40 40

3 901 917 2 434 712 2 434 312

3 901 917 1 635 656 1 635 408

3 901 917 220 232 220 160

3 901 917 578 824 578 744

69 422 508 42 395 064 42 389 152

69 422 508 34 495 208 34 490 272

69 422 508 7 899 856 7 898 880

22 061 955 4 701 936 4 677 776

22 061 955 3 869 136 3 865 384

22 061 955 832 800 812 392

0 0 0

0 0 0

0 0 0

dbo.PARTY_MONETARY_BALANCE_REL

Index (IX_PARTY_MONETARY_BALANCE_REL_01)

Index (PK_PARTY_MONETARY_BALANCE_REL)

Index (PARTY_FLAG_REL_IX1)

Index (PK_PARTY_FLAG_REL)

Index (PK_PARTY_COUNT_BALANCE_REL)

dbo.PARTY_FLAG_REL

Index (IX_PARTY_02)

dbo.PARTY_COUNT_BALANCE_REL

Index (PARTY_COUNT_BALANCE_REL_IX1)

dbo.PARTY

Index (IX_PARTY_01)

Index (PK_PARTY_KEY)

dbo.MONETARY_BALANCE_TYPE

Index (PK_MONETARY_BALANCE_TYPE)

dbo.COUNT_BALANCE_TYPE

Index (PK_COUNT_BALANCE_TYPE)

dbo.FLAG_TYPE

Index (PK_FLAG_TYPE)

[CR_CRTDATAMART_LT_OMNITELGH]on SEHAN5677CLV001\SQLTEST1 at 2013-05-06 13:53:57This report provides detailed data on the utilization of disk space by index and by partitions within the Database.

Table Name


Performance

Ghazal Tashakor

‎5 Results

2013-10-21

Figure 27 – Disk usage results for Integer


Performance

Ghazal Tashakor

‎6 Conclusions

2013-10-21

6 Conclusions Recreating database indexes with different data types would assist in

increasing the performance of basic database functionalities. Although it

may appear somehow related to the type of the data, the basics would

remain the same. Index recreate and rebuild using the results from the

tests, would optimize the query performance. Therefore, implementing

a new indexing strategy on the final database (S3) results in better

performances during update phase and normal database operation

including Seek, Join, Union, Select and etc., and can reduce the host

machine or array I/O and CPU operational costs dramatically.

A comparison between test results on the old and new models suggests

that the implementation should be conducted on all filtering and joining

process in the actual database instead of ETL Workflows, plus creating

indexes in database instead of using Sequence Generator in the ETL

Workflows. This would leave the actual functions of ETL to it, which

was the basic philosophy behind designing any ETL process. This shows

that, as the data volume grows larger and the complexity of the data

mart increases, the modularity would be more useful in the tasks as-

signed to different parts of the process, clean and independent interfaces

would become more important, and finally errors would be contained

and would not propagate to the totality of the BI process.

Bridging the TeliaSonera Business Intelligence implementation gaps by

means of Gap Analysis and comparing it with the Agile method shows

that in order for TeliaSonera to attain a more efficient system and an

improved performance, a switch over to Agile methodologies is neces-

sary. Comparisons show explicit and obvious advantages in using Agile

methodologies over the current implementation.


Performance

Ghazal Tashakor

Terminology / Notation

2013-10-21

References [1] TeliaSonera AB, Håkan Ivars ,2010: ” BIDW Operational

Guidelines CRT Data Mart and ETL for Lithuania”,

Manual Internal

[2] TeliaSonera AB,2012: ”Maintenance Specification for

CUSTOMER RECOMMENDATION TOOL”, Manual In-

ternal

[3] IBM Software Group, 2004: ” Introduction to the BI Ar-

chitecture Framework and Methods”, Manual Internal

[4] Nicola, M. ”Storage Layout and I/O Performance Tun-

ing for IBM Red Brick Data Warehouse”, IBM DB2 De-

veloper Domain, Informix Zone, October 2002.

[5] INFORMATICA: PowerCenter 8.x/9.0 Level I Developer

Lab Guide

[6] INFORMATICA: PowerCenter (Version 9.0.1): Ad-

vanced Workflow Guide

[7] Timos K. Sellis and Alkis Simitsis, “ETLWorkflows:

From Formal Specification to Optimization”, Advances in

Databases and Information Systems: 11th East European

Conference, ADBIS 2007, Varna, Bulgaria, September 29-

October 3, 2007

[8] Lynn Langit, Kevin S. Goff, Davide Mauri, Sahil Malik,

John C. Welc , Smart Business Intelligence Solutions with

Microsoft® SQL Server® 2008, Microsoft Publications,

[9] Benjamin Nevarez, Inside the SQL Server Query Optimiz-

er, Mar 7, 272 pages, Red Gate Books, 2011

[10] Glenn Berry, Louis Davidson and Tim Ford, SQL Server

DMV Starter Pack, Red Gate Books, June 2010

[11] Microsoft Troubleshooting Performance Problems in

SQL Server 2008, Technical Article, March 2009

http://www.google.se/search?tbo=p&tbm=bks&q=inauthor:%22Benjamin+Nevarez%22


Performance

Ghazal Tashakor


2013-10-21

[12] Microsoft Corporation, 2010: Introducing Microsoft SQL

Server 2008 r2, Ross Mistry and Stacia Misner

[13] Microsoft Corporation, May 2009 : Business Intelligence,

Data Warehousing, and the ISV, BI-DW ISV Adoption

Roadmap

[14] Itzik Ben-Gan, Microsoft SQL Server 2012 T-SQL Funda-

mentals, 1st Edition, Microsoft Press July 12, 2012

[15] Itzik Ben-Gan, Training Kit (Exam 70-463): Implementing a

Data Warehouse with Microsoft SQL..., 1st Edition, Micro-

soft Press ,2012

[16] Wee-Hyong Tok, Rakesh Parida, Matt Masson, Xiaoning

Ding and Kaarthik Sivashanmugam, Microsoft SQL Serv-

er 2012 Integration Services, 1st Edition, Microsoft Press

,2012

[17] Itzik Ben-Gan, Dejan Sarka and Ron Talmage, Training

Kit (Exam 70-461): Querying Microsoft SQL Server 2012,

1st Edition, Microsoft Press, 2012

[18] Dejan Sarka, Matija Lah and Grega Jerkic, Training Kit

(Exam 70-463): Implementing a Data Warehouse with Mi-

crosoft SQL..., 1st Edition, Microsoft Press , 2012

[19] CGI,2004 : Business Intelligence, Enabling Transparency

across the Enterprise

[20] Patrick O’Neil, DalIan Quass, "Improved Query Perfor-

mance with Variant Indexes", IEEE, 1996

[21] Herodotos Herodotou, Nedyalko Borisov, Shivnath

Babu, "Query Optimization Techniques for Partitioned

Tables", IEEE, 2011

[22] Stephane Azefack1, Kamel Aouiche2 and Jerome Dar-

mont1, "Dynamic index selection in data warehouses", IEEE,

2008

[23] Holger Märtens, Erhard Rahm, and Thomas Stöhr, “Dy-

namic Query Scheduling in Parallel Data Warehouses” Uni-

versity of Leipzig, Germany,2003


Performance

Ghazal Tashakor


2013-10-21

[24] Yan shi, Xiangjun Lu, "The Role of Business Intelligence in

Business Performance Management", 3rd International

Conference on Information Management, Innovation

Management and Industrial Engineering, IEEE, 2010

[25] D.P. Du PLESSIS, "Implementing Business Intelligence

Processes in a Telecommunications Company in South Af-

rica", Vaal Triangle Campus of the North-West Univer-

sity, May 2012

[26] Stefan Krompass, Andreas Scholz, Martina-Cezara Al-

butiu, Harumi Kuno, Janet Wiener, Umeshwar Dayal,

Alfons Kemper, "Quality of Service-EnabledManagement of

DatabaseWorkloads", IEEE, 2008

[27] Kurt Stockinger, Kesheng Wu, Arie Shoshani, "Strategies

for Processing ad hoc Queries on Large Data Warehouses",

IEEE 2002

[28] Gregory Malecha Greg Morrisett Avraham Shinnar

Ryan Wisnesky, " Toward a Verified Relational Database

Management System ", Harvard University, Cambridge,

MA, USA, IEEE 2010

[29] Patrick Valduriez , Daniela Florescu, Waqar Hasan, "

Open Issues in Parallel Query Optimization", INRIA,

France, IEEE September 1996

[30] Ladjel Bellatreche1, Michel Schneider2, Herv´e Lorin-

quer2, and Mukesh Mohania3, " Bringing Together Parti-

tioning, Materialized Views and Indexes to Optimize Perfor-

mance of Relational Data Warehouses", Springer-Verlag

Berlin Heidelberg 2004

[31] C. Chee-Yong, “Indexing techniques in decision support

Systems”, Ph.D. Thesis, University, of Wisconsin, Madi-

son, 1999.

[32] Holger Märtens, Erhard Rahm, and Thomas Stöhr, "

Dynamic Query Scheduling in Parallel Data Warehouses

",University of Leipzig, Germany, Springer-Verlag Ber-

lin Heidelberg 2002


Performance

Ghazal Tashakor

Appendix A: Shredding an XML

document via T-SQL for inserting

its data into a SQL Server table

2013-10-21

Appendix A: Shredding an XML document via T-SQL for inserting its data into a SQL Server table

Figure A.1: Bulk Inserts via TSQL in SQL Server to run mapping exported xml data

file and load as binary nodes.

Performance

Ghazal Tashakor

Appendix B: User manual BIDW

Operational Guidelines Data Mart

and ETL for Lithuania

2013-10-21

Appendix B: User manual BIDW Operational Guidelines Data Mart and ETL for Lithuania

Figure B.1: Database structure of Omnital

Physical Data Model

Model: CRT_GLOBAL

Package:

Diagram: CRT Common Model

Author: Thomas Haag, Håkan Ivars & Stefan Sollenhag Date: 2012-03-14

Version: 1.3PARTY

PAR_PARTY_KEY

PAR_PARENT_PARTY_KEY

PAR_PARTY_ID

PAR_PARTY_ID_TP

PAR_PARTY_NAME

PAR_PARTY_SEC_ID

PAR_PARTY_SEC_ID_TP

PAR_PARTY_BIRTH_DATE

PAR_AGE

PAR_PARTY_CONTACT_EMAIL_1



PAR_PARTY_CONTACT_MOBILE_PHONE_NBR

PAR_CREATION_TIMESTAMP

PAR_POPN_TIMESTAMP

PAR_IS_DELETED

PAR_CREATION_SOURCE

PAR_POPN_SOURCE

PAR_DELIVERY_ID_OUT

PAR_ROW_NUMBER

PAR_MD5_KEY

integer

integer

nvarchar(50)

nvarchar(4)

nvarchar(70)

nvarchar(50)

nvarchar(4)

datetime

integer

nvarchar(256)

nvarchar(256)

nvarchar(256)

nvarchar(16)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk>

<fk> <i1>

<i2>

<i2>

<i3>

<i3>

PARTY_COUNT_BALANCE_REL

CBR_PARTY_KEY

CBR_COUNT_BALANCE_KEY

CBR_PERIOD_REL_KEY

CBR_PERIOD_ABS_KEY

CBR_OCCURRENCE_COUNT

CBR_CREATION_TIMESTAMP

CBR_POPN_TIMESTAMP

CBR_IS_DELETED

CBR_CREATION_SOURCE

CBR_POPN_SOURCE

CBR_DELIVERY_ID_OUT

CBR_ROW_NUMBER

CBR_MD5_KEY

integer

integer

integer

integer

integer

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk,fk2>

<pk,fk3>

<pk,fk4>

<fk1>

<i1>

<i1>

<i1,i3>

<i2>

<i1>

COUNT_BALANCE_TYPE

CBT_COUNT_BALANCE_KEY

CBT_COUNT_BALANCE_TP

CBT_COUNT_BALANCE_DESC

CBT_COUNT_BALANCE_GROUP

CBT_CREATION_TIMESTAMP

CBT_POPN_TIMESTAMP

CBT_IS_DELETED

CBT_CREATION_SOURCE

CBT_POPN_SOURCE

CBT_DELIVERY_ID_OUT

CBT_ROW_NUMBER

CBT_MD5_KEY

integer

nvarchar(4)

nvarchar(100)

nvarchar(100)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk>

PERIOD_RELATIVE_TYPE

PERR_PERIOD_REL_KEY

PERR_PERIOD_REL_TP

PERR_PERIOD_REL_DESC

PERR_CREATION_TIMESTAMP

PERR_POPN_TIMESTAMP

PERR_IS_DELETED

PERR_CREATION_SOURCE

PERR_POPN_SOURCE

PERR_DELIVERY_ID_OUT

PERR_ROW_NUMBER

PERR_MD5_KEY

integer

nvarchar(4)

nvarchar(100)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk>

PARTY_MONETARY_BALANCE_REL

MBR_PARTY_KEY

MBR_MONETARY_BALANCE_KEY

MBR_PERIOD_REL_KEY

MBR_PERIOD_ABS_KEY

MBR_MONETARY_BALANCE_AMT

MBR_ISO_CURRENCY_CD

MBR_CREATION_TIMESTAMP

MBR_POPN_TIMESTAMP

MBR_IS_DELETED

MBR_CREATION_SOURCE

MBR_POPN_SOURCE

MBR_DELIVERY_ID_OUT

MBR_ROW_NUMBER

MBR_MD5_KEY

integer

integer

integer

integer

decimal(17,3)

char(3)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk,fk2>

<pk,fk4>

<pk,fk3>

<fk1>

<i1>

<i1>

<i1,i3>

<i2>

<i1>

MONETARY_BALANCE_TYPE

MBT_MONETARY_BALANCE_KEY

MBT_MONETARY_BALANCE_TP

MBT_MONETARY_BALANCE_DESC

MBT_MONETARY_BALANCE_GROUP

MBT_CREATION_TIMESTAMP

MBT_POPN_TIMESTAMP

MBT_IS_DELETED

MBT_CREATION_SOURCE

MBT_POPN_SOURCE

MBT_DELIVERY_ID_OUT

MBT_ROW_NUMBER

MBT_MD5_KEY...

integer

nvarchar(4)

nvarchar(100)

nvarchar(100)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk>

PROPERTY_TYPE

PT_PROPERTY_KEY

PT_PROPERTY_TP

PT_PROPERTY_DESC

PT_PROPERTY_GROUP

PT_VALUE_TP

PT_VALUE_MASK

PT_CREATION_TIMESTAMP

PT_POPN_TIMESTAMP

PT_IS_DELETED

PT_CREATION_SOURCE

PT_POPN_SOURCE

PT_DELIVERY_ID_OUT

PT_ROW_NUMBER

PT_MD5_KEY

integer

nvarchar(4)

nvarchar(100)

nvarchar(100)

nvarchar(20)

nvarchar(256)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk>

PARTY_PROPERTY_REL

PR_PARTY_KEY

PR_PROPERTY_KEY

PR_SEQUENCE_NUMBER

PR_PROPERTY_VALUE

PR_CREATION_TIMESTAMP

PR_POPN_TIMESTAMP

PR_IS_DELETED

PR_CREATION_SOURCE

PR_POPN_SOURCE

PR_DELIVERY_ID_OUT

PR_ROW_NUMBER

PR_MD5_KEY

integer

integer

integer

nvarchar(256)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk,fk2>

<pk,fk1>

<pk>









PARTY_CONTACT_ADDRESS

CA_PARTY_KEY

CA_ADDRESS_RANK

CA_ADDRESS_DOMAIN

CA_PARTY_STREET_ADDRESS

CA_PARTY_POSTAL_CD

CA_PARTY_CITY

CA_ADDRESS_COUNTRY

CA_CREATION_TIMESTAMP

CA_POPN_TIMESTAMP

CA_IS_DELETED

CA_CREATION_SOURCE

CA_POPN_SOURCE

CA_DELIVERY_ID_OUT

CA_ROW_NUMBER

CA_MD5_KEY

integer

integer

nvarchar(4)

nvarchar(35)

nvarchar(9)

nvarchar(26)

nvarchar(35)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk,fk>

<pk>

PARTY_CONTACT_EMAIL

CE_PARTY_KEY

CE_EMAIL_ADDRESS_RANK

CE_EMAIL_ADDRESS_DOMAIN

CE_PARTY_EMAIL_ADDRESS

CE_CREATION_TIMESTAMP

CE_POPN_TIMESTAMP

CE_IS_DELETED

CE_CREATION_SOURCE

CE_POPN_SOURCE

CE_DELIVERY_ID_OUT

CE_ROW_NUMBER

CE_MD5_KEY

integer

integer

nvarchar(4)

nvarchar(256)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk,fk>

<pk>

PARTY_CONTACT_TELEPHONE

CT_PARTY_KEY

CT_PHONE_NUMBER

CT_PHONE_NUMBER_TP

CT_CREATION_TIMESTAMP

CT_POPN_TIMESTAMP

CT_IS_DELETED

CT_CREATION_SOURCE

CT_POPN_SOURCE

CT_DELIVERY_ID_OUT

CT_ROW_NUMBER

CT_MD5_KEY

integer

nvarchar(16)

nvarchar(4)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk,fk>

<pk>

PROPERTY_TYPE_DOMAIN_VALUES

PTV_PROPERTY_KEY

PTV_PROPERTY_VALUE

PTV_PROPERTY_VALUE_DESC

PTV_CREATION_TIMESTAMP

PTV_POPN_TIMESTAMP

PTV_IS_DELETED

PTV_CREATION_SOURCE

PTV_POPN_SOURCE

PTV_DELIVERY_ID_OUT

PTV_ROW_NUMBER

PTV_MD5_KEY

integer

nvarchar(256)

nvarchar(256)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk,fk>

<pk>

FLAG_TYPE

FT_FLAG_KEY

FT_FLAG_TP

FT_FLAG_DESC

FT_FLAG_GROUP

FT_CREATION_TIMESTAMP

FT_POPN_TIMESTAMP

FT_IS_DELETED

FT_CREATION_SOURCE

FT_POPN_SOURCE

FT_DELIVERY_ID_OUT

FT_ROW_NUMBER

FT_MD5_KEY

integer

nvarchar(4)

nvarchar(100)

nvarchar(100)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk>

PARTY_FLAG_REL

FR_PARTY_KEY

FR_FLAG_KEY

FR_FLAG_VALUE

FR_CREATION_TIMESTAMP

FR_POPN_TIMESTAMP

FR_IS_DELETED

FR_CREATION_SOURCE

FR_POPN_SOURCE

FR_DELIVERY_ID_OUT

FR_ROW_NUMBER

FR_MD5_KEY

integer

integer

tinyint

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk,fk2>

<pk,fk1>







PERIOD_CURRENT

PERC_SEGMENT_NAME

PERC_CURRENT_DATE

PERC_CURRENT_WEEK

PERC_CURRENT_MONTH

PERC_CURRENT_QUARTER

PERC_CURRENT_SECOND_FOUR

PERC_CURRENT_SIX_MONTH

PERC_CURRENT_YEAR

nvarchar(10)

DATE

nvarchar(7)

nvarchar(7)

nvarchar(7)

nvarchar(7)

nvarchar(7)

dec(4)

<pk>

PARTY_PROPERTY_DATE_REL

PDR_PARTY_KEY

PDR_PROPERTY_KEY

PDR_SEQUENCE_NUMBER

PDR_DATE_PROPERTY_VALUE

PDR_CREATION_TIMESTAMP

PDR_POPN_TIMESTAMP

PDR_IS_DELETED

PDR_CREATION_SOURCE

PDR_POPN_SOURCE

PDR_DELIVERY_ID_OUT

PDR_ROW_NUMBER

PDR_MD5_KEY

integer

integer

integer

datetime

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk,fk1>

<pk,fk2>

<pk>









PERIOD_ABSOLUTE_TYPE

PERA_PERIOD_ABS_KEY

PERA_PERIOD_ABS_TP

PERA_PERIOD_ABS_DESC

PERA_CREATION_TIMESTAMP

PERA_POPN_TIMESTAMP

PERA_IS_DELETED

PERA_CREATION_SOURCE

PERA_POPN_SOURCE

PERA_DELIVERY_ID_OUT

PERA_ROW_NUMBER

PERA_MD5_KEY

integer

nvarchar(4)

nvarchar(100)

datetime

datetime

tinyint

nvarchar(100)

nvarchar(100)

integer

integer

nvarchar(32)

<pk>


Performance

Ghazal Tashakor

Appendix B: User manual BIDW

Operational Guidelines Data Mart

and ETL for Lithuania

2013-10-21

Figure B.2: Lithaunia ETL Informatica DB Schedule

Figure B.3: Lithaunia ETL rough overview

delivering business intelligence performance - diva …659231/fulltext01.pdf · delivering business...

Documents