dw & etl concepts

54
Business Intelligence, Data Warehousing & ETL Concepts

Upload: jeshocarme

Post on 15-Jul-2015

266 views

Category:

Career


12 download

TRANSCRIPT

Page 1: Dw & etl concepts

Business Intelligence,Data Warehousing &

ETL Concepts

Page 2: Dw & etl concepts

Business Intelligence

Page 3: Dw & etl concepts

3

Business Intelligence

How intelligent can you make your business processes?

What insight can you gain into your business?

How integrated can your business processes be?

How much more interactive can your business be with customers, partners, employees and managers?

Page 4: Dw & etl concepts

4

What is Business Intelligence (BI)?

Business Intelligence is a generalized term applied to a broad category of applications and technologies for gathering, storing, analyzing and providing access to data to help enterprise users make better business decisions

Business Intelligence applications include the activities of decision support systems, query and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining

An alternative way of describing BI is: the technology required to turn raw data into information to support decision-making within corporations and business processes

Page 5: Dw & etl concepts

5

Why BI?

BI technologies help bring decision-makers the data in a form they can quickly digest and apply to their decision making.

BI turns data into information for managers and executives and in general, people making decisions in a company.

Companies want to use technology tactically to make their operations more effective and more efficient - Business intelligence can be the catalyst for that efficiency and effectiveness.

Page 6: Dw & etl concepts

6

Benefits

The benefits of a well-planned BI implementation are going to be closely tied to the business objectives driving the project.

Identify trends and anomalies in business operations more quickly, allowing for more accurate and timelier decisions.

Deliver actionable insight and information to the right place with less effort .

Identify and operate based on a single version of the truth, allowing all analysis to be completed on a core foundation with confidence.

Page 7: Dw & etl concepts

7

Business Intelligence Components

TRANSFORM

LOAD

EXTRACT

OLAP DATA MINING

Data Warehouse

Operational Data

Page 8: Dw & etl concepts

8

Business Intelligence Architecture

Page 9: Dw & etl concepts

9

Business Intelligence Technologies

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Data Warehouses / Data Marts

Data ExplorationOLAP, DSS, EIS, Querying and Reporting

Data MiningInformation discovery

Data PresentationVisualization Techniques

Decision Making

Increasing potential to support business decisions End User

Business Analyst

Data Analyst

DB Admin

Page 10: Dw & etl concepts

Data Warehousing

Page 11: Dw & etl concepts

11

What is a Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data.

A data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, online analytical processing (OLAP) and data mining capabilities, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

It is a series of processes, procedures and tools (h/w & s/w) that help the enterprise understand more about itself, its products, its customers and the market it services

Page 12: Dw & etl concepts

12

Who are the potential Customers ?Which Products are sold the most ?

What are the region-wise preferences ?What are the competitor products ?

What are the projected sales ?What if you sale more quantity of a particular product ?

What will be the impact on revenue ?Results of promotion schemes introduced ?

Why Data Warehousing?

Need of Intelligent Information in Competitive Market

Page 13: Dw & etl concepts

13

OLTP vs. Data Warehouse

OLTP systems are tuned for known transactions and workloads while workload is not known in a data warehouse

Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)

e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December

Page 14: Dw & etl concepts

14

OLTP vs. Data Warehouse

OLTP

Application Oriented

Used to run business

Detailed data

Current up to date

Isolated Data

Repetitive access

Clerical User

WAREHOUSE (DSS)

Subject Oriented

Used to analyze business

Summarized and refined

Snapshot data

Integrated Data

Ad-hoc access

Knowledge User (Manager)

Page 15: Dw & etl concepts

15

OLTP vs Data Warehouse

OLTP

Performance Sensitive

Few Records accessed at a time (tens)

Read/Update Access

No data redundancy

Database Size 100MB -100 GB

DATA WAREHOUSE

Performance relaxed

Large volumes accessed at a time(millions)

Mostly Read (Batch Update)

Redundancy present

Database Size 100 GB - few terabytes

Page 16: Dw & etl concepts

16

OLTP vs Data Warehouse

OLTP

Transaction throughput is the performance metric

Thousands of users

Managed in entirety

Data Warehouse

Query throughput is the performance metric

Hundreds of users

Managed by subsets

Page 17: Dw & etl concepts

17

Data Warehouse Architectures

Centralized

In a centralized architecture, there exists only one data warehouse which stores all data necessary for business analysis. As already shown in the previous section, the disadvantage is the loss of performance in opposite to distributed approaches.

Central Architecture

Page 18: Dw & etl concepts

18

Tiered:

A tiered architecture is a distributed data approach. This process can not be done in one step because many sources have to be integrated into the warehouse.

On a first level, the data of all branches in one region is collected, in the second level the data from the regions is integrated into one

data warehouse.

Advantages:

Faster response time because the data is located closer to the client applications and

Reduced volume of data to be searched.

Tiered Architecture

Data Warehouse Architectures Contd…

Page 19: Dw & etl concepts

19

Metadata

Data Sources Data Management Access

Complete Warehouse Solution Architecture

Operational Data

Legacy Data

The Post

VISA

External DataSources

EnterpriseData

Warehouse

Organizationally structured

ExtractTransformLoad

Data Information Knowledge

Asset Assembly (and Management) Asset Exploitation

Data Mart

Data Mart

Departmentally structured

Data Mart

Sales

Inventory

Purchase

Page 20: Dw & etl concepts

20

Introduction To Data Marts

What is a Data Mart

From the Data Warehouse , atomic data flows to various departments for their customized needs. If this data is periodically extracted from data warehouse

and loaded into a local database, it becomes a data mart. The data in Data Mart

has a different level of granularity than that of Data Warehouse. Since the data

in Data Marts is highly customized and lightly summarized , the departments can

do whatever they want without worrying about resource utilization. Also the departments can use the analytical software they find convenient. The cost of processing becomes very low.

Page 21: Dw & etl concepts

21

Data Mart Overview

Data Marts

Satisfy 80% of

the local end-

users’ requests

Sales Representatives

and Analysts

Human

Resources

Financial Analysts,

Strategic Planners,

and Executives

DM Marketing

DM Finance

DM SalesDM HR

Data Warehouse

DM Sales

DM HR

DM Marketing

Page 22: Dw & etl concepts

22

From The Data Warehouse To Data Marts

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

Page 23: Dw & etl concepts

23

Data model is a conceptual representation of data structures (tables) required for a database and is very powerful in expressing and communicating the business requirements. A data model is an abstract model that describes how data is represented and used.

The term data model has two generally accepted meanings: A data model theory i.e. a formal description of how data may

be structured and used. A data model instance i.e. applying a data model theory to

create a practical data model instance for some particular application.

Modeling Fundamentals: What is Data Model ?

Page 24: Dw & etl concepts

24

Logical Data Model (LDM) - A logical design is conceptual and abstract. The process of logical design involves arranging data into a series of logical relationships called entities and attributes.

Logical data model includes all required entities, attributes, key groups, and relationships that represent business information and define business rules.

Modeling Fundamentals: Modeling Fundamentals: Types OF Data ModelingTypes OF Data Modeling

Logical Data Model

Page 25: Dw & etl concepts

25

Physical Data Model (PDM) - A physical data model is a representation of a data design which takes into account the facilities and constraints of a given database management system.

A complete physical data model will include all the database artifacts required to create relationships between tables or achieve performance goals, such as indexes, constraint definitions, linking tables, partitioned tables or clusters.

Modeling Fundamentals: Modeling Fundamentals: Types OF Data ModelingTypes OF Data Modeling

Physical Data Model

Page 26: Dw & etl concepts

26

Entity relationship diagram (ERD) – A data model utilizing several notations to depict data in terms of the entities and relationships described by that data.

Databases are used to store structured data. The structure of this data, together with other constraints, can be designed using a variety of techniques, one of which is called entity-relationship modeling or ERM.

Modeling Fundamentals: Modeling Fundamentals: Types OF Data ModelingTypes OF Data Modeling

ERD Diagram

Page 27: Dw & etl concepts

27

Important Terminologies – Entity – Are the principal data object about which information is to be collected.

A class of persons, places, objects, events, or concepts about which we need to capture and store data.

Modeling Fundamentals: Modeling Fundamentals: Types OF Data ModelingTypes OF Data Modeling

•Persons: agency, contractor, customer, department, division, employee, instructor, student, supplier. •Places: sales region, building, room, branch office, campus. •Objects: book, machine, part, product, raw material, software license, software package, tool, vehicle model, vehicle. •Events: application, award, cancellation, class, flight, invoice, order, registration, renewal, requisition,

reservation, sale, trip. •Concepts: account, block of time, bond, course, fund,

qualification, stock.

Page 28: Dw & etl concepts

28

Relationship – A natural business association that exists between one or more entities. The relationship may represent an event that links the entities or merely a logical affinity that exists between the entities

An example of a relationship would be: • Employees are assigned to projects• Student enrolling in a curriculum• Projects have subtasks • Departments manage one or more projects

Modeling Fundamentals: Modeling Fundamentals: Types OF Data ModelingTypes OF Data Modeling

STUDENT CURRICULUMIs being studied by is enrolled in

Page 29: Dw & etl concepts

29

Dimensional Data Modeling (DDM) - Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse.

Is a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access. It adheres to a discipline that uses the relational model with some important restrictions.

Every dimensional model is composed of one table with a multi-part key, called the fact table, and a set of smaller tables called dimension tables.

Components of a DM:

Fact Table Dimension table Attributes

Good examples of dimensions are location, product, time, promotion, organization etc. Dimension tables store records related to that particular dimension and no facts (measures) are stored in these tables.

A fact (measure) table contains measures (sales gross value, total units sold) and dimension columns. These dimension columns are actually foreign keys from the respective dimension tables.

Modeling Fundamentals: Modeling Fundamentals: Types OF Data ModelingTypes OF Data Modeling

Page 30: Dw & etl concepts

30

End users cannot understand or navigate ER models

Software cannot usefully query an ER model

Use of ER modeling techniques defeats intuitive and high performance retrieval of data

Types OF Data ModelingTypes OF Data ModelingWhy Dimensional Modeling?Why Dimensional Modeling?

When the designer places understandability and performance as the highest goals . . .

Dimensional Modeling is the natural approach

Page 31: Dw & etl concepts

31

What is a Star Schema ?

Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multi-part key in the fact table. This characteristic "star-like" structure is often called a star-schema.

Page 32: Dw & etl concepts

32

The Star schema model is essentially a method to store data which are multi-dimensional in nature, in a relational database. It consists of a single “fact table" with a compound primary key, with one segment for each “dimension" and with additional columns of additive, numeric facts.

What is a Star Schema ?

Customer

OrganizationTime

Product

Channel

SALES

The star schema makes multi-dimensional database (MDDB) functionality possible using a traditional relational database.

Page 33: Dw & etl concepts

33

Fact Tables

A fact table, because it has a multi-part primary key made up of two or more foreign keys, always expresses a many-to-many relationship.

The most useful fact tables also contain one or more numerical measures, or "facts," that occur for the combination of keys that define each record.

The most useful facts in a fact table are numeric. Numeric addition is crucial because data warehouse applications rarely retrieve a single fact table record. Rather, they retrieve hundreds, thousands, or even millions of these records at a time, and the only useful thing to do with so many records is to

add them up.

Page 34: Dw & etl concepts

34

Defining Fact Table Structure

ITEM_IDWEEK_IDSTORE_IDSALES_DOLLARSSALES_UNITS

Fact Item Day StoreItem

Store

Week

Fact ColumnsFact Table Structure

Page 35: Dw & etl concepts

35

What is a Dimension?

Data Warehouse is• Subject-Oriented •Integrated• Time-Variant• Non-volatile

Subject Dimension

In a Dimensional Model, context of the measurements are represented in dimension tables

The Dimension Attributes are the various columns in a dimension table

Page 36: Dw & etl concepts

36

What are Slow changing Dimensions?

Slowly changing dimensions are dimensions where a "constant" actually evolves slowly and asynchronously.

“ Dimensions have been assumed to be independent of time”

In the real world this is not strictly true

Examples: Humans change their name Get married or divorced

Page 37: Dw & etl concepts

37

Three Methods…

The three choices for dealing with slow changing dimensions are:

Approach Results

Type 1:

Type 2

Overwriting the old values in the dimension record

Losing the ability to track the old history

Creating an additional dimension record at the time of the change with the new attribute values

Segmenting history very accurately between the old description and the new description

Type 3: Creating new “current” fields Describe history both

Page 38: Dw & etl concepts

38

Type one

Implementing Type 1:

Overwrite the field with new value No effect anywhere else in the database

Scenarios where applicable: When original data was in error When no value is reviewed in keeping the old description/attribute

Advantages

Easy to implement

No key affected

Disadvantages

History is lost

Page 39: Dw & etl concepts

39

Type two

Advantages

Automatically partitions history

No time constraints required

Disadvantages

Abrupt point of time constraints

not effective

Implementing Type 2:

Create new record with unique key Generalize the dimensioning by adding 2 or 3 various digits to the end of the

key.

Scenarios where applicable: Most commonly used where history is of importance

Page 40: Dw & etl concepts

40

Dimension Tables

Dimension tables, most often contain descriptive textual information.

Dimension attributes are used as the source of most of the interesting constraints in data warehouse queries, and they are virtually always the source of the row headers in the SQL answer set.

It should be obvious that the power of the data warehouse is proportional to the

quality and depth of the dimension tables.

Page 41: Dw & etl concepts

41

Attributes in a Dimension Table

Allows users to constrain data by one or more attributes. Allows users to define aggregation levels for data

DEPT CLASS SALES

Dept 1

Dept 2

Class 101Class 120Class 133Class 127Class 141Class 145

100011001900210015001800

• Present Classes by Departments

• Aggregate by Class

• Qualify by Department

Page 42: Dw & etl concepts

42

Basic Dimensional Model

Page 43: Dw & etl concepts

ETL Concepts

Page 44: Dw & etl concepts

44

ETL !!!

(Extract, Transform, Load) – ETL refers to the methods involved in accessing and manipulating source

data and loading it into target database. During the ETL process, more often, data is extracted from an OLTP database, transformed to match the data warehouse schema, and loaded into the data warehouse database.

Page 45: Dw & etl concepts

45

EXTRACT DATA FROMDISPARATE SOURCES

TRANSFORM DATA

LOAD DATA WHEREWE WANT TO

WHAT IS ETL?

E EXTRACT

T TRANSFORM

L LOAD

Page 46: Dw & etl concepts

46

EXTRACTION (Data Capturing)

The ETL extraction element is responsible for extracting data from the source system. During extraction, data may be removed from the source system or a copy made and the original data retained in the source system.

Page 47: Dw & etl concepts

47

Legacy systems may require too much effort to implement such offload processes, so legacy data is often copied into the data warehouse, leaving the original data in place. Extracted data is loaded into the data warehouse staging area (a relational database usually separate from the data warehouse database), for manipulation by the remaining ETL processes.

EXTRACTION (Data Transmission)

Page 48: Dw & etl concepts

48

EXTRACTION (Cleansing Process)

Data extraction is generally performed within the source system itself.

Data extraction processes can be implemented using Transact-SQL stored procedures, Data Transformation Services (DTS) tasks, or custom applications developed in programming or scripting languages.

Page 49: Dw & etl concepts

49

TRANSFORMATION

The ETL transformation element is responsible for data validation, data accuracy, data type conversion, and business rule application. An ETL system that uses inline transformations during extraction is less robust and flexible than one that confines transformations to the reformatting element. Transformations performed in the OLTP system impose a performance burden on the OLTP database.

Page 50: Dw & etl concepts

50

TRANSFORMATION (contd.)

Data Validation Check that all rows in the fact table match rows in dimension tables to enforce data integrity.

Data Accuracy Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.

Data Type Conversion Ensure that all values for a specified field are stored the same way in the data warehouse regardless of how they were stored in the source system. For example, if one source system stores "off" or "on" in its status field and another source system stores "0" or "1" in its status field, then a data type conversion transformation converts the content of one or both of the fields to a specified common value such as "off" or "on".

Business Rule Application Ensure that the rules of the business are enforced on the data stored in the warehouse. For example, check that all customer records contain values for both FirstName and LastName fields.

Page 51: Dw & etl concepts

51

LOADING

The ETL loading element is responsible for loading transformed data into the data warehouse database.

Data warehouses are usually updated periodically rather than continuously, and large numbers of records are often loaded to multiple tables in a single data load.

The data warehouse is often taken offline during update operations so that data can be loaded faster and SQL Server 2000 Analysis Services can update OLAP cubes to incorporate the new data. BULK INSERT, bcp, and the Bulk Copy API are the best tools for data loading operations.

The design of the loading element should focus on efficiency and performance to minimize the data warehouse offline time.

Page 52: Dw & etl concepts

52

ETL Tools

What are ETL Tools?

ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision making. Before the evolution of ETL Tools, the above mentioned ETL process was done manually by using SQL code created by programmers. This task was tedious and cumbersome in many cases since it involved many resources, complex coding and more work hours. On top of it, maintaining the code placed a great challenge among the programmers

Selecting an appropriate ETL tool is the most important decision that has to be made when choosing the components of a data warehousing application. The ETL tool operates at the heart of the data warehouse, extracting data from multiple data sources, transforming the data to make it accessible to business analysis, and loading multiple target databases

Page 53: Dw & etl concepts

53

Features of ETL Tools

Features of ETL Tools

The ETL tools have the ability to extract data from various sources like RDBMS , DB2 , COBOL data files and flat files at scheduled intervals , do required transformation and load the data into Data Warehouse which resides on RDBMS. The ETL tools can connect to a RDBMS and get the list of tables and their attributes. The general steps for designing an ETL process are

Define the structure of source data

Define the structure of Destination Data

Map elements of source data to elements of destination data

Define the transformation required like changing values , summing

Schedule the execution of process

The process once executed , generates a log showing status of process , number of records inserted etc. Various reports about processes are available which can form the Metadata.

Page 54: Dw & etl concepts

54