dw architecture & best practices

67
Data Warehouse Architecture

Upload: api-19730613

Post on 18-Nov-2014

401 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: DW architecture & best practices

Data Warehouse Architecture

Page 2: DW architecture & best practices

Define Data Warehouse Architecture Define Data Warehouse and Data Mart Present a Data Warehouse Architectural

Framework Demo – Data Enterprise Integration Server

Objectives

Page 3: DW architecture & best practices

Information Systems Architecture is the process of making the key choices that are essential to the development of an information system. Architecture includes:◦ Guiding Principles: ◦ Approaches/philosophies◦ “Logical” representations of a system◦ Hardware/Operating System◦ Computing model: client/server vs traditional vs Web-

based◦ Tools and technologies

It is key, when making these choices that they are:◦ Requirements driven◦ Take into consideration operational, technical and financial

feasibility◦ Made within an architectural framework

Information Systems Architecture

Page 4: DW architecture & best practices

There are a lot of Drivers of ArchitectureArchitecture Drivers

BusinessPlan

BusinessPlan

CorporatePolitics

CorporatePolitics

SystemQualities

SystemQualities

CurrentSystems

CurrentSystems

End UserRequirements

End UserRequirements

EmergingTechnologies

EmergingTechnologies

ArchitectureArchitecture

Page 5: DW architecture & best practices

Its not – Architecture can be considered ‘high-level’ design

Architecture includes those aspects of the design that are essential to the information system

Architecture Example:◦ Users must be able to self-serve (guiding principle)◦ “We will use a “hub and spoke” design where data

will be placed in a central data warehouse, then be propagated to one or more data marts. (approach)

◦ We will normalize data in the central warehouse and use a dimensional design in the data marts (approach)

◦ We will use Oracle 8i as our DBMS (technical architecture)

How is Architecture Different from Design?

Page 6: DW architecture & best practices

Not Architecture:◦ The Order subject area will be composed of the

following tables: order_fact, customer_dim, product_dim and time_dim

◦ The customer_dim table will have the following attributes…….

Architecture vs Design

Page 7: DW architecture & best practices

Communication:◦ To business sponsors, and business users◦ Between members of the project team

Planning:◦ Cross Check for Project Plan◦ Ensure that all important components of the data

warehouse are accounted for Flexibility and Growth

◦ Thinking about overall architecture will reduce risk associated with the ‘success’ of the data warehouse

Learning Productivity and Reuse

The Value of Architecture

Page 8: DW architecture & best practices

Transaction processing systems – growth is (relatively) predictable

Example: ◦ A company uses SAP for order processing◦ They are opening a new retail store◦ They predict (based on experience) 2000

transactions per week◦ To process this volume, we need 3 workstations to

capture the transactions◦ Peak time each day is 11-2 when 50% of

transactions occur

What’s different about DW Architecture?

Page 9: DW architecture & best practices

Success drives explosive growth◦ More users◦ More (complex)

queries◦ More data

Performance is unpredictable

◦ Unpredictable queries◦ Unpredictable use

patterns

What’s Different About Data Warehouse Architecture?

Gro

wth

Time

Siebel

SAP R/3

Data Warehouse

Page 10: DW architecture & best practices

Bill Inmon: “The enterprise data warehouse”

Ralph Kimball: “data marts”

The compromise: “Hub and Spoke” or “Federated” models

The Great Data Warehouse Architecture Debate

If you build it, They will come

Page 11: DW architecture & best practices

A data mart is a collection of subject areas organized for decision support based on the specific needs of a given user group.

Each mart may widely different from others (as we will see)

Typically, data marts are built on the dimensional data model:◦ Facts – things that the organization wants to

measure: revenue, orders, shipments, purchases, etc.

◦ Dimensions – the means by which the organization wants to analyze the measures (facts) – by customer, by time, by product – BY ANY COMBINATION!!

What is a Data Mart?

Page 12: DW architecture & best practices

There are two kinds of data marts--dependent and independent.

A dependent data mart is one whose source is a data warehouse.

An independent data mart is one whose source is the legacy applications environment. All dependent data marts are fed by the same source--the data warehouse. Each independent data mart is fed uniquely and separately by the legacy applications environment.

Dependent data marts are architecturally and structurally sound.

Independent data marts have a number of significant issues

What is a Data Mart?

Page 13: DW architecture & best practices

Data Warehouse vs. Data Marts

What comes first

Page 14: DW architecture & best practices

From the Data Warehouse to Data Marts

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

Page 15: DW architecture & best practices

Data Warehouse and Data Marts

OLAPData MartLightly summarizedDepartmentally structured

Organizationally structuredAtomicDetailed Data Warehouse Data

Page 16: DW architecture & best practices

Data Mart Centric

Data Marts

Data Sources

Data Warehouse

Page 17: DW architecture & best practices

Problems with Data Mart Centric Solution

If you end up creating multiple warehouses, integrating them is a problem

Page 18: DW architecture & best practices

True Warehouse

Data Marts

Data Sources

Data Warehouse

Page 19: DW architecture & best practices

19

Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational

Data Store Logical Data Mart and Real-Time Data

Warehouse Three-Layer architecture

Data Warehouse Architectures

All involve some form of extraction, transformation and loading (ETL)

Page 20: DW architecture & best practices

20

Generic two-level data warehousing architecture

E

T

LOne, company-wide warehouse

Periodic extraction data is not completely current in warehouse

Page 21: DW architecture & best practices

21

Independent data mart data warehousing architecture

Data marts:Mini-warehouses, limited in scope

E

T

L

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

Page 22: DW architecture & best practices

22

Dependent data mart with operational data store: a three-level architecture

ET

L

Single ETL for enterprise data warehouse(EDW)

Simpler data access

ODS provides option for obtaining current data

Dependent data marts loaded from EDW

Page 23: DW architecture & best practices

23

ET

L

Near real-time ETL for Data Warehouse

ODS and data warehouse are one and the same

Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts

Logical data mart and real time warehouse architecture

Page 24: DW architecture & best practices

24

Three-layer data architecture for a data warehouse

Page 25: DW architecture & best practices

Independent data marts Hub and spoke architecture Data mart bus architecture Federated data warehouse

The Major Data Warehouse Architectures

Page 26: DW architecture & best practices

Independent data mart architecture

•Developed independently.

•No conformed dimensions (i.e., does not have the same categories and labels for data elements in data marts which would allow data across data marts to be combined).

•Built to a business unit or functional area.

Independent data marts

Data staging

Data sources

End user access/

applications

Page 27: DW architecture & best practices

Hub and spoke architecture

•Key spokesperson: Bill Inmon (1992, 1998, 2001).

•Detailed enterprise oriented view of data.

•Built in iterative manner subject area by subject area.

•Dependent data marts to support user needs for dimensional data.

ODS

Dependent data marts

Central data store

Data sources

Data staging

End user access/

applications

Page 28: DW architecture & best practices

Data mart bus architecture

•Key spokesperson: Ralph Kimball (1996, 1999).

•First data mart built as proof of concept.

•Built sequentially according to master suite of conformed dimensions and fact tables, resulting in logically integrated marts.

•Conformed dimensions provide capability to access data across architected marts.

Architected data martsData sources

Data staging

End user access/

applications

Page 29: DW architecture & best practices

Federated architecture

•Key spokesperson: Doug Hackney (2000, 2002).

•Combines data in an organization’s existing data warehousing environment.

•Characterized by combing key metrics and measures in existing data marts, data warehouses and legacy systems.

Data warehouse

Data stores

Data staging

Data mart

Federated data store

End user access/

applications

Page 30: DW architecture & best practices

Data Warehouse Architecture

Selection

Page 31: DW architecture & best practices

Architecture selection

Data warehouse architectures Independent data martData mart bus architectureHub and spoke architectureFederated

Architecture selection factors

Information interdependence

Upper management’s information

needs

Urgency of need

View of the data warehouse

Compatibility with existing systems

Nature of end user tasks

Resource constraints

Perceived ability of the IT staffSource of sponsorship

Expert influence

Page 32: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

32

Use a data model that is optimized for information retrieval◦ dimensional model◦ denormalized◦ hybrid approach

Best Practice #1

Page 33: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

33

Extract Transform Load (ETL)◦ the process of unloading or copying data from

the source systems, transforming it into the format and data model required in the BI environment, and loading it to the DW

◦ also, a software development tool for building ETL processes (an ETL tool)

◦ many production DWs use COBOL or other general-purpose programming languages to implement ETL

Data Acquisition Processes

Page 34: DW architecture & best practices

34

Capture/ExtractScrub or data cleansingTransformLoad and Index

The ETL Process

ETL = Extract, transform, and load

Page 35: DW architecture & best practices

35

Static extract = capturing a snapshot of the source data at a point in time

Incremental extract = capturing changes that have occurred since the last static extract

Capture/Extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Steps in data reconciliation

Page 36: DW architecture & best practices

36

Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Steps in data reconciliation

(cont.)

Page 37: DW architecture & best practices

37

Transform = convert data from format of operational system to format of data warehouse

Record-level:Selection–data partitioningJoining–data combiningAggregation–data summarization

Field-level: single-field–from one field to one fieldmulti-field–from many fields to one, or one field to many

Steps in data reconciliation

(cont.)

Page 38: DW architecture & best practices

38

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode: bulk rewriting of target data at periodic intervals

Update mode: only changes in source data are written to data warehouse

Steps in data reconciliation

(cont.)

Page 39: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

39

data cleansing◦the process of validating and enriching

the data as it is published to the DW◦also, a software development tool for

building data cleansing processes (a data cleansing tool)

◦many production DWs have only very rudimentary data quality assurance processes

Data Quality Assurance

Page 40: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

40

getting data loaded efficiently and correctly is critical to the success of your DW◦implementation of data acquisition &

cleansing processes represents from 50 to 80% of effort on typical DW projects

◦inaccurate data content can be ‘the kiss of death’ for user acceptance

Data Acquisition & Cleansing

Page 41: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

41

Carefully design the data acquisition and cleansing processes for your DW◦ Ensure the data is processed efficiently and

accurately◦ Consider acquiring ETL and Data Cleansing tools◦ Use them well!

Best Practice #2

Page 42: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

42

Already discussed the benefits of a dimensional model

No matter whether dimensional modeling or any other design approach is used, the data model must be documented

Data Model

Page 43: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

43

The best practice is to use some kind of data modeling tool◦ CA ERwin◦ Sybase PowerDesigner◦ Oracle Designer◦ IBM Rational Rose◦ Etc.

Different tools support different modeling notations, but they are more or less equivalent anyway

Most tools allow sharing of their metadata with an ETL tool

Documenting the Data Model

Page 44: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

44

data model standards appropriate for the environment and tools chosen in your data warehouse should be adopted

considerations should be given to data access tool(s) and integration with overall enterprise standards

standards must be documented and enforced within the DW team◦ someone must ‘own’ the data model

to ensure a quality data model, all changes should be reviewed thru some formal process

Data Model Standards

Page 45: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

45

Business definitions should be recorded for every field (unless they are technical fields only)

Domain of data should be recorded Sample values should be included As more metadata is populated into the

modeling tool it becomes increasingly important to be able to share this data across ETL and Data Access tools

Data Model Metadata

Page 46: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

46

The strategy for sharing data model and other metadata should be formalized and documented

Metadata management tools should be considered & the overall metadata architecture should be carefully planned

Metadata Architecture

Page 47: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

47

Design a metadata architecture that allows sharing of metadata between components of your DW

Best Practice #3

Page 48: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

48

Bill Inmon: “Corporate Information Factory” Hub and Spoke philosophy “JBOC” – just a bunch of cubes Let it evolve naturally

Alternative Architecture Approaches

Page 49: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

49

In most cases, business and IT agree that the data warehouse should provide a ‘single version of the truth’

Any approach that can result in disparate data marts or cubes is undesireable

This is known as data silos or…

What We Want(Architectural Principal)

Page 50: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

50

how to design an enterprise data warehouse and ensure a ‘single version of the truth’?

according to Kimball:◦ start with an overall data architecture

phase ◦ use “Data Warehouse Bus” design to

integrate multiple data marts◦ use incremental approach by building one

data mart at a time

Enterprise DW Architecture

Page 51: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

51

named for the bus in a computer◦ standard interface that allows you to plug in

cdrom, disk drive, etc.◦ these peripherals work together smoothly

provides framework for data marts to fit together

allows separate data marts to be implemented by different groups, even at different times

Data Warehouse Bus Architecture

Page 52: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

52

data mart is a complete subset of the overall data warehouse◦a single business process OR◦a group of related business processes

think of a data mart as a collection of related fact tables sharing conformed dimensions, aka a ‘fact constellation’

Data Mart Definition

Page 53: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

53

determine which dimensions will be shared across multiple data marts

conform the shared dimensions produce a master suite of shared dimensions

determine which facts will be shared across data marts

conform the facts standardize the definitions of facts

Designing The DW Bus

Page 54: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

54

conformed dimensions will usually be granular◦ makes it easy to integrate with various base level

fact tables◦ easy to extend fact table by adding new facts◦ no need to drop or reload fact tables, and no keys

have to be changed

Dimension Granularity

Page 55: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

55

by adhering to standards, the separate data marts can be plugged together◦ e.g. customer, product, time

they can even share data usefully, for example in a drill across report

ensures reports or queries from different data marts share the same context

Conforming Dimensions

Page 56: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

56

a current trend in BI/DW is ‘data consolidation’

from a software vendor perspective, it is tempting to simplify this:◦ ‘we can keep all the tables for all your disparate

applications in one physical database’

Data Consolidation

Page 57: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

57

To truly achieve ‘a single version of the truth’, must do more than simply consolidating application databases

Must integrate data models and establish common terms of reference

Data Integration

Page 58: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

58

Take an approach that consolidates data into ‘a single version of the truth’◦ Data Warehouse Bus

conformed dimensions & facts◦ OR?

Best Practice #4

Page 59: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

59

a single point of integration for disparate operational systems

contains integrated data at the most detailed level (transactional)

may be loaded in ‘near real time’ or periodically

can be used for centralized operational reporting

Operational Data Store (ODS)

Page 60: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

60

Consider implementing an ODS only when information retrieval requirements are near the bottom of the data abstraction pyramid and/or when there are multiple operational sources that need to be accessed◦ Must ensure that the data model is integrated,

not just consolidated◦ May consider 3NF data model◦ Avoid at all costs a ‘data dumping ground’

Best Practice #5

Page 61: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

61

DW workloads are typically very demanding, especially for I/O capacity

Successful implementations tend to grow very quickly, both in number of users and data volume

Rules of thumb do exist for sizing the hardware platform to provide adequate initial performance◦ typically based on estimated ‘raw’ data size

of proposed database e.g. 100-150 Gb per modern CPU

Capacity Planning

Page 62: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

62

Scaling performance within a single SMP server is referred to as ‘scale up’

Database benchmarks suggest Windows scalability is near that of Linux

IBM claims near-linear scalability for Linux (on commodity hardware) up to about 4 processors◦ Probably not cost effective to scale up Linux

much beyond 4 processors IBM claims near-linear scalability for AIX

on POWER5 up to about 8 processors

SMP Server Scale Up

Page 63: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

63

To obtain the total number of processors required for the estimated DW workload, must plan either to scale up or scale out

Both options are viable but, all other things being equal, scaling up is less disruptive to end users and requires less work to implement◦ scaling up can offer lower hardware

investment, if practical◦ however, network bandwidth or latency

issues can limit effectiveness of parallelism

Scale Up vs. Scale Out

Page 64: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

64

Create a capacity plan for your BI application & monitor it carefully

Consider future additional performance demands◦ Establish standard performance benchmark

queries and regularly run them◦ Implement capacity monitoring tools◦ Build scalability into your architecture◦ May need to allow for scaling both up and

out!

Best Practice #6

Page 65: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

65

Another emerging trend in IT generally is to utilize Open Source software running on commodity hardware◦ this is expected to offer lower total cost of ownership◦ certainly, GNU/Linux and other Open Source initiatives

do provide very good functionality and quality for minimal cost

This trend also applies to BI & DW:◦ most traditional rdbms’s are now supported on Linux◦ however, open source rdbms’s lag behind on providing

good performance for DW queries

Open Source Affordability

Page 66: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

66

DW appliances, consisting of packaged solutions providing all required software and hardware, are beginning to offer very promising price/performance

production experience is limited so far, so this is not yet a ‘best practice’

DW Appliances

Page 67: DW architecture & best practices

April 8, 2023DW Architecture Best Practices

67

In the case where an ODS is a necessary component of the overall DW, it should be carefully integrated into the overall architecture

Can also be used for:◦Staging area◦Master/reference data management◦Etc…

Role of an ODS in DW Architecture