dw 101
DESCRIPTION
Data Warehousing 101TRANSCRIPT
2
Databases vs Data Warehousing
Often mistaken for each other – vastly different The database supports data storage & retrieval
for an application or specific purpose Don’t bog it down with informational
reporting, operational in nature (would your app work after your next acquisition,
when you have grown your customer base, have more users, produce even more reports)…
A data warehouse is used for informational purposes
To facilitate business reporting and analysis It is not operational
3
Definition of a Data Warehouse
Data warehouse —
+ subject oriented + integrated + time variant + nonvolatile collection of data for management’s decisions
…data warehouses are granular. They contain the bedrock data that forms the single source for all Decision Support System/Executive Information System processing. With a data warehouse there is reconcilability of information when there are differences of opinion. The atomic data found in the warehouse can be shaped in many ways, satisfying both known requirements and standing ready to satisfy unknown requirements.
http://www.itquestionbank.com/types-of-data-warehouse.html
4
DW Project Components
Business Requirements
Physical (hw/sw) environment setup
Data Modeling ETL OLAP or ROLAP
cube design Report
Development
Query Optimization
Data Quality Assurance
Promote to Production
Maintenance Enhancement
5
Key milestones in the early years of data warehousing:
1960: General Mills & Dartmouth College research project – coin terms DIMENSIONS & FACTS
1967: Edward Yourdan “Real-Time Systems Design” 1970: ACNielsen and IRI provide dimensional data marts for
retail sales 1979 – Tom DeMarco “Structured Analysis and Design” 1988 – Barry Devlin and Paul Murphy publish “An architecture
for business information systems” in IBM Systems Journal – coin term “business data warehouse”
1991 – Bill Inmon publishes book “Building the Data Warehouse”
1995 – The Data Warehouse Institute is founded (for profit) 1996 – Ralph Kimball publishes “The Data Warehouse Tookit” 2000 – Wayne Eckerson “Data Quality and the Bottom Line”
report from TDWI 2004 – IBM states their main competitors are Oracle and
Teradata
6
The beginnings
Commercial viability occurred with a drop in disk storage prices.
Then came the BI vendors The ETL vendors The Data modelers And the database vendors fought….
7
History of Data Warehousing Data Warehouses became a distinct type of computer database during the late 1980s
and early 1990s. They were developed to meet a growing demand for management information and analysis that could not be met by operational systems
the extra processing load of reporting reduced the response time of the operational systems
the development of reports in operational systems requires writing specific SQL queries which put a heavy load on the system
Separate computer databases began to be built that were specifically designed to support management information and analysis purposes.
Data warehouses were able to bring in data from a range of different data sources mainframe computers, minicomputers, personal computers and office
automation software such as spreadsheets,
Data warehouses integrate this information in a single place.
User-friendly reporting tools and freedom from operational impacts, has led to a growth of data warehousing systems
http://www.dedupe.com/history.php
8
History of Data Warehousing
As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages:
Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.
Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.
The real next generation warehousing – not really being done:
Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.
http://www.dedupe.com/history.php
9
Data Warehouse Architecture
The term data warehouse architecture describes the overall structure of the system.
historical terms include decision support systems (DSS), management information systems (MIS)
Newer terms include business intelligence competency center (BICC) The data warehouse architecture describes the overall system
components: infrastructure, data and processes. The infrastructure technology stack perspective determines the hardware
and software products needed to implement the components of the system. The data perspective typically diagrams the source and target data
structures and aid the user in understanding what data assets are available and how they are related.
The process perspective is primarily concerned with communicating the process and flow of data from the originating source system through the process of loading the data warehouse, and often the process that client products use to access and extract data from the warehouse.
Architecture facilitates the structure, function and interrelationships of each component.
10
Advantages to DW
Enables end-user access to a wide variety of data Increased data consistency Additional documentation of the data (published data models,
data dictionaries) Lower overall computing costs and increased productivity Area to combine related data from separate sources Flexible, easy to change computing infrastructure to support
data changes in applications systems and business structures/hierarchies
Empowering end-users to perform ad-hoc queries and reports without impacting the performance of the operational systems
An enabler of commercial business applications, most notably customer relationship management (CRM) i.e. through feed-back loops.
11
Data Integration
Data integration is the aspect of combining diverse sources and giving the user a unified view of their data.
This important problem emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories).
Data integration appears with increasing frequency as the volume and the need to share existing data explodes.
It has been the focus of extensive theoretical work and numerous open problems remain to be solved.
In practice, data integration is frequently called Enterprise Information Integration.
12
Data Warehousing Toolsets
Data modeling Diagrams: ERD, etc.
Data Dictionary ETL Tools Database of Choice
Oracle, SQLServer, DB2, Teradata, Netezza, …. SQL and it’s tools Data Validation Bug trackers/issue trackers: Testing
13
Types of Data Warehouses
Not Data marts Operational Data Store (ODS) Data warehouse (enterprise data
warehouse - EDW) Exploration data warehouse Decision Support System
(aka:Management Information System MIS)
14
Brief Description of Terms Operational Systems are the internal and external core systems that support the
day-to-day business operations. They are accessed through application program interfaces (APIs) and are the source of data for the data warehouse and operational data store. (Encompasses all operational systems including ERP, relational and legacy.)
Data Acquisition is the set of processes that capture, integrate, trans-form, cleanse, reengineer and load source data into the data warehouse and operational data store. Data reengineering is the process of investigating, standardizing and providing clean consolidated data.
The Data Warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data.
Primary Storage Management consists of the processes that manage data within and across the data warehouse and operational data store. It includes processes for backup and recovery, partitioning, summarization, aggregation, and archival and retrieval of data to and from alternative storage.
Alternative Storage is the set of devices used to cost-effectively store data warehouse and exploration warehouse data that is needed but not frequently accessed. These devices are less expensive than disks and still provide adequate performance when the data is needed.
Data Delivery is the set of processes that enable end users and their supporting IS group to build and manage views of the data warehouse within their data marts. It involves a three-step process consisting of filtering, formatting and delivering data from the data warehouse to the data marts.
The Data Mart is customized and/or summarized data derived from the data warehouse and tailored to support the specific analytical requirements of a business unit or function. It utilizes a common enterprise view of strategic data and provides business units more flexibility, control and responsibility. The data mart may or may not be on the same server or location as the data warehouse.
15
Desc terms cont’d The Operational Data Store (ODS) is a subject-oriented, integrated, current, volatile
collection of data used to support the tactical decision-making process for the enterprise. It is the central point of data integration for business management, delivering a common view of enterprise data.
Meta Data Management is the process for managing information needed to promote data legibility, use and administration. Contents are described in terms of data about data, activity and knowledge.
The Exploration Warehouse is a DSS architectural structure whose purpose is to provide a safe haven for exploratory and ad hoc processing. An exploration warehouse utilizes data compression to provide fast response times with the ability to access the entire database.
The Data Mining Warehouse is an environment created so analysts may test their hypotheses, assertions and assumptions developed in the exploration warehouse. Specialized data mining tools containing intelligent agents are used to perform these tasks.
Activities are the events captured by the enterprise legacy and/or ERP systems as well as external transactions such as Internet interactions.
Statistical Applications are set up to perform complex, difficult statistical analyses such as exception, means, average and pattern analyses. The data warehouse is the source of data for these analyses. These applications analyze massive amounts of detailed data and require a reasonably performing environment.
Analytic Applications are pre-designed, ready-to-install, decision sup-port applications. They generally require some customization to fit the specific requirements of the enterprise. The source of data is the data warehouse. Examples of these applications are risk analysis, database marketing (CRM) analyses, vertical industry "data marts in a box," etc.
External Data is any data outside the normal data collected through an enterprise's internal applications. There can be any number of sources of external data such as demographic, credit, competitor and financial information. Generally, external data is purchased by the enterprise from a vendor of such information.
16
Bill Inmon
Recognized as founder of the data warehouse (wrote the first book, offered first conference with Arnie Barnett, wrote the first column in a magazine {IBM Journal}, offered the first classes)
Created the accepted definition of what a DW is (subject orientated, nonvolatile, integrated, time variant collection of data in support of management’s decisions)
Approach is top-down 1991 Founded Prism Solutions, took public, 1995
founded PineCone Systems, renamed Ambeo. 1999 created Corporate Information Factory
website to educate professionals
17
http://www.inmoncif.com/library/cif/
18
Ralph Kimball
One of the original architects of data warehousing. DW must be understandable and FAST Developed Dimensional Modeling (Kimball
method) is the standard in decision support Bottom up approach
1986 founded Red Brick Systems (used indexes for performance gains), 1992 acquired by Informix, now owned by IBM
Coinventor of Xerox Star workstation (first commerical product to use mice, icons and windows)
19
Data Models
Provide definition and format of data represent information areas of interest
or Subject Areas Modeling methodologies:
Bottom-up model design: Start with existing structures
Top-down model design: Created fresh (by SME’s) as reference
point/template
20
Data Normalization – what is it Normalization is a relational database modeling process where the
relations or tables are progressively decomposed into smaller relations to a point where all attributes in a relation are very tightly coupled with the primary key of the relation. Most data modelers try to achieve the “Third Normal Form” with all of the relations before they de-normalize for performance, ease of query or other reasons.
First Normal Form: A relation is said to be in First Normal Form if it describes a single entity and it contains no arrays or repeating attributes. For example, an order table or relation with multiple line items would not be in First Normal Form because it would have repeating sets of attributes for each line item. The relational theory would call for separate tables for order and line items.
Second Normal Form: A relation is said to be in Second Normal Form if in addition to the First Normal Form properties, all attributes are fully dependent on the primary key for the relation.
Third Normal Form: A relation is in Third Normal Form if in addition to Second Normal Form, all non-key attributes are completely independent of each other. http://www.sserve.com/ftp/dwintro.doc
21
Entity Relationship Diagramsexample 3rd normal form
ARM_INDEX
PK INDEX_ID
INDEX_CD INDEX_DESC
DELQ_STAGE
PK DELQ_STAGE_ID
DELQ_STAGE_METHOD LOWER_BOUND UPPER_BOUND DELQ_STAGE_DESC
LOAN
PK,FK10 LOAN_ID
EFF_DT ORIG_SOP033_FLAG COVERED_FLAG ACCOUNTING_ASSET_TYPE LIEN MATURITY_DTFK4 LOAN_DESIGNATION_ID ORIG_FICO INTEREST_ONLY_FLAGFK8 STATE_ID ORIG_TERM BANKRUPTCY_FLAG BKDTFK3 DOC_TYPE_ID ORIG_DT ORIG_LTV COMB_LTV ORIG_BAL FC_FLAG INVESTOR CAT_CODE ZIP_CODE ORIG_APPR_VALUE ORIG_APPR_DT MARGIN ORIG_IR_CHG_DT ORIG_PI_CHG_DTFK9 SUPER_REPLINE_ID FOREIGN_ADDRESS_FLAG IR_CHG_PERIOD SECOND_NAME OCCUPANCY ACCRUAL_PERIOD PRODUCT_TYPE LIFETIME_FLOOR LIFETIME_CAP PERIODIC_FLOOR PERIODIC_CAP PREPAYMENT PREPAYMENT_PENALTY_DESC FIRST_DUE_DATE INCOME PAY_FREQ BALLOON_IND AVM_VALUE PRIOR_SALE PRIOR_DATE INDEXED_VALUE BASELINE_VALUE CHARGE_OFF_DESC CHARGE_OFF_BORROWER FICO_BASE CITY RT_CHG_FREQ TI_AMT ORIG_RATE PROP_COUNTY MAX_FIRST_ADJ OCCUPATION LATE_30 LATE_60 LATE_90 BALLOON_CDFK7 RATE_CODE_IDFK2 COLLATERAL_IDFK5 PRODUCT_ID ORIGINATION_INCOME MODIFICATION_INCOMEFK6 PURPOSE_ID IO_TERM LOAN_TYPE PROPERTY_TYPE PI_CHG_PERIOD BANKRUPTCY FORECLOSURE_STOP EFF_DT ACTIVE_FLAG
RATE_CHANGE_PERIOD
PK RATE_CHANGE_ID
LOWER_BOUND UPPER_BOUND RATE_CHANGE_DESC
LOAN_FACT
PK,FK5 LOAN_IDPK AS_OF_DT
LOAN_STATUS CURR_SOP033_FLAG CURR_TDR_FLAG PREV_UPB CURR_UPB PREV_PRICE CURR_PRICE PREV_FMV CURR_FMV CURR_PRIN_PAID PREV_INT_PAID_YTD CURR_INT_PAID_YTD CURR_INT_PAID CURR_INT_RATE NEXT_DUE_DT DELQ_FLAG DAYS_DELQ ACCRUAL_FLAG PROJECTED_INTEREST_ACCRUAL PERFORMING_FLAG REMAINING_TERM MONTHS_TO_RATE_CHGFK2 DELQ_STAGE_OCC_ID REPLINE_IDFK4 RATE_CHANGE_ID RATE_CHANGE_DTFK1 MATURITY_ID DI_RATIO CHARGE_OFF_AMT NEXT_PI_CHG_DT PCT_OWNED BALLOON_TERMFK6 DELQ_STAGE_PRR_ID CURR_PRIN_PAID_NON_LIQUID CURR_INT_PAID_NON_LIQUID UNFUNDED_AMT CURR_UPB_NON_LIQUID CREDIT_ACCRUAL_FLAG ACCRETING_FLAG CR_ADMIN_CURR_BOOK_VAL SFR_ALLOCATION_PCT FIRST_TD_LIEN_CURR_BAL NEG_AM_AMT NEG_AM_FLAG COMMIT_AMT DEFERRED_INTEREST FDIC_MOD_ADJ_AMT PI_CONSTANT PMI_FLAG CURR_GROSS_BAL NEG_AM_ADVANCEAMT SCHED_PRIN_PAYMENT SERVICING_FEE PITI INT_PAID_DATE DELQ_TABLE_DD CURR_FICO_SCORE FICO RATE_TYPE NEG_AMORT_MAX PREV_UPB_NON_LIQUID NEG_AM_MAX_REM_AMT DEFERRED_INT CR_ADMIN_CURR_NET_BOOK_VAL CR_ADMIN_CURR_PRO_RT_BOOK_VALFK3 INDEX_ID
MATURITY_PERIOD
PK MATURITY_ID
LOWER_BOUND UPPER_BOUND MATURITY_DESC
COLLATERAL
PK COLLATERAL_ID
COLLATERAL_CD COLLATERAL_DESC
DOC_TYPE
PK DOC_TYPE_ID
DOC_TYPE_CD DOC_TYPE_DESC
LOAN_CONVERSION
PK LOAN_ID
LEGACY_LOAN_NUM NOTE_NUM SOURCE_SYSTEM PERIOD_ADDED
LOAN_DESIGNATION
PK LOAN_DESIGNATION_ID
SYSTEM PROPERTY_TYPE LOAN_TYPE INVESTOR_NO CAT_CODE COLL_CODE GL_CODE GL_ACCOUNT LOAN_REPORT BBVA_LOAN_TYPE_MAPPING BBVA_GL_ACCOUNT BBVA_GL_ACCOUNT_NAME LEGACY_GL_ASSET_TYPE ASSET_TYPE ACCOUNTING_LOAN_TYPE BANK_NO DESCRIPTION INVESTOR_NUMBER CATEGORY PRINCIPAL_PAYEE PRINCIPAL_COST_CTR INTEREST_INCOME_PAYEE INTEREST_INCOME_ACCOUNT INTEREST_INCOME_COST_CTR AIR_PAYEE AIR_ACCOUNT AIR_COST_CTR LATE_CHARGE_PAYEE LATE_CHARGE_INCOME_ACCOUNT LATE_CHARGE_INCOME_COST_CTR REGULATORY_LOAN_TYPE INV_CLASS_CODE
MODIFICATION_LOANS
PK,FK1 LOAN_IDPK AS_OF_DT
MOD_DATE MOD_CODE OCCUPANCY RE_MOD PREV_LOAN_TYPE PREV_BALANCE PREV_INT_RATE PREV_MIN_PAYMENT PREV_LOAN_TERM POST_LOAN_TYPE ACCRUING_BALANCE FOREBEARANCE_BALANCE POST_INT_RATE POST_MIN_PAYMENT POST_LOAN_TERM FDIC_MOD_FLAG PRE_MOD_PROD_TYPE PRE_MOD_INDEXED_VALUE PRE_MOD_IR_CHG_PERIOD PRE_MOD_PI_CHG_PERIOD PRE_MOD_IO_TERM PRE_MOD_INDEX_DESC
PIF_LOANS
PK,FK1 LOAN_ID
MONTH_ADDED PIF_PROCEEDS PIF_INTEREST
PRODUCT
PK PRODUCT_ID
PRODUCT_CD PRODUCT_DESC
PURPOSE
PK PURPOSE_ID
PURPOSE_CD PURPOSE_DESC
RATE_CODE
PK RATE_CODE_ID
RATE_CODE RATE_CODE_DESC
REO_LOANS
PK,FK1 LOAN_IDPK AS_OF_DT
REO_VALUE MONTH_ADDED SPECIAL_ASSETS_COMMENTS NET_OUT_STANDING REO_DATE_ACQ REO_APPRAISAL_VAL REO_APPRAISAL_DT REO_WRITE_DOWN REO_SVA SHORT_NAME INT_PAID_DATE PCT_OWNED MAT_DATE HOA_DUES NET_BOOK_VALUE EST_RENTS VR TERMITE_FEES LIST_DATE SALES_PRICE_ACC EST_CLOSE_ESCROW COMMENTS PICO_AMT
RISK_RATINGS
PK,FK1 LOAN_IDPK AS_OF_DT
TYPE HOMOGENOUS_FLAG PASS WATCH SPECIAL_MENTION SUB_STANDARD DOUBTFUL LOSS REO
SHORT_SALE_LOANS
PK,FK1 LOAN_ID
MONTH_ADDED SHORT_SALE_AMT SHORT_SALE_FLAG THIRD_PARTY_FLAG
STATE
PK STATE_ID
STATE_NAME SEC_STATE_GROUP BANK_REGION_FLAG SEC_STATE_GROUP_DESC
SUPER_REPLINE
PK SUPER_REPLINE_ID
LEGACY_SUPER_REPLINE_ID SUPER_REPLINE_DESC
SUPER_REPLINE_FACT
PK AS_OF_DTPK,FK1 SUPER_REPLINE_ID
PREV_GROSS_BOOK_VALUE PIF_PROCEEDS SHORT_SALE_AMT REO_VALUE CURR_PRIN_PAID_NON_LIQUID NEG_AM_ADVANCE_AMT CURR_INT_PAID_NON_LIQUID PREV_ACCRETION_VALUE CURR_GROSS_BOOK_VALUE IMPAIRMENT CURR_NET_BOOK_VALUE DOLLAR_BASIS_DIFF PCT_BASIS_DIFF CURR_ACCRETION_VALUE PREV_ACCRETABLE_YIELD CURR_ACCRETABLE_YIELD CURR_MODELED_CASH_FLOWS CURR_ACTUAL_CASH_FLOWS CURR_DOLLAR_MODEL_MISS CURR_PCT_MODEL_MISS CURR_MODELED_BOOK_VALUE CURR_DOL_BOOK_VALUE_MODEL_MISS CURR_PCT_BOOK_VALUE_MODEL_MISS FDIC_MOD_ADJ_AMT PREV_UPB CURR_UPB UPB_CHANGE FDIC_COVERED_IMPAIRMENT ACCRETION_RATE PICO_AMT
22
Star Schema (facts and dimensions)
The facts that the data warehouse helps analyze are classified along different dimensions: The FACT table houses the main data
Includes a large amount of aggregated data (i.e. price, units sold)
DIMENSION tables off the FACT include attributes that describe the FACT
Star schemas provide simplicity for users
23
Star Schema example (Sales db)
24
SQL to select from Star Schema
SELECT Brand, Country, SUM (Units Sold) FROM Fact.Sales
JOIN Dim.Date ON Date_FK = Date_PK
JOIN Dim.Store ON Store_FK = Store_PK
JOIN Dim.Product ON Product_FK = Product_PK
WHERE [Year] = 2010 AND Product Category = ‘TV' GROUP BY Brand, Country
25
SnowFlake Schema
Central FACT Connected to multiple DIMENSIONS which
are NORMALIZED into related tables Snowflaking effects DIMS and never FACT Used in Data warehouses and data marts
when speed is more important than efficiency/ease of data selection
Needed for many BI OLAP tools Stores less data
26
Snowflake Schema example (Sales db)
27
SQL to select from SnowFlake
SELECT B.Brand, G.Country, SUM (F.Units_Sold)
FROM Fact_Sales F (NOLOCK) INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id = G.Id INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id INNER JOIN Dim_Product_Category C (NOLOCK) ON P.Product_Category_Id = C.ID
INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE D.Year = 2010 AND C.Product_Category = 'tv' GROUP BY B.Brand, G.Country
28
Comparison of SQL Star vs SnowFlake
SELECT Brand, Country, SUM (Units Sold) FROM Fact.Sales
JOIN Dim.Date ON Date_FK = Date_PK
JOIN Dim.Store ON Store_FK = Store_PK
JOIN Dim.Product ON Product_FK = Product_PK
WHERE [Year] = 2010 AND Product Category = ‘TV' GROUP BY
Brand, Country
SELECT B.Brand, G.Country, SUM (F.Units_Sold)
FROM Fact_Sales F (NOLOCK) INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id =
G.Id INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id INNER JOIN Dim_Product_Category C (NOLOCK) ON
P.Product_Category_Id = C.ID
INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE
D.Year = 2010 AND C.Product_Category = 'tv' GROUP BY
B.Brand, G.Country
29
Basic EDW Data Model Design
PartyAccount
Product & Service
Event
Each represents a subject area in the model, with third normal tables to accommodate the data and its relationships with hierarchy
30
Account, Customer & Address Relationships
Account Contact
Party Address link
Account Party link
Address
Account
Party
Account Information loaded from ALL Source Systems
ETL process builds the relationship between Accounts and Customers (Party) based on the relationship file from CUSTOMER CRM SYSTEM
31
Architecture for an EDW or other large Data Warehouse
How do get from where you are to implement an actual system?
Start with defining your requirements Then modeling Budget $$$ Hire staff Engage partners DO IT YOURSELF, DO NOT RELY ON THE
EXPERTS – staff augment and hire the talent internally
32
BREAK --- Modeling exercise
1. ACCOUNT TEAM2. CUSTOMER/PARTY TEAM3. PRODUCT TEAM4. EVENT (TRANSACTION) TEAM5. CRM TEAM (WHO WANT INFORMATION ABOUT CUSTOMER TO BE
ABLE TO MARKET TO THEM AND KNOW PRODUCE NEW PRODUCTS CUSTOMERS WANT)
6. EXECUTIVE TEAM WHO WANT INFORMATION ABOUT HOW THE BUSINESS IS DOING (WHAT IS SELLING, WHAT IS NOT, WHAT IS PROFITABLE…..)
DIVIDE INTO THE ABOVE 6 TEAMS ASK EACH TEAM TO BRAINSTORM WHAT INFORMATION THEY NEED
SEPARATELY ASK 1 LEADER FROM EACH TEAM TO PRESENT THEIR DATA NEEDS IN
DIAGRAM FOR 1 - 4 AND LIST/SPREADSHEET FOR 5 and 6 DISCUSS HOW THE NEEDS OF THE CRM AND EXECUTIVES CAN BE MET
FROM THE DIAGRAMED DATA
33
Some Interesting Info:
http://www.itquestionbank.com/an-introduction-to-data-warehousing.html
http://www.ralphkimball.com/ http://www.inmoncif.com/home/
34
SECTION 2
Data Warehouse Architecture
35
Where we are at / Next steps
Have identified data needs
Designed a model to fit those needs
Now we need to identify how we will set up the architecture Physical hardware Software
People Who do we need on
the project team Processes
36
State of many mature companies
37
Staging
Area
EDW
Metadata | Data Governance | Data Management
DM
CPS
MANTAS
CRDB
MKTG
FIN
SALES
EDW
Data cleansing
Data profiling
Sync &Sort
EDW Process State
BISource System
Cleanse / Pre-process
IMP
RMOECALS
AFSST
REDFPSBA
AFSV-PR
38
Information Factory Concept
39
Moore’s Law (yes, it applies here too)
Sharply increasing power of computer hardware
With increase in power decrease in price (capacity of microprocessor will double
every 18 months) also holds true for other computer components
Desktop power increasing as well as service power requirements (where GO GREEN comes from)
40
Explosion in innovation
BI software now able to be deployed on intranet vs hard to maintain thick client apps Thick client still used for developers
Web server, application server, database server Allows offloading of processing to
correct tier More power for everyone
41
Change in Business
Global economy changed needs of organizations worldwide
Global markets Mergers and Acquisitions All increase data needs More tech savvy end users (demand more
data, more tools… More information demanding executives
facilitates sponsorship of DW
42
DW Evolving
Care should be taken i.e. vendor claims Size is not a factor Operational vs informational
Operational pre-defined Informational more adhoc in nature Performance Volitile vs non volitile data DW saves data for longer periods than
transactional/operational systems (trending analysis, where I was vs where I am…..)
Real-time DW vs point in time
43
DW needs to be extendable, align with business structure
Orders
Product
Future
Figure 4. Extensible data warehouse
•Setup framework for Enterprise data warehouse
•Start with few a most valuable source applications
•Add additional applications as business case can be made
Ent
erpr
ise
Dat
a W
areh
ouse
Future
http://www.sserve.com/ftp/dwintro.doc
44
Data Marts and OLAP
EnterpriseData
Warehouse
SourceSystems
Reporting
Data Mining
OLAP Analysis
Dashboard
Scorecard
Master DataManagementApplication
Enterprise Data Solution
Master / Reference Data Store
45
EDW - Objective
Source Files
CDC/SyncSort Process
StagingSource Data
(Delta)
ETL Load1
EDW Target
ETL Load2
Datamarts
ET
L
Follow the process methodology to achieve these architectural aspects : Meta Data, Security, Scalability, Reliability and Supportability
46
EDW – Data Model Design
PartyAccount
Product & Service
Event
Each represents the subject area we have in the model, with third normal tables to accommodate the data and its relationships with hierarchy
47
Account, Customer & Address Relationships
Account Contact
Party Address link
Account Party link
Address
Account
Party
Account Information loaded from ALL Source Systems
Customer Information Loaded
EDW ETL process builds the relationship between Account and Customers based on the relationship file from RM
48
Single definition of a data element
DW brings in the data from multiple sources and conforms it so that it can be viewed together Multiple systems have individual
customers/addresses, but warehouse gives single view of the customer and all the systems they are in
Helping move from product centric systems to customer centric systems
49
Business view of data
DW is only successful is it provides the view the business needs of its data
A data warehouse is a structured extensible environment designed for the analysis of non-volatile data, logically and physically transformed from multiple source applications to align with business structure, updated and maintained for a long time period, expressed in simple business
terms, and summarized for quick analysis. Vivek R. Gupta, Senior Consultant [email protected] System
Services corporation, Chicago, Illinois http://www.system-services.com
50
Example of conforming data for business view:
Figure 8. Physical transformation of application data
•Uniform business terms
•Single physical definition of an attribute
•Consistent use of entity attributes
•Default and missing values
Data Warehouse
System
OperationalSystem B
OperationalSystem A
Detailed
Data
Summarized Data
Transformation
-----------------------
cust, cust_id, borrower>> customer ID
-----------------------
“1” >> “M”
“2” >> “F”
-----------------------
Missing >>> “……..”
http://www.sserve.com/ftp/dwintro.doc
51
Business use of DW
Business should use data mart created off data warehouse
Business uses want to use existing tools/methods (replicate queires, Excel, extract to Access) against DW and validate the data between existing and DW
Over time LoB gains confidence in DW and then begins to explore new possibilities of data use and tool use
52
EDW – Process Flow
SyncSort Server Informatica Server
Oracle DB Server (EDW)
Source Systems (M/F)
Source Files
CDC/SyncSort Process
Staging Schema
Source Data (Delta)
ETL Src -Stage
EDW ETL Stage-EDW
Oracle DB Server (DM)
Data Mart (FDM)
ETLEDW-DM
Source Data LayerLanding Zone
File ProcessingETL Layer Data Layer
ETL UTF
53
EDW ETL Design
Source to Stage Mapping (For AFS)
Stage to EDW Mapping (for AFS)
EDW to FDM Mapping (for FACT)
54
ETL Tools are prolific
Abinitio Syncsort DMExpress 6.5 Oracle Warehouse Builder
(OWB) 11gR1 Oracle Data Integrator & Data Services
XI 3.0 SAP Business Objects IBM Information Server
(Datastage) 8.1 IBM SAS Data Integration Studio 4.2
SAS Institute PowerCenter 9.0 Informatica Elixir Repertoire 7.2.2 Elixir Data Migrator 7.6 Information
Builders 8. SQL Server Integration Services 10 Microsoft
Talend Open Studio & Integration Suite4.0Talend
DataFlow Manager 6.5 Pitney Bowes Business Insight
Data Integrator 9.2 Pervasive Open Text Integration Center
7.1 Open Text Transformation Manager 5.2.2
ETL Solutions Ltd. Data Manager/Decision Stream
8.2 IBM (Cognos) Clover ETL 2.9.2 Javlin ETL4ALL 4.2 IKAN DB2 Warehouse Edition 9.1 IBM Pentaho Data Integration 3.0
Pentaho Adeptia Integration
Suite5.1 Adeptia Expressor Sun – SeeBeyond ETL integrator
55
Commonly used toolsets:
Comercial ETL Tools: IBM Infosphere
DataStage Informatica PowerCenter Oracle Warehouse
Builder (OWB) Oracle Data Integrator
(ODI) SAS ETL Studio Business Objects Data
Integrator(BODI) Microsoft SQL Server
Integration Services(SSIS)
Ab Initio
Freeware, open source ETL tools:
Pentaho Data Integration (Kettle)
Talend Integrator Suite CloverETL Jasper ETL
56
ETL Extract, Transform, Load
Created to improve and facilitate data warehousing
EXTRACT Data brought in from
external sources TRANSFORM
Data fit to standards LOAD
Load converted data into target DW
Steps: Initiate Build reference data Extract from sources Validate Transform Load into staging
tables Audit reports Publish Archive cleanup
57
Reconciliation Overview (EDW-data mart)
Datamart Load ProcessEDW Load Process
StagingSource Data (Delta)
EDW
FDBR
FDM
Val
idat
e
Checkpoint 1Validate measures
between Balance files and Source Data Files
ETL ETLETLCDC
Checkpoint 2Validate measures
between Stage Tables and Source Data Files
Checkpoint 3Validate measures
between Balance files and EDW Tables
Checkpoint 4Validate measures
between EDW Tables and mart Tables
Source Files
Balance Files
TM1 (GL)
Views
Validate measures between EDW Tables and General Ledger
Extract
Ext
ract
58
EDW Data Flow
Sharedmounpoint
ETLFTPOther Sources
Source Files
Other Sources
Source Files
Mainframe(MVSPROD)
Source Files
File Transfer· All source data files should originate
from Server/Host· File transfer should be setup as job via
CA-Scheduler
File Maintenance· Both Input and Output mount points
are shared between SyncSort and Informatica Server
· The files should land on Input mount point and have a job to be Archived (if needed)
Security· FTP login accounts are/will be enabled
only from certain hosts/servers· Seprate folder under Input/Output/
Archive needs to be maintained for isolation
Capacity· Need capacity estimate for any new
feeds into Landing Zone
DB Schema· Various schema represent different
logical group of data/process in EDW· EDW_OWNER – All EDW objects· EDW_STG – Stagging/Temp objects
for ETL loads· ETL_CNTL – ETL process control
objects· FDM_OWNER – Datamart related
objects· INF – Informatica Repository· MM_OWNER – Modelmart repository
Oracle(edwora3/4)
INF
EDW
Oracle(edwora1/2)
FDM
SynSort(edwss1)
File Landing Zone/input
CDC
File Archive/ssiarc
CDC Output/output
Informatica(edwetl1)
ETL
Output/Reports/edw_output
ASM Deployed
59
EDW – Security Scheme
All EDW Tables/Objects
EDW_OWNER
EDW_JOBS
ETL login account with Read/Write access to both Staging and EDW tables
EDW_USER
End-user / reporting login account with read access to
all EDW tables
All Staging/Temp Tables and Objects, for ETL and
batch jobs
EDW_STG
Read only
Read/Write
Read/Write
Application-Server AdminInfo – informatica
Oracle – Oracle DBSsort - Syncsort
App Admin
App Execution
Login to execute the jobs (shell and other jobs)
Info_work/etl_jobs – InformticaSs_jobs - Syncsort
Developer
Individual Developer Access
Production support login for support group. Has read only
access to all application and log. Only can change the parameters
Info_sup – Informtica supportSort_sup – Syncsort support
App Support
Database User/Schema Unix User
60
EDW - InfrastructureDevelopment Environment Production Environment
Syncsort
Network IM2
VIO1
VIO2
ETL1
ORA1
ORA2
ORA3
ORA4
Network IMVIO2
VIO1ETL
SyncsortMaintenance
ORA1
ORA2
EDW-DEV DB Vlan
Corporate Network
DDC DDC1 ENT1 ENT2 BackupMainframe
AD MAIL
`
Unix/WS AdminWorkstations
`
DBA/DevWorkstations
NAS
TC
P –
TS
M
TC
P –
25
OEM
TC
P –
11
59
, 48
89
, 7
20
0,
38
72
, 18
30
TC
P –
53,
95
3
TC
P –
53,
95
3
TC
P –
12
3
TC
P –
12
3
TC
P –
38
9
UD
P/T
CP
– N
FS
TC
P –
20
, 2
1
NIM1
EDW-PRD DB Vlan
TC
P –
22
, 8
0, 4
43
, 8
44
3,
99
60
, 23
00
, 2
30
1
`
VM
TC
P –
22
TC
P –
22
, 1
52
1In
form
atic
a
EDW-PRD APP VlanEDW-DEV APP Vlan
RSK NetIQ
TC
P/U
DP
– N
ET
IQ
TC
P/U
DP
– N
ET
IQ
TCP-UDP NIM
TCP-UDPNIM
TCP-UDPNIM
TCP-UDPNIM
TCP-UDPSTND
TCP-UDP STND TCP-UDP
STND
TCP-UDPNFS, NIM
TCP-UDPNFS, NIM
TCP-UDPSTND
SOX
TCP-UDP20, 21,
NETIQ, TSMTCP-UDP
NETIQ, TSMTCP-UDP
NETIQ, TSM
TCP-UDP20, 21,
NETIQ, TSM
TCP-UDPNFS, NIM
TCP-UDPNFS, NIM
TC
P –
31
61
EDW – Development (SDLC)Intiate ImplementTestBuildDesignPlan
Initial Scope
Source Data Analysis
Initial Estimate
Final Scope
Work Plan
Architecture & Design
Source/Target Mapping
Report/Data Requirements
Data Modeling
Test, Integration and Deployment
Plan
Transition and Support Plan
Finalize Source/Target Mapping
Build ETL and other processes
Unit test cases and results
Defect fixes and support
Functional test cases and
results
Integration test cases and
results
Data validation
Load / Performance
Testing
Auto processing(CA-Scheduler)
UAT and Sign-off
Data creation/Setup
Production Migration
User Training
Support Documentation
and training
62
EDW Development Project Cycle (New Source to EDW)
ImplementTestBuildSystem Design
Initial Scope
Source Data Analysis
Initial Estimate
Final Scope
Work Plan
Architecture & Design
Source/Target Mapping
Report/Data Requirements
Data Modeling
Test, Integration and Deployment Plan
Transition and Support Plan
Finalize Source/Target Mapping
Build ETL and other processes
Unit test cases and results
Defect fixes and support
Functional test cases and results
Operations Procedure testing and review
Data validation
Load / Performance Testing
Integration Testing(CA-Scheduler)
UAT and Sign-off
Initial Data creation/Setup
Production Migration
User Training
Support Documentation and training
Requirement SpecsInitiation
Project Sponsor Approval
Management Approval
Peer review
IT Approval
IT / Business Distribution
Peer review
IT Approval
Peer / Lead reviewPerformance, Capacity and
Guidelines for project
Business Users IT Planning & Systems EDW Development Team
Data Analyst Team Operations & SupportProject Management Operations & Support
Operations & Support
Business Users
Business Users
Business Users
Business Users
Project Management
Project Management Project Management Project Management
IT Planning & Systems
IT Planning & Systems
IT Planning & Systems IT Planning & Systems
Data Analyst Team
Data Analyst Team
Data Analyst Team Data Analyst Team
EDW Development Team
EDW Development Team EDW Development TeamGroups Involved in various Phases of the project
Major Tasks and Deliverables for the project
63
EDW – Support – Escalation Procedure
CA - Scheduler
Mail Notification
User Call Help Desk Alert
Ops(Priliminary Checks ?)
Unix Admin
Oracle Admin
Informatica Admin
ETL Related Issue ?
Database Related Issue ? Storage Admin
Hardware Admin
IBM Support
EMC Support
Oracle Support
Informatica Support
EDW SupportIssue
NetIQAlerts
IBM Support
64
EDW – Support Process
Issue(Operational Error)
SLA
Process Management
Guidelines
Known DB Issues
Known Application Server Issues
Known OS issues
Input Issue Management Process Output
Record & Classify Issue
Open Remedy Ticket
RFC Required ?
Change Management
Investigation And Diagonsis
Resolution Found ?
RFC Required ?
Change Management
Provision for Temporary Solution
to continue the Process
Record the nature of the issue and
Update the User group (& BA)
Update/Close Remedy Ticket
No
Yes Yes
No
Yes
No
Provide Solution and continue the Process
65
EDW - Roadmap
EDW(Accounts and Customers)
Multiple Source System Financial Mart
Master Data Management
mart1
Risk Mart
mart2
Transaction
Customer Analytics
Source System
mart3
Management Architecture (Metadata, Data Security, Systems Management)
66
Architecture Exercise (1 of 2)
Identify needs in the following categories Physical hardware
CPU, Memory, Disk For disk – how much?
Use your model and calculate size For database & tools Will data be behind a firewall?
Software Database ETL BI tools
Application and web server
67
Architecture Exercise (2 of 2)
Break into former 6 teams Ask each team to consider what they will need to build
a DW Hardware Software People Processes Support and Operations
Allow 30 minutes to brainstorm then discuss as a class Volunteer team presents what they came up with
Lists needs on board each progressive team adds what they have to the list
Group discussion on what they have uncovered
68
SECTION 3
What is Data Quality I can’t tell you what’s
important, but your users can.
Look for the fields that can identify potential problems with the data
What is Master Data Management (MDM)
69
Data Quality
Data doesn’t stay the same Sometimes it does
Considerations: What happens to the warehouse when
the data changes When needs change
70
Roadmap to DQ
Data profiling Establishing metrics/measures Design and implement the rules Deploy the plan Review errors/exceptions Monitor the results
71
Data Profiling
What’s in the data Analyze the columns in the tables
Provides metadata Allows for good specifications for
programmers Reduces project risk (as data is now
known) How many rows, number of distinct values in
a column, how many null, data type identification
Shows the data pattern
72
Data Profiling Example
73
Data Quality is measured as the degree of superiority, or excellence, of the various data that we use to create information products.
“Reason #1 for the failure of CRM projects: Data is ignored. Enterprise must have a detailed understanding of the quality of their data. How to clean it up, how to keep it clean, where to source it, and what 3rd-party data is required. Action item: Have a data quality strategy. Devote ½ of the total timeline of the CRM project to data elements.” - Gartner
74
Data Quality Tools (Gartner Magic Quadrant)
75
Dimensions of Quality
Informatica.com
76
Data Quality Measures
Definition Accuracy Completeness Coverage Timeliness Validity
77
Definition
Conformance: The degree to which data values are consistent with their agreed upon definitions.
A detailed definition must first exist before this can be measured.
Information quality begins with a comprehensive understanding of the data inventory. The information about the data is as important as the data itself.
A Data Dictionary must exist! An organized, authoritive collection of attributes is equivalent to the old “Card Catalog” in a library, or the “Parts and List Description” section of an inventory system. It must contain all the know usage rules and an acceptable list of values. All known caveats and anomalies must be descried.
78
Accuracy
The degree to which a piece of data is correct and believable. The value can be compared to the original source for correctness, but it can still be unbelievable. Conformed values can be compared to lists of reference values. Zip code 35244 is correct and believable. Zip code 3524B is incorrect and unbelievable. Zip code 35290 is incorrect but believable (it looks right,
but does not exist). AL is a correct and believable state code (compared to
the list of valid state codes) A1 is an incorrect and unbelievable state code
(compared to the list of valid state codes) AA is an incorrect but believable state code (compared
to the list of valid state codes)
79
Completeness
The Degree to which all information expected is received. This is measured in two ways: Do we have all the records that were sent to us?
Counts from the provider can be compared against counts of data received.
Did the provider send us all the records that they have or just some of them?
This is difficult to measure without auditing and trending the source.
How would we know that the provider had a ‘glitch’ in their system and records were missing from our feed?
80
Measures of Completeness
The following questions can be answered for counts: How many records per batch by
provider? How is this batch’s counts compared to
the previous month’s average. How is the batch’s counts compared to
the same time period last year? How does this batch’s counts compare to
a 12 month average?
81
Coverage
The degree to which all fields are populated with data. Columns of data can be measured for % of missing values and compared to expected % missing. i.e. Sale Type Code is expected to be
populated 100% by all sources for Sales documents.
82
Timeliness
The degree to which provider files are received, processed and made available to for assembly to data marts. Expected receipt times are compared to actual receipt times. Late or missing files are flagged and reported
on. Proactive alerts trigger communication with the
provider contact. Proactive communication can alert to assembly
processes. Excessive lag times can be reported to providers
in order to request delivery sooner.
83
Validity
The degree to which the relationships between different data are valid. Zip code 48108 is accurate. State code
AL is accurate. Zip code 48108 is invalid for the state of AL.
84
Data Quality Measures
How do you know if your data is of high quality? Agree upon the measure that are
important to the organization and consistently report them out.
Use the data measures to communicate and inform.
85
Measurement
Informatica.com
86
Exercise: Changing the Data Warehouse (1 of 2)
So, you need to add a new source Or, you need to receive additional
data from an existing source Could be the data quality is an issue Could be that the business rules
weren’t defined adequately
87
Brainstorming Group Exercise
(2 of 2)
The data changed due to DQ measures – what do we have to do in the DW? What has to change Estimate the change Implement the change How do we make sure it doesn’t
happen again? What DQ measure can help?
88
MDMMaster Data Management
The newest ‘buzz word’
89
Exercise:
What processes need to be put in place for MDM Who needs to be involved Who owns it
90
SECTION 4
BI Tools BICC Jobs Certifications
91
SECTION 4
What is business intelligence What are BI tools What is a business intelligence
competency center (BICC) What jobs are available
certifications
92
BI Tools
93
BICC
94
Jobs in Data Warehousing
95
Certifications in DW
96
References
Data Management and Integration Topic, Gartner, http://www.gartner.com/it/products/research/asset_137953_2395.jsp
Articles: Key Issues for Implementing an Enterprise wide Data Quality Improvement Project, 2008, Key Issues for Enterprise Information Management Initiatives, 2008, Key Issues for Establishing Information Governance Policies, Processes and Organization, 2008
Data Quality Management, The Most Critical Initiative You Can Implement, J. G. Geiger, http://www2.sas.com/proceedings/sugi29/098-29.pdf
Information Management, How to Measure and Monitor the Quality of Master Data, http://www.information-management.com/issues/2007_58/master_data_management_mdm_quality-10015358-1.html?ET=informationmgmt:e963:2046487a:&st=email
Data Management Assn of Michigan Bits & Bytes, Critical Data Quality Controls, D Jeffries, Fall 2006 http://dama-michigan.org/2%20Newsletter.pdf