building an effective data warehouse architecture
DESCRIPTION
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.TRANSCRIPT
October 15-18, 2013Charlotte, NC
Building an Effective Data Warehouse Architecture
James Serra, Business Intelligence ConsultantSolidQ
October 15-18, 2013 | Charlotte, NC
Please silence cell phones
3
Explore Everything PASS Has to Offer
Free SQL Server and BI Web Events Free 1-day Training Events Regional Event
Local User Groups Around the World
Free Online Technical Training
This is Community Business Analytics Training
Session Recordings PASS Newsletter
4
Session Evaluations
ways to access
Go to passsummit/evals
Download the GuideBook App and search: PASS Summit 2013
Follow the QR code link displayed on session signage throughout the conference venue and in the program guide
Submit by 5pmFriday Oct. 18 toWIN prizes
Your feedback is important and valuable.
5
About Me
Business Intelligence Consultant, in IT for 28 years Owner of Serra Consulting Services, specializing in end-to-
end Business Intelligence and Data Warehouse solutions using the Microsoft BI stack
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW developer
Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS
Summit MCSE for SQL Server 2012: Data Platform and BI SME for SQL Server 2012 certs Contributing writer for SQL Server Pro magazine Blog at JamesSerra.com SQL Server MVP
6
Agenda
What a Data Warehouse is not What is a Data Warehouse and why use one? Fast Track Data Warehouse (FTDW) Appliances Data Warehouse vs Data Mart Kimball and Inmon Methodologies Populating a Data Warehouse ETL vs ELT Surrogate Keys SSAS Cubes SQL Server 2012 Tabular Model End-User Microsoft BI Tools
7
What a Data Warehouse is not
• A data warehouse is not a copy of a source database with the name prefixed with “DW”
• It is not a copy of multiple tables (i.e. customer) from various sources systems unioned together in a view
• It is not a dumping ground for tables from various sources with not much design put into it
8
Data Warehouse Maturity Model
Courtesy of Wayne Eckerson
9
What is a Data Warehouse and why use one?All these reasons are for data warehouses only (not OLTP): Reduce stress on production system Optimized for read access, sequential disk scans Integrate many sources of data Keep historical records (no need to save hardcopy reports) Restructure/rename tables and fields, model data Protect against source system upgrades Use Master Data Management, including hierarchies No IT involvement needed for users to create reports Improve data quality and plugs holes in source systems One version of the truth Easy to create BI solutions on top of it (i.e. SSAS Cubes)
10
Why use a Data Warehouse?
Legacy applications + databases = chaos
Production Control
MRP
InventoryControl
Parts Management
Logistics
Shipping
Raw Goods
Order Control
Purchasing
Marketing
Finance
Sales
Accounting
Management Reporting
Engineering
Actuarial
Human Resources
ContinuityConsolidationControlComplianceCollaboration
Enterprise data warehouse = order
Single version of the truth
Enterprise DataWarehouse
Every question = decision
Two purposes of data warehouse: 1) save time building reports; 2) slice in dice in ways you could not do before
11
Hardware Solutions
Fast Track Data Warehouse - A reference configuration optimized for data warehousing. This saves an organization from having to commit resources to configure and build the server hardware. Fast Track Data Warehouse hardware is tested for data warehousing which eliminates guesswork and is designed to save you months of configuration, setup, testing and tuning. You just need to install the OS and SQL Server
Appliances - Microsoft has made available SQL Server appliances that allow customers to deploy data warehouse (DW), business intelligence (BI) and database consolidation solutions in a very short time, with all the components pre-configured and pre-optimized. These appliances include all the hardware, software and services for a complete, ready-to-run, out-of-the-box, high performance, energy-efficient solutions
12
Fast Track Data Warehouse
FT Version 4.0
Benefits:- Pre-Balanced
Architectures- Choice of HW platforms- Lower TCO- High Scale- Reduced Risk- Taking the guess work
out of building a server
13
Appliances
Dell Quickstart Data Warehouse Appliance 1000 (FT 4.0, 5TB) Dell Quickstart Data Warehouse Appliance 2000 (FT 4.0,
12TB) Dell Parallel Data Warehouse Appliance (v2, SQL 2012, 15TB-
6PB) IBM Fast Track Data Warehouse (FT 4.0, 3 versions: 24TB,
60TB, 112TB) HP Enterprise Data Warehouse Appliance (v2, SQL 2012,
15TB-6PB) HP Business Data Warehouse Appliance (FT 3.0, 5TB) HP Business Decision Appliance (BI, SharePoint 2010, SQL
Server 2008R2, PP) HP Database Consolidation Appliance (virtual environment,
Windows2008R2)
14
Data Warehouse vs Data Mart
Data Warehouse: A single organizational repository of enterprise wide data across many or all subject areas Holds multiple subject areas Holds very detailed information Works to integrate all data sources Feeds data mart
Data Mart: Subset of the data warehouse that is usually oriented to specific subject (finance, marketing, sales)• The logical combination of all the data marts is a data
warehouse
In short, a data warehouse as contains many subject areas, and a data mart contains just one of those subject areas
15
Kimball and Inmon Methodologies
Two approaches for building data warehouses
16
Kimball and Inmon Myths
Myth: Kimball is a bottom-up approach without enterprise focus Really top-down: BUS matrix (choose business
processes/data sources), conformed dimensions, MDM Myth: Inmon requires a ton of up-front design that takes a
long time Inmon says to build DW iteratively, not big bang
approach (p. 91 BDW, p. 21 Imhoff) Myth: Star schema data marts are not allowed in Inmon’s
model Inmon says they are good for direct end-user access of
data (p. 365 BDW), good for data marts (p. 12 TTA) Myth: Very few companies use the Inmon method
Survey had 39% Inmon vs 26% Kimball. Many have a EDW
Myth: The Kimball and Inmon architectures are incompatible They can work together to provide a better solution
17
Kimball and Inmon Methodologies
Relational (Inmon) vs Dimensional (Kimball) Relational Modeling:
Entity-Relationship (ER) model Normalization rules Many tables using joins History tables, natural keys Good for indirect end-user access of data
Dimensional Modeling: Facts and dimensions, star schema Less tables but have duplicate data (de-normalized) Easier for user to understand (but strange for IT people used to
relational) Slowly changing dimensions, surrogate keys Good for direct end-user access of data
18
Relational Model vs Dimensional Model
Relational Model
Dimensional Model
If you are a business user, which model is easier to use?
19
Kimball and Inmon Methodologies
• Kimball:• Logical data warehouse (BUS), made up of subject areas (data
marts)• Business driven, users have active participation• Decentralized data marts (not required to be a separate physical
data store)• Independent dimensional data marts optimized for
reporting/analytics• Integrated via Conformed Dimensions (provides consistency
across data sources)• 2-tier (data mart, cube), less ETL, no data duplication
• Inmon:• Enterprise data model (CIF) that is a enterprise data warehouse
(EDW)• IT Driven, users have passive participation• Centralized atomic normalized tables (off limit to end users)• Later create dependent data marts that are separate physical
subsets of data and can be used for multiple purposes• Integration via enterprise data model• 3-tier (data warehouse, data mart, cube), duplication of data
20
Kimball Model
Why staging: Limit source contention (ELT), Recoverability, Backup, Auditing
Data Warehouse (star schema subject areas)
StagingArea 1
StagingArea 2
StagingArea 3
Multi-dimensional
Reporting Layer
OLTP Data Sources
SSIS
SSIS
SSIS
SSIS
SSIS
Dimensionalized View
Staging
Multi-dimensional
Cube Process
DW Bus Architecture
Data Mart 1
Data Mart 2
Atomic Data
21
Inmon Model
StagingArea 1
StagingArea 2
StagingArea 3
Multi-dimensional
Reporting Layer
OLTP Data Sources
SSIS
SSIS
SSIS
SSIS
SSIS
SSIS
Staging
Tabular
DimensionalizedView
Cube Process
Data Warehouse(Normalized)
Data Mart 1(Normalized)SSIS
Corporate Information Factory (CIF)
Data Mart 2(Normalized)SSIS
Atomic Data
22
Reasons to add a Enterprise Data Warehouse Single version of the truth May make building dimensions easier using lightly
denormalized tables in EDW instead of going directly from the OLTP source
Normalized EDW results in enterprise-wide consistency which makes it easier to spawn-off the data marts at the expense of duplicated data
Less daily ETL refresh and reconciliation if have many sources and many data marts in multiple databases
One place to control data (no duplication of effort and data) Reason not to: If have a few sources that need reporting
quickly
23
Which model to use?
Models are not that different, having become similar over the years, and can compliment each other
Boils down to Inmon creates a normalized DW before creating a dimensional data mart and Kimball skips the normalized DW
With tweaks to each model, they look very similar (adding a normalized EDW to Kimball, dimensionally structured data marts to Inmon)
Bottom line: Understand both approaches and pick parts from both for your situation – no need to just choose just one approach
BUT, no solution will be effective unless you possess soft skills (leadership, communication, planning, and interpersonal relationships)
24
Hybrid Model
Advice: Use SQL Server Views to interface between each level in the model
In the DW Bus Architecture each data mart could be a schema (broken out by business process subject areas), all in one database. Some companies will have one database and then build separate data mart tables from it
Data Warehouse (star schema subject areas)
StagingArea 1
StagingArea 2
StagingArea 3
Multi-dimensional
Reporting Layer
Mirror OLTP
(subset)
SSIS
SSIS
SSIS
SSIS
SSIS
SSIS
Staging
Data Mart 1
Tabular
Cube Process
Cube Process
DW Bus Architecture
Data Warehouse(Normalized)
SSIS
OLTP Data Sources SSIS
Data Mart 2
Atomic Data
Atomic Data
Corporate Information Factory (CIF)
EDW
25
Kimball Methodology
From: Kimball’s The Microsoft Data Warehouse Toolkit
Kimball defines a development lifecycle, where Inmon is just about the data warehouse (not “how” used)
26
Populating a Data Warehouse
Determine frequency of data pull (daily, weekly, etc) Full Extraction – All data (usually dimension tables) Incremental Extraction – Only data changed from last run
(fact tables) How to determine data that has changed
Timestamp - Last Updated Change Data Capture (CDC) Partitioning by date Triggers on tables MERGE SQL Statement Column DEFAULT value populated with date
Online Extraction – Data from source. First create copy of source:
Replication Database Snapshot Availability Groups
Offline Extraction – Data from flat file
27
ETL vs ELT
• Extract, Transform, and Load (ETL)• Transform while hitting source system• No staging tables• Processing done by ETL tools (SSIS)
• Extract, Load, Transform (ELT)• Uses staging tables• Processing done by target database engine (SSIS: Execute T-SQL
Statement task instead of Data Flow Transform tasks)• Use for big volumes of data• Use when source and target databases are the same• Use with Parallel Data Warehouse (PDW)
ELT is better since database engine is more efficient than SSIS• Best use of database engine: Transformations• Best use of SSIS: Data pipeline and workflow management
28
Surrogate Keys
Surrogate Keys – Unique identifier not derived from source system• Embedded in fact tables as foreign keys to dimension
tables• Allows integrating data from multiple source systems• Protect from source key changes in the source system• Allows for slowly changing dimensions• Allows you to create rows in the dimension that don’t exist
in the source (-1 in fact table for unassigned)• Improves performance (joins) and database size by using
integer type instead of text• Implemented via identity column on dimension tables
29
SSAS Cubes
Reasons to use instead of data warehouse: Aggregating (Summarizing) the data for performance Multidimensional analysis – slice, dice, drilldown Hierarchies Advanced time-calculations – i.e. 12-month rolling average Easily use Excel to view data via Pivot Tables Slowly Changing Dimensions (SCD)
30
Data Warehouse Architecture
31
SQL Server 2012 Tabular Model
New xVelocity in-memory database in SSAS Build model in Power Pivot or SQL Server Data Tools (SSDT) Uses existing relational model No star schema, no extra SSIS Uses DAX (Excel-like formula language to create custom
calculations) Faster and easier to use than multidimensional model In Hybrid Model, DW Bus Architecture goes away, but only if
you don’t need surrogate keys, SCD, conformed dimensions, and all reporting will be off of the cubes
32
End-User Microsoft Tools
Many options on Microsoft tools to use to go against the DW/BI: Excel PivotTables (SSAS Cube and Relational) SQL Server Reporting Services (SSRS) (SSAS Cube and
Relational) Report Builder (SSAS Cube and Relational) PowerPivot (Relational) PerformancePoint Services (PPS) (SSAS Cube) Power View (SSAS Cube) Data Mining (SSAS Cube)
33
Resources Data Warehouse Architecture – Kimball and Inmon methodologies:
http://bit.ly/SrzNHy SQL Server 2012: Multidimensional vs tabular: http://bit.ly/SrzX1x Data Warehouse vs Data Mart: http://bit.ly/SrAi4p Fast Track Data Warehouse Reference Guide for SQL Server 2012:
http://bit.ly/SrAwsj Complex reporting off a SSAS cube: http://bit.ly/SrAEYw Surrogate Keys: http://bit.ly/SrAIrp Normalizing Your Database: http://bit.ly/SrAHnc Difference between ETL and ELT: http://bit.ly/SrAKQa Microsoft’s Data Warehouse offerings: http://bit.ly/xAZy9h Microsoft SQL Server Reference Architecture and Appliances:
http://bit.ly/y7bXY5 Methods for populating a data warehouse: http://bit.ly/SrARuZ Great white paper: Microsoft EDW Architecture, Guidance and Deployment
Best Practices: http://bit.ly/SrAZug End-User Microsoft BI Tools – Clearing up the confusion: http://bit.ly/SrBMLT Microsoft Appliances: http://bit.ly/YQIXzM Why You Need a Data Warehouse: http://bit.ly/1fwEq0j Data Warehouse Maturity Model: http://bit.ly/xl4mGM
October 15-18, 2013 | Charlotte, NC
Questions?James Serra, Business Intelligence ConsultantEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck will be posted)
34
JOIN US for our second annual event to get the best learning for analyzing, managing, and sharing business information and insights through the Microsoft Data Platform of technologies.
October 15-18, 2013 | Charlotte, NC
Thank youfor attending this session and the 2013 PASS Summit in Charlotte, NC
36