the it perspective: data warehousing, management, and ... · 2 2 objectives explain the basics of:...
TRANSCRIPT
1 1
The IT Perspective: Data Warehousing, Management, and Analytical Structures
Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd [email protected]
2 2
Objectives Explain the basics of: 1. Master Data Management
2. Data Warehousing
3. ETL
4. OLAP/Multidimensional Data
The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation. Portions © 2010 Project Botticelli Ltd & entire material © 2010 Microsoft Corp. Some slides contain quotations from copyrighted materials by other authors, as individually attributed or as already covered by Microsoft Copyright ownerships. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.
This seminar is based on a number of sources including a few dozen of Microsoft-owned presentations, used with permission. Thank you to Chris Dial, Tara Seppa, Aydin Gencler, Ivan Kosyakov, Bryan Bredehoeft, Marin Bezic, and Donald Farmer with his entire team for all the support.
3 3
4
Master Data Management
5 5
MDM
Ensures consistency of data across all organisational uses
Impacts overall data quality
Processes and tools for: Collection, aggregation, matching, distribution, and persistence of data
Consistently
Related to Federated Data Management
Key to MDM: Modelling
6 6
Why MDM? It’s About Evolution of Enterprise Architecture
7 7
MDM Processes
• Batched Acquisition from Staging Tables
• Members, Attributes, Parent-Child Relationships
• SQL Integration Services
Import & Integration
• Versioning Changes
• Auditing
• Compliance
• Tracking of Instances
Modeling • Subscription Views
• Export to:
• Operational Systems
• Data Warehouses
• BI Analytics
• Reporting Tools
Export & Subscription
8 8
Microsoft Master Data Services SQL 2008 R2 Enterprise, Datacenter, Developer
Tools: Master Data Manager
Primary tool for managing your master data
MDS Configuration Manager
IT Pro tool
MDS Web Service For developers wanting to extend MDS
Concepts: Models
Entities
Attributes
Members
Hierarchies
Collections
Versions
Database
9 9
Modelling Master Data
Model organises data at highest level Allowing versioning of changes to data
There are typically four categories of models: People (Customers, Staff)
Places (Geographies, Cities, Countries)
Things (Products)
Concepts (Accounts, Behaviours, Transactions)
10 10
Example: Product MDM Model
Product (model)
Product (entity)
Name (free-form attr)
Code (free-form attr)
Subcategory (domain-
based attr)
Name (free-form attr)
Code (free-form attr)
Category (domain-
based attr)
Name (free-form attr)
Code (free-form attr)
StandardCost (free-form
attr)
ListPrice (free-form
attr)
Photo (file attr)
11
1. Reviewing a Data Model Using Master Data Services
12
Data Warehouse
13 13
OLE DB
ODBC
DB2 Oracle
XML
SQL Server
Analysis Services
SQL Server
Report Server Models
SQL Server
Data Mining Models
SQL Server
Integration Services
MySAP
Hyperion Essbase
SAP
NetWeaver BI SQL Server
Teradata
Rich Connectivity Data Providers
14 14
Let’s Store the Intelligence: DW
SQL Server Analysis Services server is a logical endpoint for data
Analytical data physically rests in another relational database called a Data Warehouse
Modelling of DW is a key BI function Integration with MDM is desirable
15 15
Star Schema
16 16
Star Schema Benefits
Simple, not-so-normalized model
High-performance queries Especially with Star Join Query Optimization
Mature and widely supported
Low-maintenance
17 17
Snowflake Dimension Tables Define hierarchies using multiple dimension tables
Support fact tables with varying granularity
Simplify consolidation of heterogeneous data
Potential for slower query performance in relational reporting
No difference in performance in Analysis Services database
18 18
Fact Table Fundamentals
Collection of measurements associated with a specific business process
Specific column types Foreign keys to dimensions
Measures – numeric and additive
Metadata and lineage
Consistent granularity – the most atomic level by which the facts can be defined
19 19
Fact Table Examples
Day Grain
Quarter Grain
Reseller sales data by:
•Product
•Order Date
•Reseller
•Employee
•Sales Territory
Sales quota data by:
•Employee
•Time
20 20
Date Dimension Table
Most common dimension used in analysis (aka Time dimension)
Use consistently with all facts
Useful common attributes – Year, Quarter, Month, Day
Time series analysis support
Navigation and summarization enabled with hierarchies, such as calendar or fiscal
Single table design (typically not snowflake design)
Tip: Format the key of the dimension as yyyymmdd (e.g. 20100115) to make it readily understandable
21 21
Parent-Child Hierarchy
A dimension that contains a parent attribute
A parent attribute describes a self-referencing relationship, or a self-join, within a dimension table
Common examples Organizational charts
General Ledger structures
Bill of Materials
22 22
Parent-Child Hierarchy Example
Brian
Amy
Stacia
Stephen
Shu Michael
Peter
José
Syed
23 23
Slowly Changing Dimensions
Maintain historical context as dimension data changes
Three common ways (there are more): Type 1: Overwrite the existing dimension record
Type 2: Insert a new ‘versioned’ dimension record
Type 3: Track limited history with attributes
24 24
SCD Type 1
Existing record is updated
History is not preserved
25 25
SCD Type 2
Existing record is ‘expired’ and new record inserted
History is preserved
Most common form of SCD
26 26
SCD Type 3
Existing record is updated
Limited history is preserved
Implementation is rare
SalesTerritoryKey update to 10
27 27
Let’s Get the Data
We would like to populate facts and dimensions in our Data Warehouse from OLTP data...
28
Integration and ETL
29 29
Let’s do ETL with SSIS
SQL Server Integration Services (SSIS) service
SSIS object model
Two distinct runtime engines: Control flow
Data flow
32-bit and 64-bit editions
30 30
The Package
The basic unit of work, deployment, and execution
An organized collection of: Connection managers
Control flow components
Data flow components
Variables
Event handlers
Configurations
Can be designed graphically or built programmatically
Saved in XML format to the file system or SQL Server
31 31
Control Flow
Control flow is a process-oriented workflow engine
A package contains a single control flow
Control flow elements Containers
Tasks
Precedence constraints
Variables
32 32
Data Flow
The Data Flow Task
Performs traditional ETL and more
Fast and scalable
Data Flow Components
Extract data from Sources
Load data into Destinations
Modify data with Transformations
Service Paths
Connect data flow components
Create the pipeline
33
1. Using SQL Server Integration Services for Aggregating and Deriving Data
34
OLAP/Multidimensional Data
35 35
SQL Server 2008 Analysis Services
Multidimensional Analysis Aggregates and organizes data from business data sources
Performs calculations difficult to perform using relational queries
Supports advanced business intelligence, such as Key Performance Indicators
Data Mining Discovers patterns in both relational and OLAP data
Enhances the OLAP component with discovered results
36 36
Cube = Unified Dimensional Model
Multidimensional data
Combination of measures and dimensions as one conceptual model
Measures are sourced from fact tables
Dimensions are sourced from dimension tables
37 37
Dimensions
Members from tables/views in a data source view (based on a Data Warehouse)
Contain attributes matching dimension columns
Organize attributes as hierarchies One All level and one leaf level
User hierarchies are multi-level combinations of attributes
Can be placed in display folders
Used for slicing and dicing by attribute
38 38
Hierarchies
Benefits View of data at different levels of summarization
Path to drill down or drill up
Implementation Denormalized star schema dimension
Normalized snowflake dimension
Self-referencing relationship
39 39
Hierarchy
Defined in Analysis Services
Ordered collection of attributes into levels
Navigation path through dimensional space
Very important to get right!
Customers by Geography
Country
State
City
Customer
Customers by Demographics
Marital
Gender
Customer
40 40
Measure Group
Group of measures with same dimensionality
Analogous to a fact table
Cube can contain more than one measure group
E.g. Sales, Inventory, Finance
Defined by dimension relationships
41 41
Sales Inventory Finance
Customers X
Products X X
Time X X X
Promotions X
Warehouse X
Department X
Account X
Scenario X
Measure Group
Measure Group D
ime
ns
ion
42 42
Dimension Relationships
Define interaction between dimensions and measure groups
Relationship types Regular
Reference
Fact (Degenerate)
Many-to-many
Data mining
43 43
Customer
City
State
Country
Gender Marital
Country
State
City
Customer
Gender
Customer
Marital
Gender
Customer
Customer
City
State
Country
Gender
Marital
Attributes Hierarchies
Age
Dimension Model
44 44
Calculations
Expressions evaluated at query time for values that cannot be stored in fact table
Types of calculations Calculated members
Named sets
Scoped assignments
Calculations are defined using MDX
MDX = MultiDimensional EXpressions
45
1. Using BIDS to Review Dimension Design
2. Cube Design and Functionality
46 46
Summary
As a platform for enterprise Business Intelligence you should consider four services:
Data Warehouse (can be relational)
Process for Data Management (MDS)
Process for Data Integration (ETL)
Analysis (OLAP, Data Mining, Columnar)
= SQL Server 2008 R2
47 47
© 2010 Microsoft Corporation & Project Botticelli Ltd. All rights reserved. The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation. Portions © 2010 Project Botticelli Ltd & entire material © 2010 Microsoft Corp. Some slides contain quotations from copyrighted materials by other authors, as individually attributed or as already covered by Microsoft Copyright ownerships. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.