introduction to data vault modeling

of 63 /63
Introduction to Data Vault Modeling Kent Graziano Data Vault Master and Oracle ACE TrueBridge Resources OOW 2011 Session #05923

Author: kent-graziano

Post on 11-May-2015




15 download

Embed Size (px)


Not to be confused with Oracle Database Vault (a commercial db security product), Data Vault Modeling is a specific data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the technical components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures when using the Data Vault modeling technique. The target audience is anyone wishing to explore implementing a Data Vault style data model for an Enterprise Data Warehouse, Operational Data Warehouse, or Dynamic Data Integration Store. See more content like this by following my blog or follow me on twitter @kentgraziano.


  • 1.Introduction to Data VaultModelingKent GrazianoData Vault Master and Oracle ACETrueBridge Resources OOW 2011 Session #05923

2. My Bio Kent Graziano Certified Data Vault Master Oracle ACE (BI/DW) Data Architecture and Data Warehouse Specialist 30 years in IT 20 years of Oracle-related work 15+ years of data warehousing experience Co-Author of The Business of Data Vault Modeling (2008) The Data Model Resource Book (1st Edition) Oracle Designer: A Template for Developing an Enterprise Standards Document Past-President of Oracle Development Tools User Group (ODTUG) and Rocky Mountain Oracle User Group Co-Chair BIDW SIG for ODTUG(C) Kent Graziano 3. Membership Special: Join by October15 to become a member for only $99! 4. What Is a Data Warehouse?A subject-oriented, integrated, time-variant,non-volatile collection of data in support ofmanagements decision making process. W.H. InmonThe data warehouse is where we publishused data. Ralph Kimball(C) Kent Graziano 5. Inmons Definition Subject oriented Developed around logical data groupings (subject areas) not business functions Integrated Common definitions and formats from multiple systems Time-variant Contains historical view of data Non-volatile Does not change over time No updates (C) Kent Graziano 6. Data Vault DefinitionThe Data Vault is a detail oriented, historicaltracking and uniquely linked set of normalizedtables that support one or more functional areasof business.It is a hybrid approach encompassing the best ofbreed between 3rd normal form (3NF) and starschema. The design is flexible, scalable, consistent,and adaptable to the needs of the enterprise. It is adata model that is architected specifically to meetthe needs of todays enterprise data warehouses. Dan Linstedt: Defining the Data Vault Article(C) 7. Why Bother With Something New? Old Chinese proverb: Unless you change direction, youre apt to end up where youre headed.(C) 8. Why do we need it? We have seen issues in constructing (andmanaging) an enterprise data warehouse modelusing 3rd normal form, or Star Schema. 3NF Complex PKs when cascading snapshot dates (time-driven PKs) Star difficult to re-engineer fact tables for granularity changes These issues lead to break downs inflexibility, adaptability, and even scalability(C) Kent Graziano 9. Data Vault Time LineE.F. Codd invented1976 Dr Peter Chen 1990 Dan Linstedtrelational modeling Created E-RBegins R&D on DataDiagrammingVault ModelingChris Date andHugh Darwen Mid 70s AC NielsenMaintained andPopularizedRefined ModelingDimension & Fact Terms196019701980 1990 2000 Late 80s Barry DevlinEarly 70s Bill Inmonand Dr Kimball ReleaseBegan Discussing Business DataData Warehousing WarehouseMid 80s Bill InmonPopularizes Data Mid 60s Dimension & FactWarehousing Modeling presented by General2000 Dan Linstedt Mills and Dartmouth University Mid Late 80s Dr Kimballreleases first 5 articlesPopularizes Star Schema on Data Vault Modeling(C) 10. Data Vault Evolution The work on the Data Vault approach began in the early1990s, and completed around 1999. Throughout 1999, 2000, and 2001, the Data Vault design wastested, refined, and deployed into specific customer sites. In 2002, the industry thought leaders were asked to reviewthe architecture. This is when I attend my first DV seminar in Denver and met Dan! In 2003, Dan began teaching the modeling techniques to themass public.(C) Kent Graziano 11. Data Vault Modeling(C) 12. Where does a Data Vault Fit?(C) 13. Where does a Data Vault Fit?Oracles Next Generation Data Warehouse Reference Architecture Data Vault goes here (C) Oracle Corp 14. 3 Simple Structures (C) 15. Hub and Spoke = Scalability nature uses Hub & Spoke, why shouldnt we?Genetics scale to billions of cells,the Data Vault scales to Billions of records (C) 15 16. Hubs = Neurons Hub Very similar to a neural network,The Hubs create the base structure(C) 17. Links = Dendrite + Synapse In neural networks,Dendrites & Synapses fire to pass messages,The Links dictate associations, connections(C) 18. Satellites = Memories Perception, understanding and processingThese all describe the memorySatellites house descriptors that can change over time(C) 19. National Drug Codes + Orange Book of Drug Patent ApplicationsA WORKING EXAMPLE (C) 20. 1. Hub = Business KeysProduct NumberDrug Label Code NDA Application # Firm NameDose Form CodeDrug ListingPatent Number Patent Use Code Hubs = Unique Lists of Business Keys Business Keys are used to TRACK and IDENTIFY key information (C) 21. Business Keys = Ontology Firm Name Business Keys should be Drug Listing arranged in an ontologyIn order to learn the Product Number dependencies of the data Dose Form Codeset NDA Application # Drug Label CodePatent NumberPatent Use CodeNOTE: Different Ontologies represent different views of the data!(C) 22. Hub EntityA Hub is a list of unique business keys. Hub StructureHub Product Primary KeyProduct Sequence IDUnique Index Product Number (Primary Index)Load DTS Product Load DTSRecord Source Prod Record SourceNote: A Hubs Business Key is a unique index. A Hubs Load Date represents the FIRST TIME the EDW saw the data. A Hubs Record Source represents: First the Master data source (on collisions), ifnot available, it holds the origination source of the actual key. (C) 23. Business Keys What exactly are Business Keys? Example 1: Siebel has a system generated customer key Oracle Financials has a system generated customer key These are not business keys. These are keys used by each respective system to track records. Example 2: Siebel Tracks customer name, and address as unique elements. Oracle Financials tracks name, and address as unique elements. These are business keys. What we want in the hub, are sets of natural business keysthat uniquely identify the data across systems. Stay away from system generated keys if possible. System Generated keys will cause damage in the integration cycle if they are not unique across the enterprise.(C) 24. Hub Definition What Makes a Hub Key? A Hub is based on an identifiable business key. An identifiable business key is an attribute that is used in the source systems to locate data. The business key has a very low propensity to change, and usually is not editable on the source systems. The business key has the same semantic meaning, and the same granularity across the company, but not necessarily the same format. Attributes and Ordering All attributes are mandatory. Sequence ID 1st, Busn. Key 2nd , Load Date 3rd ,Record Source Last (4th). All attributes in the Business Key form a UNIQUE Index. (C) 25. The technical objective of the Hub is to: Uniquely list all possible business keys, good, bad, or indifferent ofwhere they originated. Tie the business keys in a 1:1 ratio with surrogate keys (givingmeaning to the surrogate generated sequences). Provide a consolidation and attribution layer for clear horizontaldefinition of the business functionality. Track the arrival of data, the first time it appears in the warehouse. Provide right-time / real-time systems the ability to loadtransactions without descriptive data. (C) 26. Hub Table Structures SQN = Sequence (insertion order) LDTS = Load Date (when the Warehouse first sees the data)RSRC = Record Source (System + App where the data ORIGINATED)(C) 27. Sample Hub Product IDPRODUCT #LOAD DTSRCRD SRC1MFG-PRD1234566-1-2000MANUFACT2P12356-2-2000CONTRACTS3*P1235 2-15-2001 CONTRACTS4MFG-1235 5-17-2001 MANUFACT51235-MFG 7-14-2001 FINANCE61235 10-13-2001FINANCE7PRD1285824-12-2002 MANUFACT8PRD1258264-12-2002 MANUFACT9PRD1282564-12-2002 MANUFACT10 PRD929929-*4-12-2002 MANUFACTUniqueIndexNotes: ID is the surrogate sequence number (Primary Key) What does the load date tell you? Do you notice any overloaded uses for the product number? Are there similar keys from different systems? Can you spot entry errors? Are any patterns visually present? (C) 28. 2. Links = Associations Firms GenerateFirms Generate LabelsProduct ListingsListings Contain Firms Manufacture Labeler CodesProducts Listings for Products are in NDA ApplicationsLinks = Transactions and Associations They are used to hook together multiplesets of information (i.e., Hubs)(C) 29. Associations = Ontological Hooks Firm NameFirms GenerateProduct ListingsDrug Listing Firms Manufacture Product NumberProducts Listings for Products NDA Application # are in NDA ApplicationsBusiness Keys are associated by manylinking factors, these links comprise theassociations in the hierarchy.(C) 30. Link Definitions What Makes a Link? A Link is based on identifiable business element relationships. Otherwise known as a foreign key, AKA a business event or transaction between business keys, The relationship shouldnt change over time It is established as a fact that occurred at a specific point in time and willremain that way forever. The link table may also represent a hierarchy. Attributes All attributes are mandatory(C) 31. Link EntityA Link is an intersection of business keys. It can contain Hub Keys and Other Link Keys. Link Structure Link Line-ItemPrimary KeyLink Line Item Sequence ID Unique Index{Hub Surrogate Keys 1..N}Hub Product Sequence ID(Primary Index) Load DTS Hub Order Sequence ID Record SourceLoad DTSRecord SourceNote: A Links Business Key is a Composite Unique Index A Links Load Date represents the FIRST TIME the EDW saw the relationship. A Links Record Source represents: First the Master data source (on collisions), ifnot available, it holds the origination source of the actual key.(C) 32. Modeling Links - 1:1 or 1:M? Today: Relationship is a 1:1 so why model a Link? Tomorrow: The business rule can change to a 1:M. You discover new data later. With a Link in the Data Vault: No need to change the EDW structure. Existing data is fine. New data is added.(C) Kent Graziano 33. Link Table Structures SQN = Sequence (insertion order) LDTS = Load Date (when the Warehouse first sees the data)RSRC = Record Source (System + App where the data ORIGINATED)(C) 34. Sample Link Entity - RelationshipHub Customer OrderCSID CUST #LOAD DTS RCRD SRCSatellite1ABC123456 10-12-2000 MFGHub Order2DKEF1-25-2001CONTRACTSOrdIDORDER # LOAD DTS RCRD SRC1 ORD0001 10-12-2000 MFG2 ORD0002 10-2-2000CONTRACTSLSEQID CSIDOrdID LOAD DTS RCRD SRC1000 1 1 10-14-2000 FINANCE1001 1 2 10-14-2000 FINANCELink Order-Details LSEQID OrdID PID LIT LOAD DTSRCRD SRC Link Cust Order 1000 1 100 1 10-14-2000FINANCE 1001 1 101 2 10-14-2000FINANCE Order Details SatelliteHub ProductPID PRODUCT # LOAD DTS RCRD SRCProduct 100 PRD128582 10-14-2000 MFGSatellite 101 PRD128256 10-14-2000 MFG (C) Kent Graziano 35. Sample Link Entity - HierarchyHub CustomerLink Customer RollupIDCUSTOMER #LOAD DTS RCRD SRCFrom To LOAD DTS RCRD SRCCSID 1ABC123456 10-12-2000 MANUFACT CSID 1 NULL 10-14-2000 FINANCE 2ABC925_24FN 10-22-2000 CONTRACTS 3DKEF1-25-2001CONTRACTS 2 110-22-2000 FINANCE 4KKO92854_dd 3-7-2001 CONTRACTS 3 12-15-2001FINANCE 5LLOA_82J5J6-4-2001 SALES 4 24-3-2001 HR 6HUJI_BFIOQ8-3-2001 SALES 5 26-4-2001 SALES 7PPRU_3259 2-2-2002 FINANCE 8PAFJG2895 2-2-2002 CONTRACTS 9929ABC29852-2-2002 CONTRACTS 10 93KFLLA 2-2-2002 CONTRACTS Note: If you have logic you can roll together customers, or companies, or sub-assemblies, bill of materials, etc.. We do not want to disturb the facts (underlying data in the hub), but we do want to re- arrange hierarchies at different points over time.(C) Kent Graziano 36. Link To Link (Link Sale Component)Sat Totals Hub InvoiceLinkSat Dates ProductHierarchy Hub Link Sale HubProductLine Item CustomerSatProductLink Sale Sat SatSat Desc.ComponentQuantity Cust Active AddressSub-TotalsNote: Link Sale Component provides a shift in grain. Link Sale Component allows for configurable options of products tracked on a single line-item product sold. Link Sale Component provides for sub-assembly tracking. (C) Kent Graziano 37. 3. Satellites = Descriptors Firm Patent LocationsExpiration Info ListingFormulation Listing MedicationProductDosagesIngredientsDrug PackagingTypesSatellites = Descriptors These data provide context for the keys (Hubs)And for the associations (Links)(C) 38. Satellite Definitions What Makes a Satellite? A Satellite is based on an non-identifying business elements. Attributes that are descriptive data, often in the source systems known as descriptions, or free-form entry, or computed elements. The Satellite data changes, sometimes rapidly, sometimes slowly. The Satellites are separated by type of information and rate of change. The Satellite is dependent on the Hub or Link key as a parent, Satellites are never dependent on more than one parent table. The Satellite is never a parent table to any other table (no snow flaking). Attributes and Ordering All attributes are mandatory EXCEPT END DATE. Parent ID 1st, Load Date 2nd, Load End Date 3rd,Record Source Last. (C) 39. Descriptors = Context FirmFirm Name LocationsFirms Generate ListingProduct ListingsDrug ListingFormulation Firms Manufacture Product NumberProducts ProductStart & End of IngredientsmanufacturingContext specific point in time warehousing portion (C) 40. Satellite EntityA Satellite is a time-dimensional table housing detailed information about the Hubs or Links business keys.Hub Primary Key Customer # Satellites are defined by Load DTS Load DTSTYPE of data and RATE OFExtract DTS Extract DTS CHANGELoad End DateLoad End DateDetailCustomer Name Mathematically this reducesBusiness Data Customer Addr1Customer Addr2redundancy and decreasesstorage requirements over{Update User}{Update User}{Update DTS} {Update DTS} time (compared to a StarSchema)Record SourceRecord Source (C) 41. Satellite Entity- Details A Satellite has only 1 foreign key; it is dependent on theparent table (Hub or Link) A Satellite may or may not have an Item Numberingattribute. A Satellites Load Date represents the date the EDW sawthe data (must be a delta set). This is not Effective Date from the Source! A Satellites Record Source represents the actual sourceof the row (unit of work). To avoid Outer Joins, you must ensure that everysatellite has at least 1 entry for every Hub Key. (C) 42. Satellite Table Structures SQN = Sequence (parent identity number) LDTS = Load Date (when the Warehouse first sees the data)LEDTS = End of lifecycle for superseded recordRSRC = Record Source (System + App where the data ORIGINATED)(C) 43. Satellite Entity Hub Related Hub CustomerIDCUSTOMER #LOAD DTS RCRD SRC 0 N/A 10-12-2000 SYSTEM 1 ABC123456 10-12-2000 MANUFACT 2 ABC925_24FN 10-2-2000CONTRACTS 3 ABC5525-2510-1-2000FINANCE CUSTOMER NAME SATELLITECSID LOAD DTS NAME RCRD SRC010-12-2000 N/ASYSTEM110-12-2000 ABC SuppliersMANUFACT110-14-2000 ABC Suppliers, Inc MANUFACT110-31-2000 ABC Worldwide Suppliers, Inc MANUFACTDummy satellite112-2-2000ABC DEF Incorporated CONTRACTSrecord eliminatesneed for outer210-2-2000WorldPartCONTRACTSjoins during210-14-2000 Worldwide Suppliers IncCONTRACTSextract.310-1-2000N/AFINANCE (C) Kent Graziano 44. Satellite Entity Link RelatedLink Order DetailsIDProduct IDOrdIDLOAD DTS RCRD SRC 00 010-12-2000 SYSTEM 1PRD102110-12-2000 MANUFACT 2PRD103110-2-2000CONTRACTSSatellite Order TotalsID LOAD DTS TaxTotalRCRD SRC 0 10-12-2000 SYSTEM 1 10-12-2000 3.00 0.00 MANUFACTDummy satellite 1 10-14-2000 4.00 12.00MANUFACTrecord eliminatesneed for outer 1 10-31-2000 3.69 14.02MANUFACTjoins during 1 12-2-20004.69 13.69CONTRACTSextract. 2 10-2-20002.45 10.00CONTRACTS 2 10-14-2000 1.22 14.00CONTRACTS (C) Kent Graziano 45. Satellite Splits Type of Information IDCUSTOMER # LOAD DTSRCRD SRCHub Customer0N/A10-12-2000SYSTEM1ABC12345610-12-2000MANUFACT2ABC925_24FN10-2-2000 CONTRACTS3ABC5525-25 10-1-2000 FINANCE CUSTOMER SATELLITECSID LOAD DTS NAMEContact Sales Rgn Cust Score RCRD SRC010-12-2000 N/A N/A N/A 0SYSTEM110-12-2000 ABC Suppliers Jen F.SE102MANUFACT110-14-2000 ABC Suppliers, IncJen F.SE120MANUFACT110-31-2000 ABC Worldwide Suppliers, IncJen F.SE130MANUFACT112-2-2000ABC DEF IncorporatedJack J. SC85 CONTRACTS210-2-2000WorldPart Jenny SE99 CONTRACTS210-14-2000 Worldwide Suppliers Inc Jenny SE102CONTRACTS310-1-2000N/A N/A N/A 0FINANCE(C) Kent Graziano 46. Satellite Splits Type of InformationID CUSTOMER #LOAD DTS RCRD SRC Hub Customer0 N/A 10-12-2000 SYSTEM 1 ABC123456 10-12-2000 MANUFACT 2 ABC925_24FN 10-2-2000CONTRACTS 3 ABC5525-2510-1-2000FINANCE Customer Name Satellite Customer Sales Satellite(name Info) (Sales Info) Because of the type of information is different, we split the logical groupsinto multiple Satellites. This provides sheer flexibility in representation of the information. We may have one more problem with Rate Of Change (C) Kent Graziano 47. Satellite Splits Rate of Change IDCUSTOMER # LOAD DTSRCRD SRCHub Customer0N/A10-12-2000SYSTEM1ABC12345610-12-2000MANUFACT2ABC925_24FN10-2-2000 CONTRACTS3ABC5525-25 10-1-2000 FINANCE CUSTOMER SATELLITECSID LOAD DTS NAMEContact Sales Rgn Cust Score RCRD SRC010-12-2000 N/A N/A N/A 0SYSTEM110-12-2000 ABC Suppliers Jen F.SE102MANUFACT110-14-2000 ABC Suppliers, IncJen F.SE120MANUFACT110-31-2000 ABC Worldwide Suppliers, IncJen F.SE130MANUFACT112-2-2000ABC DEF IncorporatedJack J. SC85 CONTRACTS210-2-2000WorldPart Jenny SE99 CONTRACTS210-14-2000 Worldwide Suppliers Inc Jenny SE102CONTRACTS310-1-2000N/A N/A N/A 0FINANCE(C) Kent Graziano 48. Satellite Splits Rate of ChangeID CUSTOMER #LOAD DTS RCRD SRCCustomer Name Satellite 0 N/A 10-12-2000 SYSTEM (name Info) 1 ABC123456 10-12-2000 MANUFACT 2 ABC925_24FN 10-2-2000CONTRACTSCustomer Sales Satellite 3 ABC5525-2510-1-2000FINANCE (Sales Info)Hub Customer Customer Scoring Satellite Assume the data to score customers begins arriving in the warehouseevery 5 minutes We then separate the scoring information from therest of the satellites. IF we end up with data that (over time) doesnt change as much as wethought, we can always re-combine Satellites to eliminate joins. (C) Kent Graziano 49. Satellites Split By Source SystemSAT_SALES_CUST SAT_FINANCE_CUST SAT_CONTRACTS_CUSTPARENT SEQUENCE PARENT SEQUENCE PARENT SEQUENCELOAD DATE LOAD DATE LOAD DATENameFirst NameContact NamePhone NumberLast Name Contact EmailBest time of day to reach Guardian Full NameContact Phone NumberDo Not Call FlagCo-Signer Full NamePhone NumberAddressCityState/ProvinceZip CodeSatellite StructurePARENT SEQUENCEPrimaryLOAD DATEKey{user defined descriptive data}{or temporal based timelines} (C) TeachDataVault.com49 50. Worlds Smallest Data Vault Hub Customer Hub_Cust_Seq_ID The Data Vault doesnt have to be BIG. Hub_Cust_Num An Data Vault can be built incrementally. Hub_Cust_Load_DTS Hub_Cust_Rec_Src Reverse engineering one component of theexisting models is not uncommon. Building one part of the Data Vault, thenSatellite Customer Name Hub_Cust_Seq_IDchanging the marts to feed from that vault Sat_Cust_Load_DTSis a best practice. Sat_Cust_Load_End_DTS Sat_Cust_Name Sat_Cust_Rec_Src The smallest Enterprise Data Warehouseconsists of two tables: One Hub, One Satellite (C) 51. Top 10 Rules for DV ModelingBusiness keys with a low propensity for change become Hub keys.Transactions and integrated keys become Link tables.Descriptive data always fits in a Satellite.1.A Hub table always migrates its primary key outwards.2.Hub to Hub relationships are allowed only through a link structure.3.Recursive relationships are resolved through a link table.4.A Link structure must have at least 2 FK relationships.5.A Link structure can have a surrogate key representation.6.A Link structure has no limit to the number of hubs it integrates.7.A Link to Link relationship is allowed.8.A Satellite can be dependent on a link table.9.A Satellite can only have one parent table.10. A Satellite cannot have any foreign key relationships except the primary key tothe parent table (hub or link). (C) 52. NOTE: Automating the Build DV is a repeatable methodology with rules and standards Standard templates exist for: Loading DV tables Extracting data from DV tables RapidAce ( now Open Source) Software that applies these rules to: Convert 3NF models to DV Convert DV to Star Schema This could save us lots of time and $$(C) Kent Graziano 53. In Review Data Vault is A Data Warehouse Modeling Technique (& Methodology) Hub and Spoke Design Simple, Easy, Repeatable Structures Comprised of Standards, Rules & Procedures Made up of Ontological Metadata AUTOMATABLE!!! Hubs = Business Keys Links = Associations / Transactions Satellites = Descriptors(C) 54. The Experts SayThe Data Vault is the optimal choicefor modeling the EDW in the DW 2.0framework. Bill Inmon The Data Vault is foundationally strong and exceptionally scalable architecture. Stephen Brobst The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....Doug Laney 55. More Notables This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before. Howard Dresner[The Data Vault] captures a practical body ofknowledge for data warehouse developmentwhich both agile and traditional practitionerswill benefit from.. Scott Ambler 56. Whos Using It? 57. Growing Adoption The number of Data Vault users in the USsurpassed 500 in 2010 and grows rapidly( Kent Graziano 58. Conclusion?Changing the direction of the rivertakes less effort than stopping the flowof water(C) 59. Where To Learn More The Technical Modeling Book: http://LearnDataVault.comOn YouTube: Facebook: Dans Blog: www.danlinstedt.comThe Discussion Forums: Data Vault Discussions World wide User Group (Free): http://dvusergroup.comThe Business of Data Vault Modelingby Dan Linstedt, Kent Graziano, Hans Hultgren(available at )61 60. Contact Information Kent Graziano [email protected]