warehouse f08

39
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury [email protected] (email) http://www.davesalisbury.com/ (web site)

Upload: rudie-buzz

Post on 25-Sep-2015

229 views

Category:

Documents


1 download

DESCRIPTION

data warehouse

TRANSCRIPT

MIS 301- Database

MIS 385/MBA 664Systems Implementation with DBMS/Database ManagementDave [email protected] (email)http://www.davesalisbury.com/ (web site)ObjectivesDefinition of termsReasons for information gap between information needs and availabilityReasons for need of data warehousingDescribe three levels of data warehouse architecturesDescribe two components of star schemaEstimate fact table sizeDesign a data martDevelop requirements for a data mart

DefinitionData Warehouse: A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processesSubject-oriented: e.g. customers, patients, students, productsIntegrated: Consistent naming conventions, formats, encoding structures; from multiple data sourcesTime-variant: Can study trends and changesNonupdatable: Read-only, periodically refreshedData Mart:A data warehouse that is limited in scope

History Leading to Data WarehousingImprovement in database technologies, especially relational DBMSsAdvances in computer hardware, including mass storage and parallel architecturesEmergence of end-user computing with powerful interfaces and toolsAdvances in middleware, enabling heterogeneous database connectivity Recognition of difference between operational and informational systems

Need for Data WarehousingIntegrated, company-wide view of high-quality information (from disparate databases)Separation of operational and informational systems and data (for improved performance)Need for Data Warehousing

Data warehouse versus Data martIssues with Company-Wide ViewInconsistent key structuresSynonymsFree-form vs. structured fieldsInconsistent data valuesMissing datacf. Figure 11.1

Examples of heterogeneous dataOrganizational Trends Motivating Data WarehousesNo single system of recordsMultiple systems not synchronizedOrganizational need to analyze activities in a balanced wayCustomer relationship managementSupplier relationship management

Data Warehouse ArchitecturesGeneric Two-Level ArchitectureIndependent Data MartDependent Data Mart and Operational Data StoreLogical Data Mart and Real-Time Data WarehouseThree-Layer architectureAll involve some form of extraction, transformation and loading (ETL)

ETLOne, company-wide warehousePeriodic extraction data is not completely current in warehouseGeneric two-level data warehousing architecture

Data marts:Mini-warehouses, limited in scopeETLSeparate ETL for each independent data martData access complexity due to multiple data martsIndependent data mart data warehousing architecture

ETLSingle ETL for enterprise data warehouse(EDW)Simpler data accessODS provides option for obtaining current dataDependent data marts loaded from EDWDependent data mart with operational data store: a three-level architecture

ETLNear real-time ETL for Data WarehouseODS and data warehouse are one and the sameData marts are NOT separate databases, but logical views of the data warehouse Easier to create new data martsLogical data mart and real time warehouse architecture

Three-layer data architecture for a data warehouseData CharacteristicsStatus vs. Event DataStatusStatusEvent = a database action (create/update/delete) that results from a transaction

With transient data, changes to existing records are written over previous records, thus destroying the previous data content

Data CharacteristicsTransient vs. Periodic DataPeriodic data are never physically altered or deleted once they have been added to the store

Data CharacteristicsTransient vs. Periodic DataOther Data Warehouse ChangesNew descriptive attributesNew business activity attributesNew classes of descriptive attributesDescriptive attributes become more refinedDescriptive data are related to one anotherNew source of dataDerived DataObjectivesEase of use for decision support applicationsFast response to predefined user queriesCustomized data for particular target audiencesAd-hoc query supportData mining capabilitiesCharacteristicsDetailed (mostly periodic) dataAggregate (for summary)Distributed (to departmental servers)Star schemaMost common data model for data marts (also called dimensional model)Fact tables contain factual or quantitative dataDimension tables contain descriptions about the subjects of the businessDimension tables are denormalized to maximize performance1:N relationship between dimension tables and fact tablesExcellent for ad-hoc queries, but bad for online transaction processing

Fact tables contain factual or quantitative dataDimension tables contain descriptions about the subjects of the business 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Star schema components

Fact table provides statistics for sales broken down by product, period and store dimensionsStar schema example

Star schema with sample dataIssues Regarding Star SchemaDimension table keys must be surrogate (non-intelligent and non-business related), because:Keys may change over timeLength/format consistencyIssues Regarding Star SchemaGranularity of Fact Tablewhat level of detail do you want? Transactional grainfinest levelAggregated grainmore summarizedFiner grains better market basket analysis capabilityFiner grain more dimension tables, more rows in fact tableIssues Regarding Star SchemaDuration of the databasehow much history should be kept?Natural duration13 months or 5 quartersFinancial institutions may need longer durationOlder data is more difficult to source and cleanseFact table can get huge (monstrous)Depends on the number of dimensions and the grain of the fact tableNumber of rows = product of number of possible values for each dimension associated with the fact tableFor example, take Figure 11.11Assume only half the products record sales for a given month, the total rows would be calculated as:1000 stores X 5000 active products X 24 months = 120,000,000 rows (yikes!)

Fact tables contain time-period data Date dimensions are important

Modeling datesVariations of the Star SchemaMultiple Facts TablesCan improve performanceOften used to store facts for different combinations of dimensionsConformed dimensionsFactless Facts TablesNo nonkey data, but foreign keys for associated dimensionsUsed for:Tracking eventsInventory coverageNormalizing Dimension TablesMultivalued DimensionsFacts qualified by a set of values for the same business subjectNormalization involves creating a table for an associative entity between dimensionsHierarchiesSometimes a dimension forms a natural, fixed depth hierarchyDesign optionsInclude all information for each level in a single denormalized tableNormalize the dimension into a nested set of 1:M table relationships

32Slowly Changing Dimensions (SCD)Need to maintain knowledge of the pastOne option: for each changing attribute, create a current value field and many old-valued fields (multivalued)Better option: create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change

33The User Interface Metadata(data catalog)Identify subjects of the data martIdentify dimensions and factsIndicate how data is derived from enterprise data warehouses, including derivation rulesIndicate how data is derived from operational data store, including derivation rulesIdentify available reports and predefined queriesIdentify data analysis techniques (e.g. drill-down)Identify responsible people

On-Line Analytical Processing ToolsThe use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniquesRelational OLAP (ROLAP)Traditional relational representationMultidimensional OLAP (MOLAP)Cube structureOLAP OperationsCube slicingcome up with 2-D view of dataDrill-downgoing from summary to more detailed views

Slicing a data cube

Summary reportDrill-down with color addedStarting with summary data, users can obtain details for particular cellsExample of drill-downData mining & visualizationKnowledge discovery using a blend of statistical, AI, and computer graphics techniquesGoals:Explain observed events or conditionsConfirm hypothesesExplore data for new or unexpected relationshipsData mining & visualizationTechniquesStatistical regressionDecision tree inductionClustering and signal processingAffinitySequence associationCase-based reasoningRule discoveryNeural netsFractalsData visualizationrepresenting data in graphical/multimedia formats for analysis