cash registers & satellites briefing to the 2006 noaatech conference november 2, 2005 stan...
TRANSCRIPT
Cash Registers & Satellites Briefing to the 2006 NOAATech Conference
November 2, 2005
Stan Cutler
[email protected] 301-457-5210 x 163
Mitretek Systems/NESDIS/OSD
2
Improve communication between NOAA’s developers and the wider community of data management professionals
– Introduce vocabulary
– Identify NOAA applications that can be described using common vocabulary
Purpose
3
Agenda
Universal Data Management Challenges Notional Data Warehouse Architecture Data Modeling Approaches
– Relational
– Dimensional
4
I. Universal Data Management Challenges
5
Data Mining Example: “Market Basket Analysis”
Decisions:1) Move beer display closer to the diaper display 2) On Thursdays, sell beer & diapers at full price
Rationale:1) When men bought diapers on Thursdays and Saturdays, they
also tended to buy beer2) Men typically did their weekly grocery shopping on Saturdays3) On Thursdays, they only bought a few items
6
Many Disciplines Mine Their Data
Law Enforcement - Optimal Deployment Health Care – Coverage Risks E-Commerce – Pop-up/Link Selection Medicine – Gene/Disease Associations Etc.
Data Management GoalDevelop systems in which the data and procedures are
configured to answer questions that are important to the enterprise
7
Integrating Global (Environmental Observations) and Data Management
Ensuring Sound, State-of-the-Art (Research) Developing, Valuing, and Sustaining a World-Class
Workforce
NOAA’s Future
We are not unique. Any enterprise that collects large amounts of data has the same kind of challenges and goals
8
Ask the same kinds of questions as those challenged with similar problems
Understand the constructs and vocabulary– Architectures – Data Modeling
We can find valuable expertise outside the NOAA community
9
II. Notional Data Warehouse Architecture
10
“Hub and Spoke Architecture”
Application Specific “Data Marts”
use ”OLAP” Technologies()
DataStagingArea
DataWare-houseExternal
Data
InternalData
Transform&
“Cleanse”
Application Neutral
“ETL” = Extract, Transform and Load
“OLAP” = Online Analytical Processing
11
Retail ApplicationHub and Spoke Architecture
OLAP Data Marts(Application Specific)
DataStagingArea
DataWare-houseExternal
CustomerLists
SalesData
Transform&
Cleanse
Application Neutral
Marketing
FloorManagement
Human Resources
RealEstate
Accounting
12
Notional NOAA Hub and Spoke Architecture
NOAA Applications(Data Marts using OLAP)
DataStagingArea(RichInventory?)
DataWare-house
Other SatelliteArchives
CLASS
Transform&
Cleanse
Application Neutral
ClimatePrediction
WeatherForecast
EcosystemsManagement
Commerce &Transportation
ExternalCustomers
ESPC
Data Centers
13
III. Data Modeling Approaches
14
“Relational” Vocabulary “Relational” technologies
– Relational Data Base Management Systems (RDBMS)• COTS Products (INFORMIX, DB2, ORACLE, MS/SS, etc.)• Proprietary data management/manipulation software
– RDBMS Extensions (Most COTS products built on an RDBMS) • GUIs, CASE Tools, COOP, Application Generators, Security, etc.
“Relational” Data Models - Evolutionary approach to data base design
• Conceptual Entity Relationship Diagrams (ERD) used to identify data requirements, relationships, rules
– Diagrams– Data Dictionaries
• Logical ERDs used to normalize (eliminate redundancies)• Physical models are the Table Schema entered into the RDBMS
Online Transaction Processing (OLTP) – e.g., CLASS
15
Entity Relationship Diagram (ERD)
key..…
…
key..…
…
key..…
…
key..…
…
Entity
Relationship
Attributes
Cardinality(1, Many, or 0)
The foundation of all OLTP systems, such as CLASS
Attributes, entities, and relationships are described in the data dictionary
EntityClass
16
Object Models “inherit” ERD constructs
key..…
…
key..…
…
key..…
…
ObjectClass
key
…
…
Behavior:>>>>>>>>
17
Pros & Cons of systems based on Relational models
Strengths – Referential integrity
– Data locking
– Fast Look-up and Retrieval
– GUIs Weaknesses
– Entity proliferation
– Users don’t understand them
– Complex code must be written to accumulate multiple instances (Hard to use for Data Mining)
18
Dimensional Data Models
Fact– An instance of numeric data
Dimension– Foreign key
Fact Table– Key is a concatenation of foreign keys (dimensions)
– An instance can have dozens of foreign keys
– Millions of instances (rows) often required Programmers revenge on Data Base Administrators
– Break many relational “rules”
– Re-invented often
19
A “Dimensional” Data Model for Retailing
Who (buys, sells) – Customer (age, gender, marital status, occupation, etc.)– Sales person ( “ , “ , training, etc.)– Cash Register
What (products) – Brand, color, size, type, etc
When – Time of day, day of week, season
Where – Store (location, size, type), Shelf
Why– Promotions, advertising, discounts, economic trends
How much (was spent)– Per product, per total sale
20
Classical Star Schema: Point of Sale
Clerk_key ClerkNameJobGradeEtc.
Clerk Dimension
Time_keyCustomer_keyStore_keyClerk_keyPromo_keyProduct_keyRegister_keyDollars SoldUnits SoldDollars Cost
Register_key LocationTypeEtc.
Register Dimension
Promo_key PromoNamePriceTypeAdTypeEtc.
Promo Dimension
Product_keyDescriptionBrandSub CategoryCategoryDeptFlavorPackage Type
Product Dimension
Time_keyDayofWeekFiscal period
Time Dimension
Customer_keyCustomerNamePurchase ProfileEtc.
Customer Dimension
Store_keyStoreNameAddressFloorTypeEtc.
Store Dimension
FACT
21
Snowflake Schema: Point of Sale
Register_key LocationTypeEtc.
Register Dimension
Clerk_key ClerkNameJobGradeEtc.
Clerk Dimension
Time_keyCustomer_keyStore_keyClerk_keyPromo_keyProduct_keyRegister_keyDollars SoldUnits SoldDollars Cost
Promo_key PromoNamePriceTypeAdTypeEtc.
Promo Dimension
Product_Type_PKProduct_Type_Desc
Product Dimension
Time_keyDayofWeekFiscal period
Time Dimension
Customer_keyCustomerNamePurchase ProfileEtc.
Customer Dimension
Store_keyStoreNameAddressFloorTypeEtc.
Store Dimension
FACT Sub-Type_PKSub-Type-Desc
Sub-Type_PKSub-Type-Desc
Sub-Type_PKSub-Type-Desc
Model-Num_PKModel-Desc
Brand-ID_PKMaker-Desc
Sub-Type_PKSub-Type-Desc
Model-Num_PKModel-Desc
Brand-ID_PKMaker-Desc
22
Metadata in Dimensional Modeling
NOAA usage:– If it’s not a fact
– If it’s not a key
– It’s metadata Conventional Dimensional usage:
– If it’s not a fact
– If it’s not a key
– It’s documentation
BUT
– If it’s a key
– It’s metadata (because it describes the fact)
23
Dimensional Models for NOAA Which
– Satellite– Instrument
When – Orbit, UTC, Season, decade, epoch, etc
Where – Geospatial coordinates
Who– User affiliation– Developer affiliation
FACT: How much? – Temperature, moisture, radiance, color, etc.
24
A NOAA Star Schema?
Altitude_ key Distance above SLEtc.
Altitude Dimension
Time_key (fk)Location-key (fk)Altitude key (fk)Product_key (fk)Satellite_key (fk)Instrument_key (fk)
Temperature
Satellite_key NamePosition
Satellite Dimension
Instrument_key NameDescription
Instrument Dimension
Product_keyProduct NameDescriptionSystemSub SystemEtc.
Product Dimension
FACT TABLE
Time_keyUTC of Obs’nUTC of receipt LocalT of Obs’nOrbit_IdEtc.
Time Dimension
Location keyGeo-Coordinates of Obs’n Etc.
Location Dimension
25
Pros & Cons of systems based on dimensional models
Strengths– Very few “entity types” needed
– Decision Support Systems (DSS)• End-Users construct complex queries by selecting dimensions from a GUI
• Statistical analysis of very large data bases
– Artificial Intelligence (AI) • Automated scheduling of continuous executions
• System identifies (“discovers”) new relationships
• Discoveries shape successive execution
Weaknesses – Development Cost
– Storage
– Operational Cost - Requires much “care and feeding”
26
False Dichotomy: Relational “vs.” Dimensional
Relational and dimensional systems are not mutually exclusive – Data warehouses usually extract fact tables from relational
data bases
– Data warehouse capabilities are extensions in RDBMSs Depends on the business
– Feasibility: Is the application data good enough for ETL?
– ROI: Does the business benefit outweigh the cost?
27
SUMMARY:
NOAA’s data mining challenge is similar to that of other enterprises
A world-wide community of IT professionals uses a particular vocabulary to address the challenge
Relational technologies & models are the essential first step
Dimensional technologies & models come next