3900355fsfvb
TRANSCRIPT
-
8/7/2019 3900355fsfvb
1/35
Khalid Ahmad
1
-
8/7/2019 3900355fsfvb
2/35
Chapter 11
Definition of terms Reasons for information gap between
information needs and availability Reasons for need of data warehousing Describe three levels of data warehouse
architectures List four steps of data reconciliation Describe two components of star schema Estimate fact table size Design a data mart
2
-
8/7/2019 3900355fsfvb
3/35
Chapter 11
Data Warehouse: A subject-oriented, integrated, time-variant, non-
updatable collection of data used in support ofmanagement decision-making processes
Subject-oriented: e.g. customers, patients, students,products
Integrated: Consistent naming conventions, formats,encoding structures; from multiple data sources
Time-variant: Can study trends and changes Nonupdatable: Read-only, periodically refreshed
Data Mart: A data warehouse that is limited in scope
3
-
8/7/2019 3900355fsfvb
4/35
Chapter 11
Integrated, company-wide view of high-quality
information (from disparate databases) Separation ofoperational and informationalsystems and data (for improved performance)
4
-
8/7/2019 3900355fsfvb
5/35
5
Source: adapted from Strange (1997).
-
8/7/2019 3900355fsfvb
6/35
Chapter 11
Generic Two-Level Architecture Independent Data MartDependent Data Mart and
Operational Data StoreLogical Data Mart and @ctive
Warehouse
Three-Layer architecture
6
All involve some form ofextraction, transformation and loading(ETLETL)
-
8/7/2019 3900355fsfvb
7/357
Figure 11-2: Generic two-level architecture
E
T
L
One,company-
wide
warehouse
Periodic extraction data is not completely current in warehouse
-
8/7/2019 3900355fsfvb
8/358
Figure 11-3: Independent data martData marts:Data marts:Mini-warehouses, limited in scope
E
T
L
Separate ETL for each
independentdata mart
Data access complexity
due to multiple data marts
-
8/7/2019 3900355fsfvb
9/359
Figure 11-4:Dependentdata mart with operational data store
E
T
L
Single ETL for
enterprise data warehouse
(EDW)(EDW)
Simpler data access
ODSODS provides option for
obtaining currentdata
Dependentdata marts
loaded from EDW
-
8/7/2019 3900355fsfvb
10/3510
E
T
L
Near real-time ETL for
@active Data Warehouse@active Data Warehouse
ODSODS and data warehousedata warehouseare one and the same
Data marts are NOT separate databases,
but logical views of the data warehouse
Easier to create new data marts
-
8/7/2019 3900355fsfvb
11/35
-
8/7/2019 3900355fsfvb
12/35
Chapter 1112
Status
Status
Event = a database action(create/update/delete) that
results from a transaction
-
8/7/2019 3900355fsfvb
13/35
Chapter 1113
Data CharacteristicsData CharacteristicsTransient vs. Periodic DataTransient vs. Periodic Data
Changes to existing
records are written
over previous
records, thus
destroying the
previous data content
Data are never
physically altered or
deleted once they
have been added to
the store
-
8/7/2019 3900355fsfvb
14/35
Chapter 11
Typical operational data is: Transient not historical
Not normalized (perhaps due to denormalization forperformance)
Restricted in scope not comprehensive
Sometimes poor quality inconsistencies and errors
After ETL, data should be: Detailed not summarized yet Historical periodic
Normalized 3rd normal form or higher
Comprehensive enterprise-wide perspective
Timely data should be current enough to assist decision-
making Quality controlled accurate with full integrity
14
-
8/7/2019 3900355fsfvb
15/35
Chapter 11
Capture/Extract
Scrub or data cleansingTransformLoad and Index
15
ETL = Extract, transform, and load
-
8/7/2019 3900355fsfvb
16/35
16
Static extractStatic extract = capturing a
snapshot of the source data at
a point in time
Incremental extractIncremental extract =
capturing changes that have
occurred since the last static
extract
Capture/Extractobtaining a
snapshot of a chosen subset of
the source data for loading
into the data warehouse
-
8/7/2019 3900355fsfvb
17/35
17
Scrub/Cleanseuses pattern
recognition and AI techniques to
upgrade data quality
Fixing errors:Fixing errors: misspellings,erroneous dates, incorrect field usage,
mismatched addresses, missing data,
duplicate data, inconsistencies
Also:Also: decoding, reformatting, timestamping, conversion, key
generation, merging, error
detection/logging, locating missing
data
-
8/7/2019 3900355fsfvb
18/35
18
Transform = convert data from format
of operational system to format of data
warehouse
Record-level:Record-level:Selection data partitioning
Joining data combining
Aggregation data summarization
Field-level:Field-level:single-field from one field to one field
multi-field from many fields to one, or
one field to many
-
8/7/2019 3900355fsfvb
19/35
19
Load/Index= place transformed data
into the warehouse and create indexes
Refresh mode:Refresh mode: bulk rewriting oftarget data at periodic intervals
Update mode:Update mode: only changes insource data are written to data
warehouse
-
8/7/2019 3900355fsfvb
20/35
Chapter 1120
Figure 11-11: Single-field transformation
In general some transformation function
translates data from old form to new form
Algorithmic transformation uses aformula or logical expression
Tablelookup anotherapproach, uses a separate table
keyed by source record code
-
8/7/2019 3900355fsfvb
21/35
Chapter 1121
Figure 11-12: Multifield transformation
M:1 from many source
fields to one target field
1:M from one
source field to
many target fields
-
8/7/2019 3900355fsfvb
22/35
Chapter 11
Objectives Ease of use for decision support applications Fast response to predefined user queries
Customized data for particular target audiences
Ad-hoc query support
Data mining capabilities Characteristics Detailed (mostly periodic) data
Aggregate (for summary)
Distributed (to departmental servers)
22
Most common data model = star schemastar schema(also called dimensional model)
-
8/7/2019 3900355fsfvb
23/35
23
Figure 11-13: Components of a star schemastar schema
Fact tables containfactual or quantitative
data
Dimension tables containdescriptions about thesubjects of the business
1:N relationship
between dimensiontables and fact tables
Excellent for ad-hoc queries,
but bad for online transaction processing
Dimension tables are
denormalized to
maximizeperformance
-
8/7/2019 3900355fsfvb
24/35
24
Figure 11-14: Star schema example
Fact table provides statistics for salesbroken down by product, period and store
dimensions
-
8/7/2019 3900355fsfvb
25/35
25
-
8/7/2019 3900355fsfvb
26/35
-
8/7/2019 3900355fsfvb
27/35
Chapter 11 27
Figure 11-16: Modeling dates
Fact tables contain time-period data
Date dimensions are important
-
8/7/2019 3900355fsfvb
28/35
Chapter 11 28
Helper table simplifiesrepresentation of hierarchies indata warehouses
-
8/7/2019 3900355fsfvb
29/35
Chapter 11
Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise dat
warehouses, including derivation rules Indicate how data is derived from operational
data store, including derivation rules Identify available reports and predefined queries
Identify data analysis techniques (e.g. drill-down Identify responsible people
29
-
8/7/2019 3900355fsfvb
30/35
Chapter 11
The use of a set of graphical tools that provides
users with multidimensional views of their dataand allows them to analyze the data using simplewindowing techniques
Relational OLAP (ROLAP) Traditional relational representation
Multidimensional OLAP (MOLAP) Cube structure
OLAP Operations Cube slicing come up with 2-D view of data
Drill-down going from summary to more detailedviews
30
-
8/7/2019 3900355fsfvb
31/35
31
Figure 11-22: Slicing a data cube
-
8/7/2019 3900355fsfvb
32/35
Chapter 11 32
Figure 11-23:
Example of drill-down
Summary report
Drill-down with
color added
Starting withsummary data,users can obtain
details for particularcells
-
8/7/2019 3900355fsfvb
33/35
Chapter 11
Knowledge discovery using a blend of statistical, AI, and computergraphics techniques
Goals: Explain observed events or conditions
Confirm hypotheses
Explore data for new or unexpected relationships
Techniques Statistical regression
Decision tree induction
Clustering and signal processing
Affinity
Sequence association
Case-based reasoning
Rule discovery
Neural nets
Fractals
Data visualization representing data in graphical/multimedia formats foranalysis
33
-
8/7/2019 3900355fsfvb
34/35
Chapter 11 34
OLTP: On Line Transaction Processing Describes processing at operational sites
OLAP: On Line Analytical Processing
Describes processing at warehouse
-
8/7/2019 3900355fsfvb
35/35
Chapter 11
Standard DB (OLTP) Mostly updates Many small
transactions Mb - Gb of data Current snapshot Index/hash on p.k. Raw data
Thousands of users(e.g., clerical users)
Warehouse (OLAP) Mostly reads
Queries are long and
complex
Gb - Tb of data
History
Lots of scans
Summarized, reconcileddata
Hundreds of users (e.g.,
decision makers anal sts)