data warehouses 1. what is a data warehouse? a multi-dimensional data model data warehouse...
TRANSCRIPT
1
Data MiningData Warehouses
2
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
From data warehousing to data mining
Data Warehousing and OLAP Technology: An Overview
3
Defined in many different ways, but not rigorously.
◦ A decision support database that is maintained separately
from the organization’s operational database
◦ Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:
◦ The process of constructing and using data warehouses
What is Data Warehouse?
4
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process
Data Warehouse—Subject-Oriented
5
Constructed by integrating multiple, heterogeneous data sources◦ relational databases, flat files, on-line
transaction records Data cleaning and data integration techniques
are applied.◦ Ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc.
◦ When data is moved to the warehouse, it is converted.
Data Warehouse—Integrated
6
The time horizon for the data warehouse is significantly longer than that of operational systems
◦ Operational database: current value data
◦ Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
◦ Contains an element of time, explicitly or implicitly
◦ But the key of operational data may or may not contain “time element”
Data Warehouse—Time Variant
7
A physically separate store of data transformed
from the operational environment
Operational update of data does not occur in the
data warehouse environment
◦ Does not require transaction processing,
recovery, and concurrency control mechanisms
◦ Requires only two operations in data accessing: initial loading of data and access of data
Data Warehouse—Nonvolatile
8
OLTP (on-line transaction processing)◦ Major task of traditional relational DBMS◦ Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing)
◦ Major task of data warehouse system◦ Data analysis and decision making
Distinct features (OLTP vs. OLAP):◦ User and system orientation: customer vs. market◦ Data contents: current, detailed vs. historical, consolidated◦ Database design: ER + application vs. star + subject◦ View: current, local vs. evolutionary, integrated◦ Access patterns: update vs. read-only but complex queries
Data Warehouse vs. Operational DBMS
9
OLTP vs. OLAP
OLTP OLAP
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
10
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
From data warehousing to data mining
Data Warehousing and OLAP Technology: An Overview
11
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
◦ Dimension tables, such as item (item_name, brand, type),
or time(day, week, month, quarter, year)
◦ Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
From Tables and Spreadsheets to Data Cubes
12
Modeling data warehouses: dimensions &
measures
◦ Star schema: A fact table in the middle
connected to a set of dimension tables
◦ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake
Conceptual Modeling of Data Warehouses
13
Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcitystate_or_provincecountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
14
Example of Star Schema
15
Example of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycitystate_or_provincecountry
city
Example of Snowflake Schema
17
Allows data to be modeled and viewed in multiple dimensions
Another representation like star schema Dimensions are entities which you want to
keep records◦ Time, item, branch, location…
Each dimension has a table called dimension table
Data Cubes
18
3-D Data Cube
• 4-D cubes can be visualized as a series of 3-D cubesSupplier 1 Supplier 2 Supplier 3
19
A data cube is often referred as a cuboid Can generate a cubiod for all possible
subsets of dimensions◦ Provide different level of summarization
N-D cube base cuboid◦ Lowest level of summary
0-D cube apex cuboid◦ Highest level of summary◦ Summary over all dimensions
Cuboid
20
Cube: A Lattice of Cuboids
time,item
time,item,location
time, item, location, supplier
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
21
Start with time-product (2-D) table that shows sale amounts
Ex. Data Cube Gen
TV PC VCR
1st Qtr 1000 850 350
2nd Qtr 1352 940 298
3rd Qtr 1450 658 314
4th Qtr 1500 965 365
USA
22
TV PC VCR TV PC VCR TV PC VCR
1st Q 1000 850 350 2600 750 425 1300 850 350
2nd Q 1352 940 298 1752 860 236 1200 1000 400
3rd Q 1450 658 314 1055 458 520 1150 555 510
4th Q 1500 965 365 1350 1065 390 900 750 425
Ex. Data Cube Gen (3D)
USA Canada Mexico
23
A Sample Data Cube
Total annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
24
Cuboids Corresponding to the Cube
all
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
25
Sales volume as a function of product, month, and region
Multidimensional DataP
rodu
ctReg
ion
Month
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
26
A Concept Hierarchy: Dimension (location)
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
M. WindL. Chan
...
......
... ...
...
all
region
office
country
TorontoFrankfurtcity
27
Roll up (drill-up): summarize data◦ by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up◦ from higher level summary to lower level summary
or detailed data, or introducing new dimensions Slice and dice:
◦ project and select Pivot (rotate):
◦ reorient the cube, visualization, 3D to series of 2D planes
Typical OLAP Operations
28
29
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
From data warehousing to data mining
Chapter 3: Data Warehousing and OLAP Technology: An Overview
Data Warehouse: A Multi-Tiered Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
Othersources
Data Storage
OLAP Server
31
Data Mart◦ a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
Meta data is the data defining warehouse objects. It stores:◦ Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, data mart locations and contents
◦ Monitoring information warehouse usage statistics, error reports
◦ Algorithms used for summarization
Warehouse Architecture
32
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
From data warehousing to data mining
Chapter 3: Data Warehousing and OLAP Technology: An Overview
33
Three kinds of data warehouse applications
◦ Information processing supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
◦ Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling,
pivoting
◦ Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models,
performing classification and prediction, and presenting the mining results using visualization tools
Data Warehouse Usage
34
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
From data warehousing to data mining
Summary
Chapter 3: Data Warehousing and OLAP Technology: An Overview
35
Why data warehousing? A multi-dimensional model of a data warehouse
◦ Star schema, snowflake schema
◦ A data cube consists of dimensions & measures OLAP operations: drilling, rolling, slicing, dicing and pivoting Data warehouse architecture
Summary: Data Warehouse and OLAP Technology