acctg 6910 building enterprise & business intelligence systems (e.bis)
DESCRIPTION
ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis). Physical Data Warehouse Design. Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business. It’s all about trading storage for speed!. Fundamentals Aggregates (Ch. 16, pp. 356 - 357) - PowerPoint PPT PresentationTRANSCRIPT
1
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
Physical Data Warehouse Design
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
2
It’s all about trading storage for speed!
• Fundamentals• Aggregates (Ch. 16, pp. 356 - 357)• Indexes (Ch. 16, p. 357)
3
Fundamentals: the Storage Hierarchy
CPU
Cache
Memory
Disk
Storage Capacity
Small
Large
Access Speed
Slow
Fast
10-8 second
10-7 second
10-2 second
500-1000 MIPS
512 KB
512 MB
512 GB
4
Fundamentals: the Storage Hierarchy
CPU
Memory Disk
Disk Drive (I/O Channel)
Cache
Bus
How long does it take to query sales by city?How large is the Sales Fact table?How long does it take to access the Sales Fact table?
5
FundamentalsHow large is the fact table?e.g., 1 million records/day, 0.2KB/record 0.2
GB/day
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
6
Fundamentals
How long does it take to access all the fact records?
E.g., the small fact table is 1 Terabyte in size!
– 0.01s*1012=325 years LONG!!!!!!!!!!!!!
7
Fundamentals: the Storage Hierarchy
CPU
Memory Disk
Disk Drive (I/O Channel)
Cache
Bus
The logical unit of data transferred between disk and memory is block (e.g., 4k bytes)
8
Fundamentals
How long does it take to access all the fact records?
E.g., the small fact table is 1 Terabyte in size!
– Number of blocks: 2.5 millions– Access time = 0.01s*2500000= < 7
hrs!!!
9
Aggregate
• In data warehouse design, we choose the gain of fact table to be the possible lowest level.
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
Grain: orderline
10
Aggregate
• The reasons to choose the lowest level of fact: – (X) Analysts want to query on single
record
– (O) Analysts want to flexibly cut and group records.
11
Aggregate
• However, keeping the most detailed fact records could result in
– huge-size fact table: TeraBytes?! (1 million records/day, 256 Bytes/record
-> 0.2 GB/day)
– slow query
12
Aggregate
• To keep s data warehouse flexible, fact tables need to store facts in their lowest levels of detail.
• To improve query performance, another type of fact table which stores pre-computed summaries of detailed facts helps.
• Reduced to a logical design solution
13
Aggregate
• An aggregate fact table is a fact table that summarizes base-level fact table records
along one or several dimensions.• An aggregate dimension table is a
dimension table that summarizes base-level dimension table records.
• E.g., marketing managers check daily product sales by city --- aggregate by city in customer dimension
14
Aggregate
CUST_CITY# CITY_KEY* CITY* STATE
SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH
ref
ref
ref
ref
ref
refAggregate fact table
Aggregate dimension table
15
Aggregate
SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY
CUST_CITY# CITY_KEY* CITY* STATE
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
CUSTOMER# CUSTOMER_KEY* CID* CNAME* CITY* STATE
TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH
ref
ref
ref
ref
ref
ref
ref
ref
refref
ref
ref
16
Indexes
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
How long does it take to
find out the total purchase
Amt by Tom Jones?
17
Indexes
• Customer table– 1M records, each record 0.200 Kbytes
long– Block is 4K size, block access time is 0.01s– Number of records/block: 4/0.2=20 – Number of blocks: 1M/20=50K
• Sequential search– Time: 25K*0.01s=250s=4min.
18
Indexes
• Binary search– Time: log(50K)*0.01s=16*0.01s=0.16s
• B+ tree index– Create index pn on customer(cname);– If each node (block) in B+ tree has 117 keys, then
• # of access to indexes: log117(1M)=3 (i.e.height of the tree)
• # of access to Customer Dimension: 1• Total time = 4*0.01 = 0.04s
19
...(11 key values, 12 pointers)
...
B+-trees - P=12
Indexes to customer records
……….
Indexes to indexes
20
Indexes
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
How long does it take to
find out the total sales of
Desktop computers?
21
Performance Improvement
• Suppose there are only 4 product categories for 1M products
• Create a B+ tree index???– Suppose the size of product category
and block ID is 10 bytes– Size of index = 1M * 10 = 10 M bytes
22
Performance Improvement
• A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.
23
Bitmaps
Desktop 1 0 1
Notebook 0 1 0
Server 0 0 0
Accessory
0 0 0
Product record 1 record 2 record 3
A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.
24
Performance Improvement
• Bitmap index is suitable for low cardinality attribute.– Cardinality(A) = # of possible values for A/#of records
• Compared with B+ tree index, bitmap index has the following advantages for low cardinality attributes– Storage space saving (1M*4/8=500K bytes)– Efficient for boolean operations
• CREATE BITMAP INDEX bitpc ON PRODUCT (PCNAME);