data analysis
DESCRIPTION
Data Analysis. OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining. Data Processing. Data processing types OLTP ( On Line Transaction Processing ) OLAP ( On-line Analytical Processing ) Database types Transactional Numerous users Dynamic ( flickering ) - PowerPoint PPT PresentationTRANSCRIPT
Bogdan Shishedjiev Data Analysis 1
Data Analysis
OLTP and OLAP
Data Warehouse
SQL for Data Analysis
Data Mining
Bogdan Shishedjiev Data Analysis 2
Data Processing
• Data processing types– OLTP (On Line Transaction Processing) – OLAP (On-line Analytical Processing )
• Database types– Transactional
• Numerous users• Dynamic (flickering)• Always maintaining current state• Critical (very loaded)
– Warehouses• Aa few users (analyzers)• Relatively stable• Maintaining the data history (all states in the time)• Not loaded
Bogdan Shishedjiev Data Analysis 3
Architecture
Bogdan Shishedjiev Data Analysis 4
Architecture
• Source compnents– Filter – separate and assure the coherency of data to be
exported– Export – do the transfer of data portions in precise
moments of time.
• Warehouse components– Loader – Initial loading and preparing the warehouse.– Refresh – loads the portions– достъп– data mining– Export – to other warehouses. This creates an hierarchy
of warehouses
Bogdan Shishedjiev Data Analysis 5
Relational Schemes
• Star
Bogdan Shishedjiev Data Analysis 6
Relational Schemes
• Snowflake
Bogdan Shishedjiev Data Analysis 7
Data Warehouse Design
• Stages– Choose the activity processes to model
– Choose the granularity of the activity procesus
– Choose the dimensions that can be applied to every record of the fact table.
– Choose the facts that must be recorded in the fact table
Bogdan Shishedjiev Data Analysis 8
Data Warehouse Design
• Fact types – the most valuable are numerical continuous values– Additive – they can be added along all dimensions
(Money amounts)
– Semi-additive – they can be added along some of dimensions (Precipitations, Product quantities)
– Non-additive – they cannot be added along any dimension (Wind speed, wind direction)
Bogdan Shishedjiev Data Analysis 9
Data Warehouse Design
• Recommendations– Use continuous additive numerical values
– The fact table is highly normalized
– Don’t normalize the dimensions. The gain is < 1%
– Design thoroughly the dimension attributes. Most often they are textual and discrete. They are used as headings and constraint sources in the answers to users
Bogdan Shishedjiev Data Analysis 10
Example – Hypermarket Chain
time_keyday_of_ week
day_no_in_monthday_no_overall
week_no_ln_year week_no_overall
monthmonth_no_overall
quarterfiscal_periodholiday_flag
weekday_flaglast_day_in_month_flag
seasonevent
Dimension time
promotion_keypromotion_name
price__reduction_ typead_type
dlsplay_ typecoupon_type
ad_media_namedisplay_provider
promo_costpromo_begin_datepromo_end-date
..and others
Promotion
product_keySKU(stock keeping units )_description
SKU_numberpackage_size
brand subcategorycategory
departementpackage_ type
diet_typeweight
weight_unit_of_mesureunits_per_retail_case
units_per_shipping_casecases_per_pallet
shelf_widthshelf_heightshelf_depth..and others
Dimension product
stoie_keystore_ name
store_numberstore_street_address
store_citystore_countystore_statestore_zip
store__managerstore_phonestore_FАX
floor_plan_ typephoto_processing_typefinance_services_type
first_opened_datelast_remodel_date
store_surfacegrocery_surface
frozen-surface ..and others
Dimension shop
Time_keyproduct_keyStore_key
promotion keydollar_salesunits_salesdollar_cost
customer-count
Facts - Sales
Bogdan Shishedjiev Data Analysis 11
Hypermarket Chain
• Fact table– Granularity – each sell of a product (SKU – Stock
Keeping Unit)
– Values – Total cost (additive), SKU quantity (semi-additive), price (non-additive), customer count (non additive)
• Dimensions– Time
– Product
– Store
– Promotion
Bogdan Shishedjiev Data Analysis 12
Hypermarket Chain
• Calculation of disk space needed – Dimension time : 2 years x 365 days = 730 days
– Dimension shop : 300 shops, everyday records
– Dimension product : 30.000 products in each shop; 3000 are sold every day in each shop.
– Dimension promotion : An article can participate in only one promotion in a shop during one day.
– Elementary fact records 300 x 730 x 3000 x 1 = 657 .106 records
– Key field number 4; Value field number 4 ; Total number osf fields =8
– Fact table size - 657 .106 x 8 fields x 4B = 21 GB
Bogdan Shishedjiev Data Analysis 13
Data Operations for Data Analysis
• General form of a SQL statement
select D1.C1, ... Dn.Cn, Aggr1(F,Cl),…, Aggrn(F,Cn)from Fact as F, Dimension1 as D1,... DimensionN as Dn where join-condition (F, D1) and... and join-condition (F, Dn) and selection-condition group by D1.C1, ... Dn.Cn order by D1.C1, ... Dn.C
Bogdan Shishedjiev Data Analysis 14
Data Operations for Data Analysis
• Exampleselect Time.Month, Product.Name, sum(Qty) from Sale, Time, Product, Promotion where Sale.TimeCode = Time.TimeCode and Sale.ProductCode = Product.ProductCode and Sale.PromoCode = Promotion.PromoCode and (Product. Name = ' Pasta' or Product.Name = 'Oil') and Time.Month between 'Feb' and 'Apr' and Promotion.Name = 'SuperSaver' group by Time.Month, Product.Name order by Time.Month, Product.Name pivot Time.Month
Feb Mar Apr
Oil 5K 5K 7K
Pasta 45K 50K 51K
Bogdan Shishedjiev Data Analysis 15
Data CubeThe cube is used to represent data along some measure of interest. Although called a "cube", it can be 2-dimensional, 3-dimensional, or higher-dimensional. Each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest.
Bogdan Shishedjiev Data Analysis 16
Data Cube
• Data cube representation
Combination Count
{P1, Calgary, Vance} 2
{P2, Calgary, Vance} 4
{P3, Calgary, Vance} 1
{P1, Toronto, Vance} 5
{P3, Toronto, Vance} 8
{P5, Toronto, Vance} 2
{P5, Montreal, Vance} 5
{P1, Vancouver, Bob} 3
{P3, Vancouver, Bob} 5
{P5, Vancouver, Bob} 1
{P1, Montreal, Bob} 3
{P3, Montreal, Bob} 8
{P4, Montreal, Bob} 7
{P5, Montreal, Bob} 3
{P2, Vancouver, Richard} 11
Combination Count
{P3, Vancouver, Richard} 9
{P4, Vancouver, Richard} 2
{P5, Vancouver, Richard} 9
{P1, Calgary, Richard} 2
{P2, Calgary, Richard} 1
{P3, Calgary, Richard} 4
{P2, Calgary, Allison} 2
{P3, Calgary, Allison} 1
{P1, Toronto, Allison} 2
{P2, Toronto, Allison} 3
{P3, Toronto, Allison} 6
{P4, Toronto, Allison} 2
Bogdan Shishedjiev Data Analysis 17
Data Cube
• Totals - the value ANY or ALL or NULL
Bogdan Shishedjiev Data Analysis 18
Data Cube
• Drill down – adding a dimension for more detailed resultsTime. Month Product.Name sum(Qty)
Feb Pasta 48K
Mar Pasta 50K
Apr Pasta 51K
TIme.Monih Product.Name Zone sum(Qty)
Feb Pasta North 18K
Feb Pasta Centre 18K
Feb Pasta South 12K
Mar Pasta North 18K
Mar Pasta Centre 18K
Mar Pasta South 14K
Apr Pasta North 18K
Apr Pasta Centre 17K
Apr Pasta South 16K
Bogdan Shishedjiev Data Analysis 19
Data Cube
• Roll-up - removing dimensionTIme.Monih Product.Name Zone sum(Qty)
Feb Pasta North 18K
Feb Pasta Centre 18K
Feb Pasta South 12K
Mar Pasta North 18K
Mar Pasta Centre 18K
Mar Pasta South 14K
Apr Pasta North 18K
Apr Pasta Centre 17K
Apr Pasta South 16K
Product.Name Zone sum(Qty)
Pasta North 54K
Pasta Centre 53K
Pasta South 42K
Bogdan Shishedjiev Data Analysis 20
Data Cube
TIme.Monih Product.Name Zone sum(Qty)
Feb Pasta North 18K
Feb Pasta Centre 18K
Feb Pasta South 12K
Mar Pasta North 18K
Mar Pasta Centre 18K
Mar Pasta South 14K
Apr Pasta North 18K
Apr Pasta Centre 17K
Apr Pasta South 16KALL Pasta North 54KALL Pasta Centre 53KALL Pasta South 42KFeb Pasta ALL 48KMar Pasta ALL 50KApr Pasta ALL 51KALL Pasta ALL 149KALL ALL ALL 149K
•The whole data cube