data warehouse and data mining - wordpress.com · warehouse data • olap middleware to support...
TRANSCRIPT
Naeem Ahmed
Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Email: [email protected]
Data Warehouse and Data Mining Lecture No. 08-15
OLAP and Multi-Dimensional Data
On-Line Analytical Processing • A decision support system (DSS) that support ad-
hoc querying, i.e. enables managers and analysts to interactively manipulate data.
• Analysis of information in a database for the purpose of making management decision
• The idea is to allow the users to easy and quickly manipulate and visualize the data through multidimensional views (i.e. different perspectives)
• OLAP analyzes historical data (terabytes) using complex queries
On-Line Analytical Processing • OLAP Council definition:
– A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user
• OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity.
On-Line Analytical Processing • OLAP primarily involves aggregating large
amounts of diverse data • OLAP functionality provides dynamic multi-
dimensional analysis, supporting analytical and navigational activities
• OLAP functionality is provided by the OLAP Server • OLAP Council defines OLAP Server as:
– ‘A high capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures.’
Data Dimensionality
Data Dimensionality: Cube
Date
Cou
ntry
sum
sum TV
VCR PC
1Qtr 2Qtr 3Qtr 4Qtr
Pakistan
China
India
sum
Total annual sales of TV in Pakistan
1st Qtr Sales of TV in Pakistan
Total annual sales of TV, PC & VCR in India
Cube: A group of data cells arranged by the dimensions of the data.
Data Dimensionality Possible Views of Sale • How many Products sold at
Time to specific Customer(s)?
• How many Customers bought at specific Time the Product(s)?
• At which Time(s) the Customer(s) bought the specific Product(s)?
Products
Time
Customers
Sale
Multi-dimensional Data • Measures - numerical data being tracked • Dimensions - business parameters that define a
transaction • Example: Analyst may want to view sales data
(measure) by geography, by time, and by product (dimensions)
• Dimensional modeling is a technique for structuring data around the business concepts
• ER models describe “entities” and “relationships” • Dimensional models describe “measures” and
“dimensions”
Multi-dimensional Model “Sales by product line over the past six months” “Sales by store between 1990 and 1995”
Prod Code Time Code Store Code Sales Qty
Store Info
Product Info
Time Info . . .
Numerical Measures Key columns joining fact table
to dimension tables
Fact table for measures
Dimension tables
Multi-dimensional Model • Every dimensional model (DM) is composed of one
table with a composite primary key, called the fact table, and a set of smaller tables called dimension tables
• Forms ‘star-like’ structure, which is called a star schema or star join
• Dimensions are organized into hierarchies – E.g., Time dimension: days → weeks → quarters – E.g., Product dimension: product → product line → brand
• Dimensions have attributes – e.g., owner city and county of store
Dimension Hierarchies Store Dimension Product Dimension
District
Region
Total
Brand
Manufacturer
Total
Stores Products
Operations in Multidimensional Data Model
• Aggregation (roll-up) – dimension reduction: e.g., total sales by city – summarization over aggregate hierarchy: e.g., total sales by
city and year total sales by region and by year • Selection (slice) defines a sub-cube
– e.g., sales where city = Palo Alto and date = 20/1/2014
• Navigation to detailed data (drill-down) – e.g., (sales - expense) by city, top 3% of cities by average
income • Visualization Operations (e.g., Pivot)
A Visual Operation: Pivot
10
47
30 12
Juice
Cola
Milk Cream
3/1 3/2 3/3 3/4 Date
Reg
ion
Product
A pivot is a two dimensional lay-out of the summary data
The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures
Drill-Down and Roll-Up
Multi-dimensionality: Cube
Multi-dimensionality
On-Line Analytical Processing • OLAP Tools are Market Driven. That is, no
standards either academic or from an organization exist
• A common model approach is to use Star or Snowflake Database Schemata (common in Data Warehouse Modeling)
• End users look for the following, independent tool architecture or vendor, characteristics:
On-Line Analytical Processing • Interactivity – How easy the end user interacts with
the tool? • Customization – How easy the end user make
changes on the data representation provided by the tool?
• Security – How easy the end user can access unauthorized data?
• Visualization – How easy the tool provide multi-dimensional graphical representations?
OLAP Servers • Two possibilities for OLAP servers
– Relational OLAP (ROLAP) • Relational and specialized relational DBMS to store and manage
warehouse data • OLAP middleware to support missing pieces
– Multidimensional OLAP (MOLAP) • Array-based storage structures • Direct access to array data structures • No SQL (Structured Query Language)
– Special Language provided by vender (e.g. Multidimensional Expressions (MDX) of Microsoft)
OLAP Taxonomy • Multi-dimensional OLAP (MOLAP)
– ‘A k-dimensional matrix based on a non relational storage structure.’ Agrawal et al.
• Relational OLAP (ROLAP) – ‘A relational back-end wherein operations of the data are
translated to relational queries.’ Agrawal et al. • Hybrid OLAP (HOLAP)
– Integration of MOLAP and ROLAP • Desktop OLAP (DOLAP)
– Provides a specific cube for analysis. Simplified version of MOLAP or ROLAP
Multi-dimensional OLAP • Multi-dimensional data management in Multi-
Dimensional Database Management Systems (MDDBMS)
• A special-purpose server that directly implements multidimensional data and operations
• Advantages: Fast data access, many dimensions, performance
• Further Research on storage techniques and realization of transactional concepts
MOLAP: Dimensional Modeling Using the Multi Dimensional Model
• MDDB: a special-purpose data model, MOLAP = “Cubes”
• Facts stored in multi-dimensional arrays • The Database system builds most of the
aggregates within a non-relational data store • Dimensions used to index array • Sometimes on top of relational DB • Products: Pilot, Arbor Essbase, Gentia • Limitations: Memory
Relational OLAP • A multi-dimensional user view on relational data
storage using Star or Snowflake Database Schemata
Product Dimension
Time Dimension
Region Dimension
Customer Dimension
Product Dimension
Year Dimension
Country Dimension
Customer Dimension
Sales
Customer Characteristics
Product Kind
Region
Month
Snowflake Schema
Sales
Star Schema
Relational OLAP • An extended relational DBMS that maps operations on
multidimensional data to standard relational operators (i.e., iterators like joins, loops, nested joins etc)
• Fact tables are too big to query directly, It incorporates Aggregate tables – Aggregate tables are built by running summarizing queries joining
the fact table with one or more dimensions and saving the result set – Users don’t need to specify the aggregate table, vendors provide
automatic support of aggregate tables in data warehouse
• Advantages: Easy to understand, easy to model, easy to implement
ROLAP: Dimensional Modeling Using Relational DBMS
• Special schema design: star, snowflake • Special indexes: bitmap, multi-table join • Special tuning: maximize query throughput • Proven technology (relational model, DBMS), tend
to outperform specialized MDDB especially on large data sets
• Products – IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
• Limitations: Maintenance, Performance
Hybrid OLAP • ‘A system, which supports (and integrates) multi-
dimensional and relational storage for data in an equivalent manner in order to benefit from the corresponding characteristics and optimization techniques.’ Dinter et al.
• Advantages: use of best techniques introduced on MOLAP and ROLAP, transparency between MOLAP and ROLAP systems
• Further Research on storage systems, on global multi-dimensional schema, on common interface and mutual integration of MOLAP and ROLAP
Desktop OLAP (DOLAP) • All processing work is done in the desktop
– E.g, bring data into Excel and build a pivot table
• DOLAP can be inexpensive, easy and fast to setup on small data sets only (thousands of rows)
OLTP versus OLAP OLTP OLAP
Operational processing Informational processing Transaction-oriented Analysis-oriented For operational staffs For managers, executive & analysts Daily operations Decision support Current, up-to-date data Historical data Primitive, highly detailed data Summarized, consolidated data Detailed, flat relational views Summarized, multi-dimensional views Short, simple transactions Complex aggregate queries Read/write Mostly read only Index on keys Many scans Many users Small number of users Large databases Very large databases
OLTP versus OLAP • On-Line Transaction Processing
– Transfer $100 balance from my saving account to my checking account
• On-Line Analytical Processing
– What is the average balance of accounts by customer groups, account types, areas, account managers, and their combinations?
Aggregate • A whole formed or calculated by the combination
of many separate units or items – Total • Operators: sum, count, max, min, median, avg
– Example: Add up amounts by day – Example in SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
ans date sum1 812 48
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
Aggregate • Add up amounts by day, product • SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId!
sale prodId date amtp1 1 62p2 1 19p1 2 48
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
Roll-up
Drill-down
MOLAP Cube
sale prodId storeId amtp1 s1 12p2 s1 11p1 s3 50p2 s2 8
s1 s2 s3p1 12 50p2 11 8
Fact table view Multi-dimensional cube
dimensions = 2
dimensions = 3
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
Example: Cube Pr
oduc
t
Time
M T W Th F S S
Juice Milk Coke Cream Soap Bread
NY SF
LA 10 34 56 32 12 56
56 units of bread sold in LA on M
Dimensions: Time, Product, Store
Attributes: Product (upc, price, …) Store … …
Hierarchies: Product → Brand → … Day → Week → Quarter Store → Region →
Country
roll-up to week
roll-up to brand
roll-up to region
Cube Aggregation
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 56 4 50p2 11 8
s1 s2 s3sum 67 12 50
sump1 110p2 19
129
. . .
Example: computing sums
Roll-up
Drill-down
Aggregation Using Hierarchies
region A region Bp1 56 54p2 11 8
store
region
country
(store s1 in Region A; stores s2, s3 in Region B)
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
Slicing • Slicing means taking out the slice of a cube, given
certain set of select dimension – e.g., sales where city =‘Karachi’ and date = ‘20/1/2014’
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 12 50p2 11 8
TIME = day 1
Dicing • Dicing means viewing the slices from different
angles. – Example -Revenue for different products within a given
state or revenue for different states for a given product • Dicing is more zoom feature that selects a subset
over all the dimensions but for specific values of the dimension
• One form of Slicing and Dicing is called pivoting
Dicing