olap basics and fundamentals by bharat kalia
DESCRIPTION
OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into the data through fast, consistent, interactive, access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.TRANSCRIPT
ON-LINE ANALYTICAL PROCESSING
By :Bharat Kalia
2
OBJECTIVES
What is OLAP Need for OLAP Features & functions of
OLAP Different OLAP models OLAP implementations
3
DEMAND FOR OLAP To develop DM, three approaches In all approaches, Data Marts
rest on Dimensional Model Data Marts are sufficient for
basic data analysis Users need to go beyond such
basic analysis
4
DEMAND FOR OLAP Need for Multidimensional
Analysis Fast Access & Powerful
Calculations Limitations of other analysis
methods like: SQL Spreadsheets Report Writers
5
DEMAND FOR OLAP Traditional tools of report writers,
query products, spreadsheets, & language interfaces do not match the user expectations as far as performing multidimensional analysis with complex calculations is concerned.
Tools used with OLTP and basic DW environments do not match up to the task
6
OLAP IS THE ANSWER!OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into the data through fast, consistent, interactive, access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.
7
Why is OLAP useful? Facilitates multidimensional data
analysis by pre-computing aggregates across many sets of dimensions
Provides for: Greater speed and responsiveness Improved user interactivity
8
DATA WAREHOUSES A data warehouse is based on a
multidimensional data model which views data in the form of a data cube
A data cube allows data to be modeled and viewed in multiple dimensions
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
9
LATTICE OF CUBOIDS
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
10
CUBE
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
dimensions = 3
Multi-dimensional cube:Fact table view:
11
AGGREGATES
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
81
12
AGGREGATES
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
ans date sum1 812 48
13
Operators: sum, count, max, min, median, avg
“Having” clause Using dimension hierarchy
average by region (within store) maximum by month (within date)
Aggregates
14
CUBE AGGREGATION
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3sum 67 12 50
sump1 110p2 19
129
. . .
drill-down
rollup
Example: computing sums
15
CUBE OPERATORS
day 1
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3sum 67 12 50
sump1 110p2 19
129
. . .
sale(c1,*,*)
sale(*,*,*)sale(c2,p2,*)
sale(*,p1,*)
16
EXTENDED CUBE
c1 c2 c3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129day 2 c1 c2 c3 *
p1 44 4 48p2* 44 4 48
c1 c2 c3 *p1 12 50 62p2 11 8 19* 23 8 50 81
day 1
*
sale(*,p2,*)
17
AGGREGATION USING HIERARCHIES
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
region A region Bp1 56 54p2 11 8
customer
region
country
(customer c1 in Region A;customers c2, c3 in Region B)
18
PIVOTING
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
Multi-dimensional cube:Fact table view:
c1 c2 c3p1 56 4 50p2 11 8
19
CUBE AGGREGATES LATTICE
city, product, date
city, product city, date product, date
city product date
all
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
129
use greedyalgorithm todecide whatto materialize
20
DIMENSION HIERARCHIES
all
state
city
cities city statec1 CAc2 NY
21
DIMENSION HIERARCHIES
city, product
city, product, date
city, date product, date
city product date
all
state, product, date
state, date
state, product
state
not all arcs shown...
22
INTERESTING HIERARCHY
all
years
quarters
months
days
weeks
time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000
conceptualdimension table
23
SAMPLE CUBETotal annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntrysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Total annual salesof PC in U.S.A.
Total annual salesof VCR in U.S.A.Total Q1 sales
In U.S.ATotal Q1 salesIn Canada
Total Q1 salesIn Mexico
Total Q1 salesIn all countries
Total Q2 salesIn all countries
Total salesIn U.S.A
Total salesIn CanadaTotal sales
In Mexico
TOTAL SALES
24
Roll-Up Drill-Down Slice & Dice Pivot Drill-Across Drill-Through
OLAP Operations
25
OLAP Operations
26
Slicing
27
Dicing (Sub-cube)
28
ROLL-UP
29
DRILL-DOWN
30
OTHER OLAP OPERATIONS
o Drill-Across: Queries involving more than one fact table
o Drill-Through: Makes use of SQL to drill through the bottom level of a data cube down to its back-end relational tables
o Pivot (rotate): Pivot (also called "rotate") is avisualization operation which rotates the data axes inview in order to provide an alternative presentation ofthe data. Other examples include rotating the axes in a3-D cube, or transforming a 3-D cube into a series of 2-D planes.
31
OTHER OLAP OPERATIONS
o Moving Averageso Growth Rateso Depreciationo Currency Conversiono Statistical Functionso Top N or Bottom N queries
32
Conceptual vs. Actual The “cube” is a logical way of
visualizing the data in an OLAP setting
Not how the data is actually represented on disk
Two ways of storing data: ROLAP: Relational OLAP MOLAP: Multidimensional OLAP
33
Construction of the data cube is key to the operation of OLAP
The computation process creates a set of aggregates on the various dimensions of the data
The CUBE operator
OLAP & CUBE
34
AN EXAMPLE OF THE CUBE OPERATOR
35
The CUBE Operator Proposed by Gray et al* Effectively involves a series of
GROUP-BY operations to aggregate data
Creates power set on all attributes according to: A measure An aggregator function
*J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,D. Reichart, M. Venkatrao, F. Pellow and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
36
Problem: this generates a lot of data and work (2n sets in total, where n is the number of dimensions)
Solution: optimized algorithms to run faster, consume less memory, and perform fewer I/Os.
CUBING Problem
37
o ROLAP-based cubing algorithms (Agarwal et al’96)
o Array-based cubing algorithm (Zhao et al’97)
Efficient Computation of Data Cubes
S. Agarwal, R. Agrawal, P. M. Deshpande, A.Gupta, J. F. Naughton, R. Ramakrishnan and S.Sarawagi.On the computation of multidimensional aggregates. In VLDB'96.Y. Zhao, P. M. Deshpande, and J. F. Naughton.An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD'97.
38
o How many cuboids in a cube with 3 dimensions?o Answer: o As many group by operations?o No hierarchies involved!!o π (Li +1), where Li is the number of
levels associated with dimension Io 10 dimensions & 4 levels for each
dimensiono Total Cuboids = 510
Efficient Computation of Data Cubes
39
It is all about which DBMS you choose to store your data warehouse data
RDBMS – ROLAP MDDB – MOLAP BOTH - HOLAP
Approaches to OLAP Servers
40
Three possibilities for OLAP servers(1) Relational OLAP (ROLAP)
Relational and specialized relational DBMS to store and manage warehouse data
OLAP middleware to support missing pieces(2) Multidimensional OLAP (MOLAP)
Array-based storage structures Direct access to array data structures
(3) Hybrid OLAP (HOLAP) Storing detailed data in RDBMS Storing aggregated data in MDBMS User access via MOLAP tools
Approaches to OLAP Servers
41
Special schema design: star, snowflake
Special indexes: bitmap, multi-table join Proven technology (relational model,
DBMS), tend to outperform specialized MDDB especially on large data sets
Products IBM DB2, Oracle, Sybase IQ,
RedBrick, Informix
ROLAP
42
Defines complex, multi-dimensional data with simple model
Reduces the number of joins a query has to process
Allows the data warehouse to evolve with relatively low maintenance
Can contain both detailed and summarized data.
ROLAP is based on familiar, proven, and already selected technologies.
BUT!!! SQL for multi-dimensional manipulation of
calculations.
ROLAP
43
MDDB: a special-purpose data model
Facts stored in multi-dimensional arrays
Dimensions used to index array Sometimes on top of relational DB Products
Pilot, Arbor Essbase, Gentia
MOLAP
44
Pre-calculating or pre-consolidating transactional data improves speed.
BUTFully pre-consolidating incoming data, MDDs require an enormous amount of overhead both in processing time and in storage. An input file of 200MB can easily expand to 5GB
MDDBs are great candidates for the < 100GB department data marts.
With MDDs, application design is essentially the definition of dimensions and calculation rules, while the RDBMS requires that the database schema be a star or snowflake.
MOLAP
45
User Needs Multidimensional view Excellent Performance Analytical Flexibility Real-Time Data Access High Data Capacity
MIS Needs Leverages Data Warehouse Easy Development Low Structure Maintenance Low Aggregate Maintenance
OLAP Needs
46
Multidimensional View All true OLAP tools, whether they work
with a MDDB or an RDBMS, provide a multidimensional view of data.
For example, decision makers may view sales by office, quarter, representative, product, etc. This perspective on data, which mirrors the way business professional think, allows for more intuitive and more powerful analysis.
OLAP Needs: User Needs
47
Excellent Performance The performance of your decision
support tool directly depends on the way it manages aggregates.
RDBMS Calculate aggregates on fly (response
time suffers) DBA creates summary tables to store
aggregates (enormous amount of disk space)
OLAP Needs: User Needs
48
Excellent Performance For example, suppose you have a Sales indicator
with six dimensions—Representatives, Products, Customers, Regions, Months, and Years.
MOLAP tools will store a given aggregate, such as the November 1997 government sales of product A504 by representative 1040 in New York, in 1 cell of the MDDB.
In contrast, ROLAP tools consume 600% more space, because they require a record of seven values—six foreign keys and the actual aggregate—in a relational summary table.
OLAP Needs: User Needs
49
Excellent Performance
OLAP Needs: User Needs
50
Excellent Performance
OLAP Needs: User Needs
RDBMSs must use several summary tables to store the aggregatesthat a MOLAP could store in just one cube. For example, consider a Sales indicator with three dimensions: Months, Regions, and Products. The indicator cube will contain seven sets of aggregates:• Sales by month• Sales by product• Sales by region• Sales by month and product• Sales by month and region• Sales by product and region• Sales by product, month, and regionTo store these aggregates in an RDBMS, you’d have to create seven summary tables, one for each aggregate set. HOW MANY SUMMARY TABLES FOR 6 DIMENSIONS? (Separate fact table and shrunken dimension table approach for storing aggregates)
51
Excellent Performance
OLAP Needs: User Needs
• Huge amounts of extra storage space is required (even if there is no sparsity failure)
• Maintenance costs are high
• Lot of statistical analysis needs to be done to decide which aggregates are to be precomputed
• DBA must keep the cost/performance ratio in check
52
Excellent Performance
OLAP Needs: User Needs
• In contrast, we’ve seen that multidimensional databases store aggregates in a very compact structure that consumes very little disk space and requires very little maintenance
• All levels of consolidation can therefore be precomputed and stored in MDDB
• As a result, fast response time is not limited to the most frequently accessed queries; all aggregates can be accessed with lightning speed.
53
Analytical Flexibility
Both ROLAP & MOLAP tools offer comparative performance for Comparative Analysis Roll-up and Drill-down Slicing & Dicing
Only MOLAP tools offer ‘what-if’ analysis
OLAP Needs: User Needs
54
Real-Time Data Access MOLAP tools load data into the multidimensional cubes.
Consequently, the data being accessed is only as recent as the last load.
Some applications require real-time data access Process of continually refreshing the data attaches higher
costs to operating a MOLAP system Some MOLAP tools offer reach-through functionality to
access volatile data stored outside the MDDB Unfortunately, users must be aware of the underlying
database structure Relational data access is too complex for the typical user
OLAP Needs: User Needs
55
Real-Time Data Access ROLAP tools maintain a constant link to the
operational RDBMS, which provides users with up-to-the-minute, accurate data(Real-Time Data Warehousing)
Industries & organizations with highly volatile data particularly benefit from this access to live, operational data.
OLAP Needs: User Needs
56
High Capacity Data MOLAP products are limited by the size of the
cube defined by the multidimensional view. When dimension elements are predefined, the scope of available data is limited at the onset.
ROLAP tools circumvent this barrier. Dynamic dimensions are not stored in the predefined multidimensional model, but fetched at run time from the RDBMS.
OLAP Needs: User Needs
57
High Capacity Data
OLAP Needs: User Needs
o In MOLAP, only aggregates are stored in the cube. Atomic, operational data are forced out of the user’s analytical realm.
o ROLAP systems can access extremely detailed
operational data, as well as aggregated data stored in summary tables.
58
MIS Needs Administrators should be able to
leverage their existing relational databases without devoting large amounts of time and effort to intricate development, fine tuning, or intensive maintenance.
OLAP Needs
59
Leveraging Data Warehouse Both the finance and the MIS departments of
your organization will appreciate a decision support tool that leverages existing investments in data warehousing.
MIS staff that opts for a MOLAP tool must duplicate data in its own proprietary MDDB.
MIS staff that chooses a ROLAP tool will be able to access the data warehouse directly.
OLAP Needs: MIS Needs
60
Easy Development MOLAP development is straightforward, it requires no
fine tuning and creates its own aggregates. ROLAP tools, on the other hand, require a specific
schema for the relational database. Skilled DBAs must provide the appropriate schema
(star or snowflake schema), tune the database, and create the appropriate summary tables.
However, many ROLAP tools are metadata-driven, which means the multidimensional view is generated and maintained more easily.
OLAP Needs: MIS Needs
61
Low Structure Maintenance The structure of a MOLAP tool’s underlying MDDB
greatly depends on each of its dimensions. When one dimension changes, the entire MDDB must be re-structured.
Multi-matrix MDDBs reduce the maintenance burden ROLAP systems do not store data in a proprietary
structure. They build and maintain a constant link between the
multidimensional view and the underlying RDBMS using the metadata.
No database restructuring is required.
OLAP Needs: MIS Needs
62
Low Aggregate Maintenance MOLAP tools automatically create high-level
aggregates based on your lower-level MDDB data and aggregate definitions.
When data is updated, the aggregates are automatically updated and stored in the MDDB.
With ROLAP tools, MIS staff must continually monitor the use of summary tables to keep their cost/performance ratio in check.
DBAs inevitably use sophisticated statistics to isolate only the most frequently accessed aggregates, and store them in summary tables.
These tables leave ROLAP administrators with a heavy maintenance burden.
OLAP Needs: MIS Needs
63
ROLAP vs. MOLAP
64
ROLAP vs. MOLAP1) Performance:
• How fast will the system appear to the end-user?
• MDD server vendors believe this is a key point in their favor.
2) Data volume and scalability:
• While MDD servers can handle up to 100GB of storage, RDBMS servers can handle hundreds of gigabytes and terabytes.
65
o Best of both worlds
o Storing detailed data in RDBMS
o Storing aggregated data in MDBMS
o User access via MOLAP tools
Hybrid OLAP - HOLAP
66
HOLAP
Multi-dimensional
accessMultidimensional
Viewer
RelationalViewer
ClientMDBMS Server
Multi-dimensional
data
SQL-Read
RDBMS Server
Userdata Meta data
Deriveddata
SQL-Reach Through
SQL-Read
67
IFA. You require write access B. Your data is under 50 GBC. Your timetable to implement is 60-90 daysD. Lowest level already aggregatedE. Data access on aggregated levelF. You’re developing a general-purpose application for inventory movement or assets management
THENConsider an MDD /MOLAP solution for your data mart
IF
A. Your data is over 100 GBB. You have a "read-only" requirementC. Historical data at the lowest level of granularityD. Detailed access, long-running queriesE. Data assigned to lowest level elements
THENConsider an RDBMS/ROLAP solution for your data mart.
IFA. OLAP on aggregated and detailed dataB. Different user groupsC. Ease of use and detailed data
THENConsider an HOLAP for your data mart
ROLAP, MOLAP, or HOLAP
68
ROLAP: RDBMS -> star/snowflake schema MOLAP: MDDB -> Cube structures ROLAP or MOLAP: Data models used play major role in
performance differences MOLAP: for summarized and relatively lesser volumes
of data (100GB) ROLAP: for detailed and larger volumes of data Both storage methods have strengths and weaknesses The choice is requirement specific, though currently
data warehouses are predominantly built using RDBMSs/ROLAP.
HOLAP is emerging as the OLPA server of choice
Conclusions