Download - Aggregation in Main Memory
March 30 2001 DGRC FedStats Visit
Aggregation in Main Memory
Kenneth A. Ross
Columbia University
March 30 2001 DGRC FedStats Visit
Research Experience
Complex query processing Data Warehousing Main memory databases
Students: Kazi Zaman, Junyan Ding
March 30 2001 DGRC FedStats Visit
MediatorMediatorQueryQuery
UnifiedUnifiedResultsResults
UserUser
Main-MemoryDBMS
TraditionalDBMS
......
Scenario A
March 30 2001 DGRC FedStats Visit
MediatorMediatorData RequestData Request
UnifiedUnifiedResultsResults
UserUser
Web
TraditionalDBMS
......
Scenario B
Main Memory
DB
Sequence OfSequence OfInteractiveInteractive QueriesQueries
March 30 2001 DGRC FedStats Visit
MediatorMediator
Data RequestData Request
UnifiedUnifiedResultsResults
UserUser Web
TraditionalDBMS
......
Scenario C
Main Memory
DB
Graphical User Graphical User InterfaceInterface
Dynamic QueryDynamic Query
March 30 2001 DGRC FedStats Visit
Outline
Introduction to Datacubes Frameworks for querying cubes The Main Memory based framework Experimental Results Conclusions and Plan
March 30 2001 DGRC FedStats Visit
The CUBE BY Operator
State Year Grade Sales
CA 1997 Regular 90NY 1997 Premium 70CA 1998 Premium 65
NY 1998 Premium 95
State Year Grade Sales
CA 1997 Regular 90CA 1997 ALL 90ALL 1997 Regular 90CA ALL Regular 90
ALL 1997 Regular 90ALL 1997 ALL 160ALL ALL Regular 90CA ALL ALL 155
ALL ALL ALL 320
CUBE BY(sum Sales)
Large increase in total Size,especially with many dimensions
…….
Additional records
DGRC FedStats VisitMarch 30 2001
Lattice Representation
State, Year, Grade
State, Year State, Grade Year, Grade
State Year Grade
March 30 2001 DGRC FedStats Visit
Modeling Queries
Slice Queries ask for a single aggregate record
SELECT State, year, sum(sales)FROM BLS-12345GROUP BY State, yearHAVING State = “NY” AND year = “1998”
March 30 2001 DGRC FedStats Visit
Existing Frameworks
State, Year, Grade
State, Year State,Grade Year,Grade
State Year Grade
Choose subset of cube tomaterialize based on workload.Materialize on disk
Appropriate record recovered orcomputed for incoming slice query
Drawbacks: Ignores Clustering of Relation on disk.Smallest unit of materialization is too big.
March 30 2001 DGRC FedStats Visit
Our approach
State, Year, Grade
State, Year State,Grade Year,Grade
State Year Grade
The full cube is often larger than available memory, but ...
The finest granularity aggregate may fit.
Any record can be computedwithout having to go to disk.
How should the finest granularity be organized ?
March 30 2001 DGRC FedStats Visit
Framework
Level-1 Store Level-2 Store
records in linked lists
Slot directory
Selected coarse recordsin hash table
Finest granularity cuboid
Query q
March 30 2001 DGRC FedStats Visit
The Level-1 Store
Records are <Key,Value> pairs stored in a hash table.
Records can contain ALL’s
Given query Q, form compositekey and check level-1 store (constant time).
If not found, use level-2 store
Key Value a1 55 b2 34 c2 12
… ...
March 30 2001 DGRC FedStats Visit
The Level-2 StoreLevel-2 Store
records in linked lists
Slot directory
Finest granularity cuboidSlot directory is organized asa multidimensional array:level2[sz1][sz2][sz3][sz4]
Each slot points to a linkedlist of elements.
Records placed according toset of mapping functions H
March 30 2001 DGRC FedStats Visit
Using the Level-2 store
b4
Query Q without ALL’s
d5a3 c2
Slot 4 Slot 3 Slot 7 Slot1
Access list denoted by level2[4][3][7][1] ;aggregate those matching (a3,b4,c2,d5).
March 30 2001 DGRC FedStats Visit
Using the Level-2 store
ALL
Query Q with ALL’s
ALLa3 c2
Slot 4 List of Slots Slot 7 List of Slots
Access lists matching level2[4][*][7][*] ;aggregate those matching (a3,*,c2,*).
March 30 2001 DGRC FedStats Visit
Demo
Shows multidimensional dataset (subset of columns of 5% Census sample for NY in 1990).
User asks queries: fast answers. Future: User Interface asks many
queries, with display changing interactively.
demo
March 30 2001 DGRC FedStats Visit
Experimental ResultsQuery Processing Time vs Additional Memory Used
(real dataset, 10^6 records, 8 dimensions)
0
5
10
15
0 20 40 60 80
Additional Memory Used in MB
Ave
rage
tim
e pe
r qu
ery
in m
illi
seco
nds
Query Cost
Scanning all records takes 194 ms.
March 30 2001 DGRC FedStats Visit
Importance of Work
•Aggregation is fundamental to analysis.
•Make analysis interactive, even for many dimensions.
•Make a variety of aggregate granularities available, where possible.
March 30 2001 DGRC FedStats Visit
Contributions
A Main Memory based framework for answering datacube queries efficiently.
Query Performance in the 2-4 ms range which is more efficient than going to disk.
March 30 2001 DGRC FedStats Visit
Plan
Integrate with user interface to generate dynamic queries.
Self-tuning capability. Multiple data sets. Work with agencies to generate value
– For intra-agency analysis– For enhanced data dissemination