adbms seminar report
DESCRIPTION
Iceberg Query evaluation using bitmap indexesTRANSCRIPT
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
A Seminar Report
on
EFFICIENT ICEBERG QUERY EVALUATION USING BITMAP INDICES
By
Student Name: Om PawarRoll No: 3253
Class: TE
Guided By
Internal Guide NameProf. A. Phakatkar
Computer Engineering DepartmentAcademic Year: 2012-2013
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
CERTIFICATE
This is to certify that Mr./Miss. Om Dilip Pawar, Roll No.3253 a
student of T.E.(Computer Engineering Department) Batch
2012-13, has satisfactorily completed a seminar report on
“Efficient Iceberg Query Evaluation Using Compressed Bitmap
Index.” under the guidance of Prof. A.Phakatkar towards the
partial fulfillment of the Third Year Computer Engineering,
Semester II of the Pune University.
------------------ ---------------------- Internal Guide Head of Department,
Computer Engineering
Date:-
Place:-
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
Abstract:
Decision support and knowledge discovery systems often compute aggregate
values of interesting attributes by processing a huge amount of data in very large
databases and/or warehouses. Iceberg query is a special type of aggregation query that
computes aggregate values above a user-provided threshold. Most existing iceberg
query processing algorithms do not take advantage of the small-result-set property and
rely heavily on the tuple-scan-based approach. This incurs intensive disk accesses and
computation, resulting in long processing time especially when data size is large.
Bitmap index, which builds one bitmap vector for each attribute value, is
gaining popularity in both column-oriented and row-oriented databases in recent years.
It occupies less space than the raw data and gives opportunities for more efficient query
processing. Bitmap indices have the advantages of leveraging the antimonotone
property of iceberg queries to enable aggressive index pruning strategies. The index-
pruning-based approach introduced in this paper eliminates the need of scanning and
processing the entire data set (table) and thus speeds up the iceberg query processing
significantly. Experiments show that this approach is much more efficient than existing
algorithms commonly used in row-oriented and column-oriented databases.
Keywords:
Iceberg query, Bitmap index, Dynamic Pruning
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
INTRODUCTION
Business insight and knowledge discovery from operational data are powerful
weapons for gaining competitive advantages in the modern business world. To discover busi-
ness insights, analysts often compute aggregate values over one or more attributes in large
databases (warehouses). Iceberg query [4] is a special class of aggregation query, which
computes aggregate values above a given threshold. It is of special interest to the users, as
high frequency events or high aggregate values often carry more important information.
The general form of an iceberg query on a relationR(C1,C2,…….,Cn) is :
SELECT Ci,Cj,….,Cm,AGG(*) FROM RGROUP BY Ci,Cj,……,Cm
HAVING AGG(*)>=T
Queries which are used to compute aggregate values over an attribute(or set of
attributes) above a given threshold are called iceberg queries because the number of results
above the threshold is often very small (the tip of an iceberg), relative to the large amount of
input data (the iceberg).With the threshold constraint, an iceberg query usually only returns a
very small percentage of distinct groups as the output, which resembles the tip of an iceberg.
Because of the small result set, iceberg queries can potentially be answered quickly even
over a very large data set. However, current database systems and/or approaches do not fully
take advantage of this feature of iceberg query.
The relational database systems nowadays are all using general aggregation algo-
rithms to answer iceberg queries by first aggregating all tuples and then evaluating the HAV-
ING clause to select the iceberg result. For large data set, multipass aggregation algorithms
are used when the full aggregate result cannot fit in memory (even when the final iceberg re-
sult is small). Most existing query optimization techniques for processing iceberg queries [4]
can be categorized as the tuple-scan-based approach, which requires at least one table scan to
read data from disk.
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
Iceberg query can be evaluated efficiently using bitmap indices. Bitmap indices pro-
vide a vertical organization of a column using bitmap vectors. Bitmap indices operate on bits
rather than real tuple values. Bitwise operations are very fast to execute and can often be ac-
celerated by hardware.
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
RELATED WORK
Processing of Iceberg query is first defined and studied by Fang et al. in 1998 [4].
Fang proposed the Hybrid and Multibuckets algorithms by extending the probabilistic
techniques proposed. Sampling/bucketing method is used to predict valid groups, with
possible false positives and false negatives. Then, efficient strategies are designed to
efficiently correct false positives and false negatives to retrieve the exact result.
In data warehouses conducted studies on computing iceberg cube, which computes and
materializes cells of a data cube satisfying specified condition. These works focus on
selecting a proper order of computing aggregation over all combination of aggregate
attributes, to maximize sharing of the computation. The focus of answering iceberg queries is
to speed up the processing time of single iceberg query. The focus of computing iceberg
cubes, such that of, is to maximize the shared computation to shorten the cube generation
time. Developing efficient iceberg query answering algorithm is necessary. These algorithms
can be leveraged to generate iceberg cube more efficiently. Bitmap indices are known to be
efficient, especially for read-mostly or append-only data, and are commonly used in the data
warehousing applications and column stores. Various compression schemes for bitmap index
have been developed. Word-Aligned Hybrid (WAH) [3]and Byte-aligned Bitmap Code
(BBC) are two important compression schemes that can be applied to any column and be
used in query processing without decompression.
Model 204 was the first commercial product making extensive use of the bitmap
index. Early bitmap indices are used to implement inverted files. In data warehouse
applications, bitmap indices are shown to perform better than tree-based index schemes, such
as the variants of B-tree or R-tree. Compressed bitmap indices are widely used in column-
oriented databases, such as C-Store, which contribute to the performance gain of column
databases over row-oriented databases.
The development of bitmap compression methods, and encoding strategies further
broaden the applicability of bitmap index. Nowadays, it can be applied on all types of
attributes (e.g., high-cardinality categorical attributes numeric attributes and text attributes).
However, bitmap index is not effectively leveraged in existing works to process iceberg
queries. In this paper, a novel iceberg query processing algorithm is introduced using bitmap
indices, which are shown to be highly effective.
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
PROGRAMMER’S DESIGN
BITMAP INDEX AND ITS COMPRESSION
A bitmap for an attribute (column) of a table can be viewed as a v × r matrix, where v
is the number of distinct values of the column and r is the number of tuples (rows) in the
table. Each value in the column corresponds to a bitmap vector of length r, in which the kth
position of the vector is 1 if this value appears in the kth row and 0 otherwise.
e.g.:-
A B CA2 B2 1.23A1 B3 2.34A2 B1 5.36A2 B2 8.36A1 B3 3.27A2 B1 9.45A2 B2 6.23A2 B1 1.98A1 B3 8.23A2 B2 0.11A3 B1 3.44A3 B1 2.08
(a)Table R (b)Bitmap Indices of A,B
P: F-SMR-UG/08/R0
A1 A2 A30 1 01 0 00 1 00 1 01 0 00 1 00 1 00 1 01 0 00 1 00 0 10 0 1
B1 B2 B30 1 00 0 11 0 00 1 00 0 11 0 00 1 01 0 00 0 10 1 01 0 01 0 0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
DYNAMIC PRUNING
With bitmap indices, it is easy to calculate the total occurrences of a single value (us-
ing its bitmap vector) without accessing other data. The antimonotone property can be lever-
aged to quickly prune bitmap vectors that will not produce valid iceberg results.
First, we introduce a new bitwise-AND operation, which carries out the following
three actions in one bitwise-AND operation between vectors X and Y:
Z = X AND Y
X = X XOR Z
Y = Y XOR Z
Besides generating the resulting vector Z of the bitwise-AND operation, the operation also
sets the 1 bit in the original vectors to 0, if the corresponding bit in the resulting vector is 1.
After each bitwise-AND operation, the dynamic pruning strategy adds an extra prun-
ing step of monitoring the number of remaining 1s in both bitmap vectors involved. If the
number of 1 bits of a modified vector becomes smaller than the iceberg threshold, this vector
can be pruned. That is, no further AND operation is necessary for this vector. With dynamic
pruning, the number of AND operations can be reduced effectively, since the iceberg thresh-
old is usually large.
The dynamic pruning strategy works fine for attributes with a relatively small number
of unique values. However, its performance downgrades severely due to the empty bitwise-
AND results problem. With the dynamic index pruning strategy alone, many of the bitwise-
AND operations produce empty results after a bitwise-AND operation. That is, the resulting
bitmap vector contains no bits having value 1. Such bitwise-AND operations are fruitless in
two aspects:
1) They do not produce valid iceberg result.
2) They do not reduce the number of 1 bits in original vectors for index pruning purpose.
VECTOR ALIGNMENT
To overcome this challenge of empty bitwise-AND results problem, the vector
alignment algorithm is developed. For the dynamic pruning algorithm, the worst case bound
of the number of bitwise-AND operations needed is equal to the product of the numbers of
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
distinct values of all aggregate attributes, which would be much larger than the number of
tuples.
Definition:
First 1-bit position: It refers to the position of the first 1-bit in a bitmap vector.
Definition:
Vector alignment: Two bitmap vectors are aligned if their first 1-bit positions are the same.
If two vectors are aligned, their bitwise-AND result will not be empty, because they
have at least one overlapping position.
1. For each aggregate attribute, priority queue of its bitmap vectors prioritized by
their first 1-bit positions is built. Then, the top bitmap vector from each priority Queue is
chosen and checked whether they can be aligned. If they are, it means the resulting bitmap
vector of the bitwise-AND operation between these two vectors will not be empty.
Thus a bitwise-AND operation is carried out and the dynamic pruning strategy is applied.
2. The above process is repeated until at least one queue is empty.
3. In case when the two top bitmap vectors are not aligned, because one of the two
bitmap vectors might have been pruned already, the vector which has the smaller first 1-bit
position is selected and all 1-bits with positions smaller than the first 1-bit of the other bitmap
vector are reset. These bits can be safely removed (reset) and the fist 1-bit position of the se-
lected vector is recomputed because they will not have corresponding matching bits in the re-
maining vectors of the other queue.
Let S be the Set representing the System
S= {I, O, P, Sc, Fc}
Where I=input
O=output
P=Processes
Sc=Success Case
Fc=Failure case.
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
I= {R, Q}
Where R=Relation R(C1,C2,…….,Cn)
Q=Query
O= {Iceberg Results}
P= {Calculate query results according to conditions}
Sc= {Proper Iceberg Results}
Fc= {Improper Iceberg Results}
Algorithm 1: Iceberg Processing with Vector Alignment and Dynamic Pruning
iceberg PQ (attribute A, attribute B, threshold T)
Output: iceberg results
1: PQA.clear, PQB.clear
2: for each vector a of attribute A do
3: a.count = BIT1 COUNT (a)
4: if a.count >= T then
5: a.next1 =first1BitPosition (a, 0)
6: PQA.push (a)
7: for each vector b of attribute B do
8: b.count = BIT1_ COUNT (b)
9: if b.count >= T then
10: b.next1 = first1BitPosition(b, 0)
11: PQB.push(b)
12: R =0;
13: a, b = nextAlignedVectors(PQA, PQB; T)
14: while a ≠ null and b ≠ null do
15: PQA.pop
16: PQB.pop
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
17: r = BITWISE_AND(a, b)
18: if r.count >= T then
19: Add iceberg result (a.value, b.value, r.count) into R
20: a.count = a.count – r.count
21: b.count =b.count – r.count
22: if a.count >= T then
23: a.next1 = first1BitPosition(a, a.next1 + 1)
24: if a.next1 ≠ null then
25: PQA:push(a)
26: if b.count >= T then
27: b.next1 = first1BitPosition(b, b.next1 + 1)
28: if b.next1 ≠ null then
29: PQB:push(b)
30: a, b = nextAlignedVectors(PQA, PQB, T)
31: return R
Algorithm 2:Computing First 1 bit position
first1BitPosition (bitmap vector vec, start position pos)
Output: The position of the first 1 bit position in vector, starting
from position pos
1: len =0
2: for each word w in vector vec do
3: if w is a literal word then
4: if len <= pos AND len + 31 > pos then
5: for p = pos to len + 30 do
6: if position p is 1 then
7: return p
8: else if len > pos then
9: for p = len to len + 30 do
10: if position p is 1 then
11: return p
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
12: len += 31
13: else if w is a 0 fill word then
14: fillLength = length of this fill word
15: len += fillLength * 31
16: else
17: fillLength = length of this fill word
18: len += fillLength * 31
19: if len > pos then
20: return pos
21: return null
Algorithm 3:Find the nextAlignedVectors
nextAlignedVectors (priority queue PQA, priority queue PQB, threshold T)
Output: Two aligned vectors a ε PQa, b ε PQb
1: while PQA is not empty and PQB is not empty do
2: a = PQA.top
3: b = PQB.top
4: if a.next1 = b.next1 then
5: return a, b
6: if a.next1 > b.next1 then
7: PQB.pop
8: b.next1, skip = first1BitPositionWithSkip(b, a,next1)
9: b.count = b:count - skip
10: if b.next1 ≠ null AND b.count >= T then
11: PQB.push(b)
12: else
13: PQA.pop
14: a.next1, skip = first1BitPositionWithSkip(a, b.next1)
15: a.count = a.count - skip
16: if a.next1 ≠ null AND a.count >= T then
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
17: PQA.push(a)
18: return null, null
Generalization
It is easy to extend algorithm icebergPQ to support iceberg queries on more than two
attributes because iceberg queries have the antimonotone property. Therefore, when there are
multiple aggregate attributes, two attributes can be dealt at a time.
The icebergPQ algorithm can be also generalized to support other aggregation
functions which have the antimonotone property. For example, to support SUM function,
rather than computing the count of 1-bits for each vector, the sum of the values
corresponding to the 1 bits in the resulting bitmap vector are computed. When index pruning
is conducted, the vectors are pruned by the sum of all values corresponding to 1 bits left in
the vector, rather than the number of 1 bits. Other parts of the icebergPQ algorithm are kept
the same. Because the antimonotone property of iceberg queries is still valid for SUM, our
algorithm is still correct. Besides SUM, for MIN(MAX) functions, the modification is similar
since MIN(MAX) also operates on numeric values as SUM function. The minor difference is
that after each bitwise-AND operation, rather than computing the sum value, the min(max)
value is computed. Then, the min(max) value is used for index pruning.
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
PERFORMANCE ANALYSIS OF VECTOR ALIGNMENT
Comparing to the dynamic pruning algorithm, icebergPQ is much more efficient.
Given a table R(A,B) with n tuples. Suppose A has s unique values, B has t unique values,
and group by operation on A, B forms g groups. Here g represents the number of valid
groups that appear at least once in the relation.
It is clear that
s<= g <= n
t <= g <= n.
Theoretically, the worst case of dynamic pruning algorithm needs to compare all pairs of
vectors in the two attributes, if no dynamic pruning is effective. Hence, the worst case perfor-
mance of dynamic pruning algorithm is s × t, which could be much slower than scanning the
table itself.
Whereas, icebergPQ only processes AND operations on aligned vectors. That is, each
AND operation corresponds to a real group on A, B. Therefore, the worst case of icebergPQ
is equal to the number of groups g, which is often much smaller than the table size n.
The effect of pruning becomes quite significant in icebergPQ, since it makes the number of
AND operations much smaller than g in practice. Optimization strategies can further reduce
the execution time of AND operations.
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
EXPERIMENTAL EVALUATION
The experiments are conducted on a machine with a Pentium 4 single core processor of 3.6
GHz, 2.0 GB main memory and 7,200 rpm IDE hard drive, running Ubuntu 9.10 with ext4
file system. Experiments were carried out with both a synthetic data set and a real patent data
set. In the experiment, assumption is made that the bitmap indexes of the aggregation at-
tributes have already been built offline. This is a reasonable assumption, since other than ice-
berg queries, bitmap indexes are useful for many other tasks especially in column-oriented
databases. In this suite of experiments, icebergDP and icebergPQ was tested, on data sets
with zipfian distribution. The data size was varied from 1 to 8 million tuples the performance
of icebergPQ is magnitudes faster than icebergDP. It demonstrates the severe performance is-
sue triggered by the empty bitwise-AND results problem discussed before. With 1 million
tuples, icebergPQ only needs 0.404 seconds to finish processing, while icebergDP needs
10.688 seconds. icebergPQ also scales well when the data size increases. It only takes 11.36
second with 8 million tuples, while icebergDP takes more than 18 minutes. The performance
of icebergDP is unacceptable for practical data sizes.
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
Fig:Performance of icebergDP and icebergPQ
Fig b:Normal distribution
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
CONCLUSION
This paper presents an efficient algorithm for iceberg query processing using
compressed bitmap indices. This algorithm demonstrates superior performance over existing
schemes and it does not depend on any particular compression method. It has been observed
that bitmap index has three attractive advantages:
1) Saving disk access by avoiding tuple-scan on a table with a lot of attributes,
2) Saving computation time by conducting bitwise operations, and
3) Leveraging the antimonotone property of iceberg queries to develop aggressive
pruning strategies.
The problem of massive bitwise-AND operations was solved by vector alignment.
Both analysis and experiments verify the effectiveness of this approach and show that this
algorithm can outperform the state-of-the-art algorithms for iceberg query processing.
This algorithm is not sensitive to the number of distinct values, number of attributes
in the relation and the length of individual attributes. It works well on data sets with zipfian
distribution. The performance of this algorithm is better when the query is more “iceberg-
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
like.” That is, when the threshold of the iceberg query is relatively large (which means the
percentage of the iceberg results is relatively small). It also works better when the number of
aggregation attribute is relatively small.
REFRENCES1. “Iceberg query evaluation using bitmap index”.Bin He, Hui-I Hsiao, Member, IEEE,
Ziyang Liu, Yu Huang, and Yi Chen, Member, IEEE,2012.2. F. Delie`ge and T.B. Pedersen, “Position List Word Aligned Hybrid: Optimizing
Space and Performance for Compressed Bitmaps,” Proc. Int’l Conf. Extending Data-base Technology (EDBT), pp. 228-239, 2010.
3. A. Ferro, R. Giugno, P.L. Puglisi, and A. Pulvirenti, “BitCube: A Bottom-Up Cubing Engineering,” Proc. Int’l Conf. Data Warehousing and Knowledge Discovery (DaWaK), pp. 189-203, 2009.
4. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J.D.Ullman, “Comput-ing Iceberg Queries Efficiently,” Proc. Int’l Conf.Very Large Data Bases (VLDB), pp. 299-310, 1998K. Wu, E.J. Otoo, and A. Shoshani, “Optimizing Bitmap Indices with Efficient Compression,” ACM Trans. Database Systems, vol. 31, no. 1, pp. 1-38, 2006.
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
STC PROGRESS REPORT
Roll No:Name:Class:Sr. No. Date Topic of Discussion Remarks of Guide Guide’s sign
P: F-SMR-UG/08/R0
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING
P: F-SMR-UG/08/R0