september, 2002 efficient bitmap indexes for very large datasets john wu ekow otoo arie shoshani...

34
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory

Upload: merryl-bradford

Post on 04-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Efficient Bitmap Indexes for Very Large DatasetsJohn Wu
Ekow Otoo
Arie Shoshani
Task: range queries on high-dimensional data
Approach: bitmap index
New compression scheme
Improve CPU efficiency: 10 X
Compressed bitmap index
Applying bitmaps for a feature tracking problem
September, 2002
Generate summary data (done once): 10-100 attributes per event
Access data according to summary attributes (performed by many scientists):
20001015<=Run & 200<Energy<300 …
Selected attributes of STAR summary data (tags). Actual size (January 2002): 20 million objects, 502 attributes
OID
Run
Event
NLb
tpc
Tracks
Particles
Vertex
Characteristics of data
Appends in batches
Known solutions
Sequential scan
Bitmap index is faster in some cases
September, 2002
Basic Bitmap Index
Bitmap index is efficient for processing range queries on read-only data (P. O’Neil, 1987).
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
eventTime
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
NLb
Main operations are bitwise logical operations and they are fast
Index sizes are small for categorical attributes with low cardinality
Each individual bitmap is small and frequently used ones can be cached in memory
Scientific datasets have mostly non-categorical attributes
Index size may be large
Query processing may be slow
September, 2002
Binning: reduce the number of bitmaps
Say 0 <= NLb < 4000, we can use 20 equal size bins [0,200)[200,400)[400,600)
Encoding: reduce the number of bitmaps or reduce the number of operations
Basic: equality encoding: generates on bitmap for each bin (shown above)
Other: range encoding, interval encoding, …
Compression: reduce the size of each bitmap, may also speedup the logical operations
Find an efficient compression scheme to reduce query processing time
This talk only addresses the issue of compression
September, 2002
Efficient Compression Schemes
Best known compression scheme for bitmap indexes --- byte-aligned bitmap code (BBC)
Uses run-length encoding
Compresses nearly as well as LZ77 (gzip)
Bitwise logical operations can be performed on compressed bitmaps directly
Operations are usually faster compared to other compression schemes, e.g., ExpGol, …
Even faster than operating on uncompressed bitmaps in some cases
Used in ORACLE
Bitwise logical operations on BBC compressed bitmaps are CPU bound
Reduce CPU time
CPU time is about 80% of total time on a system with 20 MB/s disk suite
Two independent implementations of BBC show similar behavior
Operation measured: read two files from disk and perform one logical operation in memory
September, 2002
Encode / decode bitmaps in word size chunks
Designed for minimal decoding to gain speed
September, 2002
31 bits
01000…
Hardware and system
VARITAS volume manager (stripped disk) – measured IO speed 20 MB/s
Real application data from STAR
About 2.2 million records, 500 attributes
Synthetic data
Terms
Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size
Time reported are wall clock time in seconds
September, 2002
WAH spends smaller fraction of time in CPU
September, 2002
September, 2002
Logical operation (on in-core bit vectors) times only
September, 2002
Bitmap index setup:
Equality encoding
high cardinality attributes from STAR
September, 2002
1 sec 100 MB
September, 2002
2 attributes per query
5 attributes per query
WAH compressed indexes are 10X faster than ORACLE, 5X faster than our BBC
P scan is scanning vertically projection of data table – the simplest option for processing partial range queries on high-dimensional data
Queries on 12 most queried attributes, average cardinality 222,000
September, 2002
WAH
vs.
BBC
Our bitmap index can be 100 X faster than ORACLE:
10 X due to compression scheme, 10 X due to binning
Exact answers
Approximate answers
Indexing Method
Adopting Compressed Bitmaps to Operations Outside of the Bitmap Index
September, 2002
Direct numerical simulation of auto-ignition process (solution of complex partial differential equations – data computed once but never modified)
A simple model has 12 variables per cell, a realistic model may have hundreds
Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000
Time steps: 100 >>> 1000s
Task: identify features and track them across time steps
September, 2002
Identify cells with values satisfying specified conditions
Typically a partial range query, like, “600<temp<700 & HO2>10-7”
Region growing (feature identification)
Feature tracking
Identify common cells in connected regions from different time steps
September, 2002
Basic Approach
Cell identification
Solution is represented as a list of cell IDs
Region growing
For each cell in the above list, search all its neighbors
Each region is a list of cell IDs
Feature tracking
Sort cell IDs of each region and match cell IDs to identify common cells
Use bounding boxes to reduce unnecessary operations
September, 2002
Our Approach
Cell identification
Region growing
Connect neighboring line segments into regions
Convert each region into a compressed bitmap
Feature tracking
September, 2002
69 time steps, 600 X 600 grid, condition HO2>10-7
Compressed bitmaps can be efficiently used for feature tracking
Cell identification
Feature tracking
Bitmap operations 0.2 seconds
Summary
The size of WAH compressed bitmap index is modest even in the worse case
For most high cardinality attributes with N records, the index size is about 2N words. Never more than 4N words
The WAH compressed index is efficient on attributes of any cardinality
On range queries, it is faster than uncompressed bitmap index (3X), BBC compressed index (2~20X), B+-tree index (20~200X), and scanning vertically partitioned table (4~50X)
Compressed bitmaps can also be efficiently used for feature tracking
September, 2002
108 records
B+-tree size
WAH compressed index is not larger than B+-tree
September, 2002
Summary of Tests on STAR Data (I)
Compressed bitmap index is more efficient for range queries than B+-tree or no index (p scan)
A WAH compressed index uses more space than a BBC compressed index, but is more efficient
Bitmap index
B+-tree
P scan
2 attributes per query
5 attributes per query
WAH compressed indexes are faster than BBC compressed indexes (3X) and uncompressed indexes (3X)
Query box is the relative volume of the box formed by the query condition
12 lowest cardinality attributes of star, average attribute cardinality 26
September, 2002
Bottom line on queries
Encoding scheme determines number of operations
Range & interval – only one logical operation over 2 bitmaps
Equality – many operations depending on number of bins
But, space may be a consideration
What is the trade-off?
(WAH Compression)
Bins.
Average time for random range queries
Compressed Interval scheme uses 40% space compared to the uncompressed (U I)
It runs at about the same space given the same number of bins
If using the same amount of space, the compressed one is faster because there are less events to be scanned in the final filtering step
September, 2002
BMI – store bitmaps in Objectivity
IBIS – store bitmaps in files
IBIS answers queries about 4 times faster than BMI using WAH
BMI with WAH is up to ten times faster than BMI with BBC
Joint work with Kurt Stockinger (CERN)