march, 2002 efficient bitmap indexing techniques for very large datasets kesheng john wu ekow otoo...
DESCRIPTION
March, 2002 Application: STAR OIDdsthistmEvent Number mEvent Time mRun Number NLb OIDn_clus_tpc_ in[13] numberOf Primary Tracks Charged Particles_ Means[1] Primary VertexX qxb[2]zdc2Energy A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.TRANSCRIPT
March, 2002
Efficient Bitmap Indexing Techniques for Very Large Datasets
Kesheng John WuEkow Otoo
Arie Shoshani
March, 2002
Problem Statement
• Main objective: maps logical requests to qualified objects— A logical request:
• 20001015<=eventTime & 200<energy<300 …— Objects:
• Set of object ids; • Set of files containing the objects; • Offsets within the files, …
March, 2002
Application: STAROID dst hist mEvent
NumbermEventTime
mRunNumber
NLb
0 159625 159627 2635 20000827.011759
1239029 1341
1 159625 159627 2636 20000827.011759
1239029 1470
2 159625 159627 2637 20000827.011759
1239029 1663
OID n_clus_tpc_in[13]
numberOfPrimaryTracks
ChargedParticles_Means[1]
PrimaryVertexX
qxb[2] zdc2Energy
0 909 1228 266 .56 -26.40 48
1 1243 1415 317 .46 -29.08 53
2 1285 1533 281 .53 -6.754 8
A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.
March, 2002
Application: Combustion
• Direct numerical simulation of auto-ignition process (solution of complex partial differential equations)
• A dozen or more variables are computed at each time step and each grid point
• Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000
• Time steps: 100 >>> 1000s• Data size: 1 GB >>> 10 TB• Task: identify features and track them across
time steps• E.G. Find flame front across time
Find “600<temp<700” for 1 billion points per time step, and discover overlap between time steps
• Use compressed bitmaps to accelerate both feature extraction and feature tracking 1000 X 1000 X 1000
March, 2002
Building a Bitmap Index
1. Partition each property into bins (binning)— e.g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200,400)…
2. Generate a bit vector for each bin (encoding)— Bit i of bit vector j is 1 iff NLb[i] is in bin j
3. Compress each bit vector
000000000000000
000010001000000
000001110111011
101100000000000
010000000000000
000000000000100
000000000000000
property 1
000001110111011
101100000000000
010000001000000
000000000000100
000000000000000
property 2
000000000000000
000000001000000
000001110111011
101100000000000
010000000000000
000000000000100
000000000000000
property n
000010000000000
. . .
March, 2002
Advantages of Bitmap Index
• Bitmap index: specialized index that takes advantage— Read-mostly data: data produced from scientific
experiments can be appended in large groups• Fast operations
— “Predicate queries” can be performed with bitwise logical operations• Predicate ops: =, <, >, <=, >=, range,• Logical ops: AND, OR, XOR, NOT
— They are well supported by hardware• Easy to compress, potentially small index size• Each individual bitmap is small and frequently used ones
can be cached in memory
March, 2002
Operation-efficient Compression Methods
• Best known: byte-aligned bitmap code (BBC)— Uses run-length encoding (next slide)— Byte alignment, optimized for space efficiency— Decoding on bit level, not optimal for operations— Used in oracle
• We developed a new word-aligned scheme: WAH— Uses run-length encoding— Word alignment— Designed for minimal decoding to gain speed
March, 2002
Operation-efficient Compression Methods
Uncompressed:0000000000001111000000000 ......0000001000000001111111100000000 .... 000000
Compressed:12, 4, 1000,1,8,1000
Store very short sequences as-is
Advantage:
Can perform: AND, OR, COUNT operations on compressed data
Based on variations of Run Length Compression
March, 2002
Trade-off of Compression Schemes
uncompressedWAH
space
speed
better
gzip
BBC
ExpGolPacBits
March, 2002
Information About the Test Machines
• Hardware and system— Sun enterprise 450 (Ultrasparc II 400mhz)— 4GB RAM— VARITAS volume manager (stripped disk)
• Real application data from STAR— Above 2 million objects, 12 attributes
• Synthetic data— 100 million objects, 10 attributes
• Terms— Compression ratio: ratio of compressed bitmaps
size and uncompressed bitmaps size — Time reported are wall clock time in seconds
March, 2002
Logical Operation Time(Synthetic Data) 10X improvement
March, 2002
Logical Operation Time (STAR Data)Also 10X improvement
March, 2002
Encoding Schemes – Main Idea
Equalityencoding
Rangeencoding
Intervalencoding
12 bins 1 2 3 4 5 6 7 8 9 10 11 12
Interval, Range encoding: operates on 2 bins only!
March, 2002
Total Effect of Compression and Encoding Schemes
• Bottom line on queries— Compression scheme determines efficiency of
logical operations— Encoding scheme determines number of operations
• Range & interval – only one logical operation over 2 bitmaps
• Equality – many operations depending on number of bins— But, space may be a consideration
• What is the trade-off?
March, 2002
Interval Encoding Is Better Overall(WAH Compression)
Points on the graphs represent:10, 20, 30, 50, 100Bins.
Average time for random range queries
March, 2002
Timing Results
Method Index(X data)
Time (sec)
Speed
ORACLE Scan 0 6 0.1
B-tree 3.6 0.95 0.6
Native vertical partition
Scan 0 0.57 1
20 bins 0.18 0.11 5
50 bins 0.43 0.07 8
100 bins 0.90 0.05 11
March, 2002
Summary
• Compressed bitmap indices are effective for range queries
• Better compression scheme— 50% more space, but 12 time faster !!!
• Among the different encoding schemes— The interval encoding is the overall winner
March, 2002
Future Work
• Support NULL value and categorical values• On-line update: add new data and update index
without interrupting request processing• Recovery mechanism for robustness• Potential new applications: climate, astrophysics,
biology (microarrays)• Study non-uniform binning strategies• Study more encoding schemes• Integrate with conventional database system: to better
handle metadata, to provide more versatile front-end
March, 2002
How Many Bins for Continuous Domains?
Range(x)R
ange
(y)
Edge binEdge bin
.. ... ... ... ... ... .
.. ... ... ... ... ... ... ... ... ... .
.. ... .More bins
Less objects in edge bins
Searching edge bins: skip-scan over “attribute vertical partition”