fastbit for allele data dave matthews usda-ars, cornell university ithaca, ny 10 april 2012
TRANSCRIPT
FastBit for Allele DataFastBit for Allele Data
Dave MatthewsDave MatthewsUSDA-ARS, Cornell UniversityUSDA-ARS, Cornell University
Ithaca, NYIthaca, NY
10 April 201210 April 2012
A Lightning-Fast Index Drives Massive Data Analysis
http://www.scidacreview.org/0904/html/fastbit.html
FastBit significantly improves the speed of a searching operation onboth high- and low-cardinality values with a number of techniques,including a vertical data organization, an innovative bitmap compressiontechnique, and several new bitmap encoding methods...The ability to index high-cardinality data is unique to FastBit and isnot supported by other bitmap indexing methods.
Allele Data Variables
Allele = f(Marker, Line, Experiment)Size:
10^9 10^4 10^4 10^1
Cardinality:
2 = = =
Bitmap Indexing
The FastBit Technologies
1. vertical data organization
= 'vertical partitioning'. Only a few of the
(hundreds of) variables in each partition.
2. bitmap compression: Word-Aligned Hybrid Compression
3. two-level bitmap encoding
Word-aligned Hybrid Compression
• run-length encoding• 31-bit groups
Two-level Bitmap Encoding
• Approximate solution, then refine.
• Bin the values into groups, e.g. A to G, H to P, Q to Z.
• Encode the bin identifiers as bitmap.
• Encodings: equality, range, interval.– Interval has half the number of bitmap indexes.
• Multicomponent encoding: Bin the bins to reduce number of bitmap indexes.
• Multi-level encoding: hierarchy of bins, coarse to fine. Use interval encoding for coarse, equality for fine.
Indexing Bin Identifiers
Querying on more than one variable
FastBit performs extremely well on multi-variable queries because the intersection between the search results on each variable is a simple AND operation over the resulting bitmaps.
Performance
Instructions
http://crd-legacy.lbl.gov/~kewu/fastbit/doc/quickstart.html