march, 2002 efficient bitmap indexing techniques for very large datasets kesheng john wu ekow otoo...

19
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

Upload: alan-harvey-payne

Post on 17-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

March, 2002 Application: STAR OIDdsthistmEvent Number mEvent Time mRun Number NLb OIDn_clus_tpc_ in[13] numberOf Primary Tracks Charged Particles_ Means[1] Primary VertexX qxb[2]zdc2Energy A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.

TRANSCRIPT

Page 1: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Efficient Bitmap Indexing Techniques for Very Large Datasets

Kesheng John WuEkow Otoo

Arie Shoshani

Page 2: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Problem Statement

• Main objective: maps logical requests to qualified objects— A logical request:

• 20001015<=eventTime & 200<energy<300 …— Objects:

• Set of object ids; • Set of files containing the objects; • Offsets within the files, …

Page 3: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Application: STAROID dst hist mEvent

NumbermEventTime

mRunNumber

NLb

0 159625 159627 2635 20000827.011759

1239029 1341

1 159625 159627 2636 20000827.011759

1239029 1470

2 159625 159627 2637 20000827.011759

1239029 1663

OID n_clus_tpc_in[13]

numberOfPrimaryTracks

ChargedParticles_Means[1]

PrimaryVertexX

qxb[2] zdc2Energy

0 909 1228 266 .56 -26.40 48

1 1243 1415 317 .46 -29.08 53

2 1285 1533 281 .53 -6.754 8

A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.

Page 4: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Application: Combustion

• Direct numerical simulation of auto-ignition process (solution of complex partial differential equations)

• A dozen or more variables are computed at each time step and each grid point

• Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000

• Time steps: 100 >>> 1000s• Data size: 1 GB >>> 10 TB• Task: identify features and track them across

time steps• E.G. Find flame front across time

Find “600<temp<700” for 1 billion points per time step, and discover overlap between time steps

• Use compressed bitmaps to accelerate both feature extraction and feature tracking 1000 X 1000 X 1000

Page 5: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Building a Bitmap Index

1. Partition each property into bins (binning)— e.g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200,400)…

2. Generate a bit vector for each bin (encoding)— Bit i of bit vector j is 1 iff NLb[i] is in bin j

3. Compress each bit vector

000000000000000

000010001000000

000001110111011

101100000000000

010000000000000

000000000000100

000000000000000

property 1

000001110111011

101100000000000

010000001000000

000000000000100

000000000000000

property 2

000000000000000

000000001000000

000001110111011

101100000000000

010000000000000

000000000000100

000000000000000

property n

000010000000000

. . .

Page 6: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Advantages of Bitmap Index

• Bitmap index: specialized index that takes advantage— Read-mostly data: data produced from scientific

experiments can be appended in large groups• Fast operations

— “Predicate queries” can be performed with bitwise logical operations• Predicate ops: =, <, >, <=, >=, range,• Logical ops: AND, OR, XOR, NOT

— They are well supported by hardware• Easy to compress, potentially small index size• Each individual bitmap is small and frequently used ones

can be cached in memory

Page 7: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Operation-efficient Compression Methods

• Best known: byte-aligned bitmap code (BBC)— Uses run-length encoding (next slide)— Byte alignment, optimized for space efficiency— Decoding on bit level, not optimal for operations— Used in oracle

• We developed a new word-aligned scheme: WAH— Uses run-length encoding— Word alignment— Designed for minimal decoding to gain speed

Page 8: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Operation-efficient Compression Methods

Uncompressed:0000000000001111000000000 ......0000001000000001111111100000000 .... 000000

Compressed:12, 4, 1000,1,8,1000

Store very short sequences as-is

Advantage:

Can perform: AND, OR, COUNT operations on compressed data

Based on variations of Run Length Compression

Page 9: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Trade-off of Compression Schemes

uncompressedWAH

space

speed

better

gzip

BBC

ExpGolPacBits

Page 10: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Information About the Test Machines

• Hardware and system— Sun enterprise 450 (Ultrasparc II 400mhz)— 4GB RAM— VARITAS volume manager (stripped disk)

• Real application data from STAR— Above 2 million objects, 12 attributes

• Synthetic data— 100 million objects, 10 attributes

• Terms— Compression ratio: ratio of compressed bitmaps

size and uncompressed bitmaps size — Time reported are wall clock time in seconds

Page 11: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Logical Operation Time(Synthetic Data) 10X improvement

Page 12: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Logical Operation Time (STAR Data)Also 10X improvement

Page 13: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Encoding Schemes – Main Idea

Equalityencoding

Rangeencoding

Intervalencoding

12 bins 1 2 3 4 5 6 7 8 9 10 11 12

Interval, Range encoding: operates on 2 bins only!

Page 14: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Total Effect of Compression and Encoding Schemes

• Bottom line on queries— Compression scheme determines efficiency of

logical operations— Encoding scheme determines number of operations

• Range & interval – only one logical operation over 2 bitmaps

• Equality – many operations depending on number of bins— But, space may be a consideration

• What is the trade-off?

Page 15: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Interval Encoding Is Better Overall(WAH Compression)

Points on the graphs represent:10, 20, 30, 50, 100Bins.

Average time for random range queries

Page 16: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Timing Results

Method Index(X data)

Time (sec)

Speed

ORACLE Scan 0 6 0.1

B-tree 3.6 0.95 0.6

Native vertical partition

Scan 0 0.57 1

20 bins 0.18 0.11 5

50 bins 0.43 0.07 8

100 bins 0.90 0.05 11

Page 17: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Summary

• Compressed bitmap indices are effective for range queries

• Better compression scheme— 50% more space, but 12 time faster !!!

• Among the different encoding schemes— The interval encoding is the overall winner

Page 18: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

Future Work

• Support NULL value and categorical values• On-line update: add new data and update index

without interrupting request processing• Recovery mechanism for robustness• Potential new applications: climate, astrophysics,

biology (microarrays)• Study non-uniform binning strategies• Study more encoding schemes• Integrate with conventional database system: to better

handle metadata, to provide more versatile front-end

Page 19: March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002

How Many Bins for Continuous Domains?

Range(x)R

ange

(y)

Edge binEdge bin

.. ... ... ... ... ... .

.. ... ... ... ... ... ... ... ... ... .

.. ... .More bins

Less objects in edge bins

Searching edge bins: skip-scan over “attribute vertical partition”