specialized indexing for nosql databases like accumulo and hbase

23
Providing Practical Software Solutions 1 Specialized Indexing Jim Klucar [email protected] Accumulo Meetup 10/16/2012

Upload: jim-klucar

Post on 13-Jul-2015

214 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Specialized indexing for NoSQL Databases like Accumulo and HBase

Providing Practical Software Solutions

1

Specialized Indexing

Jim Klucar

[email protected]

Accumulo Meetup

10/16/2012

Page 2: Specialized indexing for NoSQL Databases like Accumulo and HBase

Accumulo Sorts Key/Value Pairs

Data is retrieved by iterating over ranges of keys

Integers: 1, 11, 2

Floats: 2.3e4, 2.3e-4

Multi-dimensional data (-76.76, 39.12)??

Overview

2

Page 3: Specialized indexing for NoSQL Databases like Accumulo and HBase

Zero-Pad Integers to maintain sort order

001

002

003

Reverse sort order by subtracting from

larger number before using integer

(1000 - 003) = 997

(1000 - 002) = 998

(1000 - 001) = 999

Integer Indexing

3

Accumulo uses reverse technique on timestamp to have most recent keys sort first

Page 4: Specialized indexing for NoSQL Databases like Accumulo and HBase

Binary Representation

[sign] [exponent] [mantissa]

If sign bit is set, Flip all bits

If sign bit isn’t set, flip just sign bit

Flipping bits is safely done with XOR

If sign bit is set, XOR with 0xFFFFFFFF

If sign not set, XOR with 0x80000000

IEEE-754 Floating Point Indexing

4

Note: Exponent is biased, not 2’s complement

Page 5: Specialized indexing for NoSQL Databases like Accumulo and HBase

Same procedure can be performed for base 10

numbers with caveats:

Adjust mantissa and exponent so decimal point is at the

same location (32.4e010 -> 3.24e009)

Apply same logic to exponent

Move both signs to first characters

Flipping a bit is subtracting digit from 9

-3.24e-9 becomes ++6.7500000e990

Be sure to print enough digits

Single (32-bit) 8 decimal digits, 3 exponent

Double(64-bit) 15 decimal digits, 4 exponent

Quad (128-bit) 35 decimal digits, 5 exponent

But Binary is so 1970’s…

5

Page 6: Specialized indexing for NoSQL Databases like Accumulo and HBase

Encoding Multi-Dimensional Data

6

http://xkcd.com/426/

Page 7: Specialized indexing for NoSQL Databases like Accumulo and HBase

Sorted Key/Value tables are 1 dimensional

Keys are either greater than or less than each

other

Need a method to map higher dimensional

data (eg 2-D,Lon/Lat) to a lower dimension

Maintain locality of higher dimensional

data in lower dimensional space

Maintain 2D query performance

The Problem

7

Page 8: Specialized indexing for NoSQL Databases like Accumulo and HBase

Concatenate Latitude and Longitude

Bad Idea! Must search entire range of one

dimension to find results

Solution 1

8

Page 9: Specialized indexing for NoSQL Databases like Accumulo and HBase

Solution 2: Z-Ordering

9

http://en.wikipedia.org/wiki/Z-order_curve

1 2

3 4

1 2

3 4

5 6

7 8

Easiest of Space-filling curves

Implemented by interleaving digits

from alternating dimensions

Lon -76.8 Lat 35.4

-+736584

Page 10: Specialized indexing for NoSQL Databases like Accumulo and HBase

Range Search with Z-order

10

Must filter out results that aren’t actually inside range ( Accumulo Iterator! )

Avoids full table scan / one dimension scan

Find z-ordering of two corners of search box. Scan from 1 to the other.

Page 11: Specialized indexing for NoSQL Databases like Accumulo and HBase

Z-Order on Lon / Lat

11

Page 12: Specialized indexing for NoSQL Databases like Accumulo and HBase

How Many Digits of Latitude?

12

39.123092 690 Miles

69 Miles

6.9 Miles

1200 Yards

120 Yards

36 Feet

3.6 Feet

4 inches

Page 13: Specialized indexing for NoSQL Databases like Accumulo and HBase

Developed for http://geohash.org

Concatenating Lon/Lat in a URL is patented!

A geohash defines an area from successive

bisections of areas of the earth.

Longitude Example:

Solution 3: Geohash

13

Page 14: Specialized indexing for NoSQL Databases like Accumulo and HBase

Longitude Geohash Example

14 Bisect the earth, choose which side of bisection point lies on,

use that bit as first bit for longitude geocode ( 1 in this case )

1 0

Page 15: Specialized indexing for NoSQL Databases like Accumulo and HBase

Longitude Geohash Example

15 Bisect section containing point, rinse repeat.

Code is now: 10

1 0

Page 16: Specialized indexing for NoSQL Databases like Accumulo and HBase

Longitude Geohash Example

16 Bisect section containing point, rinse repeat.

Code is now: 101

1 0

Page 17: Specialized indexing for NoSQL Databases like Accumulo and HBase

Perform same operation with latitude.

This defines a box around the point to

arbitrary precision.

Point of interest is in a box, not in the

center of the box

Bits are interleaved Longitude first, then

base-32 encoded

( 57.64911,10.40744) = u4ruydqqvj

Final geocode

17

Page 18: Specialized indexing for NoSQL Databases like Accumulo and HBase

See Wikipedia for worked example

http://en.wikipedia.org/wiki/Geohash

Google Earth KML Demo

http://api.prezzibenzina.it/geohash.kml

Google Earth Visualization

18

Page 19: Specialized indexing for NoSQL Databases like Accumulo and HBase

Range Search Can Be The Same

19

Performance difference varies by location and search area

Page 20: Specialized indexing for NoSQL Databases like Accumulo and HBase

Can calculate geohashes of surrounding

areas of a point deterministically

Search those areas independently

Range / Area Search Alternative

20

dqcqzkd

dqcrp dqcqn

dqcqy dqcwb

dqcqx

dqcx0

dqcw8 dqcqw

Point: dqcqzkd

Find parent geohash: dqc

Find neighbor hashes:

dqcqn,dqcrp, etc

Get search ranges

dqcqn00 - dqcqnzz

dqcrp00 – dqcrpzz

…etc

Crank up a batch scanner

Page 21: Specialized indexing for NoSQL Databases like Accumulo and HBase

Search Expands From Center

Outward

21

Page 22: Specialized indexing for NoSQL Databases like Accumulo and HBase

Geohashes are areas, z-order lon/lat are

points

Geohash is in public domain

Geohash can more easily proximity search

Key Differences

22

Page 23: Specialized indexing for NoSQL Databases like Accumulo and HBase

Watch border regions in range searches

Across equator

Across prime meridian

On top of poles

Across base-32 encoding edges

What to watch

23