s. nishimura (nec service platforms labs.), s. das, d. agrawal, a. abbadi (university of california,...

S. Nishimura S. Nishimura (NEC Service Platforms Labs.), (NEC Service Platforms Labs.), S. Das, D. Agrawal, A. AbbadiS. Das, D. Agrawal, A. Abbadi

(University of California, Santa Barbara)(University of California, Santa Barbara)

MD-HBase: A Scalable Multi-dimensional Data MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware ServicesInfrastructure for Location Aware Services

Presenter: Zhuo Liu

Overview

▐ A Motivating Story

▐ Existing Technologies

▐ Our proposal

▐ Evaluation

▐ Conclusion

Motivating Scenario: Mobile Coupon Distribution

Coupon

CurrentLocation Current

LocationCurrentLocation

Distribution Policy

• Area• # of coupons

Mobile CouponDistributer

Motivating Scenario: Mobile Coupon Distribution

CurrentLocation

CurrentLocation

CurrentLocation

CurrentLocation


Location

CurrentLocation

CurrentLocation

CurrentLocation

CurrentLocation


Location

Distribution Policy• Area• # of coupons

CouponCouponCoupon

Large amounts of DataHigh Throughput

System Scalability

Multi-Dimensional QueryNearest Neighbors Query

Efficient Complex Queries

125,000,000 subscribersin Japan

Existing Technologies

Multi-dimensional

QueriesScalability

Relational DBs

Spatial DBs

Commercial products

but expensive

Open source products

Key-Value Stores

What We Want

at a reasonable price

Ordered Key-Value Stores

key00

key11

keynn

key00

key01

key0X

value00

value01

value0X

key11

key12

key1Y

value11

value12

value1Y

keynn valuenn

Index

BucketsSorted by key

Good at 1-D Range Query

LongitudeTime

Latit

ude

But, our target is multi-dimensional…

ex. BigTable HBase

Naïve Solution: Linearlization

key00

key11

keynn

key00

key01

key0X

value00

value01

value0X

key11

key12

key1Y

value11

value12

value1Y

keynn valuenn

Projects n-D space to 1-D space

Simple, but problematic…

Apply a Z-ordering curve…

5 7 13 15

4 6 12 14

1 3 9 11

0 2 8 10

Problem: False positive scans

▐ MD-query on Linearized spaceTranslate a MD-query to

linearized range query.• Ex. Query from 2 to 9.

Scan queried linearized range.Filter points out of the queried area.

• ex. blue-hatched area (4 to 7)

Require the boundary information of

the original space.

5 7 13 15

4 6 12 14

1 3 9 11

0 2 8 102

9

Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store

Our Approach: MD-HBase

Single Dimensional IndexMulti-Dimensional Index

Ordered Key-Value Storeex. BigTable, HBase, …

MD-HBase

Introduce Multi-dimensional Index

▐ Multi-dimensional Index (ex. The K-d tree, The Quad tree)Divide a space into subspaces containing almost same # of pointsOrganize subspaces as tree

Efficient subspace pruning → to avoid false positive scans

Divide into Organize as

Space Partition By the K-d tree

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

Binary Z-ordering space

00 01 10 11

11

10

01

00

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

00 01 10 11

11

10

01

00

Partitioned space bythe K-d tree

How do we represent these subspaces?

bitwise interleavingex. x=00, y=11 → 0101

Key Idea: The longest common prefix naming scheme

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

00 01 10 11

11

10

01

00

000* 1***

Subspaces represented as the longest common prefix of keys!

Remarkable Property• Preserve boundary information

of the original space

1***

Left-bottomcorner

Right-topcorner

1000 1111

*→0 *→1

(10, 00) (11, 11)

Build an index with the longest common prefix of keys

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

00 01 10 11

11

10

01

00000* 001*

01**

1***

000*

001*

01**

1***

Index

Buckets

allocate per subspace

000*

001*

01**

1***

Reconstruct the boundary Info. &Check whether intersecting the queried area

Multi-dimensional Range Query

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

00 01 10 11

11

10

01

00

000*

001*

01**

10**

11**

Index

Filter

001*

000*

001*

10**

11**

01**

10**

Scan

Scan

Subspace Pruning

Scan 0010 -1001on the index

K Nearest Neighbors Query

▐ The best first algorithm can be applied. the most efficient technique in practical case

▐ Check the detail in our paper

1 2

4

3

5

Variations of Storage Layer

Table Share Model Uses single table, Maintain bucket boundary Most space efficiency Bucket co-location may cause

disk access congestions

Table per Bucket Model Allocates a table per bucket Most flexible mapping

One-to-one, one-to-many, many-to-one Bucket split is expensive

Copy all points to the new buckets.

Region per Bucket Model Allocates a region per bucket Most bucket split efficiency

Asynchronous bucket split Requires modification of HBase

Experimental Results: Multi-dimensional Range Query

Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 4 nodes MD-HBase responses 10~100 times faster than others

and responses proportional time to selectivity.

1

10

100

1000

0.01 0.1 1 10

Selectivity (%)

Res

po

nse

Tim

e (S

ec)

MD-HBase HBase(ZOrder) MapReduce

Experimental Results: k Nearest Neighbors Query

Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 4 nodes MD-HBase responses 1.5 sec where k 100, ≦

and 11 sec even if k = 10,000

0

2

4

6

8

10

12

1 10 100 1000 10000

k: Number of Neighbors

Res

po

nse

Tim

e (S

ec)

Experimental Results: Insert

Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead.

0

50,000

100,000

150,000

200,000

250,000

0 4 8 12 16 20

Number of nodes

Th

ou

gh

pu

t(r

eco

rds/

sec)

MD-HBase

Hbase(Zorder)

Conclusions

Designed a scalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing the longest common prefix of keys Easily extend general ordered key-value stores.

Demonstrated scalable insert throughput and excellent query performance.

Range Query: 10-100 times faster than existing technologies. kNN Query: 1.5 s when k 100.≦ Insert: 220K inserts/sec on 16 nodes cluster without overhead

Thank you. Any Questions?

s. nishimura (nec service platforms labs.), s. das, d. agrawal, a. abbadi (university of california,...

Documents

original space

space partition

multidimensional index

d range queryex

d treehow

d spacesimple

linearized range

keyvalue storessorted