s. nishimura (nec service platforms labs.), s. das, d. agrawal, a. abbadi (university of california,...
TRANSCRIPT
S. Nishimura S. Nishimura (NEC Service Platforms Labs.), (NEC Service Platforms Labs.), S. Das, D. Agrawal, A. AbbadiS. Das, D. Agrawal, A. Abbadi
(University of California, Santa Barbara)(University of California, Santa Barbara)
MD-HBase: A Scalable Multi-dimensional Data MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware ServicesInfrastructure for Location Aware Services
Presenter: Zhuo Liu
Page 2
Overview
▐ A Motivating Story
▐ Existing Technologies
▐ Our proposal
▐ Evaluation
▐ Conclusion
Page 3
Motivating Scenario: Mobile Coupon Distribution
Coupon
CurrentLocation Current
LocationCurrentLocation
Distribution Policy
• Area• # of coupons
Mobile CouponDistributer
Page 4
Motivating Scenario: Mobile Coupon Distribution
CurrentLocation
CurrentLocation
CurrentLocation
CurrentLocation
CurrentLocation Current
Location
CurrentLocation
CurrentLocation
CurrentLocation
CurrentLocation
CurrentLocation Current
Location
Distribution Policy• Area• # of coupons
CouponCouponCoupon
Large amounts of DataHigh Throughput
System Scalability
Multi-Dimensional QueryNearest Neighbors Query
Efficient Complex Queries
125,000,000 subscribersin Japan
Page 5
Existing Technologies
Multi-dimensional
QueriesScalability
Relational DBs
Spatial DBs
Commercial products
but expensive
Open source products
Key-Value Stores
What We Want
at a reasonable price
Page 6
Ordered Key-Value Stores
key00
key11
keynn
key00
key01
key0X
value00
value01
value0X
key11
key12
key1Y
value11
value12
value1Y
keynn valuenn
Index
BucketsSorted by key
Good at 1-D Range Query
LongitudeTime
Latit
ude
But, our target is multi-dimensional…
ex. BigTable HBase
Page 7
Naïve Solution: Linearlization
key00
key11
keynn
key00
key01
key0X
value00
value01
value0X
key11
key12
key1Y
value11
value12
value1Y
keynn valuenn
Projects n-D space to 1-D space
Simple, but problematic…
Apply a Z-ordering curve…
5 7 13 15
4 6 12 14
1 3 9 11
0 2 8 10
Page 8
Problem: False positive scans
▐ MD-query on Linearized spaceTranslate a MD-query to
linearized range query.• Ex. Query from 2 to 9.
Scan queried linearized range.Filter points out of the queried area.
• ex. blue-hatched area (4 to 7)
Require the boundary information of
the original space.
5 7 13 15
4 6 12 14
1 3 9 11
0 2 8 102
9
Page 9
Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store
Our Approach: MD-HBase
Single Dimensional IndexMulti-Dimensional Index
Ordered Key-Value Storeex. BigTable, HBase, …
MD-HBase
Page 10
Introduce Multi-dimensional Index
▐ Multi-dimensional Index (ex. The K-d tree, The Quad tree)Divide a space into subspaces containing almost same # of pointsOrganize subspaces as tree
Efficient subspace pruning → to avoid false positive scans
Divide into Organize as
Page 11
Space Partition By the K-d tree
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
Binary Z-ordering space
00 01 10 11
11
10
01
00
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
00 01 10 11
11
10
01
00
Partitioned space bythe K-d tree
How do we represent these subspaces?
bitwise interleavingex. x=00, y=11 → 0101
Page 12
Key Idea: The longest common prefix naming scheme
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
00 01 10 11
11
10
01
00
000* 1***
Subspaces represented as the longest common prefix of keys!
Remarkable Property• Preserve boundary information
of the original space
1***
Left-bottomcorner
Right-topcorner
1000 1111
*→0 *→1
(10, 00) (11, 11)
Page 13
Build an index with the longest common prefix of keys
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
00 01 10 11
11
10
01
00000* 001*
01**
1***
000*
001*
01**
1***
Index
Buckets
allocate per subspace
000*
001*
01**
1***
Page 14
Reconstruct the boundary Info. &Check whether intersecting the queried area
Multi-dimensional Range Query
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
00 01 10 11
11
10
01
00
000*
001*
01**
10**
11**
Index
Filter
001*
000*
001*
10**
11**
01**
10**
Scan
Scan
Subspace Pruning
Scan 0010 -1001on the index
Page 15
K Nearest Neighbors Query
▐ The best first algorithm can be applied. the most efficient technique in practical case
▐ Check the detail in our paper
1 2
4
3
5
Variations of Storage Layer
Table Share Model Uses single table, Maintain bucket boundary Most space efficiency Bucket co-location may cause
disk access congestions
Table per Bucket Model Allocates a table per bucket Most flexible mapping
One-to-one, one-to-many, many-to-one Bucket split is expensive
Copy all points to the new buckets.
Region per Bucket Model Allocates a region per bucket Most bucket split efficiency
Asynchronous bucket split Requires modification of HBase
Page 17
Experimental Results: Multi-dimensional Range Query
Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 4 nodes MD-HBase responses 10~100 times faster than others
and responses proportional time to selectivity.
1
10
100
1000
0.01 0.1 1 10
Selectivity (%)
Res
po
nse
Tim
e (S
ec)
MD-HBase HBase(ZOrder) MapReduce
Page 18
Experimental Results: k Nearest Neighbors Query
Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 4 nodes MD-HBase responses 1.5 sec where k 100, ≦
and 11 sec even if k = 10,000
0
2
4
6
8
10
12
1 10 100 1000 10000
k: Number of Neighbors
Res
po
nse
Tim
e (S
ec)
Page 19
Experimental Results: Insert
Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead.
0
50,000
100,000
150,000
200,000
250,000
0 4 8 12 16 20
Number of nodes
Th
ou
gh
pu
t(r
eco
rds/
sec)
MD-HBase
Hbase(Zorder)
Page 20
Conclusions
Designed a scalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing the longest common prefix of keys Easily extend general ordered key-value stores.
Demonstrated scalable insert throughput and excellent query performance.
Range Query: 10-100 times faster than existing technologies. kNN Query: 1.5 s when k 100.≦ Insert: 220K inserts/sec on 16 nodes cluster without overhead
Thank you. Any Questions?