information technology influence computation in spatial dabases muhammad aamir cheema faculty of...

Information Technology

Influence Computation in Spatial Dabases

Muhammad Aamir CheemaFaculty of Information TechnologyMonash University, Australia

aamir.cheema@monash.eduwww.aamircheema.com

Faculty of Information Technology

Outline

IntroductionReverse k Nearest Neighbors QueriesReverse Top-k QueriesReverse Skyline QueriesOther work

Introduction: Influence Set

In a data set consisting of facilities and users, a facility influences a user if considers as one of its most “important” facilities

A set of users influenced by is called influence set of

Influence

Influence Set

Influence Set of Coles

A facility f is important for u if it is one of the top-k facilities for a user u considering her preferences, e.g., Distance Rating Price

Important facility?

Who are my potential customers ?

Important to identify potential users/customers Used in various applications such as marketing, cluster and

outlier analysis, and decision support systems

Significance

Reverse Nearest Neighbors Reverse Top- Reverse Skyline

Outline

Outline: Reverse k Nearest NeighborsIntroductionPre-computation based approachOn-the-fly algorithms

Six-regions [2000]TPL [2004]FINCH [2008]Influence Zone [2011]SLICE [2014]TPL++ [2015]

Comparison of RkNN algorithms

Reverse k Nearest Neighbors (RkNN)

• Definition of importance– A facility f is important to a user if f is

one of its k closest facilities

• Reverse k Nearest Neighbors– Find every user u for which the query

facility q is one of its k-closest facilities.

Influence set of f1 is {u1,u2}

Influence set of f2 is {u3}

Six-regions [2000]TPL [2004]FINCH [2008] Influence Zone [2011]SLICE [2014]TPL++ [2015]

Pre-computation based approach[F. Korn et al., SIGMOD 2000]

• Pre-computation– For each user u

• Draw a circle centered at u containing its k closest facilities

– Index these circles using an R-tree

• Query processing– Find the circles that contain q

• Problems– arbitrary k?– data updates?

On-the-fly RkNN Algorithms

Pruning

Verification• Find the users that lie in the

unpruned space• For each such user, check

whether it is a RkNN of q or not

• Prune the search space using near by facilities of q

Data indexed by R-trees

On-the-fly RkNN AlgorithmsPruning

Verification

Half-space

Region-based

TPL (VLDB 2004), TPL++ (PVLDB 2015)

FINCH (PVLDB 2008),InfZone (ICDE 2011)

Six-regions (SIGMOD 2000)

SLICE (ICDE 2014)

1. Divide the whole space centred at the query q into six equal regions each of 60o

2. Let f be a facility in a partition P

3. Let u be a user in P for which dist(u,q) > dist(q,f)

4. q cannot be the closest facility of u

Proof Sketch: • fqu ≤ 60o and ufq > 60o

• ufq > fqu uq > uf

Six-regions: Pruning[I. Stanoi et al., SIGMOD Workshop 2000]

1. Divide the whole space centred at the query q into six equal regions

2. Find the k-th nearest neighbor in each Partition.

3. The k-th nearest facility of q in each region defines the area that can be pruned

Six-regions: Pruning[I. Stanoi et al., SIGMOD Workshop 2000] k =

• Access users R-tree and prune the entries that lie in the pruned area

• For each unpruned user u– Issue a boolean range query

to check if u is a RkNN or not

Disadvantage: Requires boolean range query for each candidate user

Six-regions: Verification[I. Stanoi et al., SIGMOD 2000] k =

• Half-space Pruning:• q cannot be the closest facility of u if

it lies in the half-space• q cannot be among the k-

closest facilities of u if u lies in k half-spaces

• Pruning Algorithm1. Find the nearest unseen facility f in the

unpruned area.2. Draw a bisector between q and f to

prune the search space3. Go to step 1 unless all facilities in the

unpruned area have been accessed

TPL: Pruning[Y. Tao et al., VLDB 2004]

Advantage: Prunes more space than six-

regionsDisadvantage:X Pruning is more expensive especially when k is not small

Advantage: Prunes more space than six-

regionsDisadvantage:X Pruning is more expensive especially when k is not small

Find the k-half spaces that contain the user

Requires using subsets

k! (m-k)!

Solution: TPL does not use all possible subsets

1. Sort facilities by hilbert-values2. Consider only the subset

consisting of k consecutive facilities

Considers m subsetsX Some pruning power is lost

{a,b,c,d}

TPL: Verification[Y. Tao et al., VLDB 2004]

• Prune the user R-tree entries using the k-half spaces approach

• Determine the candidate users

• Issue a bulk boolean range query to verify all candidate users

Key Idea Approximate the unpruned area

by a convex polygon

Advantage: Pruning is more efficient (e.g.,

point containment in logarithmic time)

FINCH: Pruning[W. Wu et al., PVLDB 2008]

Computing polygon• Get intersection points of half-spaces

and the boundary space• For each intersection point

– Compute a counter that denotes the number of half-spaces that contain it

– Remove the intersections with counter ≥ k

• Compute the convex hull of remaining intersection points

Pruning Algorithm1. Initialize whole space as the convex

polygon2. Find the nearest facility that lies inside

the convex polygon3. Draw its half-space, compute new

intersections and their counters and update the convex polygon

4. Go to step 2 until there is an un-accessed facility inside the polygon

• Prune the user R-tree entries that lie outside the convex polygon

• For each user that lies inside the polygon

– Issue a boolean range query to check if it is a RkNN or not

FINCH: Verification[W. Wu et al., PVLDB 2008]

Influence Zone (InfZone): Motivation[M. Cheema et al., ICDE 2011]

Pruning

Verification

• Find the users that lie in the unpruned space

• For each such user, issue a boolean range query to verify it

• Prune the search space using near by facilities of q

Influence Zone is an area such that a user u is a RkNN if and only if u is inside this area

• Compute influence zone using near by facilities

• Find the users that lie in the influence zone

The influence zone corresponds to the unpruned polygon when the bisectors of all the facilities have been considered for pruning.

Challenges:• How to compute unpruned polygon?• Using all facilities for pruning will be

very expensive

Influence Zone (InfZone): Challenges[M. Cheema et al., ICDE 2011] k =

Challenge 1: Constructing the polygon• Like FINCH, compute the counters of

all intersections• Remove the intersections with

counter ≥ k• Keep only the intersections that

either lie on the boundary of the data space OR have counter equal to k-1 or k-2

• Keep only the extreme intersections on each boundary

• Sort the intersections according to their angles with q

• Connect the intersections in the sorted order

Influence Zone (InfZone): Construction[M. Cheema et al., ICDE 2011]

k = 2 2

Challenge 2: Avoid accessing all facilities• Let Cv denote the circle centered at a

vertex v with radius dist(v,q)• A facility f can be ignored if it lies

outside Cv for every vertex of the current influence zone

• An entry e of the facility R-tree can be ignored if it lies outside Cv for every vertex of the current influence zone

Influence Zone Construction Algorithm• Initialize InfZone as the whole data space• Enheap the root of the R-tree in a heap• While heap is not empty

– De-heap an entry e– If e lies outside every Cv

• Ignore e– Else

• If e is an intermediate node– Insert children of e in the heap

• Else– Draw the bisector of e and

update the current influence zone

• Prune the user R-tree entries that lie outside the influence zone

• Return the users that lie inside the influence zone

Point containment can be done in logarithmic time O(log m)

Rectangle containment takes linear time O(m)

Influence Zone (InfZone): Verification[M. Cheema et al., ICDE 2011]

SLICE: Motivation[S. Yang et al., ICDE 2014]

Regions-based (Six-regions)

Half-space

(InfZone)

Range query

Pruning CostO(m log k) O(km2

Pruning Power

Verification Cost

Low High

O(log m)

O(m log m)

m is the # of facilities considered for pruning

1. Divide the whole space centred at the query q into t equal regions

2. Draw arcs for each facility

3. k-th arc in each partition defines the pruning region

Pruning requires checking only one distance

SLICE: Key Idea[S. Yang et al., ICDE 2014]

SLICE: Comparison with six-regions[S. Yang et al., ICDE 2014]

Six-region SLICE

Partitions Pruned

No. of Partitions

Area pruneddist(f,q) 𝑑𝑖𝑠𝑡 ( 𝑓 ,𝑞)2 cos(𝜃max)

VSθmax

SLICE: Verification[S. Yang et al., ICDE 2014]

• Significant facility: – k-th arc in each partition is called

the bounding arc – A facility f that prunes at least one

point p ∈ P lying inside the bounding arc of P.

– An insignifcant facility cannot prune any candidate user

𝐫 𝐁

𝐫 𝐁 𝐫 𝐁

Verification for a candidate

Issuing range query

for each candidate

Access significant facilities during

pruning

High I/O and cpu cost

Use significant facilities to verify O(k)

Regions-based

TPL++: Optimization 1[S. Yang et al., PVLDB 2015]TPL:1. Sort facilities by hilbert-values2. Consider only the subset

consisting of k consecutive facilities

X Considers m subsets X Some pruning power is lostTPL++:3. Initialize a counter to 0

4. Access facilities one by one

5. Increment the counter whenever a facility prunes the user u

6. Prune u when counter ≥ k

Pruning power: TPL vs TPL++[S. Yang et al., PVLDB 2015]

TPL++: Optimization 2[S. Yang et al., PVLDB 2015]TPL:• A facility entry e or a facility

point that lies in the pruned space is ignored

TPL++:• A facility entry e that lies in the

pruned space is ignored• A facility point is used for

pruning even if it lies in the pruned space

TPL vs TPL++

2 5 10 15 20 250

40I/O cost

TPL TPL++

2 5 10 15 20 250

240CPU cost (ms)

TPL TPL++

2 times better 20 times better

Pruning Six-regions

TPL TPL++ FINCH InfZone SLICE

node O(1) O(km) O(m) O(m) O(m) O(1)

point O(1) O(km) O(m) O(logm) O(m) O(1)

Adding f O(log k) O(logm) O(logm) O(m2) O(m2) O(log m)

Verification

node O(1) O(km) O(m) O(m) O(m) O(1)

point O(1) O(km) O(m) O(logm) O(logm)

#candidates

Large Large Small Medium Minimal Small

Verifying u Range query

Bulk Range query

Range query

O(logm)

Comparison of RkNN Algorithms

Experimental Comparison [Yang et al., PVLDB 2015]

• Setup– Intel Xeon 2.66 GHz CPU, 4GB

Memory and Hard disk– Index: R*-tree – 100 buffers– I/O cost and CPU cost– Average cost per query

• Data sets– Three real data sets (up to 25M

points)– CA, LA and NA– Synthetic data sets follows

different distributions (up to 20M points)

Source code and data sets are available online

RankingCriteria 1st 2nd 3rd 4th 5th 6th

I/O (no buffer) TPL++,InfZone

SLICE TPL FINCH SIX

I/O (small buffer)

TPL++,InfZone

FINCH SLICE TPL,SIX

CPU (k<10) SLICE InfZone TPL++ FINCH SIX,TPL

CPU (10<k<25) SLICE InfZone, TPL++

FINCH SIX TPL

CPU (25<k<200)

SLICE TPL++ SIX FINCH InfZone TPL

Implementation

SIX,SLICE TPL, TPL++

FINCH, InfZone

Outline

IntroductionReverse k Nearest Neighbors QueriesReverse Top-k Queries

IntroductionMonochromatic algorithms (2d)Bichromatic algorithms (≥2d)

Reverse Skyline QueriesOther work

Reverse Top-k (RTk) QueriesIntroduced by [Vlachou et al., ICDE 2010]

Examples are from [Vlachou et al, ICDE 2010]

Score(p2) = 0.2x3 + 0.8x2 = 2.2

• Definition of importance (Top-k queries)– Each user u has a preference function– Score of a facility is

score(f) = w[1]*f[1] + … w[d]*f[d]– A facility f is important to a user u if f is

one of the top-k facilities for u• Bichromatic Reverse Top-k Query (RTk)

– Find every user u for which the query facility q is one of her top-k facilities.

Tom and Max are the reverse top-1 users of p2

Bob is not a reverse top-1 user of p2

Examples are from [Vlachou et al, ICDE 2010]

q = p2, k=1

• Bichromatic RTk queries– Find every user u for which the query facility q is one of her top-k

facilities. (e.g., result is {Tom, Max})• Monochromatic RTk Queries

– Find every weighting vector for which q is one of the top-k facilities.

Result: line segment where w[price]=[1/7,5/6]

Reverse Top-k (RTk) Queries: TypesIntroduced by [Vlachou et al., ICDE 2010]

Outline

• Score(q) is the projection on the vector w

• Rank(q) w.r.t. w number of facilities below the red line

• Rank(q) < Rank(f) for every w if q dominates f

• Ignore facilities that are dominated by q• Result is empty if k facilities dominate q

Monochromatic Reverse Top-k Algorithms[Vlachou et al., ICDE 2010]

qw=[0.5,0.5]

• The relative rank of q and f depends on the rotation of the red line

Algorithm• Start with vertical line• Rank(q) Count the number of facilities

on the left• Rotate the line counter-clockwise• Update Rank(q) when line intersects a

facility • Report the weighting vectors for which

Rank(q) ≤ k

Rank(q) = 21

RTk using k-lower envelope (2d)[Cheema et al., EDBT 2014]

• Given a point a=(u,v) and a weighting vector W=(w1, w2), a.score = u*w1 + v*w2

• A point a=(u,v) is mapped to a line a*: y=ux + v in dual

• The weighting vector W=(w1, w2) is mapped to a vertical line W*: x=w1/w2

• The intersection of a* and w* is the point where y= u(w1/w2)+ v = (u*w1 +v*w2))/w2

W*: x = w1/ w2

ya= a.score/w2

yb= b.score/w2

Primal Dual

• Query: Given a weighted vector W=(w1,w2), return k objects with smallest scores

• Solution: – Map W and all the objects to dual space– Return k lowest lines intersecting W*

W*: x = w1/ w2

Primal Dual

Rank1. a2. b3. c4. d

Rank1. d2. b3. a4. c

W*: x = w3/ w4

• Given a set of lines L, mass of a point p is the number of lines that lie strictly below p

• k-lower envelope consists of every point p that lies on one of the lines in L and has mass equal to k-1.

2-lower envelope

• Map all facilities to dual space and compute k-lower envelope• Map query point to dual space• Return weighting vectors where query line is below the k-lower envelope

Slide # 61

Primal Dual

c dW*: x = w1/ w2

Computing k-lower envelope (2d)[Cheema et al., EDBT 2014]

Start from the left most point on k-lower envelope (always move towards right) Upon reaching an intersection

Make a turn (i.e., leave the current road) The path travelled is the k-lower envelope

Slide # 62

Primal Dual

Line with k-th largest slope.

i.e., point in primal with k-th largest x-value

A point (u,v) in primal is

mapped to a line y=ux+v

Outline

Bichromatic Reverse Top-k (≥2d)[Vlachou et al., ICDE 2010]

Given a set of facilities F and a set of weighting vectors W, return every weighting vector for which q is one of the top-k facilities

Brute Force Algorithm: For each vector w in W

Compute top-k facilities Return w if q is among the top-k facilities

Threshold based algorithm (RTA)• Sort the weighting vectors by their pair-wise similarity

(Similar vectors have similar top-k results)• Evaluate the first top-k query, calculate a threshold• For each weighting vector

– Try to prune using the threshold– Refine threshold

• Evaluate top-2 query for w1

• Set threshold based on w2

• score(q) for w2 > threshold discard w2

• Compute top-k for w3 and update the buffer

W=[ w1, w2, w3 ]Buffer: p1, p2

Example is from [Vlachou et al, ICDE 2010]

Bichromatic Reverse Top-k (≥2d)[Vlachou et al., SIGMOD 2013]

Branch-and-bound algorithm: Key idea• Weighting vectors and facilities are indexed (e.g., by R-tree)• Compute upper and lower bounds• Prune using the bounds• Process unpurned entries

Outline

IntroductionReverse k Nearest Neighbors QueriesReverse Top-k QueriesReverse Skyline Queries

IntroductionPre-computation based approachOn-the-fly algorithm

Other work

Reverse Skyline [Dellis et al., VLDB 2007]

Dominance• A point x dominates y if x is at least

as good as y on all the dimensions and x is better than y on at least one dimension

Skyline• Return every point that is not

dominated by any other point

Distance

Reverse Skyline [Dellis et al., VLDB 2007]

Dynamic Dominance• A user u gives her ideal point • A point x dominates y if its difference

from u is not larger than y’s difference on each dimension and is smaller on at least one dimension

Dynamic Skyline• Return every point that is not

dynamically dominated by any other point

Transform each x[i] to |u[i] – x[i]|

Distance from airport

y` a`z`

Reverse Skyline[Dellis et al., VLDB 2007]

Definition of Importance• A user u considers a facility f to be

important if f is among the dynamic skyline for the user u

Reverse Skyline• Return every user u for which the query

facility is in its dynamic skyline

y` a`z`

Outline

Other work

Precomputation based approach[Dellis et al., VLDB 2007]

Pre-computation• For each user u

– Compute and store its dynamic skyline

Query processing• u is not an answer if q is dominated by

its pre-computed skyline• u is an answer if q is not dominated by

its pre-computed skyline

y` a`z`

Precomputation based approach[Dellis et al., VLDB 2007]

Reducing storage requirement• For each user u

– Store only k of its dynamic skyline points

Query processing– u is not an answer if q is dominated by any of

the k stored points– u is guaranteed to be an answer if q

dominates any of the k stored points– otherwise, call verification to check if u is an

answer

Outline

Other work

On-the-fly Algorithm[Dellis et al., VLDB 2007]

• Window of a user u is a rectangle centered at u and q on one of the corners

• A user u is an answer iff its window is empty

Key idea• Divide the space around q into 2d

partitions• Compute skyline for each partition• Any user dominated by these skylines

cannot be the answer

Outline

Other work on reverse spatial queries

Uncertain data Continuous Monitoring (e.g., moving objects, data

stream) Influence Maximization Other spaces (e.g., road network, general metric

space, non-metric space, obstructed space) Spatial Keyword Queries …

Open problems on reverse spatial queries

Location-based reverse top-k queries Location-based reverse skyline queries

Location-based Reverse Top-k

• Definition of importance– Each user u has a preference function– A facility f is important to a user u if f is

one of the top-k facilities for u• Reverse Top-k Query (RTk)

– Find every user u for which the query facility q is one of her top-k facilities.

Influence set of f1 is {u2}

Price=1

Price=22

0.9*price + 0.1*distance

0.5*price + 0.5*distance

1*distance

Location-based Reverse Skyline • Dominance

A facility x dominates another facility y w.r.t. a user u, if for every attribute, u prefers x over y

• Definition of importance A facility f is important to a user u if f is not

dominated by any other facility• Reverse Skyline

Find every user u for which the query facility q is not dominated by any other facility.

Influence set of f2 is {u1,u2,u3}

Price=1

Price=2

References1. Flip Korn, S. Muthukrishnan: Influence Sets Based on Reverse Nearest Neighbor Queries. SIGMOD 2000:201-212

2. Ioana Stanoi, Divyakant Agrawal, Amr El Abbadi: Reverse Nearest Neighbor Queries for Dynamic Databases. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 2000:44-53

3. Yufei Tao, Dimitris Papadias, Xiang Lian: Reverse kNN Search in Arbitrary Dimensionality. VLDB 2004:744-755

4. Evangelos Dellis, Bernhard Seeger: Efficient Computation of Reverse Skyline Queries. VLDB 2007:291-302

5. Wei Wu, Fei Yang, Chee Yong Chan, Kian-Lee Tan: FINCH: evaluating reverse k-Nearest-Neighbor queries on location data. PVLDB 1(1):1056-1067 (2008)

6. Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, Kjetil Nørvåg: Reverse top-k queries. ICDE 2010:365-376

7. Muhammad Aamir Cheema, Xuemin Lin, Wenjie Zhang, Ying Zhang: Influence zone: Efficiently processing reverse k nearest neighbors queries. ICDE 2011:577-588

8. Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, Kjetil Nørvåg: Monochromatic and Bichromatic Reverse Top-k Queries. IEEE Trans. Knowl. Data Eng. (TKDE) 23(8):1215-1229 (2011)

9. Muhammad Aamir Cheema, Wenjie Zhang, Xuemin Lin, Ying Zhang: Efficiently processing snapshot and continuous reverse k nearest neighbors queries. VLDB J. (VLDB) 21(5):703-728 (2012)

10. Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg, Yannis Kotidis: Branch-and-bound algorithm for reverse top-k queries. SIGMOD 2013:481-492

11. Shiyu Yang, Muhammad Aamir Cheema, Xuemin Lin, Ying Zhang: SLICE: Reviving regions-based pruning for reverse k nearest neighbors queries. ICDE 2014:760-771

12. Muhammad Aamir Cheema, Zhitao Shen, Xuemin Lin, Wenjie Zhang: A Unified Framework for Efficiently Processing Ranking Related Queries. EDBT 2014:427-438

13. Shiyu Yang, Muhammad Aamir Cheema, Xuemin Lin, Wei Wang: Reverse k Nearest Neighbors Query Processing: Experiments and Analysis. PVLDB 8(5):605-616 (2015)

Thanks

information technology influence computation in spatial dabases muhammad aamir cheema faculty of...

information technology12on

influence zone

influence setimportant

tpl pvldb

fly algorithmssixregions

influence set of f2

influence set of f1

influence seta facility

Documents

06 cheema .pdf

post holdings cheema

aamir presentation

a unified approach for computing top-k pairs in...

australian breast device registry - monash.edu

aamir abbasi

saira aamir

relaxed reverse nearest neighbors queries arif hidayat...

cheema dental implant centre , 213 model town jalandhar...

aamir & ejaz

informal sport - monash.edu

mda028 horsley flute sonata - monash.edu

· web viewphone: 03 9905 0100 manager.mbi@monash.edu....

the crystalline chemical industries (pvt) ltd established...

multi-guarded safe zone: an effective technique to monitor...

aamir akhtar

aamir movie presentation

circulartrip: an effective algorithm for continuous knn...

information technology trends in location based services...

aamir khan