www.ntnu.no efficient processing of top-k spatial keyword queries joão b. rocha-junior, orestis...
Post on 15-Jan-2016
215 views
TRANSCRIPT
![Page 1: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/1.jpg)
www.ntnu.no
Efficient Processing of Top-k Spatial Keyword Queries
João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg
1SSTD 2011 - Minneapolis, Minnesota, USA
![Page 2: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/2.jpg)
www.ntnu.no
Outline
• Top-k spatial keyword queries• Current approaches• Spatial inverted index • Single-keyword queries• Multiple-keyword queries• Experimental evaluation• Conclusion
2SSTD 2011 - Minneapolis, Minnesota, USA
![Page 3: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/3.jpg)
www.ntnu.no
Motivation
• More and more documents in the Internet are being associated with a spatial location– Ex: tweets, images (Flickr), Wikipedia sites,
OpenStreetMap objects,…
• Most of these geotagged objects are associated with a text (description)
3SSTD 2011 - Minneapolis, Minnesota, USA
![Page 4: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/4.jpg)
www.ntnu.no
Top-k spatial keyword queries
• Query – Spatial location– Query keywords
4SSTD 2011 - Minneapolis, Minnesota, USA
Italianfood
• Returns the k best spatio-textual objects ranked in terms of both – Spatial distance to the
query location– Textual relevance to the
query keywords
![Page 5: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/5.jpg)
www.ntnu.no
Another example…
• Query – Spatial location– Query keywords
• Returns the k best spatio-textual objects ranked in terms of both – Spatial distance to the
query location– Textual relevance to the
query keywords
5SSTD 2011 - Minneapolis, Minnesota, USA
q
objects query location
distance
![Page 6: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/6.jpg)
www.ntnu.no
Ranking objects
• Score
• The spatial proximity (δ) is the normalized Euclidean distance between p and q
• The textual relevance (θ) is the cosine similarity between the description of p and the query keywords
• The query preference parameter (α) defines the importance of one measure over the other
6SSTD 2011 - Minneapolis, Minnesota, USA
€
τ ( p,q) = α ∗δ ( p,q) + (1 −α )∗θ ( p,q)
![Page 7: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/7.jpg)
www.ntnu.no
Current approaches
• Employ a modified R-tree [1,2]– Each node keeps an abstract document
representing all documents in the node sub-tree• Abstract document– Pairs (term, weight), one pair per term– The weight permits computing an upper-bound
score for the objects in the node sub-tree
7SSTD 2011 - Minneapolis, Minnesota, USA
[1] Cong, G., Jensen, C.S., Wu, D.: “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009. [2] Li, Z., Lee, K.C., Zheng, B., Lee, W., Lee, D., Wang, X.: “IR-tree: an efficient index for geographic document search”, TKDE, 2010.
![Page 8: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/8.jpg)
www.ntnu.no
Example
8SSTD 2011 - Minneapolis, Minnesota, USA
e3
e2
root:
bar:2pop:2pub:1rock:1samba:1
e1: e2: e3:
bar:2pub:2samba:1
pop:1pub:1samba:1
e1
e1 e2 e3
p5 p7p1 p2 p3 p4 p6
bar:1pop:2pub:1rock:1
e1
e1: p1 p2 p3
For simplicity, we assume that the impact of a term is defined by the frequency
rock:1pub:1
pub:2 pub:1
![Page 9: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/9.jpg)
www.ntnu.no
Current approaches
• There are several variations– Incorporating document similarity– Clustering the nodes
• Main problems– Frequent and infrequent terms are stored in the
same way (have the same cost)– Accesses several nodes due to text dimensionality– Complex management of inverted files and/or
vectors, one per node
9SSTD 2011 - Minneapolis, Minnesota, USA
![Page 10: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/10.jpg)
www.ntnu.no
Spatial inverted index (S2I)
• Similarly to an inverted index, S2I maps terms to objects that contain the term– The most frequent terms are stored in aggregated
R-trees (aR-trees)– The less frequent terms are stored in blocks in a file
• The aR-tree permits accessing the objects in decreasing order of term relevance
• The blocks permits storing the less frequent terms efficiently
10SSTD 2011 - Minneapolis, Minnesota, USA
![Page 11: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/11.jpg)
www.ntnu.no
Distribution of terms
• The distribution of terms is very skewed• Few hundred terms take up 50% of the text
11SSTD 2011 - Minneapolis, Minnesota, USA
Terms
Freq
uenc
y
![Page 12: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/12.jpg)
www.ntnu.no
Example
12SSTD 2011 - Minneapolis, Minnesota, USA
![Page 13: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/13.jpg)
www.ntnu.no
Aggregated R-tree (max) for frequent terms (e.g., pub)
• Only relevant objects are evaluated
• The objects are accessed in decreasing order of score
13SSTD 2011 - Minneapolis, Minnesota, USA
e1
e2
e0e0:
e1: e2:
e1(1) e2(2)
p1(1) p2(1) p5(2) p6(2) p7(1)
, max=1
, max=2
TermimpactTerm
impact
MaxvalueMaxvalue
![Page 14: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/14.jpg)
www.ntnu.no
Single-keyword queries
• Only a single block or tree is accessed• Block– All the objects are read and the k best are reported
• Tree– The nodes are accessed in decreasing order of score– The algorithm terminates when the score of the k-th
object is higher than the score of any unvisited node
14SSTD 2011 - Minneapolis, Minnesota, USA
![Page 15: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/15.jpg)
www.ntnu.no
Example, processing top-1
SSTD 2011 - Minneapolis, Minnesota, USA
e1
e2
e0, max=1
, max=2
e0:
e2:
e1(1) e2(2)
p1(1) p2(1) p5(2) p6(2) p7(1)
Max-heap: <e1>
Minimum distance
Top-1
e1:
Max-heap: <e2, e1>Max-heap: <p5, p6, e1, p7>
![Page 16: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/16.jpg)
www.ntnu.no
Multiple-keyword queries
• Requires aggregating the partial scores of the objects for each term t of the query keywords
• Similar to Fagin’s algorithm (NRA)– Different bounds
• Score:
16SSTD 2011 - Minneapolis, Minnesota, USA€
τ( p,q) = τ tt∈q.d∑ ( p ,q)
Partial scorePartial score
![Page 17: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/17.jpg)
www.ntnu.no
Multiple-keyword algorithm
• For each term t in q, access the objects p in S2I in decreasing of partial score– The objects are retrieved from a tree or block
• Update the lower bound score of p– Sum of the partial scores know plus the lowest
possible partial score (using the spatial distance)• Update the upper bound score of the visited
objects• Return the objects whose lower bond score
cannot be overcame by the remaining objects
17SSTD 2011 - Minneapolis, Minnesota, USA
![Page 18: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/18.jpg)
www.ntnu.no
Experimental evaluation
• We compare our approach (S2I) with the DIR-tree proposed by Cong et al. [1]
• Both approaches are implemented in Java• Measures: response time, I/O, update time,
and index size• Size of tree nodes and blocks: 4KB
18SSTD 2011 - Minneapolis, Minnesota, USA
[1] Cong, G., Jensen C. S., Wu, D. “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009.
![Page 19: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/19.jpg)
www.ntnu.no
Datasets
19SSTD 2011 - Minneapolis, Minnesota, USA
DatasetsTotal no. of objects
Avg. no. of unique terms per object
Total no. of terms
Twitter1 1M 11.94 12.5M
Twitter2 2M 12.00 25M
Twitter3 3M 12.26 38.6M
Twitter4 4M 12.27 51.6M
Data1 0.1M 131.70 32.6M
Wikipedia 0.4M 163.65 169.4M
Flickr 1.4M 14.49 25.4M
OpenStreetMap 3M 8.76 31.5M
![Page 20: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/20.jpg)
www.ntnu.no
Variables studied
• Number of results– 10, 20, 30, 40, 50
• Number of query keywords– 1, 2, 3, 4, and 5
• Query preference rate (α)– 0.1, 0.3, 0.5, 0.7, 0.9
• Scalability (twitter dataset)– 1M, 2M, 3M, 4M
20SSTD 2011 - Minneapolis, Minnesota, USA
![Page 21: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/21.jpg)
www.ntnu.no
Number of results (k)
• The response time of S2I is one order of magnitude better due to less disk accesses– DIR-tree reads several nodes before finding the top-k
due to text dimensionality
21SSTD 2011 - Minneapolis, Minnesota, USA
![Page 22: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/22.jpg)
www.ntnu.no
Number of query keywords
• One order of magnitude better in I/O and response time
22SSTD 2011 - Minneapolis, Minnesota, USA
![Page 23: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/23.jpg)
www.ntnu.no
Insertion time and index size
• S2I does not require updating inverted files (and vectors), and computing document similarity
• S2I requires more space
23SSTD 2011 - Minneapolis, Minnesota, USA
![Page 24: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/24.jpg)
www.ntnu.no
Conclusions
• Top-k spatial keyword queries are intuitive and have several applications
• We propose a new index– Terms with different frequency are stored differently
• We propose algorithms to single- and multiple- keyword queries
• The efficiency of our approach is verified through experiments on synthetic and real datasets
24SSTD 2011 - Minneapolis, Minnesota, USA
![Page 25: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/25.jpg)
www.ntnu.no 25SSTD 2011 - Minneapolis, Minnesota, USA
More information…João B. Rocha-Junior
[email protected]://www.idi.ntnu.no/~joao
Thanks!
![Page 26: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/26.jpg)
www.ntnu.no
Scalability
• S2I improvement over DIR-tree increases with cardinality of the datasets
26SSTD 2011 - Minneapolis, Minnesota, USA
![Page 27: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/27.jpg)
www.ntnu.no
Different datasets
• The advantage of S2I over DIR-tree is higher for datasets with few terms per documents
27SSTD 2011 - Minneapolis, Minnesota, USA
![Page 28: Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011](https://reader035.vdocument.in/reader035/viewer/2022062518/56649d395503460f94a12756/html5/thumbnails/28.jpg)
www.ntnu.no
Terms removal
• Terms with length=1• Terms that have no letter character– ! Character.isLetter(token.charAt(i))
28SSTD 2011 - Minneapolis, Minnesota, USA