echniques for ig patial ata › ~msidd005 › files › 18-socc-poster.pdf · techniques for big...

1
E XPERIMENTAL E VALUATIONOF S KETCHING T ECHNIQUES FOR B IG S PATIAL D ATA A. B. S IDDIQUE ,A HMED E LDAWY [msidd005,eldawy]@ucr.edu. Department of Computer Science and Engineering, University of California, Riverside. M OTIVATION Swift growth of the data 2.5 exabytes of data is produced daily, of which 60 - 80% is geo-referenced. Space telescopes broadcast about 140 GB data weekly. New scalable query processing techniques are need of the hour. Sketching techniques excluding sampling, are not well-studied due to two chal- lenges. Hard to compare their performance. Might require some tweaks to the algo- rithms to work. A comprehensive evaluation to under- stand the trade-offs in the different sketch- ing techniques for big spatial data. O VERVIEW Three-phase sketching-based framework for big data processing. Sketched data Partial Result Spark Cluster Single Machine Local Operations Final Result Big Dataset Spark Cluster Selectivity Estimation Clustering Partitioning ... Data is sketched only once for all future lo- cal operations. To make the sketching methods compara- ble, a parameter B is used. Local operations phase allows to reuse ex- isting algorithm(s) with minimal changes. Optional generalization phase is merely a scan of the whole dataset in parallel. S ELECTIVITY E STIMATION 0 0 2 65 41 46 17 0 11 16 44 192 268 374 130 0 58 46 74 184 287 355 301 49 63 64 51 121 130 65 12 39 Prefix Sum horizontal aggregation vertical ggregation 63 127 178 299 429 494 506 545 58 104 178 362 649 1004 13051354 11 27 71 263 531 905 10351035 0 0 2 67 108 154 171 171 63 127 178 299 429 494 506 545 121 231 356 661 10781498 18111899 132 258 427 924 1609 24032846 2934 132 258 429 991 1717 25573017 3105 Euler Histogram R1: Partial sum of C1 for the top-left cell R2: Parital sum of C2 for the top cell(s) w w' 1 w' 2 r/2 Q' r/2 r/2 r/2 r/2 w w' 2 r/2 Q' r/2 r/2 r/2 w w w' 1 w' 2 r/2 Q' r/2 r/2 r/2 w w R4: Partial sum of C4 for the cell(s) R3: Partial sum of C3 for the left cell(s) r/2 Q' r/2 w' 1 r/2 r/2 w w C LUSTERING Clustering 0 0 2 65 41 46 17 0 11 16 44 192 268 374 130 0 58 46 74 184 287 355 301 49 63 64 51 121 130 65 12 39 137 152 164 237 194 248 300 179 157 140 174 159 115 178 121 49 34 55 49 77 186 K Cluster Centers P ARTITIONING Partitioning 0 0 2 65 41 46 17 0 11 16 44 192 268 374 130 0 58 46 74 184 287 355 301 49 63 64 51 121 130 65 12 39 137 152 164 237 194 248 300 179 157 140 174 159 115 178 121 49 34 55 49 77 186 E XPERIMENTAL E VALUATION Selectivity Estimation Clustering Partitioning R EFERENCES [1] Chasparis, Harry, and Ahmed Eldawy., “Experi- mental evaluation of selectivity estimation on big spatial data” in Proceedings of the Fourth Interna- tional ACM Workshop on Managing and Mining En- riched Geo-Spatial Data, 2017, pp. 8. ACM. [2] Bahmani, Bahman, et al., “Scalable k-means++” in Proceedings of the VLDB Endowment, 2012, pp. 622– 633. [3] Eldawy, Ahmed and Alarabi, Louai and Mokbel, Mohamed F, “Spatial partitioning techniques in SpatialHadoop” in Proceedings of the VLDB Endow- ment, 2015, pp. 1602–1605.

Upload: others

Post on 28-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECHNIQUES FOR IG PATIAL ATA › ~msidd005 › files › 18-SoCC-poster.pdf · TECHNIQUES FOR BIG SPATIAL DATA A. B. SIDDIQUE, AHMED ELDAWY [msidd005,eldawy]@ucr.edu. Department of

EXPERIMENTAL EVALUATION OF SKETCHINGTECHNIQUES FOR BIG SPATIAL DATA

A. B. SIDDIQUE, AHMED ELDAWY[msidd005,eldawy]@ucr.edu.

Department of Computer Science and Engineering, University of California, Riverside.

MOTIVATION• Swift growth of the data

– 2.5 exabytes of data is produced daily, ofwhich 60− 80% is geo-referenced.

– Space telescopes broadcast about 140GB data weekly.

• New scalable query processing techniquesare need of the hour.

• Sketching techniques excluding sampling,are not well-studied due to two chal-lenges.

– Hard to compare their performance.– Might require some tweaks to the algo-

rithms to work.

• A comprehensive evaluation to under-stand the trade-offs in the different sketch-ing techniques for big spatial data.

OVERVIEW• Three-phase sketching-based framework

for big data processing.

Sketched

data

Partial

Result

Spark ClusterSingle Machine

Local Operations

Final

ResultBig Dataset

Spark Cluster

Selectivity

Estimation

Clustering

Partitioning

...

• Data is sketched only once for all future lo-cal operations.

• To make the sketching methods compara-ble, a parameter B is used.

• Local operations phase allows to reuse ex-isting algorithm(s) with minimal changes.

• Optional generalization phase is merely ascan of the whole dataset in parallel.

SELECTIVITY ESTIMATION

0 0 2 65 41 46 17 0

11 16 44 192 268 374 130 0

58 46 74 184 287 355 301 49

63 64 51 121 130 65 12 39

Prefix Sumhorizontal aggregation

vertic

al g

gre

gatio

n

63 127 178 299 429 494 506 545

58 104 178 362 649 1004 13051354

11 27 71 263 531 905 10351035

0 0 2 67 108 154 171 171

63 127 178 299 429 494 506 545

121 231 356 661 10781498 18111899

132 258 427 924 1609 24032846 2934

132 258 429 991 1717 25573017 3105

Euler Histogram

R1: Partial sum of C1

for the top-left cell

R2: Parital sum of C2

for the top cell(s)

w

w'1

w'2

r/2

Q'

r/2

r/2

r/2

r/2

w

w'2

r/2

Q'

r/2

r/2

r/2

w

w

w'1

w'2 r/2Q'

r/2

r/2

r/2

w

w

R4: Partial sum of C4

for the cell(s)

R3: Partial sum of C3

for the left cell(s)

r/2

Q'

r/2

w'1

r/2

r/2

w

w

CLUSTERING

Clustering

0 0 2 65 41 46 17 0

11 16 44 192 268 374 130 0

58 46 74 184 287 355 301 49

63 64 51 121 130 65 12 39

137 152 164 237 194 248 300

179 157 140 174 159 115 178

121 49 34 55 49 77 186

K C

luste

r Cente

rs

PARTITIONING

Partitioning

0 0 2 65 41 46 17 0

11 16 44 192 268 374 130 0

58 46 74 184 287 355 301 49

63 64 51 121 130 65 12 39

137 152 164 237 194 248 300

179 157 140 174 159 115 178

121 49 34 55 49 77 186

EXPERIMENTAL EVALUATION• Selectivity Estimation

• Clustering

• Partitioning

REFERENCES

[1] Chasparis, Harry, and Ahmed Eldawy., “Experi-mental evaluation of selectivity estimation on bigspatial data” in Proceedings of the Fourth Interna-tional ACM Workshop on Managing and Mining En-riched Geo-Spatial Data, 2017, pp. 8. ACM.

[2] Bahmani, Bahman, et al., “Scalable k-means++” inProceedings of the VLDB Endowment, 2012, pp. 622–633.

[3] Eldawy, Ahmed and Alarabi, Louai and Mokbel,Mohamed F, “Spatial partitioning techniques inSpatialHadoop” in Proceedings of the VLDB Endow-ment, 2015, pp. 1602–1605.