role of spatial in benchmarking big datashekhar/talk/2012/why... · 2012-05-11 · 1 role of...

29
1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose, CA) Shashi Shekhar McKnight Distinguished University Professor Department of Computer Science and Engineering University of Minnesota www.cs.umn.edu/~shekhar For more details: 1. S. Shekhar et al., Identifying patterns in spatial information: A survey of Methods, Wiley Interdisciplinary Reviews in Data Mining and Knowledge Discovery, Volume 1, May/June 2011. 2. S. Shekhar et al., Spatial Databases: Accomplishments and Research Needs, IEEE Transactions on Knowledge and Data Eng., 11(1), Jan./Feb. 1999. (Updated version in Wiley Encyclopedia of Computer Science (Ed. Benjamin Wah) , 2009.)

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

1

Role of Spatial in Benchmarking Big Data

2012 NSF Workshop on Big Data Benchmarking (San Jose, CA)

Shashi Shekhar McKnight Distinguished University Professor

Department of Computer Science and Engineering

University of Minnesota

www.cs.umn.edu/~shekhar

For more details:

1. S. Shekhar et al., Identifying patterns in spatial information: A survey of Methods, Wiley Interdisciplinary Reviews in Data Mining and Knowledge Discovery, Volume 1, May/June 2011.

2. S. Shekhar et al., Spatial Databases: Accomplishments and Research Needs, IEEE Transactions on Knowledge and Data Eng., 11(1), Jan./Feb. 1999. (Updated version in Wiley Encyclopedia of Computer Science (Ed. Benjamin Wah) , 2009.)

Page 2: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

2

Emerging SBD: Geo-social Media, Device2Device

Page 3: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

3

Why include Spatial Workload in Big Data Benchmark?

• Spatial Computing is critical to many societal grand challenges

• It is critical in cell-phone era of computing

• Decrease Map-Reduce Bias • High-cost of Reduce step favors non-iterative workload

• MPI, OpenMP provide lightweight synchronization needed for data analytics

• Spatial provide iterative workloads to counter map-reduce bias

• Beyond Pre-Big-Data Computing Assumptions

• Beyond Sorting assumption in Relational DBMS • Numbers, Character-Strings Points, Line-Strings, Polygons, Routes, Graphs

• Equi-Join Spatial-distance Join, Nearest Neighbor

• Beyond I.I.D. assumption in Statistics, Machine Learning, … • Independent Samples Auto-correlation

• Identical Distribution Heterogeneous, Non-stationary

Page 4: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

4

How to include Spatial Workload in Big Data Benchmark?

• Table Schema • Add home-address, work-address, cell-phone columns (for a customer)

• Derive addresses by reverse-geocoding spatial locations

• Generate locations from a mixture of point process for

• Hot-spots (auto-correlation)

• Urban, suburban, rural (geographic heterogeneity)

• Generate trajectories for cell-phones

• Generate routes between home and work using shortest-path algorithms

• Add temporal schedule using routine

• Add more points of interest beyond home and work, e.g. city simulators

• Spatial Queries • Nearest Neighbor queries generated for home, work and points on commute routes

• Shortest paths between points of interest (home, work, …)

• Hotspots

• Trend, Change-points…

• Metrics • Footprint scale: local, regional, country, continent, global

• Mobile device interaction per second

Page 5: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

5

Spatial Databases: Representative Projects

only in old plan

Only in new plan

In both plans

Evacutation Route Planning

Parallelize

Range Queries

Storing graphs in disk blocks Shortest Paths

Page 6: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

6

Spatial Data Mining : Representative Projects

Nest locations Distance to open water

Vegetation durability Water depth

Location prediction: nesting sites Spatial outliers: sensor (#9) on I-35

Co-location Patterns Tele connections

Page 7: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

7

Motivation for Spatial Computing

• Societal: • Google Earth, Google Maps, Navigation, location-based service

• Global Challenges facing humanity – many are geo-spatial!

• Future of Computer Science (CS) is to address societal challenges!

Page 9: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

9

Traditional Spatial Data

• Spatial attribute:

– Neighborhood and extent

– Geo-Reference: longitude, latitude, elevation

• Spatial data genre

– Raster: geo-images e.g., Google Earth

– Vector: point, line, polygons

– Graph, e.g., roadmap: node, edge, path

Raster Data for UMN Campus

Courtesy: UMN

Vector Data for UMN Campus

Courtesy: MapQuest

Graph Data for UMN Campus

Courtesy: Bing

Page 10: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

10

Traditional SBD: Raster

• Example Data Sets:

– Google Earth, Bing, NASA Worldwind

– Satellite Imagery (periodic scan)

– Climate simulation outputs for next century

– Geo-videos from UAVs, security cameras

• Example use case

– Change detection

– Feature extraction

– Urban terrain

– …

Raster Data for UMN Campus

Courtesy: UMN

Visualizing the Urban Terrain

Automated Change

Detection

Automatic Feature

Extraction

Average Monthly Temperature

(Courtsey: NASA, Prof. V. Kumar)

Page 11: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

11

Traditional SBD: Vector

• Vector data sub-genre

– Point, e.g., street addresses, …

– Line-strings, e.g., road center line

– Polygons, e.g., zipcode boundaries, …

– Collections of above types

• Common use cases

– Distance from a point, line or polygon

• Geo-Buffers around geo-features

• Nearest gas-station, store, hospital, …

– Topological queries,

• Overlapping whose jurisdiction?

• Range query – subset inside a polygon

– Aggregation:

• Hot-spots, emerging hot-spots of crime, disease, …

• Spatial auto-correlation measures

• Spatial auto-regression

Vector Data for UMN Campus

Courtesy: MapQuest

Page 12: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

12

Traditional SBD: Spatial Graphs

• Spatial Graph Examples

– Roadmaps, rail-road networks, air-routes

– Electric grid, Gas pipelines, supply chains, …

• Graph data sub-genre

– Nodes, Edges, Routes, …

– Flow networks with capacity constraints

• Use cases:

– Geo-code, Map-matching, …

– Connectivity, Shortest paths, …

– Travel-time based nearest store, hospital, …

– Logistics, supply-chain management, …

Graph Data for UMN Campus

Courtesy: Bing

Page 13: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

13

Emerging SBD: Geo-social Media, Device2Device

Page 14: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

14

Emergin Use-Case: Eco-Routing

U.P.S. Embraces High-Tech Delivery Methods (July 12, 2007)

By “The research at U.P.S. is paying off. ……..— saving roughly three million

gallons of fuel in good part by mapping routes that minimize left turns.”

• Minimize fuel consumption and GPG emission

– rather than proxies, e.g. distance, travel-time

– avoid congestion, idling at red-lights, turns and elevation changes, etc.

Page 15: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

15

Emerging SBD: Mobile Device2Device

15

• Mobile Device Examples

– Cell-phones, …

– Check-ins, location API in HTML5, tweeter,

– Vehicles: cars, trucks, airplanes, …

– RFID-tags, bar-codes, GPS-collars, …

• Trajectory & Measurements sub-genre

– Receiver: GPS tracks, …

– System: Cameras, RFID readers, …

• Use cases:

– Tracking, Tracing,

• Improve service, deter theft …

– Geo-fencing, Identify nearby friends

– Eco-routing

Page 16: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

16

Emerging SBD: Geo-Sensor Networks

16

• Geo-Sensor Network Examples

– Urban roads

– Cameras in cities (Millions)

– Electricity distribution grids, …

– Weather sensors networks, …

– Robot with sensors, …

• Sensor Network sub-genre

– Fixed reasonable resource: traffic sensors

– Ad-hoc, resource poor: wireless sensor networks

• Use cases:

– Monitoring

• Anomalies, e.g., accidents, , …

– Real-time event detection

• Congestion, emerging hotspots, …

– Feed-back control

– Predictive, anticipatory planning

Page 17: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

17

SBD Metrics

17

Data Type Representation Operations Potential Metrics

Raster Geo-Matrix Geo-registration, Feature

Extraction, Change Detection,

spatial auto-regression

Raster operations per

second

Vector Points, Lines,

Polygons

Nearest Neighbor, Point Query,

Range Query (e.g., Buffer),

Spatial Join, Hotspot detection,

etc.

Vector operations per

second

Network Graphs (nodes,

edges)

Shortest Path, Map matching,

Geo-coding, Max Flow,

Evacuation, etc.

Shortest-Paths per

second

Mobile

Devices 2

Device

Check-ins,

Trajectories,

Measurements

Check-in, identify close-by

friends, .eco-routes, Track,

trace

Mobile device2device

interactions per second

Page 18: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

18

Relational DBMS to Spatial DBMS

• 1980s: Relational DBMS • Relational Algebra

• Query Processing, e.g. sort-merge equi-join algorithm, …

• B+ Tree index

• Spatial customer (e.g. NASA, USPS) got interested

• But faced challenges

• Semantic Gap • Spatial concepts: distance, direction, overlap, inside, shortest paths, …

• SQL representation was quite verbose

• Relational algebra can not represent Transitive closure

• Performance challenge due to linearity assumption • Is B+ tree appropriate for geographic data?

• Is sorting natural in geographic space?

• New ideas emerged in 1990s • Spatial data types and operations (e.g. OGIS Simple Features)

• R-tree, Spatial-Join-Index, space partitioning, …

Page 19: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

19

Data Mining to Spatial Data Mining

• 1990s: Data Mining • Scale up to traditional models to large relational databases

• Linear regression, Decision Trees, …

• New pattern families

• Association rules

• Which items are bought together? E.g. (Diaper, beer)

• Spatial customers

• Walmart

•Which items are bought just before/after events, e.g. hurricanes?

• Where is (diaper-beer) pattern prevalent?

• Global climate change

• But faced challenges

• Independence Assumption

• Transactions, i.e. disjoint partitioning of data

Page 20: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

20

Spatial Prediction

Nest locations Distance to open water

Vegetation durability Water depth

Page 21: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

21

Mental Model: Spatial Autocorrelation (SA)

• First Law of Geography

– “All things are related, but nearby things are more related than distant things. [Tobler, 1970]”

• Autocorrelation

– Traditional i.i.d. assumption is not valid

– Measures: K-function, Moran’s I, Variogram, …

Pixel property with independent identical

distribution

Vegetation Durability with SA

Page 22: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

22

Ex. 3: Hardest to Parallelize

Name Model

Classical Linear Regression

Spatial Auto-Regression

εxβy

εxβWyy ρ

framework spatialover matrix odneighborho -by- :

parameter n)correlatio-(auto regression-auto spatial the:

nnW

• Maximum Likelihood Estimation

• Need cloud computing to scale up to large spatial dataset.

• However,

• Map reduce is too slow for iterative computations!

• computing determinant of large matrix is an open problem!

SSEnn

L2

)ln(

2

)2ln(ln)ln(

2

WI

Page 23: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

23

Clustering

• Clustering: Find groups of tuples

• Statistical Significance

– Complete spatial randomness, cluster, and de-cluster

Inputs:

Complete Spatial Random (CSR),

Cluster,

Decluster

Classical Clustering

(K-mean)

Spatial Clustering

Page 24: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

24

Spatial Outliers

• Spatial Outliers

– Traffic Data in Twin Cities

– Abnormal Sensor Detections

– Spatial and Temporal Outliers

• Spatial Join Based Tests

Page 25: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

25

Association Patterns

• Association rule e.g. (Diaper in T => Beer in T)

– Support: probability (Diaper and Beer in T) = 2/5

– Confidence: probability (Beer in T | Diaper in T) = 2/2

• Algorithm Apriori [Agarwal, Srikant, VLDB94]

– Support based pruning using monotonicity

• Note: Transaction is a core concept!

Transaction Items Bought

1 {socks, , milk, , beef, egg, …}

2 {pillow, , toothbrush, ice-cream, muffin, …}

3 { , , pacifier, formula, blanket, …}

… …

n {battery, juice, beef, egg, chicken, …}

Page 26: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

26

Pattern Family 4: Co-locations/Co-occurrence

• Given: A collection of

different types of spatial

events

• Find: Co-located

subsets of event types

• Challenge:

No Transactions

• New Approaches

– Spatial Join Based

Page 27: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

27

Parallelizing Spatial Big Data on Cloud Computing

• Parallelizing Spatial Computing

• Case 1: Compute Spatial-Autocorrelation Simpler to Parallelize – Map-reduce is okay

– Should it provide spatial de-clustering services?

– Can query-compiler generate map-reduce parallel code?

• Case 2: Harder : Parallelize Range Query on Polygon Maps – Need dynamic load balancing beyond map-reduce

– But, local processing is cheaper than sending it to another node!

– MPI or OpenMP is better!

• Case 3: Estimate Spatial Auto-Regression Parameters, Routing – Map-reduce is inefficient for iterative computations!

– MPI or OpenMP is essential!

– Golden section search, Determinant of large matrix

– Eco-routing algorithms, Evacuation route planning

Page 28: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

28

Why include Spatial Workload in Big Data Benchmark?

• Spatial Computing is critical to many societal grand challenges

• It is critical in cell-phone era of computing

• Decrease Map-Reduce Bias • High-cost of Reduce step favors non-iterative workload

• MPI, OpenMP provide lightweight synchronization needed for data analytics

• Spatial provide iterative workloads to counter map-reduce bias

• Beyond Pre-Big-Data Computing Assumptions

• Beyond Sorting assumption in Relational DBMS • Numbers, Character-Strings Points, Line-Strings, Polygons, Routes, Graphs

• Equi-Join Spatial-distance Join, Nearest Neighbor

• Beyond I.I.D. assumption in Statistics, Machine Learning, … • Independent Samples Auto-correlation

• Identical Distribution Heterogeneous, Non-stationary

Page 29: Role of Spatial in Benchmarking Big Datashekhar/talk/2012/why... · 2012-05-11 · 1 Role of Spatial in Benchmarking Big Data 2012 NSF Workshop on Big Data Benchmarking (San Jose,

29

How to include Spatial Workload in Big Data Benchmark?

• Table Schema • Add home-address, work-address, cell-phone columns (for a customer)

• Derive addresses by reverse-geocoding spatial locations

• Generate locations from a mixture of point process for

• Hot-spots (auto-correlation)

• Urban, suburban, rural (geographic heterogeneity)

• Generate trajectories for cell-phones

• Generate routes between home and work using shortest-path algorithms

• Add temporal schedule using routine

• Add more points of interest beyond home and work, e.g. city simulators

• Spatial Queries • Nearest Neighbor queries generated for home, work and points on commute routes

• Shortest paths between points of interest (home, work, …)

• Hotspots

• Trend, Change-points…

• Metrics • Footprint scale: local, regional, country, continent, global

• Mobile device interaction per second