role of spatial in benchmarking big datashekhar/talk/2012/why... · 2012-05-11 · 1 role of...
TRANSCRIPT
1
Role of Spatial in Benchmarking Big Data
2012 NSF Workshop on Big Data Benchmarking (San Jose, CA)
Shashi Shekhar McKnight Distinguished University Professor
Department of Computer Science and Engineering
University of Minnesota
www.cs.umn.edu/~shekhar
For more details:
1. S. Shekhar et al., Identifying patterns in spatial information: A survey of Methods, Wiley Interdisciplinary Reviews in Data Mining and Knowledge Discovery, Volume 1, May/June 2011.
2. S. Shekhar et al., Spatial Databases: Accomplishments and Research Needs, IEEE Transactions on Knowledge and Data Eng., 11(1), Jan./Feb. 1999. (Updated version in Wiley Encyclopedia of Computer Science (Ed. Benjamin Wah) , 2009.)
2
Emerging SBD: Geo-social Media, Device2Device
3
Why include Spatial Workload in Big Data Benchmark?
• Spatial Computing is critical to many societal grand challenges
• It is critical in cell-phone era of computing
• Decrease Map-Reduce Bias • High-cost of Reduce step favors non-iterative workload
• MPI, OpenMP provide lightweight synchronization needed for data analytics
• Spatial provide iterative workloads to counter map-reduce bias
• Beyond Pre-Big-Data Computing Assumptions
• Beyond Sorting assumption in Relational DBMS • Numbers, Character-Strings Points, Line-Strings, Polygons, Routes, Graphs
• Equi-Join Spatial-distance Join, Nearest Neighbor
• Beyond I.I.D. assumption in Statistics, Machine Learning, … • Independent Samples Auto-correlation
• Identical Distribution Heterogeneous, Non-stationary
4
How to include Spatial Workload in Big Data Benchmark?
• Table Schema • Add home-address, work-address, cell-phone columns (for a customer)
• Derive addresses by reverse-geocoding spatial locations
• Generate locations from a mixture of point process for
• Hot-spots (auto-correlation)
• Urban, suburban, rural (geographic heterogeneity)
• Generate trajectories for cell-phones
• Generate routes between home and work using shortest-path algorithms
• Add temporal schedule using routine
• Add more points of interest beyond home and work, e.g. city simulators
• Spatial Queries • Nearest Neighbor queries generated for home, work and points on commute routes
• Shortest paths between points of interest (home, work, …)
• Hotspots
• Trend, Change-points…
• Metrics • Footprint scale: local, regional, country, continent, global
• Mobile device interaction per second
5
Spatial Databases: Representative Projects
only in old plan
Only in new plan
In both plans
Evacutation Route Planning
Parallelize
Range Queries
Storing graphs in disk blocks Shortest Paths
6
Spatial Data Mining : Representative Projects
Nest locations Distance to open water
Vegetation durability Water depth
Location prediction: nesting sites Spatial outliers: sensor (#9) on I-35
Co-location Patterns Tele connections
7
Motivation for Spatial Computing
• Societal: • Google Earth, Google Maps, Navigation, location-based service
• Global Challenges facing humanity – many are geo-spatial!
• Future of Computer Science (CS) is to address societal challenges!
8
8
Smarter
Planet
SIG
SPATIAL
Spatial Computing: Reccent Trends
9
Traditional Spatial Data
• Spatial attribute:
– Neighborhood and extent
– Geo-Reference: longitude, latitude, elevation
• Spatial data genre
– Raster: geo-images e.g., Google Earth
– Vector: point, line, polygons
– Graph, e.g., roadmap: node, edge, path
Raster Data for UMN Campus
Courtesy: UMN
Vector Data for UMN Campus
Courtesy: MapQuest
Graph Data for UMN Campus
Courtesy: Bing
10
Traditional SBD: Raster
• Example Data Sets:
– Google Earth, Bing, NASA Worldwind
– Satellite Imagery (periodic scan)
– Climate simulation outputs for next century
– Geo-videos from UAVs, security cameras
• Example use case
– Change detection
– Feature extraction
– Urban terrain
– …
Raster Data for UMN Campus
Courtesy: UMN
Visualizing the Urban Terrain
Automated Change
Detection
Automatic Feature
Extraction
Average Monthly Temperature
(Courtsey: NASA, Prof. V. Kumar)
11
Traditional SBD: Vector
• Vector data sub-genre
– Point, e.g., street addresses, …
– Line-strings, e.g., road center line
– Polygons, e.g., zipcode boundaries, …
– Collections of above types
• Common use cases
– Distance from a point, line or polygon
• Geo-Buffers around geo-features
• Nearest gas-station, store, hospital, …
– Topological queries,
• Overlapping whose jurisdiction?
• Range query – subset inside a polygon
– Aggregation:
• Hot-spots, emerging hot-spots of crime, disease, …
• Spatial auto-correlation measures
• Spatial auto-regression
Vector Data for UMN Campus
Courtesy: MapQuest
12
Traditional SBD: Spatial Graphs
• Spatial Graph Examples
– Roadmaps, rail-road networks, air-routes
– Electric grid, Gas pipelines, supply chains, …
• Graph data sub-genre
– Nodes, Edges, Routes, …
– Flow networks with capacity constraints
• Use cases:
– Geo-code, Map-matching, …
– Connectivity, Shortest paths, …
– Travel-time based nearest store, hospital, …
– Logistics, supply-chain management, …
Graph Data for UMN Campus
Courtesy: Bing
13
Emerging SBD: Geo-social Media, Device2Device
14
Emergin Use-Case: Eco-Routing
U.P.S. Embraces High-Tech Delivery Methods (July 12, 2007)
By “The research at U.P.S. is paying off. ……..— saving roughly three million
gallons of fuel in good part by mapping routes that minimize left turns.”
• Minimize fuel consumption and GPG emission
– rather than proxies, e.g. distance, travel-time
– avoid congestion, idling at red-lights, turns and elevation changes, etc.
15
Emerging SBD: Mobile Device2Device
15
• Mobile Device Examples
– Cell-phones, …
– Check-ins, location API in HTML5, tweeter,
– Vehicles: cars, trucks, airplanes, …
– RFID-tags, bar-codes, GPS-collars, …
• Trajectory & Measurements sub-genre
– Receiver: GPS tracks, …
– System: Cameras, RFID readers, …
• Use cases:
– Tracking, Tracing,
• Improve service, deter theft …
– Geo-fencing, Identify nearby friends
– Eco-routing
16
Emerging SBD: Geo-Sensor Networks
16
• Geo-Sensor Network Examples
– Urban roads
– Cameras in cities (Millions)
– Electricity distribution grids, …
– Weather sensors networks, …
– Robot with sensors, …
• Sensor Network sub-genre
– Fixed reasonable resource: traffic sensors
– Ad-hoc, resource poor: wireless sensor networks
• Use cases:
– Monitoring
• Anomalies, e.g., accidents, , …
– Real-time event detection
• Congestion, emerging hotspots, …
– Feed-back control
– Predictive, anticipatory planning
17
SBD Metrics
17
Data Type Representation Operations Potential Metrics
Raster Geo-Matrix Geo-registration, Feature
Extraction, Change Detection,
spatial auto-regression
Raster operations per
second
Vector Points, Lines,
Polygons
Nearest Neighbor, Point Query,
Range Query (e.g., Buffer),
Spatial Join, Hotspot detection,
etc.
Vector operations per
second
Network Graphs (nodes,
edges)
Shortest Path, Map matching,
Geo-coding, Max Flow,
Evacuation, etc.
Shortest-Paths per
second
Mobile
Devices 2
Device
Check-ins,
Trajectories,
Measurements
Check-in, identify close-by
friends, .eco-routes, Track,
trace
Mobile device2device
interactions per second
18
Relational DBMS to Spatial DBMS
• 1980s: Relational DBMS • Relational Algebra
• Query Processing, e.g. sort-merge equi-join algorithm, …
• B+ Tree index
• Spatial customer (e.g. NASA, USPS) got interested
• But faced challenges
• Semantic Gap • Spatial concepts: distance, direction, overlap, inside, shortest paths, …
• SQL representation was quite verbose
• Relational algebra can not represent Transitive closure
• Performance challenge due to linearity assumption • Is B+ tree appropriate for geographic data?
• Is sorting natural in geographic space?
• New ideas emerged in 1990s • Spatial data types and operations (e.g. OGIS Simple Features)
• R-tree, Spatial-Join-Index, space partitioning, …
19
Data Mining to Spatial Data Mining
• 1990s: Data Mining • Scale up to traditional models to large relational databases
• Linear regression, Decision Trees, …
• New pattern families
• Association rules
• Which items are bought together? E.g. (Diaper, beer)
• Spatial customers
• Walmart
•Which items are bought just before/after events, e.g. hurricanes?
• Where is (diaper-beer) pattern prevalent?
• Global climate change
• But faced challenges
• Independence Assumption
• Transactions, i.e. disjoint partitioning of data
20
Spatial Prediction
Nest locations Distance to open water
Vegetation durability Water depth
21
Mental Model: Spatial Autocorrelation (SA)
• First Law of Geography
– “All things are related, but nearby things are more related than distant things. [Tobler, 1970]”
• Autocorrelation
– Traditional i.i.d. assumption is not valid
– Measures: K-function, Moran’s I, Variogram, …
Pixel property with independent identical
distribution
Vegetation Durability with SA
22
Ex. 3: Hardest to Parallelize
Name Model
Classical Linear Regression
Spatial Auto-Regression
εxβy
εxβWyy ρ
framework spatialover matrix odneighborho -by- :
parameter n)correlatio-(auto regression-auto spatial the:
nnW
• Maximum Likelihood Estimation
• Need cloud computing to scale up to large spatial dataset.
• However,
• Map reduce is too slow for iterative computations!
• computing determinant of large matrix is an open problem!
SSEnn
L2
)ln(
2
)2ln(ln)ln(
2
WI
23
Clustering
• Clustering: Find groups of tuples
• Statistical Significance
– Complete spatial randomness, cluster, and de-cluster
Inputs:
Complete Spatial Random (CSR),
Cluster,
Decluster
Classical Clustering
(K-mean)
Spatial Clustering
24
Spatial Outliers
• Spatial Outliers
– Traffic Data in Twin Cities
– Abnormal Sensor Detections
– Spatial and Temporal Outliers
• Spatial Join Based Tests
25
Association Patterns
• Association rule e.g. (Diaper in T => Beer in T)
– Support: probability (Diaper and Beer in T) = 2/5
– Confidence: probability (Beer in T | Diaper in T) = 2/2
• Algorithm Apriori [Agarwal, Srikant, VLDB94]
– Support based pruning using monotonicity
• Note: Transaction is a core concept!
Transaction Items Bought
1 {socks, , milk, , beef, egg, …}
2 {pillow, , toothbrush, ice-cream, muffin, …}
3 { , , pacifier, formula, blanket, …}
… …
n {battery, juice, beef, egg, chicken, …}
26
Pattern Family 4: Co-locations/Co-occurrence
• Given: A collection of
different types of spatial
events
• Find: Co-located
subsets of event types
• Challenge:
No Transactions
• New Approaches
– Spatial Join Based
27
Parallelizing Spatial Big Data on Cloud Computing
• Parallelizing Spatial Computing
• Case 1: Compute Spatial-Autocorrelation Simpler to Parallelize – Map-reduce is okay
– Should it provide spatial de-clustering services?
– Can query-compiler generate map-reduce parallel code?
• Case 2: Harder : Parallelize Range Query on Polygon Maps – Need dynamic load balancing beyond map-reduce
– But, local processing is cheaper than sending it to another node!
– MPI or OpenMP is better!
• Case 3: Estimate Spatial Auto-Regression Parameters, Routing – Map-reduce is inefficient for iterative computations!
– MPI or OpenMP is essential!
– Golden section search, Determinant of large matrix
– Eco-routing algorithms, Evacuation route planning
28
Why include Spatial Workload in Big Data Benchmark?
• Spatial Computing is critical to many societal grand challenges
• It is critical in cell-phone era of computing
• Decrease Map-Reduce Bias • High-cost of Reduce step favors non-iterative workload
• MPI, OpenMP provide lightweight synchronization needed for data analytics
• Spatial provide iterative workloads to counter map-reduce bias
• Beyond Pre-Big-Data Computing Assumptions
• Beyond Sorting assumption in Relational DBMS • Numbers, Character-Strings Points, Line-Strings, Polygons, Routes, Graphs
• Equi-Join Spatial-distance Join, Nearest Neighbor
• Beyond I.I.D. assumption in Statistics, Machine Learning, … • Independent Samples Auto-correlation
• Identical Distribution Heterogeneous, Non-stationary
29
How to include Spatial Workload in Big Data Benchmark?
• Table Schema • Add home-address, work-address, cell-phone columns (for a customer)
• Derive addresses by reverse-geocoding spatial locations
• Generate locations from a mixture of point process for
• Hot-spots (auto-correlation)
• Urban, suburban, rural (geographic heterogeneity)
• Generate trajectories for cell-phones
• Generate routes between home and work using shortest-path algorithms
• Add temporal schedule using routine
• Add more points of interest beyond home and work, e.g. city simulators
• Spatial Queries • Nearest Neighbor queries generated for home, work and points on commute routes
• Shortest paths between points of interest (home, work, …)
• Hotspots
• Trend, Change-points…
• Metrics • Footprint scale: local, regional, country, continent, global
• Mobile device interaction per second