sensor data management: challenges and (some) solutions amol deshpande, university of maryland
TRANSCRIPT
Motivation
Unprecedented, and rapidly increasing, instrumentation of our
every-day world
Wireless sensor networks
RFID
Distributed measurementnetworks (e.g. GPS)
Industrial Monitoring
Sensor Data Processing: Now
Database
time id temp
10am 1 20
10am 2 21
.. .. …
10am 7 29
Table raw-data
SensorNetwork
1. Extract all readings into a file
2. Run MATLAB/R/other data
processing tools
3. Write output to a file/back to
the database
4. Write data processing tools to
process/aggregate the output
(maybe using DB)
5. Decide new data to acquire
User
Repeat
Sensor Data Processing: What we want
Database
time id temp
10am 1 20
10am 2 21
.. .. …
10am 7 29
Table raw-data
SensorNetwork
Models to be applied to data in
real-time (at least simple ones)
User
time id temp
10am 1 20
10am 2 21
.. .. …
10am 7 29
Table processed-data
Tasks
DataContinuous (standing) queries
e.g. alert monitoring
Results to continuous queries
Ad hoc queries (possibly against
processed, modeled data)
Data Management Challenges
Very, very large scale Spatio-temporal querying essential
Need new indexing techniques, data description formats,
techniques for “data ingest” (cleaning the data etc)
Much work in scientific data management E.g. SkyServer
Data is typically imprecise, unreliable, or incomplete
(data quality) Measurement noise, failures in sensor/GPS data
High message loss rate in wireless/RFID
Balazinska et al; Data Management in the Worldwide Sensor Web; IEEE Pervasive, 2007.
Data Management Challenges
Data is generated continuously and must be processed
in real-time (distributed data streams) Need different query processing paradigms
Typically very high data rates
Must be able to handle a large number of continuous queries
efficiently
Much recent work on “Data Streams” Research systems: TelegraphCQ [Berkeley], STREAM [Stanford],
Aurora [Brown/MIT/Brandeis] etc…
Commercial systems: Streambase, TruViso, …
Balazinska et al; Data Management in the Worldwide Sensor Web; IEEE Pervasive, 2007.
Data Management Challenges
Need for real-time statistical modeling of data Eliminate spatial/temporal biases, handle missing data through
extrapolation (e.g. regression, interpolation models) Filter measurement noise (e.g. Kalman Filters) Infer hidden variables, pattern recognition (e.g. HMMs) Fault or anomaly detection Forecasting/prediction (e.g. ARIMA)
Regression/interpolation models
Temperature monitoring
Kalman Filters …
GPS Data
Data Management Challenges
The applications have strong acquisitional aspects Data has to be actively acquired as needed
Typically high data acquisition costs(e.g. energy consumption in battery-
powered devices)
Data provenance
Being able to trace something back to its origins
Data exploration and visualization
Data interoperability
Data security and privacy
…
Balazinska et al; Data Management in the Worldwide Sensor Web; IEEE Pervasive, 2007.
My Research Interests
Managing imprecise and incomplete data Support statistical modeling and querying of sensor data in
relational databases Clean, declarative abstractions Real-time processing of streaming data
Probabilistic databases Store and query data annotated with probabilities
Energy-efficient algorithms for wireless sensornets Data acquisition, target monitoring, data compression .. In-network query processing
MauveDB
Written using Apache Derby Java open source DBMS
Supports an abstraction called model-based views Declarative specification of models to be applied
Can query the output of the models using SQL
Models kept updated as new data/measurements arrive
A. Deshpande, S. Madden; MauveDB: Supporting Model-based User Views in Database Systems; SIGMOD 2006
B. Kanagal, A. Deshpande; Online Filtering, Smoothing and Probabilistic Modeling of Streaming data; ICDE 2008
MauveDB
A. Deshpande, S. Madden; MauveDB: Supporting Model-based User Views in Database Systems; SIGMOD 2006
B. Kanagal, A. Deshpande; Online Filtering, Smoothing and Probabilistic Modeling of Streaming data; ICDE 2008
MauveDB
Written using Apache Derby Java open source DBMS
Supports an abstraction called model-based views Declarative specification of models to be applied
Can query the output of the models using SQL
Models kept updated as new data/measurements arrive
Status: Support for Regression- and Interpolation-based views
Currently building support for views based on Dynamic Bayesian
networks (Kalman Filters, HMMs etc)
Ongoing work: Query processing and optimization, continuous queries
APIs for arbitrary models …
A. Deshpande, S. Madden; MauveDB: Supporting Model-based User Views in Database Systems; SIGMOD 2006
B. Kanagal, A. Deshpande; Online Filtering, Smoothing and Probabilistic Modeling of Streaming data; ICDE 2008
Probabilistic Databases
Motivation: Increasing amounts of uncertain data From sensor networks
Imprecise data, data with confidence/accuracy bounds
Human-observed data
Statistical modeling/machine learning Many models provide a distribution over a set of labels (e.g. HMMs)
Information extraction from text
Social networks
How to manage and query such data in relational databases ? Different types of uncertainties
Complex correlation patterns
Much work in database community over last few years
P. Sen, A. Deshpande; Representing and Querying Correlated Tuples in Probabilistic Databases; ICDE 2007