yanlei diao, university of massachusetts amherst capturing data uncertainty in high- volume stream...
DESCRIPTION
Yanlei Diao, University of Massachusetts Amherst Scope of Our Problem Data modeled as continuous random variables Many types of sensor data. More examples later… High-volume data streams In contrast to probabilistic databases An end-to-end solution Uncertainty of raw data Uncertainty of query processing resultsTRANSCRIPT
Yanlei Diao, University of Massachusetts Amherst
Capturing Data Uncertainty in High-Volume Stream Processing
Yanlei Diao, Boduo Li, Anna Liu, Liping Peng, Charles Sutton†, Thanh Tran, Michael Zink
University of Massachusetts, Amherst† University of California, Berkeley
Yanlei Diao, University of Massachusetts Amherst
Uncertain Data Streams Uncertain data streams
• Environmental monitoring sensor networks• Radio Frequency Identification (RFID) networks
• GPS systems • Radar sensor networks
Data: incomplete, imprecise, misleading
Results: unknown quality
Yanlei Diao, University of Massachusetts Amherst
Scope of Our Problem Data modeled as continuous random variables • Many types of sensor data. More examples later…
High-volume data streams• In contrast to probabilistic databases
An end-to-end solution• Uncertainty of raw data• Uncertainty of query processing results
Yanlei Diao, University of Massachusetts Amherst
Object Tracking and Monitoring Mobile RFID readers
• Handheld, robot-mounted Incomplete, noisy data
• Environmental factors• Orientation of reading
Not directly queriable • Raw data: <tag id, reader id, ts>
• Data needed for querying: e.g., precise object locations
+
Yanlei Diao, University of Massachusetts Amherst
Fire Monitoring Application
Display of solid merchandise shall not exceed 200 pounds per square foot of shelf area.
SELECT RSTREAM(area(R.(x,y,z)p), sum(R.weight))FROM R [PARTITION BY R.tag_id ROW 1]GROUP BY area(R.(x,y,z)p)HAVING sum(R.weight) > 200
(time, tag_id, (x,y,z)p)
What is the quality of the alert returned by this query?
Yanlei Diao, University of Massachusetts Amherst
Fire Monitoring Application
Alert when a flammable object is exposed to a high temperature.
SELECT RSTREAM(R.tag_id, R.(x,y,z)p, T.tempp)FROM RFIDStream [RANGE 3 seconds] as R,
TempStream [RANGE 3 seconds] as T WHERE object_type(R.tag_id) = ‘flammable’ and
T.tempp > 60°C and location_equals(R.(x,y,z)p, T.(x,y,z))
What is the quality of the alert returned by this query?
(time, (x,y,z), tempp)
(time, tag_id, (x,y,z)p)
Yanlei Diao, University of Massachusetts Amherst
Severe Weather Monitoring
Sensing
Merging
Detection/Predication
wireless transmission
Task Generation
Sensing
Transformation& Averaging
Transformation& Averaging
Yanlei Diao, University of Massachusetts Amherst
High-Volume, Uncertain Raw Data
High-Volume: 1.66 million data items,205Mb / sec per radar
Uncertainty:• Environmental noise• Device noise• Transmit frequency• System clock• Positioner• Antenna
Pulses1 2 3 4 5 6 7
Gates (distance)
(time)Raw Pulse
data
SensingSensing
Yanlei Diao, University of Massachusetts Amherst
Averaged Moment Data
SensingSensing
Transformation& Averaging
Transformation& Averaging
1 2 3 4 5 6 7 Pulses
Gates (distance)
(time)Moment data
velocity,reflectivity,
…
Yanlei Diao, University of Massachusetts Amherst
Averaged Moment Data
SensingSensing
Transformation& Averaging
Transformation& Averaging
1 2 3 4 5 6 7 Pulses
Gates (distance)
(time)
Uncertainty: what is the effect of averaging over uncertain data?
Moment datavelocity,
reflectivity,…
Yanlei Diao, University of Massachusetts Amherst
Merged Data
Sensing
Merging
Detection/Predication
wireless transmission
Sensing
Transformation& Averaging
Transformation& Averaging
What is the quality of the detection result?
Uncertainty: Uneven distribution of data density
Yanlei Diao, University of Massachusetts Amherst © KSWO TV
© Patrick Marsh May 8, 2007
Series of low-levelcirculations.
NWS TornadoWarnings: 7:16pm,7:39pm, 8:29pm
7:21pm
8:15 pm
9:54pm
11:00pm
Yanlei Diao, University of Massachusetts Amherst
Effect of Averaging of Uncertain DataAveraging size
Moment data size
(MB)
Detection running time
(sec)
Reported tornados
False negative
s40 41.49 27 3.75 0
60 27.68 23 1.5 2.25
80 20.79 21 0.5 3.25
100 16.65 21 0.25 3.75
500 3.42 20 0 3.75
1000 1.76 20 0 3.75
Results of 38 second trace at 8:10 pm on May 8, 2007.The averaging size 40 used to represent detection results using fine-grained data.
Yanlei Diao, University of Massachusetts Amherst
Challenges Raw data is inherently incomplete and noisy Raw data is not directly queriable
• RFID: <ts, tag_id, reader_id>; <ts, tag_id, (x,y,z)>
• Radar: <ts, gate, (I,Q)h|v>; <ts, gate, (reflectivity, velocity, …)>
High volume raw data streams• RFID: hundreds of readings per second per reader• Radar: 1.66 million data items per second per radar
Sophisticated query processing
Yanlei Diao, University of Massachusetts Amherst
System Overview
T1
T2
T3
A1 A2
A3
A4
J1
tuples w. lineage
Archived tuples
Confidence region
Mean,Variance,Bounds
Yanlei Diao, University of Massachusetts Amherst
Data Capture and Transformation Transform raw streams into tuple streams with quantified uncertainty -- compute p(X|O):• Output: continuous random variables X, hidden
• Input: random variables O, observed Existing work
• Statistical machine learning• Sensor stream cleaning and processing
Our goal: choose appropriate statistical models, optimize for high-volume streams
Yanlei Diao, University of Massachusetts Amherst
RFID Streams: Modeling A generative model characterizes how data is generated -- p(X,O)• X: true object location (x,y,z)
• O: boolean for RFID readings
• How state of the world changes• Object movement, reader motion
• How sensing generates data from the state of the world
Probabilistic inference over RFID streams in mobile Environments. T. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. Shenoy. ICDE 2009.
Yanlei Diao, University of Massachusetts Amherst
RFID Streams: Inference Probabilistic inference over streams -- p(X|O)
• Sampling-based inference• Key to performance: using a small number of samples
Standard sampling- based inference
Our optimizations
Accuracy 0.6 - 0.8 foot 0.1 - 0.5 footPerformance 0.1 reading/sec
for 20 objects> 1000 readings/sec for 20,000 objects
7 orders of magnitude improvement!
Yanlei Diao, University of Massachusetts Amherst
Radar Streams: Modeling Again, a generative model p(X,O)?
• O: raw pulse data• X: velocity, reflectivity, …• Highly complex sensing process• Extremely high volume, 1.66 million data items/sec
Pulses1 2 3 4 5 6 7
Gates (distance)
(time)• Environmental noise• Device noise• Transmit frequency• System clock• Positioner• Antenna…
Yanlei Diao, University of Massachusetts Amherst
Radar Streams: Model Fitting Make output data X observable -- p(X)
• Deterministic heuristic algorithm for O-X transformation
Fit a known model directly• Moving Average (MA) model for p(X1, …, Xn)
Key to performance: model fitting at stream speed• Identify sequences obeying MA at 1.66 million items/sec
X1 X2 X3 X4 X5 X6 X7
E1 E2 E3 E4 E5 E6 E7
Yanlei Diao, University of Massachusetts Amherst
Distance from radar
MA seq. length
MA(5)
Dynamically decide MA sequences for averaging
Initial Result of MA Fitting
Efficiently compute distribution of averaging over MA sequences
Yanlei Diao, University of Massachusetts Amherst
Relational Processing under Uncertainty A relational paradigm for data processing after initial data capture and transformation • Support , , Aggregation • Compute a distribution for each result, modeled as a continuous random variable
Integral-based approach [Cheng et al., SIGMOD 2003] • Exact, but too slow for stream processing
Sampling-based approach [Ge & Zdonik, ICDE 2008]• Speed-accuracy tradeoff?
Yanlei Diao, University of Massachusetts Amherst
Research Issues Techniques for exact derivation that are natural for continuous random variables
Approximation• Achieving speed-accuracy tradeoff more effectively
Correlated intermediate results• When do they occur with , , Aggregation?• Optimizations: avoid intermediate pdfs
• Complex function• Lineage …
Yanlei Diao, University of Massachusetts Amherst
© KSWO TV
Much Work Lies Ahead…
Your comments are welcome.
Yanlei Diao, University of Massachusetts Amherst
RFID Streams: Speed vs. Accuracy
Yanlei Diao, University of Massachusetts Amherst
Distance from radar
MA seq. length
MA(5)
MA(20)
Dynamically decide MA sequences for averaging
Performance tradeoff
Yanlei Diao, University of Massachusetts Amherst
Aggregation: Speed vs. Accuracy
Algorithm Throughput Variance Distance [0,1]
Histogram 3382 0.083
CF (exact)
466 0
CF (approx)
10593 0.012