sebd tutorial, june 2006 1 monitoring distributed streams joint works with tsachi scharfman, daniel...
Post on 20-Dec-2015
216 views
TRANSCRIPT
SEBD Tutorial, June 2006 1
Monitoring Monitoring Distributed Distributed
StreamsStreams
Joint works withJoint works withTsachi Scharfman, Daniel KerenTsachi Scharfman, Daniel Keren
SEBD Tutorial, June 2006 2
SourcesSources
A Geometric Approach to Monitoring A Geometric Approach to Monitoring Distributed Data StreamsDistributed Data Streams, SIGMOD , SIGMOD 06 (Honorable Mention)06 (Honorable Mention)
Aggregate Threshold Queries in Aggregate Threshold Queries in Sensor NetworksSensor Networks, Submitted to , Submitted to SENSYS 06SENSYS 06
Monitoring Many Features in Monitoring Many Features in Distributed Data Streams. Distributed Data Streams. In In preparation for ICDM 06.preparation for ICDM 06.
SEBD Tutorial, June 2006 3
Problem DefinitionProblem Definition A set of distributed data streamsA set of distributed data streams
Mirrored web siteMirrored web site Distributed spam filtering systemDistributed spam filtering system A sensor networkA sensor network
A data vector is collected from each streamA data vector is collected from each stream Stream is infiniteStream is infinite Sliding/jumping windowsSliding/jumping windows
Given: A function over the Given: A function over the averageaverage of the data of the data vectorsvectors
Given: A predetermined thresholdGiven: A predetermined threshold Question: did the function value cross the Question: did the function value cross the
threshold?threshold?
SEBD Tutorial, June 2006 4
Example 1: Web Page Example 1: Web Page Frequency CountsFrequency Counts
Mirrored web siteMirrored web site Each mirror maintains the frequency Each mirror maintains the frequency
each page was accessed in last 5 each page was accessed in last 5 min.min.
We would like to constantly maintain We would like to constantly maintain a list of the most frequently a list of the most frequently accessed web pages (as defined by a accessed web pages (as defined by a threshold)threshold)
SEBD Tutorial, June 2006 5
Example 2:Example 2: Air Quality Air Quality MonitoringMonitoring
Sensors monitoring Sensors monitoring the concentration of the concentration of air pollutants.air pollutants.
Each sensor holds a data vector comprising of Each sensor holds a data vector comprising of the measured concentration of various the measured concentration of various pollutants (COpollutants (CO22, SO, SO22, O, O33, etc.)., etc.).
A function on the A function on the average average data vector data vector determines the Air Quality Index (AQI)determines the Air Quality Index (AQI)
Alert in case the AQI exceeds a given threshold.Alert in case the AQI exceeds a given threshold.
SEBD Tutorial, June 2006 6
Example 3:Example 3: Variance Alert Variance Alert Sensors monitoring the temperature in a server Sensors monitoring the temperature in a server
room (machine room, conference room, etc.)room (machine room, conference room, etc.) Ensure uniform temp.: monitor variance of readingsEnsure uniform temp.: monitor variance of readings Alert in case variance exceeds a thresholdAlert in case variance exceeds a threshold
Temperature readings by Temperature readings by nn sensors sensors xx11, …, x, …, xnn
Each sensor holds a data vector Each sensor holds a data vector v vii = ( = (xxii22, , xxii ))TT
The The averageaverage data vector is data vector is v v ==
VarVar(all sensors) = (all sensors) =
2
1 1
1 1Tn n
i i
i i
x xn n
2
2
1 1
1 1n n
i i
i i
x xn n
SEBD Tutorial, June 2006 7
Example 4 (running example):Example 4 (running example): Distributed Feature Distributed Feature
SelectionSelection A distributed spam A distributed spam
mail filtering system.mail filtering system. A mail server receives A mail server receives
a stream of positive a stream of positive and negative and negative examples.examples.
Select a set of features (words) to be used in Select a set of features (words) to be used in order to build a spam classifier.order to build a spam classifier.
A feature is good if its information gain is A feature is good if its information gain is above a threshold.above a threshold.
Information Gain vs. Document Index
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0 200000 400000 600000 800000Document Index
Info
rma
tio
n G
ain
bosnia
ipo
febru
SEBD Tutorial, June 2006 8
Distributed Calculation of Distributed Calculation of Information GainInformation Gain
Each server maintains a contingency table Each server maintains a contingency table for each feature.for each feature.
We would like to determine, for each We would like to determine, for each feature, whether the information gain on the feature, whether the information gain on the averageaverage contingency table is above the contingency table is above the threshold.threshold.
SpaSpamm
^Spa^Spamm
CCi,j i,j
==
ff 0.10.1 0.20.2
^̂ff 0.20.2 0.50.5
,, ,
,1 ,2 1, 2,{1,2} {1,2}
( ) log( )( )
i ji j i j
i i j ji j
cIG C c
c c c c
SEBD Tutorial, June 2006 9
Distributed Calculation of Distributed Calculation of Information Gain – Information Gain –
continuedcontinued Note that the information gain on the Note that the information gain on the
average contingency table can not be average contingency table can not be derived from the information gain on each derived from the information gain on each individual contingency table!individual contingency table!
CC11
==
0.50.5 00
00 0.50.5
CC22
==
00 0.50.5
0.50.5 00
IGIG((CC11)=1)=1
IGIG((CC22)=1)=1
1 2 02
C CIG
SEBD Tutorial, June 2006 10
Pervious WorkPervious Work Focused on linear functions (e.g., sum, Focused on linear functions (e.g., sum,
average):average): M. Dilman and D. Raz. Efficient reactive
monitoring. In INFOCOM, pages 1012–1019, 2001.
Pervious solutions for arbitrary Functions Pervious solutions for arbitrary Functions included only Naïve Algorithms included only Naïve Algorithms All data is moved to a central placeAll data is moved to a central place
Communication overheadCommunication overhead CPU overheadCPU overhead Power overheadPower overhead Privacy issuesPrivacy issues
SEBD Tutorial, June 2006 11
Novel Geometric Novel Geometric ApproachApproach
Geometric Interpretation:Geometric Interpretation: Each node hold a statistics Each node hold a statistics
vectorvector Coloring the vector space Coloring the vector space
Grey:: function > thresholdGrey:: function > threshold White:: function <= thresholdWhite:: function <= threshold
Goal: determine color of global data vector Goal: determine color of global data vector (average).(average).
SEBD Tutorial, June 2006 12
Geometric Approach – Geometric Approach – Bounding the Convex HullBounding the Convex Hull
Observation: average is in the convex hull of drift Observation: average is in the convex hull of drift vectorsvectors
If convex hull monochromatic then average is same If convex hull monochromatic then average is same colorcolor
SEBD Tutorial, June 2006 13
Drift Drift VectorsVectors
Rather than bounding the convex hull of the Rather than bounding the convex hull of the statistics vector:statistics vector: Periodically calculate an Periodically calculate an estimate vectorestimate vector - the - the
current global valuecurrent global value Each node maintains a Each node maintains a drift vectordrift vector – the change in – the change in
the local statistics vector since the last time an the local statistics vector since the last time an estimate vector has been calculated (in relation to estimate vector has been calculated (in relation to the estimate vector)the estimate vector)
The global statistics vector is the average of the The global statistics vector is the average of the drift vectorsdrift vectors
1 1 1
1 1
( )
( )
n n nknown
i i ii i i
i
n n
i ii i
v v vAvg v
n n n
v e ve
n n
SEBD Tutorial, June 2006 14
Distributively Bounding Distributively Bounding the the
Convex HullConvex Hull A reference point is known A reference point is known
to all nodesto all nodes Each node constructs a Each node constructs a
ballball Theorem: convex hull is Theorem: convex hull is
bound by the union of ballsbound by the union of balls
SEBD Tutorial, June 2006 15
Basic AlgorithmBasic Algorithm An initial estimate An initial estimate
vector is calculatedvector is calculated Nodes check color of Nodes check color of
drift spheredrift sphere Drift vector is the Drift vector is the
diameter of the drift balldiameter of the drift ball If any ball non If any ball non
monochromatic monochromatic synchronize nodessynchronize nodes
SEBD Tutorial, June 2006 16
Reuters Corpus (RCV1-Reuters Corpus (RCV1-v2)v2)
Information Gain vs. Document Index
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0 200000 400000 600000 800000Document Index
Info
rma
tio
n G
ain
bosnia
ipo
febru
Broadcast Messages vs. Threshold
0
100
200
300
400
500
600
700
800
0 0.001 0.002 0.003 0.004 0.005 0.006Threshold
Bro
ad
ca
st
Me
ss
ag
es
(x
10
00
)
bosnia
ipo
febru
Naive Alg.
800,000+ news stories 800,000+ news stories Aug 20 1996 -- Aug 19 1997Aug 20 1996 -- Aug 19 1997 Corporate/Industrial tagging simulates spamCorporate/Industrial tagging simulates spam
n=10
SEBD Tutorial, June 2006 17
Trade-off: Accuracy vs. Trade-off: Accuracy vs. PerformancePerformance
Inefficiency: value of Inefficiency: value of function on average is function on average is close to the thresholdclose to the threshold
Performance can be Performance can be enhanced at the cost enhanced at the cost of less accurate of less accurate result: result:
Set error margin Set error margin around the threshold around the threshold valuevalue
Broadcast Messages vs. Error Margin
0
50
100
150
200
250
0% 10% 20% 30% 40% 50%Error Margin
Bro
ad
ca
st
Me
ss
ag
es
(x
10
00
) bosnia
ipo
febru
SEBD Tutorial, June 2006 18
ScalabilityScalability
Broadcast Messages vs. Number of Nodes (0.003)
0
100
200
300
400
500
600
700
800
0 25 50 75 100Number of Nodes
Bro
ad
ca
st
Me
ss
ag
es
(x
10
00
) bonsia
ipo
febru
naive
Broadcast Messages vs. Number of (0.006)
0
100
200
300
400
500
600
700
800
0 25 50 75 100Number of Nodes
Bro
adca
st M
essa
ges
(x1
000)
bonsia
ipo
febru
naive
# messages per node is constant.
SEBD Tutorial, June 2006 19
BalancingBalancing Globally calculating Globally calculating
average is costly average is costly Often possible to Often possible to
average only average only somesome of the data vectors.of the data vectors.
1
1
1
( )( )
0
( )( )
n
ii
i
n
i
n
i ii
i
e vAvg v
n
e vAvg v
n
SEBD Tutorial, June 2006 20
Computational Complexity Computational Complexity of Calculating Distance of Calculating Distance
from Zero Surface from Zero Surface Closed form solutions (Variance Closed form solutions (Variance
alert)alert) Numerical MethodsNumerical Methods Offline Computations and CachingOffline Computations and Caching
SEBD Tutorial, June 2006 21
Performance AnalysisPerformance Analysis
-10
-5
0
5
10
-10 -5 0 5 10
-10
-5
0
5
10
-10 -5 0 5 10
SEBD Tutorial, June 2006 22
Performance Analysis Performance Analysis (continued)(continued)
-10
-5
5
10
-10 -5 5 10
Exp. Global Stats.
[ ]E v
Distance Sphere
-10
-5
5
10
-10 -5 5 10
global
DDistance Sphere
Local Stats. Vector iv
Exp. Global Stats. Vector
[ ]E v
( [ ( )], ( ))iB E v t v t
SEBD Tutorial, June 2006 24
Upper Bounds on Upper Bounds on Probability of Constraint Probability of Constraint
ViolationViolation
12
[ ]Pr
( )
d
ii
violationglobal
V X
N D
2
2 1
[ ]Pr exp 2 /
d
ii
violation global
V XD d
N
SEBD Tutorial, June 2006 25
Tiered Sensor NetworksTiered Sensor Networks
Network comprised Network comprised of two types of of two types of sensors, Macro-sensors, Macro-Nodes and MotesNodes and Motes
Motes:Motes: Simple, Simple,
inexpensive inexpensive sensing unitssensing units
Based on 8-bit Based on 8-bit processorsprocessors
Macro Nodes:Macro Nodes: Less resource constrainedLess resource constrained Based on 32-bit processors. Support more Based on 32-bit processors. Support more
advanced OS and development toolsadvanced OS and development tools
SEBD Tutorial, June 2006 26
Monitoring Sensor Monitoring Sensor Networks (1)Networks (1)
A spanning tree is constructed over the A spanning tree is constructed over the connectivity graphconnectivity graph
Initial measurement vector aggregated over the Initial measurement vector aggregated over the tree, and flooded to all Motestree, and flooded to all Motes
Motes use aggregated vector as estimate vectorMotes use aggregated vector as estimate vector An attempt is made to balance constraint An attempt is made to balance constraint
violations within the cluster (intra cluster violations within the cluster (intra cluster balancing):balancing): Cluster Head iteratively selects motes and requests their drift Cluster Head iteratively selects motes and requests their drift
vectorsvectors Balancing succeeds if the average of the drift vectors collected Balancing succeeds if the average of the drift vectors collected
from motes creates a monochromatic ball with the estimate from motes creates a monochromatic ball with the estimate vectorvector
SEBD Tutorial, June 2006 27
Monitoring Sensor Monitoring Sensor Networks (2)Networks (2)
In case intra cluster balancing failed, an In case intra cluster balancing failed, an attempt is made to balance the constraint attempt is made to balance the constraint violation by passing a token among the Cluster violation by passing a token among the Cluster Heads (extra cluster balancing)Heads (extra cluster balancing):: The token consists of the average of the drift The token consists of the average of the drift
vectors held by the motes in the clusters the token vectors held by the motes in the clusters the token has visitedhas visited
Upon receipt of token, the Cluster Head collects Upon receipt of token, the Cluster Head collects drift vectors from motes, and adds them to the drift vectors from motes, and adds them to the tokentoken
In case extra cluster balancing has failed, the In case extra cluster balancing has failed, the vector held by the token is flooded to the vector held by the token is flooded to the motes, which use it as the new estimate vectormotes, which use it as the new estimate vector
SEBD Tutorial, June 2006 28
Monitoring Sensor Monitoring Sensor Networks (3)Networks (3)
Token traversal implemented as a Token traversal implemented as a DFS searchDFS search
Several tokens may simultaneously Several tokens may simultaneously traverse the network, in which case traverse the network, in which case they may be required to mergethey may be required to merge
SEBD Tutorial, June 2006 29
Data SetData Set
A 144x36 data points of A 144x36 data points of temperature readings in temperature readings in the northern hemispherethe northern hemisphere
Readings are taken Readings are taken every 6h for a period of a every 6h for a period of a yearyear
Strong Spatial and Temporal Strong Spatial and Temporal correlation among data readingscorrelation among data readings
Average temperature ranges from -Average temperature ranges from -3.15 to 15 degrees Centigrade3.15 to 15 degrees Centigrade
SEBD Tutorial, June 2006 30
Experimental Results - Experimental Results - ThresholdThreshold
Mote Messages vs. Threshold
0
1000
2000
3000
4000
5000
6000
7000
8000
-30 -20 -10 0 10 20 30 40 50Threshold ( °C)
Mo
te M
es
sa
ge
s (
x1
00
0)
Decent. Alg.
Naive Alg.
Cluster Head Messages vs. Threshold
0
1000
2000
3000
4000
5000
6000
7000
8000
-30 -20 -10 0 10 20 30 40 50Threshold ( °C)
Clu
ster
Hea
d M
sg. (
x100
0) Decent. Alg.
Naive Alg.
SEBD Tutorial, June 2006 31
Cluster Head Messages vs. Error Margin
0
50
100
150
200
250
300
0 1 2 3 4 5 6 7Error Margin ( °C)
Clu
ste
r H
ea
d M
sg
s. (
x1
00
0)
Experimental Results – Experimental Results – Error MarginError Margin
Mote Messages vs. Error Margin
0
200
400
600
800
1000
1200
1400
1600
1800
0 1 2 3 4 5 6 7Error Margin ( °C)
Mot
e M
es
sag
es(x
10
00)
SEBD Tutorial, June 2006 32
Cluster Head Messages vs. Cluster Size
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 1000 2000 3000 4000 5000Cluster Size
Clu
ste
r H
ea
d M
es
sa
ge
s
Experimental Results – Experimental Results – Cluster SizeCluster Size
Mote Messages vs. Cluster Size
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 1000 2000 3000 4000 5000Cluster Size
Mo
te M
es
sa
ge
s (
x1
00
0)
SEBD Tutorial, June 2006 33
Window SizeWindow Size
Broadcast Messages vs. Window Size
0
50
100
150
200
250
6700 8040 9380 10720 12060 13400Window size
Bro
ad
ca
st
Me
ss
ag
es
(x
10
00
) bosnia
ipo
febru
SEBD Tutorial, June 2006 34
Simultaneous FeaturesSimultaneous Features
Broadcast Messages vs. Simultaneous Features
0
50
100
150
200
250
300
0 1000 2000 3000 4000 5000Num. of Monitored Features
Bro
ad
ca
st
Me
ss
ag
es
(x
10
00
)
Decnt. Alg.
Naive Alg.
SEBD Tutorial, June 2006 35
Future WorkFuture Work
Efficiently Efficiently monitoring multiple monitoring multiple objectsobjects Exploiting Exploiting
Correlations among Correlations among objectsobjects
Monitoring Top-k Monitoring Top-k objectsobjects
Improving spherical Improving spherical bounds bounds
Large scale networksLarge scale networks
SEBD Tutorial, June 2006 36
Chi-SquareChi-Square
21 4 2 3
1 2 1 3 2 4 3 4
( )( )
( )( )( )( )
x x x xChiS A
x x x x x x x x
SpamSpam^Spa^Spa
mm
AA ==ff xx11 xx22
^̂ff xx33 xx44
4 1 2 3
1 1 3
2 1 2
3 2 3
1x x x x
y x x
y x x
y x x