space-efficient online approximation of time series data: streams

Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-orderStreams, Amnesia, and Out-of-order
ICDE 2010
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 1 / 21
Outline
Time Series & Approximation
stream model: items arrive in order of timestamp ti
I hypothesis will be relaxed in the out-of-order model
Goal: maintain succinct representation of S online
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
Bucket Approximation
called buckets.
buckets approximated by a single value (PC), a line segment (PL), or other function segments.
Error of the approximation: Lp norm
Ep = ( ∑n
1/p
called buckets.
Ep = ( ∑n
1/p
called buckets.
Ep = ( ∑n
1/p
(α,β)-approximation
a way to find the optimal approximation within a bucket
Goal: find the partition of S into B buckets that minimizes the approximation error
Online B-bucket partition is suboptimal
We content ourselves with a partition that:
achieves no more than α times the optimal error with B buckets
requires no more than βB buckets (α,β > 1)
Example
If the optimal partition for 10 buckets achieves error=3, a (1.2, 1.5)-approximated partition achieves error 6 3.6 using B ′ 6 15 buckets.
Example
Example
Example
Example
Outline
Summary of Contribution
A Generic Algorithmic Framework
variation of well known greedy technique, but with provable memory-error bounds
achieves (2, 4) approximation for a broad class of error metrics.
query-ready (no post-processing needed)
easy to implement, fast in update (O(logB)) and O(B) storage for a broad class of error metrics.
adaptable to other stream models: I amnesic (error tolerance depends on time) I out of order ((ti, vi) might arrive after (tj, vj), even if tj > ti)
Outline
Related Work I
Optimal Offline Algorithm
dynamic programming, given in Jagadish et al. [1998], with time complexity O(n2B) and space O(nB)
improved in Guha [2008] to O(n2B) time and O(n) space.
Related Work II
Online Algorithms
the best worst-case bounds due to Guha and his colleagues Gilbert et al. [2002], Guha [2008, 2009], Guha et al. [2001, 2006]
for L2 metric, (1 + ε, 1) approximation in O(n+ B3ε−2 log2 n) time with O(Bε−2 logn log ε−1) space Guha [2008].
improved in Guha [2009] to remove logn from space bounds.
in Buragohain et al. [2007], lightweight algorithms for query-ready approximations, for L∞ error metric only
survey in Guha [2008] for existing techniques.
Outline
Well-Behaved Error Metric
e(S ′,b) > 0, S ′ ⊆ S, b > 1 (1)
2 Monotonicity: The error of approximating S1 ∪ S2 using one bucket is > the error in approximating S1 and S2 separately with one bucket each.
e(S1 ∪ S2, 1) > e(S1, 1) + e(S2, 1) (2)
S1 S2
S3
3 Additivity: The error of a B-bucket approximation is the sum of the errors of the individual buckets. (works also for max)
Generic-MinMerge (GM) Algorithm
Generic-MinMerge 1: for next vi ∈ S do 2: Allocate new bucket u 3: if |B| > 4B then 4: {ba,bb} = MINERR-ADJPAIR(B) 5: bm = MERGE(ba,bb) 6: B = (B \ {ba,bb})
{bm}
Budget B = 1
b1 b2 b3
uba bbbmb3
Min-Merge Property
B obeys the Min-Merge property if for adjacent buckets b1,b2, e(bm) = MERGE(b1,b2) > maxie(bi)
{bm}
Budget B = 1
b1 b2 b3
uba bbbmb3
Min-Merge Property
{bm}
Budget B = 1
b1 b2 b3
ba bbbmb3
Min-Merge Property
{bm}
Budget B = 1
b1 b2 b3
ba bbbmb3
Min-Merge Property
{bm}
Budget B = 1
Min-Merge Property
{bm}
Budget B = 1
Min-Merge Property
{bm}
Budget B = 1
Min-Merge Property
{bm}
Budget B = 1
Min-Merge Property
Analysis of GM (memory-error tradeoff)
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
charge each O to adjacent Xs (3B Xs are sufficient)∑
O e(O) 6 E∗
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
O e(O) 6 E∗
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
O e(O) 6 E∗
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
O e(O) 6 E∗
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
charge each O to adjacent Xs (3B Xs are sufficient)
∑ O e(O) 6 E∗
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
O e(O) 6 E∗
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
O e(O) 6 E∗
Analysis of GM (runtime)
{ba,bb} = MINERR-ADJPAIR(B) bm = MERGE(ba,bb)
runtime depends on Minerr-Adjpair and Merge
Requires computing the error for the union of two buckets without explicit knowledge of the values in them
for L2 error each bucket I stores the least squares line fitting the set of values (ti, vi) in the
bucket I slope and intercept is maintained in O(1) space, Minerr-Adjpair
and Merge are O(1) time.
Theorem
We can compute a streaming ( √
2, 4) approximation of its B-bucket PC or PL histogram under the L2 norm using O(B) space and O(logB) update time for each value.
Theorem
Theorem
Theorem
Amnesic model
recent data are approximated more faithfully than older data.
the approximation error is allowed to grow with the age of the data.
proposed in Palpanas et al. [2004], very good results in practice, no worst-case guarantees
ε1
ε2
ε3
Theorem
The Amnesic-Merge algorithm achieves a (1 + ε, 3) approximation, uses O(ε−1/2B∗ log(1/ε)) space and O(logB∗ + ε−1/2 log(1/ε)) update time for each value, where B∗ is the number of buckets in the optimal solution.
Amnesic model
ε1
ε2
ε3
Theorem
Amnesic model
ε1
ε2
ε3
Theorem
Out-of-Order Stream Model
motivated by the asynchrony of distributed data processing applications: Cormode et al. [2008], Li et al. [2008, 2007]
must be tolerant of a small number of gaps in the arrival sequence.
GAPS
FRAGMENTS
many heuristics proposed in the literature: Babcock et al. [2002], Tucker et al. [2003], Busch and Tirthapura [2007], Cormode et al. [2008].
GAPS
FRAGMENTS
GAPS
FRAGMENTS
GAPS
FRAGMENTS
MinMerge for Out-of-Order Stream Model
GM can be modified to perform 0,1, or 2 merges to deal with gaps being filled.
Summary of Results
Error metric approximation update working memory
PL L∞ (1 + ε, 3) O(log B + ε−1/2 log(1/ε)) O(ε−1/2B log(1/ε)) PC L∞ (1, 3) O(log B) O(B)
PC,PL L2 ( √
MinMerge for Out-of-Order Stream Model
GM can be modified to perform 0,1, or 2 merges to deal with gaps being filled.
Summary of Results
Error metric approximation update working memory
PL L∞ (1 + ε, 3) O(log B + ε−1/2 log(1/ε)) O(ε−1/2B log(1/ε)) PC L∞ (1, 3) O(log B) O(B)
PC,PL L2 ( √
Outline
Simulations I
0 500 1000 1500 2000 2500 3000 3500 4000 -40
-30
-20
-10
0
10
20
30
40
50
60
0 500 1000 1500 2000 2500 3000 3500 4000 40
50
60
70
80
90
100
110
0 500 1000 1500 2000 2500 3000 3500 4000 24
26
28
30
32
34
36
38
40
42
0 500 1000 1500 2000 2500 3000 3500 4000 10
15
20
25
30
35
40
45
50
55
60
0 500 1000 1500 2000 2500 3000 3500 4000
STEAMGEN (Keogh et al. [2006]): The set has 9600 measurements of steam generator drum pressure
RANDOM (Keogh et al. [2006]): random walk data, 65536 values.
DJIA: Dow Jones Industrial Average index, 25737 values.
HUM: 69652 values, representing humidity measurements every 30 seconds.
TEMP: This set of 93861 temperature readings.
PRECIP: 1M annual gage-corrected precipitation recorded worldwide
Simulations II
source code available at www.cs.ucsb.edu/˜foschini/files/ooodelay.zip
System: GNU/Linux 64bit Intel Core Duo Processor (@ 2.00GHz) equipped with 2GB RAM.
Algorithms Implemented
Optimal Amnesic Approximation
Results I
Approximation quality
50 75 100 125 150 175 200 # of Buckets (B)
0
5
10
15
20
25
50 75 100 125 150 175 200 # of Buckets (B)
0
5
10
15
20
25
OPT
4B-GM
B-GM
PC (left), PL (right) approximation of 4K data, HUM dataset B-GM refers to the B-bucket approximation, 4B-GM refers to the 4B-bucket
approximation (as needed for the theorem).
Results II
0
1e-4
2e-4
1000
5000
9000
Per-item processing time vs. the size of the time stream, for B = 1000, 5000, 9000 for the PRECIP dataset.
Results III
Running Time: GM vs. SpaceApxWaveHist
100 150 200 250 300 350 400 450 500 # of buckets (B)
50
100
150
200
250
300
B-GM
SpaceApxWaveHist
100 150 200 250 300 350 400 450 500 # of buckets (B)
50
0
50
100
150
200
250
300
350
e (
s)
B-GM
SpaceApxWaveHist
The approximation error (left) of B-bucket GM is comparable to the B-bucket SpaceApxWaveHist, with the latter given twice the working space.
GM is significantly faster, due to the O(B3) term in SpaceApxWaveHist.
Results IV
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 Scale
50
100
150
200
250
300
# o
B-AM
OPT
40 60 80 100 120 140 160 180 200 # of Buckets (B)
600
700
800
900
1000
1100
1200
1300
1400
1500
Bursty
Ordered
Perturbation
40 60 80 100 120 140 160 180 200 # of Buckets (B)
10
15
20
25
30
35
40
L ∞
Amnesic-Merge vs. optimal, DJIA dataset.
Fragment-Merge in the Bursty, Perturbation and the Ordered stream models for L2 (left) and L∞ (right) error metrics
Outline
Conclusion
presented a generic framework for online approximation of time-series data for several models: data streams, amnesic, and out-of-order
proved space-quality approximation bounds for a popular greedy merge scheme for commonly used error metrics, such as L2 or L∞ highly practical implementations, require very small memory footprints, and run extremely fast
Open problems and future work
we show that GM algorithm achieves constant factor guarantees. One can show a 1 + ε approximation to optimal B bucket L2 error in O(B/ε2) space is also possible.
make GM efficient for other metrics such as L1
Conclusion
presented a generic framework for online approximation of time-series data for several models: data streams, amnesic, and out-of-order
proved space-quality approximation bounds for a popular greedy merge scheme for commonly used error metrics, such as L2 or L∞ highly practical implementations, require very small memory footprints, and run extremely fast
Open problems and future work
we show that GM algorithm achieves constant factor guarantees. One can show a 1 + ε approximation to optimal B bucket L2 error in O(B/ε2) space is also possible.
make GM efficient for other metrics such as L1
References I
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002.
C. Buragohain, N. Shrivastava, and S. Suri. Space efficient streaming algorithms for the maximum error histogram. In ICDE, 2007.
C. Busch and S. Tirthapura. A deterministic algorithm for summarizing asynchronous streams over a sliding window. In STACS, 2007.
G. Cormode, F. Korn, and S. Tirthapura. Time-decaying aggregates in out-of-order streams. In PODS, 2008.
A. Gilbert, Y. Kotidis, S. Guha, S. Muthukrishnan, et al. Fast small-space algorithms for approximate histogram maintenance. In STOC, 2002.
S. Guha. Tight results for clustering and summarizing data streams. In Departmental Papers, Department of Computer and Information Science, Univeristy of Pennsylvania, 2009. URL http://repository.upenn.edu/cis_papers/394/.
S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In STOC, 2001.
Sudipto Guha. On the space—time of optimal, approximate and streaming algorithms for synopsis construction problems. The VLDB Journal, 17(6):1509–1535, 2008. ISSN 1066-8888. doi: http://dx.doi.org/10.1007/s00778-007-0083-9.
Sudipto Guha, Nick Koudas, and Kyuseok Shim. Approximation and streaming algorithms for histogram construction problems. ACM TODS, 31(1):396–438, 2006.
H. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, et al. Optimal histograms with quality guarantees. In VLDB, 1998.
E. Keogh, X. Xi, L. Wei, and C. Ratanamahatana. The ucr time series classification/clustering homepage. 2006. URL www.cs.ucr.edu/~eamonn/time_series_data/.
J. Li, K. Tufte, V. Shkapenyuk, V. Papadimos, T. Johnson, and D. Maier. Out-of-order processing: a new architecture for high-performance stream systems. Proc. VLDB Endow., 1(1), 2008.
M. Li, M. Liu, L. Ding, E. Rundensteiner, and M. Mani. Event stream processing with out-of-order data arrival. In ICDCS Workshop, 2007.
Themistoklis Palpanas, Michail Vlachos, Eamonn Keogh, Dimitrios Gunopulos, and Wagner Truppel. Online amnesic approximation of streaming time series. In ICDE, 2004.
P. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Trans. on Knowl. and Data Eng., 15(3), 2003.

space-efficient online approximation of time series data: streams

Documents