space-efficient online approximation of time series data: streams

65
Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order Luca Foschini joint work with Sorabh Gandhi and Subhash Suri University of California Santa Barbara ICDE 2010 Luca Foschini (UCSB) Time Series Approximation ICDE 2010 1 / 21

Upload: others

Post on 03-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-orderStreams, Amnesia, and Out-of-order
ICDE 2010
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 1 / 21
Outline
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 2 / 21
Time Series & Approximation
stream model: items arrive in order of timestamp ti
I hypothesis will be relaxed in the out-of-order model
Goal: maintain succinct representation of S online
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 2 / 21
Time Series & Approximation
stream model: items arrive in order of timestamp ti
I hypothesis will be relaxed in the out-of-order model
Goal: maintain succinct representation of S online
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 2 / 21
Time Series & Approximation
stream model: items arrive in order of timestamp ti
I hypothesis will be relaxed in the out-of-order model
Goal: maintain succinct representation of S online
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 2 / 21
Time Series & Approximation
stream model: items arrive in order of timestamp ti
I hypothesis will be relaxed in the out-of-order model
Goal: maintain succinct representation of S online
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 2 / 21
Time Series & Approximation
stream model: items arrive in order of timestamp ti
I hypothesis will be relaxed in the out-of-order model
Goal: maintain succinct representation of S online
v1
t1
v2
t2
v3
t3
v4
t4
v5
t5
v6
t6
v7
t7
v8
t8
v9
t9
v10
t10
v10
t10
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 2 / 21
Bucket Approximation
called buckets.
buckets approximated by a single value (PC), a line segment (PL), or other function segments.
Error of the approximation: Lp norm
Ep = ( ∑n
1/p
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 3 / 21
Bucket Approximation
called buckets.
buckets approximated by a single value (PC), a line segment (PL), or other function segments.
Error of the approximation: Lp norm
Ep = ( ∑n
1/p
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 3 / 21
Bucket Approximation
called buckets.
buckets approximated by a single value (PC), a line segment (PL), or other function segments.
Error of the approximation: Lp norm
Ep = ( ∑n
1/p
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 3 / 21
(α,β)-approximation
a way to find the optimal approximation within a bucket
Goal: find the partition of S into B buckets that minimizes the approximation error
Online B-bucket partition is suboptimal
We content ourselves with a partition that:
achieves no more than α times the optimal error with B buckets
requires no more than βB buckets (α,β > 1)
Example
If the optimal partition for 10 buckets achieves error=3, a (1.2, 1.5)-approximated partition achieves error 6 3.6 using B ′ 6 15 buckets.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 4 / 21
(α,β)-approximation
a way to find the optimal approximation within a bucket
Goal: find the partition of S into B buckets that minimizes the approximation error
Online B-bucket partition is suboptimal
We content ourselves with a partition that:
achieves no more than α times the optimal error with B buckets
requires no more than βB buckets (α,β > 1)
Example
If the optimal partition for 10 buckets achieves error=3, a (1.2, 1.5)-approximated partition achieves error 6 3.6 using B ′ 6 15 buckets.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 4 / 21
(α,β)-approximation
a way to find the optimal approximation within a bucket
Goal: find the partition of S into B buckets that minimizes the approximation error
Online B-bucket partition is suboptimal
We content ourselves with a partition that:
achieves no more than α times the optimal error with B buckets
requires no more than βB buckets (α,β > 1)
Example
If the optimal partition for 10 buckets achieves error=3, a (1.2, 1.5)-approximated partition achieves error 6 3.6 using B ′ 6 15 buckets.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 4 / 21
(α,β)-approximation
a way to find the optimal approximation within a bucket
Goal: find the partition of S into B buckets that minimizes the approximation error
Online B-bucket partition is suboptimal
We content ourselves with a partition that:
achieves no more than α times the optimal error with B buckets
requires no more than βB buckets (α,β > 1)
Example
If the optimal partition for 10 buckets achieves error=3, a (1.2, 1.5)-approximated partition achieves error 6 3.6 using B ′ 6 15 buckets.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 4 / 21
(α,β)-approximation
a way to find the optimal approximation within a bucket
Goal: find the partition of S into B buckets that minimizes the approximation error
Online B-bucket partition is suboptimal
We content ourselves with a partition that:
achieves no more than α times the optimal error with B buckets
requires no more than βB buckets (α,β > 1)
Example
If the optimal partition for 10 buckets achieves error=3, a (1.2, 1.5)-approximated partition achieves error 6 3.6 using B ′ 6 15 buckets.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 4 / 21
Outline
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 5 / 21
Summary of Contribution
A Generic Algorithmic Framework
variation of well known greedy technique, but with provable memory-error bounds
achieves (2, 4) approximation for a broad class of error metrics.
query-ready (no post-processing needed)
easy to implement, fast in update (O(logB)) and O(B) storage for a broad class of error metrics.
adaptable to other stream models: I amnesic (error tolerance depends on time) I out of order ((ti, vi) might arrive after (tj, vj), even if tj > ti)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 5 / 21
Summary of Contribution
A Generic Algorithmic Framework
variation of well known greedy technique, but with provable memory-error bounds
achieves (2, 4) approximation for a broad class of error metrics.
query-ready (no post-processing needed)
easy to implement, fast in update (O(logB)) and O(B) storage for a broad class of error metrics.
adaptable to other stream models: I amnesic (error tolerance depends on time) I out of order ((ti, vi) might arrive after (tj, vj), even if tj > ti)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 5 / 21
Summary of Contribution
A Generic Algorithmic Framework
variation of well known greedy technique, but with provable memory-error bounds
achieves (2, 4) approximation for a broad class of error metrics.
query-ready (no post-processing needed)
easy to implement, fast in update (O(logB)) and O(B) storage for a broad class of error metrics.
adaptable to other stream models: I amnesic (error tolerance depends on time) I out of order ((ti, vi) might arrive after (tj, vj), even if tj > ti)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 5 / 21
Summary of Contribution
A Generic Algorithmic Framework
variation of well known greedy technique, but with provable memory-error bounds
achieves (2, 4) approximation for a broad class of error metrics.
query-ready (no post-processing needed)
easy to implement, fast in update (O(logB)) and O(B) storage for a broad class of error metrics.
adaptable to other stream models: I amnesic (error tolerance depends on time) I out of order ((ti, vi) might arrive after (tj, vj), even if tj > ti)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 5 / 21
Summary of Contribution
A Generic Algorithmic Framework
variation of well known greedy technique, but with provable memory-error bounds
achieves (2, 4) approximation for a broad class of error metrics.
query-ready (no post-processing needed)
easy to implement, fast in update (O(logB)) and O(B) storage for a broad class of error metrics.
adaptable to other stream models: I amnesic (error tolerance depends on time) I out of order ((ti, vi) might arrive after (tj, vj), even if tj > ti)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 5 / 21
Outline
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 6 / 21
Related Work I
Optimal Offline Algorithm
dynamic programming, given in Jagadish et al. [1998], with time complexity O(n2B) and space O(nB)
improved in Guha [2008] to O(n2B) time and O(n) space.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 6 / 21
Related Work II
Online Algorithms
the best worst-case bounds due to Guha and his colleagues Gilbert et al. [2002], Guha [2008, 2009], Guha et al. [2001, 2006]
for L2 metric, (1 + ε, 1) approximation in O(n+ B3ε−2 log2 n) time with O(Bε−2 logn log ε−1) space Guha [2008].
improved in Guha [2009] to remove logn from space bounds.
in Buragohain et al. [2007], lightweight algorithms for query-ready approximations, for L∞ error metric only
survey in Guha [2008] for existing techniques.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 7 / 21
Outline
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 8 / 21
Well-Behaved Error Metric
e(S ′,b) > 0, S ′ ⊆ S, b > 1 (1)
2 Monotonicity: The error of approximating S1 ∪ S2 using one bucket is > the error in approximating S1 and S2 separately with one bucket each.
e(S1 ∪ S2, 1) > e(S1, 1) + e(S2, 1) (2)
S1 S2
S3
3 Additivity: The error of a B-bucket approximation is the sum of the errors of the individual buckets. (works also for max)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 8 / 21
Generic-MinMerge (GM) Algorithm
Generic-MinMerge 1: for next vi ∈ S do 2: Allocate new bucket u 3: if |B| > 4B then 4: {ba,bb} = MINERR-ADJPAIR(B) 5: bm = MERGE(ba,bb) 6: B = (B \ {ba,bb})
{bm}
Budget B = 1
b1 b2 b3
uba bbbmb3
Min-Merge Property
B obeys the Min-Merge property if for adjacent buckets b1,b2, e(bm) = MERGE(b1,b2) > maxie(bi)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 9 / 21
Generic-MinMerge (GM) Algorithm
Generic-MinMerge 1: for next vi ∈ S do 2: Allocate new bucket u 3: if |B| > 4B then 4: {ba,bb} = MINERR-ADJPAIR(B) 5: bm = MERGE(ba,bb) 6: B = (B \ {ba,bb})
{bm}
Budget B = 1
b1 b2 b3
uba bbbmb3
Min-Merge Property
B obeys the Min-Merge property if for adjacent buckets b1,b2, e(bm) = MERGE(b1,b2) > maxie(bi)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 9 / 21
Generic-MinMerge (GM) Algorithm
Generic-MinMerge 1: for next vi ∈ S do 2: Allocate new bucket u 3: if |B| > 4B then 4: {ba,bb} = MINERR-ADJPAIR(B) 5: bm = MERGE(ba,bb) 6: B = (B \ {ba,bb})
{bm}
Budget B = 1
b1 b2 b3
ba bbbmb3
Min-Merge Property
B obeys the Min-Merge property if for adjacent buckets b1,b2, e(bm) = MERGE(b1,b2) > maxie(bi)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 9 / 21
Generic-MinMerge (GM) Algorithm
Generic-MinMerge 1: for next vi ∈ S do 2: Allocate new bucket u 3: if |B| > 4B then 4: {ba,bb} = MINERR-ADJPAIR(B) 5: bm = MERGE(ba,bb) 6: B = (B \ {ba,bb})
{bm}
Budget B = 1
b1 b2 b3
ba bbbmb3
Min-Merge Property
B obeys the Min-Merge property if for adjacent buckets b1,b2, e(bm) = MERGE(b1,b2) > maxie(bi)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 9 / 21
Generic-MinMerge (GM) Algorithm
Generic-MinMerge 1: for next vi ∈ S do 2: Allocate new bucket u 3: if |B| > 4B then 4: {ba,bb} = MINERR-ADJPAIR(B) 5: bm = MERGE(ba,bb) 6: B = (B \ {ba,bb})
{bm}
Budget B = 1
Min-Merge Property
B obeys the Min-Merge property if for adjacent buckets b1,b2, e(bm) = MERGE(b1,b2) > maxie(bi)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 9 / 21
Generic-MinMerge (GM) Algorithm
Generic-MinMerge 1: for next vi ∈ S do 2: Allocate new bucket u 3: if |B| > 4B then 4: {ba,bb} = MINERR-ADJPAIR(B) 5: bm = MERGE(ba,bb) 6: B = (B \ {ba,bb})
{bm}
Budget B = 1
Min-Merge Property
B obeys the Min-Merge property if for adjacent buckets b1,b2, e(bm) = MERGE(b1,b2) > maxie(bi)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 9 / 21
Generic-MinMerge (GM) Algorithm
Generic-MinMerge 1: for next vi ∈ S do 2: Allocate new bucket u 3: if |B| > 4B then 4: {ba,bb} = MINERR-ADJPAIR(B) 5: bm = MERGE(ba,bb) 6: B = (B \ {ba,bb})
{bm}
Budget B = 1
Min-Merge Property
B obeys the Min-Merge property if for adjacent buckets b1,b2, e(bm) = MERGE(b1,b2) > maxie(bi)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 9 / 21
Generic-MinMerge (GM) Algorithm
Generic-MinMerge 1: for next vi ∈ S do 2: Allocate new bucket u 3: if |B| > 4B then 4: {ba,bb} = MINERR-ADJPAIR(B) 5: bm = MERGE(ba,bb) 6: B = (B \ {ba,bb})
{bm}
Budget B = 1
Min-Merge Property
B obeys the Min-Merge property if for adjacent buckets b1,b2, e(bm) = MERGE(b1,b2) > maxie(bi)
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 9 / 21
Analysis of GM (memory-error tradeoff)
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
charge each O to adjacent Xs (3B Xs are sufficient)∑
O e(O) 6 E∗
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 10 / 21
Analysis of GM (memory-error tradeoff)
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
charge each O to adjacent Xs (3B Xs are sufficient)∑
O e(O) 6 E∗
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 10 / 21
Analysis of GM (memory-error tradeoff)
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
charge each O to adjacent Xs (3B Xs are sufficient)∑
O e(O) 6 E∗
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 10 / 21
Analysis of GM (memory-error tradeoff)
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
charge each O to adjacent Xs (3B Xs are sufficient)∑
O e(O) 6 E∗
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 10 / 21
Analysis of GM (memory-error tradeoff)
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
charge each O to adjacent Xs (3B Xs are sufficient)
∑ O e(O) 6 E∗
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 10 / 21
Analysis of GM (memory-error tradeoff)
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
charge each O to adjacent Xs (3B Xs are sufficient)∑
O e(O) 6 E∗
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 10 / 21
Analysis of GM (memory-error tradeoff)
Theorem
Proof sketch
X X X X X X X X X X
∑ x e(X) 6 E∗
charge each O to adjacent Xs (3B Xs are sufficient)∑
O e(O) 6 E∗
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 10 / 21
Analysis of GM (runtime)
{ba,bb} = MINERR-ADJPAIR(B) bm = MERGE(ba,bb)
runtime depends on Minerr-Adjpair and Merge
Requires computing the error for the union of two buckets without explicit knowledge of the values in them
for L2 error each bucket I stores the least squares line fitting the set of values (ti, vi) in the
bucket I slope and intercept is maintained in O(1) space, Minerr-Adjpair
and Merge are O(1) time.
Theorem
We can compute a streaming ( √
2, 4) approximation of its B-bucket PC or PL histogram under the L2 norm using O(B) space and O(logB) update time for each value.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 11 / 21
Analysis of GM (runtime)
{ba,bb} = MINERR-ADJPAIR(B) bm = MERGE(ba,bb)
runtime depends on Minerr-Adjpair and Merge
Requires computing the error for the union of two buckets without explicit knowledge of the values in them
for L2 error each bucket I stores the least squares line fitting the set of values (ti, vi) in the
bucket I slope and intercept is maintained in O(1) space, Minerr-Adjpair
and Merge are O(1) time.
Theorem
We can compute a streaming ( √
2, 4) approximation of its B-bucket PC or PL histogram under the L2 norm using O(B) space and O(logB) update time for each value.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 11 / 21
Analysis of GM (runtime)
{ba,bb} = MINERR-ADJPAIR(B) bm = MERGE(ba,bb)
runtime depends on Minerr-Adjpair and Merge
Requires computing the error for the union of two buckets without explicit knowledge of the values in them
for L2 error each bucket I stores the least squares line fitting the set of values (ti, vi) in the
bucket I slope and intercept is maintained in O(1) space, Minerr-Adjpair
and Merge are O(1) time.
Theorem
We can compute a streaming ( √
2, 4) approximation of its B-bucket PC or PL histogram under the L2 norm using O(B) space and O(logB) update time for each value.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 11 / 21
Analysis of GM (runtime)
{ba,bb} = MINERR-ADJPAIR(B) bm = MERGE(ba,bb)
runtime depends on Minerr-Adjpair and Merge
Requires computing the error for the union of two buckets without explicit knowledge of the values in them
for L2 error each bucket I stores the least squares line fitting the set of values (ti, vi) in the
bucket I slope and intercept is maintained in O(1) space, Minerr-Adjpair
and Merge are O(1) time.
Theorem
We can compute a streaming ( √
2, 4) approximation of its B-bucket PC or PL histogram under the L2 norm using O(B) space and O(logB) update time for each value.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 11 / 21
Amnesic model
recent data are approximated more faithfully than older data.
the approximation error is allowed to grow with the age of the data.
proposed in Palpanas et al. [2004], very good results in practice, no worst-case guarantees
ε1
ε2
ε3
Theorem
The Amnesic-Merge algorithm achieves a (1 + ε, 3) approximation, uses O(ε−1/2B∗ log(1/ε)) space and O(logB∗ + ε−1/2 log(1/ε)) update time for each value, where B∗ is the number of buckets in the optimal solution.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 12 / 21
Amnesic model
recent data are approximated more faithfully than older data.
the approximation error is allowed to grow with the age of the data.
proposed in Palpanas et al. [2004], very good results in practice, no worst-case guarantees
ε1
ε2
ε3
Theorem
The Amnesic-Merge algorithm achieves a (1 + ε, 3) approximation, uses O(ε−1/2B∗ log(1/ε)) space and O(logB∗ + ε−1/2 log(1/ε)) update time for each value, where B∗ is the number of buckets in the optimal solution.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 12 / 21
Amnesic model
recent data are approximated more faithfully than older data.
the approximation error is allowed to grow with the age of the data.
proposed in Palpanas et al. [2004], very good results in practice, no worst-case guarantees
ε1
ε2
ε3
Theorem
The Amnesic-Merge algorithm achieves a (1 + ε, 3) approximation, uses O(ε−1/2B∗ log(1/ε)) space and O(logB∗ + ε−1/2 log(1/ε)) update time for each value, where B∗ is the number of buckets in the optimal solution.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 12 / 21
Out-of-Order Stream Model
motivated by the asynchrony of distributed data processing applications: Cormode et al. [2008], Li et al. [2008, 2007]
must be tolerant of a small number of gaps in the arrival sequence.
GAPS
FRAGMENTS
many heuristics proposed in the literature: Babcock et al. [2002], Tucker et al. [2003], Busch and Tirthapura [2007], Cormode et al. [2008].
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 13 / 21
Out-of-Order Stream Model
motivated by the asynchrony of distributed data processing applications: Cormode et al. [2008], Li et al. [2008, 2007]
must be tolerant of a small number of gaps in the arrival sequence.
GAPS
FRAGMENTS
many heuristics proposed in the literature: Babcock et al. [2002], Tucker et al. [2003], Busch and Tirthapura [2007], Cormode et al. [2008].
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 13 / 21
Out-of-Order Stream Model
motivated by the asynchrony of distributed data processing applications: Cormode et al. [2008], Li et al. [2008, 2007]
must be tolerant of a small number of gaps in the arrival sequence.
GAPS
FRAGMENTS
many heuristics proposed in the literature: Babcock et al. [2002], Tucker et al. [2003], Busch and Tirthapura [2007], Cormode et al. [2008].
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 13 / 21
Out-of-Order Stream Model
motivated by the asynchrony of distributed data processing applications: Cormode et al. [2008], Li et al. [2008, 2007]
must be tolerant of a small number of gaps in the arrival sequence.
GAPS
FRAGMENTS
many heuristics proposed in the literature: Babcock et al. [2002], Tucker et al. [2003], Busch and Tirthapura [2007], Cormode et al. [2008].
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 13 / 21
MinMerge for Out-of-Order Stream Model
GM can be modified to perform 0,1, or 2 merges to deal with gaps being filled.
Summary of Results
Error metric approximation update working memory
PL L∞ (1 + ε, 3) O(log B + ε−1/2 log(1/ε)) O(ε−1/2B log(1/ε)) PC L∞ (1, 3) O(log B) O(B)
PC,PL L2 ( √
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 14 / 21
MinMerge for Out-of-Order Stream Model
GM can be modified to perform 0,1, or 2 merges to deal with gaps being filled.
Summary of Results
Error metric approximation update working memory
PL L∞ (1 + ε, 3) O(log B + ε−1/2 log(1/ε)) O(ε−1/2B log(1/ε)) PC L∞ (1, 3) O(log B) O(B)
PC,PL L2 ( √
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 14 / 21
Outline
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 15 / 21
Simulations I
0 500 1000 1500 2000 2500 3000 3500 4000 -40
-30
-20
-10
0
10
20
30
40
50
60
0 500 1000 1500 2000 2500 3000 3500 4000 40
50
60
70
80
90
100
110
0 500 1000 1500 2000 2500 3000 3500 4000 24
26
28
30
32
34
36
38
40
42
0 500 1000 1500 2000 2500 3000 3500 4000 10
15
20
25
30
35
40
45
50
55
60
0 500 1000 1500 2000 2500 3000 3500 4000
STEAMGEN (Keogh et al. [2006]): The set has 9600 measurements of steam generator drum pressure
RANDOM (Keogh et al. [2006]): random walk data, 65536 values.
DJIA: Dow Jones Industrial Average index, 25737 values.
HUM: 69652 values, representing humidity measurements every 30 seconds.
TEMP: This set of 93861 temperature readings.
PRECIP: 1M annual gage-corrected precipitation recorded worldwide
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 15 / 21
Simulations II
source code available at www.cs.ucsb.edu/˜foschini/files/ooodelay.zip
System: GNU/Linux 64bit Intel Core Duo Processor (@ 2.00GHz) equipped with 2GB RAM.
Algorithms Implemented
Optimal Amnesic Approximation
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 16 / 21
Results I
Approximation quality
50 75 100 125 150 175 200 # of Buckets (B)
0
5
10
15
20
25
50 75 100 125 150 175 200 # of Buckets (B)
0
5
10
15
20
25
OPT
4B-GM
B-GM
PC (left), PL (right) approximation of 4K data, HUM dataset B-GM refers to the B-bucket approximation, 4B-GM refers to the 4B-bucket
approximation (as needed for the theorem).
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 17 / 21
Results II
0
1e-4
2e-4
1000
5000
9000
Per-item processing time vs. the size of the time stream, for B = 1000, 5000, 9000 for the PRECIP dataset.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 18 / 21
Results III
Running Time: GM vs. SpaceApxWaveHist
100 150 200 250 300 350 400 450 500 # of buckets (B)
50
100
150
200
250
300
B-GM
SpaceApxWaveHist
100 150 200 250 300 350 400 450 500 # of buckets (B)
50
0
50
100
150
200
250
300
350
e (
s)
B-GM
SpaceApxWaveHist
The approximation error (left) of B-bucket GM is comparable to the B-bucket SpaceApxWaveHist, with the latter given twice the working space.
GM is significantly faster, due to the O(B3) term in SpaceApxWaveHist.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 19 / 21
Results IV
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 Scale
50
100
150
200
250
300
# o
B-AM
OPT
40 60 80 100 120 140 160 180 200 # of Buckets (B)
600
700
800
900
1000
1100
1200
1300
1400
1500
Bursty
Ordered
Perturbation
40 60 80 100 120 140 160 180 200 # of Buckets (B)
10
15
20
25
30
35
40
L ∞
Amnesic-Merge vs. optimal, DJIA dataset.
Fragment-Merge in the Bursty, Perturbation and the Ordered stream models for L2 (left) and L∞ (right) error metrics
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 20 / 21
Outline
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 21 / 21
Conclusion
presented a generic framework for online approximation of time-series data for several models: data streams, amnesic, and out-of-order
proved space-quality approximation bounds for a popular greedy merge scheme for commonly used error metrics, such as L2 or L∞ highly practical implementations, require very small memory footprints, and run extremely fast
Open problems and future work
we show that GM algorithm achieves constant factor guarantees. One can show a 1 + ε approximation to optimal B bucket L2 error in O(B/ε2) space is also possible.
make GM efficient for other metrics such as L1
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 21 / 21
Conclusion
presented a generic framework for online approximation of time-series data for several models: data streams, amnesic, and out-of-order
proved space-quality approximation bounds for a popular greedy merge scheme for commonly used error metrics, such as L2 or L∞ highly practical implementations, require very small memory footprints, and run extremely fast
Open problems and future work
we show that GM algorithm achieves constant factor guarantees. One can show a 1 + ε approximation to optimal B bucket L2 error in O(B/ε2) space is also possible.
make GM efficient for other metrics such as L1
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 21 / 21
References I
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002.
C. Buragohain, N. Shrivastava, and S. Suri. Space efficient streaming algorithms for the maximum error histogram. In ICDE, 2007.
C. Busch and S. Tirthapura. A deterministic algorithm for summarizing asynchronous streams over a sliding window. In STACS, 2007.
G. Cormode, F. Korn, and S. Tirthapura. Time-decaying aggregates in out-of-order streams. In PODS, 2008.
A. Gilbert, Y. Kotidis, S. Guha, S. Muthukrishnan, et al. Fast small-space algorithms for approximate histogram maintenance. In STOC, 2002.
S. Guha. Tight results for clustering and summarizing data streams. In Departmental Papers, Department of Computer and Information Science, Univeristy of Pennsylvania, 2009. URL http://repository.upenn.edu/cis_papers/394/.
S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In STOC, 2001.
Sudipto Guha. On the space—time of optimal, approximate and streaming algorithms for synopsis construction problems. The VLDB Journal, 17(6):1509–1535, 2008. ISSN 1066-8888. doi: http://dx.doi.org/10.1007/s00778-007-0083-9.
Sudipto Guha, Nick Koudas, and Kyuseok Shim. Approximation and streaming algorithms for histogram construction problems. ACM TODS, 31(1):396–438, 2006.
H. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, et al. Optimal histograms with quality guarantees. In VLDB, 1998.
E. Keogh, X. Xi, L. Wei, and C. Ratanamahatana. The ucr time series classification/clustering homepage. 2006. URL www.cs.ucr.edu/~eamonn/time_series_data/.
J. Li, K. Tufte, V. Shkapenyuk, V. Papadimos, T. Johnson, and D. Maier. Out-of-order processing: a new architecture for high-performance stream systems. Proc. VLDB Endow., 1(1), 2008.
M. Li, M. Liu, L. Ding, E. Rundensteiner, and M. Mani. Event stream processing with out-of-order data arrival. In ICDCS Workshop, 2007.
Themistoklis Palpanas, Michail Vlachos, Eamonn Keogh, Dimitrios Gunopulos, and Wagner Truppel. Online amnesic approximation of streaming time series. In ICDE, 2004.
P. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Trans. on Knowl. and Data Eng., 15(3), 2003.
Luca Foschini (UCSB) Time Series Approximation ICDE 2010 22 / 21