synopses for massive data: samples, histograms, wavelets

36
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches Full text available at: http://dx.doi.org/10.1561/1900000004

Upload: others

Post on 06-Feb-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synopses for Massive Data: Samples, Histograms, Wavelets

Synopses for Massive Data:

Samples, Histograms,

Wavelets, Sketches

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 2: Synopses for Massive Data: Samples, Histograms, Wavelets

Synopses for Massive Data:

Samples, Histograms,

Wavelets, Sketches

Graham Cormode

AT&T Labs — ResearchUSA

[email protected]

Minos Garofalakis

Technical University of CreteGreece

[email protected]

Peter J. Haas

IBM Almaden Research CenterUSA

[email protected]

Chris Jermaine

Rice UniversityUSA

[email protected]

Boston – Delft

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 3: Synopses for Massive Data: Samples, Histograms, Wavelets

Foundations and Trends R© inDatabases

Published, sold and distributed by:now Publishers Inc.PO Box 1024Hanover, MA 02339USATel. [email protected]

Outside North America:now Publishers Inc.PO Box 1792600 AD DelftThe NetherlandsTel. +31-6-51115274

The preferred citation for this publication is G. Cormode, M. Garofalakis, P. J. Haas,and C. Jermaine, Synopses for Massive Data: Samples, Histograms, Wavelets,

Sketches, Foundation and Trends R© in Databases, vol 4, nos 1–3, pp 1–294, 2011

ISBN: 978-1-60198-516-3c© 2012 G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine

All rights reserved. No part of this publication may be reproduced, stored in a retrievalsystem, or transmitted in any form or by any means, mechanical, photocopying, recordingor otherwise, without prior written permission of the publishers.

Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items forinternal or personal use, or the internal or personal use of specific clients, is granted bynow Publishers Inc for users registered with the Copyright Clearance Center (CCC). The‘services’ for users can be found on the internet at: www.copyright.com

For those organizations that have been granted a photocopy license, a separate systemof payment has been arranged. Authorization does not extend to other kinds of copy-ing, such as that for general distribution, for advertising or promotional purposes, forcreating new collective works, or for resale. In the rest of the world: Permission to pho-tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,PO Box 1024, Hanover, MA 02339, USA; Tel. +1-781-871-0245; www.nowpublishers.com;[email protected]

now Publishers Inc. has an exclusive license to publish this material worldwide. Permissionto use this content must be obtained from the copyright license holder. Please apply to nowPublishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; e-mail:[email protected]

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 4: Synopses for Massive Data: Samples, Histograms, Wavelets

Foundations and Trends R© inDatabases

Volume 4 Issues 1–3, 2011

Editorial Board

Editor-in-Chief:Joseph M. HellersteinComputer Science DivisionUniversity of California, BerkeleyBerkeley, [email protected]

EditorsAnastasia Ailamaki (EPFL)Michael Carey (UC Irvine)Surajit Chaudhuri (Microsoft Research)Ronald Fagin (IBM Research)Minos Garofalakis (Yahoo! Research)Johannes Gehrke (Cornell University)Alon Halevy (Google)Jeffrey Naughton (University of Wisconsin)Christopher Olston (Yahoo! Research)Jignesh Patel (University of Michigan)Raghu Ramakrishnan (Yahoo! Research)Gerhard Weikum (Max-Planck Institute)

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 5: Synopses for Massive Data: Samples, Histograms, Wavelets

Editorial Scope

Foundations and Trends R© in Databases covers a breadth of top-ics relating to the management of large volumes of data. The journaltargets the full scope of issues in data management, from theoreticalfoundations, to languages and modeling, to algorithms, system archi-tecture, and applications. The list of topics below illustrates some ofthe intended coverage, though it is by no means exhaustive:

• Data Models and QueryLanguages

• Query Processing andOptimization

• Storage, Access Methods, andIndexing

• Transaction Management,Concurrency Control and Recovery

• Deductive Databases

• Parallel and Distributed DatabaseSystems

• Database Design and Tuning

• Metadata Management

• Object Management

• Trigger Processing and ActiveDatabases

• Data Mining and OLAP

• Approximate and InteractiveQuery Processing

• Data Warehousing

• Adaptive Query Processing

• Data Stream Management

• Search and Query Integration

• XML and Semi-Structured Data

• Web Services and Middleware

• Data Integration and Exchange

• Private and Secure DataManagement

• Peer-to-Peer, Sensornet andMobile Data Management

• Scientific and Spatial DataManagement

• Data Brokering andPublish/Subscribe

• Data Cleaning and InformationExtraction

• Probabilistic Data Management

Information for LibrariansFoundations and Trends R© in Databases, 2011, Volume 4, 4 issues. ISSN paperversion 1931-7883. ISSN online version 1931-7891. Also available as a com-bined paper and online subscription.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 6: Synopses for Massive Data: Samples, Histograms, Wavelets

Foundations and Trends R© inDatabases

Vol. 4, Nos. 1–3 (2011) 1–294c© 2012 G. Cormode, M. Garofalakis, P. J. Haas

and C. Jermaine

DOI: 10.1561/1900000004

Synopses for Massive Data: Samples,Histograms, Wavelets, Sketches

Graham Cormode1, Minos Garofalakis2,

Peter J. Haas3, and Chris Jermaine4

1 AT&T Labs — Research, 180 Park Avenue, Florham Park, NJ 07932,USA, [email protected]

2 Technical University of Crete, University Campus — Kounoupidiana,Chania, 73100, Greece, [email protected]

3 IBM Almaden Research Center, 650 Harry Road, San Jose, CA95120-6099, USA, [email protected]

4 Rice University, 6100 Main Street, Houston, TX 77005, USA,[email protected]

Abstract

Methods for Approximate Query Processing (AQP) are essential fordealing with massive data. They are often the only means of provid-ing interactive response times when exploring massive datasets, and arealso needed to handle high speed data streams. These methods proceedby computing a lossy, compact synopsis of the data, and then execut-ing the query of interest against the synopsis rather than the entiredataset. We describe basic principles and recent developments in AQP.We focus on four key synopses: random samples, histograms, wavelets,and sketches. We consider issues such as accuracy, space and time effi-ciency, optimality, practicality, range of applicability, error bounds onquery answers, and incremental maintenance. We also discuss the trade-offs between the different synopsis types.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 7: Synopses for Massive Data: Samples, Histograms, Wavelets

Contents

1 Introduction 1

1.1 The Need for Synopses 21.2 Survey Overview 41.3 Outline 5

2 Sampling 11

2.1 Introduction 112.2 Some Simple Examples 122.3 Advantages and Drawbacks of Sampling 172.4 Mathematical Essentials of Sampling 212.5 Different Flavors of Sampling 322.6 Sampling and Database Queries 382.7 Obtaining Samples from Databases 562.8 Online Aggregation Via Sampling 65

3 Histograms 69

3.1 Introduction 713.2 One-Dimensional Histograms: Overview 793.3 Estimation Schemes 853.4 Bucketing Schemes 953.5 Multi-Dimensional Histograms 1123.6 Approximate Processing of General Queries 1273.7 Additional Topics 135

ix

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 8: Synopses for Massive Data: Samples, Histograms, Wavelets

4 Wavelets 145

4.1 Introduction 1454.2 One-Dimensional Wavelets and

Wavelet Synopses: Overview 1464.3 Wavelet Synopses for Non-L2 Error Metrics 1564.4 Multi-Dimensional Wavelets 1724.5 Approximate Processing of General Queries 1824.6 Additional Topics 186

5 Sketches 203

5.1 Introduction 2035.2 Notation and Terminology 2055.3 Frequency Based Sketches 2125.4 Sketches for Distinct Value Queries 2425.5 Other Topics in Sketching 259

6 Conclusions and Future Research Directions 265

6.1 Comparison Across Different Methods 2656.2 Approximate Query Processing in Systems 2696.3 Challenges and Future Directions for Synopses 271

Acknowledgments 277

References 279

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 9: Synopses for Massive Data: Samples, Histograms, Wavelets

1

Introduction

A synopsis of a massive dataset captures vital properties of the originaldata while typically occupying much less space. For example, supposethat our data consists of a large numeric time series. A simple summaryallows us to compute the statistical variance of this series: we maintainthe sum of all the values, the sum of the squares of the values, andthe number of observations. Then the average is given by the ratio ofthe sum to the count, and the variance is ratio of the sum of squaresto the count, less the square of the average. An important property ofthis synopsis is that we can build it efficiently. Indeed, we can find thethree summary values in a single pass through the data.

However, we may need to know more about the data than merely itsvariance: how many different values have been seen? How many timeshas the series exceeded a given threshold? What was the behavior ina given time period? To answer such queries, our three-value summarydoes not suffice, and synopses appropriate to each type of query areneeded. In general, these synopses will not be as simple or easy to com-pute as the synopsis for variance. Indeed, for many of these questions,there is no synopsis that can provide the exact answer, as is the casefor variance. The reason is that for some classes of queries, the query

1

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 10: Synopses for Massive Data: Samples, Histograms, Wavelets

2 Introduction

answers collectively describe the data in full, and so any synopsis wouldeffectively have to store the entire dataset.

To overcome this problem, we must relax our requirements. In manycases, the key objective is not obtaining the exact answer to a query,but rather receiving an accurate estimate of the answer. For example,in many settings, receiving an answer that is within 0.1% of the trueresult is adequate for our needs; it might suffice to know that the trueanswer is roughly $5 million without knowing that the exact answer is$5,001,482.76. Thus we can tolerate approximation, and there are manysynopses that provide approximate answers. This small relaxation canmake a big difference. Although for some queries it is impossible toprovide a small synopsis that provides exact answers, there are manysynopses that provide a very accurate approximation for these querieswhile using very little space.

1.1 The Need for Synopses

The use of synopses is essential for managing the massive data thatarises in modern information management scenarios. When handlinglarge datasets, from gigabytes to petabytes in size, it is often impracticalto operate on them in full. Instead, it is much more convenient to build asynopsis, and then use this synopsis to analyze the data. This approachcaptures a variety of use-cases:

• A search engine collects logs of every search made, amount-ing to billions of queries every day. It would be too slow, andenergy-intensive, to look for trends and patterns on the fulldata. Instead, it is preferable to use a synopsis that is guar-anteed to preserve most of the as-yet undiscovered patternsin the data.• A team of analysts for a retail chain would like to study the

impact of different promotions and pricing strategies on salesof different items. It is not cost-effective to give each analystthe resources needed to study the national sales data in full,but by working with synopses of the data, each analyst canperform their explorations on their own laptops.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 11: Synopses for Massive Data: Samples, Histograms, Wavelets

1.1 The Need for Synopses 3

• A large cellphone provider wants to track the health ofits network by studying statistics of calls made in differentregions, on hardware from different manufacturers, under dif-ferent levels of contention, and so on. The volume of infor-mation is too large to retain in a database, but instead theprovider can build a synopsis of the data as it is observedlive, and then use the synopsis off-line for further analysis.

These examples expose a variety of settings. The full data may residein a traditional data warehouse, where it is indexed and accessible, butis too costly to work on in full. In other cases, the data is stored as flat-files in a distributed file system; or it may never be stored in full, butbe accessible only as it is observed in a streaming fashion. Sometimessynopsis construction is a one-time process, and sometimes we needto update the synopsis as the base data is modified or as accuracyrequirements change. In all cases though, being able to construct a highquality synopsis enables much faster and more scalable data analysis.

From the 1990s through today, there has been an increasing demandfor systems to query more and more data at ever faster speeds. Enter-prise data requirements have been estimated [173] to grow at 60% peryear through at least 2011, reaching 1,800 exabytes. On the other hand,users — weaned on Internet browsers, sophisticated analytics and sim-ulation software with advanced GUIs, and computer games — havecome to expect real-time or near-real-time answers to their queries.Indeed, it has been increasingly realized that extracting knowledgefrom data is usually an interactive process, with a user issuing a query,seeing the result, and using the result to formulate the next query,in an iterative fashion. Of course, parallel processing techniques canalso help address these problems, but may not suffice on their own.Many queries, for example, are not embarrassingly parallel. More-over, methods based purely on parallelism can be expensive. Indeed,under evolving models for cloud computing, specifically “platform as aservice” fee models, users will pay costs that directly reflect the com-puting resources that they use. In this setting, use of ApproximateQuery Processing (AQP) techniques can lead to significant cost savings.Similarly, recent work [15] has pointed out that approximate processing

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 12: Synopses for Massive Data: Samples, Histograms, Wavelets

4 Introduction

techniques can lead to energy savings and greener computing. ThusAQP techniques are essential for providing, in a cost-effective manner,interactive response times for exploratory queries over massive data.

Exacerbating the pressures on data management systems is theincreasing need to query streaming data, such as real time financialdata or sensor feeds. Here the flood of high speed data can easily over-whelm the often limited CPU and memory capacities of a stream pro-cessor unless AQP methods are used. Moreover, for purposes of networkmonitoring and many other applications, approximate answers sufficewhen trying to detect general patterns in the data, such as a denial-of-service attack. AQP techniques are thus well suited to streaming andnetwork applications.

1.2 Survey Overview

In this survey, we describe basic principles and recent developments inbuilding approximate synopses (i.e., lossy, compressed representations)of massive data. Such synopses enable AQP, in which the user’s queryis executed against the synopsis instead of the original data. We focuson the four main families of synopses: random samples, histograms,wavelets, and sketches.

A random sample comprises a “representative” subset of the datavalues of interest, obtained via a stochastic mechanism. Samples canbe quick to obtain, and can be used to approximately answer a widerange of queries.

A histogram summarizes a dataset by grouping the data values intosubsets, or “buckets,” and then, for each bucket, computing a small setof summary statistics that can be used to approximately reconstruct thedata in the bucket. Histograms have been extensively studied and havebeen incorporated into the query optimizers of virtually all commercialrelational DBMSs.

Wavelet-based synopses were originally developed in the contextof image and signal processing. The dataset is viewed as a set ofM elements in a vector — that is, as a function defined on the set{0,1,2, . . . ,M − 1} — and the wavelet transform of this function isfound as a weighted sum of wavelet “basis functions.” The weights,

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 13: Synopses for Massive Data: Samples, Histograms, Wavelets

1.3 Outline 5

or coefficients, can then be “thresholded,” for example, by eliminatingcoefficients that are close to zero in magnitude. The remaining smallset of coefficients serves as the synopsis. Wavelets are good at capturingfeatures of the dataset at various scales.

Sketch summaries are particularly well suited to streaming data.Linear sketches, for example, view a numerical dataset as a vector ormatrix, and multiply the data by a fixed matrix. Such sketches are mas-sively parallelizable. They can accommodate streams of transactions inwhich data is both inserted and removed. Sketches have also been usedsuccessfully to estimate the answer to COUNT DISTINCT queries, anotoriously hard problem.

Many questions arise when evaluating or using synopses.

• What is the class of queries that can be approximatelyanswered?• What is the approximation accuracy for a synopsis of a given

size?• What are the space and time requirements for constructing

a synopsis of a given size, as well as the time required toapproximately answer the query?• How should one choose synopsis parameters such as the num-

ber of histogram buckets or the wavelet thresholding value?Is there an optimal, that is, most accurate, synopsis of a givensize?• When using a synopsis to approximately answer a query, is

it possible to obtain error bounds on the approximate queryanswer?• Can the synopsis be incrementally maintained in an efficient

manner?• Which type of synopsis is best for a given problem?

We explore these issues in subsequent chapters.

1.3 Outline

It is possible to read the discussion of each type of synopsis in isolation,to understand a particular summarization approach. We have tried touse common notation and terminology across all chapters, in order to

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 14: Synopses for Massive Data: Samples, Histograms, Wavelets

6 Introduction

facilitate comparison of the different synopses. In more detail, the topicscovered by the different chapters are given below.

1.3.1 Sampling

Random samples are perhaps the most fundamental synopses for AQP,and the most widely implemented. The simplicity of the idea — exe-cuting the desired query against a small representative subset of thedata — belies centuries of research across many fields, with decadesof effort in the database community alone. Many different methods ofextracting and maintaining samples of data have been proposed, alongwith multiple ways to build an estimator for a given query. This chap-ter introduces the mathematical foundations for sampling, in termsof accuracy and precision, and discusses the key sampling schemes:Bernoulli sampling, stratified sampling, and simple random samplingwith and without replacement.

For simple queries, such as basic SUM and AVERAGE queries, it isstraightforward to build unbiased estimators from samples. The moregeneral case — an arbitrary SQL query with nested subqueries — ismore daunting, but can sometimes be solved quite naturally in a pro-cedural way.

For small tables, drawing a sample can be done straightforwardly.For larger relations, which may not fit conveniently in memory, or maynot even be stored on disk in full, more advanced techniques are neededto make the sampling process scalable. For disk-resident data, sam-pling methods that operate at the granularity of a block rather than atuple may be preferred. Existing indices can also be leveraged to helpthe sampling. For large streams of data, considerable effort has beenput into maintaining a uniform sample as new items arrive or exist-ing items are deleted. Finally, “online aggregation” algorithms enhanceinteractive exploration of massive datasets by exploiting the fact thatan imprecise sampling-based estimate of a query result can be incre-mentally improved simply by collecting more samples.

1.3.2 Histograms

The histogram is a fundamental object for summarizing the frequencydistribution of an attribute or combination of attributes. The most

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 15: Synopses for Massive Data: Samples, Histograms, Wavelets

1.3 Outline 7

basic histograms are based on a fixed division of the domain(equi-width), or using quantiles (equi-depth), and simply keep statisticson the number of items from the input which fall in each such bucket.But many more complex methods have been designed, which aim toprovide the most accurate summary possible within a limited spacebudget. Schemes differ in how the buckets are chosen, what statisticsare stored, how estimates are extracted, and what classes of query aresupported. They are quantified based on the space and time require-ments used to build them, and the resulting accuracy guarantees thatthey provide.

The one-dimensional case is at the heart of histogram construction,since higher dimensions are typically handled via extensions of one-dimensional ideas. Beyond equi-width and equi-depth, end biased andhigh biased, maxdiff and other generalizations have been proposed. Fora variety of approximation-error metrics, dynamic programming (DP)methods can be used to find histograms — notably the “v-optimalhistograms” — that minimize the error, subject to an upper bound onthe allowable histogram size. Approximate methods can be used whenthe quadratic cost of DP is not practical. Many other constructions,both optimal and heuristic, are described, such as lattice histograms,STHoles, and maxdiff histograms. The extension of these methodsto higher dimensions adds complexity. Even the two-dimensional casepresents challenges in how to define the space of possible bucketings.The cost of these methods also rises exponentially with the dimen-sionality of the data, inspiring new approaches that combine sets oflow-dimensional histograms with high-level statistical models.

Histograms most naturally answer range–sum queries — for exam-ple, “compute total sales between July and September for adults fromage 25 through 40” — and their variations. They can also be used toapproximate more general classes of queries, such as aggregations overjoins. Various negative theoretical and empirical results indicate thatone should not expect histograms to give accurate answers to arbitraryqueries. Nevertheless, due to their conceptual simplicity, histograms canbe effectively used for a broad variety of estimation tasks, including set-valued queries, real-valued data, and aggregate queries over predicatesmore complex than simple ranges.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 16: Synopses for Massive Data: Samples, Histograms, Wavelets

8 Introduction

1.3.3 Wavelets

The wavelet synopsis is conceptually close to the histogram summary.The central difference is that, whereas histograms primarily producebuckets that are subsets of the original data-attribute domain, wave-let representations transform the data and seek to represent the mostsignificant features in a wavelet (i.e., “frequency”) domain, and cancapture combinations of high and low frequency information. The mostwidely discussed wavelet transformation is the Haar-wavelet transform(HWT), which can, in general, be constructed in time linear in thesize of the underlying data array. Picking the B largest HWT coeffi-cients results in a synopsis that provides the optimal L2 (sum-squared)error for the reconstructed data. Extending from one-dimensional tomulti-dimensional data, as with histograms, provides more definitionalchallenges. There are multiple plausible choices here, as well as algo-rithmic challenges in efficiently building the wavelet decomposition.

The core AQP task for wavelet summaries is to estimate the answerto range sums. More general SPJ (select, project, join) queries can alsobe directly applied on relation summaries, to generate a summary ofthe resulting relation. This is made possible through an appropriately-defined AQP algebra that operates entirely in the domain of waveletcoefficients.

Recent research into wavelet representations has focused on errorguarantees beyond L2. These include L1 (sum of errors) or L∞(maximum error), as well as relative-error versions of these measures.A fundamental choice here is whether to restrict the possible coef-ficient values to those arising under the basic wavelet transform, orto allow other (unrestricted) coefficient values, specifically chosen toreduce the target error metric. The construction of such (restricted orunrestricted) wavelet synopses optimized for non-L2 error metrics is achallenging problem.

1.3.4 Sketches

Sketch techniques have undergone extensive development over thepast few years. They are especially appropriate for streaming data,in which large quantities of data flow by and the sketch summary must

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 17: Synopses for Massive Data: Samples, Histograms, Wavelets

1.3 Outline 9

continually be updated quickly and compactly. Sketches, as presentedhere, are designed so that the update caused by each new piece of datais largely independent of the current state of the summary. This designchoice makes them faster to process, and also easy to parallelize.

“Frequency based sketches” are concerned with summarizing theobserved frequency distribution of a dataset. From these sketches, accu-rate estimations of individual frequencies can be extracted. This leadsto algorithms for finding approximate “heavy hitters” — items thataccount for a large fraction of the frequency mass — and quantilessuch as the median and its generalizations. The same sketches can alsobe used to estimate the sizes of (equi)joins between relations, self-joinsizes, and range queries. Such sketch summaries can be used as prim-itives within more complex mining operations, and to extract waveletand histogram representations of streaming data.

A different style of sketch construction leads to sketches for“distinct-value” queries that count the number of distinct values ina given multiset. As mentioned above, using a sample to estimate theanswer to a COUNT DISTINCT query may give highly inaccurateresults. In contract, sketching methods that make a pass over the entiredataset can provide guaranteed accuracy. Once built, these sketchesestimate not only the cardinality of a given attribute or combinationof attributes, but also the cardinality of various operations performedon them, such as set operations (union and difference), and selectionsbased on arbitrary predicates.

In the final chapter, we compare the different synopsis methods.We also discuss the use of AQP within research systems, and discusschallenges and future directions.

In our discussion, we often use terminology and examples thatarise in classical database systems, such as SQL queries over relationaldatabases. These artifacts partially reflect the original context of theresults that we survey, and provide a convenient vocabulary for the var-ious data and access models that are relevant to AQP. We emphasizethat the techniques discussed here can be applied much more generally.Indeed, one of the key motivations behind this survey is the hope thatthese techniques — and their extensions — will become a fundamentalcomponent of tomorrow’s information management systems.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 18: Synopses for Massive Data: Samples, Histograms, Wavelets

References

[1] A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silber-schatz, “HadoopDB: An architectural hybrid of MapReduce and DBMS tech-nologies for analytical workloads,” PVLDB, vol. 2, no. 1, pp. 922–933, 2009.

[2] S. Acharya, P. B. Gibbons, and V. Poosala, “Aqua: A fast decision supportsystem using approximate query answers,” in International Conference onVery Large Data Bases, 1999.

[3] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy, “The Aquaapproximate query answering system,” in ACM SIGMOD International Con-ference on Management of Data, 1999.

[4] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy, “Join synopses forapproximate query answering,” in Proceedings of the ACM SIGMOD Inter-national Conference on Management of Data, pp. 275–286, New York, NY,USA, 1999.

[5] C. C. Aggarwal, “On biased reservoir sampling in the presence of streamevolution,” in Proceedings of the International Conference on Very Large DataBases, pp. 607–618, 2006.

[6] N. Alon, P. Gibbons, Y. Matias, and M. Szegedy, “Tracking join and self-joinsizes in limited storage,” in ACM Principles of Database Systems, 1999.

[7] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of approximatingthe frequency moments,” in ACM Symposium on Theory of Computing, 1996.

[8] A. Andoni, R. Krauthgamer, and K. Onak, “Streaming algorithms fromprecision sampling,” CoRR, p. abs/1011.1263, 2010.

[9] G. Antoshenkov, “Random sampling from pseudo-ranked B+ trees,” inProceedings of the International Conference on Very Large Data Bases,pp. 375–382, 1992.

279

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 19: Synopses for Massive Data: Samples, Histograms, Wavelets

280 References

[10] P. M. Aoki, “Algorithms for index-assisted selectivity estimation,” in Proceed-ings of the International Conference on Data Engineering, p. 258, 1999.

[11] R. Avnur, J. M. Hellerstein, B. Lo, C. Olston, B. Raman, V. Raman, T. Roth,and K. Wylie, “CONTROL: Continuous output and navigation technologywith refinement on-line,” in Proceedings of the ACM SIGMOD InternationalConference on Management of Data, pp. 567–569, 1998.

[12] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models andissues in data stream systems,” in ACM Principles of Database Systems, 2002.

[13] B. Babcock, S. Chaudhuri, and G. Das, “Dynamic sample selection for approx-imate query processing,” in Proceedings of the ACM SIGMOD InternationalConference on Management of Data, pp. 539–550, New York, NY, USA, 2003.

[14] B. Babcock, M. Datar, and R. Motwani, “Sampling from a moving windowover streaming data,” in SODA, pp. 633–634, 2002.

[15] W. Baek and T. Chilimbi, “Green: A framework for supporting energy-conscious programming using controlled approximation,” in Proceedings ofPLDI, pp. 198–209, 2010.

[16] L. Baltrunas, A. Mazeika, and M. H. Bohlen, “Multi-dimensional histogramswith tight bounds for the error,” in Proceedings of IDEAS, pp. 105–112, 2006.

[17] Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisian, “Count-ing distinct elements in a data stream,” in Proceedings of RANDOM 2002,2002.

[18] R. Bellman, “On the approximation of curves by line segments using dynamicprogramming,” Communications of ACM, vol. 4, no. 6, p. 284, 1961.

[19] K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla, “Onsynopses for distinct-value estimation under multiset operations,” in ACMSIGMOD International Conference on Management of Data, 2007.

[20] S. Bhattacharrya, A. Madeira, S. Muthukrishnan, and T. Ye, “How to scalablyskip past streams,” in Scalable Stream Processing Systems (SSPS) Workshopwith ICDE 2007, 2007.

[21] L. Bhuvanagiri, S. Ganguly, D. Kesh, and C. Saha, “Simpler algorithm forestimating frequency moments of data streams,” in ACM-SIAM Symposiumon Discrete Algorithms, 2006.

[22] P. Billingsley, Probability and Measure. Wiley, third ed., 1999.[23] B. Blohsfeld, D. Korus, and B. Seeger, “A comparison of selectivity estimators

for range queries on metric attributes,” in Proceedings of the ACM SIGMODInternational Conference on Management of Data, pp. 239–250, 1999.

[24] B. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Com-munications of the ACM, vol. 13, no. 7, pp. 422–426, July 1970.

[25] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher, “Min-wise indepen-dent permutations,” in ACM Symposium on Theory of Computing, 1998.

[26] A. Z. Broder and M. Mitzenmacher, “Network applications of bloom filters:A survey,” Internet Mathematics, vol. 1, no. 4, 2003.

[27] P. G. Brown and P. J. Haas, “Techniques for warehousing of sample data,”in Proceedings of the International Conference on Data Engineering, p. 6,Washington, DC, USA, 2006.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 20: Synopses for Massive Data: Samples, Histograms, Wavelets

References 281

[28] B. Bru, “The estimates of Laplace. an example: Research concerning the pop-ulation of a large empire, 1785–1812,” in Journal de la Societe de statistiquede Paris, vol. 129, pp. 6–45, 1988.

[29] N. Bruno and S. Chaudhuri, “Exploiting statistics on query expressions foroptimization,” in Proceedings of the ACM SIGMOD International Conferenceon Management of Data, pp. 263–274, 2002.

[30] N. Bruno, S. Chaudhuri, and L. Gravano, “STHoles: A multidimensionalworkload-aware histogram,” in Proceedings of the ACM SIGMOD Inter-national Conference on Management of Data, pp. 211–222, 2001.

[31] T. Bu, J. Cao, A. Chen, and P. P. C. Lee, “A fast and compact method forunveiling significant patterns in high speed networks,” in IEEE INFOCOMM,2007.

[32] F. Buccafurri, F. Furfaro, G. Lax, and D. Sacca, “Binary-tree histograms withtree indices,” in Proceedings of Database and Expert Systems Applications,pp. 861–870, 2002.

[33] F. Buccafurri, F. Furfaro, and D. Sacca, “Estimating range queries usingaggregate data with integrity constraints: A probabilistic approach,” in Pro-ceedings of the International Conference on Database Theory, pp. 390–404,2001.

[34] F. Buccafurri, F. Furfaro, D. Sacca, and C. Sirangelo, “A quad-tree basedmultiresolution approach for two-dimensional summary data,” in Proceedingsof the International Conference on Scientific and Statistical Database Man-agement, pp. 127–137, 2003.

[35] F. Buccafurri and G. Lax, “Fast range query estimation by n-level tree his-tograms,” Data Knowledge in Engineering, vol. 51, no. 2, pp. 257–275, 2004.

[36] F. Buccafurri and G. Lax, “Reducing data stream sliding windows by cyclictree-like histograms,” in PKDD, pp. 75–86, 2004.

[37] F. Buccafurri, G. Lax, D. Sacca, L. Pontieri, and D. Rosaci, “Enhanc-ing histograms by tree-like bucket indices,” VLDB Journal, vol. 17, no. 5,pp. 1041–1061, 2008.

[38] C. Buragohain, N. Shrivastava, and S. Suri, “Space efficient streaming algo-rithms for the maximum error histogram,” in Proceedings of the InternationalConference on Data Engineering, pp. 1026–1035, 2007.

[39] J. L. Carter and M. N. Wegman, “Universal classes of hash functions,” Journalof Computer and System Sciences, vol. 18, no. 2, pp. 143–154, 1979.

[40] A. Chakrabarti, G. Cormode, and A. McGregor, “A near-optimal algorithm forcomputing the entropy of a stream,” in ACM-SIAM Symposium on DiscreteAlgorithms, 2007.

[41] K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim, “Approximatequery processing using wavelets,” in Proceedings of the International Con-ference on Very Large Data Bases, pp. 111–122, Cairo, Egypt, September2000.

[42] K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim, “Approxi-mate query processing using wavelets,” The VLDB Journal, vol. 10, no. 2–3,pp. 199–223, September 2001. (Best of VLDB’2000 Special Issue).

[43] D. Chamberlin, A Complete Guide to DB2 Universal Database. MorganKaufmann, 1998.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 21: Synopses for Massive Data: Samples, Histograms, Wavelets

282 References

[44] M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya, “Towardsestimation error guarantees for distinct values,” in Proceedings of ACMSIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems,pp. 268–279, 2000.

[45] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in datastreams,” in International Colloquium on Automata, Languages and Program-ming, 2002.

[46] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in datastreams,” Theoretical Computer Science, vol. 312, no. 1, pp. 3–15, 2004.

[47] S. Chaudhuri, G. Das, and V. R. Narasayya, “A robust, optimization-basedapproach for approximate answering of aggregate queries,” in Proceedingsof the ACM SIGMOD International Conference on Management of Data,pp. 295–306, 2001.

[48] S. Chaudhuri, G. Das, and V. R. Narasayya, “Optimized stratified samplingfor approximate query processing,” ACM Transactions on Database Systems,vol. 32, no. 2, p. 9, 2007.

[49] S. Chaudhuri, G. Das, and U. Srivastava, “Effective use of block-level samplingin statistics estimation,” in Proceedings of the ACM SIGMOD InternationalConference on Management of Data, pp. 287–298, 2004.

[50] S. Chaudhuri, R. Motwani, and V. Narasayya, “On random sampling overjoins,” SIGMOD Record, vol. 28, no. 2, pp. 263–274, 1999.

[51] S. Chaudhuri, R. Motwani, and V. R. Narasayya, “Random sampling forhistogram construction: How much is enough?,” in Proceedings of the ACMSIGMOD International Conference on Management of Data, pp. 436–447,1998.

[52] K. K. Chen, “Influence query optimization with optimization profiles and sta-tistical views in DB2 9: Optimal query performance in DB2 9 for Linux, UNIX,and Windows,” Available at www.ibm.com/developerworks/db2/library/techarticle/dm-0612chen, 2006.

[53] W. Cheney and W. Light, A Course in Approximation Theory. Brooks/Cole,Pacific Grove, CA, 2000.

[54] E. Cohen, G. Cormode, and N. G. Duffield, “Structure-aware sampling ondata streams,” in SIGMETRICS, pp. 197–208, 2011.

[55] E. Cohen, N. Grossaug, and H. Kaplan, “Processing top-k queries fromsamples,” in Proceedings of CoNext, p. 7, 2006.

[56] E. Cohen and H. Kaplan, “Summarizing data using bottom-k sketches,” inACM Conference on Principles of Distributed Computing (PODC), 2007.

[57] S. Cohen and Y. Matias, “Spectral bloom filters,” in ACM SIGMOD Inter-national Conference on Management of Data, 2003.

[58] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears,“MapReduce online,” in NSDI, 2010.

[59] J. Considine, M. Hadjieleftheriou, F. Li, J. W. Byers, and G. Kollios, “Robustapproximate aggregation in sensor data management systems,” ACM Trans-actions on Database Systems, vol. 34, no. 1, April 2009.

[60] J. Considine, F. Li, G. Kollios, and J. Byers, “Approximate aggregationtechniques for sensor databases,” in IEEE International Conference on DataEngineering, 2004.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 22: Synopses for Massive Data: Samples, Histograms, Wavelets

References 283

[61] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan, “Comparing datastreams using Hamming norms,” in International Conference on Very LargeData Bases, 2002.

[62] G. Cormode, A. Deligiannakis, M. N. Garofalakis, and A. McGregor, “Proba-bilistic histograms for probabilistic data,” PVLDB, vol. 2, no. 1, pp. 526–537,2009.

[63] G. Cormode and M. Garofalakis, “Sketching streams through the net: Dis-tributed approximate query tracking,” in International Conference on VeryLarge Data Bases, 2005.

[64] G. Cormode, M. Garofalakis, and D. Sacharidis, “Fast approximate wave-let tracking on streams,” in Proceedings of the International Conference onExtending Database Technology, Munich, Germany, March 2006.

[65] G. Cormode and M. Hadjieleftheriou, “Finding frequent items in datastreams,” in International Conference on Very Large Data Bases, 2008.

[66] G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishnan, “Fast mining oftabular data via approximate distance computations,” in IEEE InternationalConference on Data Engineering, 2002.

[67] G. Cormode, F. Korn, S. Muthukrishnan, T. Johnson, O. Spatscheck, andD. Srivastava, “Holistic UDAFs at streaming speeds,” in ACM SIGMODInternational Conference on Management of Data, pp. 35–46, 2004.

[68] G. Cormode, F. Korn, S. M. Muthukrishnan, and D. Srivastava, “Space- andtime-efficient deterministic algorithms for biased quantiles over data streams,”in Proceedings of ACM Principles of Database Systems, pp. 263–272, 2006.

[69] G. Cormode and S. Muthukrishnan, “What’s new: Finding significant differ-ences in network data streams,” in Proceedings of IEEE Infocom, 2004.

[70] G. Cormode and S. Muthukrishnan, “An improved data stream summary:The count-min sketch and its applications,” Journal of Algorithms, vol. 55,no. 1, pp. 58–75, 2005.

[71] G. Cormode and S. Muthukrishnan, “Summarizing and mining skewed datastreams,” in SIAM Conference on Data Mining, 2005.

[72] G. Cormode, S. Muthukrishnan, and I. Rozenbaum, “Summarizing and min-ing inverse distributions on data streams via dynamic inverse sampling,” inInternational Conference on Very Large Data Bases, 2005.

[73] G. Cormode, S. Muthukrishnan, K. Yi, and Q. Zhang, “Optimal sampling fromdistributed streams,” in Proceedings of ACM Principles of Database Systems,pp. 77–86, 2010.

[74] C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk, “Gigascope: Astream database for network applications,” in ACM SIGMOD InternationalConference on Management of Data, 2003.

[75] N. N. Dalvi, C. Re, and D. Suciu, “Probabilistic databases: Diamonds in thedirt,” Communications of the ACM, vol. 52, no. 7, pp. 86–94, 2009.

[76] A. Das, J. Gehrke, and M. Riedewald, “Approximation techniques for spatialdata,” in ACM SIGMOD International Conference on Management of Data,2004.

[77] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining stream statisticsover sliding windows,” in ACM-SIAM Symposium on Discrete Algorithms,2002.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 23: Synopses for Massive Data: Samples, Histograms, Wavelets

284 References

[78] I. Daubechies, Ten Lectures on Wavelets. Philadelphia, PA: Society for Indus-trial and Applied Mathematics (SIAM), 1992.

[79] A. Deligiannakis, M. Garofalakis, and N. Roussopoulos, “An approxima-tion scheme for probabilistic wavelet synopses,” in Proceedings of theInternational Conference on Scientific and Statistical Database Management,Santa Barbara, California, June 2005.

[80] A. Deligiannakis, M. Garofalakis, and N. Roussopoulos, “Extended waveletsfor multiple measures,” ACM Transactions on Database Systems, vol. 32, no. 2,June 2007.

[81] A. Deligiannakis and N. Roussopoulos, “Extended wavelets for multiple mea-sures,” in Proceedings of the ACM SIGMOD International Conference onManagement of Data, San Diego, California, June 2003.

[82] F. Deng and D. Rafiei, “New estimation algorithms for streaming data: Count-min can do more,” http://www.cs.ualberta.ca/~fandeng/paper/cmm.pdf,2007.

[83] A. Deshpande, M. N. Garofalakis, and R. Rastogi, “Independence is good:Dependency-based histogram synopses for high-dimensional data,” in Pro-ceedings of the ACM SIGMOD International Conference on Management ofData, pp. 199–210, 2001.

[84] R. A. DeVore, “Nonlinear approximation,” Acta Numerica, vol. 7, pp. 51–150,1998.

[85] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation.Springer, 2001.

[86] A. Dobra, “Histograms revisited: When are histograms the best approxima-tion method for aggregates over joins?,” in Proceedings of ACM Principles ofDatabase Systems, pp. 228–237, 2005.

[87] A. Dobra, M. Garofalakis, J. E. Gehrke, and R. Rastogi, “Processing com-plex aggregate queries over data streams,” in ACM SIGMOD InternationalConference on Management of Data, 2002.

[88] A. Dobra and F. Rusu, “Statistical analysis of sketch estimators,” ACM Trans-actions on Database Systems, vol. 33, no. 3, 2008.

[89] D. Donjerkovic, Y. E. Ioannidis, and R. Ramakrishnan, “Dynamic histograms:Capturing evolving data sets,” in Proceedings of the International Conferenceon Data Engineering, p. 86, 2000.

[90] D. Donjerkovic and R. Ramakrishnan, “Probabilistic optimization of top Nqueries,” in Proceedings of the International Conference on Very Large DataBases, pp. 411–422, 1999.

[91] N. Duffield, C. Lund, and M. Thorup, “Estimating flow distributions fromsampled flow statistics,” in ACM SIGCOMM, 2003.

[92] M. Durand and P. Flajolet, “Loglog counting of large cardinalities,” inEuropean Symposium on Algorithms (ESA), 2003.

[93] C. Estan and G. Varghese, “New directions in traffic measurement andaccounting,” in ACM SIGCOMM, 2002.

[94] C. T. Fan, M. E. Muller, and I. Rezucha, “Development of sampling plans byusing sequential (item by item) selection techniques and digital computers,”Journal of the American Statistical Association, pp. 387–402, 1962.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 24: Synopses for Massive Data: Samples, Histograms, Wavelets

References 285

[95] G. Fishman, Monte Carlo: Concepts, Algorithms and Applications. Springer,1996.

[96] P. Flajolet, “On adaptive sampling,” Computing, vol. 43, no. 4, 1990.[97] P. Flajolet and G. N. Martin, “Probabilistic counting algorithms for database

applications,” Journal of Computer and System Sciences, vol. 31, pp. 182–209,1985.

[98] G. Frahling, P. Indyk, and C. Sohler, “Sampling in dynamic datastreams and applications,” in Symposium on Computational Geometry, June2005.

[99] D. Fuchs, Z. He, and B. S. Lee, “Compressed histograms with arbitrary bucketlayouts for selectivity estimation,” Information on Science, vol. 177, no. 3,pp. 680–702, 2007.

[100] S. Ganguly, “Counting distinct items over update streams,” Theoretical Com-puter Science, vol. 378, no. 3, pp. 211–222, 2007.

[101] S. Ganguly, M. Garofalakis, and R. Rastogi, “Processing set expressions overcontinuous update streams,” in ACM SIGMOD International Conference onManagement of Data, 2003.

[102] S. Ganguly, M. Garofalakis, and R. Rastogi, “Processing data-stream joinaggregates using skimmed sketches,” in International Conference on ExtendingDatabase Technology, 2004.

[103] S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz, “Bifocal sam-pling for skew-resistant join size estimation,” SIGMOD Record, vol. 25, no. 2,pp. 271–281, 1996.

[104] S. Ganguly and A. Majumder, “CR-precis: A deterministic summary structurefor update data streams,” in ESCAPE, 2007.

[105] M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining datastreams: You only get one look,” in ACM SIGMOD International Conferenceon Management of Data, 2002.

[106] M. Garofalakis, J. Gehrke, and R. Rastogi, eds., Data Stream Management:Processing High-Speed Data Streams. Springer, 2011.

[107] M. Garofalakis and P. B. Gibbons, “Approximate query processing: Tamingthe terabytes,” Tutorial in International Conference on Very Large DataBases, Roma, Italy, September 2001.

[108] M. Garofalakis and P. B. Gibbons, “Wavelet synopses with error guarantees,”in Proceedings of the ACM SIGMOD International Conference on Manage-ment of Data, pp. 476–487, Madison, Wisconsin, June 2002.

[109] M. Garofalakis and P. B. Gibbons, “Probabilistic wavelet synopses,” ACMTransactions on Database Systems, vol. 29, no. 1, March 2004. (SIGMOD/PODS’2002 Special Issue).

[110] M. Garofalakis and A. Kumar, “Deterministic wavelet thresholding formaximum-error metrics,” in Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Paris, France, June2004.

[111] M. Garofalakis and A. Kumar, “Wavelet synopses for general error metrics,”ACM Transactions on Database Systems, vol. 30, no. 4, December 2005.(SIGMOD/PODS’2004 Special Issue).

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 25: Synopses for Massive Data: Samples, Histograms, Wavelets

286 References

[112] R. Gemulla, “Sampling algorithms for evolving datasets,” PhD Thesis, Tech-nische Universitat Dresden, Available at http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644, 2009.

[113] R. Gemulla and W. Lehner, “Deferred maintenance of disk-based randomsamples,” in Proceedings of International Conference on Extending DatabaseTechnology, Lecture Notes in Computer Science, pp. 423–441, 2006.

[114] R. Gemulla and W. Lehner, “Sampling time-based sliding windows in boundedspace,” in SIGMOD Conference, pp. 379–392, 2008.

[115] R. Gemulla, W. Lehner, and P. J. Haas, “A dip in the reservoir: Maintainingsample synopses of evolving datasets,” in Proceedings of the InternationalConference on Very Large Data Bases, pp. 595–606, 2006.

[116] R. Gemulla, W. Lehner, and P. J. Haas, “Maintaining Bernoulli samples overevolving multisets,” in Proceedings of ACM Principles of Database Systems,pp. 93–102, 2007.

[117] R. Gemulla, W. Lehner, and P. J. Haas, “Maintaining bounded-size samplesynopses of evolving datasets,” VLDB Journal, vol. 17, no. 2, pp. 173–202,2008.

[118] J. E. Gentle, Random Number Generation and Monte Carlo Methods.Springer, second ed., 2003.

[119] L. Getoor, B. Taskar, and D. Koller, “Selectivity estimation using probabilisticmodels,” in Proceedings of the ACM SIGMOD International Conference onManagement of Data, pp. 461–472, 2001.

[120] C. Giannella and B. Sayrafi, “An information theoretic histogram for singledimensional selectivity estimation,” in Proceedings of ACM Conference onApplications on Computing, pp. 676–677, 2005.

[121] P. Gibbons, “Distinct sampling for highly-accurate answers to distinct valuesqueries and event reports,” in International Conference on Very Large DataBases, 2001.

[122] P. Gibbons and S. Tirthapura, “Estimating simple functions on the union ofdata streams,” in ACM Symposium on Parallel Algorithms and Architectures(SPAA), 2001.

[123] P. Gibbons and S. Tirthapura, “Distributed streams algorithms for slid-ing windows,” in ACM Symposium on Parallel Algorithms and Architectures(SPAA), 2002.

[124] P. B. Gibbons and Y. Matias, “New sampling-based summary statistics forimproving approximate query answers,” in Proceedings of the ACM SIGMODInternational Conference on Management of Data, pp. 331–342, New York,NY, USA, 1998.

[125] P. B. Gibbons, Y. Matias, and V. Poosala, “Aqua project white paper,” Tech-nical report, Bell Laboratories, Murray Hill, NJ, 1997.

[126] P. B. Gibbons, Y. Matias, and V. Poosala, “Fast incremental maintenance ofapproximate histograms,” ACM Transactions on Database Systems, vol. 27,no. 3, pp. 261–298, 2002.

[127] A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss,“Fast, small-space algorithms for approximate histogram maintenance,” inACM Symposium on Theory of Computing, 2002.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 26: Synopses for Massive Data: Samples, Histograms, Wavelets

References 287

[128] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Surfing wave-lets on streams: One-pass summaries for approximate aggregate queries,” inInternational Conference on Very Large Data Bases, 2001.

[129] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “How to sum-marize the universe: Dynamic maintenance of quantiles,” in InternationalConference on Very Large Data Bases, 2002.

[130] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss, “One-passwavelet decomposition of data streams,” IEEE Transactions on Knowledgeand Data Engineering, vol. 15, no. 3, pp. 541–554, May 2003.

[131] M. Greenwald and S. Khanna, “Space-efficient online computation of quantilesummaries,” in ACM SIGMOD International Conference on Management ofData, 2001.

[132] S. Guha, “A note on wavelet optimization,” (Manuscript available from: http://www.cis.upenn.edu/~sudipto/note.html.), September 2004.

[133] S. Guha, “On the space-time of optimal, approximate and streaming algo-rithms for synopsis construction problems,” VLDB Journal, vol. 17, no. 6,pp. 1509–1535, 2008.

[134] S. Guha and B. Harb, “Wavelet synopsis for data streams: Minimizingnon-euclidean error,” in Proceedings of the ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, Chicago, Illinois,August 2005.

[135] S. Guha and B. Harb, “Approximation algorithms for wavelet transform cod-ing of data streams,” in Proceedings of the Annual ACM-SIAM Symposiumon Discrete Algorithms, Miami, Florida, January 2006.

[136] S. Guha, P. Indyk, S. Muthukrishnan, and M. J. Strauss, “Histogramming datastreams with fast per-item processing,” in Proceedings of the InternationalColloquium on Automata, Languages, and Programming, Malaga, Spain, July2002.

[137] S. Guha, C. Kim, and K. Shim, “XWAVE: Optimal and approximate extendedwavelets for streaming data,” in Proceedings of the International Conferenceon Very Large Data Bases, Toronto, Canada, September 2004.

[138] S. Guha, N. Koudas, and K. Shim, “Approximation and streaming algo-rithms for histogram construction problems,” ACM Transactions on DatabaseSystems, vol. 31, no. 1, pp. 396–438, 2006.

[139] S. Guha, N. Koudas, and D. Srivastava, “Fast algorithms for hierarchicalrange histogram construction,” in Proceedings of ACM Principles of DatabaseSystems, pp. 180–187, 2002.

[140] S. Guha and K. Shim, “A note on linear time algorithms for maximum errorhistograms,” IEEE Transactions on Knowledge Data Engineering, vol. 19,no. 7, pp. 993–997, 2007.

[141] S. Guha, K. Shim, and J. Woo, “REHIST: Relative error histogram construc-tion algorithms,” in Proceedings of the International Conference on Very LargeData Bases, pp. 300–311, 2004.

[142] D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi, “Approximatingmulti-dimensional aggregate range queries over real attributes,” in Proceed-ings of the ACM SIGMOD International Conference on Management of Data,pp. 463–474, 2000.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 27: Synopses for Massive Data: Samples, Histograms, Wavelets

288 References

[143] A. P. Gurajada and J. Srivastava, “Equidepth partitioning of a data setbased on finding its medians,” in Proceedings of Applications of Computing,pp. 92–101, 1991.

[144] P. J. Haas, “Large-sample and deterministic confidence intervals for onlineaggregation,” in Proceedings of International Conference on Scientific andStatistical Database Management, pp. 51–63, 1997.

[145] P. J. Haas, “The need for speed: Speeding up DB2 UDB using sampling,”IDUG Solutions Journal, vol. 10, no. 2, pp. 32–34, 2003.

[146] P. J. Haas and J. M. Hellerstein, “Join algorithms for online aggregation,”IBM Research Report RJ 10126, 1998.

[147] P. J. Haas and J. M. Hellerstein, “Ripple joins for online aggregation,” inProceedings of the ACM SIGMOD International Conference on Managementof Data, pp. 287–298, 1999.

[148] P. J. Haas, I. F. Ilyas, G. M. Lohman, and V. Markl, “Discovering andexploiting statistical properties for query optimization in relational databases:A survey,” Statistical Analysis and Data Mining, vol. 1, no. 4, pp. 223–250,2009.

[149] P. J. Haas and C. Konig, “A bi-level bernoulli scheme for database sampling,”in Proceedings of the ACM SIGMOD International Conference on Manage-ment of Data, pp. 275–286, 2004.

[150] P. J. Haas, Y. Liu, and L. Stokes, “An estimator of the number of speciesfrom quadrat sampling,” Biometrics, vol. 62, pp. 135–141, 2006.

[151] P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes, “Sampling-basedestimation of the number of distinct values of an attribute,” in Proceedings ofthe International Conference on Very Large Data Bases, pp. 311–322, 1995.

[152] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami, “Fixed-precisionestimation of join selectivity,” in Proceedings of ACM Principles of DatabaseSystems, pp. 190–201, 1993.

[153] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami, “Selectivity andcost estimation for joins based on random sampling,” Journal of Computerand Systems Science, vol. 52, no. 3, pp. 550–569, 1996.

[154] P. J. Haas, J. F. Naughton, and A. N. Swami, “On the relative cost of samplingfor join selectivity estimation,” in Proceedings of ACM Principles of DatabaseSystems, pp. 14–24, 1994.

[155] P. J. Haas and L. Stokes, “Estimating the number of classes in a finite popula-tion,” Journal of American Statistical Association, vol. 93, no. 444, pp. 1475–1487, 1998.

[156] P. J. Haas and A. N. Swami, “Sequential sampling procedures for query sizeestimation,” in Proceedings of the ACM SIGMOD International Conferenceon Management of Data, pp. 341–350, New York, NY, USA, 1992.

[157] T. Hagerup and C. Rub, “A guided tour of chernoff bounds,” Information onProcessing Letters, vol. 33, no. 6, pp. 305–308, 1990.

[158] P. Hall and C. Heyde, Martingale Limit Theory and Its Application. AcademicPress, 1980.

[159] M. Hansen, “Some history and reminiscences on survey sampling,” in Statis-tical Science, vol. 2, pp. 180–190, 1987.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 28: Synopses for Massive Data: Samples, Histograms, Wavelets

References 289

[160] B. Harb, “Algorithms for linear and nonlinear approximation of large data,”PhD Thesis, University of Pennsylvania, 2007.

[161] N. J. A. Harvey, J. Nelson, and K. Onak, “Sketching and streaming entropy viaapproximation theory,” in Proceedings of the IEEE Conference on Foundationsof Computer Science, 2008.

[162] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learn-ing: Data Mining, Inferences, and Prediction. Springer, 2001.

[163] Z. He, B. S. Lee, and X. S. Wang, “Proactive and reactive multi-dimensionalhistogram maintenance for selectivity estimation,” Journal of Systems andSoftware, vol. 81, no. 3, pp. 414–430, 2008.

[164] J. M. Hellerstein, P. J. Haas, and H. J. Wang, “Online aggregation,” in Pro-ceedings of the ACM SIGMOD International Conference on Management ofData, pp. 171–182, 1997.

[165] M. Henzinger, “Algorithmic challenges in search engines,” Internet Mathemat-ics, vol. 1, no. 1, pp. 115–126, 2003.

[166] M. Henzinger, P. Raghavan, and S. Rajagopalan, “Computing on datastreams,” Technical Report SRC 1998-011, DEC Systems Research Centre,1998.

[167] J. Hershberger, N. Shrivastava, S. Suri, and C. D. Toth, “Adaptive spatialpartitioning for multidimensional data streams,” in Proceedings of ISAAC,pp. 522–533, 2004.

[168] C. Hidber, “Online association rule mining,” in Proceedings of the ACM SIG-MOD International Conference on Management of Data, pp. 145–156, 1999.

[169] W. Hoeffding, “Probability inequalities for sums of bounded randomvariables,” Journal of the American Statistical Association, vol. 58, no. 301,p. 1330, 1963.

[170] D. G. Horvitz and D. J. Thompson, “A generalization of sampling withoutreplacement from a finite universe,” Journal of the American Statistical Asso-ciation, vol. 47, pp. 663–695, 1952.

[171] W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja, “Statistical estimators forrelational algebra expressions,” in Proceedings of ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 276–287, 1988.

[172] W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja, “Processing aggregate relationalqueries with hard time constraints,” in Proceedings of the ACM SIGMODInternational Conference on Management of Data, pp. 68–77, 1989.

[173] IDC, “The diverse and exploding digital universe,” IDC White Paper, March2008.

[174] P. Indyk, “Stable distributions, pseudorandom generators, embeddings anddata stream computation,” in IEEE Conference on Foundations of ComputerScience, 2000.

[175] P. Indyk and D. Woodruff, “Tight lower bounds for the distinct elementsproblem,” in IEEE Conference on Foundations of Computer Science, 2003.

[176] P. Indyk and D. P. Woodruff, “Optimal approximations of the frequencymoments of data streams,” in ACM Symposium on Theory of Computing,2005.

[177] Y. E. Ioannidis, “Universality of serial histograms,” in Proceedings of the Inter-national Conference on Very Large Data Bases, pp. 256–267, 1993.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 29: Synopses for Massive Data: Samples, Histograms, Wavelets

290 References

[178] Y. E. Ioannidis, “Approximations in database systems,” in Proceedings of theInternational Conference on Database Theory, pp. 16–30, 2003.

[179] Y. E. Ioannidis, “The history of histograms (abridged),” in Proceedings of theInternational Conference on Very Large Data Bases, pp. 19–30, 2003.

[180] Y. E. Ioannidis and S. Christodoulakis, “On the propagation of errors in thesize of join results,” in Proceedings of the ACM SIGMOD International Con-ference on Management of Data, pp. 268–277, 1991.

[181] Y. E. Ioannidis and S. Christodoulakis, “Optimal histograms for limitingworst-case error propagation in the size of join results,” ACM Transactionson Database Systems, vol. 18, no. 4, 1993.

[182] Y. E. Ioannidis and V. Poosala, “Balancing histogram optimality and practi-cality for query result size estimation,” in Proceedings of the ACM SIGMODInternational Conference on Management of Data, pp. 233–244, 1995.

[183] Y. E. Ioannidis and V. Poosala, “Histogram-based approximation of set-valuedquery-answers,” in Proceedings of the International Conference on Very LargeData Bases, pp. 174–185, 1999.

[184] H. V. Jagadish, H. Jin, B. C. Ooi, and K.-L. Tan, “Global optimization ofhistograms,” in Proceedings of the ACM SIGMOD International Conferenceon Management of Data, pp. 223–234, 2001.

[185] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, andT. Suel, “Optimal histograms with quality guarantees,” in Proceedings of theInternational Conference on Very Large Data Bases, pp. 275–286, 1998.

[186] B. Jawerth and W. Sweldens, “An overview of wavelet based multiresolutionanalyses,” SIAM Review, vol. 36, no. 3, pp. 377–412, 1994.

[187] T. S. Jayram, R. Kumar, and D. Sivakumar, “The one-way communicationcomplexity of gap hamming distance,” http://www.madalgo.au.dk/img/SumSchoo2007_Lecture_20slides/Bibliography/p14_Jayram_07_Manusc_ghd.pdf, 2007.

[188] C. Jermaine, S. Arumugam, A. Pol, and A. Dobra, “Scalable approximatequery processing with the DBO engine,” in ACM SIGMOD International Con-ference on Management of Data, 2007.

[189] C. Jermaine, A. Dobra, S. Arumugam, S. Joshi, and A. Pol, “The sort-merge-shrink join,” ACM Transactions on Database Systems, vol. 31, no. 4,pp. 1382–1416, 2006.

[190] C. Jermaine, A. Dobra, A. Pol, and S. Joshi, “Online estimation for subset-based sql queries,” in Proceedings of the International Conference on VeryLarge Data Bases, pp. 745–756, 2005.

[191] C. Jermaine, A. Pol, and S. Arumugam, “Online maintenance of very largerandom samples,” in Proceedings of the ACM SIGMOD International Confer-ence on Management of Data, pp. 299–310, 2004.

[192] C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou, “Dynamically maintainingfrequent items over a data stream,” in Proceedings of the ACM Conference onInformation and Knowledge Management, 2003.

[193] R. Jin, L. Glimcher, C. Jermaine, and G. Agrawal, “New sampling-basedestimators for OLAP queries,” in Proceedings of the International Conferenceon Data Engineering, p. 18, Washington, DC, USA, 2006.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 30: Synopses for Massive Data: Samples, Histograms, Wavelets

References 291

[194] W. Johnson and J. Lindenstrauss, “Extensions of Lipshitz mapping intoHilbert space,” Contemporary Mathematics, vol. 26, pp. 189–206, 1984.

[195] S. Joshi and C. Jermaine, “Sampling-based estimators for subset-basedqueries,” VLDB Journal, Accepted for Publication, 2008.

[196] H. Jowhari, M. Saglam, and G. Tardos, “Tight bounds for lp samplers, findingduplicates in streams, and related problems,” in ACM Principles of DatabaseSystems, 2011.

[197] C.-C. Kanne and G. Moerkotte, “Histograms reloaded: The merits of bucketdiversity,” in Proceedings of the ACM SIGMOD International Conference onManagement of Data, pp. 663–674, 2010.

[198] P. Karras, “Optimality and scalability in lattice histogram construction,”PVLDB, vol. 2, no. 1, pp. 670–681, 2009.

[199] P. Karras and N. Mamoulis, “Hierarchical synopses with optimal error guar-antees,” ACM Transactions on Database Systems, vol. 33, no. 3, August 2008.

[200] P. Karras and N. Mamoulis, “Lattice histograms: A resilient synopsis struc-ture,” in Proceedings of the International Conference on Data Engineering,pp. 247–256, 2008.

[201] P. Karras and N. Manoulis, “One-pass wavelet synopses for maximum-errormetrics,” in Proceedings of the International Conference on Very Large DataBases, Trondheim, Norway, September 2005.

[202] P. Karras and N. Manoulis, “The Haar+ tree: A refined synopsis data struc-ture,” in Proceedings of the International Conference on Data Engineering,Istanbul, Turkey, April 2007.

[203] P. Karras, D. Sacharidis, and N. Manoulis, “Exploiting duality in summa-rization with deterministic guarantees,” in Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, San Jose,California, August 2007.

[204] R. Kaushik, J. F. Naughton, R. Ramakrishnan, and V. T. Chakaravarthy,“Synopses for query optimization: A space-complexity perspective,” ACMTransactions on Database Systems, vol. 30, no. 4, pp. 1102–1127, 2005.

[205] R. Kaushik and D. Suciu, “Consistent histograms in the presence of distinctvalue counts,” PVLDB, vol. 2, no. 1, pp. 850–861, 2009.

[206] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggre-gate information,” in Proceedings of the IEEE Conference on Foundations ofComputer Science, pp. 482–491, 2003.

[207] S. Khanna, S. Muthukrishnan, and S. Skiena, “Efficient array partitioning,”in Proceedings of the International Colloquium on Automata, Languages andProgramming, pp. 616–626, 1997.

[208] A. C. Konig and G. Weikum, “Combining histograms and parametric curvefitting for feedback-driven query result-size estimation,” in Proceedings of theInternational Conference on Very Large Data Bases, pp. 423–434, 1999.

[209] F. Korn, T. Johnson, and H. V. Jagadish, “Range selectivity estimation forcontinuous attributes,” in Proceedings of the International Conference on Sci-entific and Statistical Database Management, pp. 244–253, 1999.

[210] E. Kushilevitz and N. Nisan, Communication Complexity. Cambridge Univer-sity Press, 1997.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 31: Synopses for Massive Data: Samples, Histograms, Wavelets

292 References

[211] Y.-K. Lai and G. T. Byrd, “High-throughput sketch update on a low-powerstream processor,” in Proceedings of the ACM/IEEE Symposium on Architec-ture for Networking and Communications Systems, 2006.

[212] M. R. Leadbetter, G. Lindgren, and H. Rootzen, Extremes and Related Prop-erties of Random Sequences and Processes: Springer Series in Statistics.Springer, 1983.

[213] G. M. Lee, H. Liu, Y. Yoon, and Y. Zhang, “Improving sketch reconstruc-tion accuracy using linear least squares method,” in Internet MeasurementConference (IMC), 2005.

[214] L. Lim, M. Wang, and J. S. Vitter, “SASH: A self-adaptive histogram set fordynamically changing workloads,” in Proceedings of the International Confer-ence on Very Large Data Bases, pp. 369–380, 2003.

[215] L. Lim, M. Wang, and J. S. Vitter, “CXHist: An on-line classification-basedhistogram for XML string selectivity estimation,” in Proceedings of the Inter-national Conference on Very Large Data Bases, pp. 1187–1198, 2005.

[216] X. Lin and Q. Zhang, “Error minimization for approximate computation ofrange aggregate,” in Proceedings of the International Conference on DatabaseSystems for Advanced Applications, pp. 165–172, 2003.

[217] Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani,“Counter braids: A novel counter architecture for per-flow measurement,” inSIGMETRICS, 2008.

[218] G. Luo, C. J. Ellmann, P. J. Haas, and J. F. Naughton, “A scalable hashripple join algorithm,” in Proceedings of the ACM SIGMOD InternationalConference on Management of Data, pp. 252–262, New York, NY, USA, 2002.

[219] J. Luo, X. Zhou, Y. Zhang, H. T. Shen, and J. Li, “Selectivity estimationby batch-query based histogram and parametric method,” in Proceedings ofAustralasian Database Conference on ADC2007, pp. 93–102, 2007.

[220] MADlib library for scalable analytics, http://madlib.net.[221] S. Mallat, A Wavelet Tour of Signal Processing. Academic Press, Second ed.,

1999.[222] Y. Matias and D. Urieli, “Optimal workload-based weighted wavelet

synopses,” in Proceedings of the International Conference on Database Theory,Edinburgh, Scotland, January 2005.

[223] Y. Matias and D. Urieli, “Optimal workload-based weighted waveletsynopses,” Theoretical Computer Science, vol. 371, no. 3, pp. 227–246, 2007.

[224] Y. Matias, J. S. Vitter, and M. Wang, “Wavelet-based histograms for selectiv-ity estimation,” in Proceedings of the ACM SIGMOD International Conferenceon Management of Data, pp. 448–459, Seattle, Washington, June 1998.

[225] Y. Matias, J. S. Vitter, and M. Wang, “Dynamic maintenance of wavelet-based histograms,” in Proceedings of the International Conference on VeryLarge Data Bases, Cairo, Egypt, September 2000.

[226] A. Metwally, D. Agrawal, and A. E. Abbadi, “Why go logarithmic if we can golinear? Towards effective distinct counting of search traffic,” in InternationalConference on Extending Database Technology, 2008.

[227] Microsoft. Microsoft StreamInsight. http://msdn.microsoft.com/en-us/library/ee362541.aspx, 2008.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 32: Synopses for Massive Data: Samples, Histograms, Wavelets

References 293

[228] J. Misra and D. Gries, “Finding repeated elements,” Science of ComputerProgramming, vol. 2, pp. 143–152, 1982.

[229] M. Mitzenmacher and E. Upfal, Probability and Computing: Randomized Algo-rithms and Probabilistic Analysis. CUP, 2005.

[230] G. Moerkotte, T. Neumann, and G. Steidl, “Preventing bad plans by bound-ing the impact of cardinality estimation errors,” PVLDB, vol. 2, no. 1,pp. 982–993, 2009.

[231] M. Monemizadeh and D. P. Woodruff, “1-pass relative-error lp-sampling withapplications,” in ACM-SIAM Symposium on Discrete Algorithms, 2010.

[232] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge UniversityPress, 1995.

[233] J. I. Munro and M. S. Paterson, “Selection and sorting with limited storage,”Theoretical Computer Science, vol. 12, pp. 315–323, 1980.

[234] M. Muralikrishna and D. J. DeWitt, “Equi-depth histograms for estimatingselectivity factors for multi-dimensional queries,” in Proceedings of the ACMSIGMOD International Conference on Management of Data, pp. 28–36, 1988.

[235] S. Muthukrishnan, “Subquadratic algorithms for workload-aware Haar waveletsynopses,” in Proceedings of the International Conference on Founda-tions of Software Technology and Theoretical Computer Science (FSTTCS),Hyderabad, India, December 2005.

[236] S. Muthukrishnan, “Data streams: Algorithms and applications,” Foundationsand Trendsr in Theoretical Computer Science, vol. 1, no. 2, pp. 117–236,2005.

[237] S. Muthukrishnan, V. Poosala, and T. Suel, “On rectangular partitionings intwo dimensions: Algorithms, complexity, and applications,” in Proceedings ofthe International Conference on Database Theory, pp. 236–256, 1999.

[238] S. Muthukrishnan and M. Strauss, “Maintenance of multidimensional his-tograms,” in Proceedings of the International Conference on FoundationsSoftware Technical and Theoretical Computer Science (FSTTCS), vol. 2914of Lecture Notes in Computer Science, pp. 352–362, Springer, 2003.

[239] S. Muthukrishnan, M. Strauss, and X. Zheng, “Workload-optimal histogramson streams,” in Proceedings of ESA, pp. 734–745, 2005.

[240] J. Neyman, “On the two different aspects of the representative method: Themethod of stratified sampling and the method of purposive selection,” Journalof the Royal Statistical Society, vol. 97, pp. 558–625, 1934.

[241] A. O’Hagan and J. J. Forster, Bayesian Inference. Volume 2B of Kendall’sAdvanced Theory of Statistics. Arnold, second ed., 2004.

[242] F. Olken, “Random sampling from databases,” Technical Report LBL-32883,Lawrence Berekeley National Laboratory, 1993.

[243] F. Olken and D. Rotem, “Simple random sampling from relational databases,”in Proceedings of the International Conference on Very Large Data Bases,pp. 160–169, 1986.

[244] F. Olken and D. Rotem, “Random sampling from B+ trees,” in Proceedingsof the International Conference on Very Large Data Bases, pp. 269–277, 1989.

[245] F. Olken and D. Rotem, “Maintenance of materialized views of samplingqueries,” in Proceedings of the International Conference on Data Engineer-ing, pp. 632–641, 1992.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 33: Synopses for Massive Data: Samples, Histograms, Wavelets

294 References

[246] F. Olken, D. Rotem, and P. Xu, “Random sampling from hash files,” SIGMODRecord, vol. 19, no. 2, pp. 375–386, 1990.

[247] C. Pang, Q. Zhang, D. Hansen, and A. Maeder, “Unrestricted wavelet synopsesunder maximum error bound,” in Proceedings of the International Conferenceon Extending Database Technology, Saint Petersburg, Russia, March 2009.

[248] N. Pansare, V. Borkar, C. Jermaine, and T. Condie, “Online aggregation forlarge MapReduce jobs,” PVLDB, vol. 5, 2011. (To appear).

[249] A. Pavan and S. Tirthapura, “Range-efficient counting of distinct elementsin a massive data stream,” SIAM Journal on Computing, vol. 37, no. 2,pp. 359–379, 2007.

[250] H. T. A. Pham and K. C. Sevcik, “Structure choices for two-dimensionalhistogram construction,” in Proceedings of CASCON, pp. 13–27, 2004.

[251] G. Piatetsky-Shapiro and C. Connell, “Accurate estimation of the number oftuples satisfying a condition,” in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data, pp. 256–276, 1984.

[252] N. Polyzotis and M. N. Garofalakis, “XCluster synopses for structured XMLcontent,” in Proceedings of the International Conference on Data Engineering,p. 63, 2006.

[253] N. Polyzotis and M. N. Garofalakis, “XSKETCH synopses for XMLdata graphs,” ACM Transactions on Database Systems, vol. 31, no. 3,pp. 1014–1063, 2006.

[254] V. Poosala and V. Ganti, “Fast approximate answers to aggregate queries ona data cube,” in Proceedings of the International Conference on Scientific andStatistical Database Management, pp. 24–33, 1999.

[255] V. Poosala, V. Ganti, and Y. E. Ioannidis, “Approximate query answeringusing histograms,” IEEE Data Engineering Bulletins, vol. 22, no. 4, pp. 5–14,1999.

[256] V. Poosala and Y. E. Ioannidis, “Selectivity estimation without the attributevalue independence assumption,” in Proceedings of the International Confer-ence on Very Large Data Bases, pp. 486–495, 1997.

[257] V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita, “Improved his-tograms for selectivity estimation of range predicates,” in Proceedings of theACM SIGMOD International Conference on Management of Data, pp. 294–305, 1996.

[258] L. Qiao, D. Agrawal, and A. E. Abbadi, “RHist: Adaptive summarizationover continuous data streams,” in Proceedings of the ACM Conference onInformation and Knowledge Management, pp. 469–476, 2002.

[259] V. Raman and J. M. Hellerstein, “Potter’s Wheel: An interactive data cleaningsystem,” in Proceedings of the International Conference on Very Large DataBases, pp. 381–390, 2001.

[260] V. Raman, B. Raman, and J. M. Hellerstein, “Online dynamic reordering,”VLDB Journal, vol. 9, no. 3, pp. 247–260, 2000.

[261] V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann, I. Narang,and R. Sidle, “Constant-time query processing,” in Proceedings of the Inter-national Conference on Data Engineering, pp. 60–69, 2008.

[262] R. L. Read, D. S. Fussell, and A. Silberschatz, “Computing bounds onaggregate operations over uncertain sets using histograms,” in Proceedings of

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 34: Synopses for Massive Data: Samples, Histograms, Wavelets

References 295

Post-ILPS ’94 Workshop on Uncertainty in Databases and Deductive Systems,pp. 107–117, 1994.

[263] F. Reiss, M. N. Garofalakis, and J. M. Hellerstein, “Compact histogramsfor hierarchical identifiers,” in Proceedings of the International Conferenceon Very Large Data Bases, pp. 870–881, 2006.

[264] J. Rissanen, “Stochastic complexity and modeling,” Annals of Statistics,vol. 14, no. 3, pp. 1080–1100, 1986.

[265] F. Rusu and A. Dobra, “Sketches for size of join estimation,” ACM Transac-tions on Database Systems, vol. 33, no. 3, 2008.

[266] D. Sacharidis, A. Deligiannakis, and T. Sellis, “Hierarchically-compressedwavelet synopses,” The VLDB Journal, vol. 18, no. 1, pp. 203–231, January2009.

[267] A. D. Sarma, O. Benjelloun, A. Y. Halevy, S. U. Nabar, and J. Widom, “Rep-resenting uncertain data: Models, properties, and algorithms,” VLDB Journal,vol. 18, no. 5, pp. 989–1019, 2009.

[268] C.-E. Sarndal, B. Swennson, and J. Wretman, Model-Assisted SurveySampling. Springer, second ed., 1992.

[269] R. T. Schweller, Z. Li, Y. Chen, Y. Gao, A. Gupta, Y. Zhang, P. A. Dinda,M.-Y. Kao, and G. Memik, “Reversible sketches: Enabling monitoring andanalysis over high-speed data streams,” IEEE Transactions on Networks,vol. 15, no. 5, 2007.

[270] D. W. Scott, Multivariate Density Estimation: Theory Practice, and Visual-ization. Wiley, 1992.

[271] S. Seshadri, “Probabilistic methods in query processing,” PhD Thesis,University of Wisconsin at Madison, Madison, WI, USA, 1992.

[272] J. Shao and D. Tu, The Jackknife and Bootstrap. Springer Series in Statistics,1995.

[273] B. W. Silverman, Density Estimation for Statistics and Data Analysis.Chapman and Hall, 1986.

[274] J. Spiegel and N. Polyzotis, “TuG synopses for approximate query answering,”ACM Transactions on Database Systems, vol. 34, no. 1, 2009.

[275] U. Srivastava, P. J. Haas, V. Markl, and N. Megiddo, “ISOMER: Consistenthistogram construction using query feedback,” in Proceedings of the Interna-tional Conference on Data Engineering, 2006.

[276] E. J. Stollnitz, T. D. DeRose, and D. H. Salesin, Wavelets for ComputerGraphics — Theory and Applications. San Francisco, CA: Morgan KaufmannPublishers, 1996.

[277] M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim, D. Maier, O. Ratzes-berger, and S. B. Zdonik, “Requirements for science data bases and scidb,” inProceedings of CIDR, 2009.

[278] N. Thaper, P. Indyk, S. Guha, and N. Koudas, “Dynamic multidimensionalhistograms,” in ACM SIGMOD International Conference on Management ofData, 2002.

[279] D. Thomas, R. Bordawekar, C. C. Aggarwal, and P. S. Yu, “On efficient queryprocessing of stream counts on the cell processor,” in IEEE InternationalConference on Data Engineering, 2009.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 35: Synopses for Massive Data: Samples, Histograms, Wavelets

296 References

[280] M. Thorup, “Even strongly universal hashing is pretty fast,” in ACM-SIAMSymposium on Discrete Algorithms, 2000.

[281] M. Thorup and Y. Zhang, “Tabulation based 4-universal hashing with appli-cations to second moment estimation,” in ACM-SIAM Symposium on DiscreteAlgorithms, 2004.

[282] S. Venkataraman, D. X. Song, P. B. Gibbons, and A. Blum, “New streamingalgorithms for fast detection of superspreaders,” in Network and DistributedSystem Security Symposium NDSS, 2005.

[283] J. S. Vitter, “Random sampling with a reservoir,” ACM Transactions onMathematical Software, vol. 11, no. 1, pp. 37–57, 1985.

[284] J. S. Vitter and M. Wang, “Approximate computation of multidimen-sional aggregates of sparse data using wavelets,” in Proceedings of the ACMSIGMOD International Conference on Management of Data, Philadelphia,Pennsylvania, May 1999.

[285] M. P. Wand and M. C. Jones, Kernel Smoothing. Chapman and Hall, 1995.[286] H. Wang and K. C. Sevcik, “Utilizing histogram information,” in Proceedings

of CASCON, p. 16, 2001.[287] H. Wang and K. C. Sevcik, “A multi-dimensional histogram for selectivity esti-

mation and fast approximate query answering,” in Proceedings of CASCON,pp. 328–342, 2003.

[288] H. Wang and K. C. Sevcik, “Histograms based on the minimum descriptionlength principle,” VLDB Journal, vol. 17, no. 3, 2008.

[289] K. Y. Whang, B. T. Vander-Zanden, and H. M. Taylor, “A linear-time proba-bilistic counting algorithm for database applications,” ACM Transactions onDatabase Systems, vol. 15, no. 2, p. 208, 1990.

[290] J. R. Wieland and B. L. Nelson, “How simulation languages should reportresults: A modest proposal,” in Proceedings of Winter Simulation Conference,pp. 709–715, 2009.

[291] R. Winter and K. Auerbach, “The big time: 1998 winter VLDB survey,”Database Programming and Design, August 1998.

[292] D. Woodruff, “Optimal space lower bounds for all frequency moments,” inACM-SIAM Symposium on Discrete Algorithms, 2004.

[293] M. Wu and C. Jermaine, “A bayesian method for guessing the extreme valuesin a data set,” in Proceedings of the International Conference on Very LargeData Bases, pp. 471–482, 2007.

[294] K. Yi, F. Li, M. Hadjieleftheriou, G. Kollios, and D. Srivastava, “Random-ized synopses for query assurance on data streams,” in IEEE InternationalConference on Data Engineering, 2008.

[295] Q. Zhang and X. Lin, “On linear-spline based histograms,” in Proceed-ings of the International Conference on Web-Age Information Management,pp. 354–366, 2002.

[296] Q. Zhang and W. Wang, “A fast algorithm for approximate quantiles in highspeed data streams,” in Proceedings of the International Conference on Sci-entific and Statistical Database Management, p. 29, 2007.

[297] X. Zhang, M. L. King, and R. J. Hyndman, “Bandwidth selection for multi-variate kernel density estimation using MCMC,” in Econometric Society 2004Australasian Meetings, 2004.

Full text available at: http://dx.doi.org/10.1561/1900000004

Page 36: Synopses for Massive Data: Samples, Histograms, Wavelets

References 297

[298] Q. Zhu and P.-A. Larson, “Building regression cost models for multidatabasesystems,” in Proceedings of the International Conference on Parallel andDistributed Information Systems, pp. 220–231, 1996.

[299] V. M. Zolotarev, One Dimensional Stable Distributions, volume 65 ofTranslations of Mathematical Monographs. American Mathematical Society,1983.

Full text available at: http://dx.doi.org/10.1561/1900000004