Download - Anthony Okorodudu CSE 6392 2006-4-25
![Page 1: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/1.jpg)
Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data NetworksBy Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum
Anthony OkoroduduCSE 63922006-4-25
![Page 2: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/2.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
2
Outline Introduction Motivation Related Work Distributed Hash Tables (DHT) Hash Sketches Distributed Hash Sketches (DHS) Counting with DHS Conclusion
![Page 3: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/3.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
3
Introduction Peer-to-peer (P2P) started as a way of
sharing files/CPU cycles among end-users Evolved into cutting networks of today
Distributed Hash Tables (DHT) made this feasible Probabilistic guarantees for degree of
efficiency, fault tolerance, and availability Data management systems of huge scale
![Page 4: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/4.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
4
Motivation Need for distributed counting
mechanisms File-sharing P2P systems: total
number of documents shared by users Sensor networks: compute aggregates
in a duplicate-insensitive manner Internet-scale DB system: build
histograms for query access plans
![Page 5: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/5.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
5
Central Goals
1. Efficiency: number of nodes contacted for counting must be small
2. Scalability and availability: large numbers of nodes may need to add elements to a (multi-) set
3. Access and storage load balancing: counting and related overheads should be fairly distributed across all nodes
![Page 6: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/6.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
6
Central Goals (continued)4. Accuracy: tunable, robust, and
highly accurate cardinality estimation
5. Simplicity and ease of integration: special, solution-based indexing structures should be avoided
6. Duplicate (in)sensitivity: count total number of items as well as the number of unique items in multi-sets
![Page 7: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/7.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
7
Distributed Counting Protocols One-node-per-counter protocols Gossip-based protocols Broadcast/convergecast-type
protocols Sampling-based protocols
![Page 8: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/8.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
8
One-node-per-counter Select a node in the overlay of the
DHT and use it to maintain counter value
Poor scalability Resembles a centralized system
![Page 9: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/9.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
9
Gossip-based Provide weak probabilistic semantics
of “eventual consistency” for outcome
Every node exchanges information with a set of nodes
Low bandwidth Not efficient in terms of number of
nodes to be contacted Low accuracy
![Page 10: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/10.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
10
Broadcast/Convergecast-type1. Broadcast phase
Querying node broadcasts query through network, creating tree of nodes as query propagates the overlay
2. Convergecast phase Node sends its local part of the answer
along with answers received from nodes deeper down the tree to “parent” node
Similar to gossip-based
![Page 11: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/11.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
11
Sampling-based Estimate the value of the counter by
selectively querying a set of nodes in the network
Sampling based techniques suffer from accuracy issues
Large samples lead to higher accuracy but more nodes need to be contacted
Sampling based techniques are usually duplicate-sensitive
![Page 12: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/12.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
12
Distributed Hash Tables (DHT) Family of structured P2P network
overlays exposing hash-table like interface
1. insert(key, value)2. lookup(key)
Highly efficient for point queries
![Page 13: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/13.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
13
Hash Sketches First proposed as a means of
estimating the cardinality of a multiset in a database
Used in many application domains for counting distinct elements in multi-sets Approximate query answering in very
large DBs, data mining on the internet graph, stream processing
![Page 14: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/14.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
14
Hash Sketches (continued) PCSA (Probabilistic Counting with
Stochastic Averaging) algorithm assumes of a pseudo-uniform hash function
Super-LogLog algorithm relaxes pseudo-uniform hash function constraints of PCSA
![Page 15: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/15.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
15
Distributed Hash Sketches (DHS) Fully decentralized, scalable, and
efficient mechanism capable of providing estimates on the cardinality of multi-sets
Satisfy all the central goals Implemented using PCSA (DHS-
PCSA) or super-LogLog (DHS-sLL) hash sketches
![Page 16: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/16.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
16
DHS O(log N) cost to insert object in an N-
node DHS O(b * log N) bandwidth consumption
if size of data is b bytes Data items are deleted if not updated
within time-to-live so deleting an item incurs no extra cost
![Page 17: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/17.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
17
DHS (continued) Accuracy of hash sketches
increases with multiple bitmap vectors
Either PCSA or super-LogLog algorithm is applied for counting
![Page 18: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/18.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
18
Counting with DHS
![Page 19: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/19.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
19
Conclusion Distributed Hash Sketches is a fully
decentralized, scalable, and efficient mechanism for providing estimates on the cardinality of multi-sets in internet-scale information systems
DHS implemented using either PCSA or the super-LogLog hash sketches
DHS histograms can introduce great performance savings during query optimization
![Page 20: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/20.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
20
References N. Ntarmos, P. Triantafillou, and G.
Weikum. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks. ICDE 2006.
![Page 21: Anthony Okorodudu CSE 6392 2006-4-25](https://reader035.vdocument.in/reader035/viewer/2022062322/568146a2550346895db3be59/html5/thumbnails/21.jpg)
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
21
Thanks