dealing with massive data
DESCRIPTION
Dealing with MASSIVE Data. Feifei Li [email protected] Dept Computer Science, FSU Sep 9, 2008. Brief Bio. B.A.S. in computer engineering from Nanyang Technological University in 2002 Ph.D. in computer science from Boston University in 2007 - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/2.jpg)
2
Brief Bio• B.A.S. in computer engineering from
Nanyang Technological University in 2002
• Ph.D. in computer science from Boston University in 2007
• Research Interns/Visitors at AT&T Labs, IBM T. J. Watson Research Center, Microsoft Research.
• Now: Assistant Professor in CS Department at FSU
![Page 3: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/3.jpg)
3
Research Areas
Algorithms and Data structures
I/O-efficient
algorithmsstreaming
algorithms
computational geometry misc.
Database Applications
spatial databases
indexingquery processing
data security and privacy
Geographic
Information Systems
data streams
Probabilistic Data
![Page 4: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/4.jpg)
4
Massive Data• Massive datasets are being collected everywhere• Storage management software is billion-$ industry
Examples (2002):
• Phone: AT&T 20TB phone call database, wireless tracking
• Consumer: WalMart 70TB database, buying patterns
• WEB: Web crawl of 200M pages and 2000M links, Google’s huge indexes
• Geography: NASA satellites generate 1.2TB per day
![Page 5: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/5.jpg)
5
Example: LIDAR Terrain Data
• Massive (irregular) point sets (1-10m resolution)
– Becoming relatively cheap and easy to collect
• Appalachian Mountains between 50GB and 5TB
• Exceeds memory limit and needs to be stored on disk
![Page 6: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/6.jpg)
6
Example: Network Flow Data• AT&T IP backbone generates 500 GB per day
• Gigascope: A data stream management system
– Compute certain statistics
• Can we do computation without storing the data?
![Page 7: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/7.jpg)
7
Traditional Random Access Machine Model
• Standard theoretical model of computation:
– Infinite memory (how nice!)
– Uniform access cost
• Simple model crucial for success of computer industry
R
A
M
![Page 8: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/8.jpg)
How to Deal with MASSIVE Data?
when there is not enough memory
![Page 9: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/9.jpg)
9
Solution 1: Buy More Memory
• Expensive
• (Probably) not scalable
– Growth rate of data is higher than the growth of memory
![Page 10: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/10.jpg)
10
Solution 2: Cheat! (by random sampling)
• Provide approximate solution for some problems– average, frequency of an element, etc.
• What if we want the exact result?• Many problems can’t be solved by sampling
– maximum, and all problems mentioned later
![Page 11: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/11.jpg)
Solution 3: Using the Right Computation Model
• External Memory Model
• Streaming Model
• Probabilistic Model (brief)
![Page 12: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/12.jpg)
Computation Model for Massive Data (1):External Memory Model
Internal memory is limited but fast
External memory is unlimited but slow
![Page 13: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/13.jpg)
13
Memory Hierarchy
• Modern machines have complicated memory hierarchy
– Levels get larger and slower further away from CPU
– Block sizes and memory sizes are different!
• There are a few attempts to model the hierarchy but not successful
– They are too complicated!
L
1
L
2
R
A
M
![Page 14: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/14.jpg)
14
Slow I/O
– Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes)
• Important to store/access data to take advantage of blocks (locality)
• Disk access is 106 times slower than main memory access
track
magnetic surface
read/write armread/write head
“The difference in speed between modern CPU and
disk technologies is analogous to the difference
in speed in sharpening a pencil using a sharpener on
one’s desk or by taking an airplane to the other side of
the world and using a sharpener on someone else’s
desk.” (D. Comer)
![Page 15: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/15.jpg)
15
Puzzle #1: Majority Counting
• A huge file of characters stored on disk• Question: Is there a character that appears > 50% of the time• Solution 1: sort + scan
– A few passes (O(logM/B N)): will come to it later• Solution 2: divide-and-conquer
– Load a chunk in to memory: N/M chunks– Count them, return majority– The overall majority must be the majority in >50% chunks– Iterate until < M– Very few passes (O(logM N)), geometrically decreasing
• Solution 3: O(1) memory, 2 passes (answer to be posted later)
b a e c a d a a d a a e a b a a f a g b
![Page 16: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/16.jpg)
16
N = # of items in the problem instance
B = # of items per disk block
M = # of items that fit in main memory
I/O: Move block between memory and disk
Performance measure: # of I/Os performed by algorithm
We assume (for convenience) that M >B2
D
P
M
Block I/O
External Memory Model [AV88]
![Page 17: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/17.jpg)
17
Sorting in External Memory
• Break all N elements into N/M chunks of size M each
• Sort each chunk individually in memory
• Merge them together
• Can merge <M/B sorted lists (queues) at once
M/B blocks in main memory
![Page 18: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/18.jpg)
18
Sorting in External Memory• Merge sort:
– Create N/M memory sized sorted lists
– Repeatedly merge lists together Θ(M/B) at a time
phases using I/Os each I/Os)( BNO)(log
MN
BMO )log(
BN
BN
BMO
)(MN
)/(BM
MN
))/(( 2BM
MN
1
![Page 19: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/19.jpg)
19
External Searching: B-Tree
• Each node (except root) has fan-out between B/2 and B
• Size: O(N/B) blocks on disk
• Search: O(logBN) I/Os following a root-to-leaf path
• Insertion and deletion: O(logBN) I/Os
![Page 20: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/20.jpg)
20
Fundamental Bounds Internal External
• Scanning: N
• Sorting: N log N
• Searching:
More Results
• List ranking N
• Minimal spanning tree N log N
• Offline union-find N
• Interval searching log N + T logBN + T/B
• Rectangle enclosure log N + T log N + T/B
• R-tree search
NBlogBN
BN
BMlog
BN
N2log
BN
BN
BMlog
BBN
BN
BM logloglog
BN
BN
BMlog
TN BT
BN
![Page 21: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/21.jpg)
21
Does All the Theory Matter?• Programs developed in RAM-model
still runs even there is not enough memory
– Run on large datasets because
OS moves blocks as needed
• OS utilizes paging and prefetching strategies
– But if program makes scattered accesses even good OS cannot take advantage of block access
Thrashing!
data size
runn
ing
tim
e
D
P
M
![Page 22: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/22.jpg)
22
Toy Experiment: Permuting• Problem:
– Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8
* Each element knows its correct position
– Output: Store them on disk in the right order
• Internal memory solution:
– Just scan the original sequence and move every element in the right place!
– O(N) time, O(N) I/Os
• External memory solution:
– Use sorting
– O(N log N) time, I/Os)log( BN
BN
BMO
![Page 23: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/23.jpg)
23
A Practical Example on Real Data• Computing persistence on large terrain data
![Page 24: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/24.jpg)
24
Takeaways• Need to be very careful when your program’s space
usage exceeds physical memory size• If program mostly makes highly localized accesses
– Let the OS handle it automatically• If program makes many non-localized accesses
– Need I/O-efficient techniques• Three common techniques (recall the majority counting
puzzle):– Convert to sort + scan– Divide-and-conquer– Other tricks
![Page 25: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/25.jpg)
Want to know more about I/O-efficient algorithms?
A course on I/O-efficient algorithms is offered as CIS5930 (Advanced Topics in Data Management)
![Page 26: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/26.jpg)
26
Computation Model for Massive Data (2):Streaming Model
You got to look at each element only once!
Cannot
Don’t want to store data and do further processing
Can’t wait to
![Page 27: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/27.jpg)
27
Streaming Algorithms: Applications
DBMS(Oracle, DB2)
Back-end Data Warehouse
Off-line analysis – slow, expensive
DSL/CableNetworks
EnterpriseNetworks
Peer
Network OperationsCenter (NOC)
What are the top (most frequent) 1000 (source, dest) pairs seen over the last month?
SELECT COUNT (R1.source, R2.dest)FROM R1, R2WHERE R1.dest = R2.source
SQL Join Query
How many distinct (source, dest) pairs have been seen?
Set-Expression Query
PSTN
Other applications:
• Sensor networks
• Network security
• Financial applications
• Web logs and clickstreams
![Page 28: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/28.jpg)
28
Puzzle #2: Find Missing Card
• How to find the missing tile by making one pass over everything?
– Assuming you can’t memorize everything (of course)
• Assign a number to each type of tiles: = 8, = 14, = 22
• Compute the sum of all remaining tiles
– (1+…+9+11+…+19+21+…+29)*4 – sum = missing tile!
Mahjong tile
![Page 29: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/29.jpg)
29
A Research Problem: Count # Distinct Elements
• Unfortunately, there is a lower bound saying you can’t do this without using Ω(n) memory
• But if we allow some errors, then can approximate it well
b a e c a d a a d a a e a b a a f a g b
# distinct elements = 7
![Page 30: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/30.jpg)
30
Solution: FM Sketch [FM85, AMS99]
• Take a (pseudo) random hash function h : {1,…,n} {1,…,2d}, where 2d > n
• For each incoming element x, compute h(x)
– e.g., h(5) = 10101100010000
– Count how many trailing zeros
– Remember the maximum number of trailing zeroes in any h(x)
• Let Y be the maximum number of trailing zeroes
– Can show E[2Y] = # distinct elements
* 2 elements, “on average” there is one h(x) with 1 trailing zero
* 4 elements, “on average” there is one h(x) with 2 trailing zeroes
* 8 elements, “on average” there is one h(x) with 3 trailing zeroes
* …
![Page 31: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/31.jpg)
Counting Paintballs
• Imagine the following scenario:– A bag of n paintballs is
emptied at the top of a long stair-case.
– At each step, each paintball either bursts and marks the step, or bounces to the next step. 50/50 chance either way.
Looking only at the pattern of marked steps, what was n?
![Page 32: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/32.jpg)
Counting Paintballs (cont)
• What does the distribution of paintball bursts look like?– The number of bursts at
each step follows a binomial distribution.
– The expected number of bursts drops geometrically.
– Few bursts after log2 n steps
1st
2nd
Y th
B(n,1/2)
B(n,1/2 Y)
B(n,1/4)
B(n,1/2 Y)
![Page 33: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/33.jpg)
33
Solution: FM Sketch [FM85, AMS99]
• So 2Y is an unbiased estimator for # distinct elements
• However, has a large variance
– Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε
• Applications:
– How many distinct IP addresses used a given link to send their traffic from the beginning of the day?
– How many new IP addresses appeared today that didn’t appear before?
![Page 34: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/34.jpg)
34
Finding Heavy Hitters• Which elements appeared in the stream more than 10% of the time?
• Applications:
– Networking
* Finding IP addresses sending most traffic
– Databases
* Iceberg queries
– Data mining
* Finding “hot” items (item sets) in transaction data
• Solution
– Exact solution is difficult
– If allow approximation of ε
* Use O(1/ε) space and O(1) time per element in stream
![Page 35: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/35.jpg)
35
Streaming in a Distributed World
• Large-scale querying/monitoring: Inherently distributed!
–Streams physically distributed across remote sitesE.g., stream of UDP packets through subset of edge routers
• Challenge is “holistic” querying/monitoring
– Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …)
– Streaming data is spread throughout the network
Network Operations
Center (NOC)
Query site Query
0 11
1 1
00
1
1 0
0
11
0
11
0
11
0
11
Q(S1 ∪ S2 ∪…)
S6
S5S4
S3
S1
S2
![Page 36: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/36.jpg)
36
Streaming in a Distributed World
• Need timely, accurate, and efficient query answers
• Additional complexity over centralized data streaming!
• Need space/time- and communication-efficient solutions
– Minimize network overhead
– Maximize network lifetime (e.g., sensor battery life)
– Cannot afford to “centralize” all streaming data
Network Operations
Center (NOC)
Query site Query
0 11
1 1
00
1
1 0
0
11
0
11
0
11
0
11
Q(S1 ∪ S2 ∪…)
S6
S5S4
S3
S1
S2
![Page 37: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/37.jpg)
Want to know more about streaming algorithms?
A graduate-level course on streaming algorithms willbe approximately offered
in the next next next semester with an error guarantee of 5%!
Or, talk to me tomorrow!
![Page 38: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/38.jpg)
Top-k Queries
• Extremely useful in information retrieval
– top-k sellers, popular movies, etc.
tuple
score
t1t2t3t4t5
65301008087
top-2 = {t3, t5}
tuple
score
t3t5t4t1t2
10087806530
Threshold Alg
RankSQL
![Page 39: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/39.jpg)
Top-k Queries on Uncertain Data
tuple
score
t3t5t4t1t2
10087806530
confidence
0.20.80.90.50.6
(sensor reading, reliability)
(page rank, how well match query)
tuple
score
t3t5t4t1t2
10087806530
confidence
0.20.80.90.50.6
top-k answer depends onthe interplay between
score and confidence
![Page 40: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/40.jpg)
Top-k Definition: U-Topk
The k tuples with the maximum probabilityof being the top-k
tuple
score
t3t5t4t1t2
10087806530
confidence
0.20.80.90.50.6
{t3, t5}: 0.2*0.8 = 0.16
{t3, t4}:
0.2*(1-0.8)*0.9 = 0.036
{t5, t4}:
(1-0.2)*0.8*0.9 = 0.576
...
Potential problem: top-k could be very different from top-(k+1)
![Page 41: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/41.jpg)
Top-k Definition: U-kRanks
The i-th tuple is the one with the maximumprobability of being at rank i, i=1,...,k
tuple
score
confidence
t3t5t4t1t2
10087806530
0.20.80.90.50.6
Rank 1:
t3: 0.2
t5: (1-0.2)*0.8 = 0.64
t4: (1-0.2)*(1-0.8)*0.9 = 0.144 ...
Rank 2:
t3: 0
t5: 0.2*0.8 = 0.16
t4: 0.9*(0.2*(1-0.8)+(1-0.2)*0.8)
= 0.612Potential problem: duplicated tuples in top-k
![Page 42: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/42.jpg)
Uncertain Data Models
• An uncertain data model represents a probability distribution of database instances (possible worlds)
• Basic model: mutual independence among all tuples• Complete models: able to represent any distribution of possible worlds
– Atomic independent random Boolean variables– Each tuple corresponds to a Boolean formula, appears iff the
formula evaluates to true– Exponential complexity
![Page 43: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/43.jpg)
Uncertain Data Model: x-relations
Each x-tuple represents a discrete probability distribution of tuples
x-tuples are mutually independent, and disjoint
U-Top2: {t1,t2}
U-2Ranks: (t1, t3)
single-alternative
multi-alternative
![Page 44: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/44.jpg)
Want to know more about uncertainty data management?
A graduate-level course on uncertainty data management will be (likely probably) offered
in the next next next next next semester
Or, talk to me tomorrow!
![Page 45: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/45.jpg)
45
Recap• External memory model
– Main memory is fast but limited
– External memory slow but unlimited
– Aim to optimize I/O performance
• Streaming model
– Main memory is fast but small
– Can’t store, not willing to store, or can’t wait to store data
– Compute the desired answers in one pass
• Probabilistic data model
– Can’t store, query exponential possible instances of possible worlds
– Compute the desired answers in the succinct representation of the probabilistic data (efficiently!! Possibly allow some errors)
![Page 46: Dealing with MASSIVE Data](https://reader036.vdocument.in/reader036/viewer/2022062408/56813e5e550346895da86404/html5/thumbnails/46.jpg)
Thanks!
Questions?