calculating frequency moments of data stream asad narayanan comp 5703 1

26
Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

Upload: gary-watkins

Post on 19-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

1

Calculating frequency moments of Data

StreamAsad Narayanan

COMP 5703

Page 2: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

2

Outline

• What is data stream?• Different constraints of data stream• Application of data stream• Frequency moments• Calculating frequency moments• Calculating F0 using FM-Sketch• Calculating F0 Using KMV• Complexity of calculating F0

• Calculating Fk

• Complexity of calculating Fk

Page 3: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

3

Data Stream

• Sequence of voluminous data arriving at high speed• Can only be accessed one at a time• Comes in arbitrary order• Cannot be stored and processed later

• Example Network analysis which will have around 1 million packets per second.• Our aim is to compute statistics on this data

Page 4: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

4

Formal Definition

• Sequence of data A= ( a1, a2, a3,…, am ) of length M• ai ϵ (1,2,3,4,…,N) N distinct elements• mi = |{ j | aj = ai , 1≤ j ≤ m}| represents number of occurrences of ai

• M and N very large• Impossible to store A on local disk.

amata2 a3a1

0 T

Page 5: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

5

Applications

• Sensory networks• Network Monitoring systems• Data stream mining• Detecting credit card fraud

• Database systems

Page 6: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

6

Limitations

• Recording all data is impossible• If we try to record all the IP addresses through a network, then we will

require space of the order 232.

• Data need to be processed in one pass• Store the data in limited space and time.• Obtain a sketch of data which can be reused to compute statistics.

Page 7: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

7

Frequency Moments

• A powerful statistical tool which can be used to determine demographic information of data • The k-th frequency moment of sequence A for k ≥ 0 is defined as:

• F0 represents the number of distinct elements in A• F1 represents total number of elements in A• Fk for k ≥ 2 gives idea about data distribution

Page 8: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

8

Calculating Frequency moments

• Direct approach requires memory of the order Ω(N) to store mi for all distinct elements ai ϵ (1,2,3,4,…,N) • But we have memory limitations, and requires an algorithm to

compute in much lower memory• This can be achieved if we are ready to compromise on accuracy.• An algorithm that computes an (Ɛ,ƍ)- approximation of Fk, where Pr[|

F’k- Fk|≤ Ɛ Fk] ≥ 1-ƍ• F’k is the (Ɛ,ƍ)- approximated value of Fk.• Ɛ is the approximation parameter and ƍ is the confidence parameter.

Page 9: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

9

Calculating F0

• F0 is the zeroth frequency moment• Represent the number of distinct element in data sequence• Main application of F0 in query optimizer of large databases

• To obtain the distinct number of elements in column without performing expensive sorting operations on entire column

• The first algorithm to determine F0 was developed by Flajolet and Martin in their paper “Probabilistic counting algorithms for database applications”• Another major contribution was the development of K-minimum Value

algorithm to determine the distinct number of elements.

Page 10: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

10

Calculating F0 (FM-Sketch method)

• Inspired from a paper by Robert Morris “Counting large numbers of events in small registers”.• Assumes there exist an ideal hash function that uniformly distributes

the elements of the sequence into hash space• The hash space is assumed to be a bit string BITMAP[] of length L,

initialized to 0• Length L is assumed to be of the order of log(N)

Page 11: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

11

FM-Sketch method (contd..)

• Let bit(y,k) is the kth bit in binary representation of y• represents the position of the least significant 1-bit in the binary

representation of y

• Let A be the sequence of data stream of length M• BITMAP[0…L-1] represents the hash space

0 1 1 1 0 1 0 0 0 0

0 1 2 3 4 5 6 7 8 9

In the given example =1 and bit(y,4)=0

Page 12: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

12

FM-Sketch Algorithm

For i:=0 to L-1, BITMAP[i] :=0For all x in A , do: Index:=p(hash(x)) If BITMAP[index]=0, then BITMAP[index]=1 ENDIFEndFORB:= Position of left most 0 bit of BITMAP[]Return 2^BEND

Page 13: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

13

FM-Sketch Example• Let the following represent data stream

• Let the hashed values be

H(a1)=011001

H(a2)=100101

H(a3)=101100

H(a4)=011011• Then according to algorithm BITMAP will be equal to

BITMAP=11000000• First occurrence of 0-bit is at position 2

F0 = 22 = 4

a1 a2 a3 a4

Page 14: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

14

FM-Sketch (Contd..)• If there are N distinct elements in a data stream:

• If i>>Log(N) then BITMAP[i] is certainly 0• If i<<log(N) then BITMAP[i] is certainly 1• For I ~ log(N) BITMAP[i] is a fringes of 0s and 1’s

• This algorithm is tested M online documentations of UNIX system• Which has total 26692 lines

• 16405 lines where distinct

• After hashing the lines the following BITMAP was obtainedBITMAP= 111111111111001100000000

• Left most 0 appeared at position 12 and right most 1 appeared at position 15• 214= 16384• To improve the accuracy, the algorithm is extended by taking an array of bit

strings instead of one and the position of 0 is averaged.

Page 15: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

15

Calculating F0 (KMV Algorithm)

• The problem with algorithm based on FM-Sketch is that they assume there exist ideal hash functions that uniformly distributes data into hash space• But in real it is difficult to get such hash function• Bar-Yossef et al. in [4], introduces k-minimum value algorithm for

determining number of distinct elements in data stream.• uses a similar hash function h which is normalised to [0,1] as h:[m] →

[0,1].

Page 16: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

16

Calculating F0 (KMV Algorithm)

• A limit t is fixed to number of values in hash space.• t is assumed of the order • At any point hash space contain t smallest hash values• Ѵ= is maximum of the hashed values• Ѵ is used to calculate F’0 using the below formula

.

Page 17: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

17

Calculating F0 (KMV Algorithm)

Initialize First t values of KMV

for a in a1 to an doif h(a) < Max(KMV ) then

Remove Max(KMV) from KMV setInsert h(a) to KMV

end ifend forV=Max(KMV )return t/Vend

Page 18: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

18

Example(KMV Algorithm)

• Let 8 distinct values of the stream be hashed as shown above.• Let t=4, then we keep only least 4 hashed values. (Highlighted in red)• This means, V=Max(first 4 hashed values) ~ 0.5• F0= t/V = 4/0.5 = 8

0 1

0.5

Page 19: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

19

Complexity of algorithms

• Each hash value requires space of order ) memory bits.• Number of hash values (t) is of the order • Therefore KMV algorithm can be implemented in memory bits space.• The access time can be reduced if we store the t hash values in a

binary tree• Thus the time complexity will be reduced to .

Page 20: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

20

Calculating Fk

• Alon et al. estimates Fk by defining random variables X that can be computed within given space and time.• The approximate value of Fk is the expectation of the random variable

X, E(X).• Construct a random variable X as follows• Select ap be a random member of sequence A with index at ‘p’. • Let , represents the number of occurrences of within the members of the

sequence A following • Random variable

Page 21: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

21

Calculating FK (Contd…)

• Let which is of the order and which is of the order ( )• Algorithm takes S2 random variables Y1, Y2,… YS2 and outputs median Y• Where Yi is the average of Xij 1 ≤ j ≤ S1

• Next we calculate Fk by calculating E(X).

Page 22: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

22

Calculating FK (Contd…)

Page 23: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

23

Complexity of Fk

• Each random variable X Stores ap and r• So space required for X can be of the order O(log(m) + log(n))• There are S1 x S2 random variables• Hence total space complexity the algorithm takes is of the order

Page 24: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

24

Calculating F2

• Using previous discussed algorithm we can compute F2 in bits.• Alon et al. in their paper simplified this algorithm using four-wise

independent random variables.• The complexity of algorithm is reduced to the following

Page 25: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

25

Reference

1. Alon, Noga, Yossi Matias, and Mario Szegedy. 'The Space Complexity Of Approximating The Frequency Moments'. Journal of Computer and System Sciences 58.1 (1999): 137-147.

2. Woodruff, David. 'Frequency Moments'. (2005): 2-3.3. Indyk, Piotr, and Woodruff David. 'Optimal Approximations Of The Frequency Moments Of

Data Streams'. Proceedings of the thirty-seventh annual ACM symposium on Theory of computing - STOC '05(2005): 202.

4. Ziv, Bar-Yossef et al. 'Counting Distinct Elements In A Data Stream.'. International Workshop on Randomization and Approximation Techniques 2483 (2002): 1-10.

5. Philippe, Flajolet, and Nigel Martin G. 'Probablistic Counting Algorithms For Database Applications'. Journal of computer and system sciences 31.2 (1985): 182-209.

6. Morris, Robert. 'Counting Large Numbers Of Events In Small Registers'. Communications of the ACM 21.10 (1978): 840-842.

7. Flajolet, Philippe. 'Approximate Counting: A Detailed Analysis'. BIT 25.1 (1985): 113-134.

Page 26: Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1

26

Thank you!