log files. o(n) data structure exercises 16.1

54
Log Files 9:00:12 M ay 6,2004 231808 DS C ITY W INN IPEG TAX/TAX $480.00 9:01:34 M ay 6,2004 452203 DS HYDR BPY/FAC $101.71 9:02:45 M ay 6,2004 764808 PR HO BBY HO BBY IN C $259.93 9:02:47 M ay 6,2004 457221 DS EN BR ID G E BPY/FAC $212.96 9:02:56 M ay 6,2004 234621 IB 2146 PO RTAGE $300.00 9:04:01 M ay 6,2004 111345 PR W AL-M AR T #2055 $183.00 9:04:23 M ay 6,2004 457524 CK NO .110 $53.15 9:04:25 M ay 6,2004 234979 DS M TS BPY/FAC $36.10 Thisisa dictionary ofbank transactions. H ow can w e find the transaction forthe account111345?

Post on 18-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Log Files. O(n) Data Structure Exercises 16.1

Log Files

9:00:12 May 6, 2004 231808 DS CITY WINNIPEG TAX/TAX $480.00 9:01:34 May 6, 2004 452203 DS HYDR BPY/FAC $101.71 9:02:45 May 6, 2004 764808 PR HOBBY HOBBY INC $259.93 9:02:47 May 6, 2004 457221 DS ENBRIDGE BPY/FAC $212.96 9:02:56 May 6, 2004 234621 IB 2146 PORTAGE $300.00 9:04:01 May 6, 2004 111345 PR WAL-MART #2055 $183.00 9:04:23 May 6, 2004 457524 CK NO.110 $53.15 9:04:25 May 6, 2004 234979 DS MTS BPY/FAC $36.10

This is a dictionary of bank transactions.

How can we find the transaction for the account 111345?

Page 2: Log Files. O(n) Data Structure Exercises 16.1

log file: An implementation of a dictionary using an unordered vector, list, or sequence to store the key-element pairs. Log file is also called audit trail. Examples: Bank transactions

Page 3: Log Files. O(n) Data Structure Exercises 16.1

Computer log file

maydin pts/22 wnpgmb11dc1-res- Sat Apr 17 23:10 pzhou pts/24 io.uwinnipeg.ca Sat Apr 17 15:43 igwizon pts/23 io.uwinnipeg.ca Sat Apr 17 15:31 sliao pts/22 wnpgmb02dc1-res- Sat Apr 17 15:31 dbetanco pts/22 io.uwinnipeg.ca Sat Apr 17 14:32 dchiu pts/22 h24-76-245-128.w Sat Apr 17 13:55 igwizon pts/32 io.uwinnipeg.ca Sat Apr 17 13:53 swang4 pts/32 wnpgmb02dc1-180- Sat Apr 17 09:40 jkwok pts/32 io.uwinnipeg.ca Sat Apr 17 00:41 clim1 pts/22 wnpgmb11dc1-res- Fri Apr 16 17:42 sliao pts/32 wnpgmb09dc1-65-8 Fri Apr 16 17:06 pzhou pts/22 io.uwinnipeg.ca Fri Apr 16 15:32 jpark3 pts/32 wnpgmb11dc1-res- Fri Apr 16 15:30 sliao pts/17 142.132.40.26 Fri Apr 16 14:33 jpark3 pts/32 slk-170-133-res. Fri Apr 16 14:31 jpark3 pts/32 slk-170-133-res. Fri Apr 16 13:51 maydin pts/32 io.uwinnipeg.ca Fri Apr 16 12:58 maydin pts/32 wnpgmb11dc1-166- Fri Apr 16 12:09 ttsukamo pts/32 io.uwinnipeg.ca Fri Apr 16 11:49

Page 4: Log Files. O(n) Data Structure Exercises 16.1

F o r m a l l y , w e s a y t h a t a l o g f i l e i s a n i m p l e m e n t a t i o n o f a d i c t i o n a r y D u s i n g a s e q u e n c e S t o s t o r e t h e i t e m s o f D i n a r b i t r a r y o r d e r . ( u n o r d e r e d s e q u e n c e i m p l e m e n t a t i o n )

3 4 2 2 1 8 4 4 7 3 0

N e x t i n s e r t i o n

Page 5: Log Files. O(n) Data Structure Exercises 16.1

Characteristics of a log file: It is an unordered list. It is easy to insert an item while searching an item with a given key needs some effort. Good application if we need only to search items occasionally. Assume the size of the log file is n. Method Running time insertItem(k,e) fast findElement(k) removeElement(k) findAllElements(k) removeAllElements(k)

O(n)O(n)O(n)O(n)

Page 6: Log Files. O(n) Data Structure Exercises 16.1

Data Structure Exercises 16.1

Page 7: Log Files. O(n) Data Structure Exercises 16.1

Hash Tables

R e c a l l th a t in J a v a , v a r ia b le s o f o b je c t s o f a w o r k in g c la s s a r e in f a c t r e f e r e n c e s to th e o b je c t s . W h a t s to r e d in a v a r ia b le i s th e m e m o r y lo c a t io n o f th e o b je c t . T h e r e f o r e , a v a r ia b le n a m e , f o r e x a m p le , f lo w e r i s a s s o c ia te d w i th a n o b je c t th r o u g h th e c o r r e s p o n d in g m e m o r y a d d r e s s .

f lo w e r

“ R o s e ”

Page 8: Log Files. O(n) Data Structure Exercises 16.1

hash table: Mapping of a key object to an integer in the range [0, N-1] where N is the capacity, or say, the number of the key objects considered.

flower

“Rose”

0x2004BA00x2004BA0

“Bill Scott”

46210 “Bill Scott”BScHistory101250

Page 9: Log Files. O(n) Data Structure Exercises 16.1

T h e r e a r e t w o c o m p o n e n t s i n a h a s h t a b l e : a b u c k e t a r r a y a n d a h a s h f u n c t i o n . b u c k e t a r r a y : A n a r r a y A o f s i z e N , w h e r e e a c h c e l l o f A i s t h o u g h t o f a s a “ b u c k e t ” ( n a m e l y , a c o n t a i n e r o f e l e m e n t s ) a n d t h e i n t e g e r N d e f i n e s t h e c a p a c i t y o f t h e a r r a y , o r s a y , t h e n u m b e r o f c e l l s i n A . ( * E a c h c e l l m a y a c c o m m o d a t e m o r e t h a n o n e i t e m s . * )

0 1 2 3 4 5 6 87 9

Page 10: Log Files. O(n) Data Structure Exercises 16.1

If the keys are integers and each key k is unique, we can access the items easily with this arrangment by placing the item in the bucket A[k].

0 1 2 3

(4,C)

5 6 87 94

C

Page 11: Log Files. O(n) Data Structure Exercises 16.1

collision: More than one elements have the same key. There are methods to handle collisions. However, we want to avoid collisions if we can.

0 1 2 3

(4,B)

5 6 87 94

C

(4,C)

B

Page 12: Log Files. O(n) Data Structure Exercises 16.1

Storing item with an integer key k in A[k] seems very efficient. However, there are drawbacks with this approach. The first drawback is that the capacity of the array A may have to be much larger than what we need. Example: Department Department number Student number History 200 200000-200999 Physics 400 400000-400999 If we use the student numbers as keys (integers), then do we have to allocate an array of the size N = 400,000 to accommodate only 2,000 students maximum?

Page 13: Log Files. O(n) Data Structure Exercises 16.1

The second drawback is that keys are often not integers. Example: We want to implement an English dictionary. Key Element depose v. 1. To remove from office or a position

of power. 2. To testify, esp. in writing. How can we use a bucket array to store it?

Page 14: Log Files. O(n) Data Structure Exercises 16.1

The solution is to use a mapping function to map an arbitrary key to an integer. Examples:

1. A function h( n ) that returns integers in the range [0-2000] when n is in the range [0 - 400000].

2. A function h( word ) that returns integers in the range [0-10000] when word is any English word, “depose” for example.

hash function: A function that maps each key k in the dictionary to an integer in the range [0 - N-1], where N is the capacity of the bucket array for the hash table.

Page 15: Log Files. O(n) Data Structure Exercises 16.1

Now, rather than storing the item (k,e) in A[k], we store it in A[h(k)].

0 1 2 3 5 6 87 94

A

(200004,A)

h( 200004 ) = 4

Page 16: Log Files. O(n) Data Structure Exercises 16.1

A g o o d h a s h f u n c t i o n i s t h e o n e t h a t m i n i m i z e c o l l i o n s a n d e a s y t o c o m p u t e . I n p r a c t i c e , a h a s h f u n c t i o n u s u a l l y c o n s i s t s o f t w o s t e p s : T h e f i r s t s t e p i s t o m a p a k e y t o a n i n t e g e r , c a l l e d t h e h a s h c o d e . T h e s e c o n d s t e p i s t o m a p t h e h a s h c o d e t o a n i n t e g e r w i t h i n t h e r a n g e o f i n d i c e s o f a b u c k e t a r r a y , c a l l t h e c o m p r e s s i o n m a p .

a r b i t r a r y k e y ( - , + )

h a s h c o d e

[ 0 , N - 1 ]

c o m p r e s s m a p

i n d i c e s o fb u c k e t a r r a y

Page 17: Log Files. O(n) Data Structure Exercises 16.1

hash code (hash value): The integer assigned to a key k.

1. Integers may be positive or negative. 2. The function to assign an integer to a key should avoid

collisions as much as possible. 3. Equivalent keys should have the same integer assigned.

The Object class in Java has a hashCode() method that returns an integer. In practice, we usually need to override this function to make it suitable for our purposes. Let us consider a few of the approaches.

Page 18: Log Files. O(n) Data Structure Exercises 16.1

For those data types that can be automatically converted to an integer, such as byte, short, int, and char, we can have a good hash code simply by convert the value to an integer (int). For float type, we can use the method Float.floatToIntBits(x) to convert a real number x to an integer. Examples: char 'T' 84 byte 115 115 short 3020 3020 int 4089254562 4089254562 float 3.14159 1078530000

Page 19: Log Files. O(n) Data Structure Exercises 16.1

Summing Components Recall that long and double in Java cannot be converted to int automatically. Variables of these types require storage size that is larger than that required by an int variable. In this case, the approach of Summing Component is to split the storage into two 32-bit parts and calculate the sum of their integer representations. static int hashCode( long i ) {

return ( int )(( i >> 32 ) + ( int ) i ); }

Page 20: Log Files. O(n) Data Structure Exercises 16.1

Examples:

i i >> 32 (int)i sum 12084 0 12084 12084 64533853369376 15025 1969746976 1969762001 For values of double type, the method Double.doubleToLongBits( x ) can be used to convert it to a long type first.

x i i >> 32 (int)i sum 1.0 4607182418800017408 1072693248 0 1072693248 3.14159 4614256650576692846 1074340345 4028335726 5102676071 Note: for 3.14159, the sum results in an overflow, which is removed automatically.

Page 21: Log Files. O(n) Data Structure Exercises 16.1

In general, the approach of Summing Components can beextended to keys with m components. Let the key be k = (x0, x1, …, xm-1), we compute the integer.We may use the following expression as its hash code:

hash code =

m

iix

0

Page 22: Log Files. O(n) Data Structure Exercises 16.1

Examples: We decompose words into characters and compute the sum of their values. temp01 535 temp10 535 stop 454 tops 454 Note that words “temp01” and “temp10” have the same hash code. Also “stop” and “tops” have the same. Therefore this approach is not good for strings.

v(t) + v(e) + v(m) + v(p) + v(0) + v(1)

Page 23: Log Files. O(n) Data Structure Exercises 16.1

122

11

0 ...codehash mm

mm xaxaxax ,

where the key is k = (x0, x1, …, xm-2, xm-1).

Polynomial Hash Codes In this approach, we choose a constant a 1 and calculate theinteger value

Page 24: Log Files. O(n) Data Structure Exercises 16.1

Examples: With , we have the following values temp01 7601359 temp10 7601367 stop 94342 tops 94678 Notice that now the hash codes are no longer the same.

A = 9

A carefully chosen value of the constant a can reduce thenumber of conflicts significantly. Good values include 33,37, 39, and 41 according to some experimental studies.

Page 25: Log Files. O(n) Data Structure Exercises 16.1

C y c l i c S h i f t H a s h C o d e s E x a m p l e : 5 - b i t c y c l i c s h i f t

1 0 0 1 0

1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1

1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1

h = 3 2 0 0 3 7 9 0 8 5 1 5

1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1

1 0 0 1 0

Page 26: Log Files. O(n) Data Structure Exercises 16.1

static int hashCode( String s ) { int h = 0; for( int i = 0; i < s.length(); i++ ) { h = ( h << 5 ) | ( h >> 27 ); h += ( int )s.charAt( i ); } return h; }

Example: s = “two” i h h = ( h << 5 ) | ( h >> 27 ); 0 0 h += ( int )s.charAt( i ); 0 1110100 h = ( h << 5 ) | ( h >> 27 ); 1 111010000000 h += ( int )s.charAt( i ); 1 111011110111 h = ( h << 5 ) | ( h >> 27 ); 2 11101111011100000 h += ( int )s.charAt( i ); 2 11101111101001111

Page 27: Log Files. O(n) Data Structure Exercises 16.1

T h e s e c o n d s t e p i n a h a s h f u n c t i o n i s t o m a p t h e h a s h c o d e i n t o t h e r a n g e [ 0 , N - 1 ] . T h e r e a r e t w o p o p u l a r a p p r o a c h e s . T h e y a r e t h e d i v i s i o n m e t h o d a n d t h e m u l t i p l y a d d a n d d i v i d e ( M A D ) m e t h o d . T h e D i v i s i o n M e t h o d I n t h i s m e t h o d , t h e c o m p r e s s i o n m a p i s g i v e n b y

Nkkh mod)(

Page 28: Log Files. O(n) Data Structure Exercises 16.1

Examples: N = 100 hash code compressed 200 0 205 5 430 30 500 0 505 5 Notice that there are collisions.

Page 29: Log Files. O(n) Data Structure Exercises 16.1

N=101 hash code compressed 200 99 205 3 430 26 500 96 505 0 In general, if N is a prime number, it helps reduce collisions.

Page 30: Log Files. O(n) Data Structure Exercises 16.1

T h e s e c o n d a p p r o a c h i s t h e m u l t i p l y a d d a n d d i v i d e ( M A D ) m e t h o d . I n t h i s m e t h o d , t h e m a p p i n g i s g i v e n b y

Nbakkh mod)(

where N is a prime number, a and b are nonnegative integersrandomly chosen at the time when the compression functionis determined so that a mod N 0. This method is more sophisticated works better.

Page 31: Log Files. O(n) Data Structure Exercises 16.1

Collision-Handling Schemes Recall that if there is no collision, we can store the item (k, e)in the bucket array cell A[h(k)]. However, collision does occurtime to time. In this case, two different keys, k1 and k2 cause thehash function to return a same value: h(k1) = h(k2). Thereforewe cannot store -the item directly in A[h(k)]. The two schemes to handle collisions: 1.       1. Separate Chaining2.       2. Open Addressing

Page 32: Log Files. O(n) Data Structure Exercises 16.1

Separate Chaining In this approach, what stored in A[h(k)]. is a reference to a sequence Sk rather than the item. In turn, the items that have the same hashfunction value k are all stored in Sk . The sequence Sk can be implemented as a log file.

Page 33: Log Files. O(n) Data Structure Exercises 16.1

Algorithms for fundamental dictionary operations Algorithm findElement(k): Assign the sequence A[h(k)] to a variable B if B is empty then return NO_SUCH_KEY else return B.findElement(k)

0 1 2 3 5 64

S

B=A[h(k)]

h( k ) = 4

k

Page 34: Log Files. O(n) Data Structure Exercises 16.1

Algorithm insertItem(k,e): If A[h(k)] is empty then Create a new initially empty,

sequence-based dictionary B Assign B to A[h(k)] else Assign A[h(k)] to B B.insertItem(k, e ) 0 1 2 3 5 64

S

B

h( k ) = 4

(k, e)

Page 35: Log Files. O(n) Data Structure Exercises 16.1

Algorithm removeElement(k): Assign A[h(k)] to B If B is empty then return NO_SUCH_KEY return B.removeElement(k)

0 1 2 3 5 64

S

B=A[h(k)]

h( k ) = 4

k

Page 36: Log Files. O(n) Data Structure Exercises 16.1

Example: Consider a dictionary of the size 13. The hash function is h(k) = k mod 13. There are 10 items in the dictionary: k h(k) 10 10 12 12 18 5 25 12 28 2 36 10 38 12 41 2 54 2 90 12

Page 37: Log Files. O(n) Data Structure Exercises 16.1

0

1

2

3

5

6

4

A

9

7

8

11

41

12

10

28 54

28

36 10

90 12 38 25

Page 38: Log Files. O(n) Data Structure Exercises 16.1

Data Structure Exercises 16.2

Page 39: Log Files. O(n) Data Structure Exercises 16.1

Open Addressing In separate chaining, extra memory blocks have to be allocated for the sequences for each bucket. To save memory space, items can be stored directly in a bucket while collisions are handled by means of other methods. These are called open addressing schemes

Page 40: Log Files. O(n) Data Structure Exercises 16.1

Linear Probing - A simple opening addressing scheme Assume that we want to insert an item (k, e) and I = h(k). Theoperation goes like this: If the bucket A[i] is not empty, we try thenext bucket A[(i+1) mod N]. If this is not empty, then we tryA[(i+2) mod N], and so on, until we find an empty bucket.

Example:

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 3 7 1 6 2 1

h ( k ) = k m o d 1 1

N e w e le m e n t w i thk e y = 1 5

Page 41: Log Files. O(n) Data Structure Exercises 16.1

T he findE lem ent operation needs to search consecu tive buckets, starting from A [h (k )], un til e ither the item or an em pty bucket is found .

0 1 2 3 4 5 6 7 8 9 10

513 26 37 16 21

h(k) = k m od 11

Find the item (15 , 15)

15

Page 42: Log Files. O(n) Data Structure Exercises 16.1

T h e r e m o v e E le m e n t o p e r a t io n i s m o r e c o m p l ic a te d . W h e n a n i t e m i s r e m o v e d , w e h a v e to s h i f t i t e m s in th e b u c k e t a r r a y to f i l l th e e m p ty s p o t w h i le le a v in g a lo n e th o s e th a t a r e in th e i r c o r r e c t lo c a t io n . E x a m p le :

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 3 7 2 1

h ( k ) = k m o d 1 1

R e m o v e th e i t e m (3 7 , 3 7 )

1 51 6

W e h a v e to s h i f t ( 1 5 ,1 5 ) w h i le le a v in g ( 1 8 ,1 8 ) a lo n e :

Page 43: Log Files. O(n) Data Structure Exercises 16.1

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 2 11 5

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 2 11 5

1 6

1 6

T o av o id th e co m p lica tio n lik e th is , w e c an u se a sp ec ia l item ca lled R E M O V E D _ IT E M to rep lac e th e re m o v ed item .

Page 44: Log Files. O(n) Data Structure Exercises 16.1

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 2 11 5

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 2 11 5

R E M O V E D _ IT E M

1 6

1 6

Page 45: Log Files. O(n) Data Structure Exercises 16.1

O n th e o th e r h a n d , th e f in d E le m e n t o p e ra tio n sh o u ld sk ip th is i te m w h e n it d o e s s e a rc h in g . E x a m p le :

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6

R E M O V E D _ IT E M

1 6 2 1

h (k ) = k m o d 1 1

F in d th e i te m (1 5 , 1 5 )

1 5

Page 46: Log Files. O(n) Data Structure Exercises 16.1

A n d th e in se r tI te m o p e ra tio n sh o u ld re p la c e i t w ith th e n e w ite m . E x a m p le :

R E M O V E D _ IT E M

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 1 6 2 1

h (k ) = k m o d 1 1

N e w e le m e n t w ithk e y = 1 5

Page 47: Log Files. O(n) Data Structure Exercises 16.1

One of disadvantages with Linear probing is that it tends to clusterthe items of the dictionary into contiguous runs. This causes thesearches to slow down quite a bit. To avoid this, we can usequadratic probing. Quadratic Probing Rather than searching the buckets for , we search the bucketsA[(i + j) mod N] for j = 0, 1, 2, …, we search the bucketsA[(i + j2) mod N].

Page 48: Log Files. O(n) Data Structure Exercises 16.1

E xam ple: A n insertItem operation

0 1 2 3 4 5 6 7 8 9 10

513 26 37 16 21

h(k) = k m od 1 1

N ew e lem ent w ithkey = 15

T hough quadratic p rob ing avo ids the c lustering p rob lem s that occur w ith linear p rob ing , it has i ts ow n clustering p rob lem s called secondary c lustering . T h is m ay cause it no t ab le to find an em pty bucket w hile there are em pty buckets availab le .

Page 49: Log Files. O(n) Data Structure Exercises 16.1

Double Hashing In this approach, we search the buckets A[(i+f(i)) mod N], where f(i) = j*h’(k) and h’(k) is the secondary hash function. In thisapproach, the secondary hash function is not allowed to be zero.A common choice is   h’(k) = q – (k mod q)

where q < N is some prime number, which can be divide by 1 anditself.

Page 50: Log Files. O(n) Data Structure Exercises 16.1

E x a m p l e :

11mod)( kkh )()( khjjf

)7mod(7)( kkh f o r k = 1 5 , i = h ( k ) = 4 , a n d . j f ( j ) i + f ( j ) 0 0 4 1 6 1 0 2 1 2 1 6 3 1 8 2 2

h’(k) = 6

Page 51: Log Files. O(n) Data Structure Exercises 16.1

Data Structure Exercises 17.1

Page 52: Log Files. O(n) Data Structure Exercises 16.1

The Ordered Dictionary ADT

Recall that keys in a dictionary may not have a total order relation. On the other hand, if a total order relation on the keys is defined, the dictionary is an ordered dictionary. Example: In an ordered dictionary, we have a few elements: {(1,E), (2,C), (5,A), (7,B), (8,D)} How can we find the element with a key that is closest to 3?

Page 53: Log Files. O(n) Data Structure Exercises 16.1

In addition to the methods for the dictionary abstract data type we have already learned such as findElement, insertItem and removeElement, an ordered dictionary also supports the following methods: closestKeyBefore(k): Return the key of the item with largest key less

than or equal to k. Input: Object (key); Output: Object (key)

closestElemBefore(k): Return the element for the item with largest key less than or equal to k. Input: Object (key); Output: Object (element)

closestKeyAfter(k): Return the key of the item with smallest key greater than or equal to k. Input: Object (key); Output: Object (key)

closestElemAfter(k): Return the element for the item with smallest key greater than or equal to k. Input: Object (key); Output: Object (element)

Page 54: Log Files. O(n) Data Structure Exercises 16.1

Example: The table shows the effect of a series of operations on an ordered dictionary with five elements: {(1,E), (2,C), (5,A), (7,B), (8,D)}

Operation Output closestKeyBefore(3) 2 closestKeyBefore(7) 7 closestKeyBefore(6) 5 closestElemBefore(2) C closestElemBefore(9) D closestElemBefore(0) NO_SUCH_KEY

closestKeyAfter(3) 5 closestKeyAfter(7) 7 closestKeyAfter(6) 7 closestElemAfter(2) C closestElemAfter(9) NO_SUCH_KEY closestElemAfter(0) E