min-wise independent permutations - ntuaimage.ntua.gr/iva/files/random_permutations.pdf · js a \ s...

37
Min-wise Independent Permutations Giorgos Tolias Image, Video and Multimedia Systems Laboratory National Technical University of Athens October 2010

Upload: others

Post on 06-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Min-wise Independent Permutations

Giorgos Tolias

Image, Video and Multimedia Systems LaboratoryNational Technical University of Athens

October 2010

Page 2: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Outline

Min-wise Independent Permutations

Near Duplicate Detection with Min-Hash

Min-hash with TF-IDF

Image Clustering

Page 3: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Outline

Min-wise Independent Permutations

Near Duplicate Detection with Min-Hash

Min-hash with TF-IDF

Image Clustering

Page 4: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Min-wise Independent Permutations

4 Andrei Z. Broder

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1resemblance

Probability of acceptance

Fig. 1. The graph of P6,14,2(x) – linear scale

1e-301e-291e-281e-271e-261e-251e-241e-231e-221e-211e-201e-191e-181e-171e-161e-151e-141e-131e-121e-111e-101e-091e-081e-071e-061e-05

0.00010.0010.010.11.

0 0.2 0.4 0.6 0.8 1resemblance

Probability of acceptance -- log scale

Fig. 2. The graph of P6,14,2(x) – logarithmic scale

Broder – CMP 2000Identifying and Filtering Near-Duplicate Documents

Page 5: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Shingles and fingerprintsIdentifying and Filtering Near-Duplicate Documents 5

We view each document as a sequence of tokens. We can take tokens to beletters, or words, or lines. We assume that we have a parser program that takesan arbitrary document and reduces it to a canonical sequence of tokens. (Here“canonical” means that any two documents that differ only in formatting orother information that we chose to ignore, for instance punctuation, formattingcommands, capitalization, and so on, will be reduced to the same sequence.) Sofrom now on a document means a canonical sequence of tokens.

A contiguous subsequence of w tokens contained in D is called a shingle.A shingle of length q is also known as a q-gram, particularly when the tokensare alphabet letters. Given a document D we can associate to it its w-shinglingdefined as the set of all shingles of size w contained in D. So for instance the4-shingling of

(a,rose,is,a,rose,is,a,rose)

is the set

{(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)}

(It is possible to use alternative definitions, based on multisets. See [4] for de-tails.)

Rather than deal with shingles directly, it is more convenient to associate toeach shingle a numeric uid (unique id). This done by fingerprinting the shingle.(Fingerprints are short tags for larger objects. They have the property that if twofingerprints are different then the corresponding objects are certainly differentand there is only a small probability that two different objects have the samefingerprint. This probability is typically exponentially small in the length of thefingerprint.)

For reasons explained in [4] it is particularly advantageous to use Rabinfingerprints [15] that have a very fast software implementation [3]. Rabin finger-prints are based on polynomial arithmetic and can be constructed in any length.It is important to choose the length of the fingerprints so that the probability ofcollisions (two distinct shingles getting the same fingerprint) is sufficiently low.(More about this below.) In practice 64 bits Rabin fingerprints are sufficient.

Hence from now on we associate to each document D a set of numbers SD

that is the result of fingerprinting the set of shingles in D. Note that the size ofSD is about equal to the number of words in D and thus storing SD on-line forevery document in a large collection is infeasible.

The resemblance r(A, B) of two documents, A and B, is defined as

r(A, B) =|SA ∩ SB ||SA ∪ SB | .

Experiments seem to indicate that high resemblance (that is, close to 1) captureswell the informal notion of “near-duplicate” or “roughly the same”. (There areanalyses that relate the “q-gram distance” to the edit-distance – see [16].)

Our approach to determining syntactic similarity is related to the samplingapproach developed independently by Heintze [8], though there are differences

Page 6: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Resemblance

Identifying and Filtering Near-Duplicate Documents 5

We view each document as a sequence of tokens. We can take tokens to beletters, or words, or lines. We assume that we have a parser program that takesan arbitrary document and reduces it to a canonical sequence of tokens. (Here“canonical” means that any two documents that differ only in formatting orother information that we chose to ignore, for instance punctuation, formattingcommands, capitalization, and so on, will be reduced to the same sequence.) Sofrom now on a document means a canonical sequence of tokens.

A contiguous subsequence of w tokens contained in D is called a shingle.A shingle of length q is also known as a q-gram, particularly when the tokensare alphabet letters. Given a document D we can associate to it its w-shinglingdefined as the set of all shingles of size w contained in D. So for instance the4-shingling of

(a,rose,is,a,rose,is,a,rose)

is the set

{(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)}

(It is possible to use alternative definitions, based on multisets. See [4] for de-tails.)

Rather than deal with shingles directly, it is more convenient to associate toeach shingle a numeric uid (unique id). This done by fingerprinting the shingle.(Fingerprints are short tags for larger objects. They have the property that if twofingerprints are different then the corresponding objects are certainly differentand there is only a small probability that two different objects have the samefingerprint. This probability is typically exponentially small in the length of thefingerprint.)

For reasons explained in [4] it is particularly advantageous to use Rabinfingerprints [15] that have a very fast software implementation [3]. Rabin finger-prints are based on polynomial arithmetic and can be constructed in any length.It is important to choose the length of the fingerprints so that the probability ofcollisions (two distinct shingles getting the same fingerprint) is sufficiently low.(More about this below.) In practice 64 bits Rabin fingerprints are sufficient.

Hence from now on we associate to each document D a set of numbers SD

that is the result of fingerprinting the set of shingles in D. Note that the size ofSD is about equal to the number of words in D and thus storing SD on-line forevery document in a large collection is infeasible.

The resemblance r(A, B) of two documents, A and B, is defined as

r(A, B) =|SA ∩ SB ||SA ∪ SB | .

Experiments seem to indicate that high resemblance (that is, close to 1) captureswell the informal notion of “near-duplicate” or “roughly the same”. (There areanalyses that relate the “q-gram distance” to the edit-distance – see [16].)

Our approach to determining syntactic similarity is related to the samplingapproach developed independently by Heintze [8], though there are differences

Page 7: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Sketches6 Andrei Z. Broder

in detail and in the precise definition of the measures used. Related samplingmechanisms for determining similarity were also developed by Manber [9] andwithin the Stanford SCAM project [1,11,12].

To compute the resemblance of two documents it suffices to keep for eachdocument a relatively small, fixed size sketch. The sketches can be computedfairly fast (linear in the size of the documents) and given two sketches the re-semblance of the corresponding documents can be computed in linear time inthe size of the sketches.

This is done as follows. Assume that for all documents of interest SD ⊆{0, . . . , n−1} def

= [n]. (As noted, in practice n = 264.) Let π be chosen uniformlyat random over Sn , the set of permutations of [n]. Then

Pr(min{π(SA)} = min{π(SB)}

)=

|SA ∩ SB ||SA ∪ SB | = r(A, B). (1)

Proof. Since π is chosen uniformly at random, for any set X ⊆ [n] and anyx ∈ X, we have

Pr(min{π(X)} = π(x)

)=

1

|X| . (2)

In other words all the elements of any fixed set X have an equal chance tobecome the minimum element of the image of X under π.

Let α be the smallest image in π(SA∪SB). Then min{π(SA)} = min{π(SB)},if and only if α is the image of an element in SA ∩ SB . Hence

Pr(min{π(SA)} = min{π(SB)}

)= Pr(π−1(α) ∈ SA ∩ SB)

=|SA ∩ SB ||SA ∪ SB | = rw(A, B).

Hence, we can choose, once and for all, a set of t independent random per-mutations π1, . . . , πt. (For instance we can take t = 100.) For each document D,we store a sketch, which is the list

SA = (min{π1(SA)}, min{π2(SA)}, . . . , min{πt(SA)}).

Then we can readily estimate the resemblance of A and B by computing howmany corresponding elements in SA and SB are equal. (In [4] it is shown thatin fact we can use a single random permutation, store the t smallest elementsof its image, and then merge-sort the sketches. However for the purposes of thispaper independent permutations are necessary.)

In practice, we have to deal with the fact it is impossible to choose andrepresent π uniformly at random in Sn for large n. We are thus led to considersmaller families of permutations that still satisfy the min-wise independencecondition given by equation (2), since min-wise independence is necessary andsufficient for equation (1) to hold. This is further explored in [5] where it is shownthat random linear transformations are likely to suffice in practice. See also [6]for an alternative implementation. We will ignore this issue in this paper.

6 Andrei Z. Broder

in detail and in the precise definition of the measures used. Related samplingmechanisms for determining similarity were also developed by Manber [9] andwithin the Stanford SCAM project [1,11,12].

To compute the resemblance of two documents it suffices to keep for eachdocument a relatively small, fixed size sketch. The sketches can be computedfairly fast (linear in the size of the documents) and given two sketches the re-semblance of the corresponding documents can be computed in linear time inthe size of the sketches.

This is done as follows. Assume that for all documents of interest SD ⊆{0, . . . , n−1} def

= [n]. (As noted, in practice n = 264.) Let π be chosen uniformlyat random over Sn , the set of permutations of [n]. Then

Pr(min{π(SA)} = min{π(SB)}

)=

|SA ∩ SB ||SA ∪ SB | = r(A, B). (1)

Proof. Since π is chosen uniformly at random, for any set X ⊆ [n] and anyx ∈ X, we have

Pr(min{π(X)} = π(x)

)=

1

|X| . (2)

In other words all the elements of any fixed set X have an equal chance tobecome the minimum element of the image of X under π.

Let α be the smallest image in π(SA∪SB). Then min{π(SA)} = min{π(SB)},if and only if α is the image of an element in SA ∩ SB . Hence

Pr(min{π(SA)} = min{π(SB)}

)= Pr(π−1(α) ∈ SA ∩ SB)

=|SA ∩ SB ||SA ∪ SB | = rw(A, B).

Hence, we can choose, once and for all, a set of t independent random per-mutations π1, . . . , πt. (For instance we can take t = 100.) For each document D,we store a sketch, which is the list

SA = (min{π1(SA)}, min{π2(SA)}, . . . , min{πt(SA)}).

Then we can readily estimate the resemblance of A and B by computing howmany corresponding elements in SA and SB are equal. (In [4] it is shown thatin fact we can use a single random permutation, store the t smallest elementsof its image, and then merge-sort the sketches. However for the purposes of thispaper independent permutations are necessary.)

In practice, we have to deal with the fact it is impossible to choose andrepresent π uniformly at random in Sn for large n. We are thus led to considersmaller families of permutations that still satisfy the min-wise independencecondition given by equation (2), since min-wise independence is necessary andsufficient for equation (1) to hold. This is further explored in [5] where it is shownthat random linear transformations are likely to suffice in practice. See also [6]for an alternative implementation. We will ignore this issue in this paper.

Page 8: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Grouping resembling documentsIdentifying and Filtering Near-Duplicate Documents 7

So far we have seen how to estimate the resemblance of a pair of documents.For this purpose the shingle fingerprints can be quite short since collisions haveonly a modest influence on our estimate if we first apply a random permutationto the shingles and then fingerprint the minimum value.

However sketches allow us to group a collection of m documents into sets ofclosely resembling documents in time proportional to m log m rather than m2,assuming that the clusters are well separated which is the practical case.

We perform the clustering algorithm in four phases. In the first phase, wecalculate a sketch for every document as explained. This step is linear in thetotal length of the documents.

To simplify the exposition of the next three phases we’ll say temporarilythat each sketch is composed of shingles, rather than images of the fingerprintof shingles under random permutations of [n].

In the second phase, we produce a list of all the shingles and the documentsthey appear in, sorted by shingle value. To do this, the sketch for each documentis expanded into a list of 〈shingle value, document ID〉 pairs. We simply sort thislist. This step takes time O(m log m) where m is the number of documents.

In the third phase, we generate a list of all the pairs of documents that shareany shingles, along with the number of shingles they have in common. To dothis, we take the file of sorted 〈shingle, ID〉 pairs and expand it into a list of〈ID, ID, count of common shingles〉 triplets by taking each shingle that appearsin multiple documents and generating the complete set of 〈ID, ID, 1〉 tripletsfor that shingle. We then apply a merge-sort procedure (adding the counts formatching ID - ID pairs) to produce a single file of all 〈ID, ID, count〉 tripletssorted by the first document ID. This phase requires the greatest amount of diskspace because the initial expansion of the document ID triplets is quadratic inthe number of documents sharing a shingle, and initially produces many tripletswith a count of 1. Because of this fact we must choose the length of the shinglefingerprints so that the number of collisions is small. To ensure this we can takeit to be say 2 log2 m + 20. In practice 64 bits fingerprints suffice.

In the final phase, we produce the complete clustering. We examine each〈ID, ID, count〉 triplet and decide if the document pair exceeds our thresholdfor resemblance. If it does, we add a link between the two documents in a union-find algorithm. The connected components output by the union-find algorithmform the final clusters.

3 Filtering Near-Duplicates

Consider two documents, A and B, that have resemblance ρ. If ρ is close to 1,then almost all the elements of SA and SB will be pairwise equal. The idea ofduplicate filtering is to divide every sketch into k groups of s elements each. Theprobability that all the elements of a group are pair-wise equal is simply ρs andthe probability that two sketches have r or more equal groups is

Pk,s,r =∑

r≤i≤k

(k

i

)ρs·i(1 − ρs)k−i.

Identifying and Filtering Near-Duplicate Documents 7

So far we have seen how to estimate the resemblance of a pair of documents.For this purpose the shingle fingerprints can be quite short since collisions haveonly a modest influence on our estimate if we first apply a random permutationto the shingles and then fingerprint the minimum value.

However sketches allow us to group a collection of m documents into sets ofclosely resembling documents in time proportional to m log m rather than m2,assuming that the clusters are well separated which is the practical case.

We perform the clustering algorithm in four phases. In the first phase, wecalculate a sketch for every document as explained. This step is linear in thetotal length of the documents.

To simplify the exposition of the next three phases we’ll say temporarilythat each sketch is composed of shingles, rather than images of the fingerprintof shingles under random permutations of [n].

In the second phase, we produce a list of all the shingles and the documentsthey appear in, sorted by shingle value. To do this, the sketch for each documentis expanded into a list of 〈shingle value, document ID〉 pairs. We simply sort thislist. This step takes time O(m log m) where m is the number of documents.

In the third phase, we generate a list of all the pairs of documents that shareany shingles, along with the number of shingles they have in common. To dothis, we take the file of sorted 〈shingle, ID〉 pairs and expand it into a list of〈ID, ID, count of common shingles〉 triplets by taking each shingle that appearsin multiple documents and generating the complete set of 〈ID, ID, 1〉 tripletsfor that shingle. We then apply a merge-sort procedure (adding the counts formatching ID - ID pairs) to produce a single file of all 〈ID, ID, count〉 tripletssorted by the first document ID. This phase requires the greatest amount of diskspace because the initial expansion of the document ID triplets is quadratic inthe number of documents sharing a shingle, and initially produces many tripletswith a count of 1. Because of this fact we must choose the length of the shinglefingerprints so that the number of collisions is small. To ensure this we can takeit to be say 2 log2 m + 20. In practice 64 bits fingerprints suffice.

In the final phase, we produce the complete clustering. We examine each〈ID, ID, count〉 triplet and decide if the document pair exceeds our thresholdfor resemblance. If it does, we add a link between the two documents in a union-find algorithm. The connected components output by the union-find algorithmform the final clusters.

3 Filtering Near-Duplicates

Consider two documents, A and B, that have resemblance ρ. If ρ is close to 1,then almost all the elements of SA and SB will be pairwise equal. The idea ofduplicate filtering is to divide every sketch into k groups of s elements each. Theprobability that all the elements of a group are pair-wise equal is simply ρs andthe probability that two sketches have r or more equal groups is

Pk,s,r =∑

r≤i≤k

(k

i

)ρs·i(1 − ρs)k−i.

Identifying and Filtering Near-Duplicate Documents 7

So far we have seen how to estimate the resemblance of a pair of documents.For this purpose the shingle fingerprints can be quite short since collisions haveonly a modest influence on our estimate if we first apply a random permutationto the shingles and then fingerprint the minimum value.

However sketches allow us to group a collection of m documents into sets ofclosely resembling documents in time proportional to m log m rather than m2,assuming that the clusters are well separated which is the practical case.

We perform the clustering algorithm in four phases. In the first phase, wecalculate a sketch for every document as explained. This step is linear in thetotal length of the documents.

To simplify the exposition of the next three phases we’ll say temporarilythat each sketch is composed of shingles, rather than images of the fingerprintof shingles under random permutations of [n].

In the second phase, we produce a list of all the shingles and the documentsthey appear in, sorted by shingle value. To do this, the sketch for each documentis expanded into a list of 〈shingle value, document ID〉 pairs. We simply sort thislist. This step takes time O(m log m) where m is the number of documents.

In the third phase, we generate a list of all the pairs of documents that shareany shingles, along with the number of shingles they have in common. To dothis, we take the file of sorted 〈shingle, ID〉 pairs and expand it into a list of〈ID, ID, count of common shingles〉 triplets by taking each shingle that appearsin multiple documents and generating the complete set of 〈ID, ID, 1〉 tripletsfor that shingle. We then apply a merge-sort procedure (adding the counts formatching ID - ID pairs) to produce a single file of all 〈ID, ID, count〉 tripletssorted by the first document ID. This phase requires the greatest amount of diskspace because the initial expansion of the document ID triplets is quadratic inthe number of documents sharing a shingle, and initially produces many tripletswith a count of 1. Because of this fact we must choose the length of the shinglefingerprints so that the number of collisions is small. To ensure this we can takeit to be say 2 log2 m + 20. In practice 64 bits fingerprints suffice.

In the final phase, we produce the complete clustering. We examine each〈ID, ID, count〉 triplet and decide if the document pair exceeds our thresholdfor resemblance. If it does, we add a link between the two documents in a union-find algorithm. The connected components output by the union-find algorithmform the final clusters.

3 Filtering Near-Duplicates

Consider two documents, A and B, that have resemblance ρ. If ρ is close to 1,then almost all the elements of SA and SB will be pairwise equal. The idea ofduplicate filtering is to divide every sketch into k groups of s elements each. Theprobability that all the elements of a group are pair-wise equal is simply ρs andthe probability that two sketches have r or more equal groups is

Pk,s,r =∑

r≤i≤k

(k

i

)ρs·i(1 − ρs)k−i.

Page 9: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Probability of collision

Identifying and Filtering Near-Duplicate Documents 7

So far we have seen how to estimate the resemblance of a pair of documents.For this purpose the shingle fingerprints can be quite short since collisions haveonly a modest influence on our estimate if we first apply a random permutationto the shingles and then fingerprint the minimum value.

However sketches allow us to group a collection of m documents into sets ofclosely resembling documents in time proportional to m log m rather than m2,assuming that the clusters are well separated which is the practical case.

We perform the clustering algorithm in four phases. In the first phase, wecalculate a sketch for every document as explained. This step is linear in thetotal length of the documents.

To simplify the exposition of the next three phases we’ll say temporarilythat each sketch is composed of shingles, rather than images of the fingerprintof shingles under random permutations of [n].

In the second phase, we produce a list of all the shingles and the documentsthey appear in, sorted by shingle value. To do this, the sketch for each documentis expanded into a list of 〈shingle value, document ID〉 pairs. We simply sort thislist. This step takes time O(m log m) where m is the number of documents.

In the third phase, we generate a list of all the pairs of documents that shareany shingles, along with the number of shingles they have in common. To dothis, we take the file of sorted 〈shingle, ID〉 pairs and expand it into a list of〈ID, ID, count of common shingles〉 triplets by taking each shingle that appearsin multiple documents and generating the complete set of 〈ID, ID, 1〉 tripletsfor that shingle. We then apply a merge-sort procedure (adding the counts formatching ID - ID pairs) to produce a single file of all 〈ID, ID, count〉 tripletssorted by the first document ID. This phase requires the greatest amount of diskspace because the initial expansion of the document ID triplets is quadratic inthe number of documents sharing a shingle, and initially produces many tripletswith a count of 1. Because of this fact we must choose the length of the shinglefingerprints so that the number of collisions is small. To ensure this we can takeit to be say 2 log2 m + 20. In practice 64 bits fingerprints suffice.

In the final phase, we produce the complete clustering. We examine each〈ID, ID, count〉 triplet and decide if the document pair exceeds our thresholdfor resemblance. If it does, we add a link between the two documents in a union-find algorithm. The connected components output by the union-find algorithmform the final clusters.

3 Filtering Near-Duplicates

Consider two documents, A and B, that have resemblance ρ. If ρ is close to 1,then almost all the elements of SA and SB will be pairwise equal. The idea ofduplicate filtering is to divide every sketch into k groups of s elements each. Theprobability that all the elements of a group are pair-wise equal is simply ρs andthe probability that two sketches have r or more equal groups is

Pk,s,r =∑

r≤i≤k

(k

i

)ρs·i(1 − ρs)k−i.

Page 10: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Probability of collision – figure4 Andrei Z. Broder

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1resemblance

Probability of acceptance

Fig. 1. The graph of P6,14,2(x) – linear scale

1e-301e-291e-281e-271e-261e-251e-241e-231e-221e-211e-201e-191e-181e-171e-161e-151e-141e-131e-121e-111e-101e-091e-081e-071e-061e-05

0.00010.0010.010.11.

0 0.2 0.4 0.6 0.8 1resemblance

Probability of acceptance -- log scale

Fig. 2. The graph of P6,14,2(x) – logarithmic scale

Page 11: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Choosing the parameters

8 Andrei Z. Broder

The remarkable fact is that for suitable choices of [k, s, r] the polynomialPk,s,r behaves as a very sharp high-band pass filter even for small values of k.For instance Figure 1 graphs P6,14,2(x) on a linear scale and Figure 2 graphs iton a logarithmic scale. The sharp drop-off is obvious.

To use this fact, we first compute for each document D the sketch SD asbefore, using k·s independent permutations. (We can now be arbitrarily generouswith the length of the fingerprints used to create shingle uid’s; however 64 bitsare plenty for our situation.) We then split SD into k groups of s elements andfingerprint each group. (To avoid dependencies, we use a different irreduciblepolynomial for these fingerprints.) We can also concatenate to each group agroup id number before fingerprinting.

Now all we need to store for each document is these k fingerprints, called“features”. Because fingerprints could collide the probability that two featuresare equal is

ρs + pf ,

where pf is the collision probability. This would indicate that it suffices to usefingerprints long enough to so that pf is less than say 10−6. However, whenapplying the filtering mechanism to a large collection of documents, we againuse the clustering process described above, and hence we must avoid spurioussharing of features. Nevertheless, for our problem 64 bits fingerprints are againsufficient.

It is particularly convenient, if possible, to choose the threshold r to be 1 or2. If r = 2 then the third phase of the merging process becomes much simplersince we don’t need to keep track of how many features are shared by variouspairs of documents: we simply keep a list of pairs known to share at least onefeature. As soon as we discover that one of these pairs shares a second feature,we know that with high probability the two documents are near-duplicates, andthus one of them can be removed from further consideration. If r = 1 the thirdphase becomes moot. In general it is possible to avoid the third phase if weagain group every r features into a super-feature, but this forces the number offeatures per document to become

(kr

).

4 Choosing the Parameters

As often the case in filter design, choosing the parameters is half science, halfblack magic. It is useful to start from a target threshold resemblance ρ0. Ideally

Pk,s,r(ρ) =

{1, for ρ ≥ ρ0;0, otherwise.

Clearly, once s is chosen, r should be approximately k · ρs0 and the larger k (and

r) the sharper the filter. (Of course, we are restricted to integral values for k, s,and r.)

If we make the (unrealistic) assumption that resemblance is uniformly dis-tributed between 0 and 1 within the set of pairs of documents to be checked,

Identifying and Filtering Near-Duplicate Documents 9

then the total error is proportional to

∫ ρ0

0

Pk,s,r(x) dx +

∫ 1

ρ0

(1 − Pk,s,r(x)

)dx

Differentiating with respect to ρ0 we obtain that this is minimized when P (ρ0) =1/2. To continue with our example we have P6,14,2(x) = 1/2 for x = 0.909... .

A different approach is to chose s so that the slope of xs at x = ρ0 ismaximized. This happens when

∂s

(sρs−1

0

)= 0 (3)

or s = 1/ ln(1/ρ0). For s = 14 the value that satisfies (3) is ρ0 = 0.931... .In practice these ideas give only a starting point for the search for a filter

that provides the required trade-offs between error bounds, time, and space. Itis necessary to graph the filter and do experimental determinations.

5 Conclusion

We have presented a method that can eliminate near-duplicate documents froma collection of hundreds of millions of documents by computing independentlyfor each document a vector of features less than 50 bytes long and comparingonly these vectors rather than entire documents. The entire processing takestime O(m logm) where m is the size of the collection. The algorithm describedhere has been successfully implemented and is in current use in the context ofthe AltaVista search engine.

Acknowledgments

I wish to thank Chuck Thacker who challenged me to find an efficient algorithmfor filtering near-duplicates. Some essential ideas behind the resemblance defi-nition and computation were developed in conversations with Greg Nelson. Theprototype implementation for AltaVista was done in collaboration with MikeBurrows and Mark Manasse.

References

1. S. Brin, J. Davis, H. Garcıa-Molina. Copy Detection Mechanisms for Digital Docu-ments. Proceedings of the ACM SIGMOD Annual Conference, May 1995.

2. K. Bharat and A. Z. Broder. A Technique for Measuring the Relative Size andOverlap of Public Web Search Engines. In Proceedings of Seventh InternationalWorld Wide Web Conference, pages 379–388, 1998.

3. A. Z. Broder. Some applications of Rabin’s fingerprinting method. In R. Capocelli,A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications,Security, and Computer Science, pages 143–152. Springer-Verlag, 1993.

Page 12: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Outline

Min-wise Independent Permutations

Near Duplicate Detection with Min-Hash

Min-hash with TF-IDF

Image Clustering

Page 13: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Scalable Near Identical Image and Shot Detection

Scalable Near Identical Image and Shot Detection

Ondrej Chum1 James Philbin1 Michael Isard2 Andrew Zisserman1

1Department of Engineering Science, University of Oxford2Microsoft Research, Silicon Valley

ABSTRACTThis paper proposes and compares two novel schemes for near du-plicate image and video-shot detection. The first approach is basedon global hierarchical colour histograms, using Locality SensitiveHashing for fast retrieval. The second approach uses local featuredescriptors (SIFT) and for retrieval exploits techniques used in theinformation retrieval community to compute approximate set inter-sections between documents using a min-Hash algorithm.

The requirements for near-duplicate images vary according tothe application, and we address two types of near duplicate defini-tion: (i) being perceptually identical (e.g. up to noise, discretizationeffects, small photometric distortions etc); and (ii) being images ofthe same 3D scene (so allowing for viewpoint changes and partialocclusion). We define two shots to be near-duplicates if they sharea large percentage of near-duplicate frames.

We focus primarily on scalability to very large image and videodatabases, where fast query processing is necessary. Both methodsare designed so that only a small amount of data need be stored foreach image. In the case of near-duplicate shot detection it is shownthat a weak approximation to histogram matching, consuming sub-stantially less storage, is sufficient for good results. We demon-strate our methods on the TRECVID 2006 data set which containsapproximately 165 hours of video (about 17.8M frames with 146Kkey frames), and also on feature films and pop videos.

Categories and Subject DescriptorsH.3.1 [Content Analysis and Indexing]: Indexing methods; I.4[Image Processing and Computer Vision ]: Image Represen-tation—hierarchical, statistical; E.2 [Data Storage Representa-tions]: Hash-table representationsGeneral TermsAlgorithms, TheoryKeywordsNear duplicate detection, LSH, Min Hash, Large image databases

1. INTRODUCTIONAn image is called a near-duplicate of a reference image if it is

“close”, according to some defined measure, to the reference im-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIVR’07, July 9–11, 2007, Amsterdam, The Netherlands.Copyright 2007 ACM 978-1-59593-733-9/07/0007 ...$5.00.

Figure 1: First page of results for the query ‘flight of a bee’using Google Images.

Figure 2: First page of results for the query ‘Munch Vampire’using Google Images.

age. Near duplicate image detection (NDID) and retrieval is a vitalcomponent for many real-world applications.

Consider the following example. Searching for the phrase ‘flightof a bee’ in a popular internet image search engine (here, GoogleImages), gives the first page of results shown in figure 1. Manyof these results show Salvador Dali’s painting ‘Dream Caused bythe Flight of a Bee Around a Pomegranate One Second BeforeAwakening,’ and are perceptually identical. A user might prefer thesearch engine to “collapse” these images of the painting into a set,represented visually by a single reference image, so that a greaterdiversity of images is initially displayed. If the user wants an imageof the painting, he could then click on the reference example to ex-plore the near-duplicate set. If the painting isn’t desired, he doesn’thave to view many near-duplicate occurrences of it. However, theimages are not identical. They differ in size, color adjustment, com-pression level, etc. Therefore, exact duplicate detection (at the pixellevel) will not be able to group all similar results together.

A second example of an image search result is shown in figure 2.There exist several different versions of the painting, ‘The Vam-pire,’ by Edvard Munch, but this is not immediately apparent fromthe search results. Grouping all the near-duplicates together so thatdistinct versions appear as distinct groups is preferable.

Video processing is another area where NDID can prove ex-tremely useful. Detection of identical frames or shots (sequences

549

Chum Philbin Isard Zisserman – CIVR 2007Scalable Near Identical Image and Shot Detection

Page 14: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Visual word representation

tuple of projections (although, we can improve run-time speed byre-using some of these projections).

The time complexity for querying every hash table is constantand this returns a set of candidate points which lie near to the querypoint in space. In practice, a proportion of these points will be at agreater distance than R from the query point. The experiments insection 5 show that for some applications these “false matches” canbe tolerated, in which case the histograms themselves do not needto be consulted at query time, and so they need not be stored. Thisleads to a storage cost of 5.3 bytes per hash table per image.

If pruning is required, we need to explicitly compute the dis-tance to each of the returned points. The total number of pointsand therefore the number of distances to compute grows as O(nρ),where ρ = ln 1/p1

ln 1/p2, p1 is a lower bound on the probability that two

points within R will hash to the same bucket and p2 is an upperbound on the probability that two points not within R will hash tothe same bucket. Clearly if p1 � p2, then ρ will be very small.Thus the complexity of enumerating duplicates is close to linear inthe number of duplicates. When this pruning is needed, we muststore an additional 384 bytes per image at query time.

4. SPARSE FEATURES AND MIN-HASH

4.1 Image descriptionLocal features and descriptors have been developed for image to

image matching [11, 14]. These are designed to be invariant to il-lumination and geometric transformations varying from scale to afull affine transformation, as might arise from a viewpoint change.They have been used successfully in model based recognition sys-tems [11]. Furthermore, by quantizing the descriptors into visualwords, ‘bag-of-words’ representations have also been used success-fully for matching images and scenes [15, 19]. We build on theseapproaches in our design of a sparse feature based near duplicateimage detector.

The difference of Gaussians (DoG) [11] operator is used as afeature (region) detector. Each region is then represented by aSIFT [11] descriptor using the image intensity only. SIFT featureshave proven to be insensitive to small local geometric and photo-metric image distortions [13].

A ‘visual vocabulary’ [19] – a set of visual words V – is con-structed by vector quantizing the SIFT descriptors of features fromthe training data using K-means. A random subset of the databasecan be used as the training data. The K-means cluster centres definevisual words. The SIFT features in every image are then assignedto the nearest cluster centre to give the visual word representation.

Assume a vocabulary V of size |V| where each visual word isencoded with unique identifier from {1, . . . , |V|}. Each image isrepresented as a set Ai of words Ai ⊂ V . Note, that a set of wordsis weaker representation than a bag of words, as it doesn’t recordthe frequency of occurrence of visual words in the image.

The distance measure between two images is computed as thesimilarity of sets A1 and A2, which is defined as the ratio of thenumber of elements in the intersection of the representations overtheir union:

sim(A1,A2) =|A1 ∩ A2||A1 ∪ A2|

. (1)

To efficiently retrieve NDID under this distance measure a min-Hash algorithm is used. This allows us to approximately find allimages whose similarity is above a threshold for a given query inconstant time. We describe the search algorithm in the followingsub-section.

4.2 Min Hash reviewIn this section, we describe how we adapt a method originally de-

veloped for text near-duplicate detection [2] to near-duplicate de-tection of images. We describe it using textual words, and thenexplain the adaptation to visual words in the following sub-section.

Two documents are near duplicate if the similarity sim(A1,A2)is higher than a given threshold ρ. The goal is to retrieve all docu-ments in the database that are similar to a query document. Thissection reviews an efficient randomized procedure that retrievesnear duplicate documents in time proportional to the number ofnear duplicate documents (i.e. time complexity is independent ofthe size of the database). The outline of the algorithm is as fol-lows: First a list of min-hashes are extracted from each document.A min-hash is a single number having the property that two sets A1

and A2 have the same value of min-hash with probability equal totheir similarity sim(A1,A2). For efficient retrieval the min-hashesare grouped into n-tuples called sketches. Identical sketches arethen efficiently found using a hash table. Documents with at leastm identical sketches (sketch hits) are considered as possible nearduplicate candidates and their similarity is then estimated using allavailable min-hashes.min-Hash. First, a random permutation of word labels π is gen-erated. For each document Ai a min-hash minπ(Ai) is recorded.Consider the following example: vocabulary V = {A,B,C,D,E,F}and three sets {A,B,C}, {B,C,D}, and {A E F}. Four independentrandom permutations and corresponding min-hashes follow in thetable.

perm

utat

ions

A B C D E F A B C B C D A E F3 6 2 5 4 1 2 2 11 2 6 3 5 4 1 2 13 2 1 6 4 5 1 1 34 3 5 6 1 2 3 3 1

min-hashes

The method is based on the fact, which we show later on, thatthe probability that minπ(A1) = minπ(A2) is

P (minπ(A1)=minπ(A2)) =|A1 ∩ A2||A1 ∪ A2|

= sim(A1,A2).

To estimate sim(A1,A2), N independent random permutations πj

are used. Let l be the number of how many times minπj(A1) =minπj(A2)). We estimate sim(A1,A2) = l/N . In our example,the sets {A,B,C} and {B,C,D} have three identical min-hashes andthe estimated similarity will be 0.75, while the exact similarity is0.5. The sets {A,B,C} and {A,E,F} share one min-hash and theirsimilarity estimate is 0.25 (0.2 is exact).How does it work? Consider drawing X = argminπ(A1 ∪A2).Since π is a random permutation, each element of A1 ∪ A2 hasthe same probability of being the least element. Therefore, we canthink of X as being drawn at random from A1 ∪ A2. If X is anelement of both A1 and A2, i.e. X ∈ A1∩A2, then minπ(A1) =minπ(A2) = π(X). If not, say X ∈ A1 \ A2, then π(X) <minπ(A2). Therefore, for random permutation π it follows

P (minπ(A1) = minπ(A2)) =|A1 ∩ A2||A1 ∪ A2|

. (2)

Sketches. For efficiency of the retrieval, the min-hashes are groupedinto n-tuples. Let Π be an n-tuple (π1, . . . , πn) of different in-dependent random permutations of V . Let SΠ(A1) be a sketch(minπ1(A1), . . ., minπn(A1)). The probability that two sets A1

and A2 have identical sketches SΠ(A1) = SΠ(A2) is sim(A1,A2)n

since the permutations Π (and hence the min-hashes in the sketch)are independent. Grouping min-hashes significantly reduces theprobability of false positive retrieval. The retrieving procedure then

552

Page 15: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Min hash

tuple of projections (although, we can improve run-time speed byre-using some of these projections).

The time complexity for querying every hash table is constantand this returns a set of candidate points which lie near to the querypoint in space. In practice, a proportion of these points will be at agreater distance than R from the query point. The experiments insection 5 show that for some applications these “false matches” canbe tolerated, in which case the histograms themselves do not needto be consulted at query time, and so they need not be stored. Thisleads to a storage cost of 5.3 bytes per hash table per image.

If pruning is required, we need to explicitly compute the dis-tance to each of the returned points. The total number of pointsand therefore the number of distances to compute grows as O(nρ),where ρ = ln 1/p1

ln 1/p2, p1 is a lower bound on the probability that two

points within R will hash to the same bucket and p2 is an upperbound on the probability that two points not within R will hash tothe same bucket. Clearly if p1 � p2, then ρ will be very small.Thus the complexity of enumerating duplicates is close to linear inthe number of duplicates. When this pruning is needed, we muststore an additional 384 bytes per image at query time.

4. SPARSE FEATURES AND MIN-HASH

4.1 Image descriptionLocal features and descriptors have been developed for image to

image matching [11, 14]. These are designed to be invariant to il-lumination and geometric transformations varying from scale to afull affine transformation, as might arise from a viewpoint change.They have been used successfully in model based recognition sys-tems [11]. Furthermore, by quantizing the descriptors into visualwords, ‘bag-of-words’ representations have also been used success-fully for matching images and scenes [15, 19]. We build on theseapproaches in our design of a sparse feature based near duplicateimage detector.

The difference of Gaussians (DoG) [11] operator is used as afeature (region) detector. Each region is then represented by aSIFT [11] descriptor using the image intensity only. SIFT featureshave proven to be insensitive to small local geometric and photo-metric image distortions [13].

A ‘visual vocabulary’ [19] – a set of visual words V – is con-structed by vector quantizing the SIFT descriptors of features fromthe training data using K-means. A random subset of the databasecan be used as the training data. The K-means cluster centres definevisual words. The SIFT features in every image are then assignedto the nearest cluster centre to give the visual word representation.

Assume a vocabulary V of size |V| where each visual word isencoded with unique identifier from {1, . . . , |V|}. Each image isrepresented as a set Ai of words Ai ⊂ V . Note, that a set of wordsis weaker representation than a bag of words, as it doesn’t recordthe frequency of occurrence of visual words in the image.

The distance measure between two images is computed as thesimilarity of sets A1 and A2, which is defined as the ratio of thenumber of elements in the intersection of the representations overtheir union:

sim(A1,A2) =|A1 ∩ A2||A1 ∪ A2|

. (1)

To efficiently retrieve NDID under this distance measure a min-Hash algorithm is used. This allows us to approximately find allimages whose similarity is above a threshold for a given query inconstant time. We describe the search algorithm in the followingsub-section.

4.2 Min Hash reviewIn this section, we describe how we adapt a method originally de-

veloped for text near-duplicate detection [2] to near-duplicate de-tection of images. We describe it using textual words, and thenexplain the adaptation to visual words in the following sub-section.

Two documents are near duplicate if the similarity sim(A1,A2)is higher than a given threshold ρ. The goal is to retrieve all docu-ments in the database that are similar to a query document. Thissection reviews an efficient randomized procedure that retrievesnear duplicate documents in time proportional to the number ofnear duplicate documents (i.e. time complexity is independent ofthe size of the database). The outline of the algorithm is as fol-lows: First a list of min-hashes are extracted from each document.A min-hash is a single number having the property that two sets A1

and A2 have the same value of min-hash with probability equal totheir similarity sim(A1,A2). For efficient retrieval the min-hashesare grouped into n-tuples called sketches. Identical sketches arethen efficiently found using a hash table. Documents with at leastm identical sketches (sketch hits) are considered as possible nearduplicate candidates and their similarity is then estimated using allavailable min-hashes.min-Hash. First, a random permutation of word labels π is gen-erated. For each document Ai a min-hash minπ(Ai) is recorded.Consider the following example: vocabulary V = {A,B,C,D,E,F}and three sets {A,B,C}, {B,C,D}, and {A E F}. Four independentrandom permutations and corresponding min-hashes follow in thetable.

perm

utat

ions

A B C D E F A B C B C D A E F3 6 2 5 4 1 2 2 11 2 6 3 5 4 1 2 13 2 1 6 4 5 1 1 34 3 5 6 1 2 3 3 1

min-hashes

The method is based on the fact, which we show later on, thatthe probability that minπ(A1) = minπ(A2) is

P (minπ(A1)=minπ(A2)) =|A1 ∩ A2||A1 ∪ A2|

= sim(A1,A2).

To estimate sim(A1,A2), N independent random permutations πj

are used. Let l be the number of how many times minπj(A1) =minπj(A2)). We estimate sim(A1,A2) = l/N . In our example,the sets {A,B,C} and {B,C,D} have three identical min-hashes andthe estimated similarity will be 0.75, while the exact similarity is0.5. The sets {A,B,C} and {A,E,F} share one min-hash and theirsimilarity estimate is 0.25 (0.2 is exact).How does it work? Consider drawing X = argminπ(A1 ∪A2).Since π is a random permutation, each element of A1 ∪ A2 hasthe same probability of being the least element. Therefore, we canthink of X as being drawn at random from A1 ∪ A2. If X is anelement of both A1 and A2, i.e. X ∈ A1∩A2, then minπ(A1) =minπ(A2) = π(X). If not, say X ∈ A1 \ A2, then π(X) <minπ(A2). Therefore, for random permutation π it follows

P (minπ(A1) = minπ(A2)) =|A1 ∩ A2||A1 ∪ A2|

. (2)

Sketches. For efficiency of the retrieval, the min-hashes are groupedinto n-tuples. Let Π be an n-tuple (π1, . . . , πn) of different in-dependent random permutations of V . Let SΠ(A1) be a sketch(minπ1(A1), . . ., minπn(A1)). The probability that two sets A1

and A2 have identical sketches SΠ(A1) = SΠ(A2) is sim(A1,A2)n

since the permutations Π (and hence the min-hashes in the sketch)are independent. Grouping min-hashes significantly reduces theprobability of false positive retrieval. The retrieving procedure then

552

Page 16: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

How does it work

tuple of projections (although, we can improve run-time speed byre-using some of these projections).

The time complexity for querying every hash table is constantand this returns a set of candidate points which lie near to the querypoint in space. In practice, a proportion of these points will be at agreater distance than R from the query point. The experiments insection 5 show that for some applications these “false matches” canbe tolerated, in which case the histograms themselves do not needto be consulted at query time, and so they need not be stored. Thisleads to a storage cost of 5.3 bytes per hash table per image.

If pruning is required, we need to explicitly compute the dis-tance to each of the returned points. The total number of pointsand therefore the number of distances to compute grows as O(nρ),where ρ = ln 1/p1

ln 1/p2, p1 is a lower bound on the probability that two

points within R will hash to the same bucket and p2 is an upperbound on the probability that two points not within R will hash tothe same bucket. Clearly if p1 � p2, then ρ will be very small.Thus the complexity of enumerating duplicates is close to linear inthe number of duplicates. When this pruning is needed, we muststore an additional 384 bytes per image at query time.

4. SPARSE FEATURES AND MIN-HASH

4.1 Image descriptionLocal features and descriptors have been developed for image to

image matching [11, 14]. These are designed to be invariant to il-lumination and geometric transformations varying from scale to afull affine transformation, as might arise from a viewpoint change.They have been used successfully in model based recognition sys-tems [11]. Furthermore, by quantizing the descriptors into visualwords, ‘bag-of-words’ representations have also been used success-fully for matching images and scenes [15, 19]. We build on theseapproaches in our design of a sparse feature based near duplicateimage detector.

The difference of Gaussians (DoG) [11] operator is used as afeature (region) detector. Each region is then represented by aSIFT [11] descriptor using the image intensity only. SIFT featureshave proven to be insensitive to small local geometric and photo-metric image distortions [13].

A ‘visual vocabulary’ [19] – a set of visual words V – is con-structed by vector quantizing the SIFT descriptors of features fromthe training data using K-means. A random subset of the databasecan be used as the training data. The K-means cluster centres definevisual words. The SIFT features in every image are then assignedto the nearest cluster centre to give the visual word representation.

Assume a vocabulary V of size |V| where each visual word isencoded with unique identifier from {1, . . . , |V|}. Each image isrepresented as a set Ai of words Ai ⊂ V . Note, that a set of wordsis weaker representation than a bag of words, as it doesn’t recordthe frequency of occurrence of visual words in the image.

The distance measure between two images is computed as thesimilarity of sets A1 and A2, which is defined as the ratio of thenumber of elements in the intersection of the representations overtheir union:

sim(A1,A2) =|A1 ∩ A2||A1 ∪ A2|

. (1)

To efficiently retrieve NDID under this distance measure a min-Hash algorithm is used. This allows us to approximately find allimages whose similarity is above a threshold for a given query inconstant time. We describe the search algorithm in the followingsub-section.

4.2 Min Hash reviewIn this section, we describe how we adapt a method originally de-

veloped for text near-duplicate detection [2] to near-duplicate de-tection of images. We describe it using textual words, and thenexplain the adaptation to visual words in the following sub-section.

Two documents are near duplicate if the similarity sim(A1,A2)is higher than a given threshold ρ. The goal is to retrieve all docu-ments in the database that are similar to a query document. Thissection reviews an efficient randomized procedure that retrievesnear duplicate documents in time proportional to the number ofnear duplicate documents (i.e. time complexity is independent ofthe size of the database). The outline of the algorithm is as fol-lows: First a list of min-hashes are extracted from each document.A min-hash is a single number having the property that two sets A1

and A2 have the same value of min-hash with probability equal totheir similarity sim(A1,A2). For efficient retrieval the min-hashesare grouped into n-tuples called sketches. Identical sketches arethen efficiently found using a hash table. Documents with at leastm identical sketches (sketch hits) are considered as possible nearduplicate candidates and their similarity is then estimated using allavailable min-hashes.min-Hash. First, a random permutation of word labels π is gen-erated. For each document Ai a min-hash minπ(Ai) is recorded.Consider the following example: vocabulary V = {A,B,C,D,E,F}and three sets {A,B,C}, {B,C,D}, and {A E F}. Four independentrandom permutations and corresponding min-hashes follow in thetable.

perm

utat

ions

A B C D E F A B C B C D A E F3 6 2 5 4 1 2 2 11 2 6 3 5 4 1 2 13 2 1 6 4 5 1 1 34 3 5 6 1 2 3 3 1

min-hashes

The method is based on the fact, which we show later on, thatthe probability that minπ(A1) = minπ(A2) is

P (minπ(A1)=minπ(A2)) =|A1 ∩ A2||A1 ∪ A2|

= sim(A1,A2).

To estimate sim(A1,A2), N independent random permutations πj

are used. Let l be the number of how many times minπj(A1) =minπj(A2)). We estimate sim(A1,A2) = l/N . In our example,the sets {A,B,C} and {B,C,D} have three identical min-hashes andthe estimated similarity will be 0.75, while the exact similarity is0.5. The sets {A,B,C} and {A,E,F} share one min-hash and theirsimilarity estimate is 0.25 (0.2 is exact).How does it work? Consider drawing X = argminπ(A1 ∪A2).Since π is a random permutation, each element of A1 ∪ A2 hasthe same probability of being the least element. Therefore, we canthink of X as being drawn at random from A1 ∪ A2. If X is anelement of both A1 and A2, i.e. X ∈ A1∩A2, then minπ(A1) =minπ(A2) = π(X). If not, say X ∈ A1 \ A2, then π(X) <minπ(A2). Therefore, for random permutation π it follows

P (minπ(A1) = minπ(A2)) =|A1 ∩ A2||A1 ∪ A2|

. (2)

Sketches. For efficiency of the retrieval, the min-hashes are groupedinto n-tuples. Let Π be an n-tuple (π1, . . . , πn) of different in-dependent random permutations of V . Let SΠ(A1) be a sketch(minπ1(A1), . . ., minπn(A1)). The probability that two sets A1

and A2 have identical sketches SΠ(A1) = SΠ(A2) is sim(A1,A2)n

since the permutations Π (and hence the min-hashes in the sketch)are independent. Grouping min-hashes significantly reduces theprobability of false positive retrieval. The retrieving procedure then

552

Page 17: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Histogram of true similarities

0 1000 2000 30000

0.5

1

1.5

2

2.5

3x 10

6

0 40 80 120 160 2000

0.5

1

1.5

2x 10

4

0 40 80 120 160 2000

0.5

1

1.5

2x 10

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12x 10

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1000

2000

3000

4000

5000

6000

7000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1000

2000

3000

4000

5000

6000

7000

Figure 7: Histograms of image pair distances in the CH-LSH(top), and true similarity in the SF-mH method (bottom) fromTRECVID data. Left to right: raw approximate similarity set,verified approximate set, and false negatives.

69 146 155 223 280

95 146 198 219 250

96 141 169 239 255

Figure 8: Examples of images form the TRECVID database.The query image is in the first column, other columns showimages and their distance for the color histogram method.

dom pairs of images and measured their similarity. The similaritybetween a random pair of images is larger than 500 in 99.9% ofcases. Figure 7(top) shows histograms of distances, and demon-strates that for NDID, verifying the raw results returned from theLSH is necessary to ensure accurate retrieval. Figure 8 shows someexample images from the TRECVID dataset along with a selectionof retrieved images showing the typical perceptual similarity be-tween images with varying histogram distances.

On a 2GHz commodity laptop, building the hash tables and find-ing all near-duplicates for the 150K TRECVID keyframes tookonly 15s + 15s = 30s.

5.2 Min Hash (SF-mH) retrievalFor this experiment, we detected DoG features [11] and vector-

quantized its SIFT description into vocabulary of 216 words. Wedefined near duplicate images as images having similarity (1) above35%. For retrieval we used k = 64 sketches of n = 3 min-hashes(taking 724 bytes per image – 384 for min-hashes and 5.3 bytes foreach of 64 hash tables). Images were placed in the raw approxi-mate similarity set given a single sketch hit. The verified approx-imate similarity set removed all images with estimated similaritylower than 35%. Using these parameters, 0.75% of images in thetrue similarity set were missing from the raw approximate simi-larity set. On average 5.9% of images in the verified approximatesimilarity set were not in the true similarity set, and 4.95% of thetrue similarity set was missing from the verified approximate set.The average ratio of sizes between the raw approximate similarityset and the verified approximate similarity set was 10.04.

The similarity between random image pairs is less than 5% in99.9% of cases. Figure 7 (bottom row) shows histograms of simi-larities, and demonstrates that for the SF-mH method verifying theraw results returned by sketches is necessary for accurate retrieval.

Figure 9 shows some example images from the TRECVID datasetalong with a selection of retrieved images showing the typical per-ceptual similarity between images with varying feature similarities.

0.79 0.69 0.52 0.27 0.21

1.00 0.56 0.40 0.29 0.21

0.72 0.70 0.37 0.28 0.24

Figure 9: Examples of images form the TRECVID database.The query image is in the first column, other columns showimages and their similarity for the Min Hash method.

31/3/0 31/1/0 31/0/0 33/0/2 34/0/0

37/0/0 34/0/6 37/0/0 34/0/2 39/0/0

37/0/3 31/0/2 30/0/0 30/2/0 36/1/1

Figure 10: Images with 30-40 detected near duplicates: sam-ples from images retrieved by both methods (common ND /colour histogram only / min Hash only).

5.3 Near duplicate definition comparisonWe have selected several sets of 30-40 near duplicate images and

have compared results of the two proposed methods on them. Thisexperiment compares the ability of the two representations (colourhistograms and image features) to encode the information neces-sary to detect near-duplicate images. Image samples from the setsare shown in figure 10. Three numbers are overlaid on the images:the number of images retrieved by both methods, the number ofimages retrieved using CH-LSH only, and the number of imagesretrieved by SF-mH only. Manual verification shows that all re-turned images are perceptually duplicates in this limited trial, inother words we did not see any false positives. We have no groundtruth to determine the exact number of false negatives, but as thefigure shows, each method had a small number of false negativescompared with the other. No obvious trends emerged from thesefalse negatives, however we know that each method is sensitive tocertain failure modes:Occlusion. Since L2 is not a robust distance, local occlusions cancause a significant increase of the colour histogram distance, mak-ing CH-LSH sensitive to occlusion. For the min-hashes, occlusionstypically insert and remove some visual words into the image rep-resentation. Therefore, SF-mH tolerates occlusions that preserve asufficiently high percentage of visual words.Noise and blur. The SF-mH method is heavily dependent on thequality of the feature detection. Therefore, any image deforma-tion that affects the firing of the feature detector can alter the per-formance of the whole method. Such deformations include strongartifacts / noise (which increases the number of features) and im-age blur (which decreases the number of features). The CH-LSHmethod is fairly insensitive to these types of image deformations.

6. NEAR-DUPLICATE SHOT DETECTIONIn addition to the keyframe experiment, we also performed shot-

based near-duplicate detection for the TRECVID data on the full17.8M frames, using CH-LSH. For this experiment, we used 36random projections, combined into 8-tuples for 36 hash tables, andused the raw approximate similarity sets directly, which requiredonly 190 bytes per image of storage. On a 2GHz machine, the

554

Page 18: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Results

0 1000 2000 30000

0.5

1

1.5

2

2.5

3x 10

6

0 40 80 120 160 2000

0.5

1

1.5

2x 10

4

0 40 80 120 160 2000

0.5

1

1.5

2x 10

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12x 10

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1000

2000

3000

4000

5000

6000

7000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1000

2000

3000

4000

5000

6000

7000

Figure 7: Histograms of image pair distances in the CH-LSH(top), and true similarity in the SF-mH method (bottom) fromTRECVID data. Left to right: raw approximate similarity set,verified approximate set, and false negatives.

69 146 155 223 280

95 146 198 219 250

96 141 169 239 255

Figure 8: Examples of images form the TRECVID database.The query image is in the first column, other columns showimages and their distance for the color histogram method.

dom pairs of images and measured their similarity. The similaritybetween a random pair of images is larger than 500 in 99.9% ofcases. Figure 7(top) shows histograms of distances, and demon-strates that for NDID, verifying the raw results returned from theLSH is necessary to ensure accurate retrieval. Figure 8 shows someexample images from the TRECVID dataset along with a selectionof retrieved images showing the typical perceptual similarity be-tween images with varying histogram distances.

On a 2GHz commodity laptop, building the hash tables and find-ing all near-duplicates for the 150K TRECVID keyframes tookonly 15s + 15s = 30s.

5.2 Min Hash (SF-mH) retrievalFor this experiment, we detected DoG features [11] and vector-

quantized its SIFT description into vocabulary of 216 words. Wedefined near duplicate images as images having similarity (1) above35%. For retrieval we used k = 64 sketches of n = 3 min-hashes(taking 724 bytes per image – 384 for min-hashes and 5.3 bytes foreach of 64 hash tables). Images were placed in the raw approxi-mate similarity set given a single sketch hit. The verified approx-imate similarity set removed all images with estimated similaritylower than 35%. Using these parameters, 0.75% of images in thetrue similarity set were missing from the raw approximate simi-larity set. On average 5.9% of images in the verified approximatesimilarity set were not in the true similarity set, and 4.95% of thetrue similarity set was missing from the verified approximate set.The average ratio of sizes between the raw approximate similarityset and the verified approximate similarity set was 10.04.

The similarity between random image pairs is less than 5% in99.9% of cases. Figure 7 (bottom row) shows histograms of simi-larities, and demonstrates that for the SF-mH method verifying theraw results returned by sketches is necessary for accurate retrieval.

Figure 9 shows some example images from the TRECVID datasetalong with a selection of retrieved images showing the typical per-ceptual similarity between images with varying feature similarities.

0.79 0.69 0.52 0.27 0.21

1.00 0.56 0.40 0.29 0.21

0.72 0.70 0.37 0.28 0.24

Figure 9: Examples of images form the TRECVID database.The query image is in the first column, other columns showimages and their similarity for the Min Hash method.

31/3/0 31/1/0 31/0/0 33/0/2 34/0/0

37/0/0 34/0/6 37/0/0 34/0/2 39/0/0

37/0/3 31/0/2 30/0/0 30/2/0 36/1/1

Figure 10: Images with 30-40 detected near duplicates: sam-ples from images retrieved by both methods (common ND /colour histogram only / min Hash only).

5.3 Near duplicate definition comparisonWe have selected several sets of 30-40 near duplicate images and

have compared results of the two proposed methods on them. Thisexperiment compares the ability of the two representations (colourhistograms and image features) to encode the information neces-sary to detect near-duplicate images. Image samples from the setsare shown in figure 10. Three numbers are overlaid on the images:the number of images retrieved by both methods, the number ofimages retrieved using CH-LSH only, and the number of imagesretrieved by SF-mH only. Manual verification shows that all re-turned images are perceptually duplicates in this limited trial, inother words we did not see any false positives. We have no groundtruth to determine the exact number of false negatives, but as thefigure shows, each method had a small number of false negativescompared with the other. No obvious trends emerged from thesefalse negatives, however we know that each method is sensitive tocertain failure modes:Occlusion. Since L2 is not a robust distance, local occlusions cancause a significant increase of the colour histogram distance, mak-ing CH-LSH sensitive to occlusion. For the min-hashes, occlusionstypically insert and remove some visual words into the image rep-resentation. Therefore, SF-mH tolerates occlusions that preserve asufficiently high percentage of visual words.Noise and blur. The SF-mH method is heavily dependent on thequality of the feature detection. Therefore, any image deforma-tion that affects the firing of the feature detector can alter the per-formance of the whole method. Such deformations include strongartifacts / noise (which increases the number of features) and im-age blur (which decreases the number of features). The CH-LSHmethod is fairly insensitive to these types of image deformations.

6. NEAR-DUPLICATE SHOT DETECTIONIn addition to the keyframe experiment, we also performed shot-

based near-duplicate detection for the TRECVID data on the full17.8M frames, using CH-LSH. For this experiment, we used 36random projections, combined into 8-tuples for 36 hash tables, andused the raw approximate similarity sets directly, which requiredonly 190 bytes per image of storage. On a 2GHz machine, the

554

Page 19: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Retrieval examples

0 1000 2000 30000

0.5

1

1.5

2

2.5

3x 10

6

0 40 80 120 160 2000

0.5

1

1.5

2x 10

4

0 40 80 120 160 2000

0.5

1

1.5

2x 10

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12x 10

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1000

2000

3000

4000

5000

6000

7000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1000

2000

3000

4000

5000

6000

7000

Figure 7: Histograms of image pair distances in the CH-LSH(top), and true similarity in the SF-mH method (bottom) fromTRECVID data. Left to right: raw approximate similarity set,verified approximate set, and false negatives.

69 146 155 223 280

95 146 198 219 250

96 141 169 239 255

Figure 8: Examples of images form the TRECVID database.The query image is in the first column, other columns showimages and their distance for the color histogram method.

dom pairs of images and measured their similarity. The similaritybetween a random pair of images is larger than 500 in 99.9% ofcases. Figure 7(top) shows histograms of distances, and demon-strates that for NDID, verifying the raw results returned from theLSH is necessary to ensure accurate retrieval. Figure 8 shows someexample images from the TRECVID dataset along with a selectionof retrieved images showing the typical perceptual similarity be-tween images with varying histogram distances.

On a 2GHz commodity laptop, building the hash tables and find-ing all near-duplicates for the 150K TRECVID keyframes tookonly 15s + 15s = 30s.

5.2 Min Hash (SF-mH) retrievalFor this experiment, we detected DoG features [11] and vector-

quantized its SIFT description into vocabulary of 216 words. Wedefined near duplicate images as images having similarity (1) above35%. For retrieval we used k = 64 sketches of n = 3 min-hashes(taking 724 bytes per image – 384 for min-hashes and 5.3 bytes foreach of 64 hash tables). Images were placed in the raw approxi-mate similarity set given a single sketch hit. The verified approx-imate similarity set removed all images with estimated similaritylower than 35%. Using these parameters, 0.75% of images in thetrue similarity set were missing from the raw approximate simi-larity set. On average 5.9% of images in the verified approximatesimilarity set were not in the true similarity set, and 4.95% of thetrue similarity set was missing from the verified approximate set.The average ratio of sizes between the raw approximate similarityset and the verified approximate similarity set was 10.04.

The similarity between random image pairs is less than 5% in99.9% of cases. Figure 7 (bottom row) shows histograms of simi-larities, and demonstrates that for the SF-mH method verifying theraw results returned by sketches is necessary for accurate retrieval.

Figure 9 shows some example images from the TRECVID datasetalong with a selection of retrieved images showing the typical per-ceptual similarity between images with varying feature similarities.

0.79 0.69 0.52 0.27 0.21

1.00 0.56 0.40 0.29 0.21

0.72 0.70 0.37 0.28 0.24

Figure 9: Examples of images form the TRECVID database.The query image is in the first column, other columns showimages and their similarity for the Min Hash method.

31/3/0 31/1/0 31/0/0 33/0/2 34/0/0

37/0/0 34/0/6 37/0/0 34/0/2 39/0/0

37/0/3 31/0/2 30/0/0 30/2/0 36/1/1

Figure 10: Images with 30-40 detected near duplicates: sam-ples from images retrieved by both methods (common ND /colour histogram only / min Hash only).

5.3 Near duplicate definition comparisonWe have selected several sets of 30-40 near duplicate images and

have compared results of the two proposed methods on them. Thisexperiment compares the ability of the two representations (colourhistograms and image features) to encode the information neces-sary to detect near-duplicate images. Image samples from the setsare shown in figure 10. Three numbers are overlaid on the images:the number of images retrieved by both methods, the number ofimages retrieved using CH-LSH only, and the number of imagesretrieved by SF-mH only. Manual verification shows that all re-turned images are perceptually duplicates in this limited trial, inother words we did not see any false positives. We have no groundtruth to determine the exact number of false negatives, but as thefigure shows, each method had a small number of false negativescompared with the other. No obvious trends emerged from thesefalse negatives, however we know that each method is sensitive tocertain failure modes:Occlusion. Since L2 is not a robust distance, local occlusions cancause a significant increase of the colour histogram distance, mak-ing CH-LSH sensitive to occlusion. For the min-hashes, occlusionstypically insert and remove some visual words into the image rep-resentation. Therefore, SF-mH tolerates occlusions that preserve asufficiently high percentage of visual words.Noise and blur. The SF-mH method is heavily dependent on thequality of the feature detection. Therefore, any image deforma-tion that affects the firing of the feature detector can alter the per-formance of the whole method. Such deformations include strongartifacts / noise (which increases the number of features) and im-age blur (which decreases the number of features). The CH-LSHmethod is fairly insensitive to these types of image deformations.

6. NEAR-DUPLICATE SHOT DETECTIONIn addition to the keyframe experiment, we also performed shot-

based near-duplicate detection for the TRECVID data on the full17.8M frames, using CH-LSH. For this experiment, we used 36random projections, combined into 8-tuples for 36 hash tables, andused the raw approximate similarity sets directly, which requiredonly 190 bytes per image of storage. On a 2GHz machine, the

554

Page 20: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Scene repetition

Key

fram

e

Query Candidate Result

Shot

Fram

es

(a)0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

1.2

1.4x 10

−3

Shot overlap threshold

Ave

rage

sho

t fal

se p

ositi

ve r

ate

(b)Figure 11: (a) shows a near-duplicate shot matched from theframes, which wouldn’t have been found by examining thekeyframes alone. (b) compares how the false positive duplicaterate changes with varying overlap threshold on a small subsetof 10 hours of TRECVID data.

method took approximately 90mins to search for all 17.8M framesin the dataset.

Figure 11(a) demonstrates the need to perform near-duplicateshot detection as opposed to using the keyframes alone. This fig-ure shows that, in general, matching shots on individual framesand matching shots on keyframes will give different results. Infigure 11(b), we examine the ability of the shot-voting to act asa proxy for histogram verification on a small sample of 10 hoursof TRECVID data. This shows that when the overlap thresholdis high, histogram verification is not required to achieve low falsepositive rates for near-duplicate shot detection. For performingNDSD over the whole dataset, we use a shot overlap threshold of0.7, which gives us an average false positive rate of 9.4× 10−4.

7. SCENE REPETITION IN FILMSIn this section we consider a looser definition of NDID – that

of identifying images of the same scene. In this case the imagesmay differ perceptually. We take as our application detecting thoseframes in a video that were shot at the same location. Previousmethods for this application have used the feature film ‘Run LolaRun’ [Tykwer, 1999] [18] and the music video ‘Come Into MyWorld’ [Gondry, 2002] [16], since both videos contain a time loop.We use both these videos for the experimental evaluation.Kylie Minogue: Come Into My World. The video contains fourrepeats of Kylie walking around a city area (with superposition),and a short 5th appearance at the end. A full description is givenin [21]. For the experiments the video is represented as key framesby extracting every 15th frame, giving 423 frames. We compare theperformance of the CH-LSH and SF-mH methods to the groundtruth by computing frame similarity matrices.

For the SF-mH method a new vocabulary of 10,000 visual wordsis generated from the key frames. Each frame is then represented by384 min-hashes, and sketches of 2 min-hashes are generated fromthese. It is necessary to use a smaller n (the number of min-hashesin a sketch) in this case to avoid false negatives. The CH-LSHmethod is unchanged from the TRECVID implementation.

Figure 12 top row shows similarity matrices for CH-LSH (dis-tance threshold 450) and SF-mH (similarity threshold 15%) respec-tively. In all similarity matrices, self-matching frames (diagonal)are not displayed. Matches along the diagonal are matches betweenconsecutive frames. Both methods are successful in capturing thecontiguous scene repetitions as demonstrated by the parallel diag-onal lines. Note that the similarity matrix for SF-mH is slightlydenser, especially in the fourth repetition where the frames aremore occluded. Also, the lines of repetition are thicker, since sev-eral consecutive frames are matched despite viewpoint change. Theviewpoint change is captured in the samples from repeated frames

dist ≤ 450

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

sim ≥ 15%

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

#1 #101 #204 #308 #409

#7 #108 #210 #313 #415

#27 #128 #232 #339

Figure 12: Top row: the similarity matrices for key framesfrom ‘Come Into My World’ for CH-LSH (left) and SF-mH(right). Bottom: samples of detected similar frames by SF-mH.Four repetitions (plus a few frames of a fifth) are evident (diag-onal and off diagonal lines). The frame number is overlaid onthe frames.

CH

-LSH

dist ≤ 200

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

dist ≤ 300

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

dist ≤ 400

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

dist ≤ 650

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

SF-m

H

sim ≥ 30%

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

sim ≥ 25%

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

sim ≥ 20%

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

sim ≥ 10%

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

Figure 13: Dependance of the scene repetition retrieval in‘Come Into My World’ on varying thresholds.

in figure 12 bottom. This demonstrates the tolerance to viewpointchange built in by design to the SF-mH method.

The dependance of the performance on the distance / similaritythreshold for the two methods is shown in figure 13. For CH-LSH,the distance thresholds used were 200, 300, 400, and 650; for SF-mH, the similarity thresholds were 30%, 25%, 20%, and 10%. Notethat there is a “sweet spot” threshold that reveals the story repetitionfor both the methods. This fact suggests that our distance/similaritymeasure corresponds to some extent to a human’s perception ofimage similarity.Run Lola Run. The story in the film repeats three times, withmany of the repeated shots being of identical locations althoughthe camera viewpoint can differ. The video is represented as keyframes by extracting every 25th frame, giving 4285 frames.

For the SF-mH method in order to give extra tolerance to view-point change the feature detector is changed here from DoG to Hes-sian Affine [12, 17]. Again SIFT descriptors are taken and vectorquantized into 10,000 visual words. The matching time over allpairs of near duplicate images (not including the feature detection)is less than 1 second for our Matlab implementation on a 2GHzmachine. The CH-LSH implementation is unchanged from theTRECVID case.

555

Page 21: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Outline

Min-wise Independent Permutations

Near Duplicate Detection with Min-Hash

Min-hash with TF-IDF

Image Clustering

Page 22: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Min-hash with TF-IDF

documents considered top 4 scorevocab 30k vocab 100k vocab 30k vocab 100k

mh ske sims simw simh sims simw simh sims simw simh sims simw simh512 256 553.8 362.2 207.0 143.8 87.3 49.3 2.54 2.54 2.67 2.43 2.42 2.57512 512 908.6 664.1 394.6 281.3 181.1 94.4 2.70 2.72 2.85 2.65 2.68 2.80512 1024 1671.9 1200.1 730.9 543.0 340.2 178.7 2.74 2.79 2.94 2.80 2.85 2.97512 1536 2325.4 1626.8 1041.3 871.4 469.6 260.7 2.75 2.80 2.96 2.81 2.90 3.03640 320 657.4 434.3 255.6 177.0 107.2 60.4 2.65 2.65 2.77 2.54 2.53 2.67640 640 1141.9 810.2 488.3 340.2 206.5 117.7 2.76 2.81 2.93 2.73 2.77 2.89640 1280 1924.3 1443.4 889.4 642.9 396.5 225.5 2.80 2.86 3.01 2.84 2.92 3.04640 1920 2691.4 1949.0 1258.4 969.7 567.0 330.7 2.80 2.87 3.02 2.88 2.96 3.09768 384 748.5 520.5 302.8 215.4 127.7 72.0 2.71 2.73 2.84 2.62 2.62 2.74768 768 1362.3 957.0 578.2 419.9 244.7 140.3 2.83 2.86 2.99 2.81 2.85 2.95768 1536 2242.9 1669.1 1035.7 761.2 637.8 264.1 2.85 2.90 3.05 2.90 2.98 3.08768 2304 2978.1 2230.6 1423.1 1154.0 816.5 382.1 2.85 2.91 3.06 2.91 3.01 3.13896 448 979.0 595.2 352.5 251.2 145.5 83.6 2.77 2.79 2.90 2.69 2.68 2.80896 896 1578.5 1082.2 683.8 481.6 275.4 163.0 2.86 2.90 3.03 2.86 2.90 3.00896 1792 2743.1 1878.6 1371.7 869.5 515.1 318.8 2.88 2.93 3.08 2.94 3.02 3.13896 2688 3398.8 2496.4 1790.8 1238.7 734.9 452.8 2.87 2.93 3.09 2.96 3.05 3.17

Table 1: University of Kentucky data set. Number of min-Hashes (mh), number of sketches (ske),number of considered documents, and average number of correct images in top 4 are shown forthree similarity measures sims, simw, and simh. Better results (lower for documents considered,higher for top 4 score) are highlighted among the methods.

sim

ssi

mw

sim

hsi

ms

sim

wsi

mh

sim

ssi

mw

sim

h

Figure 2: University of Kentucky data set: sample queries (left column), results (three rows each)for sims (top row), simw (middle row), and simh (bottom row).

Chum Philbin Zisserman – BMVC 2008Near Duplicate Image Detection: min-Hash and tf-idf Weighting

Page 23: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Image Similarity

Harris points and their descriptors).Recently, attention has been drawn to hashing based image retrieval. In [20] Tor-

ralba et al. proposed to learn short descriptors to retrieve similar images from a hugedatabase. The method is based on a dense 128D global image descriptor, which limitsthe approach to no geometric / viewpoint invariance. Jain et al. [8] introduced a methodfor efficient extension of Locally Sensitive Hashing scheme [7] for Mahalanobis distance.Both aforementioned approaches use bit strings as a fingerprint of the image. In such arepresentation, direct collision of similar images in a single bin of the hashing table isunlikely and a search over multiple bins has to be performed. This is feasible (or evenadvantageous) for approximate nearest neighbour or range search when the query exam-ple is given. However, for clustering tasks (such as finding all groups of near duplicatedimages in the database) the bit string representation is less suitable.

2 Image Representation and Similarity MeasuresRecently, most of the successful image indexing approaches are based on the bag-of-visual-words representation [5, 9, 17, 18, 19]. In this framework, for each image in thedata set affine invariant interest regions are detected. Popular choices are MSER [15],DoG (difference of Gaussians) [14] or multi-scale Hessian interest points [16]. Eachdetected feature determines an affine covariant measurement region, typically an ellipsedefined by the second moment matrix of the region. An affine invariant descriptor is thenextracted from the measurement regions. Often a 128-dimensional SIFT [14] descriptoris used.

A ‘visual vocabulary’ [19] is then constructed by vector quantization of feature de-scriptors. Often, k-means or some variant is used to build the vocabulary [17, 18]. Theimage database or a random subset can be used as the training data for clustering. Thek-means cluster centers define visual words and the SIFT features in every image are thenassigned to the nearest cluster center to give a visual word representation.

Assume a vocabulary V of size |V | where each visual word is encoded with uniqueidentifier from {1, . . . , |V |}. A bag-of-visual-words approach represents an image by avector of length |V |, where each element denotes the number of features in the image thatare represented by given visual word. A set Ai of words Ai ⊂V is a weaker representationthat does not store the number of features but only whether they are present or not.

We will discuss three different image similarity measures. Two measures use a set ofvisual words image representation; the last one uses a bag of visual words representation.All of them can be efficiently approximated by randomized algorithms. Note that theproposed similarity measures shares some properties of the tf-idf scheme which is knownto perform well in image retrieval.Set similarity. The distance measure between two images is computed as the similarity ofsets A1 and A2, which is defined as the ratio of the number of elements in the intersectionover the union:

sims(A1,A2) =|A1 ∩A2||A1 ∪A2|

. (1)

This similarity measure is used by text search engines [2] to detect near-duplicate textdocuments. In NDID, the method was used in [4]. The efficient algorithm for retrievingnear duplicate documents, called min-Hash, is reviewed in section 3.

Weighted set similarity. The set similarity measure assumes that all words are equallyimportant. Here we extend the definition of similarity to sets of words with differingimportance. Let dw ≥ 0 be an importance of a visual word Xw. The similarity of two setsA1 and A2 is

simw(A1,A2) =∑Xw∈A1∩A2

dw

∑Xw∈A1∪A2dw

. (2)

The previous definition of similarity (1) is a special case of the new definition (2) for dw =1. Efficient algorithm for retrieval using simw similarity measure is derived in section 4.1.Histogram intersection. Let ti be a vector of size |V | where each coordinate tw

i is thenumber of visual words Xw present in the i-th document. The histogram intersectionmeasure is defined as

simh0(A1,A2) =∑w min(tw

1 , tw2 )

∑w max(tw1 , t

w2 )

. (3)

This measure can be also extended using word weightings to give:

simh(A1,A2) =∑w dw min(tw

1 , tw2 )

∑w dw max(tw1 , t

w2 )

. (4)

This similarity measure (4) is closer to the tf-idf weighting scheme, while preserving theadvantages of very fast retrieval of near identical documents using the min-Hash algorithm– see section 4.2 for details.

3 Min Hash BackgroundIn this section, we describe how a method originally developed for text near-duplicatedetection [2] is adopted to near-duplicate detection of images.

Two documents are near duplicate if the similarity sims is higher than a given thresh-old ρ . The goal is to retrieve all documents in the database that are similar to a querydocument. This section reviews an efficient randomized hashing based procedure thatretrieves near duplicate documents in time proportional to the number of near duplicatedocuments. The outline of the algorithm is as follows: First a list of min-Hashes are ex-tracted from each document. A min-Hash is a single number having the property that twosets A1 and A2 have the same value of min-Hash with probability equal to their similar-ity sims(A1,A2). For efficient retrieval the min-Hashes are grouped into n-tuples calledsketches. Identical sketches are then efficiently found using a hash table. Documentswith at least h identical sketches (sketch hits) are considered as possible near duplicatecandidates and their similarity is then estimated using all available min-Hashes.min-Hash algorithm. A number of random hash functions is given f j : V → R assigninga real number to each visual word. Let Xa and Xb be different words from the vocabu-lary V . The random hash functions have to satisfy two conditions: f j(Xa) 6= f j(Xb) andP( f j(Xa) < f j(Xb)) = 0.5. The functions f j also have to be independent. For small vo-cabularies, the hash functions can be implemented as a look up table, where each elementof the table is generated by a random sample from Un(0,1).

Note that each function f j infers an ordering on the set of visual words Xa < j Xb ifff j(Xa)< f j(Xb). We define a min-Hash as a smallest element of a set Ai under orderinginduced by function f j

m(Ai, f j) = arg minX∈Ai

f j(X).

Page 24: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Word Weighting

to a more efficient method based on generating random non-uniform hash functions.For now, assume that dw are positive integers. For each set Ai, we construct a set A ′

ias follows. Each element Xw ∈Ai is represented by dw elements Xk

w, k = 1 . . .dw, in A ′i . A

min-Hash of A ′i is obtained as described in the previous section: random hash functions

f ′j(Xkw) are used to define min-Hashes on documents A ′

i . Each element Xkw is assigned a

different value by the hash function, but all Xkw represent the same visual word Xw. Let X j

wbe a min-Hash of A ′

i , a min-Hash of Ai is then defined as Xw. Again, the probability thata min-Hash of two sets A1 and A2 are identical is given by the ratio

|A ′1 ∩A ′

2 ||A ′

1 ∪A ′2 |

=∑Xw∈A1∩A2

dw

∑Xw∈A1∪A2dw

.

The same result is obtained when the following hash function is used on the originalvocabulary

f j(Xw) = mink=1...dw

f ′j(Xkw).

In the rest of this section we derive how to generate the value of the hash function directlywithout generating dw uniformly distributed random numbers for each word.

Let mw be a random variable mw = mink rkw, where k = 1 . . .dw and rk

w ∼ Un(0,1). Thecumulative distribution of mw is given by

P(mw ≤ a) = 1− (1−a)dw . (7)

It follows that a random uniformly distributed variable x ∼ Un(1,0) can be transformedto a random variable with cumulative distribution function (7) as mw = 1− dw

√1− x. The

expression can be further simplified using the fact that 1− x ∼ Un(1,0) to give mw =1− dw

√x. Note that mw is also defined for real non-negative values of dw. Since for the

purposes of the min-Hash algorithm, only the ordering of the hashes is important, furthersimplification can be obtained by applying a monotonic transformations:

f j(Xw) =− logx

dw, where x ∼ Un(1,0). (8)

4.2 Histogram intersectionIn this section, the bag-of-words image representation will be used. We show how a newvocabulary can be constructed so that the min-Hash algorithm can be directly applied toapproximate histogram intersection. Let ti be a vector of size |V | where each coordinatetwi is a number of visual words Xw present in i-th document. Let yw denote the highest

number of occurrences of visual word Xw in a document in the database yw = maxi twi . We

can construct a new vocabulary V ′ as follows. For each visual word Xw the vocabularywill contain yw different elements X1

w, . . . ,Xyww . The bag-of-words representation ti of a

document can be equivalently represented a set A ′1 ⊂ V ′, where the set A ′

1 contains twi

elements representing visual word Xw: X lw ∈ A ′

1 iff twi ≥ l. For example, if an image

contains two features represented by visual word Xw, elements X1w and X2

w will be presentin the set representation of that image2.

2Note the difference between expanding the vocabulary in section 4.1 and here. In simw the number ofrepeated elements representing one visual word was either 0 (if the visual word was not present in the image) ordw in all images, and the elements were indistinguishable. Here, each document can contain different numberof repeated elements, depending on how many instances of the visual word appear in the image. Also, eachinstance is unique, elements X1

w and X2w are different.

Page 25: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Word Weighting

to a more efficient method based on generating random non-uniform hash functions.For now, assume that dw are positive integers. For each set Ai, we construct a set A ′

ias follows. Each element Xw ∈Ai is represented by dw elements Xk

w, k = 1 . . .dw, in A ′i . A

min-Hash of A ′i is obtained as described in the previous section: random hash functions

f ′j(Xkw) are used to define min-Hashes on documents A ′

i . Each element Xkw is assigned a

different value by the hash function, but all Xkw represent the same visual word Xw. Let X j

wbe a min-Hash of A ′

i , a min-Hash of Ai is then defined as Xw. Again, the probability thata min-Hash of two sets A1 and A2 are identical is given by the ratio

|A ′1 ∩A ′

2 ||A ′

1 ∪A ′2 |

=∑Xw∈A1∩A2

dw

∑Xw∈A1∪A2dw

.

The same result is obtained when the following hash function is used on the originalvocabulary

f j(Xw) = mink=1...dw

f ′j(Xkw).

In the rest of this section we derive how to generate the value of the hash function directlywithout generating dw uniformly distributed random numbers for each word.

Let mw be a random variable mw = mink rkw, where k = 1 . . .dw and rk

w ∼ Un(0,1). Thecumulative distribution of mw is given by

P(mw ≤ a) = 1− (1−a)dw . (7)

It follows that a random uniformly distributed variable x ∼ Un(1,0) can be transformedto a random variable with cumulative distribution function (7) as mw = 1− dw

√1− x. The

expression can be further simplified using the fact that 1− x ∼ Un(1,0) to give mw =1− dw

√x. Note that mw is also defined for real non-negative values of dw. Since for the

purposes of the min-Hash algorithm, only the ordering of the hashes is important, furthersimplification can be obtained by applying a monotonic transformations:

f j(Xw) =− logx

dw, where x ∼ Un(1,0). (8)

4.2 Histogram intersectionIn this section, the bag-of-words image representation will be used. We show how a newvocabulary can be constructed so that the min-Hash algorithm can be directly applied toapproximate histogram intersection. Let ti be a vector of size |V | where each coordinatetwi is a number of visual words Xw present in i-th document. Let yw denote the highest

number of occurrences of visual word Xw in a document in the database yw = maxi twi . We

can construct a new vocabulary V ′ as follows. For each visual word Xw the vocabularywill contain yw different elements X1

w, . . . ,Xyww . The bag-of-words representation ti of a

document can be equivalently represented a set A ′1 ⊂ V ′, where the set A ′

1 contains twi

elements representing visual word Xw: X lw ∈ A ′

1 iff twi ≥ l. For example, if an image

contains two features represented by visual word Xw, elements X1w and X2

w will be presentin the set representation of that image2.

2Note the difference between expanding the vocabulary in section 4.1 and here. In simw the number ofrepeated elements representing one visual word was either 0 (if the visual word was not present in the image) ordw in all images, and the elements were indistinguishable. Here, each document can contain different numberof repeated elements, depending on how many instances of the visual word appear in the image. Also, eachinstance is unique, elements X1

w and X2w are different.

Page 26: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Intersection

to a more efficient method based on generating random non-uniform hash functions.For now, assume that dw are positive integers. For each set Ai, we construct a set A ′

ias follows. Each element Xw ∈Ai is represented by dw elements Xk

w, k = 1 . . .dw, in A ′i . A

min-Hash of A ′i is obtained as described in the previous section: random hash functions

f ′j(Xkw) are used to define min-Hashes on documents A ′

i . Each element Xkw is assigned a

different value by the hash function, but all Xkw represent the same visual word Xw. Let X j

wbe a min-Hash of A ′

i , a min-Hash of Ai is then defined as Xw. Again, the probability thata min-Hash of two sets A1 and A2 are identical is given by the ratio

|A ′1 ∩A ′

2 ||A ′

1 ∪A ′2 |

=∑Xw∈A1∩A2

dw

∑Xw∈A1∪A2dw

.

The same result is obtained when the following hash function is used on the originalvocabulary

f j(Xw) = mink=1...dw

f ′j(Xkw).

In the rest of this section we derive how to generate the value of the hash function directlywithout generating dw uniformly distributed random numbers for each word.

Let mw be a random variable mw = mink rkw, where k = 1 . . .dw and rk

w ∼ Un(0,1). Thecumulative distribution of mw is given by

P(mw ≤ a) = 1− (1−a)dw . (7)

It follows that a random uniformly distributed variable x ∼ Un(1,0) can be transformedto a random variable with cumulative distribution function (7) as mw = 1− dw

√1− x. The

expression can be further simplified using the fact that 1− x ∼ Un(1,0) to give mw =1− dw

√x. Note that mw is also defined for real non-negative values of dw. Since for the

purposes of the min-Hash algorithm, only the ordering of the hashes is important, furthersimplification can be obtained by applying a monotonic transformations:

f j(Xw) =− logx

dw, where x ∼ Un(1,0). (8)

4.2 Histogram intersectionIn this section, the bag-of-words image representation will be used. We show how a newvocabulary can be constructed so that the min-Hash algorithm can be directly applied toapproximate histogram intersection. Let ti be a vector of size |V | where each coordinatetwi is a number of visual words Xw present in i-th document. Let yw denote the highest

number of occurrences of visual word Xw in a document in the database yw = maxi twi . We

can construct a new vocabulary V ′ as follows. For each visual word Xw the vocabularywill contain yw different elements X1

w, . . . ,Xyww . The bag-of-words representation ti of a

document can be equivalently represented a set A ′1 ⊂ V ′, where the set A ′

1 contains twi

elements representing visual word Xw: X lw ∈ A ′

1 iff twi ≥ l. For example, if an image

contains two features represented by visual word Xw, elements X1w and X2

w will be presentin the set representation of that image2.

2Note the difference between expanding the vocabulary in section 4.1 and here. In simw the number ofrepeated elements representing one visual word was either 0 (if the visual word was not present in the image) ordw in all images, and the elements were indistinguishable. Here, each document can contain different numberof repeated elements, depending on how many instances of the visual word appear in the image. Also, eachinstance is unique, elements X1

w and X2w are different.

Page 27: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Intersection

The min-Hash algorithm can be applied to the new set representation directly. Thesize of set intersection |A ′

1 ∩A ′2 | is equal to ∑w min(tw

1 , tw2 ) and the size of the set union

|A ′1 ∪A ′

2 |= ∑w max(tw1 , t

w2 ). Applying these equalities to sims eqn (3) we directly obtain

eqn (4). The extension to weighted histogram intersection is straightforward.

5 Experimental ResultsWe demonstrate our method for NDID on two data sets: the TrecVid 2006 data set andthe University of Kentucky data set.

It is difficult to evaluate near duplicate image retrieval, especially on large data sets.Labelling of large data sets is difficult in its own right and the subjective definition ofnear duplicate images complicates things further. There is no ground truth available forthe TrecVid data set, hence a precise comparison of accuracy of the methods is not pos-sible for this data. Therefore, in the first experiment we mainly focus on measuring theefficiency of the methods on a large (146k images) TrecVid data set.

To evaluate the quality of the retrieval, we present an extensive comparison of theoriginal min-Hash method and the two proposed methods (word weighting and weightedhistogram intersection) on an image retrieval database – University of Kentucky database[17] – where the ground truth is available.

The idf weights were used in simw and simh as word weights in our experiments.

5.1 TrecVid 2006TrecVid [21] database consists of 146,588 JPEG keyframes automatically pre-selectedfrom 165 hours (17.8M frames, 127 GB) of MPEG-1 news footage, recorded from differ-ent TV stations from around the world. Each frame is at a resolution of 352×240 pixelsand normally of quite low quality. The frames suffer from compression artefacts, jitterand noise typically found in highly compressed video. In this experiment a vocabulary of64K visual words, N = 192 min-Hashes, sketch size n= 3, and k = 64 number of sketcheswere used as in [4].

We measured the number of sketch hits, i.e. how many pairs of documents were con-sidered to be near duplicates. Figure 1 displays the number of sketch hits plotted againstthe similarity measures of the colliding documents. For document pairs with high value ofthe similarity measure, the number of hits is roughly equal for sims and simw and slightlyhigher for simh. This means that about the same number of near duplicate images will berecovered by the first two methods and the histogram intersection detects slightly highernumber of near duplicates. The detected near duplicate results appear similar after vi-sual inspection and no significant discrepancy can be observed between the results of themethods.

However, for document pairs with low similarity (pairs that are of no interest) usingsimw and simh similarity significantly reduces the number of sketch hits. In the standardversion of the algorithm, even uninformative visual words that are common to many im-ages are equally likely to become a min-Hash. When this happens, a large number ofimages is represented by the same frequent min-Hash. In the proposed approach, com-mon (non-informative) visual words are down-weighted by a low value of idf. As a result,a lower number of sketch collisions of documents with low similarity is observed.

The average number of documents examined per query is 8.5, 7.1, and 7.7 for sims,simw, and simh respectively. Compare this to 43,997.3 of considered documents (images

Page 28: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Outline

Min-wise Independent Permutations

Near Duplicate Detection with Min-Hash

Min-hash with TF-IDF

Image Clustering

Page 29: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Image Clustering

2 Ondrej Chum and Jirı Matas

Fig. 1. Visualization of a part of a cluster of spatially related images automatically discovered from a database of 100K images.Only part of the cluster is shown. Overall, there are 113 images in the cluster, all correctly assigned. A sample of geometricallyverified correspondences is depicted as links between the images. Note that the images show the tower from opposite sides.

The above mentioned process starts with an image provided or selected by the user. However, 3D registration isstill a slow process. In general, it is not possible to do it online, and an immediate response to the user requires that3D reconstruction is already available, computed off-line. The clustering method proposed in the paper is a suitableback-end for such a system, as it discovers sufficiently large sets of overlapping images suitable for automatic recon-struction. Moreover, it outputs inter-image correspondences that may bootstrap the 3D scene reconstruction process.Availability of sufficient number of images is essential for the 3D reconstruction, and almost all sets that are usablefor 3D reconstruction have a size where our method retrieves the cluster almost certainly.

The rest of the paper is structured as follows. Section 2 reviews the work on unsupervised object and scene dis-covery, Section 3 describes the use of min-Hash for data mining purposes. In Section 4 the method is experimentallyverified on real image databases.

2 Related work on unsupervised object and scene discovery

The problem of matching (organization) of an unordered image set was first addressed by Schaffalitzky and Zissermanin [10]. Their objective was first automatic recovery of geometric relations between images from a spatially relatedset (of tens of images) and then 3D reconstruction. We are interested in a similar problem, but also in discovery ofmultiple such sets in databases with several orders of magnitude higher number of images.

Recently, the majority of image retrieval systems adopt the bag-of-words approach [11], which we also follow.First, regions of interest are detected [12] and described by an invariant descriptor [13]. The descriptors are thenvector quantized into a vocabulary of visual words [11, 5, 6].

The approach closest to ours is [14] by Sivic and Zisserman whose objective is unsupervised discovery of multipleinstances of particular objects in feature films. Object hypotheses are instantiated on neighbourhoods centered aroundregions of interest. The neighbourhoods include a predefined number of other regions and the hypothesized object isrepresented by a fixed number of visual words describing the regions. Each hypothesized object is used as a queryagainst the database consisting of key frames of the film. To reduce the number of similarity evaluations, which eachrequires counting the number of common visual words, only neighbourhoods centered at the same visual word arecompared.

The method requires∑w

i=1 d2i similarity evaluations, where w is the size of vocabulary and di is the number of

regions assigned to i-th visual word. Let D be the number of documents and t the average number of features in an

Chum Matas – Technical Report 2008Web Scale Image Clustering - Large Scale Discovery of Spatially

Related Images

Page 30: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Method

Web Scale Image Clustering 3

image, so that∑w

i=1 di = tD. The lower bound on the complexity of the approach in [14] can be written as

w∑

i=1

d2i ≥w∑

i=1

(tD

w

)2

=t2

wD2. (1)

The asymptotical complexity of [14] is thus O(D2). The factor t2/w is a ratio of two constants independent of thesize of the database. The size of the vocabulary commonly used is up to w = 1, 000, 000, the average number ofregions in an image for the database used in this paper is slightly over t = 2, 800, leaving the value of the coefficientt2/w = 7.84 in order of units. Hence, the algorithm would behave as quadratic in the number of images even forrelatively small databases. The complexity of [14] is thus the same as the complexity of querying the whole databasewith each image in turn. Such complexity is prohibitive for large databases.

Methods for query speed-up [15, 7] proceed by pre-clustering documents into similar groups. For a query, first aset of relevant document clusters is retrieved sub-linearly and the query is only evaluated against images in the selectedclusters. Such an approach trades off recall for up to seven fold speed-up [7]. A speed-up of this order is insufficientto allow querying by each image on large databases.

Approaches improving the accuracy of image retrieval [7, 9] are relevant to this paper despite not helping seedinitialization since they improve the second stage of our approach, the crawl for images visually connected to seedpairs. Accuracy improving techniques include learning a local inter-document distance measure based on the densityin the document space [7] and selecting the most informative features for the vocabulary[9]. Note that the statisticsused in those approaches might be difficult to update when new, either related (changing the density in the documentspace) or completely unrelated (changing the relevance of the features) images are inserted into the database.

Another class of methods tackling unsupervised object learning is based on topic discovery [16] via generativemodeling like probabilistic Latent Semantic Analysis (pLSA) [17] and Latent Dirichlet Allocation (LDA) [18]. Objectdiscovery based on topic analysis method was further developed in [19] where multiple segmentations were used tohypothesize the locations and extent of possible objects.

The pLSA and LDA models are a favourite choice for (unsupervised) object / image category recognition due totheir generalization power. However, the ability to generalize to a topic such as “building” is rather a disadvantagewhen particular objects are sought.

We consider topic analysis approaches not suitable for our problem for the following reasons: (i) Speed: Theselearning methods are slow, iterative and sequential (difficult or impossible to parallelize). (ii) Topics discovered bypLSA / LDA typically appear in a number of images proportional to the size of the dataset while in this paper we aimat finding clusters of certain size independent of the size of the database. (iii) When new images are inserted into thedatabase and a new topic should be formed using both old and new data, the methods need to process the original(already processed) data again together with the new ones.

3 Data Mining with min-Hash

In this section, the proposed method for discovery of clusters of spatially overlapping images is described. As the firststep, pairs of images that are likely to be spatially overlapping, the so-called seeds, are found by a procedure exploitingproperties of the min-Hash algorithm. Understanding the procedure requires at least a basic familiarity with min-Hashand we therefore review the algorithm in Sect. 3.1. Next, the four steps of the cluster discovery algorithm are detailed:

1. Hashing. Image descriptors are hashed into a hash table. In experiments in the paper we use 251 different descrip-tor values. The probability of two images falling into the same bin (exact descriptor match) is proportional to theirsimilarity – equation (2).

2. Similarity estimation. For all n-choose-2 pairs of the n images that have been hashed into the same bin a similarityis estimated. Similarity estimation is fast and consists of comparing two vectors and counting the number ofidentical elements. In this work, the number of vector elements is 512. The similarity is then thresholded.

3. Spatial consistency. For each image pair that passed the similarity test, spatial consistency is verified. Image pairsthat pass spatial consistency test are the cluster seeds.

4. Seed growing. Once cluster seeds are generated, the seed images are used as visual queries and query expansiontechnique is used to ‘crawl’ the images in the cluster.

Page 31: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Collision probability

4 Ondrej Chum and Jirı Matas

3.1 The min-Hash algorithm review

The min-Hash algorithm [20, 8] is a randomized method based on hashing finds highly similar image pairs withprobability close to one, unrelated images with probability close to zero, and similar image pairs (with low but non-negligible similarity, such as images of the same object) with a rather small probability. The low recall stops themin-Hash from being used directly as a general image retrieval method. However, in this paper we argue that it can beefficiently used for data mining purposes.

A brief review of the min-Hash algorithm follows; for detailed description see [21, 20]. For the purpose of min-Hash, images are represented as sets of visual words. This is a weaker representation than a bag of visual wordssince the frequency is reduced into a binary information (present / absent). Similarity of two images sim(A1, A2) ismeasured as a set overlap (ratio intersection over union) of their set representation

sim(A1,A2) =|A1 ∩ A2||A1 ∪ A2|

∈ 〈0, 1〉. (2)

A min-Hash is a function f that assigns a number to each set of visual words (each image representation). The functionhas a property that the probability of two sets having the same value of the min-Hash function is equal to their similarity

P (f(A1) = f(A2)) = sim(A1,A2).

To estimate the similarity of two images, multiple independent min-Hash functions fi are used. The fraction of themin-Hash functions that assigns an identical value to the two sets gives an unbiased estimate of the similarity of thetwo images.

Retrieval with min-Hash. So far, a method to estimate a similarity of two images was discussed. To efficiently retrieveimages with high similarity, the values of min-Hash function fi are grouped into s-tuples called sketches. Similarimages have many values of the min-Hash function in common (from the definition of similarity) and hence have highprobability of having the same sketches. On the other hand, dissimilar images have low chance of forming an identicalsketch. Identical sketches are efficiently found by hashing.

The probability of two sets having at least one sketch out of k in common is

P (collision) = 1− (1− sim(A1,A2)s)k. (3)

The probability depends on the similarity of the two images and on the two parameters: s the size of the sketch andk the number of (independent) sketches. These are the parameters of the method. Figure 2 visualizes the probabilityof collision plotted against the similarity of two images for fixed s = 3 and k = 512. Figure 3 shows different imagepairs and their similarity.

3.2 Cluster seed generation

In this section, a randomized procedure that generates seeds from possible clusters of images is described. Let us firstlook at the plot of the probability of sketch collision against the similarity of the images depicted in figure 2. Thesigmoid-like shape of the curve is important for the near duplicate detection task [20]. Image pairs with high similarityare retrieved with a probability close to one. Then, the probability drops rapidly - through similar image pairs (typicallyimages of the same object from a slightly different viewpoint) that are occasionally retrieved to unrelated image pairs(with similarity below 1%) that have close to zero probability of being retrieved.

Now, for the purpose of data mining, we focus on the bottom left corner of the graph. According to equation (3) animage pair with similarity sim = 0.05 has probability 6.2% to be retrieved (using 512 sketches of size 3). Such a poorrecall is certainly below acceptable level for a retrieval system. However, we aim at retrieving all relevant images fromthe image clusters in a single step. The task is to quickly retrieve seeds from the clusters - it is sufficient to retrieve asingle seed per cluster, and we are fortunate that the importance of a cluster is related to its size in the database.

The probability that not a single image pair (seed) is found by the min-Hash depends on two factors - the similarityof the images in the cluster and the number of image pairs that actually observe the same object. In the followinganalysis, which demonstrates a lower bound on this probability, we assume that a particular object or landmark is seenin v views. We also assume that all the image pairs have the same (average) similarity ε. The probability that none ofthe pairs of v views is retrieved is

P (fail) = (1− ε)v(v−1)

2 .

Web Scale Image Clustering 5

P(co

llisi

on)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

similarity

P(co

llisi

on)

0 0.02 0.04 0.06 0.08 0.110

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

similarity

Fig. 2. The probability of at least one sketch collision for two documents plotted against their similarity; with k = 512 sketches,s = 3 min-Hashes per sketch. The right plot shows a close-up of the bottom left corner of the left plot. Note the logarithmic verticalaxis.

sim = 74.2% sim = 13.0% sim = 7.4% sim = 5.7%

Fig. 3. Examples of the relation of ‘visual similarity’ and the set overlap similarity.

The plot in figure 4 shows that for popular places (i.e. those where photos are often taken from) the probability offailure to retrieve an image pair vanishes. There are three plots for similarities 5%, 6% and 7%. Since the similarity isdefined as a ratio of the size of the intersection over the size of the union, the difference between similarity 6% and5% is substantial. Going from 6% to 5% similarity means removing 17.5% of elements that were in the intersection.

It is important to point out that the probability of finding a seed depends on the image similarities and the numberof views and is completely independent of the size of the database. The v views have the same chance to be discoveredin a database of 5000 images as in a database of several millions of images without any need to change the methodparameters or re-hash. This is not true for many topic discovery approaches.

Time complexity. The method is based on hashing with a fixed number M of bins. The number of bins is based on thesize of the vocabulary which cannot be infinitely increased without splitting descriptors of the same physical region.Assuming uniform distribution of the keys, the number C of keys that fall into the same bin is a random variable witha Poisson distribution where the expected number of occurrences is λ = D/M . The expected number of key pairs thatfall into the same bin (summed over all bins) is

M∑

i=1

E(C2) =

M∑

i=1

(λ2 + λ

)=

D2

M+D. (4)

The asymptotical time complexity is O(D2) for D, i.e. size of the image database, approaching the infinity. However,for finite databases of sizes up to D ≤ M , the method behaves as linear in the number of documents since D2/M +

Page 32: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Probability of finding seeds

4 Ondrej Chum and Jirı Matas

3.1 The min-Hash algorithm review

The min-Hash algorithm [20, 8] is a randomized method based on hashing finds highly similar image pairs withprobability close to one, unrelated images with probability close to zero, and similar image pairs (with low but non-negligible similarity, such as images of the same object) with a rather small probability. The low recall stops themin-Hash from being used directly as a general image retrieval method. However, in this paper we argue that it can beefficiently used for data mining purposes.

A brief review of the min-Hash algorithm follows; for detailed description see [21, 20]. For the purpose of min-Hash, images are represented as sets of visual words. This is a weaker representation than a bag of visual wordssince the frequency is reduced into a binary information (present / absent). Similarity of two images sim(A1, A2) ismeasured as a set overlap (ratio intersection over union) of their set representation

sim(A1,A2) =|A1 ∩ A2||A1 ∪ A2|

∈ 〈0, 1〉. (2)

A min-Hash is a function f that assigns a number to each set of visual words (each image representation). The functionhas a property that the probability of two sets having the same value of the min-Hash function is equal to their similarity

P (f(A1) = f(A2)) = sim(A1,A2).

To estimate the similarity of two images, multiple independent min-Hash functions fi are used. The fraction of themin-Hash functions that assigns an identical value to the two sets gives an unbiased estimate of the similarity of thetwo images.

Retrieval with min-Hash. So far, a method to estimate a similarity of two images was discussed. To efficiently retrieveimages with high similarity, the values of min-Hash function fi are grouped into s-tuples called sketches. Similarimages have many values of the min-Hash function in common (from the definition of similarity) and hence have highprobability of having the same sketches. On the other hand, dissimilar images have low chance of forming an identicalsketch. Identical sketches are efficiently found by hashing.

The probability of two sets having at least one sketch out of k in common is

P (collision) = 1− (1− sim(A1,A2)s)k. (3)

The probability depends on the similarity of the two images and on the two parameters: s the size of the sketch andk the number of (independent) sketches. These are the parameters of the method. Figure 2 visualizes the probabilityof collision plotted against the similarity of two images for fixed s = 3 and k = 512. Figure 3 shows different imagepairs and their similarity.

3.2 Cluster seed generation

In this section, a randomized procedure that generates seeds from possible clusters of images is described. Let us firstlook at the plot of the probability of sketch collision against the similarity of the images depicted in figure 2. Thesigmoid-like shape of the curve is important for the near duplicate detection task [20]. Image pairs with high similarityare retrieved with a probability close to one. Then, the probability drops rapidly - through similar image pairs (typicallyimages of the same object from a slightly different viewpoint) that are occasionally retrieved to unrelated image pairs(with similarity below 1%) that have close to zero probability of being retrieved.

Now, for the purpose of data mining, we focus on the bottom left corner of the graph. According to equation (3) animage pair with similarity sim = 0.05 has probability 6.2% to be retrieved (using 512 sketches of size 3). Such a poorrecall is certainly below acceptable level for a retrieval system. However, we aim at retrieving all relevant images fromthe image clusters in a single step. The task is to quickly retrieve seeds from the clusters - it is sufficient to retrieve asingle seed per cluster, and we are fortunate that the importance of a cluster is related to its size in the database.

The probability that not a single image pair (seed) is found by the min-Hash depends on two factors - the similarityof the images in the cluster and the number of image pairs that actually observe the same object. In the followinganalysis, which demonstrates a lower bound on this probability, we assume that a particular object or landmark is seenin v views. We also assume that all the image pairs have the same (average) similarity ε. The probability that none ofthe pairs of v views is retrieved is

P (fail) = (1− ε)v(v−1)

2 .

Page 33: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Probability of finding seeds

6 Ondrej Chum and Jirı Matas

P(fa

il)

sketch size 3

3 6 9 12 15 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

set size

P(fa

il)

sketch size 4

5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

set size

Fig. 4. Probability of failure to generate a seed in a set of images depicting the same object using min-Hash with 512 sketchesof size 3 (left) and 4 (right); note the different scales on the horizontal axes. The three curves show the dependence for different‘average’ similarity equal to 7% (lowest curve), 6% (middle) and 5% (highest). The marker ’+’ on the left plot denotes experimentalresult on the University of Kentucky dataset (cluster size 4, P(fail) = 1 - 0.469), see section 4.1.

D ≤ 2D . In the min-Hash algorithm, the number of keys depends on the size of the vocabulary w and the size ofthe sketch s and is proportional to M = ws. In the experiments in this paper, we used w = 217 and s = 3 or s = 4.This gives the number of different hash keys M = 251 and M = 268. We believe that this number is sufficient toconveniently deal with web scale databases.

3.3 Growing the seed

We build on the query expansion technique [8] to increase the recall. The idea is as follows: an original query is issuedand the results are then used to issue new query. Not all results are used, only those that have the same spatial featurelayout (for more details on spatial verification see the following section). The spatial verification prevents the queryexpansion from so-called topic drift, where an unrelated image is used to expand the query.

In our approach, we combine two types of query expansion methods suggested in [8] - transitive closure andaverage expansion. In the transitive closure, each previously unseen (spatially verified) result is used to issue a newquery. This method is used to ‘crawl’ the scene. To improve the recall, each query is attempted to be expanded by anaverage expansion: Result images in which a sufficient number of regions are related by a homography (homographies)to the query image are selected. The homography is then used to back-project features from the result image to thequery image (only features within a bounding box of the homography support are mapped). A new query is issuedusing the combination of the original features and the features gathered from the result images. For efficiency, eachimage is used at most once for an average query expansion.

If our data mining method is used for obtaining images for 3D reconstruction, a (partial) 3D model can be usedfor query expansion [8]. To retrieve images from unexplored viewpoints synthetic views (not necessarily pixel-wise)could be generated and used as queries. This is beyond the of scope of this paper.

3.4 Spatial verification

In spatial verification we build on the many-to-many RANSAC-like approach from [6]. Tentative correspondences aredefined by a common visual word ids. The geometric constraint is an affine transformation. This choice is convenientsince a single ellipse-to-ellipse correspondence (plus constraint on the gravity vector) is sufficient to instantiate themodel. The model of affine transformation with loose thresholds allows for detection of close-to-planar structures inthe scene with no significant perspective distortion. Unlike in [6], we fit multiple such models. The global consistencyof those models is then verified by a RANSAC fitting of an epipolar geometry or homography [22]. This final checkis rapid – tentative correspondences for this stage are a union of inlier correspondences from the previous stage and a

Page 34: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Similarity examples

Web Scale Image Clustering 5

P(co

llisi

on)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

similarity

P(co

llisi

on)

0 0.02 0.04 0.06 0.08 0.110

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

similarity

Fig. 2. The probability of at least one sketch collision for two documents plotted against their similarity; with k = 512 sketches,s = 3 min-Hashes per sketch. The right plot shows a close-up of the bottom left corner of the left plot. Note the logarithmic verticalaxis.

sim = 74.2% sim = 13.0% sim = 7.4% sim = 5.7%

Fig. 3. Examples of the relation of ‘visual similarity’ and the set overlap similarity.

The plot in figure 4 shows that for popular places (i.e. those where photos are often taken from) the probability offailure to retrieve an image pair vanishes. There are three plots for similarities 5%, 6% and 7%. Since the similarity isdefined as a ratio of the size of the intersection over the size of the union, the difference between similarity 6% and5% is substantial. Going from 6% to 5% similarity means removing 17.5% of elements that were in the intersection.

It is important to point out that the probability of finding a seed depends on the image similarities and the numberof views and is completely independent of the size of the database. The v views have the same chance to be discoveredin a database of 5000 images as in a database of several millions of images without any need to change the methodparameters or re-hash. This is not true for many topic discovery approaches.

Time complexity. The method is based on hashing with a fixed number M of bins. The number of bins is based on thesize of the vocabulary which cannot be infinitely increased without splitting descriptors of the same physical region.Assuming uniform distribution of the keys, the number C of keys that fall into the same bin is a random variable witha Poisson distribution where the expected number of occurrences is λ = D/M . The expected number of key pairs thatfall into the same bin (summed over all bins) is

M∑

i=1

E(C2) =M∑

i=1

(λ2 + λ

)=

D2

M+D. (4)

The asymptotical time complexity is O(D2) for D, i.e. size of the image database, approaching the infinity. However,for finite databases of sizes up to D ≤ M , the method behaves as linear in the number of documents since D2/M +

Page 35: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Growing the seed

6 Ondrej Chum and Jirı Matas

P(fa

il)

sketch size 3

3 6 9 12 15 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

set size

P(fa

il)

sketch size 4

5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

set size

Fig. 4. Probability of failure to generate a seed in a set of images depicting the same object using min-Hash with 512 sketchesof size 3 (left) and 4 (right); note the different scales on the horizontal axes. The three curves show the dependence for different‘average’ similarity equal to 7% (lowest curve), 6% (middle) and 5% (highest). The marker ’+’ on the left plot denotes experimentalresult on the University of Kentucky dataset (cluster size 4, P(fail) = 1 - 0.469), see section 4.1.

D ≤ 2D . In the min-Hash algorithm, the number of keys depends on the size of the vocabulary w and the size ofthe sketch s and is proportional to M = ws. In the experiments in this paper, we used w = 217 and s = 3 or s = 4.This gives the number of different hash keys M = 251 and M = 268. We believe that this number is sufficient toconveniently deal with web scale databases.

3.3 Growing the seed

We build on the query expansion technique [8] to increase the recall. The idea is as follows: an original query is issuedand the results are then used to issue new query. Not all results are used, only those that have the same spatial featurelayout (for more details on spatial verification see the following section). The spatial verification prevents the queryexpansion from so-called topic drift, where an unrelated image is used to expand the query.

In our approach, we combine two types of query expansion methods suggested in [8] - transitive closure andaverage expansion. In the transitive closure, each previously unseen (spatially verified) result is used to issue a newquery. This method is used to ‘crawl’ the scene. To improve the recall, each query is attempted to be expanded by anaverage expansion: Result images in which a sufficient number of regions are related by a homography (homographies)to the query image are selected. The homography is then used to back-project features from the result image to thequery image (only features within a bounding box of the homography support are mapped). A new query is issuedusing the combination of the original features and the features gathered from the result images. For efficiency, eachimage is used at most once for an average query expansion.

If our data mining method is used for obtaining images for 3D reconstruction, a (partial) 3D model can be usedfor query expansion [8]. To retrieve images from unexplored viewpoints synthetic views (not necessarily pixel-wise)could be generated and used as queries. This is beyond the of scope of this paper.

3.4 Spatial verification

In spatial verification we build on the many-to-many RANSAC-like approach from [6]. Tentative correspondences aredefined by a common visual word ids. The geometric constraint is an affine transformation. This choice is convenientsince a single ellipse-to-ellipse correspondence (plus constraint on the gravity vector) is sufficient to instantiate themodel. The model of affine transformation with loose thresholds allows for detection of close-to-planar structures inthe scene with no significant perspective distortion. Unlike in [6], we fit multiple such models. The global consistencyof those models is then verified by a RANSAC fitting of an epipolar geometry or homography [22]. This final checkis rapid – tentative correspondences for this stage are a union of inlier correspondences from the previous stage and a

Page 36: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Clusters example

Web Scale Image Clustering 9

Fig. 5. Selected images from selected clusters discovered in the 100K database including the Oxford Landmark dataset. Top - thelargest cluster containing the Radcliffe Camera and All Souls (404 images). Below - discovered clusters of sizes 53, 14, 51, 18,and 13 respectively, not in the ground truth annotation. The last cluster contains one false positive (the rightmost image), the otherclusters are visually connected. The top four clusters were also discovered in the experiment with sketches of size four.

Page 37: Min-wise Independent Permutations - NTUAimage.ntua.gr/iva/files/random_permutations.pdf · jS A \ S Bj jS A [ S Bj: Experiments seem to indicate that high resemblance (that is, close

Example – side effect

10 Ondrej Chum and Jirı Matas

Fig. 6. A useful side-effect: a sample of near duplicate images detected in the database.

Fig. 7. Images with two different object segmented. Radcliffe Camera and All Souls (left), Bridge of Sighs and Bodleian Library(right). The extent of the landmarks is shown in different markers and colours.

Landmark labelling. As mentioned before, a single cluster may contain more than one landmark. We can furtherfactorize each cluster (using image matches and weak 3D constrains such as co-planarity and disparity) to sub-clusterscontaining a single landmark. Results of automatic landmark segmentations are shown in figure 7. Matches betweenlandmark sub-clusters are shown in figure 8. Finally, common user annotations from the sub-clusters (if available) mayserve as name tags as shown in figure 9. The segments and the positions for the labels were discovered automatically,the correct textual annotations were added manually.

Full 3D Reconstruction. The discovered clusters were processed by a 3D reconstruction pipeline [24]. Sample resultsare shown in figures 10 and 11.

6 Conclusions

We have proposed a method for discovering spatially-related images in large scale image databases. Its speed dependson the size of the database and is close to linear for database sizes up to approximately 234 ≈ 1010 images. The successrate of cluster discovery is dependent on the cluster size and average similarity and is independent of the size of thedatabase. The properties and performance of the algorithm were demonstrated on datasets with 104 and 105 images.The proposed algorithm provides, as a side effect, a stat-of-the-art near duplicate image detection.

The desirable characteristics of the data mining method stem from an appropriate use of the min-Hash algorithm,which has been so far used only for near duplicate (high similarity) problems.

Acknowledgments. Authors would like to thank to Daniel Marincec for the 3D reconstruction, Michal Perd’och fordiscussions and help, and James Philbin for providing the data and his implementation of the spatial verification [6].We are grateful for support from EC grant 215078 DIPLECS.