finding similar files in large document repositories

Finding Similar Files Finding Similar Files in Large Document in Large Document RepositoriesRepositories

KDD’05, August 21-24, 2005, Chicago, Illinois, USA.Copyright 2005 ACM

George Forman HewlettPackard Labs

[email protected]

Kave EshghiHewlettPackard [email protected]

Stephane ChiocchettiHewlettPackard France

[email protected]

AgendaAgenda• Introduction

• Method

• Results

• Related work

• Conclusions

Presented byJoyce Chen

IntroductionIntroduction

Millions of technical support documents.Covering many different products, solutions, and phases of support.

The content in new document may be duplicate.Author prefer to copy rather than link to content by reference.To avoid the possibility of dead linkBy mistake or limited authorization, the version is not update.

SolutionChunking technology to break up the document into paragraph-like pieces.Detecting collisions among the has signatures of these chunks.Efficiently determine which files are related in a large repository.


MethodMethod

Step1: Using a ‘content-based chunking algorithm’ to break up each file into a sequence of chunks

Step2:Compute the hash of each chunk.

Step3:To find those files that share chunk hashes

Reporting only those whose intersection is above some threshold.


Hashing backgroundHashing background

Use the ‘compare by hash’ method to compare chunks occurring in different files.

If fixed size sequences, it is almost impossible to find two chunks that have the same hash.

Use the MD5 algorithm which generates 128-bit hashes.

Two advantage (Rather than compare the chunk itself)

Comparison time is shorter.

Being short an fixed size, lend themselves to efficient data structures for lookup and comparison.


ChunkingChunking

Breaking a file into a sequence of chunks.

Chunk boundaries are determined by the local contents of the file.

Basic Sliding Window AlgorithmA pair of pre-determined integers D and r, r<D.

A fixed width sliding window of width W.

Fk the fingerprint on position k.

k is a chunk boundary if Fk mod D = r.


Chunking and file similarityChunking and file similarity

The problems in content-based chunking algorithm.

When two sequences R and R’ share a contiguous sub-sequence larger than the average chunk size .There would be good probability at least one shared chunk falling within the shared sequences.

Use TTTD to avoid above problemTwo Thresholds, Two Divisors algorithm.Four parameters:

D the main divisor

D’ the backup divisor

Tmin the minimum chunk size threshold

Tmax the maximum chunk size threshold


File similarity algorithmFile similarity algorithm

Step1Break each file’s content into chunk

For each chunk, record its byte length and its hash code.

The bit-length of the hash code be sufficiently long.

To avoid having many accidental hash collisions among truly different chunk.


File similarity algorithm (cont.)File similarity algorithm (cont.)

Step2Optional step for scalability

Prune and partition the above metadata into independent sub-problems

Each small enough to fit in memory

Step3Constructing a bipartite gragh

With an edge between a file vertex and a chunk vertex

iff the chunk occurs in the file

File notes are annotated with their file length

The chunk nodes are annotated with their chunk length



Step4Construct a separate file-file similarity graph.

For each file A:

(a) Look up the chunks AC that occur in file A.

(b) For each chunk in AC, look up the files it appears in,

accumulating the set of other files BS that share any chunks

with file A. (As an optimization due to symmetry, we exclude

files that have previously been considered as file A in step 4.)

(c) For each file B in set BS, determine its chunks in common

with file A,2 and add A-B to the file similarity graph if the total

chunk bytes in common exceeds some threshold, or percentage

of file length.



Step5Output the file-file similarity pairs as desired.

Use union-find algorithm to determine clusters of interconnected files.


Handling identical filesHandling identical files

Having a multiple files with identical content.

Using the same metadate with a small enhancement.

While loading the file-chunk dataCompute a hash over all the chunk hashes

Maintain a hash table that reference file nodes according to their unique content hashes.

If a file has already been loaded

We note the duplicate file name and avoid duplicating the chunk data in memory


Handling identical files (cont.)Handling identical files (cont.)


Complexity analysisComplexity analysis

The chunking of the files is linear in the total size N of the content.

O(C log C) where C is the number of chunks in the repository. (Including duplicates.)

Since C is linear in N -> O(N log N).


ResultsResults

Implement chunking algorithm in C++ (~1200 lines of code)

Used Perl to implement the similarity analysis algorithm (~500 LOC)

The bipartite partitioning algorithm (~250 LOC)

Shared union-find module (~300 LOC)

The performance on a given repository ranges widely depending on the average chunk size (a controllable size)

52, 125 technical support documents in 347 folders

Comprising 327 MB of HTML content.

3GHz Intel processor and IGB RAMChunk size set to 5000 bytes -> took 25 minutes and generated 88,510 chunks

Chunk size set to 100 byte -> took 39 minutes and generate 3.8 million

chunks.


Related workRelated work

Brin et al. “Copy detection machanisims for digital documents”

Have a large indexed database of existing documents

Detect new document contains material that already exists in the database.

it is a 1-vs-N document method.

This paper is all-to-all.

Chunk boundaries are based on the hash of ‘text units’

Paragraphs

Sentences

Can not handle the technical doumentations

This paper use TTTD chunking algorithm.


ConclusionsConclusions

To identify pieces that may have been duplicated.

Relies on chunking technology rather than paragraph boundary detection.

The bottleneck is in human attention to consider in many results.

Future workReducing false alarms and missed detections

Making the human review process as productive as possible.

finding similar files in large document repositories

Technology

file file notes

filechunk data

filefile similarity

file b

file nodes

file vertex

duplicate file

chunk length