deduplication_ custodian vs case

3
Deduplication: Custodian vs. Case Law Technology News (Online) This article also appears in the following ALM publications: Corporate Counsel August 27, 2009 Thursday Copyright 2009 ALM Media Properties, LLC All Rights Reserved Further duplication without permission is prohibited Length: 1365 words Byline: Alex K. Schiller, Special to the special to law Body Deduplication has become a mainstay of electronic data discovery processing where documents, such as word-processing files and e-mail messages, are assigned an algorithmically calculated alphanumeric value (typically an MD5 hash) and compared to all other electronic files in a data set. Documents with the same MD5 hash values are considered duplicates. As simple as this process seems, there are two different bases for deduplication: by custodian and by case. Both have their advantages and pitfalls. Deduplicating documents by custodian results only in the removal of duplicates within one person’s data set. A custodian is the owner of the electronic data harvested from one person’s hard drive, company network or e-mail account. If the data is collected only once, typically only a small number of duplicates exist. But if the custodian’s data is harvested on a rolling basis over time, the percentage of deduplicated items will increment with successive collections. For example, a file containing one week of e-mail messages will contain a relatively small amount of new data compared to the previous week’s messages. Examples of duplicate documents per custodian may be, for example, copies of e-mail messages created automatically by an AutoArchiverule established by the custodian. Deduplication by custodian is the basis preferred by vendors for several reasons. One obvious reason: deduplicating data sets by custodian results in fewer duplicates than deduplication by case and thus more documents can be generated for review -- vendors that offer to print data sets on demand can possibly earn the most income by deduplicating by custodian. For a more subtle reason, custodian deduplication provides the fewest headaches and worries to the EDD processing vendor and makes it easier to communicate to the law firm how data sets were deduped using the hash comparison explained above. But it is not as easy to conduct and explain deduplication by case, or global deduplication. DEDUPLICATION BY CASE Deduplication of documents on a case basis will result in the removal of duplicate documents within the data set for the entire case. In this manner, duplicate documents are removed not only by comparing documents within each custodian’s data set, but also by a custodian-to-custodian comparison of documents. The advantage to this basis, often called global deduplication, is that it removes the greatest number of duplicates from the review database. As a result, the attorneys have the smallest number of documents to review. Here are some examples of global duplicates: Members of the same company department might have comparably imaged computers, often storing comparable documents; spam and e-mail sent to a group, to which all or some of the case custodians belong. Global deduplication complicates matters, especially when processing e-mail. The deduping process cannot distinguish which duplicate document is the original and which is the copy. Duplicates are designated by the order the document was

Upload: alex-schiller

Post on 18-Aug-2015

10 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Deduplication_ Custodian vs  Case

Deduplication: Custodian vs. Case

Law Technology News (Online)

This article also appears in the following ALM publications:

Corporate Counsel

August 27, 2009 Thursday

Copyright 2009 ALM Media Properties, LLC All Rights Reserved Further duplication without permission is prohibited

Length: 1365 words

Byline: Alex K. Schiller, Special to the special to law

Body

Deduplication has become a mainstay of electronic data discovery processing where documents, such as word-processing

files and e-mail messages, are assigned an algorithmically calculated alphanumeric value (typically an MD5 hash) and

compared to all other electronic files in a data set. Documents with the same MD5 hash values are considered duplicates.

As simple as this process seems, there are two different bases for deduplication: by custodian and by case. Both have their

advantages and pitfalls.

Deduplicating documents by custodian results only in the removal of duplicates within one person’s data set. A custodian

is the owner of the electronic data harvested from one person’s hard drive, company network or e-mail account. If the data

is collected only once, typically only a small number of duplicates exist. But if the custodian’s data is harvested on a rolling

basis over time, the percentage of deduplicated items will increment with successive collections. For example, a file

containing one week of e-mail messages will contain a relatively small amount of new data compared to the previous

week’s messages. Examples of duplicate documents per custodian may be, for example, copies of e-mail messages created

automatically by an ″AutoArchive″ rule established by the custodian.

Deduplication by custodian is the basis preferred by vendors for several reasons. One obvious reason: deduplicating data

sets by custodian results in fewer duplicates than deduplication by case and thus more documents can be generated for

review -- vendors that offer to print data sets on demand can possibly earn the most income by deduplicating by custodian.

For a more subtle reason, custodian deduplication provides the fewest headaches and worries to the EDD processing vendor

and makes it easier to communicate to the law firm how data sets were deduped using the hash comparison explained

above. But it is not as easy to conduct and explain deduplication by case, or global deduplication.

DEDUPLICATION BY CASE

Deduplication of documents on a case basis will result in the removal of duplicate documents within the data set for the

entire case. In this manner, duplicate documents are removed not only by comparing documents within each custodian’s

data set, but also by a custodian-to-custodian comparison of documents. The advantage to this basis, often called global

deduplication, is that it removes the greatest number of duplicates from the review database. As a result, the attorneys have

the smallest number of documents to review. Here are some examples of global duplicates: Members of the same company

department might have comparably imaged computers, often storing comparable documents; spam and e-mail sent to a

group, to which all or some of the case custodians belong.

Global deduplication complicates matters, especially when processing e-mail. The deduping process cannot distinguish

which duplicate document is the original and which is the copy. Duplicates are designated by the order the document was

Page 2: Deduplication_ Custodian vs  Case

processed. Therefore, a processor may remove the original e-mail sent to a group by one of the case custodians, and only

one of the duplicates is kept.

For example, e-mails from three custodians, A, B and C, may be processed in that order. The EDD processing application

considers custodian A’s e-mails the originals and global duplicates will come mostly from B’s and C’s e-mails. The

problem in this example is that the attorney may want an original message from B’s e-mail account. But that message was

designated as a duplicate and removed from the review database. Custodian B was the original sender, but the attorney has

in the database only A’s e-mail received from B.

It may look awkward to the court that the attorney produced a message that was delivered to a recipient and not the message

from the original sender. The need to produce original documents from the custodian under investigation is particularly

important for cases that require proof that the investigated custodian actively sent a certain message and did not passively

receive it.

Another pitfall of global deduplication comes from a court order to produce all documents from one custodian and only

some documents from other custodians. To comply with such an order, it is recommended (if not necessary) that you

include all duplicate documents from the one custodian’s data set in the document set produced. Otherwise, as explained

above, a processor could remove identified duplicate documents from the custodian described in the order, while keeping

original documents stored in another custodian’s data set.

It is imperative for the litigation support analyst and attorney to understand the deduplication process when electronic data

is organized into a review database. Different EDD processing applications have different deduplication processing rules.

Compariing e-mails is relatively easy, as explained above, but what about attachments?

DETECTING DUPLICATE E-MAIL ATTACHMENTS

Some EDD processing applications create a hash value for an attached file separate from its parent e-mail message. Using

this method, duplicate attachments and loosely stored electronic files can be found. This may be advantageous because a

greater number of duplicates can be removed from the review database. Nonetheless, the recommended process for

attachments is to apply no hash value or use the same hash value as the parent e-mail message. That way, an e-mail and

its attachment are considered a set. Then, the deduplication process can compare and remove duplicate e-mail sets: the

messages and their attachments, together.

Creating separate hash values for e-mail messages and their attachments can cause a pitfall for reviewers. For example, a

litigation support analyst receives an e-mail account file that contains e-mails and attachments as well as a DVD of

electronic documents harvested from the client’s document management system and the custodian’s hard drive. The DVD

contains a document that is also an attachment to an e-mail the custodian sent. These two documents may be considered

duplicates by their calculated hash values. Let us say the deduping application assigns the original status to the document

from the DVD and duplicate status to the attachment. It may be imperative to the case to prove that the custodian

disseminated that document via e-mail to recipients who allege that they never saw it. But in the review database, the

attorney sees only the original document, the one from the DVD. The attorney may not know that the custodian e-mailed

the document to several recipients.

Finally, it is imperative for the attorney to understand how their client’s or opponent’s document management system stores

e-mail and related attachments. A DMS may allow users to store an attachment separately from its e-mail message. But this

separation can cause problems when it comes time to harvesting data relevant to litigation or investigation. In many cases

litigation hinges on document content -- knowledge of who sent what and when is not crucial. But for those cases where

knowledge of document content is important, the separation of attachments from their parent e-mail messages can prove

disastrous. For this reason alone, some DMS vendors remove the ability to save attachments separate from e-mail messages.

CONCLUSION

Deduplication: Custodian vs. Case

Page 3: Deduplication_ Custodian vs  Case

With the pros and cons of deduplicating by custodian or by case, attorneys can decide which deduplication process best

fits the case in hand. Attorneys may prefer casewide (global) deduplication because the data set to review is reduced the

most. This choice is optimal for cases premised on document content, as opposed to document context, such as the

knowledge of who had what, when and where. Alternatively, for cases revolving around document context, custodian-based

deduplication or no deduplication becomes the safest choice.

Alex K. Schiller is a senior litigation support analyst in the Chicago office of Drinker Biddle & Reath. He focuses primarily

on electronic data discovery processing and holds a Ph.D. from the University of Wisconsin-Madison in ancient

Mediterranean history. Schiller can be reached via [email protected] and 312-569-1829.

Classification

Language: ENGLISH

Publication-Type: Magazine

Subject: ELECTRONIC DISCOVERY (78%); ELECTRONIC DISCOVERY (78%); review database; e-mail messages;

global deduplication; electronic data; data set; duplicate documents; parent e-mail; hash values; deduplication process;

processing applications; document content; duplicate e-mail; document management system; document context; parent

e-mail message; data sets; case custodians; e-mail account; electronic data discovery; electronic files; deduplication

processing rules; DUPLICATE E-MAIL ATTACHMENTS; custodian deduplication; data discovery processing; duplicate

e-mail sets; cases litigation hinges; parent e-mail messages; processing vendor; harvesting data; litigation support analyst;

calculated hash values; e-mail account file; company department; custodian results; case basis; hash comparison; court

order; company network; senior litigation; word-processing files; custodian-to-custodian comparison; identified duplicate

documents; deduplicating data; duplicate attachments; duplicate status; deduping application; electronic documents;

deduping process; comparable documents; calculated alphanumeric value; successive collections; rolling basis; safest

choice; fewest headaches; obvious reason; subtle reason; custodian-based deduplication; investigated custodian; deduplicated

items; attached file; understand the deduplication process; electronic data discovery processing; Networking, Storage,

Content

Organization: UNIVERSITY OF WISCONSIN (59%); Drinker Biddle & Reath; University of Wisconsin-Madison

Industry: ELECTRONIC MAIL (90%); ELECTRONIC MAIL (90%); WORD PROCESSING SOFTWARE (78%);

WORD PROCESSING SOFTWARE (78%); EMAIL MARKETING (73%); EMAIL MARKETING (73%); HARD

DRIVES (54%); HARD DRIVES (54%)

Person: Alex K. Schiller

Geographic: Chicago

LAW FIRM: Drinker Biddle & Reath

Load-Date: September 19, 2011

Deduplication: Custodian vs. Case