deduplication_ custodian vs case
TRANSCRIPT
![Page 1: Deduplication_ Custodian vs Case](https://reader030.vdocument.in/reader030/viewer/2022020111/55d35bfcbb61eb61068b468e/html5/thumbnails/1.jpg)
Deduplication: Custodian vs. Case
Law Technology News (Online)
This article also appears in the following ALM publications:
Corporate Counsel
August 27, 2009 Thursday
Copyright 2009 ALM Media Properties, LLC All Rights Reserved Further duplication without permission is prohibited
Length: 1365 words
Byline: Alex K. Schiller, Special to the special to law
Body
Deduplication has become a mainstay of electronic data discovery processing where documents, such as word-processing
files and e-mail messages, are assigned an algorithmically calculated alphanumeric value (typically an MD5 hash) and
compared to all other electronic files in a data set. Documents with the same MD5 hash values are considered duplicates.
As simple as this process seems, there are two different bases for deduplication: by custodian and by case. Both have their
advantages and pitfalls.
Deduplicating documents by custodian results only in the removal of duplicates within one person’s data set. A custodian
is the owner of the electronic data harvested from one person’s hard drive, company network or e-mail account. If the data
is collected only once, typically only a small number of duplicates exist. But if the custodian’s data is harvested on a rolling
basis over time, the percentage of deduplicated items will increment with successive collections. For example, a file
containing one week of e-mail messages will contain a relatively small amount of new data compared to the previous
week’s messages. Examples of duplicate documents per custodian may be, for example, copies of e-mail messages created
automatically by an ″AutoArchive″ rule established by the custodian.
Deduplication by custodian is the basis preferred by vendors for several reasons. One obvious reason: deduplicating data
sets by custodian results in fewer duplicates than deduplication by case and thus more documents can be generated for
review -- vendors that offer to print data sets on demand can possibly earn the most income by deduplicating by custodian.
For a more subtle reason, custodian deduplication provides the fewest headaches and worries to the EDD processing vendor
and makes it easier to communicate to the law firm how data sets were deduped using the hash comparison explained
above. But it is not as easy to conduct and explain deduplication by case, or global deduplication.
DEDUPLICATION BY CASE
Deduplication of documents on a case basis will result in the removal of duplicate documents within the data set for the
entire case. In this manner, duplicate documents are removed not only by comparing documents within each custodian’s
data set, but also by a custodian-to-custodian comparison of documents. The advantage to this basis, often called global
deduplication, is that it removes the greatest number of duplicates from the review database. As a result, the attorneys have
the smallest number of documents to review. Here are some examples of global duplicates: Members of the same company
department might have comparably imaged computers, often storing comparable documents; spam and e-mail sent to a
group, to which all or some of the case custodians belong.
Global deduplication complicates matters, especially when processing e-mail. The deduping process cannot distinguish
which duplicate document is the original and which is the copy. Duplicates are designated by the order the document was
![Page 2: Deduplication_ Custodian vs Case](https://reader030.vdocument.in/reader030/viewer/2022020111/55d35bfcbb61eb61068b468e/html5/thumbnails/2.jpg)
processed. Therefore, a processor may remove the original e-mail sent to a group by one of the case custodians, and only
one of the duplicates is kept.
For example, e-mails from three custodians, A, B and C, may be processed in that order. The EDD processing application
considers custodian A’s e-mails the originals and global duplicates will come mostly from B’s and C’s e-mails. The
problem in this example is that the attorney may want an original message from B’s e-mail account. But that message was
designated as a duplicate and removed from the review database. Custodian B was the original sender, but the attorney has
in the database only A’s e-mail received from B.
It may look awkward to the court that the attorney produced a message that was delivered to a recipient and not the message
from the original sender. The need to produce original documents from the custodian under investigation is particularly
important for cases that require proof that the investigated custodian actively sent a certain message and did not passively
receive it.
Another pitfall of global deduplication comes from a court order to produce all documents from one custodian and only
some documents from other custodians. To comply with such an order, it is recommended (if not necessary) that you
include all duplicate documents from the one custodian’s data set in the document set produced. Otherwise, as explained
above, a processor could remove identified duplicate documents from the custodian described in the order, while keeping
original documents stored in another custodian’s data set.
It is imperative for the litigation support analyst and attorney to understand the deduplication process when electronic data
is organized into a review database. Different EDD processing applications have different deduplication processing rules.
Compariing e-mails is relatively easy, as explained above, but what about attachments?
DETECTING DUPLICATE E-MAIL ATTACHMENTS
Some EDD processing applications create a hash value for an attached file separate from its parent e-mail message. Using
this method, duplicate attachments and loosely stored electronic files can be found. This may be advantageous because a
greater number of duplicates can be removed from the review database. Nonetheless, the recommended process for
attachments is to apply no hash value or use the same hash value as the parent e-mail message. That way, an e-mail and
its attachment are considered a set. Then, the deduplication process can compare and remove duplicate e-mail sets: the
messages and their attachments, together.
Creating separate hash values for e-mail messages and their attachments can cause a pitfall for reviewers. For example, a
litigation support analyst receives an e-mail account file that contains e-mails and attachments as well as a DVD of
electronic documents harvested from the client’s document management system and the custodian’s hard drive. The DVD
contains a document that is also an attachment to an e-mail the custodian sent. These two documents may be considered
duplicates by their calculated hash values. Let us say the deduping application assigns the original status to the document
from the DVD and duplicate status to the attachment. It may be imperative to the case to prove that the custodian
disseminated that document via e-mail to recipients who allege that they never saw it. But in the review database, the
attorney sees only the original document, the one from the DVD. The attorney may not know that the custodian e-mailed
the document to several recipients.
Finally, it is imperative for the attorney to understand how their client’s or opponent’s document management system stores
e-mail and related attachments. A DMS may allow users to store an attachment separately from its e-mail message. But this
separation can cause problems when it comes time to harvesting data relevant to litigation or investigation. In many cases
litigation hinges on document content -- knowledge of who sent what and when is not crucial. But for those cases where
knowledge of document content is important, the separation of attachments from their parent e-mail messages can prove
disastrous. For this reason alone, some DMS vendors remove the ability to save attachments separate from e-mail messages.
CONCLUSION
Deduplication: Custodian vs. Case
![Page 3: Deduplication_ Custodian vs Case](https://reader030.vdocument.in/reader030/viewer/2022020111/55d35bfcbb61eb61068b468e/html5/thumbnails/3.jpg)
With the pros and cons of deduplicating by custodian or by case, attorneys can decide which deduplication process best
fits the case in hand. Attorneys may prefer casewide (global) deduplication because the data set to review is reduced the
most. This choice is optimal for cases premised on document content, as opposed to document context, such as the
knowledge of who had what, when and where. Alternatively, for cases revolving around document context, custodian-based
deduplication or no deduplication becomes the safest choice.
Alex K. Schiller is a senior litigation support analyst in the Chicago office of Drinker Biddle & Reath. He focuses primarily
on electronic data discovery processing and holds a Ph.D. from the University of Wisconsin-Madison in ancient
Mediterranean history. Schiller can be reached via [email protected] and 312-569-1829.
Classification
Language: ENGLISH
Publication-Type: Magazine
Subject: ELECTRONIC DISCOVERY (78%); ELECTRONIC DISCOVERY (78%); review database; e-mail messages;
global deduplication; electronic data; data set; duplicate documents; parent e-mail; hash values; deduplication process;
processing applications; document content; duplicate e-mail; document management system; document context; parent
e-mail message; data sets; case custodians; e-mail account; electronic data discovery; electronic files; deduplication
processing rules; DUPLICATE E-MAIL ATTACHMENTS; custodian deduplication; data discovery processing; duplicate
e-mail sets; cases litigation hinges; parent e-mail messages; processing vendor; harvesting data; litigation support analyst;
calculated hash values; e-mail account file; company department; custodian results; case basis; hash comparison; court
order; company network; senior litigation; word-processing files; custodian-to-custodian comparison; identified duplicate
documents; deduplicating data; duplicate attachments; duplicate status; deduping application; electronic documents;
deduping process; comparable documents; calculated alphanumeric value; successive collections; rolling basis; safest
choice; fewest headaches; obvious reason; subtle reason; custodian-based deduplication; investigated custodian; deduplicated
items; attached file; understand the deduplication process; electronic data discovery processing; Networking, Storage,
Content
Organization: UNIVERSITY OF WISCONSIN (59%); Drinker Biddle & Reath; University of Wisconsin-Madison
Industry: ELECTRONIC MAIL (90%); ELECTRONIC MAIL (90%); WORD PROCESSING SOFTWARE (78%);
WORD PROCESSING SOFTWARE (78%); EMAIL MARKETING (73%); EMAIL MARKETING (73%); HARD
DRIVES (54%); HARD DRIVES (54%)
Person: Alex K. Schiller
Geographic: Chicago
LAW FIRM: Drinker Biddle & Reath
Load-Date: September 19, 2011
Deduplication: Custodian vs. Case