Even before either side does a first-pass review of their collected documents, they can easily identify which potentially-discoverable documents both sides already have in common. This process would be fast, inexpensive, and easy, and would allow new kinds of cooperation between parties.
In 2009 and 2010, Patrick Oot, Joe Howie, and Anne Kershaw exposed a disturbing lack of custodial and cross-custodial deduplication in ediscovery at the time. They also considered the ethical implications of that lack. See, e.g., Patrick Oot, Joe Howie, and Anne Kershaw, “Ethics and Ediscovery Review,” ACC Docket Vol. 28, Issue 1 (Jan/Feb 2010): Pages 46-57, available at http://www.knowledgestrategysolutions.com/wp-content/uploads/ACC-Docket-Ethics-of-Edisc…pdf (“Ethics Review”) (last retrieved September 4, 2014).
The Ethics Review also observed that counsel should “[c]onsider consolidating duplicates across parties.” Ethics Review at 56 (emphasis added). However, although the last four years have seen great strides in the adoption of cross-custodial deduplication, it seems that deduplication between parties is not yet being done.
Immediately identifying those common documents should be done in many cases. Here are a few thoughts about why and how.
Identifying which documents are already in both parties’ possession would allow the parties to immediately begin discussing the responsiveness of specific documents, categories, and concepts, without exposing any confidential information. It would virtually eliminate the risk that any party’s valid interests could be compromised by those discussions.
For efficiency and objectivity, this protocol should be limited to exact duplicates. Exact duplicates could be easily and cheaply determined by comparing hash lists. To maximize the identification of duplicates, both parties should agree that all emails would be converted to the RFC 2822 format before being hashed. See Ethics Review at page 57 (addressing cross-custodian deduplication). More generally, where different collection or ingestion methods would result in non-comparable hash values, counsel should agree on using the same methods.
One way to minimize differences between ingestion methods would be for the parties to agree to share a vendor. Of course, sharing vendors raises potential conflict issues. See, e.g., Gordon v. Kaleida Health, No. 08-CV-378S(F) (W.D.N.Y. May 21, 2013), available at http://scholar.google.com/scholar_case?case=4027097771033406737 (last retrieved September 3, 2014).
Such issues could be minimized or eliminated by the use of ediscovery neutrals. See, e.g., The United States District Court for the District of Kansas Guidelines for Cases Involving Electronically Stored Information, available at http://www.ksd.uscourts.gov/guidelines-for-esi/, at p. 5 ¶13:
13. Creation of a Shared Database and Use of One Search Protocol
In appropriate cases counsel may want to attempt to agree on the construction of a shared database, accessible and searchable by both parties. In such cases, they should consider both hiring a neutral vendor and/or using one search protocol with a goal of minimizing the costs of discovery for both sides.
Using a shared vendor for the common documents would also allow the parties to use exactly the same search and clustering methods on the database of duplicates. Comparing apples to apples could help them to agree on priorities for the larger universe of non-common documents.
One slight objection to this protocol is that a common document might be part of a privileged communication to one of the parties. However, there are many solutions to that problem, such as the use of an ediscovery neutral.
Of course, deduplication between parties could also result in smaller production volumes and could also provide a way to split the cost of objectively coding the common documents.
In light of these potential opportunities, agreeing on deduplication between parties may be ethically required in some cases.
I appreciate the desire to increase efficiencies and save costs. I will also say that I’m not an attorney and not familiar nor did I read your noted citations. I do however have 20 years experience in litigation support, 15 of those involving electronic discovery and now cyber investigations.
From a technical perspective, the comparison process might be a bit more difficult in some situations, especially with email. The idea of using a common vendor is certainly a step in the right direction in that the vendor could use a common method and process to create a HASH or other value used to compare documents and identify duplicates.
It should be understood that different vendors and different software tools will generate HASH values differently and many don’t have or allow any control over the process. Using email for example, one vendor/tool may use all metadata fields and all text fields to generate a HASH value. There could be over 100 such metadata fields. If another vendor/tool doesn’t use the exact same data set, there will likely never be a match of exact HASH values for documents that a reasonable person would consider an exact duplicate. (I say “likely” because two different documents could have matching HASH values. It is not common, but possible mathematically. I won’t go there now.)
Another consideration is different email systems. For example, if person A sends the same email to persons B and C and person B is using Microsoft Exchange and person C is using Google Mail to receive their emails, when collected from persons B and C, the email will be in a different format with different metadata. The Google Mail could be converted to a PST file, the same format likely generated for collection from the Exchange system and a very common practice, however that converted file will still have different metadata and thus generate different HASH values.
To counter this issue I suppose the input metadata could be limited, say to just the author, recipient, subject and the body of the text however this will process will reduce the accuracy of identifying exact duplicates.
I wonder if the volume of common documents typically shared between parties prior to large scale discovery would warrant such efforts. I also suspect resistance from both sides to reveal such information at that stage in the process.
I think anything that can be done to reign in costs and inefficiencies in our industry is certainly a step in the direction of the enemy and certainly worth exploration.
Robert,
Thank you for your thoughtful analysis. I agree with pretty much everything you said, and offer these thoughts to continue the exploration.
First, I should note a citation error in my original post, although the error does not change the analysis in that post at all. I cited to RFC 2822, a standard for internet email promulgated by the Internet Engineering Task Force (“IETF”). I failed to note that that standard has been superseded by RFC 5322, http://tools.ietf.org/html/rfc5322, which has been updated by RFC 6854, http://tools.ietf.org/html/rfc6854.
I used email as an example because email is more likely than other kinds of ESI to contain duplicates between parties. Perhaps most importantly, the parties use email when they want to have an objective written record, i.e., proof, that they had a particular communication. And, by its nature, email is more substantive than many other kinds of ESI; it contains relatively unique, time-sensitive communications. It’s also an accessible, relatively easily defined and collectible group of documents. And, where both sides already have a copy, there are no confidentiality or authentication issues.
For all of these reasons, determining which emails are duplicated between parties may be the fastest way for the parties to quickly gain an objective view of their case. This could lead to earlier settlements. At the very least, it would allow counsel for both sides to use concrete, case-specific documents as examples in discussing the scope of discovery.
I agree that differences in email systems, vendor ingestion methods, metadata used, and hash tools will lead to different hash values. For example, some metadata fields will never match up, such as the optional “Return-Path” field. As another example, differences in local times need to be normalized, as do the formats of the time notations. Or the email could have been sent as HTML and saved by some recipients as plain text.
For these reasons and others, I now see that my focus on hash values for emails was misplaced.
I think that all of the problems that exist in deduplicating between parties also exist in deduplicating within a single party’s documents. And vendors have been performing intra-party deduplication with high success rates for many years now.
A metadata field that should indicate duplicates with almost 100% reliability is the “message-id” field, which contains a globally unique identifier for each message sent. RFC 5322 at ¶ 3.6.4. This field is optional, although the IETF states that “Every message SHOULD have a ‘Message-ID:’ field.” Id. (emphasis in original). I believe that all business email software does generate and include a unique Message-ID in each email.
Alternatively, as you suggest, you could see which messages have the same author, recipient(s), and subject. Although you think that this could reduce the accuracy of the deduplication process, I think that that inaccuracy could be minimized by including the normalized send date and time. Also, a message with a BCC could not be a duplicate unless counsel agreed to redact the contents of the BCC field.
I also agree that the volume of shared documents may be low in certain cases. I believe that any shared documents would make the negotiations over the scope of discovery more concrete and therefore fruitful.
Finally, I agree that both sides may resist such disclosures early on (particularly where the litigants just want to fight). However, their counsel have a duty, to their clients, to the courts, and to us taxpayers, to minimize such silly stuff. And even if both sides resist, the court has the power to order such measures in appropriate cases.
Thanks again.
Since the unique message-id of an email may contain domain and time information (RFC 5322 at ¶ 3.6.4 at p. 27), the parties would need to hash the message IDs before comparing them.