predictive coding legaltech

Predictive Coding 2.0 Making E-Discovery More Efficient and Cost Effective

John Tredennick Jeremy Pickens Jim Eidelman

How Many Do I Have to Check?

1.  You have a bag with 1 million M&Ms 2.  It contains mostly brown M&Ms 3.  You cannot see into the bag 4.  You have a scoop that will pull out 100

M&Ms at a time 5.  Your hope is that there are no red

M&Ms in the bag 6.  You pull out a scoop and they are all

brown

How many scoops do you need to review to be confident there are no red M&Ms?

Let’s Take a Poll

How many scoops?

1 3

5 10 20

2

100? 500? 1,000?

How Confident Do You Need to Be?

How many errors can you tolerate?

Does 95% work?

At a 95% confidence level and 5% percent margin of error: 384 M&Ms At a 99% confidence level and 1% margin of error: 459 M&Ms

§  Five out of a hundred? §  One out of a hundred? §  One percent = 10,000

How about 99%

At a 100% confidence level and 0% margin of error: 1,000,000 M&Ms

Predictive Coding

Does it Work?

What Have the Courts Said?

What Have the Courts Said?

“Until there is a judicial opinion approving (or even critiquing) the use of predictive coding, counsel will just have to rely on this article as a sign of judicial approval. In my opinion, computer-assisted coding should be used in those cases where it will help ‘secure the just, speedy, and inexpensive’ (Fed. R. Civ. P. 1) determination of cases in our e-discovery world.”

Magistrate Judge Andrew Peck

Predictive Coding 1.0

1.  Assemble your corpus. 2.  Assemble a seed set of

documents. 3.  Review the seed set. 4.  Apply machine learning and

automatically tag the remainder of the corpus.


§  Tremendous gains in review effectiveness

§  Substantial cost savings §  It works. Often quite well

….when the corpus is complete.

533 matters, nearly 36,000 uploads across the matters.

67.5 uploads per case

This is collection driven, not loading limits.

166.3 days loading case

67 uploads

166 days

In which upload and on which day do your responsive documents show up?

Terms that do not appear early begin appearing later.

Machine-Assisted Decision Making

Upload timeline of 6 TB case. When should machine-assisted decision making (e.g. early case assessment) begin?

Is it here?

Or here?

Example: Responsive Early, Junk Later

To: [email protected], [email protected]

From: [email protected]

Subject: Company Picnic

Bob, would you coordinate with Alice and make sure we have enough hamburger buns for the company picnic? Please try and find them at a reasonable price.

Responsive Junk

Example: Junk Early, Responsive Later

To: [email protected], [email protected]

From: [email protected]

Subject: Get Together

Let’s get together at 7pm at the Sports Bar to discuss pricing of our components. The Broncos are playing and I really want to watch Tebow.

Junk Responsive

Problems With Predictive Coding 1.0

The corpus is almost never complete §  Continuous collection and rolling uploads §  When does “Early Case Assessment” begin?

Changing Issues §  Responsiveness is “bursty”

Shifting Concept Relationships §  Due both to increasing corpus and changing issues §  Exploration is extremely limited

Our Approach

Predictive Coding 2.0 necessitates the ability to deal with dynamic change and flux. We have developed a flexible analytics framework based on bipartite graphs It is aware of changes in corpus and in coding so as to enable smart review and adaptive related concept suggestion as information pours in.

Goal: Continuous Case Assessment

Our Approach

Avoid the lock-in that arises due to poor decision making that occurs early in the matter when corpus (collection) and coding information is incomplete.

What Is Underneath?

A full bipartite graph of the documents and features (e.g. words, phrases, dates) that comprise those documents

Documents Terms

Feedback: Immediate and Continuous

Continuous feedback aids better decision making and predictive coding. Adapts to both:

New arrival of coding information New arrival of documents and terms

Documents Terms


Feedback – and improvement – is iterative, continuous, amplified.

% of Docs Examined Manually

The more you review, the less you have to review

Term relationships change over time Using continuous improvement, decisions can be revised and refined as the matter proceeds.

Better Decisions As Understanding Improves

Time uncovers new relationships

Documents Terms

Looking at Concepts Over Time 20% 65% lube fuels

piping fob battery purityethane

mounted petrochemicals redundant fin batteries paraxylene

compartments cif mixture phy airflow fwd ansi swopt

ventilation brentpartials chargers brg stainless locswap

rotor benzene bleed diff

accessory spd plenum liquids detector opt

Start with the key term “fuel”

At 20% these are the related terms

And at 65%

Related Terms Through Coding Filters

Documents Terms

Responsive

NonResponsive

TREC collection with many topics

identified

Putting Related Concepts to Work

The whole corpus

Topic 203 …whether the Company had met, or could, would, or might meet its financial forecasts, models, projections, or plans… Topic 205 …analyses, evaluations, projections, plans, and reports on the volume(s) or geographic location(s) of energy loads.

Term Score

modeling 1000 equation 864

stochastic 706 variables 677

parameters 518 probability 365 simulation 337

assumption 325 returns 251 curves 211

Model In the Whole Collection

Scope is the whole collection

Look at the keyword “model”

Term Score

flows 1000 assumptions 913

gains 872 shares 864 liquidity 486

fluctuations 374 analysts 285

cents 254 whitewing 237 handles 166

Model In Topic 203


Scope: Topic 203

meeting financial forecasts

Term Score

bids 1000 congestion 611

loads 455 constraints 354

clearing 292 zonal 194

signals 192 procure 190 dispatch 152

csc 120

Model In Topic 205


Scope: Topic 205

analyzing energy

volumes

Whole Corpus Topic 203 Topic 205

modeling flows bids equation assumptions congestion

stochastic gains loads variables shares constraints

parameters liquidity clearing probability fluctuations zonal simulation analysis signal

assumption cents procure returns whitewing dispatch curves handles csc

Model In Comparison Now,

imagine this with batches and coding

changes over time!

Note: Our system can accept any combination of coding and metadata filters to dynamically assess your data

Summary

Incomplete Collections

Changing Coding Calls

Havoc for Machine Coding


Problem: The corpus is almost never complete Answer: Review Algorithms that are iterative and continuous

Problem: Changing Issues Answer: Review Algorithms that are adaptive and continuous

Problem: Shifting Concept Relationships Answer: Concept Relationships that are calculated dynamically, on-the-fly, and coding-aware.

Continuous Case Assessment

Analytics Consulting

§  Analytics consulting and predictive ranking for nearly 4 years §  How it started -- Before “Predictive Coding” became popular:

“Can’t you predict what documents are probably relevant based on your review so far?” – Judge, SDNY

§  Predictive Ranking: Iterative search techniques + algorithms §  Then off-the-shelf Predictive Coding 1.0 technologies §  Catalyst’s research is exciting! We apply the research to real-world

scenarios. Applying Bipartite Analytics…

Smart Review with the Bipartite Analytics Technology Advantages:

§  Accurate §  Dynamic §  Flexible §  “Just in Time” suggestions

Smart Review Scenarios 1. “What happened” – examples: FCPA investigation, conspiracy ECA 2. Typical large scale litigation with lots of ESI – e.g., class action lawsuit 3. Highly complex litigation with multiple issues – e.g. patent and unfair competition claims

Scenario 1 – What happened?

Goal: Rapidly determine facts and resolve matter if possible Applying the Technology Small number of knowledgeable attorneys drill into documents using the fusion of advanced search features and flexible predictive coding.

Scenario 1 – What happened? Goal: Rapidly determine facts and resolve matter if possible Applying the Technology Small number of knowledgeable attorneys drill into documents using the fusion of advanced search features and flexible predictive coding.

§  Faster location of valuable “veins” of information due to search filters

§  Rapid learning and application of that learning through flexible, “just in time” predictive coding 2.0.

§  “Choose your own adventure”

Scenario 2 – Large Scale Litigation

Goal: Minimize cost because of learning across large document set, increase quality with focused review, and maximize protection of privilege and trade secrets Applying the Technology:

§  Prioritized review based on rapid, continuous learning §  Large scale defensible culling §  More accurate ranking of “potentially privileged” documents

Scenario 3– Highly Complex Litigation

Goal: Review and produce with multiple and changing issues Applying the Technology §  Rapid learning across multiple topics §  Leverage ability to adjust for change in topics §  Review quality improves because of focus §  Explore otherwise hidden subjects with Concept Explorer §  Leverage learning across narrow, focused lines of inquiry (e.g.,

emails between two people in a narrow time window) §  Protect privileged documents

Predictive Coding 2.0 Making E-Discovery More Efficient and Cost Effective

John Tredennick Jeremy Pickens Jim Eidelman