predictive coding legaltech
TRANSCRIPT
Predictive Coding 2.0 Making E-Discovery More Efficient and Cost Effective
John Tredennick Jeremy Pickens Jim Eidelman
How Many Do I Have to Check?
1. You have a bag with 1 million M&Ms 2. It contains mostly brown M&Ms 3. You cannot see into the bag 4. You have a scoop that will pull out 100
M&Ms at a time 5. Your hope is that there are no red
M&Ms in the bag 6. You pull out a scoop and they are all
brown
How many scoops do you need to review to be confident there are no red M&Ms?
Let’s Take a Poll
How many scoops?
1 3
5 10 20
2
100? 500? 1,000?
How Confident Do You Need to Be?
How many errors can you tolerate?
Does 95% work?
At a 95% confidence level and 5% percent margin of error: 384 M&Ms At a 99% confidence level and 1% margin of error: 459 M&Ms
§ Five out of a hundred? § One out of a hundred? § One percent = 10,000
How about 99%
At a 100% confidence level and 0% margin of error: 1,000,000 M&Ms
Predictive Coding
Does it Work?
What Have the Courts Said?
What Have the Courts Said?
“Until there is a judicial opinion approving (or even critiquing) the use of predictive coding, counsel will just have to rely on this article as a sign of judicial approval. In my opinion, computer-assisted coding should be used in those cases where it will help ‘secure the just, speedy, and inexpensive’ (Fed. R. Civ. P. 1) determination of cases in our e-discovery world.”
Magistrate Judge Andrew Peck
Predictive Coding 1.0
1. Assemble your corpus. 2. Assemble a seed set of
documents. 3. Review the seed set. 4. Apply machine learning and
automatically tag the remainder of the corpus.
Predictive Coding 1.0
§ Tremendous gains in review effectiveness
§ Substantial cost savings § It works. Often quite well
….when the corpus is complete.
533 matters, nearly 36,000 uploads across the matters.
67.5 uploads per case
This is collection driven, not loading limits.
166.3 days loading case
67 uploads
166 days
In which upload and on which day do your responsive documents show up?
Terms that do not appear early begin appearing later.
Machine-Assisted Decision Making
Upload timeline of 6 TB case. When should machine-assisted decision making (e.g. early case assessment) begin?
Is it here?
Or here?
Example: Responsive Early, Junk Later
To: [email protected], [email protected]
From: [email protected]
Subject: Company Picnic
Bob, would you coordinate with Alice and make sure we have enough hamburger buns for the company picnic? Please try and find them at a reasonable price.
Responsive Junk
Example: Junk Early, Responsive Later
To: [email protected], [email protected]
From: [email protected]
Subject: Get Together
Let’s get together at 7pm at the Sports Bar to discuss pricing of our components. The Broncos are playing and I really want to watch Tebow.
Junk Responsive
Problems With Predictive Coding 1.0
The corpus is almost never complete § Continuous collection and rolling uploads § When does “Early Case Assessment” begin?
Changing Issues § Responsiveness is “bursty”
Shifting Concept Relationships § Due both to increasing corpus and changing issues § Exploration is extremely limited
Our Approach
Predictive Coding 2.0 necessitates the ability to deal with dynamic change and flux. We have developed a flexible analytics framework based on bipartite graphs It is aware of changes in corpus and in coding so as to enable smart review and adaptive related concept suggestion as information pours in.
Goal: Continuous Case Assessment
Our Approach
Avoid the lock-in that arises due to poor decision making that occurs early in the matter when corpus (collection) and coding information is incomplete.
What Is Underneath?
A full bipartite graph of the documents and features (e.g. words, phrases, dates) that comprise those documents
Documents Terms
Feedback: Immediate and Continuous
Continuous feedback aids better decision making and predictive coding. Adapts to both:
New arrival of coding information New arrival of documents and terms
Documents Terms
Predictive Coding 2.0
Feedback – and improvement – is iterative, continuous, amplified.
% of Docs Examined Manually
The more you review, the less you have to review
Term relationships change over time Using continuous improvement, decisions can be revised and refined as the matter proceeds.
Better Decisions As Understanding Improves
Time uncovers new relationships
Documents Terms
Looking at Concepts Over Time 20% 65% lube fuels
piping fob battery purityethane
mounted petrochemicals redundant fin batteries paraxylene
compartments cif mixture phy airflow fwd ansi swopt
ventilation brentpartials chargers brg stainless locswap
rotor benzene bleed diff
accessory spd plenum liquids detector opt
Start with the key term “fuel”
At 20% these are the related terms
And at 65%
Related Terms Through Coding Filters
Documents Terms
Responsive
NonResponsive
TREC collection with many topics
identified
Putting Related Concepts to Work
The whole corpus
Topic 203 …whether the Company had met, or could, would, or might meet its financial forecasts, models, projections, or plans… Topic 205 …analyses, evaluations, projections, plans, and reports on the volume(s) or geographic location(s) of energy loads.
Term Score
modeling 1000 equation 864
stochastic 706 variables 677
parameters 518 probability 365 simulation 337
assumption 325 returns 251 curves 211
Model In the Whole Collection
Scope is the whole collection
Look at the keyword “model”
Term Score
flows 1000 assumptions 913
gains 872 shares 864 liquidity 486
fluctuations 374 analysts 285
cents 254 whitewing 237 handles 166
Model In Topic 203
Look at the keyword “model”
Scope: Topic 203
meeting financial forecasts
Term Score
bids 1000 congestion 611
loads 455 constraints 354
clearing 292 zonal 194
signals 192 procure 190 dispatch 152
csc 120
Model In Topic 205
Look at the keyword “model”
Scope: Topic 205
analyzing energy
volumes
Whole Corpus Topic 203 Topic 205
modeling flows bids equation assumptions congestion
stochastic gains loads variables shares constraints
parameters liquidity clearing probability fluctuations zonal simulation analysis signal
assumption cents procure returns whitewing dispatch curves handles csc
Model In Comparison Now,
imagine this with batches and coding
changes over time!
Note: Our system can accept any combination of coding and metadata filters to dynamically assess your data
Summary
Incomplete Collections
Changing Coding Calls
Havoc for Machine Coding
Predictive Coding 2.0
Problem: The corpus is almost never complete Answer: Review Algorithms that are iterative and continuous
Problem: Changing Issues Answer: Review Algorithms that are adaptive and continuous
Problem: Shifting Concept Relationships Answer: Concept Relationships that are calculated dynamically, on-the-fly, and coding-aware.
Continuous Case Assessment
Analytics Consulting
§ Analytics consulting and predictive ranking for nearly 4 years § How it started -- Before “Predictive Coding” became popular:
“Can’t you predict what documents are probably relevant based on your review so far?” – Judge, SDNY
§ Predictive Ranking: Iterative search techniques + algorithms § Then off-the-shelf Predictive Coding 1.0 technologies § Catalyst’s research is exciting! We apply the research to real-world
scenarios. Applying Bipartite Analytics…
Smart Review with the Bipartite Analytics Technology Advantages:
§ Accurate § Dynamic § Flexible § “Just in Time” suggestions
Smart Review Scenarios 1. “What happened” – examples: FCPA investigation, conspiracy ECA 2. Typical large scale litigation with lots of ESI – e.g., class action lawsuit 3. Highly complex litigation with multiple issues – e.g. patent and unfair competition claims
Scenario 1 – What happened?
Goal: Rapidly determine facts and resolve matter if possible Applying the Technology Small number of knowledgeable attorneys drill into documents using the fusion of advanced search features and flexible predictive coding.
Scenario 1 – What happened? Goal: Rapidly determine facts and resolve matter if possible Applying the Technology Small number of knowledgeable attorneys drill into documents using the fusion of advanced search features and flexible predictive coding.
§ Faster location of valuable “veins” of information due to search filters
§ Rapid learning and application of that learning through flexible, “just in time” predictive coding 2.0.
§ “Choose your own adventure”
Scenario 2 – Large Scale Litigation
Goal: Minimize cost because of learning across large document set, increase quality with focused review, and maximize protection of privilege and trade secrets Applying the Technology:
§ Prioritized review based on rapid, continuous learning § Large scale defensible culling § More accurate ranking of “potentially privileged” documents
Scenario 3– Highly Complex Litigation
Goal: Review and produce with multiple and changing issues Applying the Technology § Rapid learning across multiple topics § Leverage ability to adjust for change in topics § Review quality improves because of focus § Explore otherwise hidden subjects with Concept Explorer § Leverage learning across narrow, focused lines of inquiry (e.g.,
emails between two people in a narrow time window) § Protect privileged documents
Predictive Coding 2.0 Making E-Discovery More Efficient and Cost Effective
John Tredennick Jeremy Pickens Jim Eidelman