© 2008 clearwell systems, inc. confidential role of enterprise search in e-discovery june 18, 2008
TRANSCRIPT
© 2008 Clearwell Systems, Inc. Confidential
Role of Enterprise Search
in E-Discovery
June 18, 2008
© 2008 Clearwell Systems, Inc. Confidential 2
Enterprise E-Discovery is a business processSearch is central to E-Discovery
Processing
Analysis
Information Management
Identification Review Production Presentation
Preservation
Collection
VOLUME RELEVANCE
Electronic Discovery Reference Model (www.edrm.net)
Identification Search
• Custodians• Meta-Data• Date Range• Media Type• Data Type
Collection Search
• By Custodian • By Operator• By Data Type• By keyword, phrase,
concept• By Project
Analysis/Review Search
• Responsiveness• Privilege Determination• Review Grouping• Near-duplicates• Quality Control
© 2008 Clearwell Systems, Inc. Confidential 3
FRCP Rules governing E-Discovery
Rule Summary Reading
Rule 16(b) Outline plans for e-discovery and document production
Rule 26(f) Procedures and Protocols to govern e-discovery
Rule 16(b) (5) Courts to include scheduling orders
Rule 26(a) Expansion on definition of ESI
Rule 26(b) (2)E-Discovery Scope Cost-Shifting arguments – Burden of reasonableness moving to Requesting Party
Rule 26(b) (5) Inadvertently disclosed ESI and Privilege Claw-back agreements
Rule 34(b) Specify forms of production (Native, Image etc.)
Rule 37(f)Disallow sanctions when ESI lost due to retention policy and good faith efforts
© 2008 Clearwell Systems, Inc. Confidential 4
FRCP Rules and Their Impact on E-Discovery
• Emphasis on co-operation during E-Discovery• Sedona Principles as a Guide for E-Discovery• Early Discovery Planning Conferences• No “Gaming” of E-Discovery
• Prepare for Meet and Confer• Organizational Structure• Information Assets and Data Map• ILM Policies and Procedures• Backup and Disaster Recovery Practices• Preservation Hold/Legal Hold Policies and Actions
• Establish E-Discovery Scope• Estimate Review Size from automated Search Results• Raw Volume, Processed Volume, Review Volume• Substantiate “Not Reasonably Accessible” Claims• Move burden of “cost provability” to the Requesting Party
© 2008 Clearwell Systems, Inc. Confidential 5
CaseData
PreservationHold
Enabling E-Discovery within an Enterprise
File Shares
Messagingservers
CMS
Meta-DataIndex
Enterprise Intranet
KeywordIndex
Digital Asset Database
Organizational Data
IT Personnel
Data Map
Legal IT Personnel
Analysis, Culling, Review
Legal Search/Analysts
ECM/ILM Policies
© 2008 Clearwell Systems, Inc. Confidential 6
E-Discovery Search Characteristics
Theme
• Produce Entire Results – not sufficient to only produce Top N
• No Estimates of Counts – Must provide accurate, actual counts
• Stability of Results
• Very large Result Sets
• Fast Query Response Time
• Provide Complete Hit Context
Relevance
• Activity Based Relevance – Responsiveness Search vs. Privilege Search
• Meta-Data based Relevance – Timeliness, People, Connection to other data
• Review-directed Relevance
• Traditional TF/IDF based Relevance
Results Management
• Complete Auditing of all Searches
• Document Hit Count Reports
• Tie back to original Document Meta-Data
• EDRM XML-2 Export to downstream processes
• Group Neat-Duplicates, Concept Clusters for Review Efficiency
Data Types
• Many data formats – 10,000 formats
• New communication formats – Wiki, Blogs, SMS, IM, Unified Messaging
• ESI from old, legacy applications
• Incomplete and Corrupt data (Deleted Files, raw disk blocks)
• Handle Multi-language ESI
• Handle Low-fidelity documents – OCR-scanned images
Flexibility
• Advanced Search/Query Language
• Iterative Search and Search Refinement
• Guided Navigation, one-click Filtering
• Saving and Sharing Searches
• Remove impediments to search – ACLs, Encryption, Container Extraction
• Real-time updates for Tagging, Classifying Results
Workflow
• Incremental ESI Collections (Batches)
• Multi-level Review
• Multi-person Review
• Rolling Productions
• Activity Reports
• Outside Counsel, Opposing Counsel interactions
• Project Management
© 2008 Clearwell Systems, Inc. Confidential 7
Search EffectivenessTechniques to improve Precision and Recall
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Precision
• Pre-filtering wildcard expansions
• Boolean Queries
• Proximity Specification
• Keyword Scope (Sentence, Paragraph)
• Meta-Data Context
• Entity based Search
Recall
• Misspellings/Fuzzy Search
• Wildcard Specifications
• Synonyms
• Related Terms
• Concept Search
• Bayesian Search
Precision
Recall
Search
© 2008 Clearwell Systems, Inc. Confidential 8
E-Discovery Search: Typical measures and outcomes
Search Method Retrieved Documents
Sample Size Responsive in Sample
Estimate in Retrieved
Precision
Keyword Search
14846 1537 940 9080 0.61
Discussions Threads
16515 1537 1069 11486 0.70
Concept Search 18554 1537 1128 13617 0.73
Search Method Unretrieved Documents
Sample Size Responsive in Sample
Estimate in Unretrieved
Recall
Keyword Search 1076258
1537 29 20307 0.31
Discussions Threads 1074589
1537 28 19576 0.37
Concept Search 1072550 1537 26 18143 0.43
Number of truly responsive in Retrieved Collection:
Number of truly responsive documents in Un-Retrieved Collection
© 2008 Clearwell Systems, Inc. Confidential 9
Interactive Search: Key to Search Efficiency
Interactive wildcard, stemming expansion selection• Removes precision-recall
tradeoff by enabling interactive review and removal of false positive expansions
• Save thousands of dollars per search
Search Report• Detailed, interactive
keyword search report results for iterative large query execution
• Full transparency and auditing
• Significant time savings
© 2008 Clearwell Systems, Inc. Confidential 10
E-Discovery is about extracting Relevant Content
50 TB
PreservationStore
500 GB
1-2 GB
Archive and Store
Collect and Preserve
Analyze and Review
100-1000 TB
© 2008 Clearwell Systems, Inc. Confidential 11
Enterprise Case Study – Global Media Conglomerate
Case Data
456,448
208,628
74,713
Data culling based on query permutations reduced data set by 99% to
417
Data culling based on query permutations reduced data set by 99% to
417
Time = 2.5 days
Eliminating the need to process and review 456,000 documents saved $175,000
Eliminating the need to process and review 456,000 documents saved $175,000
© 2008 Clearwell Systems, Inc. Confidential 12
Source SCANFull TextIndexer
Copy EngineProcessingCase Mgmt
SOURCES
Deep IndexFull-Text
E-Discovery - Workflow
Meta-Data(Shallow Index)
ProcessingManifest
Rate of Ingestion• 1M files/hour• 10K directory scans• 1 TB/hour
Size of Index• 1 TB• 10 billion
objects
Size of Index• 0.2 TB• 1 billion rows• 10K/s Bulk-
Load
Size of Index• 1 TB (each partition)• Up to 100 index
partitions• 10 billion objects• 200-400 file types• Includes meta-data
Rate of Indexing• 100 K files/hour• 10-20 GB/hour
Rate of Extraction• 20 K files/hour• 2-4 GB/hour
Rate of Processing• 100 custodians• 10K files/hour• 1 GB PST/custodian
Size of Store• 32 TB FC/SCSI• 4 TB NTFS• 300 GB/custodian• 100 custodians
Size of Manifest• 10 million items
Case ESI Store
SQL Full Text
© 2008 Clearwell Systems, Inc. Confidential 13
E-Discovery Search: Collection Workflow
Source SCAN
SOURCES
Meta-Data(Shallow Index)
Search
Search Scope• Owners/SID• Last Modification Date• Creation Date• Author/Title• Department
Search Technology• Keyword Search• Parameterized Date Range
Case Document Collection
Copy of Original• Maintain Original Locations• Hash with Meta-Data for content and
location integrity• Hash without Meta-Data for content
Integrity
© 2008 Clearwell Systems, Inc. Confidential 14
Privilege Search• Documents• Emails
E-Discovery Search: Analysis Workflow
Search
Search Scope• Documents• Emails
Search Technology• Keywords• Boolean Search• Proximity Search• Fuzzy Search• Concept Search• Tagged Search
SampleNon-Responsive
Documents
SamplingEngine
Search Refinement• Additional
Keywords• Additional
Search Methods
DocumentReview
Search
Quality Control• Documents• Emails• Tags
Confidence Level
Sample Size
95 1537
99 66358
Case Document Collection
ResponsiveDocuments
Non-ResponsiveDocuments
ResponsiveMisses
“Recall”
Search
PotentiallyPrivileged
Documents
PotentiallyResponsiveDocuments
Privilege Review
Privilege“Misses” Review
PrivilegedDocuments
ProductionDocuments
Reports• Search Reports• Activity Reports• QC Reports• Project Review Reports• Privilege Log• Exceptions Reports