abbyy flexicapture & designing a high-performance system at #abbyysummit16
TRANSCRIPT
![Page 1: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/1.jpg)
ABBYY TechnologySummit2016
ABBYY NAHQ, 2016
Pierre van der Westhuizen
© ABBYY Confidential
![Page 2: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/2.jpg)
Introduction:Designing High Performance Systems
• Upscaling FlexiCapture, identify bottlenecks, and optimize system performance
• Performance Metrics and testing
• FlexiCapture Performance at a glance
– Scaling
– Up to 3 million pages per day
– High Fault Tolerance and Availability
© ABBYY Confidential 2
![Page 3: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/3.jpg)
Agenda:Designing High Performance Systems
• Introduction• *Architecture of FlexiCapture• Component Interaction Walkthrough• Defining Performance Metrics• Scaling of Systems (Demo, Medium, Large)• *Optimizing: Processing Stations, Scanning Stations and Workflow• Optimal Values of and Limitations on System Performance• System Monitoring and Bottleneck Detection• Performance testing• *Improving your current system
© ABBYY Confidential 3
![Page 4: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/4.jpg)
Architecture
© ABBYY Confidential 4
![Page 5: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/5.jpg)
• Application Level
• Application Server
• Licensing Server
• Processing Level
• Processing Server
• Data Storage
• Database
• File Storage
© ABBYY Confidential 5
Architecture – Server Side
![Page 6: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/6.jpg)
• User Stations
• Scanning
• Verification
• Processing Stations
• Administration/Monitoring Web Console
• Project Setup Station
© ABBYY Confidential 6
Architecture – Client Side
![Page 7: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/7.jpg)
Component Interaction
© ABBYY Confidential 8
![Page 8: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/8.jpg)
Component Interaction
© ABBYY Confidential 9
![Page 9: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/9.jpg)
Performance Metrics
© ABBYY Confidential 11
![Page 10: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/10.jpg)
Defining Performance Metrics
We measure performance in volumes processed per period of time.
Define target performance using performance metrics:
• The required processing time
• Processing volumes
© ABBYY Confidential 12
![Page 11: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/11.jpg)
Parameters that shape workload
• Average batch size in pages
• Image color mode: color, grayscale, black-and-white
• Pages per day (i.e. 24 hours), average/peak
• Pages per hour, average/peak
• Average document size in pages
• Number of scanning operators
• Number of verification operators
• Document storage time
© ABBYY Confidential 13
![Page 12: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/12.jpg)
SCALING
© ABBYY Confidential 14
![Page 13: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/13.jpg)
© ABBYY Confidential 15
Scaling
![Page 14: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/14.jpg)
Demo System
© ABBYY Confidential 16
![Page 15: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/15.jpg)
Medium System
© ABBYY Confidential 17
![Page 16: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/16.jpg)
Medium System:Application Server
© ABBYY Confidential 18
1. Fast network2. Fast connection to FileStorage and Database3. Fast CPU4. RAM
![Page 17: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/17.jpg)
Medium System:Processing, Licensing Servers
© ABBYY Confidential 19
For redundancy see:
FlexiCapture System Administrator Guide
![Page 18: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/18.jpg)
Medium System:Database Server
© ABBYY Confidential 20
• More RAM• Fast HDD• Avoid Mirroring• Separate Data and Logs• Index Updates
![Page 19: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/19.jpg)
Medium System:File Storage
© ABBYY Confidential 21
Read-write and capacity requirements depend on:
• Average and Peak processed per daySpeed Required for 10,000pages/hr2.8pages/sec = 2.8*3MB/sec = 8.4MB/sec
• Amount of time that documents are stored:e.g. 16 x 100,000 grayscale images x 3 MB (average file size for grayscale image) = 4.8 TB of data
NOTE: We strongly recommend something like RAID 10
![Page 20: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/20.jpg)
Large System
© ABBYY Confidential 22
![Page 21: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/21.jpg)
Optimizing Processing Stations
© ABBYY Confidential 23
![Page 22: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/22.jpg)
© ABBYY Confidential 24
Processing Stations
• Tune each station
• Add more stations
![Page 23: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/23.jpg)
Processing Station:Hardware
© ABBYY Confidential 25
• 1 Process per core• 16 cores max• 1GB RAM per core
Processing speed greatly depends on the CPU speed and the Hard Disk read-write speed
![Page 24: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/24.jpg)
Processing Station: TEMP Folder
Scenario: 100 page batches on 8-core Station
• 100 pages x 3 MB = 300MB
• 8 Cores means 8 simultaneous executive processes
• TEMP Folder Space required is 300 x 8 = 2.4GB
• Allocate 2GB per Core + 2.4GB for TEMP = 18.4GB RAM
© ABBYY Confidential 26
![Page 25: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/25.jpg)
Calculate Number of Processing Stations
© ABBYY Confidential 27
![Page 26: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/26.jpg)
Estimate the required number of processing cores
© ABBYY Confidential 28
Measure how long it takes to process one batch for one core
8-core Processing Station
Process:1. Create 24 copies of a typical batch2. Put all batches in the FlexiCapture hotfolder3. Start the timer at the first import task created 4. Stop timer after the last result has been exported to the backend
15 minutes elapsed
Each core has processed 3 batchesTime to process 1 batch is about 5 minutesIf batch has 69 pages => takes 4.35 seconds to process 1 page.
![Page 27: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/27.jpg)
Estimate the required number of processing cores Cont’d
© ABBYY Confidential 29
Estimate desired number of cores
Assume you need to process P pages in T time. We already know from the above that 1 core needs t time to process 1 page. Hence, you need N = (P x t ) / T cores.
Example. 200,000 pages in 8 hours = 28,800 secondsWe know 1 core takes 4.35 seconds to process 1 page200,000 x 4.35/28,800 = 31 cores=>4 Processing Stations with 8 cores (32 cores in total) will be sufficient processing.
![Page 28: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/28.jpg)
Processing Cores – Limiting Factors
© ABBYY Confidential 30
• The total load on the infrastructure that may result in bottlenecks– Server Hardware
– Network
– Shared Resources (Database, External Services)
• The number of processing cores that can be served by the Processing Server– Max 120 cores
Monitor Free Processing Cores on Processing Server
![Page 29: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/29.jpg)
Optimizing: Scanning Stations
© ABBYY Confidential 31
![Page 30: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/30.jpg)
Scanning Stations
© ABBYY Confidential 32
• Performance Limits– Scanner Speed
– Data Transfer Bandwidth
• Separate Network Interface for Scanning
• Setup scan settings– Color Mode
– Remove Blanks
• Schedule Image Uploads
![Page 31: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/31.jpg)
Optimizing: Workflow
© ABBYY Confidential 34
![Page 32: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/32.jpg)
Workflow
© ABBYY Confidential 35
• Avoid too many stages
• The slowest stage limits the performance
• Do not produce tasks that are too small when parallelizing processing at a stage
![Page 33: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/33.jpg)
Optimal Values and Limits
© ABBYY Confidential 36
![Page 34: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/34.jpg)
Optimal Values of and Limitations on the System Performance 1
© ABBYY Confidential 37
Factor Optimal values & limitations
System performance inpages per 24 hours:
Demo
Able to process:
up to 20,000 black-and-white or up to 1000 color pages per 24 hours
Medium up to 1 mln black-and-white or upto 300,000 color pages per 24 hours,using a farm of regular computers
Large up to 3 mln black-and-white or upto 1 mln color pages per 24 hours
![Page 35: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/35.jpg)
Optimal Values of and Limitations on the System Performance 2
© ABBYY Confidential 38
Factor Optimal values & limitations
Number of scanning operators FlexiCapture is able to host 1000scanning operators.
Number of verification operators FlexiCapture is able to host 300verification operators.
Number of processing Stations We used up to 120 cores in total for allProcessing Stations.
Number of cores per Processing Station regular disk drive: up to 8 cores.fast disk drive: up to 16 cores.RAM drive: up to 32 cores.
![Page 36: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/36.jpg)
Optimal Values of and Limitations on the System Performance 3
© ABBYY Confidential 39
Factor Optimal values & limitations
Number of pages in a Batch Optimal value is from 10 to 1000 pagesin a batch
Number of pages in a Document Optimal value is up to 100 pages in adocument
Number of pages, documents, and batchesin the system
This highly depends on hardware used.For a Large configuration, up to 100,000batches, or 1 mln documents, or 10 mlnpages is normal.
![Page 37: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/37.jpg)
Optimal Values of and Limitations on the System Performance 4
© ABBYY Confidential 40
Factor Optimal values & limitations
Data storage time Typically, pages, document, batches and event log records are stored in theSystem for up to 2 weeks.
Statistics for reporting can be stored foryears with no impact on performance
![Page 38: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/38.jpg)
Performance Testing
© ABBYY Confidential 41
![Page 39: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/39.jpg)
Performance Testing
© ABBYY Confidential 42
• Single Entry Point Project– Import from Scanning Station
– Pre-processing
– Recognition
– Export
– Processed
![Page 40: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/40.jpg)
System Monitoring and Bottleneck Detection
© ABBYY Confidential 43
![Page 41: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/41.jpg)
System Monitoring and Bottleneck Detection
• Document processing monitoring via the Administration and Monitoring Console
• Hardware monitoring for each FlexiCapture server component using various Windows Performance Monitor counters. – Memory– CPU– Hard Disk– Network– IIS– SQL Server
© ABBYY Confidential 44
![Page 42: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/42.jpg)
Setting up Performance Counters
© ABBYY Confidential 45
• Monitor FlexiCapture state and search for bottlenecks– Performance Monitor utility
• Recorded by Processing Server
• mmc /32 perfmon.msc
• Enable Counters
![Page 43: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/43.jpg)
Improving Your Current System
© ABBYY Confidential 46
![Page 44: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/44.jpg)
Improving your current System
© ABBYY Confidential 47
• Separate your Servers– App Server
– Processing Server/License Server
– Processing Station
– Database Server
![Page 45: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/45.jpg)
Improving your current System Cont’d
© ABBYY Confidential 48
• Database– More RAM is better – At least 4GB
– Fast Drives
– Data and Logs on Separate Drives
– Autogrowth – 100MB Increments
– Simple Recovery Model
– Maintenance Plans• Backups – Truncate Transaction Log
• Indexes
![Page 46: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/46.jpg)
Improving your current System Cont’d
© ABBYY Confidential 49
• Application Server– IIS Logs
– Caching
– Fast Hard Drive
– App Server Recycling Pool
– Number of Threads
– 2 NICs 1GB/s
Consider Load Balancing
![Page 47: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/47.jpg)
Improving your current System Cont’d
© ABBYY Confidential 50
• File Storage– Disable Search indexing and anti-virus scanning of FileStorage
– Do not store images in SQL database
– 1GB/s access
• Batches– Purging (2 weeks or less)
– Limit Big batches (Less than 100 pages per batch)
• Input and Output– Put input directory on Server
– Separate location for Export and Hotfolders
![Page 48: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/48.jpg)
Improving your current System Cont’d
© ABBYY Confidential 51
• Networking– Network Speed
– Switching
– VLANS
– Network Interfaces
![Page 49: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/49.jpg)
Final Summary
© ABBYY Confidential 52
• Architecture– Medium System
• How to Optimize Processing Stations– RAM
• Improving your current system– Separate your Servers!
![Page 50: ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16](https://reader031.vdocument.in/reader031/viewer/2022012310/589a2f1a1a28ab051f8b621b/html5/thumbnails/50.jpg)
Questions
© ABBYY Confidential 53