whitepaper: alfresco document transformation engine 2 eng... · 2019. 2. 15. · 1.2 scalability...
TRANSCRIPT
Alfresco Document Transformation Engine 2.0
Whitepaper:
it-novum.com
Introduction 3
1. Document Transformation Engine Scalability 4
1.1 Architecture 4
1.2 Scalability Factors 5
1.2.1 Disc I/O bandwidth 5
1.2.2 CPU resources 5
1.2.3 Network connection blocking 7
1.3 Building a scalable Document Transformation Engine infrastructure 8
1.3.1 Overview 8
1.3.2 Sizing & Scaling up 8
1.3.3 Scaling out 8
1.3.4 Virtualization 9
2. Document Transformation Engine Benchmarking 9
2.1 Methodology 10
2.2 Hardware&SoftwareEnvironment 10
2.3 Test Data 11
2.4 Results 12
2.4.1 Medium Document Set 12
2.4.2 Heavy Document Set 15
2.4.3 Very Heavy Document Set 17
2.5 Conclusions 19
Table of Contents
2
it-novum Whitepaper | Alfresco Document Transformation Engine
IntroductionThe Alfresco Document Transformation Engine is a stable, fast, and scalable product for high-quality transformationsofMicrosoftOfficedocuments,PDFandPDF/Afiles,imagesandother
fileformats.Itisanenterprise-scale and enterprise qualityalternativetoOpenOfficeandLibreOffice.Inadditiontothetransformationservicesitalsoprovidesbasic OCR-functionality
and creates PDF/A-2A documentstofulfillcomplianceforlong-termarchiving.
This document presents the results of the Alfresco Document Transformation Engine bench-
marks and uses them to analyse its performance and scalability in a standard scenario. The
benchmark also compares the performance of the Document Transformation Engine 1.5 to Do-
cumentTransformationEngine2.0andOpenOffice/LibreOffice.Youcanusethisdocumentas
a reference for practical recommendations and best practices for successful sizing, architec-ture, and deployment of your Document Transformation Engine solutions.
These results apply to Document Transformation Engine 1.5 (and above) as well as Document
TransformationEngine2.0(andabove),whichisthetargetplatformonwhichthebenchmarks
wererun.Benchmarksareacontinuouseffort.Theresultsandconsiderationspresentedinthis
document will be updated regularly.
Thisdocumentisprimarilyintendedforatechnicalaudience,butalsonon-technicalreaderswill
findthebenchmarkresultsinteresting.
3
it-novum Whitepaper | Alfresco Document Transformation Engine
1. Document Transformation Engine ScalabilityThis chapter introduces the scalability and architecture of the Document Transformation Engine.
Althoughscalabilityisoftendefinedasalinearconcept,itisquitesolution-dependent.Some
areasofscalabilitymayaffectthewaytheDocumentTransformationEngineperforms,and
should be considered when designing your solution. The following section provides an examina-
tion of some major scalability factors.
1.1 ArchitectureThe Document Transformation Engine features a scalable open architecture that communicates
withAlfrescousingaHTTPRESTAPI,whichmeansthatyoucanscaleupbyaddingmultiple
instances of the server and connecting them through a standard HTTP Network Load Balancer.
Asseeninthediagram,theAlfrescoDocumentTransformationEngineusesgenuineMicrosoft
OfficesoftwaretotransformMicrosoftWord,Excel,andPowerPointdocumentsintoPDFand
SWF.
4
it-novum Whitepaper | Alfresco Document Transformation Engine
1.2 Scalability factorsThe following issues are likely to present scalability bottlenecks.
1.2.1 Disc I/O bandwidthMicrosoftOfficetransformationsareveryI/OheavysoonsomesystemsI/Ocontentioncanbea
performancebottleneck.AssoonasmultipleWordconversionshappeninparallel,performance
cansufferheavilyfrompoorrandomreadandwritespeeds.
Initially we were using Amazon EC2 instance c3.2xlarge. I/O metrics were as follows:
и seq.readspeed:131MB/s
и seq.writespeed:83MB/s
и randomqd32readspeed:10.4MB/s
и randomqd32writespeed:3.8MB/s
ForourTeststoDocumentTransformationEngine2.0weusedaMicrosoftAzureMachine“Stan-
dardDS4v2(8Cores,28GBRAM).
OurtestsonthisenvironmentclearlyindicatedthatI/Owastheperformancebottleneck,not
theDocumentTransformationEnginesoftware.Effectivelyweweremeasuringdiskspeed,not
transformationperformance.SwitchingtoSSDinstanceresultedindocfilestransforming1.2
timesfasteranddocxfiles–1.7timesfaster.
1.2.2 CPU resourcesMosttransformationsareveryCPUintensive.So,dependingonwhatdocumentswouldbetrans-
formed,4-8physicalcoresshouldbepresentwithadedicatedthreadforeachcore.Morethan8
threadsmostlikelywillnotscalewell.Hyper-Threadingdoesnotincreaseperformance.
Makesuretoconfigurethenumberoftransformationthreadswithrespecttoyouractualhard-
ware:
5
it-novum Whitepaper | Alfresco Document Transformation Engine
и Ifyouhave8physicalcores,theproductwilldetectthisandsetthenumberofconcurrenttransformation
threads to 8.
и Thisvaluecanbeoverriddenbysettingthe“ratelimiter.loadFactor”tothedesirednumber.Thisshould
be handled with care: it is only necessary in case the number of physical cores cannot be detected manu-
ally (can happen in some virtual environments). The detected number is visible in the Web Console under
“Activen/max.
и Setting a lower value will not use the full capacity of your system
и Setting a higher value will cause context switching on the CPU and degrade performance
и Usingmorethan8coresispossible,butinevitablyyourdiskI/Osubsystemwillbecomeabottleneckat
somepoint.IfyouhaveplentyofIOPS,youcanusemorecores.
InDocumentTransformationEngine2.0theratelimiterhasbeenextendedwithanewqueueing
mechanism.IfthenumberofrequestedtransformationsishigherthanthenumberofCPUs,the
itemstobetransformedarequeuedup-inDocumentTransformationEngine1.5theyhavebeen
rejected.Aftercurrentlyrunningtransformationsareprocessed,newelementsaretakenfrom
thequeueandgettransformedaswell.Withthismechanism,lessitemsarerejected.Bydefault,
thequeueissetto50items,asasecuritymechanismforthesystem.Ifthemaximumqueuesize
isreached,newitemswillgetrejected.Thequeuesizecanbesetupordowndependingonyour
machine and business needs.
Additionally,atimeoutmechanismissetto300seconds.Iftransformationtimeforanitemtakes
longerthan300seconds,itwillberejectedaswell.Thetimeoutandthequeuedepthcanbe
configuredinthefollowingproperties:
и Timeout
и Queue depth
Bothpropertiesforqueuesizeandtransformertimeoutcanbesetinthedefault-configuration.
propertiesfile.
#Maximumnumberofjobsthatcanwaitfortransformation.Anyadditionalrequestswillfailearly.
ratelimiter.wait.limit = 50
# transformer timeout in seconds
transformer.timeout.default = 300
PleasenotethatPowerPointisnot“re-entrant”,meaningthattheDocumentTransformation
Enginecan’texecutemultiplePowerPointTransformsinparallel,regardlessofthemaxthreads
configuration.ThisisalimitationoftheMicrosoftproduct.
6
it-novum Whitepaper | Alfresco Document Transformation Engine
1.2.3 Network connection blockingTheMicrosoftOfficeproductsautomaticallyconnecttoaMicrosoftserverwhentheyapplication
isstarted.Thiscausesadelayofacoupleofseconds,whichhasaseriousimpactonaserver
product.Acoupleofconfigurationsettingspreventthisbehaviour:
и .Netrootcertificateupdateshouldbeprevented
и ThefirewallshouldbeconfiguredtoblockMSOfficefromaccessingtheInternetandwastingvaluable
time. Here is how:
1. Make sure Windows Firewall is on and
blocks incoming connections from appli-
cations that are not on the list of allowed
apps
2.Gotothelistofallowedapps
3.MakesuretoremoveallMicrosoftOffice
applications from the list
7
it-novum Whitepaper | Alfresco Document Transformation Engine
1.3 Building a scalable Document Transformation Engine infrastructure
1.3.1 OverviewThischapterwillshowyouseveralrecommendationsforthesizingandnetworkconfigurationof
your Document Transformation Engine system landscape.
1.3.2 Sizing & Scaling upTheDocumentTransformationEnginedoesn’trequirealotofhardwaretorun:adual-coreCPU
with6GBofRAMwilldotogetacceptableperformance.
The recommended sizing for the Document Transformation Engine is:
и 4to8physicalCPUcores(nohyper-threading)ofamodernIntelCPU,e.g.2.4GhzIntelXeonE5Family
и 8GBRAM
ScalingupbyaddingmorecoresandRAMispossible,butusuallydiskI/Obecomesthebottle-
neck in this case.
1.3.3 Scaling outThe product uses stateless HTTP(S) REST connections between the Alfresco Repo and the Docu-
mentTransformationEngine.Therefore,scalingoutisextremelyeasy:
и Set up 2 or more identical Document Transformation Engine instances (e.g. multiple VM’s)
и Connect the Document Transformation Engines to an HTTP load balancer
и ConfiguretheloadbalancerIPordomaininyourDocumentTransformationEngineURLsettingsonthe
Alfresco side.
Adding more Document Transformation Engine instances will improve throughput and scalabili-
8
it-novum Whitepaper | Alfresco Document Transformation Engine
ty in a linear way.
One enterprise customer runs a 4 Node Document Transformation Engine setup that handles the
transformation needs of about 45.000 daily users (around 300K transformations per day).
1.3.4 VirtualizationTheDocumentTransformationEngineworksverywellinavirtualmachineenvironment.Infact,
more than 90% of all production instances are running on VM’s. Always make sure you have re-
served enough disk I/O bandwidth for the Document Transformation Engine VMs to ensure best
performance.
2. Document Transformation Engine Benchmarking
Thischapterintroducestheprimarygoals,methodology,setupandresultsofthebenchmarking
test.Thegoalsofthetesthavebeendefinedasfollows:
и Examine the stability of the Document Transformation Engine and prove it can run 24 hours without errors or decrease in performance.
и Objectively compare the performance of the Document Transformation Engine working in native mode
andusingMicrosoftOfficeversusworkingwiththealternativeJODConvertertransformerusingOpen
Office.
и Examine how performance scales when the size and complexity of the transformed documents drasti-
9
it-novum Whitepaper | Alfresco Document Transformation Engine
cally increases. и Prove that Document Transformation Engine 2.0 is even faster than Document Transformation Engine 1.5.
и Provideabaselineforfuturebenchmarkingefforts.
2.1 Methodology и ThetestswereperformedwithabenchmarktoolusingJUnit,DropwizardMetrics,GraphiteandGrafana.It
is now a part of the Document Transformation Engine project.
и The focus of the benchmarking was stability so all tests ran for 24 hours without interruptions.
и Duringthattimethetoolwastakingrandomfilesfromapredefinedsetofrepresentativedocumentsand
transforming them. This caused minimal randomness in the results because the transformations ran into
the thousands.
и Outputfileformatwaspdf
и Thetestswereperformedusing3differenttestdatasets–medium,heavyandveryheavyfiles.
и ForDocumentTransformationEngine2.0onlytwodifferenttestdatasetswhereperformed–mediumand
heavyfiles.
и Thetestswererepeatedusingnativemode1.5,versusnativemode2.0versususingtheJODConverter
transformer
2.2 Hardware & Software EnvironmentThetestsDocumentTransformationEngine1.5versusJODConverterranonAmazonEC2in-
stance. The initial reasoning for this decision was to allow the customer the freedom to easily
reproduce the benchmark setup.
Theinstancetypewasi2.xlarge–4vCPUsand30.5GiBRAM.Theenvironmentalsohadthefol-
lowingsoftwareinstalled:
и Windows 2012 R2
и MSOffice2013x86
и OpenOffice4.1
и Java8
ThetestsofDocumentTransformationEngine2.0ranonMicrosoftAzureStandardDS4v2(8
Cores,28GiBRAM),whichmeansthehardwarewasnearlythesameasinthe1.5tests.Theins-
talledsoftwarewasidentical.
10
it-novum Whitepaper | Alfresco Document Transformation Engine
2.3 Test DataThetestdatasetweusedforDocumentTransformationEngine1.5versusJODConverterwas
divided into 3 categories:
и Medium ▶86filesintotal,allaresmallerthan1Mb,averagesizeis350Kb
и Heavy ▶55filesintotal,allaresmallerthan5Mb,averagesizeis1.9Mb
и Very Heavy ▶6filesintotal,allofthemarenearly90Mbeach
ForDocumentTransformationEngine2.0weusedonly2categories–MediumandHeavy.
Thefilesineachcategoryareequallydistributedamongthefollowingfileformats–xls,xlsx,
ppt,pptx,doc,docxfiles.Allofthemoriginatedintheirrespectiveformats,noconversionor
other manipulations were performed prior to testing
NOTE: the medium set contains a higher percentage of ppt and pptx files compared to the heavy
set. This fact, combined with the concurrency constraints for Powerpoint conversions explains the
higher throughput rate in the heavy set test results.
Thefilesinthemediumandheavysetsweretakenfromthefollowingsources:
и OfficialMicrosoftOfficeTemplates
и http://www.entrepreneur.com/formnet(license–publicdomain)
и http://www.formsmax.com/terms-of-use.html(license–publicdomain)
и http://www.office.xerox.com/small-business-templates/enus.html(license–creativecommons)
и https://www.business.gov.au/info/plan-and-start/templates-and-tools(license-creativecommons)
Theextremesetiscomposedexclusivelyoffileswithimportedhighdefinitionimages.The
imageshaveasizeof1920x1080pixel.Largerimagesdidnotproducelargerfilesduetotheway
MicrosoftOfficeoptimizesimagestorage.
The images were taken from http://www.hdwallpapers.in/ and are property of their respective
owners.Allfilesfromthe3categoriesareavailablefordownloaduponrequest.
Note: these test data represent a generic, broad use of the various transformation capabilities of
the Document Transformation Engine. If your use case targets specific transformations only (e.g.
PDF --> PDF/A-2A), you will have different results. In this case, please contact your Alfresco represen-
tative.
11
it-novum Whitepaper | Alfresco Document Transformation Engine
2.4 ResultsThe results contain performance and stability metrics as well as graphs that show the work dis-
tribution in time. Benchmarking can produce a huge amount of data. Only the relevant metrics
areprovidedtoavoidconfusionandnoise.Furthermetricscanbeprovidedifrequested.
Whileperformingthetests,wenoticedthattheJODConvertertransformerexhibitswildlyva-
rying results without changing the test conditions. Further examinations revealed its serious reli-
abilityproblems–threadshungandrefusedtocontinueworkingseeminglyatrandom.Weran4
threads per test causing huge randomness in the results. The goal of the exercise was to provide
objectivecomparison.Toreachthisgoalweperformedeachtestrun3timesfortheJODConver-
ter in order to minimize randomness. The results in the following paragraphs are the average of
those numerous test runs.
2.4.1 Medium Document Set
2.4.2 Heavy Document Set
2.4.3 Very Heavy Document Set
PERFORMANCE Native Mode 2.0 Native Mode 1.5 JODConverterSuccessful Transformations: 86 500 66 114 32 085
Average transformations per minute
(time is 24h):59.7 45.91 22.28
STABILITY Native Mode 2.0 Native Mode 1.5 JODConverter
Failed Transformations: 0.00 0.00 956
Percentage fails to all attempts: 0.00 0.00 2.89
Productive hours per core: 24.00 24.00 4.59
Throwaway hours per core: 0.00 0.00 19.33Percentage throwaway hours to all
hours:0.00 0.00 80.81
12
it-novum Whitepaper | Alfresco Document Transformation Engine
13
it-novum Whitepaper | Alfresco Document Transformation Engine
14
it-novum Whitepaper | Alfresco Document Transformation Engine
2.5 ConclusionsThe results are consistent for all test data sets. The native mode of the Document Transformati-PERFORMANCE Native Mode 2.0 Native Mode 1.5 JODConverter
Successful Transformations: 101 800 21 489 4 371
Average transformations per minute
(time is 24h):70.00 14.92 3.04
STABILITY Native Mode 2.0 Native Mode 1.5 JODConverter
Failed Transformations: 0.00 0.00 1 016
Percentage fails to all attempts: 0.00 0.00 18.85
Productive hours per core: 24.00 24.00 3.87
Throwaway hours per core: 0.00 0.00 20.08Percentage throwaway hours to all
hours:0.00 0.00 83.83
15
it-novum Whitepaper | Alfresco Document Transformation Engine
16
it-novum Whitepaper | Alfresco Document Transformation Engine
on Engine 1.5 as well as 2.0 has
и 100% transformation success rate,nofailedtransformationsatall
и Performance does not degrade over time.PERFORMANCE Native Mode 1.5 JODConverterSuccessful Transformations: 14 122 14
Average transformations per minute
(time is 24h):9.81 0.01
Mbs successfully transformed: 1111 1 147 130.89 857.14
Mbs successfully transformed per
minute (time is 24h):796.62 0.60
STABILITY Native Mode 1.5 JODConverter
Failed Transformations: 0.00 1 155
Percentage fails to all attempts: 0.00 98.80
Productive hours per core: 24.00 0.12
Throwaway hours per core: 0.00 23.82
Percentage throwaway hours to all
hours:0.00 99.49
17
it-novum Whitepaper | Alfresco Document Transformation Engine
18
it-novum Whitepaper | Alfresco Document Transformation Engine
ThebehaviouroftheJODConverterisjusttheopposite–
и numerous failed transformations и a lot of the CPU time is wasted:
и Medium ▶80%oftimeiswasted
и Heavy ▶84%oftimeiswasted
и VeryHeavy▶99.5%oftimeiswasted
As processes hang up over time this problem deepens. During the last hours of all tests pro-duced no successful transforms for the JODConverter were executed.Pleasenote–thesefailuresarerelativelyevenlydistributedamongfiles.Wedon’thaveasinglefilethatconsistently
failed to transform. In the case of the very heavy set the JODConverter did just a few doc and docx transformations before giving up completely.
StabilityoftheDocumentTransformationEngineinnativemodedirectlyinfluencesperfor-
mance.ItdiddrasticallymoreworkthantheJODConverterinalltestdatasets:
19
it-novum Whitepaper | Alfresco Document Transformation Engine
и Medium ▶2.7timesmoreworkdoneinnativemode
и Heavy ▶23.3timesmoreworkdoneinnativemode
и VeryHeavy ▶1000timesmoreworkdoneinnativemode
The Conclusion:
The Document Transformation Engine has a transformation success rate of 100%. The pixel ac-
curacyisnearlyperfectandtheperformancefitstohighestbusinessstandardsofyourcompany.
20
it-novum Profile
it-novum GmbH GermanyHeadquartersFulda:EdelzellerStraße44·36043FuldaPhone:+49(0)661103-333Branches in Dusseldorf & Dortmund
it-novum GmbH SwitzerlandHotelstrasse 1 · 8058 ZurichPhone: +41 (0) 44 567 62 07
it-novum branch AustriaAusstellungsstraße50/ZugangC·1020ViennaPhone: +43 1 205 774 1041
Your contact person: Brian Kurbjuhn Director Enterprise Information [email protected] +49 (0) 661 103 636
Leading in Business Open Source solutions and consultingit-novumistheleading IT consultancy for Business Open SourceintheGerman-speakingmarket.Foundedin2001it-novumtodayisasubsidiary of the publicly-held KAP Beteiligungs-AG.
We operate with 85 employees from our main office in Fulda and branch offices in Düsseldorf, Dortmund, Vienna and ZurichtoservelargeSMEenterprisesaswellasbigcompaniesintheGerman-speakingmarkets.
it-novumisacertified SAP Business Partner and longtime accredited partner of a wide range of Open Source products. We mainly focus on the integration of Open Source with Closed Source and the development of combined Open Source solutions and platforms.
Due to the ISO 9001 certificationit-novumbelongstooneofthefewOpenSourcespecialistswho
canprovethebusinesssuitabilityoftheirsolutions,provenbyinternationalqualitystandards.
More than 15 years of Open Source project experience и Our portfolio contains a wide range of Open Source solutions within the applications and infrastructure area as
well as own product developments which are well-established in the market.
и As an IT consulting company with a profound technical know-how within the Business Open Source area we
differentiateourselvesfromthebigsolutionproviders’standardofferings.Becauseoursolutionsarenotonly
scalable and flexible but also integrate seamlessly in your existing IT infrastructure.
и Wecanassemlemultidisciplinaryprojectteams,consistingofengineers,consultantsandbusinessdataprocessing
specialists. Thus we combine business know-how with technological excellence to build sustainable business processes.
и Our target is to provide you with a high-quality level of consulting during all project phases –fromtheanalysis
and conception up to the implementation and support.
и Asadecision-makingbasispriortotheproject’sstartweofferyouaProof-of-Concept.Throughareal-case simula-tion and a developed prototypeyoucandecideonanewsoftwarewithouttakinganyrisks.Moreover,youbenefit
from:
и Security and predictability
и Clear project methodology
и Sensible calculation