bjr.20140482

9
BJR © 2015 The Authors. Published by the British Institute of Radiology Received: 9 July 2014 Revised: 26 November 2014 Accepted: 10 December 2014 doi: 10.1259/bjr.20140482 Cite this article as: Wolstenhulme S, Davies AG, Keeble C, Moore S, Evans JA. Agreement between objective and subjective assessment of image quality in ultrasound abdominal aortic aneurism screening. Br J Radiol 2015;88:20140482. FULL PAPER Agreement between objective and subjective assessment of image quality in ultrasound abdominal aortic aneurism screening 1 S WOLSTENHULME, DCR, MHSc, 2 A G DAVIES, BSc, MSc, 3 C KEEBLE, BSc, MSc, 4 S MOORE, HND, MSc and 2 J A EVANS, PhD, FIPEM 1 School of Healthcare, University of Leeds, Leeds, UK 2 Division of Medical Physics, University of Leeds, Leeds, UK 3 Division of Epidemiology and Biostatistics, University of Leeds, Leeds, UK 4 Department of Medical Physics, Leeds Teaching Hospitals, Leeds, UK Address correspondence to: Mr Andrew Graham Davies E-mail: [email protected] Objective: To investigate agreement between objective and subjective assessment of image quality of ultrasound scanners used for abdominal aortic aneurysm (AAA) screening. Methods: Nine ultrasound scanners were used to acquire longitudinal and transverse images of the abdominal aorta. 100 images were acquired per scanner from which 5 longitudinal and 5 transverse images were randomly selected. 33 practitioners scored 90 images blinded to the scanner type and subject characteristics and were required to state whether or not the images were of adequate diagnostic quality. Odds ratios were used to rank the subjective image quality of the scanners. For objective testing, three standard test objects were used to assess penetration and resolution and used to rank the scanners. Results: The subjective diagnostic image quality was ten times greater for the highest ranked scanner than for the lowest ranked scanner. It was greater at depths of ,5.0 cm (odds ratio, 6.69; 95% confidence interval, 3.56, 12.57) than at depths of 15.120.0 cm. There was a larger range of odds ratios for transverse images than for longitudinal images. No relationship was seen between subjective scanner rankings and test object scores. Conclusion: Large variation was seen in the image quality when evaluated both subjectively and objectively. Objec- tive scores did not predict subjective scanner rankings. Further work is needed to investigate the utility of both subjective and objective image quality measurements. Advances in knowledge: Ratings of clinical image quality and image quality measured using test objects did not agree, even in the limited scenario of AAA screening. The quality of images produced by a medical imaging device is an important consideration when gauging its suitability for a specic clinical taskit is essential that the system produces images that are of sufcient delity for the clinical user. As such, image quality will form an important consideration in the selection of equipment and in the ongoing quality assurance procedures fol- lowing installation. The assessment of medical image quality can be performed in a number of ways, both subjectively (for example, using visual grading 1,2 ) and objectively using test phantoms specically designed for that purpose. 3,4 Even for a specic imaging modality such as ultrasound, the level of agree- ment between these methods has not been thoroughly investigated, although there is some evidence of poor agreement between ratings of quality scores from test objects with those of clinical users when asked to rate clinical images from the same scanner. 5 The need to provide more objective image quality as- sessment is highlighted when there are national pro- grammes requiring common standards. The breast cancer, foetal abnormalities and abdominal aortic aneurysm (AAA) detection programmes are good examples requiring ul- trasound imaging of a uniform quality. It is critical that there is good agreement between clinical users as to what constitutes an acceptable image for these purposes. This will form the basis of a gold standard of performance against which the utility of any objective testing can be evaluated. In this study, we have used the ultrasound-based aortic aneurysm screening programme as an exemplar. In the UK,

Upload: dyah-ayu-pratama-sari

Post on 16-Aug-2015

218 views

Category:

Documents


4 download

DESCRIPTION

jurnal

TRANSCRIPT

BJR 2015 TheAuthors.Publishedby the BritishInstituteof RadiologyReceived:9 July 2014Revised:26 November 2014Accepted:10 December 2014doi: 10.1259/bjr.20140482Citethis article as:Wolstenhulme S, Davies AG,KeebleC, MooreS, Evans JA. Agreement between objectiveandsubjectiveassessment of imagequality inultrasound abdominalaorticaneurism screening.BrJ Radiol2015;88:20140482.FULL PAPERAgreement between objective and subjective assessmentof image quality in ultrasound abdominal aorticaneurism screening1S WOLSTENHULME,DCR, MHSc,2A G DAVIES,BSc, MSc,3C KEEBLE,BSc, MSc,4S MOORE,HND, MSc and2J A EVANS,PhD, FIPEM1School of Healthcare, Universityof Leeds, Leeds, UK2Divisionof MedicalPhysics, University of Leeds, Leeds,UK3Divisionof EpidemiologyandBiostatistics, Universityof Leeds, Leeds, UK4Departmentof MedicalPhysics, Leeds Teaching Hospitals,Leeds, UKAddresscorrespondence to: Mr Andrew GrahamDaviesE-mail:[email protected]: Toinvestigateagreementbetweenobjectiveand subjective assessment of image quality of ultrasoundscanners used for abdominal aortic aneurysm(AAA)screening.Methods: Nine ultrasound scanners were used to acquirelongitudinal and transverse images of the abdominalaorta. 100 images were acquired per scanner from which5longitudinal and5transverseimageswererandomlyselected. 33practitionersscored90imagesblindedtothescanner typeandsubject characteristics andwererequiredtostatewhether or not the images were ofadequatediagnosticquality. Odds ratioswereusedtorankthesubjectiveimagequalityof thescanners. Forobjectivetesting, threestandardtestobjectswereusedto assess penetration and resolution and used to rank thescanners.Results: The subjective diagnostic image quality was tentimes greater for the highest ranked scanner than for thelowest ranked scanner. It was greater at depths of,5.0cm(oddsratio,6.69;95%confidence interval, 3.56,12.57)thanatdepthsof15.120.0cm. There wasalargerrange of odds ratios for transverse images than forlongitudinal images. Norelationshipwasseenbetweensubjective scanner rankings and test object scores.Conclusion: Large variation was seen in the image qualitywhen evaluated both subjectively and objectively. Objec-tivescoresdidnotpredictsubjectivescannerrankings.Furtherworkisneededtoinvestigatetheutilityofbothsubjective and objective image quality measurements.Advances in knowledge: Ratings of clinical image qualityandimagequalitymeasuredusingtest objectsdidnotagree, even in the limited scenario of AAA screening.Thequalityof images producedbyamedical imagingdeviceis animportant considerationwhengaugingitssuitabilityforaspecicclinicaltaskitisessentialthatthe system produces images that are of sufcientdelityfor the clinical user. As such, image quality will form animportant considerationintheselectionof equipmentand inthe ongoing quality assurance procedures fol-lowinginstallation.The assessment of medical image quality can be performedin a number of ways, both subjectively (for example, usingvisual grading1,2) and objectively using test phantomsspecically designed for that purpose.3,4Even for a specicimagingmodalitysuchas ultrasound, thelevel of agree-ment betweenthesemethods has not beenthoroughlyinvestigated, althoughthere is some evidence of pooragreement betweenratings of quality scores fromtestobjects with those of clinical users when asked to rate clinicalimages from the same scanner.5The needtoprovide more objective image quality as-sessment is highlighted when there are national pro-grammesrequiringcommonstandards. Thebreast cancer,foetal abnormalities and abdominal aortic aneurysm (AAA)detectionprogrammesaregoodexamplesrequiringul-trasoundimagingofauniformquality.Itiscriticalthatthere is good agreement between clinical users as to whatconstitutes an acceptable image forthese purposes. Thiswill formthebasis of agoldstandardof performanceagainstwhichtheutilityofanyobjectivetestingcanbeevaluated.Inthis study, we have usedthe ultrasound-basedaorticaneurysm screening programme as an exemplar. In the UK,the National Abdominal Aortic AneurysmScreening Pro-gramme (NAAASP) was implemented in 2013.6This pro-gramme is primarily community based, necessitating the use ofportable ultrasound scanners to allow transportation to screeningcentres. Measurements of the anteroposterior (A-P) inner to inner(ITI) abdominal aortic diameter in longitudinal section (LS) andtransverse section (TS) planes are taken.The quality of images depends upon the skill of the practitioner,thehabitusofthepatientandtheperformanceofthescanner.Together they may inuence the reliability and accuracy ofmeasurements.7,8Small errors in measurements may impact onclinical decision making, for example, resulting in inappropriateenrolment into the surveillance programme, at the 30-mmthreshold,or delayedreferralfora vascularsurgicalopinion,atthe55-mmthreshold.Selection of the ultrasound scanner to carry out nationalscreening is the responsibility of the service provider, al-though in the UK, some guidance on specication is availablefromtheNationalScreeningCommittee.Itislessclearwhatmethod providers should use to make their choice of scannerandwhether this choice has any impact onthe diagnosticimageadequacyandtheserviceprovided. Whenfacedwithsimilar procurement decisions, providers haveinvitedcom-peting manufacturers to supply equipment for evaluationoverashorttime. Theserviceproviderscommonlyusesub-jectiveassessment of theimagequalitytomakeadecision,while recognizing on a small sample, differences betweensubjects, e.g. body habitus, may affect differences betweenscanners.5,9An alternative approach is to use one or more testobjects toobjectivelyassess imageadequacythus removingintersubjectvariation. Suchobjectivemeasuresalsohavethepotential advantagesthat theyarequicktoperform, canbereproducedexactlyat different centres andareought tobelessaffectedbythesubjectiveopinionoftheoperator. A va-rietyoftestobjectshavebeendescribedforevaluationofultra-soundimagequality, andeachofthesecanbeusedtomeasurearangeofdifferentparameters.4However, thereisapaucityofevidenceas tohowresults fromsuchtests relatetosubjectiveassessment. We are not aware of any specic advice or publicationaimed at evaluatingportable AAA scanners.Theaimofthisstudywastoinvestigatethelevelofagreementbetweenthe subjective assessment of the aortic images fromportable ultrasound scanners and objective assessments obtainedusing test objects. If the agreement is good, then the implicationis that test objects couldbe usedwithcondence inthe as-sessment of image quality both for purposes of scanner selectionandinmonitoringongoingperformance. If theagreement ispoor, then either the use of test objects as objective evaluators ofperformanceshouldbeseriouslyquestionedortheassumptionthat clinical subjective performance is useful is called intoquestion.METHODS AND MATERIALSThis was a prospective study in which selected ultrasoundscanners were used by the same operator in a routine screeningenvironmentwithlaterviewingby blindedobservers.EquipmentThefollowingultrasoundscanners, nominatedbytheirmanu-facturerasbeingsuitableforaorticaneurysmscreening, weremadeavailablefor evaluation:CX50(PhilipsHealthcare, Bothell,WA)LOGIQ book XP and LOGIQ e (GE Healthcare, Chalfont StGiles,UK)Micromax, M-Turbo and Nanomax (SonoSite Inc., Bothell, WA)SIUICTS-900(MISHealthcare,London,UK)Viamo(ToshibaMedicalSystems, Tochigi,Japan)z-One(Zonare MedicalSystemsInc.,MountainView,CA).Thesescannersarereferredtoinnoparticularorderasbeingscanners AJ. Therotationof thescanners throughonelocalscreening programme of the NAAASP was arranged by thePurchase andSupply Agency innegotiationwiththe manu-facturers. Each scanner was evaluated for 1 week within the localscreening programme andwas takentoat least twogeneralpractitioner practices. The transducers used were curvilineararrays recommendedbyscanner manufacturer for this appli-cation. For each scanner, the same transducer was used for bothclinicalimageacquisitionandobjective testing.Subjective evaluation of image qualityAcquisition of imagesOn therst day of each week, one screening technician and thescanner manufacturers clinical application specialist workedtogether to achieve familiarization with the portable ultrasoundscanner. The screening technician, with 5 years post-certicationexperienceof carryingoutabdominal aortaultra-soundexaminations, acquired all images for aortic diameterassessment. For each examination, the screening technicianvariedthe operators scanning position(sitting/standing) andthe degree of tilt of the monitor. This variation depended on theheightofboththeexaminationcouchandthescannersmoni-tor.Theroomlightingwasdimmedwhencarryingouttheex-amination. Scanner controls such as gain, compound and tissueharmonic imaging and depth ofeld were changed, as required,to obtain the perceived optimal ultrasoundimage. Each patientwas examinedusingonlyonescanner. For eachpatient, fourimages of the abdominal aorta were acquired, one LS image andone TS image with measurements of the ITI diameter forNAAASP, and one LS image and one TS image without callipers.Theseimageswerestoredindigital imagingandcommunica-tions in medicine (DICOM) format on the scanners hard driveand transferred to a secure hospital information technologyserver.Thesubjects informedconsent tohaveanultrasoundexami-nationwasobtainedasperNAAASPStandardOperatingPro-cedures.6Ethical approval was not required, as the images wereroutinelyacquiredandanonymizedandthepractitioners, whoratedthe images inthe study, were National HealthServiceemployees.TheDICOMimages without callipers wereexported, withoutany image adjustment or enhancement, toacomputer. Theywerethencroppedtoremovesubject name, hospital andul-trasound scanner manufacturer identity, but retained the verticalBJR S Wolstenhulme et al2 of 9 birpublications.org/bjr Br J Radiol;88:20140482measurement scale data. Auniqueidenticationnumber wasaddedtoeachimage. Theanonymizedimagesallowedblindedscanner ranking. At the end of the clinical data collection phase,900anonymizedimageswerestored inadatabase.Image selection and scoringFiveLSandveTSimageswererandomlyselectedfromeachscanner, subjecttotheconstraintthatoneofeachLSandTSimage set contained an image of an aorta with an A-P diametersubjectively.40 mm. This was toensurethat eachset con-tained one aneurysmal aorta. 90 images (45 LS and 45 TS) wereusedforanalysis. 90imagespermittedeachobservertocom-plete the study in a realistic time scale. The reason for choosingthesame90imagesratherthanprovidingarandomsetof90images from the 900 total images was to enable analysis of thesameimagestodeterminethevariationinthescores.Theul-trasound scanners control settings likely to affect image quality(depthof eld, compoundimagingandtissueharmonicim-aging) that were used for the 90 images were recorded. Readersunfamiliarwiththeseultrasoundcontrol settingsarereferredelsewhere.1033practitioners completedademographics questionnaireandundertook scoring of images using a web-based tool. Thepractitioners were from radiology or vascular departments in theUKandthesixNAAASPearlyimplementersites. Eachpracti-tioner was given a unique identier. The demographics re-quested were the practioners profession and the level ofexperience (number of years they have been in their profession).The practitioners included a variety of professions: medicalphysicists(1), screeningtechnicians(1), radiologists(1), ultra-soundpractitioners (12), vascular surgeons (3) and vasculartechnologists(15). Theirmean(range)level ofexperiencewas11.2years(130years).All 33 observers were blinded to the scanner type and subjectcharacteristics. To achieve this, the alphanumeric text andlogoswereremovedfromtheimagespriortoviewing. Sincethe operator acquiring the images was not involvedintheimageviewing, all of theobserverswereblindedtoanypa-tientdata.The web-based tool allowed the observers to view the 90 imagesin 1 session or to pause the session and complete it in stages andat theirownpace. Thiswasperformedontheirownpersonalcomputeraccessingthecustomwrittenweb-basedsurveysoft-ware. Theobserverswereadvisedtoscoreindimmedlight-ing. At thebeginningof eachscoringsession, theobserverswerepresentedwithachallengeresponsetesttoconrmthemonitor and viewing conditions offered sufcient viewingquality to make meaningful judgments for the study. The testinvolvedreadinglowcontrast letters against differingback-groundintensities.11The observers viewed one image at a time and were required toansweryes/notothequestion: Is thisimageofadequatedi-agnostic quality? Each observer viewed the images in a randomanddifferent order. Images wereresizedfor displaypurposesusing a bilinear interpolation, such that all images weredisplayedat thesamesize. Noimages wereminied(i.e. hadtheirresolutionreduced).Objective evaluation of scanner performanceIn the absence of clear guidelines for the objective evaluation ofthistypeof scanner, ajudgment wasneededtodecidewhichparameter(s) to evaluate. Given that the aorta is a relatively largeorgan, it wasdeemedtobeunlikelythat imagingit normallywouldbeachallengefor anymodernscanner. Consequently,traditional spatial resolutionassessment was not carriedout.However, the ability of the system to image the aorta at depth inlarge patients was regarded as critical and therefore penetration-type measurements using tissue-equivalent test objects wereadopted. Threesuchtestobjectswereselectedandusedonallscanners. Thescanners weredeliveredinturntotheMedicalPhysics Department of the Leeds Teaching Hospitals Trust,Leeds, UK, and all measurements were undertaken by the sameoperator, experienced in ultrasound quality assurance (QA). Thescreenusedineachcasewasthatsuppliedwiththescanner. Itwasnotpossibletoblindtheoperatortothescannersidentity,but this was regardedas unimportant owing tothe objectivenature of the test. In each case, the preset recommended for AAAscanning bythe manufacturer was selected with tissue harmonicimaging turned off. The gain was set to maximum, unless that ledtosaturation, andthetimegaincompensationwasadjustedtogive a speckle display at the greatest possible depth.TheCardiffresolutiontestobject(RTO)isaratherolddevicethat has beenusedextensivelybymanyworkers. Its primarypurposeistoassessspatialresolution,but inour case,weusedonlysectionsthatwerefreeofresolutiontargets. Thepenetra-tion valuethatwas recordedwas denedas thedepth atwhichthespecklewasjudgedtochangeintonoiseorbasedarklevel.TheEdinburghpipetestobject(EPipe)waskindlysuppliedbythe Department of Medical Physics, Edinburgh Royal Inrmary,Edinburgh, UK. It has a tissue mimicking background butcontains a number of small diameter pipes that are scannedalong their lengths. In this case, however, the pipes were ignoredandonlythepenetrationinpipe-freeregions wasconsidered.Twodifferent measurements weremadewiththis object. Thepenetration [EPipe(pen)] was recorded using a region of the testobject that was free of pipes.The second measurementwas themaximum depth at which the 6-mm pipe could be seen [EPipe(vis)].The rationale for this is that the quality of the image is likely torelatetotheabilitytoimageasmallobjectatdepth.The Gammex 408LE spherical lesion phantomwas used(Gammex-RMI, Nottingham, UK). This device has a number ofsimulatedsphericallesions at a rangeof depths. It wasthoughtthat theabilityof thescannertodetect theselesionsat depthwouldbesimilar tothat foundwiththeEPipe(vis) test. Theprotocol used was the same as for the penetration measurement.This time, the maximum depth at which spherical lesions couldbeclearlyseen wasrecorded.The attenuation in the test objects was 0.86, 0.50 and0.70 dBcm21MHz21for the RTO, EPipe and Gammex testobjects,respectively.Full paper: Agreement between measures of imagequality in ultrasound BJR3 of 9 birpublications.org/bjr Br J Radiol;88:20140482Scannerswerethenrankedbyobjectvisibility(inmillimetres),andthe rankings were comparedwiththe subjective scannerrankingsusingSpearmansrankcorrelationcoefcient.Statistical analysisSummarystatisticsandlogisticregressionwereusedtogenerateodds ratios, with95%condence intervals (CIs), torankthescannersinorderoftheiroddsofproducinganimagewithdi-agnostic image adequacy compared with the lowest ranked scan-ner,thatis,howmanymoretimeslikelyanadequatediagnosticimagewouldbefromagivenscannercomparedwiththeleastsuccessful scanner. Threelogisticregressionmodels were used:one with LS images, one with TS images and one with all images.Analysis was carried out using Microsoft Excel (Microsoft,Redmond, WA) and the statistical software R.12The independentvariables included in the logistic regression were the nine scannertypes; the 33 practitioners; the depth categorized into four ranges(,5.0, 5.110.0, 10.115.0, 15.120.0 cm); compoundimaging(on/off); and tissue harmonic imaging (on/off).RESULTSScanner control settingsThescanner settingsused, andthedepthsat whichtheaortaswerelocatedaresummarizedinTable1. Themediandepthofeldwas10 cm (range,,520 cm),withthemajority ofimagesbeing obtained with the aorta at a depth in the 10 to 15-cm range.Eight of the nine scanners had compound imaging available, andit was used at least once in seven (77.8%). The use of compoundimaging in these seven scanners ranged from 20% (scanner D) to100%(scannerAandB). Forvescanners(55.6%),tissuehar-monic imaging was selected at least once, with usage ranging from20% (scanner D) to 100% (scanners C, E and H).Subjective assessmentOverall, 70.9% of images were ranked as adequate. The orderingof scanner types, overall andfor LSandTSseparately, whenrankedusing the odds of producing animage of diagnosticimageadequacycomparedwiththeleast successful scanner, isshown in Table 2 and Figure 1. The combined LS and TS scoresshowthatthehighestrankedscanner(A)was10.71(95%CI,6.48, 17.69) times more likely to have diagnostic image adequacythandid the least successful scanner (J). Less variationwasshown whenratingLSimages(greatestoddsratio,5.14)asad-equate compared with the TS images (greatest odds ratio, 34.28).Two images from the study, both from the TS set, are shown inFigure2.Neitherimagecontainsananeurismalaorta.Theimageswherecompoundimagingwasusedhad, statisticallysignicant, lower odds of producing an adequate image (0.38; 95%CI,0.27,0.53),whereasthoseusingtissueharmonicimaginghadhigher odds of producing an adequate image (1.77; 95% CI, 1.00,3.11). Theseoddsratioswerecalculatedallowingforthescannertype, observeranddepth. Therelationshipbetweenthedepthofeldandtheodds(with95%CI)ofscoringanabdominalaortaultrasoundimage as adequate for the nine portable ultrasoundscanners, ratedbyall observers, isshowninTable3. Forall 90images,asthedepthof eldincreased,theoddsofproducinganimage of diagnostic image adequacy decreased.Objective assessmentAsummaryofthetestobjectmeasurementsisshowninTable4andsummarizedinFigure3. Thisshowsvariationinthemeas-urementswhenusingdifferent test objects. Littleagreement wasseenbetweentheorder of theoverall subjectiverankingof thescannersandtheobjectivetest object rankings(Table5). Spear-mans rank correlation coefcient, r, was 0.00, 0.27, 0.10 and 20.27betweenthecombinedsubjectiverankandtheRTO,EPipe(pen),EPipe(vis) andGammextest objects, respectively, indicating nostrong correlations. No signicant or strong correlations were foundwhen the LS and TS subjective ranks were similarly compared.DISCUSSIONOurndings show the observers regarded 70.9% of the images tobeofdiagnosticimageadequacy, whichisindisagreementwithTable 1. Variation in the depth, compound imaging (CoI) and tissue harmonic imaging (THI) control settings used by one screeningtechnicianwhenthenineportablesultrasoundscannerswereusedtoexaminethelongitudinal andtransversesectionsof theabdominal aortaScannerDepth(cm)CoIonTHIonMedian (minimum,maximum)(#5)(.5 to#10)(.10 to#15)(.15 to#20)A 11(6.6, 13) 0 2 8 0 10 0B 7.8(1, 13) 1 5 4 0 10 0C 9(5, 18) 1 7 1 1 9 10D 10(6, 14) 0 6 4 0 2 2E 9(7, 15) 0 7 3 0 7 10F 12(8, 19) 0 3 4 3 3 0G 14(8, 17) 0 2 6 2 0 9H 13(7, 20) 0 3 3 4 8 10J 12(5, 15) 1 3 6 0 0 0BJR S Wolstenhulme et al4 of 9 birpublications.org/bjr Br J Radiol;88:20140482the screening technician, who regarded all abdominal aortaimages asoptimalforscreeningpurposes,when acquired inrealtime. The screening technician would have considered the subjectcharacteristicsandthedegreeofdifcultyinidentifyingthean-atomical relationships andlandmarks tomeasure the A-Pab-dominal aorta ITI diameter. We can only speculate the reasons forthe observers rating the 90 images differently to the screeningtechnician. Thediagnosticimageadequacymayhavebeenaf-fected by the observers viewing the images on different computersat different light levels, although the bias associated with this wasreduced by undertaking the challenge response test.11Theobserversmayalsohavebeenassessingdifferent aspects of theimage whenscoring the images. The data collectionwas per-formedatthetimewhenobserversmayhavebeenusingeitherNAAASP standardoperatingprocedures to determinediagnosticimage adequacy for control settings, anatomical relationships andlandmarks to measure the A-P ITI diameter6or local guidelines todetermine the position of the callipers on the aortic wall.7,8It hasalso been demonstrated that guidelines alone are not sufcient foragreement on what comprises an acceptable image.13The strengths of this study included the acquisition of theclinical images by one experienced screening technician, and allobjectivetestingbyasingleexperiencedtechnologist, fromallnine scanners, reducing variation in image acquisition. By usingaweb-basedsystem, wewereabletoincludeassessment froma widerangeofexpertpractitioners.The effect of depth on image adequacy for all 90 images(Table3)waslikelytobeowingtoincreasedultrasoundbeamdivergenceandattenuation.Thisleadstodecreasedspatialandcontrast resolution, which could impact on the identication ofanatomical structures andlandmarks tomeasuretheA-Pab-dominal aorta ITI diameter. The analysis of the subjectivepreference scores controlled for the inuence of depth onascannerssubjectiveimagequalityperformance. Whencom-paringthe subjectivequalityof ultrasoundscanners for AAAapplications, itisimportantthatasuitablerangeofdepthsareincludedintheimagessets.The use of compound imaging decreased the odds of diagnosticimageadequacy, whichiscontrarytothepredicteduseofthiscontrol inpractice.4,14This maybeowingtotheblurringoflateral borders. The use of tissue harmonic imaging, whichmight havebeenpredictedtoimprovecontrast resolution,4,15resultedinanincreaseindiagnosticimageadequacy, but thiswasnotstatisticallysignicant.Thedatashowthat thevariationinoddsratiofromthelowestscoring scanner (J) to the highest scoring scanner (A) was wider forTS than LS images. The scanner rankings show the 95% condenceintervalsfortheLSandTSsectionstobewiderthantheoverallrankings as their analysis uses smaller data sets. This suggestsobservers may be more condent when determining adequateimage quality with LS images. This may be owing to observers onlyassessingiftheycouldidentifytheaortaanddeterminetheland-marks to measure the A-P diameter. When scoring the TS images,the observer was assessing the anatomical relationships of the aorta,suchastheinferiorvenacava, lumbarspineandbowel andthelandmarks, tomeasure its A-Pdiameter. This mayexplainthedifferences in repeatability and reproducibility between LS and TSabdominal aorta A-P diameter measurements.7,8,1619Objective testsAll of the objective measurements were performedby a singleperson. This would have ensured consistency, although the testingwasnot blinded, whichmayhaveintroducedbias. TheoperatorTable 2. Odds ratios (and 95% confidence intervals) of diagnostic image adequacy ratingsScanner Overall Longitudinalsection TransversesectionA 10.71(6.48, 17.69) 5.14(2.55,10.36) 34.28 (14.81,79.34)B 10.19(6.14, 16.90) 3.89(1.93,7.86) 26.16 (11.53,59.35)C 4.30(2.24, 8.26) 2.02(0.81,5.01) 3.01(0.72,12.54)D 3.79(2.56, 5.60) 1.43(0.79,2.61) 6.01(3.48,10.36)E 2.88(1.53, 5.43) 1.42(0.60,3.37) 1.91(0.46,7.98)F 2.55(1.77, 3.68) 0.64(0.37,1.09) 9.73(5.31,17.82)G 2.04(1.09, 3.83) 0.89(0.40,2.02) 1.51(0.32,7.12)H 1.87(0.98, 3.57) 1.32(0.48,3.68) 0.48(12.09,1.98)J 1.00 1.00 1.00Figure1. Subjectiveimagequalityscoresoddsratioofeachscanner producing an image of acceptable quality.Full paper: Agreement between measures of imagequality in ultrasound BJR5 of 9 birpublications.org/bjr Br J Radiol;88:20140482was highly experienced in ultrasound QA and familiar with a rangeofequipmentofdifferingmanufacturers.Analysisofpreviousre-peated measurements of penetration indicate that a variation of thelarger of 2 mm or 5% would be expected for such measurements.Given that the speed of sound and attenuation is claimed to be thesame in all three test objects, it would be predicted that there wouldbeahighlevelofagreementintherankingofpenetration valuesobtained in all three tests. This was clearly not the case. One pos-sible explanation is that the relative contribution to the attenuationfrom scatter may differ between the three test objects. Data on thesevalues were not available. This is important because it is thescat-tering component that is being measured in the image as a surro-gate for real penetration. Furthermore, it is not known whether thescatter from normal tissue is similar to any of the three test objects,although all three test objects claim to mimic liver parenchyma.An additional problem is that the rank order of the scanners wasdifferent for the three test objects. This suggests that some factorotherthanscatterisinvolvedsincescatteringdifferencesalonewouldhavebeenexpectedtochangethemagnitudeof pene-tration valuesbutnottherank orderofthescanners.It canbespeculatedthat the discrepancy lies inthe different greyscaletransfercurvesandotherimageprocessingalgorithmsusedbydifferentmanufacturers.Comparison of objective and subjective imagequality measuresOur selectionof objective test toperformwas basedontheconjecture that penetrationwouldbe animportant factor inpredicting the quality of the clinical image given the nature of theexamination, and spatial resolution would be less critical, since theabdominal aorta is relatively large. Conversely, the clarity withwhichthelandmarks of theabdominal aortaA-Pdiameter aredisplayed is presumably important, and this should be related to thegreyscale transfer curve and/or dynamic range of the scanner. Ourchoice was to measure penetration and detection of spherical cystictargets with test objects. The variation between our subjective andobjectiveimagequalityrankingsmaybeowingtothesubjectiveassessment of abdominal ultrasoundimageswithvaryingdepthscompared with the detection and resolution of the small (diameter,4 mm) spherical cystictargets inthetest objects at pre-deneddepths. Itisunknownwhethereithertheobservers subjectiveorthe test object objective rankings have any bearing on the precisionand/orreproducibility oftheabdominalaortic diametermeasure-ment or, more importantly, patient outcome.LimitationsAt the patient image acquisition stage, the screening technician hadgreater familiarity with scanner E than with the other eight scanners.This may have affectedthe screening technicians condence inmanipulating the scanner control settings to obtain an optimal im-age.Forthisreason,andasweneitherwishedtoidentifythebestandworstscannersnorinfringenational procurementcondenti-ality, we anonymized the scanners in thendings. As the observerswere blindedto the scanner on which the image was acquired, webelieve the bias was reduced. The general practitioner (GP) exami-nation rooms had different background light levels. In a room withexcessive background lighting, the illumination of the screen wouldbe increased.20To compensate for this, the screening technician mayhaveincreased, indynamicscanning, thegaintovisualizetheab-dominalanatomy. Thiscouldpotentiallyimpactonthediagnosticimage adequacy of the static image.21The portable ultrasoundscanners were used without a dedicated scanner stand. This meant,betweenthe GPpractices, the scanners were placedondressingtrolleys of different heights, leading to discrepancies of the height ofthescannersmonitor. Thescreeningtechnicianneededtochangescanning positions, the degree of tilt of the monitor and the resultingviewingangletoallowbettervisualizationoftheabdominal anat-omy. This may have impacted on the diagnostic image adequacy bycausing image distortion or anisotropy, leading to change in contrastresolution.17The scanners were not used randomly in the differentrooms, and this may increase the risk of bias in the scanner rankings.Thescannerpresetswereusedasastartingpointforbothob-jectiveandsubjectivetests,andtheoperatorswerefreetoalterthesettingsastheyfeltappropriate. ThismayhaveresultedinFigure 2. Two clinical transverse images from the subjective image comparison, showing (a) a highly rated image and (b) a poorlyrated image.Table 3. Odds ratios (95% confidence intervals) describing therelationship between depth of field and diagnostic imageadequacyDepth(cm) Oddsratio,5.0 6.69 (3.56,12.57)5.110.0 4.24 (3.03,5.93)10.115.0 3.13 (2.26,4.34)15.120.0 1.00 (NA)BJR S Wolstenhulme et al6 of 9 birpublications.org/bjr Br J Radiol;88:20140482different settings being used on the two image quality measures.However, thisislikelytobethesituationwhensuchtestsarecarriedout inhospitalenvironments.It ispossiblethat theimagesfrompatientsusedinthisstudymay not be representative of the nine portable ultrasoundscanners, on which they were acquired, but the random selectionofimagesshouldhavereducedthebias, asthetenimagesperscanner are likely to represent a variety of patients. Since samplesizecalculationsarenotsuitableforcategorical predictorsandbinaryoutcomes (the scanner type beingcategorical andtheoutcome being yes/no), the required sample size cannot becalculatedtoachieveatargetpower. Weenrolled33observerseach analysing the same 90 images producing 2970 responses intotal.Webelievethisnumberofobserversandthelargeimagedatasetissufcienttodraw conclusions.There were a number of professions and range of experience in theobservers in the subjective study. It is entirely possible that both ofthese factors could affect the responses of the observer. More spe-cically, staff dealing with AAA screeningmay score differentlytoother users. Broadly stated profession and experience, as recordedin this study, are not likely to be good predictors of how often anindividualroutinelyusesAAA ultrasoundimages.Futureworktoinvestigatetheeffect of professionandexperienceinAAAultra-soundscreeningonqualityscore responses wouldbe useful inestablishing howprescriptive future study designshouldbe interms of the background of participating observers.Implications for image quality assessment ofultrasound scannersEven in a screening setting, where there is a well-dened clinicaltask, with specic criteria for the features that are required for anadequateimage, wefailedtoselectanobjectivetestthatcouldpredict the subjective assessment of image quality by users. Thisdoes not mean that such objective testing is not useful, however,as suchtests are likely tobe sensitive tochanges inscannerperformance over time, and therefore should play a role inquality assurance programmes. Objective tests are alsomini-mallyaffectedby differences inthesubjectvariability.Care must alsobe takenindrawingthe conclusionthat anyobjective test is not useful if it does not predict subjective imagequality. It is not established that observer rating of image qualityisabletopredictdiagnosticaccuracy.Inparticular,inthecon-text of AAA screening, it is the diameter measurement that is thepurposeof theimagingandnot for instancethedetectionofalesion. For other clinical applications, better agreement be-tweenobjectiveandsubjectivetestsmay befound.It is not clear what criteria, if any, should be used when assessingthe image quality performance of scanners in this AAA screeningcontext. To answer such a question, it would be necessary to studythe effect of scanner selectiononpatient outcomes, andsucha study would be long and expensive. It is likely that by the timethat the results of such a study were available, the scanners in thestudywouldnolongerbeavailableonthemarket. InlieuofTable 4. Summary of the objective measurements of the nine portable ultrasound scannersScanner Resolutiontestobject (mm) EPipe(pen)(mm) EPipe(vis)(mm) Gammex(mm)A 130 190 140 52B 145 180 110 45C 115 155 117 42D 140 200 145 36E 135 180 133 52F 155 200 146 50G 135 170 129 76H 125 180 120 40J 135 158 115 61EPipe(pen), Edinburgh pipe test object (penetration); EPipe(vis), Edinburghpipe test object (visibility).Gammex; Gammex-RMI, Nottingham, UK.Figure 3. Objective image quality scores: test object measure-ments for each scanner. Gammex; Gammex-RMI, Nottingham,UK. EPipe(pen), Edinburgh pipe test object (penetration);EPipe(vis), Edinburghpipetestobject(visibility); RTO, reso-lution test object.Full paper: Agreement between measures of imagequality in ultrasound BJR7 of 9 birpublications.org/bjr Br J Radiol;88:20140482suchastudy, wewouldencouragethedevelopment of task-specic test phantoms for image quality assessment, especially forcommontaskssuchasthoseinscreeningprogrammessuchasNAAASP. It might bepossiblethat givenaphantomwithan-thropomorphiccharacteristics, wheretheobservertaskisaorticdiametermeasurement canalsobecombinedwithasubjectiveopinion on quality. For subjective ratings on clinical images,imageselectionshouldcontainanumber of challengingcases,with the aorta at greater depths within the patient. Care must betakenwiththeviewingconditions, althoughitisunlikelytobepractical to allow all of the images to be viewed on the scannersown monitor by all observers, therefore, the viewing system mustbe controlledvia methods suchas the monitor quality checkemployed in this study. Careful selection of observers, so that theobservers are selected from the specic staff group likely to usingthe equipment would be good practice, although given differencesbetweenobservers, itmaybedifculttorecruitsufcientnum-bers of verytightly selected observers.CONCLUSIONThe study shows large variation in the performance of the nineportable ultrasound scanners evaluated, for use in the primarilycommunity-based NAAASP, when assessed both subjectively andobjectively. Test object measures of image quality do not predictsubjectivescanner image qualityrankings, andit is not clearwhich of these methods of assessment is better linked to clinicaloutcomes. Further development of task-specic test objectscould be of great benet in future quality assessments and in theunderstandingof therelationshipbetweensubjectiveandob-jective measurementsofimagequality.FUNDINGTheworkwas fundedbytheDepartment of HealthNationalService AAAScreening Programme. AGDreceives a researchgrantfrom PhilipsHealthcare.ACKNOWLEDGMENTSWe would like to thank the following: the ultrasound machinemanufacturersforallowingtheirmachinestobeevaluatedattheLeicesterNational AbdominalAorticAneurysmScreeningProgramme centre; National HealthService Purchasing andSupplyAgencyfororganizingfortheultrasoundmachinestobe takentoLeicester for evaluation; GillianHussey for ac-quiring the abdominal aorta ultrasound images; Kari Dempseyfor performing the in vitro analysis; the practitioners who ratedtheultrasoundimages; andProfessorDavidBrettleandMed-ipex for the use of the login verication tool on the web-basedsoftware.REFERENCES1. B ath M, M ansson LG. Visual grading charac-teristics (VGC) analysis: a non-parametricrank-invariant statistical method for imagequality evaluation. Br J Radiol 2007; 80: 16976.2. Smedby O, Fredrikson M. Visual gradingregression: analysing data from visual gradingexperiments with regression models. Br J Radiol2010; 83: 76775. doi: 10.1259/bjr/352549233. Launders JH, McArdle S, Workman A, CowenAR. Update on the recommended viewingprotocol for FAXIL threshold contrast detaildetectability test objects used in televisionuoroscopy. Br J Radiol 1995; 68: 707.4. Browne JE, Watson AJ, Gibson NM, Dudley NJ,Elliott AT. Objective measurements of imagequality. Ultrasound Med Biol 2004; 30: 22937.5. Metcalfe SC, Evans JA. A study of the re-lationship between routine ultrasound qualityassurance parameters and subjective operatorimage assessment. Br J Radiol 1992; 65: 5705.6. National ScreeningProgramme StandardOperating Procedures and Workbook. [Cited26November2014.] Available from: http://www.aaa.screening.nhs.ukTable 5. The ranking of the nine ultrasound scanner scores for the subjective scores, compared with the objective test object scoresScannerSubjective ObjectiveRatedbypractitioners Resolutiontestobject EPipe(pen) EPipe(vis) GammexA 1 7 3 3 3B 2 2 4 9 6C 3 9 9 7 7D 4 3 1 2 9E 5 4 4 4 3F 6 1 1 1 5G 7 4 7 5 1H 8 8 4 6 8J 9 4 8 8 2Spearman rank order correlation with subjectiverank 0.00 0.27 0.1 20.27EPipe(pen), Edinburgh pipe test object (penetration); EPipe(vis), Edinburghpipe test object (visibility).Thisshows none of the test objects scores helps topredict thesubjective studyresults.Gammex; Gammex-RMI, Nottingham, UK.BJR S Wolstenhulme et al8 of 9 birpublications.org/bjr Br J Radiol;88:201404827. Beales L, Wolstenhulme S, Evans JA, West R,Scott DJ.Reproducibilityofultrasoundmeasurement oftheabdominal aorta.BrJSurg 2011; 98:151725.doi:10.1002/bjs.76288. LongA, RouetL,Lindholt JS,AllaireE.Measuringthemaximum diameter ofnativeabdominal aorticaneurysms: reviewandcriticalanalysis. Eur JVascEndovascSurg2012;43: 51524. doi: 10.1016/j.ejvs.2012.01.0189. TapiovaaraMJ.Review ofrelationshipsbetween physical measurements and userevaluation ofimagequality.RadiatProtDosimetry2008; 129:2448. doi:10.1093/rpd/ncn00910. HoskinsPR,MartinK, Thrush A,eds.Diagnostic ultrasound: physics and equipment.Cambridge, NY: Cambridge University Press;2010.11. BrettleDS, BaconSE. Shortcommunica-tion: amethodforveriedaccesswhenusingsoftcopydisplay. BrJRadiol 2005;78: 74951.12. R Core Team. R: a language and environmentforstatistical computing.Vienna, Austria:RFoundation forstatistical computing.2014.Available from: http://www.r-project.org13. Keeble C, Wolstenhulme S, Davies AG, EvansJA. Is there agreement on what makes a goodultrasoundimage? Ultrasound2013; 21:11823.14. Elliott ST.A userguideto compoundimaging.Ultrasound2005; 13:11217.15. ShapiroRS, WagreichJ, ParsonsRB,Stancato-PasikA, YehHC, LaoR.Tissueharmonic imaging sonography: evaluation ofimagequality comparedwith conventionalsonography. AJR Am J Roentgenol 1998; 171:12036.16. Stather PW,DattaniN,Bown MJ,EarnshawJJ,Lees TA. International variationsinAAAscreening. Eur J Vasc Endovasc Surg 2013; 45:2314. doi:10.1016/j.ejvs.2012.12.01317. HartshorneTC, McCollumCN, EarnshawJJ, MorrisJ, NasimA. Ultrasoundmea-surementofaorticdiameterinanationalscreeningprogramme.EurJVascEndovascSurg2011; 42: 1959. doi: 10.1016/j.ejvs.2011.02.03018. Thapar A, ChealD,HopkinsT,WardS,Shalhoub J,Yusuf SW.Internal or externalwall diameter for abdominal aortic aneurysmscreening? Ann R CollSurgEngl2010; 92:5035. doi:10.1308/003588410X1269966390343019. Bredahl K, EldrupN, MeyerC, EibergJE,SillesenH. ReproducibilityofECG-gatedultrasounddiameterassessmentofsmallabdominal aorticaneurysms.EurJVascEndovascSurg2013; 45:23540. doi: 10.1016/j.ejvs.2012.12.01020. MooreSC, MunningsCR, BrettleDS,EvansJA. Assessmentofultrasoundmonitorimagedisplayperformance.UltrasoundMedBiol 2011; 37: 9719. doi:10.1016/j.ultrasmedbio.2011.02.01821. OetjenS,ZieeM.A visualergonomicevaluation ofdifferentscreentypesandscreentechnologieswith respecttodiscrim-inationperformance.ApplErgon2009; 40:6981.doi:10.1016/j.apergo.2008.01.008Full paper: Agreement between measures of imagequality in ultrasound BJR9 of 9 birpublications.org/bjr Br J Radiol;88:20140482