de-identifying dicom encapsulated pdf’s · example, characters do not pixelate well.) an extreme...

1

De-Identifying DICOM Encapsulated PDF’s Peter M. Kuzmak, MS, U.S. Department of Veterans Affairs; Charles Demosthenes, MD; April Maa, MD Introduction Atlanta VA Health Care System has Ophthalmic Optical Coherence Tomography, Humphrey Visual Fields, and Heidelberg Retinal Tomography DICOM Encapsulated PDF objects that are to be sent to a research partner. These objects contain “burned-in” patient identification information that must be removed prior to being exported or shared outside the VA. An automated method for converting the DICOM Encapsulated PDF reports to DICOM Secondary Capture objects and then using pixel de-identification was employed to do the required de-identification. The details of this technique are presented. Hypothesis Many clinical instruments generate patient PDF reports that are stored as DICOM objects. Adobe’s Portable Document Format (PDF) is an ISO standard that specifies a digital form for representing electronic documents to enable users to exchange and view electronic documents independent of the environment in which they were created or the environment in which they are viewed or printed.1 The DICOM Encapsulated PDF Storage SOP Class is used for storing such PDF documents in DICOM objects. It was added to the DICOM Standard in 2004 with Supplement 1042, at the request of the VA. Prior to then, PDF documents were scanned and stored as Secondary Capture objects, which was inefficient storage wise and difficult to convert (for example, characters do not pixelate well.) An extreme case is the Humphrey Visual Field report which consists almost entirely of ASCII characters (see Fig 1). When is stored as a PDF, it is about 35 kilobytes. One vendor was storing them as 24-megabyte DICOM Secondary Capture objects.

2

Fig. 1 – Humphrey Visual Field Report with fictitious patient information

Our prior paperiii described the transfer of >100,000 retinal images (DICOM Ophthalmic Photography 8 Bit Image Storage SOP Class) to our research partner for the testing of machine learned grading of diabetic retinopathy. In that paper we described the technical details of recording relevant clinical information in a JSON object and storing it in the DICOM header in the Text Value (0040,A160) data element. This paper describes the second phase of this project involving the transfer of additional kinds of eye care images or reports for the same set of patients. This includes the transfer of Humphrey Visual Field (HVF), Optical Coherent

3

Tomography (OCT), and Heidelberg Retinal Tomography (HRT) reports. The HVF measures visual acuity and the OCT & HRT provide cross-sectional analyses of the macula area of the retina. These examinations are routinely performed for patients undergoing treatment for diabetic retinopathy. This second set of eye care reports for these patients are stored as DICOM Encapsulated PDF objects. These must be de-identified before they are sent to our research partner. (The auto-generated JSON data contains a matching pseudo-Patient ID and pseudo-Study Date that is used to associate the reports with the correct patients and studies of the previously sent retinal images.) The research partner will use these additional kinds of examinations to further test and validate its machine learning algorithms for the grading of diabetic retinopathy. The de-identified reports are intended to be visually reviewed by a researcher and compared to the machine graded results as part of the algorithm refinement and validation process. Methods By definition, all patient PDF reports contain patient identification information (PII). This PII must be de-identified before the DICOM Encapsulated PDF objects are exported for research. A two-step approach was demonstrated and used to de-identify the DICOM objects that were exported for research. The DICOM Encapsulated PDF objects were first converted to a DICOM Secondary Capture (SC) objects and then DICOM data element and pixel de-identification were automatically applied. The resulting DICOM SC objects were stored to disk for later export. A manual process was used to create the DICOM SC objects. An enhanced Laurel Bridge Compass Router was used to perform both de-identification steps. Later, it is planned to automate this manual process by combining these two steps inside a modified Compass router. Step One: Converting DICOM Encapsulated PDF to DICOM Secondary Capture(s) Converting PDF to JPEG(s) is straightforward. Adobe Acrobat provides the facility to store a PDF as a set of JPEG files, one for each page. The PDF to JPEG conversion parameters need to be specified (see Appendix A). Converting a DICOM Encapsulated PDF object to a set of DICOM Secondary Capture objects requires that the page number be stored in the DICOM Instance Number (0020,0013) data element, and that the Document Title (0042,0010) be copied from the DICOM Encapsulated PDF object to each Secondary Capture object. Both of these data elements are needed for selecting the appropriate rules for facilitating pixel de-identification of a particular page in a particular report. Step Two: Blotting out Burned in Annotation The locations for PII in the PDF report are instrument-dependent. The same instrument may produce different PDF reports with the PII stored in different places. Furthermore, the PII may be located on different places on different pages of the different PDF reports, and different software versions of the instrument may store PII in different locations. Pixel de-identification needs to consider all of these factors to determine the appropriate coordinates for the pixel blotting out rectangles. Here are the corresponding DICOM data elements that were used to determine the proper pixel de-identification sets of coordinates:

a. Manufacturer (0008,0070) * b. Manufacturer’s Model Name (0008,1090) * c. Software Version(s) (0018,1020) * d. Document Title (0040,0010) e. Instance Number (0020,0013)

* DICOM instruments that support the DICOM Enhanced General Equipment Moduleiv provide this information. The blotting out of the PII in the DICOM Secondary Capture image is performed using the above information to select the corresponding pixel de-identification filter in Compass. The process of confirming the location of the pixel blanking de-identification rectangles is an iterative trial-and-error manual process within the Compass administrative console. The following table lists the DICOM Encapsulated PDF report types involved in this project:

Instrument Document Title Blotted-out PII Fields Appendix Zeiss Humphrey Visual Analyzer II-i

SFA Name, ID, DOB, Exam Date & Time, Age B

Zeiss Cirrus HD-OCT 4000/5000

Angio Analysis Name, ID, DOB, Exam Date & Time C


GPA: Optic Disc Name, ID, DOB, Exam Date & Time, Other Exams, Doctor, Comments

D


OD High Definition Images

Name, ID, DOB, Exam Date & Time E


OS Macular Thickness Analysis

Name, ID, DOB, Exam Date & Time F


Cirrus_OD_ONH and RNFL OU Analysis

Name, ID, DOB, Exam Date & Time G

4

Instrument Document Title Blotted-out PII Fields Appendix Heidelberg Engineering Spectralis

Overview Report Name, ID, DOB, Exam Date, Note Date H

Results This is a “work in progress”. We are planning to have Laurel Bridge Software implement an enhancement to the Compass Router to support automatic conversion of DICOM Encapsulated PDF objects to DICOM Secondary Capture objects. As a first step in testing the approach, we manually converted each type of DICOM Encapsulated PDF to a corresponding DICOM Secondary Capture object. We then used Compass and determined the corresponding coordinates required for the pixel de-identification rectangles. Once filters were defined to de-identify pixel data, then Compass was used to automatically de-identify the DICOM meta data and corresponding pixel data to produce sample de-identified DICOM Secondary Capture objects for each type report. Actual results will be included later. As part of this project, we expect to export on the order of 35,000 images. The trial-and-error approach of manually determining the fixed coordinates of the pixel-deidentification rectangles for each kind of DICOM Encapsulated PDF report works well in trial mode for a small restrictive set of object types. It does not scale well to a large set of different DICOM Encapsulated object types because of the necessity to manually determine the coordinates of the de-identification rectangles for each different type of DICOM Encapsulated PDF. The approach of using fixed pixel de-identification coordinates does not work for documents where the location of the PII varies and can’t be known in advance, for example, at the end of a variable-length computer-generated report or a scanned document.

Appendix A – Adobe Acrobatv options for Controlling PDF to JPEG Conversion Adobe Acrobat provides a GUI for selecting the options for controlling PDF to JPEG Conversion. The default settings are shown below the image in BOLD.

1) File Settings

a. Grayscale (Quality: Minimum, Low, Medium, High, Maximum) b. Color (Quality: Minimum, Low, Medium, High, Maximum) c. Format (Baseline (Standard), Baseline (Optimized), Progressive (3 Scans, 4 Scans, 5 Scans)

5

2) Color Management a. RGB (Off, Embed profile) b. CMYK (Off, Embed profile) c. Grayscale (Off, Embed profile)

3) Conversion a. Colorspace (Determine Automatically, Color RGB, Color CMYK, Grayscale) b. Resolution (Determine Automatically, 72, 96, 150, 300, 600, 1200, 2400 pixels per inch)

6

Appendix B – Humphrey Visual Field

7

Appendix C – Zeiss Cirrus Angiography Analysis

8

Appendix D – Ziess GPA: Optic Disc Cube

9

Appendix E – Zeiss High Definition Images

10

Appendix F – Macula Thickness

11

Appendix G – ONH and RNFL OU Analysis

12

Appendix H – Heidelberg Retinal Tomography

13

Conclusion We developed a method that can be automated for de-identifying large sets of well-known DICOM Encapsulated PDF reports that first converts the report to a DICOM Secondary Capture object and then automatically applies DICOM tag de-identification and a previously determined pixel de-identification. We were successful at applying this de-identification method on a large set of DICOM Encapsulated PDF reports generated by three kinds of eye care instruments. The method is useful for dealing with a known set of different kinds of DICOM Encapsulated PDF documents. More automated techniques are needed to deal with an arbitrarily large set of different kinds of DICOM Encapsulated PDF objects, especially ones where the location of the PII varies because of the content of the document. Statement of Impact Many medical instruments generate DICOM Encapsulated PDFs. This paper presents a successful method that can be used to de-identify them so that they can be used for research. Keywords DICOM encapsulated PDF, PDF, de-identification

1 https://www.iso.org/standard/63534.html 2 ftp://medical.nema.org/medical/dicom/final/sup104_ft.doc 3 Kuzmak P., Demosthenes C., Maa A., Exporting Diabetic Retinopathy Images from VA VistA Imaging for Research, Journal of Digital Imaging, DOI 10.1007/s10278-018-0153-0 4 DICOM PS3.3 2018b Section C.7.5.2 Enhanced General Equipment Module 5 Adobe Acrobat Pro DC, Version 2019.008.20081, Copyright © 188402018 Adobe Systems Incorporated.

de-identifying dicom encapsulated pdf’s · example, characters do not pixelate well.) an extreme...

Documents