mawson project - an online genetic data repository · web viewtypically (9/12) laboratories...

MAWSON Final Report 1

MAWSON Final Report

MAWSON PROJECT - An Online Genetic Data Repository

Final Report March 2013

A project funded under theAustralian Government’sQuality Use of PathologyProgram

MAWSON PROJECT - An Online Genetic Data RepositoryFinal Report

March 2013A project funded under the Australian Government’s Quality Use of Pathology Program

2

MAWSON Final Report

ContentGlossary...............................................................................................................................................4

Background.........................................................................................................................................8

Need 8

Clinical databases................................................................................................................................9

Benefits

12

Mawson project overview.................................................................................................................12

Aims of the project 12

Phases of the project 13

Orientation phase..............................................................................................................................13

System design....................................................................................................................................21

Architecture of Mawson prototype...................................................................................................27

Testing the system.............................................................................................................................30

Functionality of the prototype..........................................................................................................31

Outcomes..........................................................................................................................................33

Achieving Objectives.........................................................................................................................33

Problems Encountered......................................................................................................................34

Lessons learned.................................................................................................................................35

Benefits and disadvantages...............................................................................................................35

Naming conventions..........................................................................................................................36

Future work.......................................................................................................................................36

Appendix...........................................................................................................................................37

A - Data dictionary (Not for Publication)......................................................................................37

B - Data Capture Specifications (Not for Publication)...................................................................37

C - Data Storage in Laboratories Survey (Not for Publication)......................................................37

D - Data Storage in Laboratories Example (Not for Publication)...................................................37

E- User Cases (Not for Publication)..............................................................................................37

F - Mawson system user manual (Not for Publication).................................................................37

G - Mawson system models (Not for Publication)........................................................................37

H - Source code (Not for Publication)...........................................................................................37

References.........................................................................................................................................38

3

MAWSON Final Report

GlossaryBRCA1/2 The BRCA1 gene belongs to a class of genes known as tumour

suppressor genes. Like many other tumour suppressors, the protein produced from the BRCA1 gene helps prevent cells from growing and dividing too rapidly or in an uncontrolled way.

Cash-Ensemble a cache system that allows greater system speed with the ability to handle large volumes of transactional data

Cytopathology is a branch of pathology that studies and diagnoses diseases on the cellular level.

Data Dictionary a data dictionary is a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format

Data Integrity data integrity refers to maintaining and assuring the accuracy and consistency of data over its entire life-cycle, [1] and is an important feature of a database

Delphi Style is a standard style for formatting Delphi code. Delphi is a development tool for Microsoft Windows applications and is a powerful and easy to use tool for generating stand-alone graphical user interface (GUI) programs

Diagnosis Diagnosis is the identification of the nature and cause of anything. Diagnosis is typically used to determine the causes of symptoms, mitigations for problems and solutions to issues.

Familial Mutations Familial mutations are a mutation in a gene (variant) known to have caused disease in the family. Genetic testing or familiar variant testing will allow you to determine if you have also inherited the variant(s) and are at risk to develop the disease in your family.

Fragile X Syndrome Fragile X syndrome (FXS), Martin–Bell syndrome, is a genetic syndrome that is the most widespread single-gene cause of autism and inherited cause of mental retardation among boys.

Genomics / Variomics Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism).

Hematologic Disorder Hematologic diseases are disorders which primarily affect the blood.

4

MAWSON Final Report

HVP (Database) The Human Variome Project database is a database that stores the data associated with the HVP.

Identifier An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique class of objects.

Inhomogeneous Inhomogeneous data assumes that the statistical properties of any one part of an overall dataset are not the same as any other part. Not homogeneous or uniform.

Long QT Interval the QT interval is a measure of the time between the start of the Q wave and the end of the T wave in the heart's electrical cycle.

Lovd The Leiden Open (source) Variation Database. LOVD's purpose : To provide a flexible, freely available tool for Gene-centered collection and display of DNA variations.

Muscular Dystrophy Muscular Dystrophy (MD) is a group of muscle diseases that weaken the musculoskeletal system and hamper locomotion.

Mutation Surveyor Mutation Surveyor is a DNA Sequencing analysis software capable of performing variant analysis of up to 2000 Sanger sequencing files (.ab1,.RSD,.ESD & .scf) generated by Applied Biosystems Genetic Analyzers, MegaBACE as well as Beckman CEQ electrophoresis systems in 15 minutes

Mutation Validation Mutation validations are the verification of a specific mutation to determine their mutagenic potential.

Pathogenicity Pathogenicity is the potential capacity of certain species of microbes to cause a disease.

PCEHR A personally controlled eHealth record is a secure online summary of your health information.

Phenotype A phenotype is the composite of an organism's observable characteristics or traits.

Prototype A prototype is an early sample or model built to test a concept or process or to act as a thing to be replicated or learned from.

TP53 (Li-Fraumen Syn) Li–Fraumeni syndrome is an extremely rare autosomal dominant hereditary disorder. Li–Fraumeni syndrome greatly increases susceptibility to cancer.

VPN (Virtual Private Network) A virtual private network (VPN) extends a private network across public networks like the Internet

5

MAWSON Final Report

BackgroundGenetic testing (also called DNA-based tests) is among the newest and most sophisticated of techniques available to test for genetic disorders, which involve direct examination of the DNA molecule itself. Genetic testing can provide information about a person's genes and chromosomes throughout their life providing valuable information to the clinician. Traditionally genetic testing was often done as part of a genetic consultation in a hospital setting to determine the likelihood of hereditary diseases.

Available types of testing include: Newborn screening, Carrier testing, Prenatal testing, Pre-implantation genetic diagnosis, Predictive and pre-symptomatic testing (breast cancer), Parental testing, Research testing, Pharmacogenomics, Forensic testing and Diagnostic genetic testing.

Diagnostic genetic testing can be described as allowing a diagnosis for someone who has a medical condition or to determine the genetic basis of a disease.

Diagnostic genetic testing currently faces a number of challenges as it moves from the research laboratory into mainstream diagnostic medical testing. The enormous increase in demand following on from the improving capability of such testing to inform diagnostic, prognostic, and therapeutic decisions, as well as decreasing costs in applying these technologies, are pushing many diagnostic services laboratories beyond their capability to train and accredit the necessary staff members. This is compounded by the challenges of maintaining consistent, high quality reporting while the knowledge base underpinning the interpretation of genetic tests continues to evolve rapidly1 .

The consistency in interpretation and reporting is particularly important in the case of testing for heritable mutations which can involve family members being assessed by different clinicians and tested by different laboratories.

NeedThe maturation of diagnostic genetic testing as a mainstream discipline in pathology is based on a number of inter-related factors:

the development of clear generic standards in the assessment of the clinical significance of genetic variants in different genes;

the establishment of internationally accepted quality criteria, as is the case for other diagnostic tests;

the implementation of mechanisms to ensure consistency of analytical and interpretive accuracy between laboratories;

improved assessment of genotype-phenotype relationships involving individual genes or groups of genes in different populations e.g. ethnicity, method of ascertainment;

improved efficiency of workflow, reducing the time required of senior scientists to interpret variants of unknown significance;

transition to a technical base, rather than a research base, for the delivery of common genetic tests;

1 Report of the Australian Genetic Testing Survey 2006, Royal College of Pathologists of Australasia 2008. http://www.rcpa.edu.au/static/File/Asset%20library/public%20documents/Media%20Releases/2006%20and%20older/AustralianGeneSurvey2006.pdf . Accessed 25 October 2010.

6

MAWSON Final Report

appropriate recognition of concerns about confidentiality regarding the results of tests for heritable genetic variants;

provision of accredited data resources for assessing the clinical significance of genetic variants viz. the need for the database resource(s) to undergo periodic review and to meet quality criteria consistent with that required by the laboratory performing the testing.

The availability of curatorial control to maintain accurate, annotated data about the variants in a gene represents a key element of most of these requirements. There are many local and international repositories of genetic variants, but their quality and consistency are variable2, although the situation is improving with initiatives such as the Leiden Open (source) Variation database (LOVD)3.

From the informatics point of view there are two major issues to be addressed: large patient-centred databases supporting clinical practice and standards/processes/infrastructure supporting knowledge management in the field.

Clinical databasesMost research databases cannot meet the stringent needs of diagnostic laboratories. The report generated by a diagnostic laboratory is patient-specific, requires absolute consistency in nomenclature, requires immediate access to current information regarding all variants, and is generated by laboratory staff that do not necessarily have the level of expertise expected of a researcher. Furthermore, the accountability of the report is couched by its legal obligations, not as a relationship with expert peers. Main differences between a clinical (diagnostic) and research database are summarised in

2 Claustres M, Horaitis O, Vanevski M, Cotton RG (2002) Time for a unified system of mutation description and reporting: a review of locus-specific mutation databases. Genome Res 12:680-688.3 http://www.lovd.nl/3.0/home . Accessed 10 March 2013

7

MAWSON Final Report

.

Table 1 Difference in databases

Feature Clinical Research

Focus Individual Group; population

Purpose Diagnostics; treatment Knowledge production

Identifiers At lease some are necessary

Fully de-identified

Privacy Strict Less important

Consistency Required Desirable

Completeness “All results” “Interesting results”

Basis for practice Directly Indirectly

EHR Included Independent

Errors Not acceptable Acceptable to some level

Medico-legal implications Possibly severe Low risk

Informed consent Not required (exemption – data used for diagnosis and treatment)

Required (data used for other than primary purpose of collection)

In the South Australian population, approximately 15% of patients tested for a familial mutation in a breast cancer gene are found to have a variant of unknown clinical significance 4 - this ratio is similar to experience elsewhere5. This ratio can improve, if clinicians have access to a large volume of already evaluated cases. Achieving case volumes of practical utility is very difficult, unless this data is shared across a wide group of genetic testing providers.

4 Suthers, G.; 2008, Unpublished data5 van Dijk S, Otten W, Timmermans DR, et al. What's the message? Interpretation of an uninformative BRCA1/2 test result for women at risk of familial breast cancer. Genet Med 2005; 7: 239–45

8

MAWSON Final Report

Benefits

For the wider community Community expectation in relation to health outcomes continues to increase as technology improves. The recent trend toward DNA type testing has heightened the awareness of this for many pathology consumers. The use of the Mawson Database in the interpretation of genetic variants will benefit the wider community. The interpretation of genetic variants will be more consistent, and fewer variants will be unclassifiable. This will allow patients to be confident in the pathology results and the information and management provided by clinicians.

For participants A geneticist/clinician will be able to see how other laboratories have interpreted a given variant, and discrepancies will be highlighted (consistency in reporting). The collation of this information has been shown to clarify the significance of variants (improved utility of testing). The collated information can be used to provide a structured synoptic report (improvement in report formulation). The collated information will more accurately reflect the national experience of the variant (prospective improvement in quality of interpretation) and potentially identify analytical inaccuracies in DNA sequencing (continuing audit of analytical quality).

Laboratories can be advised of a change in the interpretation of a variant and re-issue reports if necessary (retrospective improvement in interpretation). The collated data can be uploaded to international databases (international improvements in medical genetic reporting).

Patients/participants will benefit from increased consistency, reliability and overall quality of interpretation of their variants. Some decrease of waiting time for results can be expected as well.

Mawson project overview

Aims of the projectThe Mawson project is set up to develop a prototype of the Mawson system. This system shall support clinicians in interpretation of variants of selected genes by sharing the data and interpretations across collaborating laboratories and allowing clinicians to discuss clinical interpretation of (pathogenicity) of such variants in general and at the level of individual cases.

Mawson system prototype intended to primarily:

o collect all variants of a particular gene (or genes) as they are identified during diagnostic genetic testing by a group of laboratories

o promote consistency among the member laboratories in the interpretation and reporting of variants by data sharing and discussion;

o facilitate consistent interpretation of variants of unknown significance by pooling and summarising variant information collected by the group of laboratories;

o protect patient confidentiality by presenting only de-identified data to all users other than the laboratory which contributed the data; and

o facilitate safe peer-to-peer communication and collaboration about the analytical and interpretative aspects of testing the gene involved.

9

MAWSON Final Report

Phases of the projectThe project started from a situation where very little was known about processes of data acquisition, storage and finally reporting at individual laboratories. There was a lack of binding standards covering the domain of medical genetic testing, although drafts of such standards were published6.

Orientation phaseThe developing team started the project with an initial round of interviews in selected laboratories (see Data in laboratories) to explore the status quo in data collection and storage. At the same time a data dictionary (Appendix A) was developed to serve as a reference for the team and collaborating laboratories.

Data dictionaryA data dictionary is a collection of descriptions of pieces of data which are used in a particular domain. Medical genetic testing is a highly dynamic field and explicit definitions of terms and description of data is important to support mutual understanding between collaborators, as well as between the designers of Mawson system and users of the system.

The data dictionary is a result of consultations across a community of practitioners and experts in the field. The main purpose of the document is to support development of our system. As such, this document reflects our understanding of data and terminology in medical genetic testing and does not aspire to be a complete or standardised set of terms and definitions.

Data in laboratoriesCollecting data from a diverse set of sources and integrating it into a single database poses several challenges. First, the laboratories produce and store the data in diverse formats; second, there are several phases in the workflow we can look at as possible points of data extraction.

Data format and scopeTo answer the data format and data scope issue we visited 12 laboratories to conduct interviews with experts. The selection of laboratories included also cytopathology, although this was not considered to be a primary focus of the Mawson project. We looked at following main areas:

laboratory focus (genes tested, diseases of interest) handling incoming proband data (identifiers persistence, de-identification, storage) data standards (nomenclature, stability of standards) steps in testing and processing results information systems and other software used in the laboratory

Laboratory focusWhile the prototype of Mawson system was oriented towards the most frequently tested gene (BRCA1/2), the intended purpose of the system is to collect data on several (all) genes tested. To support this angle of view, we interviewed 12 laboratories across a multitude of interests to gather information on interests and practices beyond breast cancer-related genetic testing. The selection of laboratories covered a wide range of genes and conditions:

6 Draft of a guideline was published in 2010: National pathology Accreditation Advisory Council: Requirements for medical testing of human nucleic acids (Draft), Canberra, June 2010

10

MAWSON Final Report

Breast cancer (BRCA1; BRCA2) Eye diseases (glaucoma, retinal blastoma) Huntington’s disease Reproductive health (such as fertility disorders, stillbirths and malformations) Hearing loss and deafness (Connexin 26; CJB2 gene) Fragile X syndrome MEN (multiple endocrine neoplasia) 1, 2a, 2b TP53 (Li-Fraumeni syndrome) Muscular dystrophy Colon cancer (such as FAP- Familial adenomatous polyposis; HNPCC-Hereditary non-

polypous colorectal cancer; MYH) Long QT interval Hematologic disorders (chronic myeloid leukaemia; haemophilia; hemochromatosis)

Also included were some highly specialised areas:

mucopolycaccharidosis other metabolic disorders Mitochondrial disorders

Several domains of testing are unique in Australia hence an Australia-wide system of collecting data is of little use, unless it is tightly connected to international research and practitioner groups within the same domain. In such cases an easy to upload international data bases and knowledge management systems would be of value.

Handling of the dataFrom the first preliminary discussions with our partners we expected a high diversity of data, as well as data management processes. Results of interviews with specialists across the selected 12 laboratories supported this expectation.

Table 2 Identifiers

Laboratory No#

Patient ID

Sample ID

Family ID

Comment

1 ??? Yes Yes Name+date of birth+family number is collected; de-identified in subsequent process. Sample ID is assigned and becomes effectively patient ID (one sample for each patient assumed)

2 ??? Yes Yes (some)

Sample ID assigned by referring provider, another sample ID assigned by the laboratory.

Kindred (family) ID is assigned in some cases

Patient ID may be given by the referring provider

3 ??? Yes ??? Referring provider states name and date of birth. Sample ID added by the laboratory. Family

11

MAWSON Final Report

relationships can be expressed informally (as text comment)

4 ??? Yes Yes Incoming sample obtains sample ID and potential kindred ID

5 Yes Yes ??? Name+date of birth+Medicare No is collected. Patient ID is assigned by the requesting provider. Laboratory systems assign request ID and sample ID.

6 Yes Yes No Requests come with requestor’s patient ID, name+date of birth. Laboratory assigns Sample ID. Family information (Family ID) is not available

7 Yes (some)

Yes Yes (some)

Sample ID (referred) varies (hospital ID and/or name+date of birth). Laboratory generates unique Sample ID (lab) – only loosely linked to patient ID. Some samples come with family ID

8 Yes (some)

Yes Yes (some)

Sample ID (referred) varies (hospital ID and/or name+date of birth). Laboratory generates unique Sample ID (lab)

9 Yes (some)

Yes Yes (some)

Sample ID (referred) varies (hospital ID and/or name+date of birth). Laboratory generates unique Sample ID (lab)

10 Yes (some)

Yes No Sample ID (referred) varies (hospital ID and/or name+date of birth). Laboratory generates unique Sample ID (lab) linked to request via patient name+date of birth

11 Yes (some)

Yes No Patient is identified by name+date of birth (if referred by hospital – hospital patient ID). Some processes use Patient ID= Sample ID (patient tested only once), some allow for separate Patient ID and several Sample IDs

12 ??? Yes Yes Referring provider includes patient name+date of birth diagnosis. Laboratory assigns a sample ID (this is used as Patient ID) and family ID

Table 2 shows how the data collected is identified by individual laboratories. Most laboratories do not generate a specific identifier for the patient, but rely either on external IDs (patient ID provided by the referring provider) or on search by name and date of birth. This approach appears to work well in most cases, however would be a problem for a system collecting one person’s data across several samples (such as multiple genes testing or testing of more than one tissue – e.g. blood and tumour).

12

MAWSON Final Report

Nomenclatures and other standardsMost laboratories tend to use HGVS standards for naming the variant. More variation is in choice of reference sequence for individual genes. The variant name is typically stored in a truncated format – i.e. without inclusion of the reference sequence (e.g. NM_12345.2(GeneAAA):c.111_112delAA would be stored as c.111-112delAA).

Testing and analysisThe process of testing and analysis from a data collection point of view is homogenous across all laboratories and can be summarised as a sequence:

Sample preparation Testing and analysis – produces raw results Result validation (includes naming any variants found) Interpretation (includes evaluation of clinical significance of a variant) Reporting

Several levels of detail in testing is used ranging from extensive testing (typically in the first member of family) to targeted tests for presence of a particular variant. We did not analyse these approaches in more detail as details on testing are not relevant to the current version of the Mawson system. Moreover, these approaches are currently broadly discussed especially in context of new analytic methods available to the laboratories (such as NextGen)7. As a result we consider attempts to extract useful requirements out of these discussions to be premature.

Software used in laboratoriesResults of genetic testing are stored in a wide variety of systems. Typically (9/12) laboratories report to have some kind of home-grown system (some of them based on Access or Excel and of highly variant sophistication and quality, sometimes combined with a repository for MS Word files archive storing patient reports), remaining (3/12) laboratories did not comment on software used. Some laboratories (4/12) also store results on a Laboratory Information System (LIS).

While several laboratories indicated plans/interest in purchasing a new system for medical genetic testing results management, no detailed information on what this system might be was available at the time of the study.

Implications for Mawson project (and similar)There are several implications for the survey results for the Mawson project (and possibly other similar projects):

Sources of data are highly inhomogeneous both in data content and in data formats Sources of data are not ready for integration with any data collection/sharing Patient identification extractable from data sources is not reliable or absent (this

information is typically assembled in the reports sent to the requesting clinicians, however these do not have a format suitable for automatic data extraction)

Sample identification extractable from data sources is not reliable (sample ID seconds as patient ID, only one sample ID is used, sample ID does not have a stable format etc.)

7 Discussion at 3rd National Pathology Forum, Sydney Harbour Marriott Hotel 13-14th of December 2012, Sydney

13

MAWSON Final Report

Nomenclature favours HGVS standards, however is based on a wide variety of reference sequences. Lack of standard in reference sequence use is a major barrier in efficient data and knowledge sharing (even a simple task of identifying that submissions from two different laboratories describe the same variant is difficult without a reliable way to translate names based on one reference sequence to names based on another reference sequence). This issue is significant, if all laboratories were to use the current standard reference sequence. Reference sequences are updated and names based on an older version might not match names of the same variant based on a newer version of the standard reference sequence.

Information about patient (such as age, sex, reason for analysis) are typically not available in computer-usable format.

Considering these findings, the following questions had to be resolved from the point of view of Mawson system prototype design:

Data collection from non-cooperative data sources (i.e. data sources without any communication interface for programmatic data export)

Individualised data collection from the laboratories (data has different structure and format in each lab) with manual entry of missing information. Alternative would be a standard data submission model as used by international databases; however manual extraction and reformatting of lab data into a clean submission is exactly the bottleneck we intend to avoid.

Data transformation and de-identification as part of the data preparation has to be done in the laboratory as sensitive data (such as patient name and date of birth) is frequently used in lieu of an identifier.

Data details – follow-up studyA more detailed survey was done in 2012 in order to explore data and data formats in laboratories. The response rate was rather low: 5/20 laboratories provided answers with some details. Even taking into account the small sample size, the results show the situation did not change.

Tables 3, 4 and 5 summarise the findings. While there is some consensus on the data collected, there is still a lot of inhomogeneity between laboratories. Data is stored in several formats, with textural entry prevailing. Most laboratories (4/5) use different versions of Mutation Surveyor to name and validate variants. While patient information and sample information is collected and stored in all (5/5) laboratories, inclusion of these identifiers in data output from Mutation Surveyor is highly idiosyncratic (e.g. conveyed via naming conventions of data files used by Mutation Surveyor) or absent. Automatic collection of data requires integration across laboratory-specific set of tools (such as Filemaker, MS Access, MS Excel, and Alamut), however some information remains non-accessible as it is recorded in laboratory logbooks and in patient reports (MS Word).

Table 3 Information collected on the Patient

PATIENT Collected

Stored

Accessible Coded

Coding schema

Format System

14

MAWSON Final Report

Patient ID 5/5 5/5 5/5 Excel/Alamut;

Filemaker8;

MS Access ;

Patient date of birth

5/5 5/5 5/5 dd/mm/yyyy dd/mm/yy

Patient gender

5/5 5/5 5/5 M/F

Family ID 2/5 2/5 2/5

Phenotype9 5/5 5/5 ? 0/5 Text Excel/Alamut;

MS Access;

Patient reports

Table 4 Information collected on Samples

SAMPLE Collected Stored Accessible System Comment

Sample ID (assigned by10:

Host laboratory

5/5 5/5 5/5 Assigned on reception

Sample ID (assigned by:

Sender

2/5 2/5 1/5

Sample material

5/5 5/5 5/5 Filemaker;

MS Excel;

No coding

Date collected


MS Excel;8 Filemaker is a database system9 The meaning of “phenotype” is not uniform across laboratories10 Who assigned the ID: sample collector; external laboratory which extracted the DNA; host laboratory on receipt of the sample

15

MAWSON Final Report

Text (report);

Testing method


MS Excel;

Text (report, lab notes);

No coding

Date tested 4/5 4/5 4/5 1/5 only “date reported”

Method of confirmation

5/5 5/5 4/5 Text (report; lab notes);

Excel

Table 5 Information collected on Results

RESULT Collected Stored Accessible System Comment

Result date 5/5 5/5 5/5 Filemaker;

Excel;

1/5 - same as date tested

Variant name 5/5 5/5 5/5 Filemaker+

Alamut;

Excel+

Alamut;

4/5 use Mutation Surveyor

Reference sequence

5/5 5/5 5/5 Filemaker+

Alamut;

Excel;

MS Word (in patient reports)

Pathogenicity 5/5 5/5 5/5 Filemaker+

Alamut;

3/5 use Plon, S.11 classification

11 Plon, S. et al: Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum Mutat. 2008 November; 29(11): 1282–1291.

16

MAWSON Final Report

Excel;

Alamut;

Nomenclature system

5/5 use HGVS nomenclature (but can differ in reference sequence). 1/5 uses additional descriptions (text).

17

MAWSON Final Report

Current situation

Judging by the results from the five laboratories, it has confirmed the need of manual entry of some of the information (such as patient, sample identifier continuity) which is indeed present in the laboratory, however not accessible to fully automatic data harvesting.

System designThe design of the Mawson system prototype was based on following assumptions:

1. the system will serve for professional practice – only identified professionals will be able to get access

2. results from all laboratories will be collected3. all participating laboratories will use consistent naming standards (and reference

sequences) for each gene of interest4. the prototype will need a high level of protection (same level as electronic health

records)

Data collection from laboratoriesThe minimal data to be collected for Mawson prototype comprises:

Patient ID – the laboratory should be able to re-identify the patient whenever needed (e.g. if a variant was re-classified)

Data about patient: age, sex and possibly diagnosis (or reason for test, or phenotype in general). Relationship to a family can be collected if applicable (family ID).

Sample ID – to identify the sample, so that the laboratory can re-identify it if there is a need

Data about the sample: source (such as blood, tumour); date of sample collection (to discern several similar samples)

Result: variant name (including reference sequence); Information about the result: method used to obtain the result (such as DNA sequencing);

date of obtaining the result (to discern re-analysis of the same sample from analysing similar sample as well as from duplicate entry of the same result); and possibly details on the result (such as range of the DNA sequenced – available form Mutation Surveyor)

There are several possible entry points to capture results of medical genetic testing depending on the availability of data and/or control connection points. Each of these has its advantages and disadvantages.

AnalyserCapturing data directly from the analyser (DNA sequencer) can reduce the number of required interfaces as the number of analysers is limited. On the other side, raw data requires quality control and local validation in order to be reliable. Unless the software driving the analyser is offering such functionality, connecting directly onto the analyser is not considered as a viable option as the processes of validating raw data are outside MAWSON project scope.

18

MAWSON Final Report

Analyser (DNA sequencer)

Local sequence-processing application

Local storage

LIS

HIS

MAWSON data capture

MAWSON server

1.

DataControl

Figure 1 Connecting to the analyser

Local data processing applicationRaw data is typically processed and validated using a specific application (such as Mutation Surveyor). The output from such application is considered to be approved by the laboratory (and hence reliable). There are two ways on how to capture data at this level.

First option is to wait until the data is stored in a file or database and poll that storage space for arrival of new data and extract the data from the file or storage (Figure 2). Advantage of this approach is that the application does not need to be able to collaborate with the MAWSON data capture process. Disadvantage is the need of a polling process to be active all the time on the laboratory computer, or manual handling of the data.

Second option is to set up the application to communicate with the MAWSON data capturing process (Figure 3). The application will then initialise data capturing by the MAWSON process, which will not require user's intervention. While this is the most automated option, the major disadvantage is the requirement of communication between both processes. Such feature is typically not available.

19

MAWSON Final Report



Local storage

LIS

HIS

MAWSON data capture

MAWSON server

2.

DataControl

Figure 2 Connect to data output from local application



Local storage

LIS

HIS

MAWSON data capture

MAWSON server

3.

DataControl

Figure 3 Connect to local application

This entry point is also closest to the use of the Mawson prototype for variant evaluation. Given that the workflow of genetic testing follows the schema in Figure 7 the optimal point of data capture is just after the results were validated, as this allows the system to have a role in variant interpretation and reporting.

Local storageMost laboratories store the results in some kind of local database, so this is a viable point for data extraction. However the major disadvantage of such solution is that a typical laboratory database uses purpose built software (in many cases by the laboratory itself - offering some advantage), so the diversity of data formats and functionality is very high.

20

MAWSON Final Report



Local storage

LIS

HIS

MAWSON data capture

MAWSON server

4.

DataControl

Figure 4 Connecting to local storage data



Local storage

LIS

HIS

MAWSON data capture

MAWSON server

5.

DataControl

Figure 5 Connecting to local storage application

The situation is in principle similar to connection to the sequence processing application either the application is capable to communicate with the MAWSON data capture process (Figure 5) or the data is extracted via files/storage (Figure 4).

Higher level Information SystemsConnecting to the Laboratory Information System (Figure 6) would offer an advantage for future linking of genomics (variomics) data with other data obtained for the patient in the laboratory. Major disadvantage of this approach is the need to resolve very sensitive questions of privacy and ethics. However such connections are not excluded in future when other data will need to be considered in evaluation of clinical significance of an unknown variant.

21

MAWSON Final Report



Local storage

LIS

HIS

MAWSON data capture

MAWSON server

6.

DataControl

Figure 6 Connecting to Laboratory Information System

A hospital information system would provide even more opportunities to link genetic information with phenotype for detailed interpretation, but similar to linking to a Laboratory Information System, there is a high barrier in collection of such data due to security and privacy issues arising of such connections.

Missing dataAt any point of data collection, some of the information needed can be missing from the material available for automatic harvesting. The data collection Client is configured to collect such information manually (i.e. opening a dialogue window and asking for data entry).

Data interpretationThe concept of result interpretation assumes the workflow as shown in Figure 7. Once the results are harvested from the laboratory, clinical significance of individual variants is interpreted. Data and other information shared on the Mawson system supports the process by allowing to see how many times the particular variant was encountered and whether there is a consensus on the pathogenicity interpretation of the variant.

Pathogenicity – coded in appropriate way (Plon, S. et al.11 for BRCA1) – is entered for each result as part of variant evaluation (see workflow in Figure 7 Laboratory workflowFigure 7). Comments on either the individual cases of the variant or on the variant as such12 can be entered manually at this stage.

12 “Variant case” refers to a variant found in a specific patient. “Variant as such” refers to the variant of the gene without direct link to a particular person.

22

MAWSON Final Report

Figure 7 Laboratory workflow

IdentifiersIdentifiers are important part of housekeeping in databases such as the Mawson system prototype. In order to properly identify a sample several pieces of information are needed: Patient ID, Sample ID, and Result ID. Figure 8 indicates the lifecycle of such identifiers, e.g. data should be linked to a Patient ID starting from the test request all the way to compiling the report and making decisions.

Figure 8 Identifiers

The Patient ID is needed to link all results to the right person. The Mawson system is using two identifiers for a person. The first one is internal sequential number of a person assigned as data arrive at the database, and is used to link results belonging to one person together. Second (external) identifier is the ID assigned by the submitting laboratory. This information allows results from one laboratory to be linked to one internal identifier of a person for reporting, and can also be used to alert the submitting laboratory, whenever there is a change of pathogenicity evaluation of one of the variants identified in this person. The external (laboratory supplied) identifier is visible only to the submitting laboratory, and is not sufficient to reveal the identity of the patient as additional data kept only in the laboratory is needed for any such re-identification. However,

23

MAWSON Final Report

collecting external identifiers is, amongst others, the main reason for high level protection of the Mawson system database.

Sample ID (supplied by the laboratory) is needed to link results related to the same sample there can be multiple analyses of the same sample, or the same result can be re-submitted.

In the current version of the Mawson system prototype the functionality to recognise duplicate entries was not implemented (out of scope). However, if data from several laboratories are to be shared, proper alignment of Patient IDs, Sample IDs and results (for diagnostic and reporting purpose) is important as the chance of the same person to be tested at different places or the same sample to be re-tested is expected to be higher than if data from only one laboratory is analysed.

ReportingThe Mawson system is designed to support, not to replace report writing, i.e. it does not support different formats of reports as used by the laboratories. Also, the information provided by the system has to be put into context with the information known on the individual patient, which remains the responsibility of the specialist writing the report.

The Mawson system prototype report is summarising the information held on the system. While the report is not intended to substitute patient-specific reports produced by laboratories, information on frequency of individual variants and their interpretation (pathogenicity) is shown. This information can be copied into the final report. Viewing discussion entries may also help to understand possible disagreement in interpretation.

Security/confidentialityImportant design consideration is security (including reliability, data integrity etc.) of the system. The system is designed to support clinical decision-making, so the information held on this system should be as secure as possible. Moreover, the information has to be protected from leaks (confidentiality) as some of the data may be considered confidential, although no information stored on the Mawson system prototype offers identification of a person. While the patient can be re-identified using additional data stored at the laboratory (i.e. with information provided by the Mawson system the laboratory is able to re-identify the patient), users cannot gain access to identity of a person from the Mawson prototype data only. However, there is a strong perception of privacy concerns around results of genetic testing as such (there is a stream of opinions claiming the genetic information itself is an identifier). As a result, for building the Mawson prototype we selected a platform used for full electronic health records (Cache/Ensemble; InterSystems Inc.) This platform is used to build complex systems such as VistA (Electronic health record systems used by Department of Veteran Affairs holding data of more than 5 million patients).

Architecture of Mawson prototypeThe prototype of the Mawson system is composed from two separate components: the Client and the Server (Figure 9). The Client is running on the laboratory computer and is providing a laboratory-specific interface for data capture. The Server is holding the main functionality of the Mawson prototype and resides on a dedicated server (currently located at the University of South Australia).

24

MAWSON Final Report

Figure 9 Mawson system overview

The ClientThe Client is monitoring a defined directory (folder) on the laboratory computer. Once a file is saved (copy&paste, or “Save” from other program such as Mutation Surveyor®) into this directory, the data capture process is started. This process involves parsing the input file into data elements, collecting missing pieces of data from the user, sanitizing the data (i.e. removing sensitive/confidential pieces of information) and mapping the input data elements to the output data schema required by Mawson system.

Any missing pieces of data (typically these are patient identifiers) are entered manually the Client opens a user interface (Figure 10) offering to enter/verify the data.

Once the data entry and processing is completed, it is packed (7Zip13) into one archive file. This file is than encrypted (GPG14) so that the data is protected while it travels through the Internet. Another option for protection would be to establish VPN (Virtual Private Network) channel between the laboratory and the Mawson server, however this may require re-configuration of the laboratory systems (such as opening ports, modifying security policies, modifying firewall settings) which can be an onerous task for some systems.

The Client logs onto the Server and uploads the data part of this process can be logging on the laboratory firewall to get access to the Internet. On success confirmation message is shown.In case of an error, messages are shown to indicate the type of the error (with the current version, detailed diagnostics of the error may need collaboration with the development team).The Client is implemented in Java (to improve portability). Installation is simple copy the Client files onto the laboratory computer and activate the tool (see the Mawson User Manual for details).

13 http://www.7-zip.org/14 http://www.gnupg.org/

25

MAWSON Final Report

Figure 10 Client interface

The ServerThe submission data file is received by the Server, it is decrypted, unpacked and the data is loaded onto the Mawson system. Before the data is stored, it is checked for errors and translated into the internal format of the Mawson system.

Data modelSimplified data model (Figure 11) shows essential data used by Mawson system prototype.

class Data Model

Patient

- LaboratoryIDs l ist- Mawson ID: int- Age: int- Gender: int

Test

- Date of Test: Date- LaboratoryID: int

Method

- Method name: char

Result

- Date of Result: Date- List of Variant cases

Sample

- Patient's Mawson ID: int- Date of sampling: Date- Material: int- LaboratoryID: int

Variant

- Variant ID: int- Reference sequence ID: int- Variant name: char- Pathogenicity: int

Gene

- Gene name: int- List of reference sequences

Variant case

- Variant ID: int- Pathogenicity: int- Variatn case name: char- Reference sequence ID: int- Laboratory ID: int- Patient ID: int

Reference sequence

- Reference Sequence ID: int- Name: char- Version: int

1*

1*

1

1

1

*

1

*

1 *

1

1

1

1

Figure 11 Data model

26

MAWSON Final Report

As can be seen form the diagram, several samples can be taken from a single patient, each sample can be subject to several tests and each test can produce several results. The result can contain several variant cases. We designed two separate classes to discern the general concept of a variant such as (Variant) holding the general knowledge about a particular variant (such as pathogenicity assigned by a curator or by group consensus), from the concept of variant detected in individual patient (Variant case) holding data relevant to the individual (including pathogenicity evaluated in the context of this individual). Each variant name is based on a versioned reference sequence compliant to the HGVS guidelines.

Complete set of diagrams covering of all data used by the system id out of scope of this report and is available in a separate document.

Testing the systemThe Mawson prototype was tested in three laboratories. The major focus of testing was on data collection, as uploading the results on a database is considered a major barrier to wider collaboration. The data capture was tested with three laboratories:

SA Pathology (at Flinders Medical Centre), Flinders Drive, Bedford Park, South Australia Familial Cancer Service, Westmead Hospital, Westmead NSW Molecular Genetics Laboratory, Pathology Queensland, Royal Brisbane & Women's Hospital

Each laboratory provided the development team with an annotated mock-up sample of their data to assess structure of the input file. Based on this information, a specific configuration of the Client was prepared.

Steps of testing are summarised in Table 6.

Table 6 Testing

Communication Mawson server can be reached from the laboratory computer

OK

The Client is running in the laboratory

Installation of the client was successful and upon start an icon appears in the Windows toolbar.

OK

Client is configured correctly Copying/Saving a file into the monitored directory and

Client User interface has a correct structure. “Correct” means matching the contents of the configuration file.

Data displayed and collected is correct (“correct” means the configuration data is correct in terms of the laboratory – i.e. that the Client is collecting the right data).

Output file is stored into the temporary directory (as listed in the configuration file) and holds correct data

OK

The Client submits data to the The Client sends data (file) to the Server – this data (file) OK

27

MAWSON Final Report

Server can be detected on server input

The Server receiving the data The file uploaded to the server is correctly decrypted, unpacked and parsed. Data is visible on the Mawson system. All errors introduced into the data are detected.

OK

Data received on Mawson system is correct

The data visible on Mawson system match the data in the original input file

OK

Users can access data Users can log on Mawson system and see the data submitted

OK

Testing was done in person with SA Pathology, other two centres were tested remotely (communication with the laboratory via Skype). All testing sites were able to successfully submit the test data to MAWSON (SA pathology: 131 variant cases; Familial Cancer Service: 314 variant cases; Molecular Genetics Laboratory: 150 variant cases).

There were several issues encountered during testing. The main issue was, as expected, the highly restricted IT environment in the laboratories (security policies, firewalls). This was overcome by adding a possibility to enter specific login credentials for the firewall and designing the Client in a way that does not require additional privileges for installation. In one laboratory we encountered an incompatible version of Java which required re-compiling the Client.

Typically 2-3 rounds of tuning the Client configuration were needed a specific version of the Client was developed, allowing detailed logging of the steps of the data capture process.

Data submitted by laboratories could not be merged as each laboratory is using a different reference sequence and the variant names could not be reconciled. As such, the prototype could be tested for functionality only (i.e. the system performs the functions as specified) but the benefits of data sharing could not be explored. An additional module translating the names of variants across different reference sequences is needed to convert the incoming data to refer to one system-wide reference sequence.

Functionality of the prototypeThe Mawson system prototype collects all validated results of genetic test from the laboratory and stores them on the server. After the user logs on the server, the initial screen shows basic usage statistics (such as how many variants were submitted, how many variants were not evaluated) and a menu of options. At the core of the system is assessment of a variant case15 and reporting (details of working with the system are described in a separate document16). Assessment of variant allows to select a variant submitted by the laboratory from a list and assign pathogenicity. Part of this process is indication whether this variant was already seen by any of the collaboration laboratories, and the possibility to add comments and start discussion. (At this stage, the capability

15 Variant case = variant found in a specific patient, variant = the mutation itself, without direct relationship to a particular person16 Mawson system manual, v.1, April 2013

28

MAWSON Final Report

of the Mawson prototype to match variants is limited, as the names based on different reference sequences cannot be automatically reconciled).

Reporting on Mawson system summarises all findings about a patient (as identified on the system) i.e. results from all current and previous tests from the laboratory are shown. For each variant a table is shown, indicating how many times this variant was captured on the Mawson system, along with pathogenicity (Figure 12 and Figure 13) pathogenicity in these examples is based on Plon, S. et al.11). Information in the report is not presented in a format for a particular laboratory. It is up to the user to choose which part of the report he/she wants to use and copy/paste it into the patient result report sent back to the requesting clinician. To attempt to offer laboratory-specific report would require further configuration specific for each laboratory (adding technical complexity) as well as additional patient specific sensitive data (which is not collected to mitigate confidentiality concerns).

Figure 12 Reporting

29

MAWSON Final Report

Figure 13 Reporting

Outcomes

Achieving ObjectivesThe main objective for the Mawson project included:

Establish MAWSON as an online repository to hold variant data, with capacity for manual and batch entry of data.

Design and implement interface to provide semi-automatic collection of data from different DNA sequencing platforms.

Design and implement module to support structured reporting on clinical significance of variants.

Interface to upload data to external databases. Integration of 3 modules into the MAWSON system. Usability study finalised in SA.

Nationwide feedback collected and analysed. Security audit report obtained and major issues resolved.

Develop a business model for sustainable operation of MAWSON system. Prepare prototype for pilot study. Complete report from the pilot study with SA

laboratories. Report on consistency of interpretation of variants in all participating labs. Complete

analysis of reports before and after MAWSON was made available. Organisational structure to manage MAWSON system completed and business model

defined.

Most objectives were achieved with exception of the commercialisation component of the project.

30

MAWSON Final Report

Problems EncounteredThere were a number of setbacks during the project resulting in a substantial loss of time. This culminated in the project being approximately 20 months behind schedule. Some of the major issues encountered include:

During the project commitments at the University intensified due to change management conditions and insufficient time was spent on the project.

Not being able to use SA partners as originally expected. (The negotiations of the contract were running for a long time SA Pathology integrated several labs into one, so we ended up having one partner in SA instead of the planned 3. As a result, we looked for partners interstate. This made it more difficult to set up the data harvesting using a series of remote consultations. We recruited 2 labs in Melbourne and Sydney however the Melbourne laboratory dropped out after several weeks due to staffing changes leaving us to find another partner. A partner was eventually found in Brisbane.

The data variability in the laboratories was much more variable in comparison to original expectations (initially we expected to collect the data for Mawson from secondary systems in the laboratory which already hold all the required data, such as an Access system used at Flinders medical centre). This option was however rejected at a later stage, as the data is typically entered into these systems by manual entry and do not contain all results, in addition they are mainly research-driven. The data need to be sourced from a system as close to the analyser as possible. We collect data from output of the systems on which the results are validated, such as Mutation Surveyor. The disadvantage of such data is that it does not contain some of the information mandatory for Mawson, such as patient ID, sample ID, sample source etc. This information needed to be added at some stage. Attempts to work together with industry partners producing this commercial software failed, so we needed to work from export data from such a system.

Another challenge we had to overcome (at least in part) was seamless moving across system firewalls. This was to prove to be a significant problem with both our Sydney and Brisbane partners.

The survey was a larger problem than expected. The team tried to send it directly to the laboratories with no response. We tried via the Human variome Project channels to distribute the survey, however with similar result. The plan to overcome this is was to contact the laboratory leaders in person and lobby them into answering our questions however this was only partially successful.

The commercialisation of the Mawson system was proving to be extremely difficult. Therefore it was decided to discontinue with this aspect of the project. As an alternative the Mawson server was hosted by University of South Australia at no additional cost for DOHA or the laboratories.

31

MAWSON Final Report

Lessons learned

Benefits and disadvantagesMain benefit of working with the system is the ability to share data and, via reporting, see the possible lack of consensus on pathogenicity evaluation of some variants. This feature provides value to users even in the absence of a database custodian. Discussion comments can reveal reasons behind differences in assessment. The discussion is not anonymous, so there is always open, the possibility to discuss details in pathogenicity assessment with individual colleagues.

Additional benefit lies in confidentiality protection: the Mawson system prototype takes information sharing and discussions from less secure media (such as e-mail) to a fully secured and protected platform with restricted access (the Mawson system is not open to the public and users of the Mawson system are identified professionals). This aspect is of particular importance, as the system was developed to be part of the clinical pathway (Figure 7) of genetic testing evaluation rather than a research platform, in the same league as other clinical systems such as LIS (laboratory information systems) operate (differences between research and clinical databases are summarised in

32

MAWSON Final Report

).

As such, opting out from the data being stored on the system can be complicated (requires editing the input file at the laboratory; laboratory typically does not have direct contact with the patient). Ethics committee assessing the project from ethics and legal perspective17 confirmed the status of the Mawson prototype as a clinical system not requiring informed consent to capture and store data.

The disadvantage, as compared to research databases (such as HVP) of the approach taken is the need of manual entry of some of the data. However, without this data it is not possible to discern re-submission of the same data (unwanted duplication), re-testing of the same sample, analysing a different sample from the same patient and testing different patients. Discerning these situations is important from the data quality point of view offering a more precise and reliable view on true occurrence of variants in the population of tested patients. Laboratory patient identifiers stored on the system allow the Mawson system to report back to the laboratory any messages relevant for that particular patient. This identifier may also allow the data on the Mawson system to be integrated into the PCEHR if there is a need for such integration in future (and all legal and ethical conditions are met to allow such integration).

Naming conventionsCurrent version of the Mawson system prototype initially assumed testing within South Australia only, in laboratories sharing the same standards (such as reference sequence use). However during the initial phase of the project there was a major reorganisation of laboratories into SA Pathology, with only one remaining major centre for BRCA1/BRCA2 testing. This situation led to the need to recruit additional laboratories for testing interstate, leading to data entry based on different reference sequences and even different genes. Sharing data across different naming conventions require name reconciliation, this was considered in the system, however was out of the scope of the project building the current prototype. As a result, part of testing of the prototype could not be done and the immediate benefit of sharing the data is diminished.

Currently we are building a module to reconcile names based on different reference sequences, and this will resolve the issue of sharing data from laboratories using a different basis for variant naming.

Future workMain problem we perceive in the domain of clinical databases is curation and knowledge management. The logical next step in improving the Mawson system prototype is to analyse the possibilities of non-traditional forms of curation (such as Delphi style of consensus building rather than engaging formal curator; possibility to increase utility of the discussion tool) and analyse the opportunity to automatically scan external sources of knowledge (other databases such as HVP, published literature etc.) to support unknown variant evaluation as well as review of pathogenicity evaluation of already known variants.

17 Review by Children, Youth& Women’s Health Service (SA) Human Research Ethics Committee, REC2324/11/13

33

MAWSON Final Report

Appendix

A Data dictionary (Not for Publication)The data dictionary was developed to support design of MAWSON (repository for gene variants). This document contains a superset of data considered to be included in MAWSON.

B Data Capture Specifications (Not for Publication)

C Data Storage in Laboratories Survey (Not for Publication)

D Data Storage in Laboratories Example (Not for Publication)

E User Cases (Not for Publication)

F Mawson system user manual (Not for Publication)The Mawson User Manual was developed to assist users to install and understand the functionality of the database. It provides step by step instructions on how to use the database as well as a quick start guide.

G Mawson system models (Not for Publication)This document provides a complete overview of all element details

H Source code (Not for Publication)The source code is the original program language which contains variable declarations, instructions, functions, loops, and other statements that tell the program how to function.

34

MAWSON Final Report

References

1. Report of the Australian Genetic Testing Survey 2006, Royal College of Pathologists of Australasia 2008.

2. Claustres M, Horaitis O, Vanevski M, Cotton RG (2002) Time for a unified system of mutation description and reporting: a review of locus-specific mutation databases. Genome Res 12:680-688.

3. Leiden Open Variation database

4. Suthers, G.; 2008, Unpublished data

5. van Dijk S, Otten W, Timmermans DR, et al. What's the message? Interpretation of an uninformative BRCA1/2 test result for women at risk of familial breast cancer. Genet Med 2005; 7: 239–45

6. Draft of a guideline was published in 2010: National pathology Accreditation Advisory Council: Requirements for medical testing of human nucleic acids (Draft), Canberra, June 2010

7. Discussion at 3rd National Pathology Forum, Sydney Harbour Marriott Hotel 13-14th of December 2012, Sydney

11. Plon, S. et al: Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum Mutat. 2008 November; 29(11): 1282–1291.

12. “Variant case” refers to a variant found in a specific patient. “Variant as such” refers to the variant of the gene without direct link to a particular person.

13. 7-Zip - is a file archiver with a high compression ratio

14. The GNU Privacy Guard

15. Variant case = variant found in a specific patient, variant = the mutation itself, without direct relationship to a particular person

16. Mawson system manual, v.1, April 2013

17. Review by Children, Youth& Women’s Health Service (SA) Human Research Ethics Committee, REC2324/11/13

35

http://www.gnupg.org/

http://www.7-zip.org/

mawson project - an online genetic data repository · web viewtypically (9/12) laboratories...

Documents