improving health data · (data integration and etl) talend open studio for big data (free version)...

32
Improving Health Data: Establishing a Mechanism for the Assessment of Data Quality Hosted by: Ben Gordon Executive Director: Hubs & Data Improvement, HDR UK Trevor Heritage Senior Vice President: Cancer Informatics, Inspirata

Upload: others

Post on 30-Dec-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Improving Health Data:Establishing a Mechanism for the Assessment of Data Quality

Hosted by:

Ben GordonExecutive Director:Hubs & Data Improvement, HDR UK

Trevor HeritageSenior Vice President:Cancer Informatics, Inspirata

Page 2: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

HDR UK’s mission is to unite the UK’s health data to enable discoveries that improve people’s lives

Our 20-year vision is for large scale data and advanced analytics to benefit every

patient interaction, clinical trial, biomedical discovery and enhance public health.

| 2

Page 3: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Our strategy focuses on uniting, improving and using health data….

| 3

Page 4: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Making it easier to discover and request access to relevant datasets via the Innovation Gateway

| 4

https://healthdatagateway.org/

Page 5: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

• Abdominal Hernia

• Abdominal aortic aneurysm

• Acne

• Actinic keratosis

• Acute kidney injury

• .

• .

• .

• Vitiligo

• Volvulus

Sourced from hospitals, biobanks and national datasets

Uniting - The UK now has 488 health datasets discoverable for research and innovation

Covering over 300 phenotypes From across 33 health data organisations

| 5

Most accessed datasetCovering all four nations of the UK and discoverable via a shared ‘front door’

National datasets49%

Biobanks22%

Hubs15%

NHS Hospitals6%

Charities, cohorts & other8%

52%

15%

19%

9%5%

England Scotland

Wales NI

International

COVID-19

Symptom Tracker

59 access requests

Page 6: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

0% 20% 40% 60% 80% 100%

Technical Quality

Value and Interest

Documentation

Access and Provision

Coverage

Platinum Gold Silver Bronze Not yet Bronze

Improving – the data is becoming more accessible and useful

| 6

There have been improvements to metadata (Apr–Sep 20) And there is now a way of measuring data utility (Hub Baseline Utility)

Baseline data utility* scores for 43 initial Hub datasetsHighest rated dataset is Comprehensive Patient Records for Cancer Outcomes, scoring Gold across all categories

0 20 40 60 80 100

Technical

Attribution

Structure

Coverage

Administrative

Summary

% filled in April (440 datasets) % filled in September (481 datasets)

Percentage of metadata fields completed (across different categories)Metadata is information about a dataset, which allows users to understand what it contains

Cat

ego

ries

of

met

adat

a

* Data Utility Evaluation Matrix available here

Page 7: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Using - health data is being used to enabling discoveries that improve people’s lives

As a result, we now know a lot more about diseases such as COVID-19 because of health data research

Over 3,340* people, globally, have been discovering data through the UK’s health data research Gateway – with

more than 6,100 searches/month

| 7

*unique users - 27th April – 8th September 2020

…and we have strategic use cases that will help to further increase and improve data discovery, access, and utility

Page 8: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Audience Poll:

From where does your interest in utilising health datasets stem?Check the top three most relevant answers from the list below:

• Population health, disease rates and statistics

• Understanding the causes of disease.

• Drug/product usage information

• Translational medicine, e.g. biomarker identification

• Supporting clinical trials.

• Improving public health and disease prevention.

• Real world data and evidence

• Pandemic/disease outbreak monitoring and preparedness.

• Treatment effectiveness and outcomes

• Drug safety / pharmacovigilance

• Other

Page 9: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

| 9

Data Quality Assessment Project

HDR UK Webinar

Page 10: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Why Inspirata?

| 10

✓ Inspirata has worked with multiple complex healthcare

organizations to deploy our solutions, helping them create

value and insights from their data assets

✓ Deep expertise in clinical data curation and quality, including

data sharing and associated PHI and data security issues

✓ Demonstratable rigorous approach to creating and delivering

data quality

✓ Vendor neutral – we utilize data cleansing, quality and

profiling tools, but we do not supply them

✓ Understanding of potential downstream uses of clinical data

(utility of data)

✓ Suite of tools for cancer reporting and registry, clinical

decision support, trials matching, and data analytics – all of

which depend on robust data quality

✓ Our R&D efforts have been focused on building software to

support our clinical and administrative workflows to improve

care, as well as research initiatives and collaborative data

sharing programs

✓ We work directly with state registries and hospitals who use

our NLP and AI software enabling identification of reportable

cancer cases and auto-abstraction and coding those cases for

registry purposes

Inspirata delivers infrastructure, domain expertise, out-of-the-box functionality and tool sets that improve data quality and processes today while being able to adapt to the future.

Page 11: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Inspirata’s Approach

| 11

Page 12: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Why is data quality an issue?Why is healthcare data challenging?

| 12

Data Definitions

Location

Data Structure

Pathology Report

~~~~~~~~~~~~

Subjective

based on

source

Keeping up with the

government

23 Jun-20

RX

Regulations & Requirements

php asp wcf

csv

pdf java

ruby xml

fastq

vcf

dochtml bam

pdf xsl

hl7

ppt

mov jpgpng wmv

From humans to data

warehouses

Data Complexity

Structured vs non-structured

Where is your data?

It’s not all digitized in

healthcare

Data Format

Page 13: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

What is Meant by Data Quality?

| 13

Metadata

Data QualityData Utility

Structured information that

characterizes the dataset – what it is,

where it is, assumptions made, etc

How “good” is the data [for its

intended purpose]. We still need to

define what is meant by “good”…

A perfectly described/high quality

dataset may be useful to you, but

not to me. What use cases can the

dataset support?

Page 14: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Importance of First-Class Metadata

| 14

➢ ADVERTISES THE ORGANIZATION’S WILLINGNESS

TO FACILITATE RESEARCH

➢ SAVE TIME AND RESOURCES

➢ UNDERSTAND THE DATASET

➢ FIND DATA: DETERMINE WHAT DATA EXISTS

➢ DETERMINE APPLICABILITY; DECIDE

IF A DATASET MEETS NEED

➢ DISCOVER HOW TO ACQUIRE,

PROCESS & USE THE DATASET

➢ AVOID DATA DUPLICATION

➢ SHARE RELIABLE INFORMATION

➢ OPTIMIZED INFRASTRUCTURE

➢ USE DATA AFTER INITIAL

INTENDED PURPOSE

➢ TRACK DATA USE AND FACILITATES

PUBLICATION

Who

benefits & how?

HIGH-GRADE

METADATA

Organizations

➢ REMOVE RELIANCE ON SINGLE

MEMBERS OF STAFF

GOOD METADATA RECORD PROVIDES ALL THE CRITICAL INFORMATION FOR DISCOVERY, UNDERSTANDING, AND REUSE.

Page 15: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Metadata Analysis –Proven Improvement

• Evaluated metadata for >400 datasets published to the HDR UK Innovation Gateway

• Assessed each dataset by type/structure and content/data model

• Conducted multiple assessments – April and June

• Significant improvement in quality of metadata

Page 16: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

How is Data Quality Measured? DAMA Quality Dimensions

| 16

Page 17: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

How is Data Quality Achieved?Across All Phases of Processing…

| 17

Page 18: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

What Tools Were Assessed?

| 18

Knime(Data analytics, profiling, reporting and integration platform)

Pandas Profiling (using Pandas I/O) (Python module for exploratory data analysis (EDA))

Orange(Data visualization, machine learning, data profiling and mining toolkit)

RapidMiner (FREE VERSION)(Integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics)

WEKA(Machine learning software to solve data mining problems)

Anonimatron(Pseudonymizes datasets)

ARX Data Anonymization(Scalable Data Anonymization Tool - supports multiple privacy models )

WhiteRabbit(Tool to help prepare for ETLs of healthcare datasets)

Aggregate Profiler (AP)(Data profiling and analysis tool)

Talend Open Studio for Data Integration (FREE VERSION)(Data integration and ETL)

Talend Open Studio For Big Data (FREE VERSION)(ETL for large and diverse data sets)

Talend Open Studio For Data Quality (FREE VERSION)(Assesses accuracy and integrity of data - Data Profiling Tool)

Talend Open Studio For ESB (FREE VERSION)

Talend Open Studio For MDM (FREE VERSION)(key capabilities for data governance and master data management)

OpenRefine(Tool for cleaning and transforming data)

DataCleaner (COMMUNITY EDITION)(Data profiling, data cleaning, and data integration tool) - offers integration with Pentaho

DataPreparator(Preprocessing - data cleaning, transformation, and exploration)

Data Match (30-DAY FREE TRIAL)(visual data cleansing application - a component of Data Ladder)

DataMartist (30 DAY FREE TRIAL)( visual, data profiling and data transformation tool)

Pentaho Kettle (COMMUNITY EDITION) (ETL Tool)Integrates with WEKA (Data Profiling)

SQL Power Architect (COMMUNITY EDITION) (Data Modeling & Profiling Tool)

SQL Power Dqguru (COMMUNITY EDITION) (Data Cleansing & MDM Tool)

DQ Analyzer (COMMUNITY EDITION)(Data profiling tool)

Pimcore(Data Management, Integration, PIM, MDM, DAM)

CytoScape(software platform for visualizing molecular interaction networks and biological pathways )

Anaconda(data science platform)

pyxplorer(a simple tool that allows interactive profiling of datasets)

MobyDQ(aims to automate Data Quality checks during data processing)

Page 19: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Data Quality Tools Assessment• Assessed ~30 data quality tools against

Gartner categorization of key capabilities

• Extensible inventory of tools, with detailed assessment criteria for >70 functionalities

• Ranking of open data quality tools in each category:

• Profiling

• Parsing, Standardizing & Cleansing

• Interactive Visualization

• Matching, Linking & Merging

• Multi-domain Support

• Workflow

• Scalability/Performance

• Usabilityhttps://www.gartner.com/en/documents/3913549

Page 20: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Data Profiling -Deep Dive• Detailed comparison of data profiling tools

against multiple objective criteria relating to DAMA data quality dimensions:

• Completeness

• Consistency

• Uniqueness

• Validity

• Accuracy

• Timeliness

• Assessment against synthetic data (Synthea) for capability, performance and scalability

• Verification by Cystic Fibrosis and Neonatal Medicine Research Group

• Final recommendation for KNIME + Pandas profiling (or ORANGE + Pandas profiling)

Page 21: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Synthetic Reference Data Creation• Synthetic data sets were created using

the open source tool, WhiteRabbit, to generate 1,000 patient and related clinical data CSV files and SQL Database adhering to OMOP data model.

• To evaluate performance and scalability of each tool an additional dataset of 1.3 Million records was also generated.

• WhiteRabbit, utilizes the application Rabbit-in-a-Hat which in turn utilizes Synthea to generate patient data and map into an OMOP model

Page 22: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Data Profiling -Pandas Profiling

| 22

Page 23: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Willing Partners -Validation & Testing

• We would sincerely like to thank you for your willingness,

effort and time taken to evaluate/run specified tools in order to provide the outputs and any feedback to HDR-UK

and Inspirata

Kieran EarlamPolicy and Evidence Manager

Rebecca CosgriffDirector of Data & Quality Improvement

Victor L BandaData Analyst - Neonatal Imperial College London

“The huge number of extensions and functionality meant that when searching for solutions I was given multiple routes to achieve my goal, which can be both an advantage and a disadvantage”

“The in-window explanations were very detailed and helpful, and given the days and weeks it would take to learn the tool I’m sure it would effectively interrogate most data.”

“Since data was too big to download to local memory on an SQL Server instance on SQL Table widget, we later migrated to postgres SQL. However to adequately interact with the Postgres database, an installation of two modules: quantile and sm_system_time was required.“

“Installing these on a windows machine is not fully supported documentation-wise, when compared to Mac OS and Linux. I believe this exercise would have been best carried via Linux machine.“

Cystic Fibrosis Trust tested “Aggregate Profiler”, “Knime” and

“MobyDQ

Neonatal Medicine Research Group tested “DataCleaner”, “Orange” and

“MobyDQ”

Page 24: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

What about Data Utility?

• Data Acquisition, Processing and Curation- Innate Data Quality

• Accuracy/Veracity• Credibility/Verifiability• Consistency/Idempotency• Auditability

- Rational Data Quality• Lineage• Unique/Concise Representation• Interpretability/Ease of Understanding• Representational Consistency

• Data Use and Sharing- Germane Data Quality

• Completeness• Validity• Currency/Timeliness• Usability/Relevancy/Value-Added

- Interoperable Data Quality• Accessibility• Access Security

| 24

The extent to which data is modelled and presented in an intelligible and clear manner

The extent to which data elements in data set conform with specified domain standards

The extent to which data meets business needs and enables user tasks

The extent to which data are available, discoverable and securely obtainable

Page 25: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Is this the end of the story? What about Data Utility?

| 25

Page 26: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Key Findings and Recommendations

Maintain focus on metadata improvements and continuous assessment of quality

Application of chosen Data Profiling tools (KNIME plus Pandas) to datasets on Gateway

Evaluation of data profiling outputs to determine which representations/visuals should be integrated onto the Gateway

Many data custodians do not have the skills to successfully cleanse or profile their data –offer data engineering resources and/or standard data profiling pipelines

Data profiling only tells you what the quality of the dataset is… by itself it does not improve the quality - provide guidance [based on profiling] as to how data quality could be improved

Data quality is an important factor for re-use of healthcare data; but quality needs to be assessed relative to intended [downstream] use cases (utility)

| 26

Page 27: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Virtual-Only Delivery Project Team

| 27Project Team Project Team

INSPIRATA PROJECT ROLES HDR-UK PROJECT ROLES

Relationship Management Project Steering Committee

Project Sponsors

HDR UK Project Leads

Team Members

Inspirata Leadership

Inspirata Project Lead

Quality Tool Assessment, Profiling and

Framework Development

Trevor

Oenone

Enez

Ben

Caroline

Vishnu

Developer

Clara

Neil Susheel Peggy Gerry David

Rich

Page 28: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

What comes next?

(and poll results)

Page 29: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Poll Results

From where does your interest in utilising health datasets stem? (56 responses)

Check the top three most relevant answers from the list below:

• Population health, disease rates and statistics - 63% (35/56)

• Improving public health and disease prevention - 52% (29/56)

• Real world data and evidence - 50% (28/56)

• Understanding the causes of disease - 43% (24/56)

• Translational medicine e.g. biomarker identification - 23% (13/56)

• Supporting clinical trials - 23% (13/56)

• Other - 21% (12/56)

• Pandemic/disease outbreak monitoring and preparedness - 20% (11/56)

• Drug safety/pharmacovigilance - 7% (4/56)

• Drug/product usage information - 5% (3/56)

Page 30: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

| 30

Activity areas improving the data

Data Utility Matrix

International links

Hubs

Data Officers Groups

National Priorities

SIGs (FHIR, Synthetic data,etc)

Position/consultation papers

Projects (Eg COVID-19)

| 30

APPLIED ANALYTICSHUMAN PHENOME

Page 31: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Questions?

Page 32: Improving Health Data · (Data integration and ETL) Talend Open Studio For Big Data (FREE VERSION) ... data profiling and data transformation tool) Pentaho Kettle (COMMUNITY EDITION)

Scan the QR Code or Visit the Links to Learn More!

| 32