improving health data · (data integration and etl) talend open studio for big data (free version)...
TRANSCRIPT
Improving Health Data:Establishing a Mechanism for the Assessment of Data Quality
Hosted by:
Ben GordonExecutive Director:Hubs & Data Improvement, HDR UK
Trevor HeritageSenior Vice President:Cancer Informatics, Inspirata
HDR UK’s mission is to unite the UK’s health data to enable discoveries that improve people’s lives
Our 20-year vision is for large scale data and advanced analytics to benefit every
patient interaction, clinical trial, biomedical discovery and enhance public health.
| 2
Our strategy focuses on uniting, improving and using health data….
| 3
Making it easier to discover and request access to relevant datasets via the Innovation Gateway
| 4
https://healthdatagateway.org/
• Abdominal Hernia
• Abdominal aortic aneurysm
• Acne
• Actinic keratosis
• Acute kidney injury
• .
• .
• .
• Vitiligo
• Volvulus
Sourced from hospitals, biobanks and national datasets
Uniting - The UK now has 488 health datasets discoverable for research and innovation
Covering over 300 phenotypes From across 33 health data organisations
| 5
Most accessed datasetCovering all four nations of the UK and discoverable via a shared ‘front door’
National datasets49%
Biobanks22%
Hubs15%
NHS Hospitals6%
Charities, cohorts & other8%
52%
15%
19%
9%5%
England Scotland
Wales NI
International
COVID-19
Symptom Tracker
59 access requests
0% 20% 40% 60% 80% 100%
Technical Quality
Value and Interest
Documentation
Access and Provision
Coverage
Platinum Gold Silver Bronze Not yet Bronze
Improving – the data is becoming more accessible and useful
| 6
There have been improvements to metadata (Apr–Sep 20) And there is now a way of measuring data utility (Hub Baseline Utility)
Baseline data utility* scores for 43 initial Hub datasetsHighest rated dataset is Comprehensive Patient Records for Cancer Outcomes, scoring Gold across all categories
0 20 40 60 80 100
Technical
Attribution
Structure
Coverage
Administrative
Summary
% filled in April (440 datasets) % filled in September (481 datasets)
Percentage of metadata fields completed (across different categories)Metadata is information about a dataset, which allows users to understand what it contains
Cat
ego
ries
of
met
adat
a
* Data Utility Evaluation Matrix available here
Using - health data is being used to enabling discoveries that improve people’s lives
As a result, we now know a lot more about diseases such as COVID-19 because of health data research
Over 3,340* people, globally, have been discovering data through the UK’s health data research Gateway – with
more than 6,100 searches/month
| 7
*unique users - 27th April – 8th September 2020
…and we have strategic use cases that will help to further increase and improve data discovery, access, and utility
Audience Poll:
From where does your interest in utilising health datasets stem?Check the top three most relevant answers from the list below:
• Population health, disease rates and statistics
• Understanding the causes of disease.
• Drug/product usage information
• Translational medicine, e.g. biomarker identification
• Supporting clinical trials.
• Improving public health and disease prevention.
• Real world data and evidence
• Pandemic/disease outbreak monitoring and preparedness.
• Treatment effectiveness and outcomes
• Drug safety / pharmacovigilance
• Other
| 9
Data Quality Assessment Project
HDR UK Webinar
Why Inspirata?
| 10
✓ Inspirata has worked with multiple complex healthcare
organizations to deploy our solutions, helping them create
value and insights from their data assets
✓ Deep expertise in clinical data curation and quality, including
data sharing and associated PHI and data security issues
✓ Demonstratable rigorous approach to creating and delivering
data quality
✓ Vendor neutral – we utilize data cleansing, quality and
profiling tools, but we do not supply them
✓ Understanding of potential downstream uses of clinical data
(utility of data)
✓ Suite of tools for cancer reporting and registry, clinical
decision support, trials matching, and data analytics – all of
which depend on robust data quality
✓ Our R&D efforts have been focused on building software to
support our clinical and administrative workflows to improve
care, as well as research initiatives and collaborative data
sharing programs
✓ We work directly with state registries and hospitals who use
our NLP and AI software enabling identification of reportable
cancer cases and auto-abstraction and coding those cases for
registry purposes
Inspirata delivers infrastructure, domain expertise, out-of-the-box functionality and tool sets that improve data quality and processes today while being able to adapt to the future.
Inspirata’s Approach
| 11
Why is data quality an issue?Why is healthcare data challenging?
| 12
Data Definitions
Location
Data Structure
Pathology Report
~~~~~~~~~~~~
Subjective
based on
source
Keeping up with the
government
23 Jun-20
RX
Regulations & Requirements
php asp wcf
csv
pdf java
ruby xml
fastq
vcf
dochtml bam
pdf xsl
hl7
ppt
mov jpgpng wmv
From humans to data
warehouses
Data Complexity
Structured vs non-structured
Where is your data?
It’s not all digitized in
healthcare
Data Format
What is Meant by Data Quality?
| 13
Metadata
Data QualityData Utility
Structured information that
characterizes the dataset – what it is,
where it is, assumptions made, etc
How “good” is the data [for its
intended purpose]. We still need to
define what is meant by “good”…
A perfectly described/high quality
dataset may be useful to you, but
not to me. What use cases can the
dataset support?
Importance of First-Class Metadata
| 14
➢ ADVERTISES THE ORGANIZATION’S WILLINGNESS
TO FACILITATE RESEARCH
➢ SAVE TIME AND RESOURCES
➢ UNDERSTAND THE DATASET
➢ FIND DATA: DETERMINE WHAT DATA EXISTS
➢ DETERMINE APPLICABILITY; DECIDE
IF A DATASET MEETS NEED
➢ DISCOVER HOW TO ACQUIRE,
PROCESS & USE THE DATASET
➢ AVOID DATA DUPLICATION
➢ SHARE RELIABLE INFORMATION
➢ OPTIMIZED INFRASTRUCTURE
➢ USE DATA AFTER INITIAL
INTENDED PURPOSE
➢ TRACK DATA USE AND FACILITATES
PUBLICATION
Who
benefits & how?
HIGH-GRADE
METADATA
Organizations
➢ REMOVE RELIANCE ON SINGLE
MEMBERS OF STAFF
GOOD METADATA RECORD PROVIDES ALL THE CRITICAL INFORMATION FOR DISCOVERY, UNDERSTANDING, AND REUSE.
Metadata Analysis –Proven Improvement
• Evaluated metadata for >400 datasets published to the HDR UK Innovation Gateway
• Assessed each dataset by type/structure and content/data model
• Conducted multiple assessments – April and June
• Significant improvement in quality of metadata
How is Data Quality Measured? DAMA Quality Dimensions
| 16
How is Data Quality Achieved?Across All Phases of Processing…
| 17
What Tools Were Assessed?
| 18
Knime(Data analytics, profiling, reporting and integration platform)
Pandas Profiling (using Pandas I/O) (Python module for exploratory data analysis (EDA))
Orange(Data visualization, machine learning, data profiling and mining toolkit)
RapidMiner (FREE VERSION)(Integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics)
WEKA(Machine learning software to solve data mining problems)
Anonimatron(Pseudonymizes datasets)
ARX Data Anonymization(Scalable Data Anonymization Tool - supports multiple privacy models )
WhiteRabbit(Tool to help prepare for ETLs of healthcare datasets)
Aggregate Profiler (AP)(Data profiling and analysis tool)
Talend Open Studio for Data Integration (FREE VERSION)(Data integration and ETL)
Talend Open Studio For Big Data (FREE VERSION)(ETL for large and diverse data sets)
Talend Open Studio For Data Quality (FREE VERSION)(Assesses accuracy and integrity of data - Data Profiling Tool)
Talend Open Studio For ESB (FREE VERSION)
Talend Open Studio For MDM (FREE VERSION)(key capabilities for data governance and master data management)
OpenRefine(Tool for cleaning and transforming data)
DataCleaner (COMMUNITY EDITION)(Data profiling, data cleaning, and data integration tool) - offers integration with Pentaho
DataPreparator(Preprocessing - data cleaning, transformation, and exploration)
Data Match (30-DAY FREE TRIAL)(visual data cleansing application - a component of Data Ladder)
DataMartist (30 DAY FREE TRIAL)( visual, data profiling and data transformation tool)
Pentaho Kettle (COMMUNITY EDITION) (ETL Tool)Integrates with WEKA (Data Profiling)
SQL Power Architect (COMMUNITY EDITION) (Data Modeling & Profiling Tool)
SQL Power Dqguru (COMMUNITY EDITION) (Data Cleansing & MDM Tool)
DQ Analyzer (COMMUNITY EDITION)(Data profiling tool)
Pimcore(Data Management, Integration, PIM, MDM, DAM)
CytoScape(software platform for visualizing molecular interaction networks and biological pathways )
Anaconda(data science platform)
pyxplorer(a simple tool that allows interactive profiling of datasets)
MobyDQ(aims to automate Data Quality checks during data processing)
Data Quality Tools Assessment• Assessed ~30 data quality tools against
Gartner categorization of key capabilities
• Extensible inventory of tools, with detailed assessment criteria for >70 functionalities
• Ranking of open data quality tools in each category:
• Profiling
• Parsing, Standardizing & Cleansing
• Interactive Visualization
• Matching, Linking & Merging
• Multi-domain Support
• Workflow
• Scalability/Performance
• Usabilityhttps://www.gartner.com/en/documents/3913549
Data Profiling -Deep Dive• Detailed comparison of data profiling tools
against multiple objective criteria relating to DAMA data quality dimensions:
• Completeness
• Consistency
• Uniqueness
• Validity
• Accuracy
• Timeliness
• Assessment against synthetic data (Synthea) for capability, performance and scalability
• Verification by Cystic Fibrosis and Neonatal Medicine Research Group
• Final recommendation for KNIME + Pandas profiling (or ORANGE + Pandas profiling)
Synthetic Reference Data Creation• Synthetic data sets were created using
the open source tool, WhiteRabbit, to generate 1,000 patient and related clinical data CSV files and SQL Database adhering to OMOP data model.
• To evaluate performance and scalability of each tool an additional dataset of 1.3 Million records was also generated.
• WhiteRabbit, utilizes the application Rabbit-in-a-Hat which in turn utilizes Synthea to generate patient data and map into an OMOP model
Data Profiling -Pandas Profiling
| 22
Willing Partners -Validation & Testing
• We would sincerely like to thank you for your willingness,
effort and time taken to evaluate/run specified tools in order to provide the outputs and any feedback to HDR-UK
and Inspirata
Kieran EarlamPolicy and Evidence Manager
Rebecca CosgriffDirector of Data & Quality Improvement
Victor L BandaData Analyst - Neonatal Imperial College London
“The huge number of extensions and functionality meant that when searching for solutions I was given multiple routes to achieve my goal, which can be both an advantage and a disadvantage”
“The in-window explanations were very detailed and helpful, and given the days and weeks it would take to learn the tool I’m sure it would effectively interrogate most data.”
“Since data was too big to download to local memory on an SQL Server instance on SQL Table widget, we later migrated to postgres SQL. However to adequately interact with the Postgres database, an installation of two modules: quantile and sm_system_time was required.“
“Installing these on a windows machine is not fully supported documentation-wise, when compared to Mac OS and Linux. I believe this exercise would have been best carried via Linux machine.“
Cystic Fibrosis Trust tested “Aggregate Profiler”, “Knime” and
“MobyDQ
Neonatal Medicine Research Group tested “DataCleaner”, “Orange” and
“MobyDQ”
What about Data Utility?
• Data Acquisition, Processing and Curation- Innate Data Quality
• Accuracy/Veracity• Credibility/Verifiability• Consistency/Idempotency• Auditability
- Rational Data Quality• Lineage• Unique/Concise Representation• Interpretability/Ease of Understanding• Representational Consistency
• Data Use and Sharing- Germane Data Quality
• Completeness• Validity• Currency/Timeliness• Usability/Relevancy/Value-Added
- Interoperable Data Quality• Accessibility• Access Security
| 24
The extent to which data is modelled and presented in an intelligible and clear manner
The extent to which data elements in data set conform with specified domain standards
The extent to which data meets business needs and enables user tasks
The extent to which data are available, discoverable and securely obtainable
Is this the end of the story? What about Data Utility?
| 25
Key Findings and Recommendations
Maintain focus on metadata improvements and continuous assessment of quality
Application of chosen Data Profiling tools (KNIME plus Pandas) to datasets on Gateway
Evaluation of data profiling outputs to determine which representations/visuals should be integrated onto the Gateway
Many data custodians do not have the skills to successfully cleanse or profile their data –offer data engineering resources and/or standard data profiling pipelines
Data profiling only tells you what the quality of the dataset is… by itself it does not improve the quality - provide guidance [based on profiling] as to how data quality could be improved
Data quality is an important factor for re-use of healthcare data; but quality needs to be assessed relative to intended [downstream] use cases (utility)
| 26
Virtual-Only Delivery Project Team
| 27Project Team Project Team
INSPIRATA PROJECT ROLES HDR-UK PROJECT ROLES
Relationship Management Project Steering Committee
Project Sponsors
HDR UK Project Leads
Team Members
Inspirata Leadership
Inspirata Project Lead
Quality Tool Assessment, Profiling and
Framework Development
Trevor
Oenone
Enez
Ben
Caroline
Vishnu
Developer
Clara
Neil Susheel Peggy Gerry David
Rich
What comes next?
(and poll results)
Poll Results
From where does your interest in utilising health datasets stem? (56 responses)
Check the top three most relevant answers from the list below:
• Population health, disease rates and statistics - 63% (35/56)
• Improving public health and disease prevention - 52% (29/56)
• Real world data and evidence - 50% (28/56)
• Understanding the causes of disease - 43% (24/56)
• Translational medicine e.g. biomarker identification - 23% (13/56)
• Supporting clinical trials - 23% (13/56)
• Other - 21% (12/56)
• Pandemic/disease outbreak monitoring and preparedness - 20% (11/56)
• Drug safety/pharmacovigilance - 7% (4/56)
• Drug/product usage information - 5% (3/56)
| 30
Activity areas improving the data
Data Utility Matrix
International links
Hubs
Data Officers Groups
National Priorities
SIGs (FHIR, Synthetic data,etc)
Position/consultation papers
Projects (Eg COVID-19)
| 30
APPLIED ANALYTICSHUMAN PHENOME
Questions?
Scan the QR Code or Visit the Links to Learn More!
| 32