difuture architecture of the data integration centers · 10/10/2019 · spring cloud data flow...
TRANSCRIPT
Architecture of the DIFUTUREData Integration Centers
Jörg Peter
Oct 10, 2019
DIFUTURE Annual Symposium 2019 - Tübingen12.09.2019 1
Introduction
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 2
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 3
Data Lake
Primary clinical systems
Project-DWH
Project-specificpseudonymization
Connectors
Hand-over of clinical data via connectorsFile- or REST-based: generic JSON as well as in several standard formats( XML, HL7 v2, )
Pseudonymization of data for DL and further processingTrust Center (TC) as pseudonymization component
Data Lake (DL)Technical harmonization in JSON format in a PostgreSQL-DB
Structural and semantic integration on the Data Lake & during export
Data provisioning for project-specific Data Warehouses (DWH)
meDIC – Concept
TC
& e.g. PHT
Data Lake
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 4
Data Lake
• Structural & semantic integration during data import
• Resources for integration are needed for all data to be imported
• Only structural / technical integration during data import
• Semantic integration & transformationonly for data to be used
• Secondary project DWH
• Effort & resources focused on demand(driven by UC & projects)
Data Warehouse
recirculation of semantically integrated data
– why?
Software & ToolsImplementation with free & open-sourcesoftware and tools
Java with Spring framework & extensions
• Spring Boot, Spring Cloud, Spring Cloud Data Flow (SCDF)
• Stream- & microservice architecturewith Apache Kafka as message broker
Keycloak as central identity management (IDM)
• Allows Single Sign-On (SSO)
Dockerization is used for distribution
of software components & Java artefacts
• Cross-site provisioning of core componentsin Artifactories & Docker registries
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 5
Microservices, Streams &Spring Cloud Data Flow
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 6
Microservice (Spring Boot Application)
Message Topic (e.g. within Apache Kafka)Message
Source Processor Sink
get data
Transform to A
Transform to B
Store in DB
Store as File
DIFUTURE Annual Symposium 2019 - Tübingen
Keycloak • Single Sign-On Solution (SSO)
• OAuth2 & OpenID Connect or SAML
• Authenticating clients and services
Allows User Federation, Identity Brokering
Connection to clinicalIDM / AD / LDAP (tested @ UKT)
Warehouses e.g.tranSMART / Glowing Bear
Integration in web-frontends of portals and REST-Interfaces
Integration withSpring Security
Data Lake components
Interconnection with most of the developed & used software solutionsplanned or already established
Trust Center components
12.09.2019 7
https://www.keycloak.org/
Technical details & Challenges
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 8
StatusFirst versions of the core components finalized
Development, test & documentationTrust Center, Data Lake, Orchestration, Project-DWH
• Containerization (Docker)for core components & warehousing solutions
• Deployment & Roll-Out tests in productive environments of the meDICs at core consortia partners and at the roll-out partners
So far, the targeted technical milestonescould successfully be achieved
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 9
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 10
Import-Pipeline
Connectors
Direct connectionto clinical systems
Transfer Point
Integratedprocessing
e.g. RESTful JSON
File -ImportHL7, CSV, JSON …
e.g. HL7 / raw
Trust Center Software• FHIR-based interface
• Configurable number of pseudonymization stages
• Encryption at-Rest, in-Transit & in-Use
• Keycloak Integration
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 11
Example: Search in the entity list
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 12
• Aggregation: Each block in the DL only contains informationabout a single patientusage, information rights and delete requests can beperformed in a patient-centric manner(consent & GDPR)
• Provenance: Sources and processing steps are storedin the metadata of the data block
• Validation: Validation of the processing steps andblock types prior to writing in the data lake& during data export
• Automation:
Data Lake
Configurable triggers for block events (create, update, delete )
(Automated) processing pipelines(e.g. semantic integration of blocks,LOINC mapping, …)
Data Lake
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 13
convert single block
trigger store
1:1transform
1:nsplit
trigger
split block store
n:1accumulate(only same patient)
triggerstore
accumulate
get additionalblocks
Data Lake – Transformation Types
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 14
When to perform a transform operation?Automatic
Trigger (INSERT, UPDATE,DELETE)For each PostgreSQL-Database eventProcess every new/updated block of Type X
ScheduledProcess all (new) Blocks of Type X at time Y
ManualSingle-Shot (convert all)Retrospective
CombinationsManual Single Shot & automatic updateSingle-Shot Update…
Data Lake – Trigger concept
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 15
TRIGGER
PLANNING
ASSEMBLY
TRANSFORM
• Analyze triggered action• Fetch the required processing steps • Choose the required pipeline path• Collect all required blocks (one or multiple)
Define & store what to do
• Detect Create, Update & Delete• Manual actions, export requests, schedules
• Transform the collected data into one or multiple output blocks
• Store the transformed outputs in theData Lake, warehouses or files
• Match plans• Plan processing steps
STOREStored as
Stream Plan
Data Lake – Transformation
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 16
AutomatedDatabase trigger
Manual orScheduled
triggers
get required data
Transform block
Determine destination
Back to lake
export
IMPORT TRIGGER PLANNING ASSEMBLY TRANSFORM STORE
Data Lake – ETL
Data Lake challenges• Stream processing
• Orchestration
• Microservices
• Error handling & data routing• Deal with processing errors
• „Clogging“ of the pipeline
• Rejection for examination
• Orchestration & Management• Development of a simplified
management- & deployment portal
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 17
:inProc. 1
Proc. 3 Sink 1
Sink 2Proc. 2
:out
DLQ
Warehousing & PortalsProvisioning of the integrated data inproject-specific Data Warehouses
Extension of tranSMART / Glowing BearOpen Source improvement
Tender & Project with The Hyve (NL)
Dockerization of project-specific data warehouses
HL7-FHIR–based Import
Slicing & Dicing of tranSMART-instances
Filtering by genomic variants by connectionto a Genomic Variant Store
Usage of i2b2/tranSMART for MS-UCDockerized i2b2/tranSMART instances for MS-UC data
Besides DIFUTURE-UCs: Successful application of DW & Trust Center components in a microbiome profiling project (TUM)
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 18
Further challenges• Improved monitoring of all services
(e.g. Prometheus, Grafana, ... )
• Monitor Pipelines & message throughput
• Memory consumptions & disk storage usage
• …
• Improved provisioning & parametrization ofthe software bundles for individual sites(e.g. with Chef or Puppet)
• Specific configurations for server settings, certificates
• (semi-) automated deployment of new instances
• Sustained productive operation& update processes
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 19
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 20
Further challenges (cont.)
• Extension to additional data sources• Doctors letters / document archives
• Radiology images / DICOM
• Genetic Information (SNPs, …)
E.g. metadata collected in Data Lake with
references to the data sources
• Natural Language Processing (NLP)Text Mining
Thanks.
Contact:
Dr. rer. nat.
Jörg PeterM.Sc. Bioinformatics
University Hospital TübingenDepartment for IT & Applied Medical Informatics
Institute for Translational BioinformaticsHoppe-Seyler-Straße 9, 72076 Tübingen
Phone: +49 7071 29-84317
01ZZ1603 [A-D] & 01ZZ1804 [A-I]
12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 21