difuture architecture of the data integration centers · 10/10/2019 · spring cloud data flow...

Architecture of the DIFUTUREData Integration Centers

Jörg Peter

Oct 10, 2019

DIFUTURE Annual Symposium 2019 - Tübingen12.09.2019 1

Introduction

12.09.2019 DIFUTURE Annual Symposium 2019 - Tübingen 2


Data Lake

Primary clinical systems

Project-DWH

Project-specificpseudonymization

Connectors

Hand-over of clinical data via connectorsFile- or REST-based: generic JSON as well as in several standard formats( XML, HL7 v2, )

Pseudonymization of data for DL and further processingTrust Center (TC) as pseudonymization component

Data Lake (DL)Technical harmonization in JSON format in a PostgreSQL-DB

Structural and semantic integration on the Data Lake & during export

Data provisioning for project-specific Data Warehouses (DWH)

meDIC – Concept

TC

& e.g. PHT

Data Lake


Data Lake

• Structural & semantic integration during data import

• Resources for integration are needed for all data to be imported

• Only structural / technical integration during data import

• Semantic integration & transformationonly for data to be used

• Secondary project DWH

• Effort & resources focused on demand(driven by UC & projects)

Data Warehouse

recirculation of semantically integrated data

– why?

Software & ToolsImplementation with free & open-sourcesoftware and tools

Java with Spring framework & extensions

• Spring Boot, Spring Cloud, Spring Cloud Data Flow (SCDF)

• Stream- & microservice architecturewith Apache Kafka as message broker

Keycloak as central identity management (IDM)

• Allows Single Sign-On (SSO)

Dockerization is used for distribution

of software components & Java artefacts

• Cross-site provisioning of core componentsin Artifactories & Docker registries


Microservices, Streams &Spring Cloud Data Flow


Microservice (Spring Boot Application)

Message Topic (e.g. within Apache Kafka)Message

Source Processor Sink

get data

Transform to A

Transform to B

Store in DB

Store as File

DIFUTURE Annual Symposium 2019 - Tübingen

Keycloak • Single Sign-On Solution (SSO)

• OAuth2 & OpenID Connect or SAML

• Authenticating clients and services

Allows User Federation, Identity Brokering

Connection to clinicalIDM / AD / LDAP (tested @ UKT)

Warehouses e.g.tranSMART / Glowing Bear

Integration in web-frontends of portals and REST-Interfaces

Integration withSpring Security

Data Lake components

Interconnection with most of the developed & used software solutionsplanned or already established

Trust Center components

12.09.2019 7

https://www.keycloak.org/

Technical details & Challenges


StatusFirst versions of the core components finalized

Development, test & documentationTrust Center, Data Lake, Orchestration, Project-DWH

• Containerization (Docker)for core components & warehousing solutions

• Deployment & Roll-Out tests in productive environments of the meDICs at core consortia partners and at the roll-out partners

So far, the targeted technical milestonescould successfully be achieved



Import-Pipeline

Connectors

Direct connectionto clinical systems

Transfer Point

Integratedprocessing

e.g. RESTful JSON

File -ImportHL7, CSV, JSON …

e.g. HL7 / raw

Trust Center Software• FHIR-based interface

• Configurable number of pseudonymization stages

• Encryption at-Rest, in-Transit & in-Use

• Keycloak Integration


Example: Search in the entity list


• Aggregation: Each block in the DL only contains informationabout a single patientusage, information rights and delete requests can beperformed in a patient-centric manner(consent & GDPR)

• Provenance: Sources and processing steps are storedin the metadata of the data block

• Validation: Validation of the processing steps andblock types prior to writing in the data lake& during data export

• Automation:

Data Lake

Configurable triggers for block events (create, update, delete )

(Automated) processing pipelines(e.g. semantic integration of blocks,LOINC mapping, …)

Data Lake


convert single block

trigger store

1:1transform

1:nsplit

trigger

split block store

n:1accumulate(only same patient)

triggerstore

accumulate

get additionalblocks

Data Lake – Transformation Types


When to perform a transform operation?Automatic

Trigger (INSERT, UPDATE,DELETE)For each PostgreSQL-Database eventProcess every new/updated block of Type X

ScheduledProcess all (new) Blocks of Type X at time Y

ManualSingle-Shot (convert all)Retrospective

CombinationsManual Single Shot & automatic updateSingle-Shot Update…

Data Lake – Trigger concept


TRIGGER

PLANNING

ASSEMBLY

TRANSFORM

• Analyze triggered action• Fetch the required processing steps • Choose the required pipeline path• Collect all required blocks (one or multiple)

Define & store what to do

• Detect Create, Update & Delete• Manual actions, export requests, schedules

• Transform the collected data into one or multiple output blocks

• Store the transformed outputs in theData Lake, warehouses or files

• Match plans• Plan processing steps

STOREStored as

Stream Plan

Data Lake – Transformation


AutomatedDatabase trigger

Manual orScheduled

triggers

get required data

Transform block

Determine destination

Back to lake

export

IMPORT TRIGGER PLANNING ASSEMBLY TRANSFORM STORE

Data Lake – ETL

Data Lake challenges• Stream processing

• Orchestration

• Microservices

• Error handling & data routing• Deal with processing errors

• „Clogging“ of the pipeline

• Rejection for examination

• Orchestration & Management• Development of a simplified

management- & deployment portal


:inProc. 1

Proc. 3 Sink 1

Sink 2Proc. 2

:out

DLQ

Warehousing & PortalsProvisioning of the integrated data inproject-specific Data Warehouses

Extension of tranSMART / Glowing BearOpen Source improvement

Tender & Project with The Hyve (NL)

Dockerization of project-specific data warehouses

HL7-FHIR–based Import

Slicing & Dicing of tranSMART-instances

Filtering by genomic variants by connectionto a Genomic Variant Store

Usage of i2b2/tranSMART for MS-UCDockerized i2b2/tranSMART instances for MS-UC data

Besides DIFUTURE-UCs: Successful application of DW & Trust Center components in a microbiome profiling project (TUM)


Further challenges• Improved monitoring of all services

(e.g. Prometheus, Grafana, ... )

• Monitor Pipelines & message throughput

• Memory consumptions & disk storage usage

• …

• Improved provisioning & parametrization ofthe software bundles for individual sites(e.g. with Chef or Puppet)

• Specific configurations for server settings, certificates

• (semi-) automated deployment of new instances

• Sustained productive operation& update processes



Further challenges (cont.)

• Extension to additional data sources• Doctors letters / document archives

• Radiology images / DICOM

• Genetic Information (SNPs, …)

E.g. metadata collected in Data Lake with

references to the data sources

• Natural Language Processing (NLP)Text Mining

Thanks.

Contact:

Dr. rer. nat.

Jörg PeterM.Sc. Bioinformatics

University Hospital TübingenDepartment for IT & Applied Medical Informatics

Institute for Translational BioinformaticsHoppe-Seyler-Straße 9, 72076 Tübingen

Phone: +49 7071 29-84317

[email protected]

01ZZ1603 [A-D] & 01ZZ1804 [A-I]


difuture architecture of the data integration centers · 10/10/2019 · spring cloud data flow...

Documents