transmart community meeting 5-7 nov 13 - session 3: transmart-data

22
transmart-data Management of tranSMART’s Environment Gustavo Lopes The Hyve B.V. November 6, 2013 Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 1 / 22

Upload: david-peyruc

Post on 19-Jan-2015

162 views

Category:

Technology


0 download

DESCRIPTION

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data Management of tranSMART's Environment Gustavo Lopes The Hyve B.V.

TRANSCRIPT

Page 1: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

transmart-dataManagement of tranSMART’s Environment

Gustavo Lopes

The Hyve B.V.

November 6, 2013

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 1 / 22

Page 2: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Outline

1 ProblemsReproductibilityVersioning ControlAutomationWhy?!tranSMART Foundation’sVersion

2 Solution: transmart-dataGeneral DescriptionConfigurationDatabase Schema ManagementSeed DataETLRModules Analyses’RserveSolrtransmartApp Configuration

3 Limitations

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 2 / 22

Page 3: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Typical Branch Distribution

Grails Code

transmartApp (without fullrepo history, always withwrong ancestry information⇒ merging quite difficult)

RModules (if you’re lucky),but analyses definitions inDB not provided

Database

SQL scripts on top of GPL1.0 dump or later. Probablyinsufficent/won’t apply

Stored procedures for ETL.Overlapping definitions withyours, but no history ⇒merging quite difficult

Manual fixups alwaysrequired (even if justpermissions/synonyms)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 3 / 22

Page 4: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Typical Branch Distribution (II)

ETL

High variablity in strategies

Instructions/sample datararely provided

Kettle scripts areproblematic

Solr/Rserve/Configuration

Solrschemas/dataimport.xmlperpetually forgotten

Idem for information on Rpackages

Sample configuration rarelyprovided

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 4 / 22

Page 5: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Versioning Control

Version control used ONLY for Grails Code. . .But often squashed and with wrong ancestor information.Forget about database, Solr, most of ETL.

Result

Merges are very difficult.

Changes cannot easily be tracked

Changes’ wherefores are unknown

Regressions are introduced (no conflicts)

Collaboration is based on e-mail attachments

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 5 / 22

Page 6: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Automation

Even with all the pieces. . .

Setting up a new branch takes days;weeks for non-basic functionality

No reproductibility in the process!

Result

Devs driven away from fully localenvironment (too much work)

Robust environment for CI passed over(too much work)

Bugs cannot be reliably reproduced (seealso: no consistent usage of VCS)

Time wasted with deployment specificmistakes/inconsistencies

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 6 / 22

Page 7: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Why?!

Guillaume Duchenne (public domain)

The “source code” for a work meansthe preferred form of the work formaking modifications to it.

— GPL v3, section 1

Is everyone holding back “source code”?More likely explanation:

No appropriate tooling being used

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 7 / 22

Page 8: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Situation for tranSMART 1.1

The situation is much better!Some problems remain, though.

The Good ,Create/populate DBis easy

Most stuff isversioned

CI for builds

Image available

Public issue tracking

The Bad /No Oracle supportChanges to DB scripts/seed data aread hoc (lax structure)No mechanism to support/compareschemas with other branchesR analyses are json blobs in TSVsNo VCS for Solr or Rserve/images’ setupSet up Sol/Rserve is time-consumingPopulation of DB with sample data is stilltime-consumingConfig changes required for dev

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 8 / 22

Page 9: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Description of transmart-data

We developed transmart-data to address most of these problems:

transmart-data is a set of

scripts for managing tranSMART’s environment and

certain application data (e.g. Solr schemas, DDL, seed data), whichis used by scripts and sometimes generated by them.

It has a makefile based interface.

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 9 / 22

Page 10: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

transmart-data: Purposes

Purposes of transmart-data:

1 Allow setting up a complete dev environment quickly (< 30 min)

2 Bring versioning to the database schema and Solr files

3 Setup Solr runtime

4 Invoke ETL pipelines

5 Setup Rserve

Target audience: Programmers

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 10 / 22

Page 11: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

transmart-data: Non-purposes

Non-purposes of transmart-data:

1 Setup a production environment(some components can be used)

2 New users evaluating tranSMART(use an pre-built image)

3 Building transmartApp or its plugin dependencies(build them yourself or use artifacts from Bamboo/Nexus)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 11 / 22

Page 12: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Configuration

Environment variable based configuration

cp v a r s . sample v a r svim v a r s #e d i t f i l esource v a r s

PGHOST=/tmpPGPORT=5432PGDATABASE=t r a n s m a r tPGUSER=\$USERPGPASSWORD=TABLESPACES=\$HOME/pg/ t a b l e s p a c e s /PGSQL BIN=\$HOME/pg/ b i n /ORAHOST=l o c a l h o s tORAPORT=1521ORASID=o r c lORAUSER=” s y s as s y s d b a ”ORAPASSWORD=mypasswordORACLE MANAGE TABLESPACES=0#c o n t i n u e s . . .

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 12 / 22

Page 13: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Database Schema Management

Support for Oracle and Postgres

Postgres

Uses pg dump(all)

Parses the dump files

#Dumpmake −C p o s t g r e s / d d l dumpmake −C p o s t g r e s / d d l /

GLOBAL e x t e n s i o n s . s q lr o l e s . s q l

#Loadmake −C p o s t g r e s / d d l l o a d

Oracle

Queries dba * tables

Dumps DDL w/DBMS METADATA

#Dumpmake −C o r a c l e / d d l dump

#Loadmake o r a c l e

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 13 / 22

Page 14: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Seed Data

Only Postgres for now

#Dump#T a b l e s to dump i n p o s t g r e s / data/<schema> l s tmake −C p o s t g r e s / data dumpmake −C p o s t g r e s /common m i n i m i z e d i f f s

#Loadmake −C p o s t g r e s / data l o a d

#Load DDL and datamake p o s t g r e s

Only for basic stuff with no ETL!

Pretty fast (DDL+data loaded in 10s)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 14 / 22

Page 15: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

ETL (I)

Unified interface for ETL

Prepare dataset

1 Prepare ETL-specific sourcefiles

2 Prepare file with ETLspecific params

3 Upload dataset to CDN(optional)

For each new ETL pipeline,support must be added

Load dataset

make −C s a m p le s /{ o r a c l e ,p o s t g r e s } l o a d <type><s t u d y id>

#Example :make −C s a m p le s / p o s t g r e s

l o a d c l i n i c a l G S E 8 5 8 1

Everything is automated!

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 15 / 22

Page 16: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

ETL (II)

Show TM CZ logs:$ make -C samples/postgres showdblog

make: Entering directory `/home/gustavo/repos/transmart-data/samples/postgres'

groovy -cp postgresql-9.2-1003.jdbc4.jar ../common/dump_audit.groovy postgres `tput cols`

Procedure | Description | Stat | Recs | Date | Time spent

------------------------------------------------------------------------------------------------------

alysis_data.kjb | GSE8581 | DONE | 1 | 2013-10-15 13:23:22. | 0.0

.load_ext_files | Drop null samples rows | Done | 0 | 2013-10-15 13:23:23. | 0.450529

.load_ext_files | Drop null cohorts rows | Done | 0 | 2013-10-15 13:23:23. | 0.043125

.load_ext_files | Drop null analysis rows | Done | 0 | 2013-10-15 13:23:23. | 0.066097

.load_ext_files | Read analysis file | Done | 1 | 2013-10-15 13:23:23. | 0.048055

.load_ext_files | Read cohort file | Done | 3 | 2013-10-15 13:23:23. | 0.085535

.load_ext_files | Read samples file | Done | 57 | 2013-10-15 13:23:23. | 0.049993

.load_ext_files | Write rwg_cohorts_ext | Done | 3 | 2013-10-15 13:23:23. | 0.099452

.load_ext_files | Write rwg_analysis_ext | Done | 1 | 2013-10-15 13:23:23. | 0.047331

.load_ext_files | Write rwg_samples_ext | Done | 57 | 2013-10-15 13:23:23. | 0.044567

.load_ext_files | Read analysis data file | Done | 436898 | 2013-10-15 13:23:27. | 3.911089

.load_ext_files | Drop null analysis_data rows | Done | 382223 | 2013-10-15 13:23:27. | 0.067765

.load_ext_files | Write rwg_analysis_data_ext | Done | 54675 | 2013-10-15 13:23:28. | 1.332746

IMPORT_FROM_EXT | Start FUNCTION | Done | 0 | 2013-10-15 13:23:29. | 0.117319

IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 0.035825

IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 6.26E-4

IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 4.84E-4

IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_An | Done | 1 | 2013-10-15 13:23:29. | 0.001079

IMPORT_FROM_EXT | Update bio_assay_analysis_id on | Done | 0 | 2013-10-15 13:23:29. | 0.030793

IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_Co | Done | 3 | 2013-10-15 13:23:29. | 8.28E-4

... (continues)

Errors are also shown (if any)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 16 / 22

Page 17: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

RModules Analyses’(tsApp-DB)

Situation in transmartApp-DB:

update searchapp.plugin_module

s e t params='{"id":" survivalAnalysis ","converter ":{"R":[" source(''|| PLUGINSCRIPTDIRECTORY|| Common/dataBuilders.R'')","source(''|| PLUGINSCRIPTDIRECTORY || Common/ExtractConcepts.R'')","source(''|| PLUGINSCRIPTDIRECTORY || Common/collapsingData.R'')","source(''|| PLUGINSCRIPTDIRECTORY || Common/BinData.R'')","source(''||PLUGINSCRIPTDIRECTORY || Survival/BuildSurvivalData.R'')","\ tSurvivalData.build(\n\tinput.dataFile = ''|| TEMPFOLDERDIRECTORY || Clinical/clinical.i2b2trans '',\n\tconcept.time=''||TIME||'',\n\tconcept.category=''|| CATEGORY ||'',\n\tconcept.eventYes=''|| EVENTYES ||'',\n\tbinning.enabled=''|| BINNING ||'',\n\tbinning.bins=''||NUMBERBINS ||'',\n\tbinning.type=''|| BINNINGTYPE ||'',\n\tbinning.manual=''||BINNINGMANUAL ||'',\n\tbinning.binrangestring=''|| BINNINGRANGESTRING ||'',\n\tbinning.variabletype=''|| BINNINGVARIABLETYPE ||'',\n\tinput.gexFile = ''||TEMPFOLDERDIRECTORY ||mRNA/Processed_Data/mRNA.trans '',\n\tinput.snpFile = ''||TEMPFOLDERDIRECTORY ||SNP/snp.trans'',\n\tconcept.category.type = ''|| TYPEDEP ||'',\n\tgenes.category = ''|| GENESDEP ||'',\n\tgenes.category.aggregate = ''|| AGGREGATEDEP||'',\n\tsample.category = ''|| SAMPLEDEP ||'',\n\ttime.category = ''|| TIMEPOINTSDEP||'',\n\tsnptype.category = ''|| SNPTYPEDEP ||'')\n\t"]}," name ":" Survival Analysis","dataFileInputMapping ":{" CLINICAL.TXT":" TRUE","SNP.TXT ":" snpData"," MRNA_DETAILED.TXT

":" mrnaData "}," dataTypes ":{" subset1 ":[" CLINICAL.TXT"]}," pivotData ":false ,"view ":"

SurvivalAnalysis "," processor ":{"R":[" source(''|| PLUGINSCRIPTDIRECTORY || Survival/CoxRegressionLoader.r'')"," CoxRegression.loader(input.filename=''outputfile '')","source(''|| PLUGINSCRIPTDIRECTORY || Survival/SurvivalCurveLoader.r'')"," SurvivalCurve.loader(input.filename=''outputfile '',concept.time=''||TIME||'')"]}," renderer ":{"GSP ":"/ survivalAnalysis/survivalAnalysisOutput "} ,... (goes on)'

where module_name = 'pgsurvivalAnalysis ';

Not very nice...

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 17 / 22

Page 18: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

RModules Analyses’ (transmart-data)

In transmart-data:

One file per analysis

Files can be generated from DB data

Sanely formatted

But we really want to remove this from the DB!array (

'id' => 'heatmap',

'name' => 'Heatmap',

'dataTypes' =>

array (

'subset1' =>

array (

0 => 'CLINICAL.TXT',

),

),

'dataFileInputMapping' =>

array (

'CLINICAL.TXT' => 'FALSE',

'SNP.TXT' => 'snpData',

'MRNA_DETAILED.TXT' => 'TRUE',

),

'pivotData' => false,

...

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 18 / 22

Page 19: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Rserve

Targets for Rserve:

Download/build R

Install R packages

Start Rserve

Install System V initscript for Rserve

Idem for systemd

cd R

make -j8 bin/root/R

#some packages don 't support

concurrent builds

make install_packages

make start_Rserve

make start_Rserve.dbg

TRANSMART_USER=tomcat7 sudo -

E make install_rserve_init

TRANSMART_USER=tomcat7 sudo -

E make install_rserve_unit

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 19 / 22

Page 20: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Solr

Solr (4.5.0) automaticallydownloaded and configured

Solr cores automatically created

User only needs to create a schemafile and dataconfig.xml

#setup & solr (psql)

make start

#just configure

make solr_home

make <core >_full_import

make <core >_delta_import

make clean_cores

ORACLE =1 make start

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 20 / 22

Page 21: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

transmartApp Configuration

Out-of-tree config management:

Targets for installing files

Zero configuration fordev!

Customization allowedwithout touching the targetfiles

Only supports ours branches

But a lot of configurationshould be in-tree instead!

#install everything

#previous files are backed

up

make install

#just one file:

make install_Config.groovy

make install_BuildConfig.

groovy

make install_DataSource.

groovy

#costumizations in:

#Config -extra.php

#BuildConfig.groovy (

limited)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 21 / 22

Page 22: tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Current Limitations

© Joost J. Bakker, CC BY 2.0

DB upgrades not handled

Only a few ETL pipelinessupported

Oracle support is behindPostgreSQL

Tooling shares repositorywith application data

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 22 / 22