transmart community meeting 5-7 nov 13 - session 3: transmart-data

Post on 19-Jan-2015

162 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data Management of tranSMART's Environment Gustavo Lopes The Hyve B.V.

TRANSCRIPT

transmart-dataManagement of tranSMART’s Environment

Gustavo Lopes

The Hyve B.V.

November 6, 2013

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 1 / 22

Outline

1 ProblemsReproductibilityVersioning ControlAutomationWhy?!tranSMART Foundation’sVersion

2 Solution: transmart-dataGeneral DescriptionConfigurationDatabase Schema ManagementSeed DataETLRModules Analyses’RserveSolrtransmartApp Configuration

3 Limitations

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 2 / 22

Typical Branch Distribution

Grails Code

transmartApp (without fullrepo history, always withwrong ancestry information⇒ merging quite difficult)

RModules (if you’re lucky),but analyses definitions inDB not provided

Database

SQL scripts on top of GPL1.0 dump or later. Probablyinsufficent/won’t apply

Stored procedures for ETL.Overlapping definitions withyours, but no history ⇒merging quite difficult

Manual fixups alwaysrequired (even if justpermissions/synonyms)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 3 / 22

Typical Branch Distribution (II)

ETL

High variablity in strategies

Instructions/sample datararely provided

Kettle scripts areproblematic

Solr/Rserve/Configuration

Solrschemas/dataimport.xmlperpetually forgotten

Idem for information on Rpackages

Sample configuration rarelyprovided

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 4 / 22

Versioning Control

Version control used ONLY for Grails Code. . .But often squashed and with wrong ancestor information.Forget about database, Solr, most of ETL.

Result

Merges are very difficult.

Changes cannot easily be tracked

Changes’ wherefores are unknown

Regressions are introduced (no conflicts)

Collaboration is based on e-mail attachments

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 5 / 22

Automation

Even with all the pieces. . .

Setting up a new branch takes days;weeks for non-basic functionality

No reproductibility in the process!

Result

Devs driven away from fully localenvironment (too much work)

Robust environment for CI passed over(too much work)

Bugs cannot be reliably reproduced (seealso: no consistent usage of VCS)

Time wasted with deployment specificmistakes/inconsistencies

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 6 / 22

Why?!

Guillaume Duchenne (public domain)

The “source code” for a work meansthe preferred form of the work formaking modifications to it.

— GPL v3, section 1

Is everyone holding back “source code”?More likely explanation:

No appropriate tooling being used

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 7 / 22

Situation for tranSMART 1.1

The situation is much better!Some problems remain, though.

The Good ,Create/populate DBis easy

Most stuff isversioned

CI for builds

Image available

Public issue tracking

The Bad /No Oracle supportChanges to DB scripts/seed data aread hoc (lax structure)No mechanism to support/compareschemas with other branchesR analyses are json blobs in TSVsNo VCS for Solr or Rserve/images’ setupSet up Sol/Rserve is time-consumingPopulation of DB with sample data is stilltime-consumingConfig changes required for dev

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 8 / 22

Description of transmart-data

We developed transmart-data to address most of these problems:

transmart-data is a set of

scripts for managing tranSMART’s environment and

certain application data (e.g. Solr schemas, DDL, seed data), whichis used by scripts and sometimes generated by them.

It has a makefile based interface.

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 9 / 22

transmart-data: Purposes

Purposes of transmart-data:

1 Allow setting up a complete dev environment quickly (< 30 min)

2 Bring versioning to the database schema and Solr files

3 Setup Solr runtime

4 Invoke ETL pipelines

5 Setup Rserve

Target audience: Programmers

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 10 / 22

transmart-data: Non-purposes

Non-purposes of transmart-data:

1 Setup a production environment(some components can be used)

2 New users evaluating tranSMART(use an pre-built image)

3 Building transmartApp or its plugin dependencies(build them yourself or use artifacts from Bamboo/Nexus)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 11 / 22

Configuration

Environment variable based configuration

cp v a r s . sample v a r svim v a r s #e d i t f i l esource v a r s

PGHOST=/tmpPGPORT=5432PGDATABASE=t r a n s m a r tPGUSER=\$USERPGPASSWORD=TABLESPACES=\$HOME/pg/ t a b l e s p a c e s /PGSQL BIN=\$HOME/pg/ b i n /ORAHOST=l o c a l h o s tORAPORT=1521ORASID=o r c lORAUSER=” s y s as s y s d b a ”ORAPASSWORD=mypasswordORACLE MANAGE TABLESPACES=0#c o n t i n u e s . . .

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 12 / 22

Database Schema Management

Support for Oracle and Postgres

Postgres

Uses pg dump(all)

Parses the dump files

#Dumpmake −C p o s t g r e s / d d l dumpmake −C p o s t g r e s / d d l /

GLOBAL e x t e n s i o n s . s q lr o l e s . s q l

#Loadmake −C p o s t g r e s / d d l l o a d

Oracle

Queries dba * tables

Dumps DDL w/DBMS METADATA

#Dumpmake −C o r a c l e / d d l dump

#Loadmake o r a c l e

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 13 / 22

Seed Data

Only Postgres for now

#Dump#T a b l e s to dump i n p o s t g r e s / data/<schema> l s tmake −C p o s t g r e s / data dumpmake −C p o s t g r e s /common m i n i m i z e d i f f s

#Loadmake −C p o s t g r e s / data l o a d

#Load DDL and datamake p o s t g r e s

Only for basic stuff with no ETL!

Pretty fast (DDL+data loaded in 10s)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 14 / 22

ETL (I)

Unified interface for ETL

Prepare dataset

1 Prepare ETL-specific sourcefiles

2 Prepare file with ETLspecific params

3 Upload dataset to CDN(optional)

For each new ETL pipeline,support must be added

Load dataset

make −C s a m p le s /{ o r a c l e ,p o s t g r e s } l o a d <type><s t u d y id>

#Example :make −C s a m p le s / p o s t g r e s

l o a d c l i n i c a l G S E 8 5 8 1

Everything is automated!

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 15 / 22

ETL (II)

Show TM CZ logs:$ make -C samples/postgres showdblog

make: Entering directory `/home/gustavo/repos/transmart-data/samples/postgres'

groovy -cp postgresql-9.2-1003.jdbc4.jar ../common/dump_audit.groovy postgres `tput cols`

Procedure | Description | Stat | Recs | Date | Time spent

------------------------------------------------------------------------------------------------------

alysis_data.kjb | GSE8581 | DONE | 1 | 2013-10-15 13:23:22. | 0.0

.load_ext_files | Drop null samples rows | Done | 0 | 2013-10-15 13:23:23. | 0.450529

.load_ext_files | Drop null cohorts rows | Done | 0 | 2013-10-15 13:23:23. | 0.043125

.load_ext_files | Drop null analysis rows | Done | 0 | 2013-10-15 13:23:23. | 0.066097

.load_ext_files | Read analysis file | Done | 1 | 2013-10-15 13:23:23. | 0.048055

.load_ext_files | Read cohort file | Done | 3 | 2013-10-15 13:23:23. | 0.085535

.load_ext_files | Read samples file | Done | 57 | 2013-10-15 13:23:23. | 0.049993

.load_ext_files | Write rwg_cohorts_ext | Done | 3 | 2013-10-15 13:23:23. | 0.099452

.load_ext_files | Write rwg_analysis_ext | Done | 1 | 2013-10-15 13:23:23. | 0.047331

.load_ext_files | Write rwg_samples_ext | Done | 57 | 2013-10-15 13:23:23. | 0.044567

.load_ext_files | Read analysis data file | Done | 436898 | 2013-10-15 13:23:27. | 3.911089

.load_ext_files | Drop null analysis_data rows | Done | 382223 | 2013-10-15 13:23:27. | 0.067765

.load_ext_files | Write rwg_analysis_data_ext | Done | 54675 | 2013-10-15 13:23:28. | 1.332746

IMPORT_FROM_EXT | Start FUNCTION | Done | 0 | 2013-10-15 13:23:29. | 0.117319

IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 0.035825

IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 6.26E-4

IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 4.84E-4

IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_An | Done | 1 | 2013-10-15 13:23:29. | 0.001079

IMPORT_FROM_EXT | Update bio_assay_analysis_id on | Done | 0 | 2013-10-15 13:23:29. | 0.030793

IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_Co | Done | 3 | 2013-10-15 13:23:29. | 8.28E-4

... (continues)

Errors are also shown (if any)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 16 / 22

RModules Analyses’(tsApp-DB)

Situation in transmartApp-DB:

update searchapp.plugin_module

s e t params='{"id":" survivalAnalysis ","converter ":{"R":[" source(''|| PLUGINSCRIPTDIRECTORY|| Common/dataBuilders.R'')","source(''|| PLUGINSCRIPTDIRECTORY || Common/ExtractConcepts.R'')","source(''|| PLUGINSCRIPTDIRECTORY || Common/collapsingData.R'')","source(''|| PLUGINSCRIPTDIRECTORY || Common/BinData.R'')","source(''||PLUGINSCRIPTDIRECTORY || Survival/BuildSurvivalData.R'')","\ tSurvivalData.build(\n\tinput.dataFile = ''|| TEMPFOLDERDIRECTORY || Clinical/clinical.i2b2trans '',\n\tconcept.time=''||TIME||'',\n\tconcept.category=''|| CATEGORY ||'',\n\tconcept.eventYes=''|| EVENTYES ||'',\n\tbinning.enabled=''|| BINNING ||'',\n\tbinning.bins=''||NUMBERBINS ||'',\n\tbinning.type=''|| BINNINGTYPE ||'',\n\tbinning.manual=''||BINNINGMANUAL ||'',\n\tbinning.binrangestring=''|| BINNINGRANGESTRING ||'',\n\tbinning.variabletype=''|| BINNINGVARIABLETYPE ||'',\n\tinput.gexFile = ''||TEMPFOLDERDIRECTORY ||mRNA/Processed_Data/mRNA.trans '',\n\tinput.snpFile = ''||TEMPFOLDERDIRECTORY ||SNP/snp.trans'',\n\tconcept.category.type = ''|| TYPEDEP ||'',\n\tgenes.category = ''|| GENESDEP ||'',\n\tgenes.category.aggregate = ''|| AGGREGATEDEP||'',\n\tsample.category = ''|| SAMPLEDEP ||'',\n\ttime.category = ''|| TIMEPOINTSDEP||'',\n\tsnptype.category = ''|| SNPTYPEDEP ||'')\n\t"]}," name ":" Survival Analysis","dataFileInputMapping ":{" CLINICAL.TXT":" TRUE","SNP.TXT ":" snpData"," MRNA_DETAILED.TXT

":" mrnaData "}," dataTypes ":{" subset1 ":[" CLINICAL.TXT"]}," pivotData ":false ,"view ":"

SurvivalAnalysis "," processor ":{"R":[" source(''|| PLUGINSCRIPTDIRECTORY || Survival/CoxRegressionLoader.r'')"," CoxRegression.loader(input.filename=''outputfile '')","source(''|| PLUGINSCRIPTDIRECTORY || Survival/SurvivalCurveLoader.r'')"," SurvivalCurve.loader(input.filename=''outputfile '',concept.time=''||TIME||'')"]}," renderer ":{"GSP ":"/ survivalAnalysis/survivalAnalysisOutput "} ,... (goes on)'

where module_name = 'pgsurvivalAnalysis ';

Not very nice...

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 17 / 22

RModules Analyses’ (transmart-data)

In transmart-data:

One file per analysis

Files can be generated from DB data

Sanely formatted

But we really want to remove this from the DB!array (

'id' => 'heatmap',

'name' => 'Heatmap',

'dataTypes' =>

array (

'subset1' =>

array (

0 => 'CLINICAL.TXT',

),

),

'dataFileInputMapping' =>

array (

'CLINICAL.TXT' => 'FALSE',

'SNP.TXT' => 'snpData',

'MRNA_DETAILED.TXT' => 'TRUE',

),

'pivotData' => false,

...

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 18 / 22

Rserve

Targets for Rserve:

Download/build R

Install R packages

Start Rserve

Install System V initscript for Rserve

Idem for systemd

cd R

make -j8 bin/root/R

#some packages don 't support

concurrent builds

make install_packages

make start_Rserve

make start_Rserve.dbg

TRANSMART_USER=tomcat7 sudo -

E make install_rserve_init

TRANSMART_USER=tomcat7 sudo -

E make install_rserve_unit

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 19 / 22

Solr

Solr (4.5.0) automaticallydownloaded and configured

Solr cores automatically created

User only needs to create a schemafile and dataconfig.xml

#setup & solr (psql)

make start

#just configure

make solr_home

make <core >_full_import

make <core >_delta_import

make clean_cores

ORACLE =1 make start

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 20 / 22

transmartApp Configuration

Out-of-tree config management:

Targets for installing files

Zero configuration fordev!

Customization allowedwithout touching the targetfiles

Only supports ours branches

But a lot of configurationshould be in-tree instead!

#install everything

#previous files are backed

up

make install

#just one file:

make install_Config.groovy

make install_BuildConfig.

groovy

make install_DataSource.

groovy

#costumizations in:

#Config -extra.php

#BuildConfig.groovy (

limited)

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 21 / 22

Current Limitations

© Joost J. Bakker, CC BY 2.0

DB upgrades not handled

Only a few ETL pipelinessupported

Oracle support is behindPostgreSQL

Tooling shares repositorywith application data

Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 22 / 22

top related