transmart community meeting 5-7 nov 13 - session 3: transmart-data
DESCRIPTION
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data Management of tranSMART's Environment Gustavo Lopes The Hyve B.V.TRANSCRIPT
transmart-dataManagement of tranSMART’s Environment
Gustavo Lopes
The Hyve B.V.
November 6, 2013
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 1 / 22
Outline
1 ProblemsReproductibilityVersioning ControlAutomationWhy?!tranSMART Foundation’sVersion
2 Solution: transmart-dataGeneral DescriptionConfigurationDatabase Schema ManagementSeed DataETLRModules Analyses’RserveSolrtransmartApp Configuration
3 Limitations
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 2 / 22
Typical Branch Distribution
Grails Code
transmartApp (without fullrepo history, always withwrong ancestry information⇒ merging quite difficult)
RModules (if you’re lucky),but analyses definitions inDB not provided
Database
SQL scripts on top of GPL1.0 dump or later. Probablyinsufficent/won’t apply
Stored procedures for ETL.Overlapping definitions withyours, but no history ⇒merging quite difficult
Manual fixups alwaysrequired (even if justpermissions/synonyms)
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 3 / 22
Typical Branch Distribution (II)
ETL
High variablity in strategies
Instructions/sample datararely provided
Kettle scripts areproblematic
Solr/Rserve/Configuration
Solrschemas/dataimport.xmlperpetually forgotten
Idem for information on Rpackages
Sample configuration rarelyprovided
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 4 / 22
Versioning Control
Version control used ONLY for Grails Code. . .But often squashed and with wrong ancestor information.Forget about database, Solr, most of ETL.
Result
Merges are very difficult.
Changes cannot easily be tracked
Changes’ wherefores are unknown
Regressions are introduced (no conflicts)
Collaboration is based on e-mail attachments
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 5 / 22
Automation
Even with all the pieces. . .
Setting up a new branch takes days;weeks for non-basic functionality
No reproductibility in the process!
Result
Devs driven away from fully localenvironment (too much work)
Robust environment for CI passed over(too much work)
Bugs cannot be reliably reproduced (seealso: no consistent usage of VCS)
Time wasted with deployment specificmistakes/inconsistencies
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 6 / 22
Why?!
Guillaume Duchenne (public domain)
The “source code” for a work meansthe preferred form of the work formaking modifications to it.
— GPL v3, section 1
Is everyone holding back “source code”?More likely explanation:
No appropriate tooling being used
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 7 / 22
Situation for tranSMART 1.1
The situation is much better!Some problems remain, though.
The Good ,Create/populate DBis easy
Most stuff isversioned
CI for builds
Image available
Public issue tracking
The Bad /No Oracle supportChanges to DB scripts/seed data aread hoc (lax structure)No mechanism to support/compareschemas with other branchesR analyses are json blobs in TSVsNo VCS for Solr or Rserve/images’ setupSet up Sol/Rserve is time-consumingPopulation of DB with sample data is stilltime-consumingConfig changes required for dev
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 8 / 22
Description of transmart-data
We developed transmart-data to address most of these problems:
transmart-data is a set of
scripts for managing tranSMART’s environment and
certain application data (e.g. Solr schemas, DDL, seed data), whichis used by scripts and sometimes generated by them.
It has a makefile based interface.
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 9 / 22
transmart-data: Purposes
Purposes of transmart-data:
1 Allow setting up a complete dev environment quickly (< 30 min)
2 Bring versioning to the database schema and Solr files
3 Setup Solr runtime
4 Invoke ETL pipelines
5 Setup Rserve
Target audience: Programmers
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 10 / 22
transmart-data: Non-purposes
Non-purposes of transmart-data:
1 Setup a production environment(some components can be used)
2 New users evaluating tranSMART(use an pre-built image)
3 Building transmartApp or its plugin dependencies(build them yourself or use artifacts from Bamboo/Nexus)
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 11 / 22
Configuration
Environment variable based configuration
cp v a r s . sample v a r svim v a r s #e d i t f i l esource v a r s
PGHOST=/tmpPGPORT=5432PGDATABASE=t r a n s m a r tPGUSER=\$USERPGPASSWORD=TABLESPACES=\$HOME/pg/ t a b l e s p a c e s /PGSQL BIN=\$HOME/pg/ b i n /ORAHOST=l o c a l h o s tORAPORT=1521ORASID=o r c lORAUSER=” s y s as s y s d b a ”ORAPASSWORD=mypasswordORACLE MANAGE TABLESPACES=0#c o n t i n u e s . . .
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 12 / 22
Database Schema Management
Support for Oracle and Postgres
Postgres
Uses pg dump(all)
Parses the dump files
#Dumpmake −C p o s t g r e s / d d l dumpmake −C p o s t g r e s / d d l /
GLOBAL e x t e n s i o n s . s q lr o l e s . s q l
#Loadmake −C p o s t g r e s / d d l l o a d
Oracle
Queries dba * tables
Dumps DDL w/DBMS METADATA
#Dumpmake −C o r a c l e / d d l dump
#Loadmake o r a c l e
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 13 / 22
Seed Data
Only Postgres for now
#Dump#T a b l e s to dump i n p o s t g r e s / data/<schema> l s tmake −C p o s t g r e s / data dumpmake −C p o s t g r e s /common m i n i m i z e d i f f s
#Loadmake −C p o s t g r e s / data l o a d
#Load DDL and datamake p o s t g r e s
Only for basic stuff with no ETL!
Pretty fast (DDL+data loaded in 10s)
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 14 / 22
ETL (I)
Unified interface for ETL
Prepare dataset
1 Prepare ETL-specific sourcefiles
2 Prepare file with ETLspecific params
3 Upload dataset to CDN(optional)
For each new ETL pipeline,support must be added
Load dataset
make −C s a m p le s /{ o r a c l e ,p o s t g r e s } l o a d <type><s t u d y id>
#Example :make −C s a m p le s / p o s t g r e s
l o a d c l i n i c a l G S E 8 5 8 1
Everything is automated!
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 15 / 22
ETL (II)
Show TM CZ logs:$ make -C samples/postgres showdblog
make: Entering directory `/home/gustavo/repos/transmart-data/samples/postgres'
groovy -cp postgresql-9.2-1003.jdbc4.jar ../common/dump_audit.groovy postgres `tput cols`
Procedure | Description | Stat | Recs | Date | Time spent
------------------------------------------------------------------------------------------------------
alysis_data.kjb | GSE8581 | DONE | 1 | 2013-10-15 13:23:22. | 0.0
.load_ext_files | Drop null samples rows | Done | 0 | 2013-10-15 13:23:23. | 0.450529
.load_ext_files | Drop null cohorts rows | Done | 0 | 2013-10-15 13:23:23. | 0.043125
.load_ext_files | Drop null analysis rows | Done | 0 | 2013-10-15 13:23:23. | 0.066097
.load_ext_files | Read analysis file | Done | 1 | 2013-10-15 13:23:23. | 0.048055
.load_ext_files | Read cohort file | Done | 3 | 2013-10-15 13:23:23. | 0.085535
.load_ext_files | Read samples file | Done | 57 | 2013-10-15 13:23:23. | 0.049993
.load_ext_files | Write rwg_cohorts_ext | Done | 3 | 2013-10-15 13:23:23. | 0.099452
.load_ext_files | Write rwg_analysis_ext | Done | 1 | 2013-10-15 13:23:23. | 0.047331
.load_ext_files | Write rwg_samples_ext | Done | 57 | 2013-10-15 13:23:23. | 0.044567
.load_ext_files | Read analysis data file | Done | 436898 | 2013-10-15 13:23:27. | 3.911089
.load_ext_files | Drop null analysis_data rows | Done | 382223 | 2013-10-15 13:23:27. | 0.067765
.load_ext_files | Write rwg_analysis_data_ext | Done | 54675 | 2013-10-15 13:23:28. | 1.332746
IMPORT_FROM_EXT | Start FUNCTION | Done | 0 | 2013-10-15 13:23:29. | 0.117319
IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 0.035825
IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 6.26E-4
IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 4.84E-4
IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_An | Done | 1 | 2013-10-15 13:23:29. | 0.001079
IMPORT_FROM_EXT | Update bio_assay_analysis_id on | Done | 0 | 2013-10-15 13:23:29. | 0.030793
IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_Co | Done | 3 | 2013-10-15 13:23:29. | 8.28E-4
... (continues)
Errors are also shown (if any)
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 16 / 22
RModules Analyses’(tsApp-DB)
Situation in transmartApp-DB:
update searchapp.plugin_module
s e t params='{"id":" survivalAnalysis ","converter ":{"R":[" source(''|| PLUGINSCRIPTDIRECTORY|| Common/dataBuilders.R'')","source(''|| PLUGINSCRIPTDIRECTORY || Common/ExtractConcepts.R'')","source(''|| PLUGINSCRIPTDIRECTORY || Common/collapsingData.R'')","source(''|| PLUGINSCRIPTDIRECTORY || Common/BinData.R'')","source(''||PLUGINSCRIPTDIRECTORY || Survival/BuildSurvivalData.R'')","\ tSurvivalData.build(\n\tinput.dataFile = ''|| TEMPFOLDERDIRECTORY || Clinical/clinical.i2b2trans '',\n\tconcept.time=''||TIME||'',\n\tconcept.category=''|| CATEGORY ||'',\n\tconcept.eventYes=''|| EVENTYES ||'',\n\tbinning.enabled=''|| BINNING ||'',\n\tbinning.bins=''||NUMBERBINS ||'',\n\tbinning.type=''|| BINNINGTYPE ||'',\n\tbinning.manual=''||BINNINGMANUAL ||'',\n\tbinning.binrangestring=''|| BINNINGRANGESTRING ||'',\n\tbinning.variabletype=''|| BINNINGVARIABLETYPE ||'',\n\tinput.gexFile = ''||TEMPFOLDERDIRECTORY ||mRNA/Processed_Data/mRNA.trans '',\n\tinput.snpFile = ''||TEMPFOLDERDIRECTORY ||SNP/snp.trans'',\n\tconcept.category.type = ''|| TYPEDEP ||'',\n\tgenes.category = ''|| GENESDEP ||'',\n\tgenes.category.aggregate = ''|| AGGREGATEDEP||'',\n\tsample.category = ''|| SAMPLEDEP ||'',\n\ttime.category = ''|| TIMEPOINTSDEP||'',\n\tsnptype.category = ''|| SNPTYPEDEP ||'')\n\t"]}," name ":" Survival Analysis","dataFileInputMapping ":{" CLINICAL.TXT":" TRUE","SNP.TXT ":" snpData"," MRNA_DETAILED.TXT
":" mrnaData "}," dataTypes ":{" subset1 ":[" CLINICAL.TXT"]}," pivotData ":false ,"view ":"
SurvivalAnalysis "," processor ":{"R":[" source(''|| PLUGINSCRIPTDIRECTORY || Survival/CoxRegressionLoader.r'')"," CoxRegression.loader(input.filename=''outputfile '')","source(''|| PLUGINSCRIPTDIRECTORY || Survival/SurvivalCurveLoader.r'')"," SurvivalCurve.loader(input.filename=''outputfile '',concept.time=''||TIME||'')"]}," renderer ":{"GSP ":"/ survivalAnalysis/survivalAnalysisOutput "} ,... (goes on)'
where module_name = 'pgsurvivalAnalysis ';
Not very nice...
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 17 / 22
RModules Analyses’ (transmart-data)
In transmart-data:
One file per analysis
Files can be generated from DB data
Sanely formatted
But we really want to remove this from the DB!array (
'id' => 'heatmap',
'name' => 'Heatmap',
'dataTypes' =>
array (
'subset1' =>
array (
0 => 'CLINICAL.TXT',
),
),
'dataFileInputMapping' =>
array (
'CLINICAL.TXT' => 'FALSE',
'SNP.TXT' => 'snpData',
'MRNA_DETAILED.TXT' => 'TRUE',
),
'pivotData' => false,
...
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 18 / 22
Rserve
Targets for Rserve:
Download/build R
Install R packages
Start Rserve
Install System V initscript for Rserve
Idem for systemd
cd R
make -j8 bin/root/R
#some packages don 't support
concurrent builds
make install_packages
make start_Rserve
make start_Rserve.dbg
TRANSMART_USER=tomcat7 sudo -
E make install_rserve_init
TRANSMART_USER=tomcat7 sudo -
E make install_rserve_unit
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 19 / 22
Solr
Solr (4.5.0) automaticallydownloaded and configured
Solr cores automatically created
User only needs to create a schemafile and dataconfig.xml
#setup & solr (psql)
make start
#just configure
make solr_home
make <core >_full_import
make <core >_delta_import
make clean_cores
ORACLE =1 make start
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 20 / 22
transmartApp Configuration
Out-of-tree config management:
Targets for installing files
Zero configuration fordev!
Customization allowedwithout touching the targetfiles
Only supports ours branches
But a lot of configurationshould be in-tree instead!
#install everything
#previous files are backed
up
make install
#just one file:
make install_Config.groovy
make install_BuildConfig.
groovy
make install_DataSource.
groovy
#costumizations in:
#Config -extra.php
#BuildConfig.groovy (
limited)
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 21 / 22
Current Limitations
© Joost J. Bakker, CC BY 2.0
DB upgrades not handled
Only a few ETL pipelinessupported
Oracle support is behindPostgreSQL
Tooling shares repositorywith application data
Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 22 / 22