transmart community meeting 5-7 nov 13 - session 3: clinical biomarker discovery

| 1

tranSMART Community MeetingDeveloper Stream, Nov 06-2013

Charlotte Raillère (tranSMART Expert)Claire Virenque (Project Manager)

TM4PTranslational Medicine for Patients

TM Data Hub ProjectImplementation of a Translational Medicine Data Integration Platform

Content of the presentation

● Update on Sanofi latest achievements

1. IT security assessment of tranSMART

2. Improvement of SNP (subject level) data loading

● Update on work-in-progress

3. New release under development (‘RC2’)

4. tranSMART x MongoDB integration

| 2tranSMART Community Meeting – Nov 06, 2013

Context – tranSMART at Sanofi

● Pilot experience with tranSMART from September 2011 till June 2012● Evaluate tranSMART capabilities to support clinical biomarker research

● Implementation project launched in September 2012● Identify tranSMART improvements, which are of highest value for Sanofi● Implement tranSMART improvements through two successive tranSMART Release

Candidates (RC)• RC1 is available since March, 2013 – code base available in Github• RC2 building is in progress• RC2 is expected to move into production mode in Q2 next year

● Working version of tranSMART available for our early adopter business units● Obj = Meet their ongoing needs related to translational research data integration.● Support for data curation & loading is also provided.


tranSMART IT security assessment

Feedback

Special thanks to Vincent Rossetto

and the IS Security Team!

Part 1 – Scope and Context

● Objective of Security Risk Assessment: Protect R&D information● Mission of R&D IS Security team – Control and Assess the risks on R&D information asset

● Risk assessment methodology ● ‘Ethical’ hacking – penetration testing

• From vulnerability scans to exploitation• Using free tools (Nessus, BackTrack, Metasploit, Sharepoint perl script)• With no account on Sanofi systems neither sanofi standard workstation

● Without access account, try to gain high level access (admin account, sensitive data)

● Risk Classification: Four grades● From ‘High’: Risk with important consequences on Sanofi activities – can happen or be

caused easily ● Till ‘Negligible’: Risk with minor consequences – requires expert knowledge or favorable

context

● Recommendations: Remediation Action Plan● With prioritization of the recommendations

| 5

● tranSMART strength overview● No trivial system accounts found. No default database accounts found.● Web servers are running under low privileges. User authentication cannot be

bypassed.• Authentication through Sanofi’s Active Directory

| 6

Part 2 – tranSMART risk assessment results

● Impact

● Sensitive data disclosure

● Technical information disclosure

● Identity usurpation

● Main risks identified

● Credential disclosure (database, Tomcat, Jboss…)

● Session hi-jacking● Privilege elevation● Application malevolence (XSS)

Part 3 – Application Security weaknesses

● XSS attack: Certain parameters (tags) are prone to store cross-site scripting attacks.

● This vulnerability can be exploited to take control of another administrator’s browser or more probably to lead phishing or viral spreading attacks

● Admin session hijacking XSS alert : • <script>alert(String.fromCharCode(88, 83, 83, 32, 97,

116, 116, 97, 99, 107, 32, 105, 110, 32, 112, 114, 111, 103, 114, 101, 115, 115))</script>

| 7

● Privilege escalation: Basic users can access some administrative features● The following URL must not be accessible to users with standard account:

• /transmart/secureObjectAccess/manageAccess• /transmart/secureObjectAccess/manageAccessBySecObj

| 8

● Use good development practices to avoid XSS attacks and privilege escalation● Based on development standards such as OWASP

● Ensure compliance of application accounts with company’s password policy● LDAP authentication using AD (preferred)● Or set up specific application password policy (pwd complexity, pwd expiration, time out…)

● Encrypt tranSMART authentication (https)● Avoid sniffing attacks and credential disclosure

● Avoid default or weak accounts● Administrative console (Jboss, Tomcat, Axis2) must have complex and secret password

• Risk: Exploit vulnerability to access admin areas and compromise the application (crafted application• Consequence: Can impact the application availability or the data confidentiality & integrity.

● Database accounts (DBA, application) must have complex and secret password• Risk: Exploit vulnerability to access the Web application database• Conquequence: Can impact the data confidentiality & integrity

● Sensitize users on security topics● Lock Workstation or log off from tranSMART session to avoid unauthorized access

Part 4 – Recommendations and good practices

https://www.owasp.org/index.php/Category:OWASP_Top_Ten_Project

Loading of SNP data

Latest achievements


Loading of SNP genotyping data

● Modification of loader.jar (from tranSMART-ETL repository)● Correction of errors● Loading speeded up

• Some inserts replaced by batch inserts• Parameters modified to insert/select data

● Less constraints on file format• Columns from the annotation file can be described in property files• New class to load SNP data from Illumina platform

● Loading of three studies with SNP data from Illumina platform (> 1million SNP)● 4 patients → 40 minutes● 30 patients → 5 hours● 1500 patients → 80 hours

● Integration of SNP loading in ICE (tranSMART Curation & Loading Tool) done


Estimation (on-going)

New tranSMART release under development (‘RC2’)

Improvements – New features


tranSMART RC2 – Scope outline

● Accommodate new data types● miRNA data (qPRC and microarray)● Proteomic data (RBM data, mass spec data)● Metabolomic data● RNA sequencing data

● Accommodate serial data (time courses, doses responses, etc.)

● Enable sequential loading of data for a study

● Enhance critical current analytics● Box Plot, Line Graph, Correlation Analysis, Grid View● Plus adaptation of analytics to new data types

● Enhance data export features

| 12

Developments in-progress.

Partnership w/ Cognizant and The Hyve.

Completion of RC2 developments planned for January, 2014.

Developments will be contributed back to the community.

tranSMART Community Meeting – Nov 06, 2013

Click here for further details on RC2 enhancements

tranSMART RC2 – Key points

● RC2 is built ‘on top of’ Sanofi RC1 release● ETL: impact of changes = high (Kettle scripts converted into Groovy, new ETL pipelines,

mapping files modified)● Data model: impact = high (creation of new tables for new data types, etc.)● UI: impact = low

● Our goal is to converge towards the GPL version● RC1 was merged with ‘Core DB’ & ‘Core API’ enhancements (from GPL1.1)

• Start of the modularization of tranSMART● New data types are implemented in a modular fashion.

• This should help to the future merging of RC2 with open source code base


Maximally benefit from public tranSMART development efforts

Limit deviation from the open source code base

Contribute back all developments to the community

Do not duplicate efforts

tranSMART x MonGo DB integration

Objective and timeline


MongoDB integration with tranSMART (1/2)

● MongoDB is a NoSQL document oriented database

● Main need for tranSMART: Physical storage of unstructured data (i.e., files)● Any files that are uploaded and visible through the Browse tab of the Sanofi RC1 (raw

data files, study related documentation such as clinical protocol, etc.)● Currently, files are stored on tranSMART app server… Limited storage capacity.

Objective: Move storage of unstructured data from tranSMART server to MongoDB db

● Why MongoDB ?● Ability to store huge volume of unstructured files● Horizontal scalability● Easy installation process


MongoDB integration with tranSMART (2/2)

● Timelines● Integration with Sanofi RC2 release (backend + UI): Q4-2013● Testing in Q1-2014


Conclusion

Any questions?

Thank you!

Acknowledgement: Sherry Cao, Jike Cui, Angelo DeCristofano, Christophe Gibault, Lars Greiffenberg, Manfred Hendlich, Rainer Kappes, Adam Palermo, Annick Peleraux, David Peyruc, Charlotte Raillère, Vincent Rossetto, Claire Virenque


Making a difference in Healthcare with Information Technologies.

Additional slides


tranSMART RC1 – Summary

● Released in March 2013● Code base available in Github

● Main improvements delivered in tranSMART RC1:

| 19

Topic 1: Data Management

• Ability to organize data within a hierarchical structure (Program/Study/Assay) with new tagging capabilities

• Synonym management for several dictionaries (e.g. compounds, genes, diseases)• New capabilities for posting, searching and exporting files• New functionality to load gene expression analysis results• Better support for time points/series• Improvement of tranSMART curation and loading tool & pipelines

Topic 2: tranSMART User Interface

• Simplification of tranSMART UI:– All searching functionalities centralized– Synchronization of the browser and analysis modules

Topic 3: Data Searching and Analysis

• Improvement of data searching capabilities:– Integrated search / filter for querying any data available (levels 1 to 4)– More search / filter criteria

• Implementation of standard analytics from GPL1.0


RC1 – New organization of tranSMART UI

● Two main tabs – synchronized with each other:

| 20

Navigate within Programs > Studies > Assays , Analysis and File Folders (see next slide)

Search data using dictionaries

Create new Programs > Studies > Assays and Files Folders, and annotate (tag) them

Export files

Visualize gene expression analysis results

Global view of all the data availableFrom level 1 data (uncurated/raw files)

to levels 3-4 data (analysis results, findings)

Run analysis on subject-level data (former Dataset Explorer)

Browse level 2 (processed) data – incl. clinical / preclinical / molecular data, etc.

Search subject-level data

Select data subsets (cohorts)

Run basic statistical and genomic analyses on those subsets (standard features from tranSMART v1.0)

Export out data subsets


tranSMART RC2 – Requirements (1/2)


Area Req # Requirement Sprint #

Data loading /

ETL pipelines

1 Optimize the clinical ETL pipeline to accelerate loading time for large clinical studies 22 Enable incremental loading of data for a given study 43 Enable loading of ‘serial’ high and low dimensional data (time course, dose response,

different sampling conditions, etc.) 2

4 Improve samples handling 15 Enable loading of RBM subject-level data as high dimensional data 36 Enable loading of microarray miRNA subject-level data as high dimensional data. 27 Enable loading of qPCR miRNA subject-level data 28 Enable loading of mass spec proteomic subject-level data as high dimensional data. 39 Enable loading of metabolomic subject-level data as high dimensional data. 3

10 Improve SNP subject-level data loading – in particular, accelerate loading time Done11 Enable loading of RNA sequencing subject-level data (gene-level expression quantification) 212 Optimize the management of annotation files for omic data Done

Security 13 Set up user authentication through the company’s Active Directory Done14 Implement security rules and user permissions in Browse tab (RC1 feature) 1

Analytics – Advanced Workflows

15 Allow better analysis of ‘serial’ high and low dimensional data using existing analytics 2

16Improve the Line Graph analytics:• Enable Line Graph to use high dimensional data•Better handle x axis • Add option to plot individual data in addition to group means or medians.

4

17 Improve sub categorization of high dimensional data (tissue, time points, etc.) in the high dimensional data node selection screen in Advanced Workflows – linked to req #3 2

18 Improve the Boxplot analytics – make individual box plots for each variable when dragging multiple nodes in field ‘Dependent Variable’, and present output in table format 4

19 Improve the Correlation Analysis analytics 4

tranSMART RC2 – Requirements (2/2)


Area Req # Requirement Sprint #

Analytics – Advanced Workflows

20 Allow analysis of RBM data using existing analytics for high dimensional data 321 Allow analysis of microarray miRNA data using existing analytics 222 Allow analysis of qPCR miRNA and mRNA data using existing analytics 223 Allow analysis of mass spectrometry subject-level data using existing analytics 324 Allow analysis of metabolomic subject-level data using existing analytics 325 Allow analysis of RNA sequencing data using existing analytics 2

Analytics – Grid View

26

Improve Grid View• Enable categorical variables in a single column• Enable column deletion, row or column selection• Enable export of selection• Automatically include variables used in Advanced Workflows

3

27 Display sample ID related to patient ID in Grid View 1

Export 28

Improve export of data• Improve performances (response time) when exporting large data volume• Add advanced filters to allow users to limit the exported data to subset of clinical fields, genes…• Add ability to better categorize the data available for a study (clinical, gene expression, etc.)• Harmonize with Grid View export capabilities

2 + 4

29 Add ability to preview a file in browser (IE8 and Firefox) 1Tagging 30 Add dictionaries for miRNA, proteins, metabolites 2

Gene sign. 31 In Gene Signature/List tab, add gene symbols – linked to req #12 Done

UI 32 Improve consistency and synchronization of data trees in Browse (Program Explorer panel) and in Analyze (Navigate Terms panel) 2

Search

33 Secure file indexing Done

34After running a free text search in Browse tab, when clicking on bold items in Program Explorer panel, highlight in right hand side Browse panel:

• String found in metadata (including in file names)• Files containing that string

3

Risk Assessment methodology


transmart community meeting 5-7 nov 13 - session 3: clinical biomarker discovery

Health & Medicine

transmart session

transmart improvements

context transmart

transmart capabilities

new transmart release

data curation loading

application availability

access account