data integration vs transparency: tackling the tension

56
Paul Groth Elsevier Labs @pgroth | pgroth.com Data Integration & Transparency Tackling the tension sity of Fribourg Informatics Colloquium June 15, 2015

Upload: paul-groth

Post on 07-Aug-2015

87 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data Integration vs Transparency: Tackling the tension

Paul Groth Elsevier Labs@pgroth | pgroth.com

Data Integration & TransparencyTackling the tension

University of Fribourg Informatics Colloquium June 15, 2015

Page 2: Data Integration vs Transparency: Tackling the tension
Page 3: Data Integration vs Transparency: Tackling the tension

Outline

• Data integration for analysis– i.e. remixing data

• The need for transparency• Provenance as a solution• The downloads folder problem

Page 4: Data Integration vs Transparency: Tackling the tension

60 % of time is spent on data preparation

Page 5: Data Integration vs Transparency: Tackling the tension

http://[email protected]

@Open_PHACTS

Page 6: Data Integration vs Transparency: Tackling the tension

Why?

Public Domain Drug Discovery Data:Pharma are accessing, processing, storing & re-processing

LiteraturePubChem

GenbankPatents

DatabasesDownloads

Data Integration Data AnalysisFirewalled Databases

Repeat @ each

companyx

Page 7: Data Integration vs Transparency: Tackling the tension

Prioritised Research QuestionsNumber sum Nr of 1 Question

15 12 9 All oxido,reductase inhibitors active <100nM in both human and mouse

18 14 8Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound?

24 13 8 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives.

32 13 8 For a given interaction profile, give me compounds similar to it.

37 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X.

38 13 8 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not).

41 13 8

A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature.

44 13 8 Give me all active compounds on a given target with the relevant assay data46 13 8 Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease)59 14 8 Identify all known protein-protein interaction inhibitors

www.openphacts.org

Page 8: Data Integration vs Transparency: Tackling the tension

Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse

From Mabel Loza - USC team

Page 9: Data Integration vs Transparency: Tackling the tension

From Mabel Loza - USC team

Page 10: Data Integration vs Transparency: Tackling the tension

From Mabel Loza - USC team

Page 11: Data Integration vs Transparency: Tackling the tension

From Mabel Loza - USC team

Page 12: Data Integration vs Transparency: Tackling the tension

Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse

ChEMBL:

Search target Oxidoreductase: 481 targets from different species

Selection of all the oxidoreductases and filtering bioactivities with the criteria IC50 < 100 (no units could be selected): 11497 data obtained

Table exported to a excel spreadsheet and manually filtered

From Mabel Loza - USC team

Page 13: Data Integration vs Transparency: Tackling the tension

5 people

Working 6 hours

Page 14: Data Integration vs Transparency: Tackling the tension

Problem: Data Integration

DataSource

DataSource

Data Warehouse

Queries

ExtractTransformLoad

DataSource

DataSource

Mediator

Queries

QueryReformulation

Page 15: Data Integration vs Transparency: Tackling the tension

Using the Power of Open PHACTS, London, 22-23 April 2013

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

index

Co

re P

latf

orm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Applications

Page 16: Data Integration vs Transparency: Tackling the tension

16

Open PHACTS Explorer

Page 17: Data Integration vs Transparency: Tackling the tension

17

Open PHACTS Explorer

?

Page 18: Data Integration vs Transparency: Tackling the tension

Credits: Curt Tilmes, Peter Fox

Tilmes, C.; Fox, P.; Ma, X.; McGuinness, D.L.; Privette, A.P.; Smith, A.; Waple, A.; Zednik, S.; Zheng, J.G., "Provenance Representation for the National Climate Assessment in the Global Change Information System," Geoscience and Remote Sensing, IEEE Transactions on , vol.51, no.11, pp.5160,5168, Nov. 2013

Page 19: Data Integration vs Transparency: Tackling the tension
Page 20: Data Integration vs Transparency: Tackling the tension

Problem: I don’t trust your assessment what is it based on?

Page 21: Data Integration vs Transparency: Tackling the tension

Tension:

Integrated & SummarizedData

Transparency& Trust

Page 22: Data Integration vs Transparency: Tackling the tension

Solution

Integrating and exposing provenance provided by multiple sources

Page 23: Data Integration vs Transparency: Tackling the tension
Page 24: Data Integration vs Transparency: Tackling the tension

provbook.org

Page 25: Data Integration vs Transparency: Tackling the tension
Page 26: Data Integration vs Transparency: Tackling the tension

National Climate Change Assessment Provenance

Page 27: Data Integration vs Transparency: Tackling the tension
Page 28: Data Integration vs Transparency: Tackling the tension

Tooling

Page 29: Data Integration vs Transparency: Tackling the tension
Page 30: Data Integration vs Transparency: Tackling the tension
Page 31: Data Integration vs Transparency: Tackling the tension
Page 32: Data Integration vs Transparency: Tackling the tension

http://asdf.readthedocs.org/en/latest/provenance.html

Page 33: Data Integration vs Transparency: Tackling the tension
Page 34: Data Integration vs Transparency: Tackling the tension

http://www.slideshare.net/soilandreyes/20130529-taverna-provenance

Page 35: Data Integration vs Transparency: Tackling the tension

Towards Workflow Ecosystems Through Semantic and Standard Representations Garijo, D.; Gil, Y.; and Corcho, O. In Proceedings of the Ninth Workshop on Workflows in Support of Large-Scale Science (WORKS), held in conjunction with the IEEE ACM International Conference on High-Performance Computing (SC), New Orleans, LA, 2014.

Page 36: Data Integration vs Transparency: Tackling the tension

https://github.com/pgroth/PROVTutorial

Page 37: Data Integration vs Transparency: Tackling the tension
Page 38: Data Integration vs Transparency: Tackling the tension
Page 39: Data Integration vs Transparency: Tackling the tension
Page 40: Data Integration vs Transparency: Tackling the tension

Great…..

but

Page 41: Data Integration vs Transparency: Tackling the tension

Data integration is manual

Page 42: Data Integration vs Transparency: Tackling the tension
Page 43: Data Integration vs Transparency: Tackling the tension
Page 44: Data Integration vs Transparency: Tackling the tension

Look to OS techniques

1. Taint Tracking2. Record and Replay

Work with Manolis Stamatogiannakis & Herbert Bos VU University Amsterdam – Security Group

Page 45: Data Integration vs Transparency: Tackling the tension
Page 46: Data Integration vs Transparency: Tackling the tension
Page 47: Data Integration vs Transparency: Tackling the tension
Page 48: Data Integration vs Transparency: Tackling the tension

https://www.youtube.com/watch?v=BD0h6M5mVoo

Page 49: Data Integration vs Transparency: Tackling the tension

http://www.androidreran.com

Page 50: Data Integration vs Transparency: Tackling the tension

Use R&R for provenance

1. Execution Capture2. Application of instrumentation3. Provenance analysis4. Selection and iteration

Implemented using plugins for Platform for Architecture-Neutral Dynamic Analysis (PANDA)

Page 51: Data Integration vs Transparency: Tackling the tension

An Example (1)

<exe://pam-foreground-~3451> prov:endedAtTime 199090196 .<exe://getent~3451> a prov:Activity . <exe://getent~3451> rdf:type dt:getent .<exe://cut~3452> a prov:Activity . <exe://cut~3452> rdf:type dt:cut .<file:/etc/nsswitch.conf> a prov:Entity .<file:/etc/nsswitch.conf> rdfs:label "/etc/nsswitch.conf" .<file:/etc/nsswitch.conf> rdf:type dt:Unknown .<exe://getent~3451> prov:used <file:/etc/nsswitch.conf> .# unused file:3477815296:getent~3451:/etc/passwd:r0:w0:f524288<exe://getent~3451> prov:startedAtTime 199090196 .<exe://getent~3451> prov:endedAtTime 200392668 .<file:FD0_3452> a prov:Entity .<file:FD0_3452> rdfs:label "FD0_3452"

Page 52: Data Integration vs Transparency: Tackling the tension

An Example (2

Page 53: Data Integration vs Transparency: Tackling the tension

Example (3)

Page 54: Data Integration vs Transparency: Tackling the tension

Example (4)

Page 55: Data Integration vs Transparency: Tackling the tension

Conclusion

• Tension between: – putting stuff together; and– documenting what’s been done

• Provenance helps• Issues in collection • Standards + Stealth1

1 hat tip Carole Goble

Page 56: Data Integration vs Transparency: Tackling the tension

Questions?

• More info:– openphacts.org– data2semantics.org– provbook.org– https://github.com/m000/dtracker– Manolis Stamatogiannakis, Paul Groth and Herbert Bos. Decoupling

Provenance Capture and Analysis from Execution. Theory and Practice of Provenance 2015

– Luc Moreau, Paul Groth, James Cheney, Timothy Lebo, Simon Miles, The rationale of PROV, Web Semantics: Science, Services and Agents on the World Wide Web, Available online 20 April 2015 http://dx.doi.org/10.1016/j.websem.2015.04.001.

– Paul Groth, "Transparency and Reliability in the Data Supply Chain," IEEE Internet Computing, vol. 17, no. 2, pp. 69-71, March-April, 2013

– Paul Groth, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013