algorithms, data structures and web computing for data ... · pinto fr, carrico ja, ramirez m,...

40
Algorithms, data structures and web computing for data mining in biomedicine Jonas S Almeida Dept Bioinformatics and Comp. Biol. Univ Texas MDAnderson Cancer Center OECD Workshop on Knowledge Markets in the Life Sciences 16-17 October 2008

Upload: others

Post on 09-Oct-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Algorithms, data structures and web computing

for data mining in biomedicine

Jonas S Almeida

Dept Bioinformatics and Comp. Biol.

Univ Texas MDAnderson Cancer Center

OECD Workshop on Knowledge Markets in the Life Sciences

16-17 October 2008

Silva S, Gouveia-Oliveira R, Maretzek A, Carrico J, Gudnason T, Kristinsson KG,

Ekdahl K, Brito-Avo A, Tomasz A, Sanches IS, Lencastre Hd H, Almeida JS (2003)

EURISWEB - Web-based epidemiological surveillance of antibiotic-resistant

pneumococci in Day Care Centers.BMC Med Inform Decis Mak. 2003 Jul 8,3(1):9.

[PMID:12846930]

Wang X, R Gorlitsky, and JS Almeida (2005) From XML to RDF: How Semantic

Web Technologies Will Change the Design of „Omic‟ Standards. Nature

Biotechnology, Sep;23(9):1099-103 [PMID:16151403].

Almeida JS, C Chen, R Gorlitsky, R Stanislaus, M Aires-de-Sousa, P Eleutério,

JA Carriço, A Maretzek, A Bohn, A Chang, F Zhang, R Mitra, GB Mills, X Wang,

HF Deus (2006) Data integration gets 'Sloppy'. Nature Biotechnology

24(9):1070-1071. [PMID:16964209].

Deus FH, R Stanislaus1, DF Veiga, C Behrens, II Wistuba, JD Minna, HR

Garner, SG Swisher, JA Roth, AM Correa, B Broom, K Coombes, A Chang, LH

Vogel, JS Almeida (2008) A Semantic Web management model for integrative

biomedical informatics. PLoS ONE. Aug 13;3(8):e2946 [PMID: 18698353].

Reference Papers on integrative infrastructure

not affordable

not traceable

not evolvable

not feasible

not manageable

This outcome was anticipated right at the onset of the Web [recall Tim Berners-Lee “weaving the web”]

Desired key features of a web-based data management system:

1. Syntactic interoperabilityAbility to get the data once told where it is.

2. Semantic interoperabilityAbility to use the data for a different purpose than the one that dictated its generation.

The path backwards.

Model IDVariable Selection

Discovery

Models,

transfer functions

[ y= f(x) ]

Boosting,

evolutionary algorithms,

exhaustive search

[ x X ]

Self-described structures,

Ontologies, RDF,

Description Logic, S3DB.

[ x [X,Z] ]

Models ----------------------- Tools ---------------------------------- Software Environment

#14. Almeida, J.S., M.A.M.Reis, M.J.T.Carrondo (1997) A Novel Unifying Kinetic Model of Denitrification. J. Theor. Biol. 186:241-249. [doi:10.1006/jtbi.1996.0352]

#31. Wolf G. Almeida JS. Pinheiro C. Correia V. Rodrigues C. Reis MAM. Crespo JG. (2001) Two-dimensional fluorometry coupled with artificial neural networks: A novel method for on-line monitoring of complex biological processes. Biotechnology & Bioengineering. 72(3):297-306.[PMID:11135199]

#36. Almeida, JS (2002) Predictive non-linear modeling of complex data by artificial neural networks. Curr. Op. Biotech. 13(1) 72-76.[PMID:11849962]

#68. Mikhitarian, K., Gillanders, W.E., Almeida, J.S., Hebert Martin R., Varela J.C., Metcalf, J.S., Cole, D.J., and Mitas, M. (2005) An innovative microarray strategy identities informative molecular markers for the detection of micrometastatic breast cancer. Clinical Cancer Research 11(10):3697-704. [PMID:15897566]

#72. Almeida JS, DJ McKillen, YA Chen, PS Gross, RW Chapman, G Warr (2005) Design and Calibration of Microarrays as Universal Transcriptomic Environmental Biosensors. Comparative and Functional Genomics, 6(3):132-137(6). [doi:10.1002/cfg.466].

#77. Garcia S.P., Jonas S. Almeida, JS (2005) Multivariate phase space reconstruction by nearest neighbor embedding with different time delays, Physical Review E 72, 027205. [PMID:16196759].

#78. Oates JC, Varghese S, Bland AM, Taylor TP, Self SE, Stanislaus R, Almeida JS, Arthur JM (2005) Prediction of urinary protein markers in lupus nephritis. Kidney Int. Dec;68(6):2588-92 [PMID:16316334].

#86. Geli P, P Rolghamre, JS Almeida, K Ekdahl (2006) Modeling Pneumococcal Resistance to Penicillin in Southern Sweden Using Artificial Neural Networks. Microbial Drug Resistance 12(3):149-157. [PMID:17002540]

#95. Wolf G, JS Almeida, JG Crespo, MA Reis (2007) An improved method for two-dimensional fluorescence monitoring of complex bioreactors. J Biotechnol. 128(4):801-12. [PMID:17291616].

#103. Sá-Leão R, Nunes S, Brito-Avô A, Alves CR, Carriço JA, Saldanha J, Almeida JS, Santos-Sanches I, de Lencastre H. (2008) High rates of transmission of and colonization by Streptococcus pneumoniae and Haemophilus influenzae within a day care center revealed in a longitudinal study. J Clin Microbiol. Jan;46(1):225-34. [PMID: 18003797]

Model ID

Lesson learned: predictive

independent variables are a

needle in the haystack.

2/5

Model IDVariable Selection

#63. Almeida JS, R Stanislaus, E Krug, J Arthur (2005) Normalization and Analysis of residual variation in 2D Gel Electrophoresis for quantitative differential proteomics. Proteomics 5(5):1242-9 [PMID:15732138].

# 64. Mitas M, JS Almeida, K Mikhitarian, WE Gillanders, DN Lewin, DD Spyropoulos, L Hoover, A Graham, T Glenn, P King, DJ Cole, R Hawes, CE Reed, BJ Hoffman (2005) Accurate discrimination of Barrett’s esophagus and esophageal adenocarcinoma using a quantitative three-tiered algorithm and multi-marker real-time RT-PCR. Clin Cancer Res. 2005 Mar 15;11(6):2205-14 [PMID:15788668].

#83. Mueller M, Wagner CL, Annibale DJ, Knapp RG, Hulsey TC, Almeida JS (2006) Parameter selection for and implementation of a web-based decision-support tool to predict extubation outcome in premature infants. BMC Medical Informatics and Decision Making 6:11 [PMID:16509967].

#87. Almeida JS, Oates JC, Arthur JM. (2006) The need for concurrent calibration and discrimination statistics in predictive models. Kidney Int. 70(1):231-2. [doi:10.1038/sj.ki.5001519].

#89. Carrico JA, Silva-Costa C, Melo-Cristino J, Pinto FR, de Lencastre H, Almeida JS, Ramirez M. (2006) Illustration of a common framework for relating multiple typing methods by application to macrolide-resistant Streptococcus pyogenes. J Clin Microbiol. 44(7):2524-32. [PMID:16825375].

#91. Almeida, J.S., S.Vinga (2006) Computing distribution of scale independent motifs in biological sequences. Algorithms for Molecular Biology. 1:18. [PMID:17049089].

#96. Pinto FR, Carrico JA, Ramirez M, Almeida JS. (2007) Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement. BMC Bioinformatics 8(1):44. [PMID:17286861].

#102. Vinga S, Almeida JS. (2007) Local Renyi entropic profiles of DNA sequences. BMC Bioinformatics. 2007 Oct 16;8(1):393. [PMID: 17939871]

Lesson learned: critical co-variables are often found in other

haystacks.

3/5

Model IDVariable Selection

Discovery

#72. Almeida JS, DJ McKillen, YA Chen, PS Gross, RW Chapman, G Warr (2005) Design and Calibration of Microarrays as Universal Transcriptomic Environmental Biosensors. Comparative and Functional Genomics, 6(3):132-137(6). [doi:10.1002/cfg.466].

#76. Wang X, R Gorlitsky, and JS Almeida (2005) From XML to RDF: How Semantic Web Technologies Will Change the Design of ‘Omic’ Standards. Nature Biotechnology, Sep;23(9):1099-103 [PMID:16151403].

#84. Karpievitch YV, Almeida JS (2006) mGrid: A parallel Matlab library for user code distribution. BMC Bioinformatics 7:139 [PMID:16539707].

#90. Almeida JS, C Chen, R Gorlitsky, R Stanislaus, M Aires-de-Sousa, P Eleutério, JA Carriço, A Maretzek, A Bohn, A Chang, F Zhang, R Mitra, GB Mills, X Wang, HF Deus (2006) Data integration gets 'Sloppy'. Nature Biotechnology 24(9):1070-1071. [PMID:16964209].

#101. Vilela M, Borges CC, Vinga S, Vanconcelos AT, Santos H, Voit EO, Almeida JS. (2007) Automated smoother for the numerical decoupling of dynamics models. BMC Bioinformatics 8(1):305. [PMID: 17711581]

#104. Stanislaus R, JM Arthur, B Rajagopalan, R Moerschell, B McGlothlen, JS Almeida (2008). An open-source representation for 2-DE-centric proteomics and support infrastructure for data storage and analysis, BMC Bioinformatics. Jan 7;9:4. [PMID: 18179696]

Lesson learned: more than domain specific models or tools, integrative

research requires a Knowledge Engineering environment.

The critical characteristic of that environment is semantic interoperability for

both data and tools. Lack of syntactic interoperability is inexcusable.

4/5

A brief history of data

< > </ >

< > </ >

< >

</ >

< >

</ >

< > </ >

< > </ >

< >

</ >

< >

</ >

rel0

Rules

rel1

rel2

rel3

rel4

rel5

rel6

Statements

rel0

rel1

rel1

rel6

rel5

rel1

rel3

rel1

rel6

rel5

rel1

rel1

rel3

rel1

rel1

RDF - everything is a resourceRDF - everything is a resource

Wang X, R Gorlitsky, and JS Almeida (2005) From XML to

RDF: How Semantic Web Technologies Will Change the

Design of „Omic‟ Standards. Nature Biotechnology,

Sep;23(9):1099-103 [PMID:16151403].

E ER

Su

bje

ct

Re

lati

on

Ob

jec

t

Rules

Su

bje

ct

Un

iqu

e I

D

Re

lati

on

Ob

jec

t

Va

lue

Re

so

urc

e

Un

iqu

e ID

RulesStatements Resources

RDF

S3DB – user and project tables

Multiple project management

(Wang 2005)

www.s3db.org

Functional

considerations

Operational

considerationsS3DB – 3 table

single project

Almeida et. al (2006) Data integration gets 'Sloppy'. Nature Biotechnology 24(9):1070-1071.

S3DB:Project: Shultz

Rules:

<V2><Person><has><Name>

<V3><Dog><has><Name>

<V4><Person><has><Dog>

<V5><Person><has><Age>

Statements

<S12><P1><R6><V2>”Charlie Brown”

<S13><P1><R6><V4><R7>

<S14><P1><R6><V5>”56 years old”

<S15><P1><R7><V3>”Snoopy”

Resources

<R6> “This is Charile Brown”

<R7> “This is Snoopy, Charlie‟s Dog”

N3:<P#1><s3:project>”Shultz”.

<RC#8><s3:resource><P#1>,<s3:name>”Person”.

<RC#9><s3:resource><P#1>,<s3:name>”Dog”.

<V#2><s3:rule><P#1><s3:subject><RC#8>,<s3:verb>”has”,<s3:subject>”Name”.

<V#3><s3:rule><P#1><s3:subject><RC#9>,<s3:verb>”has”,<s3:subject>”Name”.

<V#4><s3:rule><P#1><s3:subject><RC#8>,<s3:verb>”has”,<s3:subject><RC#9>.

<V#5><s3:rule><P#1><s3:subject><RC#8>,<s3:verb>”has”,<s3:subject>”Age”.

<R#6><s3:rsrcInstance><RC#8>,<s3:notes>“This is Charlie Brown”.

<R#7><s3:rsrcInstance><RC#9>,<s3:notes>“This is Snoopy, Charlie‟s Dog”.

<S#12><V#2>[<R#6>,”Charlie Brown”].

<S#13><V#3>[<R#7>,”Snoopy”].

<S#14><V#4>[<R#6>,<#R7>].

<S#15><V#2>[<R#6>,”56 years old”].

Flat text file

XML structure

RDF triples

RDFXMLTXT

A brief history of data

rdfs

:subCla

ssO

f

rdfs:subClassOf rdfs:subClassOf

CollectionrojectP Item

[Cid] [Iid] [Cid or L]

rdf:o

bje

ct

rdf:p

redic

ate

rdf:s

ubje

ct

rojectP

Deployment

Deployment

Unique Identifiers of entities:

Durl rdf:type s3db:Deployment

Pid rdf:type s3db:Project

Cid rdf:type s3db:Collection

Rid rdf:type s3db:Rule

Sid rdf:type s3db:Statement

Iid rdf:type s3db:Item

Uid rdf:type s3db:User

Gid rdf:type s3db:Group

rdfs:subClassOf

rdf:predicate

rdf:

subje

ct

rdf:

obje

ct

rdf:

subje

ct

rdf:

obje

ct

[Iid] [Rid] [Iid or L]

User

Group

{Doublin Core:}

dc:created_by Uid

dc:created_on date

dc:service {term of cv}

etc …

Collection Item

Rule Statement

User

Group

rdf:o

bje

ct

rdf:p

redic

ate

rdf:s

ubje

ct

S3DB Entity (annotated using DC)

Relationship (defined using RDFS)

Permission (defined by s3db:permission)

Annotation of s3db entities:

Needed only if sharing with Project that is hosted by a distinct S3DBDeployment.

Rule Statement

Attribute Value

Almeida JS et. al (2006) Nature Biotechnology 24(9):1070-1071.

A brief history of integrative architectures

Docum

ent

centr

icW

eb 1

.0

User

centr

icW

eb 2

.0

Sem

antically a

ware

Web 3

.0

S3DBWebS3DB

Generic Web-basedGUI for S3DB

SpecializedApplications(stand alone)

SpecializedApplications(stand alone)

Web server at IBL

(I/O for machines)

Client machine

(in the lab)

ibl.m

dan

derso

n.o

rg

S3DB

GUI API

DB

index

GUI API

DB

index

GUI API

DB

index

2 34

1

5

6

7

89

10

S3DB

API

DB

index2 3

89

S3DB

GUI1

10

API API

API

GUI

rdfs

:subCla

ssO

f

rdfs:subClassOf rdfs:subClassOf

CollectionrojectP Item

[Cid] [Iid] [Cid or L]

rdf:o

bje

ct

rdf:p

redic

ate

rdf:s

ubje

ct

rojectP

Deployment

Deployment

Unique Identifiers of entities:

Durl rdf:type s3db:Deployment

Pid rdf:type s3db:Project

Cid rdf:type s3db:Collection

Rid rdf:type s3db:Rule

Sid rdf:type s3db:Statement

Iid rdf:type s3db:Item

Uid rdf:type s3db:User

Gid rdf:type s3db:Group

rdfs:subClassOf

rdf:predicate

rdf:

subje

ct

rdf:

obje

ct

rdf:

subje

ct

rdf:

obje

ct

[Iid] [Rid] [Iid or L]

User

Group

{Doublin Core:}

dc:created_by Uid

dc:created_on date

dc:service {term of cv}

etc …

Collection Item

Rule Statement

User

Group

rdf:o

bje

ct

rdf:p

redic

ate

rdf:s

ubje

ct

S3DB Entity (annotated using DC)

Relationship (defined using RDFS)

Permission (defined by s3db:permission)

Annotation of s3db entities:

Needed only if sharing with Project that is hosted by a distinct S3DBDeployment.

Rule Statement

Attribute Value

Snapshots of interfaces using S3DB‟s API

(Application Programming Interface). These

applications exemplify why the semantic web

designs can be particularly effective at enabling

generic tools to assist users in exploring data

documenting very specific and very complex

relationships. Snapshot A was taken from

S3DB‟s web interface, which is included in the

downloadable package. This interface was

developed to assist in managing the database

model and, therefore, is centered on the

visualization and manipulation of the domain of

discourse, its Collections of Items and Rules

defining the documentation of their relations.

The application depicted on snapshots B-D

describe a document management tool

S3DBdoc, freely available as a Bioinformatics

Station module (see Figure 6). The navigation

is performed starting from the Project (C), then

to the Collection (B) and finally to the editing of

the Statements about an Item (D). The

snapshot B illustrates an intermediate step in

the navigation where the list of Items (in this

case samples assayed by tissue arrays, for

which there is clinical information about the

donor) is being trimmed according to the

properties of a distant entity, Age at Diagnosis,

which is a property of the Clinical Information

Collection associated with the sample that

originated the array results. This interaction

would have been difficult and computationally

intensive to manage using a relational

architecture. The RDF formatted query result

produced by the API was also visualized using

a commercial tool, Sentient Knowledge

Explorer (IO-Informatics Inc), shown in

snapshot E, and by Welkin, F, developed by the

digital inter-operability SIMILE project at the

Massachusetts Institute of Technology. See

text for discussion of graphic representations by

these tools. To protect patient confidentiality

some values in snapshots B and D are

scrambled and numeric sample and patient

identifiers elsewhere are altered.

exfoliatins104

enterotoxins103

ClfB102

LN2 viability test101

institution100

antibiotic consumption97

MRSE frequency96

MRSA frequency95

Plasmid analysis81

mechanism and genes74

target73

name63

number of children62

DCC61

bed size60

specialty59

category58

SCCmec typing57

Rep-PCR56

Dot-blot55

LN2 freezing54

patient clinical data53

Hospital52

final classification51

species and tests50

code49

indoor area48

outdoor area47

number of employees46

number of rooms45

country, city44

country, state/province/county, city43

-80oC42

isolate reference41

susceptibility40

ITQB isolate39

MIC38

alternative name37

3-4 letter code36

name35

country, state/province/county, city34

PCR genes amplification33

Agr32

susceptibility31

beta-lactamase30

isolates from same subject29

MIC28

setting, hospital/DCC/heard, service/room, ICU27

project, period26

collection date25

disk inhibition24

subject type23

full name22

class21

abbreviation20

Antibiotic19

SmaI hybridization bands18

Phagetyping17

Ribotyping16

other15

hemolysins14

leukocidins13

project, station12

disk inhibition11

PFGE10

ClaI-mecA::Tn5549

MLST8

patient (or subject) demographic data7

patient admittance data6

collection site5

RAPD4

monthly fee3

Doubling time2

Spa typing1

Entity#

exfoliatins104

enterotoxins103

ClfB102

LN2 viability test101

institution100

antibiotic consumption97

MRSE frequency96

MRSA frequency95

Plasmid analysis81

mechanism and genes74

target73

name63

number of children62

DCC61

bed size60

specialty59

category58

SCCmec typing57

Rep-PCR56

Dot-blot55

LN2 freezing54

patient clinical data53

Hospital52

final classification51

species and tests50

code49

indoor area48

outdoor area47

number of employees46

number of rooms45

country, city44

country, state/province/county, city43

-80oC42

isolate reference41

susceptibility40

ITQB isolate39

MIC38

alternative name37

3-4 letter code36

name35

country, state/province/county, city34

PCR genes amplification33

Agr32

susceptibility31

beta-lactamase30

isolates from same subject29

MIC28

setting, hospital/DCC/heard, service/room, ICU27

project, period26

collection date25

disk inhibition24

subject type23

full name22

class21

abbreviation20

Antibiotic19

SmaI hybridization bands18

Phagetyping17

Ribotyping16

other15

hemolysins14

leukocidins13

project, station12

disk inhibition11

PFGE10

ClaI-mecA::Tn5549

MLST8

patient (or subject) demographic data7

patient admittance data6

collection site5

RAPD4

monthly fee3

Doubling time2

Spa typing1

Entity#

Day 5

Day 17

Day 365

Ontology-centric web client

S3DB is equipped with REST application programming interface (API), that is, client applications can be easily weaved by composing URL calls with variable values.

A year A year

in the life of in the life of

a semantic a semantic

databasedatabase

A year A year

in the life of in the life of

a semantic a semantic

databasedatabase

• Seeding: The first stage of usage of the semantic database is characterized by a focus on the domain of discourse. In this seeding stage many Rules are inserted without validation by submission of actual data (Statements).

• Seeding: The first stage of usage of the semantic database is characterized by a focus on the domain of discourse. In this seeding stage many Rules are inserted without validation by submission of actual data (Statements).

Time (days)

Day 152

Growth: This third pattern of usage is much longer than the previous two and corresponds to a relative light activity editing the domain of discourse while, on the contrary, an intensification of the database access by the target community of users. This is distinct from the preceding Calibration state where data submission is frequently aided or even mediated by the database developers.

• Maturation: The end of the data acquisition program that motivated the creation of the database is sometimes associated with a decrease in the insertion of new data (Statements) and a near stop in the editing of the domain of discourse (Rules). This period of maturation therefore produces a stable data service that remains useful and is accessed regularly. We found this period to be ideal for harvesting: exporting the database schema for analysis of the knowledge domain, including the designing of intuitive Graphic User Interfaces.

Document-centric clients

… and client side applications can be easily developed, relying only on the

REST protocol to interoperate with the S3DB DBMS service.

S3DB is being used for a variety of molecular epidemiology domains, for

example, for Cancer Research:

Day 25

Sessio

ns

0 100 200 300 400 500 600 700 800 900 1000

Rule

s

0 10 20 30 40 50 60 70

Users

0

5

10

15

20

25

Statements per rule

0

500

1000

1500

2000

2500

0

50

100 15

0

20

0

250

300

35

0

• Calibration: once the submission of data triples (Statements) intensifies, the seed data model is reconsidered and is significantly edited. This second stage is characterized by heavy activity both regarding expanding or updating the domain of discourse and also regarding submission of data. We found this to be the right time to engage the user community with training programs.

• Calibration: once the submission of data triples (Statements) intensifies, the seed data model is reconsidered and is significantly edited. This second stage is characterized by heavy activity both regarding expanding or updating the domain of discourse and also regarding submission of data. We found this to be the right time to engage the user community with training programs.

S3DB TCGA portal

Other sources

and usages

CaBIG interoperableinitiatives

a) Manual data input and retrieval

b) Automatic data submission by BiS applications at high throughput

screening facilities.

c) Deamon application using S3DB as a web-service. These are typically BiS modules, open source bioinformatics applications or R scripts.

d) Public data and web services,

for example, at NCBI, Cancer

Genome Atlas, etc

Bioinformatics Station (BiS) Server

Semantic database (S3DB) server

Available for download athttp://bioinformaticstation.org Available for download at

http://S3DB.org

for the same functionality as web-applications see prototype at docs.s3db.org

Code Distribution

BiS

SAAS Data Service

Client App.

Distributed Semantic DBMS

S3DB

Ontology-driven web-service oriented architecture

Composite web-based applications

Desired key features of a web-based information management system:

1. Syntactic interoperabilityAbility to get the data once told where it is.

2. Semantic interoperabilityAbility to use the data for a different purpose than the one that dictated its generation.

RESTful WOA

SPARQL endpoints (reified to native API exposed through REST)Separation of domain of discourse from its instatiationPermission migration built-in core data model

http://ibl.mdanderson.org