computations using pathways and networks nigam shah [email protected]

36
Computations using Computations using pathways and networks pathways and networks Nigam Shah [email protected]

Upload: jonah-quinn

Post on 13-Dec-2015

224 views

Category:

Documents


4 download

TRANSCRIPT

Computations using pathways Computations using pathways and networksand networks

Nigam [email protected]

THE GOAL = MAKING SENSE OF THE GOAL = MAKING SENSE OF HIGH THROUGHPUT DATAHIGH THROUGHPUT DATA

High throughput dataHigh throughput data

• “high throughput” is one of those fuzzy terms that is never really defined anywhere

• Genomics data is considered high throughput if:• You can not “look” at your data to interpret it• Generally speaking it means ~ 1000 or more genes and

20 or more samples.• There are about 40 different high throughput

genomics data generation technologies.• DNA, mRNA, proteins, metabolites … all can be

measured

How does ontology help?How does ontology help?

• An ontology provides a organizing framework for creating “abstractions” of the high throughput data

• The simplest ontologies (i.e. terminologies, controlled vocabularies) provide the most bang-for-the-buck• Gene Ontology (GO) is the prime example

• More structured ontologies – such as those that represent pathways and more higher order biological concepts – still have to demonstrate real utility.

Gene Ontology to analyze microarray data

Using GO annotationsUsing GO annotations

Descriptions built by connecting/linking ontology Descriptions built by connecting/linking ontology termsterms

Biologists interpret a list of genes and form a result statement such as:

The photosynthesis genes located in the chloroplast are repressed in response to ozone stress and have the ABRE binding site enriched in their promoters.

……more structuremore structure

?<link>?

<Some MF> in <Some BP>

OBOL

Relations Ontology

OBOL

Relations Ontology

Between-ontology structureBetween-ontology structure

… … more structure [beyond GO]: PATOmore structure [beyond GO]: PATO

The building blocks of phenotype descriptions: EQEntity (bearer) such as spermatocyte, wingQuality (property, attribute)

- a kind of dependent continuant Formally, an EQ description defines:

- a Quality which inheres_in a bearer entity

The building blocks are combined according to the Pheno-syntax

www.fruitfly.org/~cjm/formats

Semantically structured annotationsSemantically structured annotations

1. Relationship ontology 2. Mouse Pathology ontology 3. Tissue/Organ 4. Gene ontology

mRNA of genes encoding proteins with mf in bp at cc is increased in sample-id which shows some pathology in some tissue in some organ

Basal layer of organ shows membranous staining

Queries enabled: 1. Identify all images with a specific pathology 2. Identify cases with pathology and some gene expression changes 3. Correlate changes biological processes with change in morphology

Discovery enabled: 1. Classify samples in expression space and “look” for histological changes that

correlate with it.

1. Relationship ontology 2. Mouse Pathology ontology 3. Tissue/Organ 4. Gene ontology

mRNA of genes encoding proteins with mf in bp at cc is increased in sample-id which shows some pathology in some tissue in some organ

Basal layer of organ shows membranous staining

Queries enabled: 1. Identify all images with a specific pathology 2. Identify cases with pathology and some gene expression changes 3. Correlate changes biological processes with change in morphology

Discovery enabled: 1. Classify samples in expression space and “look” for histological changes that

correlate with it.

WHY

HOW

Open Questions/ChallengesOpen Questions/Challenges

• Creation/acceptance of a systematic formalism for creating expressive annotations. (e.g. associated_with, involves)

• A generic tool that uses ontologies and allow the user to compose terms and cross ontology annotations• Easy term/annotation composition• Control the amount of alternative [compositional]

statements allowed

Pathways to analyze array data

““Pathways” to analyze array dataPathways” to analyze array data

• The notion of a cancer signaling pathway can serve as an organizing framework for interpreting microarray expression data.

• On examining a relatively small set of genes based on prior biological knowledge about a given pathway, the analysis becomes more specific.

Reactome’s sky painterReactome’s sky painter

Operations on pathway resourcesOperations on pathway resourcesCustom code RDF + SPARQL OWL + SWRL

Verify a pathway resource Proofreading Reactome[1]

In progress In progress

Perform integrated querying of multiple pathway resources

Hard (“wrapper” approaches)

PKB[2]

Verify multiple pathway resources

Too hard (there are ~200)

Merge and compare multiple pathway resources

“Reason” over pathway resources

[1] A case study in pathway knowledgebase verification, BMC Bioinformatics 2006, 7:196[2] Pathway Knowledge Base: An Integrated pathway resource using BioPAX, Submitted to Applied Ontology

Merge and compare pathway resourcesMerge and compare pathway resources

• Given a set of ‘nodes’ and some ‘links’ among them, query multiple pathway sources and fill in the most plausible interactions between the nodes.• Plausible = not contradicted by existing data and knowledge

• Current pathway resources [in biopax] can not support this because, the manner in which ‘nodes’ are identified, the manner in which ‘links’ are identified is arbitrary.• Reactome has started to connect the pathway steps will GO

biological processes.

• BioPAX lets pathway sources “export” their nodes and links.• …but p53 in resource A is still different from P53 in resource B• … and Activate in resource A is still different from activates in

resource B

ProblemProblem

• I have no clue what a pathway is!• A set or series of interactions, often forming a

network, which biologists have found useful to group together for organizational, historic, biophysical or other reasons.

• The complexity and abstraction represented in a pathway is decided by its author attempting to represent the interactions between a set of genes, proteins, and small molecules.

“Networks” to analyze high throughput genomic data

Building networksBuilding networks

• Take a high throughput dataset

• Define a notion of ‘relatedness’ depending on the dataset• Co-expression for

microarray data• Co-occurance for literature

networks• …

• Enlist [node]--<link>--[node] pairs

• Find a good graph drawing program!

Nice hairball but …Nice hairball but …

From Long et al, in Trends in Biochemical Sciences, vol 32, no 7.

Srinivasan B, Snow R, Shah N and Batzoglou S in Interactome Networks conference @ CSHL

From Srinivasan et al, in Briefings in Bioinformatics August 2007.

Hypotheses/Models to analyze high throughput genomic data

Events and Implicit claimsEvents and Implicit claims

An hypothesis is a statement about relationships (among objects) within a biological system.

Protein P induces transcription of gene X

An ‘event’ is a relationship between two biological entities.

Implicit claims that can be tested:1. P is a transcription factor.2. P is a transcriptional

activator.3. P is localized to the nucleus.4. P can bind to the promoter

of gene X

promoter | gene X promoter | gene X PP

Representing Events ExplicitlyRepresenting Events ExplicitlyA hypothesis consists of at least one event stream

An event stream is a sequence of one or more events or event streams with logical joints (or operators) between them.

An event has exactly one agent_a, exactly one agent_b and exactly one operator (i.e. a relationship between the two agents). It also has a physical location that denotes ‘where’ the event happened, the genetic context of the organism and associated experimental perturbations when the event happened.

A logical joint is the conjunction between two event streams.

A hypothesis consists of at least one event stream

An event stream is a sequence of one or more events or event streams with logical joints (or operators) between them.

An event has exactly one agent_a, exactly one agent_b and exactly one operator (i.e. a relationship between the two agents). It also has a physical location that denotes ‘where’ the event happened, the genetic context of the organism and associated experimental perturbations when the event happened.

A logical joint is the conjunction between two event streams.

User interfacesUser interfacesHypothesis described in

Natural Language

Biological process described in a formal language

Evaluating an hypothesisEvaluating an hypothesis

n1 b1

n1 b1

A. Representation of an hypothesis in terms of events (ev = event)

B. Holding the mouse on a neighboring hypothesis (b1) shows what event was replaced to create it

C. Plot of the support versus conflicts for submitted and neighboring hypotheses (n1, b1). Clicking on the n1 submits that hypothesis as ‘seed’

HyBrow: lessons learntHyBrow: lessons learnt

• The minimum requirement for a formal representation:• Ability to represent data information

Knowledge• A language to unambiguously express your

“thought experiment” (your model, hypothesis, theory, theorem etc)

• A reasoning framework to evaluate the outcome/ validity/accuracy of your thought experiment

• Project Home page: www.hybrow.org

Pathways as “models”?Pathways as “models”?

• Pathways are assumed to be models representing biological processes, without actually knowing the modeling formalism in which the model is valid.

• The ‘language’ of writing out a pathway doesn’t really have a grammar and/or a logic

• Most pathways end up being lists of heterogeneous sets of “steps” (in terms of the time of execution, the place of execution, the abstraction level, the kind of ‘thing’ passed along etc…)

• Lots of discussion on requirements of data providers, where are the users/consumers and their use cases?

ClaimsClaims

• Pathways are useful only if they can serve as “models” [accurate representations] of a process• Hence whatever needs to be done to ensure that a pathway is a

valid model of at least one formalism should be required of the pathway author.

• A pathway representation that doesn’t solve the problem of uniquely identifying entities doesn’t solve the problem of integrating pathways.

• We just end up with marked up, structured information from multiple providers, without actually integrating anything.

Success of projects in the Biomedical domainSuccess of projects in the Biomedical domainHigh KR complexity

Minimal KR complexity

Minimal computational complexity

High computational complexity

Success of projects in the Biomedical domainSuccess of projects in the Biomedical domainHigh KR complexity

Minimal KR complexity

Minimal computational complexity

High computational complexity