csci6904 genomics and biological computing lecture 3 – conceptual biology cells, gene circuits...

52
CSCI6904 Genomics and Biological Computing Lecture 3 – Conceptual Biology Cells, Gene circuits Conceptual Biology

Upload: aubrey-floyd

Post on 26-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

CSCI6904

Genomics and Biological Computing

Lecture 3 – Conceptual Biology

Cells, Gene circuits

Conceptual Biology

Overview

Computing in Biological systemsCells are computing information and react programatically to various situations. We will have a brief look at what is a cell and how they “compute”.

Evolutionary emergence of NetworksThese Circuits of gene products are arising in a stochastic manner. We will have a quick look on how this random walk results in a combinatorial strategy to evolve solutions.

Investigating NetworksNone of these network is visible, investigating the relationships in the physical world is a resource consuming operation.

Building Knowledge models of cells using text miningPresent a test case called GENEWAY.

Cells

Scope of molecular Biology

Molecular biology tries to organize a stochastically evolved system comprising hundreds of thousands components.

None of these components can be seen, even under the mostpowerful microscopes.

They are usually present in the 10-8 – 10-12 grams scale.

They degrade in a matter of second to hours.

The bottomline is:

Everything we know about this system comes from fragments of information.

Many of these are going to be refuted over time.

Cells as processors

Scope of Biological research

Research is usually structured such that individual contributions Can be pieced together into a “pathway”

Scope of Biological research

Research is usually structured such that individual contributions Can be pieced together into a “pathway”

SugarEssential oils

(plants)

Vitamin K

Bile

Eye Pigments

Sexual Hormones

Amino-Acids

Networks

How do they come into being?Combinatorial assembly during a stochastic process.

What is done to understand the main pathways?Grasping event the smallest facts about 1 edge in the graph is a feat.

Evolutionary Quandary

Intelligent design opposition to evolution of complex systems

A B C D

Evolutionary Quandary

Intelligent design opposition to evolution of complex systems

A B C D

Useless metabolites

Evolutionary Quandary

Intelligent design opposition to evolution of complex systems

A D

Impossible

Evolutionary Quandary

Intelligent design opposition to evolution of complex systems

A B C D

Therefore, the pathway A->D had to be designed by an intelligent entity which had the knowledge of the

intended purpose of the pathway!

Closer look at high-level genes organization

A modular systemProteins can be broken down into domains.

A combinatorial effectDomains can assemble in a combinatorial fashion to try together a vast array of potential biological activities.

Proteins are made of domains

Proteins are organized into domains

Transcription factor eF1eF1/ (PDB: 1IJF)

http://www.ncbi.nlm.nih.gov

Proteins are made of domains

Domains have several interesting properties.

Transcription factor eF1eF1/ (PDB: 1IJF)

http://www.ncbi.nlm.nih.gov

Proteins are made of domains

Domains fold onto themselves such that it is possible to express them separately (in most case).

They are small relative to actual proteins. Which may make it easier to rapidly fold into the right conformation.

Transcription factor eF1eF1/ (PDB: 1IJF)

Proteins are made of domains

They usually provide a biological function through binding or catalysis.

Transcription factor eF1eF1/ (PDB: 1IJF)

A stochastic process

A molecular network

= An interaction

Interfaces are expensive to evolve

Transcription factor eF1/ (PDB: 1IJF)

Interfaces are very sensitive to mutation as they must provide a perfect match.

Network of Metabolites

Metabolites are essentially forming network with a scale-free property, which parallels the stochastic assembly of domains.

At least, this appears to be true with the data there are so far.

Rzhetsky and Gomez, 2001. Bioinformatics, 17:988-996

http://www.genego.com/about/products.shtml

Evolutionary Quandary

Back to our A to D problem.

A B C D

An observed pathway therefore is simply a path connecting an input molecule and a required output. Each edge can be seen as a gene product (protein).

Overall, the pathway offers some kind of advantage to the host organism.

With positive selection, the pathway gets better and look as if it was designed for a specific purpose.

Scope of Biological research

Density of knowledge generating statements per article withrespect to source journals

Where it becomes a bioinformatic’s problem:

Nature of the problemBuilding a global model from plain English text sources.

Size

Complexity

What is done in the GeneWays project The workflow of their integrated system

What I think it really means in the long runThe relationship between research and researchers

(The right information system will be the next big thing)

Motivation

Human limitations andData-heavy and knowledge-heavy Disciplines

SynthesizingHypothesis building

Visualizing Records keeping

Modeling Knowledge StreamliningStructuring(Directing)

(Changing the way research is communicated?)

Motivation

In knowledge-intensive field, the connection between investigators and background information is thinning down.

Data

Hypothesis

Experiment

Information(data,

concepts)

KnowledgeThis arrow does not scale up

as quickly as the others BioinformaticsComputational Biology

Scope of GeneWays

Build from plain-English publications a

model for molecular biology

Allow a more holistic approach to hypothesis formulation.

Scope of GeneWays

~ 3 million statements

150 K full text articles

Scope of GeneWays

What are we looking for, ultimately ?

protein A binds gene Bgene B regulates gene Cgene C express protein D

protein D inactivates protein A

Scope of GeneWays

Doc Sorting

Terms identification

Disambiguation

Information extraction

Ontology

Visualization

Details of GeneWays

Doc Sorting

From Abstracts, using either clustering (unsupervised) or

Naïve Bayes.

This system is using a mixture of methods to

achieve the binary classification:

Relevant / irrelevant

Details of GeneWays

Tagging terms

Especially hard in biology(?)

Morphological rulesGrammatical rules

Rules/dictionary methodsSVMHMM

Naïve BayesDecision Trees

Recall in the 70’s to 80’s

Details of GeneWays

Tagging terms

HTML -> XML-like format

Details of GeneWays

Tagging terms

Vertices:

GeneProtein

GeneorproteinProcess

SmallmoleculesSpeciesComplexDisease

Domain (protein)

Details of GeneWays

Tagging terms

Edges:

N-acylateacetylate

N-glycosylateO-glycosylate

BindDegrade

(De-)methylate(De-)phophorylate[Make|break]bond

ExpressTranscribeReleaseInteract

Substitute… n = 125 (2001)

Details of GeneWays

Learning new verbs:

AVAD system

Χ2 statistics of occurrence of terms before and after tagged

items.

Log-likelihood test based on frequency of occurrence in corpus-specific literature

Co-localize and synergize were discovered using AVAD

Nomenclature

There are obscure ways to agree:

Protein kinase A phosphorylates protein B

Is the same as :

AB ATP B P ADP

Nomenclature

There are obscure ways, period:

Gene named:

“Forever Young” in Arabidopsis Thaliana (mustard familly)

“Mother against decapentaplegic” in Fruit fly

Nevermind the jargon!

Fight fire with fire:

They developed a method that uses BLAST, a popular sequence database search algorithm to mine for biological terms.

(Krauthammer et al., 2000. Gene. 259:245-252)

Nevermind the jargon!

Fight fire with fire:

N-(2-Hydroxyethyl)piperazine-N'-(2-ethanesulfonic acid) (HEPES)2-(N-Morpholino)ethanesulfonic acid (MES)

3-(N-Morpholino)propanesulfonic acid (MOPS)N-tris[Hydroxymethyl]methyl-3-aminopropanesulfonic acid (TAPS)

tris(Hydroxymethyl)aminomethane (TRIS)

Details of GeneWays

Disambiguation

il2 and interleukine-2 can both be used to refer to either

the gene, the protein or the mRNA.

Details of GeneWays

Disambiguation

Use canonical name as much as possible.

Learn Semantic classes

Details of GeneWays

Information extraction

Correlation methodsHMM

Formal grammar (lexicon)

GeneWays uses NLP GENIES

Attempts complete parsing, then default to segmenting

and partial parsing.

Details of the NLP system

GENIES (GENomics Information Extraction System)

Based on MedLEE (medical NLP system)

Term tagging component uses rules and external knowledge

Nested relationships, normalized and agentive forms of verbs inhibit, inhibition and inhibitor .

Details of GeneWays

Information simplification

Convert nested relationships into a collection of binary

statements.

Details of GeneWaysOntology

Knowledge Models

Uses for GeneWaysVisualization

Synthesis and querying facility

The only filter described at the time of the publication is a filter

based on the number of statement supporting an edge.

Uses for GeneWaysVisualization

Synthesis and querying facility

Validation of GeneWays

Expert Review

125 statements / 2500 were erroneous or “phantoms”.

Of these 125:

- 100 due to term identification.- 12 NLP errors.- 5 Simplifier errors.- 8 Actually correct!

System’s precision: 95%Expert’s precision : 93.5%

Such as system should be seen as a mean to enrich

Validation of GeneWays

Redundancy

Redundant statements are not necessarily “more true”.

Redundancy due to indirect relationships.

Validation of GeneWays

A parser’s nightmare:

Statement : “mitogen-activated protein kinase kinase kinase (MAPKKK) phosphorylates protein B”

Interpretations:

1. Protein kinase [protein] is activated by the mitogen [complex]2. MAPK[protein] phosphorylate MAPKK[protein]3. MAPKK[protein] phosphorylate MAPKKK[protein]4. MAPKKK[protein] phosphorylate B [protein]

Potential historical artifacts:

1. B[protein] is activated by the mitogen[complex]2. MAPKK[wrongly thought to be MAPK] phosphorylate B[protein]3. …

Perspective

References

Main: Rzhetski et al., 2004. GeneWays: a system for extracting, analysing,

visualizing, and integrating molecular pathway data. J. Biomed. Informatics, 37:43-53

Learning Verbs: Hatzivassiloglou, V., Weng, W. Learning Anchor Verbs for Biological

Interactions Patterns from published text articleswww.cs.columbia.edu/nlp/papers/2002/ hatzivassiloglou_weng_02.pdf

NLP processor: Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A . 2001.GENIES: a natural-language processing system for the extraction of

molecular pathways from journal articles.Bioinformatics, 17:S74-S82

Acknowledgement: Aditya Aggarwal, the student who dug out this paper to present in CSCI 6904 (2004)