institute of genetics and selection of industrial microorganisms, moscow, russia

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Combined network of transcription regulation and protein-protein interaction for inferring

genome-wide functional linkages

Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia


Russian-Indian Collaborating Project

State Research Center of Genetics and Selection of Industrial Microorganisms, Moscow, Russia

Prof. Shekhar Mande

Kharkevich Institute of Information Transition Problems, Russian Academy of Sciences

Prof. Mikhail Gelfand


Comparative genomics can show gene functional linkages

• Co-occurrence in known operons• Minimal distance between a pair of

genes in a genome (unknown operons)

• Phylogenetics profiling (similar behaviour of a gene pair in several genomes)

266 linear genomes allow to evaluate functional linkages between genes by statistical methods

Yellaboina et al. Genome Research, 2007 17: 527-535


What are the mechanisms behind “functionally related genes”?

Protein-protein interactions obtained in high- throughoutput experimental methods correlate well with functional relatedness of genes obtained with bioinformatics.

Yellaboina et al. Genome Research, 2007 17: 527-535

Protein-protein interaction…

BUT …

Metabolic pathways

Transcription co-regulationOr several simultaneously

Other extravagant mechanisms (direct interactions in genome)


Bioinformatics of transcription regulation of bacterial genes

• Specific promoters• RNA based switches • Specific protein

transcription regulatory factors (TF)

TF-mediated regulation is often responsible for regulation of complex processes

Cross – talk

Non-trivial concentration dependence

(quorum Sensing)


DNA-signals responsible for TF binding

Bacterial regulatory sites:

1. Usually long and divergent

2. Often positioning referred to the promoter is important

3. Sites for crass-talking proteins may overlap


Integrated database

Functionally related genes

Methabolic associated genes

Transcription associated genes

Protein interactions


Bioinformatics for hierarchy of organization levels of biosystems

12 program components

integrated into a single system

DNA Sequence

RNA

Protein

Variation between species and individuals in populations

Sequence

Structure

Sequence

Structure

Complex

TandemSWAN, BASIO, ALEX,

SeSiMCMC, STRUSWER, STRUDL,

RNA-MBFS, Prophet, Oligomeasure,

PSACR, Combinator, KMD


Some technical points


Two integrated databases

• Molecular entities• Genome annotations

PathWay Studio, Ariadne Genomics, Inc

Original database of genome annotation and transcription

regulation


Integration of data on binding sites and genome annotations

• All experimental and predicted binding sites and other segments data are mapped into genome.

• Filtration of multiple identical entries and obviously irrelevant sites in EcoCyc

• Site positioning in relation with other genomic structures (repeats, genes)

• Motifs are represented as lists of allowed words

• Different experimental sources, as well as comparative genomics studies are used for motif construction


Viewpoints

• Database that contain the experimental data and computational predictions in the integrated manner

• XML format for organizing data flow• Possible distributed computations• Possible platform independence (Ruby & Java)


Unified storage for experimental data from different sources

SELEX

Comparative genomics

footprinting

Motif modelsGenome

small-BiSMarkXML-based small language

for Biological Sequence Markup

database engine

filtering identical and irrelevant motifs, preprocessing


Identification of optimal binding motifs using stochastic optimization

• SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif

• Multiple local alignment of candidate genomic sequences

• Optimization of the motif length

• Modeling of diades (palindromes and tandem repeats) in motif structures

• Priors for absent sites and sites at the forward and backward DNA strands

SeSiMCMC result on a TRANSFAC dataset

Known binding site motif (SELEX, a sequence logo for Sp1 factor binding site)


SeSiMCMC sampler page


Identification of spaced and overlapping motifs


Regulatory regions: Different types of architecture

ArcA sites

Promoter

Homotypic Clusters

Clusters aligned with promoters

Overlapping and spaced binding sites


Statistical validation of selectivity and identification of optimal binding motifs

• AhoPro algorithm for calculation of P-value of site binding• Comparison of different binding motifs • Using different motif models• Selection of the optimal motif• Direct calculation of motif selectivity for different specificity

levels

Motif models support includes• Positional weight matrices• Word lists• IUPAC strings


•Each state at ith step –

• class CCi i ((rr11, r, r22;q);q)

Aho-Corasick pattern matching automaton

A C

TC T

C T

root

H1 = {ACC, AACC, AССTT}

H2 = {AT, CTAT, CT}

Probabilities to be at each state (probability transducer)

1 1 2Pr( ( , ; ')) iC q

step (text length)number of occurrences of the first motif

number of occurrences of the second motif

the longest suffix in prefix closure of H1UH2

Aho-Pro algorithm: exact P-value calculation


We developed an algorithm of exact p-value calculation for multiple

occurrences of multiples motifs Boeva, V., J. Clement, M. Regnier, M.A. Roytberg, and V.J. Makeev. 2007. Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules. Algorithms Mol Biol 2: 13.

AhoPro – p-value calculator!


A data flow for motif model construction

SeSiMCMC

Footprinting results

Genome-mappedwith correct flanking

sequences

ChIP-chip

Raw long sequences

SeSiMCMC

SELEX

Short sites or site parts

May be used as maskTo be used

as initial mask

Verification

Motif model Motif model

Sp1 binding site

Additional motif length estimation Additional motif length estimation


Obtaining clean data from specific sources

Using TRANSFAC as base data source for binding sites of a selected factor

Footprinted sequence N earest gene

TR AN SFAC entry

C hrom osom e

5000bp5000bp

filtering am biguous entries

C hrom osom e

Footprin ted sequenceFlank Flank

extracting chrom osom e region, conta in ing footprinted sequence

Footprinted sequenceFlank F lank

D ataset

TR AN SFACTransfac entryfactor

b indingsites

Chrom osome region

Transfac entryTransfac entry

.....Transfac entry

Footprinted sequenceFlank F lankFootprinted sequenceFlank F lank

Footprinted sequenceFlank F lank database engine

small-BiSMark


A verification procedure for created motif model

New motif model Testing sequence set

Wisely chosen set ofmotif-containing sequences

AhoPro

Choosing optimal motif specificity

Selectivity testing

Processed experimental data(via SeSiMCMC)

Newly discovered motif(by SeSiMCMC or ScanSeq)


Comparative motif analysis

Testing sequence set

Footprinting, ChIP-chip data,

Random generated setKnown motif model 1

New motif model

Known motif model 2

Known motif model 3

AhoPro

Selectivity testing

Comparative analysisSelecting best motif model


An a genome-wide motif distribution mapping

Known motif model 1

New motif model

Known motif model 2

Known motif model 3

Genome-wide globally positioned on chromosome

sites with different quality

Possible clustering of sites:different models for one factor

best models for different factors

Positioning within specific DNA regions:CRM, CpG islands, etc.


Possible multiple Opera House management for grid computing support

(request redirecting and resource balancing only)

Distributed computations support

Single physical machine multi-process remote task execution control service

«Theatre manager»

«Opera House» «Opera House»

«Opera House»

Specified scenario «opera libretto» execution

Physical machine

Main database


Overview of the technical realization of the complex

MySQL

Database levelRuby-poweredcross-platform

DRb-based server

Server level

Data-workflow level

Ruby-poweredcross-platform scenario scripts

SeSiMCMC and AhoProHigh-speed C++ code

Application level

Web-interface level

Ruby-based CGIRuby-on-rails in future

Ruby and Java-based tools(REXML, JAXP, SAXON)

small-BiSMark processing


THE END


Acknowledgments

• GosNIIgenetika group:- Vsevolod Makeev- Alexander Favorov- Elizaveta Permina- Valentina Boeva- Ivan Kulakovsky- Dmitry Malko

Financial support Russian Federation State Innovation ProjectRussian Foundation of Basic Research DST India


Biological data analysis components

• DNA analysis:

•Basio – large-scale sequence analysis: compositional segmentation

•TandemSWAN – tandem repeats in DNA sequences

•SeSiMCMC – DNA motif identification

•Oligomeasure – DNA structure from DNA sequence


TandemSWAN

• Tandem repeats with substitution but without indels with a control of repeat statistical significance

tttatttatttatttatttatttatttatttatttatttatttatttatttatttatttattta

Finds micro- and minisatellites with substitutions


BAesianSegmentationInformationOptimizer

• Performs DNA parsing into segments with a uniform composition• Uses Bayesian optimization over all possible segment

configuration• Uses Bayesian Information Criterion (BIC) to control segmentation

resolution

Format the output

List of segments

Split – sequence preprocessing

Basio – basic segmentaton algorithm Report

Select the appropriate output format

filter

Remove short or redundent segments

Input sequence

atcatatca|ggcggcgcagccgcagcc|tctcttcttc


SeSiMCMC – Sequence Similarity Markov Chain Monte Carlo

• SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif

• Multiple local alignment of candidate genomic sequences

• Optimal identification of the motif length

• Analysis of symmetries in motif structures

• Priors for absent sites and sites at the forward and backward DNA strands

SeSiMCMC result on a TRANSFAC dataset

Known binding site motif (SELEX, a sequence logo for Sp1 factor binding site)


ALEX – Alingment of Exons

Identifies exons in a genomic alignment

CTGACGCACAGACCCAAGTGACGACGAGGCCGA

CGGACGGACAGACCCAAGTGACGACGAGGCCGA

REG

BEG

M

REG

END M

REG

BEG

H

REG

END

H Glob

Best

Exon

Beg

Best

Exon

End

Best

Exon

Type

Best

Start

Exon

Beg

Best Start

ExonEnd

Best

Inner

Exon

Beg

Best

Inner

Exon

End

Best

Stop

Exon

Beg

Best

Stop

Exon

End

Best

One

Exon

Beg

Best

One

Exon

End

1425 2127 126 831 16.85 1779 1882 1 1779 1882 1752 1882 1752 1886 0 0

2373 2896 1202 1741 36.65 2452 2705 0 2398 2705 2452 2705 2849 2896 0 0

3279 3544 2238 2503 30.56 3279 3544 0 3512 3544 3279 3544 3279 3515 0 0

3827 4806 2815 3785 59.30 4002 4490 2 4017 4061 4002 4061 4002 4490 4017 4490

5051 5436 4059 4454 8.37 5051 5131 2 0 0 0 0 5051 5131 0 0


PROTEIN ANALYSIS

• Struswer – Smith Waterman aligner taking into account the secondary structure

• Prophet – Secondary structure predictor based on discriminate analysis

• PSIC – multiple alignment with homologs


1 N C 0.990 0.007 0.006

2 A C 0.900 0.040 0.044

3 K C 0.769 0.108 0.093

4 L C 0.666 0.315 0.074

5 K C 0.539 0.489 0.048

6 P H 0.405 0.639 0.033

7 V H 0.100 0.875 0.030

8 Y H 0.072 0.908 0.025

9 D H 0.055 0.926 0.012

10 S H 0.059 0.928 0.007

11 L H 0.069 0.919 0.005

12 D H 0.078 0.921 0.002

13 A H 0.051 0.946 0.003

14 V H 0.062 0.928 0.006

15 R H 0.096 0.880 0.010

16 R H 0.107 0.894 0.013

17 C H 0.104 0.899 0.015

18 A H 0.051 0.945 0.014

STRUSWER-STRUcture extension of Smith-Waterman alignER

STRUSWER – alignment of protein sequences with the reference to their secondary structure

------------------------------------------------------------------------------------------- 1a04A.exp A <-> 1au7A.exp A erd.vnqLtprerdi.lklIaqGlpnkmiarrLdites.......tvkvhvkh.....mlkkmklksrveAavwvhqErif......... ...gmraLeqfanefkvrrIklGytqtnvgeaL...aavhgsefsqtticrfenlqlsfknac....klkAilskwlEe..aeqkrrtti LLL.HHHLLHHHHHH.HHHHHLLLLHHHHHHHHLLLHH.......HHHHHHHH.....HHHHHLLLLHHHHHHHHHHHLLL......... ...LHHHHHHHHHHHHHHHHHHLLLHHHHHHHH...HHLLLLLLLHHHHHHHHLLLLEHHHHH....HHHHHHHHHHHH..LLLLLLLLL score: 326.000000 ID : 0.11

Формат выходного выравнивания. Вверху – выравнивание первичных структур; внизу – выравнивание вторичных структур.


Protein Secondary Structure Prediction PROPHET


RNA-MBFS (RNA MultyBranch-Free Structures).

M – цикл ветвления (multi-branched loop), степень ветвления -3 Подскруктуры слева от M (т.е. E-F-G-I-H) и сверху от M (т.е. C-B-S-T) –неветвящиеся.

S

B

C

M E F G HI

5’

3’

T

Рис. 3.1. Вторичная структута РНК.

Creates optimal RNA-structure without branching


Integration on the level of computation and data

• Easy accessible via web interface• Integration at data level• Cluster and local network distributed computation support• Cross-platform


Building complex computational applications

• Possibility to create individual scenarios for any special task• Pipelining support for computational conveyers• Simple XML-format for scenarios and conveyers descriptors


Individual user spaces and profiles

• Individual result storage and file library

• Individual user account support


Easy remote administration via web interface


What’s under the hood

• Used technologies and program tools– MySQL 5 database as result and user space storage– JSP for web-interface– Apache Tomcat 5 JSP/Servlet container– Java 5 and RMI for distributed computations server and node-software


Acknowledgements

• Financial support of- Russian Federation State Contract № 02.434.11.100

(Intellectual technologies 2). Prof. Tumanyan V.G.- Russian Academy of Sciences project in Molecular and Cellular

Biology

• ContributorsInstutute of Mathematical Problems of Molecular Biology (Moscow

Region, Puschino, Russia)Voronezh State University, Voronezh, RussiaState Research Center of Genetics and Selection of Industrial

Microorganisms, GosNIIgenetika, Moscow, Russia

institute of genetics and selection of industrial microorganisms, moscow, russia

Documents

proteinprotein interactions

experimental data

genome research

pair of genes

binding sites

segments data

irrelevant sites

genes motifs