institute of genetics and selection of industrial microorganisms, moscow, russia
DESCRIPTION
Combined network of transcription regulation and protein-protein interaction for inferring genome-wide functional linkages. Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia. Russian-Indian Collaborating Project. - PowerPoint PPT PresentationTRANSCRIPT
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Combined network of transcription regulation and protein-protein interaction for inferring
genome-wide functional linkages
Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Russian-Indian Collaborating Project
State Research Center of Genetics and Selection of Industrial Microorganisms, Moscow, Russia
Prof. Shekhar Mande
Kharkevich Institute of Information Transition Problems, Russian Academy of Sciences
Prof. Mikhail Gelfand
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Comparative genomics can show gene functional linkages
• Co-occurrence in known operons• Minimal distance between a pair of
genes in a genome (unknown operons)
• Phylogenetics profiling (similar behaviour of a gene pair in several genomes)
266 linear genomes allow to evaluate functional linkages between genes by statistical methods
Yellaboina et al. Genome Research, 2007 17: 527-535
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
What are the mechanisms behind “functionally related genes”?
Protein-protein interactions obtained in high- throughoutput experimental methods correlate well with functional relatedness of genes obtained with bioinformatics.
Yellaboina et al. Genome Research, 2007 17: 527-535
Protein-protein interaction…
BUT …
Metabolic pathways
Transcription co-regulationOr several simultaneously
Other extravagant mechanisms (direct interactions in genome)
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Bioinformatics of transcription regulation of bacterial genes
• Specific promoters• RNA based switches • Specific protein
transcription regulatory factors (TF)
TF-mediated regulation is often responsible for regulation of complex processes
Cross – talk
Non-trivial concentration dependence
(quorum Sensing)
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
DNA-signals responsible for TF binding
Bacterial regulatory sites:
1. Usually long and divergent
2. Often positioning referred to the promoter is important
3. Sites for crass-talking proteins may overlap
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Integrated database
Functionally related genes
Methabolic associated genes
Transcription associated genes
Protein interactions
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Bioinformatics for hierarchy of organization levels of biosystems
12 program components
integrated into a single system
DNA Sequence
RNA
Protein
Variation between species and individuals in populations
Sequence
Structure
Sequence
Structure
Complex
TandemSWAN, BASIO, ALEX,
SeSiMCMC, STRUSWER, STRUDL,
RNA-MBFS, Prophet, Oligomeasure,
PSACR, Combinator, KMD
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Some technical points
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Two integrated databases
• Molecular entities• Genome annotations
PathWay Studio, Ariadne Genomics, Inc
Original database of genome annotation and transcription
regulation
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Integration of data on binding sites and genome annotations
• All experimental and predicted binding sites and other segments data are mapped into genome.
• Filtration of multiple identical entries and obviously irrelevant sites in EcoCyc
• Site positioning in relation with other genomic structures (repeats, genes)
• Motifs are represented as lists of allowed words
• Different experimental sources, as well as comparative genomics studies are used for motif construction
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Viewpoints
• Database that contain the experimental data and computational predictions in the integrated manner
• XML format for organizing data flow• Possible distributed computations• Possible platform independence (Ruby & Java)
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Unified storage for experimental data from different sources
SELEX
Comparative genomics
footprinting
Motif modelsGenome
small-BiSMarkXML-based small language
for Biological Sequence Markup
database engine
filtering identical and irrelevant motifs, preprocessing
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Identification of optimal binding motifs using stochastic optimization
• SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif
• Multiple local alignment of candidate genomic sequences
• Optimization of the motif length
• Modeling of diades (palindromes and tandem repeats) in motif structures
• Priors for absent sites and sites at the forward and backward DNA strands
SeSiMCMC result on a TRANSFAC dataset
Known binding site motif (SELEX, a sequence logo for Sp1 factor binding site)
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
SeSiMCMC sampler page
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Identification of spaced and overlapping motifs
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Regulatory regions: Different types of architecture
ArcA sites
Promoter
Homotypic Clusters
Clusters aligned with promoters
Overlapping and spaced binding sites
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Statistical validation of selectivity and identification of optimal binding motifs
• AhoPro algorithm for calculation of P-value of site binding• Comparison of different binding motifs • Using different motif models• Selection of the optimal motif• Direct calculation of motif selectivity for different specificity
levels
Motif models support includes• Positional weight matrices• Word lists• IUPAC strings
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
•Each state at ith step –
• class CCi i ((rr11, r, r22;q);q)
Aho-Corasick pattern matching automaton
A C
TC T
C T
root
H1 = {ACC, AACC, AССTT}
H2 = {AT, CTAT, CT}
Probabilities to be at each state (probability transducer)
1 1 2Pr( ( , ; ')) iC q
step (text length)number of occurrences of the first motif
number of occurrences of the second motif
the longest suffix in prefix closure of H1UH2
Aho-Pro algorithm: exact P-value calculation
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
We developed an algorithm of exact p-value calculation for multiple
occurrences of multiples motifs Boeva, V., J. Clement, M. Regnier, M.A. Roytberg, and V.J. Makeev. 2007. Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules. Algorithms Mol Biol 2: 13.
AhoPro – p-value calculator!
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
A data flow for motif model construction
SeSiMCMC
Footprinting results
Genome-mappedwith correct flanking
sequences
ChIP-chip
Raw long sequences
SeSiMCMC
SELEX
Short sites or site parts
May be used as maskTo be used
as initial mask
Verification
Motif model Motif model
Sp1 binding site
Additional motif length estimation Additional motif length estimation
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Obtaining clean data from specific sources
Using TRANSFAC as base data source for binding sites of a selected factor
Footprinted sequence N earest gene
TR AN SFAC entry
C hrom osom e
5000bp5000bp
filtering am biguous entries
C hrom osom e
Footprin ted sequenceFlank Flank
extracting chrom osom e region, conta in ing footprinted sequence
Footprinted sequenceFlank F lank
D ataset
TR AN SFACTransfac entryfactor
b indingsites
Chrom osome region
Transfac entryTransfac entry
.....Transfac entry
Footprinted sequenceFlank F lankFootprinted sequenceFlank F lank
Footprinted sequenceFlank F lank database engine
small-BiSMark
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
A verification procedure for created motif model
New motif model Testing sequence set
Wisely chosen set ofmotif-containing sequences
AhoPro
Choosing optimal motif specificity
Selectivity testing
Processed experimental data(via SeSiMCMC)
Newly discovered motif(by SeSiMCMC or ScanSeq)
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Comparative motif analysis
Testing sequence set
Footprinting, ChIP-chip data,
Random generated setKnown motif model 1
New motif model
Known motif model 2
Known motif model 3
AhoPro
Selectivity testing
Comparative analysisSelecting best motif model
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
An a genome-wide motif distribution mapping
Known motif model 1
New motif model
Known motif model 2
Known motif model 3
Genome-wide globally positioned on chromosome
sites with different quality
Possible clustering of sites:different models for one factor
best models for different factors
Positioning within specific DNA regions:CRM, CpG islands, etc.
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Possible multiple Opera House management for grid computing support
(request redirecting and resource balancing only)
Distributed computations support
Single physical machine multi-process remote task execution control service
«Theatre manager»
«Opera House» «Opera House»
«Opera House»
Specified scenario «opera libretto» execution
Physical machine
Main database
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Overview of the technical realization of the complex
MySQL
Database levelRuby-poweredcross-platform
DRb-based server
Server level
Data-workflow level
Ruby-poweredcross-platform scenario scripts
SeSiMCMC and AhoProHigh-speed C++ code
Application level
Web-interface level
Ruby-based CGIRuby-on-rails in future
Ruby and Java-based tools(REXML, JAXP, SAXON)
small-BiSMark processing
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
THE END
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Acknowledgments
• GosNIIgenetika group:- Vsevolod Makeev- Alexander Favorov- Elizaveta Permina- Valentina Boeva- Ivan Kulakovsky- Dmitry Malko
Financial support Russian Federation State Innovation ProjectRussian Foundation of Basic Research DST India
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Biological data analysis components
• DNA analysis:
•Basio – large-scale sequence analysis: compositional segmentation
•TandemSWAN – tandem repeats in DNA sequences
•SeSiMCMC – DNA motif identification
•Oligomeasure – DNA structure from DNA sequence
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
TandemSWAN
• Tandem repeats with substitution but without indels with a control of repeat statistical significance
tttatttatttatttatttatttatttatttatttatttatttatttatttatttatttattta
Finds micro- and minisatellites with substitutions
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
BAesianSegmentationInformationOptimizer
• Performs DNA parsing into segments with a uniform composition• Uses Bayesian optimization over all possible segment
configuration• Uses Bayesian Information Criterion (BIC) to control segmentation
resolution
Format the output
List of segments
Split – sequence preprocessing
Basio – basic segmentaton algorithm Report
Select the appropriate output format
filter
Remove short or redundent segments
Input sequence
atcatatca|ggcggcgcagccgcagcc|tctcttcttc
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
SeSiMCMC – Sequence Similarity Markov Chain Monte Carlo
• SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif
• Multiple local alignment of candidate genomic sequences
• Optimal identification of the motif length
• Analysis of symmetries in motif structures
• Priors for absent sites and sites at the forward and backward DNA strands
SeSiMCMC result on a TRANSFAC dataset
Known binding site motif (SELEX, a sequence logo for Sp1 factor binding site)
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
ALEX – Alingment of Exons
Identifies exons in a genomic alignment
CTGACGCACAGACCCAAGTGACGACGAGGCCGA
CGGACGGACAGACCCAAGTGACGACGAGGCCGA
REG
BEG
M
REG
END M
REG
BEG
H
REG
END
H Glob
Best
Exon
Beg
Best
Exon
End
Best
Exon
Type
Best
Start
Exon
Beg
Best Start
ExonEnd
Best
Inner
Exon
Beg
Best
Inner
Exon
End
Best
Stop
Exon
Beg
Best
Stop
Exon
End
Best
One
Exon
Beg
Best
One
Exon
End
1425 2127 126 831 16.85 1779 1882 1 1779 1882 1752 1882 1752 1886 0 0
2373 2896 1202 1741 36.65 2452 2705 0 2398 2705 2452 2705 2849 2896 0 0
3279 3544 2238 2503 30.56 3279 3544 0 3512 3544 3279 3544 3279 3515 0 0
3827 4806 2815 3785 59.30 4002 4490 2 4017 4061 4002 4061 4002 4490 4017 4490
5051 5436 4059 4454 8.37 5051 5131 2 0 0 0 0 5051 5131 0 0
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
PROTEIN ANALYSIS
• Struswer – Smith Waterman aligner taking into account the secondary structure
• Prophet – Secondary structure predictor based on discriminate analysis
• PSIC – multiple alignment with homologs
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
1 N C 0.990 0.007 0.006
2 A C 0.900 0.040 0.044
3 K C 0.769 0.108 0.093
4 L C 0.666 0.315 0.074
5 K C 0.539 0.489 0.048
6 P H 0.405 0.639 0.033
7 V H 0.100 0.875 0.030
8 Y H 0.072 0.908 0.025
9 D H 0.055 0.926 0.012
10 S H 0.059 0.928 0.007
11 L H 0.069 0.919 0.005
12 D H 0.078 0.921 0.002
13 A H 0.051 0.946 0.003
14 V H 0.062 0.928 0.006
15 R H 0.096 0.880 0.010
16 R H 0.107 0.894 0.013
17 C H 0.104 0.899 0.015
18 A H 0.051 0.945 0.014
STRUSWER-STRUcture extension of Smith-Waterman alignER
STRUSWER – alignment of protein sequences with the reference to their secondary structure
------------------------------------------------------------------------------------------- 1a04A.exp A <-> 1au7A.exp A erd.vnqLtprerdi.lklIaqGlpnkmiarrLdites.......tvkvhvkh.....mlkkmklksrveAavwvhqErif......... ...gmraLeqfanefkvrrIklGytqtnvgeaL...aavhgsefsqtticrfenlqlsfknac....klkAilskwlEe..aeqkrrtti LLL.HHHLLHHHHHH.HHHHHLLLLHHHHHHHHLLLHH.......HHHHHHHH.....HHHHHLLLLHHHHHHHHHHHLLL......... ...LHHHHHHHHHHHHHHHHHHLLLHHHHHHHH...HHLLLLLLLHHHHHHHHLLLLEHHHHH....HHHHHHHHHHHH..LLLLLLLLL score: 326.000000 ID : 0.11
Формат выходного выравнивания. Вверху – выравнивание первичных структур; внизу – выравнивание вторичных структур.
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Protein Secondary Structure Prediction PROPHET
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
RNA-MBFS (RNA MultyBranch-Free Structures).
M – цикл ветвления (multi-branched loop), степень ветвления -3 Подскруктуры слева от M (т.е. E-F-G-I-H) и сверху от M (т.е. C-B-S-T) –неветвящиеся.
S
B
C
M E F G HI
5’
3’
T
Рис. 3.1. Вторичная структута РНК.
Creates optimal RNA-structure without branching
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Integration on the level of computation and data
• Easy accessible via web interface• Integration at data level• Cluster and local network distributed computation support• Cross-platform
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Building complex computational applications
• Possibility to create individual scenarios for any special task• Pipelining support for computational conveyers• Simple XML-format for scenarios and conveyers descriptors
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Individual user spaces and profiles
• Individual result storage and file library
• Individual user account support
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Easy remote administration via web interface
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
What’s under the hood
• Used technologies and program tools– MySQL 5 database as result and user space storage– JSP for web-interface– Apache Tomcat 5 JSP/Servlet container– Java 5 and RMI for distributed computations server and node-software
October 13-14 Novosibirsk,Indian_Russian Meeting, 2008
Acknowledgements
• Financial support of- Russian Federation State Contract № 02.434.11.100
(Intellectual technologies 2). Prof. Tumanyan V.G.- Russian Academy of Sciences project in Molecular and Cellular
Biology
• ContributorsInstutute of Mathematical Problems of Molecular Biology (Moscow
Region, Puschino, Russia)Voronezh State University, Voronezh, RussiaState Research Center of Genetics and Selection of Industrial
Microorganisms, GosNIIgenetika, Moscow, Russia