i.u. school of informatics
DESCRIPTION
Capstone Presentation. Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana by Irfan Gunduz. I.U. School of Informatics. 04/25/04. INTRODUCTION. Motifs - PowerPoint PPT PresentationTRANSCRIPT
I.U. School of Informatics
Motif Discovery from Large Number of Sequences:
A Case Study with Disease Resistance Genes
in Arabidopsis thaliana
by Irfan Gunduz
04/25/04
Capstone Presentation
INTRODUCTION
Motifs
• Highly conserved regions across a subset of proteins that share the same function
A molecule’s function A Structural Feature Family membership
• Motifs can be used to predict
YNEDSKHYDDDSNHYDNDSNHYENDSKH
>Seq A>Seq B>Seq C>Seq D
I.U. School of Informatics
INTRODUCTION
Current motif finding soft-wares:• MEME• PROSITE• PRATT, etc
Do they work with large number of sequences?
• Pattern discovery relies on statistical or combinatorial techniques,looking for signals
• Signal-to-noise ratio becomes less clear as the number of sequences increases
What to do?
I.U. School of Informatics
I.U. School of Informatics
Develop a computational procedure to find functional motifs from large number of sequences
Objective
I.U. School of Informatics
BLAST (Sequence alignment tool) BAG ( Sequence Clustering package) CLUSTAL W (Multiple sequence alignment) HMMERII (HMM based software) BLOCK MAKER (Block/Motif finder) LAMA (Block comparison tools) PERL
Tools
COMPUTATIONAL PROCEDURE
I.U. School of Informatics
COMPUTATIONAL PROCEDURE
1- Collecting and Clustering Sequences
Extract well-annotated sequences of interest from genome of interest
All to all pair wise comparison using Blast
Estimate the best bit score for clustering
Cluster sequences using BAG
I.U. School of Informatics
COMPUTATIONAL PROCEDURE
2 - ENRICHMENT
Align multiple sequences in each cluster
Start HMM based programs build profile for each cluster
Search genome of interest with new profileand extract more sequences if available
I.U. School of Informatics
3 – REFINEMENT
4 – MOTIF FINDING
COMPUTATIONAL PROCEDURE
Refine clusters by regrouping
Submit sequences in each cluster to Block Maker
compare blocks using LAMA
Cluster blocks by using BAG
I.U. School of Informatics
A Case Study with Disease Resistance Genes in Arabidopsis thaliana
I.U. School of Informatics
Why Disease Resistance Genes?
I.U. School of Informatics
Background, Disease Resistance Genes
Domain Probable FunctionTIRCCKINLRR Recognition of specificityNB ATP and GTP binding
I.U. School of Informatics
• 116 disease resistance protein or disease resistance protein like annotated sequences were extracted from Arabidopsis thaliana genome
• Clustered into 32 groups
• After refinement four clusters were formed for further analysis
# of Sequences
Cluster 1 96
Cluster 2 45
Cluster 3 641
Cluster 4 11
• 20 to 640 sequences were added in each cluster after HMM iterations
Case Study, Arabidopsis thaliana
I.U. School of Informatics
Case Study, Arabidopsis thaliana
PFAM Search
Cluster 1 NB-ARC, TIR, Kin, LRR
Cluster 2 NB-ARC, Kin, LRR
Cluster 3 Ser/Thr Kin
Cluster 4 Kin
Domains
I.U. School of Informatics
Cluster1
Cluster2
Results, Block Maker
Case Study, Arabidopsis thaliana
15218608 YDVFLSFRGVDTRQTIVSHL15218618 YDVFLSFRGEDTRKNIVSHL15220795 YDVFLSFRGEDTRKTIVSHL
I.U. School of Informatics
Results, Lama and BAG
Case Study, Arabidopsis thaliana
Cluster1
Cluster2
Cluster1 Cluster2 Cluster3
Clu
ster
s at
th
e w
ho
le g
ene
leve
l
Clusters at the Block Level
I.U. School of Informatics
TIR-I TIR-II Kin1a Kin2 NBS-B
Kin1a Kin2 NBS-B NBS-CNBS-A GLPL
Cluster1
Cluster2
Cluster1 Cluster2 Cluster3
Clu
ster
s at
th
e w
ho
le g
ene
leve
l
Clusters at the Block Level
LRR
LRR
Case Study, Arabidopsis thaliana
RPP8RPM1
RPS4RPP1RPP5
I.U. School of Informatics
Number of Disease Resistance Gene Candidates on each Chromosome
Cluster 1 16 2 6 16 35Cluster 2 20 0 6 4 9
CHR-1 CHR-II CHR-III CHR-IV CHR-V
Case Study, Arabidopsis thaliana
I.U. School of Informatics
New Disease Resistance Gene Candidates
Cluster 1GI 15236505GI 15242136GI 15233862
Cluster 2
GI 15221277GI 15221280GI 15217940GI 15221744
Case Study, Arabidopsis thaliana
I.U. School of Informatics
To test effectiveness of the computational procedure
792 Unique sequences were merged and submitted to MEME and PRATT to detect functional motifs.
• Time : Took more than 9000 minutes on Pentium IV 1.7 GHz machine running on Linux
• Result : No known disease resistance gene motifs were detected
Case Study, Arabidopsis thaliana
I.U. School of Informatics
CONCLUSIONS:
Sensible combination of tools provides an excellent mechanism for motif detection
Clustering helps to improve performance of other well known tools
Case Study, Arabidopsis thaliana
I.U. School of Informatics
ACKNOWLEDGEMENT
Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes
in Arabidopsis thaliana
Irfan Gunduz, Sihui Zhao, Mehmet Dalkilic and Sun Kim
will be presented at
The 2003 International Conference on Mathematics andEngineering Techniques in Medicine and Biological Sciences
I.U. School of Informatics
Case Study, Arabidopsis thaliana
I.U. School of Informatics
Disease Resistance Mechanism
I.U. School of Informatics
COMPUTATIONAL PROCEDURE
Refinement
B
A
C
D BD C