![Page 1: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/1.jpg)
PAT project
Advanced bioinformatics tools for analyzing the Arabidopsis genome
Proteins of Arabidopsis thaliana (PAT)
&
Gene Ontology (GO)
Hongyu Zhang, Ph.D.
![Page 2: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/2.jpg)
PAT project
Sequence
Structure
Function
Bioinformatics
![Page 3: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/3.jpg)
PAT project
PAT: Structure-aided function annotation
• PAT is a collaborating project between Ceres and San Diego Supercomputer Center: http://pat.sdsc.edu
• Importance of structure-aided function annotation
– Structure contains more function information than sequence, like active site, binding motif etc.
– Structure is more conserved than sequence during evolution, therefore protein sequences can have similar structures even without clearly detected sequence similarity. It means that we have bigger chance to find the function relationship from structure similarity than from sequence similarity using advanced structure prediction programs like PSI-BLAST and threading algorithm.
– Structure prediction programs can also be used to predict all sorts of structure features of proteins, like trans-membrane tendency, electrostatics potential distribution, or coil-coil fold tendency. Those structure features are also valuable to biologists to guess the possible functions of novel genes.
![Page 4: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/4.jpg)
PAT project
Fold recognition
• Frequently implies biochemical function
0
100
200
300
400
500
600
1 2 3 4
number of different functions
nu
mb
er o
f fo
lds
![Page 5: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/5.jpg)
PAT project
Highlights in PAT annotations
• Domain-based prediction
– Structure domain
•PDB, SCOP
– Sequence domain
•Pfam
• Predictions are strictly benchmarked
![Page 6: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/6.jpg)
PAT project
Reliability categories
Category Reliable level Benchmark
A Certain >99.9%
B Reliable >99%
C Probable >90%
D Possible >50%
E Potential >10%
![Page 7: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/7.jpg)
PAT project
Methods
• Programs
Protein sequences were analyzed using a spectrum of programs, including structure prediction, function prediction and feature annotation methods.
• Database
All the results were organized and stored in an Oracle relational database for the ease of data access and process.
• Interface
Web-based interface convenient for both computational and non-computational biologist users.
![Page 8: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/8.jpg)
PAT project
Programs used in PAT pipeline
• Protein structure and function– Homology modeling
BLAST, PSI-BLAST search against protein structure database
– Threading
123D+ search against a protein fold library
• Protein class and features
COILS, TMHMM, SignalP, PSI-pred, PSORT
![Page 9: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/9.jpg)
PAT project
Protein sequences
Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)
Structural assignment of domains by PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Only sequences w/out A-prediction
Structural assignment of domains by 123D on FOLDLIB
Create PSI-BLAST profiles for Protein sequences
Store assigned regions in the DB
Functional assignment by PFAM, NR, PSIPred assignments
FOLDLIB
NR, PFAM
Domain location prediction by sequence
structure infosequence info
SCOP, PDB
Building FOLDLIB:
PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP
90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)
![Page 10: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/10.jpg)
PAT project
GUI:Top Level
![Page 11: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/11.jpg)
PAT project
Example: P450 family
• Sequence relatives detected by ordinary Blast search
– 313 hits, when E-score cutoff is 0.001
– 324 hits, when E-score cutoff is 0.01
• Sequence relatives detected by PAT
– 367 hits with confidence greater or equal to 99%
![Page 12: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/12.jpg)
PAT project
Figure 2. SCOP results, super-family level. It displayed the number of true positive predictions versus the number of false positive predictions for the SCOP test set. Here, if two proteins share the first three SCOP sccs ids, e.g., d.126.1.1 and d.126.1.2, they are considered having the same structure in super-family level. The results in this figure displayed that PSI-BLAST are superior than both NCBI-BLAST and WU-BLAST in picking up the true positives.
![Page 13: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/13.jpg)
PAT project
Acknowledgement
• Dr. Nickolai Alexandrov
• Dr. Philip E. Bourne
• Dr. Wilfred W. Li
• Dr. Greg B. Quinn
• Dr. Ilya E. Shindyalov
![Page 14: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/14.jpg)
PAT project
Gene Ontology (GO) project• Gene Ontology Consortium
(http://www.geneontology.org)• Controlled vocabularies for the description of gene
functions.• Three dimensions
– Molecular Function• the tasks performed by individual gene products;
examples are transcription factor and DNA helicase
– Biological Process• broad biological goals, such as purine metabolism
or mitosis, that are accomplished by ordered assemblies of molecular functions
– Cellular Component • subcellular structures, locations, and
macromolecular complexes; examples include nucleus, telomere, and origin recognition complex
![Page 15: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/15.jpg)
PAT project
Three dimensions of GO
Molecular Function
Biological process
Cellular Component
Gene product
![Page 16: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/16.jpg)
PAT project
.GO:0003673 : Gene_Ontology .GO:0003674 : molecular_function .GO:0005488 : binding .GO:0003676 : nucleic acid binding .GO:0003677 : DNA binding .GO:0003700 : transcription factor .GO:0030528 : transcription regulator .GO:0003700 : transcription factor
Hierarchical structure of GO term tree
![Page 17: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/17.jpg)
PAT project
The evidence codes used in GO
• IC inferred by curator • IDA inferred from direct assay • IEA inferred from electronic annotation • IEP inferred from expression pattern • IGI inferred from genetic interaction • IMP inferred from mutant phenotype • IPI inferred from physical interaction • ISS inferred from sequence or structural similarity • NAS non-traceable author statement • ND no biological data available • TAS traceable author statement • NR not recorded
![Page 18: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/18.jpg)
PAT project
Process to annotate Ceres peptide
• Download GO annotations from TAIR website (http://www.arabidopsis.org)
• Annotating methods If
the sequence of the Ceres peptide is the same as a GO database sequence based on locus name, copy all the annotations of the GO database sequence to the Ceres peptide.
Else
For each Ceres peptide, pick up its best hit that does have the TAIR annotation, and then copy its annotation to this Ceres peptide.
![Page 19: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/19.jpg)
PAT project
Example: P450 family• Sequence relatives detected by simple Blast search
– 313 hits, when E-score cutoff is 0.001
– 324 hits, when E-score cutoff is 0.01
• Sequence relatives detected by PAT
– 367 hits with confidence greater or equal to 99%
• Sequence relatives annotated by GO
– 365 hits
– Number of Hits based on evidence
• 295 with ISS (inferred from sequence or structural similarity)
• 67 with IEA (inferred from electronic annotation)
• 2 with TAS (traceable author statement)
• 1 with IDA (inferred from direct assay)
![Page 20: Advanced bioinformatics tools for analyzing the Arabidopsis genome](https://reader036.vdocument.in/reader036/viewer/2022082819/56813988550346895da11ba6/html5/thumbnails/20.jpg)
PAT project
Acknowledgement
• Dr. Nickolai Alexandrov
• Mr. Eric Zetterbaum
• Dr. Richard Flavell
• etc.