bio information –an overview
TRANSCRIPT
-
8/2/2019 Bio information An Overview
1/43
What is Bioinformatics?
Bioinformatics is an emerging scientific
discipline representing the combined
power of biology, mathematics, andcomputers
-
8/2/2019 Bio information An Overview
2/43
-
8/2/2019 Bio information An Overview
3/43
Bioinformatics includes
Sequence analysis used by geneticists, cellbiologists, molecular biologists, etc.
Molecular modeling used bycrystallographers, cell biologists,biochemists, etc.
Molecular phylogeny/evolution Ecology and population studies
Medical informatics
-
8/2/2019 Bio information An Overview
4/43
Three important sub-disciplines within bioinformatics
involving computational biology would include:
the development and implementation of tools that
enable efficient access and management of
different types of information the analysis and interpretation of various types of
data including nucleotide and amino acid
sequences, protein domains, and protein structures
the development of new algorithms and statistics
with which to assess relationships among
members of large data sets
http://www.library.csi.cuny.edu/~davis/Bioinformatics/Bioinformatics/dataanal.htmlhttp://www.ncbi.nlm.nih.gov/Education/Bioinformatics/dataanal.htmlhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/Bioinformatics/datatypes.htmlhttp://www.ncbi.nlm.nih.gov/Education/Bioinformatics/datatypes.htmlhttp://www.ncbi.nlm.nih.gov/Education/Bioinformatics/datatypes.htmlhttp://www.ncbi.nlm.nih.gov/Education/Bioinformatics/datatypes.htmlhttp://www.ncbi.nlm.nih.gov/Education/Bioinformatics/datatypes.htmlhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/Bioinformatics/datatypes.htmlhttp://www.ncbi.nlm.nih.gov/Education/Bioinformatics/dataanal.htmlhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/Bioinformatics/dataanal.html -
8/2/2019 Bio information An Overview
5/43
-
8/2/2019 Bio information An Overview
6/43
GenBank Data
Year Base Pairs Sequences
1982 680338 606
1983 2274029 2427
1984 3368765 4175
1985 5204420 5700
1986 9615371 9978
1987 15514776 14584
-
8/2/2019 Bio information An Overview
7/43
1988 23800000 20579
1989 34762585 28791
1990 49179285 39533
1991 71947426 55627
1992 101008486 78608
1993 157152442 143492
1994 217102462 215273
-
8/2/2019 Bio information An Overview
8/43
-
8/2/2019 Bio information An Overview
9/43
-
8/2/2019 Bio information An Overview
10/43
-
8/2/2019 Bio information An Overview
11/43
Analysis of sequence information: comutational
Biology
Finding the genes in the DNA sequences of various
Organism.
Developing methods to Predict the structure and/ or
function of newly discovered proteins and structural RNA
sequences.
Clustering protein sequences into families of related
sequences and the development of protein models
Aligning similar proteins and generating phylogenetic trees
to examine evolutionary relationships
-
8/2/2019 Bio information An Overview
12/43
-
8/2/2019 Bio information An Overview
13/43
-
8/2/2019 Bio information An Overview
14/43
-
8/2/2019 Bio information An Overview
15/43
Goals of Bioinformatics and Sequence Analysis
can be subdivided into
1. Sequence entry, assembly, and
management
2. Nucleotide sequence analysis
3. Protein sequence analysis
4. Multiple sequence analysis
5. Additional and integrated analyses
-
8/2/2019 Bio information An Overview
16/43
Sequence Entry and Editing
-
8/2/2019 Bio information An Overview
17/43
-
8/2/2019 Bio information An Overview
18/43
Sequence Assembly
-
8/2/2019 Bio information An Overview
19/43
-
8/2/2019 Bio information An Overview
20/43
Nucleotide Sequence Analysis
Sequence Similarity Analysis
o
o Query: 298
CCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAA
o || | || | | |||| | || |||| ||| | |||||||||||||
CCCGGGAACCTGCGGTGGTCCGCCCGCCCAGCCCCAGTG
-
8/2/2019 Bio information An Overview
21/43
-
8/2/2019 Bio information An Overview
22/43
Gene discovery: coding regions, exon, and gene
prediction
ORF M
-
8/2/2019 Bio information An Overview
23/43
1 to 1647 length = 1647
3 to 80 length = 78
326 to 409 length = 84
1064 to 1174 length = 111
C 1650 to 1249 length = 402
C 1649 to 1509 length = 141
C 660 to 511 length = 150
C 584 to 507 length = 78
C 510 to 283 length = 228
C 452 to 321 length = 132
C 149 to 39 length = 111
C 135 to 4 length = 132
ORF List (C= complementary strand)
ORF Map
-
8/2/2019 Bio information An Overview
24/43
-
8/2/2019 Bio information An Overview
25/43
Protein Sequence Analysis
Sequence Similarity Analysis
Protein sequences can be analyzed in ways similar tonucleotide sequences. Some common types of analysesare database similarity searching (to identify protein
sequence database entries similar to a given protein) andsequence comparison (for example, to align two proteinsequences and identify common regions).
Compare sequence similarity of A and B below.
Query = sequence use to search database
Sbjct = sequence aligned to in database
Letters in between the two are identities and + =
conservative amino acid substitution
-
8/2/2019 Bio information An Overview
26/43
-
8/2/2019 Bio information An Overview
27/43
Prediction of protein properties
Predict molecular weight
Predict isoelectric point (pI)
Predict extinction coefficient
Protease recognition sites
-
8/2/2019 Bio information An Overview
28/43
Search for Known Motifs
Motif searching is also very useful in protein
sequences, to recognize specific amino acid
patterns with functional significance. A numberof databases of protein motifs such as the
PROSITE database have been created either from
literature surveys or directly from sequence
databases, for the purpose of identifying proteinsand domains or particular functional sites.
-
8/2/2019 Bio information An Overview
29/43
Predict Secondary Structure
The function of a protein is
strongly dependent on its three-
dimensional structure
propensities of various aminoacids (or stretches of amino
acids) to form or break
particular secondary structure
elements
-
8/2/2019 Bio information An Overview
30/43
-
8/2/2019 Bio information An Overview
31/43
-
8/2/2019 Bio information An Overview
32/43
-
8/2/2019 Bio information An Overview
33/43
Protein tertiary structure prediction
Predicting the tertiary (three-dimensional) structure of aprotein from its sequence is still far from a trivial task, andusually involves combining the information from a range
of sources - database searches, comparisons with similarsequences whose structure is known, motifs known tocorrespond to particular structural elements, and secondarystructure information.
PDB Files = 1tim, 2act, 3rn3,, 1mbn. 1est
There are several approaches to building a 3 dimensionalmodel for a protein including homology modeling,profiling, and threading (see supplement).
http://www.library.csi.cuny.edu/~davis/Bioinformatics/1timMono.pdbhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/2act.pdbhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/3rn3.pdbhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/1mbn.pdbhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/1est.pdbhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/suppl1.htmhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/suppl1.htmhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/1est.pdbhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/1mbn.pdbhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/3rn3.pdbhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/2act.pdbhttp://www.library.csi.cuny.edu/~davis/Bioinformatics/1timMono.pdb -
8/2/2019 Bio information An Overview
34/43
-
8/2/2019 Bio information An Overview
35/43
Multiple Sequence Analysis
A whole new set of questions can be asked whenthe sequences of related genes from differentorganisms are available. For example, conserved
regions can be identified, either as an indication oftheir functionality, or as targets for PCRexperiments, or for designing probes fordiagnostic tests.
The first step in multiple sequence analysis is toalign the related sequences together into a multiplesequence alignment, that is, an alignment of morethan 2 sequences.
-
8/2/2019 Bio information An Overview
36/43
-
8/2/2019 Bio information An Overview
37/43
-
8/2/2019 Bio information An Overview
38/43
Challenges in bioinformatics
Explosion of information Need for faster, automated analysis to process large
amounts of data
Need for integration between different types of
information (sequences, literature, annotations, proteinlevels, RNA levels etc)
Need for "smarter" software to identify interestingrelationships in very large data sets
Lack of "bioinformaticians"
Software needs to be easier to access, use andunderstand
Biologists need to learn about the software, its
limitations, and how to interpret its results
-
8/2/2019 Bio information An Overview
39/43
-
8/2/2019 Bio information An Overview
40/43
Diagnostics
DNA probes for infectious disease
DNA probes for inherited disease
Analysis of gene expression
Analysis of protein expression
-
8/2/2019 Bio information An Overview
41/43
Therapeutics
Recombinant gene products
Novel drug targets
Rational drug design
Gene therapy
http://www.scitech.com.au/software/Scanalytics/IPLab.pdf -
8/2/2019 Bio information An Overview
42/43
http://www.scitech.com.au/software/Scanalytics/IPLab.pdfhttp://www.scitech.com.au/software/Scanalytics/IPLab.pdfhttp://www.scitech.com.au/software/Scanalytics/IPLab.pdf -
8/2/2019 Bio information An Overview
43/43