bioinformatics & computational biologyhome.iitk.ac.in/~rsankar/courses/lec01.pdfdr. r. sankar,...
TRANSCRIPT
Dr. R. Sankar, BSE 633 (2020)
BSE 633A
Bioinformatics & Computational Biology
R. Sankararamakrishnan
Dr. R. Sankar, BSE 633 (2020)
References Bioinformatics: Sequence and Genome Analysis David W. Mount, Cold Spring Harbor Laboratory Press (2001)
Bioinformatics and Functional Genomics by Jonathan Pevsner, Wiley-Balckwell
Developing Bioinformatics Computer Skills. C. Gibas and P. Jambeck, O’ Reilly (2001)
Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. R. Durbin, S. Eddy, A. Krogh and G. Mitchison, Cambridge University Press (1998)
Journals: Bioinformatics, BMC Bioinformatics, Nucleic Acid Research, ISMB, J. Comp. Biol., PLoS Computational Biology
Dr. R. Sankar, BSE 633 (2020)
Instructors: Upto MidSem Exam: Dr. R. Sankar After MidSem Exam: Dr. Hamim Zafar
Dr. R. Sankar, BSE 633 (2020)
Quiz I – February first week: 5% Midsem: 30% Quiz II – April first week: 5% Assignment/Exercise: 10% Presentation: 5% End-semester exam: 40% Attendance: 5%
Course evaluation
Dr. R. Sankar, BSE 633 (2020)
Introduction to bioinformatics, biological databases and their growth, Concept of homology and definition of associated terms, pairwise sequence alignment, dotmatrix plot, dynamic programming algorithm, global (Needleman-Wunsch) and local (Smith-Waterman) alignments, BLAST Scoring matrices (PAM and BLOSUM families), gap penalty, statistical significance of alignment Multiple sequence alignment, Sum-of-pairs method, CLUSTAL W, Genetic Algorithm Pattern finding in protein and DNA sequencing, Gibbs Sampler, Hidden Markov Model, Profile construction and searching, PSI-BLAST Introduction to phylogeny, maximum parsimony method, distance method (neighbor-joining), maximum-likelihood method Gene prediction in prokaryotes and eukaryotes, homology and ab-initio methods
Genome analysis and annotation, comparative genomics
BSE633: Course Contents
Dr. R. Sankar, BSE 633 (2020)
Powerpoint presentation of each class and other course materials will be available at:
http://home.iitk.ac.in/~rsankar/courses/
Dr. R. Sankar, BSE 633 (2020)
What is Bioinformatics? - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
What is Computational Biology? - The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.
- NIH Definition http://www.bisti.nih.gov/
Dr. R. Sankar, BSE 633 (2020)
Nature (Oct. 2017)
Dr. R. Sankar, BSE 633 (2020)
Nature (Oct. 2017)
Dr. R. Sankar, BSE 633 (2020)
Nature (Oct. 2017)
Dr. R. Sankar, BSE 633 (2020) Nature (Oct. 2017)
Dr. R. Sankar, BSE 633 (2020)
The first protein was sequenced in 1953
Dr. R. Sankar, BSE 633 (2020)
Number of protein sequences today Source: UniProt Database www.uniprot.org
Swiss-Prot: 561,568 seqs TrEMBL: 179,250,561 seqs
11/Dec/2019
Dr. R. Sankar, BSE 633 (2020)
Myoglobin and Hemoglobin: First protein structures to be determined
Dr. R. Sankar, BSE 633 (2020)
Yearly growth of structures in PDB
http://www.pdb.org
159140 structures in PDB Date: 6/Jan/2020
Dr. R. Sankar, BSE 633 (2020)
1976: Bacteriophage MS2 – RNA Virus; 3569 bp PhiX174 – DNA virus; 5386 bp 1995: Haemophilus influenzae - bacteria; 1.8 m bp Methanococcus jannaschii – archaeon; 1.7 m bp 1996: Baker’s yeast; 12.1 m bp 1998: Caenorhabditis elegans; 100 m bp 2000: Arabidopsis thaliana; 119 m bp Drosophila melanogaster; 165 m bp 2001: Homo sapiens; 3.2 b bp 2002: Mouse; 3.48 b bp 2003: Mosquito; 278 m bp Japanese pufferfish; 390 m bp Rice: 374 m bp 2004: Chicken; 1 b bp 2005: Chimpanzee; 3.3 b 2010: Western clawed frog: 1.5 m bp 2013: Zebra fish; 1.5 b
http://www.yourgenome.org/facts/timeline-organisms-that-have-had-their-genomes-sequenced
Genome Sequencing: Important milestones
Dr. R. Sankar, BSE 633 (2020)
Number of genome sequences
http://gregoryzynda.com/ncbi/genome/python/2014/03/31/ncbi-genome.html
https://ark-invest.com/research/genome-sequencing
The genome sequencing market is in its infancy, poised to grow at rates difficult to comprehend. Sequencing is introducing deeper scientific knowledge into medical decision making, eliminating wasteful guess work, and moving us closer to a truly personalized healthcare system.
Dr. R. Sankar, BSE 633 (2020) http://www.internationalgenome.org/
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
598 sequences from India
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020) https://www.nlm.nih.gov/about/2020CJ.html
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
https://digitalworldbiology.com/blog/bio-databases-2018-how-do-they-taste
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
>gi|388480089|ref|YP_492284.1| transporter [Escherichia coli str. K-12 substr. W3110] MSGLKQELGLAQGIGLLSTSLLGTGVFAVPALAALVAGNNSLWAWPVLIILVFPIAIVFAILGRHYPSAG GVAHFVGMAFGSRLERVTGWLFLSVIPVGLPAALQIAAGFGQAMFGWHSWQLLLAELGTLALVWYIGTRG ASSSANLQTVIAGLIVALIVAIWWAGDIKPANIPFPAPGNIELTGLFAALSVMFWCFVGLEAFAHLASEF KNPERDFPRALMIGLLLAGLVYWGCTVVVLHFDAYGEKMAAAASLPKIVVQLFGVGALWIACVIGYLACF ASLNIYIQSFARLVWSQAQHNPDHYLARLSSRHIPNNALNAVLGCCVVSTLVIHALEINLDALIIYANGI FIMIYLLCMLAGCKLLQGRYRLLAVVGGLLCVLLLAMVGWKSLYALIMLAGLWLLLPKRKTPENGITT
A sample record in FASTA format
Dr. R. Sankar, BSE 633 (2020)
Genomic sequences
Single Nucleotide Polymorphisms (SNPs)
Protein amino acid sequences
Protein 3D structures
Gene Expression
Protein function
Biomolecular interactions and networks
Literature
Biological Data
Dr. R. Sankar, BSE 633 (2020)
Emergence of ‘Omes’ – The new ‘era’ in Biology
Transcriptome: the mRNA complement of an entire organism, tissue type, or cell
Metabolome: the totality of metabolites in an organism
Lipidome: the totality of lipids
Glycome: the totality of glycans, carbohydrate structures of an organism, a cell or tissue type
Interactome: the totality of the molecular interactions in an organism
Spliceome: the totality of the alternative splicing protein isoforms
Kinome: The totality of protein kinases in a cell
Foldome: Foldome is the totality of biological structures as skeletons
Dynome: Adding a 4th Dimension to the Protein Database by Terascale Simulation
Reactome: A knowledge base of biological processes
Dr. R. Sankar, BSE 633 (2020)
What is Bioinformatics?
A Proposed Definition and Overview of the Field N.M. Luscombe, D. Greenbaum and M. Gerstein
http://bioinfo.mbb.yale.edu/
Dr. R. Sankar, BSE 633 (2020)
What is Bioinformatics? Bioinformatics is conceptualizing biology in terms
of molecules (in the sense of physical chemistry) and applying “informatics techniques” (derived from applied maths, computer science and statistics) to understand and organize the information associated with these molecules on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications
Mark Gerstein, Yale University
Dr. R. Sankar, BSE 633 (2020)
Bioinformatics is an interdisciplinary field combining mathematical, statistical, and computer methods to analyze medical, biological, biochemical, and biophysical data.
Dr. R. Sankar, BSE 633 (2020)
Crystal Structure of ATP-gated P2X4 receptors
Nature July (2009)
Dr. R. Sankar, BSE 633 (2020)
Nature July (2009)
Dr. R. Sankar, BSE 633 (2020)
Nature July (2009)
Dr. R. Sankar, BSE 633 (2020)
What are we going to learn in this course?
How to compare two sequences?
How to compare many sequences?
How to evaluate an alignment?
What are the limitations?
Phylogenetic analysis Prediction of genes from a given genomic sequence
Comparative genomics
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020) https://courses.lumenlearning.com/wmopen-biology2/chapter/speciation/
Dr. R. Sankar, BSE 633 (2020)
Species & Speciation Species: Group of populations, have similar appearance
Successfully interbreed
Reproductively isolated from other species
Gene flow occurs, genetically distinctive and isolated from other species
Speciation The formation of two groups of organisms that are reproductively isolated from each other and thus have no gene flow.
When there is no gene flow, the 2 groups will accumulate more and more differences over time.
Dr. R. Sankar, BSE 633 (2020)
Gene Duplication
•A redundant duplicate of a gene may acquire divergent mutations and eventually emerge as a new gene.
•Gene duplication is one of the means by which a new gene can arise.
•It is one of a only a few ways to increase the amount of genetic material.
•One of the means to create new function
Dr. R. Sankar, BSE 633 (2020)
Why should we do sequence alignments?
Useful for discovering functional, structural and evolutionary information in biological sequences
Sequences that are very much alike probably have the same function and 3D structure in the case of proteins
If two sequences from different organisms are similar, there may have been a common ancestor sequence
The sequences are then defined as homologous
Dr. R. Sankar, BSE 633 (2020)
Hemoglobin
Dr. R. Sankar, BSE 633 (2020)
_________ Rat_gene_1
Rat |
________X
| |_________ Rat_gene_2
|
---( )
| _____________ Mouse_gene_1
| |
|____X
Mouse |_____________ Mouse_gene_2
Two genes are to be orthologous if they diverged after a speciation event, Two genes are to be paralogous if they diverged after a duplication event.
Orthologous and paralogous genes
http://www.icp.ucl.ac.be/~opperd/private/orthol.html
Dr. R. Sankar, BSE 633 (2020)
Types of Homology
Dr. R. Sankar, BSE 633 (2020)
Chymotrypsin Subtilisin
Dr. R. Sankar, BSE 633 (2020)
Analogous genes
Similar regions in sequences may not have a common ancestor but may have arisen independently by evolutionary pathways converging on the same function
This is called convergent evolution
Such gene/protein sequences are referred to as analogous
Dr. R. Sankar, BSE 633 (2020)
Certain infectious agents, such as retroviruses, or species hybridization can introduce foreign DNA into the genome of an organism.
Once introduced, these sequences become part of the genome passed between generations, but the sequence has its origins elsewhere
Such sequences are called xenologues or xenologous sequences
Xenologoues
Dr. R. Sankar, BSE 633 (2020)
Sequence Alignment - Example
Dr. R. Sankar, BSE 633 (2020)
Sequence Alignment - Definition
•Procedure of comparing two or more sequences
•Search for a series of individual characters or character pattern that are in the same order
•Two sequences are aligned by writing them across a page in two rows
•Identical/similar characters are placed in the same column
•Nonidentical characters are placed in the same column as a mismatch or opposite a gap in the other sequence
•Optimal alignment: mismatches and gaps are placed to bring as many identical and similar characters as possible
Dr. R. Sankar, BSE 633 (2020)
PRA isomerase: IGP synthase
Dr. R. Sankar, BSE 633 (2020)
>1PII:_|PDBID|CHAIN|SEQUENCE MQTVLAKIVADKAIWVEARKQQQPLASFQNEVQPSTRHFYDALQGARTAFILECKKASPSKGVIRDDFDPARIAAIYKHYASAISVLTDEKYFQGSFNFLPIVSQIAPQPILCKDFIIDPYQIYLARYYQADACLLMLSVLDDDQYRQLAAVAHSLEMGVLTEVSNEEEQERAIALGAKVVGINNRDLRDLSIDLNRTRELAPKLGHNVTVISESGINTYAQVRELSHFANGFLIGSALMAHDDLHAAVRRVLLGENKVCGLTRGQDAKAAYDAGAIYGGLIFVATSPRCVNVEQAQEVMAAAPLQYVGVFRNHDIADVVDKAKVLSLAAVQLHGNEEQLYIDTLREALPAHVAIWKALSVGETLPAREFQHVDKYVLDNGQGGSGQRFDWSLLNGQSLGNVLLAGGLGADNCVEAAQTGCAGLDFNSAVESQPGIKDARLLASVFQTLRAY > PRAI sequence GENKVCGLTRGQDAKAAYDAGAIYGGLIFVATSPRCVNVEQAQEVMAAAPLQYVGVFRNHDIADVVDKAKVLSLAAVQLHGNEEQLYIDTLREALPAHVAIWKALSVGETLPAREFQHVDKYVLDNGQGGSGQRFDWSLLNGQSLGNVLLAGGLGADNCVEAAQTGCAGLDFNSAVESQPGIKDARLLASVFQTLRAY
Dr. R. Sankar, BSE 633 (2020)
Global and Local alignment Global alignment: Align the entire sequence
Sequences that are quite similar and approximately the same length are suitable candidates
Local alignment: Stretches of sequence with the highest density of matches are aligned
One or more islands of matches or subalignments are generated
More suitable for sequences that are similar along some of their lengths and dissimilar in others
that differ in length
that share a conserved region or domain
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
Exercise 1 Get the UniProt (www.uniprot.org) Accession ID for the protein whose PDB ID is 1BL8. Go to the corresponding UniProt entry. Find out the databases which are cross-linked. What are the related databases from which you can extract information about this protein?