bioinformatics final

Click here to load reader

Upload: rainu-rajeev

Post on 11-Jan-2017

67 views

Category:

Science


0 download

TRANSCRIPT

Bioinformatics

Bioinformatics

Bioinformatics

The analysis of biological information using computers and statistical techniques.The science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research Marriage of Computer Science and BiologyOrganize, store, analyze, visualize genomic data Utilizes methods from Computer Science, Mathematics, Statistics and BiologyMargaret Oakley Dayhoff is the father & mother of Bioinformatics

At the convergence of two revolutions: the ultra-fast growth of biological data, and the information revolutionBioinformatics

3

Example 1Compare proteins with similar sequences (for instance kinases) and understand what the similarities and differences mean.

4

Example 2Look at the genome and predict where genes are (promoters; transcription binding sites; introns; exons)

5

Predict the 3-dimensional structure of a protein from its primary sequence

Example 3

6

Correlate between gene expression and diseaseExample 4

A gene chip quantifying gene expression in different tissues under different conditions

May be used for personalized medicine

7

GenomicsStudy of sequences, gene organization & mutations at the DNA level

the study of information flow within a cell

The Human Genome Project

3 billion bases 30,000 geneshttp://www.genome.gov/

Goals Established for the Human Genome Project

Began in 1990Identify all of the genes in human DNA.Determine the sequence of the 3 billion chemical nucleotide bases that make up human DNA.Store this information in data bases.Develop faster, more efficient sequencing technologies.Develop tools for data analysis.Address the ethical, legal, and social issues (ELSI) that may arise form the project.

PublishedThe International Human Genome Sequencing Consortium published their results in Nature, 409 (6822): 860-921, 2001.Initial Sequencing and Analysis of the Human Genome Celera Genomics published their results in Science, Vol 291(5507): 1304-1351, 2001.The Sequence of the Human Genome

How to characterize new diseases? What new treatments can be discovered?How do we treat individual patients? Tailoring treatments?

Impact of Genomics on Medicine

Implications for BiomedicinePhysicians will use genetic information to diagnose and treat disease.Virtually all medical conditions have a genetic componentFaster drug development research: (pharmacogenomics)Individualized drugsAll Biologists/Doctors will use gene sequence information in their daily work

Proteomics Uses information determined by biochemical/crystal structure methods Visualization of protein structure Make protein-protein comparisons Used to determine:conformation/foldingantibody binding sitesprotein-protein interactionscomputer aided drug design

TranscriptomicsThe term can be applied to the total set of transcripts in a given organism, or to the specific subset of transcripts present in a particular cell type

The transcriptome reflects the genes that are being actively expressed at any given time

Eg. DNA microarray technology

Database

A collection of data that needs to be:StructuredSearchableUpdated (periodically)Cross referenced

Challenge:To change meaningless data into useful information that can be accessed and analysed the best way possible.

Biological databasesLike any other databaseData organization for optimal analysis

Data is of different typesRaw data (DNA, RNA, protein sequences)Curated data (DNA, RNA and protein annotated sequences and structures, expression data)

Raw Biological dataNucleic Acids (DNA)

Raw Biological dataAmino acid residues (proteins)

Curated Biological Data

DNA, nucleotide sequences

Gene boundaries, topologyGene structure

Introns, exons, ORFs, splicingExpression data

Mass spectometry

Curated Biological data3D Structures, folds

Primary Seq dbs is collaborative- data exchanged daily

Databases 1: nucleotide sequenceThe main DNA sequence db are EMBL (Europe)/GenBank (USA) /DDBJ (Japan)There are also specialized databases for the different types of RNAs (i.e. tRNA, rRNA, mi RNA)3D structure (DNA and RNA)Others: Aberrant splicing db; Eucaryotic promoter db (EPD); RNA editing sites, Multimedia Telomere Resource

EMBL/GenBank/DDJBThese 3 db contain mainly the same informations within 2-3 days (few differences in the format and syntax)Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc.) derived from:Genome projects and sequencing centersIndividual scientists Patent offices (i.e. European Patent Office, EPO)Non-confidential data are exchanged daily Currently: 8.3 x106 sequences, over 9.7 x109 bp;Sequences from > 50000 different species;

GenBank entry: example LOCUS HSERPG 3398 bp DNA PRI 22-JUN-1993 DEFINITION Human gene for erythropoietin. ACCESSION X02158 VERSION X02158.1 GI:31224 KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313 (6005), 806-810 (1985) MEDLINE 85137899 COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs. FEATURES Location/Qualifiers source 1..3398 /organism="Homo sapiens" /db_xref="taxon:9606" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) exon 397..627 /number=1 sig_peptide join(615..627,1194..1261) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /codon_start=1 /product="erythropoietin" /protein_id="CAA26095.1" /db_xref="GI:312304" /db_xref="SWISS-PROT:P01588" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI

GenBank entry (cont.) TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR" intron 628..1193 /number=1 exon 1194..1339 /number=2 mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2760) /product="erythropoietin" intron 1340..1595 /number=2 exon 1596..1682 /number=3 intron 1683..2293 /number=3 exon 2294..2473 /number=4 intron 2474..2607 /number=4 exon 2608..3327 /note="3' untranslated region" /number=5 BASE COUNT 698 a 1034 c 991 g 675 t ORIGIN 1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg 181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc 241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc

Database 2: protein sequences SWISS-PROT: created in 1986 (A.Bairoch) http://www.expasy.org/sprot/TrEMBL: created in 1996; complement to SWISS-PROT; derived from EMBL CDS translations (proteomic version of EMBL)

PIR-PSD: Protein Information Resources http://pir.georgetown.edu/

Genpept: proteomic version of GenBank

Many specialized protein databases for specific families or groups of proteins.

Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) YPD (Yeast) etc.

SWISS-PROTCollaboration between the SIB (CH) and EMBL/EBI (UK)Fully annotated (manually), non-redundant, cross-referenced, documented protein sequence database.~113000 sequences from more than 6800 different species; 70000 references (publications); 550000 cross-references (databases); ~200 Mb of annotations.Weekly releases; available from about 50 servers across the world, the main source being ExPASy

TrEMBL (Translation of EMBL)It is impossible to cope with the quantity of newly generated data AND to maintain the high quality of SWISS-PROT -> TrEMBL, created in 1996.

TrEMBL is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools.

Contains all what is not in SWISS-PROT. SWISS-PROT + TrEMBL = all known protein sequences.

Well-structured SWISS-PROT-like resource.

a Swiss-Prot entryoverview

sequence

Accession number

Entry name

To be updated (change the complete entry ?)

TrEMBL: example

Original TrEMBL entry which has been integrated into the SWISS-PROT EPO_HUMAN entry and thus which is not found in TrEMBL anymore.

33

PDBProtein Data Bank, managed by RCSBCurrently there are ~13000 structures for about 4000 different molecules, but far less protein family !There are also databases that contain data derived from PDB. Examples: HSSP (homology-derived secondary structure of proteins), SWISS-3DIMAGE (images)

Restriction enzyme

PDB: example

HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6REVDAT 1 15-OCT-92 12CA 0 12CA 7JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13REMARK 1 12CA 14REMARK 2 12CA 15REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16REMARK 3 12CA 17REMARK 3 REFINEMENT. 12CA 18REMARK 3 PROGRAM PROLSQ 12CA 19REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20REMARK 3 R VALUE 0.170 12CA 21REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23REMARK 4 12CA 24REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27

ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102.

37

38

Species-oriented databasesSGD: Saccharomyces Genome Database; molecular biology and genetics of the yeast Saccharomyces cerevisiae (baker's or budding yeast) FlyBase: A Database of the Drosophila Genome MGI: Mouse Genome Informatics RGD: Rat Genome DatabaseWormBase - database for the nematode Caenorhabditis elegans The Arabidopsis Information Resource (TAIR) - database for the brassica family plant Arabidopsis thaliana

Mouse Genome Informatics

39

40

Sequence Alignment ToolsDatabase Searching:

BLAST:NCBI, Web Interface: http://www.ncbi.nlm.nih.gov/BLAST/WuBLAST http://blast.wustl.eduFASTA: http://www.ebi.ac.uk/fasta3/Smith-WatermanPar-Align: http://dna.uio.no/search/

Multiple Sequence Alignment:CLUSTALW: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.htmlDiAlign, Web Interface: http://genomatix.gsf.de/cgi-bin/dialign/dialign.plMSA:http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.htmlWeb Interface: http://bioweb.pasteur.fr/seqanal/interfaces/msa-simple.html

42

Uses of BLASTQuery a database for sequences similar to an input sequence

Uses of BLASTIdentify previously characterized sequencesQuery a database for sequences similar to an input sequence

Uses of BLASTIdentify previously characterized sequences.Find phylogenetically related sequences.Query a database for sequences similar to an input sequence

Uses of BLASTIdentify previously characterized sequences.Find phylogenetically related sequences.Identify possible functions based on similarities to known sequences.Query a database for sequences similar to an input sequence

Sequence Homology SoftwareNCBI-BLASTRun by the National Center for Biotechnology InformationBLAST uses a heuristic algorithm based on the Smith-Waterman algorithmAlgorithm searches database for a small string within the query (default 11 for nucleotide searches), then when it detects a match, searches for shared nucleotides at each end of the seed to extend the matchGaps are taken into account, then the matches are presented in order of statistical significancehttp://www.ncbi.nlm.nih.gov/BLAST/

Different Types of BLASTNucleotide-nucleotide BLAST (BLASTN): Basic nucleutide sequence searchesThe BLAST that you used for your sequences Protein-protein BLAST (BLASTP): Similar technology used to search amino acid sequences Position-Specific Iterative BLAST (PSI-BLAST):A more advance protein BLAST useful for analyzing relationships between divergently evolved proteins.

Different Types of BLASTBLASTX and tBLASTN variants:Use six-frame translation for proteins and nucleotides, respectively, in the searchMegaBLAST:Used for BLASTing several sequences at once to cut down on processing load and server reporting-time

Types of BLAST:Graphic courtesy of Joel Graber.

Interpreting BLAST ResultsQuery CoverageThe percent of the query sequence matched by the database entryMax IdentThe percent identity, i.e. the percent that the genes match up within the limits of the full match (e.g. deletions or additions reduce this value)

FASTAAnother method for local sequence alignment.Maintained by Dr. William Pearson at the University of Virginia. http://www.infobiogen.fr/doc/Fasta/docfasta.html

52

FASTA VersionsFASTA-nucleotide or protein sequence searching

FASTx/ FASTy compares a translated DNA query sequenceto a protein sequence database (forward or backward translation of the query)

tFASTx/ tFASTy -compares protein query sequence to DNA sequence database that has been translated into three forward and three reverse reading frames

53

Multiple Sequence Alignment SoftwareClustal (free) ClustalX SoftwareClustalW WebDNAStar ($$$)Functionality is similar, but difference is in interface, tools, and speed of algorithms http://www.ebi.ac.uk/clustalw/

Multiple Sequence Alignment (MSA)Multiple sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment - instead of aligning two sequences, n sequences are aligned simultaneously, where n is > 2Definition: A multiple sequence alignment is an alignment of n > 2 sequences obtained by inserting gaps (-) into sequences

FastA FormatA sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

56

Other Completed Genomes

Haemophilus influenzae Escherichia coli Bacillus subtilusHelicobacter pyloriBorrelia burgdorferi Streptococcus pneumoniaeSaccharomyces cerevisiae

Caenorhabditis elegans Arabidopsis thalianaArchaeoglobus fulgidusMethanobacterium thermoautotrophicumMethanococcus jannaschiiMycoplasma pneumoniaeMycoplasm genitaliuRickettsia prowazekiiMycobacterium tuberculosis

Treponema pallidumStaphylococcus aureusAnd more!

Completed Plant GenomesArabidopsis thaliana

Completed Insect GenomesDrosophila melanogaster

Completed Rodent GenomesMus musculus

Protien- Three Structure Levels

Beta Sheet

HelixLoop

PDB ID: 12as Primary structure: sequence of amino acidse.g., DRVYIHPF

Secondary structure: local folding patternse.g., alpha-helix, beta-sheet, loop

Tertiary structure: complete 3D fold

61This protein has two chainsThree different secondary structures introduce three structure levels

Different Levels of Protein Structure Prediction

Protein Structure PredictionRegular Secondary Structure Prediction (-helix -sheet)APSSP2: Highly accurate method for secondary structure predictionCombines memory based reasoning ( MBR) and ANN methods Irregular secondary structure prediction methods (Tight turns)Betatpred: Consensus method for -turns prediction Statistical methods combinedKaur and Raghava (2001) Bioinformatics Bteval : Benchmarking of -turns prediction Kaur and Raghava (2002) J. Bioinformatics and Computational Biology, 1:495:504BetaTpred2: Highly accurate method for predicting -turns (ANN)Multiple alignment and secondary structure informationKaur and Raghava (2003) Protein Sci 12:627-34 BetaTurns: Prediction of -turn types in proteins Evolutionary information Kaur and Raghava (2004) Bioinformatics 20:2751-8. AlphaPred: Prediction of -turns in proteinsKaur and Raghava (2004) Proteins: Structure, Function, and Genetics 55:83-90 GammaPred: Prediction of -turns in proteinsKaur and Raghava (2004) Protein Science; 12:923-929.

Protein Structure Prediction BhairPred: Prediction of Super secondary structure predictionPrediction of Beta HairpinsSecondary structure and surface accessibility used as inputManish et al. (2005) Nucleic Acids Research TBBpred: Prediction of outer membrane proteinsPrediction of trans membrane beta barrel proteinsPrediction of beta barrel regionsApplication of ANN and SVM + Evolutionary informationNatt et al. (2004) Proteins: 56:11-8 ARNHpred: Analysis and prediction side chain, backbone interactionsPrediction of aromatic NH interactionsKaur and Raghava (2004) FEBS Letters 564:47-57 . SARpred: Prediction of surface accessibility Multiple alignment (PSIBLAST) and Secondary structure information ANN: Two layered network (sequence-structure-structure)Garg et al., (2005) Proteins PepStr: Prediction of tertiary structure of Bioactive peptides

Homology Modeling

The Best MatchDRVYIHPFADRVYIHPFAQuery Sequence:Protein sequence classification database PSI-BLAST HMM Smith-Waterman algorithm

65

Phylogenetic analysisEvolution studiesSystematic biology Medical research and epidemiology Ecology

66(e.g. the out of Africa debate or evolutionary relationships between humans and other primates) (e.g. Tree of life)(e.g. origins of AIDS) follow changes in extremely rapidly changing organisms, even though the evolutionary period for HIV is less than a 100 years, there is already enough information to study it in phylogenetic analysis.Ecology, changes in bacteria colonies.CENTRAL Point is: we are not just comparing different species, we can compare viruses within one particular organism, colonies of the same organism and so on

67

Terminology

Rooted tree Unrooted tree

68Root = common ancestor of all units, Path from root to node defines evolutionary pathUnrooted: specify relationships only, not evolutionary pathAn internal node in a gene tree represents the divergence of an ancestral gene into two genes with different DNA sequences: this occurs by mutation Branch length = evolutionary distance, in this case RNA was examined to find the differences between species. Note that in this case we dont have a scale, the evolutionary distances are the branch lengths.

Advantages of molecular phylogenetic analysisAnalogous features (share common function, but NOT common ancestry) can be misleadingDNA sequences more simple to model, we only have the four states A, C, G, TDNA samples for sequence analysis easy to prepareSoftwares used :PhylipPaup

69Example of analogous features = wings of birds and bats, not the same structureFinite number of states, Easier to compare different speciesDNA sequences as string of letters, easy to use for mathematical and statistical analysis.

Modern drug discovery process

Target identificationTarget validationLead identificationLead optimizationPreclinical phase Drug discovery 2-5 years

Drug discovery is an expensive process involving high R & D cost and extensive clinical testing A typical development time is estimated to be 10-15 years.

6-9 years

70

Bioinformatics tools in Drug Design

Comparison of Sequences: Identify targetsHomology modelling: active site predictionSystems Biology: Identify targetsDatabases: Manage informationIn silico screening (Ligand based, receptor based): Iterative steps of Molecular docking.Pharmacogenomic databases: assist safety related issues

Molecular Docking

R

L

Docking is the computational determination of binding affinity between molecules (protein structure and ligand). Given a protein and a ligand find out the binding free energy of the complex formed by docking them.

LR

72

Molecular Docking: classification

Docking or Computer aided drug designing can be broadly classifiedReceptor based methods- make use of the structure of the target protein.Ligand based methods- based on the known inhibitors

73

Receptor based methodsUses the 3D structure of the target receptor to search for the potential candidate compounds that can modulate the target function. These involve molecular docking of each compound in the chemical database into the binding site of the target and predicting the electrostatic fit between them. The compounds are ranked using an appropriate scoring function such that the scores correlate with the binding affinity. Receptor based method has been successfully applied in many targets

74

Ligand based strategy

In the absence of the structural information of the target, ligand based method make use of the information provided by known inhibitors for the target receptor. Structures similar to the known inhibitors are identified from chemical databases.

75

Ligand based strategySearch for similar compounds

database

known activesstructures found

76

Example: use of Bioinformatics in Drug discovery

Identification of novel drug targets against human malaria

Docking Software : DockAutodockGold

77

Top 100 solutions

Out of top 40 only 10 compounds available for purchase

Drug-like compound library (1,000,00)

Molecular docking

Ligand docked into proteins active site

Insilico identification of novel inhibitors against PfClpQ , a novel drug target of P.falciparum by high throughput docking

PfclpQ

78

Rasmol

RasMol2 is a molecular graphics program intended for the visualization of proteins, nucleic acids and small molecules

The program is aimed at display, teaching and generation of publication quality images

The program reads in molecular co-ordinate files and interactively displays the molecule on the screen in a variety of representations and color schemes

Supported input file formats include Brookhaven Protein Databank (PDB),Sybyl Mol2 formats, Molecular Design Limited's (MDL) Mol file format, Minnesota Supercomputer Centre's (MSC) XYZ (XMol) format CHARMm format files

Thank YouStephen JamesDepartment of BioinformaticsMar Athanasios College for Advanced Studies(MACFAST)Email : [email protected]: 9746935363