introduction to bioinformatics victor jin
DESCRIPTION
TRANSCRIPT
![Page 1: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/1.jpg)
Introduction to Bioinformatics
Victor JinDepartment of Biomedical Informatics
Ohio State University
![Page 2: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/2.jpg)
What is Bioinformatics?
bio·in·for·mat·ics
: the collection, classification, storage, and analysis of biochemical and
biological information using computers especially as applied in molecular
genetics and genomics.
Source: Merriam-Webster's Medical Dictionary, © 2002 Merriam-Webster, Inc.
![Page 3: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/3.jpg)
Myth1 : Bioinformatics is about genomics
• Nucleotide – DNA, RNA, …• Genome – Sequences, chromosomes, expressed data, …• Protein – Sequences, 3-D structure, interaction, …• System – Gene network, protein network, TFs, …• Other – Masspec, microarray, images, lab records, journals, literatures, …
The goal is to understand how the system works.
![Page 4: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/4.jpg)
Myth2 : Data vs. Information
Data
Nucleotide – DNA, RNA, …Genome – Sequences, chromosomes, expressed data, …Protein – Sequences, 3-D structure, interaction, …System – Gene network, protein network, TFs, …Other – Masspec, microarray, images, lab records, journals, literatures, …
Information
GenotypePhenotypeGenotype-Phenotype relationshipSNPs PathwaysDrug targets
Getting data is “easy”, extracting information is hard!
![Page 5: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/5.jpg)
Myth3 : Computer is intelligent
Pros• Repeated work• Accurate storage• Precise computation• Fast communication…
Cons• Cannot generalize• No real intelligence…
The results must be reviewed and validated by biologists. In addition, biologists must have some understanding of how computer processes data (algorithms) – that’s why we need to learn bioinformatics.
![Page 6: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/6.jpg)
Biology – Bioinformatics
Bioinformatics
![Page 7: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/7.jpg)
High-throughput techniques
DNA Sequencing
• 1970’s – Nobel prize• 1980’s – Ph.D. thesis• Early 1990’s – Major
research projects• Late 1990’s to now - $20
![Page 8: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/8.jpg)
Human Genome ProjectThe Beginning (1988)
Cold Spring Harbor LaboratoryLong Island, New York
![Page 9: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/9.jpg)
Initial Analysis of the Human Genome
![Page 10: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/10.jpg)
What information do we want to extract?
Science, 9/2/2005 Total genetic difference (# of bases) is 4%35 million single base substitutions plus 5 million insertions or deletions (indels)
The average protein differs by only two amino acids, and 29% of proteins are identical.
Genotype – Phenotype relationship!!!
![Page 11: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/11.jpg)
Phenotype• mRNA level
• Protein expression
• Protein structure
• Cell morphology
• Tissue morphology
• System physiological functions
• Behavior
• …
![Page 12: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/12.jpg)
High-throughput techniques
High throughput protein crystalization
Mass spectrometry
Microarray
High throughput cell imaging
High throughput in vivo screening
…
![Page 13: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/13.jpg)
How to extract the information?
Computational tools
• Building the databases
• Perform analysis/extract features
• Data funsion/Integration
• Data mining/Classification/statistical
learning
• Visualization/representation
Biological information!!!
![Page 14: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/14.jpg)
What we are going to do:
• Search the databases
• Perform analysis
• Present output
Be a salient user!
![Page 15: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/15.jpg)
What the scope of Bioinformatics teach?
• Genomics
• Proteomics
• Microarray analysis
• Other aspects
• Ontology
• Machine learning / statistical analysis
• Visualization
• Data sources (databases)
• Available tools
• Major issues in using the
databases and tools
• Other resources
![Page 16: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/16.jpg)
Review of Biology
Central dogma
![Page 17: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/17.jpg)
Review of Biology
Operon
![Page 18: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/18.jpg)
Review of Biology
mRNA, cDNA,
exon, intron
![Page 19: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/19.jpg)
Review of Biology
Protein folding and structure
![Page 20: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/20.jpg)
Databases
GenBank www.ncbi.nlm.nih.gov/GenBank/EMBL www.ebi.ac.uk/embl/DDBJ www.ddbj.nig.ac.jpSynchronized daily.Accession numbers are managed in a consistent way.
AceDBDDJP DNAJJPIDMIPSPHREDPIRPROSITERDPTIGRUNIGENE…
![Page 21: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/21.jpg)
Resources
Local: OSU library
Web: PubMedJSTOR (http://www.jstor.com)http://www.expasy.orghttp://www.genecards.orghttp://www.pathguide.org/
![Page 22: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/22.jpg)
PubMed – Entrez
PubMed : http://www.pubmed.gov, http://www.ncbi.nlm.nih.gov/entrez/query.fcgiPubMed training : http://www.nlm.nih.gov/bsd/disted/pubmed.htmlEntrez : http://www.ncbi.nlm.nih.gov/Database/index.html
Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. Click on the graphic below for a more detailed view of Entrez integration.
![Page 23: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/23.jpg)
Entrez Databases
![Page 24: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/24.jpg)
LiteraturesExamples:
1. E2F32. Retinoblastoma
Constraints: automatics vs. manual
Save: Tutorial at http://www.nlm.nih.gov/bsd/viewlet/myncbi/saving_searches.swf
![Page 25: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/25.jpg)
Literatures
![Page 26: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/26.jpg)
Literatures
![Page 27: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/27.jpg)
LiteraturesExamples:
1. E2F32. Retinoblastoma
Constraints: automatics vs. manual
![Page 28: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/28.jpg)
Literatures
![Page 29: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/29.jpg)
Nucleotide• Gene• Genome• Sequence• mRNA• cDNA• SNP• ESTs (expressed sequence tags) / UniGene
• Name• Accession number• GI number• Version number• Alias
![Page 30: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/30.jpg)
Accession number, GI number, Version• accession number (GenBank) - The accession number is the unique identifier
assigned to the entire sequence record when the record is submitted to
GenBank. The GenBank accession number is a combination of letters and
numbers that are usually in the format of one letter followed by five digits (e.g.,
M12345) or two letters followed by six digits (e.g., AC123456). • The accession number for a particular record will not change even if the author submits a request to
change some of the information in the record. Take note that an accession number is a unique identifier for
a complete sequence record, while a Sequence Identifier, such as a Version, GI, or ProteinID, is an
identification number assigned just to the sequence data. The NCBI Entrez System is searchable by
accession number using the Accession [ACCN] search field.
• GI (GenBank) - A GI or "GenInfo Identifier" is a sequence identifier that can be
assigned to a nucleotide sequence or protein translation. Each GI is a numeric
value of one or more digits. The protein translation and the nucleotide sequence
contained in the same record will each be assigned different GI numbers. • Every time the sequence data for a particular record is changed, its version number increases and it
receives a new GI. However, while each new version number is based upon the previous version number, a
new GI for an altered sequence may be completely different from the previous GI. For example, in the
GenBank record M12345, the original GI might be 7654321, but after a change in the sequence is
submitted, the new GI for the changed sequence could be 10529376. Individuals can search for nucleotide
sequences and protein translations by GI using the UID search field in the NCBI sequence databases.
• GI number is NOT GeneID.
![Page 31: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/31.jpg)
Example : E2F3
![Page 32: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/32.jpg)
Example : E2F3
![Page 33: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/33.jpg)
Data FormatFASTA (.fasta file) >gi|33469954|ref|NM_000240.2| Homo sapiens monoamine oxidase A (MAOA), nuclear
gene encoding mitochondrial protein, mRNA GGGCGCTCCCGGAGTATCAGCAAAAGGGTTCGCCCCGCCCACAGTGCCCGGCTCCCCCCGGGTATCAAAA GAAGGATCGGCTCCGCCCCCGGGCTCCCCGGGGGAGTTGATAGAAGGGTCCTTCCCACCCTTTGCCGTCC CCACTCCTGTGCCTACGACCCAGGAGCGTGTCAGCCAAAGCATGGAGAATCAAGAGAAGGCGAGTATCGC GGGCCACATGTTCGACGTAGTCGTGATCGGAGGTGGCATTTCAGGACTATCTGCTGCCAAACTCTTGACT GAATATGGCGTTAGTGTTTTGGTTTTAGAAGCTCGGGACAGGGTTGGAGGAAGAACATATACTATAAGGA ATGAGCATGTTGATTACGTAGATGTTGGTGGAGCTTATGTGGGACCAACCCAAAACAGAATCTTACGCTT GTCTAAGGAGCTGGGCATAGAGACTTACAAAGTGAATGTCAGTGAGCGTCTCGTTCAATATGTCAAGGGG AAAACATATCCATTTCGGGGCGCCTTTCCACCAGTATGGAATCCCATTGCATATTTGGATTACAATAATC TGTGGAGGACAATAGATAACATGGGGAAGGAGATTCCAACTGATGCACCCTGGGAGGCTCAACATGCTGA CAAATGGGACAAAATGACCATGAAAGAGCTCATTGACAAAATCTGCTGGACAAAGACTGCTAGGCGGTTT GCTTATCTTTTTGTGAATATCAATGTGACCTCTGAGCCTCACGAAGTGTCTGCCCTGTGGTTCTTGTGGT ATGTGAAGCAGTGCGGGGGCACCACTCGGATATTCTCTGTCACCAATGGTGGCCAGGAACGGAAGTTTGT AGGTGGATCTGGTCAAGTGAGCGAACGGATAATGGACCTCCTCGGAGACCAAGTGAAGCTGAACCATCCT GTCACTCACGTTGACCAGTCAAGTGACAACATCATCATAGAGACGCTGAACCATGAACATTATGAGTGCA AATACGTAATTAATGCGATCCCTCCGACCTTGACTGCCAAGATTCACTTCAGACCAGAGCTTCCAGCAGA GAGAAACCAGTTAATTCAGCGGCTTCCAATGGGAGCTGTCATTAAGTGCATGATGTATTACAAGGAGGCC TTCTGGAAGAAGAAGGATTACTGTGGCTGCATGATCATTGAAGATGAAGATGCTCCAATTTCAATAACCT TGGATGACACCAAGCCAGATGGGTCACTGCCTGCCATCATGGGCTTCATTCTTGCCCGGAAAGCTGATCG ACTTGCTAAGCTACATAAGGAAATAAGGAAGAAGAAAATCTGTGAGCTCTATGCCAAAGTGCTGGGATCC CAAGAAGCTTTACATCCAGTGCATTATGAAGAGAAGAACTGGTGTGAGGAGCAGTACTCTGGGGGCTGCT ACACGGCCTACTTCCCTCCTGGGATCATGACTCAATATGGAAGGGTGATTCGTCAACCCGTGGGCAGGAT TTTCTTTGCGGGCACAGAGACTGCCACAAAGTGGAGCGGCTACATGGAAGGGGCAGTTGAGGCTGGAGAA CGAGCAGCTAGGGAGGTCTTAAATGGTCTCGGGAAGGTGACCGAGAAAGATATCTGGGTACAAGAACCTG
…
>gi|4557735|ref|NP_000231.1| monoamine oxidase A [Homo sapiens] MENQEKASIAGHMFDVVVIGGGISGLSAAKLLTEYGVSVLVLEARDRVGGRTYTIRNEHVDYVDVGGAYV GPTQNRILRLSKELGIETYKVNVSERLVQYVKGKTYPFRGAFPPVWNPIAYLDYNNLWRTIDNMGKEIPT DAPWEAQHADKWDKMTMKELIDKICWTKTARRFAYLFVNINVTSEPHEVSALWFLWYVKQCGGTTRIFSV TNGGQERKFVGGSGQVSERIMDLLGDQVKLNHPVTHVDQSSDNIIIETLNHEHYECKYVINAIPPTLTAK IHFRPELPAERNQLIQRLPMGAVIKCMMYYKEAFWKKKDYCGCMIIEDEDAPISITLDDTKPDGSLPAIM GFILARKADRLAKLHKEIRKKKICELYAKVLGSQEALHPVHYEEKNWCEEQYSGGCYTAYFPPGIMTQYG RVIRQPVGRIFFAGTETATKWSGYMEGAVEAGERAAREVLNGLGKVTEKDIWVQEPESKDVPAVEITHTF WERNLPSVSGLLKIIGFSTSVTALGFVLYKYKLLPRS
![Page 34: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/34.jpg)
Data Format
Other formats
NBRF/PIR (.pir file)Begin with “>P1;” for protein sequence and “>N1;” for nucleotide.
GDE (.gde file) Similar to FASTA file, begin with “%” instead of “>”.
![Page 35: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/35.jpg)
Protein DatabasesUniProt is the universal protein database, a central repository of protein data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the world's most comprehensive resource on protein information.The Protein Information Resource (PIR), located at Georgetown University Medical Center (GUMC), is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. Swiss-Prot is a curated biological database of protein sequences from different species created in 1986 by Amos Bairoch during his PhD and developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. PDBNCBIhttp://proteome.nih.gov/links.html
![Page 36: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/36.jpg)
PubMed – Protein DatabasesThe Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to Protein Information Resource (PIR), SWISS-PROT, Protein Research Foundation (PRF), and Protein Data Bank (PDB) (sequences from solved structures).
The Structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. Use Cn3D, the NCBI 3D structure viewer, for easy interactive visualization of molecular structures from Entrez.
Tutorial: http://www.pdb.org/pdbstatic/tutorials/tutorial.html
![Page 37: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/37.jpg)
Example – UniProt - Expasyhttp://www.uniprot.org/ http://www.expasy.org/
![Page 38: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/38.jpg)
Example – UniProt - Expasy
![Page 39: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/39.jpg)
Example – UniProt - Expasy
![Page 40: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/40.jpg)
Example – UniProt - Expasy
![Page 41: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/41.jpg)
Example – UniProt - Expasy
![Page 42: Introduction to Bioinformatics Victor Jin](https://reader035.vdocument.in/reader035/viewer/2022081414/54c674374a7959f67d8b4638/html5/thumbnails/42.jpg)
Annotation - Visualization
UCSC Genome Browser (http://genome.ucsc.edu/)