Introduction to Bioinformatics and Biological databases
Nicky Mulder: [email protected]
• The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research, particularly in genomics. www.informatics.jax.org/mgihome/other/glossary.shtml
• Doing biology on a computer (computational biology)
What is Bioinformatics?
Why is Bioinformatics needed?
• Small- and large-scale biological analyses
• New laboratory technologies, e.g. sequencing
• Move away from single gene to whole genome
• Collection and storage of biological information
• Manipulation of biological information
• Computers have capability for both, and cheap
Hypothesis-driven bioinformatics
Gene of interest
Search PubMed for more information
Retrieve the protein sequence
Sequence similaritysearch
Looking at whole genomes
Retrieve the DNA sequence
Analysing DNA sequence
Retrieve articles
Sequence similaritysearch
Phylogenetics
Sequence alignments, finding motifs
Finding domains,Classifying proteins
Genomics
Sequence analysis
Biological databases
Phylogenetics
Hypothesis-generating bioinformatics
High-throughput experiment (microarray,
Proteomics, NGS)
Experimental design
Statistics
Gene listsData processing
Pathway analysis
Data mining
Gene set enrichment
Systems biology
Data integration
Proteomics, NGS
Systems biologyStatistics
Two major components to Bioinformatics
• Storing and retrieving data:– Biological databases– Querying these to retrieve data
• Manipulating the data –tools e.g:– Sequence similarity searches– Protein families and function prediction– Comparing sequences -phylogenetics
What is a database
• an organized body of related infomation www.cogsci.princeton.edu/cgi-bin/webwn
• Data collection that is:– Structured (computer readable)– Searchable– Updatable– Cross-linked– Publicly available
Biological Databases• Where do you go to find:
– A video -> Youtube– Info on S. Hawking-> Wikipedia– A book -> Amazon– A friend -> Facebook– DNA sequence -> EMBL– Protein sequence -> UniProtKB, RefSeq…
• Biological databases:– Order and make data available to public– Turn data into computer-readable form– Provide ability to retrieve data from various sources
• Can have primary (archival) or secondary databases (curated)
Categories of Databases for Life Sciences
• Sequences (DNA, protein)• Genomics• Mutation• Protein domain/family• Proteomics• 3D structure• Metabolism• Bibliography• Protein interaction
Categories of Databases for Life Sciences
• Sequences (DNA, protein)• Genomics• Mutation• Protein domain/family• Proteomics• 3D structure• Metabolism• Bibliography• Protein interaction
• >5000 genomes sequenced(single organism, varying sizes, including virus)
• Thousands of ongoing genome sequencing projects • cDNAs sequencing projects (ESTs or cDNAs) • Metagenome sequencing projects (~200)
= environmental samples: multiple ‘unknown’ organisms=microbiome
• Personal human genomes• Cost of sequencing is coming down –alternative to other
technologies
Why do we need sequence DBs?
Sequence databases
• Used for retrieving a known gene/protein sequence• Useful for finding information on a gene/protein• Can find out how many genes are available for a given
organism• Can comparing your sequence to the others in the database• Can submit your sequence to store with the rest• Main databases: nucleotide and protein sequence DBs• Should be interconnected with other databases
DNA sequences
Gene annotationGeneexpression data
Proteinsequences
Macromolecular structure data
Protein centric view of database network
Nucleotide sequence databases
• EMBL, DDBJ, GenBank
• Data submitted by sequence owner
• Must provide certain information and CDS if applicable
• No additional annotation added
• Entries never merged –some redundancy
PromoterExons
CDS (coding sequence)
taxonomy
Cross-references
references
accession number
features
Annotation
(Prediction or experimentally determined)
Sequence
CDSCoDing Sequence
(proposed by submitters)
Feature lines in EMBL entries
• Describes features on a sequence NB for function, replication, recombination, structure etc.
• Feature key e.g. CDS protein-coding sequence, ribosome binding site
• Functional group• Location• How to find feature• Qualifier• Additional info
Summary of information in EMBL entries
• Provides taxonomy from which sequence came• Provides information on submitters and references• Describes features on a sequence NB for function,
replication, recombination, structure etc.• Shows if the DNA encodes a protein (CDS) and
provides protein sequence • Provides actual nucleotide sequence• Describes sequence type, e.g. genomic DNA,
RNA, EST
CDS: mRNA versus genomic sequence CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG
Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * **************
CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT
CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACGenomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCGenomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
**************************** ***********************************************
CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGAGenomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************
CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAAGenomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * ***
CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC
Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC******************************************* * ************** ******** ***** **** * *********** ***************************
CONTIG C-----------------------------------------------------------------------------------------------------------------------Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA *
exon
exon
exon
exon
exon
intron
intron
intron
Other nucleotide databases
• RefSeq
• dbEST
• Short read archive
• Trace archive
• WGS collections
Protein sequences
DNA
RNA
Protein
SS
Ac
Protein cleavage Protein modification
Transported to organelle or membrane
Folded into secondary or
tertiary structure
Performs a specific function
All this info needs to be captured in a database
Protein Sequence Databases
• UniProt: – Swiss-Prot –manually curated, distinguishes
between experimental and computationally derived annotation
– TrEMBL - Automatic translation of EMBL, no manual curation, some automatic annotation
• GenPept -GenBank translations• RefSeq - Non-redundant sequences for certain
organisms
A UniProtKB/Swiss-Prot entry
Protein existence levels:1: Evidence at protein level 2: Evidence at transcript level 3: Inferred from homology 4: Predicted 5: Uncertain
Swiss-Prot annotation mainly found in:
• Comment (CC) lines – Function, pathway, cofactor, regulation, disease,
subcellular location,
• Feature table (FT) – features on the sequence, e.g. domain, active site
• Keyword (KW) lines – Set of a few hundred controlled vocabulary terms
• Description (DE) lines – Protein name/function
Other parts to UniProt
• UniParc –archive of all sequences
• UniProt –Swiss-Prot + TrEMBL
• UniProt NREF100 (100% seqs merged)
• UniProt NREF90 (90% seqs merged)
• UniProt NREF50 (50% seqs merged)
• UniMES –metagenomic sequences
Submitting sequences to EMBL or UniProt
WEB-IN -web-based submission tool for submitting DNA sequences to EMBL database.
Protein sequences submitted when the peptides have been directly sequenced. Submit through SPIN
Sequence formats
• Not MSWord, but text!• Most include an ID/name/annotation of some sort• FASTA, E.g.
>xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcgctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgccagatcaaggctcatgtagcctcactgg
Others specific to programs, e.g. GCG, abi, clustal, etc.
Accession numbers
• GenBank/EMBL/DDBJ: 1 letter & digits, e.g.: U12345 or 2 letters & 6 digits, e.g.: AY123456
• GenPept Sequence Records -3 letters & 5 digits, e.g.: AAA12345
• UniProt -All 6 characters: [A,B,O,P,Q] [0-9] [A-Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9], e.g.:P12345 and Q9JJS7
Cross-referencing identifiers
• So many different IDs for same thing, e.g. Ensembl, EMBL, HGNC, UniGene, UniProt, Affy ID, etc.
• Need mapping files to move between them to avoid having to parse every entry
• PICR (http://www.ebi.ac.uk/Tools/picr/) enables mapping between IDs
• UniProt website mapper (www.uniprot.org)
Taxonomy Databases
• Most used is NCBI’s taxonomy database: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy
• Provides entries for all known organisms• Provides taxonomic lineage and translation table for
organisms• Sequence entries for organism• UniProt-specific taxonomy database is Newt: • http://www.ebi.ac.uk/newt
Example taxonomy entry
Literature database: PubMed/Medline
• Source of Medical-related & scientific literature• PubMed has articles published after 1965• Can search by many different means, e.g. author,
title, date, journal etc., or keywords for each• PubMed has list of tags to search specific fields,
e.g. [AU], [TI], [DP] etc. • Can save queries and results• Can usually retrieve abstracts and full papers
Types of search fields
• Title Words [TI] MeSH Terms [MH] • Title/Abstract Words [TIAB] Language [LA]• Text Words [TW] Journal Title [TA]• Substance Name [NM] Issue [IP]• Subset [SB] Filter [FILTER]• Secondary Source ID [SI] Entrez Date [EDAT]• Subheadings [SH] EC/RN Number [RN]• Publication Type [PT] Author Name [AU]• Publication Date [DP] All Fields [ALL]• Personal Name as Subject [PS] Affiliation [AD]• Page Number [PG] Unique Identifiers [UID]• Title Words [TI] MeSH Major Topic [MAJR]• MeSH Date [MHDA]
Querying biological databases
• Databases hold a wealth of information
• Data is held in specific formats and controlled vocabulary- easy for searching
• Can retrieve and save data you need
• In many cases you can retrieve data from multiple sources
How to query databases
• Query languages e.g. SQL• Can query with single word or phrase• Boolean queries• Regular expressions• Basic database querying is usually done through
web interface– Text or sequence-based searches– Can use Boolean queries and regular expressions
Words and phrases
• Most searches are case insensitive • Keywords are single words searched• Phrases –groups of words• E.g. tyrosine protein kinase –returns anything
with either of the words “tyrosine ”, “protein ” or “kinase” (keywords)
• “tyrosine protein kinase” –returns anything with the complete phrase only
Boolean operators (George Boole)
• Operators e.g. & (AND), | (OR), ! (NOT), e.g.:– protein & kinase ! tyrosine
– tyrosine & protein & kinase
• More complex: (tyrosine OR kinase) AND (NOT serine) • Operators don’t work in “”, e.g. “tyrosine and kinase”• Wildcards * and ? E.g. cell*ase finds all words starting
with “cell” and ending in “ase”• Attributes are used to be more specific about where to
find the keyword
Resources for searching databases
• Sequence Retrieval System (SRS)
• EBI –all EBI databases search
• NCBI –Entrez
• Each database usually has own web interface allowing simple queries
• E.g. EnsMart allows querying of Ensembl database
Sequence Retrieval Systemhttp://srs.ebi.ac.uk
• Integrates over 150 different databases• A database can be searched and results linked to other
databases in SRS• Searches can be simple or complex –can view previous
queries and combine them• Results can be viewed in default formats or user-defined
views• Users can save their results and launch software packages
(>100 applications) to analyse results