introduction to bioinformatics and biological databases nicky mulder: [email protected]

44
Introduction to Bioinformatics and Biological databases Nicky Mulder: [email protected]

Upload: emory-lucas

Post on 13-Dec-2015

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Introduction to Bioinformatics and Biological databases

Nicky Mulder: [email protected]

Page 2: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

• The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research, particularly in genomics. www.informatics.jax.org/mgihome/other/glossary.shtml

• Doing biology on a computer (computational biology)

What is Bioinformatics?

Page 3: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Why is Bioinformatics needed?

• Small- and large-scale biological analyses

• New laboratory technologies, e.g. sequencing

• Move away from single gene to whole genome

• Collection and storage of biological information

• Manipulation of biological information

• Computers have capability for both, and cheap

Page 4: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Hypothesis-driven bioinformatics

Gene of interest

Search PubMed for more information

Retrieve the protein sequence

Sequence similaritysearch

Looking at whole genomes

Retrieve the DNA sequence

Analysing DNA sequence

Retrieve articles

Sequence similaritysearch

Phylogenetics

Sequence alignments, finding motifs

Finding domains,Classifying proteins

Genomics

Sequence analysis

Biological databases

Phylogenetics

Page 5: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Hypothesis-generating bioinformatics

High-throughput experiment (microarray,

Proteomics, NGS)

Experimental design

Statistics

Gene listsData processing

Pathway analysis

Data mining

Gene set enrichment

Systems biology

Data integration

Proteomics, NGS

Systems biologyStatistics

Page 6: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Two major components to Bioinformatics

• Storing and retrieving data:– Biological databases– Querying these to retrieve data

• Manipulating the data –tools e.g:– Sequence similarity searches– Protein families and function prediction– Comparing sequences -phylogenetics

Page 7: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

What is a database

• an organized body of related infomation www.cogsci.princeton.edu/cgi-bin/webwn

• Data collection that is:– Structured (computer readable)– Searchable– Updatable– Cross-linked– Publicly available

Page 8: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Biological Databases• Where do you go to find:

– A video -> Youtube– Info on S. Hawking-> Wikipedia– A book -> Amazon– A friend -> Facebook– DNA sequence -> EMBL– Protein sequence -> UniProtKB, RefSeq…

• Biological databases:– Order and make data available to public– Turn data into computer-readable form– Provide ability to retrieve data from various sources

• Can have primary (archival) or secondary databases (curated)

Page 9: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Categories of Databases for Life Sciences

• Sequences (DNA, protein)• Genomics• Mutation• Protein domain/family• Proteomics• 3D structure• Metabolism• Bibliography• Protein interaction

Page 10: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Categories of Databases for Life Sciences

• Sequences (DNA, protein)• Genomics• Mutation• Protein domain/family• Proteomics• 3D structure• Metabolism• Bibliography• Protein interaction

Page 11: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

• >5000 genomes sequenced(single organism, varying sizes, including virus)

• Thousands of ongoing genome sequencing projects • cDNAs sequencing projects (ESTs or cDNAs) • Metagenome sequencing projects (~200)

= environmental samples: multiple ‘unknown’ organisms=microbiome

• Personal human genomes• Cost of sequencing is coming down –alternative to other

technologies

Why do we need sequence DBs?

Page 12: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Sequence databases

• Used for retrieving a known gene/protein sequence• Useful for finding information on a gene/protein• Can find out how many genes are available for a given

organism• Can comparing your sequence to the others in the database• Can submit your sequence to store with the rest• Main databases: nucleotide and protein sequence DBs• Should be interconnected with other databases

Page 13: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

DNA sequences

Gene annotationGeneexpression data

Proteinsequences

Macromolecular structure data

Protein centric view of database network

Page 14: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Nucleotide sequence databases

• EMBL, DDBJ, GenBank

• Data submitted by sequence owner

• Must provide certain information and CDS if applicable

• No additional annotation added

• Entries never merged –some redundancy

PromoterExons

CDS (coding sequence)

Page 15: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

taxonomy

Cross-references

references

accession number

features

Page 16: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Annotation

(Prediction or experimentally determined)

Sequence

CDSCoDing Sequence

(proposed by submitters)

Page 17: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Feature lines in EMBL entries

• Describes features on a sequence NB for function, replication, recombination, structure etc.

• Feature key e.g. CDS protein-coding sequence, ribosome binding site

• Functional group• Location• How to find feature• Qualifier• Additional info

Page 18: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Summary of information in EMBL entries

• Provides taxonomy from which sequence came• Provides information on submitters and references• Describes features on a sequence NB for function,

replication, recombination, structure etc.• Shows if the DNA encodes a protein (CDS) and

provides protein sequence • Provides actual nucleotide sequence• Describes sequence type, e.g. genomic DNA,

RNA, EST

Page 19: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

CDS: mRNA versus genomic sequence CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG

Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * **************

 CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT

CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACGenomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCGenomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC

**************************** ***********************************************

CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGAGenomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************

CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAAGenomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * ***

CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC

Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC******************************************* * ************** ******** ***** **** * *********** ***************************

 CONTIG C-----------------------------------------------------------------------------------------------------------------------Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA *  

exon

exon

exon

exon

exon

intron

intron

intron

Page 20: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Other nucleotide databases

• RefSeq

• dbEST

• Short read archive

• Trace archive

• WGS collections

Page 21: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Protein sequences

DNA

RNA

Protein

SS

Ac

Protein cleavage Protein modification

Transported to organelle or membrane

Folded into secondary or

tertiary structure

Performs a specific function

All this info needs to be captured in a database

Page 22: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Protein Sequence Databases

• UniProt: – Swiss-Prot –manually curated, distinguishes

between experimental and computationally derived annotation

– TrEMBL - Automatic translation of EMBL, no manual curation, some automatic annotation

• GenPept -GenBank translations• RefSeq - Non-redundant sequences for certain

organisms

Page 23: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

A UniProtKB/Swiss-Prot entry

Protein existence levels:1: Evidence at protein level 2: Evidence at transcript level 3: Inferred from homology 4: Predicted 5: Uncertain

Page 24: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za
Page 25: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za
Page 26: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za
Page 27: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za
Page 28: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za
Page 29: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Swiss-Prot annotation mainly found in:

• Comment (CC) lines – Function, pathway, cofactor, regulation, disease,

subcellular location,

• Feature table (FT) – features on the sequence, e.g. domain, active site

• Keyword (KW) lines – Set of a few hundred controlled vocabulary terms

• Description (DE) lines – Protein name/function

Page 30: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Other parts to UniProt

• UniParc –archive of all sequences

• UniProt –Swiss-Prot + TrEMBL

• UniProt NREF100 (100% seqs merged)

• UniProt NREF90 (90% seqs merged)

• UniProt NREF50 (50% seqs merged)

• UniMES –metagenomic sequences

Page 31: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Submitting sequences to EMBL or UniProt

WEB-IN -web-based submission tool for submitting DNA sequences to EMBL database.

Protein sequences submitted when the peptides have been directly sequenced. Submit through SPIN

Page 32: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Sequence formats

• Not MSWord, but text!• Most include an ID/name/annotation of some sort• FASTA, E.g.

>xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcgctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgccagatcaaggctcatgtagcctcactgg

Others specific to programs, e.g. GCG, abi, clustal, etc.

Page 33: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Accession numbers

• GenBank/EMBL/DDBJ: 1 letter & digits, e.g.: U12345 or 2 letters & 6 digits, e.g.: AY123456

• GenPept Sequence Records -3 letters & 5 digits, e.g.: AAA12345

• UniProt -All 6 characters: [A,B,O,P,Q] [0-9] [A-Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9], e.g.:P12345 and Q9JJS7

Page 34: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Cross-referencing identifiers

• So many different IDs for same thing, e.g. Ensembl, EMBL, HGNC, UniGene, UniProt, Affy ID, etc.

• Need mapping files to move between them to avoid having to parse every entry

• PICR (http://www.ebi.ac.uk/Tools/picr/) enables mapping between IDs

• UniProt website mapper (www.uniprot.org)

Page 35: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Taxonomy Databases

• Most used is NCBI’s taxonomy database: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy

• Provides entries for all known organisms• Provides taxonomic lineage and translation table for

organisms• Sequence entries for organism• UniProt-specific taxonomy database is Newt: • http://www.ebi.ac.uk/newt

Page 36: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Example taxonomy entry

Page 37: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Literature database: PubMed/Medline

• Source of Medical-related & scientific literature• PubMed has articles published after 1965• Can search by many different means, e.g. author,

title, date, journal etc., or keywords for each• PubMed has list of tags to search specific fields,

e.g. [AU], [TI], [DP] etc. • Can save queries and results• Can usually retrieve abstracts and full papers

Page 38: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Types of search fields

• Title Words [TI] MeSH Terms [MH] • Title/Abstract Words [TIAB] Language [LA]• Text Words [TW] Journal Title [TA]• Substance Name [NM] Issue [IP]• Subset [SB] Filter [FILTER]• Secondary Source ID [SI] Entrez Date [EDAT]• Subheadings [SH] EC/RN Number [RN]• Publication Type [PT] Author Name [AU]• Publication Date [DP] All Fields [ALL]• Personal Name as Subject [PS] Affiliation [AD]• Page Number [PG] Unique Identifiers [UID]• Title Words [TI] MeSH Major Topic [MAJR]• MeSH Date [MHDA]

Page 39: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Querying biological databases

• Databases hold a wealth of information

• Data is held in specific formats and controlled vocabulary- easy for searching

• Can retrieve and save data you need

• In many cases you can retrieve data from multiple sources

Page 40: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

How to query databases

• Query languages e.g. SQL• Can query with single word or phrase• Boolean queries• Regular expressions• Basic database querying is usually done through

web interface– Text or sequence-based searches– Can use Boolean queries and regular expressions

Page 41: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Words and phrases

• Most searches are case insensitive • Keywords are single words searched• Phrases –groups of words• E.g. tyrosine protein kinase –returns anything

with either of the words “tyrosine ”, “protein ” or “kinase” (keywords)

• “tyrosine protein kinase” –returns anything with the complete phrase only

Page 42: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Boolean operators (George Boole)

• Operators e.g. & (AND), | (OR), ! (NOT), e.g.:– protein & kinase ! tyrosine

– tyrosine & protein & kinase

• More complex: (tyrosine OR kinase) AND (NOT serine) • Operators don’t work in “”, e.g. “tyrosine and kinase”• Wildcards * and ? E.g. cell*ase finds all words starting

with “cell” and ending in “ase”• Attributes are used to be more specific about where to

find the keyword

Page 43: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Resources for searching databases

• Sequence Retrieval System (SRS)

• EBI –all EBI databases search

• NCBI –Entrez

• Each database usually has own web interface allowing simple queries

• E.g. EnsMart allows querying of Ensembl database

Page 44: Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

Sequence Retrieval Systemhttp://srs.ebi.ac.uk

• Integrates over 150 different databases• A database can be searched and results linked to other

databases in SRS• Searches can be simple or complex –can view previous

queries and combine them• Results can be viewed in default formats or user-defined

views• Users can save their results and launch software packages

(>100 applications) to analyse results