introduction to bioinformatics and biological databases nicky mulder: nicola.mulder@uct.ac.za

Post on 13-Dec-2015

225 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to Bioinformatics and Biological databases

Nicky Mulder: nicola.mulder@uct.ac.za

• The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research, particularly in genomics. www.informatics.jax.org/mgihome/other/glossary.shtml

• Doing biology on a computer (computational biology)

What is Bioinformatics?

Why is Bioinformatics needed?

• Small- and large-scale biological analyses

• New laboratory technologies, e.g. sequencing

• Move away from single gene to whole genome

• Collection and storage of biological information

• Manipulation of biological information

• Computers have capability for both, and cheap

Hypothesis-driven bioinformatics

Gene of interest

Search PubMed for more information

Retrieve the protein sequence

Sequence similaritysearch

Looking at whole genomes

Retrieve the DNA sequence

Analysing DNA sequence

Retrieve articles

Sequence similaritysearch

Phylogenetics

Sequence alignments, finding motifs

Finding domains,Classifying proteins

Genomics

Sequence analysis

Biological databases

Phylogenetics

Hypothesis-generating bioinformatics

High-throughput experiment (microarray,

Proteomics, NGS)

Experimental design

Statistics

Gene listsData processing

Pathway analysis

Data mining

Gene set enrichment

Systems biology

Data integration

Proteomics, NGS

Systems biologyStatistics

Two major components to Bioinformatics

• Storing and retrieving data:– Biological databases– Querying these to retrieve data

• Manipulating the data –tools e.g:– Sequence similarity searches– Protein families and function prediction– Comparing sequences -phylogenetics

What is a database

• an organized body of related infomation www.cogsci.princeton.edu/cgi-bin/webwn

• Data collection that is:– Structured (computer readable)– Searchable– Updatable– Cross-linked– Publicly available

Biological Databases• Where do you go to find:

– A video -> Youtube– Info on S. Hawking-> Wikipedia– A book -> Amazon– A friend -> Facebook– DNA sequence -> EMBL– Protein sequence -> UniProtKB, RefSeq…

• Biological databases:– Order and make data available to public– Turn data into computer-readable form– Provide ability to retrieve data from various sources

• Can have primary (archival) or secondary databases (curated)

Categories of Databases for Life Sciences

• Sequences (DNA, protein)• Genomics• Mutation• Protein domain/family• Proteomics• 3D structure• Metabolism• Bibliography• Protein interaction

Categories of Databases for Life Sciences

• Sequences (DNA, protein)• Genomics• Mutation• Protein domain/family• Proteomics• 3D structure• Metabolism• Bibliography• Protein interaction

• >5000 genomes sequenced(single organism, varying sizes, including virus)

• Thousands of ongoing genome sequencing projects • cDNAs sequencing projects (ESTs or cDNAs) • Metagenome sequencing projects (~200)

= environmental samples: multiple ‘unknown’ organisms=microbiome

• Personal human genomes• Cost of sequencing is coming down –alternative to other

technologies

Why do we need sequence DBs?

Sequence databases

• Used for retrieving a known gene/protein sequence• Useful for finding information on a gene/protein• Can find out how many genes are available for a given

organism• Can comparing your sequence to the others in the database• Can submit your sequence to store with the rest• Main databases: nucleotide and protein sequence DBs• Should be interconnected with other databases

DNA sequences

Gene annotationGeneexpression data

Proteinsequences

Macromolecular structure data

Protein centric view of database network

Nucleotide sequence databases

• EMBL, DDBJ, GenBank

• Data submitted by sequence owner

• Must provide certain information and CDS if applicable

• No additional annotation added

• Entries never merged –some redundancy

PromoterExons

CDS (coding sequence)

taxonomy

Cross-references

references

accession number

features

Annotation

(Prediction or experimentally determined)

Sequence

CDSCoDing Sequence

(proposed by submitters)

Feature lines in EMBL entries

• Describes features on a sequence NB for function, replication, recombination, structure etc.

• Feature key e.g. CDS protein-coding sequence, ribosome binding site

• Functional group• Location• How to find feature• Qualifier• Additional info

Summary of information in EMBL entries

• Provides taxonomy from which sequence came• Provides information on submitters and references• Describes features on a sequence NB for function,

replication, recombination, structure etc.• Shows if the DNA encodes a protein (CDS) and

provides protein sequence • Provides actual nucleotide sequence• Describes sequence type, e.g. genomic DNA,

RNA, EST

CDS: mRNA versus genomic sequence CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG

Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * **************

 CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT

CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACGenomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCGenomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC

**************************** ***********************************************

CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGAGenomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************

CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAAGenomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * ***

CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC

Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC******************************************* * ************** ******** ***** **** * *********** ***************************

 CONTIG C-----------------------------------------------------------------------------------------------------------------------Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA *  

exon

exon

exon

exon

exon

intron

intron

intron

Other nucleotide databases

• RefSeq

• dbEST

• Short read archive

• Trace archive

• WGS collections

Protein sequences

DNA

RNA

Protein

SS

Ac

Protein cleavage Protein modification

Transported to organelle or membrane

Folded into secondary or

tertiary structure

Performs a specific function

All this info needs to be captured in a database

Protein Sequence Databases

• UniProt: – Swiss-Prot –manually curated, distinguishes

between experimental and computationally derived annotation

– TrEMBL - Automatic translation of EMBL, no manual curation, some automatic annotation

• GenPept -GenBank translations• RefSeq - Non-redundant sequences for certain

organisms

A UniProtKB/Swiss-Prot entry

Protein existence levels:1: Evidence at protein level 2: Evidence at transcript level 3: Inferred from homology 4: Predicted 5: Uncertain

Swiss-Prot annotation mainly found in:

• Comment (CC) lines – Function, pathway, cofactor, regulation, disease,

subcellular location,

• Feature table (FT) – features on the sequence, e.g. domain, active site

• Keyword (KW) lines – Set of a few hundred controlled vocabulary terms

• Description (DE) lines – Protein name/function

Other parts to UniProt

• UniParc –archive of all sequences

• UniProt –Swiss-Prot + TrEMBL

• UniProt NREF100 (100% seqs merged)

• UniProt NREF90 (90% seqs merged)

• UniProt NREF50 (50% seqs merged)

• UniMES –metagenomic sequences

Submitting sequences to EMBL or UniProt

WEB-IN -web-based submission tool for submitting DNA sequences to EMBL database.

Protein sequences submitted when the peptides have been directly sequenced. Submit through SPIN

Sequence formats

• Not MSWord, but text!• Most include an ID/name/annotation of some sort• FASTA, E.g.

>xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcgctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgccagatcaaggctcatgtagcctcactgg

Others specific to programs, e.g. GCG, abi, clustal, etc.

Accession numbers

• GenBank/EMBL/DDBJ: 1 letter & digits, e.g.: U12345 or 2 letters & 6 digits, e.g.: AY123456

• GenPept Sequence Records -3 letters & 5 digits, e.g.: AAA12345

• UniProt -All 6 characters: [A,B,O,P,Q] [0-9] [A-Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9], e.g.:P12345 and Q9JJS7

Cross-referencing identifiers

• So many different IDs for same thing, e.g. Ensembl, EMBL, HGNC, UniGene, UniProt, Affy ID, etc.

• Need mapping files to move between them to avoid having to parse every entry

• PICR (http://www.ebi.ac.uk/Tools/picr/) enables mapping between IDs

• UniProt website mapper (www.uniprot.org)

Taxonomy Databases

• Most used is NCBI’s taxonomy database: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy

• Provides entries for all known organisms• Provides taxonomic lineage and translation table for

organisms• Sequence entries for organism• UniProt-specific taxonomy database is Newt: • http://www.ebi.ac.uk/newt

Example taxonomy entry

Literature database: PubMed/Medline

• Source of Medical-related & scientific literature• PubMed has articles published after 1965• Can search by many different means, e.g. author,

title, date, journal etc., or keywords for each• PubMed has list of tags to search specific fields,

e.g. [AU], [TI], [DP] etc. • Can save queries and results• Can usually retrieve abstracts and full papers

Types of search fields

• Title Words [TI] MeSH Terms [MH] • Title/Abstract Words [TIAB] Language [LA]• Text Words [TW] Journal Title [TA]• Substance Name [NM] Issue [IP]• Subset [SB] Filter [FILTER]• Secondary Source ID [SI] Entrez Date [EDAT]• Subheadings [SH] EC/RN Number [RN]• Publication Type [PT] Author Name [AU]• Publication Date [DP] All Fields [ALL]• Personal Name as Subject [PS] Affiliation [AD]• Page Number [PG] Unique Identifiers [UID]• Title Words [TI] MeSH Major Topic [MAJR]• MeSH Date [MHDA]

Querying biological databases

• Databases hold a wealth of information

• Data is held in specific formats and controlled vocabulary- easy for searching

• Can retrieve and save data you need

• In many cases you can retrieve data from multiple sources

How to query databases

• Query languages e.g. SQL• Can query with single word or phrase• Boolean queries• Regular expressions• Basic database querying is usually done through

web interface– Text or sequence-based searches– Can use Boolean queries and regular expressions

Words and phrases

• Most searches are case insensitive • Keywords are single words searched• Phrases –groups of words• E.g. tyrosine protein kinase –returns anything

with either of the words “tyrosine ”, “protein ” or “kinase” (keywords)

• “tyrosine protein kinase” –returns anything with the complete phrase only

Boolean operators (George Boole)

• Operators e.g. & (AND), | (OR), ! (NOT), e.g.:– protein & kinase ! tyrosine

– tyrosine & protein & kinase

• More complex: (tyrosine OR kinase) AND (NOT serine) • Operators don’t work in “”, e.g. “tyrosine and kinase”• Wildcards * and ? E.g. cell*ase finds all words starting

with “cell” and ending in “ase”• Attributes are used to be more specific about where to

find the keyword

Resources for searching databases

• Sequence Retrieval System (SRS)

• EBI –all EBI databases search

• NCBI –Entrez

• Each database usually has own web interface allowing simple queries

• E.g. EnsMart allows querying of Ensembl database

Sequence Retrieval Systemhttp://srs.ebi.ac.uk

• Integrates over 150 different databases• A database can be searched and results linked to other

databases in SRS• Searches can be simple or complex –can view previous

queries and combine them• Results can be viewed in default formats or user-defined

views• Users can save their results and launch software packages

(>100 applications) to analyse results

top related