bioinformatics - databases - bioplexity · data sources for bioinformatics main types of biological...

32
Bioinformatics - Lecture 09 Bioinformatics Biological databases Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Upload: duongcong

Post on 04-May-2018

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Bioinformatics - Lecture 09

BioinformaticsBiological databases

Martin Saturka

http://www.bioplexity.org/lectures/

EBI version 0.4

Creative Commons Attribution-Share Alike 2.5 License

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 2: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Data sources for bioinformatics

Main types of biological databases with utilization tools.communication with databases. database usage.genomes, structure families, expression maps.

Main topicsdatabase technics- file types, sql, biodas- protocols, bioperlparticular databases- sequences, structures- expression, ontologydigest approaches- constraint programming- filters, data structures

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 3: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

File types

sequencesFasta - multiple sequencesone sequence: first line - header >..., next lines - per 70 ntGenBank header: gi|gi-number|gb|accession| locusGFF - sequence featuresseqname source feature start end score strand frame

GenBank flat files or ASN.1flat files: multiline description, 6x10 nt per lineASN.1: structured description {...{...}}, optionally packed

structuresPDB - line/column-wise dataline start: line decription - comment / atomatom rank role residuum chain res-rank coordinates

expressiontables: lines - genes, columns - tissues / experimentsMAGE: http://www.mged.org/Workgroups/MAGE/mage.html

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 4: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Annotation access

annotation systems

http://www.open-bio.org/wiki/Projects

BioDAS - distributed annotation systemhttp://biodas.org/data accesshttp://example.org/das/organism/features?segment=CHR I:1,500

site prefix das data command arguments

system compositiona reference sequence server plus annotation servers

other projectsMOBY - interoperability for biological data server servicesOBDA - sequence access standardizationmyGrid - grid and middleware for bioinformatics

http://www.mygrid.org.uk/myExperiment - myGrid spin-off

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 5: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

DBMS

SQL

tabular data storagemain open-source database systemssmall: sqlite, large: postgresql, firebirdsqlaccess: sql - structured query languagesuitable for data structured into regular tablesdifferent approaches

linear data (sequences): flat filesdeeply structured data: ASN.1, HDF, NetCDF

> sqlite3

CREATE TABLE genes (gi INTEGER, chr CHAR(3), ori CHAR(1));

INSERT INTO genes VALUES (826, ’19’, ’+’);

INSERT INTO genes VALUES (827, ’X’, ’-’);

SELECT * FROM genes;

.quit

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 6: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Genomes

where to get data on sequenced chromosomes

gene specific: gene id sequence specefic: accession

main genome database sitesNCBI - National center for biotechnology information

http://www.ncbi.nlm.nih.gov/Entrez/EMBL - European bioinformatics institute

http://www.ebi.ac.uk/embl/DDBJ - DNA databank of Japan

http://www.ddbj.nig.ac.jp/

NCBI ftp://ftp.ncbi.nlm.nih.gov/directory /genomes/H sapiens/

assembled reference sequences: Assembled chromosomesfile /gene/DATA/gene2refseq.gz

gene IDs with positions along chromosomes

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 7: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

DNA variations

SNPs, CNVs

many projects set to deal with intra-species variationdbSNPhttp://www.ncbi.nlm.nih.gov/SNP/the SNP consortiumhttp://snp.cshl.org/haplotypeshttp://www.hapmap.org/glovar - human variationshttp://www.glovar.org/human variomehttp://www.humanvariomeproject.org/general variomeshttp://variome.net/

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 8: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Gene prediction

open-source and on-line gene prediction

Glimmer - bacteria, archea, viruseshttp://cbcb.umd.edu/software/glimmer/

GlimmerHMM - eukaryotic geneshttp://cbcb.umd.edu/software/GlimmerHMM/

GeneZilla (TIGRscan) - eukaryotic geneshttp://www.genezilla.org/

GenScan - human geneshttp://genes.mit.edu/GENSCAN.html

software listshttp://www.genefinding.org/

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 9: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Nucleic structures

RNAs and 3D nucleic structural databases

3D structures of nucleic acidsRNABasehttp://www.rnabase.org/NDB nucleic acids databasehttp://ndbserver.rutgers.edu/

SCOR - structural classification of RNAhttp://scor.berkeley.edu/

RNA motifs, structures and interactions

other databasesSmall RNA databasehttp://condor.bcm.tmc.edu/smallRNA/Noncoding RNA databasehttp://biobases.ibch.poznan.pl/ncRNA/

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 10: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Proteins

protein structures

3D structuresRCSB http://www.rcsb.org/

protein domainsExpasy http://www.expasy.ch/UniProt http://www.uniprot.org/

structuresSCOP, CATH, FSSP, CASP, PFAMhierarchical classification

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 11: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Protein families

systematics on protein structures

SCOP http://scop.berkeley.edu/structural classification of proteinsalpha, beta, alpha/beta, alpha+beta, ... superfamiliesfolds: cca 1000, superfamilies: cca 1500, families: cca 3000

CATH http://www.cathdb.info/class (C), architecture (A), topology (T), homologoussuperfamily (H)cca 1400 familiesC: main secondary structure compositionA: orientation of secondary structuresT: folds with sec. structure connectivityH: similarity superfamilies

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 12: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Expasy

ExPASy (expert protein analysis system)

UniProt - the universal protein resourcehttp://www.expasy.uniprot.org/

knowledgebase, reference clusters, archives

swissprothttp://www.expasy.ch/sprot/

database of protein sequences together with annotationsstructure and function of proteins

prositehttp://www.expasy.ch/prosite/

documentation on protein domains, folds, families

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 13: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Expressions

expression microarrays repositories

not a central repositoryevery institution wants to have a main microarray database

some of the repositoriesGEO - gene expression omnibushttp://www.ncbi.nlm.nih.gov/geo/Stanford microarray databasehttp://genome-www.stanford.edu/Broad (MIT/Harvard) institutehttp://www.broad.mit.edu/tools/data.htmlEBI ArrayExpresshttp://www.ebi.ac.uk/arrayexpress/ChipDBhttp://staffa.wi.mit.edu/chipdb/public/

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 14: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Expression atlases

expression mapping projectsBrainAtlas (mouse oriented)http://www.brainatlas.org/http://www.brain-map.org/RAD - RNA abundance databasehttp://www.cbil.upenn.edu/RAD3/BodyMaphttp://bodymap.ims.u-tokyo.ac.jp/GNF gene expression atlashttp://expression.gnf.org/3D developmental gene expressionhttp://www.univie.ac.at/GeneEMAC/TissueInfohttp://pbtest.med.cornell.edu/services/tissueinfo/query

relational schemaGUS - genomics unified schemahttp://www.gusdb.org/

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 15: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Ontology

overall description of bio-systems

Gene Ontologyhttp://www.geneontology.org/description of gene products for various databasesthe main bio-ontology project

Gene Cardshttp://www.genecards.org/human genes information / ontology databaseone of the first ontology projects

KEGGhttp://www.geneobjects.org/Kyoto encyclopedia of genes and genomesmainly known for molecular interaction pathways

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 16: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Organisms

sites dedicated to particular model organisms

the sites:the generic model organism database projecthttp://www.gmod.org/Escherichia coli http://ecocyc.org/Saccharomyces cerevisiaehttp://www.yeastgenome.org/Arabidopsis thalianahttp://www.arabidopsis.orgDrosophila melanogasterhttp://www.flybase.org/ http://www.fruitfly.org/Caenorhabditis eleganshttp://www.wormbase.org/Danio rerio http://zfin.org/Mus musculus http://www.informatics.jax.org/Rattus sp. http://rgd.mcw.edu/

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 17: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

EBI

European bioinformatics institute

EBI http://www.ebi.ac.uk/part of EMBL http://www.embl.org/

EBI databsesEMBL nucleotide databasehttp://www.ebi.ac.uk/embl/UniProt (together with Expasy and PIR)ArrayExpresshttp://www.ebi.ac.uk/arrayexpress/public repository for microarray dataEnsemblhttp://www.ensembl.org/genomes and annotation for metazoa

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 18: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

NCBI

National center for biotechnology information

http://www.ncbi.nlm.nih.gov/

the main bioinformatics institute / web sitedatabases and services

sequence databasesGenBank, ESTs, SNPs, etc.PubMed - literature databaseEntrezhttp://www.ncbi.nlm.nih.gov/entrez/retrieval system connecting together plethora of databasesincluding PubMed, genomes, ontologiesBlast - the search engine, OMIM, etc.Science primerhttp://www.ncbi.nlm.nih.gov/About/primer/introductions into molecular biology and bioinformatics

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 19: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Blast

Blast - basic local alignment search tool

http://www.ncbi.nlm.nih.gov/BLAST/

two examplesblast - nucleotide - blastn (for short queries)ATCAGTGTAGTCATCGATACCGTAGTCA

- short random sequenceresultsnothing significant, use mouse sequence gi 83999722display graph

genomes - human - megablast (for related sequences)GACACCTTCTCTCCTCCCAGATTCCAGTAACTCCCAATCTTCTCTCTGCAG

- part of an immunoglobulin sequenceresultstwo very significant matches, use ref NT 026437.11click on IGHG1 for informationclick on ’blue box’ to zoom in 8x

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 20: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Perl

standard language for biosequences

Perl scriptingpros:

for fast access to various configuration and log filessuitable for short to middle programs on structured datahuge amount of packages for varius database systems,datastructures, including formats of biological data andconnections to biological databasesregularily used for parsing datafiles and program outputsin common daily bioinformatics

cons:usually hard to read and sustain scriptsobject oriented approach rather rudimentary

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 21: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Perl basics

simple perl scripting

example.pl#/usr/bin/env perl

use strict;

use warnings;

my $var01 = "GATTACA";

$var01 =˜ s/T/U/g;

print substr (reverse($var01), 1, 4), "\n"; # CAUU

my @array01 = (3.14, "Pi");

my %hash01 = ("value" => 3.14, "symbol" => "Pi");

print $hash01{"symbol"}, "\n" if 3.14 == $array01[0];

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 22: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

BioPerl

modules for sequence-based work in bioinformatics

modulescore, run, dbi packages

main parser, standard bio filetypeswrapper around variety bioinformatics toolsconnecting to biological databases

microarray packagemanipulation of microarray formats (preliminary)

other packagesfor linkage studies, C extensions for align algorithms, etc.

usageuse Bio::Perl;

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 23: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Filetype conversion

converterread a given file ’file.seq’takes sequence and writes it in fasta format

#/usr/bin/env perl

use strict;

use warnings;

use Bio::Perl;

my $in = Bio::SeqIO->new(-file => "file.seq" ,

’-format’ => ’GenBank’);

my $out = Bio::SeqIO->new(-file => ">file.fa" ,

’-format’ => ’Fasta’);

my $seq = $in->next seq();

$out->write seq($seq);

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 24: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Sequence retrieval

example on sequence filestakes sequence of human protein il9rmakes blast request and write the results

#/usr/bin/env perl

use strict;

use warnings;

use Bio::Perl;

my $seq obj = get sequence(’genbank’,"il9r human");

write sequence(">il9r.fasta",’fasta’,$seq obj);

my $blast result = blast sequence($seq obj);

write blast(">il9r.blast", $blast result);

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 25: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Human gene IDs extraction

#/usr/bin/env perl

use strict; #use with <gene2refseq

use warnings;

use Bio::Perl;

my @cols = (1, 2, 7, 9, 10);

my $col last = 11;

my $done = 0;

while (<STDIN>) {. my @line = split /\s+/, $ ;

. if ("9606" eq $line[0]) {

. $line[2] = lc $line[2];

. next if $done >= $line[1];

. foreach my $col (@cols) {

. print STDOUT $line[$col], " ";}

. print STDOUT $line[$col last], "\n";}}

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 26: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Data structures

data storage for running programs

one / several long sequencessimple string (array of chars)not to put it into composite structures

long time to access a packed string

sets of all oligonucleotidesnucleotides→ numbers 0...3standard array as a lookup table for the oligos

fast access to each cell

table of gene ids - gaps between idshash (i.e. associative array)scripting languages with suitable hash data structures

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 27: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Constraint programming

how to solve problems with constraints

kind of logic programmingdeclarative programming

specify what to do, not how to do ita logic program with constraint specifications

setting relations between dataspecific methods for constraint satisfaction

many general solutions but few of them obey the constraintsmostly combinatorial problems

not the technics for general optimizations

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 28: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Constraint approaches

search while consistent

depth-first searchbetter than starting paths if most of them falseback-tracking rather ineffective, to avoid itefficiency with filtering out the wrong paths

filter aheadset as unaccessible all the recognised wrong pathslesser ways for other (backtracked) search attemptsdata reduction for a final exhaustive search

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 29: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Chess board example

the second attempt leads to the searched resultmuch faster than the search-backtrack approach

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 30: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Constraints in bioinformatics

problems accessible for CP

where (not) to use CPsequence alignment - no

all of the alignments allowedoptimization case, not for the constraint programming

clustering, classification - nomany possible waysoptimization case, not for the constraint programming

sequence assembly - not exactlywhile based on constraints, not a global filtering

higher structure prediction - yessuitable connections of short sequences required

RNA gene prediction - yesspecific sequence characteristics required

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 31: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Filters

filtering as CP examples

non-coding RNA predictioneach RNA gene contain base-paired sequencescomplementary sequences with a limited separation(7nt, 70nt)-stack used in FastR software

structure compositionpredicted secondary structures should be concatenatablefirst, to thread short sequences to gain building blockscombinatorial search to concatenate the sequences

Martin Saturka www.Bioplexity.org Bioinformatics - Databases

Page 32: Bioinformatics - Databases - Bioplexity · Data sources for bioinformatics Main types of biological databases with utilization tools. communication with databases. database usage

Items to remember

Nota bene:

file types, data access

Database sitessequences, structuresexpressions, ontologies

Programmingbioperl, conversionconstraints, filters

Martin Saturka www.Bioplexity.org Bioinformatics - Databases