what is a database? databases an introduction to · introduction to bioinformatics an introduction...

76
MCB - February 06 EMBnet Introduction to Bioinformatics An introduction to biological databases [email protected] MCB - February 06 EMBnet Introduction to Bioinformatics What is a database ? A collection of – structured – searchable (index) -> table of contents updated periodically (release) -> new edition cross-referenced (hyperlinks ) -> links with other db data Includes also associated tools (software) necessary for db access/query, db updating, db information insertion, db information deletion….

Upload: others

Post on 07-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

An introd

uction to biological

datab

ases

Marie-Claude.Blatte

[email protected]

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

What is a

database

?

•A collection of

–structured

–search

able

(index

)-> tab

le of contents

–upd

ated period

ically (release)-> new

edition

–cross-referenced

(hyperlinks)

-> links with

other d

b

data

•Includ

es also associated tools (softw

are) necessary for d

b access/query, d

b upd

ating, db

information insertion, d

b inform

ation deletion…

.

Page 2: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Why biologica

l databases ?

•Exponential grow

th in b

iological data.

•Data (nucleic acid

sequences (DNA, R

NA),

protein sequence, 3D structures, 2

D gel

analysis, MS analysis, m

icroarrays, protein-protein interaction…

.) are no longer publish

ed

in a conventional manner, b

ut directly

submitted

to datab

ases.

•Essential tools for b

iological research.

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Distrib

ution of databases

•Books, articles

1968 -> 19

85

•Com

puter tapes19

82 ->19

92

•Floppy d

isks19

84 -> 19

90

•CD-R

OM

1989 -> ?

•FTP

1989 -> ?

•On-line services

1982 -> 19

94

•WW

W19

93 -> ?

•DVD

2001 -> ?

Page 3: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

Some sta

tistics and re

marks

•More th

an10

00 different ‘b

iological’ datab

ases

•Variab

le size: <100Kb to >10

Gb

–DNA: > 10

Gb

–Protein: 1 G

b–

3D structure: 5

Gb

–Oth

er: smaller

•Upd

ate frequency: daily

to annually

•How to find

them ?

–Links to m

any oth

er m

olecula

r biology

databases:

(ExPA

Sy) h

ttp://www.ex

pasy.org/links.htm

l#Proteins

–Bioh

unt: http://w

ww.ex

pasy.org/BioH

unt/–

Google: h

ttp://www.google.com

/

! Datab

ase hom

e server !

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Page 4: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

http://w

ww.ex

pasy.org/

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Page 5: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

The te

n importa

nt biologica

l databases*

GenB

ank/DDJB/E

MBLw

ww.ncb

i.nlm.nih

.govNucleotid

e sequencesEnsem

bl

www.ensem

bl.org

Hum

an/mouse genom

ePub

Med

www.ncb

i.nlm.nih

.govLiterature references

NR (entrez protein)

www.ncb

i.nlm.nih

.govProtein sequences

Swiss-Prot

www.ex

pasy.orgProtein sequences

InterProwww.eb

i.ac.ukProtein d

omains

OMIM

www.ncb

i.nlm.nih

.govGenetic d

iseasesEnzym

eswww.ex

pasy.orgEnzym

esPD

Bwww.rcsb

.org/pdb/

Protein structuresKEGG

www.genom

e.ad.jp

Metab

olic pathways

*according to th

e «Bioinform

atics for dum

mies

»

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics

•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)

•3D structure

•Metab

olism

•Bibliograph

y

•‘Oth

ers’ (Microarrays, Protein protein interaction…

)

Page 6: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Yes, if y

ou train quickly

, you ca

n cre

ate a ne

w database,

but first e

at y

our dinne

r !MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics

•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)

•3D structure

•Metab

olism

•Bibliograph

y

•‘Oth

ers’ (Microarrays, Protein protein interaction…

)

Page 7: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Idealminim

al conte

nt of a

seque

nce database entry

•Sequences

!!

•Accession num

ber

(AC) (unique id

entifier)

•Tax

onomic

data

•References

•ANNOTATIO

N/C

URATIO

N

•Keyw

ords

•Cross-references

•Docum

entation

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Database 1a: nucle

otide se

quence

s

•The 3

main pub

licnucleic acid

sequence datab

asesare

EMBL (E

urope)/GenB

ank(U

SA) /D

DBJ (J

apan)«different view

s of the sam

e data set

» with

in 2 to 3

days (since

1990)

•EMBL: since 19

82

•Specialized

datab

ases forth

e different types of R

NAs (i.e. tR

NA,

rRNA, tm

RNA, uR

NA, etc…

)

•3D structure (D

NA and

RNA) �

PDB

•Oth

ers: Aberrant splicing d

b; E

ukaryotic promoter d

b(EPD

); RNA

editing sites, M

ultimed

iaTelom

ereResource …

Page 8: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

http://w

ww.expasy.org/links.h

tml#DNA

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

cDNAs, E

STs, genes, genom

es, …

EMBL, G

enBank, D

DBJ

Data not sub

mitte

d to pub

lic databases*, d

elayed or ca

ncelled…

* REMARK: J

ournals d

o not acce

pt a pa

per d

ealing w

ith a se

quence

if the EMBL/GenBank/DDBJ AC num

ber is not a

vailable…

http://ww

w.insdc.org/

The hectic life

of a se

quence

Page 9: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

•Serve as

arch

ives

•Contain

all public

sequences derived

from:

–Genom

e projects (> 80 %

of entries)

–Sequencing centers (cD

NAs, E

STs…

)

–Ind

ividual scientists ( 15

% of entries)

–Patent offices (i.e. E

uropean Patent Office, E

PO)

•Currently: 4

6x10

6sequences, ~

80x10

9bp;

•Sequences from

> 80’000 different species;

•Contrib

ution: EMBL 10

%; G

enBank

73 %

; DDBJ 17

%

EMBL/G

enBank/D

DBJMC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

The trem

endous increase

in nucleotide sequences

1980: 8

0 genes fully sequenced

!

Page 10: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

RNA

DNA

hum

an

mouse

rat

http://w

ww3.eb

i.ac.uk/Services/D

BStats/

New

projects:Environm

ental sequences(no tax

onomic inform

ation)

More th

an 80’000 species, b

ut…

Hum

an/Mouse/R

at: Organism

s with

the h

ighest red

undancy

!

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Molecule types

(…DNA vs R

NA quality of th

e protein sequence…(in E

ucaryota))

small nucleolar

RNA

signal recognition particle RNA com

ponent

Page 11: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

taxonom

y

Cross-

reference

s

reference

s

keyword

Annota

tion

(Prediction or

experim

enta

lly determined)

seque

nce

CDS

CoD

ing Sequence

(proposed by sub

mitters)

Page 12: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

The hectic life

of a se

quence

cDNAs, E

STs, genes, genom

es, …

EMBL, G

enBank, D

DBJ

Data not sub

mitte

d to pub

lic databases*, d

elayed or ca

ncelled…

with or w

ithout annotated C

DS

provided by authors

CDS

CoDing S

eque

nceportion of D

NA/RNA tra

nslated into prote

in(from

Met to S

TOP)

Experim

enta

lly prove

dor d

erive

d from

gene pre

diction

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

An im

portant a

nnotation (-

> CDS)

EMBL/GenBank/D

DBJ

Page 13: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG

Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG

*** ************ ** * **************

CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------

Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA

************************************ ****************************

CONTIG ------------------------------------------------------------------------------------------------------------------------

Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT

CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC

Genomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC

**************************************************************************

CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC

Genomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC

****************************

***********************************************

CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA

Genomic

CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA

************************************************************************************************************************

CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------

Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT

********************************** ********

CONTIG -------------------------------------------------------------------------------------------------------------------GNAAA

Genomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA

****

CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC

Genomic

GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC

******************************************* * ************** ******** ***** **** * *********** ***************************

CONTIG C-----------------------------------------------------------------------------------------------------------------------

Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA

*

CoD

ing Sequence

Alignm

ents betw

een a mRNA and

a genomic sequence

exon

exon

exon

exon

exon

intron

intron

intron

EMBL/GenBank/D

DBJ

CDS provid

ed by th

e sub

mitte

rs

Page 14: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

Transla

tion provided by EMBL

CDS provid

ed by th

e sub

mitte

rs

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

A major prob

lem: th

e submission of annotated

CDSs

Exam

ple The genom

e of chim

pazee

The sequence (genom

ic) is complete and

has b

een submitted

to EMBL

Com

parative genomic analyses h

ave been d

one and th

e CDS id

entified,

but

these C

DSs h

ave not been sub

mitted

!

-> only 1’505CDSs are availab

le yet (in UniProtK

B) !

Page 15: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

EMBL/GenBank/D

DBJ

Sort of sequence m

useum, w

here sequences are preserved

for eternity as th

ey were d

etermined

, interpreted and

publish

ed

originally by th

eir authors

(primary sequence repository)

The auth

ors have full auth

ority over the content of th

e entries th

ey submit !

(editorial control of th

e content belongs to th

e authors)

(exception: T

PA, since january 2

003)

Sub

mission: F

TP, em

ail, Web

in, etc…

Page 16: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

Protein sequence derived

from th

e traduction of a vector contam

ination

EMBL/GenBank/D

DBJ

•Unex

pected inform

ation you can find in th

ese db:

FT source 1..124

FT /db_xref="taxon:4097"

FT /organelle="plastid:chloroplast"

FT /organism="Nicotiana tabacum"

FT /isolate="Cuban cahibo

cigar, gift from

FT

President Fidel

Castro"

•Or:FT source 1..17084

FT /chromosome="complete

mitochondrial genome"

FT /db_xref="taxon:9267"

FT /organelle="mitochondrion"

FT /organism="Didelphis virginiana"

FT /dev_stage="adult"

FT /isolate="fresh road killed individual"

FT /tissue_type="liver"

Page 17: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

FT C

DS com

plement(4

5959..4

7332)

FT /d

b_xref="S

PTREMBL:Q

9UZ71"

FT /note="PA

B2386"

FT /transl_

table=11

FT /prod

uct="4-A

MIN

OBUTYRATE qui se d

ilate AMIN

OTRANSFERASE

FT (E

C 2.6.1.19

)"FT /protein_

id="C

AB5018

8.1"

FT /translation="M

DYPR

IVVNPPG

PKAKELIE

REKRVLS

TGIG

VKLF

PLVPK

RGFGP

FT F

IEDVDGNVFID

FLA

GAAAASTGYSHPK

LVKAVKEQVELIQ

HSMIG

YTHSERAIR

VAEK

FT LV

KIS

PIKNSKVLF

GLS

GSDAVDMAIK

VSKFSTRRPW

ILAFIG

AYHGQTLG

ATSVASFQ

FT V

SQKRGYSPLM

PNVFW

VPY

PNPY

RNPW

GIN

GYEEPQ

ELV

NRVVEYLE

DYVFSHVVPPD

EFT V

AAFFAEPIQ

GDAGIV

VPPE

NFFKELK

KLLD

EHGILLV

MDEVQTGIG

RTGKW

FASEW

FE

FT V

KPD

MIIF

GKGVASGMGLS

GVIG

REDIM

DIT

SGSALLT

PAANPV

ISAAADATLE

IIEEE

FT N

LLKNAIE

VGSFIM

KRLN

ELK

EQFDIIG

DVRGKGLM

IGVEIV

KENGRPD

PEMTGKIC

WR

FT A

FELG

LILPSYGMFGNVIR

ITPPLV

LTKEVAEKGLE

IIEKAIK

DAIA

GKVERKVVTWH"

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Anoth

er major issue for th

e protein sequence datab

ases…

Page 18: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Environm

enta

l seque

nces (d

ivision ‘ENV’)

Aim

:To sequence all D

NA present in a given sam

ple, with

out knowing from

which

species the D

NA is d

erived from

-Sargasso sea (C

raigVenter)

-hum

an fluids

-earth

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Page 19: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

No id

ea of the species…

(microb

ial population…)

No id

ea of the gene pred

iction program to b

e used…

No id

ea of the genetic cod

e to be used

for traduction !!!!!

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Page 20: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Nucle

ic acid

s databases: th

e prob

lems

-Arch

ive -> high

ly redond

ant-Sim

ilarity searches are not ob

vious…

-Auth

or’s authority -> variab

le level of the annotation quality

-i.e. gene/protein nam

e attribution…

-Variab

le level of sequence quality-Sequencing quality

-Gene pred

iction quality

The se

cond ‘ge

neration’ of nucle

otide se

quence

databases

Gene-centric d

atabases

All th

e sequence information relevant to a given gene

is mad

e accessible at once

i.e. EntrezG

ene/RefS

eq

Genom

e-centric d

atabases

Information ab

out gene sequence, relative position, strand

orientation, bioch

emical functions…

Information m

anagement system

s that are ab

le to connect specialized

sequence collection and brow

sing tools

i.e. Ensem

bl, T

IGR

Page 21: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

The se

cond ge

neration of nucle

otide se

quence

databases

Gene-centric d

atabases

All th

e sequence information relevant to a given gene

is mad

e accessible at once

i.e. Entrez G

ene/RefS

eqMC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Entre

z Gene / R

efSeq(NCBI)

Datab

ase with

gene-specific information, w

hich

focuses on the genom

es th

at have b

een completely sequenced

, that h

ave an active research

community to contrib

ute gene-specific information, or th

at are sched

uled

for intense sequence analysis.

The content of E

ntrezGene represents th

e result of curationand

autom

ated integration of d

ata from N

CBI's

Reference S

equence project (RefS

eq), from collab

orating mod

el organism datab

ases, and from

many

other d

atabases availab

le from N

CBI.

The correspond

ing sequences are available th

anks to cross-links to RefS

eqor oth

er sequence datab

ases

Page 22: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Interactions

Gene ontology

Page 23: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Links to all the sequences found

in EMBL/G

enBank/D

DBJ

corresponding to th

is gene

Links to RefS

eq

Accession num

bers

-for R

NA (N

M_)

-for genom

ic (NT_)

-for protein (N

P_)

Entre

z Gene is tigh

ly linke

d to R

efSeq

(«interdependent curated resources

»)

RefS

eq: The R

eference Sequence (R

efSeq) collection aim

s to provid

e a compreh

ensive, integrated, non-red

undant set of

sequences, including genom

ic DNA, transcript (R

NA), and

protein prod

ucts, for major research

organisms.

1’899’454 entries (2

0-S

ep-2005); 3

060 species.

Page 24: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics A

view of th

e redund

ancy…

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

A RefS

eq entry

Page 25: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

A RefS

eq entry

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Working w

ith whole ge

nome databases:

Genom

e-centric d

atabases

«Brow

sing resource

Rem

ark: Genom

e-centric datab

ases give usually access to several genomes, b

ut som

e are «specialized

» in particular organism

s, i.e. TIG

R: b

acteria and plants

Page 26: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

http://w

ww.ensem

bl.org/ind

ex.htm

l

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Page 27: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Telom

er 5’

Page 28: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Ensem

blprovid

es abioinform

aticsfram

ework to

organise biology around

the sequences of a selection of large

genomes.

Ense

mbl/m

artvie

w: e

xample

of queries

-Retrieve all m

ouse hom

ologues of hum

an disease genes containing

transmem

brane d

omains located

betw

een 1p22 and

1q22

-Retrieve th

e sequences 5kb

upstream of all h

uman «

known» genes from

ch

romosom

e 6

….

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Major prob

lem of th

is type of ‘datab

ase’: the upd

ate frequency…

Page 29: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

..and pla

nts

http://w

ww.tigr.org/td

b/

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Database 1b

Proteinseque

nces

Page 30: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

The hectic life

of a prote

in seque

nce …

TrE

MBL

Genpept

CoDing S

eque

nces

provided by sub

mitte

rs

cDNAs, E

STs, genom

es, …

EMBL,

GenB

ank, DDBJ

Data not sub

mitte

d to pub

lic databases, d

elayed or ca

ncelled…

Swiss-Prot

CoDing S

eque

nces

provided by sub

mitte

rsand

«de novo

» ge

ne pre

diction

RefS

eqXP_

NNNNN

UniProt: S

wiss-Prot + T

rEMBL + (PIR

)NCBI-nr: S

wiss-Prot + G

enPept + (PIR) + R

efSeq + PD

B + PR

F

Manua

lly annota

ted

PRF

Scientific pub

lications d

erive

d se

quence

s

with or w

ithout annotated C

DS

3D structures

PRF, PIR

Protein Identification R

esource

Page 31: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

UniProtK

B

NCBI-nr

(Entrez protein)

2 major ‘re

sources’

(general prote

in seque

nce databases)

Page 32: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Protein se

quence

databases

The UniProt pa

thway

(Universal protein resource)M

CB

-F

ebruary 06E

MB

netIntroduction to

Bioinform

atics

UniProt K

nowledgeBase

UniProtK

B/TrEMBL

Compute

r annota

ted

protein se

quence

s

Release 31.1 of 2

7-Sep-2005:

2’151’724 E

ntries/~

95’000

species

UniR

ef100

UniR

ef90

UniR

ef50

•One U

niRef100 entry =

Allid

entica

l seque

nces

(including fragm

ents).

•One U

niRef90

entry = Sequences th

at have at least

90% or m

ore identity

.

•One U

niRef50

entry =Sequences th

at are at least

50% or m

ore identity

.

Independ

ent of species.

UniProt A

rchives:

Arch

ived ra

wprote

in seque

nces,

found in pub

licly accessib

le datab

ases:

Swiss-Prot, T

rEMBL,

PIR, E

MBL, E

nsembl,

IPI, PDB, R

efSeq,

FlyB

ase, Worm

Base,

Patent Offices.

Use with

extre

me ca

ution:Contains

pseudogenes,

incorrect CDS

predictions, etc…

UniProtK

B/Swiss-

ProtManua

lly annota

ted

protein se

quence

s

Release 48.1 of 2

7-Sep-2005:

195’058 entrie

s/9’479 spe

cies

UniProtK

BRelease 6

.0 consists of:

The UniProt com

ponents

Page 33: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

In Geneva (S

IB):

1 Group Lead

er42 A

nnotators4 Prosite annotators

18 Program

mers and

Research

ers5 A

dministrators, science com

municators

3 S

ystem Administrators

4 S

tudents

------------------77 people

At E

BI:

(Swiss-Prot + E

MBL + T

rEMBL)

75 people (2

9 Annotators)

At PIR

:1 G

roup Leader

13 Protein S

cience Team

12 Inform

atics Team

------------------26 people

The U

niProtgroups

(Antib

es, Septem

ber 2

004)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

UniProt K

nowledgeBase

UniProtK

B/TrEMBL

Compute

r annota

ted

protein se

quence

s

Release 31.1 of 2

7-Sep-2005:

2’151’724 E

ntries/~

95’000

species

UniR

ef100

UniR

ef90

UniR

ef50

•One U

niRef100 entry =

Allid

entica

l seque

nces

(including fragm

ents).

•One U

niRef90

entry = Sequences th

at have at least

90% or m

ore identity

.

•One U

niRef50

entry =Sequences th

at are at least

50% or m

ore identity

.

Independ

ent of species.

UniProt A

rchives:

Arch

ived ra

wprote

in seque

nces,

found in pub

licly accessib

le datab

ases:

Swiss-Prot, T

rEMBL,

PIR, E

MBL, E

nsembl,

IPI, PDB, R

efSeq,

FlyB

ase, Worm

Base,

Patent Offices.

Use with

extre

me ca

ution:Contains

pseudogenes,

incorrect CDS

predictions, etc…

UniProtK

B/Swiss-

ProtManua

lly annota

ted

protein se

quence

s

Release 48.1 of 2

7-Sep-2005:

195’058 entrie

s/9’479 spe

cies

UniProtK

BRelease 6

.0 consists of:

The UniProt com

ponents

Page 34: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

The hectic life

of a prote

in seque

nce …

TrE

MBL

CoDing S

eque

nces

provided by sub

mitte

rs*

cDNAs, E

STs, genom

es, …

EMBL

Data not sub

mitte

d to pub

lic databases, d

elayed or ca

ncelled…

Swiss-Prot

Manua

lly annota

ted

Nucle

ic acid

s

Amino a

cids

with or w

ithout annotated C

DSDirect sub

mission (< 1%

) PIR data

* ~ 1/10

EMBL entry is associated

with

an annotated CDS;

EMBL

Swiss-

Prot

TrEMBL

CDS

Page 35: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

In order to avoid

redund

ancy, once manually annotated

and integrated

into S

wiss-Prot, th

e TrE

MBLentry

will no m

ore be in T

rEMBL

Literature information

(more th

an 1500 journals cited

)

Sequence ch

eck and analysis

High

performance

bioinform

aticstools

Datab

ases and

external scientific ex

pertise

X

From

TrE

MBL to S

wiss-Prot

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

In a UniProtK

B/Swiss-

Prot entry

, you ca

n expect to find

:

•All th

e names

of a given protein (and of its gene);

•Its b

iological originwith

links to the tax

onomic d

atabases;

•A selection of re

ference

s;•

A sum

mary

of what is know

nab

out the protein: function,

alternative products, PT

M, tissue ex

pression, disease, etc.…

;•

Num

erous cross-reference

s;•

Selected

keyword

s;•

A description of im

portant se

quence

feature

s: dom

ains, PT

Ms, variations, etc.;

•A (often corrected

) protein seque

nceand

the d

escription of various isoform

s/variants.

Page 36: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

View «

by default

» on

the ExPASy se

rver

References

RN, R

P, RC, R

X, R

A, R

L lines

Com

ments

CC lines

FeaturesFT lines

SequenceSQ lines

Nam

es and tax

onomy

DE, G

N, O

C, O

S, O

G lines

Cross-references

DR lines

Keyw

ords

KW

lines

Accession num

ber

ID, A

C, D

T linesS

eque

ncing errors ? Polym

orphism

s ?

Alternative splicing ?

Alternative initiation ?

Usage of an alternative prom

oter ? RNA ed

iting ?

Seque

nce qua

lity

Selenocystein ?

Fragment ?

Sam

e gene ?

-> 1 ge

ne / 1

specie = 1 Swiss-

Prot entry

For h

uman: ~

4,7 diffe

rent ind

ependent se

quence

reports /ge

ne

-> Identifica

tion and annota

tion of all se

quence

diffe

rence

s

Page 37: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

-13

sequences (complete or partial)

-derived

from m

RNA (n=6

) or genomic D

NA (n=7

)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Multiple

alignm

ent of th

e end of th

e available GCR se

quence

s

Annotation of th

e sequence differences

Page 38: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

post-tra

nslationa

lmodifica

tions of proteins

(PTMs)

5-10

fold increase

alterna

tive splicing

ofmRNA

2-5 fold

increase ~ 100’000 human

transcripts

~ 25’000 human

genes

~ 1'000'000

human prote

ins

Increase in complex

ity

From

genome to

proteome

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

and…

•Theproteom

icscom

munity is faced

with

a huge task:

»It need

s to confirm, b

y identification,

many potential splice isoform

sand

initiation sites;

»It need

s to help to ch

aracterize the

extent of PT

M’s on th

e majority of

hum

an proteins and to ad

dress th

e fluctuation of th

ese PTMs over tim

e and

space.

Page 39: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

-> give accessto

all known* protein sequences

* submitted

to the pub

lic datab

ases (EMBL, G

enBank, D

DJB, S

wiss-Prot)

210’000+ 2’600’000

≈≈≈ ≈2’200’000

Redund

ancy

in TrEMBL

&

Redund

ancy

between T

rEMBL and Swiss-

Prot

Red

undancy is going to d

ecrease: «new

» genom

e sequencing -> «new

» proteins

(Amos B

airoch, sept 2

002)

~10

’000 species ~

100’000 species

Swiss-

Prot&TrEMBL

introduce a new

arithmetical concept !

Page 40: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

In the case of h

uman proteins, th

e redund

ancy is still very high

:

~13’000+~58’000 ≈≈≈ ≈

about 2

2’000*

* hum

an gene number estim

ation:< 2

5’000

Missing sequences:•

Sequences not sub

mitted

to EMBL/G

enBank/D

DJB (and

PIR

)•

Not yet pred

icted or know

n genes ("no CDSprovid

ed by

the sub

mitters" or no D

NA sequence)

•Confid

ential data (Patent application sequences)

•Im

munoglob

ulins, T-cell receptors (-> U

niParc)•

… Swiss-

Prot +TrEMBL

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

•Be aw

are of the d

ifferences betw

een UniProtK

B/T

rEMBL and

UniProtK

B/S

wiss-Prot

–Com

puter vs. Hum

an

–Red

undant vs. N

on-redund

ant

•We need

your

feedback and

your ex

pertise!sw

iss-prot@ex

pasy.org

Take h

ome m

essage

Page 41: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

http://w

ww.swissprot20.org/

http://w

ww.swissprot20.org/

You are welcome to our 20th

anniversary meeting in Brazil th

is year

1986-2006

Swiss-Prot: Alive and Kicking!

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

UniProt K

nowledgeBase

UniProtK

B/TrEMBL

Compute

r annota

ted

protein se

quence

s

Release 31.1 of 2

7-Sep-2005:

2’151’724 E

ntries/~

95’000

species

UniR

ef100

UniR

ef90

UniR

ef50

•One U

niRef100 entry =

Allid

entica

l seque

nces

(including fragm

ents).

•One U

niRef90

entry = Sequences th

at have at least

90% or m

ore identity

.

•One U

niRef50

entry =Sequences th

at are at least

50% or m

ore identity

.

Independ

ent of species.

UniProt A

rchives:

Arch

ived ra

wprote

in seque

nces,

found in pub

licly accessib

le datab

ases:

Swiss-Prot, T

rEMBL,

PIR, E

MBL, E

nsembl,

IPI, PDB, R

efSeq,

FlyB

ase, Worm

Base,

Patent Offices.

Use with

extre

me ca

ution:Contains

pseudogenes,

incorrect CDS

predictions, etc…

UniProtK

B/Swiss-

ProtManua

lly annota

ted

protein se

quence

s

Release 48.1 of 2

7-Sep-2005:

195’058 entrie

s/9’479 spe

cies

UniProtK

BRelease 6

.0 consists of:

The UniProt com

ponents

Page 42: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Blast a

gainst U

niRef100, 9

0 and 50

http://w

ww.expasy.org/tools/b

last/

http://w

ww.expasy.org/tools/b

last/

Seque

nce of h

uman e

rythropoie

tin

By default

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Blast a

gainst U

niRef100

Page 43: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Blast a

gainst U

niRef90

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Blast a

gainst U

niRef50

Page 44: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

UniProt K

nowledgeBase

UniProtK

B/TrEMBL

Compute

r annota

ted

protein se

quence

s

Release 31.1 of 2

7-Sep-2005:

2’151’724 E

ntries/~

95’000

species

UniR

ef100

UniR

ef90

UniR

ef50

•One U

niRef100 entry =

Allid

entica

l seque

nces

(including fragm

ents).

•One U

niRef90

entry = Sequences th

at have at least

90% or m

ore identity

.

•One U

niRef50

entry =Sequences th

at are at least

50% or m

ore identity

.

Independ

ent of species.

UniProt A

rchives:

Arch

ived ra

wprote

in seque

nces,

found in pub

licly accessib

le datab

ases:

Swiss-Prot, T

rEMBL,

PIR, E

MBL, E

nsembl,

IPI, PDB, R

efSeq,

FlyB

ase, Worm

Base,

Patent Offices.

Use with

extre

me ca

ution:Contains

pseudogenes,

incorrect CDS

predictions, etc…

UniProtK

B/Swiss-

ProtManua

lly annota

ted

protein se

quence

s

Release 48.1 of 2

7-Sep-2005:

195’058 entrie

s/9’479 spe

cies

UniProtK

BRelease 6

.0 consists of:

The UniProt com

ponents

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Query = accession num

ber ‘only’

Page 45: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

“Sequences are rarely d

eposited in a “m

ature” state; as with

all scientific research, D

NA and

protein annotation is a continual process of learning, revision and

corrections.”

“Sequencing error rates: ~

1 base in 10

’000”

“Making people aw

are of errors is good and

great; making

people aware th

at they’re responsib

le also for correcting errors is even greater”

C. H

ardley, E

MBO reports, 4

(9), 2

003.

Righ

ting the w

rongsor "N

obod

y's perfect"

Page 46: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

The NCBI-nr pa

thway

(Entre

z protein)

Page 47: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

UniProtK

B: S

wiss-Prot + T

rEMBL + (PIR

)

NCBI-nr: S

wiss-Prot + G

enPept + (PIR) + PD

B + PR

F + R

efSeq

2 major pa

thways

(general prote

in seque

nce databases)

Protein se

quence

s: «NR database

»Entre

z protein

http://w

ww.ncb

i.nlm.nih

.gov/entrez/query.fcgi?db=Protein

Page 48: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

UniProtK

B: S

wiss-Prot + T

rEMBL + (PIR

)

NCBI-nr: S

wiss-Prot + G

enPept + (PIR) + R

efSeq + PD

B + PR

F

derived

from

GenB

ank/EMBL/D

DBJ sequences

which

have a C

DS annotated

on them

-equivalent to T

rEMBL,

except th

at it is redund

ant with

Swiss-Prot

All PIR

data h

ave been

integrated into S

wiss-Prot

and TrE

MBL (U

niProtKB)

3D structure d

atabase:

all the protein sequences

which

have b

een cristallized(U

niProtKB is crosslinked

to PDB)

+ mutated protein sequences

+ chimeric proteins

(no matches w

ith UniProtK

B sequences)

Scientific pub

lications derived

sequences«Journal scan

»(integrated

into TrE

MBL)

derived

from GenB

ank/EMBL/D

DBJ sequences

+ predicted protein sequences

Query at E

ntrez prote

in

http://w

ww.ncb

i.nlm.nih

.gov/entrez/query.fcgi?db=Protein

Page 49: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Typica

l result of

a que

ry at

«Entre

z protein»

RefSeq

Swiss-

Prot

Genpe

pt(gb

/embl/d

dbj)

PDB

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

http://w

ww.pir.uniprot.org/search

/idmapping.sh

tml

Page 50: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)•3D structure

•Metab

olism/Path

ways

•Bibliograph

y•Gene ontology (G

O)

•‘Oth

ers’ (Microarrays, Protein protein interaction…

)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Page 51: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

MIM / O

MIM

•OMIM

™:Online M

endelian

Inheritance in

Man

•catalog

of hum

an genes and genetic

disord

ers

•contains a sum

mary of literature

and

reference information. It also contains

links to publications

and sequence

information.

Page 52: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

…and

plantsFungi

Bacteria

Arch

aeVirus and

phages…

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

http://flyb

ase.bio.ind

iana.edu/

Page 53: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics

•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)

•3D structure

•Metab

olism

•Bibliograph

y

•‘Oth

ers’ (Microarrays, Protein protein interaction…

)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Muta

tion/polymorph

ism: d

efinitions

•SNPs: single nucleotid

e polymorph

isms; occur

approxim

ately once every 100 to 3

00 bases

(distinction b

etween sequencing error and

polymorph

ism !)

•c-SNPs: cod

ing single nucleotide polym

orphism

s (S

ingle Nucleotid

e Polymorph

isms w

ithin cD

NA sequences)

•SAPs: single am

ino-acid polym

orphism

s

•Missense m

utation: -> SAP

•Nonsense m

utation: -> STOP

•Insertion/d

eletion of nucleotides -> fram

eshift…

Page 54: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

Databases 3

: muta

tion/polymorph

ism

•Contain inform

ationson sequence variations linked

or not to genetic diseases;

•Mainly h

uman b

ut: OMIA

-Online M

endelian

Inheritance in A

nimals

•General db:

–OMIM

–HMGD -Hum

an Gene M

utation db

–SVD -Sequence variation d

b

–HGBASE -Hum

an Genic

Bi-A

llelic Sequences d

b

–dbSNP-Hum

an single nucleotide polym

orphism

(SNP) d

b•Disease-spe

cific db: m

ost of these d

atabases are eith

er linked to a

single gene or to a single disease;

–p5

3 m

utation db

–ADB -Albinism

db (M

utations in hum

an genes causing albinism

) –

Asth

ma and

Allergy gene d

b

–….

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

For h

uman

Page 55: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Muta

tion/polymorph

ism•

No single source for all S

NPs (~

100 SNPs

db ) !

•Generally m

odest size; lack of coord

ination and form

at standard

s in these

datab

ases making it d

ifficult to access the d

ata.

•! N

umbering of th

e mutated

amino acid

depend

s on the d

b(aa no 1 is not

necessary the initiator M

et !)

•There are initiatives to unify th

ese datab

ases (politic/founding prob

lems)

Mutation D

atabase

Initiative (4th

July 19

96).

-> SVD -Sequence V

ariation Datab

ase projectat

EBI (H

MutD

B)

http://w

ww.eb

i.ac.uk/mutations/central/

-> HUGO M

utation Datab

aseInitiative (M

DI).

Hum

an Genom

eVariation S

ociety http://w

ww.genom

ic.unimelb

.edu.au/m

di/d

blist/d

blist.h

tml

Page 56: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics

•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)

•3D structure

•Metab

olism

•Bibliograph

y

•‘Oth

ers’ (Microarrays, Protein protein interaction…

)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Proteindomain/fa

mily: som

e definitions

•Most proteins h

ave «mod

ular» structures

•Estim

ation: ~ 3 dom

ains / protein

Page 57: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

Proteindom

ain/family: som

e definitions

•Dom

ains (conserved sequences or structures) are

identified

by m

ultiple sequence alignments

•Dom

ains can be d

efined by d

ifferent meth

ods:

–Patte

rn(regular ex

pression); used for very conserved

dom

ains–Profile

s(w

eighted

matrices): tw

o-dim

ensional tables of

position specific match

-, gap-, and insertion-scores, d

erived

from aligned

sequence families; used

for less conserved

dom

ains–Hidden M

arkov M

odel(H

MM); prob

abilistic m

odels; an oth

er meth

od to generate profiles.[L

IVM]-[ST]-A-[STAG]-H-C

Patte

rn-Profile

•Profile:

•Pattern:

Yes or no

ID T

RY

PS

IN_D

OM

; MA

TR

IX.

AC

PS

50240;D

T D

EC

-2001 (CR

EA

TE

D); D

EC

-2001 (DA

TA

UP

DA

TE

); DEC

-2001 (INF

O U

PD

AT

E).

DE

Serine proteases, trypsin

domain profile.

MA

/GE

NE

RA

L_SP

EC

: ALP

HA

BE

T=

'AB

CD

EF

GH

IKLM

NP

QR

ST

VW

YZ

'; LEN

GT

H=

234;M

A /D

ISJO

INT

: DE

FIN

ITIO

N=

PR

OT

EC

T; N

1=6; N2=

229;M

A /N

OR

MA

LIZA

TIO

N: M

OD

E=

1; FU

NC

TIO

N=

LINE

AR

; R1=0.

0169; R2=

0.00836256; TE

XT

='-LogE

';M

A /C

UT

_OF

F: LE

VE

L=0; SC

OR

E=

1134; N_S

CO

RE

=9.5; MO

DE

=1; TE

XT

='!';

MA

/CU

T_O

FF

: LEV

EL=-1; S

CO

RE

=775; N

_SC

OR

E=6.5; M

OD

E=

1; TE

XT

='?';

MA

/DE

FA

ULT

: M0=

-9; D=

-20; I=-20; B

1=-60; E

1=-60; M

I=-105; M

D=

-105; IM=

-105; DM

=-105;

MA

/I: B1=0; B

I=-105; B

D=-105;

MA

A B

D E

F G

H I K

L M N

P Q

R S

T V

W Y

MA

/M: S

Y=

'I'; M=

-8,-29,-34,-26, 3,-34,-24, 34,-26, 19, 15,-24,-21,-21,-24,-19, -8, 25,-19, 3;M

A /M

: SY

='N

'; M=

0, 14, 10, 1,-22, -1, 6,-23, -4,-26,-17, 20,-14, -1, -6, 13, 2,-20,-34,-15;M

A /M

: SY

='E

'; M=

-4, 4, 7, 14,-26,-13, -7,-23, 3,-22,-16, 2, 7, 3, -3, 2, -2,-21,-30,-18;M

A /M

: SY

='R

'; M=

-12, 5, 5, 7,-23,-17, 3,-24, 8,-20,-12, 7,-16, 10, 12, -2, -6,-21,-27, -9;M

A /M

: SY

='W

'; M=

-16,-33,-35,-27, 13,-22,-24,-11,-18,-13,-13,-31,-27,-20,-18,-30,-21,-18, 97, 25;M

A /M

: SY

='V

'; M=

1,-29,-31,-28, -1,-30,-29, 31,-22, 13, 11,-27,-27,-26,-22,-12, -2, 41,-27, -8;M

A /M

: SY

='L'; M

= -8,-29,-31,-22, 9,-30,-21, 23,-27, 37, 20,-28,-28,-21,-20,-25, -8, 17,-20, -1;

MA

/M: S

Y=

'T'; M

= 2, -1, -9, -9,-11,-17,-19,-10,-10,-13,-11, 1,-11, -9,-10, 23, 43, 0,-32,-12;

MA

/M: S

Y=

'A'; M

= 45, -9,-19,-10,-20, -2,-15,-11,-10,-11,-10, -9,-11, -9,-19, 10, 1, -1,-21,-18;

MA

/M: S

Y=

'A'; M

= 40, -9,-17, -8,-21, 5,-18,-14, -9,-13,-12, -8,-11, -9,-16, 9, -2, -5,-21,-21;

MA

/M: S

Y=

'H'; M

=-18, 0, 0, 1,-21,-19, 89,-29, -8,-21, -1, 9,-19, 11, 0, -7,-17,-29,-30, 16;

MA

/M: S

Y=

'C'; M

= -9,-18,-28,-29,-20,-29,-29,-29,-29,-20,-19,-18,-39,-29,-29, -9, -9, -9,-49,-29;

MA

/I: E1=0; IE

=-105; D

E=-105;

//

score/thresh

old

Page 58: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

HMM

(PFAM)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Proteindomain/fa

mily databases

•Contains b

iologically significant «pattern /

profiles/ HMM

» form

ulated in such

a way th

at, with

appropriate computional tools, it can rapid

ly and

reliably d

etermine to w

hich

known fam

ily of proteins (if any) a new

sequence belongs to

•Used

as a toolto id

entify the function of

uncharacterized

proteins translated from

genomic

or cDNA sequences («

functional diagnostic

»)

•Eith

er manually curated

(i.e. PROSIT

E, Pfam

A,

PRIN

TS, S

MART, T

IGRFAM etc.) or autom

atically generated

(i.e. PfamB, ProD

om, D

OMO)

Page 59: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

Protein dom

ain/family d

b

PROSIT

EPatterns / Profiles

ProDom

Aligned

motifs (PS

I-BLA

ST) (Pfam

B)

PRIN

TS

Aligned

motifs

PfamHMM (H

idden M

arkov Mod

els)

SMART

HMM

TIG

Rfam

HMM

Superfam

ilyHMM

PIRSF (iProC

lass), Gene 3

D, Panth

er

DOMO

Aligned

motifs

BLO

CKS

Aligned

motifs (PS

I-BLA

ST)

CDD

Pfam and

SMART

-> A Conserved

Dom

ain Datab

ase and Search

Service

I In nt te er rp pr ro o

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Inte

rProwww.eb

i.ac.uk/interpro

•Search

simultaneously m

any dom

ain datab

ases.

•Single set of d

ocuments linked

to the

various meth

ods;

•Release 12

.0 contains 12

’542 entries and

covers 7

7.4%

of UniProtK

B(~90%

UniProtK

B/S

wiss-Prot)

Page 60: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Scan InterPro

Exam

ple: GAL4

_YEAST

Page 61: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

http://w

ww.ebi.ac.uk/inte

gr8/EBI-Inte

gr8-HomePage.do

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics

•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)

•3D structure

•Metab

olism

•Bibliograph

y

•‘Oth

ers’ (Microarrays, Protein protein interaction…

)

Page 62: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Databases 5

:prote

omics

•Contain inform

ationsob

tained by2D-PA

GE:

images of m

aster gels and description of

identified

proteins

•Exam

ples: SWIS

S-2DPA

GE, E

CO2DBASE,

Maize-2

DPA

GE, S

ub2D, C

yano2DBase, etc.

•Com

posedof im

age and tex

t files

•Mass S

pectrometry (M

S) d

atabase: Prid

e

Page 63: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

http://w

ww.ebi.ac.uk/prid

e/

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics

•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)

•3D structure

•Metab

olism

•Bibliograph

y

•‘Oth

ers’ (Microarrays, Protein protein interaction…

)

Page 64: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Databases 6

: 3D structure

•Only one: PD

B (Protein D

ata Bank),

•Contains th

e spatial coordinates

of macrom

olecule atom

swhose 3

D structure h

as been ob

tained by X

-ray or N

MR stud

ies

•Proteins represent m

ore than 9

0% of availab

le structures

(others

are DNA, R

NA, sugars, viruses,

protein/DNA com

plexes…

)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

PDB: Prote

in Data Bank

www.rcsb

.org/pdb/

•Managed

by R

esearch Collab

oratoryfor S

tructuralBioinform

atics(RCSB) (U

SA).

•Associated

with

specialized program

s allow th

e visualization

of the correspond

ing3D structure

(e.g., SwissPD

B-view

er, Chim

e, Rasm

ol)).

•Currently th

ere are ~29’500 structural d

ata for ab

out 8’000 different proteins, b

ut far less protein fam

ily (high

ly redund

ant) !

Page 65: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

PDB: e

xample

HEADER LY

ASE(O

XO-A

CID

) 01-O

CT-91 12

CA 12

CA 2

COMPN

D C

ARBONIC

ANHYDRASE /II (C

ARBONATE DEHYDRATASE) (/H

CA II) 12

CA 3

COMPN

D 2

(E.C.4.2.1.1) M

UTANT W

ITH VAL 12

1 REPLA

CED BY ALA

(/V12

1A) 12

CA 4

SOURCE H

UMAN (H

OMO SAPIE

NS) R

ECOMBIN

ANT PR

OTEIN

12

CA 5

AUTHOR S

.K.N

AIR

,D.W

.CHRIS

TIA

NSON

12CA 6

REVDAT 1 15

-OCT-92 12

CA 0

12CA 7

JRNL A

UTH S

.K.N

AIR

,T.L.C

ALD

ERONE,D.W

.CHRIS

TIA

NSON,C.A.FIE

RKE 12

CA 8

JRNL T

ITL A

LTERIN

G THE M

OUTH O

F A H

YDROPH

OBIC

POCKET.

12CA 9

JRNL T

ITL 2

STRUCTURE AND KIN

ETIC

S O

F H

UMAN CARBONIC

ANHYDRASE 12

CA 10

JRNL T

ITL 3

/II$ M

UTANTS AT RESID

UE VAL-12

1 12CA 11

JRNL R

EF J

.BIO

L.CHEM. V

. 266 17

320 19

91 12

CA 12

JRNL R

EFN A

STM JBCHA3 U

S IS

SN 0021-9

258 0

71 12

CA 13

REMARK 1

12CA 14

REMARK 2

12

CA 15

REMARK 2

RESOLU

TIO

N. 2

.4 A

NGSTROMS.

12CA 16

REMARK 3

12

CA 17

REMARK 3

REFIN

EMENT.

12CA 18

REMARK 3

PROGRAM PR

OLS

Q

12CA 19

REMARK 3

AUTHORS H

ENDRIC

KSON,KONNERT

12CA 2

0REMARK 3

R VALU

E 0

.170

12CA 2

1REMARK 3

RMSD BOND DIS

TANCES 0

.011 A

NGSTROMS

12CA 2

2REMARK 3

RMSD BOND ANGLE

S 1.3

DEGREES

12CA 2

3REMARK 4

12

CA 2

4REMARK 4

N-T

ERMIN

AL R

ESID

UES SER 2, H

IS 3, H

IS 4 AND C-T

ERMIN

AL 12

CA 2

5REMARK 4

RESID

UE LY

S 260 W

ERE N

OT LO

CATED IN

THE DENSIT

Y M

APS

AND, 12

CA 2

6REMARK 4

THEREFORE, N

O COORDIN

ATES ARE IN

CLU

DED FOR THESE RESID

UES. 12

CA 2

7………

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

PDB (cont.)

SHEET 3

S10

PHE 6

6 PH

E 7

0 -1 O

ASN 6

7 N

LEU 6

0 12

CA 6

8SHEET 4

S10

TYR 8

8 T

RP 9

7 -1 O

PHE 9

3 N

VAL 6

8 12

CA 6

9SHEET 5

S10

ALA

116 A

SN 12

4 -1 O

HIS

119 N

HIS

94 12

CA 7

0SHEET 6

S10

LEU 14

1 VAL 15

0 -1 O

LEU 14

4 N

LEU 12

0 12

CA 7

1SHEET 7

S10

VAL 2

07 LE

U 2

12 1 O

ILE 2

10 N

GLY

14

5 12

CA 7

2SHEET 8

S10

TYR 19

1 GLY

196 -1 O

TRP 19

2 N

VAL 2

11 12CA 7

3SHEET 9

S10

LYS 2

57 A

LA 2

58 -1 O

LYS 2

57 N

THR 19

3 12

CA 7

4SHEET 10

S10

LYS 3

9 T

YR 4

0 1 O

LYS 3

9 N

ALA

258 12

CA 7

5TURN 1 T

1 GLN

28 V

AL 3

1 TYPE

VIB

(CIS

-PRO 30) 12

CA 7

6TURN 2

T2 GLY

81 LE

U 8

4 T

YPE

II(PRIM

E) (G

LY 82)

12CA 7

7TURN 3

T3 ALA

134 G

LN 13

7 T

YPE

I (GLN

136)

12CA 7

8TURN 4

T4 GLN

137 G

LY 14

0 T

YPE

I (ASP 13

9)

12CA 7

9TURN 5

T5 THR 2

00 LE

U 2

03 T

YPE

VIA

(CIS

-PRO 202) 12

CA 8

0TURN 6

T6 GLY

233 G

LU 2

36 T

YPE

II (GLY

235)

12CA 8

1CRYST1 4

2.700 4

1.700 7

3.000 9

0.00 10

4.60 9

0.00 P 2

1 2 12

CA 8

2ORIG

X1 1.0

00000 0

.000000 0

.000000 0

.00000

12CA 8

3ORIG

X2 0

.000000 1.0

00000 0

.000000 0

.00000

12CA 8

4ORIG

X3 0

.000000 0

.000000 1.0

00000 0

.00000

12CA 8

5SCALE

1 0.023419

0.000000 0

.00610

0 0

.00000

12CA 8

6SCALE

2 0

.000000 0

.023981 0

.000000 0

.00000

12CA 8

7SCALE

3 0

.000000 0

.000000 0

.014

156 0

.00000

12CA 8

8ATOM 1 N

TRP 5

8.519

-0.751 10

.738 1.0

0 13

.37 12

CA 8

9ATOM 2

CA T

RP 5

7.743 -1.6

68 11.5

85 1.0

0 13

.42 12

CA 9

0ATOM 3

C T

RP 5

6.786 -2

.502 10

.667 1.0

0 13

.47 12

CA 9

1ATOM 4

O T

RP 5

6.422 -2

.085 9

.607 1.0

0 13

.57 12

CA 9

2ATOM 5

CB T

RP 5

6.997 -0

.917

12.645 1.0

0 13

.34 12

CA 9

3ATOM 6

CG T

RP 5

5.784 -0

.209 12

.221 1.0

0 13

.40 12

CA 9

4ATOM 7

CD1 T

RP 5

5.681 1.0

84 11.7

97 1.0

0 13

.29 12

CA 9

5ATOM 8

CD2 TRP 5

4.417

-0.667 12

.221 1.0

0 13

.34 12

CA 9

6ATOM 9

NE1 T

RP 5

4.388 1.4

18 11.5

15 1.0

0 13

.30 12

CA 9

7ATOM 10

CE2 TRP 5

3.588 0

.375 11.7

97 1.0

0 13

.35 12

CA 9

8ATOM 11 C

E3 TRP 5

3.837 -1.8

77 12

.645 1.0

0 13

.39 12

CA 9

9ATOM 12

CZ2 TRP 5

2.216

0.208 11.6

56 1.0

0 13

.39 12

CA 10

0ATOM 13

CZ3 TRP 5

2.465 -2

.043 12

.504 1.0

0 13

.33 12

CA 10

1ATOM 14

CH2 TRP 5

1.654 -1.0

01 12

.009 1.0

0 13

.34 12

CA 10

2…….

Coord

inates of each atom

Page 66: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

The sam

e PD

B entry

“visualized”

with

Chim

e

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Industry

of databases a

round PD

B

-HSSP: H

omology-d

erived second

ary structure of proteins. http://w

ww.sand

er.ebi.ac.uk/h

ssp/

-Structure classification-CATH

-SCOP

-…

-Hom

ology-derived

3D structure d

b:

Swiss-M

odel R

edepository

(SMR): feb

2006: 6

75’000 m

odels.

Page 67: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics

•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)

•3D structure

•Metab

olism

•Bibliograph

y

•‘Oth

ers’ (Microarrays, Protein protein interaction…

)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Databases 7

: metabolic

•Contain inform

ationsth

at describ

e enzymes,

bioch

emical reactions and

metab

olic pathways;

•ENZYME and

BRENDA: nom

enclature datab

asesth

at store inform

ationson enzym

e names and

reactions;

•Metab

olicdatab

ases: EcoC

yc(specialized

on Esch

erichia coli), K

EGG, E

MP/W

IT;

Usually th

ese datab

ases are tightly coupled

with

query softw

are that allow

s the user to visualise reaction

schem

es.

Page 68: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

•There are ab

out 3750 “E

C num

bers”

~ 14

50 can not b

e linked to any sequence !

BRENDA

Useful to pre

pare

lab’s e

xperim

ents !

http://w

ww.brenda.uni-

koeln.d

e/

Page 69: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

http://w

ww.ge

nome.ad.jp/ke

gg

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics

•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)

•3D structure

•Metab

olism

•Bibliograph

y

•‘Oth

ers’ (Microarrays, Protein protein interaction…

)

Page 70: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Databases 8

: bibliogra

phic

•Bibliograph

ic reference datab

ases contain citations and

abstract inform

ationsof

publish

ed life science articles;

•Exam

ple: Med

line

•Oth

er more specialized

datab

ases also exist

(i.e. Agricola

http://agricola.nal.usd

a.gov/, EMBASE

(not free)…).

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Medline

•Com

prehensive d

atabase of prim

ary scientific literature in the

biom

edical area.

•More th

an 4,000 biom

edical journals pub

lished

in the U

nited

States

and 70 oth

er countries

•Contains

over 15 m

illion index

ed citations since

1966 until now

•Citations prior to th

e mid-19

60s are located

in OLD

MEDLIN

E.

•Contains links to b

iological db

–Many papers not d

ealing with

hum

ans are not in Med

line !–

Before 19

70, keeps only th

e first 10 auth

ors !–

Not all journals h

ave citations since 1966 ! (th

ey go back…

)

–Ind

exed

by G

ooglein 2

004 !

Page 71: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

PubMed

http://w

ww.ncb

i.nlm.nih

.gov/entre

z/query.fcgi?

db=Pub

Med

•Maintained

by th

e US N

ational Library of M

edicine.

•Allow

s access to the citations from

MEDLIN

E and

additional

life science journals.

•Includ

es links to many sites provid

ing full text articles and

oth

er related resources.

•Gives also access to :-

In Process Citations

–Pub

lisher supplied

citations: citations directly sub

mitted

to Pub

Med

([Record

as supplied by pub

lisher]).

•PM

ID(Pub

Med

ID)

UI(M

edline ID

)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

DOI (D

igital Object Id

entifier)are nam

es (characters

and/or d

igits) assigned to ob

jects of intellectual property such

as electronic journal articles, images, learning

objects, eb

ooks, any kind of content.

Server: h

ttp://dx.doi.org

-> biggest ad

vance to track docum

ents on the w

eb !

Page 72: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Categorie

s of databases for L

ife Science

s

•Sequences (D

NA, protein)

•Genom

ics

•Mutation/polym

orphism

•Protein d

omain/fam

ily(----> tools)

•Proteom

ics(2D gel, M

ass Spectrom

etry)

•3D structure

•Metab

olism

•Bibliograph

y

•‘Oth

ers’

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Databases 9

: others

•There are m

any datab

ases that cannot b

e classified

in the categories listed

previously;

•Exam

ples: ReB

ase(restriction enzym

es), TRANSFAC (transcription factors), C

arbBank,

GlycoS

uiteDB(linked

sugars), Protein-protein interactions d

b (IntA

CT, …

), Protease db

(MEROPS

), biotech

nology patents db, etc.;

•As w

ell as many oth

er resources concerning any and

new aspects of m

acromolecules and

molecular b

iology (Microarrays).

Page 73: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Inte

ractom

e-Protein/protein interaction:

description from

1 to more th

an 20’000 interactions / pub

lication

-Several d

atabases: Intact, B

IND, D

IP.

-Proteom

ics standard

initiative since 2005

http://w

ww.eb

i.ac.uk/intact/index

.htm

l

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Page 74: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Gene Ontology

(GO) d

atabase

The G

ene Ontology (G

O) project (h

ttp://www.geneontology.org/) provid

es structured

, controlled vocab

ularies and classifications th

at cover several dom

ains of molecular and

cellular biology and

are freely available for

community use in th

e automated

annotationof genes, gene prod

ucts and

sequences.

The th

ree organizing principles of GO are m

olecula

r function (MF), b

iological

process

(BP) and

cellula

r compone

nt (CC).

The GO te

rms a

re good

but th

e applica

tions are bad

-> mapping of th

e GO term

to proteins and genes is d

one automatically, often

according only to th

e presence of a specific dom

ain (only 5

% of th

e hum

an gene have correct G

O term

s !)

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Proliferation of d

atabases

•Which

does contain th

e high

est quality data ?

•Which

is the m

ore compreh

ensive ?

•Which

is the m

ore up-to-date ?

•Which

is the less red

undant ?

•Which

is the m

ore index

ed (allow

s complex

queries) ?

•Which

Web

server does respond

most quickly ?

•…….??????

Page 75: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

To b

enefit from th

e data stored

in a datab

ase, we need

:

•easy access to th

e information

-> a meth

od for ex

tracting only that inform

ation need

ed to answ

er a specific biological question

…now

Med

line is index

ed by G

oogle….but th

e others ?..

Exam

ples: Entrez (N

CBI), S

RS (E

urop), tools such

as BLA

ST, Peptid

ent…

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Some im

portant pra

ctical re

marks

•Datab

ases: many errors (autom

ated

annotation) !

•Not all d

b are availab

le on all servers

•The upd

ate frequency is not the sam

e for all servers;

•Som

e servers add autom

atically cross-references to an entry (im

plicit links) in ad

dition to alread

y existing links (ex

plicit links)…

different looks…

Page 76: What is a database? databases An introduction to · Introduction to Bioinformatics An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch MCB - February 06 ... *according

MC

B -

February 06

EM

Bnet

Introduction toB

ioinformatics

Before

the introd

uction to databases…

Afte

r the introd

uction to databases…