what is a database? databases an introduction to · introduction to bioinformatics an introduction...
TRANSCRIPT
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
An introd
uction to biological
datab
ases
Marie-Claude.Blatte
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
What is a
database
?
•A collection of
–structured
–search
able
(index
)-> tab
le of contents
–upd
ated period
ically (release)-> new
edition
–cross-referenced
(hyperlinks)
-> links with
other d
b
data
•Includ
es also associated tools (softw
are) necessary for d
b access/query, d
b upd
ating, db
information insertion, d
b inform
ation deletion…
.
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Why biologica
l databases ?
•Exponential grow
th in b
iological data.
•Data (nucleic acid
sequences (DNA, R
NA),
protein sequence, 3D structures, 2
D gel
analysis, MS analysis, m
icroarrays, protein-protein interaction…
.) are no longer publish
ed
in a conventional manner, b
ut directly
submitted
to datab
ases.
•Essential tools for b
iological research.
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Distrib
ution of databases
•Books, articles
1968 -> 19
85
•Com
puter tapes19
82 ->19
92
•Floppy d
isks19
84 -> 19
90
•CD-R
OM
1989 -> ?
•FTP
1989 -> ?
•On-line services
1982 -> 19
94
•WW
W19
93 -> ?
•DVD
2001 -> ?
Some sta
tistics and re
marks
•More th
an10
00 different ‘b
iological’ datab
ases
•Variab
le size: <100Kb to >10
Gb
–DNA: > 10
Gb
–Protein: 1 G
b–
3D structure: 5
Gb
–Oth
er: smaller
•Upd
ate frequency: daily
to annually
•How to find
them ?
–Links to m
any oth
er m
olecula
r biology
databases:
(ExPA
Sy) h
ttp://www.ex
pasy.org/links.htm
l#Proteins
–Bioh
unt: http://w
ww.ex
pasy.org/BioH
unt/–
Google: h
ttp://www.google.com
/
! Datab
ase hom
e server !
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
http://w
ww.ex
pasy.org/
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
The te
n importa
nt biologica
l databases*
GenB
ank/DDJB/E
MBLw
ww.ncb
i.nlm.nih
.govNucleotid
e sequencesEnsem
bl
www.ensem
bl.org
Hum
an/mouse genom
ePub
Med
www.ncb
i.nlm.nih
.govLiterature references
NR (entrez protein)
www.ncb
i.nlm.nih
.govProtein sequences
Swiss-Prot
www.ex
pasy.orgProtein sequences
InterProwww.eb
i.ac.ukProtein d
omains
OMIM
www.ncb
i.nlm.nih
.govGenetic d
iseasesEnzym
eswww.ex
pasy.orgEnzym
esPD
Bwww.rcsb
.org/pdb/
Protein structuresKEGG
www.genom
e.ad.jp
Metab
olic pathways
*according to th
e «Bioinform
atics for dum
mies
»
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics
•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)
•3D structure
•Metab
olism
•Bibliograph
y
•‘Oth
ers’ (Microarrays, Protein protein interaction…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Yes, if y
ou train quickly
, you ca
n cre
ate a ne
w database,
but first e
at y
our dinne
r !MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics
•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)
•3D structure
•Metab
olism
•Bibliograph
y
•‘Oth
ers’ (Microarrays, Protein protein interaction…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Idealminim
al conte
nt of a
seque
nce database entry
•Sequences
!!
•Accession num
ber
(AC) (unique id
entifier)
•Tax
onomic
data
•References
•ANNOTATIO
N/C
URATIO
N
•Keyw
ords
•Cross-references
•Docum
entation
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Database 1a: nucle
otide se
quence
s
•The 3
main pub
licnucleic acid
sequence datab
asesare
EMBL (E
urope)/GenB
ank(U
SA) /D
DBJ (J
apan)«different view
s of the sam
e data set
» with
in 2 to 3
days (since
1990)
•EMBL: since 19
82
•Specialized
datab
ases forth
e different types of R
NAs (i.e. tR
NA,
rRNA, tm
RNA, uR
NA, etc…
)
•3D structure (D
NA and
RNA) �
PDB
•Oth
ers: Aberrant splicing d
b; E
ukaryotic promoter d
b(EPD
); RNA
editing sites, M
ultimed
iaTelom
ereResource …
…
http://w
ww.expasy.org/links.h
tml#DNA
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
cDNAs, E
STs, genes, genom
es, …
EMBL, G
enBank, D
DBJ
Data not sub
mitte
d to pub
lic databases*, d
elayed or ca
ncelled…
* REMARK: J
ournals d
o not acce
pt a pa
per d
ealing w
ith a se
quence
if the EMBL/GenBank/DDBJ AC num
ber is not a
vailable…
http://ww
w.insdc.org/
The hectic life
of a se
quence
…
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
•Serve as
arch
ives
•Contain
all public
sequences derived
from:
–Genom
e projects (> 80 %
of entries)
–Sequencing centers (cD
NAs, E
STs…
)
–Ind
ividual scientists ( 15
% of entries)
–Patent offices (i.e. E
uropean Patent Office, E
PO)
•Currently: 4
6x10
6sequences, ~
80x10
9bp;
•Sequences from
> 80’000 different species;
•Contrib
ution: EMBL 10
%; G
enBank
73 %
; DDBJ 17
%
EMBL/G
enBank/D
DBJMC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
The trem
endous increase
in nucleotide sequences
1980: 8
0 genes fully sequenced
!
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
RNA
DNA
hum
an
mouse
rat
http://w
ww3.eb
i.ac.uk/Services/D
BStats/
New
projects:Environm
ental sequences(no tax
onomic inform
ation)
More th
an 80’000 species, b
ut…
Hum
an/Mouse/R
at: Organism
s with
the h
ighest red
undancy
!
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Molecule types
(…DNA vs R
NA quality of th
e protein sequence…(in E
ucaryota))
small nucleolar
RNA
signal recognition particle RNA com
ponent
taxonom
y
Cross-
reference
s
reference
s
keyword
Annota
tion
(Prediction or
experim
enta
lly determined)
seque
nce
CDS
CoD
ing Sequence
(proposed by sub
mitters)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
The hectic life
of a se
quence
…
cDNAs, E
STs, genes, genom
es, …
EMBL, G
enBank, D
DBJ
Data not sub
mitte
d to pub
lic databases*, d
elayed or ca
ncelled…
with or w
ithout annotated C
DS
provided by authors
CDS
CoDing S
eque
nceportion of D
NA/RNA tra
nslated into prote
in(from
Met to S
TOP)
Experim
enta
lly prove
dor d
erive
d from
gene pre
diction
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
An im
portant a
nnotation (-
> CDS)
EMBL/GenBank/D
DBJ
CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG
Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG
*** ************ ** * **************
CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------
Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA
************************************ ****************************
CONTIG ------------------------------------------------------------------------------------------------------------------------
Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT
CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC
Genomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC
**************************************************************************
CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
Genomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
****************************
***********************************************
CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA
Genomic
CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA
************************************************************************************************************************
CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------
Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT
********************************** ********
CONTIG -------------------------------------------------------------------------------------------------------------------GNAAA
Genomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA
****
CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC
Genomic
GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC
******************************************* * ************** ******** ***** **** * *********** ***************************
CONTIG C-----------------------------------------------------------------------------------------------------------------------
Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA
*
CoD
ing Sequence
Alignm
ents betw
een a mRNA and
a genomic sequence
exon
exon
exon
exon
exon
intron
intron
intron
EMBL/GenBank/D
DBJ
CDS provid
ed by th
e sub
mitte
rs
Transla
tion provided by EMBL
CDS provid
ed by th
e sub
mitte
rs
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
A major prob
lem: th
e submission of annotated
CDSs
Exam
ple The genom
e of chim
pazee
The sequence (genom
ic) is complete and
has b
een submitted
to EMBL
Com
parative genomic analyses h
ave been d
one and th
e CDS id
entified,
but
these C
DSs h
ave not been sub
mitted
!
-> only 1’505CDSs are availab
le yet (in UniProtK
B) !
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
EMBL/GenBank/D
DBJ
Sort of sequence m
useum, w
here sequences are preserved
for eternity as th
ey were d
etermined
, interpreted and
publish
ed
originally by th
eir authors
(primary sequence repository)
The auth
ors have full auth
ority over the content of th
e entries th
ey submit !
(editorial control of th
e content belongs to th
e authors)
(exception: T
PA, since january 2
003)
Sub
mission: F
TP, em
ail, Web
in, etc…
Protein sequence derived
from th
e traduction of a vector contam
ination
EMBL/GenBank/D
DBJ
•Unex
pected inform
ation you can find in th
ese db:
FT source 1..124
FT /db_xref="taxon:4097"
FT /organelle="plastid:chloroplast"
FT /organism="Nicotiana tabacum"
FT /isolate="Cuban cahibo
cigar, gift from
FT
President Fidel
Castro"
•Or:FT source 1..17084
FT /chromosome="complete
mitochondrial genome"
FT /db_xref="taxon:9267"
FT /organelle="mitochondrion"
FT /organism="Didelphis virginiana"
FT /dev_stage="adult"
FT /isolate="fresh road killed individual"
FT /tissue_type="liver"
FT C
DS com
plement(4
5959..4
7332)
FT /d
b_xref="S
PTREMBL:Q
9UZ71"
FT /note="PA
B2386"
FT /transl_
table=11
FT /prod
uct="4-A
MIN
OBUTYRATE qui se d
ilate AMIN
OTRANSFERASE
FT (E
C 2.6.1.19
)"FT /protein_
id="C
AB5018
8.1"
FT /translation="M
DYPR
IVVNPPG
PKAKELIE
REKRVLS
TGIG
VKLF
PLVPK
RGFGP
FT F
IEDVDGNVFID
FLA
GAAAASTGYSHPK
LVKAVKEQVELIQ
HSMIG
YTHSERAIR
VAEK
FT LV
KIS
PIKNSKVLF
GLS
GSDAVDMAIK
VSKFSTRRPW
ILAFIG
AYHGQTLG
ATSVASFQ
FT V
SQKRGYSPLM
PNVFW
VPY
PNPY
RNPW
GIN
GYEEPQ
ELV
NRVVEYLE
DYVFSHVVPPD
EFT V
AAFFAEPIQ
GDAGIV
VPPE
NFFKELK
KLLD
EHGILLV
MDEVQTGIG
RTGKW
FASEW
FE
FT V
KPD
MIIF
GKGVASGMGLS
GVIG
REDIM
DIT
SGSALLT
PAANPV
ISAAADATLE
IIEEE
FT N
LLKNAIE
VGSFIM
KRLN
ELK
EQFDIIG
DVRGKGLM
IGVEIV
KENGRPD
PEMTGKIC
WR
FT A
FELG
LILPSYGMFGNVIR
ITPPLV
LTKEVAEKGLE
IIEKAIK
DAIA
GKVERKVVTWH"
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Anoth
er major issue for th
e protein sequence datab
ases…
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Environm
enta
l seque
nces (d
ivision ‘ENV’)
Aim
:To sequence all D
NA present in a given sam
ple, with
out knowing from
which
species the D
NA is d
erived from
-Sargasso sea (C
raigVenter)
-hum
an fluids
-earth
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
No id
ea of the species…
(microb
ial population…)
No id
ea of the gene pred
iction program to b
e used…
No id
ea of the genetic cod
e to be used
for traduction !!!!!
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Nucle
ic acid
s databases: th
e prob
lems
-Arch
ive -> high
ly redond
ant-Sim
ilarity searches are not ob
vious…
-Auth
or’s authority -> variab
le level of the annotation quality
-i.e. gene/protein nam
e attribution…
-Variab
le level of sequence quality-Sequencing quality
-Gene pred
iction quality
The se
cond ‘ge
neration’ of nucle
otide se
quence
databases
Gene-centric d
atabases
All th
e sequence information relevant to a given gene
is mad
e accessible at once
i.e. EntrezG
ene/RefS
eq
Genom
e-centric d
atabases
Information ab
out gene sequence, relative position, strand
orientation, bioch
emical functions…
Information m
anagement system
s that are ab
le to connect specialized
sequence collection and brow
sing tools
i.e. Ensem
bl, T
IGR
The se
cond ge
neration of nucle
otide se
quence
databases
Gene-centric d
atabases
All th
e sequence information relevant to a given gene
is mad
e accessible at once
i.e. Entrez G
ene/RefS
eqMC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Entre
z Gene / R
efSeq(NCBI)
Datab
ase with
gene-specific information, w
hich
focuses on the genom
es th
at have b
een completely sequenced
, that h
ave an active research
community to contrib
ute gene-specific information, or th
at are sched
uled
for intense sequence analysis.
The content of E
ntrezGene represents th
e result of curationand
autom
ated integration of d
ata from N
CBI's
Reference S
equence project (RefS
eq), from collab
orating mod
el organism datab
ases, and from
many
other d
atabases availab
le from N
CBI.
The correspond
ing sequences are available th
anks to cross-links to RefS
eqor oth
er sequence datab
ases
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Interactions
Gene ontology
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Links to all the sequences found
in EMBL/G
enBank/D
DBJ
corresponding to th
is gene
Links to RefS
eq
Accession num
bers
-for R
NA (N
M_)
-for genom
ic (NT_)
-for protein (N
P_)
Entre
z Gene is tigh
ly linke
d to R
efSeq
(«interdependent curated resources
»)
RefS
eq: The R
eference Sequence (R
efSeq) collection aim
s to provid
e a compreh
ensive, integrated, non-red
undant set of
sequences, including genom
ic DNA, transcript (R
NA), and
protein prod
ucts, for major research
organisms.
1’899’454 entries (2
0-S
ep-2005); 3
060 species.
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics A
view of th
e redund
ancy…
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
A RefS
eq entry
A RefS
eq entry
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Working w
ith whole ge
nome databases:
Genom
e-centric d
atabases
«Brow
sing resource
s»
Rem
ark: Genom
e-centric datab
ases give usually access to several genomes, b
ut som
e are «specialized
» in particular organism
s, i.e. TIG
R: b
acteria and plants
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
http://w
ww.ensem
bl.org/ind
ex.htm
l
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Telom
er 5’
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Ensem
blprovid
es abioinform
aticsfram
ework to
organise biology around
the sequences of a selection of large
genomes.
Ense
mbl/m
artvie
w: e
xample
of queries
-Retrieve all m
ouse hom
ologues of hum
an disease genes containing
transmem
brane d
omains located
betw
een 1p22 and
1q22
-Retrieve th
e sequences 5kb
upstream of all h
uman «
known» genes from
ch
romosom
e 6
….
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Major prob
lem of th
is type of ‘datab
ase’: the upd
ate frequency…
..and pla
nts
http://w
ww.tigr.org/td
b/
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Database 1b
Proteinseque
nces
The hectic life
of a prote
in seque
nce …
TrE
MBL
Genpept
CoDing S
eque
nces
provided by sub
mitte
rs
cDNAs, E
STs, genom
es, …
EMBL,
GenB
ank, DDBJ
Data not sub
mitte
d to pub
lic databases, d
elayed or ca
ncelled…
Swiss-Prot
CoDing S
eque
nces
provided by sub
mitte
rsand
«de novo
» ge
ne pre
diction
RefS
eqXP_
NNNNN
UniProt: S
wiss-Prot + T
rEMBL + (PIR
)NCBI-nr: S
wiss-Prot + G
enPept + (PIR) + R
efSeq + PD
B + PR
F
Manua
lly annota
ted
PRF
Scientific pub
lications d
erive
d se
quence
s
with or w
ithout annotated C
DS
3D structures
PRF, PIR
Protein Identification R
esource
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
UniProtK
B
NCBI-nr
(Entrez protein)
2 major ‘re
sources’
(general prote
in seque
nce databases)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Protein se
quence
databases
The UniProt pa
thway
(Universal protein resource)M
CB
-F
ebruary 06E
MB
netIntroduction to
Bioinform
atics
UniProt K
nowledgeBase
UniProtK
B/TrEMBL
Compute
r annota
ted
protein se
quence
s
Release 31.1 of 2
7-Sep-2005:
2’151’724 E
ntries/~
95’000
species
UniR
ef100
UniR
ef90
UniR
ef50
•One U
niRef100 entry =
Allid
entica
l seque
nces
(including fragm
ents).
•One U
niRef90
entry = Sequences th
at have at least
90% or m
ore identity
.
•One U
niRef50
entry =Sequences th
at are at least
50% or m
ore identity
.
Independ
ent of species.
UniProt A
rchives:
Arch
ived ra
wprote
in seque
nces,
found in pub
licly accessib
le datab
ases:
Swiss-Prot, T
rEMBL,
PIR, E
MBL, E
nsembl,
IPI, PDB, R
efSeq,
FlyB
ase, Worm
Base,
Patent Offices.
Use with
extre
me ca
ution:Contains
pseudogenes,
incorrect CDS
predictions, etc…
UniProtK
B/Swiss-
ProtManua
lly annota
ted
protein se
quence
s
Release 48.1 of 2
7-Sep-2005:
195’058 entrie
s/9’479 spe
cies
UniProtK
BRelease 6
.0 consists of:
The UniProt com
ponents
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
In Geneva (S
IB):
1 Group Lead
er42 A
nnotators4 Prosite annotators
18 Program
mers and
Research
ers5 A
dministrators, science com
municators
3 S
ystem Administrators
4 S
tudents
------------------77 people
At E
BI:
(Swiss-Prot + E
MBL + T
rEMBL)
75 people (2
9 Annotators)
At PIR
:1 G
roup Leader
13 Protein S
cience Team
12 Inform
atics Team
------------------26 people
The U
niProtgroups
(Antib
es, Septem
ber 2
004)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
UniProt K
nowledgeBase
UniProtK
B/TrEMBL
Compute
r annota
ted
protein se
quence
s
Release 31.1 of 2
7-Sep-2005:
2’151’724 E
ntries/~
95’000
species
UniR
ef100
UniR
ef90
UniR
ef50
•One U
niRef100 entry =
Allid
entica
l seque
nces
(including fragm
ents).
•One U
niRef90
entry = Sequences th
at have at least
90% or m
ore identity
.
•One U
niRef50
entry =Sequences th
at are at least
50% or m
ore identity
.
Independ
ent of species.
UniProt A
rchives:
Arch
ived ra
wprote
in seque
nces,
found in pub
licly accessib
le datab
ases:
Swiss-Prot, T
rEMBL,
PIR, E
MBL, E
nsembl,
IPI, PDB, R
efSeq,
FlyB
ase, Worm
Base,
Patent Offices.
Use with
extre
me ca
ution:Contains
pseudogenes,
incorrect CDS
predictions, etc…
UniProtK
B/Swiss-
ProtManua
lly annota
ted
protein se
quence
s
Release 48.1 of 2
7-Sep-2005:
195’058 entrie
s/9’479 spe
cies
UniProtK
BRelease 6
.0 consists of:
The UniProt com
ponents
The hectic life
of a prote
in seque
nce …
TrE
MBL
CoDing S
eque
nces
provided by sub
mitte
rs*
cDNAs, E
STs, genom
es, …
EMBL
Data not sub
mitte
d to pub
lic databases, d
elayed or ca
ncelled…
Swiss-Prot
Manua
lly annota
ted
Nucle
ic acid
s
Amino a
cids
with or w
ithout annotated C
DSDirect sub
mission (< 1%
) PIR data
* ~ 1/10
EMBL entry is associated
with
an annotated CDS;
EMBL
Swiss-
Prot
TrEMBL
CDS
In order to avoid
redund
ancy, once manually annotated
and integrated
into S
wiss-Prot, th
e TrE
MBLentry
will no m
ore be in T
rEMBL
Literature information
(more th
an 1500 journals cited
)
Sequence ch
eck and analysis
High
performance
bioinform
aticstools
Datab
ases and
external scientific ex
pertise
X
From
TrE
MBL to S
wiss-Prot
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
In a UniProtK
B/Swiss-
Prot entry
, you ca
n expect to find
:
•All th
e names
of a given protein (and of its gene);
•Its b
iological originwith
links to the tax
onomic d
atabases;
•A selection of re
ference
s;•
A sum
mary
of what is know
nab
out the protein: function,
alternative products, PT
M, tissue ex
pression, disease, etc.…
;•
Num
erous cross-reference
s;•
Selected
keyword
s;•
A description of im
portant se
quence
feature
s: dom
ains, PT
Ms, variations, etc.;
•A (often corrected
) protein seque
nceand
the d
escription of various isoform
s/variants.
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
View «
by default
» on
the ExPASy se
rver
References
RN, R
P, RC, R
X, R
A, R
L lines
Com
ments
CC lines
FeaturesFT lines
SequenceSQ lines
Nam
es and tax
onomy
DE, G
N, O
C, O
S, O
G lines
Cross-references
DR lines
Keyw
ords
KW
lines
Accession num
ber
ID, A
C, D
T linesS
eque
ncing errors ? Polym
orphism
s ?
Alternative splicing ?
Alternative initiation ?
Usage of an alternative prom
oter ? RNA ed
iting ?
Seque
nce qua
lity
Selenocystein ?
Fragment ?
Sam
e gene ?
-> 1 ge
ne / 1
specie = 1 Swiss-
Prot entry
For h
uman: ~
4,7 diffe
rent ind
ependent se
quence
reports /ge
ne
-> Identifica
tion and annota
tion of all se
quence
diffe
rence
s
-13
sequences (complete or partial)
-derived
from m
RNA (n=6
) or genomic D
NA (n=7
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Multiple
alignm
ent of th
e end of th
e available GCR se
quence
s
Annotation of th
e sequence differences
post-tra
nslationa
lmodifica
tions of proteins
(PTMs)
5-10
fold increase
alterna
tive splicing
ofmRNA
2-5 fold
increase ~ 100’000 human
transcripts
~ 25’000 human
genes
~ 1'000'000
human prote
ins
Increase in complex
ity
From
genome to
proteome
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
and…
•Theproteom
icscom
munity is faced
with
a huge task:
»It need
s to confirm, b
y identification,
many potential splice isoform
sand
initiation sites;
»It need
s to help to ch
aracterize the
extent of PT
M’s on th
e majority of
hum
an proteins and to ad
dress th
e fluctuation of th
ese PTMs over tim
e and
space.
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
-> give accessto
all known* protein sequences
* submitted
to the pub
lic datab
ases (EMBL, G
enBank, D
DJB, S
wiss-Prot)
210’000+ 2’600’000
≈≈≈ ≈2’200’000
Redund
ancy
in TrEMBL
&
Redund
ancy
between T
rEMBL and Swiss-
Prot
Red
undancy is going to d
ecrease: «new
» genom
e sequencing -> «new
» proteins
(Amos B
airoch, sept 2
002)
~10
’000 species ~
100’000 species
Swiss-
Prot&TrEMBL
introduce a new
arithmetical concept !
In the case of h
uman proteins, th
e redund
ancy is still very high
:
~13’000+~58’000 ≈≈≈ ≈
about 2
2’000*
* hum
an gene number estim
ation:< 2
5’000
Missing sequences:•
Sequences not sub
mitted
to EMBL/G
enBank/D
DJB (and
PIR
)•
Not yet pred
icted or know
n genes ("no CDSprovid
ed by
the sub
mitters" or no D
NA sequence)
•Confid
ential data (Patent application sequences)
•Im
munoglob
ulins, T-cell receptors (-> U
niParc)•
… Swiss-
Prot +TrEMBL
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
•Be aw
are of the d
ifferences betw
een UniProtK
B/T
rEMBL and
UniProtK
B/S
wiss-Prot
–Com
puter vs. Hum
an
–Red
undant vs. N
on-redund
ant
•We need
your
feedback and
your ex
pertise!sw
iss-prot@ex
pasy.org
Take h
ome m
essage
http://w
ww.swissprot20.org/
http://w
ww.swissprot20.org/
You are welcome to our 20th
anniversary meeting in Brazil th
is year
1986-2006
Swiss-Prot: Alive and Kicking!
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
UniProt K
nowledgeBase
UniProtK
B/TrEMBL
Compute
r annota
ted
protein se
quence
s
Release 31.1 of 2
7-Sep-2005:
2’151’724 E
ntries/~
95’000
species
UniR
ef100
UniR
ef90
UniR
ef50
•One U
niRef100 entry =
Allid
entica
l seque
nces
(including fragm
ents).
•One U
niRef90
entry = Sequences th
at have at least
90% or m
ore identity
.
•One U
niRef50
entry =Sequences th
at are at least
50% or m
ore identity
.
Independ
ent of species.
UniProt A
rchives:
Arch
ived ra
wprote
in seque
nces,
found in pub
licly accessib
le datab
ases:
Swiss-Prot, T
rEMBL,
PIR, E
MBL, E
nsembl,
IPI, PDB, R
efSeq,
FlyB
ase, Worm
Base,
Patent Offices.
Use with
extre
me ca
ution:Contains
pseudogenes,
incorrect CDS
predictions, etc…
UniProtK
B/Swiss-
ProtManua
lly annota
ted
protein se
quence
s
Release 48.1 of 2
7-Sep-2005:
195’058 entrie
s/9’479 spe
cies
UniProtK
BRelease 6
.0 consists of:
The UniProt com
ponents
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Blast a
gainst U
niRef100, 9
0 and 50
http://w
ww.expasy.org/tools/b
last/
http://w
ww.expasy.org/tools/b
last/
Seque
nce of h
uman e
rythropoie
tin
By default
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Blast a
gainst U
niRef100
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Blast a
gainst U
niRef90
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Blast a
gainst U
niRef50
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
UniProt K
nowledgeBase
UniProtK
B/TrEMBL
Compute
r annota
ted
protein se
quence
s
Release 31.1 of 2
7-Sep-2005:
2’151’724 E
ntries/~
95’000
species
UniR
ef100
UniR
ef90
UniR
ef50
•One U
niRef100 entry =
Allid
entica
l seque
nces
(including fragm
ents).
•One U
niRef90
entry = Sequences th
at have at least
90% or m
ore identity
.
•One U
niRef50
entry =Sequences th
at are at least
50% or m
ore identity
.
Independ
ent of species.
UniProt A
rchives:
Arch
ived ra
wprote
in seque
nces,
found in pub
licly accessib
le datab
ases:
Swiss-Prot, T
rEMBL,
PIR, E
MBL, E
nsembl,
IPI, PDB, R
efSeq,
FlyB
ase, Worm
Base,
Patent Offices.
Use with
extre
me ca
ution:Contains
pseudogenes,
incorrect CDS
predictions, etc…
UniProtK
B/Swiss-
ProtManua
lly annota
ted
protein se
quence
s
Release 48.1 of 2
7-Sep-2005:
195’058 entrie
s/9’479 spe
cies
UniProtK
BRelease 6
.0 consists of:
The UniProt com
ponents
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Query = accession num
ber ‘only’
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
“Sequences are rarely d
eposited in a “m
ature” state; as with
all scientific research, D
NA and
protein annotation is a continual process of learning, revision and
corrections.”
“Sequencing error rates: ~
1 base in 10
’000”
“Making people aw
are of errors is good and
great; making
people aware th
at they’re responsib
le also for correcting errors is even greater”
C. H
ardley, E
MBO reports, 4
(9), 2
003.
Righ
ting the w
rongsor "N
obod
y's perfect"
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
The NCBI-nr pa
thway
(Entre
z protein)
UniProtK
B: S
wiss-Prot + T
rEMBL + (PIR
)
NCBI-nr: S
wiss-Prot + G
enPept + (PIR) + PD
B + PR
F + R
efSeq
2 major pa
thways
(general prote
in seque
nce databases)
Protein se
quence
s: «NR database
»Entre
z protein
http://w
ww.ncb
i.nlm.nih
.gov/entrez/query.fcgi?db=Protein
UniProtK
B: S
wiss-Prot + T
rEMBL + (PIR
)
NCBI-nr: S
wiss-Prot + G
enPept + (PIR) + R
efSeq + PD
B + PR
F
derived
from
GenB
ank/EMBL/D
DBJ sequences
which
have a C
DS annotated
on them
-equivalent to T
rEMBL,
except th
at it is redund
ant with
Swiss-Prot
All PIR
data h
ave been
integrated into S
wiss-Prot
and TrE
MBL (U
niProtKB)
3D structure d
atabase:
all the protein sequences
which
have b
een cristallized(U
niProtKB is crosslinked
to PDB)
+ mutated protein sequences
+ chimeric proteins
(no matches w
ith UniProtK
B sequences)
Scientific pub
lications derived
sequences«Journal scan
»(integrated
into TrE
MBL)
derived
from GenB
ank/EMBL/D
DBJ sequences
+ predicted protein sequences
Query at E
ntrez prote
in
http://w
ww.ncb
i.nlm.nih
.gov/entrez/query.fcgi?db=Protein
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Typica
l result of
a que
ry at
«Entre
z protein»
RefSeq
Swiss-
Prot
Genpe
pt(gb
/embl/d
dbj)
PDB
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
http://w
ww.pir.uniprot.org/search
/idmapping.sh
tml
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)•3D structure
•Metab
olism/Path
ways
•Bibliograph
y•Gene ontology (G
O)
•‘Oth
ers’ (Microarrays, Protein protein interaction…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MIM / O
MIM
•OMIM
™:Online M
endelian
Inheritance in
Man
•catalog
of hum
an genes and genetic
disord
ers
•contains a sum
mary of literature
and
reference information. It also contains
links to publications
and sequence
information.
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
…and
plantsFungi
Bacteria
Arch
aeVirus and
phages…
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
http://flyb
ase.bio.ind
iana.edu/
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics
•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)
•3D structure
•Metab
olism
•Bibliograph
y
•‘Oth
ers’ (Microarrays, Protein protein interaction…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Muta
tion/polymorph
ism: d
efinitions
•SNPs: single nucleotid
e polymorph
isms; occur
approxim
ately once every 100 to 3
00 bases
(distinction b
etween sequencing error and
polymorph
ism !)
•c-SNPs: cod
ing single nucleotide polym
orphism
s (S
ingle Nucleotid
e Polymorph
isms w
ithin cD
NA sequences)
•SAPs: single am
ino-acid polym
orphism
s
•Missense m
utation: -> SAP
•Nonsense m
utation: -> STOP
•Insertion/d
eletion of nucleotides -> fram
eshift…
Databases 3
: muta
tion/polymorph
ism
•Contain inform
ationson sequence variations linked
or not to genetic diseases;
•Mainly h
uman b
ut: OMIA
-Online M
endelian
Inheritance in A
nimals
•General db:
–OMIM
–HMGD -Hum
an Gene M
utation db
–SVD -Sequence variation d
b
–HGBASE -Hum
an Genic
Bi-A
llelic Sequences d
b
–dbSNP-Hum
an single nucleotide polym
orphism
(SNP) d
b•Disease-spe
cific db: m
ost of these d
atabases are eith
er linked to a
single gene or to a single disease;
–p5
3 m
utation db
–ADB -Albinism
db (M
utations in hum
an genes causing albinism
) –
Asth
ma and
Allergy gene d
b
–….
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
For h
uman
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Muta
tion/polymorph
ism•
No single source for all S
NPs (~
100 SNPs
db ) !
•Generally m
odest size; lack of coord
ination and form
at standard
s in these
datab
ases making it d
ifficult to access the d
ata.
•! N
umbering of th
e mutated
amino acid
depend
s on the d
b(aa no 1 is not
necessary the initiator M
et !)
•There are initiatives to unify th
ese datab
ases (politic/founding prob
lems)
Mutation D
atabase
Initiative (4th
July 19
96).
-> SVD -Sequence V
ariation Datab
ase projectat
EBI (H
MutD
B)
http://w
ww.eb
i.ac.uk/mutations/central/
-> HUGO M
utation Datab
aseInitiative (M
DI).
Hum
an Genom
eVariation S
ociety http://w
ww.genom
ic.unimelb
.edu.au/m
di/d
blist/d
blist.h
tml
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics
•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)
•3D structure
•Metab
olism
•Bibliograph
y
•‘Oth
ers’ (Microarrays, Protein protein interaction…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Proteindomain/fa
mily: som
e definitions
•Most proteins h
ave «mod
ular» structures
•Estim
ation: ~ 3 dom
ains / protein
Proteindom
ain/family: som
e definitions
•Dom
ains (conserved sequences or structures) are
identified
by m
ultiple sequence alignments
•Dom
ains can be d
efined by d
ifferent meth
ods:
–Patte
rn(regular ex
pression); used for very conserved
dom
ains–Profile
s(w
eighted
matrices): tw
o-dim
ensional tables of
position specific match
-, gap-, and insertion-scores, d
erived
from aligned
sequence families; used
for less conserved
dom
ains–Hidden M
arkov M
odel(H
MM); prob
abilistic m
odels; an oth
er meth
od to generate profiles.[L
IVM]-[ST]-A-[STAG]-H-C
Patte
rn-Profile
•Profile:
•Pattern:
Yes or no
ID T
RY
PS
IN_D
OM
; MA
TR
IX.
AC
PS
50240;D
T D
EC
-2001 (CR
EA
TE
D); D
EC
-2001 (DA
TA
UP
DA
TE
); DEC
-2001 (INF
O U
PD
AT
E).
DE
Serine proteases, trypsin
domain profile.
MA
/GE
NE
RA
L_SP
EC
: ALP
HA
BE
T=
'AB
CD
EF
GH
IKLM
NP
QR
ST
VW
YZ
'; LEN
GT
H=
234;M
A /D
ISJO
INT
: DE
FIN
ITIO
N=
PR
OT
EC
T; N
1=6; N2=
229;M
A /N
OR
MA
LIZA
TIO
N: M
OD
E=
1; FU
NC
TIO
N=
LINE
AR
; R1=0.
0169; R2=
0.00836256; TE
XT
='-LogE
';M
A /C
UT
_OF
F: LE
VE
L=0; SC
OR
E=
1134; N_S
CO
RE
=9.5; MO
DE
=1; TE
XT
='!';
MA
/CU
T_O
FF
: LEV
EL=-1; S
CO
RE
=775; N
_SC
OR
E=6.5; M
OD
E=
1; TE
XT
='?';
MA
/DE
FA
ULT
: M0=
-9; D=
-20; I=-20; B
1=-60; E
1=-60; M
I=-105; M
D=
-105; IM=
-105; DM
=-105;
MA
/I: B1=0; B
I=-105; B
D=-105;
MA
A B
D E
F G
H I K
L M N
P Q
R S
T V
W Y
MA
/M: S
Y=
'I'; M=
-8,-29,-34,-26, 3,-34,-24, 34,-26, 19, 15,-24,-21,-21,-24,-19, -8, 25,-19, 3;M
A /M
: SY
='N
'; M=
0, 14, 10, 1,-22, -1, 6,-23, -4,-26,-17, 20,-14, -1, -6, 13, 2,-20,-34,-15;M
A /M
: SY
='E
'; M=
-4, 4, 7, 14,-26,-13, -7,-23, 3,-22,-16, 2, 7, 3, -3, 2, -2,-21,-30,-18;M
A /M
: SY
='R
'; M=
-12, 5, 5, 7,-23,-17, 3,-24, 8,-20,-12, 7,-16, 10, 12, -2, -6,-21,-27, -9;M
A /M
: SY
='W
'; M=
-16,-33,-35,-27, 13,-22,-24,-11,-18,-13,-13,-31,-27,-20,-18,-30,-21,-18, 97, 25;M
A /M
: SY
='V
'; M=
1,-29,-31,-28, -1,-30,-29, 31,-22, 13, 11,-27,-27,-26,-22,-12, -2, 41,-27, -8;M
A /M
: SY
='L'; M
= -8,-29,-31,-22, 9,-30,-21, 23,-27, 37, 20,-28,-28,-21,-20,-25, -8, 17,-20, -1;
MA
/M: S
Y=
'T'; M
= 2, -1, -9, -9,-11,-17,-19,-10,-10,-13,-11, 1,-11, -9,-10, 23, 43, 0,-32,-12;
MA
/M: S
Y=
'A'; M
= 45, -9,-19,-10,-20, -2,-15,-11,-10,-11,-10, -9,-11, -9,-19, 10, 1, -1,-21,-18;
MA
/M: S
Y=
'A'; M
= 40, -9,-17, -8,-21, 5,-18,-14, -9,-13,-12, -8,-11, -9,-16, 9, -2, -5,-21,-21;
MA
/M: S
Y=
'H'; M
=-18, 0, 0, 1,-21,-19, 89,-29, -8,-21, -1, 9,-19, 11, 0, -7,-17,-29,-30, 16;
MA
/M: S
Y=
'C'; M
= -9,-18,-28,-29,-20,-29,-29,-29,-29,-20,-19,-18,-39,-29,-29, -9, -9, -9,-49,-29;
MA
/I: E1=0; IE
=-105; D
E=-105;
//
score/thresh
old
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
…
…
HMM
(PFAM)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Proteindomain/fa
mily databases
•Contains b
iologically significant «pattern /
profiles/ HMM
» form
ulated in such
a way th
at, with
appropriate computional tools, it can rapid
ly and
reliably d
etermine to w
hich
known fam
ily of proteins (if any) a new
sequence belongs to
•Used
as a toolto id
entify the function of
uncharacterized
proteins translated from
genomic
or cDNA sequences («
functional diagnostic
»)
•Eith
er manually curated
(i.e. PROSIT
E, Pfam
A,
PRIN
TS, S
MART, T
IGRFAM etc.) or autom
atically generated
(i.e. PfamB, ProD
om, D
OMO)
Protein dom
ain/family d
b
PROSIT
EPatterns / Profiles
ProDom
Aligned
motifs (PS
I-BLA
ST) (Pfam
B)
PRIN
TS
Aligned
motifs
PfamHMM (H
idden M
arkov Mod
els)
SMART
HMM
TIG
Rfam
HMM
Superfam
ilyHMM
PIRSF (iProC
lass), Gene 3
D, Panth
er
DOMO
Aligned
motifs
BLO
CKS
Aligned
motifs (PS
I-BLA
ST)
CDD
Pfam and
SMART
-> A Conserved
Dom
ain Datab
ase and Search
Service
I In nt te er rp pr ro o
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Inte
rProwww.eb
i.ac.uk/interpro
•Search
simultaneously m
any dom
ain datab
ases.
•Single set of d
ocuments linked
to the
various meth
ods;
•Release 12
.0 contains 12
’542 entries and
covers 7
7.4%
of UniProtK
B(~90%
UniProtK
B/S
wiss-Prot)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Scan InterPro
Exam
ple: GAL4
_YEAST
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
http://w
ww.ebi.ac.uk/inte
gr8/EBI-Inte
gr8-HomePage.do
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics
•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)
•3D structure
•Metab
olism
•Bibliograph
y
•‘Oth
ers’ (Microarrays, Protein protein interaction…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Databases 5
:prote
omics
•Contain inform
ationsob
tained by2D-PA
GE:
images of m
aster gels and description of
identified
proteins
•Exam
ples: SWIS
S-2DPA
GE, E
CO2DBASE,
Maize-2
DPA
GE, S
ub2D, C
yano2DBase, etc.
•Com
posedof im
age and tex
t files
•Mass S
pectrometry (M
S) d
atabase: Prid
e
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
http://w
ww.ebi.ac.uk/prid
e/
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics
•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)
•3D structure
•Metab
olism
•Bibliograph
y
•‘Oth
ers’ (Microarrays, Protein protein interaction…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Databases 6
: 3D structure
•Only one: PD
B (Protein D
ata Bank),
•Contains th
e spatial coordinates
of macrom
olecule atom
swhose 3
D structure h
as been ob
tained by X
-ray or N
MR stud
ies
•Proteins represent m
ore than 9
0% of availab
le structures
(others
are DNA, R
NA, sugars, viruses,
protein/DNA com
plexes…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
PDB: Prote
in Data Bank
www.rcsb
.org/pdb/
•Managed
by R
esearch Collab
oratoryfor S
tructuralBioinform
atics(RCSB) (U
SA).
•Associated
with
specialized program
s allow th
e visualization
of the correspond
ing3D structure
(e.g., SwissPD
B-view
er, Chim
e, Rasm
ol)).
•Currently th
ere are ~29’500 structural d
ata for ab
out 8’000 different proteins, b
ut far less protein fam
ily (high
ly redund
ant) !
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
PDB: e
xample
HEADER LY
ASE(O
XO-A
CID
) 01-O
CT-91 12
CA 12
CA 2
COMPN
D C
ARBONIC
ANHYDRASE /II (C
ARBONATE DEHYDRATASE) (/H
CA II) 12
CA 3
COMPN
D 2
(E.C.4.2.1.1) M
UTANT W
ITH VAL 12
1 REPLA
CED BY ALA
(/V12
1A) 12
CA 4
SOURCE H
UMAN (H
OMO SAPIE
NS) R
ECOMBIN
ANT PR
OTEIN
12
CA 5
AUTHOR S
.K.N
AIR
,D.W
.CHRIS
TIA
NSON
12CA 6
REVDAT 1 15
-OCT-92 12
CA 0
12CA 7
JRNL A
UTH S
.K.N
AIR
,T.L.C
ALD
ERONE,D.W
.CHRIS
TIA
NSON,C.A.FIE
RKE 12
CA 8
JRNL T
ITL A
LTERIN
G THE M
OUTH O
F A H
YDROPH
OBIC
POCKET.
12CA 9
JRNL T
ITL 2
STRUCTURE AND KIN
ETIC
S O
F H
UMAN CARBONIC
ANHYDRASE 12
CA 10
JRNL T
ITL 3
/II$ M
UTANTS AT RESID
UE VAL-12
1 12CA 11
JRNL R
EF J
.BIO
L.CHEM. V
. 266 17
320 19
91 12
CA 12
JRNL R
EFN A
STM JBCHA3 U
S IS
SN 0021-9
258 0
71 12
CA 13
REMARK 1
12CA 14
REMARK 2
12
CA 15
REMARK 2
RESOLU
TIO
N. 2
.4 A
NGSTROMS.
12CA 16
REMARK 3
12
CA 17
REMARK 3
REFIN
EMENT.
12CA 18
REMARK 3
PROGRAM PR
OLS
Q
12CA 19
REMARK 3
AUTHORS H
ENDRIC
KSON,KONNERT
12CA 2
0REMARK 3
R VALU
E 0
.170
12CA 2
1REMARK 3
RMSD BOND DIS
TANCES 0
.011 A
NGSTROMS
12CA 2
2REMARK 3
RMSD BOND ANGLE
S 1.3
DEGREES
12CA 2
3REMARK 4
12
CA 2
4REMARK 4
N-T
ERMIN
AL R
ESID
UES SER 2, H
IS 3, H
IS 4 AND C-T
ERMIN
AL 12
CA 2
5REMARK 4
RESID
UE LY
S 260 W
ERE N
OT LO
CATED IN
THE DENSIT
Y M
APS
AND, 12
CA 2
6REMARK 4
THEREFORE, N
O COORDIN
ATES ARE IN
CLU
DED FOR THESE RESID
UES. 12
CA 2
7………
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
PDB (cont.)
SHEET 3
S10
PHE 6
6 PH
E 7
0 -1 O
ASN 6
7 N
LEU 6
0 12
CA 6
8SHEET 4
S10
TYR 8
8 T
RP 9
7 -1 O
PHE 9
3 N
VAL 6
8 12
CA 6
9SHEET 5
S10
ALA
116 A
SN 12
4 -1 O
HIS
119 N
HIS
94 12
CA 7
0SHEET 6
S10
LEU 14
1 VAL 15
0 -1 O
LEU 14
4 N
LEU 12
0 12
CA 7
1SHEET 7
S10
VAL 2
07 LE
U 2
12 1 O
ILE 2
10 N
GLY
14
5 12
CA 7
2SHEET 8
S10
TYR 19
1 GLY
196 -1 O
TRP 19
2 N
VAL 2
11 12CA 7
3SHEET 9
S10
LYS 2
57 A
LA 2
58 -1 O
LYS 2
57 N
THR 19
3 12
CA 7
4SHEET 10
S10
LYS 3
9 T
YR 4
0 1 O
LYS 3
9 N
ALA
258 12
CA 7
5TURN 1 T
1 GLN
28 V
AL 3
1 TYPE
VIB
(CIS
-PRO 30) 12
CA 7
6TURN 2
T2 GLY
81 LE
U 8
4 T
YPE
II(PRIM
E) (G
LY 82)
12CA 7
7TURN 3
T3 ALA
134 G
LN 13
7 T
YPE
I (GLN
136)
12CA 7
8TURN 4
T4 GLN
137 G
LY 14
0 T
YPE
I (ASP 13
9)
12CA 7
9TURN 5
T5 THR 2
00 LE
U 2
03 T
YPE
VIA
(CIS
-PRO 202) 12
CA 8
0TURN 6
T6 GLY
233 G
LU 2
36 T
YPE
II (GLY
235)
12CA 8
1CRYST1 4
2.700 4
1.700 7
3.000 9
0.00 10
4.60 9
0.00 P 2
1 2 12
CA 8
2ORIG
X1 1.0
00000 0
.000000 0
.000000 0
.00000
12CA 8
3ORIG
X2 0
.000000 1.0
00000 0
.000000 0
.00000
12CA 8
4ORIG
X3 0
.000000 0
.000000 1.0
00000 0
.00000
12CA 8
5SCALE
1 0.023419
0.000000 0
.00610
0 0
.00000
12CA 8
6SCALE
2 0
.000000 0
.023981 0
.000000 0
.00000
12CA 8
7SCALE
3 0
.000000 0
.000000 0
.014
156 0
.00000
12CA 8
8ATOM 1 N
TRP 5
8.519
-0.751 10
.738 1.0
0 13
.37 12
CA 8
9ATOM 2
CA T
RP 5
7.743 -1.6
68 11.5
85 1.0
0 13
.42 12
CA 9
0ATOM 3
C T
RP 5
6.786 -2
.502 10
.667 1.0
0 13
.47 12
CA 9
1ATOM 4
O T
RP 5
6.422 -2
.085 9
.607 1.0
0 13
.57 12
CA 9
2ATOM 5
CB T
RP 5
6.997 -0
.917
12.645 1.0
0 13
.34 12
CA 9
3ATOM 6
CG T
RP 5
5.784 -0
.209 12
.221 1.0
0 13
.40 12
CA 9
4ATOM 7
CD1 T
RP 5
5.681 1.0
84 11.7
97 1.0
0 13
.29 12
CA 9
5ATOM 8
CD2 TRP 5
4.417
-0.667 12
.221 1.0
0 13
.34 12
CA 9
6ATOM 9
NE1 T
RP 5
4.388 1.4
18 11.5
15 1.0
0 13
.30 12
CA 9
7ATOM 10
CE2 TRP 5
3.588 0
.375 11.7
97 1.0
0 13
.35 12
CA 9
8ATOM 11 C
E3 TRP 5
3.837 -1.8
77 12
.645 1.0
0 13
.39 12
CA 9
9ATOM 12
CZ2 TRP 5
2.216
0.208 11.6
56 1.0
0 13
.39 12
CA 10
0ATOM 13
CZ3 TRP 5
2.465 -2
.043 12
.504 1.0
0 13
.33 12
CA 10
1ATOM 14
CH2 TRP 5
1.654 -1.0
01 12
.009 1.0
0 13
.34 12
CA 10
2…….
Coord
inates of each atom
The sam
e PD
B entry
“visualized”
with
Chim
e
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Industry
of databases a
round PD
B
-HSSP: H
omology-d
erived second
ary structure of proteins. http://w
ww.sand
er.ebi.ac.uk/h
ssp/
-Structure classification-CATH
-SCOP
-…
-Hom
ology-derived
3D structure d
b:
Swiss-M
odel R
edepository
(SMR): feb
2006: 6
75’000 m
odels.
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics
•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)
•3D structure
•Metab
olism
•Bibliograph
y
•‘Oth
ers’ (Microarrays, Protein protein interaction…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Databases 7
: metabolic
•Contain inform
ationsth
at describ
e enzymes,
bioch
emical reactions and
metab
olic pathways;
•ENZYME and
BRENDA: nom
enclature datab
asesth
at store inform
ationson enzym
e names and
reactions;
•Metab
olicdatab
ases: EcoC
yc(specialized
on Esch
erichia coli), K
EGG, E
MP/W
IT;
Usually th
ese datab
ases are tightly coupled
with
query softw
are that allow
s the user to visualise reaction
schem
es.
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
•There are ab
out 3750 “E
C num
bers”
~ 14
50 can not b
e linked to any sequence !
BRENDA
Useful to pre
pare
lab’s e
xperim
ents !
http://w
ww.brenda.uni-
koeln.d
e/
http://w
ww.ge
nome.ad.jp/ke
gg
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics
•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)
•3D structure
•Metab
olism
•Bibliograph
y
•‘Oth
ers’ (Microarrays, Protein protein interaction…
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Databases 8
: bibliogra
phic
•Bibliograph
ic reference datab
ases contain citations and
abstract inform
ationsof
publish
ed life science articles;
•Exam
ple: Med
line
•Oth
er more specialized
datab
ases also exist
(i.e. Agricola
http://agricola.nal.usd
a.gov/, EMBASE
(not free)…).
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Medline
•Com
prehensive d
atabase of prim
ary scientific literature in the
biom
edical area.
•More th
an 4,000 biom
edical journals pub
lished
in the U
nited
States
and 70 oth
er countries
•Contains
over 15 m
illion index
ed citations since
1966 until now
•Citations prior to th
e mid-19
60s are located
in OLD
MEDLIN
E.
•Contains links to b
iological db
–Many papers not d
ealing with
hum
ans are not in Med
line !–
Before 19
70, keeps only th
e first 10 auth
ors !–
Not all journals h
ave citations since 1966 ! (th
ey go back…
)
–Ind
exed
by G
ooglein 2
004 !
PubMed
http://w
ww.ncb
i.nlm.nih
.gov/entre
z/query.fcgi?
db=Pub
Med
•Maintained
by th
e US N
ational Library of M
edicine.
•Allow
s access to the citations from
MEDLIN
E and
additional
life science journals.
•Includ
es links to many sites provid
ing full text articles and
oth
er related resources.
•Gives also access to :-
In Process Citations
–Pub
lisher supplied
citations: citations directly sub
mitted
to Pub
Med
([Record
as supplied by pub
lisher]).
•PM
ID(Pub
Med
ID)
UI(M
edline ID
)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
DOI (D
igital Object Id
entifier)are nam
es (characters
and/or d
igits) assigned to ob
jects of intellectual property such
as electronic journal articles, images, learning
objects, eb
ooks, any kind of content.
Server: h
ttp://dx.doi.org
-> biggest ad
vance to track docum
ents on the w
eb !
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Categorie
s of databases for L
ife Science
s
•Sequences (D
NA, protein)
•Genom
ics
•Mutation/polym
orphism
•Protein d
omain/fam
ily(----> tools)
•Proteom
ics(2D gel, M
ass Spectrom
etry)
•3D structure
•Metab
olism
•Bibliograph
y
•‘Oth
ers’
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Databases 9
: others
•There are m
any datab
ases that cannot b
e classified
in the categories listed
previously;
•Exam
ples: ReB
ase(restriction enzym
es), TRANSFAC (transcription factors), C
arbBank,
GlycoS
uiteDB(linked
sugars), Protein-protein interactions d
b (IntA
CT, …
), Protease db
(MEROPS
), biotech
nology patents db, etc.;
•As w
ell as many oth
er resources concerning any and
new aspects of m
acromolecules and
molecular b
iology (Microarrays).
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Inte
ractom
e-Protein/protein interaction:
description from
1 to more th
an 20’000 interactions / pub
lication
-Several d
atabases: Intact, B
IND, D
IP.
-Proteom
ics standard
initiative since 2005
http://w
ww.eb
i.ac.uk/intact/index
.htm
l
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Gene Ontology
(GO) d
atabase
The G
ene Ontology (G
O) project (h
ttp://www.geneontology.org/) provid
es structured
, controlled vocab
ularies and classifications th
at cover several dom
ains of molecular and
cellular biology and
are freely available for
community use in th
e automated
annotationof genes, gene prod
ucts and
sequences.
The th
ree organizing principles of GO are m
olecula
r function (MF), b
iological
process
(BP) and
cellula
r compone
nt (CC).
The GO te
rms a
re good
but th
e applica
tions are bad
-> mapping of th
e GO term
to proteins and genes is d
one automatically, often
according only to th
e presence of a specific dom
ain (only 5
% of th
e hum
an gene have correct G
O term
s !)
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Proliferation of d
atabases
•Which
does contain th
e high
est quality data ?
•Which
is the m
ore compreh
ensive ?
•Which
is the m
ore up-to-date ?
•Which
is the less red
undant ?
•Which
is the m
ore index
ed (allow
s complex
queries) ?
•Which
Web
server does respond
most quickly ?
•…….??????
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
To b
enefit from th
e data stored
in a datab
ase, we need
:
•easy access to th
e information
-> a meth
od for ex
tracting only that inform
ation need
ed to answ
er a specific biological question
…now
Med
line is index
ed by G
oogle….but th
e others ?..
Exam
ples: Entrez (N
CBI), S
RS (E
urop), tools such
as BLA
ST, Peptid
ent…
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Some im
portant pra
ctical re
marks
•Datab
ases: many errors (autom
ated
annotation) !
•Not all d
b are availab
le on all servers
•The upd
ate frequency is not the sam
e for all servers;
•Som
e servers add autom
atically cross-references to an entry (im
plicit links) in ad
dition to alread
y existing links (ex
plicit links)…
different looks…
MC
B -
February 06
EM
Bnet
Introduction toB
ioinformatics
Before
the introd
uction to databases…
Afte
r the introd
uction to databases…