basics on bioinformatics lecture 2 - unina.it bioinf. (1).pdfbasics on bioinformatics lecture 2...
TRANSCRIPT
Database or databank?
Initially
o Databank (UK)
o Database (USA)
Solution
The abbreviation db
2
Entity-Relationship (ER) modeling
Notation uses three main constructs:
o Data entities
Represents a set or collection of objects in the real world that share the
same properties. Person, place, object, event or concept about which data is
to be maintained.
o Attributes
Named property or characteristic of an entity
o Relationships
Association between the instances of one or more entity typesAssociation between the instances of one or more entity types
Relationships can be classified as either
one – to – one 1�1one – to – many 1�Nmany – to –many N�N
Connectivity
3
1 : N
Cardinality
1 : 1
4
N : M
ER example
5
database: basic structure
Databases are composed of tables of data.
Gi Accession Length Cultivar Dev.stag Tissue sequence
30320090 CD003352 356 -Turning stage
of fruit ripeningPericarp GTACTCCTAAAC…..
15195408 BI421671 492 TA496 25-40 days old callus CCACAACCACA…..
50892290 AJ784669 346West Virginia
106
8 days post
anthesisfruit CAAATTTA…..
Databases are composed of tables of data.
Tables hold logically related sets of data. A table is essentially
the same thing as a spreadsheet: a set of rows and columns
6
database: basic structure
Gi Accession Length Cultivar Dev.stag Tissue sequence
30320090 CD003352 356 -Turning stage
of fruit ripeningPericarp GTACTCCTAAAC…..
15195408 BI421671 492 TA496 25-40 days old callus CCACAACCACA…..
50892290 AJ784669 346West Virginia
106
8 days post
anthesisfruit CAAATTTA…..
Each table has several records or entries : Each table has several records or entries :
a record stores all the information for a given individual
Records are the rows of a data table
7
database: basic structure
Gi Accession Length Cultivar Dev.stag Tissue sequence
30320090 CD003352 356 -Turning stage
of fruit ripeningPericarp GTACTCCTAAAC…..
15195408 BI421671 492 TA496 25-40 days old callus CCACAACCACA…..
50892290 AJ784669 346West Virginia
106
8 days post
anthesisfruit CAAATTTA…..
Each record has several fields:Each record has several fields:
A field is an individual piece of data, a single attribute of the
record.
Fields are the columns of a data table
8
database: basic structure
Gi Accession Length Cultivar Dev.stag Tissue sequence
30320090 CD003352 356 -Turning stage
of fruit ripeningPericarp GTACTCCTAAAC…..
15195408 BI421671 492 TA496 25-40 days old callus CCACAACCACA…..
50892290 AJ784669 346West Virginia
106
8 days post
anthesisfruit CAAATTTA…..
Each record (row) has a unique identifier, the primary key.Each record (row) has a unique identifier, the primary key.
the primary key serves to identify the data stored in this
record across all the tables in the database.
Databases are manipulated with a language called SQL (Structured
Query Language). It’s a “baby English” type of language: uses real
words, but rigid in terms of the order and placement.
Various database software: Oracle, MS Access, MySQL, etc.9
Why biological databases?
oMake biological data available to scientistsConsolidation of data (gather data from different sources)Provide access to large dataset that cannot be publishedexplicitly (genome, …)
oMake biological data available in computer-readable formatMake data accessible for automated analysisMake data accessible for automated analysis
10
Biological db
o Vary in size, quality, coverage, level of interest
o Many of the major ones covered in the annual Database Issue of
Nucleic Acids Research
11
2010
Biological db
12
Biological db
13
What makes a good db?
o comprehensiveness
o accuracy
o is up-to-date
o good interface
o batch search/download
o API (web services, DAS, etc.)
14
“must have” item when using db
o Remember the server, the database, and the program
version used
o Write down sequence identification numbers
o Databases are not like good wine
(use up-to-date builds)
o Use local installs when it becomes necessary15
Primary and derived data
Primary databases:
Databases consisting of data derived experimentally such as
nucleotide sequences and three dimensional structures.
Secondary databases:
Those data that are derived from the analysis or treatment ofThose data that are derived from the analysis or treatment of
primary data
16
Nucleotide sequence databases
GenBank www.ncbi.nlm.nih.gov/GenBank
17
www.ebi.ac.uk/emblwww.ddbj.nig.ac.jp
The 3 databases are synchronized on a daily basis, and the accessionnumbers are consistent.
There are no legal restriction in the usage of these databases.However, there are some patented sequences in the database
GenBank sample record
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.htmlLOCUS AF115338 591 bp DNA linear BCT 19-AUG-1999 DEFINITION Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds. ACCESSION AF115338 VERSION AF115338.1 GI:4959391 KEYWORDS . SOURCE Pseudomonas fluorescens. ORGANISM Pseudomonas fluorescens Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas. REFERENCE 1 (bases 1 to 591) AUTHORS Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R. TITLE Influence of a putative ECF sigma factor on expression of the major outer membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas fluorescens JOURNAL J. Bacteriol. 181 (16), 4746-4754 (1999) MEDLINE 99369842 PUBMED 10438740 REFERENCE 2 (bases 1 to 591) AUTHORS De Mot,R. TITLE Direct Submission JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics,
headertitle
taxonomy
citation
18
JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium FEATURES Location/Qualifiers source 1..591 /organism="Pseudomonas fluorescens" /strain="M114" /db_xref="taxon:294" gene 1..591 /gene="sigX" CDS 1..591 /gene="sigX" /codon_start=1 /transl_table=11 /product="ECF sigma factor SigX" /protein_id="AAD34329.1" /db_xref="GI:4959392" /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET" BASE COUNT 157 a 133 c 170 g 131 t ORIGIN 1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag 241 gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag 301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag 361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg 421 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa 481 tttcaggaga tcgcagacat catgcacatg ggtttgagtg cgacaaaaat gcgttacaaa 541 cgtgctctag ataaattgcg tgagaaattt gcaggcgaga ctgaaactta g
features
sequence
Protein sequence database
The mission of UniProt is to provide the
scientific community with a comprehensive,
high-quality and freely accessible resource of
protein sequence and functional information.
UniprotKB Knowledgebase
is the central hub for the collection of functional information on proteins, with accurate,
consistent and rich annotation.
Swiss-Prot, which is
manually annotated
and reviewed.
TrEMBL, which is
automatically annotated
and is not reviewed.
The UniProt Reference
Clusters (UniRef), which is
used to speed up sequence
similarity searches.
19
UniProt entry
20
Protein data bank
The PDB archive contains information about experimentally
determined structures of proteins, nucleic acids, and complex
assemblies. (XrayXray,, NMR,NMR, ComputationallyComputationally predictedpredicted)
Mission: maintain a single archive of macromolecular structural data that is freely
and openly available to the global community
Number of Structures Available
21
PDB entry
22
Protein structure levels
23
The gene Ontology (GO)
GO goals
The GO Website http://www.geneontology.org 24
The gene Ontology (GO)
GO is divided in 3 domain (levels of annotation):
o Molecular function - basic activities of a gene product atthe molecular level
o Biological process - set of molecular events with a definedbeginning and an endbeginning and an end
o Cellular component - the parts of a cell or its extracellularenvironment
25
GO structure
nucleus chromosome mitochondrion
The structure of GO can be described in terms of direct acyclic graph (DAG), where each
GO term is a node, and the relationships between the terms are arcs between the nodes
Is_a
part_of part_of
Nuclear chromosome mitochondrial chromosome
GO currently has 2 relationship types:Is_a
An is_a child of a parent means that the child is a complete type of its parent, but can be discriminated in some way from other children of the parent.
Part_ofA part_of child of a parent means that the child is always a constituent of the parent that in combination with other constituents of the parent make up the parent.
26
Searching for papers
http://www.ncbi.nlm.nih.gov/pubmedhttp://scholar.google.com/
http://www.scopus.com/home.url
http://portal.isiknowledge.com/
27
Querying GenBank
http://www.ncbi.nlm.nih.gov/sites/gquery
Search from the Entrez main page the gene whose accession
number is BC043443.
o How many results we get in the Gene db?
o What is the official name of the gene? Other possible
28
o What is the official name of the gene? Other possible
names?
o On which DNA strand is it located?
o How many variants of splicing it has?
o Which disease is the gene associated to?
o Is it involved in the apoptosis process?
o How long is the coding sequence of the first variant of
slicing?
Querying GenBank
http://www.ncbi.nlm.nih.gov/genbank/
NG_000007
29
Querying GenBank
What kind of molecule is it? Genomic DNA
30
Querying GenBank
Where is locate the promoter of the gene HBB? Upstream the nucleotide 70545
31
Querying GenBank
Indicate the number of exons =
Indicate the length of the second exon =
Indicate the number of introns =
Indicate the length of the first intron =
3
71039-70817 +1 = 223 nts
2
70816-70685+1 = 132 nts
32
Querying GenBank
Indicate the location of the 5 'UTR =
Indicate the length of the 5 'UTR =
Indicate the location of the 3 'UTR =
Indicate the length of the 3 'UTR =
70545..70594
70594-70545 +1 = 50 nts
72019..72150
72150-72019 +1 = 132 nts
33
Querying GenBank
Indicate the nucleotide positions of the start codon = 70595,70596,70597
34
Querying GenBank
Download in FASTA format the sequence of the HBB gene
35
Querying GenBank
70545 72150
36
Querying GenBank
37
Querying GenBank
>gi|28380636:70545-72150 Homo sapiens beta globin region (HBB@); and hemoglobin, beta (HBB); and hemoglobin, delta (HBD); and hemoglobin, epsilon 1 (HBE1); and hemoglobin, gamma A (HBG1); and hemoglobin, gamma G (HBG2), RefSeqGene on chromosome 11 ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAG ACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGGCTGCTGG TGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGG CAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGAC AACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACT TCAGGGTGAGTCTATGGGACGCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGTTAAGTTCATGTCATAG GAAGGGGATAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATTGCATCAGTGTGGAAGTCT CAGGATCGTTTTAGTTTCTTTTATTTGCTGTTCATAACAATTGTTTTCTTTTGTTTAATTCTTGCTTTCT TTTTTTTTCTTCTCCGCAATTTTTACTATTATACTTAATGCCTTAACATTGTGTATAACAAAAGGAAATA TCTCTGAGATACATTAAGTAACTTAAAAAAAAACTTTACACAGTCTGCCTAGTACATTACTATTTGGAAT ATATGTGTGCTTATTTGCATATTCATAATCTCCCTACTTTATTTTCTTTTATTTTTAATTGATACATAAT CATTATACATATTTATGGGTTAAAGTGTAATGTTTTAATATGTGTACACATATTGACCAAATCAGGGTAA TTTTGCATTTGTAATTTTAAAAAATGCTTTCTTCTTTTAATATACTTTTTTGTTTATCTTATTTCTAATA CTTTCCCTAATCTCTTTCTTTCAGGGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAG CTTTCCCTAATCTCTTTCTTTCAGGGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAG AATAACAGTGATAATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAATATTTCTGCATATAAAT TGTAACTGATGTAAGAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACCATTCTGCTTTTATTTT ATGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTT ATCTTCCTCCCACAGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
38
Querying GenBank
39
Querying GenBank: link to geneID
40
How many articles did Nunzio D’Agostino publish?
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
41
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
How many articles did Nunzio D’Agostino publish?
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]
42
How many articles did Nunzio D’Agostino publish?
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]
How many of these are releted to EST?
43
How many articles did Nunzio D’Agostino publish?
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]
How many of these are releted to EST?
D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]
44
How many articles did Nunzio D’Agostino publish?
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]
How many of these are releted to EST?
D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]
How many of these are on the BMC Genomics Journal?
45
How many of these are on the BMC Genomics Journal?
How many articles did Nunzio D’Agostino publish?
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]
How many of these are releted to EST?
D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]
How many of these are on the BMC Genomics Journal?
46
How many of these are on the BMC Genomics Journal?
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author
Name] AND BMC Genomics [journal]
How many articles did Nunzio D’Agostino publish?
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]
How many of these are releted to EST?
D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]
How many of these are on the BMC Genomics Journal?
47
How many of these are on the BMC Genomics Journal?
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author
Name] AND BMC Genomics [journal]
How many articles do include the word “RNA-Seq” in the title?
How many articles did Nunzio D’Agostino publish?
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]
How many of these are releted to EST?
D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]
How many of these are on the BMC Genomics Journal?
48
How many of these are on the BMC Genomics Journal?
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author
Name] AND BMC Genomics [journal]
How many articles in PubMED do include the word “RNA-Seq” in the title?
RNA-Seq [title]
How many articles did Nunzio D’Agostino publish?
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]
How many of these are releted to EST?
D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]
How many of these are on the BMC Genomics Journal?
49
How many of these are on the BMC Genomics Journal?
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author
Name] AND BMC Genomics [journal]
How many articles in PubMED do include the word “RNA-Seq” in the title?
RNA-Seq [title]
How many reviews have been published in 2008 containing the word
"transcriptome”?
How many articles did Nunzio D’Agostino publish?
Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]
How many of these are releted to EST?
D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]
How many of these are on the BMC Genomics Journal?
50
How many of these are on the BMC Genomics Journal?
D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author
Name] AND BMC Genomics [journal]
How many articles in PubMED do include the word “RNA-Seq” in the title?
RNA-Seq [title]
How many reviews have been published in 2008 containing the word
"transcriptome”?
transcriptome [title] AND review [Publication Type] AND 2008[publication date]