databases protein structure and bioinformatics group€¦ · 7 oct 2016 12 aspects of relational...
TRANSCRIPT
Databases
Protein Structure and Bioinformatics
Group
7 Oct 2016 2
Purpose of the lecture
● provide an overview of available databases● what are they for?● the contents of the most important databases● how to query these databases● make you aware of drawbacks and pitfalls
7 Oct 2016 3
Overview
● intro on databases● database models● overview of biological databases● details of often used databases and/or providers● some remarks on data quality
7 Oct 2016 4
Why databases?
● Exponential growth of:– sequences
– structures
– literature
● Need for efficient storage and management tools● Need for standardization
7 Oct 2016 5
Solution: databases
● coherent, consistent, designed for special purpose● data model: clearly defined data structure● database management system: easy access and
management
7 Oct 2016 6
What is a database
● any organized collection of data– card filing system
– telephone book
● now: A collection of information organized in such a way that a computer program can quickly select desired pieces of data.
● you need: Database Management System (DBMS)
7 Oct 2016 7
Database modelslogical structure of a database
● flat file● relational model (most used)● other:
– object-oriented, XML, hierarchical, network
● Database Management Systems (DBMS) include: MySQL, PostgreSQL, SQLite, Microsoft SQL Server,Oracle, SAP, dBASE, FoxPro, IBM DB2, LibreOffice Base and FileMaker Pro
7 Oct 2016 8
Flat file
● written in plain text, standard defined format● often tab-delimited or comma-separated text files● each line is a record● fields are separated by delimiters: tabs, commas● searching only sequential
7 Oct 2016 9
DNA and protein sequences in FASTA format
>gi|71902539|ref|NM_000051.3| Homo sapiens ataxia telangiectasia mutated (ATM), mRNACCGGAGCCCGAGCCGAAGGGCGAGCCGCAAACGCTAAGTCGCTGGCCATTGGTGGACATGGCGCAGGCGCGTTTGCTCCGACGGGCCGAATGTTTTGGGGCAGTGTTTTGAGCGCGGAGACCGCGTGATACTGGATGCGCATGGGCATACCGTGCTCTGCGGCTGCTTGGCGTTGCTTCTTCCTCCAGAAGTGGGCGCTGGGCAGTCACGCAGGGTTTGAACCGGAAGCGGGAGTAGGTAGCTGCGTGGCTAACGGAGAAAAGAAGCCGTGGCCGCGGGAGGAGGCGAGAGGAGTCGGGATCTGCGCTGCAGCCACCGCCGCGGTTGATACTACTTTGACCTTCCGAGTGCAGTGACAGTGATGTGTGTTCTGAAATTGTGAACCATGAGTCTAGTACTTAATGATCTGCTTATCTGCTGCCGTCAACTAGAACATGATAGAGCTACAGAACGAAAGAAAGAAGTTGAGAAATTTAAGCGCCTGATTCGAGATCCTGAAACAATTAAACATCTAGATCGGCATTCAGATTCCAAACAAGGAAAATATTTGAATTGGGATG
>gi|71902540|ref|NP_000042.3| serine-protein kinase ATM [Homo sapiens]MSLVLNDLLICCRQLEHDRATERKKEVEKFKRLIRDPETIKHLDRHSDSKQGKYLNWDAVFRFLQKYIQKETECLRIAKPNVSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQELLNYIMDTVKDSSNGAIYGADCSNILLKDILSVRKYWCEISQQQWLELFSVYFRLYLKPSQDVHRVLVARIIHAVTKGCCSQTDGLNSKFLDFFSKAIQCARQEKSSSGLNHILAALTIFLKTLAVNFRIRVCELGDEILPTLLYIWTQHRLNDSLKEVIIELFQLQIYIHHPKGAKTQEKGAYESTKWRSILYNLYDLLVNEISHIGSRGKYSSGFRNIAVKENLIELMADICHQVFNEDTRSLEISQSYTTTQRESSDYSVPCKRKKIELGWEVIKDHLQKSQNDFDLVPWLQIATQLISKYPASLPNCELSPLLMILSQLLPQQRHGERTPYVLRCLTEVALCQDKRSNLESSQKSDLLKLWNKIWCI
7 Oct 2016 10
Relational database
● database is composed of tables● each table has records (rows)● each record has fields (columns)● relational:
– tables hold logically related sets of data
– each record has a unique identifier: primary key
– relations between tables through keys
7 Oct 2016 11
Relational database
● PK = primary key, unique identifier
● FK = foreign key, connects to primary key in Customer table
7 Oct 2016 12
Aspects of relational databases● tables hold logically related sets of data● order of rows irrelevant (random access!)● rows are unique: no duplication of information● searching is specifying what you want:
– which field(s) from which table(s) under which condition(s)
– SQL (Structured Query Language)
● searching speed can be increased by using indexes
7 Oct 2016 13
Querying a database with SQL
7 Oct 2016 14
How to access databases● Web-based Graphical Users Interfaces (GUI)
– you do not see the underlying database structure
– output defined by host/provider
● File Transfer Protocol (FTP)– mostly flat files
● Application Programmers Interface (API)– you will approach database programmatically
through web services (SOAP/REST)
7 Oct 2016 15
Biological database providers/host
● EBI European Bioinformatics Institute
● SIB Swiss Institute of Bioinformatics
● NCBI National Center for Biotechnology
Information
● DDBJ DNA Databank of Japan
7 Oct 2016 16
Classification of biological databases
Primary: hold experimentally derived data● experimental data repositories● sequence databases● structure databases
7 Oct 2016 17
Classification of biological databases
Secondary: derived information from primary databases
● sequence related● genome related● structure related● expression data (RNA, protein)● pathway information
7 Oct 2016 18
Experimental data repositories
● Gene Expression Omnibus (GEO)● ArrayExpress● European Nucleotide Archive (ENA)
7 Oct 2016 19
Primary sequence databases
DNA/nucleotide sequences
Ensembl (EBI/Wellcome Trust Sanger Inst.)
GenBank (NCBI)
DNA Data Bank of Japan (DDBJ)
European Nucleotide Archive (EMBL-EBI)
7 Oct 2016 20
Primary sequence databases
protein sequences
UniProtKB UniProt Knowledge Base– UniProtKB/Swiss-Prot
– UniProtKB/TrEMBL
NCBI Protein
7 Oct 2016 21
Primary structure databases
Protein Data Bank (PDB)
Nucleic Acid Database
Cambridge Structural Database
7 Oct 2016 22
Secondary databases
● sequence related
– ProSite
– Pfam
– Enzyme
– REBase (restriction enzymes)
7 Oct 2016 23
Secondary databases
● genome related
Online Mendelian Inheritance in Man
TRANSFAC (transcription factors)
7 Oct 2016 24
Secondary databases● structure related
– DSSP Database of Secondary Structure Assignments
– HSSP Homology-derived Secondary Structure of Proteins
– Dali: comparing protein structures in 3D
7 Oct 2016 25
Secondary databases● expression data
– Expression Atlas
– Human Protein Atlas● pathway related
– KEGG: Kyoto Encyclopedia of Genes and Genomes
7 Oct 2016 26
Databases on Human Genes and Diseases
● General human genetics databases
e.g. HGMD
● General polymorphism databases
e.g NCBI SNP (dbSNP)
● Cancer gene and variant databases
e.g. COSMIC, Cancer Genome Atlas
7 Oct 2016 27
Databases on Human Genes and Diseases
● Gene-, system- or disease-specific databases– Locus-Specific DataBases, see e.g. HGVS
http://www.hgvs.org
– Disease-specific, e.g. IDbases: locus-specific databases for immunodeficiency-causing variations http://structure.bmc.lu.se/idbase/
– System-specific, e.g. GWASCatalog: genome-wide association studies
7 Oct 2016 28
Databases on Human Genes and Diseases
● Online Mendelian Inheritance in Man
7 Oct 2016 29
Locus-Specific Databases (LSDBs) list at www.hgvs.org/locuc-specific-
mutation-databases
7 Oct 2016 30
IDbases atstructure.bmc.lu.se/idbase
7 Oct 2016 31
BTKbase at LOVD.nl
7 Oct 2016 32
Nucleic Acids Research
● The NAR on line Molecular Biology Database Collection is published in the Database issue each year
● 2016: 1685 listings● URL: http://www.oxfordjournals.org/nar/database/c/
7 Oct 2016 33
7 Oct 2016 34
Wikipedia
URL: http://en.wikipedia.org/wiki/List_of_biological_databases
7 Oct 2016 35
PubMed
● The access point to medicine related publications● PubMed comprises more than 26 million citations
for biomedical literature
URL: http://www.ncbi.nlm.nih.gov/pubmed
7 Oct 2016 36
Some examples
● NCBI● UniProtKB/Swiss-Prot● PDB● Ensembl
7 Oct 2016 37
NCBIhttps://www.ncbi.nlm.nih.gov/
7 Oct 2016 38
NCBI Genetics & Medicine
7 Oct 2016 39
NCBI Handbook
7 Oct 2016 40
NCBI search
7 Oct 2016 41
NCBI Gene: download settings
7 Oct 2016 42
NCBI Gene: display settings
7 Oct 2016 43
NCBI Gene: Genomic regions etc.
7 Oct 2016 44
NCBI Gene: Reference sequences
7 Oct 2016 45
NCBI Gene: Reference sequences
7 Oct 2016 46
NCBI Gene: Reference sequences
7 Oct 2016 47
NCBI Gene: Reference sequences
information about the fields in GenBank records can be found at:
● NCBI handbook● https://www.ncbi.nlm.nih.gov/genbank/samplerecord/
7 Oct 2016 48
NCBI Gene: Reference sequences
7 Oct 2016 49
NCBI Gene: Reference sequences
7 Oct 2016 50
NCBI Gene: Reference sequences
7 Oct 2016 51
NCBI dbSNP: short genetic variations
7 Oct 2016 52
UniProtwww.uniprot.org
7 Oct 2016 53
UniProtKB/Swiss-Prot
7 Oct 2016 54
UniProtKB/Swiss-Prot
7 Oct 2016 55
Protein Data Bank in Europe (PDBe)
7 Oct 2016 56
Protein Data Bank (in Japan)
7 Oct 2016 57
Protein Data Bank (in Japan)
7 Oct 2016 58
Protein Data Bank
7 Oct 2016 59
Ensemblwww.ensembl.org
7 Oct 2016 60
7 Oct 2016 61
Ensembl variants
7 Oct 2016 62
KEGGintegrating genomic and chemical
information with systems information
7 Oct 2016 63
KEGG Pathways
7 Oct 2016 64
Some remarks about data quality
● how up-to-date is the database● is the database hand-curated by experts● when using data from a database, try to check these● be aware of the fact that there can be always errors
somewhere
7 Oct 2016 65
Example of checking data
● checking variant descriptions can be done with the Mutalyzer Name Checker tool: https://mutalyzer.nl
● Name Checker takes a complete sequence variant description (e.g. NM_000061.2:c.214A>G)
● variant description will be checked if it is according to HGVS rules
7 Oct 2016 66
Example of checking data
7 Oct 2016 67
Mutalyzer Name Checker
7 Oct 2016 68
Mutalyzer Name Check result (part)
7 Oct 2016 69
Thanks
● Protein Structure and Bioinformatics Group● BMC B13● [email protected]● http://structure.bmc.lu.se