databases for bioinformatics a very partial introduction

Databases for BioinformaticsA very partial introduction

Paul GreenfieldCSIRO Mathematics, Informatics and Statistics

January 2013

Databases

• A heavily overloaded term• Used here to mean a collection of (possibly) structured data that can be

accessed and queried (and the related technology) • Structured databases

• Data is defined using a schema• Strongly defined data structures, with some defined semantics• Relational databases (accessed through SQL) are structured• Unstructured databases

• Collections of data that can be created, read, updated or deleted• No (or little) definition of the data itself• And many other variants and mixtures of the above

• Key-value stores are popular... • As are document-centric databases...

Databases and bioinformatics

• Various types of databases commonly used• Repositories of structured reference data

– Back-end to Web forms for downloading selected data– Publications, reference sequences, proteins, ...

• Metadata and results from experiments– Needs to keep the results organised somehow– (spreadsheets are possibly more common though)

• Non-computing use of ‘database’ as well• A collection of data rather than a technology• Often using databases to structure and provide access to a larger repository

– Is PubMed a ‘database’ or a ‘database’?– Is the SRA (Short Read Archive) a ‘database’?

Metabase example

• A database of biological databases• Wide variety of different types of

data on different organisms• Holding the core data for a

research community• Basis for collaboration

• EBI, NCBI, ...• Typical usage model is search

then download• APIs for searching • ftp downloads

NCBI ‘databases’

Relational databases and SQL

• Mainstream database technology• Oracle, mySQL, SQL/Server, DB2, ...• Widely used in commerce and science• Data entities are modelled using tables (relations)

• Tables are sets of identically-structured records (tuples)• Defined relationships between tables • Lots of theory about how to design relational databases (‘normal forms’)• Structured Query Language

• A standard, non-procedural, set-based, declarative query language• Say what results you want, and let the database work out how to get them

A bacterial database

• Goal: a queriable bacterial database• Genes, sequences, functions, ... for all available bacteria• Answering some questions about bacteria without writing code...• Finished and draft annotated bacterial genomes from NCBI

• ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/• ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/• Draft genomes are sets of contigs (from an assembly)

– One sequence file containing a set of contigs/scaffolds• Finished genomes

– One sequence file per genome, plasmid, ...• One directory for each species/strain

– DNA sequences, amino acid sequences, annotations– In a variety of formats

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/

Bacterial database design• A few thousand directories of bacterial sequences & annotations...

• How do you answer simple questions? • Loading this raw data into a database could make it more useful• What do we want to model as tables?

• Organisms (directories)• Genomes, plasmids (for finished organisms), scaffolds/contigs (for drafts)• Genes (from annotations) – sequence + annotations• Other pre-computed metrics (relatedness??)• Each table needs a unique key

• Used to connect related entities (foreign keys)• Can be ‘compound’ (speciesNo, sequenceNo)

Species Sequences Genes

SpeciesCREATE TABLE Species ( SpeciesNo int NOT NULL AUTO_INCREMENT, -- unique id pecies SpeciesID varchar(100) NOT NULL, -- species/strain name SpeciesUid int NOT NULL, -- uid from NCBI directory Finished int NOT NULL -- finished = 1, draft = 0 ) ENGINE=MYISAM AVG_ROW_LENGTH=200 MAX_ROWS=10000;

ALTER TABLE Species ADD PRIMARY KEY (SpeciesNo), ADD INDEX BySpeciesID (SpeciesID);

Sequences

CREATE TABLE Sequences ( SpeciesNo int NOT NULL, -- (SpeciesNo, SequenceNo) is compound key SequenceNo int NOT NULL, -- ...for this sequence/genome/plasmid/... SequenceID varchar(20) NOT NULL, -- name of this sequence (e.g. NCnnnnnnn) SequenceDesc varchar(100) NOT NULL, -- description of sequence SequenceLength int NOT NULL -- length of this sequence string

) ENGINE=MYISAM AVG_ROW_LENGTH=200 MAX_ROWS=1000000;

ALTER TABLE Sequences ADD PRIMARY KEY (SpeciesNo, SequenceNo), ADD INDEX BySequenceID (SequenceID);

GenesCREATE TABLE Genes ( SpeciesNo int NOT NULL, -- |(SpeciesNo, SequenceNo, GeneNo) is key SequenceNo int NOT NULL, -- |... for this gene record GeneNo int NOT NULL, -- | GenePID varchar(20) NOT NULL, -- PID name for gene (or ‘non coding’) GeneName varchar(20) NOT NULL, -- short-form gene name GeneSynonym varchar(20) NOT NULL, -- GeneCode varchar(20) NOT NULL, -- GeneCOG varchar(20) NOT NULL, -- COG code for protein product GeneProduct varchar(200) NOT NULL, -- description of protein or RNA GeneRNA char (3) NOT NULL, -- 'RNA' for RNA genes GeneGC int NOT NULL, -- GC content of this gene GeneStrand char (1) NOT NULL, -- '+'=forward; '-'=reverse GeneStart int NOT NULL, -- chromosome-relative location of 'first' base GeneEnd int NOT NULL, -- chromosome-relative location of 'last' base GeneLength int NOT NULL -- total length of gene GeneKey int NOT NULL, -- key used to get to gene sequence (1-1) ) ENGINE=MYISAM AVG_ROW_LENGTH=200 MAX_ROWS=100000000;

ALTER TABLE Genes ADD PRIMARY KEY (SpeciesNo, SequenceNo, GeneNo), ADD INDEX ByGenePID (GenePID), ADD INDEX ByGeneLocation (SpeciesNo,SequenceNo,GeneStart,GeneEnd,GeneNo,GeneRNA)

Gene Sequence

• Actual sequence for each gene is useful sometimes• But often just an overhead resulting in worse query performance• Current DB design splits out the gene sequence from the gene metadata and

annotations– Reduces the size of the heavily-used Gene records

CREATE TABLE GenesSeqs ( GeneKey int NOT NULL, -- unique key for gene sequence (1-1) Gene longtext NOT NULL -- gene itself

) ENGINE=MYISAM AVG_ROW_LENGTH=200 MAX_ROWS=100000000;

ALTER TABLE GeneSeqsADD PRIMARY KEY (GeneKey)

SQL Queries

• A query is an operation over tables that returns a table• Tables can be thought of as sets if that helps• Based on relational algebra originally

SELECT <result columns> FROM <tables>WHERE <restriction predicate>ORDER BY.... GROUP BY....

• Query returns a table with the columns specified by <result columns>• <tables> specifies the tables that are joined for the query• <restriction predicate> is a Boolean expression over table columns that define

the conditions that must be met for a query row to be included in the results• Optional ORDER BY clause defines the order of the result table• Optional GROUP BY clause is used for aggregation• And all this is recursive...

Some simple sample queries

select * from Species order by SpeciesID• Returns all columns from all rows in the Species table, in SpeciesID order

select * from Species where SpeciesID like ‘Bacillus thuringiensis%’• Returns all columns for all the Bt organisms in the database

select * from Species, Sequences where SpeciesID like 'Bacillus thuringiensis%' and Finished=1 and Species.SpeciesNo= Sequences.SpeciesNo• Joins Species and Sequences rows (Cartesian product in theory) on SpeciesNo• Only returns rows for finished Bt organisms• All columns from Species and Sequences returned

Selecting columns

select sp.SpeciesNo, sp.SpeciesID, sq.SequenceNo, sq.SequenceID, sq.SequenceLength, sq.SequenceDesc from Species sp, Sequences sq where sp.SpeciesID like 'Bacillus thuringiensis%' and sp.Finished=1 and sp.SpeciesNo=sq.SpeciesNo• sp and sq are aliases for tables• Get all the sequences for all the Bt organisms

select sp.SpeciesNo, sp.SpeciesID, sq.SequenceLength as 'Genome length'

from Species sp, Sequences sq where sp.SpeciesID like 'Bacillus thuringiensis%' and sp.Finished=1 and sp.SpeciesNo=sq.SpeciesNo and sq.SequenceDesc like '% chromosome'• And just the chromosome lengths (if they’re annotated properly)

select 'Anthrax', COUNT(*), AVG(sq.SequenceLength) as 'Average length'

from Species sp, Sequences sq where sp.SpeciesID like 'Bacillus anthracis%' and sp.Finished=1 and sp.SpeciesNo=sq.SpeciesNo and sq.SequenceDesc like '% chromosome'• How many Anthrax strains are in the database and what is their average

genome length

select sp.SpeciesID, COUNT(*) as 'sequences' from Species sp, Sequences sq where sp.SpeciesID like 'Bacillus anthracis%' and sp.Finished=1 and sp.SpeciesNo=sq.SpeciesNo group by sp.SpeciesID• How many sequences do we have for each of the Anthrax strains?

Aggregations (counts, averages, ....)

More on joins

• A complex topic (inner, outer, natural, left, right)• We’ll just be linking rows together using their keys• Either explicitly or implicitly• Differences come from handling no-match cases...

select sq.SequenceNo, g.GeneStart, g.GeneEnd, g.GeneLength, g.GeneProduct, g.GeneCOG, g.GeneGC from Species sp, Sequences sq, Genes g where sp.SpeciesNo=sq.SpeciesNo and g.SpeciesNo=sp.SpeciesNo and g.SequenceNo=sq.SequenceNo and sp.SpeciesID='Bacillus anthracis Ames'order by g.SequenceNo, g.GeneStart• Get the genes in the 'Bacillus anthracis Ames’ strain • Could rewrite using explicit joins (see example)

More joins

select distinct s.SpeciesNo, s.SpeciesID, g.GeneProduct, gs.Gene from Species s

join Genes g on g.SpeciesNo = s.SpeciesNo join GeneSeqs gs on g.GeneKey = gs.GeneKey

where g.GeneRNA='RNA' and (g.GeneProduct like '%16S%' or g.GeneProduct = 'small subunit ribosomal RNA' or g.GeneProduct = 'Ribosomal RNA small subunit') order by s.SpeciesID

• Find all easily identifiable 16S genes from all organisms• Explicit ‘join’ syntax being used• ‘distinct’ says don’t return duplicate rows

Joins across databases

select sp.SpeciesNo, sp.SpeciesID, sq.SequenceLength, sq.SequenceDesc from Species sp join RDPBacterialTaxonomy.dbo.SpeciesNames sn on sp.SpeciesNo=sn.SpeciesNo join Sequences sq on sp.SpeciesNo=sq.SpeciesNo where sn.FamilyID='Enterobacteriaceae' and sq.SequenceLength > 1000000 and sp.Finished=1 order by sp.SpeciesID

• Fetch the lengths and descriptions of all long sequences for finished organisms in the Enterobacteriaceae family• Not all ‘chromosomes’ are annotated as such...

• You could query to find out which ones...

Set operations

select distinct sp.speciesNo, sp.SpeciesID from Species sp where sp.Finished=1 exceptselect spg.speciesNo, spg.SpeciesID from Species spg, Genes g where g.SpeciesNo=spg.SpeciesNo and g.GeneRNA='RNA'and spg.Finished=1 and

(g.GeneProduct like '%16S%' or g.GeneProduct = 'small subunit ribosomal RNA‘ or g.GeneProduct = 'Ribosomal RNA small subunit') order by sp.SpeciesID• What organisms did not have an easily-identifiable 16S sequence?• Union, intersection, ...

Nested queries

Pre-computed similarity metrics

• The database you have included three similarity tables• Based on k-mers in common (shared k-mers)

– Organism-to-organism (GenomeToGenomeMatches)– Sequence-to-sequence (SequenceToSequenceMatches)– Gene-to-gene (GeneToGeneMatches)

• Other such pre-computed similarity metrics are possible• But shared k-mers are fast to compute and useful• Support answering questions about relatedness and conservation

Comparing two closely-related organisms

Performance

• SQL engines look at your query and the database and decide how to execute it most efficiently• ‘Query optimiser’• Use available indexes to improve search time• You say what you want to do – not how to do it• Slow queries may be doing linear searches of large tables• (look at bacterial db schema and the defined indexes)• Typical queries take just a few seconds

– Acinetobacter baumannii AB0057 query...

Who shares these resistance genes?

Wrap-up

• The bioinformatics world is full of ‘databases’• Collections of searchable data/references/literature• Often based on some form of database technology

– Structured, unstructured, SQL, noSQL, key-value stores, ... – Often hidden behind web pages and scripting – Often accessed through APIs– Use whatever seems to work best

• Query interfaces are much rarer– Some Web forms constructing queries on your behalf

– Constructing some form of search predicate– Direct SQL queries are powerful but not common...

Prac session

• The bacterial database is available on the lab system• mySQL • Run some of the samples in these notes• Other tasks

• Find out how many 16S copies there are, on average, in a given family• How would you do this for all families?• Find the toxin region in 'Clostridium botulinum A2 Kyoto‘• Find out what other organisms share these toxin genes• Find plausible functions for some of the ‘hypothetical’ genes in

Methanococcus maripaludis X1• Do something else interesting...

databases for bioinformatics a very partial introduction

Documents

database work

core data

data itselfand

data structures

raw data

partial introduction

documentcentric databases

wella collection of