sequence databases

Sequence Databases

As DNA and protein sequences accumulate, they are deposited in public databases.

One of the most popular of these is GenBank, which stores all types of different sequences in plain text files, where different features are separated by blank spaces and special symbols.

Each file can be downloaded directly. Sample file:LOCUS name of locusSOURCE source organism of DNAFEATURES information about sequence by base positionORIGIN 1 gaattcgata aatctctggg ttattgtgac gtttataatg acgttaggca 51 atatcattct atcattaagc

Relational Databases

A more structured way to store data is by using relational databases. A database is made up of a set of tables which stores data in a non-redundant manner to facilitate data updates.

student uin firstname lastname123456789 Alice Smith333445555 Tom Wang

major uin department123456789 BICH123456789 CPSC333445555 BICH

Query Languages

Instead of allowing users to look at the contents of a database directly, access is through the use of a query language over a connection to the database.

SQL is one of the most popular query languages. The following is a sample SQL query:

select firstname, lastnamefrom student, majorwhere student.uin = major.uin

and major.department = “BICH”

This approach imposes a structure to the data and allows updates to be seen instantaneously.

Graphical User Interface

Fortunately, there is no need to learn a query language in order to be able to access a relational database.

Modern database software packages provide graphical user interfaces developed on top of the query language level, so that users don’t need to know anything about query languages.

Relational databases are very popular within biotech companies and web interfaces are becoming standard.

Comparing Sequences

Mutation in DNA sequences is modeled by three basic operations: insertion of a nucleotide, deletion of a nucleotide and substitution of a nucleotide.

Mutation is tolerated to a certain extent before the function of a DNA sequence is lost or changed. Thus, similarity between sequences reflects similarity in function.

As DNA sequences of different organisms accumulate, comparing a newly discovered sequence to known sequences helps to identify the function of the new sequence.

Pairwise Alignment

Given two (DNA or protein) sequences, an alignment puts the two strings against each other so that similar parts are aligned together.

For example, given two strings ATCTCGAT and TGCATAT, one alignment is: AT-C-TCGCT -TGCAT--ATwhere ‘-’ denotes a gap introduced to pad non-similar parts.

Given a scoring scheme, the pairwise alignment problem is the problem of finding the best alignment with optimal score. A special form of alignment called local alignment allows alignments to not cover entire sequences.

Finding Optimal Alignment

Dynamic programming techniques are used to find the optimal alignment (which use a divide-and-conquer approach to express the original problem as smaller subproblems).

These algorithms for finding optimal alignment takes time proportional to the product of the length of input sequences.

In large scale comparisons, heuristics are commonly employed to find good alignment (which is not necessarily optimal) between two sequences.

Database Search

As the amount of sequenced DNA increases, it is likely that a new DNA sequence has similarity to at least one of the known sequences.

New DNA sequences are annotated as much as possible and collected in sequence databases such as GENBANK.

When a new DNA sequence is obtained, the first logical step is to try to search for similar sequences in a database of known sequences.

If similar sequences are found from database search, sequence similarity would imply similarity in function.

Filtration Idea

Given a database of sequences {s1, …, sn} and a sequence s, the first step is to determine which of the pairs (s, si) are likely to be similar.

There are conflicting objectives:

• Don’t want to lose real similarities.

• Want to cut down the number of possible similarities.

The simplest way to perform such a filtration is to consider a k-mer appearing in s and check if it appears in each of si.

s1s2s3s4

s

Seed-Extension Approach

The next step is to extend the short similarities (seeds) we found in the filtration step to see if the gaps between the short similar areas can be filled.

All hits with strong similarity found are sorted in decreasing order of score.

It is possible to compute the optimal alignment in detail for a limited subset of high scoring hits.

BLAST

BLAST is one of the most popular programs people use to perform database search.

There are various variations of the basic programs to compare DNA against DNA, DNA against protein, protein against protein, etc.

In addition to employing the filtration-extension approach, BLAST employs an additional set of heuristics to optimize its performance.

BLAST Seed Selection

BLAST uses different seed lengths for different types of sequences:

• use words of length 11 for DNA sequences.

• use words of length 3 for protein sequences.

• perform a translation from DNA to protein for comparisons involving DNA and proteins. All three reading frames are investigated.

For protein sequences, in addition to look for exact matches, BLAST also looks for approximate matches with score above a preset threshold.

BLAST Alignment Extension

BLAST tries to extend an alignment from the matching words in both directions, and continue the extension as long as the score continues to increase. If this results in a gap between two matching words being filled, the extension is successful.

s

si

×

×

××

×

×

Multiple Alignment

It is frequently very difficult to reveal weak similarities between two sequences.

These similarities can often be captured by comparing multiple sequences.

Multiple alignment generalizes the alignment idea to handle many sequences.

AT-C-TCGAT -TGCAT--AT ATCCA-CGCT

ClustalW

ClustalW is a very popular software for multiple alignment.

It uses the progressive alignment idea: treat each sequence as an alignment initially and iteratively select the next two most similar alignments to be combined into a bigger alignment by pairwise alignment techniques.

The order to combine the alignments is determined by following an evolutionary tree generated on the fly between the sequences.

+

TCoffee

When computing an initial pairwise alignment, TCoffee incorporates consistency information from other sequences to improve its consistency within the final multiple alignment:

• It accepts different kinds of pairwise alignments as input, and constructs a weighted library to represent the pairwise relationships.

• For each sequence pair A and B, it extends the original library by trying to align A and B through each sequence C in the input set and adding consistency information to the (A,B) pair.

TCoffee is much more accurate than ClustalW.

Gene Prediction

In eukaryotes, genes are made up of exons and introns. Introns are cut out of the RNA and exons are concatenated together to form the mRNA before translation into a protein.

The gene prediction problem is the following problem: Given a genomic sequence, locate the exon-intron boundaries.

exon exon exonintron intron gene (RNA)

mRNA

GENSCAN

GENSCAN is one of the most popular software programs used for gene structure prediction.

It integrates multiple types of information in a hidden markov model (HMM), including promoter signals, splicing signals, length distributions for exons and introns, poly-A transcriptional end signal, etc.

A HMM consists of a set of states and transition probabilities between the states. Each state represents a modeled feature.

Although the structure of the HMM is pre-determined, it is necessary to provide estimates of other parameters (such as transition probabilities) via a set of training examples.

Similarity-based Approaches

As the amount of sequenced DNA increases, it is likely that a new (unspliced) genomic sequence has similarity to a known cDNA or protein.

BLAST can be used to find similar sequences from the database of known sequences. These BLAST hits reveal only the most significant local alignments between the genomic sequence and a similar sequence, which usually represent one or few exons.

The GenomeScan software integrates this information with the HMM model of GENSCAN.

Regulation of Gene Expression

Gene regulatory proteins bind to specific places (regulatory sites) on DNA. These sites are usually close to the gene.

geneoff

site

genesiteon

regulatory protein

Finding Regulatory Sites

Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes).

Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences.

Find some way to identify conserved patterns shared by these sequences, resulting in a list of potential regulatory sites.

The problem of finding patterns shared by a given sample of sequences is called the motif finding problem.

Motif Finding Problem

genesite

genesite

genesite

genesite

genesite

sample

MEME

MEME is one of the most popular programs for motif finding. It finds a motif, represented by a set of ungapped patterns, which is the least likely to appear by chance.

It uses the expectation-maximization (EM) approach: first obtain an initial motif (which may not be very good), then iteratively obtain a better motif with the following two steps:

• Expectation: compute the statistical composition of the current motif and find the probability of finding the site at each position in each sequence.

• Maximization: These probabilities are used to update the statistical composition.

Finding Regulatory Sites by Alignment

Recall that one way to find regulatory sites is to construct samples of upstream regions of co-regulated genes and use motif finding algorithms to look for shared patterns.

These samples of co-regulated genes are constructed from diverse sources. Regulatory sites can reside on either strand and they are not at a fixed distance from the transcriptional start site.

For these samples, alignment algorithms are not useful. However, alignment approaches can be very useful when we specifically consider co-regulated genes which are very close in evolutionary distance.

What to Look For

There are frequently more than one regulatory site for a gene. When sufficiently close upstream regions of co-regulated genes are aligned, we expect the sites to be short, well conserved, in the same order and on the same strand. These short blocks of highly conserved positions are separated by less conserved regions.

Although evolutionarily close genes should be used, they should not be so close that the entire upstream region is highly conserved, which does not give much information.

Using Closely Related Yeast Species

Align upstream sequences from a few Saccharomyces species to try to identify regulatory elements:

• Many alignments of S.mikatae and S.bayanus promoter regions with S.cerevisiae show short runs of conserved sequences, which predicts regulatory sites.

• The two species S.cariocanus and S.paradoxus are too closely related to S.cerevisiae. Not much information is revealed in their alignments.

Inferring Evolution

Using estimates on the evolutionary distance between multiple organisms, an evolutionary tree can be constructed predicting the evolutionary relationships.

It is possible that conflicting trees are produced from different data sources. We are much more confident that a tree is correct when the same tree is produced from various data sources.

mouse

rat

human

Evolutionary Distance Estimates

Estimates of evolutionary distance:

• Each multiple alignment gives distance estimates between species. To obtain more accurate estimates, a large set of multiple alignments including different regions or genes should be utilized collectively.

Using Evolutionary Information in Alignment

Progressive multiple alignment algorithms often construct an evolutionary tree on the fly to determine the order in which smaller alignments are merged to form bigger ones.

If the sequences to be aligned are from different species, it is also possible to utilize a pre-determined evolutionary tree.

Since an evolutionary tree can be constructed given a multiple alignment, it is possible to iteratively improve both the evolutionary tree and the multiple alignment from crude initial estimates.

Using Evolutionary Information in Motif Finding

The motif finding problem is to find a pattern shared by a given set of sequences. In the case when the sequences are from different species, evolutionary information can be utilized to get more accurate results.

Given an evolutionary tree of the represented species, a better way to score a motif is to compute the total number of mutations along the branches of the tree. It has been shown that this approach produces better motifs.