bioinformatics: guide to bio-computing and the internet copyright© kerstin wagner

Bioinformatics: Guide to bio-computing and the Internet

Copyright© Kerstin Wagner

Introduction: What is bioinformatics?

Can be defined as the body of tools, algorithms needed to handle large and complex biological information.

Bioinformatics is a new scientific discipline created from the interaction of biology and computer.

The NCBI defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline”

Genomics era: High-throughput DNA sequencing

The first high-throughput genomics technology was automated DNA sequencing in the early 1990.

In September 1999, Celera Genomics completed the sequencing of the Drosophila genome.

In 1995, Venter and Hamilton used whole-genome shotgun sequencing strategy to sequence the genomes of Mycoplasma and Haemophilus .

The 3-billion-bp human genome sequence was generated in a competition between the publicly funded Human Genome Project and Celera

Top image: confocal detection by the MegaBACE sequencer of fluorescently labeled DNA

High-throughput DNA sequencing

That was then. How about now?

Next Generation Sequencing

(2010) vol11:31

Genomics: Completed genomes as of 2010

Currently the genome of the organisms are sequenced:

This generates large amounts of information to be handled by individual computers.

1598 bacterial/85 archaeal/294 eukaryotic genomes

The trend of data growth

01

234

56

78

1980 1985 1990 1995 2000

Years

Nu

cle

oti

de

s(b

illio

n)

21st century is a century of biotechnology:

Microarray: Global expression analysis: RNA levels of every gene in the genome analyzed in parallel. (OUT!)

Replaced by RNA-seq

Proteomics:Global protein analysis generates by large mass spectra libraries.

Metabolomics:Global metabolite analysis: 25,000 secondary metabolites characterized

Genomics: New sequence information is being produced at increasing rates. (The

contents of GenBank double every year)

Metagenomics

- “Who is there and what are they doing?”

- Cultivation-independent approaches to study the big impact of microbes

How to handle the large amount of information?

Drew Sheneman, New Jersey--The Newark Star Ledger

Answer: bioinformatics and Internet

Bioinformatics history

IBM 7090 computer

In1960s: the birth of bioinformatics

Margaret Oakley Dayhoff created:The first protein databaseThe first program for sequence assembly

There is a need for computers and algorithms that allow: Access, processing, storing, sharing, retrieving, visualizing, annotating…

Why do we need the Internet?

“omics” projects and the information associated with involve a huge amount of data that is stored on computers all over the world.

Because it is impossible to maintain up-to-date copies of all relevant databases within the lab. Access to the data is via the internet.

You are here

Your request

Database storage

Results

The Commercial Market

Current bioinformatics market is worth 300 million / year (Half software)

Prediction: $2 billion / year in 5-6 years

~50 Bioinformatics companies: Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic, GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools, Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist, eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic, GeneFormatics, Molecular Simulations, Bioinformatics Solutions….

Scope of this lab

The lab will touch on the following computational tasks:Similarity search

Sequence comparison: Alignment, multiple alignment, retrieval

Sequences analysis: Signal peptide, transmembrane domain,…

Protein folding: secondary structure from sequence

Sequence evolution: phylogenetic trees

Make you familiar with bioinformatics resources available on the web to do these tasks.

You have just cloned a gene

Evolutionary relationship?

-Phylogenetic tree

-Accession #?-Annotation?

Is it already in databases?

-Sub-localization-Soluble?-3D fold

Protein characteristics?

-% identity?-Family member?

Is there similar sequences?

-Alignments?-Domains?

Is there conserved regions?

Other information?

-Expression profile?-Mutants?

A critical failure of current bioinformatics is the lack of a single software package that can perform all of these functions.

Applying algorithms to analyze genomics data

DNA (nucleotide sequences) databases

They are big databases and searching either one should produce similar results because they exchange information routinely.

-GenBank (NCBI): http://www.ncbi.nlm.nih.gov

-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp

-TIGR: http://tigr.org/tdb/tgi

-Yeast: http://yeastgenome.org

-Microbes: http://img.jgi.doe.gov/cgi-bin/pub/main.cgi

Specialized databases:Tissues, species… -ESTs (Expressed Sequence Tags)

~at NCBI http://www.ncbi.nlm.nih.gov/dbEST ~at TIGR http://tigr.org/tdb/tgi

- ...many more!

http://www.ncbi.nlm.nih.gov/

http://www.ddbj.nig.ac.jp/

http://tigr.org/tdb/tgi

http://yeastgenome.org/


http://tigr.org/tdb/tgi

They are big databases too: -Swiss-Prot (very high level of annotation)

http://au.expasy.org/

-PIR (protein identification resource) the world's most comprehensive catalog of information on proteinshttp://www.pir.uniprot.org/

Translated databases: -TREMBL (translated EMBL): includes entries that have

not been annotated yet into Swiss-Prot. http://www.ebi.ac.uk/trembl/access.html

-GenPept (translation of coding regions in GenBank)

-pdb (sequences derived from the 3D structure Brookhaven PDB) http://www.rcsb.org/pdb/

Protein (amino acid) databases

http://au.expasy.org/

http://www.pir.uniprot.org/

http://www.ebi.ac.uk/trembl/access.html

http://www.rcsb.org/pdb/

Database homology searching

Use algorithms to efficiently provide mathematical basis of searches that can be translated to statistical significance.

Assumes that sequence, structure, and function are inter-related.

All similarity searching methods rely on the concepts of alignment and distance between sequences.

A similarity score is calculated from a distance: the number of DNA bases or amino acids that are different between two sequences.

Calculating alignment scores

Scoring system: Uses scoring matrices that allow biologists to quantify the quality of sequence alignments.

The raw score S is calculated by summing the scores for each aligned position and the scores for gaps. Gap creation/extension scores are inherent to the scoring system in use (BLAST, FASTA…)

The score for an identity or a mismatch is given by the specified substitution matrix (e.g., BLOSUM62).

Devising a scoring system

How the matrices were created: Very similar sequences were aligned.

From these alignments, the frequency of substitution between each pair of amino acids was calculated and then PAM1 was built.

After normalizing to log-odds format, the full series of PAM matrices can be calculated by multiplying the PAM1 matrix by itself.

Some popular scoring matrices are: PAM (Percent Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required.

BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity.

Devising a scoring system

Importance: Scoring matrices appear in all analysis involving sequence comparison.

The choice of matrix can strongly influence the outcome of the analysis.

Understanding theories underlying a given scoring matrix can aid in making proper choice: -Some matrices reflect similarity: good for database searching

-Some reflect distance: good for phylogenies

Log-odds matrices, a normalisation method for matrix values:

S is the probability that two residues, i and j, are aligned by evolutionary descent and by chance. qij are the frequencies that i and j are observed to align in sequences known to

be related. pi and pj are their frequencies of occurrence in the set of sequences.

Database search methods: Sequence Alignment

Two broad classes of sequence alignments exist:

Global alignment: not sensitive

Local alignment: faster

QKESGPSSSYC

VQQESGLVRTTC

ESG

ESG

The most widely used local similarity algorithms are:Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;

http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)

http://www.ebi.ac.uk/MPsrch/

http://www.ncbi.nih.gov/

http://fasta.genome.jp/

http://www.ebi.ac.uk/fasta33/

http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl

Which algorithm to use for database similarity search?

Speed: BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a

LOT OF COMPUTER POWER)

Sensitivity/statistics: FASTA is more sensitive, misses less homologuesSmith-Waterman is even more sensitive. BLAST calculates probabilities

FASTA more accurate for DNA-DNA search then BLAST

-tuple methods provide optimal alignments

These methods are faster and excellent in comparing sequences.

BLAST and FASTA programs are based on -tuple algorithms:

1-Using query sequence, derive a list of words of length w (e.g., 3)

2-Keep high-scoring words using a scoring matrix(e.g. BLOSUM 62)

3-High-scoring words are compared with database sequences

4-Sequences with many matches to high-scoring words are used for final alignments

The dilemma: DNA or protein?

Is the comparison of two nucleotide sequences accurate?

By translating into amino acid sequence, are we losing information? The genetic code is degenerate (Two or more codons can represent the same amino acid)

Very different DNA sequences may code for similar protein sequences We certainly do not want to miss those cases!

Search by similarity

Using nucleotide seq. Using amino acid seq.

Tools to search databases

Comparing DNA sequences give more random matches:

Reasons for translating

A good alignment with end-gaps A very poor alignment

Almost 50% identity!

Conservation of protein in evolution (DNA similarity decays faster!)

It is almost always better to compare coding sequences in their amino acid form, especially if they are very divergent.

Very highly similar nucleotide sequences may give better results.

Conclusion:

FASTA: Compares a DNA query to DNA database, or a protein query to protein database

FASTX: Compares a translated DNA query to a protein database

TFASTA: Compares a protein query to a translated DNA database

BLAST and FASTA variants

BLASTN: Compares a DNA query to DNA database.

BLASTP: Compares a protein query to protein database.

BLASTX: Compares the 6-frame translations of DNA query to protein database.

TBLASTN: Compares a protein query to the 6-frame translations of a DNA database.

TBLASTX: Compares the 6-frame translations of DNA query to the 6-frame translations of a DNA database (each sequence is comparable to

BLASTP searches!)

PSI-BLAST: Performs iterative database searches. The results from each round are incorporated into a 'position specific' score matrix, which is used for further searching

A practical example of sequence alignmenthttp://www.ncbi.nlm.nih.gov

BLAST results


Detailed BLAST results

E value: is the expectation value or probability to find by chance hits similar to your sequence. The lower the E, the more significant the score.

Database searching tips

Use latest database version.

Use BLAST first, then a finer tool (FASTA,…)

Search both strands when using FASTA.

Translate sequences where relevant

Search 6-frame translation of DNA database

E < 0.05 is statistically significant, usually biologically interesting.

If the query has repeated segments, delete them and repeat search

Most widely used sites for sequence analysis

Sites for alignment of 2 sequences:

T-COFFEE (http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi): more accurate than ClustalW for sequences with less than 30% identity.

ClustalW (http://www.ch.embnet.org/software/ClustalW.html;

http://align.genome.jp)bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi)LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html)MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html)

Sites for DNA to protein translation: These algorithms can translate DNA sequences in any of the 3 forward or three reverse sense frames.

Translate (http://au.expasy.org/tools/dna.html)Translate a DNA sequence: (http://www.vivo.colostate.edu/molkit/translate/index.html)Transeq (http://www.ebi.ac.uk/emboss/transeq)

http://www.ch.embnet.org/software/ClustalW.html

http://align.genome.jp/

http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

http://www.ch.embnet.org/software/LALIGN_form.html

http://prodes.toulouse.inra.fr/multalin/multalin.html

http://au.expasy.org/tools/dna.html

http://www.vivo.colostate.edu/molkit/translate/index.html

http://www.ebi.ac.uk/emboss/transeq

http://www.mbio.ncsu.edu/bioedit/bioedit.html

BioEdit — a sequence editing software package

Oligo Design and Analysis Tools

http://www.idtdna.com/scitools/scitools.aspx

bioinformatics: guide to bio-computing and the internet copyright© kerstin wagner

Documents

internet slide

year slide

celera slide

annotating slide

single discipline slide

bioinformatics companies

new sequence information

bioinformatics history