Download - Introduction to Bioinformatics Sequence analysis - CGIARhpc.ilri.cgiar.org/beca/training/maseno2012/Sequence_Analysis_files... · Outline 1. Molecular sequences 2. Nucleic acid sequence

Introduction to Bioinformatics

Sequence analysis

Etienne de Villiers

BecA-ILRI Hub Nairobi, Kenya

Outline

1.  Molecular sequences

2.  Nucleic acid sequence analysis

3.  Protein sequence analysis

4.  Homology Searching

Sequence analysis: overview

Nucleotide sequence file

Search databases for similar sequences

Sequence comparison

Multiple sequence analysis

Design further experiments

 Restriction mapping  PCR planning

Translate into

protein

Search for known motifs

RNA structure prediction

non-coding

coding

Protein sequence analysis

Search for protein coding regions

Manual sequence

entry

Sequence database browsing

Sequencing project

management

Protein sequence file

Search databases for similar sequences

Sequence comparison

Search for known motifs

Predict secondary structure

Predict tertiary

structure Create a multiple sequence alignment

Edit the alignment

Format the alignment for

publication

Molecular phylogeny

Protein family analysis

Nucleotide sequence analysis

Sequence entry

Sequence entry

Sequences for analysis can be obtained from two main sources:

• Generated by yourself or

• Obtained from databases

Sequence formats

• Why different formats?

• Organise sequence information

• Database integration

•  It is import to ensure that sequence files do not contain special characters. ASCII files are suitable for most sequence programs.

•  However independent DB and some widely used programs developed slightly different formats for sequences.

•  Correct use of different formats is critical as well as a possibility to recognize and convert sequence/file/entry from one format to another.

Main file formats used in Bioinformatics

ASN.1 EMBL Swiss Prot FASTA GenBank Phylip PIR Nexus GCG

Sequence formats

•  There are many different (> 20) sequences formats including GenBank, EMBL, SwissProt, FASTA and several others.

2. GenBank format

LOCUS seq1 16bp

DEFINITION seq1, 16 bases, 2688 checksum.

ORIGIN

1 agctagctag

//

LOCUS seq2 20bp

1. FASTA/Pearson format

>seq1

agctagct actgg

>seq2

aactaact attcg

There are several computer programs able to convert formats:

ReadSeq Available as standalone package or on the web:

• http://bioportal.bic.nus.edu.sg/readseq/readseq.html • http://www-bimas.cit.nih.gov/molbio/readseq/ • http://bioweb.pasteur.fr/seqanal/interfaces/readseq-simple.html

Seqret A program in EMBOSS suite

Sequence format conversions

Molecular Sequence Databases

What kinds of analyses are there?

1.  Finding regions of interest in nucleic acid sequence 2.  Gene finding 3.  Frequency analysis 4.  Database searching – (Day 5) 5.  Multiple alignment – (Day 6) 6.  Measuring homology by pairwise alignment – (Day 6)

Protein

mRNA

Gene/DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

A gene codes for a protein

Structure of Prokaryote and Eukaryote genes

Finding regions of interest in nucleic acid sequence

•  Classification of lowly sequence repeats. •  Identification of gene components

–  (exon/intron boundaries –  Promoters –  transcription factor binding sites..)

TTGCAAAACACCTATGAGGGTCAAAAAAGTTTTATTATATACACTTCCGGTTGTCGGTAT TCTTTGATTATATTTAATTTCGTTAGGAAAAGACCGGAAAAAGAAGAGGAACTCAAACCT CCTTCTGCATTAGAAGATGAACTTAAAAAACGTGAAGAAGAAAGCCGAAAACGCATGGAA GAAATGCAAAAGGAAATTCTCGAAAAAAAGTTAAGAGAAGGTAAAAAAGCCTTGGAAGAA CTTGAAAAACGTGAAAAAGAAGTGGTAGATGAGTTTGCAAAACACCTCAAAAAACCTGAA GAAAGACTTCCTAAAATTATTCTTACATTGGATTCCGGTTTTCCAACAGTTGATCCTATT

Gene finding (1/3)

•  Parts of a gene is scattered over DNA. •  One approach - identify exactly where four

specific signal type can be found. –  The start codon. (ATG) –  Beginning of each intron.(GT-) –  End of each intron.(-AG) –  The stop codon.

Gene finding (2/3)

•  Second approach - content scoring method –  analyze larger regions of sequence using codon

frequency •  codon frequency is different in coding and noncoding

regions. •  The coding portion of exons. •  The noncoding portion of exons. •  Introns •  intergenic regions.

Gene finding (3/3)

•  The two major types : eukaryotic and prokaryotic

•  eukaryotic gene finding is much harder –  Presence of introns. –  Coding density is low. –  Underlying technologies : hidden Markov models,

decision trees, neural networks...

Gene finding

TTGCAAAACACCTATGAGGGTCAAAAAAGTTTTATTATATACACTTCCGGTTGTCGGTAT TCTTTGATTATATTTAATTTCGTTAGGAAAAGACCGGAAAAAGAAGAGGAACTCAAACCT CCTTCTGCATTAGAAGATGAACTTAAAAAACGTGAAGAAGAAAGCCGAAAACGCATGGAA GAAATGCAAAAGGAAATTCTCGAAAAAAAGTTAAGAGAAGGTAAAAAAGCCTTGGAAGAA CTTGAAAAACGTGAAAAAGAAGTGGTAGATGAGTTTGCAAAACACCTCAAAAAACCTGAA GAAAGACTTCCTAAAATTATTCTTACATTGGATTCCGGTAGTCCAACAGTTGATCCTATT

Frequency analysis

•  Determine the frequency of occurrence of sequence elements.

•  Applications –  using oligomer frequency to distinguish coding and

noncoding regions. –  frequency of amino acids for predicting 3D structure

of protein, functionality, location in the cell. –  finding ribosome binding site.

Gene prediction using codon frequency

Frame 1

Frame 2

Frame 3

coding

non-coding

correct start

coding sequence

Profile / PSSM

LTMTRGDIGNYLGLTVETISRLLGRFQKSGML LTMTRGDIGNYLGLTIETISRLLGRFQKSGMI LTMTRGDIGNYLGLTVETISRLLGRFQKSEIL LTMTRGDIGNYLGLTVETISRLLGRLQKMGIL LAMSRNEIGNYLGLAVETVSRVFSRFQQNELI LAMSRNEIGNYLGLAVETVSRVFTRFQQNGLI LPMSRNEIGNYLGLAVETVSRVFTRFQQNGLL VRMSREEIGNYLGLTLETVSRLFSRFGREGLI LRMSREEIGSYLGLKLETVSRTLSKFHQEGLI LPMCRRDIGDYLGLTLETVSRALSQLHTQGIL LPMSRRDIADYLGLTVETVSRAVSQLHTDGVL LPMSRQDIADYLGLTIETVSRTFTKLERHGAI

• DNA / proteins Segments of the same length L;

• Often represented as Positional frequency matrix;

Protein Sequence Analysis

  Physico-chemical properties.  Cellular localization.   Signal peptides.   Transmembrane domains.   Post-translational modifications.  Motifs & domains.   Secondary structure.  Other resources.

ExPASy (Expert Protein Analysis System)

•  Swiss Institute of Bioinformatics (SIB). •  Dedicated to the analysis of protein sequences

and structures. •  Many of the programs for protein sequence

analysis can be accessed via ExPASy.

http://www.expasy.org/tools/

1) Physico-chemical properties:

•  ProtParam tool o  molecular weight o  theoretical pI (pH no net electrical charge) o  amino acid composition o  atomic composition o  extinction coefficient o  estimated half-life o  instability index o  aliphatic index o  grand average of hydropathicity (GRAVY)

2) Cellular localization:

•  Proteins destined for particular subcellular localizations have distinct amino acid properties particularly in their N-terminal regions.

•  Used to predict whether a protein is localized in the cytoplasm, nucleus, mitochondria, or is retained in the ER, or destined for lysosome (vacuolar) or the peroxisome.

•  PSORT •  End of the output the percentage likelihood of the

subcellular localization.

3) Signal peptides:

•  Proteins destined for secretion, operation with the endoplasmic reticulum, lysosomes and many transmembrane proteins are synthesized with leading (N-terminal) 13 – 36 residue signal peptides.

•  SignalP WWW server can be used to predict the presence and location of signal peptide cleavage sites in your proteins.

•  Useful to know whether your protein has a signal peptide as it indicates that it may be secreted from the cell.

•  Proteins in their active form will have their signal peptides removed.

4) Transmembrane domains:

•  TMpred program makes a prediction of membrane-spanning regions and their orientation.

•  Algorithm is based on the statistical analysis of TMbase, a database of naturally occurring transmembrane proteins.

•  Presence of transmembrane domains is an indication that the protein is located on the cell surface.

5) Post-translational modifications:

•  After translation has occurred proteins may undergo a number of posttranslational modifications.

•  Can include the cleavage of the pro- region to release the active protein, the removal of the signal peptide and numerous covalent modifications such as, acetylations, glycosylations, hydroxylations, methylations and phosphorylations.

•  Posttranslational modifications may alter the molecular weight of your protein and thus its position on a gel.

•  Many programs available for predicting the presence of posttranslational modifications, we will take a look at one for the prediction of type O-glycosylation sites in mammalian proteins.

•  These programs work by looking for consensus sites and just because a site is found does not mean that a modification definitely occurs.

6) Motifs and Domains:

•  Motifs and domains give you information on the function of your protein.

•  Search the protein against one of the motif or profile databases.

•  ProfileScan, which allows you to search both the Prosite and Pfam databases simultaneously

7) Secondary Structure Prediction:

•  WHY: –  If protein structure, even secondary structure, can be accurately

predicted from the now abundantly available gene and protein sequences, such sequences become immensely more valuable for the understanding of drug design, the genetic basis of disease, the role of protein structure in its enzymatic, structural, and signal transduction functions, and basic physiology from molecular to cellular, to fully systemic levels.

•  JPRED - works by combining a number of modern, high quality prediction methods to form a consensus.

Secondary Structure Prediction

•  Essentially protein secondary structure consists of 3 major conformations;

  α Helix.

  β pleated sheet.

  coil conformation.

Sequence Comparisons

•  The comparison of DNA sequences is most used method in bioinformatics: –  Annotations of new nucleotide and protein sequences –  construction of protein structures –  design and analysis of bioinformatic and biological

experiments. •  Nature acts conservatively, it does not develop a new

kind of biology for every life form but continuously changes and adapts a proven general concept.

•  One may transfer functional information from one protein to another if both possess a certain degree of similarity.

DB searches for similar sequences

•  Since Charles Darwin the idea of common origin of species became widely accepted view, however the level of similarity on molecular level between distant species remained unclear until 1970s and 1980s.

•  At that time the fact that many DNA and particularly protein molecules retain significant (>60-70%) or high (>85%) similarity hundreds of millions of years after separation from the common ancestor was established.

•  This discovery as well as practical needs to search growing DB lead to development of effective methods of similarity search.

•  Two programs, which greatly facilitated the similarity search, were developed FASTA (Pearson and Lipman 1988) and BLAST (Altschul et al. 1990).

Basics of similarity searches

•  The basic step in any similarity search is an alignment of two or more sequences. Principles of alignment will be considered during the next lecture.

•  The search provides a list of DB sequences with which a query sequence can be aligned. Then scoring procedure is implemented, which allows to measure degree of similarity from 100% identity to a loose similarity.

•  A common reason for performing a DB search is to find a related gene. A matched gene (or any other sequence) may provide a clue as to function.

•  An alternative task can be be achieved when a sequence with known function or role is used as a query for search in a species genome.

•  The search must be fast and sensitive enough.

Inferring function by homology

•  The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology searching, the equivalent genes in one species to those known to be important in other model species.

•  Logic: if the linear alignment of a pair of sequences is similar, then we can infer that the 3-dimensional structure is similar; if the 3-D structure is similar then there is a good chance that the function is similar.

Sequence comparison through pairwise alignments •  Goal of pairwise comparison is to find conserved regions (if

any) between two sequences

•  Extrapolate information about our sequence using the known characteristics of the other sequence

THIO_EMENI GFVVVDCFATWCGPCKAIAPTVEKFAQTY G ++VD +A WCGPCK IAP +++ A Y ??? GAILVDFWAEWCGPCKMIAPILDEIADEY

THIO_EMENI SwissProt

Extrapolate

???

Why Align Sequences?

• DNA sequences (4 letters in alphabet) • GTAAACTGGTACT…

• Amino acid (protein) sequences (20 letters) • SSHLDKLMNEFF…

• Align them so we can search databases • To help predict structure/function of new genes

• In particular, look for homologues (evolutionary relatives)

• 3D-pssm (Imperial College - Structure Prediction) • http://www.sbg.bio.ic.ac.uk/servers/3dpssm • Give it a gene sequence

• It predicts the protein structure

Do alignments make sense ?

Evolution of sequences •  Sequences evolve through mutation and selection

 Selective pressure is different for each residue position in a protein (i.e. conservation of active site, structure, charge, etc.)

•  Modular nature of proteins  Nature keeps re-using domains

•  Alignments try to tell the evolutionary story of the proteins Relationships

Same Sequence

Same 3D Fold

Same Origin

Same Function

•  Two similar regions of the Drosophila melanogaster Slit and Notc proteins

970 980 990 1000 1010 1020 SLIT_DROME FSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFC ..:.: :. :.: ...:.: .. : :.. : ::.. . :.: ::..:. :. :. : NOTC_DROME YKCECPRGFYDAHCLSDVDECASN-PCVNEGRCEDGINEFICHCPPGYTGKRCELDIDEC 740 750 760 770 780 790

•  Comparing the tissue-type and urokinase type plasminogen activators. Displayed using a diagonal plot or Dotplot.

Tissue-Type plasminogen Activator

Urokinase-Type plasm

inogen Activator

URL: www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

 As simple as projecting the diagonals onto the axis. Tissue-Type plasminogen Activator

Urokinase-Type plasm

inogen Activator

Tissue-Type plasminogen Activator

A A’ B D C

Urokinase-Type plasminogen Activator A C B D

Some definitions

Identity Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value strongly depends on how the two sequences are aligned.

Similarity Proportion of pairs of similar residues between two aligned sequences. If two residues are similar is determined by a substitution matrix. This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used.

Homology Two sequences are homologous if and only if they have a common ancestor. There is no such thing as a level of homology ! (It's either yes or no)

•  Homologous sequences do not necessarily serve the same function... •  ... Nor are they always highly similar: structure may be conserved while sequence is not.

Concept of a sequence alignment •  Pairwise Alignment:

 Explicit mapping between the residues of 2 sequences

–  Tolerant to errors (mismatches, insertion / deletions or indels)

–  Evaluation of the alignment in a biological concept (significance)

Seq A GARFIELDTHELASTFA-TCAT ||||||||||| || ||||

Seq B GARFIELDTHEVERYFASTCAT

errors / mismatches insertion

deletion

Number of alignments •  There are many ways to align two sequences •  Consider the sequence fragments below: a simple alignment

shows some conserved portions

but also:

CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA

CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA

•  Number of possible alignments for 2 sequences of length 1000 residues:   more than 10600 gapped alignments

(Avogadro 1024, estimated number of atoms in the universe 1080)

What is a good alignment ? •  We need a way to evaluate the biological meaning of a given

alignment

•  Intuitively we "know" that the following alignment:

is better than:

CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA

ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG

•  We can express this notion more rigorously, by using a scoring system

Simple alignment scores •  A simple way (but not the best) to score an alignment is to count 1

for each match and 0 for each mismatch.

 Score: 12

CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA

ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG

 Score: 5

Scoring schemes - amino acids

•  Scoring system used for nucleic acids doesn’t take into account –  Likelihood of one amino acid changing to another –  Some amino acid substitutions are disastrous

•  So they don’t survive evolution –  Some substitutions barely change anything

•  Because the two amino acids are chemically quite similar •  Scoring schemes address this problem

–  Give scores to the chances of each substitution •  2 possibilities:

–  Use empirical evidence •  Of actual substitutions in known homologues (families)

–  Use theory from chemistry (hydrophobicity, etc.)

BLOSUM62 Scheme

•  Blocks Amino Acid Substitution Matrices •  Empirical method

–  Based on roughly 2000 amino acid patterns (blocks) –  Found in more than 500 families of related proteins

•  Calculate the Log-odds scores for each pair (R1, R2) –  Let O = observed frequency R1 <=> R2

–  Let E = expected frequency R1 <=> R2 [happening by chance] –  I.e., Score = round(2 * log2(O/E))

•  To calculate the score for an alignment of two sequences –  Add up the pairwise scores for residues

BLOSUM62 Substitution Matrix

•  Zero: by chance –  + more than

chance –  - less than

chance •  Arranged by

–  Sidegroups –  So, high scoring

in the end boxes •  Example

–  M,I,L,V –  Interchangeable

Example Calculation

•  Query = S S H L D K L M R •  Dbase = H S H L K L L M G •  Score = -1 4 8 4 -1 -2 4 5 0

•  Total score = -1+4+8+4+-1+-2+4+5+-2 = 21 •  Write Blosum(Query,Dbase) = 21

–  Not standard to do this

Basic Local Alignment Search Tool (BLAST) was developed as a new way to perform sequence similarity searches.

“Local” means it searches and aligns sequence segments, rather than align the entire sequence.

It’s able to detect relationships among sequences which share only isolated regions of similarity.

Currently, it is the most popular and most accepted sequence analysis tool.

Why BLAST?

•  Identify unknown sequences - The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then you may have access to a wealth of biological information.

•  Help gene/protein function and structure prediction – genes with similar sequences tend to share similar functions or structure.

•  Identify protein family – group related (paralog or ortholog) genes and their proteins into a family.

• Prepare sequences for multiple alignments

•  And more …

  Idea: statistically significant alignments (hits) –  Will have regions of at least 3 letters same

  Or at least high scoring with respect to BLOSUM matrix

  Based on small local alignments   Makes use of lookup tables

CCNDHRKMTCSPNDNNRK

YTNHHMMTTYSLDNNNKK more likely than CCNDHRKMTCSPNDNNRK

TTNDHRMTACSPDNNNKH

BLAST Algorithm

A Question

Question: Given the protein sequence

SLAALLNKCKTPQGQRLVNQW

and the word length L= 3,

how does the BLAST algorithm find the highest scoring alignment between this sequence and another sequence?

Answer: Explaining the BLAST Algorithm

1.  Query sequence must be split into words of defined length. A list of words of length 3 (L) in the query protein sequence is made starting with positions 1,2, and 3; then 2,3, and 4; etc. Our query sequence:

SLAALLNKCKTPQGQRLVNQW

SLA, LAA, AAL, ALL, LLN, LNK, NKC, KCK, CKT,PQG,QGQ,GQR,QRL,RLV,LVN,VNQ NQW

Con…BLAST Algorithm

2.  Define a threshold alignment score T (neighbourhood score threshold).

3.  Find all word-pairs of length L with score ≥ T –  e.g Find all w such that

S(w, PQG) ≥ T –  In another words, the query sequence are evaluated with

any other combination of three amino acids. –  This is done using a scoring matrix (e.g., BLOSUM 62). –  Note: There are a total 20 x 20 x 20 = 8,000 possible

match scores for a word

Con…BLAST Algorithm

Neighbourhood words to PQG

PQG 18 PEG 15 PRG 14 PKG 14 PDG 13 PHG 13 PMG 13 PSG 13 PQA 12 PQN 12

Neighbourhood Score Threshold (T=13)

Neighbourhood words

Note: This procedure is repeated for each three-letter word in the query sequence

Con….BLAST Algorithm

4.  Now, search database for all ‘hits’ -sequences with exact matches to each w.

5.  Extend in both directions alignment of ‘hits’ while score increases – producing High Scoring Pair’s (locally optimal ungapped alignments).

6.  Return sequences with HSP’s which have significantly (statistically) higher scores than a threshold Smax –  Smax obtained empirically from random sequences


•  So….

SLAALLNKCKTPQGQRLVNQW +LA++L+ TP G R++ +W

TLASVLDCTVTPMGSRMLKRW

High Scoring Segment Pair’s (HSPs)


7.  Varying the threshold alignment score T –  Search time decreases as T is increased, fewer word

pairs are found –  Sensitivity of search decreases as T is increased,

word pairs overlooked (homologous (or similar) sequences may be discarded).

–  Note: The score of the alignment Smax AND the associated statistical significance are required to assess whether homology is suggested.


Finally: •  For each statistically significant HSP

–  The alignment is reported •  If a sequence D has two HSPs with Query Q

–  Two different alignments are reported •  Later versions of BLAST

–  Try and unify the two alignments

Blast@NCBI

Blast advanced parameters

Blast results

Interpret BLAST results - Distribution

This image shows the distribution of BLAST hits on the query sequence. Each line represents a hit. The span of a line represents the region where similarity is detected. Different colors represent different ranges of scores.

Query sequence

BLAST hits. Click to access the pairwise alignment.

Interpret BLAST results - Description

ID (GI #, refseq #, DB-specific ID #) Click to access the record in GenBank

Max score – higher, better. Click to access the pairwise alignment

Expect value – lower, better. It tells the possibility that this is a random hit

Gene/sequence Definition

The description (also called definition) lines are listed below under the heading "Sequences producing significant alignments". The term "significant" simply refers to all those hits whose E value was less than the threshold. It does not imply biological significance.

Links

Interpret BLAST results – pairwise alingments

Query line: the segment from query sequence.

Sbjct line: the segment from hit (subject) sequence.

Middle line: the consensus bases

Summary - If your sequence is NUCLEOTIDE

Length DB Purpose Program

20 bp or longer

Nucleic Identify the query sequence MegaBlast blastn

Find sequences similar to query sequence

blastn

Find similar proteins to translated query in a translated database

tblastx

Protein Find similar proteins to translated query in a protein database

blastx

7-20 bp Nucleic Find primer binding sites or map short contiguous motifs

Search for short, nearly exact matches

Summary - If your sequence is PROTEIN

Length DB Purpose Program

15 residue or longer

Protein Identify the query sequence or find protein sequences similar to query

blastp

Find members of a protein family or build a custom position-specific score matrix

PSI-blast

Find proteins similar to the query around a given pattern

PHI-blast

Nucleic Find similar proteins in a translated nucleotide database

tblastn

5-15 residue

Protein Search for peptide motifs Search for short, nearly exact matches

Download - Introduction to Bioinformatics Sequence analysis - CGIARhpc.ilri.cgiar.org/beca/training/maseno2012/Sequence_Analysis_files... · Outline 1. Molecular sequences 2. Nucleic acid sequence

Top Related