Introduction to Bioinformatics
Sequence analysis
Etienne de Villiers
BecA-ILRI Hub Nairobi, Kenya
Outline
1. Molecular sequences
2. Nucleic acid sequence analysis
3. Protein sequence analysis
4. Homology Searching
Sequence analysis: overview
Nucleotide sequence file
Search databases for similar sequences
Sequence comparison
Multiple sequence analysis
Design further experiments
Restriction mapping PCR planning
Translate into
protein
Search for known motifs
RNA structure prediction
non-coding
coding
Protein sequence analysis
Search for protein coding regions
Manual sequence
entry
Sequence database browsing
Sequencing project
management
Protein sequence file
Search databases for similar sequences
Sequence comparison
Search for known motifs
Predict secondary structure
Predict tertiary
structure Create a multiple sequence alignment
Edit the alignment
Format the alignment for
publication
Molecular phylogeny
Protein family analysis
Nucleotide sequence analysis
Sequence entry
Sequence entry
Sequences for analysis can be obtained from two main sources:
• Generated by yourself or
• Obtained from databases
Sequence formats
• Why different formats?
• Organise sequence information
• Database integration
• It is import to ensure that sequence files do not contain special characters. ASCII files are suitable for most sequence programs.
• However independent DB and some widely used programs developed slightly different formats for sequences.
• Correct use of different formats is critical as well as a possibility to recognize and convert sequence/file/entry from one format to another.
Main file formats used in Bioinformatics
ASN.1 EMBL Swiss Prot FASTA GenBank Phylip PIR Nexus GCG
Sequence formats
• There are many different (> 20) sequences formats including GenBank, EMBL, SwissProt, FASTA and several others.
2. GenBank format
LOCUS seq1 16bp
DEFINITION seq1, 16 bases, 2688 checksum.
ORIGIN
1 agctagctag
//
LOCUS seq2 20bp
1. FASTA/Pearson format
>seq1
agctagct actgg
>seq2
aactaact attcg
There are several computer programs able to convert formats:
ReadSeq Available as standalone package or on the web:
• http://bioportal.bic.nus.edu.sg/readseq/readseq.html • http://www-bimas.cit.nih.gov/molbio/readseq/ • http://bioweb.pasteur.fr/seqanal/interfaces/readseq-simple.html
Seqret A program in EMBOSS suite
Sequence format conversions
Molecular Sequence Databases
What kinds of analyses are there?
1. Finding regions of interest in nucleic acid sequence 2. Gene finding 3. Frequency analysis 4. Database searching – (Day 5) 5. Multiple alignment – (Day 6) 6. Measuring homology by pairwise alignment – (Day 6)
Protein
mRNA
Gene/DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
A gene codes for a protein
Structure of Prokaryote and Eukaryote genes
Finding regions of interest in nucleic acid sequence
• Classification of lowly sequence repeats. • Identification of gene components
– (exon/intron boundaries – Promoters – transcription factor binding sites..)
TTGCAAAACACCTATGAGGGTCAAAAAAGTTTTATTATATACACTTCCGGTTGTCGGTAT TCTTTGATTATATTTAATTTCGTTAGGAAAAGACCGGAAAAAGAAGAGGAACTCAAACCT CCTTCTGCATTAGAAGATGAACTTAAAAAACGTGAAGAAGAAAGCCGAAAACGCATGGAA GAAATGCAAAAGGAAATTCTCGAAAAAAAGTTAAGAGAAGGTAAAAAAGCCTTGGAAGAA CTTGAAAAACGTGAAAAAGAAGTGGTAGATGAGTTTGCAAAACACCTCAAAAAACCTGAA GAAAGACTTCCTAAAATTATTCTTACATTGGATTCCGGTTTTCCAACAGTTGATCCTATT
Gene finding (1/3)
• Parts of a gene is scattered over DNA. • One approach - identify exactly where four
specific signal type can be found. – The start codon. (ATG) – Beginning of each intron.(GT-) – End of each intron.(-AG) – The stop codon.
Gene finding (2/3)
• Second approach - content scoring method – analyze larger regions of sequence using codon
frequency • codon frequency is different in coding and noncoding
regions. • The coding portion of exons. • The noncoding portion of exons. • Introns • intergenic regions.
Gene finding (3/3)
• The two major types : eukaryotic and prokaryotic
• eukaryotic gene finding is much harder – Presence of introns. – Coding density is low. – Underlying technologies : hidden Markov models,
decision trees, neural networks...
Gene finding
TTGCAAAACACCTATGAGGGTCAAAAAAGTTTTATTATATACACTTCCGGTTGTCGGTAT TCTTTGATTATATTTAATTTCGTTAGGAAAAGACCGGAAAAAGAAGAGGAACTCAAACCT CCTTCTGCATTAGAAGATGAACTTAAAAAACGTGAAGAAGAAAGCCGAAAACGCATGGAA GAAATGCAAAAGGAAATTCTCGAAAAAAAGTTAAGAGAAGGTAAAAAAGCCTTGGAAGAA CTTGAAAAACGTGAAAAAGAAGTGGTAGATGAGTTTGCAAAACACCTCAAAAAACCTGAA GAAAGACTTCCTAAAATTATTCTTACATTGGATTCCGGTAGTCCAACAGTTGATCCTATT
Frequency analysis
• Determine the frequency of occurrence of sequence elements.
• Applications – using oligomer frequency to distinguish coding and
noncoding regions. – frequency of amino acids for predicting 3D structure
of protein, functionality, location in the cell. – finding ribosome binding site.
Gene prediction using codon frequency
Frame 1
Frame 2
Frame 3
coding
non-coding
correct start
coding sequence
Profile / PSSM
LTMTRGDIGNYLGLTVETISRLLGRFQKSGML LTMTRGDIGNYLGLTIETISRLLGRFQKSGMI LTMTRGDIGNYLGLTVETISRLLGRFQKSEIL LTMTRGDIGNYLGLTVETISRLLGRLQKMGIL LAMSRNEIGNYLGLAVETVSRVFSRFQQNELI LAMSRNEIGNYLGLAVETVSRVFTRFQQNGLI LPMSRNEIGNYLGLAVETVSRVFTRFQQNGLL VRMSREEIGNYLGLTLETVSRLFSRFGREGLI LRMSREEIGSYLGLKLETVSRTLSKFHQEGLI LPMCRRDIGDYLGLTLETVSRALSQLHTQGIL LPMSRRDIADYLGLTVETVSRAVSQLHTDGVL LPMSRQDIADYLGLTIETVSRTFTKLERHGAI
• DNA / proteins Segments of the same length L;
• Often represented as Positional frequency matrix;
Protein Sequence Analysis
Physico-chemical properties. Cellular localization. Signal peptides. Transmembrane domains. Post-translational modifications. Motifs & domains. Secondary structure. Other resources.
ExPASy (Expert Protein Analysis System)
• Swiss Institute of Bioinformatics (SIB). • Dedicated to the analysis of protein sequences
and structures. • Many of the programs for protein sequence
analysis can be accessed via ExPASy.
http://www.expasy.org/tools/
1) Physico-chemical properties:
• ProtParam tool o molecular weight o theoretical pI (pH no net electrical charge) o amino acid composition o atomic composition o extinction coefficient o estimated half-life o instability index o aliphatic index o grand average of hydropathicity (GRAVY)
2) Cellular localization:
• Proteins destined for particular subcellular localizations have distinct amino acid properties particularly in their N-terminal regions.
• Used to predict whether a protein is localized in the cytoplasm, nucleus, mitochondria, or is retained in the ER, or destined for lysosome (vacuolar) or the peroxisome.
• PSORT • End of the output the percentage likelihood of the
subcellular localization.
3) Signal peptides:
• Proteins destined for secretion, operation with the endoplasmic reticulum, lysosomes and many transmembrane proteins are synthesized with leading (N-terminal) 13 – 36 residue signal peptides.
• SignalP WWW server can be used to predict the presence and location of signal peptide cleavage sites in your proteins.
• Useful to know whether your protein has a signal peptide as it indicates that it may be secreted from the cell.
• Proteins in their active form will have their signal peptides removed.
4) Transmembrane domains:
• TMpred program makes a prediction of membrane-spanning regions and their orientation.
• Algorithm is based on the statistical analysis of TMbase, a database of naturally occurring transmembrane proteins.
• Presence of transmembrane domains is an indication that the protein is located on the cell surface.
5) Post-translational modifications:
• After translation has occurred proteins may undergo a number of posttranslational modifications.
• Can include the cleavage of the pro- region to release the active protein, the removal of the signal peptide and numerous covalent modifications such as, acetylations, glycosylations, hydroxylations, methylations and phosphorylations.
• Posttranslational modifications may alter the molecular weight of your protein and thus its position on a gel.
• Many programs available for predicting the presence of posttranslational modifications, we will take a look at one for the prediction of type O-glycosylation sites in mammalian proteins.
• These programs work by looking for consensus sites and just because a site is found does not mean that a modification definitely occurs.
6) Motifs and Domains:
• Motifs and domains give you information on the function of your protein.
• Search the protein against one of the motif or profile databases.
• ProfileScan, which allows you to search both the Prosite and Pfam databases simultaneously
7) Secondary Structure Prediction:
• WHY: – If protein structure, even secondary structure, can be accurately
predicted from the now abundantly available gene and protein sequences, such sequences become immensely more valuable for the understanding of drug design, the genetic basis of disease, the role of protein structure in its enzymatic, structural, and signal transduction functions, and basic physiology from molecular to cellular, to fully systemic levels.
• JPRED - works by combining a number of modern, high quality prediction methods to form a consensus.
Secondary Structure Prediction
• Essentially protein secondary structure consists of 3 major conformations;
α Helix.
β pleated sheet.
coil conformation.
Sequence Comparisons
• The comparison of DNA sequences is most used method in bioinformatics: – Annotations of new nucleotide and protein sequences – construction of protein structures – design and analysis of bioinformatic and biological
experiments. • Nature acts conservatively, it does not develop a new
kind of biology for every life form but continuously changes and adapts a proven general concept.
• One may transfer functional information from one protein to another if both possess a certain degree of similarity.
DB searches for similar sequences
• Since Charles Darwin the idea of common origin of species became widely accepted view, however the level of similarity on molecular level between distant species remained unclear until 1970s and 1980s.
• At that time the fact that many DNA and particularly protein molecules retain significant (>60-70%) or high (>85%) similarity hundreds of millions of years after separation from the common ancestor was established.
• This discovery as well as practical needs to search growing DB lead to development of effective methods of similarity search.
• Two programs, which greatly facilitated the similarity search, were developed FASTA (Pearson and Lipman 1988) and BLAST (Altschul et al. 1990).
Basics of similarity searches
• The basic step in any similarity search is an alignment of two or more sequences. Principles of alignment will be considered during the next lecture.
• The search provides a list of DB sequences with which a query sequence can be aligned. Then scoring procedure is implemented, which allows to measure degree of similarity from 100% identity to a loose similarity.
• A common reason for performing a DB search is to find a related gene. A matched gene (or any other sequence) may provide a clue as to function.
• An alternative task can be be achieved when a sequence with known function or role is used as a query for search in a species genome.
• The search must be fast and sensitive enough.
Inferring function by homology
• The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology searching, the equivalent genes in one species to those known to be important in other model species.
• Logic: if the linear alignment of a pair of sequences is similar, then we can infer that the 3-dimensional structure is similar; if the 3-D structure is similar then there is a good chance that the function is similar.
Sequence comparison through pairwise alignments • Goal of pairwise comparison is to find conserved regions (if
any) between two sequences
• Extrapolate information about our sequence using the known characteristics of the other sequence
THIO_EMENI GFVVVDCFATWCGPCKAIAPTVEKFAQTY G ++VD +A WCGPCK IAP +++ A Y ??? GAILVDFWAEWCGPCKMIAPILDEIADEY
THIO_EMENI SwissProt
Extrapolate
???
Why Align Sequences?
• DNA sequences (4 letters in alphabet) • GTAAACTGGTACT…
• Amino acid (protein) sequences (20 letters) • SSHLDKLMNEFF…
• Align them so we can search databases • To help predict structure/function of new genes
• In particular, look for homologues (evolutionary relatives)
• 3D-pssm (Imperial College - Structure Prediction) • http://www.sbg.bio.ic.ac.uk/servers/3dpssm • Give it a gene sequence
• It predicts the protein structure
Do alignments make sense ?
Evolution of sequences • Sequences evolve through mutation and selection
Selective pressure is different for each residue position in a protein (i.e. conservation of active site, structure, charge, etc.)
• Modular nature of proteins Nature keeps re-using domains
• Alignments try to tell the evolutionary story of the proteins Relationships
Same Sequence
Same 3D Fold
Same Origin
Same Function
• Two similar regions of the Drosophila melanogaster Slit and Notc proteins
970 980 990 1000 1010 1020 SLIT_DROME FSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFC ..:.: :. :.: ...:.: .. : :.. : ::.. . :.: ::..:. :. :. : NOTC_DROME YKCECPRGFYDAHCLSDVDECASN-PCVNEGRCEDGINEFICHCPPGYTGKRCELDIDEC 740 750 760 770 780 790
• Comparing the tissue-type and urokinase type plasminogen activators. Displayed using a diagonal plot or Dotplot.
Tissue-Type plasminogen Activator
Urokinase-Type plasm
inogen Activator
URL: www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
As simple as projecting the diagonals onto the axis. Tissue-Type plasminogen Activator
Urokinase-Type plasm
inogen Activator
Tissue-Type plasminogen Activator
A A’ B D C
Urokinase-Type plasminogen Activator A C B D
Some definitions
Identity Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value strongly depends on how the two sequences are aligned.
Similarity Proportion of pairs of similar residues between two aligned sequences. If two residues are similar is determined by a substitution matrix. This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used.
Homology Two sequences are homologous if and only if they have a common ancestor. There is no such thing as a level of homology ! (It's either yes or no)
• Homologous sequences do not necessarily serve the same function... • ... Nor are they always highly similar: structure may be conserved while sequence is not.
Concept of a sequence alignment • Pairwise Alignment:
Explicit mapping between the residues of 2 sequences
– Tolerant to errors (mismatches, insertion / deletions or indels)
– Evaluation of the alignment in a biological concept (significance)
Seq A GARFIELDTHELASTFA-TCAT ||||||||||| || ||||
Seq B GARFIELDTHEVERYFASTCAT
errors / mismatches insertion
deletion
Number of alignments • There are many ways to align two sequences • Consider the sequence fragments below: a simple alignment
shows some conserved portions
but also:
CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA
CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA
• Number of possible alignments for 2 sequences of length 1000 residues: more than 10600 gapped alignments
(Avogadro 1024, estimated number of atoms in the universe 1080)
What is a good alignment ? • We need a way to evaluate the biological meaning of a given
alignment
• Intuitively we "know" that the following alignment:
is better than:
CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA
ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG
• We can express this notion more rigorously, by using a scoring system
Simple alignment scores • A simple way (but not the best) to score an alignment is to count 1
for each match and 0 for each mismatch.
Score: 12
CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA
ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG
Score: 5
Scoring schemes - amino acids
• Scoring system used for nucleic acids doesn’t take into account – Likelihood of one amino acid changing to another – Some amino acid substitutions are disastrous
• So they don’t survive evolution – Some substitutions barely change anything
• Because the two amino acids are chemically quite similar • Scoring schemes address this problem
– Give scores to the chances of each substitution • 2 possibilities:
– Use empirical evidence • Of actual substitutions in known homologues (families)
– Use theory from chemistry (hydrophobicity, etc.)
BLOSUM62 Scheme
• Blocks Amino Acid Substitution Matrices • Empirical method
– Based on roughly 2000 amino acid patterns (blocks) – Found in more than 500 families of related proteins
• Calculate the Log-odds scores for each pair (R1, R2) – Let O = observed frequency R1 <=> R2
– Let E = expected frequency R1 <=> R2 [happening by chance] – I.e., Score = round(2 * log2(O/E))
• To calculate the score for an alignment of two sequences – Add up the pairwise scores for residues
BLOSUM62 Substitution Matrix
• Zero: by chance – + more than
chance – - less than
chance • Arranged by
– Sidegroups – So, high scoring
in the end boxes • Example
– M,I,L,V – Interchangeable
Example Calculation
• Query = S S H L D K L M R • Dbase = H S H L K L L M G • Score = -1 4 8 4 -1 -2 4 5 0
• Total score = -1+4+8+4+-1+-2+4+5+-2 = 21 • Write Blosum(Query,Dbase) = 21
– Not standard to do this
Basic Local Alignment Search Tool (BLAST) was developed as a new way to perform sequence similarity searches.
“Local” means it searches and aligns sequence segments, rather than align the entire sequence.
It’s able to detect relationships among sequences which share only isolated regions of similarity.
Currently, it is the most popular and most accepted sequence analysis tool.
Why BLAST?
• Identify unknown sequences - The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then you may have access to a wealth of biological information.
• Help gene/protein function and structure prediction – genes with similar sequences tend to share similar functions or structure.
• Identify protein family – group related (paralog or ortholog) genes and their proteins into a family.
• Prepare sequences for multiple alignments
• And more …
Idea: statistically significant alignments (hits) – Will have regions of at least 3 letters same
Or at least high scoring with respect to BLOSUM matrix
Based on small local alignments Makes use of lookup tables
CCNDHRKMTCSPNDNNRK
YTNHHMMTTYSLDNNNKK more likely than CCNDHRKMTCSPNDNNRK
TTNDHRMTACSPDNNNKH
BLAST Algorithm
A Question
Question: Given the protein sequence
SLAALLNKCKTPQGQRLVNQW
and the word length L= 3,
how does the BLAST algorithm find the highest scoring alignment between this sequence and another sequence?
Answer: Explaining the BLAST Algorithm
1. Query sequence must be split into words of defined length. A list of words of length 3 (L) in the query protein sequence is made starting with positions 1,2, and 3; then 2,3, and 4; etc. Our query sequence:
SLAALLNKCKTPQGQRLVNQW
SLA, LAA, AAL, ALL, LLN, LNK, NKC, KCK, CKT,PQG,QGQ,GQR,QRL,RLV,LVN,VNQ NQW
Con…BLAST Algorithm
2. Define a threshold alignment score T (neighbourhood score threshold).
3. Find all word-pairs of length L with score ≥ T – e.g Find all w such that
S(w, PQG) ≥ T – In another words, the query sequence are evaluated with
any other combination of three amino acids. – This is done using a scoring matrix (e.g., BLOSUM 62). – Note: There are a total 20 x 20 x 20 = 8,000 possible
match scores for a word
Con…BLAST Algorithm
Neighbourhood words to PQG
PQG 18 PEG 15 PRG 14 PKG 14 PDG 13 PHG 13 PMG 13 PSG 13 PQA 12 PQN 12
Neighbourhood Score Threshold (T=13)
Neighbourhood words
Note: This procedure is repeated for each three-letter word in the query sequence
Con….BLAST Algorithm
4. Now, search database for all ‘hits’ -sequences with exact matches to each w.
5. Extend in both directions alignment of ‘hits’ while score increases – producing High Scoring Pair’s (locally optimal ungapped alignments).
6. Return sequences with HSP’s which have significantly (statistically) higher scores than a threshold Smax – Smax obtained empirically from random sequences
Con….BLAST Algorithm
• So….
SLAALLNKCKTPQGQRLVNQW +LA++L+ TP G R++ +W
TLASVLDCTVTPMGSRMLKRW
High Scoring Segment Pair’s (HSPs)
Con….BLAST Algorithm
7. Varying the threshold alignment score T – Search time decreases as T is increased, fewer word
pairs are found – Sensitivity of search decreases as T is increased,
word pairs overlooked (homologous (or similar) sequences may be discarded).
– Note: The score of the alignment Smax AND the associated statistical significance are required to assess whether homology is suggested.
Con….BLAST Algorithm
Finally: • For each statistically significant HSP
– The alignment is reported • If a sequence D has two HSPs with Query Q
– Two different alignments are reported • Later versions of BLAST
– Try and unify the two alignments
Blast@NCBI
Blast@NCBI
Blast advanced parameters
Blast results
Interpret BLAST results - Distribution
This image shows the distribution of BLAST hits on the query sequence. Each line represents a hit. The span of a line represents the region where similarity is detected. Different colors represent different ranges of scores.
Query sequence
BLAST hits. Click to access the pairwise alignment.
Interpret BLAST results - Description
ID (GI #, refseq #, DB-specific ID #) Click to access the record in GenBank
Max score – higher, better. Click to access the pairwise alignment
Expect value – lower, better. It tells the possibility that this is a random hit
Gene/sequence Definition
The description (also called definition) lines are listed below under the heading "Sequences producing significant alignments". The term "significant" simply refers to all those hits whose E value was less than the threshold. It does not imply biological significance.
Links
Interpret BLAST results – pairwise alingments
Query line: the segment from query sequence.
Sbjct line: the segment from hit (subject) sequence.
Middle line: the consensus bases
Summary - If your sequence is NUCLEOTIDE
Length DB Purpose Program
20 bp or longer
Nucleic Identify the query sequence MegaBlast blastn
Find sequences similar to query sequence
blastn
Find similar proteins to translated query in a translated database
tblastx
Protein Find similar proteins to translated query in a protein database
blastx
7-20 bp Nucleic Find primer binding sites or map short contiguous motifs
Search for short, nearly exact matches
Summary - If your sequence is PROTEIN
Length DB Purpose Program
15 residue or longer
Protein Identify the query sequence or find protein sequences similar to query
blastp
Find members of a protein family or build a custom position-specific score matrix
PSI-blast
Find proteins similar to the query around a given pattern
PHI-blast
Nucleic Find similar proteins in a translated nucleotide database
tblastn
5-15 residue
Protein Search for peptide motifs Search for short, nearly exact matches