sequence analysis hemant kelkar center for bioinformatics university of north carolina chapel hill,...

30
Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Sequence Analysis

Hemant KelkarCenter for Bioinformatics

University of North CarolinaChapel Hill, NC 27599

Page 2: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Scope of Series

Talk I

• Overview and BLAST

Talk II

• Protein analysis/Sequence Alignment

Talk III

• Evolution

• Genomics and challenges

Page 3: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Bioinformatics

• Mathematical, Statistical and computational methods that are used for solving biological problems

• Glue that holds the “omics” data together

Page 4: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Help …

• Is “my sequence” in the databases?• Is it similar to any sequence in the DB?• Does it have any know motifs/domains

that can help in identification?• Is there a structural homolog?• Are there any polymorphisms?• Genetic Map location?

Bioinformatics TOOLS!

Page 5: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Bioinformatics Tools

• Genetic Code

• Protein Structure

• Protein Evolution

Similarity search e.g. BLAST, FASTA

http://restools.sdsc.edu/biotools/biotools9.html

e.g. CLUSTALW, T-COFFEE, Phylip

Page 6: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Primary Sequence Databases

• GenBank (http://www.ncbi.nlm.nih.gov/Genbank/index.html

) • PIR (http://pir.georgetown.edu/) • Swiss-Prot (http://us.expasy.org/sprot/)

Sequence information as is generated in the laboratory

Page 7: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Derived Sequence Databases

• PFAM (http://www.sanger.ac.uk/Software/Pfam/) : Protein families based on HMM models

• InterPRO (http://www.ebi.ac.uk/interpro/) : Protein families and domains based on functional sites

• TransFac (http://www.gene-regulation.com/) transcription factor db

• Cytochrome P450 database (http://drnelson.utmem.edu/CytochromeP450.html)

Databases based on functional or phylogenetic analysis

Page 8: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Derived Sequence Databases

• Flybase (http://www.flybase.org/) : Fly Genome

• Wormbase (http://www.wormbase.org/) : C. elegans

• Genome Browser (http://genome.ucsc.edu/) :

Human and Mouse • MGI (http://www.informatics.jax.org/) : Mouse

• Microbial Genome Resource : (http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl)

Databases based on taxonomy

Page 9: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Sequence Alignments

• Provide a measure of relation between the nucleotide or protein sequence

• This allows us to decipher:

Structural relationships

Functional relationships

Evolutionary relationships

Page 10: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Sequence Similarity Searches

• Information conserved evolutionarily

• DNA sequences NOT coding for proteins/rRNAs diverge rapidly• When possible use protein sequences for similarity searches

• Non-homologous protein identification is much less reliable• What is measured and what is inferred?

Page 11: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Similarity

• Is always based on an observable

• Usually expressed as % identity

• Quantifies the divergence of two sequences

• substitutions/insertions/deletions

• Residues crucial for structure and/or function

Page 12: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Homology

• Homology always implies that the molecules share a common ancestor

• Absolute answer

• Molecules ARE or ARE NOT homologous

• No degrees

Page 13: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

How to Find Similar Sequences

• Global Sequence Alignments

• Sequence comparison along entire length

• Homolog of similar length• Local Sequence Alignments

• Similar regions in two sequences

• Regions outside the local alignment excluded

• Sequences of different length/similarity

Page 14: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Dotplot

Page 15: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Scoring Matrices

• Empirical weighting schemes

• Considers important biology

• Side chain chemistry/structure/function

• Functional/Structural Conservation

• Ile/Val – small and hydrophobic

• Ser/Thr – both polar

• Size/Charge/Hydrophibicity

Page 16: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Nucleotide Matrix

A C G TA 5 -4 -4 -4C -4 5 -4 -4G -4 -4 5 -4T -4 -4 -4 5

Page 17: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

PAM Scoring Matrices

• Margaret Dayhoff (1978)

• Point accepted mutations (PAM)

• Patterns of substitutions in highly related proteins (>85% identical), based on multiple sequence alignments

• New side chains must function similarly

• 1 PAM 1 AA change per 100 AA

• 1 PAM ~ 1 % Divergence

Page 18: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

BLOSUM Matrices

• Henikoff and Henikoff (1992)

• Blocks Substitution Matrices

• Differences in conserved ungapped regions

• Directly calculated no extrapolations

• Sensitive to structural/functional subs

• Generally perform better for local similarity searches

Page 19: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Scoring Matrix – BLOSUM62

Page 20: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

BLOSUM n

• Calculated from sequences sharing no more than n% identity

• Sequences with more than n% identity are clustered and weighted to 1• Reducing the value of “n” yields more divergent/distantly-related sequences • BLOSUM62 used as default by many of the online search sites

Page 21: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Matrices and more

PAM Matrices (Altschul, 1991)

PAM 40 Short alignments >70%

PAM120 >50%

PAM250 Longer weaker local areas >30%

BLOSUM Matrices (Henikoff, 1993)

BLOSUM 90 Short alignments >60%

BLOSUM 80 >50%

BLOSUM 62Commonly used >35%

BLOSUM 30 Longer, weaker local alignments

Page 22: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Gaps

• Compensate for insertion and deletions• Improvement alignments

• Must be kept to a reasonably small number • 1 per 20 residues is logical

• Need a different scoring scheme

Page 23: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Gap Penalties

• Penalty for gap introduction

• Penalty for Gap extension

where G = gap-opening penalty 511

L = Gap-extension penalty 21

n = Length of gap

Deductions for Gap = G + Ln

NucProt

Page 24: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

BLAST

• Basic Local Alignment Search Tool

• Seeks high-scoring segment pair (HSP)

• Sequences that can be aligned w/o gaps

• have a maximal aggregate score

• score be above score threshold S• Many HSP reported for ungapped blast

Page 25: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

BLAST Algorithms

Program Query TargetBLASTN Nucloetide NucleotideBLASTP Protein ProteinBLASTX Nucleotide Protein

(6-Frame)

TBLASTN Protein Nucleotide (6FR)TBLASTX Nucloetide(6FR) Nucloetide(6FR)

Page 26: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Neighborhood Words

Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE

STL13

SAL8

SNL8

SVL8

SBL7

SCL7

SDL7

Etc.

= 4 + 5 + 4

Neighborhood Score Threshold

(T = 8)

Query Word (W = 3)

Page 27: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

High-Scoring Segment Pairs

STL13

SAL8

SNL8

SVL8

SBL7

SCL7

SDL7

Etc.Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE

++ G + ++G G+GKS+LLSA L L+ ++G + Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS

Page 28: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Extension

Significance Decay

• Mismatches

• Gap penalties

Extension

Cumulative Score

X

S

T

Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE ++ G + ++G G+GKS+LLSA L L+ ++G +

Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS

Page 29: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

Karlin Altschul Equation

E = kmNe-λs

m Number of letters in query

N Number of letters in db

mN Size of search space

λs Normalized score

k minor constant

Page 30: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599

http://www.ncbi.nlm.nih.gov