sequence based analysis tutorial nih proteomics workshop cecilia arighi, ph.d. protein information...

Post on 18-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sequence Based Analysis Tutorial

NIH Proteomics Workshop

Cecilia Arighi, Ph.D.Protein Information Resource at Georgetown University Medical Center

22

Retrieval, Sequence Search & Classification Methods

Retrieve protein info by text / UID Sequence Similarity Search

BLAST, FASTA, Dynamic Programming Family Classification

Patterns, Profiles, Hidden Markov Models, Sequence Alignments, Neural Networks

Integrated Search and Classification System

33

Sequence Similarity Search (I)

Based on Pair-Wise Comparisons Dynamic Programming Algorithms

Global Similarity: Needleman-Wunch Local Similarity: Smith-Waterman

Heuristic Algorithms FASTA: Based on K-Tuples (2-Amino Acid) BLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment Pairs PHI-BLAST: Pattern-Hit Initiated Search PSI-BLAST: Position-Specific Iterated Search

44

Sequence Similarity Search (II) Similarity Search Parameters

Scoring Matrices – Based on Conserved Amino Acid Substitution

Dayhoff Mutation Matrix, e.g., PAM250 (~20% Identity)

Henikoff Matrix from Ungapped Alignments, e.g., BLOSUM 62

Gap Penalty Search Time Comparisons

Smith-Waterman: 10 Min FASTA: 2 Min BLAST: 20 Sec

55

Feature Representation Features of Amino Acids: Physicochemical Properties,

Context (Local & Global) Features, Evolutionary Features Alternative Amino Acids: Classification of Amino Acids To

Capture Different Features of Amino Acid Residues

66

Substitution Matrix Likelihood of One Amino Acid Mutated into Another Over

Evolutionary Time Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7) Positive Score: Conservative Substitution (e.g., Lys/Arg, +3) High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys)

77

Secondary Structure Features Helix Patterns of Hydrophobic Residue Conservation

Showing I, I+3, I+4, I+7 Pattern Are Highly Indicative of an Helix (Amphipathic)

Strands That Are Half Buried in the Protein Core Will Tend to Have Hydrophobic Residues at Positions I, I+2, I+4, I+6

88

BLASTBLAST (Basic Local Alignment Search Tool) Extremely fast Robust Most frequently used

It finds very short segment pairs (“seeds”) between the query and the database sequence

These seeds are then extended in both directions until the maximum possible score for extensions of this particular seed is reached

99

BLAST Search From BLAST Search Interface Table-Format Result with BLAST Output and SSEARCH

(Smith-Waterman) Pair-Wise Alignment

Link to NCBI taxonomy

Click to seealignment

Links to iProClass and UniProtKB reports

Link to PIRSF report

Click to see SSearch alignment

1010

Blast Result & Pairwise Alignment

BLAST Aligment

1111

Classification

What is classification? Why do we need protein classification? Different levels of classification Basis for functional protein classification How to classify a protein of unknown

function?

1212

Classification Databases

Protein motif

Protein domain

3-D structure Whole-protein

C - x(2,4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3,5) - H The 2 C's and the 2 H's are zinc ligands

Group proteins according to the presence of a common domain Group proteins according to

common 3D structure

Group proteins according to common domain architecture and length

1313

Family Classification Methods

Based on Other Classification Information

Multiple Sequence Alignment (ClustalW)

ProSite Pattern Search Profile Search Hidden Markov Models (HMMs)

Domain (Pfam); Whole protein (PIRSF) Neural Networks

1414

How do you build a tree?

Pick sequences to align Align them Verify the alignment Keep the parts that are aligned correctly Build and evaluate a phylogenetic tree Integrated Analysis

1515

Pairwise alignment:Calculate distance matrix

Mean number of differences per residue

Unrooted Neighbor-Joining Tree Branch length drawn to scale

Rooted NJ Tree (guide tree)

Root place at a position where the means of the branch lengths on either side of the root are equal

Progressive Alignment guided by the tree

Alignment starts from the tips of the tree towards the root

Thompson et al., NAR 22, 4675 (1994).

Multiple Sequence Alignment: CLUSTALW

1616

PIR Multiple Alignment and Tree From Text/Sequence Search Result or CLUSTAL W Alignment Interface

1717

1818

PIR Pattern Search From Text/Sequence Search Result or Pattern Search Interface

P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N

P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N

Alignment of a region involved in catalytic activity

Create Pattern and search in database:

A

B

O05689

Test sequence against PROSITE database

Signature Patterns for Functional Motifs

1919

Pattern Search Result (I)A. One Query Pattern Against UniProtKB or UniRef100 DBs

Display the query pattern

Links to iProClass and UniProtKB reports

Link to NCBI taxonomy

Link to PIRSF report

Indicate pattern sequence region(s)

2020

Pattern Search Result (II)B. One Query Sequence Against PROSITE Pattern Database

2121

Profile Method

Profile: A Table of Scores to Express Family Consensus Derived from Multiple Sequence Alignments Num of Rows = Num of Aligned Positions Each row contains a score for the alignment with each possible

residue. Profile Searching

Summation of Scores for Each Amino Acid Residue along Query Sequence

Higher Match Values at Conserved Positions

2222

Prosite PS50157 profile for Zinc finger C2H2

2323

Search One Query Protein Against all the Full-length and Domain HMM models for the fully curated PIRSFs by HMMER

The matched regions and statistics will be displayed.

Shows PIRSF that the query belongs to

Statistical data for all domains

Statistical data per domain

Alignment with consensus sequence

1

PIRSF scan

2424

Creation and Curation of PIRSFs

2525

Integrated Bioinformatics System for Function and Pathway Discovery

Data Integration Associative Analysis

Sequence Analysis Pipeline

(Family Classification & Feature Identification)

Data Mining Tools

(Retrieval, Visualization, Analysis, Correlation)

Data Warehouse

(Gene, Protein, Family, Function, Structure, Pathway, Interaction)

Graphical User Interface

(Browsing, Querying, Navigation)

Input

(Gene/Protein Expression Data)

Output

(Analysis Results, Biological Interpretation)

Integrated Bioinformatics System

User

Input

(Local Data, Search Criteria, Report Format)

Sequence Analysis Pipeline

(Family Classification & Feature Identification)

Data Mining Tools

(Retrieval, Visualization, Analysis, Correlation)

Data Warehouse

(Gene, Protein, Family, Function, Structure, Pathway, Interaction)

Graphical User Interface

(Browsing, Querying, Navigation)

Input

(Gene/Protein Expression Data)

Output

(Analysis Results, Biological Interpretation)

Integrated Bioinformatics System

User

Input

(Local Data, Search Criteria, Report Format)

2626

Analytical Pipeline

Query Sequence

UniProt

Top-Matched Superfamilies/Domains

BLAST Search HMM Domain Search

Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs

SSEARCH CLUSTALW

Superfamily/Domain/Motif Alignments

Family Relationships & Functional Features

Family Classification & Functional Analysis

HMM Motif Search Pattern Search SignalP/TMHMM

2727

Integrated Bioinformatics System

Global Bioinformatics Analysis of 1000’s of Genes and Proteins

Pathway Discovery,

Target Identification

Gene Expression Data Proteomic Data

Clustering

Expression Pattern

Visualization & Statistical Analysis

Clustered Matrix Pathway Map Process HierarchyClustered Graph

Gene/Peptide-Protein Mapping

Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis)

Functional Analysis (Sequence Analysis & Information Retrieval)

Integrated Protein Knowledge System

Comprehensive Protein

Information Matrix

Protein List

Gene Expression Data Proteomic Data

Clustering

Expression Pattern

Visualization & Statistical Analysis

Clustered Matrix Pathway Map Process HierarchyClustered GraphClustered Matrix Pathway Map Process HierarchyClustered Graph

Gene/Peptide-Protein Mapping

Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis)

Functional Analysis (Sequence Analysis & Information Retrieval)

Integrated Protein Knowledge System

Comprehensive Protein

Information Matrix

Protein List

Gene/Peptide-Protein Mapping

Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis)

Functional Analysis (Sequence Analysis & Information Retrieval)

Integrated Protein Knowledge System

Comprehensive Protein

Information Matrix

Protein List

2828

Lab Section

2929

Rat eye lens phosphoproteomics in normal and cataractKamei et al., Biol. Pharm. Bull., 2005.

Normal Cataract(-) pI (+)

Mw

More phosphorylated spots in cataract sample.Digestion and MS from Spot 16 gave these peptides:

MDVTIQHPWFKRALGPFYPSRCSLSADGMLTFSGYRLPSNVDQSALS

We want to identify the protein(s) that contain these peptides

Use Peptide Search

MDVTIQHPWFKR

3030

Peptide Search

Restrict search to an organism

3131

Links to iProClass and UniProtKB reports

Link to NCBI taxonomy

Link to PIRSF report

Matching peptidehighlighted in the sequence

Sorting arrows

Peptide Search & ResultsSpecies restricted search

Search in UniProtKB, 23 proteins

3232

Batch Retrieval Results (I)

Retrieve more sequences

• Retrieve multiple proteins in from iProClass using a specific identifier or a combination of them• Provides a means to easily retrieve and analyze proteins when the identifiers come from different databases

3333

Blast Similarity Search

>P24623

• Perform sequence similarity search

What proteins are related to rat CRYAA?

http://pir.georgetown.edu/pirwww/search/blast.shtml

3535

Pairwise Alignment

3636

UniProtKBDatabaseand unique UniParc

sequences

PIR protein family classification

database

PIR Text Search ((http://pir.georgetown.edu/search/textsearch.shtml)

Let’s search for human crystallins

3737

Refine your search or start over

Display PDB ID

Let’s look for crystallins which have 3D structure

3838

Domain Display allows to compare simultaneously Pfam domains present in multiple proteins

Let’s perform a multiple alignment on the sequences containing PF00030

Share same domainarchitecture

3939

Multiple Alignment

4040

Interactive Phylogenetic Tree and Alignment

Beta B1 and gamma crystallins share the same domains, SCOP fold and share significant sequence similarity suggesting that they are related

4141

Pattern Search (I)

Search for proteins containing this pattern (PS00225) in rat

Select P07320 and perform a pattern search

4242

Pattern Search Result

Beta and gamma Crystallins have multiple copies of this pattern

4343

PIRSF provides a single platform where all the previous analysis has been done by curators

Represents extent of manual curation

Pfam domains assigned with high confidence

Link to PIRSF report

Validation tag

4444

Alpha-crystallin is exclusively found in metazoans

Taxonomic Distribution

Multiple Alignment

Domain Architecture

4545

PIRSF scan

4646

PIRSF report (I): a single platform to study proteins

Subfamily level

4747

Cross-links to other databases

PIRSF report (II)

http://www.geneontology.org/

4848

alpha-Crystallin and Related Proteins

Alpha crystallin alpha chain

Alpha crystallin beta chain

HSPs

top related