sequence based analysis tutorial nih proteomics workshop cecilia arighi, ph.d. protein information...

47
Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

Upload: coleen-wilkins

Post on 18-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

Sequence Based Analysis Tutorial

NIH Proteomics Workshop

Cecilia Arighi, Ph.D.Protein Information Resource at Georgetown University Medical Center

Page 2: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

22

Retrieval, Sequence Search & Classification Methods

Retrieve protein info by text / UID Sequence Similarity Search

BLAST, FASTA, Dynamic Programming Family Classification

Patterns, Profiles, Hidden Markov Models, Sequence Alignments, Neural Networks

Integrated Search and Classification System

Page 3: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

33

Sequence Similarity Search (I)

Based on Pair-Wise Comparisons Dynamic Programming Algorithms

Global Similarity: Needleman-Wunch Local Similarity: Smith-Waterman

Heuristic Algorithms FASTA: Based on K-Tuples (2-Amino Acid) BLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment Pairs PHI-BLAST: Pattern-Hit Initiated Search PSI-BLAST: Position-Specific Iterated Search

Page 4: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

44

Sequence Similarity Search (II) Similarity Search Parameters

Scoring Matrices – Based on Conserved Amino Acid Substitution

Dayhoff Mutation Matrix, e.g., PAM250 (~20% Identity)

Henikoff Matrix from Ungapped Alignments, e.g., BLOSUM 62

Gap Penalty Search Time Comparisons

Smith-Waterman: 10 Min FASTA: 2 Min BLAST: 20 Sec

Page 5: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

55

Feature Representation Features of Amino Acids: Physicochemical Properties,

Context (Local & Global) Features, Evolutionary Features Alternative Amino Acids: Classification of Amino Acids To

Capture Different Features of Amino Acid Residues

Page 6: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

66

Substitution Matrix Likelihood of One Amino Acid Mutated into Another Over

Evolutionary Time Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7) Positive Score: Conservative Substitution (e.g., Lys/Arg, +3) High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys)

Page 7: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

77

Secondary Structure Features Helix Patterns of Hydrophobic Residue Conservation

Showing I, I+3, I+4, I+7 Pattern Are Highly Indicative of an Helix (Amphipathic)

Strands That Are Half Buried in the Protein Core Will Tend to Have Hydrophobic Residues at Positions I, I+2, I+4, I+6

Page 8: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

88

BLASTBLAST (Basic Local Alignment Search Tool) Extremely fast Robust Most frequently used

It finds very short segment pairs (“seeds”) between the query and the database sequence

These seeds are then extended in both directions until the maximum possible score for extensions of this particular seed is reached

Page 9: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

99

BLAST Search From BLAST Search Interface Table-Format Result with BLAST Output and SSEARCH

(Smith-Waterman) Pair-Wise Alignment

Link to NCBI taxonomy

Click to seealignment

Links to iProClass and UniProtKB reports

Link to PIRSF report

Click to see SSearch alignment

Page 10: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1010

Blast Result & Pairwise Alignment

BLAST Aligment

Page 11: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1111

Classification

What is classification? Why do we need protein classification? Different levels of classification Basis for functional protein classification How to classify a protein of unknown

function?

Page 12: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1212

Classification Databases

Protein motif

Protein domain

3-D structure Whole-protein

C - x(2,4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3,5) - H The 2 C's and the 2 H's are zinc ligands

Group proteins according to the presence of a common domain Group proteins according to

common 3D structure

Group proteins according to common domain architecture and length

Page 13: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1313

Family Classification Methods

Based on Other Classification Information

Multiple Sequence Alignment (ClustalW)

ProSite Pattern Search Profile Search Hidden Markov Models (HMMs)

Domain (Pfam); Whole protein (PIRSF) Neural Networks

Page 14: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1414

How do you build a tree?

Pick sequences to align Align them Verify the alignment Keep the parts that are aligned correctly Build and evaluate a phylogenetic tree Integrated Analysis

Page 15: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1515

Pairwise alignment:Calculate distance matrix

Mean number of differences per residue

Unrooted Neighbor-Joining Tree Branch length drawn to scale

Rooted NJ Tree (guide tree)

Root place at a position where the means of the branch lengths on either side of the root are equal

Progressive Alignment guided by the tree

Alignment starts from the tips of the tree towards the root

Thompson et al., NAR 22, 4675 (1994).

Multiple Sequence Alignment: CLUSTALW

Page 16: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1616

PIR Multiple Alignment and Tree From Text/Sequence Search Result or CLUSTAL W Alignment Interface

Page 17: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1717

Page 18: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1818

PIR Pattern Search From Text/Sequence Search Result or Pattern Search Interface

P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N

P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N

Alignment of a region involved in catalytic activity

Create Pattern and search in database:

A

B

O05689

Test sequence against PROSITE database

Signature Patterns for Functional Motifs

Page 19: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

1919

Pattern Search Result (I)A. One Query Pattern Against UniProtKB or UniRef100 DBs

Display the query pattern

Links to iProClass and UniProtKB reports

Link to NCBI taxonomy

Link to PIRSF report

Indicate pattern sequence region(s)

Page 20: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2020

Pattern Search Result (II)B. One Query Sequence Against PROSITE Pattern Database

Page 21: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2121

Profile Method

Profile: A Table of Scores to Express Family Consensus Derived from Multiple Sequence Alignments Num of Rows = Num of Aligned Positions Each row contains a score for the alignment with each possible

residue. Profile Searching

Summation of Scores for Each Amino Acid Residue along Query Sequence

Higher Match Values at Conserved Positions

Page 22: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2222

Prosite PS50157 profile for Zinc finger C2H2

Page 23: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2323

Search One Query Protein Against all the Full-length and Domain HMM models for the fully curated PIRSFs by HMMER

The matched regions and statistics will be displayed.

Shows PIRSF that the query belongs to

Statistical data for all domains

Statistical data per domain

Alignment with consensus sequence

1

PIRSF scan

Page 24: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2424

Creation and Curation of PIRSFs

Page 25: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2525

Integrated Bioinformatics System for Function and Pathway Discovery

Data Integration Associative Analysis

Sequence Analysis Pipeline

(Family Classification & Feature Identification)

Data Mining Tools

(Retrieval, Visualization, Analysis, Correlation)

Data Warehouse

(Gene, Protein, Family, Function, Structure, Pathway, Interaction)

Graphical User Interface

(Browsing, Querying, Navigation)

Input

(Gene/Protein Expression Data)

Output

(Analysis Results, Biological Interpretation)

Integrated Bioinformatics System

User

Input

(Local Data, Search Criteria, Report Format)

Sequence Analysis Pipeline

(Family Classification & Feature Identification)

Data Mining Tools

(Retrieval, Visualization, Analysis, Correlation)

Data Warehouse

(Gene, Protein, Family, Function, Structure, Pathway, Interaction)

Graphical User Interface

(Browsing, Querying, Navigation)

Input

(Gene/Protein Expression Data)

Output

(Analysis Results, Biological Interpretation)

Integrated Bioinformatics System

User

Input

(Local Data, Search Criteria, Report Format)

Page 26: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2626

Analytical Pipeline

Query Sequence

UniProt

Top-Matched Superfamilies/Domains

BLAST Search HMM Domain Search

Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs

SSEARCH CLUSTALW

Superfamily/Domain/Motif Alignments

Family Relationships & Functional Features

Family Classification & Functional Analysis

HMM Motif Search Pattern Search SignalP/TMHMM

Page 27: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2727

Integrated Bioinformatics System

Global Bioinformatics Analysis of 1000’s of Genes and Proteins

Pathway Discovery,

Target Identification

Gene Expression Data Proteomic Data

Clustering

Expression Pattern

Visualization & Statistical Analysis

Clustered Matrix Pathway Map Process HierarchyClustered Graph

Gene/Peptide-Protein Mapping

Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis)

Functional Analysis (Sequence Analysis & Information Retrieval)

Integrated Protein Knowledge System

Comprehensive Protein

Information Matrix

Protein List

Gene Expression Data Proteomic Data

Clustering

Expression Pattern

Visualization & Statistical Analysis

Clustered Matrix Pathway Map Process HierarchyClustered GraphClustered Matrix Pathway Map Process HierarchyClustered Graph

Gene/Peptide-Protein Mapping

Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis)

Functional Analysis (Sequence Analysis & Information Retrieval)

Integrated Protein Knowledge System

Comprehensive Protein

Information Matrix

Protein List

Gene/Peptide-Protein Mapping

Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis)

Functional Analysis (Sequence Analysis & Information Retrieval)

Integrated Protein Knowledge System

Comprehensive Protein

Information Matrix

Protein List

Page 28: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2828

Lab Section

Page 29: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

2929

Rat eye lens phosphoproteomics in normal and cataractKamei et al., Biol. Pharm. Bull., 2005.

Normal Cataract(-) pI (+)

Mw

More phosphorylated spots in cataract sample.Digestion and MS from Spot 16 gave these peptides:

MDVTIQHPWFKRALGPFYPSRCSLSADGMLTFSGYRLPSNVDQSALS

We want to identify the protein(s) that contain these peptides

Use Peptide Search

MDVTIQHPWFKR

Page 30: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

3030

Peptide Search

Restrict search to an organism

Page 31: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

3131

Links to iProClass and UniProtKB reports

Link to NCBI taxonomy

Link to PIRSF report

Matching peptidehighlighted in the sequence

Sorting arrows

Peptide Search & ResultsSpecies restricted search

Search in UniProtKB, 23 proteins

Page 32: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

3232

Batch Retrieval Results (I)

Retrieve more sequences

• Retrieve multiple proteins in from iProClass using a specific identifier or a combination of them• Provides a means to easily retrieve and analyze proteins when the identifiers come from different databases

Page 33: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

3333

Blast Similarity Search

>P24623

• Perform sequence similarity search

What proteins are related to rat CRYAA?

http://pir.georgetown.edu/pirwww/search/blast.shtml

Page 34: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

3535

Pairwise Alignment

Page 35: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

3636

UniProtKBDatabaseand unique UniParc

sequences

PIR protein family classification

database

PIR Text Search ((http://pir.georgetown.edu/search/textsearch.shtml)

Let’s search for human crystallins

Page 36: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

3737

Refine your search or start over

Display PDB ID

Let’s look for crystallins which have 3D structure

Page 37: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

3838

Domain Display allows to compare simultaneously Pfam domains present in multiple proteins

Let’s perform a multiple alignment on the sequences containing PF00030

Share same domainarchitecture

Page 38: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

3939

Multiple Alignment

Page 39: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

4040

Interactive Phylogenetic Tree and Alignment

Beta B1 and gamma crystallins share the same domains, SCOP fold and share significant sequence similarity suggesting that they are related

Page 40: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

4141

Pattern Search (I)

Search for proteins containing this pattern (PS00225) in rat

Select P07320 and perform a pattern search

Page 41: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

4242

Pattern Search Result

Beta and gamma Crystallins have multiple copies of this pattern

Page 42: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

4343

PIRSF provides a single platform where all the previous analysis has been done by curators

Represents extent of manual curation

Pfam domains assigned with high confidence

Link to PIRSF report

Validation tag

Page 43: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

4444

Alpha-crystallin is exclusively found in metazoans

Taxonomic Distribution

Multiple Alignment

Domain Architecture

Page 44: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

4545

PIRSF scan

Page 45: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

4646

PIRSF report (I): a single platform to study proteins

Subfamily level

Page 46: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

4747

Cross-links to other databases

PIRSF report (II)

http://www.geneontology.org/

Page 47: Sequence Based Analysis Tutorial NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center

4848

alpha-Crystallin and Related Proteins

Alpha crystallin alpha chain

Alpha crystallin beta chain

HSPs