protein sequence databases, peptides to proteins, and statistical significance

58
Protein Sequence Databases, Peptides to Proteins, and Statistical Significance Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center

Upload: xanti

Post on 05-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Protein Sequence Databases, Peptides to Proteins, and Statistical Significance. Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center. Protein Sequence Databases. Link between mass spectra and proteins - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Protein Sequence Databases, Peptides to Proteins, and

Statistical Significance

Nathan EdwardsDepartment of Biochemistry and Mol. & Cell. BiologyGeorgetown University Medical Center

Page 2: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

2

Protein Sequence Databases

• Link between mass spectra and proteins• A protein’s amino-acid sequence provides

a basis for interpreting• Enzymatic digestion• Separation protocols• Fragmentation• Peptide ion masses

• We must interpret database information as carefully as mass spectra.

Page 3: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

3

More than sequence…

Protein sequence databases provide much more than sequence:

• Names• Descriptions• Facts• Predictions• Links to other information sources

Protein databases provide a link to the current state of our understanding about a protein.

Page 4: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

4

Much more than sequence

Names• Accession, Name, Description

Biological Source• Organism, Source, Taxonomy

LiteratureFunction

• Biological process, molecular function, cellular component

• Known and predictedFeatures

• Polymorphism, Isoforms, PTMs, DomainsDerived Data

• Molecular weight, pI

Page 5: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

5

Database types

Curated• Swiss-Prot• UniProt• RefSeq NP

Translated• TrEMBL• RefSeq XP, ZP

Omnibus• NCBI’s nr• MSDB• IPI

Other• PDB• HPRD• EST• Genomic

Page 6: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

6

SwissProt

• From ExPASy • Expert Protein Analysis System• Swiss Institute of Bioinformatics

• ~ 515,000 protein sequence “entries”• ~ 12,000 species represented• ~ 20,000 Human proteins• Highly curated• Minimal redundancy• Part of UniProt Consortium

Page 7: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

7

TrEMBL

• Translated EMBL nucleotide sequences• European Molecular Biology Laboratory

• European Bioinformatics Institute (EBI)• Computer annotated • Only sequences absent from SwissProt• ~ 10.5 M protein sequence “entries”• ~ 230,000 species• ~ 75,000 Human proteins• Part of UniProt Consortium

Page 8: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

8

UniProt

• Universal Protein Resource• Combination of sequences from

• Swiss-Prot• TrEMBL

• Mixture of highly curated/reviewed (SwissProt) and computer annotation (TrEMBL)

• “Similar sequence” clusters are available• 50%, 90%, 100% sequence similarity

Page 9: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

9

RefSeq

• Reference Sequence• From NCBI (National Center for

Biotechnology Information), NLM, NIH• Integrated genomic, transcript, and

protein sequences.• Varying levels of curation

• Reviewed, Validated, …, Predicted, …• ~ 9.7 M protein sequence “entries”

• ~ 209,000 reviewed, ~ 90,000 validated• ~ 39,000 Human proteins

Page 10: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

10

RefSeq

• Particular focus on major research organisms• Tightly integrated with genome projects.

• Curated entries: NP accessions• Predicted entries: XP accessions• Others: YP, ZP, AP

Page 11: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

11

IPI

• International Protein Index• From EBI

• For a specific species, combines• UniProt, RefSeq, Ensembl• Species specific databases: HInv-DB, VEGA, TAIR

• ~ 87,000 (from ~ 307,000 ) human protein sequence entries

• Human, mouse, rat, zebra fish, arabidopsis, chicken, cow

• Slated for closure November 2010, but still going…

Page 12: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

12

MSDB

• From the Imperial College (London)• Combines

• PIR, TrEMBL, GenBank, SwissProt• Distributed with Mascot

• …so well integrated with Mascot• ~ 3.2M protein sequence entries• “Similar sequences” suppressed

• 100% sequence similarity• Not updated since September 2006

(obsolete)

Page 13: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

13

NCBI’s nr

• “non-redundant”• Contains

• GenBank CDS translations• RefSeq Proteins• Protein Data Bank (PDB)• SwissProt, TrEMBL, PIR• Others

• “Similar sequences” suppressed• 100% sequence similarity

• ~ 10.5 M protein sequence “entries”

Page 14: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

14

Human Sequences

• Number of Human genes is believed to be between 20,000 and 25,000

SwissProt ~ 20,000

RefSeq ~ 39,000

TrEMBL ~ 75,000

IPI-HUMAN ~ 87,000

MSDB ~130,000

nr ~230,000

Page 15: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

15

DNA to Protein Sequence

Derived from http://online.itp.ucsb.edu/online/infobio01/burge

Page 16: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

16

UCSC Genome Browser

• Shows many sources of protein sequence evidence in a unified display

Page 17: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

17

Accessions

• Permanent labels• Short, machine readable• Enable precise communication• Typos render them unusable!• Each database uses a different format

• Swiss-Prot: P17947• Ensembl: ENSG00000066336• PIR: S60367; S60367• GO: GO:0003700;

Page 18: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

18

Names / IDs

• Compact mnemonic labels• Not guaranteed permanent• Require careful curation• Conceptual objects

• ALBU_HUMAN• Serum Albumin

• RT30_HUMAN• Mitochondrial 28S ribosomal protein S30

• CP3A7_HUMAN• Cytochrome P450 3A7

Page 19: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

19

Description / Name

• Free text description• Human readable• Space limited• Hard for computers to interpret!• No standard nomenclature or format• Often abused….

• COX7R_HUMAN• Cytochrome c oxidase subunit VIIa-

related protein, mitochondrial [Precursor]

Page 20: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

20

FASTA Format

• >• Accession number

• No uniform format• Multiple accessions separated by |

• One line of description• Usually pretty cryptic

• Organism of sequence?• No uniform format• Official latin name not necessarily used

• Amino-acid sequence in single-letter code• Usually spread over multiple lines.

Page 21: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

21

FASTA Format

Page 22: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

22

Organism / Species / Taxonomy

• The protein’s organism…• …or the source of the biological sample

• The most reliable sequence annotation available

• Useful only to the extent that it is correct• NCBI’s taxonomy is widely used

• Provides a standard of sorts; Heirachical• Other databases don’t necessarily keep up

• Organism specific sequence databases starting to become available.

Page 23: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

23

Organism / Species / Taxonomy• Buffalo rat• Gunn rats• Norway rat• Rattus PC12 clone IS• Rattus norvegicus• Rattus norvegicus8• Rattus norwegicus• Rattus rattiscus

• Rattus sp.

• Rattus sp. strain Wistar• Sprague-Dawley rat• Wistar rats• brown rat• laboratory rat• rat• rats• zitter rats

Page 24: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

24

Controlled Vocabulary

• Middle ground between computers and people

• Provides precision for concepts• Searching, sorting, browsing• Concept relationships

• Vocabulary / Ontology must be established• Human curation

• Link between concept and object:• Manually curated• Automatic / Predicted

Page 25: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

25

Gene Ontology

• Hierarchical• Molecular function• Biological process• Cellular component

• Describes the vocabulary only!• Protein families provide GO association

• Not necessarily any appropriate GO category.• Not necessarily in all three hierarchies.• Sometimes general categories are used because

none of the specific categories are correct.

Page 26: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

26

Gene Ontology

Page 27: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

27

Protein Families

• Similar sequence implies similar function• Similar structure implies similar function• Common domains imply similar function

• Bootstrap up from small sets of proteins/domains with well understood characteristics

• Usually a hybrid manual / automatic approach

Page 28: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

28

Protein Families

Page 29: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

29

Protein Families

Page 30: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

30

Sequence Variants

• Protein sequence can vary due to• Polymorphism• Alternative splicing• Post-translational modification

• Sequence databases typically do not capture all versions of a protein’s sequence

Page 31: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

31

Swiss-Prot Variant Annotations

Page 32: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

32

Swiss-Prot Variant Annotations

Page 33: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

33

Omnibus Database Redundancy Elimination

• Source databases often contain the same sequences with different descriptions

• Omnibus databases keep one copy of the sequence, and • An arbitrary description, or• All descriptions, or• Particular description, based on source preference

• Good definitions can be lost, including taxonomy

Page 34: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

34

Description Elimination

• gi|12053249|emb|CAB66806.1| hypothetical protein [Homo sapiens]

• gi|46255828|gb|AAH68998.1| COMMD4 protein [Homo sapiens]

• gi|42632621|gb|AAS22242.1| COMMD4 [Homo sapiens]

• gi|21361661|ref|NP_060298.2| COMM domain containing 4 [Homo sapiens]

• gi|51316094|sp|Q9H0A8|COM4_HUMAN COMM domain containing protein 4

• gi|49065330|emb|CAG38483.1| COMMD4 [Homo sapiens]

Page 35: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

35

Peptides to Proteins

Nesvizhskii et al., Anal. Chem. 2003

Page 36: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

36

Peptides to Proteins

Page 37: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

37

Peptides to Proteins

• A peptide sequence may occur in many different protein sequences• Variants, paralogues, protein families

• Separation, digestion and ionization is not well understood

• Proteins in sequence database are extremely non-random, and very dependent

Page 38: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Indistinguishable Protein Sequences

38 Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005

Page 39: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Indistinguishable Protein Sequences

39 Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005

Page 40: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Protein Families

40 Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005

Page 41: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Protein Grouping Scenarios• Parsimony

• Minimum # of proteins• Weighted

• Choose proteinswith the most confident peptides(ProteinProphet)

• Show all • Mark repeated peptides

• Often no (ideal) resolution is possible!

41Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005

Page 43: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

43

Moderate quality peptide identification: E-value < 10-3

Page 44: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

44

Peptide Identification

• Peptide fragmentation by CID is poorly understood

• MS/MS spectra represent incomplete information about amino-acid sequence• I/L, K/Q, GG/N, …

• Correct identifications don’t come with a certificate!

Page 45: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

45

Peptide Identification

• High-throughput workflows demand we analyze all spectra, all the time.

• Spectra may not contain enough information to be interpreted correctly• …bad static on a cell phone

• Peptides may not match our assumptions• …its all Greek to me

• “Don’t know” is an acceptable answer!

Page 46: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

46

What scores do “wrong” peptides get?• Generate random peptide sequences

• Real looking fragment masses• Empirical distribution• Require similar precursor mass

• Arbitrary score function can model anything we like!

Page 47: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

47

Random Peptide Scores

Fenyo & Beavis, Anal. Chem., 2003

Page 48: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

48

Random Peptide Scores

Fenyo & Beavis, Anal. Chem., 2003

Page 49: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

49

Random Peptide Scores

• Truly random peptides don’t look much like real peptides

• Just use peptides from the sequence database!

• Assumptions:• IID sampling of “score” values per spectra

• Caveats:• Correct peptide (non-random) may be included• Peptides are not independent

Page 50: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

50

Extrapolating from the Empirical Distribution• Often, the empirical shape is

consistent with a theoretical model

Geer et al., J. Proteome Research, 2004Fenyo & Beavis, Anal. Chem., 2003

Page 51: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

E-values vs p-values

• Need to adjust for the size of the sequence database• Best false/random score goes up with number

of trials• E-value makes this adjustment

• Expected number of incorrect peptides (with this score) from this sequence database.

• E-value = # Trials * p-value (to 1st approx.)

51

Page 52: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

52

False Discovery Rate

• Which peptide IDs to accept?• E-value only provides a per-spectrum statistic• With enough spectra, even these can be

misleading!• Decide which spectra (w/ scores) will be

accepted:• SEQUEST Xcorr, E-value, Score, etc., plus...• Threshold on identification criteria

• Control the proportion of incorrect identifications in the result for entire dataset

Page 53: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Distribution of scores over all spectra

53

0

20

40

60

80

100

120

140

160

180

200

-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3

Brian Searle, Proteome Software

Page 54: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Distribution of scores over all spectra

54

0

20

40

60

80

100

120

140

160

180

200

-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3

False

True

Brian Searle, Proteome Software

Page 55: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

55

False Discovery Rate

• FDRscore ≥ x = # false ids with score ≥ x # all ids with score ≥ x

• Need to estimate numerator!• Assumes the false (and true) scores, sampled

over spectra, are IID• Not true for some peptide-spectrum scores• (Mostly) true for E-values

• Can compute the # false ids using a decoy search…

Page 56: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

56

Peptide Prophet

Distribution of spectral scores in the results

Keller et al., Anal. Chem. 2002

Page 57: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Decoy searches

• Shuffle or reverse sequence database• Same size as original• Known false identifications• Estimate “False” distribution

• Alternatively, merge target+decoy results:• Competition between target and decoy scores• Assume false target and false decoys each

win half the time• FDRscore ≥ x = 2 * # decoy ids with score ≥ x

# target ids with score ≥ x 57

Page 58: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Summary

• Protein sequence databases have varying characteristics, choose wisely!

• Inferring proteins from peptides can be (very) tricky!

• Statistical significance can help control the proportion of errors in the (peptide-level) results.

58