bioinformatics. analysis of proteomic data. dr richard j edwards 28 august 2009; calmaro workshop....

32
Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Post on 22-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Bioinformatics. Analysis of proteomic data.

Dr Richard J Edwards 28 August 2009; CALMARO workshop.

©Gary Larson

(In not much detail)

Page 2: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Bioinformatic analysis of proteomic data

Improving sequence identifications Dealing with redundancy Annotating protein hits

Adding value to protein lists Accession number mapping & data integration Gene Ontology analysis Protein interaction networks

Example: identifying E. huxleyi proteins with multi-species and EST sequence databases

Open Discussion

Page 3: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Improving identifications:dealing with redundancy.

Page 4: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Identifying redundancy

Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

Choice of database affects redundancy identification SwissProt/IPI indicate splice variants EnsEMBL peptides map back onto non-redundant gene IDs Poor annotation hard to differentiate variant/error/family

Page 5: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Copyright ©2005 American Society for Biochemistry and Molecular BiologyNesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

Example: alpha tubulin protein family

Identifying redundancy Sometimes, identification cannot be conclusive

Page 6: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Copyright ©2005 American Society for Biochemistry and Molecular BiologyNesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

Basic peptidegrouping scenarios

Identifying redundancy Sometimes, identification cannot be conclusive

Different scenarios canpresent different problems

How important is it to study? Might need to identify

protein(s) through furtherexperiments

?? ?

???

?

Page 7: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Copyright ©2005 American Society for Biochemistry and Molecular BiologyNesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

A simplified example of a protein summary list

Identifying redundancy

Final protein list: Conclusive IDs Protein groups Inconclusive IDs

Are inconclusive/ group hits redundant?

Same protein from different species

Splice variants

Does it matter? Inflated

numbers Biased analyses Comparisons

between experiments

Unique to protein

Unique to group

No unique

Page 8: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Homology groupings

Can use BLAST to identify groups of related proteins Help identify possible redundancies Need to look at peptides

Particularly useful for “off-species” identifications Tendency for many hits

to same protein in different species

Clustering proteins by %identity

http://www.southampton.ac.uk/~re1u06/software/gablam/

Page 9: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Improving identifications:annotating protein hits.

Page 10: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Protein annotation

Database

Protein ListProtein List

NOISE

Poorly (un)annotated proteins Real proteins or database noise? Reliable annotation?

Page 11: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Most of our protein data comes from DNA sequences

PDB: 53,660 structures = 3D

SwissProt: 392,667 = Curated

TrEMBL: >6 million &UniParc: >16 million

= Most inferred from DNA Most annotation inferred through

sequence analysis

Protein data from translated DNA

Lots of errors! Sequence errors Annotation errors

AnnotationTranslation

Where does the data come from?

Page 12: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Protein annotation

Use standard sequence analysis tools Manual guidance/care = better than automated databases!

Homology searching BLAST vs. UniProtKB Protein domain searches, e.g. PFam

Conservation analysis Multiple sequence alignment with homologues

Are functionally important sites conserved?

Phylogenetic analysis Evolutionary relationships can help distinguish function

Assignment to protein subfamily etc. Useful where BLAST hits have competing annotation

http://www.southampton.ac.uk/~re1u06/software/haqesac/

Page 13: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Beyond proteomics:adding value to protein

lists.

Page 14: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

What Bioinformatics cannot (usually) do

Magic

Replace hypothesis driven research

Directed analysis is always better than “fishing” (e.g. GO)

Provide a definitive answer

Ranking/prioritising better

Page 15: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Follow-up analyses

Many possibilities What was the aim of the study? What resources are available for your organism?

Imitation is the sincerest form of flattery Find a good study and copy the best bits

Easier to describe Easier to justify to reviewers

Hypothesis-driven analysis is best Many tools facilitate hypothesis generation (data

exploration) Be aware of risk of testing a hypothesis on data used to

generate it Be aware of multiple testing issues

Page 16: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Follow-up analyses

EBI and NCBI both provide many useful tools EBI run many good courses at Hinxton

http://www.ebi.ac.uk/Tools/

Page 17: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Seek collaborations

Time / Energy

Rew

ard

Bioinformatics

Find a tame bioinformatician to help if needed Good collaboration = Trade

Papers / Grants / improving the bioinformatics E.g. adding your organism/database

to an online resource

©Gary Larson

Page 18: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Accession number mapping Other databases may contain better/specific annotation

UniProtKB, OMIM etc.

Results from searches against older databases may need updating

EBI tool: PICR [Protein Identifier Cross-Reference Service]

BioMart: Query & Xref tool for manydatabases www.biomart.org

http://www.ebi.ac.uk/Tools/picr/

Page 19: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

BioMart

Page 20: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Gene Ontology analysis

Gene Ontology [GO] = gene annotation project Controlled vocabulary allows standardisation & comparisons

http://www.geneontology.org/

Page 21: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Gene Ontology analysis

Many Gene Ontology exploration tools AmiGO, GOA, FatiGO, DAVID etc. Depend on source databases

May need to map IDs using PICR first

GO enrichment Assess frequency of GO terms in your list against

expectation Often a big multiple testing issue Be aware of biases – how is expectation derived

E.g. Abundant, conserved proteins more likely to be annotated & more likely to be identified in a proteomics experiment

Best if hypothesis-driven or used for data confirmation E.g. Enrichment of certain subcellular fraction

Page 22: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Protein interaction networks Can be useful for identifying protein complexes in

data E.g. STRING [http://string-db.org/]

Page 23: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Example: identifying E. huxleyi proteins with multi-species and EST

sequence databases

Page 24: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Combined search strategy

Genome unavailable (for download & searching)

dbESTThalassiosirapseudonana

Taxa-limitedDatabase

90,000 E huxESTs

Protein ListProtein List

:Rhodophyta::Stramenopiles

::Haptophyceae:

:Alveolata::Cryptophyta:

Page 25: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

EST dataset

BLASTdatabase

MS/MS dataMASCOT

hits

MASCOT hitsTranslated to

6RFs

RFs and MASCOTpeptides filtered

FIESTA consensus &

annotation

Final proteinidentifications

BUDAPESTCORE

1

2

3

45

Poor qualityRFs removed

OPTIONAL(MANUAL or AUTOMATED)

90,000 E huxESTs

173 ESTs728

189 RFs

113

615

Taxa-limitedDatabase

117 Cons321

34 Cons34

83 Cons287

173 EST hits (728 peptides)

83 Consensus sequences 40 Clusters by homology

(variants/isoforms)

287 Peptides 239 Unique to one

consensus 48 Shared within one

cluster

http://www.southampton.ac.uk/~re1u06/software/budapest/

Page 26: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Annotating EST ConsensusSequences Homology searching & phylogenetics

SequenceDatabase

Consensus

UniProt

Taxa-limitedDatabase

Alignment

Page 27: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Protein family identification

Page 28: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Redundancy/Variants

Page 29: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Combined search strategy

Genome unavailable (for download & searching)

dbESTThalassiosirapseudonana

Taxa-limitedDatabase

90,000 E huxESTs

173 Hits83 Consensus40+ Proteins

96 Hits26+ Proteins

:Rhodophyta::Stramenopiles

::Haptophyceae:

:Alveolata::Cryptophyta:

64+ Proteins(12 Common)

Page 30: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Conclusions.

Page 31: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Summary Extra analysis of raw protein lists adds value

False positives vs. Real proteins Annotation of uncharacterised hits

Numerous tools for mining protein lists Data exploration and/or hypothesis testing Community/Organism dependent Worth contacting bioinformaticians for further development

Development of customised bioinformatics solutions can greatly increase power of study Increased availability of high throughput technologies

Poor annotation & high error rates Increased need for bioinformatics post-processing to improve

quality

Page 32: Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Open [email protected]