workflows and pipelines for ngs analysis: lessons from ... · genome annotation. genome annotation...
Embed Size (px)
TRANSCRIPT

11th Sep 2014
Debasis Dash
Workflows and Pipelines for NGS analysis: Lessons from proteomics
Conference on Applying NGS in Basic research Health care and Agriculture

ATGAAGAAGCTGTGTGCTTTCACTATTGCCTTTTTTTCCCTGAAGTTTTGTCTCATCTTGTGCAGTTTGACTGAACCCAATTGCTTTTGGAAGATAAAGAAGAGAGAAGTTAATGATGGAGATTTGCAAAATGAGTGTGGTTTTGTCCTTTTTACACTTGAGAGCCCTATTGAAGAAAATTTTTATAATCACATTATTAATTTTAGGATACCAGCAAGAAAATATGAATTTTTTCTGGTAATGTTTTTTGCTACTGATGAGATCAACAAGAATCCTTATCTTTTATCCAACATGTCTTTGATATTTTCCTTCATTTTTGGTATGTGTGAAGATACAATGGGAGTTCTGGATAAAGCATATTTACATCAAAACAACTATTTCGATCTACTTAATTATAACTGTGGAAGAAAGAAACGTTGTGATGTAAAACTTACAGGACCATCATGGAAAACTTCCTTAAAACTTTCAGTTAATTCAAGGGCACCAAAGATTTTCTTTGGACCATTTAATCCTAACCTGAGTGACCATGACCAGTTTCCCTATATCTATCAGATAGCAACCAAGGACACATATTTGCTCCATGGCATGGTCTCCTTGATGTTTCATTTTGAATGGACTTGGATAGGACTGATCATCACAGATGATGACCAAGGTATTCAGTTTCACTCAGACTTGAGAGAAGAAATGCAAAGGCATGCGATCTGTTTAGCTTTTGTGATTATGATCCCAGAAAGCATTAAGTTATACAACACAAAGTTTAAGATATATGACCAACAACTTATGACATCTTCAGCAAAGGTTACTATCATTTATGGCAAAATGATCTCCACTCTAGAACTCAACTTTGCAAGATGGACATATTTAGTTGCACGGAGAATCTGGATCACAACCTCAAAATTGGATGTCATCACATATGATAAAGATTTCAGCCTTGATTTCTTCCACGGGACTGTCATTTTTGCCCACCACCACAATGACATCGCTACATTTAGAAATTTTATGCAAATAATAAACACATCCAAGTATCCAGTAGATATTTCTCAGTCTATGGGGCAGTGGAATCATTTTAACTGTTCAATCTCAAAGAACAAGAAGAAAATGGATTTTTTTATGTTGAAAAACCCAATGGAATGGTTAACACAGCACACATTTGACATGGTCCTGAGTGAAGAAGGTTACAATTTGTATAATGCTGTGTATGCTGTGGCCCACACCTATCACGAACTCATTTTTCAACAAGTAGAGTCTCAGGAAATGGCCAAACCCAAAGGACTATTCACTGACTGTCAGCAGGTGGCTTCTTTGCTTAAAACTAGGGTATTTACTAACCCTGTTGGAGAGCTGGTGAACATGAATCATAAGGAAAATCAGTGTGCCAAGTATGACATTTTCATCATTTGGAATTTTCCAAATGGCCTTGGATTAAAAGTGAAAATAGGAAGCTATTTTCCTTGTTTGCAACAGAGTCAACATCTTCATATATCTGAAGACTGGGAGTGGGTTACAGGAGAAACATTGGTTCCCTCCTCAGTGTGTAGTGAGACATGTACTGCAGGATTCAGAAAAAGTCATCAGAAACAAACAGCCAACTGCTGCTTTGATTGTGTCCAGTGCCAAGAAAATGAGATTGCCAAT
Where are the protein coding genes in a genome
http://www.picgifs.com/
Genome annotation

Genome annotation
Transcriptome
Proteome
Structural biology
ReactomeMetabolome
Interactome
Systems biology
Importance of genome annotation
Armengaud J. Proteogenomics and systems biology: quest for the ultimate missing parts. Expert Rev Proteomics. 2010

Solving a puzzle when pieces are missing or broken
http://www.puzzlewarehouse.com/missing-pieces/

How proteins are detected from samples?
Peptide Spectrum Match Scorer
Protein Extraction
Protease Digestion
LCMS
MS1 MS/MS
Experimental MS/MS
Spectrum
Protein Database
Theoretical Peptide
digestion
Peptide fragmentation
simulation
Theoretical MS/MS
Spectrum
A high-throughput method of protein identification

A large fraction of experimental spectra remain unidentified. May be because of
Unknown modifications on the peptides
Limitations of search algorithm
Noisy Spectra
Spectra are from non-peptidic origin
Peptides are missing in the search database
Identified
Unidentified
Proteomics: Challenges

Targ
et
De
coy
Threshold score
Concatenated target-decoy search*
• FDR= 2 x decoy/ (target +decoy )
Separate target and decoy search**
• FDR = decoy/target
* Nature Methods - 4, 207 - 214 (2007) **. J. Proteome Res., 2008, 7 (01), pp 29–34
sco
res
Controlling error rates through decoys

MassWiz: An advanced algorithm for peptide discovery
Intensity of matching peaks
Continuity of y-ions & b-ions
Neutral losses & Immoniumions
Fragment mass error sensitive scoring
Yadav AK, Kumar D, Dash D. MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J Proteome Res. 2011

Data: ISB standard protein mixhttp://regis-web.systemsbiology.net/PublicDatasets
Algorithm comparison: PSMs

A large fraction of experimental spectra remain unidentified. May be because of
Unknown modifications on the peptides
Limitations of search algorithm
Noisy Spectra
Spectra are from non-peptidic origin
Peptides are missing in the search database
Identified
Unidentified
Proteomics: Challenges

Proteogenomics: An alternate proteomic search strategy

Proteogenomics: An alliance of Genomics and Proteomics
Genome Annotation
Known Peptides
Novel Peptides
Proteomic identifications
Novel Gene
Gene model change
Gene on different frame
Gene on opposite strand
Armengaud J. A perfect genome annotation is within reach with the proteomics and genomics alliance. Curr Opin Microbiol. 2009

Genomics Proteomics
Lack of analysis-pipeline/software for integration of proteomics data with genome or genomics data

Bridging the Gap
Developing computational strategies to identify novel protein coding loci from MS data
Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism

Proteogenomic analysis of Mycobacterium tuberculosis

• Genome size: 4.4 mb (1998) Cole et al.
• 3924 ORFs annotated in the first genome draft.
• 3995 genes in re-annotation. (Camus et al 2002)
• 3988 protein coding genes (NCBI Refseq)
• 3987 protein coding genes (Sanger Institute)
• 3918 protein coding genes (TIGR/JCVI)
• 50% of the genes vary in Translation initiation site (TIS) between Sanger and TIGR annotations (deSouza et al 2008)
• 4,012 protein coding genes (Tuberculist R21)
Does Mycobacterium tuberculosis needre-annotation?

Identified hypothetical
21%
Unidentified20%
Identified59%
123 LCMS runs of cell lysate and culture filtrate of Mtb H37Rv
3176 out of 3988 NCBI Refseq proteins (80% Mtb proteome) identified
Translational evidence for 829 Hypothetical proteins
233 of 829 hypothetical proteins identified for the first time
Deep proteome profiling is achieved
In collaboration with Dr. Akhilesh Pandey & IOB

Conservation of Novel proteins
Kelkar DS, Kumar D et al Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol Cell Proteomics. 2011
41 Novel protein coding loci
Changes in 79 existing gene models
Correction in TIS for 33 and confirming for 868
proteins
Mtb H37Rv: Novel Translations


Database creation
Spectra processing
Peptide Assignment
FDR estimation
Peptide mapping to
genome
Gene coordinate comparison
Peptide classification
Result reporting
Visualization
Challenges in proteogenomics
MassWiz
OMSSA
X!Tandem
InsPecT
A solution with complete automation and high fidelity of results is required


Integrating results from multiple algorithms: Set theory
Peptides identified by multiple algorithms have low false positives but this
method does not allow to control or estimate false discovery rate

Integrating results from multiple algorithms: FDRscore
OMSSAE-value
X!TandemP-value
InspectP-value
MassWizScore
Metrics from multiple algorithms are not comparable
FDR values from individual algorithms can be processed to generate a common scoreJones AR et al, Proteomics 2009
Score
FDR
Score
FDR
FDR
Q-value
FDRscore
Q-value
P-value P-value
FDR
FDRFDR
Q-valueFDRscore
Q-valueFDRscore based result integration allowed statistical
assessment (FDR) of final results

59 novel proteins identified
• 51 Novel proteins with 2 or more unique peptides
• Single peptide hits are selected if identified in minimum 2 samples and after manual inspection
49 gene model changes identified
• Translated start site suggested upstream to current annotation
TIS confirmed for 21 genes
• TIS correction for 1 genes
Novel Proteome of B. japonicum

FgeneSBoperon
A novel protein reveals a novel operon
• ORF length• Codon Bias• Promoter region• Ribosome binding site

A gene model change

Novel proteins are short
TTG start codon in Gene model changes
Novel peptides are distributed throughout the genome
Is there a common theme of novel identifications?
Most novel proteins are short proteins

A methylotroph- Organisms with ability to grow on reduced carbon compound like methanol or methylamine
Ecologically important- Supports vegetation by producing phytohormones
Industrial application- In production of important chemicals and bio-molecules on methanol feedstock
Model organism- to study methylotrophic metabolism
Member of Methylobacteriacea family: A diverse taxonomy with many genes specific to one genome

31 Novel protein coding genes
70 gene model changes
104 methylotrophy gene products
2,678 Proteins

Limited conservation and Low GC content of novel genes suggest Lateral gene transfer as probable mode of origin

Developing computational strategies to identify novel protein coding loci from MS data
Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism

1 Exon boundary peptide
Exon junction peptides for detecting splice variants
2 Splice variant
3 New exon
4 A new 3’ splice site
5 A new 5’ splice site

Junction Peptide map onINTRONPeptides map on
Different translation frame
Peptides map on
INTERGENIC
Peptides map on NON-CODING
GENE
Peptides map on
UTR
Peptides map on
INTRON
Peptides map on Opposite Strand
Eukaryotic Proteogenomics
Gene
Peptides
Novel Peptides

Prokaryotic Proteogenomics
Proteogenomics: Prokaryotic vs. Eukaryotic
Eukaryotic Proteogenomics

>TCONS_00006262 gene=XLOC_004176 loc:1|58177-58500|-
ATTTTGGAGTTGTGTAGCCAAT………………………………………………………………………………………………..
>TCONS_00006264 gene=XLOC_004177 loc:1|169401-172238|-
AAGGTTCAAGGTACAAGGTGGGGTATGCC……………………………………………………………………………………
>TCONS00006267_420_548_3
TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC…………………………
>TCONS00006268_769_999_1
SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF……………………….
1
1 Raw Rna-Seq reads from NCBI-SRA repository
2
2 Read QC and processing using Trimmomatic
3
3 Filtered read mapping on reference genome using STAR aligner
4
4 Transcript assembly by Cufflinks
5
5 Assembly QC and comparison using cuffcompare and BLAST
6
6 Fasta of all transcripts generated using gffread
7
7 Theoretical translated protein database
RNA-seq analysis pipeline to capture transcriptome

>TCONS_00006262 gene=XLOC_004176 loc:1|58177-58500|-
ATTTTGGAGTTGTGTAGCCAAT………………………………………………………………………………………………..
>TCONS_00006264 gene=XLOC_004177 loc:1|169401-172238|-
AAGGTTCAAGGTACAAGGTGGGGTATGCC……………………………………………………………………………………
>TCONS00006267_420_548_3
TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC…………………………
>TCONS00006268_769_999_1
SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF……………………….
1
2
3 4 5
6
7
GenoSuite
OMSSA
X!TANDEM
8 Tandem mass spectra 9 Peptide identification
EuGenoSuite: Integrates transcriptomics to proteomics
10Protein grouping/Protein assembler
11
12
13
14
Novel / Known categorization

Organism Genome Size$(Mb) Annotated Proteins*
Human 3,284.83 104,763
Mouse 2,796.64 52,165
Rat 2,909.70 25,725
*Ensembl release 74$NCBI Genome
Genome size and annotation comparison

Rattusnorvegicus
Brain
Liver
Spleen
Testes
KidneyColon
Muscle
Lung
Heart
9 tissues and 3 replicate for
each
Sequencing instruments
• HiSeq 2000
• IlluminaGAII
Case study dataset
Sample 1 Sample 2 Sample 3
T1 T2 T1 T2 T1 T2
T1: Technical Replicate 1 T2: Technical Replicate 2

11,725 Peptides (1%FDR, identified in both T1 and T2)
EuGenoSuite
Transcriptomicanalysis pipeline
400million
Paired end Reads
2 Million MS/MS spectra
312 Novel Peptides (275 unique mapping)
11,413 mapped to known proteins
45 Spliced peptides
145intergenic
18different
frame
28non coding
loci
25UTR
14intronic

Discovery of splice variant for Threonyl t-RNA synthetase

Translation of Pseudogene
Pseudogene
Paralog(PCBP2)

105,380 unique transcripts assembled
≈2,900 Annotated proteins identified
Transcripts and peptides for Eight Pseudogenes
Translation of exons annotated as non-coding (15 genes)
45 splice variants detected
Rat Analysis Summary

Translation of a novel gene locus

Summary
• N-terminal acetylation of bacterial proteins
Part 1
Part 2
•Proteomics data when searched against genomic background aids novel protein discovery
•GenoSuite : A fully automated multi-algorithmic proteomics and proteogenomics analysis tool
•Comprehensive proteogenomic analysis of B. japonicum improves protein annotation of rhizobia
• Integrated analysis of RNA-seq and mass spectrometry proteomics data tracks down novel protein isoforms
• EuGenoSuite : An in-house pipeline for eukaryotic proteogenomics
• Translation of pseudogenes in rat microglia

Conclusion
Proteomics
Genomics
TranscriptomicsData
Integration
Novel Discovery
Genome Annotation
GenoSuite
EuGenoSuite

Acknowledgements
IGIBIT Team
IGIB friends and family
&IOB team

Thank you