identifying novel genes and gene refinements in … · 2016. 4. 26. · be eliminated in tte. •...
TRANSCRIPT
-
Identifying Novel Genes and Gene Refinements in Thermoanaerobacter
tengcongensisPresented by Kun Zhang
1
Institute of Computing Technology, CAS, Beijing, Chinahttp://pfind.ict.ac.cn
2014‐06‐20
-
Thermoanaerobacter tengcongensis
• Thermoanaerobacter tengcongensis (TTE)– Discovered in Tengchong, Yunnan, China
– Complete genome sequencing in 20021, 2.7Mb
• Genome annotation2– Computational: ab initial, comparative
– Experimental: RNA‐seq, EST, etc. (cDNA‐seq related)
– TTE: 2588 coding genes (through ab initial, comparative)
1. Bao Q, et al. A complete sequence of the T. tengcongensis genome. Genome Res 12, 689‐700 (2002).
2. 1. Brent MR. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9, 62‐73 (2008). 2
-
Background
• Deficiencies of the – High error rate in the genes predicted
– Discriminate coding genes from non‐coding ones
– Explain the translational event, such as alternative translational
initiate site, translational UTR (untranslational region)
1. Andrews SJ, Rothnagel JA. Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet 15, 193‐204 (2014). 3
-
Background
• Another experimental evidence: proteomics
data (LC‐MS/MS based)
• For genome annotation: Proteogenomics– Direct evidence of ORF translation
– Find novel genes and refine gene models that the traditional
approaches could not reach
1. Ahrens CH, Brunner E, Qeli E, Basler K, Aebersold R. Generating and navigating proteome maps using mass spectrometry. Nat Rev Mol Cell Biol 11, 789‐801 (2010).
2. Andrews SJ, Rothnagel JA. Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet 15, 193‐204 (2014). 4
Utilize proteomics data to find novel genes and gene refinements in TTE.
-
LC‐MS/MS based proteomics• LC‐MS/MS based proteomics1
1. Steen H, Mann M. The ABC's (and XYZ's) of peptide sequencing. Nat Rev MolCell Biol 5, 699‐711 (2004). 5
PP EE PP TT II DD EE
PP EE PP TT II DD EE
PP EE PP TT II DD EE++
+ +
+ +
B ions Y ions
-
Searching the protein database• Database search1
1. Nesvizhskii AI, Vitek O, Aebersold R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4, 787‐797 (2007). 6
Reference proteomes of experientially verified or predicted protein sequences
-
7Discipline
Algorithm
Software
Application
Aim Deep Analysis of Big Data for MS‐based Proteomics
• Nature Methods 2012 : Identification of cross‐linked peptides from complex samples• PNAS 2012: Nematode sperm maturation involves serine protease inhibitor (Serpin)•MCP 2009: A strategy for large‐scale identification of core fucosylated glycoproteins• Proteomics 2012: The ovarian cancer‐derived proteome: A repertoire of tumor markers•Mol BioSys 2012: Improved pipeline for LC‐ETD‐MS/MS using charge enhancing methods• 2009 –2013: Protein identification service at the Institute of Zoology, CAS
Computational Proteomics
-
Proteogenomic analysis
ORF Scoring & TDA Filtering
A
B
C
Peptide Spectrum Matching
Proteogenimc Database MS Data
.fasta MS2
.fna TTE Genome
.gtf
ReferenceDatabase
ORF0201ORF0013ORF0387ORF1029ORF1101ORF2033ORF1587ORF1323ORF1977REV_ORF0021
2.032.102.112.162.222.232.252.272.282.31
Ranked ORF Score
Reliability
D
E
chrRefined ORF Novel ORF Annotated ORF
Genome search specific peptides
Annotated ORF Homologous ORF
Peptide from annotated ORF
Annotated ORFchr
Re‐annotating
6‐ORF Translating
8
-
Novel genes and gene refinements
Reference chromosomeRefinedORF Novel ORF Annotated ORF
Genome search specific peptides
Annotated ORF Homologous ORF
Peptide from annotated ORF
Annotated ORFchromosome
Re‐annotating
Translated genome
Database
peptides
peptides
9
-
Data and database• Database: translated TTE genome• Data
– Mass spectrometry: Q‐Exactive (HCD‐FT)– Tandem mass spectra: ~5 million
• Search parameter– Engine: pFind v3.0– Default params for HCD‐FT mode
• Quality control: ORF level FDR
-
All ORFs identified
Category Number Ratio
Annotated ORF 2002 91%
Refined ORF 112 5%
Novel ORF 49 2%
Homologous ORF 42 2%
11
• Validation of novel ORFs– Computational: the local FDR(posterior error probability) measures
the reliability of each ORF was calculated
– Transcriptional evidence: the strand specific RNA‐seq data was also
used as a supporting evidence
Table. All ORFs identified in this study.
-
Distribution of novel and reference ORFs
0 500 1000 1500 20000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ranked ORF
Cum
ulat
ive
perc
enta
ge o
f OR
F 2588 genenovel ORFlocal FDR
FDR
-
Novel ORFs in length
0 0.5 1 1.5 2 2.5 3 3.5 40
5
10
15
20
25
30
35Novel ORFAnnotated ORF
Perc
enta
ge (%
)
log (length)10
~150bp (50aa)
13
The length (bp) and RNA expression of the novel ORFs
Novel ORFAnnotated ORF
log (FPKM)10
0
5
10
15
20
25
30
35
Perc
enta
ge (%
)
0 5-5
-
Genomic location
• Novel ORFs were shorter in length (sORF)
• Classification based on the distance (1000kb) to
annotated ORFs
– UTR translation or operon structure (46)
– Intragenic regions (1)
– Overlapping with annotated ORFs (2)
14
-
• Examples of identified novel ORFs– UTR translation
– Overlapping ORF
– Non‐canonical start codon
15
-
UTR translationA
B
gi|20808287|ref|NP_623458.1| gi|20808288|ref|NP_623459.1| gi|20808289|ref|NP_623460.1|
pep_EEYALR
pep_LVPAVALPHPLGNPSLGPK
pep_VALPHPLGNPSLGPK
Novel ORF
expression range [0 - 500]
expression range [0 - 500]
RNA-seq expression level on positive strand
RNA-seq expression level on negative strand
1,820,000 bp 1,821,000 bp 1,822,000 bp 1,823,000 bp 1,824,000 bp 1,825,000 bp
ORF
Identified peptides
pep_AGLPTVHVCTLVPLSKpep_LVPAVALPHPLGNPSLGPKEEYALR
Rela
tive
Int
ensi
ty(%
)
m/z200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400
y1+
147.
11
a2+
185.
16
b2+
213.
16
y2+
244.
16
a3+
282.
18 y3+
301.
19
a4+
353.
22
b4+
381.
25
y4+
414.
27
a5+
452.
29
y10+
+49
0.28
y11+
+55
8.81
y6+
598.
35y1
2++
607.
34
y13+
+66
3.88
y14+
+69
9.40
y7+
712.
40
y15+
+74
8.93
y8+
769.
42y1
6++
784.
45
y17+
+83
2.97
y9+
882.
50
y10+
979.
55
y11+
1116
.61
y12+
1213
.66
y13+
1326
.75
y14+
1397
.79
y1
Ky2
Py3
Gy4
LSy6
Py7
Ny8
Gy9
Ly10
Pb8
y11
Hb7
y12
Py13
Lb5
y14
Ab4
y15
Vb3
y16
Ab2
y17
Py18
VL3+100
90
80
70
60
50
40
30
20
10
0
+20
-20Erro
r (pp
m)
0
chrom
16
-
Overlapping ORF
gi|20808787|ref|NP_623958.1| gi|20808788|ref|NP_623959.1|
pep_LLEESLLKR
pep_LLEESLLK
expression range [0 - 110]
expression range [0 - 52]
2,317,400 bp 2,317,600 bp 2,317,800 bp 2,318,000 bp 2,318,200 bp 2,318,400 bp 2,318,600 bp 2,318,800 bp
Identified peptides
RNA-seq expression level on positive strand
RNA-seq expression level on negative strand
ORF
Rela
tive
Int
ensi
ty(%
)
m/z200 300 400 500 600 700 800 900 1000
y1+
175.
12
a2+
199.
18
b2+
227.
18
b4++
243.
13
y3+
416.
29 y8++
494.
29
[M]+
+55
0.84 y5
+61
6.41
y6+
745.
46
y7-N
H3+
857.
48
y7+
874.
50
y8+
987.
58
y1
RKy3
LLb4
y5
Sy6
Eb2
y7
Ey8
LL2+
100
90
80
70
60
50
40
30
20
10
0
+20
-20Err
or (p
pm)
0
Novel ORF
chrom
17
A
B
-
Non‐canonical start codon
18
V G * L P * E V I A D P A S K I K N Y A L N F N E R I K R E I E E A I E K N E K L F Q K A N E R G P G T K H R * L L V L F R K EY G R V P A I R S Y C * S S I Q D * Q L C S Q F K G Q N E E * D R R R D G * K R K F V A K C K R K R A W N * T K I P T G L V A K R *L G K C P S N * * L M L L Q N S R I T L L I S I K G S K G R L R K Q S R R I K K * F S S Q M K E E Q G L K I D K Y S Y W S G S K E
gi|20807821|ref|NP_622992.1|
pep_LRENFNLAYNK pep_QFLKENK
pep_ENFNLAYNK pep_NKELAEELERK
pep_LRENFNLAY
pep_ELAEELERK
pep_QFLKENKELAEELER
pep_ELAEELER
expression range [0 - 944]
expression range [0 - 774]
1,356,680 bp 1,356,700 bp 1,356,720 bp 1,356,740 bp 1,356,760 bp 1,356,780 bp 1,356,800 bp 1,356,820 bp 1,356,840 bp 1,356,860 bp
RNA-seq expression level on positive strand
RNA-seq expression level on negative strand
chrom
Rela
tive
Int
ensi
ty(%
)
m/z100 200 300 400 500 600 700 800 900 1000 1100 1200
y1-H
2O+
129.
10
y1+
147.
11
y2-H
2O+
243.
15
y2+
261.
15b2
+27
0.19
b6++
387.
70
y3-N
H3+
407.
19 y3+
424.
22
b7++
444.
24
y4-N
H3+
478.
23y4
+49
5.26
b4+
513.
28
b9++
561.
29
y5-N
H3+
591.
31y5
+60
8.34
a5+
632.
35
b5+
660.
34[M
]-NH
3-N
H3+
+67
4.34
[M]-N
H3++
682.
85
[M]+
+69
1.36
b6+
774.
39
a7+
859.
48
b7+
887.
47
a8+
930.
51
b8+
958.
51
a9+
1093
.58
b9+
1121
.57
a10+
1207
.60
b10+
1235
.62
b10
y1
Kb9
y2
Nb8
y3
Yb7
y4
Ab6
y5
Lb5N
b4F
y8
Nb2ERL
2+100
90
80
70
60
50
40
30
20
10
0
+20
-20Err
or (p
pm)
0
A
B
-
19
Category Sensitivity Accuracy
Transcriptomic ★★★ ★★
Computational ★★ ★☆
Proteomic ★★☆ ★★★
-
Conclusion• 2% ORFs are missing, short ORFs are tend to be eliminated in TTE.
• Most of the novel ORFs identified in this study come from UTR region, operon, non‐canonical start codons.
• Proteogenomics is a good source for genome refinement and novel gene discovery, which may beyond the reach of genomic and transcriptomic approaches.
20
-
Acknowledgement
21
Associate ProfessorRui‐Xiang Sun
Assistant ResearcherHao Chi
M.D.Long Wu
Associate ProfessorQuan‐Hui Wang
ProfessorSi‐Min He
ProfessorSi‐Qi Liu
ProfessorXiao‐Hong Qian
Associate ProfessorYi Zhao
ProfessorLu Xie
Ph.D. candidateDe‐Chao Bu
-
And, many thanks to our pFinders
22
-
Q & A
23