identifying novel genes and gene refinements in … · 2016. 4. 26. · be eliminated in tte. •...

Identifying Novel Genes and Gene Refinements in Thermoanaerobacter

tengcongensisPresented by Kun Zhang

1

Institute of Computing Technology, CAS, Beijing, Chinahttp://pfind.ict.ac.cn

2014‐06‐20

Thermoanaerobacter tengcongensis

• Thermoanaerobacter tengcongensis (TTE)– Discovered in Tengchong, Yunnan, China

– Complete genome sequencing in 20021, 2.7Mb

• Genome annotation2– Computational: ab initial, comparative

– Experimental: RNA‐seq, EST, etc. (cDNA‐seq related)

– TTE: 2588 coding genes (through ab initial, comparative)

1. Bao Q, et al. A complete sequence of the T. tengcongensis genome. Genome Res 12, 689‐700 (2002).

2. 1. Brent MR. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9, 62‐73 (2008). 2

Background

• Deficiencies of the – High error rate in the genes predicted

– Discriminate coding genes from non‐coding ones

– Explain the translational event, such as alternative translational

initiate site, translational UTR (untranslational region)

1. Andrews SJ, Rothnagel JA. Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet 15, 193‐204 (2014). 3

Background

• Another experimental evidence: proteomics

data (LC‐MS/MS based)

• For genome annotation: Proteogenomics– Direct evidence of ORF translation

– Find novel genes and refine gene models that the traditional

approaches could not reach

1. Ahrens CH, Brunner E, Qeli E, Basler K, Aebersold R. Generating and navigating proteome maps using mass spectrometry. Nat Rev Mol Cell Biol 11, 789‐801 (2010).

2. Andrews SJ, Rothnagel JA. Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet 15, 193‐204 (2014). 4

Utilize proteomics data to find novel genes and gene refinements in TTE.

LC‐MS/MS based proteomics• LC‐MS/MS based proteomics1

1. Steen H, Mann M. The ABC's (and XYZ's) of peptide sequencing. Nat Rev MolCell Biol 5, 699‐711 (2004). 5

PP EE PP TT II DD EE

PP EE PP TT II DD EE

PP EE PP TT II DD EE++

+ +

+ +

B ions Y ions

Searching the protein database• Database search1

1. Nesvizhskii AI, Vitek O, Aebersold R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4, 787‐797 (2007). 6

Reference proteomes of experientially verified or predicted protein sequences

7Discipline

Algorithm

Software

Application

Aim Deep Analysis of Big Data for MS‐based Proteomics

• Nature Methods 2012 : Identification of cross‐linked peptides from complex samples• PNAS 2012: Nematode sperm maturation involves serine protease inhibitor (Serpin)•MCP 2009: A strategy for large‐scale identification of core fucosylated glycoproteins• Proteomics 2012: The ovarian cancer‐derived proteome: A repertoire of tumor markers•Mol BioSys 2012: Improved pipeline for LC‐ETD‐MS/MS using charge enhancing methods• 2009 –2013: Protein identification service at the Institute of Zoology, CAS

Computational Proteomics

Proteogenomic analysis

ORF Scoring & TDA Filtering

A

B

C

Peptide Spectrum Matching

Proteogenimc Database MS Data

.fasta MS2

.fna TTE Genome

.gtf

ReferenceDatabase

ORF0201ORF0013ORF0387ORF1029ORF1101ORF2033ORF1587ORF1323ORF1977REV_ORF0021

2.032.102.112.162.222.232.252.272.282.31

Ranked ORF Score

Reliability

D

E

chrRefined ORF Novel ORF Annotated ORF

Genome search specific peptides

Annotated ORF Homologous ORF

Peptide from annotated ORF

Annotated ORFchr

Re‐annotating

6‐ORF Translating

8

Novel genes and gene refinements

Reference chromosomeRefinedORF Novel ORF Annotated ORF

Genome search specific peptides

Annotated ORF Homologous ORF

Peptide from annotated ORF

Annotated ORFchromosome

Re‐annotating

Translated genome

Database

peptides

peptides

9

Data and database• Database: translated TTE genome• Data

– Mass spectrometry: Q‐Exactive (HCD‐FT)– Tandem mass spectra: ~5 million

• Search parameter– Engine: pFind v3.0– Default params for HCD‐FT mode

• Quality control: ORF level FDR

All ORFs identified

Category Number Ratio

Annotated ORF 2002 91%

Refined ORF 112 5%

Novel ORF 49 2%

Homologous ORF 42 2%

11

• Validation of novel ORFs– Computational: the local FDR(posterior error probability) measures

the reliability of each ORF was calculated

– Transcriptional evidence: the strand specific RNA‐seq data was also

used as a supporting evidence

Table. All ORFs identified in this study.

Distribution of novel and reference ORFs

0 500 1000 1500 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ranked ORF

Cum

ulat

ive

perc

enta

ge o

f OR

F 2588 genenovel ORFlocal FDR

FDR

Novel ORFs in length

0 0.5 1 1.5 2 2.5 3 3.5 40

5

10

15

20

25

30

35Novel ORFAnnotated ORF

Perc

enta

ge (%

)

log (length)10

~150bp (50aa)

13

The length (bp) and RNA expression of the novel ORFs

Novel ORFAnnotated ORF

log (FPKM)10

0

5

10

15

20

25

30

35

Perc

enta

ge (%

)

0 5－5

Genomic location

• Novel ORFs were shorter in length (sORF)

• Classification based on the distance (1000kb) to

annotated ORFs

– UTR translation or operon structure (46)

– Intragenic regions (1)

– Overlapping with annotated ORFs (2)

14

• Examples of identified novel ORFs– UTR translation

– Overlapping ORF

– Non‐canonical start codon

15

UTR translationA

B

gi|20808287|ref|NP_623458.1| gi|20808288|ref|NP_623459.1| gi|20808289|ref|NP_623460.1|

pep_EEYALR

pep_LVPAVALPHPLGNPSLGPK

pep_VALPHPLGNPSLGPK

Novel ORF

expression range [0 - 500]


RNA-seq expression level on positive strand

RNA-seq expression level on negative strand

1,820,000 bp 1,821,000 bp 1,822,000 bp 1,823,000 bp 1,824,000 bp 1,825,000 bp

ORF

Identified peptides

pep_AGLPTVHVCTLVPLSKpep_LVPAVALPHPLGNPSLGPKEEYALR

Rela

tive

Int

ensi

ty(%

)

m/z200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400

y1+

147.

11

a2+

185.

16

b2+

213.

16

y2+

244.

16

a3+

282.

18 y3+

301.

19

a4+

353.

22

b4+

381.

25

y4+

414.

27

a5+

452.

29

y10+

+49

0.28

y11+

+55

8.81

y6+

598.

35y1

2++

607.

34

y13+

+66

3.88

y14+

+69

9.40

y7+

712.

40

y15+

+74

8.93

y8+

769.

42y1

6++

784.

45

y17+

+83

2.97

y9+

882.

50

y10+

979.

55

y11+

1116

.61

y12+

1213

.66

y13+

1326

.75

y14+

1397

.79

y1

Ky2

Py3

Gy4

LSy6

Py7

Ny8

Gy9

Ly10

Pb8

y11

Hb7

y12

Py13

Lb5

y14

Ab4

y15

Vb3

y16

Ab2

y17

Py18

VL3+100

90

80

70

60

50

40

30

20

10

0

+20

－20Erro

r (pp

m)

0

chrom

16

Overlapping ORF

gi|20808787|ref|NP_623958.1| gi|20808788|ref|NP_623959.1|

pep_LLEESLLKR

pep_LLEESLLK



2,317,400 bp 2,317,600 bp 2,317,800 bp 2,318,000 bp 2,318,200 bp 2,318,400 bp 2,318,600 bp 2,318,800 bp

Identified peptides



ORF

Rela

tive

Int

ensi

ty(%

)

m/z200 300 400 500 600 700 800 900 1000

y1+

175.

12

a2+

199.

18

b2+

227.

18

b4++

243.

13

y3+

416.

29 y8++

494.

29

[M]+

+55

0.84 y5

+61

6.41

y6+

745.

46

y7-N

H3+

857.

48

y7+

874.

50

y8+

987.

58

y1

RKy3

LLb4

y5

Sy6

Eb2

y7

Ey8

LL2+

100

90

80

70

60

50

40

30

20

10

0

+20

－20Err

or (p

pm)

0

Novel ORF

chrom

17

A

B

Non‐canonical start codon

18

V G * L P * E V I A D P A S K I K N Y A L N F N E R I K R E I E E A I E K N E K L F Q K A N E R G P G T K H R * L L V L F R K EY G R V P A I R S Y C * S S I Q D * Q L C S Q F K G Q N E E * D R R R D G * K R K F V A K C K R K R A W N * T K I P T G L V A K R *L G K C P S N * * L M L L Q N S R I T L L I S I K G S K G R L R K Q S R R I K K * F S S Q M K E E Q G L K I D K Y S Y W S G S K E

gi|20807821|ref|NP_622992.1|

pep_LRENFNLAYNK pep_QFLKENK

pep_ENFNLAYNK pep_NKELAEELERK

pep_LRENFNLAY

pep_ELAEELERK

pep_QFLKENKELAEELER

pep_ELAEELER



1,356,680 bp 1,356,700 bp 1,356,720 bp 1,356,740 bp 1,356,760 bp 1,356,780 bp 1,356,800 bp 1,356,820 bp 1,356,840 bp 1,356,860 bp



chrom

Rela

tive

Int

ensi

ty(%

)

m/z100 200 300 400 500 600 700 800 900 1000 1100 1200

y1-H

2O+

129.

10

y1+

147.

11

y2-H

2O+

243.

15

y2+

261.

15b2

+27

0.19

b6++

387.

70

y3-N

H3+

407.

19 y3+

424.

22

b7++

444.

24

y4-N

H3+

478.

23y4

+49

5.26

b4+

513.

28

b9++

561.

29

y5-N

H3+

591.

31y5

+60

8.34

a5+

632.

35

b5+

660.

34[M

]-NH

3-N

H3+

+67

4.34

[M]-N

H3++

682.

85

[M]+

+69

1.36

b6+

774.

39

a7+

859.

48

b7+

887.

47

a8+

930.

51

b8+

958.

51

a9+

1093

.58

b9+

1121

.57

a10+

1207

.60

b10+

1235

.62

b10

y1

Kb9

y2

Nb8

y3

Yb7

y4

Ab6

y5

Lb5N

b4F

y8

Nb2ERL

2+100

90

80

70

60

50

40

30

20

10

0

+20

－20Err

or (p

pm)

0

A

B

19

Category Sensitivity Accuracy

Transcriptomic ★★★ ★★

Computational ★★ ★☆

Proteomic ★★☆ ★★★

Conclusion• 2% ORFs are missing, short ORFs are tend to be eliminated in TTE.

• Most of the novel ORFs identified in this study come from UTR region, operon, non‐canonical start codons.

• Proteogenomics is a good source for genome refinement and novel gene discovery, which may beyond the reach of genomic and transcriptomic approaches.

20

Acknowledgement

21

Associate ProfessorRui‐Xiang Sun

Assistant ResearcherHao Chi

M.D.Long Wu

Associate ProfessorQuan‐Hui Wang

ProfessorSi‐Min He

ProfessorSi‐Qi Liu

ProfessorXiao‐Hong Qian

Associate ProfessorYi Zhao

ProfessorLu Xie

Ph.D. candidateDe‐Chao Bu

And, many thanks to our pFinders

22

Q & A

23

identifying novel genes and gene refinements in … · 2016. 4. 26. · be eliminated in tte. •...

Documents