ribosome profiling and the improvement...

RIBOSOME PROFILING AND THE

IMPROVEMENT OF PROTEIN

IDENTIFICATION IN CANCER

PROTEOGENOMICS STUDIES Number of words: 20 226

Sofie Gielis Student number: 01507424

Promotor: Prof. dr. ir. Wim Van Criekinge

Promotor: dr. ir. Gerben Menschaert

Master's dissertation submitted in partial fulfillment of the requirements for the degree of Master in

Bioscience Engineering: Cell and Gene Biotechnology

Academic year: 2016 - 2017

mailto:[email protected]

mailto:[email protected]

vii

Acknowledgements

After five years of hard work and dedication, college is coming to an end. It was definitely the

most challenging and interesting experience ever. I am very grateful for all the support of my

friends and family throughout these years. Thank you all!

Also, I would like to thank my promotor Wim Van Criekinge to guide me through the basics

of bioinformatics in the first master year. Your enthusiastic way of teaching and love for the

subject has given me the interest in bioinformatics.

Finally, a special thanks goes to my promotor Gerben Menschaert and tutor Steven

Verbruggen for all the help and support over the last year.

Sofie Gielis

Gent, 4 juni 2017

ix

Contents

1 Introduction and outline ....................................................................................................... 1

2 Overview of relevant literature ............................................................................................ 2

2.1 NGS-based technologies ................................................................................................... 2

2.2 MS-based proteomics ....................................................................................................... 3

2.2.1 Outline of MS-based proteomics experiments ........................................................... 3

2.2.2 Database searching and MS/MS-based protein identification ................................... 3

2.3 Proteogenomics ................................................................................................................ 5

2.3.1 Goals of proteogenomics ........................................................................................... 5

2.4 Ribosome profiling ........................................................................................................... 6

2.4.1 An emerging technique .............................................................................................. 6

2.4.2 Outline of a ribosome profiling experiment ............................................................... 6

2.4.3 Applications of ribosome profiling ............................................................................ 8

2.5 The PROTEOFORMER pipeline ................................................................................... 10

2.5.1 Overview of the pipeline .......................................................................................... 11

2.5.2 Achievements ........................................................................................................... 14

3 Material and methods ......................................................................................................... 16

3.1 Hardware ........................................................................................................................ 16

3.2 Software .......................................................................................................................... 16

3.2.1 SSH clients ............................................................................................................... 16

3.2.2 Python ...................................................................................................................... 16

3.2.3 SQLite ...................................................................................................................... 17

3.2.4 Python DB-API ........................................................................................................ 17

3.2.5 STAR ....................................................................................................................... 17

3.2.6 Variant caller ............................................................................................................ 18

3.2.7 SearchGUI and PeptideShaker ................................................................................. 18

3.2.8 Galaxy ...................................................................................................................... 19

3.3 Public databases .............................................................................................................. 19

3.3.1 Ensembl annotation bundle ...................................................................................... 19

3.3.2 dbSNP ...................................................................................................................... 20

3.3.3 Sequence read archives ............................................................................................ 21

x

3.3.4 PRIDE ...................................................................................................................... 21

3.4 Custom database creation ............................................................................................... 21

3.4.1 Mapping ................................................................................................................... 22

3.4.2 Transcript calling ..................................................................................................... 23

3.4.3 TIS calling ................................................................................................................ 23

3.4.4 Variant calling .......................................................................................................... 24

3.4.5 Translation assembly ............................................................................................... 24

3.5 Validation of the renewed PROTEOFORMER pipeline with proteomics data ............. 25

4 Results .................................................................................................................................. 26

4.1 Custom database creation ............................................................................................... 26

4.1.1 Mapping statistics .................................................................................................... 26

4.1.2 Mapping quality control ........................................................................................... 30

4.1.3 Discovered variants .................................................................................................. 31

4.1.4 Translation assembly ............................................................................................... 33

4.2 Validation of the renewed PROTEOFORMER pipeline with proteomics data ............. 34

4.2.1 Exploration of the effect of INDELs on the identification process ......................... 34

4.2.2 Exploration of the effect of SNPs on the identification process .............................. 40

5 Discussion ............................................................................................................................. 44

6 Conclusion ............................................................................................................................ 48

7 Further research .................................................................................................................. 49

Bibliography ........................................................................................................................... 51

xi

List of abbreviations

APC adenomatous polyposis coli

API application programming interface

A-site aminoacyl site

aTIS alternative translation initiation site (general)/annotated start site (PROTEOFORMER)

ATP adenosine triphosphate

BAM Binary Alignment/Map

BCF Binary Variant Call Format

BRAF B-Raf proto-oncogene serine/threonine kinase

cDNA complementary DNA

CDS coding sequence

CHX cycloheximide

COFRADIC combined fractional diagonal chromatography

COSMIC Catalogue Of Somatic Mutations

cRAP common Repository of Adventitious Proteins

CWI Center for Mathematics and Computer Science

DB-API database application programming interface

dbSNP Single Nucleotide Polymorphism Database

dbTIS database annotated TIS

DDBJ DNA Databank of Japan

DNA deoxyribonucleic acid

dNTP deoxynucleotide triphosphate

dTIS downstream TIS

EMBL-EBI European Bioinformatics Institute

ENA European Nucleotide Archive

E-site exit site

EST expressed sequence tag

GATK Genome Analysis Toolkit

GRC Genome Reference Consortium

HARR harringtonine

HCT116 human colorectal tumour 116 cell line

INDEL INsertion and DELetion

INSDC International Nucleotide Sequence Database Collaboration

K lysine

KRAS Kirsten rat sarcoma viral oncogene homolog

LC liquid chromatography

LC-MS/MS liquid chromatography followed by tandem mass spectrometry

LTM lactimidomycin

mESC mouse embryonic stem cells

MMS19 methyl-methanesulfonate sensitivity 19

mRNA messenger RNA

MS mass spectrometry

http://genomereference.org/

xii

MS/MS tandem mass spectrometry

MTHFD1 methylenetetrahydrofolate dehydrogenase 1

NCBI National Center for Biotechnology Information

NGS next-generation sequencing

NHGRI National Human Genome Research Institute

OMSSA Open Mass Spectrometry Search Algorithm

ORF open reading frame

PacBio Pacific Biosciences

PCR polymerase chain reaction

PRIDE Proteomics Identifications database

PSF Python Software Foundation

P-site peptidyl site

PSM peptide to spectrum match

QC quality control

R arginine

RIBO-seq ribosome profiling

RNA ribonucleic acid

RNA-seq RNA sequencing

RPF ribosome protected fragment

rRNA ribosomal RNA

SA suffix array

SAM Sequence Alignment/Map

SMAD4 mothers against decapentaplegic homolog 4

SAAV single amino acid variant

SMRT single molecule, real-time sequencing technology

sn(o) RNA small nuclear and nucleolar RNA

SNP Single Nucleotide Polymorphism

SOLiD Sequencing by Oligo Ligation Detection

sORF small open reading frame

SQL Structured Query Language

SRA Sequence Read Archive

SSH Secure Shell

STAR Spliced Transcripts Alignment to a Reference

TCGA The Cancer Genome Atlas

TELO2 telomere maintenance 2

TIS translation initiation site

TP53 tumour protein 53

TrEMBL translated EMBL

tRNA transfer RNA

UniProt Universal Protein Resource

uORF upstream open reading frame

3' UTR 3' untranslated region

5' UTR 5' untranslated region

VCF Variant Call Format

xiii

Abstract

A lot of recent breakthroughs in genomics, proteomics and transcriptomics studies are due to

the development of so-called next-generation sequencing techniques (NGS). An important

novel NGS technique is termed ribosome profiling and is based on the deep sequencing of

ribosome-protected fragments. By studying the mRNA fragments captured by translating

ribosomes, the actual translation status in vivo can be assessed. In combination with MS/MS,

ribosome profiling has become an important assistant in the protein identification process by

building a more comprehensive protein sequence database. Currently, this database can be

generated by using a new automated pipeline developed by researchers at the BioBix lab: the

PROTEOFORMER pipeline.

This master thesis will cover the improvement of the protein identification process in human

cancer cells in a proteogenomics setting combining ribosome profiling and mass spectrometry

data. Preceding studies have shown that these proteins are often rich in SNPs (Single

Nucleotide Polymorphisms) and INDELs (INsertions and DELetions). Therefore, the addition

of these genetic variants into the protein database is bound to have a positive effect on the

protein identification rate. Previously, the inclusion of SNPs in the PROTEOFORMER

pipeline has already proven its success. Similarly, an improvement on the overall protein

identification process is expected when INDEL information is included. This hypothesis was

evaluated by comparing the protein identification rate using custom search databases with and

without INDELs in the MS/MS identification process. Therefore, INDELs derived from

ribosome profiling data were included. This has led to the validation of one protein which

could not be identified with custom databases lacking this information. Although, this shows

that the inclusion of INDELs had only a small positive effect on the protein identification rate,

further increase is expected in the near future by providing extra INDEL information from

variant databases and by using better INDEL detection methods.

Finally, the effect of the inclusion of INDEL information on the protein identification rate

was compared with the effect of SNPs. Notwithstanding only one SNP was validated with

MS/MS data, more than 100 SNP-containing proteins were identified solely or within a

protein group. These results show the importance of well generated custom databases tailored

to the study in question.

xv

Korte samenvatting

Verschillende doorbraken in genomische, proteomische en transcriptomische studies zijn te

wijten aan de ontwikkeling van nieuwe sequeneringstechnieken. Een belangrijke voorbeeld

hiervan is ribosoom profilering. Deze opkomende techniek is gebaseerd op het sequeneren

van ribosoom omgeven mRNA-fragmenten. Dankzij het bestuderen van de

mRNA-fragmenten aanwezig in actieve ribosomen, kan de daadwerkelijke translatie status

in vivo nagegaan worden. In combinatie met tandem massaspectrometrie kan ribosoom

profilering een belangrijke rol spelen in het eiwit identificatie proces binnen een

proteogenomische studie, met name in de ontwikkeling van een uitgebreidere eiwit sequentie

databank. Momenteel kan deze databank gegenereerd worden dankzij een nieuwe en

geautomatiseerde pipeline ontwikkeld door de onderzoekers van het BioBix labo: de

PROTEOFORMER pipeline.

Deze masterproef omvat de verbetering van het eiwit identificatie proces bij menselijke

kankercellen. Voorafgaande studies hebben aangetoond dat eiwitten vaak rijk zijn aan SNPs

(Single Nucleotide Polymorphisms) en INDELs (INserties en DELeties). Vermoedelijk zal

het toevoegen van deze genetische varianten een positief effect hebben op het aantal

geïdentificeerde eiwitten. Het succes van de inclusie van SNPs in de PROTEOFORMER

pipeline is reeds bewezen. We verwachten echter een additionele verbetering van de eiwit

identificatie procedure wanneer ook de INDELs geïncludeerd worden. Deze hypothese werd

onderzocht door de eiwit identificaties te evalueren via zelf gegenereerde databanken met en

zonder de inclusie van INDELs. Hiervoor werden INDELs afgeleid van ribosoom

profileringsdata gebruikt. Dit heeft geleid tot de validatie van één eiwit dat niet

geïdentificeerd kon worden met de gegenereerde databanken die deze informatie ontbraken.

Ondanks dit een klein, positief effect van INDELs op de eiwit identificatie aantoont, wordt er

in de nabije toekomst een grotere stijging verwacht door het voorzien van extra INDEL

informatie vanuit databanken en het gebruik van betere INDEL detectie methoden.

Tenslotte werd het effect van de INDELs op de eiwit identificatie vergeleken met het effect

van de SNPs. Niettegenstaande er slechts één SNP gevalideerd werd met MS/MS data,

werden er meer dan 100 SNP bevattende eiwitten geïdentificeerd die al dan niet tot een eiwit

groep behoorden. Deze resultaten tonen het belang aan van databanken die afgestemd zijn op

het onderzoek.

1

1 Introduction and outline

Protein identification plays an important role in a lot of life science studies. This is mostly

based on mass spectrometry (MS) and relies on the availability of annotated protein

sequences in public databases. Therefore, the identification process is limited to the detection

of known proteins. Currently, researchers are trying to address this problem by creating

custom protein sequence databases based on genomic and transcriptomic information. Hence,

novel technologies targeting nucleotide information on translational level are being

developed. This fusion of genomics and transcriptomics studies has led to a new research

area, called proteogenomics.

Ribosome profiling is a relatively new technique enabling the sequencing of ribosome

protected mRNA fragments. Herewith, a snapshot of the translational situation in vivo can be

made. Nowadays, this information is already taken as basis for the construction of custom

protein sequence databases. In order to facilitate this process, an automated pipeline was

designed: the PROTEOFORMER pipeline. Currently, it is already proven that this pipeline

increases the overall protein identification rate in MS-based experiments. However, the

pipeline is still under continuous improvement.

In this master thesis, an attempt is made to improve the MS-based protein identification

strategy of human cancer samples. Since proteins can largely be affected by the presence of

SNPs and INDELs in the aforementioned samples, the influence of introducing this

information in the protein identification pipeline is studied. Previously, SNP (Single

Nucleotide Polymorphism) information was added to the pipeline resulting in an overall

increase on the protein identification rate. Here, the focus will mainly be on the addition of

INDELs to the pipeline. In general, the inclusion of INDEL information is expected to have a

positive effect. In addition, the effect of INDELs will be compared with the effect of SNPs by

analyzing the identification rates of custom databases with and without the addition of these

variants based on matching ribosome profiling data from human colorectal cancer cells.

In the following chapter, MS-based proteomics, proteogenomics, ribosome profiling and the

PROTEOFORMER pipeline are explained in more depth. Moreover, an introduction into

NGS-based technologies is given as these technologies are frequently used in proteogenomics

studies. Chapter 3 focuses on the practical aspects of this master thesis by describing the used

materials and methods. Afterwards, the results are given in chapter 4 and discussed in chapter

5. Chapter 6 gives a general conclusion. Finally, this master thesis ends with a glimpse into

the future of the PROTEOFORMER pipeline in chapter 7.

2

2 Overview of relevant literature

2.1 NGS-based technologies

Since the discovery of deoxyribonucleic acid (DNA) in 1869 by Johann Friedrich Miescher,

DNA has been the center of attention in a lot of scientific studies [1]. However, it took almost

100 years before the first sequencing operations were carried out. Sequencing involves every

experiment in which the order of nucleic acids in a DNA or ribonucleic acid (RNA) sequence

is determined. Initially, sequencing methods, referred to as first-generation sequencing

methods, were slow and confined to short sequences only [2]. Later on, new insights and

improvements gave rise to high-throughput sequencing technologies: next-generation

sequencing-based technologies (NGS-based technologies). Nowadays, multiple NGS-based

technologies are commercially available. Illumina sequencing by synthesis, SOLiD

(Sequencing by Oligo Ligation Detection) and 454 sequencing are probably the most popular

techniques [3]. In general, these methods all start with the fractionation of the sample nucleic

acids into smaller pieces and the ligation of adaptor sequences needed for the amplification

and sequencing steps. Between different sequencing technologies, the amplification methods

are often shared but the actual sequencing procedures are quite different [3,4].

The oldest commercial available NGS-technology is 454 sequencing. Its rationale is based on

the production of light during the conversion of luciferin to oxyluciferin by the enzymatic

action of luciferase. This reaction can only occur when a complementary base is added to the

sequence resulting in the generation of pyrophosphate and its derivative product adenosine

triphosphate (ATP) which activates the enzyme. By adding one deoxynucleotide triphosphate

(dNTP) at a time, the nucleotide sequence can be devised. This process is also called

pyrosequencing [3,5,6].

SOLiD and Illumina sequencing are both based on the detection of fluorescent nucleotides.

They differ in the amount of nucleotides attached during each detection cycle. The SOLiD

system relies on the incorporation of a probe containing eight bases with a fluorescent group

attached to the last base. By using four different fluorescent dyes, colour detection can reveal

the last base. The implementation of a ladder primer set, makes it possible to determine the

sequence after five sequencing cycles [3,5]. In contrast with SOLiD, Illumina is based on the

incorporation of fluorescent labelled mononucleotides. By adding a blocker to the end of

these nucleotides, the synthesis process can be carefully controlled, allowing the detection of

one single nucleotide [3,5].

Further improvements in sequencing technologies has led to a new generation of techniques:

the third generation. Sequencing technologies belonging to this generation are characterized

by the ability to sequence longer fragments in comparison with previously explained NGS-

based technologies. Moreover, no amplification step is required due to the ability to sequence

single molecules. Two popular third generation sequencing techniques are single molecule

3

real-time sequencing technology (SMRT) provided by Pacific Biosciences (PacBio) [7,8] and

Oxford Nanopore sequencing [9]. Despite their promising features, they are not yet routinely

used in the commercial industry.

2.2 MS-based proteomics

2.2.1 Outline of MS-based proteomics experiments

Proteomics refers to the study of the total protein content of a species, also called its

proteome. It involves the identification and quantification of the entire protein load present in

the species of interest [10,11]. A fundamental part in this research is the use of shotgun

proteomics. This methodology aims at the identification of proteins in a complex protein

mixture by studying the corresponding peptides after enzymatic cleavage [10,11]. Figure 1

gives a clear overview of the general aspects. A common shotgun proteomics experiment

starts with an enzymatic cleavage of the proteins into peptides (mostly trypsin is used).

Afterwards, these molecules are separated using liquid chromatography (LC). The eluted

peptides are then ionized before they enter the mass spectrometer [12,13,14]. In mass

spectrometry (MS), the isolation of peptides is based on their mass-to-charge ratio [11,15].

The desired proteomic data is obtained by repeating the MS step on the fragments of each

selected peptide ion [13] resulting in a final tandem mass spectrometry (MS/MS) spectra. The

described combination of liquid chromatography and tandem mass spectrometry is

abbreviated as LC-MS/MS [16].

Figure 1: Practical aspects of LC-MS/MS. In general, every shotgun proteomics experiment starts

with the enzymatic digestion of the proteins of interest into peptides. These peptides are then separated

in a liquid chromatography step and afterwards subjected to tandem mass spectrometry. Adapted from

[12].

2.2.2 Database searching and MS/MS-based protein identification

Database searching of the resulting MS/MS spectra gives rise to the identity of the examined

peptides and thus also their corresponding proteins. This process is illustrated in figure 2. It

consist of two consecutive steps, namely the peptide and the protein identification process

[17,18].

Peptide identification

The identification of the peptides, obtained after protein digestion, is based on the correlation

between experimental and theoretical MS/MS fragmentation spectra. The latter are obtained

by subjecting a chosen protein sequence database to in silico MS/MS pattern prediction

algorithms. In this prediction process, proteins are virtually digested into peptides using

4

enzymatic cleavage rules. For example, when using the trypsin cleavage rule, cleavage is

supposed to occur C-terminally after every lysine or arginine [19]. Next, the associated mass-

to-charge ratios are calculated for all the cleaved products. By comparing these theoretical

ratios with the experimental mass-to-charge ratios a candidate peptide list is generated. This

list contains all the theoretical peptides which can be identified based on their mass-to-charge

ratios. Subsequently, these peptides are broken down into smaller ions. Finally, the produced

theoretical fragmentation spectra are matched with the experimental MS/MS spectra resulting

in a statistical similarity score. Only peptides associated with a high score will be retained and

added to a final matching peptide list [14,18].

Figure 2: Overview of the protein identification process using a database searching approach. Proteins

can be identified by comparing the experimental MS/MS spectra with theoretical spectra. The latter

are obtained by subjecting a protein sequence database to an in silico MS/MS pattern prediction

algorithm [14].

Protein identification

The obtained information from the peptide identification process is brought together in a

second step resulting in a list of possible proteins present in the sample of interest. This list

will be subjected to a statistical analysis in order to select the true protein identifications [18].

The overall procedure can be hampered by the protein inference problem which refers to the

difficulty of extending information from the peptide to the protein level. This may be a result

of large-scale sequence homology between the proteins present, because digestion of these

proteins can give rise to identical peptides. In the absence of unique peptide sequences,

discrimination between these proteins is not possible. This problem can be taken into account

by collecting the indiscernible proteins in a so-called protein group. In this way, statistical

analysis can reveal the presence of a certain group of proteins without knowing the presence

of specific proteins within that group [20].

5

2.3 Proteogenomics

Proteogenomics is a relatively new term describing the fusion of proteomics and genomics

[12,21,22]. This method made its debut in the scientific research community in 2004 [12] and

has been evolving since then. Initially, proteogenomics comprised the support of proteomic

information in genome annotation [12,22]. Genome annotation appears after genome

sequencing and purposes the delineation of the genes and determination of their biological

function [16]. Conventionally, this endeavour was pursued by the genomic research

community, whereas proteomic researchers were focusing on the actual protein level.

Nowadays, proteomics and genomics are not stand-alone research fields anymore, but

perform a joint effort in improving the current genome annotation [12].

2.3.1 Goals of proteogenomics

Since the beginning of proteogenomics, a lot of studies are dedicated to improvements in the

genome annotation and the protein identification process. Therefore, these two applications

will be explained in more detail.

Aid genome annotation

As proteogenomics studies were developed with a focus on genome annotation, this is still

one of the most important applications. Traditionally, protein-coding genes were located

using sequence similarity or ab initio gene finding predictions. These annotation processes,

however, are rather theory-based and do not provide enough evidence to call genes flawlessly

[16]. Validation of the predicted genes is therefore extremely important. Mass

spectrometry-based experiments provide a way to confirm the existence of protein-coding

genes and hypothetical proteins with translational evidence. Mass spectrometry-based

proteomics data can also be used to find new protein-coding genes by using the identified

peptide sequences to search genomic databases. In this way, proteogenomics studies can

reduce the error rate of current and previous annotation studies, which emphasizes its

importance in genome annotation and re-annotation [16,23,24].

Discover new proteoforms

Along with the progress in genome annotation, proteogenomics has also become a key player

in the process of protein identification and finding new proteoforms. The term 'proteoforms'

refers to all protein products derived from the same gene [25]. For example, proteins might

have proteoforms containing a longer N- or C-terminus resulting in the so-called N-terminally

and C-terminally extended proteoforms. Similarly, proteoforms with shorter N- or C-termini

are called N-terminally or C-terminally truncated proteoforms [26,27].

Traditionally, proteins were identified with the aid of reference protein sequence databases

like Swiss-Prot [28]. As explained earlier, these databases can be used to match

experimentally obtained MS/MS spectra with their corresponding peptides after undergoing

in silico MS/MS. Unfortunately, this limits the identification procedure to peptides already

present in the consulted reference database. Researchers are trying to address this problem by

generating customized protein sequence databases using genomic and transcriptomic

6

information, which enlarges the search space in the field of interest. This can be of particular

importance when studying disease-associated proteins which may contain one or more genetic

variation. Likewise, completely new peptides associated with novel coding regions are more

likely to be found with a customized database [12,22,29]. So far, a lot of new proteins have

been discovered with the aid of proteogenomics studies. These include alternative splice

variants [30] and peptides derived from small open reading frames (sORFs) [31].

2.4 Ribosome profiling

2.4.1 An emerging technique

Next to genomics and proteomics, translatomics has been a very important research area over

the last few years. At the beginning, information on the translational level was derived from

measurements at the expression level by evaluation of mRNA (messenger RNA) abundance

using microarrays. Later on, mRNA sequencing became more and more popular. However

due to the action of extensive regulation on mRNA level and the occurrence of events like

initiation at non-AUG start codons, it is impossible to determine the exact protein levels and

their identity by using the detected mRNA levels. With the advent of translation inhibitors,

new approaches to monitor protein information at the translational level were developed

[32,33].

Initially, polysome profiling [34] was used to follow up the protein synthesis. A polysome

refers to the presence of multiple ribosomes occupying an mRNA transcript. In polysome

profiling these mRNAs are studied after isolation of the polysome fraction. Unfortunately,

this technique has limited resolution and cannot be used to reveal the exact ribosome positions

[32,33]. These shortcomings have led to the introduction of ribosome profiling (also called

RIBO-seq) by Ingolia et al. This emerging technique is based on the deep sequencing of

ribosome protected fragments (RPFs). By only studying the mRNA fragments captured in the

ribosomes, the actual translation status in vivo can be assessed. Moreover, the ability to halt

active ribosomes, enables the determination of the ribosomal positions with single-nucleotide

precision. Up to now, these advantages have made the technique very popular in the process

of genome annotation by improving the identification of novel protein-coding genes

[32,33,35].

2.4.2 Outline of a ribosome profiling experiment

The ribosome profiling strategy consists of a few main steps. These steps are presented in

figure 3 and are explained below.

7

Figure 3: Overview of the major steps in ribosome profiling. Ribosome profiling begins with the

release of RNA from cells treated with translation inhibitors. After a subsequent RNA digest, which

degrades unbound RNA, the resulting RNA-ribosome-protected fragment (RPF) complexes are

isolated with a centrifugation technique. The RPFs are then separated from the ribosomes and purified

using gel electrophoresis. In a following step, the isolated RPFs are linked with adaptors, amplified

and sequenced. An intermediate rRNA depletion step makes sure that the remaining rRNA molecules

are degraded. Finally, the sequenced RPFs are mapped to a reference sequence [36].

Every RIBO-seq experiment starts with lysing the cells of interest. In order to take a

representative snapshot of the translational situation, halting the active ribosomes before the

actual lysis is necessary. This pausing step is introduced to avoid alterations of the ribosomal

positions during and after lysis as a result of the rapid translational progress. Different

approaches based on translation inhibitors including cycloheximide (CHX), lactimidomycin

(LTM) and harringtonine (HARR), are available [35,36,37,38]. Their purposes will be

explained in section 2.4.3.

After cell lysis, nuclease digestion ensures the degradation of RNA that is not protected by

ribosomes. The resulting RNA-ribosome complexes contain RPFs of circa 28 nucleotides,

also termed footprints. A subsequent separation step based on sucrose density fractionation

isolates these footprint-bound ribosomes from remaining cell residuals. In a consecutive step,

the footprints are recovered from the ribosomal complexes and purified by gel

electrophoresis. Deep sequencing of these footprints is possible after the construction of a

derived DNA-library. This includes the attachment of linkers that provides the priming site

needed for reverse transcription followed by circularization of the products [37,38]. Next, the

contamination with ribosomal RNA (rRNA) has to be taken into consideration. Therefore

8

rRNA is optionally depleted by hybridization with rRNA probes which are later on removed

due to their affinity for streptavidin [38]. Following a polymerase chain reaction (PCR), the

resulting fragments are then ready for deep sequencing. Finally, the sequenced fragments can

be analyzed by aligning the obtained reads to a reference sequence [35,37,38].

2.4.3 Applications of ribosome profiling

Despite the fact that ribosome profiling is a relatively young technique, it has already been

used in a number of different studies demonstrating the large range of applications of this

technology. Generally, two different approaches can be made. The first one is relying on the

use of elongating ribosomes whereas the second one makes us of initiating ribosomes [35,38].

Applications of elongating ribosomes

Elongating ribosomes can be halted with the aid of flash-freezing [38,39] or by using

antibiotics acting on the translocation activity. A well known and widely used antibiotic is

cycloheximide (CHX). This antibacterial drug blocks the elongation of translation in

eukaryotic cells trough the obstruction of the exit site of the 60S ribosome subunit [36,40,41].

The mRNA sequences restrained in these halted ribosomes provide a massive amount of

information leading to a better (re)-annotation of the genome [38].

Identification of small open reading frames

Ribosome profiling can reveal the existence of small open reading frames (sORFs). These

open reading frames (ORFs) are defined as ORFs smaller than or equal to 300 bases. Due to

their small sizes, sORFs are often excluded by automated gene annotation algorithms.

Likewise, MS-based proteomics and RNA-seq (RNA sequencing) based transcriptomics

experiments are not effective in revealing new sORFs. As a result, the discovery of new

sORFs became feasible with the introduction of ribosome profiling. The small peptides

derived from sORFs are called micropeptides and can have important functions [42].

Transporter proteins, transcription factors and hormones are just a few examples of the

enormously diverse functionality of these micropeptides [43].

Studying protein translation on a quantitative level

Ribosome profiling also provides the possibility to study the expression of different genes on

a quantitative scale. Each gene transcribes its DNA information into mRNA sequences. These

sequences will go through the translational process due to ribosomal activity. Every mRNA

transcript bound to a ribosomal complex is considered as being translated. In other words,

every RPF that is picked up during the experiment represents a translating ribosome.

Therefore, the number of RPFs resembles the number of translating ribosomes for the

corresponding transcript and is proportional to the level of protein that is produced. By taking

into account both the amount of RPFs and the time required to finish a protein, the production

rate of the proteins can be determined. Hence, RPF density can be used to quantify gene

expression and measure the speed of protein synthesis [35,38]. Also, by measuring the

translation products and comparing it to the mRNA abundance on transcriptional level in

RNA-seq, new insights into the translational regulation can be gained [33,38].

9

Discovering genetic variation

Ribosome profiling can be of particular importance in disease-focused studies due to the

inclusion of genetic variation information within ORFs. Single Nucleotide Polymorphisms

(SNPs) are the most common variants in the human population. SNPs represent single base

positions in the genome that are differing between individuals. Several people will for

example carry a cytosine at a specific genomic location while others have an adenine at this

position [44].

Next to SNPs, insertions (addition of one or more nucleotides) and deletions (removal of one

or more nucleotides) can occur. A lot of diseases are already linked with specific SNPs and

INDELs (INsertions and DELetions). Well know examples are sickle-cell anaemia (caused by

a SNP [45]), cystic fibrosis (caused by a deletion [46]) and Tay-Sachs disease (caused by an

insertion [47]).

Genetic variation is also playing a key role in cancer research. Due to the presence of

extensive protection mechanisms against cancer, multiple genes have to undergo mutagenesis

to initiate the disease [48,49]. Moreover, the alterations that lead to oncogenesis, have to

occur in specific classes of genes: oncogenes, tumour-suppressor genes and stability genes.

Oncogenes and tumour-suppressor genes are related to the cell-cycle control. By stimulating

cell division (due to activated oncogenes) and inhibiting cell-cycle control (due to inhibited

suppressor genes), mutations in these genes can cause a rapid increase in the number of cells.

Stability genes, on the other hand, play an important role in controlling the presence of

mutations by trying to repair the genetic alterations. Inactivation of these genes suppresses

this control mechanism resulting in a higher mutation rate [49]. Both SNPs and INDELs are

often present in cancer-related genes [48,50]. These variations can be detected with

NGS-based technologies, including ribosome profiling, making these very useful techniques

in cancer studies [51,52].

Applications of initiating ribosomes

Ribosomes can also be stalled solely during the initiation stage. Halting initiating ribosomes

can be achieved by using harringtonine (HARR) or lactimidomycin (LTM). LTM inhibits

translation initiation by binding to the exit site of the 80S complex whereas HARR halts the

ribosomes by binding to the acceptor site of the free 60S ribosome subunit [38,40,53,54].

Both compounds are useful in the identification of translation start sites. However LTM can

be preferred over HARR as treatment of the latter results in a substantial amount of mRNA-

fragments mapping after the translation start site resulting in a drop of the accuracy of the

experiment [40].

Detection of alternative translation initiation sites and their resulting proteoforms

The translational process in eukaryotes is a well-known phenomenon. Both the initiation and

the elongation process have been thoroughly studied over the years. Generally spoken,

translation initiation is triggered when the ribosomal complex scans an AUG codon. In the

10

case of efficient recognition, the first AUG will serve as a translation initiation site [40].

However, inefficient recognition, due to factors like the absence of necessary secondary

structures in the mRNA near the initiation site, may lead to leaky scanning. In this way, more

efficient translation initiation sites (TISs) are searched in the region downstream the initial

TIS [40,55].

More recently, it was discovered that translation initiation is also occurring at non-AUG

codons differing in one nucleotide with the cognate start codon. These near-cognate start sites

are denoted as alternative translation initiation sites or aTIS. Due to the failure of finding

these aTIS using in silico sequence analysis, it was not possible to accurately predict new

aTIS. However, with the introduction of ribosome profiling as a way to define ribosomal

positions, this problem was solved. Using specific translation initiation inhibitors, translation

initiation sites can be determined with single nucleotide resolution. In this way both TIS and

aTIS are accurately predicted [40].

A significant fraction of these newly annotated TIS represents near-cognate start sites lying in

the regions upstream of the annotated coding sequence. These seem particularly important in

the generation of upstream open reading frames (uORFs): short translated sequences in the 5'

untranslated region (5' UTR) [32,39,56]. Moreover, some of these ORFs might have an

influence on the regulation of downstream ORFs by producing small peptides called

peptoswitches [57].

Alternative TIS are also responsible for the production of N-terminally extended and N-

terminally truncated proteoforms. The latter are produced by using downstream TIS. Both

extensions and truncations result in variants having different functions or carrying other

localisation signals [35,40].

2.5 The PROTEOFORMER pipeline

Recently a new tool creating a customized database has been developed by researchers of

BioBix in Gent. This tool was introduced under the name of PROTEOFORMER in 2014 and

is already publicly available as a stand-alone version and a Galaxy implementation. The

PROTEOFORMER tool consists of an automated pipeline of seven steps starting from

ribosome profiling data and ending in a protein sequence search database necessary for the

identification of proteins by MS/MS matching. Using RIBO-seq data rather than mRNA-seq

data, the PROTEOFORMER pipeline overcomes the problems of searching a big database

and provides evidence on translational level. Nowadays, the analysis of RIBO-seq data

derived from Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis

thaliana has been made available. The analysis of other species will be made available in the

future [52].

11

2.5.1 Overview of the pipeline

The PROTEOFORMER pipeline includes seven major steps: quality control, mapping,

transcript calling, TIS calling, variation calling, translation assembly and translation database

construction. These steps are depicted in figure 4. Besides these major steps, figure 4 also

contains the prior step of next-generation sequencing, needed to gain the input reads, and the

final peptide and protein identification step. A detailed description of these steps is given

below [52,58].

Figure 4: Overview of the PROTEOFORMER pipeline. The PROTEOFORMER pipeline starts from

RIBO-seq reads generated with NGS-based technologies. These reads are analyzed using seven

different steps including quality control, mapping of the reads to the reference genome, transcript

calling, TIS calling, variation calling, translation assembly and translation database construction. In a

final step the custom created database can be used in a MS-based peptide and protein identification

process [52].

Next generation sequencing

The PROTEOFORMER pipeline uses RIBO-seq data as input. This information is gathered

by sequencing the ribosome protected footprints using the Illumina approach. As previously

explained, Illumina sequencing is based on fluorophores and terminating groups attached to

the four different nucleotides. By matching each nucleotide with a specific fluorescent

emission signal, the incorporated nucleotide can be determined. The resulting reads derived

from the measured fluorescence and their estimated quality are collected in a FASTQ file. The

inclusion of a TIS calling algorithm in the pipeline requires ribosome profiling information on

initiating ribosomes. Hence, in addition to the RIBO-seq data of CHX-treated samples,

information concerning HARR- or LTM-treated samples is necessary. Consequently, two

FASTQ files are taken as input: one containing reads derived from elongating ribosomes and

one representing information about initiating ribosomes [52].

12

Quality control

The subsequent quality control of the collected FASTQ files forms a preliminary step to the

PROTEOFORMER pipeline. A suitable tool for the monitoring of FASTQ files and

determining the sequence quality is the FastQC quality control application developed by the

Babraham Institute. This application creates a quality control (QC) report that is very valuable

for the detection of irregularities or unexpected deviations from the predicted outcome of the

raw sequencing data [59]. Moreover, FastQC can also be applied on the aligned reads after

mapping. In this way, the effects of the prepossessing steps on the raw data can be verified.

These include adaptor clipping, sequence trimming and filtering on unwanted RNA.

Aside from the QC report, two other quality determinants are taken into account. The first one

is known as the metagenic functional classification wherein the species-specific annotation

from Ensembl is used. RPFs in accordance with protein-coding transcripts are classified in

5' UTR (5' untranslated region), exonic, intronic and 3' UTR (3' untranslated region)

annotation classes whereas RPFs corresponding with non-protein-coding transcripts are

classified as 'other biotypes'. The rest of the RPFs are considered intergenic [52]. The second

quality determinant is the gene distribution which visualizes the overall dynamic range of the

translation counts [52].

Mapping

As a first step in the pipeline, the raw RPFs should be aligned against a reference genome.

Two different mapping approaches can be used by the PROTEOFORMER pipeline [52]:

TopHat [60] and STAR (Spliced Transcripts Alignment to a Reference) [61]. Both these

splice-aware aligners start with constructing a genome index. The mapping against the

indexed genome of the species of interest is possible after filtering out the reads aligned to the

PhiX bacteriophage genome (resulting from the spiked-in sequences) and the reads aligned to

the species-specific rRNA (resulting from rRNA contamination specific to the ribosome

profiling protocol), tRNA (transfer RNA) and sn(o)RNA (both small nuclear and nucleolar

RNA) [52]. For all these filtering steps, the RPFs are mapped non-uniquely, meaning they are

allowed to map to more than one location. In contrast, the mapping against the reference

genome of the species of interest can be carried out uniquely, wherein the RPFs map to

maximal on location, or non-uniquely.

After clipping the 3' end adaptor sequence, the reads are mapped to the different created

genome indices. The number of reads mapped to each of these indices will be calculated and

redirected to an SQLite table. The alignment information is also given in a BedGraph file

which enables the visualization of the results in a genome browser [52].

Transcript calling

In addition to the mapping procedure, translational evidence on the transcripts is collected by

calculating the number of footprints corresponding to every genomic location of the coding

sequence. Thereby the footprint coverage of every exon is determined. Transcripts having an

13

uniform exon coverage exceeding a certain threshold are marked as truly translated [52]. In

this way, Ensembl transcripts matching the experimental RIBO-seq data are identified.

SNP calling

Using the aligned reads of the previous step, SNPs can be uncovered by using the SAMtools

suite. SAMtools is a software package developed for the analysis of alignments in Sequence

Alignment/Map (SAM) and Binary Alignment/Map (BAM) formats [62]. Finding SNPs with

the mpileup module of SAMtools starts from a SAM or BAM file containing the alignment of

the reads on a reference genome. The mpileup command uses this info to generate a Binary

Variant Call Format (BCF) file containing genomic position information together with the

likelihoods of the potential genotypes [63,64]. Since mpileup requires a BAM file as input, a

SAM to BAM conversion is necessary [64]. The calling itself is done with BCFtools by using

the information in the BCF file and applying a Bayesian model. It can also convert BCF files

to Variant Call Format (VCF) files [63,65].

TIS calling

As already mentioned before, the inclusion of antibiotics in the ribosome profiling protocol is

a key element in determining translation initiation sites. Both LTM and HARR can be used to

stall initiating ribosomes whereas CHX acts on elongating ribosomes. The parallel analysis of

the obtained LTM/HARR-treated and CHX-treated reads results in a peak calling approach

[40]. Herewith, LTM- or HARR-associated RPFs and CHX-associated RPFs are mapped to

transcript sequences contained within the Ensembl annotation bundle [52]. During this

procedure, reads of initiating ribosomes will be accumulated at the translation initiation start

sites, resulting in the so-called TIS peaks [40]. Only a subset of these peak positions

resembles actual TIS. These TIS can be detected using user-defined parameters concerning

the proximity of predicted TIS, their minimal associated number of reads and a value

representing the relative abundance of LTM-treated reads over CHX-treated reads at a single

nucleotide position [52].

Translation assembly

The objective of the previous steps is the collection of the information derived from ribosome

profiling data. With the translation of RIBO-seq data into amino acid sequences in mind, both

TIS and SNP information as well as knowledge concerning transcript isoforms is gathered. In

addition to the information on proteoforms described above, the chromosome reference

sequence and the Ensembl annotation bundle are essential. Scanning these chromosome files

using a binary reading approach allows to acquire the exon sequences easily and fast. These

exon sequences are used to derive the desired protein sequences [52].

Translation database

In order to use the assembled protein sequences in MS-based protein identification

experiments, these sequences need to be stored in a database. The PROTEOFORMER

pipeline generates a custom database in FASTA format. An important feature is the

non-redundancy of this database [52].

14

MS-based proteomics/peptidomics

The final protein sequence database created with the PROTEOFORMER pipeline can be used

in the MS-based peptide and protein identification process. Therefore, database searching

engines are used to match the in silico generated MS/MS spectra, derived from peptides

within the database, with the experimental available MS/MS spectra. This process is

explained in section 3.2.7 in more depth [52].

2.5.2 Achievements

The PROTEOFORMER pipeline was developed with the eye on creating an optimal custom

protein database with the help of ribosome profiling information. The transition from publicly

available protein databases like Swiss-Prot to custom databases leads to a rise in the number

of protein identifications. Moreover, the inclusion of RIBO-seq data overcomes the problem

of an increasing database size by including only the translation products of one single reading

frame (as opposed to conversion of mRNA sequencing that generally performs a six-frame

translation) and focuses on the actual translation in vivo [52].

The benefits of using the resulting custom database in a MS/MS proteogenomics experiment

was demonstrated by comparing the generated custom databases with a peptide database

derived from Swiss-Prot [28]. This strategy was applied both to RIBO-seq data derived from

mouse embryonic stem cells (mESC) and the human colorectal tumour 116 cell line

(HCT116). The results of this validation experiment are shown in figure 5. The first two pie

charts in this figure visualize the increase of protein identification due to the implementation

of RIBO-seq data. Both cell lines contain proteins that were not part of the Swiss-Prot

database and could therefore not be detected before. Next to the identification of new

proteins, the protein score of a significant part of the proteins was increased. The last two pie

charts represent the classification of the N-terminally proteoforms (detected with the N-

terminal COmbined FRactional DIagonal Chromatography or COFRADIC method [66]).

Proteoforms containing TIS annotated in the Swiss-Prot database (database annotated TIS or

dbTIS) are the most abundant group, followed by proteoforms starting downstream of the

annotated TIS (downstream TIS or dTIS). Additionally, uORFs and N-terminally extended

proteoforms are detected but in smaller numbers. Moreover, both near-cognate and cognate

start sites where found within the newly detected proteoforms [52] .

15

Figure 5: Pie charts representing the rise in identification rate for a) mouse and b) human and the

identification of new N-terminally proteoforms for c) mouse and d) human by using the

PROTEOFORMER pipeline [52].

16

3 Material and methods

Currently, the PROTEOFORMER pipeline has already proven its usefulness by increasing the

overall protein identification rate and enabling the detection of alternative proteoforms like N-

terminally extended proteoforms [52]. However, in order to be useful in cancer studies, the

inclusion of structural variant information is very important. SNPs are already taken into

account, but next to SNPs, insertions and deletions play a very important role in the

development of cancer. Therefore, it is necessary to include these variants in the pipeline as

well.

By taking into account both SNPs and INDELs, a lot of the genetic variation in cancer cells is

covered [48]. In all probability, this will improve the identification rate of proteins present in

cancer cells in comparison with databases lacking this information. This hypothesis is tested

by studying the protein identifications with and without the inclusion of INDELs using

matching ribosome profiling and mass spectrometry data from a human colorectal cancer

dataset. Additionally, the effect of INDELs is compared with the effect of SNPs.

3.1 Hardware

Both the existing scripts, needed to run the PROTEOFORMER pipeline, as well as the new

scripts written during this master thesis were executed on the Linux servers of BioBix. Two

different servers where used: one having 32 processors and 157 GB RAM and one having 64

processors and 315 GB RAM. The analysis of the peptide identification results was done on a

laptop with, 8 GB RAM, four 2,20 GHz Intel® Core™ i5-5200U processors and the

Windows 8.1 operating system.

3.2 Software

3.2.1 SSH clients

The connection with the servers was made using the Secure Shell (SSH) protocol insuring

secure network services [67]. Within the context of the client-server model, two different SSH

clients where used: PuTTY (version 0.67) and WinSCP (version 5.9). While WinSCP was

needed for a secure file transfer between the local and the remote computer. PuTTY was used

to operate the servers from a distance.

3.2.2 Python

Python is a high-level, object-oriented language created by Guido van Rossum. Van Rossum

designed this program while working at the Center for Mathematics and Computer Science

(CWI) in Amsterdam and made it publicly available in 1991. Today, Python is owned by a

non-profit organization, called the Python Software Foundation (PSF), and is freely available

at the official Python webpage: www.python.org. The programming language is distributed

with a large library of modules. Moreover, Python is portable and applicable to a large

https://ark.intel.com/nl/products/85212/Intel-Core-i5-5200U-Processor-3M-Cache-up-to-2_70-GHz

17

domain of problems. Thanks to its simple syntax and its interactive interpreter, this

programming language is also easy to learn and probably the most convenient language to

introduce non-programmers into computer programming [68].

Initially, the majority of the scripts of the PROTEOFORMER pipeline, were written in Perl

(version 5.22.2). However, due to Python's rising popularity, this will change over time. In

this master thesis, Python version 2.7.11 was used.

3.2.3 SQLite

During the analysis of the ribosome profiling data by the PROTEOFORMER pipeline, the

results were stored in a relational database. In general, a relational database consists of tables

containing data and connections between the records of these tables. This data was stored and

queried using SQLite (version 2.8.17). SQLite is a freely available, very reliable Structured

Query Language (SQL) database engine [69]. It can be used in combination with any

available operating system [70] and enables hand-operated execution of SQL queries through

the sqlite3 command-line service [71].

3.2.4 Python DB-API

The connection with the desired SQLite database through Python was made using the sqlite3

module (version 3.11.0). This module is responsible for the creation of an SQL interface

conformable with the consistent database application programming interface (API) used in

Python: DB-API 2.0 [72,73].

Initially, new Python database modules were provided with a particular interface different

from other existing modules. Due to incompatibilities, users were compelled to rewrite

working code when changing the database product. This problem has been the starting point

for the creation of a consistent Python database interface. Data retrieving and inserting by the

sqlite3 module begins with the generation of an object expressing the connection with the

database. Hereafter, a cursor object needs to be initiated and executed in order to perform an

SQL command. Finally, the results can be collected by using one of the fetch techniques

[72,74].

3.2.5 STAR

Spliced Transcripts Alignment to a Reference (STAR) is a free alignment tool written in C++.

It provides fast mapping of RNA-seq data to the reference genome in three consecutive steps:

genome indexing, seed search and seed clustering. Genome indexing is done by means of

uncompressed suffix arrays (SAs). A binary string search in these SAs allows the detection of

seeds. These seeds are clustered together with the aid of anchor sequences. Simultaneously, a

local alignment scoring scheme is used to find the alignments having the best scores [61].

STAR (version 2.5.2a) was used to map all the RIBO-seq derived reads to the reference PhiX

bacteriophage genome and the human reference rRNA, sn(o)RNA, tRNA and genome

sequence.

18

3.2.6 Variant caller

The PROTEOFORMER pipeline enables SNP detection using the SAMtools suite. Next to

SNPs, INDELs needed to be identified. Today, multiple INDEL detection tools are available.

They are often based on different principles while the general workflow is the same. Hence,

different categories of INDEL calling tools can be made [65].

A first important class of INDEL calling tools encloses the haplotype-based tools. Their

algorithms begin with selecting those sites in the genome that are more likely to show

variations. These variant sites are used to create a list of possible haplotypes. After realigning

these haplotypes to the original reads and applying Bayesian statistics, the variants can be

called [65,75,76]. Popular haplotype-based tools are GATK (Genome Analysis Toolkit)

HaplotypeCaller [77] and Platypus [76].

A second important class of variant calling tools groups the alignment-based techniques.

These methods start with the alignment of the reads against a reference genome. This

alignment is used to collect all the possible variants that are subsequently subjected to a

filtering step in which the correct SNPs and INDELs are retained [65]. Both SAMtools [62]

and GATK UnifiedGenotyper [78] are well known variant callers belonging to this class.

Because the PROTEOFORMER pipeline uses SAMtools mpileup (version 1.3) to find SNPs

in experimental data and BCFtools (version 1.3) to call the significant variants, these tools

were also used to find INDELs.

3.2.7 SearchGUI and PeptideShaker

A classic proteomics study ends with the identification of the proteins picked up throughout

the experiment. Figure 6 gives an overview of the main steps carried out during this process.

Figure 6: Protein identification using SearchGUI and PeptideShaker. Experimental MS/MS spectra

are identified using a database searching approach. Hereby, SearchGUI provides different search

engines which can analyze the MS/MS spectra simultaneously. The resulting peptide to spectrum

matches (PSMs) are evaluated with PeptideShaker. Finally, the obtained results can be submitted to

PRIDE or subjected to further analysis [79].

19

The data-analysis starts from the obtained experimental spectra and a protein sequence

database present in the standard FASTA format. Both data available from PRIDE (more

information on PRIDE is given in section 3.3.4) as well as in-house spectral data can be taken

as a starting point. Identification of the experimental spectra can be done using a database

searching approach by applying freely available identification software algorithms, also called

search engines. Nowadays, various search engines are at hand, each having their own

advantages and disadvantages. In order to achieve the best outcome possible, multiple search

engines are often combined. However, due to conflicts between input parameters and visual

interfaces, this can be very troublesome [79,80].

SearchGUI (version 3.2.14) is a user-friendly interface providing easy configuration of eight

different search engines (including X!Tandem [81] and the Open Mass Spectrometry Search

Algorithm or OMSSA [82]) and de novo sequencing algorithms, avoiding these conflicts

[80,83]. The resulting peptide to spectrum matches (PSMs) can be analyzed by another

software application: PeptideShaker (version 1.16.5). PeptideShaker provides simple

evaluation of the collective output gathered by the selected search engines and facilitates the

re-analysis of public data. The final results can be easily submitted to PRIDE or subjected to

further analysis in case of unidentified spectra [79,84].

3.2.8 Galaxy

Galaxy is an open source, web-enabled platform designed for the intensive data analysis

present in everyday bioscience research. It allows researchers to create or reuse complex

computational workflows from a user-friendly interface without the need to program. The

platform ensures reproducibility of the experiments by storing the required metadata.

Furthermore, the possibility to add annotations to every step of a workflow and the simplicity

of communicating experimental results are making this platform accessible to every

researcher interested in genomic data analysis [85].

3.3 Public databases

3.3.1 Ensembl annotation bundle

The Ensembl annotation bundle is part of the Ensembl project founded in 1999 [86]. It

provides high quality genome annotation for a wide variety of species. The Ensembl

annotation process starts from a genome assembly present in the International Nucleotide

Sequence Database Collaboration (INSDC). It involves ab initio gene prediction tools and

sequence homology based on protein, cDNA (complementary DNA), EST (expressed

sequence tag) and RNA-seq data [87,88].

The human genome assembly is currently managed by a worldwide association responsible

for improvements of the human, mouse, chicken and zebrafish reference genome: the Genome

Reference Consortium (GRC) [89,90]. Hence, the human genome assembly is abbreviated as

GRCh followed by a release number.

http://www.thegpm.org/tandem



20

Here, the research is based on assembly GRCh38 and Ensembl annotation bundle version 82.

Only the tables needed for the construction of the Ensembl transcripts were used. These

include the exon, the exon_transcript, the gene, the seq_region, the coord_system, the

transcript and the translation table. Their contents and underlying relationships are shown in

figure 7.

Figure 7: Overview of the most important Ensembl tables needed in the PROTEOFORMER pipeline.

Transcript information is stored in the transcripts table. It provides knowledge about the start and end

positions of the transcripts and the strand (sense or antisense). It is linked with the translation table

which contains information on the translation start and stop sites of the transcript. The transcripts are

linked with their corresponding genes by the gene table and with their exons through the

exon_transcript linking table. More information about the exons themselves can be seen in the exon

table. The tables seq_region and coord_system store information about all the available coordinate

systems. The PROTEOFORMER pipeline uses the chromosome coordinate system. Adapted from

[91].

3.3.2 dbSNP

Parallel to the variant calling, performed by SAMtools, the PROTEOFORMER pipeline

allows the inclusion of SNPs present in the Single Nucleotide Polymorphism Database

(dbSNP) [52]. This database was developed in 1998 by the National Center for Biotechnology

Information (NCBI), with the help of the National Human Genome Research Institute

(NHGRI), as an answer to the increasing interest and research in genomic variations and their

21

association with clinical phenotypes. The name 'dbSNP' refers to the most abundant variation

type present in the database: SNPs. Along with these SNPs, dbSNP contains other genomic

variant including short INDELs [92,93].

Researchers can submit new genomic variants to the database when providing sufficient

information with regard to the experiment and the flanking sequences. In case of multiple

entries concerning the same genomic variant, dbSNP will combine the information given by

all original records and store it into a new reference record. This record can be linked with

other NCBI databases [92,93].

In addition to the inclusion of SNPs from dbSNP, the PROTEOFORMER pipeline was

adapted in order to enable the inclusion of INDELs as well.

3.3.3 Sequence read archives

In today's genomic studies, NGS-based technologies entail massive amounts of nucleotide

sequence data. In order to preserve this information in the international, public domain, the

DNA Databank of Japan (DDBJ), the European Bioinformatics Institute (EMBL-EBI) and the

NCBI started a collaboration to unify the obtained sequence information in all of their

databases, namely the INSDC. Thanks to this collaboration, data submitted to one of these

organizations can easily be shared with the other partners. One of the data types contained by

these databases is raw sequence data produced during NGS. This data type is stored in a

separate database called the Sequence Read Archive (SRA) [94]. Both the DDBJ, the NCBI

as well as the EMBL-EBI have their own SRA [95]. The later is part of the European

Nucleotide Archive (ENA) [96].

3.3.4 PRIDE

The PRoteomics IDEntifications database (PRIDE) is a public database holding the

experimental results of proteomic studies associated with published, scientific articles. It was

created to make proteomic data more accessible and easier to query, but it can also be used as

a private platform for communication and collaboration during the process of data-analysis

and peer-review of the final article. Peptide and protein identification lists can be easily

submitted and approached via a web interface along with its associated MS/MS spectra

[97,98]. Together with the RIBO-seq data available in the SRAs, this data is used during the

validation of the renewed PROTEOFORMER pipeline.

3.4 Custom database creation

An assessment was made of the effect of including INDEL or SNP information in the creation

of custom sequence databases on the peptide/protein identification rate in MS-based

proteomics. Hence, RIBO-seq data derived from the human colorectal cancer cell line

HCT116 was used. This data was generated by the BioBix lab in 2014 [99] and can be easily

downloaded from the SRAs.

22

The custom databases itself were created using the mapping, transcript calling, TIS calling

and variant calling steps provided by the PROTEOFORMER pipeline. The latter was only

available for SNP detection. Therefore, the initial variant calling script was adapted.

Moreover, a completely new assembly script was written for the inclusion of both INDELs

and SNPs in the protein sequence database. Both the new variant calling script as well as the

assembly script are available at GitHub on https://github.com/sgielis/Master-thesis.

3.4.1 Mapping

For every RIBO-seq experiment, two FASTQ files were present: one containing the RIBO-

seq derived reads collected after CHX-treatment and another containing the RIBO-seq derived

reads collected after LTM-treatment. Both files were preprocessed using the FASTX-toolkit

(version 0.0.14) [100,101]. This included the removal of the adaptor sequence

(AGATCGGAAGAGCACAC) using the fastx_clipper tool and trimming the sequences with

the fastq_quality_trimmer tool. During the latter step, a quality threshold of 28 was used,

resulting in the removal of nucleotides lying at the end of the sequence and having a quality

score below 28. Simultaneously, sequences shorter than 20 nucleotides were discarded.

In a following step, STAR (version 020201) was used to map the remaining RIBO-seq reads

consecutively to the reference PhiX bacteriophage genome and the human reference rRNA,

sn(o)RNA and tRNA (obtained from NCBI). These filtering steps were needed in order to

retain the translating RPFs only. In all cases, a maximum of two mismatches per read was

allowed. Moreover, a value of 0.5 was chosen for the option seedSearchStartLmaxOverLread

and no introns where allowed during mapping against the PhiX genome.

Next, the filtered (unmapped) reads were aligned to the human reference genome. Also, a

maximum of two mismatches per read was chosen as a threshold. Alignments associated with

reads mapping to maximum 15 loci were collected in SAM and BAM files. The same

information was also stored in BedGraph files used to visualize the obtained information in a

genome browser.

In a later step, the RPFs were appointed to single nucleotide positions, namely the ribosomal

P-sites. This was done using the Plastid tool [102]. Plastid determines these P-sites by

calculating P-site offsets. Both the term P-site and P-site offset are explained in figure 8 in

more detail.

In general, Plastid starts with grouping all the RPFs by their lengths. Subsequently, these

reads are mapped to the canonical genes in the region surrounding the canonical start site.

Finally the distance between the canonical start site and the highest peak at the 5' end of the

start codon is measured and returned as a P-site offset [103,104].

Because the P-site offsets were calculated for all read lengths varying from 22 to 34

nucleotides, the exact P-site could be calculated for each of these lengths. Next, the range of

RPF lengths having a clear P-site offset was determined. All the RPFs lying in this range were

23

retained for further analysis. This information was also stored in two SQLite tables: one for

the CHX-treated reads and another for the LTM-treated reads. Both tables contain the retained

P-sites and an extra value representing the number of reads that are appointed to this site.

Figure 8: Clarification of the terms P-site and P-site offset. The P-site (peptidyl site) is the second site

in the ribosomal complex. Here, the enlarging polypeptide is present. It is preceded by the aminoacyl

site (A-site) which is responsible for the presentation of the aminoacyl-tRNA to het mRNA and

followed by the exit site (E-Site) where empty tRNA molecules leave the ribosomal complex [105].

The latter is not shown in this figure. The P-site offset is defined as the distance of the first nucleotide

of the RPF from the P-site. Here, a RPF of length 28 nucleotides is shown with an offset of 12 [104].

3.4.2 Transcript calling

Subsequent to the mapping procedure, the ribosome profiles of the CHX-treated ribosomes

were used to find the translated transcripts. The transcript calling procedure includes the

collection of all possible transcripts using the Ensembl annotation bundle and the subsequent

association of these Ensembl transcripts with the evidence on translational level.

3.4.3 TIS calling

After the mapping and transcript calling, the translation start sites were determined. The TIS-

calling procedure started from the read counts and transcripts gathered by previous methods.

In a first step, these reads were matched with the transcripts. Meanwhile, the LTM-reads

accumulated at TIS were combined into peaks. These TIS-peaks resembled both true and false

TIS. Hence, subsequent criteria were needed to retain the peak positions corresponding with

true translation start sites. As a first criterion, the identified TIS positions should lay in a ±1

nucleotide window relative to an AUG or near-cognate start codon. Ribosome profiles

registered on the first position or the +/-1 position of an AUG or near-cognate start codon

were summed into one peak. Second, these combined profiles should have more reads on a

single position in comparison with a user definable minimal count. Third, the positions must

also have a maximum of reads mapped within a user definable window. Here a window of

one codon up- and one codon downstream was chosen. And for the final criterion, an

RLTM-RCHX value was calculated by using the following formula:

24

Here, X represents the amount of reads on position X and N the sum of all the reads on the

transcript for the RPFs obtained after treatment with LTM or CHX. [52]. Peaks with an

RLTM-RCHX value equal or higher than a selected threshold were retained.

Both the minimum count of reads as well as the RLTM-RCHX number could have a different

value depending on the annotation. Five different annotations were possible: aTIS (annotated

TIS: the TIS coincidence with a canonical start site), 5' UTR (TIS lying upstream of the

canonical start site), 3' UTR (TIS lying downstream of the stop codon), coding sequence or

CDS (TIS lying within the protein-coding transcript) and no translation (TIS lying in non-

coding transcripts). All the TIS calling procedures were performed using a minimum count of

5 for aTIS, 10 for a 5' UTR site, a 3' UTR site and a non-coding site and 15 for a CDS site.

The RLTM-RCHX values were set to 0.01 for aTIS, 0.05 for a 5' UTR, a 3' UTR and a non-

coding site and 0.15 for a CDS site.

3.4.4 Variant calling

SNPs and INDELs were detected using SAMtools. This variant caller starts from a SAM file

containing the alignment of the CHX-derived RIBO-seq data with the reference sequence. At

first, this SAM file was converted into a BAM file which was sorted on genomic location.

Next, the genotype likelihoods were determined and collected in a BCF file with the mpileup

command. BCFtools used this BCF file to call all the significant SNPs and INDELs and wrote

this information to a VCF file. Furthermore, the vcfutils.pl varFilter option was used to

discard all variants having a read depth below 3 or exceeding 100. From the final VCF file,

two SQLite tables were created: one for the SNPs and another one for the INDELs. Both

tables contain genomic position information, the reference and alternative variants and their

corresponding allele frequencies. The latter is an important value describing the biological

relevance of the variant in your sample compared to the reference. An allele frequency of 1.0,

for example, means that the reference allele is not present in the studied RIBO-seq data while

an allele frequency of 0.5 indicates both alleles at a certain locus are equally present.

Next to the experimental observed variants, SNPs and INDELs from dbSNP were also taken

into consideration. In order to reduce the size of the custom generated database, only dbSNP

variants present in the RIBO-seq derived reads were included. While adding these variants to

the two existing SQLite tables a distinction was made between new variants discovered by

SAMtools, existing variants present in dbSNP and detected by SAMtools and variants that

could not be called by SAMtools but were nevertheless present in dbSNP and in the SAM

alignment file.

3.4.5 Translation assembly

In a final phase of the customized database construction, all the information obtained during

previous steps was collected and assembled into protein sequences. This process started from

25

the transcript sequences called with sufficient translational evidence. Their sequences were

build up using the human reference genome sequence acquired from the iGenomes repository

and the exon start and end positions given by the Ensembl annotation bundle. In addition, the

experimental identified TIS were taken into account by constructing ORFs starting from that

TIS position.

During ORF construction, variant information was added. Due to the discovery of a large

number of SNPs by SAMtools, the SNPs from dbSNP were not included. In contrast the

number of INDELs discovered by SAMtools was low. Hence, both the INDELs discovered

by SAMtools as well as the INDELs from dbSNP were included. Therefore, all the INDELs

and SNPs were collected per transcript and sorted in ascending order of appearance.

Afterwards the variants were introduced one by one. During this procedure, all the alternative

INDELs and SNPs were inserted. Here, only SNP information from non-synonymous SNPs

was included. The latter are SNPs leading to a single amino acid change in a protein resulting

in the so-called single amino acid variants (SAAVs) [106]. These SNPs are also known as

missense mutations. The reference variants were also included alongside their alternative

sequence when an allele frequency around 0.5 was observed.

Finally, all the assembled nucleotide sequences were translated in silico into protein

sequences. An extra filtering step ensured that only complete proteins were retained and

stored in an SQLite table and further processed into a non-redundant database in FASTA

format. From this database a target-decoy database [107] was created including additional

information from the cRAP (common Repository of Adventitious Proteins) database. The

latter is a FASTA database containing proteins which are often present in MS-based

proteomics experiments due to contamination [108].

3.5 Validation of the renewed PROTEOFORMER pipeline with

proteomics data

The generated custom databases were used to identify proteins and peptides isolated from the

HCT116 cell line using experimental MS/MS spectra available from PRIDE. This was done

with the help of the search engines X!Tandem Vengeance (2015.12.15.2) and OMSSA (2.1.9)

through the Searchgui user-interface. The analysis was carried out using two fixed

modifications (heavy labelled arginine (13

C6) and lysine (13

C6)) and three variable

modifications (pyroglutamate formation of N-terminal glutamine, oxidation of methionine to

methionine-sulfoxide, and acetylation of the N-terminus of both peptides and proteins).

Trypsin was the selected digestion enzyme and a threshold of one missed cleavage was set. In

addition, cleavage before proline was allowed. Furthermore, a mass tolerance of 10 ppm on

precursor ions and 0.5 Da on fragment ions was chosen. For the peptide charge a range from

2+ until 4+ was selected.

The resulting PSMs produced by X!Tandem and OMSSA were analysed with PeptideShaker

using a false discovery rate (FDR) of 1%.

26

4 Results

4.1 Custom database creation

To study the influence of incorporating INDELs in the protein sequences on the identification

rate in a MS-based identification process, two protein sequence databases are created: one

with INDEL information and one without. In addition, a protein sequence database containing

SNP information is generated to compare the impact of INDELs versus SNPs on the protein

identification level.

4.1.1 Mapping statistics

Initially, the CHX-treated and the LTM-treated RIBO-seq derived reads from the human

colorectal cancer cell line were mapped successively to the PhiX genome, the human rRNA,

sn(o)RNA (both small nuclear and nucleolar), tRNA and the human reference genome. The

mapping statistics of the CHX-treated RIBO-seq derived reads are given in table 1, whereas

table 2 lists the mapping results of the LTM-treated RIBO-seq derived reads. Both tables

specify the number of input reads available for mapping against each of the reference

sequence types (PhiX, rRNA, sn(o)RNA, tRNA and the human genome) and the associated

number of mapped and unmapped reads. In addition, the frequency of the mapped reads

relatively to the total reads associated with the sequence type is given.

Table 1: Mapping statistics for the CHX-treated RIBO-seq derived reads.

Type Total reads Mapped Unmapped Frequency mapped (%)

PhiX 215 956 640 582 289 215 374 351 0,27

rRNA 215 374 351 159 881 772 55 492 579 74,23

sn(o)RNA 55 492 579 1 854 896 53 637 683 3,34

tRNA 53 637 683 6 686 666 46 951 017 12,47

genomic 46 951 017 39 774 983* 7 176 034 84,72**

*From a total of 39 774 983 reads mapped to the human reference genome, 26 615 285 reads mapped

to a unique genomic location, 13 159 698 were non-unique.

**The total frequency of 84,72% is the sum of the frequency of the uniquely mapped reads (56,69%)

and the non-uniquely mapped reads (28,03%) to the human reference genome.

27

Table 2: Mapping statistics for the LTM-treated RIBO-seq derived reads.

Type Total reads Mapped Unmapped Frequency mapped (%)

PhiX 228 664 982 590 458 228 074 524 0,26

rRNA 228 074 524 150 632 286 77 442 238 66,05

sn(o)RNA 77 442 238 31 98 726 74 243 512 4,13

tRNA 74 243 512 6 112 089 68 131 423 8,23

genomic 68 131 423 55 879 608* 12 251 815 82,02**

*From a total of 55 879 608 reads mapped to the human reference genome, 38 689 756 reads mapped

to a unique genomic location, 17 189 852 were non-unique.

**The total frequency of 82,02% is the sum of the frequency of the uniquely mapped reads (56,79%)

and the non-uniquely mapped reads (25,23%) to the human reference genome.

In total 39 774 983 CHX-treated and 55 879 608 LTM-treated RIBO-seq derived reads are

mapped to the human reference genome. In order to determine the exact nucleotide position

needed for the TIS and transcript calling procedures, P-site offsets were calculated. With the

help of the Plastid tool, two plots were created visualizing these offsets in function of their

corresponding read length. In figure 9, the results for the LTM-derived reads are given.

According to this figure, reads with a length varying from 27 until 34 nucleotides have a

clearly determined P-site offset and can therefore be associated with one single nucleotide

position on the reference genome. However, this range cannot be applied to the CHX-derived

reads as shown by figure 10. Here, reads having less than 27 or more than 31 nucleotides

cannot be associated with a clearly determined P-site offset. Due to the inability to choose a

different range for both the analysis of the CHX-derived and the LTM-derived reads, the

range from 27 until 31 nucleotides, with a well-defined offset was chosen. All the CHX- and

LTM-derived reads lying within this range are retained and pinpointed to one nucleotide

position.

28

Figure 9: Overview of the calculated P- site offsets for the LTM-treated RIBO-seq derived reads with

lengths varying from 22 until 34 nucleotides.

29

Figure 10: Overview of the calculated P-site offsets for CHX-treated RIBO-seq derived reads with

lengths varying from 22 until 34 nucleotides.

30

4.1.2 Mapping quality control

After the mapping procedure, the quality of the RIBO-seq data was determined using two

important quality parameters, namely the translational phasing and the length distribution of

the reads (i.e. RPFs). The phase distribution represents the amount of RIBO-seq reads mapped

to nucleotides lying in phase, lying one nucleotide out of phase or lying two nucleotides out

of phase within the translated open reading frame. The length distribution gives the number of

reads having a certain length. Both distributions are shown in figure 11 for the CHX-derived

and LTM-derived reads. It is clear that the majority of the reads map to nucleotides in phase.

Furthermore, the length distribution shows that most of the reads have a length within the

selected range of 27-31 nucleotides. As a consequence, only a small amount of the initial

reads mapped to the human reference genome were excluded, resulting in a limited loss of

information.

Figure 11: Phase and RPF length distribution of the CHX-treated RIBO-seq derived reads (a) and

LTM-treated RIBO-seq derived reads (b) mapped to the human reference genome.

31

4.1.3 Discovered variants

Variant information was gathered based on the SAM file containing the alignments of the

CHX-derived reads. Both new variants, discovered by SAMtools, as well as existing variant,

present in dbSNP, were detected. Figure 12 shows the number of variants discovered by each

of these sources. In total 63 300 SNPs and 251 INDELs are found. Most of these variants are

already present in dbSNP. However, 43 novel INDELs and 19 914 new SNPs are discovered

with SAMtools.

Figure 12: Pie charts showing (1) the number of new variants discovered by SAMtools, (2) the

number of dbSNP variants missed by SAMtools and (3) the number of existing variants present in

dbSNP and detected by SAMtools for a) SNPs and b) INDELs.

The detected variants were incorporated into the Ensembl transcript sequences during the

assembly. Herein, the incorporation was limited to variants having an allele frequency lying

between 0.4 and 0.6 or 0.9 and 1.0. In this way, the allele frequencies lying nearby the

theoretical predicted allele frequencies of 0.5 and 0.9 are included. In order to investigate the

number of variants having an allele frequency located within this range, the distribution of the

allele frequencies for both variants were calculated. The results are shown in figure 13 and 14

for respectively the SNPs and INDELs. From figure 13, it is clear that only a small subset of

the SNPs has an allele frequency lying outside the predefined range. These are therefore not

included in the Ensembl transcripts. In contrast with the SNPs, all the INDELs have allele

frequencies lying within the applied range. Hence, all the INDELs are included in the

Ensembl transcripts. For both variant classes, the majority of the variants have an allele

frequency around 1.0. This shows that for most variants, the reference is not present in the

studied HCT116 RIBO-seq dataset.

32

Figure 13: Distribution of the allele frequencies of the SNPs detected by SAMtools.

Figure 14: Distribution of the allele frequencies of the INDELs detected by SAMtools.

33

4.1.4 Translation assembly

Before assembling the protein sequences into a database in FASTA format, an SQLite table

containing the protein sequences including transcript, TIS and variant information was

created. In total three SQLite tables are made: one with INDEL information, one with SNP

information and one without any variant information. These contain respectively 140 617,

143 710 and 140 293 unique protein sequences. In figure 15, a Venn diagram is given,

illustrating the amount of overlap between these three databases. From this diagram, it's clear

that most of the protein sequences are shared between the three tables. Moreover, the diagram

shows that addition of SNP information is responsible for 7 348 extra protein sequences,

whereas the addition of INDEL information results in 345 extra protein sequences.

Figure 15: Venn diagram showing the amount of similar and distinct protein sequences related with

Ensembl transcripts lacking variant information and Ensembl transcripts carrying one or more

INDELs or SNPs (derived from the RIBO-seq data).

While studying the number of identical transcripts in the SQLite tables with INDELs and

without variants, 7 transcripts were found containing an INDEL that does not alter the protein

sequence. Because INDELs are typically associated with dramatically adjustments on protein

level, these transcripts were studied in more detail.

It has been found that, for every transcript, the stop codon is lying in an INDEL. More

specifically, the stop codon is part of a deletion of one nucleotide. Normally, this type of

variant is responsible for a frameshift resulting in a deviant protein sequence. However, this

frameshifting event is not occurring, because it does not alter the stop codon itself but changes

the mRNA sequence downstream. For example, the stop codon TAG is not altered when the

guanine of the stop codon is part of a reported deletion from GGGGGGA to GGGGGA and

the first two nucleotides of the stop codon (in this case TA) are preceding the INDEL.

34

4.2 Validation of the renewed PROTEOFORMER pipeline with

proteomics data

After creating the custom protein sequence databases, these were used in a database search

approach to obtain peptide to spectrum matches (PSMs) from the experimental fragmentation

spectra (MS/MS). By comparing the proteins identified from using these different databases,

the influence of the incorporation of INDEL and SNP information on the peptide and protein

identification process is assessed.

4.2.1 Exploration of the effect of INDELs on the identification process

In order to study the gain of adding INDEL information to the peptide and protein

identification process, both the custom protein sequence database without variant information

and with INDEL information were used to identify peptide and protein sequences. During all

analyses a threshold of 90% confidence was set, meaning only the proteins having a score of

at least 90% are considered. In total 19 protein groups are detected that contain at least one

proteoform with an INDEL. More importantly a single protein containing an INDEL is

identified. Both the protein groups as well as the single protein are discussed in more detail in

the following paragraphs.

Identification of a single protein containing an INDEL

A protein derived from the Ensembl transcript ENST00000368232 (start site at position

156 599 046 on chromosome 1) is identified with a confidence score of 100%. This transcript

contains an insertion of a guanine and thymine at position 1069 in the transcript sequence.

This INDEL is present in dbSNP, but is also found using SAMtools with an allele frequency

of 1.0. In order to make the subsequent analysis clearer, this protein is called the alternative

protein. In addition, the protein derived from the Ensembl transcript lacking the INDEL is

called the reference protein.

The identification of the alternative protein is based on the identification of two peptides:

pyro-QKEEEEATASER-COOH with 100% confidence and NH2-GEETVLGGGTR-COOH

with 99% confidence. The PSMs of these peptides are shown in figure 16 and 17 respectively.

These figures also visualize the found fragment ions corresponding with each of the peptide

bonds. For the first peptide, evidence is found for nine precursor fragments out of eleven

theoretical possible ones. Eight of these are validated with both b- and y-ions, one is only

validated with a b-ion. For the second peptide, only five out of ten peptide bonds are validated

with b- and/or y-ions. However, no other peptide is found containing the subsequence

ETVLGG. This reinsures the presence of NH2-GEETVLGGGTR-COOH.

35

Figure 16: Fragment ions and spectrum of the peptide QKEEEATASER-COOH. a) The sequence

fragmentation plot shows the b-ions (blue peaks) and y-ions (red peaks) present due to the cleavage of

the corresponding peptide bonds. b) The PSM visualizes the annotated spectrum. Here, the red peaks

are annotated by a fragment ion, whereas the grey peaks are not.

Figure 17: Fragment ions and spectrum of the peptide NH2-GEETVLGGGTR-COOH. a) The

sequence fragmentation plot shows the b-ions (blue peaks) and y-ions (red peaks) present due to the

cleavage of the corresponding peptide bonds. b) The PSM visualizes the annotated spectrum. Here, the

red peaks are annotated by a fragment ion, whereas the grey peaks are not.

36

The effect of the insertion on protein level is studied by comparing the alternative protein

sequence with the reference sequence. Below, the alignment of the two protein sequences is

given. The bold, underlined amino acids represent the identified peptides. The amino acids in

green highlight the differences between the two protein sequences.

reference MNVTPEVKSRGMKFAEEQLLKHGWTQGKGLGRKENGITQALRVTLKQDTHGVGHDPAKEF

alternative MNVTPEVKSRGMKFAEEQLLKHGWTQGKGLGRKENGITQALRVTLKQDTHGVGHDPAKEF

************************************************************

reference TNHWWNELFNKTAANLVVETGQDGVQIRSLSKETTRYNHPKPNLLYQKFVKMATLTSGGE

alternative TNHWWNELFNKTAANLVVETGQDGVQIRSLSKETTRYNHPKPNLLYQKFVKMATLTSGGE

************************************************************

reference KPNKDLESCSDDDNQGSKSPKILTDEMLLQACEGRTAHKAARLGITMKAKLARLEAQEQA

alternative KPNKDLESCSDDDNQGSKSPKILTDEMLLQACEGRTAHKAARLGITMKAKLARLEAQEQA

************************************************************

reference FLARLKGQDPGAPQLQSESKPPKKKKKKRRQKEEEEATASERNDADEKHPEHAEQNIRKS

alternative FLARLKGQDPGAPQLQSESKPPKKKKKKRRQKEEEEATASERNDADEKHPEHAEQNIRKS

************************************************************

reference KKKKRRHQEGKVSDEREGTTKGNEKEDAAGTSGLGELNSREQTNQSLRKGKKKKRWHHEE

alternative KKKKRRHQEGKVSDEREGTTKGNEKEDAAGTSGLGELNSREQTNQSLRKGKKKKRWHHEE

************************************************************

reference EKMGVLEEGGKGKEAAGSVRTEEVESRAYADPCSRRKKRQQQEEEDLNLEDRGEETFRWW

alternative EKMGVLEEGGKGKEAAGSVRTEEVESRAYADPCSRRKKRQQQEEEDLNLEDRGEETVLGG

********************************************************.

reference NQGSREQSMQ--------------------------------------------------

alternative --GTREAESRACSDGRSRKSKKKRQQHQEEEDILDVRDEKDGGAREAESRAHTGSSSRGK

*:** . :

reference ----------------------------

alternative RKRQQHPKKERAGVSTVQKAKKKQKKRD

The reference sequence consists of 370 amino acids. The alternative protein sequence

contains 446 amino acids, which is 76 more than the reference sequence. The alternative

protein is therefore a C-terminally extended proteoform of the reference protein. Both proteins

are present in UniProt as G patch domain-containing protein 4 proteoforms. However, the

reference protein is not experimental validated and is therefore present in the TrEMBL section

(UniProt entry A0A0A0MRK1), whereas the alternative protein is reviewed and present in

the Swiss-Prot section (UniProt entry Q5T3I0).

Important to notice is the position of the second identified peptide, namely

NH2-GEETVLGGGTR-COOH. This peptide is overlapping the amino acids that are changed

by the INDEL and therefore provides evidence of the presence of the alternative protein in the

studied sample.

37

The C-terminal extended proteoform is found as a single protein (figure 18). The term 'single

protein' refers to a protein that can be explained solely by a set of identified peptides. In

contrast, a protein identification is undistinguishable from other protein identifications when

the identified peptide sequences can jointly map to more than one protein sequence in the

database. This identification problem was also explained in section 2.2.2 as the protein

inference problem. In this case, all the undistinguishable proteins are collected in a group and

are identified as one ‘protein group’ instead of single proteins.

Figure 18: Protein inference graph of the C-terminally extended proteoform. The C-terminally

extended proteoform is represented by the black circle at the bottom of the graph. It is linked with the

two peptides (green circles) described in the main text. Because the presence of these two peptides can

only be explained by the C-terminally extended variant, this proteoform is found as a single protein.

(The reference protein is not present in the database due to the INDEL having an allele frequency of

1.0.) In addition, a third peptide (red circle) is present in the top of the figure. This was not considered

in the discussion of the peptides due to its confidence score of 0%. The remaining gray circles are

representing proteins containing this peptide.

The protein identification process making use of the database without variant information

gives rise to the identification of a protein group containing three proteins: the reference

protein (370 amino acids) associated with transcript ENST00000368232 and two other

proteins associated with transcript ENST00000438976 (one starting at position 156 599 013

of chromosome 1 resulting in a sequence of 359 amino acids, the other starting at position

156 601 439 resulting in a sequence of 375 amino acids). These proteins differ only slightly

from each other (the last 359 amino acids of every protein are identical) as they are all

products from the same gene but translated from different transcript isoforms. Absence of

identified peptides specific to one of these proteins limits the identification to a group instead

of a single protein (figure 19).

More importantly to notice, is that none of these proteins, identified by the database without

INDEL information, are actual present in the studied sample. This conclusion can be drawn

from the allele frequency of the INDEL, which is 1.0.

38

In addition, since the three proteins are similar in the majority of their sequence, we would

expect to have found INDELs in the two other proteins as well. However, only the protein

associated with transcript ENST00000368232 is identified as a protein containing an INDEL.

In order to explain this result, the two other transcripts and their resulting protein sequences

were studied in more detail. It appears that the addition of an INDEL in these sequences gives

rise to incomplete proteins. This means that the inclusion of an INDEL in these transcripts

results in a new transcript sequence containing no stop codon. This makes it impossible to

produce a functional protein.

All the above results illustrate the importance of the addition of INDEL information to a

custom created protein sequence database.

Figure 19: Protein inference graph of the reference protein (blue circle) found using the database

without INDELS. Only one peptide (green circle) associated with the reference protein is found. This

peptide is linked with two other proteins (gray circles with yellow border) making it impossible to

distinguish one protein from another.

Protein groups containing proteins derived from Ensembl transcripts with INDELs

Next to the previously described single protein, 19 protein groups are identified containing at

least one proteoform derived from a transcript holding an INDEL. These INDEL-containing

proteoforms are not validated with the matching MS-based proteomics data, since they were

within protein inference groups also holding non-INDEL containing proteoforms. However,

they were still studied in more detail in order to evaluate the possible impact of INDELs on

the translational level.

In general, the INDELs give rise to three different type of proteoforms: C-terminally

truncated proteoforms, N-terminally extended proteoforms and proteoforms missing only one

amino acid with respect to the original protein lacking the INDEL. For every type an example

is given in figure 20 and described in more detail in the following paragraphs.

39

a) ENST00000612661

b) ENST00000612898

c) ENST00000527673

Legend:

reference protein protein in protein group validated peptide

alternative protein protein not in protein group non-validated peptide

Figure 20: Protein inference graphs from (a) a protein group containing an C-terminally truncated

proteoform (black circle) in comparison with the reference sequence (blue circle), (b) a protein group

containing an N-terminally extended proteoform (black circle) and (c) a protein group containing a

proteoform missing one amino acid (black circle) in comparison with the reference sequence (blue

circle).

Example of an C-terminally truncated proteoform

A first protein group consists of ten proteins. Two are derived from the Ensembl transcript

ENST00000612661 (start site at position 113 857 746 on chromosome 6): one containing a

deletion of an adenine at position 454 in the transcript sequence and one lacking this deletion.

In this case, the reading frame is changed with one nucleotide when the deletion is

encountered. Because of this, a 'premature' stop codon is encountered, lying 5' upstream of the

normal stop codon. This results in an C-terminally truncated proteoform.

More specifically, the reference proteoform contains 332 amino acids whereas the alternative

proteoform contains 165 amino acids. They only share their first 154 amino acids. The 178

remaining amino acids from the reference protein and the 11 remaining amino acids from the

alternative protein are completely different.

Due to the absence of peptides overlapping the distinct C-terminal part of the alternative

protein, this proteoform is not validated with the MS/MS data.

Example of an N-terminally extended proteoform

A second group contains 22 proteins. An example of this category is derived from the

Ensembl transcript ENST00000612898 (start site lying in the 5' UTR region at position

27 838 622 on chromosome 6) containing a deletion of a thymine at the fourth position in the

transcript sequence.

Due to the position of the start site in the 5' UTR region, the protein sequence derived from

this transcript lacking the INDEL is completely different from the protein sequence derived

40

from the canonical start site (which is lying at position 27 838 662 on chromosome 6). Hence,

the start site in the 5' UTR region is lying out of frame with the canonical start site.

The deletion in the transcript sequence changes the reading frame with one nucleotide. As a

result, the canonical start site is lying in frame with the 5' UTR start site, leading to an

alternative sequence of 139 amino acids. Herein, the last 126 amino acids are identical with

the reference sequence starting from the canonical start site. Hence, the alternative protein is

an N-terminally extended proteoform of this reference protein. Due to the absence of

identified peptides overlapping the N-terminally extended region, this proteoform was not

validated with MS/MS data.

Example of a proteoform missing one amino acid in comparison with its reference protein

A next protein group consists of two proteins derived from the Ensembl transcript

ENST00000527673 (start site at position 119 018 284 on chromosome 11). Here, one of the

proteins contains a deletion of three nucleotides: the nucleotides AAGG starting at position 28

in the reference transcript sequence are replaced with a single guanine. Since a deletion of

three amino acids occurs, the reading frame is not changed. Therefore, the deletion has only a

small effect on the protein, namely the absence of a lysine encoded by the three deleted

nucleotides. Also, the new proteoform is not validated due to the lack of peptides overlapping

this missing amino acid.

4.2.2 Exploration of the effect of SNPs on the identification process

In order to compare the effect of INDELs with the effect of SNPs on the peptide and protein

identification level, the protein identification process was repeated using a custom database

including SNP information. Here, the analysis was restricted to proteins having a confidence

score of at least 90%.

SNP containing proteins lacking extra MS/MS validation

In total 30 SNP-containing proteins are identified as single proteins (i.e. not in an inference

group). In addition, 74 protein groups are identified that solely consists of SNP containing

proteins. The majority of these proteins are not validated with peptides overlapping the SNP

affected amino acids. However, these proteins can still be distinguished from the reference

proteins (proteins without SNP) by means of their allele frequency: all of the SNPs associated

with single proteins or a protein groups lacking the reference proteins have an allele

frequency of 1.0. This indicates that the database does not contain the reference proteins.

Hence, the alternative proteins can be identified without the need of peptides overlapping the

changed amino acids.

An interesting remark can be made on the function of the identified proteins. At least two of

these proteins are related with DNA repair activity (MMS19 (methyl-methanesulfonate

sensitivity 19) nucleotide excision repair protein homolog [109] and TELO2 (telomere

maintenance 2)-interacting protein 1 homolog [110]). In addition one possible oncogene

related with colorectal cancer (Pinin [111]) was found.

41

SNP-containing proteins with extra MS/MS validation

Only one protein group contains a MS/MS validated SNP. This group (figure 21) consists of

two proteins: one derived from the Ensembl transcript ENST00000216605 (start site at

position 64 388 428 on chromosome 14) which has a length of 935 amino acids and is

submitted to Swiss-Prot (UniProt entry P11586) and on derived from the Ensembl transcript

ENST00000545908 (start site at position 64 388 260 on chromosome 14) which has a length

1020 of amino acids and is submitted to TrEMBL (Uniprot entry F5H2F4). Each contain two

SNPs (one at chromosomal position 64 415 662 replacing an adenine by a guanine and one at

chromosomal position 64 442 127 replacing a guanine by an adenine) with an allele frequency

of 1.0. Because both proteins are products from the same gene (MTHFD1 gene), their

sequences are highly similar. The alignment of these sequences is shown below. In addition,

the reference sequence of ENST00000545908 (Uniprot entry F5H2F4) is added to illustrate

the amino acid changes. The bold, underlined amino acids represent the identified peptides.

Due to the presence of a peptide starting after the first methionine (i.e. the initiator methionine

has been cleaved by the methionine aminopeptidase), the protein derived from Ensembl

transcript ENST00000216605 is more likely to be present in the sample. The green amino

acids represent the amino acids that are changed by the SNPs.

ENST00000216605 --------------------------------------------------------MAPA

ENST00000545908 MRHRVCRGRSGQGRRRSSVIPWPVPKHVGWVVLLGCGGSGTSILVVSIVGSGLIKAMAPA

F5H2F4 MRHRVCRGRSGQGRRRSSVIPWPVPKHVGWVVLLGCGGSGTSILVVSIVGSGLIKAMAPA

****

ENST00000216605 EILNGKEISAQIRARLKNQVTQLKEQVPGFTPRLAILQVGNRDDSNLYINVKLKAAEEIG

ENST00000545908 EILNGKEISAQIRARLKNQVTQLKEQVPGFTPRLAILQVGNRDDSNLYINVKLKAAEEIG

F5H2F4 EILNGKEISAQIRARLKNQVTQLKEQVPGFTPRLAILQVGNRDDSNLYINVKLKAAEEIG

************************************************************

ENST00000216605 IKATHIKLPRTTTESEVMKYITSLNEDSTVHGFLVQLPLDSENSINTEEVINAIAPEKDV

ENST00000545908 IKATHIKLPRTTTESEVMKYITSLNEDSTVHGFLVQLPLDSENSINTEEVINAIAPEKDV

F5H2F4 IKATHIKLPRTTTESEVMKYITSLNEDSTVHGFLVQLPLDSENSINTEEVINAIAPEKDV

************************************************************

ENST00000216605 DGLTSINAGRLARGDLNDCFIPCTPKGCLELIKETGVPIAGRHAVVVGRSKIVGAPMHDL

ENST00000545908 DGLTSINAGRLARGDLNDCFIPCTPKGCLELIKETGVPIAGRHAVVVGRSKIVGAPMHDL

F5H2F4 DGLTSINAGKLARGDLNDCFIPCTPKGCLELIKETGVPIAGRHAVVVGRSKIVGAPMHDL

*********:**************************************************

ENST00000216605 LLWNNATVTTCHSKTAHLDEEVNKGDILVVATGQPEMVKGEWIKPGAIVIDCGINYVPDD

ENST00000545908 LLWNNATVTTCHSKTAHLDEEVNKGDILVVATGQPEMVKGEWIKPGAIVIDCGINYVPDD

F5H2F4 LLWNNATVTTCHSKTAHLDEEVNKGDILVVATGQPEMVKGEWIKPGAIVIDCGINYVPDD

************************************************************

ENST00000216605 KKPNGRKVVGDVAYDEAKERASFITPVPGGVGPMTVAMLMQSTVESAKRFLEKFKPGKWM

ENST00000545908 KKPNGRKVVGDVAYDEAKERASFITPVPGGVGPMTVAMLMQSTVESAKRFLEKFKPGKWM

F5H2F4 KKPNGRKVVGDVAYDEAKERASFITPVPGGVGPMTVAMLMQSTVESAKRFLEKFKPGKWM

************************************************************

ENST00000216605 IQYNNLNLKTPVPSDIDISRSCKPKPIGKLAREIGLLSEEVELYGETKAKVLLSALERLK

ENST00000545908 IQYNNLNLKTPVPSDIDISRSCKPKPIGKLAREIGLLSEEVELYGETKAKVLLSALERLK

F5H2F4 IQYNNLNLKTPVPSDIDISRSCKPKPIGKLAREIGLLSEEVELYGETKAKVLLSALERLK

************************************************************

42

ENST00000216605 HRPDGKYVVVTGITPTPLGEGKSTTTIGLVQALGAHLYQNVFACVRQPSQGPTFGIKGGA

ENST00000545908 HRPDGKYVVVTGITPTPLGEGKSTTTIGLVQALGAHLYQNVFACVRQPSQGPTFGIKGGA

F5H2F4 HRPDGKYVVVTGITPTPLGEGKSTTTIGLVQALGAHLYQNVFACVRQPSQGPTFGIKGGA

************************************************************

ENST00000216605 AGGGYSQVIPMEEFNLHLTGDIHAITAANNLVAAAIDARIFHELTQTDKALFNRLVPSVN

ENST00000545908 AGGGYSQVIPMEEFNLHLTGDIHAITAANNLVAAAIDARIFHELTQTDKALFNRLVPSVN

F5H2F4 AGGGYSQVIPMEEFNLHLTGDIHAITAANNLVAAAIDARIFHELTQTDKALFNRLVPSVN

************************************************************

ENST00000216605 GVRRFSDIQIRRLKRLGIEKTDPTTLTDEEINRFARLDIDPETITWQRVLDTNDRFLRKI

ENST00000545908 GVRRFSDIQIRRLKRLGIEKTDPTTLTDEEINRFARLDIDPETITWQRVLDTNDRFLRKI

F5H2F4 GVRRFSDIQIRRLKRLGIEKTDPTTLTDEEINRFARLDIDPETITWQRVLDTNDRFLRKI

************************************************************

ENST00000216605 TIGQAPTEKGHTRTAQFDISVASEIMAVLALTTSLEDMRERLGKMVVASSKKGEPVSAED

ENST00000545908 TIGQAPTEKGHTRTAQFDISVASEIMAVLALTTSLEDMRERLGKMVVASSKKGEPVSAED

F5H2F4 TIGQAPTEKGHTRTAQFDISVASEIMAVLALTTSLEDMRERLGKMVVASSKKGEPVSAED

************************************************************

ENST00000216605 LGVSGALTVLMKDAIKPNLMQTLEGTPVFVHAGPFANIAHGNSSIIADQIALKLVGPEGF

ENST00000545908 LGVSGALTVLMKDAIKPNLMQTLEGTPVFVHAGPFANIAHGNSSIIADQIALKLVGPEGF

F5H2F4 LGVSGALTVLMKDAIKPNLMQTLEGTPVFVHAGPFANIAHGNSSIIADRIALKLVGPEGF

************************************************:***********

ENST00000216605 VVTEAGFGADIGMEKFFNIKCRYSGLCPHVVVLVATVRALKMHGGGPTVTAGLPLPKAYI

ENST00000545908 VVTEAGFGADIGMEKFFNIKCRYSGLCPHVVVLVATVRALKMHGGGPTVTAGLPLPKAYI

F5H2F4 VVTEAGFGADIGMEKFFNIKCRYSGLCPHVVVLVATVRALKMHGGGPTVTAGLPLPKAYI

************************************************************

ENST00000216605 QENLELVEKGFSNLKKQIENARMFGIPVVVAVNAFKTDTESELDLISRLSREHGAFDAVK

ENST00000545908 QENLELVEKGFSNLKKQIENARMFGIPVVVAVNAFKTDTESELDLISRLSREHGAFDAVK

F5H2F4 QENLELVEKGFSNLKKQIENARMFGIPVVVAVNAFKTDTESELDLISRLSREHGAFDAVK

************************************************************

ENST00000216605 CTHWAEGGKGALALAQAVQRAAQAPSSFQLLYDLKLPVEDKIRIIAQKIYGADDIELLPE

ENST00000545908 CTHWAEGGKGALALAQAVQRAAQAPSSFQLLYDLKLPVEDKIRIIAQKIYGADDIELLPE

F5H2F4 CTHWAEGGKGALALAQAVQRAAQAPSSFQLLYDLKLPVEDKIRIIAQKIYGADDIELLPE

************************************************************

ENST00000216605 AQHKAEVYTKQGFGNLPICMAKTHLSLSHNPEQKGVPTGFILPIRDIRASVGAGFLYPLV

ENST00000545908 AQHKAEVYTKQGFGNLPICMAKTHLSLSHNPEQKGVPTGFILPIRDIRASVGAGFLYPLV

F5H2F4 AQHKAEVYTKQGFGNLPICMAKTHLSLSHNPEQKGVPTGFILPIRDIRASVGAGFLYPLV

************************************************************

ENST00000216605 GTMSTMPGLPTRPCF-------YDID-------LDPETEQVNGL--------F-------

ENST00000545908 GTITIHLQEATLKVWPVSIQAHWELGSISKPREVSPCPEDLKLIVGVSPEVIFSLNSHHV

F5H2F4 GTITIHLQEATLKVWPVSIQAHWELGSISKPREVSPCPEDLKLIVGVSPEVIFSLNSHHV

**:: * : :::. :.* *::: : *

43

The first SNP is validated by an overlapping peptide. It replaces a lysine (K) by an arginine

(R) at sequence position 134 in the protein derived from transcript ENST00000216605 and at

sequence position 190 in the protein derived from transcript ENST00000545908. The second

SNP replaces an arginine (R) by a glutamine (Q) at sequence position 653 in the protein

derived from transcript ENST00000216605 and at sequence position 709 in the protein

derived from transcript ENST00000545908. This SNP is not validated by the matching

MS/MS data. Still, it is considered present by reason of its allele frequency of 1.0.

Figure 21: Protein inference graph of the protein group containing a validated SNP. The blue dots

represent the two proteins that are grouped together. The smaller, green dots represent the identified

peptides. The peptide overlapping the SNP affected amino acid is shown in light green. The small, red

dot represents a non-validated peptide.

44

5 Discussion

The interest in genetic variation is associated with its influence on the phenotypic level,

especially when it comes to disease susceptibility. In this work, the focus lies on the human

cancer proteome. From previous studies, SNPs and INDELs appear to be very abundant in

human cancer cells [48]. Therefore, their effect on the protein identification rate was studied.

The inclusion of INDELs resulted in the identification of one extra protein containing an

insertion of two nucleotides at the end of its transcript sequence. This implicated a shift in the

reading frame generating a new stop codon lying downstream the original one. Hence, an C-

terminally extended proteoform was present. This proteoform was validated with one peptide

derived from the altered C-terminus and one peptide shared with the reference protein. The

former made it possible to discern the alternative from the reference proteoform. This,

however, did not indicate the absence of the reference proteoform. In fact, the absence of

proteins can never be confirmed by MS [112]. Although, in this case, the allele frequency of

1.0 evinced the absence of the reference protein.

Next to the discovered single protein, 19 protein groups were detected containing one or more

alternative proteoforms. Unfortunately, no peptides were found overlapping the altered amino

acids. As a consequence, the presence of these proteoforms was not validated. Still these

contained valuable information about the putative impact of INDELs on the protein

translation products. According to these examples, the effect of an INDEL is associated with

its length. Here, a deletion of three nucleotides resulted in the loss of only one amino acid

while maintaining the reading frame. Generally, INDELs which are multiples of three

nucleotides (in-frame INDELs) do not change the reading frame resulting in a minor effect on

the protein sequence [113]. In contrast, INDELs with a length not divisible by three do alter

the reading frame. Here, the position of the INDEL in the transcript sequence plays an

important role. For example, the INDEL in the C-terminally part induced a new stop codon

upstream of the original stop codon, resulting in an C-terminal truncated proteoform. Apart

from the C-terminally extended or truncated proteoforms, the human proteome contains N-

terminal proteoforms. These are associated with alternative start sites [114]. Alternative start

sites located in the 5' UTR give rise to N-terminally extended proteoforms, whereas

alternative start sites located downstream the canonical start site result in N-terminally

truncated proteins. However, this is only the case when the alternative and canonical start

sites are lying in frame. Here, an N-terminally extended proteoform was found including an

INDEL lying upstream the canonical start codon. This INDEL was responsible for a frame

shift of one nucleotide causing the near-cognate site to lie in frame with the canonical start

site.

The inclusion of SNP information in the PROTEOFORMER pipeline has led to the

identification of 30 single SNP-containing proteins and 74 protein groups consisting solely of

SNP-containing proteins. Again, only one SNP was validated with MS/MS. This SNP was

45

present in two proteins and was responsible for the replacement of a lysine by an arginine.

The inconsistency between the number of identified proteins containing a SNP and the

number of SNPs validated with MS/MS illustrates the importance of adding extra

experimental data to the custom database. In fact, the identification is based on the allele

frequency of the SNPs. All 30 single proteins and 74 protein groups contain SNPs with an

allele frequency of 1.0. This means that the reference protein was not found in the sample and

therefore was not added to the database. This makes the identification of the alternative

proteins possible. In case of SNPs having an allele frequency of 0.5, both reference and

alternative proteins are added to the custom database. In the absence of peptides overlapping

the altered amino acids, the alternative and reference protein cannot be distinguished. In

summary, the SNP-calling based on the RIBO-seq data points to a full inclusion of the

mutation point in both alleles (i.e. allele frequency of 1.0) and thus gives evidence of the

alternative sequence (although matching MS/MS validation is missing).

The results illustrate the possible impact of INDELs and SNPs on the diversity of the human

proteome. Although the study here was restricted to the effect of these variants on molecular

level, other studies have been focusing on their effect on cancer development. Cancer results

from somatic mutations in three different group of genes: tumour-suppressor genes,

oncogenes and stability genes. These are all active in the cell-cycle regulation through the

production of proteins initiating cell division (oncogenes), controlling cell division (tumour-

suppressor genes) or providing DNA repair activity (stability genes) [48,115]. The latter are

important in maintaining genomic stability. Due to the inactivation of these DNA repair

genes, mutations in oncogenes and tumour-suppressor genes can accumulate [116]. The latter

two are the key drivers in the development of cancer.

Oncogenes are associated with a gain of function. This is a result of precise mutations at

specific locations resulting in a small modification of the protein [113]. Both mutations

leading to the over-expression of a protein as well as mutations resulting in continuously

activated proteins may lead to cancer development [115]. The majority of these mutations are

missense mutations. However in-frame INDELs are also frequently occurring in oncogenes

[113]. In contrast with oncogenes, tumour-suppressor genes contribute to cancer by

loss-of-function mutations. Here, frameshifting INDELs are playing an important role [113].

Normally, the introduction of a frameshifting INDEL would lead to the production of an

aberrant protein. These are generally not present due to the action of nonsense-mediated

mRNA decay [117]. However, in cancer development, frameshifting INDELs in

tumour-suppressor genes are often retained by means of positive selection. Indeed,

non-functional tumour-suppressor genes are not able to stop the cell cycle or induce apoptosis

in case of excessive cell growth. This results in the accumulation of tumour cells [115]. Next

to these frameshifting INDELs, tumour-suppressor genes also contain a large number of

missense and nonsense substitutions (substitutions leading to a 'premature' stop codon). In

addition a small proportion of the mutations are caused by in-frame INDELs [113].

46

In general, the development of cancer occurs in different phases due to the accumulation of

multiple mutations. Both the function of the oncogenes as well as the tumour-suppressor

genes need to be affected. Hence, different genes are often playing a role. Various studies

have been focusing on the association between genetic variation and the risk of developing

colorectal cancer. It has been found that a lot of colorectal tumours are initiated by mutations

in three tumour-suppressor genes (APC (Adenomatous polyposis coli) [118], tumour protein

53 (TP53) [119] and SMAD4 (mothers against decapentaplegic homolog 4) [120]) and two

oncogenes (KRAS (Kirsten rat sarcoma viral oncogene homolog) [121] and BRAF (B-Raf

proto-oncogene serine/threonine kinase) [122]). In addition, numerous other genes are

associated with an increased risk in colorectal cancer. In a recent meta-analysis [123], 950

colorectal cancer studies were investigated. From this analysis, variations in 50 genes were

found to be associated with an elevated risk in colorectal cancer. One of these genes was also

identified with the database containing SNP information, namely the MTHFD1

(methylenetetrahydrofolate dehydrogenase 1) gene. This gene generates a protein with

enzymatic activity in the folate-mediated one-carbon metabolism pathway. This pathway

plays an important role in the synthesis and repair of DNA by de novo production of purines

and thymidylate. It is also responsible for the production of methionine which is required for

DNA methylation. As a consequence, polymorphisms in the MTHFD1 gene may have a

possible impact on the genetic stability and gene expression resulting in cancer susceptibility

[124]. In addition, at least three other proteins related with cancer development were detected

in this work. These include two proteins with DNA repair activity (MMS19 nucleotide

excision repair protein homolog and TELO2-interacting protein 1 homolog) and a possible

oncogene (Pinin). The alternative G patch domain-containing protein 4 does also occur in

some cancer studies [125,126]. In one of these studies, a proteogenomic approach was used to

assess the impact of mutations in DNA mismatch repair genes on the proteome. This resulted

in the validation of variant peptides including the sequence GEETVLGGGTR which was also

validated in this work. However its exact relation with cancer was not elucidated [126]. All

these results illustrate the potential benefits of proteogenomic cancer studies in clinical

approaches, namely the discovery and use of new biomarkers. By combining proteomic and

genomic approaches, variant proteins associated with tumour development can be detected

[127].

An important remark can be made on the impact of the detected variants on the translational

level. In figure 15 the number of alternative proteins having one or more INDEL(s) or SNP(s)

is shown. In total 7348 SNP-containing proteins are present in the database. Hereof, 30 single

proteins and 74 protein groups are detected with a matching MS-based proteomics

experiment. However, only one SNP was validated with mass spectrometry evidence. In

addition, only one alternative protein with an INDEL out of 345 possible proteins was

validated. According to these results, a limited amount of the alternative proteins is picked up

with mass spectrometry. In other words, these proteins are associated with a low identification

rate. This limitation is also present in other proteogenomics studies [127,128]. For example,

in a recent study [129], the identification rate of single amino acid variants in breast cancer

was studied. Only 10 % of the SNPs detected using whole genome sequencing and whole

47

transcriptome sequencing were validated with MS/MS (figure 22). Multiple explanations are

possible for this phenomenon. For one thing, genetic variation has been linked with a

decreased efficiency in translation [130,131] and an elevated level of protein degradation

[132,133]. In addition to these biological consequences, the validation of genetic variants is

limited by the MS/MS process itself [127,128,129]. Efficient peptide identification, for

example, is restricted to peptides with a length from 6 to 30 amino acids. Therefore, all

variants incorporated in peptides falling outside this range will most likely be missed.

Because this limitation in sequence coverage of MS/MS is associated with trypsin digestion,

alternative approaches using multiple enzymes and fragmenting methods need to be studied

[129].

Figure 22: Venn-diagram showing the amount of SNPs detected by whole genome DNA-seq (blue

circle), whole transcriptome RNA-seq (pink circle) and MS/MS-based proteomics (orange circle) in

luminal breast tumours [129].

A comparative analysis of the protein identification process using the databases with SNPs

and INDELs shows that the addition of SNPs has a more powerful effect. Indeed, more than

100 SNP-containing proteins and only one INDEL-containing protein were identified. This

difference can be explained by several factors. First of all, SNPs are more abundant in the

human genome [134,135]. Secondly, the success of INDEL detection is depending on the

chosen INDEL calling tool. Here, an alignment-based method was used to find variants in

short-read data. This method seems suitable for the detection of SNPs, but is more error-prone

when detecting INDELs. In the case of deletions, the detection is complicated due to the

presence of a gap between the reference sequences. Insertions, on the other hand, complicate

the process by introducing new sequences which cannot be mapped to the reference genome.

In addition, reads with long insertions are flanked with shorter reference sequences which

makes the mapping procedure more difficult [136].

48

6 Conclusion

Today, a lot of studies rely on the identification of proteins using MS-based technologies.

This implies the need of well annotated, comprehensive protein sequence databases.

Nowadays, custom databases are generated using novel information, both from genomic level

as well as transcriptional level. This has led to the discovery of new proteoforms that could

not be detected using publicly available databases lacking this information.

The progress in genomics and transcriptomics goes hand in hand with the introduction of new

sequencing techniques. Currently, it is possible to assess the translational status of mRNA

fragments using ribosome profiling. With the aid of the PROTEOFORMER pipeline, this

information can easily be transformed into a custom protein sequence database.

In this master thesis, the impact of INDELs on the identification rate of human cancer

proteins was studied and compared with the impact of SNPs. For this, ribosome profiling data

from the human colorectal cancer cell line HCT116 was used to create a custom protein

sequence database. Hereof, one INDEL and one SNP were validated with MS/MS. These low

numbers may be a result of the low sequence coverage of MS/MS and the possible negative

effect of genetic variation on the translation process and the protein stability. In addition, a

large contrast was found between the number of alternative proteins detected using the

database with INDELs and the number of alternative proteins detected using the database

with SNPs. Both the relative frequency of appearance of these variants as well as the

limitations of the current INDEL calling strategy can explain this difference. Hence, further

improvements of the PROTEOFORMER pipeline with regard to genetic variation are

expected to give an additional increase in the protein identification rate.

49

7 Further research

In this master thesis, the impact of SNPs and INDELs on the protein identification rate of

human colorectal cancer cells was studied. It has been found that the addition of these variants

to the PROTEOFORMER pipeline has improved the protein identification process by the

detection of alternative proteins. However, this study has also shown that the pipeline is still

incomplete. Here, the current limitations and the possible improvements will be explained in

more detail.

A first improvement can be made using other variant callers. As previously stated,

alignment-based methods, like SAMtools, are less effective in detecting INDELs. As a result,

new INDEL calling tools are developed. These are often based on paired-end RNA

sequencing. In contrast with single-read sequencing, paired-end methods sequence both ends

of the read [137]. Pindel for example uses a split-read mapping approach based on short-read

paired-end sequencing data [138]. In addition, haplotype-based methods are also adequate in

calling INDELs. Here, alignment information is used to find the regions of interest which are

later assembled into possible haplotypes. By realigning the reads with these haplotypes,

INDELs are detected. Although these methods seem to overcome the limitation of

alignment-based methods, they are still not optimal. As a result, the best outcome is obtained

when combining multiple techniques [65]. In accordance with the general problem of

detecting longer INDELs in short reads, the read length can also be considered as an

important factor influencing the success of INDEL detection. In this context, the use of

ribosome profiling might be questioned as the method of choice since it limits the read length

to approximately 28 nucleotides. This problem can be addressed by allowing the support of

paired-end RNA-sequencing of longer reads [139]. In this way the sequencing step is split up

in two parts: ribosome profiling will give information about the translational status including

TIS information whereas paired-end RNA-seq will be used to find genetic variations.

Furthermore, the overall detection rate of both SNPs and INDELs from the ribosome profiling

data is limited. Figure 12 illustrates the number of variants detected from the ribosome

profiling data. Only a small amount of INDELs is picked up with SAMtools. The majority

were identified using dbSNP. Also, most of the identified SNPs were present in dbSNP. This

shows that the number of new genetic variants discovered from the experimental data is

relatively low. This conclusion can also be drawn from other proteogenomics studies. In

recent work [127], human colon and rectal tumours were characterized using a

proteogenomics approach based on RNA-seq. For this, tumour samples from The Cancer

Genome Atlas (TCGA) [140] were conducted to LC-MS/MS shotgun proteomics. Single

nucleotide variants were detected using matching RNA-seq data. Figure 23 shows the number

of detected single amino acid variants in the analyzed samples. A few of these SNPs were

previously reported by TCGA. Most of the SNPs were present in dbSNP. In addition, a

sufficient number of COSMIC-supported variants were detected. COSMIC (Catalogue Of

Somatic Mutations In Cancer) [141] is a database collecting somatic mutations in human

50

cancer genes. Only a small amount of the SNPs represented new variants. This study

illustrates that most of the variants discovered in proteogenomics studies are already

annotated in public databases. As a result, a further increase in the protein identification rate

is expected when these previously annotated variants are added to the pipeline. In order to

limit the size of the database, only cancer-related variants need to be included. As a result, the

addition of COSMIC-variants related with colorectal cancer seems the best approach.

Figure 23: Overview of the number of single amino acid variants (SAAVs) per TCGA sample. The

number of somatic variants, reported by TCGA, is shown in the upper graph. The second and third

graph represent the remaining number of somatic variants present in COSMIC and dbSNP

respectively. The last graph visualizes the number of new variants. Adapted from [127].

In summary, the implementation of new INDEL detection methods and the incorporation of

genetic variants derived from public databases to the PROTEOFORMER pipeline is expected

to cause an additional increase in the protein identification rate.

51

Bibliography

[1] R. Dahm, “Friedrich Miescher and the discovery of DNA,” Dev. Biol., vol. 278, no. 2,

pp. 274–288, 2005.

[2] J. M. Heather and B. Chain, “The sequence of sequencers: The history of sequencing

DNA,” Genomics, vol. 107, no. 1, pp. 1–8, 2016.

[3] L. Liu et al., “Comparison of next-generation sequencing systems.,” J. Biomed.

Biotechnol., vol. 2012, p. 251364, 2012.

[4] J. Shendure and H. Ji, “Next-generation DNA sequencing,” Nat. Biotechnol., vol. 26,

no. 10, pp. 1135–1145, 2008.

[5] E. R. Mardis, “The impact of next-generation sequencing technology on genetics,”

Trends Genet., vol. 24, no. 3, pp. 133–141, 2008.

[6] J. M. Rothberg and J. H. Leamon, “The development and impact of 454 sequencing.,”

Nat. Biotechnol., vol. 26, no. 10, pp. 1117–1124, 2008.

[7] R. J. Roberts, M. O. Carneiro, and M. C. Schatz, “The advantages of SMRT

sequencing,” Genome Biol., vol. 14, no. 6, p. 405, 2013.

[8] Pacific Biosciences, “Advance genomics with Single Molecule, Real-time (SMRT)

sequencing.” [Online]. Available:

http://www.pacb.com/smrt-science/smrt-sequencing/. [Accessed: 06-May-2017].

[9] S. Goodwin, J. Gurtowski, S. Ethe-Sayers, P. Deshpande, M. Schatz, and W. R.

McCombie, “Oxford Nanopore sequencing, hybrid error correction, and de novo

assembly of a eukaryotic genome,” Genome Res., vol. 25, no. 11, pp. 1750–1756,

2015.

[10] Y. Zhang, B. R. Fonslow, B. Shan, M. Baek, and J. R. Yates, “Protein Analysis by

Shotgun / Bottom-up Proteomics,” Chem. Rev., vol. 113, no. 4, pp. 2343–2394, 2013.

[11] X. Han, A. Aslanian, and J. Yates, “Mass Spectrometry for Proteomics,” Curr Opin

Chem Biol., vol. 12, no. 5, pp. 483–490, 2008.

[12] A. I. Nesvizhskii, “Proteogenomics: concepts, applications and computational

strategies,” Nat. Methods, vol. 11, no. 11, pp. 1114–1125, 2014.

[13] R. Aebersold and M. Mann, “Mass spectrometry-based proteomics.,” Nature, vol. 422,

no. 6928, pp. 198–207, 2003.

[14] M. W. Duncan, R. Aebersold, and R. M. Caprioli, “The pros and cons of peptide-

centric proteomics,” Nat. Biotechnol., vol. 28, no. 7, pp. 659–664, 2010.

[15] C. C. Wu and M. J. MacCoss, “Shotgun proteomics: tools for the analysis of complex

biological systems.,” Curr. Opin. Mol. Ther., vol. 4, no. 3, pp. 242–50, 2002.

[16] C. Ansong, S. O. Purvine, J. N. Adkins, M. S. Lipton, and R. D. Smith,

“Proteogenomics: Needs and roles to be filled by proteomics in genome annotation,”

Briefings Funct. Genomics Proteomics, vol. 7, no. 1, pp. 50–62, 2008.

[17] J. S. Cottrell, “Protein identification using MS/MS data,” J. Proteomics, vol. 74, no.

10, pp. 1842–1851, 2011.

[18] A. I. Nesvizhskii, “A survey of computational methods and error rate estimation

procedures for peptide and protein identification in shotgun proteomics,” J.

Proteomics, vol. 73, no. 11, pp. 2092–2123, 2010.

[19] J. V Olsen, S.-E. Ong, and M. Mann, “Trypsin cleaves exclusively C-terminal to

arginine and lysine residues.,” Mol. Cell. proteomics, vol. 3, no. 6, pp. 608–14, 2004.

[20] A. I. Nesvizhskii and R. Aebersold, “Interpretation of Shotgun Proteomic Data,” Mol.

Cell. Proteomics, vol. 4, no. 10, pp. 1419–1440, 2005.

52

[21] N. Castellana and V. Bafna, “Proteogenomics to discover the full coding content of

genomes: A computational perspective,” J. Proteomics, vol. 73, no. 11, pp. 2124–2135,

2010.

[22] G. Menschaert and D. Fenyö, “Proteogenomics from a bioinformatics angle: a growing

field,” Mass Spectrom. Rev., vol. 2015, no. 9999, pp. 1–16, 2015.

[23] S. Tanner et al., “Improving gene annotation using peptide mass spectrometry,”

Proteome, vol. 17, no. 2, pp. 231–239, 2007.

[24] M. Mann and A. Pandey, “Use of mass spectrometry-derived data to annotate

nucleotide and protein sequence databases,” Trends Biochem. Sci., vol. 26, no. 1, pp.

54–61, 2001.

[25] L. M. Smith et al., “Proteoform: a single term describing protein complexity,” Nat.

Methods, vol. 10, no. 3, pp. 186–187, 2013.

[26] N. Fortelny, P. Pavlidis, and C. M. Overall, “The path of no return--Truncated protein

N-termini and current ignorance of their genesis,” Proteomics, vol. 15, no. 14, pp.

2547–2552, 2015.

[27] G. Menschaert, V. Olexiouk, and S. Verbruggen, “Proteoformer 2.0 (in preparation),”

2017.

[28] B. Boeckmann et al., “The SWISS-PROT protein knowledgebase and its supplement

TrEMBL in 2003,” Nucleic Acids Res., vol. 31, no. 1, pp. 365–370, 2003.

[29] D. A. Bitton, D. L. Smith, Y. Connolly, P. J. Scutt, and C. J. Miller, “An integrated

mass-spectrometry pipeline identifies novel protein coding-regions in the human

genome,” PLoS One, vol. 5, no. 1, pp. 1–10, 2010.

[30] G. M. Sheynkman, M. R. Shortreed, B. L. Frey, and L. M. Smith, “Discovery and mass

spectrometric analysis of novel splice-junction peptides using RNA-Seq.,” Mol. Cell.

Proteomics, vol. 12, pp. 2341–53, 2013.

[31] D. Yagoub et al., “Proteogenomic Discovery of a Small, Novel Protein in Yeast

Reveals a Strategy for the Detection of Unannotated Short Open Reading Frames,” J.

Proteome Res., vol. 14, no. 12, pp. 5038–5047, 2015.

[32] N. T. Ignolia, S. Ghaemmaghami, J. R. S. Newman, and J. S. Weissman, “Genome-

Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome

Profiling,” Science., vol. 324, no. 5924, pp. 218–223, 2009.

[33] N. T. Ingolia, “Ribosome profiling: new views of translation, from single codons to

genome scale.,” Nat. Rev. Genet., vol. 15, no. 3, pp. 205–13, 2014.

[34] H. Chassé, S. Boulben, V. Costache, P. Cormier, and J. Morales, “Analysis of

translation using polysome profiling,” Nucleic Acids Res., vol. 45, no. 3, p. e15, 2017.

[35] N. T. Ingolia, “Ribosome Footprint Profiling of Translation throughout the Genome,”

Cell, vol. 165, no. 1, pp. 22–33, 2016.

[36] A. M. Michel and P. V. Baranov, “Ribosome profiling: A Hi-Def monitor for protein

synthesis at the genome-wide scale,” Wiley Interdiscip. Rev. RNA, vol. 4, no. 5, pp.

473–490, 2013.

[37] N. T. Ingolia, “Genome-wide translational profiling by Ribosome footprinting,”

Methods Enzymol., vol. 470, pp. 119–142, 2010.

[38] N. T. Ingolia, G. A. Brar, S. Rouskin, A. M. Mcgeachy, and J. S. Weissman, “The

ribosome profiling strategy for monitoring translation in vivo by deep sequencing of

ribosome-protected mRNA fragments,” Nat. Prot., vol. 7, no. 8, pp. 1534–1550, 2013.

[39] N. T. Ingolia, L. Lareau, and J. Weissman, “Ribosome Profiling of Mouse Embryonic

Stem Cells Reveales Complexity of Mammalian Proteomes,” Cell, vol. 147, no. 4, pp.

789–802, 2012.

53

[40] S. Lee, B. Liu, S. Lee, S.X. Huang, B. Shen, and S.B. Qian, “Global mapping of

translation initiation sites in mammalian cells at single-nucleotide resolution. TL -

109,” Proc. Natl. Acad. Sci. U. S. A., vol. 109, no. 37, p. 32, 2012.

[41] W. McKeehan and B. Hardesty, “The mechanism of cycloheximide inhibition of

protein synthesis in rabbit reticulocytes,” Biochem. Biophys. Res. Commun., vol. 36,

no. 4, pp. 625–630, 1969.

[42] V. Olexiouk, J. Crappé, S. Verbruggen, K. Verhegen, L. Martens, and G. Menschaert,

“SORFs.org: A repository of small ORFs identified by ribosome profiling,” Nucleic

Acids Res., vol. 44, no. D1, pp. D324–D329, 2016.

[43] M. a Basrai, P. Hieter, and J. D. Boeke, “Small Open Reading Frames : Beautiful

Needles in the Haystack Small Open Reading Frames : Beautiful Needles in the

Haystack,” Genome Res., vol. 7, pp. 768–771, 1997.

[44] R. Gibbs et al.,“The International HapMap Project.,” Nature, vol. 426, no. 6968, pp.

789–796, 2003.

[45] D. C. Rees, T. N. Williams, and M. T. Gladwin, “Sickle-cell disease,” Lancet, vol. 376,

no. 9757, pp. 2018–2031, 2010.

[46] J. L. Bobadilla, M. Macek, J. P. Fine, and P. M. Farrell, “Cystic fibrosis: A worldwide

analysis of CFTR mutations - Correlation with incidence data and application to

screening,” Hum. Mutat., vol. 19, no. 6, pp. 575–606, 2002.

[47] E. C. Landels, I. H. Ellis, A. H. Fensom, P. M. Green, and M. Bobrow, “Frequency of

the Tay-Sachs disease splice and insertion mutations in the UK Ashkenazi Jewish

population,” J Med Genet, vol. 28, pp. 177–180, 1991.

[48] H. Yang, Y. Zhong, C. Peng, J.-Q. Chen, and D. Tian, “Important role of indels in

somatic mutations of human cancer genes.,” BMC Med. Genet., vol. 11, no. 128, 2010.

[49] B. Vogelstein and K. Kinzler, “Cancer genes and the pathways they control,” Nat.

Med., vol. 10, no. 8, pp. 789–799, 2004.

[50] C. Greenman et al., “Patterns of somatic mutation in human cancer genomes,” Nature,

vol. 446, no. 7132, pp. 153–158, 2007.

[51] Z. Sun, A. Bhagwate, N. Prodduturi, P. Yang, and J.-P. A. Kocher, “Indel detection

from RNA-seq data: tool evaluation and strategies for accurate detection of actionable

mutations,” Brief. Bioinform., pp. 1–11, 2016.

[52] J. Crappé et al., “PROTEOFORMER: deep proteome coverage through ribosome

profiling and MS integration.,” Nucleic Acids Res., vol. 43, no. 5, p. e29, Mar. 2015.

[53] M. Fresno, A. Jiménez, and D. Vázquez, “Inhibition of translation in eukaryotic

systems by harringtonine.,” Eur. J. Biochem., vol. 72 VN-r, no. 2, pp. 323–330, 1977.

[54] T. Schneider-poetsch et al., “Inhibition of Eukaryotic Translation Elongation by

Cycloheximide and Lactimidomycin,” Nat. Chem. Biol., vol. 6, no. 3, pp. 209–217,

2010.

[55] M. Kozak, “Downstream secondary structure facilitates recognition of initiator codons

by eukaryotic ribosomes.,” Proc. Natl. Acad. Sci. U. S. A., vol. 87, no. 21, pp. 8301–

8305, 1990.

[56] G. Menschaert et al., “Deep proteome coverage based on ribosome profiling aids mass

spectrometry-based protein and peptide discovery and provides evidence of alternative

translation products and near-cognate translation initiation events.,” Mol. Cell.

Proteomics, vol. 12, no. 7, pp. 1780–90, 2013.

[57] S. J. Andrews and J. A. Rothnagel, “Emerging evidence for functional peptides

encoded by short open reading frames,” Nat. Rev. Genet., vol. 15, no. 3, pp. 193–204,

2014.

54

[58] G. Menschaert, J. Crappé, E. Ndah, and S. Koch , ert, “What is PROTEOFORMER?,”

2014. [Online]. Available: http://www.biobix.be/proteoformer/. [Accessed: 29-Oct-

2016].

[59] “FastQC - A Quality Control application for FastQ files.” [Online]. Available:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/README.txt. [Accessed:

29-Oct-2016].

[60] D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg, “TopHat2:

accurate alignment of transcriptomes in the presence of insertions, deletions and gene

fusions,” Genome Biol., vol. 14, no. 4, p. R36, 2013.

[61] A. Dobin et al., “STAR: Ultrafast universal RNA-seq aligner,” Bioinformatics, vol. 29,

no. 1, pp. 15–21, 2013.

[62] H. Li et al., “The Sequence Alignment/Map format and SAMtools,” Bioinformatics,

vol. 25, no. 16, pp. 2078–2079, 2009.

[63] SAMtools, “Multisample SNP Calling.” [Online]. Available:

http://samtools.sourceforge.net/mpileup.shtml. [Accessed: 14-Sep-2016].

[64] Genome Research Limited, “Samtools Manual Page,” 2016. [Online]. Available:

http://www.htslib.org/doc/samtools.html. [Accessed: 14-Sep-2016].

[65] M. S. Hasan, X. Wu, and L. Zhang, “Performance evaluation of indel calling tools

using real short-read data,” Hum. Genomics, vol. 9, no. 1, p. 20, 2015.

[66] K. Gevaert et al., “Exploring proteomes and analyzing protein processing by mass

spectrometric identification of sorted N-terminal peptides,” Nat. Biotechnol., vol. 21,

no. 5, pp. 566–569, 2003.

[67] T. Ylonen and C. Lonvick, “The Secure Shell (SSH) Protocol Architecture,” 2006.

[Online]. Available: https://www.rfc-editor.org/rfc/rfc4251.txt. [Accessed: 04-Apr-

2017].

[68] Python Software Foundation, “General Python FAQ,” 2017. [Online]. Available:

https://docs.python.org/3/faq/general.html. [Accessed: 05-Apr-2017].

[69] SQLite Consortium, “About SQLite.” [Online]. Available:

https://www.sqlite.org/about.html. [Accessed: 03-Apr-2017].

[70] SQLite Consortium, “SQLite is a Self Contained System.” [Online]. Available:

https://www.sqlite.org/selfcontained.html. [Accessed: 03-Apr-2017].

[71] SQLite Consortium, “Command Line Shell For SQLite.” [Online]. Available:

http://www.sqlite.org/cli.html. [Accessed: 03-Apr-2017].

[72] Python Software Foundation, “DB-API 2.0 interface for SQLite databases,” 2017.

[Online]. Available: https://docs.python.org/2/library/sqlite3.html. [Accessed: 05-Apr-

2017].

[73] Python Software Foundation, “PEP 0249 -- Python Database API Specification v2.0,”

2017. [Online]. Available: https://www.python.org/dev/peps/pep-0249/. [Accessed: 05-

Apr-2017].

[74] A. M. Kuchling, “The Python DB-API,” Linux Journal, no. 49, 1998.

[75] Broad Institute, “HC overview: How the HaplotypeCaller works,” 2016. [Online].

Available: https://software.broadinstitute.org/gatk/documentation/article.php?id=4148.

[Accessed: 15-Sep-2016].

[76] A. Rimmer, H. Phan, I. Mathieson, Z. Iqbal, and S. R. F. Twigg, “Integrating mapping-

, assembly- and haplotype-based approaches for calling variants in clinical sequencing

applications,” Nat Genet., vol. 46, no. 8, pp. 912–918, 2016.

55

[77] Broad Institute, “HaplotypeCaller.” [Online]. Available:

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstit

ute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php. [Accessed: 29-Apr-

2017].

[78] M. A. DePristo et al., “A framework for variation discovery and genotyping using

next-generation DNA sequencing data,” Nat. Genet., vol. 43, no. 5, pp. 491–498, 2011.

[79] M. Vaudel et al., “PeptideShaker enables reanalysis of MS-derived proteomics data

sets,” Nat. Biotechnol., vol. 33, no. 1, pp. 22–24, 2015.

[80] M. Vaudel, H. Barsnes, F. S. Berven, A. Sickmann, and L. Martens, “SearchGUI: An

open-source graphical user interface for simultaneous OMSSA and X!Tandem

searches,” Proteomics, vol. 11, no. 5, pp. 996–999, 2011.

[81] R. Craig and R. C. Beavis, “TANDEM: Matching proteins with tandem mass spectra,”

Bioinformatics, vol. 20, no. 9, pp. 1466–1467, 2004.

[82] L. Y. Geer et al., “Open mass spectrometry search algorithm,” J. Proteome Res., vol. 3,

no. 5, pp. 958–964, 2004.

[83] Compomics, “SearchGUI,” 2017. [Online]. Available:

http://compomics.github.io/projects/searchgui.html. [Accessed: 11-Apr-2017].

[84] Compomics, “PeptideShaker,” 2017. [Online]. Available:

http://compomics.github.io/projects/peptide-shaker.html. [Accessed: 11-Apr-2017].

[85] J. Goecks, A. Nekrutenko, J. Taylor, and The Galaxy Team, “Galaxy: a comprehensive

approach for supporting accessible, reproducible, and transparent computational

research in the life sciences.,” Genome Biol., vol. 11, no. 8, p. R86, 2010.

[86] EMBL-EBI, “About the Ensembl Project,” 2014. [Online]. Available:

http://www.ensembl.org/info/about/index.html. [Accessed: 16-Apr-2017].

[87] T. Hubbard et al., “The Ensembl genome database project,” Nucleic Acids Res., vol.

30, no. 1, pp. 38–41, 2002.

[88] B. L. Aken et al., “The Ensembl Gene Annotation System,” Database (Oxford)., vol.

2016, p. baw093, 2016.

[89] D. M. Church et al., “Modernizing reference genome assemblies,” PLoS Biol., vol. 9,

no. 7, p. e1001091, 2011.

[90] “The Genome Reference Consortium.” [Online]. Available:

https://www.ncbi.nlm.nih.gov/grc. [Accessed: 12-Apr-2017].

[91] EMBL-EBI, “Ensembl Core - Schema documentation,” 2017. [Online]. Available:

http://www.ensembl.org/info/docs/api/core/core_schema.html. [Accessed: 12-Apr-

2017].

[92] S. T. Sherry et al., “dbSNP: the NCBI database of genetic variation.,” Nucleic Acids

Res., vol. 29, no. 1, pp. 308–311, 2001.

[93] S. T. Sherry, M. Ward, and K. Sirotkin, “dbSNP—Database for Single Nucleotide

Polymorphisms and Other Classes of Minor Genetic Variation,” Genome Res., vol. 9,

no. 8, pp. 677–679, 1999.

[94] G. Cochrane, I. Karsch-Mizrachi, and Y. Nakamura, “The International Nucleotide

Sequence Database Collaboration,” Nucleic Acids Res., vol. 39, no. (Database issue):

D15–D18, 2011.

[95] EMBL-EBI, “International Nucleotide Sequence Database Collaboration.” [Online].

Available: http://www.insdc.org/. [Accessed: 09-Apr-2017].

[96] R. Leinonen et al., “The European nucleotide archive,” Nucleic Acids Res., vol. 39, no.

(Database issue):D28-31, 2011.

56

[97] P. Jones et al., “PRIDE: a public repository of protein and peptide identifications for

the proteomics community,” Nucleic Acids Res., vol. 34, no. (Database issue): D659-

63, 2006.

[98] L. Martens et al., “PRIDE: The proteomics identifications database,” Proteomics, vol.

5, no. 13, pp. 3537–3545, 2005.

[99] A. Koch et al., “A proteogenomics approach integrating proteomics and ribosome

profiling increases the efficiency of protein identification and enables the discovery of

alternative translation start sites,” Proteomics, vol. 14, no. 0, pp. 2688–2698, 2014.

[100] H. Laboratory, “FASTX-Toolkit.” [Online]. Available:

http://hannonlab.cshl.edu/fastx_toolkit/. [Accessed: 22-May-2017].

[101] D. Blankenberg et al., “Manipulation of FASTQ data with galaxy,” Bioinformatics,

vol. 26, no. 14, pp. 1783–1785, 2010.

[102] J. G. Dunn and J. S. Weissman, “Plastid: nucleotide-resolution analysis of next-

generation sequencing and genomics data,” BMC Genomics, vol. 17, no. 1, p. 958,

2016.

[103] J. G. Dunn, “Performing metagene analyses,” 2014. [Online]. Available:

http://plastid.readthedocs.io/en/latest/examples/metagene.html. [Accessed: 24-May-

2017].

[104] J. G. Dunn, “Determine P-site offsets for ribosome profiling data,” 2014. [Online].

Available: http://plastid.readthedocs.io/en/latest/examples/p_site.html. [Accessed: 24-

May-2017].

[105] D. L. Lafontaine and D. Tollervey, “The function and synthesis of ribosomes.,” Nat.

Rev. Mol. Cell Biol., vol. 2, no. 7, pp. 514–520, 2001.

[106] C. M. Yates, I. Filippis, L. A. Kelley, and M. J. E. Sternberg, “SuSPect: Enhanced

prediction of single amino acid variant (SAV) phenotype using network features,” J.

Mol. Biol., vol. 426, no. 14, pp. 2692–2701, 2014.

[107] J. E. Elias and S. P. Gygi, “Target-decoy search strategy for increased confidence in

large-scale protein identifications by mass spectrometry,” Nat. Methods, vol. 4, no. 3,

pp. 207–214, 2007.

[108] The Global Proteome Machine Organization, “cRAP protein sequences,” 2011.

[Online]. Available: http://www.thegpm.org/crap/. [Accessed: 23-May-2017].

[109] O. Stehling et al., “MMS19 assembles iron-sulfur proteins required for DNA

metabolism and genomic integrity,” Science (80-. )., vol. 337, no. 6091, pp. 195–199,

2012.

[110] K. E. Hurov, C. Cotta-Ramusino, and S. J. Elledge, “A genetic screen identifies the

Triple T complex required for DNA damage signaling and ATM and ATR stability,”

Genes Dev., vol. 24, no. 17, pp. 1939–1950, 2010.

[111] Z. Wei, W. Ma, X. Qi, X. Zhu, Y. Wang, and Z. Xu, “Pinin facilitated proliferation and

metastasis of colorectal cancer through activating EGFR / ERK signaling pathway,”

Oncotarget, vol. 7, no. 20, pp. 29429–29439, 2016.

[112] M. A. Baldwin, “Protein Identification by Mass Spectrometry,” Mol. Cell. Proteomics,

vol. 3, no. 1, pp. 1–9, 2003.

[113] P. Iengar, “An analysis of substitution, deletion and insertion mutations in cancer

genes,” Nucleic Acids Res., vol. 40, no. 14, pp. 6401–6413, 2012.

[114] D. Gawron, E. Ndah, K. Gevaert, and P. Van Damme, “Positional proteomics reveals

differences in N-terminal proteoform stability,” Mol. Syst. Biol., vol. 12, no. 2, p. 858,

2016.

[115] Harvey Lodish et al., “The genetic basis of cancer,” in Molecular Cell Biology, 7th ed.,

New York: W. H. Freeman and Company, 2000, pp. 1124–1131.

57

[116] G. Kurzawski, J. Suchy, T. Debniak, J. Kładny, and J. Lubiński, “Importance of

microsatellite instability (MSI) in colorectal cancer: MSI as a daignostic tool,” Ann.

Oncol., vol. 15, no. SUPPL. 4, pp. 283–284, 2004.

[117] Y.-F. Chang, J. S. Imam, and M. F. Wilkinson, “The Nonsense-Mediated Decay RNA

Surveillance Pathway,” Annu. Rev. Biochem., vol. 76, pp. 51–74, 2007.

[118] R. Fodde, “The APC gene in colorectal cancer.,” Eur. J. Cancer, vol. 38, no. 7, pp.

867–71, 2002.

[119] X.L. Li, J. Zhou, Z.R. Chen, and W.J. Chng, “P53 Mutations in Colorectal Cancer-

Molecular Pathogenesis and Pharmacological Reactivation,” World J. Gastroenterol.,

vol. 21, no. 1, pp. 84–93, 2015.

[120] M. Miyaki and T. Kuroki, “Role of Smad4 (DPC4) inactivation in human cancer,”

Biochem. Biophys. Res. Commun., vol. 306, no. 4, pp. 799–804, 2003.

[121] M. Hajdúch, S. Jančík, J. Drábek, and D. Radzioch, “Clinical relevance of KRAS in

human cancers,” J. Biomed. Biotechnol., vol. 2010, no. 150960, p. 13, 2010.

[122] D. Barras, “BRAF Mutation in Colorectal Cancer : An Update,” Biomark. Cancer, vol.

7, no. Suppl 1, pp. 9–12, 2015.

[123] X. Ma, B. Zhang, and W. Zheng, “Genetic variants associated with colorectal-cancer

risk: comprehensive research synopsis, meta-analysis, and epidemiological evidence,”

Gut, vol. 63, no. 2, pp. 326–336, 2008.

[124] A. J. MacFarlane, C. A. Perry, M. F. McEntee, D. M. Lin, and P. J. Stover, “Mthfd1 is

a modifier of chemically induced intestinal carcinogenesis,” Carcinogenesis, vol. 32,

no. 3, pp. 427–433, 2011.

[125] K. Yamaguchi et al., “MRG-binding protein contributes to colorectal cancer

development,” Cancer Sci., vol. 102, no. 8, pp. 1486–1492, 2011.

[126] P. J. Halvey et al., “Proteogenomic analysis reveals unanticipated adaptations of

colorectal tumor cells to deficiencies in DNA mismatch repair,” Cancer Res., vol. 74,

no. 1, pp. 387–397, 2014.

[127] B. Zhang, J. Wang, X. Wang, J. Zhu, Q. Liu, and Z. Shi, “Proteogenomic

characterization of human colon and rectal cancer,” Nature, vol. 513, no. 7518, pp.

382–387, 2014.

[128] P. Mertins et al., “Proteogenomics connects somatic mutations to signaling in breast

cancer,” Nature, vol. 534, no. 7605, pp. 55–62, 2016.

[129] K. V. Ruggles et al., “An Analysis of the Sensitivity of Proteogenomic Mapping of

Somatic Mutations and Novel Splicing Events in Cancer,” Mol. Cell. Proteomics, vol.

15, no. 3, pp. 1060–1071, 2016.

[130] Q. Li et al., “Genome-wide search for exonic variants affecting translational

efficiency,” vol. 4, no. 2260, 2013.

[131] C. Polychronakos, “Gene expression as a quantitative trait: what about translation?,” J.

Med. Genet., vol. 49, no. 9, pp. 554–557, 2012.

[132] J. Hu and P. C. Ng, “Predicting the effects of frameshifting indels,” Genome Biol., vol.

13, no. 2, p. R9, 2012.

[133] E. Nagy and L. E. Maquat, “A rule for termination-codon position within intron-

containing genes: When nonsense affects RNA abundance,” Trends Biochem. Sci., vol.

23, no. 6, pp. 198–199, 1998.

[134] R. A. Cartwright, “Problems and solutions for estimating indel rates and length

distributions,” Mol. Biol. Evol., vol. 26, no. 2, pp. 473–480, 2009.

[135] J. Q. Chen, Y. Wu, H. Yang, J. Bergelson, M. Kreitman, and D. Tian, “Variation in the

ratio of nucleotide substitution and indel rates across genomes in mammals and

bacteria,” Mol. Biol. Evol., vol. 26, no. 7, pp. 1523–1531, 2009.

58

[136] G. Narzisi and M. C. Schatz, “The Challenge of Small-Scale Repeats for Indel

Discovery,” Front. Bioeng. Biotechnol., vol. 3, p. 8, 2015.

[137] Illumina, “Paired-End Sequencing,” 2017. [Online]. Available:

https://www.illumina.com/technology/next-generation-sequencing/paired-end-

sequencing_assay.html. [Accessed: 29-May-2017].

[138] K. Ye, M. H. Schulz, Q. Long, R. Apweiler, and Z. Ning, “Pindel: A pattern growth

approach to detect break points of large deletions and medium sized insertions from

paired-end short reads,” Bioinformatics, vol. 25, no. 21, pp. 2865–2871, 2009.

[139] P. J. Campbell et al., “Identification of somatically acquired rearrangements in cancer

using genome-wide massively parallel paired-end sequencing,” Nat. Genet., vol. 40,

no. 6, pp. 722–729, 2008.

[140] K. Tomczak, P. Czerwińska, and M. Wiznerowicz, “The Cancer Genome Atlas

(TCGA): An immeasurable source of knowledge,” Contemp. Oncol., vol. 19, no. 1A,

pp. A68–A77, 2015.

[141] S. A. Forbes et al., “COSMIC: Exploring the world’s knowledge of somatic mutations

in human cancer,” Nucleic Acids Res., vol. 43, pp. D805–D811, 2015.

ribosome profiling and the improvement...

Documents