ribosome profiling and the improvement...
TRANSCRIPT
RIBOSOME PROFILING AND THE
IMPROVEMENT OF PROTEIN
IDENTIFICATION IN CANCER
PROTEOGENOMICS STUDIES Number of words: 20 226
Sofie Gielis Student number: 01507424
Promotor: Prof. dr. ir. Wim Van Criekinge
Promotor: dr. ir. Gerben Menschaert
Master's dissertation submitted in partial fulfillment of the requirements for the degree of Master in
Bioscience Engineering: Cell and Gene Biotechnology
Academic year: 2016 - 2017
vii
Acknowledgements
After five years of hard work and dedication, college is coming to an end. It was definitely the
most challenging and interesting experience ever. I am very grateful for all the support of my
friends and family throughout these years. Thank you all!
Also, I would like to thank my promotor Wim Van Criekinge to guide me through the basics
of bioinformatics in the first master year. Your enthusiastic way of teaching and love for the
subject has given me the interest in bioinformatics.
Finally, a special thanks goes to my promotor Gerben Menschaert and tutor Steven
Verbruggen for all the help and support over the last year.
Sofie Gielis
Gent, 4 juni 2017
viii
ix
Contents
1 Introduction and outline ....................................................................................................... 1
2 Overview of relevant literature ............................................................................................ 2
2.1 NGS-based technologies ................................................................................................... 2
2.2 MS-based proteomics ....................................................................................................... 3
2.2.1 Outline of MS-based proteomics experiments ........................................................... 3
2.2.2 Database searching and MS/MS-based protein identification ................................... 3
2.3 Proteogenomics ................................................................................................................ 5
2.3.1 Goals of proteogenomics ........................................................................................... 5
2.4 Ribosome profiling ........................................................................................................... 6
2.4.1 An emerging technique .............................................................................................. 6
2.4.2 Outline of a ribosome profiling experiment ............................................................... 6
2.4.3 Applications of ribosome profiling ............................................................................ 8
2.5 The PROTEOFORMER pipeline ................................................................................... 10
2.5.1 Overview of the pipeline .......................................................................................... 11
2.5.2 Achievements ........................................................................................................... 14
3 Material and methods ......................................................................................................... 16
3.1 Hardware ........................................................................................................................ 16
3.2 Software .......................................................................................................................... 16
3.2.1 SSH clients ............................................................................................................... 16
3.2.2 Python ...................................................................................................................... 16
3.2.3 SQLite ...................................................................................................................... 17
3.2.4 Python DB-API ........................................................................................................ 17
3.2.5 STAR ....................................................................................................................... 17
3.2.6 Variant caller ............................................................................................................ 18
3.2.7 SearchGUI and PeptideShaker ................................................................................. 18
3.2.8 Galaxy ...................................................................................................................... 19
3.3 Public databases .............................................................................................................. 19
3.3.1 Ensembl annotation bundle ...................................................................................... 19
3.3.2 dbSNP ...................................................................................................................... 20
3.3.3 Sequence read archives ............................................................................................ 21
x
3.3.4 PRIDE ...................................................................................................................... 21
3.4 Custom database creation ............................................................................................... 21
3.4.1 Mapping ................................................................................................................... 22
3.4.2 Transcript calling ..................................................................................................... 23
3.4.3 TIS calling ................................................................................................................ 23
3.4.4 Variant calling .......................................................................................................... 24
3.4.5 Translation assembly ............................................................................................... 24
3.5 Validation of the renewed PROTEOFORMER pipeline with proteomics data ............. 25
4 Results .................................................................................................................................. 26
4.1 Custom database creation ............................................................................................... 26
4.1.1 Mapping statistics .................................................................................................... 26
4.1.2 Mapping quality control ........................................................................................... 30
4.1.3 Discovered variants .................................................................................................. 31
4.1.4 Translation assembly ............................................................................................... 33
4.2 Validation of the renewed PROTEOFORMER pipeline with proteomics data ............. 34
4.2.1 Exploration of the effect of INDELs on the identification process ......................... 34
4.2.2 Exploration of the effect of SNPs on the identification process .............................. 40
5 Discussion ............................................................................................................................. 44
6 Conclusion ............................................................................................................................ 48
7 Further research .................................................................................................................. 49
Bibliography ........................................................................................................................... 51
xi
List of abbreviations
APC adenomatous polyposis coli
API application programming interface
A-site aminoacyl site
aTIS alternative translation initiation site (general)/annotated start site (PROTEOFORMER)
ATP adenosine triphosphate
BAM Binary Alignment/Map
BCF Binary Variant Call Format
BRAF B-Raf proto-oncogene serine/threonine kinase
cDNA complementary DNA
CDS coding sequence
CHX cycloheximide
COFRADIC combined fractional diagonal chromatography
COSMIC Catalogue Of Somatic Mutations
cRAP common Repository of Adventitious Proteins
CWI Center for Mathematics and Computer Science
DB-API database application programming interface
dbSNP Single Nucleotide Polymorphism Database
dbTIS database annotated TIS
DDBJ DNA Databank of Japan
DNA deoxyribonucleic acid
dNTP deoxynucleotide triphosphate
dTIS downstream TIS
EMBL-EBI European Bioinformatics Institute
ENA European Nucleotide Archive
E-site exit site
EST expressed sequence tag
GATK Genome Analysis Toolkit
GRC Genome Reference Consortium
HARR harringtonine
HCT116 human colorectal tumour 116 cell line
INDEL INsertion and DELetion
INSDC International Nucleotide Sequence Database Collaboration
K lysine
KRAS Kirsten rat sarcoma viral oncogene homolog
LC liquid chromatography
LC-MS/MS liquid chromatography followed by tandem mass spectrometry
LTM lactimidomycin
mESC mouse embryonic stem cells
MMS19 methyl-methanesulfonate sensitivity 19
mRNA messenger RNA
MS mass spectrometry
xii
MS/MS tandem mass spectrometry
MTHFD1 methylenetetrahydrofolate dehydrogenase 1
NCBI National Center for Biotechnology Information
NGS next-generation sequencing
NHGRI National Human Genome Research Institute
OMSSA Open Mass Spectrometry Search Algorithm
ORF open reading frame
PacBio Pacific Biosciences
PCR polymerase chain reaction
PRIDE Proteomics Identifications database
PSF Python Software Foundation
P-site peptidyl site
PSM peptide to spectrum match
QC quality control
R arginine
RIBO-seq ribosome profiling
RNA ribonucleic acid
RNA-seq RNA sequencing
RPF ribosome protected fragment
rRNA ribosomal RNA
SA suffix array
SAM Sequence Alignment/Map
SMAD4 mothers against decapentaplegic homolog 4
SAAV single amino acid variant
SMRT single molecule, real-time sequencing technology
sn(o) RNA small nuclear and nucleolar RNA
SNP Single Nucleotide Polymorphism
SOLiD Sequencing by Oligo Ligation Detection
sORF small open reading frame
SQL Structured Query Language
SRA Sequence Read Archive
SSH Secure Shell
STAR Spliced Transcripts Alignment to a Reference
TCGA The Cancer Genome Atlas
TELO2 telomere maintenance 2
TIS translation initiation site
TP53 tumour protein 53
TrEMBL translated EMBL
tRNA transfer RNA
UniProt Universal Protein Resource
uORF upstream open reading frame
3' UTR 3' untranslated region
5' UTR 5' untranslated region
VCF Variant Call Format
xiii
Abstract
A lot of recent breakthroughs in genomics, proteomics and transcriptomics studies are due to
the development of so-called next-generation sequencing techniques (NGS). An important
novel NGS technique is termed ribosome profiling and is based on the deep sequencing of
ribosome-protected fragments. By studying the mRNA fragments captured by translating
ribosomes, the actual translation status in vivo can be assessed. In combination with MS/MS,
ribosome profiling has become an important assistant in the protein identification process by
building a more comprehensive protein sequence database. Currently, this database can be
generated by using a new automated pipeline developed by researchers at the BioBix lab: the
PROTEOFORMER pipeline.
This master thesis will cover the improvement of the protein identification process in human
cancer cells in a proteogenomics setting combining ribosome profiling and mass spectrometry
data. Preceding studies have shown that these proteins are often rich in SNPs (Single
Nucleotide Polymorphisms) and INDELs (INsertions and DELetions). Therefore, the addition
of these genetic variants into the protein database is bound to have a positive effect on the
protein identification rate. Previously, the inclusion of SNPs in the PROTEOFORMER
pipeline has already proven its success. Similarly, an improvement on the overall protein
identification process is expected when INDEL information is included. This hypothesis was
evaluated by comparing the protein identification rate using custom search databases with and
without INDELs in the MS/MS identification process. Therefore, INDELs derived from
ribosome profiling data were included. This has led to the validation of one protein which
could not be identified with custom databases lacking this information. Although, this shows
that the inclusion of INDELs had only a small positive effect on the protein identification rate,
further increase is expected in the near future by providing extra INDEL information from
variant databases and by using better INDEL detection methods.
Finally, the effect of the inclusion of INDEL information on the protein identification rate
was compared with the effect of SNPs. Notwithstanding only one SNP was validated with
MS/MS data, more than 100 SNP-containing proteins were identified solely or within a
protein group. These results show the importance of well generated custom databases tailored
to the study in question.
xiv
xv
Korte samenvatting
Verschillende doorbraken in genomische, proteomische en transcriptomische studies zijn te
wijten aan de ontwikkeling van nieuwe sequeneringstechnieken. Een belangrijke voorbeeld
hiervan is ribosoom profilering. Deze opkomende techniek is gebaseerd op het sequeneren
van ribosoom omgeven mRNA-fragmenten. Dankzij het bestuderen van de
mRNA-fragmenten aanwezig in actieve ribosomen, kan de daadwerkelijke translatie status
in vivo nagegaan worden. In combinatie met tandem massaspectrometrie kan ribosoom
profilering een belangrijke rol spelen in het eiwit identificatie proces binnen een
proteogenomische studie, met name in de ontwikkeling van een uitgebreidere eiwit sequentie
databank. Momenteel kan deze databank gegenereerd worden dankzij een nieuwe en
geautomatiseerde pipeline ontwikkeld door de onderzoekers van het BioBix labo: de
PROTEOFORMER pipeline.
Deze masterproef omvat de verbetering van het eiwit identificatie proces bij menselijke
kankercellen. Voorafgaande studies hebben aangetoond dat eiwitten vaak rijk zijn aan SNPs
(Single Nucleotide Polymorphisms) en INDELs (INserties en DELeties). Vermoedelijk zal
het toevoegen van deze genetische varianten een positief effect hebben op het aantal
geïdentificeerde eiwitten. Het succes van de inclusie van SNPs in de PROTEOFORMER
pipeline is reeds bewezen. We verwachten echter een additionele verbetering van de eiwit
identificatie procedure wanneer ook de INDELs geïncludeerd worden. Deze hypothese werd
onderzocht door de eiwit identificaties te evalueren via zelf gegenereerde databanken met en
zonder de inclusie van INDELs. Hiervoor werden INDELs afgeleid van ribosoom
profileringsdata gebruikt. Dit heeft geleid tot de validatie van één eiwit dat niet
geïdentificeerd kon worden met de gegenereerde databanken die deze informatie ontbraken.
Ondanks dit een klein, positief effect van INDELs op de eiwit identificatie aantoont, wordt er
in de nabije toekomst een grotere stijging verwacht door het voorzien van extra INDEL
informatie vanuit databanken en het gebruik van betere INDEL detectie methoden.
Tenslotte werd het effect van de INDELs op de eiwit identificatie vergeleken met het effect
van de SNPs. Niettegenstaande er slechts één SNP gevalideerd werd met MS/MS data,
werden er meer dan 100 SNP bevattende eiwitten geïdentificeerd die al dan niet tot een eiwit
groep behoorden. Deze resultaten tonen het belang aan van databanken die afgestemd zijn op
het onderzoek.
xvi
1
1 Introduction and outline
Protein identification plays an important role in a lot of life science studies. This is mostly
based on mass spectrometry (MS) and relies on the availability of annotated protein
sequences in public databases. Therefore, the identification process is limited to the detection
of known proteins. Currently, researchers are trying to address this problem by creating
custom protein sequence databases based on genomic and transcriptomic information. Hence,
novel technologies targeting nucleotide information on translational level are being
developed. This fusion of genomics and transcriptomics studies has led to a new research
area, called proteogenomics.
Ribosome profiling is a relatively new technique enabling the sequencing of ribosome
protected mRNA fragments. Herewith, a snapshot of the translational situation in vivo can be
made. Nowadays, this information is already taken as basis for the construction of custom
protein sequence databases. In order to facilitate this process, an automated pipeline was
designed: the PROTEOFORMER pipeline. Currently, it is already proven that this pipeline
increases the overall protein identification rate in MS-based experiments. However, the
pipeline is still under continuous improvement.
In this master thesis, an attempt is made to improve the MS-based protein identification
strategy of human cancer samples. Since proteins can largely be affected by the presence of
SNPs and INDELs in the aforementioned samples, the influence of introducing this
information in the protein identification pipeline is studied. Previously, SNP (Single
Nucleotide Polymorphism) information was added to the pipeline resulting in an overall
increase on the protein identification rate. Here, the focus will mainly be on the addition of
INDELs to the pipeline. In general, the inclusion of INDEL information is expected to have a
positive effect. In addition, the effect of INDELs will be compared with the effect of SNPs by
analyzing the identification rates of custom databases with and without the addition of these
variants based on matching ribosome profiling data from human colorectal cancer cells.
In the following chapter, MS-based proteomics, proteogenomics, ribosome profiling and the
PROTEOFORMER pipeline are explained in more depth. Moreover, an introduction into
NGS-based technologies is given as these technologies are frequently used in proteogenomics
studies. Chapter 3 focuses on the practical aspects of this master thesis by describing the used
materials and methods. Afterwards, the results are given in chapter 4 and discussed in chapter
5. Chapter 6 gives a general conclusion. Finally, this master thesis ends with a glimpse into
the future of the PROTEOFORMER pipeline in chapter 7.
2
2 Overview of relevant literature
2.1 NGS-based technologies
Since the discovery of deoxyribonucleic acid (DNA) in 1869 by Johann Friedrich Miescher,
DNA has been the center of attention in a lot of scientific studies [1]. However, it took almost
100 years before the first sequencing operations were carried out. Sequencing involves every
experiment in which the order of nucleic acids in a DNA or ribonucleic acid (RNA) sequence
is determined. Initially, sequencing methods, referred to as first-generation sequencing
methods, were slow and confined to short sequences only [2]. Later on, new insights and
improvements gave rise to high-throughput sequencing technologies: next-generation
sequencing-based technologies (NGS-based technologies). Nowadays, multiple NGS-based
technologies are commercially available. Illumina sequencing by synthesis, SOLiD
(Sequencing by Oligo Ligation Detection) and 454 sequencing are probably the most popular
techniques [3]. In general, these methods all start with the fractionation of the sample nucleic
acids into smaller pieces and the ligation of adaptor sequences needed for the amplification
and sequencing steps. Between different sequencing technologies, the amplification methods
are often shared but the actual sequencing procedures are quite different [3,4].
The oldest commercial available NGS-technology is 454 sequencing. Its rationale is based on
the production of light during the conversion of luciferin to oxyluciferin by the enzymatic
action of luciferase. This reaction can only occur when a complementary base is added to the
sequence resulting in the generation of pyrophosphate and its derivative product adenosine
triphosphate (ATP) which activates the enzyme. By adding one deoxynucleotide triphosphate
(dNTP) at a time, the nucleotide sequence can be devised. This process is also called
pyrosequencing [3,5,6].
SOLiD and Illumina sequencing are both based on the detection of fluorescent nucleotides.
They differ in the amount of nucleotides attached during each detection cycle. The SOLiD
system relies on the incorporation of a probe containing eight bases with a fluorescent group
attached to the last base. By using four different fluorescent dyes, colour detection can reveal
the last base. The implementation of a ladder primer set, makes it possible to determine the
sequence after five sequencing cycles [3,5]. In contrast with SOLiD, Illumina is based on the
incorporation of fluorescent labelled mononucleotides. By adding a blocker to the end of
these nucleotides, the synthesis process can be carefully controlled, allowing the detection of
one single nucleotide [3,5].
Further improvements in sequencing technologies has led to a new generation of techniques:
the third generation. Sequencing technologies belonging to this generation are characterized
by the ability to sequence longer fragments in comparison with previously explained NGS-
based technologies. Moreover, no amplification step is required due to the ability to sequence
single molecules. Two popular third generation sequencing techniques are single molecule
3
real-time sequencing technology (SMRT) provided by Pacific Biosciences (PacBio) [7,8] and
Oxford Nanopore sequencing [9]. Despite their promising features, they are not yet routinely
used in the commercial industry.
2.2 MS-based proteomics
2.2.1 Outline of MS-based proteomics experiments
Proteomics refers to the study of the total protein content of a species, also called its
proteome. It involves the identification and quantification of the entire protein load present in
the species of interest [10,11]. A fundamental part in this research is the use of shotgun
proteomics. This methodology aims at the identification of proteins in a complex protein
mixture by studying the corresponding peptides after enzymatic cleavage [10,11]. Figure 1
gives a clear overview of the general aspects. A common shotgun proteomics experiment
starts with an enzymatic cleavage of the proteins into peptides (mostly trypsin is used).
Afterwards, these molecules are separated using liquid chromatography (LC). The eluted
peptides are then ionized before they enter the mass spectrometer [12,13,14]. In mass
spectrometry (MS), the isolation of peptides is based on their mass-to-charge ratio [11,15].
The desired proteomic data is obtained by repeating the MS step on the fragments of each
selected peptide ion [13] resulting in a final tandem mass spectrometry (MS/MS) spectra. The
described combination of liquid chromatography and tandem mass spectrometry is
abbreviated as LC-MS/MS [16].
Figure 1: Practical aspects of LC-MS/MS. In general, every shotgun proteomics experiment starts
with the enzymatic digestion of the proteins of interest into peptides. These peptides are then separated
in a liquid chromatography step and afterwards subjected to tandem mass spectrometry. Adapted from
[12].
2.2.2 Database searching and MS/MS-based protein identification
Database searching of the resulting MS/MS spectra gives rise to the identity of the examined
peptides and thus also their corresponding proteins. This process is illustrated in figure 2. It
consist of two consecutive steps, namely the peptide and the protein identification process
[17,18].
Peptide identification
The identification of the peptides, obtained after protein digestion, is based on the correlation
between experimental and theoretical MS/MS fragmentation spectra. The latter are obtained
by subjecting a chosen protein sequence database to in silico MS/MS pattern prediction
algorithms. In this prediction process, proteins are virtually digested into peptides using
4
enzymatic cleavage rules. For example, when using the trypsin cleavage rule, cleavage is
supposed to occur C-terminally after every lysine or arginine [19]. Next, the associated mass-
to-charge ratios are calculated for all the cleaved products. By comparing these theoretical
ratios with the experimental mass-to-charge ratios a candidate peptide list is generated. This
list contains all the theoretical peptides which can be identified based on their mass-to-charge
ratios. Subsequently, these peptides are broken down into smaller ions. Finally, the produced
theoretical fragmentation spectra are matched with the experimental MS/MS spectra resulting
in a statistical similarity score. Only peptides associated with a high score will be retained and
added to a final matching peptide list [14,18].
Figure 2: Overview of the protein identification process using a database searching approach. Proteins
can be identified by comparing the experimental MS/MS spectra with theoretical spectra. The latter
are obtained by subjecting a protein sequence database to an in silico MS/MS pattern prediction
algorithm [14].
Protein identification
The obtained information from the peptide identification process is brought together in a
second step resulting in a list of possible proteins present in the sample of interest. This list
will be subjected to a statistical analysis in order to select the true protein identifications [18].
The overall procedure can be hampered by the protein inference problem which refers to the
difficulty of extending information from the peptide to the protein level. This may be a result
of large-scale sequence homology between the proteins present, because digestion of these
proteins can give rise to identical peptides. In the absence of unique peptide sequences,
discrimination between these proteins is not possible. This problem can be taken into account
by collecting the indiscernible proteins in a so-called protein group. In this way, statistical
analysis can reveal the presence of a certain group of proteins without knowing the presence
of specific proteins within that group [20].
5
2.3 Proteogenomics
Proteogenomics is a relatively new term describing the fusion of proteomics and genomics
[12,21,22]. This method made its debut in the scientific research community in 2004 [12] and
has been evolving since then. Initially, proteogenomics comprised the support of proteomic
information in genome annotation [12,22]. Genome annotation appears after genome
sequencing and purposes the delineation of the genes and determination of their biological
function [16]. Conventionally, this endeavour was pursued by the genomic research
community, whereas proteomic researchers were focusing on the actual protein level.
Nowadays, proteomics and genomics are not stand-alone research fields anymore, but
perform a joint effort in improving the current genome annotation [12].
2.3.1 Goals of proteogenomics
Since the beginning of proteogenomics, a lot of studies are dedicated to improvements in the
genome annotation and the protein identification process. Therefore, these two applications
will be explained in more detail.
Aid genome annotation
As proteogenomics studies were developed with a focus on genome annotation, this is still
one of the most important applications. Traditionally, protein-coding genes were located
using sequence similarity or ab initio gene finding predictions. These annotation processes,
however, are rather theory-based and do not provide enough evidence to call genes flawlessly
[16]. Validation of the predicted genes is therefore extremely important. Mass
spectrometry-based experiments provide a way to confirm the existence of protein-coding
genes and hypothetical proteins with translational evidence. Mass spectrometry-based
proteomics data can also be used to find new protein-coding genes by using the identified
peptide sequences to search genomic databases. In this way, proteogenomics studies can
reduce the error rate of current and previous annotation studies, which emphasizes its
importance in genome annotation and re-annotation [16,23,24].
Discover new proteoforms
Along with the progress in genome annotation, proteogenomics has also become a key player
in the process of protein identification and finding new proteoforms. The term 'proteoforms'
refers to all protein products derived from the same gene [25]. For example, proteins might
have proteoforms containing a longer N- or C-terminus resulting in the so-called N-terminally
and C-terminally extended proteoforms. Similarly, proteoforms with shorter N- or C-termini
are called N-terminally or C-terminally truncated proteoforms [26,27].
Traditionally, proteins were identified with the aid of reference protein sequence databases
like Swiss-Prot [28]. As explained earlier, these databases can be used to match
experimentally obtained MS/MS spectra with their corresponding peptides after undergoing
in silico MS/MS. Unfortunately, this limits the identification procedure to peptides already
present in the consulted reference database. Researchers are trying to address this problem by
generating customized protein sequence databases using genomic and transcriptomic
6
information, which enlarges the search space in the field of interest. This can be of particular
importance when studying disease-associated proteins which may contain one or more genetic
variation. Likewise, completely new peptides associated with novel coding regions are more
likely to be found with a customized database [12,22,29]. So far, a lot of new proteins have
been discovered with the aid of proteogenomics studies. These include alternative splice
variants [30] and peptides derived from small open reading frames (sORFs) [31].
2.4 Ribosome profiling
2.4.1 An emerging technique
Next to genomics and proteomics, translatomics has been a very important research area over
the last few years. At the beginning, information on the translational level was derived from
measurements at the expression level by evaluation of mRNA (messenger RNA) abundance
using microarrays. Later on, mRNA sequencing became more and more popular. However
due to the action of extensive regulation on mRNA level and the occurrence of events like
initiation at non-AUG start codons, it is impossible to determine the exact protein levels and
their identity by using the detected mRNA levels. With the advent of translation inhibitors,
new approaches to monitor protein information at the translational level were developed
[32,33].
Initially, polysome profiling [34] was used to follow up the protein synthesis. A polysome
refers to the presence of multiple ribosomes occupying an mRNA transcript. In polysome
profiling these mRNAs are studied after isolation of the polysome fraction. Unfortunately,
this technique has limited resolution and cannot be used to reveal the exact ribosome positions
[32,33]. These shortcomings have led to the introduction of ribosome profiling (also called
RIBO-seq) by Ingolia et al. This emerging technique is based on the deep sequencing of
ribosome protected fragments (RPFs). By only studying the mRNA fragments captured in the
ribosomes, the actual translation status in vivo can be assessed. Moreover, the ability to halt
active ribosomes, enables the determination of the ribosomal positions with single-nucleotide
precision. Up to now, these advantages have made the technique very popular in the process
of genome annotation by improving the identification of novel protein-coding genes
[32,33,35].
2.4.2 Outline of a ribosome profiling experiment
The ribosome profiling strategy consists of a few main steps. These steps are presented in
figure 3 and are explained below.
7
Figure 3: Overview of the major steps in ribosome profiling. Ribosome profiling begins with the
release of RNA from cells treated with translation inhibitors. After a subsequent RNA digest, which
degrades unbound RNA, the resulting RNA-ribosome-protected fragment (RPF) complexes are
isolated with a centrifugation technique. The RPFs are then separated from the ribosomes and purified
using gel electrophoresis. In a following step, the isolated RPFs are linked with adaptors, amplified
and sequenced. An intermediate rRNA depletion step makes sure that the remaining rRNA molecules
are degraded. Finally, the sequenced RPFs are mapped to a reference sequence [36].
Every RIBO-seq experiment starts with lysing the cells of interest. In order to take a
representative snapshot of the translational situation, halting the active ribosomes before the
actual lysis is necessary. This pausing step is introduced to avoid alterations of the ribosomal
positions during and after lysis as a result of the rapid translational progress. Different
approaches based on translation inhibitors including cycloheximide (CHX), lactimidomycin
(LTM) and harringtonine (HARR), are available [35,36,37,38]. Their purposes will be
explained in section 2.4.3.
After cell lysis, nuclease digestion ensures the degradation of RNA that is not protected by
ribosomes. The resulting RNA-ribosome complexes contain RPFs of circa 28 nucleotides,
also termed footprints. A subsequent separation step based on sucrose density fractionation
isolates these footprint-bound ribosomes from remaining cell residuals. In a consecutive step,
the footprints are recovered from the ribosomal complexes and purified by gel
electrophoresis. Deep sequencing of these footprints is possible after the construction of a
derived DNA-library. This includes the attachment of linkers that provides the priming site
needed for reverse transcription followed by circularization of the products [37,38]. Next, the
contamination with ribosomal RNA (rRNA) has to be taken into consideration. Therefore
8
rRNA is optionally depleted by hybridization with rRNA probes which are later on removed
due to their affinity for streptavidin [38]. Following a polymerase chain reaction (PCR), the
resulting fragments are then ready for deep sequencing. Finally, the sequenced fragments can
be analyzed by aligning the obtained reads to a reference sequence [35,37,38].
2.4.3 Applications of ribosome profiling
Despite the fact that ribosome profiling is a relatively young technique, it has already been
used in a number of different studies demonstrating the large range of applications of this
technology. Generally, two different approaches can be made. The first one is relying on the
use of elongating ribosomes whereas the second one makes us of initiating ribosomes [35,38].
Applications of elongating ribosomes
Elongating ribosomes can be halted with the aid of flash-freezing [38,39] or by using
antibiotics acting on the translocation activity. A well known and widely used antibiotic is
cycloheximide (CHX). This antibacterial drug blocks the elongation of translation in
eukaryotic cells trough the obstruction of the exit site of the 60S ribosome subunit [36,40,41].
The mRNA sequences restrained in these halted ribosomes provide a massive amount of
information leading to a better (re)-annotation of the genome [38].
Identification of small open reading frames
Ribosome profiling can reveal the existence of small open reading frames (sORFs). These
open reading frames (ORFs) are defined as ORFs smaller than or equal to 300 bases. Due to
their small sizes, sORFs are often excluded by automated gene annotation algorithms.
Likewise, MS-based proteomics and RNA-seq (RNA sequencing) based transcriptomics
experiments are not effective in revealing new sORFs. As a result, the discovery of new
sORFs became feasible with the introduction of ribosome profiling. The small peptides
derived from sORFs are called micropeptides and can have important functions [42].
Transporter proteins, transcription factors and hormones are just a few examples of the
enormously diverse functionality of these micropeptides [43].
Studying protein translation on a quantitative level
Ribosome profiling also provides the possibility to study the expression of different genes on
a quantitative scale. Each gene transcribes its DNA information into mRNA sequences. These
sequences will go through the translational process due to ribosomal activity. Every mRNA
transcript bound to a ribosomal complex is considered as being translated. In other words,
every RPF that is picked up during the experiment represents a translating ribosome.
Therefore, the number of RPFs resembles the number of translating ribosomes for the
corresponding transcript and is proportional to the level of protein that is produced. By taking
into account both the amount of RPFs and the time required to finish a protein, the production
rate of the proteins can be determined. Hence, RPF density can be used to quantify gene
expression and measure the speed of protein synthesis [35,38]. Also, by measuring the
translation products and comparing it to the mRNA abundance on transcriptional level in
RNA-seq, new insights into the translational regulation can be gained [33,38].
9
Discovering genetic variation
Ribosome profiling can be of particular importance in disease-focused studies due to the
inclusion of genetic variation information within ORFs. Single Nucleotide Polymorphisms
(SNPs) are the most common variants in the human population. SNPs represent single base
positions in the genome that are differing between individuals. Several people will for
example carry a cytosine at a specific genomic location while others have an adenine at this
position [44].
Next to SNPs, insertions (addition of one or more nucleotides) and deletions (removal of one
or more nucleotides) can occur. A lot of diseases are already linked with specific SNPs and
INDELs (INsertions and DELetions). Well know examples are sickle-cell anaemia (caused by
a SNP [45]), cystic fibrosis (caused by a deletion [46]) and Tay-Sachs disease (caused by an
insertion [47]).
Genetic variation is also playing a key role in cancer research. Due to the presence of
extensive protection mechanisms against cancer, multiple genes have to undergo mutagenesis
to initiate the disease [48,49]. Moreover, the alterations that lead to oncogenesis, have to
occur in specific classes of genes: oncogenes, tumour-suppressor genes and stability genes.
Oncogenes and tumour-suppressor genes are related to the cell-cycle control. By stimulating
cell division (due to activated oncogenes) and inhibiting cell-cycle control (due to inhibited
suppressor genes), mutations in these genes can cause a rapid increase in the number of cells.
Stability genes, on the other hand, play an important role in controlling the presence of
mutations by trying to repair the genetic alterations. Inactivation of these genes suppresses
this control mechanism resulting in a higher mutation rate [49]. Both SNPs and INDELs are
often present in cancer-related genes [48,50]. These variations can be detected with
NGS-based technologies, including ribosome profiling, making these very useful techniques
in cancer studies [51,52].
Applications of initiating ribosomes
Ribosomes can also be stalled solely during the initiation stage. Halting initiating ribosomes
can be achieved by using harringtonine (HARR) or lactimidomycin (LTM). LTM inhibits
translation initiation by binding to the exit site of the 80S complex whereas HARR halts the
ribosomes by binding to the acceptor site of the free 60S ribosome subunit [38,40,53,54].
Both compounds are useful in the identification of translation start sites. However LTM can
be preferred over HARR as treatment of the latter results in a substantial amount of mRNA-
fragments mapping after the translation start site resulting in a drop of the accuracy of the
experiment [40].
Detection of alternative translation initiation sites and their resulting proteoforms
The translational process in eukaryotes is a well-known phenomenon. Both the initiation and
the elongation process have been thoroughly studied over the years. Generally spoken,
translation initiation is triggered when the ribosomal complex scans an AUG codon. In the
10
case of efficient recognition, the first AUG will serve as a translation initiation site [40].
However, inefficient recognition, due to factors like the absence of necessary secondary
structures in the mRNA near the initiation site, may lead to leaky scanning. In this way, more
efficient translation initiation sites (TISs) are searched in the region downstream the initial
TIS [40,55].
More recently, it was discovered that translation initiation is also occurring at non-AUG
codons differing in one nucleotide with the cognate start codon. These near-cognate start sites
are denoted as alternative translation initiation sites or aTIS. Due to the failure of finding
these aTIS using in silico sequence analysis, it was not possible to accurately predict new
aTIS. However, with the introduction of ribosome profiling as a way to define ribosomal
positions, this problem was solved. Using specific translation initiation inhibitors, translation
initiation sites can be determined with single nucleotide resolution. In this way both TIS and
aTIS are accurately predicted [40].
A significant fraction of these newly annotated TIS represents near-cognate start sites lying in
the regions upstream of the annotated coding sequence. These seem particularly important in
the generation of upstream open reading frames (uORFs): short translated sequences in the 5'
untranslated region (5' UTR) [32,39,56]. Moreover, some of these ORFs might have an
influence on the regulation of downstream ORFs by producing small peptides called
peptoswitches [57].
Alternative TIS are also responsible for the production of N-terminally extended and N-
terminally truncated proteoforms. The latter are produced by using downstream TIS. Both
extensions and truncations result in variants having different functions or carrying other
localisation signals [35,40].
2.5 The PROTEOFORMER pipeline
Recently a new tool creating a customized database has been developed by researchers of
BioBix in Gent. This tool was introduced under the name of PROTEOFORMER in 2014 and
is already publicly available as a stand-alone version and a Galaxy implementation. The
PROTEOFORMER tool consists of an automated pipeline of seven steps starting from
ribosome profiling data and ending in a protein sequence search database necessary for the
identification of proteins by MS/MS matching. Using RIBO-seq data rather than mRNA-seq
data, the PROTEOFORMER pipeline overcomes the problems of searching a big database
and provides evidence on translational level. Nowadays, the analysis of RIBO-seq data
derived from Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis
thaliana has been made available. The analysis of other species will be made available in the
future [52].
11
2.5.1 Overview of the pipeline
The PROTEOFORMER pipeline includes seven major steps: quality control, mapping,
transcript calling, TIS calling, variation calling, translation assembly and translation database
construction. These steps are depicted in figure 4. Besides these major steps, figure 4 also
contains the prior step of next-generation sequencing, needed to gain the input reads, and the
final peptide and protein identification step. A detailed description of these steps is given
below [52,58].
Figure 4: Overview of the PROTEOFORMER pipeline. The PROTEOFORMER pipeline starts from
RIBO-seq reads generated with NGS-based technologies. These reads are analyzed using seven
different steps including quality control, mapping of the reads to the reference genome, transcript
calling, TIS calling, variation calling, translation assembly and translation database construction. In a
final step the custom created database can be used in a MS-based peptide and protein identification
process [52].
Next generation sequencing
The PROTEOFORMER pipeline uses RIBO-seq data as input. This information is gathered
by sequencing the ribosome protected footprints using the Illumina approach. As previously
explained, Illumina sequencing is based on fluorophores and terminating groups attached to
the four different nucleotides. By matching each nucleotide with a specific fluorescent
emission signal, the incorporated nucleotide can be determined. The resulting reads derived
from the measured fluorescence and their estimated quality are collected in a FASTQ file. The
inclusion of a TIS calling algorithm in the pipeline requires ribosome profiling information on
initiating ribosomes. Hence, in addition to the RIBO-seq data of CHX-treated samples,
information concerning HARR- or LTM-treated samples is necessary. Consequently, two
FASTQ files are taken as input: one containing reads derived from elongating ribosomes and
one representing information about initiating ribosomes [52].
12
Quality control
The subsequent quality control of the collected FASTQ files forms a preliminary step to the
PROTEOFORMER pipeline. A suitable tool for the monitoring of FASTQ files and
determining the sequence quality is the FastQC quality control application developed by the
Babraham Institute. This application creates a quality control (QC) report that is very valuable
for the detection of irregularities or unexpected deviations from the predicted outcome of the
raw sequencing data [59]. Moreover, FastQC can also be applied on the aligned reads after
mapping. In this way, the effects of the prepossessing steps on the raw data can be verified.
These include adaptor clipping, sequence trimming and filtering on unwanted RNA.
Aside from the QC report, two other quality determinants are taken into account. The first one
is known as the metagenic functional classification wherein the species-specific annotation
from Ensembl is used. RPFs in accordance with protein-coding transcripts are classified in
5' UTR (5' untranslated region), exonic, intronic and 3' UTR (3' untranslated region)
annotation classes whereas RPFs corresponding with non-protein-coding transcripts are
classified as 'other biotypes'. The rest of the RPFs are considered intergenic [52]. The second
quality determinant is the gene distribution which visualizes the overall dynamic range of the
translation counts [52].
Mapping
As a first step in the pipeline, the raw RPFs should be aligned against a reference genome.
Two different mapping approaches can be used by the PROTEOFORMER pipeline [52]:
TopHat [60] and STAR (Spliced Transcripts Alignment to a Reference) [61]. Both these
splice-aware aligners start with constructing a genome index. The mapping against the
indexed genome of the species of interest is possible after filtering out the reads aligned to the
PhiX bacteriophage genome (resulting from the spiked-in sequences) and the reads aligned to
the species-specific rRNA (resulting from rRNA contamination specific to the ribosome
profiling protocol), tRNA (transfer RNA) and sn(o)RNA (both small nuclear and nucleolar
RNA) [52]. For all these filtering steps, the RPFs are mapped non-uniquely, meaning they are
allowed to map to more than one location. In contrast, the mapping against the reference
genome of the species of interest can be carried out uniquely, wherein the RPFs map to
maximal on location, or non-uniquely.
After clipping the 3' end adaptor sequence, the reads are mapped to the different created
genome indices. The number of reads mapped to each of these indices will be calculated and
redirected to an SQLite table. The alignment information is also given in a BedGraph file
which enables the visualization of the results in a genome browser [52].
Transcript calling
In addition to the mapping procedure, translational evidence on the transcripts is collected by
calculating the number of footprints corresponding to every genomic location of the coding
sequence. Thereby the footprint coverage of every exon is determined. Transcripts having an
13
uniform exon coverage exceeding a certain threshold are marked as truly translated [52]. In
this way, Ensembl transcripts matching the experimental RIBO-seq data are identified.
SNP calling
Using the aligned reads of the previous step, SNPs can be uncovered by using the SAMtools
suite. SAMtools is a software package developed for the analysis of alignments in Sequence
Alignment/Map (SAM) and Binary Alignment/Map (BAM) formats [62]. Finding SNPs with
the mpileup module of SAMtools starts from a SAM or BAM file containing the alignment of
the reads on a reference genome. The mpileup command uses this info to generate a Binary
Variant Call Format (BCF) file containing genomic position information together with the
likelihoods of the potential genotypes [63,64]. Since mpileup requires a BAM file as input, a
SAM to BAM conversion is necessary [64]. The calling itself is done with BCFtools by using
the information in the BCF file and applying a Bayesian model. It can also convert BCF files
to Variant Call Format (VCF) files [63,65].
TIS calling
As already mentioned before, the inclusion of antibiotics in the ribosome profiling protocol is
a key element in determining translation initiation sites. Both LTM and HARR can be used to
stall initiating ribosomes whereas CHX acts on elongating ribosomes. The parallel analysis of
the obtained LTM/HARR-treated and CHX-treated reads results in a peak calling approach
[40]. Herewith, LTM- or HARR-associated RPFs and CHX-associated RPFs are mapped to
transcript sequences contained within the Ensembl annotation bundle [52]. During this
procedure, reads of initiating ribosomes will be accumulated at the translation initiation start
sites, resulting in the so-called TIS peaks [40]. Only a subset of these peak positions
resembles actual TIS. These TIS can be detected using user-defined parameters concerning
the proximity of predicted TIS, their minimal associated number of reads and a value
representing the relative abundance of LTM-treated reads over CHX-treated reads at a single
nucleotide position [52].
Translation assembly
The objective of the previous steps is the collection of the information derived from ribosome
profiling data. With the translation of RIBO-seq data into amino acid sequences in mind, both
TIS and SNP information as well as knowledge concerning transcript isoforms is gathered. In
addition to the information on proteoforms described above, the chromosome reference
sequence and the Ensembl annotation bundle are essential. Scanning these chromosome files
using a binary reading approach allows to acquire the exon sequences easily and fast. These
exon sequences are used to derive the desired protein sequences [52].
Translation database
In order to use the assembled protein sequences in MS-based protein identification
experiments, these sequences need to be stored in a database. The PROTEOFORMER
pipeline generates a custom database in FASTA format. An important feature is the
non-redundancy of this database [52].
14
MS-based proteomics/peptidomics
The final protein sequence database created with the PROTEOFORMER pipeline can be used
in the MS-based peptide and protein identification process. Therefore, database searching
engines are used to match the in silico generated MS/MS spectra, derived from peptides
within the database, with the experimental available MS/MS spectra. This process is
explained in section 3.2.7 in more depth [52].
2.5.2 Achievements
The PROTEOFORMER pipeline was developed with the eye on creating an optimal custom
protein database with the help of ribosome profiling information. The transition from publicly
available protein databases like Swiss-Prot to custom databases leads to a rise in the number
of protein identifications. Moreover, the inclusion of RIBO-seq data overcomes the problem
of an increasing database size by including only the translation products of one single reading
frame (as opposed to conversion of mRNA sequencing that generally performs a six-frame
translation) and focuses on the actual translation in vivo [52].
The benefits of using the resulting custom database in a MS/MS proteogenomics experiment
was demonstrated by comparing the generated custom databases with a peptide database
derived from Swiss-Prot [28]. This strategy was applied both to RIBO-seq data derived from
mouse embryonic stem cells (mESC) and the human colorectal tumour 116 cell line
(HCT116). The results of this validation experiment are shown in figure 5. The first two pie
charts in this figure visualize the increase of protein identification due to the implementation
of RIBO-seq data. Both cell lines contain proteins that were not part of the Swiss-Prot
database and could therefore not be detected before. Next to the identification of new
proteins, the protein score of a significant part of the proteins was increased. The last two pie
charts represent the classification of the N-terminally proteoforms (detected with the N-
terminal COmbined FRactional DIagonal Chromatography or COFRADIC method [66]).
Proteoforms containing TIS annotated in the Swiss-Prot database (database annotated TIS or
dbTIS) are the most abundant group, followed by proteoforms starting downstream of the
annotated TIS (downstream TIS or dTIS). Additionally, uORFs and N-terminally extended
proteoforms are detected but in smaller numbers. Moreover, both near-cognate and cognate
start sites where found within the newly detected proteoforms [52] .
15
Figure 5: Pie charts representing the rise in identification rate for a) mouse and b) human and the
identification of new N-terminally proteoforms for c) mouse and d) human by using the
PROTEOFORMER pipeline [52].
16
3 Material and methods
Currently, the PROTEOFORMER pipeline has already proven its usefulness by increasing the
overall protein identification rate and enabling the detection of alternative proteoforms like N-
terminally extended proteoforms [52]. However, in order to be useful in cancer studies, the
inclusion of structural variant information is very important. SNPs are already taken into
account, but next to SNPs, insertions and deletions play a very important role in the
development of cancer. Therefore, it is necessary to include these variants in the pipeline as
well.
By taking into account both SNPs and INDELs, a lot of the genetic variation in cancer cells is
covered [48]. In all probability, this will improve the identification rate of proteins present in
cancer cells in comparison with databases lacking this information. This hypothesis is tested
by studying the protein identifications with and without the inclusion of INDELs using
matching ribosome profiling and mass spectrometry data from a human colorectal cancer
dataset. Additionally, the effect of INDELs is compared with the effect of SNPs.
3.1 Hardware
Both the existing scripts, needed to run the PROTEOFORMER pipeline, as well as the new
scripts written during this master thesis were executed on the Linux servers of BioBix. Two
different servers where used: one having 32 processors and 157 GB RAM and one having 64
processors and 315 GB RAM. The analysis of the peptide identification results was done on a
laptop with, 8 GB RAM, four 2,20 GHz Intel® Core™ i5-5200U processors and the
Windows 8.1 operating system.
3.2 Software
3.2.1 SSH clients
The connection with the servers was made using the Secure Shell (SSH) protocol insuring
secure network services [67]. Within the context of the client-server model, two different SSH
clients where used: PuTTY (version 0.67) and WinSCP (version 5.9). While WinSCP was
needed for a secure file transfer between the local and the remote computer. PuTTY was used
to operate the servers from a distance.
3.2.2 Python
Python is a high-level, object-oriented language created by Guido van Rossum. Van Rossum
designed this program while working at the Center for Mathematics and Computer Science
(CWI) in Amsterdam and made it publicly available in 1991. Today, Python is owned by a
non-profit organization, called the Python Software Foundation (PSF), and is freely available
at the official Python webpage: www.python.org. The programming language is distributed
with a large library of modules. Moreover, Python is portable and applicable to a large
17
domain of problems. Thanks to its simple syntax and its interactive interpreter, this
programming language is also easy to learn and probably the most convenient language to
introduce non-programmers into computer programming [68].
Initially, the majority of the scripts of the PROTEOFORMER pipeline, were written in Perl
(version 5.22.2). However, due to Python's rising popularity, this will change over time. In
this master thesis, Python version 2.7.11 was used.
3.2.3 SQLite
During the analysis of the ribosome profiling data by the PROTEOFORMER pipeline, the
results were stored in a relational database. In general, a relational database consists of tables
containing data and connections between the records of these tables. This data was stored and
queried using SQLite (version 2.8.17). SQLite is a freely available, very reliable Structured
Query Language (SQL) database engine [69]. It can be used in combination with any
available operating system [70] and enables hand-operated execution of SQL queries through
the sqlite3 command-line service [71].
3.2.4 Python DB-API
The connection with the desired SQLite database through Python was made using the sqlite3
module (version 3.11.0). This module is responsible for the creation of an SQL interface
conformable with the consistent database application programming interface (API) used in
Python: DB-API 2.0 [72,73].
Initially, new Python database modules were provided with a particular interface different
from other existing modules. Due to incompatibilities, users were compelled to rewrite
working code when changing the database product. This problem has been the starting point
for the creation of a consistent Python database interface. Data retrieving and inserting by the
sqlite3 module begins with the generation of an object expressing the connection with the
database. Hereafter, a cursor object needs to be initiated and executed in order to perform an
SQL command. Finally, the results can be collected by using one of the fetch techniques
[72,74].
3.2.5 STAR
Spliced Transcripts Alignment to a Reference (STAR) is a free alignment tool written in C++.
It provides fast mapping of RNA-seq data to the reference genome in three consecutive steps:
genome indexing, seed search and seed clustering. Genome indexing is done by means of
uncompressed suffix arrays (SAs). A binary string search in these SAs allows the detection of
seeds. These seeds are clustered together with the aid of anchor sequences. Simultaneously, a
local alignment scoring scheme is used to find the alignments having the best scores [61].
STAR (version 2.5.2a) was used to map all the RIBO-seq derived reads to the reference PhiX
bacteriophage genome and the human reference rRNA, sn(o)RNA, tRNA and genome
sequence.
18
3.2.6 Variant caller
The PROTEOFORMER pipeline enables SNP detection using the SAMtools suite. Next to
SNPs, INDELs needed to be identified. Today, multiple INDEL detection tools are available.
They are often based on different principles while the general workflow is the same. Hence,
different categories of INDEL calling tools can be made [65].
A first important class of INDEL calling tools encloses the haplotype-based tools. Their
algorithms begin with selecting those sites in the genome that are more likely to show
variations. These variant sites are used to create a list of possible haplotypes. After realigning
these haplotypes to the original reads and applying Bayesian statistics, the variants can be
called [65,75,76]. Popular haplotype-based tools are GATK (Genome Analysis Toolkit)
HaplotypeCaller [77] and Platypus [76].
A second important class of variant calling tools groups the alignment-based techniques.
These methods start with the alignment of the reads against a reference genome. This
alignment is used to collect all the possible variants that are subsequently subjected to a
filtering step in which the correct SNPs and INDELs are retained [65]. Both SAMtools [62]
and GATK UnifiedGenotyper [78] are well known variant callers belonging to this class.
Because the PROTEOFORMER pipeline uses SAMtools mpileup (version 1.3) to find SNPs
in experimental data and BCFtools (version 1.3) to call the significant variants, these tools
were also used to find INDELs.
3.2.7 SearchGUI and PeptideShaker
A classic proteomics study ends with the identification of the proteins picked up throughout
the experiment. Figure 6 gives an overview of the main steps carried out during this process.
Figure 6: Protein identification using SearchGUI and PeptideShaker. Experimental MS/MS spectra
are identified using a database searching approach. Hereby, SearchGUI provides different search
engines which can analyze the MS/MS spectra simultaneously. The resulting peptide to spectrum
matches (PSMs) are evaluated with PeptideShaker. Finally, the obtained results can be submitted to
PRIDE or subjected to further analysis [79].
19
The data-analysis starts from the obtained experimental spectra and a protein sequence
database present in the standard FASTA format. Both data available from PRIDE (more
information on PRIDE is given in section 3.3.4) as well as in-house spectral data can be taken
as a starting point. Identification of the experimental spectra can be done using a database
searching approach by applying freely available identification software algorithms, also called
search engines. Nowadays, various search engines are at hand, each having their own
advantages and disadvantages. In order to achieve the best outcome possible, multiple search
engines are often combined. However, due to conflicts between input parameters and visual
interfaces, this can be very troublesome [79,80].
SearchGUI (version 3.2.14) is a user-friendly interface providing easy configuration of eight
different search engines (including X!Tandem [81] and the Open Mass Spectrometry Search
Algorithm or OMSSA [82]) and de novo sequencing algorithms, avoiding these conflicts
[80,83]. The resulting peptide to spectrum matches (PSMs) can be analyzed by another
software application: PeptideShaker (version 1.16.5). PeptideShaker provides simple
evaluation of the collective output gathered by the selected search engines and facilitates the
re-analysis of public data. The final results can be easily submitted to PRIDE or subjected to
further analysis in case of unidentified spectra [79,84].
3.2.8 Galaxy
Galaxy is an open source, web-enabled platform designed for the intensive data analysis
present in everyday bioscience research. It allows researchers to create or reuse complex
computational workflows from a user-friendly interface without the need to program. The
platform ensures reproducibility of the experiments by storing the required metadata.
Furthermore, the possibility to add annotations to every step of a workflow and the simplicity
of communicating experimental results are making this platform accessible to every
researcher interested in genomic data analysis [85].
3.3 Public databases
3.3.1 Ensembl annotation bundle
The Ensembl annotation bundle is part of the Ensembl project founded in 1999 [86]. It
provides high quality genome annotation for a wide variety of species. The Ensembl
annotation process starts from a genome assembly present in the International Nucleotide
Sequence Database Collaboration (INSDC). It involves ab initio gene prediction tools and
sequence homology based on protein, cDNA (complementary DNA), EST (expressed
sequence tag) and RNA-seq data [87,88].
The human genome assembly is currently managed by a worldwide association responsible
for improvements of the human, mouse, chicken and zebrafish reference genome: the Genome
Reference Consortium (GRC) [89,90]. Hence, the human genome assembly is abbreviated as
GRCh followed by a release number.
20
Here, the research is based on assembly GRCh38 and Ensembl annotation bundle version 82.
Only the tables needed for the construction of the Ensembl transcripts were used. These
include the exon, the exon_transcript, the gene, the seq_region, the coord_system, the
transcript and the translation table. Their contents and underlying relationships are shown in
figure 7.
Figure 7: Overview of the most important Ensembl tables needed in the PROTEOFORMER pipeline.
Transcript information is stored in the transcripts table. It provides knowledge about the start and end
positions of the transcripts and the strand (sense or antisense). It is linked with the translation table
which contains information on the translation start and stop sites of the transcript. The transcripts are
linked with their corresponding genes by the gene table and with their exons through the
exon_transcript linking table. More information about the exons themselves can be seen in the exon
table. The tables seq_region and coord_system store information about all the available coordinate
systems. The PROTEOFORMER pipeline uses the chromosome coordinate system. Adapted from
[91].
3.3.2 dbSNP
Parallel to the variant calling, performed by SAMtools, the PROTEOFORMER pipeline
allows the inclusion of SNPs present in the Single Nucleotide Polymorphism Database
(dbSNP) [52]. This database was developed in 1998 by the National Center for Biotechnology
Information (NCBI), with the help of the National Human Genome Research Institute
(NHGRI), as an answer to the increasing interest and research in genomic variations and their
21
association with clinical phenotypes. The name 'dbSNP' refers to the most abundant variation
type present in the database: SNPs. Along with these SNPs, dbSNP contains other genomic
variant including short INDELs [92,93].
Researchers can submit new genomic variants to the database when providing sufficient
information with regard to the experiment and the flanking sequences. In case of multiple
entries concerning the same genomic variant, dbSNP will combine the information given by
all original records and store it into a new reference record. This record can be linked with
other NCBI databases [92,93].
In addition to the inclusion of SNPs from dbSNP, the PROTEOFORMER pipeline was
adapted in order to enable the inclusion of INDELs as well.
3.3.3 Sequence read archives
In today's genomic studies, NGS-based technologies entail massive amounts of nucleotide
sequence data. In order to preserve this information in the international, public domain, the
DNA Databank of Japan (DDBJ), the European Bioinformatics Institute (EMBL-EBI) and the
NCBI started a collaboration to unify the obtained sequence information in all of their
databases, namely the INSDC. Thanks to this collaboration, data submitted to one of these
organizations can easily be shared with the other partners. One of the data types contained by
these databases is raw sequence data produced during NGS. This data type is stored in a
separate database called the Sequence Read Archive (SRA) [94]. Both the DDBJ, the NCBI
as well as the EMBL-EBI have their own SRA [95]. The later is part of the European
Nucleotide Archive (ENA) [96].
3.3.4 PRIDE
The PRoteomics IDEntifications database (PRIDE) is a public database holding the
experimental results of proteomic studies associated with published, scientific articles. It was
created to make proteomic data more accessible and easier to query, but it can also be used as
a private platform for communication and collaboration during the process of data-analysis
and peer-review of the final article. Peptide and protein identification lists can be easily
submitted and approached via a web interface along with its associated MS/MS spectra
[97,98]. Together with the RIBO-seq data available in the SRAs, this data is used during the
validation of the renewed PROTEOFORMER pipeline.
3.4 Custom database creation
An assessment was made of the effect of including INDEL or SNP information in the creation
of custom sequence databases on the peptide/protein identification rate in MS-based
proteomics. Hence, RIBO-seq data derived from the human colorectal cancer cell line
HCT116 was used. This data was generated by the BioBix lab in 2014 [99] and can be easily
downloaded from the SRAs.
22
The custom databases itself were created using the mapping, transcript calling, TIS calling
and variant calling steps provided by the PROTEOFORMER pipeline. The latter was only
available for SNP detection. Therefore, the initial variant calling script was adapted.
Moreover, a completely new assembly script was written for the inclusion of both INDELs
and SNPs in the protein sequence database. Both the new variant calling script as well as the
assembly script are available at GitHub on https://github.com/sgielis/Master-thesis.
3.4.1 Mapping
For every RIBO-seq experiment, two FASTQ files were present: one containing the RIBO-
seq derived reads collected after CHX-treatment and another containing the RIBO-seq derived
reads collected after LTM-treatment. Both files were preprocessed using the FASTX-toolkit
(version 0.0.14) [100,101]. This included the removal of the adaptor sequence
(AGATCGGAAGAGCACAC) using the fastx_clipper tool and trimming the sequences with
the fastq_quality_trimmer tool. During the latter step, a quality threshold of 28 was used,
resulting in the removal of nucleotides lying at the end of the sequence and having a quality
score below 28. Simultaneously, sequences shorter than 20 nucleotides were discarded.
In a following step, STAR (version 020201) was used to map the remaining RIBO-seq reads
consecutively to the reference PhiX bacteriophage genome and the human reference rRNA,
sn(o)RNA and tRNA (obtained from NCBI). These filtering steps were needed in order to
retain the translating RPFs only. In all cases, a maximum of two mismatches per read was
allowed. Moreover, a value of 0.5 was chosen for the option seedSearchStartLmaxOverLread
and no introns where allowed during mapping against the PhiX genome.
Next, the filtered (unmapped) reads were aligned to the human reference genome. Also, a
maximum of two mismatches per read was chosen as a threshold. Alignments associated with
reads mapping to maximum 15 loci were collected in SAM and BAM files. The same
information was also stored in BedGraph files used to visualize the obtained information in a
genome browser.
In a later step, the RPFs were appointed to single nucleotide positions, namely the ribosomal
P-sites. This was done using the Plastid tool [102]. Plastid determines these P-sites by
calculating P-site offsets. Both the term P-site and P-site offset are explained in figure 8 in
more detail.
In general, Plastid starts with grouping all the RPFs by their lengths. Subsequently, these
reads are mapped to the canonical genes in the region surrounding the canonical start site.
Finally the distance between the canonical start site and the highest peak at the 5' end of the
start codon is measured and returned as a P-site offset [103,104].
Because the P-site offsets were calculated for all read lengths varying from 22 to 34
nucleotides, the exact P-site could be calculated for each of these lengths. Next, the range of
RPF lengths having a clear P-site offset was determined. All the RPFs lying in this range were
23
retained for further analysis. This information was also stored in two SQLite tables: one for
the CHX-treated reads and another for the LTM-treated reads. Both tables contain the retained
P-sites and an extra value representing the number of reads that are appointed to this site.
Figure 8: Clarification of the terms P-site and P-site offset. The P-site (peptidyl site) is the second site
in the ribosomal complex. Here, the enlarging polypeptide is present. It is preceded by the aminoacyl
site (A-site) which is responsible for the presentation of the aminoacyl-tRNA to het mRNA and
followed by the exit site (E-Site) where empty tRNA molecules leave the ribosomal complex [105].
The latter is not shown in this figure. The P-site offset is defined as the distance of the first nucleotide
of the RPF from the P-site. Here, a RPF of length 28 nucleotides is shown with an offset of 12 [104].
3.4.2 Transcript calling
Subsequent to the mapping procedure, the ribosome profiles of the CHX-treated ribosomes
were used to find the translated transcripts. The transcript calling procedure includes the
collection of all possible transcripts using the Ensembl annotation bundle and the subsequent
association of these Ensembl transcripts with the evidence on translational level.
3.4.3 TIS calling
After the mapping and transcript calling, the translation start sites were determined. The TIS-
calling procedure started from the read counts and transcripts gathered by previous methods.
In a first step, these reads were matched with the transcripts. Meanwhile, the LTM-reads
accumulated at TIS were combined into peaks. These TIS-peaks resembled both true and false
TIS. Hence, subsequent criteria were needed to retain the peak positions corresponding with
true translation start sites. As a first criterion, the identified TIS positions should lay in a ±1
nucleotide window relative to an AUG or near-cognate start codon. Ribosome profiles
registered on the first position or the +/-1 position of an AUG or near-cognate start codon
were summed into one peak. Second, these combined profiles should have more reads on a
single position in comparison with a user definable minimal count. Third, the positions must
also have a maximum of reads mapped within a user definable window. Here a window of
one codon up- and one codon downstream was chosen. And for the final criterion, an
RLTM-RCHX value was calculated by using the following formula:
24
Here, X represents the amount of reads on position X and N the sum of all the reads on the
transcript for the RPFs obtained after treatment with LTM or CHX. [52]. Peaks with an
RLTM-RCHX value equal or higher than a selected threshold were retained.
Both the minimum count of reads as well as the RLTM-RCHX number could have a different
value depending on the annotation. Five different annotations were possible: aTIS (annotated
TIS: the TIS coincidence with a canonical start site), 5' UTR (TIS lying upstream of the
canonical start site), 3' UTR (TIS lying downstream of the stop codon), coding sequence or
CDS (TIS lying within the protein-coding transcript) and no translation (TIS lying in non-
coding transcripts). All the TIS calling procedures were performed using a minimum count of
5 for aTIS, 10 for a 5' UTR site, a 3' UTR site and a non-coding site and 15 for a CDS site.
The RLTM-RCHX values were set to 0.01 for aTIS, 0.05 for a 5' UTR, a 3' UTR and a non-
coding site and 0.15 for a CDS site.
3.4.4 Variant calling
SNPs and INDELs were detected using SAMtools. This variant caller starts from a SAM file
containing the alignment of the CHX-derived RIBO-seq data with the reference sequence. At
first, this SAM file was converted into a BAM file which was sorted on genomic location.
Next, the genotype likelihoods were determined and collected in a BCF file with the mpileup
command. BCFtools used this BCF file to call all the significant SNPs and INDELs and wrote
this information to a VCF file. Furthermore, the vcfutils.pl varFilter option was used to
discard all variants having a read depth below 3 or exceeding 100. From the final VCF file,
two SQLite tables were created: one for the SNPs and another one for the INDELs. Both
tables contain genomic position information, the reference and alternative variants and their
corresponding allele frequencies. The latter is an important value describing the biological
relevance of the variant in your sample compared to the reference. An allele frequency of 1.0,
for example, means that the reference allele is not present in the studied RIBO-seq data while
an allele frequency of 0.5 indicates both alleles at a certain locus are equally present.
Next to the experimental observed variants, SNPs and INDELs from dbSNP were also taken
into consideration. In order to reduce the size of the custom generated database, only dbSNP
variants present in the RIBO-seq derived reads were included. While adding these variants to
the two existing SQLite tables a distinction was made between new variants discovered by
SAMtools, existing variants present in dbSNP and detected by SAMtools and variants that
could not be called by SAMtools but were nevertheless present in dbSNP and in the SAM
alignment file.
3.4.5 Translation assembly
In a final phase of the customized database construction, all the information obtained during
previous steps was collected and assembled into protein sequences. This process started from
25
the transcript sequences called with sufficient translational evidence. Their sequences were
build up using the human reference genome sequence acquired from the iGenomes repository
and the exon start and end positions given by the Ensembl annotation bundle. In addition, the
experimental identified TIS were taken into account by constructing ORFs starting from that
TIS position.
During ORF construction, variant information was added. Due to the discovery of a large
number of SNPs by SAMtools, the SNPs from dbSNP were not included. In contrast the
number of INDELs discovered by SAMtools was low. Hence, both the INDELs discovered
by SAMtools as well as the INDELs from dbSNP were included. Therefore, all the INDELs
and SNPs were collected per transcript and sorted in ascending order of appearance.
Afterwards the variants were introduced one by one. During this procedure, all the alternative
INDELs and SNPs were inserted. Here, only SNP information from non-synonymous SNPs
was included. The latter are SNPs leading to a single amino acid change in a protein resulting
in the so-called single amino acid variants (SAAVs) [106]. These SNPs are also known as
missense mutations. The reference variants were also included alongside their alternative
sequence when an allele frequency around 0.5 was observed.
Finally, all the assembled nucleotide sequences were translated in silico into protein
sequences. An extra filtering step ensured that only complete proteins were retained and
stored in an SQLite table and further processed into a non-redundant database in FASTA
format. From this database a target-decoy database [107] was created including additional
information from the cRAP (common Repository of Adventitious Proteins) database. The
latter is a FASTA database containing proteins which are often present in MS-based
proteomics experiments due to contamination [108].
3.5 Validation of the renewed PROTEOFORMER pipeline with
proteomics data
The generated custom databases were used to identify proteins and peptides isolated from the
HCT116 cell line using experimental MS/MS spectra available from PRIDE. This was done
with the help of the search engines X!Tandem Vengeance (2015.12.15.2) and OMSSA (2.1.9)
through the Searchgui user-interface. The analysis was carried out using two fixed
modifications (heavy labelled arginine (13
C6) and lysine (13
C6)) and three variable
modifications (pyroglutamate formation of N-terminal glutamine, oxidation of methionine to
methionine-sulfoxide, and acetylation of the N-terminus of both peptides and proteins).
Trypsin was the selected digestion enzyme and a threshold of one missed cleavage was set. In
addition, cleavage before proline was allowed. Furthermore, a mass tolerance of 10 ppm on
precursor ions and 0.5 Da on fragment ions was chosen. For the peptide charge a range from
2+ until 4+ was selected.
The resulting PSMs produced by X!Tandem and OMSSA were analysed with PeptideShaker
using a false discovery rate (FDR) of 1%.
26
4 Results
4.1 Custom database creation
To study the influence of incorporating INDELs in the protein sequences on the identification
rate in a MS-based identification process, two protein sequence databases are created: one
with INDEL information and one without. In addition, a protein sequence database containing
SNP information is generated to compare the impact of INDELs versus SNPs on the protein
identification level.
4.1.1 Mapping statistics
Initially, the CHX-treated and the LTM-treated RIBO-seq derived reads from the human
colorectal cancer cell line were mapped successively to the PhiX genome, the human rRNA,
sn(o)RNA (both small nuclear and nucleolar), tRNA and the human reference genome. The
mapping statistics of the CHX-treated RIBO-seq derived reads are given in table 1, whereas
table 2 lists the mapping results of the LTM-treated RIBO-seq derived reads. Both tables
specify the number of input reads available for mapping against each of the reference
sequence types (PhiX, rRNA, sn(o)RNA, tRNA and the human genome) and the associated
number of mapped and unmapped reads. In addition, the frequency of the mapped reads
relatively to the total reads associated with the sequence type is given.
Table 1: Mapping statistics for the CHX-treated RIBO-seq derived reads.
Type Total reads Mapped Unmapped Frequency mapped (%)
PhiX 215 956 640 582 289 215 374 351 0,27
rRNA 215 374 351 159 881 772 55 492 579 74,23
sn(o)RNA 55 492 579 1 854 896 53 637 683 3,34
tRNA 53 637 683 6 686 666 46 951 017 12,47
genomic 46 951 017 39 774 983* 7 176 034 84,72**
*From a total of 39 774 983 reads mapped to the human reference genome, 26 615 285 reads mapped
to a unique genomic location, 13 159 698 were non-unique.
**The total frequency of 84,72% is the sum of the frequency of the uniquely mapped reads (56,69%)
and the non-uniquely mapped reads (28,03%) to the human reference genome.
27
Table 2: Mapping statistics for the LTM-treated RIBO-seq derived reads.
Type Total reads Mapped Unmapped Frequency mapped (%)
PhiX 228 664 982 590 458 228 074 524 0,26
rRNA 228 074 524 150 632 286 77 442 238 66,05
sn(o)RNA 77 442 238 31 98 726 74 243 512 4,13
tRNA 74 243 512 6 112 089 68 131 423 8,23
genomic 68 131 423 55 879 608* 12 251 815 82,02**
*From a total of 55 879 608 reads mapped to the human reference genome, 38 689 756 reads mapped
to a unique genomic location, 17 189 852 were non-unique.
**The total frequency of 82,02% is the sum of the frequency of the uniquely mapped reads (56,79%)
and the non-uniquely mapped reads (25,23%) to the human reference genome.
In total 39 774 983 CHX-treated and 55 879 608 LTM-treated RIBO-seq derived reads are
mapped to the human reference genome. In order to determine the exact nucleotide position
needed for the TIS and transcript calling procedures, P-site offsets were calculated. With the
help of the Plastid tool, two plots were created visualizing these offsets in function of their
corresponding read length. In figure 9, the results for the LTM-derived reads are given.
According to this figure, reads with a length varying from 27 until 34 nucleotides have a
clearly determined P-site offset and can therefore be associated with one single nucleotide
position on the reference genome. However, this range cannot be applied to the CHX-derived
reads as shown by figure 10. Here, reads having less than 27 or more than 31 nucleotides
cannot be associated with a clearly determined P-site offset. Due to the inability to choose a
different range for both the analysis of the CHX-derived and the LTM-derived reads, the
range from 27 until 31 nucleotides, with a well-defined offset was chosen. All the CHX- and
LTM-derived reads lying within this range are retained and pinpointed to one nucleotide
position.
28
Figure 9: Overview of the calculated P- site offsets for the LTM-treated RIBO-seq derived reads with
lengths varying from 22 until 34 nucleotides.
29
Figure 10: Overview of the calculated P-site offsets for CHX-treated RIBO-seq derived reads with
lengths varying from 22 until 34 nucleotides.
30
4.1.2 Mapping quality control
After the mapping procedure, the quality of the RIBO-seq data was determined using two
important quality parameters, namely the translational phasing and the length distribution of
the reads (i.e. RPFs). The phase distribution represents the amount of RIBO-seq reads mapped
to nucleotides lying in phase, lying one nucleotide out of phase or lying two nucleotides out
of phase within the translated open reading frame. The length distribution gives the number of
reads having a certain length. Both distributions are shown in figure 11 for the CHX-derived
and LTM-derived reads. It is clear that the majority of the reads map to nucleotides in phase.
Furthermore, the length distribution shows that most of the reads have a length within the
selected range of 27-31 nucleotides. As a consequence, only a small amount of the initial
reads mapped to the human reference genome were excluded, resulting in a limited loss of
information.
Figure 11: Phase and RPF length distribution of the CHX-treated RIBO-seq derived reads (a) and
LTM-treated RIBO-seq derived reads (b) mapped to the human reference genome.
31
4.1.3 Discovered variants
Variant information was gathered based on the SAM file containing the alignments of the
CHX-derived reads. Both new variants, discovered by SAMtools, as well as existing variant,
present in dbSNP, were detected. Figure 12 shows the number of variants discovered by each
of these sources. In total 63 300 SNPs and 251 INDELs are found. Most of these variants are
already present in dbSNP. However, 43 novel INDELs and 19 914 new SNPs are discovered
with SAMtools.
Figure 12: Pie charts showing (1) the number of new variants discovered by SAMtools, (2) the
number of dbSNP variants missed by SAMtools and (3) the number of existing variants present in
dbSNP and detected by SAMtools for a) SNPs and b) INDELs.
The detected variants were incorporated into the Ensembl transcript sequences during the
assembly. Herein, the incorporation was limited to variants having an allele frequency lying
between 0.4 and 0.6 or 0.9 and 1.0. In this way, the allele frequencies lying nearby the
theoretical predicted allele frequencies of 0.5 and 0.9 are included. In order to investigate the
number of variants having an allele frequency located within this range, the distribution of the
allele frequencies for both variants were calculated. The results are shown in figure 13 and 14
for respectively the SNPs and INDELs. From figure 13, it is clear that only a small subset of
the SNPs has an allele frequency lying outside the predefined range. These are therefore not
included in the Ensembl transcripts. In contrast with the SNPs, all the INDELs have allele
frequencies lying within the applied range. Hence, all the INDELs are included in the
Ensembl transcripts. For both variant classes, the majority of the variants have an allele
frequency around 1.0. This shows that for most variants, the reference is not present in the
studied HCT116 RIBO-seq dataset.
32
Figure 13: Distribution of the allele frequencies of the SNPs detected by SAMtools.
Figure 14: Distribution of the allele frequencies of the INDELs detected by SAMtools.
33
4.1.4 Translation assembly
Before assembling the protein sequences into a database in FASTA format, an SQLite table
containing the protein sequences including transcript, TIS and variant information was
created. In total three SQLite tables are made: one with INDEL information, one with SNP
information and one without any variant information. These contain respectively 140 617,
143 710 and 140 293 unique protein sequences. In figure 15, a Venn diagram is given,
illustrating the amount of overlap between these three databases. From this diagram, it's clear
that most of the protein sequences are shared between the three tables. Moreover, the diagram
shows that addition of SNP information is responsible for 7 348 extra protein sequences,
whereas the addition of INDEL information results in 345 extra protein sequences.
Figure 15: Venn diagram showing the amount of similar and distinct protein sequences related with
Ensembl transcripts lacking variant information and Ensembl transcripts carrying one or more
INDELs or SNPs (derived from the RIBO-seq data).
While studying the number of identical transcripts in the SQLite tables with INDELs and
without variants, 7 transcripts were found containing an INDEL that does not alter the protein
sequence. Because INDELs are typically associated with dramatically adjustments on protein
level, these transcripts were studied in more detail.
It has been found that, for every transcript, the stop codon is lying in an INDEL. More
specifically, the stop codon is part of a deletion of one nucleotide. Normally, this type of
variant is responsible for a frameshift resulting in a deviant protein sequence. However, this
frameshifting event is not occurring, because it does not alter the stop codon itself but changes
the mRNA sequence downstream. For example, the stop codon TAG is not altered when the
guanine of the stop codon is part of a reported deletion from GGGGGGA to GGGGGA and
the first two nucleotides of the stop codon (in this case TA) are preceding the INDEL.
34
4.2 Validation of the renewed PROTEOFORMER pipeline with
proteomics data
After creating the custom protein sequence databases, these were used in a database search
approach to obtain peptide to spectrum matches (PSMs) from the experimental fragmentation
spectra (MS/MS). By comparing the proteins identified from using these different databases,
the influence of the incorporation of INDEL and SNP information on the peptide and protein
identification process is assessed.
4.2.1 Exploration of the effect of INDELs on the identification process
In order to study the gain of adding INDEL information to the peptide and protein
identification process, both the custom protein sequence database without variant information
and with INDEL information were used to identify peptide and protein sequences. During all
analyses a threshold of 90% confidence was set, meaning only the proteins having a score of
at least 90% are considered. In total 19 protein groups are detected that contain at least one
proteoform with an INDEL. More importantly a single protein containing an INDEL is
identified. Both the protein groups as well as the single protein are discussed in more detail in
the following paragraphs.
Identification of a single protein containing an INDEL
A protein derived from the Ensembl transcript ENST00000368232 (start site at position
156 599 046 on chromosome 1) is identified with a confidence score of 100%. This transcript
contains an insertion of a guanine and thymine at position 1069 in the transcript sequence.
This INDEL is present in dbSNP, but is also found using SAMtools with an allele frequency
of 1.0. In order to make the subsequent analysis clearer, this protein is called the alternative
protein. In addition, the protein derived from the Ensembl transcript lacking the INDEL is
called the reference protein.
The identification of the alternative protein is based on the identification of two peptides:
pyro-QKEEEEATASER-COOH with 100% confidence and NH2-GEETVLGGGTR-COOH
with 99% confidence. The PSMs of these peptides are shown in figure 16 and 17 respectively.
These figures also visualize the found fragment ions corresponding with each of the peptide
bonds. For the first peptide, evidence is found for nine precursor fragments out of eleven
theoretical possible ones. Eight of these are validated with both b- and y-ions, one is only
validated with a b-ion. For the second peptide, only five out of ten peptide bonds are validated
with b- and/or y-ions. However, no other peptide is found containing the subsequence
ETVLGG. This reinsures the presence of NH2-GEETVLGGGTR-COOH.
35
Figure 16: Fragment ions and spectrum of the peptide QKEEEATASER-COOH. a) The sequence
fragmentation plot shows the b-ions (blue peaks) and y-ions (red peaks) present due to the cleavage of
the corresponding peptide bonds. b) The PSM visualizes the annotated spectrum. Here, the red peaks
are annotated by a fragment ion, whereas the grey peaks are not.
Figure 17: Fragment ions and spectrum of the peptide NH2-GEETVLGGGTR-COOH. a) The
sequence fragmentation plot shows the b-ions (blue peaks) and y-ions (red peaks) present due to the
cleavage of the corresponding peptide bonds. b) The PSM visualizes the annotated spectrum. Here, the
red peaks are annotated by a fragment ion, whereas the grey peaks are not.
36
The effect of the insertion on protein level is studied by comparing the alternative protein
sequence with the reference sequence. Below, the alignment of the two protein sequences is
given. The bold, underlined amino acids represent the identified peptides. The amino acids in
green highlight the differences between the two protein sequences.
reference MNVTPEVKSRGMKFAEEQLLKHGWTQGKGLGRKENGITQALRVTLKQDTHGVGHDPAKEF
alternative MNVTPEVKSRGMKFAEEQLLKHGWTQGKGLGRKENGITQALRVTLKQDTHGVGHDPAKEF
************************************************************
reference TNHWWNELFNKTAANLVVETGQDGVQIRSLSKETTRYNHPKPNLLYQKFVKMATLTSGGE
alternative TNHWWNELFNKTAANLVVETGQDGVQIRSLSKETTRYNHPKPNLLYQKFVKMATLTSGGE
************************************************************
reference KPNKDLESCSDDDNQGSKSPKILTDEMLLQACEGRTAHKAARLGITMKAKLARLEAQEQA
alternative KPNKDLESCSDDDNQGSKSPKILTDEMLLQACEGRTAHKAARLGITMKAKLARLEAQEQA
************************************************************
reference FLARLKGQDPGAPQLQSESKPPKKKKKKRRQKEEEEATASERNDADEKHPEHAEQNIRKS
alternative FLARLKGQDPGAPQLQSESKPPKKKKKKRRQKEEEEATASERNDADEKHPEHAEQNIRKS
************************************************************
reference KKKKRRHQEGKVSDEREGTTKGNEKEDAAGTSGLGELNSREQTNQSLRKGKKKKRWHHEE
alternative KKKKRRHQEGKVSDEREGTTKGNEKEDAAGTSGLGELNSREQTNQSLRKGKKKKRWHHEE
************************************************************
reference EKMGVLEEGGKGKEAAGSVRTEEVESRAYADPCSRRKKRQQQEEEDLNLEDRGEETFRWW
alternative EKMGVLEEGGKGKEAAGSVRTEEVESRAYADPCSRRKKRQQQEEEDLNLEDRGEETVLGG
********************************************************.
reference NQGSREQSMQ--------------------------------------------------
alternative --GTREAESRACSDGRSRKSKKKRQQHQEEEDILDVRDEKDGGAREAESRAHTGSSSRGK
*:** . :
reference ----------------------------
alternative RKRQQHPKKERAGVSTVQKAKKKQKKRD
The reference sequence consists of 370 amino acids. The alternative protein sequence
contains 446 amino acids, which is 76 more than the reference sequence. The alternative
protein is therefore a C-terminally extended proteoform of the reference protein. Both proteins
are present in UniProt as G patch domain-containing protein 4 proteoforms. However, the
reference protein is not experimental validated and is therefore present in the TrEMBL section
(UniProt entry A0A0A0MRK1), whereas the alternative protein is reviewed and present in
the Swiss-Prot section (UniProt entry Q5T3I0).
Important to notice is the position of the second identified peptide, namely
NH2-GEETVLGGGTR-COOH. This peptide is overlapping the amino acids that are changed
by the INDEL and therefore provides evidence of the presence of the alternative protein in the
studied sample.
37
The C-terminal extended proteoform is found as a single protein (figure 18). The term 'single
protein' refers to a protein that can be explained solely by a set of identified peptides. In
contrast, a protein identification is undistinguishable from other protein identifications when
the identified peptide sequences can jointly map to more than one protein sequence in the
database. This identification problem was also explained in section 2.2.2 as the protein
inference problem. In this case, all the undistinguishable proteins are collected in a group and
are identified as one ‘protein group’ instead of single proteins.
Figure 18: Protein inference graph of the C-terminally extended proteoform. The C-terminally
extended proteoform is represented by the black circle at the bottom of the graph. It is linked with the
two peptides (green circles) described in the main text. Because the presence of these two peptides can
only be explained by the C-terminally extended variant, this proteoform is found as a single protein.
(The reference protein is not present in the database due to the INDEL having an allele frequency of
1.0.) In addition, a third peptide (red circle) is present in the top of the figure. This was not considered
in the discussion of the peptides due to its confidence score of 0%. The remaining gray circles are
representing proteins containing this peptide.
The protein identification process making use of the database without variant information
gives rise to the identification of a protein group containing three proteins: the reference
protein (370 amino acids) associated with transcript ENST00000368232 and two other
proteins associated with transcript ENST00000438976 (one starting at position 156 599 013
of chromosome 1 resulting in a sequence of 359 amino acids, the other starting at position
156 601 439 resulting in a sequence of 375 amino acids). These proteins differ only slightly
from each other (the last 359 amino acids of every protein are identical) as they are all
products from the same gene but translated from different transcript isoforms. Absence of
identified peptides specific to one of these proteins limits the identification to a group instead
of a single protein (figure 19).
More importantly to notice, is that none of these proteins, identified by the database without
INDEL information, are actual present in the studied sample. This conclusion can be drawn
from the allele frequency of the INDEL, which is 1.0.
38
In addition, since the three proteins are similar in the majority of their sequence, we would
expect to have found INDELs in the two other proteins as well. However, only the protein
associated with transcript ENST00000368232 is identified as a protein containing an INDEL.
In order to explain this result, the two other transcripts and their resulting protein sequences
were studied in more detail. It appears that the addition of an INDEL in these sequences gives
rise to incomplete proteins. This means that the inclusion of an INDEL in these transcripts
results in a new transcript sequence containing no stop codon. This makes it impossible to
produce a functional protein.
All the above results illustrate the importance of the addition of INDEL information to a
custom created protein sequence database.
Figure 19: Protein inference graph of the reference protein (blue circle) found using the database
without INDELS. Only one peptide (green circle) associated with the reference protein is found. This
peptide is linked with two other proteins (gray circles with yellow border) making it impossible to
distinguish one protein from another.
Protein groups containing proteins derived from Ensembl transcripts with INDELs
Next to the previously described single protein, 19 protein groups are identified containing at
least one proteoform derived from a transcript holding an INDEL. These INDEL-containing
proteoforms are not validated with the matching MS-based proteomics data, since they were
within protein inference groups also holding non-INDEL containing proteoforms. However,
they were still studied in more detail in order to evaluate the possible impact of INDELs on
the translational level.
In general, the INDELs give rise to three different type of proteoforms: C-terminally
truncated proteoforms, N-terminally extended proteoforms and proteoforms missing only one
amino acid with respect to the original protein lacking the INDEL. For every type an example
is given in figure 20 and described in more detail in the following paragraphs.
39
a) ENST00000612661
b) ENST00000612898
c) ENST00000527673
Legend:
reference protein protein in protein group validated peptide
alternative protein protein not in protein group non-validated peptide
Figure 20: Protein inference graphs from (a) a protein group containing an C-terminally truncated
proteoform (black circle) in comparison with the reference sequence (blue circle), (b) a protein group
containing an N-terminally extended proteoform (black circle) and (c) a protein group containing a
proteoform missing one amino acid (black circle) in comparison with the reference sequence (blue
circle).
Example of an C-terminally truncated proteoform
A first protein group consists of ten proteins. Two are derived from the Ensembl transcript
ENST00000612661 (start site at position 113 857 746 on chromosome 6): one containing a
deletion of an adenine at position 454 in the transcript sequence and one lacking this deletion.
In this case, the reading frame is changed with one nucleotide when the deletion is
encountered. Because of this, a 'premature' stop codon is encountered, lying 5' upstream of the
normal stop codon. This results in an C-terminally truncated proteoform.
More specifically, the reference proteoform contains 332 amino acids whereas the alternative
proteoform contains 165 amino acids. They only share their first 154 amino acids. The 178
remaining amino acids from the reference protein and the 11 remaining amino acids from the
alternative protein are completely different.
Due to the absence of peptides overlapping the distinct C-terminal part of the alternative
protein, this proteoform is not validated with the MS/MS data.
Example of an N-terminally extended proteoform
A second group contains 22 proteins. An example of this category is derived from the
Ensembl transcript ENST00000612898 (start site lying in the 5' UTR region at position
27 838 622 on chromosome 6) containing a deletion of a thymine at the fourth position in the
transcript sequence.
Due to the position of the start site in the 5' UTR region, the protein sequence derived from
this transcript lacking the INDEL is completely different from the protein sequence derived
40
from the canonical start site (which is lying at position 27 838 662 on chromosome 6). Hence,
the start site in the 5' UTR region is lying out of frame with the canonical start site.
The deletion in the transcript sequence changes the reading frame with one nucleotide. As a
result, the canonical start site is lying in frame with the 5' UTR start site, leading to an
alternative sequence of 139 amino acids. Herein, the last 126 amino acids are identical with
the reference sequence starting from the canonical start site. Hence, the alternative protein is
an N-terminally extended proteoform of this reference protein. Due to the absence of
identified peptides overlapping the N-terminally extended region, this proteoform was not
validated with MS/MS data.
Example of a proteoform missing one amino acid in comparison with its reference protein
A next protein group consists of two proteins derived from the Ensembl transcript
ENST00000527673 (start site at position 119 018 284 on chromosome 11). Here, one of the
proteins contains a deletion of three nucleotides: the nucleotides AAGG starting at position 28
in the reference transcript sequence are replaced with a single guanine. Since a deletion of
three amino acids occurs, the reading frame is not changed. Therefore, the deletion has only a
small effect on the protein, namely the absence of a lysine encoded by the three deleted
nucleotides. Also, the new proteoform is not validated due to the lack of peptides overlapping
this missing amino acid.
4.2.2 Exploration of the effect of SNPs on the identification process
In order to compare the effect of INDELs with the effect of SNPs on the peptide and protein
identification level, the protein identification process was repeated using a custom database
including SNP information. Here, the analysis was restricted to proteins having a confidence
score of at least 90%.
SNP containing proteins lacking extra MS/MS validation
In total 30 SNP-containing proteins are identified as single proteins (i.e. not in an inference
group). In addition, 74 protein groups are identified that solely consists of SNP containing
proteins. The majority of these proteins are not validated with peptides overlapping the SNP
affected amino acids. However, these proteins can still be distinguished from the reference
proteins (proteins without SNP) by means of their allele frequency: all of the SNPs associated
with single proteins or a protein groups lacking the reference proteins have an allele
frequency of 1.0. This indicates that the database does not contain the reference proteins.
Hence, the alternative proteins can be identified without the need of peptides overlapping the
changed amino acids.
An interesting remark can be made on the function of the identified proteins. At least two of
these proteins are related with DNA repair activity (MMS19 (methyl-methanesulfonate
sensitivity 19) nucleotide excision repair protein homolog [109] and TELO2 (telomere
maintenance 2)-interacting protein 1 homolog [110]). In addition one possible oncogene
related with colorectal cancer (Pinin [111]) was found.
41
SNP-containing proteins with extra MS/MS validation
Only one protein group contains a MS/MS validated SNP. This group (figure 21) consists of
two proteins: one derived from the Ensembl transcript ENST00000216605 (start site at
position 64 388 428 on chromosome 14) which has a length of 935 amino acids and is
submitted to Swiss-Prot (UniProt entry P11586) and on derived from the Ensembl transcript
ENST00000545908 (start site at position 64 388 260 on chromosome 14) which has a length
1020 of amino acids and is submitted to TrEMBL (Uniprot entry F5H2F4). Each contain two
SNPs (one at chromosomal position 64 415 662 replacing an adenine by a guanine and one at
chromosomal position 64 442 127 replacing a guanine by an adenine) with an allele frequency
of 1.0. Because both proteins are products from the same gene (MTHFD1 gene), their
sequences are highly similar. The alignment of these sequences is shown below. In addition,
the reference sequence of ENST00000545908 (Uniprot entry F5H2F4) is added to illustrate
the amino acid changes. The bold, underlined amino acids represent the identified peptides.
Due to the presence of a peptide starting after the first methionine (i.e. the initiator methionine
has been cleaved by the methionine aminopeptidase), the protein derived from Ensembl
transcript ENST00000216605 is more likely to be present in the sample. The green amino
acids represent the amino acids that are changed by the SNPs.
ENST00000216605 --------------------------------------------------------MAPA
ENST00000545908 MRHRVCRGRSGQGRRRSSVIPWPVPKHVGWVVLLGCGGSGTSILVVSIVGSGLIKAMAPA
F5H2F4 MRHRVCRGRSGQGRRRSSVIPWPVPKHVGWVVLLGCGGSGTSILVVSIVGSGLIKAMAPA
****
ENST00000216605 EILNGKEISAQIRARLKNQVTQLKEQVPGFTPRLAILQVGNRDDSNLYINVKLKAAEEIG
ENST00000545908 EILNGKEISAQIRARLKNQVTQLKEQVPGFTPRLAILQVGNRDDSNLYINVKLKAAEEIG
F5H2F4 EILNGKEISAQIRARLKNQVTQLKEQVPGFTPRLAILQVGNRDDSNLYINVKLKAAEEIG
************************************************************
ENST00000216605 IKATHIKLPRTTTESEVMKYITSLNEDSTVHGFLVQLPLDSENSINTEEVINAIAPEKDV
ENST00000545908 IKATHIKLPRTTTESEVMKYITSLNEDSTVHGFLVQLPLDSENSINTEEVINAIAPEKDV
F5H2F4 IKATHIKLPRTTTESEVMKYITSLNEDSTVHGFLVQLPLDSENSINTEEVINAIAPEKDV
************************************************************
ENST00000216605 DGLTSINAGRLARGDLNDCFIPCTPKGCLELIKETGVPIAGRHAVVVGRSKIVGAPMHDL
ENST00000545908 DGLTSINAGRLARGDLNDCFIPCTPKGCLELIKETGVPIAGRHAVVVGRSKIVGAPMHDL
F5H2F4 DGLTSINAGKLARGDLNDCFIPCTPKGCLELIKETGVPIAGRHAVVVGRSKIVGAPMHDL
*********:**************************************************
ENST00000216605 LLWNNATVTTCHSKTAHLDEEVNKGDILVVATGQPEMVKGEWIKPGAIVIDCGINYVPDD
ENST00000545908 LLWNNATVTTCHSKTAHLDEEVNKGDILVVATGQPEMVKGEWIKPGAIVIDCGINYVPDD
F5H2F4 LLWNNATVTTCHSKTAHLDEEVNKGDILVVATGQPEMVKGEWIKPGAIVIDCGINYVPDD
************************************************************
ENST00000216605 KKPNGRKVVGDVAYDEAKERASFITPVPGGVGPMTVAMLMQSTVESAKRFLEKFKPGKWM
ENST00000545908 KKPNGRKVVGDVAYDEAKERASFITPVPGGVGPMTVAMLMQSTVESAKRFLEKFKPGKWM
F5H2F4 KKPNGRKVVGDVAYDEAKERASFITPVPGGVGPMTVAMLMQSTVESAKRFLEKFKPGKWM
************************************************************
ENST00000216605 IQYNNLNLKTPVPSDIDISRSCKPKPIGKLAREIGLLSEEVELYGETKAKVLLSALERLK
ENST00000545908 IQYNNLNLKTPVPSDIDISRSCKPKPIGKLAREIGLLSEEVELYGETKAKVLLSALERLK
F5H2F4 IQYNNLNLKTPVPSDIDISRSCKPKPIGKLAREIGLLSEEVELYGETKAKVLLSALERLK
************************************************************
42
ENST00000216605 HRPDGKYVVVTGITPTPLGEGKSTTTIGLVQALGAHLYQNVFACVRQPSQGPTFGIKGGA
ENST00000545908 HRPDGKYVVVTGITPTPLGEGKSTTTIGLVQALGAHLYQNVFACVRQPSQGPTFGIKGGA
F5H2F4 HRPDGKYVVVTGITPTPLGEGKSTTTIGLVQALGAHLYQNVFACVRQPSQGPTFGIKGGA
************************************************************
ENST00000216605 AGGGYSQVIPMEEFNLHLTGDIHAITAANNLVAAAIDARIFHELTQTDKALFNRLVPSVN
ENST00000545908 AGGGYSQVIPMEEFNLHLTGDIHAITAANNLVAAAIDARIFHELTQTDKALFNRLVPSVN
F5H2F4 AGGGYSQVIPMEEFNLHLTGDIHAITAANNLVAAAIDARIFHELTQTDKALFNRLVPSVN
************************************************************
ENST00000216605 GVRRFSDIQIRRLKRLGIEKTDPTTLTDEEINRFARLDIDPETITWQRVLDTNDRFLRKI
ENST00000545908 GVRRFSDIQIRRLKRLGIEKTDPTTLTDEEINRFARLDIDPETITWQRVLDTNDRFLRKI
F5H2F4 GVRRFSDIQIRRLKRLGIEKTDPTTLTDEEINRFARLDIDPETITWQRVLDTNDRFLRKI
************************************************************
ENST00000216605 TIGQAPTEKGHTRTAQFDISVASEIMAVLALTTSLEDMRERLGKMVVASSKKGEPVSAED
ENST00000545908 TIGQAPTEKGHTRTAQFDISVASEIMAVLALTTSLEDMRERLGKMVVASSKKGEPVSAED
F5H2F4 TIGQAPTEKGHTRTAQFDISVASEIMAVLALTTSLEDMRERLGKMVVASSKKGEPVSAED
************************************************************
ENST00000216605 LGVSGALTVLMKDAIKPNLMQTLEGTPVFVHAGPFANIAHGNSSIIADQIALKLVGPEGF
ENST00000545908 LGVSGALTVLMKDAIKPNLMQTLEGTPVFVHAGPFANIAHGNSSIIADQIALKLVGPEGF
F5H2F4 LGVSGALTVLMKDAIKPNLMQTLEGTPVFVHAGPFANIAHGNSSIIADRIALKLVGPEGF
************************************************:***********
ENST00000216605 VVTEAGFGADIGMEKFFNIKCRYSGLCPHVVVLVATVRALKMHGGGPTVTAGLPLPKAYI
ENST00000545908 VVTEAGFGADIGMEKFFNIKCRYSGLCPHVVVLVATVRALKMHGGGPTVTAGLPLPKAYI
F5H2F4 VVTEAGFGADIGMEKFFNIKCRYSGLCPHVVVLVATVRALKMHGGGPTVTAGLPLPKAYI
************************************************************
ENST00000216605 QENLELVEKGFSNLKKQIENARMFGIPVVVAVNAFKTDTESELDLISRLSREHGAFDAVK
ENST00000545908 QENLELVEKGFSNLKKQIENARMFGIPVVVAVNAFKTDTESELDLISRLSREHGAFDAVK
F5H2F4 QENLELVEKGFSNLKKQIENARMFGIPVVVAVNAFKTDTESELDLISRLSREHGAFDAVK
************************************************************
ENST00000216605 CTHWAEGGKGALALAQAVQRAAQAPSSFQLLYDLKLPVEDKIRIIAQKIYGADDIELLPE
ENST00000545908 CTHWAEGGKGALALAQAVQRAAQAPSSFQLLYDLKLPVEDKIRIIAQKIYGADDIELLPE
F5H2F4 CTHWAEGGKGALALAQAVQRAAQAPSSFQLLYDLKLPVEDKIRIIAQKIYGADDIELLPE
************************************************************
ENST00000216605 AQHKAEVYTKQGFGNLPICMAKTHLSLSHNPEQKGVPTGFILPIRDIRASVGAGFLYPLV
ENST00000545908 AQHKAEVYTKQGFGNLPICMAKTHLSLSHNPEQKGVPTGFILPIRDIRASVGAGFLYPLV
F5H2F4 AQHKAEVYTKQGFGNLPICMAKTHLSLSHNPEQKGVPTGFILPIRDIRASVGAGFLYPLV
************************************************************
ENST00000216605 GTMSTMPGLPTRPCF-------YDID-------LDPETEQVNGL--------F-------
ENST00000545908 GTITIHLQEATLKVWPVSIQAHWELGSISKPREVSPCPEDLKLIVGVSPEVIFSLNSHHV
F5H2F4 GTITIHLQEATLKVWPVSIQAHWELGSISKPREVSPCPEDLKLIVGVSPEVIFSLNSHHV
**:: * : :::. :.* *::: : *
43
The first SNP is validated by an overlapping peptide. It replaces a lysine (K) by an arginine
(R) at sequence position 134 in the protein derived from transcript ENST00000216605 and at
sequence position 190 in the protein derived from transcript ENST00000545908. The second
SNP replaces an arginine (R) by a glutamine (Q) at sequence position 653 in the protein
derived from transcript ENST00000216605 and at sequence position 709 in the protein
derived from transcript ENST00000545908. This SNP is not validated by the matching
MS/MS data. Still, it is considered present by reason of its allele frequency of 1.0.
Figure 21: Protein inference graph of the protein group containing a validated SNP. The blue dots
represent the two proteins that are grouped together. The smaller, green dots represent the identified
peptides. The peptide overlapping the SNP affected amino acid is shown in light green. The small, red
dot represents a non-validated peptide.
44
5 Discussion
The interest in genetic variation is associated with its influence on the phenotypic level,
especially when it comes to disease susceptibility. In this work, the focus lies on the human
cancer proteome. From previous studies, SNPs and INDELs appear to be very abundant in
human cancer cells [48]. Therefore, their effect on the protein identification rate was studied.
The inclusion of INDELs resulted in the identification of one extra protein containing an
insertion of two nucleotides at the end of its transcript sequence. This implicated a shift in the
reading frame generating a new stop codon lying downstream the original one. Hence, an C-
terminally extended proteoform was present. This proteoform was validated with one peptide
derived from the altered C-terminus and one peptide shared with the reference protein. The
former made it possible to discern the alternative from the reference proteoform. This,
however, did not indicate the absence of the reference proteoform. In fact, the absence of
proteins can never be confirmed by MS [112]. Although, in this case, the allele frequency of
1.0 evinced the absence of the reference protein.
Next to the discovered single protein, 19 protein groups were detected containing one or more
alternative proteoforms. Unfortunately, no peptides were found overlapping the altered amino
acids. As a consequence, the presence of these proteoforms was not validated. Still these
contained valuable information about the putative impact of INDELs on the protein
translation products. According to these examples, the effect of an INDEL is associated with
its length. Here, a deletion of three nucleotides resulted in the loss of only one amino acid
while maintaining the reading frame. Generally, INDELs which are multiples of three
nucleotides (in-frame INDELs) do not change the reading frame resulting in a minor effect on
the protein sequence [113]. In contrast, INDELs with a length not divisible by three do alter
the reading frame. Here, the position of the INDEL in the transcript sequence plays an
important role. For example, the INDEL in the C-terminally part induced a new stop codon
upstream of the original stop codon, resulting in an C-terminal truncated proteoform. Apart
from the C-terminally extended or truncated proteoforms, the human proteome contains N-
terminal proteoforms. These are associated with alternative start sites [114]. Alternative start
sites located in the 5' UTR give rise to N-terminally extended proteoforms, whereas
alternative start sites located downstream the canonical start site result in N-terminally
truncated proteins. However, this is only the case when the alternative and canonical start
sites are lying in frame. Here, an N-terminally extended proteoform was found including an
INDEL lying upstream the canonical start codon. This INDEL was responsible for a frame
shift of one nucleotide causing the near-cognate site to lie in frame with the canonical start
site.
The inclusion of SNP information in the PROTEOFORMER pipeline has led to the
identification of 30 single SNP-containing proteins and 74 protein groups consisting solely of
SNP-containing proteins. Again, only one SNP was validated with MS/MS. This SNP was
45
present in two proteins and was responsible for the replacement of a lysine by an arginine.
The inconsistency between the number of identified proteins containing a SNP and the
number of SNPs validated with MS/MS illustrates the importance of adding extra
experimental data to the custom database. In fact, the identification is based on the allele
frequency of the SNPs. All 30 single proteins and 74 protein groups contain SNPs with an
allele frequency of 1.0. This means that the reference protein was not found in the sample and
therefore was not added to the database. This makes the identification of the alternative
proteins possible. In case of SNPs having an allele frequency of 0.5, both reference and
alternative proteins are added to the custom database. In the absence of peptides overlapping
the altered amino acids, the alternative and reference protein cannot be distinguished. In
summary, the SNP-calling based on the RIBO-seq data points to a full inclusion of the
mutation point in both alleles (i.e. allele frequency of 1.0) and thus gives evidence of the
alternative sequence (although matching MS/MS validation is missing).
The results illustrate the possible impact of INDELs and SNPs on the diversity of the human
proteome. Although the study here was restricted to the effect of these variants on molecular
level, other studies have been focusing on their effect on cancer development. Cancer results
from somatic mutations in three different group of genes: tumour-suppressor genes,
oncogenes and stability genes. These are all active in the cell-cycle regulation through the
production of proteins initiating cell division (oncogenes), controlling cell division (tumour-
suppressor genes) or providing DNA repair activity (stability genes) [48,115]. The latter are
important in maintaining genomic stability. Due to the inactivation of these DNA repair
genes, mutations in oncogenes and tumour-suppressor genes can accumulate [116]. The latter
two are the key drivers in the development of cancer.
Oncogenes are associated with a gain of function. This is a result of precise mutations at
specific locations resulting in a small modification of the protein [113]. Both mutations
leading to the over-expression of a protein as well as mutations resulting in continuously
activated proteins may lead to cancer development [115]. The majority of these mutations are
missense mutations. However in-frame INDELs are also frequently occurring in oncogenes
[113]. In contrast with oncogenes, tumour-suppressor genes contribute to cancer by
loss-of-function mutations. Here, frameshifting INDELs are playing an important role [113].
Normally, the introduction of a frameshifting INDEL would lead to the production of an
aberrant protein. These are generally not present due to the action of nonsense-mediated
mRNA decay [117]. However, in cancer development, frameshifting INDELs in
tumour-suppressor genes are often retained by means of positive selection. Indeed,
non-functional tumour-suppressor genes are not able to stop the cell cycle or induce apoptosis
in case of excessive cell growth. This results in the accumulation of tumour cells [115]. Next
to these frameshifting INDELs, tumour-suppressor genes also contain a large number of
missense and nonsense substitutions (substitutions leading to a 'premature' stop codon). In
addition a small proportion of the mutations are caused by in-frame INDELs [113].
46
In general, the development of cancer occurs in different phases due to the accumulation of
multiple mutations. Both the function of the oncogenes as well as the tumour-suppressor
genes need to be affected. Hence, different genes are often playing a role. Various studies
have been focusing on the association between genetic variation and the risk of developing
colorectal cancer. It has been found that a lot of colorectal tumours are initiated by mutations
in three tumour-suppressor genes (APC (Adenomatous polyposis coli) [118], tumour protein
53 (TP53) [119] and SMAD4 (mothers against decapentaplegic homolog 4) [120]) and two
oncogenes (KRAS (Kirsten rat sarcoma viral oncogene homolog) [121] and BRAF (B-Raf
proto-oncogene serine/threonine kinase) [122]). In addition, numerous other genes are
associated with an increased risk in colorectal cancer. In a recent meta-analysis [123], 950
colorectal cancer studies were investigated. From this analysis, variations in 50 genes were
found to be associated with an elevated risk in colorectal cancer. One of these genes was also
identified with the database containing SNP information, namely the MTHFD1
(methylenetetrahydrofolate dehydrogenase 1) gene. This gene generates a protein with
enzymatic activity in the folate-mediated one-carbon metabolism pathway. This pathway
plays an important role in the synthesis and repair of DNA by de novo production of purines
and thymidylate. It is also responsible for the production of methionine which is required for
DNA methylation. As a consequence, polymorphisms in the MTHFD1 gene may have a
possible impact on the genetic stability and gene expression resulting in cancer susceptibility
[124]. In addition, at least three other proteins related with cancer development were detected
in this work. These include two proteins with DNA repair activity (MMS19 nucleotide
excision repair protein homolog and TELO2-interacting protein 1 homolog) and a possible
oncogene (Pinin). The alternative G patch domain-containing protein 4 does also occur in
some cancer studies [125,126]. In one of these studies, a proteogenomic approach was used to
assess the impact of mutations in DNA mismatch repair genes on the proteome. This resulted
in the validation of variant peptides including the sequence GEETVLGGGTR which was also
validated in this work. However its exact relation with cancer was not elucidated [126]. All
these results illustrate the potential benefits of proteogenomic cancer studies in clinical
approaches, namely the discovery and use of new biomarkers. By combining proteomic and
genomic approaches, variant proteins associated with tumour development can be detected
[127].
An important remark can be made on the impact of the detected variants on the translational
level. In figure 15 the number of alternative proteins having one or more INDEL(s) or SNP(s)
is shown. In total 7348 SNP-containing proteins are present in the database. Hereof, 30 single
proteins and 74 protein groups are detected with a matching MS-based proteomics
experiment. However, only one SNP was validated with mass spectrometry evidence. In
addition, only one alternative protein with an INDEL out of 345 possible proteins was
validated. According to these results, a limited amount of the alternative proteins is picked up
with mass spectrometry. In other words, these proteins are associated with a low identification
rate. This limitation is also present in other proteogenomics studies [127,128]. For example,
in a recent study [129], the identification rate of single amino acid variants in breast cancer
was studied. Only 10 % of the SNPs detected using whole genome sequencing and whole
47
transcriptome sequencing were validated with MS/MS (figure 22). Multiple explanations are
possible for this phenomenon. For one thing, genetic variation has been linked with a
decreased efficiency in translation [130,131] and an elevated level of protein degradation
[132,133]. In addition to these biological consequences, the validation of genetic variants is
limited by the MS/MS process itself [127,128,129]. Efficient peptide identification, for
example, is restricted to peptides with a length from 6 to 30 amino acids. Therefore, all
variants incorporated in peptides falling outside this range will most likely be missed.
Because this limitation in sequence coverage of MS/MS is associated with trypsin digestion,
alternative approaches using multiple enzymes and fragmenting methods need to be studied
[129].
Figure 22: Venn-diagram showing the amount of SNPs detected by whole genome DNA-seq (blue
circle), whole transcriptome RNA-seq (pink circle) and MS/MS-based proteomics (orange circle) in
luminal breast tumours [129].
A comparative analysis of the protein identification process using the databases with SNPs
and INDELs shows that the addition of SNPs has a more powerful effect. Indeed, more than
100 SNP-containing proteins and only one INDEL-containing protein were identified. This
difference can be explained by several factors. First of all, SNPs are more abundant in the
human genome [134,135]. Secondly, the success of INDEL detection is depending on the
chosen INDEL calling tool. Here, an alignment-based method was used to find variants in
short-read data. This method seems suitable for the detection of SNPs, but is more error-prone
when detecting INDELs. In the case of deletions, the detection is complicated due to the
presence of a gap between the reference sequences. Insertions, on the other hand, complicate
the process by introducing new sequences which cannot be mapped to the reference genome.
In addition, reads with long insertions are flanked with shorter reference sequences which
makes the mapping procedure more difficult [136].
48
6 Conclusion
Today, a lot of studies rely on the identification of proteins using MS-based technologies.
This implies the need of well annotated, comprehensive protein sequence databases.
Nowadays, custom databases are generated using novel information, both from genomic level
as well as transcriptional level. This has led to the discovery of new proteoforms that could
not be detected using publicly available databases lacking this information.
The progress in genomics and transcriptomics goes hand in hand with the introduction of new
sequencing techniques. Currently, it is possible to assess the translational status of mRNA
fragments using ribosome profiling. With the aid of the PROTEOFORMER pipeline, this
information can easily be transformed into a custom protein sequence database.
In this master thesis, the impact of INDELs on the identification rate of human cancer
proteins was studied and compared with the impact of SNPs. For this, ribosome profiling data
from the human colorectal cancer cell line HCT116 was used to create a custom protein
sequence database. Hereof, one INDEL and one SNP were validated with MS/MS. These low
numbers may be a result of the low sequence coverage of MS/MS and the possible negative
effect of genetic variation on the translation process and the protein stability. In addition, a
large contrast was found between the number of alternative proteins detected using the
database with INDELs and the number of alternative proteins detected using the database
with SNPs. Both the relative frequency of appearance of these variants as well as the
limitations of the current INDEL calling strategy can explain this difference. Hence, further
improvements of the PROTEOFORMER pipeline with regard to genetic variation are
expected to give an additional increase in the protein identification rate.
49
7 Further research
In this master thesis, the impact of SNPs and INDELs on the protein identification rate of
human colorectal cancer cells was studied. It has been found that the addition of these variants
to the PROTEOFORMER pipeline has improved the protein identification process by the
detection of alternative proteins. However, this study has also shown that the pipeline is still
incomplete. Here, the current limitations and the possible improvements will be explained in
more detail.
A first improvement can be made using other variant callers. As previously stated,
alignment-based methods, like SAMtools, are less effective in detecting INDELs. As a result,
new INDEL calling tools are developed. These are often based on paired-end RNA
sequencing. In contrast with single-read sequencing, paired-end methods sequence both ends
of the read [137]. Pindel for example uses a split-read mapping approach based on short-read
paired-end sequencing data [138]. In addition, haplotype-based methods are also adequate in
calling INDELs. Here, alignment information is used to find the regions of interest which are
later assembled into possible haplotypes. By realigning the reads with these haplotypes,
INDELs are detected. Although these methods seem to overcome the limitation of
alignment-based methods, they are still not optimal. As a result, the best outcome is obtained
when combining multiple techniques [65]. In accordance with the general problem of
detecting longer INDELs in short reads, the read length can also be considered as an
important factor influencing the success of INDEL detection. In this context, the use of
ribosome profiling might be questioned as the method of choice since it limits the read length
to approximately 28 nucleotides. This problem can be addressed by allowing the support of
paired-end RNA-sequencing of longer reads [139]. In this way the sequencing step is split up
in two parts: ribosome profiling will give information about the translational status including
TIS information whereas paired-end RNA-seq will be used to find genetic variations.
Furthermore, the overall detection rate of both SNPs and INDELs from the ribosome profiling
data is limited. Figure 12 illustrates the number of variants detected from the ribosome
profiling data. Only a small amount of INDELs is picked up with SAMtools. The majority
were identified using dbSNP. Also, most of the identified SNPs were present in dbSNP. This
shows that the number of new genetic variants discovered from the experimental data is
relatively low. This conclusion can also be drawn from other proteogenomics studies. In
recent work [127], human colon and rectal tumours were characterized using a
proteogenomics approach based on RNA-seq. For this, tumour samples from The Cancer
Genome Atlas (TCGA) [140] were conducted to LC-MS/MS shotgun proteomics. Single
nucleotide variants were detected using matching RNA-seq data. Figure 23 shows the number
of detected single amino acid variants in the analyzed samples. A few of these SNPs were
previously reported by TCGA. Most of the SNPs were present in dbSNP. In addition, a
sufficient number of COSMIC-supported variants were detected. COSMIC (Catalogue Of
Somatic Mutations In Cancer) [141] is a database collecting somatic mutations in human
50
cancer genes. Only a small amount of the SNPs represented new variants. This study
illustrates that most of the variants discovered in proteogenomics studies are already
annotated in public databases. As a result, a further increase in the protein identification rate
is expected when these previously annotated variants are added to the pipeline. In order to
limit the size of the database, only cancer-related variants need to be included. As a result, the
addition of COSMIC-variants related with colorectal cancer seems the best approach.
Figure 23: Overview of the number of single amino acid variants (SAAVs) per TCGA sample. The
number of somatic variants, reported by TCGA, is shown in the upper graph. The second and third
graph represent the remaining number of somatic variants present in COSMIC and dbSNP
respectively. The last graph visualizes the number of new variants. Adapted from [127].
In summary, the implementation of new INDEL detection methods and the incorporation of
genetic variants derived from public databases to the PROTEOFORMER pipeline is expected
to cause an additional increase in the protein identification rate.
51
Bibliography
[1] R. Dahm, “Friedrich Miescher and the discovery of DNA,” Dev. Biol., vol. 278, no. 2,
pp. 274–288, 2005.
[2] J. M. Heather and B. Chain, “The sequence of sequencers: The history of sequencing
DNA,” Genomics, vol. 107, no. 1, pp. 1–8, 2016.
[3] L. Liu et al., “Comparison of next-generation sequencing systems.,” J. Biomed.
Biotechnol., vol. 2012, p. 251364, 2012.
[4] J. Shendure and H. Ji, “Next-generation DNA sequencing,” Nat. Biotechnol., vol. 26,
no. 10, pp. 1135–1145, 2008.
[5] E. R. Mardis, “The impact of next-generation sequencing technology on genetics,”
Trends Genet., vol. 24, no. 3, pp. 133–141, 2008.
[6] J. M. Rothberg and J. H. Leamon, “The development and impact of 454 sequencing.,”
Nat. Biotechnol., vol. 26, no. 10, pp. 1117–1124, 2008.
[7] R. J. Roberts, M. O. Carneiro, and M. C. Schatz, “The advantages of SMRT
sequencing,” Genome Biol., vol. 14, no. 6, p. 405, 2013.
[8] Pacific Biosciences, “Advance genomics with Single Molecule, Real-time (SMRT)
sequencing.” [Online]. Available:
http://www.pacb.com/smrt-science/smrt-sequencing/. [Accessed: 06-May-2017].
[9] S. Goodwin, J. Gurtowski, S. Ethe-Sayers, P. Deshpande, M. Schatz, and W. R.
McCombie, “Oxford Nanopore sequencing, hybrid error correction, and de novo
assembly of a eukaryotic genome,” Genome Res., vol. 25, no. 11, pp. 1750–1756,
2015.
[10] Y. Zhang, B. R. Fonslow, B. Shan, M. Baek, and J. R. Yates, “Protein Analysis by
Shotgun / Bottom-up Proteomics,” Chem. Rev., vol. 113, no. 4, pp. 2343–2394, 2013.
[11] X. Han, A. Aslanian, and J. Yates, “Mass Spectrometry for Proteomics,” Curr Opin
Chem Biol., vol. 12, no. 5, pp. 483–490, 2008.
[12] A. I. Nesvizhskii, “Proteogenomics: concepts, applications and computational
strategies,” Nat. Methods, vol. 11, no. 11, pp. 1114–1125, 2014.
[13] R. Aebersold and M. Mann, “Mass spectrometry-based proteomics.,” Nature, vol. 422,
no. 6928, pp. 198–207, 2003.
[14] M. W. Duncan, R. Aebersold, and R. M. Caprioli, “The pros and cons of peptide-
centric proteomics,” Nat. Biotechnol., vol. 28, no. 7, pp. 659–664, 2010.
[15] C. C. Wu and M. J. MacCoss, “Shotgun proteomics: tools for the analysis of complex
biological systems.,” Curr. Opin. Mol. Ther., vol. 4, no. 3, pp. 242–50, 2002.
[16] C. Ansong, S. O. Purvine, J. N. Adkins, M. S. Lipton, and R. D. Smith,
“Proteogenomics: Needs and roles to be filled by proteomics in genome annotation,”
Briefings Funct. Genomics Proteomics, vol. 7, no. 1, pp. 50–62, 2008.
[17] J. S. Cottrell, “Protein identification using MS/MS data,” J. Proteomics, vol. 74, no.
10, pp. 1842–1851, 2011.
[18] A. I. Nesvizhskii, “A survey of computational methods and error rate estimation
procedures for peptide and protein identification in shotgun proteomics,” J.
Proteomics, vol. 73, no. 11, pp. 2092–2123, 2010.
[19] J. V Olsen, S.-E. Ong, and M. Mann, “Trypsin cleaves exclusively C-terminal to
arginine and lysine residues.,” Mol. Cell. proteomics, vol. 3, no. 6, pp. 608–14, 2004.
[20] A. I. Nesvizhskii and R. Aebersold, “Interpretation of Shotgun Proteomic Data,” Mol.
Cell. Proteomics, vol. 4, no. 10, pp. 1419–1440, 2005.
52
[21] N. Castellana and V. Bafna, “Proteogenomics to discover the full coding content of
genomes: A computational perspective,” J. Proteomics, vol. 73, no. 11, pp. 2124–2135,
2010.
[22] G. Menschaert and D. Fenyö, “Proteogenomics from a bioinformatics angle: a growing
field,” Mass Spectrom. Rev., vol. 2015, no. 9999, pp. 1–16, 2015.
[23] S. Tanner et al., “Improving gene annotation using peptide mass spectrometry,”
Proteome, vol. 17, no. 2, pp. 231–239, 2007.
[24] M. Mann and A. Pandey, “Use of mass spectrometry-derived data to annotate
nucleotide and protein sequence databases,” Trends Biochem. Sci., vol. 26, no. 1, pp.
54–61, 2001.
[25] L. M. Smith et al., “Proteoform: a single term describing protein complexity,” Nat.
Methods, vol. 10, no. 3, pp. 186–187, 2013.
[26] N. Fortelny, P. Pavlidis, and C. M. Overall, “The path of no return--Truncated protein
N-termini and current ignorance of their genesis,” Proteomics, vol. 15, no. 14, pp.
2547–2552, 2015.
[27] G. Menschaert, V. Olexiouk, and S. Verbruggen, “Proteoformer 2.0 (in preparation),”
2017.
[28] B. Boeckmann et al., “The SWISS-PROT protein knowledgebase and its supplement
TrEMBL in 2003,” Nucleic Acids Res., vol. 31, no. 1, pp. 365–370, 2003.
[29] D. A. Bitton, D. L. Smith, Y. Connolly, P. J. Scutt, and C. J. Miller, “An integrated
mass-spectrometry pipeline identifies novel protein coding-regions in the human
genome,” PLoS One, vol. 5, no. 1, pp. 1–10, 2010.
[30] G. M. Sheynkman, M. R. Shortreed, B. L. Frey, and L. M. Smith, “Discovery and mass
spectrometric analysis of novel splice-junction peptides using RNA-Seq.,” Mol. Cell.
Proteomics, vol. 12, pp. 2341–53, 2013.
[31] D. Yagoub et al., “Proteogenomic Discovery of a Small, Novel Protein in Yeast
Reveals a Strategy for the Detection of Unannotated Short Open Reading Frames,” J.
Proteome Res., vol. 14, no. 12, pp. 5038–5047, 2015.
[32] N. T. Ignolia, S. Ghaemmaghami, J. R. S. Newman, and J. S. Weissman, “Genome-
Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome
Profiling,” Science., vol. 324, no. 5924, pp. 218–223, 2009.
[33] N. T. Ingolia, “Ribosome profiling: new views of translation, from single codons to
genome scale.,” Nat. Rev. Genet., vol. 15, no. 3, pp. 205–13, 2014.
[34] H. Chassé, S. Boulben, V. Costache, P. Cormier, and J. Morales, “Analysis of
translation using polysome profiling,” Nucleic Acids Res., vol. 45, no. 3, p. e15, 2017.
[35] N. T. Ingolia, “Ribosome Footprint Profiling of Translation throughout the Genome,”
Cell, vol. 165, no. 1, pp. 22–33, 2016.
[36] A. M. Michel and P. V. Baranov, “Ribosome profiling: A Hi-Def monitor for protein
synthesis at the genome-wide scale,” Wiley Interdiscip. Rev. RNA, vol. 4, no. 5, pp.
473–490, 2013.
[37] N. T. Ingolia, “Genome-wide translational profiling by Ribosome footprinting,”
Methods Enzymol., vol. 470, pp. 119–142, 2010.
[38] N. T. Ingolia, G. A. Brar, S. Rouskin, A. M. Mcgeachy, and J. S. Weissman, “The
ribosome profiling strategy for monitoring translation in vivo by deep sequencing of
ribosome-protected mRNA fragments,” Nat. Prot., vol. 7, no. 8, pp. 1534–1550, 2013.
[39] N. T. Ingolia, L. Lareau, and J. Weissman, “Ribosome Profiling of Mouse Embryonic
Stem Cells Reveales Complexity of Mammalian Proteomes,” Cell, vol. 147, no. 4, pp.
789–802, 2012.
53
[40] S. Lee, B. Liu, S. Lee, S.X. Huang, B. Shen, and S.B. Qian, “Global mapping of
translation initiation sites in mammalian cells at single-nucleotide resolution. TL -
109,” Proc. Natl. Acad. Sci. U. S. A., vol. 109, no. 37, p. 32, 2012.
[41] W. McKeehan and B. Hardesty, “The mechanism of cycloheximide inhibition of
protein synthesis in rabbit reticulocytes,” Biochem. Biophys. Res. Commun., vol. 36,
no. 4, pp. 625–630, 1969.
[42] V. Olexiouk, J. Crappé, S. Verbruggen, K. Verhegen, L. Martens, and G. Menschaert,
“SORFs.org: A repository of small ORFs identified by ribosome profiling,” Nucleic
Acids Res., vol. 44, no. D1, pp. D324–D329, 2016.
[43] M. a Basrai, P. Hieter, and J. D. Boeke, “Small Open Reading Frames : Beautiful
Needles in the Haystack Small Open Reading Frames : Beautiful Needles in the
Haystack,” Genome Res., vol. 7, pp. 768–771, 1997.
[44] R. Gibbs et al.,“The International HapMap Project.,” Nature, vol. 426, no. 6968, pp.
789–796, 2003.
[45] D. C. Rees, T. N. Williams, and M. T. Gladwin, “Sickle-cell disease,” Lancet, vol. 376,
no. 9757, pp. 2018–2031, 2010.
[46] J. L. Bobadilla, M. Macek, J. P. Fine, and P. M. Farrell, “Cystic fibrosis: A worldwide
analysis of CFTR mutations - Correlation with incidence data and application to
screening,” Hum. Mutat., vol. 19, no. 6, pp. 575–606, 2002.
[47] E. C. Landels, I. H. Ellis, A. H. Fensom, P. M. Green, and M. Bobrow, “Frequency of
the Tay-Sachs disease splice and insertion mutations in the UK Ashkenazi Jewish
population,” J Med Genet, vol. 28, pp. 177–180, 1991.
[48] H. Yang, Y. Zhong, C. Peng, J.-Q. Chen, and D. Tian, “Important role of indels in
somatic mutations of human cancer genes.,” BMC Med. Genet., vol. 11, no. 128, 2010.
[49] B. Vogelstein and K. Kinzler, “Cancer genes and the pathways they control,” Nat.
Med., vol. 10, no. 8, pp. 789–799, 2004.
[50] C. Greenman et al., “Patterns of somatic mutation in human cancer genomes,” Nature,
vol. 446, no. 7132, pp. 153–158, 2007.
[51] Z. Sun, A. Bhagwate, N. Prodduturi, P. Yang, and J.-P. A. Kocher, “Indel detection
from RNA-seq data: tool evaluation and strategies for accurate detection of actionable
mutations,” Brief. Bioinform., pp. 1–11, 2016.
[52] J. Crappé et al., “PROTEOFORMER: deep proteome coverage through ribosome
profiling and MS integration.,” Nucleic Acids Res., vol. 43, no. 5, p. e29, Mar. 2015.
[53] M. Fresno, A. Jiménez, and D. Vázquez, “Inhibition of translation in eukaryotic
systems by harringtonine.,” Eur. J. Biochem., vol. 72 VN-r, no. 2, pp. 323–330, 1977.
[54] T. Schneider-poetsch et al., “Inhibition of Eukaryotic Translation Elongation by
Cycloheximide and Lactimidomycin,” Nat. Chem. Biol., vol. 6, no. 3, pp. 209–217,
2010.
[55] M. Kozak, “Downstream secondary structure facilitates recognition of initiator codons
by eukaryotic ribosomes.,” Proc. Natl. Acad. Sci. U. S. A., vol. 87, no. 21, pp. 8301–
8305, 1990.
[56] G. Menschaert et al., “Deep proteome coverage based on ribosome profiling aids mass
spectrometry-based protein and peptide discovery and provides evidence of alternative
translation products and near-cognate translation initiation events.,” Mol. Cell.
Proteomics, vol. 12, no. 7, pp. 1780–90, 2013.
[57] S. J. Andrews and J. A. Rothnagel, “Emerging evidence for functional peptides
encoded by short open reading frames,” Nat. Rev. Genet., vol. 15, no. 3, pp. 193–204,
2014.
54
[58] G. Menschaert, J. Crappé, E. Ndah, and S. Koch , ert, “What is PROTEOFORMER?,”
2014. [Online]. Available: http://www.biobix.be/proteoformer/. [Accessed: 29-Oct-
2016].
[59] “FastQC - A Quality Control application for FastQ files.” [Online]. Available:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/README.txt. [Accessed:
29-Oct-2016].
[60] D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg, “TopHat2:
accurate alignment of transcriptomes in the presence of insertions, deletions and gene
fusions,” Genome Biol., vol. 14, no. 4, p. R36, 2013.
[61] A. Dobin et al., “STAR: Ultrafast universal RNA-seq aligner,” Bioinformatics, vol. 29,
no. 1, pp. 15–21, 2013.
[62] H. Li et al., “The Sequence Alignment/Map format and SAMtools,” Bioinformatics,
vol. 25, no. 16, pp. 2078–2079, 2009.
[63] SAMtools, “Multisample SNP Calling.” [Online]. Available:
http://samtools.sourceforge.net/mpileup.shtml. [Accessed: 14-Sep-2016].
[64] Genome Research Limited, “Samtools Manual Page,” 2016. [Online]. Available:
http://www.htslib.org/doc/samtools.html. [Accessed: 14-Sep-2016].
[65] M. S. Hasan, X. Wu, and L. Zhang, “Performance evaluation of indel calling tools
using real short-read data,” Hum. Genomics, vol. 9, no. 1, p. 20, 2015.
[66] K. Gevaert et al., “Exploring proteomes and analyzing protein processing by mass
spectrometric identification of sorted N-terminal peptides,” Nat. Biotechnol., vol. 21,
no. 5, pp. 566–569, 2003.
[67] T. Ylonen and C. Lonvick, “The Secure Shell (SSH) Protocol Architecture,” 2006.
[Online]. Available: https://www.rfc-editor.org/rfc/rfc4251.txt. [Accessed: 04-Apr-
2017].
[68] Python Software Foundation, “General Python FAQ,” 2017. [Online]. Available:
https://docs.python.org/3/faq/general.html. [Accessed: 05-Apr-2017].
[69] SQLite Consortium, “About SQLite.” [Online]. Available:
https://www.sqlite.org/about.html. [Accessed: 03-Apr-2017].
[70] SQLite Consortium, “SQLite is a Self Contained System.” [Online]. Available:
https://www.sqlite.org/selfcontained.html. [Accessed: 03-Apr-2017].
[71] SQLite Consortium, “Command Line Shell For SQLite.” [Online]. Available:
http://www.sqlite.org/cli.html. [Accessed: 03-Apr-2017].
[72] Python Software Foundation, “DB-API 2.0 interface for SQLite databases,” 2017.
[Online]. Available: https://docs.python.org/2/library/sqlite3.html. [Accessed: 05-Apr-
2017].
[73] Python Software Foundation, “PEP 0249 -- Python Database API Specification v2.0,”
2017. [Online]. Available: https://www.python.org/dev/peps/pep-0249/. [Accessed: 05-
Apr-2017].
[74] A. M. Kuchling, “The Python DB-API,” Linux Journal, no. 49, 1998.
[75] Broad Institute, “HC overview: How the HaplotypeCaller works,” 2016. [Online].
Available: https://software.broadinstitute.org/gatk/documentation/article.php?id=4148.
[Accessed: 15-Sep-2016].
[76] A. Rimmer, H. Phan, I. Mathieson, Z. Iqbal, and S. R. F. Twigg, “Integrating mapping-
, assembly- and haplotype-based approaches for calling variants in clinical sequencing
applications,” Nat Genet., vol. 46, no. 8, pp. 912–918, 2016.
55
[77] Broad Institute, “HaplotypeCaller.” [Online]. Available:
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstit
ute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php. [Accessed: 29-Apr-
2017].
[78] M. A. DePristo et al., “A framework for variation discovery and genotyping using
next-generation DNA sequencing data,” Nat. Genet., vol. 43, no. 5, pp. 491–498, 2011.
[79] M. Vaudel et al., “PeptideShaker enables reanalysis of MS-derived proteomics data
sets,” Nat. Biotechnol., vol. 33, no. 1, pp. 22–24, 2015.
[80] M. Vaudel, H. Barsnes, F. S. Berven, A. Sickmann, and L. Martens, “SearchGUI: An
open-source graphical user interface for simultaneous OMSSA and X!Tandem
searches,” Proteomics, vol. 11, no. 5, pp. 996–999, 2011.
[81] R. Craig and R. C. Beavis, “TANDEM: Matching proteins with tandem mass spectra,”
Bioinformatics, vol. 20, no. 9, pp. 1466–1467, 2004.
[82] L. Y. Geer et al., “Open mass spectrometry search algorithm,” J. Proteome Res., vol. 3,
no. 5, pp. 958–964, 2004.
[83] Compomics, “SearchGUI,” 2017. [Online]. Available:
http://compomics.github.io/projects/searchgui.html. [Accessed: 11-Apr-2017].
[84] Compomics, “PeptideShaker,” 2017. [Online]. Available:
http://compomics.github.io/projects/peptide-shaker.html. [Accessed: 11-Apr-2017].
[85] J. Goecks, A. Nekrutenko, J. Taylor, and The Galaxy Team, “Galaxy: a comprehensive
approach for supporting accessible, reproducible, and transparent computational
research in the life sciences.,” Genome Biol., vol. 11, no. 8, p. R86, 2010.
[86] EMBL-EBI, “About the Ensembl Project,” 2014. [Online]. Available:
http://www.ensembl.org/info/about/index.html. [Accessed: 16-Apr-2017].
[87] T. Hubbard et al., “The Ensembl genome database project,” Nucleic Acids Res., vol.
30, no. 1, pp. 38–41, 2002.
[88] B. L. Aken et al., “The Ensembl Gene Annotation System,” Database (Oxford)., vol.
2016, p. baw093, 2016.
[89] D. M. Church et al., “Modernizing reference genome assemblies,” PLoS Biol., vol. 9,
no. 7, p. e1001091, 2011.
[90] “The Genome Reference Consortium.” [Online]. Available:
https://www.ncbi.nlm.nih.gov/grc. [Accessed: 12-Apr-2017].
[91] EMBL-EBI, “Ensembl Core - Schema documentation,” 2017. [Online]. Available:
http://www.ensembl.org/info/docs/api/core/core_schema.html. [Accessed: 12-Apr-
2017].
[92] S. T. Sherry et al., “dbSNP: the NCBI database of genetic variation.,” Nucleic Acids
Res., vol. 29, no. 1, pp. 308–311, 2001.
[93] S. T. Sherry, M. Ward, and K. Sirotkin, “dbSNP—Database for Single Nucleotide
Polymorphisms and Other Classes of Minor Genetic Variation,” Genome Res., vol. 9,
no. 8, pp. 677–679, 1999.
[94] G. Cochrane, I. Karsch-Mizrachi, and Y. Nakamura, “The International Nucleotide
Sequence Database Collaboration,” Nucleic Acids Res., vol. 39, no. (Database issue):
D15–D18, 2011.
[95] EMBL-EBI, “International Nucleotide Sequence Database Collaboration.” [Online].
Available: http://www.insdc.org/. [Accessed: 09-Apr-2017].
[96] R. Leinonen et al., “The European nucleotide archive,” Nucleic Acids Res., vol. 39, no.
(Database issue):D28-31, 2011.
56
[97] P. Jones et al., “PRIDE: a public repository of protein and peptide identifications for
the proteomics community,” Nucleic Acids Res., vol. 34, no. (Database issue): D659-
63, 2006.
[98] L. Martens et al., “PRIDE: The proteomics identifications database,” Proteomics, vol.
5, no. 13, pp. 3537–3545, 2005.
[99] A. Koch et al., “A proteogenomics approach integrating proteomics and ribosome
profiling increases the efficiency of protein identification and enables the discovery of
alternative translation start sites,” Proteomics, vol. 14, no. 0, pp. 2688–2698, 2014.
[100] H. Laboratory, “FASTX-Toolkit.” [Online]. Available:
http://hannonlab.cshl.edu/fastx_toolkit/. [Accessed: 22-May-2017].
[101] D. Blankenberg et al., “Manipulation of FASTQ data with galaxy,” Bioinformatics,
vol. 26, no. 14, pp. 1783–1785, 2010.
[102] J. G. Dunn and J. S. Weissman, “Plastid: nucleotide-resolution analysis of next-
generation sequencing and genomics data,” BMC Genomics, vol. 17, no. 1, p. 958,
2016.
[103] J. G. Dunn, “Performing metagene analyses,” 2014. [Online]. Available:
http://plastid.readthedocs.io/en/latest/examples/metagene.html. [Accessed: 24-May-
2017].
[104] J. G. Dunn, “Determine P-site offsets for ribosome profiling data,” 2014. [Online].
Available: http://plastid.readthedocs.io/en/latest/examples/p_site.html. [Accessed: 24-
May-2017].
[105] D. L. Lafontaine and D. Tollervey, “The function and synthesis of ribosomes.,” Nat.
Rev. Mol. Cell Biol., vol. 2, no. 7, pp. 514–520, 2001.
[106] C. M. Yates, I. Filippis, L. A. Kelley, and M. J. E. Sternberg, “SuSPect: Enhanced
prediction of single amino acid variant (SAV) phenotype using network features,” J.
Mol. Biol., vol. 426, no. 14, pp. 2692–2701, 2014.
[107] J. E. Elias and S. P. Gygi, “Target-decoy search strategy for increased confidence in
large-scale protein identifications by mass spectrometry,” Nat. Methods, vol. 4, no. 3,
pp. 207–214, 2007.
[108] The Global Proteome Machine Organization, “cRAP protein sequences,” 2011.
[Online]. Available: http://www.thegpm.org/crap/. [Accessed: 23-May-2017].
[109] O. Stehling et al., “MMS19 assembles iron-sulfur proteins required for DNA
metabolism and genomic integrity,” Science (80-. )., vol. 337, no. 6091, pp. 195–199,
2012.
[110] K. E. Hurov, C. Cotta-Ramusino, and S. J. Elledge, “A genetic screen identifies the
Triple T complex required for DNA damage signaling and ATM and ATR stability,”
Genes Dev., vol. 24, no. 17, pp. 1939–1950, 2010.
[111] Z. Wei, W. Ma, X. Qi, X. Zhu, Y. Wang, and Z. Xu, “Pinin facilitated proliferation and
metastasis of colorectal cancer through activating EGFR / ERK signaling pathway,”
Oncotarget, vol. 7, no. 20, pp. 29429–29439, 2016.
[112] M. A. Baldwin, “Protein Identification by Mass Spectrometry,” Mol. Cell. Proteomics,
vol. 3, no. 1, pp. 1–9, 2003.
[113] P. Iengar, “An analysis of substitution, deletion and insertion mutations in cancer
genes,” Nucleic Acids Res., vol. 40, no. 14, pp. 6401–6413, 2012.
[114] D. Gawron, E. Ndah, K. Gevaert, and P. Van Damme, “Positional proteomics reveals
differences in N-terminal proteoform stability,” Mol. Syst. Biol., vol. 12, no. 2, p. 858,
2016.
[115] Harvey Lodish et al., “The genetic basis of cancer,” in Molecular Cell Biology, 7th ed.,
New York: W. H. Freeman and Company, 2000, pp. 1124–1131.
57
[116] G. Kurzawski, J. Suchy, T. Debniak, J. Kładny, and J. Lubiński, “Importance of
microsatellite instability (MSI) in colorectal cancer: MSI as a daignostic tool,” Ann.
Oncol., vol. 15, no. SUPPL. 4, pp. 283–284, 2004.
[117] Y.-F. Chang, J. S. Imam, and M. F. Wilkinson, “The Nonsense-Mediated Decay RNA
Surveillance Pathway,” Annu. Rev. Biochem., vol. 76, pp. 51–74, 2007.
[118] R. Fodde, “The APC gene in colorectal cancer.,” Eur. J. Cancer, vol. 38, no. 7, pp.
867–71, 2002.
[119] X.L. Li, J. Zhou, Z.R. Chen, and W.J. Chng, “P53 Mutations in Colorectal Cancer-
Molecular Pathogenesis and Pharmacological Reactivation,” World J. Gastroenterol.,
vol. 21, no. 1, pp. 84–93, 2015.
[120] M. Miyaki and T. Kuroki, “Role of Smad4 (DPC4) inactivation in human cancer,”
Biochem. Biophys. Res. Commun., vol. 306, no. 4, pp. 799–804, 2003.
[121] M. Hajdúch, S. Jančík, J. Drábek, and D. Radzioch, “Clinical relevance of KRAS in
human cancers,” J. Biomed. Biotechnol., vol. 2010, no. 150960, p. 13, 2010.
[122] D. Barras, “BRAF Mutation in Colorectal Cancer : An Update,” Biomark. Cancer, vol.
7, no. Suppl 1, pp. 9–12, 2015.
[123] X. Ma, B. Zhang, and W. Zheng, “Genetic variants associated with colorectal-cancer
risk: comprehensive research synopsis, meta-analysis, and epidemiological evidence,”
Gut, vol. 63, no. 2, pp. 326–336, 2008.
[124] A. J. MacFarlane, C. A. Perry, M. F. McEntee, D. M. Lin, and P. J. Stover, “Mthfd1 is
a modifier of chemically induced intestinal carcinogenesis,” Carcinogenesis, vol. 32,
no. 3, pp. 427–433, 2011.
[125] K. Yamaguchi et al., “MRG-binding protein contributes to colorectal cancer
development,” Cancer Sci., vol. 102, no. 8, pp. 1486–1492, 2011.
[126] P. J. Halvey et al., “Proteogenomic analysis reveals unanticipated adaptations of
colorectal tumor cells to deficiencies in DNA mismatch repair,” Cancer Res., vol. 74,
no. 1, pp. 387–397, 2014.
[127] B. Zhang, J. Wang, X. Wang, J. Zhu, Q. Liu, and Z. Shi, “Proteogenomic
characterization of human colon and rectal cancer,” Nature, vol. 513, no. 7518, pp.
382–387, 2014.
[128] P. Mertins et al., “Proteogenomics connects somatic mutations to signaling in breast
cancer,” Nature, vol. 534, no. 7605, pp. 55–62, 2016.
[129] K. V. Ruggles et al., “An Analysis of the Sensitivity of Proteogenomic Mapping of
Somatic Mutations and Novel Splicing Events in Cancer,” Mol. Cell. Proteomics, vol.
15, no. 3, pp. 1060–1071, 2016.
[130] Q. Li et al., “Genome-wide search for exonic variants affecting translational
efficiency,” vol. 4, no. 2260, 2013.
[131] C. Polychronakos, “Gene expression as a quantitative trait: what about translation?,” J.
Med. Genet., vol. 49, no. 9, pp. 554–557, 2012.
[132] J. Hu and P. C. Ng, “Predicting the effects of frameshifting indels,” Genome Biol., vol.
13, no. 2, p. R9, 2012.
[133] E. Nagy and L. E. Maquat, “A rule for termination-codon position within intron-
containing genes: When nonsense affects RNA abundance,” Trends Biochem. Sci., vol.
23, no. 6, pp. 198–199, 1998.
[134] R. A. Cartwright, “Problems and solutions for estimating indel rates and length
distributions,” Mol. Biol. Evol., vol. 26, no. 2, pp. 473–480, 2009.
[135] J. Q. Chen, Y. Wu, H. Yang, J. Bergelson, M. Kreitman, and D. Tian, “Variation in the
ratio of nucleotide substitution and indel rates across genomes in mammals and
bacteria,” Mol. Biol. Evol., vol. 26, no. 7, pp. 1523–1531, 2009.
58
[136] G. Narzisi and M. C. Schatz, “The Challenge of Small-Scale Repeats for Indel
Discovery,” Front. Bioeng. Biotechnol., vol. 3, p. 8, 2015.
[137] Illumina, “Paired-End Sequencing,” 2017. [Online]. Available:
https://www.illumina.com/technology/next-generation-sequencing/paired-end-
sequencing_assay.html. [Accessed: 29-May-2017].
[138] K. Ye, M. H. Schulz, Q. Long, R. Apweiler, and Z. Ning, “Pindel: A pattern growth
approach to detect break points of large deletions and medium sized insertions from
paired-end short reads,” Bioinformatics, vol. 25, no. 21, pp. 2865–2871, 2009.
[139] P. J. Campbell et al., “Identification of somatically acquired rearrangements in cancer
using genome-wide massively parallel paired-end sequencing,” Nat. Genet., vol. 40,
no. 6, pp. 722–729, 2008.
[140] K. Tomczak, P. Czerwińska, and M. Wiznerowicz, “The Cancer Genome Atlas
(TCGA): An immeasurable source of knowledge,” Contemp. Oncol., vol. 19, no. 1A,
pp. A68–A77, 2015.
[141] S. A. Forbes et al., “COSMIC: Exploring the world’s knowledge of somatic mutations
in human cancer,” Nucleic Acids Res., vol. 43, pp. D805–D811, 2015.