mir-155 target prediction and validation in nasopharyngeal ... · mir-155 target prediction and...
TRANSCRIPT
MiR-155 Target Prediction and Validation in Nasopharyngeal
Carcinoma
I L Q A R A B D U L L A Y E V
Master of Science Thesis Stockholm, Sweden 2010
MiR-155 Target Prediction and Validation in Nasopharyngeal
Carcinoma
I L Q A R A B D U L L A Y E V
Master’s Thesis in Biomedical Engineering (30 ECTS credits) at the Computational and Systems Biology Master Programme Royal Institute of Technology year 2010 Supervisor at CSC was Erik Aurell Examiner was Anders Lansner TRITA-CSC-E 2010:164 ISRN-KTH/CSC/E--10/164--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc
MiR-155 target prediction and validation in nasopharyngeal carcinoma
Abstract
MicroRNAs (miRNAs) play an important role in controlling gene expression in Euka-
ryotes. They target many mRNAs and either degrade them or inhibit their translation into
protein. Thus finding targets of miRNAs has been a hot topic since their first discovery.
Many prediction tools have been designed for the purpose of target prediction. Different
tools use different approaches, and hence they predict different targets. Thus finding the
best working tool or combination of tools is important. MicroRNA-155 (miR-155) is one
of well-studied miRNAs which is associated (mostly upregulated) to numerous diseases
including nasopharyngeal carcinoma (NPC) - one of the most common malignancies in
certain areas of South-China, and Africa. This project aims to find the best scoring miRNA
prediction tool, implementing it on miR-155, compared to the result from Microarray
experiment and in this way shed some light on NPC.
Målförutsägelse och validering av Mir-155 i nasofaryngealt
carcinoma
Sammanfattning MikroRNA (miRNA) spelar en stor roll vid reglering av genuttrycken i eukaryoter. En
betydande del av cellers mRNA påverkas av sådana miRNA, antingen genom nedbrytning
eller genom att translationen till proteinerna hämmas. Att söka efter mål för olika miRNA
har därför varit ett hett änme allt sedan miRNA först upptäcktes. Många olika verktyg har
designats och utvecklats för detta syfte och att hitta det bästa verktygen är därför viktigt.
MikroRNA-155 (miR-155) är ett välstuderat miRNA, associerat med ett flertal olika
sjukdommar såsom till exempel nasopharyngeal carcinoma (NPC) - en av det vanligast
förekomande elakartade cancrarna i vissa delar av södra Kina och Afrika. Detta projekt har
som mål att hitta det bästa verktyget för miRNA prediktion, implementera det på miR-155,
för att sedan korrelatera det med redan funna resultat från microarray experiment och på så
sätt öka förståelsen av NPC.
Aknowledgement
I would like to thank some people First of all, I would like to thank for my supervisor Erik
Aurell for taking me to his group and introducing me to his collaborators, in which I ended
up doing my thesis. He also helped me to learn making good research and to improve my
writing skills. I would also like to thank for Aymeric for his valuable contributions
especially about computational prediction part of my thesis.
Socondly, I am grateful for my supervisor at Microbiology, Tumor and Cell Biology
Department of Karolinska Institute, Professor Ingemar Ernberg, for providing me this thesis
and sustaining the suitable scientific environment as well as experimental platform. I am
deeply thankful for his doctorate student – Ziming Du, for helping me doing wet-lab
experiments. I learned a lot from Ziming.
I would like to thank for my friends – Rustam, Rasim, Emre, Alejandro, James,
Shaghayegh who supported and motivated me through the entire process. I would also like
to thank Ann Bengston for her coordinations during the administrative processes. My
special thanks go for my family who believed and supported me through my entire life
whatever the conditions are. I am and always will be deeply grateful for them. Finally, all
my heart goes for my lovely wife – Aysegul.
Table of Contents
1. Introduction ....................................................................................................................... 1
1.1 General Information about miRNAs ........................................................................... 1
1.1.1 Biogenesis ........................................................................................................... 1
1.1.2 Plant miRNA target prediction works perfect ..................................................... 4
1.1.3 Animal miRNAs ................................................................................................. 4
1.2. Target Prediction of miRNAs .................................................................................... 6
1.2.1 Features/Parameters for miRNA target prediction ............................................... 6
1.2.2 Target prediction software packages .................................................................. 10
1.3. Gene set analysis ...................................................................................................... 14
1.3.1 Gene Ontology Enrichment Analysis Software Toolkit (GOEAST) ................. 14
2 Methodology .................................................................................................................... 15
2.1 Target prediction - Gathering and handling data ...................................................... 15
2.2 Database of experimentally validated genes ............................................................. 15
2.3 Comparison ............................................................................................................... 17
2.4 Microarray experiment set-up .................................................................................. 17
2.4.1 Cell lines and tissue samples ............................................................................ 17
2.4.2 MiRNA transfections ........................................................................................ 17
2.5 Polymerase Chain Reaction (PCR) assays ................................................................ 18
2.5.1 Real-time polymerase chain reaction (qPCR) .................................................. 18
2.5.2 PCR ................................................................................................................... 19
2.6 Microarray Analysis ................................................................................................. 19
2.6.1 Defining parameters used: ................................................................................. 20
2.7 How to use of Microarray data and target predictions ............................................. 20
3 Results .............................................................................................................................. 20
Result I: The comparison between predicted targets and experimentally validated ........... 20
Part 1 of Result I: The precision test by using manually constructed database ........... 21
Part 2 of Result I: The precision test by using Mirwalk .............................................. 22
Result II: Amalgamation of predicted targets of top 4 software packages ....................... 23
Result III: Quality control of microarray data ................................................................. 24
Result IV: Elucidating microarray data ........................................................................ 25
Result V: GOEAST analysis ..................................................................................... 29
Result VI: Validation of microarray results by qPCR ....................................................... 30
4 Discussion ...................................................................................................................... 33
5 References ...................................................................................................................... 35
Appendices ............................................................................................................................ 38
Appendix 1 ............................................................................................................................ 39
Appendix 2 ............................................................................................................................ 45
Appendix 3 ............................................................................................................................ 48
Abbreviations:
MiRNA: MicroRNA
Mir-155: MicroRNA 155
3’ UTR: 3' untranslated region
RISC: RNA-induced silencing complex
mRNA: Messenger RNA: mRNA
Ago: Argonaute protein
CDS: Coding sequence
qPCR: Quantitative (real-time) polymerase chain reaction
DAVID: Database for Annotation, Visualization and Integrated Discovery
RNA Pol II: RNA Polymerase II
Pri-miRNA: Primary microRNA
Pre-miRNA: Precursor microRNA
FOXO3A: forkhead box O3A
GO: Gene Ontology
KEGG: Kyoto Encyclopedia of Genes and Genomes
GOEAST: Gene Ontology Enrichment Analysis Software Toolkit
MAMI: Meta Mir:Target Inference
ENG: Ensemble gene ID
WC: Watson-Crick
kb: kilobase
1
1. Introduction
1.1 General Information about miRNAs Micro RNAs (miRNAs) are short (19-24 nucleotides in length), endogenously expressed
RNA molecules, that regulate gene expression by directly and favorably binding to 3'
untranslated regions (UTRs) of protein coding genes [1]. It is expected that miRNAs
regulate up to 60% of all mammalian genes [22]. MiRNAs are well conserved among the
species, being an evolutionary important component [22].
The first miRNA was discovered in 1993 during the study of the gene lin-4 in the nematode
Caenorhabditis elegans [2]. It had been found that the corresponding protein's – LIN-4 –
translation is regulated by an RNA that is encoded by lin-4 itself. That endogenous RNA,
which is called lin-4, acted as post-transcriptional regulator, and one thought at that time
that this was a unique property of nematodes [2].
Plant miRNAs are usually complementary to the coding regions of mRNAs, which
promotes the cleavage of RNA. In contrast, microRNAs in animals partially base pair and
inhibit protein translation of the target mRNA. This exists in plants also, but is less
common. MicroRNAs that are partially complementary to the target can also speed
up deadenylation (shortening of polyA tail on mRNA), causing mRNAs to be degraded in
comparatively shorter time.
It is thought that miRNAs can have hundreds of targets. Until now - as reported in the
miRBase database, 14197 miRNAs in 133 species are known [26].
1.1.1 Biogenesis
Mature miRNAs are processed from longer transcripts called primary miRNAs (pri-
miRNAs). Primary miRNAs are usually transcribed by RNA Polymerase II (RNA Pol II).
They are further processed in the nucleus and form ~70 nucleotide step-loop structures
referred to as precursor miRNA (pre-miRNA) (see Figure 1).
Furthermore, pre-miRNAs are cleaved in the cytoplasm by endonuclease called Dicer into
complementary short RNA molecules. One of the short RNA molecules integrates into the
RNA-induced silencing complex (RISC) and leads the whole complex towards a target
messenger RNA (mRNA). In other words, miRNAs provide the specificity that selects the
individual gene targets through (partially) complementary base-pairing between the miRNA
and the mRNA transcript of its target gene (see Figure 1).
2
Figure 1: Regulation of gene expression by miRNAs. Adopted from [25]. Pri-miRNAs are first processed
by the Drosha/Pasha complex into 60-70 nt pre-miRNAs in the nucleus. These pre-miRNAs are transported
by Exportin 5 into the cytoplasm. Dicer then cleaves pre-miRNAs into duplexes. Only one strand of this
duplex is incorporated into the RISC. The final complex is the functioning as both mRNA cleavage and
translational repressor by binding to the target mRNA.
Target selection then brings the mRNA transcripts close to the acting range of the RISC
effector proteins, the principle components which are a miRNA-specific Argonaute protein
(Ago) and a GW182 (scaffold protein) [27]. Purification of the RISC has shown that it
3
contains at least one member of the Ago protein family. Furthermore, mutagenesis studies
suggest that Ago2 is particularly responsible for cleavage activity of RISC [25].
Figure 2: A Speculative model showing the roles of each miRNA region and the way it binds to Ago protein.
[1]. (A) MicroRNA (red) is bound to Argonaute (AGO). The first nucleotide is twisted away from the helix
and permanently unavailable for pairing. Nucleotides 2–8 are bound (to Ago) in a way that they are
preorganized to favor efficient pairing. Nucleotides 9–11 are facing away from an incoming mRNA and
unavailable for binding; the remainder of the miRNA is bound in a configuration that has not been
preorganized for efficient pairing. (B) 8mer site has been recognized by the complex. (C) The conformational
accommodation of extensively paired sites allowing the miRNA and mRNA to wrap around each other. (D)
This pairing is suitable for mRNA cleavage, in which Ago locks the paired duplex down so that the active site
(shown with black arrow) will end up cleaving the mRNA. (E) The 3′-supplementary pairing, in which shown
that the message can pair to nucleotides 13–16. In this model, miRNA and mRNA are not wrapped around
each other. Adopted from [1].
MiRNAs are involved in diverse biological functions, such as development, proliferation,
differentiation and apoptosis [9, 10]. Accumulative evidence allude to that microRNAs are
deregulating the pathogenesis of tumors. Approximately 50% of all miRNAs are physically
located in cancer-associated regions of genome [11]. Several miRNAs are functioning as
tumor suppressors or as oncogenes [11].
Individual miRNAs are well-studied compared to multiple miRNA cooperativity. There is
the possibility that miRNAs act synergistically, which is largely unknown [12]. This makes
target prediction very complicated. Microarray studies do not reveal full information about
miRNA targets because they do not capture the effect of translation inhibition, they capture
only degradation. Proteomics studies, on the other hand, uncover more information,
because it yields data on the protein level. There are very few large proteomics studies due
to cost issues. So, when the final data production is considered, proteomics also expected
to produce less data [8, 13].
The last track on miRNA target prediction could be checking pathways. That might yield
better understanding and solving target prediction problem from the systems approach.
Particular miRNAs could act on particular pathways.
4
1.1.2 Plant miRNA target prediction works perfect
Plant miRNAs are involved in various aspects of plant growth and development, including
root formation, leaf morphology and polarity, molecular signaling, diverse transition
phases, flowering time and floral organ identity. Plant miRNAs are also involved in dealing
with stress by post-transcriptional regulation of target genes. MiRNA genes are transcribed
by RNA polymerase II [34].
Plant miRNA target prediction shows high success about finding direct targets. Simply,
checking the high complementarities between miRNA and potential mRNA coding
sequences (CDS) reveals the most probable targets [3].
Since plant miRNA target prediction shows great success in silico, there is not that much
need for novel prediction software or combination of different software.
1.1.3 Animal miRNAs
Genetics is important to identify animal miRNA targets. In contrast to plant miRNAs, it has
been found that lin-4 and let-7 regulate their gene targets by loose complementarity to the
3'UTRs of those targets. It has been established that animal miRNAs do not generally show
extensive complementarity to any endogenous transcripts [4, 5].
There are numerous target prediction software packages which try to shed some light on the
animal miRNA targeting problem. Different prediction tools try different approaches by
introducing various parameters, resulting different sets of predicted targets. The
challenging part is to identify which prediction tool(s) (or combination of different tools)
work best. The goal of this study was to find best working prediction tool(s), thus, by the
help of that finding and trying to validating some of those targets for microRNA-155 (miR-
155) in nasopharyngeal carcinoma.
1.1.3.1 MiR-155
MiR-155 is contained in Bic, a 64 nucleotide long non-coding gene, residing in
chromosome 21 : 25868163 – 25868227. Primary microRNA transcript is transcribed from
Bic, and is processed into pre-miR-155, which is 62 nucleotide long, whereas mature miR-
155 is 22 nucleotides. According to the [26], there are 16 species that miR-155 is
expressed. Some of the well-studied species are Homo sapiens, Mus musculus, Gallus
gallus, Danio rerio, Ciona savignyi and Ciona intestinalis. The miR155 gene is present in
only one copy, and miR155 does not share significant sequence with other reported
miRNAs [26, 35].
MiR-155 is involved in various biological processes including immunity, haematopoiesis
and inflammation. Mir-155 is highly expressed in Hodgkin‟s lymphoma and in large B cell
lymphomas. The overexpression of miR-155 indicates that it is an oncogene. MiR-155
null mice had serious immune defects in both adaptive and innate immunity [35].
5
Figure 3: The representation of precursor miR-155 (65 bp) sequence by Genome Browser, which resides in
chromosome 21 : 25868163 - 25868227 : + Adopted from [16]
Accumulating evidence indicates that miR-155 is an oncogenic miRNA. Many profiling
studies have already shown that miR-155 is upregulated in various types of human
malignancies [23, 24]. Those malignancies include B cell lymphoma and breast,
nasopharyngeal, colon, lung, and kidney carcinomas. For instance, in breast cancer miR-
155 induces cell survival and has a role in chemoresistance [24]. Its anti-apoptotic function
is mediated by direct inhibition of FOXO3a (the gene that belongs to the forkhead family of
transcription factors, associated with acute leukemia). Furthermore, elevated miR-155
levels have recently been observed in late stage and poor overall survival cases suffering
from several different types of malignancies. Knock-down of miR-155 has been associated
with impaired immune activity [24]. In addition, it has been linked to inflammation, as
well [24].
6
Figure4: The secondary structure of precursor miR-155 predicted by MirnaMap. Adopted from [17].
1.2. Target Prediction of miRNAs
1.2.1 Features/Parameters for miRNA target prediction
Determination of parameters that are crucial in target prediction has been quite challenging.
This is mainly due to limited pairing between miRNAs and target mRNAs. To solve that
problem, many computational and experimental approaches have been used synergistically.
Widely proposed parameters/features are divided into six categories: „seed site‟ pairing, site
location, conservation, site accessibility, multiple sites and expression profile.
1.2.1.1 ‘Seed site’ is the most important feature for target recognition
MiRNA targets contain at least one region that has Watson-Crick (WC) pairing (in which
adenine (A) forms a base pair with thymine (T) and cytosine (C) with guanine (G) )
towards the 5′ end of the miRNA binding site. Specifically, this region, which is located at
positions 2–7 from the 5′ end of miRNA, is known as the „seed‟. RISC uses this site as a
nucleation signal for recognizing target mRNAs.
A stringent-seed site has perfect Watson–Crick pairing and can be divided into four „seed‟
types: 8mer, 7mer-m8, 7mer-A1 and 6mer – varying due to the combination of the
nucleotide of position 1 and pairing at position 8. 8mer has both an adenine residue at
position 1 of the target site and base pairing at position 8. 7mer-A1 has an adenine at
7
position 1, but no base pairing at position 8. On the other hand, 7mer-m8 has base pairing at
position 8, but not adenine at position 1. Finally, 6mer has neither an adenine at position 1
nor base pairing at position 8 [14]. The importance of the adenine at position 1 is that, it
increases the efficiency of target recognition [8]. The hierarchy can be stated as:
8mer > 7mer-m8 > 7mer-A1 > 6mer in the stringent-seed types [14].
In addition, moderate-stringent-seed matching – RISC tolerating little mismatches or the
G:U wobble within the seed region – is functional as well, because the RISC can tolerate
little mismatches or the G:U wobble within the seed region. This moderate-stringent-seed
matching has five „seed‟ types: GUM, GUT, BM, BT and LP, defined regarding to the
mismatch type [14].
The preferable nucleotide number of matches in the 3′ part differs between the site that has
stringent-seed pairing and the one that has moderate-stringent-seed pairing. Stringent-seeds
require 3–4 matches in the positions 13–16, whereas moderate-stringent-seeds require 4–5
matches in the positions 13–19. Sites with this additional 3′ pairing are called 3′-
supplementary
The advantage of using different set of seed types is increasing sensitivity. On the other
hand, high specificity is obtained when only stringent-seed types are considered, but some
targets could be missed in that way (due to tolerated mismatches, wobbles, and so on).
Figure 5: Types of miRNA target sites and multiple sites. (a) Stringent-seed site, 7mer-A1. Vertical lines
8
indicate Watson–Crick paring. (b) Moderate-stringent-seed site, showing BM as an example. (c) 3′-
supplementary site, in which more than three to four nucleotides paring required. (d) Optimal distance of two
miRNA target sites. Adopted from [15].
1.2.1.2 Site location
Most target sites of miRNAs are located in 3‟UTRs of target genes. . Somehow RISC
prefers acting on 3‟UTR. Target sites are not uniformly distributed within 3‟UTRs, but
instead tend to cluster near ends if the sequence is more than 2kb long. Some genes have
comparatively short 3‟UTRs, e.g. house-keeping genes, which is believed to help avoid
interference from miRNAs. If the 3‟UTR is short, then the binding sites (if there is any) are
usually located 15-20 nucleotides away from stop codons [15].
Alternative splicing and polyadenylation makes it difficult to predict miRNA targets,
because they result in unexpected or difficult to calculate target features. Consequently,
software packages predict many false positive targets. More specifically, polyadenylation
shortens the 3‟UTR, while alternative splicing makes different potential targets [15].
Even though many known miRNA targets are preferentially located in 3‟UTR, it is reported
that some miRNA targets are also found on 5‟UTR and CDS [19]. Reasonably, functioning
on CDS and 5‟UTR is more difficult for RISC than functioning on 3‟UTR since it might
have to compete with ribosomes, transcription factors and many other regulatory proteins.
This is believed to be one of the reasons why RISC prefers 3‟UTR [15].
1.2.1.3 Conservation: Targets and miRNAs are conserved among related species
MiRNAs that have the same seed site belong to the same miRNA family, and are well
conserved among related species. Additionally, miRNA families have targets that are
conserved among related species [9]. Applying conservation filters decreases the false
positive rate and is especially effective amongst conserved miRNAs. On the other hand, it
has been reported that 30% of all experimentally validated miRNA target genes may not be
well-conserved.
1.2.1.4 Accessibility
The secondary structure of mRNA affects the target accessibility significantly. Target sites
have to be accessible, meaning that they have to be opened and must not interact with other
sites within the mRNA. After the first interaction, the secondary structure of mRNA could
be disrupted by RISC on the binding site to elongate hybridization [15].
9
Figure 6: Accessibility of mRNA. For binding to the miRNA, the target site has to be accessible,
meaning it has to be opened and must not interact with other sites within the mRNA. Opening costs
a certain amount of energy ΔGopen . The total free energy change is Δ ΔG =ΔGduplex – ΔGopen. Δ ΔG
represents score for the accessibility of the target site and the probability for a miRNA-target
interaction. Adopted from [15].
Lower AU content is preferential, meaning that it is easy to access mRNA and bind to it,
due to less hydrogen bond between A and U. Especially, the A:Us surrounding the binding
site could be used as a significant parameter to calculate accessibility. Efficient target sites
preferentially have A:U rich context in ~30 nucleotides upstream and downstream from the
seed site [14].
10
1.2.1.5 Multiple sites in single target
Multiple binding sites might exist on the same 3‟UTR. This in fact will result in
cooperativity, which may enhance overall miRNA functionality. MiRNAs can act on their
targets synergistically. Two target sites within the optimal distance are shown to enhance
target site efficacy [14]. The optimal length is often between 17 and 35 nucleotides [14,
13].
1.2.1.6 Expression profile: miRNA:mRNA pairs are negatively correlated in
expression profiles
Single miRNA is capable of regulating many genes; thus expression profiles of mRNAs
might vary considerably depending on the miRNA expression levels. In addition, many
miRNAs are also expressed differently in different tissues. As a result, if negatively
correlated expression values of a miRNA:mRNA pair are detected across different tissue
profiles, the mRNA of the pair is probably targeted by the miRNA [15]. This approach
effectively reduces false positives. The majority of miRNA targets appear to be regulated
both at the mRNA and protein level, but some targets only show an effect at the protein
level [32].
1.2.2 Target prediction software packages
1.2.2.1 Mirtarget2
Mirtarget2 is machine learning tool, which has been developed by analyzing thousands of
genes downregulated by miRNAs Available database for miRNA target prediction in five
species are: human, mouse, rat, dog and chicken. Mirtarget2 incorporates 4 parameters
which are: moderately-stringent seeds, site positions, and site accessibility and conservation
filter [6, 7].
1.2.2.2 TargetScan
TargetScan presents several approaches for predicting microRNA target sites in several
species. The first established version of TargetScan was designed to search for seed pairing.
The ranking was based on the thermodynamic stability of the binding site. Furthermore, the
predicted targets for multiple species were combined to get predictions for conserved target
sites [18].
The context score for a specific site is the sum of the contribution of these four features:
11
i. Site (seed) contribution
ii. 3' pairing contribution
iii. Local AU content
iv. Positional contribution
The imperfect seed matching with addition of 3‟ compensatory pairing is later incorporated
to the TargetScan algorithm. The efficiencies of the sites are calculated by looking at the
3‟UTR context of the target mRNA sites. Web server of TargetScan provides miRNA
predictions for human, dog, chimpanzee, rat, mouse, chicken, rhesus, cow, frog, opossum,
worm and fly. The conservation filter is carefully quantified by TargetScan, which is called
PCT. The probability of conserved targeting considering multiple sites, gives Aggregate PCT:
1 - ( (1 - PCT)site1 x (1 - PCT)site2 x (1 - PCT)site3 ... ) [22]
Figure 7: Snapshot taken from the TargetScan web server, while looking for miR-155 putative targets.
TargetScan provides clear picture of predicted targets. Both gene symbol and the gene name are reported.
Moreover, the number of different seed types, type of conservation (conserved and poorly conserved), total
context score and aggregate PCT are shown on the website.
1.2.2.3 DIANA-MicroT v3.0
DIANA-MicroT algorithm searches stringent seed pairing to target mRNAs, which are at
least 7 consecutive WC pairs. In addition, 6mer and seeds with G:U wobble are also
accepted if the 3‟ end of the miRNA has a compensating pairing with the target [21].
12
By using the targets identified by the molecular biological method pSILAC developed by
[13], the performance of various target prediction programs was assessed. DIANA-microT
v3.0 accomplished the highest score of 66% accurately predicting targets over all predicted
targets [21].
DIANA microT web server is very user-friendly, where prediction results are organized in
expandable tabs (see Fig 8). For human and mouse those predictions are available at
http://diana.cslab.ece.ntua.gr/microT/. DIANA provides the opportunity to search for
targets of a specific miRNA and as well as miRNA(s) of specific mRNA (target genes).
Furthermore, DIANA microT v3.0 provides a signal-to-noise ratio (SNR), miTG score and
precision score. Results are ranked according to the miTG, in which user defines threshold
miTG score. Official gene symbol and Ensemble gene IDs are used as an identifier.
Finally, results can be downloaded as a spreadsheet to work on independently.
Figure 8: Snapshot taken from DIANA MicroT web server, while predicting miR-155 targets. The
expandable tab shows almost all necessary information about predicted target (in this case BACH1). One of
the very important one is seed type (shown on the very left). Shown here that in 3‟ UTR of gene BACH1,
there are 4 miR-155 target sites. Also, the number of conservations among species of that specific binding site
is expressed as well. On the very right, one can see the prediction confirmation by other well-known software
packages.
13
1.2.2.4 PicTar
PicTar – probabilistic identification of combinations of target sites – is an algorithm to
predict miRNA targets. The PicTar algorithm uses a different approach, which is ranking
targets by considering whether the mRNA is a target for combinations of other miRNAs as
well.
PicTar algorithm requires perfect 7mer of WC pairing of either nucleotide 1-7 or 2-8.
Imperfect seed pairing is also allowed in PicTar, but it does not increase the overall score.
PicTar uses RNAhybrid to calculate free energy required to form a miRNA:mRNA hybrid
in order to filter the potential targets according to the free energy filter. Additionally,
PicTar uses a conservation filter to reduce the number of false positives. Finally, the
magnitude of all inputs is put together and sent to PicTar Sequence Scoring Algorithm,
which uses Hidden Markov Model (HMM) to compute maximum-likelihood score (MLS).
MLS defines the likelihood of a gene being a target of a specific miRNA. The MLS score
is calculated for every species separately, and combined to get final PicTar score, which is
in turn used for ranking the potential targets. Typical MLS values for top predicted targets
are ranging from 5 to 10.
At http://pictar.mdc-berlin.de/ precompiled predictions for vertebrates, flies, mice and
nematodes are available.
1.2.2.5 MAMI
MAMI (Meta Mir:Target Inference) is a software/database which uses pre-compiled lists of
targets from other softwares to increase the reliability of predictions. MAMI also allows
users to choose the preferred sensitivity and specificity values.
Sensitivity = True positives / (True positives + False negatives)
Specificity = True negatives / (True negatives + False positives)
Sensitivity and specificity are easily tunable to the user's needs, which is 5 different levels
of sensitivity and specificity, to best suit for the experimental goals.
The internal cutoff values, which were used to generate each performance in the validated
set, were applied to all human miR-target predictions. Aim was to calculate the percentile
of predictions that satisfy these cutoffs.
1.2.2.6 Other prediction tools
Other prediction tools are PITA, EIMMO, Miranda, RNAhybrid, TargetRank, RNA22 and
etc.
14
Table 2: List of miRNA prediction tools and their features. Adopted from [15]
A Seed pairing. ●: stringent seeds, ○: moderately stringent seeds, Blank: seed sites not
considered. b Site location. ●: target positions considered, Blank: target positions not considered.
c Conservation. ●: with/without conservation filter, ○: with conservation filter, Blank:
conservation not considered. d Site accessibility. ●: site accessibility with minimum free energy considered, ○: A:U rich
flanking considered, Blank: site accessibility not considered. e Multiple sites in single mRNA. ●: multiple sites considered, ○: the number of putative
sites considered, Blank: multiple co-operability not considered. f Expression profile. ●: expression profiles used, Blank: expression profiles not used.
1.3. Gene set analysis Several methods have been developed for gene set analysis of microarray data. These
methods calculate the differential gene expression patterns of group of functionally related
genes rather than individual ones. The basic goal is to discover gene sets whose expression
patterns are associated with phenotypes of interest. Gene Ontology (GO) and Kyoto
Encyclopedia of Genes and Genomes (KEGG) are good examples for collecting genes into
functional groups.
1.3.1 Gene Ontology Enrichment Analysis Software Toolkit (GOEAST)
GOEAST is web based software toolkit which provides an easy way to analyze high-
throughput experimental results, i.e. microarray data. It has a user friendly interface which
is easy to visualize extensive data and perform GO analysis. Moreover, the main function
of GOEAST is to identify significantly enriched GO terms among give lists of genes using
desired statistical methods [31].
15
2 Methodology
2.1 Target prediction - Gathering and handling data First of all, all the miR-155 related predictions are obtained from each website. The
following is the list of target prediction software‟s websites:
Table 1: List of target prediction softwares/databases and their corresponding websites:
PicTar http://pictar.mdc-berlin.de/
TargetScan 5.1 www.targetscan.org
DIANA-MicroT 3.0 http://diana.cslab.ece.ntua.gr/microT/
MAMI http://mami.med.harvard.edu/
EIMMO 3 www.mirz.unibas.ch/ElMMo3/
MirTarget2 http://mirdb.org/miRDB/
PITA http://genie.weizmann.ac.il/pubs/mir07/
TargetRank http://genes.mit.edu/targetrank/
RNA22 http://cbcsrv.watson.ibm.com/rna22.html
Prediction softwares do not use a common gene identifier. As a result, DIANA-MicroT 3.0
gives gene symbol and Ensemble gene ID (ENG), TargetScan 5.1and MirTarget2 yield gene
symbol and gene name, PicTar gives gene name and RefSeq ID, MAMI shows only gene
symbols and so on. So, those results were mapped to unique identifier, which is found to
be ENG, because most genes have a unique ENG identifier.
2.2 Database of experimentally validated genes Total numbers of experimentally validated genes are constructed using Tarbase [28] and
Mirwalk [29]. These databases show both mRNA and protein level downregulation. Thus,
only mRNA level (validated by Luciferase reporter assay) down-regulations, which are
constructed by manually checking Tarbase [28] and publications are considered separately
in this study. By doing this, finally, 37 mRNA level experimentally validated
downregulated genes were obtained (see Table 2). By using those targets, one can only
study mRNA degredation, because translation inhibition is not detectable in Luciferase
16
reporter assay. The second database was Mirwalk [29], which comprised all the targets of
Tarbase. It was also used as a validation source, but keeping in mind that validated targets
by Mirwalk are derived from online publications (considering any kind of miRNA-target
interactions that are reported). As a result, 528 “DIRECT and “INDIRECT” (study includes
and doesn‟t include Luciferase reporter assay, respectively) targets of miR-155 were
collected by using Mirwalk [29].
Gene_symbol Gene_name
AGTR1 Angiotensin II receptor, type 1
AGTRAP Angiotensin II receptor-associated protein
AID Activation-induced cytidine deaminase
ARID2 AT rich interactive domain 2 (ARID, RFX-like)
ARNTL Aryl hydrocarbon receptor nuclear translocator-like
AT1R angiotensin II receptor 1B
BACH1 BTB and CNC homology 1, basic leucine zipper transcription factor 1
BCL2L13 BCL2-like 13 (apoptosis facilitator)
BIRC4BP XIAP associated factor 1
CEBPB CCAAT/enhancer binding protein (C/EBP), beta
CSF1R Colony stimulating factor 1 receptor
CUTL1 Cut-like homeobox 1
Ets-1 v-ets erythroblastosis virus E26 oncogene homolog 1
FGF7 Fibroblast growth factor 7 (keratinocyte growth factor)
FOS FBJ murine osteosarcoma viral oncogene homolog
HIF1A Hypoxia inducible factor 1, alpha subunit (basic helix-loop-helix transcription factor)
HIVEP2 Human immunodeficiency virus type I enhancer binding protein 2
IKBKE Inhibitor of kappa light polypeptide gene enhancer in B-cells, kinase epsilon
JARID2 Jumonji, AT rich interactive domain 2
MAF V-maf musculoaponeurotic fibrosarcoma oncogene homolog (avian)
MAP3K10 Mitogen-activated protein kinase kinase kinase 10
MEIS1 Meis homeobox 1
PDCD6 Programmed cell death 6
PICALM Phosphatidylinositol binding clathrin assembly protein
PU.1 Spleen focus forming virus (SFFV) proviral integration oncogene spi1
RFK Riboflavin kinase
RHOA Ras homolog gene family, member A
RPS6KA3 Ribosomal protein S6 kinase, 90kDa, polypeptide 3
SAMHD1 SAM domain and HD domain 1
SHIP1 inositol polyphosphate-5-phosphatase
SLA Src-like-adaptor
SMAD5 SMAD family member 5
TAB2 TGF-beta activated kinase 1/MAP3K7 binding protein 2
17
TP53INP1 Tumor protein p53 inducible nuclear protein 1
ZIC3 Zic family member 3 (odd-paired homolog, Drosophila)
ZNF537 Zinc finger protein 537
ZNF652 Zinc finger protein 652
Table 3: The list of experimentally validated 37 genes.
2.3 Comparison 37 validated genes were compared with predicted targets of each software/database. The
result was put into the list, which includes total number of predicted targets for each
software packages and number of validated targets are among those targets. Precision, the
percentage of validated targets to total predicted targets, was calculated for each
software/database. This parameter – precision, shows the combinatorial effect of both
sensitivity and specificity. Since, pre-compiled results are obtained directly from websites
of different softwares, it was impossible to calculated specificity (because number of True
Negatives are unknown) unless they already mention it (i.e., MAMI). On the other hand,
sensitivity (by considering validated targets) could be calculated, since number of True
Positives (TP) and False Negatives (FN) are known.
Precision = True positives / Total predicted targets
2.4 Microarray experiment set-up Microarray experimental design was done at Microbiology Tumor and Cell Biology (MTC)
department of Karolinska Institute with the help of doctoral student, Ziming Du, under the
supervision of Prof. Ingemar Ernberg. The whole experimental design, from harvesting
cells to extracting RNA took place in March 2010. Microarray experiment was done using
Affymetrix platform at the core facility for Bioinformatics and Expression Analysis (BEA),
located at the Department of Biosciences and Nutrition at Novum, Huddinge.
2.4.1 Cell lines and tissue samples
Human NPC cell line TW03 cells were cultured in IMEM (Gibco USA) containing
10% fetal calf serum (FCS). The immortalized nasopharyngeal epithelial cell line NP69
was cultured in keratinocyte serum-free medium (Invitrogen) supplemented with 5% FCS,
25 μg/ml bovine pituitary extract, and 0.2 ng/ml recombinant epidermal growth factor, as
suggested by the manufacturer. All the cell lines were grown in a humidified incubator at
37oC with 5% CO2.
2.4.2 MiRNA transfections
Before transfection, 2 × 105 cells per well were plated into 6-well plates and grown for one
day in antibiotic-free medium containing 10% FCS. When the cells confluent were reached
to 40% to 60%, cells were transfected with miR-155 Pre-miR™ miRNA Precursor (miR-
155 mimic) Molecules (Cat#: PM12601, Ambion, USA), or Pre-miR™ miRNA Precursor
Molecules-Negative Control #1 (Cat#: AM17110, Ambion, USA) or miR-155 Anti-miR™
18
miRNA Inhibitor (Cat#: AM12601, Ambion, USA), or Anti-miR™ miRNA Inhibitors-
Negative Control #1 (Cat#: AM17010, Ambion, USA) using Lipofectamine 2000
(Invitrogen, USA) according to the manufacturer‟s instructions.
Transfected (miR-155 mimic 100nM, miR-155 mimic 50nM, miR-155 control 50nM) cells
were grown at 37oC for 6 hr, followed by incubation with complete medium. For miR-155
assay and Western blot analysis, cells were harvested for RNA and protein after 48 hr.
2.5 Polymerase Chain Reaction (PCR) assays The PCR assays were done at Microbiology Tumor and Cell Biology (MTC) department of
Karolinska Institute with the help of doctoral student, Ziming Du, under the supervision of
Prof. Ingemar Ernberg. Whole experimental design took place in June 2010.
symbol mimic
100nM
mimic
50nM Control 50nM NP69 TW03 LOG2_100 LOG2_50 LOG2_TW03 prediction
C9orf5 884,29 828,1 1656,07 791,85 975,8 -0.91 -1 0.3 TargetScan
PERP 1531,4 594,97 1328,54 2120,2 1098,3 0.21 -1.16 -0.95 DIANA-MicroT
TP53INP1 48,38 50,02 164,9 7,19 111,41 -1.77 -1.72 3.95 TargetScan
TERF1 422,37 350,07 553,02 312,8 449,22 -0.39 -0.66 0.52 DIANA-MicroT+
TargeScan
BCLAF1 530,82 455,01 691,37 748,89 453,08 -0.38 -0.6 -0.72 DIANA-MicroT
E2F2 95,39 101,94 129,26 168,52 142,3 -0.44 -0.34 -0.24 DIANA-MicroT
Table 4: 6 genes which are found to be interesting enough to perform validation experiments on
them, since they have been predicted by at least one of softwares as potential targets. In addition,
the microarray expression values of those genes are downregulated compared to the control_50nM
or NP69 normal tissue.
2.5.1 Real-time polymerase chain reaction (qPCR)
For the qPCR assay, total RNA was isolated from cell lines using TRIzol reagent
(Invitrogen) according to the manufacturer‟s instructions, then was treated with RNase free
DNase I (Cat#: 04716728001, Roche). The miR-155 qPCR assay was performed by
TaqMan® MicroRNA Assays (Cat#: 4373124, Applied Biosystems, USA) and RNU6B
(Cat#: 4373381, Applied Biosystems, USA) was used as internal control. The relative
expression level was determined as 2-ΔΔCt
.
Data are presented as the expression level relative to the calibrator, with the standard error
of the mean of triplicate measures for each test sample.
After reverse transcription of the total RNA, the first-strand cDNA was then used as
template for detection of PERP, TP53INP1, TERF1, BCLAF1 and E2F2 expression by
quantitative real time PCR (QT-PCR) with the SYBR Green I chemistry (Power SYBR
Green PCR Master Mix, CAT#: 4367659, ABI Inc., USA). GAPDH was used as internal
control.
19
Here is the list of picked genes (with their corresponding primers) from microarray data for
further validations:
qRT-Primers for ZDHHC2 (NM_016353)
ZDHHC2 Forward: TCTTAGGCGAGCAGCCAAGGAT
ZDHHC2 Reverse: CAGTGATGGCAGCGATCTGGTT
qRT-Primers for KDM5B (NM_006618)
KDM5B Forward: AGCCAGAGACTGGCTTCAGGAT
KDM5B Reverse: AGCCTGAACCTCAGCTACTAGG
qRT-Primers for E2F2 (NM_004091)
E2F2 Forward: CTCTCTGAGCTTCAAGCACCTG
E2F2 Reverse: CTTGACGGCAATCACTGTCTGC
qRT-Primers for BCLAF1 (NM_014739)
BCLAF1 Forward: CCTAAACGAGCGGTTCACTTCG
BCLAF1 Reverse: GCTAAACGGGTATGCTTCCTCAG
qRT-Primers for TERF1 (NM_017489)
TERF1 Forward: CATGGAACCCAGCAACAAGACC
TERF1 Reverse: CTGCTTTCAGTGGCTCTTCTGC
qRT-Primers for TP53INP1 (NM_033285)
TP53INP1 Forward: TGATGAATGGATTCTTGTTGACTTC
TP53INP1 Reverse: TGAAGGGTGCTCAGTAGGTGAC
qRT-Primers for PERP (NM_022121)
PERP Forward: CCAGATGCTTGTCTTCCTGAGAG
PERP Reverse: AGTGACAGCAGGGTTGGCATGA
2.5.2 PCR
For normal PCR assay, total RNA was extracted from cell lines using TRIzol reagent
(Invitrogen). This was done as a quality check before running qPCR.
2.6 Microarray Analysis Microarray analysis was done at Department of Computational Biology at KTH with the
help of doctoral student, Aymeric Fouquier d‟Hérouel, under the supervision of Prof. Erik
20
Aurell. Annotations were obtained from Affymetrix probset annotation file - HuGene-1_0-
st-v1.r3.cdf. The whole analysis took place in June 2010. The PLIER algorithm was used
for gene expression analysis. The primary analysis includes the following individual
operations:
1) Image correction
2) Global and local background correction
3) Feature normalization
4) Spatial normalizatione
5) Global normalization
2.6.1 Defining parameters used:
In order to analyze large microarray data, it is important to introduce some parameters to
filter out noise. The expression values of genes are ranging approximately from 0.01 to
10000. The following parameters are chosen for eliminating noise, while not losing useful
information:
1. Expression values > 30 (applied on all samples simultaneously) AND
2. Log2 (miR-155 mimic 100nM / miR-155 control 50nM) < - 0.5 AND
3. Log2 (miR-155 mimic 50nM / miR-155 control 50nM) < - 0.5 AND
4. 2 < Log2 (miR-155 control 50nM / Np69) < 0.5
2.7 How to use Microarray data and target predictions Microarray data shows the change in mRNA expression in vitro, whereas target prediction
predicting the miRNA-mRNA interaction in silico. By combining those two types of data,
the targeting mechanism was investigated.
21
3 Results
Result I: The comparison between predicted targets and experimentally validated targets
The list of predicted targets for PicTar was obtained from online database at
http://pictar.mdc-berlin.de/ in February 2010. In total, 199 miR-155 target genes were
obtained. The list of predicted targets for TargetScan 5.1 obtained from online database at
www.targetscan.org in February 2010. In total, 281 miR-155 target genes were obtained.
The list of predicted targets for DIANA-MicroT 3.0 obtained from online database at
http://diana.cslab.ece.ntua.gr/microT/ in February 2010. In total, 166 miR-155 target genes
were obtained. The list of predicted targets for MAMI obtained from online database at
http://mami.med.harvard.edu/ in February 2010. In total, 205 miR-155 target genes were
obtained.
The manually constructed database has been created by using Tarbase [28] and different
publications. Totally 37 genes were identified as experimentally validated miR-155 targets.
Those genes were used to check the precision of software packages during downstream
processes.
Mirwalk [29] has been used for the construction of the second database. Totally 528 genes
were identified as indirect miR-155 targets. Those genes were also used to check the
precision of software packages during downstream processes.
Eleven software packages/databases were tested by using a manually constructed database
(37 targets) and Mirwalk [29] database (528 genes). By checking the precision score of
eleven softwares/databases using 2 different sets of validated targets, the reliability of those
was assessed. The ones which showed highest precision and sensitivity at the same time
were chosen to perform further predictions.
22
Part 1 of Result I: The precision test using manually constructed database
The software benchmark was implemented using 37 direct targets. The precision and sensitivity
score of eleven software/databases were checked and ranked. Top four ones are significant enough
for our further analysis.
Table5: The software benchmark using 37 direct targets. The precision and sensitivity score of
eleven software/databases were checked and ranked. Top four ones are significant enough for our
further analysis.
Part 2 of Result I: The precision test by using Mirwalk The software benchmark was implemented using 528 indirect targets. The precision and sensitivity
score of eleven software/databases were checked and ranked. The same top four ones are obtained
as in the previous test case (see Table 3). Therefore, those four software packages/databases were
found significant enough for our further analysis.
Software/
Database TRUE_POSITIVE Total_#_of_targets Precision Sensitivity
DIANA-
microT 3.0 12 166 7.23 0.32
Targetscan 22 281 7.83 0.59
Pictar 17 199 8.54 0.46
MAMI 16 205 7.8 0.43
EIMMO 31 2955 1.05 0.84
Miranda 26 1952 1.33 0.7
PITA 29 1266 2.29 0.78
RNA22 1 332 0.3 0.03
Targetrank 27 682 3.96 0.73
Mirgator 28 723 3.87 0.76
Mirbase 15 854 1,75 0,40
23
Software TRUE_POSITIVES Total_#_of_targets Precision Sensitivity
DIANA-
microT 24 166 14.46 0.05
Targetscan 38 281 13.52 0.07
Pictar 24 199 12.06 0.05
MAMI 33 205 16.1 0.06
EIMMO 123 2955 4.16 0.23
Miranda 118 1952 6.05 0.22
PITA 82 1266 6.48 0.16
RNA22 18 332 5.42 0.03
Targetrank 53 682 7.77 0.1
Mirgator 57 723 7.88 0.11
Mirbase 35 854 4.1 0.07
Table6: The software benchmark using 528 direct and indirect targets by Mirwalk - the database of
experimentally validated miRNA targets. The precision and sensitivity score of eleven
software/databases were checked and ranked. Top four ones are significant enough for our further
analysis.
Result II: Amalgamation of predicted targets of top 4 software packages yielded 9 miR-155 target candidates Top scoring software packages predicted their own gene-sets. It is not obvious which genes are
potentially targets without combining targets of four software packages. Thus by amalgamating
predicted targets from all four, the list of genes that were predicted by corresponding software
package was constructed as below:
Gene_Symbol DIANA TargS MAMI Pictar TOTAL Gene_name
NUFIP2 + + + + 4 nuclear fragile X mental retardation protein interacting protein 2
MAP3K7IP2 + + + + 4 mitogen-activated protein kinase kinase kinase 7 interacting protein 2
SGK3 + + + + 4 serum/glucocorticoid regulated kinase family, member 3
TSHZ3 + + + + 4 teashirt zinc finger homeobox 3
SEMA5A + + + + 4 sema domain, seven thrombospondin repeats (type 1 and type 1-like),
transmembrane
RAB11FIP2 + + + + 4 RAB11 family interacting protein 2 (class I)
24
SEPT11 + + + + 4 septin 11
FAR1 + + + + 4 fatty acyl CoA reductase 1
KRAS + + + + 4 v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog
ETS1 + + + + 4 v-ets erythroblastosis virus E26 oncogene homolog 1 (avian)
BACH1 + + + + 4 BTB and CNC homology 1, basic leucine zipper transcription factor 1
ZNF236 + + + + 4 zinc finger protein 236
DCUN1D3 - + + + 3 DCN1, defective in cullin neddylation 1, domain containing 3
ETNK2 - + + + 3 ethanolamine kinase 2
DNAJB7 - + + + 3 DnaJ (Hsp40) homolog, subfamily B, member 7
IKBKE - + + + 3 inhibitor of kappa light polypeptide gene enhancer in B-cells
HDAC4 + - + + 3 histone deacetylase 4
FBXO11 + + - + 3 F-box protein 11
CACNA1C - + + + 3 hypothetical protein LOC100131098;
C3orf18 - + + + 3 chromosome 3 open reading frame 18
UBQLN1 + - + + 3 ubiquilin 1
CSF1R - + + + 3 colony stimulating factor 1 receptor
CD47 - + + + 3 CD47 molecule
CARHSP1 - + + + 3 calcium regulated heat stable protein 1, 24kDa
YWHAE - + + + 3 similar to 14-3-3 protein epsilon (14-3-3E)
MIDN + + + - 3 midnolin
MAP3K14 + + - + 3 mitogen-activated protein kinase kinase kinase 14
MAP3K10 - + + + 3 mitogen-activated protein kinase kinase kinase 10
NFAT5 - + + + 3 nuclear factor of activated T-cells 5, tonicity-responsive
N4BP1 - + + + 3 NEDD4 binding protein 1
MYO10 - + + + 3 myosin X
KPNA1 + + - + 3 karyopherin alpha 1 (importin alpha 5)
KIAA1274 - + + + 3 KIAA1274
JARID2 + + + - 3 jumonji, AT rich interactive domain 2
LRRC59 - + + + 3 leucine rich repeat containing 59
Table 7: Intersection of predicted targets by four different softwares. Blue ones are validated
DIRECT targets of miR-155. The whole list is shown at Appendix 1.
If we consider the precision scores from test case I: it is ~ 8 %. After combining prediction
results of four software packages, this percentage increases ~25 % when considering 4 hits.
This means that, 3 out of 12 hits which were predicted by all four software packages are
experimentally validated direct miR-155 targets. This brings the idea that other 9 targets
(see Table 8) are strong potential miR-155 targets, which could be checked during further
validation experiments.
25
Gene_Symbol DIANA TargS MAMI Pictar TOTAL Gene_name
NUFIP2 + + + + 4 nuclear fragile X mental retardation protein interacting protein 2
MAP3K7IP2 + + + + 4 mitogen-activated protein kinase kinase kinase 7 interacting protein 2
SGK3 + + + + 4 serum/glucocorticoid regulated kinase family, member 3
SEMA5A + + + + 4 sema domain, seven thrombospondin repeats (type 1 and type 1-like),
transmembrane
RAB11FIP2 + + + + 4 RAB11 family interacting protein 2 (class I)
SEPT11 + + + + 4 septin 11
FAR1 + + + + 4 fatty acyl CoA reductase 1
KRAS + + + + 4 v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog
Table 8: The list of 9 miR-155 target candidates. All of those genes have been predicted by all 4
top scoring target prediction software packages.
Result III: Quality control of microarray data By plotting the scatter plot the reproducibility of the microarray experiment was checked.
Even though it is not the exact parameters that were checked (miR-155 mimic 50nM was
aimed to represent “roughly” the duplicate of miR-155 mimic 100nM), it still shows that
the data is reproducible.
26
Figure 9: Scatter plot showing miR-155 mimic 100nM and miR-155 mimic 50Mg. This figure
roughly suggests the correlation between miR-155 mimic 100ng and miR-155 mimic 50nM data.
Result IV: Elucidating microarray data revealed some potential miR-155 target genes
Part I: Using DAVID revealed two candidate genes: WEE1 and DPY19L1
As a result of Microarray analysis, using specified parameters described in Methods
section, 395 genes (not shown) were obtained. Only 363 out of 395 genes were annotated
on The Database for Annotation, Visualization and Integrated Discovery v6.7 (DAVID)‟s
database, thus were chosen for further functional analysis [36, 37]. The human genome
was used as background for the functional annotation. Results are sorted according to p-
values (see Figure 10).
The first significant functionally annotated group was identified – “urinary bladder
tumor_disease_3rd”, which belongs to the database – “UNIGENE_EST_QUARTILE”.
The list of 105 genes which belong to “urinary bladder tumor_disease_3rd” are provided in
Appendix 2.
Furthermore, other “UNIGENE_EST_QUARTILE” related enriched datasets are: adrenal
tumor_disease_3rd
, oral tumor_disease_3rd
, thyroid tumor_disease_3rd
, ear_normal_3rd
,
esophageal tumor_disease_3rd
, tongue_normal_3rd
, pharynx_normal_3rd
, mammary
gland_normal_3rd
, larynx_normal_3rd
, laryngeal cancer_disease_3rd
, pharyngeal
tumor_disease_3rd
and esophagus_normal_3rd
. Those datasets include list of genes that are
related to corresponding tissues. Since those datasets are highly enriched in this study,
meaning that miR-155 mimic downregulated genes in these datasets. In addition, many of
those tissues are located on either digestive or respiratory track where they are somehow
anatomically close to the nasopharyngeal tissue. To give a concrete example,
pharynx_normal_3rd
has 68 genes which are enriched among those 363 annotated genes
(see Appendix 3). Those genes are related to normal pharynx tissue according to
“UNIGENE_EST_QUARTILE” database. Since we are dealing with nasopharyngeal
tissue, those 68 genes are found to be highly significant for further validation analysis. By
quick looking up to the DIANA MicroT prediction results, 2 genes from the
pharynx_normal_3rd
dataset are found, which are:
DPY19L1, dpy-19-like 1 (C. elegans); similar to hCG1645499 [36]
WEE1, WEE1 homolog (S. pombe) [36]
WEE1 is predicted by three top scoring software packages: DIANA MicroT, TargetScan
and Pictar. According to DIANA MicroT it has 9mer (9 nucleotide match at seed region).
This increases the possibility of WEE1 being a potential miR-155 target. Additionally,
DPY19L1 has also been predicted by DIANA MicroT, in which it has two 8mers.
27
Figure 10: Functional annotation of 395 genes that were obtained by microarray data.
DAVID [36, 37] is used to perform the functional annotation.
Part II: Comparing predictions to microarray data slightly increased the accuracy
and revealed potential miR-155 target candidate genes, such as kras, sgk3,
MAP3K7IP2 and far1
Among the genes predicted by at least 1 top scoring software packages, the genes most
downregulated ones are also predicted more than once. In addition, experimentally
validated direct targets were enriched. The precision increased a little bit, ~30% (9 out of
30 genes are on the Table 6). Moreover, 4 genes shown red in Table 6 - kras, sgk3,
MAP3K7IP2 and far1 are predicted by all of the top scoring software packages and also
significantly (at least 25%) downregulated in microarray experiment. Those 4 genes are
strong miR-155 candidate target genes that could be considered for further validations.
28
Gene_Symbol DIANA TargS MAMI Pictar TOTAL LOG2_100 LOG2_50 LOG2_TW
03
p53DINP1 - + + + 3 -1.77 -1.72 3.95
Myo1d - + + + 3 0.41 -1.59 -1.85
VAV3 + + - - 2 0.26 -1.28 -0.94
KRAS + + + + 4 -0.69 -1.18 -1.08
ADD3 + - + + 3 -1.39 -1.15 1.6
ETNK2 - + + + 3 -0.77 -1.01 3.15
PICALM + - + - 2 -0.62 -0.83 -1.41
BCAT1 + + - + 3 0.15 -0.73 -2.1
ZNF652 + + + - 3 -0.66 -0.71 -0.55
TSGA14 - + + + 3 -0.97 -0.66 1.37
ETS1 + + + + 4 -0.34 -0.66 0.29
CARHSP1 - + + + 3 -0.39 -0.65 -0.48
JARID2 + + + - 3 0.17 -0.65 -0.88
SDCBP - + + + 3 0.03 -0.6 -0.54
USP48 - + + + 3 -0.47 -0.57 0.27
SMAD1 + + - - 2 0.08 -0.55 -0.22
MEIS1 - + - + 2 0.08 -0.54 1.55
kcip-1 + + - + 3 -0.1 -0.54 -0.42
MYO10 - + + + 3 0.14 -0.53 -1.88
SGK3 + + + + 4 -0.32 -0.5 -0.62
WWC1 - + + + 3 0 -0.48 -1.1
CSNK1G2 - + + + 3 -0.04 -0.47 -0.86
HIF1A - - + + 2 -0.18 -0.46 -0.22
UBQLN1 + - + + 3 -0.19 -0.46 -0.71
YWHAE - + + + 3 -0.1 -0.38 -0.37
ARID2 + + - - 2 -0.29 -0.33 -0.01
MAP3K7IP2 + + + + 4 -0.17 -0.32 -0.53
KPNA1 + + - + 3 0.1 -0.29 -1.49
FAR1 + + + + 4 -0.2 -0.28 -1.32
SLA - + + + 3 -0.06 -0.28 -0.16
Table 9: Combination of microarray data with prediction data. The microarray data is
incorporated to the list of targets in Appendix 1. Blue ones on the left indicate that the gene has
been validated by wet-lab experiments. LOG2_100 indicates: log2(miR-155 mimic 100 nM / miR-
155 control 50 nM). LOG2_50 indicates: log2(miR-155 mimic 50 nM / miR-155 control 50 nM).
LOG2_TW03 indicates: log2 (TW03 / NP69).
29
Result V: GOEAST analysis revealed the importance of protein and nucleotide binding related genes via Gene Ontologies
The analysis of 395 genes using GOEAST revealed the importance of protein and
nucleotide binding related genes. This also means that the significant portion of 395 genes
is transcription factors (GO: 0000166).
Another significantly enriched GO term is, GO:0005072 - transforming growth factor beta
receptor, cytoplasmic mediator activity, defines the molecular function in which it explains
the activity of any molecules that transmit the signal from a TGF-beta receptor from the
cytoplasm to the nucleus [40]. As seen from Figure 11, there are totally 10 genes (see Table
10) in GO:0005072, and 4 of them are enriched in the list introduced.
Table 10: List of totally 10 genes in GO:0005072 - transforming growth factor beta
receptor, cytoplasmic mediator activity [40].
Parameters that were chosen on GOEAST:
Statistical test method: Hypergeometric
Multi-test adjustment method: Yekutieli (FDR under dependency)
Significance Level of Enrichment: 0.001
Database ID Gene_Symbol Reference Evidence Gene name
O15105 SMAD7 PMID:9256479 IDA
O15198 SMAD9 PMID:19018011TAS
O43541 SMAD6 PMID:9256479 IDA
P17813 ENG PMID:12015308IDA
P46527 CDKN1B PMID:8033212 TAS
P84022 SMAD3 PMID:9111321 IDA
Q13485 SMAD4 PMID:9389648 IDA
Q15796 SMAD2 PMID:9256479 IDA
UniProtKB Mothers against decapentaplegic homolog 7
UniProtKB Mothers against decapentaplegic homolog 9
UniProtKB Mothers against decapentaplegic homolog 6
UniProtKB Endoglin
UniProtKB Cyclin-dependent kinase inhibitor 1B
UniProtKB Mothers against decapentaplegic homolog 3
UniProtKB Mothers against decapentaplegic homolog 4
UniProtKB Mothers against decapentaplegic homolog 2
30
Figure: 11 395 genes that were obtained by microarray data is used to analyze by the help of
GOEAST. The gradient of the color yellow indicates the significance of the corresponding gene
ontology (the more intense the yellow is, the more the significance is because of lower p values).
Result VI: Validation of microarray results by qPCR revealed that Zdhhc2 and tp53inp1 genes are significantly downregulated As a result of qPCR experiment, the quantification of selected genes was obtained. This let
us accurately determine which gene(s) is/are downregulated in 5 different samples. As a
result of qPCR, zdhhc2 and tp53inp1 showed downregulation in both miR-155 mimic
100nM and miR-155 mimic 50nM when compared to miR-155 control 50nM.
As a result of qPCR, Zdhhc2 and tp53inp1 genes showed significant downregulation in
both miR-155 mimic 100nM and miR-155 mimic 50nM when compared to miR-155
control 50nM (see Figure 11 and 12).
31
Figure11: The qPCR results of 2ef2, kdm5b and zdhhc2 genes. The zdhhc2 gene showed
significant downregulation when considering NP69 mimic 50nM and NP69 mimic 100nM
compared with NP69 control 50nM, NP69 parental and TW03. Other genes did not show
significant downregulation.
Figure 12: The qPCR results of bclaf1, terf1 and tp53inp1 genes. The tp53inp1 gene
showed significant downregulation when considering NP69 mimic 50nM and NP69 mimic
100nM compared with NP69 control 50nM, NP69 parental and TW03.
32
Figure 13: The qPCR results of the gene perp. This gene did not show significant
downregulation.
33
4 Discussion
Finding the best working software packages for miRNA target prediction is quite
complicated for many reasons. Different software packages use different parameters as
well as different 3‟UTRs (some of them only considers the longest 3‟UTR, while the other
considering all possible 3‟UTRs). These differences result in different set of targets for
particular miRNA. Another complication is that having different output formats from
different software packages. This needs to be converted into common identifier and
sometimes it is difficult to find the proper identifier.
The biological difference between animal and plant miRNA targeting mechanism remains
largely unknown. The obvious difference is the site of miRNA binding, which in plant is
CDS, while in animals it is 3‟UTR. Theoretically, miRNA can bind to CDS in animals, too.
Maybe this is where the difference arise, that binding CDS is more difficult in animals than
plants, because of ribosome or other translation factors occupying mRNA. Binding to the
CDS might be difficult to avoid in plants. Another hypothesis would be the difference in
the effect of RISC, meaning that in plants RISC might bind to corresponding miRNA so
that it would favor to have complete complementarity. This leads to the fact that, miRNA
target prediction in plants is easier.
The first evaluation of this study is that, 4 different miRNA prediction softwares namely –
TargetScan 5.1, PicTar 5, MAMI and DIANA-MicroT 3.0 could be used for further
investigation. Incorporating microarray data into the study slightly strengthened the results
gained from those softwares. However, the overall result is, computational predictions and
microarray data didn't add drammatic effect.
By using computational predictions and microarray data 2 potentially strong miR-155
target genes is another contribution of this study. These two targets namely, tp53inp1 and
zdhhc2, will be considered for further validation, especially Luciferase reporter assay.
Using 395 genes for functional enrichment studies could be considered as another source
for finding potential targets. As described in Methods, those 395 genes have at least 25%
downregulation (log2 of fold changes is less than -0.5). One might consider 25%
downregulation as insignificant or noise, but when it comes to miRNAs, many reported
studies indicated that 25% downregulation also matters. Therefore, candidate target genes
found by using DAVID, DPY19L1 and WEE1, are also significant enough for further
validation experiments. The amount of WEE1 (a nuclear protein, which is a tyrosine
kinase) enzyme decreases at M phase when it is hyperphosphorylated, is consistent with the
idea that it might act as a negative regulator of entry into mitosis. If one would make a
story, the storyline would be; by downregulating WEE1 activity, mitosis will be kept active,
which results in proliferation, which is favored by almost all cancer tissues.
The target prediction and validation procedure could be improved by using alternative
technologies. The alternative method for microarray would be RNA-Seq, which is a
34
technique that quantifies the transcriptome of cells by using deep sequencing technologies.
There are significant amount of publications supporting that RNA-Seq reveals more
information than microarray, because it is not hybridization dependent technique (like
microarray), in which detecting different isoforms is less likely. Since multi-exon genes
have the potential to produce different isoforms, and microarray mostly doesn‟t detect
different isoforms, one could argue about the misleading of microarray data. Hypothetical
example would be: Specific miRNA binds to specific isoform of a gene (alternative 3‟UTR
splicing events give rise to different 3‟UTR of a single gene) and eventually downregulates
it. The hybridization probe of microarray is unique for the gene, but it doesn‟t specifically
bind to 3‟UTR (it can bind everywhere on mRNA). Thus while one isoform going down;
other isoforms would still exist, contributing the total amount of mRNA in the sample.
Consequently, the microarray data will not show highly downregulation, and this will lead
to misinterpreting microarray data. As a result of all these, one could use RNA-Seq for the
miRNA studies.
Another part of the experiment was to validate microarray results by checking the relative
expression of some “significant” genes by qPCR. Being significant gene here could be
explained by having a tumor suppression function. The following 5 genes are related to
tumor suppression or negative regulation of mitosis or positive regulation of apoptosis.
This means that in the absence of those genes there is a certain risk of the cell being highly
proliferative and becoming a tumor cell.
TERF1, BCLAF1 ----> Negative regulation of mitosis, GO: 0045839.
PERP, TP53INP1 ---> Positive regulation of apoptosis, GO: 0043065.
E2F2 -----> Plays a crucial role in the control of cell cycle and action of tumor suppressor
proteins and is also a target of the transforming proteins of DNA tumor viruses.
Recent study [41] has published that SMAD2 is direct target of miR-155. SMAD2 was
also enriched in this study during GO analysis of GOEAST. SMAD2 belongs to
GO:0005072 which has 10 genes mostly belonging to SMAD family, those act as a
mediators of TGF-β (pleiotropic cytokine, with important effects on processes such as
fibrosis, angiogenesis and immunosupression) signaling. Upregulation of miR-155 altered
the response mechanisms to TGF-β by changing the expression of target genes which are
involved in inflammation, fibrosis and angiogenesis. Briefly, this brings the idea that other
SMAD family genes that were enriched in our study could be checked during further
validations.
35
5 References
[1] Bartel DP: MicroRNAs: Target Recognition and Regulatory Functions. Cell 2009,
136(2):215-233
[2] Lee, R. C., Feinbaum, R. L., and Ambros, V. (1993). The C. elegans heterochronic gene
lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843-854.
[3] Rhoades MW, Reinhart BJ, Lim LP, Burger CB, Bartel B, Bartel DP: Prediction of plant
microRNA targets. Cell 2003, 110:513-520.
[4] Reinhart BJ, Slack F, Basson M, Pasquinelli A, Bettinger J, Rougvie A, Horvitz HR,
Ruvkun G: The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis
elegans. Nature 2000,
403:901-906.
[5] Lee RC, Feinbaum RL, Ambros V: The C. elegans heterochronic gene lin-4 encodes
small RNAs with antisense complementarity to lin-14. Cell 1993, 75:843-854.
[6] Xiaowei Wang and Issam M. El Naqa (2008) Prediction of both conserved and
nonconserved microRNA targets in animals. Bioinformatics 24(3):325-332.
[7] Xiaowei Wang (2008) miRDB: a microRNA target prediction and functional annotation
database with a wiki interface. RNA 14(6):1012-1017
[8] Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines,
indicates that thousands of human genes are microRNA targets. Cell 2005;120:15-20.
[9] Ambros V (2004). The functions of animal microRNAs. Nature 431: 350–355
[10] Bushati N, Cohen SM (2007) microRNA functions. Annu Rev Cell Dev Biol 23: 175–
205
[11] Sevignani C, Calin GA, Nnadi SC, Shimizu M, Davuluri RV, Hyslop T, Demant P,
Croce CM, Siracusa LD (2007) MicroRNA genes are frequently located near mouse cancer
susceptibility loci. Proc Natl Acad Sci USA 104: 8017– 8022
[12] Asangani IA, Rasheed SA, Nikolova DA, Leupold JH, Colburn NH, Post S, Allgayer
H (2008) MicroRNA-21 (miR-21) post-transcriptionally downregulates tumor suppressor
Pdcd4 and stimulates invasion, intravasation and metastasis in colorectal cancer. Oncogene
27: 2128–2136
[13] Selbach M, Schwanhausser B, Thierfelder N, Fang Z, Khanin R, Rajewsky N (2008)
36
Widespread changes in protein synthesis induced by microRNAs. Nature 455: 58– 63
[14] Grimson A, Farh KK, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP. MicroRNA
targeting specificity in mammals: determinants beyond seed pairing. Mol Cell
2007;27(1):91–105.
[15] Saito T., and Sætrom P., (2010). MicroRNAs – targeting and target prediction
[16] UCSC Genome Browser on Human Mar. 2006 (NCBI36/hg18) Assembly. Retrieved in
04.04.2010 from http://genome.ucsc.edu/cgi-
bin/hgTracks?db=hg18&position=chr21:25868163-
25868227&hgt.customText=http://mirnamap.mbc.nctu.edu.tw/cache/bed/hsa-mirna.bed
[17] The pre-miRNA of MI0000681. Retrieved in 04.04.2010 from
http://mirnamap.mbc.nctu.edu.tw/php/mirna_entry.php?acc=MI0000681
[18] Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of
mammalian microRNA targets. Cell. 115: 787-798 (2003).
[19] Lytle, J.R. et al. (2007) Target mRNAs are repressed as efficiently by
microRNAbinding sites in the 50 UTR as in the 30 UTR. Proc. Natl. Acad. Sci. U. S. A.
104, 9667– 9672.
[20] Arvey A, Larsson E, Sander C, Leslie CS, Marks DS. Target mRNA abundance dilutes
microRNA and siRNA activity. Molecular Systems Biology (2010) 6:363.
[21] M. Maragkakis; M. Reczko; V. A. Simossis; P. Alexiou; G. L. Papadopoulos; T.
Dalamagas; G. Giannopoulos; G. Goumas; E. Koukis; K. Kourtis; T. Vergoulis; N. Koziris;
T. Sellis; P. Tsanakas; A. G. Hatzigeorgiou. DIANA-microT web server: elucidating
microRNA functions through target prediction. Nucleic Acids Research 2009 Jul 1; 37(Web
Server issue):W273-6.
[22] Friedman, R.C., Farh K. K., Christopher B Burge, David P Bartel. (2009) Most
mammalian mRNAs are conserved targets of microRNAs. Genome Res. 19, 92–105
[23] Rajewsky N., and Chen K. Natural selection on human microRNA binding sites
inferred from SNP data. Nature Genetics 38, 1452 - 1456 (2006)
[24] Kong W, He L, Coppola M, Guo J, Esposito NN, Coppola D, Cheng JQ. MicroRNA-
155 regulates cell survival, growth and chemosensitivity by targeting FOXO3a in breast
cancer. J Biol Chem. 2010 Apr 6
[25] Brown JR, Sanseau P. A computational view of microRNAs and their targets. Drug
Discov Today. 10: 595-601 (2005)
[26] Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ. miRBase: tools for microRNA
genomics. Nucleic Acids Res. 36: D154-D158 (2008)
[27] Hammell CM. The microRNA-argonaute complex: a platform for mRNA modulation.
37
RNA Biol 2008;5(3):123–7.
[28] The database of experimentally supported targets: a functional update of TarBase.
(Papadopoulos GL, Reczko M, Simossis VA, Sethupathy P, Hatzigeorgiou AG.), Nucleic
Acids Res. 2009 Jan;37(Database issue):D155-8. Epub 2008 Oct 27.
[29] Mirwalk, http://www.ma.uni-heidelberg.de/apps/zmf/mirwalk/index.html
[30] Uniprot, http://www.uniprot.org/keywords/?query=name:"Phosphoprotein"
[31] Nucleic Acids Res. 2008 May 16. GOEAST: a web-based software toolkit for Gene
Ontology enrichment analysis. Zheng Q, Wang XJ. PMID: 18487275
[32] Lim LP, Lau NC, Garrett-Engele P, Grimson A, Schelter JM, Castle J, Bartel DP,
Linsley PS, Johnson JM. Microarray analysis shows that some microRNAs downregulate
large numbers of target mRNAs, Nature. 433: 769-773 (2005)
[33] Eulalio A, Huntzinger E, Nishihara T, Rehwinkel J, Fauser M, Izaurralde E (January
2009)."Deadenylation is a widespread effect of miRNA regulation". RNA 15 (1): 21–
32.doi:10.1261/rna.1399509. PMID 19029310.
[34] Sunkar R, Jagadeeswaran G. In silico identification of conserved microRNAs in large
number of diverse plant species. BMC Plant Biol. 2008;8:37.
[35] Howell F Moffett and Carl D Novina (2007). A small RNA makes a Bic difference.
Genome Biology 2007, 8:221 (doi:10.1186/gb-2007-8-7-221)
[36] Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large
gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.
[37] Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA.
DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol.
2003;4(5):P3
[38] Retrieved from Genecards, http://www.genecards.org, on June 2010.
[39] Anthony A. Millar and Peter M. Waterhouse. (2005) Plant and animal microRNAs:
similarities and differences. FUNCTIONAL & INTEGRATIVE GENOMICS. 5:3, 129-
135, DOI: 10.1007/s10142-005-0145-2.
[40] Retrieved from www.geneontology.org
[41] Louafi F, Martinez-Nunez RT, Sanchez-Elsner T.(2010). Microrna-155 (miR-155)
targets SMAD2 and modulates the response of macrophages to transforming growth factor-
{beta}. J Biol Chem.
38
Appendices
Supplementary Figure 1
Supplementary Figure 1: The scatter plot showing the correlation between NP69 tissue
and NP69 control 50nM did not significant correlation between them.
39
Appendix 1
DIANA-MicroT 3.0 http://diana.cslab.ece.ntua.gr/microT/
TargetScan 5.1 http://www.targetscan.org/vert_50/
Pictar 5.0 http://pictar.mdc-berlin.de/
MAMI http://mami.med.harvard.edu/
Blue ones are validated DIRECT targets of miR-155
Gene_Symbol DIANA TargS MAMI Pictar TOTAL Gene_name
NUFIP2 1 1 1 1 4 nuclear fragile X mental retardation protein interacting
protein 2
MAP3K7IP2 1 1 1 1 4 mitogen-activated protein kinase kinase kinase 7
interacting protein 2
SGK3 1 1 1 1 4 serum/glucocorticoid regulated kinase family, member 3
TSHZ3 1 1 1 1 4 teashirt zinc finger homeobox 3
SEMA5A 1 1 1 1 4 sema domain, seven thrombospondin repeats (type 1 and
type 1-like)
RAB11FIP2 1 1 1 1 4 RAB11 family interacting protein 2 (class I)
SEPT11 1 1 1 1 4 septin 11
FAR1 1 1 1 1 4 fatty acyl CoA reductase 1
KRAS 1 1 1 1 4 v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog
ETS1 1 1 1 1 4 v-ets erythroblastosis virus E26 oncogene homolog 1
(avian)
BACH1 1 1 1 1 4 BTB and CNC homology 1, basic leucine zipper
transcription factor 1
ZNF236 1 1 1 1 4 zinc finger protein 236
DCUN1D3 1 1 1 3 DCN1, defective in cullin neddylation 1, domain
containing 3 (S. cerevisiae)
ETNK2 1 1 1 3 ethanolamine kinase 2
DNAJB7 1 1 1 3 DnaJ (Hsp40) homolog, subfamily B, member 7
IKBKE 1 1 1 3 inhibitor of kappa light polypeptide gene enhancer in B-
cells, kinase epsilon
HDAC4 1 1 1 3 histone deacetylase 4
FBXO11 1 1 1 3 F-box protein 11
CACNA1C 1 1 1 3 hypothetical protein LOC100131098; calcium channel
C3orf18 1 1 1 3 chromosome 3 open reading frame 18
UBQLN1 1 1 1 3 ubiquilin 1
CSF1R 1 1 1 3 colony stimulating factor 1 receptor
CD47 1 1 1 3 CD47 molecule
CARHSP1 1 1 1 3 calcium regulated heat stable protein 1, 24kDa
YWHAE 1 1 1 3 similar to 14-3-3 protein epsilon (14-3-3E)
MIDN 1 1 1 3 midnolin
MAP3K14 1 1 1 3 mitogen-activated protein kinase kinase kinase 14
40
MAP3K10 1 1 1 3 mitogen-activated protein kinase kinase kinase 10
NFAT5 1 1 1 3 nuclear factor of activated T-cells 5, tonicity-responsive
N4BP1 1 1 1 3 NEDD4 binding protein 1
MYO10 1 1 1 3 myosin X
KPNA1 1 1 1 3 karyopherin alpha 1 (importin alpha 5)
KIAA1274 1 1 1 3 KIAA1274
JARID2 1 1 1 3 jumonji, AT rich interactive domain 2
LRRC59 1 1 1 3 leucine rich repeat containing 59
ZIC3 1 1 1 3 Zic family member 3 (odd-paired homolog, Drosophila)
KPNA4 1 1 1 3 karyopherin alpha 4 (importin alpha 3)
SLA 1 1 1 3 Src-like-adaptor
SKV 1 1 1 3 v-ski sarcoma viral oncogene homolog (avian)
RNF123 1 1 1 3 ring finger protein 123
ZNF652 1 1 1 3 zinc finger protein 652
Nova1 1 1 1 3 neuro-oncological ventral antigen 1
FGF7 1 1 1 3 hypothetical LOC100132771; fibroblast growth factor 7
CEBPB 1 1 1 3 CCAAT/enhancer binding protein (C/EBP), beta
Myo1d 1 1 1 3 myosin ID
ZNF703 1 1 1 3 zinc finger protein 703
kcip-1 1 1 1 3 tyrosine 3-monooxygenase/tryptophan 5-monooxygenase
activation protein
p53DINP1 1 1 1 3 tumor protein p53 inducible nuclear protein 1
LRP1B 1 1 1 3 low density lipoprotein-related protein 1B (deleted in
tumors)
C1QL2 1 1 1 3 complement component 1, q subcomponent-like 2
ARL5B 1 1 1 3 ADP-ribosylation factor-like 5B
AICDA 1 1 1 3 activation-induced cytidine deaminase
ADD3 1 1 1 3 adducin 3 (gamma)
BOC 1 1 1 3 Boc homolog (mouse)
BCAT1 1 1 1 3 branched chain aminotransferase 1, cytosolic
ASTN2 1 1 1 3 astrotactin 2
c-myb 1 1 1 3 v-myb myeloblastosis viral oncogene homolog (avian)
ZNF198 1 1 1 3 zinc finger, MYM-type 2
CSNK1G2 1 1 1 3 casein kinase 1, gamma 2
TLE4 1 1 1 3 transducin-like enhancer of split 4 (E(sp1) homolog,
Drosophila)
EHD1 1 1 1 3 EH-domain containing 1
Olfml3 1 1 1 3 olfactomedin-like 3
PSKH1 1 1 1 3 protein serine kinase H1
USP48 1 1 1 3 ubiquitin specific peptidase 48
41
SOX1 1 1 1 3 SRY (sex determining region Y)-box 1
TOMM20 1 1 1 3 similar to translocase of outer mitochondrial membrane
WIT1 1 1 1 3 Wilms tumor upstream neighbor 1
UPP2 1 1 1 3 uridine phosphorylase 2
SOCS1 1 1 1 3 suppressor of cytokine signaling 1
SPI1 1 1 1 3 spleen focus forming virus (SFFV)
TSGA14 1 1 1 3 testis specific, 14
SDCBP 1 1 1 3 syndecan binding protein (syntenin)
WWC1 1 1 1 3 WW and C2 domain containing 1
TRIM2 1 1 1 3 tripartite motif-containing 2
SUFU 1 1 1 3 suppressor of fused homolog (Drosophila)
SMARCA4 1 1 2 SWI/SNF related, matrix associated
AKAP10 1 1 2 A kinase (PRKA) anchor protein 10
Itk 1 1 2 IL2-inducible T-cell kinase
VAV3 1 1 2 vav 3 guanine nucleotide exchange factor
SMNDC1 1 1 2 survival motor neuron domain containing 1
RCN2 1 1 2 reticulocalbin 2, EF-hand calcium binding domain
RREB1 1 1 2 ras responsive element binding protein 1
SPIN3 1 1 2 spindlin family, member 3
JMJD1A 1 1 2 lysine (K)-specific demethylase 3A
CSNK1A1 1 1 2 casein kinase 1, alpha 1
SOX10 1 1 2 SRY (sex determining region Y)-box 10
SOS1 1 1 2 son of sevenless homolog 1 (Drosophila)
SMAD1 1 1 2 SMAD family member 1
ATP2B1 1 1 2 ATPase, Ca++ transporting, plasma membrane 1
ANTXR2 1 1 2 anthrax toxin receptor 2
SMAD2 1 1 2 SMAD family member 2
SLC39A10 1 1 2 solute carrier family 39 (zinc transporter), member 10
BCL11A 1 1 2 B-cell CLL/lymphoma 11A (zinc finger protein)
ZNF642 1 1 2 zinc finger protein 642
BAG5 1 1 2 BCL2-associated athanogene 5
TSPAN14 1 1 2 tetraspanin 14
ZNF644 1 1 2 zinc finger protein 644
BRD1 1 1 2 bromodomain containing 1
SOCS6 1 1 2 suppressor of cytokine signaling 6
TRPS1 1 1 2 trichorhinophalangeal syndrome I
ABHD2 1 1 2 abhydrolase domain containing 2
ACTA1 1 1 2 actin, alpha 1, skeletal muscle
42
RRP15 1 1 2 ribosomal RNA processing 15 homolog (S. cerevisiae)
MLCK 1 1 2 myosin light chain kinase
KBTBD2 1 1 2 kelch repeat and BTB (POZ) domain containing 2
FLJ90013 1 1 2 transmembrane anterior posterior transformation 1
BSN2 1 1 2 basonuclin 2
SP3 1 1 2 Sp3 transcription factor
PSIP1 1 1 2 PC4 and SFRS1 interacting protein 1
WDFY3 1 1 2 WD repeat and FYVE domain containing 3
INPP5D 1 1 2 inositol polyphosphate-5-phosphatase, 145kDa
TYRP1 1 1 2 tyrosinase-related protein 1
ARID2 1 1 2 AT rich interactive domain 2 (ARID, RFX-like)
ZFYVE14 1 1 2 ankyrin repeat and FYVE domain containing 1
PELI1 1 1 2 pellino homolog 1 (Drosophila)
WDR45 1 1 2 WD repeat domain 45
ZNF238 1 1 2 zinc finger protein 238
LCORL 1 1 2 ligand dependent nuclear receptor corepressor-like
SP1 1 1 2 Sp1 transcription factor
NR2F2 1 1 2 nuclear receptor subfamily 2, group F, member 2
Mon1a 1 1 2 MON1 homolog A (yeast)
RAB1A 1 1 2 RAB1A, member RAS oncogene family
cab39 1 1 2 calcium binding protein 39
TMEM178 1 1 2 transmembrane protein 178
TFDP2 1 1 2 transcription factor Dp-2 (E2F dimerization partner 2)
SSH2 1 1 2 slingshot homolog 2 (Drosophila)
NDFIP1 1 1 2 Nedd4 family interacting protein 1
EHF 1 1 2 ets homologous factor
STRN3 1 1 2 striatin, calmodulin binding protein 3
DNCI1 1 1 2 dynein, cytoplasmic 1, intermediate chain 1
AHCYL2 1 1 2 adenosylhomocysteinase-like 2
TRIM32 1 1 2 tripartite motif-containing 32
H3.3B 1 1 2 H3 histone, family 3B (H3.3B);
MEIS1 1 1 2 Meis homeobox 1
KCNN3 1 1 2 potassium intermediate/small conductance channel
LOC389458 1 1 2 hypothetical LOC389458; RB-associated KRAB zinc
finger
RBMS3 1 1 2 RNA binding motif, single stranded interacting protein
HIF1A 1 1 2 hypoxia inducible factor 1, alpha subunit
RNF146 1 1 2 ring finger protein 146
IRF2BP2 1 1 2 interferon regulatory factor 2 binding protein 2
43
HIVEP2 1 1 2 human immunodeficiency virus type I enhancer binding
protein 2
HNRPA3 1 1 2 heterogeneous nuclear ribonucleoprotein A3
KCNA1 1 1 2 potassium voltage-gated channel
KIAA1267 1 1 2 KIAA1267
RICTOR 1 1 2 RPTOR independent companion of MTOR,
JHDM1D 1 1 2 jumonji C domain containing histone demethylase 1
SGCB 1 1 2 sarcoglycan, beta
GPR85 1 1 2 G protein-coupled receptor 85
GTF2A1L 1 1 2 stonin 1
GDF6 1 1 2 growth differentiation factor 6
GOLPH3L 1 1 2 golgi phosphoprotein 3-like
RSPO2 1 1 2 R-spondin 2 homolog (Xenopus laevis)
HERC4 1 1 2 hect domain and RLD 4
H3F3B 1 1 2 H3 histone, family 3B (H3.3B)
HBP1 1 1 2 HMG-box transcription factor 1
MECP2 1 1 2 methyl CpG binding protein 2 (Rett syndrome)
PKN2 1 1 2 protein kinase N2
NAV3 1 1 2 neuron navigator 3; similar to neuron navigator 3
PLEKHK1 1 1 2 rhotekin 2
PLAG1 1 1 2 pleiomorphic adenoma gene 1
PEA15 1 1 2 phosphoprotein enriched in astrocytes 15
PHF17 1 1 2 PHD finger protein 17
PKIA 1 1 2 protein kinase (cAMP-dependent, catalytic) inhibitor alpha
PICALM 1 1 2 phosphatidylinositol binding clathrin assembly protein
REPS2 1 1 2 RALBP1 associated Eps domain containing 2
RC3H2 1 1 2 ring finger and CCCH-type zinc finger domains 2
RBM47 1 1 2 RNA binding motif protein 47
KIAA1715 1 1 2 KIAA1715
RCOR1 1 1 2 REST corepressor 1
RAB34 1 1 2 RAB34, member RAS oncogene family
ZBTB41 1 1 2 zinc finger and BTB domain containing 41
LOC646270 1 1 2 elongation factor, RNA polymerase II, 2
LOC646438 1 1 2 H3 histone, family 3B (H3.3B);
CDC73 1 1 2 cell division cycle 73
CHD7 1 1 2 chromodomain helicase DNA binding protein 7
SF3B1 1 1 2 splicing factor 3b, subunit 1, 155kDa
CBL 1 1 2 Cas-Br-M (murine) ecotropic retroviral seq
COL21A1 1 1 2 collagen, type XXI, alpha 1
44
COL7A1 1 1 2 collagen, type VII, alpha 1
CKAP5 1 1 2 cytoskeleton associated protein 5
CNTN4 1 1 2 contactin 4
BCORL1 1 1 2 BCL6 co-repressor-like 1
C10orf26 1 1 2 chromosome 10 open reading frame 26
C10orf46 1 1 2 chromosome 10 open reading frame 46
SLC12A6 1 1 2 solute carrier family 12
C10orf12 1 1 2 chromosome 10 open reading frame 12
C5orf41 1 1 2 chromosome 5 open reading frame 41
C9orf5 1 1 2 chromosome 9 open reading frame 5
SIM2 1 1 2 single-minded homolog 2 (Drosophila)
SHOX 1 1 2 short stature homeobox
FAM134C 1 1 2 family with sequence similarity 134, member C
FBXO33 1 1 2 F-box protein 33
FLJ37543 1 1 2 hypothetical protein FLJ37543
FAM135A 1 1 2 family with sequence similarity 135, member A
S100PBP 1 1 2 S100P binding protein
GABRA1 1 1 2 gamma-aminobutyric acid (GABA) A receptor, alpha 1
GCN5L2 1 1 2 K(lysine) acetyltransferase 2A
FOS 1 1 2 v-fos FBJ murine osteosarcoma viral oncogene homolog
FZD5 1 1 2 frizzled homolog 5 (Drosophila)
COPS3 1 1 2 COP9 constitutive photomorphogenic homolog subunit 3
DNAJB1 1 1 2 DnaJ (Hsp40) homolog, subfamily B, member 1
SCG2 1 1 2 secretogranin II (chromogranin C)
SEC14L5 1 1 2 SEC14-like 5 (S. cerevisiae)
DET1 1 1 2 de-etiolated homolog 1 (Arabidopsis)
SATB1 1 1 2 SATB homeobox 1
SALL1 1 1 2 sal-like 1 (Drosophila)
E2F2 1 1 2 E2F transcription factor 2
EDG1 1 1 2 sphingosine-1-phosphate receptor 1
45
Appendix 2:
ID Gene Name
HMGCS1 3-hydroxy-3-methylglutaryl-Coenzyme A synthase 1 (soluble)
AAK1 AP2 associated kinase 1
ATP6V1C1 ATPase, H+ transporting, lysosomal 42kDa, V1 subunit C1
CAP2 CAP, adenylate cyclase-associated protein, 2 (yeast)
CNOT1 CCR4-NOT transcription complex, subunit 1
CDC42BPA CDC42 binding protein kinase alpha (DMPK-like)
COMMD2 COMM domain containing 2
F11R F11 receptor
H3F3A H3 histone, family 3B (H3.3B); H3 histone, family 3A pseudogene; H3 histone, family 3A; similar to H3 histone, family 3B; similar to histone H3.3B
HBS1L HBS1-like (S. cerevisiae)
KIAA1671 KIAA1671 protein
LASS6 LAG1 homolog, ceramide synthase 6
LRBA LPS-responsive vesicle trafficking, beach and anchor containing
MLF1IP MLF1 interacting protein
PERP PERP, TP53 apoptosis effector
ARHGAP5 Rho GTPase activating protein 5
SMAD2 SMAD family member 2
SMAD6 SMAD family member 6
SMEK2 SMEK homolog 2, suppressor of mek1 (Dictyostelium)
ST6GALNAC2ST6 (alpha-N-acetyl-neuraminyl-2,3-beta-galactosyl-1,3)-N-acetylgalactosaminide alpha-2,6-sialyltransferase 2
TAF9B TAF9B RNA polymerase II, TATA box binding protein (TBP)-associated factor, 31kDa
WEE1 WEE1 homolog (S. pombe)
XIAP X-linked inhibitor of apoptosis
ACOX1 acyl-Coenzyme A oxidase 1, palmitoyl
ANLN anillin, actin binding protein
ANXA2P2 annexin A2 pseudogene 2
ANXA2P1, ANXA2annexin A2 pseudogene 3; annexin A2; annexin A2 pseudogene 1
ATL3 atlastin GTPase 3
CREBL2 cAMP responsive element binding protein-like 2
CDH1 cadherin 1, type 1, E-cadherin (epithelial)
CHP calcium binding protein P22
CREG1 cellular repressor of E1A-stimulated genes 1
CENPF centromere protein F, 350/400ka (mitosin)
CBX5 chromobox homolog 5 (HP1 alpha homolog, Drosophila)
C18orf10 chromosome 18 open reading frame 10
CSDE1 cold shock domain containing E1, RNA-binding
COL12A1 collagen, type XII, alpha 1
CBFB core-binding factor, beta subunit
DLGAP5 discs, large (Drosophila) homolog-associated protein 5
ENAH enabled homolog (Drosophila)
ERMP1 endoplasmic reticulum metallopeptidase 1
EGFR epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian)
46
ANKRD36B similar to KIAA1641; similar to ankyrin repeat domain 26; ankyrin repeat domain 36B
SNRNP200 similar to U5 snRNP-specific protein, 200 kDa; small nuclear ribonucleoprotein 200kDa (U5)
HMGB3 similar to high mobility group box 3; high-mobility group box 3
PRKDC similar to protein kinase, DNA-activated, catalytic polypeptide; protein kinase, DNA-activated, catalytic polypeptide
TOMM20 similar to translocase of outer mitochondrial membrane 20 homolog; similar to mitochondrial outer membrane protein 19; translocase of outer mitochondrial membrane 20 homolog (yeast)
SLC35B4 solute carrier family 35, member B4
SLC9A6 solute carrier family 9 (sodium/hydrogen exchanger), member 6
FAM173B family with sequence similarity 173, member B
SKA2 family with sequence similarity 33, member A; similar to Spindle and kinetochore-associated protein 2
GLTP glycolipid transfer protein; glycolipid transfer protein pseudogene 1
GPC1 glypican 1
GNAQ guanine nucleotide binding protein (G protein), q polypeptide
HSPB1 heat shock 27kDa protein-like 2 pseudogene; heat shock 27kDa protein 1
HELZ helicase with zinc finger
HP1BP3 heterochromatin protein 1, binding protein 3
HNRNPA3 heterogeneous nuclear ribonucleoprotein A3
HIST1H1B histone cluster 1, H1b
HIP1 huntingtin interacting protein 1
ID1 inhibitor of DNA binding 1, dominant negative helix-loop-helix protein
ID3 inhibitor of DNA binding 3, dominant negative helix-loop-helix protein
ITGAV integrin, alpha V (vitronectin receptor, alpha polypeptide, antigen CD51)
IL13RA1 interleukin 13 receptor, alpha 1
KPNA6 karyopherin alpha 6 (importin alpha 7)
KRT17 keratin 17; keratin 17 pseudogene 3
KTN1 kinectin 1 (kinesin receptor)
KREMEN1 kringle containing transmembrane protein 1
LNX2 ligand of numb-protein X 2
MANEA mannosidase, endo-alpha
MID1 midline 1 (Opitz/BBB syndrome)
MSN moesin
MYH9 myosin, heavy chain 9, non-muscle
MARCKS myristoylated alanine-rich protein kinase C substrate
NCAPD2 non-SMC condensin I complex, subunit D2
PAK2 p21 protein (Cdc42/Rac)-activated kinase 2
PPL periplakin
PICALM phosphatidylinositol binding clathrin assembly protein
PKP1 plakophilin 1 (ectodermal dysplasia/skin fragility syndrome); similar to plakophilin 1 isoform 1a
PABPC1 poly(A) binding protein, cytoplasmic pseudogene 5; poly(A) binding protein, cytoplasmic 1
PMEPA1 prostate transmembrane protein, androgen induced 1
PCMTD2 protein-L-isoaspartate (D-aspartate) O-methyltransferase domain containing 2
RIPK4 receptor-interacting serine-threonine kinase 4
RPL5 ribosomal protein L5 pseudogene 34; ribosomal protein L5 pseudogene 1; ribosomal protein L5
RPLP0 ribosomal protein, large, P0 pseudogene 2; ribosomal protein, large, P0 pseudogene 3; ribosomal protein, large, P0 pseudogene 6; ribosomal protein, large, P0
SEMA3C sema domain, immunoglobulin domain (Ig), short basic domain, secreted, (semaphorin) 3C
47
SYNE2 spectrin repeat containing, nuclear envelope 2
SGPL1 sphingosine-1-phosphate lyase 1
SKAP2 src kinase associated phosphoprotein 2
STON2 stonin 2
SMC4 structural maintenance of chromosomes 4
SNAP23 synaptosomal-associated protein, 23kDa
TNKS tankyrase, TRF1-interacting ankyrin-related ADP-ribose polymerase
TLL1 tolloid-like 1
TOP2A topoisomerase (DNA) II alpha 170kDa
TOP2B topoisomerase (DNA) II beta 180kDa
TOB2 transducer of ERBB2, 2
TBL1XR1 transducin (beta)-like 1 X-linked receptor 1
TM7SF3 transmembrane 7 superfamily member 3
TMEM14A transmembrane protein 14A
TMEM56 transmembrane protein 56
TWF1 twinfilin, actin-binding protein, homolog 1 (Drosophila)
UBE4A ubiquitination factor E4A (UFD2 homolog, yeast)
ZFAT zinc finger and AT hook domain containing
ZBTB41 zinc finger and BTB domain containing 41
48
Appendix 3:
ID Gene Name
HMGCS1 3-hydroxy-3-methylglutaryl-Coenzyme A synthase 1 (soluble)
OXCT1 3-oxoacid CoA transferase 1
AHNAK AHNAK nucleoprotein
AHNAK2 AHNAK nucleoprotein 2
ATP6V1D ATPase, H+ transporting, lysosomal 34kDa, V1 subunit D
AGAP1 ArfGAP with GTPase domain, ankyrin repeat and PH domain 1
DIP2A DIP2 disco-interacting protein 2 homolog A (Drosophila)
ELAVL2 ELAV (embryonic lethal, abnormal vision, Drosophila)-like 2 (Hu antigen B)
F11R F11 receptor
GPSM2 G-protein signaling modulator 2 (AGS3-like, C. elegans)
IQGAP1 IQ motif containing GTPase activating protein 1
KIAA1671 KIAA1671 protein
KLF7 Kruppel-like factor 7 (ubiquitous)
LFNG LFNG O-fucosylpeptide 3-beta-N-acetylglucosaminyltransferase
MLF1IP MLF1 interacting protein
NDRG1 N-myc downstream regulated 1
WEE1 WEE1 homolog (S. pombe)
XIAP X-linked inhibitor of apoptosis
ANXA2P1, ANXA2annexin A2 pseudogene 3; annexin A2; annexin A2 pseudogene 1
CDH1 cadherin 1, type 1, E-cadherin (epithelial)
C18orf10 chromosome 18 open reading frame 10
PSAT1 chromosome 8 open reading frame 62; phosphoserine aminotransferase 1
CIT citron (rho-interacting, serine/threonine kinase 21)
CLOCK clock homolog (mouse)
COL12A1 collagen, type XII, alpha 1
DLG1 discs, large homolog 1 (Drosophila)
DPY19L1 dpy-19-like 1 (C. elegans); similar to hCG1645499
ENAH enabled homolog (Drosophila)
EGFR epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian)
FAM173B family with sequence similarity 173, member B
GGCX gamma-glutamyl carboxylase
GLTP glycolipid transfer protein; glycolipid transfer protein pseudogene 1
GTDC1 glycosyltransferase-like domain containing 1
GNG12 guanine nucleotide binding protein (G protein), gamma 12
HDLBP high density lipoprotein binding protein
49
HIP1 huntingtin interacting protein 1
KRT17 keratin 17; keratin 17 pseudogene 3
MAN1A2 mannosidase, alpha, class 1A, member 2
MBOAT2 membrane bound O-acyltransferase domain containing 2
MMGT1 membrane magnesium transporter 1
MSN moesin
MYO5A myosin VA (heavy chain 12, myoxin)
MYH9 myosin, heavy chain 9, non-muscle
PAK2 p21 protein (Cdc42/Rac)-activated kinase 2
PALLD palladin, cytoskeletal associated protein
PPL periplakin
PLD1 phospholipase D1, phosphatidylcholine-specific
PKP1 plakophilin 1 (ectodermal dysplasia/skin fragility syndrome); similar to plakophilin 1 isoform 1a
PABPC1 poly(A) binding protein, cytoplasmic pseudogene 5; poly(A) binding protein, cytoplasmic 1
PRKCA protein kinase C, alpha
PTPN11 protein tyrosine phosphatase, non-receptor type 11; similar to protein tyrosine phosphatase, non-receptor type 11
PTPRK protein tyrosine phosphatase, receptor type, K
PHTF2 putative homeodomain transcription factor 2
RIPK4 receptor-interacting serine-threonine kinase 4
KDM5B similar to Jumonji, AT rich interactive domain 1B (RBP2-like); lysine (K)-specific demethylase 5B
PRKDC similar to protein kinase, DNA-activated, catalytic polypeptide; protein kinase, DNA-activated, catalytic polypeptide
SLC39A9 solute carrier family 39 (zinc transporter), member 9
SPTBN1 spectrin, beta, non-erythrocytic 1
SGPL1 sphingosine-1-phosphate lyase 1
SREBF2 sterol regulatory element binding transcription factor 2
SVIL supervillin
TBL1XR1 transducin (beta)-like 1 X-linked receptor 1
TGFBI transforming growth factor, beta-induced, 68kDa
TNRC6B trinucleotide repeat containing 6B
TUFT1 tuftelin 1
TWF1 twinfilin, actin-binding protein, homolog 1 (Drosophila)
VPS13B vacuolar protein sorting 13 homolog B (yeast)
ZNF185 zinc finger protein 185 (LIM domain)
TRITA-CSC-E 2010:164 ISRN-KTH/CSC/E--10/164-SE
ISSN-1653-5715
www.kth.se