a common class of transcripts with 5’-intron depletion ......2016/06/06 · 2 where are we...
TRANSCRIPT
CanCenik
1
ACommonClassofTranscriptswith5’-IntronDepletion,DistinctEarlyCoding1
SequenceFeatures,andN1-MethyladenosineModification2
CanCenik1,2*,HonNianChua3,4*,GuramritSingh2,5,7,8,AbdallaAkef6,MichaelPSnyder1,3
AlexanderF.Palazzo6,MelissaJMoore2,7,8#,andFrederickPRoth3,9,10#4
5
Affiliations:6
1DepartmentofGenetics,StanfordUniversitySchoolofMedicine,Stanford,CA94305,USA7
2DepartmentofBiochemistryandMolecularPharmacology,UniversityofMassachusettsMedical8
School,Worcester,MA01605,USA9
3DonnellyCentreandDepartmentsofMolecularGeneticsandComputerScience,UniversityofToronto10
andLunenfeld-TanenbaumResearchInstitute,MtSinaiHospital,TorontoM5G1X5,Ontario,Canada11
4DataRobot,Inc.,61ChathamSt.BostonMA02109,USA12
5DepartmentofMolecularGenetics,CenterforRNABiology,TheOhioStateUniversity,Columbus,OH13
43210,USA14
6DepartmentofBiochemistry,UniversityofToronto,1King’sCollegeCircle,MSBRoom5336,Toronto,15
ONM5S1A8,Canada16
7HowardHughesMedicalInstitute,UniversityofMassachusettsMedicalSchool,Worcester,MA01605,17
USA18
8RNATherapeuticsInstitute,UniversityofMassachusettsMedicalSchool,Worcester,MA01605,USA19
9CenterforCancerSystemsBiology(CCSB),Dana-FarberCancerInstitute,Boston02215,MA,USA20
10TheCanadianInstituteforAdvancedResearch,TorontoM5G1Z8,ON,Canada21
22
*Theseauthorscontributedequallytothiswork.23
#Correspondenceto:[email protected],[email protected]
25
ShortTitle:Firstintronpositiondefinesadistincttranscriptclass26
Keywords:5’UTRintrons,randomforest,N1methyladenosine,ExonJunctionComplex27
28
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
2
Abstract:1
Intronsarefoundin5’untranslatedregions(5’UTRs)for35%ofallhumantranscripts.2
These5’UTRintronsarenotrandomlydistributed:genesthatencodesecreted,membrane-3
boundandmitochondrialproteinsarelesslikelytohavethem.Curiously,transcriptslacking4
5’UTRintronstendtoharborspecificRNAsequenceelementsintheirearlycodingregions.5
Hereweexploredtheconnectionbetweencoding-regionsequenceand5’UTRintronstatus.6
Wedevelopedaclassifiermodelingthisrelationshipthatcanpredict5’UTRintronstatuswith7
>80%accuracyusingonlysequencefeaturesintheearlycodingregion.Thus,theclassifier8
identifiestranscriptswith5’proximal-intron-minus-like-codingregions(“5IM”transcripts).9
Unexpectedly,wefoundthattheearlycodingsequencefeaturesdefining5IMtranscriptsare10
widespread,appearingin21%ofallhumanRefSeqtranscripts.Membersofthistranscript11
classtendtoexhibitdifferentialRNAexpressionacrossmanyconditions.The5IMclassof12
transcriptsisenrichedfornon-AUGstartcodons,moreextensivesecondarystructure13
precedingthestartcodon,andnearthe5’cap,greaterdependenceoneIF4Efortranslation,14
andassociationwithER-proximalribosomes.5IMtranscriptsareboundbytheExonJunction15
Complex(EJC)atnon-canonical5’proximalpositions.Finally,N1-methyladenosinesare16
specificallyenrichedintheearlycodingregionsof5IMtranscripts.Takentogether,our17
analysesrevealtheexistenceofadistinct5IMclasscomprising~20%ofhumantranscripts.18
Thisclassisdefinedbydepletionof5’proximalintrons,presenceofspecificRNAsequence19
featuresassociatedwithlowtranslationefficiency,N1-methyladenosinesintheearlycoding20
region,andenrichmentfornon-canonicalbindingbytheExonJunctionComplex.21
22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
3
Introduction:1
2
Approximately35%ofallhumantranscriptsharborintronsintheir5’untranslated3
regions(5’UTRs)(Ceniketal.2010;Hongetal.2006).Amonggeneswith5’UTRintrons(5UIs),4
thoseannotatedas“regulatory”aresignificantlyoverrepresented,whilethereisanunder-5
representationofgenesencodingproteinsthataretargetedtoeithertheendoplasmic6
reticulum(ER)ormitochondria(Ceniketal.2011).FortranscriptsthatencodeER-and7
mitochondria-targetedproteins,5UIdepletionisassociatedwithpresenceofspecificRNA8
sequences(Ceniketal.2011;Palazzoetal.2007,2013).Specifically,nuclearexportofan9
otherwiseinefficientlyexportedmicroinjectedmRNAorcDNAtranscriptcanbepromotedby10
anER-targetingsignalsequence-containingregion(SSCRs)ormitochondrialsignalsequence11
codingregion(MSCRs)fromagenelacking5’UTRintrons(Ceniketal.2011;Leeetal.2015).12
However,morerecentstudiessuggestthatmanySSCRslikelyhavelittleimpactonnuclear13
exportforinvivotranscribedRNAs(Leeetal.2015)butratherenhancetranslationina14
RanBP2-dependentmanner(Mahadevanetal.2013).15
AmongSSCR-andMSCR-containingtranscripts(referredtohereafterasSSCRand16
MSCRtranscripts),~75%lack5’UTRintrons(“5UI–”transcripts)and~25%havethem17
(“5UI+”transcripts).Thesetwogroupshavemarkedlydifferentsequencecompositionsatthe18
5’endsoftheircodingsequences.5UI–transcriptstendtohaveloweradeninecontent19
(Palazzoetal.2007)andusecodonswithfeweruracilsandadeninesthan5UI+transcripts20
(Ceniketal.2011).Theirsignalsequencesalsocontainleucineandargininemoreoftenthan21
thebiochemicallysimilaraminoacidsisoleucineandlysine,respectively.Leucineandarginine22
codonscontainfeweradenineandthyminenucleotides,consistentwithadenineandthymine23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
4
depletion.ThisdepletionisalsoassociatedwiththepresenceofaspecificGC-richRNAmotif1
intheearlycodingregionof5UI–transcripts(Ceniketal.2011).2
Despitesomeknowledgeastotheirearlycodingregionfeatures,keyquestionsabout3
thisclassof5UI–transcriptshaveremainedunanswered:Dotheabovesequencefeatures4
extendbeyondSSCR-andMSCR-containingtranscriptstoother5UI–genes?Do5UI–5
transcriptshavingthesefeaturessharecommonfunctionorregulatorynetwork?What6
bindingfactor(s)recognizetheseRNAelements?Amorecompletemodeloftherelationshipof7
earlycodingfeaturesand5UI–statuswouldbegintoaddressthesequestions.8
Here,weundertookanintegrativemachinelearningapproachtobetterunderstandthe9
relationshipbetweenearlycodingregionfeaturesand5UIstatus.Usingtheresulting10
computationalmodel,wefindthat~21%ofallhumantranscriptscanbeclassifiedas“5IM”11
(definedby5’-proximal-intron-minus-likecodingregions).Whilemanyofthesetranscripts12
encodeER-andmitochondrial-targetedproteins,manyothersencodenuclearandcytoplasmic13
proteins.Sharedcharacteristicsarethat5IMtranscriptstendtolackof5’proximalintrons,14
exhibitnon-canonicalExonJunctionComplexbinding,havefeaturesassociatedwithlower15
intrinsictranslationefficiency,andhaveanincreasedincidenceofN1-methyladenosines.16
17
Results:18
Aclassifierthatpredicts5UIstatususingonlyearlycodingsequenceinformation.19
Tobetterunderstandthepreviously-reportedenigmaticrelationshipbetweencertainearly20
codingregionsequencesandtheabsenceofa5UI,wesoughttomodelthisrelationship.21
Specifically,weusedarandomforestclassifier(Breiman2001)tolearntherelationship22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
5
between5UIabsenceandacollectionof36differentnucleotide-levelfeaturesextractedfrom1
thefirst99nucleotidesofallhumancodingregions(CDS)(Figs1A-C;TableS1;Methods).2
WethenusedalltranscriptsknowntocontainanSSCR,regardlessof5UIstatus,asourtraining3
set.Thistrainingconstraintensuredthatallinputnucleotidesequencesweresubjectto4
similarfunctionalconstraintsattheproteinlevel.Thus,wesoughttoidentifysequence5
featuresthatdifferbetween5UI–and5UI+transcriptsattheRNAlevel.6
Ourclassifierassignstoeachtranscripta“5’UTR-intron-minus-predictor”(5IMP)score7
between0and10,wherehigherscorescorrespondtoahigherlikelihoodofbeing5UI–(Fig8
1C).Interestingly,preliminaryrankingofthe5UI–transcriptsby5IMPscorerevealeda9
relationshipbetweenthepositionofthefirstintroninthecodingregionandthe5IMPscore.10
5UI–transcriptsforwhichthefirstintronwas>85ntsdownstreamofthestartcodonhadthe11
highest5IMPscores.Furthermore,thecloserthefirstintronwastothestartcodon,thelower12
the5IMPscore(Fig1D).Weexploredthisrelationshipfurtherbytrainingclassifiersthat13
increasinglyexcludedfromthetrainingset5UI–transcriptsaccordingtothedistanceofthe14
firstintronfromthe5’endofthecodingregion.Thisrevealedthatclassifierperformance,as15
measuredbytheareaundertheprecisionrecallcurve(AUPRC),increasedasafunctionofthe16
distancefromstartcodontofirstintrondistance(Methods,Fig1E),andalsoasafunctionof17
distancefromtranscriptionstartsitetothefirstintron.Thusthe5UI–-predictiveRNA18
sequencefeaturesweidentifiedaremoreaccuratelydescribedaspredictorsoftranscripts19
without5’-proximalintrons.20
TominimizetheimpactofintronpositionswithintheCDS,weselectedatrainingset21
thatonlyinclude5UI–SSCRtranscriptsforwhichthefirstintronwasatleast90nts22
downstreamofthestartcodon(Methods).Discriminativemotiffeatureswerelearned23
independently(Methods),andperformanceofthisnewclassifierwasgaugedusing10-fold24
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
6
crossvalidation.Incross-validation,performanceaccordingtoareaunderthereceiver1
operatingcurve(AUC)was74%andmeanareaundertheprecisionrecallcurve(AUPRC)was2
88%(Fig2A,yellowcurves).Incomparison,anaivepredictorwouldachieveanAUCof50%3
andanAUPRCof71%(because71%ofSSCRtranscriptsare5UI–).Weusedthisoptimized4
classifierforallsubsequentanalyses.5
SSCRtranscriptsexhibitedmarkedlydifferent5IMPscoredistributionsforthe5UI+and6
5UI–subsets(Fig2B).The5UI+scoredistributionwasunimodalwithapeakat~2.4.In7
contrast,the5UI–scoredistributionwasbimodalwithonepeakat~3.6andanotherat~9,8
suggestingtheexistenceofatleasttwounderlying5UI–transcriptclasses.Thepeakatscore9
3.6resembledthe5UI+peak.Alsocontributingtothepeakat3.6arethe5UI–transcripts10
harboringanintroninthefirst90ntsoftheCDS(55%ofall5UI–transcripts).Theother11
distincthigh-scoring5UI–class(peakatscore9)iscomposedoftranscriptsthathavespecific12
5UI--predictiveRNAsequenceelementswithintheearlycodingregion.13
Next,wewantedtoevaluatewhetherourclassifierwasdetectingdifferencesbetween14
5UI+and5UI–SSCRtranscriptsthatoccurspecificallyintheearlycodingregionasopposedto15
overallsequencedifferencesbetweenthetwotranscriptclasses.Todoso,foreverytranscript16
werandomlychose99ntsfromtheregiondownstreamofthe3rdexon.The5IMPscore17
distributionsofthese‘3’proximalexon’setswereidenticalfor5UI+and5UI–transcripts(Figs18
2A,and2C),confirmingthatthesequencefeaturesthatdistinguish5UI+and5UI–transcripts19
arespecifictotheearlycodingregion.20
21
RNAelementsassociatedwith5UI–transcriptsarepervasiveinthehumangenome22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
7
HavingtrainedtheclassifieronSSCRtranscripts,wenextaskedhowwellitwouldpredictthe1
5UIstatusofothertranscripts.DespitehavingbeentrainedexclusivelyonSSCRtranscripts,2
theclassifierperformedremarkablywellonMSCRtranscripts,achievinganAUCof86%and3
AUPRCof95%(Fig2A,purpleline;Fig2D;ascomparedwith50%and77%,respectively,4
expectedbychance).ThisresultsuggeststhatRNAelementswithinearlycodingregionsof5
5UI–MSCR-transcriptsaresimilartothosein5UI–SSCR-transcriptsdespitedistinctfunctional6
constraintsattheproteinlevel.7
Wethentestedwhethertheclassifiercanpredict5UI–statusintranscriptsthatcontain8
neitherSSCRnorMSCR(“S–/M–”transcripts).BecauseunannotatedSSCRscouldconfoundthis9
analysis,wefirstusedSignalP3.0toidentifyS–/M–transcriptsmostlikelytocontainahidden10
SSCR(Bendtsenetal.2004).These‘SignalP+‘transcriptshada5IMPscoredistribution11
comparabletothoseofknownSSCRandMSCRtranscripts(Fig2E),withanAUCof82%and12
anAUPRCof95%(Fig2A,lightblueline).Whereas5UI+SignalP+transcriptshad13
predominantlylow5IMPscores,5UI–SignalP+5IMPscoreswerestronglyskewedtowards14
high5IMPscores(peakat~9;Fig2E).ThusSignalP+transcriptsdolikelycontainmany15
unannotatedSSCRs.16
Finally,weusedtheclassifiertocalculate5IMPscoresforallremaining“S–/M–/SignalP–17
”transcripts.Althoughtheperformancewasweakeronthisgeneset,itwasstillbetterthana18
naivepredictor(Fig2A,greenline).Asexpected,the5UI+S–/M–/SignalP–transcriptswere19
stronglyskewedtowardlow5IMPscores(Fig2F).Surprisingly,however,asignificant20
fractionof5UI–S–/M–/SignalP–transcriptshadhigh5IMPscores(~18%).Thissuggeststhe21
existenceofabroaderclassoftranscriptsthatsharefeaturesof5’proximal-intron-minus-like-22
codingregions.Hereafterwerefertotranscriptsinthisclassas“5IM”transcripts.23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
8
Todeterminethefractionofallhumantranscriptsthatare5IM,wemadeuseofthe1
above-describednegativecontrolsetofrandomlychosensequencesdownstreamofthe3rd2
exonfromeachtranscript.Byquantifyingtheexcessofhigh-scoringsequencesinthereal3
distributionrelativetothiscontroldistribution,weestimatethat21%ofallhumantranscripts4
are5IMtranscripts(a5IMPscoreof7.41correspondstoa5%FalseDiscoveryRate;Fig2G).5
Thesetof5IMtranscriptsdefinedbyourclassifier(TableS2)includesmanythatdonot6
encodeER-targetedormitochondrialproteins.TheseresultssuggestthatRNA-levelfeatures7
prevalentintheearlycodingregionsof5UI–SSCRandMSCRtranscriptsarealsofoundinother8
transcripttypes(Figs2F-G),andthat5IMtranscriptsrepresentabroadclass.9
10
Functionalcharacterizationof5IMtranscripts11
5IMtranscriptsaredefinedbymRNAsequencefeatures.Hence,wehypothesizedthat5IM12
transcriptsmaybefunctionallyrelatedthroughtheirsharedregulationmediatedbythe13
presenceofthesecommonfeatures.Tothisend,wecollected66large-scaledatasets14
representingdiverseattributescoveringsevenbroadcategories:(1)Curatedfunctional15
annotations--e.g.,GeneOntologyterms,annotationasa‘housekeeping’gene,genessubjectto16
RNAediting;(2)RNAlocalization--e.g.,todendrites,tomitochondria;(3)ProteinandmRNA17
half-life,ribosomeoccupancyandfeaturesthatdecreasestabilityofoneormoremRNA18
isoforms--e.g.,AU-richelements(4)TendencytoexhibitdifferentialmRNAexpression;(5)19
Sequencefeaturesassociatedwithregulatedtranslation--e.g.,codonoptimality,secondary20
structurenearthestartcodon;(6)KnowninteractionswithRNA-bindingproteinsor21
complexessuchasStaufen-1,TDP-43,ortheExonJunctionComplex(EJC)(7)RNA22
modifications–i.e.,N1-methyladenosine(m1A).Whereaswefoundnospecificassociation23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
9
between5IMtranscriptsandfeaturesincategories(1),(2)and(3)otherthanthealready-1
knownenrichmentsforER-andmitochondrial-targetedmRNAs,observationsfromanalysisof2
categories(4),(5),(6),and(7)aredescribedbelow.Weadjustedforthemultiplehypotheses3
testingproblemattwolevels.First,wetookaconservativeapproach(Bonferronicorrection)4
tocorrectforthenumberoftestedfunctionalcharacteristics.Second,someofthefunctional5
categorieswereanalyzedinmoredepthandmultiplesub-hypothesesweretestedwithinthe6
givencategory.Inthisin-depthanalysisafalsediscovery-basedcorrectionwasadopted.For7
example,weinvestigatedwhether5IMtranscriptsexhibitdifferentialRNAexpression.We8
usedBonferroni-correctiontocalculatethep-valueassociatedwiththistest.Then,wefurther9
testedwhether5IMtranscriptsaredifferentiallyexpressedunderthespecificconditions.For10
thisin-depthanalysis,weadoptedafalsediscoveryratebasedcorrection.Below,allreported11
p-valuesremainsignificant(p-adjusted<0.05)afterappropriatemultiplehypothesis12
correction.13
14
5IMtranscriptsexhibitdifferentialRNAexpressionacrossmanyconditions15
Genesetsexhibitingpositivelyornegativelycorrelatedexpressionacrosstissuesand16
conditionsareenrichedforco-regulationandco-functionality(Stuartetal.2003).Sincethe17
powertodetectdifferentialexpressionishigherforabundanttranscripts,wefirsttested18
whether5IMtranscriptsgenerallyhavehigherbaselineexpressionthanothertranscripts.We19
usedbaselineexpressionmeasurementsacross16humantissues(Methods),andfounda20
weakbutsignificantpositivecorrelationbetweenbaselineexpressionand5IMPscores21
(Spearmanrho=0.07;p=7.8e-15).Tocontrolforthepossibleimpactoftheobserved22
differenceinbaselineexpressionondifferentialexpression,wegeneratedarandomsampleof23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
10
transcriptssuchthatthebaselineexpressionleveldistributionofthissetmatchesthe1
expressiondistributionofthe5IMPtranscripts(Methods).2
Combiningdifferentialexpressiondatafrom~6500humanmicroarrayexperiments,3
weobservedthat5IMtranscriptswereoverrepresentedingenesetsexhibitingcondition-4
dependentchangesinmRNAexpressionlevelcomparedtheidentically-sizedbaseline-5
expression-matchedrandomtranscriptset(Fig3A,Methods,p<2.2e-16;Chi-squaredtest).6
Thus,thelevelsof5IMtranscriptshaveahigher-than-randomlikelihoodofchangingacrossa7
widevarietyoftissues,celltypes,andenvironments.8
9
5IMtranscriptshavefeaturessuggestinglowertranslationefficiency10
Translationregulationisamajordeterminantofproteinlevels(VogelandMarcotte2012).To11
investigatepotentialconnectionsbetween5IMtranscriptsandtranslationalregulation,we12
examinedfeaturesassociatedwithtranslation.Featuresfoundtobesignificantwere:13
(I)Secondarystructuresnearthestartcodoncanaffectinitiationratebymodulating14
startcodonrecognition(Parsyanetal.2011).Weobservedapositivecorrelationbetween15
5IMPscoreandthefreeenergyoffolding(-ΔG)ofthe35nucleotidesimmediatelypreceding16
thestartcodon(Fig3B;Spearmanrho=0.39;p<2.2e-16).Thissuggeststhat5IMtranscripts17
haveagreatertendencyforsecondarystructurenearthestartcodon,presumablymakingthe18
startcodonlessaccessible.19
(II)Similarly,secondarystructuresnearthe5’capcanmodulatetranslationby20
hinderingbindingbythe43S-preinitiationcomplextothemRNA(Babendureetal.2006).We21
observedapositivecorrelationbetween5IMPscoreandthefreeenergyoffolding(-ΔG)ofthe22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
11
5’most35nucleotides(Spearmanrho=0.18;p=7.9e-130).Thissuggeststhat5IMtranscripts1
haveagreatertendencyforsecondarystructurenearthe5’cap,presumablyhinderingbinding2
bythe43S-preinitiationcomplex.3
(III)eIF4Eoverexpression.TheheterotrimerictranslationinitiationcomplexeIF4F4
(madeupofeIF4A,eIF4EandeIF4G)isresponsibleforfacilitatingthetranslationof5
transcriptswithstrong5’UTRsecondarystructures(Parsyanetal.2011).TheeIF4Esubunit6
bindstothe7mGpppG‘methyl-G’capandtheATP-dependenthelicaseeIF4A(scaffoldedby7
eIF4G)destabilizes5’UTRsecondarystructure(Marintchevetal.2009).Apreviousstudy8
identifiedtranscriptsthatweremoreactivelytranslatedunderconditionsthatpromotecap-9
dependenttranslation(overexpressionofeIF4E)(Larssonetal.2007).Inagreementwiththe10
observationthat5IMtranscriptshavemoresecondarystructureupstreamofthestartcodon11
andnearthe5’cap,transcriptswithhigh5IMPscoresweremorelikelytobetranslationally,12
butnottranscriptionally,upregulateduponeIF4Eoverexpression(Fig3C;WilcoxonRankSum13
Testp=2.05e-22,andp=0.28,respectively).14
(IV)Non-AUGstartcodons.Transcriptswithnon-AUGstartcodonsalsohave15
intrinsicallylowtranslationinitiationefficiencies(HinnebuschandLorsch2012).These16
mRNAsweregreatlyenrichedamongtranscriptswithhigh5IMPscores(Fisher’sExactTestp17
=0.0003;oddsratio=3.9)andhaveamedian5IMPscorethatis3.57higherthanthosewith18
anAUGstart(Fig3D).19
(V)Codonoptimality.Theefficiencyoftranslationelongationisaffectedbycodon20
optimality(HershbergandPetrov2008).Althoughsomeaspectsofthisremaincontroversial21
(CharneskiandHurst2013;Shahetal.2013;ZinshteynandGilbert2013;Gerashchenkoand22
Gladyshev2015),itisclearthatdecodingofcodonsbytRNAswithdifferentabundancescan23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
12
affectthetranslationrateunderconditionsofcellularstress(reviewedinGingoldandPilpel1
2011).WethereforeexaminedthetRNAadaptationindex(tAI),whichcorrelateswithcopy2
numbersoftRNAgenesmatchingagivencodon(dosReisetal.2004).Specifically,we3
calculatedthemediantAIofthefirst99codingnucleotidesofeachtranscript,andfoundthat4
5IMPscorewasnegativelycorrelatedwithtAI(FigS1A;SpearmanCorrelationrho=-0.23;p<5
2e-16mediantAIand5IMPscore).Thiseffectwasrestrictedtotheearlycodingregionsasthe6
negativecontrolsetofrandomlychosensequencesdownstreamofthe3rdexonfromeach7
transcriptdidnotexhibitarelationshipbetween5IMPscoreandcodonoptimality(FigS1B-C).8
Thus,5IMtranscriptsshowreducedcodonoptimalityinearlycodingregions,suggestingthat9
5IMtranscriptshavedecreasedtranslationelongationefficiency.10
Tomorepreciselydeterminewherethecodonoptimalityphenomenonoccurswithin11
theentireearlycodingregion,wegroupedtranscriptsby5IMPscore.Foreachgroup,we12
calculatedthemeantAIatcodons2-33(i.e.,nts4-99).Acrossthisentireregion,5IM13
transcripts(5IMP>7.41;5%FDR)hadsignificantlylowertAIvaluesateverycodonexcept14
codons24and32(Fig3E;WilcoxonRankSumtestHolm-adjustedp<0.05forall15
comparisons).Toeliminatepotentialconfoundingvariables,includingnucleotidecomposition,16
weperformedseveraladditionalcontrolanalyses(Methods);noneofthesealteredthebasic17
conclusionthat5IMtranscriptshavelowercodonoptimalitythannon-5IMtranscriptsacross18
theentireearlycodingregion.19
(VI)RibosomespermRNA.Finally,weexaminedtherelationshipbetween5IMPscore20
andtranslationefficiency,asmeasuredbythesteady-statenumberofribosomespermRNA21
molecule.Tothisend,weusedalargedatasetofribosomeprofilingandRNA-Seqexperiments22
fromhumanlymphoblastoidcelllines(Ceniketal.2015).Fromthis,wecalculatedtheaverage23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
13
numberofribosomesoneachtranscriptandidentifiedtranscriptswithhighorlowribosome1
occupancy(respectivelydefinedbyoccupancyatleastonestandarddeviationaboveorbelow2
themean;seeMethods).5IMtranscriptswereslightlybutsignificantlydepletedinthehigh3
ribosome-occupancycategory(Fig3F;Fisher’sExactTestp=0.0006,oddsratio=1.3).4
Moreover,5IMPscoresexhibitedaweakbutsignificantnegativecorrelationwiththenumber5
ofribosomespermRNAmolecule(Spearmanrho=-0.11;p=5.98e-23).6
Takentogether,alloftheaboveresultsrevealthat5IMtranscriptshavesequence7
featuresassociatedwithlowertranslationefficiency,atthestagesofbothtranslationinitiation8
andelongation.9
10
Non-ERtrafficked5IMtranscriptsareenrichedinER-proximalribosomeoccupancy11
Wenextinvestigatedtherelationshipbetween5IMPscoreandthelocalizationoftranslation12
withincells.Exploringthesubcellularlocalizationoftranslationatatranscriptome-scale13
remainsasignificantchallenge.Yet,arecentstudydescribedproximity-specificribosome14
profilingtoidentifymRNAsoccupiedbyER-proximalribosomesinbothyeastandhumancells15
(Janetal.2014).Inthismethod,ribosomesarebiotinylatedbasedontheirproximitytoa16
markerproteinsuchasSec61,whichlocalizestotheERmembrane(Janetal.2014).Foreach17
transcript,theenrichmentforbiotinylatedribosomeoccupancyyieldsameasureofER-18
proximityoftranslatedmRNAs.19
Wereanalyzedthisdatasettoexploretherelationbetween5IMPscoresandER-20
proximalribosomeoccupancyinHEK-293cells.Asexpected,transcriptsthatexhibitthe21
highestenrichmentforER-proximalribosomeswereSSCR-transcriptsandtranscriptswith22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
14
otherER-targetingsignals.Yet,wenoticedasurprisingcorrelationbetweenER-proximal1
ribosomeoccupancyand5IMtranscriptswithnoER-targetingevidence(Fig4).This2
relationshipwastrueforbothmitochondrialgenes(Fig4;Spearmanrho=0.43;p<2.2e-16),3
andgeneswithnoevidenceforeitherER-ormitochondrial-targeting(Fig4;Spearmanrho=-4
0.36;p<2.2e-16).Theseresultssuggestthat5IMtranscriptsaremorelikelythannon-5IM5
transcriptstoengagewithER-proximalribosomes.6
7
5IMtranscriptsarestronglyenrichedinnon-canonicalEJCoccupancysites8
Sharedsequencefeaturesandfunctionaltraitsamong5IMtranscriptscausesonetowonder9
whatcommonmechanismsmightlink5IMsequencefeaturesto5IMtraits.Forexample,5IM10
transcriptsmightshareregulationbyoneormoreRNA-bindingproteins(RBPs).To11
investigatethisideafurther,wetestedforenrichmentof5IMtranscriptsamongthe12
experimentallyidentifiedtargetsof23differentRBPs(includingCLIP-Seq,andvariants;see13
Methods).Onlyonedatasetwassubstantiallyenrichedforhigh5IMPscoresamongtargets:a14
transcriptome-widemapofbindingsitesoftheExonJunctionComplex(EJC)inhumancells,15
obtainedviatandem-immunoprecipitationfollowedbydeepsequencing(RIPiT)(Singhetal.16
2014,2012).TheEJCisamulti-proteincomplexthatisstablydepositedupstreamofexon-17
exonjunctionsasaconsequenceofpre-mRNAsplicing(LeHiretal.2000).RIPiTdata18
confirmedthatcanonicalEJCsites(cEJCsites;sitesboundbyEJCcorefactorsandappearing19
~24ntsupstreamofexon-exonjunctions)occupy~80%ofallpossibleexon-exonjunction20
sitesandarenotassociatedwithanysequencemotif.Unexpectedly,manyEJC-associated21
footprintsoutsideofthecanonical-24regionwereobserved(Fig5A)(Singhetal.2012).22
These‘non-canonical’EJCoccupancysites(ncEJCsites)wereassociatedwithmultiple23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
15
sequencemotifs,threeofwhichweresimilartoknownrecognitionmotifsforSRproteinsthat1
co-purifiedwiththeEJCcoresubunits.Interestingly,anothermotif(Fig5B;top)thatwas2
specificallyfoundinfirstexonsisnotknowntobeboundbyanyknownRNA-bindingprotein3
(Singhetal.2012).ThismotifwassimilartotheCG-richmotif(Fig5B;bottom)thatwefound4
tobeassociatedwithin5UI–and5IMtranscripts(Ceniketal.2011).Thissimilaritypresages5
thepossibilityofenrichmentoffirstexonncEJCsitesamong5IMtranscripts.6
PositionanalysisofcalledEJCpeaksrevealedthatwhileonly9%ofcEJCsresideinfirst7
exons,19%ofallncEJCsarefoundthere.Whenweinvestigatedtherelationshipbetween5IMP8
scoresandncEJCsinearlycodingregions,wefoundastrikingcorrespondencebetweenthe9
two–themedian5IMPscorewashighestfortranscriptswiththegreatestnumberofncEJCs10
(Fig5C;WilcoxonRankSumTest;p<0.0001).Whenwerepeatedthisanalysisbyconditioning11
on5UIstatus,wesimilarlyfoundthatncEJCswereenrichedamongtranscriptswithhigh5IMP12
scoresregardlessof5UIstatus(Fisher’sExactTest,p<3.16e-14,oddsratio>2.3;Fig5D).13
TheseresultssuggestthatthestrikingenrichmentofncEJCpeaksinearlycodingregionswas14
generallyapplicabletoalltranscriptswithhigh5IMPscoresregardlessof5UIpresence.15
TranscriptsharboringN1-methyladenosine(m1A)havehigh5IMPscores16
ItisincreasinglyclearthatribonucleotidebasemodificationinmRNAsishighlyprevalentand17
canbeamechanismforpost-transcriptionalregulation(Fryeetal.2016).OneRNA18
modificationpresenttowardsthe5’endsofmRNAtranscriptsisN1-methyladenosine(m1A)19
(Lietal.2016;Dominissinietal.2016),initiallyidentifiedintotalRNAandrRNAs(Dunn1961;20
Hall1963;KlootwijkandPlanta1973).Intriguingly,thepositionofm1Amodificationshas21
beenshowntobemorecorrelatedwiththepositionofthefirstintronthanwith22
transcriptionalortranslationalstartsites(Figure2gfromDominissinietal.2016).Thefirst23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
16
intronwasthenearestsplicesitefor85%ofm1As(Dominissinietal.2016).When5’UTR1
intronswerepresent,m1Awasfoundtobenearthefirstsplicesiteregardlessoftheposition2
ofthestartcodon(Dominissinietal.2016).Giventhat5IMtranscriptsarealsocharacterized3
bythepositionofthefirstintron,weinvestigatedtherelationshipbetween5IMPscoreand4
m1ARNAmodificationmarks.5
Weanalyzedtheunionofpreviouslyidentifiedm1Amodifications(Dominissinietal.6
2016)acrossallcelltypesandconditions.Notethatalthoughthereissomeevidencethatthese7
marksdependoncelltypeandgrowthcondition,itisdifficulttobeconfidentofthecelltype8
andcondition-dependenceofanyparticularmarkgivenexperimentalvariation(seeMethods).9
WefoundthatmRNAswithm1Amodificationearlyinthecodingregion(first99nucleotides)10
hadsubstantiallyhigher5IMPscoresthanmRNAslackingthesemarks(Fig6A;WilcoxonRank11
SumTestp=3.4e-265),andweregreatlyenrichedamong5IMtranscripts(Fig6A;Fisher’s12
ExactTestp=1.6e-177;oddsratio=3.8).Inotherwords,thesequencefeatureswithinthe13
earlycodingregionthatdefine5IMPtranscriptsalsopredictm1Amodificationintheearly14
codingregion.15
Wenextwonderedwhether5IMPscorewaspredictingm1Amodificationgenerally,or16
predictingitspreciselocation.Indeed,manyofthepreviouslyidentifiedm1Apeakswere17
withinthe5’UTRsofmRNAs(Lietal.2016).Interestingly,5IMPscoreswereonlypredictiveof18
m1Amodificationintheearlycodingregion,andnotpredictiveofm1Amodificationinthe19
5’UTR(Fig6B).Thisofferstheintriguingpossibilitythatthesequencefeaturesthatdefine20
5IMPtranscriptsareco-localizedwithm1Amodification.21
22
Discussion:23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
17
Coordinatingtheexpressionoffunctionallyrelatedtranscriptsduringcellularresponse1
tostimuliandenvironmentcanbeachievedbypost-transcriptionalprocessessuchassplicing,2
RNAexport,RNAlocalizationortranslation(MooreandProudfoot2009).SetsofmRNAs3
subjecttoacommonregulatorytranscriptionalprocesscanexhibitcommonsequencefeatures4
thatdefinethemtobeaclass.Forexample,transcriptssubjecttoregulationbyparticular5
miRNAstendtosharecertainsequencesintheir3’UTRsthatarecomplementarytothe6
regulatorymiRNA(AmeresandZamore2013).Similarly,transcriptsthatsharea5’terminal7
oligopyrimidinetractarecoordinatelyregulatedbymTORandribosomalproteinS6kinase8
(Meyuhas2000).Herewequantitativelydefine‘5IM’transcriptsasaclassthatshares9
commonsequenceelementsandfunctionalproperties.The5IMclasscomprises21%ofall10
humantranscripts.11
Whereas35%ofhumantranscriptshaveoneormore5’UTRintrons,themajorityof12
5IMtranscriptshaveneithera5’UTRintronnoranintroninthefirst90ntsoftheORF.Other13
sharedfeaturesof5IMtranscriptsincludesequencefeaturesassociatedwithlowtranslation14
initiationrates.Theseare:(1)atendencyforRNAsecondarystructureintheregion15
immediatelyprecedingthestartcodon(Fig3B),andnearthe5’cap;(2)translational16
upregulationuponoverexpressionofeIF4E(Fig3C);and(3)morefrequentuseofnon-AUG17
startcodons(Fig3D).Alsoconsistentwithlowintrinsictranslationefficiencies,5IM18
transcriptsadditionallytendtodependonlessabundanttRNAstodecodethebeginningofthe19
openreadingframe(Fig3E).20
WehadpreviouslyreportedthattranscriptsencodingproteinswithER-and21
mitochondrial-targetingsignalsequences(SSCRsandMSCRs,respectively)areover-22
representedamongthe65%oftranscriptslacking5’UTRintrons(Ceniketal.2011).23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
18
Transcriptsinthissetareenrichedforthesequencefeaturesdetectedbyour5IMclassifier.By1
examiningtheseenrichedsequencefeatures,weshowedthatthe5IMclassextendsbeyond2
mRNAsencodingmembraneproteins.Janetal.recentlydevelopedatranscriptome-scale3
methodtoidentifymRNAsoccupiedbyER-proximalribosomesinbothyeastandhumancells4
(Janetal.2014).Asexpected,transcriptsknowntoencodeER-traffickedproteinswerehighly5
enrichedforER-proximalribosomeoccupancy.However,theirdataalsoshowedmany6
transcriptsencodingnon-ERtraffickedproteinstoalsobeengagedwithER-proximal7
ribosomes(ReidandNicchitta2015b).Similarly,severalotherstudieshavesuggestedacritical8
roleofER-proximalribosomesintranslatingseveralcytoplasmicproteins(ReidandNicchitta9
2015a).Here,wefoundthat5IMtranscriptsincludingthosethatarenotER-traffickedor10
mitochondrialweresignificantlymorelikelytoexhibitbindingtoER-proximalribosomes11
(Figs4A-B).12
InadditiontoribosomesdirectlyresidentontheER,aninterestingpossibilityisthe13
presenceofapoolofperi-ERribosomes(Janetal.2015;ReidandNicchitta2015a).14
Associationof5IMtranscriptswithsuchaperi-ERribosomepoolcouldpotentiallyexplainthe15
observedcorrelationof5IMstatuswithbindingtoER-proximalribosomes.TheERis16
physicallyproximaltomitochondria(RowlandandVoeltz2012),soperi-ERribosomesmay17
includethosetranslatingmRNAsonmitochondria(i.e.,mRNAswithMSCRs)(Sylvestreetal.18
2003).However,evenwhentranscriptscorrespondingtoER-traffickedandmitochondrial19
proteinswereexcludedfromconsideration,ER-proximalribosomeenrichmentand5IMP20
scoreswerehighlycorrelated(Fig4A).Thusanothersharedfeatureof5IMtranscriptsistheir21
translationonorneartheERregardlessoftheultimatedestinationoftheencodedprotein.22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
19
Inanattempttoidentifyacommonfactorbinding5IMtranscripts,weaskedwhether1
5IMtranscriptswereenrichedamongtheexperimentallyidentifiedtargetsof23different2
RBPs.OnlyoneRBPemerged--theexonjunctioncomplex(EJC).Specifically,weobserveda3
dramaticenrichmentofnon-canonicalEJC(ncEJC)bindingsiteswithintheearlycodingregion4
of5IMtranscripts.Further,theCG-richmotifidentifiedforncEJCsinfirstexonsisstrikingly5
similartotheCG-richmotifenrichedinthefirstexonsof5IMtranscripts(Fig5B).Previous6
workimplicatedRanBP2,aproteinassociatedwiththecytoplasmicfaceofthenuclearpore,as7
abindingfactorforsomeSSCRs(Mahadevanetal.2013).ThisfindingmaysuggestthatEJC8
occupancyonthesetranscriptsmaybeinfluencedbynuclearporeproteinsduringnuclear9
export.10
EJCdepositionduringtheprocessofpre-mRNAsplicingenablesthenuclearhistoryof11
anmRNAtoinfluencepost-transcriptionalprocessesincludingmRNAlocalization,translation12
efficiency,andnonsensemediateddecay(Changetal.2007;KervestinandJacobson2012;13
Choeetal.2014).WhilecanonicalEJCbindingoccursatafixeddistanceupstreamofexon-exon14
junctionsandinvolvesdirectcontactbetweenthesugar-phosphatebackboneandtheEJCcore15
anchoringproteineIF4AIII,ncEJCbindingsiteslikelyreflectstableengagementbetweenthe16
EJCcoreandothermRNPproteins(e.g.,SRproteins)recognizingnearbysequencemotifs.17
AlthoughsomeRBPswereidentifiedforncEJCmotifsfoundininternalexons(Singhetal.18
2014,2012),todatenocandidateRBPhasbeenidentifiedfortheCG-richncEJCmotiffoundin19
thefirstexon.IfthismotifdoesresultfromanRBPinteraction,itislikelytobeoneormoreof20
theca.70proteinsthatstablyandspecificallybindtotheEJCcore(Singhetal.2012).21
Finally,weobservedadramaticenrichmentform1Amodificationsamong5IM22
transcripts.Giventhisstrikingenrichmentperhapsitisperhapsnotsurprisingthatm1A23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
20
containingmRNAswerealsoshowntohavemorestructured5’UTRsthatareGC-rich1
comparedtom1AlackingmRNAs(Dominissinietal.2016).Similarto5IMtranscripts,m1A2
containingmRNAswerefoundtodecoratestartcodonsthatappearinahighlystructured3
context.WhileALKBH3hasbeenidentifiedasaproteinthatcandemethylatem1A,itis4
currentlyunknownwhetherthereareanyproteinsthatcanspecificallyactas“readers”of5
m1A.RecentstudieshavebeguntoidentifysuchreadersforothermRNAmodificationssuchas6
YTHDF1,YTHDF2,WTAPandHNRNPA2B1(Pingetal.2014;Liuetal.2014;Wangetal.2014,7
2015;Alarcónetal.2015).Ourstudyhighlightsapossiblelinkbetweennon-canonicalEJC8
bindingandm1A.Hence,ourresultsyieldtheintriguinghypothesisthatoneormoreoftheca.9
70proteinsthatstablyandspecificallybindtotheEJCcorecanfunctionasanm1Areader.10
Anotherintriguingpossibilityisthat5IMtranscriptfeaturesassociatedwithlower11
intrinsictranslationefficiencymaytogetherenablegreater‘tunability’of5IMtranscriptsatthe12
translationstage.Regulatedenhancementorrepressionoftranslation,for5IMtranscripts,13
couldallowforrapidchangesinproteinlevels.Thereareanalogiestothisscenarioin14
transcriptionalregulation,whereinhighlyregulatedgenesoftenhavepromoterswithlow15
baselinelevelsthatcanberapidlymodulatedthroughtheactionofregulatorytranscription16
factors.Asmoreribosomeprofilingstudiesarepublishedexaminingtranslationalresponses17
transcriptome-wideundermultipleperturbations,conditionsunderwhich5IMtranscriptsare18
translationallyregulatedmayberevealed.19
Takentogether,ouranalysesrevealtheexistenceofadistinct‘5IM’classcomprising20
21%ofhumantranscripts.Thisclassisdefinedbydepletionof5’proximalintrons,presence21
ofspecificRNAsequencefeaturesassociatedwithlowtranslationefficiency,non-canonical22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
21
bindingbytheExonJunctionComplexandanenrichmentforN1-methyladenosine1
modification.2
3
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
22
MaterialsandMethods:1
DatasetsandAnnotations2
HumantranscriptsequencesweredownloadedfromtheNCBIhumanReferenceGene3
Collection(RefSeq)viatheUCSCtablebrowser(hg19)onJun252010(Kentetal.2002;Pruitt4
etal.2005).Transcriptswithfewerthanthreecodingexons,andtranscriptswherethefirst995
codingnucleotidesstraddlemorethantwoexonswereremovedfromfurtherconsideration.6
Thecriteriaforexclusionofgeneswithfewerthanthreecodingexonswastoensurethatthe7
analysisofdownstreamregionswaspossibleforallgenesthatwereusedinouranalysisof8
earlycodingregions.Specifically,thedownstreamregionswereselectedrandomlyfrom9
downstreamofthe3rdexons.Hence,geneswithfewerexonswouldnotbeabletocontributea10
downstreamregionpotentiallycreatingaskewinrepresentation.Transcriptswereclustered11
basedonsequencesimilarityinthefirst99codingnucleotides.Specifically,eachtranscript12
pairwasalignedusingBLASTwiththeDUSTfilterdisabled(Altschuletal.1990).Transcript13
pairswithBLASTE-values<1e-25weregroupedintotranscriptclustersthatwere14
subsequentlyassignedtooneoffourcategories:MSCR-containing,SSCR-containing,S–/MSCR–15
SignalP+,orS–/MSCR–SignalP–asfollows:16
MSCR-containingtranscriptswereannotatedusingMitoCartaandothersourcesas17
describedin(Ceniketal.2011).SSCR-containingtranscriptswerethesetoftranscripts18
annotatedtocontainsignalpeptidesintheEnsemblGenev.58annotations,whichwere19
downloadedthroughBiomartonJun252010.FortranscriptswithoutanannotatedMSCRor20
SSCR,thefirst70aminoacidswereanalyzedusingSignalP3.0(Bendtsenetal.2004).Using21
theeukaryoticpredictionmode,transcriptswereassignedtotheS–/MSCR–SignalP+category22
ifeithertheHiddenMarkovModelortheArtificialNeuralNetworkclassifiedthesequenceas23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
23
signalpeptidecontaining.AllremainingtranscriptclusterswereassignedtotheS–/MSCR–1
SignalP–category.2
Foreachtranscriptcluster,wealsoconstructedmatchedcontrolsequences.Control3
sequenceswerederivedfromasinglerandomlychosenin-frame‘window’downstreamofthe4
3rdexonfromtheevaluatedtranscripts.Ifanevaluatedtranscripthadfewerthan995
nucleotidesdownstreamofthe3rdexon,nocontrolsequencewasextracted.5UIlabelsand6
transcriptclusteringforthecontrolsequenceswereinheritedfromtheevaluatedtranscript.7
Therationaleforthisdecisionisthatouranalysisdependsonthepositionofthefirstintron;8
hencegeneswithfewerthantwoexonsneedtobeexcluded,asthesewillnothaveintrons.We9
furtherrequiredthematchedcontrolsequencestofalldownstreamoftheearlycodingregion.10
Inthevastmajorityofcasesthethirdexonfelloutsidethefirst99nucleotidesofthecoding11
region,makingthisaconvenientcriterionbywhichtochoosecontrolregions.12
SequenceFeaturesandMotifDiscovery13
36sequencefeatureswereextractedfromeachtranscript(TableS1).Thesequence14
featuresincludedtheratioofargininestolysines,theratioofleucinestoisoleucines,adenine15
content,lengthofthelongeststretchwithoutadenines,preferenceagainstcodonsthatcontain16
adeninesorthymines.ThesefeatureswerepreviouslyfoundtobeenrichedinSSCR-17
containingandcertain5UI–transcripts(Ceniketal.2011;Palazzoetal.2007).Inaddition,we18
extractedratiosbetweenseveralotheraminoacidpairsbasedonhaving19
biochemical/evolutionarysimilarity,i.e.havingpositivescores,accordingtotheBLOSUM6220
matrix(HenikoffandHenikoff1992).Toavoidextremeratiosgiventherelativelyshort21
sequencelength,pseudo-countswereaddedtoaminoacidratiosusingtheirrespective22
genome-wideprevalence.23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
24
Inaddition,weusedthreepublishedmotiffindingalgorithms(AlignACE,DEME,and1
MoAN(Rothetal.1998;RedheadandBailey2007;Valenetal.2009))todiscoverRNA2
sequencemotifsenrichedamong5UI–transcripts.AlignACEimplementsaGibbssampling3
approachandisoneofthepioneeringeffortsinmotifdiscovery(Rothetal.1998).We4
modifiedtheAlignACEsourcecodetorestrictmotifsearchestoonlytheforwardstrandofthe5
inputsequencestoenableRNAmotifdiscovery.DEMEandMoANadoptdiscriminative6
approachestomotiffindingbysearchingformotifsthataredifferentiallyenrichedbetween7
twosetsofsequences(RedheadandBailey2007;Valenetal.2009).MoANhastheadditional8
advantageofdiscoveringvariablelengthmotifs,andcanidentifyco-occurringmotifswiththe9
highestdiscriminativepower.10
Intotal,sixmotifswerediscoveredusingthethreemotiffindingalgorithms(TableS1).11
Positionspecificscoringmatricesforallmotifswereusedtoscorethefirst99-lpositionsin12
eachsequence,wherelisthelengthofthemotif.Weassessedthesignificanceofeachmotif13
instancebycalculatingthep-valueofenrichment(Fisher’sExactTest)among5UI–transcripts14
consideringalltranscriptswithamotifinstanceachievingaPSSMscoregreaterthanequalto15
theinstanceunderconsideration.Thesignificancescoreandpositionofthetwobestmotif16
instanceswereusedasfeaturesfortheclassifier(TableS1).17
5IMClassifierTrainingandPerformanceEvaluation18
WemodifiedanimplementationoftheRandomForestclassifier(Breiman2001)to19
modeltherelationshipbetweensequencefeaturesintheearlycodingregionandtheabsence20
of5’UTRintrons(5UIs).Thisclassifierdiscriminatestranscriptswith5’proximal-intron-21
minus-like-codingregionsandhenceisnamedthe‘5IM’classifier.Thetrainingsetforthe22
classifierwascomposedofSSCRtranscriptsexclusively.Thereweretworeasonstorestrict23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
25
modelconstructiontoSSCRtranscripts:1)weexpectedthepresenceofspecificRNAelements1
asafunctionof5UIpresencebasedonourpreviouswork(Ceniketal.2011);and2)we2
wantedtorestrictmodelbuildingtosequencesthathavesimilarfunctionalconstraintsatthe3
proteinlevel.4
Weobservedthat5UI–transcriptswithintronsproximalto5’endofthecodingregion5
havesequencecharacteristicssimilarto5UI+transcripts(Fig1D).Tosystematically6
characterizethisrelationship,webuiltdifferentclassifiersusingtrainingsetsthatexcluded7
5UI–transcriptswithacodingregionintronpositionedatincreasingdistancesfromthestart8
codon.Weevaluatedtheperformanceofeachclassifierusing10-foldcrossvalidation.9
Giventhatalargenumberofmotifdiscoveryiterationswereneeded,wesoughtto10
reducethecomputationalburden.Weisolatedasubsetofthetrainingexamplestobeused11
exclusivelyformotiffinding.Motifdiscoverywasperformedonceusingthissetofsequences,12
andthesamemotifswereusedineachfoldofthecrossvalidationforalltheclassifiers.13
Imbalancesbetweenthesizesofpositiveandnegativetrainingexamplescanleadto14
detrimentalclassificationperformance(WangandYao2012).Hence,webalancedthetraining15
setsizeof5UI–and5UI+transcriptsbyrandomlysamplingfromthelargerclass.We16
constructed10sub-classifierstoreducesamplingbias,andforeachtestexample,the17
predictionscorefromeachsubclassifierwassummedtoproduceacombinedscorebetween018
and10.Fortherestoftheanalyses,weusedtheclassifiertrainedusing5UI–transcriptswhere19
firstcodingintronfallsoutsidethefirst90codingnucleotides(Fig1E).20
Weevaluatedclassifierperformanceusinga10-foldcrossvalidationstrategyforSSCR-21
containingtranscripts(i.e.thetrainingset).Ineachfoldofthecross-validation,themodelwas22
trainedwithoutanyinformationfromtheheld-outexamples,includingmotifdiscovery.Forall23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
26
theothertranscriptsandthecontrolsets(seeabove),the5IMPscoreswerecalculatedusing1
theclassifiertrainedusingSSCRtranscriptsasdescribedabove.5IMPscoredistributionfor2
thecontrolsetwasusedtocalculatetheempiricalcumulativenulldistribution.Usingthis3
function,wedeterminedthep-valuecorrespondingtothe5IMPscoreforalltranscripts.We4
correctedformultiplehypothesestestingusingtheqvalueRpackage(Storey2003).Basedon5
thisanalysis,weestimatethata5IMPscoreof7.41correspondstoa5%FalseDiscoveryRate6
andsuggestthat21%ofallhumantranscriptscanbeconsideredas5IMtranscripts.7
Whilethetheoreticalrangeof5IMPscoresis0-10,thehighestobserved5IMPis9.855.8
Wenotethatforallfiguresthatdepict5IMPscoredistributions,wedisplayedtheentire9
theoreticalrangeof5IMPscores(0-10).10
FunctionalCharacterizationof5IMTranscripts:11
Wecollectedgenome-scaledatasetfrompublicallyavailabledatabasesandfrom12
supplementaryinformationprovidedinselectedarticles.Forallanalyzeddatasets,wefirst13
convertedallgene/transcriptidentifiers(IDs)intoRefSeqtranscriptIDsusingtheSynergizer14
webserver(BerrizandRoth2008).Ifadatasetwasgeneratedusinganon-humanspecies(ex.15
TargetsidentifiedbyTDP-43RNAimmunoprecipitationinratneuronalcells),weused16
Homologenerelease64(downloadedonSep282009)toidentifythecorrespondingortholog17
inhumans.Ifatleastonememberofatranscriptclusterwasassociatedwithafunctional18
phenotype,weassignedtheclustertothepositivesetwithrespecttothefunctionalphenotype.19
Ifmorethanonememberofaclusterhadthefunctionalphenotype,weonlyretainedonecopy20
oftheclusterunlesstheydifferedinaquantitativemeasurement.Forexample,considertwo21
hypotheticaltranscripts:NM_1andNM_2thatwereclusteredtogetherandhavea5IMPscore22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
27
of8.5.IfNM_1hadanmRNAhalf-lifeof2hrwhileNM_2’shalf-lifewas1hrthanwesplitthe1
clusterwhilepreservingthe5IMPscoreforbothNM_1andNM_2.2
Oncethetranscriptswerepartitionedbasedonthefunctionalphenotype,werantwo3
statisticaltests:1)Fisher’sExactTestforenrichmentof5IMtranscriptswithinthefunctional4
category;2)WilcoxonRankSumTesttocompare5IMPscoresbetweentranscriptspartitioned5
bythefunctionalphenotype.Additionally,fordatasetswhereaquantitativemeasurementwas6
available(ex.mRNAhalf-life),wecalculatedtheSpearmanrankcorrelationbetween5IMP7
scoresandthequantitativevariable.Intheseanalyses,weassumedthatthetestspacewasthe8
entiresetofRefSeqtranscripts.Forallphenotypeswhereweobservedapreliminary9
statisticallysignificantresult,wefollowedupwithmoredetailedanalysesdescribedbelow.10
RNACo-expressionAnalysis:11
Transcriptswithincreasedordecreasedexpressionwereextractedfromallmicroarray12
datasetsdepositedinArrayExpressonOct2010(Parkinsonetal.2007).Ofthe~650013
microarrayexperiments,atleast10differentiallyexpressedtranscriptswereidentifiedin14
~5100experiments.Fisher’sexacttestwasusedtocalculatetheenrichmentof5IMoran15
expression-matchedcontrolsetoftranscriptsamongdifferentiallyexpressedtranscripts.16
BaselineexpressionwasdeterminedasthemeanRNAexpressionacross16humantissuesas17
measuredbytheIlluminaBodymapRNA-Seqdatafor16humantissues(E-MTAB-513,18
downloadedonApr2015).FPKMbasedRNAexpressionquantificationwasused19
(http://www.ebi.ac.uk/gxa/experiments/E-MTAB-513/analysis-methods).Toensurethat20
baselineexpressionofthecontrolsetmatchesthatof5IMtranscripts,wefirstdiscretized21
meanlog10FPKMacrosstissuesbyroundingtoonedecimalplace.ForeachdiscretizedFPKM22
value,wesampledwithoutreplacementthesamenumberoftranscriptsas5IMtranscripts23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
28
withthegivenexpressionlevel.Notethattoensurethatthegeneexpressiondistributionis1
matchedexactlyweallowedtherandomsettoincludetranscriptsthatare5IM.Hence,the2
approachleadstoaconservativeestimateofenrichment.3
AnalysisofFeaturesAssociatedwithTranslation:4
Foreachtranscript,wepredictedthepropensityforsecondarystructureprecedingthe5
translationstartsiteandthe5’cap.Specifically,weextracted35nucleotidesprecedingthe6
translationstartsiteorthefirst35nucleotidesofthe5’UTR.Ifa5’UTRisshorterthan357
nucleotides,thetranscriptwasremovedfromtheanalysis.hybrid-ss-minutility(UNAFold8
packageversion3.8)withdefaultparameterswasusedtocalculatetheminimumfolding9
energy(MarkhamandZuker2008).10
CodonoptimalitywasmeasuredusingthetRNAAdaptationIndex(tAI),whichisbased11
onthegenomiccopynumberofeachtRNA(dosReisetal.2004).tAIforallhumancodons12
weredownloadedfromTulleretal.2010TableS1.tAIprofilesforthefirst30aminoacids13
werecalculatedforalltranscripts.Codonoptimalityprofilesweresummarizedforthefirst3014
aminoacidsforeachtranscriptorbyaveragingtAIateachcodon.15
Wecarriedouttwocontrolexperimentstotestwhethertheassociationbetween5IMP16
scoreandtAIcouldbeexplainedbyconfoundingvariables.First,wepermutedthenucleotides17
inthefirst90ntsandobservednorelationshipbetween5IMPscoreandmeantAIwhenthese18
permutedsequenceswereused(FigS1).Second,weselectedrandomin-frame99nucleotides19
from3rdexontotheendofthecodingregionandfoundnosignificantdifferencesintAI(Fig20
S1).TheseresultssuggestthattherelationshipbetweentAIand5IMPscoreisconfinedtothe21
first30aminoacidsandisnotexplainedbysimpledifferencesinnucleotidecomposition.22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
29
RibosomeprofilingandRNAexpressiondataforhumanlymphoblastoidcells(LCLs)1
weredownloadedfromNCBIGEOdatabaseaccessionnumber:GSE65912.Translation2
efficiencywascalculatedaspreviouslydescribed(Ceniketal.2015).Mediantranslation3
efficiencyacrossthedifferentcell-typeswasusedforeachtranscript. 4
AnalysisofProximitySpecificRibosomeProfilingData:5
WedownloadedproximityspecificribosomeprofilingdataforHEK293cellsfromJan6
etal.2014;TableS6.WeconvertedUCSCgeneidentifierstoHGNCsymbolsusingg:Profiler7
(Reimandetal.2011).WeretainedallgeneswithanRPKM>5ineitherinputorpulldownand8
requiredthatatleast30readsweremappedineitherofthetwolibraries.WeusedER-9
targetingevidencecategories“secretome”,“phobius”,“TMHMM”,“SignalP”,“signalSequence”,10
“signalAnchor”from(Janetal.2014)toannotategenesashavingER-targetingevidence.The11
genesthatdidnothaveanyER-targetingevidenceor“mitoCarta”/“mito.GO”annotationswere12
deemedasthesetofgeneswithnoER-targetingormitochondrialevidence.Wecalculatedthe13
log2oftheratiobetweenER-proximalribosomepulldownRPKMandcontrolRPKMasthe14
measureofenrichmentforER-proximalribosomeoccupancy(asinJanetal.2014).Amoving15
averageofthisratiowascalculatedforgenesgroupedbytheir5IMPscore.Forthiscalculation,16
weusedbinsof30mitochondrialgenesor100geneswithnoevidenceofER-ormitochondrial17
targeting.18
AnalysisofGenome-wideBindingSitesofExonJunctionComplex:19
Dr.GeneYeoandGabrielPrattgenerouslyshareduniformlyprocessedpeakcallsfor20
experimentsidentifyinghumanRNAbindingproteintargets.Thesedatasetsincludevarious21
CLIP-SeqdatasetsanditsvariantssuchasiCLIP.Atotal49datasetsfrom22factorswere22
analyzed.Thesefactorswere:hnRNPA1,hnRNPF,hnRNPM,hnRNPU,Ago2,hnRNPU,HuR,23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
30
IGF2BP1,IGF2BP2,IGF2BP3,FMR1,eIF4AIII,PTB,IGF2BP1,Ago3,Ago4,MOV10,Fip1,CF1
Im68,CFIm59,CFIm25,andhnRNPA2B1.Weextractedthe5IMPscoresforalltargetsofeach2
RBP.WecalculatedtheWilcoxonRankSumteststatisticcomparingthe5IMPscore3
distributionofthetargetsofeachRBPtoallothertranscriptswith5IMPscores.Noneofthe4
testedRBPtargetsetshadanadjustedp-value<0.05andamediandifferencein5IMPscore>5
1whencomparedtonon-targettranscripts.6
Inaddition,weusedRNA:proteinimmunoprecipitationintandem(RIPiT)datato7
determineExonJunctionComplex(EJC)bindingsites(Singhetal.2014,2012).Weanalyzed8
thecommonpeaksfromtheY14-Magohimmunoprecipitationconfiguration(Singhetal.2012;9
Kucukuraletal.2013).CanonicalEJCbindingsitesweredefinedaspeakswhoseweighted10
centeris15to32nucleotidesupstreamofanexon-intronboundary.Allremainingpeakswere11
deemedas“non-canonical”EJCbindingsites.Weextractedallnon-canonicalpeaksthat12
overlappedthefirst99nucleotidesofthecodingregionandrestrictedouranalysisto13
transcriptsthathadanRPKMgreaterthanoneinthematchedRNA-Seqdata.14
Analysisofm1Amodifiedtranscripts15
WedownloadedthelistofRNAsobservedtocontainm1AfromLietal.(2016)and16
Dominissinietal.(2016).RefSeqtranscriptidentifierswereconvertedtoHGNCsymbolsusing17
g:Profiler(Reimandetal.2011).Theoverlapbetweenthetwodatasetswasdeterminedusing18
HGNCsymbols.m1Amodificationsthatoverlapthefirst99nucleotidesofthecodingregion19
weredeterminedusingbedtools(QuinlanandHall2010).20
Lietal.(2016)identified600transcriptswithm1AmodificationinnormalHEK29321
cells.Ofthese,368transcriptswerenotfoundtocontainthem1AmodificationinHEK293cells22
byDominissinietal.(2016).Yet,81%ofthesewerefoundtobem1Amodifiedinothercell23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
31
types.Lietal.(2016)alsoanalyzedm1AuponH2O2treatmentandserumstarvationinHEK2931
cellsandidentifiedmanym1Amodificationsthatareonlyfoundthesestressconditions.2
However,20%of371transcriptsharboringstress-inducedm1Amodificationswerefoundin3
normalHEK293cellsbyDominissinietal.(2016).Takentogether,theseanalysessuggestthat4
transcriptome-widem1Amapsremainincomplete.Hence,weanalyzedthe5IMPscoresofall5
mRNAswithm1Aacrosscelltypesandconditions.WereportedresultsusingtheDominissini6
etal.datasetbutthesameconclusionsweresupportedbym1AmodificationsfromLietal.7
8
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
32
Acknowledgements:1
WewouldliketothankGeneYeo,andGabrielPrattforsharingpeakcallsforRNA-2
bindingproteintargetidentificationexperiments(CLIP,PAR-CLIP,iCLIP).WethankElif3
SarinayCenik,BaşarCenik,andAlperKüçükuralforhelpfuldiscussions.F.P.R.andH.N.C.were4
supportedbytheCanadaExcellenceResearchChairsProgram.Thisworkwassupportedin5
partbyNIHgrantsGM50037(M.J.M.),HG001715andHG004233(F.P.R.),theKrembil6
Foundation,anOntarioResearchFundaward,andtheAvonFoundation(F.P.R.).A.F.P.was7
supportedbyagrantfromtheCanadianInstitutesofHealthResearchtoA.F.P.(FRN102725).8
Authorcontributions:CC,MJM,andFPRdesignedthestudy.CCandHNCimplementedthe9
randomforestclassifierandcarriedoutallanalyses.GSandAAperformedinitialexperiments.10
MPSandAFPprovidedcriticalfeedback.MJMandFPRjointlysupervisedtheproject.CC,FPR11
andMJMwrotethemanuscriptwithinputfromallauthors.12
13
14
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
33
References:1
AlarcónCR,GoodarziH,LeeH,LiuX,TavazoieS,TavazoieSF.2015.HNRNPA2B1IsaMediatorof2
m(6)A-DependentNuclearRNAProcessingEvents.Cell162:1299–1308.3
AltschulSF,GishW,MillerW,MyersEW,LipmanDJ.1990.Basiclocalalignmentsearchtool.JMolBiol4
215:403–410.5
AmeresSL,ZamorePD.2013.DiversifyingmicroRNAsequenceandfunction.NatRevMolCellBiol14:6
475–488.7
BabendureJR,BabendureJL,DingJ-H,TsienRY.2006.ControlofmammaliantranslationbymRNA8
structurenearcaps.RNA12:851–861.9
BendtsenJD,NielsenH,vonHeijneG,BrunakS.2004.Improvedpredictionofsignalpeptides:SignalP10
3.0.JMolBiol340:783–795.11
BerrizGF,RothFP.2008.TheSynergizerservicefortranslatinggene,proteinandotherbiological12
identifiers.Bioinformatics24:2272–2273.13
BreimanL.2001.RandomForests.MachLearn45:5–32.14
CenikC,ChuaHN,ZhangH,TarnawskySP,AkefA,DertiA,TasanM,MooreMJ,PalazzoAF,RothFP.15
2011.Genomeanalysisrevealsinterplaybetween5’UTRintronsandnuclearmRNAexportfor16
secretoryandmitochondrialgenes.PLoSGenet7:e1001366.17
CenikC,DertiA,MellorJC,BerrizGF,RothFP.2010.Genome-widefunctionalanalysisofhuman5’18
untranslatedregionintrons.GenomeBiol11:R29.19
CenikC,SarinayCenikE,ByeonGW,GrubertF,CandilleSI,SpacekD,AlsallakhB,TilgnerH,ArayaCL,20
TangH,etal.2015.IntegrativeanalysisofRNA,translationandproteinlevelsrevealsdistinct21
regulatoryvariationacrosshumans.GenomeRes.http://dx.doi.org/10.1101/gr.193342.115.22
ChangY-F,ImamJS,WilkinsonMF.2007.Thenonsense-mediateddecayRNAsurveillancepathway.23
AnnuRevBiochem76:51–74.24
CharneskiCA,HurstLD.2013.Positivelychargedresiduesarethemajordeterminantsofribosomal25
velocity.PLoSBiol11:e1001508.26
ChoeJ,RyuI,ParkOH,ParkJ,ChoH,YooJS,ChiSW,KimMK,SongHK,KimYK.2014.eIF4AIIIenhances27
translationofnuclearcap-bindingcomplex-boundmRNAsbypromotingdisruptionofsecondary28
structuresin5’UTR.ProcNatlAcadSciUSA111:E4577–86.29
DominissiniD,NachtergaeleS,Moshitch-MoshkovitzS,PeerE,KolN,Ben-HaimMS,DaiQ,DiSegniA,30
Salmon-DivonM,ClarkWC,etal.2016.ThedynamicN(1)-methyladenosinemethylomein31
eukaryoticmessengerRNA.Nature530:441–446.32
dosReisM,SavvaR,WernischL.2004.Solvingtheriddleofcodonusagepreferences:atestfor33
translationalselection.NucleicAcidsRes32:5036–5044.34
DunnDB.1961.Theoccurrenceof1-methyladenineinribonucleicacid.BiochimBiophysActa46:198–35
200.36
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
34
FryeM,JaffreySR,PanT,RechaviG,SuzukiT.2016.RNAmodifications:whathavewelearnedand1
whereareweheaded?NatRevGenet17:365–372.2
GerashchenkoMV,GladyshevVN.2015.Translationinhibitorscauseabnormalitiesinribosome3
profilingexperiments.NucleicAcidsRes42:e134.4
GingoldH,PilpelY.2011.Determinantsoftranslationefficiencyandaccuracy.MolSystBiol7:481.5
HallRH.1963.Methodforisolationof2′-O-methylribonucleosidesandN1-methyladenosinefrom6
ribonucleicacid.BiochimicaetBiophysicaActa(BBA)-SpecializedSectiononNucleicAcidsand7
RelatedSubjects68:278–283.8
HenikoffS,HenikoffJG.1992.Aminoacidsubstitutionmatricesfromproteinblocks.ProcNatlAcadSci9
USA89:10915–10919.10
HershbergR,PetrovDA.2008.Selectiononcodonbias.AnnuRevGenet42:287–299.11
HinnebuschAG,LorschJR.2012.Themechanismofeukaryotictranslationinitiation:newinsightsand12
challenges.ColdSpringHarbPerspectBiol4.http://dx.doi.org/10.1101/cshperspect.a011544.13
HongX,ScofieldDG,LynchM.2006.Intronsize,abundance,anddistributionwithinuntranslated14
regionsofgenes.MolBiolEvol23:2392–2404.15
JanCH,WilliamsCC,WeissmanJS.2015.LOCALTRANSLATION.ResponsetoCommenton“Principlesof16
ERcotranslationaltranslocationrevealedbyproximity-specificribosomeprofiling.”Science348:17
1217.18
JanCH,WilliamsCC,WeissmanJS.2014.PrinciplesofERcotranslationaltranslocationrevealedby19
proximity-specificribosomeprofiling.Science346:1257521.20
KentWJ,SugnetCW,FureyTS,RoskinKM,PringleTH,ZahlerAM,HausslerD.2002.Thehuman21
genomebrowseratUCSC.GenomeRes12:996–1006.22
KervestinS,JacobsonA.2012.NMD:amultifacetedresponsetoprematuretranslationaltermination.23
NatRevMolCellBiol13:700–712.24
KlootwijkJ,PlantaRJ.1973.AnalysisofthemethylationsitesinyeastribosomalRNA.EurJBiochem39:25
325–333.26
KucukuralA,ÖzadamH,SinghG,MooreMJ,CenikC.2013.ASPeak:anabundancesensitivepeak27
detectionalgorithmforRIP-Seq.Bioinformatics29:2485–2486.28
LarssonO,LiS,IssaenkoOA,AvdulovS,PetersonM,SmithK,BittermanPB,PolunovskyVA.2007.29
EukaryoticTranslationInitiationFactor4E–InducedProgressionofPrimaryHumanMammary30
EpithelialCellsalongtheCancerPathwayIsAssociatedwithTargetedTranslationalDeregulation31
ofOncogenicDriversandInhibitors.CancerRes67:6814–6824.32
LeeES,AkefA,MahadevanK,PalazzoAF.2015.Theconsensus5’splicesitemotifinhibitsmRNA33
nuclearexport.PLoSOne10:e0122743.34
LeHirH,IzaurraldeE,MaquatLE,MooreMJ.2000.Thespliceosomedepositsmultipleproteins20-2435
nucleotidesupstreamofmRNAexon-exonjunctions.EMBOJ19:6860–6869.36
LiuJ,YueY,HanD,WangX,FuY,ZhangL,JiaG,YuM,LuZ,DengX,etal.2014.AMETTL3-METTL1437
complexmediatesmammaliannuclearRNAN6-adenosinemethylation.NatChemBiol10:93–95.38
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
35
LiX,XiongX,WangK,WangL,ShuX,MaS,YiC.2016.Transcriptome-widemappingrevealsreversible1
anddynamicN1-methyladenosinemethylome.NatChemBiol.2
http://dx.doi.org/10.1038/nchembio.2040(AccessedMarch29,2016).3
MahadevanK,ZhangH,AkefA,CuiXA,GueroussovS,CenikC,RothFP,PalazzoAF.2013.4
RanBP2/Nup358potentiatesthetranslationofasubsetofmRNAsencodingsecretoryproteins.5
PLoSBiol11:e1001545.6
MarintchevA,EdmondsKA,MarintchevaB,HendricksonE,ObererM,SuzukiC,HerdyB,SonenbergN,7
WagnerG.2009.TopologyandregulationofthehumaneIF4A/4G/4Hhelicasecomplexin8
translationinitiation.Cell136:447–460.9
MarkhamNR,ZukerM.2008.UNAFold:softwarefornucleicacidfoldingandhybridization.Methods10
MolBiol453:3–31.11
MeyuhasO.2000.Synthesisofthetranslationalapparatusisregulatedatthetranslationallevel.EurJ12
Biochem267:6321–6330.13
MooreMJ,ProudfootNJ.2009.Pre-mRNAprocessingreachesbacktotranscriptionandaheadto14
translation.Cell136:688–700.15
PalazzoAF,MahadevanK,TarnawskySP.2013.ALREX-elementsandintrons:twoidentityelements16
thatpromotemRNAnuclearexport.WileyInterdiscipRevRNA4:523–533.17
PalazzoAF,SpringerM,ShibataY,LeeC-S,DiasAP,RapoportTA.2007.Thesignalsequencecoding18
regionpromotesnuclearexportofmRNA.PLoSBiol5:e322.19
ParkinsonH,KapusheskyM,ShojatalabM,AbeygunawardenaN,CoulsonR,FarneA,HollowayE,20
KolesnykovN,LiljaP,LukkM,etal.2007.ArrayExpress--apublicdatabaseofmicroarray21
experimentsandgeneexpressionprofiles.NucleicAcidsRes35:D747–50.22
ParsyanA,SvitkinY,ShahbazianD,GkogkasC,LaskoP,MerrickWC,SonenbergN.2011.mRNA23
helicases:thetacticiansoftranslationalcontrol.NatRevMolCellBiol12:235–245.24
PingX-L,SunB-F,WangL,XiaoW,YangX,WangW-J,AdhikariS,ShiY,LvY,ChenY-S,etal.2014.25
MammalianWTAPisaregulatorysubunitoftheRNAN6-methyladenosinemethyltransferase.Cell26
Res24:177–189.27
PruittKD,TatusovaT,MaglottDR.2005.NCBIReferenceSequence(RefSeq):acuratednon-redundant28
sequencedatabaseofgenomes,transcriptsandproteins.NucleicAcidsRes33:D501–4.29
QuinlanAR,HallIM.2010.BEDTools:aflexiblesuiteofutilitiesforcomparinggenomicfeatures.30
Bioinformatics26:841–842.31
RedheadE,BaileyTL.2007.DiscriminativemotifdiscoveryinDNAandproteinsequencesusingthe32
DEMEalgorithm.BMCBioinformatics8:385.33
ReidDW,NicchittaCV.2015a.DiversityandselectivityinmRNAtranslationontheendoplasmic34
reticulum.NatRevMolCellBiol16:221–231.35
ReidDW,NicchittaCV.2015b.LOCALTRANSLATION.Commenton“PrinciplesofERcotranslational36
translocationrevealedbyproximity-specificribosomeprofiling.”Science348:1217.37
ReimandJ,ArakT,ViloJ.2011.g:Profiler—awebserverforfunctionalinterpretationofgenelists38
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
36
(2011update).NucleicAcidsResgkr378.1
RothFP,HughesJD,EstepPW,ChurchGM.1998.FindingDNAregulatorymotifswithinunaligned2
noncodingsequencesclusteredbywhole-genomemRNAquantitation.NatBiotechnol16:939–3
945.4
RowlandAA,VoeltzGK.2012.Endoplasmicreticulum–mitochondriacontacts:functionofthejunction.5
NatRevMolCellBiol13:607–625.6
ShahP,DingY,NiemczykM,KudlaG,PlotkinJB.2013.Rate-limitingstepsinyeastproteintranslation.7
Cell153:1589–1601.8
SinghG,KucukuralA,CenikC,LeszykJD,ShafferSA,WengZ,MooreMJ.2012.ThecellularEJC9
interactomerevealshigher-ordermRNPstructureandanEJC-SRproteinnexus.Cell151:750–764.10
SinghG,RicciEP,MooreMJ.2014.RIPiT-Seq:ahigh-throughputapproachforfootprintingRNA:protein11
complexes.Methods65:320–332.12
StoreyJD.2003.Thepositivefalsediscoveryrate:aBayesianinterpretationandtheq-value.AnnStat13
31:2013–2035.14
StuartJM,SegalE,KollerD,KimSK.2003.Agene-coexpressionnetworkforglobaldiscoveryof15
conservedgeneticmodules.Science302:249–255.16
SylvestreJ,VialetteS,CorralDebrinskiM,JacqC.2003.LongmRNAscodingforyeastmitochondrial17
proteinsofprokaryoticoriginpreferentiallylocalizetothevicinityofmitochondria.GenomeBiol4:18
R44.19
TullerT,CarmiA,VestsigianK,NavonS,DorfanY,ZaborskeJ,PanT,DahanO,FurmanI,PilpelY.2010.20
Anevolutionarilyconservedmechanismforcontrollingtheefficiencyofproteintranslation.Cell21
141:344–354.22
ValenE,SandelinA,WintherO,KroghA.2009.Discoveryofregulatoryelementsisimprovedbya23
discriminatoryapproach.PLoSComputBiol5:e1000562.24
VogelC,MarcotteEM.2012.Insightsintotheregulationofproteinabundancefromproteomicand25
transcriptomicanalyses.NatRevGenet13:227–232.26
WangS,YaoX.2012.MulticlassImbalanceProblems:AnalysisandPotentialSolutions.IEEETransSyst27
ManCybernBCybern.http://dx.doi.org/10.1109/TSMCB.2012.2187280.28
WangX,LuZ,GomezA,HonGC,YueY,HanD,FuY,ParisienM,DaiQ,JiaG,etal.2014.N6-29
methyladenosine-dependentregulationofmessengerRNAstability.Nature505:117–120.30
WangX,ZhaoBS,RoundtreeIA,LuZ,HanD,MaH,WengX,ChenK,ShiH,HeC.2015.N(6)-31
methyladenosineModulatesMessengerRNATranslationEfficiency.Cell161:1388–1399.32
ZinshteynB,GilbertWV.2013.LossofaconservedtRNAanticodonmodificationperturbscellular33
signaling.PLoSGenet9:e1003675.34
35
36
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
37
FigureLegends:1
Fig1|Modelingtherelationshipbetweensequencefeaturesintheearlycoding2
regionandtheabsenceof5’UTRintrons(5UIs).(a)Forallhumantranscripts,3
informationabout36nucleotide-levelfeaturesoftheearlycodingregion(first994
nucleotides)and5UIpresencewasextracted.(b)Transcriptscontainingasignalsequence5
codingregion(SSCR)wereusedtotrainaRandomForestclassifierthatmodeledthe6
relationshipbetween5UIabsenceand36sequencefeatures.(c)Withthisclassifier,all7
humantranscriptswereassignedascorethatquantifiesthelikelihoodof5UIabsence8
basedonspecificRNAsequencefeaturesintheearlycodingregion.Transcriptswithhigh9
scoresarethusconsideredtohave5’-proximalintronminus-likecodingregions(5IMs).10
(d)“5’UTR-intron-minus-predictor”(5IMP)scoredistributionsforSSCR-containing11
transcriptsshifttohigherscoreswithlater-appearingfirstintrons,suggestingthat5IM12
codingregionfeaturesnotonlypredictlackofa5UI,butalsolackofearlycodingregion13
introns.(e)Classifierperformancewasoptimizedbyexcluding5UI–transcriptswith14
intronsappearingearlyinthecodingregion.Cross-validationperformance(areaunderthe15
precisionrecallcurve,AUPRC)wasexaminedforaseriesofalternative5IMclassifiers16
usingdifferentfirst-intron-positioncriterionforexcluding5UI–transcriptsfromthe17
trainingset(Methods).18
19
Fig2|Predicting5UIstatusaccuratelyusingonlyearlycodingsequences.(a)As20
judgedbyareaunderthereceiveroperatingcharacteristiccurve(AUROC)andAUPRC,The21
5IMclassifierperformedwellforseveraldifferenttranscriptclasses.(b)Thedistribution22
of5IMPscoresrevealsclearseparationof5UI+and5UI–transcriptsforSSCR-containing23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
38
transcripts,whereeachSSCR-containingtranscriptwasscoredusingaclassifierthatdid1
notusethattranscriptintraining(Methods).(c)Codingsequencefeaturesthatare2
predictiveof5’proximalintronpresencearerestrictedtotheearlycodingregion.Thiswas3
supportedbyidentical5IMclassifierscoredistributionswithrespectto5UIpresencefor4
negativecontrolsequences,eachderivedfromasinglerandomlychosen‘window’5
downstreamofthe3rdexonfromoneoftheevaluatedtranscripts.(d)MSCRtranscripts6
exhibitedamajordifferencein5IMPscoresbasedontheir5UIstatuseventhoughnoMSCR7
transcriptswereusedintrainingtheclassifier.(e)Transcriptspredictedtocontainsignal8
peptides(SignalP+)hada5IMPscoredistributionsimilartothatofSSCR-containing9
transcripts.(f)AftereliminatingSSCR,MSCR,andSignalP+transcripts,theremainingS–10
/MSCR–SignalP–transcriptswerestillsignificantlyenrichedforhigh5IMclassifierscores11
among5UI–transcripts.(g)Thecontrolsetofrandomlychosensequencesdownstreamof12
the3rdexonfromeachtranscriptwasusedtocalculateanempiricalcumulativenull13
distributionof5IMPscores.Usingthisfunction,wedeterminedthep-valuecorresponding14
tothe5IMPscoreforalltranscripts.Thereddashedlineindicatesthep-value15
correspondingto5%FalseDiscoveryRate.16
17
Fig3|5IMtranscriptsaremorelikelytobedifferentiallyexpressedandhave18
sequencefeaturesassociatedwithlowertranslationefficiency(a)5IMtranscripts19
weresignificantlyenrichedamongdifferentiallyexpressedgenesthatwereidentifiedfrom20
~6500microarraydatasets(Methods).Theobservedp-valuedistribution(blue)using21
5IMPscoresdiffered(p<2.2e-16;Chi-squaredtest)fromthatusing5IMPscoresofan22
identically-sizedbaseline-expression-matchedrandomtranscriptset(red).(b)The5IM23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
39
classifierscorewaspositivelycorrelatedwiththepropensityformRNAstructure(-ΔG)1
(Spearmanrho=0.39;p<2.2e-16).Foreachtranscript,35nucleotidesprecedingthestart2
codonwereusedtocalculate-ΔG(Methods).(c)Transcriptsthataretranslationally3
upregulatedinresponsetoeIF4Eoverexpression(Larssonetal.2007)(blue)were4
enrichedforhigher5IMPscores.Lightgreenshadingindicates5IMPscorescorresponding5
to5%FDR.(d)Transcriptswithnon-AUGstartcodons(blue)exhibitedsignificantlyhigher6
5IMPscoresthantranscriptswithacanonicalATGstartcodon(yellow).(e)Higher5IMP7
scoreswereassociatedwithlessoptimalcodons(asmeasuredbythetRNAadaptation8
index,tAI)forthefirst33codons.Foralltranscriptswithineach5IMPscorecategory9
(blue-high,orange-low),themeantAIwascalculatedateachcodonposition.Startcodon10
wasnotshown.(f)Transcriptswithlowertranslationefficiencywereenrichedforhigher11
5IMPscores.Transcriptswithtranslationefficiencyonestandarddeviationbelowthe12
mean(“LOW”translation,yellow)andonestandarddeviationhigherthanthemean13
(“HIGH”translation,blue)wereidentifiedusingribosomeprofilingandRNA-Seqdatafrom14
humanlymphoblastoidcelllines(Methods).15
16
Fig4|5IMtranscriptswithnoevidenceofER-targetingaremorelikelytoexhibitER-17
proximalribosomeoccupancy.AmovingaverageofER-proximalribosomeoccupancy18
wascalculatedbygroupinggenesby5IMPscore(seeMethods).Weplottedthemoving19
averageof5IMPscoresfortranscriptswithnoevidenceofER-ormitochondrialtargeting20
(green)orfortranscriptspredictedtobemitochondrial(purple).Weplottedarandom21
subsampleoftranscriptsontopofthemovingaverage(circles).22
23
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
40
Fig5|5IMtranscriptsharbornon-canonicalExonJunctionComplex(EJC)binding1
sites(a)ObservedEJCbindingsites(Singhetal.2012)areshownforanexample5IM2
transcript(LAMC1).CanonicalEJCbindingsites(purple)are~24ntupstreamofanexon-3
intronboundary.Theremainingbindingsitesareconsideredtobenon-canonical(green).4
(b)Asequencemotif(top)previouslyidentifiedtobeenrichedamongncEJCbindingsites5
infirstexons(Figure6fromSinghetal.2012)ishighlysimilartoamotif(bottom)6
enrichedintheearlycodingregionsof5UI–transcripts(Figure6fromCeniketal.2011)(c)7
5IMPscorefortranscriptswithzero,one,twoormorenon-canonicalEJCbindingsitesin8
thefirst99codingnucleotidesrevealsthattranscriptswithhigh5IMPscoresfrequently9
harbornon-canonicalEJCbindingsites.(d)Transcriptswithhigh5IMPscoresareenriched10
fornon-canonicalEJCsregardlessof5UIpresenceorabsence.11
12
Fig6|5IMtranscriptsareenrichedformRNAswithearlycodingregionm1A13
modifications(a)Transcriptswithm1Amodifications(blue)inthefirst99codingnucleotides14
exhibitsignificantenrichmentfor5IMtranscriptsandhavehigher5IMPSscoresthan15
transcriptswithoutm1Amodificationsinthefirst99codingnucleotides(yellow).(b)16
Transcriptswithm1Amodifications(blue)inthe5’UTRdonotdisplayasimilarenrichment.17
18
SupportingInformation19
FigureS1|Associationbetween5IMPscoresandcodonoptimalityisrestrictedtothe20
first30aminoacidsandisnotexplainedbynucleotidecontent.(a)5IMtranscriptstend21
tohavelessoptimalcodonsintheirfirst30aminoacidsasmeasuredbytRNAadaptationindex22
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
41
(tAI).ThemediantAIforeachtranscriptwascalculatedandtranscriptsweregroupedbytheir1
5IMPscores.ThedistributionofmediantAIwasplottedasaboxplot.(b)Wepermutedthe2
nucleotidesofthefirst99nucleotidesandfoundthattherelationshipbetween5IMPandtAI3
waslost.Thisresultsuggestedthatnucleotidecompositionalonedoesn’texplainthe4
relationshipbetween5IMPandtAI.(c)Weusedthepreviouslydescribednegativecontrol5
sequences,eachderivedfromasinglerandomlychosenin-frame‘window’downstreamofthe6
3rdexonfromoneoftheevaluatedtranscripts.Wefoundnorelationshipbetween5IMPscore7
andtAIintheseregionssuggestingthattheobservedassociationisrestrictedtotheearly8
codingregion.9
TableS1-|Listoffeaturesusedbythe5IMclassifier.10
TableS2-|5IMPscoresofallhumantranscripts.11
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
5'UTR Intron (5UI) First 99 nucleotidesATGGCGCA
ATGACAGA
ATGTATAAT
Sequence 1
Sequence 2
Sequence N
5’UTRIntron Arg:Lys
+
–
–
1.3
3.5
4.0
Adenine Codon Bias
0.9
2.1
1.5
Extract CDS featuresA B
Model the relationship between early coding sequence and 5UI status
C
Grow Decision Tree Repeat SamplingGenerate Multiple Trees
Sequence 5Sequence 5Sequence 9Sequence 11
BootstrapSample
Assign ‘5IMP’ score reflecting 5UI– propensity5IMP Score
9.2/10Sequence X ATGAGCGCGT...?
Likely 5UI–
2.7/10Sequence Y ATGACTATAA...?
Likely 5UI+
5UTR
Intron 0-4
243
-5960
-7071
-84 85+
2
4
6
8
First Coding Intron Position
5IM
P Sc
ore
D
2000 50 100 1500.5
0.6
0.7
0.8
0.9
1.0
AUPR
C (5
UI– p
redi
ctio
n)
First Coding Intron Position
E90 nt cutoff
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
E
0 2 4 6 8 105IMP Score
0.00
0.10
0.20
0.30
Dens
ity
5UI+5UI–
SSCR–/ MSCR–/ SignalP+
0.00
0.10
0.20
0.30
C
Dens
ity
0 2 4 6 8 105IMP Score
3’ proximal exons
5UI+5UI–
A
SSCRMSCR
3’ proximal exons
S– /M– SignalP+
Prec
isio
n
Recall
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate
Reca
ll
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
D
0 2 4 6 8 105IMP Score
5UI+5UI–
MSCR
Dens
ity
0.00
0.10
0.20
0.30
B
Dens
ity
0 2 4 6 8 10
SSCR
0.00
0.10
0.20
0.30
5IMP Score
5UI+5UI–
0 2 4 6 8 10
SSCR–/ MSCR–/ SignalP–
5IMP Score
5UI+5UI–
0.00
0.10
0.20
0.30
Dens
ity
F
Wò]HS\L
5\TIL
Y�VM�;
YHUZJYPW[Z
0.0 0.2 0.4 0.6 0.8 1.00
1k
2k
G
0.00 0.01 0.02 0.030
200
400
�047�*SHZZPMPLY�7LYMVYTHUJL
S– /M– SignalP–
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
A
0.00
0.05
0.10
0.15
0.20
0.25
Dens
ity
NON-AUG
AUG
Start Codon
5IMP Score0 2 4 6 8 10
D
C
5IMP Score0 2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
Dens
ity
TRANSLATIONDOWN TRANSLATION
UP
NO CHANGEEIF4E Overexpression
FRibosomes per mRNA
LOW
HIGH
0 2 4 6 8 105IMP Score
0.00
0.05
0.10
0.15
0.20
Dens
ity
B
5IMP Score
Seco
dary
Stru
ctur
e-¬
G
2 4 6 8
0
5
10
15
20
25
E
0.36
0.38
0.40
0.42
Codo
n O
ptim
ality
5 10 15 20 25 30Codon Position
5IM
Non-5IM
Mean tAI
Num
ber o
f Exp
erim
ents
p-value0.00 0.25 0.50 0.75 1.00
0
0.5k
1.0k
1.5k
randomobserved
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
A
Mitochondrial Evidence
2 4 6 8
5IMP score
,9òW
Yo_PT
HS�9PIV
ZVTL�6J
J\WH
UJ`
No ER-targeting or Mitochondrial Evidence
#��
�
�
�
%���
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
AScale 1 kb
LAMC1
200 basesScale
EJC Peak Calls
Non-Canonical EJC(ncEJC) Peaks
Canonical EJC(cEJC) Peaks
2 4 6 8 10
0
0.050.1
0.15
0.20
0.25
0.30
0.35
5IMP Score
Dens
ity
All TranscriptsFirst 99 Coding Nucleotides
No Peaks 1 Peak
2+ Peaks
C
ncEJC Peaks
ACG
CG
GCC
GGCGCGCG
C
ATGC
ACTGC
GTAGCGA
GC
B
First 99 Coding NucleotidesD
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35 5UI+ Transcripts
No ncEJC peak
1+ ncEJC peak
5UI– Transcripts
No ncEJC peak
1+ ncEJC peak
0 2 4 6 8 10 2 4 6 8 10
5IMP Score5IMP Score
Dens
ity
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
0.00
0.05
0.10
0.15
0.20
Dens
ity
0.00
0.05
0.10
0.15
0.20
Dens
ity
5IMP Score
0 2 4 6 8 10
A
5IMP Score
0 2 4 6 8 10
5’UTRFirst 99 Coding Nucleotides
B
m1A-modified m1A-modified
No m1A No m1A.CC-BY-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
0.25
0.30
0.35
0.40
0.45
5IMP Score
Med
ian
tAI
0-7.41 7.41-10
A
Codon Position
5 10 15 20 25 30
0.30
0.32
0.34
0.36
0.38
0.40
0.42
Codo
n O
ptim
ality
mea
n tA
I
First 99 nucleotides permuted
5IMP � 7.41 7.41 < 5IMP
B
Codon Position
5 10 15 20 25 30
0.30
0.32
0.34
0.36
0.38
0.40
0.42
Random 99 nucleotidesDownstream of 3rd Exon
Codo
n O
ptim
ality
mea
n tA
I
7.41 < 5IMP5IMP � 7.41
C.CC-BY-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint
Table&S1.&List&of&Features&Used&in&Training&the&Random&Forest&
FEATURE&CLASS& PARTICULAR&FEATURES&USED&Ratio&of&Amino&Acids& Number&of&Leucines&/&Number&of&Isoleucines&
Number&of&Arginines&/&Number&of&Lysines&Number&of&Aspartates&/&Number&of&Glutamates&Number&of&Phenylalanines&/&Number&of&Tyrosines&Number&of&Leucines&/&Number&of&Valines&Number&of&Valines&/&Number&of&Isoleucines&Number&of&Glutamines&/&Number&of&Glutamates&
Nucleotide&Content& Percent&Adenines&Percent&Thymines&Length&of&the&Longest&Track&Lacking&Any&Adenines&
Codon&Preferences& Ratio&of&NonHAdenine&Codons&to&Adenine&Codons&Ratio&of&NonHThymine&Codons&to&Thymine&Codons&
Motif&Score&and&Position& MoAn&–&12&features&including&PSSM&scores&and&positions&of&the&top&two&occurrences&for&each&of&3&motifs&discovered&by&MoAn&[44] DEME&–&4&features&including&PSSM&scores&and&positions&of&the&top&two&occurrences&for&1&motif&discovered&by&DEME[43] AlignACE&–&8&features&including&PSSM&scores&and&positions&of&the&top&two&occurrences&for&each&of&2&motifs&discovered&by&AlignACE&[42]&
&
.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint