a common class of transcripts with 5’-intron depletion ......2016/06/06  · 2 where are we...

49
Can Cenik 1 A Common Class of Transcripts with 5’-Intron Depletion, Distinct Early Coding 1 Sequence Features, and N 1 -Methyladenosine Modification 2 Can Cenik 1,2* , Hon Nian Chua 3,4* , Guramrit Singh 2,5,7,8 , Abdalla Akef 6 , Michael P Snyder 1 , 3 Alexander F. Palazzo 6 , Melissa J Moore 2,7,8# , and Frederick P Roth 3,9,10# 4 5 Affiliations: 6 1 Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA 7 2 Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical 8 School, Worcester, MA 01605, USA 9 3 Donnelly Centre and Departments of Molecular Genetics and Computer Science, University of Toronto 10 and Lunenfeld-Tanenbaum Research Institute, Mt Sinai Hospital, Toronto M5G 1X5, Ontario, Canada 11 4 DataRobot, Inc., 61 Chatham St. Boston MA 02109, USA 12 5 Department of Molecular Genetics, Center for RNA Biology, The Ohio State University , Columbus, OH 13 43210, USA 14 6 Department of Biochemistry, University of Toronto, 1 King’s College Circle, MSB Room 5336, Toronto, 15 ON M5S 1A8, Canada 16 7 Howard Hughes Medical Institute, University of Massachusetts Medical School, Worcester, MA 01605, 17 USA 18 8 RNA Therapeutics Institute, University of Massachusetts Medical School, Worcester, MA 01605, USA 19 9 Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston 02215, MA, USA 20 10 The Canadian Institute for Advanced Research, Toronto M5G 1Z8, ON, Canada 21 22 * These authors contributed equally to this work. 23 #Correspondence to: [email protected] , [email protected] 24 25 Short Title: First intron position defines a distinct transcript class 26 Keywords: 5’ UTR introns, random forest, N 1 methyladenosine, Exon Junction Complex 27 28 . CC-BY-ND 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted June 7, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint

Upload: others

Post on 09-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

1

ACommonClassofTranscriptswith5’-IntronDepletion,DistinctEarlyCoding1

SequenceFeatures,andN1-MethyladenosineModification2

CanCenik1,2*,HonNianChua3,4*,GuramritSingh2,5,7,8,AbdallaAkef6,MichaelPSnyder1,3

AlexanderF.Palazzo6,MelissaJMoore2,7,8#,andFrederickPRoth3,9,10#4

5

Affiliations:6

1DepartmentofGenetics,StanfordUniversitySchoolofMedicine,Stanford,CA94305,USA7

2DepartmentofBiochemistryandMolecularPharmacology,UniversityofMassachusettsMedical8

School,Worcester,MA01605,USA9

3DonnellyCentreandDepartmentsofMolecularGeneticsandComputerScience,UniversityofToronto10

andLunenfeld-TanenbaumResearchInstitute,MtSinaiHospital,TorontoM5G1X5,Ontario,Canada11

4DataRobot,Inc.,61ChathamSt.BostonMA02109,USA12

5DepartmentofMolecularGenetics,CenterforRNABiology,TheOhioStateUniversity,Columbus,OH13

43210,USA14

6DepartmentofBiochemistry,UniversityofToronto,1King’sCollegeCircle,MSBRoom5336,Toronto,15

ONM5S1A8,Canada16

7HowardHughesMedicalInstitute,UniversityofMassachusettsMedicalSchool,Worcester,MA01605,17

USA18

8RNATherapeuticsInstitute,UniversityofMassachusettsMedicalSchool,Worcester,MA01605,USA19

9CenterforCancerSystemsBiology(CCSB),Dana-FarberCancerInstitute,Boston02215,MA,USA20

10TheCanadianInstituteforAdvancedResearch,TorontoM5G1Z8,ON,Canada21

22

*Theseauthorscontributedequallytothiswork.23

#Correspondenceto:[email protected],[email protected]

25

ShortTitle:Firstintronpositiondefinesadistincttranscriptclass26

Keywords:5’UTRintrons,randomforest,N1methyladenosine,ExonJunctionComplex27

28

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 2: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

2

Abstract:1

Intronsarefoundin5’untranslatedregions(5’UTRs)for35%ofallhumantranscripts.2

These5’UTRintronsarenotrandomlydistributed:genesthatencodesecreted,membrane-3

boundandmitochondrialproteinsarelesslikelytohavethem.Curiously,transcriptslacking4

5’UTRintronstendtoharborspecificRNAsequenceelementsintheirearlycodingregions.5

Hereweexploredtheconnectionbetweencoding-regionsequenceand5’UTRintronstatus.6

Wedevelopedaclassifiermodelingthisrelationshipthatcanpredict5’UTRintronstatuswith7

>80%accuracyusingonlysequencefeaturesintheearlycodingregion.Thus,theclassifier8

identifiestranscriptswith5’proximal-intron-minus-like-codingregions(“5IM”transcripts).9

Unexpectedly,wefoundthattheearlycodingsequencefeaturesdefining5IMtranscriptsare10

widespread,appearingin21%ofallhumanRefSeqtranscripts.Membersofthistranscript11

classtendtoexhibitdifferentialRNAexpressionacrossmanyconditions.The5IMclassof12

transcriptsisenrichedfornon-AUGstartcodons,moreextensivesecondarystructure13

precedingthestartcodon,andnearthe5’cap,greaterdependenceoneIF4Efortranslation,14

andassociationwithER-proximalribosomes.5IMtranscriptsareboundbytheExonJunction15

Complex(EJC)atnon-canonical5’proximalpositions.Finally,N1-methyladenosinesare16

specificallyenrichedintheearlycodingregionsof5IMtranscripts.Takentogether,our17

analysesrevealtheexistenceofadistinct5IMclasscomprising~20%ofhumantranscripts.18

Thisclassisdefinedbydepletionof5’proximalintrons,presenceofspecificRNAsequence19

featuresassociatedwithlowtranslationefficiency,N1-methyladenosinesintheearlycoding20

region,andenrichmentfornon-canonicalbindingbytheExonJunctionComplex.21

22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 3: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

3

Introduction:1

2

Approximately35%ofallhumantranscriptsharborintronsintheir5’untranslated3

regions(5’UTRs)(Ceniketal.2010;Hongetal.2006).Amonggeneswith5’UTRintrons(5UIs),4

thoseannotatedas“regulatory”aresignificantlyoverrepresented,whilethereisanunder-5

representationofgenesencodingproteinsthataretargetedtoeithertheendoplasmic6

reticulum(ER)ormitochondria(Ceniketal.2011).FortranscriptsthatencodeER-and7

mitochondria-targetedproteins,5UIdepletionisassociatedwithpresenceofspecificRNA8

sequences(Ceniketal.2011;Palazzoetal.2007,2013).Specifically,nuclearexportofan9

otherwiseinefficientlyexportedmicroinjectedmRNAorcDNAtranscriptcanbepromotedby10

anER-targetingsignalsequence-containingregion(SSCRs)ormitochondrialsignalsequence11

codingregion(MSCRs)fromagenelacking5’UTRintrons(Ceniketal.2011;Leeetal.2015).12

However,morerecentstudiessuggestthatmanySSCRslikelyhavelittleimpactonnuclear13

exportforinvivotranscribedRNAs(Leeetal.2015)butratherenhancetranslationina14

RanBP2-dependentmanner(Mahadevanetal.2013).15

AmongSSCR-andMSCR-containingtranscripts(referredtohereafterasSSCRand16

MSCRtranscripts),~75%lack5’UTRintrons(“5UI–”transcripts)and~25%havethem17

(“5UI+”transcripts).Thesetwogroupshavemarkedlydifferentsequencecompositionsatthe18

5’endsoftheircodingsequences.5UI–transcriptstendtohaveloweradeninecontent19

(Palazzoetal.2007)andusecodonswithfeweruracilsandadeninesthan5UI+transcripts20

(Ceniketal.2011).Theirsignalsequencesalsocontainleucineandargininemoreoftenthan21

thebiochemicallysimilaraminoacidsisoleucineandlysine,respectively.Leucineandarginine22

codonscontainfeweradenineandthyminenucleotides,consistentwithadenineandthymine23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 4: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

4

depletion.ThisdepletionisalsoassociatedwiththepresenceofaspecificGC-richRNAmotif1

intheearlycodingregionof5UI–transcripts(Ceniketal.2011).2

Despitesomeknowledgeastotheirearlycodingregionfeatures,keyquestionsabout3

thisclassof5UI–transcriptshaveremainedunanswered:Dotheabovesequencefeatures4

extendbeyondSSCR-andMSCR-containingtranscriptstoother5UI–genes?Do5UI–5

transcriptshavingthesefeaturessharecommonfunctionorregulatorynetwork?What6

bindingfactor(s)recognizetheseRNAelements?Amorecompletemodeloftherelationshipof7

earlycodingfeaturesand5UI–statuswouldbegintoaddressthesequestions.8

Here,weundertookanintegrativemachinelearningapproachtobetterunderstandthe9

relationshipbetweenearlycodingregionfeaturesand5UIstatus.Usingtheresulting10

computationalmodel,wefindthat~21%ofallhumantranscriptscanbeclassifiedas“5IM”11

(definedby5’-proximal-intron-minus-likecodingregions).Whilemanyofthesetranscripts12

encodeER-andmitochondrial-targetedproteins,manyothersencodenuclearandcytoplasmic13

proteins.Sharedcharacteristicsarethat5IMtranscriptstendtolackof5’proximalintrons,14

exhibitnon-canonicalExonJunctionComplexbinding,havefeaturesassociatedwithlower15

intrinsictranslationefficiency,andhaveanincreasedincidenceofN1-methyladenosines.16

17

Results:18

Aclassifierthatpredicts5UIstatususingonlyearlycodingsequenceinformation.19

Tobetterunderstandthepreviously-reportedenigmaticrelationshipbetweencertainearly20

codingregionsequencesandtheabsenceofa5UI,wesoughttomodelthisrelationship.21

Specifically,weusedarandomforestclassifier(Breiman2001)tolearntherelationship22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 5: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

5

between5UIabsenceandacollectionof36differentnucleotide-levelfeaturesextractedfrom1

thefirst99nucleotidesofallhumancodingregions(CDS)(Figs1A-C;TableS1;Methods).2

WethenusedalltranscriptsknowntocontainanSSCR,regardlessof5UIstatus,asourtraining3

set.Thistrainingconstraintensuredthatallinputnucleotidesequencesweresubjectto4

similarfunctionalconstraintsattheproteinlevel.Thus,wesoughttoidentifysequence5

featuresthatdifferbetween5UI–and5UI+transcriptsattheRNAlevel.6

Ourclassifierassignstoeachtranscripta“5’UTR-intron-minus-predictor”(5IMP)score7

between0and10,wherehigherscorescorrespondtoahigherlikelihoodofbeing5UI–(Fig8

1C).Interestingly,preliminaryrankingofthe5UI–transcriptsby5IMPscorerevealeda9

relationshipbetweenthepositionofthefirstintroninthecodingregionandthe5IMPscore.10

5UI–transcriptsforwhichthefirstintronwas>85ntsdownstreamofthestartcodonhadthe11

highest5IMPscores.Furthermore,thecloserthefirstintronwastothestartcodon,thelower12

the5IMPscore(Fig1D).Weexploredthisrelationshipfurtherbytrainingclassifiersthat13

increasinglyexcludedfromthetrainingset5UI–transcriptsaccordingtothedistanceofthe14

firstintronfromthe5’endofthecodingregion.Thisrevealedthatclassifierperformance,as15

measuredbytheareaundertheprecisionrecallcurve(AUPRC),increasedasafunctionofthe16

distancefromstartcodontofirstintrondistance(Methods,Fig1E),andalsoasafunctionof17

distancefromtranscriptionstartsitetothefirstintron.Thusthe5UI–-predictiveRNA18

sequencefeaturesweidentifiedaremoreaccuratelydescribedaspredictorsoftranscripts19

without5’-proximalintrons.20

TominimizetheimpactofintronpositionswithintheCDS,weselectedatrainingset21

thatonlyinclude5UI–SSCRtranscriptsforwhichthefirstintronwasatleast90nts22

downstreamofthestartcodon(Methods).Discriminativemotiffeatureswerelearned23

independently(Methods),andperformanceofthisnewclassifierwasgaugedusing10-fold24

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 6: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

6

crossvalidation.Incross-validation,performanceaccordingtoareaunderthereceiver1

operatingcurve(AUC)was74%andmeanareaundertheprecisionrecallcurve(AUPRC)was2

88%(Fig2A,yellowcurves).Incomparison,anaivepredictorwouldachieveanAUCof50%3

andanAUPRCof71%(because71%ofSSCRtranscriptsare5UI–).Weusedthisoptimized4

classifierforallsubsequentanalyses.5

SSCRtranscriptsexhibitedmarkedlydifferent5IMPscoredistributionsforthe5UI+and6

5UI–subsets(Fig2B).The5UI+scoredistributionwasunimodalwithapeakat~2.4.In7

contrast,the5UI–scoredistributionwasbimodalwithonepeakat~3.6andanotherat~9,8

suggestingtheexistenceofatleasttwounderlying5UI–transcriptclasses.Thepeakatscore9

3.6resembledthe5UI+peak.Alsocontributingtothepeakat3.6arethe5UI–transcripts10

harboringanintroninthefirst90ntsoftheCDS(55%ofall5UI–transcripts).Theother11

distincthigh-scoring5UI–class(peakatscore9)iscomposedoftranscriptsthathavespecific12

5UI--predictiveRNAsequenceelementswithintheearlycodingregion.13

Next,wewantedtoevaluatewhetherourclassifierwasdetectingdifferencesbetween14

5UI+and5UI–SSCRtranscriptsthatoccurspecificallyintheearlycodingregionasopposedto15

overallsequencedifferencesbetweenthetwotranscriptclasses.Todoso,foreverytranscript16

werandomlychose99ntsfromtheregiondownstreamofthe3rdexon.The5IMPscore17

distributionsofthese‘3’proximalexon’setswereidenticalfor5UI+and5UI–transcripts(Figs18

2A,and2C),confirmingthatthesequencefeaturesthatdistinguish5UI+and5UI–transcripts19

arespecifictotheearlycodingregion.20

21

RNAelementsassociatedwith5UI–transcriptsarepervasiveinthehumangenome22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 7: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

7

HavingtrainedtheclassifieronSSCRtranscripts,wenextaskedhowwellitwouldpredictthe1

5UIstatusofothertranscripts.DespitehavingbeentrainedexclusivelyonSSCRtranscripts,2

theclassifierperformedremarkablywellonMSCRtranscripts,achievinganAUCof86%and3

AUPRCof95%(Fig2A,purpleline;Fig2D;ascomparedwith50%and77%,respectively,4

expectedbychance).ThisresultsuggeststhatRNAelementswithinearlycodingregionsof5

5UI–MSCR-transcriptsaresimilartothosein5UI–SSCR-transcriptsdespitedistinctfunctional6

constraintsattheproteinlevel.7

Wethentestedwhethertheclassifiercanpredict5UI–statusintranscriptsthatcontain8

neitherSSCRnorMSCR(“S–/M–”transcripts).BecauseunannotatedSSCRscouldconfoundthis9

analysis,wefirstusedSignalP3.0toidentifyS–/M–transcriptsmostlikelytocontainahidden10

SSCR(Bendtsenetal.2004).These‘SignalP+‘transcriptshada5IMPscoredistribution11

comparabletothoseofknownSSCRandMSCRtranscripts(Fig2E),withanAUCof82%and12

anAUPRCof95%(Fig2A,lightblueline).Whereas5UI+SignalP+transcriptshad13

predominantlylow5IMPscores,5UI–SignalP+5IMPscoreswerestronglyskewedtowards14

high5IMPscores(peakat~9;Fig2E).ThusSignalP+transcriptsdolikelycontainmany15

unannotatedSSCRs.16

Finally,weusedtheclassifiertocalculate5IMPscoresforallremaining“S–/M–/SignalP–17

”transcripts.Althoughtheperformancewasweakeronthisgeneset,itwasstillbetterthana18

naivepredictor(Fig2A,greenline).Asexpected,the5UI+S–/M–/SignalP–transcriptswere19

stronglyskewedtowardlow5IMPscores(Fig2F).Surprisingly,however,asignificant20

fractionof5UI–S–/M–/SignalP–transcriptshadhigh5IMPscores(~18%).Thissuggeststhe21

existenceofabroaderclassoftranscriptsthatsharefeaturesof5’proximal-intron-minus-like-22

codingregions.Hereafterwerefertotranscriptsinthisclassas“5IM”transcripts.23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 8: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

8

Todeterminethefractionofallhumantranscriptsthatare5IM,wemadeuseofthe1

above-describednegativecontrolsetofrandomlychosensequencesdownstreamofthe3rd2

exonfromeachtranscript.Byquantifyingtheexcessofhigh-scoringsequencesinthereal3

distributionrelativetothiscontroldistribution,weestimatethat21%ofallhumantranscripts4

are5IMtranscripts(a5IMPscoreof7.41correspondstoa5%FalseDiscoveryRate;Fig2G).5

Thesetof5IMtranscriptsdefinedbyourclassifier(TableS2)includesmanythatdonot6

encodeER-targetedormitochondrialproteins.TheseresultssuggestthatRNA-levelfeatures7

prevalentintheearlycodingregionsof5UI–SSCRandMSCRtranscriptsarealsofoundinother8

transcripttypes(Figs2F-G),andthat5IMtranscriptsrepresentabroadclass.9

10

Functionalcharacterizationof5IMtranscripts11

5IMtranscriptsaredefinedbymRNAsequencefeatures.Hence,wehypothesizedthat5IM12

transcriptsmaybefunctionallyrelatedthroughtheirsharedregulationmediatedbythe13

presenceofthesecommonfeatures.Tothisend,wecollected66large-scaledatasets14

representingdiverseattributescoveringsevenbroadcategories:(1)Curatedfunctional15

annotations--e.g.,GeneOntologyterms,annotationasa‘housekeeping’gene,genessubjectto16

RNAediting;(2)RNAlocalization--e.g.,todendrites,tomitochondria;(3)ProteinandmRNA17

half-life,ribosomeoccupancyandfeaturesthatdecreasestabilityofoneormoremRNA18

isoforms--e.g.,AU-richelements(4)TendencytoexhibitdifferentialmRNAexpression;(5)19

Sequencefeaturesassociatedwithregulatedtranslation--e.g.,codonoptimality,secondary20

structurenearthestartcodon;(6)KnowninteractionswithRNA-bindingproteinsor21

complexessuchasStaufen-1,TDP-43,ortheExonJunctionComplex(EJC)(7)RNA22

modifications–i.e.,N1-methyladenosine(m1A).Whereaswefoundnospecificassociation23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 9: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

9

between5IMtranscriptsandfeaturesincategories(1),(2)and(3)otherthanthealready-1

knownenrichmentsforER-andmitochondrial-targetedmRNAs,observationsfromanalysisof2

categories(4),(5),(6),and(7)aredescribedbelow.Weadjustedforthemultiplehypotheses3

testingproblemattwolevels.First,wetookaconservativeapproach(Bonferronicorrection)4

tocorrectforthenumberoftestedfunctionalcharacteristics.Second,someofthefunctional5

categorieswereanalyzedinmoredepthandmultiplesub-hypothesesweretestedwithinthe6

givencategory.Inthisin-depthanalysisafalsediscovery-basedcorrectionwasadopted.For7

example,weinvestigatedwhether5IMtranscriptsexhibitdifferentialRNAexpression.We8

usedBonferroni-correctiontocalculatethep-valueassociatedwiththistest.Then,wefurther9

testedwhether5IMtranscriptsaredifferentiallyexpressedunderthespecificconditions.For10

thisin-depthanalysis,weadoptedafalsediscoveryratebasedcorrection.Below,allreported11

p-valuesremainsignificant(p-adjusted<0.05)afterappropriatemultiplehypothesis12

correction.13

14

5IMtranscriptsexhibitdifferentialRNAexpressionacrossmanyconditions15

Genesetsexhibitingpositivelyornegativelycorrelatedexpressionacrosstissuesand16

conditionsareenrichedforco-regulationandco-functionality(Stuartetal.2003).Sincethe17

powertodetectdifferentialexpressionishigherforabundanttranscripts,wefirsttested18

whether5IMtranscriptsgenerallyhavehigherbaselineexpressionthanothertranscripts.We19

usedbaselineexpressionmeasurementsacross16humantissues(Methods),andfounda20

weakbutsignificantpositivecorrelationbetweenbaselineexpressionand5IMPscores21

(Spearmanrho=0.07;p=7.8e-15).Tocontrolforthepossibleimpactoftheobserved22

differenceinbaselineexpressionondifferentialexpression,wegeneratedarandomsampleof23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 10: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

10

transcriptssuchthatthebaselineexpressionleveldistributionofthissetmatchesthe1

expressiondistributionofthe5IMPtranscripts(Methods).2

Combiningdifferentialexpressiondatafrom~6500humanmicroarrayexperiments,3

weobservedthat5IMtranscriptswereoverrepresentedingenesetsexhibitingcondition-4

dependentchangesinmRNAexpressionlevelcomparedtheidentically-sizedbaseline-5

expression-matchedrandomtranscriptset(Fig3A,Methods,p<2.2e-16;Chi-squaredtest).6

Thus,thelevelsof5IMtranscriptshaveahigher-than-randomlikelihoodofchangingacrossa7

widevarietyoftissues,celltypes,andenvironments.8

9

5IMtranscriptshavefeaturessuggestinglowertranslationefficiency10

Translationregulationisamajordeterminantofproteinlevels(VogelandMarcotte2012).To11

investigatepotentialconnectionsbetween5IMtranscriptsandtranslationalregulation,we12

examinedfeaturesassociatedwithtranslation.Featuresfoundtobesignificantwere:13

(I)Secondarystructuresnearthestartcodoncanaffectinitiationratebymodulating14

startcodonrecognition(Parsyanetal.2011).Weobservedapositivecorrelationbetween15

5IMPscoreandthefreeenergyoffolding(-ΔG)ofthe35nucleotidesimmediatelypreceding16

thestartcodon(Fig3B;Spearmanrho=0.39;p<2.2e-16).Thissuggeststhat5IMtranscripts17

haveagreatertendencyforsecondarystructurenearthestartcodon,presumablymakingthe18

startcodonlessaccessible.19

(II)Similarly,secondarystructuresnearthe5’capcanmodulatetranslationby20

hinderingbindingbythe43S-preinitiationcomplextothemRNA(Babendureetal.2006).We21

observedapositivecorrelationbetween5IMPscoreandthefreeenergyoffolding(-ΔG)ofthe22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 11: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

11

5’most35nucleotides(Spearmanrho=0.18;p=7.9e-130).Thissuggeststhat5IMtranscripts1

haveagreatertendencyforsecondarystructurenearthe5’cap,presumablyhinderingbinding2

bythe43S-preinitiationcomplex.3

(III)eIF4Eoverexpression.TheheterotrimerictranslationinitiationcomplexeIF4F4

(madeupofeIF4A,eIF4EandeIF4G)isresponsibleforfacilitatingthetranslationof5

transcriptswithstrong5’UTRsecondarystructures(Parsyanetal.2011).TheeIF4Esubunit6

bindstothe7mGpppG‘methyl-G’capandtheATP-dependenthelicaseeIF4A(scaffoldedby7

eIF4G)destabilizes5’UTRsecondarystructure(Marintchevetal.2009).Apreviousstudy8

identifiedtranscriptsthatweremoreactivelytranslatedunderconditionsthatpromotecap-9

dependenttranslation(overexpressionofeIF4E)(Larssonetal.2007).Inagreementwiththe10

observationthat5IMtranscriptshavemoresecondarystructureupstreamofthestartcodon11

andnearthe5’cap,transcriptswithhigh5IMPscoresweremorelikelytobetranslationally,12

butnottranscriptionally,upregulateduponeIF4Eoverexpression(Fig3C;WilcoxonRankSum13

Testp=2.05e-22,andp=0.28,respectively).14

(IV)Non-AUGstartcodons.Transcriptswithnon-AUGstartcodonsalsohave15

intrinsicallylowtranslationinitiationefficiencies(HinnebuschandLorsch2012).These16

mRNAsweregreatlyenrichedamongtranscriptswithhigh5IMPscores(Fisher’sExactTestp17

=0.0003;oddsratio=3.9)andhaveamedian5IMPscorethatis3.57higherthanthosewith18

anAUGstart(Fig3D).19

(V)Codonoptimality.Theefficiencyoftranslationelongationisaffectedbycodon20

optimality(HershbergandPetrov2008).Althoughsomeaspectsofthisremaincontroversial21

(CharneskiandHurst2013;Shahetal.2013;ZinshteynandGilbert2013;Gerashchenkoand22

Gladyshev2015),itisclearthatdecodingofcodonsbytRNAswithdifferentabundancescan23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 12: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

12

affectthetranslationrateunderconditionsofcellularstress(reviewedinGingoldandPilpel1

2011).WethereforeexaminedthetRNAadaptationindex(tAI),whichcorrelateswithcopy2

numbersoftRNAgenesmatchingagivencodon(dosReisetal.2004).Specifically,we3

calculatedthemediantAIofthefirst99codingnucleotidesofeachtranscript,andfoundthat4

5IMPscorewasnegativelycorrelatedwithtAI(FigS1A;SpearmanCorrelationrho=-0.23;p<5

2e-16mediantAIand5IMPscore).Thiseffectwasrestrictedtotheearlycodingregionsasthe6

negativecontrolsetofrandomlychosensequencesdownstreamofthe3rdexonfromeach7

transcriptdidnotexhibitarelationshipbetween5IMPscoreandcodonoptimality(FigS1B-C).8

Thus,5IMtranscriptsshowreducedcodonoptimalityinearlycodingregions,suggestingthat9

5IMtranscriptshavedecreasedtranslationelongationefficiency.10

Tomorepreciselydeterminewherethecodonoptimalityphenomenonoccurswithin11

theentireearlycodingregion,wegroupedtranscriptsby5IMPscore.Foreachgroup,we12

calculatedthemeantAIatcodons2-33(i.e.,nts4-99).Acrossthisentireregion,5IM13

transcripts(5IMP>7.41;5%FDR)hadsignificantlylowertAIvaluesateverycodonexcept14

codons24and32(Fig3E;WilcoxonRankSumtestHolm-adjustedp<0.05forall15

comparisons).Toeliminatepotentialconfoundingvariables,includingnucleotidecomposition,16

weperformedseveraladditionalcontrolanalyses(Methods);noneofthesealteredthebasic17

conclusionthat5IMtranscriptshavelowercodonoptimalitythannon-5IMtranscriptsacross18

theentireearlycodingregion.19

(VI)RibosomespermRNA.Finally,weexaminedtherelationshipbetween5IMPscore20

andtranslationefficiency,asmeasuredbythesteady-statenumberofribosomespermRNA21

molecule.Tothisend,weusedalargedatasetofribosomeprofilingandRNA-Seqexperiments22

fromhumanlymphoblastoidcelllines(Ceniketal.2015).Fromthis,wecalculatedtheaverage23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 13: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

13

numberofribosomesoneachtranscriptandidentifiedtranscriptswithhighorlowribosome1

occupancy(respectivelydefinedbyoccupancyatleastonestandarddeviationaboveorbelow2

themean;seeMethods).5IMtranscriptswereslightlybutsignificantlydepletedinthehigh3

ribosome-occupancycategory(Fig3F;Fisher’sExactTestp=0.0006,oddsratio=1.3).4

Moreover,5IMPscoresexhibitedaweakbutsignificantnegativecorrelationwiththenumber5

ofribosomespermRNAmolecule(Spearmanrho=-0.11;p=5.98e-23).6

Takentogether,alloftheaboveresultsrevealthat5IMtranscriptshavesequence7

featuresassociatedwithlowertranslationefficiency,atthestagesofbothtranslationinitiation8

andelongation.9

10

Non-ERtrafficked5IMtranscriptsareenrichedinER-proximalribosomeoccupancy11

Wenextinvestigatedtherelationshipbetween5IMPscoreandthelocalizationoftranslation12

withincells.Exploringthesubcellularlocalizationoftranslationatatranscriptome-scale13

remainsasignificantchallenge.Yet,arecentstudydescribedproximity-specificribosome14

profilingtoidentifymRNAsoccupiedbyER-proximalribosomesinbothyeastandhumancells15

(Janetal.2014).Inthismethod,ribosomesarebiotinylatedbasedontheirproximitytoa16

markerproteinsuchasSec61,whichlocalizestotheERmembrane(Janetal.2014).Foreach17

transcript,theenrichmentforbiotinylatedribosomeoccupancyyieldsameasureofER-18

proximityoftranslatedmRNAs.19

Wereanalyzedthisdatasettoexploretherelationbetween5IMPscoresandER-20

proximalribosomeoccupancyinHEK-293cells.Asexpected,transcriptsthatexhibitthe21

highestenrichmentforER-proximalribosomeswereSSCR-transcriptsandtranscriptswith22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 14: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

14

otherER-targetingsignals.Yet,wenoticedasurprisingcorrelationbetweenER-proximal1

ribosomeoccupancyand5IMtranscriptswithnoER-targetingevidence(Fig4).This2

relationshipwastrueforbothmitochondrialgenes(Fig4;Spearmanrho=0.43;p<2.2e-16),3

andgeneswithnoevidenceforeitherER-ormitochondrial-targeting(Fig4;Spearmanrho=-4

0.36;p<2.2e-16).Theseresultssuggestthat5IMtranscriptsaremorelikelythannon-5IM5

transcriptstoengagewithER-proximalribosomes.6

7

5IMtranscriptsarestronglyenrichedinnon-canonicalEJCoccupancysites8

Sharedsequencefeaturesandfunctionaltraitsamong5IMtranscriptscausesonetowonder9

whatcommonmechanismsmightlink5IMsequencefeaturesto5IMtraits.Forexample,5IM10

transcriptsmightshareregulationbyoneormoreRNA-bindingproteins(RBPs).To11

investigatethisideafurther,wetestedforenrichmentof5IMtranscriptsamongthe12

experimentallyidentifiedtargetsof23differentRBPs(includingCLIP-Seq,andvariants;see13

Methods).Onlyonedatasetwassubstantiallyenrichedforhigh5IMPscoresamongtargets:a14

transcriptome-widemapofbindingsitesoftheExonJunctionComplex(EJC)inhumancells,15

obtainedviatandem-immunoprecipitationfollowedbydeepsequencing(RIPiT)(Singhetal.16

2014,2012).TheEJCisamulti-proteincomplexthatisstablydepositedupstreamofexon-17

exonjunctionsasaconsequenceofpre-mRNAsplicing(LeHiretal.2000).RIPiTdata18

confirmedthatcanonicalEJCsites(cEJCsites;sitesboundbyEJCcorefactorsandappearing19

~24ntsupstreamofexon-exonjunctions)occupy~80%ofallpossibleexon-exonjunction20

sitesandarenotassociatedwithanysequencemotif.Unexpectedly,manyEJC-associated21

footprintsoutsideofthecanonical-24regionwereobserved(Fig5A)(Singhetal.2012).22

These‘non-canonical’EJCoccupancysites(ncEJCsites)wereassociatedwithmultiple23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 15: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

15

sequencemotifs,threeofwhichweresimilartoknownrecognitionmotifsforSRproteinsthat1

co-purifiedwiththeEJCcoresubunits.Interestingly,anothermotif(Fig5B;top)thatwas2

specificallyfoundinfirstexonsisnotknowntobeboundbyanyknownRNA-bindingprotein3

(Singhetal.2012).ThismotifwassimilartotheCG-richmotif(Fig5B;bottom)thatwefound4

tobeassociatedwithin5UI–and5IMtranscripts(Ceniketal.2011).Thissimilaritypresages5

thepossibilityofenrichmentoffirstexonncEJCsitesamong5IMtranscripts.6

PositionanalysisofcalledEJCpeaksrevealedthatwhileonly9%ofcEJCsresideinfirst7

exons,19%ofallncEJCsarefoundthere.Whenweinvestigatedtherelationshipbetween5IMP8

scoresandncEJCsinearlycodingregions,wefoundastrikingcorrespondencebetweenthe9

two–themedian5IMPscorewashighestfortranscriptswiththegreatestnumberofncEJCs10

(Fig5C;WilcoxonRankSumTest;p<0.0001).Whenwerepeatedthisanalysisbyconditioning11

on5UIstatus,wesimilarlyfoundthatncEJCswereenrichedamongtranscriptswithhigh5IMP12

scoresregardlessof5UIstatus(Fisher’sExactTest,p<3.16e-14,oddsratio>2.3;Fig5D).13

TheseresultssuggestthatthestrikingenrichmentofncEJCpeaksinearlycodingregionswas14

generallyapplicabletoalltranscriptswithhigh5IMPscoresregardlessof5UIpresence.15

TranscriptsharboringN1-methyladenosine(m1A)havehigh5IMPscores16

ItisincreasinglyclearthatribonucleotidebasemodificationinmRNAsishighlyprevalentand17

canbeamechanismforpost-transcriptionalregulation(Fryeetal.2016).OneRNA18

modificationpresenttowardsthe5’endsofmRNAtranscriptsisN1-methyladenosine(m1A)19

(Lietal.2016;Dominissinietal.2016),initiallyidentifiedintotalRNAandrRNAs(Dunn1961;20

Hall1963;KlootwijkandPlanta1973).Intriguingly,thepositionofm1Amodificationshas21

beenshowntobemorecorrelatedwiththepositionofthefirstintronthanwith22

transcriptionalortranslationalstartsites(Figure2gfromDominissinietal.2016).Thefirst23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 16: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

16

intronwasthenearestsplicesitefor85%ofm1As(Dominissinietal.2016).When5’UTR1

intronswerepresent,m1Awasfoundtobenearthefirstsplicesiteregardlessoftheposition2

ofthestartcodon(Dominissinietal.2016).Giventhat5IMtranscriptsarealsocharacterized3

bythepositionofthefirstintron,weinvestigatedtherelationshipbetween5IMPscoreand4

m1ARNAmodificationmarks.5

Weanalyzedtheunionofpreviouslyidentifiedm1Amodifications(Dominissinietal.6

2016)acrossallcelltypesandconditions.Notethatalthoughthereissomeevidencethatthese7

marksdependoncelltypeandgrowthcondition,itisdifficulttobeconfidentofthecelltype8

andcondition-dependenceofanyparticularmarkgivenexperimentalvariation(seeMethods).9

WefoundthatmRNAswithm1Amodificationearlyinthecodingregion(first99nucleotides)10

hadsubstantiallyhigher5IMPscoresthanmRNAslackingthesemarks(Fig6A;WilcoxonRank11

SumTestp=3.4e-265),andweregreatlyenrichedamong5IMtranscripts(Fig6A;Fisher’s12

ExactTestp=1.6e-177;oddsratio=3.8).Inotherwords,thesequencefeatureswithinthe13

earlycodingregionthatdefine5IMPtranscriptsalsopredictm1Amodificationintheearly14

codingregion.15

Wenextwonderedwhether5IMPscorewaspredictingm1Amodificationgenerally,or16

predictingitspreciselocation.Indeed,manyofthepreviouslyidentifiedm1Apeakswere17

withinthe5’UTRsofmRNAs(Lietal.2016).Interestingly,5IMPscoreswereonlypredictiveof18

m1Amodificationintheearlycodingregion,andnotpredictiveofm1Amodificationinthe19

5’UTR(Fig6B).Thisofferstheintriguingpossibilitythatthesequencefeaturesthatdefine20

5IMPtranscriptsareco-localizedwithm1Amodification.21

22

Discussion:23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 17: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

17

Coordinatingtheexpressionoffunctionallyrelatedtranscriptsduringcellularresponse1

tostimuliandenvironmentcanbeachievedbypost-transcriptionalprocessessuchassplicing,2

RNAexport,RNAlocalizationortranslation(MooreandProudfoot2009).SetsofmRNAs3

subjecttoacommonregulatorytranscriptionalprocesscanexhibitcommonsequencefeatures4

thatdefinethemtobeaclass.Forexample,transcriptssubjecttoregulationbyparticular5

miRNAstendtosharecertainsequencesintheir3’UTRsthatarecomplementarytothe6

regulatorymiRNA(AmeresandZamore2013).Similarly,transcriptsthatsharea5’terminal7

oligopyrimidinetractarecoordinatelyregulatedbymTORandribosomalproteinS6kinase8

(Meyuhas2000).Herewequantitativelydefine‘5IM’transcriptsasaclassthatshares9

commonsequenceelementsandfunctionalproperties.The5IMclasscomprises21%ofall10

humantranscripts.11

Whereas35%ofhumantranscriptshaveoneormore5’UTRintrons,themajorityof12

5IMtranscriptshaveneithera5’UTRintronnoranintroninthefirst90ntsoftheORF.Other13

sharedfeaturesof5IMtranscriptsincludesequencefeaturesassociatedwithlowtranslation14

initiationrates.Theseare:(1)atendencyforRNAsecondarystructureintheregion15

immediatelyprecedingthestartcodon(Fig3B),andnearthe5’cap;(2)translational16

upregulationuponoverexpressionofeIF4E(Fig3C);and(3)morefrequentuseofnon-AUG17

startcodons(Fig3D).Alsoconsistentwithlowintrinsictranslationefficiencies,5IM18

transcriptsadditionallytendtodependonlessabundanttRNAstodecodethebeginningofthe19

openreadingframe(Fig3E).20

WehadpreviouslyreportedthattranscriptsencodingproteinswithER-and21

mitochondrial-targetingsignalsequences(SSCRsandMSCRs,respectively)areover-22

representedamongthe65%oftranscriptslacking5’UTRintrons(Ceniketal.2011).23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 18: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

18

Transcriptsinthissetareenrichedforthesequencefeaturesdetectedbyour5IMclassifier.By1

examiningtheseenrichedsequencefeatures,weshowedthatthe5IMclassextendsbeyond2

mRNAsencodingmembraneproteins.Janetal.recentlydevelopedatranscriptome-scale3

methodtoidentifymRNAsoccupiedbyER-proximalribosomesinbothyeastandhumancells4

(Janetal.2014).Asexpected,transcriptsknowntoencodeER-traffickedproteinswerehighly5

enrichedforER-proximalribosomeoccupancy.However,theirdataalsoshowedmany6

transcriptsencodingnon-ERtraffickedproteinstoalsobeengagedwithER-proximal7

ribosomes(ReidandNicchitta2015b).Similarly,severalotherstudieshavesuggestedacritical8

roleofER-proximalribosomesintranslatingseveralcytoplasmicproteins(ReidandNicchitta9

2015a).Here,wefoundthat5IMtranscriptsincludingthosethatarenotER-traffickedor10

mitochondrialweresignificantlymorelikelytoexhibitbindingtoER-proximalribosomes11

(Figs4A-B).12

InadditiontoribosomesdirectlyresidentontheER,aninterestingpossibilityisthe13

presenceofapoolofperi-ERribosomes(Janetal.2015;ReidandNicchitta2015a).14

Associationof5IMtranscriptswithsuchaperi-ERribosomepoolcouldpotentiallyexplainthe15

observedcorrelationof5IMstatuswithbindingtoER-proximalribosomes.TheERis16

physicallyproximaltomitochondria(RowlandandVoeltz2012),soperi-ERribosomesmay17

includethosetranslatingmRNAsonmitochondria(i.e.,mRNAswithMSCRs)(Sylvestreetal.18

2003).However,evenwhentranscriptscorrespondingtoER-traffickedandmitochondrial19

proteinswereexcludedfromconsideration,ER-proximalribosomeenrichmentand5IMP20

scoreswerehighlycorrelated(Fig4A).Thusanothersharedfeatureof5IMtranscriptsistheir21

translationonorneartheERregardlessoftheultimatedestinationoftheencodedprotein.22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 19: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

19

Inanattempttoidentifyacommonfactorbinding5IMtranscripts,weaskedwhether1

5IMtranscriptswereenrichedamongtheexperimentallyidentifiedtargetsof23different2

RBPs.OnlyoneRBPemerged--theexonjunctioncomplex(EJC).Specifically,weobserveda3

dramaticenrichmentofnon-canonicalEJC(ncEJC)bindingsiteswithintheearlycodingregion4

of5IMtranscripts.Further,theCG-richmotifidentifiedforncEJCsinfirstexonsisstrikingly5

similartotheCG-richmotifenrichedinthefirstexonsof5IMtranscripts(Fig5B).Previous6

workimplicatedRanBP2,aproteinassociatedwiththecytoplasmicfaceofthenuclearpore,as7

abindingfactorforsomeSSCRs(Mahadevanetal.2013).ThisfindingmaysuggestthatEJC8

occupancyonthesetranscriptsmaybeinfluencedbynuclearporeproteinsduringnuclear9

export.10

EJCdepositionduringtheprocessofpre-mRNAsplicingenablesthenuclearhistoryof11

anmRNAtoinfluencepost-transcriptionalprocessesincludingmRNAlocalization,translation12

efficiency,andnonsensemediateddecay(Changetal.2007;KervestinandJacobson2012;13

Choeetal.2014).WhilecanonicalEJCbindingoccursatafixeddistanceupstreamofexon-exon14

junctionsandinvolvesdirectcontactbetweenthesugar-phosphatebackboneandtheEJCcore15

anchoringproteineIF4AIII,ncEJCbindingsiteslikelyreflectstableengagementbetweenthe16

EJCcoreandothermRNPproteins(e.g.,SRproteins)recognizingnearbysequencemotifs.17

AlthoughsomeRBPswereidentifiedforncEJCmotifsfoundininternalexons(Singhetal.18

2014,2012),todatenocandidateRBPhasbeenidentifiedfortheCG-richncEJCmotiffoundin19

thefirstexon.IfthismotifdoesresultfromanRBPinteraction,itislikelytobeoneormoreof20

theca.70proteinsthatstablyandspecificallybindtotheEJCcore(Singhetal.2012).21

Finally,weobservedadramaticenrichmentform1Amodificationsamong5IM22

transcripts.Giventhisstrikingenrichmentperhapsitisperhapsnotsurprisingthatm1A23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 20: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

20

containingmRNAswerealsoshowntohavemorestructured5’UTRsthatareGC-rich1

comparedtom1AlackingmRNAs(Dominissinietal.2016).Similarto5IMtranscripts,m1A2

containingmRNAswerefoundtodecoratestartcodonsthatappearinahighlystructured3

context.WhileALKBH3hasbeenidentifiedasaproteinthatcandemethylatem1A,itis4

currentlyunknownwhetherthereareanyproteinsthatcanspecificallyactas“readers”of5

m1A.RecentstudieshavebeguntoidentifysuchreadersforothermRNAmodificationssuchas6

YTHDF1,YTHDF2,WTAPandHNRNPA2B1(Pingetal.2014;Liuetal.2014;Wangetal.2014,7

2015;Alarcónetal.2015).Ourstudyhighlightsapossiblelinkbetweennon-canonicalEJC8

bindingandm1A.Hence,ourresultsyieldtheintriguinghypothesisthatoneormoreoftheca.9

70proteinsthatstablyandspecificallybindtotheEJCcorecanfunctionasanm1Areader.10

Anotherintriguingpossibilityisthat5IMtranscriptfeaturesassociatedwithlower11

intrinsictranslationefficiencymaytogetherenablegreater‘tunability’of5IMtranscriptsatthe12

translationstage.Regulatedenhancementorrepressionoftranslation,for5IMtranscripts,13

couldallowforrapidchangesinproteinlevels.Thereareanalogiestothisscenarioin14

transcriptionalregulation,whereinhighlyregulatedgenesoftenhavepromoterswithlow15

baselinelevelsthatcanberapidlymodulatedthroughtheactionofregulatorytranscription16

factors.Asmoreribosomeprofilingstudiesarepublishedexaminingtranslationalresponses17

transcriptome-wideundermultipleperturbations,conditionsunderwhich5IMtranscriptsare18

translationallyregulatedmayberevealed.19

Takentogether,ouranalysesrevealtheexistenceofadistinct‘5IM’classcomprising20

21%ofhumantranscripts.Thisclassisdefinedbydepletionof5’proximalintrons,presence21

ofspecificRNAsequencefeaturesassociatedwithlowtranslationefficiency,non-canonical22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 21: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

21

bindingbytheExonJunctionComplexandanenrichmentforN1-methyladenosine1

modification.2

3

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 22: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

22

MaterialsandMethods:1

DatasetsandAnnotations2

HumantranscriptsequencesweredownloadedfromtheNCBIhumanReferenceGene3

Collection(RefSeq)viatheUCSCtablebrowser(hg19)onJun252010(Kentetal.2002;Pruitt4

etal.2005).Transcriptswithfewerthanthreecodingexons,andtranscriptswherethefirst995

codingnucleotidesstraddlemorethantwoexonswereremovedfromfurtherconsideration.6

Thecriteriaforexclusionofgeneswithfewerthanthreecodingexonswastoensurethatthe7

analysisofdownstreamregionswaspossibleforallgenesthatwereusedinouranalysisof8

earlycodingregions.Specifically,thedownstreamregionswereselectedrandomlyfrom9

downstreamofthe3rdexons.Hence,geneswithfewerexonswouldnotbeabletocontributea10

downstreamregionpotentiallycreatingaskewinrepresentation.Transcriptswereclustered11

basedonsequencesimilarityinthefirst99codingnucleotides.Specifically,eachtranscript12

pairwasalignedusingBLASTwiththeDUSTfilterdisabled(Altschuletal.1990).Transcript13

pairswithBLASTE-values<1e-25weregroupedintotranscriptclustersthatwere14

subsequentlyassignedtooneoffourcategories:MSCR-containing,SSCR-containing,S–/MSCR–15

SignalP+,orS–/MSCR–SignalP–asfollows:16

MSCR-containingtranscriptswereannotatedusingMitoCartaandothersourcesas17

describedin(Ceniketal.2011).SSCR-containingtranscriptswerethesetoftranscripts18

annotatedtocontainsignalpeptidesintheEnsemblGenev.58annotations,whichwere19

downloadedthroughBiomartonJun252010.FortranscriptswithoutanannotatedMSCRor20

SSCR,thefirst70aminoacidswereanalyzedusingSignalP3.0(Bendtsenetal.2004).Using21

theeukaryoticpredictionmode,transcriptswereassignedtotheS–/MSCR–SignalP+category22

ifeithertheHiddenMarkovModelortheArtificialNeuralNetworkclassifiedthesequenceas23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 23: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

23

signalpeptidecontaining.AllremainingtranscriptclusterswereassignedtotheS–/MSCR–1

SignalP–category.2

Foreachtranscriptcluster,wealsoconstructedmatchedcontrolsequences.Control3

sequenceswerederivedfromasinglerandomlychosenin-frame‘window’downstreamofthe4

3rdexonfromtheevaluatedtranscripts.Ifanevaluatedtranscripthadfewerthan995

nucleotidesdownstreamofthe3rdexon,nocontrolsequencewasextracted.5UIlabelsand6

transcriptclusteringforthecontrolsequenceswereinheritedfromtheevaluatedtranscript.7

Therationaleforthisdecisionisthatouranalysisdependsonthepositionofthefirstintron;8

hencegeneswithfewerthantwoexonsneedtobeexcluded,asthesewillnothaveintrons.We9

furtherrequiredthematchedcontrolsequencestofalldownstreamoftheearlycodingregion.10

Inthevastmajorityofcasesthethirdexonfelloutsidethefirst99nucleotidesofthecoding11

region,makingthisaconvenientcriterionbywhichtochoosecontrolregions.12

SequenceFeaturesandMotifDiscovery13

36sequencefeatureswereextractedfromeachtranscript(TableS1).Thesequence14

featuresincludedtheratioofargininestolysines,theratioofleucinestoisoleucines,adenine15

content,lengthofthelongeststretchwithoutadenines,preferenceagainstcodonsthatcontain16

adeninesorthymines.ThesefeatureswerepreviouslyfoundtobeenrichedinSSCR-17

containingandcertain5UI–transcripts(Ceniketal.2011;Palazzoetal.2007).Inaddition,we18

extractedratiosbetweenseveralotheraminoacidpairsbasedonhaving19

biochemical/evolutionarysimilarity,i.e.havingpositivescores,accordingtotheBLOSUM6220

matrix(HenikoffandHenikoff1992).Toavoidextremeratiosgiventherelativelyshort21

sequencelength,pseudo-countswereaddedtoaminoacidratiosusingtheirrespective22

genome-wideprevalence.23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 24: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

24

Inaddition,weusedthreepublishedmotiffindingalgorithms(AlignACE,DEME,and1

MoAN(Rothetal.1998;RedheadandBailey2007;Valenetal.2009))todiscoverRNA2

sequencemotifsenrichedamong5UI–transcripts.AlignACEimplementsaGibbssampling3

approachandisoneofthepioneeringeffortsinmotifdiscovery(Rothetal.1998).We4

modifiedtheAlignACEsourcecodetorestrictmotifsearchestoonlytheforwardstrandofthe5

inputsequencestoenableRNAmotifdiscovery.DEMEandMoANadoptdiscriminative6

approachestomotiffindingbysearchingformotifsthataredifferentiallyenrichedbetween7

twosetsofsequences(RedheadandBailey2007;Valenetal.2009).MoANhastheadditional8

advantageofdiscoveringvariablelengthmotifs,andcanidentifyco-occurringmotifswiththe9

highestdiscriminativepower.10

Intotal,sixmotifswerediscoveredusingthethreemotiffindingalgorithms(TableS1).11

Positionspecificscoringmatricesforallmotifswereusedtoscorethefirst99-lpositionsin12

eachsequence,wherelisthelengthofthemotif.Weassessedthesignificanceofeachmotif13

instancebycalculatingthep-valueofenrichment(Fisher’sExactTest)among5UI–transcripts14

consideringalltranscriptswithamotifinstanceachievingaPSSMscoregreaterthanequalto15

theinstanceunderconsideration.Thesignificancescoreandpositionofthetwobestmotif16

instanceswereusedasfeaturesfortheclassifier(TableS1).17

5IMClassifierTrainingandPerformanceEvaluation18

WemodifiedanimplementationoftheRandomForestclassifier(Breiman2001)to19

modeltherelationshipbetweensequencefeaturesintheearlycodingregionandtheabsence20

of5’UTRintrons(5UIs).Thisclassifierdiscriminatestranscriptswith5’proximal-intron-21

minus-like-codingregionsandhenceisnamedthe‘5IM’classifier.Thetrainingsetforthe22

classifierwascomposedofSSCRtranscriptsexclusively.Thereweretworeasonstorestrict23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 25: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

25

modelconstructiontoSSCRtranscripts:1)weexpectedthepresenceofspecificRNAelements1

asafunctionof5UIpresencebasedonourpreviouswork(Ceniketal.2011);and2)we2

wantedtorestrictmodelbuildingtosequencesthathavesimilarfunctionalconstraintsatthe3

proteinlevel.4

Weobservedthat5UI–transcriptswithintronsproximalto5’endofthecodingregion5

havesequencecharacteristicssimilarto5UI+transcripts(Fig1D).Tosystematically6

characterizethisrelationship,webuiltdifferentclassifiersusingtrainingsetsthatexcluded7

5UI–transcriptswithacodingregionintronpositionedatincreasingdistancesfromthestart8

codon.Weevaluatedtheperformanceofeachclassifierusing10-foldcrossvalidation.9

Giventhatalargenumberofmotifdiscoveryiterationswereneeded,wesoughtto10

reducethecomputationalburden.Weisolatedasubsetofthetrainingexamplestobeused11

exclusivelyformotiffinding.Motifdiscoverywasperformedonceusingthissetofsequences,12

andthesamemotifswereusedineachfoldofthecrossvalidationforalltheclassifiers.13

Imbalancesbetweenthesizesofpositiveandnegativetrainingexamplescanleadto14

detrimentalclassificationperformance(WangandYao2012).Hence,webalancedthetraining15

setsizeof5UI–and5UI+transcriptsbyrandomlysamplingfromthelargerclass.We16

constructed10sub-classifierstoreducesamplingbias,andforeachtestexample,the17

predictionscorefromeachsubclassifierwassummedtoproduceacombinedscorebetween018

and10.Fortherestoftheanalyses,weusedtheclassifiertrainedusing5UI–transcriptswhere19

firstcodingintronfallsoutsidethefirst90codingnucleotides(Fig1E).20

Weevaluatedclassifierperformanceusinga10-foldcrossvalidationstrategyforSSCR-21

containingtranscripts(i.e.thetrainingset).Ineachfoldofthecross-validation,themodelwas22

trainedwithoutanyinformationfromtheheld-outexamples,includingmotifdiscovery.Forall23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 26: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

26

theothertranscriptsandthecontrolsets(seeabove),the5IMPscoreswerecalculatedusing1

theclassifiertrainedusingSSCRtranscriptsasdescribedabove.5IMPscoredistributionfor2

thecontrolsetwasusedtocalculatetheempiricalcumulativenulldistribution.Usingthis3

function,wedeterminedthep-valuecorrespondingtothe5IMPscoreforalltranscripts.We4

correctedformultiplehypothesestestingusingtheqvalueRpackage(Storey2003).Basedon5

thisanalysis,weestimatethata5IMPscoreof7.41correspondstoa5%FalseDiscoveryRate6

andsuggestthat21%ofallhumantranscriptscanbeconsideredas5IMtranscripts.7

Whilethetheoreticalrangeof5IMPscoresis0-10,thehighestobserved5IMPis9.855.8

Wenotethatforallfiguresthatdepict5IMPscoredistributions,wedisplayedtheentire9

theoreticalrangeof5IMPscores(0-10).10

FunctionalCharacterizationof5IMTranscripts:11

Wecollectedgenome-scaledatasetfrompublicallyavailabledatabasesandfrom12

supplementaryinformationprovidedinselectedarticles.Forallanalyzeddatasets,wefirst13

convertedallgene/transcriptidentifiers(IDs)intoRefSeqtranscriptIDsusingtheSynergizer14

webserver(BerrizandRoth2008).Ifadatasetwasgeneratedusinganon-humanspecies(ex.15

TargetsidentifiedbyTDP-43RNAimmunoprecipitationinratneuronalcells),weused16

Homologenerelease64(downloadedonSep282009)toidentifythecorrespondingortholog17

inhumans.Ifatleastonememberofatranscriptclusterwasassociatedwithafunctional18

phenotype,weassignedtheclustertothepositivesetwithrespecttothefunctionalphenotype.19

Ifmorethanonememberofaclusterhadthefunctionalphenotype,weonlyretainedonecopy20

oftheclusterunlesstheydifferedinaquantitativemeasurement.Forexample,considertwo21

hypotheticaltranscripts:NM_1andNM_2thatwereclusteredtogetherandhavea5IMPscore22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 27: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

27

of8.5.IfNM_1hadanmRNAhalf-lifeof2hrwhileNM_2’shalf-lifewas1hrthanwesplitthe1

clusterwhilepreservingthe5IMPscoreforbothNM_1andNM_2.2

Oncethetranscriptswerepartitionedbasedonthefunctionalphenotype,werantwo3

statisticaltests:1)Fisher’sExactTestforenrichmentof5IMtranscriptswithinthefunctional4

category;2)WilcoxonRankSumTesttocompare5IMPscoresbetweentranscriptspartitioned5

bythefunctionalphenotype.Additionally,fordatasetswhereaquantitativemeasurementwas6

available(ex.mRNAhalf-life),wecalculatedtheSpearmanrankcorrelationbetween5IMP7

scoresandthequantitativevariable.Intheseanalyses,weassumedthatthetestspacewasthe8

entiresetofRefSeqtranscripts.Forallphenotypeswhereweobservedapreliminary9

statisticallysignificantresult,wefollowedupwithmoredetailedanalysesdescribedbelow.10

RNACo-expressionAnalysis:11

Transcriptswithincreasedordecreasedexpressionwereextractedfromallmicroarray12

datasetsdepositedinArrayExpressonOct2010(Parkinsonetal.2007).Ofthe~650013

microarrayexperiments,atleast10differentiallyexpressedtranscriptswereidentifiedin14

~5100experiments.Fisher’sexacttestwasusedtocalculatetheenrichmentof5IMoran15

expression-matchedcontrolsetoftranscriptsamongdifferentiallyexpressedtranscripts.16

BaselineexpressionwasdeterminedasthemeanRNAexpressionacross16humantissuesas17

measuredbytheIlluminaBodymapRNA-Seqdatafor16humantissues(E-MTAB-513,18

downloadedonApr2015).FPKMbasedRNAexpressionquantificationwasused19

(http://www.ebi.ac.uk/gxa/experiments/E-MTAB-513/analysis-methods).Toensurethat20

baselineexpressionofthecontrolsetmatchesthatof5IMtranscripts,wefirstdiscretized21

meanlog10FPKMacrosstissuesbyroundingtoonedecimalplace.ForeachdiscretizedFPKM22

value,wesampledwithoutreplacementthesamenumberoftranscriptsas5IMtranscripts23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 28: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

28

withthegivenexpressionlevel.Notethattoensurethatthegeneexpressiondistributionis1

matchedexactlyweallowedtherandomsettoincludetranscriptsthatare5IM.Hence,the2

approachleadstoaconservativeestimateofenrichment.3

AnalysisofFeaturesAssociatedwithTranslation:4

Foreachtranscript,wepredictedthepropensityforsecondarystructureprecedingthe5

translationstartsiteandthe5’cap.Specifically,weextracted35nucleotidesprecedingthe6

translationstartsiteorthefirst35nucleotidesofthe5’UTR.Ifa5’UTRisshorterthan357

nucleotides,thetranscriptwasremovedfromtheanalysis.hybrid-ss-minutility(UNAFold8

packageversion3.8)withdefaultparameterswasusedtocalculatetheminimumfolding9

energy(MarkhamandZuker2008).10

CodonoptimalitywasmeasuredusingthetRNAAdaptationIndex(tAI),whichisbased11

onthegenomiccopynumberofeachtRNA(dosReisetal.2004).tAIforallhumancodons12

weredownloadedfromTulleretal.2010TableS1.tAIprofilesforthefirst30aminoacids13

werecalculatedforalltranscripts.Codonoptimalityprofilesweresummarizedforthefirst3014

aminoacidsforeachtranscriptorbyaveragingtAIateachcodon.15

Wecarriedouttwocontrolexperimentstotestwhethertheassociationbetween5IMP16

scoreandtAIcouldbeexplainedbyconfoundingvariables.First,wepermutedthenucleotides17

inthefirst90ntsandobservednorelationshipbetween5IMPscoreandmeantAIwhenthese18

permutedsequenceswereused(FigS1).Second,weselectedrandomin-frame99nucleotides19

from3rdexontotheendofthecodingregionandfoundnosignificantdifferencesintAI(Fig20

S1).TheseresultssuggestthattherelationshipbetweentAIand5IMPscoreisconfinedtothe21

first30aminoacidsandisnotexplainedbysimpledifferencesinnucleotidecomposition.22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 29: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

29

RibosomeprofilingandRNAexpressiondataforhumanlymphoblastoidcells(LCLs)1

weredownloadedfromNCBIGEOdatabaseaccessionnumber:GSE65912.Translation2

efficiencywascalculatedaspreviouslydescribed(Ceniketal.2015).Mediantranslation3

efficiencyacrossthedifferentcell-typeswasusedforeachtranscript. 4

AnalysisofProximitySpecificRibosomeProfilingData:5

WedownloadedproximityspecificribosomeprofilingdataforHEK293cellsfromJan6

etal.2014;TableS6.WeconvertedUCSCgeneidentifierstoHGNCsymbolsusingg:Profiler7

(Reimandetal.2011).WeretainedallgeneswithanRPKM>5ineitherinputorpulldownand8

requiredthatatleast30readsweremappedineitherofthetwolibraries.WeusedER-9

targetingevidencecategories“secretome”,“phobius”,“TMHMM”,“SignalP”,“signalSequence”,10

“signalAnchor”from(Janetal.2014)toannotategenesashavingER-targetingevidence.The11

genesthatdidnothaveanyER-targetingevidenceor“mitoCarta”/“mito.GO”annotationswere12

deemedasthesetofgeneswithnoER-targetingormitochondrialevidence.Wecalculatedthe13

log2oftheratiobetweenER-proximalribosomepulldownRPKMandcontrolRPKMasthe14

measureofenrichmentforER-proximalribosomeoccupancy(asinJanetal.2014).Amoving15

averageofthisratiowascalculatedforgenesgroupedbytheir5IMPscore.Forthiscalculation,16

weusedbinsof30mitochondrialgenesor100geneswithnoevidenceofER-ormitochondrial17

targeting.18

AnalysisofGenome-wideBindingSitesofExonJunctionComplex:19

Dr.GeneYeoandGabrielPrattgenerouslyshareduniformlyprocessedpeakcallsfor20

experimentsidentifyinghumanRNAbindingproteintargets.Thesedatasetsincludevarious21

CLIP-SeqdatasetsanditsvariantssuchasiCLIP.Atotal49datasetsfrom22factorswere22

analyzed.Thesefactorswere:hnRNPA1,hnRNPF,hnRNPM,hnRNPU,Ago2,hnRNPU,HuR,23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 30: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

30

IGF2BP1,IGF2BP2,IGF2BP3,FMR1,eIF4AIII,PTB,IGF2BP1,Ago3,Ago4,MOV10,Fip1,CF1

Im68,CFIm59,CFIm25,andhnRNPA2B1.Weextractedthe5IMPscoresforalltargetsofeach2

RBP.WecalculatedtheWilcoxonRankSumteststatisticcomparingthe5IMPscore3

distributionofthetargetsofeachRBPtoallothertranscriptswith5IMPscores.Noneofthe4

testedRBPtargetsetshadanadjustedp-value<0.05andamediandifferencein5IMPscore>5

1whencomparedtonon-targettranscripts.6

Inaddition,weusedRNA:proteinimmunoprecipitationintandem(RIPiT)datato7

determineExonJunctionComplex(EJC)bindingsites(Singhetal.2014,2012).Weanalyzed8

thecommonpeaksfromtheY14-Magohimmunoprecipitationconfiguration(Singhetal.2012;9

Kucukuraletal.2013).CanonicalEJCbindingsitesweredefinedaspeakswhoseweighted10

centeris15to32nucleotidesupstreamofanexon-intronboundary.Allremainingpeakswere11

deemedas“non-canonical”EJCbindingsites.Weextractedallnon-canonicalpeaksthat12

overlappedthefirst99nucleotidesofthecodingregionandrestrictedouranalysisto13

transcriptsthathadanRPKMgreaterthanoneinthematchedRNA-Seqdata.14

Analysisofm1Amodifiedtranscripts15

WedownloadedthelistofRNAsobservedtocontainm1AfromLietal.(2016)and16

Dominissinietal.(2016).RefSeqtranscriptidentifierswereconvertedtoHGNCsymbolsusing17

g:Profiler(Reimandetal.2011).Theoverlapbetweenthetwodatasetswasdeterminedusing18

HGNCsymbols.m1Amodificationsthatoverlapthefirst99nucleotidesofthecodingregion19

weredeterminedusingbedtools(QuinlanandHall2010).20

Lietal.(2016)identified600transcriptswithm1AmodificationinnormalHEK29321

cells.Ofthese,368transcriptswerenotfoundtocontainthem1AmodificationinHEK293cells22

byDominissinietal.(2016).Yet,81%ofthesewerefoundtobem1Amodifiedinothercell23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 31: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

31

types.Lietal.(2016)alsoanalyzedm1AuponH2O2treatmentandserumstarvationinHEK2931

cellsandidentifiedmanym1Amodificationsthatareonlyfoundthesestressconditions.2

However,20%of371transcriptsharboringstress-inducedm1Amodificationswerefoundin3

normalHEK293cellsbyDominissinietal.(2016).Takentogether,theseanalysessuggestthat4

transcriptome-widem1Amapsremainincomplete.Hence,weanalyzedthe5IMPscoresofall5

mRNAswithm1Aacrosscelltypesandconditions.WereportedresultsusingtheDominissini6

etal.datasetbutthesameconclusionsweresupportedbym1AmodificationsfromLietal.7

8

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 32: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

32

Acknowledgements:1

WewouldliketothankGeneYeo,andGabrielPrattforsharingpeakcallsforRNA-2

bindingproteintargetidentificationexperiments(CLIP,PAR-CLIP,iCLIP).WethankElif3

SarinayCenik,BaşarCenik,andAlperKüçükuralforhelpfuldiscussions.F.P.R.andH.N.C.were4

supportedbytheCanadaExcellenceResearchChairsProgram.Thisworkwassupportedin5

partbyNIHgrantsGM50037(M.J.M.),HG001715andHG004233(F.P.R.),theKrembil6

Foundation,anOntarioResearchFundaward,andtheAvonFoundation(F.P.R.).A.F.P.was7

supportedbyagrantfromtheCanadianInstitutesofHealthResearchtoA.F.P.(FRN102725).8

Authorcontributions:CC,MJM,andFPRdesignedthestudy.CCandHNCimplementedthe9

randomforestclassifierandcarriedoutallanalyses.GSandAAperformedinitialexperiments.10

MPSandAFPprovidedcriticalfeedback.MJMandFPRjointlysupervisedtheproject.CC,FPR11

andMJMwrotethemanuscriptwithinputfromallauthors.12

13

14

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 33: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

33

References:1

AlarcónCR,GoodarziH,LeeH,LiuX,TavazoieS,TavazoieSF.2015.HNRNPA2B1IsaMediatorof2

m(6)A-DependentNuclearRNAProcessingEvents.Cell162:1299–1308.3

AltschulSF,GishW,MillerW,MyersEW,LipmanDJ.1990.Basiclocalalignmentsearchtool.JMolBiol4

215:403–410.5

AmeresSL,ZamorePD.2013.DiversifyingmicroRNAsequenceandfunction.NatRevMolCellBiol14:6

475–488.7

BabendureJR,BabendureJL,DingJ-H,TsienRY.2006.ControlofmammaliantranslationbymRNA8

structurenearcaps.RNA12:851–861.9

BendtsenJD,NielsenH,vonHeijneG,BrunakS.2004.Improvedpredictionofsignalpeptides:SignalP10

3.0.JMolBiol340:783–795.11

BerrizGF,RothFP.2008.TheSynergizerservicefortranslatinggene,proteinandotherbiological12

identifiers.Bioinformatics24:2272–2273.13

BreimanL.2001.RandomForests.MachLearn45:5–32.14

CenikC,ChuaHN,ZhangH,TarnawskySP,AkefA,DertiA,TasanM,MooreMJ,PalazzoAF,RothFP.15

2011.Genomeanalysisrevealsinterplaybetween5’UTRintronsandnuclearmRNAexportfor16

secretoryandmitochondrialgenes.PLoSGenet7:e1001366.17

CenikC,DertiA,MellorJC,BerrizGF,RothFP.2010.Genome-widefunctionalanalysisofhuman5’18

untranslatedregionintrons.GenomeBiol11:R29.19

CenikC,SarinayCenikE,ByeonGW,GrubertF,CandilleSI,SpacekD,AlsallakhB,TilgnerH,ArayaCL,20

TangH,etal.2015.IntegrativeanalysisofRNA,translationandproteinlevelsrevealsdistinct21

regulatoryvariationacrosshumans.GenomeRes.http://dx.doi.org/10.1101/gr.193342.115.22

ChangY-F,ImamJS,WilkinsonMF.2007.Thenonsense-mediateddecayRNAsurveillancepathway.23

AnnuRevBiochem76:51–74.24

CharneskiCA,HurstLD.2013.Positivelychargedresiduesarethemajordeterminantsofribosomal25

velocity.PLoSBiol11:e1001508.26

ChoeJ,RyuI,ParkOH,ParkJ,ChoH,YooJS,ChiSW,KimMK,SongHK,KimYK.2014.eIF4AIIIenhances27

translationofnuclearcap-bindingcomplex-boundmRNAsbypromotingdisruptionofsecondary28

structuresin5’UTR.ProcNatlAcadSciUSA111:E4577–86.29

DominissiniD,NachtergaeleS,Moshitch-MoshkovitzS,PeerE,KolN,Ben-HaimMS,DaiQ,DiSegniA,30

Salmon-DivonM,ClarkWC,etal.2016.ThedynamicN(1)-methyladenosinemethylomein31

eukaryoticmessengerRNA.Nature530:441–446.32

dosReisM,SavvaR,WernischL.2004.Solvingtheriddleofcodonusagepreferences:atestfor33

translationalselection.NucleicAcidsRes32:5036–5044.34

DunnDB.1961.Theoccurrenceof1-methyladenineinribonucleicacid.BiochimBiophysActa46:198–35

200.36

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 34: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

34

FryeM,JaffreySR,PanT,RechaviG,SuzukiT.2016.RNAmodifications:whathavewelearnedand1

whereareweheaded?NatRevGenet17:365–372.2

GerashchenkoMV,GladyshevVN.2015.Translationinhibitorscauseabnormalitiesinribosome3

profilingexperiments.NucleicAcidsRes42:e134.4

GingoldH,PilpelY.2011.Determinantsoftranslationefficiencyandaccuracy.MolSystBiol7:481.5

HallRH.1963.Methodforisolationof2′-O-methylribonucleosidesandN1-methyladenosinefrom6

ribonucleicacid.BiochimicaetBiophysicaActa(BBA)-SpecializedSectiononNucleicAcidsand7

RelatedSubjects68:278–283.8

HenikoffS,HenikoffJG.1992.Aminoacidsubstitutionmatricesfromproteinblocks.ProcNatlAcadSci9

USA89:10915–10919.10

HershbergR,PetrovDA.2008.Selectiononcodonbias.AnnuRevGenet42:287–299.11

HinnebuschAG,LorschJR.2012.Themechanismofeukaryotictranslationinitiation:newinsightsand12

challenges.ColdSpringHarbPerspectBiol4.http://dx.doi.org/10.1101/cshperspect.a011544.13

HongX,ScofieldDG,LynchM.2006.Intronsize,abundance,anddistributionwithinuntranslated14

regionsofgenes.MolBiolEvol23:2392–2404.15

JanCH,WilliamsCC,WeissmanJS.2015.LOCALTRANSLATION.ResponsetoCommenton“Principlesof16

ERcotranslationaltranslocationrevealedbyproximity-specificribosomeprofiling.”Science348:17

1217.18

JanCH,WilliamsCC,WeissmanJS.2014.PrinciplesofERcotranslationaltranslocationrevealedby19

proximity-specificribosomeprofiling.Science346:1257521.20

KentWJ,SugnetCW,FureyTS,RoskinKM,PringleTH,ZahlerAM,HausslerD.2002.Thehuman21

genomebrowseratUCSC.GenomeRes12:996–1006.22

KervestinS,JacobsonA.2012.NMD:amultifacetedresponsetoprematuretranslationaltermination.23

NatRevMolCellBiol13:700–712.24

KlootwijkJ,PlantaRJ.1973.AnalysisofthemethylationsitesinyeastribosomalRNA.EurJBiochem39:25

325–333.26

KucukuralA,ÖzadamH,SinghG,MooreMJ,CenikC.2013.ASPeak:anabundancesensitivepeak27

detectionalgorithmforRIP-Seq.Bioinformatics29:2485–2486.28

LarssonO,LiS,IssaenkoOA,AvdulovS,PetersonM,SmithK,BittermanPB,PolunovskyVA.2007.29

EukaryoticTranslationInitiationFactor4E–InducedProgressionofPrimaryHumanMammary30

EpithelialCellsalongtheCancerPathwayIsAssociatedwithTargetedTranslationalDeregulation31

ofOncogenicDriversandInhibitors.CancerRes67:6814–6824.32

LeeES,AkefA,MahadevanK,PalazzoAF.2015.Theconsensus5’splicesitemotifinhibitsmRNA33

nuclearexport.PLoSOne10:e0122743.34

LeHirH,IzaurraldeE,MaquatLE,MooreMJ.2000.Thespliceosomedepositsmultipleproteins20-2435

nucleotidesupstreamofmRNAexon-exonjunctions.EMBOJ19:6860–6869.36

LiuJ,YueY,HanD,WangX,FuY,ZhangL,JiaG,YuM,LuZ,DengX,etal.2014.AMETTL3-METTL1437

complexmediatesmammaliannuclearRNAN6-adenosinemethylation.NatChemBiol10:93–95.38

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 35: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

35

LiX,XiongX,WangK,WangL,ShuX,MaS,YiC.2016.Transcriptome-widemappingrevealsreversible1

anddynamicN1-methyladenosinemethylome.NatChemBiol.2

http://dx.doi.org/10.1038/nchembio.2040(AccessedMarch29,2016).3

MahadevanK,ZhangH,AkefA,CuiXA,GueroussovS,CenikC,RothFP,PalazzoAF.2013.4

RanBP2/Nup358potentiatesthetranslationofasubsetofmRNAsencodingsecretoryproteins.5

PLoSBiol11:e1001545.6

MarintchevA,EdmondsKA,MarintchevaB,HendricksonE,ObererM,SuzukiC,HerdyB,SonenbergN,7

WagnerG.2009.TopologyandregulationofthehumaneIF4A/4G/4Hhelicasecomplexin8

translationinitiation.Cell136:447–460.9

MarkhamNR,ZukerM.2008.UNAFold:softwarefornucleicacidfoldingandhybridization.Methods10

MolBiol453:3–31.11

MeyuhasO.2000.Synthesisofthetranslationalapparatusisregulatedatthetranslationallevel.EurJ12

Biochem267:6321–6330.13

MooreMJ,ProudfootNJ.2009.Pre-mRNAprocessingreachesbacktotranscriptionandaheadto14

translation.Cell136:688–700.15

PalazzoAF,MahadevanK,TarnawskySP.2013.ALREX-elementsandintrons:twoidentityelements16

thatpromotemRNAnuclearexport.WileyInterdiscipRevRNA4:523–533.17

PalazzoAF,SpringerM,ShibataY,LeeC-S,DiasAP,RapoportTA.2007.Thesignalsequencecoding18

regionpromotesnuclearexportofmRNA.PLoSBiol5:e322.19

ParkinsonH,KapusheskyM,ShojatalabM,AbeygunawardenaN,CoulsonR,FarneA,HollowayE,20

KolesnykovN,LiljaP,LukkM,etal.2007.ArrayExpress--apublicdatabaseofmicroarray21

experimentsandgeneexpressionprofiles.NucleicAcidsRes35:D747–50.22

ParsyanA,SvitkinY,ShahbazianD,GkogkasC,LaskoP,MerrickWC,SonenbergN.2011.mRNA23

helicases:thetacticiansoftranslationalcontrol.NatRevMolCellBiol12:235–245.24

PingX-L,SunB-F,WangL,XiaoW,YangX,WangW-J,AdhikariS,ShiY,LvY,ChenY-S,etal.2014.25

MammalianWTAPisaregulatorysubunitoftheRNAN6-methyladenosinemethyltransferase.Cell26

Res24:177–189.27

PruittKD,TatusovaT,MaglottDR.2005.NCBIReferenceSequence(RefSeq):acuratednon-redundant28

sequencedatabaseofgenomes,transcriptsandproteins.NucleicAcidsRes33:D501–4.29

QuinlanAR,HallIM.2010.BEDTools:aflexiblesuiteofutilitiesforcomparinggenomicfeatures.30

Bioinformatics26:841–842.31

RedheadE,BaileyTL.2007.DiscriminativemotifdiscoveryinDNAandproteinsequencesusingthe32

DEMEalgorithm.BMCBioinformatics8:385.33

ReidDW,NicchittaCV.2015a.DiversityandselectivityinmRNAtranslationontheendoplasmic34

reticulum.NatRevMolCellBiol16:221–231.35

ReidDW,NicchittaCV.2015b.LOCALTRANSLATION.Commenton“PrinciplesofERcotranslational36

translocationrevealedbyproximity-specificribosomeprofiling.”Science348:1217.37

ReimandJ,ArakT,ViloJ.2011.g:Profiler—awebserverforfunctionalinterpretationofgenelists38

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 36: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

36

(2011update).NucleicAcidsResgkr378.1

RothFP,HughesJD,EstepPW,ChurchGM.1998.FindingDNAregulatorymotifswithinunaligned2

noncodingsequencesclusteredbywhole-genomemRNAquantitation.NatBiotechnol16:939–3

945.4

RowlandAA,VoeltzGK.2012.Endoplasmicreticulum–mitochondriacontacts:functionofthejunction.5

NatRevMolCellBiol13:607–625.6

ShahP,DingY,NiemczykM,KudlaG,PlotkinJB.2013.Rate-limitingstepsinyeastproteintranslation.7

Cell153:1589–1601.8

SinghG,KucukuralA,CenikC,LeszykJD,ShafferSA,WengZ,MooreMJ.2012.ThecellularEJC9

interactomerevealshigher-ordermRNPstructureandanEJC-SRproteinnexus.Cell151:750–764.10

SinghG,RicciEP,MooreMJ.2014.RIPiT-Seq:ahigh-throughputapproachforfootprintingRNA:protein11

complexes.Methods65:320–332.12

StoreyJD.2003.Thepositivefalsediscoveryrate:aBayesianinterpretationandtheq-value.AnnStat13

31:2013–2035.14

StuartJM,SegalE,KollerD,KimSK.2003.Agene-coexpressionnetworkforglobaldiscoveryof15

conservedgeneticmodules.Science302:249–255.16

SylvestreJ,VialetteS,CorralDebrinskiM,JacqC.2003.LongmRNAscodingforyeastmitochondrial17

proteinsofprokaryoticoriginpreferentiallylocalizetothevicinityofmitochondria.GenomeBiol4:18

R44.19

TullerT,CarmiA,VestsigianK,NavonS,DorfanY,ZaborskeJ,PanT,DahanO,FurmanI,PilpelY.2010.20

Anevolutionarilyconservedmechanismforcontrollingtheefficiencyofproteintranslation.Cell21

141:344–354.22

ValenE,SandelinA,WintherO,KroghA.2009.Discoveryofregulatoryelementsisimprovedbya23

discriminatoryapproach.PLoSComputBiol5:e1000562.24

VogelC,MarcotteEM.2012.Insightsintotheregulationofproteinabundancefromproteomicand25

transcriptomicanalyses.NatRevGenet13:227–232.26

WangS,YaoX.2012.MulticlassImbalanceProblems:AnalysisandPotentialSolutions.IEEETransSyst27

ManCybernBCybern.http://dx.doi.org/10.1109/TSMCB.2012.2187280.28

WangX,LuZ,GomezA,HonGC,YueY,HanD,FuY,ParisienM,DaiQ,JiaG,etal.2014.N6-29

methyladenosine-dependentregulationofmessengerRNAstability.Nature505:117–120.30

WangX,ZhaoBS,RoundtreeIA,LuZ,HanD,MaH,WengX,ChenK,ShiH,HeC.2015.N(6)-31

methyladenosineModulatesMessengerRNATranslationEfficiency.Cell161:1388–1399.32

ZinshteynB,GilbertWV.2013.LossofaconservedtRNAanticodonmodificationperturbscellular33

signaling.PLoSGenet9:e1003675.34

35

36

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 37: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

37

FigureLegends:1

Fig1|Modelingtherelationshipbetweensequencefeaturesintheearlycoding2

regionandtheabsenceof5’UTRintrons(5UIs).(a)Forallhumantranscripts,3

informationabout36nucleotide-levelfeaturesoftheearlycodingregion(first994

nucleotides)and5UIpresencewasextracted.(b)Transcriptscontainingasignalsequence5

codingregion(SSCR)wereusedtotrainaRandomForestclassifierthatmodeledthe6

relationshipbetween5UIabsenceand36sequencefeatures.(c)Withthisclassifier,all7

humantranscriptswereassignedascorethatquantifiesthelikelihoodof5UIabsence8

basedonspecificRNAsequencefeaturesintheearlycodingregion.Transcriptswithhigh9

scoresarethusconsideredtohave5’-proximalintronminus-likecodingregions(5IMs).10

(d)“5’UTR-intron-minus-predictor”(5IMP)scoredistributionsforSSCR-containing11

transcriptsshifttohigherscoreswithlater-appearingfirstintrons,suggestingthat5IM12

codingregionfeaturesnotonlypredictlackofa5UI,butalsolackofearlycodingregion13

introns.(e)Classifierperformancewasoptimizedbyexcluding5UI–transcriptswith14

intronsappearingearlyinthecodingregion.Cross-validationperformance(areaunderthe15

precisionrecallcurve,AUPRC)wasexaminedforaseriesofalternative5IMclassifiers16

usingdifferentfirst-intron-positioncriterionforexcluding5UI–transcriptsfromthe17

trainingset(Methods).18

19

Fig2|Predicting5UIstatusaccuratelyusingonlyearlycodingsequences.(a)As20

judgedbyareaunderthereceiveroperatingcharacteristiccurve(AUROC)andAUPRC,The21

5IMclassifierperformedwellforseveraldifferenttranscriptclasses.(b)Thedistribution22

of5IMPscoresrevealsclearseparationof5UI+and5UI–transcriptsforSSCR-containing23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 38: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

38

transcripts,whereeachSSCR-containingtranscriptwasscoredusingaclassifierthatdid1

notusethattranscriptintraining(Methods).(c)Codingsequencefeaturesthatare2

predictiveof5’proximalintronpresencearerestrictedtotheearlycodingregion.Thiswas3

supportedbyidentical5IMclassifierscoredistributionswithrespectto5UIpresencefor4

negativecontrolsequences,eachderivedfromasinglerandomlychosen‘window’5

downstreamofthe3rdexonfromoneoftheevaluatedtranscripts.(d)MSCRtranscripts6

exhibitedamajordifferencein5IMPscoresbasedontheir5UIstatuseventhoughnoMSCR7

transcriptswereusedintrainingtheclassifier.(e)Transcriptspredictedtocontainsignal8

peptides(SignalP+)hada5IMPscoredistributionsimilartothatofSSCR-containing9

transcripts.(f)AftereliminatingSSCR,MSCR,andSignalP+transcripts,theremainingS–10

/MSCR–SignalP–transcriptswerestillsignificantlyenrichedforhigh5IMclassifierscores11

among5UI–transcripts.(g)Thecontrolsetofrandomlychosensequencesdownstreamof12

the3rdexonfromeachtranscriptwasusedtocalculateanempiricalcumulativenull13

distributionof5IMPscores.Usingthisfunction,wedeterminedthep-valuecorresponding14

tothe5IMPscoreforalltranscripts.Thereddashedlineindicatesthep-value15

correspondingto5%FalseDiscoveryRate.16

17

Fig3|5IMtranscriptsaremorelikelytobedifferentiallyexpressedandhave18

sequencefeaturesassociatedwithlowertranslationefficiency(a)5IMtranscripts19

weresignificantlyenrichedamongdifferentiallyexpressedgenesthatwereidentifiedfrom20

~6500microarraydatasets(Methods).Theobservedp-valuedistribution(blue)using21

5IMPscoresdiffered(p<2.2e-16;Chi-squaredtest)fromthatusing5IMPscoresofan22

identically-sizedbaseline-expression-matchedrandomtranscriptset(red).(b)The5IM23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 39: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

39

classifierscorewaspositivelycorrelatedwiththepropensityformRNAstructure(-ΔG)1

(Spearmanrho=0.39;p<2.2e-16).Foreachtranscript,35nucleotidesprecedingthestart2

codonwereusedtocalculate-ΔG(Methods).(c)Transcriptsthataretranslationally3

upregulatedinresponsetoeIF4Eoverexpression(Larssonetal.2007)(blue)were4

enrichedforhigher5IMPscores.Lightgreenshadingindicates5IMPscorescorresponding5

to5%FDR.(d)Transcriptswithnon-AUGstartcodons(blue)exhibitedsignificantlyhigher6

5IMPscoresthantranscriptswithacanonicalATGstartcodon(yellow).(e)Higher5IMP7

scoreswereassociatedwithlessoptimalcodons(asmeasuredbythetRNAadaptation8

index,tAI)forthefirst33codons.Foralltranscriptswithineach5IMPscorecategory9

(blue-high,orange-low),themeantAIwascalculatedateachcodonposition.Startcodon10

wasnotshown.(f)Transcriptswithlowertranslationefficiencywereenrichedforhigher11

5IMPscores.Transcriptswithtranslationefficiencyonestandarddeviationbelowthe12

mean(“LOW”translation,yellow)andonestandarddeviationhigherthanthemean13

(“HIGH”translation,blue)wereidentifiedusingribosomeprofilingandRNA-Seqdatafrom14

humanlymphoblastoidcelllines(Methods).15

16

Fig4|5IMtranscriptswithnoevidenceofER-targetingaremorelikelytoexhibitER-17

proximalribosomeoccupancy.AmovingaverageofER-proximalribosomeoccupancy18

wascalculatedbygroupinggenesby5IMPscore(seeMethods).Weplottedthemoving19

averageof5IMPscoresfortranscriptswithnoevidenceofER-ormitochondrialtargeting20

(green)orfortranscriptspredictedtobemitochondrial(purple).Weplottedarandom21

subsampleoftranscriptsontopofthemovingaverage(circles).22

23

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 40: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

40

Fig5|5IMtranscriptsharbornon-canonicalExonJunctionComplex(EJC)binding1

sites(a)ObservedEJCbindingsites(Singhetal.2012)areshownforanexample5IM2

transcript(LAMC1).CanonicalEJCbindingsites(purple)are~24ntupstreamofanexon-3

intronboundary.Theremainingbindingsitesareconsideredtobenon-canonical(green).4

(b)Asequencemotif(top)previouslyidentifiedtobeenrichedamongncEJCbindingsites5

infirstexons(Figure6fromSinghetal.2012)ishighlysimilartoamotif(bottom)6

enrichedintheearlycodingregionsof5UI–transcripts(Figure6fromCeniketal.2011)(c)7

5IMPscorefortranscriptswithzero,one,twoormorenon-canonicalEJCbindingsitesin8

thefirst99codingnucleotidesrevealsthattranscriptswithhigh5IMPscoresfrequently9

harbornon-canonicalEJCbindingsites.(d)Transcriptswithhigh5IMPscoresareenriched10

fornon-canonicalEJCsregardlessof5UIpresenceorabsence.11

12

Fig6|5IMtranscriptsareenrichedformRNAswithearlycodingregionm1A13

modifications(a)Transcriptswithm1Amodifications(blue)inthefirst99codingnucleotides14

exhibitsignificantenrichmentfor5IMtranscriptsandhavehigher5IMPSscoresthan15

transcriptswithoutm1Amodificationsinthefirst99codingnucleotides(yellow).(b)16

Transcriptswithm1Amodifications(blue)inthe5’UTRdonotdisplayasimilarenrichment.17

18

SupportingInformation19

FigureS1|Associationbetween5IMPscoresandcodonoptimalityisrestrictedtothe20

first30aminoacidsandisnotexplainedbynucleotidecontent.(a)5IMtranscriptstend21

tohavelessoptimalcodonsintheirfirst30aminoacidsasmeasuredbytRNAadaptationindex22

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 41: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

CanCenik

41

(tAI).ThemediantAIforeachtranscriptwascalculatedandtranscriptsweregroupedbytheir1

5IMPscores.ThedistributionofmediantAIwasplottedasaboxplot.(b)Wepermutedthe2

nucleotidesofthefirst99nucleotidesandfoundthattherelationshipbetween5IMPandtAI3

waslost.Thisresultsuggestedthatnucleotidecompositionalonedoesn’texplainthe4

relationshipbetween5IMPandtAI.(c)Weusedthepreviouslydescribednegativecontrol5

sequences,eachderivedfromasinglerandomlychosenin-frame‘window’downstreamofthe6

3rdexonfromoneoftheevaluatedtranscripts.Wefoundnorelationshipbetween5IMPscore7

andtAIintheseregionssuggestingthattheobservedassociationisrestrictedtotheearly8

codingregion.9

TableS1-|Listoffeaturesusedbythe5IMclassifier.10

TableS2-|5IMPscoresofallhumantranscripts.11

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 42: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

5'UTR Intron (5UI) First 99 nucleotidesATGGCGCA

ATGACAGA

ATGTATAAT

Sequence 1

Sequence 2

Sequence N

5’UTRIntron Arg:Lys

+

1.3

3.5

4.0

Adenine Codon Bias

0.9

2.1

1.5

Extract CDS featuresA B

Model the relationship between early coding sequence and 5UI status

C

Grow Decision Tree Repeat SamplingGenerate Multiple Trees

Sequence 5Sequence 5Sequence 9Sequence 11

BootstrapSample

Assign ‘5IMP’ score reflecting 5UI– propensity5IMP Score

9.2/10Sequence X ATGAGCGCGT...?

Likely 5UI–

2.7/10Sequence Y ATGACTATAA...?

Likely 5UI+

5UTR

Intron 0-4

243

-5960

-7071

-84 85+

2

4

6

8

First Coding Intron Position

5IM

P Sc

ore

D

2000 50 100 1500.5

0.6

0.7

0.8

0.9

1.0

AUPR

C (5

UI– p

redi

ctio

n)

First Coding Intron Position

E90 nt cutoff

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 43: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

E

0 2 4 6 8 105IMP Score

0.00

0.10

0.20

0.30

Dens

ity

5UI+5UI–

SSCR–/ MSCR–/ SignalP+

0.00

0.10

0.20

0.30

C

Dens

ity

0 2 4 6 8 105IMP Score

3’ proximal exons

5UI+5UI–

A

SSCRMSCR

3’ proximal exons

S– /M– SignalP+

Prec

isio

n

Recall

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

Reca

ll

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

D

0 2 4 6 8 105IMP Score

5UI+5UI–

MSCR

Dens

ity

0.00

0.10

0.20

0.30

B

Dens

ity

0 2 4 6 8 10

SSCR

0.00

0.10

0.20

0.30

5IMP Score

5UI+5UI–

0 2 4 6 8 10

SSCR–/ MSCR–/ SignalP–

5IMP Score

5UI+5UI–

0.00

0.10

0.20

0.30

Dens

ity

F

Wò]HS\L

5\TIL

Y�VM�;

YHUZJYPW[Z

0.0 0.2 0.4 0.6 0.8 1.00

1k

2k

G

0.00 0.01 0.02 0.030

200

400

�047�*SHZZPMPLY�7LYMVYTHUJL

S– /M– SignalP–

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 44: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

A

0.00

0.05

0.10

0.15

0.20

0.25

Dens

ity

NON-AUG

AUG

Start Codon

5IMP Score0 2 4 6 8 10

D

C

5IMP Score0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

Dens

ity

TRANSLATIONDOWN TRANSLATION

UP

NO CHANGEEIF4E Overexpression

FRibosomes per mRNA

LOW

HIGH

0 2 4 6 8 105IMP Score

0.00

0.05

0.10

0.15

0.20

Dens

ity

B

5IMP Score

Seco

dary

Stru

ctur

e-¬

G

2 4 6 8

0

5

10

15

20

25

E

0.36

0.38

0.40

0.42

Codo

n O

ptim

ality

5 10 15 20 25 30Codon Position

5IM

Non-5IM

Mean tAI

Num

ber o

f Exp

erim

ents

p-value0.00 0.25 0.50 0.75 1.00

0

0.5k

1.0k

1.5k

randomobserved

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 45: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

A

Mitochondrial Evidence

2 4 6 8

5IMP score

,9òW

Yo_PT

HS�9PIV

ZVTL�6J

J\WH

UJ`

No ER-targeting or Mitochondrial Evidence

#��

�

�

%���

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 46: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

AScale 1 kb

LAMC1

200 basesScale

EJC Peak Calls

Non-Canonical EJC(ncEJC) Peaks

Canonical EJC(cEJC) Peaks

2 4 6 8 10

0

0.050.1

0.15

0.20

0.25

0.30

0.35

5IMP Score

Dens

ity

All TranscriptsFirst 99 Coding Nucleotides

No Peaks 1 Peak

2+ Peaks

C

ncEJC Peaks

ACG

CG

GCC

GGCGCGCG

C

ATGC

ACTGC

GTAGCGA

GC

B

First 99 Coding NucleotidesD

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35 5UI+ Transcripts

No ncEJC peak

1+ ncEJC peak

5UI– Transcripts

No ncEJC peak

1+ ncEJC peak

0 2 4 6 8 10 2 4 6 8 10

5IMP Score5IMP Score

Dens

ity

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 47: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

0.00

0.05

0.10

0.15

0.20

Dens

ity

0.00

0.05

0.10

0.15

0.20

Dens

ity

5IMP Score

0 2 4 6 8 10

A

5IMP Score

0 2 4 6 8 10

5’UTRFirst 99 Coding Nucleotides

B

m1A-modified m1A-modified

No m1A No m1A.CC-BY-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 48: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

0.25

0.30

0.35

0.40

0.45

5IMP Score

Med

ian

tAI

0-7.41 7.41-10

A

Codon Position

5 10 15 20 25 30

0.30

0.32

0.34

0.36

0.38

0.40

0.42

Codo

n O

ptim

ality

mea

n tA

I

First 99 nucleotides permuted

5IMP � 7.41 7.41 < 5IMP

B

Codon Position

5 10 15 20 25 30

0.30

0.32

0.34

0.36

0.38

0.40

0.42

Random 99 nucleotidesDownstream of 3rd Exon

Codo

n O

ptim

ality

mea

n tA

I

7.41 < 5IMP5IMP � 7.41

C.CC-BY-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 49: A Common Class of Transcripts with 5’-Intron Depletion ......2016/06/06  · 2 where are we headed?Nat Rev Genet 17: 365–372. 17: 365–372. 3

Table&S1.&List&of&Features&Used&in&Training&the&Random&Forest&

FEATURE&CLASS& PARTICULAR&FEATURES&USED&Ratio&of&Amino&Acids& Number&of&Leucines&/&Number&of&Isoleucines&

Number&of&Arginines&/&Number&of&Lysines&Number&of&Aspartates&/&Number&of&Glutamates&Number&of&Phenylalanines&/&Number&of&Tyrosines&Number&of&Leucines&/&Number&of&Valines&Number&of&Valines&/&Number&of&Isoleucines&Number&of&Glutamines&/&Number&of&Glutamates&

Nucleotide&Content& Percent&Adenines&Percent&Thymines&Length&of&the&Longest&Track&Lacking&Any&Adenines&

Codon&Preferences& Ratio&of&NonHAdenine&Codons&to&Adenine&Codons&Ratio&of&NonHThymine&Codons&to&Thymine&Codons&

Motif&Score&and&Position& MoAn&–&12&features&including&PSSM&scores&and&positions&of&the&top&two&occurrences&for&each&of&3&motifs&discovered&by&MoAn&[44] DEME&–&4&features&including&PSSM&scores&and&positions&of&the&top&two&occurrences&for&1&motif&discovered&by&DEME[43] AlignACE&–&8&features&including&PSSM&scores&and&positions&of&the&top&two&occurrences&for&each&of&2&motifs&discovered&by&AlignACE&[42]&

&

.CC-BY-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 7, 2016. ; https://doi.org/10.1101/057455doi: bioRxiv preprint