rna-seq: quantification and models for assessing ... · quantification and models for assessing...
TRANSCRIPT
RNA-seq:quantificationandmodelsfor
assessingdifferentialexpression(atleastforsomeapproaches)
IanDworkinNGS2016
@IanDworkin
Whatwewillcovertoday• Absolutefundamentalsofexperimentaldesign• Whyweusecountdataasinput• IntroducingabitofprobabilitytowhymanyRNADifferentialanalysistoolsuseanegativebinomial.
• Whydocareaboutvariance/over-dispersionsomuch.• Howdoweestimateover-dispersionwithsmallsamplesizes(andwhyedgeR andDGEgivedifferentresults).
• Abitaboutdealingwithmultiplecomparisons(ifwehavetime).
GoalsIamnotplanningontryingtoprovideanysortofoverviewofstatisticalmethodsforgenomicdata.InsteadIamgoingtoprovideafewshortideastothinkabout.
Statistics(likebioinformatics)isarapidlydevelopingarea,inparticularwithrespecttogenomics.Rarelyisitclearwhatthe“rightway”toanalyzeyourdatais.
InsteadIhopetoaidyouinusingsomecommonsensewhenthinkingaboutyourexperimentsforusinghighthroughputsequencing.
Caveats
• Therearewholecoursesonproperexperimentaldesignandstatistics.Greatbookstoo.ThismaterialinBio720isnotenough!
• ForexperimentaldesignIhighlyrecommend:– Quinn&Keough:ExperimentalDesignanddataanalysisforbiologists.
http://www.amazon.com/Experimental-Design-Data-Analysis-Biologists/dp/0521009766/
Thebasicsofexperimentaldesign
• Thereareafewbasicpointstoalwayskeepinmind:– Biologicalreplication(asmuchasyoucanafford)isextremelyimportant.Torobustlyidentifydifferentiallyexpressed(DE)genesrequiresstatisticalpowers.• (note:thisisnothowmanyreadsyouhaveforagenewithinasample,buthowmanybiologically/statisticallyindependentsamplespertreatment).
– Technicalreplicationdoesnothelpwithstatisticalpower(i.e.don’tsplitasinglesampleandrunastwolibraries).
Biologicalreplicationgivesfarmorestatisticalpowerthanincreasedsequencingdepthwithin
abiologicalsample!!!!
• Sequencing(andlibraryprep)costsarestillsufficientlyexpensivethatmostexperimentsusesmallnumbersofbiologicalreplicates.
• Giventheadditionalcostsoflibrarycosts(~225$/sampleatourfacility),manyfolksgoforincreaseddepthinsteadofmoresamples.
• Foragivenlevelofsequencingdepth(total)foratreatment,itisfarbettertogoformorebiologicalreplicates,eachatlowersequencingdepth(ratherthanfewerreplicatedathighersequencingdepth).
Biologicalreplicationgivesfarmorestatisticalpowerthanincreasedsequencingdepthwithinabiological
sample!!!!
Roblesetal.2012
Howdothemethodscompareinsimulation?
Kvam etal.2012
Thebasicsofexperimentaldesign
• Thereareafewbasicpointstoalwayskeepinmind:– Biologicalreplication.– Designyourexperimenttoavoidconfoundingyourdifferenttreatments(sex,nutrition)witheachotherorwithtechnicalvariables(lanewithinaflowcell,betweenflowcellvariation).• Makediagrams/tablesofyourexperimentaldesign,orusearandomizeddesign.
Thebasicsofexperimentaldesign
• Thereareafewbasicpointstoalwayskeepinmind:– Biologicalreplication.– Designexperimenttoavoidconfounding variables.– Sampleindividuals(withintreatment)randomly!
Usefulreferences
PaulL.AuerandR.W.Doerge 2010.StatisticalDesignandAnalysisofRNA-SeqData.Genetics.10.1534/genetics.110.114983PMID:20439781
Bullard,J.H.,Purdom,E.,Hansen,K.D.,&Dudoit,S.(2010).EvaluationofstatisticalmethodsfornormalizationanddifferentialexpressioninmRNA-SeqexperimentsBMCBioinformatics,11,94.doi:10.1186/1471-2105-11-94
Designingyourexperimentbeforeyoustart.
Sampling
Replication
Blocking
Randomization
OverallwearegoingtobethinkingabouthowtoavoidConfoundingsourcesofvariationinthedata.
AllofthesearelargertopicsthatarepartofExperimentalDesign.
Sampling
Sampling
Replication
Blocking
Randomization
Samplingdesignisallaboutmakingsurethatwhenyou“pick”(sample)observations,youdosoinarandom andunbiasedmanner.
Propersamplingaimstocontrolforunknownsourcesofvariationthatinfluencetheoutcomeofyourexperiments.
Thisseemsreasonable,andoftenintuitivetomostexperimentalbiologists,butitcanbeveryinsidious.Whiteboard…
Sampling
Sampling
Replication
Blocking
Randomization
BiologicalreplicatesNottechnicalones.
• Thereislittlepurposeinusingtechnicalreplication(i.e.samesample,multiplelibrarypreps)fromagivenbiologicalsampleUNLESSpartofyourquestionrevolvesaroundit.
• Focusonbiologicalvariability.Whileyouareconfoundingsomesourcesoftechnicalandbiologicalvariability,wealreadyknowalotabouttheformer,andlittleaboutthelatter(inparticularforyoursystem).
Replication
Sampling
Replication
Blocking
Randomization
Imagineyouhaveanexperimentwithonefactor(sex),withtwotreatmentlevels(malesandfemales).
Youwanttolookforsexspecificdifferencesinthebrainsofyourcrittersbasedontranscriptionalprofiling,soyoudecidetouseRNA-seq.
Perhapsyouhavealimitedbudgetsoyoudecidetorunonesampleofmalebrains,andonesampleoffemalebrains,eachinonelaneofaflowcell.
What(useful)informationcanyougetoutofthis?
Notmuch(buttheremaybesome).Why?
Replication
Sampling
Replication
Blocking
Randomization
Why?
Noreplication.Howwillyouknowifthedifferencesyouobserveareduetodifferencesinmalesandfemales,random(biological)differencesbetweenindividuals,ortechnicalvariationduetoRNAextraction,processingorrunningthesamplesondifferentlanes.
Allofthesesourcesofvariationareconfounded,andtherearenoparticularlygoodwaysofseparatingthemout.
Buttherearelotsofsourcesofvariation,sohowdoweaccountforthese?
Replication
Sampling
Replication
Blocking
Randomization
Todate,severalstudieshavesuggestedthat“technical”replicatesforRNA-seq showverylittlevariation/highcorrelation.
Mortazavi etal.2008
Howmightsuchastatementbemisleadingaboutvariation?
Replication
Sampling
Replication
Blocking
Randomization
Thisstudylookedatasinglesourceoftechnicalvariation.
Runningexactlythesamesampleontwodifferentlanesonaflowcell.
Thiscompletelyignoresothersourcesof“technicalvariation”variationduetoRNApurificationvariationduetofragmentation,labeling,etc..lanetolanevariationflowcelltoflowcellvariation
Allofthesemaybeimportant(althoughunlikelyinteresting)sourcesofvariation…
However…..
Replication
Sampling
Replication
Blocking
Randomization
ManystudieshaveignoredtheBIOLOGICALSOURCESofVARIATIONbetweenreplicates.Inmostcasesbiologicalvariationbetweensamples(fromthesametreatment)aregenerallyfarmorevariablethantechnicalsourcesofvariation.
Whileitwouldbenicetobeabletopartitionvarioussourcesoftechnicalvariation(suchaslabeling,RNAextraction),itoftentooexpensivetoperformsuchadesign(seewhiteboard).
IFyouhavelimitedresources,itisgenerallyfarbettertohavebiologicalreplication(independentbiologicalsamplesforagiventreatment)thantechnicalreplication.
Doestheseleadtoconfoundedsourcesofvariation?
Blocking
Sampling
Replication
Blocking
Randomization
Blocksinexperimentaldesignrepresentsomefactor(usuallysomethingnotofmajorinterest)thatcanstronglyinfluenceyouroutcomes.Moreimportantlyitisafactorwhichyoucanusetogroupotherfactorsthatyouareinterestedin.
Forinstanceinagriculturethereisoftenplottoplotvariation.Youmaynotbeinterestedintheplotthemselvesbutinthevarietyofcropsyouaregrowing.
Butwhatwouldhappenifyougrewallofstrain1onplot1andallofstrain2onplot2?
Whiteboard.
Theseplotswouldrepresentblockinglevels
Blocking
Sampling
Replication
Blocking
Randomization
Ingenomicstudiesthemajorblockinglevelsareoftentheslide/chipformicroarrays(i.e.twosamples/slidefor2colorarrays,16arrays/slideforIllumina arrays).
ForGAII/HiSeq RNA-seq datathemajorblockingeffectistheflowcellitselfandlaneswithintheflowcell.
AuerandDoerge 2010
Blocking
Sampling
Replication
Blocking
Randomization
Incorporatinglanesasablockingeffect
AuerandDoerge 2010
Blockingdesigns
Sampling
Replication
Blocking
Randomization
BalancedIncompleteBlockingDesign(BIBD)
Let’sdissectthesesubscripts.
Balancedfortreatmentsacrossflowcells..Randomizedforlocation AuerandDoerge 2010
Whatstandardtechnicalissuesshouldyouconsiderforblocking:
• FlowCell• Lane• Adaptors• Libraryprep• Sameinstrument• People!• RNAextraction/purification
Whathappenswhenyoufailtoblock(orreplicate)?
Yue F,ChengY,Breschi A,etal.:AcomparativeencyclopediaofDNAelementsinthemousegenome.Nature.2014;515(7527):355–364
LinS,LinY,Nery JR,etal.:Comparisonofthetranscriptionallandscapesbetweenhumanandmousetissues.ProcNatl Acad Sci USA.2014;111(48):17224–17229
Inarecentanalysisofthemod-encodedata,RNAseq datasuggestedthatclustering(forgeneexpression)morebyspeciesthanbytissue.Thiswasanunusualfinding.
Gilad YandMizrahi-ManO.AreanalysisofmouseENCODEcomparativegeneexpressiondata[v1;refstatus:indexed,http://f1000r.es/5ez]F1000Research2015,4:121(doi:10.12688/f1000research.6536.1)
Anewre-analysisdemonstratedsomepotentiallyseriousissueswiththeexperimentaldesign
Figure1.Studydesign for:Yue F,ChengY,Breschi A,etal.:AcomparativeencyclopediaofDNAelementsinthemousegenome.Nature.
2014;515(7527):355–364LinS,LinY,Nery JR,etal.:Comparisonofthetranscriptionallandscapesbetweenhumanandmousetissues.
ProcNatl Acad Sci USA.2014;111(48):17224–17229
GiladYandMizrahi-ManO2015[v1;refstatus:awaitingpeerreview,http://f1000r.es/5ez]F1000Research2015,4:121(doi:10.12688/f1000research.6536.1)
Differentialexpression
• ProbablythesinglemostcommonuseofRNA-Seq dataisexaminedifferentialexpressionoftranscripts(transcriptionalprofiles).
Differentialexpression
• Butdifferentialexpressionofwhat?
Differentialexpression
• Butdifferentialexpressionofwhat?– Genes– Transcripts(alternativetranscripts)– Allelespecificexpression– Exon levelexpression
Yourprimarygoalsofyourexperimentshouldguideyourdesign.
• Theexactdetails(#biologicalsamples,sampledepth,read_length,strandspecificity)ofhowyouperformyourexperimentneedstobeguidedbyyourprimarygoal.
• Unlessyouhaveallthe$$,nosingledesigncancaptureallofthevariability.
Yourgoalsmatter
• Forinstance:Ifyourprimaryinterestindiscoveryofnewtranscripts,samplingdeeplywithinasampleisprobablybest.
• Fordifferentialexpressionanalyses,youwillalmostneverhavetheabilitytoperformDifferentialexpressionanalysisonveryraretranscripts,soitisrarelyusefultogeneratemorethan15-20millionreadpairsperbiologicalsample.
Asimpletruth:Thereisnotechnologynorstatistical
wizardrythatcansaveapoorlyplannedexperiment.Theonlytrulyfailedexperimentisapoorlyplanned
one.
Toconsultthestatisticianafteranexperimentisfinishedisoftenmerelytoaskhim(her)toconductapostmortemexamination.He(she)canperhapssaywhattheexperimentdiedof.
RonaldFisher
Counting
• Oneofthemostdifficultissueshasbeenhowtocount.
• Wefirstneedtoaskwhatfeatures wewanttocount.
WhatFeaturescouldwecount?
WhatFeaturescouldwecount?
• Countingatthelevelofgenes(readsmappedtogeneregardlessoftranscript).
• Countingattheleveloftranscript.• Countingatthelevelofexons.• Countingatthelevelofkmers withinoneoftheabove
• Countingatthelevelofnucleotideswithinexon/transcript/gene.
Counting
• Weareinterestedintranscriptabundance.• Butweneedtotakeintoaccountanumberofthings.
Counting
• Weareinterestedintranscriptabundance.• Butweneedtotakeintoaccountanumberofthings.
• Howmanyreadsinthesample.• Lengthoftranscripts• GCcontentandsequencingbias(influencingcountsoftranscriptswithinasample).
SeeminglysensibleCounting(butultimatelynotsouseful).
• RPKM(readsalignedperkilobase ofexon permillionreadsmapped)– Mortazavi etal2008
• FPKM(fragmentsperkilobase ofexon permillionfragmentsmapped).Sameideaforpairedendsequencing.
• TPM,TMM…etc…
Takehomemessage(fromme):Actualcountsshouldbeusedasinputfordifferentialexpressionanalysis,not
(pre)scaled measures.
BUT:Noteveryoneagreeswiththisapproachthough.Norwithmyargumentsaboutcounting.
Lior Patcher’s blogisagoodplacetowatchthedebate.Alsocheckoutsomecommentsinthevignetteandpaperonlimma/voom.
RPKM
ProblemswithRPKM
• RPKMisnotaconsistentmeasureofexpressionabundance(orrelativemolarconcentration).
• See– http://blog.nextgenetics.net/?e=51– Wagneretal2012MeasurementofmRNAabundanceusingRNA-seq data:RPKMmeasureis
inconsistentamongsamples.TheoryBiosci
HowaboutTranscriptspermillion(TPM)
WhileTPMisingeneralmore(statistically)consistent,itisstillgenerallynotappropriate.
Normalization(forDE)canbemuchmorecomplicatedinpractice
• Whymightscalingbytotalnumberofreads(sequencingdepth)beamisleadingquantitytoscaleby?
Normalization(forDE)canbemuchmorecomplicatedinpractice
• Scalingbytotalmappedreads(sequencingdepth)canbesubstantiallyinfluencedbythesmallproportionofhighlyexpressedgenes.
(Whatmighthappen?)
• Anumberofalternativeshavebeenproposedandused(i.e.usingquantile normalization,etc..)
Bullard,J.H.,Purdom,E.,Hansen,K.D.,&Dudoit,S.(2010).EvaluationofstatisticalmethodsfornormalizationanddifferentialexpressioninmRNA-Seq experiments.BMCBioinformatics,11,94.doi:10.1186/1471-2105-11-94
Counting(andnormalizing)inpractice
• Inpractice,wedonotwantto“pre-scale”ourdataasisdoneinF/R-PKMorTPM.
• Insteadwearefarbetteroffusingamodelbasedapproachfornormalizingforread-lengthorlibrarysizeinthedatamodelingperse.
• Thisisfarmoreflexible.
Takehomemessage:Actualcountsshouldbeusedasinputfordifferentialexpressionanalysis,not
(pre)scaled measures.
Theissueisthatgettingunambiguouscountsishard(Rob).
DifferentialExpressionanalysis.APrimer.
• Iamassumingthatwehavealreadydecidedonanappropriatemethodtocountandconvertmappedreadstodiscretevalues…
• Thereisabitweneedtoknowtohelpusunderstandwhattodonext.
Abitofbackgroundonprobability.• Fundamentallyourobservedmeasureofexpressionarethecountsofreads.
• Dependinguponthedatamodelingframeworkwewishtouse,weneedtoaccountforthis,asthesearenotnecessarilyapproximatedwellbynormal(Gaussian)distributionsthatareusedfor“standard”linearmodelsliket-tests,ANOVA,regression.
• Thisisnotaproblematall,asitiseasytomodeldatacomingfromotherdistributions,andiswidelyavailableinstatspackagesandprogramminglanguagesalike.
ProbabilityDensityvs.Massfunction
ProbabilityMassfunctionforadiscretevariable.
ProbabilityDensityfunctionforacontinuousvariable.
ProbabilityMassfunction(Fordiscretedistributions,likeread
counts)
P(13|Poisson(l=10))=0.073
Heightrepresentstheprobabilityatthatpoint(integer).
“Area”oftheboxhasnoparticularmeaning.
P(integer)≥0P(non-integers)=0.
ProbabilityDensityfunction
Heightatx=13is0.0799Thisisnottheprobabilityatx=13,butthedensity.i.e.f(13)=0.0799,wheref(x)isthenormaldistribution.
P(x=13|N(mean=10,sd=3.3))=0WHY?
ProbabilityDensityfunction
Wecandefinetheprobabilityintheinterval10≤x≤15
P(10≤x≤15|N(10,3.3))=0.435
Clarificationsoncontinuousdistributions.
AREAUNDERCURVEOFPDF=1
(Theintegralofthenormal)
Bolker 2007CH4page137
Themultitudeofprobabilitydistributionsallowustotochoose
thosethatmatchourdataortheoreticalexpectationsintermsof
shape,location,scale.
Fittingadistributionisanartandscienceofutmostimportanceinprobabilitymodeling.Theideaisyouwantadistributiontofityourdatamodel“justright”withoutafitthatis“overfit”(orunderfit).Overfittingmodelsissometimesaprobleminmoderndataminingmethodsbecausethemodelsfitcanbetoospecifictoaparticulardatasettobeofbroaderuse.
Seefeld2007
Sowhydoweusethem?It’sallaboutshapeandscale!
• Becausetheyprovideausableframeworkforframingourquestions,andallowingforparametricmethods;i.elikelihoodandBayesian.
• Evenifwedonotknowitsactualdistribution,itisclearfrequencydataisgenerallygoingtobebetterfitbyabinomialthananormaldistribution.Why?
Whywillitbeabetterfit?
• Thebinomialisbounded byzeroand1• Otherdistributions(gamma,poisson,etc)havealowerboundaryatzero.
• Thisprovidesaconvenientframeworkfortherelationshipbetweenmeansandvarianceasoneapproachestheboundarycondition.
Somediscretedistributions(leadinguptowhywemaywantto
usethenegativebinomial)
BinomialPoisson
Negative-binomial
Randomvariables
• Thisiswhatwewanttoknowtheprobabilitydistributionof.
• I.e.P(x|somedistribution)
Iwilluse“x”tobetherandomvariableineachcase.
BinomialLet’ssayyousetupaseriesofenclosures.Withineachenclosureyouplace25flies,andapre-determinedsetofpredators.Youwanttoknowwhatthedistribution(acrossenclosures)offliesgettingeatenis,basedonapre-determinedprobabilityofsuccessforagivenpredatorspecies.
Youcansetthisupasabinomialproblem.
N(Rcallsthissize)=25(thetotal#ofindividualsor“trials”forpredation)intheenclosurep=probabilityofasuccessfulpredation“trial”(thecointoss)x=#trialsofsuccessfulpredation.Thisiswhatweusuallywantfortheprobabilitydistribution.
Binomial
Youcanthinkofthisintwoways.A)Anormalizingconstantsothatprobabilitiessumto1.B)#ofdifferentcombinationstoallowforx“successful”predationeventsoutofNtotal.
Youwilloftenseex=kandhear“Nchoosek”
Example
• Ifpredatorspecies1hadaper“trial”probabilityofsuccessfullyeatingapreyitemof0.2,whatwouldbetheprobabilityofexactly10flies(outofthe25)beingeateninasingleenclosure.
P(x=10|bi(N=25,p=0.2))=0.0118
Notsohigh.Wecanlookattheexpectedprobabilitydistributionfordifferentvaluesofx.
Thiswouldbetheexpecteddistributionifwesetupmanyreplicateenclosureswith25fliesandthispredator.
Predatorspecies2ismuchhungrier….
Let’ssaywehad100fliesperenclosure,andpredatorspecies3was
reallyineffective,p=0.01
Whiletheremaybeatheoreticallimittothenumberoffliesthatcanbeeaten,practicallyspeakingitisunlimitedsincethepredationprobabilityissolow.
ThisisalotlikethesituationwehavewithRNA-seq data.
Poisson• Whenyouhaveadiscreterandomvariablewheretheprobabilityofa“successful”trialisverysmall,butthetheoretical(orpractical)rangeiseffectivelyinfinite,youcanuseapoisson distribution.
• Usefulforcounting#of“rare”events,likenewmigrantstoapopulation/year.
• #ofnewmutations/offspring..• #countsofsequencingreads(wellsortof)…
Poisson• It isalsoseeminingly usefulforRNA-Seqdata.(althoughwewillseenotveryusefulinpractice).
Poisson
x isourrandomvariable(#events/unitsamplingeffort)– readcountsforageneinasamplel Isthe“rate”parameter. i.e.Expectednumber ofreads(foratranscript)persamplel isthemeanandthevariance!!!!
ForitsrelationtoabinomialwhenNislargeandp issmalll=N*p
Poisson
• Let’ssayfliesdispersetocolonizeanewpatchataverylowrate(previousestimatessuggestwewillobserveoneflyforeverytwonewpatchesweexamine,l=0.5).
• Whatistheprobabilityofobserving2fliesonanewpatchofland?
P(x=2|poisson(l=0.5))=0.076
Probabilityofobservingxnumberoffliesonapatchgivenlambda=0.5
Whathappensaslambdaincreases?
0 1 2 3 4 5 6 7 8 9 11 13
! = 4 (expected # of reads for transcript x across samples)
# of reads for transcript x
prop
ortio
n of
sam
ples
for t
rans
crip
t x
0.00
0.05
0.10
0.15
4 7 10 14 18 22 26 30 34 38
! = 20
# of reads for transcript x
prop
ortio
n of
sam
ples
for t
rans
crip
t x
0.00
0.02
0.04
0.06
0.08
58 68 76 84 92 101 111 121 131 141
! = 100
# of reads for transcript x
prop
ortio
n of
sam
ples
for t
rans
crip
t x
0.00
0.01
0.02
0.03
0.04
Poissonmeanandvariance
• Whenlambdaissmallforyourrandomvariable,youwilloftenfindthatyourdatais“over-dispersed”.
• ThatisthereismorevariationthatexpectedunderPoisson(lambda).
• Similarlywhenlambdagetslarge,youwilloftenfindthatthereislessvariationthanexpectedunderPoisson(lambda).
AndersandHuber2010GenomeBiology
Whypoisson mightnotmodelsequencereadswell
• MostRNA-Seq data(andmostcountdatainbiology)isnotmodeledwellbypoissonbecausetherelationshipsbetweenmeansandvariancestendtobefarmorecomplicatedamong(andwithin)biologicalreplicates.
• Ithasbeenargued(Mortzavi etal2008)thattechnicalvariationinRNA-Seq iscapturedbyPoisson.Ihavemydoubtsevenonthis.
Quasi-poisson
• Sinceover-dispersionissuchacommonissue,anumberofapproacheshavebeendevelopedtoaccountforitwithcountdata.
• Oneistouseaquasi-poisson.• Insteadofvariance(x)=λ,itis
• Variance(x)=λθ• Whereθ isthe(multiplicative)over-dispersionparameter.
Howaboutanormaldistribution?
• Despiteworkingwithdiscretecountdata,severalauthorsusenormaldistributions.Severalreasons.
Howaboutanormaldistribution?• Despiteworkingwithdiscretecountdata,severalauthorsuse
normaldistributions.Severalreasons:
1. Whenthemeannumberofcountsisfarenoughawayfromzero,oftenthenormaldistributiondoesagoodjoboffittingthedata(andcapturingmean&variancerelationship).Forlowmeancountsavariancestabilizationcanaidmodeling(theapproachusedinlimma/voom).
2. Ourresponsevariable(countsoffeatures)arenotmeasuredwithouterror,andthereforearenottruemeasures.Whenestimatingeffectsinourmodelweaccountforthisuncertaintyandassuminganormaldistributionenablesadditionalflexibility.
Negativebinomial
• InbiologytheNeg.Binomialismostlyusedlikeapoisson,butwhenyouneedmoredispersionofx (itneedstobespreadoutmore).
• Thenegativebinomial isaPoissondistributionwherelambdaitselfvariesaccordingtoaGammadistribution.
Negativebinomial
Expectednumberofcounts=μOver-dispersionparameter=k
Forourpurposesallwecareaboutisthat
General(ized)linearmodels
• Forresponsevariablesthatarecontinuous,youarelikelyfamiliarwithapproachesthatcomefromthegenerallinearmodel.
Astandardlinearregression(ifx iscontinuous).Ifx isdiscretethiswouldbeat-test/Anova.
Generalizedlinearmodel
• MANYofthedifferentialexpressiontoolsutilizealinearmodelframework.
• Thusitisimportanttogetfamiliarwiththeframework.
• TheclassbyJonathanandBen(B)isprobablyagreatplacetostart.
ContinuityofStatisticalApproaches
t-test
ANOVA
NumberofLevels:
MixedEffectsModel(randomorboth)FixedPredictors:
Regression(continuous)
ANCOVA(both)
GeneralLinearModel
Predictors:(discrete)
GeneralizedLinearModel(non-normal)Response:
(normal)
ProcessModels
Generalizedlinearmodels• Butwhatdoyoudowhenyourresponsevariableisnotnormallydistributed?
• Theframeworkofthelinearmodelcanbeextendedtoaccountfordifferentdistributionsfairlyeasily(onemajorclassoftheseisthegeneralizedlinearmodels).
ContinuityofStatisticalApproaches
t-test
ANOVA
NumberofLevels:
MixedEffectsModel(randomorboth)FixedPredictors:
Regression(continuous)
ANCOVA(both)
GeneralLinearModel
Predictors:(discrete)
GeneralizedLinearModel(non-normal)Response:
(normal)
ProcessModels
Generalized LinearModels(GLiM)
• Inmanycasesagenerallinearmodel isnotappropriatebecausevaluesarebounded– e.g.counts>0,proportionsbetween0and1
• Ageneralizationoflinearmodelstoincludeanydistributionoferrorsfromtheexponentialfamilyofdistributions
• Normal,Poisson,binomial,multinomial,exponential,gamma,NOTnegativebinomial
• GeneralLinearModelisjustaspecialcaseofGLiMinwhichtheerrorsarenormallydistributed
• Example,logisticregression• Wewilluselikelihoodforparameterestimationandinference
GeneralizationsofGLM
• Insteadofasimplelinearmodel:Y=b0 +b1x1+b2x2 +e
– Assumethate’sareindependent,normallydistributedwithmean0andconstantvariances2
– Cansolveforb’sbyminimizingsquarede’s
• GLiMconsiderssomeadjustmenttothedatatolinearizeY- alink function
Y=g(b0 +b1x1+b2x2 +e)or f(Y)=b0 +b1x1+b2x2 +e– Forexampleforcountdatawhicharealwayspositive
f(Y)=log(Y) loglink
Whatisalinkfunction?
• Thelinkfunctionisawayoftransformingtheobservedresponsevariable(LHS).
• Goals• 1)linearizeobservedresponse• 2)Altertheboundaryconditionsofthedata.• 3)Toallowforanadditivemodelinthecovariates(RHS)
PoissonFamily
• Dataarecountsofsomething(i.e.0,1,2,3,4…)• Numberofoccurrencesofaneventoverafixedperiodoftimeorspace• Examples…
• Ifthemeanvalueishighthencountscanbelog-normalornormallydistributed• Whenmeanvalueislowthentherestartstobelotsofzerosandvariancedependson
themean• Ifupperendisalsoboundedthenbinomialwouldbebetter
• Defaultlinkisthelog link,variancefunction=µ– i.e.,family=poisson(link=“log”,variance=“mu”)– Otheroptionmightbethesqrt link
PoissonandnegativebinomialFamily
Essentiallyitmeansyoucanlogtransformthesequencecountsanduseapoisson,quasi-poisson ornegativebinomialtofitit(mostlinksaremorecomplicated,thisisniceandsimple).
i.e.countsaremodeledas
Methodsusingnb glm• edgeR (butitisnotdefault,sobeware!)• DESeq/DESeq2(maybeDEXseq aswell?)• BaySeq• Limma (voom – kindofsortof…).
• Howevertheseallmodelthevariancequitedifferently(howtheyborrowinformationacrossgenestoestimatemean-variancerelationships).
SeeYu,Huber&Vitek 2013(Bioinformatics)fordiscussionofthisissue.
Methodsusingpoisson andquasi-poisson
• tspm (twostagepoisson model)– Fitsmodelswithpoisson first.Ifover-dispersedthenusesaquasi-poisson.
– Thusthereareessentiallytwogroupsofgenes.
Whythisisuseful• Sincewecanfittheseasageneralizedlinearmodel,wecanfitarbitrarilycomplexdesigns(ifwehavesufficientsamplesizestoestimatealltheparameters).
• Wecanincorporateallaspectsofreadlength,librarysize,lane,flowcellinadditiontoalloftheimportantbiologicalpredictors(yourtreatments).
• NOt-testsforyou!!!
Estimatingover-dispersion(variance)(orwhyprogramsseeminglydoingthe
samethinggivedifferentresults)
Variancesrequirelotsofdatatoestimatewell(notjustforcountdata)• Itturnsoutthattoestimatevariances,youneedalotmorereplicationthanyoudoformeans.
• HowevermostRNA-Seq experimentsstillhavesmallnumbersofbiologicalreplicates.
• Sohowtogoaboutestimatingvariances?
IFsamplesizesarelarge(withinandbetweentreatments).
• Mostmethodsdowell(basedonNB,quasi-Pornon-parametricapproaches).
• Theycanmodelindividuallevelvariances(andpotentiallycanuseresamplingapproachestoavoidhavingtomakeparametricassumptions).
Butifsamplesizes(intermsofbiologicalreplication)issmall.
• Thenwehaveaproblem.• Thisiswherethesoftwarereallytendstodiffer,astheyallmake(different)assumptionsabouttheuncertaintyincounts,mean-variancerelationships,andhowbesttomodelsucheffects.
• InparticularedgeR andDEseq usesomemethodstoborrowinformationacrossgenes(andhaveoptionstochangethisprocess).
• Thiscandramaticallychangetheresults.Anders,S.,&Huber,W.(2010).Differentialexpressionanalysisforsequencecountdata.GenomeBiology,11(10),R106.doi:10.1186/gb-2010-11-10-r106
Andersetal(2013).Count-baseddifferentialexpressionanalysisofRNAsequencingdatausingRandBioconductor.NatureProtocols,8(9),1765–1786
AndersandHuber2010
Yuetal(2013).ShrinkageestimationofdispersioninNegativeBinomialmodelsforRNA-seq experimentswithsmallsamplesize.Bioinformatics,29(10),1275–1282.
AndersandHuber2010
Let’sthinkaboutthis.
Love,Huber&Anders2014BioRXiV doi:10.1101/002832
Wecanalso“shrink”estimatesbasedonover-dispersion….
Takehome
• Withsmallsamplesizes,themethodsusedifferentapproachestogetgene-wiseover-dispersion(basedonalldata).
• EdgeR ismorepowerful(moresignificanthits)thanDESeq generally.Butmuchmoresusceptibletofalsepositivesduetooutliers.
• DESeq2“should”besomewhereinthemiddle.
Biologicalreplicationgivesfarmorestatisticalpowerthanincreasedsequencingdepthwithin
abiologicalsample!!!!
• Sequencing(andlibraryprep)costsarestillsufficientlyexpensivethatmostexperimentsusesmallnumbersofbiologicalreplicates.
• Giventheadditionalcostsoflibrarycosts(~225$/sampleatourfacility),manyfolksgoforincreaseddepthinsteadofmoresamples.
• Foragivenlevelofsequencingdepth(total)foratreatment,itisfarbettertogoformorebiologicalreplicates,eachatlowersequencingdepth(ratherthanfewerreplicatedathighersequencingdepth).
Biologicalreplicationgivesfarmorestatisticalpowerthanincreasedsequencingdepthwithinabiological
sample!!!!
Roblesetal.2012
Howdothemethodscompareinsimulation?
Kvam etal.2012
Howdothemethodscompareinsimulation?
Kvam etal.2012
Howdothemethodscompareforrealdata?
Kvam etal.2012
Howdothemethodscompareinadifferentsetofsimulations?
Soneson 2012
WillexplainROC(receiveroperatorcurves)andtheareaundercurvesonboard.
References• Robles,J.A.,Qureshi,S.E.,Stephen,S.J.,Wilson,S.R.,Burden,C.J.,&Taylor,J.M.(2012).Efficientexperimentaldesignand
analysisstrategiesforthedetectionofdifferentialexpressionusingRNA-Sequencing.BMCGenomics,13,484.doi:10.1186/1471-2164-13-484
• Bullard,J.H.,Purdom,E.,Hansen,K.D.,&Dudoit,S.(2010).EvaluationofstatisticalmethodsfornormalizationanddifferentialexpressioninmRNA-Seq experiments.BMCBioinformatics,11,94.doi:10.1186/1471-2105-11-94
• Kvam,V.M.,Liu,P.,&Si,Y.(2012).AcomparisonofstatisticalmethodsfordetectingdifferentiallyexpressedgenesfromRNA-seq data.AmericanJournalOfBotany,99(2),248–256.doi:10.3732/ajb.1100340
• Soneson,C.,&Delorenzi,M.(2013).AcomparisonofmethodsfordifferentialexpressionanalysisofRNA-seq data.BMCBioinformatics,14,91.doi:10.1186/1471-2105-14-91
• Wagner,G.P.,Kin,K.,&Lynch,V.J.(2012).MeasurementofmRNAabundanceusingRNA-seq data:RPKMmeasureisinconsistentamongsamples.Theoryinbiosciences=Theorie indenBiowissenschaften,131(4),281–285.doi:10.1007/s12064-012-0162-3
• Vijay,N.,Poelstra,J.W.,Künstner,A.,&Wolf,J.B.W.(2012).Challengesandstrategiesintranscriptomeassemblyanddifferentialgeneexpressionquantification.Acomprehensiveinsilico assessmentofRNA-seq experiments.MolecularEcology.doi:10.1111/mec.12014
Whydowecareaboutmultiplecomparisons?
Howcanwedealwithmultiplecomparisons