rna-seq: quantification and models for assessing ... · quantification and models for assessing...

RNA-seq:quantificationandmodelsfor

assessingdifferentialexpression(atleastforsomeapproaches)

IanDworkinNGS2016

@IanDworkin

Whatwewillcovertoday• Absolutefundamentalsofexperimentaldesign• Whyweusecountdataasinput• IntroducingabitofprobabilitytowhymanyRNADifferentialanalysistoolsuseanegativebinomial.

• Whydocareaboutvariance/over-dispersionsomuch.• Howdoweestimateover-dispersionwithsmallsamplesizes(andwhyedgeR andDGEgivedifferentresults).

• Abitaboutdealingwithmultiplecomparisons(ifwehavetime).

GoalsIamnotplanningontryingtoprovideanysortofoverviewofstatisticalmethodsforgenomicdata.InsteadIamgoingtoprovideafewshortideastothinkabout.

Statistics(likebioinformatics)isarapidlydevelopingarea,inparticularwithrespecttogenomics.Rarelyisitclearwhatthe“rightway”toanalyzeyourdatais.

InsteadIhopetoaidyouinusingsomecommonsensewhenthinkingaboutyourexperimentsforusinghighthroughputsequencing.

Caveats

• Therearewholecoursesonproperexperimentaldesignandstatistics.Greatbookstoo.ThismaterialinBio720isnotenough!

• ForexperimentaldesignIhighlyrecommend:– Quinn&Keough:ExperimentalDesignanddataanalysisforbiologists.

http://www.amazon.com/Experimental-Design-Data-Analysis-Biologists/dp/0521009766/

Thebasicsofexperimentaldesign

• Thereareafewbasicpointstoalwayskeepinmind:– Biologicalreplication(asmuchasyoucanafford)isextremelyimportant.Torobustlyidentifydifferentiallyexpressed(DE)genesrequiresstatisticalpowers.• (note:thisisnothowmanyreadsyouhaveforagenewithinasample,buthowmanybiologically/statisticallyindependentsamplespertreatment).

– Technicalreplicationdoesnothelpwithstatisticalpower(i.e.don’tsplitasinglesampleandrunastwolibraries).

Biologicalreplicationgivesfarmorestatisticalpowerthanincreasedsequencingdepthwithin

abiologicalsample!!!!

• Sequencing(andlibraryprep)costsarestillsufficientlyexpensivethatmostexperimentsusesmallnumbersofbiologicalreplicates.

• Giventheadditionalcostsoflibrarycosts(~225$/sampleatourfacility),manyfolksgoforincreaseddepthinsteadofmoresamples.

• Foragivenlevelofsequencingdepth(total)foratreatment,itisfarbettertogoformorebiologicalreplicates,eachatlowersequencingdepth(ratherthanfewerreplicatedathighersequencingdepth).

Biologicalreplicationgivesfarmorestatisticalpowerthanincreasedsequencingdepthwithinabiological

sample!!!!

Roblesetal.2012

Howdothemethodscompareinsimulation?

Kvam etal.2012


• Thereareafewbasicpointstoalwayskeepinmind:– Biologicalreplication.– Designyourexperimenttoavoidconfoundingyourdifferenttreatments(sex,nutrition)witheachotherorwithtechnicalvariables(lanewithinaflowcell,betweenflowcellvariation).• Makediagrams/tablesofyourexperimentaldesign,orusearandomizeddesign.


• Thereareafewbasicpointstoalwayskeepinmind:– Biologicalreplication.– Designexperimenttoavoidconfounding variables.– Sampleindividuals(withintreatment)randomly!

Usefulreferences

PaulL.AuerandR.W.Doerge 2010.StatisticalDesignandAnalysisofRNA-SeqData.Genetics.10.1534/genetics.110.114983PMID:20439781

Bullard,J.H.,Purdom,E.,Hansen,K.D.,&Dudoit,S.(2010).EvaluationofstatisticalmethodsfornormalizationanddifferentialexpressioninmRNA-SeqexperimentsBMCBioinformatics,11,94.doi:10.1186/1471-2105-11-94

Designingyourexperimentbeforeyoustart.

Sampling

Replication

Blocking

Randomization

OverallwearegoingtobethinkingabouthowtoavoidConfoundingsourcesofvariationinthedata.

AllofthesearelargertopicsthatarepartofExperimentalDesign.

Sampling

Sampling

Replication

Blocking

Randomization

Samplingdesignisallaboutmakingsurethatwhenyou“pick”(sample)observations,youdosoinarandom andunbiasedmanner.

Propersamplingaimstocontrolforunknownsourcesofvariationthatinfluencetheoutcomeofyourexperiments.

Thisseemsreasonable,andoftenintuitivetomostexperimentalbiologists,butitcanbeveryinsidious.Whiteboard…

Sampling

Sampling

Replication

Blocking

Randomization

BiologicalreplicatesNottechnicalones.

• Thereislittlepurposeinusingtechnicalreplication(i.e.samesample,multiplelibrarypreps)fromagivenbiologicalsampleUNLESSpartofyourquestionrevolvesaroundit.

• Focusonbiologicalvariability.Whileyouareconfoundingsomesourcesoftechnicalandbiologicalvariability,wealreadyknowalotabouttheformer,andlittleaboutthelatter(inparticularforyoursystem).

Replication

Sampling

Replication

Blocking

Randomization

Imagineyouhaveanexperimentwithonefactor(sex),withtwotreatmentlevels(malesandfemales).

Youwanttolookforsexspecificdifferencesinthebrainsofyourcrittersbasedontranscriptionalprofiling,soyoudecidetouseRNA-seq.

Perhapsyouhavealimitedbudgetsoyoudecidetorunonesampleofmalebrains,andonesampleoffemalebrains,eachinonelaneofaflowcell.

What(useful)informationcanyougetoutofthis?

Notmuch(buttheremaybesome).Why?

Replication

Sampling

Replication

Blocking

Randomization

Why?

Noreplication.Howwillyouknowifthedifferencesyouobserveareduetodifferencesinmalesandfemales,random(biological)differencesbetweenindividuals,ortechnicalvariationduetoRNAextraction,processingorrunningthesamplesondifferentlanes.

Allofthesesourcesofvariationareconfounded,andtherearenoparticularlygoodwaysofseparatingthemout.

Buttherearelotsofsourcesofvariation,sohowdoweaccountforthese?

Replication

Sampling

Replication

Blocking

Randomization

Todate,severalstudieshavesuggestedthat“technical”replicatesforRNA-seq showverylittlevariation/highcorrelation.

Mortazavi etal.2008

Howmightsuchastatementbemisleadingaboutvariation?

Replication

Sampling

Replication

Blocking

Randomization

Thisstudylookedatasinglesourceoftechnicalvariation.

Runningexactlythesamesampleontwodifferentlanesonaflowcell.

Thiscompletelyignoresothersourcesof“technicalvariation”variationduetoRNApurificationvariationduetofragmentation,labeling,etc..lanetolanevariationflowcelltoflowcellvariation

Allofthesemaybeimportant(althoughunlikelyinteresting)sourcesofvariation…

However…..

Replication

Sampling

Replication

Blocking

Randomization

ManystudieshaveignoredtheBIOLOGICALSOURCESofVARIATIONbetweenreplicates.Inmostcasesbiologicalvariationbetweensamples(fromthesametreatment)aregenerallyfarmorevariablethantechnicalsourcesofvariation.

Whileitwouldbenicetobeabletopartitionvarioussourcesoftechnicalvariation(suchaslabeling,RNAextraction),itoftentooexpensivetoperformsuchadesign(seewhiteboard).

IFyouhavelimitedresources,itisgenerallyfarbettertohavebiologicalreplication(independentbiologicalsamplesforagiventreatment)thantechnicalreplication.

Doestheseleadtoconfoundedsourcesofvariation?

Blocking

Sampling

Replication

Blocking

Randomization

Blocksinexperimentaldesignrepresentsomefactor(usuallysomethingnotofmajorinterest)thatcanstronglyinfluenceyouroutcomes.Moreimportantlyitisafactorwhichyoucanusetogroupotherfactorsthatyouareinterestedin.

Forinstanceinagriculturethereisoftenplottoplotvariation.Youmaynotbeinterestedintheplotthemselvesbutinthevarietyofcropsyouaregrowing.

Butwhatwouldhappenifyougrewallofstrain1onplot1andallofstrain2onplot2?

Whiteboard.

Theseplotswouldrepresentblockinglevels

Blocking

Sampling

Replication

Blocking

Randomization

Ingenomicstudiesthemajorblockinglevelsareoftentheslide/chipformicroarrays(i.e.twosamples/slidefor2colorarrays,16arrays/slideforIllumina arrays).

ForGAII/HiSeq RNA-seq datathemajorblockingeffectistheflowcellitselfandlaneswithintheflowcell.

AuerandDoerge 2010

Blocking

Sampling

Replication

Blocking

Randomization

Incorporatinglanesasablockingeffect

AuerandDoerge 2010

Blockingdesigns

Sampling

Replication

Blocking

Randomization

BalancedIncompleteBlockingDesign(BIBD)

Let’sdissectthesesubscripts.

Balancedfortreatmentsacrossflowcells..Randomizedforlocation AuerandDoerge 2010

Whatstandardtechnicalissuesshouldyouconsiderforblocking:

• FlowCell• Lane• Adaptors• Libraryprep• Sameinstrument• People!• RNAextraction/purification

Whathappenswhenyoufailtoblock(orreplicate)?

Yue F,ChengY,Breschi A,etal.:AcomparativeencyclopediaofDNAelementsinthemousegenome.Nature.2014;515(7527):355–364

LinS,LinY,Nery JR,etal.:Comparisonofthetranscriptionallandscapesbetweenhumanandmousetissues.ProcNatl Acad Sci USA.2014;111(48):17224–17229

Inarecentanalysisofthemod-encodedata,RNAseq datasuggestedthatclustering(forgeneexpression)morebyspeciesthanbytissue.Thiswasanunusualfinding.

Gilad YandMizrahi-ManO.AreanalysisofmouseENCODEcomparativegeneexpressiondata[v1;refstatus:indexed,http://f1000r.es/5ez]F1000Research2015,4:121(doi:10.12688/f1000research.6536.1)

Anewre-analysisdemonstratedsomepotentiallyseriousissueswiththeexperimentaldesign

Figure1.Studydesign for:Yue F,ChengY,Breschi A,etal.:AcomparativeencyclopediaofDNAelementsinthemousegenome.Nature.

2014;515(7527):355–364LinS,LinY,Nery JR,etal.:Comparisonofthetranscriptionallandscapesbetweenhumanandmousetissues.

ProcNatl Acad Sci USA.2014;111(48):17224–17229

GiladYandMizrahi-ManO2015[v1;refstatus:awaitingpeerreview,http://f1000r.es/5ez]F1000Research2015,4:121(doi:10.12688/f1000research.6536.1)

Differentialexpression

• ProbablythesinglemostcommonuseofRNA-Seq dataisexaminedifferentialexpressionoftranscripts(transcriptionalprofiles).


• Butdifferentialexpressionofwhat?


• Butdifferentialexpressionofwhat?– Genes– Transcripts(alternativetranscripts)– Allelespecificexpression– Exon levelexpression

Yourprimarygoalsofyourexperimentshouldguideyourdesign.

• Theexactdetails(#biologicalsamples,sampledepth,read_length,strandspecificity)ofhowyouperformyourexperimentneedstobeguidedbyyourprimarygoal.

• Unlessyouhaveallthe$$,nosingledesigncancaptureallofthevariability.

Yourgoalsmatter

• Forinstance:Ifyourprimaryinterestindiscoveryofnewtranscripts,samplingdeeplywithinasampleisprobablybest.

• Fordifferentialexpressionanalyses,youwillalmostneverhavetheabilitytoperformDifferentialexpressionanalysisonveryraretranscripts,soitisrarelyusefultogeneratemorethan15-20millionreadpairsperbiologicalsample.

Asimpletruth:Thereisnotechnologynorstatistical

wizardrythatcansaveapoorlyplannedexperiment.Theonlytrulyfailedexperimentisapoorlyplanned

one.

Toconsultthestatisticianafteranexperimentisfinishedisoftenmerelytoaskhim(her)toconductapostmortemexamination.He(she)canperhapssaywhattheexperimentdiedof.

RonaldFisher

Counting

• Oneofthemostdifficultissueshasbeenhowtocount.

• Wefirstneedtoaskwhatfeatures wewanttocount.

WhatFeaturescouldwecount?

WhatFeaturescouldwecount?

• Countingatthelevelofgenes(readsmappedtogeneregardlessoftranscript).

• Countingattheleveloftranscript.• Countingatthelevelofexons.• Countingatthelevelofkmers withinoneoftheabove

• Countingatthelevelofnucleotideswithinexon/transcript/gene.

Counting

• Weareinterestedintranscriptabundance.• Butweneedtotakeintoaccountanumberofthings.

Counting

• Weareinterestedintranscriptabundance.• Butweneedtotakeintoaccountanumberofthings.

• Howmanyreadsinthesample.• Lengthoftranscripts• GCcontentandsequencingbias(influencingcountsoftranscriptswithinasample).

SeeminglysensibleCounting(butultimatelynotsouseful).

• RPKM(readsalignedperkilobase ofexon permillionreadsmapped)– Mortazavi etal2008

• FPKM(fragmentsperkilobase ofexon permillionfragmentsmapped).Sameideaforpairedendsequencing.

• TPM,TMM…etc…

Takehomemessage(fromme):Actualcountsshouldbeusedasinputfordifferentialexpressionanalysis,not

(pre)scaled measures.

BUT:Noteveryoneagreeswiththisapproachthough.Norwithmyargumentsaboutcounting.

Lior Patcher’s blogisagoodplacetowatchthedebate.Alsocheckoutsomecommentsinthevignetteandpaperonlimma/voom.

ProblemswithRPKM

• RPKMisnotaconsistentmeasureofexpressionabundance(orrelativemolarconcentration).

• See– http://blog.nextgenetics.net/?e=51– Wagneretal2012MeasurementofmRNAabundanceusingRNA-seq data:RPKMmeasureis

inconsistentamongsamples.TheoryBiosci

HowaboutTranscriptspermillion(TPM)

WhileTPMisingeneralmore(statistically)consistent,itisstillgenerallynotappropriate.

Normalization(forDE)canbemuchmorecomplicatedinpractice

• Whymightscalingbytotalnumberofreads(sequencingdepth)beamisleadingquantitytoscaleby?

Normalization(forDE)canbemuchmorecomplicatedinpractice

• Scalingbytotalmappedreads(sequencingdepth)canbesubstantiallyinfluencedbythesmallproportionofhighlyexpressedgenes.

(Whatmighthappen?)

• Anumberofalternativeshavebeenproposedandused(i.e.usingquantile normalization,etc..)

Bullard,J.H.,Purdom,E.,Hansen,K.D.,&Dudoit,S.(2010).EvaluationofstatisticalmethodsfornormalizationanddifferentialexpressioninmRNA-Seq experiments.BMCBioinformatics,11,94.doi:10.1186/1471-2105-11-94

Counting(andnormalizing)inpractice

• Inpractice,wedonotwantto“pre-scale”ourdataasisdoneinF/R-PKMorTPM.

• Insteadwearefarbetteroffusingamodelbasedapproachfornormalizingforread-lengthorlibrarysizeinthedatamodelingperse.

• Thisisfarmoreflexible.

Takehomemessage:Actualcountsshouldbeusedasinputfordifferentialexpressionanalysis,not

(pre)scaled measures.

Theissueisthatgettingunambiguouscountsishard(Rob).

DifferentialExpressionanalysis.APrimer.

• Iamassumingthatwehavealreadydecidedonanappropriatemethodtocountandconvertmappedreadstodiscretevalues…

• Thereisabitweneedtoknowtohelpusunderstandwhattodonext.

Abitofbackgroundonprobability.• Fundamentallyourobservedmeasureofexpressionarethecountsofreads.

• Dependinguponthedatamodelingframeworkwewishtouse,weneedtoaccountforthis,asthesearenotnecessarilyapproximatedwellbynormal(Gaussian)distributionsthatareusedfor“standard”linearmodelsliket-tests,ANOVA,regression.

• Thisisnotaproblematall,asitiseasytomodeldatacomingfromotherdistributions,andiswidelyavailableinstatspackagesandprogramminglanguagesalike.

ProbabilityDensityvs.Massfunction

ProbabilityMassfunctionforadiscretevariable.

ProbabilityDensityfunctionforacontinuousvariable.

ProbabilityMassfunction(Fordiscretedistributions,likeread

counts)

P(13|Poisson(l=10))=0.073

Heightrepresentstheprobabilityatthatpoint(integer).

“Area”oftheboxhasnoparticularmeaning.

P(integer)≥0P(non-integers)=0.

ProbabilityDensityfunction

Heightatx=13is0.0799Thisisnottheprobabilityatx=13,butthedensity.i.e.f(13)=0.0799,wheref(x)isthenormaldistribution.

P(x=13|N(mean=10,sd=3.3))=0WHY?

ProbabilityDensityfunction

Wecandefinetheprobabilityintheinterval10≤x≤15

P(10≤x≤15|N(10,3.3))=0.435

Clarificationsoncontinuousdistributions.

AREAUNDERCURVEOFPDF=1

(Theintegralofthenormal)

Bolker 2007CH4page137

Themultitudeofprobabilitydistributionsallowustotochoose

thosethatmatchourdataortheoreticalexpectationsintermsof

shape,location,scale.

Fittingadistributionisanartandscienceofutmostimportanceinprobabilitymodeling.Theideaisyouwantadistributiontofityourdatamodel“justright”withoutafitthatis“overfit”(orunderfit).Overfittingmodelsissometimesaprobleminmoderndataminingmethodsbecausethemodelsfitcanbetoospecifictoaparticulardatasettobeofbroaderuse.

Seefeld2007

Sowhydoweusethem?It’sallaboutshapeandscale!

• Becausetheyprovideausableframeworkforframingourquestions,andallowingforparametricmethods;i.elikelihoodandBayesian.

• Evenifwedonotknowitsactualdistribution,itisclearfrequencydataisgenerallygoingtobebetterfitbyabinomialthananormaldistribution.Why?

Whywillitbeabetterfit?

• Thebinomialisbounded byzeroand1• Otherdistributions(gamma,poisson,etc)havealowerboundaryatzero.

• Thisprovidesaconvenientframeworkfortherelationshipbetweenmeansandvarianceasoneapproachestheboundarycondition.

Somediscretedistributions(leadinguptowhywemaywantto

usethenegativebinomial)

BinomialPoisson

Negative-binomial

Randomvariables

• Thisiswhatwewanttoknowtheprobabilitydistributionof.

• I.e.P(x|somedistribution)

Iwilluse“x”tobetherandomvariableineachcase.

BinomialLet’ssayyousetupaseriesofenclosures.Withineachenclosureyouplace25flies,andapre-determinedsetofpredators.Youwanttoknowwhatthedistribution(acrossenclosures)offliesgettingeatenis,basedonapre-determinedprobabilityofsuccessforagivenpredatorspecies.

Youcansetthisupasabinomialproblem.

N(Rcallsthissize)=25(thetotal#ofindividualsor“trials”forpredation)intheenclosurep=probabilityofasuccessfulpredation“trial”(thecointoss)x=#trialsofsuccessfulpredation.Thisiswhatweusuallywantfortheprobabilitydistribution.

Binomial

Youcanthinkofthisintwoways.A)Anormalizingconstantsothatprobabilitiessumto1.B)#ofdifferentcombinationstoallowforx“successful”predationeventsoutofNtotal.

Youwilloftenseex=kandhear“Nchoosek”

Example

• Ifpredatorspecies1hadaper“trial”probabilityofsuccessfullyeatingapreyitemof0.2,whatwouldbetheprobabilityofexactly10flies(outofthe25)beingeateninasingleenclosure.

P(x=10|bi(N=25,p=0.2))=0.0118

Notsohigh.Wecanlookattheexpectedprobabilitydistributionfordifferentvaluesofx.

Thiswouldbetheexpecteddistributionifwesetupmanyreplicateenclosureswith25fliesandthispredator.

Predatorspecies2ismuchhungrier….

Let’ssaywehad100fliesperenclosure,andpredatorspecies3was

reallyineffective,p=0.01

Whiletheremaybeatheoreticallimittothenumberoffliesthatcanbeeaten,practicallyspeakingitisunlimitedsincethepredationprobabilityissolow.

ThisisalotlikethesituationwehavewithRNA-seq data.

Poisson• Whenyouhaveadiscreterandomvariablewheretheprobabilityofa“successful”trialisverysmall,butthetheoretical(orpractical)rangeiseffectivelyinfinite,youcanuseapoisson distribution.

• Usefulforcounting#of“rare”events,likenewmigrantstoapopulation/year.

• #ofnewmutations/offspring..• #countsofsequencingreads(wellsortof)…

Poisson• It isalsoseeminingly usefulforRNA-Seqdata.(althoughwewillseenotveryusefulinpractice).

Poisson

x isourrandomvariable(#events/unitsamplingeffort)– readcountsforageneinasamplel Isthe“rate”parameter. i.e.Expectednumber ofreads(foratranscript)persamplel isthemeanandthevariance!!!!

ForitsrelationtoabinomialwhenNislargeandp issmalll=N*p

Poisson

• Let’ssayfliesdispersetocolonizeanewpatchataverylowrate(previousestimatessuggestwewillobserveoneflyforeverytwonewpatchesweexamine,l=0.5).

• Whatistheprobabilityofobserving2fliesonanewpatchofland?

P(x=2|poisson(l=0.5))=0.076

Probabilityofobservingxnumberoffliesonapatchgivenlambda=0.5

Whathappensaslambdaincreases?

0 1 2 3 4 5 6 7 8 9 11 13

! = 4 (expected # of reads for transcript x across samples)

# of reads for transcript x

prop

ortio

n of

sam

ples

for t

rans

crip

t x

0.00

0.05

0.10

0.15

4 7 10 14 18 22 26 30 34 38

! = 20


prop

ortio

n of

sam

ples

for t

rans

crip

t x

0.00

0.02

0.04

0.06

0.08

58 68 76 84 92 101 111 121 131 141

! = 100


prop

ortio

n of

sam

ples

for t

rans

crip

t x

0.00

0.01

0.02

0.03

0.04

Poissonmeanandvariance

• Whenlambdaissmallforyourrandomvariable,youwilloftenfindthatyourdatais“over-dispersed”.

• ThatisthereismorevariationthatexpectedunderPoisson(lambda).

• Similarlywhenlambdagetslarge,youwilloftenfindthatthereislessvariationthanexpectedunderPoisson(lambda).

AndersandHuber2010GenomeBiology

Whypoisson mightnotmodelsequencereadswell

• MostRNA-Seq data(andmostcountdatainbiology)isnotmodeledwellbypoissonbecausetherelationshipsbetweenmeansandvariancestendtobefarmorecomplicatedamong(andwithin)biologicalreplicates.

• Ithasbeenargued(Mortzavi etal2008)thattechnicalvariationinRNA-Seq iscapturedbyPoisson.Ihavemydoubtsevenonthis.

Quasi-poisson

• Sinceover-dispersionissuchacommonissue,anumberofapproacheshavebeendevelopedtoaccountforitwithcountdata.

• Oneistouseaquasi-poisson.• Insteadofvariance(x)=λ,itis

• Variance(x)=λθ• Whereθ isthe(multiplicative)over-dispersionparameter.

Howaboutanormaldistribution?

• Despiteworkingwithdiscretecountdata,severalauthorsusenormaldistributions.Severalreasons.

Howaboutanormaldistribution?• Despiteworkingwithdiscretecountdata,severalauthorsuse

normaldistributions.Severalreasons:

1. Whenthemeannumberofcountsisfarenoughawayfromzero,oftenthenormaldistributiondoesagoodjoboffittingthedata(andcapturingmean&variancerelationship).Forlowmeancountsavariancestabilizationcanaidmodeling(theapproachusedinlimma/voom).

2. Ourresponsevariable(countsoffeatures)arenotmeasuredwithouterror,andthereforearenottruemeasures.Whenestimatingeffectsinourmodelweaccountforthisuncertaintyandassuminganormaldistributionenablesadditionalflexibility.

Negativebinomial

• InbiologytheNeg.Binomialismostlyusedlikeapoisson,butwhenyouneedmoredispersionofx (itneedstobespreadoutmore).

• Thenegativebinomial isaPoissondistributionwherelambdaitselfvariesaccordingtoaGammadistribution.

Negativebinomial

Expectednumberofcounts=μOver-dispersionparameter=k

Forourpurposesallwecareaboutisthat

General(ized)linearmodels

• Forresponsevariablesthatarecontinuous,youarelikelyfamiliarwithapproachesthatcomefromthegenerallinearmodel.

Astandardlinearregression(ifx iscontinuous).Ifx isdiscretethiswouldbeat-test/Anova.

Generalizedlinearmodel

• MANYofthedifferentialexpressiontoolsutilizealinearmodelframework.

• Thusitisimportanttogetfamiliarwiththeframework.

• TheclassbyJonathanandBen(B)isprobablyagreatplacetostart.

ContinuityofStatisticalApproaches

t-test

ANOVA

NumberofLevels:

MixedEffectsModel(randomorboth)FixedPredictors:

Regression(continuous)

ANCOVA(both)

GeneralLinearModel

Predictors:(discrete)

GeneralizedLinearModel(non-normal)Response:

(normal)

ProcessModels

Generalizedlinearmodels• Butwhatdoyoudowhenyourresponsevariableisnotnormallydistributed?

• Theframeworkofthelinearmodelcanbeextendedtoaccountfordifferentdistributionsfairlyeasily(onemajorclassoftheseisthegeneralizedlinearmodels).

ContinuityofStatisticalApproaches

t-test

ANOVA

NumberofLevels:

MixedEffectsModel(randomorboth)FixedPredictors:

Regression(continuous)

ANCOVA(both)

GeneralLinearModel

Predictors:(discrete)

GeneralizedLinearModel(non-normal)Response:

(normal)

ProcessModels

Generalized LinearModels(GLiM)

• Inmanycasesagenerallinearmodel isnotappropriatebecausevaluesarebounded– e.g.counts>0,proportionsbetween0and1

• Ageneralizationoflinearmodelstoincludeanydistributionoferrorsfromtheexponentialfamilyofdistributions

• Normal,Poisson,binomial,multinomial,exponential,gamma,NOTnegativebinomial

• GeneralLinearModelisjustaspecialcaseofGLiMinwhichtheerrorsarenormallydistributed

• Example,logisticregression• Wewilluselikelihoodforparameterestimationandinference

GeneralizationsofGLM

• Insteadofasimplelinearmodel:Y=b0 +b1x1+b2x2 +e

– Assumethate’sareindependent,normallydistributedwithmean0andconstantvariances2

– Cansolveforb’sbyminimizingsquarede’s

• GLiMconsiderssomeadjustmenttothedatatolinearizeY- alink function

Y=g(b0 +b1x1+b2x2 +e)or f(Y)=b0 +b1x1+b2x2 +e– Forexampleforcountdatawhicharealwayspositive

f(Y)=log(Y) loglink

Whatisalinkfunction?

• Thelinkfunctionisawayoftransformingtheobservedresponsevariable(LHS).

• Goals• 1)linearizeobservedresponse• 2)Altertheboundaryconditionsofthedata.• 3)Toallowforanadditivemodelinthecovariates(RHS)

PoissonFamily

• Dataarecountsofsomething(i.e.0,1,2,3,4…)• Numberofoccurrencesofaneventoverafixedperiodoftimeorspace• Examples…

• Ifthemeanvalueishighthencountscanbelog-normalornormallydistributed• Whenmeanvalueislowthentherestartstobelotsofzerosandvariancedependson

themean• Ifupperendisalsoboundedthenbinomialwouldbebetter

• Defaultlinkisthelog link,variancefunction=µ– i.e.,family=poisson(link=“log”,variance=“mu”)– Otheroptionmightbethesqrt link

PoissonandnegativebinomialFamily

Essentiallyitmeansyoucanlogtransformthesequencecountsanduseapoisson,quasi-poisson ornegativebinomialtofitit(mostlinksaremorecomplicated,thisisniceandsimple).

i.e.countsaremodeledas

Methodsusingnb glm• edgeR (butitisnotdefault,sobeware!)• DESeq/DESeq2(maybeDEXseq aswell?)• BaySeq• Limma (voom – kindofsortof…).

• Howevertheseallmodelthevariancequitedifferently(howtheyborrowinformationacrossgenestoestimatemean-variancerelationships).

SeeYu,Huber&Vitek 2013(Bioinformatics)fordiscussionofthisissue.

Methodsusingpoisson andquasi-poisson

• tspm (twostagepoisson model)– Fitsmodelswithpoisson first.Ifover-dispersedthenusesaquasi-poisson.

– Thusthereareessentiallytwogroupsofgenes.

Whythisisuseful• Sincewecanfittheseasageneralizedlinearmodel,wecanfitarbitrarilycomplexdesigns(ifwehavesufficientsamplesizestoestimatealltheparameters).

• Wecanincorporateallaspectsofreadlength,librarysize,lane,flowcellinadditiontoalloftheimportantbiologicalpredictors(yourtreatments).

• NOt-testsforyou!!!

Estimatingover-dispersion(variance)(orwhyprogramsseeminglydoingthe

samethinggivedifferentresults)

Variancesrequirelotsofdatatoestimatewell(notjustforcountdata)• Itturnsoutthattoestimatevariances,youneedalotmorereplicationthanyoudoformeans.

• HowevermostRNA-Seq experimentsstillhavesmallnumbersofbiologicalreplicates.

• Sohowtogoaboutestimatingvariances?

IFsamplesizesarelarge(withinandbetweentreatments).

• Mostmethodsdowell(basedonNB,quasi-Pornon-parametricapproaches).

• Theycanmodelindividuallevelvariances(andpotentiallycanuseresamplingapproachestoavoidhavingtomakeparametricassumptions).

Butifsamplesizes(intermsofbiologicalreplication)issmall.

• Thenwehaveaproblem.• Thisiswherethesoftwarereallytendstodiffer,astheyallmake(different)assumptionsabouttheuncertaintyincounts,mean-variancerelationships,andhowbesttomodelsucheffects.

• InparticularedgeR andDEseq usesomemethodstoborrowinformationacrossgenes(andhaveoptionstochangethisprocess).

• Thiscandramaticallychangetheresults.Anders,S.,&Huber,W.(2010).Differentialexpressionanalysisforsequencecountdata.GenomeBiology,11(10),R106.doi:10.1186/gb-2010-11-10-r106

Andersetal(2013).Count-baseddifferentialexpressionanalysisofRNAsequencingdatausingRandBioconductor.NatureProtocols,8(9),1765–1786

AndersandHuber2010

Yuetal(2013).ShrinkageestimationofdispersioninNegativeBinomialmodelsforRNA-seq experimentswithsmallsamplesize.Bioinformatics,29(10),1275–1282.

AndersandHuber2010

Let’sthinkaboutthis.

Love,Huber&Anders2014BioRXiV doi:10.1101/002832

Wecanalso“shrink”estimatesbasedonover-dispersion….

Takehome

• Withsmallsamplesizes,themethodsusedifferentapproachestogetgene-wiseover-dispersion(basedonalldata).

• EdgeR ismorepowerful(moresignificanthits)thanDESeq generally.Butmuchmoresusceptibletofalsepositivesduetooutliers.

• DESeq2“should”besomewhereinthemiddle.

Biologicalreplicationgivesfarmorestatisticalpowerthanincreasedsequencingdepthwithin

abiologicalsample!!!!

• Sequencing(andlibraryprep)costsarestillsufficientlyexpensivethatmostexperimentsusesmallnumbersofbiologicalreplicates.

• Giventheadditionalcostsoflibrarycosts(~225$/sampleatourfacility),manyfolksgoforincreaseddepthinsteadofmoresamples.

• Foragivenlevelofsequencingdepth(total)foratreatment,itisfarbettertogoformorebiologicalreplicates,eachatlowersequencingdepth(ratherthanfewerreplicatedathighersequencingdepth).

Biologicalreplicationgivesfarmorestatisticalpowerthanincreasedsequencingdepthwithinabiological

sample!!!!

Roblesetal.2012

Howdothemethodscompareinsimulation?

Kvam etal.2012

Howdothemethodscompareforrealdata?

Kvam etal.2012

Howdothemethodscompareinadifferentsetofsimulations?

Soneson 2012

WillexplainROC(receiveroperatorcurves)andtheareaundercurvesonboard.

References• Robles,J.A.,Qureshi,S.E.,Stephen,S.J.,Wilson,S.R.,Burden,C.J.,&Taylor,J.M.(2012).Efficientexperimentaldesignand

analysisstrategiesforthedetectionofdifferentialexpressionusingRNA-Sequencing.BMCGenomics,13,484.doi:10.1186/1471-2164-13-484

• Bullard,J.H.,Purdom,E.,Hansen,K.D.,&Dudoit,S.(2010).EvaluationofstatisticalmethodsfornormalizationanddifferentialexpressioninmRNA-Seq experiments.BMCBioinformatics,11,94.doi:10.1186/1471-2105-11-94

• Kvam,V.M.,Liu,P.,&Si,Y.(2012).AcomparisonofstatisticalmethodsfordetectingdifferentiallyexpressedgenesfromRNA-seq data.AmericanJournalOfBotany,99(2),248–256.doi:10.3732/ajb.1100340

• Soneson,C.,&Delorenzi,M.(2013).AcomparisonofmethodsfordifferentialexpressionanalysisofRNA-seq data.BMCBioinformatics,14,91.doi:10.1186/1471-2105-14-91

• Wagner,G.P.,Kin,K.,&Lynch,V.J.(2012).MeasurementofmRNAabundanceusingRNA-seq data:RPKMmeasureisinconsistentamongsamples.Theoryinbiosciences=Theorie indenBiowissenschaften,131(4),281–285.doi:10.1007/s12064-012-0162-3

• Vijay,N.,Poelstra,J.W.,Künstner,A.,&Wolf,J.B.W.(2012).Challengesandstrategiesintranscriptomeassemblyanddifferentialgeneexpressionquantification.Acomprehensiveinsilico assessmentofRNA-seq experiments.MolecularEcology.doi:10.1111/mec.12014

Whydowecareaboutmultiplecomparisons?

Howcanwedealwithmultiplecomparisons

rna-seq: quantification and models for assessing ... · quantification and models for assessing...

Documents