integrating chip-seq and rna-seq datakendzior/stat877/slides/bacher1.pdf · integrating chip-seq...
TRANSCRIPT
IntegratingChIP-seqandRNA-seqdata
RhondaBacher
April20,2017
Integrationof–omicsdata
• Eachpieceofdatarepresentsasnapshotofthebiologicalsystem.
• Integrationmovestowardsunderstandingthesystemasawhole.
2RitchieMD,Holzinger ER,LiR,PendergrassSA,KimD.Methodsofintegratingdatatouncovergenotype-phenotypeinteractions.NatRevGenet.2015;16:85–97.
Integrationof–omicsdataischallenging
• Needtounderstandcharacteristicsofeachdatatype.
• Incorporatebiologicalinformation.
• Needdata.
3
Datarepositories
• Datarepositoriescontainauthorsdepositeddatausedinpublication,alongwithconsortiumefforts.
• Surgeinresearchgroupsprocessingdatatobe‘analysis-ready’.• Easilyaccessibleandsearchable• Reproducibilityandconsistency
4Collado-TorresL, NelloreA,Kammers K,EllisSE,Taub MA,HansenKD,JaffeAE,LangmeadB,LeekJT. ReproducibleRNA-seqanalysisusing recount2. NatureBiotechnology,2017.doi:10.1038/nbt.3838.
Integrationof–omicsdata
5Qin,Jing,etal."ApplicationsofintegrativeOMICsapproachestogeneregulationstudies." QuantitativeBiology (2016):1-19.
IntegratingChIP-seqandRNA-seqdata
• Transcriptionfactors(TFs)controlregulationofexpression.• Activationandrepression.
mRNA
DNAGene
mRNA
DNATF
X
GeneTF
mRNA
GeneTF
IntegratingChIP-seqandRNA-seqdata
• Whystudyregulatorymechanismsoftranscriptionfactorsongeneexpression?
• Diseasessuchascanceranddiabetes:
• MutationsinTFs.
• Possibletodevelopinginterventions.
• Regenerativemedicine:
• Keyfactorscontrollinggeneexpressionduringcelldevelopmentanddifferentiation
7
RecapRNA-seq
DNAGene
mRNA
Gene GeneDNA
mRNACondition1 Condition2
Gene
Sample1 … Sample nGene1 42 … 6
… … … …Genem 3 … 5
Condition1 Condition2
DNAGene
mRNA
Gene GeneDNA
mRNACondition1 Condition2
Gene(ColinDewey)
(ChristinaKendziorski)
8
RecapRNA-seq
• UsestatisticalmethodstoidentifygenesthatareDE:
• Dataarecounts.• NegativeBinomialdistribution.
• Withfewreplicates,difficulttoestimatebothmeanandvariance.• Borrowinformationacrossgenestogetmoreaccurateestimatesofpergenevariances.
• Multipletestingsincetestinggenesone-by-one.(MichaelNewton)• Adjustp-values.
9
RecapRNA-seq
10Soneson,Charlotte,andMauroDelorenzi."AcomparisonofmethodsfordifferentialexpressionanalysisofRNA-seqdata." BMCbioinformatics 14.1(2013):91.
RecapChIP-seq
DNA
TF
DNATF Gene
GeneDNA
PeakDNA
TF
DNATF Gene
GeneDNA
PeakDNA
TF
DNATF Gene
GeneDNA
Peak
(Sunduz Keles)
11
RecapChIP-seq
Gene
12
Frommanycells Onlysequencemost5’end
GenePeak
Alignreads/tags
RecapChIP-seq
• UsestatisticalmethodstoinvestigateDNA-Proteininteractions:
• Identifybindingsites.
• Comparepeakstobackgroundsignal.• Accountforsequencebiases
• Shapeofpeak.
13
RecapChIP-seq
14Laajala,Teemu D.,etal."ApracticalcomparisonofmethodsfordetectingtranscriptionfactorbindingsitesinChIP-seqexperiments." BMCgenomics10.1(2009):618.
Attie lab
• Studythegeneticsofobesityanddiabetes.
• Diabetes:Poorresponsetoinsulinincombinationwithfailuretomakeenoughinsulin leadstoincreasedlevelsofglucoseinthebody.
• Betacellsmakeinsulinandarelocatedinaclusterofcellsinthepancreas(islets).
15
Attie lab
• NFAT:nuclearfactorofactivatedT cellsisaprotein/transcriptionfactorthatregulatesgeneexpressioninbetacells.
(KarlBroman)
16Keller,MarkP.,etal."TheTranscriptionFactorNfatc2Regulatesβ-CellProliferationandGenesAssociatedwithType2DiabetesinMouseandHumanIslets." PLoS Genetics 12.12(2016):e1006466.
Attie lab
• ExamineeffectofNFATonbetacells:
• RNA-seqexperiment:over-expresstheTFandidentifydifferentiallyexpressedgenesrelativetoacontrolsetofexpressionsamples.
• ChIP-seqexperiment:identifysitesinthegenomewheretheTFbinds.
Sample1 … Sample nGene1 42 … 6
… … … …Genem 3 … 5
NFATover-express Control
17
Analysis:ChIP-seqdata
• IdentifypeaksandTFbindingsites(TFBS).• Annotatepeakstothegenome.• Calculatedistanceofapeaktothenearesttranscriptionstartsite(TSS).• Softwarethatallowsacutoffparameterforannotation:
• ChIPpeakAnno (doesn’tconsiderstrandinformation).• ChIPseeker
18
Analysis
• Overlap:
• ListofDEgenesfromRNA-seq.
• ListoftargetgenesfromChIP-seq. DEgenes
TFtargetgenes
GenesregulatedbyTF
19
Accuracyofdefiningtargetgenes
• TFBSareoftenneargenesinthepromoterregionorslightlyupstream.
• WillvaryacrossspeciesandTFs.
20Yu,Chun-Ping,Jinn-Jy Lin,andWen-Hsiung Li."PositionaldistributionoftranscriptionfactorbindingsitesinArabidopsisthaliana." Scientificreports 6(2016).Koudritsky,Mark,andEytan Domany."Positionaldistributionofhumantranscriptionfactorbindingsites." Nucleicacidsresearch 36.21(2008):6795-6805.
ArabidopsisHuman
Considerations:
• Peaksareoftennotfoundingenesorinpromoters.
• Distanceofnearestgeneisalsounreliable.
21Hua,Sujun,etal."GenomicanalysisofestrogencascaderevealshistonevariantH2A.Zassociatedwithbreastcancerprogression."Molecularsystemsbiology 4.1(2008):188.
Improvetargetgenelist
• Howtorankthepossibletargets?• TIP:probabilisticmethodtoannotatepeaksandrankgenetargets:
• Foreachgeneg,define:
• Interestedin:
• Transformallscoresintoz-scores,assesssignificanceforeachgene.
where
22
Improvetargetgenelist
• TIPimprovesuponclosestgeneapproaches:
• Bindingsitesmayfallingene-richlocations.
• Bindingsitesmayaffectmultiplegenes.
• Learnmorebyincludingexpressiondata:• Effectofbindingontargetgenes.
23
BETA:TFtargetprediction.
• Geneswithmorenearbybindingsitesandmoredifferentialexpressionaremorelikelytobecalledasrealtargets.
• Calculateeachgene’sregulatorypotential:
whereiisoverallpeakswithin100kbofTSSanddi isthedistancebetweenpeakandtheTSS(relativeto100kb).
• Rgb =ranksofgene’sregulatorypotential(1islargestpotential)
• Rge =rankp-values(adjusted)ofDEgenes(1is‘strongest’DE)
• RPg =(Rgb /n)*(Rgb /n)
• SeparatelyforDEUPandDEDOWNgenes.24
.
Wang,Su,etal."TargetanalysisbyintegrationoftranscriptomeandChIP-seqdatawithBETA." Natureprotocols 8.12(2013):2502-2515.
Additionalanalyses
1. DetermineifTFeffectisactivationorrepression.
2. Motifanalysis.
3. Otherstatisticalanalyses.
25
BETA:DetermineifTFeffectisactivationorrepression.
• ForallgeneslabelledasDEUP,DEDOWN,orEE,consider:• ValueoftheDEteststatisticforeachgene
and• Calculateeachgene’sregulatorypotential:
whereiisoverallpeakswithin100kbofTSSanddi isthedistancebetweenpeakandtheTSS(relativeto100kb).
26Wang,Su,etal."TargetanalysisbyintegrationoftranscriptomeandChIP-seqdatawithBETA." Natureprotocols 8.12(2013):2502-2515.
BETA:DetermineifTFeffectisactivationorrepression.
• Sortgene’sbytheirsg andassignranks.
• Useone-tailedK-StesttodetermineifDEUPorDEDOWNissignificantlydifferent.
27Wang,Su,etal."TargetanalysisbyintegrationoftranscriptomeandChIP-seqdatawithBETA." Natureprotocols 8.12(2013):2502-2515.
BETA:MotifAnalysis
• UsestheMOODSalgorithmtofidmotifsneartargetgenes.
• Calculatethenumberofmotifsnearbindingsiteinthesummitandadjacentsiteandlookforenrichment.
• PerformseparatelyforDEUPandDEDOWNgenes.• Identifydifferentialmotifs.
28Wang,Su,etal."TargetanalysisbyintegrationoftranscriptomeandChIP-seqdatawithBETA." Natureprotocols 8.12(2013):2502-2515.
Statisticalanalysis
• DoDEgenessignificantlyoverlapwithTFtargetgenes?
• Mightrestricttospecificgenesetofinterest.• PerformGSEAonoverlappinggenes.
TFtarget NotTFtarget
DE
NotDE
29
(MichaelNewton)
Statisticalanalysis
• Positionalquestions:
• Arepeaksinpromoterregions?(Introns,Exons,Intergenic,etc.)
• Permutationtest:Sampleasetofregionsrandomly*manytimesandcountXfortherandomset.Calculatetheempiricalp-value.
Peak NotPeak(?)
Promoterregion X
Notpromoterregion
30
GREAT:GenomicRegionsEnrichmentofAnnotationsTool
• Enrichmenttestsforbindingsitesusingtypicalmethodscanbebiased.• Non-codingelementsdonotnecessarilyassociatewiththenearestgene.
• GREAT:• Functionallyannotatesnon-codingregionsbasedonnearbygenes.• Accountsforthetotalfractionofthegenomeactuallyannotatedforanygivenontologyterm.• Countshowmanyinputgenomicregions(peaks)fallintothoseareas.• Binomialtestoverregions.
31McLean,CoryY.,etal."GREATimprovesfunctionalinterpretationofcis-regulatoryregions." Naturebiotechnology 28.5(2010):495-501.
Biggerpicturebiologicalquestions
• CanTFbindingpredictgeneexpression?• HugerepositoriesofChIP-seqformanyTFs.
• ForanyRNA-seqexperiment:LetYg bethelogexpressionofgeneg,Xg,j issomemeasureofeachTF,j,relativetogeneg.
•Many extensionstothis.
32
Biggerpicturebiologicalquestions
• Regulatorygenenetworks.
->SeeSushmita Roy’sslides!
33
Additionaldataintegration:
• eQTL :howsequencevariantsaffectexpression.
• ATAC-seq:onlyregionsofDNAthatareopencanbeactivelytranscribed.
34
ATAC-seq
35http://www.abcam.com/epigenetics/epigenetics-application-spotlight-atac-seq
ATAC-seq
36Ackermann,AmandaM.,etal."IntegrationofATAC-seqandRNA-seqidentifieshumanalphacellandbetacellsignaturegenes."Molecularmetabolism 5.3(2016):233-244.
Summary
• Integratingdatatypesisusefulforveryspecificquestions(oneparticularTF)andforbroaderproblems(genenetworks).
• Understandingcharacteristicsofeachdatatypeiscrucial.• Biologicalaspects• Excellentstatisticalmethods
37
Richardson,Sylvia,GeorgeC.Tseng,andWeiSun."Statisticalmethodsinintegrativegenomics." AnnualReviewofStatisticsandItsApplication 3(2016):181-209.