next generation sequencing - purdue university · a next generation sequencing (ngs) refresher •...
Post on 26-May-2020
11 Views
Preview:
TRANSCRIPT
NextGenerationSequencing
NadiaAtallah
ANextGenerationSequencing(NGS)Refresher
• Becamecommerciallyavailablein2005• Constructionofasequencinglibraryà clonalamplificationtogeneratesequencingfeatures• Highdegreeofparallelism• Usesmicroandnanotechnologiestoreducesizeofsamplecomponents• Reducesreagentcosts• Enablesmassivelyparallelsequencingreactions
• Revolutionary:hasbroughthighspeedtogenomesequencing• Changedthewaywedoresearch,medicine
RNA-Seq
• High-throughputsequencingofRNA• Allowsforquantificationofgeneexpressionanddifferentialexpressionanalyses• Characterizationofalternativesplicing• Annotation• Goalistoidentifygenesandgenearchitecture
• denovotranscriptomeassembly• nogenomesequencenecessary!
RNA-seq workflow
DesignExperiment
••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!
RNApreparation
••IsolateRNA••PurifyRNA
PrepareLibraries
••ConverttheRNAtocDNA••Addsequencingadapters
Sequence••SequencethecDNAusingasequencingplatform
Analysis
••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions
Replication
• Numberofreplicatesdependsonvariousfactors:• Cost,complexityofexperimentaldesign(howmanyfactorsareof
interest),availabilityofsamples• BiologicalReplicates
• Sequencinglibrariesfrommultipleindependentbiologicalsamples
• VeryimportantinRNA-seq differentialexpressionanalysisstudies• Atleast3biologicalreplicatesneededtocalculatestatisticssuch
asp-values• TechnicalReplication
• Sequencingmultiplelibrariesfromthesamebiologicalsample• Allowsestimationofnon-biologicalvariation• NotgenerallynecessaryinRNA-seq experiments• Technicalvariationismoreofanissueonlyforlowlyexpressed
transcripts
DesignExperiment
Mouse1 Mouse2 Mouse3
Sample1 Sample2 Sample3
Sample1 Sample2 Sample3
PoolingSamplesinRNA-seq• Canbebeneficialiftissueisscare/enoughRNAistoughtoobtain• Utilizesmoresamples,couldincreasepowerduetoreducedbiologicalvariability
• Dangerisofapoolingbias(adifferencebetweenthevaluemeasuredinthepoolandthemeanofthevaluesmeasuredinthecorrespondingindividualreplicates)
• Possiblethatyoucangetapositiveresultduetoonlyonesampleinthepool• Mightmisssmallalterationsthatmightdisappearwhenonly1samplehasadifferenttranscriptomeprofilethanothersinthepool
• Generallyitisbettertouseonesampleperbiologicalreplicate• Ifyoumustpool,trytousethesameamountofmaterialpersampleinthepool
DesignExperiment
Evaluatedvalidityoftwopoolingstrategies(3or8biologicalreplicatesperpool;twopoolspergroup).FoundpoolingbiasandlowpositivepredictivevalueofDEanalysisinpooledsamples.
Single-endversuspaired-end• Reads=thesequencedportionofcDNAfragments• Single-end=cDNAfragmentsaresequencedfromonlyoneend(1x100)
• Paired-end=cDNAfragmentsaresequencedfrombothends(2x100)
• Paired-endisimportantfordenovotranscriptomeassemblyandforidentifyingtranscriptionalisoforms
• Lessimportantfordifferentialgeneexpressionifthereisagoodreferencegenome
• Don’tusepaired-endreadsforsequencingsmallRNAs…
• Noteonread-length:longreadsareimportantfordenovotranscriptassemblyandforidentifyingtranscriptionalisoforms,notrequiredfordifferentialgeneexpressionifthereisagoodreferencegenome
DesignExperiment
SequencingDepth– HowdeepshouldIsequence?
• Depth=(readlength)(numberofreads)/(haploidgenomelength)
• Eachlibraryprepmethodsuffersfromspecificbiasesandresultsinunevencoverageofindividualtranscriptsà inordertogetreadsspanningtheentiretranscriptmorereads(deepersequencing)isrequired
• Dependsonexperimentalobjectives• Differentialgeneexpression?Getenoughcountsofeachtranscriptsuchthataccuratestatisticalinferencescanbemade
• Denovotranscriptomeassembly?Maximizecoverageofraretranscriptsandtranscriptionalisoforms
• Annotation?• Alternativesplicinganalysis?
DesignExperiment
1)LiuY.,etal.,RNA-seq differentialexpressionstudies:moresequenceormorereplication?Bioinformatics30(3):301-304(2014)2)LiuY.,etal.,Evaluatingtheimpactofsequencingdepthontranscriptomeprofilinginhumanadipose.Plos One8(6):e66883(2013)3)Bentley,D.R.etal.Accuratewholehumangenomesequencingusingreversibleterminatorchemistry.Nature456,53–59(2008)4)Rozowsky,J.etal.,PeakSeq enablessystematicscoringofChIP-seq experimentsrelativetocontrols.NatureBiotech.27,65-75(2009).
MillionRe
ads
StrandSpecificity
• Strand-specific=youknowwhetherthereadoriginatedfromthe+or– strand• Importantfordenovotranscriptassembly• Importantforidentifyingtrueanti-sensetranscripts• Lessimportantfordifferentialgeneexpressionifthereisareferencegenome• Knowledgeofstrandedness mayhelpassignreadstogenesadjacenttooneanotherbutonoppositestrands
DesignExperiment
RNA-seq experimentaldesignsummary
• Veryimportantstep- ifdoneincorrectlynoamountofstatisticalexpertisecangleaninformationoutofyourdata!!!• Biologicalreplicates
• FordifferentialexpressionIgenerallyrecommendatleast3– allowsyoutoestimatevarianceandp-values
• Technicalreplicates• GenerallynotnecessaryinRNA-seq experiments
• Depthofsequencing• Dependsonyourexperimentalgoalsandorganism!
• Lengthofreads• Longerreads=betteralignments• Longerreads=moreexpensive
• Paired-endorsingle-end?• Paired-end=betteralignment• Paired-end=moreexpensive
• Pooling– Notidealbutsometimesnecessary• Strand-specific?
• Definitelyforantisensetranscriptidentificationanddenovotranscriptomeassembly• Notnecessaryfordifferentialgeneexpressiononanorganismwithawell-characterizedreferencegenome
DesignExperiment
ExperimentalDesign
PerfectWorld• Readsaslongapossible• Paired-end• Sequenceasdeeplyaspossibletodetectnoveltranscripts(100-200M)• Asmanyreplicatesaspossible• Preferablyrunasmallpilotexperimentfirsttoseehowmanyreplicatesareneededgiventheeffectsize
RealWorld• Determinewhatyourgoalsareandwhattreatmentsyouareinterestedin;planaccordingly• Forasimpledifferentialgeneexpressionexperimentonahumanyoucouldgetawaywithsingle-end,75-100bpreads,withn=3biologicalreplicates,sequencedto~30millionreads/sample(1laneofsequencingforasimplecontrolvstreatment6sampledesign)
DesignExperiment
MicroarrayversusRNA-SeqRNA-seq
• Counts(discretedata)• Negativebinomialdistributionusedinstatisticalanalysis
• Nogenomesequenceneeded• Canbeusedtocharacterizenoveltranscripts/spliceforms
• Metric:Counts(quantitative)
Microarray• Continuousdata• Normaldistributionusedinstatisticalanalysis• Genomemustbesequenced• UsesDNAhybridizations– sequenceinfoneeded
• Metric:Relativeintensities
DesignExperiment
DoIuseMicroarrayorSequencing?
• Whatexpertiseisavailable?• Isyourlabalreadysetupformicroarrays?Doesyourbioinformatician prefertoanalyzenextgendata?Whatarepeopleinyourdepartmentfamiliarwith?Istheresomeonewhocanhelpyoutroubleshootproblems?
• Costàmicroarraysarecheaper• Atwhatlevelsarethetranscriptsofinterestlikelytobeexpressedat?
• Microarraysindicaterelativeratherthanabsoluteexpression• Thiscanbeproblematicforaccurateestimationofexpressionlevelsofveryhighlyorlowlyexpressedtranscripts
• Doesyourorganismofinteresthaveawellcharacterizedgenome?• Dataanalysis:howconfidentareyouinyourabilitytoanalyzethedata?
• Microarrayshavebeenaroundforalotlongerandsomicroarrayanalysishasmoreuser-friendlytools
DesignExperiment
WhatshouldItellthesequencingcenterIwant?
• Depth,numberoflanes• Multiplexing• Single-endversuspairedend• WhichRNAspeciesamIinterestedinsequencing?• Paired-endorsingle-end?• Strand-specific?• Lengthofreads• PolyAselectionorribodepletion
DesignExperiment
RNA-seq workflow
DesignExperiment
••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!
RNApreparation
••IsolateRNA••PurifyRNA
PrepareLibraries
••ConverttheRNAtocDNA••Addsequencingadapters
Sequence••SequencethecDNAusingasequencingplatform
Analysis
••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions
RNAextraction,purification,andqualityassessment
RNApreparation
• RIN=RNAintegritynumber• Generally,RINscores>8aregood,dependingontheorganism• ImportanttousehighRINscoresamples,particularlywhensequencingsmallRNAstobesureyouaren’t
simplyselectingdegradedRNAs
18S28S
RNA-seq workflow
DesignExperiment
••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!
RNApreparation
••IsolateRNA••PurifyRNA
PrepareLibraries
••ConverttheRNAtocDNA••Addsequencingadapters
Sequence••SequencethecDNAusingasequencingplatform
Analysis
••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions
TargetEnrichment
• ItisnecessarytoselectwhichRNAsyousequence• TotalRNAgenerallyconsistsof>80%rRNA (Raz etal.,2011)• IfrRNA notremoved,mostreadswouldbefromrRNA
• Sizeselection– whatsizeRNAsdoyouwanttoselect?SmallRNAs?mRNAs?• PolyAselection=methodofisolatingPoly(A+)transcripts,usuallyusingoligo-dT affinity• Ribodepletion =depletesRibosomoal RNAsusingsequence-specificbiotin-labeledprobes
PrepareLibraries
LibraryPrepPrepareLibraries
• Beforeasamplecanbesequenced,itmustbepreparedintoasamplelibraryfromtotalRNA.• Alibraryisacollectionoffragmentsthatrepresentsampleinput• Differentmethodsexist,eachwithdifferentbiases
RNA-seq workflow
DesignExperiment
••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!
RNApreparation
••IsolateRNA••PurifyRNA
PrepareLibraries
••ConverttheRNAtocDNA••Addsequencingadapters
Sequence••SequencethecDNAusingasequencingplatform
Analysis
••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions
NextGenerationSequencingPlatforms
• 454Sequencing/Roche• GSJuniorSystem• GSFLX+System
• Illumina(Solexa)• HiSeq System• GenomeanalyzerIIx• MySeq
• AppliedBiosystems– LifeTechnologies• SOLiD 5500System• SOLiD 5500xlSystem
• IonTorrent• PersonalGenomeMachine(PGM)• Proton
Sequence
Platform Chemistry ReadLength RunTIme Gb/Run Advantage Disadvantage
454GSJunior Pyro-sequencing
500 8hrs 0.04 Longreadlength
Higherror rate
454GSFLX+ Pyro-sequencing
700 23hrs 0.7 Longreadlength
Higherrorrate
HiSeq Reversibleterminator
100 2days(rapidmode)
120(rapidmode)
Highthroughput,lowcost
Shortreads,longerrun
time
IonProton Protondetection
200 2hrs 100 Shortruntimes
New,lesstested
NextGenerationSequencingPlatformsSequence
RNA-seq workflow
DesignExperiment
••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!
RNApreparation
••IsolateRNA••PurifyRNA
PrepareLibraries
••ConverttheRNAtocDNA••Addsequencingadapters
Sequence••SequencethecDNAusingasequencingplatform
Analysis
••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions
StandardDifferentialExpressionAnalysis
Checkdataquality
Trim&filterreads,remove
adapters
Checkdataquality
Alignreadstoreferencegenome
Countreadsaligningtoeachgene
UnsupervisedClustering
Differentialexpressionanalysis
GOenrichmentanalysis
Pathwayanalysis
Analysis
Fileformats- FASTQfiles– whatwegetbackfromthesequencingcenter
• Thisisusuallytheformatyourdataisinwhensequencingiscomplete• Textfiles• Containsbothsequenceandbasequalityinformation
• Phred score=Q=-10log10P• Pisbase-callingerrorprobability
• IntegerscoresconvertedtoASCIIcharacters• Example:
@ILLUMINA:188:C03MYACXX:4:1101:3001:19991:N:0:CGATGTTACTTGTTACAGGCAATACGAGCAGCTTCCAAAGCTTCACTAGAGACATTTTCTTTCTCCCAACTCACAAGATGAACACAAAATGGAAACT+1=DDFFFHHHHHJJDGHHHIJIJIIJJIJIIIGIIGJIIIJCHEIIJGIJJIJIIJIJIFGGGGGIJIFFBEFDC>@@BB?A9@3;@(553>@>C(59:?
Analysis
DataCleaning:aMultistepProcessRemoveadapters
•• Removecontaminationfromfastq files(orGTFfiles)
Removecontamination
••Removesadaptersequences
Trimreads ••Trimreadsbasedonquality
Separatereads
••Separatereadsintopairedandunpaired
Analysis
QualityControl– PerBaseSequenceQualityAnalysis
QualityControl– PerSequenceQualityScoresAnalysis
AGCACC GTT AGTCGAGG ACTAGTCC GATGCA
ReferenceGenomeCACC GTT AGTCGA
TCGAGG ACTAGT
TAGTCC GATGCAACC GTT AGTCGAG
Sample1 Sample2 ……. SampleN
Gene1 145 176 ……. 189
Gene2 13 27 ……. 19
……. ……. ……. ……. …….
GeneG 28 30 ……. 20
Analysis
AligningReadstoaReference
Unique
reads
Fileformats:FASTAfiles• Textfilewithsequences(aminoacidornucleotides)• Firstlinepersequencebeginswith>andinformationaboutsequence• Example:>comp2_c0_seq1GCGAGATGATTCTCCGGTTGAATCAGATCCAGAGGCATGTATATATCGTCTGCAAAATGCTAGAAACCCTCATGTGTGTAATGCAGTGCATTCATGAAAACCTTGTAAGCTCACGTGTCGCTGACTGTCTGAGAACCGACTCGCTAATGTTCCATGGAGTGGCTGCATACATCACAGATTGTGATTCCAGGTTGCGAGACTATTTGCAGGATGCATGCGAGCTGATTGCCTATTCCTTCTACTTCTTAAATAAAGTAAGAGC
Analysis
Fileformats:BAMandSAMfiles• SAMfileisatab-delimitedtextfilethatcontainssequencealignmentinformation• Thisiswhatyougetafteraligningreadstothegenome• BAMfilesaresimplythebinaryversion(compressedandindexedversion)ofSAMfilesà theyaresmaller• Example:
Headerlines(beginwith“@”)
Alignmentsection
Analysis
Terminology
• Counts=(Xi)thenumberofreadsthataligntoaparticularfeaturei (gene,isoform,miRNA…)• Librarysize=(N)numberofreadssequenced• FPKM=Fragmentsperkilobase ofexonpermillionmappedreads
• Takeslengthofgene(li)intoaccount• FPKMi=(Xi/li*N)*109
• CPM=CountsPerMillionmappedreads• CPMi= Xi/N*106
• FDR=FalseDiscoveryRate(therateofTypeIerrors– falsepositives);a10%FDRmeansthat10%ofyourdifferentiallyexpressedgenesarelikelytobefalsepositives• wemustadjustformultipletestinginRNA-seq statisticalanalysestocontroltheFDR
Units
Analysis
Caveats
• Ifyouhavezerocountsitdoesnotnecessarilymeanthatageneisnotexpressedatall• Especiallyinsingle-cellRNA-seq
• RNAandproteinexpressionprofilesdonotalwayscorrelatewell• CorrelationsvarywildlybetweenRNAandproteinexpression• Dependsoncategoryofgene• Correlationcoefficientdistributionswerefoundtobebimodalbetweengeneexpressionandproteindata(onegroupofgeneproductshadameancorrelationof0.71;theanotherhadameancorrelationof0.28)• Shankavaram et.al,2007
Analysis
Thankyou!
Anyquestions?
top related