next generation sequencing - purdue university · a next generation sequencing (ngs) refresher •...

Post on 26-May-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NextGenerationSequencing

NadiaAtallah

ANextGenerationSequencing(NGS)Refresher

• Becamecommerciallyavailablein2005• Constructionofasequencinglibraryà clonalamplificationtogeneratesequencingfeatures• Highdegreeofparallelism• Usesmicroandnanotechnologiestoreducesizeofsamplecomponents• Reducesreagentcosts• Enablesmassivelyparallelsequencingreactions

• Revolutionary:hasbroughthighspeedtogenomesequencing• Changedthewaywedoresearch,medicine

RNA-Seq

• High-throughputsequencingofRNA• Allowsforquantificationofgeneexpressionanddifferentialexpressionanalyses• Characterizationofalternativesplicing• Annotation• Goalistoidentifygenesandgenearchitecture

• denovotranscriptomeassembly• nogenomesequencenecessary!

RNA-seq workflow

DesignExperiment

••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!

RNApreparation

••IsolateRNA••PurifyRNA

PrepareLibraries

••ConverttheRNAtocDNA••Addsequencingadapters

Sequence••SequencethecDNAusingasequencingplatform

Analysis

••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions

Replication

• Numberofreplicatesdependsonvariousfactors:• Cost,complexityofexperimentaldesign(howmanyfactorsareof

interest),availabilityofsamples• BiologicalReplicates

• Sequencinglibrariesfrommultipleindependentbiologicalsamples

• VeryimportantinRNA-seq differentialexpressionanalysisstudies• Atleast3biologicalreplicatesneededtocalculatestatisticssuch

asp-values• TechnicalReplication

• Sequencingmultiplelibrariesfromthesamebiologicalsample• Allowsestimationofnon-biologicalvariation• NotgenerallynecessaryinRNA-seq experiments• Technicalvariationismoreofanissueonlyforlowlyexpressed

transcripts

DesignExperiment

Mouse1 Mouse2 Mouse3

Sample1 Sample2 Sample3

Sample1 Sample2 Sample3

PoolingSamplesinRNA-seq• Canbebeneficialiftissueisscare/enoughRNAistoughtoobtain• Utilizesmoresamples,couldincreasepowerduetoreducedbiologicalvariability

• Dangerisofapoolingbias(adifferencebetweenthevaluemeasuredinthepoolandthemeanofthevaluesmeasuredinthecorrespondingindividualreplicates)

• Possiblethatyoucangetapositiveresultduetoonlyonesampleinthepool• Mightmisssmallalterationsthatmightdisappearwhenonly1samplehasadifferenttranscriptomeprofilethanothersinthepool

• Generallyitisbettertouseonesampleperbiologicalreplicate• Ifyoumustpool,trytousethesameamountofmaterialpersampleinthepool

DesignExperiment

Evaluatedvalidityoftwopoolingstrategies(3or8biologicalreplicatesperpool;twopoolspergroup).FoundpoolingbiasandlowpositivepredictivevalueofDEanalysisinpooledsamples.

Single-endversuspaired-end• Reads=thesequencedportionofcDNAfragments• Single-end=cDNAfragmentsaresequencedfromonlyoneend(1x100)

• Paired-end=cDNAfragmentsaresequencedfrombothends(2x100)

• Paired-endisimportantfordenovotranscriptomeassemblyandforidentifyingtranscriptionalisoforms

• Lessimportantfordifferentialgeneexpressionifthereisagoodreferencegenome

• Don’tusepaired-endreadsforsequencingsmallRNAs…

• Noteonread-length:longreadsareimportantfordenovotranscriptassemblyandforidentifyingtranscriptionalisoforms,notrequiredfordifferentialgeneexpressionifthereisagoodreferencegenome

DesignExperiment

SequencingDepth– HowdeepshouldIsequence?

• Depth=(readlength)(numberofreads)/(haploidgenomelength)

• Eachlibraryprepmethodsuffersfromspecificbiasesandresultsinunevencoverageofindividualtranscriptsà inordertogetreadsspanningtheentiretranscriptmorereads(deepersequencing)isrequired

• Dependsonexperimentalobjectives• Differentialgeneexpression?Getenoughcountsofeachtranscriptsuchthataccuratestatisticalinferencescanbemade

• Denovotranscriptomeassembly?Maximizecoverageofraretranscriptsandtranscriptionalisoforms

• Annotation?• Alternativesplicinganalysis?

DesignExperiment

1)LiuY.,etal.,RNA-seq differentialexpressionstudies:moresequenceormorereplication?Bioinformatics30(3):301-304(2014)2)LiuY.,etal.,Evaluatingtheimpactofsequencingdepthontranscriptomeprofilinginhumanadipose.Plos One8(6):e66883(2013)3)Bentley,D.R.etal.Accuratewholehumangenomesequencingusingreversibleterminatorchemistry.Nature456,53–59(2008)4)Rozowsky,J.etal.,PeakSeq enablessystematicscoringofChIP-seq experimentsrelativetocontrols.NatureBiotech.27,65-75(2009).

MillionRe

ads

StrandSpecificity

• Strand-specific=youknowwhetherthereadoriginatedfromthe+or– strand• Importantfordenovotranscriptassembly• Importantforidentifyingtrueanti-sensetranscripts• Lessimportantfordifferentialgeneexpressionifthereisareferencegenome• Knowledgeofstrandedness mayhelpassignreadstogenesadjacenttooneanotherbutonoppositestrands

DesignExperiment

RNA-seq experimentaldesignsummary

• Veryimportantstep- ifdoneincorrectlynoamountofstatisticalexpertisecangleaninformationoutofyourdata!!!• Biologicalreplicates

• FordifferentialexpressionIgenerallyrecommendatleast3– allowsyoutoestimatevarianceandp-values

• Technicalreplicates• GenerallynotnecessaryinRNA-seq experiments

• Depthofsequencing• Dependsonyourexperimentalgoalsandorganism!

• Lengthofreads• Longerreads=betteralignments• Longerreads=moreexpensive

• Paired-endorsingle-end?• Paired-end=betteralignment• Paired-end=moreexpensive

• Pooling– Notidealbutsometimesnecessary• Strand-specific?

• Definitelyforantisensetranscriptidentificationanddenovotranscriptomeassembly• Notnecessaryfordifferentialgeneexpressiononanorganismwithawell-characterizedreferencegenome

DesignExperiment

ExperimentalDesign

PerfectWorld• Readsaslongapossible• Paired-end• Sequenceasdeeplyaspossibletodetectnoveltranscripts(100-200M)• Asmanyreplicatesaspossible• Preferablyrunasmallpilotexperimentfirsttoseehowmanyreplicatesareneededgiventheeffectsize

RealWorld• Determinewhatyourgoalsareandwhattreatmentsyouareinterestedin;planaccordingly• Forasimpledifferentialgeneexpressionexperimentonahumanyoucouldgetawaywithsingle-end,75-100bpreads,withn=3biologicalreplicates,sequencedto~30millionreads/sample(1laneofsequencingforasimplecontrolvstreatment6sampledesign)

DesignExperiment

MicroarrayversusRNA-SeqRNA-seq

• Counts(discretedata)• Negativebinomialdistributionusedinstatisticalanalysis

• Nogenomesequenceneeded• Canbeusedtocharacterizenoveltranscripts/spliceforms

• Metric:Counts(quantitative)

Microarray• Continuousdata• Normaldistributionusedinstatisticalanalysis• Genomemustbesequenced• UsesDNAhybridizations– sequenceinfoneeded

• Metric:Relativeintensities

DesignExperiment

DoIuseMicroarrayorSequencing?

• Whatexpertiseisavailable?• Isyourlabalreadysetupformicroarrays?Doesyourbioinformatician prefertoanalyzenextgendata?Whatarepeopleinyourdepartmentfamiliarwith?Istheresomeonewhocanhelpyoutroubleshootproblems?

• Costàmicroarraysarecheaper• Atwhatlevelsarethetranscriptsofinterestlikelytobeexpressedat?

• Microarraysindicaterelativeratherthanabsoluteexpression• Thiscanbeproblematicforaccurateestimationofexpressionlevelsofveryhighlyorlowlyexpressedtranscripts

• Doesyourorganismofinteresthaveawellcharacterizedgenome?• Dataanalysis:howconfidentareyouinyourabilitytoanalyzethedata?

• Microarrayshavebeenaroundforalotlongerandsomicroarrayanalysishasmoreuser-friendlytools

DesignExperiment

WhatshouldItellthesequencingcenterIwant?

• Depth,numberoflanes• Multiplexing• Single-endversuspairedend• WhichRNAspeciesamIinterestedinsequencing?• Paired-endorsingle-end?• Strand-specific?• Lengthofreads• PolyAselectionorribodepletion

DesignExperiment

RNA-seq workflow

DesignExperiment

••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!

RNApreparation

••IsolateRNA••PurifyRNA

PrepareLibraries

••ConverttheRNAtocDNA••Addsequencingadapters

Sequence••SequencethecDNAusingasequencingplatform

Analysis

••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions

RNAextraction,purification,andqualityassessment

RNApreparation

• RIN=RNAintegritynumber• Generally,RINscores>8aregood,dependingontheorganism• ImportanttousehighRINscoresamples,particularlywhensequencingsmallRNAstobesureyouaren’t

simplyselectingdegradedRNAs

18S28S

RNA-seq workflow

DesignExperiment

••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!

RNApreparation

••IsolateRNA••PurifyRNA

PrepareLibraries

••ConverttheRNAtocDNA••Addsequencingadapters

Sequence••SequencethecDNAusingasequencingplatform

Analysis

••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions

TargetEnrichment

• ItisnecessarytoselectwhichRNAsyousequence• TotalRNAgenerallyconsistsof>80%rRNA (Raz etal.,2011)• IfrRNA notremoved,mostreadswouldbefromrRNA

• Sizeselection– whatsizeRNAsdoyouwanttoselect?SmallRNAs?mRNAs?• PolyAselection=methodofisolatingPoly(A+)transcripts,usuallyusingoligo-dT affinity• Ribodepletion =depletesRibosomoal RNAsusingsequence-specificbiotin-labeledprobes

PrepareLibraries

LibraryPrepPrepareLibraries

• Beforeasamplecanbesequenced,itmustbepreparedintoasamplelibraryfromtotalRNA.• Alibraryisacollectionoffragmentsthatrepresentsampleinput• Differentmethodsexist,eachwithdifferentbiases

RNA-seq workflow

DesignExperiment

••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!

RNApreparation

••IsolateRNA••PurifyRNA

PrepareLibraries

••ConverttheRNAtocDNA••Addsequencingadapters

Sequence••SequencethecDNAusingasequencingplatform

Analysis

••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions

NextGenerationSequencingPlatforms

• 454Sequencing/Roche• GSJuniorSystem• GSFLX+System

• Illumina(Solexa)• HiSeq System• GenomeanalyzerIIx• MySeq

• AppliedBiosystems– LifeTechnologies• SOLiD 5500System• SOLiD 5500xlSystem

• IonTorrent• PersonalGenomeMachine(PGM)• Proton

Sequence

Platform Chemistry ReadLength RunTIme Gb/Run Advantage Disadvantage

454GSJunior Pyro-sequencing

500 8hrs 0.04 Longreadlength

Higherror rate

454GSFLX+ Pyro-sequencing

700 23hrs 0.7 Longreadlength

Higherrorrate

HiSeq Reversibleterminator

100 2days(rapidmode)

120(rapidmode)

Highthroughput,lowcost

Shortreads,longerrun

time

IonProton Protondetection

200 2hrs 100 Shortruntimes

New,lesstested

NextGenerationSequencingPlatformsSequence

RNA-seq workflow

DesignExperiment

••Setuptheexperimenttoaddressyourspecificbiologicalquestions••Meetwithyourbioinformatician andsequencingcenter!!!

RNApreparation

••IsolateRNA••PurifyRNA

PrepareLibraries

••ConverttheRNAtocDNA••Addsequencingadapters

Sequence••SequencethecDNAusingasequencingplatform

Analysis

••Qualitycontrol••Alignreadstothegenome/assembleatranscriptome••Downstreamanalysisbasedonyourquestions

StandardDifferentialExpressionAnalysis

Checkdataquality

Trim&filterreads,remove

adapters

Checkdataquality

Alignreadstoreferencegenome

Countreadsaligningtoeachgene

UnsupervisedClustering

Differentialexpressionanalysis

GOenrichmentanalysis

Pathwayanalysis

Analysis

Fileformats- FASTQfiles– whatwegetbackfromthesequencingcenter

• Thisisusuallytheformatyourdataisinwhensequencingiscomplete• Textfiles• Containsbothsequenceandbasequalityinformation

• Phred score=Q=-10log10P• Pisbase-callingerrorprobability

• IntegerscoresconvertedtoASCIIcharacters• Example:

@ILLUMINA:188:C03MYACXX:4:1101:3001:19991:N:0:CGATGTTACTTGTTACAGGCAATACGAGCAGCTTCCAAAGCTTCACTAGAGACATTTTCTTTCTCCCAACTCACAAGATGAACACAAAATGGAAACT+1=DDFFFHHHHHJJDGHHHIJIJIIJJIJIIIGIIGJIIIJCHEIIJGIJJIJIIJIJIFGGGGGIJIFFBEFDC>@@BB?A9@3;@(553>@>C(59:?

Analysis

DataCleaning:aMultistepProcessRemoveadapters

•• Removecontaminationfromfastq files(orGTFfiles)

Removecontamination

••Removesadaptersequences

Trimreads ••Trimreadsbasedonquality

Separatereads

••Separatereadsintopairedandunpaired

Analysis

QualityControl– PerBaseSequenceQualityAnalysis

QualityControl– PerSequenceQualityScoresAnalysis

AGCACC GTT AGTCGAGG ACTAGTCC GATGCA

ReferenceGenomeCACC GTT AGTCGA

TCGAGG ACTAGT

TAGTCC GATGCAACC GTT AGTCGAG

Sample1 Sample2 ……. SampleN

Gene1 145 176 ……. 189

Gene2 13 27 ……. 19

……. ……. ……. ……. …….

GeneG 28 30 ……. 20

Analysis

AligningReadstoaReference

Unique

reads

Fileformats:FASTAfiles• Textfilewithsequences(aminoacidornucleotides)• Firstlinepersequencebeginswith>andinformationaboutsequence• Example:>comp2_c0_seq1GCGAGATGATTCTCCGGTTGAATCAGATCCAGAGGCATGTATATATCGTCTGCAAAATGCTAGAAACCCTCATGTGTGTAATGCAGTGCATTCATGAAAACCTTGTAAGCTCACGTGTCGCTGACTGTCTGAGAACCGACTCGCTAATGTTCCATGGAGTGGCTGCATACATCACAGATTGTGATTCCAGGTTGCGAGACTATTTGCAGGATGCATGCGAGCTGATTGCCTATTCCTTCTACTTCTTAAATAAAGTAAGAGC

Analysis

Fileformats:BAMandSAMfiles• SAMfileisatab-delimitedtextfilethatcontainssequencealignmentinformation• Thisiswhatyougetafteraligningreadstothegenome• BAMfilesaresimplythebinaryversion(compressedandindexedversion)ofSAMfilesà theyaresmaller• Example:

Headerlines(beginwith“@”)

Alignmentsection

Analysis

Terminology

• Counts=(Xi)thenumberofreadsthataligntoaparticularfeaturei (gene,isoform,miRNA…)• Librarysize=(N)numberofreadssequenced• FPKM=Fragmentsperkilobase ofexonpermillionmappedreads

• Takeslengthofgene(li)intoaccount• FPKMi=(Xi/li*N)*109

• CPM=CountsPerMillionmappedreads• CPMi= Xi/N*106

• FDR=FalseDiscoveryRate(therateofTypeIerrors– falsepositives);a10%FDRmeansthat10%ofyourdifferentiallyexpressedgenesarelikelytobefalsepositives• wemustadjustformultipletestinginRNA-seq statisticalanalysestocontroltheFDR

Units

Analysis

Caveats

• Ifyouhavezerocountsitdoesnotnecessarilymeanthatageneisnotexpressedatall• Especiallyinsingle-cellRNA-seq

• RNAandproteinexpressionprofilesdonotalwayscorrelatewell• CorrelationsvarywildlybetweenRNAandproteinexpression• Dependsoncategoryofgene• Correlationcoefficientdistributionswerefoundtobebimodalbetweengeneexpressionandproteindata(onegroupofgeneproductshadameancorrelationof0.71;theanotherhadameancorrelationof0.28)• Shankavaram et.al,2007

Analysis

Thankyou!

Anyquestions?

top related