sequence preprocessing: a perspective · preprocessing •map reads to contaminants/phixand extract...

13
Sequence Preprocessing: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis [email protected]

Upload: others

Post on 05-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

Sequence Preprocessing: A perspective

Dr. Matthew L. Settles

Genome CenterUniversity of California, Davis

[email protected]

Page 2: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

WhyPreprocessreads

• Wehavefoundthataggressively“cleaning”andprocessingreadscanmakealargedifferencetothespeed andquality ofassemblyandmappingresults.Cleaningyourreadsmeans,removingreads/basesthatare:• otherunwantedsequence(polyA tailsinRNA-seq data)• artificiallyaddedontosequenceofprimaryinterest(vectors,adapters,primers)

• joinshortoverlappingpaired-endreads• lowqualitybases• originatefromPCRduplication• notofprimaryinterest(contamination)

• Preprocessingalsoproducesanumberofstatisticsthataretechnicalinnaturethatshouldbeusedtoevaluate“experimentalconsistancy”

Page 3: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

ReadPreprocessingstrategies,manyovertime

• Identityandremovecontaminantandvectorreads• Readswhichappeartofullycomefromextraneoussequenceshouldberemoved.

• Qualitytrim/cut• “end”trimareaduntiltheaveragequality>Q(Lucy)• removeanyreadwithaveragequality<Q

• eliminatesingletons/duplicates• Ifyouhaveexcessdepthofcoverage,andparticularlyifyouhaveatleastx-foldcoveragewherexisthereadlength,theneliminatingsingletonsisanicewayofdramaticallyreducingthenumberoferror-pronereads.

• Readwhichappearthesame(particularlypaired-end)areoftenmorelikelyPCRduplicatesandthereforredundantreads.

• eliminateallreads(pairs)containingan“N”character• Ifyoucanaffordthelossofcoverage,youmightthrowawayallreadscontainingNs.

• Identityandtrimoffadapterandbarcodesifpresent• Believeitornot,thesoftwareprovidedbyIllumina,eitherdoesnotlookfor,ordoesamediocrejobof,identifyingadaptersandremovingthem.

Page 4: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

RibosomalRNA

• RibosomalRNAmakesup90%ormoreofatypicaltotalRNAsample.• LibraryprepmethodsreducetherRNA representationinasample

• oligoDt onlybindstopolyA tailstoenrichasampleformRNA• Ribo-depletionbindsrRNA sequences

Neithertechniqueis100%efficient

Canscreen(mapreadstorRNA sequences)todeterminerRNAefficiencyandpotentiallyremovethosereads.

Page 5: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

DNA/RNA,couldcontain‘contamination’Libraryprep,fragmentation,adapteraddition

PCRenrichment

FinalLibrary,sizedistribution PossibleadditionofphiX SequencingCharacteristics/Quality

Page 6: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

Preprocessing• Mapreadstocontaminants/PhiX andextractunmappedreads[bowtie2--local

• Removecontaminants(atleastPhiX),usesbowtie2thenextractsallreads(pairs)thataremarkedasunmapped.

• Super-Deduper [PEreadsonly]• RemovePCRduplicates(weusebases10-35ofeachpairedread)

• FLASH2[ PEreadsonly]• Joinandextend,overlappingpairedendreads• Ifreadscompletelyoverlaptheywillcontainadapter,removeadapters• Identifyandremoveanyadapterdimerspresent

• Scythe[SEReadsonly]• Identifyandremoveadaptersequence

• Sickle• Trimsequences(5’and3’)byqualityscore(IlikeQ20)

• cleanup• RunapolyA/Ttrimmer• Removeanyreadsthatarelessthentheminimumlengthparameter• Producepreprocessingstatistics

Page 7: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

WhyScreenforPhiX

• PhiX isacommoncontrolinIlluminaruns,facilitiesrarelytellyouif/whenPhiX hasbeenspikedin

• Doesnothaveabarcode,sointheoryshouldnotbeinyourdata

• However• WhenIknowPhiX hasbeenspikedin,Ifindsequenceeverytime• WhenIknowPhiX hasnotbeenspikedin,Idonot findsequence

• Bettersafethansorryandscreenforit.

Page 8: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

SuperDeduper

https://github.com/dstreett/Super-Deduper

Read1

Read2

Data AlignmentAlgorithm

MarkDuplicates Rmdup SuperDeduper FastUniq Fulcrum Total#ofReads

PhiX BWAMEM 1,048,278(0.25%)

1,011,145(1.05%)

1,156,700(13.7%)

4,202,526 3,092,155 4,750,299

Bowtie2Local 1,054,725(6.62%)

948,784(10.2%)

1,166,936(14.0%)

4,236,647 3,103,872 4,790,972

Bowtie2Global 799,524(0%)

800,868(0.12%)

896,487(9.92%)

3,768,641 2,704,114 4,293,787

Acroporadigitifera

BWAMEM 5,132,111(2.26%)

6,906,634(44.5%)

5,133,339(10.2%)

12,968,469 2,103,567 54,108,240

Bowtie2Local 4,688,809(4.03%)

5,931,862(38.9%)

3,971,743(9.32%)

9,893,903 4,259,619 41,728,154

Bowtie2Global 1,457,865(3.62%)

1,512,966(24.2%)

1,185,838(11.4%)

3,014,498 1,286,031 11,600,847

Page 9: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

SuperDeduper

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False Positive Rate

True

Pos

itive

Rat

e

10

0.875

0.900

0.925

0.950

0.975

0.03 0.04 0.05 0.06 0.07False Positive Rate

StartPosition

1

5

20

40

150

Figure 1: ROC curves. Only a representative subset of the different start positions is shown. The image on the left shows the full ROC curves and the image on the left is a zoomed in view of corner of the curves. Each curve represents a start position and each point represents a length. The labeled point in the image on the right is the default start and length for

Super Deduper.

We calculated the Youden Index for every combination tested and the point that acquired the highest index value (as compared to Picard MarkDuplicates) occurred at a start position of 5bp and a length of 10bps (20bp total over both reads)

Page 10: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

Flash2– overlappingofreadsandadapterremovalinpairedendreads

TargetRegion

Read1

InsertsizeRead2

TargetRegion

Read1

InsertsizeRead2

TargetRegion

Read1

InsertsizeRead2

Insertsize>lengthofthenumberofcycles

Insertsize<lengthofthenumberofcycles(10bpmin)

Insertsize<lengthofthereadlength

Product:ReadPair

Product:Extended,Single

Product:AdapterTrimmed,Single

https://github.com/dstreett/FLASH2

Page 11: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

QualityTrimming- Sickle

Remove“poor”qualitysequencefromboththe5’and3’ends

Page 12: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

QA/QC

• Beyondgenerating‘better’datafordownstreamanalysis,cleaningstatisticsalsogiveyouanideaastothequalityofthesample,librarygeneration,andsequencingqualityusedtogeneratethedata.

• Thiscanhelpinformyouofwhatyoumightdointhefuture.• I’vefounditbesttoperformQA/QConboththerunasawhole(poorsamplescanaffectothersamples)andonthesamplesthemselvesastheycomparetoothersamples (REMEMBER,BECONSISTANT).

• ReportssuchasBasespace forIllumina,aregreatwaystoevaluatetherunsasawhole.

• PCA/MDSplotsofthepreprocessingsummaryareagreatwaytolookfortechnicalbiasacrossyourexperiment

Page 13: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2

ComparingMappingRawvsPreprocessedwithstar