dbcampliconspipeline bioinformatics · python application overlap paired reads (wrapper around...

dbcAmplicons pipelineBioinformatics

Matthew L. SettlesGenome Center Bioinformatics Core

University of California, [email protected]; [email protected]

Workshopdataset:Slashpile

• SlashPile– Accumulateddebrisfromcuttingbrushortrimmingtrees• Measured,bacteriaandfungalcommunitiesusing5amplicons

• 16sV1V3• 16sV4V5• ITS1• ITS2• LSU

• 3– slashpiles• 2depths• Distancefromslashpile

InputFiles:BarcodeTable

Requires3columns:BarcodeID [anameforthepair],Index1(Read2inRC),Index2(Read3)inaplaintab-delimitedtextfile.Orientationisimportant,butyoucanchangeinthepreprocessarguments.Firstlineisacommentandjusthelpmeremembers.

InputFiles:PrimerTableRequires4columns:thereadinwhichtheprimershouldbecheckedfor(allowableareP5/P7,R1/R2,READ1/READ2,F/R,FORWARD/REVERSE,PrimerPairIDdescribeswhichshouldbefound‘together’,PrimerIDindividualid,andsequence(IUPACambiguitycharactersareallowed).

InputFiles:SampleSheetRequires4columnsandaheader:SampleID samplesname,PrimerPairID sameasinprimerfile,barcodeID sameasinbarcodefile,andProjectID whichrepresentsthefileprefixfortheoutputandcanincludeapath.SampleID,PrimerPairID,BarcodeID pairsmustbeunique.InadditionforPrimerPairID,canbecommaseparated,*(matchanyprimer),or‘-’shouldmatchnoprimer.

Additionalcolumnsareallowedandwillbeaddedtothebiom fileindbcAmplicons abundances.

SamplesAllowedCharacters:

a-zA-Z0-9_-ProjectsDisallowedCharacters:

:"\'*?<>|<space>

InputFiles:SequencingReadfiles

QualityScores

Illumina ReadnamingconventionsCASAVA1.8orgreaterReadIDs

• @EAS139:136:FC706VJ:2:2104:15343:1973931:Y:18:ATCACG• EAS139theuniqueinstrumentname• 136therunid• FC706VJtheflowcell id• 2flowcell lane• 2104tilenumberwithintheflowcell lane• 15343’x’-coordinateoftheclusterwithinthetile• 197393’y’-coordinateoftheclusterwithinthetile• 1thememberofapair,1or2(paired-endormate-pairreadsonly)• YYifthereadfailsfilter(readisbad),Notherwise• 180whennoneofthecontrolbitsareon,otherwiseitisanevennumber

• ATCACGindexsequence

Readsfromthesequencingprovider

• Fastq filesareactuallynotrawdatafromtheprovider,“raw”dataisactuallybcl files.

• Sequencingproviderwillrunanapplicationbcl2fastqwithasamplesheettoproducedemultiplexed (bybarcode)fastq files.

• FordbcAmplicons youwanttorequestfromyoursequencingprovidernon-demultiplexed fastq (soonesetfortheentirerun)andtheindexreads.

Bioinformatics

Illuminabcl2fastq1.8.4 FourFastq files

“Preprocess”Demultiplex barcode

andprimer sequences

https://github.com/msettles/dbcAmplicons

dbcAmpliconsPythonApplication

Overlappairedreads(wrapper around custom Flash2)

DownstreamAnalysis

BarcodeMatchdeterminedbyedit-distance,allowingfor1mismatchperbarcode(bydefault)

Primers

Match determinedbyLevenshteinDistanceallowingfor4mismatches(bydefault)withthefinal4(bydefault)basesrequiredtobeperfectmatches

BarcodeTablePrimerTableSampleSheet

DownstreamAnalysis

PopulationCommunityProfiling(i.e.microbial,bacterial,fungal,etc.)

TargetedRe-sequencing

ScreenUsingBowtie2,screentargetsagainstareferencefasta file,separatingreadsbythosethatproducematchesandthosethatdonotmatchsequencesinthereferencedatabase.

ClassifyWrapperaroundtheMSU RibosomalDatabaseProject(RDP)ClassifierforBacterialandArchaeal 16SrRNA sequences,Fungal28SrRNA,fungalITSregions

Abundance Reduce RDPclassifierresultstoabundancetables(orbiom fileformat),rowsaretaxaandcolumnsaresamplesreadyforadditionalcommunityanalysis.

Consensus - Reducereadstoconsensus sequenceforeachsampleandamplicon

R-functions tobeaddedintodbcAmplicons

Most Common– Reducereadstothemostcommonlyoccurringreadinthesampleandamplicon(thatispresentinatleast5%and5reads,bydefault)

Haplotypes– Imputethedifferenthaplotypesinthesampleandamplicon

dbcAmplicons PythonApplication

SupplementalScripts

• convert2Readto4Read.py• Forwhensamplesareprocessedbysomeoneelse

• splitReadsBySample.py• TofacilitateuploadtotheSRA

• preprocPair_with_inlineBC.py• CutoutinlineBCandcreate4readsforstandardinputprocessing• Willworkwith”Millslab”protocol

• dbcVersionReport.sh• Printoutversionnumbersofalltools

dbcAmplicons:Preprocessing

1. Readinthemetadatainputtables:Barcodes,Primers(optional),Samples(optional)

2. Readinabatchofreads(default100,000),foreachread1. Compareindexbarcodestothebarcodetable,notebestmatchingbarcode2. Compare5’endofreadstotheprimertable,notebestmatchingprimer3. Comparetobarcode:primer pairtothesampletable,notesampleID and

projectID4. Ifitsalegitimatereads(containsmatchingbarcode,primer,sample)output

thereadstotheoutputfile3. OutputIdentified_Barcodes.txt fileOutput:Preprocessedreads,Identified_Barcodes.txt file

Barcode/PrimerComparison

ACTGACTG

GTCAGCTA

TCAGTCAG

CGTACGTA

ACTTACTG

1

6

6

5

Compareseachbarcodetoallpossiblebarcodesandreturnsthebestmatch<desirededitdistance

BarcodeComparison PrimerComparison

GGCTTGGTCATTTAGAGGAAGTAA

CGGCTTGGTCATTTAGAGGAAGTAA

ACGGCTTGGTCATTTAGAGGAAGTAA

TACGGCTTGGTCATTTAGAGGAAGTAA

TACGGACTTG_TCATTTACAGGAAGTAAAAGTCGTAA

Comparesthebeginning(primerregion)ofeachreadtoallpossibleprimersandreturnsthebestmatch<specifiedmaximimum Levenshtein disteance +final4exactmatch

Primer1

Primer2

Primer3

Primer4

Read

ThenewreadheaderHeaderformatofallidentifiedsequences:

dbcAmplicions:join

• UsesFlash2tomergereadsthatoverlaptoproducealonger(orsometimesshorterread).

• Modificationinclude:• Performscompleteoverlapswithadaptertrimming• Allowsfordifferentsizedreads(aftercuttingprimeroff)• Discardsreadswith>50%Qof10orless,whichareindicativeofadapter/primerdimers

Output:prefix.notCombined_1.fastq.gz,prefix.notCombined_2.fastq.gzprefix.extendedFrags.fastq.gzprefix.histprefix.histogram

Flash2– overlappingofreadsandadapterremovalinpairedendreads

TargetRegion

Read1

InsertsizeRead2

TargetRegion

Read1

InsertsizeRead2

TargetRegion

Read1

InsertsizeRead2

Insertsize>lengthofthenumberofcycles

Insertsize<lengthofthenumberofcycles(10bpmin)

Insertsize<lengthofthereadlength

Product:ReadPair

Product:Extended,Single

Product:AdapterTrimmed,Single

https://github.com/dstreett/FLASH2

Flash2typicallyproducestightsizes

44 76 107 135 163 179 197 215 232 248 264 280 296 328

amplicon length

log1

0 co

unt

01

23

45

67

dbcAmplicons: classify

• UsestheRDP(RibosomalDatabaseProject)classifierforbacterialandarchaeal 16S,fungalLSU,ITSwarcup/unitedatabases.Youcanprovideyourowntrainingdatabase

• Classifiessequencestotheclosesttaxonomic referenceprovidesabootstrapscoreforreliability

• ConcatenatesPaired-endreads• Cantrimofflowqualityends,tosomevalueQ

Output:fixrank fileHWI-M01380:26:000000000-ABHNY:1:2116:24606:7147|Slashpile27:16sV1V3:470Bacteria domain 1.0"Proteobacteria“phylum 1.0……….ThroughGenus/Species

DirectClassification- RDP

• RibosomalDatabaseProject(RDP)- naïveBayesianClassifier• Compareseachreadtoadatabase

• Databaseisupdatesperiodically• Comparesbyk-mers (15mers)• 100bootstrapstoestablishconfidenceinresult

• Orderdoesnotmatter,no3%!

• Drawbacks• Acceptsonlyfasta (thoughwebsiteimpliesfastq)files• Canbeslow• Downtogenusonly(for16s,speciesforITS)• Kmer databasearebasedonwhole16s• CannotgrouptogetherunknownOTUsthatrepresentuniquetaxa

Clustering

• Clustering– “Becauseoftheincreasingsizesoftoday’samplicondatasets,fastandgreedydenovo clusteringheuristicsarethepreferredandonlypracticalapproachtoproduceOTUs”.IDISAGREE

Sharedstepsinthesecurrentalgorithmsare:1. Anampliconisdrawnoutoftheampliconpoolandbecomesthecenterofanew

OTU(centroidselection)2. Thiscentroidisthencomparedtoallotheramplicons remaininginthepool.3. Ampliconsforwhichthedistanceiswithinaglobalclusteringthreshold,t(e.g.

3%),tothecentroidaremovedfromthepooltotheOUT4. TheOTUisthenclosed.Thesestepsarerepeatedaslongasampliconsremainin

thepool.

ReasonswhyI’mnotafan

1. Littletonobiologicalrationaltoanyoftheclusteringparameters,modifytheparameterstogetaresultyoulike.

2. Dependentonordering,reorderourreadsyoucangetdifferentsetofOTUs.Oftennotrepeatablefromruntorun.

3. 3%(oranyothercutoff)isBS.4. Mostclusteringalgorithmsdonotconsidersequencingerrors.5. Ifyougeneratemoredatayouhavetostarttheclusteringprocess

alloveragainaspopulationofsequencesmatters.6. I’msurethereismore

OTUclusteringComparison

ChenW,ZhangCK,ChengY,ZhangS,ZhaoH(2013)AComparisonofMethodsforClustering16SrRNA SequencesintoOTUs.PLoS ONE8(8):e70837.doi:10.1371/journal.pone.0070837http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0070837

Classification

Directclassification

Taxa1

Taxa2

Taxa3

Read1

Read2

Read3

Read4

• Takesfixrank file(s)outputsabundancestablesandtaxa_info table• Abundancetables

• Rowsaretaxa• Columnsaresamples• Countsofthenumberofampliconsforeachtaxa/samples

• Proportionstables• Sameasabundancebuteachcellistheproportionofamplicons(socountsincelldividedbythecolumnssum)

• Biom file(BiologicalObservationMatrix)• JSONfileformatformicrobiomefiles• http://biom-format.org• Abundancetablesare0heavy,abiom fileremovesthe0’saswellasstoresextrametadata

dbcAmplicons:abundance

Abundance tablesandBiom files

The BIOMfileformat (canonicallypronounced biome)isdesignedtobeageneral-useformatforrepresentingbiologicalsamplebyobservationcontingencytables.BIOMisarecognizedstandardforthe EarthMicrobiomeProject andisa GenomicsStandardsConsortium supportedproject.Containstheabundancecounts,thesamplenames,fulltaxonomicstring[domainthroughgenus/species],andanysamplemetadatainthesamplesheet.

Taxon_Name Level Slashpile1 Slashpile10 Slashpile11 Slashpile13 Slashpile14 Slashpile15 Slashpile16 Slashpile17Archaea domain 0 0 0 1 1 1 1 0Pyrolobus genus 1 0 0 0 0 0 0 0Bacteria domain 2981 1479 110 2674 1732 2707 1303 2706Acetothermia_genera_incertae_sedis genus 0 0 0 0 0 0 0 1Acidobacteria phylum 84 34 4 85 60 110 17 60Acidobacteria_Gp1 class 376 252 31 444 378 565 13 218Gp10 genus 17 5 0 1 6 5 2 0Gp11 genus 3 0 0 4 0 12 0 0Gp12 genus 6 0 0 1 2 2 0 1Gp13 genus 19 1 0 5 4 11 1 4Gp15 genus 47 6 0 10 3 29 1 27Gp16 genus 187 135 6 132 105 171 12 69Gp17 genus 21 6 1 12 8 11 15 6Gp18 genus 0 0 0 0 0 1 0 0Gp19 genus 1 0 0 0 1 0 0 1Acidicapsa genus 9 17 6 29 11 11 0 1Acidipila genus 17 9 8 18 20 36 0 7

Acidobacterium genus 13 4 0 4 2 5 1 0Bryocella genus 75 75 8 104 115 166 1 125

Samples.taxa_info.txt

Suppliesextrainformationaboutthetaxidentifiedintheexperimentaswellasthefulltaxonomicpath.

Taxon_Name MeanBootstrapValue MeanLengthMerged PercentageAsPairs Totald__Archaea 0.523 298 0 21d__Archaea;p__Crenarchaeota;c__Thermoprotei;o__Desulfurococcales;f__Pyrodictiaceae;g__Pyrolobus 0.63 555 0 1d__Bacteria 0.984 459 0 63378d__Bacteria;p__Acetothermia;c__Acetothermia_genera_incertae_sedis;o__Acetothermia_genera_incertae_sedis;f__Acetothermia_genera_incertae_sedis;g__Acetothermia_genera_incertae_sedis 0.54 452 0 3d__Bacteria;p__Acidobacteria 0.696 476 0 1869d__Bacteria;p__Acidobacteria;c__Acidobacteria_Gp1 0.85 462 0 9286d__Bacteria;p__Acidobacteria;c__Acidobacteria_Gp10;o__Gp10;f__Gp10;g__Gp10 0.771 497 0 247d__Bacteria;p__Acidobacteria;c__Acidobacteria_Gp11;o__Gp11;f__Gp11;g__Gp11 0.949 466 0 45d__Bacteria;p__Acidobacteria;c__Acidobacteria_Gp12;o__Gp12;f__Gp12;g__Gp12 0.691 455 0 70

FutureDirections

• dbcAmplicons isadatareductionpipeline,producesabundance/biomefiles,postprocessingmosttypicallydoneinR.

• Include“error-correctingbarcodes”indemultiplexing• IdentificationofPCRduplicates(usingUMI)• ReplaceRDPclassificationwithanotherscheme

• Haveideas(foryears)butnotime• Useampliconlengthinclassification• Includescreeningofdiversitysampleinpreprocessingtogetanideaofactualproportioninthepool

• IncorporatetheRgenotypingpipelineintodbcAmplicons• Extendtoinferringcopynumber(orploidylevels)

• Correctforcopynumber(16s)• Outputdataforrarefactioncurves

PostProcessing• Iprettymuchdoallofmypostanalysis(abundancetable,Biome)inR

• CommonPackages• Vegan

v https://github.com/vegandevs/vegan• Vegetarian

v https://github.com/cran/vegetarian• Phyloseq (usesvegan,ade4,ape,picante)

v https://joey711.github.io/phyloseq/

• EcologicalDiversityAnalysis• howdoesthestructureofOTUacrosssamples/groupscompare

• OrdinationAnalysis(multivariateanalysis)• Visualizetherelativesimilarity/dissimilarityacrosssamples,testfortaxa/environmentrelationships

• DifferentialAbundanceAnalysis(univariateanalysis)• UsestoolsfromRNAseq(limma,edgeR)

• Visualization(temporal,heatmaps,’trees’,more)

Standardization/Normalization

• Relative(proportional)abundances• Dividebysumofsample,values0-100%

• LogCPM fromRNAseq• Hellingerstandardization

• http://biol09.biol.umontreal.ca/PLcourses/Section_7.7_Transformations.pdf

• Others• Wisconsin

Multicommunityanalysis

dbcampliconspipeline bioinformatics · python application overlap paired reads (wrapper around...

Documents