dbcampliconspipeline bioinformatics · python application overlap paired reads (wrapper around...
TRANSCRIPT
dbcAmplicons pipelineBioinformatics
Matthew L. SettlesGenome Center Bioinformatics Core
University of California, [email protected]; [email protected]
Workshopdataset:Slashpile
• SlashPile– Accumulateddebrisfromcuttingbrushortrimmingtrees• Measured,bacteriaandfungalcommunitiesusing5amplicons
• 16sV1V3• 16sV4V5• ITS1• ITS2• LSU
• 3– slashpiles• 2depths• Distancefromslashpile
InputFiles:BarcodeTable
Requires3columns:BarcodeID [anameforthepair],Index1(Read2inRC),Index2(Read3)inaplaintab-delimitedtextfile.Orientationisimportant,butyoucanchangeinthepreprocessarguments.Firstlineisacommentandjusthelpmeremembers.
InputFiles:PrimerTableRequires4columns:thereadinwhichtheprimershouldbecheckedfor(allowableareP5/P7,R1/R2,READ1/READ2,F/R,FORWARD/REVERSE,PrimerPairIDdescribeswhichshouldbefound‘together’,PrimerIDindividualid,andsequence(IUPACambiguitycharactersareallowed).
InputFiles:SampleSheetRequires4columnsandaheader:SampleID samplesname,PrimerPairID sameasinprimerfile,barcodeID sameasinbarcodefile,andProjectID whichrepresentsthefileprefixfortheoutputandcanincludeapath.SampleID,PrimerPairID,BarcodeID pairsmustbeunique.InadditionforPrimerPairID,canbecommaseparated,*(matchanyprimer),or‘-’shouldmatchnoprimer.
Additionalcolumnsareallowedandwillbeaddedtothebiom fileindbcAmplicons abundances.
SamplesAllowedCharacters:
a-zA-Z0-9_-ProjectsDisallowedCharacters:
:"\'*?<>|<space>
InputFiles:SequencingReadfiles
QualityScores
Illumina ReadnamingconventionsCASAVA1.8orgreaterReadIDs
• @EAS139:136:FC706VJ:2:2104:15343:1973931:Y:18:ATCACG• EAS139theuniqueinstrumentname• 136therunid• FC706VJtheflowcell id• 2flowcell lane• 2104tilenumberwithintheflowcell lane• 15343’x’-coordinateoftheclusterwithinthetile• 197393’y’-coordinateoftheclusterwithinthetile• 1thememberofapair,1or2(paired-endormate-pairreadsonly)• YYifthereadfailsfilter(readisbad),Notherwise• 180whennoneofthecontrolbitsareon,otherwiseitisanevennumber
• ATCACGindexsequence
Readsfromthesequencingprovider
• Fastq filesareactuallynotrawdatafromtheprovider,“raw”dataisactuallybcl files.
• Sequencingproviderwillrunanapplicationbcl2fastqwithasamplesheettoproducedemultiplexed (bybarcode)fastq files.
• FordbcAmplicons youwanttorequestfromyoursequencingprovidernon-demultiplexed fastq (soonesetfortheentirerun)andtheindexreads.
Bioinformatics
Illuminabcl2fastq1.8.4 FourFastq files
“Preprocess”Demultiplex barcode
andprimer sequences
https://github.com/msettles/dbcAmplicons
dbcAmpliconsPythonApplication
Overlappairedreads(wrapper around custom Flash2)
DownstreamAnalysis
BarcodeMatchdeterminedbyedit-distance,allowingfor1mismatchperbarcode(bydefault)
Primers
Match determinedbyLevenshteinDistanceallowingfor4mismatches(bydefault)withthefinal4(bydefault)basesrequiredtobeperfectmatches
BarcodeTablePrimerTableSampleSheet
DownstreamAnalysis
PopulationCommunityProfiling(i.e.microbial,bacterial,fungal,etc.)
TargetedRe-sequencing
ScreenUsingBowtie2,screentargetsagainstareferencefasta file,separatingreadsbythosethatproducematchesandthosethatdonotmatchsequencesinthereferencedatabase.
ClassifyWrapperaroundtheMSU RibosomalDatabaseProject(RDP)ClassifierforBacterialandArchaeal 16SrRNA sequences,Fungal28SrRNA,fungalITSregions
Abundance Reduce RDPclassifierresultstoabundancetables(orbiom fileformat),rowsaretaxaandcolumnsaresamplesreadyforadditionalcommunityanalysis.
Consensus - Reducereadstoconsensus sequenceforeachsampleandamplicon
R-functions tobeaddedintodbcAmplicons
Most Common– Reducereadstothemostcommonlyoccurringreadinthesampleandamplicon(thatispresentinatleast5%and5reads,bydefault)
Haplotypes– Imputethedifferenthaplotypesinthesampleandamplicon
dbcAmplicons PythonApplication
SupplementalScripts
• convert2Readto4Read.py• Forwhensamplesareprocessedbysomeoneelse
• splitReadsBySample.py• TofacilitateuploadtotheSRA
• preprocPair_with_inlineBC.py• CutoutinlineBCandcreate4readsforstandardinputprocessing• Willworkwith”Millslab”protocol
• dbcVersionReport.sh• Printoutversionnumbersofalltools
dbcAmplicons:Preprocessing
1. Readinthemetadatainputtables:Barcodes,Primers(optional),Samples(optional)
2. Readinabatchofreads(default100,000),foreachread1. Compareindexbarcodestothebarcodetable,notebestmatchingbarcode2. Compare5’endofreadstotheprimertable,notebestmatchingprimer3. Comparetobarcode:primer pairtothesampletable,notesampleID and
projectID4. Ifitsalegitimatereads(containsmatchingbarcode,primer,sample)output
thereadstotheoutputfile3. OutputIdentified_Barcodes.txt fileOutput:Preprocessedreads,Identified_Barcodes.txt file
Barcode/PrimerComparison
ACTGACTG
GTCAGCTA
TCAGTCAG
CGTACGTA
ACTTACTG
1
6
6
5
Compareseachbarcodetoallpossiblebarcodesandreturnsthebestmatch<desirededitdistance
BarcodeComparison PrimerComparison
GGCTTGGTCATTTAGAGGAAGTAA
CGGCTTGGTCATTTAGAGGAAGTAA
ACGGCTTGGTCATTTAGAGGAAGTAA
TACGGCTTGGTCATTTAGAGGAAGTAA
TACGGACTTG_TCATTTACAGGAAGTAAAAGTCGTAA
Comparesthebeginning(primerregion)ofeachreadtoallpossibleprimersandreturnsthebestmatch<specifiedmaximimum Levenshtein disteance +final4exactmatch
Primer1
Primer2
Primer3
Primer4
Read
ThenewreadheaderHeaderformatofallidentifiedsequences:
dbcAmplicions:join
• UsesFlash2tomergereadsthatoverlaptoproducealonger(orsometimesshorterread).
• Modificationinclude:• Performscompleteoverlapswithadaptertrimming• Allowsfordifferentsizedreads(aftercuttingprimeroff)• Discardsreadswith>50%Qof10orless,whichareindicativeofadapter/primerdimers
Output:prefix.notCombined_1.fastq.gz,prefix.notCombined_2.fastq.gzprefix.extendedFrags.fastq.gzprefix.histprefix.histogram
Flash2– overlappingofreadsandadapterremovalinpairedendreads
TargetRegion
Read1
InsertsizeRead2
TargetRegion
Read1
InsertsizeRead2
TargetRegion
Read1
InsertsizeRead2
Insertsize>lengthofthenumberofcycles
Insertsize<lengthofthenumberofcycles(10bpmin)
Insertsize<lengthofthereadlength
Product:ReadPair
Product:Extended,Single
Product:AdapterTrimmed,Single
https://github.com/dstreett/FLASH2
Flash2typicallyproducestightsizes
44 76 107 135 163 179 197 215 232 248 264 280 296 328
amplicon length
log1
0 co
unt
01
23
45
67
dbcAmplicons: classify
• UsestheRDP(RibosomalDatabaseProject)classifierforbacterialandarchaeal 16S,fungalLSU,ITSwarcup/unitedatabases.Youcanprovideyourowntrainingdatabase
• Classifiessequencestotheclosesttaxonomic referenceprovidesabootstrapscoreforreliability
• ConcatenatesPaired-endreads• Cantrimofflowqualityends,tosomevalueQ
Output:fixrank fileHWI-M01380:26:000000000-ABHNY:1:2116:24606:7147|Slashpile27:16sV1V3:470Bacteria domain 1.0"Proteobacteria“phylum 1.0……….ThroughGenus/Species
DirectClassification- RDP
• RibosomalDatabaseProject(RDP)- naïveBayesianClassifier• Compareseachreadtoadatabase
• Databaseisupdatesperiodically• Comparesbyk-mers (15mers)• 100bootstrapstoestablishconfidenceinresult
• Orderdoesnotmatter,no3%!
• Drawbacks• Acceptsonlyfasta (thoughwebsiteimpliesfastq)files• Canbeslow• Downtogenusonly(for16s,speciesforITS)• Kmer databasearebasedonwhole16s• CannotgrouptogetherunknownOTUsthatrepresentuniquetaxa
Clustering
• Clustering– “Becauseoftheincreasingsizesoftoday’samplicondatasets,fastandgreedydenovo clusteringheuristicsarethepreferredandonlypracticalapproachtoproduceOTUs”.IDISAGREE
Sharedstepsinthesecurrentalgorithmsare:1. Anampliconisdrawnoutoftheampliconpoolandbecomesthecenterofanew
OTU(centroidselection)2. Thiscentroidisthencomparedtoallotheramplicons remaininginthepool.3. Ampliconsforwhichthedistanceiswithinaglobalclusteringthreshold,t(e.g.
3%),tothecentroidaremovedfromthepooltotheOUT4. TheOTUisthenclosed.Thesestepsarerepeatedaslongasampliconsremainin
thepool.
ReasonswhyI’mnotafan
1. Littletonobiologicalrationaltoanyoftheclusteringparameters,modifytheparameterstogetaresultyoulike.
2. Dependentonordering,reorderourreadsyoucangetdifferentsetofOTUs.Oftennotrepeatablefromruntorun.
3. 3%(oranyothercutoff)isBS.4. Mostclusteringalgorithmsdonotconsidersequencingerrors.5. Ifyougeneratemoredatayouhavetostarttheclusteringprocess
alloveragainaspopulationofsequencesmatters.6. I’msurethereismore
OTUclusteringComparison
ChenW,ZhangCK,ChengY,ZhangS,ZhaoH(2013)AComparisonofMethodsforClustering16SrRNA SequencesintoOTUs.PLoS ONE8(8):e70837.doi:10.1371/journal.pone.0070837http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0070837
Classification
Directclassification
Taxa1
Taxa2
Taxa3
Read1
Read2
Read3
Read4
• Takesfixrank file(s)outputsabundancestablesandtaxa_info table• Abundancetables
• Rowsaretaxa• Columnsaresamples• Countsofthenumberofampliconsforeachtaxa/samples
• Proportionstables• Sameasabundancebuteachcellistheproportionofamplicons(socountsincelldividedbythecolumnssum)
• Biom file(BiologicalObservationMatrix)• JSONfileformatformicrobiomefiles• http://biom-format.org• Abundancetablesare0heavy,abiom fileremovesthe0’saswellasstoresextrametadata
dbcAmplicons:abundance
Abundance tablesandBiom files
The BIOMfileformat (canonicallypronounced biome)isdesignedtobeageneral-useformatforrepresentingbiologicalsamplebyobservationcontingencytables.BIOMisarecognizedstandardforthe EarthMicrobiomeProject andisa GenomicsStandardsConsortium supportedproject.Containstheabundancecounts,thesamplenames,fulltaxonomicstring[domainthroughgenus/species],andanysamplemetadatainthesamplesheet.
Taxon_Name Level Slashpile1 Slashpile10 Slashpile11 Slashpile13 Slashpile14 Slashpile15 Slashpile16 Slashpile17Archaea domain 0 0 0 1 1 1 1 0Pyrolobus genus 1 0 0 0 0 0 0 0Bacteria domain 2981 1479 110 2674 1732 2707 1303 2706Acetothermia_genera_incertae_sedis genus 0 0 0 0 0 0 0 1Acidobacteria phylum 84 34 4 85 60 110 17 60Acidobacteria_Gp1 class 376 252 31 444 378 565 13 218Gp10 genus 17 5 0 1 6 5 2 0Gp11 genus 3 0 0 4 0 12 0 0Gp12 genus 6 0 0 1 2 2 0 1Gp13 genus 19 1 0 5 4 11 1 4Gp15 genus 47 6 0 10 3 29 1 27Gp16 genus 187 135 6 132 105 171 12 69Gp17 genus 21 6 1 12 8 11 15 6Gp18 genus 0 0 0 0 0 1 0 0Gp19 genus 1 0 0 0 1 0 0 1Acidicapsa genus 9 17 6 29 11 11 0 1Acidipila genus 17 9 8 18 20 36 0 7
Acidobacterium genus 13 4 0 4 2 5 1 0Bryocella genus 75 75 8 104 115 166 1 125
Samples.taxa_info.txt
Suppliesextrainformationaboutthetaxidentifiedintheexperimentaswellasthefulltaxonomicpath.
Taxon_Name MeanBootstrapValue MeanLengthMerged PercentageAsPairs Totald__Archaea 0.523 298 0 21d__Archaea;p__Crenarchaeota;c__Thermoprotei;o__Desulfurococcales;f__Pyrodictiaceae;g__Pyrolobus 0.63 555 0 1d__Bacteria 0.984 459 0 63378d__Bacteria;p__Acetothermia;c__Acetothermia_genera_incertae_sedis;o__Acetothermia_genera_incertae_sedis;f__Acetothermia_genera_incertae_sedis;g__Acetothermia_genera_incertae_sedis 0.54 452 0 3d__Bacteria;p__Acidobacteria 0.696 476 0 1869d__Bacteria;p__Acidobacteria;c__Acidobacteria_Gp1 0.85 462 0 9286d__Bacteria;p__Acidobacteria;c__Acidobacteria_Gp10;o__Gp10;f__Gp10;g__Gp10 0.771 497 0 247d__Bacteria;p__Acidobacteria;c__Acidobacteria_Gp11;o__Gp11;f__Gp11;g__Gp11 0.949 466 0 45d__Bacteria;p__Acidobacteria;c__Acidobacteria_Gp12;o__Gp12;f__Gp12;g__Gp12 0.691 455 0 70
FutureDirections
• dbcAmplicons isadatareductionpipeline,producesabundance/biomefiles,postprocessingmosttypicallydoneinR.
• Include“error-correctingbarcodes”indemultiplexing• IdentificationofPCRduplicates(usingUMI)• ReplaceRDPclassificationwithanotherscheme
• Haveideas(foryears)butnotime• Useampliconlengthinclassification• Includescreeningofdiversitysampleinpreprocessingtogetanideaofactualproportioninthepool
• IncorporatetheRgenotypingpipelineintodbcAmplicons• Extendtoinferringcopynumber(orploidylevels)
• Correctforcopynumber(16s)• Outputdataforrarefactioncurves
PostProcessing• Iprettymuchdoallofmypostanalysis(abundancetable,Biome)inR
• CommonPackages• Vegan
v https://github.com/vegandevs/vegan• Vegetarian
v https://github.com/cran/vegetarian• Phyloseq (usesvegan,ade4,ape,picante)
v https://joey711.github.io/phyloseq/
• EcologicalDiversityAnalysis• howdoesthestructureofOTUacrosssamples/groupscompare
• OrdinationAnalysis(multivariateanalysis)• Visualizetherelativesimilarity/dissimilarityacrosssamples,testfortaxa/environmentrelationships
• DifferentialAbundanceAnalysis(univariateanalysis)• UsestoolsfromRNAseq(limma,edgeR)
• Visualization(temporal,heatmaps,’trees’,more)
Standardization/Normalization
• Relative(proportional)abundances• Dividebysumofsample,values0-100%
• LogCPM fromRNAseq• Hellingerstandardization
• http://biol09.biol.umontreal.ca/PLcourses/Section_7.7_Transformations.pdf
• Others• Wisconsin
Multicommunityanalysis