metagenomics - goseqit · metagenomics 2 metagenomics is the study of genetic material recovered...

Workshop on Whole Genome Sequencing and Analysis, 19-21 Sep. 2016

Metagenomics

Metagenomics

2

Metagenomicsisthestudyofgeneticmaterialrecovereddirectlyfromenvironmentalsamples.

Ex:Soil(rainforrest)Fecalsamples(humangutmicrobiome)Water(deepocean)

Onlyasmallfractionofbacteriacanbeeasilyculturedinthelab(<5%).Metagenomicsgivesapictureofallspeciesfromanenvironment.

Hasman et al 2014. J. Clinic. Microbiol. 52(1). 139-146.

Traditional analysis

- Species identification

- MIC testing

Bioinformatics analysis

- KmerFinder: Species identification

- MG-RAST: Used to estimate the level of host contamination and the distribution of bacterial species

- Chainmapper (a predecessor of MGMapper). Uses mapping (BWA and Bowtie) for identifying species

- MLST/ResFinder

- SNPTree (a predecessor of CSIPhylogeny): Used for constructing phylogenetic trees

35

35

19

19

19

23

Results, species identification

For samples that were unculturable, species could not be determined using conventional or WGS-based identification



Sequencing directly on clinical samples often resulted in identification of multiple species

Results, antimicrobial resistance

Direct sequencing did not miss any resistance genes, rather it led to an overestimation of the occurrence of resistance

Conclusions

• Decreasingtheoveralltimespendonanalysis

• Detectionofpathogens

• Bacterialtyping

• Detectionofantimicrobialresistancegenes

Directsequencingofa(simple)clinicalsamplecanbeusedfor

Thiswillbefurtherimprovedwhen

• BettermethodsforextractingDNAfromsampleswithlittleDNAexist

• Cut-offcriteriafordetectingpathogens/geneshavebeendetermined

MGmapperAtooltomapMetaGenomicsdata

▪ AtooltomapFASTQfilesfrommetagenomicsamplesagainstoneormorereferencesequencedatabases

▪ Makeoutputthatis“fairly”easytounderstand

▪ Giveanoverviewofabundance,depthandcoverageinrelationtorefseq

Referencebasedmappingoffastqreads

name_R1.fastq@HWUSI-EAS664L:24:64FGCAAXX:4:1:2853:1232 1:N:0:CTTGTACCTCGGACGATTGCCGNATAATTTCTGGGTACCACGATGCTTGTTTTCACCACAAGAATGAATGTTTTCGGCACATTTCTCCCCAGAGTGTTATAATTGCG+HHHHHGHHHHHHHHHD#BDDABCCAGHHHHHFEHHHHHHHHHHHHHHHHHHHHHGHBFGGGGGGGGHHEGDHCE<EEEBEDDDC7A-@7@?B=A?BEEBAE@HWUSI-EAS664L:24:64FGCAAXX:4:1:5315:1234 1:N:0:CTTGTACAGTGCCATCGTAATANTGAGTGCTGGCTCGAAGATGGAGAGCGTTAAGGCGATCCGATTTTGTTGGAGTGTCTCCTGGTTATCTGCGGCTCTGACCATTA+IIIIIIIIIIIIIIIF#FFEFAFFEIIIIIIEIIFCGG?EEGDGEIHHIIGHEEGGIEGIHGGACCEAEBFB@EEBBDE@B??>AB@AAA>>:@:==8=@@

name_R2.fastq@HWUSI-EAS664L:24:64FGCAAXX:4:1:2853:1232 2:N:0:CTTGTATCACTACCGTAATTTGAACCGGCAAGATAATGCCGAAGTTCTGTAAATAAGTAAAGATTTGCGCGCTAAATCGCAACAAACAGGTTCGGCACATTACTCCG+IIIIDIIIIIHHIIIIHIIIIFHFHIHIGIGHIII>DGBGGGGDFBCGDDFEDFFFBFFHDICHFBDDEBEFBHEGGGEEAGG<?@BBBB8BBB/?6?;86@HWUSI-EAS664L:24:64FGCAAXX:4:1:5315:1234 2:N:0:CTTGTACACTTTAAGTATTTTGCAATCCAGCGGCGTCCCTCTGCTGGATGGGATGAATTTGTCCACCGAAAGCCTCAACAACCTCGAACTTCGCCAGCGTCTGGCAA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIGIIGHHIIIIFIHHHIHIFIIHG@F@EFDE@EE8C8A>A;?1>@C8C?>?<?9<>:?<)?

In a FASTQ file, each sequence read covers four lines. • Line 1 begins with a ‘@‘ and is followed by a sequence identifier

• Line 2 is the raw sequence (ACTGN)

• Line3 begins with a ‘+‘ and is optionally followed by the same identifier as in the first line. Alternatively it is empty

• Line 4 encodes the quality values for the sequence in Line 2. Must have same number of symbols as Line 2.

FASTQ files

Adaptor removal/trimming

Identify paired reads

Reads don’t map to phiX

genome

Pre-processing of reads

AS=28 AS=45 AS=90

Bwamemmappingofallreadsagainstreferencedatabasesandremovebadhits

AS=45 AS=90

Filterhitsbasedonalignmentcriteria

AlignmentScore>=30

Bestmode:Re-arrangedatabasehitsandkeeponlythereadpairswiththebestsumofalignmentscores.fullmode:keepallhitsevenifpresentinseveraldatabases.

Bacteria pair1 forward AS=55 reverse AS=60Bacteria pair2 forward AS=90 reverse AS=100Human pair1 forward AS=60 reverse AS=60Fungi pair1 forward AS=50 reverse AS=55

Bacteria pair2 forward AS=90 reverse AS=100Human pair1 forward AS=60 reverse AS=60Fungi pair1 forward AS=50 reverse AS=55

FinaloutputAbundanceandreadcountstatistics,fastacontigsTaxonomyannotation,post-processing(confidence)

Properlypairedreads

RefSeq1

YES:InsertSizewithinupperandlowerboundariesdeterminedbybwamem

RefSeq2

RefSeq1

NO:Pairedbutmappedtodifferentrefsequenceentries

5´ 5´

|-----------------InsertSize------------------|

MGmapperFindingthebesthit(bestmode)orfindingthemall(fullmode)

Bacteria Bacteria-draft Plasmid Virus ResFinder GreenGenes Silva … nt

ReferencesequencedatabasesBestmode:Areadpaircanmaptoonly1reference

sequenceineachoftheselectedreferencesequencedatabases.Fullmode:Allreadpairhitsarereported

55 60

Bacteria

60 60

Human

50 55

Fungi

AlignmentscoreforaReadpairisthesumofthealignmentscoresforeachread.

115

120

105

FragmentationofDNAfromasamplegenomesizesdiffer

21

Genome1

Insertsafterfragmentation

Genome2

Insertsafterfragmentation

Manyreads(inserts)frombiggerDNApieces,fewerfromsmallgenomesorgenes

Abundance

22

AbundanceHowmuchisthere?

23

Ifcatis4timesbiggerthanant,Whatisthemostabundantspecies?

Fastqreadsmappedtoaref.seqeunce

WhynormalizereadCountswithreferencesequencesize?

Numbers • Strainabundance(paired-end)

• Abundance(%)=100*readCount/size*2

• Strainabundance(Single-end)• Abundance(%)=100*readCount/size

• Abundancespecies(%)=ΣAbundancestrain• Covered_positions

• Numberofposistionsinarefseqthatareobservedat>=1X

• Coverage=covered_positions/size

• Depth=nucleotides/size

• ReadCountUniq=readswhereAS>XS,where• ASisthealignmentscoreand• XSissecondbesthit

Size=numberofbp’sinreferencesequence

Refseq

25

MGmappersettings

MGMapperoutput,continued

ContentofExcelfile

ResultsfrommappingtotheResFinderdatabase

metagenomics - goseqit · metagenomics 2 metagenomics is the study of genetic material recovered...

Documents