metagenomics - goseqit · metagenomics 2 metagenomics is the study of genetic material recovered...
TRANSCRIPT
Workshop on Whole Genome Sequencing and Analysis, 19-21 Sep. 2016
Metagenomics
Metagenomics
2
Metagenomicsisthestudyofgeneticmaterialrecovereddirectlyfromenvironmentalsamples.
Ex:Soil(rainforrest)Fecalsamples(humangutmicrobiome)Water(deepocean)
Onlyasmallfractionofbacteriacanbeeasilyculturedinthelab(<5%).Metagenomicsgivesapictureofallspeciesfromanenvironment.
Hasman et al 2014. J. Clinic. Microbiol. 52(1). 139-146.
Traditional analysis
- Species identification
- MIC testing
Bioinformatics analysis
- KmerFinder: Species identification
- MG-RAST: Used to estimate the level of host contamination and the distribution of bacterial species
- Chainmapper (a predecessor of MGMapper). Uses mapping (BWA and Bowtie) for identifying species
- MLST/ResFinder
- SNPTree (a predecessor of CSIPhylogeny): Used for constructing phylogenetic trees
35
35
19
19
19
23
Results, species identification
For samples that were unculturable, species could not be determined using conventional or WGS-based identification
Results, species identification
Results, species identification
Sequencing directly on clinical samples often resulted in identification of multiple species
Results, antimicrobial resistance
Direct sequencing did not miss any resistance genes, rather it led to an overestimation of the occurrence of resistance
Conclusions
• Decreasingtheoveralltimespendonanalysis
• Detectionofpathogens
• Bacterialtyping
• Detectionofantimicrobialresistancegenes
Directsequencingofa(simple)clinicalsamplecanbeusedfor
Thiswillbefurtherimprovedwhen
• BettermethodsforextractingDNAfromsampleswithlittleDNAexist
• Cut-offcriteriafordetectingpathogens/geneshavebeendetermined
MGmapperAtooltomapMetaGenomicsdata
▪ AtooltomapFASTQfilesfrommetagenomicsamplesagainstoneormorereferencesequencedatabases
▪ Makeoutputthatis“fairly”easytounderstand
▪ Giveanoverviewofabundance,depthandcoverageinrelationtorefseq
Referencebasedmappingoffastqreads
name_R1.fastq@HWUSI-EAS664L:24:64FGCAAXX:4:1:2853:1232 1:N:0:CTTGTACCTCGGACGATTGCCGNATAATTTCTGGGTACCACGATGCTTGTTTTCACCACAAGAATGAATGTTTTCGGCACATTTCTCCCCAGAGTGTTATAATTGCG+HHHHHGHHHHHHHHHD#BDDABCCAGHHHHHFEHHHHHHHHHHHHHHHHHHHHHGHBFGGGGGGGGHHEGDHCE<EEEBEDDDC7A-@7@?B=A?BEEBAE@HWUSI-EAS664L:24:64FGCAAXX:4:1:5315:1234 1:N:0:CTTGTACAGTGCCATCGTAATANTGAGTGCTGGCTCGAAGATGGAGAGCGTTAAGGCGATCCGATTTTGTTGGAGTGTCTCCTGGTTATCTGCGGCTCTGACCATTA+IIIIIIIIIIIIIIIF#FFEFAFFEIIIIIIEIIFCGG?EEGDGEIHHIIGHEEGGIEGIHGGACCEAEBFB@EEBBDE@B??>AB@AAA>>:@:==8=@@
name_R2.fastq@HWUSI-EAS664L:24:64FGCAAXX:4:1:2853:1232 2:N:0:CTTGTATCACTACCGTAATTTGAACCGGCAAGATAATGCCGAAGTTCTGTAAATAAGTAAAGATTTGCGCGCTAAATCGCAACAAACAGGTTCGGCACATTACTCCG+IIIIDIIIIIHHIIIIHIIIIFHFHIHIGIGHIII>DGBGGGGDFBCGDDFEDFFFBFFHDICHFBDDEBEFBHEGGGEEAGG<?@BBBB8BBB/?6?;86@HWUSI-EAS664L:24:64FGCAAXX:4:1:5315:1234 2:N:0:CTTGTACACTTTAAGTATTTTGCAATCCAGCGGCGTCCCTCTGCTGGATGGGATGAATTTGTCCACCGAAAGCCTCAACAACCTCGAACTTCGCCAGCGTCTGGCAA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIGIIGHHIIIIFIHHHIHIFIIHG@F@EFDE@EE8C8A>A;?1>@C8C?>?<?9<>:?<)?
In a FASTQ file, each sequence read covers four lines. • Line 1 begins with a ‘@‘ and is followed by a sequence identifier
• Line 2 is the raw sequence (ACTGN)
• Line3 begins with a ‘+‘ and is optionally followed by the same identifier as in the first line. Alternatively it is empty
• Line 4 encodes the quality values for the sequence in Line 2. Must have same number of symbols as Line 2.
FASTQ files
Adaptor removal/trimming
Identify paired reads
Reads don’t map to phiX
genome
Pre-processing of reads
AS=28 AS=45 AS=90
Bwamemmappingofallreadsagainstreferencedatabasesandremovebadhits
AS=45 AS=90
Filterhitsbasedonalignmentcriteria
AlignmentScore>=30
Bestmode:Re-arrangedatabasehitsandkeeponlythereadpairswiththebestsumofalignmentscores.fullmode:keepallhitsevenifpresentinseveraldatabases.
Bacteria pair1 forward AS=55 reverse AS=60Bacteria pair2 forward AS=90 reverse AS=100Human pair1 forward AS=60 reverse AS=60Fungi pair1 forward AS=50 reverse AS=55
Bacteria pair2 forward AS=90 reverse AS=100Human pair1 forward AS=60 reverse AS=60Fungi pair1 forward AS=50 reverse AS=55
FinaloutputAbundanceandreadcountstatistics,fastacontigsTaxonomyannotation,post-processing(confidence)
Properlypairedreads
RefSeq1
YES:InsertSizewithinupperandlowerboundariesdeterminedbybwamem
RefSeq2
RefSeq1
NO:Pairedbutmappedtodifferentrefsequenceentries
5´ 5´
|-----------------InsertSize------------------|
MGmapperFindingthebesthit(bestmode)orfindingthemall(fullmode)
Bacteria Bacteria-draft Plasmid Virus ResFinder GreenGenes Silva … nt
ReferencesequencedatabasesBestmode:Areadpaircanmaptoonly1reference
sequenceineachoftheselectedreferencesequencedatabases.Fullmode:Allreadpairhitsarereported
55 60
Bacteria
60 60
Human
50 55
Fungi
AlignmentscoreforaReadpairisthesumofthealignmentscoresforeachread.
115
120
105
FragmentationofDNAfromasamplegenomesizesdiffer
21
Genome1
Insertsafterfragmentation
Genome2
Insertsafterfragmentation
Manyreads(inserts)frombiggerDNApieces,fewerfromsmallgenomesorgenes
Abundance
22
AbundanceHowmuchisthere?
23
Ifcatis4timesbiggerthanant,Whatisthemostabundantspecies?
Fastqreadsmappedtoaref.seqeunce
WhynormalizereadCountswithreferencesequencesize?
Numbers • Strainabundance(paired-end)
• Abundance(%)=100*readCount/size*2
• Strainabundance(Single-end)• Abundance(%)=100*readCount/size
• Abundancespecies(%)=ΣAbundancestrain• Covered_positions
• Numberofposistionsinarefseqthatareobservedat>=1X
• Coverage=covered_positions/size
• Depth=nucleotides/size
• ReadCountUniq=readswhereAS>XS,where• ASisthealignmentscoreand• XSissecondbesthit
Size=numberofbp’sinreferencesequence
Refseq
25
MGmappersettings
MGMapperoutput,continued
MGMapperoutput,continued
ContentofExcelfile
ResultsfrommappingtotheResFinderdatabase