Download - committee_meeting_1031
The Story of My Research
developing a bottom-up computational approach to investigate microbial diversity
Qingpeng Zhang Department of Computer Science and Engineering
Michigan State University Supervisor: Dr. Titus Brown
The Story of My Research
developing a bottom-up computational approach to investigate microbial diversity
Qingpeng Zhang Department of Computer Science and Engineering
Michigan State University Supervisor: Dr. Titus Brown
odyssey?
khmer development
start study/research metagenomics
digital normalization
diversity analysis on k-
mer level
2008
2009
2010
2011
2012
2013
2014
Osedax Symbiontsdiversity
analysis on read level(IGS)
GPGC soil
sample
developing a bottom-up computational approach to investigate microbial diversity
2008: metagenomics
2008: metagenomics
“Big Data!”
Microbial diversity
similarity-based composition-based
binning/annotation
assemblyreference
2009: microbial diversity
Microbial diversity
similarity-based composition-based
binning/annotation
assemblyreference
2009: microbial diversity
How many stuffs are there in the sample? - alpha diversity How different are the samples? - beta diversity
Microbial diversity
similarity-based composition-based
binning/annotation
assemblyreference
2009: microbial diversity
"Nothing works, everything sucks."
Microbial diversity
similarity-based composition-based
binning/annotation
assemblyreference
2009: microbial diversity
NO!
2009: k-mer counting
khmer development
start study/research metagenomics
digital normalization
diversity analysis on k-
mer level
Osedax Symbiontsdiversity
analysis on read level(IGS)
GPGC soil
sample
2008
2009
2010
2011
2012
2013
2014
developing a bottom-up computational approach to investigate microbial diversity
2010 -now: GPGC
How many stuffs are there in the sample? - alpha diversity How does agricultural soil differ from native soil? - beta diversity
khmer development
start study/research metagenomics
digital normalization
diversity analysis on k-
mer level
Osedax Symbiontsdiversity
analysis on read level(IGS)
GPGC soil
sample
2008
2009
2010
2011
2012
2013
2014
developing a bottom-up computational approach to investigate microbial diversity
2010 -now: khmer
2010 -now: khmer
2010 -now: khmer
• My contributions: • algorithm design/analysis, exploring the mathematics behind, the choice of optimal
parameters• contributing codes, including unique k-mers counting, overlap k-mer counting, optimal
parameter choice, others related to my specific research project.• benchmarking, testing, actually using it.• exploration of applications like error trimming, filter low abundance reads, digital
normalization, etc. suggestion on features• work on the khmer manuscript
2010 -now: khmer
• My contributions: • algorithm design/analysis, exploring the mathematics behind, the choice of optimal
parameters• contributing codes, including unique k-mers counting, overlap k-mer counting, optimal
parameter choice, others related to my specific research project.• benchmarking, testing, actually using it.• exploration of applications like error trimming, filter low abundance reads, digital
normalization, etc. suggestion on features• work on the khmer manuscript
khmer development
start study/research metagenomics
digital normalization
diversity analysis on k-mer level
Osedax Symbiontsdiversity
analysis on read level(IGS)
GPGC soil
sample
2008
2009
2010
2011
2012
2013
2014
developing a bottom-up computational approach to investigate microbial diversity
2010 -2012: diversity analysis on k-mer level
2010 -2012: diversity analysis on k-mer level
khmer development
start study/research metagenomics
digital normalization
diversity analysis on k-
mer level
Osedax Symbiontsdiversity
analysis on read level(IGS)
GPGC soil
sample
2008
2009
2010
2011
2012
2013
2014
developing a bottom-up computational approach to investigate microbial diversity
2011-2012: diginorm
median k-mer frequency to represent the sequencing coverage of the read
useful for diversity analysis
removing redundant reads useful for assembly
Digital normalization
2011-2012: diginorm
median k-mer frequency to represent the sequencing coverage of the read
useful for diversity analysis
removing redundant reads useful for assembly
Digital normalization
khmer development
start study/research metagenomics
digital normalization
diversity analysis on k-
mer level
Osedax Symbiontdiversity
analysis on read level(IGS)
GPGC soil
sample
2008
2009
2010
2011
2012
2013
2014
developing a bottom-up computational approach to investigate microbial diversity
2012-2013 symbionts
My contributions: • diginorm/assembly/binning/
annotation • genome completeness estimation
• 94% complete Rs1 • 66-89% complete Rs2
• some transcriptome analysis • Other bioinformatics support
khmer development
start study/research metagenomics
digital normalization
diversity analysis on k-
mer level
Osedax Symbionts
diversity analysis on
read level(IGS)
GPGC soil
sample
2008
2009
2010
2011
2012
2013
2014
developing a bottom-up computational approach to investigate microbial diversity
2012 -now: diversity analysis on read level
2012 -now: diversity analysis on read level
IGS(informative genomic segment) can represent
the novel information of a genome
We can use all the data, not only the data we
understand!
AAABABCDAABC
ABCEFGHIAFGH
AAAB
AABC
ABCD ABCEFGHI AFGH
AAABABCDAABC
ABCEFGHIAFGH
AAAB
AABC
ABCD ABCEFGHI AFGH
Improve the pipeline
khmer diginorm error correction
Sorcerer II Global Ocean Sampling Expedition
2010 -now: GPGC
khmer development
start study/research metagenomics
digital normalization
diversity analysis on k-
mer level
Osedax Symbiontsdiversity
analysis on read level(IGS)
GPGC soil
sample
2008
2009
2010
2011
2012
2013
2014
developing a bottom-up computational approach to investigate microbial diversity
37
Future work
• Finish the IGS based diversity analysis paper • Refine pipeline/adjust statistical method to fit IGSs • More real data sets
• MetaHIT(Metagenomics of the Human Intestinal Tract) (working..) • HMP (Human Microbiome Project) (working..) • GPGC(Soil) (working..) • Ballast water virome (working..)
• Finish a review of the methods and applications of k-mer counting in bioinformatics (will also be part of my dissertation)
• Expand the application of IGS • sequencing depth/effort estimation, genome size estimation • reads binning/classification based on coverage profile across samples • relate IGS to phylogenetic info and function • extract IGS(reads) according different coverage profile (shared by all
Acknowledgement
● Dr. Titus Brown
● Lab members of GED
● Elijah Lowe
● Jiarong Guo
● Camille Scott
● Michael Crusoe
● Luiz Irber
● Dr. Sherine Awad
● Former members of GED
● Dr. Adina Howe
● Eric McDonald
● Dr. Jason Pell
● Dr. Likit Preeyanon
● RDP
● Dr. Jim Cole
● Jordan Fish