can we measure everything? 1 professor jacques corbeil canada research chair in medical genomics
TRANSCRIPT
1
Can we measure everything?
Professor Jacques Corbeil Canada Research Chair in Medical Genomics
2
Take home messages for Big Data
•Can we measure everything?
Yes
•Careful for what you wish for!!!
3
Take home message
•Can we measure everything?
•Yes
•Careful for what you wish for!!!
Massive amount of unstructured data
4
Take home message
•Can we measure everything?
•Yes
•Careful for what you wish for!!!
Big data need analysis pipelines
5
Plan of the presentation
•Sequencing for genomic and metagenomic
•Mass spectrometry for metabolomic.•Biological computing and machine learning will be interspaced throughout.
6
Genomics and Metagenomics
•Nextgen sequencing.
Frédéric Raymond, Maxime DéraspePier-Luc Plante & Alexandre Drouin
Metagenomic analysis with RAY META
Reads
de Bruijn graph
Reference genomes and taxonomy
Colored kmers
ABC
Colored de Bruijn graph
Colored assembly
Profiling
Taxon Frequencies
Bacteroidaceae 48 %
Rikenellaceae 15 %
Clostridiaceae 6 %
… …
Assembly
(Boisvert et al. 2012)
Microbiome and antibiotics
124 millions bp/sample70 samples
Molecular epidemiology using whole genomes
Ray Surveyor on 1600 bacterial genomes
New way to do Phylogeny!!
• Can be adapted to whole genome, core genome, others.
• Superfast• Precise• Insanely great!
Resistance of P. aeruginosa using whole genomes
Ray Surveyor on S. pneumoniae genomes (normalized)
Lots of Kmers10 242 551
Whole genome
3h30 on 408 CPUs (17 computers)
Big data epidemiology with Ray Surveyor
C. difficile
Whole genome
Comparing whole genomes with specific genes
Whole genome Crispr only
Developed an algorithm to calculate similarity.
Comparing whole genomes with specific genes
Whole genome Resistance genes
We will have number for homology!
Clostridium difficile Source: Dr. Vivian Loo (McGill University)
Pseudomonas aeruginosa Source: PMID25367914
Mycobacterium tuberculosis Source: PMID25599400
Streptococcus pneumoniae Source: PMID23644493
32 823 803, m = 470
132 487 288, m = 393
11 255 033, m = 154
10 542 251, m = 680
Datasets
The SCM outperform other state-of-the-art biomarker discovery methods in terms of sparsity (in parenthesis) and compares favorably in terms of accuracy.
Benchmark
On most datasets, the obtained models are highly accurate and rely on very few k-mers.
Results
Example Models
20
Metabolomics
•Mass spectrometry and metabolomic.
Pier-Luc Plante, Alexandre DrouinFrancis Brochu, Prudencio ToussouNancy Boucher
Mass spectrometry
The paradigm
We love the hay.
High throughput mass spectrometry: Laser DiodeThermal Desorption (LDTD)
Sample every 10s on averageBig Data approaches
26
Aims of the research program
•Better quality control procedures
•Diagnostic tools in health and disease states.
•Ultimately, predict paths and assist in the decision.
The Set Covering Machine
• Supervised learning algorithm
• 3 interesting properties in our context:
• Accurate: state-of-the-art predictive error• Interpretable: sparse models• Scalable: optimal algorithmic complexity
• Marchand, M., & Shawe-Taylor, J. (2003). The set covering machine. The Journal of Machine Learning Research, 3, 723-746.
Sensitive detection
Approximately 40,000 peaks per spectrum.
Spectre MS m/z
Inte
nsit
yPatient 1
Patient 2
Patient 3
Patient 4
Patient 5
Each peak is a potential biomarker
30
New problem= new algorithms
Problem: we have many mass spectra but the m/z are not identical from one sample to another.*Reference-free mass spectrum alignment*Virtual lock mass
RF-MSA: correct peak shape variation (ion distribution)VLM: homologous peak distance correction
Aligned spectra
Consensus is only if a peak is in all 4 spectrum but can ask if ¾ etc….
50 of the 192 aligned spectrum
Consensus (502 matches)
Can do clustering and multifactorial analyses!
33
100 of the 1000 aligned spectrum
At this stage, one needs more help!
34
Plasmas (male and female)
35
Can machine learning do better?
48 samples and 12 832 peaks in the dataset
36
Perspectives
• In a position to derive signatures for specific states of complex biological matrices.
• Useful for diagnostics and monitoring industrial processes.
• Relatively cheap compare to sequencing and immunodetection systems.
• More in line with clinical or industrial setting since you process one sample at time for the same cost.
Mass spectrometry
A new paradigm!!!
38
Acknowledgements
• Nancy Boucher• Francis Brochu• Alexandre Drouin• Sébastien Giguère• Pier-Luc Plante• Frédéric Raymond• Lynda Robitaille• Prudencio Toussou
Héma-Québec Louis ThibaultAstraZeneca Veronica Kos
Humphrey GardnerPhytronix Serge Auger
Jean Lacoursière Pierre Picard
Royal Victoria Vivian LooINSPQ Cécile TremblayWaters corp. Keith Fadgen
Geoff Gerhardt
Big Data Centre François LavioletteU. Laval. Mario Marchand