can we measure everything? 1 professor jacques corbeil canada research chair in medical genomics

Post on 19-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Can we measure everything?

Professor Jacques Corbeil Canada Research Chair in Medical Genomics

2

Take home messages for Big Data

•Can we measure everything?

Yes

•Careful for what you wish for!!!

3

Take home message

•Can we measure everything?

•Yes

•Careful for what you wish for!!!

Massive amount of unstructured data

4

Take home message

•Can we measure everything?

•Yes

•Careful for what you wish for!!!

Big data need analysis pipelines

5

Plan of the presentation

•Sequencing for genomic and metagenomic

•Mass spectrometry for metabolomic.•Biological computing and machine learning will be interspaced throughout.

6

Genomics and Metagenomics

•Nextgen sequencing.

Frédéric Raymond, Maxime DéraspePier-Luc Plante & Alexandre Drouin

Metagenomic analysis with RAY META

Reads

de Bruijn graph

Reference genomes and taxonomy

Colored kmers

ABC

Colored de Bruijn graph

Colored assembly

Profiling

Taxon Frequencies

Bacteroidaceae 48 %

Rikenellaceae 15 %

Clostridiaceae 6 %

… …

Assembly

(Boisvert et al. 2012)

Microbiome and antibiotics

124 millions bp/sample70 samples

Molecular epidemiology using whole genomes

Ray Surveyor on 1600 bacterial genomes

New way to do Phylogeny!!

• Can be adapted to whole genome, core genome, others.

• Superfast• Precise• Insanely great!

Resistance of P. aeruginosa using whole genomes

Ray Surveyor on S. pneumoniae genomes (normalized)

Lots of Kmers10 242 551

Whole genome

3h30 on 408 CPUs (17 computers)

Big data epidemiology with Ray Surveyor

C. difficile

Whole genome

Comparing whole genomes with specific genes

Whole genome Crispr only

Developed an algorithm to calculate similarity.

Comparing whole genomes with specific genes

Whole genome Resistance genes

We will have number for homology!

Clostridium difficile Source: Dr. Vivian Loo (McGill University)

Pseudomonas aeruginosa Source: PMID25367914

Mycobacterium tuberculosis Source: PMID25599400

Streptococcus pneumoniae Source: PMID23644493

32 823 803, m = 470

132 487 288, m = 393

11 255 033, m = 154

10 542 251, m = 680

Datasets

The SCM outperform other state-of-the-art biomarker discovery methods in terms of sparsity (in parenthesis) and compares favorably in terms of accuracy.

Benchmark

On most datasets, the obtained models are highly accurate and rely on very few k-mers.

Results

Example Models

20

Metabolomics

•Mass spectrometry and metabolomic.

Pier-Luc Plante, Alexandre DrouinFrancis Brochu, Prudencio ToussouNancy Boucher

Mass spectrometry

The paradigm

We love the hay.

High throughput mass spectrometry: Laser DiodeThermal Desorption (LDTD)

Sample every 10s on averageBig Data approaches

26

Aims of the research program

•Better quality control procedures

•Diagnostic tools in health and disease states.

•Ultimately, predict paths and assist in the decision.

The Set Covering Machine

• Supervised learning algorithm

• 3 interesting properties in our context:

• Accurate: state-of-the-art predictive error• Interpretable: sparse models• Scalable: optimal algorithmic complexity

• Marchand, M., & Shawe-Taylor, J. (2003). The set covering machine. The Journal of Machine Learning Research, 3, 723-746.

Sensitive detection

Approximately 40,000 peaks per spectrum.

Spectre MS m/z

Inte

nsit

yPatient 1

Patient 2

Patient 3

Patient 4

Patient 5

Each peak is a potential biomarker

30

New problem= new algorithms

Problem: we have many mass spectra but the m/z are not identical from one sample to another.*Reference-free mass spectrum alignment*Virtual lock mass

RF-MSA: correct peak shape variation (ion distribution)VLM: homologous peak distance correction

Aligned spectra

Consensus is only if a peak is in all 4 spectrum but can ask if ¾ etc….

50 of the 192 aligned spectrum

Consensus (502 matches)

Can do clustering and multifactorial analyses!

33

100 of the 1000 aligned spectrum

At this stage, one needs more help!

34

Plasmas (male and female)

35

Can machine learning do better?

48 samples and 12 832 peaks in the dataset

36

Perspectives

• In a position to derive signatures for specific states of complex biological matrices.

• Useful for diagnostics and monitoring industrial processes.

• Relatively cheap compare to sequencing and immunodetection systems.

• More in line with clinical or industrial setting since you process one sample at time for the same cost.

Mass spectrometry

A new paradigm!!!

38

Acknowledgements

• Nancy Boucher• Francis Brochu• Alexandre Drouin• Sébastien Giguère• Pier-Luc Plante• Frédéric Raymond• Lynda Robitaille• Prudencio Toussou

Héma-Québec Louis ThibaultAstraZeneca Veronica Kos

Humphrey GardnerPhytronix Serge Auger

Jean Lacoursière Pierre Picard

Royal Victoria Vivian LooINSPQ Cécile TremblayWaters corp. Keith Fadgen

Geoff Gerhardt

Big Data Centre François LavioletteU. Laval. Mario Marchand

top related