metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

42
http://ohnosequences.com www.era7bioinformatics.c A New Cloud Computing System for Massive Analysis of Reads from Metagenomics Samples A New Era in Diagnostic Microbiology Pathogen Genomics. Whole Genome Sequencing 15 January 2014. The Royal College of Pathologists.

Upload: era7-bioinformatics

Post on 03-Dec-2014

615 views

Category:

Health & Medicine


1 download

DESCRIPTION

Traditional microbial genome sequencing relies upon clonal cultures, but the new era of genomics is facing a new challenge: the metagenomics analysis. In the next few years it is probable that metagenomics will be used in clinical diagnostic settings. Thus, metagenomics has the potential to revolutionize pathogen detection in public health laboratories by allowing the simultaneous detection of all microorganisms in a clinical sample. For viruses, unbiased high-throughput sequencing approach is useful for directly detecting pathogenic viruses without advance genetic information. The use of metagenomics for virus discovery in clinical samples has opened new opportunities for understanding the aetiology of unexplained illness. For bacteria, it should be reminded that only a small fraction of the phylogenetic diversity of Bacteria and Archaea is represented by cultivated organisms. Hence, metagenomics will probably serve to identify new pathogens, and new infections caused by consortiums. In chronic infections metagenomics will give us information about the relevance of biofilms and other bacterial organizations that would be important in such infections. As an example, metagenomics for Mycobacterium infections have demonstrated undetected, plural, strains in the same patient. Microbiome analysis has been one of the most important applications of metagenomics. Two major strategies have been applied in the past years for bacterial metagenomics: 16S and shotgun metagenomics. 16S metagenomics tells us about microbial diversity and relative abundance of species and taxa. Shotgun metagenomics is a much more massive approach able to inform about the functional profile of the different genes present in the sample and even to obtain assembled genomes if the sample is not very complex. Metagenomics has brought new challenges to bioinformatics. Cloud computing can solve the problem of massive data analysis providing scalable, real time, on demand computing for metagenomics data analysis. However, Cloud Computing infrastructure is not easy to manage and publicly available software solutions would be needed to extend the use of cloud for the analysis of huge metagenomics data sets. MG7 is a new system for analysis of reads from metagenomics based on the use of cloud computing for the parallel computation of the BLAST similarity in which is based the inference of function and the assignment of taxonomic origin. A special peculiarity of MG7 system is the utilization of a non relational model database. MG7 uses a graph database to store the results of the analysis and to facilitate the querying and the access to the data organized in the hierarchic structure of the taxonomy tree. MG7 is an open source project that is licensed under AGPLV3 license.

TRANSCRIPT

Page 1: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com

www.era7bioinformatics.com

A New Cloud Computing System for Massive Analysis of Reads from

Metagenomics Samples

A New Era in Diagnostic Microbiology Pathogen Genomics. Whole Genome Sequencing

15 January 2014. The Royal College of Pathologists.

Page 2: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

- A bit of context:

• What is Era7

• What is Oh no sequences! Research group

• Research lines / Research projects

- Clonal cultures versus Metagenomics

- Microbiome

- Microbiome in health and disease

- Metagenomics in a clinical sample

- 16S and shotgun metagenomics

- Metagenomics for detection of viruses

- Metagenomics for detection of bacteria

- The metagenomics bioinformatics challenge:

• High computational cost

• Bining for reducing computation

• Reducing reference database

- MG7

• Cloud computing

• MG7 algorithms and pipeline

• Lowest Common Ancestor assignment

• MG7 uses Graph databases

• MG7 uses NCBI taxonomy tree

A New Cloud Computing System for Massive Analysis of Reads from Metagenomics Samples

MG7 for metagenomics analysis www.era7bioinformatics.com

The Royal College of Pathologists15 January 2014

Page 3: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

A bit of context

The Royal College of Pathologists15 January 2014

Page 4: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

What is Era7 Bioinformatics

The Royal College of Pathologists15 January 2014

Page 5: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

• Research driven SME• Open Source• Cloud Computing• Next Generation Sequencing

The Royal College of Pathologists15 January 2014

Page 6: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

• Bacterial Genomics projects• Comparative Genomics• Metagenomics• Microbiome• RNA-seq (and Dual RNA-seq)• Cancer Genomics• Big Data management and integration

The Royal College of Pathologists15 January 2014

Page 7: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

What is Era7 Oh no sequences!

The Royal College of Pathologists15 January 2014

Page 8: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

Research Lines:

• Algorithms for assembly

• Methods for bacterial genome annotation

• New Cloud Computing Architectures

• Graph Databases for Biological data

• Comparative genomics and bacterial

evolution

• Genome Plasticity

• Big Data integration and visualization

• Host Immune System and infection

Software Research

Ptojects• BG7

• Bio4j

• Nextmicro

• Statika

• Nispero

• MG7(All of them are Open Source

AGPLv3 projects)

A New Cloud Computing System for Massive Analysis of Reads from Metagenomics Samples

MG7 for metagenomics analysis www.era7bioinformatics.com

The Royal College of Pathologists15 January 2014

Page 9: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Traditional microbial genome sequencing relies upon clonal cultures,

but the new era of genomics is facing a new challenge: the metagenomics analysis

The Royal College of Pathologists15 January 2014

Page 10: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Microbiome analysis is possible by metagenomics approaches.• Health and Disease • Therapeutic Interventions• Transplant• Immune system

The Royal College of Pathologists15 January 2014

Page 11: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Microbiome in Health and Disease • Inflamatory Bowel Disease• Diabetes• Obesity• Cardiovascular Disease• Colon Cancer

The Royal College of Pathologists15 January 2014

Page 12: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Modifying the Microbiome • Prebiotics• Probiotics• Microbiome Transplant (Clostridium Difficile)

The Royal College of Pathologists15 January 2014

Page 13: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

For bacteria, it should be reminded that only a small fraction of the phylogenetic diversity of Bacteria and Archaea is represented by cultivated organisms

The Royal College of Pathologists15 January 2014

Page 14: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Metagenomics has the potential to revolutionize pathogen detection in public health laboratories by allowing the

simultaneous detection of all microorganisms in a clinical sample

The Royal College of Pathologists15 January 2014

Page 15: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Metagenomic analysis after PCR amplification of different gene regions

Shotgun Metagenomics

The Royal College of Pathologists15 January 2014

Page 16: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Metagenomic analysis after PCR amplification of different gene regions:• 16S rRNA• Gyrase• Ribosomal proteins• Elongation Fctors• RNA Polymerase• ……….16S metagenomics tells us about microbial diversity and relative abundance of species and taxa

The Royal College of Pathologists15 January 2014

Page 17: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Shotgun Metagenomics

Shotgun metagenomics is a much more massive approach able to inform about the functional profile of the different genes present in the sample and even to obtain assembled genomes if the sample is not very complex

The Royal College of Pathologists15 January 2014

Page 18: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Thechnology

• 454 in the past

• illumina today (approaches overlaping paired reads)

• Preprocessing steps very important

The Royal College of Pathologists15 January 2014

Page 19: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

For viruses:

Unbiased high-throughput sequencing approach is useful for directly detecting pathogenic viruses without advance genetic information.

The use of metagenomics for virus discovery in clinical samples has opened new opportunities for understanding the aetiology of unexplained illness

The Royal College of Pathologists15 January 2014

Page 20: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

For Bacteria:

Metagenomics will probably serve to identify new pathogens, and new infections caused by consortiums.

In chronic infections metagenomics will give us information about the relevance of biofilms and other bacterial organizations that would be important in such infections.. Microbiome analysis has been one of the most important applications of metagenomics.

The Royal College of Pathologists15 January 2014

Page 21: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

For Bacteria:

As an example, metagenomics for Mycobacterium infections have demonstrated undetected, plural, strains in the same patient

The Royal College of Pathologists15 January 2014

Page 22: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

The Bioinformatics challenge

Metagenomics has a high computational cost

1. One approach is to reduce the need of computation

2. The other is to be more efficient

The Royal College of Pathologists15 January 2014

Page 23: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

The Bioinformatics challenge

Metagenomics has a high computational cost

1. Reducing the computation

• Binning (clustering) the reads 16S and Shotgun.

Operational Taxonomic Units (OTUs) in 16S

The Royal College of Pathologists15 January 2014

Page 24: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

The Bioinformatics challenge

Metagenomics has a high computational cost

1. Reducing the computation

• Reducing the size of the reference database: It is frequent to use only the complete bacterial genomes Shotgun

The Royal College of Pathologists15 January 2014

Page 25: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

The Bioinformatics challenge

Metagenomics has a high computational cost

2. The other is to be more efficient: MG7

The Royal College of Pathologists15 January 2014

Page 26: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

The Bioinformatics challenge

Cloud computing can solve the problem of massive data analysis providing scalable, real time, on demand computing for metagenomics data analysis.

However, Cloud Computing infrastructure is not easy to manage and publicly available software solutions would be needed to extend the use of cloud for the analysis of huge metagenomics data sets.

The Royal College of Pathologists15 January 2014

Page 27: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7

• Based in Cloud Computing (AWS)• Parallel computation• Each read is compared with the complete

database:• No binning, all the reads• All the known sequences (nt database) for

shotgun• NCBI taxonomy• Graph database for analyzing the assignment

results

The Royal College of Pathologists15 January 2014

Page 28: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Based in Cloud Computing (AWS)

• EC2• S3• SQS• SNS• ……

The Royal College of Pathologists15 January 2014

Page 29: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Based in Cloud Computing (AWS) parallel

computation

• A Cloud Master machine creates tasks and set Qeues

• A set (hundreds, it could be thousands) of Cloud instances (usually micro cloud EC2 instances) are launched

• After the parallel computation, results are modeled in a graph database. This allows to further analysis

The Royal College of Pathologists15 January 2014

Page 30: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

The Royal College of Pathologists15 January 2014

https://github.com/pablopareja/MG7/wiki

Page 31: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

The Royal College of Pathologists15 January 2014

https://github.com/pablopareja/MG7/wiki Data Model for the Graph DatabaseNeo4j

Page 32: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Based in Cloud Computing (AWS)

• Storage , another challenge. AWS Cloud is very useful:

• S3 for inmediate access

• Glacier for archiving .

The Royal College of Pathologists15 January 2014

Page 33: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Each read is compared with the complete database:

• Direct Assignment Best Blast Hit It can be done by:• E value• Depending on similarity % and length of

the hit

• Lowest Common Ancestor

The Royal College of Pathologists15 January 2014

Page 34: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Lowest Common Ancestor

The Royal College of Pathologists15 January 2014

First step:

We start from a set of nodes with

an arbitrary length – 4 in this

sample, which are spread through

the taxonomy tree

Page 35: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Lowest Common Ancestor

The Royal College of Pathologists15 January 2014

Second step:

We fetch then the first node from

the set and calculate its whole

ancestor list to the main root of

the taxonomy.

Page 36: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Lowest Common Ancestor

The Royal College of Pathologists15 January 2014

Third step:Now that we have the list, we take the second node of the set and check if it’s contained in it, if not, we keep going up through its ancestors until we find a marked node. Once it has been found, we get rid of the previous elements in the list (if any) so that they are not taken into account for the next iterations in the algorithm.

Page 37: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Lowest Common Ancestor

The Royal College of Pathologists15 January 2014

Fourth step:We keep going trough our node set, and node C also removes some elements of the list…

Page 38: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Lowest Common Ancestor

The Royal College of Pathologists15 January 2014

Fifth step:Finally we reach the last node of our set, but no element is removed from our list as a result.

Page 39: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 Lowest Common Ancestor

The Royal College of Pathologists15 January 2014

Here we have our lowest common ancestor!

Page 40: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7 All the known sequences (nt database) for

shotgun

Nt database is the largest nucleotide database.It contains nucleotide sequences from all the

organisms.

This is important to detect:

• Unexpected organism• Contamination

The Royal College of Pathologists15 January 2014

Page 41: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

MG7

NCBI taxonomy

This Taxonomy is probably the best and mostcomprehensive

A Graph Database is very appropriate to model aTaxonomy tree

The Royal College of Pathologists15 January 2014

Page 42: Metagenomics and cloud_computing_london_january_2014_era7_bioinformatics

http://ohnosequences.com www.era7bioinformatics.com

Thanks for your attention!

Marina ManriqueEduardo Pareja-

TobesPablo Pareja-Tobes

Raquel TobesEduardo Pareja

[email protected]