metagenomics and cloud_computing_london_january_2014_era7_bioinformatics
DESCRIPTION
Traditional microbial genome sequencing relies upon clonal cultures, but the new era of genomics is facing a new challenge: the metagenomics analysis. In the next few years it is probable that metagenomics will be used in clinical diagnostic settings. Thus, metagenomics has the potential to revolutionize pathogen detection in public health laboratories by allowing the simultaneous detection of all microorganisms in a clinical sample. For viruses, unbiased high-throughput sequencing approach is useful for directly detecting pathogenic viruses without advance genetic information. The use of metagenomics for virus discovery in clinical samples has opened new opportunities for understanding the aetiology of unexplained illness. For bacteria, it should be reminded that only a small fraction of the phylogenetic diversity of Bacteria and Archaea is represented by cultivated organisms. Hence, metagenomics will probably serve to identify new pathogens, and new infections caused by consortiums. In chronic infections metagenomics will give us information about the relevance of biofilms and other bacterial organizations that would be important in such infections. As an example, metagenomics for Mycobacterium infections have demonstrated undetected, plural, strains in the same patient. Microbiome analysis has been one of the most important applications of metagenomics. Two major strategies have been applied in the past years for bacterial metagenomics: 16S and shotgun metagenomics. 16S metagenomics tells us about microbial diversity and relative abundance of species and taxa. Shotgun metagenomics is a much more massive approach able to inform about the functional profile of the different genes present in the sample and even to obtain assembled genomes if the sample is not very complex. Metagenomics has brought new challenges to bioinformatics. Cloud computing can solve the problem of massive data analysis providing scalable, real time, on demand computing for metagenomics data analysis. However, Cloud Computing infrastructure is not easy to manage and publicly available software solutions would be needed to extend the use of cloud for the analysis of huge metagenomics data sets. MG7 is a new system for analysis of reads from metagenomics based on the use of cloud computing for the parallel computation of the BLAST similarity in which is based the inference of function and the assignment of taxonomic origin. A special peculiarity of MG7 system is the utilization of a non relational model database. MG7 uses a graph database to store the results of the analysis and to facilitate the querying and the access to the data organized in the hierarchic structure of the taxonomy tree. MG7 is an open source project that is licensed under AGPLV3 license.TRANSCRIPT
http://ohnosequences.com
www.era7bioinformatics.com
A New Cloud Computing System for Massive Analysis of Reads from
Metagenomics Samples
A New Era in Diagnostic Microbiology Pathogen Genomics. Whole Genome Sequencing
15 January 2014. The Royal College of Pathologists.
- A bit of context:
• What is Era7
• What is Oh no sequences! Research group
• Research lines / Research projects
- Clonal cultures versus Metagenomics
- Microbiome
- Microbiome in health and disease
- Metagenomics in a clinical sample
- 16S and shotgun metagenomics
- Metagenomics for detection of viruses
- Metagenomics for detection of bacteria
- The metagenomics bioinformatics challenge:
• High computational cost
• Bining for reducing computation
• Reducing reference database
- MG7
• Cloud computing
• MG7 algorithms and pipeline
• Lowest Common Ancestor assignment
• MG7 uses Graph databases
• MG7 uses NCBI taxonomy tree
A New Cloud Computing System for Massive Analysis of Reads from Metagenomics Samples
MG7 for metagenomics analysis www.era7bioinformatics.com
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
A bit of context
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
What is Era7 Bioinformatics
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
• Research driven SME• Open Source• Cloud Computing• Next Generation Sequencing
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
• Bacterial Genomics projects• Comparative Genomics• Metagenomics• Microbiome• RNA-seq (and Dual RNA-seq)• Cancer Genomics• Big Data management and integration
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
What is Era7 Oh no sequences!
The Royal College of Pathologists15 January 2014
Research Lines:
• Algorithms for assembly
• Methods for bacterial genome annotation
• New Cloud Computing Architectures
• Graph Databases for Biological data
• Comparative genomics and bacterial
evolution
• Genome Plasticity
• Big Data integration and visualization
• Host Immune System and infection
Software Research
Ptojects• BG7
• Bio4j
• Nextmicro
• Statika
• Nispero
• MG7(All of them are Open Source
AGPLv3 projects)
A New Cloud Computing System for Massive Analysis of Reads from Metagenomics Samples
MG7 for metagenomics analysis www.era7bioinformatics.com
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Traditional microbial genome sequencing relies upon clonal cultures,
but the new era of genomics is facing a new challenge: the metagenomics analysis
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Microbiome analysis is possible by metagenomics approaches.• Health and Disease • Therapeutic Interventions• Transplant• Immune system
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Microbiome in Health and Disease • Inflamatory Bowel Disease• Diabetes• Obesity• Cardiovascular Disease• Colon Cancer
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Modifying the Microbiome • Prebiotics• Probiotics• Microbiome Transplant (Clostridium Difficile)
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
For bacteria, it should be reminded that only a small fraction of the phylogenetic diversity of Bacteria and Archaea is represented by cultivated organisms
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Metagenomics has the potential to revolutionize pathogen detection in public health laboratories by allowing the
simultaneous detection of all microorganisms in a clinical sample
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Metagenomic analysis after PCR amplification of different gene regions
Shotgun Metagenomics
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Metagenomic analysis after PCR amplification of different gene regions:• 16S rRNA• Gyrase• Ribosomal proteins• Elongation Fctors• RNA Polymerase• ……….16S metagenomics tells us about microbial diversity and relative abundance of species and taxa
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Shotgun Metagenomics
Shotgun metagenomics is a much more massive approach able to inform about the functional profile of the different genes present in the sample and even to obtain assembled genomes if the sample is not very complex
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Thechnology
• 454 in the past
• illumina today (approaches overlaping paired reads)
• Preprocessing steps very important
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
For viruses:
Unbiased high-throughput sequencing approach is useful for directly detecting pathogenic viruses without advance genetic information.
The use of metagenomics for virus discovery in clinical samples has opened new opportunities for understanding the aetiology of unexplained illness
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
For Bacteria:
Metagenomics will probably serve to identify new pathogens, and new infections caused by consortiums.
In chronic infections metagenomics will give us information about the relevance of biofilms and other bacterial organizations that would be important in such infections.. Microbiome analysis has been one of the most important applications of metagenomics.
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
For Bacteria:
As an example, metagenomics for Mycobacterium infections have demonstrated undetected, plural, strains in the same patient
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
The Bioinformatics challenge
Metagenomics has a high computational cost
1. One approach is to reduce the need of computation
2. The other is to be more efficient
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
The Bioinformatics challenge
Metagenomics has a high computational cost
1. Reducing the computation
• Binning (clustering) the reads 16S and Shotgun.
Operational Taxonomic Units (OTUs) in 16S
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
The Bioinformatics challenge
Metagenomics has a high computational cost
1. Reducing the computation
• Reducing the size of the reference database: It is frequent to use only the complete bacterial genomes Shotgun
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
The Bioinformatics challenge
Metagenomics has a high computational cost
2. The other is to be more efficient: MG7
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
The Bioinformatics challenge
Cloud computing can solve the problem of massive data analysis providing scalable, real time, on demand computing for metagenomics data analysis.
However, Cloud Computing infrastructure is not easy to manage and publicly available software solutions would be needed to extend the use of cloud for the analysis of huge metagenomics data sets.
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
MG7
• Based in Cloud Computing (AWS)• Parallel computation• Each read is compared with the complete
database:• No binning, all the reads• All the known sequences (nt database) for
shotgun• NCBI taxonomy• Graph database for analyzing the assignment
results
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
MG7 Based in Cloud Computing (AWS)
• EC2• S3• SQS• SNS• ……
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
MG7 Based in Cloud Computing (AWS) parallel
computation
• A Cloud Master machine creates tasks and set Qeues
• A set (hundreds, it could be thousands) of Cloud instances (usually micro cloud EC2 instances) are launched
• After the parallel computation, results are modeled in a graph database. This allows to further analysis
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
The Royal College of Pathologists15 January 2014
https://github.com/pablopareja/MG7/wiki
http://ohnosequences.com www.era7bioinformatics.com
The Royal College of Pathologists15 January 2014
https://github.com/pablopareja/MG7/wiki Data Model for the Graph DatabaseNeo4j
http://ohnosequences.com www.era7bioinformatics.com
MG7 Based in Cloud Computing (AWS)
• Storage , another challenge. AWS Cloud is very useful:
• S3 for inmediate access
• Glacier for archiving .
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
MG7 Each read is compared with the complete database:
• Direct Assignment Best Blast Hit It can be done by:• E value• Depending on similarity % and length of
the hit
• Lowest Common Ancestor
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
MG7 Lowest Common Ancestor
The Royal College of Pathologists15 January 2014
First step:
We start from a set of nodes with
an arbitrary length – 4 in this
sample, which are spread through
the taxonomy tree
http://ohnosequences.com www.era7bioinformatics.com
MG7 Lowest Common Ancestor
The Royal College of Pathologists15 January 2014
Second step:
We fetch then the first node from
the set and calculate its whole
ancestor list to the main root of
the taxonomy.
http://ohnosequences.com www.era7bioinformatics.com
MG7 Lowest Common Ancestor
The Royal College of Pathologists15 January 2014
Third step:Now that we have the list, we take the second node of the set and check if it’s contained in it, if not, we keep going up through its ancestors until we find a marked node. Once it has been found, we get rid of the previous elements in the list (if any) so that they are not taken into account for the next iterations in the algorithm.
http://ohnosequences.com www.era7bioinformatics.com
MG7 Lowest Common Ancestor
The Royal College of Pathologists15 January 2014
Fourth step:We keep going trough our node set, and node C also removes some elements of the list…
http://ohnosequences.com www.era7bioinformatics.com
MG7 Lowest Common Ancestor
The Royal College of Pathologists15 January 2014
Fifth step:Finally we reach the last node of our set, but no element is removed from our list as a result.
http://ohnosequences.com www.era7bioinformatics.com
MG7 Lowest Common Ancestor
The Royal College of Pathologists15 January 2014
Here we have our lowest common ancestor!
http://ohnosequences.com www.era7bioinformatics.com
MG7 All the known sequences (nt database) for
shotgun
Nt database is the largest nucleotide database.It contains nucleotide sequences from all the
organisms.
This is important to detect:
• Unexpected organism• Contamination
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
MG7
NCBI taxonomy
This Taxonomy is probably the best and mostcomprehensive
A Graph Database is very appropriate to model aTaxonomy tree
The Royal College of Pathologists15 January 2014
http://ohnosequences.com www.era7bioinformatics.com
Thanks for your attention!
Marina ManriqueEduardo Pareja-
TobesPablo Pareja-Tobes
Raquel TobesEduardo Pareja