Download - H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir
![Page 1: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/1.jpg)
Sparkling Water on the Spark Notebook: Interactive Genomes
clusteringWhy you must care, by Data Fellas
Xavier [email protected]
@xtordoir
![Page 2: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/2.jpg)
● Apache Spark● Interactivity: Spark notebook● Genomics on Spark: ADAM● Data exploitation● H2O w/ Spark: Sparkling water● Show time● Streamlining dev/deployment
Lineup
Can’t wait!
![Page 3: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/3.jpg)
Data Fellas
Andy Petrella
MathsGeospatialDistributed Computing
Spark NotebookTrainer Spark/ScalaMachine Learning
Xavier Tordoir
PhysicsBioinformaticsDistributed Computing
Scala (& Perl)trainer SparkMachine Learning
![Page 4: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/4.jpg)
Distributed computing framework
Large Scale Data Processing engineI play BIG!
What is Apache Spark?
![Page 5: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/5.jpg)
Distributed computing framework
Large Scale Data Processing engine
● SQL & Dataframes● Streaming● Graph Processing● Machine Learning
With all colors!
What is Apache Spark?
![Page 6: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/6.jpg)
Distributed computing framework
Large Scale Data Processing engine
● Optimize memory usage (FAST)● Optimize computation execution
(Complex tasks)● Easy programming model
Checking in cache If I remember...
What is Apache Spark?
![Page 7: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/7.jpg)
Distributed computing framework
Large Scale Data Processing engine
● Interactive● @ any scale
http://spark-notebook.io
Laurel? HArdy? Anyone?
What is Apache Spark?
![Page 8: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/8.jpg)
● Scala (types, production quality)● Reactive&pluggable charts API
(scala = no.js)● easy install, no deps.● multiple sparkContext
out of the box.
What is Apache Spark?
![Page 9: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/9.jpg)
http://bdgenomics.org/
ADAM Project (UC Berkeley):
● Data format (schema, compact, distributed): avro + parquet
● API (Reads, Variants, Genotypes, …)
I, ADAM
Genomics with Spark?
![Page 10: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/10.jpg)
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
GenomicsThe data
Please, don’t mind the colors...
![Page 11: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/11.jpg)
GenomicsThe data
So… that’s what separates us huh?
![Page 12: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/12.jpg)
1000 genomes: http://www.1000genomes.org/
~1000 samples
Few samples => Machine Learning
GenomicsThe data
Woooow, really, you must be kidding me… ahahahahah
![Page 13: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/13.jpg)
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Few samples => Machine Learning
Lots of Data => Distributed computing
GenomicsThe data
Oh… damned… hum huh
![Page 14: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/14.jpg)
Population stratification
w/ Deeplearning? H2O
From the spark notebook? Sparkling water
GenomicsThe problem
Here I need some water.
![Page 15: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/15.jpg)
Memory implementation of “Map-Reduce”
Highly optimised structures for the JVM
blazing fast convergent models
H2O
Higher API
![Page 16: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/16.jpg)
H2OSparkling: in-memory data exchange
I remember things better with two copies in memory.
http://h2o.ai/product/sparkling-water/
![Page 17: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/17.jpg)
Showtime!
press play...
There’s a notebook for that
![Page 18: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/18.jpg)
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Shar3 (Data Fellas)ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
![Page 19: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/19.jpg)
Shar3 (Data Fellas)Analysis
Production
DistributionRendering
Discovery
CatalogProject
Generator
Micro Service / Binary format
Schema for output
Metadata
![Page 20: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/20.jpg)
Spark and the Notebook are interactive and leverage distributed computing infrastructure
ADAM is an optimized storage format for Massive genomic data
Spark provides tools to manipulate data and works w/ other libraries like H2O
Data scientists and application developers can work together
Summary
Wake up, we’re back!
![Page 21: H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir](https://reader033.vdocument.in/reader033/viewer/2022050614/586f79011a28ab10258b6e2b/html5/thumbnails/21.jpg)
Acknowledgements
Frank NothaftMatt Massie
Neil Fergusson
Vinod & Michal
Thank you For your attention!
Questions?
And now let’s talk.