parallel frameworks & big data - ut southwestern · wilhelm m1, schlegl j2, hahne h3, moghaddas...
TRANSCRIPT
Parallel Frameworks amp Big DataHadoop and Spark on BioHPC
1 Updated for 2015-11-18
[web] portalbiohpcswmededu
[email] biohpc-helputsouthwesternedu
What is lsquoBig Datarsquo
Big data amp parallel processing are hand-in-hand
Models of parallelization
Apache Hadoop
Apache Spark
Overview
ldquoBig Datardquo
Datasets too big for conventional data processing applications
In business often Predictive Analytics
Big Data in Biomedical Science
4
Complex long running analyses on GB datasets ne Big Data
Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo
Easy to manage on individual modern servers with standard apps
Cross lsquoomicsrsquo integration often is a big data problem
Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip
Different formats ndash import storage integration problems
Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business
A Big Data Example ndash Proteomics DB
5
Partnership between academia and industry (SAP)
Data size is 721TB (not huge) but contains 43 million MSMS spectra
Biggest challenge is presentation and interpretation ndash search visualization etc
Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes
Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9
Big Datahellip light
6
BUT ndash all achieved using a traditional single relational database algorithms etc
No flexible search query architecture
Looks like lsquoBig Datarsquohellip
TB scale input data
40000 core hours of processing on TACC
Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2
Consider a Big Data problemhellip
7
I have 5000 sequenced exomes with associated clinical observations and want to
bull Find all SNPs in each patient
bull Aggregate statistics for SNPs across all patients
bull Easily query for SNP associations with any of my clinical variables
bull Build a predictive model for a specific disease
A lot of data a lot of processing ndash need to do each step in parallel
Parallelization ndash Shared Memory
8
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
1 system multiple CPUs
Each CPU has portion of RAM attached directly
CPUs can access their or otherrsquos RAM
Parallel processes or threads can access any of the data in RAM easily
Parallelization ndash Distributed Shared Memory
9
RAM RAM RAM RAMRAM
Logical Address Space
RAM is split across many systems that use a special interconnect
The entire RAM across all machines is accessible anywhere in a single address space
Still quite simple to program but costly and now uncommon
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
What is lsquoBig Datarsquo
Big data amp parallel processing are hand-in-hand
Models of parallelization
Apache Hadoop
Apache Spark
Overview
ldquoBig Datardquo
Datasets too big for conventional data processing applications
In business often Predictive Analytics
Big Data in Biomedical Science
4
Complex long running analyses on GB datasets ne Big Data
Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo
Easy to manage on individual modern servers with standard apps
Cross lsquoomicsrsquo integration often is a big data problem
Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip
Different formats ndash import storage integration problems
Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business
A Big Data Example ndash Proteomics DB
5
Partnership between academia and industry (SAP)
Data size is 721TB (not huge) but contains 43 million MSMS spectra
Biggest challenge is presentation and interpretation ndash search visualization etc
Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes
Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9
Big Datahellip light
6
BUT ndash all achieved using a traditional single relational database algorithms etc
No flexible search query architecture
Looks like lsquoBig Datarsquohellip
TB scale input data
40000 core hours of processing on TACC
Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2
Consider a Big Data problemhellip
7
I have 5000 sequenced exomes with associated clinical observations and want to
bull Find all SNPs in each patient
bull Aggregate statistics for SNPs across all patients
bull Easily query for SNP associations with any of my clinical variables
bull Build a predictive model for a specific disease
A lot of data a lot of processing ndash need to do each step in parallel
Parallelization ndash Shared Memory
8
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
1 system multiple CPUs
Each CPU has portion of RAM attached directly
CPUs can access their or otherrsquos RAM
Parallel processes or threads can access any of the data in RAM easily
Parallelization ndash Distributed Shared Memory
9
RAM RAM RAM RAMRAM
Logical Address Space
RAM is split across many systems that use a special interconnect
The entire RAM across all machines is accessible anywhere in a single address space
Still quite simple to program but costly and now uncommon
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
ldquoBig Datardquo
Datasets too big for conventional data processing applications
In business often Predictive Analytics
Big Data in Biomedical Science
4
Complex long running analyses on GB datasets ne Big Data
Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo
Easy to manage on individual modern servers with standard apps
Cross lsquoomicsrsquo integration often is a big data problem
Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip
Different formats ndash import storage integration problems
Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business
A Big Data Example ndash Proteomics DB
5
Partnership between academia and industry (SAP)
Data size is 721TB (not huge) but contains 43 million MSMS spectra
Biggest challenge is presentation and interpretation ndash search visualization etc
Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes
Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9
Big Datahellip light
6
BUT ndash all achieved using a traditional single relational database algorithms etc
No flexible search query architecture
Looks like lsquoBig Datarsquohellip
TB scale input data
40000 core hours of processing on TACC
Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2
Consider a Big Data problemhellip
7
I have 5000 sequenced exomes with associated clinical observations and want to
bull Find all SNPs in each patient
bull Aggregate statistics for SNPs across all patients
bull Easily query for SNP associations with any of my clinical variables
bull Build a predictive model for a specific disease
A lot of data a lot of processing ndash need to do each step in parallel
Parallelization ndash Shared Memory
8
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
1 system multiple CPUs
Each CPU has portion of RAM attached directly
CPUs can access their or otherrsquos RAM
Parallel processes or threads can access any of the data in RAM easily
Parallelization ndash Distributed Shared Memory
9
RAM RAM RAM RAMRAM
Logical Address Space
RAM is split across many systems that use a special interconnect
The entire RAM across all machines is accessible anywhere in a single address space
Still quite simple to program but costly and now uncommon
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Big Data in Biomedical Science
4
Complex long running analyses on GB datasets ne Big Data
Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo
Easy to manage on individual modern servers with standard apps
Cross lsquoomicsrsquo integration often is a big data problem
Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip
Different formats ndash import storage integration problems
Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business
A Big Data Example ndash Proteomics DB
5
Partnership between academia and industry (SAP)
Data size is 721TB (not huge) but contains 43 million MSMS spectra
Biggest challenge is presentation and interpretation ndash search visualization etc
Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes
Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9
Big Datahellip light
6
BUT ndash all achieved using a traditional single relational database algorithms etc
No flexible search query architecture
Looks like lsquoBig Datarsquohellip
TB scale input data
40000 core hours of processing on TACC
Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2
Consider a Big Data problemhellip
7
I have 5000 sequenced exomes with associated clinical observations and want to
bull Find all SNPs in each patient
bull Aggregate statistics for SNPs across all patients
bull Easily query for SNP associations with any of my clinical variables
bull Build a predictive model for a specific disease
A lot of data a lot of processing ndash need to do each step in parallel
Parallelization ndash Shared Memory
8
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
1 system multiple CPUs
Each CPU has portion of RAM attached directly
CPUs can access their or otherrsquos RAM
Parallel processes or threads can access any of the data in RAM easily
Parallelization ndash Distributed Shared Memory
9
RAM RAM RAM RAMRAM
Logical Address Space
RAM is split across many systems that use a special interconnect
The entire RAM across all machines is accessible anywhere in a single address space
Still quite simple to program but costly and now uncommon
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
A Big Data Example ndash Proteomics DB
5
Partnership between academia and industry (SAP)
Data size is 721TB (not huge) but contains 43 million MSMS spectra
Biggest challenge is presentation and interpretation ndash search visualization etc
Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes
Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9
Big Datahellip light
6
BUT ndash all achieved using a traditional single relational database algorithms etc
No flexible search query architecture
Looks like lsquoBig Datarsquohellip
TB scale input data
40000 core hours of processing on TACC
Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2
Consider a Big Data problemhellip
7
I have 5000 sequenced exomes with associated clinical observations and want to
bull Find all SNPs in each patient
bull Aggregate statistics for SNPs across all patients
bull Easily query for SNP associations with any of my clinical variables
bull Build a predictive model for a specific disease
A lot of data a lot of processing ndash need to do each step in parallel
Parallelization ndash Shared Memory
8
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
1 system multiple CPUs
Each CPU has portion of RAM attached directly
CPUs can access their or otherrsquos RAM
Parallel processes or threads can access any of the data in RAM easily
Parallelization ndash Distributed Shared Memory
9
RAM RAM RAM RAMRAM
Logical Address Space
RAM is split across many systems that use a special interconnect
The entire RAM across all machines is accessible anywhere in a single address space
Still quite simple to program but costly and now uncommon
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Big Datahellip light
6
BUT ndash all achieved using a traditional single relational database algorithms etc
No flexible search query architecture
Looks like lsquoBig Datarsquohellip
TB scale input data
40000 core hours of processing on TACC
Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2
Consider a Big Data problemhellip
7
I have 5000 sequenced exomes with associated clinical observations and want to
bull Find all SNPs in each patient
bull Aggregate statistics for SNPs across all patients
bull Easily query for SNP associations with any of my clinical variables
bull Build a predictive model for a specific disease
A lot of data a lot of processing ndash need to do each step in parallel
Parallelization ndash Shared Memory
8
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
1 system multiple CPUs
Each CPU has portion of RAM attached directly
CPUs can access their or otherrsquos RAM
Parallel processes or threads can access any of the data in RAM easily
Parallelization ndash Distributed Shared Memory
9
RAM RAM RAM RAMRAM
Logical Address Space
RAM is split across many systems that use a special interconnect
The entire RAM across all machines is accessible anywhere in a single address space
Still quite simple to program but costly and now uncommon
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Consider a Big Data problemhellip
7
I have 5000 sequenced exomes with associated clinical observations and want to
bull Find all SNPs in each patient
bull Aggregate statistics for SNPs across all patients
bull Easily query for SNP associations with any of my clinical variables
bull Build a predictive model for a specific disease
A lot of data a lot of processing ndash need to do each step in parallel
Parallelization ndash Shared Memory
8
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
1 system multiple CPUs
Each CPU has portion of RAM attached directly
CPUs can access their or otherrsquos RAM
Parallel processes or threads can access any of the data in RAM easily
Parallelization ndash Distributed Shared Memory
9
RAM RAM RAM RAMRAM
Logical Address Space
RAM is split across many systems that use a special interconnect
The entire RAM across all machines is accessible anywhere in a single address space
Still quite simple to program but costly and now uncommon
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Parallelization ndash Shared Memory
8
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
1 system multiple CPUs
Each CPU has portion of RAM attached directly
CPUs can access their or otherrsquos RAM
Parallel processes or threads can access any of the data in RAM easily
Parallelization ndash Distributed Shared Memory
9
RAM RAM RAM RAMRAM
Logical Address Space
RAM is split across many systems that use a special interconnect
The entire RAM across all machines is accessible anywhere in a single address space
Still quite simple to program but costly and now uncommon
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Parallelization ndash Distributed Shared Memory
9
RAM RAM RAM RAMRAM
Logical Address Space
RAM is split across many systems that use a special interconnect
The entire RAM across all machines is accessible anywhere in a single address space
Still quite simple to program but costly and now uncommon
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Parallelization ndash Message Passing
10
Many systems multiple CPUs each
Can only directly access RAM inside single system
If needed data from other system must pass as a message across network
Must be carefully planned and optimized
Difficult
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
CORE
CORE
CORE
CORE
RAM
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Map Reduce
11
Impose a rigid structure on tasks Always map then reduce
httpdmerwth-aachendederesearchprojectsmapreduce
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Distributed Computing - Hadoop
12
A different way of thinking
Specify problem as mapping and reduction steps
Run on subsets of data on different nodes
Use special distributed filesystem to communicate data between steps
The framework takes care of the parallelism
httpsenwikipediaorgwikiApache_Hadoop
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Map ndash Count words in a line
13
public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()
public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException
String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)
WordCountjava
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Reduce ndash Sum the occurrence of words on from all lines
14
public static class Reduce extends ReducerltText IntWritable Text IntWritablegt
public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException
int sum = 0for (IntWritable val values)
sum += valget()contextwrite(key new IntWritable(sum))
WordCountjava
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Driver ndash Run the map-reduce task
15
public static void main(String[] args) throws Exception Configuration conf = new Configuration()
Job job = new Job(conf wordcount)
jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)
jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)
jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)
FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))
jobwaitForCompletion(true)
WordCountjava
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Running a Hadoop Job on BioHPC ndash sbatch script
16
binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000
module add myhadoop030-spark
export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output
$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_hadoopsbatch
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Hadoop Limitations on BioHPC
17
Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized
HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends
Old Hadoop Running Hadoop 121Update to 2x soon
Looking for interested users to try out persistent HDFS
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
General Hadoop Limitations
18
Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)
Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower
HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading
Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
SPARK
19
bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently
Far easier and better suited to most scientific tasks than plain Hadoop
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Singular Value Decomposition
20
Challenge Visualize patterns in a huge assay x gene dataset
Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns
Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and
principal component analysis in A Practical Approach to Microarray Data Analysis DP
Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-
UR-02-4001
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
SVD on the cluster with Spark
21
import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix
val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)
val rowVectors = inputmap(
_split(t)map(_toDouble)
)map( v =gt Vectorsdense(v) )cache()
val mat=new RowMatrix(rowVectors)
val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)
Simple scripting language - This is scala can also use python
spark_svdscala
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Running Spark jobs on BioHPC
22
binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000
module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID
myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010
$HADOOP_HOMEbinstart-allsh
source $HADOOP_CONF_DIRsparkspark-envsh myspark start
spark-shell -i large_svdscala
myspark stop
$HADOOP_HOMEbinstop-allsh
myhadoop-cleanupsh
slurm_sparksbatch
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
Demo ndash Interactive Spark ndash Python amp Scala
23
Must be on a cluster node to connect to spark workers (login node or GUI session)
Launch a spark clustersbatch slurm_spark_interactivesh
Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh
Connect using interactive scala sessionspark-shell scala
or interactive python sessionpyspark python
And shutdown when donescancel ltjobidgt
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work
What do you want to do
24
Discussion
What big-data problems to you have in your research
Are Hadoop andor Spark interesting for your projects
How can we help you use HadoopSpark for your work