myhadoop 0.30

17
SAN DIEGO SUPERCOMPUTER CENTER myHadoop 0.30 and Beyond Glenn K. Lockwood, Ph.D. User Services Group San Diego Supercomputer Center

Upload: glenn-k-lockwood

Post on 06-May-2015

372 views

Category:

Technology


1 download

DESCRIPTION

Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features. myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)

TRANSCRIPT

Page 1: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

myHadoop 0.30 and Beyond!

Glenn K. Lockwood, Ph.D.!User Services Group!

San Diego Supercomputer Center!

Page 2: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Hadoop and HPC!PROBLEM: domain scientists aren't using Hadoop!•  toy for computer scientists and salespeople!•  Java is not high-performance!•  don't want to learn computer science and Java!!SOLUTION: make Hadoop easier for HPC users!•  use existing HPC clusters and software!•  use Perl/Python/C/C++/Fortran instead of Java!•  make starting Hadoop as easy as possible!

Page 3: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Compute: Traditional vs. Data-Intensive!

Traditional HPC!•  CPU-bound problems!•  Solution: OpenMP-

and MPI-based parallelism!

Data-Intensive!•  IO-bound problems!•  Solution: Map/reduce-

based parallelism!

Page 4: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Architecture for Both Workloads!PROs!

•  High-speed interconnect!

•  Complementary object storage!

•  Fast CPUs, RAM!•  Less faulty!

CONs!•  Nodes aren't storage-

rich!•  Transferring data

between HDFS and object storage*!

!* unless using Lustre, S3, etc backends!

Page 5: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Add Data Analysis to Existing Compute Infrastructure!

Page 6: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Add Data Analysis to Existing Compute Infrastructure!

Page 7: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Add Data Analysis to Existing Compute Infrastructure!

Page 8: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Add Data Analysis to Existing Compute Infrastructure!

Page 9: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

myHadoop – 3-step Install!1. Download Apache Hadoop 1.x and myHadoop

0.30!$ wget http://apache.cs.utah.edu/hadoop/common/

hadoop-1.2.1/hadoop-1.2.1-bin.tar.gz!$ wget glennklockwood.com/files/myhadoop-0.30.tar.gz!

2. Unpack both Hadoop and myHadoop!$ tar zxvf hadoop-1.2.1-bin.tar.gz!$ tar zxvf myhadoop-0.30.tar.gz!

3. Apply myHadoop patch to Hadoop!$ cd hadoop-1.2.1/conf!$ patch < ../myhadoop-0.30/myhadoop-1.2.1.patch!

!

Page 10: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

myHadoop – 3-step Cluster!1. Set a few environment variables!

# sets HADOOP_HOME, JAVA_HOME, and PATH!$ module load hadoop!$ export HADOOP_CONF_DIR=$HOME/mycluster.conf!!

2. Run myhadoop-configure.sh to set up Hadoop!$ myhadoop-configure.sh -s /scratch/$USER/$PBS_JOBID!!

3. Start cluster with Hadoop's start-all.sh!$ start-all.sh!

Page 11: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Easy Wordcount in Python!#!/usr/bin/env  python    import  sys    for  line  in  sys.stdin:      line  =  line.strip()      keys  =  line.split()      for  key  in  keys          value  =  1          print(  '%s\t%d'  %                (key,  value)  )  

#!/usr/bin/env  python    import  sys    last_key  =  None  running_tot  =  0    for  input_line  in  sys.stdin:      input_line  =  input_line.strip()      this_key,  value  =  input_line.split("\t",  1)      value  =  int(value)      if  last_key  ==  this_key:          running_tot  +=  value      else:          if  last_key:              print(  "%s\t%d"  %  (last_key,  running_tot)  )              running_tot  =  value              last_key  =  this_key      if  last_key  ==  this_key:      print(  "%s\t%d"  %  (last_key,  running_tot)  )  

https://github.com/glennklockwood/hpchadoop/tree/master/wordcount.py!

Page 12: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Easy Wordcount in Python!

$  hadoop  dfs  -­‐put  ./input.txt  mobydick.txt    $  hadoop  jar  \          /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar  \          -­‐mapper  "python  $PWD/mapper.py"  \          -­‐reducer  "python  $PWD/reducer.py"  \          -­‐input  mobydick.txt  \          -­‐output  output    $  hadoop  dfs  -­‐cat  output/part-­‐*  >  ./output.txt  

Page 13: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Data-Intensive Performance Scaling!8-node Hadoop cluster on Gordon!

8 GB VCF (Variant Calling Format) file!

9x speedup using 2 mappers/node!

Page 14: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Data-Intensive Performance Scaling!7.7 GB of chemical evolution data, 270 MB/s

processing rate!

Page 15: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Advanced Features - Useability!•  System-wide default configurations!

•  myhadoop-0.30/conf/myhadoop.conf!•  MH_SCRATCH_DIR – specify location of node-local

storage for all users!•  MH_IPOIB_TRANSFORM – specify regex to transform

node hostnames into IP over InfiniBand hostnames!•  Users can remain totally ignorant of scratch

disks and InfiniBand!•  Literally define HADOOP_CONF_DIR and run myhadoop-configure.sh with no parameters – myHadoop figures out everything else!

Page 16: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

Advanced Features - Useability!•  Parallel filesystem support!

•  HDFS on Lustre via myHadoop persistent mode (-p)!•  Direct Lustre support (IDH)!•  No performance loss at smaller scales for HDFS on Lustre!

•  Resource managers supported in unified framework:!•  Torque 2.x and 4.x – Tested on SDSC Gordon!•  SLURM 2.6 – Tested on TACC Stampede!•  Grid Engine!•  Can support LSF, PBSpro, Condor easily (need testbeds)!

Page 17: myHadoop 0.30

SAN DIEGO SUPERCOMPUTER CENTER

New Features!•  Perl reimplementation for expandability!•  Scalability improvements!

•  Separate namenode, secondary namenode, jobtrackers!•  Automatic switchover based on cluster size best practices!

•  Hadoop 2.0 / YARN support (halfway there)!•  Spark integration!