h2o big data environments

18
H2O in Big Data Environments Tom Kraljevic H2O World Nov. 18, 2014

Upload: sri-ambati

Post on 01-Jul-2015

1.286 views

Category:

Technology


1 download

DESCRIPTION

H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms while keeping the widely used languages of R and JSON as an API. H2O brings and elegant lego-like infrastructure that brings fine-grained parallelism to math over simple distributed arrays.

TRANSCRIPT

Page 1: H2O Big Data Environments

H2O  in  Big  Data  Environments  

Tom  Kraljevic  H2O  World  Nov.  18,  2014  

Page 2: H2O Big Data Environments

Outline  

•  MoCvaCon  •  ObservaCons  and  Goals  •  Hadoop  (and  YARN)  •  Spark  (and  Sparkling  Water)  •  EC2  •  Other  environments  •  Q  &  A    

Page 3: H2O Big Data Environments

MoCvaCon  

•  InformaCon  is  the  New  Gold,  and  informaCon  comes  from  Data  

•  Last  year  at  NY  Strata:    freneCc  data  collecCon  (not  necessarily  with  a  plan)  

•  This  year  at  NY  Strata:    people  have  a  plan  –  Market  mix  modeling,  Fraud  detecCon  –  AdverCsing  technology,  Customer  intelligence  

•  Reading  data,  building  models  and  making  predicCons  at  scale  

 

Page 4: H2O Big Data Environments

ObservaCons  and  Goals  •  ObservaCons  

–  Data  has  gravity  •  Govt.  RegulaCon,  Privacy,  Sheer  size  

–  HDFS  and  S3  are  terrific  data  stores  –  Procurement  of  new  hardware  resources  is  usually  hard  

•  Easier  to  use  the  Hadoop  environment  you  already  have,  or  start  up  new  nodes  in  the  cloud  

•  Goals  –  Move  algorithms  (i.e.  H2O)  to  the  data  

•  H2O  is  pure  Java,  so  it’s  easy  to  run  anywhere  –  Make  H2O  best-­‐in-­‐class  Machine  Learning  applicaCon  /  library  –  Make  H2O  work  well  in  whatever  environment  you  already  have  

Page 5: H2O Big Data Environments

Supported  Hadoop  Environments  •  All  major  distribuCons  –  Cloudera  – Hortonworks  – MapR  

•  Hosted  in  the  cloud  –  e.g.    AlCscale  

•  We  don’t  see  this  one  too  oben  (but  it’s  worked  when  we  have)  – Open  source  Apache  Hadoop  

Page 6: H2O Big Data Environments

Hadoop  (and  YARN)  •  You  can  launch  H2O  directly  on  Hadoop:  

$ hadoop jar h2odriver.jar … -nodes 3 –mapperXmx 50g !

•  H2O  uses  Hadoop  MapReduce  to  get  CPU  and  Memory  on  the  cluster,  not  to  manage  work  –  H2O  mappers  stay  at  0%  progress  forever  

•  UnCl  you  shut  down  the  H2O  job  yourself  –  All  mappers  (3  in  this  case)  must  be  running  at  the  same  Cme  –  The  mappers  communicate  with  each  other  

•  Form  an  H2O  cluster  on-­‐the-­‐spot  within  your  Hadoop  environment  –  No  Hadoop  reducers(!)  

•  Special  YARN  memory  sefngs  for  large  mappers  –  yarn.nodemanager.resource.memory-­‐mb  –  yarn.scheduler.maximum-­‐allocaCon-­‐mb    

Page 7: H2O Big Data Environments

H2O  on  YARN  Deployment  Hadoop  Gateway  Node  

hadoop  jar  h2odriver.jar  …  

YARN  RM  

HDFS  HDFS  Data  Nodes  

Hadoop  Cluster  

H2O  H2O  Mappers  

(YARN  Containers)  

YARN  Worker  Nodes  

NM   NM   NM   NM   NM   NM  YARN  Node  Managers  (NM)  

H2O   H2O  

Page 8: H2O Big Data Environments

Now  You  Have  an  H2O  Cluster  Hadoop  Gateway  Node  

hadoop  jar  h2odriver.jar  …  

HDFS  HDFS  Data  Nodes  

Hadoop  Cluster  

H2O  H2O  Mappers  

(YARN  Containers)  

YARN  Worker  Nodes  

NM   NM   NM  YARN  Node  Managers  (NM)  

H2O   H2O  

YARN  RM  

Page 9: H2O Big Data Environments

Read  Data  from  HDFS  Once  Hadoop  Gateway  Node  

hadoop  jar  h2odriver.jar  …  

HDFS  HDFS  Data  Nodes  

Hadoop  Cluster  

H2O  H2O  Mappers  

(YARN  Containers)  

YARN  Worker  Nodes  

NM   NM   NM  YARN  Node  Managers  (NM)  

H2O   H2O  

YARN  RM  

Page 10: H2O Big Data Environments

Build  Models  in-­‐Memory  Hadoop  Gateway  Node  

hadoop  jar  h2odriver.jar  …  Hadoop  Cluster  

H2O  H2O  Mappers  

(YARN  Containers)  

YARN  Worker  Nodes  

NM   NM   NM  YARN  Node  Managers  (NM)  

H2O   H2O  

YARN  RM  

Page 11: H2O Big Data Environments

Spark  (and  Sparkling  Water)  

•  H2O  runs  as  an  applicaCon  on  a  Spark  cluster  using  spark-­‐submit  – Standard  Spark  1.1  –  Includes  H2O  on  Spark  on  YARN  

•  H2O  and  Spark  nodes  share  a  JVM  process  •  H2ORDD  facilitates  easy  data  sharing  between  Spark  (e.g.  Spark  SQL,  MLlib)  and  H2O  (e.g.  Deep  Learning)  

•  Scala  support    

Page 12: H2O Big Data Environments

Sparkling  Water  ApplicaCon  Life  Cycle  

Sparkling  App  jar  file  

Spark  Master  JVM  

spark-­‐submit  

Spark  Worker  JVM  

Spark  Worker  JVM  

Spark  Worker  JVM  

(1)  

(2)  

(3)  

Sparkling  Water  Cluster  

     Spark    Executor            JVM  

H2O  (4)  

     Spark    Executor            JVM  

H2O  

     Spark    Executor            JVM  

H2O  

Page 13: H2O Big Data Environments

Sparkling  Water  Data  DistribuCon  

H2O  

H2O  

H2O  

Sparkling  Water  Cluster  

Spark  Executor  JVM  Data  Source  (e.g.  HDFS)   (1)  

(2)  

(3)  

H2O  RDD  

Spark  Executor  JVM  

Spark  Executor  JVM  

Spark  RDD  

Page 14: H2O Big Data Environments

For  More  Info  on  Sparkling  Water...  

•  Visit  Michal  at  the  H2O  World  Hacker’s  Corner  

•  Training  and  Blogs  –  hip://learn.h2o.ai/content/hackers_staCon/start_with_sparkling_water.html  –  hip://h2o.ai/blog/2014/11/sparkling/water/on/yarn/example/  –  hip://h2o.ai/blog/2014/09/sparkling-­‐water-­‐tutorials/  –  hip://h2o.ai/blog/2014/09/how-­‐sparkling-­‐water-­‐brings-­‐h2o-­‐to-­‐spark/  –  hip://h2o.ai/blog/2014/09/Sparkling-­‐Water/  

Page 15: H2O Big Data Environments

EC2  

•  Security  Group  sefngs  for  TCP/UDP  port  management  –  port  54321  (TCP,  User  to  H2O)  –  port  54322  (TCP  and  UDP,  H2O  to  H2O)  

•  IAM  role  for  easy  and  secure  access  to  S3  data  buckets  (H2O  to  S3  data)  

•  Give  each  H2O  node  a  fully-­‐specified  flanile  with  the  IP  and  port  of  each  node  in  the  cloud  

•  Use  decent  instances  (cloud  doesn’t  imply  powerful)  –  m3.xlarge  (or  higher)  –  c3.xlarge  (or  higher)  

Page 16: H2O Big Data Environments

EC2  Instances  and  Data  Access  

Data  in  S3  

EC2  instance   EC2  instance   EC2  instance  

H2O   H2O   H2O  

H2O  Cluster  

Page 17: H2O Big Data Environments

Other  Environments  

•  Rumored  to  work,  but  untested  (so  far)  by  us…  – Google  Compute  Engine  – Microsob  Azure  

Page 18: H2O Big Data Environments

Q  &  A      

Thanks  for  aiending  H2O  World!    

hip://h2o.ai