Download - Sparkling Water 5 28-14
![Page 1: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/1.jpg)
Meetup 5/28/2014
Sparkling WaterMichal Malohlava!
@mmalohlava!
@hexadata
![Page 2: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/2.jpg)
Who am I?
Background
•PhD in CS from Charles University in Prague, 2012
•1 year PostDoc at Purdue University experimenting with algos for large computation
•1 year at 0xdata helping to develop H2O engine for big data computation
!Experience with domain-specific languages, distributed system, software engineering, and big data.
![Page 3: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/3.jpg)
Overview
1.Towards H2O and Spark integration
2.Details and demo
3.Next steps…
![Page 4: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/4.jpg)
Vision
Towards Spark and H2O integration
![Page 5: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/5.jpg)
User-friendly API
!Large and active community
!Platform components - SQL
!Multitenancy
Memory efficient
!
Performance of computation
!
Machine learning algorithms
!
Parser, R-interface
![Page 6: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/6.jpg)
Combine benefits of both tools and makes
“H2O a killer application for Spark”
![Page 7: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/7.jpg)
Steps towards !interoperability
1.Data sharing between Spark to H2O
2.Optimize & improve
!
3.Low-level integration
![Page 8: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/8.jpg)
Steps towards !interoperability
1.Data sharing between Spark to H2O
2.Optimize & improve
!
3.Low-level integration
![Page 9: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/9.jpg)
Data sharing scenario
����� ���
��
![Page 10: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/10.jpg)
Data sharing scenario
����� ���
��
![Page 11: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/11.jpg)
Data sharing scenario
����� ���
��
RDD
![Page 12: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/12.jpg)
Data sharing scenario
����� ���
��
RDD
SQL query
![Page 13: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/13.jpg)
Data sharing scenario
����� ���
��
RDD
SQL query
![Page 14: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/14.jpg)
Data sharing scenario
����� ���
��
FrameRDD
SQL query
![Page 15: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/15.jpg)
Data sharing scenario
����� ���
��
FrameRDD
SQL queryAlgo
![Page 16: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/16.jpg)
Data sharing scenario
����� ���
��
FrameRDD
SQL queryAlgo
![Page 17: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/17.jpg)
Data sharing strategies
Possible solutions
•Direct
•Distributed
•Socket-based
•File-based
•Tachyon-based
����� ���
��
Spark to H2O
![Page 18: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/18.jpg)
Data sharing strategies
Possible solutions
•Direct
•Distributed
•Socket-based
•File-based
•Tachyon-based
����� ���
��
Spark to H2O
![Page 19: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/19.jpg)
Data sharing strategies
Possible solutions
•Direct
•Distributed
•Socket-based
•File-based
•Tachyon-based
����� ���
��
Spark to H2O
![Page 20: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/20.jpg)
Data sharing strategies
Possible solutions
•Direct
•Distributed
•Socket-based
•File-based
•Tachyon-based
����� ���
��
Spark to H2O
![Page 21: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/21.jpg)
Data sharing strategies
Possible solutions
•Direct
•Distributed
•Socket-based
•File-based
•Tachyon-based
����� ���
��
Spark to H2O
![Page 22: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/22.jpg)
Data sharing strategies
Possible solutions
•Direct
•Distributed
•Socket-based
•File-based
•Tachyon-based
����� ���
��
Spark to H2O
![Page 23: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/23.jpg)
Data sharing!via Tachyon
����� ���
��
![Page 24: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/24.jpg)
Data sharing!via Tachyon
����� ���
��
Tachyon
![Page 25: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/25.jpg)
Data sharing!via Tachyon
����� ���
��
H2O node with Spark driver
Tachyon
![Page 26: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/26.jpg)
Data sharing!via Tachyon
����� ���
��
Tachyon
Invoke Spark driver
![Page 27: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/27.jpg)
Data sharing!via Tachyon
����� ���
��
Tachyon
Load data
![Page 28: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/28.jpg)
Data sharing!via Tachyon
����� ���
��
Query
Tachyon
![Page 29: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/29.jpg)
Data sharing!via Tachyon
����� ���
��
Tachyon
Persist data to Tachyon
![Page 30: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/30.jpg)
Data sharing!via Tachyon
����� ���
��
Tachyon
Load data into H2O frame
![Page 31: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/31.jpg)
Data sharing!via Tachyon
����� ���
��
Tachyon
Invoke GBM on data
![Page 32: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/32.jpg)
Spark 1.0-rc11
!
SQL component
!
Implemented proper parser/serializer to satisfy H2O parser
����� ���
��
![Page 33: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/33.jpg)
Latest H2O version - 2.5-SNAPSHOT
!
With Tachyon support included
!
Embedded Spark driver
����� ���
��
![Page 34: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/34.jpg)
Key requirements
Transparent approach
!
Work with many columns
!
Preserve NAs
!
Preserve headers
![Page 35: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/35.jpg)
Key requirements
Transparent approach
!
Work with many columns
!
Preserve NAs
!
Preserve headers
![Page 36: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/36.jpg)
Key requirements
Transparent approach
!
Work with many columns
!
Preserve NAs
!
Preserve headers
![Page 37: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/37.jpg)
Key requirements
Transparent approach
!
Work with many columns
!
Preserve NAs
!
Preserve headers
![Page 38: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/38.jpg)
Solved challenges
Large number of columns in SQL schema
•>22 columns (case class restriction)
•Solved via Product interface class Airlines( year :Option[Int], // 0! month :Option[Int], // 1! dayOfMonth :Option[Int], // 2! dayOfWeek :Option[Int], // 3! crsDepTime :Option[Int], // 5! crsArrTime :Option[Int], // 7! uniqueCarrier :Option[String], // 8! flightNum :Option[Int], // 9! tailNum :Option[Int], // 10! crsElapsedTime:Option[Int], // 12! origin :Option[String], // 16! dest :Option[String], // 17! distance :Option[Int], // 18! isArrDelayed :Option[Boolean],// 29! isDepDelayed :Option[Boolean] // 30! ) extends Product { … }
![Page 39: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/39.jpg)
Solved challenges
Handling NAs during load
•Store them in SQL RDD
•Solved by https://github.com/apache/spark/pull/658
•Use Option[T] or non-primitive Java type
!
Handling NAs during save
•A simple sql.Row serializer handling NA values
![Page 40: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/40.jpg)
Time for Demo
![Page 41: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/41.jpg)
Step-by-step
Start Spark cloud - 1 worker
Start Tachyon storage
Start H2O slave node
Start H2O master node with Scala master program
override def run(conf: DemoConf): Unit = {! // Dataset! val dataset = "data/allyears2k_headers.csv"! // Row parser! val rowParser = AirlinesParser! // Table name for SQL! val tableName = "airlines_table"! // Select all flights with destination == SFO! val query = """SELECT * FROM airlines_table WHERE dest="SFO" """! ! // Connect to shark cluster and make a query over prostate, transfer data into H2O! val frame:Frame = executeSpark[Airlines](dataset, rowParser, !! ! ! conf.extractor, tableName, query, local=conf.local)! ! // Now make a blocking call of GBM directly via Java API! gbm(frame, frame.vec("isDepDelayed"), 100, true)! }
![Page 42: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/42.jpg)
Step-by-step
Start Spark cloud - 1 worker
Start Tachyon storage
Start H2O slave node
Start H2O master node with Scala master program
override def run(conf: DemoConf): Unit = {! // Dataset! val dataset = "data/allyears2k_headers.csv"! // Row parser! val rowParser = AirlinesParser! // Table name for SQL! val tableName = "airlines_table"! // Select all flights with destination == SFO! val query = """SELECT * FROM airlines_table WHERE dest="SFO" """! ! // Connect to shark cluster and make a query over prostate, transfer data into H2O! val frame:Frame = executeSpark[Airlines](dataset, rowParser, !! ! ! conf.extractor, tableName, query, local=conf.local)! ! // Now make a blocking call of GBM directly via Java API! gbm(frame, frame.vec("isDepDelayed"), 100, true)! }
Demo code
![Page 43: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/43.jpg)
Next steps…
Optimize data transfers
•Have notion of H2O RDD inside Spark
H2O Backend for MLlib
•Based on H2O RDD
•Use H2O algos
![Page 44: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/44.jpg)
Open challenges
See http://jira.0xdata.com and Sparkling component
•PUB-730 Transfer results from H2O frame into RDD
•PUB-732 Parquet support for H2O
•PUB-733 MLlib backend
•PUB-734 H2O-based RDD
![Page 45: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/45.jpg)
Time for questions
Thank you!
![Page 46: Sparkling Water 5 28-14](https://reader034.vdocument.in/reader034/viewer/2022051311/53ed7fa88d7f7289708b5cf3/html5/thumbnails/46.jpg)
Learn more about H2O at 0xdata.com
or
Thank you!
Follow us at @hexadata
neo> for r in h2o h2o-sparkling; do !git clone “[email protected]:0xdata/$r.git”!done