hbaseconeast2016: hbase and spark, state of the art
TRANSCRIPT
HBaseCon East 2016HBase and Spark, state of the art
• Java Message Service => JMS• Solutions Architect at Cloudera• A bit of everything…
• Development• Team/Project manager• Architect
• O'Reilly author of Architecting HBase Applications• International
• Worked from Paris to Los Angeles• More than 100 flights per year
• HBase (and others) contributor
About Jean-Marc Spaggiari
HBaseCon East 2016
Overview• Where we came from• Examples of code• Improvements that are coming up• Spark Streaming Use case
HBaseCon East 2016
Source of Demand• Demand started in the field
• Use Cases• APIs access Gets, Puts, Scans• MapReduce Mass Scans• MapReduce Bulk Load• MapReduce Smart gets and puts
• Spark has all but killed MapReduce• Spark Streaming has grown in popularity
• Populating Aggregates• Entity Centric-Time Series data store i.e. OpenTSDB• Look ups for joins or mutations
HBaseCon East 2016
How it Started
• Started on GitHub• Andrew Purtell started the effort to put into HBase• Big call out to Sean B, Jon H, Ted Y, Ted M and Matteo B
• Components• Normal Spark• Spark Streaming• Bulk Load• SparkSQL
HBaseCon East 2016
Under the covers
HBaseCon East 2016
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks TasksWalker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
Key Addition: HBaseContext
Create an HBaseContext :// An Hadoop/HBase Configuration objectval conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))// sc is the Spark Context; hbase context corresponds to an HBase Connectionval hbaseContext = new HBaseContext(sc, conf)
HBaseCon East 2016
• Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
• Most of them in both Java and Scala
Operations on the HBaseContext
HBaseCon East 2016
Foreach
Read data in parallel for each partition and compute :
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {// do something val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))it.foreach(r => {
... // HBase API put/incr/append/cas calls}bufferedMutator.flush()bufferedMutator.close()
})
HBaseCon East 2016
Foreach
Read data in parallel for each partition and compute :
hbaseContext.foreachPartition(keyValuesPuts,new VoidFunction<Tuple2<Iterator<Put>, Connection>>() {@Overridepublic void call(Tuple2<Iterator<Put>, Connection> t) throws Exception {BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME);while (t._1().hasNext()) {
... // HBase API put/incr/append/cas calls}mutator.flush();mutator.close();
} }); });
HBaseCon East 2016
Map
Take a dataset and map it in parallel for each partition to produce a new RDD or process it
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {val table = conn.getTable(TableName.valueOf("t1"))var res = mutable.MutableList[String]()it.map( r => {
... // HBase API Scan Results}
})
HBaseCon East 2016
BulkLoad
Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,t => {val rowKey = t._1val fam:Array[Byte] = t._2._1val qual = t._2._2val value = t._2._3val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)val load = new LoadIncrementalHFiles(config)load.run(Array(stagingFolder, tableNameString))
HBaseCon East 2016
BulkLoad
Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,t => {val rowKey = t._1val fam:Array[Byte] = t._2._1val qual = t._2._2val value = t._2._3val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)val load = new LoadIncrementalHFiles(config)load.run(Array(stagingFolder, tableNameString))
HBaseCon East 2016
Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,t => {val rowKey = t._1val fam:Array[Byte] = t._2._1val qual = t._2._2val value = t._2._3val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)val load = new LoadIncrementalHFiles(config)load.run(Array(stagingFolder, tableNameString))
BulkLoad
HBaseCon East 2016
BulkLoadThinRowsBulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)val familyQualifiersValues = new FamiliesQualifiersValuest._2.foreach(f => {val family:Array[Byte] = f._1val qualifier = f._2val value:Array[Byte] = f._3familyQualifiersValues +=(family, qualifier, value)
})(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon East 2016
BulkLoadThinRowsBulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)val familyQualifiersValues = new FamiliesQualifiersValuest._2.foreach(f => {val family:Array[Byte] = f._1val qualifier = f._2val value:Array[Byte] = f._3familyQualifiersValues +=(family, qualifier, value)
})(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon East 2016
BulkPut
Parallelized HBase Multiput :
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => {
val put = new Put(putRecord._1)putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3))put
}
HBaseCon East 2016
BulkPutParallelized HBase Multiput :
hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {@Overridepublic Put call(String v1) throws Exception {String[] tokens = v1.split("\\|");Put put = new Put(Bytes.toBytes(tokens[0]));put.addColumn(Bytes.toBytes("segment"),
Bytes.toBytes(tokens[1]),Bytes.toBytes(tokens[2]));
return put;}
});
HBaseCon East 2016
BulkDelete
Parallelized HBase Multi-deletes
hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,putRecord => new Delete(putRecord),4) // batch size
rdd.hbaseBulkDelete(hbaseContext, tableName,putRecord => new Delete(putRecord),4) // batch size
HBaseCon East 2016
Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
What Improvements Have We Made?
▪ Combine Spark and HBase• Spark Catalyst Engine for Query Plan and Optimization
• HBase for Fast Access KV Store
• Implement Standard External Data Source with Built-in Filter
▪ High Performance• Data Locality: Move Computation to Data
• Partition Pruning: Task only Performed in RS Holding Requested Data
• Column Pruning / Predicate Pushdown: Reduce Network Overhead
▪ Full Fledged DataFrame Support• Spark-SQL
• Integrated Language Query
▪ Run on Top of Existing HBase Table• Native Support Java Primitive Types
▪ Still some work and improvements to be done• HBASE-16638 Reduce the number of Connections
• HBASE-14217 Add Java access to Spark bulk load functionality
HBaseCon East 2016
Data frame + HBase
WIP... 2.0?
HBaseCon East 2016
Usage - Define the Catalog
def catalog = s"""{|"table":{"namespace":"default", "name":"table1"},|"rowkey":"key",|"columns":{
|"col0":{"cf":"rowkey", "col":"key", "type":"string"},|"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},|"col2":{"cf":"cf1", "col":"col2", "type":"string"}
|}|}""".stripMargin
HBase table to dataframe table catalog mapping:
HBaseCon East 2016
Usage – Write to HBase
sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog,
HBaseTableCatalog.newTable -> "5")).format("org.apache.spark.sql.execution.datasources.hbase").save()
Export RDD into a new HBase table with DataFrame
HBaseCon East 2016
Usage– Construct DataFrame
def withCatalog(cat: String): DataFrame = {sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->catalog)).format("org.apache.spark.sql.execution.datasources.hbase").load()
}
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016
Usage - Language Integrate Query/SQL
val df = withCatalog(catalog)val s = df.filter((($"col0" <= "row050" && $"col0" > "row040") ||$"col0" === "row005" ||$"col0" === "row020" ||$"col0" === "r20" ||$"col0" <= "row005") &&($"col2" === 1 ||$"col2" === 42)).select("col0", "col1", "col2")
s.show
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016
Usage - Language Integrate Query/SQL
// Load the dataframeval df = withCatalog(catalog)//SQL exampledf.registerTempTable("table")sqlContext.sql("select count(col1) from table").show
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016
Spark Streaming Example
KafkaProducerSpark
StreamingHBase SOLR
HBaseCon East 2016