hbaseconeast2016: hbase and spark, state of the art

HBaseCon East 2016HBase and Spark, state of the art

• Java Message Service => JMS• Solutions Architect at Cloudera• A bit of everything…

• Development• Team/Project manager• Architect

• O'Reilly author of Architecting HBase Applications• International

• Worked from Paris to Los Angeles• More than 100 flights per year

• HBase (and others) contributor

About Jean-Marc Spaggiari

HBaseCon East 2016

Overview• Where we came from• Examples of code• Improvements that are coming up• Spark Streaming Use case

HBaseCon East 2016

Source of Demand• Demand started in the field

• Use Cases• APIs access Gets, Puts, Scans• MapReduce Mass Scans• MapReduce Bulk Load• MapReduce Smart gets and puts

• Spark has all but killed MapReduce• Spark Streaming has grown in popularity

• Populating Aggregates• Entity Centric-Time Series data store i.e. OpenTSDB• Look ups for joins or mutations

HBaseCon East 2016

How it Started

• Started on GitHub• Andrew Purtell started the effort to put into HBase• Big call out to Sean B, Jon H, Ted Y, Ted M and Matteo B

• Components• Normal Spark• Spark Streaming• Bulk Load• SparkSQL

HBaseCon East 2016

Under the covers

HBaseCon East 2016

Driver

Walker Node

Configs

Executor

Static Space

Configs

HConnection

Tasks TasksWalker Node

Executor

Static Space

Configs

HConnection

Tasks Tasks

Key Addition: HBaseContext

Create an HBaseContext :// An Hadoop/HBase Configuration objectval conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))// sc is the Spark Context; hbase context corresponds to an HBase Connectionval hbaseContext = new HBaseContext(sc, conf)

HBaseCon East 2016

• Foreach

• Map

• BulkLoad

• BulkLoadThinRows

• BulkGet (aka Multiget)

• BulkDelete

• Most of them in both Java and Scala

Operations on the HBaseContext

HBaseCon East 2016

Foreach

Read data in parallel for each partition and compute :

rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {// do something val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))it.foreach(r => {

... // HBase API put/incr/append/cas calls}bufferedMutator.flush()bufferedMutator.close()

})

HBaseCon East 2016

Foreach

Read data in parallel for each partition and compute :

hbaseContext.foreachPartition(keyValuesPuts,new VoidFunction<Tuple2<Iterator<Put>, Connection>>() {@Overridepublic void call(Tuple2<Iterator<Put>, Connection> t) throws Exception {BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME);while (t._1().hasNext()) {

... // HBase API put/incr/append/cas calls}mutator.flush();mutator.close();

} }); });

HBaseCon East 2016

Map

Take a dataset and map it in parallel for each partition to produce a new RDD or process it

val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {val table = conn.getTable(TableName.valueOf("t1"))var res = mutable.MutableList[String]()it.map( r => {

... // HBase API Scan Results}

})

HBaseCon East 2016

BulkLoad

Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)

rdd.hbaseBulkLoad(hbaseContext, tableName,t => {val rowKey = t._1val fam:Array[Byte] = t._2._1val qual = t._2._2val value = t._2._3val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)Seq((keyFamilyQualifier, value)).iterator

}, stagingFolder)val load = new LoadIncrementalHFiles(config)load.run(Array(stagingFolder, tableNameString))

HBaseCon East 2016

Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)

rdd.hbaseBulkLoad(hbaseContext, tableName,t => {val rowKey = t._1val fam:Array[Byte] = t._2._1val qual = t._2._2val value = t._2._3val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)Seq((keyFamilyQualifier, value)).iterator

}, stagingFolder)val load = new LoadIncrementalHFiles(config)load.run(Array(stagingFolder, tableNameString))

BulkLoad

HBaseCon East 2016

BulkLoadThinRowsBulk load a data set into HBase (for skinny tables, <10k cols)

hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {

val rowKey = Bytes.toBytes(t._1)val familyQualifiersValues = new FamiliesQualifiersValuest._2.foreach(f => {val family:Array[Byte] = f._1val qualifier = f._2val value:Array[Byte] = f._3familyQualifiersValues +=(family, qualifier, value)

})(new ByteArrayWrapper(rowKey), familyQualifiersValues)

}, stagingFolder.getPath)

HBaseCon East 2016

BulkPut

Parallelized HBase Multiput :

hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => {

val put = new Put(putRecord._1)putRecord._2.foreach((putValue) =>

put.add(putValue._1, putValue._2, putValue._3))put

}

HBaseCon East 2016

BulkPutParallelized HBase Multiput :

hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {@Overridepublic Put call(String v1) throws Exception {String[] tokens = v1.split("\\|");Put put = new Put(Bytes.toBytes(tokens[0]));put.addColumn(Bytes.toBytes("segment"),

Bytes.toBytes(tokens[1]),Bytes.toBytes(tokens[2]));

return put;}

});

HBaseCon East 2016

BulkDelete

Parallelized HBase Multi-deletes

hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,putRecord => new Delete(putRecord),4) // batch size

rdd.hbaseBulkDelete(hbaseContext, tableName,putRecord => new Delete(putRecord),4) // batch size

HBaseCon East 2016

Table RDD

How to materialize a table as a Spark RDD.

HBaseCon East 2016

What Improvements Have We Made?

▪ Combine Spark and HBase• Spark Catalyst Engine for Query Plan and Optimization

• HBase for Fast Access KV Store

• Implement Standard External Data Source with Built-in Filter

▪ High Performance• Data Locality: Move Computation to Data

• Partition Pruning: Task only Performed in RS Holding Requested Data

• Column Pruning / Predicate Pushdown: Reduce Network Overhead

▪ Full Fledged DataFrame Support• Spark-SQL

• Integrated Language Query

▪ Run on Top of Existing HBase Table• Native Support Java Primitive Types

▪ Still some work and improvements to be done• HBASE-16638 Reduce the number of Connections

• HBASE-14217 Add Java access to Spark bulk load functionality

HBaseCon East 2016

Data frame + HBase

WIP... 2.0?

HBaseCon East 2016

Usage - Define the Catalog

def catalog = s"""{|"table":{"namespace":"default", "name":"table1"},|"rowkey":"key",|"columns":{

|"col0":{"cf":"rowkey", "col":"key", "type":"string"},|"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},|"col2":{"cf":"cf1", "col":"col2", "type":"string"}

|}|}""".stripMargin

HBase table to dataframe table catalog mapping:

HBaseCon East 2016

Usage – Write to HBase

sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog,

HBaseTableCatalog.newTable -> "5")).format("org.apache.spark.sql.execution.datasources.hbase").save()

Export RDD into a new HBase table with DataFrame

HBaseCon East 2016

Usage– Construct DataFrame

def withCatalog(cat: String): DataFrame = {sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->catalog)).format("org.apache.spark.sql.execution.datasources.hbase").load()

}

Import RDD into a new HBase table with DataFrame

HBaseCon East 2016

Usage - Language Integrate Query/SQL

val df = withCatalog(catalog)val s = df.filter((($"col0" <= "row050" && $"col0" > "row040") ||$"col0" === "row005" ||$"col0" === "row020" ||$"col0" === "r20" ||$"col0" <= "row005") &&($"col2" === 1 ||$"col2" === 42)).select("col0", "col1", "col2")

s.show


HBaseCon East 2016

Usage - Language Integrate Query/SQL

// Load the dataframeval df = withCatalog(catalog)//SQL exampledf.registerTempTable("table")sqlContext.sql("select count(col1) from table").show


HBaseCon East 2016

Spark Streaming Example

KafkaProducerSpark

StreamingHBase SOLR

HBaseCon East 2016

hbaseconeast2016: hbase and spark, state of the art

Engineering