wattgo: analyses temps-réél de series temporelles avec spark et solr (français)
TRANSCRIPT
Smart Energy as a Service
Founded in 2011 by experts in data analytics, utilities business and big data
French households panel equiped with meter sensors
A team of 18 people, with a core R&D team working in building load curve disaggregation algorithms
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
UIMeteorMicroServices
Spark
ProtoBuf RPC
Put your logic
hereCa
ssandra
So
lR - DSE Field Transformer
Trig
gers
UsersProtoBufsstored asBlobs
SensorsTimeSeries
Kafka<<
<<< CQL SolR Query >>>>>
Kafka
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
RealTime Analytics using DSE search (SolR) Apache Spark
and Cassandra Triggers
Real-time aggregation on arbitrary groupsbased on customer metadata
Demo usecase :Real time monitoring of energy consumption
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
UIMeteorMicroServices
Spark
ProtoBuf RPC
Put your logic
hereCa
ssandra
So
lR - DSE Field Transformer
Trig
gers
UsersProtoBufsstored asBlobs
SensorsTimeSeries
Kafka<<
<<< CQL SolR Query >>>>>
Kafka
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
CREATE TABLE cassandradays.queries ( name text PRIMARY KEY, query text) WITH ...;
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
CREATE TABLE cassandradays.queries ( name text PRIMARY KEY, query text) WITH ...;
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
abstract class CassandraReadOnlyTrigger extends ITrigger {
override def augment(key: ByteBuffer, mut: ColumnFamily): util.Collection[Mutation] = { handleTrigger(key, mut) // Non blocking call null // Let C* proceed }
def handleTrigger(key: ByteBuffer, mut: ColumnFamily): Future[Unit] = Future {
def handler:(MutationAccessor => Unit) = if(mut.isMarkedForDelete) delete else read
handler(new MutationAccessor(key, mut))
}
def read(mut: MutationAccessor): Unit
def delete(mut: MutationAccessor): Unit
}
CassandraReadOnlyTrigger.scala
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
AggregatorTrigger.scala
class AggregatorTrigger extends CassandraReadOnlyTrigger{
// Netty boilerplate [...]
val aggregatorService = AggregatorServiceRPC.Aggregatron.newStub(channel)
override def read(mut: MutationAccessor): Unit = { // triggered on upserts
val request = Aggregate.newBuilder()
// Name of our aggregation request.setName(mut.getValue[String]("name"))
//SolR query itself request.setQuery(mut.getValue[String]("query"))
aggregatorService.registerNew(controller, request.build(), callback)
}
override def delete(mut: MutationAccessor): Unit = { // triggered on deletes
val request = Aggregate.newBuilder() request.setName(mut.getValue[String]("name"))
aggregatorService.delete(controller, request.build(), callback) }}
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
class MutationAccessor(partitionKey: ByteBuffer, update: ColumnFamily) {
trait ValueMapper[T] { def getValue: T }
object ValueMapper {
implicit def stringMapper(name: String): ValueMapper[String] = makeMapper(name, UTF8Type.instance.compose)
implicit def intMapper(name: String): ValueMapper[Int] = makeMapper(name, Int32Type.instance.compose)
[…]
def makeMapper[T](name: String, f: ByteBuffer => T): ValueMapper[T] = { new ValueMapper[T] { def getValue = f(getBuffer(name)) } }
} def getValue[T](implicit vm: ValueMapper[T]): T = vm.getValue}
MutationAccessor.scala
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
$ nodetool reloadtriggersLoad your trigger on each node :
Bind it to your Cassandra table :cqlsh:> CREATE trigger aggregatorTrigger on cassandradays.queries using 'AggregatorTrigger';
No need to restart CassandraINFO 13:37:00 Loading new jar /path/to/your/trigger/directory/AggregatorTrigger.jar
Enjoy !
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
UIMeteorMicroServices
Spark
ProtoBuf RPC
Put your logic
hereCa
ssandra
So
lR - DSE Field Transformer
Trig
gers
UsersProtoBufsstored asBlobs
SensorsTimeSeries
Kafka
<<<<< CQL SolR Query >>>>>
Kafka
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
class AggregatorRPCService(val aggregationsHandler: ActorRef) extends Aggregatron { import AggregationsHandlerMessages._
override def registerNew(controller: RpcController, request: Aggregate, done: RpcCallback[RegistrationResponse]): Unit = { aggregationsHandler ! UpdateEntry(request.getName, request.getQuery) } override def delete(controller: RpcController, request: Aggregate, done: RpcCallback[DeletionResponse]): Unit = { aggregationsHandler ! DeleteEntry(request.getName) } }
AggregatorRPCService.scala
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
WIDs = cassandra.withSessionDo(session => {
class AggregationsHandler(conf: SparkConf) extends Actor { import AggregationsHandlerMessages._ val cassandra = CassandraConnector(conf) val prepared = cassandra.withSessionDo(session => { session.prepare("SELECT * FROM cassandradays.users WHERE solr_query = ?") }) AggregatorServiceEndPoint.start(new AggregatorRPCService(self), 7777) val aggregations = mutable.HashMap[String, Seq[String]]() def receive = { case GetAggregations => sender ! aggregations.toMap case UpdateEntry(name, query) => val val bound = prepared.bind bound.setString("solr_query", query) val i = session.execute(bound).iterator() i.map(_.getString("wid")).toSeq }) aggregations += name -> WIDs case DeleteEntry(name) => aggregations.remove(name) } }
AggregationsHandler.scala
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
def getAggregations(ip: String, port: Int) : Map[String, Seq[String]] = { val fw = (aggregatesHandlerActor ? GetAggregations)(Timeout(5.seconds)) Await.result(fw, 5.seconds).asInstanceOf[Map[String, Seq[String]]] }
[...] kafkaStream.flatMap(msg => { val dp = msg._2.parseJson.convertTo[RawDataPoint]
getAggregations(ip, port).flatMap(agg => { if (agg._2.contains(dp.key)) { Some(agg._1 -> OutputData(agg._1, dp.value, 1)) } else { None } }) }).reduceByKeyAndWindow((a: OutputData, b: OutputData) => { OutputData(a.name, a.sum + b.sum, a.count + b.count) }, Seconds(60), Seconds(3)).foreachRDD(rdd => { rdd.collect().foreach{ x => val message = new ProducerRecord[String, String](outputTopic, null, x._2.toJson.toString()) producer.send(message) } }) [...]
DemoCassandraDays.scala
{ "key" : "519888bdeabc888934000000", "ts" : 1434458546000, "value" : 147.3}
{ "name" : "13", "sum" : 88760.0, "count" : 126 }
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
Deploy with :
Same syntax as original spark-submit
dse spark-submit target/scala-2.10/DemoCassandraDays.jar
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
UIMeteorMicroServices
Spark
ProtoBuf RPC
Put your logic
here
Cassa
ndra
SolR - DSE Field Transform
er
UsersProtoBufsstored asBlobs
Kafka<<
<<< CQL SolR Query >>>>>
Kafka
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
CREATE TABLE cassandradays.users ( wid text PRIMARY KEY, "protobuf:com.wattgo.users.User" blob) WITH ...;
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
package com.wattgo.users;
option java_package = "com.wattgo.users";option java_outer_classname = "WattGoUser";
import "Details.proto";
message User { required string wid = 1; optional Details details = 2;}
[...]
$ protoc --java_out=. User.proto Details.proto [...]
User.proto :
Generate Protobuf DescriptorSet File for later use of Protobuf Reflection API :
Generate WattGoUser Java class :
$ protoc --include_imports --descriptor_set_out=User.desc User.proto
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
WattGoUserDetails.Details details = WattGoUserDetails.Details.newBuilder() .setEmail("[email protected]") .setFirstname("Denis") .setLastname("Ritchie") .setAddress(address) .build();
WattGoUser.User user = WattGoUser.User.newBuilder() .setWid("49b96edde3a1d5444f5cd145b7117144") .setDetails(details) .build();
byte[] blob = user.toByteArray();
cqlsh:> SELECT * FROM users WHERE wid = ‘49b96edde3a1d5444f5cd145b7117144’;
wid | protobuf:com.wattgo.users.User 51548164eabc884b2d00014f | 0x0a18353135343831363465616263...
WattgoUser.class
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
" The FieldInputTransformer and FieldOutputTransformer classes must be extended to define a custom column-to-document field mapping [...]. FieldInputTransformer takes an inserted Cassandra column and modifies it prior to Solr indexing, while FieldOutputTransformer parses a Cassandra row just before returning the result of a Solr query. "
EDWARD RIBEIRO, DataStaxhttp://www.datastax.com/dev/blog/dse-field-transformers
FieldInputTransformerimport com.datastax.bdp.search.solr.FieldInputTransformer;
FieldOutputTransformerimport com.datastax.bdp.search.solr.FieldOutputTransformer;
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
SolR schema.xml
<schema name="users" version="1.1">... <fields> <field name="wid" type="string" indexed="true" stored="true"/> <field name="protobuf:com.wattgo.users.User" type="binary" indexed="true" stored="true"/> ... <field name="details.address.zipcode" type="string" indexed="true" stored="false"/> <field name="details.address.city" type="string" indexed="true" stored="false"/> <field name="details.address.country" type="string" indexed="true" stored="false"/> ... </fields>... <uniqueKey>wid</uniqueKey></schema>
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
SolR solrconfig.xml
<config>... <fieldInputTransformer name="dse" class="com.wattgo.search.transformers.protobuf. InputTransformer"> </fieldInputTransformer>
<fieldOutputTransformer name="dse" class="com.wattgo.search.transformers.protobuf. OutputTransformer"> </fieldOutputTransformer>...</config>
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
Start indexing your data :
Using dsetool :
Using curl :
$ dsetool create_core cassandradays.users schema=schema.xml solrconfig=solrconfig.xml
$ curl "http://localhost:8983/solr/resource/cassandradays.users/solrconfig.xml" \--data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8'
$ curl "http://localhost:8983/solr/resource/cassandradays.users/schema.xml" \--data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8' $ curl -XPOST "http://localhost:8983/solr/admin/cores?action=CREATE&name=cassandradays.users"\-H 'Content-type:text; charset=utf-8'
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
Now our 'users' CQL table looks like this :
CREATE TABLE cassandradays.users ( wid text PRIMARY KEY, "protobuf:com.wattgo.users.User" blob, solr_query text) WITH ...;
CREATE CUSTOM INDEX cassandradays_users_protobufcomwattgousersuser_index ON cassandradays.users ("protobuf:com.wattgo.users.User") USING 'com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex';
CREATE CUSTOM INDEX cassandradays_users_solr_query_index ON cassandradays.users (solr_query) USING 'com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex';
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
public class InputTransformer extends FieldInputTransformer {
public static final String prefix = "protobuf:"; @Override public boolean evaluate(String field) { return field.startsWith(prefix); }
@Override public void addFieldToDocument( SolrCore core, IndexSchema schema, String key, Document doc, SchemaField fieldInfo, String fieldValue, float boost, DocumentHelper helper) throws IOException {
String className = fieldInfo.getName().substring(prefix.length());
Descriptor descriptor = ProtobufFileDescriptorSetParser.getDescriptor(className);
byte[] data = Hex.decodeHex(fieldValue.toCharArray()); DynamicMessage message = DynamicMessage.parseFrom(descriptor, data);
Map<String, ProtobufField> fields = ProtobufMessageParser.flattenFields(message, "");
for (Map.Entry<String, ProtobufField> field: fields.entrySet()) {
ProtobufField entry = field.getValue(); Set<Object> values = entry.getValues(); String fieldName = field.getKey();
SchemaField fieldSchema = core.getLatestSchema().getFieldOrNull(fieldName);
for (Object value: values) { helper.addFieldToDocument(core, core.getLatestSchema(), key, doc, fieldSchema, value.toString(), boost); } } }}
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
public class OutputTransformer extends FieldOutputTransformer {
@Override public void binaryField(FieldInfo fieldInfo, byte[] value, StoredFieldVisitor visitor, FieldOutputTransformer.DocumentHelper helper) throws IOException { String prefix = "protobuf:"
;
if(!fieldInfo.name.startsWith(prefix)) { visitor.binaryField(fieldInfo, value); return; }
String className = fieldInfo.name.substring(prefix.length());
Descriptor descriptor = ProtobufFileDescriptorSetParser.getDescriptor(className); DynamicMessage message = DynamicMessage.parseFrom(descriptor, value);
Map<String, ProtobufField> fields = ProtobufMessageParser.flattenFields(message,
""); for (Map.Entry<String, ProtobufField> field: fields.entrySet()) {
FieldInfo info = helper.getFieldInfo(field.getKey()); ProtobufField current = field.getValue(); FieldDescriptor fieldDescriptor = current.getFieldDescriptor(); Set<Object> fieldValues = current.getValues(); FieldDescriptor.JavaType type = fieldDescriptor.getJavaType();
for (Object fieldValue: fieldValues) { if (type == FieldDescriptor.JavaType.STRING) visitor.stringField(info, (String) fieldValue); else if (type == FieldDescriptor.JavaType.BYTE_STRING) visitor.binaryField(info, (byte[]) fieldValue); else if [...] } } } }}
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
cqlsh:> SELECT * FROM users WHERE solr_query = 'details.address.zipcode:13*';
wid | protobuf:com.wattgo.users.User | solr_query51548164eabc884b2d00014f | 0x0a18353135343831363465616263... | null51264fb2eabc88610c00001f | 0x0a18353132363466623265616263... | null5199e9d0eabc88172c000001 | 0x0a18353139396539643065616263... | null5127a249eabc88641500001c | 0x0a18353132376132343965616263... | null51548164eabc884b2d000143 | 0x0a18353135343831363465616263... | null[...]
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
UIMeteorMicroServices
Spark
ProtoBuf RPC
Put your logic
hereCa
ssandra
So
lR - DSE Field Transformer
Trig
gers
UsersProtoBufsstored asBlobs
SensorsTimeSeries
Kafka<<
<<< CQL SolR Query >>>>>
Kafka
Smart Energy as a Service
RealTime Analytics Spark SolR Cassandra
Real time monitoring of energy consumption in France