vital ai metaql: queries across nosql, sql, sparql, and spark

70
Today: Marc C. Hadeld, Founder Vital AI http://vital.ai [email protected] 917.463.4776 MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Upload: vitalai

Post on 14-Apr-2017

11.555 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Today:

Marc C. Hadfield, FounderVital AIhttp://vital.ai [email protected] 917.463.4776

MetaQL:Queries Across NoSQL, SQL, Sparql, and Spark

Page 2: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

<intro>

Marc C. Hadfield, Founder Vital AIhttp://[email protected]

MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Quick Overview

Page 3: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

agenda

MetaQL Intro

Motivation

Domain Models (Schema)

MetaQL DSL

MetaQL Implementations

Examples

Page 4: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

MetaQL

Leverage Domain Model (Schema)

Compose Queries in Code: Typed

Execute Queries on Databases, Interchangeably

Minimize TCO: Separation of Concerns

Developer Efficiency

Query Framework

Executable JVM Code! (Groovy Closure)

Page 5: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

MetaQL Origin

Across many data-driven application implementations, a desire for:

Reusable Processes, Tools: Stop re-inventing the wheel.

Tools to manage “schema” across an application & organization.

Tools to combine Semantic Web, NOSQL, and Hadoop/Spark.

Team Collaboration: Human Labor is usually limiting factor.

Page 6: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

sample

Recipient

Sender EMail

hasRecipient

hasSender

Page 7: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

sample

Recipient

Sender EMail

hasRecipient

hasSender

ARC

ARC

Page 8: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

sample

Recipient

Sender EMail

hasRecipient

hasSender

notEqual

type:PersonAddress:[email protected]

type:Person

type:hasSender

type:hasRecipient

type:Email

Page 9: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

sample MetaQL graph query

GRAPH { value segments: ["mydata"] ARC { node_constraint { Email.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {

Person.props().emailAddress.equalTo(“[email protected]") }

node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }

Page 10: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Internet of Things

Amazon Echo

Page 11: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Internet of Things

Coffee

Page 12: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Internet of Things:

Batch and Stream

Processing

Amazon Echo

Amazon Echo Service

haley-app webserviceVert.X

Vital Prime

Database

DataScript

Hadoop - HDFS

Apache SparkStreaming, MLLIB, NLP, GraphX

Aspen Datawarehouse

Analytics Layer

Serving Layer

Haley DeviceRaspberry Pi

Voice to Text API

Cognitive Application

NLP and Inference to process User request.

Query Knowledge in DB

Streaming Prediction Models:

“Should I really have more Coffee?”

External APIs…

Page 13: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Demo Examples

Vital Prime

Database

Vert.XVital-Vertx

JavaScript WebAppVitalService-JS

PredictionModels

DataScript

https://github.com/vital-ai/vital-examples

Page 14: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Demo Example

https://demos.vital.ai/enron-js-app/index.htmlhttps://github.com/vital-ai/vital-examples/tree/master/enron-js-app

Page 15: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Demo Example

Page 16: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Demo Example

Page 17: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Demo Example

Recipient EMailhasRecipient

Page 18: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Cytoscape Plugin

https://github.com/vital-ai/vital-cytoscapehttp://cytoscape.org/

Page 19: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Cytoscape Plugin

Page 20: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Cytoscape Plugin

Page 21: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Cytoscape Plugin

Page 22: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Cytoscape Plugin

Page 23: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Cytoscape Plugin: Wordnet Data, “wine, vino”

Page 24: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

where are we using MetaQL?

Financial Services Healthcare

Internet-of-Things Start-Ups, Recommendation Apps

Page 25: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

motivation for MetaQL

Page 26: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

application architecture

Batch and Stream

Processing

Web / Mobile Application

Application Server

TransactionalDatabase

Hadoop - HDFS

Apache SparkStreaming, MLLIB, GraphX

Analytics Layer

Serving Layer

Key/ValueCache

External APIs Exrernal API Services

Multiple Databases + Analytics +

External APIs

Page 27: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

enterprise application architecture

Dashboard

Application Server

Enterprise Datawarehouse

Data Silo Data Silo Data Silo Data Silo Data Silo ∞Many Many Many Data Models…

Page 28: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

volume, velocity, variety

polyglot persistance = multiple database technologies

…but we also have very many data models.

many databases, many data models, changing rapidly.

too many moving parts for a developer to reasonably manage! need fewer APIs to learn!

Page 29: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

what happens when changes occur?

Task

Infrastructure DevOps

Data Scientists

Business +Domain Experts

Developers

Roles

Page 30: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

what changes?

Data Model Changes New Data Sources

Infrastructure Change Switch Databases

New Prediction Models / Features New Service APIs…

Many Interdependencies…

Example: Change in the taxonomy of a categorization service breaks all the logic tied to the old categories.

Page 31: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

total cost of ownership

How much code changes when we modify our data model to include new sources?

How to minimize by decoupling dependencies?

When we switch database technologies?

Page 32: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Domain Model as “Contract”

Infrastructure DevOps

Data Scientists

Business +Domain Experts

Developers DomainModel

Everyone to agree (or at least be aware) of the definition of Domain Concepts.

Ue semantics to map “views”.

Page 33: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

MetaQL Abstraction

Infrastructure DevOps

Data Scientists

Business +Domain Experts

Developers DomainModel

MetaQL

Abstraction to give breathing room to Infrastructure.

Page 34: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Infrastructure / DevOps

Database Types: • Key/Value • Document • RDF Graph • NOSQL • Relational • Timeseries

ACID vs. BASE

Optimizing Query Generation

Tuning Secondary Indices

Update MetaQL DSL for new DB features

CAP Theorem

Page 35: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Domain Model (Schema)

Page 36: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Domain Model Implementation

Combine: SQL-style Schema with Hadoop Data Serialization Schema (Avro, Thrift, Protocol Buffers, Kyro, Parquet) add Semantics: the “Meaning” of objects

Not a table “person”, but define the concept of Person to be used throughout an application. The implementation decides how to store “Person” data in it’s database.

Page 37: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Domain Model Implementation

Domain Model definition resolves: RDF vs Property Graph model Object Relational Impedance Mismatch

Use OWL to capture Domain Model: SubClasses SubProperties

Multiple Inheritance

Marginal technology performance gains are hugely outweighed by Human productively gains, and wider choice of tools.

Compromise across modeling paradigms .

Page 38: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Domain Model Implementation

Example: Healthcare Application: URI<Person123> IS_A: • Patient • BillableAccount • InsuredEntity Same URI across three domain concepts: Diagnostics Records, Billing System, Insurance System.

Implementation Note: We generate code for the JVM using “traits” as a way to implement multiple inheritance (Groovy, Scala, Java8). The trait is used as a semantic marker to link to the Domain Model.

Page 39: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Domain Model - Core Classes

Node NodeEdge

HyperNode

HyperEdge

Properties: • URI • Primary Type • Types

Edges/HyperEdges: • Source URI • Destination URI

Edges: • Peer • Taxonomy

Class Instances contain Properties.

Page 40: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Protege OWL Editor

Page 41: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

VitalSigns: Domain Model Dev Kit

$ vitalsigns generate -o ./domain-ontology/enron-dataset-1.0.0.owl

$ ls domain-groovy-jarenron-dataset-groovy-1.0.0.jar

$ ls domain-json-schemaenron-dataset-1.0.0.js

OWL can be compiled into JVM code statically (create an artifact for maven), or done dynamically at runtime.

Page 42: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Development with the Domain Model

Code Completion from Domain Model

Page 43: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Development with the Domain ModelVitalSigns vs = VitalSigns.get()

Musician john = new Musician().generateURI(“john")

john.name = "John Lennon"

john.birthday = "October 9, 1940"^xsd.xdatetime("MMMM d, yyyy”)

MusicGroup thebeatles = new MusicGroup().generateURI("thebeatles")

thebeatles.name = "The Beatles"

// try to assign the wrong property, throws an exception

try { thebeatles.birthday = "January 1, 1970"^xsd.xdatetime("MMMM d, yyyy”)

} catch(Exception ex) { println ex } // no such property exception

vs.addToCache( thebeatles.addEdge_hasMember(john) )

// use cache to resolve queriesthebeatles.getMembers().each{ println it.name }

// use database to resolve queriesthebeatles.getMembers(ServiceWide).each{ println it.name }

Implicit MetaQL Queries

Page 44: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

VitalService API

• Open/Close Endpoint • Create/Remove Segment • Create/Read/Update/Delete Object • Queries (MetaQL as input closure) • Service Operations (MetaQL as input closure) • callFunction (DataScript) • init Transaction/Commit/Rollback

A “Segment” is a Database (container of objects)

Page 45: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

MetaQL

VitalSigns: Domain Model Manager • MetaQL DSL • Prediction Model DSL • Pipeline Transformation DSL (ETL)

(in development)

A tricky bit is find the best way to express the DSL within the allowed grammar of the host language (Groovy). It’s an ongoing effort.

Page 46: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Query Types

AGGREGATION

PATH

GRAPH

SELECT

Page 47: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Query Elements

• constraints: node_constraint, edge_constraint, … • comparators (equalTo, greaterThan, …) • provides, ?reference • AND, OR • OPTIONAL • Sort Criteria

Page 48: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

SELECT query

SELECT {

value limit: 100value offset: 0value segments: ["mydata"]

constraint { Person.class }

constraint { Person.props().name.equalTo("John" ) }

}

Page 49: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

GRAPH query

GRAPH { value segments: ["mydata"] ARC { node_constraint { Email.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {

Person.props().emailAddress.equalTo(“[email protected]") } node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }

Page 50: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

GRAPH query (2)

GRAPH {

value segments: [VitalSegment.withId('wordnet')]value inlineObjects: true

ARC {node_bind { "node1" }node_constraint { SynsetNode.expandSubclasses(true) }node_constraint { SynsetNode.props().name.contains_i("happy") }

ARC { edge_bind { "edge" } node_bind { "node2" } }

} }

Code iterating over Results can use bind names to reference objects in each solution: node1, edge, node2.

<—- inline objects

Page 51: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

PATH query

def forward = truedef reverse = false

PATH {value segments: segmentsvalue maxdepth: 5 value rootURIs: [URIProperty.withString(inputURI)]

if( forward ) {ARC {

value direction: 'forward'// accept any edge: edge_constraint { }// accept any node: node_constraint { }

}}if( reverse ) {

ARC {value direction: 'reverse'// accept any edge: edge_constraint { }// accept any node: node_constraint { }}

}}

Page 52: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

AGGREGATION query

SUM Product.props().cost

AVERAGE Person.props().birthday

COUNT_DISTINCT Document.props().active

FIRST { DISTINCT Document.props().title, expandProperty : false, order: Order.ASC }

Part of a SELECT query

Page 53: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Service Operations DSL

Insert

Update

Delete

Page 54: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Service Operations

INSERT {value segment: 'testing'

insert(MusicGroup.class, provides: "thebeatles") { MusicGroup thebeatles ->thebeatles.name = "The Beatles"thebeatles.URI = "thebeatles"

}insert(Musician.class, provides: "john") {

Musician john ->john.name = "John"john.URI = "john"

}insert(Edge_hasMember) { Edge_hasMember member ->

member.sourceURI = ref("thebeatles").toString()member.destinationURI = ref("john").toString()member.URI = "edge1"

}}

<— Using “provides” values

Page 55: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Transactions

def xid = service.startTransaction()

service.save(xid, person123)

service.commitTransaction(xid)

Implemented at the service level:

Page 56: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

MetaQL Implementations

MetaQL

ExecutableQuery

Query Generator

Page 57: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Sparql/RDF Implementation

G S P O

Quad Store

Franz Allegrograph

Page 58: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Sparql/RDF Implementation

VitalGraphQuery q = builder.query {GRAPH {

value segments: ["documents"]ARC {

node_constraint { Person.class }node_constraint { Person.props().emailID.equalTo(“[email protected]" ) }

ARC {node_constraint { EMailMessage.class }edge_constraint { Edge_hasEMailMessage.class }

} } }

}.toQuery()

println "Query: " + q.toSparql()

Page 59: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Sparql/RDF Implementation

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>PREFIX vital-core: <http://vital.ai/ontology/vital-core#>PREFIX p0: <http://vital.ai/ontology/enron-emails#>

SELECT DISTINCT ?s1 ?d2 ?e2FROM <segment:customer__app__documents>WHERE { { ?s1 p0:hasEmailID ?value1 . ?s1 rdf:type ?value2 . FILTER ( ?value2 = p0:Person && ?value1 = “[email protected]"^^xsd:string ) { ?d2 rdf:type ?value3 . ?e2 rdf:type ?value4 . FILTER ( ?value3 = p0:EMailMessage && ?value4 = p0:Edge_hasEMailMessage ) ?e2 vital-core:hasEdgeSource ?s1 . ?e2 vital-core:hasEdgeDestination ?d2 . } }}

Page 60: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Spark-SQL / Dataframe

URI P V

Segment RDD Property RDD

K V

Experimenting with: new Dataframe Optimizer: Catalyst, new Dataframe DSL for query generation, and using GraphX for isolated Graph Query cases

Generate “Bad” queries, with optimizer fixing them and Spark partitioning RDDs, as long as Spark is aware of Schema.

Page 61: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Key/Value Implementation

K V

URI —> Serialized Object

Page 62: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Lucene/SOLR Implementation

DocID

1

2

3

P1

V1

V1

P2

V2

V2

P3

V3

V3

P4

V4

V4

Inverted Index of Property Values…

Page 63: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

NoSQL BigTable Implementation

DynamoDB (HBase, Cassandra, Accumulo, …)

ROWID

1

2

3

C1

K1=V1

K1=V1

K1=V1

C2

K1=V1

K1=V1

K1=V1

C3

K1=V1

K1=V1

K1=V1

C4

K1=V1, K1=V1

K1=V1, K1=V1

K1=V1, K1=V1

URI P V

Per Segment object table

Per Segment property table

+ Secondary Indices

+ Secondary Indices

Page 64: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

SQL Implementation

SQL, Hive-SQL, Redshift, …

G S P O

Per Segment Table

with Partitioning (Hive)

Page 65: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

implementation

DSL Documentation to be posted: http://www.metaql.org/

VitalSigns, VitalService, MetaQL https://dashboard.vital.ai/

Vital AI github: https://github.com/vital-ai/ Sample Code

Spark Code: Aspen, Aspen-Datawarehouse

Documentation Coming!

Page 66: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

closing thoughts

Separation of Concerns yields the Agility needed to keep up with rapidly evolving Data.

“Domain Model as Contract” provides a framework for consistent interpretation of Data across an application.

MetaQL provides a framework for the consistent access and query of Data across an application.

Context: Data-Driven Application / Cognitive Applications:

Page 67: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Thank You!

Marc C. Hadfield, FounderVital AIhttp://vital.ai [email protected] 917.463.4776

Page 68: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Pipeline DSL (ETL)

PIPELINE { // WorkflowPIPE { // a Workflow Component with dependencies

TRANSFORM { // Joins across Datasets IF (RULE { } ) // Boolean, Query, Construct, … THEN { RULE { } } ELSE { RULE { } } }PIPE { … } // dependent PIPE} // Output Dataset

PIPE { … }

}

Influenced by Spark Pipeline and Google Dataflow Pipeline

Page 69: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Schema Upgrade/Downgrade

UPGRADE {

upgrade(oldClass: OLD_Person.class,newClass: NEW_Person.class ) {

person_old, person_new -> person_new.newName = person_old.oldName }}

DOWNGRADE {

downgrade(newClass: NEW_Person.class,oldClass: OLD_Person.class ) {

person_new, person_old -> person_old.oldName = person_new.newName }}

Page 70: Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Multiple Endpoints

def service1 = VitalService.getService(profile:”kv-users”)def service2 = VitalService.getService(profile:”posts-db”)def service3 = VitalService.getService(profile:”friendgraph-db”)

// given user URI:[email protected]

// get user object from service1

// find friends of user in friendgraph via service3

// find posts of friends in posts-db

// update service1 with cache of user-to-friends-postings

// send postings of friends to user in UI