vital ai metaql: queries across nosql, sql, sparql, and spark
TRANSCRIPT
Today:
Marc C. Hadfield, FounderVital AIhttp://vital.ai [email protected] 917.463.4776
MetaQL:Queries Across NoSQL, SQL, Sparql, and Spark
<intro>
Marc C. Hadfield, Founder Vital AIhttp://[email protected]
MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Quick Overview
agenda
MetaQL Intro
Motivation
Domain Models (Schema)
MetaQL DSL
MetaQL Implementations
Examples
MetaQL
Leverage Domain Model (Schema)
Compose Queries in Code: Typed
Execute Queries on Databases, Interchangeably
Minimize TCO: Separation of Concerns
Developer Efficiency
Query Framework
Executable JVM Code! (Groovy Closure)
MetaQL Origin
Across many data-driven application implementations, a desire for:
Reusable Processes, Tools: Stop re-inventing the wheel.
Tools to manage “schema” across an application & organization.
Tools to combine Semantic Web, NOSQL, and Hadoop/Spark.
Team Collaboration: Human Labor is usually limiting factor.
sample
Recipient
Sender EMail
hasRecipient
hasSender
sample
Recipient
Sender EMail
hasRecipient
hasSender
ARC
ARC
sample
Recipient
Sender EMail
hasRecipient
hasSender
notEqual
type:PersonAddress:[email protected]
type:Person
type:hasSender
type:hasRecipient
type:Email
sample MetaQL graph query
GRAPH { value segments: ["mydata"] ARC { node_constraint { Email.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {
Person.props().emailAddress.equalTo(“[email protected]") }
node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }
Internet of Things
Amazon Echo
Internet of Things
Coffee
Internet of Things:
Batch and Stream
Processing
Amazon Echo
Amazon Echo Service
haley-app webserviceVert.X
Vital Prime
Database
DataScript
Hadoop - HDFS
Apache SparkStreaming, MLLIB, NLP, GraphX
Aspen Datawarehouse
Analytics Layer
Serving Layer
Haley DeviceRaspberry Pi
Voice to Text API
Cognitive Application
NLP and Inference to process User request.
Query Knowledge in DB
Streaming Prediction Models:
“Should I really have more Coffee?”
External APIs…
Demo Examples
Vital Prime
Database
Vert.XVital-Vertx
JavaScript WebAppVitalService-JS
PredictionModels
DataScript
https://github.com/vital-ai/vital-examples
Demo Example
https://demos.vital.ai/enron-js-app/index.htmlhttps://github.com/vital-ai/vital-examples/tree/master/enron-js-app
Demo Example
Demo Example
Demo Example
Recipient EMailhasRecipient
Cytoscape Plugin
https://github.com/vital-ai/vital-cytoscapehttp://cytoscape.org/
Cytoscape Plugin
Cytoscape Plugin
Cytoscape Plugin
Cytoscape Plugin
Cytoscape Plugin: Wordnet Data, “wine, vino”
where are we using MetaQL?
Financial Services Healthcare
Internet-of-Things Start-Ups, Recommendation Apps
motivation for MetaQL
application architecture
Batch and Stream
Processing
Web / Mobile Application
Application Server
TransactionalDatabase
Hadoop - HDFS
Apache SparkStreaming, MLLIB, GraphX
Analytics Layer
Serving Layer
Key/ValueCache
External APIs Exrernal API Services
Multiple Databases + Analytics +
External APIs
enterprise application architecture
Dashboard
Application Server
Enterprise Datawarehouse
Data Silo Data Silo Data Silo Data Silo Data Silo ∞Many Many Many Data Models…
volume, velocity, variety
polyglot persistance = multiple database technologies
…but we also have very many data models.
many databases, many data models, changing rapidly.
too many moving parts for a developer to reasonably manage! need fewer APIs to learn!
what happens when changes occur?
Task
Infrastructure DevOps
Data Scientists
Business +Domain Experts
Developers
Roles
what changes?
Data Model Changes New Data Sources
Infrastructure Change Switch Databases
New Prediction Models / Features New Service APIs…
Many Interdependencies…
Example: Change in the taxonomy of a categorization service breaks all the logic tied to the old categories.
total cost of ownership
How much code changes when we modify our data model to include new sources?
How to minimize by decoupling dependencies?
When we switch database technologies?
Domain Model as “Contract”
Infrastructure DevOps
Data Scientists
Business +Domain Experts
Developers DomainModel
Everyone to agree (or at least be aware) of the definition of Domain Concepts.
Ue semantics to map “views”.
MetaQL Abstraction
Infrastructure DevOps
Data Scientists
Business +Domain Experts
Developers DomainModel
MetaQL
Abstraction to give breathing room to Infrastructure.
Infrastructure / DevOps
Database Types: • Key/Value • Document • RDF Graph • NOSQL • Relational • Timeseries
ACID vs. BASE
Optimizing Query Generation
Tuning Secondary Indices
Update MetaQL DSL for new DB features
CAP Theorem
Domain Model (Schema)
Domain Model Implementation
Combine: SQL-style Schema with Hadoop Data Serialization Schema (Avro, Thrift, Protocol Buffers, Kyro, Parquet) add Semantics: the “Meaning” of objects
Not a table “person”, but define the concept of Person to be used throughout an application. The implementation decides how to store “Person” data in it’s database.
Domain Model Implementation
Domain Model definition resolves: RDF vs Property Graph model Object Relational Impedance Mismatch
Use OWL to capture Domain Model: SubClasses SubProperties
Multiple Inheritance
Marginal technology performance gains are hugely outweighed by Human productively gains, and wider choice of tools.
Compromise across modeling paradigms .
Domain Model Implementation
Example: Healthcare Application: URI<Person123> IS_A: • Patient • BillableAccount • InsuredEntity Same URI across three domain concepts: Diagnostics Records, Billing System, Insurance System.
Implementation Note: We generate code for the JVM using “traits” as a way to implement multiple inheritance (Groovy, Scala, Java8). The trait is used as a semantic marker to link to the Domain Model.
Domain Model - Core Classes
Node NodeEdge
HyperNode
HyperEdge
Properties: • URI • Primary Type • Types
Edges/HyperEdges: • Source URI • Destination URI
Edges: • Peer • Taxonomy
Class Instances contain Properties.
Protege OWL Editor
VitalSigns: Domain Model Dev Kit
$ vitalsigns generate -o ./domain-ontology/enron-dataset-1.0.0.owl
$ ls domain-groovy-jarenron-dataset-groovy-1.0.0.jar
$ ls domain-json-schemaenron-dataset-1.0.0.js
OWL can be compiled into JVM code statically (create an artifact for maven), or done dynamically at runtime.
Development with the Domain Model
Code Completion from Domain Model
Development with the Domain ModelVitalSigns vs = VitalSigns.get()
Musician john = new Musician().generateURI(“john")
john.name = "John Lennon"
john.birthday = "October 9, 1940"^xsd.xdatetime("MMMM d, yyyy”)
MusicGroup thebeatles = new MusicGroup().generateURI("thebeatles")
thebeatles.name = "The Beatles"
// try to assign the wrong property, throws an exception
try { thebeatles.birthday = "January 1, 1970"^xsd.xdatetime("MMMM d, yyyy”)
} catch(Exception ex) { println ex } // no such property exception
vs.addToCache( thebeatles.addEdge_hasMember(john) )
// use cache to resolve queriesthebeatles.getMembers().each{ println it.name }
// use database to resolve queriesthebeatles.getMembers(ServiceWide).each{ println it.name }
Implicit MetaQL Queries
VitalService API
• Open/Close Endpoint • Create/Remove Segment • Create/Read/Update/Delete Object • Queries (MetaQL as input closure) • Service Operations (MetaQL as input closure) • callFunction (DataScript) • init Transaction/Commit/Rollback
A “Segment” is a Database (container of objects)
MetaQL
VitalSigns: Domain Model Manager • MetaQL DSL • Prediction Model DSL • Pipeline Transformation DSL (ETL)
(in development)
A tricky bit is find the best way to express the DSL within the allowed grammar of the host language (Groovy). It’s an ongoing effort.
Query Types
AGGREGATION
PATH
GRAPH
SELECT
Query Elements
• constraints: node_constraint, edge_constraint, … • comparators (equalTo, greaterThan, …) • provides, ?reference • AND, OR • OPTIONAL • Sort Criteria
SELECT query
SELECT {
value limit: 100value offset: 0value segments: ["mydata"]
constraint { Person.class }
constraint { Person.props().name.equalTo("John" ) }
}
GRAPH query
GRAPH { value segments: ["mydata"] ARC { node_constraint { Email.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {
Person.props().emailAddress.equalTo(“[email protected]") } node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }
GRAPH query (2)
GRAPH {
value segments: [VitalSegment.withId('wordnet')]value inlineObjects: true
ARC {node_bind { "node1" }node_constraint { SynsetNode.expandSubclasses(true) }node_constraint { SynsetNode.props().name.contains_i("happy") }
ARC { edge_bind { "edge" } node_bind { "node2" } }
} }
Code iterating over Results can use bind names to reference objects in each solution: node1, edge, node2.
<—- inline objects
PATH query
def forward = truedef reverse = false
PATH {value segments: segmentsvalue maxdepth: 5 value rootURIs: [URIProperty.withString(inputURI)]
if( forward ) {ARC {
value direction: 'forward'// accept any edge: edge_constraint { }// accept any node: node_constraint { }
}}if( reverse ) {
ARC {value direction: 'reverse'// accept any edge: edge_constraint { }// accept any node: node_constraint { }}
}}
AGGREGATION query
SUM Product.props().cost
AVERAGE Person.props().birthday
COUNT_DISTINCT Document.props().active
FIRST { DISTINCT Document.props().title, expandProperty : false, order: Order.ASC }
Part of a SELECT query
Service Operations DSL
Insert
Update
Delete
Service Operations
INSERT {value segment: 'testing'
insert(MusicGroup.class, provides: "thebeatles") { MusicGroup thebeatles ->thebeatles.name = "The Beatles"thebeatles.URI = "thebeatles"
}insert(Musician.class, provides: "john") {
Musician john ->john.name = "John"john.URI = "john"
}insert(Edge_hasMember) { Edge_hasMember member ->
member.sourceURI = ref("thebeatles").toString()member.destinationURI = ref("john").toString()member.URI = "edge1"
}}
<— Using “provides” values
Transactions
def xid = service.startTransaction()
service.save(xid, person123)
service.commitTransaction(xid)
Implemented at the service level:
MetaQL Implementations
MetaQL
ExecutableQuery
Query Generator
Sparql/RDF Implementation
G S P O
Quad Store
Franz Allegrograph
Sparql/RDF Implementation
VitalGraphQuery q = builder.query {GRAPH {
value segments: ["documents"]ARC {
node_constraint { Person.class }node_constraint { Person.props().emailID.equalTo(“[email protected]" ) }
ARC {node_constraint { EMailMessage.class }edge_constraint { Edge_hasEMailMessage.class }
} } }
}.toQuery()
println "Query: " + q.toSparql()
Sparql/RDF Implementation
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>PREFIX vital-core: <http://vital.ai/ontology/vital-core#>PREFIX p0: <http://vital.ai/ontology/enron-emails#>
SELECT DISTINCT ?s1 ?d2 ?e2FROM <segment:customer__app__documents>WHERE { { ?s1 p0:hasEmailID ?value1 . ?s1 rdf:type ?value2 . FILTER ( ?value2 = p0:Person && ?value1 = “[email protected]"^^xsd:string ) { ?d2 rdf:type ?value3 . ?e2 rdf:type ?value4 . FILTER ( ?value3 = p0:EMailMessage && ?value4 = p0:Edge_hasEMailMessage ) ?e2 vital-core:hasEdgeSource ?s1 . ?e2 vital-core:hasEdgeDestination ?d2 . } }}
Spark-SQL / Dataframe
URI P V
Segment RDD Property RDD
K V
Experimenting with: new Dataframe Optimizer: Catalyst, new Dataframe DSL for query generation, and using GraphX for isolated Graph Query cases
Generate “Bad” queries, with optimizer fixing them and Spark partitioning RDDs, as long as Spark is aware of Schema.
Key/Value Implementation
K V
URI —> Serialized Object
Lucene/SOLR Implementation
DocID
1
2
3
P1
V1
V1
P2
V2
V2
P3
V3
V3
P4
V4
V4
Inverted Index of Property Values…
NoSQL BigTable Implementation
DynamoDB (HBase, Cassandra, Accumulo, …)
ROWID
1
2
3
C1
K1=V1
K1=V1
K1=V1
C2
K1=V1
K1=V1
K1=V1
C3
K1=V1
K1=V1
K1=V1
C4
K1=V1, K1=V1
K1=V1, K1=V1
K1=V1, K1=V1
URI P V
Per Segment object table
Per Segment property table
+ Secondary Indices
+ Secondary Indices
SQL Implementation
SQL, Hive-SQL, Redshift, …
G S P O
Per Segment Table
with Partitioning (Hive)
implementation
DSL Documentation to be posted: http://www.metaql.org/
VitalSigns, VitalService, MetaQL https://dashboard.vital.ai/
Vital AI github: https://github.com/vital-ai/ Sample Code
Spark Code: Aspen, Aspen-Datawarehouse
Documentation Coming!
closing thoughts
Separation of Concerns yields the Agility needed to keep up with rapidly evolving Data.
“Domain Model as Contract” provides a framework for consistent interpretation of Data across an application.
MetaQL provides a framework for the consistent access and query of Data across an application.
Context: Data-Driven Application / Cognitive Applications:
Thank You!
Marc C. Hadfield, FounderVital AIhttp://vital.ai [email protected] 917.463.4776
Pipeline DSL (ETL)
PIPELINE { // WorkflowPIPE { // a Workflow Component with dependencies
TRANSFORM { // Joins across Datasets IF (RULE { } ) // Boolean, Query, Construct, … THEN { RULE { } } ELSE { RULE { } } }PIPE { … } // dependent PIPE} // Output Dataset
PIPE { … }
}
Influenced by Spark Pipeline and Google Dataflow Pipeline
Schema Upgrade/Downgrade
UPGRADE {
upgrade(oldClass: OLD_Person.class,newClass: NEW_Person.class ) {
person_old, person_new -> person_new.newName = person_old.oldName }}
DOWNGRADE {
downgrade(newClass: NEW_Person.class,oldClass: OLD_Person.class ) {
person_new, person_old -> person_old.oldName = person_new.newName }}
Multiple Endpoints
def service1 = VitalService.getService(profile:”kv-users”)def service2 = VitalService.getService(profile:”posts-db”)def service3 = VitalService.getService(profile:”friendgraph-db”)
// given user URI:[email protected]
// get user object from service1
// find friends of user in friendgraph via service3
// find posts of friends in posts-db
// update service1 with cache of user-to-friends-postings
// send postings of friends to user in UI