vital ai: big data modeling
DESCRIPTION
Video: https://www.youtube.com/watch?v=Rt2oHibJT4k Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models. These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it. This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers: Creating Semantic Metadata Models of Big Data Resources Graphical UI Tools for Big Data Models Tools to synchronize Big Data Models and Application Code Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference Using Big Data Models with Machine Learning to generate Predictive Models Developer Collaborative/Coordination processes using Big Data Models and Git Managing change – Big Data Models with rapidly changing Data ResourcesTRANSCRIPT
Big Data Modeling
Today:
Marc C. Hadfield, FounderVital AIhttp://vital.ai [email protected] 917.463.4776
intro
Marc C. Hadfield, Founder Vital AIhttp://[email protected]
Big Data Modeling isData Modeling with the
“Variety” Big Data Dimension in mind…
Big Data “Variety” Dimension
The “Variety” problem can be addressed by a combination of improved tools and a methodology
involving both system architecture anddata science / analysis.
Compared to Volume and Velocity, Variety is a very labor -intensive human-centric process.
Variety is the many types of data to be utilized together in a data-driven application.
Potentially too many types for any single person to keep track of (especially in Life Sciences).
Key Takeaways:
Using OWL as a “meta-schema” can drastically reduce operations/development effort and increase the value of the data for analysis.
OWL can augment and not replace familiar development processes and tools.
A huge amount of ongoing development effort is spent transforming data across components and
keeping data consistent during analysis.
Collecting Good Data = Good Analytics
Big Data Modeling:
Challenges
Goals
OWL as Modeling Language
Using OWL-based Models…
Collaboration/Modeling Tools
Examples from NYC Department of Education:
Domain Ontology
Application Architecture
Development Methodology/Tools
NYC Department of Education:
Data Architecture
Data Science
Data Models in:
Challenges
Mobile/Web App Architecture
Data Model Data Model
Mobile App ServerImplementation Database
Database
Master Database"Data Lake"
Database
Database
Database
Business IntelligenceData Analytics
Dashboard
Enterprise DataWarehouse Architecture
Schema “on read” or “on write”
Data Model
Data Model
Data Model
Data Model
Data Model
ETL Process
Mobile App
Server Layer
Real Time Data
Calculated Views
Hadoop
Predictive Analytics
Master Database"Data Lake"
Business IntelligenceData Analytics
Dashboard
Lambda Architecture + Hadoop: Data Driven App
Data Model
Data Model
Data Model
Data ModelData Model
Data Wrangling / Data Science Master Database"Data Lake"
Business IntelligenceData Analytics
Raw Data
R
Data Model
Data Model
Data Model
Prediction Models must integrate back with production environment:
Same Data, Difference Contexts…
Redundant Models.
Data Architecture Issues
{ Database SchemaJSON DataData Object ClassesAvro/Parquet
Redundant Data Definitions:
Considerable Development / Maintenance / Operational Overhead
Data Science / Data Wrangling Issues
Data Harmonization: Merging Datasets from Multiple Sources
Loss of Context: Feature f123 = Column135 X Column45 / Column13
Side note: Let’s stop using CSV files for datasets!No more flat datasets!
Goals
Goals:
Reduce redundancy in Data Definitions Enforce Clean/Harmonized Data Use Contextual Datasets Use Best Software Components (Databases, Analytics, …) Use Familiar Tools (IDE, git, Languages, R)
OWL as Modeling Language
Web Ontology Language (OWL)
Specifies an Ontology (“Data Model”) Formal Semantics, W3C Standard
Provides a language to describe the meaning of data properties and how they relate to classes.
Example: Mammal Necessary Conditions: warm-blooded, vertebrate animal, has hair or fur, secrets milk, (typically) live birth
Greater descriptive power than Schema (SQL Tables) and Serialization Frameworks (Avro)
Why OWL?
If we can more formally specify what the data *means*, then we can have a single data model (ontology) apply to our entire architecture, and data can be transformed automatically locally as per the needs of a specific software module.
Manually coded data transforms may be “lossy” and/or introduce errors, so eliminating them helps keep data clean.
Why OWL? (continued)
Example: if we specify what a “Document” is, then a text-mining analyzer will know how to access the textual data without further prompting.
Example: if we specify Features for use in Machine Learning in the ontology, then features can be generated automatically to train Machine Learning Models, and the same features would be generated when we use the model in production.
Why OWL? (continued)
Note: As ontologies can extend other ontologies, rather than a single ontology, a collection of linked ontologies can be used, allowing segmentation across an organization.
Vital Core Ontology
Protege Editor…
Nodes, Edges, HyperNodes, HyperEdges get URIs
John/WorksFor/IBM —> Node / Edge / Node
Vital Core Ontology
Vital Domain Ontology
Application Domain Ontology
Extending the Ontology
NYC Dept of Education Domain Ontology
Generating Data Bindings with VitalSigns:
Ontology VitalSigns
Groovy Bindings
Semantic Bindings
Hadoop Bindings
Prolog Bindings
Graph Bindings
HBase Bindings
JavaScript Bindings
Code/Schema Generation
vitalsigns generate -ont name…
person123.name = "John"person123.worksFor.company456
<person123> <hasName> "John"<worksFor123> <hasSource> <person123><worksFor123> <hasDestination> <company456><worksFor123> <hasType> <worksFor>
person123, Node:type=Person, Node:hasName="John"worksFor123, Edge:type=worksFor, Edge:hasSource=person123, Edge:hasDestination=company456
Groovy
RDF
HBase
Data Representations
VitalSigns
Generation —> JAR Library
RuntimeDomain Ontology
Domain Ontology
Domain Ontology
Domain Ontology
VitalSigns
Class
Using OWL-based Models
Developing with the Ontology in UI, Hadoop, NLP, Scripts, ...
Node:Person Node:PersonEdge:hasFriend
Set<Friend> person123.getFriends()
Eclipse IDE
// Reference to an NYCSchool object NYCSchool school123 = … // get from database !// Get a list of programs, local context (cache) List<NYCSchoolProgram> programs = school123.getPrograms() !// Get list of programs, global context (database) List<NYCSchoolProgram> programs = school123.getPrograms(Context.ServiceWide) !
JVM Development
Using JSON-Schema Data in JavaScript
for(var i = 0 ; i < progressReports.length; i++) { var r = progressReports[i]; var sub = $('<ul>'); sub.append('<li>Overall Grade : ' + r.progReportOverallGrade + '</li>'); sub.append('<li>Progress Grade: ' + r.progReportProgressGrade + '</li>'); sub.append('<li>Environment Grade: ' + r.progReportEnvironmentGrade + '</li>'); sub.append('<li>College and Career Readiness Grade: ' + r.progRepCollegeAndCareerReadinessGrade+ '</li>'); sub.append('<li>Performance Grade: ' + r.progReportPerformanceGrade+ '</li>'); sub.append('<li>Closing the Achievement Gap Points: ' + r.progReportClosingTheAchievementGapPoints+ '</li>'); sub.append('<li>Percentile Rank: ' + r.progReportPercentileRank + '</li>'); sub.append('<li>Overall Score: ' + r.progReportOverallScore + '</li>'); }
NoSQL Queries
Query API / CRUD Operations!Queries generated into “native” NoSQL Query format: Sparql / Triplestore (Allegrograph) HBase / DynamoDB MongoDB Hive/HiveQL (on Spark/Hadoop2.0)
Query Types: “Select” and “Graph”
Abstract type of datastore from application/analytics code
Pass in a “native” query when necessary
Data Serialization, Analytics Jobs
Data Serialized into file format by blocks of objects
Leverage Hadoop Serialization Standards: Sequence File, Avro, Parquet Get data in and out of HDFS Files
Spark/Hadoop jobs passed a set of objects as input URI of object is key
Data Objects are serialized into Compressed Strings for transport over Flume, etc.
Machine Learning
Via Hadoop, Spark, R
Mahout, MLLib
Build Predictive Models
Classification, Clustering...
Use Features defined in Ontology
Learn Target defined in Ontology
Models consume Ontology Data as input
Natural Language Processing/Text Mining
Topic Categorization…
Extract Entities… Text Features from Ontology
Classes extending Document…
Graph Analytics
GraphX, Giraph: PageRank, Centrality, Interest Graph, …
Inference / Rules
Use Semantic Web Rule Engines / Reasoners!Load Ontology + RDF Representation of Data Instances (Individuals)
R Analytics
Load Serialized Data into R Dataframes !Reference Classes and Properties by Name in Dataframes (cleaner code than huge number of columns)
Graph Visualization with Cytoscape
Data already in Node/Edge Graph Form
Graph Visualization with Cytoscape
Visualize Data “Hot Spots”
NYC Schools Architecture
Mobile App
JSONSchema
VertX
Vital Flow Queue
RuleEngine
NLP
DynamoDB
Vital Prime
VitalServiceClient
NYC SchoolsData Model
R Serialized DataData Insights
Collaboration & Tools
Collaboration/Tools
git - code revision system
OWL Ontologies treated as code artifact
Coordinate across Teams: “Front End”, “Back End”, “Operations”,
“Business Intelligence”, “Data Science”…
Coordinate across Enterprise: Departments / Business Units
“Data Model of Record”
Ontology Versioning
NYCSchoolRecommendation-0.1.8.owl
Semantic Versioning (http://semver.org/)
vitalsigns command line
vitalsigns generate
vitalsigns upversion/downversion
code/schema generation
increase version patch number move previous version to archive
rename OWL file including username
JAR files pushed into Maven (Continuous Integration)
Git Integration
git: add, commit, push, pull
diff: determine differences
merge: merge two Ontologies
detect types of Ontology changes
merge into new patch version
OWL as Data Modeling Language: Data Architecture & Data Science / Analytics
Conclusions
Leverage Existing Tools, Components
Reduce model redundancy, reduce effort.
A Means to Collaborate Across Teams: Data Model of Record
Cleaner Data
Integrate additional analysis
For more information, please contact: Marc C. Hadfield http://vital.ai [email protected] 917.463.4776
Thank You!