semantic web standards and the variety “v” of big data
DESCRIPTION
TopQuadrant presentation by Bob DuCharme given in the dual NoSQL and SemanticTechnology & Business track in San Jose on August 20, 2014TRANSCRIPT
© Copyright 2014 TopQuadrant Inc. Slide 1
Semantic Web standards and
the Variety “V” of Big Data
Bob DuCharme
August 20, 2014
© Copyright 2014 TopQuadrant Inc. Slide 2
Three Vs of Big Data
Volume
Velocity
Variety
© Copyright 2014 TopQuadrant Inc. Slide 3
Gartner, September 2013
© Copyright 2014 TopQuadrant Inc. Slide 4
Which dimensions did people struggle with the most?
Volume 35%
Velocity 16%
Variety 49%
© Copyright 2014 TopQuadrant Inc. Slide 5
Why is variety hard?
Furniture Inventory
Protein Database?
Customer Database
Conference Attendees?
SurnameGivenNameLastPurchaseZipCodeEmail
last_namefirst_nameis_speakerpostal_codeemail
© Copyright 2014 TopQuadrant Inc. Slide 6
Schemas
Good thing:
Ensure data quality
Make query writing* easier
Add efficiency
*And essentially, all application development
Annoying thing:
Can’t add property values someone didn’t see coming
Changing schema (and data with it) slow and expensive
Often tied too closely to specific implementation
Inflexibility × 3.
© Copyright 2014 TopQuadrant Inc. Slide 7
Schemaless NoSQL databases
Can’t add property values someone didn’t see coming?
Changing schema (and data with it) slow and expensive?
Often tied too closely to specific implementation?
© Copyright 2014 TopQuadrant Inc. Slide 8
Schemaless: how do applications know what properties are available?
By any means necessary
Documentation
Query for properties that got used
App possibly written by same person or team
Responsibility shifted from database (designer) to application (designer)
© Copyright 2014 TopQuadrant Inc. Slide 9
Schema: all or nothing?
Customer Database
Conference Attendees?
SurnameGivenNameLastPurchaseZipCodeEmail
last_namefirst_nameis_speakerpostal_codeemail
ETL (Extract-Transform-Load)?
© Copyright 2014 TopQuadrant Inc. Slide 10
RDF Schema (RDFS)
W3C Standard since 2004
Often overshadowed by superset standard OWL
Describes RDF, written using RDF syntaxes
Semantic Web
Linked Data
© Copyright 2014 TopQuadrant Inc. Slide 11
RDF
www.w3.org/RDF (second sentence!):
“RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed.”
© Copyright 2014 TopQuadrant Inc. Slide 12
Sample schema
@prefix cust: <http://companyX.com/ns/customer#> .@prefix ca: <http://companyY.com/ns/confAttendees#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
cust:Surname a rdf:Property . # or: cust:Surname rdf:type rdf:Property . cust:GivenName a rdf:Property . cust:ZipCode a rdf:Property . cust:Email a rdf:Property .
ca:last_name a rdf:Property . ca:first_name a rdf:Property . ca:postal_code a rdf:Property. ca:email a rdf:Property .
# LastPurchase and is_speaker: don't care (for now)!
Customer Database
Conference Attendees
© Copyright 2014 TopQuadrant Inc. Slide 13
Relating properties# assuming prefix declarations from previous slide@prefix schema: <http://schema.org/> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
cust:Surname rdfs:subPropertyOf schema:familyName . ca:last_name rdfs:subPropertyOf schema:familyName .
cust:GivenName rdfs:subPropertyOf schema:givenName . ca:first_name rdfs:subPropertyOf schema:givenName .
cust:Email rdfs:subPropertyOf schema:email . ca:email rdfs:subPropertyOf schema:email .
Cust:ZipCode rdfs:subPropertyOf schema:postalCode . ca:postal_code rdfs:subPropertyOf schema:postalCode .
© Copyright 2014 TopQuadrant Inc. Slide 14
Using the combined data
# SPARQL query: where should we open# a government relations office?
SELECT ?postalCodeWHERE { ?person schema:email ?email . FILTER(strends(?email,".gov")) ?person schema:postalCode ?postalCode . }
© Copyright 2014 TopQuadrant Inc. Slide 15
Middleware to treat RDBMS as RDF
Customers
Mapping Middleware (e.g. D2R, Ultrawrap)
Application
SPARQL query
SQL query
Relational results
SPARQL query
results
© Copyright 2014 TopQuadrant Inc. Slide 16
Middleware to treat RDBMS as RDF
Customers
Mapping Middleware (e.g. D2R, Ultrawrap)
Application
SPARQL query
SQL query
Relational results
SPARQL query
results
Conference Attendees
SQL query
Relational results
Schema metadata
triplestore
© Copyright 2014 TopQuadrant Inc. Slide 17
Further enhancement
ex:Person a rdfs:Class.
schema:familyName rdfs:domain ex:Person . schema:givenName rdfs:domain ex:Person . schema:email rdfs:domain ex:Person . schema:postalCode rdfs:domain ex:Person .
schema:postalCode rdfs:label "postal code" . Schema:postalCode rdfs:comment "Zip code in the USA, postcode in the UK."
© Copyright 2014 TopQuadrant Inc. Slide 18
Adding more with OWL
equipment code room
X1703 main kitchen
Z0439 cold storage
room building
main kitchen 98 Main St.
cold storage 14 Broad St.
Equipment Room addresses
eq:room rdfs:subPropertyOf ex:locatedIn . rmaddr:building rdfs:subPropertyOf ex:locatedIn .
ex:locatedIn a owl:TransitiveProperty.
rmaddr:98MainSt a ex:Building. eq:X1703 eq:room eq:mainKitchen .eq:mainKitchen rmaddr:building rmaddr:98MainSt .
© Copyright 2014 TopQuadrant Inc. Slide 19
Query for which building
# SPARQL query: what building is# equipment piece x1703 in?
SELECT ?buildingWHERE { ?building a ex:Building. eq:X1703 ex:locatedIn ?building . }
located in
located in
© Copyright 2014 TopQuadrant Inc. Slide 20
A little more OWL
schema:email a owl:inverseFunctionalProperty .
ex:cust401 cust:GivenName "James" . ex:cust401 cust:Surname "Smith" . ex:cust401 cust:Email "[email protected]" .
ex:ca04395 ca:first_name "Jim" . ex:ca04395 ca:last_name "Smith" . ex:ca04395 ca:email "[email protected]" .
ex:cust401 owl:sameAs ex:ca04395 .
© Copyright 2014 TopQuadrant Inc. Slide 21
What OWL adds to RDFS
RDFS gives you properties to describe your properties, classes, and instances (i.e. your resources)
OWL gives you:
• More properties to describe your resources
• Classes that you can use to describe resources
• The ability to define your own classes that you can use to describe resources
© Copyright 2014 TopQuadrant Inc. Slide 22
Middleware to treat RDBMS as RDF
Customers
Mapping Middleware (e.g. D2R, Ultrawrap)
Application
SPARQL query
SQL query
Relational results
SPARQL query
results
Conference Attendees
SQL query
Relational results
Schema metadata
triplestore
© Copyright 2014 TopQuadrant Inc. Slide 23
Descriptive vs. Proscriptive schemas
Not rules to follow– e.g. “Employee must have a first and last name!”– Other ways to do implement constraints
Machine-readable guides to what you’ve got to work with– Data types– Relationships to other resources and classes of
resources Metadata!
© Copyright 2014 TopQuadrant Inc. Slide 24
Whose schemas?
Your own schemas can describe what you need from the data you’re using
Standardized schemas (e.g. schema.org, GoodRelations) can tie together your data with data form other sources
Tie together your custom schemas with (subsets that you’re interested in of) standardized schemas
Tie together (subsets that you’re interested in of) different data sets from different sources
© Copyright 2014 TopQuadrant Inc. Slide 25
Top-down or bottom-up schema development?
Whichever you like I like bottom-up
– (Hey Cyc project: good luck with that!) Lots of data to deal with?
– Model just enough to drive a simple, proof-of-concept application
– Build the model (schema) a little at a time, then add more to your application
– Connect that model to models of (subsets of) other data sets
© Copyright 2014 TopQuadrant Inc. Slide 26
Who is doing this now?
Pharma
Oil and gas
Publishing
© Copyright 2014 TopQuadrant Inc. Slide 27
TopQuadrant Products and Solutions
Solutions
Asset Management
Solutions
Search / Content
Enrichment
TopBraid Platform Solution Engine
IDE
Solutions
Compose your own
Solutions
Master Data Management
SolutionsInformation
Discovery for Life Sciences
Solutions
Information Exchange
TopQuadrant offers configurable, out-of-the box solutions enabling organizations to evolve their information infrastructure into a semantic ecosystem
© Copyright 2014 TopQuadrant Inc. Slide 28
Dynamic Interactive Exploration - Search, Query, Filter, Browse, Navigate, Visualize, Share
Logical Data Warehouse - Flexible, Adaptive Information Structuring
TopBraid Insight™ (TBI)
Connect the dots for new insights. Ease Big Data Variety
© Copyright 2013 TopQuadrant Inc. Slide 29
© Copyright 2014 TopQuadrant Inc. Slide 30
• Tames Big Data to empower businesses
• Offers on-demand integrated access to diverse data, making it possible to discover information just in time
• Delivers new levels of creativity and infrastructure flexibility
TopBraid Insight: Connects the Dots
© Copyright 2014 TopQuadrant Inc. Slide 31
Photo credits
• Volume: (CC BY-NC 2.0) Fabrizio Monti https://www.flickr.com/photos/delphaber/3514894189
• Velocity: (CC BY 2.0) Gabriel https://www.flickr.com/photos/cod_gabriel/1332225362
• Variety: (CC BY-NC-SA 2.0) IRRI Photos https://www.flickr.com/photos/ricephotos/4753359957
© Copyright 2014 TopQuadrant Inc. Slide 32
“A wonderful harmony is created when we join together the seemingly unconnected.” - Heraclitus
Bob DuCharme [email protected]
Thank you!