linked data experiences at springer nature
TRANSCRIPT
�1
LinkedDataExperiencesatSpringerNature
MichelePasinLeadDataArchitectKnowledgeGraphTeam
LinkedDataExperiencesatSpringerNatureLeipzig,09/2016
�2
Outline
•Whoweare
•Whysemantictechnologies
•Ourworksofar
•TheScigraphproject
•Lookingahead
LinkedDataExperiencesatSpringerNature-Leipzig,09/2016
�3
WhoWeAre
�4
Formed in May 2015 through the merger of Nature Publishing Group, Palgrave Macmillan, Macmillan Education and Springer Science+Business Media
�5
4
5
1
14
2
13k employees in over 50 countries, EUR 1.5 billion turnover
�6
[Pre-Merger]SpringerScience+BusinessMediabrands
�7
[Pre-Merger]MacmillanScience&Educationbrands
Holtzbrinck Publishing Group
�8
Wepublishalotofscience!(since1815)
13M documents 7M articles, 4M chapters 4k journals, 700k books
�9
..andgeneratealotoftraffic
11.5M monthly visitors (nature.com)
260M visits per year 600M downloads per year (link.springer.com)
> Collaborative effort between Springer Nature and Digital Science
> Supporting internal use cases,but also contributing to an emerging web of linked science data
> Not just publications data but a wealth of other related information
LinkedDataExperiencesatSpringerNature-Leipzig,09/2016
�12
WhySemanticTechnologies
�13
WhyisSemanticsImportantToUs?
Challenges: Data Silos ● Data is fragmented
● Data gets duplicated
● Data is hardcoded into applications
Change Drivers ● Digital first workflow
● User-centric design
● Unified Springer Nature domain
Forexample:oursitesarecurrentlyorganisedaroundarTcles,journalsandissues…
However,scienTstsareinterestedinansweringquesTonsaboutrealworldthings…
Searchenginesdonotknowwehavecontentaboutthesethings…
1sthitfromnature.com…
Notlinkedto/from..
�17
XML
ePub
HTML
TIFF
Today: Content base Tomorrow: Knowledge Graph
We publish science We manage knowledge
Vision
The Knowledge Graph is about collecting information about objects in the real world
…so that we can do a better job of providing users with what they're looking for
reads / writes
is about
interested in
Three areas of knowledge we care about
Reads / Writes
Works forFunds
Lead researcher in
Produces
StudiesLocated at
In proceedings
Contains
Cites
Has learning resource
Attends
Has topicProduces
�21
Research/ Manuscript
Creation
Manuscript Submission
Peer Review/ Proposal Stage
Planning
Production
Publication
Distribution/ Sales
DiscoveryResearcher /
Author
Editorial / Publisher
Reviewer
Opportunities:Tools&ServicesAlongthePublishingLifeCycle
LinkedDataExperiencesatSpringerNature-Leipzig,09/2016
�22
OurWorkSoFar
OurWorkSoFar
2014
2013
2012
2015
2016
NPG Linked Data Platform
Nature Ontologies Portal
Springer Materials
Springer ConferencesScigraph
Content Hub
Scigraph prototype
Nero Project
Linnaeus Project
Springer Protocols
CURI Semantic Annotation Project
Deliverables (2012–2014) ● Prototype for external use
● SPARQL query service
● Two RDF dataset releases in 2012
– April 2012 (22m triples)
– July 2012 (270m triples)
● Live updates to query endpoint
Led to (2014–) ● Focus on internal use-cases
● Publish ontology pages
● Periodic data snapshots
NPGLinkedDataPlatform(2012)
Features ● Hybrid RDF + XML architecture
– MarkLogic for XML, RDF/XML
– Triplestore (TDB) for RDF validation
● Repo’s for binary assets
Layout ! Semantic RDF/XML includes in XML
● RDF objects serialized in list order
● Application XML for subject hierarchy
Indexes ● Indexes over all elements
● Range indexes for datatypes (e.g. dates)
NPGContentHub(2014):HybridArchitecture
SubjectPages(2014)
�27
NPGOntologiesPortal(2015):DataPublishing
�28
SpringerMaterials(2014)
�29
SpringerConferencesPortal(2015)
�30
ScigraphProject(2016):mainobjectives
Data Integration
> Consolidation of existing LD efforts via a single domain mode
> Ingestion and normalisation of third party datasets
Discoverability
> Better end user applications [B2C]
> Metadata delivery & validation [B2B]
> Data publishing [B2developers]
LinkedDataExperiencesatSpringerNature-Leipzig,09/2016
�31
Scigraph
what’sinit>dataarchitecture,taxonomies,ontologies
howitworks>ETL,naming,validation,identity
�32
DataLandscape
Citations / References160M
Articles7M
Chapters3.6M
Journals4K
Books700k
Subjects4K
ArticleTypes
Grants2M
Organizations60K
Conferences10K
Funders
Publishers
Universities
ScigraphCore
Persons1M
Relations
Publish states
Vocabularies
a DB/OO scheme
Arbitrary relations plus axioms, constraints and rules expressed in a logical languagea glossary
an axiomatized theory
a thesaurusa taxonomy
Taxonomy plus related terms;
captures synonymy, homonymy etc.
Complexity (ontological depth)
A controlled vocabulary with NL
definitions (e.g. lexicon)
- Publishers - Relations - Publish-states
A c.v. that captures broaderThan / narrowerThan relationships
- Subjects, - Article Types
Relational model: unconstrained use of arbitrary relations
Scigraph Core ontology
OntologiesandTaxonomies:overview
�34
TheCoreOntology
- Language: OWL 2, Profile: ALCHI(D) - Entities: ~73 classes, ~250 properties - Principles: Incremental Formalization/ Enterprise Integration / Model Coherence
http://www.nature.com/ontologies/core/
�35
TheCoreOntology:mappings
:Asset
:Thing
:Publication
:Concept
:Event
:Subject
:Type
:Agent
:ArticleType
:PublishingEvent
:AggregationEvent
:Component
:Document
:Serial
cidoc-crm:Information_Carrier
cidoc-crm:Conceptual_Object
dbpedia:Agentdc:Agentdcterms:Agentcidoc-crm:Agentvcard:Agentfoaf:Agent
event:Eventbibo:Eventschema:Eventcidoc-crm:TemporalEntity
cidoc-crm:Typevcard:Type
fabio:SubjectTerm
bibo:Documentcidoc-crm:Documentfoaf:Document
bibo:Periodicalfabio:Periodicalschema:Periodical
bibo:DocumentPart
fabio:Expressioncidoc-crm:InformationObject
= owl:equivalentClass
�36
SKOStaxonomies:Poolpartyintegration
�37
SKOStaxonomies:Subjects
- Structure: SKOS, ~2500 concepts, multi hierarchical tree, 6 branches, 7 levels of depth - Mappings: 100% of terms, using skos:broadMatch or skos:closeMatch, (Dbpedia and
MESH) - Document tagging: mostly manual, different workflows, often costly and inconsistent
�38
Semi-AutomatictaggingwithDimensions(fromUberResearch)
LinkedDataExperiencesatSpringerNature-Leipzig,09/2016
�39
Scigraph
what’sinit>dataarchitecture,taxonomies,ontologies
howitworks>ETL,naming,validation,identity
�40
NamingArchitecture:federatedmodel
> Dereference and 303 redirects: - http://name.scigraph.com/{things}/ - http://data.scigraph.com/{things}/
> Two patterns: schemas and instances - http://name.scigraph.com/ontologies/{domain}/ - http://name.scigraph.com/{domain}/{things}/
> Prefixes for schemas and instances - @prefix sg: <http://name.scigraph.com/ontologies/core/> .
> Entity names follow a robust convention - camel-case for naming terms, with an initial uppercase for classes and an initial lowercase for properties.
> Named graphs used to track provenance
�41
Scigraph-DataFlow
Peer Review DDS Core
Media UNSILO TARGET Uber Research DBPedia etc..
KNOWLEDGE GRAPH
JSON-LD API DDS Adapter TTL Loader RDF Loader ..
datasources
integrationlayer
real time services
Peer Review Service
Search Service(Content Hub)
applications Peer Review Oscar Search
data is delivered to applications via fast APIs
data is extracted and denormalised so to support
applications
data is normalised and mapped to SN ontologies
�42
ETLArchitecture:mainfeatures[inevolution]
Tech stack > Airflow framework (Airbnb) > Amazon S3 to make backups > GraphDB triplestore (staging and presentation) > Elastic search and APIs
Components & Principles > Graph must be ‘ephemeral’ > Data sources versioning algorithm > Identity Persistence service > Validation via SHACL (TopBraid API)
�43
ETLArchitecture
Personszip
XML
RDF
JSON
CSV
ArticlesDB
PublishersDataset
BooksAPI
Sources Data StoreAmazon S3
Data StagingTriplestore
Data PresentationTriplestore
LinkedData
Browser
Analytics
Reporting
APIs
✴ Extraction ✴ Validation✴ Identity Persistence✴ Updating / Replacing
named graphs
✴ Versioning service✴ (md5 checksum,
timestamps, origin version, etc...)
✴ Integration (union graph)
✴ Inference
Named Graphs
IdentityPersistence
Identity PersistenceModule
J1(xml)
J2(xml)
RDFExtractor
journals:76as67fda76sd67a
id: 1DOI: 123issn: ABC
id: 2 issn: ABC
J1(xml)
id: 1DOI: 123issn: ABC
ingest #1
ingest #2
ingest #3
Identity Registry
sgo:core Ontologysg:Journal a owl:Class ; sg:hasKeyProperty sg:doi . sg:hasKeyProperty sg:issn sg:hasKeyProperty sg:eissn ....
�45
DataValidation:fromSPINtoSHACL
> SPIN SPARQL syntax (2011, TopQuadrant)
> Example: “if a Journal instance has no short title, raise an Exception”
> Main drawback: hard to maintain and to read by non specialists
�46
DataValidation:fromSPINtoSHACL
> SHACL - Shapes Constraint Language (2016, TopQuadrant)
> Example: “all article instances should have a valid DOI”
> Example: “all grants instances should have max 1 start year and end year”
> Approach: polish data before entering the triplestore, use triplestore inference primarily for integration
LinkedDataExperiencesatSpringerNature-Leipzig,09/2016
�47
NextSteps
�48
LookingAhead
Summary ● Scigraph is our latest LD platform - public version live in late 2016
● SW tech allows for scalable enterprise-level metadata management
● It is crucial to distinguish between data Integration VS (real time) data delivery
● Still a work in progress… suggestions or feedback very welcome!
Ongoing Work ● Ontology: federated model, more advanced inferencing capabilities
● Build internal/external APIs (JSON-LD) by integrating also NoSQL
● Tools for analytics, reporting, visualisation, interactive exploration of the graph
● Entities extraction: scientific entities, places, people, events etc..
● We’re looking to collaborate… Crossref, W3C, building a Linked Science Web
Future:ascientificarticleX-ray?
�50
TheKnowledgeGraphteam
CORE TEAM
* Markus Kaindl: Product Owner * Ben Kirkley: Project Manager
* Michele Pasin: Lead Data Architect * Tony Hammond: Data Architect * Matias Piipari: Lead Engineer * Hilverd Reker: Software Engineer *Artur Konczak: Software Engineer
*<blankNode>: Data Scientist *<blankNode>: Data Engineer
DIGITAL SCIENCE
* Martin Szomszor: Data Scientist *Richard Koks: Data Scientist * Mario Diwersy: CTO, Uber Research
PROGRAM SPONSOR
* Henning Schoenenberger: Director Data & Metadata