brief on linked data for u.s. epa's chief data scientist

42
Bernadette Hyland CEO & co-founder 11911 Freedom Drive, Suite 850 Reston,VA 20190 Tel. +1-571-331-3758 [email protected] @BernHyland [email protected] @3RoundStones Extend Your Reach. Linked Data for Smarter Decisions. Follow up information prepared for Robin Thottungal, Chief Data Scientist / Director of Analytics US Environmental Protection Agency - Feb 26, 2016

Upload: 3-round-stones

Post on 22-Jan-2018

291 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Bernadette Hyland

CEO & co-founder

11911 Freedom Drive, Suite 850

Reston, VA 20190

Tel. +1-571-331-3758

[email protected]@BernHyland

[email protected]@3RoundStones

Extend Your Reach.

Linked Data for Smarter Decisions.

Follow up information prepared forRobin Thottungal, Chief Data Scientist / Director of Analytics

US Environmental Protection Agency - Feb 26, 2016

Page 2: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Today’s reality at EPA

»Tens of thousands of sources

»Many formats - JSON, XML, CSV, PDF, PPT, SHP, SHX, text, binary…

»Thousands of data silos

»No single source of truth

»Varied interpretations

»Brittle interfaces - lack of interoperability

Image Credit: Smart Data Collective

Page 3: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Wide Variety of Data at EPA

3

Image Credit: MarkLogic, see http://www.marklogic.com/resources/marklogic-semantics-datasheet/resource_download/datasheets/

Page 4: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Credit: Frederick Giasson, Data Scientist & Software Developer, http://fgiasson.com/blog/index.php/2014/07/23/big-structures-where-the-semantic-web-meets-artificial-intelligence/

Potential at EPA …

• Findable data

• Accessible data

• Interoperable data

• Re-usable data

• Shared context

• Data Platforms (HDFS, NoSQL)

Page 5: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Linked Data is helping to extend & augmentEPA’s significant investment in enterprise relational technologies

How?

By leveraging NoSQL Data Platforms that rigorously adhere to international data interoperability standards. *

* Relevant international data exchange standards are published by the W3C, OGC, IEEE

Image Credit: MarkLogic

Page 6: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Graph databases, as a subset of NoSQL databases, are the most efficient way to look at the relationships between data

items, patterns of relationships and interactions.Image Credit: Cray, see http://www.cray.com/blog/graph-databases-101/

Graph Databases 101

Page 7: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Hadoop Integration»While over 90% of the world’s data has been created in the last two

years, EPA has tremendous variety of data requires the “right tool for the job”

»Historic data (“short, wide, complex data”) vs.

»Granular sensor & GIS data (“long skinny data”)

»Core mission-based systems with robust historic data, includes:

»Toxics Release Inventory (TRI)

»Facilities Registry (FRS)

»RCRA Handler

»EPA’s enterprise information architecture should include a data platform that leverages Hadoop: HDFS and MapReduce, and accommodates EPA’s robust data landscape.

»Must support modern, open source tools for application development, visualizations, crowdsourcing, and deployment on the Web

Page 8: Brief on Linked Data for U.S. EPA's Chief Data Scientist

8

One option - MarkLogic Integrates Hadoop Ecosystem &EPA’s Robust Data Landscape

Image Credit: http://www.marklogic.com/what-is-marklogic/features/hadoop-integration/

Page 9: Brief on Linked Data for U.S. EPA's Chief Data Scientist

EPA Robust Data Ecosystem is adaptable using a Linked Data Approach

» Makes data integration faster and easier » By using a global addressing scheme, HTTP URIs.

» Uses semantics to “glue” together data faster.» Common semantic definitions link traditional relational

models.» No more out of data documentation using standard

vocabularies.

» Robust search and discovery by leveraging the semantic graph.

» Scales to the Web!

9

Page 10: Brief on Linked Data for U.S. EPA's Chief Data Scientist

All modern data platforms deployed at EPA should

»Support options for data modeling - Linked Data (JSON-LD, RDF), SQL (JSON, XML)

»Native store and query of documents, blobs and structured data.

»Standards-based query interface across documents and data, e.g., Full support for SPARQL 1.1

»Offer enterprise functionality including high availability & disaster recovery, scalability & elasticity, ACID transactions

»Be deployable on FedRamp certified cloud provider certifying controls for security, high availability, disaster recovery

»Scale to billions of statements, triples, etc.

»Store unstructured data across clusters like Hadoop, making it easy to move data partitions.

Page 11: Brief on Linked Data for U.S. EPA's Chief Data Scientist

»Much but not all of EPA’s data is well suited for a Linked Data approach.

»Linked Data is based on 20+ year old idea, a system of linked information systems

M A N N I N G

David WoodMarsha ZaidmanLuke RuthWITH Michael Hausenblas

FOREWORD BY Tim Berners-Lee

Structured data on the Web

Linked Data

Page 12: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Goals: Governmental transparency and/or improved internal efficiencies

Governments Worldwide are using a Linked Data Approach

Page 13: Brief on Linked Data for U.S. EPA's Chief Data Scientist
Page 14: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Linked Data Apps use data from many

EPA programs and other Open Data Sources

Page 15: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Linked Data Management SystemFor government open data publishing

Funded by

Page 16: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Linked Data Platform is in QA now! https://usepa.3roundstones.net Anticipated to move to production in 2016.

Page 17: Brief on Linked Data for U.S. EPA's Chief Data Scientist

shared innovation™

Search for facilities where we live. Unlike many EPA Web portals, linked data is human AND machine readable data. No screen scraping is required. Encourages re-use (discourages data silos)

Page 18: Brief on Linked Data for U.S. EPA's Chief Data Scientist

The EPA Linked Data service CONNECTS data silos, and provides familiar map and table data views

Page 19: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Click to drill down to pollution reports that combine data from 5 previously unconnected data silos.

Page 20: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Click through to the source of the pollution data via the source reports (TRI).

Page 21: Brief on Linked Data for U.S. EPA's Chief Data Scientist

EPA collects granular pollution data. Linked Data opens up the data to a much wider audience in a human readable format.

Page 22: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Previously, only people who employed complex screen scraping techniques could get at this data. Now, EPA open data is available using an international data standard, with one click!

Page 23: Brief on Linked Data for U.S. EPA's Chief Data Scientist
Page 24: Brief on Linked Data for U.S. EPA's Chief Data Scientist
Page 25: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Good news story!Pollution graphs created in one week using Open Source Software & EPA Linked Data

Page 26: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Use of shared vocabularies, e.g. Places, Geographis, Dublin Core, Geo, FOAF, ORG, Vcard are the “lingua franca” of data interoperability

Page 27: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Case StudyUsing EPA Linked Data to assist chronic asthma/COPD patients

with timely weather alerts

Funded by

Page 28: Brief on Linked Data for U.S. EPA's Chief Data Scientist
Page 29: Brief on Linked Data for U.S. EPA's Chief Data Scientist

User

NOAA US EPA AirNow

DBpediaNational Library of Medicine

US EPA SunWise

Page 30: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Case Study: OrgpediaAn open organizational data project

on public & private companies

Funded by

Page 31: Brief on Linked Data for U.S. EPA's Chief Data Scientist
Page 32: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Using the Callimachus Open Source Data Platform, we rapidly built a crowdsourcing platform.

Page 33: Brief on Linked Data for U.S. EPA's Chief Data Scientist

3 Round Stones provides commercial application

support on the cloud or behind the enterprise firewall using

@3RoundStones http://3RoundStones.com

Page 34: Brief on Linked Data for U.S. EPA's Chief Data Scientist

CONTENT MANAGEMENT

SYSTEM

LINKED DATA MANAGEMENT

SYSTEM

Callimachus

UN

ST

RU

CT

UR

ED

T

EX

T

TE

XT

ST

RU

CT

UR

ED

D

AT

A

DA

TA

Page 35: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Callimachus supports

in-browser development

Page 36: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Callimachus Enterprise customers are creating data-driven applications with data from leading graph

databases:

Page 37: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Callimachus is a scalable Web application server for publishing and consuming open data

Who uses it?

• Government, international publishers, healthcare / life sciencesWhat pain does Callimachus address?

• Integration of data silos where a graph approach is needed• Rapid creation of visualizations, dashboards (mashups) & info graphics• Less expensive solution to a data warehouse

Example apps?

• Collaborative knowledge management • Publishing workflow• Drug discovery / clinical trials • Predictive Analytics

Page 38: Brief on Linked Data for U.S. EPA's Chief Data Scientist

data interoperability & portability

Supports:• HTML5, XHTML5, CSS3, JavaScript• XQuery, XProc, XPath, XSLT• SPARQL 1.1 Query, Update, Federated Query,

Service Description, Property Paths, Graph Store HTTP Protocol

• RDF/XML, RDF/Turtle, JSON-LD, SPARQL XML, SPARQL JSON

Callimachus is fanatical about

Page 39: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Contractor (3 Round Stones, Inc.)

Public

Application, Script or automated client

Web Browser

SPARQL endpointREST APIResource URIs

Linked Data management systemlocated at a Tier 1 Cloud Provider

(FISMA compliant)

RDF Database

Registered developer

Page 40: Brief on Linked Data for U.S. EPA's Chief Data Scientist

<HTML>

Enterprise Data Documents

Read/ Write

Point to, include

Callimachus Enterprise

Page 41: Brief on Linked Data for U.S. EPA's Chief Data Scientist

“Big Data Is Important, but Open Data Is More Valuable” As change agents, enterprise architects can help

their organizations become richer through strategies such as open data.

David Newman, VP Research, Gartner

Page 42: Brief on Linked Data for U.S. EPA's Chief Data Scientist

Open Source Enterprise License

Community supported Commercial support

in-browser development, deployment, backups

Linked Data publication

User profiles, social sharing

Document, app management

OpenAnnotation support

External datasources

Shared deployments

Realms (virtual hosts)

Enterprise management

Cloud deployments

Callimachus

Callimachus™, the Callimachus logo, Callimachus Enterprise™, the Callimachus Enterprise logo and tagline, are trademarks of 3 Round

Stones, Inc. and are registered in the United States and abroad. Copyright © 2011-2016 3 Round Stones, Inc. All rights reserved.

Callimachus Enterprise