first they have to find it: getting government data discovered and used adapted from: john s....
TRANSCRIPT
First they have to find it: Getting Government Data Discovered and Used
Adapted from: John S. Erickson, Ph.D.Tetherless World ConstellationRensselaer Polytechnic InstituteTroy, New York, USA
Twitter: @olyerickson #TWCRPI
<Panel: The Art & Science of Data Visualization>
Open Government Data Around the World
2
Starting with efforts in the US and UK, governments around the world have recognized the need to publish their critical data
Percent of total collection (from 1M+ datasets)
Diverse Approaches to Open Gov't Data
3
Government data initiatives have taken many forms
GovData portals are widely varied in how they help users discover and use relevant datasets
Percent of total catalogs(from 192 catalogs)
Federated Discovery of Government Data
4
Stakeholders have seenthe need for
Federated discoveryacross catalogs,
especially from withinmajor search engines
includingBing, Google, Yahoo!
and Yandex
Government Data in the linked open data cloud
http://linkeddata.org/
Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)
Linked Data is Not Enough...
6
• Publishing open government data as Linked Data is not enough
• For OGD to be useful, datasets must be published using metadata, markup standards and presentation that aid discovery and use
Linked Data is Not Enough...
7
• Publishing open government data as Linked Data is not enough
• For OGD to be useful, datasets must be published using metadata, markup standards and presentation that aid discovery and use
Dataset Metadata for Discovery and Use
8
Recent work at TWC RPI demonstrates
the value of applying emerging standards for
uniformly describing government datasets
and catalogs
International Open Government Dataset Search
9
TWC's IOGDS application is an aggregated catalog of more than 1M datasets from over 192 dataset catalogs from governments at every level around the world
See: http://logd.tw.rpi.edu
10
Anticipates W3C DCAT RDF vocabulary
Demos what a comprehensive federated catalog based on DCAT and aggregation API might look like
International Open Government Dataset Search
11
IOGDS is a multi-year effort based on downloading, scraping or accessing APIs, converting metadata to a proto-DCAT model, and publishing via endpoint and download
International Open Government Dataset Search
API
Download
WebWebWeb
IOGDS Workflow
IODGSCSVPer-site
scrapercode
ad hoccode
Csv2rdf4lodautomation
11
Catalogs
See: http://logd.tw.rpi.edu
Schema.org: Semantic Markup for Discovery
12
TWC RPI has published dataset listings based on IOGDS using emerging microdata standards, esp. schema.org model endorsed by Bing, Google, Yahoo!, Yandex...
Schema.org datasets extension
13
• TWC RPI's schema.org dataset extension will enable government dataset catalogs to more easily be parsed and indexed by the major search engines...
• ...which will help users find relevant datasets!
• TWC's dataset extension entered public discussion June 2012
Schema.org datasets extension
14
The schema.org datasets extension enables relevant datasets to be more easily discovered by a range of stakeholders including researchers, data journalists, bloggers and developers
15
Schema.org datasets extension
“...we've reviewed the current datasets schema proposal in draft, and we are comfortable with the current state of things...
“...At this point, if the group would solidify on the dataset proposal, then Data.gov would support and use it.
---Chris Musialek
CKAN Data Catalog Scheme & Protocol
16
API-based catalog federation is also possible
ckan announced DCAT-based query/federation API
enables OAI-PMH-like harvesting and more
Dataset extension to schema.org
Demo/ links
http://www.w3.org/wiki/WebSchemas/Datasets
http://www.w3.org/wiki/WebSchemas/SchemaDotOrgProposals
Good introduction (longer/ with more context):
http://www.slideshare.net/joshsh/semantic-markup-using-schemaorg
Examples of current schema.org results
http://schema-creator.org/event.php
http://schema-creator.org/product.php
To do…
Get Google, Bing, Yahoo, … to crawl these pages
It might look like this: http://www.google.com/publicdata/directory
From Jim Hendler:
Google is now building custom search engines that will pull down schema.org
Dan Brickley is working on one from the Dataset schema, not yet public
There's also an open govt data search – not much in it, but looks nice – it's at http://www.google.com/publicdata/directory
Retrieve all the logd datasets:PREFIX dgtwc: <http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#>PREFIX conv: <http://purl.org/twc/vocab/conversion/>PREFIX void: <http://rdfs.org/ns/void#>PREFIX dcterms: <http://purl.org/dc/terms/>SELECT DISTINCT ?dataset ?catalog ?catalog_id ?title ?desc ?country ?homepage ?agency_id ?contributor_id WHERE { ?dataset a conv:CatalogedDataset . ?dataset void:inDataset ?catalog . ?catalog dcterms:identifier ?catalog_id . ?dataset <http://purl.org/dc/terms/title> ?title . ?dataset dcterms:description ?desc . OPTIONAL { ?dataset dgtwc:catalog_country ?country . } OPTIONAL { ?dataset <http://xmlns.com/foaf/0.1/homepage> ?homepage . } OPTIONAL { ?dataset dgtwc:agency ?agency . ?agency dcterms:identifier ?agency_id . } OPTIONAL { ?dataset <http://purl.org/dc/terms/contributor> ?contributor . ?contributor dcterms:identifier ?contributor_id . } #?dataset dgtwc:catalog_country <http://dbpedia.org/resource/United_States> .}
Courtesy: Josh Shinavier (RPI/TWC)
A large number of datasets:
http://logd.tw.rpi.edu/schemaorg_dataset_extension
http://www.google.com/webmasters/tools/richsnippets?url=http://logd.tw.rpi.edu/schemaorg_dataset_extension&view=
http://logd.tw.rpi.edu/page/international_dataset_catalog_search
Latest from Josh:
Datasets-as-Linked-Data demo. The RDFa in the pages is not only correct w.r.t. schema.org but is also presented in such a way that an RDFa-aware Linked Data crawler can hop from datasets to catalogs, back again, into DBpedia, etc. while gathering the RDFa as linked RDF.
Since we now have Datasets-ish RDFa markup in the main IOGDS dataset pages (i.e. the pages which the URIs of the datasets redirect to), we're pretty close to a completely integrated demo.
What remains: (1) the current markup has some problems. We need to fix those; (2) we need markup for catalogs as well as datasets…
Needed (1) and (2):
To fix (1), we need to make changes to the LODSPeaKr templates that automatically generate those pages, to make them compliant with the model Josh developed.
To fix (2), we'll work with Alvaro (Graves) to create LODSPeaKr-based automation to generate catalog pages in an efficient way.
(2) presents more of a challenge than (1) at this point, since the IOGDS implementation of dataset details pages is mostly correct at this point.
Still need Dan B. to assist with getting them found…
What we need:
Willingness to adopt the dataset schema extension – we need lots of datasets to start showing up
We (TWC) will be pushing out some tools, more demos and how-tos, very soon
Wanna play? http://wiki.esipfed.org/index.php/DatasetSchema