introducing jsonpedia
DESCRIPTION
Introduction to JSONpedia a JSON version of WikipediaTRANSCRIPT
JSONpediaFacilitating consumption of MediaWiki content.
WWW.SPAZIODATI.EU
Michele Mostarda <[email protected]>, TW: @micmosmercoledì 10 ottobre 12
What is JSONpedia?
mercoledì 10 ottobre 12
“JSONpedia is a library and a web service meant to read WikiText markup as JSON.”
mercoledì 10 ottobre 12
‣ Initially conceived as a tool to produce data to train Machine Learning models.
‣ The REST service,inspired by Sweeble Crystalball,produces JSON, HTML and (coming soon) RDF data.
‣ Written over a context-dependent event based parser to be more performant than an Regex based parser (like the wikiparser) or a DOM based parser (like Sweeble).
mercoledì 10 ottobre 12
Differences with Sweeble
mercoledì 10 ottobre 12
‣ Lightweight Event based parser.‣ More tolerant to frequent syntax errors
present within WikiText pages.‣ Serializes to JSON output which is easier
to consume!
mercoledì 10 ottobre 12
Differences with DBpedia
mercoledì 10 ottobre 12
‣ JSONpedia doesn't add any semantic to the extracted data.
‣ JSONpedia could integrate the current DBpedia regex-based parser.
‣ JSONpedia is a not competitor of DBpedia but rather a complement.
mercoledì 10 ottobre 12
JSONpedia Internals
mercoledì 10 ottobre 12
ArchitectureParser
Input WikiText
Structure
Validator
Extractor
Splitter
Linker
+
DBpedia API/Freebase
Output JSON
mercoledì 10 ottobre 12
WikiText Parser Events// Document bounding.void beginDocument(URL document);void endDocument();
// Error handling.void parseWarning(String msg, ParserLocation location);void parseError(Exception e, ParserLocation location);
// Tag handling.void beginTag(String node, Attribute[] attributes);void endTag(String node);void inlineTag(String node, Attribute[] attributes);void commentTag(String comment);
// Sectionsvoid section(String title, int level);
// Referencesvoid beginReference(String label);void endReference(String label);
// Linksvoid beginLink(String url);void endLink(String url);
// listsvoid beginList();void listItem();void endList();
// Templatesvoid beginTemplate(String name);void endTemplate(String name);
// Tablesvoid beginTable();void headCell(int row, int col);void bodyCell(int row, int col);void endTable();
// Generic parametervoid parameter(String param);// parameter / text valuevoid text(String content);
mercoledì 10 ottobre 12
WikiText Processors
‣ Structure‣ Extractors‣ Linkers‣ Splitters‣ Validator
Processors receive the stream of events generated by the parser and perform data construction and transformation.
mercoledì 10 ottobre 12
Structure
The Structure Processor receives a stream of WikiText parsing events and builds a 1-1JSON representation of the document DOM.
mercoledì 10 ottobre 12
Extractors
Extractors are specific Processors that collect a certain type of data from the event stream: for example the SectionsExtractor collects the list of all sections detected in the document stream.
mercoledì 10 ottobre 12
Linkers
A Linker is a Processor which links the current document entity to other informations acquired from external sources. An example of Linker is the FreebaseLinker which connects an entity to the same representation in Freebase if any.
mercoledì 10 ottobre 12
Splitters
A Splitter is a Processor able to cut sub trees of the JSON document built by the Structure processor. An example of Splitter is the TableSplitter which extract the JSON structures representing the tables declared in the document.
mercoledì 10 ottobre 12
Validator
A Validator is a Processor performing the check of data structures parsed from a document.
mercoledì 10 ottobre 12
Forthcoming Features
‣ JSONpedia DB (based on MongoDB + ElasticSearch) can be queried online. Also JSONpedia dumps will be available.
‣ Online data model Exporter Tool (CSV)‣ RDF output.
mercoledì 10 ottobre 12
Release
JSONpedia will be fully released OpenSource in by the end of the year.
mercoledì 10 ottobre 12
Live Demo
http://bit.ly/jsonpediaor
http://json.it.dbpedia.org/frontend/form.html
mercoledì 10 ottobre 12