mining the web of linked data with rapidminer

14
Mining the Web of Linked Data with RapidMiner Introducing the RapidMiner Linked Open Data Extension Petar Ristoski, Christian Bizer, Heiko Paulheim

Upload: heiko-paulheim

Post on 03-Jul-2015

1.008 views

Category:

Data & Analytics


1 download

DESCRIPTION

Lots of data from different domains is published as Linked Open Data. While there are quite a few browsers for that data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this challenge entry, we introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need to know SPARQL or RDF. As an example, we show how statistical data on scientific publications, published as an RDF data cube, can be linked to further datasets and analyzed using additional background knowledge from various LOD datasets.

TRANSCRIPT

Page 1: Mining the Web of Linked Data with RapidMiner

Mining the Web of Linked Data with RapidMiner

Introducing the RapidMinerLinked Open Data Extension

Petar Ristoski, Christian Bizer, Heiko Paulheim

Page 2: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 2

Motivation

Which factors lead to a high corruption rate?

How to improve the quality of living?

How to find good books to read?

How to prevent inflation?What makes cars to consume less fuel?

How to publish more scientific articles?

How to decrease the electricity consumption?

Page 3: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 3

Motivation

??

Page 4: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 4

Motivation

LODLocalData

link combine cleanse transform analyze

Page 5: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 5

RapidMiner Linked Open Data Extension

Introducing RapidMiner:

● An open source platform for data mining and predictive analytics● Processes are designed by wiring operators in a GUI

(no programming)● Operators for data loading, transformation, modeling, visualization, …● Scalable, distributed, parallel processing in a cloud environment● 200,000 active users

● Developers can write their own extensions

Page 6: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 6

RapidMiner Linked Open Data Extension

• The extension adds operators for

– accessing local and remote semantic web data (RDF, SPARQL, …)

– linking local to remote data (e.g., DBpedia Lookup)

– enriching local data (e.g., with data properties from LOD sources)

– automatically following links to other datasets

– exploiting semantic schemata for optimizing attribute subset selection(DiscoveryScience'14)

– matching and fusing data from different sources

• Data analysts can use it without knowing SPARQL etc.

Page 7: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 7

Example Use Case

• Which factors correlate with the increase of published scientific and technical journal articles?

• RapidMiner workflow:

– Import data from WorldBank RDF data cube

– Link countries to DBpedia

– Explore additional datasets

– Generate attributes

– Analyze the results

• now live!

Page 8: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 8

Example Use Case

• Starting from links to DBpedia, we follow links and collect data from

– DBpedia

– Linked GeoData

– Eurostat

– GeoNames

– WHO’s Global Health Observatory

– Linked Energy Data

– OpenCyc

– World Factbook

– YAGO

• Related data is fused

– e.g., population figures from different sources

Page 9: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 9

Example Use Case

• Factors that correlate with large number of publications

– The fragile state index – FSI (positive)

– Human development index – HDI (positive)

– GDP (positive)

• wealthier countries being able to invest more federal money into science funding?

– For EU countries, the number of EU seats (positive)

• an increasing fraction of EU funding for science being attributed to those countries?

– Many climate indicators (precipitation, hours of sun, temperature)

• unequal distribution of wealth across different climate zones?

Page 10: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 10

Other Use Cases

• Improving performance of predictive models (RMWorld'14)

– UCI car dataset: predicting fuel consumption

• Reducing the prediction error of M5' by half

– on average, we are wrong by 1.6 instead of 2.9 MPG

Page 11: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 12

Other Use Cases

• Building Semantic Recommeder Systems (ESWC'14)

• Combines two extensions:

– Linked Open Data extension

– Recommender system extension

• Use data about books for content-based recommender

– best system (out of 24) on two out of three tasks

Page 12: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 13

Other Use Cases

• Debugging Linked Open Data

– loading a subset of statements

– augment with additional features

– run outlier detection

• again: a specialextension

• Example: identify wrong dataset interlinks (WoDOOM'14)

– AUC up to 85%

Page 13: Mining the Web of Linked Data with RapidMiner

10/27/14 Ristoski, Bizer, Paulheim 14

Summary

• This challenge entry

– brings data analysis to the web of data

– can be used by data analysts without learning SPARQL

• Availability

– on the RapidMiner marketplace

– installable from inside RapidMiner

– >4,000 installations and counting

Page 14: Mining the Web of Linked Data with RapidMiner

Mining the Web of Linked Data with RapidMiner

Introducing the RapidMinerLinked Open Data Extension

Petar Ristoski, Christian Bizer, Heiko Paulheim