data mining with background knowledge from the web - introducing the rapidminer linked open data...

1. Data Mining with Background Knowledgefrom the WebIntroducing the RapidMinerLinked Open Data Extension08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 1Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer

2. Motivation: An Example Data Mining Task Analyzing book salesISBN City Sold3-2347-3427-1 Darmstadt 1243-43784-324-2 Mannheim 4933-145-34587-0 Rodorf 14ISBN City Population ... Genre Publisher ... Sold3-2347-3427-1 Darm-stadt144402 ... Crime Bloody3-43784-324-2 Mann-heim291458 Crime Guns Ltd. 493...Books... 1243-145-34587-0 Ro-dorf12019 ... Travel Up&Away ... 14... Crime novels sell better in larger cities08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 2 3. Motivation Many data mining problems are solved better when you have more background knowledge(leaving scalability aside) Problems: Tedious work Selection bias: what to include?08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 3 4. Linked Open Data in a Nutshell Started in 2007 A collection of ~1,000 open datasets from various domains, e.g., general knowledge, government data, using semantic web standards (HTTP, RDF, SPARQL,) Machine processable Free of charge Sophisticated tool stacks08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 4 5. Linked Open Data in a Nutshellhttp://lod-cloud.net/08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 5 6. Example: DBpedia08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 6 7. The RapidMiner LOD Extension08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 7 8. The RapidMiner LOD Extension Automatic discovery of links to Linked Open Data for local data objects e.g., the database entry Boston is linked tohttp://dbpedia.org/resource/Boston Automatic generation of attributes e.g., add all numeric values found for Boston (and other cities) Plus Feature selection algorithms optimized for LOD Automatic following of links to other datasets Schema matching (coming soon) No need to know Semantic Web technologies!08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 8 9. Example: the Auto MPG Dataset A well-known UCI dataset Goal: predict fuel consumption of cars Hypothesis: background knowledge more accurate predictions Used background knowledge: Entity types and categories from DBpedia (=Wikipedia)08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 9 10. Example: the Auto MPG Dataset A well-known UCI dataset Goal: predict fuel consumption of cars Hypothesis: background knowledge more accurate predictions Used background knowledge: Entity types and categories from DBpedia (=Wikipedia) Result: M5Rules down to almost half the prediction error i.e., on average, we are wrong by 1.6 instead of 2.9 MPG08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 10 11. Example: the Auto MPG Dataset The original attributes are cylinders, displacement, horsepower, weight, acceleration, model, origin plus name (unique string) and mpg (target) Models built are, e.g., high horsepower/weight high consumption Additional attributes lead to further insights, e.g. front-wheel drives have a lower consumption than rear-wheel drives hatchbacks have a lower consumption than station wagons rally cars generally have a low consumption08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 11 12. Example: Analyzing Statistics As shown, e.g., at ESWC 2012, SemStats 2013 Statistics found on the web oftencontain only few attributes extreme case: only entity + target Examples: Quality of living in cities (right) Corruption by country Fertility rate by country Suicide rate by country Box office revenue of films ...08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 12 13. Example: Analyzing Statistics Process in RapidMiner: load statistic link entities (cities, countries, etc.) to LOD cloud collect additional attributes analyze for correlations with target attribute of statistic08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 13 14. Example: Analyzing Statistics Quality of living in cities worldwide: indicators for low quality too hot (highest temperature in June exceeds 27C) too cold (highest temperature in January below 16C) too big (total area exceeds 334km) poor cultural live (no music recordings made in this city) or simply: wrong place on the map (latitude

data mining with background knowledge from the web - introducing the rapidminer linked open data...

Data & Analytics