exploiting linked open data as background knowledge in data mining

1. 10/08/13 Heiko Paulheim 1 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim

2. 10/08/13 Heiko Paulheim 2 Outline Motivation The original FeGeLOD framework Experiments Applications The RapidMiner Linked Open Data Extension Challenges and Future Work 3. 10/08/13 Heiko Paulheim 3 Motivation: An Example Data Mining Task Analyzing book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Rodorf 14 ... ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm- stadt 144402 ... Crime Bloody Books ... 124 3-43784-324-2 Mann- heim 291458 Crime Guns Ltd. 493 3-145-34587-0 Ro- dorf 12019 ... Travel Up&Away ... 14 ... Crime novels sell better in larger cities 4. 10/08/13 Heiko Paulheim 4 Motivation Many data mining problems are solved better when you have more background knowledge (leaving scalability aside) Problems: Tedious work Selection bias: what to include? 5. 10/08/13 Heiko Paulheim 5 Motivation http://lod-cloud.net/ 6. 10/08/13 Heiko Paulheim 6 Motivation Idea: reuse background knowledge from Linked Open Data include it in the data mining process as needed Two main variants: develop mining/learning algorithms that run directly on Linked Data create relational features from Linked Data 7. 10/08/13 Heiko Paulheim 7 Motivation Develop mining/learning algorithms e.g., DL Learner e.g., dedicated Kernel functions Advantages: can be quite efficient no reduction to flat table structure semantics can be respected directly 8. 10/08/13 Heiko Paulheim 8 Motivation Create relational features e.g., LiDDM e.g., AutoSPARQL e.g., FeGeLOD / RapidMiner Linked Open Data Extension Advantages: Easy combination of knowledge from various sources including relational features in the original data Arbitrary mining algorithms/tools possible 9. 10/08/13 Heiko Paulheim 9 FeGeLOD Feature Generation from LOD IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 N a m e d E n t it y R e c o g n it io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t F e a t u r e G e n e r a t io n IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l 1 4 1 4 7 1 C ity _ U R I_ ... ... F e a t u r e S e le c t io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l 1 4 1 4 7 1 10. 10/08/13 Heiko Paulheim 10 FeGeLOD Feature Generation from LOD Original prototype, based on Weka: Simple NER (guessing URIs) Seven generators: direct types data properties unqualified relations (boolean, numeric) qualified relations (boolean, numeric) individuals (dangerous!) - may be restricted to specific property Simple feature selection: filtering features that have only* different values (expect numerical) that have only* identical values that are mostly missing* *) 95% or 99% 11. 10/08/13 Heiko Paulheim 11 Experiments Testing with two* standard machine learning data sets Zoo: classifying animals AAUP: predicting income of university employees (regression task) Question: how much improvement do additional features bring? *) standard ML datasets with speaking labels are scarce! 12. 10/08/13 Heiko Paulheim 12 Experiments: Zoo Dataset 13. 10/08/13 Heiko Paulheim 13 First Results: AAUP 14. 10/08/13 Heiko Paulheim 14 Experiments: Early Insights Additional features often improve the results Zoo dataset: Ripper: 89.11 to 96.04 SMO: 93.07 to 97.03 No improvement for Naive Bayes AAUP dataset (compensation): M5: 59.88 to 51.28 SMO: 74.12 to 61.97 No improvement for linear regression ...but they may also cause problems extreme example: 6.54 to 189.90 for linear regression memory and timeouts due to large datasets 15. 10/08/13 Heiko Paulheim 15 Experiments: Quality of Features Information gain of features on Zoo dataset 16. 10/08/13 Heiko Paulheim 16 Experiments: Quality of Features Information gain of features on AAUP dataset (compensation) 17. 10/08/13 Heiko Paulheim 17 Application: Classifying Events from Wikipedia Event Extraction from Wikipedia Joint work with Dennis Wegener and Daniel Hienert (GESIS) Task: event classification (e.g., Politics, Sports, ...) http://www.vizgr.org/historical-events/timeline/ 18. 10/08/13 Heiko Paulheim 18 Application: Classifying Events from Wikipedia Source Material: http://www.vizgr.org/historical-events/timeline/ 19. 10/08/13 Heiko Paulheim 19 Application: Classifying Events from Wikipedia Positive Examples for class politics: 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. 2010, June 3 Christian Wulff is nominated for President of Germany by Angela Merkel. Negative Examples for class politics: 2010, July 7 Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. 2012, February 16 Roman Lob is selected to represent Germany in the Eurovision Song Contest. 20. 10/08/13 Heiko Paulheim 20 Application: Classifying Events from Wikipedia Positive Examples for class politics: 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. 2010, June 3 Christian Wulff is nominated for President of Germany by Angela Merkel. Negative Examples for class politics: 2010, July 7 Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. 2012, February 16 Roman Lob is selected to represent Germany in the Eurovision Song Contest. Possible learned model: "Angela Merkel" Politics 21. 10/08/13 Heiko Paulheim 21 Application: Classifying Events from Wikipedia Possibly Learned Model: "Angela Merkel" Politics How can we do better? Background knowledge from Linked Open Data 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts down the seven oldest German nuclear power plants. 2012, May 13, Elections in North Rhine-Westphalia Hannelore Kraft [class: Politician] is elected to continue as Minister-President, heading an SPD- Green coalition. Model learned in that case: "[class: Politician]" Politics 22. 10/08/13 Heiko Paulheim 22 Application: Classifying Events from Wikipedia Model learned in that case: "[class: Politician]" Politics Much more general Can also classify events with politicians not contained in the training set Less training examples required A few events with politicians, athletes, singers, ... are enough 23. 10/08/13 Heiko Paulheim 23 Application: Classifying Events from Wikipedia Experiments on Wikipedia data >10 categories 1,000 labeled examples as training set Classification accuracy: 80% Plus: We have trained a language-independent model! often, models are like "elect*" Politics 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von Norbert Rttgen [class: Politician] zum Bundesumweltminister ernannt. 6 januari 2012: Jonas Sjstedt [class: Politician] vljs till ny partiledare fr Vnsterpartiet efter Lars Ohly [class: Politician]. 24. 10/08/13 Heiko Paulheim 24 Application: Classifying Tweets Joint work with Axel Schulz and Petar Ristoski (SAP Research) Goal: using Twitter for emergency management fire at #mannheim #universityomg two cars on fire #A5 #accident fire at train station still burning my heart is on fire!!!come on baby light my fire boss should fire that stupid moron 25. 10/08/13 Heiko Paulheim 25 Application: Classifying Tweets Social media contains data on many incidents But keyword search is not enough Detecting small incidents is hard Manual inspection is too expensive (and slow) Machine learning could help Train a model to classify incident/non incident tweets Apply model for detecting incident related tweets Training data: Traffic accidents ~2,000 tweets containing relevant keywords (car, crash, etc.), hand labeled (50% related to traffic incidents) 26. 10/08/13 Heiko Paulheim 26 Application: Classifying Tweets Learning to classify tweets: Positive and negative examples Features: Stemming POS tagging Word n-grams Accuracy ~90% But Accuracy drops to ~85% when applying the model to a different city 27. 10/08/13 Heiko Paulheim 27 Application: Classifying Tweets Example set: Again crash on I90 Accident on I90 Model: I90 indicates traffic accident Applying the model: Two cars crashed on I51 not related to traffic accident 28. 10/08/13 Heiko Paulheim 28 Using LOD for Preventing Overfitting Example set: Again crash on I90 Accident on I90 dbpedia:Interstate_90 dbpedia-owl:Road rdf:type dbpedia:Interstate_51 rdf:type Model: dbpedia-owl:Road indicates traffic accident Applying the model: Two cars crashed on I51 indicates traffic accident Using DBpedia Spotlight + FeGeLOD Accuracy keeps up at 90% Overfitting is avoided 29. 10/08/13 Heiko Paulheim 29 Explaining Statistics Statistics are very wide spread Quality of living in cities Corruption by country Fertility rate by country Suicide rate by country Box office revenue of films ... 30. 10/08/13 Heiko Paulheim 30 Explaining Statistics Questions we are often interested in Why does city X have a high/low quality of living? Why is the corruption higher in country A than in country B? Will a new film create a high/low box office revenue? i.e., we are looking for explanations forecasts (e.g., extrapolations) 31. 10/08/13 Heiko Paulheim 31 Explaining Statistics http://xkcd.com/605/ 32. 10/08/13 Heiko Paulheim 32 Explaining Statistics What statistics often look like 33. 10/08/13 Heiko Paulheim 33 Explaining Statistics There are powerful tools for finding correlations etc. but many statistics cannot be interpreted directly background knowledge is missing Approach: use Linked Open Data for enriching statistical data (e.g., FeGeLOD) run analysis tools for finding explanations 34. 10/08/13 Heiko Paulheim 34 Prototype Tool: Explain-a-LOD Loads a statistics file (e.g., CSV) Adds background knowledge Runs basic analysis (correlation, rule learning) Presents explanations 35. 10/08/13 Heiko Paulheim 35 Statistical Data: Examples Data Set: Mercer Quality of Living Quality of living in 216 cities word wide norm: NYC=100 (value range 23-109) As of 1999 http://across.co.nz/qualityofliving.htm LOD data sets used in the examples: DBpedia CIA World Factbook for statistics by country 36. 10/08/13 Heiko Paulheim 36 Statistical Data: Examples Examples for low quality cities big hot cities (junHighC >= 27 and areaTotalKm >= 334) cold cities where no music has ever been recorded (recordedIn_in = false and janHighC

exploiting linked open data as background knowledge in data mining

Technology