modelling the web: examples of modelling text, knowledge networks and physical-social systems
DESCRIPTION
TRANSCRIPT
Steffen [email protected]
1WeST
Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany
Modelling the Web Examples of Modelling Text, Knowledge Networks
and Physical-Social Systems
Steffen Staab
Steffen [email protected]
2WeST
What do people want from the Web?
Web as storagelibrary
memory
Web as toolsearch
transaction
Web as social mediumcommunication
cooperation
Web as mirror of selfIdentification
outreach
Steffen [email protected]
4WeST
My Agenda in the Large
Web Content Discovering patterns Building tools Understanding
Web Interaction Monitoring Exploiting Guiding Understanding
Web Evolution Monitoring Predicting Guiding Understanding
Steffen [email protected]
5WeST
1. Modelling Text
My Agenda for Today
Web Content Web Interaction
Web Evolution
2. Modeling Network
Evolution3. Modeling Physical-
social Data
Steffen [email protected]
6WeST
1. Modelling Text
My Agenda for Today
Web Content Web Interaction
Web Evolution
2. Modeling Network
Evolution3. Modeling Physical-
social Data
Steffen [email protected]
8WeST
Language Models
What follows „UK is“?
Conditional probability:
where
Issue:Long word sequences can rarely be observed
Steffen [email protected]
9WeST
Modified Kneser-Ney Smoothing of n-grams
If sequence is hard to observethen approximate recursively observing marginal frequencies of
......
Steffen [email protected]
10WeST
Modified Kneser-Ney Smoothing of n-grams
If sequence is hard to observethen approximate recursively observing marginal frequencies of
First recursion step:
Problem:If last word in the sequnce is rare, the overall sequence will be rare,
then the approximation will be of low quality.
Steffen [email protected]
11WeST
Generalized Language Models [ACL14]
If sequence is too hard to observe, then approximate based on marginal probabilities of
...
recursively.
Core idea of formal solution: Recursively applicable, commutative skip operators
Steffen [email protected]
12WeST
Improvement of GLMs [ACL14]
Evaluation measure: Perplexity
Data set: English Wikipedia, different sample sizes
Relative improvement: 2,6% (most training data, smallest model) to13,9% (least training data, largest model)
Perplexity (normalized)
Steffen [email protected]
13WeST
Outlook for Generalized Language Models Correcting mistakes that are done in all tools
Lack of appropriate models
Other operators („the wild black cat“) Delete: „the black cat“ Part-of-speech: „the adj adj cat“
Application: e.g. next word prediction
Other data structures Tree-like data Graph data
proposal for Google
current focus
Semantic Web
Steffen [email protected]
14WeST
1. Modelling Text
My Agenda for Today
Web Content Web Interaction
Web Evolution
2. Modeling Network
Evolution3. Modeling Physical-
social Data
Steffen [email protected]
15WeST
Evolution of Networks [ICWSM 2013]
Additions RemovalsTraining
Link Prediction Problem
Unlink Prediction Problem
Markov assumption:
history irrelevant
Steffen [email protected]
16WeST
Related Work in Brief
Prediction feature f assigns a score to node pair (i, j) implies to be ranked above
• Link Prediction: edge likelier to be added• Unlink Prediction: edge likelier to be removed
f (i , j ) > f (i , k ) (i , j) (i , k )
Steffen [email protected]
17WeST
Related Work in Brief
Static features degree common-neighbours path3 local-clustering-
coefficient/embeddedness ...
Prediction feature f assigns a score to node pair (i, j) implies to be ranked above
• Link Prediction: edge likelier to be added• Unlink Prediction: edge likelier to be removed
f (i , j ) > f (i , k ) (i , j) (i , k )
Steffen [email protected]
18WeST
Unlink prediction is much more difficult than link prediction
The Snapshot View
Link and unlink prediction
(ICWSM 2013)
Steffen [email protected]
19WeST
Related Work in Brief
Additions RemovalsTraining
Link Prediction Problem
Unlink Prediction Problem
Markov assumption:
history irrelevant
Advantage: General ModelDisadvantage: General Model
IdeaKeep generality,
improve prediction
Steffen [email protected]
20WeST
Our Approach - 1
Additions RemovalsTraining
Link Prediction Problem
Unlink Prediction Problem
Markov assumption:
history irrelevant
Hypothesis: Temporal information generally improves prediction
Idea1 Nodes concerned2 Neighbourhood
Steffen [email protected]
21WeST
Our Approach - 2
Dynamic features:+ recency+ longevity
Extrapolation for temporal preferential attachment:
Steffen [email protected]
22WeST
Evaluation & Discussion (excerpt)
Temporal link prediction significantly better, but only sightly Temporal unlink prediction always significantly improved Temporal preferential attachment best
AUC baselinequalitativequantitativeextrapolation
Steffen [email protected]
23WeST
Outlook for Evolution of Networks
Temporal dynamics still underexplored lack of datasets! next experiments:
• Twitter followers• Xing.de
Unlinks lead to link recommendation new Wikipedia link (reorganization of Wikipedia pages!) new job new friend
Steffen [email protected]
24WeST
1. Modelling Text
My Agenda for Today
Web Content Web Interaction
Web Evolution
2. Modeling Network
Evolution3. Modeling Physical-
social Data
Steffen [email protected]
25WeST
fish, rice
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
fish, salmon, wine
rice, fish
lobster, seafood, shrimp
coffee
coffee, wine
coffee
wine
wine
pizza, wine
pizza, wine
pasta, wine
pasta, shrimplobster, shrimp
seafood, shrimp
Tagged photos with geo-coordinates from Flickr
Steffen [email protected]
26WeST
fish, rice
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
fish, salmon, wine
seafood, shrimp
lobster, seafood, shrimp
coffee
coffee, wine
coffeeitalian, wine
wine
pizza, wine
italian, pizza, wine
pasta, wine
pasta, shrimp
seafoodfishlobstershrimpcrabwinesalmon
winepizzacoffeeitalianpasta
seafood, shrimp
lobster, shrimp
Tasks: Discovering topics, finding clusters
Steffen [email protected]
27WeST
Cultural areas, country borders, geographical features and other geographical observations exhibit complex spatial distributions
wikipedia.org
Challenge
Steffen [email protected]
28WeST
fish, rice
lobster, shrimp
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
seafood, shrimp
fish, salmon, wine
seafood, shrimp
lobster, seafood, shrimp
coffee
coffee, wine
coffeeitalian, wine
wine
pizza, wine
italian, pizza, wine
pasta, wine
pasta, shrimp
seafoodfishlobstershrimpcrabwinesalmon
winepizzacoffeeitalianpasta
A. Ahmed, L. Hong and A. Smola, 2013 (following (Yin et al 2011; Sizov 2010))
Existing approaches: Gaussian regions
Steffen [email protected]
29WeST
fish, rice
lobster, shrimp
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
seafood, shrimp
fish, salmon, wine
seafood, shrimp
lobster, seafood, shrimp
coffee
coffee, wine
coffeeitalian, wine
wine
pizza, wine
italian, pizza, wine
pasta, wine
pasta, shrimp
seafoodfishlobstershrimpcrabwinesalmon
winepizzacoffeeitalianpasta
MGTM 1: Global Topic Clustering
Steffen [email protected]
30WeST
fish, rice
lobster, shrimp
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
seafood, shrimp
fish, salmon, wine
seafood, shrimp
lobster, seafood, shrimp
coffee
coffee, wine
coffeeitalian, wine
wine
pizza, wine
italian, pizza, wine
pasta, wine
pasta, shrimp
seafoodfishlobstershrimpcrabwinesalmon
winepizzacoffeeitalianpasta
MGTM 2: Determining Neighbourhoods
Steffen [email protected]
31WeST
Cluster adjacency Dependencies of document-specific topic distributions
Exchange of topic information between clusters
MGTM 3: Derived Topic Model
Steffen [email protected]
32WeST
Exchange of topic information between clusters
MGTM 4: Exchange of Topic Information
Steffen [email protected]
33WeST
Exchange of topic information between clusters
MGTM 4: Exchange of Topic Information
Steffen [email protected]
34WeST
Exchange of topic information between clusters
MGTM 4: Exchange of Topic Information
Steffen [email protected]
36WeST
Evaluation: Anectodal, Perplexity, Gaming
Gaming study: intrusion detection
Precision 8 topicsavg / median
LGTA 0.60 / 0.58
Basic model 0.64 / 0.58
MGTM 0.78 / 0.75
Steffen [email protected]
37WeST
Outlook for LDA with structure
Texts + social network structures scientometry xing.de
Web pages + user visits chefkoch.de
Steffen [email protected]
38WeST
Future: Knowledge about social aspects needed
Future: CS style models for social sciences
Steffen [email protected]
39WeST
References[ACL14] R. Pickhardt, T. Gottron, M. Körner, P. G. Wagner, T. Speicher, S.
Staab. A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser Ney Smoothing. In: Proc. of ACL-2014 - The 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, June 22-27, 2014.
[WSDM14] C. Kling, J. Kunegis, S. Sizov, S. Staab. Detecting Non-Gaussian Geographical Topics in Tagged Photo Collections. In: Proc. of the 7th ACM Conference on Web Search and Data Mining (WSDM2014), New York, US, February 24-28, 2014.
[ICWSM13] J.Preusse, J.Kunegis, M.Thimm, T.Gottron, S. Staab. Structural Changes in Collaborative Knowledge Networks. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM 2013), Boston, July 8-10, 2013.
Steffen [email protected]
40WeST
Semantic Web
Social Web & Web Retrieval
Interactive Web & Human Computing
Web & Economy
Software & Services
Web Science & Technologies Team & Research
Computational Social Science
Thank You!