1 the evolution of a story in a network – a web mining perspective bettina berendt berendt
TRANSCRIPT
![Page 1: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/1.jpg)
1
The evolution of a story in a network – a Web mining perspective
Bettina Berendtwww.cs.kuleuven.be/~berendt
![Page 2: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/2.jpg)
2
About me: My public (and mine-able) profile
: Information Systems: Computer Science / Cognitive Science: Artificial Intelligence: Business Science: Economics
: Computer Science
![Page 3: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/3.jpg)
3
Story evolution: texts change (T)
Story evolution: authors change (U)
Web mining:Text, Link structure, Usage
Agenda
Story evolution: communities of authors change (L)
Story evolution: reading behaviour changes (U)
![Page 4: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/4.jpg)
4
Web mining
![Page 5: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/5.jpg)
5
Information retrieval and data mining
What‘s in this list?
How is it ordered?
![Page 6: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/6.jpg)
6
Information retrieval and data mining
![Page 7: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/7.jpg)
7
Data mining & Web mining
Knowledge discovery (aka Data mining):
“the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” 1
Web mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas:
Web content mining
Web structure mining
Web usage mining
1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press
Navigation, queries, content access & creation
Simple, bipartite, tripartite, ... graphs
Texts, pictures, sounds, ...
![Page 8: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/8.jpg)
8
Story evolution: texts change
– joint work with Ilija Subašić, 2008 –
* All references are given on slide no. 47
![Page 9: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/9.jpg)
9
Dynamic Web content
![Page 10: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/10.jpg)
10
A story begins
http://www.telegraph.co.uk/news/main.jhtml?xml=/news/2007/05/22/nmaddy122.xml
![Page 11: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/11.jpg)
11
The story unfolds
![Page 12: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/12.jpg)
12
The story unfolds– new actors enter the stage (and old ones change their roles)
![Page 13: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/13.jpg)
13
Basic idea: A story is about relational statements story stages expressed by co-occurrences
Robert Murat – suspect
Kate MccCann (the mother) – suspect
Gabriel Ruget‘s talk
![Page 14: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/14.jpg)
14
Data collection and preprocessing
Articles from Google News 05/2007 – 11/2007 for search term “madeleine mccann“
(there was a Google problem in the December archive)
Only English-language articles
For each month, the first 100 hits
Of these, all that were freely available 477 documents
Preprocessing: HTML cleaning
tokenization
stopword removal
![Page 15: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/15.jpg)
15
Story elements
content-bearing words
the 150 top-TF words without stopwords
![Page 16: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/16.jpg)
16
Story stages:co-occurrence in a window
“mother“ and “suspect“ co-occur• in a window of size ≥ 6 (all words)• in a window of size ≥ 2 (non-stopwords only)
![Page 17: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/17.jpg)
17
Salient story elements
1. Split whole corpus T by week (17 = 30 Apr + until 44 = 12 Nov +)
2. For each week
Compute the weights for corpus t for this week
3. Weight =
Support of co-occurrence of 2 content-bearing words w1, w2 in t =
(# articles from t containing both w1, w2 in window) / (# all articles in t)
4. Threshold
Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)
Time-relevance TR of co-occurrence(w1, w2) =
support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥
θ2 (e.g., 2) *
5. Rank by TR, for each week identify top 2
6. Story elements = peak words = all elements of these top 2 pairs (# = 38)
![Page 18: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/18.jpg)
18
Salient story stages, and story evolution
7. Story stage = co-occurrences of peak words in t
For each week t: aggregate over t-2, t-1, t moving average
8. Story evolution = how story stages evolve over the t in T
![Page 19: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/19.jpg)
19
Story stages: Example result
<week 17>
<show sliders>
![Page 20: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/20.jpg)
20
Story evolution: result
<morphAll.py>
![Page 21: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/21.jpg)
21
... the story is lost if we go back to single entities
Robert Murat – suspect
Kate MccCann (the mother) – suspect
![Page 22: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/22.jpg)
22
Future work
“beyond words“ (e.g., semantics)
Web communities
Michael Barber‘s talk
Gabriel Ruget‘s qualia?!
![Page 23: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/23.jpg)
23
Story evolution: authors change(and stories with them)
![Page 24: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/24.jpg)
24
Multi-authored texts
http://en.wikipedia.org/wiki/Madeleine_McCann
![Page 25: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/25.jpg)
25
Who authored?
![Page 26: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/26.jpg)
26
Visualizing conflict – example “edit wars“
Viégas, Wattenberg, & Dave, 2004
![Page 27: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/27.jpg)
27
The bone of contention ...
![Page 28: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/28.jpg)
28
Story evolution: communities of authors develop parallel stories
![Page 29: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/29.jpg)
29
Basic data for Web structure mining:hyperlinks and textual references
![Page 30: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/30.jpg)
30
Example: Political blogs in the US
Adamic & Glance, 2005 (visualization modified)
All links Thresholded (link occurrence ≥ 25)
blue: liberal; red: conservative
Gabriel Ruget‘s
“publics“
![Page 31: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/31.jpg)
31
Example: Blogs sourcing mainstream media
Hyperlinks from blogs to mainstream news media Germany USA
[Berendt, Schlegel, & Koch, in Kommunikation, Partizipation und Wirkungen im Social Web, 2008]
![Page 32: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/32.jpg)
32
The German and the US blogospheres
Data reported in [Berendt, Schlegel, & Koch, 2008]
![Page 33: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/33.jpg)
33Example:The politics of sourcing – what do blogposts on global warming refer to?
Walejko & Ksiazek, in press
![Page 34: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/34.jpg)
34
Story evolution: communities of authors change
![Page 35: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/35.jpg)
35
Who authored? (revisited)
![Page 36: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/36.jpg)
36
Tracing anonymous edits
![Page 37: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/37.jpg)
37
Why?
![Page 38: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/38.jpg)
38
Story evolution: reading behaviour changes
![Page 39: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/39.jpg)
39
The story unfolds– query analysis may reveal more than text analysis
![Page 40: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/40.jpg)
40
Reading may “predate“ writing
![Page 41: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/41.jpg)
41
Request frequency for a specific diagnosis in the investigated eHealth portal, depending on time and request language
Which diagnosis is that?
[Yihune, 2003; see also Heino & Toivonen, 2003]
![Page 42: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/42.jpg)
42
My story has reached its end
![Page 43: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/43.jpg)
43
My story has reached its end
![Page 44: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/44.jpg)
44
My story has reached its end
![Page 45: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/45.jpg)
45
My story has reached its end
![Page 46: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/46.jpg)
46
My story has reached its end
is our discussion‘s beginning!
![Page 47: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/47.jpg)
47
References
Adamic, L., & Glance, N. (2005). The political blogosphere and the 2004 U.S. Election: Divided they blog. In Proc. of the 3rd Int. Worksh. on Link Discovery at ACM SIGKDD (pp. 36–44).
Berendt, B., Schlegel, M., & Koch, R. (2008). Die deutschsprachige Blogosph ¨are: Reifegrad, Politisierung, Themen und Bezug zu Nachrichtenmedien [[The German-speaking blogosphere: Maturity, political focus, and relation to news media]]. To appear in A. Zerfaß, M. Welker, & J. Schmidt (Eds.), Kommunikation, Partizipation und Wirkungen im Social Web (Band 2: Strategien und Anwendungen: Perspektiven für Wirtschaft, Politik, Publizistik) [[Communication, Participation and Eects in Social Web (Vol. 2: Strategies and Applications: Perspectives for the Economy, Politics, and Journalism]] .(pp. 72-96). Köln, Germany: Herbert von Halem Verlag.
Berendt, B. & Subašić, I. (in press). Identifying, measuring and visualizing the evolution of a story: A Web mining approach. To appear in Proc. COLLNET 2008 (Fourth International Conference on Webometrics, Informetrics and Scientometrics & Ninth COLLNET Meeting). Berlin, July/August 2008.
Griffith, V. (2007). WikiScanner: List anonymous wikipedia edits from interesting organizations. http://wikiscanner.virgil.gr
Heino, J. & Toivonen, H. (2003). Automated Detection of Epidemics from the Usage Logs of a Physicians' Reference Database. In Proc. PKDD 2003. http://www.springerlink.com/content/g8h9f8y2fd3xq7ft/
Viégas, F.B., Wattenberg, M., & Dave, K. (2004). Studying Cooperation and Conflict between Authors with history flow Visualizations. In Proc. CHI 2004 (pp. 575-582).
Walejko, G. & Ksiazek, T. (in press). The Politics of Sourcing: A Study of Journalistic Practices in the Blogosphere. To appear in Proc. Of the Second International Conference on Weblogs and Social Media (ICWSM 2008). Seattle, March/April 2008. http://www.icwsm.org/2008
Yihune, G. (2003). Evaluation eines medizinischen Informationssystems im World Wide Web. Nutzungsanalyse am Beispiel www.dermis.net. Dissertation. Medizinische Fakultät der Ruprecht-Karls-Universität Heidelberg.
![Page 48: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/48.jpg)
48
Backup Slides
![Page 49: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/49.jpg)
49
(Some) further work in text processing
![Page 50: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/50.jpg)
50
Improving on words and weights
![Page 51: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/51.jpg)
51
Stemming
Want to reduce all morphological variants of a word to a single index term
e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document)
Stemming - reduce words to their root form
e.g. fish – becomes a new index term
Porter stemming algorithm (1980)
relies on a preconstructed suffix list with associated rules
e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE
– BINARIZATION => BINARIZE
![Page 52: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/52.jpg)
52Inverse document frequency (IDF)
A term that occurs in a few documents is likely to be a better discriminator than a term that appears in most or all documents
nj - Number of documents which contain the term j
n - total number of documents in the set
Inverse document frequency
![Page 53: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/53.jpg)
53
Full Weighting (TF-IDF)
The TF-IDF weight of a term j in document di is
![Page 54: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/54.jpg)
54
Beyond words
![Page 55: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/55.jpg)
55
N-grams and Named-Entity Recognition
Madeleine )
Madeleine McCann )
Maddie ) MADELEINE_MCCANN
Maddy )
... )
![Page 56: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/56.jpg)
56
Semantics (e.g., word-sense disambiguation)
![Page 57: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/57.jpg)
57
The need for word sense disambiguation
“She sat by the bank and looked sentimentally at the last fish.“
„She sat by the bank and looked sentimentally at the last coins.““She sat by the bank and looked sentimentally at the last coins.“
![Page 58: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/58.jpg)
58
WordNet semantic relations
![Page 59: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/59.jpg)
59
Web mining for analyzing multiple perspectives:
[Fortuna, Galleguillos, & Cristianini, in press]
What characterizes different news sources?
Nearest neighbour / best reciprocal hitfor document matching;Kernel Canonical Correlation Analysisand vector operationsfor finding topics and characteristic keywords
![Page 60: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt](https://reader034.vdocument.in/reader034/viewer/2022042822/56649ef25503460f94c04a9c/html5/thumbnails/60.jpg)
60
Syntactic analysis
From simple part-of-speech tagging to full-scale NLP parsing