(semi-)automatic analysis of online contents
TRANSCRIPT
![Page 1: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/1.jpg)
Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
(Semi-)Automatic Analysis of Online Contents
Steffen Staab@ststaab
Web and Internet Science Group · ECS · University of Southampton, UK &
![Page 2: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/2.jpg)
(Semi-)Automatic analysis of online content 2/68Steffen Staab
Content analysis
Text++
Content
![Page 3: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/3.jpg)
(Semi-)Automatic analysis of online content 3/68Steffen Staab
Is it difficult?
„Nach dem Auspacken der LPS-105 präsentiert sich dem Betrachter ein stabiles Laufwerk, das genauso geringe Außenmaße besitzt wie die Maxtor.“
Unpacking the LPS 105 reveals a sturdy disk drive which is of the same small size as the Maxtor.
Text++
Content
![Page 4: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/4.jpg)
(Semi-)Automatic analysis of online content 4/68Steffen Staab
„Content“ analysis: What is in online content?
....
Entailment
Summaries
Arguments
Discourse
OpinionsSentiments
Facts – who, what, when?
Syntax
Semantics
Pragmatics
Knowledge
Text++
Content
CLing
![Page 5: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/5.jpg)
(Semi-)Automatic analysis of online content 5/68Steffen Staab
Purpose
Technical objectives• Search• data & knowledge
bases:• facts• arguments• ...
Applications• Google Search• Watson • „Watson 2“
Social science and humanities objectives
• Form hypotheses• Find indications• Recognize trends• ...
![Page 6: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/6.jpg)
(Semi-)Automatic analysis of online content 6/68Steffen Staab
Objective oriented content analysis
....
Entailment
Summaries
Arguments
Discourse
OpinionsSentiments
Facts – who, what, when?
Syntax
Semantics
Pragmatics
Knowledge
Text++
Semantic Web
Trend hypotheses
Selection of facts, function,
trust
CLing
![Page 7: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/7.jpg)
(Semi-)Automatic analysis of online content 7/68Steffen Staab
SEMANTIC WEB ANNOTATION
![Page 8: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/8.jpg)
(Semi-)Automatic analysis of online content 8/68Steffen Staab
CREAM – Creating Metadata (Handschuh et al 2002, 2003)
Document Viewer / EditorOntology
Guidance & Fact Browser
Concepts
Instances of Concepts
Attribute Instances = instance of a property to a datatype instance
Relationship Instances =instance of a property
to a class instance
![Page 9: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/9.jpg)
(Semi-)Automatic analysis of online content 9/68Steffen Staab
CREAM – Creating Metadata (Handschuh et al 2002, 2003)
Open world - Target ontologies now could be:• Schema.org
(3 Trillion facts collected by Google; 10,000 of concepts)
• Wikidata1,148,230 concepts (2 weeks ago)
![Page 10: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/10.jpg)
(Semi-)Automatic analysis of online content 10/68Steffen Staab
Annotating facts with Cream
+++Open (wrt ontologies)Flexible
Semi-automatic: SCREAM
---
Effort for annotation(minimize # of clicks)Thick ClientTech Readiness Level ~5A lot of effort to prepare tool
for a taskLimited accuracy
![Page 11: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/11.jpg)
(Semi-)Automatic analysis of online content 11/68Steffen Staab
Technology Readiness LevelsTRL 1: Beobachtung und Beschreibung des
Funktionsprinzips (8-15 Jahre zur Marktreife)TRL 2: Beschreibung der Anwendung einer TechnologieTRL 3: Nachweis der Funktionstüchtigkeit einer Technologie
(5-13 Jahre zur Marktreife)TRL 4: Versuchsaufbau im LaborTRL 5: Versuchsaufbau in EinsatzumgebungTRL 6: Prototyp in EinsatzumgebungTRL 7: Prototyp im Einsatz (1-5 Jahre)TRL 8: Qualifiziertes System mit Nachweis der
Funktionstüchtigkeit im EinsatzbereichTRL 9: Qualifiziertes System mit Nachweis des
erfolgreichen Einsatzes
![Page 12: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/12.jpg)
(Semi-)Automatic analysis of online content 12/68Steffen Staab
CLUSTERING OF TEXTDATA
http://topicmodels.west.uni-koblenz.deWith Christoph Kling
![Page 13: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/13.jpg)
(Semi-)Automatic analysis of online content 13/68Steffen Staab
Text Mining Documents
Documents are PDFs, emails, tweets,
Flickr photo tags, Word companions,…
Documents consist of bag of words metadata
- author(s) - timestamp- geolocation- publisher- booktitle- device...
Chinese food
Vegan
food
Break-fast
dimsumduckeggs
...
vegantofu...
eggsham...
Objective:Cluster, categorize,
& explain
![Page 14: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/14.jpg)
(Semi-)Automatic analysis of online content 14/68Steffen Staab
Latent Dirichlet Allocation (LDA)
![Page 15: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/15.jpg)
(Semi-)Automatic analysis of online content 15/68Steffen Staab
Latent Dirichlet Allocation (LDA)
Document-topic distributions
Topic-word distributions
K topicsM documentsEach doc m from M has length Nm
![Page 16: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/16.jpg)
(Semi-)Automatic analysis of online content 16/68Steffen Staab
Use Metadata to Help Topic Prediction
Improve topic detection→ Morning times may help to improve the breakfast topic Describe dependencies: metadata ↔ topics
→ breakfast topic happens during morning hours Chinese
food
Vegan
food
Break-
fast
dimsumduckeggs
...
vegantofu...
eggsham...
![Page 17: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/17.jpg)
(Semi-)Automatic analysis of online content 17/68Steffen Staab
Use Metadata to Help Topic Prediction
Improve topic detection→ Morning times may help to improve the breakfast topic Describe dependencies: metadata ↔ topics
→ breakfast topic happens during morning hours
Usage Autocompletion
→ From words to words Prediction of search queries
→ From metadata to words→ From words to metadata
Chinese food
Vegan
food
Break-
fast
dimsumduckeggs
...
vegantofu...
eggsham...
![Page 18: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/18.jpg)
(Semi-)Automatic analysis of online content 18/68Steffen Staab
Dataset
Linux Kernel Mailinglist3,400,000 emails with timestamps and mailinglist ID
![Page 19: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/19.jpg)
(Semi-)Automatic analysis of online content 19/68Steffen Staab
Nominal
Ordinal
Cyclic
Spherical
Networked
Structures of Metadata Spaces Kern Desk Mail
Spatial Model is not used in this application(but might be)!
![Page 20: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/20.jpg)
(Semi-)Automatic analysis of online content 20/68Steffen Staab
Topics
![Page 21: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/21.jpg)
(Semi-)Automatic analysis of online content 21/68Steffen Staab
Topics
![Page 22: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/22.jpg)
(Semi-)Automatic analysis of online content 22/68Steffen Staab
Topics
Professional topics:
Hobbyist topics:
![Page 23: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/23.jpg)
(Semi-)Automatic analysis of online content 23/68Steffen Staab
Topics
Metadata weighting:
![Page 24: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/24.jpg)
(Semi-)Automatic analysis of online content 24/68Steffen Staab
126,408 Online Fetish Users: First 8 Topics
![Page 25: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/25.jpg)
(Semi-)Automatic analysis of online content 25/68Steffen Staab
Sociodemographics of Fetish dataset
![Page 26: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/26.jpg)
(Semi-)Automatic analysis of online content 26/68Steffen Staab
Influence of Sociodemographics on Favorite Fetishes
![Page 27: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/27.jpg)
(Semi-)Automatic analysis of online content 27/68Steffen Staab
Other applications of (extended) LDA
Sentiment and Topics(Naveed et al ICWSM 2013)
Topics and spatial knowledge(Kling et al WSDM 2014)
Modelling of power(Kling et al ICWSM 2015)
![Page 28: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/28.jpg)
(Semi-)Automatic analysis of online content 28/68Steffen Staab
BELIEVABILITY AND TRUST IN ONLINE NEWS
With Christoph Kling, Jerome KunegisCollaboraiton with Jutta Milde, Karin Stengel, Ines VogelOngoing work in KOMEPOL
![Page 29: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/29.jpg)
(Semi-)Automatic analysis of online content 29/68Steffen Staab
Targets
![Page 30: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/30.jpg)
(Semi-)Automatic analysis of online content 30/68Steffen Staab
Example article at Spiegel.de
![Page 31: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/31.jpg)
(Semi-)Automatic analysis of online content 31/68Steffen Staab
Requirements
Scalability:• # Documents• # Annotators• # Annotations per
annotater
Tool:• Administration• Crowdsourcing• Semi-automatic
![Page 32: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/32.jpg)
(Semi-)Automatic analysis of online content 32/68Steffen Staab
Separating article management and coding
![Page 33: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/33.jpg)
(Semi-)Automatic analysis of online content 33/68Steffen Staab
Text-Upload
![Page 34: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/34.jpg)
(Semi-)Automatic analysis of online content 34/68Steffen Staab
Managing projects
![Page 35: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/35.jpg)
(Semi-)Automatic analysis of online content 35/68Steffen Staab
Article
![Page 36: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/36.jpg)
(Semi-)Automatic analysis of online content 36/68Steffen Staab
Defining a Coding-Job
![Page 37: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/37.jpg)
(Semi-)Automatic analysis of online content 37/68Steffen Staab
Highlighting using Keywords and Clustering
![Page 38: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/38.jpg)
(Semi-)Automatic analysis of online content 38/68Steffen Staab
Article coding
![Page 39: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/39.jpg)
(Semi-)Automatic analysis of online content 39/68Steffen Staab
Preparing a code book (1)
![Page 40: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/40.jpg)
(Semi-)Automatic analysis of online content 40/68Steffen Staab
Preparing a code book (2)
![Page 41: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/41.jpg)
(Semi-)Automatic analysis of online content 41/68Steffen Staab
CONCLUSION
![Page 42: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/42.jpg)
(Semi-)Automatic analysis of online content 42/68Steffen Staab
Lessons Learned
New targets• Require new modeling of
gaps
Challenges• Technology Readiness
Levels• Many tools – no „good“ tool
(„done is better than perfect“?)
• Reproducability
ToDos• Eclipse/Protege of
annotation• modular• extensible• open
• Optimizing the processes
![Page 43: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/43.jpg)
(Semi-)Automatic analysis of online content 43/68Steffen Staab
No tool to rule them all
....
Entailment
Summaries
Arguments
Discourse
OpinionsSentiments
Facts – who, when, where, what?
Syntax
Semantics
Pragmatics
Knowledge
Text++
Semantic Web
Trend-hypothesen
Faktenauswahl, Funktion, Vertrauen
Gap
Gap
CLing
![Page 44: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/44.jpg)
(Semi-)Automatic analysis of online content 44/68Steffen Staab
THANK YOU FOR YOUR ATTENTION!
![Page 45: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/45.jpg)
(Semi-)Automatic analysis of online content 45/68Steffen Staab
C. C. Kling, J. Kunegis, S. Sizov, and S. Staab. “Detecting non-gaussian geographical topics in tagged photo collections.” In: Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014, New York, NY, USA, February 24-28, 2014.
I. C. Vogel, J. Milde, K. Stengel, S. Staab, C. C. Kling, and J. Kunegis. “Glaubwürdigkeit und Vertrauen von Online-News.” In: Datenschutz und Datensicherheit 39.5 (2015), pp. 312–316.
S. Handschuh, S. Staab. CREAM – CREAting Metadata for the Semantic Web. Computer Networks. 42(5): 579-598, Elsevier 2003.
S. Handschuh, S. Staab, F. Ciravegna. S-CREAM – Semi-automatic CREAtion of Metadata.In: Proc. of the European Conference on Knowledge Acquisition and Management – EKAW-2002 . Madrid, Spain, October 1-4, 2002. LNCS/LNAI 2473, Springer, 2002, pp. 358-372.
C. Kling. Probabilistic Models for Context in Social Media. Novel Approaches and Inference Schemes. Submitted as PhD thesis, Institute for Web Science and Technologies, University of Koblenz-Landau, to be defended Nov/Dec 2016
Nasir Naveed, Thomas Gottron, Steffen Staab:Feature Sentiment Diversification of User Generated Reviews: The FREuD Approach. ICWSM 2013
Christoph Carl Kling, Jérôme Kunegis, Heinrich Hartmann, Markus Strohmaier, Steffen Staab:Voting Behaviour and Power in Online Democracy: A Study of LiquidFeedback in Germany's Pirate Party. ICWSM 2015: 208-217
Bibliography
![Page 46: (Semi-)Automatic analysis of online contents](https://reader031.vdocument.in/reader031/viewer/2022022414/587c53f01a28abc62c8b672d/html5/thumbnails/46.jpg)
(Semi-)Automatic analysis of online content 46/68Steffen Staab
URLs
http://topicmodels.west.uni-koblenz.dehttp://komepol.west.uni-koblenz.de
http://www.slideshare.net/steffenstaab