finding high-quality content in social media chenwq 2011/11/26
Post on 19-Dec-2015
215 views
TRANSCRIPT
Finding High-Quality Content in Social Media
chenwq2011/11/26
Authors
Eugene Agichtein
Emory University
Research: Intelligent Information Access Lab (IRLab)
News:our team wins the "Best Paper" award at SIGIR 2011.
Abstract
From the early 2000s,user-generated content has become popular on the web.The quality of user-generated content varies drastically from excel-lent to abuse and spam.To separate high-quality content from the rest automaticallyGraph-based framework– combine the different sources of evidence
in a classification formulation
MODELING CONTENT QUALITYMODELING CONTENT QUALITY
Related workRelated work
CONTENT QUALITY ANALYSISCONTENT QUALITY ANALYSIS
EXPERIMENT & ConclusionEXPERIMENT & Conclusion
11
22
33
44
Contents
Related work
Link analysis in social media
Propagating reputation
Question/answering portals and fo-
rums
Expert finding
Text analysis for content quality
Implicit feedback for ranking
Related work
Link analysis in social media
– G = (V, E)
– V corresponding to the users of a question/an-
swer system
– a directed edge e = (u, v) ∈ E from a user u ∈ V
to a user v ∈ V if user u has answered to at least
one question of user v
– G’ = (V, E’)
PageRank, ExpertiseRank, HITS
MODELING CONTENT QUALITYMODELING CONTENT QUALITY
Related workRelated work
CONTENT QUALITY ANALYSISCONTENT QUALITY ANALYSIS
EXPERIMENT & ConclusionEXPERIMENT & Conclusion
11
22
33
44
Contents
CONTENT QUALITY ANALYSIS——Intrinsic content quality
As a baseline, we use textual features
only—with all word n-grams up to
length 5 that appear in the collection
more than 3 times used as feature-
susers
Punctuation and typos Syntactic and semantic Grammaticality
1. Punctuation
2. Capitalization
3. Spacing density
4. Character-level
entropy
5. Spelling mistakes
6. Out-of-vocabulary
words
1. Average number of
syllables per word
2. Entropy of word
lengths
3. Readability measures
1. Part-of-speech
sequences
2. Formality score
3. Distance between its
(trigram) language
model and several
given language models
CONTENT QUALITY ANALYSIS——Intrinsic content quality
CONTENT QUALITY ANALYSIS——User relationships
items and users Graph
user-user Graphu qanswer
uv
u has answered a question from user v
CONTENT QUALITY ANALYSIS——Usage statistics
The number of clicks on some itemThe dwell time on some item
CONTENT QUALITY ANALYSIS——classification framework
We cast the problem of quality ranking as a binary classification – support vector machines– log-linear classifiers– stochastic gradient boosted trees
Our goal is to discover interesting,well for-mulated and factually accurate content
MODELING CONTENT QUALITYMODELING CONTENT QUALITY
Related workRelated work
CONTENT QUALITY ANALYSISCONTENT QUALITY ANALYSIS
EXPERIMENT & ConclusionEXPERIMENT & Conclusion
11
22
33
44
Contents
MODELING CONTENT QUALITY——user relationships
Our dataset, viewed as a graph as il-lustrated in Figure 1
MODELING CONTENT QUALITY——user relationships
The relationships between questions, users asking and answering questions, and answers can be captured by a tri-partite graph outlined in Figure 2
MODELING CONTENT QUALITY——user relationships
the unique characteristics of the com-munity question/answering domain
MODELING CONTENT QUALITY——user relationships
Question subtree– Q Features from the question being answered– QU Features from the asker of the question being
answered– QA Features from the other answers to the same
question
MODELING CONTENT QUALITY——user relationships
User subtree– UA Features from the answers of the user– UQ Features from the questions of the user– UV Features from the votes of the user– UQA Features from answers received to the
user’s questions– U Other user-based features
MODELING CONTENT QUALITY——user relationships
Question features
MODELING CONTENT QUALITY——user relationships
Implicit user-user relationsG = (V,E)– E = Ea∪Eb∪Ev∪Es∪E+∪E−
Gx = (V,Ex)– hx the vector of hub scores on the vertices V– ax the vector of authority scores– px the vector of PageRank scores– p´x the vector of PageRank scores in the trans-
posed graph
MODELING CONTENT QUALITY——user relationships
Implicit user-user relations
MODELING CONTENT QUALITY——user relationships
Content features for QA
– to identify the most salient features for the specific tasks of question or answer quality classification• the KL-divergence between the
language models of the two texts• their non-stopword overlap• the ratio between their lengths
MODELING CONTENT QUALITY——user relationships
Usage features for QA– number of item views (clicks)– Metadata of question
• how long ago the question was posted– derived statistics
• the expected number of views for a given category
• the deviation from the expected num-ber of views
– other second-order statistics• the click frequency
MODELING CONTENT QUALITYMODELING CONTENT QUALITY
Related workRelated work
CONTENT QUALITY ANALYSISCONTENT QUALITY ANALYSIS
EXPERIMENT & ConclusionEXPERIMENT & Conclusion
11
22
33
44
Contents
Experiment & Conclusions——EXPERIMENTAL SETTING
Dataset
Edges induced from the whole dataset.
MODELING CONTENT QUALITY——EXPERIMENTAL SETTING
Dataset statistics
MODELING CONTENT QUALITY——EXPERIMENTAL SETTING
Dataset statistics
MODELING CONTENT QUALITY——EXPERIMENTAL SETTING
Dataset statistics
MODELING CONTENT QUALITY——EXPERIMENTAL SETTING
Dataset statistics
MODELING CONTENT QUALITY——EXPERIMENTAL SETTING
Dataset statistics
MODELING CONTENT QUALITY——EXPERIMENTAL SETTING
Dataset statistics
MODELING CONTENT QUALITY——EXPERIMENTAL SETTING
Dataset statistics
MODELING CONTENT QUALITY——EXPERIMENTAL SETTING
Dataset statistics
Thanks for attention!