interpretable and effective opinion spam detection via temporal...
TRANSCRIPT
![Page 1: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/1.jpg)
Interpretable and E�ective Opinion SpamDetection via Temporal Pa�ern Mining Across
Websites
Yuan Yuan, Sihong Xie, Chun-Ta Lu, Jie Tang and Philip S. Yu
Tsinghua University, Lehigh University and University of Illinois at Chicago
December 7, 2016
![Page 2: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/2.jpg)
Online reviews & spam
Reviews and ratings influence our decisions
Spam reviews are misleading (the review below was filtered by Yelp)
Yuan et al. (BigData 2016) 2
![Page 3: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/3.jpg)
Multiple review sites
One business may have information on multiple sites
What if we combine information on di�erent sites?
Yuan et al. (BigData 2016) 3
![Page 4: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/4.jpg)
Basic idea: Bi-level framework
Yuan et al. (BigData 2016) 4
![Page 5: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/5.jpg)
Main contributions
Proposed a novel spam detection framework using timeseries pa�erns defined over multiple data sources.
Performed in-depth studies to reveal a full picture of the de-fined pa�erns on two levels
Showed quantitative (prediction) and qualitative (casestudies) results demonstrate that the framework can preciselyidentify and explain a�acks that were not previously spo�ed
Yuan et al. (BigData 2016) 5
![Page 6: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/6.jpg)
Single website time series construction
Useful single website Pa�ernsCount of Reviews, Average Rating, Five-star Ratio, Low-ratingRatio, Average Sentiment, Highly Positive Sentiment Ratio,Negative Positive Sentiment Ratio
e.g. Five-star Ratio: FRs(t) =∑
rs :time(rs )∈τt 1[rating(rs)=5]+αFRs
CRs(t)+α
Yuan et al. (BigData 2016) 6
![Page 7: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/7.jpg)
Algorithm: Single site time series pa�ern detection
For each pair of segmentsCompute d = λ
(1/ |k1 |+1/ |k2 |)∆t+λ
Yuan et al. (BigData 2016) 7
![Page 8: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/8.jpg)
Algorithm: Single site time series pa�ern detection
d = λ(1/ |k1 |+1/ |k2 |)∆t+λ > θ , and k1 > 0 and k2 < 0
a burst window is detected
Yuan et al. (BigData 2016) 8
![Page 9: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/9.jpg)
Algorithm: Single site time series pa�ern detection
d = λ(1/ |k1 |+1/ |k2 |)∆t+λ > θ , and k1 > 0 and k2 < 0
a burst window is detected
Yuan et al. (BigData 2016) 9
![Page 10: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/10.jpg)
Algorithm: Single site time series pa�ern detection
d = λ(1/ |k1 |+1/ |k2 |)∆t+λ > θ , and k1 < 0 and k2 > 0
a dive window is detected
Yuan et al. (BigData 2016) 10
![Page 11: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/11.jpg)
Algorithm: Single site time series pa�ern detection
d = λ(1/ |k1 |+1/ |k2 |)∆t+λ > θ , and k1 < 0 and k2 > 0
a dive window is detected
Yuan et al. (BigData 2016) 11
![Page 12: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/12.jpg)
Algorithm: Single site time series pa�ern detection
d = λ(1/ |k1 |+1/ |k2 |)∆t+λ > θ , and k1 < 0 and k2 > 0
a dive window is detectedtake the union of detected burst/dive windows
Yuan et al. (BigData 2016) 12
![Page 13: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/13.jpg)
Algorithm: Single site time series pa�ern detection
each time window is classified into burst/dive/plateau
Yuan et al. (BigData 2016) 13
![Page 14: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/14.jpg)
Cross-site time series pa�ern design and construction
detect single-site pa�erns in di�erent sites
combine the simultaneous pa�erns
assumption: di�erent cross-site pa�erns have di�erent spamratio (validate on dataset)
Yuan et al. (BigData 2016) 14
![Page 15: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/15.jpg)
Data setup
Raw data
Foursquare: crawled 301,717 venues
Yelp: Yelp challenge dataset1
Matched by names and locations
95 businesses
Foursquare: 15,004 reviews, 12,147 reviewers
Yelp: 68,517 reviews, 31,092 reviewers
1http://www.yelp.com/dataset_challengeYuan et al. (BigData 2016) 15
![Page 16: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/16.jpg)
Basic statistics of cross-site pa�erns
Table: Cross-Site pa�ern statistics
Pa�ern
(Y-F)
Yelp Foursquare
#bus
ines
s
#rev
iew
#rev
iew
er
#rel
ated
revi
ews
filte
red
rati
o
#bus
ines
s
#rev
iew
#rev
iew
er
BB 7 181 179 19133 27.07% 9 89 83BP 27 821 772 127427 26.31% 27 200 186BD 8 295 290 41713 18.98% 9 122 114PB 51 3795 3187 636679 13.68% 52 1154 1089PP 95 59830 23509 9364943 11.99% 95 12152 9491PD 33 3024 2589 548993 15.41% 34 1036 943DB 4 76 76 10321 21.05% 6 79 74DP 10 303 300 23822 48.18% 9 73 71DD 4 192 190 21059 28.13% 6 99 96
Yuan et al. (BigData 2016) 16
![Page 17: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/17.jpg)
Human evaluation
Three human annotators independently label the sampled reviewsusing 3 levels of suspiciousness (1: not suspicious, 2: likely suspiciousand 3: very suspicious.)
Table: Human annotation results
Pa�erns # reviews Avg Scores Prec(> 1) Prec(> 2)B∗ 93 1.9785 0.9677 0.3871BB 18 1.9074 0.8889 0.4444BP 75 1.9956 0.9867 0.3733PB 68 2.0098 0.8971 0.3824PP 55 1.8606 0.9091 0.2909PD 14 1.7857 0.7857 0.2857
Yuan et al. (BigData 2016) 17
![Page 18: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/18.jpg)
Microscopic classification - Behavioral Features
Table: Microscopic behavioral features of reviewers and reviews, and theircorrelations with the ground truths
Feature Corr. Description
DC +0.252 Proportion of days when a reviewer posts reviewson businesses in di�erent cities.
DS +0.230 Proportion of days when a reviewer posts reviewson businesses in di�erent states.
MP +0.183 Proportion of days when a reviewer posts 3 or morereviews.
LRR -0.148 Proportion of reviews with 1 or 2 stars posted by areviewer.
FRR +0.121 Proportion of reviews with 5 stars posted by a re-viewer.
RC +0.086 Sum of reviews posted by a reviewer.
Yuan et al. (BigData 2016) 18
![Page 19: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/19.jpg)
Microscopic classification - Textual Features
Table: Microscopic textual features of reviewers and reviews, and theircorrelations with the ground truths
Feature Corr. Description
LC -0.010 Sum of le�ers in a review.
CWR +0.106 Proportion of ALL-CAPITAL words. (“I" excluded)
CLR +0.065 Proportion of capital le�ers.
1PP -0.034 Proportion of first person pronouns.
2PP +0.094 Proportion of second person pronouns.
EX +0.032 Proportion of exclamation.
Yuan et al. (BigData 2016) 19
![Page 20: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/20.jpg)
Classification - Results
Prior methods [Rayana et al 2015]
0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
Tru
e P
osi
tive R
ate
B+T ROC (AUC = 0.65)
B ROC (AUC = 0.67)
T ROC (AUC = 0.55)
Random
0.0 0.2 0.4 0.6 0.8 1.0Recall
0.0
0.2
0.4
0.6
0.8
1.0
Pre
cisi
on
B+T Precision-Recall curve
B Precision-Recall curve
T Precision-Recall curve
Yuan et al. (BigData 2016) 20
![Page 21: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/21.jpg)
Classification - Results
Linear regression
0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
Tru
e P
osi
tive R
ate
B+T ROC (AUC = 0.70)
B ROC (AUC = 0.68)
T ROC (AUC = 0.60)
Random
0.0 0.2 0.4 0.6 0.8 1.0Recall
0.0
0.2
0.4
0.6
0.8
1.0
Pre
cisi
on
B+T Precision-Recall curve
B Precision-Recall curve
T Precision-Recall curve
Yuan et al. (BigData 2016) 21
![Page 22: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/22.jpg)
Case studies
Table: Case study: representative reviews (the codes under the site namesindicate detected pa�erns)
Representative reviews
Yelp
CR: P
AR: B
FR: B
LR: D
(5 stars)... really was awesome to be there. I don’t knowwhy people are complaining, ...
(5 stars) Ignore the negative reviews... that part was funin itself!(5 stars) ... I don’t know why people are complaining, theydon’t even have to have it opened, but they do. Enjoy it!
(5 stars) ... parking is FREE... they have items on displayfrom $100,000 and more to magnets of the cast for $8.00...
Yuan et al. (BigData 2016) 22
![Page 23: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/23.jpg)
Case studies
Table: Case study: representative reviews (the codes under the site namesindicate detected pa�erns)
Representative reviews
Foursquare
CR: B
AS: D
HPSR: P
NSR: B
Waste of a trip!
They are way over priced on everything, including therefrancised items from the show.Extremely overpriced, they got famous on TV and nowscrew everyone with high prices!
An exhilirating experience. I find going to dumps andalmost ge�ing murdered exhilirating.
Waste of time‼!
Yuan et al. (BigData 2016) 23
![Page 24: Interpretable and Effective Opinion Spam Detection via Temporal …clu/doc/bigdata16_spam_slides.pdf · Basic statistics of cross-site pa˛erns Table:Cross-Site pa˛ern statistics](https://reader034.vdocument.in/reader034/viewer/2022042408/5f232328413b9d07934208de/html5/thumbnails/24.jpg)
Conclusion
MotivationCombine information across multiple sites
Proposed a bi-level frameworkMacroscopic to Microscopic
MacroscopicSingle-site pa�erns
Cross-site pa�erns
Human annotation
MicroscopicClassifications (Prior models and Linear Regressions)
Case studies
Yuan et al. (BigData 2016) 24