[系列活動] 資料探勘速遊 - session4 case-studies

78
Case Studies Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University [email protected] Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation

Post on 15-Apr-2017

1.781 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: [系列活動] 資料探勘速遊 - Session4 case-studies

Case Studies

Yi-Shin Chen

Institute of Information Systems and Applications

Department of Computer Science

National Tsing Hua University

[email protected]

Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation

Page 2: [系列活動] 資料探勘速遊 - Session4 case-studies

Case: Mining Reddit Data

不指定特定目標的Cases

2

Page 3: [系列活動] 資料探勘速遊 - Session4 case-studies

Reddit Datahttps://drive.google.com/open?id=0BwpI8947eCyuRFVDLU4tT2

5JbFE

3

Page 4: [系列活動] 資料探勘速遊 - Session4 case-studies

Reddit: The Front Page of the

Internet

50k+ on

this set

Page 5: [系列活動] 資料探勘速遊 - Session4 case-studies

Subreddit Categories

▷Reddit’s structure may already provide a

baseline similarity

Page 6: [系列活動] 資料探勘速遊 - Session4 case-studies

Provided Data

Page 7: [系列活動] 資料探勘速遊 - Session4 case-studies

Recover Structure

Page 8: [系列活動] 資料探勘速遊 - Session4 case-studies

Data Exploration

8

Page 9: [系列活動] 資料探勘速遊 - Session4 case-studies

9

Page 10: [系列活動] 資料探勘速遊 - Session4 case-studies

Data Mining Final ProjectRole-Playing Games Sales Prediction

Group 6

Page 11: [系列活動] 資料探勘速遊 - Session4 case-studies

Dataset

• Use Reddit comment dataset

• Data Selection– Choose 30 Role-playing games from the subreddits.

– Choose useful attributes form their comment• Body

• Score

• Subreddit

11

Page 12: [系列活動] 資料探勘速遊 - Session4 case-studies

12

Page 13: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

• Common Game Features

– Drop trivial words

• We manually build the trivial words dictionary by ourselves.

– Ex : “is” “the” “I” “you” “we” “a”.

– TF-IDF

• We use Tf-IDF to calculate the importance of each words.

13

Page 14: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

– TF-IDF result• But the TF-IDF value are too small to find the significant

importance.

– Define Comment Feature• We manually define the common gaming words in five

categories by referring several game website.

Game-Play | Criticize | Graphic | Music | Story14

Page 15: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

• Filtering useful comments– Common keywords of 5 categories:

– Filtering useful comments• Using the frequent keywords to find out the useful

comments in each category.

15

Page 16: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

– Using FP-Growth to find other feature

• To Find other frequent keywords

{time, play, people, gt, limited}

{time, up, good, now, damage, gear, way, need, build, better, d3, something, right, being, gt, limited}

• Then, manually choose some frequent keywords

16

Page 17: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

• Comment Emotion

– Filtering outliers

• First filter out those comment whose “score” separate in top and bottom 2.5% which indicate that they might be outliers.

17

Page 18: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

– Emotion detection• We use LIWC dictionary to calculate each comment’s positive

and negative emotion percentage.Ex: I like this game. -> positiveEX: I hate this character. -> negative

• Then find each category’s positive and negative emotion score.– 𝑆𝑐𝑜𝑟𝑒𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃= 𝑗=0

𝑛 𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡𝑗* (positive words)

– 𝑆𝑐𝑜𝑟𝑒𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁= 𝑗=0𝑛 𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡

𝑗* (negative words)

– 𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦= 𝑖=0𝑚 𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡𝑚

– 𝐹𝑖𝑛𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃= 𝑆𝑐𝑜𝑟𝑒𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃/

𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦– 𝐹𝑖𝑛𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁

= 𝑆𝑐𝑜𝑟𝑒𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁/𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦

18

Page 19: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

– Emotion detection• We use LIWC dictionary to calculate each comment’s positive

and negative emotion percentage.Ex: I like this game. -> positiveEX: I hate this character. -> negative

• Then find each category’s positive and negative emotion score.

19

The meaning is : Finding each category’s emotion score for each gameCalculating TotalScore of each category’s commentsHaving the average emotion score FinalScore of each category

Page 20: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

• Sales Data Extraction

– Crawling website’s data

– Find the games’ sales on each platform

20

Page 21: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

– Annually sales for each game

– Median of each platform

– Mean of all platform

Median : 0.17 -> 30 games (26H 4L)

Mean : 0.533 -> 30 games (18H 12L)

– Try both of Median/ Mean to do the prediction.

21

Page 22: [系列活動] 資料探勘速遊 - Session4 case-studies

Sales Prediction

• Model Construction

– We use the 4 outputs from pre-processing step to be the input of the Naïve Bayes classifier

22

Page 23: [系列活動] 資料探勘速遊 - Session4 case-studies

Evaluation

We use naïve Bayes to evaluate our result

1. training 70% & test30%

2. Leave-one-out

23

Page 24: [系列活動] 資料探勘速遊 - Session4 case-studies

Evaluation (train70% & test30% )

0.67

0.78

0.67

0.56

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Accuracy

Median 0.17

Test1 Total_score

Test1 No Total_score

Test2 Total_score

Test2 No Total_score

0.78

0.44

0.56

0.22

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Accuracy

Mean 0.533

Test1 Total_score

Test1 No Total_score

Test2 Total_score

Test2 No Total_score24

Page 25: [系列活動] 資料探勘速遊 - Session4 case-studies

Evaluation(Leave-one-out)

Choose 29 games as training data and the other one as test data, and totally run 30times to get the accuracy.

Total 30 times Accuracy

Median Total_score 7 times wrong 77%

Median NO Total_score 5 times wrong 83%

Mean Total_score 6 times wrong 80%

Mean No Total_score 29 times wrong 3%

25

Page 26: [系列活動] 資料探勘速遊 - Session4 case-studies

Attributes: Original Features Scores

HL

Median: 0.17M Mean: 0.53M

Attribute distribution to target(sales) domain

Sales boundary for H and L class H, L

Sales class

ErrorTimes

Accuracy

Median 5 83%

Mean 29 3%

Evaluation: Leave-one-OutTotal Sample size: 30

26

Page 27: [系列活動] 資料探勘速遊 - Session4 case-studies

Attributes: Transformed Features Scores

BoundaryErrorTimes

Accuracy

Median 7 77%Mean 6 80%

HL

Median: 0.17M

Mean: 0.53M

Attribute distribution project to target(sales) domain

Sales boundary for H and L class

H, L Sales class

Evaluation: Leave-one-OutTotal Sample size: 30

27

Page 28: [系列活動] 資料探勘速遊 - Session4 case-studies

Finding related subreddits based on the gamers’ participation

Data Mining Final Project

Group#4

Page 29: [系列活動] 資料探勘速遊 - Session4 case-studies

Goal & Hypothesis

- To suggest new subreddits to the Gaming subreddit users based on the

participation of other users

29

Page 30: [系列活動] 資料探勘速遊 - Session4 case-studies

Data Exploration

We extracted the Top 20 gaming related subreddits to be focused on.

30

leagueoflegends

DestinyTheGame

DotA2

GlobalOffensive

Gaming

GlobalOffensiveTrade

witcher

hearthstone

Games

2007scape

smashbros

wow

Smite

heroesofthestorm

EliteDangerous

FIFA

Guildwars2

tf2

summonerswar

runescape

Page 31: [系列活動] 資料探勘速遊 - Session4 case-studies

31

Data Exploration (Cont.)

We sorted the subreddits by users’

comment activities.

We found out that the most active

subreddit is Leagueoflegends.

Page 32: [系列活動] 資料探勘速遊 - Session4 case-studies

32

Data Exploration (Cont.)

Page 33: [系列活動] 資料探勘速遊 - Session4 case-studies

Data Pre-processing- Removed comments from moderators and bots (>2,000 comments)

- Removed comments from default subreddits

- Extracted all the users that commented on at least 3 subreddits

- Transformed data into “Market Basket” transactions: 159,475 users

33

Page 34: [系列活動] 資料探勘速遊 - Session4 case-studies

34

ProcessingSorted the frequent items (sort.js)Eg. DotA2, Neverwinter, atheism → DotA2, atheism, Neverwinter

Page 35: [系列活動] 資料探勘速遊 - Session4 case-studies

35

Processing- Choosing the minimum support from 159,475 Transactions

- Max possible support 25.71% (at least in 41,004 Transactions)

- Min possible support 0.0000062% (at least in 1 Transaction)

Min_Support as

0.05 % = 79.73

transactions

Page 36: [系列活動] 資料探勘速遊 - Session4 case-studies

36

Processing- How a FP-Growth tree looks like

Page 37: [系列活動] 資料探勘速遊 - Session4 case-studies

37

Processing- Our FP-Growth (minimum support = 1% at least in 1,595 transactions )

Page 38: [系列活動] 資料探勘速遊 - Session4 case-studies

38

Processing- Our FP-Growth (minimum support = 5% at least in 7,974 transactions)

Page 39: [系列活動] 資料探勘速遊 - Session4 case-studies

39

Processing

Num

ber

of G

enera

ted

Tra

nsa

ctio

ns

Page 40: [系列活動] 資料探勘速遊 - Session4 case-studies

40

Processing

Num

ber

of G

enera

ted

Tra

nsa

ctio

ns

Page 41: [系列活動] 資料探勘速遊 - Session4 case-studies

41

ProcessingOur minimum support = 0.8% (at least in 1,276 transactions)

Games -> politics Support : 1.30% (2,073 users) Confidence : 8.30%

Page 42: [系列活動] 資料探勘速遊 - Session4 case-studies

We classified the rules in 4 groups based on their confidence

Post-processing

42

1% - 20% 21% - 40% 41% - 60% 61% - 100%

Group #1 Group #2 Group #3 Group #4

Page 43: [系列活動] 資料探勘速遊 - Session4 case-studies

- {Antecedent} → {Consequent}

eg. {FIFA, Soccer} → {NFL}

- Lift ratio = confidence / Benchmark confidence

- Benchmark confidence = # consequent element in data set / total

transactions in data set

Post-processing

43

1.00

Page 44: [系列活動] 資料探勘速遊 - Session4 case-studies

Post-processing

44

We created 8 surveys with 64

rules (2 rules per group) and 2

questions per rules.

Page 45: [系列活動] 資料探勘速遊 - Session4 case-studies

Post-processing

Q1: Do you think subreddit A and subreddit B are related?

[ yes | no ]

Q2: If you are subscriber of subreddit A, will you also be interested in subreddit

B?

Definitely No [ 1 , 2 , 3 , 4 , 5 ] Definitely Yes

45

Dependency

Expected Confidence

Page 46: [系列活動] 資料探勘速遊 - Session4 case-studies

●We got response from 52 persons

●Data Confidence vs Expected Confidence → Lift Over Expected

Post-processing

46

Page 47: [系列活動] 資料探勘速遊 - Session4 case-studies

●% Non-related was proportion of the “No” answer to entire response in first

question.

●Interestingness = Average of Expected confidence * % Non-related.

Post-processing

47

Page 48: [系列活動] 資料探勘速遊 - Session4 case-studies

Post-processing

48

Interestingness value per Group

Page 49: [系列活動] 資料探勘速遊 - Session4 case-studies

Results

49

●How we can suggest new subreddits then?

1% - 20%61% - 100%

Group #1Group #4

Less Confidence

High Interestingness

High Confidence

Less Interestingness

Page 50: [系列活動] 資料探勘速遊 - Session4 case-studies

Results

50

●Suggest them As:

1% - 20%61% - 100%

Group #1Group #4

Maybe you could be

Interested in...

Other People also

talks about….

Page 51: [系列活動] 資料探勘速遊 - Session4 case-studies

Results

51

●Results from Group# 4 ( 9 Rules )

Page 52: [系列活動] 資料探勘速遊 - Session4 case-studies

Results

52

●Results from Group# 1 ( 173 Rules )

Page 53: [系列活動] 資料探勘速遊 - Session4 case-studies

Results

53

Group #4

Group #1

Page 54: [系列活動] 資料探勘速遊 - Session4 case-studies

Challenges

- Reducing scope of data for decreasing computational time

- Defining and calculating the interestingness value

- How to suggest the rules that we got to reddit users

54

Page 55: [系列活動] 資料探勘速遊 - Session4 case-studies

Conclusion

- We cannot be sure that our recommendation system will be 100% useful to

user since the interestingness can vary depending on the purpose of the

experiment

- To getting more accurate result, we need to ask all generated association

rules, more than 300 rules in survey

55

Page 56: [系列活動] 資料探勘速遊 - Session4 case-studies

References

- Liu, B. (2000). Analyzing the subjective interestingness of association rules. 15(5), 47-55.

doi:10.1109/5254.889106

- Calculating Lift, How We Make Smart Online Product Recommendations

(https://www.youtube.com/watch?v=DeZFe1LerAQ)

- Reddit website (https://www.reddit.com/)

56

Page 57: [系列活動] 資料探勘速遊 - Session4 case-studies

Application of Data Mining Techniques on Active Users in Reddit

57

Page 58: [系列活動] 資料探勘速遊 - Session4 case-studies

Raw data

Preprocessing

Clustering &

Evaluation

Knowledge

Data Mining Process

58

Page 59: [系列活動] 資料探勘速遊 - Session4 case-studies

Facts - Why Active Users?

** http://www.pewinternet.org/2013/07/03/6-of-online-adults-are-reddit-users/

59

Page 60: [系列活動] 資料探勘速遊 - Session4 case-studies

Facts - Why Active Users?

** http://www.snoosecret.com/statistics-about-reddit.html60

Page 61: [系列活動] 資料探勘速遊 - Session4 case-studies

Active Users Definitions

We define active users as people who :

1. who had posted or commented in at least 5 subreddits

2. who has at least 5 Posts or Comments in each of the

subreddits

3. # of Users’ comments above Q3

4. Average Score of the User, who satisfies three criteria

above > Q3

61

Page 62: [系列活動] 資料探勘速遊 - Session4 case-studies

Preprocessing

Total posts in May2015: 54,504,410

Total distinct authors in May2015: 2,611,449

After deleting BOTs, [deleted], non-English, only-URL posts, and length < 3 posts,

we got 46,936,791 rows and 2,514,642 distinct authors.

Finally, we extracted 25,298 “active users” (0.97%)

and 5,007,845 posts (9.18%) by our active users

definitions

62

Page 63: [系列活動] 資料探勘速遊 - Session4 case-studies

Clustering:

# of clusters (K) = 10 , k = √(n/2), k

= √(√(n/2))

using python sklearn module -

KMeans(), open source -

KPrototypes()

K-means, K-prototype

63

Page 64: [系列活動] 資料探勘速遊 - Session4 case-studies

Attributes

author = 1

subreddit = C (frequency: 27>others)

activity = 27/(11+6+27+17+3) = 0.42

assurance = 3/1 = 3

64

Page 65: [系列活動] 資料探勘速遊 - Session4 case-studies

Clustering: K-means, K-prototype

ratio

65

Page 66: [系列活動] 資料探勘速遊 - Session4 case-studies

Clustering: K-means, K-prototype

nominal

nominal

66

Page 67: [系列活動] 資料探勘速遊 - Session4 case-studies

Evaluation

Separate data into 3 parts, A, B, C.

K-means && K-prototype for AB, AC

Compare labels of A which was from clustering result of AB, AC.

Measurement:

Adjusted Rand index: measure similarity between two list. (Ex. AB-A & AC-

A)

Homogeneity: each cluster contains only members of a single class

Completeness: all members of a given class are assigned to the same

cluster.

V-measure: harmonic mean of homogeneity & completeness

data A B C

data A B C

67

Page 68: [系列活動] 資料探勘速遊 - Session4 case-studies

Evaluation: K-means v.s. K-prototype

K-means K-prototype

Adjusted Rand index 0.510904 0.193399

Homogeneity 0.803911 0.326116

Completeness 0.681669 0.298049

V-measure 0.737761 0.311451

68

Page 69: [系列活動] 資料探勘速遊 - Session4 case-studies

VISUALIZATION

69

Page 70: [系列活動] 資料探勘速遊 - Session4 case-studies

70

Page 71: [系列活動] 資料探勘速遊 - Session4 case-studies

Visualization Part

71

Page 72: [系列活動] 資料探勘速遊 - Session4 case-studies

72

Page 73: [系列活動] 資料探勘速遊 - Session4 case-studies

73

Page 74: [系列活動] 資料探勘速遊 - Session4 case-studies

74

Page 75: [系列活動] 資料探勘速遊 - Session4 case-studies

RapidMiner

75

Page 76: [系列活動] 資料探勘速遊 - Session4 case-studies

Weka

76

Page 77: [系列活動] 資料探勘速遊 - Session4 case-studies

Orange

77

Page 78: [系列活動] 資料探勘速遊 - Session4 case-studies

Kibana

78