technical appendix catch the pink flamingo analysis · 2016-08-14 · graph analytics analysis...

1

Technical Appendix

Catch the Pink Flamingo Analysis

Produced by: Junbin Yuan

Acquiring, Exploring and Preparing the Data

Data Exploration

Data Set Overview

The table below lists each of the files available for analysis with a short description of what is

found in each one.

File Name Description Fields

File: ad-clicks.csv

A line is added to this file when a player clicks on an advertisement in the Flamingo app.

timestamp: when the click occurred.

txID: a unique id (within ad-

clicks.log) for the click

userSessionid: the id of the user

session for the user who made the

click

teamid: the current team id of the

user who made the click

userid: the user id of the user who

made the click

adID: the id of the ad clicked on

adCategory: the category/type of ad

clicked on

File: buy-clicks.csv

A line is added to this file when a player makes an in-app purchase in the Flamingo app.

timestamp: when the purchase was

made.

txID: a unique id (within buy-

2

clicks.log) for the purchase

userSessionid: the id of the user

session for the user who made the

purchase

team: the current team id of the user

who made the purchase

userid: the user id of the user who

made the purchase

buyID: the id of the item purchased

price: the price of the item

purchased

File: game-clicks.csv

A line is added to this file each time a user performs a click in the game.

time: when the click occurred.

clickid: a unique id for the click.

userid: the id of the user performing

the click.

usersessionid: the id of the session

of the user when the click is

performed.

isHit: denotes if the click was on a

flamingo (value is 1) or missed the

flamingo (value is 0)

teamId: the id of the team of the

user

teamLevel: the current level of the

team of the user

3

File: level-events.csv

A line is added to this file each time a team starts or finishes a level in the game

time: when the event occurred.

eventid: a unique id for the event

teamid: the id of the team

level: the level started or completed

eventType: the type of event, either

start or end

File: team.csv

This file contains a line for each team terminated in the game.

teamid: the id of the team

name: the name of the team

teamCreationTime: the timestamp

when the team was created

teamEndTime: the timestamp when

the last member left the team

strength: a measure of team

strength, roughly corresponding to

the success of a team

currentLevel: the current level of the

team

File: team-

assignments.csv

A line is added to this file each time a user joins a team. A user can be in at most a single team at a time.

time: when the user joined the team.

team: the id of the team

userid: the id of the user

assignmentid: a unique id for this

assignment

File: users.csv This file contains a line for each user playing the game.

timestamp: when user first played

4

the game.

id: the user id assigned to the user.

nick: the nickname chosen by the

user.

twitter: the twitter handle of the user.

dob: the date of birth of the user.

country: the two-letter country code

where the user lives.

File: user-session.csv

Each line in this file describes a user session, which denotes when a user starts and stops playing the game. Additionally, when a team goes to the next level in the game, the session is ended for each user in the team and a new one started.

timeStamp: a timestamp denoting

when the event occurred.

userSessionId: a unique id for the

session.

userId: the current user's ID.

teamId: the current user's team.

assignmentId: the team assignment

id for the user to the team.

sessionType: whether the event is

the start or end of a session.

teamLevel: the level of the team

during this session.

platformType: the type of platform of

the user during this session.

Aggregation

5

source="buy-clicks.csv" | stats sum(price)

source="buy-clicks.csv" | stats count by buyId

Amount spent buying items $21407

# Unique items available to be purchased 6

A histogram showing how many times each item is purchased:

source="buy-clicks.csv" | stats count by buyId

A histogram showing how much money was made from each item:

source="buy-clicks.csv" | stats sum(price) by buyId

Filtering

A histogram showing total amount of money spent by the top ten users (ranked by how much

money they spent).

6

source="buy-clicks.csv" | stats sum(price) by userId | sort 10 -sum(price)

The following table shows the user id, platform, and hit-ratio percentage for the top three buying

users:

source="user-session.csv" userId=2229 | table platformType



source="game-clicks.csv" userId=2229 | stats avg(isHit)



Rank User Id Platform Hit-Ratio (%)

1 2229 iPhone 11.5970%

2 12 iPhone 13.0682%

3 471 iPhone 14.5038%

7

Data Classification Analysis

Data Preparation

Analysis of combined_data.csv

Sample Selection

Item Amount

# of Samples 4619

# of Samples with Purchases 1411

Attribute Creation

A new categorical attribute was created to enable analysis of players as broken into 2

categories (HighRollers and PennyPinchers). A screenshot of the attribute follows:

8

<Fill In: Describe the design of your attribute in 1-3 sentences.>

The new attribute prediction_category was derived from avg_price. It enabled the analysis of the

player’s spending behavior. The Numeric Binner node used two bins: the PennyPincher bin has

a range of less or equal to $5.00. The HighRoller bin has a range of $5.00 and more.

The creation of this new categorical attribute was necessary because in this project, we want to

identify what makes a HighRoller and PennyPincher. This attribute defined the rules into two

bins for what is a HighRoller (buyers of items that cost more than $5.00) and PennyPincher

PennyPinchers (buyers of items that cost $5.00 or less).

Attribute Selection

The following attributes were filtered from the dataset for the following reasons:

Attribute Rationale for Filtering

userId Each user has a unique Id. This attribute has a large number of values and high intrinsic info. It will cause the decision tree overfitting. Thus, this attribute was excluded.

userSessionId Each user session has a unique Id. This attribute has a large number of values and high intrinsic info. It will cause the decision tree overfitting. Thus, this attribute was excluded.

Avg_price Average purchase price for user session. A new attribute ‘prediction_category’ was created and derived from this attribute. Thus, this attribute was excluded.

Data Partitioning and Modeling

The data was partitioned into train and test datasets.

The <Train> data set was used to create the decision tree model.

The trained model was then applied to the <Test> dataset.

This is important because < the decision tree model is constructed from learn (train) partition of

the data. The test partition is used for model evaluation and for model selection. Any given size

tree built on the learn data commits to specific predictions can be compared to test data’s actual

results. >

9

When partitioning the data using sampling, it is important to set the random seed because… <it

ensures to get reproducible results upon re-execution. If there is no fixed seed defined, a new

random seed is taken for each execution.>

A screenshot of the resulting decision tree can be seen below:

Evaluation

A screenshot of the confusion matrix can be seen below:

10

As seen in the screenshot above, the overall accuracy of the model is <87.603%>

<Fill In: Write one sentence for each of the values of the confusion matrix indicating what has

been correctly or incorrectly predicted.>

True Positive: 459 actual PennyPinchers that were correctly classified as PennyPinchers

False Positive: 62 HighRollers were incorrectly labelled as PennyPinchers

False Negative: 43 PennyPincher were incorrectly marked as HighRollers

True Negative: 283 HighRollers were correctly classified as HighRollers

Analysis Conclusions

The final KNIME workflow is shown below:

What makes a HighRoller vs. a PennyPincher?

<The attributes platformType was selected to be the prediction maker. Users with iPhone were

more likely be a HighRoller. Players with other platformType (android, Linux, windows, mac)

spend less and more likely be the PennyPincher. >

Specific Recommendations to Increase Revenue

1. Since iPhone users are the big spenders, release more tools in iPhone application to facilitate the flamingo catching will more likely to increase revenue.

11

2. When users miss the flamingo, prompt some recommendation for useful tools. The suggestion message may attract eager players to place more purchases.

Clustering Analysis

Attribute Selection

Attribute Rationale for Selection

IsHit This attribute is in the game-clicks.csv file. If a user’s click was on a flamingo, the value is 1. If the click missed the flamingo, the value is 0. It measures the accuracy of the users. Users are associated with teams in the file.

price This attribute is in the buy-clicks.csv file. It has the price of the item purchased by users. Users are grouped into teams in the file.

Aggregated attributes: TotalHits Revenue

TotalHits is the aggregation of the total number of successful hits by teams. The analysis tried to find out if this number is related to team spending behaviors.

Revenue is the aggregation of the total spending by teams.

The analysis tried to find out if this number is related to total

number of hits by teams.

Training Data Set Creation

The training data set used for this analysis is shown below (first 5 lines):

<Fill In Screenshot>

12

Dimensions of the training data set (rows x columns): < 104, 2 >

# of clusters created: < 2 >

Cluster Centers

#Train KMeans model to create two clusters

clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10,

initializationMode="random")

#Display the centers of two clusters

print(clusters.centers)

13

Cluster # Cluster Center

1 array([ 556.85106383, 60.59574468])

2 array([ 972.38596491, 325.59649123])

These clusters can be differentiated from each other as follows:

First number (field1) in each array refers to number of total successful hits by a team and

the second number (field2) is the revenue per team.

Cluster 1 is different from the others in that… < it has low number of total successful hits and

low revenue. Teams with less successful hits (isHit = 1) which have less successful rate of

catching the flamingos tend to spend less>

14

Cluster 2 is different from the others in that… <it has high number of total successful hits and

high revenue. Teams with more successful hits (isHit = 1) which have more successful rate of

catching the flamingos tend to spend more.>

Recommended Actions

Action Recommended Rationale for the action

Release more in-app ads to promote tools that can hit a flamingo easier

Eglence Inc. can generate more revenue by introducing more tools to catching the flamingos. If a tool has more features, it will cost more.

Implement team catching flamingo event

Eglence Inc. can generate more revenue by introducing team events in catching the flamingos. When players catch the flamingo in a team, suggest some strategies and associated tools that can used by a team.

Graph Analytics Analysis

Modeling Chat Data using a Graph Data Model

(This chat data model is created using Neo4j graphic tool from six csv files. The data contains

information regarding users, teams, team chat sessions, chat items, chats and timestamps to

support node creations. It provides data to build relationships for creating chats, joining teams,

leaving teams, mentioning teams, and responding to chats. From this graph model, you can

perform various analysis on player behaviours. The analysis can benefit Eglence Inc. by

targeting the user for revenue generation in the Flamingo Game.

Creation of the Graph Database for Chats

Describe the steps you took for creating the graph database. As part of these steps

i) Write the schema of the 6 CSV files

Chat_create_team_chat.csv: userid, teamid, TeamChatSessionID, timestamp

Chat_item_team_chat.csv: userid, TeamChatSessionID, ChatItemID, timestamp

Chat_join_team_chat.csv: userid, TeamChatSessionID, timestamp

Chat_leave_team_chat.csv: userid, TeamChatSessionID, timestamp

Chat_mention_team_chat.csv: ChatItem, userid, timestamp

Chat_respond_team_chat.csv: chatid1, chatid2, timestamp

ii) Explain the loading process and include a sample LOAD command

15

Six csv files are loaded into Neo4j one-by-one:

After loading Chat_create_team_chat.csv, three nodes and two edges are created. Nodes are

user, Team and TeamChatSession. Edges are CreateSession and OwnedBy;

After loading Chat_join_team_chat.csv, two nodes and one edge are created. Nodes are user

and TeamChatSession. Edge is Join;

After loading Chat_leave_team_chat.csv, two nodes and one edge are created. Nodes are user

and TeamChatSession. Edge is Leave;

After loading Chat_iteam_team_chat.csv, three nodes and two edges are created. Nodes are

user, TeamChatSession and chatitem. Edges are CreateChat and part of;

After loading Chat_mention_team_chat.csv, two nodes and one edge are created. Nodes are

chatiteam user. Edge is Mentioned;

After loading Chat_respond_team_chat.csv, two nodes and one edge are created. Nodes are

chatiteam1 chatitem2. Edge is ResponseTo.

Example of load command:

LOAD CSV FROM "file:///E:/BigDataCapstoneProjectWeek4/chat_item_team_chat.csv" AS row

MERGE (u:User {id: toInt(row[0])})

MERGE (c:TeamChatSession {id: toInt(row[1])})

MERGE (i:ChatItem {id: toInt(row[2])})

MERGE (u)-[:CreateChat{timeStamp: row[3]}]->(i)

MERGE (i)-[:PartOf{timeStamp: row[3]}]->(c)

iii) Present a screenshot of some part of the graph you have generated. The graphs

must include clearly visible examples of most node and edge types.

Finding the longest conversation chain and its participants

Report the results including the length of the conversation (path length) and how many unique

users were part of the conversation chain. Describe your steps. Write the query that produces

the correct answer.

The longest conversation chain in the chat data using the "ResponseTo" edge label = 10 There are 10 ChatItems nodes and 9 ResponseTo edges in the conversation.

16

# Step 1: Get the longest length of the ResponseTo edge match p=(i:ChatItem)-[:ResponseTo*]->(j:ChatItem) return p, length(p) AS

LenLongestPath, extract(n in nodes(p)|n.id) AS longestPath order by length(p) desc

limit 1

# There are 5 users in this longest conversation # Step 2: Get the users who are in the longest path match p=(i:ChatItem)-[:ResponseTo*]->(j:ChatItem) where length(p)=9 with extract(n in

nodes(p)|n.id) AS LongestPath

match (u:User)-[:CreateChat]->(k:ChatItem) where k.id in LongestPath return

count(distinct u) as NumUsers

Analyzing the relationship between top 10 chattiest users and top 10

chattiest teams

Describe your steps from Question 2. In the process, create the following two tables. You only

need to include the top 3 for each table. Identify and report whether any of the chattiest users

were part of any of the chattiest teams.

# Get chattiest users match q=(u:User)-[:CreateChat*]-(i:ChatItem)

with u, count(u) as ChatCount

return u, ChatCount

order by ChatCount desc limit 10

Chattiest Users

Users Number of Chats

394 115

2067 111

209 109

# Get the chattiest teams match q=(i:ChatItem)-[:PartOf*]->(c:TeamChatSession)-[:OwnedBy*]->(t:Team)

with q, t

return count(q), t

17

order by count(q) desc limit 3

Chattiest Teams

Teams Number of Chats

82 1324

185 1036

112 957

Finally, present your answer, i.e. whether or not any of the chattiest users are part of any of the

chattiest teams.

If using only the top 3 users (394, 2067 and 209), there is no user is belong to team 85,

185, or 112. If we look at top 10 users, user 999 is belong to team 52. match q=(i:ChatItem)-[:PartOf*]->(c:TeamChatSession)-[:OwnedBy*]->(t:Team) where t.id

= 82

match s=(u:user)-[:CreateChat]-(i:ChatItem) where i in nodes(q) return distinct u

match q=(i:ChatItem)-[:PartOf*]->(c:TeamChatSession)-[:OwnedBy*]->(t:Team) where t.id

= 185

match s=(u:user)-[:CreateChat]-(i:ChatItem) where i in nodes(q) return distinct

match q=(i:ChatItem)-[:PartOf*]->(c:TeamChatSession)-[:OwnedBy*]->(t:Team) where t.id

= 112

match s=(u:user)-[:CreateChat]-(i:ChatItem) where i in nodes(q) return distinct u

How Active Are Groups of Users?

Describe your steps for performing this analysis. Be as clear, concise, and as brief as possible.

Finally, report the top 3 most active users in the table below.

# Step 1:

# One mentioned another user in a chat match (u1:User)-[:CreateChat*]->(i:ChatItem)-[:Mentioned]->(u2:User)

with distinct u1, u2

create (u1)-[:InteractsWith]->(u2)

# One created a chatItem in response to another user’s chatItem match (u1:User)-[:CreateChat]->(ci1:ChatItem)<-[:ResponseTo]-(ci2:CharItem)<-

[:CreateChat*]->(u2:User)

with distinct u1, u2

create (u1)-[:InteractsWith]->(u2) # Eliminate all self-loops involving the edge “Interacts with” Match (u1)-[r:InteractsWith]->(u1) delete r

# Get the list of neighbors

match (u1:User {id:394})-[:InteractsWith]-(u2:User)

with collect(distinct u2.id) as neighbors

return neighbors # Get neighbors:[1012, 1782, 2011, 1997]; 4



return neighbors # Get neighbors:[209, 1672, 516, 1627, 63, 1265, 2096, 516]; 8



18

return neighbors # Get neighbors:[2067, 1672, 516, 1627, 63, 1265, 516]; 7

Most Active Users (based on Cluster Coefficients)

User ID Coefficient

394 1

2067 1.321429

209 1.285714

Detail steps:

user edges neighbor coefficient

394 12 4 1

2067 74 8 1.321429

209 54 7 1.285714

clustering coefficient = total number of edges on neighbors/[k * (k-1)], where k is

the number of neighbors of the node

# For user 394, its neighbors have total 12 edges

# 3

match (u1:User {id:1012})-[r]-(u2:User)

where u2.id in [1782, 2011, 1997]

return count(r)

# 3


where u2.id in [1012, 2011, 1997]

return count(r)

# 3


where u2.id in [1012, 1782, 1997]

return count(r)

# 3


where u2.id in [1012, 1782, 2011]

return count(r)


# 9


where u2.id in [1672, 516, 1627, 63, 1265, 2096, 516]

return count(r)

# 11


where u2.id in [209, 516, 1627, 63, 1265, 2096, 516]

return count(r)

# 11


where u2.id in [209, 1672, 1627, 63, 1265, 2096, 516]

return count(r)

# 11


where u2.id in [209, 1672, 516, 63, 1265, 2096, 516]

return count(r)

# 10


where u2.id in [209, 1672, 516, 1627, 1265, 2096, 516]

return count(r)

# 7

19


where u2.id in [209, 1672, 516, 1627, 63, 2096, 516]

return count(r)

# 4


where u2.id in [209, 1672, 516, 1627, 63, 1265, 516]

return count(r)

# 11


where u2.id in [209, 1672, 516, 1627, 63, 1265, 2096]

return count(r)


# 9


where u2.id in [1672, 516, 1627, 63, 1265]

return count(r)

# 9


where u2.id in [2067, 516, 1627, 63, 1265]

return count(r)

# 10


where u2.id in [2067, 1672, 1627, 63, 1265]

return count(r)

# 10


where u2.id in [2067, 1672, 516, 63, 1265]

return count(r)

# 9


where u2.id in [2067, 1672, 516, 1627, 1265]

return count(r)

# 7


where u2.id in [2067, 1672, 516, 1627, 63]

return count(r)

technical appendix catch the pink flamingo analysis · 2016-08-14 · graph analytics analysis...

Documents