technical appendix catch the pink flamingo analysis · 2016-08-14 · graph analytics analysis...
TRANSCRIPT
1
Technical Appendix
Catch the Pink Flamingo Analysis
Produced by: Junbin Yuan
Acquiring, Exploring and Preparing the Data
Data Exploration
Data Set Overview
The table below lists each of the files available for analysis with a short description of what is
found in each one.
File Name Description Fields
File: ad-clicks.csv
A line is added to this file when a player clicks on an advertisement in the Flamingo app.
timestamp: when the click occurred.
txID: a unique id (within ad-
clicks.log) for the click
userSessionid: the id of the user
session for the user who made the
click
teamid: the current team id of the
user who made the click
userid: the user id of the user who
made the click
adID: the id of the ad clicked on
adCategory: the category/type of ad
clicked on
File: buy-clicks.csv
A line is added to this file when a player makes an in-app purchase in the Flamingo app.
timestamp: when the purchase was
made.
txID: a unique id (within buy-
2
clicks.log) for the purchase
userSessionid: the id of the user
session for the user who made the
purchase
team: the current team id of the user
who made the purchase
userid: the user id of the user who
made the purchase
buyID: the id of the item purchased
price: the price of the item
purchased
File: game-clicks.csv
A line is added to this file each time a user performs a click in the game.
time: when the click occurred.
clickid: a unique id for the click.
userid: the id of the user performing
the click.
usersessionid: the id of the session
of the user when the click is
performed.
isHit: denotes if the click was on a
flamingo (value is 1) or missed the
flamingo (value is 0)
teamId: the id of the team of the
user
teamLevel: the current level of the
team of the user
3
File: level-events.csv
A line is added to this file each time a team starts or finishes a level in the game
time: when the event occurred.
eventid: a unique id for the event
teamid: the id of the team
level: the level started or completed
eventType: the type of event, either
start or end
File: team.csv
This file contains a line for each team terminated in the game.
teamid: the id of the team
name: the name of the team
teamCreationTime: the timestamp
when the team was created
teamEndTime: the timestamp when
the last member left the team
strength: a measure of team
strength, roughly corresponding to
the success of a team
currentLevel: the current level of the
team
File: team-
assignments.csv
A line is added to this file each time a user joins a team. A user can be in at most a single team at a time.
time: when the user joined the team.
team: the id of the team
userid: the id of the user
assignmentid: a unique id for this
assignment
File: users.csv This file contains a line for each user playing the game.
timestamp: when user first played
4
the game.
id: the user id assigned to the user.
nick: the nickname chosen by the
user.
twitter: the twitter handle of the user.
dob: the date of birth of the user.
country: the two-letter country code
where the user lives.
File: user-session.csv
Each line in this file describes a user session, which denotes when a user starts and stops playing the game. Additionally, when a team goes to the next level in the game, the session is ended for each user in the team and a new one started.
timeStamp: a timestamp denoting
when the event occurred.
userSessionId: a unique id for the
session.
userId: the current user's ID.
teamId: the current user's team.
assignmentId: the team assignment
id for the user to the team.
sessionType: whether the event is
the start or end of a session.
teamLevel: the level of the team
during this session.
platformType: the type of platform of
the user during this session.
Aggregation
5
source="buy-clicks.csv" | stats sum(price)
source="buy-clicks.csv" | stats count by buyId
Amount spent buying items $21407
# Unique items available to be purchased 6
A histogram showing how many times each item is purchased:
source="buy-clicks.csv" | stats count by buyId
A histogram showing how much money was made from each item:
source="buy-clicks.csv" | stats sum(price) by buyId
Filtering
A histogram showing total amount of money spent by the top ten users (ranked by how much
money they spent).
6
source="buy-clicks.csv" | stats sum(price) by userId | sort 10 -sum(price)
The following table shows the user id, platform, and hit-ratio percentage for the top three buying
users:
source="user-session.csv" userId=2229 | table platformType
source="user-session.csv" userId=12 | table platformType
source="user-session.csv" userId=471 | table platformType
source="game-clicks.csv" userId=2229 | stats avg(isHit)
source="game-clicks.csv" userId=12 | stats avg(isHit)
source="game-clicks.csv" userId=471 | stats avg(isHit)
Rank User Id Platform Hit-Ratio (%)
1 2229 iPhone 11.5970%
2 12 iPhone 13.0682%
3 471 iPhone 14.5038%
7
Data Classification Analysis
Data Preparation
Analysis of combined_data.csv
Sample Selection
Item Amount
# of Samples 4619
# of Samples with Purchases 1411
Attribute Creation
A new categorical attribute was created to enable analysis of players as broken into 2
categories (HighRollers and PennyPinchers). A screenshot of the attribute follows:
8
<Fill In: Describe the design of your attribute in 1-3 sentences.>
The new attribute prediction_category was derived from avg_price. It enabled the analysis of the
player’s spending behavior. The Numeric Binner node used two bins: the PennyPincher bin has
a range of less or equal to $5.00. The HighRoller bin has a range of $5.00 and more.
The creation of this new categorical attribute was necessary because in this project, we want to
identify what makes a HighRoller and PennyPincher. This attribute defined the rules into two
bins for what is a HighRoller (buyers of items that cost more than $5.00) and PennyPincher
PennyPinchers (buyers of items that cost $5.00 or less).
Attribute Selection
The following attributes were filtered from the dataset for the following reasons:
Attribute Rationale for Filtering
userId Each user has a unique Id. This attribute has a large number of values and high intrinsic info. It will cause the decision tree overfitting. Thus, this attribute was excluded.
userSessionId Each user session has a unique Id. This attribute has a large number of values and high intrinsic info. It will cause the decision tree overfitting. Thus, this attribute was excluded.
Avg_price Average purchase price for user session. A new attribute ‘prediction_category’ was created and derived from this attribute. Thus, this attribute was excluded.
Data Partitioning and Modeling
The data was partitioned into train and test datasets.
The <Train> data set was used to create the decision tree model.
The trained model was then applied to the <Test> dataset.
This is important because < the decision tree model is constructed from learn (train) partition of
the data. The test partition is used for model evaluation and for model selection. Any given size
tree built on the learn data commits to specific predictions can be compared to test data’s actual
results. >
9
When partitioning the data using sampling, it is important to set the random seed because… <it
ensures to get reproducible results upon re-execution. If there is no fixed seed defined, a new
random seed is taken for each execution.>
A screenshot of the resulting decision tree can be seen below:
Evaluation
A screenshot of the confusion matrix can be seen below:
10
As seen in the screenshot above, the overall accuracy of the model is <87.603%>
<Fill In: Write one sentence for each of the values of the confusion matrix indicating what has
been correctly or incorrectly predicted.>
True Positive: 459 actual PennyPinchers that were correctly classified as PennyPinchers
False Positive: 62 HighRollers were incorrectly labelled as PennyPinchers
False Negative: 43 PennyPincher were incorrectly marked as HighRollers
True Negative: 283 HighRollers were correctly classified as HighRollers
Analysis Conclusions
The final KNIME workflow is shown below:
What makes a HighRoller vs. a PennyPincher?
<The attributes platformType was selected to be the prediction maker. Users with iPhone were
more likely be a HighRoller. Players with other platformType (android, Linux, windows, mac)
spend less and more likely be the PennyPincher. >
Specific Recommendations to Increase Revenue
1. Since iPhone users are the big spenders, release more tools in iPhone application to facilitate the flamingo catching will more likely to increase revenue.
11
2. When users miss the flamingo, prompt some recommendation for useful tools. The suggestion message may attract eager players to place more purchases.
Clustering Analysis
Attribute Selection
Attribute Rationale for Selection
IsHit This attribute is in the game-clicks.csv file. If a user’s click was on a flamingo, the value is 1. If the click missed the flamingo, the value is 0. It measures the accuracy of the users. Users are associated with teams in the file.
price This attribute is in the buy-clicks.csv file. It has the price of the item purchased by users. Users are grouped into teams in the file.
Aggregated attributes: TotalHits Revenue
TotalHits is the aggregation of the total number of successful hits by teams. The analysis tried to find out if this number is related to team spending behaviors.
Revenue is the aggregation of the total spending by teams.
The analysis tried to find out if this number is related to total
number of hits by teams.
Training Data Set Creation
The training data set used for this analysis is shown below (first 5 lines):
<Fill In Screenshot>
12
Dimensions of the training data set (rows x columns): < 104, 2 >
# of clusters created: < 2 >
Cluster Centers
#Train KMeans model to create two clusters
clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10,
initializationMode="random")
#Display the centers of two clusters
print(clusters.centers)
13
Cluster # Cluster Center
1 array([ 556.85106383, 60.59574468])
2 array([ 972.38596491, 325.59649123])
These clusters can be differentiated from each other as follows:
First number (field1) in each array refers to number of total successful hits by a team and
the second number (field2) is the revenue per team.
Cluster 1 is different from the others in that… < it has low number of total successful hits and
low revenue. Teams with less successful hits (isHit = 1) which have less successful rate of
catching the flamingos tend to spend less>
14
Cluster 2 is different from the others in that… <it has high number of total successful hits and
high revenue. Teams with more successful hits (isHit = 1) which have more successful rate of
catching the flamingos tend to spend more.>
Recommended Actions
Action Recommended Rationale for the action
Release more in-app ads to promote tools that can hit a flamingo easier
Eglence Inc. can generate more revenue by introducing more tools to catching the flamingos. If a tool has more features, it will cost more.
Implement team catching flamingo event
Eglence Inc. can generate more revenue by introducing team events in catching the flamingos. When players catch the flamingo in a team, suggest some strategies and associated tools that can used by a team.
Graph Analytics Analysis
Modeling Chat Data using a Graph Data Model
(This chat data model is created using Neo4j graphic tool from six csv files. The data contains
information regarding users, teams, team chat sessions, chat items, chats and timestamps to
support node creations. It provides data to build relationships for creating chats, joining teams,
leaving teams, mentioning teams, and responding to chats. From this graph model, you can
perform various analysis on player behaviours. The analysis can benefit Eglence Inc. by
targeting the user for revenue generation in the Flamingo Game.
Creation of the Graph Database for Chats
Describe the steps you took for creating the graph database. As part of these steps
i) Write the schema of the 6 CSV files
Chat_create_team_chat.csv: userid, teamid, TeamChatSessionID, timestamp
Chat_item_team_chat.csv: userid, TeamChatSessionID, ChatItemID, timestamp
Chat_join_team_chat.csv: userid, TeamChatSessionID, timestamp
Chat_leave_team_chat.csv: userid, TeamChatSessionID, timestamp
Chat_mention_team_chat.csv: ChatItem, userid, timestamp
Chat_respond_team_chat.csv: chatid1, chatid2, timestamp
ii) Explain the loading process and include a sample LOAD command
15
Six csv files are loaded into Neo4j one-by-one:
After loading Chat_create_team_chat.csv, three nodes and two edges are created. Nodes are
user, Team and TeamChatSession. Edges are CreateSession and OwnedBy;
After loading Chat_join_team_chat.csv, two nodes and one edge are created. Nodes are user
and TeamChatSession. Edge is Join;
After loading Chat_leave_team_chat.csv, two nodes and one edge are created. Nodes are user
and TeamChatSession. Edge is Leave;
After loading Chat_iteam_team_chat.csv, three nodes and two edges are created. Nodes are
user, TeamChatSession and chatitem. Edges are CreateChat and part of;
After loading Chat_mention_team_chat.csv, two nodes and one edge are created. Nodes are
chatiteam user. Edge is Mentioned;
After loading Chat_respond_team_chat.csv, two nodes and one edge are created. Nodes are
chatiteam1 chatitem2. Edge is ResponseTo.
Example of load command:
LOAD CSV FROM "file:///E:/BigDataCapstoneProjectWeek4/chat_item_team_chat.csv" AS row
MERGE (u:User {id: toInt(row[0])})
MERGE (c:TeamChatSession {id: toInt(row[1])})
MERGE (i:ChatItem {id: toInt(row[2])})
MERGE (u)-[:CreateChat{timeStamp: row[3]}]->(i)
MERGE (i)-[:PartOf{timeStamp: row[3]}]->(c)
iii) Present a screenshot of some part of the graph you have generated. The graphs
must include clearly visible examples of most node and edge types.
Finding the longest conversation chain and its participants
Report the results including the length of the conversation (path length) and how many unique
users were part of the conversation chain. Describe your steps. Write the query that produces
the correct answer.
The longest conversation chain in the chat data using the "ResponseTo" edge label = 10 There are 10 ChatItems nodes and 9 ResponseTo edges in the conversation.
16
# Step 1: Get the longest length of the ResponseTo edge match p=(i:ChatItem)-[:ResponseTo*]->(j:ChatItem) return p, length(p) AS
LenLongestPath, extract(n in nodes(p)|n.id) AS longestPath order by length(p) desc
limit 1
# There are 5 users in this longest conversation # Step 2: Get the users who are in the longest path match p=(i:ChatItem)-[:ResponseTo*]->(j:ChatItem) where length(p)=9 with extract(n in
nodes(p)|n.id) AS LongestPath
match (u:User)-[:CreateChat]->(k:ChatItem) where k.id in LongestPath return
count(distinct u) as NumUsers
Analyzing the relationship between top 10 chattiest users and top 10
chattiest teams
Describe your steps from Question 2. In the process, create the following two tables. You only
need to include the top 3 for each table. Identify and report whether any of the chattiest users
were part of any of the chattiest teams.
# Get chattiest users match q=(u:User)-[:CreateChat*]-(i:ChatItem)
with u, count(u) as ChatCount
return u, ChatCount
order by ChatCount desc limit 10
Chattiest Users
Users Number of Chats
394 115
2067 111
209 109
# Get the chattiest teams match q=(i:ChatItem)-[:PartOf*]->(c:TeamChatSession)-[:OwnedBy*]->(t:Team)
with q, t
return count(q), t
17
order by count(q) desc limit 3
Chattiest Teams
Teams Number of Chats
82 1324
185 1036
112 957
Finally, present your answer, i.e. whether or not any of the chattiest users are part of any of the
chattiest teams.
If using only the top 3 users (394, 2067 and 209), there is no user is belong to team 85,
185, or 112. If we look at top 10 users, user 999 is belong to team 52. match q=(i:ChatItem)-[:PartOf*]->(c:TeamChatSession)-[:OwnedBy*]->(t:Team) where t.id
= 82
match s=(u:user)-[:CreateChat]-(i:ChatItem) where i in nodes(q) return distinct u
match q=(i:ChatItem)-[:PartOf*]->(c:TeamChatSession)-[:OwnedBy*]->(t:Team) where t.id
= 185
match s=(u:user)-[:CreateChat]-(i:ChatItem) where i in nodes(q) return distinct
match q=(i:ChatItem)-[:PartOf*]->(c:TeamChatSession)-[:OwnedBy*]->(t:Team) where t.id
= 112
match s=(u:user)-[:CreateChat]-(i:ChatItem) where i in nodes(q) return distinct u
How Active Are Groups of Users?
Describe your steps for performing this analysis. Be as clear, concise, and as brief as possible.
Finally, report the top 3 most active users in the table below.
# Step 1:
# One mentioned another user in a chat match (u1:User)-[:CreateChat*]->(i:ChatItem)-[:Mentioned]->(u2:User)
with distinct u1, u2
create (u1)-[:InteractsWith]->(u2)
# One created a chatItem in response to another user’s chatItem match (u1:User)-[:CreateChat]->(ci1:ChatItem)<-[:ResponseTo]-(ci2:CharItem)<-
[:CreateChat*]->(u2:User)
with distinct u1, u2
create (u1)-[:InteractsWith]->(u2) # Eliminate all self-loops involving the edge “Interacts with” Match (u1)-[r:InteractsWith]->(u1) delete r
# Get the list of neighbors
match (u1:User {id:394})-[:InteractsWith]-(u2:User)
with collect(distinct u2.id) as neighbors
return neighbors # Get neighbors:[1012, 1782, 2011, 1997]; 4
match (u1:User {id:2067})-[:InteractsWith]-(u2:User)
with collect(distinct u2.id) as neighbors
return neighbors # Get neighbors:[209, 1672, 516, 1627, 63, 1265, 2096, 516]; 8
match (u1:User {id:209})-[:InteractsWith]-(u2:User)
with collect(distinct u2.id) as neighbors
18
return neighbors # Get neighbors:[2067, 1672, 516, 1627, 63, 1265, 516]; 7
Most Active Users (based on Cluster Coefficients)
User ID Coefficient
394 1
2067 1.321429
209 1.285714
Detail steps:
user edges neighbor coefficient
394 12 4 1
2067 74 8 1.321429
209 54 7 1.285714
clustering coefficient = total number of edges on neighbors/[k * (k-1)], where k is
the number of neighbors of the node
# For user 394, its neighbors have total 12 edges
# 3
match (u1:User {id:1012})-[r]-(u2:User)
where u2.id in [1782, 2011, 1997]
return count(r)
# 3
match (u1:User {id:1782})-[r]-(u2:User)
where u2.id in [1012, 2011, 1997]
return count(r)
# 3
match (u1:User {id:2011})-[r]-(u2:User)
where u2.id in [1012, 1782, 1997]
return count(r)
# 3
match (u1:User {id:1997})-[r]-(u2:User)
where u2.id in [1012, 1782, 2011]
return count(r)
# For user 2067, its neighbors have total 74 edges
# 9
match (u1:User {id:209})-[r]-(u2:User)
where u2.id in [1672, 516, 1627, 63, 1265, 2096, 516]
return count(r)
# 11
match (u1:User {id:1672})-[r]-(u2:User)
where u2.id in [209, 516, 1627, 63, 1265, 2096, 516]
return count(r)
# 11
match (u1:User {id:516})-[r]-(u2:User)
where u2.id in [209, 1672, 1627, 63, 1265, 2096, 516]
return count(r)
# 11
match (u1:User {id:1627})-[r]-(u2:User)
where u2.id in [209, 1672, 516, 63, 1265, 2096, 516]
return count(r)
# 10
match (u1:User {id:63})-[r]-(u2:User)
where u2.id in [209, 1672, 516, 1627, 1265, 2096, 516]
return count(r)
# 7
19
match (u1:User {id:1265})-[r]-(u2:User)
where u2.id in [209, 1672, 516, 1627, 63, 2096, 516]
return count(r)
# 4
match (u1:User {id:2096})-[r]-(u2:User)
where u2.id in [209, 1672, 516, 1627, 63, 1265, 516]
return count(r)
# 11
match (u1:User {id:516})-[r]-(u2:User)
where u2.id in [209, 1672, 516, 1627, 63, 1265, 2096]
return count(r)
# For user 209, its neighbors have total 54 edges
# 9
match (u1:User {id:2067})-[r]-(u2:User)
where u2.id in [1672, 516, 1627, 63, 1265]
return count(r)
# 9
match (u1:User {id:1672})-[r]-(u2:User)
where u2.id in [2067, 516, 1627, 63, 1265]
return count(r)
# 10
match (u1:User {id:516})-[r]-(u2:User)
where u2.id in [2067, 1672, 1627, 63, 1265]
return count(r)
# 10
match (u1:User {id:1627})-[r]-(u2:User)
where u2.id in [2067, 1672, 516, 63, 1265]
return count(r)
# 9
match (u1:User {id:63})-[r]-(u2:User)
where u2.id in [2067, 1672, 516, 1627, 1265]
return count(r)
# 7
match (u1:User {id:1265})-[r]-(u2:User)
where u2.id in [2067, 1672, 516, 1627, 63]
return count(r)