analysis of video game reviews · the inconvenience and misleading caused by these spam comments...

Analysis of video game reviews Detecting Spam opinions in game reviews via semi-supervised technique

Pengze Bian (u6341832)

A report submitted for the course COMP4560

Individual Research Project

Supervised by Dr. Penny Kyburz

The Australian National University

Oct 2019

© Pengze Bian 2019

Except where otherwise indicated, this report is my own original work

Pengze Bian

24 Oct 2019

Acknowledgments

This research is based on previous research on spam opinions detection.

This work is the result of the collaboration with my supervisor Dr Penny Kyburz.

I would like to thank my supervisor Dr Penny Kyburz for her support throughout my project. Before this

research, I had only the basic knowledge of natural language processing. Dr Penny Kyburz provided many

resources and information about the initial stage of my research and keep me on the right path.

Abstract

With the development of network services and the technological innovation, more and more software

developers are beginning to focus on user feedbacks. Users freely express their ideas or opinions through

different platforms on the internet. These user-generated online comments are increasingly influence

consumers’ purchase decisions which has contributed to the emergence and development of spam comments.

The inconvenience and misleading caused by these spam comments have attracted researchers’ attention.

The work of detecting spam comments has well developments in the reviews of shopping, hotel areas. Jindal

and Liu [2] investigated spam comments firstly who also divided spam comments into different categories.

In this paper, we explore and implement generalized approaches for data mining and identifying online

deceptive spam game comments based on a game comments dataset which is crawled from the well-known

game platform Steam. In terms of different features of game comments, we can further conclude the spam

game comments into two main types which are fake reviews (deceptive comments) and irrelevant reviews.

Due to the similarity between game reviews and most text comments, we can use many mature NLP

technologies and machine learning methods to analyse and classify. Meanwhile, we need to specifically

analyse some characteristics of game reviews due to the lack of professional research to analyse game

reviews. This paper analyses spam game reviews and presents some techniques to detect them.

Keywords: spam comments; data mining; NLP; machine learning; spam game reviews detection

Contents

Acknowledgments

Abstract

1 Introduction ..................................................................................................................... 1

2 Related Works ................................................................................................................ 3

2.1 Spam opinions ......................................................................................................... 3

2.2 Positive Unlabelled Learning.................................................................................... 4

3 Datasets .......................................................................................................................... 5

3.1 Unlabelled steam reviews dataset ........................................................................... 5

3.2 Gold-Standard dataset ............................................................................................. 5

3.3 labelled steam reviews dataset ................................................................................ 5

4 Feature Generation ......................................................................................................... 6

4.1 Word-Based Features .............................................................................................. 6

4.2 Sentence-Based Features ....................................................................................... 8

5 Experimental Desgin ..................................................................................................... 11

6 Research Results.......................................................................................................... 22

7 Conclusions and Future Works ..................................................................................... 27

8 Appendices ................................................................................................................... 29

Appendix 1: Project Description ................................................................................... 29

Appendix 2: Independent Study Contract .................................................................... 29

Appendix 3: Description of Artefacts Produced and README .................................... 29

Appendix 4: Function Words ........................................................................................ 33

9 Reference ..................................................................................................................... 34

The Australian National University | 1

Introduction

With the coming of the 5G techniques, the interaction and transmission of network information will be more

convenient and rapid. New techniques have dramatically changed the way that people express themselves

and interact with others. Users of the network can be freer to leave their own opinions or comments.

According to the yelp website [1], the USA reviews site Yelp has more than 192 million comments by the

end of Q2 2019. Commentary information expresses users’ own opinions, this information will guide other

users’ opinions and consumption behaviours. Luyang, Bing and Ting [3] claimed that about 81% of USA

internet users refer to product reviews before purchasing products, and more than 80% of users believe that

comments have an impact on their purchase behaviour. The potential value of online reviews has led to more

and more spam reviews appearing on the web. Spam reviews can be concluded as fake reviews and useless

reviews. Kelsey group [4] has explained the fake reviews refer to non-conformity or ambiguity about a

product or service. Fake reviews can also be explained as deceptive reviews or untruthful opinion. Useless

reviews can be concluded as the potential advertisements, irrelevant reviews and meaningless reviews. These

spam reviews are widely distributed, harmful and difficult to identify manually. According to Luyang [3],

spam reviews are widely distributed in reviews sites about accommodations, travel and shopping. The fake

reviews on the Yelp website account for about 14%-20% [5][6]. Many researchers are beginning to pay

attention to the spam reviews. Researchers [7] organized three volunteers to manually identify 160 false

comments and volunteers tend to misjudge fake reviews as true reviews. The recognition accuracy is only

53.1%-61.9%. The research proves that the method of identifying fake reviews manually is not feasible due

to the low accuracy and the amount of time required for judgment. Therefore, how to effectively identify

fake reviews has become one of the urgent network security issues to be solved.

The research question of fake reviews was first proposed by Jindal and Liu in 2008 [2]. In this paper, authors

concluded three types of spam reviews: untruthful opinions, reviews on brands only and non-reviews (non-

reviews include advertisements and other irrelevant reviews containing no opinions). These three kinds of

reviews basically cover all different kinds of spam reviews. The study of spam reviews detection begins with

the detection of deceptive text of the reviews. Due to the lack of standard datasets for spam reviews,

especially the datasets for untruthful reviews in the early time of the research, the development of spam

reviews detection is slow. It was not until researcher Ott M, et al [7] created a gold dataset of fake reviews

using shopping and hotel reviews that the research began to develop rapidly. With the development of spam

commentary and the deepening of research, researchers realized that identifying fake reviewers and fake

reviewers’ group can detect spam reviews more effectively [3].

With the increasing pressure of modern society, people began to pay more attention to relaxation and

entertainment. Video games has become the mainstream decompression and entertainment. The reviews of

different games, like other online reviews, they will affect the decision of customers and there are also spam

reviews in game reviews that interfere with consumers’ interests. However, most of these researches are

developed and implemented based on the gold dataset or labelled datasets about shopping reviews and hotel

reviews. Exploring spam reviews in game reviews is a field where few people are involved. This paper will

attempt to validate the existing technology to the performance for detecting spam reviews in video game


reviews. Meanwhile, the potential unique feature of game reviews will be analysed, and a labelled game

reviews dataset will be created based on different features.

The paper is organized as follows: in Section II, we summarize related works; in Section III, we introduce

the corpus we use; in Section IV, we explain the features; in Sections V, we describe the experimental part;

in Section VI, we evaluate the results; in Section VII, we conclude the article and the future works.


Related Works

Spam opinions:

Early research on network information focused on the contexts of Web text [8] and email information [9].

With the explosive growth of network information, companies and researchers are noticing the value of on-

line opinions. Analysis of on-line opinions became a popular research. These researches are mainly focused

on mining opinions from random reviews or do the sentiment analysis to classify the emotion of reviews as

positive or negative [10, 11,12]. The mainstream method of early sentiment analysis is to establish an

emotional dictionary corresponding to the corpus of research. This method assigns different scores to special

emotional words and common words, and the emotional tendency of each text is obtained by adding or

subtracting the scores of all words with certain mathematical methods. The maturity of machine learning

technology accelerated the development of sentiment analysis [11]. When spam reviews have caused trouble

for users, researchers have increased concern about deceptive opinion spam [2,7,13,14]. As we mentioned in

the previous section, Jindal and Liu (2008) [2] studied the deceptive opinion spam firstly and claimed three

different types of deceptive spam reviews.

• Type 1 (untruthful opinions): Malicious propaganda or maliciously disparaged comments

• Type 2 (reviews on brands only): Review only discuss the influence, services of brands and not on the

product itself.

• Type 3 (non-reviews): Advertisements, irrelevant reviews without meaningful opinions (e.g., questions,

answers or symbols).

They collected type 2 and 3 manually due to the obvious features and applied the Naïve Bayes, SVM and

logistic regression to identify these two types of deceptive opinion spam. Type 1 is difficult to detect. They

identified duplicate reviews based on review text, reviewers and product as the untruthful opinions. They

performed 10fold cross validation on the data. It obtained the average AVC value of 78% using all the

features. Yoo and Gretzel [15] gathered 40 truthful and 42 deceptive hotel reviews manually and attempt to

compare the linguistic differences between them. It has contributed the basic linguistic features of deceptive

reviews. Meanwhile, some researchers are beginning to focus on the factors rather than the text itself. Wang,

Liang et al [16] analysed the features of different behaviours between deceptive users and normal users.

These deceptive users are the users who have write malicious reviews or untruthful reviews. This paper

concluded some characteristics of spammers:

1. Spammers probably have a high-frequency connecting time series than normal users.

2. Spammers usually post irrelevant or similar comments to their commented objects.

3. Spammers send spam reviews without considering the domains of the objects.

The authors designed two testing strategies to evaluate the three above mentioned characteristics:

A) A user is a spammer if each of his three behavioural characteristics presents that he is a spammer.

B) A user is a spammer if any one of his three behavioural characteristics represents that he is a spammer.

The final accuracy of their detection for A and B are 100% and 92.6% respectively.


Other researches focus on the singleton review spam detection [l7]. They claimed that the trustful reviewers'

arrival pattern is steady and uncorrelated to their rating pattern. However, for those spammers, they have

opposite behaviours. They got the precision 61.11 % and the recall 75.86%.

Previous research was mainly limited to the absence of standard deceptive spam reviews datasets. Everything

has changed and developed faster after Ott et al. created a gold-standard dataset [7]. They employed Turkers

to write deceptive reviews and do the following research based on this dataset. In this paper, the gold-

standard dataset has been used and the detail description of this dataset will be shown in following part. The

labelled dataset has changed the main methods for detecting the deceptive reviews. More and more

researches have attempted to use machine learning techniques to identify the truthful reviews and deceptive

reviews automatically. The performance of machine learning techniques is influenced by the number of

samples and features. Therefore, researchers focus on how to find the effective features to represent

deceptive reviews. Some researches [2, 7, 18] used the bag of words or n-gram features to represent a

review. Mukherjee [14] combined the n-gram features and part-of-speech features to train an SVM classifier

which got 65.5% and 67.8% accuracy on hotel area dataset. SVM classifier stand out in solving small

sample, nonlinear and high-level features. Analysing words separately will ignore the meaning and potential

features of the whole sentence or whole reviews. Researches begins to analyse the deceptive reviews based

on the sentence level. Li et al [19] use word vector as input and apply convolutional neural network model

based on semantic representation and regard text vectors as features to classify text. They think that the

different sentences in the comment text has different importance. In the process of text representation

learning, the semantic representation of the sentences with different weight performance better than semantic

representation of the whole text. Text analysis tool LIWC (Linguistic Inquiry and Word Count) can extract

multiple text features including the stylistic features. Stylistic features are mainly used to describe the user’s

writing style, including lexical and syntactic features [20]. In this paper [20], authors have listed many

stylistic features, like the number of short words (e.g. if, the), the length of tokens or reviews, the number of

uppercase. Research [21] claimed that some stylistic features of deceptive reviews is not completely

consistent with the findings of psychology on liar. Deceptive reviews contain more first-person words than

truthful reviews. They analysed this phenomenon and conclude the spammer try to make up more personal

consumption experiences to make the reviews more credible to read. Meanwhile, Li [22], Hammad [23] and

Mukherjee [14] use the raw-data features (the post date, the ranking of reviews, reviews ID, product ID,

feedbacks etc.) and other features together which have improved the performance.

Positive-Unlabelled Learning:

Positive-Unlabelled Learning is one kind of semi-supervised techniques. It is different with the traditional

supervised learning which needs positive samples and negative samples. Researchers [24] introduced that the

unlabelled data is helpful in classifier building. This paper uses a small set of labelled documents and a large

set of unlabelled documents to build classifiers. They show that this approach is better than using the small

set of labelled documents alone. It provides the basic theory of PU-learning. Some studies combine multiple

features and detect deceptive reviews based on semi-supervised methods [22,25,26,27]. One improved

technique [28] introduces the spy technique which are used to find the reliable negative samples. Then they

use positive samples (P) and reliable negative reviews (RN) to train Naïve Bayes to detect deceptive reviews.

After that, Hernadez [25], Ren [26] and Li [27] propose different new models based on PU-Learning.


Datasets

This paper applied three reviews datasets. Three datasets distributed different functions in the whole process

of research.

1. Steam game reviews dataset

• This dataset is the main dataset for the whole research. Dataset is collected from 8 different games

(these games will be introduced in experiment part) from steam which is unlabelled raw data. This

dataset has 33000 reviews and each review includes nickname, game duration, rating, numbers of

players who think the review is helpful, numbers of people who think the review is funny and content of

reviews. We use this dataset as the main corpus to analyse and mining the potential features of spam

opinions in game reviews.

2. Gold-standard dataset

• Gold-standard deceptive opinions dataset is introduced in [29] and [7]. It is publicly available on the

Internet [30] which is labelled with truthful and deceptive. This dataset includes: 1600 reviews which are

400 truthful positive reviews from TripAdvisor, 400 truthful positive reviews from TripAdvisor, Orbitz,

Expedia, Hotels.com, Priceline and Yelp, 400 deceptive positive reviews and 400 deceptive negative

reviews from Mechanical Turk.

3. Steam game reviews with sentiment label dataset

• This dataset is publicly available on the internet [31] which is labelled with positive and negative.

We use a sub-dataset which has 100000 reviews and each review includes game id, content of reviews,

sentiment label and number players who think the review is helpful. We use this dataset to predict and

adjust the sentiment value for reviews in first dataset.


Feature Generation

Many researches have extracted features from deceptive reviews in various aspects. This part is going to

analyse all the features we used for detecting deceptive game reviews. Meanwhile, according to the style of

the game reviews, some unique features of the steam game reviews will be introduced. These features can be

divided into three different aspects.

1. Word-Based features

• Word-based features focus on the single token for each review. It will use the tokens to represent the

difference between the truthful reviews and deceptive reviews. These tokens are not only words, but also

punctuations, digits and even symbols. Word-based features include bag-of-words features, stylistic

features and part-of-speech features.

•

(a) Bag-of-words features

• BOW is one of the most commonly used method for representing word features. It can be

called as n-gram features. According to different number of words in combinations, unigram, bigram

and trigram features are widely used. These features are a very effective feature in the perspective of

opinion mining and sentiment analysis. In the direction of deceptive reviews detection, its

effectiveness is higher than other text features [3]. However, there are significant differences in the

accuracy in different datasets. According to pervious research [7], Bow features can get 89.6%

accuracy on the initial gold standard dataset which includes 400 truthful reviews and 400 deceptive

reviews. However, the Bow features can only get 67.8% accuracy on the Dianping dataset [14]. The

gap is because the deceptive reviews in Dianping dataset deliberately imitate the truthful reviews in

terms of language vocabulary. Therefore, the separate use of the Bow features is not accurate enough

to identify deceptive reviews and needs to be used with other features.

(b) Stylistic features

• Stylistic features are mainly used to describe the user’s writing style. According to the early

analysis of truthful writing and imaginative writing [32,33,34], the deceptive reviews and truthful

reviews are different in writing style. For example, deceptive reviews contain more first-person

words [21] (we have mentioned above) and spammer in compare with real reviewers use simpler,

short, and fewer average syllables per word [35]. Therefore, find the features in different writing

styles is the main work for stylistic features. According to paper [20], stylistic features can be

categorized into two types, lexical features and syntactic features. In Table 1, these are the lexical

features we have used in this research. The first row is the name of different features. The second

row is the detailed explanation of each feature. The third row is the description of implementation.


Lexical Features Description Implementation

Total number of numbers

All numbers like 1,2,10,100 Extract numbers by Regular expression

Length of review

tokens(T)

All tokens, tokens include words,

symbols, punctuations

Total number of tokens

Ratio of total number of first-person words

All first-person words: I, my, mine, our, ours, we, us, me

Total number of first-person words / T

Total number of

characters (N)

All letters Aa-Zz, all single

digits, all punctuation marks, all symbols

Total number of

characters

Ratio of total number

of uppercase letters

All letters A-Z Total number of

uppercase characters / N

Ratio of total number of digits and symbols

All digits, punctuation, special symbols ($@#%^&*)

Total number of digits and symbols / N

Total number of short

words

All short words (1-3

characters): if, the, how

Total number of

short words

Table 1

In Table 2, these are the syntactic features we have used in this research.

Syntactic Features Description Implementation

Numbers of punctuations . ? ! : ; ‘ “ Frequency of

punctuation marks

Numbers of function

words

Using list of words that

present in Appendix

Frequency of function

words

Table 2

(c) Part-of-speech Features

• Part-of-speech features is generated by tagging the words with different part of speech and

counting the frequency. Li [18] concluded that the truthful reviews and deceptive reviews are

different in the number of different part-of-speeches. The truthful reviews contain more words which

are tagged as noun, adjective, preposition, qualifier and conjunction. The deceptive reviews contain

more words which are tagged as verb, adverb, pronoun and pre-qualifier. These characteristics are

consistent with the early analysis of truthful writing and imaginative writing [32,33,34]. However,

part-of-speech features have certain limitations. Deceptive reviews fabricated by experts do not

satisfy this rule. When experts write reviews, the purpose of imitating truthful reviews is stronger.

They will imitate the characteristics of truthful reviews from the details of product information,

consumption experience, etc., which is more deceptive. Meanwhile, the distribution of part-of-

speech will have difference in different reviews areas which is consistent with the results of

computing linguistics [34]. However, for detecting deceptive reviews, the part-of-speech features

perform better than Bow features in the domain migration problem.

Based on Word-based features, researchers obtained good detection results using classification models like

SVM, neural network. Mukherjee [14] et al. applied n-gram feature and part-of-speech feature to train SVM

model to detect deceptive reviews on Yelp dataset which obtained 65.6% and 67.8% accuracy in hotel

reviews area. Shojaee [20] et al. applied stylistic features to train SMO (Sequential Minimal Optimization)

classifier and Naïve Bayes on Gold-standard dataset which obtained 84% and 74% F-measure.


2. Sentence-Based Features

• Bow features do not consider the order of the keywords in the process of representing the document,

but merely treats the document as a collection of the probability of occurrence of some keywords. Each

keyword is independent of each other. In this case, the overall meaning and characteristics of the

document are lost. Therefore, we need to consider the potential factors of whole sentence. Sentence-

Based Features includes Doc2Vec and sentiment features.

•

(a) Doc2Vec

• Doc2Vec, or paragraph2vec, sentence embeddings, is an unsupervised algorithm that can get

the vector expression of sentences / paragraphs / documents. It was proposed by Quoc Le and Tomas

Mikolov in 2014 [36]. Doc2Vec is an extension of word2vec. Word2Vec is introduced in paper [37]

and paper [38]. Word2Vec can be effectively trained on millions of dictionaries and hundreds of

millions of datasets. The training result, word embeddings can measure the similarity between words

and words. The Doc2Vec is very similar to word2vec in terms of principle and training method, but

the result of doc2vec is to represent an entire sentence. Like the word2vec, the training results can

measure the similarity between sentences and sentences.

Based on two different implementations, doc2vec has two models which are distributed memory

model and distributed bag of words model.

• Distributed memory model：

The core idea of the word2vec is that we can predict the word i according to the other words which

near to i, that is the other words which near to i will influence the word i. DM model use the same

principle to train a sentence. Graph 1 shows the framework of DM model.

Graph 1

Each paragraph / sentence is mapped into vector space, which can be represented by a column of

matrix D. Each word is also mapped to a vector space, which can be represented by a column of

matrix W. The paragraph vector is then cascaded or averaged to obtain features that predict the next

on

D W W W

Classifier

Average/Concatenate

Paragraph Matrix

Paragraph id the cat sat


word in the sentence. This paragraph / sentence vector can also be considered as a word, which acts

as the memory unit of the context or the subject of the paragraph.

• Distributed bag of words model:

This training method ignores the context of the input and lets the model predict a random word in the

paragraph. That is, a window is sampled from the text and then a word is randomly sampled from the

window as a prediction task at each iteration, then the model predicts the result. The only input is a

paragraph / sentence vector. Graph 2 shows the framework of DBOW model.

Graph 2

The paragraph / sentence vector can be the abstract representation of each paragraph / sentence

which can be regarded as the feature for each review.

(b) Sentiment Features

• According to Li [21], the deceptive reviews contain more emotional words than the truthful

reviews which shows that deceptive reviews are more positive or negative than truthful reviews.

From the perspective of psychology, the purpose of spammers is to advocate or discredit the object.

The use of emotional words in reviews can show and enhance the emotional polarity of reviews.

Therefore, the sentiment analysis is necessary for detecting deceptive reviews. The reviews with

strong emotion are more likely to be deceptive. The mainstream sentiment analysis methods are

sentiment dictionary and machine learning techniques.

• Sentiment dictionary:

This method is going to create a dictionary which contains as many as possible emotional words.

Each emotional word has a polarity value, for each input sentence, this method calculates all

emotional words in this sentence and return the polarity value for the whole sentence by certain

mathematical calculation methods. This method is limited by the initial sentiment dictionary.

Meanwhile, the exact emotional polarity of each emotional word in the whole sentence is also

largely ignored.

D

Classifier

Paragraph Matrix

Paragraph id

the cat sat on


• Machine Learning techniques:

There are a lot of public resources of labelled datasets about emotional polarity which allows

supervised learning methods to be widely applied to sentiment analysis. Li [21] et al. create a model

based on the factors set Y of deceptive reviews. Y includes sentiment, domain and source. The

sentiment of reviews has two polarity, positive and negative. The domain includes reviews from

hotels, restaurants and doctors. The source includes employee, turker and customer. They calculate

three different probabilities for feeding the model to predict the probability that each review belongs

to each source.

3. Raw-data Features

• The raw data of the review refers to the characteristics of the review except the text content. As we

mentioned in Dataset part, the reviews in unlabelled game reviews dataset include nickname, game

duration, rating, numbers of players who think the review is helpful, numbers of people who think the

review is funny and content of reviews. After analysing the text content by above methods, we can

obtain more potential raw data. In this research, the Raw-data Features can detect deceptive reviews

directly. Raw-data Features include play duration and sentiment polarity, rating and sentiment polarity

and confidence.

• play duration and sentiment polarity:

According to the rule of Steam, players can request a refund if the play duration for one game is less than

2 hours. From the psychological perspective, few spammers will actually buy one product before writing

one untruthful review. Therefore, the play duration can be helpful for detecting deceptive reviews.

• rating and sentiment polarity:

Generally, the rating can represent the attitude or emotion of players. When there is a contradiction

between these two values, this review is more likely to be deceptive.

• Confidence:

This value is calculated by considering the factor of the play duration, numbers of players who think the

review is helpful, numbers of people who think the review is funny and the length of reviews.

After observing the game reviews, we found that the long length reviews generally have more supporters

(“supporters” means the number of people who think the review is helpful). Therefore, confidence is a

composite value which is used to indicate the conviction of each reviews.

In this part, we concluded all the features we have used in this research. Different features can be applied to

detect different types of spam reviews which will be introduced and explained in the next part.


Experimental Design

The goal in this section is going to implement features above and apply these features to analyse whether it

can precisely classify deceptive reviews from game reviews. Meanwhile, the whole process of the

experiment will be introduced in detail in this part. According to some unique features of game reviews, we

have divided the spam reviews into three different types of reviews, untruthful reviews, duplicate reviews

and non-reviews.

1. Untruthful reviews

Untruthful reviews, or deceptive reviews, are the reviews which maliciously advocated and vilified.

Generally, this kind of reviews are difficult to detect manually.

2. Duplicate reviews

In this research, duplicate reviews are the reviews which have the same semantics, same polarity and

similar length, but different users.

3. Non-reviews

Non-reviews refer to reviews that have no meaning or effect.

These unique features of game reviews can be concluded by observing the raw data.

1. Game reviews contains more symbols or emojis. Steam allows players to add different emojis in the reviews. These emojis are regarded as special symbols when we extract data from steam.

2. Steam does not limit the length of reviews which means the distribution of length of reviews is scattered.

Using the average length as a feature indicator is not very helpful. Meanwhile, the longer the review, the

stronger the credibility. 3. The emotional polarity of game reviews is easy to judge. The rating for each review can be used for

predicting the polarity.

4. Game reviews contain more game terms, irregular words and network terms.

The first step in the experiment is to get the dataset.

The basic information of datasets has been introduced in the Dataset part. The unlabelled steam reviews

dataset has been applied as the main dataset for analysing features, training the classifier and predicting

labels. It was collected from 8 different games, they are GTA, PUBG, Oxygen Not Included, Total War:

Three Kingdoms, NBA 2k19, Just Cause 4 and Scum. The reasons we choose these games are:

1. They are popular which means there are enough reviews we can extract.

2. These eight games are different types of games. Therefore, we can cover more game types as much as

possible to make all extracted game reviews features are more universal and representative.

3. Before we specifically analyse the game reviews, we do not know the exact sentiment polarity of each

reviews. In order to balance the effects of positive and negative reviews as much as possible, these eight

games have different ratings range from very good to very poor.

For each review, it contains Nickname, rating, play duration, the number of people who think the review is

help or funny, the content of review.


Nickname is the account name of player.

Rating has two different signs, recommended and Not recommended.

Play duration indicates how long the player played the game when posting this review.

Helpful and funny are two voting features designed for players.

The content of review includes all letters, all digits and all symbols of the original version on steam reviews

page. This research focus on the reviews which were wrote in English.

There are 33450 reviews in this dataset.

Gold-standard dataset (labelled with truthful and deceptive) and Labelled steam reviews dataset (labelled

with positive and negative) are publicly accessible and downloadable. The details of these two datasets has

been introduced in the previous research [7] and website [30].

The second step in the experiment is to pre-process the dataset.

After obtaining the raw dataset, we should perform preliminary processing on the raw dataset to enable it to

be applied for implementing the various methods in the study. Good pre-processing makes research tasks

easier and more efficient. There are many pre-processing techniques in data mining area and natural

language processing area. The pre-processing methods will be introduced as follow:

1. Numerical

For this research, we will numerical the information except the text of reviews. There are five attributes

in the unlabelled steam reviews. I will numerical the rating attributes and voting attributes. The

operations are as follow:

(a) Rating has two types: recommended and not recommended which indicates whether the player who

left this review likes this game. It can represent the overall attitude of the player on this game. We

change all recommended sign with 1 and all not recommended sign with 0.

(b) The raw information of voting attribute is: “No one has rated this review as helpful yet”, “2 people

found this review helpful 1 person found this review funny” and “1 person found this review helpful”. Therefore, the voting attribute can be divided into helpful attribute and funny attribute. We extract the

numbers in voting attribute and the result will be shown as table 3.

The raw information helpful funny

No one has rated this review as helpful yet 0 0

2 people found this review helpful 1 person

found this review funny

2 1

1 person found this review helpful 1 0

Table 3

Numerical method allows the features of the raw data to be visualized and greatly facilitates the next pre-

processing operations.


2. Lowercase

Now we need to pre-process the content of review. Each review is generated by players which do not

have a unified expression. This messy expression can seriously affect the extraction of certain features.

Therefore, the pre-processing of natural language text is very important. Lowercase is one basic method

of unified expression. It will make all uppercase letters into lowercase to achieve the purpose of unified

text. However, this method will not be applied when we extract some features.

3. Replace punctuations and numbers with space or no-space

Because punctuations and numbers do not help much in analysing sentence semantics and emotions, we

will not consider these meaningless characters when extracting features which based on sentence level.

The process will be shown in table 4

The original sentence The pre-process result

. []

10/10 []

I spent 10 hrs playing this game and got 100% fun

I spent hrs playing this game and got fun

It is a well-known game It is a well known game

Table 4

4. Tokenize

Tokenize is a commonly used pre-processing tool which has two main methods.

One is going to split each sentence in one paragraph (Table 5).

The original sentence The split result

This is a text for test. And I want to learn

more techniques.

[‘This is a text for test.’, ‘And I want

to learn more techniques.’]

Table 5

One is going to split each word in one sentence or paragraph. It will change a sentence or paragraph into

a set of words and keep all duplicate words (Table 6).

The original sentence The split result

This is a text for test. And I want to learn

more techniques.

[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘test’, ‘.’,

‘And’, ‘I’, ‘want’, ‘to’, ‘learn’, ‘more’,

‘techniques’, ‘.’]

Table 6

In this research, we apply the second one to split the reviews into sets of word which is convenient for

next pre-process.


5. Stop words

During in the information retrieval or natural language process, certain words are automatically filtered

in order to save storage space and improve search or process efficiency. These words are called stop

words. These stop words are defined manually, and all stop words will be stored in the stop word list. In

this research, the stop word list we used comes from the nltk package which has been shown in

appendix.

6. Stemming and Lemmatization

For grammatical reasons, documents will use different forms of one word. For example, the word

organize has the forms like organizes, organizing and organized. Meanwhile, there are families of

derivationally related words with similar meanings. For example, democracy, democratic,

and democratization.

However, the diversity of such words can cause a lot of trouble for this research. For instance, we need

to find the part of speech for each word. The tags are different for car and cars. It will increase the

burden of running the program when we want to extract certain features. Therefore, we need stemming

and lemmatization.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes

derivationally related forms of a based form. For example, am, are, is will be changed into be. Car, cars,

car’s will be changed into car.

However, stemming and lemmatization have some difference. Stemming usually refers to a crude

heuristic process that cut off the ends of words. It will remove derivational affixes. Lemmatization

usually refers to process word properly with the use of a vocabulary and morphological analysis of

words. It aims to remove inflectional endings. It will only return the base form of a word, which is

known as the lemma. The result will be shown in table 7

Stemming Result

wolves, dogs, running, ate wolv, dog, run, ate

Lemmatization Result

wolves, dogs, running, ate wolf, dog, run, eat

Table 7

The third step in the experiment is to extract stylistic features from game reviews.

Stylistic features are based on words in the content of reviews. There are some features in stylistic features

need to analyse the symbols, numbers and punctuations. Therefore, we do not apply lowercase, replace

punctuations and numbers with space or no-space, stop words and stemming and lemmatization techniques

for stylistic features.


For lexical features:

Lexical Features Implementation How to detect spam reviews

Total number of

numbers

Extract numbers by

Regular expression

The more the number, the less

semantic information the review

contains, and the more likely to be a deceptive review.

Length of review

tokens(T)

Total number of tokens The shorter the length of review,

the weaker the credibility.

Ratio of total number of first-person words

Total number of first-person words / T

As we mention above, the more the first-person words, the more

likely to be a spam review.

Total number of characters (N)

Total number of characters


of uppercase letters

Total number of

uppercase characters / N

Capitalized letters generally

indicate emphasis or strong tone

which express a strong emotion. The bigger the ratio, the stronger

the emotion.


of digits and symbols

Total number of digits

and symbols / N

The more the ratio, the less

semantic information the review contains. When ratio is 1, it

means the review only contains

symbols or digits. We can regard it as spam review.


words


words

Table 8

For lexical features:

Syntactic Features Implementation How to detect spam reviews

Numbers of punctuations Frequency of

punctuation marks

The more the number, the

less semantic information the review contains

Numbers of function

words

Frequency of function

words

Table 9

The single feature in Stylistic features can help to detect spam reviews. However, a single feature cannot

directly determine that a review is a spam review.

The fourth step in the experiment is to extract part-of-speech features from game reviews.

When we generate the part-of-speech features, we applied all the pre-process techniques except the stop

word technique. According to Li [21], the ratio of different part of speeches are different in truthful reviews

and deceptive reviews. However, the stop word list contains some words which we need to use. Therefore,

we do not apply stop word. Meanwhile, stemming and lemmatization can help to save the time and cost

when we tag each word. We define all the words which are tagged as noun, adjective, preposition, qualifier

and conjunction as the true words. Then, we define all the words which are tagged as verb, adverb, pronoun


and pre-qualifier as the false words. We will count the frequency of true word and false word for each

review. Then the part of speech features includes frequency of true words, frequency of false words and the

ratio of false words (frequency of false words / frequency of true words).

The fifth step in the experiment is to extract Doc2Vec features from game reviews.

In this research, we use all the reviews in unlabelled steam reviews dataset as training data to train 4 doc2vec

models.

We apply all the pre-process techniques to train the first two models which are stored as Doc2vec_dm and

Doc2vec_dbow. These two are based on different models of doc2vec which we have introduced above.

Meanwhile, we apply all the pre-process techniques except stop word to train the last two models which are

stored as Doc2vec_dm_stop and Doc2vec_dbow_stop. These two are based on different models of doc2vec

which we have introduced above.

The main function for doc2vec is to find the similar texts based on the vector space by calculating the

distance between different vectors. By comparing the first two models and second two models, we found that

the stop words can influence the performance of doc2vec models. The models with stop words perform

better in finding the similar reviews. We analyse the reason and conclude as follow:

1. Doc2vec model is the development of word2vec model. Since the word2vec model depends on the context which is near to the predict word, the context may be stop words. Therefore, when we delete these stop

words, some words that will be predicted may lose their dependency on the context.

2. The stop words list contains some words that are useful in training doc2vec model which means the stop words list does not delete words within a reasonable amount.

Preserving stop words will make the semantic expression of sentences more abundant and increase the

distance between sentences in the vector space. Therefore, we use Doc2vec_dm_stop and

Doc2vec_dbow_stop to extract the doc2vec features in this research. We combine these two models by

stacking the array horizontally (in column order) to get the vector for each review. This vector contains the

characteristics of both models. We regard this vector as the doc2vec feature for each review.


Meanwhile, we make each model returns one most similar review for each review. Then we apply the

algorithm which is shown in the Graph 3 to find the duplicate reviews.

Graph 3

In this algorithm, we set t1 and t2 with different values, 0.85, 0.9 and 0.95. After observation and practice,

the t1 should be set as 0.95 and 0.9. The threshold t1 and t2 are decided by the quality of models and the size

of training data.

We find there are many reviews which have very short length, like “nice game” and “good game”. It

contains very limited information due to the length of this review. Therefore, these reviews cannot be

regarded as duplicate reviews simply. These reviews will be checked by other features which will be

introduced later. We set the length of review should be longer than 5. Meanwhile, we define the similar

length to be 5 tokens more or less than the input review. For these duplicate reviews we will label them as

spam reviews.

The sixth step in the experiment is to extract sentiment features from game reviews.

We have introduced some techniques for sentiment analysis above. In this research, we apply sentiment

dictionary and machine learning together to obtain the final polarity value for each review. All pre-

processing techniques have been applied in finding the sentiment features

Sentiment Dictionary

In this research, we use the python inbuilt package TextBlob to calculate the polarity value for each review.

Meanwhile, the function will return two values, one is polarity and subjective. TextBlob calculates the

polarity based on the sentiment words dictionary.

Duplicate (reviews):

For each review i,

If the length of i > 5,

Get the most similar review a of i based on dbow model

Get the most similar review b of i based on dm model

If the similarity of a and i > t1

If ( length a < length i + 5 and length a > length i – 5) and the polarity of a is the same as i

If the nickname of a is different with i

Return i.index, a.index

If the similarity of b and i > t2

If length b == length i +/- 5 and the polarity of b is the same as i

If the nickname of b is different with i

Return i.index, b.index


Machine Learning

We apply the labelled steam reviews in this part. We have introduced that the reviews in this dataset are

labelled as positive and negative. The original dataset contains 6.4 million reviews. This research just uses

100,000 reviews. The training dataset has 90,000 reviews and testing dataset has 10,000 reviews. We use

CountVectorizer to process each content of reviews in this dataset. CountVectorizer will converted the text

to a word frequency matrix. This matrix will be used as the features to train a classifier. The classifier we

used XGB classifier. Xgboost is one of the boosting algorithms. The idea of the boosting algorithm is to

integrate many weak classifiers to form a strong classifier. Because Xgboost is a lifting tree model, it

integrates many tree models to form a strong classifier. The tree model used is the CART regression tree

model.

Meanwhile, we use the log-loss as the evaluation to evaluate the performance of classifier. The log-loss for

Xgboost is 0.2 which is lower than SVM model which is 0.41.

After we get the classifier, we apply it to predict the unlabelled steam reviews. The predicted positive

reviews will be labelled as 1 and the predicted negative reviews will be labelled as 0.

Sentiment dictionary and Machine Learning return one polarity value and one label for each review.

However, both methods are not 100% correct. Thus, we need to adjust the polarity value for each review.

The reviews we need to adjust are:

1. The review which the polarity value is bigger than 0 but the label is 0

2. The review which the polarity value is smaller than 0 but the label is 1

For the first type of reviews, we will subtract the polarity value by 0.3. For the second type of reviews, we

will add the polarity value by 0.3.

According to the calculation of TextBlob, it will be influenced by the number of emotional words. We

believe the review which the polarity value is bigger than 0.3 or smaller than -0.3 indicates that There are

more positive or negative words in this review. Thus, we still believe the polarity value is smaller than 0

even if the label is 1or we still believe the polarity value is bigger than 0 even if the label is 0.

After the process above, we can get the final polarity, the subjective for each review. Then we use one

technique called Local Outlier Factor to detect the outlier by these two values. Local Outlier Factor is a

classic density-based algorithm. This algorithm will calculate one value for each review. The larger the

value, the higher the probability that this review becomes an outlier.

Overall, there are three values will be given in the sentiment features for each review, the polarity, the

subjective and the local outlier factor.

The seventh step in the experiment is to extract raw-data features from game reviews.

The raw-data features are extracted from the raw data. We have introduced the definition of raw data. In this

research, we have used three raw-data features


1. Play duration and Sentiment value (polarity)

We find all the reviews which the play duration is less than 2 hours according to the steam rule (we have

introduced above). The algorithm is shown in Graph 4.

Graph 4

In this algorithm, we set two different thresholds T1 and T2. For this research, we set T1 is 0.3 and T2 is -

0.75. The reason is:

(a) According to normal psychological reactions, the play duration of the game determines the attribute if the player to the game. When the play duration is less than 2 hours, it means the game does not

attract players and the polarity of the reviews should be negative. Thus, the reviews are more likely

to be untruthful reviews if the polarity value of the reviews is bigger than 0.3 (Considering that there

are some players just cannot wait to write a review, their polarity value can be positive, but if the value is bigger than 0.3, it should be abnormal)

(b) Meanwhile, we still need a threshold T2 even if these kinds of reviews are general negative. Thus, if

the polarity value is lower than -0.75, we still believe that he has malicious intentions to vilify.

2. Rating and Sentiment value

Rating has two types, recommended and not recommended. The purpose of rating for users is to show

the attitude of player towards the game. The algorithm is shown in Graph 5

Play duration and Sentiment value (reviews):

For each review i,

If the Play duration of i < = 2,

If the polarity value of i >=T1 or the polarity value of i <= T2.

Set a label 1 for i

Else

Set a label 0 for i


Graph 5

In this algorithm, we select all the reviews that the rating is different with their polarity value.

We also set two different thresholds T1 and T2. In this research, we set T1 is -0.5 and T2 is 0.5.

The reasons are:

(a) In the actual commentary behaviour, there are many reviews emotions may be just a momentary feeling for this matter. Therefore, the rating cannot represent the attitude of players accurately. It may

be that only one update of the game dissatisfied the player and he left a negative review. However, he

still loves this game and recommend the game. In this case, we cannot regard this review as untruthful

review. (b) The case in (a) will usually happen in game reviews. However, the polarity should be within a

reasonable range. The polarity which is bigger than 0.5 or smaller than -o.5 is not in the reasonable

range. Therefore, we set T1 is -0.5 and T2 is 0.5.

3. Confidence

Confidence is calculated by consider different factors. According to the description in Stylistic features,

the length of reviews will influence the credibility of reviews. Meanwhile, the voting system also needs

to be considered when we are detecting untruthful reviews.

Therefore, we define the confidence for each review as:

confidence = Play Duration ∗ N(length of review) ∗ (ℎ𝑒𝑙𝑝𝑓𝑢𝑙 + 0.1) ∗ (𝑓𝑢𝑛𝑛𝑦 + 0.1

2)

N (x) means that we normalize x into the range (0,1).

Helpful is the number of people who think the review is helpful, and funny is the number of people who

think the review is funny. We add 0.1 for helpful and funny to avoid the number of helpful or funny is 0.

Meanwhile, we think the weight of helpful and funny should be different. The meaning of Helpful is

more rigorous than funny.

Play duration and Sentiment value (reviews):

For each review i,

If the Rating of i = = 1,

If the polarity value of i < T1.

Set a label 1 for i

Else

Set a label 0 for i

If the Rating of i = = 0,

If the polarity value of i > T2.

Set a label 1 for i

Else

Set a label 0 for i


Confidence can measure the credibility of each value.

The eighth step in the experiment is to apply semi-supervised technique to detect spam reviews.

The final step for this research is apply classifier to detect the unlabelled steam reviews dataset.

The main dataset, unlabelled steam reviews dataset, is unlabelled and the negative samples are difficult to

detect manually. Therefore, the traditional supervised techniques cannot work on this dataset. In this case,

semi-supervised techniques can handle and classify based on a small size of positive samples dataset and a

large size of unlabelled samples dataset.

Semi-supervised techniques include two-view model and pu learning model. In this research, we apply

Positive Unlabelled learning to train a classifier to predict the unlabelled dataset.

According to all features above, we can generate a positive samples dataset:

1. We sort all reviews in descending order of the confidence value. Therefore, the top reviews are highly

credible comments.

2. Then, we apply raw-data features to delete all the potential untruthful reviews. Meanwhile, we apply stylistic features to delete all the potential non-reviews.

3. Finally, we selected 2000 reviews from the first quarter of the reviews, 1000 reviews from the second

quarter of the reviews and 500 reviews from the third quarter of reviews. We select reviews from different parts to avoid that the confidence influences the final result.

Then, we regard all the rest samples as the unlabelled samples dataset.

In this research, we use pu-bagging technique to train a decision tree classifier.

PU bagging technique:

1. Create a training dataset by combing all positive samples and random unlabelled3 samples 2. Use “bootstrap” samples to build a classifier. Bootstrap includes All positive samples and a random

unlabelled sample which has the same number samples as Positive dataset.

3. Apply classifier to classify the samples which not in bootstrap and record the probability. 4. Repeat the above three steps. The final probability is the average value of all iteration.

Meanwhile, we apply all features (except the raw-data feature) to represent reviews in the Gold-Standard

dataset. Then we perform the pu-bagging on the gold standard dataset with 100 positive samples and 700

unlabelled samples to verify the validity of pu-bagging.


Research Result

This part will show the results which are obtained by all the features above. Meanwhile, we will compare

and explained the results with previous researches.

As we mentioned above, we divide the spam game reviews into three different types, untruthful review,

duplicate reviews and non-reviews. We will show how these features works and the results.

1. Non-reviews

In this research, non-reviews are much easier to detect than other two kinds of reviews. For detecting

non-reviews, we applied the following techniques:

(a) Pre-processing

Pre-processing is the first step for mining the data information. In my pre-processing techniques, all

the reviews that only contain the punctuations and digits will be processed and the length of reviews

after pre-processing will be 0. These reviews have no meaningful words. Thus, we can label them as

spam reviews.

(b) Stylistic features

For non-reviews, the stylistic features will show a high number in ratio of numbers and symbols,

number of punctuations.

Non-reviews

(All examples are the original reviews in

dataset)

Examples:

1. “.”

2. “!!” 3. “10/10” (There are many reviews like this)

4. “123818237980801312/1”

5. Language not in English (We cannot extract the

features from these reviews, thus, all the stylistic features are 0)

มนเยยมมากเปนเกมทรวมแทบทกสงไวในเกมเดยว😀

Table 10

There are 337 reviews that belongs to non-reviews. Such reviews account for a small percentage, and

they are easy to be find manually.

2. Duplicate reviews

In this research, we use doc2vec features, sentiment features and stylistic features to detect the duplicate

reviews. The algorithm has been shown in Graph 3.


Duplicate reviews (All examples are the

original reviews in

dataset)

Examples: Reviews 1:

Reviews 2:

Table 11

The reviews 1 and review 2 come from different users, same game. We applied doc2vec to change all

context into vectors. And the distance between different vectors can indicate the similarity of two

reviews. These two reviews are general the same in the content of reviews. The performance of doc2vec

are influenced by the initial training dataset. If we change the limited length from >5 to >0, there will be

much more duplicate reviews in the result because there are many short reviews that contain the same

words. Therefore, the result of duplicate reviews is also decided by the overall distribution of review

length. Because the main dataset for this study was randomly grabbed from Steam, thus, it is difficult to

control the quality of the dataset.

There are 1221 groups of reviews that are be detected as duplicate reviews. One group contain two

reviews, like review 1 and review 345 are one group and they are more likely to be duplicate reviews.

3. Untruthful reviews

All features above are going to detect untruthful reviews. The truthful reviews can be two different types,

explicit untruthful reviews and implicit untruthful reviews. We can detect the explicit untruthful reviews

by raw-data features. For implicit untruthful reviews, it is the most important and difficult to detect. All

features will be used to apply semi-supervised techniques to detect implicit untruthful reviews.


Explicit untruthful reviews

(All examples are the

original reviews in dataset)

Examples: Reviews 1:

Reviews 2:

Table 11

For explicit untruthful reviews, there are three reviews.

The play duration of first two reviews are less than 2 hours, and the content of reviews is short and will

contain no word after applying pre-processing techniques. Meanwhile, negative reviews without

explanations are also lowly credible.

The third review is detected as untruthful reviews because the recommend sign is different with the

polarity value. That means, the player does not recommend this game, however, the player write a

positive review which the polarity value is bigger than 0.3. Therefore, we can regard it as an untruthful

review.

4. Positive Unlabelled Learning

Since there is no publicly labelled game reviews dataset (labelled with truthful and untruthful), we have

performed the Positive Unlabelled Learning on the Gold-Standard dataset to verify the validity of

Positive Unlabelled Learning.

We use all the features above (except the raw-data features, because the gold standard dataset has no the

attributes for generating the raw-data features) to present the reviews in Gold-Standard dataset.

Then, we use different size of positive samples to train pu bagging methods. We use F-measure to

evaluate the result. There are 400 truthful reviews and 400 deceptive reviews.


Based on decision tree classifier

Initial size of Positive

samples

Initial size of unlabelled

samples

F-measure

80 720 53.5%

120 680 65.8%

200 600 72.2%

Table 12

According to Hernadez [25], he performed different methods for detect deceptive reviews on gold

standard dataset.

Based on PU-LEA NB

Initial size of

Deceptive samples

Initial size of

unlabelled samples

F-measure Iteration Final Training Set

80 520 55.9% 6 80D,267U

100 520 83.7% 7 100D,140U

120 520 78% 5 120D,203U

Table 13

One method of the researchers applied is PU-LEA based Naïve Bayes. They choose the deceptive

reviews as the initial training dataset.

Comparing the results of this research and pervious research, we can find:

(a) The F-measure are increasing while the initial size of training dataset is increasing

(b) When we have enough initial samples, iterative algorithm performs better.

(c) The performance of this research is worse by comparing the F-measure. After analysing the difference,

we think the reasons are:

1. The ratios of training dataset are different. In this research, the ratios are 80/720, 120/680 and

200/600 which is much smaller than 80/267, 100/140 and 120/203. The larger the dataset that

needs to be predicated, the more likely the performance will be worse.

2. This research does not apply the iteration of algorithm. We think the performance will be better if

we add the new positive reviews into the training dataset in each iteration.

3. The features in this research are not the best combination for detecting the gold standard dataset.

All these features are generated by considering the game reviews, not the hotel reviews. Different areas of reviews will have different features. Although some features may exist in different areas,

these features are not enough to represent the uniqueness of the area. For example, there are no

short reviews, no irregular statements (many symbols, punctuations) in gold standard dataset. Therefore, some features for detecting non-reviews are not helpful in gold standard dataset and

even counterproductive.

4. Raw-data features can improve the accuracy of classifier [14]. We do not find features that are

more suitable for detecting gold standard dataset.


(d) Both results are far better than the best human result obtained in this dataset, which, according to the

result in paper [7] it is around 60% of accuracy.

Finally, we apply the pu bagging method to predict the unlabelled steam reviews dataset. Meanwhile, we

generate a reliable spam reviews by labelling all duplicate reviews, all non-reviews and all reviews that

are detected by raw-data features as 0

The result is a list of probability. The probability of truthful reviews is bigger than 0.5.

However, we cannot ensure that all reviews that the probability is smaller than 0.5 are untruthful

reviews. According to the performance in gold standard dataset, we set threshold t is 0.2. When t is 0.2,

the F-measure of gold standard dataset is the best. Meanwhile, we consider the true proportion of spam

comments might range from 10% to 20% [3][4]. Thus, we label the reviews as truthful reviews if their

probability is bigger than 0.2 and the reviews as spam reviews if their probability is smaller than 0.2.

There are 18895 reviews which the probability is bigger than 0.5. There are 5021 reviews which labelled

as spam reviews.


Conclusions and Future Works

In this paper we have applied the techniques from natural language process for extracting features of game

reviews to detect the spam reviews. Since the main corpus of the current researches come from shopping or

hotel reviews, we will consider the differences that these features may have in different corpus when

generating various features above.

For the three kinds of spam reviews we have defined, the non-reviews are easy to be detected manually, the

duplicate reviews applied the doc2vec model for represent the similarity of reviews by the distance, the

untruthful reviews are difficult to detect manually, we applied the raw-data features to find the explicit

untruthful reviews and all the features above to find the implicit untruthful reviews.

After that, we applied semi-supervised techniques which can work on the dataset that only has positive

samples and unlabelled samples. Many researchers have verified the validity of Positive Unlabelled learning

for many text classification problems [39,40,41,42]. In this paper, we also perform pu bagging on the gold

standard dataset to evaluate the features and the classifier. By comparing with the previous research, we

verified the validity of the final method. The accuracy of final dataset performs much better than the manual

identification. The final dataset can be used as a labelled dataset with truthful and untruthful for further

research.

In the further research, there are some improvements we need to consider:

1. Consider the potential meaning of numbers in digits. In this research, we do not consider the function of

numbers in the review. We applied pre-processing techniques to clean all the numbers and symbols from

reviews. Meanwhile, we regard all the reviews that only contain symbols or numbers are non-reviews. The method does detect many abnormal reviews, like “.”, “***************” and “ksadfklasjdfkljsdfasd

fasdfasdfasdf”. However, some reviews do have their own unique meaning, like 10/10. When we manual

ly recognize this review, we think it equals to the sentence “the game is perfect”. But in this research, we regard it as non-reviews.

Therefore, in the future works, we need to process the raw data more accurately and more reasonable.

The special emojis and numbers should be considered as one special review.

2. Consider the size of initial corpus. Natural Language Process needs to handle reviews as much as possible.

More reviews mean more features, meanwhile, large size of initial corpus can avoid a certain eigenvalue

to play a decisive role because of the amount of data is too small. In this research, we select 33450 reviews from 8 different game. When collecting data at the beginning, we consider that the emotional polarity of

game reviews should be balanced. However, we do not control other factors like the length of reviews.

This leads to short reviews that are more likely to become spam reviews. Therefore, in the future work, we need to enlarge the initial dataset.

3. Consider more semi-supervised techniques or evaluations. This research is going to use all features to

detect spam reviews in the unlabelled dataset. We applied Positive Unlabelled Learning due to the lack of a labelled game reviews dataset that can be used as a standard. Therefore, the performance of positive

unlabelled learning decides the quality of the final predicted dataset. In this research, we only use pu

bagging to train and test. Although the final F-measure is not bad, we still need to perform more classifiers and more techniques to compare and select the best performance. In the future work, we need to apply

two-step-method for positive unlabelled learning, the first step is finding the positive sample and reliable

negative samples by some spy techniques. Then the classifier will be trained by these two initial dataset and unlabelled dataset. In the end of each iteration, we need to add the new positive or new negative sample


into the initial dataset and train the classifier again until the labels for all samples do not change. This

method makes use of unlabelled dataset as much as possible. Combine PU-bagging and the two-step-method to get the “average” probability for each review will improve the final accuracy of prediction.

4. There are one kind of reviews that we did not consider in this research, we can conclude it as irrelevant reviews. Irrelevant reviews mean these reviews do not cover the main topic for the game. For example,

the advertainments can be regard as irrelevant reviews. The percentage of these reviews is very low, and

they are easy to be identify manually. In this research, we attempted to apply doc2vec model to find the

irrelevant reviews. Each review is turned into a vector in space, and distance can measure the similarity between reviews. We can find the most similar reviews by calculating distance which means we can find

the outliers by the same way. These outliers should be the irrelevant reviews. However, the final result

shows that most of the outliers are the long length reviews. These reviews cannot be regarded as spam reviews according to other features. Therefore, the result is useless. In the future works, we need to

consider the unique method for detecting irrelevant reviews. For example, we can collect the irrelevant

reviews like ads, questions manually and train classifiers to detect the irrelevant reviews in large dataset.

5. Consider spammers or spammer groups. The size of initial steam reviews dataset is not enough for

analysing spam reviews from the factors of spammers or spammer groups. In the future works, we need to

extract more reviews and each review needs to contain more information. Like how many games the player has, how many reviews the player have wrote, how many positive reviews or negative reviews in these

reviews. Detecting spammers will be more effective than detecting spam reviews [lily]. Spammers are

quite different from real users in terms of user attributes (like how many games the player has, the average play duration) and behaviours.

Overall, we conclude that the current algorithms are workable, and we believe that there are many

improvements we can get if we apply more methods to optimize the semi-supervised techniques in the

future.


Appendices

Appendix 1: Project Description

This project will involve developing and applying text analysis and/or topic mining

algorithms to video game reviews. The student will research existing approaches to text

analysis of video game reviews, and other applications, and develop a program to analyse

web-based game reviews. The expected direction of this project will focus on

analysis/mining video game reviews, NLP techniques and a prototype.

Appendix 2: Independent Study Contract

The contract for this project is presented at the end of this report.

Appendix 3: Description of Artefacts Produced and README file.

The project is composed of ten files. Each file contains the relevant code with the file name

All files are jupyter notebook, and all package should perform using Anaconda.

Meanwhile, we put all the resource data, code into one folder.

Attention that, all the features should be executed in the following order.

Load_reviews:

The function for extracting game reviews from 8 different game in Steam. It will return two

files, Alldata.txt and my_data.csv. We stored the 33450 reviews in different form.

Pre-processing:

The function for the pre-processing techniques. These techniques are come from the

python inbuilt package: re and nltk. These techniques can be divided into two different

types, one is for processing the context of reviews (pre-processing_with_stopwords,

Stem_lem and stop words) and one is for processing the raw data in the reviews

(Preprocessing_rating and Preprocesiing_helpful).

It will return a list of words for the first type. It will return a dataframe for second type.

Stylistic feature:

The function for extracting stylistic features.

It will return a dataframe which contains the stylistic features.

Non-reviews:


The function for extracting part-of-speech features. It will spend long time on tagging the

part of speech.

It will return a dataframe which contains the stylistic features.

Doc2vec:

The function for training doc2vec model and finding the duplicate reviews.

The trian_doc is going to train new models for new reviews, it will cost a long time.

The vectors_gold is going to load the trained _gold models and extract doc2vec feature

(vectors) to create a dataframe with the vector features.

The vectors is going to load the trained _stop models and extract doc2vec feature

(vectors) to create a dataframe with the vectors features.

The Isolation_Forest_stop is going to find the outliers (irrelevant reviews), however, the

performance is bad.

The duplicate_reviews_withstop is going to return the index of duplicate reviews.

If we want to run the duplicate function in this file, we need to run the sentiment feature

first.

Sentiment feature:

The function for training classifier to predict, using TextBlob to calculate the polarity, adjust

the final polarity, using local outlier factors to find outliers.

The train_classifier is going to train the classifier and will run all other functions in this file.

It will return two dataframes, one contains the sentiment features (polarity, subjective and

local outlier factors) and one is a set of outliers.

The threshold of outliers is decided by observing the box graph.

Method is the threshold.

Raw_data_feature:

After we get all the features above, we can run the raw_data_features.

It will return a dataframe contains the raw_data features. It is the final dataframe we are

going to use to predict the labels.


PU_learning:

The function for applying the pu learning to predict the unlabelled dataset.

The pu_leaning for the dataset that extracted from the load_reviews function.

The pu_bagging is the main implementation of pu learning.

It will return a list of probability.

Mian:

Main this the file to use all the files above to execute the whole process of this program.

We should be careful that the load_reviews and the non-reviews and doc2vec will spend a

long time on running.

If we want to use my research data:

We have stored all the reviews in my_data.

We have stored one dataframe which contains the part_of_speech feature.

The dataframe is new_data_3.csv

We have trained 4 doc2vec models for extract the vectors. These four models are

Doc2vec_dm_stop, Doc2vec_dbow_stop, Doc2vec_dm_stop_gold and

Doc2vec_dbow_stop_gold.

The code means we will use the trained model: Doc2vec_dm_stop, Doc2vec_dbow_stop

If we code as doc2vec.vectors_gold (), we will use Doc2vec_dm_stop_gold and

Doc2vec_dbow_stop_gold.

If you want to extract the new dataset,

Just run the first function load_reviews.


And change:

Part_of_speech features:

Change all the data_part_1 into data_part.

Meanwhile, we need to change the doc2vec.vectors () into doc2vec.train_doc (), we will

train new models, be careful, we should go to doc2vec file to change the model names

and stored path.

In the sentiment_feature, we need go to the sentiment feature file to change the threshold

if the graph is mess. The follow one should be good.

Test_on_Gold_standard:

It will represent the reviews with all the features above (except raw_data feature)

And return a F-measure.

We just need to change the 100 range from 50 to 300, we can get different F-measure with

different initial positive samples.

The MergeTxt function is going to combine multiple text into one text in the same folder.

We do not need to run it if just test the gold standard dataset. We have stored them into

truthful_reviews.txt and deceptive_reviews.txt.


Appendix 4: function words

a between in nor some upon about both including nothing somebody us above

but inside of someone used after by into off something via all can is on such

we although cos it once than what am do its one that whatever among down latter onto

the when an each less opposite their where and either like or them whether another

enough little our these which any every lots outside they while anybody

everybody many over this who anyone everyone me own those

whoever anything everything more past though whom are few most per

through whose around following much plenty till will as for must plus to with

at from my regarding toward within be have near same towards without because

he need several under worth before her neither she unless would behind

him no should unlike yes below i nobody since until you beside if none so

up your


Reference

[1] https://www.yelp.com/about

[2] Jindal N, Liu B. Opinion spam and analysis // Proceedings of International Conference on Web

Search and Data Mining. Stanford, USA, 2008:219-230.

[3] Luyang Li, Bing Qin and Ting Liu. 2017 Survey on Fake Review Detection Research. Research

Center for Social Computing and Information Retrieval. School of Computer Science and

Technology, Harbin Institute of Technology, Harbin 150001.

[4] Kelsey group. Online consumer-generated reviews have significant impact on offline purchase

behaviour. http://www.comscore.com/press/release/asp?press=19282007.

[5] Ott M, Cardie C, Hancock J. Estimating the Prevalence of Deception in Online Reviews

Communities. Eprint Arxiv, 2012:201-210.

[6] Fei G, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R. Exploiting Burstiness in reviews

or review spammer detection // Proceedings of the International AAAI Conference in Weblogs and

Social Media. Boston, USA, 2013:175-184.

[7] Ott M, Choi Y, Cardie C, et al. Finding deceptive opinion spam by any stretch of the

imagination // Proceedings of Meeting of the Association for Computational Linguistics: Human

Language Technologies. Association for Computational Linguistics. Portland, USA, 2011:309-319.

[8] Zolt an Gy ongyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with

trustrank. In Proceedings of the Thirtieth international conference on Very large data bases-Volume

30, pages 576–587. VLDB Endowment.

[9] Harris Drucker, Donghui Wu, and Vladimir Vapnik. 1999. Support vector machines for spam

categorization. Neural Networks, IEEE Transactions on, 10(5):1048–1054.

[10]. K. Dave, S. Lawrence & D. Pennock. Mining the peanut gallery: opinion extraction and

semantic classification of product reviews. WWW’2003.

[11]. B. Pang, L. Lee & S. Vaithyanathan. Thumbs up? Sentiment classification using machine

learning techniques. EMNLP’2002.

[12]. A-M. Popescu and O. Etzioni. Extracting Product Features and Opinions from Reviews.

EMNLP’2005.

[13] Guangyu Wu,Derek Greene,Barry Smyth, and P adraig Cunningham. 2010. Distortion as a validation

criterion in the identification of suspicious reviews. In Proceedings of the First Workshop on Social Media

Analytics, pages 10–13. ACM.

https://www.yelp.com/about

http://www.comscore.com/press/release/asp?press=19282007


[14] Arjun Mukherjee, Vivek Venkataraman, Bing Liu, and Natalie Glance. 2013b. What yelp fake review

filter might be doing. In Seventh International AAAI Conference on Weblogs and Social Media.

[15] Kyung-Hyan Yoo and Ulrike Gretzel. 2009. Comparison of deceptive and truthful travel reviews. In

Information and communication technologies in tourism 2009, pages 37–47. Springer.

[16] Q. Wang, B. Liang, W. Shi, Z. Liang, and W. Sun. Detecting spam comments with malicious users'

behavioral characteristics. In Information Theory and Information Security (ICITIS), 2010 iEEE

International Conference on, pages 563 -567, dec. 2010.

[17] S. Xie, G. Wang, S. Lin, and P. S. Yu. Review spam detection via temporal pattern discovery. In

Proceedings of the i8thACM SIGKDD international conference on Knowledge discovery and data mining,

KDD '12, pages 823-831, New York, NY, USA, 2012. ACM.

[18] Li J, Ott M, Cardie C, et al. Towards a General Rule for Identifying Deceptive Opinion Spam //

Proceedings of Meeting of the Association for Computational Linguistics. Baltimore, USA, 2014: 1566-

1576.

[19] Abbasi A, Zhang Z, Zimbra D, et al. Detecting fake websites: the contribution of statistical learning

theory. Mis Quarterly, 2010, 34(3): 435-461.

[20] Shojaee S, Murad M A A., Bin Azman A, et al. Detecting Deceptive Reviews Using Lexical and

Syntactic Features // Proceedings of the International Conference on Intellient Systems Design and

Applications. Selangor, Malaysia, 2013:53-58.

[21] Li J, Ott M, Cardie C, et al. Towards a General Rule for Identifying Deceptive Opinion Spam //

Proceedings of Meeting of the Association for Computational Linguistics. Baltimore, USA, 2014: 1566-

1576.

[22] Li F, Huang M, Yang Y, et al. Learning to identify review spam // Proceedings of the International Joint

Conference on Artificial Intelligence. Barcelona, Spain, 2011:2488-2493.

[23] Eh-Halees A M, Hammad A A. An Approach for Detecting Spam in Arabic Opinion Reviews.

International Arab Journal of Information Technology, 2015, 12(1):9-16.

[24] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled

documents using EM. Machine Learning, 39(2/3):103–134, 2000.

[25] Fusilier D H, Guzman-Cabrera R, Montes-T-Gonez, et al. Using PU-Learning to Detect Deceptive

Opinion Spam // Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment

and Social Media Analysis. Atlanta, USA, 2013: 38-45.

[26] Y Ren, D Ji, H Zhang. Positive unlabelled learning for deceptive reviews detection // Proceedings of the

2014 Conference on Empirical Methods in Natural Language Proceeding. Doha, Qatar, 2014:488-498.

[27] Li H, Chen Z, Liu B, et al. Spotting Fake Reviews via Collective Positive-Unlabelled Learning //

Proceedings of the IEEE International Conference on Data Mining series. Dallas, USA, 2014:467-475.

[28] Liu, W. Lee, P. Yu and X. Li, “Partially supervised classification of text documents,” In Proceedings of

International Conference on Machine Learning (ICML2002), 2002.

[29] M. Ott, C. Cardie, and J. T. Hancock. Negative deceptive Opinion spam. In Proceedings of the 2013

Conference of the North American Chapter of the Association for Computational Linguistics: Human


Language Technologies, Short Papers, Atlanta, Georgia, USA, June 2013. Association for Computational

Linguistics.

[30] https://myleott.com › op-spam

[31] https://zenodo.org/record/1000885#.XbFasOgzaUk

[32] Structure G & Longman Grammar of Spoken and Written English. Modern English Teacher, 2001,

10(4): 75-77.

[33] Depaulo B M, Ansfield M E, Bell K L. Interpersonal Deception Theory. Communication Theory, 1996,

6(3): 297-310.

[34] Rayson P, Wilson A, Leech G. Grammatical word class variation within the British National Corpus

Sampler. Language & Computers, Leiden, Netherlands: Editions Rodopo BV., 2001:295-306.

[35] J. Burgoon, J. P. Blair, T. Qin, and J. F. Nunamaker Jr. Detecting deception through linguistic analysis.

In Hsinchun Chen, Richard Miranda, DanielD. Zeng, Chris Demchak, Jenny Schroeder, and Therani

Madhusudan, editors, Intelligence and Security Informatics, volume 2665 of Lecture Notes in Computer

Science, pages 91-101. Springer Berlin Heidelberg, 2003.

[36] https://cs.stanford.edu/~quocle/paragraph_vector.pdf.

[37] Distributed Representations of Sentences and Documents.

[38] Efficient estimation of word representations in vector space.

[39] Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabelled data //

Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

LasVegas, USA, 2008: 213-220.

[40] Xiaoli. Li, Philip S. Yu, Bing Liu and See-Kiong Ng. Positive unlabelled learning for data stream

classification // Proceedings of the 9th SIAM International Conference on Data Mining. Sparks, USA, 2009:

257-268.

[41] Yanshan Xiao, Bing Liu, Jie Yin, Longbing Cao, Chengqi Zhang and Zhifeng Hao. Similarity-based

approach for positive and unlabelled learning // Proceeding of the 22th International Joint Conference on

Artificial Intelligence. Barcelona, Spain, 2011: 1577-1582.

[42] Dell Zhang. A simple probabilistic approach to learning from positive and unlabelled examples //

Proceedings of the 5th Annual UK Workshop on Computational Intelligence. London, UK, 2005: 83-87.

https://zenodo.org/record/1000885#.XbFasOgzaUk

https://cs.stanford.edu/~quocle/paragraph_vector.pdf