automatic detection of click fraud in online advertisements

Automatic Detection of Click Fraud in Online Advertisements

by

Abhishek Agarwal, M.S.

A Thesis

In

COMPUTER SCIENCE

Submitted to the Graduate Faculty

of Texas Tech University in

Partial Fulfillment of

the Requirements for

the Degree of

MASTER OF SCIENCE

Approved

Dr. Rattikorn Hewett

Chair of Committee

Dr. Sunho Lim

Dr. Eunseog Youn

Peggy Gordon Miller

Dean of the Graduate School

August, 2012

Texas Tech University, Abhishek Agarwal, August 2012

ii

ACKNOWLEDGMENTS

I would like to thank Dr. Rattikorn Hewett for her guidance throughout my Master‟s

research. Her in-depth knowledge of the subject, focus on clarity and quality of work has

helped me learn skills which will help me for the rest of my career. Her guidance on the

research is invaluable and has helped me cope with the challenges I faced throughout the

course of this work.


iii

TABLE OF CONTENTS

Acknowledgments ........................................................................................................ ii

Abstract ......................................................................................................................... v

List of Tables ............................................................................................................... vi

List of Figures ............................................................................................................. vii

Motivation ..................................................................................................................... 1

Contributions ...................................................................................................... 5

Background Work ........................................................................................................ 7

Preliminaries ................................................................................................................. 9

Terms.................................................................................................................. 9

Problem Statement ........................................................................................... 10

Assumptions ..................................................................................................... 10

Mathematical Theory of Evidence ................................................................... 11

Mass Functions ............................................................................................... 12

Combination Rule ........................................................................................... 14

Proposed Dempster Shafer Theory for Click Fraud Detection ............................. 16

The Core Element of Dempster Shafer Theory................................................ 16

Mass functions for Click Fraud Detection ....................................................... 17

Evidence 1: Number of clicks on the ad ......................................................... 17

Evidence 2: Time spent in browsing ............................................................... 18

Evidence 3: Ad-Visit after non-ad visit ............................................................ 18

Evidence 4: Time of Click ............................................................................... 19

Evidence 5: Place of origin of click ................................................................. 20

Evidence 6: Creating of membership.............................................................. 21

Evidence 7: Adding a product in shopping cart .............................................. 22

Data Set & Illustration .............................................................................................. 24

Data Description............................................................................................... 24

Example of belief computation using mass function and combination ........... 28

Evaluation ................................................................................................................... 34

Case Study 1 ..................................................................................................... 34

Case Study 2 ..................................................................................................... 45


iv

Discussion & Conclusions .......................................................................................... 48

Bibliography ............................................................................................................... 50


v

ABSTRACT

Increasing advancement, access and availability of the Internet Technology have intensified

the growth of Internet users over the last decade. This has made online advertising a popular

venue for many companies to market their products and services. Today, online advertisement

is one of the most important sources of revenues that impact the economy of many large

enterprises. In online advertisement, an advertiser pays a broker (e.g., Google, Yahoo), who

normally has a search engine, to post its online advertisement, which can be on any

appropriate publisher site. The publisher earns revenues from the broker for each click on the

advertisement posted on its site, while the advertiser will be charged. Thus, when an

excessive number of clicks occur, this can quickly dry up the fund of a rival company and

drive it out of the competing advertisement. At the same time, each click adds revenue to the

publisher. This motivates click frauds, which refer to malicious acts to create fraudulent clicks

with the intent to increase revenue or drive away competitors without real interest in the

products or services being advertised. Identifying click frauds is a difficult problem because

of the dynamic nature of the click behaviors, some of which are generated by humans and

some are by automated software called bots. There have been previous work attempting to

identify click frauds using various techniques but they tend to be limited by the types of the

data, the way they are processing or assumptions that are not always achievable.

This thesis presents an approach to automatically detecting click frauds in online advertising.

The approach uses a mathematical theory of evidence to estimate the likelihood of a click

whether it is fraud or genuine using web log data of a user‟s activities on the advertiser‟s

website. One advantage of the proposed approach is the fact that the likelihood can be

computed for each incoming click and thus it gives an online computation of the belief that

fits well with the dynamic behaviors of users. The thesis describes the approach and evaluates

its validity using two real-world case studies. We believe the approach is general in that it

can be applied to any scenario.


vi

LIST OF TABLES

4.1 Fraud certification rules ....................................................................... 23

5.1 Sample log data .................................................................................... 25

5.2 Input from server log ............................................................................ 28

5.3 Coefficient values................................................................................. 29

5.4 Mass function beliefs for illustrated example ...................................... 31

6.1 Computed belief values for Case Study 1 ............................................ 43

6.2 Computed belief values for first IP ...................................................... 46

6.3 Computed belief values for second IP ................................................. 46

6.4 Computed belief values for third IP ..................................................... 46


vii

LIST OF FIGURES

1.1 % change of revenue for advertising media (GeekWire, 2012) ............. 1

1.2 Google‟s revenue source distribution in 2011 (Google Earnings

Report, 2011) ......................................................................................... 2

1.3 Scenario before click fraud occurred ..................................................... 3

1.4 Scenario after click fraud occurred ........................................................ 4

4.1 Click fraud detection framework using D-S theory ............................. 16

5.1 Legends for timeline diagram .............................................................. 27

5.2 Timeline diagram sample data in Table 5.1 ......................................... 27

5.3 Timeline diagram for Table 5.2 ........................................................... 28

5.4 Combined belief of fraud for input in Figure 5.3 ................................. 33

6.1 Timeline input for Case Study 1 .......................................................... 34

6.2 Belief of fraud from mass function 1 ................................................... 36

6.3 Belief of ~fraud from mass function 2 ................................................. 37






6.9 Combined belief of fraud for Case Study 1 ......................................... 44

6.10 Timeline diagram for Case Study 2 ..................................................... 45

6.11 Combined belief values for Case Study 2 ............................................ 47


1

CHAPTER I

MOTIVATION

The Internet has seen tremendous growth in the last decade and according to current

statistics from the World Bank, nearly 32% of the world population currently uses the

Internet. This has made online advertising not only lucrative but also an important medium for

businesses to reach out to a large consumer base (Jansen, 2007). Figure 1.1 below shows that

while most other media of advertisement are losing market share, online advertisements are

growing tremendously.

Figure 1.1 % change of revenue for advertising media (GeekWire, 2012)

Not only do online ads benefit advertisers, they are also a rich source of revenue for

publishers who display ads on their websites and brokers like Google, Yahoo, MSN, Ask.com

etc. who provide the technical platform for online advertisements. Thus, online ads drive the

Internet economy and are the necessary life blood for its survival and growth. Figure 1.2

below shows that in 2011 97% of Google‟s revenue was from online ads alone.


2

Figure 1.2 Google‟s revenue source distribution in 2011 (Google Earnings Report, 2011)

Online advertising is however not free of issues and click fraud is a major problem

which can impact its growth. Click fraud is a type of crime in online advertisement in which a

user clicks on an ad not with a genuine interest in what the advertiser has to offer but with

intent of either generating illegal revenue (for the publisher that hosts the advertisement) from

clicks or to intentionally cause monetary loss to the advertiser. It hurts the advertisers and

may deter them from investing in online ads.

Many advertising mechanisms exist including the pay-per-click (PPC) scheme which

contributes to about 57 percent of all the Internet ads with nearly more than US$16 billion in

revenue in 2010 (Tuzhilin, 2006; IAB and PwC, 2010). A popular example of PPC scheme is

the Google Adsense. In PPC brokers like Google place targeted ads in dedicated ad spaces on

publisher websites. Brokers get paid by advertisers for every click on the ad and they share

the income generated this way with the publishers. While PPC is a great model for online

advertisement, it suffers the most from the problem of click fraud (Tuzhilin, 2006). Most of

the publishers in PPC programs are small time blog owners and are the source of majority of

the click fraud. Competitors of an advertiser can also commit click fraud in order to reduce

competition and it may indirectly benefit their business. To commit click fraud, publishers or


3

competitors can click on the ad themselves, ask friends to do it, use an Internet bot script

which repeatedly clicks on the ads or hire people to do it for them (Kshetri, 2010). Such clicks

are of no value to the advertisers as the clicker has no intent to buy their product or service,

use information or carry out any transaction useful to the advertiser‟s business (Jansen, 2007).

The brokers too have an incentive in not filtering out all the click fraud as doing so will

reduce their revenues. They can contribute to click fraud by passively letting the fraud happen

and not taking adequate measures to stop it. The lesser known brokers have a greater

incentive in doing so (Kshetri, 2010). Multiple lawsuits filed by various advertisers against

Google and Yahoo for not taking adequate steps to curb click fraud are an indication of

brokers‟ inability or unwillingness in this regard. Figure 1.3 below shows a scenario before

click fraud when the advertiser‟s money reserve (advertising budget) is full. The publisher,

broker or competitors have not generated any illegal revenue from click fraud.

Figure 1.3 Scenario before click fraud occurred

Figure 1.4 below shows the scenario after click fraud which caused advertiser‟s budget to

completely deplete and the broker, publisher and competitor‟s illegal profit to increase.


4

Figure 1.4 Scenario after click fraud occurred

Reputed brokers like Google actively try to contain click fraud by filtering out

fraudulent clicks and permanently blocking publishers who are found involved (Tuzhilin,

2006; Kshetri, 2010). They have access to a user‟s search activities and the data they collect

from the publisher to find patterns in a user‟s behavior. The idea is to estimate a user‟s

intention behind the click in order to rate a click as genuine or fraudulent. However they may

not have access to the data about a user‟s actions on the advertiser‟s website where the user is

taken following the click. This is because the advertiser may choose to share limited or no

data at all with the broker due to their own privacy concerns (Tuzhilin, 2006).

Brokers provide aggregate statistics to advertisers and do not share details on which

clicks they found fraudulent in order to avoid making their detection mechanisms open to

fraudsters. Thus advertisers are not adequately informed and there is a strong case for the

advertisers to have their own click fraud detection system in place. This way the advertisers

can protect themselves not only from fraudulent publishers and competitors but also from

brokers who either fail to detect fraud or let it occur willingly. Such a system can help them

estimate the extent of the fraud in their ad campaign and pay the brokers for genuine clicks

only. It is important to note here that brokers have access to much larger sources of

information than advertisers. The advertisers must be able to do the click fraud detection with

the limited data they have about users‟ action at their website.


5

Click fraud identification is a difficult problem to solve. Fraud mechanisms evolve and

continually change over time. The fraud can be carried out both by humans and software bots

with distinctive characteristic behaviors. It is difficult to track users with their IP addresses as

IPs are generally dynamic in that a IP address of the same user may change anytime. A

software bot too can use different IP addresses at a time to carry out click attacks. Finally, the

advertiser has access to data from their server, which gives very limited information about a

user‟s behaviors.

Contributions

This paper presents an approach to automatically detecting click fraud at the ad-site.

The advertisers can use the proposed approach to detect their click frauds. Our approach

employs the mathematical theory of evidence called Dempster-Shafer (DS) Theory (Shafer,

1976; Denoeux, 1995; Dong et al., 2010; Sentz et al., 2002) for evidence-based reasoning to

estimate the likelihood of a click being fraudulent based on the evidence gathered from the

weblog data available to the advertiser. The proposed approach can also be useful for brokers

for computing correct charges to their clients if the data are available to them. Our approach is

based on a widely used theory that allows the estimate of the likelihood to be computed as

each incoming click is exhibited. That is it offers an on-line computation. Thus, after each

click from a given IP we can estimate our belief if the click is suspicion to be fraudulent or

not. In summary the contributions of this thesis include: (1) an approach for automatically

detecting or identifying click frauds, (2) a framework for reasoning about click frauds that

integrates relevant information extracted from weblog data with the evidence based reasoning

to update click fraud analysis in real-time, and (3) core elements of the proposed approach

that consists of a set evidences required in detecting click frauds. These evidences will be

formulated in terms of functions called mass functions used in the DS theory.

The rest of this thesis is organized as follows: Chapter II presents background work

on click frauds identification. Chapter III gives preliminaries including terms and relevant

concepts, the problem formulation and its assumption, and the Dempster-Shafer Theory along

with its fundamental elements. Chapter IV presents our approach to the problem and the

details of the core contribution on formulating mass functions for click fraud identification


6

problem. Chapter V explains the data set used for the approach and gives an illustrative

example. Chapter VI evaluates the proposed approach with experiments on synthetic data

generated on two case studies. Chapter VII gives concluding remarks and possible extension

for future work.


7

CHAPTER II

BACKGROUND WORK

Many different types of solutions have been proposed to counter click fraud.

(Tuzhilin, 2006) suggested a model where the advertisers pay for a click only if it leads to a

conversion event like purchase to counter CF. Such a model is economically unviable for

publishers and so is not available to advertisers. Another method proposed (Tuzhilin, 2006) is

the use of data mining models based on past data to classify clicks as fraud or ~fraud (not

fraud). Such a solution may suffer from high inaccuracy as fraud mechanisms evolve and

change over time. There is an assumption that past clicking behavior is indicative of future

behavior. A large number of past clicks which can be truly classified as valid or invalid are

also required. This is a batch process and not online. Moreover such datasets are at the

disposal of brokers only and other involved parties like advertisers cannot use them. The

author clearly states these limitations.

(Haddadi, 2010) discusses the use of bluff ads for detecting sources of click fraud like

trained bots or poorly trained human workforce employed to carry out fraud. The display text

of these ads is unrelated to the context of the user to whom they are displayed. For example a

user in Australia should not ideally be shown an ad of a special offer on pizza in New York

City. A click by the user is unnatural in this case and will indicate that the user is a bot or

human involved in fraud. However careful humans and sophisticated bots can still beat it.

Also this is a „broker-centric‟ model. This can be implemented by brokers and advertisers

need to completely trust brokers in this.

Recently (Antoniou et al., 2011) proposed a burst detection algorithm to detect high

frequency of user activity in short time periods to detect various types of click frauds

including voting click fraud, frauds related to blog post popularity, search engine retaliation

and advertising click fraud. While this is a good general solution for all types of click frauds

mentioned, it does not cater to the nuances of advertisement click fraud, as a simple detection

of bursts may not be enough to differentiate between valid and invalid clicks. More


8

factors/evidences need to be taken into consideration before we could conclusively label a

click as fraudulent. (Walgampaya et al., 2011) proposed a method to detect bot scripts

involved in click fraud using Bayesian Classifiers.

The methods above are either not sufficient to combat the problem of click fraud

individually or require broker involvement of some kind. The involvement may be in the form

of policy changes by brokers or sharing data at their disposal and they have been unwilling for

both. As a result they cannot be used by advertisers to actively detect fraud at their site.

(Kantardzic et al., 2010) proposed a real time click fraud detection and prevention

system. It uses D-S Theory for multilevel data fusion of evidences from different sources like

IP address, referrer, country etc. However they rely on data from both the client (advertiser)

and server (broker). An advertiser does not have access to broker‟s data and hence this system

is limited to be used by brokers only. Our approach equips advertisers with a fraud detection

system using only the data at their disposal. The evidences that they extract from server data

to formulate mass functions are very basic whereas some of our rules are sophisticated and

novel to the best of our knowledge. We do not maintain any historical databases and exploit

the fact from (Antoniou et al., 2011) that fraud will happen in bursts. Our approach is simple

yet our set of rules is powerful and comprehensive making it difficult for fraudsters to carry

out any viable attacks on the advertiser. For example, rules 1, 2, 4 and 5 make it difficult for a

bot to generate clicks without detection.


9

CHAPTER III

PRELIMINARIES

This section outlines the foundation for the proposed method of click fraud detection

and the assumptions we have taken.

Terms

We now define terms used in this thesis.

Advertiser is a seller with an e-commerce website who pays for his ads to be displayed on

other sites. These ads may create more traffic and revenue for the advertisers since a user

who clicks on these ads is directed to their site.

Ad-site is the advertiser‟s website. A user on the Internet can visit the ad-site by several

means like using an Internet search, typing the URL of the advertiser on their browser,

bookmark the advertiser and clicking it later or clicking on the ad on a publisher site.

Ad-visit is a visit of a user to ad-site by clicking an ad. Non-ad visit is a user visit by any

means other than clicking an ad.

Session is a continuous period of time that a visitor navigates within the advertiser‟s site.

In other words it is the duration for which a user maintains an active HTTP connection

with the server. In a session the user can be browsing, reading, watching videos, filling out

forms, registering for membership, adding products in a shopping cart, purchase products

etc.

Publishers are the websites which hosts ads for the advertisers and get paid for the click

on those ads. Common examples are blogs and news sites.

Broker is an intermediary between advertiser and publisher. They provide the technical

platform for online advertisements. They are mostly Internet search engine companies like

Google, Yahoo, AOL, Ask.com etc. and use their search technology to serve targeted ads

on publisher sites based on website content, geographical location etc..


10

Pay Per Click (PPC) is an online advertising model in which publishers display ads on

their websites and get paid for each click on those ads. Google runs a PPC program called

Adsense.

Gclid is a unique ID called that is attached to the server log for every click that was made

on Google ads. This helps identify unique visitors to the best approximation as Google

uses various parameters to make this unique identification.

Problem Statement

Given a weblog data at the site of the advertiser over a period of time, find all

occurrences of click fraud. For every such occurrence, identify its owner by its corresponding

IP address. The advertiser‟s web server log data has information such as IP address, date &

time, Gclid number (to be described later), a requested page and referrer for every click.

Assumptions

Due to the dynamic natures of IP addresses associated to each user, to solve the above

problem in real practice, it is necessary to make the following assumptions.

1) IP addressing changes over time and a user may be assigned to different IP addresses

while he/she is surfing the Internet. A user (either human or bot) may try to carry out

fraudulent clicks using as many different IPs as possible in order to avoid detection.

Therefore it is not feasible to use a long duration data of an IP. Instead we use a short

duration of a window W. In this work, W is specified to be 30 minutes during which we

assume that the IP address for a user will not change. This duration is typical and is

reasonable though is quite different from other existing work. The probability that a user

with a particular IP clicked on an ad and that the same IP is assigned to another user who

also clicks on the same ad within the proposed window is negligibly low. Our approach is

however not limited by this window size and one can pick a size that suits them well.

2) A fraudster has an incentive in clicking on an ad multiple times but no intention in making

an actual purchase of a product or service. Fraudsters will make money on clicking on the

ads but will have to spend money to make purchases and this is strictly against their end

goal. Thus, if a user makes a purchase at the ad-site, we assume that the user is not


11

involved in fraud. However in some circumstances (like in order to confuse detection

systems), the fraudster may make a purchase. Such an action will not help the fraudster as

soon as he moves out of the time window W.

3) Fraudulent clicks with large time gaps in between every two clicks do not deliver any

substantial monetary gain to the fraudster. The number of clicks has to be large enough

with shorter gaps between them and therefore, a burst of clicks may indicate Click-Fraud

(Antoniou, 2011).

4) Since HTTP is a stateless protocol it is difficult to accurately estimate the session

duration. We sum the time difference between consecutive HTTP requests by the user to

get the total session time but however there is no way to compute the exact time spent by

the user viewing the last page since there is no request after that. We thus had to make an

assumption that 30 seconds was spent on the last page. Our approach is however not

limited by this assumption and any other duration can be assumed for the last page view.

5) We modeled our approach around Google‟s Adsense as it is the most widely accepted Pay

Per Click program. We use gclid, a unique id attached by Google to the web server logs of

advertisers for every click that was made on their ads. It follows Google‟s definition of

unique visits. Google claims that it uses various parameters to assign unique gclids and

third party CF detection engines which use the gclid are more accurate than others. So we

take data filtered by the broker (Google) and apply our own approach for further filtration.

However our approach can be modeled around any other PPC program and the way to

identify the clicks that were made on advertisements could be by creating unique landing

pages. This way by looking at server logs we can separate out visits made from ads.

Mathematical Theory of Evidence

Efforts in identifying click fraud have mostly concentrated on identifying a certain

characteristic of user behavior and this is quite different from our approach. To provide a

theoretical background of our approach we describe the mathematical theory of evidence also

known as the Dempster-Shafer (D-S) Theory (Shafer, 1976; Denoeux, 1995; Dong et al.,

2010; Sentz et al., 2002). It is related to traditional probability and set theory but is not the


12

same. The D-S theory allows probability assignment to a set of atomic elements rather than an

atomic element and it can be used to represent not only the likelihood of occurrence of an

event but also the uncertainty associated with it.

Using the D-S Theory evidence, which is coming from multiple sources with varying

level of certainty, can be effectively combined online. Its ease of use combined with a wide

and successful application in many areas makes it an ideal candidate for application in click

fraud detection which requires a complex model with several evidences.

In our problem domain a user can either be a fraud or not a fraud (~fraud). So we

have a finite set of hypothesis (atomic elements) in the problem domain U = {fraud, ~fraud}.

The power set of U is a set {{fraud}, {~fraud}, {U}, {}}. Each of the four elements in the

power set represents a belief between 0 and 1. {fraud} represents a belief of the user being a

fraud; {~fraud} represents the belief of the user being not fraud; U represents the belief of

user being both fraud and ~fraud and thus it represents the uncertainty; is an empty (null)

set and it represents a contradiction, thus it is always 0. DS-Theory assigns belief to all the

elements of this power set of U rather than mutually exclusive events of U. The sum of all

belief values in the power set of U is 1.

Mass Functions

A degree of belief is represented as a belief function called mass function m which

provides a probability assignment to any AU, where m() = 0 and m(fraud) + m(~fraud) +

m(U) = 1.

m() = 0

m(fraud) ∈ [0, 1]

m(~fraud) ∈ [0, 1]

m(U) ∈ [0, 1]

X Am(X) = 1


13

The mass m(A) represents a belief exactly on A. For example, U = {faulty, ~faulty}

represents a hypotheses of a suspect being both faulty and non-faulty. A situation in which

m({fraud, ~fraud}) = 1 occurs where there is no certainty regarding an evidence at all and this

cannot be adequately represented with traditional probability theory. A belief mass is

therefore different from probability. As we see above the probabilities are being assigned to

sets rather than mutually exclusive singletons (Shafer, 1976; Sentz et al, 2002). When the

probabilities are assigned to mutually exclusive events i.e. either fraud or ~fraud such that

m(U) is always 0 then DS-Theory becomes same as probability theory. For every mass

function, there are associated functions of belief and plausibility. The degree of belief on A,

bel(A) and the plausibility of A, pl(A) defined to be respectively:

bel(A) = X Am(X)

pl(A) = 1 – bel(~A) =X A m(X).

For example, bel({fraud}) = m({fraud}) + m() = m({fraud}). In general, bel(A) =

m(A) for any singleton set AU and in such a case the computation of bel is greatly reduced.

However, bel(A) is not necessary the same as m(A) when A is not a singleton set. Thus, m,

bel and pl can be derived from one another. Thus, belief and probability are different

measures. In this thesis, we use the terms likelihood and belief synonymously.

For our approach we use multiple evidences each of which contributes to either a

belief (or disbelief) that a user is a fraud depending on the nature of the evidence and its

quantified value (Dong et al., 2010). For example, if a user clicks many times on an ad, it

becomes evidence that the user is a fraud. Each evidence can support a user for either fraud or

~fraud but not both. If an evidence for a user supports fraud, the rest of the belief from the

evidence cannot commit only to the universal set U which quantifies the uncertainty. If

evidence i supports that the user is fraud then the mass functions for the evidence are defined

as follows:

mi(fraud) = α*f

mi (~fraud) = 0


14

mi (U) = 1 - α*f

Where 0 < α < 1, is an empirically derived value that signifies the strength of the evidence

in supporting the user is fraud. 0 < f < 1, is a function that is used to quantify the evidence.

If evidence i supports that the user is ~fraud then the mass functions for the evidence

are defined as follows:

mi(fraud) = 0

mi (~fraud) = β*g

mi (U) = 1 - β*g

Where 0 < β < 1, is an empirically derived value that signifies the strength of the evidence in

supporting the user is ~fraud. 0 < g < 1, is a function that is used to quantify the evidence.

Combination Rule

Since we have multiple mass functions, we need a way to combine them. A mass

function can be combined using various rules including the popular Dempster’s Rule of

Combination, which is a generalization of the Bayes rule. For X, A, BU, a combination rule

of mass functions m1 and m2, denoted by m1m2 (or m1, 2) is defined as the following:

where K =

and m1m2 () = 0

The combination rule can be applied in pairs repeatedly to obtain a combination of

multiple mass functions. The above rule strongly emphasizes the agreement between multiple

sources of evidence and ignores the disagreement by the use of a normalization factor.

m1AB (A)m2(B)

m1,2( X ) m1 m2( X ) m1AB X ( A)m2(B)

1 K


15


16

CHAPTER IV

PROPOSED DEMPSTER SHAFER THEORY FOR CLICK FRAUD DETECTION

We propose an approach that can be used by the advertisers to detect fraud in real time

using data available to them, without any data from the broker which can either be impossible

to acquire or very limited if at all possible. This section describes our approach in detail and

the mass functions that have been developed to compute the belief of fraud.

The Core Element of Dempster Shafer Theory

Figure 4.1 below shows the framework elements of click fraud detection using our

approach. A user‟s clicking activity is captured by the advertiser‟s web server logs. The server

logs are updated in real time as users request pages from the server and the click fraud

detection system reads this data as soon as it is logged. For a latest click that the system is

processing, it finds the IP address and reads all the log data from that IP in the window W.

This data is pre-processed to extract out meaningful

Figure 4.1 Click fraud detection framework using D-S theory


17

evidences and then formulated into various mass functions. Each mass function computes a

belief of fraud which is unique and can conflict with the beliefs from other mass functions.

These beliefs are combined using Dempser‟s combination rule. The combined belief is

categorized into fraud, ~fraud or suspicious by using a set of threshold values. This process is

repeated for every new user click.

Mass functions for Click Fraud Detection

Using the user behavior from the weblogs at the advertiser‟s site as evidences to

reason about click fraud we formulate mass functions based on each of such core evidence.

These evidence are contributed by various factors such as number of clicks on the ad, time

spent browsing the advertiser site etc. The mass functions are used to compute belief value on

the click being fraud or not fraud (~fraud). The belief value from different evidences is

combined as each of them occurs in the data. A mass function contributes to either a belief (or

disbelief) that a user is a fraud depending on its nature and its quantified value. The following

gives detailed formulae of mass functions based on each evidence. The values αi and βi for

evidence i represent the strength of the evidence in mass function formulation (mi). In

practice these values will be empirically derived.

Evidence 1: Number of clicks on the ad

If the number of clicks on the ad from an IP in the time window W (30 minutes) is

high, then likelihood of the user being a fraud is high. Fraudsters have a natural incentive of

making more money by clicking the ads many times in a short period of time (short bursts).

The more they click, the more illegal revenue they generate for themselves. The Basic Mass

Assignment (BMA) for this evidence will always support a belief of fraud whose value

depends on the number of clicks.

Let n be the number of clicks in the window W.

Likelihood of the fraud = 1 – 1/n

m1( fraud) = α1 (1-1/n) (1)


18

m1 (~fraud) = 0 (2)

m1 (U) = 1 - m1 ( fraud ) = 1 – α1 (1-1/n) (3)

Evidence 2: Time spent in browsing

If the time spent by the user at the ad-site is high then he/she is less likely to be a

fraud. A genuine user will click the ad due to a real interest in advertiser‟s content (advertised

product, service or website content) and is likely to spend more time exploring the ad-site

than a fraudster. Fraudsters are less likely to do so since they are not interested in the product

and so that they could do more clicks in a given time. The BMA for this rule will always

support a belief of ~fraud whose value depends on the time spent at the ad-site. As a user

continues to spend more time at the ad-site the belief that he is ~fraud will increase.

Let t be the time spent by the user in all visits in the time window W (30 minutes) where 0 < t

<= 30 minutes. The likelihood of ~fraud increases as t increases.

m2 (fraud) = 0 (4)

m2 (~ fraud) = β2 *(t/W) (5)

m2 (U) = 1 - m2 (~ fraud ) = 1 – β2* (t/W) (6)

Evidence 3: Ad-Visit after non-ad visit

If a user clicks on an ad after a non-ad visit, then he is likely to be a fraud. Once a user

makes a non-ad visit to the ad-site, it implies that the user is aware how to reach the site apart

from clicking on the ad. Clicking on an ad after that seems unnecessary and indicates a

likelihood of fraud. The BMA for this rule can support a belief of either fraud or ~fraud

behavior.

Let x be the likelihood of fraud. If the user has visited only via ads then x=0.1 (little

likelihood of fraud). If the user has visited via ads after visiting normally then x=1.0 (high

likelihood of fraud). Thus the mass functions when the evidence supports fraud are as

follows:


19

m3 (fraud) = α3 *(x) (7)

m3 (~ fraud) = 0 (8)

m3 (U) = 1 - m3 ( fraud ) = 1 - α3*(x) (9)

Let y=1.0 be the likelihood of ~fraud if the user does not have an ad-visit after a non-ad visit.

The mass functions if the evidence supports ~fraud are as follows:

m3 (fraud) = 0 (10)

m3 (~ fraud) = β3 *(y) (11)

m3 (U) = 1 - m3 ( ~fraud ) = 1 – β3 *(y) (12)

Evidence 4: Time of Click

If the click occurred in the most suspicious time (or most active period of fraud

activity) then the user is likely to be a fraud. Fraudsters are generally known to be active

during certain hours of the day and a click at such hours can be indicative of fraudulent

activity. We follow Universal Time to determine this and not any particular time zone. If a

click happens at that certain time slot of suspicion then the click is likely to be a fraud

otherwise ~fraud. The BMA for this rule will support a belief of fraud if the time of click lies

in the suspicious time range. Otherwise it will support a belief of ~fraud.

Let Tstart and Tend be the start and end of the suspicious time range, t be the time of click.

Let x=1.0 be the likelihood of fraud if t lies between Tstart and Tend. The mass functions when

the evidence supports fraud are as follows:

m4 (fraud) = α4*(x) (13)

m4 (~ fraud) = 0 (14)

m4 (U) = 1 - m4 ( fraud ) = 1 – α4*(x) (15)

Let y=1.0 be the likelihood of ~fraud if t does not lie between Tstart and Tend. The mass

functions when the evidence supports ~fraud are as follows:


20

m4 (fraud) = 0 (16)

m4 (~fraud) = β4*(y) (17)

m4 (U) = 1 - m4 (~ fraud ) = 1 – β4*(y) (18)

Evidence 5: Place of origin of click

If the click originated from a location (country, state or city) where the advertiser has

no business then the user is likely to be a fraud. Ads are often targeted for audience of a

particular region where the advertisers have a reach or rights to sell their products. This is

especially true for small and medium sized businesses that are restricted to a country or city.

Even large advertisers mostly advertise to a local clientele such as a car company which sells

in many countries but has different ads based on the different models it sells in each country.

If a click originates from a location outside of advertiser‟s region of business then it is likely

to be fraud as the user will get no value from such a click. Also it is notable that in some

countries the laws against cyber frauds are very weak and this fact is utilized by fraudsters to

their advantage. Fraudsters use IP addresses originating from these countries through bots or

hiring people (many of whom do not realize that their act is causing huge losses to

advertisers) at low cost to carry out the fraud in order to avoid prosecution (Kshetri, 2010). As

a result such clicks have high suspicion associated with them. This rule has the ability to limit

a range of fraudulent attacks which depend on using IP addresses from varied geographical

locations (these include the use of both humans and bots). The BMA for this rule supports a

belief of fraud if the click originated from a region outside of advertiser‟s business and a

belief of ~fraud otherwise.

Let x=1.0 be the likelihood of fraud if the click originated from a region outside of

advertiser‟s business. The mass functions when the evidence supports fraud are as follows:

m5 (fraud) = α5 *(x) (19)

m5 (~ fraud) = 0 (20)

m5 (U) = 1 - m5 ( fraud ) = 1 - α5*(x) (21)


21

Let y=1.0 be the likelihood of fraud if the click originated from a region outside of

advertiser‟s business. The mass functions when the evidence supports ~fraud are as follows:

m5 (fraud) = 0 (22)

m5 (~fraud) = β5*(y) (23)

m5 (U) = 1 - m5 (~ fraud ) = 1 - β5*(y) (24)

Evidence 6: Creating of membership

If the user creates a membership account (register as member) with the advertiser, then

he/she is less likely to be a fraud. However he/she may or may not create such an account.

Fraudsters however are less likely to register themselves at the ad-site or create membership

account as they have no incentive in doing so and because it also requires them to spend some

time and give out some information like email, address etc. The BMA for this rule supports a

belief of ~fraud if a membership account was created, otherwise supports negligible belief of

fraud.

Let x=1 be the likelihood of fraud if a membership account is created. The mass functions when the

evidence supports fraud are as follows:

m6 (fraud) = α6* (x) (25)

m6 (~fraud) = 0 (26)

m6 (U) = 1 - m6 ( fraud ) = 1 - α6 *(x) (27)

Let y=1 be the likelihood of ~fraud if a membership account is not created. The mass functions

when the evidence supports ~fraud are as follows:

m6 (fraud) = 0 (28)

m6 (~ fraud) = β6 *(y) (29)

m6 (U) = 1 - m6 ( ~fraud ) = 1 - β6 *(y) (30)


22

Evidence 7: Adding a product in shopping cart

If the user adds a product to his shopping cart, then he/she is less likely to be a fraud.

Due to a lack of genuine interest in the advertiser‟s product or services, a fraudster is less

likely to use a shopping cart. Using a shopping cart requires the user to spend time for which a

fraudster has no incentive. The BMA for this rule supports a belief of ~fraud if a product was

added to a cart otherwise supports a negligible belief of fraud.

Let x=1.0 be the likelihood of fraud if the user does not add any product to his shopping cart. The

mass functions when the evidence supports fraud are as follows:

m7 (fraud) = α7* (x) (31)

m7 (~fraud) = 0 (32)

m7 (U) = 1 - m7 ( fraud ) = 1 – α7 *(x) (33)

Let y=1.0 be the likelihood of ~fraud if the user adds a product to his shopping cart. The mass

functions when the evidence supports ~fraud are as follows:

m7 (fraud) = 0 (34)

m7 (~ fraud) = β7*(y) (35)

m7 (U) = 1 - m7 ( ~fraud ) = 1 - β7*(y) (36)

Individually, the evidences are not sufficient in determining the likelihood of a user

being fraud or ~fraud. Each evidence may give different or contradicting belief of fraud

depending on their nature. But upon combination they provide a highly accurate estimate.

Thus, the likelihood of a click being fraudulent is estimated by combining the beliefs obtained

from corresponding mass functions for each of the supporting evidences. To define the rule

for combining mass functions, suppose m1 and m2 be two distinct mass functions of a

particular click. Dempster‟s rule of combination can be applied as shown below. For

readability, we omit i, and replace {fi}, {~fi} and Ui by f, ~f and U, respectively.

m1,2(f)= (m1(f)m2(f)+m1(f)m2(U)+m1(U )m2(f))(1K)


23

m1,2(~f)=(m1(~f)m2(~f)+m1(~f)m2(U)+m1(U)m2(~f))(1K)

m1,2(U)=(m1(U)m2(U ))(1K),

where K = m1(f)m2(~f) + m1(~f)m2(f).

This combination rule can be applied repeatedly pair-wise until evidence from all

clicks has been incorporated into the computation of the likelihood of each statement. Our

proposed approach certifies the clicks based on the corresponding likelihood of them being

fraudulent using the beliefs combined from all of the evidences. Table 4.1 below describes the

thresholds that we have empirically derived from our experiments and tests.

Table 4.1 Fraud certification rules

Lower Upper

Not Fraud 0 0.499

Suspicious 0.5 0.649

Fraud 0.65 1

A combined belief of fraud < 0.5 indicates ~fraud. A combined belief of fraud >= 0.65

indicates fraud and all values in between indicate a suspicion.


24

CHAPTER V

DATA SET & ILLUSTRATION

In this section we give a detailed explanation of the data that we use in our approach.

We also show an illustrated example using our data set with our approach.

Data Description

Click data is not publicly available. Any real weblog data from a web server is a

property of the owner of the server and are not made public due to privacy concerns by the

owner. Moreover such data need to be cleaned to extract data in relevant format. This is a

time consuming process and is not a focus of our research. For these reasons we use synthetic

data for our research. Furthermore we can manipulate synthetic data and add patterns of fraud

for evaluating different click fraud scenarios.

The data show weblog from the advertiser‟s web server. For our experiments and

evaluations we synthesize log data in combined log format (CLF). We pre-process the raw

logs and extract the following information from them for each user in real time: IP address of

the remote computer requesting the web page; time and date of request; the page that was

requested; and the Gclid number. The region from which the click originated can be easily

extracted from the IP address by using one of the many geo location services which map the

IP to a place using geo location database. The Table 5.1 below shows a sample data extracted

from the server logs.


25

Table 5.1 Sample log data

IP Address Click No Gclid No Time of click Requested

Page

Referrer

172.16.276.3 1 1001 3/5/2012 1:50 index.htm adsite.htm


172.16.276.3 3 1002 3/5/2012 1:59 page1.htm index.htm

172.16.276.3 4 1002 3/5/2012 2:01 page2.htm page1.htm

172.16.276.3 5 null 3/5/2012 2:05 index.htm google.com

172.16.276.3 6 null 3/5/2012 2:08 page1.htm index.htm

172.16.276.3 7 null 3/5/2012 2:10 page2.htm page1.htm

172.16.276.3 8 null 3/5/2012 2:14 index.htm null

172.16.276.3 9 null 3/5/2012 2:16 page1.htm index.htm

172.16.276.3 10 null 3/5/2012 2:17 page2.htm page1.htm

Each row of the Table 5.1 above represents a HTTP request by the user made to the

advertiser‟s web server. Whenever a user requests content from the advertiser an HTTP

request is generated. Below are some observations which describe data represented by the

Table 5.1.

Every row represents a click by the user requesting content from the ad-site.

All the clicks in the table above are by the same user since the IP address is the same for

all rows of the log.

Index.htm is the landing page. Every time index.htm is the requested page, it implies a

new visit. The Table 5.1 has 4 unique visits.

A non-null Gclid number implies an ad-visit. Click numbers 1 through 4 belong to an ad-

visit since they have a valid Gclid number attached.

Two different Gclid numbers above imply two different ad-visits. The first click with

Gclid number 1001 implies an ad-visit. Since there is only 1 row with Gclid number 1001,

it implies that the user did not make any other page requests after landing on the ad-site

during first ad-visit. The second click with Gclid number 1002 is also an ad-visit.


26

However in this visit the user requested page1.htm and page2.htm also (click number 3

and 4).

Each row with a null Gclid number implies a non-ad visit. Click numbers 5 through 10

correspond to two non-ad visits.

Click number 5 corresponds to first non-ad visit and the third visit overall. The visitor was

referred to the ad-site by Google search since google.com is the referrer. After landing the

user requested two more pages in the same visit, page1.htm and page2.htm.

Click number 8 corresponds to second non-ad visit and fourth visit overall. A null referrer

implies that the user may have typed in the ad-site‟s URL in his browser or had previously

bookmarked the site and clicked on the bookmark. After landing the user requested two

more pages in the same visit, page1.htm and page2.htm.


27

We will use a timeline diagram to help illustrate our inputs (like Table 5.1) for the rest

of the thesis. Figure 5.1 shows the legends for the diagram and Figure 5.2 shows a timeline

diagram corresponding to the input from Table 5.1.

Figure 5.1 Legends for timeline diagram

Figure 5.2 Timeline diagram sample data in Table 5.1

A timeline diagram is a visual representation of a user‟s clicking data from the server

weblogs. Just by looking at Figure 5.2 we can easily make certain observations. The user has

made 4 unique visits. The first two visits were ad-visits and the last two were non-ad visits.

The width of the session blocks indicates session durations. The first visit was a very short

session in which the user did not request any pages after landing. In all the other visits the

user requested two other pages and the session durations are longer. The start and end times of

every session is also given. Lastly we can see that the user neither logged in as a member in

any of the sessions nor used a shopping cart.


28

Example of belief computation using mass function and combination

In this example we analyze and compute the belief of a user being fraud or ~fraud

using our approach. The purpose is to explain the approach and the computations involved

along with a simple example. The following is a sample input in Table 5.2 below.

Table 5.2 Input from server log

IP Address Click No Gclid No Time of click Requested Page Referrer







From Table 5.2 above we can easily conclude that the user made six ad-visits. The

user did not request any page of ad-site other than index.htm. Figure 5.3 below shows the

timeline diagram for the data corresponding to Table 5.2.

Figure 5.3 Timeline diagram for Table 5.2

As soon as a row is logged corresponding to a user activity, the system reads it

immediately and computes the mass beliefs for each piece of evidence which are then

combined to get an overall belief score using Dempster‟s combination rule. For the Table 5.2


29

above six belief values will be computed corresponding to every click. Thus the belief about

the user changes with every user click and is updated.

The evidence combination process combines beliefs from each conflicting evidence

and gives a belief score for a user‟s each click. To demonstrate our approach we will work out

the calculation of belief values at the 6th

click. Please note that we use the α and β values from

Table 5.3. These values have been derived empirically with our experiments and will be used

with all our computations.

Table 5.3 Coefficient values

Evidence No α β

1 0.8 -

2 - 0.99

3 0.6 0.2

4 0.2 0.01

5 0.4 0.1

6 0.02 0.25

7 0.01 0.2

Evidence 1 always supports a belief of fraud and therefore at the 6th

click on the ad the mass

function values are:

m1 (fraud) = 0.8* (1-1/6) = 0.667

m1 (~fraud) = 0

m1 (U) = 1 - m1* ( fraud ) = 1 – 0.8 *(1-1/6) = 0.332

Evidence 2 always supports a belief of ~fraud. The user spends 30 seconds in each visit since

he does not open any other page and therefore the total time spent is 180 seconds. The

window size W is 1800 seconds. Therefore the mass function values are:

m2 (~ fraud) = 0.99 *(180/1800) = 0.099


30

m2 (fraud) = 0

m2 (U) = 1 - m2 *(~ fraud ) = 1 – 0.99* (180/1800) = 0.901

Evidence 3 supports a little belief of fraud since there was no non-ad visit by the user.

Therefore the mass function values are:

m3 (fraud) = 0.6* (0.1) = 0.06

m3 (~ fraud) = 0

m3 (U) = 1 - m3 *( fraud ) = 1 – 0.6 *(0.1) = 0.94

Evidence 4 supports a belief of fraud since the 6th

click occurs at a suspicious time (2:23 AM).

Therefore the mass function values are:

m4 (fraud) = 0.2*(1) = 0.2

m4 (~ fraud) = 0

m4 (U) = 1 - m4 *( fraud ) = 1 - 0.2*(1) = 0.8

Evidence 5 supports a belief of fraud since we assume that the IP originates from a region

outside the area of business of the advertiser. Therefore the mass function values are:

m5 (fraud) = 0.4 *(1) = 0.4

m5 (~ fraud) = 0

m5 (U) = 1 - m5* (fraud) = 1 – 0.4 *(1) = 0.6

Evidences 6 and 7 support a little fraud since no product was added to a shopping cart and

neither was a membership account used. Therefore the mass function values are:

m6 (fraud) = 0.02 *(1) = 0.02


31

m6 (~fraud) = 0

m6 (U) = 1 - m7 *(fraud) = 1 – 0.02*(1) = 0.98

m7 (fraud) = 0.01* (1) = 0.01

m7 (~fraud) = 0

m7 (U) = 1 - m8* (fraud) = 1 – 0.01* (1) = 0.99

From Table 5.4 below we can observe that each mass function gives a varying degree

of belief values and these can be conflicting.

Table 5.4 Mass function beliefs for illustrated example

belief(fraud) belief(~fraud)

m1 0.667 0

m2 0 0.099

m3 0.06 0

m4 0.2 0

m5 0.4 0

m6 0.02 0

m7 0.01 0

Now we can apply the Dempster’s rule of combination to get the combined belief

about the user from the mass beliefs in Table 5.4.

K = m1(f)m2(~f) + m1(~f)m2(f) = 0.066

1-K = 0.934

m1,2(f) = m1(f)m2(f)+m1(f)m2(U)+m1(U )m2(f)/(1-K) = 0.643

m1,2(~f) =m1(~f)m2(~f)+m1(~f)m2(U)+m1(U)m2(~f)/(1-K) = 0.035

m1,2(U )= m1(U)m2(U )/(1-K) = 0.321


32

m1,2 is the combined mass belief from functions 1 and 2. Next we combine this with

mass functions for function 3 to get the combined mass belief m1,2,3

K = m1,2(f)m3(~f) + m1,2(~f)m3(f) = 0.0021

1-K = 0.998

m1,2,3(f) = m1,2(f)m3(f)+m1,2(f)m3(U)+m1,2(U )m3(f) = 0.664

m1,2,3(~f)= m1,2(~f)m3(~f)+m1,2(~f)m3(U)+m1,2(U)m3(~f) = 0.0333

m1,2,3(U ) = m1,2(U)m3(U ) = 0.303

The above belief combination repeats until no more evidence needs to be considered.

Thus, the belief of the hypothesis that click 6 is fraudulent is calculated in accumulative

fashion. Following the procedure we go on to get the combined belief of all mass beliefs

m1,2,3….7

m1,2,3….7(f) = 0.840

m1,2,3….7(~f) = 0.016

m1,2,3….7(U ) = 0.144

As we can clearly see, the belief (fraud) of 0.84 is clearly above the threshold for

fraud (0.65) given in Table 4.1 and so the user is certified as fraud. Figure 5.4 gives a

graphical representation of the combined belief of fraud over all the 6 clicks made by the user

(in this example we have worked out the mass value computation of 6th

click only but the

figure plots the mass values computed for all clicks from 1st through 6

th). We can easily

observe how the combined belief changes as more clicks are made.


33

Figure 5.4 Combined belief of fraud for input in Figure 5.3


34

CHAPTER VI

EVALUATION

In this section we present two case studies (scenarios), each of which corresponds to a

different type of click fraud attack. In case study 1 we present a scenario where a human user

is trying to perform click fraud and uses different click patterns in order to avoid detection. In

case study 2 we present a scenario where a software bot is used to perform click fraud and it

tries to make detection difficult by using multiple IP addresses. In both the cases we present

our output and show that our approach is able to successfully detect click fraud. We will

discuss the generality of our solution in Chapter VII.

Case Study 1

We present a scenario where a human user is trying to commit click fraud and avoid

detection by giving an impression of a regular user. Figure 6.1 below show the user activity

for the test case.

Figure 6.1 Timeline input for Case Study 1

A fraudster needs to repeatedly click on the ad in order to make a substantial profit. In

this case the fraudster clicks the ad seven times (leading to seven ad-visits). The fraudster also


35

enters the ad-site via a regular search (non-ad visit) to give a stronger impression of a regular

user. He/she spends time on the site after landing (with random session durations) and carries

out activities like opening 32 links in the ad-site after landing, creating membership account

and adding a product to his shopping cart.

Below we describe the belief computed from every mass function and the combined

belief in figures 6.2 through 6.9. We have plotted the belief value with time (in the range of

window W). Please note that some of the functions support both fraud and ~fraud at different

times depending on the input and thus they can have both types of beliefs at different times. In

these cases we just show belief of fraud for the purpose of clarity. Also note that whenever a

function supports belief in ~fraud then the belief in fraud becomes 0 and vice versa.


36

Figure 6.2 below shows the belief computed from Mass Function 1 (Number of clicks

on the ad) according to which if the number of clicks on the ad from an IP in the time window

W (30 minutes) is high, then likelihood of the user being a fraud is high. Mass Function 1

supports only a belief of fraud and the belief at the first click on the ad is 0. The belief

increases as more clicks are made on the ad. The increase is faster in the first five clicks due

to the nature of the function. It is notable that the belief of fraud does not increase in the third

visit as it is a non-ad visit. This function does not consider any other user activity apart from

the number of clicks on the ad. Therefore user activities like a non-ad visit (third visit), adding

products to shopping cart etc. do not affect the belief of this mass function.

Figure 6.2 Belief of fraud from mass function 1


37

Figure 6.3 below shows the belief computed from Mass Function 2 (Time spent in

browsing) according to which if the time spent by the user at the ad-site is high then he/she is

less likely to be a fraud. This function supports only the belief of ~fraud. In this case study the

user spent time in every session and this is reflected in an increasing belief of ~fraud. This

belief clearly contradicts the belief from Mass Function 1 which supports a belief of fraud.

The fraudster has spent a considerable time browsing the ad-site during every visit to give an

impression of a genuine user. As we can see below the user has a high belief of ~fraud at the

end.

Figure 6.3 Belief of ~fraud from mass function 2


38

Figure 6.4 below shows the belief computed from Mass Function 3 (Ad-visit after

non-ad visit) according to which if a user clicks on an ad after a non-ad visit, he/she is likely

to be a fraud. Once a user makes a non-ad visit to the ad-site, it implies that the user is aware

how to reach the site apart from clicking on the ad. The first three visits are all ad-visits and

therefore the function supports a little belief of fraud. The fourth visit is a non-ad visit and

therefore the function does not support fraud (belief become 0). But the fifth visit is an ad-

visit (after non-ad visit). The function computes a high belief of fraud because of this and we

see that the belief of fraud spikes up to 0.6.



39

Figure 6.5 below shows the belief computed from Mass Function 4 (Time of click)

according to which if the click occurred in the most suspicious time (or most active period of

fraud activity) then the user is likely to be a fraud.. The first three visits are not during the

most suspicious time for fraud therefore the function does not support a belief of fraud.

During the fourth visit the session enters the suspicious time and therefore the function

supports fraud. The curve below shows this increased belief.



40

Figure 6.6 below shows the belief computed from Mass Function 5 (Place of origin of

click) according to which if the click originated from a location (country, state or city) where

the advertiser has no business then the user is likely to be a fraud. For this case study we

assume that the IP address of the user is from a region outside of the advertiser‟s region of

business. A click from such an IP is not natural and the advertiser will not benefit from it. The

function therefore supports a belief of fraud throughout and this value does not change at any

time.



41

Figure 6.7 below shows the belief computed from Mass Function 6 (Creation of

membership) according to which if the user creates a membership account (register as

member) with the advertiser, he/she is less likely to be a fraud. The user does not create any

membership or registration with the advertiser during the first three visits. However during

the fourth visit the user does create it and therefore this mass function changes its belief to

support ~fraud from 0 to 0.25.



42

Figure 6.8 below shows the belief computed from Mass Function 7 (Adding a product

to shopping cart) according to which if the user adds a product to his shopping cart, he/she is

less likely to be a fraud. The user does not use the shopping cart during the first three visits.

However during the fourth visit the user does add a product to it and therefore this mass

function belief to support ~fraud increases from 0 to 0.2.



43

The system combines the mass beliefs and a combined belief corresponding to each

click is computed. Table 6.1 below shows the computed values of belief, plausibility and

deduction for every click.

Table 6.1 Computed belief values for Case Study 1

click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction

1 0.45 0.99 0.015 0.55 not fraud

2 0.44 0.98 0.022 0.56 not fraud

3 0.44 0.97 0.027 0.56 not fraud

4 0.44 0.96 0.036 0.56 not fraud

5 0.43 0.96 0.043 0.57 not fraud

6 0.43 0.95 0.049 0.57 not fraud

7 0.65 0.96 0.036 0.35 suspect

8 0.64 0.96 0.041 0.36 suspect

9 0.64 0.95 0.052 0.36 suspect

10 0.63 0.94 0.06 0.37 suspect

11 0.63 0.93 0.068 0.37 suspect

12 0.7 0.94 0.059 0.3 fraud

13 0.69 0.93 0.067 0.31 fraud

14 0.69 0.93 0.072 0.31 fraud

15 0.69 0.92 0.078 0.31 fraud

16 0.68 0.91 0.092 0.32 fraud

17 0.67 0.9 0.1 0.33 fraud

18 0.59 0.81 0.19 0.41 suspect

19 0.51 0.7 0.3 0.49 suspect

20 0.5 0.69 0.31 0.5 suspect

21 0.43 0.6 0.4 0.57 not fraud

22 0.42 0.59 0.41 0.58 not fraud

23 0.42 0.58 0.42 0.58 not fraud

24 0.8 0.87 0.13 0.2 fraud

25 0.8 0.86 0.14 0.2 fraud

26 0.79 0.86 0.14 0.21 fraud

27 0.79 0.85 0.15 0.21 fraud

28 0.8 0.86 0.14 0.2 fraud

29 0.79 0.85 0.15 0.21 fraud

30 0.78 0.84 0.16 0.22 fraud

31 0.78 0.84 0.16 0.22 fraud

32 0.79 0.84 0.16 0.21 fraud

33 0.78 0.84 0.16 0.22 fraud

34 0.78 0.83 0.17 0.22 fraud

35 0.77 0.82 0.18 0.23 fraud

36 0.76 0.81 0.19 0.24 fraud

37 0.77 0.81 0.19 0.23 fraud

38 0.76 0.81 0.19 0.24 fraud

39 0.75 0.8 0.2 0.25 fraud

40 0.74 0.79 0.21 0.26 fraud


44

Figure 6.9 below shows the combined belief of fraud obtained by combining the

beliefs from all the mass functions using Dempster‟s combination rule. It is interesting to note

that individually the beliefs from mass functions contradict and give vary. However upon

combination they give correct belief which changes to reflect the changes in user‟s activity.

Figure 6.9 Combined belief of fraud for Case Study 1

Initially the combined belief of fraud is low and according to the threshold values in

Table 4.1 it indicates a ~fraud. As the user clicks again on the ad (second visit), the belief of

fraud increases and the user moves from ~fraud to suspicious. In the third ad-visit the belief of

fraud increases further and indicates a fraud. But as the user does a non-ad visit (fourth visit),

creates membership and uses shopping cart, the belief drops back to ~fraud. Had the user

stopped clicking on the ad at this point he/she would have been considered ~fraud. However

when the user clicks on ad again and makes an ad-visit (fifth visit) the belief increases to


45

support fraud. We see that the change in belief spikes to a high value during fifth visit because

this is an ad-visit after a non-ad visit. At the end the user‟s belief of fraud continues to be high

and this is certified as a case of fraud. Also the time of click and the location of the IP

contribute to the suspicion.

Case Study 2

This case study presents a scenario where a software bot is used to commit click fraud

by using different IP addresses at different times. Use of multiple IP addresses can make

detection difficult. In most approaches to click fraud detection including ours, n different IPs

will be considered n unique users. (Walgampaya et al., 2011) suggest a specialized approach

to identify bot attacks. For the ease of clarity let us now consider that each IP belongs to a

different user. Figure 6.10 below shows the activity from three different IP addresses (users)

in a timeline diagram. We have used a different color mechanism for this timeline diagram to

represent visits by three different IPs and do not show the time range of each session to avoid

cluttering.

Figure 6.10 Timeline diagram for Case Study 2


46

Using each IP, two ad-visits are made out of which the first visit has a short session and in the

second visit has longer sessions. The first two IPs are outside of the advertiser‟s region of

business but the third IP originates from the advertiser‟s area of business. Last four visits lie

in a suspicious time range.

The system computes mass beliefs and a combined belief corresponding to each click from

every IP. Tables 6.2, 6.3 and 6.4 below show the computed values of belief, plausibility and

deduction for first, second and third IPs respectively.

Table 6.2 Computed belief values for first IP


1 0.45 0.99 0.015 0.55 not fraud

2 0.66 0.99 0.014 0.34 fraud

3 0.72 0.98 0.025 0.28 fraud

Table 6.3 Computed belief values for second IP


1 0.45 0.99 0.015 0.55 not fraud

2 0.66 0.98 0.02 0.34 fraud

3 0.53 0.94 0.061 0.47 suspect

Table 6.4 Computed belief values for third IP


1 0.078 0.89 0.11 0.92 not fraud

2 0.73 0.99 0.0089 0.27 fraud

3 0.51 0.9 0.095 0.49 suspect


47

Figure 6.11 below shows the computed values of belief of fraud for all visits by the

bot using the three IPs.

Figure 6.11 Combined belief values for Case Study 2

From the Figure 6.11 and Tables 6.2 to 6.4 above we can observe that our system

detects the users with first two IPs as fraud and the user with the third IP as suspicious even

when there were just two clicks that occurred from each IP. The third IP was not outside of

advertiser‟s region of business and hence the system could conclude it as suspicious. The

above clicks from three different IPs could be from one single bot. We evaluate them as three

different users and yet detect the fraud.


48

CHAPTER VII

DISCUSSION & CONCLUSIONS

The thesis proposes an approach for click fraud identification that can be used by the

advertising community to solve their click fraud problems. Our approach is fundamentally

different from existing methods. First, we focus on the type of clicking activity, which can

create real value for the fraudster and attempt to detect that. For this we take raw weblog data

and derive meaningful evidences for our mass function formulization. Second, it has the

ability to do on-line computation to detect fraudulent clicks. Such computation adapts well to

real-time systems and this is a key advantage. Third, the approach is relatively simple and fast

because it requires only the incoming data at advertiser‟s disposal. It neither requires the

advertiser to maintain and update large historical databases of various evidences nor

necessitates learning of any patterns. This makes the approach beneficial for use by

advertisers. Fourth, the resulting beliefs also indicate the gray area of suspicious activity

which can alert the advertiser of irregular or abnormal traffic. This is useful against click

fraud attacks which may be hard to catch but still falls in suspicious category. Finally, the

approach suggests extraction of evidences from limited server data and can be extended easily

by adding new mass functions to represent additional evidence.

Our experiments on the two case studies show that the proposed approach works

correctly. Although we have not experimented on all possible scenarios of click fraud

behaviors we believe that our approach will work effectively in general because of the

following reasons. First, the technique allows combination of a set of evidences that can

contribute to click fraud detection. Second the set of evidences considered in this thesis is in

the worst case near complete. Finally, if the set is not complete, the technique can be easily

extended by adding new evidences into the proposed click fraud detection system.

Future work includes more experiments to gain understanding of the characteristics of

the proposed approach, for example, what are the novel click attacks which the approach fails

to identify and if found, what are the other sources of data and evidences that can be identified

to detect them. Future work also requires experiments to see if our approach works for


49

specialized bot attacks which can be highly sophisticated and evolve continuously. These are

among our ongoing and future research.


50

BIBLIOGRAPHY

D. Antoniou, M. Paschou, E. Sakkopoulos, E. Sourla, G. Tzimas, A Tsakalidis, E. Viennas,

“Exposing click-fraud using a burst detection algorithm”, in Proceedings of ISCC on

Computers and Communications, IEEE Symposium, Jun 2011, pp. 1111-1116.

A. Tuzhilin, “The Lane‟s Gifts vs. Google Report”, 2006

M. Kantardzic, C. Walgampaya, B. Wenerstorm, O. Lozitskiy, S. Higgins and D. Kings,

“Improving Click Fraud Detection by Real Time Data Fusion”, in Proceedings of the

ISSPIT on Signal Processing and Information Technology, IEEE International

Symposium, Dec. 2008, pp. 69-74.

G. Shafer, “A Mathematical Theory of Evidence”, Princeton University Press, 1976.

T. Denoeux, “ A K-nearest Neighbour Classification Rule based on Dempster-Shafer

Theory”, IEEE Transactions on Systems, Man and Cybernetics, 25 (1995) 804-813.

F. Dong, Sol. M. Shatz, H. Xu, “Reasoning Under Uncertainty For Shill Detection In Online

Actuions using Dempster Shafer Theory”, International Journal of Software Engineering

and Knowledge Engineering, 2010, pp. 943-973.

K. Sentz, S Ferson, “Combination of Evidence in Dempster-Shafer Theory”, SAND 2002-

0835, April 2002.

N. Kshetri, “The Economics of Click Fraud”, Security and Privacy, IEEE, May-June 2010,

pp. 45-53.

H. Haddadi, “Fighting Online Click-Fraud Using Bluff Ads”, ACM SIGCOMM Computer

Communication Review, v.40 n.2, April 2010 [doi>10.1145/1764873.1764877]

V. Anupam, A Mayer, K. Nissim, B. Pinkas, and M. K. Reither, “On the Security of pay-per-

click and other web advertising schemes”, Computer Netwroks, 31(11-16): 1999, 1091-

1100.

M. Kantardzic, C. Walgampaya, and H. Jamali, “Click fraud prevention in pay-per-click

model: Learning through multimodel evidence fusion”, in Proceedings of ICMWI of

Machine and Web Intelligence, 2010, pp. 20-27.

http://dx.doi.org/10.1145/1764873.1764877


51

C. Walgampaya, and M. Kantardzic, “Cracking the Smart ClickBot”, in Proceedings of Web

Systems Evolution on 13th

IEEE Symposium, 2011, pp. 125-134.

B. J. Jansen, “Click Fraud”, IEEE Computer, vol. 40, no. 7, Jul 2007, pp. 85-86.

X. Li, Y. Liu, and D. Zeng, “Publisher click fraud in the pay-per-click advertising market:

Incentives and consequences”, in Proceeding of Intelligence and Security Inforatics of

IEEE International Conference, 2011, pp. 207-209.

S. Majumdar, D. Kulkarni, and C. V. Ravishankar , “Addressing Click Fraud in Content

Delivery Systems”, in Proceedings of INFOCOM 2007 of 26th IEEE International

Conference, May 2007, pp. 240-248.

A. Metwally, D. Agarwal, A. Abbadi, and Q. Zheng, “On Hit Inflation and Detection in

Streams of Web Advertising Networks”, in Proceedings of Distributed Computing

Systems on ICDCS, Jun 2007, pp. 52-52.

lAB, and PwC, “lAB Internet Advertising Revenue Report, 2010”, First Half-Year Results,

New York, U.S., 2011.

GeegkWire Magazine, “Newspapers take it on the chin as online ad revenue falls into the

hands of a few tech giants”, Mar 2012, http://www.geekwire.com/2012/newspapers-chin-

online-ad-revenue-falls-hands-tech-giants/

Google Earnings Report, “Google Announces Second Quarter 2011 Financial Results”, Jul

2011, http://investor.google.com/earnings/2011/Q2_google_earnings.html

http://www.geekwire.com/2012/newspapers-chin-online-ad-revenue-falls-hands-tech-giants/

http://www.geekwire.com/2012/newspapers-chin-online-ad-revenue-falls-hands-tech-giants/

http://investor.google.com/earnings/2011/Q2_google_earnings.html

automatic detection of click fraud in online advertisements

Data & Analytics

abhishek agarwal

fraud detection

ad visit

automatic detection

quality of work

mass functions

background work

graduate school