automatic detection of click fraud in online advertisements
TRANSCRIPT
Automatic Detection of Click Fraud in Online Advertisements
by
Abhishek Agarwal, M.S.
A Thesis
In
COMPUTER SCIENCE
Submitted to the Graduate Faculty
of Texas Tech University in
Partial Fulfillment of
the Requirements for
the Degree of
MASTER OF SCIENCE
Approved
Dr. Rattikorn Hewett
Chair of Committee
Dr. Sunho Lim
Dr. Eunseog Youn
Peggy Gordon Miller
Dean of the Graduate School
August, 2012
Texas Tech University, Abhishek Agarwal, August 2012
ii
ACKNOWLEDGMENTS
I would like to thank Dr. Rattikorn Hewett for her guidance throughout my Master‟s
research. Her in-depth knowledge of the subject, focus on clarity and quality of work has
helped me learn skills which will help me for the rest of my career. Her guidance on the
research is invaluable and has helped me cope with the challenges I faced throughout the
course of this work.
Texas Tech University, Abhishek Agarwal, August 2012
iii
TABLE OF CONTENTS
Acknowledgments ........................................................................................................ ii
Abstract ......................................................................................................................... v
List of Tables ............................................................................................................... vi
List of Figures ............................................................................................................. vii
Motivation ..................................................................................................................... 1
Contributions ...................................................................................................... 5
Background Work ........................................................................................................ 7
Preliminaries ................................................................................................................. 9
Terms.................................................................................................................. 9
Problem Statement ........................................................................................... 10
Assumptions ..................................................................................................... 10
Mathematical Theory of Evidence ................................................................... 11
Mass Functions ............................................................................................... 12
Combination Rule ........................................................................................... 14
Proposed Dempster Shafer Theory for Click Fraud Detection ............................. 16
The Core Element of Dempster Shafer Theory................................................ 16
Mass functions for Click Fraud Detection ....................................................... 17
Evidence 1: Number of clicks on the ad ......................................................... 17
Evidence 2: Time spent in browsing ............................................................... 18
Evidence 3: Ad-Visit after non-ad visit ............................................................ 18
Evidence 4: Time of Click ............................................................................... 19
Evidence 5: Place of origin of click ................................................................. 20
Evidence 6: Creating of membership.............................................................. 21
Evidence 7: Adding a product in shopping cart .............................................. 22
Data Set & Illustration .............................................................................................. 24
Data Description............................................................................................... 24
Example of belief computation using mass function and combination ........... 28
Evaluation ................................................................................................................... 34
Case Study 1 ..................................................................................................... 34
Case Study 2 ..................................................................................................... 45
Texas Tech University, Abhishek Agarwal, August 2012
iv
Discussion & Conclusions .......................................................................................... 48
Bibliography ............................................................................................................... 50
Texas Tech University, Abhishek Agarwal, August 2012
v
ABSTRACT
Increasing advancement, access and availability of the Internet Technology have intensified
the growth of Internet users over the last decade. This has made online advertising a popular
venue for many companies to market their products and services. Today, online advertisement
is one of the most important sources of revenues that impact the economy of many large
enterprises. In online advertisement, an advertiser pays a broker (e.g., Google, Yahoo), who
normally has a search engine, to post its online advertisement, which can be on any
appropriate publisher site. The publisher earns revenues from the broker for each click on the
advertisement posted on its site, while the advertiser will be charged. Thus, when an
excessive number of clicks occur, this can quickly dry up the fund of a rival company and
drive it out of the competing advertisement. At the same time, each click adds revenue to the
publisher. This motivates click frauds, which refer to malicious acts to create fraudulent clicks
with the intent to increase revenue or drive away competitors without real interest in the
products or services being advertised. Identifying click frauds is a difficult problem because
of the dynamic nature of the click behaviors, some of which are generated by humans and
some are by automated software called bots. There have been previous work attempting to
identify click frauds using various techniques but they tend to be limited by the types of the
data, the way they are processing or assumptions that are not always achievable.
This thesis presents an approach to automatically detecting click frauds in online advertising.
The approach uses a mathematical theory of evidence to estimate the likelihood of a click
whether it is fraud or genuine using web log data of a user‟s activities on the advertiser‟s
website. One advantage of the proposed approach is the fact that the likelihood can be
computed for each incoming click and thus it gives an online computation of the belief that
fits well with the dynamic behaviors of users. The thesis describes the approach and evaluates
its validity using two real-world case studies. We believe the approach is general in that it
can be applied to any scenario.
Texas Tech University, Abhishek Agarwal, August 2012
vi
LIST OF TABLES
4.1 Fraud certification rules ....................................................................... 23
5.1 Sample log data .................................................................................... 25
5.2 Input from server log ............................................................................ 28
5.3 Coefficient values................................................................................. 29
5.4 Mass function beliefs for illustrated example ...................................... 31
6.1 Computed belief values for Case Study 1 ............................................ 43
6.2 Computed belief values for first IP ...................................................... 46
6.3 Computed belief values for second IP ................................................. 46
6.4 Computed belief values for third IP ..................................................... 46
Texas Tech University, Abhishek Agarwal, August 2012
vii
LIST OF FIGURES
1.1 % change of revenue for advertising media (GeekWire, 2012) ............. 1
1.2 Google‟s revenue source distribution in 2011 (Google Earnings
Report, 2011) ......................................................................................... 2
1.3 Scenario before click fraud occurred ..................................................... 3
1.4 Scenario after click fraud occurred ........................................................ 4
4.1 Click fraud detection framework using D-S theory ............................. 16
5.1 Legends for timeline diagram .............................................................. 27
5.2 Timeline diagram sample data in Table 5.1 ......................................... 27
5.3 Timeline diagram for Table 5.2 ........................................................... 28
5.4 Combined belief of fraud for input in Figure 5.3 ................................. 33
6.1 Timeline input for Case Study 1 .......................................................... 34
6.2 Belief of fraud from mass function 1 ................................................... 36
6.3 Belief of ~fraud from mass function 2 ................................................. 37
6.4 Belief of ~fraud from mass function 3 ................................................. 38
6.5 Belief of fraud from mass function 4 ................................................... 39
6.6 Belief of fraud from mass function 5 ................................................... 40
6.7 Belief of ~fraud from mass function 6 ................................................. 41
6.8 Belief of ~fraud from mass function 7 ................................................. 42
6.9 Combined belief of fraud for Case Study 1 ......................................... 44
6.10 Timeline diagram for Case Study 2 ..................................................... 45
6.11 Combined belief values for Case Study 2 ............................................ 47
Texas Tech University, Abhishek Agarwal, August 2012
1
CHAPTER I
MOTIVATION
The Internet has seen tremendous growth in the last decade and according to current
statistics from the World Bank, nearly 32% of the world population currently uses the
Internet. This has made online advertising not only lucrative but also an important medium for
businesses to reach out to a large consumer base (Jansen, 2007). Figure 1.1 below shows that
while most other media of advertisement are losing market share, online advertisements are
growing tremendously.
Figure 1.1 % change of revenue for advertising media (GeekWire, 2012)
Not only do online ads benefit advertisers, they are also a rich source of revenue for
publishers who display ads on their websites and brokers like Google, Yahoo, MSN, Ask.com
etc. who provide the technical platform for online advertisements. Thus, online ads drive the
Internet economy and are the necessary life blood for its survival and growth. Figure 1.2
below shows that in 2011 97% of Google‟s revenue was from online ads alone.
Texas Tech University, Abhishek Agarwal, August 2012
2
Figure 1.2 Google‟s revenue source distribution in 2011 (Google Earnings Report, 2011)
Online advertising is however not free of issues and click fraud is a major problem
which can impact its growth. Click fraud is a type of crime in online advertisement in which a
user clicks on an ad not with a genuine interest in what the advertiser has to offer but with
intent of either generating illegal revenue (for the publisher that hosts the advertisement) from
clicks or to intentionally cause monetary loss to the advertiser. It hurts the advertisers and
may deter them from investing in online ads.
Many advertising mechanisms exist including the pay-per-click (PPC) scheme which
contributes to about 57 percent of all the Internet ads with nearly more than US$16 billion in
revenue in 2010 (Tuzhilin, 2006; IAB and PwC, 2010). A popular example of PPC scheme is
the Google Adsense. In PPC brokers like Google place targeted ads in dedicated ad spaces on
publisher websites. Brokers get paid by advertisers for every click on the ad and they share
the income generated this way with the publishers. While PPC is a great model for online
advertisement, it suffers the most from the problem of click fraud (Tuzhilin, 2006). Most of
the publishers in PPC programs are small time blog owners and are the source of majority of
the click fraud. Competitors of an advertiser can also commit click fraud in order to reduce
competition and it may indirectly benefit their business. To commit click fraud, publishers or
Texas Tech University, Abhishek Agarwal, August 2012
3
competitors can click on the ad themselves, ask friends to do it, use an Internet bot script
which repeatedly clicks on the ads or hire people to do it for them (Kshetri, 2010). Such clicks
are of no value to the advertisers as the clicker has no intent to buy their product or service,
use information or carry out any transaction useful to the advertiser‟s business (Jansen, 2007).
The brokers too have an incentive in not filtering out all the click fraud as doing so will
reduce their revenues. They can contribute to click fraud by passively letting the fraud happen
and not taking adequate measures to stop it. The lesser known brokers have a greater
incentive in doing so (Kshetri, 2010). Multiple lawsuits filed by various advertisers against
Google and Yahoo for not taking adequate steps to curb click fraud are an indication of
brokers‟ inability or unwillingness in this regard. Figure 1.3 below shows a scenario before
click fraud when the advertiser‟s money reserve (advertising budget) is full. The publisher,
broker or competitors have not generated any illegal revenue from click fraud.
Figure 1.3 Scenario before click fraud occurred
Figure 1.4 below shows the scenario after click fraud which caused advertiser‟s budget to
completely deplete and the broker, publisher and competitor‟s illegal profit to increase.
Texas Tech University, Abhishek Agarwal, August 2012
4
Figure 1.4 Scenario after click fraud occurred
Reputed brokers like Google actively try to contain click fraud by filtering out
fraudulent clicks and permanently blocking publishers who are found involved (Tuzhilin,
2006; Kshetri, 2010). They have access to a user‟s search activities and the data they collect
from the publisher to find patterns in a user‟s behavior. The idea is to estimate a user‟s
intention behind the click in order to rate a click as genuine or fraudulent. However they may
not have access to the data about a user‟s actions on the advertiser‟s website where the user is
taken following the click. This is because the advertiser may choose to share limited or no
data at all with the broker due to their own privacy concerns (Tuzhilin, 2006).
Brokers provide aggregate statistics to advertisers and do not share details on which
clicks they found fraudulent in order to avoid making their detection mechanisms open to
fraudsters. Thus advertisers are not adequately informed and there is a strong case for the
advertisers to have their own click fraud detection system in place. This way the advertisers
can protect themselves not only from fraudulent publishers and competitors but also from
brokers who either fail to detect fraud or let it occur willingly. Such a system can help them
estimate the extent of the fraud in their ad campaign and pay the brokers for genuine clicks
only. It is important to note here that brokers have access to much larger sources of
information than advertisers. The advertisers must be able to do the click fraud detection with
the limited data they have about users‟ action at their website.
Texas Tech University, Abhishek Agarwal, August 2012
5
Click fraud identification is a difficult problem to solve. Fraud mechanisms evolve and
continually change over time. The fraud can be carried out both by humans and software bots
with distinctive characteristic behaviors. It is difficult to track users with their IP addresses as
IPs are generally dynamic in that a IP address of the same user may change anytime. A
software bot too can use different IP addresses at a time to carry out click attacks. Finally, the
advertiser has access to data from their server, which gives very limited information about a
user‟s behaviors.
Contributions
This paper presents an approach to automatically detecting click fraud at the ad-site.
The advertisers can use the proposed approach to detect their click frauds. Our approach
employs the mathematical theory of evidence called Dempster-Shafer (DS) Theory (Shafer,
1976; Denoeux, 1995; Dong et al., 2010; Sentz et al., 2002) for evidence-based reasoning to
estimate the likelihood of a click being fraudulent based on the evidence gathered from the
weblog data available to the advertiser. The proposed approach can also be useful for brokers
for computing correct charges to their clients if the data are available to them. Our approach is
based on a widely used theory that allows the estimate of the likelihood to be computed as
each incoming click is exhibited. That is it offers an on-line computation. Thus, after each
click from a given IP we can estimate our belief if the click is suspicion to be fraudulent or
not. In summary the contributions of this thesis include: (1) an approach for automatically
detecting or identifying click frauds, (2) a framework for reasoning about click frauds that
integrates relevant information extracted from weblog data with the evidence based reasoning
to update click fraud analysis in real-time, and (3) core elements of the proposed approach
that consists of a set evidences required in detecting click frauds. These evidences will be
formulated in terms of functions called mass functions used in the DS theory.
The rest of this thesis is organized as follows: Chapter II presents background work
on click frauds identification. Chapter III gives preliminaries including terms and relevant
concepts, the problem formulation and its assumption, and the Dempster-Shafer Theory along
with its fundamental elements. Chapter IV presents our approach to the problem and the
details of the core contribution on formulating mass functions for click fraud identification
Texas Tech University, Abhishek Agarwal, August 2012
6
problem. Chapter V explains the data set used for the approach and gives an illustrative
example. Chapter VI evaluates the proposed approach with experiments on synthetic data
generated on two case studies. Chapter VII gives concluding remarks and possible extension
for future work.
Texas Tech University, Abhishek Agarwal, August 2012
7
CHAPTER II
BACKGROUND WORK
Many different types of solutions have been proposed to counter click fraud.
(Tuzhilin, 2006) suggested a model where the advertisers pay for a click only if it leads to a
conversion event like purchase to counter CF. Such a model is economically unviable for
publishers and so is not available to advertisers. Another method proposed (Tuzhilin, 2006) is
the use of data mining models based on past data to classify clicks as fraud or ~fraud (not
fraud). Such a solution may suffer from high inaccuracy as fraud mechanisms evolve and
change over time. There is an assumption that past clicking behavior is indicative of future
behavior. A large number of past clicks which can be truly classified as valid or invalid are
also required. This is a batch process and not online. Moreover such datasets are at the
disposal of brokers only and other involved parties like advertisers cannot use them. The
author clearly states these limitations.
(Haddadi, 2010) discusses the use of bluff ads for detecting sources of click fraud like
trained bots or poorly trained human workforce employed to carry out fraud. The display text
of these ads is unrelated to the context of the user to whom they are displayed. For example a
user in Australia should not ideally be shown an ad of a special offer on pizza in New York
City. A click by the user is unnatural in this case and will indicate that the user is a bot or
human involved in fraud. However careful humans and sophisticated bots can still beat it.
Also this is a „broker-centric‟ model. This can be implemented by brokers and advertisers
need to completely trust brokers in this.
Recently (Antoniou et al., 2011) proposed a burst detection algorithm to detect high
frequency of user activity in short time periods to detect various types of click frauds
including voting click fraud, frauds related to blog post popularity, search engine retaliation
and advertising click fraud. While this is a good general solution for all types of click frauds
mentioned, it does not cater to the nuances of advertisement click fraud, as a simple detection
of bursts may not be enough to differentiate between valid and invalid clicks. More
Texas Tech University, Abhishek Agarwal, August 2012
8
factors/evidences need to be taken into consideration before we could conclusively label a
click as fraudulent. (Walgampaya et al., 2011) proposed a method to detect bot scripts
involved in click fraud using Bayesian Classifiers.
The methods above are either not sufficient to combat the problem of click fraud
individually or require broker involvement of some kind. The involvement may be in the form
of policy changes by brokers or sharing data at their disposal and they have been unwilling for
both. As a result they cannot be used by advertisers to actively detect fraud at their site.
(Kantardzic et al., 2010) proposed a real time click fraud detection and prevention
system. It uses D-S Theory for multilevel data fusion of evidences from different sources like
IP address, referrer, country etc. However they rely on data from both the client (advertiser)
and server (broker). An advertiser does not have access to broker‟s data and hence this system
is limited to be used by brokers only. Our approach equips advertisers with a fraud detection
system using only the data at their disposal. The evidences that they extract from server data
to formulate mass functions are very basic whereas some of our rules are sophisticated and
novel to the best of our knowledge. We do not maintain any historical databases and exploit
the fact from (Antoniou et al., 2011) that fraud will happen in bursts. Our approach is simple
yet our set of rules is powerful and comprehensive making it difficult for fraudsters to carry
out any viable attacks on the advertiser. For example, rules 1, 2, 4 and 5 make it difficult for a
bot to generate clicks without detection.
Texas Tech University, Abhishek Agarwal, August 2012
9
CHAPTER III
PRELIMINARIES
This section outlines the foundation for the proposed method of click fraud detection
and the assumptions we have taken.
Terms
We now define terms used in this thesis.
Advertiser is a seller with an e-commerce website who pays for his ads to be displayed on
other sites. These ads may create more traffic and revenue for the advertisers since a user
who clicks on these ads is directed to their site.
Ad-site is the advertiser‟s website. A user on the Internet can visit the ad-site by several
means like using an Internet search, typing the URL of the advertiser on their browser,
bookmark the advertiser and clicking it later or clicking on the ad on a publisher site.
Ad-visit is a visit of a user to ad-site by clicking an ad. Non-ad visit is a user visit by any
means other than clicking an ad.
Session is a continuous period of time that a visitor navigates within the advertiser‟s site.
In other words it is the duration for which a user maintains an active HTTP connection
with the server. In a session the user can be browsing, reading, watching videos, filling out
forms, registering for membership, adding products in a shopping cart, purchase products
etc.
Publishers are the websites which hosts ads for the advertisers and get paid for the click
on those ads. Common examples are blogs and news sites.
Broker is an intermediary between advertiser and publisher. They provide the technical
platform for online advertisements. They are mostly Internet search engine companies like
Google, Yahoo, AOL, Ask.com etc. and use their search technology to serve targeted ads
on publisher sites based on website content, geographical location etc..
Texas Tech University, Abhishek Agarwal, August 2012
10
Pay Per Click (PPC) is an online advertising model in which publishers display ads on
their websites and get paid for each click on those ads. Google runs a PPC program called
Adsense.
Gclid is a unique ID called that is attached to the server log for every click that was made
on Google ads. This helps identify unique visitors to the best approximation as Google
uses various parameters to make this unique identification.
Problem Statement
Given a weblog data at the site of the advertiser over a period of time, find all
occurrences of click fraud. For every such occurrence, identify its owner by its corresponding
IP address. The advertiser‟s web server log data has information such as IP address, date &
time, Gclid number (to be described later), a requested page and referrer for every click.
Assumptions
Due to the dynamic natures of IP addresses associated to each user, to solve the above
problem in real practice, it is necessary to make the following assumptions.
1) IP addressing changes over time and a user may be assigned to different IP addresses
while he/she is surfing the Internet. A user (either human or bot) may try to carry out
fraudulent clicks using as many different IPs as possible in order to avoid detection.
Therefore it is not feasible to use a long duration data of an IP. Instead we use a short
duration of a window W. In this work, W is specified to be 30 minutes during which we
assume that the IP address for a user will not change. This duration is typical and is
reasonable though is quite different from other existing work. The probability that a user
with a particular IP clicked on an ad and that the same IP is assigned to another user who
also clicks on the same ad within the proposed window is negligibly low. Our approach is
however not limited by this window size and one can pick a size that suits them well.
2) A fraudster has an incentive in clicking on an ad multiple times but no intention in making
an actual purchase of a product or service. Fraudsters will make money on clicking on the
ads but will have to spend money to make purchases and this is strictly against their end
goal. Thus, if a user makes a purchase at the ad-site, we assume that the user is not
Texas Tech University, Abhishek Agarwal, August 2012
11
involved in fraud. However in some circumstances (like in order to confuse detection
systems), the fraudster may make a purchase. Such an action will not help the fraudster as
soon as he moves out of the time window W.
3) Fraudulent clicks with large time gaps in between every two clicks do not deliver any
substantial monetary gain to the fraudster. The number of clicks has to be large enough
with shorter gaps between them and therefore, a burst of clicks may indicate Click-Fraud
(Antoniou, 2011).
4) Since HTTP is a stateless protocol it is difficult to accurately estimate the session
duration. We sum the time difference between consecutive HTTP requests by the user to
get the total session time but however there is no way to compute the exact time spent by
the user viewing the last page since there is no request after that. We thus had to make an
assumption that 30 seconds was spent on the last page. Our approach is however not
limited by this assumption and any other duration can be assumed for the last page view.
5) We modeled our approach around Google‟s Adsense as it is the most widely accepted Pay
Per Click program. We use gclid, a unique id attached by Google to the web server logs of
advertisers for every click that was made on their ads. It follows Google‟s definition of
unique visits. Google claims that it uses various parameters to assign unique gclids and
third party CF detection engines which use the gclid are more accurate than others. So we
take data filtered by the broker (Google) and apply our own approach for further filtration.
However our approach can be modeled around any other PPC program and the way to
identify the clicks that were made on advertisements could be by creating unique landing
pages. This way by looking at server logs we can separate out visits made from ads.
Mathematical Theory of Evidence
Efforts in identifying click fraud have mostly concentrated on identifying a certain
characteristic of user behavior and this is quite different from our approach. To provide a
theoretical background of our approach we describe the mathematical theory of evidence also
known as the Dempster-Shafer (D-S) Theory (Shafer, 1976; Denoeux, 1995; Dong et al.,
2010; Sentz et al., 2002). It is related to traditional probability and set theory but is not the
Texas Tech University, Abhishek Agarwal, August 2012
12
same. The D-S theory allows probability assignment to a set of atomic elements rather than an
atomic element and it can be used to represent not only the likelihood of occurrence of an
event but also the uncertainty associated with it.
Using the D-S Theory evidence, which is coming from multiple sources with varying
level of certainty, can be effectively combined online. Its ease of use combined with a wide
and successful application in many areas makes it an ideal candidate for application in click
fraud detection which requires a complex model with several evidences.
In our problem domain a user can either be a fraud or not a fraud (~fraud). So we
have a finite set of hypothesis (atomic elements) in the problem domain U = {fraud, ~fraud}.
The power set of U is a set {{fraud}, {~fraud}, {U}, {}}. Each of the four elements in the
power set represents a belief between 0 and 1. {fraud} represents a belief of the user being a
fraud; {~fraud} represents the belief of the user being not fraud; U represents the belief of
user being both fraud and ~fraud and thus it represents the uncertainty; is an empty (null)
set and it represents a contradiction, thus it is always 0. DS-Theory assigns belief to all the
elements of this power set of U rather than mutually exclusive events of U. The sum of all
belief values in the power set of U is 1.
Mass Functions
A degree of belief is represented as a belief function called mass function m which
provides a probability assignment to any AU, where m() = 0 and m(fraud) + m(~fraud) +
m(U) = 1.
m() = 0
m(fraud) ∈ [0, 1]
m(~fraud) ∈ [0, 1]
m(U) ∈ [0, 1]
X Am(X) = 1
Texas Tech University, Abhishek Agarwal, August 2012
13
The mass m(A) represents a belief exactly on A. For example, U = {faulty, ~faulty}
represents a hypotheses of a suspect being both faulty and non-faulty. A situation in which
m({fraud, ~fraud}) = 1 occurs where there is no certainty regarding an evidence at all and this
cannot be adequately represented with traditional probability theory. A belief mass is
therefore different from probability. As we see above the probabilities are being assigned to
sets rather than mutually exclusive singletons (Shafer, 1976; Sentz et al, 2002). When the
probabilities are assigned to mutually exclusive events i.e. either fraud or ~fraud such that
m(U) is always 0 then DS-Theory becomes same as probability theory. For every mass
function, there are associated functions of belief and plausibility. The degree of belief on A,
bel(A) and the plausibility of A, pl(A) defined to be respectively:
bel(A) = X Am(X)
pl(A) = 1 – bel(~A) =X A m(X).
For example, bel({fraud}) = m({fraud}) + m() = m({fraud}). In general, bel(A) =
m(A) for any singleton set AU and in such a case the computation of bel is greatly reduced.
However, bel(A) is not necessary the same as m(A) when A is not a singleton set. Thus, m,
bel and pl can be derived from one another. Thus, belief and probability are different
measures. In this thesis, we use the terms likelihood and belief synonymously.
For our approach we use multiple evidences each of which contributes to either a
belief (or disbelief) that a user is a fraud depending on the nature of the evidence and its
quantified value (Dong et al., 2010). For example, if a user clicks many times on an ad, it
becomes evidence that the user is a fraud. Each evidence can support a user for either fraud or
~fraud but not both. If an evidence for a user supports fraud, the rest of the belief from the
evidence cannot commit only to the universal set U which quantifies the uncertainty. If
evidence i supports that the user is fraud then the mass functions for the evidence are defined
as follows:
mi(fraud) = α*f
mi (~fraud) = 0
Texas Tech University, Abhishek Agarwal, August 2012
14
mi (U) = 1 - α*f
Where 0 < α < 1, is an empirically derived value that signifies the strength of the evidence
in supporting the user is fraud. 0 < f < 1, is a function that is used to quantify the evidence.
If evidence i supports that the user is ~fraud then the mass functions for the evidence
are defined as follows:
mi(fraud) = 0
mi (~fraud) = β*g
mi (U) = 1 - β*g
Where 0 < β < 1, is an empirically derived value that signifies the strength of the evidence in
supporting the user is ~fraud. 0 < g < 1, is a function that is used to quantify the evidence.
Combination Rule
Since we have multiple mass functions, we need a way to combine them. A mass
function can be combined using various rules including the popular Dempster’s Rule of
Combination, which is a generalization of the Bayes rule. For X, A, BU, a combination rule
of mass functions m1 and m2, denoted by m1m2 (or m1, 2) is defined as the following:
where K =
and m1m2 () = 0
The combination rule can be applied in pairs repeatedly to obtain a combination of
multiple mass functions. The above rule strongly emphasizes the agreement between multiple
sources of evidence and ignores the disagreement by the use of a normalization factor.
m1AB (A)m2(B)
m1,2( X ) m1 m2( X ) m1AB X ( A)m2(B)
1 K
Texas Tech University, Abhishek Agarwal, August 2012
15
Texas Tech University, Abhishek Agarwal, August 2012
16
CHAPTER IV
PROPOSED DEMPSTER SHAFER THEORY FOR CLICK FRAUD DETECTION
We propose an approach that can be used by the advertisers to detect fraud in real time
using data available to them, without any data from the broker which can either be impossible
to acquire or very limited if at all possible. This section describes our approach in detail and
the mass functions that have been developed to compute the belief of fraud.
The Core Element of Dempster Shafer Theory
Figure 4.1 below shows the framework elements of click fraud detection using our
approach. A user‟s clicking activity is captured by the advertiser‟s web server logs. The server
logs are updated in real time as users request pages from the server and the click fraud
detection system reads this data as soon as it is logged. For a latest click that the system is
processing, it finds the IP address and reads all the log data from that IP in the window W.
This data is pre-processed to extract out meaningful
Figure 4.1 Click fraud detection framework using D-S theory
Texas Tech University, Abhishek Agarwal, August 2012
17
evidences and then formulated into various mass functions. Each mass function computes a
belief of fraud which is unique and can conflict with the beliefs from other mass functions.
These beliefs are combined using Dempser‟s combination rule. The combined belief is
categorized into fraud, ~fraud or suspicious by using a set of threshold values. This process is
repeated for every new user click.
Mass functions for Click Fraud Detection
Using the user behavior from the weblogs at the advertiser‟s site as evidences to
reason about click fraud we formulate mass functions based on each of such core evidence.
These evidence are contributed by various factors such as number of clicks on the ad, time
spent browsing the advertiser site etc. The mass functions are used to compute belief value on
the click being fraud or not fraud (~fraud). The belief value from different evidences is
combined as each of them occurs in the data. A mass function contributes to either a belief (or
disbelief) that a user is a fraud depending on its nature and its quantified value. The following
gives detailed formulae of mass functions based on each evidence. The values αi and βi for
evidence i represent the strength of the evidence in mass function formulation (mi). In
practice these values will be empirically derived.
Evidence 1: Number of clicks on the ad
If the number of clicks on the ad from an IP in the time window W (30 minutes) is
high, then likelihood of the user being a fraud is high. Fraudsters have a natural incentive of
making more money by clicking the ads many times in a short period of time (short bursts).
The more they click, the more illegal revenue they generate for themselves. The Basic Mass
Assignment (BMA) for this evidence will always support a belief of fraud whose value
depends on the number of clicks.
Let n be the number of clicks in the window W.
Likelihood of the fraud = 1 – 1/n
m1( fraud) = α1 (1-1/n) (1)
Texas Tech University, Abhishek Agarwal, August 2012
18
m1 (~fraud) = 0 (2)
m1 (U) = 1 - m1 ( fraud ) = 1 – α1 (1-1/n) (3)
Evidence 2: Time spent in browsing
If the time spent by the user at the ad-site is high then he/she is less likely to be a
fraud. A genuine user will click the ad due to a real interest in advertiser‟s content (advertised
product, service or website content) and is likely to spend more time exploring the ad-site
than a fraudster. Fraudsters are less likely to do so since they are not interested in the product
and so that they could do more clicks in a given time. The BMA for this rule will always
support a belief of ~fraud whose value depends on the time spent at the ad-site. As a user
continues to spend more time at the ad-site the belief that he is ~fraud will increase.
Let t be the time spent by the user in all visits in the time window W (30 minutes) where 0 < t
<= 30 minutes. The likelihood of ~fraud increases as t increases.
m2 (fraud) = 0 (4)
m2 (~ fraud) = β2 *(t/W) (5)
m2 (U) = 1 - m2 (~ fraud ) = 1 – β2* (t/W) (6)
Evidence 3: Ad-Visit after non-ad visit
If a user clicks on an ad after a non-ad visit, then he is likely to be a fraud. Once a user
makes a non-ad visit to the ad-site, it implies that the user is aware how to reach the site apart
from clicking on the ad. Clicking on an ad after that seems unnecessary and indicates a
likelihood of fraud. The BMA for this rule can support a belief of either fraud or ~fraud
behavior.
Let x be the likelihood of fraud. If the user has visited only via ads then x=0.1 (little
likelihood of fraud). If the user has visited via ads after visiting normally then x=1.0 (high
likelihood of fraud). Thus the mass functions when the evidence supports fraud are as
follows:
Texas Tech University, Abhishek Agarwal, August 2012
19
m3 (fraud) = α3 *(x) (7)
m3 (~ fraud) = 0 (8)
m3 (U) = 1 - m3 ( fraud ) = 1 - α3*(x) (9)
Let y=1.0 be the likelihood of ~fraud if the user does not have an ad-visit after a non-ad visit.
The mass functions if the evidence supports ~fraud are as follows:
m3 (fraud) = 0 (10)
m3 (~ fraud) = β3 *(y) (11)
m3 (U) = 1 - m3 ( ~fraud ) = 1 – β3 *(y) (12)
Evidence 4: Time of Click
If the click occurred in the most suspicious time (or most active period of fraud
activity) then the user is likely to be a fraud. Fraudsters are generally known to be active
during certain hours of the day and a click at such hours can be indicative of fraudulent
activity. We follow Universal Time to determine this and not any particular time zone. If a
click happens at that certain time slot of suspicion then the click is likely to be a fraud
otherwise ~fraud. The BMA for this rule will support a belief of fraud if the time of click lies
in the suspicious time range. Otherwise it will support a belief of ~fraud.
Let Tstart and Tend be the start and end of the suspicious time range, t be the time of click.
Let x=1.0 be the likelihood of fraud if t lies between Tstart and Tend. The mass functions when
the evidence supports fraud are as follows:
m4 (fraud) = α4*(x) (13)
m4 (~ fraud) = 0 (14)
m4 (U) = 1 - m4 ( fraud ) = 1 – α4*(x) (15)
Let y=1.0 be the likelihood of ~fraud if t does not lie between Tstart and Tend. The mass
functions when the evidence supports ~fraud are as follows:
Texas Tech University, Abhishek Agarwal, August 2012
20
m4 (fraud) = 0 (16)
m4 (~fraud) = β4*(y) (17)
m4 (U) = 1 - m4 (~ fraud ) = 1 – β4*(y) (18)
Evidence 5: Place of origin of click
If the click originated from a location (country, state or city) where the advertiser has
no business then the user is likely to be a fraud. Ads are often targeted for audience of a
particular region where the advertisers have a reach or rights to sell their products. This is
especially true for small and medium sized businesses that are restricted to a country or city.
Even large advertisers mostly advertise to a local clientele such as a car company which sells
in many countries but has different ads based on the different models it sells in each country.
If a click originates from a location outside of advertiser‟s region of business then it is likely
to be fraud as the user will get no value from such a click. Also it is notable that in some
countries the laws against cyber frauds are very weak and this fact is utilized by fraudsters to
their advantage. Fraudsters use IP addresses originating from these countries through bots or
hiring people (many of whom do not realize that their act is causing huge losses to
advertisers) at low cost to carry out the fraud in order to avoid prosecution (Kshetri, 2010). As
a result such clicks have high suspicion associated with them. This rule has the ability to limit
a range of fraudulent attacks which depend on using IP addresses from varied geographical
locations (these include the use of both humans and bots). The BMA for this rule supports a
belief of fraud if the click originated from a region outside of advertiser‟s business and a
belief of ~fraud otherwise.
Let x=1.0 be the likelihood of fraud if the click originated from a region outside of
advertiser‟s business. The mass functions when the evidence supports fraud are as follows:
m5 (fraud) = α5 *(x) (19)
m5 (~ fraud) = 0 (20)
m5 (U) = 1 - m5 ( fraud ) = 1 - α5*(x) (21)
Texas Tech University, Abhishek Agarwal, August 2012
21
Let y=1.0 be the likelihood of fraud if the click originated from a region outside of
advertiser‟s business. The mass functions when the evidence supports ~fraud are as follows:
m5 (fraud) = 0 (22)
m5 (~fraud) = β5*(y) (23)
m5 (U) = 1 - m5 (~ fraud ) = 1 - β5*(y) (24)
Evidence 6: Creating of membership
If the user creates a membership account (register as member) with the advertiser, then
he/she is less likely to be a fraud. However he/she may or may not create such an account.
Fraudsters however are less likely to register themselves at the ad-site or create membership
account as they have no incentive in doing so and because it also requires them to spend some
time and give out some information like email, address etc. The BMA for this rule supports a
belief of ~fraud if a membership account was created, otherwise supports negligible belief of
fraud.
Let x=1 be the likelihood of fraud if a membership account is created. The mass functions when the
evidence supports fraud are as follows:
m6 (fraud) = α6* (x) (25)
m6 (~fraud) = 0 (26)
m6 (U) = 1 - m6 ( fraud ) = 1 - α6 *(x) (27)
Let y=1 be the likelihood of ~fraud if a membership account is not created. The mass functions
when the evidence supports ~fraud are as follows:
m6 (fraud) = 0 (28)
m6 (~ fraud) = β6 *(y) (29)
m6 (U) = 1 - m6 ( ~fraud ) = 1 - β6 *(y) (30)
Texas Tech University, Abhishek Agarwal, August 2012
22
Evidence 7: Adding a product in shopping cart
If the user adds a product to his shopping cart, then he/she is less likely to be a fraud.
Due to a lack of genuine interest in the advertiser‟s product or services, a fraudster is less
likely to use a shopping cart. Using a shopping cart requires the user to spend time for which a
fraudster has no incentive. The BMA for this rule supports a belief of ~fraud if a product was
added to a cart otherwise supports a negligible belief of fraud.
Let x=1.0 be the likelihood of fraud if the user does not add any product to his shopping cart. The
mass functions when the evidence supports fraud are as follows:
m7 (fraud) = α7* (x) (31)
m7 (~fraud) = 0 (32)
m7 (U) = 1 - m7 ( fraud ) = 1 – α7 *(x) (33)
Let y=1.0 be the likelihood of ~fraud if the user adds a product to his shopping cart. The mass
functions when the evidence supports ~fraud are as follows:
m7 (fraud) = 0 (34)
m7 (~ fraud) = β7*(y) (35)
m7 (U) = 1 - m7 ( ~fraud ) = 1 - β7*(y) (36)
Individually, the evidences are not sufficient in determining the likelihood of a user
being fraud or ~fraud. Each evidence may give different or contradicting belief of fraud
depending on their nature. But upon combination they provide a highly accurate estimate.
Thus, the likelihood of a click being fraudulent is estimated by combining the beliefs obtained
from corresponding mass functions for each of the supporting evidences. To define the rule
for combining mass functions, suppose m1 and m2 be two distinct mass functions of a
particular click. Dempster‟s rule of combination can be applied as shown below. For
readability, we omit i, and replace {fi}, {~fi} and Ui by f, ~f and U, respectively.
m1,2(f)= (m1(f)m2(f)+m1(f)m2(U)+m1(U )m2(f))(1K)
Texas Tech University, Abhishek Agarwal, August 2012
23
m1,2(~f)=(m1(~f)m2(~f)+m1(~f)m2(U)+m1(U)m2(~f))(1K)
m1,2(U)=(m1(U)m2(U ))(1K),
where K = m1(f)m2(~f) + m1(~f)m2(f).
This combination rule can be applied repeatedly pair-wise until evidence from all
clicks has been incorporated into the computation of the likelihood of each statement. Our
proposed approach certifies the clicks based on the corresponding likelihood of them being
fraudulent using the beliefs combined from all of the evidences. Table 4.1 below describes the
thresholds that we have empirically derived from our experiments and tests.
Table 4.1 Fraud certification rules
Lower Upper
Not Fraud 0 0.499
Suspicious 0.5 0.649
Fraud 0.65 1
A combined belief of fraud < 0.5 indicates ~fraud. A combined belief of fraud >= 0.65
indicates fraud and all values in between indicate a suspicion.
Texas Tech University, Abhishek Agarwal, August 2012
24
CHAPTER V
DATA SET & ILLUSTRATION
In this section we give a detailed explanation of the data that we use in our approach.
We also show an illustrated example using our data set with our approach.
Data Description
Click data is not publicly available. Any real weblog data from a web server is a
property of the owner of the server and are not made public due to privacy concerns by the
owner. Moreover such data need to be cleaned to extract data in relevant format. This is a
time consuming process and is not a focus of our research. For these reasons we use synthetic
data for our research. Furthermore we can manipulate synthetic data and add patterns of fraud
for evaluating different click fraud scenarios.
The data show weblog from the advertiser‟s web server. For our experiments and
evaluations we synthesize log data in combined log format (CLF). We pre-process the raw
logs and extract the following information from them for each user in real time: IP address of
the remote computer requesting the web page; time and date of request; the page that was
requested; and the Gclid number. The region from which the click originated can be easily
extracted from the IP address by using one of the many geo location services which map the
IP to a place using geo location database. The Table 5.1 below shows a sample data extracted
from the server logs.
Texas Tech University, Abhishek Agarwal, August 2012
25
Table 5.1 Sample log data
IP Address Click No Gclid No Time of click Requested
Page
Referrer
172.16.276.3 1 1001 3/5/2012 1:50 index.htm adsite.htm
172.16.276.3 2 1002 3/5/2012 1:56 index.htm adsite.htm
172.16.276.3 3 1002 3/5/2012 1:59 page1.htm index.htm
172.16.276.3 4 1002 3/5/2012 2:01 page2.htm page1.htm
172.16.276.3 5 null 3/5/2012 2:05 index.htm google.com
172.16.276.3 6 null 3/5/2012 2:08 page1.htm index.htm
172.16.276.3 7 null 3/5/2012 2:10 page2.htm page1.htm
172.16.276.3 8 null 3/5/2012 2:14 index.htm null
172.16.276.3 9 null 3/5/2012 2:16 page1.htm index.htm
172.16.276.3 10 null 3/5/2012 2:17 page2.htm page1.htm
Each row of the Table 5.1 above represents a HTTP request by the user made to the
advertiser‟s web server. Whenever a user requests content from the advertiser an HTTP
request is generated. Below are some observations which describe data represented by the
Table 5.1.
Every row represents a click by the user requesting content from the ad-site.
All the clicks in the table above are by the same user since the IP address is the same for
all rows of the log.
Index.htm is the landing page. Every time index.htm is the requested page, it implies a
new visit. The Table 5.1 has 4 unique visits.
A non-null Gclid number implies an ad-visit. Click numbers 1 through 4 belong to an ad-
visit since they have a valid Gclid number attached.
Two different Gclid numbers above imply two different ad-visits. The first click with
Gclid number 1001 implies an ad-visit. Since there is only 1 row with Gclid number 1001,
it implies that the user did not make any other page requests after landing on the ad-site
during first ad-visit. The second click with Gclid number 1002 is also an ad-visit.
Texas Tech University, Abhishek Agarwal, August 2012
26
However in this visit the user requested page1.htm and page2.htm also (click number 3
and 4).
Each row with a null Gclid number implies a non-ad visit. Click numbers 5 through 10
correspond to two non-ad visits.
Click number 5 corresponds to first non-ad visit and the third visit overall. The visitor was
referred to the ad-site by Google search since google.com is the referrer. After landing the
user requested two more pages in the same visit, page1.htm and page2.htm.
Click number 8 corresponds to second non-ad visit and fourth visit overall. A null referrer
implies that the user may have typed in the ad-site‟s URL in his browser or had previously
bookmarked the site and clicked on the bookmark. After landing the user requested two
more pages in the same visit, page1.htm and page2.htm.
Texas Tech University, Abhishek Agarwal, August 2012
27
We will use a timeline diagram to help illustrate our inputs (like Table 5.1) for the rest
of the thesis. Figure 5.1 shows the legends for the diagram and Figure 5.2 shows a timeline
diagram corresponding to the input from Table 5.1.
Figure 5.1 Legends for timeline diagram
Figure 5.2 Timeline diagram sample data in Table 5.1
A timeline diagram is a visual representation of a user‟s clicking data from the server
weblogs. Just by looking at Figure 5.2 we can easily make certain observations. The user has
made 4 unique visits. The first two visits were ad-visits and the last two were non-ad visits.
The width of the session blocks indicates session durations. The first visit was a very short
session in which the user did not request any pages after landing. In all the other visits the
user requested two other pages and the session durations are longer. The start and end times of
every session is also given. Lastly we can see that the user neither logged in as a member in
any of the sessions nor used a shopping cart.
Texas Tech University, Abhishek Agarwal, August 2012
28
Example of belief computation using mass function and combination
In this example we analyze and compute the belief of a user being fraud or ~fraud
using our approach. The purpose is to explain the approach and the computations involved
along with a simple example. The following is a sample input in Table 5.2 below.
Table 5.2 Input from server log
IP Address Click No Gclid No Time of click Requested Page Referrer
172.16.276.3 1 1001 3/5/2012 1:56 index.htm adsite.htm
172.16.276.3 2 1002 3/5/2012 2:01 index.htm adsite.htm
172.16.276.3 3 1003 3/5/2012 2:07 index.htm adsite.htm
172.16.276.3 4 1004 3/5/2012 2:13 index.htm adsite.htm
172.16.276.3 5 1005 3/5/2012 2:18 index.htm adsite.htm
172.16.276.3 6 1006 3/5/2012 2:23 index.htm adsite.htm
From Table 5.2 above we can easily conclude that the user made six ad-visits. The
user did not request any page of ad-site other than index.htm. Figure 5.3 below shows the
timeline diagram for the data corresponding to Table 5.2.
Figure 5.3 Timeline diagram for Table 5.2
As soon as a row is logged corresponding to a user activity, the system reads it
immediately and computes the mass beliefs for each piece of evidence which are then
combined to get an overall belief score using Dempster‟s combination rule. For the Table 5.2
Texas Tech University, Abhishek Agarwal, August 2012
29
above six belief values will be computed corresponding to every click. Thus the belief about
the user changes with every user click and is updated.
The evidence combination process combines beliefs from each conflicting evidence
and gives a belief score for a user‟s each click. To demonstrate our approach we will work out
the calculation of belief values at the 6th
click. Please note that we use the α and β values from
Table 5.3. These values have been derived empirically with our experiments and will be used
with all our computations.
Table 5.3 Coefficient values
Evidence No α β
1 0.8 -
2 - 0.99
3 0.6 0.2
4 0.2 0.01
5 0.4 0.1
6 0.02 0.25
7 0.01 0.2
Evidence 1 always supports a belief of fraud and therefore at the 6th
click on the ad the mass
function values are:
m1 (fraud) = 0.8* (1-1/6) = 0.667
m1 (~fraud) = 0
m1 (U) = 1 - m1* ( fraud ) = 1 – 0.8 *(1-1/6) = 0.332
Evidence 2 always supports a belief of ~fraud. The user spends 30 seconds in each visit since
he does not open any other page and therefore the total time spent is 180 seconds. The
window size W is 1800 seconds. Therefore the mass function values are:
m2 (~ fraud) = 0.99 *(180/1800) = 0.099
Texas Tech University, Abhishek Agarwal, August 2012
30
m2 (fraud) = 0
m2 (U) = 1 - m2 *(~ fraud ) = 1 – 0.99* (180/1800) = 0.901
Evidence 3 supports a little belief of fraud since there was no non-ad visit by the user.
Therefore the mass function values are:
m3 (fraud) = 0.6* (0.1) = 0.06
m3 (~ fraud) = 0
m3 (U) = 1 - m3 *( fraud ) = 1 – 0.6 *(0.1) = 0.94
Evidence 4 supports a belief of fraud since the 6th
click occurs at a suspicious time (2:23 AM).
Therefore the mass function values are:
m4 (fraud) = 0.2*(1) = 0.2
m4 (~ fraud) = 0
m4 (U) = 1 - m4 *( fraud ) = 1 - 0.2*(1) = 0.8
Evidence 5 supports a belief of fraud since we assume that the IP originates from a region
outside the area of business of the advertiser. Therefore the mass function values are:
m5 (fraud) = 0.4 *(1) = 0.4
m5 (~ fraud) = 0
m5 (U) = 1 - m5* (fraud) = 1 – 0.4 *(1) = 0.6
Evidences 6 and 7 support a little fraud since no product was added to a shopping cart and
neither was a membership account used. Therefore the mass function values are:
m6 (fraud) = 0.02 *(1) = 0.02
Texas Tech University, Abhishek Agarwal, August 2012
31
m6 (~fraud) = 0
m6 (U) = 1 - m7 *(fraud) = 1 – 0.02*(1) = 0.98
m7 (fraud) = 0.01* (1) = 0.01
m7 (~fraud) = 0
m7 (U) = 1 - m8* (fraud) = 1 – 0.01* (1) = 0.99
From Table 5.4 below we can observe that each mass function gives a varying degree
of belief values and these can be conflicting.
Table 5.4 Mass function beliefs for illustrated example
belief(fraud) belief(~fraud)
m1 0.667 0
m2 0 0.099
m3 0.06 0
m4 0.2 0
m5 0.4 0
m6 0.02 0
m7 0.01 0
Now we can apply the Dempster’s rule of combination to get the combined belief
about the user from the mass beliefs in Table 5.4.
K = m1(f)m2(~f) + m1(~f)m2(f) = 0.066
1-K = 0.934
m1,2(f) = m1(f)m2(f)+m1(f)m2(U)+m1(U )m2(f)/(1-K) = 0.643
m1,2(~f) =m1(~f)m2(~f)+m1(~f)m2(U)+m1(U)m2(~f)/(1-K) = 0.035
m1,2(U )= m1(U)m2(U )/(1-K) = 0.321
Texas Tech University, Abhishek Agarwal, August 2012
32
m1,2 is the combined mass belief from functions 1 and 2. Next we combine this with
mass functions for function 3 to get the combined mass belief m1,2,3
K = m1,2(f)m3(~f) + m1,2(~f)m3(f) = 0.0021
1-K = 0.998
m1,2,3(f) = m1,2(f)m3(f)+m1,2(f)m3(U)+m1,2(U )m3(f) = 0.664
m1,2,3(~f)= m1,2(~f)m3(~f)+m1,2(~f)m3(U)+m1,2(U)m3(~f) = 0.0333
m1,2,3(U ) = m1,2(U)m3(U ) = 0.303
The above belief combination repeats until no more evidence needs to be considered.
Thus, the belief of the hypothesis that click 6 is fraudulent is calculated in accumulative
fashion. Following the procedure we go on to get the combined belief of all mass beliefs
m1,2,3….7
m1,2,3….7(f) = 0.840
m1,2,3….7(~f) = 0.016
m1,2,3….7(U ) = 0.144
As we can clearly see, the belief (fraud) of 0.84 is clearly above the threshold for
fraud (0.65) given in Table 4.1 and so the user is certified as fraud. Figure 5.4 gives a
graphical representation of the combined belief of fraud over all the 6 clicks made by the user
(in this example we have worked out the mass value computation of 6th
click only but the
figure plots the mass values computed for all clicks from 1st through 6
th). We can easily
observe how the combined belief changes as more clicks are made.
Texas Tech University, Abhishek Agarwal, August 2012
33
Figure 5.4 Combined belief of fraud for input in Figure 5.3
Texas Tech University, Abhishek Agarwal, August 2012
34
CHAPTER VI
EVALUATION
In this section we present two case studies (scenarios), each of which corresponds to a
different type of click fraud attack. In case study 1 we present a scenario where a human user
is trying to perform click fraud and uses different click patterns in order to avoid detection. In
case study 2 we present a scenario where a software bot is used to perform click fraud and it
tries to make detection difficult by using multiple IP addresses. In both the cases we present
our output and show that our approach is able to successfully detect click fraud. We will
discuss the generality of our solution in Chapter VII.
Case Study 1
We present a scenario where a human user is trying to commit click fraud and avoid
detection by giving an impression of a regular user. Figure 6.1 below show the user activity
for the test case.
Figure 6.1 Timeline input for Case Study 1
A fraudster needs to repeatedly click on the ad in order to make a substantial profit. In
this case the fraudster clicks the ad seven times (leading to seven ad-visits). The fraudster also
Texas Tech University, Abhishek Agarwal, August 2012
35
enters the ad-site via a regular search (non-ad visit) to give a stronger impression of a regular
user. He/she spends time on the site after landing (with random session durations) and carries
out activities like opening 32 links in the ad-site after landing, creating membership account
and adding a product to his shopping cart.
Below we describe the belief computed from every mass function and the combined
belief in figures 6.2 through 6.9. We have plotted the belief value with time (in the range of
window W). Please note that some of the functions support both fraud and ~fraud at different
times depending on the input and thus they can have both types of beliefs at different times. In
these cases we just show belief of fraud for the purpose of clarity. Also note that whenever a
function supports belief in ~fraud then the belief in fraud becomes 0 and vice versa.
Texas Tech University, Abhishek Agarwal, August 2012
36
Figure 6.2 below shows the belief computed from Mass Function 1 (Number of clicks
on the ad) according to which if the number of clicks on the ad from an IP in the time window
W (30 minutes) is high, then likelihood of the user being a fraud is high. Mass Function 1
supports only a belief of fraud and the belief at the first click on the ad is 0. The belief
increases as more clicks are made on the ad. The increase is faster in the first five clicks due
to the nature of the function. It is notable that the belief of fraud does not increase in the third
visit as it is a non-ad visit. This function does not consider any other user activity apart from
the number of clicks on the ad. Therefore user activities like a non-ad visit (third visit), adding
products to shopping cart etc. do not affect the belief of this mass function.
Figure 6.2 Belief of fraud from mass function 1
Texas Tech University, Abhishek Agarwal, August 2012
37
Figure 6.3 below shows the belief computed from Mass Function 2 (Time spent in
browsing) according to which if the time spent by the user at the ad-site is high then he/she is
less likely to be a fraud. This function supports only the belief of ~fraud. In this case study the
user spent time in every session and this is reflected in an increasing belief of ~fraud. This
belief clearly contradicts the belief from Mass Function 1 which supports a belief of fraud.
The fraudster has spent a considerable time browsing the ad-site during every visit to give an
impression of a genuine user. As we can see below the user has a high belief of ~fraud at the
end.
Figure 6.3 Belief of ~fraud from mass function 2
Texas Tech University, Abhishek Agarwal, August 2012
38
Figure 6.4 below shows the belief computed from Mass Function 3 (Ad-visit after
non-ad visit) according to which if a user clicks on an ad after a non-ad visit, he/she is likely
to be a fraud. Once a user makes a non-ad visit to the ad-site, it implies that the user is aware
how to reach the site apart from clicking on the ad. The first three visits are all ad-visits and
therefore the function supports a little belief of fraud. The fourth visit is a non-ad visit and
therefore the function does not support fraud (belief become 0). But the fifth visit is an ad-
visit (after non-ad visit). The function computes a high belief of fraud because of this and we
see that the belief of fraud spikes up to 0.6.
Figure 6.4 Belief of ~fraud from mass function 3
Texas Tech University, Abhishek Agarwal, August 2012
39
Figure 6.5 below shows the belief computed from Mass Function 4 (Time of click)
according to which if the click occurred in the most suspicious time (or most active period of
fraud activity) then the user is likely to be a fraud.. The first three visits are not during the
most suspicious time for fraud therefore the function does not support a belief of fraud.
During the fourth visit the session enters the suspicious time and therefore the function
supports fraud. The curve below shows this increased belief.
Figure 6.5 Belief of fraud from mass function 4
Texas Tech University, Abhishek Agarwal, August 2012
40
Figure 6.6 below shows the belief computed from Mass Function 5 (Place of origin of
click) according to which if the click originated from a location (country, state or city) where
the advertiser has no business then the user is likely to be a fraud. For this case study we
assume that the IP address of the user is from a region outside of the advertiser‟s region of
business. A click from such an IP is not natural and the advertiser will not benefit from it. The
function therefore supports a belief of fraud throughout and this value does not change at any
time.
Figure 6.6 Belief of fraud from mass function 5
Texas Tech University, Abhishek Agarwal, August 2012
41
Figure 6.7 below shows the belief computed from Mass Function 6 (Creation of
membership) according to which if the user creates a membership account (register as
member) with the advertiser, he/she is less likely to be a fraud. The user does not create any
membership or registration with the advertiser during the first three visits. However during
the fourth visit the user does create it and therefore this mass function changes its belief to
support ~fraud from 0 to 0.25.
Figure 6.7 Belief of ~fraud from mass function 6
Texas Tech University, Abhishek Agarwal, August 2012
42
Figure 6.8 below shows the belief computed from Mass Function 7 (Adding a product
to shopping cart) according to which if the user adds a product to his shopping cart, he/she is
less likely to be a fraud. The user does not use the shopping cart during the first three visits.
However during the fourth visit the user does add a product to it and therefore this mass
function belief to support ~fraud increases from 0 to 0.2.
Figure 6.8 Belief of ~fraud from mass function 7
Texas Tech University, Abhishek Agarwal, August 2012
43
The system combines the mass beliefs and a combined belief corresponding to each
click is computed. Table 6.1 below shows the computed values of belief, plausibility and
deduction for every click.
Table 6.1 Computed belief values for Case Study 1
click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction
1 0.45 0.99 0.015 0.55 not fraud
2 0.44 0.98 0.022 0.56 not fraud
3 0.44 0.97 0.027 0.56 not fraud
4 0.44 0.96 0.036 0.56 not fraud
5 0.43 0.96 0.043 0.57 not fraud
6 0.43 0.95 0.049 0.57 not fraud
7 0.65 0.96 0.036 0.35 suspect
8 0.64 0.96 0.041 0.36 suspect
9 0.64 0.95 0.052 0.36 suspect
10 0.63 0.94 0.06 0.37 suspect
11 0.63 0.93 0.068 0.37 suspect
12 0.7 0.94 0.059 0.3 fraud
13 0.69 0.93 0.067 0.31 fraud
14 0.69 0.93 0.072 0.31 fraud
15 0.69 0.92 0.078 0.31 fraud
16 0.68 0.91 0.092 0.32 fraud
17 0.67 0.9 0.1 0.33 fraud
18 0.59 0.81 0.19 0.41 suspect
19 0.51 0.7 0.3 0.49 suspect
20 0.5 0.69 0.31 0.5 suspect
21 0.43 0.6 0.4 0.57 not fraud
22 0.42 0.59 0.41 0.58 not fraud
23 0.42 0.58 0.42 0.58 not fraud
24 0.8 0.87 0.13 0.2 fraud
25 0.8 0.86 0.14 0.2 fraud
26 0.79 0.86 0.14 0.21 fraud
27 0.79 0.85 0.15 0.21 fraud
28 0.8 0.86 0.14 0.2 fraud
29 0.79 0.85 0.15 0.21 fraud
30 0.78 0.84 0.16 0.22 fraud
31 0.78 0.84 0.16 0.22 fraud
32 0.79 0.84 0.16 0.21 fraud
33 0.78 0.84 0.16 0.22 fraud
34 0.78 0.83 0.17 0.22 fraud
35 0.77 0.82 0.18 0.23 fraud
36 0.76 0.81 0.19 0.24 fraud
37 0.77 0.81 0.19 0.23 fraud
38 0.76 0.81 0.19 0.24 fraud
39 0.75 0.8 0.2 0.25 fraud
40 0.74 0.79 0.21 0.26 fraud
Texas Tech University, Abhishek Agarwal, August 2012
44
Figure 6.9 below shows the combined belief of fraud obtained by combining the
beliefs from all the mass functions using Dempster‟s combination rule. It is interesting to note
that individually the beliefs from mass functions contradict and give vary. However upon
combination they give correct belief which changes to reflect the changes in user‟s activity.
Figure 6.9 Combined belief of fraud for Case Study 1
Initially the combined belief of fraud is low and according to the threshold values in
Table 4.1 it indicates a ~fraud. As the user clicks again on the ad (second visit), the belief of
fraud increases and the user moves from ~fraud to suspicious. In the third ad-visit the belief of
fraud increases further and indicates a fraud. But as the user does a non-ad visit (fourth visit),
creates membership and uses shopping cart, the belief drops back to ~fraud. Had the user
stopped clicking on the ad at this point he/she would have been considered ~fraud. However
when the user clicks on ad again and makes an ad-visit (fifth visit) the belief increases to
Texas Tech University, Abhishek Agarwal, August 2012
45
support fraud. We see that the change in belief spikes to a high value during fifth visit because
this is an ad-visit after a non-ad visit. At the end the user‟s belief of fraud continues to be high
and this is certified as a case of fraud. Also the time of click and the location of the IP
contribute to the suspicion.
Case Study 2
This case study presents a scenario where a software bot is used to commit click fraud
by using different IP addresses at different times. Use of multiple IP addresses can make
detection difficult. In most approaches to click fraud detection including ours, n different IPs
will be considered n unique users. (Walgampaya et al., 2011) suggest a specialized approach
to identify bot attacks. For the ease of clarity let us now consider that each IP belongs to a
different user. Figure 6.10 below shows the activity from three different IP addresses (users)
in a timeline diagram. We have used a different color mechanism for this timeline diagram to
represent visits by three different IPs and do not show the time range of each session to avoid
cluttering.
Figure 6.10 Timeline diagram for Case Study 2
Texas Tech University, Abhishek Agarwal, August 2012
46
Using each IP, two ad-visits are made out of which the first visit has a short session and in the
second visit has longer sessions. The first two IPs are outside of the advertiser‟s region of
business but the third IP originates from the advertiser‟s area of business. Last four visits lie
in a suspicious time range.
The system computes mass beliefs and a combined belief corresponding to each click from
every IP. Tables 6.2, 6.3 and 6.4 below show the computed values of belief, plausibility and
deduction for first, second and third IPs respectively.
Table 6.2 Computed belief values for first IP
click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction
1 0.45 0.99 0.015 0.55 not fraud
2 0.66 0.99 0.014 0.34 fraud
3 0.72 0.98 0.025 0.28 fraud
Table 6.3 Computed belief values for second IP
click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction
1 0.45 0.99 0.015 0.55 not fraud
2 0.66 0.98 0.02 0.34 fraud
3 0.53 0.94 0.061 0.47 suspect
Table 6.4 Computed belief values for third IP
click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction
1 0.078 0.89 0.11 0.92 not fraud
2 0.73 0.99 0.0089 0.27 fraud
3 0.51 0.9 0.095 0.49 suspect
Texas Tech University, Abhishek Agarwal, August 2012
47
Figure 6.11 below shows the computed values of belief of fraud for all visits by the
bot using the three IPs.
Figure 6.11 Combined belief values for Case Study 2
From the Figure 6.11 and Tables 6.2 to 6.4 above we can observe that our system
detects the users with first two IPs as fraud and the user with the third IP as suspicious even
when there were just two clicks that occurred from each IP. The third IP was not outside of
advertiser‟s region of business and hence the system could conclude it as suspicious. The
above clicks from three different IPs could be from one single bot. We evaluate them as three
different users and yet detect the fraud.
Texas Tech University, Abhishek Agarwal, August 2012
48
CHAPTER VII
DISCUSSION & CONCLUSIONS
The thesis proposes an approach for click fraud identification that can be used by the
advertising community to solve their click fraud problems. Our approach is fundamentally
different from existing methods. First, we focus on the type of clicking activity, which can
create real value for the fraudster and attempt to detect that. For this we take raw weblog data
and derive meaningful evidences for our mass function formulization. Second, it has the
ability to do on-line computation to detect fraudulent clicks. Such computation adapts well to
real-time systems and this is a key advantage. Third, the approach is relatively simple and fast
because it requires only the incoming data at advertiser‟s disposal. It neither requires the
advertiser to maintain and update large historical databases of various evidences nor
necessitates learning of any patterns. This makes the approach beneficial for use by
advertisers. Fourth, the resulting beliefs also indicate the gray area of suspicious activity
which can alert the advertiser of irregular or abnormal traffic. This is useful against click
fraud attacks which may be hard to catch but still falls in suspicious category. Finally, the
approach suggests extraction of evidences from limited server data and can be extended easily
by adding new mass functions to represent additional evidence.
Our experiments on the two case studies show that the proposed approach works
correctly. Although we have not experimented on all possible scenarios of click fraud
behaviors we believe that our approach will work effectively in general because of the
following reasons. First, the technique allows combination of a set of evidences that can
contribute to click fraud detection. Second the set of evidences considered in this thesis is in
the worst case near complete. Finally, if the set is not complete, the technique can be easily
extended by adding new evidences into the proposed click fraud detection system.
Future work includes more experiments to gain understanding of the characteristics of
the proposed approach, for example, what are the novel click attacks which the approach fails
to identify and if found, what are the other sources of data and evidences that can be identified
to detect them. Future work also requires experiments to see if our approach works for
Texas Tech University, Abhishek Agarwal, August 2012
49
specialized bot attacks which can be highly sophisticated and evolve continuously. These are
among our ongoing and future research.
Texas Tech University, Abhishek Agarwal, August 2012
50
BIBLIOGRAPHY
D. Antoniou, M. Paschou, E. Sakkopoulos, E. Sourla, G. Tzimas, A Tsakalidis, E. Viennas,
“Exposing click-fraud using a burst detection algorithm”, in Proceedings of ISCC on
Computers and Communications, IEEE Symposium, Jun 2011, pp. 1111-1116.
A. Tuzhilin, “The Lane‟s Gifts vs. Google Report”, 2006
M. Kantardzic, C. Walgampaya, B. Wenerstorm, O. Lozitskiy, S. Higgins and D. Kings,
“Improving Click Fraud Detection by Real Time Data Fusion”, in Proceedings of the
ISSPIT on Signal Processing and Information Technology, IEEE International
Symposium, Dec. 2008, pp. 69-74.
G. Shafer, “A Mathematical Theory of Evidence”, Princeton University Press, 1976.
T. Denoeux, “ A K-nearest Neighbour Classification Rule based on Dempster-Shafer
Theory”, IEEE Transactions on Systems, Man and Cybernetics, 25 (1995) 804-813.
F. Dong, Sol. M. Shatz, H. Xu, “Reasoning Under Uncertainty For Shill Detection In Online
Actuions using Dempster Shafer Theory”, International Journal of Software Engineering
and Knowledge Engineering, 2010, pp. 943-973.
K. Sentz, S Ferson, “Combination of Evidence in Dempster-Shafer Theory”, SAND 2002-
0835, April 2002.
N. Kshetri, “The Economics of Click Fraud”, Security and Privacy, IEEE, May-June 2010,
pp. 45-53.
H. Haddadi, “Fighting Online Click-Fraud Using Bluff Ads”, ACM SIGCOMM Computer
Communication Review, v.40 n.2, April 2010 [doi>10.1145/1764873.1764877]
V. Anupam, A Mayer, K. Nissim, B. Pinkas, and M. K. Reither, “On the Security of pay-per-
click and other web advertising schemes”, Computer Netwroks, 31(11-16): 1999, 1091-
1100.
M. Kantardzic, C. Walgampaya, and H. Jamali, “Click fraud prevention in pay-per-click
model: Learning through multimodel evidence fusion”, in Proceedings of ICMWI of
Machine and Web Intelligence, 2010, pp. 20-27.
Texas Tech University, Abhishek Agarwal, August 2012
51
C. Walgampaya, and M. Kantardzic, “Cracking the Smart ClickBot”, in Proceedings of Web
Systems Evolution on 13th
IEEE Symposium, 2011, pp. 125-134.
B. J. Jansen, “Click Fraud”, IEEE Computer, vol. 40, no. 7, Jul 2007, pp. 85-86.
X. Li, Y. Liu, and D. Zeng, “Publisher click fraud in the pay-per-click advertising market:
Incentives and consequences”, in Proceeding of Intelligence and Security Inforatics of
IEEE International Conference, 2011, pp. 207-209.
S. Majumdar, D. Kulkarni, and C. V. Ravishankar , “Addressing Click Fraud in Content
Delivery Systems”, in Proceedings of INFOCOM 2007 of 26th IEEE International
Conference, May 2007, pp. 240-248.
A. Metwally, D. Agarwal, A. Abbadi, and Q. Zheng, “On Hit Inflation and Detection in
Streams of Web Advertising Networks”, in Proceedings of Distributed Computing
Systems on ICDCS, Jun 2007, pp. 52-52.
lAB, and PwC, “lAB Internet Advertising Revenue Report, 2010”, First Half-Year Results,
New York, U.S., 2011.
GeegkWire Magazine, “Newspapers take it on the chin as online ad revenue falls into the
hands of a few tech giants”, Mar 2012, http://www.geekwire.com/2012/newspapers-chin-
online-ad-revenue-falls-hands-tech-giants/
Google Earnings Report, “Google Announces Second Quarter 2011 Financial Results”, Jul
2011, http://investor.google.com/earnings/2011/Q2_google_earnings.html