an exploratory analysis of the tracked web
TRANSCRIPT
-
Technische Universitt Berlin
Master Thesis
An Exploratory Analysis ofthe Tracked Web
Author:Karim Wadie
Supervisor:Prof. Volker Markl
Advisor:Johannes Kirschnick
A thesis submitted in partial fulfilment of the requirementsfor the degree of Master of Science in Computer Science
as part of the Erasmus Mundus programme IT4BI
in the
Database Systems and Information Management Group (DIMA)Department of Computer Science
July 2015
http://www.tu-berlin.dehttps://www.dima.tu-berlin.de/http://cs.tu-berlin.de/welcome.html
-
Declaration of Authorship
I declare that I have authored this thesis independently, that I have not used other than
the declared sources/resources, and that I have explicitly marked all material which has
been quoted either literally or by content from the used sources.
Eidesstattliche Erklrung
Ich erklre an Eides statt, dass ich die vorliegende Arbeit selbststndig verfasst, andere
als die angegebenen Quellen/Hilfsmittel nicht benutzt, und die den benutzten Quellen
wrtlich und inhaltlich entnommenen Stellen als solche kenntlich gemacht habe.
Berlin,
July 31, 2015
Karim WADIE
i
-
"The man who comes back through the Door in the Wall will never be quite the same
as the man who went out. He will be wiser but less sure, happier but less self-satisfied,
humbler in acknowledging his ignorance yet better equipped to understand the relationship
of words to things, of systematic reasoning to the unfathomable mystery which it tries,
forever vainly, to comprehend."
Aldous Huxley
-
Technische Universitt Berlin
AbstractFaculty of Electrical Engineering and Computer Science
Department of Computer Science
Master of Science in Computer Science
An Exploratory Analysis of
the Tracked Web
by Karim Wadie
There are no doubts that web tracking has progressively prevailed on the internet
over the past years for traffic analytics and/or building user browsing profiles that aids
personalized advertising. There are several techniques a tracking service can actually
record visitors behavior on a remote website, some of which can be detected in an offline
setting by analyzing the HTML contexts and common tracking practices such as tracking-
pixels and scripts that communicate with a 3rd party host. This thesis builds on top of
the TrackTheTrackers project that is initiated at TU-Berlin and aims to extract the
tracking services from the Common Crawl; the largest publicly-available web corpus
by providing a deeper, quantitative analysis of the web tracking phenomenon in terms of
its widespread and its relationship with the web structure. As far as our knowledge, this
research is the first one to combine web-graph studies with 3rd-party tracking analysis.
Throughout our exploratory analysis, we report a number of statistical findings about the
tracking graph along with descriptive, structural properties of the web graph spanned
by the trackers and tracked websites (i.e. the tracked-web), and finally, we examine
how structural features of the web graph such as community structures and centrality
measures can affect the spread of tracking over the web. For instance, we found that
60% of the web is potentially tracked, with Google being the number 1 tracker over
the internet. We also used a quantitative approach to discover that the tracked-web is
highly interconnected and exhibits the small-world phenomenon with only 5 degrees of
separation, and that it resembles the structure of a social network more than of a web
graph.
http://www.tu-berlin.dehttp://www.eecs.tu-berlin.de/menue/fakultaet_iv/)http://cs.tu-berlin.de/welcome.html
-
Acknowledgements
I take this opportunity to express gratitude to Johannes, my supervisor, for his guid-
ance throughout the thesis as well as his comments that greatly improved the manuscript.
I also thank Sebastian Schelter for his excellent work on the trackthetracker project and
providing the datasets, on which I am building upon this study.
iv
-
Contents
Declaration of Authorship i
Abstract iii
Acknowledgements iv
List of Figures viii
List of Tables ix
Abbreviations x
1 Introduction and Literature Review 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 What is web tracking? . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 The business empire of web tracking . . . . . . . . . . . . . . . . . 11.1.3 Why should we study tracking? . . . . . . . . . . . . . . . . . . . . 3
1.2 Literature Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Web tracking studies . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Web graph studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Objectives 13
3 Methodology 153.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Common Crawl web corpus . . . . . . . . . . . . . . . . . . . . . . 153.1.2 Web Data Commons hyper-link graph . . . . . . . . . . . . . . . . 173.1.3 The Common Crawl WWW ranking . . . . . . . . . . . . . . . . . 183.1.4 Alexa top sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Data Processing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.3 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.4 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.5 MS SQL Server BI Stack . . . . . . . . . . . . . . . . . . . . . . . . 21
v
-
Contents vi
3.2.6 WebGraph Framework . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.7 FlashGraph Framework . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.1 Trackers extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.1 Amazon EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.2 DIMA IBM Power Cluster . . . . . . . . . . . . . . . . . . . . . . . 24
4 Analysis I: Statistical Properties 264.1 Trackers Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Top Sites Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Tracking Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.1 Country code analysis . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.2 Generic domain analysis . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Trackers Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Analysis II: Structural Properties 425.1 Tracked-Web Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1 Density and node degrees . . . . . . . . . . . . . . . . . . . . . . . 425.1.2 Power-law fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.1.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Tracked-Web Degree of Separation . . . . . . . . . . . . . . . . . . . . . . 455.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.2 Approach: HyperANF . . . . . . . . . . . . . . . . . . . . . . . . . 465.2.3 Distance-related features . . . . . . . . . . . . . . . . . . . . . . . . 485.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Is The Tracked-Web a Small World? . . . . . . . . . . . . . . . . . . . . . 505.4 Tracked-Web Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4.1 WCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4.2 SCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 Centrality and Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.5.3 Individual centrality correlation . . . . . . . . . . . . . . . . . . . . 575.5.4 Centrality-based classification . . . . . . . . . . . . . . . . . . . . . 58
5.6 Community Structure and Tracking . . . . . . . . . . . . . . . . . . . . . . 595.6.1 Vertex-centric neighborhoods . . . . . . . . . . . . . . . . . . . . . 595.6.2 Web graph communities . . . . . . . . . . . . . . . . . . . . . . . . 605.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Future Work 65
7 Thesis Summary 67
-
Contents vii
A Top Trackers By Source 71
B Tracking Penetration By Country 74
C Social Widgets Detection 80
Bibliography 82
-
List of Figures
1.1 Example of online advertising players 1. . . . . . . . . . . . . . . . . . . . 31.2 USA online advertisement market growth in USD billions 2 . . . . . . . . 41.3 Case Study: Third-Party Analytics. . . . . . . . . . . . . . . . . . . . . . . 61.4 Case Study: Third-Party Advertising. . . . . . . . . . . . . . . . . . . . . 61.5 Case Study: Advertising Networks. . . . . . . . . . . . . . . . . . . . . . . 71.6 Case Study: Social Widgets. . . . . . . . . . . . . . . . . . . . . . . . . . . 81.7 Bow-tie structure of the web . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Pseudocode of the main routines in extracting trackers . . . . . . . . . . . . . 25
4.1 Tracking detection summary . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Alexa top sites tracking penetration . . . . . . . . . . . . . . . . . . . . . 304.3 Tracking sources summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Tracking Classification Summary . . . . . . . . . . . . . . . . . . . . . . . 334.5 ccTLD tracking penetration histogram . . . . . . . . . . . . . . . . . . . . 344.6 Tracking penetration worldwide . . . . . . . . . . . . . . . . . . . . . . . . 354.7 Log-Log plot for the number of trackers per PLD . . . . . . . . . . . . . . 384.8 Pseudo code of the Apriori algorithm . . . . . . . . . . . . . . . . . . . . . 39
5.1 Log-Log plot of the tracked web indegree distribution . . . . . . . . . . . . 435.2 Log-Log plot of the tracked web outdegree distribution . . . . . . . . . . . 435.3 Probability mass function of the tracked-web distance . . . . . . . . . . . 495.4 Cumulative probability function of the tracked-web distance . . . . . . . . 495.5 Log-Log plot of the tracked-web WCC size distribution . . . . . . . . . . . 535.6 Pseudocode of the tarjan algorithm for finding strongly connected components
in a graph 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.7 Log-Log plot of the tracked-web SCC size distribution . . . . . . . . . . . 565.8 Pseudocode for computing tracking coefficient of vertices . . . . . . . . . . . . 605.9 Log-Log plot of the web graph community-size distribution . . . . . . . . 625.10 A visual representation of the web graph mega-communities . . . . . . . . 62
C.1 Facebook social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . 80C.2 Twitter social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . . 81C.3 YouTube social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . 81C.4 Reddit social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . . 81
viii
-
List of Tables
3.1 Content statistics of the 2012 web corpus . . . . . . . . . . . . . . . . . . 17
4.1 Top 20 potential trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Top trackers penetration ratio across Alexa top sites . . . . . . . . . . . . 314.3 Tracking-Source distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Social-Widget tracking summary . . . . . . . . . . . . . . . . . . . . . . . 334.5 Tracking penetration by gTLD . . . . . . . . . . . . . . . . . . . . . . . . 354.6 Top Trackers Coverage over gTLDs . . . . . . . . . . . . . . . . . . . . . . 374.7 Frequent item sets of top 20 trackers . . . . . . . . . . . . . . . . . . . . . 404.8 Top 20 trackers association rules . . . . . . . . . . . . . . . . . . . . . . . 41
5.1 Power-law fitting of tracked-web indegree and outdegree . . . . . . . . . . 455.2 HyperANF Results on the tracked-web . . . . . . . . . . . . . . . . . . . . 485.3 Distance-related features for the web, Facebook and Tracked Web . . . . . 505.4 Calculating the small-world measure S for the tracked-web . . . . . . . . . 525.5 Point bi-serial correlation between centrality measures and tracking . . . . 585.6 Area under the curve (AUC) for different binary classifiers (centrality
measures vs tracking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.7 Tracking Coefficients of the web graph neighborhoods . . . . . . . . . . . 60
A.1 Top 20 potential trackers employing scripts . . . . . . . . . . . . . . . . . 71A.2 Top 20 potential trackers employing IFrames . . . . . . . . . . . . . . . . . 72A.3 Top 20 potential trackers employing Images . . . . . . . . . . . . . . . . . 72A.4 Top 20 potential trackers employing Links . . . . . . . . . . . . . . . . . . 73
B.1 Tracking analysis by country code top level domain . . . . . . . . . . . . . 74
ix
-
Abbreviations
ccTLD Country code top level domain
DWH Data Warehouse
GA Google Analytics
gTLD Generic top level domain
HDFS Hadoop Distributed File System
PLD Pay-level-domain
SCC Strongly connected component (of a graph)
TLD Top level domain
WCC Weakly connected component (of a graph)
WDC Web Data Commons
x
-
Dedicated to my parents, for their love, endless support andencouragement.
xi
-
Chapter 1
Introduction and Literature Review
1.1 Introduction
1.1.1 What is web tracking?
Web tracking, is commonly referred to the act of collecting subsets of the users brows-
ing data or browsing behavior over the internet. This practice attracted a lot of attention
over the past few years, especially after the social media boom and the increasing levels
of privacy-issues awareness acquired by the average internet user.
There is no doubt that tracking is prevalent on the web today, most of us who use
search engines or e-commerce sites (e.g. Amazon) have seen the implications of web
tracking or just tracking as we will refer to in this document at least in terms of
targeted advertisements, especially when it is observed cross-sites; for example as coming
across advertisements on ones social media profile for products previously viewed on a
completely different e-commerce site.
In our work, we use the term Tracked-web to refer to the graph structure of web
links formed by the tracking and tracked web entities. We aim to provide a better
understanding about this subset of the web in terms of statistics about these entities, as
well as discovering local and global structural properties about the graph.
1.1.2 The business empire of web tracking
Before going into details, one needs first to understand what is the motivation behind
such practice, what kind of web entities are behind it and how can they actually do it.
1
-
Chapter 1. Introduction 2
First-Party and Third-Party Tracking To begin with, we need to differentiate
between what is called first-party and third-party tracking. The first kind refers to when
a website is keeping track of its visitors activities on their own site, either anonymously
or by user profiles, in order to analyze customer behavior, enhance their service or even
communicate it to other entities for a profit. First-party tracking is very common in most
major websites, however, it often raises serious concerns when it crosses the virtual world
of the internet and includes real world information like GPS track history, fingerprints
and such. Unfortunately, this type of tracking is beyond our scope of analysis since its
integrated in the website logic and it can be hardly detected or analyzed offline.
The other type of tracking, third-party tracking, refers to the practice by which an
outside entity (the tracker) other than the directly visited website, tracks the users visit
to the site. For example, if a web user visits reuters.com, a third-party tracker like
doubleclick.net - embedded by reuters.com to provide targeted advertising - can log the
users visit to reuters.com. For most types of third-party tracking, the tracker will be
able to link the users visit to reuters.com with the users visit to other sites on which
the tracker is also embedded, and thus building what is called a browsing profile of that
user. In this study we will only consider third-party tracking over the internet for our
analysis because of its potential concern to users, who may be surprised that a party with
which they may or may not have chosen to interact is recording their online behavior in
unexpected ways.
Tracking Services The web entities acting as third party trackers are generally cat-
egorized into two broad groups, web traffic analytics and advertising-based services (we
will discuss a detailed categorization framework in the literature review section 1.2). The
first group of trackers usually provide their services to websites in return of a paid pre-
mium or subscription plans, however, the most popular web-traffic analysis service [1],
Google Analytics [2], can be used for free. In this case, Google is believed to generate
indirect profit from the free analytics service by integrating the data it collects with its
paid advertising service; Google AdWords [3].
The other group of tracking services are the one directly concerned with online ad-
vertising. Advertising business has evolved since the birth of the internet and over the
years from email marketing campaigns to online display ads in the 1990s to the more
complex landscape of search ads (see figure 1.1) that involves targeted advertising with
automated biding and connects a number of stakeholders like publishers who are hosting
the ads, advertisers who are advertising their products/services , advertising agencies
that help generate and place the ad copy, ad servers that technically deliver the ads and
-
Chapter 1. Introduction 3
advertising affiliates who conduct promotional work for the advertisers and potentially
more players.
Figure 1.1: Example of online advertising players 1.
It is not hard to understand how the online advertising business had to become more
sophisticated over the years when we know that it is a multi-billion dollar industry.
According to a study by PricewaterhouseCoopers (PwC) [4], figure 1.2 shows that online
advertising generated a revenue of 49.5 Billion USD in 2014 in the United States alone.
Another recent study estimated the European ad market in 2012 for 24.3 Billion EUR
[5].
1.1.3 Why should we study tracking?
Despite the prevalence of web tracking and the resulting public and media outcry ,
primarily in the western world, there is a lack of clarity about how tracking works, how
widespread the practice is, and the scope of the browsing profiles that trackers can collect
about users. Thus, efforts in exploring and understanding the structure of the web from
a tracking perspective as we are aiming in this thesis is important in shedding a
light on this part of the internet in order to:
1. Design crawling and tracker detection algorithms.1Figure taken from LUMA Partners: http://www.lumapartners.com/lumascapes/
-
Chapter 1. Introduction 4
Figure 1.2: USA online advertisement market growth in USD billions 2
2. Design protection techniques against trackers.
3. Understand the coverage of some key trackers and their domination over the inter-
net. Thus, estimating their business value and market weight
4. Predict the evolution and spread of the tracking phenomenon.
5. Predict the emergence of new phenomenon in the tracing graph.
2Figure taken from PwC Internet advertising report 2014 [4]
-
Chapter 1. Introduction 5
1.2 Literature Overview
1.2.1 Web tracking studies
There exist a number of studies that have been conducted by researchers to under-
stand, analyze and classify the web tracking phenomenon and even to develop techniques
to protect against it. The most prominent of which is the work by Roesner, Kohno,
and Wetherall [6] in 2012. In their study, the authors presented an in-depth empirical
investigation of third-party tracking where they introduced a comprehensive classifica-
tion framework for web tracking based on client-side observable behaviors. They also
developed and evaluated a web browser plugin, which is designed to thwart tracking orig-
inating from social media widgets (like the Facebook like button) while still allowing
the widgets to be used.
The suggested framework is established from client-side methods for detecting and
classifying five kinds of third-party trackers based on how they manipulate browser state.
The five behaviors observed are:
1. Third-Party Analytics:
In order to analyze their traffic, websites usually embed a library (in the form of
a script) provided by the anlytics engine (e.g. Google Analytics). In the case of
GA, the script sets a site-owned cookie (not tracker-owned) on the the visitors
browser, that contains a unique identifier. The script then transfers this identifier
to google-analytics.com by making explicit requests containing information such as
the operating system version, browser, geographic location, etc.
Since the cookie set by the tracker was created in the context of the site visited
(site-owned), identifiers set by the tracker in this case is different across sites. Thus,
a single user will be associated with different identifiers on different sites, limiting
the trackers ability to create a cross-site browsing profile for that user. Figure 1.3
shows a case study as offered in the original work [6].
2. Third-Party Advertising:
Is the tracking for the purpose of targeted advertising, an example of this type is
Googles advertising network, DoubleClick [7].
When a user visits a page, the tracker (advertiser) will choose an ad to display on
that page as an image or an iframe. Thus, the cookie which contains the visitor
unique identifier is set as tracker-owned.As a result, the same unique identifier
is associated with with the user whenever he visits any site with the tracker ads
embedded in it. In this case, the tracker is able to build a cross-site browsing profile
-
Chapter 1. Introduction 6
Figure 1.3: Case Study: Third-Party Analytics.
Websites commonly use third-party analytics engines like Google Analytics (GA) to trackvisitors. This process involves (1) the website embedding the GA script, which, after (2)loading in the users browser, (3) sets a site-owned cookie. This cookie is (4) communicatedback to GA along with other tracking information.
Figure 1.4: Case Study: Third-Party Advertising.
When a website (1) includes a third-party ad from an entity like Doubleclick, Doubleclick (2-3)sets a tracker-owned cookie on the users browser. Subsequent requests to Doubleclick from anywebsite will include that cookie, allowing it to track the user across those sites.
for each unique user. Figure 1.4 shows a case study as offered in the original work
[6].
3. Third-Party Advertising with Popups:
Using popups to display ads give the tracker the advantage to set its own first-party
cookie, allowing it to pass some common third-party cookies blocking mechanisms
embedded in some browsers or plugins. This kind of tracking is malicious since it
puts the tracker in a first-party position without the users consent. An example
of these trackers is insightexpressai.com
4. Third-Party Advertising Networks:
Trackers often cooperates, and it is insufficient to simply consider trackers in iso-
lation. A website may embed one third-party tracker, which in turn serves as an
-
Chapter 1. Introduction 7
aggregator for a number of other third-party trackers. Figure 1.5 shows a case
study as offered in the original work [6].
Figure 1.5: Case Study: Advertising Networks.
As in the ordinary third-party advertising case, a website (1-2) embeds an ad from Admeld,which (3) sets a tracker-owned cookie. Admeld then (4) makes a request to another third-partyadvertiser, Turn, and passes its own tracker-owned cookie value and other tracking informationto it. This allows Turn to track the user across sites on which Admeld makes this request,without needed to set its own tracker-owned state.
5. Third-Party Social Widgets:
Most social networking sites, offers social widgets like the Facebook Like but- ton,
the Twitter tweet button, the Google +1 button and others. These widgets can
be included by other websites to allow users logged in to these social networking
sites to like, tweet, or +1 the embedding web page. In case of Facebook, it can set
its tracker-owned cookie from a first-party position when the user voluntarily visits
facebook.com and then when a user visits another website that embed Facebook
"Like" button, the requests made to facebook.com to render this button allow
Facebook to track the user across sites just as Doubleclick can. Figure 1.6 shows a
case study as offered in the original work [6].
From the observed tracking behavior, the authors then formulated a framework for
classifying trackers into 5 classes were a single tracker may exhibit more than one of
these behaviors:
1. Behavior A (Analytics): The tracker serves as a third-party analytics engine
for sites. It can only track users within sites.
2. Behavior B (Vanilla): The tracker uses third-party storage that it can get and
set only from a third-party
-
Chapter 1. Introduction 8
3. Behavior C (Forced): The cross-site tracker forces users to visit its domain
directly (e.g., popup, redirect), placing it in a first-party position.
4. Behavior D (Referred): The tracker relies on a B, C, or E tracker to leak unique
identifiers to it, rather than on its own client-side state, to track users across sites.
5. Behavior E (Personal): The cross-site tracker is visited by the user directly in
other contexts.
In our study, and since we are working in an offline settings, we will be able to make
the differentiation between Third-Party analytics, Third-Party Tracking and Third-Party
Social Widgets.
Apart from Roesner et al. [6], a number of studies have empirically examined tracking
on the web, most notably Krishnamurthy et al. [8]. In their paper, the authors presented
a study where they measured the coverage of third-party tracking on the web. However,
unlike [6], they didnt distinguish between different tracking behavior.
From a different perspective, the authors of [9] studied privacy-violating information
flows on the web where they found instances of cookie leaking, as well as other privacy
violations. However, they didnt differentiate between third-party trackers and the visited
sites themselves. Also, in his five-year study of modern web traffic, Ihm [10] found that
12% of the web requests in 2010 counts for advertisements. Alongside, he also found
that Google Analytics is tracking up to 40% of the pages in their dataset.
Figure 1.6: Case Study: Social Widgets.
Social sites like Facebook, which users visit directly in other circum- stancesallowing them to(1) set a cookie identifying the userexpose social widgets such as the Like button. Whenanother website embeds such a button, the request to Facebook to render the button (2-3)includes Facebooks tracker-owned cookie. This allows Facebook to track the user across anysite that embeds such a button.
-
Chapter 1. Introduction 9
As for the phenomenon of trackers collaboration, [8] and [11] analyzed the private
data leakage from first-party websites to data aggegators that can, potentially, link user
accounts across different sites.On another study, Jackson and Boneh [12] classify trackers
based on the type of cooperation between the embedding site and the trackers. Although,
they didnt provide measurements on the prevalence of the tracker classes.
Finally, in the past few years, there have been observable online discussions about
tracking like, [5], along with workshops on tracking like the W3C Workshop on Web
Tracking and User Privacy.
1.2.2 Web graph studies
Apart from the web tracking phenomenon itself, there are numerous studies that mod-
els the web as a graph to analyze its structure and observe interesting measurements
and statistics about it. We find these kind of efforts inspirational to our analysis of the
tracked-web in terms of what questions to ask and techniques to answer them.
The most notable study, covered by our literature search , is the paper by Mateo et al.
[13]. In order to discover a set of local and global properties of the web graph, the authors
conducted a set of experiments on web crawls made available by Alta Vista, each with
over 200 million pages and 1.5 billion links. They showed that the overall structure of the
Web is considerably more complicated than suggested by earlier experiments on a limited
scale. Famously, they published a visual interpretations of their findings about the web
structure which has become well known in later literature as the bow-tie structure of the
web.
The authors first reports the in- and out-degree distributions of the web pages, confirm-
ing previous reports on power laws [14]. Then, they studied the directed and undirected
connected components of the Web where they show that power laws also arise in the
distribution of sizes of these connected components. They found that most (over 90%)
of the approximately 203 million nodes in their crawl data form a single connected com-
ponent if links are treated as undirected edges.
This giant weak connected web can be broken into four pieces as shown in figure 1.7.
The first of which is a central core, where every page can reach another one in the same
core by following a directed link; this giant strongly connected component (SCC) is at
the heart of the web. The second and third pieces are called IN and OUT. IN contains
pages that cant be reached from the SCC but can reach it; The authors claims that
-
Chapter 1. Introduction 10
these might be new sites that people have not yet discovered and linked to. On the other
hand, OUT contains pages that are pointed to from the SCC , but cant link back to
it; again, they suggest that such cluster represents corporate websites that contain only
internal links.
Finally, the TENDRILS consist of pages that are in total isolation from the SCCcannot
reach the SCC, and cannot be reached from it. Perhaps the most interesting fact they
found is that all the four sets are roughly the same size where the size of the SCC is
relatively small; it comprises about 56 million pages. Each of the other three sets contain
about 44 million pages. Finally, they measured the diameter of the central core (SCC)
to be at least 28, and the diameter of the graph as a whole is over 500.
Figure 1.7: Bow-tie structure of the web
One can pass from any node of IN through SCC to any node of OUT. Hanging off IN and OUTare TENDRILS containing nodes that are reachable from portions of IN, or that can reachportions of OUT, without passage through SCC. It is possible for a TENDRIL hanging offfrom IN to be hooked into a TENDRIL leading into OUT, forming a TUBE: i.e., a passagefrom a portion of IN to a portion of OUT without touching SCC. Diagram and description aretaken from 3
3Figure taken from Mateo, S., Jose, S., & Alto, P. (2006). Graph structure in the web. SystemsResearch, 33, 115. [13]
-
Chapter 1. Introduction 11
A more detailed work about the size of the components of the bow-tie model is done
by Serrano et al. [15]. They concluded that the properties of a web crawl is dependent
on the crawling process by analyzing four crawls gathered between 2001 and 2004 by
different crawlers with different parameters.
-
Chapter 1. Introduction 12
We can also find a number of studies about the web structure that use the same data set
as our thesis, the Common Crawl Web Corpus (see 3.1.1). In an abstract study, Kolias
et al. [16] presented an initial exploratory analysis on the Common Crawl. Although
they examined only a fraction of the dataset, some initial interesting measurements and
characteristics of the web corpus were shown. They reported statistics on two levels of
granularity, page and site levels, such as the MIME type distribution of resources, top
10-languages for page content, distribution of page age, HTML versions, page degree
distribution, pages per website, site language and site degree distribution.
An in depth comparison of the latest findings on the web structure with previous work
is done by Meusel et al. [17]. They confirm the existence of a giant strongly connected
component, but they strongly emphasize that it is strongly dependent on the crawling
process. Their most important finding however is that the distributions of indegree,
outdegree and sizes of strongly connected components are not power laws, something
that contradicts the findings throughout the literature up to now.
From a different level of aggregation, Lehmberg et al. [18] published a number of similar
findings on the web characteristics and degree distribution but on the pay-level-domain
granularity, as opposed to the page-level analysis in prior work. Finally a technical report
also presented the main characteristics of the Common Crawl 2012 dataset can be found
in[19].
Apart from the common crawl web corpus, various other studies focused on the struc-
ture of the national web domains, which consist of all websites that ends by a specific
country code or that are hosted at an IP that belongs to a segment assigned to a specific
country. Works [20, 21] present findings on crawls made by different crawlers on the
African and Chines parts of the web. Along with its structure, other characteristics of
the web are presented by Baeza-Yates et al. [22]. This work is basically a side-by-side
comparison of the results of 12 studies focusing on web characteristics. Their results
include various levels of detail contents, links and technologies dissected by national
domains.
As for the power-law distribution phenomenon, a number of observations have been
made in various aspects of the Web. The most relevant to our study, is the distribution
of degrees on the web graph. In this context, recent work [13, 23, 24] suggests that
both the in- and the out-degrees of vertices on the web graph have power laws. This
collection of findings reveals the power-law distribution as a macroscopic phenomenon
on the entire web, as well as a microscopic phenomenon at the level of single Websites,
and at intermediate levels between these two.
-
Chapter 2
Objectives
The aim of this study is to provide a deeper, quantitative understanding of the web-
tracking phenomenon, in terms of its widespread and its relationship with the web
structure. By doing so, we are also one step closer to design better tracker detection
and tracking protection techniques by understanding the structure of the tracked-web
graph. That is in addition to measuring the coverage of key trackers, and thus helping
in estimating their business value and market weight.
To achieve that, we structure our exploratory analysis into a set of questions and
hypothesis to be answered or validated. We summarize the high-level goals of the thesis
to the following:
Extracting potential trackers from the Common Crawl web corpus based on specificHTML contexts and assumptions. Followed by constructing an aggregated tracking
graph on the pay-level-domain (PLD), that is, a graph structure showing which
PLD is tracked by which service.
Computing statistical indicators on the tracking graph to measure the prevalenceof tracking in the web.
Computing descriptive, structural properties of the tracked-web, which is, the sub-set of the aggregated PLD web graph that includes only the trackers and tracked
hosts.
Examining how some structural properties of the web affects the spread of trackingover the internet.
13
-
Chapter 2. Objectives 14
We can then expand these high-level goals to a number of discrete questions and
hypothesis as follows:
1. To which degree the web is being tracked? And how many potential trackers can
we extract from the web corpus?
2. Who are the top 20 trackers? their coverage, their business and the HTML context
in which they are usually embedded?
3. What is the percentage of tracked websites (i.e. tracking penetration) within the
subset of most popular domains based on Alexa Ranking?
4. How often trackers appears in each HTML context (i.e. scripts, images, iframes
and links)?
5. What is the decomposition of trackers across traffic analytics, ad networks and
social widgets?
6. What is the tracking penetration per country?
7. What is the tracking penetration by generic top-level-domain (i.e. .com,.net,.org,etc.)?
8. Are there sets of trackers that usually appear together in one PLD?
9. What is the degree distribution of the tracked-web? Does it follow a power law?
10. What is the effective diameter, average distance and spid1 of the tracked-web?
11. Does the tracked-web exhibits the small-world phenomenon?
12. How big is the largest weakly-connected-component and strongly-connected-component
of the tracked-web? Does WCC and SCC size distribution follows a power law?
13. Can we support the hypothesis that domains with higher centrality measures are
more likely to be tracked?
14. Can we support the hypothesis that the web is clustered into communities/neigh-
borhoods that are either "safe" (i.e. with no tracked PLDs) or "completely tracked"
(i.e. all PLDs are tracked)?
1spid: shortest-path index of dispersion
-
Chapter 3
Methodology
In order to answer the questions in scope of our study (see chapter 2), we conduct
a series of experiments using the publicly-available datasets and tools presented in this
chapter.
3.1 Datasets
3.1.1 Common Crawl web corpus
The Common Crawl project [25] is a non-profit organization dedicated to provide
a copy of the internet to internet researchers, companies and individuals at no cost for
the purpose of research and analysis. Their goal is to democratize the data so everyone,
not just big companies, can do high quality research and analysis.
Common Crawl Uses The possibilities are endless, but people have used the data
to improve language translation software, predict trends, track disease propagation and
much more [26].
A number of interesting papers and projects have been made available in the past
couple of years that are based on the common crawl data, some of which are the Web
Data Commons project that we are also utilizing in this thesis (see 3.1.2). Also, the
popular SwiftKey keyboard app for mobile devices is reported to use the web corpus
to enrich its functionality [27]. A number of published studies about the web are also
reported to be using data sets from Common Crawl as we mentioned in the literature
overview chapter [1619].
15
-
Chapter 3. Methodology 16
Having said that, our main use of the corpus is to analyze the HTML code of individual
pages to extract the potential tracking services from it and construct a tracking graph
for further analysis. The tracking graph is an edge file in the form of trackingservicetrackedsite that will be used along with the hyper link graph (provided by Web Data
Commons) to build a property graph that covers both, web links and tracking relation-
ships.
Data set choice The Common Crawl corpus contains petabytes of data collected
over the last 7 years. It contains raw web page data, extracted metadata and text
extractions. The dataset lives on Amazon cloud storage S3 [28] as part of the Amazon
Public Datasets program [29]. The data sets represents multiple crawls at different years
that also employees different crawling algorithms. However, in our study, we are using
the web corpus that was released in August 2012. The reason behind this selection is
two folds:
1. Since we are also matching the web corpus with its hyper-link graph representation
offered by the Web Data Commons project, we are limited only to the 2012 and
2014 corpora that are offered by the project. However, the two corpora are crawled
using different techniques. The 2012 corpus was gathered using a web crawler
employing a breadth-first-search selection strategy and embedding link discovery
while crawling. Also, The crawl was seeded with a large number of URLs from
former crawls performed by the Common Crawl Project. That is opposed to the
2014 crawl that employs a modified Apache Nutch crawler [30] to download pages
from a large but fixed seed list. The 2014 crawler was restricted to URLs contained
in this list and did not extract additional URLs from links in the crawled pages.
The seed list contained around 6 billion URLs and was provided by the search
engine company blekko [31].
2. The Web Data Commons foundation recommends using the 2012 over the 2014
graph for the analysis of the connectivity of Web pages or the overall analysis of
the Web graph, as a BFS based selection strategy including URL discovery while
crawling will more likely results in a realistic sample of the web graph [32].
.
2012 Web Corpus Status The corpus consists of approximately 3.8 billion document
occupying over 100+ terabyte of data. Table 3.1 contains a summary of the corpus
contents [33].
-
Chapter 3. Methodology 17
Table 3.1: Content statistics of the 2012 web corpus
Content Type Number (in millions)Domains 61PDF 92Word 6.5Excel 1.3
As the thesis is running in parallel with the Track The Trackers project (see 3.3.1)
which is responsible of extracting trackers, and due to the timeline and budget constraints
(see 3.3.1), we were able to run the extraction job on 25% of the 2012 corpus, which is
roughly 23 terabyte of raw data. Meaning that, we will be working with a 25% random
sample of the web crawl which we consider a representative one for our analysis.
3.1.2 Web Data Commons hyper-link graph
The Web Data Commons project [34] was started by researchers from Freie Uni-
versitt Berlin and the Karlsruhe Institute of Technology (KIT) in 2012. The goal of
the project is to facilitate research and support companies in exploiting the wealth of
information on the Web by extracting structured data from web crawls, mainly from
the Common Crawl project,and provide this data for public download. Today the WDC
Project is mainly maintained by the Data and Web Science Research Group at the Uni-
versity of Mannheim.
Web Data Commons uses The project offers three types of data:
1. RDFa, Microdata, and Microformat: structured data describing products,
people, organizations, places, and events embedded into HTML pages using markup
standards such as RDFa, Microdata and Microformats.
2. Web Tables: a fraction of the HTML tables found on the web is quasi-relational,
meaning that they contain structured data describing a set of entities, and are thus
useful in application contexts such as data search, table augmentation, knowledge
base construction, and for various NLP tasks.
3. Hyperlink Graphs: large hyperlink graphs that WDC extracts from the Common
Crawl corpora. These graphs can help researchers to improve search algorithms,
develop spam detection methods and evaluate graph analysis algorithms.
-
Chapter 3. Methodology 18
Data Set choice In our analysis, we work with the 2012 Hyper Link graph. The
reasons to choose the 2012 over the 2014 version is due to the crawling techniques used
by Common Crawl as explained in the previous section 3.1.1. WDC provides the graph
on three levels of granularity/aggregation; page level, host level and pay-level-domain
(PLD)which we are using in this thesis. A PLD can be considered as the root sub-
domain for which users/organizations usually pay for when registering a URL. PLDs
allow us to identify a realm, where a single user or organization is likely to be in control.
For example, the 2 research groups dima.tu-berlin.de and ida.tu-berlin.de have the same
parent PLD tu-berlin.de. The pay-level-domain web graph consists of approximately 43
million node and 623 million arcs.
3.1.3 The Common Crawl WWW ranking
The project [35] is brought by the Laboratory for Web Algorithmics of the Universit
degli Studi di Milano and by the Data and Web Science Group of the University of
Mannheim. They parse the common crawl corpus to generate a web graph, from which
they compute a set of rankings (centrality measures) for each node in the graph. We
mainly use their PageRank and Harmonic centrality data sets in one of our experiments.
3.1.4 Alexa top sites
As part of our trackers-penetration analysis we are using a dataset [36] containing a
list of the top 1 million websites based on traffic made available by Alexa Analytics[37].
3.2 Data Processing Platforms
3.2.1 Apache Hadoop
The Apache Hadoop[38] project develops open-source software for reliable, scalable,
distributed computing. Its software library is a framework that allows for the distributed
processing of large structured and unstructured data sets across clusters of computers
using simple programming models. It is designed to scale up from single servers to thou-
sands of machines, each offering local computation and storage. Rather than rely on
hardware to deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on top of a cluster
of computers, each of which may be prone to failures. For more details about Hadoop
internals one can refer to [38].
-
Chapter 3. Methodology 19
In our study, we use Hadoop Distributed File System (HDFS) to store the large datasets
in order to make them available for processing in a distributed environment, as well
as Hadoop MapReduce framework for the actual parallel data processing, especially in
extracting trackers from the web corpus.
HDFS is a file system that provides reliable data storage and access across all the
nodes in a Hadoop cluster. It links together the file systems on many local nodes to
create a single file system.
MapReduce is the heart of Hadoop. It is a programming paradigm that allows for
massive scalability across hundreds or thousands of servers in a Hadoop cluster. The
term MapReduce actually refers to two distinct tasks that Hadoop programs perform.
The first is the map job, which takes a set of raw input data and transforms it into
another intermediate set of data represented in key/value pairs. The reduce job operates
on these intermediate key/value tuples and combines (aggregate) them into a smaller set
of tuples. As the sequence of the name MapReduce implies, the reduce job is always
performed after the map job.
-
Chapter 3. Methodology 20
3.2.2 Apache Spark
Spark [39] is an open source, parallel data processing framework that complements
Apache Hadoop to make it easy to develop fast, unified Big Data applications combining
batch, streaming, and interactive analytics on a variety of data input types. It was
originally developed in 2009 in UC Berkeleys AMPLab, and open sourced in 2010 as an
Apache project.
Sparks main data primitive is Resilient Distributed Datasets (RDD) [40] that enables it in
achieving fast in-memory data processing over a distributed environment. Apache Spark
is offered prepackaged with libraries for different big data tasks such as structured data
manipulation (Spark SQL), machine learning (MLib), data streaming (Spark Streaming)
and graph processing (GraphX). For more details about Apache Spark internals one can
refer to [39]
In this thesis, we mainly use Spark version 1.3.1 and its GraphX library [41] for ana-
lyzing the tracking graph. At a high-level, GraphX extends the Spark RDD abstraction
by introducing the Resilient Distributed Property Graph, a directed multi-graph 1 with
properties attached to each vertex and edge. To support graph computation, GraphX
exposes a set of fundamental operators, such as subgraph and joins, as well as an opti-
mized variant of the Pregel API [42]. In addition, GraphX includes a growing collection
of graph algorithms and builders to simplify graph analytics tasks.
3.2.3 Apache Flink
Flink [43] is an open source platform for scalable batch and stream data processing
that started at TU-Berlin under the name of Stratosphere and now is a top level Apache
project. Similar to Spark, it provides out of the box libraries for batch and streams
processing, machine learning, SQL-like interface and graph processing. However, Flink
provides an internal optimizer similar to those found in relational databases, besides, it
is optimized for cyclic or iterative processes by using iterative transformations on data
collections. This is achieved by an optimization of join algorithms, operator chaining
and reusing of partitioning and sorting. For more details about Apache Flink internals
one can refer to [43]
We use Flink version 0.9 to conduct a number of experiments in our study that uses
its Pregel-like graph processing framework Spargel through its higher level API Gelly.1A multigraph is a graph which is permitted to have multiple edges (also called parallel edges), that
is, edges that have the same end nodes. Thus two vertices may be connected by more than one edge.
-
Chapter 3. Methodology 21
3.2.4 R
R is a language and environment for statistical computing and graphics. It is a GNU
project which is similar to the S language and environment which was developed at
Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and
colleagues. R is available as Free Software under the terms of the Free Software Foun-
dations GNU General Public License in source code form.
R provides a wide variety of statistical and graphical techniques, and is highly exten-
sible. One of Rs strengths is the ease with which well-designed publication-quality plots
can be produced, including mathematical symbols and formulae where needed.
After running our experiments on the large-scale datasets, we often produce interme-
diate aggregation and metrics (e.g. vertex-wise metrics of a graph) and then process
these results using R to obtain the final statistics and/or plots.
3.2.5 MS SQL Server BI Stack
SQL Server [44] is the Microsoft product-line for relational databases. On top of the
core database engine, SQL Server provides solutions for data integration (ETL), OLAP
cubes and reporting through SQL Server Integration Services (SSIS), Analysis Services
(SSAS) and Reporting Services (SSRS) respectively.
We are using a free, student version of SQL Server 2012 obtained through Microsoft
DreamSpark program 2, for developing a data warehouse that stores a multidimensional
model of the tracking graph obtained from the Common Crawl web corpus, and to build
an OLAP cube on top of it that facilitates parts of our analysis in chapter 4.
3.2.6 WebGraph Framework
WebGraph [45] is an open source framework, under the GNU General Public License,
for graph compression aimed at studying web graphs and developed in Java. It provides
simple ways to manage very large graphs, exploiting modern compression techniques [46].
More precisely, it is made of the following2DreamSpark is a Microsoft Program that supports technical education by providing access to Mi-
crosoft software for learning, teaching and research purposes. https://www.dreamspark.com/
-
Chapter 3. Methodology 22
A set of flat codes, codes, which are particularly suitable for storing web graphs.
Algorithms for compressing web graphs that exploit gap compression and referen-tiation, intervalisation and codes to provide a high compression ratio.
Algorithms for lazy techniques in a accessing a compressed graph without actuallydecompressing it until it is necessary.
Algorithms for analysing very large graphs, such as estimating neighborhood func-tions, detecting strongly connected components, etc.
Sample of publicly available very large datasets that can reach 1+ billion links
We mainly use the WebGraph framework in chapter 5 to estimate the neighborhood
function of the tracked-web using the HyperANF algorithm, and extracting a number
of distance-related measures from it. For more details about the WebGraph Framework
and its algorithms one can refer to [45, 46]
3.2.7 FlashGraph Framework
FlashGraph [47, 48] is a semi-external memory graph processing engine, optimized
for a high-speed SSD array but can also run on hard disk drives (HDD). FlashGraph
provides flexible programming interface to help users implement graph algorithms along
with a number of ready-to-use, common graph algorithms that can scale to very large
graphs on commodity machines within an acceptable run-time.
We mainly use FlashGraph in chapter 5 for triangle counting, as the algorithms in
Spark and Flink did not scale well with our graphs.
3.3 Data Preparation
3.3.1 Trackers extraction
In order to extract potential tracking services from the Common Crawl web corpus,
we are utilizing a running project initiated and developed at TU-Berlin by Sebastian
Schelter, with contributions from other developers including the author of this thesis.
The project name is Track the Trackers[49] and it is open sourced on GitHub .
-
Chapter 3. Methodology 23
Track the Trackers uses Hadoop MapReduce to process the input web corpus (un-
structured data) stored in Arc file format [50] and parse each HTML page along with
its resources into an intermediate serializable structured format using Google Protocol
Buffers 3. These intermediate structures are then read again using another MapReduce
job to extract potential trackers and to build the tracking graph. Figure 3.1 provides
a high level overview on the code for trackers extraction and constructing the tracking
graph.
The tracking graph job mark an HTML page resource (i.e. scripts, images, links, etc.)
suspicious if its source (i.e. its HTML source attribute) is a different domain than that
of the page itself (i.e. a third-party domain). The rational behind that relies on the four
types of the HTML resource we are interested in:
1. Scripts: Most of the third-party analytics trackers 4, use a code snippet with the
source attribute linked to their analysis engine.
2. IFrames: Most third-party advertisers use HTML IFrames to host their ads. In
most cases the source attribute of the IFrame is linked to the advertiser.
3. Images: A number of trackers, such as Googles doubleclick , use a technique called
Tracking Pixels; which is an img tag that is generally from a third-party domain.
The browser sees the img tag and makes a request from the users browser to the
server (as directed by the URL in the HTML source attribute). On the image
request, the browser passes the users domain-specific cookie ID just as it would
with any HTTP request, this ID can identify and track the user. The server then
responds with a transparent 1x1 GIF image, which should not be visible to the end
user.
4. Links: The same logic as with the images can also be applied with any external
resources requested by a page from a third-party domain. These kinds of cross-
domain requests can be achieved by an HTML link tag. This is different from the
HTML href tag that represents a clickable hyperlink.
In case a resource is marked as suspicious, a new tuple is added to the tracking graph
representing an edge between the source URL of this resource the potential tracker3Protocol buffers (also known as Protoc) are Googles language-neutral, platform-neutral, extensible
mechanism for serializing structured data like XML, but smaller, faster, and simpler. One defines howhe wants his data to be structured once, then he can use special generated source code to easily writeand read his structured data to and from a variety of data streams and using a variety of languages Java, C++, or Python.
4 Refer to 1.2.1 for the classification model of tracking services
-
Chapter 3. Methodology 24
and the tracked pay-level-domain of that page. In this case we assume generality for the
sake of a high level analysis; if one page or more in a website are tracked, we consider
the website as tracked by the sum of all trackers found within its individual pages.
3.4 Environment
3.4.1 Amazon EC2
As the the 2012 Common Crawl Corpus resides on Amazon S3, we need to process
it using Amazon Elastic MapReduce to extract the intermediate files that contains the
parsed resources in each page, from which we can construct the tracking graph (see 3.3.1).
This extraction job has been supported by an AWS in Education Research Grant award5 obtained by Sebastian Schelter.
3.4.2 DIMA IBM Power Cluster
For running Spark and Flink distributed jobs for analyzing large graphs (i.e the track-
ing graph, tracked-web and PLD web graph) we use the IBM Power Cluster offered by
IBM to the DIMA research group at TU-Berlin. The cluster consists of 10 nodes, 48
cores and 60GB RAM each, and total disk space of 1.8 TB that is mainly used for HDFS.
5http://aws.amazon.com/grants/
-
Chapter 3. Methodology 25
Figure 3.1: Pseudocode of the main routines in extracting trackers
For simplicity, the pseudocode omits details about keeping the HTML tag in the trackinggraph. In reality an entry of the tracking graph consists of (trackerID, trackedID, isScript,isIFrame, isImage, isLink).
Phase 1. Parsing HTML resourcesinput: Set of Arc files containing web corpus pagesoutput: set of parquet files with parsed pages
function processArcFile (ArcFile)for each page in ArcFile do
if (page.type is HTML) thenparsedPage := emptyparsedPage.javascripts := parse(page , resources.javascript)parsedPage.iframes := parse(page , resources.iframe)parsedPage.images := parse(page , resources.image)parsedPage.links := parse(page , resources.links)parsedPage.saveAsParquetFormat ()
end ifend for
end function
Phase 2. Construct the tracking graphinput: Set of parquet files with parsed pagesoutput: Tracking graph
function map(ParsedPage)thirdPartyResources := List.empty
for each script in ParsedPage.javascripts doif (script.src != ParsedPage.src)
thirdPartyResources.add (script.src)end if
end for
for each iframe in ParsedPage.iframes doif (iframe.src != ParsedPage.src)
thirdPartyResources.add (iframe.src)end if
end for
\\ ... fill the thirdPartyResources by doing the same for images and links
for each tracker in thirdPartyResources doif (tracker.PLD != ParsedPage.PLD)
omit (tracker.PLD , ParsedPage.PLD)end if
end forend function
function reduce (tracker , List:trackedPLDs)trackedHosts := trackedPLDs.distinct
for each trackedHost in trackedHosts dosaveToTrackingGraph (tracker , trackedHost)
end function
-
Chapter 4
Analysis I: Statistical Properties
In this chapter we focus on presenting and analyzing a number of statistical measure-
ments about the tracking services and tracked websites. First we will investigate the top
trackers, their general coverage, and the tracking penetration in most popular websites.
Then we will drill into to the contexts where trackers are observed and their classes, as
well as analyzing the tracked hosts domain extensions. Finally, we will investigate the
relationships between the top trackers and if there are significant associations between
their existence in a given PLD 1 .
To do so, we mainly use Hadoop to extract tracking services and construct the tracking
graph (see 3.3.1) from the raw web corpus on Amazon cloud, along with Spark and Flink
for analytical jobs to compute different metrics and aggregations of the graph on the
university cluster (see 3.4.2), and finally analyze these intermediate metrics locally using
R to obtain the final statistics and indicators. Also, we designed a data warehouse and
developed an OLAP cube on top of it using Microsoft SQL Server 2012 BI stack. The
straight-forward data warehouse contains 2 main dimensions, Tracked PLD and Tracking
Service, along with one, narrow but lengthy, fact table that contains the tracking graph
as an edge list with a number of needed Boolean columns for analysis. The cube helps in
mapping the relational model of the DWH into a multidimensional one that can benefit
from MDX queries for more convenient data analysis when it comes to drilling and slicing
data.1As a reminder, a pay-level-domain (PLD) is a the main part of a URL that identifies a parent
organization/domain. For example, the 2 research groups dima.tu-berlin.de and ida.tu-berlin.de havethe same parent PLD tu-berlin.de
26
-
Chapter 4. Analysis I: Statistical Properties 27
4.1 Trackers Coverage
Our first investigation is to determine the top tracking services and analyze their coverage
over the web.
First we ran the Hadoop job to extract the tracking graph from the Common Crawl
web corpus (see 3.3.1). We were able to process a sample of 25% of raw data from the
full corpus. However, after analyzing the processed output we found that this sample
accounts for 35% of the individual pages and 75% of the pay-level-domains in the full
corpus. That is based on our generalization assumption where we tag a pay-level-domain
as potentially tracked if at least one of its pages is tracked. We believe that this high
level of PLD coverage in the sample is due to the long-tail distribution of the number of
web pages within websites that is observed by [13, 23, 24].
We were able to extract roughly 100 million tracking entries (i.e. Tracker X > Pay
Level Domain Y). After that, we ran an analytical Flink job to count the number of
tracked PLDs per unique potential tracker. Based on the tracker extraction assumptions
we explained in 3.3.1, we extracted approximately 27 million potential tracker. This
figure raised some doubts on the assumptions we took while detecting trackers. However,
after further analysis of the tracking counts distribution (i.e. number of tracked sites per
tracker), we observed two interesting facts:
That 82% of these potential trackers have a tracking count of only 1
That 99.9% of them have a tracking count less than 1,000 hosts
Based on the first finding, we considered any tracker that occurs only once (i.e. track-
ing only one PLD) as noise in the extraction process, since no actual tracking service
will be visible in only 1 host. Based on that, we define the new term effective-tracker,
that is, a tracking service that are detected to track more than one PLD. Those effective
trackers are approximately 4.8 million within our dataset. For the second finding, we
hypothesize that the number of tracked sites per tracker is following a power-law distri-
bution, however, that needs a further empirical examination.
As illustrated in figure 4.1, we found that at least 60% of the PLDs in the sample are
potentially tracked under our previously mentioned assumptions. For those 19 million
-
Chapter 4. Analysis I: Statistical Properties 28
Figure 4.1: Tracking detection summary
The figure shows statistics about the sample taken from the full web corpus residing onAmazon S3. The processing of raw data to extract pages and resources is done on AmazonElastic Map Reduce, and finally constructing the tracking graph and its analysis is performedon TU-Berlin DIMA cluster.
PLD (constituting the 60%) we detected the top 20 trackers (see table 4.1) based on the
number of unique PLDs spanned by each of them. One can notice that Google-related
services has the highest share of tracking. However, the figures cant be aggregated since
one tracked PLD can be tracked by multiple services.
To better understand the nature of these trackers, we investigated further to find out
the following:
googlesyndication.com: is a domain owned by Google that is used for storingand loading ad content and other resources relating to ads for Google AdSense and
DoubleClick from the Google content delivery network.
ajax-googleapis.com: The AJAX Libraries API is Googles content distributionnetwork and loading architecture for the most popular open source JavaScript
libraries such as jQuery, AngularJS, Dojo,etc.
The difference between the well known facebook.com and facebook.net is that thelater is Facebook APIs endpoint that support social widgets and other applications,
while the former is usually found in iframes and images context (table 4.1) that we
postulate their usage in hosting Facebook media content (video and pictures).
-
Chapter 4. Analysis I: Statistical Properties 29
Table 4.1: Top 20 potential trackers
The table also shows the HTML context in which the tracker was detected. An important remarkwhile interpreting the figures below is that the context percentages dont have to add up to 100%for each tracker since the same tracker can be detected in different contexts within the same PLD.
Tracker Frequency% of
TrackedPLDs
% ofAll
PLDs
Script%
IFrame%
Image%
Link%
google-analytics.com 8,183,519 42% 25% 100% 0% 0% 0%googlesyndication.com 2,953,807 15% 9% 99% 0% 1% 0%google.com 2,206,582 11% 7% 78% 16% 15% 7%ajax.googleapis.com 1,470,524 8% 5% 99% 0% 0% 6%facebook.com 1,315,966 7% 4% 17% 77% 12% 0%macromedia.com 1,290,750 7% 4% 100% 0% 0% 0%adobe.com 983,536 5% 3% 56% 0% 47% 0%facebook.net 858,533 4% 3% 100% 0% 0% 0%casalemedia.com 832,215 4% 3% 100% 0% 0% 0%youtube.com 780,471 4% 2% 15% 83% 9% 1%twitter.com 753,311 4% 2% 92% 10% 1% 1%addthis.com 741,610 4% 2% 97% 0% 34% 0%imgaft.com 607,701 3% 2% 99% 0% 100% 0%godaddy.com 566,565 3% 2% 99% 1% 3% 0%gravatar.com 545,740 3% 2% 30% 0% 82% 7%gmpg.org 516,165 3% 2% 0% 0% 0% 100%statcounter.com 507,867 3% 2% 96% 0% 95% 0%dsnextgen.com 399,400 2% 1% 98% 2% 0% 0%wordpress.com 384,114 2% 1% 81% 0% 37% 16%yahoo.com 367,155 2% 1% 27% 2% 78% 0%
casalemedia.com: is a Canadian online media and technology company. Theybuild online advertising technology for web publishers and advertisers.
imgaft.com: we could not find extensive information about this domain and itssiblings ak2.imgaft and ak3.imgaft. The only thread we find is that it is registered
to Godaddy. We suspect it is being used in the parked-domain advertising schema
that Godaddy provides for its users. In some cases when a user is reserving a
domain until his website is created, or even to sell it in the future, the domain
can be parked and a temporary landing page with targeted adverting is viewed
by Godaddy to the domain visitors in return for a percentage of the ad-revenues
paid to the parked-host owner. However, we couldnt technically validate this
hypothesis.
gravatar.com is an online service that provides users with images (avatars) thatfollows them from site to site appearing beside their name when they do things like
comment or post on a blog. Avatars help in identifying its users posts across blogs
and web forums. We believe that it made it to the top 20 lists since it is included
by default in every WordPress.com account, which have more than 6 million pages
in the Common Crawl corpus sample we are using.
-
Chapter 4. Analysis I: Statistical Properties 30
dsnextgen.com: we could not find much info about this domain but we did finda number of threads regarding it as a malware and people reporting their websites
hacked by it.
statcounter.com: is a free web tracker, embedded by websites as a hit counterand to provide real-time detailed web traffic information.
4.2 Top Sites Tracking
In this question, we analyze the magnificence of the tracking phenomenon from a
different perspective. Apart from the general statistics about the entire web corpus we
observed so far, we would rather focus on quantifying the trackers penetration over a key
subset of the internet, which is the most popular sites of the web. To achieve that, we
use the publicly available dataset mentioned in 3.1.4 from Alexa Analytics containing a
list of the top 1 million websites based on traffic.
Interestingly we find that the tracking penetration increases as we go up the list of
top sites. The tracking penetration starts at 48% within the top 1 million PLDs, and
increases gradually to reach a high 82% within the top 1000 PLD as shown in figure 4.2.
Figure 4.2: Alexa top sites tracking penetration
Furthermore, we noticed that this pattern -increasing tracking penetration by de-
creasing the subset of top sites is also visible on the tracker level as well. Table 4.2
shows that the top 10 trackers are the same at each subset, with the same order track-
ing penetration and following an increasing trend across subsets, except for one tracker,
doubleclick.net, that only appears withing the top 1000 sites instead of addthis.com.
-
Chapter 4. Analysis I: Statistical Properties 31
Table 4.2: Top trackers penetration ratio across Alexa top sites
Tracker Top 1000K Top 500K Top 100K Top 10K Top 1K
google-analytics.com 0.34 0.38 0.47 0.62 0.71google.com 0.16 0.20 0.29 0.45 0.60facebook.com 0.11 0.14 0.21 0.37 0.48ajax.googleapis.com 0.09 0.11 0.16 0.28 0.40googlesyndication.com 0.09 0.11 0.15 0.22 0.30facebook.net 0.08 0.11 0.17 0.30 0.37twitter.com 0.08 0.10 0.17 0.32 0.44youtube.com 0.07 0.09 0.14 0.25 0.40addthis.com 0.07 0.08 0.12 0.20macromedia.com 0.05 0.06 0.11 0.21 0.33doubleclick.net 0.34
4.3 Tracking Classification
Our third question focuses on the tracking types. We discussed in the literature
overview a proposed classification framework for tracking behavior, out of which we can
distinguish between 3rd party web analytics, advertisers and social widgets (see 1.2.1).
To classify trackers we first need to analyze the contexts where the potential tracker is
detected, as explained before in 3.3.1, a 3rd party tracker can be detected as the source
HTML attribute of scripts, iframes, images and links. Figure 4.3 shows the ratio of
trackers detected at each HTML source, compared to the number of unique trackers, as
well as the ratio of tracked PLDs, compared to the number of unique tracked PLDs. We
notice that most potential trackers (92%) are detected as sources of image tags in HTML
and that most tracked PLDs are potentially tracked by means of 3rd party scripts.
Figure 4.3: Tracking sources summary
-
Chapter 4. Analysis I: Statistical Properties 32
A key point one needs to understand while interpreting the tracking-source analysis
graph in figure 4.3 is that ratios dont have to add up to 1. This is due to the fact that
a single tracker can be detected in different sources at different PLDs (e.g. in a script in
PLD 1 and in a an image in PLD 2) and even potentially at the same PLD. The same
goes for tracked PLDs where one PLD can be potentially tracked by different trackers
detected at different source (e.g. using Google Analytics for traffic analysis and hosting
3rd party ads in iframes). Table 4.3 shows the frequency distribution of the available
combinations of tracking contexts. The frequency represents the number of occurrences
where a tracked PLD is detected to have the corresponding tracking sources. The ratio
is calculated based on the total number of entries in the tracking graph (approximately
80 millions). For detailed information about top trackers by source one can refer to
appendix A.
Table 4.3: Tracking-Source distribution
HTML Source Frequency Ratio
Script 37,745,830 48%Image 23,304,038 30%Script & Image 5,578,269 8%IFrame 3,956,215 5%Script & Image & Link 3,406,727 5%Link 2,146,367 3%Image & Link 1,050,777 2%Script & Link 827,657 2%Script & IFrame 419,904 1%All 398,320 1%IFrame & Image 225,109 1%Script & IFrame & Image 107,748 1%Script & IFrame & Link 102,543 1%IFrame & Image & Link 57,420 1%IFrame & Link 29,045 1%
For 3rd party social-widget tracking, we analyzed a predefined set of code snippets
offered by popular social network websites (see appendix C) and marked each entry in the
tracking graph if it is being tracked by a social-widget or not based on the source attribute
that the code is using. Table 4.4 shows the share of each social network compared to the
subset of PLDs being tracked by social-widgets and in terms of coverage, it shows the
percentage of PLDs spanned by each social network compared to all tracked PLDs and
compared to the sample web corpus.
Finally, based on the trackers extraction assumption and the proposed classification
framework, we can assign the scipt tracking to 3rd party web analytics services, iframe
and images to advertising-related trackers, while extracting the social-widgets trackers
-
Chapter 4. Analysis I: Statistical Properties 33
Table 4.4: Social-Widget tracking summary
Social-Widget AbsoluteFrequencyRelativeFrequency
% ofTracked PLDs
% ofAll PLDs
Facebook 2,180,111 0.576 11.17% 6.72%Youtube 798,027 0.211 4.09% 2.46%Twitter 783,727 0.207 4.02% 2.42%Reddit 17,552 0.005 0.09% 0.05%Instagram 4,346 0.001 0.02% 0.01%Tumbler 140 0.000 0.00% 0.00%
manually as explained in the previous section. This lead us to final statistics about the
tracking classification as illustrated in figure 4.4; It shows the percentages of tracked
PLDs under each class. The ratios dont add up to 1 because of the overlapping tracking
behavior as explained earlier.
Figure 4.4: Tracking Classification Summary
4.4 Domain Analysis
Our next area of exploration is the tracking penetration analysis based on internet
domains. There are many levels domains to consider (e.g. second-level, top-level, etc.).
However, we are focusing on the generic top level domain (gTLD) and country code top
level domain (ccTLD)
4.4.1 Country code analysis
We were able to detect approximately 11 million pay-level-domains that contains a
country code (e.g. .de, .uk, .fr, etc.) from the sample web corpus of 32 million PLD (out
-
Chapter 4. Analysis I: Statistical Properties 34
of which, around 60% of them where marked as potentially tracked).
By means of informal visual analysis, we found that tracking penetration ratios of
country codes are following a normal distribution as shown in figure 4.5 with minimum
= 0.23, median = 0.59 , maximum = 0.93 and standard deviation = 0.1.
Figure 4.5: ccTLD tracking penetration histogram
For each tracking penetration value (x-axis), we plot a bar presenting the number of countrieswith such penetration
An interesting way to visualize the global spread of the web tracking phenomenon as
well as its degree, is using a heat map as shown in figure 4.6. Interestingly, Germany
scored a relatively low penetration rate of 49%, placing it in the lower quartile of the
data. We can also notice one of the highest penetration rates concentrated in Russia and
post-soviet states in eastern Europe and Asia.
Finally, an important remark is that we are only considering the ccTLD extensions in
our analysis and not the country-assigned IP range of addresses. This experiment can
be further enriched by incorporating the IP analysis as well.
4.4.2 Generic domain analysis
Besides the country codes, we were also able to detect 22,986,076 PLD in the web
corpus sample (of approximately 32 million PLD) that contains an element of a predefined
set of the most popular generic top level domains (gTLD) assigned by the Internet
-
Chapter 4. Analysis I: Statistical Properties 35
Figure 4.6: Tracking penetration worldwide
Shades of green, yellow, red indicate low, medium and high penetration rates respectively,given that the scale starts at 23% and ends at 93%. Black color indicates no data available
Assigned Numbers Authority (IANA). The gTLDss are .com, .net, .org, .gov, .edu, .mil,
.info and .biz. Out of these PLDs, we marked 13,800,223 of them as potentially tracked.
In table 4.5 we summarize the tracking penetration ratio for each of the extracted
gTLDs. Surprisingly, the results came out against our expectations that the more popular
and commercial domains such as .com and .net will have higher penetration compared
to the more private, and in some cases sensitive, domains such as .edu, .gov. Also, we
didnt not expect the .mil gTLD used by military organizations will have a high
penetration rate as 53%, even though it tails the list.
Table 4.5: Tracking penetration by gTLD
gTLD PLDs(sample)PLDs
(tracked)Tracking
Penetration
.edu 33629 22512 67%
.gov 51081 33178 65%
.info 472131 304941 65%
.net 1746476 1116848 64%
.biz 156559 99525 64%
.org 1923282 1214639 63%
.com 18602312 11008258 59%
.mil 606 322 53%
-
Chapter 4. Analysis I: Statistical Properties 36
To further investigate these unexpected results, we compiled the matrix in table 4.6 with
the top 10 trackers of each gTLD along with the number of PLDs they cover within it.
Based on this matrix we observed the following:
While the Google-related trackers are the only core trackers across gTLDs, thetop 10 trackers are almost identical across the com, org, net, info and biz (with
few exceptions). They are also a subset of the overall top trackers noted in 4.1.
However, trackers tend to be different and sparse in the edu, gov and mil group.
What we consider sensitive gTLDs like .gov, .mil and .edu, are tracked mostly byweb analytics tools like google-analytics and addthis.com and with social networks
widgets. However, there are no indications of them employing advertising-related
trackers or content delivery networks, even popular ones such as googlesyndica-
tion.com. This is somehow understandable since websites like these intended for
public service will need to employ some sort of social interaction via social widgets,
not to mention to analyze their own traffic.
Some popular trackers only appear within commercial gTLDs such as the popularweb host godaddy.com, but never with .gov, .mil or .edu
Few trackers only appear under one gTLD, like cnzz.com and ejercito.mil.co underthe .gov and .mil gTLDs respectively.
To understand the last point further, we drilled deeper into the data and with some
internet search we found that cnzz.com is a Chines tracking service that employs scripts
in tracked pages. It turn out that the 2,141 PLDs cnzz.com is tracking under the .gov
gTLD all have .cn country code, which means they are Chines government PLDs. Also,
we found that ejercito.mil.co belongs to the national Colombian army and that all the
15 tracked PLDs are being tracked by means of 3rd party HTML links.
4.5 Trackers Association
In this section, we aim to investigate the frequent co-occurrence of tracking services
and if there are rules that can predict the presence of trackers in a PLD based on the
existence of other trackers. For example if the existence of tracker z in a PLD is usually
associated with the existence of trackers x and y .
-
Chapter 4. Analysis I: Statistical Properties 37
Table 4.6: Top Trackers Coverage over gTLDs
The matrix has values only for the top 10 trackers of each gTLDs or zeros to indicate that thetracker is completely absent regardless of its ranking. For example the second cell(horizontally) indicates that addthis is tracking 51,627 PLDs with the .org gTLD whilegodaddy.com is completely absent from all .mil domains. A tracker is marked with anunderscore if it is not within the top 10 trackers under a specific gTLD.
Tracker .com .org .net .info .biz .edu .gov .mil
addthis.com - 51,627 - 9,680 - 1,991 989 17adobe.com 591,441 50,835 - - 3,184 4,392 3,350 50ajax.googleapis.com 867,033 99,873 69,199 13,143 4,956 3,757 1,700 32baidu.com - - - - - - 939 0casalemedia.com 609,597 - 58,074 46,841 6,455 - 0 0cnzz.com - - - - - - 2,141 0ejercito.mil.co - 0 0 0 0 0 0 15facebook.com 726,854 89,693 67,375 13,816 4,551 3,054 - 23facebook.net 485,699 53,172 43,369 - - 1,939 - -gmpg.org - - - 10,600 - - - -godaddy.com - - - 25,316 5,144 - - 0google-analytics.com 4,545,650 467,913 402,411 89,385 34,300 11,881 7,776 171google.com 1,244,448 166,275 127,759 32,421 9,380 5,315 3,241 38googlesyndication.com 1,874,776 168,622 235,195 112,147 22,153 - - -imgaft.com 471,002 - 46,055 27,232 5,635 0 0 0macromedia.com 802,529 60,038 55,717 - 3,950 4,539 10,468 46twimg.com - - - - - - - 24twitter.com - - 46,114 - - 1,348 - -weather.com.cn - - - - - - 2,673 0youtube.com - 68,471 - - - 2,483 898 19
Total PLDsTracked in gTLD 11,008,258 1,214,639 1,116,848 304,941 99,525 22,512 33,178 322
To begin, we want to understand the nature of the trackers existence in terms of quan-
tity (i.e. how many trackers there are per pay-level-domain). We start by computing
the total number of tracking services per PLD (approximately 19 million PLD) and ob-
serve the distribution. As shown in figure 4.7, the distribution is far from normal, in
fact, it tends to be highly exponential, with more than 99.99% of the data set in the
range of 1-100 trackers per PLD, that means there exist a tiny fraction of the PLDs
with huge number of trackers (above 1000). We then wanted to understand if that might
be attributed to the number of pages in each processed PLD, however, we calculated
the Pearson correlation coefficient 2 between the number of pages and the number of
trackers (per PLD) to be 0.28, which means a slight positive correlation (even though we
intuitively thought about a higher positive correlation). After examining a subset of the
top PLDs, in terms of pages and trackers, we found that most of them are huge networks
such as Google, YouTube, Tumbler, etc. where they permit users loading resources for
3rd party domains (e.g. scripts, content, themes, etc.) as well as the use of 3rd party2Pearsons correlation coefficient is the covariance of the two variables divided by the product of
their standard deviations. It measures the linear correlation (dependence) between two variables x andy giving a value between [1,-1] where 1 is total positive correlation, 0 is no correlation, and -1 is totalnegative correlation.
-
Chapter 4. Analysis I: Statistical Properties 38
web traffic monitoring and hence the high number of trackers.
Figure 4.7: Log-Log plot for the number of trackers per PLD
The second part of the analysis is to identify the groups of tracking services that
usually appear together in PLDs. In order to achieve that, we model the problem as
in a market-basket analysis (with trackers as products and tracking graph entries as
transaction) while employing frequent itemset mining techniques. On top of that, we use
association rules learning to find out if there are dependencies between trackers.
Apriori [51] is a seminal frequent itemset mining algorithm that we are using (out
of the box from SQL Server Analysis Services 3 ) to help answering our question. In a
nutshell, Apriori works by identifying the frequent individual items in the dataset and
extending them to larger and larger item sets as long as those item sets appear sufficiently
often in the data (by means of a support function). The frequent item sets determined
by Apriori can be later used to determine association rules which highlight general trends
in the dataset. Figure 4.8 shows an outline of the algorithm.
We applied the Apriori implementation on a subset of the tracking graph that contains
the top 20 trackers (extracted in 4.1) and their corresponding tracking entries of approx-
imately 26 million records (32% of the the complete graph). Table 4.7 shows 20 frequent3Microsoft provides its implementation of Apriori under the name of Microsoft Association Algo-
rithm. See msdn.microsoft.com/en-us/library/cc280428.aspx4Figure taken from en.wikipedia.org/wiki/Apriori_algorithm
-
Chapter 4. Analysis I: Statistical Properties 39
Figure 4.8: Pseudo code of the Apriori algorithm
The pseudo code for the algorithm is given below for a transaction database T , and a supportthreshold of . Ck is the candidate set for level k. At each step, the algorithm is assumed togenerate the candidate sets from the large item sets of the preceding level, heeding thedownward closure lemma. count[c] access