an exploratory analysis of the tracked web

Technische Universitt Berlin

Master Thesis

An Exploratory Analysis ofthe Tracked Web

Author:Karim Wadie

Supervisor:Prof. Volker Markl

Advisor:Johannes Kirschnick

A thesis submitted in partial fulfilment of the requirementsfor the degree of Master of Science in Computer Science

as part of the Erasmus Mundus programme IT4BI

in the

Database Systems and Information Management Group (DIMA)Department of Computer Science

July 2015

http://www.tu-berlin.dehttps://www.dima.tu-berlin.de/http://cs.tu-berlin.de/welcome.html

Declaration of Authorship

I declare that I have authored this thesis independently, that I have not used other than

the declared sources/resources, and that I have explicitly marked all material which has

been quoted either literally or by content from the used sources.

Eidesstattliche Erklrung

Ich erklre an Eides statt, dass ich die vorliegende Arbeit selbststndig verfasst, andere

als die angegebenen Quellen/Hilfsmittel nicht benutzt, und die den benutzten Quellen

wrtlich und inhaltlich entnommenen Stellen als solche kenntlich gemacht habe.

Berlin,

July 31, 2015

Karim WADIE

i

"The man who comes back through the Door in the Wall will never be quite the same

as the man who went out. He will be wiser but less sure, happier but less self-satisfied,

humbler in acknowledging his ignorance yet better equipped to understand the relationship

of words to things, of systematic reasoning to the unfathomable mystery which it tries,

forever vainly, to comprehend."

Aldous Huxley

Technische Universitt Berlin

AbstractFaculty of Electrical Engineering and Computer Science

Department of Computer Science

Master of Science in Computer Science

An Exploratory Analysis of

the Tracked Web

by Karim Wadie

There are no doubts that web tracking has progressively prevailed on the internet

over the past years for traffic analytics and/or building user browsing profiles that aids

personalized advertising. There are several techniques a tracking service can actually

record visitors behavior on a remote website, some of which can be detected in an offline

setting by analyzing the HTML contexts and common tracking practices such as tracking-

pixels and scripts that communicate with a 3rd party host. This thesis builds on top of

the TrackTheTrackers project that is initiated at TU-Berlin and aims to extract the

tracking services from the Common Crawl; the largest publicly-available web corpus

by providing a deeper, quantitative analysis of the web tracking phenomenon in terms of

its widespread and its relationship with the web structure. As far as our knowledge, this

research is the first one to combine web-graph studies with 3rd-party tracking analysis.

Throughout our exploratory analysis, we report a number of statistical findings about the

tracking graph along with descriptive, structural properties of the web graph spanned

by the trackers and tracked websites (i.e. the tracked-web), and finally, we examine

how structural features of the web graph such as community structures and centrality

measures can affect the spread of tracking over the web. For instance, we found that

60% of the web is potentially tracked, with Google being the number 1 tracker over

the internet. We also used a quantitative approach to discover that the tracked-web is

highly interconnected and exhibits the small-world phenomenon with only 5 degrees of

separation, and that it resembles the structure of a social network more than of a web

graph.

http://www.tu-berlin.dehttp://www.eecs.tu-berlin.de/menue/fakultaet_iv/)http://cs.tu-berlin.de/welcome.html

Acknowledgements

I take this opportunity to express gratitude to Johannes, my supervisor, for his guid-

ance throughout the thesis as well as his comments that greatly improved the manuscript.

I also thank Sebastian Schelter for his excellent work on the trackthetracker project and

providing the datasets, on which I am building upon this study.

iv

Contents

Declaration of Authorship i

Abstract iii

Acknowledgements iv

List of Figures viii

List of Tables ix

Abbreviations x

1 Introduction and Literature Review 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 What is web tracking? . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 The business empire of web tracking . . . . . . . . . . . . . . . . . 11.1.3 Why should we study tracking? . . . . . . . . . . . . . . . . . . . . 3

1.2 Literature Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Web tracking studies . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Web graph studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Objectives 13

3 Methodology 153.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Common Crawl web corpus . . . . . . . . . . . . . . . . . . . . . . 153.1.2 Web Data Commons hyper-link graph . . . . . . . . . . . . . . . . 173.1.3 The Common Crawl WWW ranking . . . . . . . . . . . . . . . . . 183.1.4 Alexa top sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Data Processing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.3 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.4 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.5 MS SQL Server BI Stack . . . . . . . . . . . . . . . . . . . . . . . . 21

v

Contents vi

3.2.6 WebGraph Framework . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.7 FlashGraph Framework . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.1 Trackers extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.1 Amazon EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.2 DIMA IBM Power Cluster . . . . . . . . . . . . . . . . . . . . . . . 24

4 Analysis I: Statistical Properties 264.1 Trackers Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Top Sites Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Tracking Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4.1 Country code analysis . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.2 Generic domain analysis . . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 Trackers Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Analysis II: Structural Properties 425.1 Tracked-Web Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1 Density and node degrees . . . . . . . . . . . . . . . . . . . . . . . 425.1.2 Power-law fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.1.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Tracked-Web Degree of Separation . . . . . . . . . . . . . . . . . . . . . . 455.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.2 Approach: HyperANF . . . . . . . . . . . . . . . . . . . . . . . . . 465.2.3 Distance-related features . . . . . . . . . . . . . . . . . . . . . . . . 485.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Is The Tracked-Web a Small World? . . . . . . . . . . . . . . . . . . . . . 505.4 Tracked-Web Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4.1 WCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4.2 SCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.5 Centrality and Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.5.3 Individual centrality correlation . . . . . . . . . . . . . . . . . . . . 575.5.4 Centrality-based classification . . . . . . . . . . . . . . . . . . . . . 58

5.6 Community Structure and Tracking . . . . . . . . . . . . . . . . . . . . . . 595.6.1 Vertex-centric neighborhoods . . . . . . . . . . . . . . . . . . . . . 595.6.2 Web graph communities . . . . . . . . . . . . . . . . . . . . . . . . 605.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Future Work 65

7 Thesis Summary 67

Contents vii

A Top Trackers By Source 71

B Tracking Penetration By Country 74

C Social Widgets Detection 80

Bibliography 82

List of Figures

1.1 Example of online advertising players 1. . . . . . . . . . . . . . . . . . . . 31.2 USA online advertisement market growth in USD billions 2 . . . . . . . . 41.3 Case Study: Third-Party Analytics. . . . . . . . . . . . . . . . . . . . . . . 61.4 Case Study: Third-Party Advertising. . . . . . . . . . . . . . . . . . . . . 61.5 Case Study: Advertising Networks. . . . . . . . . . . . . . . . . . . . . . . 71.6 Case Study: Social Widgets. . . . . . . . . . . . . . . . . . . . . . . . . . . 81.7 Bow-tie structure of the web . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Pseudocode of the main routines in extracting trackers . . . . . . . . . . . . . 25

4.1 Tracking detection summary . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Alexa top sites tracking penetration . . . . . . . . . . . . . . . . . . . . . 304.3 Tracking sources summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Tracking Classification Summary . . . . . . . . . . . . . . . . . . . . . . . 334.5 ccTLD tracking penetration histogram . . . . . . . . . . . . . . . . . . . . 344.6 Tracking penetration worldwide . . . . . . . . . . . . . . . . . . . . . . . . 354.7 Log-Log plot for the number of trackers per PLD . . . . . . . . . . . . . . 384.8 Pseudo code of the Apriori algorithm . . . . . . . . . . . . . . . . . . . . . 39

5.1 Log-Log plot of the tracked web indegree distribution . . . . . . . . . . . . 435.2 Log-Log plot of the tracked web outdegree distribution . . . . . . . . . . . 435.3 Probability mass function of the tracked-web distance . . . . . . . . . . . 495.4 Cumulative probability function of the tracked-web distance . . . . . . . . 495.5 Log-Log plot of the tracked-web WCC size distribution . . . . . . . . . . . 535.6 Pseudocode of the tarjan algorithm for finding strongly connected components

in a graph 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.7 Log-Log plot of the tracked-web SCC size distribution . . . . . . . . . . . 565.8 Pseudocode for computing tracking coefficient of vertices . . . . . . . . . . . . 605.9 Log-Log plot of the web graph community-size distribution . . . . . . . . 625.10 A visual representation of the web graph mega-communities . . . . . . . . 62

C.1 Facebook social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . 80C.2 Twitter social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . . 81C.3 YouTube social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . 81C.4 Reddit social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . . 81

viii

List of Tables

3.1 Content statistics of the 2012 web corpus . . . . . . . . . . . . . . . . . . 17

4.1 Top 20 potential trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Top trackers penetration ratio across Alexa top sites . . . . . . . . . . . . 314.3 Tracking-Source distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Social-Widget tracking summary . . . . . . . . . . . . . . . . . . . . . . . 334.5 Tracking penetration by gTLD . . . . . . . . . . . . . . . . . . . . . . . . 354.6 Top Trackers Coverage over gTLDs . . . . . . . . . . . . . . . . . . . . . . 374.7 Frequent item sets of top 20 trackers . . . . . . . . . . . . . . . . . . . . . 404.8 Top 20 trackers association rules . . . . . . . . . . . . . . . . . . . . . . . 41

5.1 Power-law fitting of tracked-web indegree and outdegree . . . . . . . . . . 455.2 HyperANF Results on the tracked-web . . . . . . . . . . . . . . . . . . . . 485.3 Distance-related features for the web, Facebook and Tracked Web . . . . . 505.4 Calculating the small-world measure S for the tracked-web . . . . . . . . . 525.5 Point bi-serial correlation between centrality measures and tracking . . . . 585.6 Area under the curve (AUC) for different binary classifiers (centrality

measures vs tracking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.7 Tracking Coefficients of the web graph neighborhoods . . . . . . . . . . . 60

A.1 Top 20 potential trackers employing scripts . . . . . . . . . . . . . . . . . 71A.2 Top 20 potential trackers employing IFrames . . . . . . . . . . . . . . . . . 72A.3 Top 20 potential trackers employing Images . . . . . . . . . . . . . . . . . 72A.4 Top 20 potential trackers employing Links . . . . . . . . . . . . . . . . . . 73

B.1 Tracking analysis by country code top level domain . . . . . . . . . . . . . 74

ix

Abbreviations

ccTLD Country code top level domain

DWH Data Warehouse

GA Google Analytics

gTLD Generic top level domain

HDFS Hadoop Distributed File System

PLD Pay-level-domain

SCC Strongly connected component (of a graph)

TLD Top level domain

WCC Weakly connected component (of a graph)

WDC Web Data Commons

x

Dedicated to my parents, for their love, endless support andencouragement.

xi

Chapter 1

Introduction and Literature Review

1.1 Introduction

1.1.1 What is web tracking?

Web tracking, is commonly referred to the act of collecting subsets of the users brows-

ing data or browsing behavior over the internet. This practice attracted a lot of attention

over the past few years, especially after the social media boom and the increasing levels

of privacy-issues awareness acquired by the average internet user.

There is no doubt that tracking is prevalent on the web today, most of us who use

search engines or e-commerce sites (e.g. Amazon) have seen the implications of web

tracking or just tracking as we will refer to in this document at least in terms of

targeted advertisements, especially when it is observed cross-sites; for example as coming

across advertisements on ones social media profile for products previously viewed on a

completely different e-commerce site.

In our work, we use the term Tracked-web to refer to the graph structure of web

links formed by the tracking and tracked web entities. We aim to provide a better

understanding about this subset of the web in terms of statistics about these entities, as

well as discovering local and global structural properties about the graph.

1.1.2 The business empire of web tracking

Before going into details, one needs first to understand what is the motivation behind

such practice, what kind of web entities are behind it and how can they actually do it.

1

Chapter 1. Introduction 2

First-Party and Third-Party Tracking To begin with, we need to differentiate

between what is called first-party and third-party tracking. The first kind refers to when

a website is keeping track of its visitors activities on their own site, either anonymously

or by user profiles, in order to analyze customer behavior, enhance their service or even

communicate it to other entities for a profit. First-party tracking is very common in most

major websites, however, it often raises serious concerns when it crosses the virtual world

of the internet and includes real world information like GPS track history, fingerprints

and such. Unfortunately, this type of tracking is beyond our scope of analysis since its

integrated in the website logic and it can be hardly detected or analyzed offline.

The other type of tracking, third-party tracking, refers to the practice by which an

outside entity (the tracker) other than the directly visited website, tracks the users visit

to the site. For example, if a web user visits reuters.com, a third-party tracker like

doubleclick.net - embedded by reuters.com to provide targeted advertising - can log the

users visit to reuters.com. For most types of third-party tracking, the tracker will be

able to link the users visit to reuters.com with the users visit to other sites on which

the tracker is also embedded, and thus building what is called a browsing profile of that

user. In this study we will only consider third-party tracking over the internet for our

analysis because of its potential concern to users, who may be surprised that a party with

which they may or may not have chosen to interact is recording their online behavior in

unexpected ways.

Tracking Services The web entities acting as third party trackers are generally cat-

egorized into two broad groups, web traffic analytics and advertising-based services (we

will discuss a detailed categorization framework in the literature review section 1.2). The

first group of trackers usually provide their services to websites in return of a paid pre-

mium or subscription plans, however, the most popular web-traffic analysis service [1],

Google Analytics [2], can be used for free. In this case, Google is believed to generate

indirect profit from the free analytics service by integrating the data it collects with its

paid advertising service; Google AdWords [3].

The other group of tracking services are the one directly concerned with online ad-

vertising. Advertising business has evolved since the birth of the internet and over the

years from email marketing campaigns to online display ads in the 1990s to the more

complex landscape of search ads (see figure 1.1) that involves targeted advertising with

automated biding and connects a number of stakeholders like publishers who are hosting

the ads, advertisers who are advertising their products/services , advertising agencies

that help generate and place the ad copy, ad servers that technically deliver the ads and


advertising affiliates who conduct promotional work for the advertisers and potentially

more players.

Figure 1.1: Example of online advertising players 1.

It is not hard to understand how the online advertising business had to become more

sophisticated over the years when we know that it is a multi-billion dollar industry.

According to a study by PricewaterhouseCoopers (PwC) [4], figure 1.2 shows that online

advertising generated a revenue of 49.5 Billion USD in 2014 in the United States alone.

Another recent study estimated the European ad market in 2012 for 24.3 Billion EUR

[5].

1.1.3 Why should we study tracking?

Despite the prevalence of web tracking and the resulting public and media outcry ,

primarily in the western world, there is a lack of clarity about how tracking works, how

widespread the practice is, and the scope of the browsing profiles that trackers can collect

about users. Thus, efforts in exploring and understanding the structure of the web from

a tracking perspective as we are aiming in this thesis is important in shedding a

light on this part of the internet in order to:

1. Design crawling and tracker detection algorithms.1Figure taken from LUMA Partners: http://www.lumapartners.com/lumascapes/


Figure 1.2: USA online advertisement market growth in USD billions 2

2. Design protection techniques against trackers.

3. Understand the coverage of some key trackers and their domination over the inter-

net. Thus, estimating their business value and market weight

4. Predict the evolution and spread of the tracking phenomenon.

5. Predict the emergence of new phenomenon in the tracing graph.

2Figure taken from PwC Internet advertising report 2014 [4]


1.2 Literature Overview

1.2.1 Web tracking studies

There exist a number of studies that have been conducted by researchers to under-

stand, analyze and classify the web tracking phenomenon and even to develop techniques

to protect against it. The most prominent of which is the work by Roesner, Kohno,

and Wetherall [6] in 2012. In their study, the authors presented an in-depth empirical

investigation of third-party tracking where they introduced a comprehensive classifica-

tion framework for web tracking based on client-side observable behaviors. They also

developed and evaluated a web browser plugin, which is designed to thwart tracking orig-

inating from social media widgets (like the Facebook like button) while still allowing

the widgets to be used.

The suggested framework is established from client-side methods for detecting and

classifying five kinds of third-party trackers based on how they manipulate browser state.

The five behaviors observed are:

1. Third-Party Analytics:

In order to analyze their traffic, websites usually embed a library (in the form of

a script) provided by the anlytics engine (e.g. Google Analytics). In the case of

GA, the script sets a site-owned cookie (not tracker-owned) on the the visitors

browser, that contains a unique identifier. The script then transfers this identifier

to google-analytics.com by making explicit requests containing information such as

the operating system version, browser, geographic location, etc.

Since the cookie set by the tracker was created in the context of the site visited

(site-owned), identifiers set by the tracker in this case is different across sites. Thus,

a single user will be associated with different identifiers on different sites, limiting

the trackers ability to create a cross-site browsing profile for that user. Figure 1.3

shows a case study as offered in the original work [6].

2. Third-Party Advertising:

Is the tracking for the purpose of targeted advertising, an example of this type is

Googles advertising network, DoubleClick [7].

When a user visits a page, the tracker (advertiser) will choose an ad to display on

that page as an image or an iframe. Thus, the cookie which contains the visitor

unique identifier is set as tracker-owned.As a result, the same unique identifier

is associated with with the user whenever he visits any site with the tracker ads

embedded in it. In this case, the tracker is able to build a cross-site browsing profile


Figure 1.3: Case Study: Third-Party Analytics.

Websites commonly use third-party analytics engines like Google Analytics (GA) to trackvisitors. This process involves (1) the website embedding the GA script, which, after (2)loading in the users browser, (3) sets a site-owned cookie. This cookie is (4) communicatedback to GA along with other tracking information.

Figure 1.4: Case Study: Third-Party Advertising.

When a website (1) includes a third-party ad from an entity like Doubleclick, Doubleclick (2-3)sets a tracker-owned cookie on the users browser. Subsequent requests to Doubleclick from anywebsite will include that cookie, allowing it to track the user across those sites.

for each unique user. Figure 1.4 shows a case study as offered in the original work

[6].

3. Third-Party Advertising with Popups:

Using popups to display ads give the tracker the advantage to set its own first-party

cookie, allowing it to pass some common third-party cookies blocking mechanisms

embedded in some browsers or plugins. This kind of tracking is malicious since it

puts the tracker in a first-party position without the users consent. An example

of these trackers is insightexpressai.com

4. Third-Party Advertising Networks:

Trackers often cooperates, and it is insufficient to simply consider trackers in iso-

lation. A website may embed one third-party tracker, which in turn serves as an


aggregator for a number of other third-party trackers. Figure 1.5 shows a case

study as offered in the original work [6].

Figure 1.5: Case Study: Advertising Networks.

As in the ordinary third-party advertising case, a website (1-2) embeds an ad from Admeld,which (3) sets a tracker-owned cookie. Admeld then (4) makes a request to another third-partyadvertiser, Turn, and passes its own tracker-owned cookie value and other tracking informationto it. This allows Turn to track the user across sites on which Admeld makes this request,without needed to set its own tracker-owned state.

5. Third-Party Social Widgets:

Most social networking sites, offers social widgets like the Facebook Like button,

the Twitter tweet button, the Google +1 button and others. These widgets can

be included by other websites to allow users logged in to these social networking

sites to like, tweet, or +1 the embedding web page. In case of Facebook, it can set

its tracker-owned cookie from a first-party position when the user voluntarily visits

facebook.com and then when a user visits another website that embed Facebook

"Like" button, the requests made to facebook.com to render this button allow

Facebook to track the user across sites just as Doubleclick can. Figure 1.6 shows a

case study as offered in the original work [6].

From the observed tracking behavior, the authors then formulated a framework for

classifying trackers into 5 classes were a single tracker may exhibit more than one of

these behaviors:

1. Behavior A (Analytics): The tracker serves as a third-party analytics engine

for sites. It can only track users within sites.

2. Behavior B (Vanilla): The tracker uses third-party storage that it can get and

set only from a third-party


3. Behavior C (Forced): The cross-site tracker forces users to visit its domain

directly (e.g., popup, redirect), placing it in a first-party position.

4. Behavior D (Referred): The tracker relies on a B, C, or E tracker to leak unique

identifiers to it, rather than on its own client-side state, to track users across sites.

5. Behavior E (Personal): The cross-site tracker is visited by the user directly in

other contexts.

In our study, and since we are working in an offline settings, we will be able to make

the differentiation between Third-Party analytics, Third-Party Tracking and Third-Party

Social Widgets.

Apart from Roesner et al. [6], a number of studies have empirically examined tracking

on the web, most notably Krishnamurthy et al. [8]. In their paper, the authors presented

a study where they measured the coverage of third-party tracking on the web. However,

unlike [6], they didnt distinguish between different tracking behavior.

From a different perspective, the authors of [9] studied privacy-violating information

flows on the web where they found instances of cookie leaking, as well as other privacy

violations. However, they didnt differentiate between third-party trackers and the visited

sites themselves. Also, in his five-year study of modern web traffic, Ihm [10] found that

12% of the web requests in 2010 counts for advertisements. Alongside, he also found

that Google Analytics is tracking up to 40% of the pages in their dataset.

Figure 1.6: Case Study: Social Widgets.

Social sites like Facebook, which users visit directly in other circum- stancesallowing them to(1) set a cookie identifying the userexpose social widgets such as the Like button. Whenanother website embeds such a button, the request to Facebook to render the button (2-3)includes Facebooks tracker-owned cookie. This allows Facebook to track the user across anysite that embeds such a button.


As for the phenomenon of trackers collaboration, [8] and [11] analyzed the private

data leakage from first-party websites to data aggegators that can, potentially, link user

accounts across different sites.On another study, Jackson and Boneh [12] classify trackers

based on the type of cooperation between the embedding site and the trackers. Although,

they didnt provide measurements on the prevalence of the tracker classes.

Finally, in the past few years, there have been observable online discussions about

tracking like, [5], along with workshops on tracking like the W3C Workshop on Web

Tracking and User Privacy.

1.2.2 Web graph studies

Apart from the web tracking phenomenon itself, there are numerous studies that mod-

els the web as a graph to analyze its structure and observe interesting measurements

and statistics about it. We find these kind of efforts inspirational to our analysis of the

tracked-web in terms of what questions to ask and techniques to answer them.

The most notable study, covered by our literature search , is the paper by Mateo et al.

[13]. In order to discover a set of local and global properties of the web graph, the authors

conducted a set of experiments on web crawls made available by Alta Vista, each with

over 200 million pages and 1.5 billion links. They showed that the overall structure of the

Web is considerably more complicated than suggested by earlier experiments on a limited

scale. Famously, they published a visual interpretations of their findings about the web

structure which has become well known in later literature as the bow-tie structure of the

web.

The authors first reports the in- and out-degree distributions of the web pages, confirm-

ing previous reports on power laws [14]. Then, they studied the directed and undirected

connected components of the Web where they show that power laws also arise in the

distribution of sizes of these connected components. They found that most (over 90%)

of the approximately 203 million nodes in their crawl data form a single connected com-

ponent if links are treated as undirected edges.

This giant weak connected web can be broken into four pieces as shown in figure 1.7.

The first of which is a central core, where every page can reach another one in the same

core by following a directed link; this giant strongly connected component (SCC) is at

the heart of the web. The second and third pieces are called IN and OUT. IN contains

pages that cant be reached from the SCC but can reach it; The authors claims that


these might be new sites that people have not yet discovered and linked to. On the other

hand, OUT contains pages that are pointed to from the SCC , but cant link back to

it; again, they suggest that such cluster represents corporate websites that contain only

internal links.

Finally, the TENDRILS consist of pages that are in total isolation from the SCCcannot

reach the SCC, and cannot be reached from it. Perhaps the most interesting fact they

found is that all the four sets are roughly the same size where the size of the SCC is

relatively small; it comprises about 56 million pages. Each of the other three sets contain

about 44 million pages. Finally, they measured the diameter of the central core (SCC)

to be at least 28, and the diameter of the graph as a whole is over 500.

Figure 1.7: Bow-tie structure of the web

One can pass from any node of IN through SCC to any node of OUT. Hanging off IN and OUTare TENDRILS containing nodes that are reachable from portions of IN, or that can reachportions of OUT, without passage through SCC. It is possible for a TENDRIL hanging offfrom IN to be hooked into a TENDRIL leading into OUT, forming a TUBE: i.e., a passagefrom a portion of IN to a portion of OUT without touching SCC. Diagram and description aretaken from 3

3Figure taken from Mateo, S., Jose, S., & Alto, P. (2006). Graph structure in the web. SystemsResearch, 33, 115. [13]


A more detailed work about the size of the components of the bow-tie model is done

by Serrano et al. [15]. They concluded that the properties of a web crawl is dependent

on the crawling process by analyzing four crawls gathered between 2001 and 2004 by

different crawlers with different parameters.


We can also find a number of studies about the web structure that use the same data set

as our thesis, the Common Crawl Web Corpus (see 3.1.1). In an abstract study, Kolias

et al. [16] presented an initial exploratory analysis on the Common Crawl. Although

they examined only a fraction of the dataset, some initial interesting measurements and

characteristics of the web corpus were shown. They reported statistics on two levels of

granularity, page and site levels, such as the MIME type distribution of resources, top

10-languages for page content, distribution of page age, HTML versions, page degree

distribution, pages per website, site language and site degree distribution.

An in depth comparison of the latest findings on the web structure with previous work

is done by Meusel et al. [17]. They confirm the existence of a giant strongly connected

component, but they strongly emphasize that it is strongly dependent on the crawling

process. Their most important finding however is that the distributions of indegree,

outdegree and sizes of strongly connected components are not power laws, something

that contradicts the findings throughout the literature up to now.

From a different level of aggregation, Lehmberg et al. [18] published a number of similar

findings on the web characteristics and degree distribution but on the pay-level-domain

granularity, as opposed to the page-level analysis in prior work. Finally a technical report

also presented the main characteristics of the Common Crawl 2012 dataset can be found

in[19].

Apart from the common crawl web corpus, various other studies focused on the struc-

ture of the national web domains, which consist of all websites that ends by a specific

country code or that are hosted at an IP that belongs to a segment assigned to a specific

country. Works [20, 21] present findings on crawls made by different crawlers on the

African and Chines parts of the web. Along with its structure, other characteristics of

the web are presented by Baeza-Yates et al. [22]. This work is basically a side-by-side

comparison of the results of 12 studies focusing on web characteristics. Their results

include various levels of detail contents, links and technologies dissected by national

domains.

As for the power-law distribution phenomenon, a number of observations have been

made in various aspects of the Web. The most relevant to our study, is the distribution

of degrees on the web graph. In this context, recent work [13, 23, 24] suggests that

both the in- and the out-degrees of vertices on the web graph have power laws. This

collection of findings reveals the power-law distribution as a macroscopic phenomenon

on the entire web, as well as a microscopic phenomenon at the level of single Websites,

and at intermediate levels between these two.

Chapter 2

Objectives

The aim of this study is to provide a deeper, quantitative understanding of the web-

tracking phenomenon, in terms of its widespread and its relationship with the web

structure. By doing so, we are also one step closer to design better tracker detection

and tracking protection techniques by understanding the structure of the tracked-web

graph. That is in addition to measuring the coverage of key trackers, and thus helping

in estimating their business value and market weight.

To achieve that, we structure our exploratory analysis into a set of questions and

hypothesis to be answered or validated. We summarize the high-level goals of the thesis

to the following:

Extracting potential trackers from the Common Crawl web corpus based on specificHTML contexts and assumptions. Followed by constructing an aggregated tracking

graph on the pay-level-domain (PLD), that is, a graph structure showing which

PLD is tracked by which service.

Computing statistical indicators on the tracking graph to measure the prevalenceof tracking in the web.

Computing descriptive, structural properties of the tracked-web, which is, the sub-set of the aggregated PLD web graph that includes only the trackers and tracked

hosts.

Examining how some structural properties of the web affects the spread of trackingover the internet.

13

Chapter 2. Objectives 14

We can then expand these high-level goals to a number of discrete questions and

hypothesis as follows:

1. To which degree the web is being tracked? And how many potential trackers can

we extract from the web corpus?

2. Who are the top 20 trackers? their coverage, their business and the HTML context

in which they are usually embedded?

3. What is the percentage of tracked websites (i.e. tracking penetration) within the

subset of most popular domains based on Alexa Ranking?

4. How often trackers appears in each HTML context (i.e. scripts, images, iframes

and links)?

5. What is the decomposition of trackers across traffic analytics, ad networks and

social widgets?

6. What is the tracking penetration per country?

7. What is the tracking penetration by generic top-level-domain (i.e. .com,.net,.org,etc.)?

8. Are there sets of trackers that usually appear together in one PLD?

9. What is the degree distribution of the tracked-web? Does it follow a power law?

10. What is the effective diameter, average distance and spid1 of the tracked-web?

11. Does the tracked-web exhibits the small-world phenomenon?

12. How big is the largest weakly-connected-component and strongly-connected-component

of the tracked-web? Does WCC and SCC size distribution follows a power law?

13. Can we support the hypothesis that domains with higher centrality measures are

more likely to be tracked?

14. Can we support the hypothesis that the web is clustered into communities/neigh-

borhoods that are either "safe" (i.e. with no tracked PLDs) or "completely tracked"

(i.e. all PLDs are tracked)?

1spid: shortest-path index of dispersion

Chapter 3

Methodology

In order to answer the questions in scope of our study (see chapter 2), we conduct

a series of experiments using the publicly-available datasets and tools presented in this

chapter.

3.1 Datasets

3.1.1 Common Crawl web corpus

The Common Crawl project [25] is a non-profit organization dedicated to provide

a copy of the internet to internet researchers, companies and individuals at no cost for

the purpose of research and analysis. Their goal is to democratize the data so everyone,

not just big companies, can do high quality research and analysis.

Common Crawl Uses The possibilities are endless, but people have used the data

to improve language translation software, predict trends, track disease propagation and

much more [26].

A number of interesting papers and projects have been made available in the past

couple of years that are based on the common crawl data, some of which are the Web

Data Commons project that we are also utilizing in this thesis (see 3.1.2). Also, the

popular SwiftKey keyboard app for mobile devices is reported to use the web corpus

to enrich its functionality [27]. A number of published studies about the web are also

reported to be using data sets from Common Crawl as we mentioned in the literature

overview chapter [1619].

15

Chapter 3. Methodology 16

Having said that, our main use of the corpus is to analyze the HTML code of individual

pages to extract the potential tracking services from it and construct a tracking graph

for further analysis. The tracking graph is an edge file in the form of trackingservicetrackedsite that will be used along with the hyper link graph (provided by Web Data

Commons) to build a property graph that covers both, web links and tracking relation-

ships.

Data set choice The Common Crawl corpus contains petabytes of data collected

over the last 7 years. It contains raw web page data, extracted metadata and text

extractions. The dataset lives on Amazon cloud storage S3 [28] as part of the Amazon

Public Datasets program [29]. The data sets represents multiple crawls at different years

that also employees different crawling algorithms. However, in our study, we are using

the web corpus that was released in August 2012. The reason behind this selection is

two folds:

1. Since we are also matching the web corpus with its hyper-link graph representation

offered by the Web Data Commons project, we are limited only to the 2012 and

2014 corpora that are offered by the project. However, the two corpora are crawled

using different techniques. The 2012 corpus was gathered using a web crawler

employing a breadth-first-search selection strategy and embedding link discovery

while crawling. Also, The crawl was seeded with a large number of URLs from

former crawls performed by the Common Crawl Project. That is opposed to the

2014 crawl that employs a modified Apache Nutch crawler [30] to download pages

from a large but fixed seed list. The 2014 crawler was restricted to URLs contained

in this list and did not extract additional URLs from links in the crawled pages.

The seed list contained around 6 billion URLs and was provided by the search

engine company blekko [31].

2. The Web Data Commons foundation recommends using the 2012 over the 2014

graph for the analysis of the connectivity of Web pages or the overall analysis of

the Web graph, as a BFS based selection strategy including URL discovery while

crawling will more likely results in a realistic sample of the web graph [32].

.

2012 Web Corpus Status The corpus consists of approximately 3.8 billion document

occupying over 100+ terabyte of data. Table 3.1 contains a summary of the corpus

contents [33].


Table 3.1: Content statistics of the 2012 web corpus

Content Type Number (in millions)Domains 61PDF 92Word 6.5Excel 1.3

As the thesis is running in parallel with the Track The Trackers project (see 3.3.1)

which is responsible of extracting trackers, and due to the timeline and budget constraints

(see 3.3.1), we were able to run the extraction job on 25% of the 2012 corpus, which is

roughly 23 terabyte of raw data. Meaning that, we will be working with a 25% random

sample of the web crawl which we consider a representative one for our analysis.

3.1.2 Web Data Commons hyper-link graph

The Web Data Commons project [34] was started by researchers from Freie Uni-

versitt Berlin and the Karlsruhe Institute of Technology (KIT) in 2012. The goal of

the project is to facilitate research and support companies in exploiting the wealth of

information on the Web by extracting structured data from web crawls, mainly from

the Common Crawl project,and provide this data for public download. Today the WDC

Project is mainly maintained by the Data and Web Science Research Group at the Uni-

versity of Mannheim.

Web Data Commons uses The project offers three types of data:

1. RDFa, Microdata, and Microformat: structured data describing products,

people, organizations, places, and events embedded into HTML pages using markup

standards such as RDFa, Microdata and Microformats.

2. Web Tables: a fraction of the HTML tables found on the web is quasi-relational,

meaning that they contain structured data describing a set of entities, and are thus

useful in application contexts such as data search, table augmentation, knowledge

base construction, and for various NLP tasks.

3. Hyperlink Graphs: large hyperlink graphs that WDC extracts from the Common

Crawl corpora. These graphs can help researchers to improve search algorithms,

develop spam detection methods and evaluate graph analysis algorithms.


Data Set choice In our analysis, we work with the 2012 Hyper Link graph. The

reasons to choose the 2012 over the 2014 version is due to the crawling techniques used

by Common Crawl as explained in the previous section 3.1.1. WDC provides the graph

on three levels of granularity/aggregation; page level, host level and pay-level-domain

(PLD)which we are using in this thesis. A PLD can be considered as the root sub-

domain for which users/organizations usually pay for when registering a URL. PLDs

allow us to identify a realm, where a single user or organization is likely to be in control.

For example, the 2 research groups dima.tu-berlin.de and ida.tu-berlin.de have the same

parent PLD tu-berlin.de. The pay-level-domain web graph consists of approximately 43

million node and 623 million arcs.

3.1.3 The Common Crawl WWW ranking

The project [35] is brought by the Laboratory for Web Algorithmics of the Universit

degli Studi di Milano and by the Data and Web Science Group of the University of

Mannheim. They parse the common crawl corpus to generate a web graph, from which

they compute a set of rankings (centrality measures) for each node in the graph. We

mainly use their PageRank and Harmonic centrality data sets in one of our experiments.

3.1.4 Alexa top sites

As part of our trackers-penetration analysis we are using a dataset [36] containing a

list of the top 1 million websites based on traffic made available by Alexa Analytics[37].

3.2 Data Processing Platforms

3.2.1 Apache Hadoop

The Apache Hadoop[38] project develops open-source software for reliable, scalable,

distributed computing. Its software library is a framework that allows for the distributed

processing of large structured and unstructured data sets across clusters of computers

using simple programming models. It is designed to scale up from single servers to thou-

sands of machines, each offering local computation and storage. Rather than rely on

hardware to deliver high-availability, the library itself is designed to detect and handle

failures at the application layer, so delivering a highly-available service on top of a cluster

of computers, each of which may be prone to failures. For more details about Hadoop

internals one can refer to [38].


In our study, we use Hadoop Distributed File System (HDFS) to store the large datasets

in order to make them available for processing in a distributed environment, as well

as Hadoop MapReduce framework for the actual parallel data processing, especially in

extracting trackers from the web corpus.

HDFS is a file system that provides reliable data storage and access across all the

nodes in a Hadoop cluster. It links together the file systems on many local nodes to

create a single file system.

MapReduce is the heart of Hadoop. It is a programming paradigm that allows for

massive scalability across hundreds or thousands of servers in a Hadoop cluster. The

term MapReduce actually refers to two distinct tasks that Hadoop programs perform.

The first is the map job, which takes a set of raw input data and transforms it into

another intermediate set of data represented in key/value pairs. The reduce job operates

on these intermediate key/value tuples and combines (aggregate) them into a smaller set

of tuples. As the sequence of the name MapReduce implies, the reduce job is always

performed after the map job.


3.2.2 Apache Spark

Spark [39] is an open source, parallel data processing framework that complements

Apache Hadoop to make it easy to develop fast, unified Big Data applications combining

batch, streaming, and interactive analytics on a variety of data input types. It was

originally developed in 2009 in UC Berkeleys AMPLab, and open sourced in 2010 as an

Apache project.

Sparks main data primitive is Resilient Distributed Datasets (RDD) [40] that enables it in

achieving fast in-memory data processing over a distributed environment. Apache Spark

is offered prepackaged with libraries for different big data tasks such as structured data

manipulation (Spark SQL), machine learning (MLib), data streaming (Spark Streaming)

and graph processing (GraphX). For more details about Apache Spark internals one can

refer to [39]

In this thesis, we mainly use Spark version 1.3.1 and its GraphX library [41] for ana-

lyzing the tracking graph. At a high-level, GraphX extends the Spark RDD abstraction

by introducing the Resilient Distributed Property Graph, a directed multi-graph 1 with

properties attached to each vertex and edge. To support graph computation, GraphX

exposes a set of fundamental operators, such as subgraph and joins, as well as an opti-

mized variant of the Pregel API [42]. In addition, GraphX includes a growing collection

of graph algorithms and builders to simplify graph analytics tasks.

3.2.3 Apache Flink

Flink [43] is an open source platform for scalable batch and stream data processing

that started at TU-Berlin under the name of Stratosphere and now is a top level Apache

project. Similar to Spark, it provides out of the box libraries for batch and streams

processing, machine learning, SQL-like interface and graph processing. However, Flink

provides an internal optimizer similar to those found in relational databases, besides, it

is optimized for cyclic or iterative processes by using iterative transformations on data

collections. This is achieved by an optimization of join algorithms, operator chaining

and reusing of partitioning and sorting. For more details about Apache Flink internals

one can refer to [43]

We use Flink version 0.9 to conduct a number of experiments in our study that uses

its Pregel-like graph processing framework Spargel through its higher level API Gelly.1A multigraph is a graph which is permitted to have multiple edges (also called parallel edges), that

is, edges that have the same end nodes. Thus two vertices may be connected by more than one edge.


3.2.4 R

R is a language and environment for statistical computing and graphics. It is a GNU

project which is similar to the S language and environment which was developed at

Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and

colleagues. R is available as Free Software under the terms of the Free Software Foun-

dations GNU General Public License in source code form.

R provides a wide variety of statistical and graphical techniques, and is highly exten-

sible. One of Rs strengths is the ease with which well-designed publication-quality plots

can be produced, including mathematical symbols and formulae where needed.

After running our experiments on the large-scale datasets, we often produce interme-

diate aggregation and metrics (e.g. vertex-wise metrics of a graph) and then process

these results using R to obtain the final statistics and/or plots.

3.2.5 MS SQL Server BI Stack

SQL Server [44] is the Microsoft product-line for relational databases. On top of the

core database engine, SQL Server provides solutions for data integration (ETL), OLAP

cubes and reporting through SQL Server Integration Services (SSIS), Analysis Services

(SSAS) and Reporting Services (SSRS) respectively.

We are using a free, student version of SQL Server 2012 obtained through Microsoft

DreamSpark program 2, for developing a data warehouse that stores a multidimensional

model of the tracking graph obtained from the Common Crawl web corpus, and to build

an OLAP cube on top of it that facilitates parts of our analysis in chapter 4.

3.2.6 WebGraph Framework

WebGraph [45] is an open source framework, under the GNU General Public License,

for graph compression aimed at studying web graphs and developed in Java. It provides

simple ways to manage very large graphs, exploiting modern compression techniques [46].

More precisely, it is made of the following2DreamSpark is a Microsoft Program that supports technical education by providing access to Mi-

crosoft software for learning, teaching and research purposes. https://www.dreamspark.com/


A set of flat codes, codes, which are particularly suitable for storing web graphs.

Algorithms for compressing web graphs that exploit gap compression and referen-tiation, intervalisation and codes to provide a high compression ratio.

Algorithms for lazy techniques in a accessing a compressed graph without actuallydecompressing it until it is necessary.

Algorithms for analysing very large graphs, such as estimating neighborhood func-tions, detecting strongly connected components, etc.

Sample of publicly available very large datasets that can reach 1+ billion links

We mainly use the WebGraph framework in chapter 5 to estimate the neighborhood

function of the tracked-web using the HyperANF algorithm, and extracting a number

of distance-related measures from it. For more details about the WebGraph Framework

and its algorithms one can refer to [45, 46]

3.2.7 FlashGraph Framework

FlashGraph [47, 48] is a semi-external memory graph processing engine, optimized

for a high-speed SSD array but can also run on hard disk drives (HDD). FlashGraph

provides flexible programming interface to help users implement graph algorithms along

with a number of ready-to-use, common graph algorithms that can scale to very large

graphs on commodity machines within an acceptable run-time.

We mainly use FlashGraph in chapter 5 for triangle counting, as the algorithms in

Spark and Flink did not scale well with our graphs.

3.3 Data Preparation

3.3.1 Trackers extraction

In order to extract potential tracking services from the Common Crawl web corpus,

we are utilizing a running project initiated and developed at TU-Berlin by Sebastian

Schelter, with contributions from other developers including the author of this thesis.

The project name is Track the Trackers[49] and it is open sourced on GitHub .


Track the Trackers uses Hadoop MapReduce to process the input web corpus (un-

structured data) stored in Arc file format [50] and parse each HTML page along with

its resources into an intermediate serializable structured format using Google Protocol

Buffers 3. These intermediate structures are then read again using another MapReduce

job to extract potential trackers and to build the tracking graph. Figure 3.1 provides

a high level overview on the code for trackers extraction and constructing the tracking

graph.

The tracking graph job mark an HTML page resource (i.e. scripts, images, links, etc.)

suspicious if its source (i.e. its HTML source attribute) is a different domain than that

of the page itself (i.e. a third-party domain). The rational behind that relies on the four

types of the HTML resource we are interested in:

1. Scripts: Most of the third-party analytics trackers 4, use a code snippet with the

source attribute linked to their analysis engine.

2. IFrames: Most third-party advertisers use HTML IFrames to host their ads. In

most cases the source attribute of the IFrame is linked to the advertiser.

3. Images: A number of trackers, such as Googles doubleclick , use a technique called

Tracking Pixels; which is an img tag that is generally from a third-party domain.

The browser sees the img tag and makes a request from the users browser to the

server (as directed by the URL in the HTML source attribute). On the image

request, the browser passes the users domain-specific cookie ID just as it would

with any HTTP request, this ID can identify and track the user. The server then

responds with a transparent 1x1 GIF image, which should not be visible to the end

user.

4. Links: The same logic as with the images can also be applied with any external

resources requested by a page from a third-party domain. These kinds of cross-

domain requests can be achieved by an HTML link tag. This is different from the

HTML href tag that represents a clickable hyperlink.

In case a resource is marked as suspicious, a new tuple is added to the tracking graph

representing an edge between the source URL of this resource the potential tracker3Protocol buffers (also known as Protoc) are Googles language-neutral, platform-neutral, extensible

mechanism for serializing structured data like XML, but smaller, faster, and simpler. One defines howhe wants his data to be structured once, then he can use special generated source code to easily writeand read his structured data to and from a variety of data streams and using a variety of languages Java, C++, or Python.

4 Refer to 1.2.1 for the classification model of tracking services


and the tracked pay-level-domain of that page. In this case we assume generality for the

sake of a high level analysis; if one page or more in a website are tracked, we consider

the website as tracked by the sum of all trackers found within its individual pages.

3.4 Environment

3.4.1 Amazon EC2

As the the 2012 Common Crawl Corpus resides on Amazon S3, we need to process

it using Amazon Elastic MapReduce to extract the intermediate files that contains the

parsed resources in each page, from which we can construct the tracking graph (see 3.3.1).

This extraction job has been supported by an AWS in Education Research Grant award5 obtained by Sebastian Schelter.

3.4.2 DIMA IBM Power Cluster

For running Spark and Flink distributed jobs for analyzing large graphs (i.e the track-

ing graph, tracked-web and PLD web graph) we use the IBM Power Cluster offered by

IBM to the DIMA research group at TU-Berlin. The cluster consists of 10 nodes, 48

cores and 60GB RAM each, and total disk space of 1.8 TB that is mainly used for HDFS.

5http://aws.amazon.com/grants/


Figure 3.1: Pseudocode of the main routines in extracting trackers

For simplicity, the pseudocode omits details about keeping the HTML tag in the trackinggraph. In reality an entry of the tracking graph consists of (trackerID, trackedID, isScript,isIFrame, isImage, isLink).

Phase 1. Parsing HTML resourcesinput: Set of Arc files containing web corpus pagesoutput: set of parquet files with parsed pages

function processArcFile (ArcFile)for each page in ArcFile do

if (page.type is HTML) thenparsedPage := emptyparsedPage.javascripts := parse(page , resources.javascript)parsedPage.iframes := parse(page , resources.iframe)parsedPage.images := parse(page , resources.image)parsedPage.links := parse(page , resources.links)parsedPage.saveAsParquetFormat ()

end ifend for

end function

Phase 2. Construct the tracking graphinput: Set of parquet files with parsed pagesoutput: Tracking graph

function map(ParsedPage)thirdPartyResources := List.empty

for each script in ParsedPage.javascripts doif (script.src != ParsedPage.src)

thirdPartyResources.add (script.src)end if

end for

for each iframe in ParsedPage.iframes doif (iframe.src != ParsedPage.src)

thirdPartyResources.add (iframe.src)end if

end for

\\ ... fill the thirdPartyResources by doing the same for images and links

for each tracker in thirdPartyResources doif (tracker.PLD != ParsedPage.PLD)

omit (tracker.PLD , ParsedPage.PLD)end if

end forend function

function reduce (tracker , List:trackedPLDs)trackedHosts := trackedPLDs.distinct

for each trackedHost in trackedHosts dosaveToTrackingGraph (tracker , trackedHost)

end function

Chapter 4

Analysis I: Statistical Properties

In this chapter we focus on presenting and analyzing a number of statistical measure-

ments about the tracking services and tracked websites. First we will investigate the top

trackers, their general coverage, and the tracking penetration in most popular websites.

Then we will drill into to the contexts where trackers are observed and their classes, as

well as analyzing the tracked hosts domain extensions. Finally, we will investigate the

relationships between the top trackers and if there are significant associations between

their existence in a given PLD 1 .

To do so, we mainly use Hadoop to extract tracking services and construct the tracking

graph (see 3.3.1) from the raw web corpus on Amazon cloud, along with Spark and Flink

for analytical jobs to compute different metrics and aggregations of the graph on the

university cluster (see 3.4.2), and finally analyze these intermediate metrics locally using

R to obtain the final statistics and indicators. Also, we designed a data warehouse and

developed an OLAP cube on top of it using Microsoft SQL Server 2012 BI stack. The

straight-forward data warehouse contains 2 main dimensions, Tracked PLD and Tracking

Service, along with one, narrow but lengthy, fact table that contains the tracking graph

as an edge list with a number of needed Boolean columns for analysis. The cube helps in

mapping the relational model of the DWH into a multidimensional one that can benefit

from MDX queries for more convenient data analysis when it comes to drilling and slicing

data.1As a reminder, a pay-level-domain (PLD) is a the main part of a URL that identifies a parent

organization/domain. For example, the 2 research groups dima.tu-berlin.de and ida.tu-berlin.de havethe same parent PLD tu-berlin.de

26

Chapter 4. Analysis I: Statistical Properties 27

4.1 Trackers Coverage

Our first investigation is to determine the top tracking services and analyze their coverage

over the web.

First we ran the Hadoop job to extract the tracking graph from the Common Crawl

web corpus (see 3.3.1). We were able to process a sample of 25% of raw data from the

full corpus. However, after analyzing the processed output we found that this sample

accounts for 35% of the individual pages and 75% of the pay-level-domains in the full

corpus. That is based on our generalization assumption where we tag a pay-level-domain

as potentially tracked if at least one of its pages is tracked. We believe that this high

level of PLD coverage in the sample is due to the long-tail distribution of the number of

web pages within websites that is observed by [13, 23, 24].

We were able to extract roughly 100 million tracking entries (i.e. Tracker X > Pay

Level Domain Y). After that, we ran an analytical Flink job to count the number of

tracked PLDs per unique potential tracker. Based on the tracker extraction assumptions

we explained in 3.3.1, we extracted approximately 27 million potential tracker. This

figure raised some doubts on the assumptions we took while detecting trackers. However,

after further analysis of the tracking counts distribution (i.e. number of tracked sites per

tracker), we observed two interesting facts:

That 82% of these potential trackers have a tracking count of only 1

That 99.9% of them have a tracking count less than 1,000 hosts

Based on the first finding, we considered any tracker that occurs only once (i.e. track-

ing only one PLD) as noise in the extraction process, since no actual tracking service

will be visible in only 1 host. Based on that, we define the new term effective-tracker,

that is, a tracking service that are detected to track more than one PLD. Those effective

trackers are approximately 4.8 million within our dataset. For the second finding, we

hypothesize that the number of tracked sites per tracker is following a power-law distri-

bution, however, that needs a further empirical examination.

As illustrated in figure 4.1, we found that at least 60% of the PLDs in the sample are

potentially tracked under our previously mentioned assumptions. For those 19 million


Figure 4.1: Tracking detection summary

The figure shows statistics about the sample taken from the full web corpus residing onAmazon S3. The processing of raw data to extract pages and resources is done on AmazonElastic Map Reduce, and finally constructing the tracking graph and its analysis is performedon TU-Berlin DIMA cluster.

PLD (constituting the 60%) we detected the top 20 trackers (see table 4.1) based on the

number of unique PLDs spanned by each of them. One can notice that Google-related

services has the highest share of tracking. However, the figures cant be aggregated since

one tracked PLD can be tracked by multiple services.

To better understand the nature of these trackers, we investigated further to find out

the following:

googlesyndication.com: is a domain owned by Google that is used for storingand loading ad content and other resources relating to ads for Google AdSense and

DoubleClick from the Google content delivery network.

ajax-googleapis.com: The AJAX Libraries API is Googles content distributionnetwork and loading architecture for the most popular open source JavaScript

libraries such as jQuery, AngularJS, Dojo,etc.

The difference between the well known facebook.com and facebook.net is that thelater is Facebook APIs endpoint that support social widgets and other applications,

while the former is usually found in iframes and images context (table 4.1) that we

postulate their usage in hosting Facebook media content (video and pictures).


Table 4.1: Top 20 potential trackers

The table also shows the HTML context in which the tracker was detected. An important remarkwhile interpreting the figures below is that the context percentages dont have to add up to 100%for each tracker since the same tracker can be detected in different contexts within the same PLD.

Tracker Frequency% of

TrackedPLDs

% ofAll

PLDs

Script%

IFrame%

Image%

Link%

google-analytics.com 8,183,519 42% 25% 100% 0% 0% 0%googlesyndication.com 2,953,807 15% 9% 99% 0% 1% 0%google.com 2,206,582 11% 7% 78% 16% 15% 7%ajax.googleapis.com 1,470,524 8% 5% 99% 0% 0% 6%facebook.com 1,315,966 7% 4% 17% 77% 12% 0%macromedia.com 1,290,750 7% 4% 100% 0% 0% 0%adobe.com 983,536 5% 3% 56% 0% 47% 0%facebook.net 858,533 4% 3% 100% 0% 0% 0%casalemedia.com 832,215 4% 3% 100% 0% 0% 0%youtube.com 780,471 4% 2% 15% 83% 9% 1%twitter.com 753,311 4% 2% 92% 10% 1% 1%addthis.com 741,610 4% 2% 97% 0% 34% 0%imgaft.com 607,701 3% 2% 99% 0% 100% 0%godaddy.com 566,565 3% 2% 99% 1% 3% 0%gravatar.com 545,740 3% 2% 30% 0% 82% 7%gmpg.org 516,165 3% 2% 0% 0% 0% 100%statcounter.com 507,867 3% 2% 96% 0% 95% 0%dsnextgen.com 399,400 2% 1% 98% 2% 0% 0%wordpress.com 384,114 2% 1% 81% 0% 37% 16%yahoo.com 367,155 2% 1% 27% 2% 78% 0%

casalemedia.com: is a Canadian online media and technology company. Theybuild online advertising technology for web publishers and advertisers.

imgaft.com: we could not find extensive information about this domain and itssiblings ak2.imgaft and ak3.imgaft. The only thread we find is that it is registered

to Godaddy. We suspect it is being used in the parked-domain advertising schema

that Godaddy provides for its users. In some cases when a user is reserving a

domain until his website is created, or even to sell it in the future, the domain

can be parked and a temporary landing page with targeted adverting is viewed

by Godaddy to the domain visitors in return for a percentage of the ad-revenues

paid to the parked-host owner. However, we couldnt technically validate this

hypothesis.

gravatar.com is an online service that provides users with images (avatars) thatfollows them from site to site appearing beside their name when they do things like

comment or post on a blog. Avatars help in identifying its users posts across blogs

and web forums. We believe that it made it to the top 20 lists since it is included

by default in every WordPress.com account, which have more than 6 million pages

in the Common Crawl corpus sample we are using.


dsnextgen.com: we could not find much info about this domain but we did finda number of threads regarding it as a malware and people reporting their websites

hacked by it.

statcounter.com: is a free web tracker, embedded by websites as a hit counterand to provide real-time detailed web traffic information.

4.2 Top Sites Tracking

In this question, we analyze the magnificence of the tracking phenomenon from a

different perspective. Apart from the general statistics about the entire web corpus we

observed so far, we would rather focus on quantifying the trackers penetration over a key

subset of the internet, which is the most popular sites of the web. To achieve that, we

use the publicly available dataset mentioned in 3.1.4 from Alexa Analytics containing a

list of the top 1 million websites based on traffic.

Interestingly we find that the tracking penetration increases as we go up the list of

top sites. The tracking penetration starts at 48% within the top 1 million PLDs, and

increases gradually to reach a high 82% within the top 1000 PLD as shown in figure 4.2.

Figure 4.2: Alexa top sites tracking penetration

Furthermore, we noticed that this pattern -increasing tracking penetration by de-

creasing the subset of top sites is also visible on the tracker level as well. Table 4.2

shows that the top 10 trackers are the same at each subset, with the same order track-

ing penetration and following an increasing trend across subsets, except for one tracker,

doubleclick.net, that only appears withing the top 1000 sites instead of addthis.com.


Table 4.2: Top trackers penetration ratio across Alexa top sites

Tracker Top 1000K Top 500K Top 100K Top 10K Top 1K

google-analytics.com 0.34 0.38 0.47 0.62 0.71google.com 0.16 0.20 0.29 0.45 0.60facebook.com 0.11 0.14 0.21 0.37 0.48ajax.googleapis.com 0.09 0.11 0.16 0.28 0.40googlesyndication.com 0.09 0.11 0.15 0.22 0.30facebook.net 0.08 0.11 0.17 0.30 0.37twitter.com 0.08 0.10 0.17 0.32 0.44youtube.com 0.07 0.09 0.14 0.25 0.40addthis.com 0.07 0.08 0.12 0.20macromedia.com 0.05 0.06 0.11 0.21 0.33doubleclick.net 0.34

4.3 Tracking Classification

Our third question focuses on the tracking types. We discussed in the literature

overview a proposed classification framework for tracking behavior, out of which we can

distinguish between 3rd party web analytics, advertisers and social widgets (see 1.2.1).

To classify trackers we first need to analyze the contexts where the potential tracker is

detected, as explained before in 3.3.1, a 3rd party tracker can be detected as the source

HTML attribute of scripts, iframes, images and links. Figure 4.3 shows the ratio of

trackers detected at each HTML source, compared to the number of unique trackers, as

well as the ratio of tracked PLDs, compared to the number of unique tracked PLDs. We

notice that most potential trackers (92%) are detected as sources of image tags in HTML

and that most tracked PLDs are potentially tracked by means of 3rd party scripts.

Figure 4.3: Tracking sources summary


A key point one needs to understand while interpreting the tracking-source analysis

graph in figure 4.3 is that ratios dont have to add up to 1. This is due to the fact that

a single tracker can be detected in different sources at different PLDs (e.g. in a script in

PLD 1 and in a an image in PLD 2) and even potentially at the same PLD. The same

goes for tracked PLDs where one PLD can be potentially tracked by different trackers

detected at different source (e.g. using Google Analytics for traffic analysis and hosting

3rd party ads in iframes). Table 4.3 shows the frequency distribution of the available

combinations of tracking contexts. The frequency represents the number of occurrences

where a tracked PLD is detected to have the corresponding tracking sources. The ratio

is calculated based on the total number of entries in the tracking graph (approximately

80 millions). For detailed information about top trackers by source one can refer to

appendix A.

Table 4.3: Tracking-Source distribution

HTML Source Frequency Ratio

Script 37,745,830 48%Image 23,304,038 30%Script & Image 5,578,269 8%IFrame 3,956,215 5%Script & Image & Link 3,406,727 5%Link 2,146,367 3%Image & Link 1,050,777 2%Script & Link 827,657 2%Script & IFrame 419,904 1%All 398,320 1%IFrame & Image 225,109 1%Script & IFrame & Image 107,748 1%Script & IFrame & Link 102,543 1%IFrame & Image & Link 57,420 1%IFrame & Link 29,045 1%

For 3rd party social-widget tracking, we analyzed a predefined set of code snippets

offered by popular social network websites (see appendix C) and marked each entry in the

tracking graph if it is being tracked by a social-widget or not based on the source attribute

that the code is using. Table 4.4 shows the share of each social network compared to the

subset of PLDs being tracked by social-widgets and in terms of coverage, it shows the

percentage of PLDs spanned by each social network compared to all tracked PLDs and

compared to the sample web corpus.

Finally, based on the trackers extraction assumption and the proposed classification

framework, we can assign the scipt tracking to 3rd party web analytics services, iframe

and images to advertising-related trackers, while extracting the social-widgets trackers


Table 4.4: Social-Widget tracking summary

Social-Widget AbsoluteFrequencyRelativeFrequency

% ofTracked PLDs

% ofAll PLDs

Facebook 2,180,111 0.576 11.17% 6.72%Youtube 798,027 0.211 4.09% 2.46%Twitter 783,727 0.207 4.02% 2.42%Reddit 17,552 0.005 0.09% 0.05%Instagram 4,346 0.001 0.02% 0.01%Tumbler 140 0.000 0.00% 0.00%

manually as explained in the previous section. This lead us to final statistics about the

tracking classification as illustrated in figure 4.4; It shows the percentages of tracked

PLDs under each class. The ratios dont add up to 1 because of the overlapping tracking

behavior as explained earlier.

Figure 4.4: Tracking Classification Summary

4.4 Domain Analysis

Our next area of exploration is the tracking penetration analysis based on internet

domains. There are many levels domains to consider (e.g. second-level, top-level, etc.).

However, we are focusing on the generic top level domain (gTLD) and country code top

level domain (ccTLD)

4.4.1 Country code analysis

We were able to detect approximately 11 million pay-level-domains that contains a

country code (e.g. .de, .uk, .fr, etc.) from the sample web corpus of 32 million PLD (out


of which, around 60% of them where marked as potentially tracked).

By means of informal visual analysis, we found that tracking penetration ratios of

country codes are following a normal distribution as shown in figure 4.5 with minimum

= 0.23, median = 0.59 , maximum = 0.93 and standard deviation = 0.1.

Figure 4.5: ccTLD tracking penetration histogram

For each tracking penetration value (x-axis), we plot a bar presenting the number of countrieswith such penetration

An interesting way to visualize the global spread of the web tracking phenomenon as

well as its degree, is using a heat map as shown in figure 4.6. Interestingly, Germany

scored a relatively low penetration rate of 49%, placing it in the lower quartile of the

data. We can also notice one of the highest penetration rates concentrated in Russia and

post-soviet states in eastern Europe and Asia.

Finally, an important remark is that we are only considering the ccTLD extensions in

our analysis and not the country-assigned IP range of addresses. This experiment can

be further enriched by incorporating the IP analysis as well.

4.4.2 Generic domain analysis

Besides the country codes, we were also able to detect 22,986,076 PLD in the web

corpus sample (of approximately 32 million PLD) that contains an element of a predefined

set of the most popular generic top level domains (gTLD) assigned by the Internet


Figure 4.6: Tracking penetration worldwide

Shades of green, yellow, red indicate low, medium and high penetration rates respectively,given that the scale starts at 23% and ends at 93%. Black color indicates no data available

Assigned Numbers Authority (IANA). The gTLDss are .com, .net, .org, .gov, .edu, .mil,

.info and .biz. Out of these PLDs, we marked 13,800,223 of them as potentially tracked.

In table 4.5 we summarize the tracking penetration ratio for each of the extracted

gTLDs. Surprisingly, the results came out against our expectations that the more popular

and commercial domains such as .com and .net will have higher penetration compared

to the more private, and in some cases sensitive, domains such as .edu, .gov. Also, we

didnt not expect the .mil gTLD used by military organizations will have a high

penetration rate as 53%, even though it tails the list.

Table 4.5: Tracking penetration by gTLD

gTLD PLDs(sample)PLDs

(tracked)Tracking

Penetration

.edu 33629 22512 67%

.gov 51081 33178 65%

.info 472131 304941 65%

.net 1746476 1116848 64%

.biz 156559 99525 64%

.org 1923282 1214639 63%

.com 18602312 11008258 59%

.mil 606 322 53%


To further investigate these unexpected results, we compiled the matrix in table 4.6 with

the top 10 trackers of each gTLD along with the number of PLDs they cover within it.

Based on this matrix we observed the following:

While the Google-related trackers are the only core trackers across gTLDs, thetop 10 trackers are almost identical across the com, org, net, info and biz (with

few exceptions). They are also a subset of the overall top trackers noted in 4.1.

However, trackers tend to be different and sparse in the edu, gov and mil group.

What we consider sensitive gTLDs like .gov, .mil and .edu, are tracked mostly byweb analytics tools like google-analytics and addthis.com and with social networks

widgets. However, there are no indications of them employing advertising-related

trackers or content delivery networks, even popular ones such as googlesyndica-

tion.com. This is somehow understandable since websites like these intended for

public service will need to employ some sort of social interaction via social widgets,

not to mention to analyze their own traffic.

Some popular trackers only appear within commercial gTLDs such as the popularweb host godaddy.com, but never with .gov, .mil or .edu

Few trackers only appear under one gTLD, like cnzz.com and ejercito.mil.co underthe .gov and .mil gTLDs respectively.

To understand the last point further, we drilled deeper into the data and with some

internet search we found that cnzz.com is a Chines tracking service that employs scripts

in tracked pages. It turn out that the 2,141 PLDs cnzz.com is tracking under the .gov

gTLD all have .cn country code, which means they are Chines government PLDs. Also,

we found that ejercito.mil.co belongs to the national Colombian army and that all the

15 tracked PLDs are being tracked by means of 3rd party HTML links.

4.5 Trackers Association

In this section, we aim to investigate the frequent co-occurrence of tracking services

and if there are rules that can predict the presence of trackers in a PLD based on the

existence of other trackers. For example if the existence of tracker z in a PLD is usually

associated with the existence of trackers x and y .


Table 4.6: Top Trackers Coverage over gTLDs

The matrix has values only for the top 10 trackers of each gTLDs or zeros to indicate that thetracker is completely absent regardless of its ranking. For example the second cell(horizontally) indicates that addthis is tracking 51,627 PLDs with the .org gTLD whilegodaddy.com is completely absent from all .mil domains. A tracker is marked with anunderscore if it is not within the top 10 trackers under a specific gTLD.

Tracker .com .org .net .info .biz .edu .gov .mil

addthis.com - 51,627 - 9,680 - 1,991 989 17adobe.com 591,441 50,835 - - 3,184 4,392 3,350 50ajax.googleapis.com 867,033 99,873 69,199 13,143 4,956 3,757 1,700 32baidu.com - - - - - - 939 0casalemedia.com 609,597 - 58,074 46,841 6,455 - 0 0cnzz.com - - - - - - 2,141 0ejercito.mil.co - 0 0 0 0 0 0 15facebook.com 726,854 89,693 67,375 13,816 4,551 3,054 - 23facebook.net 485,699 53,172 43,369 - - 1,939 - -gmpg.org - - - 10,600 - - - -godaddy.com - - - 25,316 5,144 - - 0google-analytics.com 4,545,650 467,913 402,411 89,385 34,300 11,881 7,776 171google.com 1,244,448 166,275 127,759 32,421 9,380 5,315 3,241 38googlesyndication.com 1,874,776 168,622 235,195 112,147 22,153 - - -imgaft.com 471,002 - 46,055 27,232 5,635 0 0 0macromedia.com 802,529 60,038 55,717 - 3,950 4,539 10,468 46twimg.com - - - - - - - 24twitter.com - - 46,114 - - 1,348 - -weather.com.cn - - - - - - 2,673 0youtube.com - 68,471 - - - 2,483 898 19

Total PLDsTracked in gTLD 11,008,258 1,214,639 1,116,848 304,941 99,525 22,512 33,178 322

To begin, we want to understand the nature of the trackers existence in terms of quan-

tity (i.e. how many trackers there are per pay-level-domain). We start by computing

the total number of tracking services per PLD (approximately 19 million PLD) and ob-

serve the distribution. As shown in figure 4.7, the distribution is far from normal, in

fact, it tends to be highly exponential, with more than 99.99% of the data set in the

range of 1-100 trackers per PLD, that means there exist a tiny fraction of the PLDs

with huge number of trackers (above 1000). We then wanted to understand if that might

be attributed to the number of pages in each processed PLD, however, we calculated

the Pearson correlation coefficient 2 between the number of pages and the number of

trackers (per PLD) to be 0.28, which means a slight positive correlation (even though we

intuitively thought about a higher positive correlation). After examining a subset of the

top PLDs, in terms of pages and trackers, we found that most of them are huge networks

such as Google, YouTube, Tumbler, etc. where they permit users loading resources for

3rd party domains (e.g. scripts, content, themes, etc.) as well as the use of 3rd party2Pearsons correlation coefficient is the covariance of the two variables divided by the product of

their standard deviations. It measures the linear correlation (dependence) between two variables x andy giving a value between [1,-1] where 1 is total positive correlation, 0 is no correlation, and -1 is totalnegative correlation.


web traffic monitoring and hence the high number of trackers.

Figure 4.7: Log-Log plot for the number of trackers per PLD

The second part of the analysis is to identify the groups of tracking services that

usually appear together in PLDs. In order to achieve that, we model the problem as

in a market-basket analysis (with trackers as products and tracking graph entries as

transaction) while employing frequent itemset mining techniques. On top of that, we use

association rules learning to find out if there are dependencies between trackers.

Apriori [51] is a seminal frequent itemset mining algorithm that we are using (out

of the box from SQL Server Analysis Services 3 ) to help answering our question. In a

nutshell, Apriori works by identifying the frequent individual items in the dataset and

extending them to larger and larger item sets as long as those item sets appear sufficiently

often in the data (by means of a support function). The frequent item sets determined

by Apriori can be later used to determine association rules which highlight general trends

in the dataset. Figure 4.8 shows an outline of the algorithm.

We applied the Apriori implementation on a subset of the tracking graph that contains

the top 20 trackers (extracted in 4.1) and their corresponding tracking entries of approx-

imately 26 million records (32% of the the complete graph). Table 4.7 shows 20 frequent3Microsoft provides its implementation of Apriori under the name of Microsoft Association Algo-

rithm. See msdn.microsoft.com/en-us/library/cc280428.aspx4Figure taken from en.wikipedia.org/wiki/Apriori_algorithm


Figure 4.8: Pseudo code of the Apriori algorithm

The pseudo code for the algorithm is given below for a transaction database T , and a supportthreshold of . Ck is the candidate set for level k. At each step, the algorithm is assumed togenerate the candidate sets from the large item sets of the preceding level, heeding thedownward closure lemma. count[c] access

an exploratory analysis of the tracked web

Documents