an exploratory analysis of the tracked web

99
Technische Universität Berlin Master Thesis An Exploratory Analysis of the Tracked Web Author: Karim Wadie Supervisor: Prof. Volker Markl Advisor: Johannes Kirschnick A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in Computer Science as part of the Erasmus Mundus programme IT4BI in the Database Systems and Information Management Group (DIMA) Department of Computer Science July 2015

Upload: lybao

Post on 13-Feb-2017

221 views

Category:

Documents


1 download

TRANSCRIPT

  • Technische Universitt Berlin

    Master Thesis

    An Exploratory Analysis ofthe Tracked Web

    Author:Karim Wadie

    Supervisor:Prof. Volker Markl

    Advisor:Johannes Kirschnick

    A thesis submitted in partial fulfilment of the requirementsfor the degree of Master of Science in Computer Science

    as part of the Erasmus Mundus programme IT4BI

    in the

    Database Systems and Information Management Group (DIMA)Department of Computer Science

    July 2015

    http://www.tu-berlin.dehttps://www.dima.tu-berlin.de/http://cs.tu-berlin.de/welcome.html

  • Declaration of Authorship

    I declare that I have authored this thesis independently, that I have not used other than

    the declared sources/resources, and that I have explicitly marked all material which has

    been quoted either literally or by content from the used sources.

    Eidesstattliche Erklrung

    Ich erklre an Eides statt, dass ich die vorliegende Arbeit selbststndig verfasst, andere

    als die angegebenen Quellen/Hilfsmittel nicht benutzt, und die den benutzten Quellen

    wrtlich und inhaltlich entnommenen Stellen als solche kenntlich gemacht habe.

    Berlin,

    July 31, 2015

    Karim WADIE

    i

  • "The man who comes back through the Door in the Wall will never be quite the same

    as the man who went out. He will be wiser but less sure, happier but less self-satisfied,

    humbler in acknowledging his ignorance yet better equipped to understand the relationship

    of words to things, of systematic reasoning to the unfathomable mystery which it tries,

    forever vainly, to comprehend."

    Aldous Huxley

  • Technische Universitt Berlin

    AbstractFaculty of Electrical Engineering and Computer Science

    Department of Computer Science

    Master of Science in Computer Science

    An Exploratory Analysis of

    the Tracked Web

    by Karim Wadie

    There are no doubts that web tracking has progressively prevailed on the internet

    over the past years for traffic analytics and/or building user browsing profiles that aids

    personalized advertising. There are several techniques a tracking service can actually

    record visitors behavior on a remote website, some of which can be detected in an offline

    setting by analyzing the HTML contexts and common tracking practices such as tracking-

    pixels and scripts that communicate with a 3rd party host. This thesis builds on top of

    the TrackTheTrackers project that is initiated at TU-Berlin and aims to extract the

    tracking services from the Common Crawl; the largest publicly-available web corpus

    by providing a deeper, quantitative analysis of the web tracking phenomenon in terms of

    its widespread and its relationship with the web structure. As far as our knowledge, this

    research is the first one to combine web-graph studies with 3rd-party tracking analysis.

    Throughout our exploratory analysis, we report a number of statistical findings about the

    tracking graph along with descriptive, structural properties of the web graph spanned

    by the trackers and tracked websites (i.e. the tracked-web), and finally, we examine

    how structural features of the web graph such as community structures and centrality

    measures can affect the spread of tracking over the web. For instance, we found that

    60% of the web is potentially tracked, with Google being the number 1 tracker over

    the internet. We also used a quantitative approach to discover that the tracked-web is

    highly interconnected and exhibits the small-world phenomenon with only 5 degrees of

    separation, and that it resembles the structure of a social network more than of a web

    graph.

    http://www.tu-berlin.dehttp://www.eecs.tu-berlin.de/menue/fakultaet_iv/)http://cs.tu-berlin.de/welcome.html

  • Acknowledgements

    I take this opportunity to express gratitude to Johannes, my supervisor, for his guid-

    ance throughout the thesis as well as his comments that greatly improved the manuscript.

    I also thank Sebastian Schelter for his excellent work on the trackthetracker project and

    providing the datasets, on which I am building upon this study.

    iv

  • Contents

    Declaration of Authorship i

    Abstract iii

    Acknowledgements iv

    List of Figures viii

    List of Tables ix

    Abbreviations x

    1 Introduction and Literature Review 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1.1 What is web tracking? . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 The business empire of web tracking . . . . . . . . . . . . . . . . . 11.1.3 Why should we study tracking? . . . . . . . . . . . . . . . . . . . . 3

    1.2 Literature Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Web tracking studies . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Web graph studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2 Objectives 13

    3 Methodology 153.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.1.1 Common Crawl web corpus . . . . . . . . . . . . . . . . . . . . . . 153.1.2 Web Data Commons hyper-link graph . . . . . . . . . . . . . . . . 173.1.3 The Common Crawl WWW ranking . . . . . . . . . . . . . . . . . 183.1.4 Alexa top sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.2 Data Processing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.3 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.4 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.5 MS SQL Server BI Stack . . . . . . . . . . . . . . . . . . . . . . . . 21

    v

  • Contents vi

    3.2.6 WebGraph Framework . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.7 FlashGraph Framework . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.1 Trackers extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.4 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.1 Amazon EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.2 DIMA IBM Power Cluster . . . . . . . . . . . . . . . . . . . . . . . 24

    4 Analysis I: Statistical Properties 264.1 Trackers Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Top Sites Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Tracking Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    4.4.1 Country code analysis . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.2 Generic domain analysis . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.5 Trackers Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5 Analysis II: Structural Properties 425.1 Tracked-Web Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . 42

    5.1.1 Density and node degrees . . . . . . . . . . . . . . . . . . . . . . . 425.1.2 Power-law fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.1.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    5.2 Tracked-Web Degree of Separation . . . . . . . . . . . . . . . . . . . . . . 455.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.2 Approach: HyperANF . . . . . . . . . . . . . . . . . . . . . . . . . 465.2.3 Distance-related features . . . . . . . . . . . . . . . . . . . . . . . . 485.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    5.3 Is The Tracked-Web a Small World? . . . . . . . . . . . . . . . . . . . . . 505.4 Tracked-Web Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    5.4.1 WCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4.2 SCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    5.5 Centrality and Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.5.3 Individual centrality correlation . . . . . . . . . . . . . . . . . . . . 575.5.4 Centrality-based classification . . . . . . . . . . . . . . . . . . . . . 58

    5.6 Community Structure and Tracking . . . . . . . . . . . . . . . . . . . . . . 595.6.1 Vertex-centric neighborhoods . . . . . . . . . . . . . . . . . . . . . 595.6.2 Web graph communities . . . . . . . . . . . . . . . . . . . . . . . . 605.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    6 Future Work 65

    7 Thesis Summary 67

  • Contents vii

    A Top Trackers By Source 71

    B Tracking Penetration By Country 74

    C Social Widgets Detection 80

    Bibliography 82

  • List of Figures

    1.1 Example of online advertising players 1. . . . . . . . . . . . . . . . . . . . 31.2 USA online advertisement market growth in USD billions 2 . . . . . . . . 41.3 Case Study: Third-Party Analytics. . . . . . . . . . . . . . . . . . . . . . . 61.4 Case Study: Third-Party Advertising. . . . . . . . . . . . . . . . . . . . . 61.5 Case Study: Advertising Networks. . . . . . . . . . . . . . . . . . . . . . . 71.6 Case Study: Social Widgets. . . . . . . . . . . . . . . . . . . . . . . . . . . 81.7 Bow-tie structure of the web . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.1 Pseudocode of the main routines in extracting trackers . . . . . . . . . . . . . 25

    4.1 Tracking detection summary . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Alexa top sites tracking penetration . . . . . . . . . . . . . . . . . . . . . 304.3 Tracking sources summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Tracking Classification Summary . . . . . . . . . . . . . . . . . . . . . . . 334.5 ccTLD tracking penetration histogram . . . . . . . . . . . . . . . . . . . . 344.6 Tracking penetration worldwide . . . . . . . . . . . . . . . . . . . . . . . . 354.7 Log-Log plot for the number of trackers per PLD . . . . . . . . . . . . . . 384.8 Pseudo code of the Apriori algorithm . . . . . . . . . . . . . . . . . . . . . 39

    5.1 Log-Log plot of the tracked web indegree distribution . . . . . . . . . . . . 435.2 Log-Log plot of the tracked web outdegree distribution . . . . . . . . . . . 435.3 Probability mass function of the tracked-web distance . . . . . . . . . . . 495.4 Cumulative probability function of the tracked-web distance . . . . . . . . 495.5 Log-Log plot of the tracked-web WCC size distribution . . . . . . . . . . . 535.6 Pseudocode of the tarjan algorithm for finding strongly connected components

    in a graph 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.7 Log-Log plot of the tracked-web SCC size distribution . . . . . . . . . . . 565.8 Pseudocode for computing tracking coefficient of vertices . . . . . . . . . . . . 605.9 Log-Log plot of the web graph community-size distribution . . . . . . . . 625.10 A visual representation of the web graph mega-communities . . . . . . . . 62

    C.1 Facebook social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . 80C.2 Twitter social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . . 81C.3 YouTube social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . 81C.4 Reddit social widget code snippet . . . . . . . . . . . . . . . . . . . . . . . . 81

    viii

  • List of Tables

    3.1 Content statistics of the 2012 web corpus . . . . . . . . . . . . . . . . . . 17

    4.1 Top 20 potential trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Top trackers penetration ratio across Alexa top sites . . . . . . . . . . . . 314.3 Tracking-Source distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Social-Widget tracking summary . . . . . . . . . . . . . . . . . . . . . . . 334.5 Tracking penetration by gTLD . . . . . . . . . . . . . . . . . . . . . . . . 354.6 Top Trackers Coverage over gTLDs . . . . . . . . . . . . . . . . . . . . . . 374.7 Frequent item sets of top 20 trackers . . . . . . . . . . . . . . . . . . . . . 404.8 Top 20 trackers association rules . . . . . . . . . . . . . . . . . . . . . . . 41

    5.1 Power-law fitting of tracked-web indegree and outdegree . . . . . . . . . . 455.2 HyperANF Results on the tracked-web . . . . . . . . . . . . . . . . . . . . 485.3 Distance-related features for the web, Facebook and Tracked Web . . . . . 505.4 Calculating the small-world measure S for the tracked-web . . . . . . . . . 525.5 Point bi-serial correlation between centrality measures and tracking . . . . 585.6 Area under the curve (AUC) for different binary classifiers (centrality

    measures vs tracking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.7 Tracking Coefficients of the web graph neighborhoods . . . . . . . . . . . 60

    A.1 Top 20 potential trackers employing scripts . . . . . . . . . . . . . . . . . 71A.2 Top 20 potential trackers employing IFrames . . . . . . . . . . . . . . . . . 72A.3 Top 20 potential trackers employing Images . . . . . . . . . . . . . . . . . 72A.4 Top 20 potential trackers employing Links . . . . . . . . . . . . . . . . . . 73

    B.1 Tracking analysis by country code top level domain . . . . . . . . . . . . . 74

    ix

  • Abbreviations

    ccTLD Country code top level domain

    DWH Data Warehouse

    GA Google Analytics

    gTLD Generic top level domain

    HDFS Hadoop Distributed File System

    PLD Pay-level-domain

    SCC Strongly connected component (of a graph)

    TLD Top level domain

    WCC Weakly connected component (of a graph)

    WDC Web Data Commons

    x

  • Dedicated to my parents, for their love, endless support andencouragement.

    xi

  • Chapter 1

    Introduction and Literature Review

    1.1 Introduction

    1.1.1 What is web tracking?

    Web tracking, is commonly referred to the act of collecting subsets of the users brows-

    ing data or browsing behavior over the internet. This practice attracted a lot of attention

    over the past few years, especially after the social media boom and the increasing levels

    of privacy-issues awareness acquired by the average internet user.

    There is no doubt that tracking is prevalent on the web today, most of us who use

    search engines or e-commerce sites (e.g. Amazon) have seen the implications of web

    tracking or just tracking as we will refer to in this document at least in terms of

    targeted advertisements, especially when it is observed cross-sites; for example as coming

    across advertisements on ones social media profile for products previously viewed on a

    completely different e-commerce site.

    In our work, we use the term Tracked-web to refer to the graph structure of web

    links formed by the tracking and tracked web entities. We aim to provide a better

    understanding about this subset of the web in terms of statistics about these entities, as

    well as discovering local and global structural properties about the graph.

    1.1.2 The business empire of web tracking

    Before going into details, one needs first to understand what is the motivation behind

    such practice, what kind of web entities are behind it and how can they actually do it.

    1

  • Chapter 1. Introduction 2

    First-Party and Third-Party Tracking To begin with, we need to differentiate

    between what is called first-party and third-party tracking. The first kind refers to when

    a website is keeping track of its visitors activities on their own site, either anonymously

    or by user profiles, in order to analyze customer behavior, enhance their service or even

    communicate it to other entities for a profit. First-party tracking is very common in most

    major websites, however, it often raises serious concerns when it crosses the virtual world

    of the internet and includes real world information like GPS track history, fingerprints

    and such. Unfortunately, this type of tracking is beyond our scope of analysis since its

    integrated in the website logic and it can be hardly detected or analyzed offline.

    The other type of tracking, third-party tracking, refers to the practice by which an

    outside entity (the tracker) other than the directly visited website, tracks the users visit

    to the site. For example, if a web user visits reuters.com, a third-party tracker like

    doubleclick.net - embedded by reuters.com to provide targeted advertising - can log the

    users visit to reuters.com. For most types of third-party tracking, the tracker will be

    able to link the users visit to reuters.com with the users visit to other sites on which

    the tracker is also embedded, and thus building what is called a browsing profile of that

    user. In this study we will only consider third-party tracking over the internet for our

    analysis because of its potential concern to users, who may be surprised that a party with

    which they may or may not have chosen to interact is recording their online behavior in

    unexpected ways.

    Tracking Services The web entities acting as third party trackers are generally cat-

    egorized into two broad groups, web traffic analytics and advertising-based services (we

    will discuss a detailed categorization framework in the literature review section 1.2). The

    first group of trackers usually provide their services to websites in return of a paid pre-

    mium or subscription plans, however, the most popular web-traffic analysis service [1],

    Google Analytics [2], can be used for free. In this case, Google is believed to generate

    indirect profit from the free analytics service by integrating the data it collects with its

    paid advertising service; Google AdWords [3].

    The other group of tracking services are the one directly concerned with online ad-

    vertising. Advertising business has evolved since the birth of the internet and over the

    years from email marketing campaigns to online display ads in the 1990s to the more

    complex landscape of search ads (see figure 1.1) that involves targeted advertising with

    automated biding and connects a number of stakeholders like publishers who are hosting

    the ads, advertisers who are advertising their products/services , advertising agencies

    that help generate and place the ad copy, ad servers that technically deliver the ads and

  • Chapter 1. Introduction 3

    advertising affiliates who conduct promotional work for the advertisers and potentially

    more players.

    Figure 1.1: Example of online advertising players 1.

    It is not hard to understand how the online advertising business had to become more

    sophisticated over the years when we know that it is a multi-billion dollar industry.

    According to a study by PricewaterhouseCoopers (PwC) [4], figure 1.2 shows that online

    advertising generated a revenue of 49.5 Billion USD in 2014 in the United States alone.

    Another recent study estimated the European ad market in 2012 for 24.3 Billion EUR

    [5].

    1.1.3 Why should we study tracking?

    Despite the prevalence of web tracking and the resulting public and media outcry ,

    primarily in the western world, there is a lack of clarity about how tracking works, how

    widespread the practice is, and the scope of the browsing profiles that trackers can collect

    about users. Thus, efforts in exploring and understanding the structure of the web from

    a tracking perspective as we are aiming in this thesis is important in shedding a

    light on this part of the internet in order to:

    1. Design crawling and tracker detection algorithms.1Figure taken from LUMA Partners: http://www.lumapartners.com/lumascapes/

  • Chapter 1. Introduction 4

    Figure 1.2: USA online advertisement market growth in USD billions 2

    2. Design protection techniques against trackers.

    3. Understand the coverage of some key trackers and their domination over the inter-

    net. Thus, estimating their business value and market weight

    4. Predict the evolution and spread of the tracking phenomenon.

    5. Predict the emergence of new phenomenon in the tracing graph.

    2Figure taken from PwC Internet advertising report 2014 [4]

  • Chapter 1. Introduction 5

    1.2 Literature Overview

    1.2.1 Web tracking studies

    There exist a number of studies that have been conducted by researchers to under-

    stand, analyze and classify the web tracking phenomenon and even to develop techniques

    to protect against it. The most prominent of which is the work by Roesner, Kohno,

    and Wetherall [6] in 2012. In their study, the authors presented an in-depth empirical

    investigation of third-party tracking where they introduced a comprehensive classifica-

    tion framework for web tracking based on client-side observable behaviors. They also

    developed and evaluated a web browser plugin, which is designed to thwart tracking orig-

    inating from social media widgets (like the Facebook like button) while still allowing

    the widgets to be used.

    The suggested framework is established from client-side methods for detecting and

    classifying five kinds of third-party trackers based on how they manipulate browser state.

    The five behaviors observed are:

    1. Third-Party Analytics:

    In order to analyze their traffic, websites usually embed a library (in the form of

    a script) provided by the anlytics engine (e.g. Google Analytics). In the case of

    GA, the script sets a site-owned cookie (not tracker-owned) on the the visitors

    browser, that contains a unique identifier. The script then transfers this identifier

    to google-analytics.com by making explicit requests containing information such as

    the operating system version, browser, geographic location, etc.

    Since the cookie set by the tracker was created in the context of the site visited

    (site-owned), identifiers set by the tracker in this case is different across sites. Thus,

    a single user will be associated with different identifiers on different sites, limiting

    the trackers ability to create a cross-site browsing profile for that user. Figure 1.3

    shows a case study as offered in the original work [6].

    2. Third-Party Advertising:

    Is the tracking for the purpose of targeted advertising, an example of this type is

    Googles advertising network, DoubleClick [7].

    When a user visits a page, the tracker (advertiser) will choose an ad to display on

    that page as an image or an iframe. Thus, the cookie which contains the visitor

    unique identifier is set as tracker-owned.As a result, the same unique identifier

    is associated with with the user whenever he visits any site with the tracker ads

    embedded in it. In this case, the tracker is able to build a cross-site browsing profile

  • Chapter 1. Introduction 6

    Figure 1.3: Case Study: Third-Party Analytics.

    Websites commonly use third-party analytics engines like Google Analytics (GA) to trackvisitors. This process involves (1) the website embedding the GA script, which, after (2)loading in the users browser, (3) sets a site-owned cookie. This cookie is (4) communicatedback to GA along with other tracking information.

    Figure 1.4: Case Study: Third-Party Advertising.

    When a website (1) includes a third-party ad from an entity like Doubleclick, Doubleclick (2-3)sets a tracker-owned cookie on the users browser. Subsequent requests to Doubleclick from anywebsite will include that cookie, allowing it to track the user across those sites.

    for each unique user. Figure 1.4 shows a case study as offered in the original work

    [6].

    3. Third-Party Advertising with Popups:

    Using popups to display ads give the tracker the advantage to set its own first-party

    cookie, allowing it to pass some common third-party cookies blocking mechanisms

    embedded in some browsers or plugins. This kind of tracking is malicious since it

    puts the tracker in a first-party position without the users consent. An example

    of these trackers is insightexpressai.com

    4. Third-Party Advertising Networks:

    Trackers often cooperates, and it is insufficient to simply consider trackers in iso-

    lation. A website may embed one third-party tracker, which in turn serves as an

  • Chapter 1. Introduction 7

    aggregator for a number of other third-party trackers. Figure 1.5 shows a case

    study as offered in the original work [6].

    Figure 1.5: Case Study: Advertising Networks.

    As in the ordinary third-party advertising case, a website (1-2) embeds an ad from Admeld,which (3) sets a tracker-owned cookie. Admeld then (4) makes a request to another third-partyadvertiser, Turn, and passes its own tracker-owned cookie value and other tracking informationto it. This allows Turn to track the user across sites on which Admeld makes this request,without needed to set its own tracker-owned state.

    5. Third-Party Social Widgets:

    Most social networking sites, offers social widgets like the Facebook Like but- ton,

    the Twitter tweet button, the Google +1 button and others. These widgets can

    be included by other websites to allow users logged in to these social networking

    sites to like, tweet, or +1 the embedding web page. In case of Facebook, it can set

    its tracker-owned cookie from a first-party position when the user voluntarily visits

    facebook.com and then when a user visits another website that embed Facebook

    "Like" button, the requests made to facebook.com to render this button allow

    Facebook to track the user across sites just as Doubleclick can. Figure 1.6 shows a

    case study as offered in the original work [6].

    From the observed tracking behavior, the authors then formulated a framework for

    classifying trackers into 5 classes were a single tracker may exhibit more than one of

    these behaviors:

    1. Behavior A (Analytics): The tracker serves as a third-party analytics engine

    for sites. It can only track users within sites.

    2. Behavior B (Vanilla): The tracker uses third-party storage that it can get and

    set only from a third-party

  • Chapter 1. Introduction 8

    3. Behavior C (Forced): The cross-site tracker forces users to visit its domain

    directly (e.g., popup, redirect), placing it in a first-party position.

    4. Behavior D (Referred): The tracker relies on a B, C, or E tracker to leak unique

    identifiers to it, rather than on its own client-side state, to track users across sites.

    5. Behavior E (Personal): The cross-site tracker is visited by the user directly in

    other contexts.

    In our study, and since we are working in an offline settings, we will be able to make

    the differentiation between Third-Party analytics, Third-Party Tracking and Third-Party

    Social Widgets.

    Apart from Roesner et al. [6], a number of studies have empirically examined tracking

    on the web, most notably Krishnamurthy et al. [8]. In their paper, the authors presented

    a study where they measured the coverage of third-party tracking on the web. However,

    unlike [6], they didnt distinguish between different tracking behavior.

    From a different perspective, the authors of [9] studied privacy-violating information

    flows on the web where they found instances of cookie leaking, as well as other privacy

    violations. However, they didnt differentiate between third-party trackers and the visited

    sites themselves. Also, in his five-year study of modern web traffic, Ihm [10] found that

    12% of the web requests in 2010 counts for advertisements. Alongside, he also found

    that Google Analytics is tracking up to 40% of the pages in their dataset.

    Figure 1.6: Case Study: Social Widgets.

    Social sites like Facebook, which users visit directly in other circum- stancesallowing them to(1) set a cookie identifying the userexpose social widgets such as the Like button. Whenanother website embeds such a button, the request to Facebook to render the button (2-3)includes Facebooks tracker-owned cookie. This allows Facebook to track the user across anysite that embeds such a button.

  • Chapter 1. Introduction 9

    As for the phenomenon of trackers collaboration, [8] and [11] analyzed the private

    data leakage from first-party websites to data aggegators that can, potentially, link user

    accounts across different sites.On another study, Jackson and Boneh [12] classify trackers

    based on the type of cooperation between the embedding site and the trackers. Although,

    they didnt provide measurements on the prevalence of the tracker classes.

    Finally, in the past few years, there have been observable online discussions about

    tracking like, [5], along with workshops on tracking like the W3C Workshop on Web

    Tracking and User Privacy.

    1.2.2 Web graph studies

    Apart from the web tracking phenomenon itself, there are numerous studies that mod-

    els the web as a graph to analyze its structure and observe interesting measurements

    and statistics about it. We find these kind of efforts inspirational to our analysis of the

    tracked-web in terms of what questions to ask and techniques to answer them.

    The most notable study, covered by our literature search , is the paper by Mateo et al.

    [13]. In order to discover a set of local and global properties of the web graph, the authors

    conducted a set of experiments on web crawls made available by Alta Vista, each with

    over 200 million pages and 1.5 billion links. They showed that the overall structure of the

    Web is considerably more complicated than suggested by earlier experiments on a limited

    scale. Famously, they published a visual interpretations of their findings about the web

    structure which has become well known in later literature as the bow-tie structure of the

    web.

    The authors first reports the in- and out-degree distributions of the web pages, confirm-

    ing previous reports on power laws [14]. Then, they studied the directed and undirected

    connected components of the Web where they show that power laws also arise in the

    distribution of sizes of these connected components. They found that most (over 90%)

    of the approximately 203 million nodes in their crawl data form a single connected com-

    ponent if links are treated as undirected edges.

    This giant weak connected web can be broken into four pieces as shown in figure 1.7.

    The first of which is a central core, where every page can reach another one in the same

    core by following a directed link; this giant strongly connected component (SCC) is at

    the heart of the web. The second and third pieces are called IN and OUT. IN contains

    pages that cant be reached from the SCC but can reach it; The authors claims that

  • Chapter 1. Introduction 10

    these might be new sites that people have not yet discovered and linked to. On the other

    hand, OUT contains pages that are pointed to from the SCC , but cant link back to

    it; again, they suggest that such cluster represents corporate websites that contain only

    internal links.

    Finally, the TENDRILS consist of pages that are in total isolation from the SCCcannot

    reach the SCC, and cannot be reached from it. Perhaps the most interesting fact they

    found is that all the four sets are roughly the same size where the size of the SCC is

    relatively small; it comprises about 56 million pages. Each of the other three sets contain

    about 44 million pages. Finally, they measured the diameter of the central core (SCC)

    to be at least 28, and the diameter of the graph as a whole is over 500.

    Figure 1.7: Bow-tie structure of the web

    One can pass from any node of IN through SCC to any node of OUT. Hanging off IN and OUTare TENDRILS containing nodes that are reachable from portions of IN, or that can reachportions of OUT, without passage through SCC. It is possible for a TENDRIL hanging offfrom IN to be hooked into a TENDRIL leading into OUT, forming a TUBE: i.e., a passagefrom a portion of IN to a portion of OUT without touching SCC. Diagram and description aretaken from 3

    3Figure taken from Mateo, S., Jose, S., & Alto, P. (2006). Graph structure in the web. SystemsResearch, 33, 115. [13]

  • Chapter 1. Introduction 11

    A more detailed work about the size of the components of the bow-tie model is done

    by Serrano et al. [15]. They concluded that the properties of a web crawl is dependent

    on the crawling process by analyzing four crawls gathered between 2001 and 2004 by

    different crawlers with different parameters.

  • Chapter 1. Introduction 12

    We can also find a number of studies about the web structure that use the same data set

    as our thesis, the Common Crawl Web Corpus (see 3.1.1). In an abstract study, Kolias

    et al. [16] presented an initial exploratory analysis on the Common Crawl. Although

    they examined only a fraction of the dataset, some initial interesting measurements and

    characteristics of the web corpus were shown. They reported statistics on two levels of

    granularity, page and site levels, such as the MIME type distribution of resources, top

    10-languages for page content, distribution of page age, HTML versions, page degree

    distribution, pages per website, site language and site degree distribution.

    An in depth comparison of the latest findings on the web structure with previous work

    is done by Meusel et al. [17]. They confirm the existence of a giant strongly connected

    component, but they strongly emphasize that it is strongly dependent on the crawling

    process. Their most important finding however is that the distributions of indegree,

    outdegree and sizes of strongly connected components are not power laws, something

    that contradicts the findings throughout the literature up to now.

    From a different level of aggregation, Lehmberg et al. [18] published a number of similar

    findings on the web characteristics and degree distribution but on the pay-level-domain

    granularity, as opposed to the page-level analysis in prior work. Finally a technical report

    also presented the main characteristics of the Common Crawl 2012 dataset can be found

    in[19].

    Apart from the common crawl web corpus, various other studies focused on the struc-

    ture of the national web domains, which consist of all websites that ends by a specific

    country code or that are hosted at an IP that belongs to a segment assigned to a specific

    country. Works [20, 21] present findings on crawls made by different crawlers on the

    African and Chines parts of the web. Along with its structure, other characteristics of

    the web are presented by Baeza-Yates et al. [22]. This work is basically a side-by-side

    comparison of the results of 12 studies focusing on web characteristics. Their results

    include various levels of detail contents, links and technologies dissected by national

    domains.

    As for the power-law distribution phenomenon, a number of observations have been

    made in various aspects of the Web. The most relevant to our study, is the distribution

    of degrees on the web graph. In this context, recent work [13, 23, 24] suggests that

    both the in- and the out-degrees of vertices on the web graph have power laws. This

    collection of findings reveals the power-law distribution as a macroscopic phenomenon

    on the entire web, as well as a microscopic phenomenon at the level of single Websites,

    and at intermediate levels between these two.

  • Chapter 2

    Objectives

    The aim of this study is to provide a deeper, quantitative understanding of the web-

    tracking phenomenon, in terms of its widespread and its relationship with the web

    structure. By doing so, we are also one step closer to design better tracker detection

    and tracking protection techniques by understanding the structure of the tracked-web

    graph. That is in addition to measuring the coverage of key trackers, and thus helping

    in estimating their business value and market weight.

    To achieve that, we structure our exploratory analysis into a set of questions and

    hypothesis to be answered or validated. We summarize the high-level goals of the thesis

    to the following:

    Extracting potential trackers from the Common Crawl web corpus based on specificHTML contexts and assumptions. Followed by constructing an aggregated tracking

    graph on the pay-level-domain (PLD), that is, a graph structure showing which

    PLD is tracked by which service.

    Computing statistical indicators on the tracking graph to measure the prevalenceof tracking in the web.

    Computing descriptive, structural properties of the tracked-web, which is, the sub-set of the aggregated PLD web graph that includes only the trackers and tracked

    hosts.

    Examining how some structural properties of the web affects the spread of trackingover the internet.

    13

  • Chapter 2. Objectives 14

    We can then expand these high-level goals to a number of discrete questions and

    hypothesis as follows:

    1. To which degree the web is being tracked? And how many potential trackers can

    we extract from the web corpus?

    2. Who are the top 20 trackers? their coverage, their business and the HTML context

    in which they are usually embedded?

    3. What is the percentage of tracked websites (i.e. tracking penetration) within the

    subset of most popular domains based on Alexa Ranking?

    4. How often trackers appears in each HTML context (i.e. scripts, images, iframes

    and links)?

    5. What is the decomposition of trackers across traffic analytics, ad networks and

    social widgets?

    6. What is the tracking penetration per country?

    7. What is the tracking penetration by generic top-level-domain (i.e. .com,.net,.org,etc.)?

    8. Are there sets of trackers that usually appear together in one PLD?

    9. What is the degree distribution of the tracked-web? Does it follow a power law?

    10. What is the effective diameter, average distance and spid1 of the tracked-web?

    11. Does the tracked-web exhibits the small-world phenomenon?

    12. How big is the largest weakly-connected-component and strongly-connected-component

    of the tracked-web? Does WCC and SCC size distribution follows a power law?

    13. Can we support the hypothesis that domains with higher centrality measures are

    more likely to be tracked?

    14. Can we support the hypothesis that the web is clustered into communities/neigh-

    borhoods that are either "safe" (i.e. with no tracked PLDs) or "completely tracked"

    (i.e. all PLDs are tracked)?

    1spid: shortest-path index of dispersion

  • Chapter 3

    Methodology

    In order to answer the questions in scope of our study (see chapter 2), we conduct

    a series of experiments using the publicly-available datasets and tools presented in this

    chapter.

    3.1 Datasets

    3.1.1 Common Crawl web corpus

    The Common Crawl project [25] is a non-profit organization dedicated to provide

    a copy of the internet to internet researchers, companies and individuals at no cost for

    the purpose of research and analysis. Their goal is to democratize the data so everyone,

    not just big companies, can do high quality research and analysis.

    Common Crawl Uses The possibilities are endless, but people have used the data

    to improve language translation software, predict trends, track disease propagation and

    much more [26].

    A number of interesting papers and projects have been made available in the past

    couple of years that are based on the common crawl data, some of which are the Web

    Data Commons project that we are also utilizing in this thesis (see 3.1.2). Also, the

    popular SwiftKey keyboard app for mobile devices is reported to use the web corpus

    to enrich its functionality [27]. A number of published studies about the web are also

    reported to be using data sets from Common Crawl as we mentioned in the literature

    overview chapter [1619].

    15

  • Chapter 3. Methodology 16

    Having said that, our main use of the corpus is to analyze the HTML code of individual

    pages to extract the potential tracking services from it and construct a tracking graph

    for further analysis. The tracking graph is an edge file in the form of trackingservicetrackedsite that will be used along with the hyper link graph (provided by Web Data

    Commons) to build a property graph that covers both, web links and tracking relation-

    ships.

    Data set choice The Common Crawl corpus contains petabytes of data collected

    over the last 7 years. It contains raw web page data, extracted metadata and text

    extractions. The dataset lives on Amazon cloud storage S3 [28] as part of the Amazon

    Public Datasets program [29]. The data sets represents multiple crawls at different years

    that also employees different crawling algorithms. However, in our study, we are using

    the web corpus that was released in August 2012. The reason behind this selection is

    two folds:

    1. Since we are also matching the web corpus with its hyper-link graph representation

    offered by the Web Data Commons project, we are limited only to the 2012 and

    2014 corpora that are offered by the project. However, the two corpora are crawled

    using different techniques. The 2012 corpus was gathered using a web crawler

    employing a breadth-first-search selection strategy and embedding link discovery

    while crawling. Also, The crawl was seeded with a large number of URLs from

    former crawls performed by the Common Crawl Project. That is opposed to the

    2014 crawl that employs a modified Apache Nutch crawler [30] to download pages

    from a large but fixed seed list. The 2014 crawler was restricted to URLs contained

    in this list and did not extract additional URLs from links in the crawled pages.

    The seed list contained around 6 billion URLs and was provided by the search

    engine company blekko [31].

    2. The Web Data Commons foundation recommends using the 2012 over the 2014

    graph for the analysis of the connectivity of Web pages or the overall analysis of

    the Web graph, as a BFS based selection strategy including URL discovery while

    crawling will more likely results in a realistic sample of the web graph [32].

    .

    2012 Web Corpus Status The corpus consists of approximately 3.8 billion document

    occupying over 100+ terabyte of data. Table 3.1 contains a summary of the corpus

    contents [33].

  • Chapter 3. Methodology 17

    Table 3.1: Content statistics of the 2012 web corpus

    Content Type Number (in millions)Domains 61PDF 92Word 6.5Excel 1.3

    As the thesis is running in parallel with the Track The Trackers project (see 3.3.1)

    which is responsible of extracting trackers, and due to the timeline and budget constraints

    (see 3.3.1), we were able to run the extraction job on 25% of the 2012 corpus, which is

    roughly 23 terabyte of raw data. Meaning that, we will be working with a 25% random

    sample of the web crawl which we consider a representative one for our analysis.

    3.1.2 Web Data Commons hyper-link graph

    The Web Data Commons project [34] was started by researchers from Freie Uni-

    versitt Berlin and the Karlsruhe Institute of Technology (KIT) in 2012. The goal of

    the project is to facilitate research and support companies in exploiting the wealth of

    information on the Web by extracting structured data from web crawls, mainly from

    the Common Crawl project,and provide this data for public download. Today the WDC

    Project is mainly maintained by the Data and Web Science Research Group at the Uni-

    versity of Mannheim.

    Web Data Commons uses The project offers three types of data:

    1. RDFa, Microdata, and Microformat: structured data describing products,

    people, organizations, places, and events embedded into HTML pages using markup

    standards such as RDFa, Microdata and Microformats.

    2. Web Tables: a fraction of the HTML tables found on the web is quasi-relational,

    meaning that they contain structured data describing a set of entities, and are thus

    useful in application contexts such as data search, table augmentation, knowledge

    base construction, and for various NLP tasks.

    3. Hyperlink Graphs: large hyperlink graphs that WDC extracts from the Common

    Crawl corpora. These graphs can help researchers to improve search algorithms,

    develop spam detection methods and evaluate graph analysis algorithms.

  • Chapter 3. Methodology 18

    Data Set choice In our analysis, we work with the 2012 Hyper Link graph. The

    reasons to choose the 2012 over the 2014 version is due to the crawling techniques used

    by Common Crawl as explained in the previous section 3.1.1. WDC provides the graph

    on three levels of granularity/aggregation; page level, host level and pay-level-domain

    (PLD)which we are using in this thesis. A PLD can be considered as the root sub-

    domain for which users/organizations usually pay for when registering a URL. PLDs

    allow us to identify a realm, where a single user or organization is likely to be in control.

    For example, the 2 research groups dima.tu-berlin.de and ida.tu-berlin.de have the same

    parent PLD tu-berlin.de. The pay-level-domain web graph consists of approximately 43

    million node and 623 million arcs.

    3.1.3 The Common Crawl WWW ranking

    The project [35] is brought by the Laboratory for Web Algorithmics of the Universit

    degli Studi di Milano and by the Data and Web Science Group of the University of

    Mannheim. They parse the common crawl corpus to generate a web graph, from which

    they compute a set of rankings (centrality measures) for each node in the graph. We

    mainly use their PageRank and Harmonic centrality data sets in one of our experiments.

    3.1.4 Alexa top sites

    As part of our trackers-penetration analysis we are using a dataset [36] containing a

    list of the top 1 million websites based on traffic made available by Alexa Analytics[37].

    3.2 Data Processing Platforms

    3.2.1 Apache Hadoop

    The Apache Hadoop[38] project develops open-source software for reliable, scalable,

    distributed computing. Its software library is a framework that allows for the distributed

    processing of large structured and unstructured data sets across clusters of computers

    using simple programming models. It is designed to scale up from single servers to thou-

    sands of machines, each offering local computation and storage. Rather than rely on

    hardware to deliver high-availability, the library itself is designed to detect and handle

    failures at the application layer, so delivering a highly-available service on top of a cluster

    of computers, each of which may be prone to failures. For more details about Hadoop

    internals one can refer to [38].

  • Chapter 3. Methodology 19

    In our study, we use Hadoop Distributed File System (HDFS) to store the large datasets

    in order to make them available for processing in a distributed environment, as well

    as Hadoop MapReduce framework for the actual parallel data processing, especially in

    extracting trackers from the web corpus.

    HDFS is a file system that provides reliable data storage and access across all the

    nodes in a Hadoop cluster. It links together the file systems on many local nodes to

    create a single file system.

    MapReduce is the heart of Hadoop. It is a programming paradigm that allows for

    massive scalability across hundreds or thousands of servers in a Hadoop cluster. The

    term MapReduce actually refers to two distinct tasks that Hadoop programs perform.

    The first is the map job, which takes a set of raw input data and transforms it into

    another intermediate set of data represented in key/value pairs. The reduce job operates

    on these intermediate key/value tuples and combines (aggregate) them into a smaller set

    of tuples. As the sequence of the name MapReduce implies, the reduce job is always

    performed after the map job.

  • Chapter 3. Methodology 20

    3.2.2 Apache Spark

    Spark [39] is an open source, parallel data processing framework that complements

    Apache Hadoop to make it easy to develop fast, unified Big Data applications combining

    batch, streaming, and interactive analytics on a variety of data input types. It was

    originally developed in 2009 in UC Berkeleys AMPLab, and open sourced in 2010 as an

    Apache project.

    Sparks main data primitive is Resilient Distributed Datasets (RDD) [40] that enables it in

    achieving fast in-memory data processing over a distributed environment. Apache Spark

    is offered prepackaged with libraries for different big data tasks such as structured data

    manipulation (Spark SQL), machine learning (MLib), data streaming (Spark Streaming)

    and graph processing (GraphX). For more details about Apache Spark internals one can

    refer to [39]

    In this thesis, we mainly use Spark version 1.3.1 and its GraphX library [41] for ana-

    lyzing the tracking graph. At a high-level, GraphX extends the Spark RDD abstraction

    by introducing the Resilient Distributed Property Graph, a directed multi-graph 1 with

    properties attached to each vertex and edge. To support graph computation, GraphX

    exposes a set of fundamental operators, such as subgraph and joins, as well as an opti-

    mized variant of the Pregel API [42]. In addition, GraphX includes a growing collection

    of graph algorithms and builders to simplify graph analytics tasks.

    3.2.3 Apache Flink

    Flink [43] is an open source platform for scalable batch and stream data processing

    that started at TU-Berlin under the name of Stratosphere and now is a top level Apache

    project. Similar to Spark, it provides out of the box libraries for batch and streams

    processing, machine learning, SQL-like interface and graph processing. However, Flink

    provides an internal optimizer similar to those found in relational databases, besides, it

    is optimized for cyclic or iterative processes by using iterative transformations on data

    collections. This is achieved by an optimization of join algorithms, operator chaining

    and reusing of partitioning and sorting. For more details about Apache Flink internals

    one can refer to [43]

    We use Flink version 0.9 to conduct a number of experiments in our study that uses

    its Pregel-like graph processing framework Spargel through its higher level API Gelly.1A multigraph is a graph which is permitted to have multiple edges (also called parallel edges), that

    is, edges that have the same end nodes. Thus two vertices may be connected by more than one edge.

  • Chapter 3. Methodology 21

    3.2.4 R

    R is a language and environment for statistical computing and graphics. It is a GNU

    project which is similar to the S language and environment which was developed at

    Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and

    colleagues. R is available as Free Software under the terms of the Free Software Foun-

    dations GNU General Public License in source code form.

    R provides a wide variety of statistical and graphical techniques, and is highly exten-

    sible. One of Rs strengths is the ease with which well-designed publication-quality plots

    can be produced, including mathematical symbols and formulae where needed.

    After running our experiments on the large-scale datasets, we often produce interme-

    diate aggregation and metrics (e.g. vertex-wise metrics of a graph) and then process

    these results using R to obtain the final statistics and/or plots.

    3.2.5 MS SQL Server BI Stack

    SQL Server [44] is the Microsoft product-line for relational databases. On top of the

    core database engine, SQL Server provides solutions for data integration (ETL), OLAP

    cubes and reporting through SQL Server Integration Services (SSIS), Analysis Services

    (SSAS) and Reporting Services (SSRS) respectively.

    We are using a free, student version of SQL Server 2012 obtained through Microsoft

    DreamSpark program 2, for developing a data warehouse that stores a multidimensional

    model of the tracking graph obtained from the Common Crawl web corpus, and to build

    an OLAP cube on top of it that facilitates parts of our analysis in chapter 4.

    3.2.6 WebGraph Framework

    WebGraph [45] is an open source framework, under the GNU General Public License,

    for graph compression aimed at studying web graphs and developed in Java. It provides

    simple ways to manage very large graphs, exploiting modern compression techniques [46].

    More precisely, it is made of the following2DreamSpark is a Microsoft Program that supports technical education by providing access to Mi-

    crosoft software for learning, teaching and research purposes. https://www.dreamspark.com/

  • Chapter 3. Methodology 22

    A set of flat codes, codes, which are particularly suitable for storing web graphs.

    Algorithms for compressing web graphs that exploit gap compression and referen-tiation, intervalisation and codes to provide a high compression ratio.

    Algorithms for lazy techniques in a accessing a compressed graph without actuallydecompressing it until it is necessary.

    Algorithms for analysing very large graphs, such as estimating neighborhood func-tions, detecting strongly connected components, etc.

    Sample of publicly available very large datasets that can reach 1+ billion links

    We mainly use the WebGraph framework in chapter 5 to estimate the neighborhood

    function of the tracked-web using the HyperANF algorithm, and extracting a number

    of distance-related measures from it. For more details about the WebGraph Framework

    and its algorithms one can refer to [45, 46]

    3.2.7 FlashGraph Framework

    FlashGraph [47, 48] is a semi-external memory graph processing engine, optimized

    for a high-speed SSD array but can also run on hard disk drives (HDD). FlashGraph

    provides flexible programming interface to help users implement graph algorithms along

    with a number of ready-to-use, common graph algorithms that can scale to very large

    graphs on commodity machines within an acceptable run-time.

    We mainly use FlashGraph in chapter 5 for triangle counting, as the algorithms in

    Spark and Flink did not scale well with our graphs.

    3.3 Data Preparation

    3.3.1 Trackers extraction

    In order to extract potential tracking services from the Common Crawl web corpus,

    we are utilizing a running project initiated and developed at TU-Berlin by Sebastian

    Schelter, with contributions from other developers including the author of this thesis.

    The project name is Track the Trackers[49] and it is open sourced on GitHub .

  • Chapter 3. Methodology 23

    Track the Trackers uses Hadoop MapReduce to process the input web corpus (un-

    structured data) stored in Arc file format [50] and parse each HTML page along with

    its resources into an intermediate serializable structured format using Google Protocol

    Buffers 3. These intermediate structures are then read again using another MapReduce

    job to extract potential trackers and to build the tracking graph. Figure 3.1 provides

    a high level overview on the code for trackers extraction and constructing the tracking

    graph.

    The tracking graph job mark an HTML page resource (i.e. scripts, images, links, etc.)

    suspicious if its source (i.e. its HTML source attribute) is a different domain than that

    of the page itself (i.e. a third-party domain). The rational behind that relies on the four

    types of the HTML resource we are interested in:

    1. Scripts: Most of the third-party analytics trackers 4, use a code snippet with the

    source attribute linked to their analysis engine.

    2. IFrames: Most third-party advertisers use HTML IFrames to host their ads. In

    most cases the source attribute of the IFrame is linked to the advertiser.

    3. Images: A number of trackers, such as Googles doubleclick , use a technique called

    Tracking Pixels; which is an img tag that is generally from a third-party domain.

    The browser sees the img tag and makes a request from the users browser to the

    server (as directed by the URL in the HTML source attribute). On the image

    request, the browser passes the users domain-specific cookie ID just as it would

    with any HTTP request, this ID can identify and track the user. The server then

    responds with a transparent 1x1 GIF image, which should not be visible to the end

    user.

    4. Links: The same logic as with the images can also be applied with any external

    resources requested by a page from a third-party domain. These kinds of cross-

    domain requests can be achieved by an HTML link tag. This is different from the

    HTML href tag that represents a clickable hyperlink.

    In case a resource is marked as suspicious, a new tuple is added to the tracking graph

    representing an edge between the source URL of this resource the potential tracker3Protocol buffers (also known as Protoc) are Googles language-neutral, platform-neutral, extensible

    mechanism for serializing structured data like XML, but smaller, faster, and simpler. One defines howhe wants his data to be structured once, then he can use special generated source code to easily writeand read his structured data to and from a variety of data streams and using a variety of languages Java, C++, or Python.

    4 Refer to 1.2.1 for the classification model of tracking services

  • Chapter 3. Methodology 24

    and the tracked pay-level-domain of that page. In this case we assume generality for the

    sake of a high level analysis; if one page or more in a website are tracked, we consider

    the website as tracked by the sum of all trackers found within its individual pages.

    3.4 Environment

    3.4.1 Amazon EC2

    As the the 2012 Common Crawl Corpus resides on Amazon S3, we need to process

    it using Amazon Elastic MapReduce to extract the intermediate files that contains the

    parsed resources in each page, from which we can construct the tracking graph (see 3.3.1).

    This extraction job has been supported by an AWS in Education Research Grant award5 obtained by Sebastian Schelter.

    3.4.2 DIMA IBM Power Cluster

    For running Spark and Flink distributed jobs for analyzing large graphs (i.e the track-

    ing graph, tracked-web and PLD web graph) we use the IBM Power Cluster offered by

    IBM to the DIMA research group at TU-Berlin. The cluster consists of 10 nodes, 48

    cores and 60GB RAM each, and total disk space of 1.8 TB that is mainly used for HDFS.

    5http://aws.amazon.com/grants/

  • Chapter 3. Methodology 25

    Figure 3.1: Pseudocode of the main routines in extracting trackers

    For simplicity, the pseudocode omits details about keeping the HTML tag in the trackinggraph. In reality an entry of the tracking graph consists of (trackerID, trackedID, isScript,isIFrame, isImage, isLink).

    Phase 1. Parsing HTML resourcesinput: Set of Arc files containing web corpus pagesoutput: set of parquet files with parsed pages

    function processArcFile (ArcFile)for each page in ArcFile do

    if (page.type is HTML) thenparsedPage := emptyparsedPage.javascripts := parse(page , resources.javascript)parsedPage.iframes := parse(page , resources.iframe)parsedPage.images := parse(page , resources.image)parsedPage.links := parse(page , resources.links)parsedPage.saveAsParquetFormat ()

    end ifend for

    end function

    Phase 2. Construct the tracking graphinput: Set of parquet files with parsed pagesoutput: Tracking graph

    function map(ParsedPage)thirdPartyResources := List.empty

    for each script in ParsedPage.javascripts doif (script.src != ParsedPage.src)

    thirdPartyResources.add (script.src)end if

    end for

    for each iframe in ParsedPage.iframes doif (iframe.src != ParsedPage.src)

    thirdPartyResources.add (iframe.src)end if

    end for

    \\ ... fill the thirdPartyResources by doing the same for images and links

    for each tracker in thirdPartyResources doif (tracker.PLD != ParsedPage.PLD)

    omit (tracker.PLD , ParsedPage.PLD)end if

    end forend function

    function reduce (tracker , List:trackedPLDs)trackedHosts := trackedPLDs.distinct

    for each trackedHost in trackedHosts dosaveToTrackingGraph (tracker , trackedHost)

    end function

  • Chapter 4

    Analysis I: Statistical Properties

    In this chapter we focus on presenting and analyzing a number of statistical measure-

    ments about the tracking services and tracked websites. First we will investigate the top

    trackers, their general coverage, and the tracking penetration in most popular websites.

    Then we will drill into to the contexts where trackers are observed and their classes, as

    well as analyzing the tracked hosts domain extensions. Finally, we will investigate the

    relationships between the top trackers and if there are significant associations between

    their existence in a given PLD 1 .

    To do so, we mainly use Hadoop to extract tracking services and construct the tracking

    graph (see 3.3.1) from the raw web corpus on Amazon cloud, along with Spark and Flink

    for analytical jobs to compute different metrics and aggregations of the graph on the

    university cluster (see 3.4.2), and finally analyze these intermediate metrics locally using

    R to obtain the final statistics and indicators. Also, we designed a data warehouse and

    developed an OLAP cube on top of it using Microsoft SQL Server 2012 BI stack. The

    straight-forward data warehouse contains 2 main dimensions, Tracked PLD and Tracking

    Service, along with one, narrow but lengthy, fact table that contains the tracking graph

    as an edge list with a number of needed Boolean columns for analysis. The cube helps in

    mapping the relational model of the DWH into a multidimensional one that can benefit

    from MDX queries for more convenient data analysis when it comes to drilling and slicing

    data.1As a reminder, a pay-level-domain (PLD) is a the main part of a URL that identifies a parent

    organization/domain. For example, the 2 research groups dima.tu-berlin.de and ida.tu-berlin.de havethe same parent PLD tu-berlin.de

    26

  • Chapter 4. Analysis I: Statistical Properties 27

    4.1 Trackers Coverage

    Our first investigation is to determine the top tracking services and analyze their coverage

    over the web.

    First we ran the Hadoop job to extract the tracking graph from the Common Crawl

    web corpus (see 3.3.1). We were able to process a sample of 25% of raw data from the

    full corpus. However, after analyzing the processed output we found that this sample

    accounts for 35% of the individual pages and 75% of the pay-level-domains in the full

    corpus. That is based on our generalization assumption where we tag a pay-level-domain

    as potentially tracked if at least one of its pages is tracked. We believe that this high

    level of PLD coverage in the sample is due to the long-tail distribution of the number of

    web pages within websites that is observed by [13, 23, 24].

    We were able to extract roughly 100 million tracking entries (i.e. Tracker X > Pay

    Level Domain Y). After that, we ran an analytical Flink job to count the number of

    tracked PLDs per unique potential tracker. Based on the tracker extraction assumptions

    we explained in 3.3.1, we extracted approximately 27 million potential tracker. This

    figure raised some doubts on the assumptions we took while detecting trackers. However,

    after further analysis of the tracking counts distribution (i.e. number of tracked sites per

    tracker), we observed two interesting facts:

    That 82% of these potential trackers have a tracking count of only 1

    That 99.9% of them have a tracking count less than 1,000 hosts

    Based on the first finding, we considered any tracker that occurs only once (i.e. track-

    ing only one PLD) as noise in the extraction process, since no actual tracking service

    will be visible in only 1 host. Based on that, we define the new term effective-tracker,

    that is, a tracking service that are detected to track more than one PLD. Those effective

    trackers are approximately 4.8 million within our dataset. For the second finding, we

    hypothesize that the number of tracked sites per tracker is following a power-law distri-

    bution, however, that needs a further empirical examination.

    As illustrated in figure 4.1, we found that at least 60% of the PLDs in the sample are

    potentially tracked under our previously mentioned assumptions. For those 19 million

  • Chapter 4. Analysis I: Statistical Properties 28

    Figure 4.1: Tracking detection summary

    The figure shows statistics about the sample taken from the full web corpus residing onAmazon S3. The processing of raw data to extract pages and resources is done on AmazonElastic Map Reduce, and finally constructing the tracking graph and its analysis is performedon TU-Berlin DIMA cluster.

    PLD (constituting the 60%) we detected the top 20 trackers (see table 4.1) based on the

    number of unique PLDs spanned by each of them. One can notice that Google-related

    services has the highest share of tracking. However, the figures cant be aggregated since

    one tracked PLD can be tracked by multiple services.

    To better understand the nature of these trackers, we investigated further to find out

    the following:

    googlesyndication.com: is a domain owned by Google that is used for storingand loading ad content and other resources relating to ads for Google AdSense and

    DoubleClick from the Google content delivery network.

    ajax-googleapis.com: The AJAX Libraries API is Googles content distributionnetwork and loading architecture for the most popular open source JavaScript

    libraries such as jQuery, AngularJS, Dojo,etc.

    The difference between the well known facebook.com and facebook.net is that thelater is Facebook APIs endpoint that support social widgets and other applications,

    while the former is usually found in iframes and images context (table 4.1) that we

    postulate their usage in hosting Facebook media content (video and pictures).

  • Chapter 4. Analysis I: Statistical Properties 29

    Table 4.1: Top 20 potential trackers

    The table also shows the HTML context in which the tracker was detected. An important remarkwhile interpreting the figures below is that the context percentages dont have to add up to 100%for each tracker since the same tracker can be detected in different contexts within the same PLD.

    Tracker Frequency% of

    TrackedPLDs

    % ofAll

    PLDs

    Script%

    IFrame%

    Image%

    Link%

    google-analytics.com 8,183,519 42% 25% 100% 0% 0% 0%googlesyndication.com 2,953,807 15% 9% 99% 0% 1% 0%google.com 2,206,582 11% 7% 78% 16% 15% 7%ajax.googleapis.com 1,470,524 8% 5% 99% 0% 0% 6%facebook.com 1,315,966 7% 4% 17% 77% 12% 0%macromedia.com 1,290,750 7% 4% 100% 0% 0% 0%adobe.com 983,536 5% 3% 56% 0% 47% 0%facebook.net 858,533 4% 3% 100% 0% 0% 0%casalemedia.com 832,215 4% 3% 100% 0% 0% 0%youtube.com 780,471 4% 2% 15% 83% 9% 1%twitter.com 753,311 4% 2% 92% 10% 1% 1%addthis.com 741,610 4% 2% 97% 0% 34% 0%imgaft.com 607,701 3% 2% 99% 0% 100% 0%godaddy.com 566,565 3% 2% 99% 1% 3% 0%gravatar.com 545,740 3% 2% 30% 0% 82% 7%gmpg.org 516,165 3% 2% 0% 0% 0% 100%statcounter.com 507,867 3% 2% 96% 0% 95% 0%dsnextgen.com 399,400 2% 1% 98% 2% 0% 0%wordpress.com 384,114 2% 1% 81% 0% 37% 16%yahoo.com 367,155 2% 1% 27% 2% 78% 0%

    casalemedia.com: is a Canadian online media and technology company. Theybuild online advertising technology for web publishers and advertisers.

    imgaft.com: we could not find extensive information about this domain and itssiblings ak2.imgaft and ak3.imgaft. The only thread we find is that it is registered

    to Godaddy. We suspect it is being used in the parked-domain advertising schema

    that Godaddy provides for its users. In some cases when a user is reserving a

    domain until his website is created, or even to sell it in the future, the domain

    can be parked and a temporary landing page with targeted adverting is viewed

    by Godaddy to the domain visitors in return for a percentage of the ad-revenues

    paid to the parked-host owner. However, we couldnt technically validate this

    hypothesis.

    gravatar.com is an online service that provides users with images (avatars) thatfollows them from site to site appearing beside their name when they do things like

    comment or post on a blog. Avatars help in identifying its users posts across blogs

    and web forums. We believe that it made it to the top 20 lists since it is included

    by default in every WordPress.com account, which have more than 6 million pages

    in the Common Crawl corpus sample we are using.

  • Chapter 4. Analysis I: Statistical Properties 30

    dsnextgen.com: we could not find much info about this domain but we did finda number of threads regarding it as a malware and people reporting their websites

    hacked by it.

    statcounter.com: is a free web tracker, embedded by websites as a hit counterand to provide real-time detailed web traffic information.

    4.2 Top Sites Tracking

    In this question, we analyze the magnificence of the tracking phenomenon from a

    different perspective. Apart from the general statistics about the entire web corpus we

    observed so far, we would rather focus on quantifying the trackers penetration over a key

    subset of the internet, which is the most popular sites of the web. To achieve that, we

    use the publicly available dataset mentioned in 3.1.4 from Alexa Analytics containing a

    list of the top 1 million websites based on traffic.

    Interestingly we find that the tracking penetration increases as we go up the list of

    top sites. The tracking penetration starts at 48% within the top 1 million PLDs, and

    increases gradually to reach a high 82% within the top 1000 PLD as shown in figure 4.2.

    Figure 4.2: Alexa top sites tracking penetration

    Furthermore, we noticed that this pattern -increasing tracking penetration by de-

    creasing the subset of top sites is also visible on the tracker level as well. Table 4.2

    shows that the top 10 trackers are the same at each subset, with the same order track-

    ing penetration and following an increasing trend across subsets, except for one tracker,

    doubleclick.net, that only appears withing the top 1000 sites instead of addthis.com.

  • Chapter 4. Analysis I: Statistical Properties 31

    Table 4.2: Top trackers penetration ratio across Alexa top sites

    Tracker Top 1000K Top 500K Top 100K Top 10K Top 1K

    google-analytics.com 0.34 0.38 0.47 0.62 0.71google.com 0.16 0.20 0.29 0.45 0.60facebook.com 0.11 0.14 0.21 0.37 0.48ajax.googleapis.com 0.09 0.11 0.16 0.28 0.40googlesyndication.com 0.09 0.11 0.15 0.22 0.30facebook.net 0.08 0.11 0.17 0.30 0.37twitter.com 0.08 0.10 0.17 0.32 0.44youtube.com 0.07 0.09 0.14 0.25 0.40addthis.com 0.07 0.08 0.12 0.20macromedia.com 0.05 0.06 0.11 0.21 0.33doubleclick.net 0.34

    4.3 Tracking Classification

    Our third question focuses on the tracking types. We discussed in the literature

    overview a proposed classification framework for tracking behavior, out of which we can

    distinguish between 3rd party web analytics, advertisers and social widgets (see 1.2.1).

    To classify trackers we first need to analyze the contexts where the potential tracker is

    detected, as explained before in 3.3.1, a 3rd party tracker can be detected as the source

    HTML attribute of scripts, iframes, images and links. Figure 4.3 shows the ratio of

    trackers detected at each HTML source, compared to the number of unique trackers, as

    well as the ratio of tracked PLDs, compared to the number of unique tracked PLDs. We

    notice that most potential trackers (92%) are detected as sources of image tags in HTML

    and that most tracked PLDs are potentially tracked by means of 3rd party scripts.

    Figure 4.3: Tracking sources summary

  • Chapter 4. Analysis I: Statistical Properties 32

    A key point one needs to understand while interpreting the tracking-source analysis

    graph in figure 4.3 is that ratios dont have to add up to 1. This is due to the fact that

    a single tracker can be detected in different sources at different PLDs (e.g. in a script in

    PLD 1 and in a an image in PLD 2) and even potentially at the same PLD. The same

    goes for tracked PLDs where one PLD can be potentially tracked by different trackers

    detected at different source (e.g. using Google Analytics for traffic analysis and hosting

    3rd party ads in iframes). Table 4.3 shows the frequency distribution of the available

    combinations of tracking contexts. The frequency represents the number of occurrences

    where a tracked PLD is detected to have the corresponding tracking sources. The ratio

    is calculated based on the total number of entries in the tracking graph (approximately

    80 millions). For detailed information about top trackers by source one can refer to

    appendix A.

    Table 4.3: Tracking-Source distribution

    HTML Source Frequency Ratio

    Script 37,745,830 48%Image 23,304,038 30%Script & Image 5,578,269 8%IFrame 3,956,215 5%Script & Image & Link 3,406,727 5%Link 2,146,367 3%Image & Link 1,050,777 2%Script & Link 827,657 2%Script & IFrame 419,904 1%All 398,320 1%IFrame & Image 225,109 1%Script & IFrame & Image 107,748 1%Script & IFrame & Link 102,543 1%IFrame & Image & Link 57,420 1%IFrame & Link 29,045 1%

    For 3rd party social-widget tracking, we analyzed a predefined set of code snippets

    offered by popular social network websites (see appendix C) and marked each entry in the

    tracking graph if it is being tracked by a social-widget or not based on the source attribute

    that the code is using. Table 4.4 shows the share of each social network compared to the

    subset of PLDs being tracked by social-widgets and in terms of coverage, it shows the

    percentage of PLDs spanned by each social network compared to all tracked PLDs and

    compared to the sample web corpus.

    Finally, based on the trackers extraction assumption and the proposed classification

    framework, we can assign the scipt tracking to 3rd party web analytics services, iframe

    and images to advertising-related trackers, while extracting the social-widgets trackers

  • Chapter 4. Analysis I: Statistical Properties 33

    Table 4.4: Social-Widget tracking summary

    Social-Widget AbsoluteFrequencyRelativeFrequency

    % ofTracked PLDs

    % ofAll PLDs

    Facebook 2,180,111 0.576 11.17% 6.72%Youtube 798,027 0.211 4.09% 2.46%Twitter 783,727 0.207 4.02% 2.42%Reddit 17,552 0.005 0.09% 0.05%Instagram 4,346 0.001 0.02% 0.01%Tumbler 140 0.000 0.00% 0.00%

    manually as explained in the previous section. This lead us to final statistics about the

    tracking classification as illustrated in figure 4.4; It shows the percentages of tracked

    PLDs under each class. The ratios dont add up to 1 because of the overlapping tracking

    behavior as explained earlier.

    Figure 4.4: Tracking Classification Summary

    4.4 Domain Analysis

    Our next area of exploration is the tracking penetration analysis based on internet

    domains. There are many levels domains to consider (e.g. second-level, top-level, etc.).

    However, we are focusing on the generic top level domain (gTLD) and country code top

    level domain (ccTLD)

    4.4.1 Country code analysis

    We were able to detect approximately 11 million pay-level-domains that contains a

    country code (e.g. .de, .uk, .fr, etc.) from the sample web corpus of 32 million PLD (out

  • Chapter 4. Analysis I: Statistical Properties 34

    of which, around 60% of them where marked as potentially tracked).

    By means of informal visual analysis, we found that tracking penetration ratios of

    country codes are following a normal distribution as shown in figure 4.5 with minimum

    = 0.23, median = 0.59 , maximum = 0.93 and standard deviation = 0.1.

    Figure 4.5: ccTLD tracking penetration histogram

    For each tracking penetration value (x-axis), we plot a bar presenting the number of countrieswith such penetration

    An interesting way to visualize the global spread of the web tracking phenomenon as

    well as its degree, is using a heat map as shown in figure 4.6. Interestingly, Germany

    scored a relatively low penetration rate of 49%, placing it in the lower quartile of the

    data. We can also notice one of the highest penetration rates concentrated in Russia and

    post-soviet states in eastern Europe and Asia.

    Finally, an important remark is that we are only considering the ccTLD extensions in

    our analysis and not the country-assigned IP range of addresses. This experiment can

    be further enriched by incorporating the IP analysis as well.

    4.4.2 Generic domain analysis

    Besides the country codes, we were also able to detect 22,986,076 PLD in the web

    corpus sample (of approximately 32 million PLD) that contains an element of a predefined

    set of the most popular generic top level domains (gTLD) assigned by the Internet

  • Chapter 4. Analysis I: Statistical Properties 35

    Figure 4.6: Tracking penetration worldwide

    Shades of green, yellow, red indicate low, medium and high penetration rates respectively,given that the scale starts at 23% and ends at 93%. Black color indicates no data available

    Assigned Numbers Authority (IANA). The gTLDss are .com, .net, .org, .gov, .edu, .mil,

    .info and .biz. Out of these PLDs, we marked 13,800,223 of them as potentially tracked.

    In table 4.5 we summarize the tracking penetration ratio for each of the extracted

    gTLDs. Surprisingly, the results came out against our expectations that the more popular

    and commercial domains such as .com and .net will have higher penetration compared

    to the more private, and in some cases sensitive, domains such as .edu, .gov. Also, we

    didnt not expect the .mil gTLD used by military organizations will have a high

    penetration rate as 53%, even though it tails the list.

    Table 4.5: Tracking penetration by gTLD

    gTLD PLDs(sample)PLDs

    (tracked)Tracking

    Penetration

    .edu 33629 22512 67%

    .gov 51081 33178 65%

    .info 472131 304941 65%

    .net 1746476 1116848 64%

    .biz 156559 99525 64%

    .org 1923282 1214639 63%

    .com 18602312 11008258 59%

    .mil 606 322 53%

  • Chapter 4. Analysis I: Statistical Properties 36

    To further investigate these unexpected results, we compiled the matrix in table 4.6 with

    the top 10 trackers of each gTLD along with the number of PLDs they cover within it.

    Based on this matrix we observed the following:

    While the Google-related trackers are the only core trackers across gTLDs, thetop 10 trackers are almost identical across the com, org, net, info and biz (with

    few exceptions). They are also a subset of the overall top trackers noted in 4.1.

    However, trackers tend to be different and sparse in the edu, gov and mil group.

    What we consider sensitive gTLDs like .gov, .mil and .edu, are tracked mostly byweb analytics tools like google-analytics and addthis.com and with social networks

    widgets. However, there are no indications of them employing advertising-related

    trackers or content delivery networks, even popular ones such as googlesyndica-

    tion.com. This is somehow understandable since websites like these intended for

    public service will need to employ some sort of social interaction via social widgets,

    not to mention to analyze their own traffic.

    Some popular trackers only appear within commercial gTLDs such as the popularweb host godaddy.com, but never with .gov, .mil or .edu

    Few trackers only appear under one gTLD, like cnzz.com and ejercito.mil.co underthe .gov and .mil gTLDs respectively.

    To understand the last point further, we drilled deeper into the data and with some

    internet search we found that cnzz.com is a Chines tracking service that employs scripts

    in tracked pages. It turn out that the 2,141 PLDs cnzz.com is tracking under the .gov

    gTLD all have .cn country code, which means they are Chines government PLDs. Also,

    we found that ejercito.mil.co belongs to the national Colombian army and that all the

    15 tracked PLDs are being tracked by means of 3rd party HTML links.

    4.5 Trackers Association

    In this section, we aim to investigate the frequent co-occurrence of tracking services

    and if there are rules that can predict the presence of trackers in a PLD based on the

    existence of other trackers. For example if the existence of tracker z in a PLD is usually

    associated with the existence of trackers x and y .

  • Chapter 4. Analysis I: Statistical Properties 37

    Table 4.6: Top Trackers Coverage over gTLDs

    The matrix has values only for the top 10 trackers of each gTLDs or zeros to indicate that thetracker is completely absent regardless of its ranking. For example the second cell(horizontally) indicates that addthis is tracking 51,627 PLDs with the .org gTLD whilegodaddy.com is completely absent from all .mil domains. A tracker is marked with anunderscore if it is not within the top 10 trackers under a specific gTLD.

    Tracker .com .org .net .info .biz .edu .gov .mil

    addthis.com - 51,627 - 9,680 - 1,991 989 17adobe.com 591,441 50,835 - - 3,184 4,392 3,350 50ajax.googleapis.com 867,033 99,873 69,199 13,143 4,956 3,757 1,700 32baidu.com - - - - - - 939 0casalemedia.com 609,597 - 58,074 46,841 6,455 - 0 0cnzz.com - - - - - - 2,141 0ejercito.mil.co - 0 0 0 0 0 0 15facebook.com 726,854 89,693 67,375 13,816 4,551 3,054 - 23facebook.net 485,699 53,172 43,369 - - 1,939 - -gmpg.org - - - 10,600 - - - -godaddy.com - - - 25,316 5,144 - - 0google-analytics.com 4,545,650 467,913 402,411 89,385 34,300 11,881 7,776 171google.com 1,244,448 166,275 127,759 32,421 9,380 5,315 3,241 38googlesyndication.com 1,874,776 168,622 235,195 112,147 22,153 - - -imgaft.com 471,002 - 46,055 27,232 5,635 0 0 0macromedia.com 802,529 60,038 55,717 - 3,950 4,539 10,468 46twimg.com - - - - - - - 24twitter.com - - 46,114 - - 1,348 - -weather.com.cn - - - - - - 2,673 0youtube.com - 68,471 - - - 2,483 898 19

    Total PLDsTracked in gTLD 11,008,258 1,214,639 1,116,848 304,941 99,525 22,512 33,178 322

    To begin, we want to understand the nature of the trackers existence in terms of quan-

    tity (i.e. how many trackers there are per pay-level-domain). We start by computing

    the total number of tracking services per PLD (approximately 19 million PLD) and ob-

    serve the distribution. As shown in figure 4.7, the distribution is far from normal, in

    fact, it tends to be highly exponential, with more than 99.99% of the data set in the

    range of 1-100 trackers per PLD, that means there exist a tiny fraction of the PLDs

    with huge number of trackers (above 1000). We then wanted to understand if that might

    be attributed to the number of pages in each processed PLD, however, we calculated

    the Pearson correlation coefficient 2 between the number of pages and the number of

    trackers (per PLD) to be 0.28, which means a slight positive correlation (even though we

    intuitively thought about a higher positive correlation). After examining a subset of the

    top PLDs, in terms of pages and trackers, we found that most of them are huge networks

    such as Google, YouTube, Tumbler, etc. where they permit users loading resources for

    3rd party domains (e.g. scripts, content, themes, etc.) as well as the use of 3rd party2Pearsons correlation coefficient is the covariance of the two variables divided by the product of

    their standard deviations. It measures the linear correlation (dependence) between two variables x andy giving a value between [1,-1] where 1 is total positive correlation, 0 is no correlation, and -1 is totalnegative correlation.

  • Chapter 4. Analysis I: Statistical Properties 38

    web traffic monitoring and hence the high number of trackers.

    Figure 4.7: Log-Log plot for the number of trackers per PLD

    The second part of the analysis is to identify the groups of tracking services that

    usually appear together in PLDs. In order to achieve that, we model the problem as

    in a market-basket analysis (with trackers as products and tracking graph entries as

    transaction) while employing frequent itemset mining techniques. On top of that, we use

    association rules learning to find out if there are dependencies between trackers.

    Apriori [51] is a seminal frequent itemset mining algorithm that we are using (out

    of the box from SQL Server Analysis Services 3 ) to help answering our question. In a

    nutshell, Apriori works by identifying the frequent individual items in the dataset and

    extending them to larger and larger item sets as long as those item sets appear sufficiently

    often in the data (by means of a support function). The frequent item sets determined

    by Apriori can be later used to determine association rules which highlight general trends

    in the dataset. Figure 4.8 shows an outline of the algorithm.

    We applied the Apriori implementation on a subset of the tracking graph that contains

    the top 20 trackers (extracted in 4.1) and their corresponding tracking entries of approx-

    imately 26 million records (32% of the the complete graph). Table 4.7 shows 20 frequent3Microsoft provides its implementation of Apriori under the name of Microsoft Association Algo-

    rithm. See msdn.microsoft.com/en-us/library/cc280428.aspx4Figure taken from en.wikipedia.org/wiki/Apriori_algorithm

  • Chapter 4. Analysis I: Statistical Properties 39

    Figure 4.8: Pseudo code of the Apriori algorithm

    The pseudo code for the algorithm is given below for a transaction database T , and a supportthreshold of . Ck is the candidate set for level k. At each step, the algorithm is assumed togenerate the candidate sets from the large item sets of the preceding level, heeding thedownward closure lemma. count[c] access