a distributed approach to uncovering tailored information and exploitation on the web

Kenton P. Born

Kansas State University

October 26th, 2011

A DISTRIBUTED APPROACH TO UNCOVERING TAILORED INFORMATION AND EXPLOITATION ON THE WEB

“The power of individual targeting – the technology will be so good it will be very hard for people to watch or consume something that has not in some sense been tailored for them”

Eric Schmidt, Google

ROADMAP• Introduction

• Background

• Problem Statement

• Hypothesis

• Methodology

• Results

• Limitations

• Future Work

BACKGROUND• What attributes can be used to distinguish a system?

• Anything with consistency on one machine, but entropy across varying machines

• MAC and IP Address

• TCP/IP Stack

• Application Layer Content

• Client Fingerprint – The combination of all identifiable attributes of a client system that distinguish it from others.

HTTP FINGERPRINTING• Browser Identification

• User-Agent string

• Object detection

• Plugins

• Operating System Identification

• User-Agent string

• TCP/IP stack

• User Identification

• IP Address

• Cookies

• Aggregation of fingerprintable attributes

PANOPTICLICK

THE EPHEMERAL WEB• Websites are growing in complexity

• Static content Dynamic content Instantly dynamic content Tailored content

• Has an affect on:

• Web crawlers / Search engines

• Change detection services

• Semantic analysis

• Trust

• These are not solved problems!

Third Party Ads

Client Tracking

Analytics

TAILORED WEB CONTENT• How do you know when a web response has been modified because of the

fingerprintable attributes of your system?

• User and location-based tailoring

• Services

• Misinformation

• Browser and operating system tailoring

• Software downloads

• Exploits for specific client fingerprints

How can we assign a level of trust to web content?

HYPOTHESIS• A Multiplexing proxy, through the real-time detection and categorization of web

content that has been modified for specific locations, browsers, and operating systems, provides enhanced misinformation, exploit, and web design analytics.

• This study analyzed the utility of the multiplexing proxy’s detection, classification, and visualization methods for three different roles: open source analysts, cyber analysts, and reverse engineers.

• Both qualitative and quantitative approaches were taken in an attempt to understand whether the techniques used by many sites are similar, or whether the breadth of dynamic changes is too vast to ever be handled well in an automated system.

METHODOLOGY• Multiplex requests at a proxy

• Change at most one fingerprintable attribute per request

• Modify User-Agent string (Browser)

• Modify TCP/IP stack (Operating System)

• Modify IP address (Location)

• Send duplicate requests

• Aggregate the responses at the proxy

• Analyze them against the original response

• Present the user with detailed analysis along with the original response

FALSE POSITIVE MITIGATION

• How do we handle the false positives due to instantly dynamic content?

• Send several requests that duplicate the fingerprint of the original request!

• Provides a baseline of the instantly dynamic data

• Anomalies from this baseline are the tailored content!

• Accuracy improves with additional duplicate requests

CLASSIFICATION• Classify the resources and provide tools for analyzing them

• Look at:

• MD5 hash (byte-to-byte comparison)

• Stripped hash (Only compare content)

• Structure hash (Do they share the same structure)

• Response length anomalies

• Visualization

Classification Description

Static No changes were detected

Instantly Dynamic Changes were detected between requests with a duplicate fingerprint

Browser Anomaly A change was attributed to a modified User-Agent string

OS Anomaly A change was attributed to a TCP/IP stack modification

Location Anomaly A change was attributed to the IP address of the requesting system

OPEN SOURCE ANALYST

CYBER ANALYST

REVERSE ENGINEER• Analyze and Investigate websites from empirical classification study

• Determine insights gained about server behavior

• Determine value of information extracted by the tool’s analytical and visual capabilities

SYSTEM COMPONENTS• Client-side Firefox Browser Plugin (XUL/JavaScript)

• Easily distributed

• Piggy-back off of browser functionality

• APIs for request/header manipulation

• Other Plugins

• i.e. Ad-Block Plus

• Customized Proxy (Java)

• Distributed Agents (Java)

• Fingerprint modifications

• Load-balancing

• Web-Service (Java/GWT)

CONFIGURING THE MULTIPLEXING PROXY

Navigate to the configuration screen and select the specific browser, operating system, and location modifications you would like made to the requests going through the multiplexing proxy.

USING THE MULTIPLEXING PROXY

The Firefox plugins adds an additional toolbar to the browser

Anomalies will first be signaled in the toolbar. Click on them to view a summary of the analysis of the resources!

Additional Search Bar

LISTING THE DYNAMIC RESOURCES

The dynamic resources are listed in order of “most interesting” to “least interesting”

Overview of the analysis for each resourceClick on a resource to analyze further

RESULTS

THEORETICAL CLASSIFICATION ACCURACY• Dissertation includes many formulas for calculating the probability of misclassification

• Generally revolves around “dice rolling” calculations

• Websites returning multiple versions of a webpage with different probabilities requires more in-depth formulas than versions with equal probabilities

• “Weighted dice” vs. “Fair dice”

• Increasing the number of duplicate requests increases the accuracy!

• Three duplicate requests provides sufficient accuracy for most cases

• Chicken-or-the-egg problem: you can only calculate the accuracy of a particular website once you know everything about the website

• Looking at the empirical classification accuracy helps with this

EMPIRICAL CLASSIFICATION ACCURACY

Category # Websites # Resources Instantly Dynamic Resources

News / News Aggregate 22 1821 35

Business 44 2428 37

Personal 12 226 2

Blog 16 950 9

Forum 11 327 11

Search Engine 12 187 10

Reference/Image/Video 18 987 21

Other 15 704 19

TOTAL 150 7630 144

EMPIRICAL CLASSIFICATION ACCURACY

# Duplicate Requests

Static Instantly Dynamic

Misclassified % Error

1 7509 121 23 0.301

2 7499 131 13 0.170

3 7488 142 2 0.0262

4 7488 142 2 0.0262

5 7487 143 1 0.0131

UNDERSTANDING THE ANALYSIS RESULTS

The response to the original request

A response that matches, byte-for-byte, the original response

A response that matches the original after stripping out benign attributes, formatting, etc.

A response that does not seem to match the original response

A request that failed to receive a response from the server

Each type of response has various tools for analyzing it more deeply against the original response:i.e. diff / inline diff / stripped diff / view

A response that has a significantly anomalous length compared to the original/duplicates

CLIENT FINGERPRINT / ANALYTICS COLLECTION

Location/IP based modifications flag an anomaly!

IP Address geolocation!

Woah!

LOCATION-BASED BLOCKING/TAILORING

Location-based anomalies!

The request sent through a proxy in China failed…

“The great firewall of China”

AD TAILORING

Modifying the User-Agent string flags a response anomaly!

You do not have Windows 7, get it!

You have Windows 7 Starter, upgrade to Home Premium!

Get the latest service pack for Windows!

You do not have Windows 7, buy a new PC!

MISCONFIGURED / MALICIOUS PROXY

The response from the proxy in China did not match the rest!

Default Apache response, a misconfiguration!

While this was a misconfigured proxy, a malicious proxy would similarly throw an anomaly!

JAVASCRIPT INJECTION (LAB ENVIRONMENT)

The Internet Explorer fingerprint triggered an anomaly!

JavaScript Injection! A malicious, obfuscated redirect…

REAL JAVASCRIPT INJECTION!Seemingly random anomalies!

Performance analysis script!

PRICE TAILORING

Prices are shown in a new currency… same price?


SEARCH ENGINE REDIRECT


Redirect to www.google.ru!

Significant length anomaly!

IMAGE TAILORING

Handling image transparency issues for IE6!

Internet Explorer 6 anomaly!

FORMATTING INCONSISTENCIES

Tailored images for IE6

Tailored formatting for different browsers

CURRENT LIMITATIONS• No support for HTTPS

• Requires the proxy to hijack the handshake and play MITM

• No ability to manipulate and analyze the effects of HTTP cookies

• HTTP POST requests are ignored

• If they aren’t idempotent, could cause issues

• i.e. Adding items to shopping carts

• TCP/IP stack manipulation is not robust

• Requires a machine or VM for each operating system fingerprint

• Need a tool to quickly modify the stack as necessary on any machine

FUTURE WORK

• More robust real-time TCP/IP stack manipulation tool

• Cookies

• Tailored Content due to the presence of certain cookies

• Find websites that share cookies across various domains!

• Look at other protocols

• DNS

• Routing

• Etc.

CONTACT INFORMATION

Kenton Born

Kansas State University

Lawrence Livermore National Laboratory

[email protected]

BACKUP SLIDES

RELATED WORK• Many studies on dynamic aspects of websites

• Cho and Garcia-Molina (2000)

• 25% of pages in .com domains changed within a day

• 50% of pages changed after 11 days

• Other TLDs such as .gov were less dynamic

• Olston and Pandey (2008)

• Developed web crawl policies that accounted for longevity of information

• Periodic crawling of dynamic material instead of batched crawling

Must Switch from batched crawling to complex, incremental crawlers!

RELATED WORK (2)• Measuring website differences

• Cho and Garcia-Molina (2000)

• MD5 checksum

• Fetterly et al. (2003)

• Vector of syntactic properties using the shingling metric

• Most changes are trivial (non-content)

• Greater frequency of change in top level domains

• Larger documents have a greater frequency and degree of change

• Past changes can predict the size and frequency of future changes

• Adar et al. (WSDM 2009)

• Xpath extraction of “cleaned” website

• Calculated survivability of each elementPatterns can be found in website changes by analyzing them

more deeply

RELATED WORK (3)• Change frequency

• Adar et al. (WSDM 2009)

• Over 50% of the websites examined had frequently changing data

• Over 10% of the websites contained instantly dynamic data

• An instantly dynamic website typically modifies similar amounts of information each time, in contrast with sites such as blogs

• Ntoulas et al. (2004)

• 8% of downloaded sites each week were new web pages

• Calculated TF-IDF Cosine distance and word distance between versions.

• Most website changes were minor, not causing significant differences

• Kim and Lee (2005)

• After 100 days, 40% of URLs were not found on initial crawls.

• Calculated download rate, modification rate, coefficient of age

• Did nothing to handle instantly dynamic data!

RELATED WORK (4)• Website comparison

• Kwan et al. (2006)

• Analyzed comparison methods against specific types of change (markup removed)

• Byte-to-byte (checksum)

• TF-IDF cosine distance not sensitive enough for most changes

• Word distance only effective for “replace” changes

• Could not report on “moved” text

• Edit distance differed from word distance by treating “move” and “replace” similarly

• Shingling metric performed best against “add” and “drop” changes

• Over-sensitive to the rest

Different types of changes are best detected using different methods!

RELATED WORK (5)• Structural changes

• Dontcheva et al. (2007)

• Removed structurally irrelevant elements and analyzed the DOM tree

• Small changes or layout modifications happen toward the leaves of the DOM tree

• Major websites changes happen deeper in the truee.

• Larger websites with large amounts of traffic and highly dynamic content tended to have a larger number of structural changes.

• Automated extraction is difficult for changes away from leaf nodes

• Did not take AJAX/Flash applications into account

Element depth can help classify the type of change!

RELATED WORK (6)• Revisitation patterns

• Adar et al. (CHI 2008; CHI 2009)

• Enhance user experience by highlighting relevant content that changed from previous visits.

• Polled user’s to find relationships between user’s intentions and site revisitation.

• Dynamic website revisitation - users searching for new information

• Static website revisitation - people revisiting something previously viewed

Is it possible to identify relevant information for a user?

RELATED WORK (7)• Change detection in XML/HTML

• Longest common subsequence

• Diff

• HtmlDiff

• Hirschberg algorithm

• Mikhaiel and Stroulia (2005)

• Labeled-ordered tree comparison

• Chawathe and Garcia-Molina (1997)

• Minimum cost edge cover of a bipartite graph

• Wang et al. (2003)

• X-Diff

• Tree-to-tree correction techniques.

• Xing et al. (2008)

• X-Diff+

• Visual representation of how an XML document conformed to its DTD

RELATED WORK (8)• Tools for visualizing web differences

• Chen et al. (2000)

• AT&T Difference Engine (AIDE)

• Used TopBlend - Heaviest common subsequence solver

• Jacobson-Vo algorithm

• Web-crawler that collects temporal version of websites and highlights differences.

• Adar et. al (UIST 2008)

• Builds a collection of documents and snapshots of website over time.

• Explore websites through different lenses

• Greenberg and Boyle (2006)

• Stored bitmaps of user-selected regions, notifying the user when significant changes were detected.

• Limiting and ineffective in many cases.

RELATED WORK (9)• Many services that monitor websites and alert users

• Liu et al. (2000)

• WebCQ

• http://www.rba.co.uk/sources/monitor.htm

Web-based Desktop-basedChange Detection Copernic Tracker

ChangeDetect Internet Owl

Femtoo Update Patrol

FollowThatPage Update Scanner for Firefox

Infominder Website Watcher

Page2RSS

Watch That Page

Websnitcher

RELATED WORK (10)• Real-time comparative web browsing

• Nadamoto and Tanaka (2003)

• Comparative Web Browser (CWB)

• Displays and synchronizes multiple web pages based on relevance.

• Nadamoto et al. (2005)

• Bilingual Comparative Web Browser (B-CWB)

• Same as CWB, but attempts to do it across varying languages

• Selenium

• Framework providing an API to invoke web requests in varying browsers and run tests against their responses.

a distributed approach to uncovering tailored information and exploitation on the web

Documents