privacy on the web - emory universitylxiong/cs573_s12/share/slides/cs...information privacy on the...

Privacy on the Web

Li Xiong

Department of Mathematics and Computer ScienceEmory University

Definitions of Privacy

Right to be left alone (1890s, Brandeis, future US Supreme Court Justice)

a: The quality or state of being apart from company or observation; b: freedom from unauthorized intrusion (Merrian-Webster)

The right of individual to be protected against intrusion into his personal life or affairs, or those of his family, by direct physical or by publication of information (Calcutt committee, UK)

Aspects of Privacy

Information privacy Bodily privacy Privacy of communications Territorial privacy

Information Privacy

Establishment of rules governing the collection and handling of personal data Data about individuals should not be

automatically available to other individuals and organizations

The individual must be able to exercise a substantial degree of control over that data and its use.

Information privacy on the web

Large amount of (personal) data collected on the web Search engine logs Personal data and blogs on

social network sites …

The data are of great value for both individuals and our society.

The data also pose a significant threat to individuals’ privacy.

Data privacy on the web – some case studies A comparison of privacy practices of internet

service companies Query log retention and its privacy

implications Information revelation patterns and its privacy

implications on social network sites

A race to the bottom: privacy ranking of Internet service companies Privacy International, 2007 Studied and ranked the privacy practices of

key Internet based companies Amazon, AOL, Apple, BBC, eBay, Facebook,

Friendster, Google, LinkedIn, LiveJournal, Microsoft, MySpace, Skype, Wikipedia, LiveSpace, Yahoo!, YouTube

A Race to the Bottom: Methodologies

Corporate administrative details Data collection and processing Data retention Openness and transparency Customer and user control Privacy enhancing innovations and privacy

invasive innovations

A race to the bottom: interim results revealed

Why Google

Retains a large quantity of information about users, often for an unstated or indefinite length of time, without clear limitation on subsequent use or disclosure

Maintains records of all search strings with associated IP and time stamps for at least 18-24 months

Additional personal information from user profiles in Orkut

Use advanced profiling system for ads

Query Log

AnonID Query QueryTime ItemRank ClickURL217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com1268 gall stones 2006-05-11 02:12:511268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com

(Source: AOL Query Log)

A Face is exposed for AOL searcher No. 4417749

20 million Web search queries by AOL User 4417749

“numb fingers”, “60 single men” “dog that urinates on everything” “landscapers in Lilburn, Ga” Several people names with last name Arnold “homes sold in shadow lake subdivision

gwinnett county georgia”

Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her dogs

Privacy Risks of Query Log Accidental or malicious disclosure

Disclosure of information that users intended to keep private, or that may harm them when released

Compelled disclosure to third parties Query logs may be subject to subpoena as part of civil

litigation between individuals or organizations Disclosure to the government

Query logs may be subject to government demands in the context of law enforcement or intelligence investigations

Misuse of user profiles The retention of query logs may allow the creation of

detailed profiles of individuals’ interests, preferences, and behaviors.

Query Log Retention Rationale

Improving ranking algorithms and quality of search results

Language-based applications such as spelling correction

Query refinement Personalization Combating fraud and abuse Sharing data for academic research Sharing data for marketing and other

commercial purposes

Query log retention

Analyze potential techniques/practices for query log retention how well the technique protects privacy how well the technique preserves the utility of

the query logs how well the technique might be implemented

as a user control (a mechanism that allows users to choose to applied the technique)

1. Log Deletion

Erase users’ complete query logs may occur as early as when the search engine returns

search results to the user. Privacy: the most privacy-enhancing technique

available Utility: drops to zero after they are erased

If the query log keep longer before erasure, search engine could seek to gain some of the benefits of log analysis and storage

User control: straightforward - either have their logs retained or deleted

2. Hashing queries

Two approaches Entire queries could be hashed, so that the

resulting log contains a hash value tokenize the query, and hash each token,

resulting in a set of hash values

2. Hashing queries (privacy protection)

user’s interests or behaviors no longer be available

√Profile misuse

√ XGovernment

Can’t ask for all queries associated with a particular user, IP address, or cookie IDBut able to inquire about particular query terms

√ XThird party

hashed sensitive information is hard to be reversed

√malicious

2. Hashing queries (utility)

Ranking, anti-fraud, Language-based application, query refinement

personalization, some academic, commercial

If token-based hashing, or if the search engine retains aggregate statistics about queries (or some subset of queries) in an unhashedform.

Viable rationales

rely on tying query logs to particular users

Impeded rationales

2. Hashing queries (user control)

Straightforward the technique’s effectiveness in protecting

privacy may actually increase for those who choose to adopt it if not all individuals make use of it. (because reverse-engineering attacks relies on statistic)

3. Identifier Deletion

Identifier: IP address, cookie IDs

3. Identifier Deletion (privacy protection)

any user profile based on remaining partial identifier cannot likely be correlated to specific individuals, computers, or browsers

√Profile misuse

√Government

difficult to request all query logs corresponding to a specific user, IP address, or cookie ID

√Third party

if users query their own personal informationXmalicious

3. Identifier Deletion (utility)

Language-based application, query refinement

Ranking, anti-fraud, personalization, academic, commercial

those analyzing the data can use other query information (e.g., timestamps and query content)

Viable rationales

rely on tying query logs to particular users

Impeded rationales

4. Hashing Identifiers

Identifier: IP address, cookie IDs

4. Hashing Identifiers (privacy protection)

Possible to link a single user’s queries together

XProfile misuse

XGovernment

If civil litigants or government authorities possess the input into the hash for a particular user (e.g., the user’s IP address and/or cookie ID)

XThird party

if users query theirown personal information

Xmalicious

4. Hashing Identifiers (utility)

Ranking, Language-based application, query refinement, personalization, academic, commercial

anti-fraud

it preserves the correlation between individual users and their queries.

Viable rationales

difficult to recover a fraudulent user’s external identifiers just by looking at the query logs

Impeded rationales

5. Scrubbing Query Content

remove identifying information phone numbers, Social Security numbers,

credit card numbers, addresses, and names distinguish identifying information from the

remainder of the query content [Xiong and Agichtein 2007]

5. Scrubbing Query Content (privacy protection)

all the other information of value (the searcher’s interests, tastes, and behaviors) remains available

XProfile misuse

XGovernment

queries can still be tied to an individual via an identifier

XThird party

But still possible to identify individual with out source

√malicious

5. Scrubbing Query Content (utility)

Ranking, Language-based application, query refinement, anti-fraud personalization, academic, commercial

it preserves the correlation between individual users and their queries.

Viable rationales

Only with the specific purpose of analyzing identifying information

Impeded rationales

5. Scrubbing Query Content (user control)

Two ways Scrub everything the search engine deemed

to be identifying User defined

6. Deleting Infrequent Query

remove queries appearing infrequently the vast majority of queries occur a small

number of times [Beitzel et al. 2004; Spink et al. 2001]

Infrequent queries right now might not necessarily be infrequent for ever (new professional athletes or celebrities, new slang, new product names, etc.)

6. Deleting Infrequent Query (privacy protection)

all the other information of value (the searcher’s interests, tastes, and behaviors) remains available

XProfile misuse

XGovernment

queries can still be tied to an individual via an identifier

XThird party

Less identifying information is available. But still possible to identify individual with out source

√malicious

6. Deleting Infrequent Query (utility)

Ranking, anti-fraud, commercial

Language-based application, query refinement, personalization, academic,

based in part on the value of analyzing popular or high-volume queries.

Viable rationales

depend on recognizing rare queries and learn how to offer the searcher suggestions or make adjustments behind the scenes

Impeded rationales

6. Deleting Infrequent Query (user control)

Difficult Hard to define infrequent query

7. Shortening session

shorten the length of time that any identifier is associated with an individual [Xiong, 2007]. a user may be assigned a new identifier every

month, day, or hour Or when users close their browsers or when

they navigate away from the search engine’s site.

7. Shortening session (privacy protection)

profiles could only be based on data from a narrow window√Profile

misuse

√Government

shorter sessions can remove the link between a user and his or her entire query history

√Third party

the query content may still contain identifying information, and because queries within the same session may still be linked together

Xmalicious

7. Shortening session (utility)

Ranking, Language-based application, query, refinement, academic,

personalization, anti-fraud, commercial

based on queries that occur in close proximity to each other

Viable rationales

based on sharing historical profiles of users.

Impeded rationales

7. Shortening session (user control)

easy provide the option of clearing identifiers stored

on the search engine’s side managing which users have requested shorter

sessions and when their sessions expire may be expensive

Conclusion

It is possible to collect/retain query logs while protecting user privacy

Technical challenges to strike the balance between retaining query log utility and protecting user privacy

Policy and user education challenges

Motivation

Mass adoption Number of online social networking sites has increased Dramatic increase of online network participants each

Information revelation behavior of participants More open than offline social networks

Online Vs. Offline Networks

Offline social networks contain diverse relations. Examples – Family, Friend, Co-Worker, Roommate,

Acquaintance, Classmate, Teammate, Enemy, etc.

Online social networks simplify relations to simplistic binary relations such as “Friend or not”. How does someone qualify as a “Friend or not”?

What is the measurement? Most users tend to list anyone (as a Friend) who they

know and do not actively dislike.

Online Vs. Offline Networks

An offline social network may include up to a dozen intimate or significant ties and 1000 to 1700 “acquaintances” or “interactions”.

Online social networks can list hundreds of direct “friends” and include hundreds of thousands of additional “friends” within just three degrees of separation from a subject.

Online Social Networks -Privacy Implications

1. The level of identifiability of the information

2. The possible recipients of the information

3. The possible uses of the information

Online Social Networks -Privacy Implications

1. Level of identifiability Sites that don’t expose user identity may provide

enough information to identify the profile’s owner Examples:

Face re-identification through photos used across different sites Demographic data Category-based representations of interests that reveal unique

or rare overlaps of hobbies or tastes

Information Revelation (two possibilities) Identify “anonymous” profile through previous knowledge of

profile owner’s characteristics or traits (identity disclosure) Allowing a party to infer previously unknown characteristics or

traits about an identified profile (attribute disclosure)

Online Social Networks -Privacy Implications2. Possible Recipients – Who has access to the

profile information?

Hosting site / Company The site’s social network (in some cases site

visitors) Hackers Government Agencies

Online Social Networks -Privacy Implications3. Possible uses – how can social network

profile information be used?

Dependant upon information provided (may be extensive and intimate in some cases)

Possible uses (risks) Identity theft Online/physical stalking Embarrassment Blackmail

Analysis - The Facebook.com

Gross and Acquisti, 2005 In June 2005, the authors searched for all

“female” and all “male” profiles for CMU Facebook members using Facebook’sadvanced search feature and extracted their profile IDs.Using the extracted IDs, they downloaded a

total of 4540 profiles – virtually the entire CMU Facebook population at the time of the study.

The Facebook.com Demographics

The Facebook.comTypes and Amounts of Information Disclosed

In general, CMU Facebook members provided large amounts of information. 90.8% of profiles contained an image. 87.8% revealed their birth date. 39.9% listed a phone number 50.8% listed their current residence. 62.9% listed their relationship status.

Across most categories, the amount of information revealed by female and male users was similar. A notable exception was the phone number, disclosed by substantially more male than female users (47.1% vs. 28.9%).

The Facebook.comTypes and Amounts of Information Disclosed

The Facebook.com Data Validity

Names were manually categorized as being one of the following. Real Name – Name appears to be real (example – can be

matched to the visible CMU e-mail address provided at login). Partial Name – Only a first name is given. Fake Name – Obviously fake name.

The Facebook.com Data Identifiability

The same evaluation was repeated for Friendster, where the profile name is only the first name of the member (which makes Friendster profiles not as identifiable as Facebook profiles).

The Facebook.com Data Identifiability

Friends networks can also contribute to data validity and identifiabilitysince adding a friend requires explicit confirmation.

Facebook users typically maintain a very large network of friends. On average, CMU Facebook members list 78.2 friends at CMU and 54.9

friends at other schools.

The Facebook.com Privacy Implications

The population of Facebook users studied is oblivious, unconcerned, or pragmatic about their personal privacy Benefit of selectively revealing data to strangers may appear larger

than the perceived costs of privacy invasions. Peer pressure or herding behavior. Incomplete information about possible privacy implications Service’s user interface may drive unchallenged acceptance of default

privacy settings.

Users may put themselves at risk for a variety of attacks on their physical or online persona.

Personal data is generously provided Limiting privacy preferences are sparingly used.

Stalking

Potential adversary (with an account at the same institution) can determine the likely physical location of the user for large portions of the day based on profile information about

residence locationclass schedulelocation of last login.

Re-identification Demographics 45.8% list birthday, gender, and current residence. Can be linked

to outside, de-identified data sources such as hospital discharge data.

Face Re-Identification Social Security Numbers Hometown and birth-date can be used to estimate the first three

and middle two digits of a social security number. Possible to obtain last four digits (often used in unprotected logins

and passwords) through social engineering. Identify Theft Majority of profiles contain current phone number and residence

which are often used for verification by financial institutions.

Digital Dossier

Privacy implications of revealing personal information may extend beyond their immediate impact, which can be limited.

With low and decreasing costs of storing digital information, it is possible to monitor and record the evolution of the network and its users’ profiles, thereby building a digital dossier for its participants.

Users may not be concerned about the visibility of personal information now, but may be later when the data could still be available.

Conclusions Online social networks are both vaster and looser than their

offline counterparts. Possible for a profile to be connected to thousands of

other profiles through the network’s ties. In the study of CMU users of Facebook

Quantified individuals’ willingness to provide large amounts of personal information has been.

Shown how unconcerned its’ users appear to privacy risks based on how personal data is generously provided and limiting privacy preferences are hardly used.

Based on the information they provide online, users expose themselves to various physical and cyber risks.

Remember, they are always watching … what can we do?

Some advices from privacy campaigners Use cash when you can. Do not give your phone number, social-security number or

address, unless you absolutely have to. Do not fill in questionnaires or respond to telemarketers. Demand that credit and data-marketing firms produce all

information they have on you, correct errors and remove you from marketing lists.

Check your medical records often. Block caller ID on your phone, and keep your number unlisted. Never leave your mobile phone on, your movements can be

traced. Do not user store credit or discount cards If you must use the Internet, encrypt your e-mail, reject all

“cookies” and never give your real name when registering at websites

Better still, use somebody else’s computer

Privacy Protection Technologies

Access control De-identification/anonymization Statistical databases and inference control Data perturbation Cryptographic protocols

Thank you

privacy on the web - emory universitylxiong/cs573_s12/share/slides/cs...information privacy on the...

Documents

privacy, accountability and trust privacy, accountability...

cs573 data privacy and security - emory...

15-744: computer networking l-23 privacy. 2 overview routing...

secure multiparty computation – applications for privacy...

privacy preserving data mining – multiplicative...

cs 377 database systems - emory...

privacy - introductie beware of privacy – security...

cs 377 database systems - emory...

gluten content classification - emory...

future of privacy forum€¦ · privacy 2020 10 privacy...

privacy preserving data mining - emory...

m-invariance and dynamic datasets - emory...

privacy preserving data mining – introduction and...

cs573 data privacy and...

secure multiparty computation – applications for...

data mining: concepts and techniques - emory...

privacy regulation: culturally universal or culturally...

privacy resources resources oct06 column.pdf ·...

hippocratic databases and fine grained access...

chap 10: privacy in computing. privacy as an aspect of...