privacy on the web - emory universitylxiong/cs573_s12/share/slides/cs...information privacy on the...
Post on 19-Jun-2020
0 Views
Preview:
TRANSCRIPT
Privacy on the Web
Li Xiong
Department of Mathematics and Computer ScienceEmory University
Definitions of Privacy
Right to be left alone (1890s, Brandeis, future US Supreme Court Justice)
a: The quality or state of being apart from company or observation; b: freedom from unauthorized intrusion (Merrian-Webster)
The right of individual to be protected against intrusion into his personal life or affairs, or those of his family, by direct physical or by publication of information (Calcutt committee, UK)
Aspects of Privacy
Information privacy Bodily privacy Privacy of communications Territorial privacy
Information Privacy
Establishment of rules governing the collection and handling of personal data Data about individuals should not be
automatically available to other individuals and organizations
The individual must be able to exercise a substantial degree of control over that data and its use.
Information privacy on the web
Large amount of (personal) data collected on the web Search engine logs Personal data and blogs on
social network sites …
The data are of great value for both individuals and our society.
The data also pose a significant threat to individuals’ privacy.
Data privacy on the web – some case studies A comparison of privacy practices of internet
service companies Query log retention and its privacy
implications Information revelation patterns and its privacy
implications on social network sites
A race to the bottom: privacy ranking of Internet service companies Privacy International, 2007 Studied and ranked the privacy practices of
key Internet based companies Amazon, AOL, Apple, BBC, eBay, Facebook,
Friendster, Google, LinkedIn, LiveJournal, Microsoft, MySpace, Skype, Wikipedia, LiveSpace, Yahoo!, YouTube
A Race to the Bottom: Methodologies
Corporate administrative details Data collection and processing Data retention Openness and transparency Customer and user control Privacy enhancing innovations and privacy
invasive innovations
A race to the bottom: interim results revealed
Why Google
Retains a large quantity of information about users, often for an unstated or indefinite length of time, without clear limitation on subsequent use or disclosure
Maintains records of all search strings with associated IP and time stamps for at least 18-24 months
Additional personal information from user profiles in Orkut
Use advanced profiling system for ads
Data privacy on the web – some case studies A comparison of privacy practices of internet
service companies Query log retention and its privacy
implications Information revelation patterns and its privacy
implications on social network sites
Query Log
AnonID Query QueryTime ItemRank ClickURL217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com1268 gall stones 2006-05-11 02:12:511268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com
(Source: AOL Query Log)
A Face is exposed for AOL searcher No. 4417749
20 million Web search queries by AOL User 4417749
“numb fingers”, “60 single men” “dog that urinates on everything” “landscapers in Lilburn, Ga” Several people names with last name Arnold “homes sold in shadow lake subdivision
gwinnett county georgia”
Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her dogs
Privacy Risks of Query Log Accidental or malicious disclosure
Disclosure of information that users intended to keep private, or that may harm them when released
Compelled disclosure to third parties Query logs may be subject to subpoena as part of civil
litigation between individuals or organizations Disclosure to the government
Query logs may be subject to government demands in the context of law enforcement or intelligence investigations
Misuse of user profiles The retention of query logs may allow the creation of
detailed profiles of individuals’ interests, preferences, and behaviors.
Query Log Retention Rationale
Improving ranking algorithms and quality of search results
Language-based applications such as spelling correction
Query refinement Personalization Combating fraud and abuse Sharing data for academic research Sharing data for marketing and other
commercial purposes
Query log retention
Analyze potential techniques/practices for query log retention how well the technique protects privacy how well the technique preserves the utility of
the query logs how well the technique might be implemented
as a user control (a mechanism that allows users to choose to applied the technique)
1. Log Deletion
Erase users’ complete query logs may occur as early as when the search engine returns
search results to the user. Privacy: the most privacy-enhancing technique
available Utility: drops to zero after they are erased
If the query log keep longer before erasure, search engine could seek to gain some of the benefits of log analysis and storage
User control: straightforward - either have their logs retained or deleted
2. Hashing queries
Two approaches Entire queries could be hashed, so that the
resulting log contains a hash value tokenize the query, and hash each token,
resulting in a set of hash values
2. Hashing queries (privacy protection)
user’s interests or behaviors no longer be available
√Profile misuse
√ XGovernment
Can’t ask for all queries associated with a particular user, IP address, or cookie IDBut able to inquire about particular query terms
√ XThird party
hashed sensitive information is hard to be reversed
√malicious
2. Hashing queries (utility)
Ranking, anti-fraud, Language-based application, query refinement
personalization, some academic, commercial
If token-based hashing, or if the search engine retains aggregate statistics about queries (or some subset of queries) in an unhashedform.
Viable rationales
rely on tying query logs to particular users
Impeded rationales
2. Hashing queries (user control)
Straightforward the technique’s effectiveness in protecting
privacy may actually increase for those who choose to adopt it if not all individuals make use of it. (because reverse-engineering attacks relies on statistic)
3. Identifier Deletion
Identifier: IP address, cookie IDs
3. Identifier Deletion (privacy protection)
any user profile based on remaining partial identifier cannot likely be correlated to specific individuals, computers, or browsers
√Profile misuse
√Government
difficult to request all query logs corresponding to a specific user, IP address, or cookie ID
√Third party
if users query their own personal informationXmalicious
3. Identifier Deletion (utility)
Language-based application, query refinement
Ranking, anti-fraud, personalization, academic, commercial
those analyzing the data can use other query information (e.g., timestamps and query content)
Viable rationales
rely on tying query logs to particular users
Impeded rationales
4. Hashing Identifiers
Identifier: IP address, cookie IDs
4. Hashing Identifiers (privacy protection)
Possible to link a single user’s queries together
XProfile misuse
XGovernment
If civil litigants or government authorities possess the input into the hash for a particular user (e.g., the user’s IP address and/or cookie ID)
XThird party
if users query theirown personal information
Xmalicious
4. Hashing Identifiers (utility)
Ranking, Language-based application, query refinement, personalization, academic, commercial
anti-fraud
it preserves the correlation between individual users and their queries.
Viable rationales
difficult to recover a fraudulent user’s external identifiers just by looking at the query logs
Impeded rationales
5. Scrubbing Query Content
remove identifying information phone numbers, Social Security numbers,
credit card numbers, addresses, and names distinguish identifying information from the
remainder of the query content [Xiong and Agichtein 2007]
5. Scrubbing Query Content (privacy protection)
all the other information of value (the searcher’s interests, tastes, and behaviors) remains available
XProfile misuse
XGovernment
queries can still be tied to an individual via an identifier
XThird party
But still possible to identify individual with out source
√malicious
5. Scrubbing Query Content (utility)
Ranking, Language-based application, query refinement, anti-fraud personalization, academic, commercial
it preserves the correlation between individual users and their queries.
Viable rationales
Only with the specific purpose of analyzing identifying information
Impeded rationales
5. Scrubbing Query Content (user control)
Two ways Scrub everything the search engine deemed
to be identifying User defined
6. Deleting Infrequent Query
remove queries appearing infrequently the vast majority of queries occur a small
number of times [Beitzel et al. 2004; Spink et al. 2001]
Infrequent queries right now might not necessarily be infrequent for ever (new professional athletes or celebrities, new slang, new product names, etc.)
6. Deleting Infrequent Query (privacy protection)
all the other information of value (the searcher’s interests, tastes, and behaviors) remains available
XProfile misuse
XGovernment
queries can still be tied to an individual via an identifier
XThird party
Less identifying information is available. But still possible to identify individual with out source
√malicious
6. Deleting Infrequent Query (utility)
Ranking, anti-fraud, commercial
Language-based application, query refinement, personalization, academic,
based in part on the value of analyzing popular or high-volume queries.
Viable rationales
depend on recognizing rare queries and learn how to offer the searcher suggestions or make adjustments behind the scenes
Impeded rationales
6. Deleting Infrequent Query (user control)
Difficult Hard to define infrequent query
7. Shortening session
shorten the length of time that any identifier is associated with an individual [Xiong, 2007]. a user may be assigned a new identifier every
month, day, or hour Or when users close their browsers or when
they navigate away from the search engine’s site.
7. Shortening session (privacy protection)
profiles could only be based on data from a narrow window√Profile
misuse
√Government
shorter sessions can remove the link between a user and his or her entire query history
√Third party
the query content may still contain identifying information, and because queries within the same session may still be linked together
Xmalicious
7. Shortening session (utility)
Ranking, Language-based application, query, refinement, academic,
personalization, anti-fraud, commercial
based on queries that occur in close proximity to each other
Viable rationales
based on sharing historical profiles of users.
Impeded rationales
7. Shortening session (user control)
easy provide the option of clearing identifiers stored
on the search engine’s side managing which users have requested shorter
sessions and when their sessions expire may be expensive
Conclusion
It is possible to collect/retain query logs while protecting user privacy
Technical challenges to strike the balance between retaining query log utility and protecting user privacy
Policy and user education challenges
Data privacy on the web – some case studies A comparison of privacy practices of internet
service companies Query log retention and its privacy
implications Information revelation patterns and its privacy
implications on social network sites
Motivation
Mass adoption Number of online social networking sites has increased Dramatic increase of online network participants each
year
Information revelation behavior of participants More open than offline social networks
Online Vs. Offline Networks
Offline social networks contain diverse relations. Examples – Family, Friend, Co-Worker, Roommate,
Acquaintance, Classmate, Teammate, Enemy, etc.
Online social networks simplify relations to simplistic binary relations such as “Friend or not”. How does someone qualify as a “Friend or not”?
What is the measurement? Most users tend to list anyone (as a Friend) who they
know and do not actively dislike.
Online Vs. Offline Networks
An offline social network may include up to a dozen intimate or significant ties and 1000 to 1700 “acquaintances” or “interactions”.
Online social networks can list hundreds of direct “friends” and include hundreds of thousands of additional “friends” within just three degrees of separation from a subject.
Online Social Networks -Privacy Implications
1. The level of identifiability of the information
2. The possible recipients of the information
3. The possible uses of the information
Online Social Networks -Privacy Implications
1. Level of identifiability Sites that don’t expose user identity may provide
enough information to identify the profile’s owner Examples:
Face re-identification through photos used across different sites Demographic data Category-based representations of interests that reveal unique
or rare overlaps of hobbies or tastes
Information Revelation (two possibilities) Identify “anonymous” profile through previous knowledge of
profile owner’s characteristics or traits (identity disclosure) Allowing a party to infer previously unknown characteristics or
traits about an identified profile (attribute disclosure)
Online Social Networks -Privacy Implications2. Possible Recipients – Who has access to the
profile information?
Hosting site / Company The site’s social network (in some cases site
visitors) Hackers Government Agencies
Online Social Networks -Privacy Implications3. Possible uses – how can social network
profile information be used?
Dependant upon information provided (may be extensive and intimate in some cases)
Possible uses (risks) Identity theft Online/physical stalking Embarrassment Blackmail
Analysis - The Facebook.com
Gross and Acquisti, 2005 In June 2005, the authors searched for all
“female” and all “male” profiles for CMU Facebook members using Facebook’sadvanced search feature and extracted their profile IDs.Using the extracted IDs, they downloaded a
total of 4540 profiles – virtually the entire CMU Facebook population at the time of the study.
The Facebook.com Demographics
The Facebook.com Demographics
The Facebook.comTypes and Amounts of Information Disclosed
In general, CMU Facebook members provided large amounts of information. 90.8% of profiles contained an image. 87.8% revealed their birth date. 39.9% listed a phone number 50.8% listed their current residence. 62.9% listed their relationship status.
Across most categories, the amount of information revealed by female and male users was similar. A notable exception was the phone number, disclosed by substantially more male than female users (47.1% vs. 28.9%).
The Facebook.comTypes and Amounts of Information Disclosed
The Facebook.com Data Validity
Names were manually categorized as being one of the following. Real Name – Name appears to be real (example – can be
matched to the visible CMU e-mail address provided at login). Partial Name – Only a first name is given. Fake Name – Obviously fake name.
The Facebook.com Data Identifiability
The same evaluation was repeated for Friendster, where the profile name is only the first name of the member (which makes Friendster profiles not as identifiable as Facebook profiles).
The Facebook.com Data Identifiability
Friends networks can also contribute to data validity and identifiabilitysince adding a friend requires explicit confirmation.
Facebook users typically maintain a very large network of friends. On average, CMU Facebook members list 78.2 friends at CMU and 54.9
friends at other schools.
The Facebook.com Privacy Implications
The population of Facebook users studied is oblivious, unconcerned, or pragmatic about their personal privacy Benefit of selectively revealing data to strangers may appear larger
than the perceived costs of privacy invasions. Peer pressure or herding behavior. Incomplete information about possible privacy implications Service’s user interface may drive unchallenged acceptance of default
privacy settings.
Users may put themselves at risk for a variety of attacks on their physical or online persona.
Personal data is generously provided Limiting privacy preferences are sparingly used.
The Facebook.com Privacy Implications
Stalking
Potential adversary (with an account at the same institution) can determine the likely physical location of the user for large portions of the day based on profile information about
residence locationclass schedulelocation of last login.
The Facebook.com Privacy Implications
Re-identification Demographics 45.8% list birthday, gender, and current residence. Can be linked
to outside, de-identified data sources such as hospital discharge data.
Face Re-Identification Social Security Numbers Hometown and birth-date can be used to estimate the first three
and middle two digits of a social security number. Possible to obtain last four digits (often used in unprotected logins
and passwords) through social engineering. Identify Theft Majority of profiles contain current phone number and residence
which are often used for verification by financial institutions.
The Facebook.com Privacy Implications
Digital Dossier
Privacy implications of revealing personal information may extend beyond their immediate impact, which can be limited.
With low and decreasing costs of storing digital information, it is possible to monitor and record the evolution of the network and its users’ profiles, thereby building a digital dossier for its participants.
Users may not be concerned about the visibility of personal information now, but may be later when the data could still be available.
The Facebook.com Privacy Implications
Conclusions Online social networks are both vaster and looser than their
offline counterparts. Possible for a profile to be connected to thousands of
other profiles through the network’s ties. In the study of CMU users of Facebook
Quantified individuals’ willingness to provide large amounts of personal information has been.
Shown how unconcerned its’ users appear to privacy risks based on how personal data is generously provided and limiting privacy preferences are hardly used.
Based on the information they provide online, users expose themselves to various physical and cyber risks.
Remember, they are always watching … what can we do?
Some advices from privacy campaigners Use cash when you can. Do not give your phone number, social-security number or
address, unless you absolutely have to. Do not fill in questionnaires or respond to telemarketers. Demand that credit and data-marketing firms produce all
information they have on you, correct errors and remove you from marketing lists.
Check your medical records often. Block caller ID on your phone, and keep your number unlisted. Never leave your mobile phone on, your movements can be
traced. Do not user store credit or discount cards If you must use the Internet, encrypt your e-mail, reject all
“cookies” and never give your real name when registering at websites
Better still, use somebody else’s computer
Privacy Protection Technologies
Access control De-identification/anonymization Statistical databases and inference control Data perturbation Cryptographic protocols
Thank you
top related