class-based graph anonymization for social network data shai peretz db seminar (winter 2009)

42
Class-based Class-based graph anonymization graph anonymization for social network for social network data data Shai Peretz Shai Peretz DB Seminar (winter 2009) DB Seminar (winter 2009)

Post on 21-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Class-basedClass-basedgraph anonymization for graph anonymization for

social network datasocial network data

Shai PeretzShai Peretz

DB Seminar (winter 2009) DB Seminar (winter 2009)

Page 2: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

ExampleExample

Such data needs to be anonymized. Such data needs to be widely available for scientific

research.

Page 3: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Anonymization and UtilityAnonymization and Utility

Let G be an interaction graph over sets V, Let G be an interaction graph over sets V, I and edges E. I and edges E.

Our goalOur goal is to produce an is to produce an anonymized anonymized version of G: G’version of G: G’

Give Give privacyprivacy Give Give accurate answers on queries that are accurate answers on queries that are

not privacy revealingnot privacy revealing From G’From G’ Tolerate less accurate answers in other cases Tolerate less accurate answers in other cases

From G’ .From G’ .

Page 4: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Anonymization and UtilityAnonymization and Utility Definitions:Definitions: We now distinguish between a node in the graph, We now distinguish between a node in the graph, vv V, V,

and the corresponding entity and the corresponding entity xx X X Each entity Each entity xx can have a number of attributes, such as can have a number of attributes, such as

age, location, and a unique identifier.age, location, and a unique identifier. We write We write x(v)x(v) to denote the identifier, or “true label” of node v. to denote the identifier, or “true label” of node v.

Page 5: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Anonymization and UtilityAnonymization and Utility

UtilityUtility is judged based on the is judged based on the qualityquality with with which various which various queries can be answered on queries can be answered on the anonymized graphthe anonymized graph

Page 6: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Anonymization MethodsAnonymization Methods

We’ll present two methods for We’ll present two methods for anonymizing the rich interaction graphs:anonymizing the rich interaction graphs: Label listLabel list PartitioningPartitioning

Page 7: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

LABEL LISTS LABEL LISTS

We propose to anonymize interaction We propose to anonymize interaction graphs using “graphs using “label listslabel lists”:”: the the identifieridentifier of each node in the graph is of each node in the graph is

replaced by a list of possible identifiersreplaced by a list of possible identifiers (or (or labels).labels).

Page 8: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

LABEL LISTS (cont.)LABEL LISTS (cont.)

We’ll introduce two approaches:We’ll introduce two approaches: Arbitrary ListArbitrary List Uniform ListUniform List

Page 9: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Arbitrary ListArbitrary List

This is a naïve approachThis is a naïve approach Pros: Pros:

SimpleSimple approach approach Ease of implementationEase of implementation

Page 10: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Arbitrary List (cont.)Arbitrary List (cont.)

Cons:Cons: U7U7 appears only once, appears only once,

therefore immediately therefore immediately identifiableidentifiable

Only Only u1,u2,u3,u4u1,u2,u3,u4 appear in the appear in the first 4 nodes, and therefore first 4 nodes, and therefore those nodes must be some those nodes must be some permutation of those 4 labels.permutation of those 4 labels.

Because we know that u1-u4 Because we know that u1-u4 can only be in the first 4 can only be in the first 4 nodes, we can infer that the 5nodes, we can infer that the 5thth node is node is u5u5 and the 6 and the 6thth is is u6u6

Page 11: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Uniform ListUniform List

We propose an approach to We propose an approach to generate sets generate sets of listsof lists which are which are more structuredmore structured. .

As a consequence, they As a consequence, they avoid “static” avoid “static” attacksattacks, and retain privacy in the face of , and retain privacy in the face of attacks based on background knowledge.attacks based on background knowledge.

Page 12: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Uniform List - Uniform List - Dividing the nodes Dividing the nodes into classesinto classes

First, the nodes V are divided into “classes” of size m, for a parameter m

Every node will have a list of label of size k To guarantee privacy, we must ensure that

there is sufficient diversity in the interactions of nodes in the same class.

Page 13: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Uniform List - Dividing the nodes Uniform List - Dividing the nodes into classes (cont.)into classes (cont.)

Suppose entities u1 and u2 are placed together in a class of size 2.

Then it is clear that u1 and u2 were friends and emailed each other

Page 14: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Uniform List - Dividing the nodes Uniform List - Dividing the nodes into classes (cont.)into classes (cont.)

Safety propertySafety property:: A division of nodes V into classes satisfies the

Class Safety property if for any node, v participates in interactions with at most one node in any class .

v V

S V

, , , , , , , :v i w i v j z S SE w zj

v

z

wi

j

z w

v

zi

j

Safety propertySafety property

Page 15: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

LABEL LISTS - Uniform List, LABEL LISTS - Uniform List, Dividing the nodes into classes Dividing the nodes into classes

(cont.)(cont.) Problem?:

If there is a single entity which has interactions with every other entity

• it is not possible to achieve class safety for any m > 1

In real social networks a user typically interacts with only a small fraction of other entities out of the millions of possibilities

Page 16: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Uniform List - Dividing the nodes Uniform List - Dividing the nodes into classes (cont.)into classes (cont.)

Algorithm :

Here, we take each node v in turn, and insert it in the first class that has fewer than m members, provided that performing this insertion would not violate the safety condition (lines 2–7).

Page 17: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Uniform List - Dividing the nodes Uniform List - Dividing the nodes into classes (cont.)into classes (cont.)

Problem: Queries which involve selections on entity attributes (say, selecting users located in Japan)

Unsure exactly which nodes these correspond to Solution: Placing entities that have the same value on

this attribute in the same class. Given a “workload” describing which attributes are seen

as most important for querying (say, location is first, followed by age), the input can be sorted under this ordering of attributes.

This sorting step is indicated inline 1 of the Algorithm.

Page 18: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Uniform List, Generating and Uniform List, Generating and assigning assigning (k,m)-Uniform lists(k,m)-Uniform lists

Prefix patternsPrefix patterns: A prefix pattern occurs when : A prefix pattern occurs when the pattern p ={0,1,2, . . . k−1}. the pattern p ={0,1,2, . . . k−1}.

Full patternFull pattern. In the full pattern case k =m and so . In the full pattern case k =m and so the only possible pattern is p = {0,1,2, . . .m−1}the only possible pattern is p = {0,1,2, . . .m−1}

Page 19: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Uniform List - Generating and Uniform List - Generating and assigning (k,m)-Uniform listsassigning (k,m)-Uniform lists

The The two parameters k and mtwo parameters k and m clearly affect clearly affect the the tradeoff between privacy and utilitytradeoff between privacy and utility: : a (1,1) uniform list associates each node a (1,1) uniform list associates each node

directly with the corresponding entity, directly with the corresponding entity, allowing allowing full utility but no privacyfull utility but no privacy; ;

a (|V|, |V|) uniform list associates each node a (|V|, |V|) uniform list associates each node with the list of all possible labels, and with the list of all possible labels, and represents an extreme of represents an extreme of minimal utility and minimal utility and maximal privacymaximal privacy. .

Page 20: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Uniform List - Uniform List - SecuritySecurity

Theorem 1: An Theorem 1: An attackerattacker who has who has no no knowledgeknowledge about the original data, can about the original data, can correctlycorrectly guess which entities participate in guess which entities participate in an interaction with an interaction with probability at most 1/kprobability at most 1/k

Theorem 2: An Theorem 2: An attackerattacker who is able to use who is able to use background knowledgebackground knowledge to find the to find the true true identity of at most r nodesidentity of at most r nodes can guess can guess which other entities participate in an which other entities participate in an interaction withinteraction with probability at most 1/(k−r). probability at most 1/(k−r).

Page 21: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

LABEL LISTS - Uniform List, LABEL LISTS - Uniform List, SecuritySecurity

Problem:Problem: If the attacker If the attacker vv has has only one friend in only one friend in AlaskaAlaska All the classes containing nodes which share All the classes containing nodes which share

an interaction with v, an interaction with v, only one has nodes only one has nodes located in Alaskalocated in Alaska

Page 22: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

PARTITIONING APPROACHPARTITIONING APPROACH

To avoid such attacksTo avoid such attacks, we increase the amount of , we increase the amount of masking of data, at the expense of utility. masking of data, at the expense of utility.

Partitioning approach: Instead of releasing the full edge Partitioning approach: Instead of releasing the full edge information, information, only the number of edges between (and only the number of edges between (and within) each subsetwithin) each subset is releasedis released

Thick lines indicate double edgesThick lines indicate double edges

Page 23: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

PARTITIONING APPROACH PARTITIONING APPROACH (cont.)(cont.)

Even if an attacker is somehow able to identify which Even if an attacker is somehow able to identify which node represents an node represents an entityentity, he , he can’t discover his can’t discover his interactionsinteractions..

We make use of the same We make use of the same safety condition as beforesafety condition as before In In FigureFigure, which does , which does not satisfynot satisfy the condition, an attacker can the condition, an attacker can

infer that since u6 and u7 are in the same class they alone infer that since u6 and u7 are in the same class they alone must must share the blog2 interactionshare the blog2 interaction

?

??

?

Page 24: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

PARTITIONING APPROACH PARTITIONING APPROACH (cont.)(cont.)

COROLLARY : An COROLLARY : An attackerattacker who has who has no no background knowledgebackground knowledge about the original data about the original data can correctly can correctly guess which entities participate in guess which entities participate in an interaction withan interaction with probability at most 1/mprobability at most 1/m..

Theorem 3: An Theorem 3: An attacker with background attacker with background knowledge knowledge about interactions of an entity, about interactions of an entity, modeled as knowing the true identity of some modeled as knowing the true identity of some nodes, and the fact that these nodes participate nodes, and the fact that these nodes participate in certain interactions, can correctly in certain interactions, can correctly guess which guess which entities participate in interactionsentities participate in interactions about which about which nothing is known with nothing is known with probability at most 1/mprobability at most 1/m

Page 25: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

PARTITIONING APPROACH PARTITIONING APPROACH (cont.)(cont.)

This This additional resilienceadditional resilience comes at the comes at the cost of cost of reducing the utilityreducing the utility for answering for answering more complex graph more complex graph queriesqueries

Page 26: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

EXPERIMENTAL STUDY - EXPERIMENTAL STUDY - Querying Anonymized DataQuerying Anonymized Data

We’ll present two approaches for We’ll present two approaches for answering queries on anonymized dataanswering queries on anonymized data:: Sampling Consistent GraphsSampling Consistent Graphs Deterministic Query BoundsDeterministic Query Bounds

Page 27: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Querying Anonymized Data - Querying Anonymized Data - Sampling Consistent GraphsSampling Consistent Graphs

A A probabilistic approachprobabilistic approach is for the analyst to is for the analyst to randomly sample a randomly sample a graph that is consistent with the anonymized datagraph that is consistent with the anonymized data, and perform the , and perform the analysis on this graph. analysis on this graph.

The query can be evaluated over the resulting graph; The query can be evaluated over the resulting graph; repeating this repeating this several timesseveral times generates an “expected” answer to the query given generates an “expected” answer to the query given the anonymized data. the anonymized data.

In the full-list and partition cases, sampling a consistent graph takes In the full-list and partition cases, sampling a consistent graph takes time linear in the size of the anonymized data.time linear in the size of the anonymized data.

Page 28: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Querying Anonymized Data - Querying Anonymized Data - Deterministic Query BoundsDeterministic Query Bounds

A more costly approach is to A more costly approach is to search over search over all possible graphsall possible graphs that are that are consistent with consistent with the anonymized datathe anonymized data, and report the range , and report the range of possible answers to the query.of possible answers to the query.

In general, this could be In general, this could be prohibitively prohibitively expensiveexpensive, but for many natural queries, , but for many natural queries, the problem can be broken down.the problem can be broken down. Example: query “how many users in the USA Example: query “how many users in the USA

have used application X?”have used application X?”

Page 29: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

EXPERIMENTAL STUDY – EXPERIMENTAL STUDY – Experimental FrameworkExperimental Framework

Two datasets that vary appreciably in size and Two datasets that vary appreciably in size and graph structuregraph structure:: Xanga networkXanga network consists of about consists of about 780K nodes and 3 780K nodes and 3

million edgesmillion edges. . • Each node in the graph represents a user of Xanga and his Each node in the graph represents a user of Xanga and his

corresponding blog.corresponding blog.• There are There are two types of edgestwo types of edges between nodes representing between nodes representing

the interactions: a the interactions: a subscriptionsubscription (or readership) to another (or readership) to another Xanga blog and an explicit Xanga blog and an explicit friendshipfriendship relation. relation.

• Each Each user has associated profile informationuser has associated profile information, such as the , such as the user’s location, age, and gender, which comprise attributes user’s location, age, and gender, which comprise attributes of the corresponding node in the graph. of the corresponding node in the graph.

Page 30: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

EXPERIMENTAL STUDY – EXPERIMENTAL STUDY – Experimental FrameworkExperimental Framework

Speed datingSpeed dating dataset has dataset has 530 participants530 participants, , and consists of data about and consists of data about 4150 “dates4150 “dates” ” arranged between pairs of individuals. arranged between pairs of individuals. • For each participant, demographic information was For each participant, demographic information was

collected (age, race, field of study), and their level collected (age, race, field of study), and their level of interest in a number of hobbies on a scale of 1 of interest in a number of hobbies on a scale of 1 to 10. to 10.

• Each Each interaction node represents a “date”interaction node represents a “date”, and , and records information about whether each participant records information about whether each participant was positive or negative about their counterpart; was positive or negative about their counterpart; we call this a “match” if both were positive. we call this a “match” if both were positive.

• Each individual participated in from 6 to 22 dates.Each individual participated in from 6 to 22 dates.

Page 31: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

EXPERIMENTAL STUDY – EXPERIMENTAL STUDY – Experimental FrameworkExperimental Framework

Kind of queriesKind of queries:: Pair QueriesPair Queries. These are single-hop queries of the kind: how many . These are single-hop queries of the kind: how many

nodes from one subpopulation interact with nodes of another nodes from one subpopulation interact with nodes of another subpopulation. subpopulation. For instance, how many users from US are friends For instance, how many users from US are friends with users from Australia?with users from Australia?

Trio QueriesTrio Queries. These queries involve two hops in the graph and query for . These queries involve two hops in the graph and query for triples such as, triples such as, how many Americans are friends with Chinese how many Americans are friends with Chinese users who are also friends with Australian users?users who are also friends with Australian users?

Triangle QueriesTriangle Queries. This type counts triangles of individuals (a.k.a. the . This type counts triangles of individuals (a.k.a. the transitivity property among nodes, or the clustering coefficient), that is, transitivity property among nodes, or the clustering coefficient), that is, nodes that have neighbor pairs which are connected. nodes that have neighbor pairs which are connected. For instance, For instance, how many Americans are friends with Chinese and Australians how many Americans are friends with Chinese and Australians who are friends with each other?who are friends with each other?

We We answer queries by sampling 10 consistent graphsanswer queries by sampling 10 consistent graphs Both anonymization and query answering algorithms scale well: Both anonymization and query answering algorithms scale well:

anonymization took less than a minute for the Xangaanonymization took less than a minute for the Xanga data on a data on a standard desktop with 1GB RAM, while standard desktop with 1GB RAM, while queries took under a second queries took under a second eacheach

Page 32: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results – Experimental Results – Uniform List AnonymizationUniform List Anonymization

Q1Q1 (pair query, Xanga dataset):“How many Americans in (pair query, Xanga dataset):“How many Americans in different age groups are friends with residents of Hong different age groups are friends with residents of Hong Kong with age less than 20?”Kong with age less than 20?”

Page 33: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results – Experimental Results – Uniform List AnonymizationUniform List Anonymization

Q (Speed Dating):“How many pairs with Q (Speed Dating):“How many pairs with the same attribute had a match?”the same attribute had a match?”

Page 34: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results – Experimental Results – Uniform List AnonymizationUniform List Anonymization

Work-load of 100 queriesWork-load of 100 queries

Page 35: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results – Experimental Results – Partition anonymizationPartition anonymization

To To generate a consistent graph, we randomly create N edgesgenerate a consistent graph, we randomly create N edges between nodes from two connected groups that satisfy the safety between nodes from two connected groups that satisfy the safety conditioncondition

Q2Q2: “How many Americans are connected to users with degree : “How many Americans are connected to users with degree greater than 10?greater than 10?

Page 36: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results –Experimental Results –selectivityselectivity

Page 37: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility

Sorting:Sorting: Age, Gender, Location, Degree (AGLD)Age, Gender, Location, Degree (AGLD) Gender, Age, Location, Degree (GALD)Gender, Age, Location, Degree (GALD) Location, Gender, Age, Degree (LGAD)Location, Gender, Age, Degree (LGAD) Degree, Gender, Location, Age (DGLA)Degree, Gender, Location, Age (DGLA)

QQ11

Page 38: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility

Q3Q3 ( (trio Querytrio Query): “How many users in the age ): “How many users in the age group 15-20yrs (and varying number of friends) group 15-20yrs (and varying number of friends) are friends with users in the age group are friends with users in the age group 10-15yrs and also with those in the age group 10-15yrs and also with those in the age group 20-25yrs?”20-25yrs?”

Page 39: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility

Sorting on sample pair, trio and triangle queries involving two attributes, primarily age and degree

Page 40: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility

Queries of the three types (pair, trio, triangle), where each predicate is only over a single attribute

The three queries in Figure (e) are: the pair query “How many users of age 20 are friends with users of age 21?”; the trio query “How many users of age 20 are friends with users of age 21 and with users of age 22?”; and the triangle “How many users of age 20 are friends with users of age 21 and age 22 who are also friends?”

Page 41: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility

Full workload of 100 queries

Page 42: Class-based graph anonymization for social network data Shai Peretz DB Seminar (winter 2009)

SummarySummary

We showed that Class-basedWe showed that Class-basedgraph anonymization works, its gives both graph anonymization works, its gives both privacy, and ability to do researchprivacy, and ability to do research

Prioritizing Attributes (Sorting) before Prioritizing Attributes (Sorting) before doing queries is advised, but can be done doing queries is advised, but can be done only once to ensure privacyonly once to ensure privacy

This method for anonymization scales This method for anonymization scales well.well.