class-based graph anonymization for social network data shai peretz db seminar (winter 2009)
Post on 21-Dec-2015
215 views
TRANSCRIPT
Class-basedClass-basedgraph anonymization for graph anonymization for
social network datasocial network data
Shai PeretzShai Peretz
DB Seminar (winter 2009) DB Seminar (winter 2009)
ExampleExample
Such data needs to be anonymized. Such data needs to be widely available for scientific
research.
Anonymization and UtilityAnonymization and Utility
Let G be an interaction graph over sets V, Let G be an interaction graph over sets V, I and edges E. I and edges E.
Our goalOur goal is to produce an is to produce an anonymized anonymized version of G: G’version of G: G’
Give Give privacyprivacy Give Give accurate answers on queries that are accurate answers on queries that are
not privacy revealingnot privacy revealing From G’From G’ Tolerate less accurate answers in other cases Tolerate less accurate answers in other cases
From G’ .From G’ .
Anonymization and UtilityAnonymization and Utility Definitions:Definitions: We now distinguish between a node in the graph, We now distinguish between a node in the graph, vv V, V,
and the corresponding entity and the corresponding entity xx X X Each entity Each entity xx can have a number of attributes, such as can have a number of attributes, such as
age, location, and a unique identifier.age, location, and a unique identifier. We write We write x(v)x(v) to denote the identifier, or “true label” of node v. to denote the identifier, or “true label” of node v.
Anonymization and UtilityAnonymization and Utility
UtilityUtility is judged based on the is judged based on the qualityquality with with which various which various queries can be answered on queries can be answered on the anonymized graphthe anonymized graph
Anonymization MethodsAnonymization Methods
We’ll present two methods for We’ll present two methods for anonymizing the rich interaction graphs:anonymizing the rich interaction graphs: Label listLabel list PartitioningPartitioning
LABEL LISTS LABEL LISTS
We propose to anonymize interaction We propose to anonymize interaction graphs using “graphs using “label listslabel lists”:”: the the identifieridentifier of each node in the graph is of each node in the graph is
replaced by a list of possible identifiersreplaced by a list of possible identifiers (or (or labels).labels).
LABEL LISTS (cont.)LABEL LISTS (cont.)
We’ll introduce two approaches:We’ll introduce two approaches: Arbitrary ListArbitrary List Uniform ListUniform List
Arbitrary ListArbitrary List
This is a naïve approachThis is a naïve approach Pros: Pros:
SimpleSimple approach approach Ease of implementationEase of implementation
Arbitrary List (cont.)Arbitrary List (cont.)
Cons:Cons: U7U7 appears only once, appears only once,
therefore immediately therefore immediately identifiableidentifiable
Only Only u1,u2,u3,u4u1,u2,u3,u4 appear in the appear in the first 4 nodes, and therefore first 4 nodes, and therefore those nodes must be some those nodes must be some permutation of those 4 labels.permutation of those 4 labels.
Because we know that u1-u4 Because we know that u1-u4 can only be in the first 4 can only be in the first 4 nodes, we can infer that the 5nodes, we can infer that the 5thth node is node is u5u5 and the 6 and the 6thth is is u6u6
Uniform ListUniform List
We propose an approach to We propose an approach to generate sets generate sets of listsof lists which are which are more structuredmore structured. .
As a consequence, they As a consequence, they avoid “static” avoid “static” attacksattacks, and retain privacy in the face of , and retain privacy in the face of attacks based on background knowledge.attacks based on background knowledge.
Uniform List - Uniform List - Dividing the nodes Dividing the nodes into classesinto classes
First, the nodes V are divided into “classes” of size m, for a parameter m
Every node will have a list of label of size k To guarantee privacy, we must ensure that
there is sufficient diversity in the interactions of nodes in the same class.
Uniform List - Dividing the nodes Uniform List - Dividing the nodes into classes (cont.)into classes (cont.)
Suppose entities u1 and u2 are placed together in a class of size 2.
Then it is clear that u1 and u2 were friends and emailed each other
Uniform List - Dividing the nodes Uniform List - Dividing the nodes into classes (cont.)into classes (cont.)
Safety propertySafety property:: A division of nodes V into classes satisfies the
Class Safety property if for any node, v participates in interactions with at most one node in any class .
v V
S V
, , , , , , , :v i w i v j z S SE w zj
v
z
wi
j
z w
v
zi
j
Safety propertySafety property
LABEL LISTS - Uniform List, LABEL LISTS - Uniform List, Dividing the nodes into classes Dividing the nodes into classes
(cont.)(cont.) Problem?:
If there is a single entity which has interactions with every other entity
• it is not possible to achieve class safety for any m > 1
In real social networks a user typically interacts with only a small fraction of other entities out of the millions of possibilities
Uniform List - Dividing the nodes Uniform List - Dividing the nodes into classes (cont.)into classes (cont.)
Algorithm :
Here, we take each node v in turn, and insert it in the first class that has fewer than m members, provided that performing this insertion would not violate the safety condition (lines 2–7).
Uniform List - Dividing the nodes Uniform List - Dividing the nodes into classes (cont.)into classes (cont.)
Problem: Queries which involve selections on entity attributes (say, selecting users located in Japan)
Unsure exactly which nodes these correspond to Solution: Placing entities that have the same value on
this attribute in the same class. Given a “workload” describing which attributes are seen
as most important for querying (say, location is first, followed by age), the input can be sorted under this ordering of attributes.
This sorting step is indicated inline 1 of the Algorithm.
Uniform List, Generating and Uniform List, Generating and assigning assigning (k,m)-Uniform lists(k,m)-Uniform lists
Prefix patternsPrefix patterns: A prefix pattern occurs when : A prefix pattern occurs when the pattern p ={0,1,2, . . . k−1}. the pattern p ={0,1,2, . . . k−1}.
Full patternFull pattern. In the full pattern case k =m and so . In the full pattern case k =m and so the only possible pattern is p = {0,1,2, . . .m−1}the only possible pattern is p = {0,1,2, . . .m−1}
Uniform List - Generating and Uniform List - Generating and assigning (k,m)-Uniform listsassigning (k,m)-Uniform lists
The The two parameters k and mtwo parameters k and m clearly affect clearly affect the the tradeoff between privacy and utilitytradeoff between privacy and utility: : a (1,1) uniform list associates each node a (1,1) uniform list associates each node
directly with the corresponding entity, directly with the corresponding entity, allowing allowing full utility but no privacyfull utility but no privacy; ;
a (|V|, |V|) uniform list associates each node a (|V|, |V|) uniform list associates each node with the list of all possible labels, and with the list of all possible labels, and represents an extreme of represents an extreme of minimal utility and minimal utility and maximal privacymaximal privacy. .
Uniform List - Uniform List - SecuritySecurity
Theorem 1: An Theorem 1: An attackerattacker who has who has no no knowledgeknowledge about the original data, can about the original data, can correctlycorrectly guess which entities participate in guess which entities participate in an interaction with an interaction with probability at most 1/kprobability at most 1/k
Theorem 2: An Theorem 2: An attackerattacker who is able to use who is able to use background knowledgebackground knowledge to find the to find the true true identity of at most r nodesidentity of at most r nodes can guess can guess which other entities participate in an which other entities participate in an interaction withinteraction with probability at most 1/(k−r). probability at most 1/(k−r).
LABEL LISTS - Uniform List, LABEL LISTS - Uniform List, SecuritySecurity
Problem:Problem: If the attacker If the attacker vv has has only one friend in only one friend in AlaskaAlaska All the classes containing nodes which share All the classes containing nodes which share
an interaction with v, an interaction with v, only one has nodes only one has nodes located in Alaskalocated in Alaska
PARTITIONING APPROACHPARTITIONING APPROACH
To avoid such attacksTo avoid such attacks, we increase the amount of , we increase the amount of masking of data, at the expense of utility. masking of data, at the expense of utility.
Partitioning approach: Instead of releasing the full edge Partitioning approach: Instead of releasing the full edge information, information, only the number of edges between (and only the number of edges between (and within) each subsetwithin) each subset is releasedis released
Thick lines indicate double edgesThick lines indicate double edges
PARTITIONING APPROACH PARTITIONING APPROACH (cont.)(cont.)
Even if an attacker is somehow able to identify which Even if an attacker is somehow able to identify which node represents an node represents an entityentity, he , he can’t discover his can’t discover his interactionsinteractions..
We make use of the same We make use of the same safety condition as beforesafety condition as before In In FigureFigure, which does , which does not satisfynot satisfy the condition, an attacker can the condition, an attacker can
infer that since u6 and u7 are in the same class they alone infer that since u6 and u7 are in the same class they alone must must share the blog2 interactionshare the blog2 interaction
?
??
?
PARTITIONING APPROACH PARTITIONING APPROACH (cont.)(cont.)
COROLLARY : An COROLLARY : An attackerattacker who has who has no no background knowledgebackground knowledge about the original data about the original data can correctly can correctly guess which entities participate in guess which entities participate in an interaction withan interaction with probability at most 1/mprobability at most 1/m..
Theorem 3: An Theorem 3: An attacker with background attacker with background knowledge knowledge about interactions of an entity, about interactions of an entity, modeled as knowing the true identity of some modeled as knowing the true identity of some nodes, and the fact that these nodes participate nodes, and the fact that these nodes participate in certain interactions, can correctly in certain interactions, can correctly guess which guess which entities participate in interactionsentities participate in interactions about which about which nothing is known with nothing is known with probability at most 1/mprobability at most 1/m
PARTITIONING APPROACH PARTITIONING APPROACH (cont.)(cont.)
This This additional resilienceadditional resilience comes at the comes at the cost of cost of reducing the utilityreducing the utility for answering for answering more complex graph more complex graph queriesqueries
EXPERIMENTAL STUDY - EXPERIMENTAL STUDY - Querying Anonymized DataQuerying Anonymized Data
We’ll present two approaches for We’ll present two approaches for answering queries on anonymized dataanswering queries on anonymized data:: Sampling Consistent GraphsSampling Consistent Graphs Deterministic Query BoundsDeterministic Query Bounds
Querying Anonymized Data - Querying Anonymized Data - Sampling Consistent GraphsSampling Consistent Graphs
A A probabilistic approachprobabilistic approach is for the analyst to is for the analyst to randomly sample a randomly sample a graph that is consistent with the anonymized datagraph that is consistent with the anonymized data, and perform the , and perform the analysis on this graph. analysis on this graph.
The query can be evaluated over the resulting graph; The query can be evaluated over the resulting graph; repeating this repeating this several timesseveral times generates an “expected” answer to the query given generates an “expected” answer to the query given the anonymized data. the anonymized data.
In the full-list and partition cases, sampling a consistent graph takes In the full-list and partition cases, sampling a consistent graph takes time linear in the size of the anonymized data.time linear in the size of the anonymized data.
Querying Anonymized Data - Querying Anonymized Data - Deterministic Query BoundsDeterministic Query Bounds
A more costly approach is to A more costly approach is to search over search over all possible graphsall possible graphs that are that are consistent with consistent with the anonymized datathe anonymized data, and report the range , and report the range of possible answers to the query.of possible answers to the query.
In general, this could be In general, this could be prohibitively prohibitively expensiveexpensive, but for many natural queries, , but for many natural queries, the problem can be broken down.the problem can be broken down. Example: query “how many users in the USA Example: query “how many users in the USA
have used application X?”have used application X?”
EXPERIMENTAL STUDY – EXPERIMENTAL STUDY – Experimental FrameworkExperimental Framework
Two datasets that vary appreciably in size and Two datasets that vary appreciably in size and graph structuregraph structure:: Xanga networkXanga network consists of about consists of about 780K nodes and 3 780K nodes and 3
million edgesmillion edges. . • Each node in the graph represents a user of Xanga and his Each node in the graph represents a user of Xanga and his
corresponding blog.corresponding blog.• There are There are two types of edgestwo types of edges between nodes representing between nodes representing
the interactions: a the interactions: a subscriptionsubscription (or readership) to another (or readership) to another Xanga blog and an explicit Xanga blog and an explicit friendshipfriendship relation. relation.
• Each Each user has associated profile informationuser has associated profile information, such as the , such as the user’s location, age, and gender, which comprise attributes user’s location, age, and gender, which comprise attributes of the corresponding node in the graph. of the corresponding node in the graph.
EXPERIMENTAL STUDY – EXPERIMENTAL STUDY – Experimental FrameworkExperimental Framework
Speed datingSpeed dating dataset has dataset has 530 participants530 participants, , and consists of data about and consists of data about 4150 “dates4150 “dates” ” arranged between pairs of individuals. arranged between pairs of individuals. • For each participant, demographic information was For each participant, demographic information was
collected (age, race, field of study), and their level collected (age, race, field of study), and their level of interest in a number of hobbies on a scale of 1 of interest in a number of hobbies on a scale of 1 to 10. to 10.
• Each Each interaction node represents a “date”interaction node represents a “date”, and , and records information about whether each participant records information about whether each participant was positive or negative about their counterpart; was positive or negative about their counterpart; we call this a “match” if both were positive. we call this a “match” if both were positive.
• Each individual participated in from 6 to 22 dates.Each individual participated in from 6 to 22 dates.
EXPERIMENTAL STUDY – EXPERIMENTAL STUDY – Experimental FrameworkExperimental Framework
Kind of queriesKind of queries:: Pair QueriesPair Queries. These are single-hop queries of the kind: how many . These are single-hop queries of the kind: how many
nodes from one subpopulation interact with nodes of another nodes from one subpopulation interact with nodes of another subpopulation. subpopulation. For instance, how many users from US are friends For instance, how many users from US are friends with users from Australia?with users from Australia?
Trio QueriesTrio Queries. These queries involve two hops in the graph and query for . These queries involve two hops in the graph and query for triples such as, triples such as, how many Americans are friends with Chinese how many Americans are friends with Chinese users who are also friends with Australian users?users who are also friends with Australian users?
Triangle QueriesTriangle Queries. This type counts triangles of individuals (a.k.a. the . This type counts triangles of individuals (a.k.a. the transitivity property among nodes, or the clustering coefficient), that is, transitivity property among nodes, or the clustering coefficient), that is, nodes that have neighbor pairs which are connected. nodes that have neighbor pairs which are connected. For instance, For instance, how many Americans are friends with Chinese and Australians how many Americans are friends with Chinese and Australians who are friends with each other?who are friends with each other?
We We answer queries by sampling 10 consistent graphsanswer queries by sampling 10 consistent graphs Both anonymization and query answering algorithms scale well: Both anonymization and query answering algorithms scale well:
anonymization took less than a minute for the Xangaanonymization took less than a minute for the Xanga data on a data on a standard desktop with 1GB RAM, while standard desktop with 1GB RAM, while queries took under a second queries took under a second eacheach
Experimental Results – Experimental Results – Uniform List AnonymizationUniform List Anonymization
Q1Q1 (pair query, Xanga dataset):“How many Americans in (pair query, Xanga dataset):“How many Americans in different age groups are friends with residents of Hong different age groups are friends with residents of Hong Kong with age less than 20?”Kong with age less than 20?”
Experimental Results – Experimental Results – Uniform List AnonymizationUniform List Anonymization
Q (Speed Dating):“How many pairs with Q (Speed Dating):“How many pairs with the same attribute had a match?”the same attribute had a match?”
Experimental Results – Experimental Results – Uniform List AnonymizationUniform List Anonymization
Work-load of 100 queriesWork-load of 100 queries
Experimental Results – Experimental Results – Partition anonymizationPartition anonymization
To To generate a consistent graph, we randomly create N edgesgenerate a consistent graph, we randomly create N edges between nodes from two connected groups that satisfy the safety between nodes from two connected groups that satisfy the safety conditioncondition
Q2Q2: “How many Americans are connected to users with degree : “How many Americans are connected to users with degree greater than 10?greater than 10?
Experimental Results –Experimental Results –selectivityselectivity
Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility
Sorting:Sorting: Age, Gender, Location, Degree (AGLD)Age, Gender, Location, Degree (AGLD) Gender, Age, Location, Degree (GALD)Gender, Age, Location, Degree (GALD) Location, Gender, Age, Degree (LGAD)Location, Gender, Age, Degree (LGAD) Degree, Gender, Location, Age (DGLA)Degree, Gender, Location, Age (DGLA)
QQ11
Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility
Q3Q3 ( (trio Querytrio Query): “How many users in the age ): “How many users in the age group 15-20yrs (and varying number of friends) group 15-20yrs (and varying number of friends) are friends with users in the age group are friends with users in the age group 10-15yrs and also with those in the age group 10-15yrs and also with those in the age group 20-25yrs?”20-25yrs?”
Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility
Sorting on sample pair, trio and triangle queries involving two attributes, primarily age and degree
Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility
Queries of the three types (pair, trio, triangle), where each predicate is only over a single attribute
The three queries in Figure (e) are: the pair query “How many users of age 20 are friends with users of age 21?”; the trio query “How many users of age 20 are friends with users of age 21 and with users of age 22?”; and the triangle “How many users of age 20 are friends with users of age 21 and age 22 who are also friends?”
Experimental Results –Experimental Results –Prioritizing Attributes for UtilityPrioritizing Attributes for Utility
Full workload of 100 queries
SummarySummary
We showed that Class-basedWe showed that Class-basedgraph anonymization works, its gives both graph anonymization works, its gives both privacy, and ability to do researchprivacy, and ability to do research
Prioritizing Attributes (Sorting) before Prioritizing Attributes (Sorting) before doing queries is advised, but can be done doing queries is advised, but can be done only once to ensure privacyonly once to ensure privacy
This method for anonymization scales This method for anonymization scales well.well.