de- anonymizing social networks
DESCRIPTION
De- anonymizing Social Networks. Arvind Narayanan and Vitaly Shmatikov The University of Texas at Austin - by Nafia Malik. Motivation. OSN are sharing sensitive information User willingness to share information and disclosure to unintended parties are not connected. - PowerPoint PPT PresentationTRANSCRIPT
Community-Enhanced De-anonymization ofOnline Social Networks
By Shirin Nilizadeh, Apu Kapadia & Yong-Yeol Ahn
Presented By Elaine Aryeetey
• Introduction• Terms & Definitions• Research Purpose• NS Algorithm• De-Anonymization Model• Experiment• Conclusion• Q&A
Content
Introduction• Online social network very popular • Data released to marketers,
academic research & developing new applications Profit from data while honoring
privacy??Identity of user is removed
Introduction
Introduction
OSN Providers
Terms & Definitions• De-anonymization: Data mining strategy
in which anonymous data is cross-referenced with other data sources to re-identify the anonymous data source.
• Reference Graph: Network graph used for cross-referencing
• De-anonymized Graph: OSN datasets sold to third parties with user identities removed
• Seed: Known common identities in the two graphs
Terms & Definitions• Community: Group of nodes (people)
that are densely connected to each other while having lesser connections to nodes residing outside of the community
• Anonymity: As the state of being not identifiable within a set of subjects
• Noise: How reference graph differs the anonymized graph
Social Network Graph
Social Network Graph
Attack Model• Adversary with data may try to de-
anonymize• Has access to two networks,
G{V,E}and , • We focus on the cases where V ≈ V’
and E ≈ E ‘ , i where the vertices and edges are approximately the same
Some proposed De-Anonymization Approaches• Adversarial models where the
attacker has attribute information from the reference network
Public Flickr Network Anonymized Flickr Network
Some proposed De-Anonymization Approaches• An attacker may have access to the
social network structure with real identities
Twitter Network Instagram Network
Research PurposeStudy Problem of de-anonymization
using network alignment techniques.Why? Datasets can be de-anonymized using network alignment’ techniques to map nodes from the reference graph into the anonymized graph
• DEFINITION 1. A graph, G{V, E} V that represents the users in the network and a set of undirected edges E ⊆ {e = (u, v) : u, v ∈ V } We denote the degree of a node by . Let N = |V | be the total number of nodes in G.• DEFINITION 2. A graph G’s
community structure (C) is a disjoint partition of vertices in G, namely C = {, , . . . , }
Graph Definition
Won link prediction of anonymized data set on kaggle1. Seed detection: Maps a small
number of users (seeds) between two networks by searching for unique subgraphs
2. Propagation: Expands the set of matched users by incrementally comparing and mapping the neighbors of the previously mapped seeds.
Re-identification algorithm by Narayanan
and Shmatikov (NS)
Propagation• Randomly picks an already-mapped
node pair (,) ∈ M, where ∈ V , ∈ • Randomly pick node from set of
unmapped neighbors, then compares it with each unmapped node ( ) in the set of unmapped neighbors
• Uniqueness measured by eccentricity
Re-identification algorithm by Narayanan
and Shmatikov (NS)
𝑺 (𝒗 ,𝒗 ′ )=|{𝒘 ,𝒘 ′ ) :𝒘 ∈𝑵 ( 𝒗 ) ;𝒘 ′∈ 𝑵 (𝒗 ′) ;𝒂𝒏𝒅 (𝒘 ,𝒘 ′ )∈𝑴 }∨ ¿√𝒌𝒗 𝒌𝒗 ′
¿
Community-Enhanced De-Anonymization Model
OverviewCommunity aware mapping algorithm built upon community-blind mapping algorithms1. Community detection (disjoint and
non-overlapping) 2. Community mapping using already-
known seeds and using the network of communities
3. Seed enrichment4. Global propagation
De-Anonymization ModelCommunity Detection
Reference Network Anonymised Network
De-Anonymization ModelCommunity mapping using Seed IdentificationCommunities associated with seed nodes can be mapped to each other.
Reference Network
Anonymised Network
De-Anonymization ModelCommunity mapping using network of communitiesEach community is a node and a weighted edge between two communities represents the number of connections between nodes in two communitiesS(
• Computes the similarity score for each neighbor of the mapped nodes in the right graph
De-Anonymization ModelSeed Enrichment & Local Propagation: Finding more seeds at community level using distance metrics
Reference Network
Anonymised Network
De-Anonymization ModelSeed Enrichment & Local PropagationDistance metric
Nodes are matched and identified as seeds if either their degree or their clustering coefficients are similar enough and above a certain eccentricity threshold.
𝑫𝒅(𝒗 𝒊 ,𝒗 𝒋)¿ 𝒅 (𝒗𝒊 )−𝒅 (𝒗 𝒋)∨¿
𝒎𝒂𝒙 (𝒅 (𝒗𝒊 ) ,𝒅 (𝒗 𝒋 ))¿
De-Anonymization ModelGlobal PropagationApplies the community-blind mapping algorithm to the whole network using all the currently mapped nodes as seeds. Necessary because1. Communities may not be mapped
correctly2. Communities may not be mapped at
all
De-Anonymization ModelDegree Of Anonymity: Estimating degree of anonymity in anonymized network. G(V, E) (reference) and () (anonymized)
De-Anonymization ModelAnonymity for a user u ∈ V
Normalized degree of anonymity for user u
Degree of anonymity for the whole system
De-Anonymization ModelDegree Of Anonymity For Community Blind
P(~|) can be assigned ∈ V 1. If is mapped by the algorithm to z′, the vertices u′ ∈ V ′ can be partitioned as:
a. Mapped node z′, (, z′) ∈ . b. Nodes y′ not mapped to ,(u, y
′)∈! 2. If is not mapped, (, ′) ∈! , we consider entire mapping. P(~|1/||
De-Anonymization ModelDegree Of Anonymity For Community Aware set of communities in G′ not mapped. c ↔ c′ mapped community by algorithmP ( u∼u′| , ) can be assigned values for all u′∈V′ based on the following cases
De-Anonymization ModelDegree Of Anonymity For Community Aware1. If u is mapped to z′, and u ∈ c
where c is mapped to c′ and z′∈ c ′. V’ partitioned as
a. Mapped node z′b. Nodes y′ within c′ not mapped to
uc. The remaining nodes r′ not in c′pmap,1a + pmap,1b + pmap,1c = 1
De-Anonymization ModelDegree Of Anonymity For Community Aware2. If u is mapped to z′, where u ∈ c and c
is mapped to c′ and z′∈! c ′. partitioned as
a. Mapped node z′b. Nodes y′ within c′c. Remaining nodes r′ not in c′
and not mapped to uPmap,2a + pmap,2b + pmap,2c = 1
De-Anonymization ModelDegree Of Anonymity For Community Aware3. If u is mapped to z′, where u ∈ c and c is not mapped to any community. V’ partitioned as
a. Mapped node z′b. Remaining nodes r′ not
mapped to u4. If u not mapped to any node in G and
community of u not mapped to any community
a. Correct mapping is within entire set
De-Anonymization ModelDegree Of Anonymity For Community Aware5. If u not mapped to any node in G but community of u, c is mapped to community c’
a. Nodes within b. Remaining nodes not in
Experiments• An ensemble of networks with the
same number of nodes, edges, noise level, and the type of noise
• Simpler model used• Prepare a copy of the original
network, partially alter its structure, and compare the network alignment performance of two approaches — community-aware and community-blind
ExperimentsDatasets
Data Set Nodes Edges Date RangeCollaboration network
36,458 171,735 Jan 1, 1995 and Mar, 31 2005.
Twitter mention network (4 partitions)
90,332 377,588 Mar 24, 2012 to Apr 25, 2012
Twitter mention network (9 partitions)
9,745 50,164 Mar 24, 2012 to Apr 25, 2012
Experiments - Setup• Generate an ensemble of 10 networks
for each noise level• Run InfoMap algorithm• Attacker has less prior knowledge
(about small number of seeds)• Small set of initial seed for community
blind and aware• Performance calculation
ExperimentsResults: Impact of noise
generate an ensemble of 10 networks foreach of the real-world networks. We run the InfoMap community
ExperimentsResults• For 10% noise and 16 seeds, A(G) is
0.45 & 0.83 (or anonymity is 6.81 and 12.57 bits) using CA & CB
• In collaboration network, community- aware algorithm is able to correctly map about 15% of users while community-blind algorithm can barely re-identify any user. degree of anonymity is 0.84 and 1 (or anonymity is 12.72 and 15.15 bits)
• Difference between the performance of two algorithms greatly increases when the noise is above 15% and 20%.
ExperimentsResults: Impact of number of seeds
ExperimentsResults• In Twitter for seeds number of four, the
CA algorithm successfully re-identifies 77% of users while the CB algorithm only re-identifies about 7% of the users. Degree of anonymity is about 0.13 and 0.97 (and anonymity is 2.14 and 15.97.
• The community-aware algorithm decreases the anonymity by 13.83 additional bits compared to the community-blind algorithm.
ExperimentsResults: Network Size• Performance difference between the
community-aware and community-blind algorithms is more obvious when the network is bigger
• Having a smaller network, both algorithms perform better in re-identifying users and tolerating noise
• The approach used exhibits slightly higher error rate in some cases but most of them occur when the community-blind approach completely fails, and theirs correctly identifies many more users.
ExperimentsResults: Overlapped Data Set
ExperimentsResults• Community-aware algorithm reduces
the degree of anonymity while the community-blind algorithm fails regardless of the number of seeds. (left column)
• Community-blind algorithm fails completely when the noise level is more than 10%, whereas the community- aware algorithm fails when the noise level is more than 30% (right column)
• Approach doesn’t increase time complexity
• This approach is more robust against added noise to the anonymized data set
• Can perform well with fewer known seeds as well as larger networks.
• Approach is not tied to any specific algorithm; other community detection methods and community-blind network alignment algorithms could be ‘plugged in’ to the framework
Conclusion
Mapping two networks that are not identical to each other, using the
community-based mapping algorithm is almost always guaranteed to reduce
the anonymity more and find more successful mappings than the
community-blind, global map- ping algorithm.
Conclusion
Questions & Comments