de- anonymizing social networks

Community-Enhanced De-anonymization ofOnline Social Networks

By Shirin Nilizadeh, Apu Kapadia & Yong-Yeol Ahn

Presented By Elaine Aryeetey

• Introduction• Terms & Definitions• Research Purpose• NS Algorithm• De-Anonymization Model• Experiment• Conclusion• Q&A

Content

Introduction• Online social network very popular • Data released to marketers,

academic research & developing new applications Profit from data while honoring

privacy??Identity of user is removed

Introduction

OSN Providers

Terms & Definitions• De-anonymization: Data mining strategy

in which anonymous data is cross-referenced with other data sources to re-identify the anonymous data source.

• Reference Graph: Network graph used for cross-referencing

• De-anonymized Graph: OSN datasets sold to third parties with user identities removed

• Seed: Known common identities in the two graphs

Terms & Definitions• Community: Group of nodes (people)

that are densely connected to each other while having lesser connections to nodes residing outside of the community

• Anonymity: As the state of being not identifiable within a set of subjects

• Noise: How reference graph differs the anonymized graph

Social Network Graph

Attack Model• Adversary with data may try to de-

anonymize• Has access to two networks,

G{V,E}and , • We focus on the cases where V ≈ V’

and E ≈ E ‘ , i where the vertices and edges are approximately the same

Some proposed De-Anonymization Approaches• Adversarial models where the

attacker has attribute information from the reference network

Public Flickr Network Anonymized Flickr Network

Some proposed De-Anonymization Approaches• An attacker may have access to the

social network structure with real identities

Twitter Network Instagram Network

Research PurposeStudy Problem of de-anonymization

using network alignment techniques.Why? Datasets can be de-anonymized using network alignment’ techniques to map nodes from the reference graph into the anonymized graph

• DEFINITION 1. A graph, G{V, E} V that represents the users in the network and a set of undirected edges E ⊆ {e = (u, v) : u, v ∈ V } We denote the degree of a node by . Let N = |V | be the total number of nodes in G.• DEFINITION 2. A graph G’s

community structure (C) is a disjoint partition of vertices in G, namely C = {, , . . . , }

Graph Definition

Won link prediction of anonymized data set on kaggle1. Seed detection: Maps a small

number of users (seeds) between two networks by searching for unique subgraphs

2. Propagation: Expands the set of matched users by incrementally comparing and mapping the neighbors of the previously mapped seeds.

Re-identification algorithm by Narayanan

and Shmatikov (NS)

Propagation• Randomly picks an already-mapped

node pair (,) ∈ M, where ∈ V , ∈ • Randomly pick node from set of

unmapped neighbors, then compares it with each unmapped node ( ) in the set of unmapped neighbors

• Uniqueness measured by eccentricity

Re-identification algorithm by Narayanan

and Shmatikov (NS)

𝑺 (𝒗 ,𝒗 ′ )=|{𝒘 ,𝒘 ′ ) :𝒘 ∈𝑵 ( 𝒗 ) ;𝒘 ′∈ 𝑵 (𝒗 ′) ;𝒂𝒏𝒅 (𝒘 ,𝒘 ′ )∈𝑴 }∨ ¿√𝒌𝒗 𝒌𝒗 ′

¿

Community-Enhanced De-Anonymization Model

OverviewCommunity aware mapping algorithm built upon community-blind mapping algorithms1. Community detection (disjoint and

non-overlapping) 2. Community mapping using already-

known seeds and using the network of communities

3. Seed enrichment4. Global propagation

De-Anonymization ModelCommunity Detection

Reference Network Anonymised Network

De-Anonymization ModelCommunity mapping using Seed IdentificationCommunities associated with seed nodes can be mapped to each other.

Reference Network

Anonymised Network

De-Anonymization ModelCommunity mapping using network of communitiesEach community is a node and a weighted edge between two communities represents the number of connections between nodes in two communitiesS(

• Computes the similarity score for each neighbor of the mapped nodes in the right graph

De-Anonymization ModelSeed Enrichment & Local Propagation: Finding more seeds at community level using distance metrics

Reference Network

Anonymised Network

De-Anonymization ModelSeed Enrichment & Local PropagationDistance metric

Nodes are matched and identified as seeds if either their degree or their clustering coefficients are similar enough and above a certain eccentricity threshold.

𝑫𝒅(𝒗 𝒊 ,𝒗 𝒋)¿ 𝒅 (𝒗𝒊 )−𝒅 (𝒗 𝒋)∨¿

𝒎𝒂𝒙 (𝒅 (𝒗𝒊 ) ,𝒅 (𝒗 𝒋 ))¿

De-Anonymization ModelGlobal PropagationApplies the community-blind mapping algorithm to the whole network using all the currently mapped nodes as seeds. Necessary because1. Communities may not be mapped

correctly2. Communities may not be mapped at

all

De-Anonymization ModelDegree Of Anonymity: Estimating degree of anonymity in anonymized network. G(V, E) (reference) and () (anonymized)

De-Anonymization ModelAnonymity for a user u ∈ V

Normalized degree of anonymity for user u

Degree of anonymity for the whole system

De-Anonymization ModelDegree Of Anonymity For Community Blind

P(~|) can be assigned ∈ V 1. If is mapped by the algorithm to z′, the vertices u′ ∈ V ′ can be partitioned as:

a. Mapped node z′, (, z′) ∈ . b. Nodes y′ not mapped to ,(u, y

′)∈! 2. If is not mapped, (, ′) ∈! , we consider entire mapping. P(~|1/||

De-Anonymization ModelDegree Of Anonymity For Community Aware set of communities in G′ not mapped. c ↔ c′ mapped community by algorithmP ( u∼u′| , ) can be assigned values for all u′∈V′ based on the following cases

De-Anonymization ModelDegree Of Anonymity For Community Aware1. If u is mapped to z′, and u ∈ c

where c is mapped to c′ and z′∈ c ′. V’ partitioned as

a. Mapped node z′b. Nodes y′ within c′ not mapped to

uc. The remaining nodes r′ not in c′pmap,1a + pmap,1b + pmap,1c = 1

De-Anonymization ModelDegree Of Anonymity For Community Aware2. If u is mapped to z′, where u ∈ c and c

is mapped to c′ and z′∈! c ′. partitioned as

a. Mapped node z′b. Nodes y′ within c′c. Remaining nodes r′ not in c′

and not mapped to uPmap,2a + pmap,2b + pmap,2c = 1

De-Anonymization ModelDegree Of Anonymity For Community Aware3. If u is mapped to z′, where u ∈ c and c is not mapped to any community. V’ partitioned as

a. Mapped node z′b. Remaining nodes r′ not

mapped to u4. If u not mapped to any node in G and

community of u not mapped to any community

a. Correct mapping is within entire set

De-Anonymization ModelDegree Of Anonymity For Community Aware5. If u not mapped to any node in G but community of u, c is mapped to community c’

a. Nodes within b. Remaining nodes not in

Experiments• An ensemble of networks with the

same number of nodes, edges, noise level, and the type of noise

• Simpler model used• Prepare a copy of the original

network, partially alter its structure, and compare the network alignment performance of two approaches — community-aware and community-blind

ExperimentsDatasets

Data Set Nodes Edges Date RangeCollaboration network

36,458 171,735 Jan 1, 1995 and Mar, 31 2005.

Twitter mention network (4 partitions)

90,332 377,588 Mar 24, 2012 to Apr 25, 2012

Twitter mention network (9 partitions)

9,745 50,164 Mar 24, 2012 to Apr 25, 2012

Experiments - Setup• Generate an ensemble of 10 networks

for each noise level• Run InfoMap algorithm• Attacker has less prior knowledge

(about small number of seeds)• Small set of initial seed for community

blind and aware• Performance calculation

ExperimentsResults: Impact of noise

generate an ensemble of 10 networks foreach of the real-world networks. We run the InfoMap community

ExperimentsResults• For 10% noise and 16 seeds, A(G) is

0.45 & 0.83 (or anonymity is 6.81 and 12.57 bits) using CA & CB

• In collaboration network, community- aware algorithm is able to correctly map about 15% of users while community-blind algorithm can barely re-identify any user. degree of anonymity is 0.84 and 1 (or anonymity is 12.72 and 15.15 bits)

• Difference between the performance of two algorithms greatly increases when the noise is above 15% and 20%.

ExperimentsResults: Impact of number of seeds

ExperimentsResults• In Twitter for seeds number of four, the

CA algorithm successfully re-identifies 77% of users while the CB algorithm only re-identifies about 7% of the users. Degree of anonymity is about 0.13 and 0.97 (and anonymity is 2.14 and 15.97.

• The community-aware algorithm decreases the anonymity by 13.83 additional bits compared to the community-blind algorithm.

ExperimentsResults: Network Size• Performance difference between the

community-aware and community-blind algorithms is more obvious when the network is bigger

• Having a smaller network, both algorithms perform better in re-identifying users and tolerating noise

• The approach used exhibits slightly higher error rate in some cases but most of them occur when the community-blind approach completely fails, and theirs correctly identifies many more users.

ExperimentsResults: Overlapped Data Set

ExperimentsResults• Community-aware algorithm reduces

the degree of anonymity while the community-blind algorithm fails regardless of the number of seeds. (left column)

• Community-blind algorithm fails completely when the noise level is more than 10%, whereas the community- aware algorithm fails when the noise level is more than 30% (right column)

• Approach doesn’t increase time complexity

• This approach is more robust against added noise to the anonymized data set

• Can perform well with fewer known seeds as well as larger networks.

• Approach is not tied to any specific algorithm; other community detection methods and community-blind network alignment algorithms could be ‘plugged in’ to the framework

Conclusion

Mapping two networks that are not identical to each other, using the

community-based mapping algorithm is almost always guaranteed to reduce

the anonymity more and find more successful mappings than the

community-blind, global mapping algorithm.

Conclusion

Questions & Comments

de- anonymizing social networks

Documents

anonymized network

terms network

anonymized graph

social network structure

reference graph

reference social graph

network alignment techniques

popular data