reza zafarani and huan liu data mining and machine learning laboratory (dmml) arizona state...
TRANSCRIPT
![Page 1: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/1.jpg)
Connecting Users across Social Media Sites:A Behavioral-Modeling Approach
REZA ZAFARANI AND HUAN LIU
DATA MINING AND MACHINE LEARNING LABORATORY (DMML)
ARIZONA STATE UNIVERSITY
KDD 2013 – CHICAGO, ILLINOIS
![Page 2: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/2.jpg)
How hard can it be to identify
an individual across sites?
Privacy Experts Claim Advertisers
Know a lot about People
Can they stop showing you the
same repetitive ads across sites?
![Page 3: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/3.jpg)
More information about individuals
Many social media sites
Partial Information
Complementary Information
Better User Profiles
Google+
Age
Location
Education
Huan Liu
N/A
Tempe,AZ
USC
N/A
USA
USC (1985-89)
Can we connect individualsacross sites?
Connectivity is not available
Consistency in Information Availability
![Page 4: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/4.jpg)
Can we verify that the information provided across sites belong to the same individual?
![Page 5: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/5.jpg)
MOdeling Behavior for Identifying Users across Sites
Human behavior generates Information redundancy
Information shared across sites
provides a behavioral fingerprint
MOBIUS
- Behavioral Modeling
- Minimum Information
![Page 6: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/6.jpg)
Identification Function
Minimum information available on ALL sites:
Usernames
CandidateUsername (john.smith)
Prior Usernames ({jsmith, john.s})
![Page 7: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/7.jpg)
Behavior 1
Behavior 2
Behavior n
Information RedundancyInformation Redundancy
Information Redundancy
Feature Set 1
Feature Set 2
Feature Set n
GeneratesCaptured
Via
Learning Framewor
kData
IdentificationFunction
![Page 8: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/8.jpg)
Behaviors
Human Limitation
Time & Memory Limitation
Knowledge Limitation
Exogenous Factors
Typing Patterns
Language Patterns
Endogenous Factors
Personal Attributes &
Traits
Habits
![Page 9: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/9.jpg)
Using Same Usernames
Username Length
Likelihood
Time and Memory Limitation
59% of individuals use the same
username
1 2 3 4 5 6 7 8 9 10 11 120 0 0 0 0 0 0
2
4
5
1
0
![Page 10: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/10.jpg)
Limited Vocabulary
Limited Alphabet
Knowledge Limitation
Identifying individuals by their
vocabulary size
Alphabet Size is correlated to
language: शमं�त कु� मं�र -> Shamanth Kumar
![Page 11: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/11.jpg)
Typing Patterns
QWERTY Keyboard Variants: AZERTY, QWERTZ
DVORAK Keyboard
Keyboard type impacts your usernames
QWER1234 AOEUISNTH
![Page 12: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/12.jpg)
Modifying Previous
Usernames
Creating Similar
UsernamesUsername Observatio
n Likelihood
Habits - old habits die hardAdding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters
Nametag and Gateman
Usernames come from a language
model
![Page 13: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/13.jpg)
Experiment Setup
Data:
200,000 instances (50% class balance)
414 Features
Previous Methods:
1) Zafarani and Liu, 2009
2) Perito et al., 2011
Baselines:
3) Exact Username Match
4) Substring Match
5) Patterns in Letters
![Page 14: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/14.jpg)
Exac
t Use
rnam
e M
atch
Subs
trin
g M
atch
ing
Patte
rns in
Let
ters
Zafar
ani a
nd L
iu
Perito
et a
l.
Naï
ve B
ayes
0
20
40
60
80
100
7763.12
49.2566 77.59
91.38
MOBIUS Performance
![Page 15: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/15.jpg)
Naï
ve B
ayes J4
8
Rando
m F
ores
t
L2-reg
L2-
Loss
SVM
L1-reg
L2-
Loss
SVM
L2-reg
Log
istic Reg
ress
ion
L1-reg
Log
istic Reg
ress
ion
89909192939495
91.3890.87
93.5993.793.7193.7793.8
Choice of Learning Algorithm
![Page 16: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/16.jpg)
Diminishing Returns for Adding More Usernames
![Page 17: REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS](https://reader035.vdocument.in/reader035/viewer/2022062221/56649e415503460f94b32983/html5/thumbnails/17.jpg)
Discover applications of connecting users across sites
Information shared across sites acts as a behavioral fingerprint
Human Behavior Results in Information RedundancyIncorporating features indigenous to specific sitesA methodology for connecting individuals across sitesA behavioral modeling approachUses minimum information across
sitesAllows for integration of additional
behaviors when required
Conclusions + Future Work