linking organizational social networking profiles project id: h0791030 jerome cheng zhi kai...
TRANSCRIPT
1
Linking Organizational Social Networking ProfilesPROJECT ID: H0791030JEROME CHENG ZHI KAI (A0080860H)
2
Example: Holiday InnTWITTER FACEBOOK
3
Motivation: Individuals
• Want to find profiles, but no one place has them
• Sometimes on company websites, but:• No standardized location• Not all companies bother
4
5
6
Motivation: Organizations
• Track competitor’s use of social media
• Find imposter profiles
7
Problem Definition
System
Social Profiles
Organization Name
Official
Affiliate
Unrelated
8
Related Work
• Focused on deduplication for individuals
• Relevant: profile characteristics focused on
9
Related Work: Usernames
• Connecting Corresponding Identities across Communities (Zafarani & Liu, 2009)
• Connecting users across social media sites: a behavioral-modeling approach (Zafarani & Liu, 2013)
• Studying User Footprints in Different Online Social Networks (Malhotra et al., 2012)
10
Related Work: Created Content
• Identifying Users Across Social Tagging Systems (Iofciu, Fankhauser, Abel & Bischoff, 2011)
11
Methodology: System Design
1. Input: organization’s name (query)
2. Search Facebook/Twitter APIs, retrieve profiles
3. Convert profiles into feature vectors
4. Classify profile-as-vectors
12
Classifier Choice
• Evaluated scikit-learn’s:• Decision Tree• Naïve Bayes• Support Vector• Logistic Regression• Random Forest
• Features aren’t independent – trees are well-suited
13
Feature Breakdown: Name-based
• Normalized Edit Distance• Query to Username• Query to Display Name
• Edit Distance• Query to Username• Query to Display Name• Length of Query• Length of Username• Length of Display Name
14
Feature Breakdown: Name-based Quirks
• Need to handle abbreviations, stopwords• Citigroup versus Citi, General Motors versus GM
• Take two edit distances: original string, processed string
• Use better scoring of the two
15
Feature Breakdown: Description
• Occurrences of Query
• Cosine Similarity• Query and Description• Duckduckgo Description and Profile Description
16
Feature Breakdown: Language Models
• Construct Bigram Language Model for:• Official profile descriptions• Affiliate profile descriptions• Unrelated profile descriptions
• Probability that candidate description belongs to each
17
Evaluation: Ground Truth Creation
1. Retrieved organizations from Freebase
2. Searched for profiles on Twitter/Facebook
3. Manually labelled as official/affiliate/unrelated
18
Evaluation: Ground Truth Breakdown
TWITTER CLASSES
Official; 232; 7%
Affiliate; 675; 20%
Unrelated; 2474; 73%
FACEBOOK CLASSES
Official; 146; 4%Affiliate; 491; 14%
Unrelated; 2776; 81%
3381 labels 3413 labels
19
Evaluation: Process
• Mainly concerned with official and affiliate classes• Not interested in unrelated class
• Modified 10-fold Cross Validation
20
Evaluation: Modified Cross Validation
1. Generate folds as per normal
2. Train classifier on training set as per normal
3. For each affiliate/official profile in test set:1. Input organization’s name to system2. Count number of correct results
4. Calculate precision/recall/F1 from counts
21
Evaluation: Baseline
• Normalised Edit Distance: Username/Display Name and Query
• Emulates searching networks manually without examining profile in detail
22
Results & Discussion: Twitter
F1 Precision Recall0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
0.559
0.716
0.458
0.862
0.947
0.791
Official
Baseline Final
F1 Precision Recall0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
0.7130.750
0.559
0.905 0.884 0.862
Affiliate
Baseline Final
23
Results & Discussion: Facebook
F1 Precision Recall0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
0.7500.792
0.711
0.8840.945
0.830
Official
Baseline Final
F1 Precision Recall0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
0.559
0.744
0.480
0.8620.816
0.639
Affiliate
Baseline Final
24
Discussion
• Baseline performs well for official class on Facebook
• Username and display name alone are good indicators for this class• Other features still help, but not as much
25
Discussion: Facebook Characteristics
• Many profile types: people, pages, places, etc.
• Finding official pages is simplified
• But: finding affiliates requires more effort
26
Discussion: Facebook Characteristics
• Facebook doesn’t require a “username” be specified for pages• Will just use an ID instead
• Auto-generated pages also only have IDs, use name from Wikipedia/other sources
27
Limitations
• Ground truth proportions: expand and/or balance
28
Limitations
• Ground truth proportions: expand and/or balance
• Limited number of profiles retrieved for classification
29
Future Work
• Support additional networks
• Examine post content
• “Preferential” classification